# Deep learning papers/notes 01: Performers, Lookahead, rADAM

This post if from a series of quick notes written primarily for personal usage while reading random ML/SWE/CS papers. As such they might be incomprehensible and/or flat out wrong.

### Performers: Faster approximation of Transformers

- Rethinking Attention with Performers (Paper Explained)
- Problem with attention
`L`

: number of tokens,`d`

: dimensionality of`Q`

,`K`

,`V`

->`L^2`

attention matrix - Attention:
`softmax(Queries * Keys^T) * Values`

=>`softmax(L,d * d,L) * L,d`

=>`softmax(L, L) * L,d`

- Solution -> factorize
`softmax(Q * K^T)`

=>`Q' * K'^T`

=>`Q' * (K' * V)`

- -> more efficient computation:
`L,d * (d,L * L,d)`

-> linear in`L`

- -> more efficient computation:
- Factorization through positive Orthogonal Random features approach
- Potentially not only softmax, can be more generic

### Lookahead: Smart wrapper over any optimizer

- Lookahead Optimizer: k steps forward, 1 step back | Michael Zhang
- Wraps around any optimizer
`O`

- Creates weights checkpoint
`c_i`

, makes`n`

steps with`O`

- Interpolates between current state and (
`n`

steps of`O`

ago) saved`c_i`

->`c_i+1`

- Creates weights checkpoint
- Intuitively: Tries
`n`

steps with arbitrary optimizer, then goes in the final location direction

### Rectified ADAM:

- https://arxiv.org/pdf/1908.03265.pdf
- ADAM has large variance during warmup
- Solution: low initial learning rate, negligible momentum for first few batches

### Image transformer: Attend to image patches

- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained)
- Transformer on images -> too big attention matrix (n^2 for n pixels)
- -> cut into 16x16 patches -> attend to patches (way smaller number)

- Use linear embedding for individual patches - didn’t work worse than embedding with CNN

Written
by Petr Houška
on