Deep learning papers/notes 01: Performers, Lookahead, rADAM


This post if from a series of quick notes written primarily for personal usage while reading random ML/SWE/CS papers. As such they might be incomprehensible and/or flat out wrong.

Performers: Faster approximation of Transformers

  • Rethinking Attention with Performers (Paper Explained)
  • Problem with attention L: number of tokens, d: dimensionality of Q, K, V -> L^2 attention matrix
  • Attention: softmax(Queries * Keys^T) * Values => softmax(L,d * d,L) * L,d => softmax(L, L) * L,d
  • Solution -> factorize softmax(Q * K^T) => Q' * K'^T => Q' * (K' * V)
    • -> more efficient computation: L,d * (d,L * L,d) -> linear in L
  • Factorization through positive Orthogonal Random features approach
    • Potentially not only softmax, can be more generic

Lookahead: Smart wrapper over any optimizer

Rectified ADAM:

Image transformer: Attend to image patches

Written by on