Deep learning notes 08: Fastformer - additive or static (self)attention?

#papers

This post if from a series of quick notes written primarily for personal usage while reading random ML/SWE/CS papers. As such they might be incomprehensible and/or flat out wrong.

Fastformer - Additive attention can(not) be all you need

Modeling pairwise interaction between all pairs of tokens is expensive
Fastformer promises to use “additive attention” that’s linear in complexity via tokens-global aggregation
Presented in terms of queries, keys, values but could be just in terms of n (in this case 3) columns: a, b, …, z
Computation goes sequentially, starts with computing the output of the second column, then third, …, last
- For each column, create per-token input values, e.g. a1..an, b1…bn; the same way q,k,v are produced in transformer
- For computing the per-token outputs of second column Bi, start with Ai = ai
- For each Ai value, produce αi weight via softmax after transformation with learned wa, αi= exp(wa*Ai)/∑exp(wa*Aj)
- Produce global A as weighted average of Ai, A = ∑ αi * Ai
- The output of column b is then pointwise multiplication, Bi = bi x A
- In case there’s column c, we aggregate Bi to a single B, pointwise multiply with ci to get Ci
Still essentially quadratic i=0..n: Bi = bi x A = bi x ∑ αi * Ai = ∑ bi x αi * Ai
- Given there’s no softmax -> global a can be computed first -> linear in computation
The aggregation weights αi are essentially self-attention with per-column/layer static learned query wa
- Also could be viewed as soft classification according to learned static separation boundary vector wa
No information sharing between tokens apart from pointwise multiplication between global aggregate of prev. column
- Not really a proper attention; sort-of static query self-attention in the aggregation step
- It is statically learned what sort of tokens each layer/column should globally attend to; not dynamic per each token
- Good for tasks with global information, e.g. topic classification
Seems to just be framed in terms of the words of attention mechanism
In practice fast and with relatively good results on certain NLP tasks

Written by Petr Houška on Oct 11, 2021