Deep learning notes 04: Perceivers and Branch specialization
This post if from a series of quick notes written primarily for personal usage while reading random ML/SWE/CS papers. As such they might be incomprehensible and/or flat out wrong.
Perceiver: General Perception with Iterative Attention (arxiv)
- CNN: Locality exploited by sliding window
- ViT: Divide picture into (local) patches, attend over them
- Traditional: self-attention computation and memory
n^2
in the number of tokens (|K|*|Q|
)- NLP: 1000s tokens
- CV: ideally » 50k pixels
- Originally NLP transformers: encoder (cross attention) & decoder (self attention)
- Encoder on input: attends only to input tokens, same number of
keys
,values
andqueries
- Decoder on output: attends to both input and output tokens, input only produces keys and values; not queries
- -> different amount of (
keys
,values
) and (queries
): computational requirements are not strictly quadratic
- Encoder on input: attends only to input tokens, same number of
- Perceiver: split (
queries
) and (keys
,values
) for vision classification models- Stack of cross-attention (with input), and few self-attentions (mix, compute) repeated multiple times
-
Cross-attention: queries
not based on input, but with arbitrarily smaller dimensionN*D
N is ~1000
- Dim of
N*M
: way smaller thanM^2
, only linearly on input -> allows video, not-patched images, sound, …
- Dim of
- Self-attention:
queries
as well askeys
,values
based on cross-att.’squeries
dim -> arbitrarily small- Dim of
N^2
: independent on input dimensions; uses output of prev. step forvalues
,keys
- Dim of
- Multiple layers of this cross attentions, self-attentions stack
- Each cross-attention uses the same input image for calculating (using different weights)
keys
,values
- Weights for
keys
,values
(from input), and queries can be shared across repeats -> essentially recurrent neural network
- Each cross-attention uses the same input image for calculating (using different weights)
- Initial
queries
can be random or learned; queries in later layers are calculated based on earlier layer results - Interpretation:
Queries
: what we would want to know; represent channelsKeys
: what each pixel represents/offers
- Fourier features for positional encoding, not learned
- Results comparable to ResNet without any picture assumptions (apart from 2d positional encoding)
- ~50 layers, number of parameters comparable to ResNet
- Attention maps possibly static / dependent only on location instead of input content
Branch specialization: Similar features cluster together
- When CNNs are branched into multiple sets of channels that don’t allow cross-information sharing -> specialization
- E.g. on initial portion of
AlexNet
(split into streams two due to GPU memory limits) - Features are not organized randomly as within a normal layer, but grouped/clustered for each branch
- First group: Black and white Gabor filters, second group: low frequency color detectors
- E.g. on initial portion of
- Possible explanation:
- First part of branch is incentivized to form features relevant to second part
- <=>
- Second part prefers features which the first half provides primitives for
- Inception (multiple parallel blocks) and residual networks (unroll parallelly) also feature separate sets of channels
- Residual networks sidestep requirement to have a lot of parameters to mix values between branches
- Bigger convolution (i.e. smaller branches) tend to be more specialized (e.g. 5x5 for Inception)
- Inception happens even across multiple depths:
mixed3a_5x5
: BW vs Color, brightness, …mixed3b_5x5
: curve related, boundary detectors, fur/eye/face detectors, …mixed4a_5x5
: complex 3D shapes/geometry- -> Very unlikely to happen by chance e.g. for
mixed3a_5x5
~ 1/10^8
- Specialization is consistent across architectures and tasks
- The two groups on
Alexnet
no matter what you train it on (Places, …) - Specialized curvature group also common across architectures / datasets
- The two groups on
- Hypothesis: Branching just surfaces structure that already exists
- Test: weights between 1st and 2nd conv layers -> SVD -> singular vectors: frequency and colors
- The largest dimension of variation in which neurons connect to which in the next layer
- Idea: parallels to neuroscience and brain region specialization
Written
by Petr Houška
on