Deep learning notes 04: Perceivers and Branch specialization

#papers

This post if from a series of quick notes written primarily for personal usage while reading random ML/SWE/CS papers. As such they might be incomprehensible and/or flat out wrong.

Perceiver: General Perception with Iterative Attention (arxiv)

CNN: Locality exploited by sliding window
ViT: Divide picture into (local) patches, attend over them
Traditional: self-attention computation and memory n^2 in the number of tokens (|K|*|Q|)
- NLP: 1000s tokens
- CV: ideally » 50k pixels
Originally NLP transformers: encoder (cross attention) & decoder (self attention)
- Encoder on input: attends only to input tokens, same number of keys, values and queries
- Decoder on output: attends to both input and output tokens, input only produces keys and values; not queries
- -> different amount of (keys, values) and (queries): computational requirements are not strictly quadratic
Perceiver: split (queries) and (keys, values) for vision classification models
- Stack of cross-attention (with input), and few self-attentions (mix, compute) repeated multiple times
- Cross-attention: queries not based on input, but with arbitrarily smaller dimension N*D N is ~1000
  - Dim of N*M: way smaller than M^2, only linearly on input -> allows video, not-patched images, sound, …
- Self-attention: queries as well as keys, values based on cross-att.’s queries dim -> arbitrarily small
  - Dim of N^2: independent on input dimensions; uses output of prev. step for values, keys
Multiple layers of this cross attentions, self-attentions stack
- Each cross-attention uses the same input image for calculating (using different weights) keys, values
- Weights for keys, values (from input), and queries can be shared across repeats -> essentially recurrent neural network
Initial queries can be random or learned; queries in later layers are calculated based on earlier layer results
Interpretation:
- Queries: what we would want to know; represent channels
- Keys: what each pixel represents/offers
Fourier features for positional encoding, not learned
Results comparable to ResNet without any picture assumptions (apart from 2d positional encoding)
- ~50 layers, number of parameters comparable to ResNet
Attention maps possibly static / dependent only on location instead of input content

Branch specialization: Similar features cluster together

When CNNs are branched into multiple sets of channels that don’t allow cross-information sharing -> specialization
- E.g. on initial portion of AlexNet (split into streams two due to GPU memory limits)
- Features are not organized randomly as within a normal layer, but grouped/clustered for each branch
- First group: Black and white Gabor filters, second group: low frequency color detectors
Possible explanation:
- First part of branch is incentivized to form features relevant to second part
- <=>
- Second part prefers features which the first half provides primitives for
Inception (multiple parallel blocks) and residual networks (unroll parallelly) also feature separate sets of channels
- Residual networks sidestep requirement to have a lot of parameters to mix values between branches
- Bigger convolution (i.e. smaller branches) tend to be more specialized (e.g. 5x5 for Inception)
- Inception happens even across multiple depths:
  - mixed3a_5x5: BW vs Color, brightness, …
  - mixed3b_5x5: curve related, boundary detectors, fur/eye/face detectors, …
  - mixed4a_5x5: complex 3D shapes/geometry
  - -> Very unlikely to happen by chance e.g. for mixed3a_5x5 ~ 1/10^8
Specialization is consistent across architectures and tasks
- The two groups on Alexnet no matter what you train it on (Places, …)
- Specialized curvature group also common across architectures / datasets
Hypothesis: Branching just surfaces structure that already exists
- Test: weights between 1st and 2nd conv layers -> SVD -> singular vectors: frequency and colors
- The largest dimension of variation in which neurons connect to which in the next layer
Idea: parallels to neuroscience and brain region specialization

Written by Petr Houška on Apr 14, 2021