SWE notes 02: SIFT - detect features regardless of scale without DNNs

#swe

SIFT: Scale invariant feature transform

Interest point:
- Rich content (brightness and color variation, …)
- Well defined representation for matching comparison with other points
- Well defined position in the image
- Scale, orientation, brightness, … invariant
What are good interest points?
- Not edges -> not descriptive / unique enough
- Corners only good for simpler images
- “Blobls” actually relatively good: location, orientation, size & possible to assign signature
Detecting blobs
- Detecting edges: first/second derivative of gaussian convolution (removes noise)
  - Extrema locations correspond to position of a blobs (edges on either side)
  - The larger the extrema the more prominent blob
- Changing sigma (for the gaussian): changing detection scale (Detecting Blobs SIFT Detector 6:20)
- Try multiple sigmas -> create stack of feature maps, each corresponding of trying to find blobs at different scale
Extracting interest points
- Get stack of feature maps per blob scale
- Compute differences of all two adjacent scale feature maps (smaller and bigger)
- Find extrema across all difference-featuremap featuremaps (3d max operator; 2d across space, 1d across scales)
- Filter one only high extrema (threshold)
- -> SIFT interest points
Scale invariance:
- We know the scale of interest points -> rescale them
Orientation invariance:
- For every pixel compute gradient (edge)
- Look just at orientation (magnitude is about lightning), create histogram
- Take principal (largest) orientation and use it to normalize location (rotate the patch through the orientation)
SIFT descriptor
- Create histogram per normalized (orientation, scaling) point of interest (usually divided into 4 subplots)
- Distance between histograms can be normalized correlation / L2 / …
Allows many applications: s.a. matching features from one picture to another picture (different scale/orientation, …)

Apr 30, 2022

Deep learning notes 11: All is Kernels - but not those in ring0

#papers

Deep Networks Are Kernel Machines

Deep learning success is attributed to automatic discovery of new data representation
Paper’s proposal:
- Deep learning training: stores training data
- Deep learning inference: compares with stored data -> outputs closest -> ~kernel machine
- -> Deep network weights ~ superposition of training examples
- -> Network architecture incorporates knowledge of target function into kernel (similarity function)
Kernel machine: y = g ( sum_i: ai * K(x, xi) + b)
- K: similarity kernel function
- xi: remembered training set
- ai: gold labels
- -> essentially global nearest neighbor classifier
- The biggest question is how to choose the Kernel function
Gradient descent NN is equivalent to path-kernel Kernel machine
- Two datapoints are similar if gradients of the (network) output w.r.t to weights are similar (similar - inner product)
- Gradient of the (network) output w.r.t to weights: for particular input xi, how to change the weights to e.g. increase output
- Classify a datapoint
  - Retrace along all versions of the network (after each GD update)
  - Classify by whichever training datapoints had similar effect on the network (sensitivity) over the course of training
  - Get output of initial model and accumulate differences made by each GD step -> output of final
- Proof datapoint x, its classification y, training datapoints xi -> yi, weights wj, loss L:
  - Change in output for x w.r.t to steps of GD:
  - dy/dt = sum_j..d dy/dwj * dwj/dt | GSD’ed NN outputs change over time
  - = sum_j..d dy/dwj * (-dL/dwj) | Each wight changes according to Loss derivative
  - = sum_j..d dy/dwj (-sum_i..m dL/dyi * dyi/dwj) | Loss derivative over all data is sum over each data
  - = -sum_i..m dL/dyi * sum_j..d dy/dwj * dyi*dwj| Chain rule
  - = -sum_i..m L'(yi, yi) K_{f, w(t)}(x, xi) | re-arrange
  - => y = y0 - Integral_t sum_i..m L'(yi, yi) K_f,w(t) (x, xi) dt | output of NN y=
  - …some normalization…
  - = sum_i..m ai * K(x, xi) + b | output of Kernel machine
- Notes: ai, b depends on x,
Another way of looking at DNNs, also connects to boosting
- General statement: all learning methods will incorporate data

Kernels!

Refresher on concepts:
- Kernel functions: symmetric, positive semi-definitive ~ similarity measure
- Kernel matrix (e.g. gram matrix): pairwise distances of all points in dataset
Kernel: get kernel matrix without doing explicit feature map expansion on each datapoint and then dot product
- E.g. using simple elementwise kernel function -> way cheaper (or even actually possible) to evaluate
E.g.: Polynomial expansion with polynomial dimensionality p and datapoint dimensionality d:
- Full expansion: each datapoint d expanded to d*p (could be worse, inf. for other kernels), then dot product on these
- Kernel method: immediately computes the same kernel matrix out of two raw d-dim inputs
Hilbert spaces: imagine vector spaces but with generalized base vectors (functions, polynomials, …)
- Vector space needs to be linear and have scalar inner product
- ~Space of functions where the point evaluation functional is linear and continuous
- When converging in the function space -> also converging in outputs
Reproducing kernel, F set of functions fi: E->C, F is Hilbert space; K: ExE->C is reproducing kernel when:
- K(., t) € H, Vt€E
- <fi, K(., t)> = fi(t), Vt€E, Vfi € H | the value of fi(t) is reproduced by inner product of fi with K(., t)
- Reproducing Kernel Functions | IntechOpen, 7.pdf (berkeley.edu), rkhs2.pdf (berkeley.edu), optimization - Connection between SVM and Representer theorem - Cross Validated (stackexchange.com), functional analysis - Connection between Representer Theorem and Mercer’s Theorem? - Mathematics Stack Exchange
- Two reproducing kernels for two points -> equivalence between inner product of their reproducing kernels and its single evaluation
  - E.g.: <k(.,x), k(.,x')> = k(x, x') | for <f, g> = sum_i,j ai * bj * k(xi, xj) | H = ( f(x) = sum_i..m ai * k(xi, x) )
- Every reproducing kernel induces unique reproducing-kernel Hilbert space, and vice versa
- Every reproducing kernel is positive definite, and every positive definite kernel function defines unique reproducing kernel Hilbert space
Kernel methods ~ soft nearest neighbor
Any matrix of inner products (style transfer, attention, …) can be exploited through kernel lens
Various ways of looking at things
- SVM: Learn global separation planes
- DNN: Learn sequence of processing
Ridge regression with kernels: RBF, …
- Does regression in transformed space instead of original efficiently
- Seems to work good because regressing in those spaces seems to be good/better than original spaces
Representor theorem:
- ~Optimal function f*: x->y can be represented as linear comb. on the kernel functions k(x1, .), k(x2, .) of the dataset
- f*(.) = sum_1..n ai * k(., xi)
  - To get optimal ai coefficient: Take kernel products of our data (kernel matrix) & linear regression
  - Across all training points: find combination of similarities with all others that produces lowest error to its output
- Even assuming all possible datapoints, solution still lives in linear combination of only points from dataset
- Kernel method allows us to do this effectively -> coefficient on datapoints (their kernels) instead of their feature(s| maps)
  - Datapoints are the model
Usages:
- Style transfer: gram matrix in style
- Super-resolution: similarity patches
You could view DNN as implicit kernel method, indirectly also learns the similarity matrix

Mar 22, 2022

Deep learning notes 10: Diffusion models - noise to nice in few steps

#papers

DDPM - Diffusion Models Beat GANs on Image Synthesis

Input + sampled little bit of noise; repeated multiple times (~1000s) -> pure noise
- x_0 = x; noise(x_t+1|x_t); process input image from data distr.: x, applied noise(…) multiple times -> image of noise
If we could invert this process -> generative model: random normal noise image -> original image
- Learn to undo one “little bit of noise” step at a time: distribution noise(x_t-1|x_t)
- Sample random noise image, undo noise 1000s times (each time get one step cleaner image) -> sample clean data distr.
- Reversal gives us process of normal noise to data distribution
Noising up: q: noise(x_t+1|x_t) well-defined process of adding noise from Normal distribution
- Each step depends only on output of previous step
- Added noise has diagonal covariance matrix, is centered at last sample but down-scaled
- Given large T and well behaved schedule, last step is nearly isotropic Gaussian distribution
- Produces vast amount of data pairs of x_t-1, x_t
Denoising: p: noise(x_t-1|x_t) requires entire data distribution -> approximated via neural network
- Reversal doesn’t predict single image but a whole distribution of images (that could’ve been previous step)
- The output distribution is assumed to be gaussian (mean, covariance)
- The gaussian distribution assumption is maintained for small noise-up steps
Combination of p and q is ~VAE (variational auto-encoder) -> just train it
- The true distribution can be easily computed out of the both known training pairs
- Loss forces the denoising network predicted distribution to be close do the true distribution
The predicted covariance can be either statically set (actually doesn’t work super-bad) or also predicted
- If fixed: can be fixed based on the forward noise-up step parameters
Combination of two loss functions are used, for stability also resampled (early noise-up steps are more impactful)
- Simple objective: L2 difference between true and predicted picture / noise
- Variational loss: proper KL divergence VAE loss, including variance, …
Class-label guided generation can improve performance
- Train class classifier not only for clean images, but also for noised images -> use them to steer the generation
- Analogous to ~GANs; ~shifts the predicted distribution of the step-denoised-images to where specified label is likelier
Idea: Have GANs that have multiple discriminators along the path from noise to final image ~ merge these two approaches

Autoregressive Diffusion Models

New type of auto-regressive models: variables can be decoded in arbitrary order
Autoregressive models: produces tokens (words/patches/pixels) one after another
- E.g. First word, based on priming with it a second word, based on first two a third, …
- Usually a fixed order, for sentences starts with a first one, …
- Repeat until the whole output has been generated
ARDMs: Don’t have to go first to last, order could be arbitrary
- Can also produce multiple tokens at once reducing number of steps for lower accuracy
At the beginning all tokens are initialized (?randomly/zero?)
- DNN (usually transformer) processes them -> per token output (e.g. distribution over categories)
- A portion of them are sampled and decoded (argmax for categorization) -> concrete outputs for few tokens
- Concrete outputs replace random inputs for the sampled tokens, DNN, new subset of tokens are decoded, …
- Repeat until all tokens are sampled & set
~Similar to BERT
- Trained with random word within sentence masking -> predicts distribution over words for masked tokens
- Training is similar to BERT, just with non-fixed blanked tokens ratio
During training: mask a portion of tokens, average losses for all at once
- Sampling one timestep of one ordering where we decode & loss all of the remaining/maked tokens
- Left to right allows taking only the next (one) tokens’s loss -> less noisy
Why can’t we sample all tokens at once?
- Tokens aren’t independent -> argmax on one token (collapsing its distribution) influences other tokens’ disr.
- Sampling multiple at once is faster (less steps necessary) but possibly less ideal outputs
Extensions:
- Tokens could be re-sampled
- Multiple pixels at a time can be sampled -> to get the order/token groups dynamic programming
Initially sample only more rough values (e.g. out of few colors), only later revisit & predict specific color

Feb 20, 2022

Deep learning notes 09: Grokking - overfit into generalization

#papers

Grokking: Generalization beyond Overfitting on small algorithmic datasets

Neural network suddenly generalizes way beyond the point of overfitting with proper regularization
- Train accuracy rises with training steps, eventually NN overfits on training data, training continues…
- After orders of magnitude more steps, the network suddenly generalizes and test accuracy shoots up as well
Similar to the idea of double descent, just with training steps instead of number of parameters
In DD - when trained to convergence the relationship between test accuracy and number of parameters
- With higher capacity models test accuracy first increases as the model is becoming capable
- Then it starts going down when the network is big enough to remember dataset -> overfitting
- Increasing the number of parameters further leads to increase of accuracy again, surpassing previous best
- Interpretation: at some point there are enough parameters to nicely match all datapoints but smoothy (with proper regularization)
This paper dataset: variables + binary operation (e.g. polynomial operations) without any noise
- Dataset is a table of all pairs of variables + the result of the operation
- The network predicts the result of the operation for specific variables (portion blanked for trained data)
- Dataset is very specific & without noise, on real world issues the phenomena is hard to induce / see
Multiple variables of the dataset
- Size of the dataset
- Complexity of the operation
- Train dataset ratio
Training accuracy shoots to 100 % soon (10^2 steps), test accuracy to 100 % also but later (10^5)
- The higher the train set ratio, the faster the snap on test accuracy happens
- The easier the operation is (e.g. there are symmetries) the easier it happens
- The bigger the dataset, the harder it is to induce the phenomena
Weight decay seems to be very important to make the phenomena appear faster
- ? Prefers simple solutions vs remembering whole dataset
- So many train steps that a good solution is eventually discovered & then preferred because ^^
- Maybe weight decay is good-ish but not the best regularization
Visualization of the weights (t-SNE) shows structure that could be interpreted via the operation

Nov 20, 2021

Deep learning notes 08: Fastformer - additive or static (self)attention?

#papers

Fastformer - Additive attention can(not) be all you need

Modeling pairwise interaction between all pairs of tokens is expensive
Fastformer promises to use “additive attention” that’s linear in complexity via tokens-global aggregation
Presented in terms of queries, keys, values but could be just in terms of n (in this case 3) columns: a, b, …, z
Computation goes sequentially, starts with computing the output of the second column, then third, …, last
- For each column, create per-token input values, e.g. a1..an, b1…bn; the same way q,k,v are produced in transformer
- For computing the per-token outputs of second column Bi, start with Ai = ai
- For each Ai value, produce αi weight via softmax after transformation with learned wa, αi= exp(wa*Ai)/∑exp(wa*Aj)
- Produce global A as weighted average of Ai, A = ∑ αi * Ai
- The output of column b is then pointwise multiplication, Bi = bi x A
- In case there’s column c, we aggregate Bi to a single B, pointwise multiply with ci to get Ci
Still essentially quadratic i=0..n: Bi = bi x A = bi x ∑ αi * Ai = ∑ bi x αi * Ai
- Given there’s no softmax -> global a can be computed first -> linear in computation
The aggregation weights αi are essentially self-attention with per-column/layer static learned query wa
- Also could be viewed as soft classification according to learned static separation boundary vector wa
No information sharing between tokens apart from pointwise multiplication between global aggregate of prev. column
- Not really a proper attention; sort-of static query self-attention in the aggregation step
- It is statically learned what sort of tokens each layer/column should globally attend to; not dynamic per each token
- Good for tasks with global information, e.g. topic classification
Seems to just be framed in terms of the words of attention mechanism
In practice fast and with relatively good results on certain NLP tasks

Oct 11, 2021

Deep learning notes 07: PonderNet - Learn when to stop thinking

#papers

PonderNet - Learn when to stop thinking

Recurrent(ly run) network that can decide when to stop computation
In standard NN the amount of computation grows with size of input/output and/or is static; not with problem complexity
End-to-end trained to compromise between: computation cost (# of steps), training prediction accuracy, and generalization
- Halting node: predicts probability of halting on conditional of not halting before
- The rest of the network can be any architecture (rnn, cnn, transformer, …, capsule network, …)
Input x; x processed to hidden state h_i; processed via s(...) function; (h_i+1, y_i, λ_i) = s(h_i)
- Each steps returns next hidden state (h_i+1), output (y_i), and probability of stopping (λ_i)
- Probability to stop at step n: p_n = λ_n * TT_1..n-1 (1- λ_i)
At inference λ is used probabilistically (i.e. the probability is sampled)
Training loop:
- Input x into encoder, get h_0, … unroll the network for n steps regardless of λ
- Consider all outputs at the same time; loss = p_1*L(y_1)+p_2*L(y_2)+...+p_n*L(y_n)
- -> Possibly unstable -> two goals: make y_i better or make p_i smaller
- Regularization for KL(p_i || geometricDistirbution(λp)) -> forces lambdas to be similar to hyperparameter
Contrast vs ACT:
- Considers the output a weighted average of outputs: loss = L(p_1 * y_1 + ... + p_n * y_n)
- Early results need to be compatible with later results
- Less dynamic; needs more steps in experiments, worse in extrapolation
- Pondernet correctly needs more steps for more complex problems

Aug 27, 2021

Deep learning notes 06: Part-whole hierarchies with GLOM

#papers

GLOM: How to represent part-whole hierarchies in a neural network

Idea paper, not an actual implementation
- How does fixed architecture parse various pictures into part/whole hierarchy that’s different for each input
  - E.g. car is made of cabin, motor, …, cabin out of windows, doors, …
Dynamic image parsing sort-of handled by capsule networks
- First layer capsules represent/recognize lowest level features; capsule for window, door, …; second layer cabin, …
- Door and window activates cabin capsule, …
- ~discretization (active/nonactive) over implicit feature hierarchy of CNNs
GLOM architecture: large number of same weights columns, one for each spatial location
Each column is stack of spatially local autoencoders
- Each vertically divided into multiple (~5) levels
- Each level represents patch of image at different resolution/abstraction level
  - Cat’s ear -> furr, part of ear, cat’s head, cat, … ; neck -> furr, part of neck, cat’s neck, cat
  - All locations of cat’s ear will have similar second level activation, all locations in image similar last level activation
- At each level the activation is embedding vector of that feature at that location ~ CNNs but differently implemented
Communication/inference is iterative
- Between levels (layers) of each column through explicit neural networks
- Between columns through attention mechanism
- Iterative approach/eventual consistency forces all locations of a feature to share
For layer l, location x, timestep t+1 embedding is:
- e_t+1,l,x = e_t,l,x + f_td(e_t,l+1,x) f_bu(e_t,l-1,x) + <acrossLoc::below>
  - Last timestep + through NN (f_td, f_bu) above and below level
  - NN functions (f_td, f_bu) weights shared for the same level of all locations
- Positional encoding added to each input (similar to transformers)
Through message passing all locations for certain feature (e.g. Cat’s ear locations) converge on ~appropriate level activation
- Multiple locations sharing n-th level activation -> an island
- Following islands from topmost level to the bottom gives us parse hierarchy
  - The higher the bigger features -> the bigger islands; topmost: one island represents the whole image: class
For across-location/cross columns information sharing: attention over the same level of all columns
- Attends not using keys/queries but similarity: Instead of sm(QK^T)V -> sm(XX^T)X
- -> attends within islands -> converges towards clustering -> similar vectors forced toward similar vectors
Issue: on lower level similar things can share information even if they are in different parts of parse tree higher
- Possible solution: Module attention based on closeness in higher levels of parse tree (higher levels of columns) as well
- ~the further the less influential: sm(SUM_k=0..L-l λ^k X_l+k * X_l+k^T)X
Iterative algorithm, eventual consensus -> embeddings also update -> doesn’t discover clusters but creates them
Designed decisions:
- Locations per patches (CNN) or even per pixel
- Bottom-up network could look at nearby location but spatial locality could also be done only by the attention mechanism
Training
- Denoising autoencoder: reconstructing corrupted image (missing certain regions)
- To encourage islands of new identity: regularizer based on contrast learning
  - Crops from same image should agree, from different images disagree -> needs to be done on scene level not lower
Represent coordinate transformations: it’s not necessary to have explicit part-whole coordinate transformations
- Better have it implicit in higher dimensional embedding than explicit in low level (can’t model uncertainty)
For video: don’t need to converge for each frame, can move in time within video during convergence
- I.e. Few iteration steps per each video frame, if changes not too rapid -> should still reach stable higher levels

Jul 13, 2021

Deep learning notes 05: Unsupervised vision with DINO

#papers

DINO: Emerging Properties in Self-Supervised Vision Transformers

Unsupervised learning regime for vision (self attention, 8x8 patches) transformers
- Intermediate representation clusters pictures of similar labels together (without seeing labels)
- Capable of object detection and masking (attention mask segments objects very well)
- Capable of classification (output KNN to known labeled examples)
- Copy detection, image retrieval, … -> good similarity measure
Attention masks for CLS token: the token that contains final representation (doesn’t have image patch on input, to not bias)
Self-supervised learning: self-distillation without labels
Negative samples learning:
- Take anchor patch and patch A from one image, and patch B from second image
- Give all three patches to the model, tell it which is anchor patch
- Ask whether A or B is from the same as anchor
Self-supervised without negative samples learning
- Use only one image, augment in multiple ways (BYOL) -> produce two versions for teacher and student
  - Global crops: > 50 % of the image
  - Local crops: < 50 % of the image
  - Rotations, color-jitters, …
- Pass each one version through teacher, one through student
  - Note: Actually pass both through both, loss is combination of cross-difference
- Loss is the difference between end image representations (CLS output)
  - Same image, only differently augmented -> should have similar representation
  - To mitigate collapse to single repre. -> different models for teacher and student
- Only train (backprop) student, build teacher as exponential average of students
Teacher only uses global cropping
- If student has local crop -> student learns that its patch should match the whole with more context
- -> forces the model to learn part-whole relationship & representing the whole image
Teacher maintains running average of all representations it sees -> subtracts it from its representation
- ~normalization, helps against collapse
Representation has softmax with temperature at the end
- Dimensionality of softmax is arbitrary: don’t have explicit labels (unsupervised) -> who knows how many
- Teacher has sharpening -> more peaked distribution -> forcer larger differences between diff. outputs
- Softmax is not common in unsupervised -> forces model to come up with “its own classes”
Versus supervised learning
- Supervised has way more noisy / overfitted attention mask -> hyper optimization on the task at hand
Why does it work?
- Augmentations: in computer vision they’re super important ~ that’s where the human prior is
  - What’s augmented away doesn’t matter
Dataset: there’s always an explicit object of interest -> how we take pictures brings prior

May 11, 2021

SWE notes 01: Types as weather

#swe

Types are like the Weather, Type Systems are like Weathermen - Matthias Felleisen

Types are language of prediction by the programmer what the program will do
Type systems check these prediction
Code is written for others to understand but also to be run on computers
All developers think types while their create code (more or less precise, but still)
- Capturing these thoughts in comments is problematic -> not checked -> become wrong
- Types are checked automatically
Only type inference added to untyped language fundamentally doesn’t work
- If things go wrong -> super-hard to have reasonable error messages
Instead add gradual typing system
- Allow adding types incrementally throughout codebase
- Idiomatic: just adding types, not changing code
- Strive for reasonable error messages
What should happen when part is types and there’s error in untyped land
- In typed racket: It tells user what happened on the typed/untyped boundary (through value proxies)
- Can also provide profiling info w.r.t to specific values
Contracts are good but they are very much not free during runtime
- Problem with higher order objects (lazy streams, first class functions, …) -> need to allocate
- Good for more complex checking hard to encode in types
Idea: JITs could exploit dependent types (just like compilers exploit static types)

May 11, 2021

Deep learning notes 04: Perceivers and Branch specialization

#papers

Perceiver: General Perception with Iterative Attention (arxiv)

CNN: Locality exploited by sliding window
ViT: Divide picture into (local) patches, attend over them
Traditional: self-attention computation and memory n^2 in the number of tokens (|K|*|Q|)
- NLP: 1000s tokens
- CV: ideally » 50k pixels
Originally NLP transformers: encoder (cross attention) & decoder (self attention)
- Encoder on input: attends only to input tokens, same number of keys, values and queries
- Decoder on output: attends to both input and output tokens, input only produces keys and values; not queries
- -> different amount of (keys, values) and (queries): computational requirements are not strictly quadratic
Perceiver: split (queries) and (keys, values) for vision classification models
- Stack of cross-attention (with input), and few self-attentions (mix, compute) repeated multiple times
- Cross-attention: queries not based on input, but with arbitrarily smaller dimension N*D N is ~1000
  - Dim of N*M: way smaller than M^2, only linearly on input -> allows video, not-patched images, sound, …
- Self-attention: queries as well as keys, values based on cross-att.’s queries dim -> arbitrarily small
  - Dim of N^2: independent on input dimensions; uses output of prev. step for values, keys
Multiple layers of this cross attentions, self-attentions stack
- Each cross-attention uses the same input image for calculating (using different weights) keys, values
- Weights for keys, values (from input), and queries can be shared across repeats -> essentially recurrent neural network
Initial queries can be random or learned; queries in later layers are calculated based on earlier layer results
Interpretation:
- Queries: what we would want to know; represent channels
- Keys: what each pixel represents/offers
Fourier features for positional encoding, not learned
Results comparable to ResNet without any picture assumptions (apart from 2d positional encoding)
- ~50 layers, number of parameters comparable to ResNet
Attention maps possibly static / dependent only on location instead of input content

Branch specialization: Similar features cluster together

When CNNs are branched into multiple sets of channels that don’t allow cross-information sharing -> specialization
- E.g. on initial portion of AlexNet (split into streams two due to GPU memory limits)
- Features are not organized randomly as within a normal layer, but grouped/clustered for each branch
- First group: Black and white Gabor filters, second group: low frequency color detectors
Possible explanation:
- First part of branch is incentivized to form features relevant to second part
- <=>
- Second part prefers features which the first half provides primitives for
Inception (multiple parallel blocks) and residual networks (unroll parallelly) also feature separate sets of channels
- Residual networks sidestep requirement to have a lot of parameters to mix values between branches
- Bigger convolution (i.e. smaller branches) tend to be more specialized (e.g. 5x5 for Inception)
- Inception happens even across multiple depths:
  - mixed3a_5x5: BW vs Color, brightness, …
  - mixed3b_5x5: curve related, boundary detectors, fur/eye/face detectors, …
  - mixed4a_5x5: complex 3D shapes/geometry
  - -> Very unlikely to happen by chance e.g. for mixed3a_5x5 ~ 1/10^8
Specialization is consistent across architectures and tasks
- The two groups on Alexnet no matter what you train it on (Places, …)
- Specialized curvature group also common across architectures / datasets
Hypothesis: Branching just surfaces structure that already exists
- Test: weights between 1st and 2nd conv layers -> SVD -> singular vectors: frequency and colors
- The largest dimension of variation in which neurons connect to which in the next layer
Idea: parallels to neuroscience and brain region specialization

Apr 14, 2021

Deep learning notes 03: Switch, Feedback, and generally transformers

#papers

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
1 trillion parameters (sparse) vs 175B parameters dense of GPT-3
Mixture of experts, each token routed to only single expert per layer
- Few experts per machine, allows parallel execution: indiv. Token to different experts -> machines
On large NLP test loss goes down with more parameters even with the same:
- Training dataset, number of steps, FLOPs per forward pass (due to sparsity)
- -> tradeoff distributable memory for training speed / performance (demonstrated on super-huge models)
Transformer:
- MHA: relates tokens within layer to each other
- Feed forward layer (for all tokens): relates layers to each other
Switching transformer: Multiple feed forward layers, each token uses one of them
- Routing matrix, dot product with token -> softmax -> routing weight (soft) -> hard clip “argmax”
- In forward pass -> only one FF “expert” is used per token
Previously thought impossible to use just one “argmaxed” expert due to instability
- Better initialization, adaptive precision (float16 on communication, float32 for within node computation), better dropout

Transformers dimensionality via Attention is all you need

Multiple dimensionalities:
- Model dimension (Dmodel): initial embedding dimension, input/output of both attention and feed forward network layers
- Value, key dimensions (Dk, Dv): dimensionality of individual keys (and thus queries) and values
- Hidden FFN size (Dff): dimensionality of first out of two (second has dimensionality of Dmodel) layers of FFN
- Heads (H): number of attention heads
- Layers (N): number of attention, feed forward network stacks
Network computation
- Dmodel token vector projected per attention head to Dk, Dk, and Dv vectors for keys, queries, and values, attention happens
- After attention finishes, concatenation of Dv outputs per attention head projected to Dmodel
- Dmodel sized vector is projected through a single hidden layer (Dff) and then reprojected to Dmodel
Dmodel: 512, Dk: 64, Dv: 64, H: 8, Dff: 2048, Bert is similar; more thorough description
- Not insignificant portion of parameters is in FFN portion (2048*512 * 2) that recombines indiv. heads info

Feedback Transformers: Addressing Some Limitations of Transformers with Feedback Memory

Feedback Transformers: Addressing Some Limitations of Transformers with Feedback Memory (Explained)
RNN: Data flows one step per time in shared hidden state
Transformer: Data flows each token to all tokens on every layer
- Each layer very limited ~linear parallel recombination
- Casual masking: Only let transformers see “previous” tokens (big in NLP)
Normal transformer: only feed forward, can recombine left-previous-layer tokens
Memory transformer: allow lateral (left same layer) and also feed-forward (left-upper) connections
- Disables parallel training, need to compute left tokens fully first (even upper layers) to compute current
- Representations of all layers of a token are combined (weighted sum) to a single per-token memory representation
- Tokens to the right attend to individual left tokens’ memory representations
In a way attention over multiple-layer RNN (not super-new idea, attention originated in RNNs in similar way)
It seems that connection from past-tokens top (highest layer) representation is responsible for most gains

Feb 17, 2021

Deep learning papers/notes 02: GPT3, data extraction, Dall-E

#papers

GPT-3

Language Models are Few-Shot Learners (Paper Explained)
Language model: model that generates continuation of text
~100 attention layers, ~100 heads, ~100 dimensionality, 3.2M batch size
Not bidirectional, goes from left to right (autoregressive)
Bert-approach:
- Pretrain on general data: generic language model
- Finetune (gradient updates) on specific task (e.g. Sentiment analysis)
GPT-approach:
- Zero shot: take pre-trained model, give it textual task description, prompt, expect output
- GPT does one/few shot: give it few pairs of description, prompt and output
- -> no gradient update on concete task
  - Just relies on absolutely huge training set that included these tasks somehow somewhere
Language model is just trained to finish a text that looks as “description, prompt, answer, prompt, ….”
- The output can be restricted to be out of a set of possible answers -> easier
Closed book system
- Good for trivia, not good for e.g. natural questions
Hypothesis:
- Large transformers are almost storing the training data
- Inference: sort of fuzzy KNN/interpolation of training data with language model
- Would be good to see what training examples were used for current output
  - ~index of what training samples influenced what weights
Not great performance on:
- Reading tasks (prompt contains text + question connected to it) that require reasoning
- Better for reading tasks where model selects more probable answer (out of 2): correlated
- -> suggest interpolation hypothesis
Very good language model, almost perfect grammar ~ fuzzy search
- No tasks that would try to make poor English out of good, scramble words, …
- A lot of presented tasks can be explained by being good English model

Extracting Training Data from Large Language Models

Extracting Training Data from Large Language Models (Paper Explained)
GPT2/3
Querying large black-box language model for data that appear only once/few times in training data
- It’s ok to remember good spelling, general info (e.g. correct zip codes, …), bad to remember specific datapoints
- Eidetic memorialization: if string is extractable and appears k-times in training data (possibly many times in k docs)
Not focused on targeted training data extraction but general “any rememebred data”
Intuition: easy to extract datapoints far from other datapoints
- Model can’t extract patterns w.r.t to them -> remembers the datapoints exactly
- For example GUIDs, random urls, random strings, …
- Does not mean all training data is extractible
Generate a lot of data, select highly likely outputs, deduplicate, manually check if on web few times
- Data generation improvements: tweaks to priming and temprature to generate more diverse outputs
- Selection improvements: train smaller model on similar (not same) datasets, take likely on targeted model but unlikely on new
  - Smaller new model is unlikely to remember the same few-shot datapoints
Note: Distillation models: not all datapoints loose their performance equally
- Assumption: Most affects rememebred single-training-datapoint examples
Memorization is context specific: heavily depends on prompt
Even if datapoint is only in one doc, it might need to be repeated multiple times in the doc to be remembered
- Number of repeats required is higher with smaller models
- Not clear relationship between documents, batches, …

OpenAI DALL·E: Creating Images from Text

OpenAI DALL·E: Creating Images from Text (Blog Post Explained)
Generating pictures out of textual description
Idea: GPT-3 generates image tile hieroglyphs tokens, VQ-VAE’s decoder uses them as latent repres. to create images
GPT-3 like language model:
- One stream of tokens: first textual description tokens, then autocompletes/generates image tile hieroglyphs tokens
- Image tile hieroglyphs from vocabulary of VQ-VAE latent space codebook
- Each tile token attends to only specific tile tokens (row, column, neighborhood) and all text tokens
VQ-VAE
- Encoder: per image tile projects to latent space, selects closest vector (hieroglyph) from codebook
  - Pretrained as normal VAE, decoder possibly fine-tuned together with GPT-3 part
- Decoder: creates image out of matrix of codebook latent vectors produced either by encoder (training) or GPT-3 (inference)
- Codebook also trained w. encoder ~ essentially tile embedding to latent space, decoder ~ reverse embedding
Blog mentions continuous relaxation of the codebook, no need for it to be explicit, not sure what it means
8192 Codebook vectors, trained; 32x32 tiles per image, image resolution 256x256
Outputs 512 images, re-reranked with another text :: image matching model

Feb 10, 2021

Hypersimensional computing: wide embedding meets Hopfield

#papers

2004.11204, Symbolic Representation and Learning With Hyperdimensional Computing
Main primitive - hypervector: ~10 000+ bits vectors (variations with integer/float vectors exist)
- Information related to each object/concept represented in distributed manner across whole vector (equally significant bits)
- Recombination of these vectors preserves information -> allows entities composition using well defined vector operations
- Super wide bit embedding + composition operators with explicit similarity-like classification/Hopfield-like asociativity
Benefits:
- Wide representation brings robustness against noise
- Only three operators: addition (normal), multiplication (xor), permutation (actual permutation)
- Very easily super-HW accelerated
Initial step: encoding to hypervectors - essentially embedding problem - very important for final accuracy
- Goal pre-trained DNNs (e.g. initial layers of CNNs)
- Random gram-based encoding;
  - Random level hypervectors are generated (e.g. per letter)
  - Rotationally permutated by their spatial location (e.g. position in n-gram)
  - Result for whole n-gram is binding (multiplication) all of these together, result for text is summation of n-grams
- Random record-based encoding; e.g. for encoding speech using per time-bucket:
  - Position hypervectors: generated orthogonal to each other
  - Quantized level hypervectors: correlated based on distance (next higher level only portion of random bits flipped)
  - Result for whole speech clip is summation of all (position, level) pairs bound together (multiplication)
Allows quantization, compression (cut & replace w. sum of short orth. vectors), adaptive tain. (~iterative update on missclass.)
Procedure (e.g. image classification):
- Training
  - Training images hashed (using pre-trained DNN) to binary representation, transformations are made
  - Aggregate per class to form hypervectors (e.g. consensus sum: most frequent bit at every pos. wins)
  - Class hypervectors are stored in associative memory (e.g. Hopfield-like, explicit list)
- Inference:
  - Hash input image, same transformations as during training
  - Query closest class hypervector (~KNN-like) and output most similar -> class with closest hypervector (distance metric)
  - Possible explicit small NN to transform outputs of similarity query (per class similarity) instead of argmax

Jan 30, 2021

Deep learning papers/notes 01: Performers, Lookahead, rADAM

#papers

Performers: Faster approximation of Transformers

Rethinking Attention with Performers (Paper Explained)
Problem with attention L: number of tokens, d: dimensionality of Q, K, V -> L^2 attention matrix
Attention: softmax(Queries * Keys^T) * Values => softmax(L,d * d,L) * L,d => softmax(L, L) * L,d
Solution -> factorize softmax(Q * K^T) => Q' * K'^T => Q' * (K' * V)
- -> more efficient computation: L,d * (d,L * L,d) -> linear in L
Factorization through positive Orthogonal Random features approach
- Potentially not only softmax, can be more generic

Lookahead: Smart wrapper over any optimizer

Lookahead Optimizer: k steps forward, 1 step back | Michael Zhang
Wraps around any optimizer O
- Creates weights checkpoint c_i, makes n steps with O
- Interpolates between current state and (n steps of O ago) saved c_i -> c_i+1
Intuitively: Tries n steps with arbitrary optimizer, then goes in the final location direction

Rectified ADAM:

https://arxiv.org/pdf/1908.03265.pdf
ADAM has large variance during warmup
Solution: low initial learning rate, negligible momentum for first few batches

Image transformer: Attend to image patches

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained)
Transformer on images -> too big attention matrix (n^2 for n pixels)
- -> cut into 16x16 patches -> attend to patches (way smaller number)
Use linear embedding for individual patches - didn’t work worse than embedding with CNN

Jan 22, 2021

The Simple Essence of Automatic Differentiation

#papers

Relevant Lecture and slides.

Don’t look at backprop in terms of graphs, treat it as algebraic composition over functions
- Graphs can be good for visualization, but AD has nothing to do with graphs
- Problem with graphs -> hard to deal with, trees easier but can be exp. big
Derivative of a function f::(a->b) at a certain point is a linear map f'::(a-:b)
- Can’t compose derivative operator: D::(a->b)->(a->(a-:b))
  - When composed D (g o f) a = D g (f a) o D f a it also needs f a i.e. b
- Solution: D'::(a->b) -> (a->(b * (a-:b))), doesn’t produce just derivative but f(a) * f'::b * (a-:b)
  - Now for D' (g o f) we only need (D' g) o (D' f) -> easily composable
Compile the composition to various categories: graph repres., functions composition, derivative, …
- Category: identity :: a->a, composition (o) :: (b->c)->(a->b) -> (a->c)
- Cartesian category: notion of Cartesian m...

Dec 13, 2020

Fast weights: attend to the recent past

#papers

Paper in question: 1610.06258. Relevant lectures here and here.

Traditionally D/RNN have two types of memory:
- Standard weights W: long term memory of the whole dataset
- Hidden state h(t): maintained by active neurons / directly storing activ., limited, very immediate
  - Remembered things heavily influence currently processed stuff, limited capacity
New intermediate “fast weights” A(t): weights/synapses but faster learning, rapid decay of temp. information
- Higher capacity associative “network” that stores past hidden states (e.g. Hopfield network)
New (hidden) state is a combination of traditional new RNN state and its repeated modulation with fast weights
- Two steps: preliminary vector h_0(t+1)=f(Wh(t)+Cx(t)) and s:1..S steps of inner loop with h_S(t+1)==h(t+1)
- h_s+1(t+1)=f([Wh(t)+Cx(t)]+A(t)*h_s(t+1))
  - Multiple steps s:1..S allow the new state to settle; […] the same for all, preliminary vector without f(..)
- Fast weights A(t+1) updated with h(t+1) at the end of timestamp e.g. using Hopfield rule + heavy decay
- Backprop can go (even) through fast weights (doesn’t update them) -> network learns to work it them
Problem with minibatches: different fast weights for each sequence in a batch -> expl. attention over past hidden states
- Unclear details how that solves minibatch problem but less comp. intensive: k\*n vs n^2
- A(t)*h_s(t+1) ~= sum_i=1..t λ^(t-i) h(i)*<h(i),h_s(t+1)>
- Non-parametrized attention with strength ~ scalar product between the past state and current hidden state
Notes:
- Interpretation: Attracts each new state towards recent hidden states
- Benefit: frees hidden state to learn good representation for final classification
- Needs fast weights output layer normalization

Dec 9, 2020

Overleaf: how to maintain separate git repo with nice history

#misc

Overleaf is an awesome online editor for LaTeX. It comes with git integration where your project serves as a git repository. You can pull updates made online to your local repo and also create new commits locally and push them to overleaf, where they immediately appear. Unfortunately, there are two limitations.

Only one branch - master, that cannot be force pushed to.
Changes made through the online editor are automatically committed at relatively random intervals with static message Update on Overleaf..

A consequence of that is that changes made online result in series of commits with non-destriptive messages that are relatively arbitrarily portioned. While that’s perfectly fine from a usability standpoint, it just doesn’t work for the pedants among us who want to have nice¹ history². Luckily, there’s an easy way to maintain a separate repo with a perfectly nice history and keep it in sync with overleaf both ways.

You can have a separate local branch ol that tracks overleaf’s master, only make commits on that branch, and then cherry-pick all its new commits (both created locally and pushed to overleaf and pulled from overleaf) to your local master, where you can change its history as you see fit.

Have local repo with the same content ...

Nov 24, 2020

Scaffolding jekyll posts with little bit of bash

#swe

Jekyll blogs are quite awesome. Really simple to set-up, relatively straightforward to customize, and generally a pleasure to work with. Their only downside I’ve noticed is a slightly annoying new-post story. You need to create a file at a specific location, correctly lower-case and sanitize its name - that should correspond to the title, and fill the date - twice.

Luckily, we can spend hours perfecting automation of this menial task and save a negative amount of time even assuming we’ll continue blogging at a reasonable pace.

A quick google revealed at least two existing scaffolding projects for jekyll posts but one didn’t fit my needs - as it only works with normal posts, and another required jeoman and was generally larger in scope.

For those reasons, I decided to write my own little bash script. In its current form it is capable of scaffolding new posts and new items of my special TIL items (a normal Jekyll collection with specific attributes and format). It could, however, be very easily customized to handle any type of jekyll content. Apart from just creating new files, it automatically handles all the dates, sanitizing title and using it as the file-name, and supports custom templating in case your content is more specialized.

It’s built on top of two principles:

The first argument specifies command. The command determines what hand...

Jul 10, 2020

Asserts with custom messages in Burst Unity

#swe

While helping with little something that uses Unity I came across another interesting thing. If you try to use normal UnityEngine.Assertions.* in your Burst jobs you’ll find out they’re (as of Burst 1.3.3 and Unity 2020.1.0b9) being silently optimized away and are not checked, not even in debug Builds.

A quick google search for unity burst assert will yield you this amazing blog-post about making your own asserts that work well with Burst. I encourage everyone to read it. Unfortunately, there’s one small deficiency with the code suggested, it doesn’t allow custom assertion-failed messages.

No log message:

using System;
using UnityEngine;
using Unity.Collections;

static class BurstAssert
{
    // based on: https://jacksondunstan.com/articles/5292
    public static void IsTrue(bool condition)
    {
        #if UNITY_ASSERTIONS
        if (!condition)
        {
            throw new Exception("Assertion failed");
        }
        #endif
    }
}

What doesn’t work:

While it is possible to instantiate and throw an Exception with a custom message directly from a burst’ed method (see the "Assertion failed" message), you can’t pass a managed string to a method, even if it is onl...

Jul 9, 2020

Async in C#, .NET, and Unity: Allocation and state machine builders

#swe

While helping with little something that uses Unity I came across the rabbit hole async/await support is in Unity. Historically Unity used generators (known as coroutines in Unity’s world) to support async/multiple-frames-spanning computation. In 2017 they added initial support for async/await but without any meaningful libraries support and with potential performance pitfalls (hello GC, how are you?). To be fair, at that time async had performance implications even in mainland .NET (Core), mainly around allocations which - (un)fortunately aren’t anywhere as problematic for (mostly) throughput oriented .NET Core apps as they can be for near-real-time applications like Unity games.

Luckily for .NET, with the release of .NET Core 2.1 in 2018 a lot of those issues got solved and allocations were decreased substantially. But what was the change actually about? And how does it relate to Unity and/or 3rd party Unity focused async/await libraries such as UniTask or UnityAsync? Let’s find out.

I’ll assume some (relatively deep) knowledge about async/await. If you’re not sure you have it, be sure to check this awesome blog-post about the topic.

State machine rewrite:

When you write an async method, Roslyn will rewrite it to a method that does following. As this rewrite is done by the compiler it will happen regardless of your runtime, be it full framework, .NET Core 2.1, or Unity.

Compiler synthesizes an IAsyncStateMachine struct containing the original implementation of the method cut into a state machine (as its MoveNext(..) method) and locals lifted as fields.
Compiler generated IAsyncStateMachine struct is initialized (stateMachine) with:
- This pointer.
- Parameters.
- Newly initialized XYZMethodBuilder (methodBuilder) struct corresponding to the Task-like object that ...

Jul 7, 2020

Managed personal VPN between your devices through Azure

#misc

Let’s say you have a few Windows devices, for example, a powerful desktop and laptop. You mostly work on the desktop but when being remote on the laptop you want to be able to remote to the desktop and harvest it’s power / avoid having to sync data.

Unfortunately, your ISP NATs you into oblivion and/or doesn’t provide static IP so RDP/SSH’ing directly is out of the question. Self-hosted Visual Studio Codespace works for most things but don’t cut it for everything. At the same time, you neither want to rent a VM and manage your own VPN (too much work) nor want to use off the shelf 3rd party product s.a. hamachi/TeamViewer/Parsec because you have this irrational detest towards 3rd party remoting software (I know, I know…¹).

This brings you to a situation in which you’re looking for a managed VPN, ideally one payable with your free Azure credits. Luckily, it’s a situation that has a relatively easy solution.

Follow this Point-to-Site VPN tutorial.
- Basic SKU should be perfectly fine and will run you < ~30 USD/month.
Distribute client certificates created in 1) to your devices, download VPN configuration.
Connect on all desired devices to the VPN.
Find the IP address of each device within the VPN.
- Run ipconfig /all and find the assigned address within the correct Network interface on every device. Beware, the list might contain the true VPN network interface’s HyperV relay.
- Run arp -a, identify the correct network device, and guess the association of IP addresses and devices.
- Go to Azure Portal/<Your VPN_gateway>/User VPN Configuration/Allocated IP addresses and guess the association of IP addresses and devices.
Use the IP address to SSH/RDP to your desired device.

For RDP two more things might be relevant.

By default the VPN’s network interface doesn’t have DNS configured and so connecting via Computer names won’t work. You’ll either have to configure DNS or just use IP addresses directly.
I was unable to login into a Windows 10 2004 that was configured as passwordless despite both computers using Microsoft account and Windows hello authentication. I believe it should be possible to make it work, as I know it can work in enterprise setting but just wasn’t worth it for me.

I’m also fully aware that VSCode Codespace is essentially 3rd party remote software, somehow I’m fine with that ¯\(ツ)/¯. ↩

Jun 28, 2020

Gated linear networks

#papers

Paper in question: 1910.01526

Series of linear filters (weights) on input with non-linearity at the end
- Non-linearities are on each layer (neuron) but they cancel each other out
Set of weights per each neuron
- Specific weight vector selected via context func. from input (side information)
- Each neuron different set of weights, different context function
- Same side information for all neurons in all layers
- Weights adjusted during training, only the one weight vector for any specific input, online gradient descent
Context function:
- Usually set of half-space functions (similarity with side inf)
- Don’t change during training, need to be sampled correctly
- Similar data will (through context func.) force same weights for neurons -> sim. outputs
- Unsimilar data won’t use the same weights -> less forgetting
Each neuron is geometric mixture of outputs of previous layer (through weights)
- Weights initialized randomly, updated via training
Essentially a multilevel mixture of KNN and linear transformation with point non-lin.

Jun 22, 2020