Gated linear networks
This post if from a series of notes written for personal usage while reading random ML/SWE/CS papers. The notes weren’t originally intended for the eyes of other people and therefore might be incomprehensible and/or flat out wrong.
Paper in question: 1910.01526
- Series of linear filters (weights) on input with non-linearity at the end
- Non-linearities are on each layer (neuron) but they cancel each other out
- Set of weights per each neuron
- Specific weight vector selected via context func. from input (side information)
- Each neuron different set of weights, different context function
- Same side information for all neurons in all layers
- Weights adjusted during training, only the one weight vector for any specific input, online gradient descent
- Context function:
- Usually set of half-space functions (similarity with side inf)
- Don’t change during training, need to be sampled correctly
- Similar data will (through context func.) force same weights for neurons -> sim. outputs
- Unsimilar data won’t use the same weights -> less forgetting
- Each neuron is geometric mixture of outputs of previous layer (through weights)
- Weights initialized randomly, updated via training
- Essentially a multilevel mixture of KNN and linear transformation with point non-lin.
Written
by Petr Houška
on