Deep learning notes 11: All is Kernels - but not those in ring0
This post if from a series of quick notes written primarily for personal usage while reading random ML/SWE/CS papers. As such they might be incomprehensible and/or flat out wrong.
Deep Networks Are Kernel Machines
- Deep learning success is attributed to automatic discovery of new data representation
- Paper’s proposal:
- Deep learning training: stores training data
- Deep learning inference: compares with stored data -> outputs closest -> ~kernel machine
- -> Deep network weights ~ superposition of training examples
- -> Network architecture incorporates knowledge of target function into kernel (similarity function)
- Kernel machine:
y = g ( sum_i: ai * K(x, xi) + b)
K
: similarity kernel functionxi
: remembered training setai
: gold labels- -> essentially global nearest neighbor classifier
- The biggest question is how to choose the Kernel function
- Gradient descent NN is equivalent to path-kernel Kernel machine
- Two datapoints are similar if gradients of the (network) output w.r.t to weights are similar (similar - inner product)
- Gradient of the (network) output w.r.t to weights: for particular input
xi
, how to change the weights to e.g. increase output - Classify a datapoint
- Retrace along all versions of the network (after each GD update)
- Classify by whichever training datapoints had similar effect on the network (sensitivity) over the course of training
- Get output of initial model and accumulate differences made by each GD step -> output of final
- Proof datapoint
x
, its classificationy
, training datapointsxi -> yi
, weightswj
, lossL
:- Change in output for
x
w.r.t to steps of GD: dy/dt = sum_j..d dy/dwj * dwj/dt
| GSD’ed NN outputs change over time- =
sum_j..d dy/dwj * (-dL/dwj)
| Each wight changes according to Loss derivative - =
sum_j..d dy/dwj (-sum_i..m dL/dyi * dyi/dwj)
| Loss derivative over all data is sum over each data - =
-sum_i..m dL/dyi * sum_j..d dy/dwj * dyi*dwj
| Chain rule - =
-sum_i..m L'(yi, yi) K_{f, w(t)}(x, xi)
| re-arrange - =>
y = y0 - Integral_t sum_i..m L'(yi, yi) K_f,w(t) (x, xi) dt
| output of NNy=
- …some normalization…
- =
sum_i..m ai * K(x, xi) + b
| output of Kernel machine
- Change in output for
- Notes:
ai
,b
depends onx,
- Another way of looking at DNNs, also connects to boosting
- General statement: all learning methods will incorporate data
Kernels!
- Refresher on concepts:
- Kernel functions: symmetric, positive semi-definitive ~ similarity measure
- Kernel matrix (e.g. gram matrix): pairwise distances of all points in dataset
- Kernel: get kernel matrix without doing explicit feature map expansion on each datapoint and then dot product
- E.g. using simple elementwise kernel function -> way cheaper (or even actually possible) to evaluate
- E.g.: Polynomial expansion with polynomial dimensionality
p
and datapoint dimensionalityd
:- Full expansion: each datapoint
d
expanded tod*p
(could be worse, inf. for other kernels), then dot product on these - Kernel method: immediately computes the same kernel matrix out of two raw
d
-dim inputs
- Full expansion: each datapoint
- Hilbert spaces: imagine vector spaces but with generalized base vectors (functions, polynomials, …)
- Vector space needs to be linear and have scalar inner product
- ~Space of functions where the point evaluation functional is linear and continuous
- When converging in the function space -> also converging in outputs
- Reproducing kernel,
F
set of functionsfi: E->C
,F
is Hilbert space;K: ExE->C
is reproducing kernel when:K(., t) € H
,Vt€E
<fi, K(., t)> = fi(t)
,Vt€E, Vfi € H
| the value offi(t)
is reproduced by inner product offi
withK(., t)
- Reproducing Kernel Functions | IntechOpen, 7.pdf (berkeley.edu), rkhs2.pdf (berkeley.edu), optimization - Connection between SVM and Representer theorem - Cross Validated (stackexchange.com), functional analysis - Connection between Representer Theorem and Mercer’s Theorem? - Mathematics Stack Exchange
- Two reproducing kernels for two points -> equivalence between inner product of their reproducing kernels and its single evaluation
- E.g.:
<k(.,x), k(.,x')> = k(x, x')
| for<f, g> = sum_i,j ai * bj * k(xi, xj)
|H = ( f(x) = sum_i..m ai * k(xi, x) )
- E.g.:
- Every reproducing kernel induces unique reproducing-kernel Hilbert space, and vice versa
- Every reproducing kernel is positive definite, and every positive definite kernel function defines unique reproducing kernel Hilbert space
- Kernel methods ~ soft nearest neighbor
- Any matrix of inner products (style transfer, attention, …) can be exploited through kernel lens
- Various ways of looking at things
- SVM: Learn global separation planes
- DNN: Learn sequence of processing
- Ridge regression with kernels: RBF, …
- Does regression in transformed space instead of original efficiently
- Seems to work good because regressing in those spaces seems to be good/better than original spaces
- Representor theorem:
- ~Optimal function
f*: x->y
can be represented as linear comb. on the kernel functionsk(x1, .)
,k(x2, .)
of the dataset f*(.) = sum_1..n ai * k(., xi)
- To get optimal
ai
coefficient: Take kernel products of our data (kernel matrix) & linear regression - Across all training points: find combination of similarities with all others that produces lowest error to its output
- To get optimal
- Even assuming all possible datapoints, solution still lives in linear combination of only points from dataset
- Kernel method allows us to do this effectively -> coefficient on datapoints (their kernels) instead of their feature(s| maps)
- Datapoints are the model
- ~Optimal function
- Usages:
- Style transfer: gram matrix in style
- Super-resolution: similarity patches
- You could view DNN as implicit kernel method, indirectly also learns the similarity matrix
Written
by Petr Houška
on