Deep learning notes 11: All is Kernels - but not those in ring0

#papers

This post if from a series of quick notes written primarily for personal usage while reading random ML/SWE/CS papers. As such they might be incomprehensible and/or flat out wrong.

Deep Networks Are Kernel Machines

  • Deep learning success is attributed to automatic discovery of new data representation
  • Paper’s proposal:
    • Deep learning training: stores training data
    • Deep learning inference: compares with stored data -> outputs closest -> ~kernel machine
    • -> Deep network weights ~ superposition of training examples
    • -> Network architecture incorporates knowledge of target function into kernel (similarity function)
  • Kernel machine: y = g ( sum_i: ai * K(x, xi) + b)
    • K: similarity kernel function
    • xi: remembered training set
    • ai: gold labels
    • -> essentially global nearest neighbor classifier
    • The biggest question is how to choose the Kernel function
  • Gradient descent NN is equivalent to path-kernel Kernel machine
    • Two datapoints are similar if gradients of the (network) output w.r.t to weights are similar (similar - inner product)
    • Gradient of the (network) output w.r.t to weights: for particular input xi, how to change the weights to e.g. increase output
    • Classify a datapoint
      • Retrace along all versions of the network (after each GD update)
      • Classify by whichever training datapoints had similar effect on the network (sensitivity) over the course of training
      • Get output of initial model and accumulate differences made by each GD step -> output of final
    • Proof datapoint x, its classification y, training datapoints xi -> yi, weights wj, loss L:
      • Change in output for x w.r.t to steps of GD:
      • dy/dt = sum_j..d dy/dwj * dwj/dt | GSD’ed NN outputs change over time
      • = sum_j..d dy/dwj * (-dL/dwj) | Each wight changes according to Loss derivative
      • = sum_j..d dy/dwj (-sum_i..m dL/dyi * dyi/dwj) | Loss derivative over all data is sum over each data
      • = -sum_i..m dL/dyi * sum_j..d dy/dwj * dyi*dwj| Chain rule
      • = -sum_i..m L'(yi, yi) K_{f, w(t)}(x, xi) | re-arrange
      • => y = y0 - Integral_t sum_i..m L'(yi, yi) K_f,w(t) (x, xi) dt | output of NN y=
      • …some normalization…
      • = sum_i..m ai * K(x, xi) + b | output of Kernel machine
    • Notes: ai, b depends on x,
  • Another way of looking at DNNs, also connects to boosting
    • General statement: all learning methods will incorporate data

Kernels!

  • Refresher on concepts:
    • Kernel functions: symmetric, positive semi-definitive ~ similarity measure
    • Kernel matrix (e.g. gram matrix): pairwise distances of all points in dataset
  • Kernel: get kernel matrix without doing explicit feature map expansion on each datapoint and then dot product
    • E.g. using simple elementwise kernel function -> way cheaper (or even actually possible) to evaluate
  • E.g.: Polynomial expansion with polynomial dimensionality p and datapoint dimensionality d:
    • Full expansion: each datapoint d expanded to d*p (could be worse, inf. for other kernels), then dot product on these
    • Kernel method: immediately computes the same kernel matrix out of two raw d-dim inputs
  • Hilbert spaces: imagine vector spaces but with generalized base vectors (functions, polynomials, …)
    • Vector space needs to be linear and have scalar inner product
    • ~Space of functions where the point evaluation functional is linear and continuous
    • When converging in the function space -> also converging in outputs
  • Reproducing kernel, F set of functions fi: E->C, F is Hilbert space; K: ExE->C is reproducing kernel when:
  • Kernel methods ~ soft nearest neighbor
  • Any matrix of inner products (style transfer, attention, …) can be exploited through kernel lens
  • Various ways of looking at things
    • SVM: Learn global separation planes
    • DNN: Learn sequence of processing
  • Ridge regression with kernels: RBF, …
    • Does regression in transformed space instead of original efficiently
    • Seems to work good because regressing in those spaces seems to be good/better than original spaces
  • Representor theorem:
    • ~Optimal function f*: x->y can be represented as linear comb. on the kernel functions k(x1, .), k(x2, .) of the dataset
    • f*(.) = sum_1..n ai * k(., xi)
      • To get optimal ai coefficient: Take kernel products of our data (kernel matrix) & linear regression
      • Across all training points: find combination of similarities with all others that produces lowest error to its output
    • Even assuming all possible datapoints, solution still lives in linear combination of only points from dataset
    • Kernel method allows us to do this effectively -> coefficient on datapoints (their kernels) instead of their feature(s| maps)
      • Datapoints are the model
  • Usages:
    • Style transfer: gram matrix in style
    • Super-resolution: similarity patches
  • You could view DNN as implicit kernel method, indirectly also learns the similarity matrix
Written by on