Deep learning notes 05: Unsupervised vision with DINO
This post if from a series of quick notes written primarily for personal usage while reading random ML/SWE/CS papers. As such they might be incomprehensible and/or flat out wrong.
DINO: Emerging Properties in Self-Supervised Vision Transformers
- Unsupervised learning regime for vision (self attention, 8x8 patches) transformers
- Intermediate representation clusters pictures of similar labels together (without seeing labels)
- Capable of object detection and masking (attention mask segments objects very well)
- Capable of classification (output KNN to known labeled examples)
- Copy detection, image retrieval, … -> good similarity measure
- Attention masks for CLS token: the token that contains final representation (doesn’t have image patch on input, to not bias)
- Self-supervised learning: self-distillation without labels
- Negative samples learning:
- Take anchor patch and patch A from one image, and patch B from second image
- Give all three patches to the model, tell it which is anchor patch
- Ask whether A or B is from the same as anchor
- Self-supervised without negative samples learning
- Use only one image, augment in multiple ways (BYOL) -> produce two versions for teacher and student
- Global crops: > 50 % of the image
- Local crops: < 50 % of the image
- Rotations, color-jitters, …
- Pass each one version through teacher, one through student
- Note: Actually pass both through both, loss is combination of cross-difference
- Loss is the difference between end image representations (CLS output)
- Same image, only differently augmented -> should have similar representation
- To mitigate collapse to single repre. -> different models for teacher and student
- Only train (backprop) student, build teacher as exponential average of students
- Use only one image, augment in multiple ways (BYOL) -> produce two versions for teacher and student
- Teacher only uses global cropping
- If student has local crop -> student learns that its patch should match the whole with more context
- -> forces the model to learn part-whole relationship & representing the whole image
- Teacher maintains running average of all representations it sees -> subtracts it from its representation
- ~normalization, helps against collapse
- Representation has softmax with temperature at the end
- Dimensionality of softmax is arbitrary: don’t have explicit labels (unsupervised) -> who knows how many
- Teacher has sharpening -> more peaked distribution -> forcer larger differences between diff. outputs
- Softmax is not common in unsupervised -> forces model to come up with “its own classes”
- Versus supervised learning
- Supervised has way more noisy / overfitted attention mask -> hyper optimization on the task at hand
- Why does it work?
- Augmentations: in computer vision they’re super important ~ that’s where the human prior is
- What’s augmented away doesn’t matter
- Augmentations: in computer vision they’re super important ~ that’s where the human prior is
- Dataset: there’s always an explicit object of interest -> how we take pictures brings prior
Written
by Petr Houška
on