Deep learning papers/notes 02: GPT3, data extraction, Dall-E


This post if from a series of quick notes written primarily for personal usage while reading random ML/SWE/CS papers. As such they might be incomprehensible and/or flat out wrong.


  • Language Models are Few-Shot Learners (Paper Explained)
  • Language model: model that generates continuation of text
  • ~100 attention layers, ~100 heads, ~100 dimensionality, 3.2M batch size
  • Not bidirectional, goes from left to right (autoregressive)
  • Bert-approach:
    • Pretrain on general data: generic language model
    • Finetune (gradient updates) on specific task (e.g. Sentiment analysis)
  • GPT-approach:
    • Zero shot: take pre-trained model, give it textual task description, prompt, expect output
    • GPT does one/few shot: give it few pairs of description, prompt and output
    • -> no gradient update on concete task
      • Just relies on absolutely huge training set that included these tasks somehow somewhere
  • Language model is just trained to finish a text that looks as “description, prompt, answer, prompt, ….”
    • The output can be restricted to be out of a set of possible answers -> easier
  • Closed book system
    • Good for trivia, not good for e.g. natural questions
  • Hypothesis:
    • Large transformers are almost storing the training data
    • Inference: sort of fuzzy KNN/interpolation of training data with language model
    • Would be good to see what training examples were used for current output
      • ~index of what training samples influenced what weights
  • Not great performance on:
    • Reading tasks (prompt contains text + question connected to it) that require reasoning
    • Better for reading tasks where model selects more probable answer (out of 2): correlated
    • -> suggest interpolation hypothesis
  • Very good language model, almost perfect grammar ~ fuzzy search
    • No tasks that would try to make poor English out of good, scramble words, …
    • A lot of presented tasks can be explained by being good English model

Extracting Training Data from Large Language Models

  • Extracting Training Data from Large Language Models (Paper Explained)
  • GPT2/3
  • Querying large black-box language model for data that appear only once/few times in training data
    • It’s ok to remember good spelling, general info (e.g. correct zip codes, …), bad to remember specific datapoints
    • Eidetic memorialization: if string is extractable and appears k-times in training data (possibly many times in k docs)
  • Not focused on targeted training data extraction but general “any rememebred data”
  • Intuition: easy to extract datapoints far from other datapoints
    • Model can’t extract patterns w.r.t to them -> remembers the datapoints exactly
    • For example GUIDs, random urls, random strings, …
    • Does not mean all training data is extractible
  • Generate a lot of data, select highly likely outputs, deduplicate, manually check if on web few times
    • Data generation improvements: tweaks to priming and temprature to generate more diverse outputs
    • Selection improvements: train smaller model on similar (not same) datasets, take likely on targeted model but unlikely on new
      • Smaller new model is unlikely to remember the same few-shot datapoints
  • Note: Distillation models: not all datapoints loose their performance equally
    • Assumption: Most affects rememebred single-training-datapoint examples
  • Memorization is context specific: heavily depends on prompt
  • Even if datapoint is only in one doc, it might need to be repeated multiple times in the doc to be remembered
    • Number of repeats required is higher with smaller models
    • Not clear relationship between documents, batches, …

OpenAI DALL·E: Creating Images from Text

  • OpenAI DALL·E: Creating Images from Text (Blog Post Explained)
  • Generating pictures out of textual description
  • Idea: GPT-3 generates image tile hieroglyphs tokens, VQ-VAE’s decoder uses them as latent repres. to create images
  • GPT-3 like language model:
    • One stream of tokens: first textual description tokens, then autocompletes/generates image tile hieroglyphs tokens
    • Image tile hieroglyphs from vocabulary of VQ-VAE latent space codebook
    • Each tile token attends to only specific tile tokens (row, column, neighborhood) and all text tokens
  • VQ-VAE
    • Encoder: per image tile projects to latent space, selects closest vector (hieroglyph) from codebook
      • Pretrained as normal VAE, decoder possibly fine-tuned together with GPT-3 part
    • Decoder: creates image out of matrix of codebook latent vectors produced either by encoder (training) or GPT-3 (inference)
    • Codebook also trained w. encoder ~ essentially tile embedding to latent space, decoder ~ reverse embedding
  • Blog mentions continuous relaxation of the codebook, no need for it to be explicit, not sure what it means
  • 8192 Codebook vectors, trained; 32x32 tiles per image, image resolution 256x256
  • Outputs 512 images, re-reranked with another text :: image matching model
Written by on