MYTHOS: Genesis

Mastering Your Tokenized Heroes with Optimized Sequences

Overview

In this exam, you will work on reimplementing and retraining the transformer component of the PARTI (Pathways Autoregressive Text-to-Image) generator (blog, paper). For simplicity, we will focus solely on the transformer and remove text conditioning. You will train the model to predict the next image token based on previous tokens in an autoregressive manner.

PARTI being quite a hard paper, you can instead read this one: Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation


Objectives

  1. Rewrite the Transformer:

    • Implement the transformer architecture from scratch. You're allowed scaled_dot_product_attention from pytorch.
    • Include features like multi-head attention, feed-forward layers, and positional encodings.
    • There is no text conditioning so this a transformer decoder only.
  2. Train the Transformer:

    • Train your rewritten transformer to model the image generation process as an autoregressive task.
    • Use provided image tokens from a pre-tokenized dataset for training.
  3. Evaluate the Transformer:

    • Measure the quality of the trained model by assessing the likelihood of token sequences on a validation set.
    • Optionally, generate images by sampling tokens from the trained model.

Dataset

You are provided with a pre-tokenized image dataset: This dataset contains image sequences tokenized into discrete codes using a VQ-VAE.

LINK TO DATASET

Each line in the dataset is a tag followed by a sequence of 67 tokens representing an image. Each token is an integer code in [0, 1024].

Example:

000d1f2aa107dbed4cea301effa6e0d8.png 0 152 276 348 643 1005 99 180 816 265 348 557 721 754 429 847 146 560 693 453 224 234 180 123 729 555 759 701 447 467 400 117 435 152 570 90 644 623 135 670 485 408 364 403 393 318 702 471 344 609 301 926 707 609 629 717 309 49 366 49 791 110 791 366 366 576 7

You MUST NOT learn the tag information. The tag is only provided as a name for the image.


Task Requirements

1. Rewriting the Transformer

2. Training the Transformer

3. Evaluating the Transformer


Submission Requirements

You must submit your

Evaluation Criteria

Your submission will be evaluated based on Performance: Quality of the trained model. You are provided with a test set to evaluate your model. The test set contains only sequence prefixes, and you need to predict the end of the sequence. If your model is properly trained, those will represent valid images. You are graded automatically on the the likelihood of your predictions (if your generated images are likely to be real images, NOT on the images aesthetics or quality).


Guidelines


Additional Notes


BONUS POINTS: Model interpretation

Get up to 5 bonus points by doing some research on the image encodings. Try understanding how the model learned to encode and decode images:

Bonus points will be awarded based on the depth of your analysis and the quality of your findings.


Good luck, and may your transformers generate impressive results!