Instructions

# MYTHOS: Genesis
## Mastering Your Tokenized Heroes with Optimized Sequences

### Overview
In this exam, you will work on **reimplementing and retraining the transformer component** of the PARTI
(Pathways Autoregressive Text-to-Image) generator ([blog](https://sites.research.google/parti/), [paper](https://arxiv.org/abs/2206.10789)). For simplicity, we will focus solely on the transformer
and remove text conditioning. You will train the model to predict the next image token based on previous
tokens in an autoregressive manner.

PARTI being quite a hard paper, you can instead read this one: [Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation](https://arxiv.org/abs/2406.06525)

---

### Objectives
1. **Rewrite the Transformer:**
- Implement the transformer architecture from scratch. You're allowed scaled_dot_product_attention from pytorch.
- Include features like multi-head attention, feed-forward layers, and positional encodings.
- There is no text conditioning so this a transformer decoder only.

2. **Train the Transformer:**
- Train your rewritten transformer to model the image generation process as an autoregressive task.
- Use provided image tokens from a pre-tokenized dataset for training.

3. **Evaluate the Transformer:**
- Measure the quality of the trained model by assessing the likelihood of token sequences on a validation set.
- Optionally, generate images by sampling tokens from the trained model.

---

### Dataset
You are provided with **a pre-tokenized image dataset:** This dataset contains image sequences tokenized into discrete codes using a VQ-VAE.

[LINK TO DATASET](https://files.vermeille.fr/dataset.txt)

Each line in the dataset is a tag followed by a sequence of 67 tokens representing an image. Each token is an integer code in [0, 1024].

Example:
```
000d1f2aa107dbed4cea301effa6e0d8.png 0 152 276 348 643 1005 99 180 816 265 348 557 721 754 429 847 146 560 693 453 224 234 180 123 729 555 759 701 447 467 400 117 435 152 570 90 644 623 135 670 485 408 364 403 393 318 702 471 344 609 301 926 707 609 629 717 309 49 366 49 791 110 791 366 366 576 7
```

You MUST NOT learn the tag information. The tag is only provided as a name for the image.

---

### Task Requirements
#### 1. Rewriting the Transformer
- Implement the transformer model architecture.
- Again, contrarily to the text to image paper referenced, remove the text-conditioning pathway.

#### 2. Training the Transformer
- Use the provided training set to optimize the model.
- Implement teacher forcing to train the model in an autoregressive manner.

#### 3. Evaluating the Transformer
- You should calculate the **perplexity** of the model on a validation set.
- (Optional) Generate sample token sequences from the trained model and decode them back into images using the VQ-VAE.

---

### Submission Requirements
You must submit your
---

### Evaluation Criteria
Your submission will be evaluated based on **Performance:** Quality of the trained model.
You are provided with a test set to evaluate your model. The test set contains only sequence prefixes, and you need to predict the end of the sequence.
If your model is properly trained, those will represent valid images. You are graded automatically on the the likelihood of your predictions (if your generated images are likely to be real images, NOT on the images aesthetics or quality).

---

### Guidelines
- Use standard transformer libraries for building your model. You are not required to implement low-level operations like attention from scratch.
- Pay attention to hyperparameters like learning rate, dropout, and attention heads, as they significantly affect training stability.
- Collaborating with others is not allowed. Your work must be your own.
- You may consult course materials and public resources but must cite any external references used.

---

### Additional Notes
- You are **not** required to implement the VQ-VAE tokenization or decoding process; this will be provided.

---

# BONUS POINTS: Model interpretation

Get up to 5 bonus points by doing some research on the image encodings. Try understanding how the model learned to encode and decode images:
- What do tokens represent?
- How do they relate to the original image?
- Are some tokens favored in some locations? Does it depend on the image?
- Are the tokens impacting the image locally or globally? Does it depend on the token?
- Is there some kind of grammar? How do tokens relate to each other?
- Can you find some interesting patterns in the tokens?
- etc.

Bonus points will be awarded based on the depth of your analysis and the quality of your findings.

---

Good luck, and may your transformers generate impressive results!