Jigsaw Activity - Each team learns and teaches their assigned concepts
Fine-tuning is like learning a new card game. You don't re-learn what cards are or how to hold them - you just learn the new rules. The base knowledge transfers.
Attention is why transformers understand context so well. Each word can attend to every other word, capturing relationships across the whole sentence.
Q1: What's the main difference between pretraining and fine-tuning?
Q2: Why is attention important in transformers?
| Model | Params | Speed | SST-2 Acc |
|---|---|---|---|
| DistilBERT | 66M | Fast | ~89% |
| BERT-base | 110M | Medium | ~92% |
| RoBERTa | 125M | Medium | ~94% |
| DeBERTa | 86-304M | Slower | ~95% |
| Situation | Recommended |
|---|---|
| Quick experiments | DistilBERT |
| Best accuracy | RoBERTa or DeBERTa |
| Limited GPU memory | DistilBERT or LoRA |
| Production (fast inference) | DistilBERT |
Model choice is a trade-off between accuracy and speed/resources. For the competition, start with DistilBERT for fast iteration, then try RoBERTa if you want higher accuracy.
Q1: Why might you choose DistilBERT over BERT-base?
Q2: Which model typically gets the highest accuracy on SST-2?
Tokenization is like spelling. Simple words are one piece ("cat"), but complex words get broken down ("un-happi-ness"). This way the model can handle ANY word, even ones it's never seen.
The [CLS] token is crucial for classification. The model learns to pack the "summary" of the whole sentence into this one token, which then goes through the classification head.
Q1: Why do modern tokenizers use subwords instead of whole words?
Q2: What is the [CLS] token used for in classification?
| Parameter | Range | Impact |
|---|---|---|
| learning_rate | 1e-5 to 5e-5 | HIGH |
| batch_size | 8, 16, 32 | MEDIUM |
| num_epochs | 2-5 | MEDIUM |
| warmup_ratio | 0.0-0.2 | LOW |
Learning rate is the MOST impactful hyperparameter. For fine-tuning, use small values (2e-5 = 0.00002) because we're making small adjustments to already-good weights.
Q1: If you have 1000 training examples and batch size 10, how many steps per epoch?
Q2: Your loss is NaN after a few steps. What's likely wrong?
LoRA with r=16 often achieves 95%+ of full fine-tuning accuracy while training only 0.1% of parameters. Great for large models or limited hardware!
Overfitting is like memorizing answers for a specific test instead of understanding the material. You ace that exact test but fail any variation.
Q1: How do you know if your model is overfitting?
Q2: You get an OOM (Out of Memory) error. What should you try first?