Team Learning: Critical Concepts

Jigsaw Activity - Each team learns and teaches their assigned concepts

How This Works

Divide into 5 teams - Each team is assigned 2 concept areas
Learn (10 min) - Study your team's concepts using this reference
Prepare (5 min) - Decide who explains what, practice your explanations
Teach (2 min per team) - Each team teaches their concepts to the class
Quiz - Use the quiz questions to verify understanding

Total Time: ~25 minutes

Team 1: Foundation

Fine-Tuning vs Pretraining
Transfer Learning
Transformer Architecture
Attention Mechanism

Team 2: Models

BERT Family (DistilBERT, RoBERTa)
Model Selection Trade-offs
Parameters & Size
When to Use What

Team 3: Data

Tokenization
Subwords & Vocabulary
Special Tokens [CLS] [SEP]
Attention Masks & Padding

Team 4: Training

Epochs, Batches, Steps
Learning Rate & Warmup
Loss & Gradients
Hyperparameter Tuning

Team 5: Advanced

LoRA & PEFT
Evaluation Metrics
Overfitting
Hardware (GPU/MPS)

🔴 Team 1: Foundation

🟡 Team 2: Models

🟢 Team 3: Data

🔵 Team 4: Training

🟣 Team 5: Advanced

Team 1: Foundation Concepts

Teach-Back Checklist

Pretraining Fine-tuning Transfer Learning Attention

🎓 Pretraining vs Fine-Tuning

Pretraining Training on massive unlabeled data to learn general language. Takes weeks on hundreds of GPUs. Done by big labs (Google, Meta, OpenAI).

Fine-tuning Adapting a pretrained model to a specific task using smaller labeled data. Takes minutes to hours on a single GPU.

Transfer Learning Knowledge from pretraining (grammar, word meanings) transfers to new tasks. We don't start from scratch!

Analogy for the Class

Fine-tuning is like learning a new card game. You don't re-learn what cards are or how to hold them - you just learn the new rules. The base knowledge transfers.

🧠 Transformer & Attention

Transformer Neural network architecture using attention. Processes all tokens in parallel (unlike older RNNs). Powers BERT, GPT, and all modern LLMs.

Attention Mechanism letting each token "look at" every other token. Captures context: "bank" means different things near "river" vs "money".

Encoder Reads input and creates representations. Used for understanding tasks (BERT). What we use for classification.

Decoder Generates output token by token. Used for generation (GPT). Not used in our exercise.

Key Insight

Attention is why transformers understand context so well. Each word can attend to every other word, capturing relationships across the whole sentence.

Quiz Questions (to ask the class)

Q1: What's the main difference between pretraining and fine-tuning?

Pretraining learns general language from massive unlabeled data (weeks, many GPUs). Fine-tuning adapts to a specific task with smaller labeled data (minutes/hours, one GPU).

Q2: Why is attention important in transformers?

Attention lets each token look at every other token, capturing context and relationships. "Bank" can mean different things - attention helps the model understand which meaning based on surrounding words.

Team 2: Model Selection

Teach-Back Checklist

DistilBERT BERT RoBERTa Trade-offs

🤖 BERT Family Models

Model	Params	Speed	SST-2 Acc
DistilBERT	66M	Fast	~89%
BERT-base	110M	Medium	~92%
RoBERTa	125M	Medium	~94%
DeBERTa	86-304M	Slower	~95%

DistilBERT "Distilled" BERT - 40% smaller, 60% faster, 97% of performance. Perfect for demos and experimentation.

RoBERTa BERT trained longer with more data and better methodology. Often the best "base" model for accuracy.

⚖️ When to Use What

Situation	Recommended
Quick experiments	DistilBERT
Best accuracy	RoBERTa or DeBERTa
Limited GPU memory	DistilBERT or LoRA
Production (fast inference)	DistilBERT

Key Insight

Model choice is a trade-off between accuracy and speed/resources. For the competition, start with DistilBERT for fast iteration, then try RoBERTa if you want higher accuracy.

Quiz Questions (to ask the class)

Q1: Why might you choose DistilBERT over BERT-base?

DistilBERT is 40% smaller and 60% faster while keeping 97% of BERT's performance. Better for quick experiments and when you have limited GPU memory or need fast inference.

Q2: Which model typically gets the highest accuracy on SST-2?

DeBERTa typically achieves the highest accuracy (~95%), but RoBERTa (~94%) is close and often more practical. The 1% difference may not justify the slower training time.

Team 3: Tokenization & Data

Teach-Back Checklist

Tokenization Subwords [CLS] & [SEP] Attention Mask

🔤 Tokenization Basics

Token A unit of text the model processes. Can be a word, subword, or character. Models work with token IDs (numbers), not text.

Tokenizer Converts text to token IDs. Each model has its OWN tokenizer - always use the matching one!

Subword Breaking rare words into known pieces. "unhappiness" → ["un", "##happi", "##ness"]. Handles any word!

Vocabulary All tokens the model knows. Typically 30K-50K tokens. Built during pretraining.

Analogy for the Class

Tokenization is like spelling. Simple words are one piece ("cat"), but complex words get broken down ("un-happi-ness"). This way the model can handle ANY word, even ones it's never seen.

🏷️ Special Tokens & Masks

[CLS] Special token at the START. Its final representation is used for classification. "CLS" = classification.

[SEP] Special token marking END of sequence (or separating two sequences). "SEP" = separator.

Attention Mask Binary mask: 1 = real token, 0 = padding. Tells model which tokens to ignore.

Padding Adding dummy tokens so all sequences in a batch have same length. Required for batching.

Key Insight

The [CLS] token is crucial for classification. The model learns to pack the "summary" of the whole sentence into this one token, which then goes through the classification head.

Quiz Questions (to ask the class)

Q1: Why do modern tokenizers use subwords instead of whole words?

Subwords handle ANY word, even rare or made-up ones. "Transformerification" can be tokenized as ["transform", "##er", "##ification"] even if the full word was never seen. A word-level tokenizer would fail.

Q2: What is the [CLS] token used for in classification?

The [CLS] token's final hidden state is used as the "sentence representation" for classification. The model learns to pack the meaning of the whole sentence into this one token during training.

Team 4: Training & Hyperparameters

Teach-Back Checklist

Epoch/Batch/Step Learning Rate Loss & Gradients Hyperparameters

🔄 Training Loop Basics

Epoch One complete pass through ALL training data. 3 epochs = seeing every example 3 times.

Batch Group of examples processed together. Batch size 16 = 16 examples per forward pass.

Step One weight update. Steps per epoch = dataset size / batch size.

Loss How wrong the model is. Lower = better. We use cross-entropy loss for classification.

Gradient Direction to adjust weights to reduce loss. Computed via backpropagation.

🎛️ Hyperparameters

Parameter	Range	Impact
learning_rate	1e-5 to 5e-5	HIGH
batch_size	8, 16, 32	MEDIUM
num_epochs	2-5	MEDIUM
warmup_ratio	0.0-0.2	LOW

Learning Rate Size of weight updates. Too high = unstable (loss explodes). Too low = slow learning.

Warmup Gradually increase LR at start. Prevents early instability. 10% warmup is common.

Key Insight

Learning rate is the MOST impactful hyperparameter. For fine-tuning, use small values (2e-5 = 0.00002) because we're making small adjustments to already-good weights.

Quiz Questions (to ask the class)

Q1: If you have 1000 training examples and batch size 10, how many steps per epoch?

100 steps per epoch. Steps = dataset size / batch size = 1000 / 10 = 100. Each step processes one batch of 10 examples.

Q2: Your loss is NaN after a few steps. What's likely wrong?

Learning rate is too high! The updates are so large that weights explode to infinity (NaN). Try reducing learning rate by 10x (e.g., 2e-5 → 2e-6).

Team 5: LoRA, Evaluation & Hardware

Teach-Back Checklist

LoRA basics Accuracy vs F1 Overfitting GPU/MPS

🔧 LoRA & PEFT

PEFT Parameter-Efficient Fine-Tuning. Methods to train with fewer parameters.

LoRA Low-Rank Adaptation. Freezes pretrained weights, only trains small adapter matrices.

Rank (r) Size of LoRA matrices. Higher = more capacity. Common: 8, 16, 32.

Frozen Weights Original model weights that don't update. Only adapters train.

Key Insight

LoRA with r=16 often achieves 95%+ of full fine-tuning accuracy while training only 0.1% of parameters. Great for large models or limited hardware!

📊 Evaluation & Hardware

Accuracy % correct predictions. Simple but can mislead with imbalanced classes.

F1 Score Harmonic mean of precision and recall. Better for imbalanced data.

Overfitting Model memorizes training data but fails on new data. Val loss increases while train loss decreases.

MPS Metal Performance Shaders - Apple's GPU acceleration for M1/M2/M3/M4 Macs.

OOM Out of Memory error. Reduce batch size or use LoRA.

Analogy for the Class

Overfitting is like memorizing answers for a specific test instead of understanding the material. You ace that exact test but fail any variation.

Quiz Questions (to ask the class)

Q1: How do you know if your model is overfitting?

Training loss keeps decreasing but validation loss starts INCREASING. The gap between training and validation accuracy grows. The model is memorizing training data rather than learning general patterns.

Q2: You get an OOM (Out of Memory) error. What should you try first?

Reduce batch size! Smaller batches use less GPU memory. If that's not enough, try LoRA (fewer trainable parameters) or a smaller model like DistilBERT.

Class Progress

0 / 20 concepts covered