Team Learning: Critical Concepts

Jigsaw Activity - Each team learns and teaches their assigned concepts

How This Works

  1. Divide into 5 teams - Each team is assigned 2 concept areas
  2. Learn (10 min) - Study your team's concepts using this reference
  3. Prepare (5 min) - Decide who explains what, practice your explanations
  4. Teach (2 min per team) - Each team teaches their concepts to the class
  5. Quiz - Use the quiz questions to verify understanding
Total Time: ~25 minutes

Team 1: Foundation

  • Fine-Tuning vs Pretraining
  • Transfer Learning
  • Transformer Architecture
  • Attention Mechanism

Team 2: Models

  • BERT Family (DistilBERT, RoBERTa)
  • Model Selection Trade-offs
  • Parameters & Size
  • When to Use What

Team 3: Data

  • Tokenization
  • Subwords & Vocabulary
  • Special Tokens [CLS] [SEP]
  • Attention Masks & Padding

Team 4: Training

  • Epochs, Batches, Steps
  • Learning Rate & Warmup
  • Loss & Gradients
  • Hyperparameter Tuning

Team 5: Advanced

  • LoRA & PEFT
  • Evaluation Metrics
  • Overfitting
  • Hardware (GPU/MPS)
🔴 Team 1: Foundation
🟡 Team 2: Models
🟢 Team 3: Data
🔵 Team 4: Training
🟣 Team 5: Advanced

Team 1: Foundation Concepts

Teach-Back Checklist

🎓 Pretraining vs Fine-Tuning

Pretraining Training on massive unlabeled data to learn general language. Takes weeks on hundreds of GPUs. Done by big labs (Google, Meta, OpenAI).
Fine-tuning Adapting a pretrained model to a specific task using smaller labeled data. Takes minutes to hours on a single GPU.
Transfer Learning Knowledge from pretraining (grammar, word meanings) transfers to new tasks. We don't start from scratch!

Analogy for the Class

Fine-tuning is like learning a new card game. You don't re-learn what cards are or how to hold them - you just learn the new rules. The base knowledge transfers.

🧠 Transformer & Attention

Transformer Neural network architecture using attention. Processes all tokens in parallel (unlike older RNNs). Powers BERT, GPT, and all modern LLMs.
Attention Mechanism letting each token "look at" every other token. Captures context: "bank" means different things near "river" vs "money".
Encoder Reads input and creates representations. Used for understanding tasks (BERT). What we use for classification.
Decoder Generates output token by token. Used for generation (GPT). Not used in our exercise.

Key Insight

Attention is why transformers understand context so well. Each word can attend to every other word, capturing relationships across the whole sentence.

Quiz Questions (to ask the class)

Q1: What's the main difference between pretraining and fine-tuning?

Pretraining learns general language from massive unlabeled data (weeks, many GPUs). Fine-tuning adapts to a specific task with smaller labeled data (minutes/hours, one GPU).

Q2: Why is attention important in transformers?

Attention lets each token look at every other token, capturing context and relationships. "Bank" can mean different things - attention helps the model understand which meaning based on surrounding words.

Team 2: Model Selection

Teach-Back Checklist

🤖 BERT Family Models

Model Params Speed SST-2 Acc
DistilBERT 66M Fast ~89%
BERT-base 110M Medium ~92%
RoBERTa 125M Medium ~94%
DeBERTa 86-304M Slower ~95%
DistilBERT "Distilled" BERT - 40% smaller, 60% faster, 97% of performance. Perfect for demos and experimentation.
RoBERTa BERT trained longer with more data and better methodology. Often the best "base" model for accuracy.

⚖️ When to Use What

Situation Recommended
Quick experiments DistilBERT
Best accuracy RoBERTa or DeBERTa
Limited GPU memory DistilBERT or LoRA
Production (fast inference) DistilBERT

Key Insight

Model choice is a trade-off between accuracy and speed/resources. For the competition, start with DistilBERT for fast iteration, then try RoBERTa if you want higher accuracy.

Quiz Questions (to ask the class)

Q1: Why might you choose DistilBERT over BERT-base?

DistilBERT is 40% smaller and 60% faster while keeping 97% of BERT's performance. Better for quick experiments and when you have limited GPU memory or need fast inference.

Q2: Which model typically gets the highest accuracy on SST-2?

DeBERTa typically achieves the highest accuracy (~95%), but RoBERTa (~94%) is close and often more practical. The 1% difference may not justify the slower training time.

Team 3: Tokenization & Data

Teach-Back Checklist

🔤 Tokenization Basics

Token A unit of text the model processes. Can be a word, subword, or character. Models work with token IDs (numbers), not text.
Tokenizer Converts text to token IDs. Each model has its OWN tokenizer - always use the matching one!
Subword Breaking rare words into known pieces. "unhappiness" → ["un", "##happi", "##ness"]. Handles any word!
Vocabulary All tokens the model knows. Typically 30K-50K tokens. Built during pretraining.

Analogy for the Class

Tokenization is like spelling. Simple words are one piece ("cat"), but complex words get broken down ("un-happi-ness"). This way the model can handle ANY word, even ones it's never seen.

🏷️ Special Tokens & Masks

[CLS] Special token at the START. Its final representation is used for classification. "CLS" = classification.
[SEP] Special token marking END of sequence (or separating two sequences). "SEP" = separator.
Attention Mask Binary mask: 1 = real token, 0 = padding. Tells model which tokens to ignore.
Padding Adding dummy tokens so all sequences in a batch have same length. Required for batching.

Key Insight

The [CLS] token is crucial for classification. The model learns to pack the "summary" of the whole sentence into this one token, which then goes through the classification head.

Quiz Questions (to ask the class)

Q1: Why do modern tokenizers use subwords instead of whole words?

Subwords handle ANY word, even rare or made-up ones. "Transformerification" can be tokenized as ["transform", "##er", "##ification"] even if the full word was never seen. A word-level tokenizer would fail.

Q2: What is the [CLS] token used for in classification?

The [CLS] token's final hidden state is used as the "sentence representation" for classification. The model learns to pack the meaning of the whole sentence into this one token during training.

Team 4: Training & Hyperparameters

Teach-Back Checklist

🔄 Training Loop Basics

Epoch One complete pass through ALL training data. 3 epochs = seeing every example 3 times.
Batch Group of examples processed together. Batch size 16 = 16 examples per forward pass.
Step One weight update. Steps per epoch = dataset size / batch size.
Loss How wrong the model is. Lower = better. We use cross-entropy loss for classification.
Gradient Direction to adjust weights to reduce loss. Computed via backpropagation.

🎛️ Hyperparameters

Parameter Range Impact
learning_rate 1e-5 to 5e-5 HIGH
batch_size 8, 16, 32 MEDIUM
num_epochs 2-5 MEDIUM
warmup_ratio 0.0-0.2 LOW
Learning Rate Size of weight updates. Too high = unstable (loss explodes). Too low = slow learning.
Warmup Gradually increase LR at start. Prevents early instability. 10% warmup is common.

Key Insight

Learning rate is the MOST impactful hyperparameter. For fine-tuning, use small values (2e-5 = 0.00002) because we're making small adjustments to already-good weights.

Quiz Questions (to ask the class)

Q1: If you have 1000 training examples and batch size 10, how many steps per epoch?

100 steps per epoch. Steps = dataset size / batch size = 1000 / 10 = 100. Each step processes one batch of 10 examples.

Q2: Your loss is NaN after a few steps. What's likely wrong?

Learning rate is too high! The updates are so large that weights explode to infinity (NaN). Try reducing learning rate by 10x (e.g., 2e-5 → 2e-6).

Team 5: LoRA, Evaluation & Hardware

Teach-Back Checklist

🔧 LoRA & PEFT

PEFT Parameter-Efficient Fine-Tuning. Methods to train with fewer parameters.
LoRA Low-Rank Adaptation. Freezes pretrained weights, only trains small adapter matrices.
Rank (r) Size of LoRA matrices. Higher = more capacity. Common: 8, 16, 32.
Frozen Weights Original model weights that don't update. Only adapters train.

Key Insight

LoRA with r=16 often achieves 95%+ of full fine-tuning accuracy while training only 0.1% of parameters. Great for large models or limited hardware!

📊 Evaluation & Hardware

Accuracy % correct predictions. Simple but can mislead with imbalanced classes.
F1 Score Harmonic mean of precision and recall. Better for imbalanced data.
Overfitting Model memorizes training data but fails on new data. Val loss increases while train loss decreases.
MPS Metal Performance Shaders - Apple's GPU acceleration for M1/M2/M3/M4 Macs.
OOM Out of Memory error. Reduce batch size or use LoRA.

Analogy for the Class

Overfitting is like memorizing answers for a specific test instead of understanding the material. You ace that exact test but fail any variation.

Quiz Questions (to ask the class)

Q1: How do you know if your model is overfitting?

Training loss keeps decreasing but validation loss starts INCREASING. The gap between training and validation accuracy grows. The model is memorizing training data rather than learning general patterns.

Q2: You get an OOM (Out of Memory) error. What should you try first?

Reduce batch size! Smaller batches use less GPU memory. If that's not enough, try LoRA (fewer trainable parameters) or a smaller model like DistilBERT.

Class Progress

0 / 20 concepts covered