Week 17 - Day 2 - ML Engineering

Advanced NLP Engineering

Transformers Fine-Tuning & Multilingual Evaluation

3 Hours
Hugging Face Transformers
LoRA / PEFT
Ship-Ready Evidence

The Reviewable Training Run

"A training run is a reviewable artifact. Your PR should make it obvious: what you changed, why you changed it, how you measured it, and what still fails."

🧠

Tonight's Mission

Learn to make fine-tuning decisions, run reproducible training, and produce multilingual evaluation evidence that reviewers can audit.

Interactive Learning Modules

Master transformer fine-tuning through hands-on exploration

Learn
🎯

Fine-Tuning Strategy Selector

Interactive decision tree to choose between full fine-tuning and LoRA/PEFT based on your constraints.

  • Compute and memory constraints
  • Dataset size considerations
  • Deployment requirements
  • Parameter efficiency analysis
Select Strategy
Learn
⚙️

TrainingArguments Explorer

Interactive guide to Hugging Face TrainingArguments. Understand each parameter's impact.

  • Batch size and learning rate
  • Evaluation and save strategies
  • Checkpointing options
  • Generate config snippets
Explore Args
Practice
🌍

Multilingual Evaluation Planner

Design slice-based evaluation plans that expose hidden failures across languages and scripts.

  • Language and script slices
  • Code-switching test cases
  • Failure mode categories
  • Export evaluation plan
Plan Evaluation
Learn
🔤

Tokenization Visualizer

See how different languages and scripts are tokenized. Understand why tokenization matters for multilingual NLP.

  • Compare tokenization across languages
  • Visualize subword splits
  • Identify tokenization stress tests
  • Understand coverage issues
Visualize Tokens
Challenge
📋

PR Evidence Builder

Generate the complete "evidence bundle" for a reviewable training PR. Create all required artifacts.

  • Config and metrics tables
  • Failure analysis templates
  • Limitations statements
  • Reproducibility commands
Build Evidence
Learn
🔧

LoRA Parameter Calculator

Calculate and compare trainable parameters for LoRA vs full fine-tuning. Understand the trade-offs.

  • Rank and alpha settings
  • Parameter count comparison
  • Memory estimation
  • Config generator
Calculate LoRA
Competition
🏆

SST-2 Competition

Put your skills to the test! Fine-tune a model on SST-2 and compete for the highest accuracy.

  • Starter notebook template
  • Competition rules & scoring
  • Leaderboard submission
  • Real training practice
View Rules
Prep
📚

Concepts Reference

Team learning activity - 5 teams each learn and teach critical concepts before the demo.

  • Foundation & Transformers
  • Models & Tokenization
  • Training & Hyperparameters
  • LoRA & Evaluation
Team Learning

Suggested Learning Path

🎯
Strategy
15 min
⚙️
Training Args
15 min
🌍
Eval Slices
20 min
🔤
Tokenization
10 min
📋
Evidence PR
20 min
🏆
Competition
30 min

🏆 SST-2 Fine-Tuning Competition

Use the tools above to optimize your training config, then compete for the highest accuracy!

Competition Rules Python Script Jupyter Notebook

Key Concepts

Essential terms for tonight's session

Fine-tuning
Updating pretrained weights on task-specific data to adapt behavior
LoRA
Low-rank adapters that reduce trainable parameters while keeping base weights fixed
PEFT
Parameter-Efficient Fine-Tuning methods for adapting models with fewer parameters
Trainer
HF API that manages training loop, evaluation, saving, and logging
TrainingArguments
Structured config for batch size, LR, epochs, save strategy
Evaluation Slice
Subset of eval data by condition (language, script, domain)
Code-switching
Input mixing multiple languages within the same text
Reproducibility
Ability to rerun training/eval and get consistent results

Code You'll Work With

Hugging Face Trainer pattern

from transformers import Trainer, TrainingArguments

args = TrainingArguments(
    output_dir="runs/experiment",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    learning_rate=2e-5,
    seed=42, # Reproducibility!
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    compute_metrics=compute_metrics,
)
trainer.train()

Tonight's Learning Objectives

By the end of this session, you'll be able to:

🎯

Choose Fine-Tuning Strategy

Select full fine-tune vs LoRA based on compute, data, and deployment constraints

🔄

Run Reproducible Training

Use Trainer API with proper checkpoints, eval, and artifact saving

🌍

Evaluate Multilingually

Create slice-based evaluation with per-language metrics and error analysis

📋

Produce Evidence

Generate reviewable PRs with config, metrics, failures, and limitations

🔤

Understand Tokenization

Identify multilingual tokenization issues and create stress tests

📝

Document Decisions

Write ADRs for training choices that reviewers can audit