W19-W20 Concepts Reference

Team Learning Activity - Jigsaw Format

← Back to Hub

📚 Jigsaw Learning Activity

Each team will become experts in their assigned topic area, then teach the concepts to other teams. Complete the checklist and quiz to verify understanding before teaching.

📈 Team Progress

Team 1: RL Basics
0/5 items
Team 2: PPO
0/5 items
Team 3: HPO
0/5 items
Team 4: A/B Testing
0/5 items
Team 5: Workflow
0/5 items

🎯 Team 1: Reinforcement Learning Basics

Master the fundamentals of RL and the CartPole environment.

Agent

The learner/decision-maker that interacts with the environment. Takes actions based on observations.

agent.predict(observation) -> action

Environment

The world the agent interacts with. Receives actions, returns observations and rewards.

env.step(action) -> obs, reward, done, info

Observation

Information the agent receives about the current state. In CartPole: [cart_pos, cart_vel, pole_angle, pole_vel]

Action

What the agent does. In CartPole: 0 (push left) or 1 (push right). Discrete action space.

Reward

Feedback signal. CartPole gives +1 for each timestep the pole stays upright. Goal: maximize cumulative reward.

Episode

One complete interaction from start to terminal state. Ends when pole falls or max steps reached.

✅ Teach-Back Checklist

  • Explain the agent-environment loop
  • Describe CartPole's observation space
  • Explain the reward structure
  • Define what "solving" CartPole means
  • Draw the RL loop diagram

🔬 Quick Quiz

1. In CartPole, what happens when the agent takes action=1?

Push cart left
Push cart right
Do nothing

2. What reward does CartPole give per timestep?

+1 for staying upright
+10 for balancing perfectly
Varies by pole angle

🤖 Team 2: PPO Algorithm

Understand Proximal Policy Optimization and its key hyperparameters.

Policy

A function that maps observations to actions. PPO learns a neural network policy that improves over time.

Learning Rate

How big the weight updates are. Too high = unstable, too low = slow learning. Typical: 1e-4 to 1e-3.

learning_rate = 3e-4

Batch Size

Number of samples used per gradient update. Larger = more stable but slower. Must be <= n_steps.

N Steps

Rollout buffer size. How many steps to collect before updating. Larger = better gradient estimates.

Gamma (Discount)

How much to value future rewards. 0.99 means future rewards are nearly as important as immediate ones.

Entropy Coefficient

Encourages exploration by adding randomness to action selection. Higher = more exploration.

✅ Teach-Back Checklist

  • Explain what a policy is
  • Describe learning rate's effect
  • Explain batch_size vs n_steps relationship
  • Define gamma's role
  • Explain entropy coefficient

🔬 Quick Quiz

1. What happens if learning rate is too high?

Training is too slow
Training becomes unstable
Memory runs out

2. What must be true about batch_size and n_steps?

batch_size <= n_steps
batch_size = n_steps exactly
batch_size > n_steps

🔬 Team 3: Optuna HPO

Master hyperparameter optimization with Optuna.

Study

An Optuna study is a complete HPO session. It contains multiple trials and tracks the best result.

study = optuna.create_study(direction="maximize")

Trial

One run with a specific set of hyperparameters. Optuna runs many trials to find the best combination.

Objective Function

The function Optuna optimizes. Takes a trial, samples params, trains, returns a score to maximize/minimize.

Search Space

The range of values Optuna can sample from. Defined using suggest_* methods in the objective.

trial.suggest_float("lr", 1e-5, 1e-3, log=True)

Sampler

Algorithm for choosing next hyperparameters. TPE (default) is smarter than random - it learns from past trials.

Pruning

Early stopping of bad trials. If a trial looks unpromising, Optuna can stop it early to save time.

✅ Teach-Back Checklist

  • Explain Study vs Trial
  • Write an objective function structure
  • Define a search space
  • Explain why TPE is better than random
  • Describe when to use pruning

🔬 Quick Quiz

1. What does trial.suggest_float(..., log=True) do?

Samples on log scale (better for learning rates)
Logs the value to a file
Uses logarithm of the value

🧪 Team 4: A/B Testing & Statistics

Understand rigorous experimentation and statistical significance.

A/B Test

Comparing two variants (baseline vs candidate) to determine which performs better with statistical confidence.

Baseline

The current/default approach. What we compare the candidate against. The "control" in the experiment.

Candidate

The new approach we're testing. The "treatment" that we hypothesize might be better.

Bootstrap CI

Confidence interval computed by resampling. Shows the range where the true mean likely falls (e.g., 95% CI).

Statistical Significance

When we're confident the difference isn't due to chance. If CIs don't overlap, difference is significant.

Random Seeds

Different random initializations. Running with multiple seeds (3-5+) ensures results aren't due to luck.

✅ Teach-Back Checklist

  • Explain baseline vs candidate
  • Describe what a confidence interval means
  • Explain why we use multiple seeds
  • Interpret overlapping vs non-overlapping CIs
  • Define decision rules (ship/iterate/abandon)

🔬 Quick Quiz

1. If baseline CI is [400, 450] and candidate CI is [460, 500], what should you do?

Ship candidate (CIs don't overlap, candidate higher)
Iterate (need more data)
Abandon candidate

🚀 Team 5: OSS Workflow

Master Git branching, PRs, and code review best practices.

Feature Branch

A separate branch for your work. Never commit directly to main. Create branch, make changes, merge via PR.

git checkout -b feature/add-hpo

Pull Request (PR)

A request to merge your branch into main. Allows code review before changes are integrated.

Code Review

Team members examine your code and provide feedback. Can Approve, Request Changes, or Comment.

Commit

A snapshot of changes with a descriptive message. Keep commits small and focused.

git commit -m "Add learning rate to search space"

Merge

Integrating your branch into main after approval. Squash merge combines all commits into one.

Conflict

When two branches modify the same lines. Must be resolved manually before merging.

✅ Teach-Back Checklist

  • Explain why we use branches
  • Describe the PR workflow
  • List good commit message practices
  • Explain code review etiquette
  • Demo the branch -> PR -> merge flow

🔬 Quick Quiz

1. Where should you make changes?

Directly on main branch
On a feature branch
In a fork only