W19-W20 Concepts Reference

📚 Jigsaw Learning Activity

Each team will become experts in their assigned topic area, then teach the concepts to other teams. Complete the checklist and quiz to verify understanding before teaching.

📈 Team Progress

Team 1: RL Basics

0/5 items

Team 2: PPO

0/5 items

Team 3: HPO

0/5 items

Team 4: A/B Testing

0/5 items

Team 5: Workflow

0/5 items

🎯 Team 1: Reinforcement Learning Basics

Master the fundamentals of RL and the CartPole environment.

Agent

The learner/decision-maker that interacts with the environment. Takes actions based on observations.

agent.predict(observation) -> action

Environment

The world the agent interacts with. Receives actions, returns observations and rewards.

env.step(action) -> obs, reward, done, info

Observation

Information the agent receives about the current state. In CartPole: [cart_pos, cart_vel, pole_angle, pole_vel]

Action

What the agent does. In CartPole: 0 (push left) or 1 (push right). Discrete action space.

Reward

Feedback signal. CartPole gives +1 for each timestep the pole stays upright. Goal: maximize cumulative reward.

Episode

One complete interaction from start to terminal state. Ends when pole falls or max steps reached.

✅ Teach-Back Checklist

Explain the agent-environment loop
Describe CartPole's observation space
Explain the reward structure
Define what "solving" CartPole means
Draw the RL loop diagram

🔬 Quick Quiz

1. In CartPole, what happens when the agent takes action=1?

Push cart left

Push cart right

Do nothing

2. What reward does CartPole give per timestep?

+1 for staying upright

+10 for balancing perfectly

Varies by pole angle

🤖 Team 2: PPO Algorithm

Understand Proximal Policy Optimization and its key hyperparameters.

Policy

A function that maps observations to actions. PPO learns a neural network policy that improves over time.

Learning Rate

How big the weight updates are. Too high = unstable, too low = slow learning. Typical: 1e-4 to 1e-3.

learning_rate = 3e-4

Batch Size

Number of samples used per gradient update. Larger = more stable but slower. Must be <= n_steps.

N Steps

Rollout buffer size. How many steps to collect before updating. Larger = better gradient estimates.

Gamma (Discount)

How much to value future rewards. 0.99 means future rewards are nearly as important as immediate ones.

Entropy Coefficient

Encourages exploration by adding randomness to action selection. Higher = more exploration.

✅ Teach-Back Checklist

Explain what a policy is
Describe learning rate's effect
Explain batch_size vs n_steps relationship
Define gamma's role
Explain entropy coefficient

🔬 Quick Quiz

1. What happens if learning rate is too high?

Training is too slow

Training becomes unstable

Memory runs out

2. What must be true about batch_size and n_steps?

batch_size <= n_steps

batch_size = n_steps exactly

batch_size > n_steps

🔬 Team 3: Optuna HPO

Master hyperparameter optimization with Optuna.

Study

An Optuna study is a complete HPO session. It contains multiple trials and tracks the best result.

study = optuna.create_study(direction="maximize")

Trial

One run with a specific set of hyperparameters. Optuna runs many trials to find the best combination.

Objective Function

The function Optuna optimizes. Takes a trial, samples params, trains, returns a score to maximize/minimize.

Search Space

The range of values Optuna can sample from. Defined using suggest_* methods in the objective.

trial.suggest_float("lr", 1e-5, 1e-3, log=True)

Sampler

Algorithm for choosing next hyperparameters. TPE (default) is smarter than random - it learns from past trials.

Pruning

Early stopping of bad trials. If a trial looks unpromising, Optuna can stop it early to save time.

✅ Teach-Back Checklist

Explain Study vs Trial
Write an objective function structure
Define a search space
Explain why TPE is better than random
Describe when to use pruning

🔬 Quick Quiz

1. What does trial.suggest_float(..., log=True) do?

Samples on log scale (better for learning rates)

Logs the value to a file

Uses logarithm of the value

🧪 Team 4: A/B Testing & Statistics

Understand rigorous experimentation and statistical significance.

A/B Test

Comparing two variants (baseline vs candidate) to determine which performs better with statistical confidence.

Baseline

The current/default approach. What we compare the candidate against. The "control" in the experiment.

Candidate

The new approach we're testing. The "treatment" that we hypothesize might be better.

Bootstrap CI

Confidence interval computed by resampling. Shows the range where the true mean likely falls (e.g., 95% CI).

Statistical Significance

When we're confident the difference isn't due to chance. If CIs don't overlap, difference is significant.

Random Seeds

Different random initializations. Running with multiple seeds (3-5+) ensures results aren't due to luck.

✅ Teach-Back Checklist

Explain baseline vs candidate
Describe what a confidence interval means
Explain why we use multiple seeds
Interpret overlapping vs non-overlapping CIs
Define decision rules (ship/iterate/abandon)

🔬 Quick Quiz

1. If baseline CI is [400, 450] and candidate CI is [460, 500], what should you do?

Ship candidate (CIs don't overlap, candidate higher)

Iterate (need more data)

Abandon candidate

🚀 Team 5: OSS Workflow

Master Git branching, PRs, and code review best practices.

Feature Branch

A separate branch for your work. Never commit directly to main. Create branch, make changes, merge via PR.

git checkout -b feature/add-hpo

Pull Request (PR)

A request to merge your branch into main. Allows code review before changes are integrated.

Code Review

Team members examine your code and provide feedback. Can Approve, Request Changes, or Comment.

Commit

A snapshot of changes with a descriptive message. Keep commits small and focused.

git commit -m "Add learning rate to search space"

Merge

Integrating your branch into main after approval. Squash merge combines all commits into one.

Conflict

When two branches modify the same lines. Must be resolved manually before merging.

✅ Teach-Back Checklist

Explain why we use branches
Describe the PR workflow
List good commit message practices
Explain code review etiquette
Demo the branch -> PR -> merge flow

🔬 Quick Quiz

1. Where should you make changes?

Directly on main branch

On a feature branch

In a fork only