Team Learning Activity - Jigsaw Format
← Back to HubEach team will become experts in their assigned topic area, then teach the concepts to other teams. Complete the checklist and quiz to verify understanding before teaching.
Master the fundamentals of RL and the CartPole environment.
The learner/decision-maker that interacts with the environment. Takes actions based on observations.
The world the agent interacts with. Receives actions, returns observations and rewards.
Information the agent receives about the current state. In CartPole: [cart_pos, cart_vel, pole_angle, pole_vel]
What the agent does. In CartPole: 0 (push left) or 1 (push right). Discrete action space.
Feedback signal. CartPole gives +1 for each timestep the pole stays upright. Goal: maximize cumulative reward.
One complete interaction from start to terminal state. Ends when pole falls or max steps reached.
1. In CartPole, what happens when the agent takes action=1?
2. What reward does CartPole give per timestep?
Understand Proximal Policy Optimization and its key hyperparameters.
A function that maps observations to actions. PPO learns a neural network policy that improves over time.
How big the weight updates are. Too high = unstable, too low = slow learning. Typical: 1e-4 to 1e-3.
Number of samples used per gradient update. Larger = more stable but slower. Must be <= n_steps.
Rollout buffer size. How many steps to collect before updating. Larger = better gradient estimates.
How much to value future rewards. 0.99 means future rewards are nearly as important as immediate ones.
Encourages exploration by adding randomness to action selection. Higher = more exploration.
1. What happens if learning rate is too high?
2. What must be true about batch_size and n_steps?
Master hyperparameter optimization with Optuna.
An Optuna study is a complete HPO session. It contains multiple trials and tracks the best result.
One run with a specific set of hyperparameters. Optuna runs many trials to find the best combination.
The function Optuna optimizes. Takes a trial, samples params, trains, returns a score to maximize/minimize.
The range of values Optuna can sample from. Defined using suggest_* methods in the objective.
Algorithm for choosing next hyperparameters. TPE (default) is smarter than random - it learns from past trials.
Early stopping of bad trials. If a trial looks unpromising, Optuna can stop it early to save time.
1. What does trial.suggest_float(..., log=True) do?
Understand rigorous experimentation and statistical significance.
Comparing two variants (baseline vs candidate) to determine which performs better with statistical confidence.
The current/default approach. What we compare the candidate against. The "control" in the experiment.
The new approach we're testing. The "treatment" that we hypothesize might be better.
Confidence interval computed by resampling. Shows the range where the true mean likely falls (e.g., 95% CI).
When we're confident the difference isn't due to chance. If CIs don't overlap, difference is significant.
Different random initializations. Running with multiple seeds (3-5+) ensures results aren't due to luck.
1. If baseline CI is [400, 450] and candidate CI is [460, 500], what should you do?
Master Git branching, PRs, and code review best practices.
A separate branch for your work. Never commit directly to main. Create branch, make changes, merge via PR.
A request to merge your branch into main. Allows code review before changes are integrated.
Team members examine your code and provide feedback. Can Approve, Request Changes, or Comment.
A snapshot of changes with a descriptive message. Keep commits small and focused.
Integrating your branch into main after approval. Squash merge combines all commits into one.
When two branches modify the same lines. Must be resolved manually before merging.
1. Where should you make changes?