Understanding the algorithm behind your CartPole agent
RL is learning through trial and error. An agent interacts with an environment, receives rewards, and learns to maximize total reward over time.
What the agent observes. In CartPole: [cart_position, cart_velocity, pole_angle, pole_angular_velocity]
What the agent does. In CartPole: push left (0) or push right (1)
Feedback signal. In CartPole: +1 for every timestep the pole stays upright
The agent's strategy: given a state, which action to take? Written as π(a|s) = probability of action a given state s
One complete run from start to termination. CartPole episode ends when pole falls or 500 steps reached.
Find a policy π that maximizes expected cumulative reward:
Where γ (gamma) is the discount factor. It controls how much we care about future vs. immediate rewards:
More "short-sighted"
Prefers immediate rewards
Faster learning, possibly suboptimal
More "far-sighted"
Considers long-term consequences
Slower learning, better final policy
Instead of learning which states are valuable (value-based methods like Q-learning), policy gradient methods directly learn the policy.
If an action led to good reward → make it more likely
If an action led to bad reward → make it less likely
This formula says: adjust the policy parameters θ in the direction that increases the probability of actions with positive advantage.
Advantage tells us: "Was this action better or worse than average?"
Expected total reward starting from state s, following policy π
Expected total reward starting from state s, taking action a, then following π
Positive = action was better than expected
Negative = action was worse than expected
Big policy changes can destroy good behavior learned so far. Performance can collapse suddenly.
Reward signals are noisy. One lucky/unlucky episode can cause wild policy swings.
This is where PPO comes in - it solves these problems!
PPO uses an actor-critic setup with two neural networks (or one network with two heads):
Decides what to do
Input: State s
Output: π(a|s)
Evaluates how good a state is
Input: State s
Output: V(s)
PPO uses GAE to compute advantages. It balances bias vs. variance with a parameter λ (lambda).
Where δt = rt + γV(st+1) - V(st) is the TD error.
High bias, low variance
Uses only immediate TD error
Good for: Simple tasks, fast learning
Low bias, high variance
Uses full Monte Carlo returns
Good for: Complex tasks, accurate gradients
Common choice: λ = 0.95 balances both well for most tasks.
PPO's key insight: limit how much the policy can change in a single update.
It does this by "clipping" the objective function:
Where:
If A > 0 (good action): ratio can increase up to (1+ε) but no more
If A < 0 (bad action): ratio can decrease down to (1-ε) but no more
This prevents catastrophically large updates!
Clipping prevents catastrophic updates that destroy learned behavior
Reuses collected data for multiple epochs of updates
No complex trust region constraints like TRPO
Default hyperparameters work for many tasks
These are the knobs you'll tune in your experiments. Understanding what each does helps you debug and improve performance.
Too low: Slow learning, might not converge
Too high: Unstable, performance collapses
Start with: 3e-4
Too low: High variance, noisy updates
Too high: Slow iteration, stale gradients
Start with: 2048
Too low: Noisy gradients, slow convergence
Too high: Less updates per rollout
Start with: 64
Lower: Short-sighted, faster learning
Higher: Far-sighted, harder to learn
Start with: 0.99
Lower: More biased, lower variance
Higher: Less biased, higher variance
Start with: 0.95
Zero: No exploration bonus
Higher: More random exploration
Start with: 0.0 (CartPole is easy)
Learning rate too high. Try reducing by 10x (e.g., 3e-4 → 3e-5)
Try increasing learning rate, or reducing n_steps for faster iteration
Increase ent_coef to encourage exploration (try 0.01)
Make sure batch_size ≤ n_steps. This is a common mistake!
Test what you've learned about PPO!
Answer all questions to see your score