Proximal Policy Optimization (PPO)

Understanding the algorithm behind your CartPole agent

The Reinforcement Learning Loop

RL is learning through trial and error. An agent interacts with an environment, receives rewards, and learns to maximize total reward over time.

Agent

Policy: π(a|s)

Action

at

Environment

CartPole

State + Reward

st+1, rt

Learn

Update π
Click "Step Through" to see the RL loop in action

Key RL Vocabulary

🎯

State (s)

What the agent observes. In CartPole: [cart_position, cart_velocity, pole_angle, pole_angular_velocity]

🎮

Action (a)

What the agent does. In CartPole: push left (0) or push right (1)

🏆

Reward (r)

Feedback signal. In CartPole: +1 for every timestep the pole stays upright

🧠

Policy (π)

The agent's strategy: given a state, which action to take? Written as π(a|s) = probability of action a given state s

📈

Episode

One complete run from start to termination. CartPole episode ends when pole falls or 500 steps reached.

The Goal of RL

Find a policy π that maximizes expected cumulative reward:

J(π) = E[r0 + γr1 + γ²r2 + γ³r3 + ...]

Where γ (gamma) is the discount factor. It controls how much we care about future vs. immediate rewards:

γ = 0.9 (Lower)

More "short-sighted"

Prefers immediate rewards

Faster learning, possibly suboptimal

γ = 0.99 (Higher)

More "far-sighted"

Considers long-term consequences

Slower learning, better final policy

Policy Gradient Methods

Instead of learning which states are valuable (value-based methods like Q-learning), policy gradient methods directly learn the policy.

The Core Idea

If an action led to good reward → make it more likely
If an action led to bad reward → make it less likely

∇J(θ) ≈ E[∇log πθ(a|s) · A(s,a)]

This formula says: adjust the policy parameters θ in the direction that increases the probability of actions with positive advantage.

What is Advantage?

Advantage tells us: "Was this action better or worse than average?"

A(s, a) = Q(s, a) - V(s)
📊

V(s) - State Value

Expected total reward starting from state s, following policy π

📊

Q(s, a) - Action Value

Expected total reward starting from state s, taking action a, then following π

⚖️

A(s, a) - Advantage

Positive = action was better than expected
Negative = action was worse than expected

The Problem with Vanilla Policy Gradient

❌ Too Large Updates

Big policy changes can destroy good behavior learned so far. Performance can collapse suddenly.

❌ High Variance

Reward signals are noisy. One lucky/unlucky episode can cause wild policy swings.

This is where PPO comes in - it solves these problems!

Actor-Critic Architecture

PPO uses an actor-critic setup with two neural networks (or one network with two heads):

🎭 Actor (Policy)

Decides what to do

Input: State s
Output: π(a|s)

Advantage
guides actor

🎯 Critic (Value)

Evaluates how good a state is

Input: State s
Output: V(s)

How They Work Together

  1. Actor chooses actions based on current policy
  2. Critic estimates value of states visited
  3. Advantage is computed using critic's estimates
  4. Actor is updated to favor high-advantage actions
  5. Critic is updated to better predict values

GAE: Generalized Advantage Estimation

PPO uses GAE to compute advantages. It balances bias vs. variance with a parameter λ (lambda).

AGAE = δt + (γλ)δt+1 + (γλ)²δt+2 + ...

Where δt = rt + γV(st+1) - V(st) is the TD error.

λ = 0 (Low)

High bias, low variance

Uses only immediate TD error

Good for: Simple tasks, fast learning

λ = 1 (High)

Low bias, high variance

Uses full Monte Carlo returns

Good for: Complex tasks, accurate gradients

Common choice: λ = 0.95 balances both well for most tasks.

The PPO Innovation: Clipping

PPO's key insight: limit how much the policy can change in a single update.

It does this by "clipping" the objective function:

LCLIP = min(r(θ)A, clip(r(θ), 1-ε, 1+ε)A)

Where:

  • r(θ) = probability ratio = πnew(a|s) / πold(a|s)
  • ε (epsilon) = clipping range, typically 0.2
  • A = advantage estimate

Visualizing the Clipping

Advantage (A) +1.0 (Good action)
Clip Epsilon (ε) 0.2
🔑

Why Clipping Works

If A > 0 (good action): ratio can increase up to (1+ε) but no more
If A < 0 (bad action): ratio can decrease down to (1-ε) but no more
This prevents catastrophically large updates!

PPO Training Loop

for iteration in range(num_iterations): # 1. Collect trajectories using current policy trajectories = collect_rollouts(policy, env, n_steps=2048) # 2. Compute advantages using GAE advantages = compute_gae(trajectories, critic, gamma=0.99, lambda=0.95) # 3. Store old policy probabilities old_probs = policy.get_probs(trajectories.states, trajectories.actions) # 4. Multiple epochs of minibatch updates for epoch in range(10): # PPO typically uses 3-10 epochs for batch in minibatches(trajectories, batch_size=64): # Compute probability ratio new_probs = policy.get_probs(batch.states, batch.actions) ratio = new_probs / old_probs # Clipped objective clipped_ratio = clip(ratio, 1-eps, 1+eps) loss = -min(ratio * batch.advantages, clipped_ratio * batch.advantages) # Update policy and critic optimizer.step(loss)

What Makes PPO Special

Stable Learning

Clipping prevents catastrophic updates that destroy learned behavior

Sample Efficient

Reuses collected data for multiple epochs of updates

Simple to Implement

No complex trust region constraints like TRPO

Works Well Out-of-Box

Default hyperparameters work for many tasks

PPO Hyperparameters Explained

These are the knobs you'll tune in your experiments. Understanding what each does helps you debug and improve performance.

Interactive Parameter Explorer

learning_rate 3e-4

Standard starting point. Lower = slower but more stable learning.

n_steps 2048

Steps collected before each update. More = stable gradients, slower iteration.

batch_size 64

Minibatch size for SGD. Must be ≤ n_steps. Larger = smoother gradients.

gamma (γ) 0.99

Discount factor. Higher = agent cares more about long-term rewards.

gae_lambda (λ) 0.95

GAE parameter. Higher = lower bias, higher variance in advantage estimates.

ent_coef 0.0

Entropy bonus. Higher = more exploration, prevents premature convergence.

Quick Reference Guide

learning_rate

1e-5
1e-2

Too low: Slow learning, might not converge

Too high: Unstable, performance collapses

Start with: 3e-4

n_steps

128
4096

Too low: High variance, noisy updates

Too high: Slow iteration, stale gradients

Start with: 2048

batch_size

32
512

Too low: Noisy gradients, slow convergence

Too high: Less updates per rollout

Start with: 64

gamma

0.9
0.999

Lower: Short-sighted, faster learning

Higher: Far-sighted, harder to learn

Start with: 0.99

gae_lambda

0.8
1.0

Lower: More biased, lower variance

Higher: Less biased, higher variance

Start with: 0.95

ent_coef

0.0
0.1

Zero: No exploration bonus

Higher: More random exploration

Start with: 0.0 (CartPole is easy)

Debugging Tips

📉

Reward suddenly collapses?

Learning rate too high. Try reducing by 10x (e.g., 3e-4 → 3e-5)

📊

Learning is very slow?

Try increasing learning rate, or reducing n_steps for faster iteration

🎲

Agent gets stuck in one behavior?

Increase ent_coef to encourage exploration (try 0.01)

⚠️

Training crashes with error?

Make sure batch_size ≤ n_steps. This is a common mistake!

Check Your Understanding

Test what you've learned about PPO!

1. What does the "clipping" in PPO prevent?

The agent from exploring new actions
The critic from updating
Large, destabilizing policy updates
The use of multiple epochs

2. In actor-critic, what does the critic estimate?

The best action to take
The value of a state V(s)
The learning rate
The entropy bonus

3. If your agent's performance suddenly collapses during training, what should you try first?

Decrease the learning rate
Increase n_steps
Set gamma to 1.0
Remove the critic

4. What does a positive advantage A(s,a) > 0 mean?

The action was random
The state value is high
The action was better than average for that state
The episode ended successfully

5. What is the purpose of the entropy coefficient (ent_coef)?

Speed up training
Reduce memory usage
Make the critic more accurate
Encourage exploration by rewarding randomness

6. Why must batch_size be ≤ n_steps?

To save memory
You can't create batches larger than the data collected
It's an arbitrary convention
To make the GPU faster

Your Score

Answer all questions to see your score