PPO Explainer

The Reinforcement Learning Loop

RL is learning through trial and error. An agent interacts with an environment, receives rewards, and learns to maximize total reward over time.

Agent

Policy: π(a|s)

→

Action

a_t

→

Environment

CartPole

→

State + Reward

s_t+1, r_t

→

Learn

Update π

Click "Step Through" to see the RL loop in action

Key RL Vocabulary

🎯

State (s)

What the agent observes. In CartPole: [cart_position, cart_velocity, pole_angle, pole_angular_velocity]

🎮

Action (a)

What the agent does. In CartPole: push left (0) or push right (1)

🏆

Reward (r)

Feedback signal. In CartPole: +1 for every timestep the pole stays upright

🧠

Policy (π)

The agent's strategy: given a state, which action to take? Written as π(a|s) = probability of action a given state s

📈

Episode

One complete run from start to termination. CartPole episode ends when pole falls or 500 steps reached.

The Goal of RL

Find a policy π that maximizes expected cumulative reward:

J(π) = E[r₀ + γr₁ + γ²r₂ + γ³r₃ + ...]

Where γ (gamma) is the discount factor. It controls how much we care about future vs. immediate rewards:

γ = 0.9 (Lower)

More "short-sighted"

Prefers immediate rewards

Faster learning, possibly suboptimal

γ = 0.99 (Higher)

More "far-sighted"

Considers long-term consequences

Slower learning, better final policy

Policy Gradient Methods

Instead of learning which states are valuable (value-based methods like Q-learning), policy gradient methods directly learn the policy.

The Core Idea

If an action led to good reward → make it more likely
If an action led to bad reward → make it less likely

∇J(θ) ≈ E[∇log π_θ(a|s) · A(s,a)]

This formula says: adjust the policy parameters θ in the direction that increases the probability of actions with positive advantage.

What is Advantage?

Advantage tells us: "Was this action better or worse than average?"

A(s, a) = Q(s, a) - V(s)

📊

V(s) - State Value

Expected total reward starting from state s, following policy π

📊

Q(s, a) - Action Value

Expected total reward starting from state s, taking action a, then following π

⚖️

A(s, a) - Advantage

Positive = action was better than expected
Negative = action was worse than expected

The Problem with Vanilla Policy Gradient

❌ Too Large Updates

Big policy changes can destroy good behavior learned so far. Performance can collapse suddenly.

❌ High Variance

Reward signals are noisy. One lucky/unlucky episode can cause wild policy swings.

This is where PPO comes in - it solves these problems!

Actor-Critic Architecture

PPO uses an actor-critic setup with two neural networks (or one network with two heads):

🎭 Actor (Policy)

Decides what to do

Input: State s
Output: π(a|s)

← Advantage
guides actor →

🎯 Critic (Value)

Evaluates how good a state is

Input: State s
Output: V(s)

How They Work Together

Actor chooses actions based on current policy
Critic estimates value of states visited
Advantage is computed using critic's estimates
Actor is updated to favor high-advantage actions
Critic is updated to better predict values

GAE: Generalized Advantage Estimation

PPO uses GAE to compute advantages. It balances bias vs. variance with a parameter λ (lambda).

A^GAE = δ_t + (γλ)δ_t+1 + (γλ)²δ_t+2 + ...

Where δ_t = r_t + γV(s_t+1) - V(s_t) is the TD error.

λ = 0 (Low)

High bias, low variance

Uses only immediate TD error

Good for: Simple tasks, fast learning

λ = 1 (High)

Low bias, high variance

Uses full Monte Carlo returns

Good for: Complex tasks, accurate gradients

Common choice: λ = 0.95 balances both well for most tasks.

The PPO Innovation: Clipping

PPO's key insight: limit how much the policy can change in a single update.

It does this by "clipping" the objective function:

L^CLIP = min(r(θ)A, clip(r(θ), 1-ε, 1+ε)A)

Where:

r(θ) = probability ratio = π_new(a|s) / π_old(a|s)
ε (epsilon) = clipping range, typically 0.2
A = advantage estimate

Visualizing the Clipping

Advantage (A) +1.0 (Good action)

Clip Epsilon (ε) 0.2

🔑

Why Clipping Works

If A > 0 (good action): ratio can increase up to (1+ε) but no more
If A < 0 (bad action): ratio can decrease down to (1-ε) but no more
This prevents catastrophically large updates!

PPO Training Loop

for iteration in range(num_iterations):
    # 1. Collect trajectories using current policy
    trajectories = collect_rollouts(policy, env, n_steps=2048)

    # 2. Compute advantages using GAE
    advantages = compute_gae(trajectories, critic, gamma=0.99, lambda=0.95)

    # 3. Store old policy probabilities
    old_probs = policy.get_probs(trajectories.states, trajectories.actions)

    # 4. Multiple epochs of minibatch updates
    for epoch in range(10):  # PPO typically uses 3-10 epochs
        for batch in minibatches(trajectories, batch_size=64):
            # Compute probability ratio
            new_probs = policy.get_probs(batch.states, batch.actions)
            ratio = new_probs / old_probs

            # Clipped objective
            clipped_ratio = clip(ratio, 1-eps, 1+eps)
            loss = -min(ratio * batch.advantages,
                        clipped_ratio * batch.advantages)

            # Update policy and critic
            optimizer.step(loss)
                

What Makes PPO Special

Stable Learning

Clipping prevents catastrophic updates that destroy learned behavior

Sample Efficient

Reuses collected data for multiple epochs of updates

Simple to Implement

No complex trust region constraints like TRPO

Works Well Out-of-Box

Default hyperparameters work for many tasks

PPO Hyperparameters Explained

These are the knobs you'll tune in your experiments. Understanding what each does helps you debug and improve performance.

Interactive Parameter Explorer

learning_rate 3e-4

Standard starting point. Lower = slower but more stable learning.

n_steps 2048

Steps collected before each update. More = stable gradients, slower iteration.

batch_size 64

Minibatch size for SGD. Must be ≤ n_steps. Larger = smoother gradients.

gamma (γ) 0.99

Discount factor. Higher = agent cares more about long-term rewards.

gae_lambda (λ) 0.95

GAE parameter. Higher = lower bias, higher variance in advantage estimates.

ent_coef 0.0

Entropy bonus. Higher = more exploration, prevents premature convergence.

Quick Reference Guide

learning_rate

1e-5

1e-2

Too low: Slow learning, might not converge

Too high: Unstable, performance collapses

Start with: 3e-4

n_steps

128

4096

Too low: High variance, noisy updates

Too high: Slow iteration, stale gradients

Start with: 2048

batch_size

32

512

Too low: Noisy gradients, slow convergence

Too high: Less updates per rollout

Start with: 64

gamma

0.9

0.999

Lower: Short-sighted, faster learning

Higher: Far-sighted, harder to learn

Start with: 0.99

gae_lambda

0.8

1.0

Lower: More biased, lower variance

Higher: Less biased, higher variance

Start with: 0.95

ent_coef

0.0

0.1

Zero: No exploration bonus

Higher: More random exploration

Start with: 0.0 (CartPole is easy)

Debugging Tips

📉

Reward suddenly collapses?

Learning rate too high. Try reducing by 10x (e.g., 3e-4 → 3e-5)

📊

Learning is very slow?

Try increasing learning rate, or reducing n_steps for faster iteration

🎲

Agent gets stuck in one behavior?

Increase ent_coef to encourage exploration (try 0.01)

⚠️

Training crashes with error?

Make sure batch_size ≤ n_steps. This is a common mistake!

Check Your Understanding

Test what you've learned about PPO!

1. What does the "clipping" in PPO prevent?

The agent from exploring new actions

The critic from updating

Large, destabilizing policy updates

The use of multiple epochs

2. In actor-critic, what does the critic estimate?

The best action to take

The value of a state V(s)

The learning rate

The entropy bonus

3. If your agent's performance suddenly collapses during training, what should you try first?

Decrease the learning rate

Increase n_steps

Set gamma to 1.0

Remove the critic

4. What does a positive advantage A(s,a) > 0 mean?

The action was random

The state value is high

The action was better than average for that state

The episode ended successfully

5. What is the purpose of the entropy coefficient (ent_coef)?

Speed up training

Reduce memory usage

Make the critic more accurate

Encourage exploration by rewarding randomness

6. Why must batch_size be ≤ n_steps?

To save memory

You can't create batches larger than the data collected

It's an arbitrary convention

To make the GPU faster

Your Score

Answer all questions to see your score