Reading Learning Graphs

Key Metrics to Track

During DQN training, we track several metrics. Understanding what each one means is crucial for diagnosing performance.

Episode Reward

Total reward accumulated in one episode. For CartPole, this equals the number of timesteps the pole stayed balanced (max 500).

reward = steps_balanced

Average Reward (100 ep)

Rolling average over last 100 episodes. Smooths out noise to show the true learning trend. This is your primary success metric.

avg = mean(last_100_rewards)

Loss

How wrong the network's Q-value predictions are. Calculated as the difference between predicted Q and target Q (squared).

loss = (Q_pred - Q_target)²

Epsilon (ε)

Exploration rate. Starts high (random actions) and decays over time (exploit learned policy). Tracks the exploration-exploitation balance.

ε: 1.0 → 0.01 (decay)

Interactive: Common Learning Patterns

Click each pattern to see what it looks like and what it means for your model.

Healthy Learning

What you see: Reward steadily increases with some variance, then stabilizes at a high level.

What it means: Your agent is learning! The variance is normal - it's exploring. The stabilization shows it found a good policy.

Model performance: Excellent Agent will perform well in deployment.

Learning Curve Patterns

Healthy Convergence

Signs: Steady upward trend, decreasing variance, stable plateau

Performance: Agent reliably solves the task

Keep these settings!

High Variance / Unstable

Signs: Wild swings between high and low rewards, no clear trend

Performance: Inconsistent - sometimes good, often bad

Needs stabilization

Plateau (Stuck)

Signs: Initial improvement, then flat line below target

Performance: Mediocre - found local optimum

Needs exploration boost

Catastrophic Forgetting

Signs: Reaches high performance, then crashes and doesn't recover

Performance: Unreliable - "forgot" good policy

Critical issue

Not Learning

Signs: Flat line near random performance, no improvement

Performance: Random - agent learned nothing

Check implementation

Slow but Steady

Signs: Gradual improvement, needs more episodes

Performance: Will eventually work, just slow

Be patient or tune LR

Graph → Performance Correlation

What You See	What It Means	Expected Performance	Action
Reward climbing, low variance	Agent is learning effectively	High & consistent	Continue training
Reward climbing but high variance	Learning but unstable updates	Inconsistent	Enable Target Network
Loss decreasing, reward flat	Learning wrong Q-values	Poor	Check reward function
Loss exploding (NaN or huge)	Gradient explosion	Broken	Lower learning rate
Early spike then crash	Catastrophic forgetting	Unreliable	Enable Experience Replay
Flat near random baseline	Not learning at all	Random	Debug code, check gradients
Plateau below goal	Stuck in local optimum	Mediocre	Increase exploration (ε)
Oscillating wildly	Q-value overestimation	Erratic	Enable Double DQN

Understanding Loss vs Reward

Key Insight: Low loss does NOT always mean good performance! Loss measures prediction accuracy, not policy quality. An agent can perfectly predict bad Q-values.

What to Look For

Ideal: Both Improve

Loss decreases while reward increases. The network is learning accurate Q-values that lead to good actions.

Warning: Loss Low, Reward Low

Network confidently predicts wrong values. Often means rewards are sparse or incorrectly defined.

Loss Behavior Phases

Phase 1: High & Chaotic

Early training - network predicting randomly, targets changing rapidly. This is normal!

Phase 2: Decreasing

Network learning patterns, predictions improving. Reward should start climbing here.

Phase 3: Stable (Low)

Network predictions match targets consistently. If reward is high, you're done!

Troubleshooting Guide

Problem: Training is Unstable

Enable Target Network (separates prediction from targets)
Enable Experience Replay (breaks correlation)
Lower learning rate (smaller updates)
Increase batch size (smoother gradients)

Problem: Not Learning at All

Check that rewards are reaching the network
Verify gradients are non-zero
Increase learning rate (updates too small)
Check epsilon isn't stuck at 1.0

Problem: Learns Then Forgets

Enable Experience Replay (essential!)
Increase replay buffer size
Decrease learning rate
Update target network less frequently

Problem: Stuck at Plateau

Slower epsilon decay (explore more)
Increase minimum epsilon
Try different network architecture
Check if environment has solution

How DQN Switches Affect Learning Curves

Switch	OFF Behavior	ON Behavior	Graph Change
Experience Replay	High correlation, forgetting	IID samples, stable learning	Reduces spikes and crashes
Target Network	Moving target, oscillation	Stable targets, smooth learning	Reduces variance significantly
Double DQN	Q-value overestimation	Accurate Q-values	Higher, more stable plateau

Pro Tip: Enable switches one at a time and compare graphs. This helps you understand what each improvement actually does!

Before You Celebrate: Performance Checklist

Is Your Model Actually Good?

Average reward exceeds target (CartPole: 195+)

Performance is consistent (low variance in last 100 episodes)

Multiple runs produce similar results

Agent performs well with epsilon=0 (pure exploitation)

Loss has stabilized (not still decreasing rapidly)

No recent performance crashes

CartPole Success Criteria

OpenAI Gym considers CartPole "solved" when the agent achieves an average reward of 195+ over 100 consecutive episodes. This means the pole stays balanced for at least 195 timesteps on average.

Poor: < 50

Basically random. Pole falls almost immediately.

Learning: 50-150

Some understanding, but inconsistent.

Good: 150-195

Decent policy, but not solved yet.

Solved: 195+

Consistently balances the pole!

Diagnostic Workflow: Step-by-Step Debugging

When training isn't working, follow this systematic approach to identify and fix the problem.

1

Check: Is the reward signal reaching the agent?

How to test: Print rewards after each episode. Are they non-zero?

print(f"Episode {ep}: reward = {total_reward}")

If broken: Check environment setup, verify env.step() returns correct values

2

Check: Are gradients flowing?

How to test: Check gradient norms after backward pass

grad_norm = sum(p.grad.norm() for p in model.parameters())

If zero: Loss not connected to parameters, or loss.backward() not called

If exploding (>1000): Learning rate too high, add gradient clipping

If vanishing (<1e-7): Learning rate too low, or dead ReLUs

3

Check: Are Q-values reasonable?

How to test: Print Q-values periodically

print(f"Q-values: {q_network(state)}")

If all same: Network not differentiating states

If huge (>1000): Q-value explosion, lower learning rate

If NaN: Numerical instability, check for division by zero

4

Check: Is exploration working?

How to test: Log action distribution over an episode

action_counts = {0: left_count, 1: right_count}

If always same action: Epsilon too low, or stuck in local optimum

If 50/50 throughout: Epsilon not decaying, or not exploiting

5

Check: Is the replay buffer working?

How to test: Verify buffer size and sample diversity

print(f"Buffer size: {len(replay_buffer)}")

If empty: Experiences not being stored

If not growing: Check buffer.push() is being called

If always same samples: Random sampling broken

Golden Rule: Always change ONE thing at a time and observe the effect. Changing multiple things makes it impossible to know what helped (or hurt).

Optimization Approaches

Systematic methods to improve training performance beyond the basic DQN switches.

Learning Rate Strategies

Linear Decay

Gradually reduce LR as training progresses. High LR for fast initial learning, low LR for fine-tuning.

lr = lr_start * (1 - episode / total_episodes)

Recommended for beginners

Step Decay

Drop LR by factor at specific milestones. Good when you know roughly when learning plateaus.

lr = lr_start * (0.1 ^ floor(episode / step_size))

Good for known training length

Warmup + Decay

Start with low LR, ramp up, then decay. Prevents early instability from random weights.

warmup: lr = lr_max * (step / warmup_steps)

Advanced technique

Cyclical LR

Oscillate between min and max LR. Can escape local optima by periodically increasing LR.

lr = lr_min + (lr_max - lr_min) * cycle_position

Experimental

Exploration Strategies

Epsilon-Greedy (Standard)

Random action with probability ε, greedy otherwise. Simple and effective.

ε: 1.0 → 0.01 over N episodes

Default choice

Noisy Networks

Add learnable noise to network weights. Network learns WHEN to explore based on uncertainty.

W = μ + σ * ε, where ε ~ N(0,1)

No epsilon tuning needed

Boltzmann Exploration

Sample actions proportional to Q-values (softmax). Higher Q = more likely, but still explores.

P(a) = exp(Q(a)/τ) / Σexp(Q/τ)

Temperature τ sensitive

UCB (Upper Confidence Bound)

Bonus for less-visited state-actions. Balances exploration of uncertain areas.

Q_ucb = Q + c * sqrt(ln(t) / N(a))

Requires visit counts

Network Architecture Tuning

Hidden Layer Size

Too small (32): Can't represent complex patterns
Sweet spot (64-256): Good for most tasks
Too large (1024+): Overfits, slow training
Rule of thumb: Start with 128, adjust based on results

Number of Layers

1 hidden: Simple tasks (CartPole)
2-3 hidden: Most RL tasks
4+ hidden: Complex visual tasks, use with caution
Deeper ≠ better: Diminishing returns in RL

Activation Functions

ReLU: Default choice, fast, works well
LeakyReLU: Prevents dead neurons
Tanh: Bounded output, good for some tasks
Never use Sigmoid: Vanishing gradients in deep nets

Batch Size

Small (16-32): Noisy gradients, more exploration
Medium (64-128): Good balance
Large (256+): Stable but may need higher LR
Memory limited?: Use gradient accumulation

Advanced Techniques

Additional improvements beyond the standard DQN switches. These are potential features for future sessions.

Prioritized Experience Replay

W20D2 Topic

Problem it solves: Uniform sampling wastes time on easy transitions

How it works: Sample transitions proportional to TD error (surprise). Learn more from mistakes.

P(i) = |TD_error_i|^α / Σ|TD_error|^α

Impact: 2-3x faster learning, better sample efficiency

Dueling DQN Architecture

Architecture change

Problem it solves: Hard to learn state value vs action advantage separately

How it works: Split network into Value stream V(s) and Advantage stream A(s,a)

Q(s,a) = V(s) + (A(s,a) - mean(A))

Impact: Better generalization, especially when actions have similar values

Noisy Networks

Replaces ε-greedy

Problem it solves: ε-greedy explores randomly, not intelligently

How it works: Add learnable noise parameters to weights. Network learns when to explore.

y = (μ_w + σ_w * ε) * x + μ_b + σ_b * ε

Impact: State-dependent exploration, no ε tuning needed

N-Step Returns

TD improvement

Problem it solves: 1-step TD has high bias, MC has high variance

How it works: Use N steps of actual rewards before bootstrapping

G_t^(n) = r_t + γr_{t+1} + ... + γ^n * V(s_{t+n})

Impact: Faster reward propagation, good balance of bias/variance

Weight Initialization (Pre-seeding)

Quick win

Problem it solves: Random initialization can lead to dead neurons or slow start

Options:

Xavier/Glorot: Good for tanh activations
He/Kaiming: Best for ReLU (default in PyTorch)
Orthogonal: Preserves gradient norms
Pre-trained: Initialize from similar task

Impact: Faster initial learning, more stable training

Reward Normalization

Stability boost

Problem it solves: Varying reward scales make learning rate tuning hard

Options:

Clipping: rewards = clip(rewards, -1, 1)
Scaling: rewards = rewards / max_reward
Running normalization: Use running mean/std

Impact: Consistent learning dynamics across environments

Gradient Clipping

Stability essential

Problem it solves: Large TD errors cause exploding gradients

How it works: Cap gradient magnitude before applying updates

torch.nn.utils.clip_grad_norm_(params, max_norm=10)

Impact: Prevents catastrophic weight updates, more stable training

Model Checkpointing

Best practice

Problem it solves: Training can crash, or best model isn't final model

How it works: Save model weights when performance improves

if avg_reward > best_reward: save(model)

Impact: Never lose progress, deploy best model not last model

Hyperparameter Sensitivity Guide

How much each hyperparameter affects training, and safe ranges to try.

Hyperparameter	Sensitivity	Safe Range	Effect of Too Low	Effect of Too High
Learning Rate	Very High	1e-4 to 1e-3	Slow/no learning	Unstable, divergence
Discount (γ)	Medium	0.95 to 0.99	Myopic (short-sighted)	Slow convergence
Epsilon Decay	Medium	0.995 to 0.9999	Exploits too early	Explores too long
Batch Size	Low	32 to 256	Noisy updates	Memory issues, slow
Buffer Size	Low	10K to 1M	Correlation issues	Stale experiences
Target Update Freq	Medium	100 to 10000 steps	Unstable targets	Stale targets
Hidden Size	Low	64 to 512	Underfitting	Overfitting, slow

Tuning Priority: Learning rate → Epsilon decay → Target update frequency → Everything else. Get these three right first!