← Back to W20D1 Hub

Reading Learning Graphs

How to interpret training curves and diagnose model performance

Key Metrics to Track

During DQN training, we track several metrics. Understanding what each one means is crucial for diagnosing performance.

Episode Reward

Total reward accumulated in one episode. For CartPole, this equals the number of timesteps the pole stayed balanced (max 500).

reward = steps_balanced

Average Reward (100 ep)

Rolling average over last 100 episodes. Smooths out noise to show the true learning trend. This is your primary success metric.

avg = mean(last_100_rewards)

Loss

How wrong the network's Q-value predictions are. Calculated as the difference between predicted Q and target Q (squared).

loss = (Q_pred - Q_target)²

Epsilon (ε)

Exploration rate. Starts high (random actions) and decays over time (exploit learned policy). Tracks the exploration-exploitation balance.

ε: 1.0 → 0.01 (decay)

Interactive: Common Learning Patterns

Click each pattern to see what it looks like and what it means for your model.

Healthy Learning

What you see: Reward steadily increases with some variance, then stabilizes at a high level.

What it means: Your agent is learning! The variance is normal - it's exploring. The stabilization shows it found a good policy.

Model performance: Excellent Agent will perform well in deployment.

Learning Curve Patterns

Healthy Convergence

Signs: Steady upward trend, decreasing variance, stable plateau

Performance: Agent reliably solves the task

Keep these settings!

High Variance / Unstable

Signs: Wild swings between high and low rewards, no clear trend

Performance: Inconsistent - sometimes good, often bad

Needs stabilization

Plateau (Stuck)

Signs: Initial improvement, then flat line below target

Performance: Mediocre - found local optimum

Needs exploration boost

Catastrophic Forgetting

Signs: Reaches high performance, then crashes and doesn't recover

Performance: Unreliable - "forgot" good policy

Critical issue

Not Learning

Signs: Flat line near random performance, no improvement

Performance: Random - agent learned nothing

Check implementation

Slow but Steady

Signs: Gradual improvement, needs more episodes

Performance: Will eventually work, just slow

Be patient or tune LR

Graph → Performance Correlation

What You See What It Means Expected Performance Action
Reward climbing, low variance Agent is learning effectively High & consistent Continue training
Reward climbing but high variance Learning but unstable updates Inconsistent Enable Target Network
Loss decreasing, reward flat Learning wrong Q-values Poor Check reward function
Loss exploding (NaN or huge) Gradient explosion Broken Lower learning rate
Early spike then crash Catastrophic forgetting Unreliable Enable Experience Replay
Flat near random baseline Not learning at all Random Debug code, check gradients
Plateau below goal Stuck in local optimum Mediocre Increase exploration (ε)
Oscillating wildly Q-value overestimation Erratic Enable Double DQN

Understanding Loss vs Reward

Key Insight: Low loss does NOT always mean good performance! Loss measures prediction accuracy, not policy quality. An agent can perfectly predict bad Q-values.

What to Look For

Ideal: Both Improve

Loss decreases while reward increases. The network is learning accurate Q-values that lead to good actions.

Warning: Loss Low, Reward Low

Network confidently predicts wrong values. Often means rewards are sparse or incorrectly defined.

Loss Behavior Phases

Phase 1: High & Chaotic

Early training - network predicting randomly, targets changing rapidly. This is normal!

Phase 2: Decreasing

Network learning patterns, predictions improving. Reward should start climbing here.

Phase 3: Stable (Low)

Network predictions match targets consistently. If reward is high, you're done!

Troubleshooting Guide

Problem: Training is Unstable

  • Enable Target Network (separates prediction from targets)
  • Enable Experience Replay (breaks correlation)
  • Lower learning rate (smaller updates)
  • Increase batch size (smoother gradients)

Problem: Not Learning at All

  • Check that rewards are reaching the network
  • Verify gradients are non-zero
  • Increase learning rate (updates too small)
  • Check epsilon isn't stuck at 1.0

Problem: Learns Then Forgets

  • Enable Experience Replay (essential!)
  • Increase replay buffer size
  • Decrease learning rate
  • Update target network less frequently

Problem: Stuck at Plateau

  • Slower epsilon decay (explore more)
  • Increase minimum epsilon
  • Try different network architecture
  • Check if environment has solution

How DQN Switches Affect Learning Curves

Switch OFF Behavior ON Behavior Graph Change
Experience Replay High correlation, forgetting IID samples, stable learning Reduces spikes and crashes
Target Network Moving target, oscillation Stable targets, smooth learning Reduces variance significantly
Double DQN Q-value overestimation Accurate Q-values Higher, more stable plateau
Pro Tip: Enable switches one at a time and compare graphs. This helps you understand what each improvement actually does!

Before You Celebrate: Performance Checklist

Is Your Model Actually Good?

Average reward exceeds target (CartPole: 195+)
Performance is consistent (low variance in last 100 episodes)
Multiple runs produce similar results
Agent performs well with epsilon=0 (pure exploitation)
Loss has stabilized (not still decreasing rapidly)
No recent performance crashes

CartPole Success Criteria

OpenAI Gym considers CartPole "solved" when the agent achieves an average reward of 195+ over 100 consecutive episodes. This means the pole stays balanced for at least 195 timesteps on average.

Poor: < 50

Basically random. Pole falls almost immediately.

Learning: 50-150

Some understanding, but inconsistent.

Good: 150-195

Decent policy, but not solved yet.

Solved: 195+

Consistently balances the pole!

Diagnostic Workflow: Step-by-Step Debugging

When training isn't working, follow this systematic approach to identify and fix the problem.

1

Check: Is the reward signal reaching the agent?

How to test: Print rewards after each episode. Are they non-zero?

print(f"Episode {ep}: reward = {total_reward}")

If broken: Check environment setup, verify env.step() returns correct values

2

Check: Are gradients flowing?

How to test: Check gradient norms after backward pass

grad_norm = sum(p.grad.norm() for p in model.parameters())

If zero: Loss not connected to parameters, or loss.backward() not called

If exploding (>1000): Learning rate too high, add gradient clipping

If vanishing (<1e-7): Learning rate too low, or dead ReLUs

3

Check: Are Q-values reasonable?

How to test: Print Q-values periodically

print(f"Q-values: {q_network(state)}")

If all same: Network not differentiating states

If huge (>1000): Q-value explosion, lower learning rate

If NaN: Numerical instability, check for division by zero

4

Check: Is exploration working?

How to test: Log action distribution over an episode

action_counts = {0: left_count, 1: right_count}

If always same action: Epsilon too low, or stuck in local optimum

If 50/50 throughout: Epsilon not decaying, or not exploiting

5

Check: Is the replay buffer working?

How to test: Verify buffer size and sample diversity

print(f"Buffer size: {len(replay_buffer)}")

If empty: Experiences not being stored

If not growing: Check buffer.push() is being called

If always same samples: Random sampling broken

Golden Rule: Always change ONE thing at a time and observe the effect. Changing multiple things makes it impossible to know what helped (or hurt).

Optimization Approaches

Systematic methods to improve training performance beyond the basic DQN switches.

Learning Rate Strategies

Linear Decay

Gradually reduce LR as training progresses. High LR for fast initial learning, low LR for fine-tuning.

lr = lr_start * (1 - episode / total_episodes)
Recommended for beginners

Step Decay

Drop LR by factor at specific milestones. Good when you know roughly when learning plateaus.

lr = lr_start * (0.1 ^ floor(episode / step_size))
Good for known training length

Warmup + Decay

Start with low LR, ramp up, then decay. Prevents early instability from random weights.

warmup: lr = lr_max * (step / warmup_steps)
Advanced technique

Cyclical LR

Oscillate between min and max LR. Can escape local optima by periodically increasing LR.

lr = lr_min + (lr_max - lr_min) * cycle_position
Experimental

Exploration Strategies

Epsilon-Greedy (Standard)

Random action with probability ε, greedy otherwise. Simple and effective.

ε: 1.0 → 0.01 over N episodes
Default choice

Noisy Networks

Add learnable noise to network weights. Network learns WHEN to explore based on uncertainty.

W = μ + σ * ε, where ε ~ N(0,1)
No epsilon tuning needed

Boltzmann Exploration

Sample actions proportional to Q-values (softmax). Higher Q = more likely, but still explores.

P(a) = exp(Q(a)/τ) / Σexp(Q/τ)
Temperature τ sensitive

UCB (Upper Confidence Bound)

Bonus for less-visited state-actions. Balances exploration of uncertain areas.

Q_ucb = Q + c * sqrt(ln(t) / N(a))
Requires visit counts

Network Architecture Tuning

Hidden Layer Size

  • Too small (32): Can't represent complex patterns
  • Sweet spot (64-256): Good for most tasks
  • Too large (1024+): Overfits, slow training
  • Rule of thumb: Start with 128, adjust based on results

Number of Layers

  • 1 hidden: Simple tasks (CartPole)
  • 2-3 hidden: Most RL tasks
  • 4+ hidden: Complex visual tasks, use with caution
  • Deeper ≠ better: Diminishing returns in RL

Activation Functions

  • ReLU: Default choice, fast, works well
  • LeakyReLU: Prevents dead neurons
  • Tanh: Bounded output, good for some tasks
  • Never use Sigmoid: Vanishing gradients in deep nets

Batch Size

  • Small (16-32): Noisy gradients, more exploration
  • Medium (64-128): Good balance
  • Large (256+): Stable but may need higher LR
  • Memory limited?: Use gradient accumulation

Advanced Techniques

Additional improvements beyond the standard DQN switches. These are potential features for future sessions.

Prioritized Experience Replay

W20D2 Topic

Problem it solves: Uniform sampling wastes time on easy transitions

How it works: Sample transitions proportional to TD error (surprise). Learn more from mistakes.

P(i) = |TD_error_i|^α / Σ|TD_error|^α

Impact: 2-3x faster learning, better sample efficiency

Dueling DQN Architecture

Architecture change

Problem it solves: Hard to learn state value vs action advantage separately

How it works: Split network into Value stream V(s) and Advantage stream A(s,a)

Q(s,a) = V(s) + (A(s,a) - mean(A))

Impact: Better generalization, especially when actions have similar values

Noisy Networks

Replaces ε-greedy

Problem it solves: ε-greedy explores randomly, not intelligently

How it works: Add learnable noise parameters to weights. Network learns when to explore.

y = (μ_w + σ_w * ε) * x + μ_b + σ_b * ε

Impact: State-dependent exploration, no ε tuning needed

N-Step Returns

TD improvement

Problem it solves: 1-step TD has high bias, MC has high variance

How it works: Use N steps of actual rewards before bootstrapping

G_t^(n) = r_t + γr_{t+1} + ... + γ^n * V(s_{t+n})

Impact: Faster reward propagation, good balance of bias/variance

Weight Initialization (Pre-seeding)

Quick win

Problem it solves: Random initialization can lead to dead neurons or slow start

Options:

  • Xavier/Glorot: Good for tanh activations
  • He/Kaiming: Best for ReLU (default in PyTorch)
  • Orthogonal: Preserves gradient norms
  • Pre-trained: Initialize from similar task

Impact: Faster initial learning, more stable training

Reward Normalization

Stability boost

Problem it solves: Varying reward scales make learning rate tuning hard

Options:

  • Clipping: rewards = clip(rewards, -1, 1)
  • Scaling: rewards = rewards / max_reward
  • Running normalization: Use running mean/std

Impact: Consistent learning dynamics across environments

Gradient Clipping

Stability essential

Problem it solves: Large TD errors cause exploding gradients

How it works: Cap gradient magnitude before applying updates

torch.nn.utils.clip_grad_norm_(params, max_norm=10)

Impact: Prevents catastrophic weight updates, more stable training

Model Checkpointing

Best practice

Problem it solves: Training can crash, or best model isn't final model

How it works: Save model weights when performance improves

if avg_reward > best_reward: save(model)

Impact: Never lose progress, deploy best model not last model

Hyperparameter Sensitivity Guide

How much each hyperparameter affects training, and safe ranges to try.

Hyperparameter Sensitivity Safe Range Effect of Too Low Effect of Too High
Learning Rate Very High 1e-4 to 1e-3 Slow/no learning Unstable, divergence
Discount (γ) Medium 0.95 to 0.99 Myopic (short-sighted) Slow convergence
Epsilon Decay Medium 0.995 to 0.9999 Exploits too early Explores too long
Batch Size Low 32 to 256 Noisy updates Memory issues, slow
Buffer Size Low 10K to 1M Correlation issues Stale experiences
Target Update Freq Medium 100 to 10000 steps Unstable targets Stale targets
Hidden Size Low 64 to 512 Underfitting Overfitting, slow
Tuning Priority: Learning rate → Epsilon decay → Target update frequency → Everything else. Get these three right first!