How to interpret training curves and diagnose model performance
During DQN training, we track several metrics. Understanding what each one means is crucial for diagnosing performance.
Total reward accumulated in one episode. For CartPole, this equals the number of timesteps the pole stayed balanced (max 500).
Rolling average over last 100 episodes. Smooths out noise to show the true learning trend. This is your primary success metric.
How wrong the network's Q-value predictions are. Calculated as the difference between predicted Q and target Q (squared).
Exploration rate. Starts high (random actions) and decays over time (exploit learned policy). Tracks the exploration-exploitation balance.
Click each pattern to see what it looks like and what it means for your model.
What you see: Reward steadily increases with some variance, then stabilizes at a high level.
What it means: Your agent is learning! The variance is normal - it's exploring. The stabilization shows it found a good policy.
Model performance: Excellent Agent will perform well in deployment.
Signs: Steady upward trend, decreasing variance, stable plateau
Performance: Agent reliably solves the task
Keep these settings!Signs: Wild swings between high and low rewards, no clear trend
Performance: Inconsistent - sometimes good, often bad
Needs stabilizationSigns: Initial improvement, then flat line below target
Performance: Mediocre - found local optimum
Needs exploration boostSigns: Reaches high performance, then crashes and doesn't recover
Performance: Unreliable - "forgot" good policy
Critical issueSigns: Flat line near random performance, no improvement
Performance: Random - agent learned nothing
Check implementationSigns: Gradual improvement, needs more episodes
Performance: Will eventually work, just slow
Be patient or tune LR| What You See | What It Means | Expected Performance | Action |
|---|---|---|---|
| Reward climbing, low variance | Agent is learning effectively | High & consistent | Continue training |
| Reward climbing but high variance | Learning but unstable updates | Inconsistent | Enable Target Network |
| Loss decreasing, reward flat | Learning wrong Q-values | Poor | Check reward function |
| Loss exploding (NaN or huge) | Gradient explosion | Broken | Lower learning rate |
| Early spike then crash | Catastrophic forgetting | Unreliable | Enable Experience Replay |
| Flat near random baseline | Not learning at all | Random | Debug code, check gradients |
| Plateau below goal | Stuck in local optimum | Mediocre | Increase exploration (ε) |
| Oscillating wildly | Q-value overestimation | Erratic | Enable Double DQN |
Loss decreases while reward increases. The network is learning accurate Q-values that lead to good actions.
Network confidently predicts wrong values. Often means rewards are sparse or incorrectly defined.
Early training - network predicting randomly, targets changing rapidly. This is normal!
Network learning patterns, predictions improving. Reward should start climbing here.
Network predictions match targets consistently. If reward is high, you're done!
| Switch | OFF Behavior | ON Behavior | Graph Change |
|---|---|---|---|
| Experience Replay | High correlation, forgetting | IID samples, stable learning | Reduces spikes and crashes |
| Target Network | Moving target, oscillation | Stable targets, smooth learning | Reduces variance significantly |
| Double DQN | Q-value overestimation | Accurate Q-values | Higher, more stable plateau |
OpenAI Gym considers CartPole "solved" when the agent achieves an average reward of 195+ over 100 consecutive episodes. This means the pole stays balanced for at least 195 timesteps on average.
Basically random. Pole falls almost immediately.
Some understanding, but inconsistent.
Decent policy, but not solved yet.
Consistently balances the pole!
When training isn't working, follow this systematic approach to identify and fix the problem.
How to test: Print rewards after each episode. Are they non-zero?
If broken: Check environment setup, verify env.step() returns correct values
How to test: Check gradient norms after backward pass
If zero: Loss not connected to parameters, or loss.backward() not called
If exploding (>1000): Learning rate too high, add gradient clipping
If vanishing (<1e-7): Learning rate too low, or dead ReLUs
How to test: Print Q-values periodically
If all same: Network not differentiating states
If huge (>1000): Q-value explosion, lower learning rate
If NaN: Numerical instability, check for division by zero
How to test: Log action distribution over an episode
If always same action: Epsilon too low, or stuck in local optimum
If 50/50 throughout: Epsilon not decaying, or not exploiting
How to test: Verify buffer size and sample diversity
If empty: Experiences not being stored
If not growing: Check buffer.push() is being called
If always same samples: Random sampling broken
Systematic methods to improve training performance beyond the basic DQN switches.
Gradually reduce LR as training progresses. High LR for fast initial learning, low LR for fine-tuning.
Drop LR by factor at specific milestones. Good when you know roughly when learning plateaus.
Start with low LR, ramp up, then decay. Prevents early instability from random weights.
Oscillate between min and max LR. Can escape local optima by periodically increasing LR.
Random action with probability ε, greedy otherwise. Simple and effective.
Add learnable noise to network weights. Network learns WHEN to explore based on uncertainty.
Sample actions proportional to Q-values (softmax). Higher Q = more likely, but still explores.
Bonus for less-visited state-actions. Balances exploration of uncertain areas.
Additional improvements beyond the standard DQN switches. These are potential features for future sessions.
Problem it solves: Uniform sampling wastes time on easy transitions
How it works: Sample transitions proportional to TD error (surprise). Learn more from mistakes.
Impact: 2-3x faster learning, better sample efficiency
Problem it solves: Hard to learn state value vs action advantage separately
How it works: Split network into Value stream V(s) and Advantage stream A(s,a)
Impact: Better generalization, especially when actions have similar values
Problem it solves: ε-greedy explores randomly, not intelligently
How it works: Add learnable noise parameters to weights. Network learns when to explore.
Impact: State-dependent exploration, no ε tuning needed
Problem it solves: 1-step TD has high bias, MC has high variance
How it works: Use N steps of actual rewards before bootstrapping
Impact: Faster reward propagation, good balance of bias/variance
Problem it solves: Random initialization can lead to dead neurons or slow start
Options:
Impact: Faster initial learning, more stable training
Problem it solves: Varying reward scales make learning rate tuning hard
Options:
Impact: Consistent learning dynamics across environments
Problem it solves: Large TD errors cause exploding gradients
How it works: Cap gradient magnitude before applying updates
Impact: Prevents catastrophic weight updates, more stable training
Problem it solves: Training can crash, or best model isn't final model
How it works: Save model weights when performance improves
Impact: Never lose progress, deploy best model not last model
How much each hyperparameter affects training, and safe ranges to try.
| Hyperparameter | Sensitivity | Safe Range | Effect of Too Low | Effect of Too High |
|---|---|---|---|---|
| Learning Rate | Very High | 1e-4 to 1e-3 | Slow/no learning | Unstable, divergence |
| Discount (γ) | Medium | 0.95 to 0.99 | Myopic (short-sighted) | Slow convergence |
| Epsilon Decay | Medium | 0.995 to 0.9999 | Exploits too early | Explores too long |
| Batch Size | Low | 32 to 256 | Noisy updates | Memory issues, slow |
| Buffer Size | Low | 10K to 1M | Correlation issues | Stale experiences |
| Target Update Freq | Medium | 100 to 10000 steps | Unstable targets | Stale targets |
| Hidden Size | Low | 64 to 512 | Underfitting | Overfitting, slow |