Q-Learning Hyperparameters

Understanding what each parameter does and how to tune it

← Back to Hub

What are Hyperparameters?

Hyperparameters are settings you choose before training starts. Unlike regular parameters (like Q-values) which the agent learns, hyperparameters control how the agent learns.

Think of it like learning to cook: the recipe ingredients are parameters, but the oven temperature and cooking time are hyperparameters.

α Learning Rate (Alpha)

How much to update Q-values after each experience

The learning rate controls how quickly the agent updates its beliefs (Q-values) based on new experiences. It determines the step size when moving toward the target.

Q(s,a) = Q(s,a) + α × (target - Q(s,a))

If the target is 10 and current Q-value is 0:

💡 Analogy: Learning to Shoot Baskets

Imagine learning to shoot baskets. Each shot gives you feedback.

High learning rate (α = 0.9): You completely change your technique after each shot. You might overcorrect and be inconsistent.

Low learning rate (α = 0.1): You make tiny adjustments after each shot. More stable, but takes longer to improve.

▲ Too High (> 0.5)
  • Learns fast initially
  • Unstable, erratic behavior
  • Forgets old experiences quickly
  • May never converge
▼ Too Low (< 0.05)
  • Very stable learning
  • Extremely slow progress
  • May get stuck in bad patterns
  • Needs many more episodes

Interactive Demo: See How Learning Rate Affects Updates

0.01 (slow) 1.0 (fast)
Current Q(s,a) = 0
Target = 10
After update: Q(s,a) = 2.0
20%

✅ Recommended for CartPole

Start with learning_rate = 0.1 to 0.3

CartPole is relatively simple, so moderate learning rates work well. If scores are unstable, try lowering it.

γ Discount Factor (Gamma)

How much to value future rewards vs immediate rewards

The discount factor determines how "far-sighted" or "short-sighted" the agent is. It multiplies future rewards, making them worth less than immediate rewards.

target = reward + γ × max(Q(next_state))

Total value of a sequence of rewards [1, 1, 1, 1, 1]:

💡 Analogy: Saving Money

Would you rather have $100 today or $100 next year?

High gamma (γ = 0.99): "$100 next year is almost as good as $100 today. I'll save for retirement."

Low gamma (γ = 0.5): "$100 next year is only worth $50 to me today. I want instant gratification!"

▲ High (0.99 - 0.999)
  • Plans far into the future
  • Willing to sacrifice now for later
  • Good for long episodes
  • Can be slower to learn
▼ Low (< 0.9)
  • Focuses on immediate rewards
  • Ignores long-term consequences
  • May make short-sighted choices
  • Faster but suboptimal

Reward Horizon: How Far Ahead Does the Agent "See"?

0.5 (short-sighted) 0.999 (far-sighted)
Value of reward at each future step:
Now: 1.00 → +1: 0.99 → +2: 0.98 → +3: 0.97 → +4: 0.96 → +5: 0.95
Effective horizon: ~100 steps (where value drops below 0.5)

✅ Recommended for CartPole

Use discount_factor = 0.99

CartPole episodes can last up to 500 steps. High gamma helps the agent understand that keeping the pole balanced NOW leads to more rewards LATER.

ε Epsilon (Exploration Rate)

The explore vs exploit tradeoff

Epsilon controls how often the agent explores (tries random actions) vs exploits (uses what it already knows). This is one of the most important concepts in reinforcement learning!

if random() < ε: action = random
else: action = best_known

💡 Analogy: Choosing a Restaurant

Explore (ε = 1.0): "I'll try a random restaurant every time. I might find something amazing... or terrible."

Exploit (ε = 0.0): "I'll always go to my favorite restaurant. Safe, but I'll never discover somewhere better."

Balanced (ε = 0.1): "I'll usually go to my favorite, but occasionally try somewhere new."

The Three Epsilon Parameters

1. Epsilon Start

How much to explore at the beginning. Usually 1.0 (100% random) because the agent knows nothing yet.

2. Epsilon End

Minimum exploration rate. Usually 0.01 (1% random) to occasionally try new things even after learning.

3. Epsilon Decay

How fast epsilon decreases. After each episode: epsilon = epsilon × decay

Epsilon Decay Over Episodes

0.95 (fast decay) 0.999 (slow decay)

Explore vs Exploit at Episode 100

Explore (random)   Exploit (best known)
▲ Decay Too Fast (0.95)
  • Stops exploring quickly
  • May miss good strategies
  • Gets stuck in local optima
  • ε ≈ 0.01 after ~90 episodes
▼ Decay Too Slow (0.999)
  • Explores for a long time
  • Slow to use what it learned
  • Wastes episodes on random actions
  • ε ≈ 0.37 after 1000 episodes

✅ Recommended for CartPole

epsilon_start = 1.0, epsilon_end = 0.01, epsilon_decay = 0.99

With decay=0.99, epsilon reaches ~0.01 after about 450 episodes. This gives time to explore early, then exploit later.

🔁 Number of Episodes

How many times to play the game during training

An episode is one complete game from start to finish. In CartPole, an episode ends when the pole falls or you reach 500 steps.

Episode 1-50
Mostly random exploration. Agent learns basic patterns. Scores: ~15-30
Episode 50-200
Starting to learn. Mix of exploration and exploitation. Scores: ~30-100
Episode 200-500
More exploitation. Refining strategy. Scores: ~50-200+
Episode 500+
Fine-tuning. May reach peak performance. Scores: ~100-500
▲ More Episodes (1000+)
  • More time to learn
  • Better final performance
  • Takes longer to run
  • Diminishing returns eventually
▼ Fewer Episodes (100-200)
  • Quick experiments
  • May not fully converge
  • Good for testing settings
  • Won't reach peak performance

✅ Recommended for CartPole

Use num_episodes = 500 for standard training

500 episodes is usually enough to see significant learning. Use 100-200 for quick tests, 1000+ for best results.

📦 State Discretization Bins

Converting continuous states to discrete buckets

Q-Learning needs discrete states (like squares on a chess board), but CartPole has continuous states (any decimal number). We solve this by dividing the range into "bins" (buckets).

Example: Pole Angle (-12° to +12°)

With 8 bins, each bin covers 3 degrees:

-12°
-9°
-9°
-6°
-6°
-3°
-3°

+3°
+3°
+6°
+6°
+9°
+9°
+12°

With 24 bins, each bin covers 1 degree:

-12
-11
-10
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
+1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11

More bins = finer distinctions between angles, but more states to learn

💡 Analogy: Thermostat Settings

Few bins (coarse): Thermostat has only "Cold", "Medium", "Hot". Simple, but can't distinguish between 68°F and 72°F.

Many bins (fine): Thermostat shows exact temperature. More precise, but takes longer to learn the right setting for each degree.

Total Number of States

CartPole has 4 observations. Total possible states = bins4 (pole angle uses 2× bins for finer resolution)

Bins Setting Total States Learning Speed Precision
8 ~8,192 Fast Coarse
12 ~41,472 Medium Medium
16 ~131,072 Slow Fine
24 ~663,552 Very Slow Very Fine
▲ More Bins (16-24)
  • Finer state distinctions
  • Potentially better final policy
  • Many more states to visit
  • Needs more episodes
▼ Fewer Bins (8)
  • Coarse state distinctions
  • Faster learning (fewer states)
  • May lose important detail
  • Good for quick experiments

✅ Recommended for CartPole

Use num_bins = 12 for balanced learning

12 bins provides good precision without exploding the state space. Use 8 for quick tests, 16-24 for best results with more episodes.

📋 Quick Reference

Parameter Symbol Range Default Effect
Learning Rate α 0.01 - 1.0 0.2 How fast to update Q-values
Discount Factor γ 0.8 - 0.999 0.99 How much to value future rewards
Epsilon Start ε0 0 - 1.0 1.0 Initial exploration rate
Epsilon End εmin 0 - 0.2 0.01 Minimum exploration rate
Epsilon Decay - 0.95 - 0.999 0.99 How fast exploration decreases
Episodes - 100 - 2000 500 Training duration
Bins - 8 - 24 12 State space resolution
Build Your Config → Download Starter Code