Q-Learning Hyperparameters Explained

What are Hyperparameters?

Hyperparameters are settings you choose before training starts. Unlike regular parameters (like Q-values) which the agent learns, hyperparameters control how the agent learns.

Think of it like learning to cook: the recipe ingredients are parameters, but the oven temperature and cooking time are hyperparameters.

Learning Rate Discount Factor Epsilon (Exploration) Episodes Bins

α Learning Rate (Alpha)

How much to update Q-values after each experience

The learning rate controls how quickly the agent updates its beliefs (Q-values) based on new experiences. It determines the step size when moving toward the target.

Q(s,a) = Q(s,a) + α × (target - Q(s,a))

If the target is 10 and current Q-value is 0:

α = 0.1: New Q = 0 + 0.1 × 10 = 1.0 (small step)
α = 0.5: New Q = 0 + 0.5 × 10 = 5.0 (medium step)
α = 1.0: New Q = 0 + 1.0 × 10 = 10.0 (replace entirely)

💡 Analogy: Learning to Shoot Baskets

Imagine learning to shoot baskets. Each shot gives you feedback.

High learning rate (α = 0.9): You completely change your technique after each shot. You might overcorrect and be inconsistent.

Low learning rate (α = 0.1): You make tiny adjustments after each shot. More stable, but takes longer to improve.

▲ Too High (> 0.5)

Learns fast initially
Unstable, erratic behavior
Forgets old experiences quickly
May never converge

▼ Too Low (< 0.05)

Very stable learning
Extremely slow progress
May get stuck in bad patterns
Needs many more episodes

Interactive Demo: See How Learning Rate Affects Updates

Learning Rate (α): 0.2

0.01 (slow) 1.0 (fast)

Current Q(s,a) = 0

Target = 10

After update: Q(s,a) = 2.0

20%

✅ Recommended for CartPole

Start with learning_rate = 0.1 to 0.3

CartPole is relatively simple, so moderate learning rates work well. If scores are unstable, try lowering it.

γ Discount Factor (Gamma)

How much to value future rewards vs immediate rewards

The discount factor determines how "far-sighted" or "short-sighted" the agent is. It multiplies future rewards, making them worth less than immediate rewards.

target = reward + γ × max(Q(next_state))

Total value of a sequence of rewards [1, 1, 1, 1, 1]:

γ = 0.5: 1 + 0.5 + 0.25 + 0.125 + 0.0625 = 1.94
γ = 0.9: 1 + 0.9 + 0.81 + 0.73 + 0.66 = 4.10
γ = 0.99: 1 + 0.99 + 0.98 + 0.97 + 0.96 = 4.90

💡 Analogy: Saving Money

Would you rather have $100 today or $100 next year?

High gamma (γ = 0.99): "$100 next year is almost as good as $100 today. I'll save for retirement."

Low gamma (γ = 0.5): "$100 next year is only worth $50 to me today. I want instant gratification!"

▲ High (0.99 - 0.999)

Plans far into the future
Willing to sacrifice now for later
Good for long episodes
Can be slower to learn

▼ Low (< 0.9)

Focuses on immediate rewards
Ignores long-term consequences
May make short-sighted choices
Faster but suboptimal

Reward Horizon: How Far Ahead Does the Agent "See"?

Discount Factor (γ): 0.99

0.5 (short-sighted) 0.999 (far-sighted)

Value of reward at each future step:

                        Now: 1.00 → +1: 0.99 → +2: 0.98 → +3: 0.97 → +4: 0.96 → +5: 0.95
                    

Effective horizon: ~100 steps (where value drops below 0.5)

✅ Recommended for CartPole

Use discount_factor = 0.99

CartPole episodes can last up to 500 steps. High gamma helps the agent understand that keeping the pole balanced NOW leads to more rewards LATER.

ε Epsilon (Exploration Rate)

The explore vs exploit tradeoff

Epsilon controls how often the agent explores (tries random actions) vs exploits (uses what it already knows). This is one of the most important concepts in reinforcement learning!

if random() < ε: action = random
else: action = best_known

💡 Analogy: Choosing a Restaurant

Explore (ε = 1.0): "I'll try a random restaurant every time. I might find something amazing... or terrible."

Exploit (ε = 0.0): "I'll always go to my favorite restaurant. Safe, but I'll never discover somewhere better."

Balanced (ε = 0.1): "I'll usually go to my favorite, but occasionally try somewhere new."

The Three Epsilon Parameters

1. Epsilon Start

How much to explore at the beginning. Usually 1.0 (100% random) because the agent knows nothing yet.

2. Epsilon End

Minimum exploration rate. Usually 0.01 (1% random) to occasionally try new things even after learning.

3. Epsilon Decay

How fast epsilon decreases. After each episode: epsilon = epsilon × decay

Epsilon Decay Over Episodes

Decay Rate: 0.99

0.95 (fast decay) 0.999 (slow decay)

Explore vs Exploit at Episode 100

■ Explore (random) ■ Exploit (best known)

▲ Decay Too Fast (0.95)

Stops exploring quickly
May miss good strategies
Gets stuck in local optima
ε ≈ 0.01 after ~90 episodes

▼ Decay Too Slow (0.999)

Explores for a long time
Slow to use what it learned
Wastes episodes on random actions
ε ≈ 0.37 after 1000 episodes

✅ Recommended for CartPole

epsilon_start = 1.0, epsilon_end = 0.01, epsilon_decay = 0.99

With decay=0.99, epsilon reaches ~0.01 after about 450 episodes. This gives time to explore early, then exploit later.

🔁 Number of Episodes

How many times to play the game during training

An episode is one complete game from start to finish. In CartPole, an episode ends when the pole falls or you reach 500 steps.

Episode 1-50

Mostly random exploration. Agent learns basic patterns. Scores: ~15-30

Episode 50-200

Starting to learn. Mix of exploration and exploitation. Scores: ~30-100

Episode 200-500

More exploitation. Refining strategy. Scores: ~50-200+

Episode 500+

Fine-tuning. May reach peak performance. Scores: ~100-500

▲ More Episodes (1000+)

More time to learn
Better final performance
Takes longer to run
Diminishing returns eventually

▼ Fewer Episodes (100-200)

Quick experiments
May not fully converge
Good for testing settings
Won't reach peak performance

✅ Recommended for CartPole

Use num_episodes = 500 for standard training

500 episodes is usually enough to see significant learning. Use 100-200 for quick tests, 1000+ for best results.

📦 State Discretization Bins

Converting continuous states to discrete buckets

Q-Learning needs discrete states (like squares on a chess board), but CartPole has continuous states (any decimal number). We solve this by dividing the range into "bins" (buckets).

Example: Pole Angle (-12° to +12°)

With 8 bins, each bin covers 3 degrees:

-12°
-9°

-9°
-6°

-6°
-3°

-3°
0°

0°
+3°

+3°
+6°

+6°
+9°

+9°
+12°

With 24 bins, each bin covers 1 degree:

-12

-11

-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

+10

+11

More bins = finer distinctions between angles, but more states to learn

💡 Analogy: Thermostat Settings

Few bins (coarse): Thermostat has only "Cold", "Medium", "Hot". Simple, but can't distinguish between 68°F and 72°F.

Many bins (fine): Thermostat shows exact temperature. More precise, but takes longer to learn the right setting for each degree.

Total Number of States

CartPole has 4 observations. Total possible states = bins⁴ (pole angle uses 2× bins for finer resolution)

Bins Setting	Total States	Learning Speed	Precision
`8`	~8,192	Fast	Coarse
`12`	~41,472	Medium	Medium
`16`	~131,072	Slow	Fine
`24`	~663,552	Very Slow	Very Fine

▲ More Bins (16-24)

Finer state distinctions
Potentially better final policy
Many more states to visit
Needs more episodes

▼ Fewer Bins (8)

Coarse state distinctions
Faster learning (fewer states)
May lose important detail
Good for quick experiments

✅ Recommended for CartPole

Use num_bins = 12 for balanced learning

12 bins provides good precision without exploding the state space. Use 8 for quick tests, 16-24 for best results with more episodes.

📋 Quick Reference

Parameter	Symbol	Range	Default	Effect
Learning Rate	α	0.01 - 1.0	`0.2`	How fast to update Q-values
Discount Factor	γ	0.8 - 0.999	`0.99`	How much to value future rewards
Epsilon Start	ε₀	0 - 1.0	`1.0`	Initial exploration rate
Epsilon End	ε_min	0 - 0.2	`0.01`	Minimum exploration rate
Epsilon Decay	-	0.95 - 0.999	`0.99`	How fast exploration decreases
Episodes	-	100 - 2000	`500`	Training duration
Bins	-	8 - 24	`12`	State space resolution

Build Your Config → Download Starter Code