Understanding what each parameter does and how to tune it
← Back to HubHyperparameters are settings you choose before training starts. Unlike regular parameters (like Q-values) which the agent learns, hyperparameters control how the agent learns.
Think of it like learning to cook: the recipe ingredients are parameters, but the oven temperature and cooking time are hyperparameters.
How much to update Q-values after each experience
The learning rate controls how quickly the agent updates its beliefs (Q-values) based on new experiences. It determines the step size when moving toward the target.
If the target is 10 and current Q-value is 0:
Imagine learning to shoot baskets. Each shot gives you feedback.
High learning rate (α = 0.9): You completely change your technique after each shot.
You might overcorrect and be inconsistent.
Low learning rate (α = 0.1): You make tiny adjustments after each shot.
More stable, but takes longer to improve.
Start with learning_rate = 0.1 to 0.3
CartPole is relatively simple, so moderate learning rates work well. If scores are unstable, try lowering it.
How much to value future rewards vs immediate rewards
The discount factor determines how "far-sighted" or "short-sighted" the agent is. It multiplies future rewards, making them worth less than immediate rewards.
Total value of a sequence of rewards [1, 1, 1, 1, 1]:
Would you rather have $100 today or $100 next year?
High gamma (γ = 0.99): "$100 next year is almost as good as $100 today.
I'll save for retirement."
Low gamma (γ = 0.5): "$100 next year is only worth $50 to me today.
I want instant gratification!"
Use discount_factor = 0.99
CartPole episodes can last up to 500 steps. High gamma helps the agent understand that keeping the pole balanced NOW leads to more rewards LATER.
The explore vs exploit tradeoff
Epsilon controls how often the agent explores (tries random actions) vs exploits (uses what it already knows). This is one of the most important concepts in reinforcement learning!
Explore (ε = 1.0): "I'll try a random restaurant every time.
I might find something amazing... or terrible."
Exploit (ε = 0.0): "I'll always go to my favorite restaurant.
Safe, but I'll never discover somewhere better."
Balanced (ε = 0.1): "I'll usually go to my favorite, but
occasionally try somewhere new."
How much to explore at the beginning. Usually 1.0 (100% random) because the agent knows nothing yet.
Minimum exploration rate. Usually 0.01 (1% random) to occasionally try new things even after learning.
How fast epsilon decreases. After each episode:
epsilon = epsilon × decay
epsilon_start = 1.0,
epsilon_end = 0.01,
epsilon_decay = 0.99
With decay=0.99, epsilon reaches ~0.01 after about 450 episodes. This gives time to explore early, then exploit later.
How many times to play the game during training
An episode is one complete game from start to finish. In CartPole, an episode ends when the pole falls or you reach 500 steps.
Use num_episodes = 500 for standard training
500 episodes is usually enough to see significant learning. Use 100-200 for quick tests, 1000+ for best results.
Converting continuous states to discrete buckets
Q-Learning needs discrete states (like squares on a chess board), but CartPole has continuous states (any decimal number). We solve this by dividing the range into "bins" (buckets).
With 8 bins, each bin covers 3 degrees:
With 24 bins, each bin covers 1 degree:
More bins = finer distinctions between angles, but more states to learn
Few bins (coarse): Thermostat has only "Cold", "Medium", "Hot".
Simple, but can't distinguish between 68°F and 72°F.
Many bins (fine): Thermostat shows exact temperature.
More precise, but takes longer to learn the right setting for each degree.
CartPole has 4 observations. Total possible states = bins4 (pole angle uses 2× bins for finer resolution)
| Bins Setting | Total States | Learning Speed | Precision |
|---|---|---|---|
8 |
~8,192 | Fast | Coarse |
12 |
~41,472 | Medium | Medium |
16 |
~131,072 | Slow | Fine |
24 |
~663,552 | Very Slow | Very Fine |
Use num_bins = 12 for balanced learning
12 bins provides good precision without exploding the state space. Use 8 for quick tests, 16-24 for best results with more episodes.
| Parameter | Symbol | Range | Default | Effect |
|---|---|---|---|---|
| Learning Rate | α | 0.01 - 1.0 | 0.2 |
How fast to update Q-values |
| Discount Factor | γ | 0.8 - 0.999 | 0.99 |
How much to value future rewards |
| Epsilon Start | ε0 | 0 - 1.0 | 1.0 |
Initial exploration rate |
| Epsilon End | εmin | 0 - 0.2 | 0.01 |
Minimum exploration rate |
| Epsilon Decay | - | 0.95 - 0.999 | 0.99 |
How fast exploration decreases |
| Episodes | - | 100 - 2000 | 500 |
Training duration |
| Bins | - | 8 - 24 | 12 |
State space resolution |