📈 Learning Rate Strategies
Function to modify: get_learning_rate() in w19d2_starter.py
What is Learning Rate?
The learning rate (α) controls how much new information overrides old information during Q-value updates.
Q(s,a) ← Q(s,a) + α × [r + γ × max Q(s') - Q(s,a)]
High Learning Rate (0.5-1.0)
Fast adaptation to new experiences. Risk: Unstable, may forget good policies.
Low Learning Rate (0.01-0.1)
Stable, gradual learning. Risk: Slow convergence, may get stuck.
Improvement Ideas
1. Time-Based Decay Easy
Start with high learning rate, decay over time:
def get_learning_rate(self, state=None):
return self.learning_rate * (0.999 ** self.total_updates)
- Allows fast early learning, then stabilizes
- Try different decay factors: 0.999, 0.9995, 0.9999
2. Per-State Learning Rate Medium
Track visits to each state, decrease learning rate for frequently visited states:
self.visit_counts = {}
def get_learning_rate(self, state=None):
if state not in self.visit_counts:
self.visit_counts[state] = 0
self.visit_counts[state] += 1
return 1.0 / self.visit_counts[state]
- Guarantees convergence (Robbins-Monro conditions)
- Well-studied states get smaller updates
3. Scheduled Decay Easy
Step-based decay at specific intervals:
def get_learning_rate(self, state=None):
episode = self.total_updates // 500
if episode < 100: return 0.5
if episode < 300: return 0.2
return 0.05
Pitfall to Avoid
Don't decay the learning rate too quickly! If it reaches 0 too soon, the agent stops learning entirely.
🔍 Exploration Strategies
Function to modify: select_action() in w19d2_starter.py
The Exploration-Exploitation Dilemma
Should the agent try new things (explore) or stick with what works (exploit)?
Too Much Exploration
Agent keeps trying random actions even when it knows good ones. Score never improves.
Too Little Exploration
Agent gets stuck with suboptimal policy because it never discovers better actions.
Improvement Ideas
1. Boltzmann/Softmax Exploration Medium
Instead of random exploration, choose actions probabilistically based on Q-values:
def select_action(self, state, training=True):
discrete_state = self.discretize(state)
q_values = self.q_table[discrete_state]
temperature = 1.0
exp_q = np.exp(q_values / temperature)
probs = exp_q / np.sum(exp_q)
return np.random.choice([0, 1], p=probs)
- Better actions more likely to be chosen
- Still explores, but prefers promising actions
- Try temperature values: 0.5, 1.0, 2.0
2. Linear Epsilon Decay Easy
Decay epsilon linearly instead of exponentially:
def decay_epsilon(self, episode):
decay_steps = 400
self.epsilon = max(
self.epsilon_end,
self.epsilon_start - (episode / decay_steps) * (self.epsilon_start - self.epsilon_end)
)
3. UCB-like Exploration Hard
Add exploration bonus for less-visited state-action pairs:
def select_action(self, state, training=True):
discrete_state = self.discretize(state)
q_values = self.q_table[discrete_state]
if discrete_state not in self.action_counts:
self.action_counts[discrete_state] = [0, 0]
c = 2.0
total = sum(self.action_counts[discrete_state]) + 1
ucb = []
for a in [0, 1]:
bonus = c * np.sqrt(np.log(total) / (self.action_counts[discrete_state][a] + 1))
ucb.append(q_values[a] + bonus)
return np.argmax(ucb)
📊 State Representation
Function to modify: create_bins() in w19d2_starter.py
Why Does Binning Matter?
Q-Learning needs discrete states, but CartPole has continuous observations. How we discretize affects learning:
Too Few Bins (4-6)
Can't distinguish between similar but important states. Poor precision.
Too Many Bins (30+)
States rarely visited twice. Takes forever to learn.
State Variables
| Variable |
Range |
Importance |
Default Bins |
| Cart Position |
-2.4 to 2.4 |
Low - cart can be anywhere |
12 |
| Cart Velocity |
-∞ to ∞ (clip to ±3) |
Medium |
12 |
| Pole Angle |
-0.21 to 0.21 rad |
HIGH - most critical! |
24 |
| Pole Velocity |
-∞ to ∞ (clip to ±3) |
High - predicts future angle |
12 |
Improvement Ideas
1. More Bins for Critical Variables Easy
Give more precision to the most important variables:
def create_bins(self):
return {
"cart_pos": np.linspace(-2.4, 2.4, 8),
"cart_vel": np.linspace(-3, 3, 12),
"pole_angle": np.linspace(-0.21, 0.21, 48),
"pole_vel": np.linspace(-3, 3, 24),
}
2. Non-Uniform Bins Medium
Put more bins near zero (where precision matters most):
def create_bins(self):
pole_angle_bins = np.concatenate([
np.linspace(-0.21, -0.05, 8),
np.linspace(-0.05, 0.05, 16),
np.linspace(0.05, 0.21, 8)
])
return {
"cart_pos": np.linspace(-2.4, 2.4, 12),
"cart_vel": np.linspace(-3, 3, 12),
"pole_angle": pole_angle_bins,
"pole_vel": np.linspace(-3, 3, 12),
}
3. Ignore Cart Position Easy
Some research suggests cart position doesn't matter much:
def discretize(self, state):
cart_pos, cart_vel, pole_angle, pole_vel = state
return (
0,
np.digitize(cart_vel, self.bins["cart_vel"]),
np.digitize(pole_angle, self.bins["pole_angle"]),
np.digitize(pole_vel, self.bins["pole_vel"]),
)
- Reduces state space dramatically
- May work well for balancing but not edge-avoidance
🎯 Reward Shaping
Function to modify: shape_reward() in w19d2_starter.py
What is Reward Shaping?
The environment gives +1 for every step. But we can add our own signals to guide learning faster:
shaped_reward = base_reward + F(s, s')
Good Shaping
Provides hints about progress toward the goal without changing optimal policy.
Bad Shaping
Changes what the agent optimizes for. May learn wrong behavior!
Improvement Ideas
1. Angle-Based Penalty Easy
Penalize being far from upright:
def shape_reward(self, base_reward, state, next_state, done):
angle_penalty = abs(state[2]) * 2
return base_reward - angle_penalty
- Encourages keeping pole upright
- Try multipliers: 1, 2, 5, 10
2. Velocity Penalty Easy
Penalize fast movements (encourage smooth control):
def shape_reward(self, base_reward, state, next_state, done):
vel_penalty = abs(state[1]) * 0.1 + abs(state[3]) * 0.1
return base_reward - vel_penalty
3. Center Position Bonus Easy
Reward staying near the center of the track:
def shape_reward(self, base_reward, state, next_state, done):
pos_penalty = abs(state[0]) * 0.5
return base_reward - pos_penalty
4. Potential-Based Shaping Hard
Mathematically guaranteed to preserve optimal policy:
def potential(self, state):
"""Higher when pole is more upright."""
return -abs(state[2])
def shape_reward(self, base_reward, state, next_state, done):
F = self.discount_factor * self.potential(next_state) - self.potential(state)
return base_reward + F
- Based on research by Andrew Ng
- Provably doesn't change optimal policy
Critical Warning
Be careful with reward shaping! If your shaped rewards are mostly negative, Q-values become very negative and learning can fail. Always ensure rewards stay mostly positive or adjust the scale.