← Back to Hub

Q-Learning Improvement Research

Each team member picks one topic to research and implement

📈 Learning Rate Strategies

Function to modify: get_learning_rate() in w19d2_starter.py

What is Learning Rate?

The learning rate (α) controls how much new information overrides old information during Q-value updates.

Q(s,a) ← Q(s,a) + α × [r + γ × max Q(s') - Q(s,a)]

High Learning Rate (0.5-1.0)

Fast adaptation to new experiences. Risk: Unstable, may forget good policies.

Low Learning Rate (0.01-0.1)

Stable, gradual learning. Risk: Slow convergence, may get stuck.

Improvement Ideas

1. Time-Based Decay Easy

Start with high learning rate, decay over time:

# In get_learning_rate(): def get_learning_rate(self, state=None): return self.learning_rate * (0.999 ** self.total_updates)
  • Allows fast early learning, then stabilizes
  • Try different decay factors: 0.999, 0.9995, 0.9999

2. Per-State Learning Rate Medium

Track visits to each state, decrease learning rate for frequently visited states:

# Add to __init__(): self.visit_counts = {} # In get_learning_rate(): def get_learning_rate(self, state=None): if state not in self.visit_counts: self.visit_counts[state] = 0 self.visit_counts[state] += 1 return 1.0 / self.visit_counts[state]
  • Guarantees convergence (Robbins-Monro conditions)
  • Well-studied states get smaller updates

3. Scheduled Decay Easy

Step-based decay at specific intervals:

def get_learning_rate(self, state=None): episode = self.total_updates // 500 if episode < 100: return 0.5 if episode < 300: return 0.2 return 0.05

Resources

Pitfall to Avoid

Don't decay the learning rate too quickly! If it reaches 0 too soon, the agent stops learning entirely.

🔍 Exploration Strategies

Function to modify: select_action() in w19d2_starter.py

The Exploration-Exploitation Dilemma

Should the agent try new things (explore) or stick with what works (exploit)?

Too Much Exploration

Agent keeps trying random actions even when it knows good ones. Score never improves.

Too Little Exploration

Agent gets stuck with suboptimal policy because it never discovers better actions.

Improvement Ideas

1. Boltzmann/Softmax Exploration Medium

Instead of random exploration, choose actions probabilistically based on Q-values:

def select_action(self, state, training=True): discrete_state = self.discretize(state) q_values = self.q_table[discrete_state] temperature = 1.0 # Higher = more random exp_q = np.exp(q_values / temperature) probs = exp_q / np.sum(exp_q) return np.random.choice([0, 1], p=probs)
  • Better actions more likely to be chosen
  • Still explores, but prefers promising actions
  • Try temperature values: 0.5, 1.0, 2.0

2. Linear Epsilon Decay Easy

Decay epsilon linearly instead of exponentially:

# Replace decay_epsilon() method: def decay_epsilon(self, episode): decay_steps = 400 # Reach epsilon_end after this many episodes self.epsilon = max( self.epsilon_end, self.epsilon_start - (episode / decay_steps) * (self.epsilon_start - self.epsilon_end) )

3. UCB-like Exploration Hard

Add exploration bonus for less-visited state-action pairs:

# Add to __init__(): self.action_counts = {} def select_action(self, state, training=True): discrete_state = self.discretize(state) q_values = self.q_table[discrete_state] if discrete_state not in self.action_counts: self.action_counts[discrete_state] = [0, 0] c = 2.0 # Exploration coefficient total = sum(self.action_counts[discrete_state]) + 1 ucb = [] for a in [0, 1]: bonus = c * np.sqrt(np.log(total) / (self.action_counts[discrete_state][a] + 1)) ucb.append(q_values[a] + bonus) return np.argmax(ucb)

Resources

📊 State Representation

Function to modify: create_bins() in w19d2_starter.py

Why Does Binning Matter?

Q-Learning needs discrete states, but CartPole has continuous observations. How we discretize affects learning:

Too Few Bins (4-6)

Can't distinguish between similar but important states. Poor precision.

Too Many Bins (30+)

States rarely visited twice. Takes forever to learn.

State Variables

Variable Range Importance Default Bins
Cart Position -2.4 to 2.4 Low - cart can be anywhere 12
Cart Velocity -∞ to ∞ (clip to ±3) Medium 12
Pole Angle -0.21 to 0.21 rad HIGH - most critical! 24
Pole Velocity -∞ to ∞ (clip to ±3) High - predicts future angle 12

Improvement Ideas

1. More Bins for Critical Variables Easy

Give more precision to the most important variables:

def create_bins(self): return { "cart_pos": np.linspace(-2.4, 2.4, 8), # Fewer bins (less important) "cart_vel": np.linspace(-3, 3, 12), "pole_angle": np.linspace(-0.21, 0.21, 48), # More bins (critical!) "pole_vel": np.linspace(-3, 3, 24), # More bins (important) }

2. Non-Uniform Bins Medium

Put more bins near zero (where precision matters most):

def create_bins(self): # Non-uniform bins: more resolution near center pole_angle_bins = np.concatenate([ np.linspace(-0.21, -0.05, 8), np.linspace(-0.05, 0.05, 16), # Fine near zero np.linspace(0.05, 0.21, 8) ]) return { "cart_pos": np.linspace(-2.4, 2.4, 12), "cart_vel": np.linspace(-3, 3, 12), "pole_angle": pole_angle_bins, "pole_vel": np.linspace(-3, 3, 12), }

3. Ignore Cart Position Easy

Some research suggests cart position doesn't matter much:

def discretize(self, state): cart_pos, cart_vel, pole_angle, pole_vel = state # Ignore cart position - use constant 0 return ( 0, # Always 0 for cart position np.digitize(cart_vel, self.bins["cart_vel"]), np.digitize(pole_angle, self.bins["pole_angle"]), np.digitize(pole_vel, self.bins["pole_vel"]), )
  • Reduces state space dramatically
  • May work well for balancing but not edge-avoidance

Resources

🎯 Reward Shaping

Function to modify: shape_reward() in w19d2_starter.py

What is Reward Shaping?

The environment gives +1 for every step. But we can add our own signals to guide learning faster:

shaped_reward = base_reward + F(s, s')

Good Shaping

Provides hints about progress toward the goal without changing optimal policy.

Bad Shaping

Changes what the agent optimizes for. May learn wrong behavior!

Improvement Ideas

1. Angle-Based Penalty Easy

Penalize being far from upright:

def shape_reward(self, base_reward, state, next_state, done): angle_penalty = abs(state[2]) * 2 # state[2] is pole angle return base_reward - angle_penalty
  • Encourages keeping pole upright
  • Try multipliers: 1, 2, 5, 10

2. Velocity Penalty Easy

Penalize fast movements (encourage smooth control):

def shape_reward(self, base_reward, state, next_state, done): vel_penalty = abs(state[1]) * 0.1 + abs(state[3]) * 0.1 return base_reward - vel_penalty

3. Center Position Bonus Easy

Reward staying near the center of the track:

def shape_reward(self, base_reward, state, next_state, done): pos_penalty = abs(state[0]) * 0.5 # state[0] is cart position return base_reward - pos_penalty

4. Potential-Based Shaping Hard

Mathematically guaranteed to preserve optimal policy:

def potential(self, state): """Higher when pole is more upright.""" return -abs(state[2]) # Negative absolute angle def shape_reward(self, base_reward, state, next_state, done): # Shaping reward is difference in potentials F = self.discount_factor * self.potential(next_state) - self.potential(state) return base_reward + F
  • Based on research by Andrew Ng
  • Provably doesn't change optimal policy

Critical Warning

Be careful with reward shaping! If your shaped rewards are mostly negative, Q-values become very negative and learning can fail. Always ensure rewards stay mostly positive or adjust the scale.

Resources