← Back to Hub

Q-Learning Improvement Research

Each team member picks one topic to research and implement

📈 Learning Rate Strategies

Function to modify: get_learning_rate() in w19d2_starter.py

What is Learning Rate?

The learning rate (α) controls how much new information overrides old information during Q-value updates.

Q(s,a) ← Q(s,a) + α × [r + γ × max Q(s') - Q(s,a)]

High Learning Rate (0.5-1.0)

Fast adaptation to new experiences. Risk: Unstable, may forget good policies.

Low Learning Rate (0.01-0.1)

Stable, gradual learning. Risk: Slow convergence, may get stuck.

Improvement Ideas

1. Time-Based Decay Easy

Start with high learning rate, decay over time:

# In get_learning_rate():
def get_learning_rate(self, state=None):
    return self.learning_rate * (0.999 ** self.total_updates)
                

Allows fast early learning, then stabilizes
Try different decay factors: 0.999, 0.9995, 0.9999

2. Per-State Learning Rate Medium

Track visits to each state, decrease learning rate for frequently visited states:

# Add to __init__():
self.visit_counts = {}

# In get_learning_rate():
def get_learning_rate(self, state=None):
    if state not in self.visit_counts:
        self.visit_counts[state] = 0
    self.visit_counts[state] += 1
    return 1.0 / self.visit_counts[state]
                

Guarantees convergence (Robbins-Monro conditions)
Well-studied states get smaller updates

3. Scheduled Decay Easy

Step-based decay at specific intervals:

def get_learning_rate(self, state=None):
    episode = self.total_updates // 500
    if episode < 100: return 0.5
    if episode < 300: return 0.2
    return 0.05
                

Resources

Sutton & Barto, Chapter 6 - Temporal-Difference Learning
Robbins-Monro Conditions - Mathematical convergence guarantees

Pitfall to Avoid

Don't decay the learning rate too quickly! If it reaches 0 too soon, the agent stops learning entirely.

🔍 Exploration Strategies

Function to modify: select_action() in w19d2_starter.py

The Exploration-Exploitation Dilemma

Should the agent try new things (explore) or stick with what works (exploit)?

Too Much Exploration

Agent keeps trying random actions even when it knows good ones. Score never improves.

Too Little Exploration

Agent gets stuck with suboptimal policy because it never discovers better actions.

Improvement Ideas

1. Boltzmann/Softmax Exploration Medium

Instead of random exploration, choose actions probabilistically based on Q-values:

def select_action(self, state, training=True):
    discrete_state = self.discretize(state)
    q_values = self.q_table[discrete_state]

    temperature = 1.0  # Higher = more random
    exp_q = np.exp(q_values / temperature)
    probs = exp_q / np.sum(exp_q)
    return np.random.choice([0, 1], p=probs)
                

Better actions more likely to be chosen
Still explores, but prefers promising actions
Try temperature values: 0.5, 1.0, 2.0

2. Linear Epsilon Decay Easy

Decay epsilon linearly instead of exponentially:

# Replace decay_epsilon() method:
def decay_epsilon(self, episode):
    decay_steps = 400  # Reach epsilon_end after this many episodes
    self.epsilon = max(
        self.epsilon_end,
        self.epsilon_start - (episode / decay_steps) * (self.epsilon_start - self.epsilon_end)
    )
                

3. UCB-like Exploration Hard

Add exploration bonus for less-visited state-action pairs:

# Add to __init__(): self.action_counts = {}

def select_action(self, state, training=True):
    discrete_state = self.discretize(state)
    q_values = self.q_table[discrete_state]

    if discrete_state not in self.action_counts:
        self.action_counts[discrete_state] = [0, 0]

    c = 2.0  # Exploration coefficient
    total = sum(self.action_counts[discrete_state]) + 1

    ucb = []
    for a in [0, 1]:
        bonus = c * np.sqrt(np.log(total) / (self.action_counts[discrete_state][a] + 1))
        ucb.append(q_values[a] + bonus)

    return np.argmax(ucb)
                

Resources

Multi-Armed Bandit Problem - Foundation of exploration strategies
Lilian Weng's Bandit Blog - Great overview of exploration methods

📊 State Representation

Function to modify: create_bins() in w19d2_starter.py

Why Does Binning Matter?

Q-Learning needs discrete states, but CartPole has continuous observations. How we discretize affects learning:

Too Few Bins (4-6)

Can't distinguish between similar but important states. Poor precision.

Too Many Bins (30+)

States rarely visited twice. Takes forever to learn.

State Variables

Variable	Range	Importance	Default Bins
Cart Position	-2.4 to 2.4	Low - cart can be anywhere	12
Cart Velocity	-∞ to ∞ (clip to ±3)	Medium	12
Pole Angle	-0.21 to 0.21 rad	HIGH - most critical!	24
Pole Velocity	-∞ to ∞ (clip to ±3)	High - predicts future angle	12

Improvement Ideas

1. More Bins for Critical Variables Easy

Give more precision to the most important variables:

def create_bins(self):
    return {
        "cart_pos": np.linspace(-2.4, 2.4, 8),      # Fewer bins (less important)
        "cart_vel": np.linspace(-3, 3, 12),
        "pole_angle": np.linspace(-0.21, 0.21, 48),  # More bins (critical!)
        "pole_vel": np.linspace(-3, 3, 24),      # More bins (important)
    }
                

2. Non-Uniform Bins Medium

Put more bins near zero (where precision matters most):

def create_bins(self):
    # Non-uniform bins: more resolution near center
    pole_angle_bins = np.concatenate([
        np.linspace(-0.21, -0.05, 8),
        np.linspace(-0.05, 0.05, 16),  # Fine near zero
        np.linspace(0.05, 0.21, 8)
    ])

    return {
        "cart_pos": np.linspace(-2.4, 2.4, 12),
        "cart_vel": np.linspace(-3, 3, 12),
        "pole_angle": pole_angle_bins,
        "pole_vel": np.linspace(-3, 3, 12),
    }
                

3. Ignore Cart Position Easy

Some research suggests cart position doesn't matter much:

def discretize(self, state):
    cart_pos, cart_vel, pole_angle, pole_vel = state
    # Ignore cart position - use constant 0
    return (
        0,  # Always 0 for cart position
        np.digitize(cart_vel, self.bins["cart_vel"]),
        np.digitize(pole_angle, self.bins["pole_angle"]),
        np.digitize(pole_vel, self.bins["pole_vel"]),
    )
                

Reduces state space dramatically
May work well for balancing but not edge-avoidance

Resources

CartPole Documentation - Official state descriptions
Sutton & Barto, Chapter 9 - Function approximation

🎯 Reward Shaping

Function to modify: shape_reward() in w19d2_starter.py

What is Reward Shaping?

The environment gives +1 for every step. But we can add our own signals to guide learning faster:

shaped_reward = base_reward + F(s, s')

Good Shaping

Provides hints about progress toward the goal without changing optimal policy.

Bad Shaping

Changes what the agent optimizes for. May learn wrong behavior!

Improvement Ideas

1. Angle-Based Penalty Easy

Penalize being far from upright:

def shape_reward(self, base_reward, state, next_state, done):
    angle_penalty = abs(state[2]) * 2  # state[2] is pole angle
    return base_reward - angle_penalty
                

Encourages keeping pole upright
Try multipliers: 1, 2, 5, 10

2. Velocity Penalty Easy

Penalize fast movements (encourage smooth control):

def shape_reward(self, base_reward, state, next_state, done):
    vel_penalty = abs(state[1]) * 0.1 + abs(state[3]) * 0.1
    return base_reward - vel_penalty
                

3. Center Position Bonus Easy

Reward staying near the center of the track:

def shape_reward(self, base_reward, state, next_state, done):
    pos_penalty = abs(state[0]) * 0.5  # state[0] is cart position
    return base_reward - pos_penalty
                

4. Potential-Based Shaping Hard

Mathematically guaranteed to preserve optimal policy:

def potential(self, state):
    """Higher when pole is more upright."""
    return -abs(state[2])  # Negative absolute angle

def shape_reward(self, base_reward, state, next_state, done):
    # Shaping reward is difference in potentials
    F = self.discount_factor * self.potential(next_state) - self.potential(state)
    return base_reward + F
                

Based on research by Andrew Ng
Provably doesn't change optimal policy

Critical Warning

Be careful with reward shaping! If your shaped rewards are mostly negative, Q-values become very negative and learning can fail. Always ensure rewards stay mostly positive or adjust the scale.

Resources

Ng et al. 1999 - Policy Invariance Under Reward Transformations
Reward Shaping Survey - Modern overview