The foundation of value-based reinforcement learning
What is a Q-Table?
Think of it like a cheat sheet: Imagine you're playing a video game and you write down "when I'm in room A and I go left, good things happen (score: +10)". A Q-table is exactly that - a giant lookup table that stores how good each action is in each situation.
A Q-table is a simple data structure that stores the expected future reward (called the Q-value) for every combination of:
📍 State (S)
Where you are / what you observe. In a grid world, this is your position. In CartPole, it's the pole angle and velocities.
🎮 Action (A)
What you can do. Move up/down/left/right, or push cart left/right. The choices available to the agent.
💰 Q-Value
How good is taking action A in state S? Higher = better. This number predicts total future reward.
# A Q-table is just a 2D array (or dictionary)
q_table = {}
# Q-value for state (2,3) taking action "right"
q_table[(2, 3)]["right"] = 7.5# To decide what to do: pick action with highest Q
best_action = max(q_table[state], key=q_table[state].get)
Interactive Demo: Watch Q-Learning in Action
Click on cells to see their Q-values. Run episodes to watch the agent learn!
Grid World Environment
0
Episodes
0
Total Reward
100%
Exploration (ε)
Q-Table Values
Click a cell to see its Q-values. Green outline = best action.
State
↑ Up
↓ Down
← Left
→ Right
Last Update:
Run an episode to see Q-learning updates here...
The Bellman Equation: How Q-Values Are Calculated
The magic of Q-learning comes from the Bellman equation - a recursive formula that defines what a Q-value should be:
The Bellman Optimality Equation
Q(s, a) = R + γ · maxa' Q(s', a')
Q(s, a)
Value of action a in state s
R
Immediate reward received
γ (gamma)
Discount factor (0.9-0.99)
max Q(s', a')
Best Q-value in next state
In plain English: "The value of doing action A here equals the reward I get NOW, plus the discounted value of the BEST thing I can do next." It's like asking: "If I go this way, what's my immediate payoff plus my best future prospects?"
Why the Discount Factor (γ)?
The discount factor (usually 0.95-0.99) makes future rewards worth slightly less than immediate rewards:
γ = 0.99: Future matters a lot (patient agent)
γ = 0.9: Balanced view of present and future
γ = 0: Only care about immediate reward (greedy)
The Q-Learning Update Rule
We don't know the true Q-values initially, so we learn them from experience. After each step, we update our estimate:
Q-Learning Update
Q(s, a) ← Q(s, a) + α · [R + γ · max Q(s', a') - Q(s, a)]
α (alpha)
Learning rate (0.01-0.1)
R + γ·max Q(s',a')
TD Target (what we observed)
Target - Q(s,a)
TD Error (surprise)
# After taking action and observing result:
old_value = q_table[state][action]
next_max = max(q_table[next_state].values())
# TD Target: what we think the value should be
td_target = reward + gamma * next_max
# TD Error: how surprised we are
td_error = td_target - old_value
# Update: move slightly toward target
q_table[state][action] = old_value + alpha * td_error
Learning Step by Step
1. Observe current state
Agent sees where it is: state = (2, 3)
2. Choose action (ε-greedy)
Usually pick best action, sometimes explore randomly
3. Take action, observe result
Move right → new state (2, 4), reward = -1
4. Update Q-value
Adjust Q(2,3,"right") based on what we learned
5. Repeat
Continue until episode ends, then start new episode
Exploration vs Exploitation (ε-Greedy)
The fundamental dilemma: should we do what we think is best, or try something new that might be better?
🎯 Exploitation
Choose the action with highest Q-value. Use what you've learned. Safe, but might miss better options.
🔍 Exploration
Choose a random action. Discover new possibilities. Risky, but necessary to find optimal policy.
ε-Greedy Strategy
With probability ε, explore (random action). Otherwise, exploit (best action).
ε = 0.3
🔍 Explore (30%)🎯 Exploit (70%)
Epsilon Decay
Start with high ε (explore a lot), gradually reduce it (exploit more as we learn):
epsilon = 1.0# Start fully random
epsilon_min = 0.01# Never go fully greedy
epsilon_decay = 0.995# Decay rate# After each episode:
epsilon = max(epsilon_min, epsilon * epsilon_decay)
Why Q-Tables Have Limits
Q-tables work great for small, discrete problems. But they struggle with:
Large State Spaces
A 10×10 grid = 100 states. An 84×84 Atari screen = 256^7056 states. Can't store a table that big!
Continuous States
CartPole has infinite possible angles. You'd need infinite rows in your table!
No Generalization
Learning that (2,3)→right is good doesn't help with (2,4). Each state learned separately.
The Solution? Replace the table with a neural network that can generalize! Instead of storing Q(s,a) for every state, train a network to predict Q-values for any state. That's exactly what DQN does - and why we're learning it in W20!