← Back to W20D1 Hub

Q-Table Explained

The foundation of value-based reinforcement learning

What is a Q-Table?

Think of it like a cheat sheet: Imagine you're playing a video game and you write down "when I'm in room A and I go left, good things happen (score: +10)". A Q-table is exactly that - a giant lookup table that stores how good each action is in each situation.

A Q-table is a simple data structure that stores the expected future reward (called the Q-value) for every combination of:

📍 State (S)

Where you are / what you observe. In a grid world, this is your position. In CartPole, it's the pole angle and velocities.

🎮 Action (A)

What you can do. Move up/down/left/right, or push cart left/right. The choices available to the agent.

💰 Q-Value

How good is taking action A in state S? Higher = better. This number predicts total future reward.

# A Q-table is just a 2D array (or dictionary) q_table = {} # Q-value for state (2,3) taking action "right" q_table[(2, 3)]["right"] = 7.5 # To decide what to do: pick action with highest Q best_action = max(q_table[state], key=q_table[state].get)

Interactive Demo: Watch Q-Learning in Action

Click on cells to see their Q-values. Run episodes to watch the agent learn!

Grid World Environment

0
Episodes
0
Total Reward
100%
Exploration (ε)

Q-Table Values

Click a cell to see its Q-values. Green outline = best action.

State ↑ Up ↓ Down ← Left → Right
Last Update:
Run an episode to see Q-learning updates here...

The Bellman Equation: How Q-Values Are Calculated

The magic of Q-learning comes from the Bellman equation - a recursive formula that defines what a Q-value should be:

The Bellman Optimality Equation

Q(s, a) = R + γ · maxa' Q(s', a')
Q(s, a)
Value of action a in state s
R
Immediate reward received
γ (gamma)
Discount factor (0.9-0.99)
max Q(s', a')
Best Q-value in next state
In plain English: "The value of doing action A here equals the reward I get NOW, plus the discounted value of the BEST thing I can do next." It's like asking: "If I go this way, what's my immediate payoff plus my best future prospects?"

Why the Discount Factor (γ)?

The discount factor (usually 0.95-0.99) makes future rewards worth slightly less than immediate rewards:

The Q-Learning Update Rule

We don't know the true Q-values initially, so we learn them from experience. After each step, we update our estimate:

Q-Learning Update

Q(s, a) ← Q(s, a) + α · [R + γ · max Q(s', a') - Q(s, a)]
α (alpha)
Learning rate (0.01-0.1)
R + γ·max Q(s',a')
TD Target (what we observed)
Target - Q(s,a)
TD Error (surprise)
# After taking action and observing result: old_value = q_table[state][action] next_max = max(q_table[next_state].values()) # TD Target: what we think the value should be td_target = reward + gamma * next_max # TD Error: how surprised we are td_error = td_target - old_value # Update: move slightly toward target q_table[state][action] = old_value + alpha * td_error

Learning Step by Step

1. Observe current state

Agent sees where it is: state = (2, 3)

2. Choose action (ε-greedy)

Usually pick best action, sometimes explore randomly

3. Take action, observe result

Move right → new state (2, 4), reward = -1

4. Update Q-value

Adjust Q(2,3,"right") based on what we learned

5. Repeat

Continue until episode ends, then start new episode

Exploration vs Exploitation (ε-Greedy)

The fundamental dilemma: should we do what we think is best, or try something new that might be better?

🎯 Exploitation

Choose the action with highest Q-value. Use what you've learned. Safe, but might miss better options.

🔍 Exploration

Choose a random action. Discover new possibilities. Risky, but necessary to find optimal policy.

ε-Greedy Strategy

With probability ε, explore (random action). Otherwise, exploit (best action).

ε = 0.3
🔍 Explore (30%) 🎯 Exploit (70%)

Epsilon Decay

Start with high ε (explore a lot), gradually reduce it (exploit more as we learn):

epsilon = 1.0 # Start fully random epsilon_min = 0.01 # Never go fully greedy epsilon_decay = 0.995 # Decay rate # After each episode: epsilon = max(epsilon_min, epsilon * epsilon_decay)

Why Q-Tables Have Limits

Q-tables work great for small, discrete problems. But they struggle with:

Large State Spaces

A 10×10 grid = 100 states. An 84×84 Atari screen = 256^7056 states. Can't store a table that big!

Continuous States

CartPole has infinite possible angles. You'd need infinite rows in your table!

No Generalization

Learning that (2,3)→right is good doesn't help with (2,4). Each state learned separately.

The Solution? Replace the table with a neural network that can generalize! Instead of storing Q(s,a) for every state, train a network to predict Q-values for any state. That's exactly what DQN does - and why we're learning it in W20!
Learn About DQN →