Q-Table Explained

What is a Q-Table?

Think of it like a cheat sheet: Imagine you're playing a video game and you write down "when I'm in room A and I go left, good things happen (score: +10)". A Q-table is exactly that - a giant lookup table that stores how good each action is in each situation.

A Q-table is a simple data structure that stores the expected future reward (called the Q-value) for every combination of:

📍 State (S)

Where you are / what you observe. In a grid world, this is your position. In CartPole, it's the pole angle and velocities.

🎮 Action (A)

What you can do. Move up/down/left/right, or push cart left/right. The choices available to the agent.

💰 Q-Value

How good is taking action A in state S? Higher = better. This number predicts total future reward.

# A Q-table is just a 2D array (or dictionary)
q_table = {}

# Q-value for state (2,3) taking action "right"
q_table[(2, 3)]["right"] = 7.5

# To decide what to do: pick action with highest Q
best_action = max(q_table[state], key=q_table[state].get)
            

Interactive Demo: Watch Q-Learning in Action

Click on cells to see their Q-values. Run episodes to watch the agent learn!

Grid World Environment

0

Episodes

0

Total Reward

100%

Exploration (ε)

Q-Table Values

Click a cell to see its Q-values. Green outline = best action.

State	↑ Up	↓ Down	← Left	→ Right

Last Update:

Run an episode to see Q-learning updates here...

The Bellman Equation: How Q-Values Are Calculated

The magic of Q-learning comes from the Bellman equation - a recursive formula that defines what a Q-value should be:

The Bellman Optimality Equation

Q(s, a) = R + γ · max_a' Q(s', a')

Q(s, a)

Value of action a in state s

R

Immediate reward received

γ (gamma)

Discount factor (0.9-0.99)

max Q(s', a')

Best Q-value in next state

In plain English: "The value of doing action A here equals the reward I get NOW, plus the discounted value of the BEST thing I can do next." It's like asking: "If I go this way, what's my immediate payoff plus my best future prospects?"

Why the Discount Factor (γ)?

The discount factor (usually 0.95-0.99) makes future rewards worth slightly less than immediate rewards:

γ = 0.99: Future matters a lot (patient agent)
γ = 0.9: Balanced view of present and future
γ = 0: Only care about immediate reward (greedy)

The Q-Learning Update Rule

We don't know the true Q-values initially, so we learn them from experience. After each step, we update our estimate:

Q-Learning Update

Q(s, a) ← Q(s, a) + α · [R + γ · max Q(s', a') - Q(s, a)]

α (alpha)

Learning rate (0.01-0.1)

R + γ·max Q(s',a')

TD Target (what we observed)

Target - Q(s,a)

TD Error (surprise)

# After taking action and observing result:
old_value = q_table[state][action]
next_max = max(q_table[next_state].values())

# TD Target: what we think the value should be
td_target = reward + gamma * next_max

# TD Error: how surprised we are
td_error = td_target - old_value

# Update: move slightly toward target
q_table[state][action] = old_value + alpha * td_error
            

Learning Step by Step

1. Observe current state

Agent sees where it is: state = (2, 3)

2. Choose action (ε-greedy)

Usually pick best action, sometimes explore randomly

3. Take action, observe result

Move right → new state (2, 4), reward = -1

4. Update Q-value

Adjust Q(2,3,"right") based on what we learned

5. Repeat

Continue until episode ends, then start new episode

Exploration vs Exploitation (ε-Greedy)

The fundamental dilemma: should we do what we think is best, or try something new that might be better?

🎯 Exploitation

Choose the action with highest Q-value. Use what you've learned. Safe, but might miss better options.

🔍 Exploration

Choose a random action. Discover new possibilities. Risky, but necessary to find optimal policy.

ε-Greedy Strategy

With probability ε, explore (random action). Otherwise, exploit (best action).

ε = 0.3

🔍 Explore (30%) 🎯 Exploit (70%)

Epsilon Decay

Start with high ε (explore a lot), gradually reduce it (exploit more as we learn):

epsilon = 1.0           # Start fully random
epsilon_min = 0.01       # Never go fully greedy
epsilon_decay = 0.995   # Decay rate

# After each episode:
epsilon = max(epsilon_min, epsilon * epsilon_decay)
            

Why Q-Tables Have Limits

Q-tables work great for small, discrete problems. But they struggle with:

Large State Spaces

A 10×10 grid = 100 states. An 84×84 Atari screen = 256^7056 states. Can't store a table that big!

Continuous States

CartPole has infinite possible angles. You'd need infinite rows in your table!

No Generalization

Learning that (2,3)→right is good doesn't help with (2,4). Each state learned separately.

The Solution? Replace the table with a neural network that can generalize! Instead of storing Q(s,a) for every state, train a network to predict Q-values for any state. That's exactly what DQN does - and why we're learning it in W20!

Learn About DQN →