← Back to W20D1 Hub

State, Action, Reward

The three pillars of Reinforcement Learning

The Three Core Concepts

Every reinforcement learning problem can be understood through three fundamental concepts:

📍

State (S)

"Where am I?"

The current situation or observation. Everything the agent knows about the world at this moment. The information used to make decisions.

🎮

Action (A)

"What can I do?"

The choices available to the agent. Each action moves the agent from one state to another. The agent's way of influencing the environment.

Reward (R)

"How did I do?"

Immediate feedback after taking an action. Positive rewards encourage behavior, negative rewards discourage it. The agent's learning signal.

The Agent-Environment Loop

Reinforcement learning is a continuous cycle of interaction between an agent (the learner) and an environment (the world):

Agent

The learner/decision-maker

Action (A)
State (S), Reward (R)

Environment

The world/game/task

# The RL loop in code: state = env.reset() # Start: get initial state while not done: action = agent.choose_action(state) # Agent decides next_state, reward, done = env.step(action) # Environment responds agent.learn(state, action, reward, next_state) # Agent updates state = next_state # Move to next state

Interactive Demo: CartPole

Experience state, action, and reward firsthand! Try to balance the pole by pushing the cart left or right.

CartPole Environment

Cart Position (x) 0.00
Cart Velocity (v) 0.00
Pole Angle (θ) 0.00°
Angular Velocity (ω) 0.00
Last Action -
Reward +1
Total Steps 0

📍 State (4 numbers)

Cart Position: How far left/right the cart is

Cart Velocity: How fast the cart is moving

Pole Angle: How tilted the pole is (0° = upright)

Angular Velocity: How fast the pole is falling

🎮 Actions (2 choices)

0 = Push Left: Apply force to move cart left

1 = Push Right: Apply force to move cart right

⭐ Reward

+1: For every timestep the pole stays up

Episode ends: When pole falls (>15°) or cart goes off screen

Goal: Maximize total reward (keep pole balanced longest)

Understanding State

Think of state like a photograph: It captures everything about the current moment that matters for making a decision. A chess state is the board position. A driving state includes your speed, position, and what's around you.

Types of States

📊 Discrete States

Finite, countable number of states

Example: Grid position (1,1), (1,2), (2,1)...
Example: Chess board configurations
Example: Tic-tac-toe positions

📈 Continuous States

Infinite possible values (real numbers)

Example: CartPole: angle = 0.0523 radians
Example: Robot arm: joint angles
Example: Self-driving car: speed, position

🖼️ High-Dimensional States

Many features or raw pixels

Example: Atari: 84×84 pixel images
Example: Go board: 19×19 grid
Example: Robot camera feed

The Markov Property

A state is Markov if it contains all information needed to predict the future. The past doesn't matter - only the present state.

S₀
S₁
S₂
S₃ (Now)
S₄ (Future)

Future S₄ depends only on present S₃, not on how we got here (S₀→S₁→S₂)

Understanding Action

Actions are your controls: Like buttons on a game controller. In some games you have 2 buttons, in others you have 18. Some games need continuous input (steering wheel) instead of discrete buttons.

Types of Action Spaces

🔘 Discrete Actions

Choose from a fixed set of options

CartPole: Left (0) or Right (1)
Atari: 18 possible button combinations
Grid: Up, Down, Left, Right

🎚️ Continuous Actions

Any value in a range

Steering: -1.0 (full left) to +1.0 (full right)
Throttle: 0.0 (stop) to 1.0 (full speed)
Robot: Joint torques in Nm
# Discrete action space action = 0 # or 1, 2, 3... env.step(action) # Continuous action space action = [0.5, -0.3] # steering, throttle env.step(action)

Understanding Reward

Reward is your score: Just like points in a video game, but given at every step. The agent's entire goal is to maximize the total reward over time. This simple signal drives all learning!

Types of Reward Structures

📍 Dense Rewards

Feedback at every step

CartPole: +1 for each timestep alive
Racing: +speed, -off track penalty

Easier to learn from!

🏁 Sparse Rewards

Feedback only at the end

Chess: +1 win, -1 lose, 0 draw
Maze: +1 at goal, 0 elsewhere

Harder - need exploration!

🎯 Shaped Rewards

Designed to guide learning

Distance: -distance to goal
Progress: +0.1 per checkpoint

Helps but can cause issues!

The Reward Hypothesis

The Central Claim of RL

"All goals can be described as maximizing cumulative reward"

— Sutton & Barto, Reinforcement Learning: An Introduction

This means: if you can define a reward function, you can use RL to solve the problem. The challenge is often designing the right reward!

Episodes and Trajectories

An episode is one complete run from start to end (game over). A trajectory is the sequence of states, actions, and rewards in that episode:

S₀
Start
A₀
S₁
R₁ = +1
A₁
S₂
R₂ = +1
A₂
S₃
R₃ = +1
A₃
S₄
R₄ = +100
Goal!

Return (G): Total Discounted Reward

G = R₁ + γR₂ + γ²R₃ + γ³R₄ + ...

The agent's goal: maximize expected return G

Real-World Examples

Problem State Actions Reward
CartPole Position, velocity, angle, angular velocity Push left, Push right +1 per timestep balanced
Atari Games 84×84 pixel screen image 18 joystick/button combinations Game score change
Chess Board configuration All legal moves +1 win, -1 lose, 0 draw
Robot Walking Joint positions, velocities, IMU Motor torques (continuous) +forward speed, -falling
Stock Trading Prices, positions, indicators Buy, sell, hold amounts Profit/loss
Dialogue System Conversation history Response options User satisfaction

Key Takeaways

State = Information

Everything the agent observes about its situation. Should be Markov (contain all relevant info).

Action = Control

How the agent influences the environment. Can be discrete choices or continuous values.

Reward = Signal

Feedback that drives learning. Agent learns to maximize cumulative reward over time.

Ready to see how agents learn from these signals?

Learn About Q-Tables →