RL Fundamentals: State, Action, Reward

The Three Core Concepts

Every reinforcement learning problem can be understood through three fundamental concepts:

📍

State (S)

"Where am I?"

The current situation or observation. Everything the agent knows about the world at this moment. The information used to make decisions.

🎮

Action (A)

"What can I do?"

The choices available to the agent. Each action moves the agent from one state to another. The agent's way of influencing the environment.

⭐

Reward (R)

"How did I do?"

Immediate feedback after taking an action. Positive rewards encourage behavior, negative rewards discourage it. The agent's learning signal.

The Agent-Environment Loop

Reinforcement learning is a continuous cycle of interaction between an agent (the learner) and an environment (the world):

Agent

The learner/decision-maker

Action (A)

State (S), Reward (R)

Environment

The world/game/task

# The RL loop in code:
state = env.reset()              # Start: get initial state

while not done:
    action = agent.choose_action(state)  # Agent decides
    next_state, reward, done = env.step(action)  # Environment responds
    agent.learn(state, action, reward, next_state)  # Agent updates
    state = next_state                # Move to next state
                

Interactive Demo: CartPole

Experience state, action, and reward firsthand! Try to balance the pole by pushing the cart left or right.

CartPole Environment

Cart Position (x) 0.00

Cart Velocity (v) 0.00

Pole Angle (θ) 0.00°

Angular Velocity (ω) 0.00

Last Action -

Reward +1

Total Steps 0

📍 State (4 numbers)

Cart Position: How far left/right the cart is

Cart Velocity: How fast the cart is moving

Pole Angle: How tilted the pole is (0° = upright)

Angular Velocity: How fast the pole is falling

🎮 Actions (2 choices)

0 = Push Left: Apply force to move cart left

1 = Push Right: Apply force to move cart right

⭐ Reward

+1: For every timestep the pole stays up

Episode ends: When pole falls (>15°) or cart goes off screen

Goal: Maximize total reward (keep pole balanced longest)

Understanding State

Think of state like a photograph: It captures everything about the current moment that matters for making a decision. A chess state is the board position. A driving state includes your speed, position, and what's around you.

Types of States

📊 Discrete States

Finite, countable number of states

Example: Grid position (1,1), (1,2), (2,1)...

Example: Chess board configurations

Example: Tic-tac-toe positions

📈 Continuous States

Infinite possible values (real numbers)

Example: CartPole: angle = 0.0523 radians

Example: Robot arm: joint angles

Example: Self-driving car: speed, position

🖼️ High-Dimensional States

Many features or raw pixels

Example: Atari: 84×84 pixel images

Example: Go board: 19×19 grid

Example: Robot camera feed

The Markov Property

A state is Markov if it contains all information needed to predict the future. The past doesn't matter - only the present state.

S₀

→

S₁

→

S₂

→

S₃ (Now)

→

S₄ (Future)

Future S₄ depends only on present S₃, not on how we got here (S₀→S₁→S₂)

Understanding Action

Actions are your controls: Like buttons on a game controller. In some games you have 2 buttons, in others you have 18. Some games need continuous input (steering wheel) instead of discrete buttons.

Types of Action Spaces

🔘 Discrete Actions

Choose from a fixed set of options

CartPole: Left (0) or Right (1)

Atari: 18 possible button combinations

Grid: Up, Down, Left, Right

🎚️ Continuous Actions

Any value in a range

Steering: -1.0 (full left) to +1.0 (full right)

Throttle: 0.0 (stop) to 1.0 (full speed)

Robot: Joint torques in Nm

# Discrete action space
action = 0  # or 1, 2, 3...
env.step(action)

# Continuous action space
action = [0.5, -0.3]  # steering, throttle
env.step(action)
            

Understanding Reward

Reward is your score: Just like points in a video game, but given at every step. The agent's entire goal is to maximize the total reward over time. This simple signal drives all learning!

Types of Reward Structures

📍 Dense Rewards

Feedback at every step

CartPole: +1 for each timestep alive

Racing: +speed, -off track penalty

Easier to learn from!

🏁 Sparse Rewards

Feedback only at the end

Chess: +1 win, -1 lose, 0 draw

Maze: +1 at goal, 0 elsewhere

Harder - need exploration!

🎯 Shaped Rewards

Designed to guide learning

Distance: -distance to goal

Progress: +0.1 per checkpoint

Helps but can cause issues!

The Reward Hypothesis

The Central Claim of RL

"All goals can be described as maximizing cumulative reward"

— Sutton & Barto, Reinforcement Learning: An Introduction

This means: if you can define a reward function, you can use RL to solve the problem. The challenge is often designing the right reward!

Episodes and Trajectories

An episode is one complete run from start to end (game over). A trajectory is the sequence of states, actions, and rewards in that episode:

S₀

Start

A₀

S₁

R₁ = +1

A₁

S₂

R₂ = +1

A₂

S₃

R₃ = +1

A₃

S₄

R₄ = +100

Goal!

Return (G): Total Discounted Reward

G = R₁ + γR₂ + γ²R₃ + γ³R₄ + ...

The agent's goal: maximize expected return G

Real-World Examples

Problem	State	Actions	Reward
CartPole	Position, velocity, angle, angular velocity	Push left, Push right	+1 per timestep balanced
Atari Games	84×84 pixel screen image	18 joystick/button combinations	Game score change
Chess	Board configuration	All legal moves	+1 win, -1 lose, 0 draw
Robot Walking	Joint positions, velocities, IMU	Motor torques (continuous)	+forward speed, -falling
Stock Trading	Prices, positions, indicators	Buy, sell, hold amounts	Profit/loss
Dialogue System	Conversation history	Response options	User satisfaction

Key Takeaways

State = Information

Everything the agent observes about its situation. Should be Markov (contain all relevant info).

Action = Control

How the agent influences the environment. Can be discrete choices or continuous values.

Reward = Signal

Feedback that drives learning. Agent learns to maximize cumulative reward over time.

Ready to see how agents learn from these signals?

Learn About Q-Tables →