The three pillars of Reinforcement Learning
Every reinforcement learning problem can be understood through three fundamental concepts:
The current situation or observation. Everything the agent knows about the world at this moment. The information used to make decisions.
The choices available to the agent. Each action moves the agent from one state to another. The agent's way of influencing the environment.
Immediate feedback after taking an action. Positive rewards encourage behavior, negative rewards discourage it. The agent's learning signal.
Reinforcement learning is a continuous cycle of interaction between an agent (the learner) and an environment (the world):
The learner/decision-maker
The world/game/task
Experience state, action, and reward firsthand! Try to balance the pole by pushing the cart left or right.
Cart Position: How far left/right the cart is
Cart Velocity: How fast the cart is moving
Pole Angle: How tilted the pole is (0° = upright)
Angular Velocity: How fast the pole is falling
0 = Push Left: Apply force to move cart left
1 = Push Right: Apply force to move cart right
+1: For every timestep the pole stays up
Episode ends: When pole falls (>15°) or cart goes off screen
Goal: Maximize total reward (keep pole balanced longest)
Finite, countable number of states
Infinite possible values (real numbers)
Many features or raw pixels
A state is Markov if it contains all information needed to predict the future. The past doesn't matter - only the present state.
Future S₄ depends only on present S₃, not on how we got here (S₀→S₁→S₂)
Choose from a fixed set of options
Any value in a range
Feedback at every step
Easier to learn from!
Feedback only at the end
Harder - need exploration!
Designed to guide learning
Helps but can cause issues!
— Sutton & Barto, Reinforcement Learning: An Introduction
This means: if you can define a reward function, you can use RL to solve the problem. The challenge is often designing the right reward!
An episode is one complete run from start to end (game over). A trajectory is the sequence of states, actions, and rewards in that episode:
The agent's goal: maximize expected return G
| Problem | State | Actions | Reward |
|---|---|---|---|
| CartPole | Position, velocity, angle, angular velocity | Push left, Push right | +1 per timestep balanced |
| Atari Games | 84×84 pixel screen image | 18 joystick/button combinations | Game score change |
| Chess | Board configuration | All legal moves | +1 win, -1 lose, 0 draw |
| Robot Walking | Joint positions, velocities, IMU | Motor torques (continuous) | +forward speed, -falling |
| Stock Trading | Prices, positions, indicators | Buy, sell, hold amounts | Profit/loss |
| Dialogue System | Conversation history | Response options | User satisfaction |
Everything the agent observes about its situation. Should be Markov (contain all relevant info).
How the agent influences the environment. Can be discrete choices or continuous values.
Feedback that drives learning. Agent learns to maximize cumulative reward over time.
Ready to see how agents learn from these signals?
Learn About Q-Tables →