CartPole Environment Explainer

Understanding the classic reinforcement learning benchmark

← Back to Hub

🎯 The Goal

Balance a pole on a moving cart for as long as possible by applying forces to move the cart left or right.

Solved when average reward reaches 195+ over 100 episodes (or 500 for v1)

🎮 Interactive Simulation

Episode Reward
0
Cart Position
0.00
Pole Angle
0.0°
Status
Balanced
Use and arrow keys to control the cart, or click the buttons below

👁 Observation Space (4D)

The agent receives 4 continuous values each timestep:

Index Observation Range
0 Cart Position [-4.8, 4.8]
1 Cart Velocity [-Inf, Inf]
2 Pole Angle [-0.418, 0.418] rad
3 Pole Angular Velocity [-Inf, Inf]

🎮 Action Space (Discrete)

The agent can take one of 2 discrete actions:

Action 0: Push Left
Apply force to move cart left
Action 1: Push Right
Apply force to move cart right

🎁 Reward Structure

+1
Every timestep the pole remains upright

Key insight: The reward is constant (+1) for each step of survival. The agent learns to maximize cumulative reward by keeping the pole balanced longer.

🛑 Episode Termination

  • 📈
    Pole Angle Exceeds ±12°
    Pole tilts more than 0.2095 radians from vertical
  • 🚧
    Cart Leaves Boundaries
    Cart position exceeds ±2.4 units from center
  • Maximum Steps Reached
    Episode truncated at 200 (v0) or 500 (v1) steps

💡 Understanding CartPole for RL

Why CartPole?
Simple enough to train quickly, complex enough to require learning. Perfect for testing RL algorithms.
Sparse vs Dense Rewards
CartPole has dense rewards (+1 each step). Compare to sparse rewards where you only get reward at the end.
Continuous Observations
The 4D observation space is continuous, requiring function approximation (neural networks) to generalize.
Discrete Actions
Only 2 possible actions makes this simpler than continuous control tasks. Good starting point for RL.

💻 Code Example

Basic CartPole interaction with Gymnasium:

import gymnasium as gym

# Create environment
env = gym.make("CartPole-v1", render_mode="human")

# Reset and get initial observation
observation, info = env.reset()
# observation = [cart_pos, cart_vel, pole_angle, pole_vel]

for _ in range(1000):
    # Take random action (0=left, 1=right)
    action = env.action_space.sample()

    # Step environment
    observation, reward, terminated, truncated, info = env.step(action)

    if terminated or truncated:
        observation, info = env.reset()

env.close()