CartPole Environment Explainer

🎯 The Goal

Balance a pole on a moving cart for as long as possible by applying forces to move the cart left or right.

Solved when average reward reaches 195+ over 100 episodes (or 500 for v1)

👁 Observation Space (4D)

The agent receives 4 continuous values each timestep:

Index	Observation	Range
0	Cart Position	[-4.8, 4.8]
1	Cart Velocity	[-Inf, Inf]
2	Pole Angle	[-0.418, 0.418] rad
3	Pole Angular Velocity	[-Inf, Inf]

🎮 Action Space (Discrete)

The agent can take one of 2 discrete actions:

←

Action 0: Push Left

Apply force to move cart left

→

Action 1: Push Right

Apply force to move cart right

🎁 Reward Structure

Every timestep the pole remains upright

Key insight: The reward is constant (+1) for each step of survival. The agent learns to maximize cumulative reward by keeping the pole balanced longer.

🛑 Episode Termination

📈

Pole Angle Exceeds ±12°

Pole tilts more than 0.2095 radians from vertical
🚧

Cart Leaves Boundaries

Cart position exceeds ±2.4 units from center
✅

Maximum Steps Reached

Episode truncated at 200 (v0) or 500 (v1) steps

💡 Understanding CartPole for RL

Why CartPole?

Simple enough to train quickly, complex enough to require learning. Perfect for testing RL algorithms.

Sparse vs Dense Rewards

CartPole has dense rewards (+1 each step). Compare to sparse rewards where you only get reward at the end.

Continuous Observations

The 4D observation space is continuous, requiring function approximation (neural networks) to generalize.

Discrete Actions

Only 2 possible actions makes this simpler than continuous control tasks. Good starting point for RL.

💻 Code Example

Basic CartPole interaction with Gymnasium:

import gymnasium as gym

# Create environment
env = gym.make("CartPole-v1", render_mode="human")

# Reset and get initial observation
observation, info = env.reset()
# observation = [cart_pos, cart_vel, pole_angle, pole_vel]

for _ in range(1000):
    # Take random action (0=left, 1=right)
    action = env.action_space.sample()

    # Step environment
    observation, reward, terminated, truncated, info = env.step(action)

    if terminated or truncated:
        observation, info = env.reset()

env.close()