Understanding the classic reinforcement learning benchmark
← Back to HubBalance a pole on a moving cart for as long as possible by applying forces to move the cart left or right.
Solved when average reward reaches 195+ over 100 episodes (or 500 for v1)
The agent receives 4 continuous values each timestep:
| Index | Observation | Range |
|---|---|---|
| 0 | Cart Position | [-4.8, 4.8] |
| 1 | Cart Velocity | [-Inf, Inf] |
| 2 | Pole Angle | [-0.418, 0.418] rad |
| 3 | Pole Angular Velocity | [-Inf, Inf] |
The agent can take one of 2 discrete actions:
Key insight: The reward is constant (+1) for each step of survival. The agent learns to maximize cumulative reward by keeping the pole balanced longer.
Basic CartPole interaction with Gymnasium:
import gymnasium as gym # Create environment env = gym.make("CartPole-v1", render_mode="human") # Reset and get initial observation observation, info = env.reset() # observation = [cart_pos, cart_vel, pole_angle, pole_vel] for _ in range(1000): # Take random action (0=left, 1=right) action = env.action_space.sample() # Step environment observation, reward, terminated, truncated, info = env.step(action) if terminated or truncated: observation, info = env.reset() env.close()