Section-by-section instructions for completing your CartPole agent
Create an AI agent that learns to balance a pole on a cart. You'll use AI assistance to generate code, then adapt it to fit this template structure.
The template has specific sections that must be filled in for your solution to work with the evaluator. Here's the workflow:
Your my_solution.py must have a class called MyAgent with a select_action(state) method. The evaluator looks for this exact structure!
Document your design decisions and AI assistance. This helps instructors understand your thought process and ensures academic integrity.
""" ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ W20D2: MY CARTPOLE SOLUTION ┃ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ Student: [YOUR FULL NAME] Date: [TODAY'S DATE] Algorithm: [DQN / PPO / Genetic Algorithm] ┌──────────────────────────────────────────────────────────────────────────────┐ │ DESIGN DECISIONS │ └──────────────────────────────────────────────────────────────────────────────┘ 1. Algorithm Choice: [Your choice] Why: [2-3 sentences explaining your reasoning] 2. Network Architecture: [Describe layers, sizes, activations] Why: [Explain why this architecture suits the problem] 3. Key Enhancements: [List features: replay buffer, target network, etc.] Why: [Explain how each enhancement helps learning] ┌──────────────────────────────────────────────────────────────────────────────┐ │ AI ASSISTANCE USED │ └──────────────────────────────────────────────────────────────────────────────┘ - [What prompts did you give the AI?] - [What did you modify from the AI's code?] - [What did you learn?] """
1. Algorithm Choice: DQN (Deep Q-Network)
Why:
DQN is excellent for discrete action spaces like CartPole's
left/right actions. It learns a value function that predicts
future rewards, making it stable and sample-efficient.
2. Network Architecture:
- 2 hidden layers with 128 neurons each
- ReLU activation functions
- Output: 2 neurons (Q-values for left/right)
Why:
CartPole has only 4 input features, so a small network is
sufficient. ReLU prevents vanishing gradients.
Import all the libraries your solution needs. The template already includes the basics, but you may need to add more.
try: import gymnasium as gym except ImportError: import gym import numpy as np import torch import torch.nn as nn import json import os
# Additional imports for DQN import torch.optim as optim import random from collections import deque # For replay buffer
# Additional imports for PPO import torch.optim as optim from torch.distributions import Categorical # For action sampling
# Additional imports for Genetic Algorithm import random import copy # For deep copying networks
Never remove the gymnasium import or the try/except block - this ensures compatibility across different systems.
Define the neural network architecture that your agent will use to make decisions. This is the "brain" of your agent.
# CartPole State (Input): 4 values state = [ cart_position, # Where the cart is (-4.8 to 4.8) cart_velocity, # How fast it's moving pole_angle, # Angle from vertical (-0.42 to 0.42 rad) pole_angular_velocity # How fast the pole is falling ] # CartPole Actions (Output): 2 choices action = 0 # Push cart LEFT action = 1 # Push cart RIGHT
class QNetwork(nn.Module): """ Q-Network: Outputs Q-value for each action. Q(state, action) = expected future reward """ def __init__(self, state_size=4, action_size=2, hidden_size=128): super().__init__() self.network = nn.Sequential( nn.Linear(state_size, hidden_size), # 4 → 128 nn.ReLU(), nn.Linear(hidden_size, hidden_size), # 128 → 128 nn.ReLU(), nn.Linear(hidden_size, action_size) # 128 → 2 ) def forward(self, state): return self.network(state) # Returns [Q(left), Q(right)]
class ActorCritic(nn.Module): """ Actor-Critic: Two heads - policy (actor) and value (critic). """ def __init__(self, state_size=4, action_size=2, hidden_size=64): super().__init__() # Shared layers self.shared = nn.Sequential( nn.Linear(state_size, hidden_size), nn.Tanh(), nn.Linear(hidden_size, hidden_size), nn.Tanh() ) # Actor head: outputs action probabilities self.actor = nn.Linear(hidden_size, action_size) # Critic head: outputs state value self.critic = nn.Linear(hidden_size, 1) def forward(self, state): shared = self.shared(state) action_probs = torch.softmax(self.actor(shared), dim=-1) value = self.critic(shared) return action_probs, value
class PolicyNetwork(nn.Module): """ Simple policy network for Genetic Algorithm. Outputs action probabilities directly. """ def __init__(self, state_size=4, action_size=2, hidden_size=32): super().__init__() self.network = nn.Sequential( nn.Linear(state_size, hidden_size), nn.ReLU(), nn.Linear(hidden_size, action_size), nn.Softmax(dim=-1) ) def forward(self, state): return self.network(state) # Returns [P(left), P(right)]
This class MUST exist and MUST have a select_action(state) method. The evaluator imports this class directly!
The MyAgent class wraps your entire agent - initialization, model loading, and action selection. The evaluator creates an instance of this class to test your agent.
class MyAgent: """ Your CartPole agent. REQUIRED: select_action(state) method """ def __init__(self): """Initialize your agent here.""" # 1. Create your neural network self.network = YourNetwork() # 2. Load saved model if it exists (for evaluation) model_path = os.path.join( os.path.dirname(os.path.abspath(__file__)), "results", "model.pt" ) if os.path.exists(model_path): self.network.load_state_dict( torch.load(model_path, weights_only=True) ) self.network.eval() # Set to evaluation mode def select_action(self, state): """ Choose an action given the current state. Args: state: numpy array of shape (4,) Returns: action: 0 (left) or 1 (right) """ # Convert state to tensor and get action with torch.no_grad(): state_tensor = torch.FloatTensor(state).unsqueeze(0) # ... your logic to select action ... return action # Must be 0 or 1
def select_action(self, state): """DQN: Pick action with highest Q-value.""" with torch.no_grad(): state_tensor = torch.FloatTensor(state).unsqueeze(0) q_values = self.network(state_tensor) return q_values.argmax(dim=1).item() # 0 or 1
def select_action(self, state): """PPO: Sample from action probabilities.""" with torch.no_grad(): state_tensor = torch.FloatTensor(state).unsqueeze(0) probs, _ = self.network(state_tensor) # For evaluation, take most probable action return probs.argmax(dim=1).item() # 0 or 1
def select_action(self, state): """Genetic: Pick action with highest probability.""" with torch.no_grad(): state_tensor = torch.FloatTensor(state).unsqueeze(0) probs = self.network(state_tensor) return probs.argmax(dim=1).item() # 0 or 1
The training function is where your agent learns. It runs episodes, collects experience, and updates the neural network weights.
def train(episodes=500): """ Train your agent. Returns: scores: list of episode scores agent: trained MyAgent instance """ env = gym.make("CartPole-v1") agent = MyAgent() scores = [] for episode in range(episodes): state, _ = env.reset() total_reward = 0 done = False while not done: # 1. Select action action = agent.select_action(state) # 2. Take action in environment next_state, reward, terminated, truncated, _ = env.step(action) done = terminated or truncated # 3. YOUR LEARNING LOGIC HERE # - Store experience # - Update network state = next_state total_reward += reward scores.append(total_reward) # Print progress if (episode + 1) % 10 == 0: avg = np.mean(scores[-100:]) print(f"Episode {episode + 1}: Avg = {avg:.1f}") env.close() return scores, agent
# DQN Training: Store experience, sample batch, update Q-network # Inside the while loop: # Store experience in replay buffer agent.replay_buffer.push(state, action, reward, next_state, done) # Sample batch and train agent.train_step() # Update target network (soft update) agent.update_target_network() # After episode: agent.decay_epsilon() # Reduce exploration over time
# PPO Training: Collect trajectory, compute advantages, update policy # Collect full episode trajectory states, actions, rewards, log_probs, values = [], [], [], [], [] # Inside while loop: action, log_prob, value = agent.select_action_with_info(state) states.append(state) actions.append(action) rewards.append(reward) log_probs.append(log_prob) values.append(value) # After episode: compute advantages and update agent.update(states, actions, rewards, log_probs, values)
# Genetic Training: Evaluate population, select best, mutate # Create population of agents population = [PolicyNetwork() for _ in range(population_size)] # For each generation: # 1. Evaluate fitness of each individual fitness_scores = [evaluate_agent(net, env) for net in population] # 2. Select top performers (elitism) elite = select_top(population, fitness_scores, elite_size) # 3. Create new generation through crossover + mutation population = breed_new_generation(elite)
The main block runs training when you execute the file and saves the trained model so the evaluator can load it.
if __name__ == "__main__": # Bootstrap virtual environment (already in template) ensure_venv() # Run training scores, agent = train() # Create results directory results_dir = os.path.join( os.path.dirname(os.path.abspath(__file__)), "results" ) os.makedirs(results_dir, exist_ok=True) # CRITICAL: Save model weights model_path = os.path.join(results_dir, "model.pt") torch.save(agent.network.state_dict(), model_path) print(f"Model saved to {model_path}") # Save training statistics with open(os.path.join(results_dir, "my_results.json"), "w") as f: json.dump({ "algorithm": "YOUR_ALGORITHM", "final_avg": float(np.mean(scores[-100:])), "best_score": float(max(scores)), "episodes": len(scores) }, f, indent=2) # Optional: Watch agent play # demo(agent)
You MUST save your model to results/model.pt using the absolute path pattern shown above. The evaluator loads from this exact location!
# Why we use this pattern: results_dir = os.path.join( os.path.dirname(os.path.abspath(__file__)), # Directory containing my_solution.py "results" # results subdirectory ) # This ensures the path works regardless of: # - What directory you run from (cd somewhere && python path/to/my_solution.py) # - Operating system (Windows vs Mac vs Linux)
You learn by trying things and seeing what happens. Writing it down helps you remember what worked and what didn't!
## Try #1
**Time:** 2:30 PM
**What I changed:** Made the network bigger (64 → 128 neurons)
**What I expected:** It should learn better with more brain power
**What happened:** Score went from 180 to 290 - nice!
**What I learned:** Bigger network = smarter agent
## Try #2
**Time:** 2:45 PM
**What I changed:** Made learning slower (0.01 → 0.001)
**What I expected:** It won't learn as fast but might be smoother
**What happened:** Score jumped to 420!
**What I learned:** Going too fast made it miss the good answers
What score is your agent getting? Is it getting better?
What could you change to make it better?
Change ONE thing and run it again.
Did it work? Why do you think so?
Only change ONE thing between each test. If you change three things at once, you won't know which one made the difference!
Open project_journal.html in your browser. That's where you'll keep track of everything you try!
Your agent has a bunch of settings you can change. Some make it learn faster, some make it smarter, some break it completely! Here's what each one does.
| Neurons per Layer | 32, 64, 128, 256 |
More neurons = smarter but slower |
| Number of Layers | 1, 2, 3 |
More layers = can learn harder stuff |
| Activation Type | ReLU, Tanh |
Different ways neurons "fire" |
CartPole is pretty simple. Start with 2 layers of 64 neurons. Only make it bigger if it's not learning.
| Learning Rate | 0.0001, 0.0003, 0.001, 0.01 |
Too fast = crazy, too slow = boring |
| Discount (gamma) | 0.95, 0.99, 0.999 |
How much it cares about future rewards |
| Batch Size | 32, 64, 128 |
How many memories to learn from at once |
If your agent isn't learning anything, this is usually why. Try 0.001 first. If it's crazy, go lower. If it's too slow, go higher.
| Starting Epsilon | 1.0 |
Starts 100% random |
| Epsilon Decay | 0.99, 0.995, 0.999 |
How fast it stops being random |
| Minimum Epsilon | 0.01, 0.001 |
Never go fully predictable |
# With decay=0.995: Episode 100: 60% random # Still trying lots of stuff Episode 200: 37% random # Half and half Episode 300: 22% random # Mostly using what it learned Episode 500: 8% random # Almost always smart choices
| Memory Size | 5000, 10000, 50000 |
How many past moves to remember |
| Target Update | every step, every 10 |
How often to update the "teacher" network |
| Update Speed (τ) | 0.001, 0.005, 0.01 |
Lower = more stable but slower |
Try 32 neurons, then 128, then 256. Which one learns best? Which is fastest?
Try 0.01, 0.001, and 0.0001. What happens when it's too fast? Too slow?
What happens if you remove the target network? Does it still learn? Is it stable?
What if it stays 10% random forever? What if it stops being random really fast?
For each thing you try, write down: (1) What you changed, (2) What you thought would happen, (3) What actually happened, (4) What you learned. This is how real engineers work!
Go through this complete checklist to ensure your solution will work with the evaluator and meets all requirements.
# Train your agent python my_solution.py # Evaluate (should show learned behavior, not random) python evaluate.py # Evaluate with graph python evaluate.py --graph # Watch agent play python evaluate.py --watch
| A (Excellent) | 475+ average score |
| B (Good) | 400-474 average score |
| C (Needs Work) | 300-399 average score |
| D (Keep Trying) | 200-299 average score |
| F (Not Learning) | Below 200 average score |