╔═══════════════════════════════════════════════════════════════════════════════╗ ║ ████████╗███████╗███╗ ███╗██████╗ ██╗ █████╗ ████████╗███████╗ ║ ║ ╚══██╔══╝██╔════╝████╗ ████║██╔══██╗██║ ██╔══██╗╚══██╔══╝██╔════╝ ║ ║ ██║ █████╗ ██╔████╔██║██████╔╝██║ ███████║ ██║ █████╗ ║ ║ ██║ ██╔══╝ ██║╚██╔╝██║██╔═══╝ ██║ ██╔══██║ ██║ ██╔══╝ ║ ║ ██║ ███████╗██║ ╚═╝ ██║██║ ███████╗██║ ██║ ██║ ███████╗ ║ ║ ╚═╝ ╚══════╝╚═╝ ╚═╝╚═╝ ╚══════╝╚═╝ ╚═╝ ╚═╝ ╚══════╝ ║ ║ 📖 GUIDE 📖 ║ ╚═══════════════════════════════════════════════════════════════════════════════╝

my_solution.py Template Guide

Section-by-section instructions for completing your CartPole agent

0

Overview & File Structure

🎯 Your Mission

Create an AI agent that learns to balance a pole on a cart. You'll use AI assistance to generate code, then adapt it to fit this template structure.

How This Works

The template has specific sections that must be filled in for your solution to work with the evaluator. Here's the workflow:

1
Design
2
Prompt AI
3
Adapt Code
4
Train
5
Evaluate

File Structure

W20D2_Student/ ├── index.html ← Start here! ├── my_solution.py ← Your code goes here! ├── evaluate.py ← Tests your solution ├── requirements.txt ← Dependencies ├── guides/ │ ├── design_lab.html ← Design your solution │ ├── template_guide.html ← You are here │ └── project_journal.html ← Track your experiments └── results/ ← Created when you train ├── model.pt ← Your saved model └── my_results.json ← Training stats

⚠️ Critical Requirement

Your my_solution.py must have a class called MyAgent with a select_action(state) method. The evaluator looks for this exact structure!

1

Documentation Header

🎯 Purpose

Document your design decisions and AI assistance. This helps instructors understand your thought process and ensures academic integrity.

What to Fill In

"""
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                        W20D2: MY CARTPOLE SOLUTION                           ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

    Student: [YOUR FULL NAME]
    Date: [TODAY'S DATE]
    Algorithm: [DQN / PPO / Genetic Algorithm]

┌──────────────────────────────────────────────────────────────────────────────┐
│                            DESIGN DECISIONS                                  │
└──────────────────────────────────────────────────────────────────────────────┘

    1. Algorithm Choice: [Your choice]
       Why:
       [2-3 sentences explaining your reasoning]

    2. Network Architecture:
       [Describe layers, sizes, activations]

       Why:
       [Explain why this architecture suits the problem]

    3. Key Enhancements:
       [List features: replay buffer, target network, etc.]

       Why:
       [Explain how each enhancement helps learning]

┌──────────────────────────────────────────────────────────────────────────────┐
│                           AI ASSISTANCE USED                                 │
└──────────────────────────────────────────────────────────────────────────────┘

    - [What prompts did you give the AI?]
    - [What did you modify from the AI's code?]
    - [What did you learn?]

"""
            

📝 Good Example

    1. Algorithm Choice: DQN (Deep Q-Network)
       Why:
       DQN is excellent for discrete action spaces like CartPole's
       left/right actions. It learns a value function that predicts
       future rewards, making it stable and sample-efficient.

    2. Network Architecture:
       - 2 hidden layers with 128 neurons each
       - ReLU activation functions
       - Output: 2 neurons (Q-values for left/right)

       Why:
       CartPole has only 4 input features, so a small network is
       sufficient. ReLU prevents vanishing gradients.
                

✅ Compliance Checklist

  • Your name is filled in (not "[YOUR NAME]")
  • Date is today's actual date
  • Algorithm name matches what you implemented
  • Design decisions explain WHY, not just WHAT
  • AI assistance section is honest and specific
2

Imports

🎯 Purpose

Import all the libraries your solution needs. The template already includes the basics, but you may need to add more.

Default Imports (Already Included)

try:
    import gymnasium as gym
except ImportError:
    import gym
import numpy as np
import torch
import torch.nn as nn
import json
import os
            

Common Additional Imports by Algorithm

# Additional imports for DQN
import torch.optim as optim
import random
from collections import deque  # For replay buffer
                
# Additional imports for PPO
import torch.optim as optim
from torch.distributions import Categorical  # For action sampling
                
# Additional imports for Genetic Algorithm
import random
import copy  # For deep copying networks
                

⚠️ Don't Remove

Never remove the gymnasium import or the try/except block - this ensures compatibility across different systems.

✅ Compliance Checklist

  • gymnasium/gym import is preserved
  • torch and torch.nn are imported
  • numpy is imported as np
  • All imports your code uses are included
3

Neural Network Class

🎯 Purpose

Define the neural network architecture that your agent will use to make decisions. This is the "brain" of your agent.

CartPole Input/Output

# CartPole State (Input): 4 values
state = [
    cart_position,      # Where the cart is (-4.8 to 4.8)
    cart_velocity,      # How fast it's moving
    pole_angle,         # Angle from vertical (-0.42 to 0.42 rad)
    pole_angular_velocity  # How fast the pole is falling
]

# CartPole Actions (Output): 2 choices
action = 0  # Push cart LEFT
action = 1  # Push cart RIGHT
            

Network Examples by Algorithm

class QNetwork(nn.Module):
    """
    Q-Network: Outputs Q-value for each action.
    Q(state, action) = expected future reward
    """
    def __init__(self, state_size=4, action_size=2, hidden_size=128):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_size, hidden_size),   # 4 → 128
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size),  # 128 → 128
            nn.ReLU(),
            nn.Linear(hidden_size, action_size)   # 128 → 2
        )

    def forward(self, state):
        return self.network(state)  # Returns [Q(left), Q(right)]
                
class ActorCritic(nn.Module):
    """
    Actor-Critic: Two heads - policy (actor) and value (critic).
    """
    def __init__(self, state_size=4, action_size=2, hidden_size=64):
        super().__init__()
        # Shared layers
        self.shared = nn.Sequential(
            nn.Linear(state_size, hidden_size),
            nn.Tanh(),
            nn.Linear(hidden_size, hidden_size),
            nn.Tanh()
        )
        # Actor head: outputs action probabilities
        self.actor = nn.Linear(hidden_size, action_size)
        # Critic head: outputs state value
        self.critic = nn.Linear(hidden_size, 1)

    def forward(self, state):
        shared = self.shared(state)
        action_probs = torch.softmax(self.actor(shared), dim=-1)
        value = self.critic(shared)
        return action_probs, value
                
class PolicyNetwork(nn.Module):
    """
    Simple policy network for Genetic Algorithm.
    Outputs action probabilities directly.
    """
    def __init__(self, state_size=4, action_size=2, hidden_size=32):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, action_size),
            nn.Softmax(dim=-1)
        )

    def forward(self, state):
        return self.network(state)  # Returns [P(left), P(right)]
                

💡 Architecture Tips

  • CartPole is simple - 64-128 hidden units is usually enough
  • 2 hidden layers work well for most RL problems
  • ReLU is standard, but Tanh works better for PPO

✅ Compliance Checklist

  • Class inherits from nn.Module
  • Has __init__ method calling super().__init__()
  • Has forward method that returns output
  • Input size is 4 (state dimensions)
  • Output size is 2 (number of actions)
4

MyAgent Class (CRITICAL)

⚠️ Most Important Section

This class MUST exist and MUST have a select_action(state) method. The evaluator imports this class directly!

🎯 Purpose

The MyAgent class wraps your entire agent - initialization, model loading, and action selection. The evaluator creates an instance of this class to test your agent.

Required Structure

class MyAgent:
    """
    Your CartPole agent.

    REQUIRED: select_action(state) method
    """

    def __init__(self):
        """Initialize your agent here."""
        # 1. Create your neural network
        self.network = YourNetwork()

        # 2. Load saved model if it exists (for evaluation)
        model_path = os.path.join(
            os.path.dirname(os.path.abspath(__file__)),
            "results", "model.pt"
        )
        if os.path.exists(model_path):
            self.network.load_state_dict(
                torch.load(model_path, weights_only=True)
            )
            self.network.eval()  # Set to evaluation mode

    def select_action(self, state):
        """
        Choose an action given the current state.

        Args:
            state: numpy array of shape (4,)

        Returns:
            action: 0 (left) or 1 (right)
        """
        # Convert state to tensor and get action
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            # ... your logic to select action ...
        return action  # Must be 0 or 1
            

select_action by Algorithm

def select_action(self, state):
    """DQN: Pick action with highest Q-value."""
    with torch.no_grad():
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        q_values = self.network(state_tensor)
        return q_values.argmax(dim=1).item()  # 0 or 1
                
def select_action(self, state):
    """PPO: Sample from action probabilities."""
    with torch.no_grad():
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        probs, _ = self.network(state_tensor)
        # For evaluation, take most probable action
        return probs.argmax(dim=1).item()  # 0 or 1
                
def select_action(self, state):
    """Genetic: Pick action with highest probability."""
    with torch.no_grad():
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        probs = self.network(state_tensor)
        return probs.argmax(dim=1).item()  # 0 or 1
                

⚠️ Common Mistakes

  • Wrong return type: Must return an integer (0 or 1), not a tensor
  • Returning tuple: Don't return (action, value) - just the action
  • Missing model load: Agent won't work in evaluate.py without loading
  • Wrong path: Use absolute path with __file__ for model loading

✅ Compliance Checklist

  • Class is named exactly "MyAgent"
  • Has __init__ method that creates the network
  • __init__ loads saved model from results/model.pt
  • Has select_action(self, state) method
  • select_action returns integer 0 or 1
  • Uses torch.no_grad() during action selection
5

Training Function

🎯 Purpose

The training function is where your agent learns. It runs episodes, collects experience, and updates the neural network weights.

Required Structure

def train(episodes=500):
    """
    Train your agent.

    Returns:
        scores: list of episode scores
        agent: trained MyAgent instance
    """
    env = gym.make("CartPole-v1")
    agent = MyAgent()

    scores = []

    for episode in range(episodes):
        state, _ = env.reset()
        total_reward = 0
        done = False

        while not done:
            # 1. Select action
            action = agent.select_action(state)

            # 2. Take action in environment
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            # 3. YOUR LEARNING LOGIC HERE
            # - Store experience
            # - Update network

            state = next_state
            total_reward += reward

        scores.append(total_reward)

        # Print progress
        if (episode + 1) % 10 == 0:
            avg = np.mean(scores[-100:])
            print(f"Episode {episode + 1}: Avg = {avg:.1f}")

    env.close()
    return scores, agent
            

Training Loop by Algorithm

# DQN Training: Store experience, sample batch, update Q-network

# Inside the while loop:
# Store experience in replay buffer
agent.replay_buffer.push(state, action, reward, next_state, done)

# Sample batch and train
agent.train_step()

# Update target network (soft update)
agent.update_target_network()

# After episode:
agent.decay_epsilon()  # Reduce exploration over time
                
# PPO Training: Collect trajectory, compute advantages, update policy

# Collect full episode trajectory
states, actions, rewards, log_probs, values = [], [], [], [], []

# Inside while loop:
action, log_prob, value = agent.select_action_with_info(state)
states.append(state)
actions.append(action)
rewards.append(reward)
log_probs.append(log_prob)
values.append(value)

# After episode: compute advantages and update
agent.update(states, actions, rewards, log_probs, values)
                
# Genetic Training: Evaluate population, select best, mutate

# Create population of agents
population = [PolicyNetwork() for _ in range(population_size)]

# For each generation:
# 1. Evaluate fitness of each individual
fitness_scores = [evaluate_agent(net, env) for net in population]

# 2. Select top performers (elitism)
elite = select_top(population, fitness_scores, elite_size)

# 3. Create new generation through crossover + mutation
population = breed_new_generation(elite)
                

💡 Training Tips

  • Print progress every 10 episodes to monitor learning
  • Check for "solved" condition: avg score >= 475 over 100 episodes
  • CartPole-v1 max score is 500 (episode truncates at 500 steps)
  • If score isn't improving after 100 episodes, check your learning rate

✅ Compliance Checklist

  • Function is named "train" with episodes parameter
  • Creates gym environment with "CartPole-v1"
  • Creates MyAgent instance
  • Returns (scores, agent) tuple
  • Prints progress during training
  • Closes environment at the end
6

Main Block & Saving Model

🎯 Purpose

The main block runs training when you execute the file and saves the trained model so the evaluator can load it.

Required Structure

if __name__ == "__main__":
    # Bootstrap virtual environment (already in template)
    ensure_venv()

    # Run training
    scores, agent = train()

    # Create results directory
    results_dir = os.path.join(
        os.path.dirname(os.path.abspath(__file__)),
        "results"
    )
    os.makedirs(results_dir, exist_ok=True)

    # CRITICAL: Save model weights
    model_path = os.path.join(results_dir, "model.pt")
    torch.save(agent.network.state_dict(), model_path)
    print(f"Model saved to {model_path}")

    # Save training statistics
    with open(os.path.join(results_dir, "my_results.json"), "w") as f:
        json.dump({
            "algorithm": "YOUR_ALGORITHM",
            "final_avg": float(np.mean(scores[-100:])),
            "best_score": float(max(scores)),
            "episodes": len(scores)
        }, f, indent=2)

    # Optional: Watch agent play
    # demo(agent)
            

⚠️ Critical: Model Saving

You MUST save your model to results/model.pt using the absolute path pattern shown above. The evaluator loads from this exact location!

Path Handling Explanation

# Why we use this pattern:
results_dir = os.path.join(
    os.path.dirname(os.path.abspath(__file__)),  # Directory containing my_solution.py
    "results"                                       # results subdirectory
)

# This ensures the path works regardless of:
# - What directory you run from (cd somewhere && python path/to/my_solution.py)
# - Operating system (Windows vs Mac vs Linux)
            

💡 What to Save

  • model.pt: Network weights (required for evaluation)
  • my_results.json: Training statistics (for your records)
  • Your network architecture must match between save and load!

✅ Compliance Checklist

  • Uses if __name__ == "__main__" guard
  • Calls ensure_venv() first
  • Calls train() and captures scores, agent
  • Creates results directory with os.makedirs
  • Saves model to results/model.pt
  • Uses absolute path with __file__
  • Saves training stats to my_results.json
7

Journaling Your Process

📓 Why Journal?

You learn by trying things and seeing what happens. Writing it down helps you remember what worked and what didn't!

What to Write Down

📝 Simple Experiment Log

## Try #1
**Time:** 2:30 PM
**What I changed:** Made the network bigger (64 → 128 neurons)
**What I expected:** It should learn better with more brain power
**What happened:** Score went from 180 to 290 - nice!
**What I learned:** Bigger network = smarter agent

## Try #2
**Time:** 2:45 PM
**What I changed:** Made learning slower (0.01 → 0.001)
**What I expected:** It won't learn as fast but might be smoother
**What happened:** Score jumped to 420!
**What I learned:** Going too fast made it miss the good answers
                

The Try-It-And-See Loop

1. Look

What score is your agent getting? Is it getting better?

2. Guess

What could you change to make it better?

3. Try It

Change ONE thing and run it again.

4. Write It Down

Did it work? Why do you think so?

⚠️ One Thing at a Time!

Only change ONE thing between each test. If you change three things at once, you won't know which one made the difference!

Questions to Answer in Your Journal

🤔 Think About These

  • Why did I pick this algorithm? What made it seem like a good choice?
  • What was the trickiest problem I ran into? How did I fix it?
  • Which setting made the biggest difference in my score?
  • If I could start over, what would I do differently?
  • What did I learn from reading the AI's code?
  • What parts of the AI code did I have to fix or change?

💡 Where to Write

Open project_journal.html in your browser. That's where you'll keep track of everything you try!

8

Experimentation Guide

🔬 Playing With the Settings

Your agent has a bunch of settings you can change. Some make it learn faster, some make it smarter, some break it completely! Here's what each one does.

Settings You Can Change

Brain Size (Network)

Neurons per Layer 32, 64, 128, 256 More neurons = smarter but slower
Number of Layers 1, 2, 3 More layers = can learn harder stuff
Activation Type ReLU, Tanh Different ways neurons "fire"

💡 Start Small

CartPole is pretty simple. Start with 2 layers of 64 neurons. Only make it bigger if it's not learning.

How Fast It Learns

Learning Rate 0.0001, 0.0003, 0.001, 0.01 Too fast = crazy, too slow = boring
Discount (gamma) 0.95, 0.99, 0.999 How much it cares about future rewards
Batch Size 32, 64, 128 How many memories to learn from at once

⚠️ Learning Rate is #1

If your agent isn't learning anything, this is usually why. Try 0.001 first. If it's crazy, go lower. If it's too slow, go higher.

Random vs Smart (DQN)

Starting Epsilon 1.0 Starts 100% random
Epsilon Decay 0.99, 0.995, 0.999 How fast it stops being random
Minimum Epsilon 0.01, 0.001 Never go fully predictable

📊 How Random Decreases Over Time

# With decay=0.995:
Episode 100: 60% random  # Still trying lots of stuff
Episode 200: 37% random  # Half and half
Episode 300: 22% random  # Mostly using what it learned
Episode 500:  8% random  # Almost always smart choices
                    

Memory Settings (DQN)

Memory Size 5000, 10000, 50000 How many past moves to remember
Target Update every step, every 10 How often to update the "teacher" network
Update Speed (τ) 0.001, 0.005, 0.01 Lower = more stable but slower

Things to Try

🧪 Try #1: Brain Size

Try 32 neurons, then 128, then 256. Which one learns best? Which is fastest?

🧪 Try #2: Learning Speed

Try 0.01, 0.001, and 0.0001. What happens when it's too fast? Too slow?

🧪 Try #3: No Target Network

What happens if you remove the target network? Does it still learn? Is it stable?

🧪 Try #4: Always Random?

What if it stays 10% random forever? What if it stops being random really fast?

📈 Write It All Down!

For each thing you try, write down: (1) What you changed, (2) What you thought would happen, (3) What actually happened, (4) What you learned. This is how real engineers work!

9

Final Compliance Checklist

🎯 Before Submitting

Go through this complete checklist to ensure your solution will work with the evaluator and meets all requirements.

📋 File Structure

  • my_solution.py exists in the W20D2_Student folder
  • results/model.pt exists after training
  • No syntax errors (file runs without crashing)

📋 Documentation Header

  • Your name is filled in
  • Date is correct
  • Algorithm name matches implementation
  • Design decisions explain reasoning
  • AI assistance is documented honestly

📋 MyAgent Class (CRITICAL)

  • Class is named exactly "MyAgent"
  • __init__ creates the neural network
  • __init__ loads model from results/model.pt
  • select_action(state) method exists
  • select_action returns integer (0 or 1)
  • select_action does NOT return tuple

📋 Training

  • train() function exists
  • Returns (scores, agent) tuple
  • Model saves to results/model.pt
  • Uses absolute path for saving

📋 Testing

  • Run: python my_solution.py (training works)
  • Run: python evaluate.py (evaluation works)
  • Score is reasonable (not just random ~20-30)

📓 Journal (You Need This!)

  • Wrote down at least 3 things you tried
  • For each one: what you expected + what happened
  • Answered the thinking questions
  • Wrote down what prompts you gave the AI
  • Explained what you changed in the AI's code

Quick Test Commands

# Train your agent
python my_solution.py

# Evaluate (should show learned behavior, not random)
python evaluate.py

# Evaluate with graph
python evaluate.py --graph

# Watch agent play
python evaluate.py --watch
            

🏆 Grading Scale Reminder

A (Excellent) 475+ average score
B (Good) 400-474 average score
C (Needs Work) 300-399 average score
D (Keep Trying) 200-299 average score
F (Not Learning) Below 200 average score