╔═══════════════════════════════════════════════════════════════════════════════╗ ║ ████████╗███████╗███╗ ███╗██████╗ ██╗ █████╗ ████████╗███████╗ ║ ║ ╚══██╔══╝██╔════╝████╗ ████║██╔══██╗██║ ██╔══██╗╚══██╔══╝██╔════╝ ║ ║ ██║ █████╗ ██╔████╔██║██████╔╝██║ ███████║ ██║ █████╗ ║ ║ ██║ ██╔══╝ ██║╚██╔╝██║██╔═══╝ ██║ ██╔══██║ ██║ ██╔══╝ ║ ║ ██║ ███████╗██║ ╚═╝ ██║██║ ███████╗██║ ██║ ██║ ███████╗ ║ ║ ╚═╝ ╚══════╝╚═╝ ╚═╝╚═╝ ╚══════╝╚═╝ ╚═╝ ╚═╝ ╚══════╝ ║ ║ 📖 GUIDE 📖 ║ ╚═══════════════════════════════════════════════════════════════════════════════╝

my_solution.py Template Guide

Section-by-section instructions for completing your CartPole agent

Overview & File Structure

🎯 Your Mission

Create an AI agent that learns to balance a pole on a cart. You'll use AI assistance to generate code, then adapt it to fit this template structure.

How This Works

The template has specific sections that must be filled in for your solution to work with the evaluator. Here's the workflow:

Design

Prompt AI

Adapt Code

Train

Evaluate

File Structure

W20D2_Student/ ├── index.html ← Start here! ├── my_solution.py ← Your code goes here! ├── evaluate.py ← Tests your solution ├── requirements.txt ← Dependencies ├── guides/ │ ├── design_lab.html ← Design your solution │ ├── template_guide.html ← You are here │ └── project_journal.html ← Track your experiments └── results/ ← Created when you train ├── model.pt ← Your saved model └── my_results.json ← Training stats

⚠️ Critical Requirement

Your my_solution.py must have a class called MyAgent with a select_action(state) method. The evaluator looks for this exact structure!

Documentation Header

🎯 Purpose

Document your design decisions and AI assistance. This helps instructors understand your thought process and ensures academic integrity.

What to Fill In

"""
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                        W20D2: MY CARTPOLE SOLUTION                           ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

    Student: [YOUR FULL NAME]
    Date: [TODAY'S DATE]
    Algorithm: [DQN / PPO / Genetic Algorithm]

┌──────────────────────────────────────────────────────────────────────────────┐
│                            DESIGN DECISIONS                                  │
└──────────────────────────────────────────────────────────────────────────────┘

    1. Algorithm Choice: [Your choice]
       Why:
       [2-3 sentences explaining your reasoning]

    2. Network Architecture:
       [Describe layers, sizes, activations]

       Why:
       [Explain why this architecture suits the problem]

    3. Key Enhancements:
       [List features: replay buffer, target network, etc.]

       Why:
       [Explain how each enhancement helps learning]

┌──────────────────────────────────────────────────────────────────────────────┐
│                           AI ASSISTANCE USED                                 │
└──────────────────────────────────────────────────────────────────────────────┘

    - [What prompts did you give the AI?]
    - [What did you modify from the AI's code?]
    - [What did you learn?]

"""

📝 Good Example

    1. Algorithm Choice: DQN (Deep Q-Network)
       Why:
       DQN is excellent for discrete action spaces like CartPole's
       left/right actions. It learns a value function that predicts
       future rewards, making it stable and sample-efficient.

    2. Network Architecture:
       - 2 hidden layers with 128 neurons each
       - ReLU activation functions
       - Output: 2 neurons (Q-values for left/right)

       Why:
       CartPole has only 4 input features, so a small network is
       sufficient. ReLU prevents vanishing gradients.

✅ Compliance Checklist

Your name is filled in (not "[YOUR NAME]")
Date is today's actual date
Algorithm name matches what you implemented
Design decisions explain WHY, not just WHAT
AI assistance section is honest and specific

Imports

🎯 Purpose

Import all the libraries your solution needs. The template already includes the basics, but you may need to add more.

Default Imports (Already Included)

try:
    import gymnasium as gym
except ImportError:
    import gym
import numpy as np
import torch
import torch.nn as nn
import json
import os

Common Additional Imports by Algorithm

# Additional imports for DQN
import torch.optim as optim
import random
from collections import deque  # For replay buffer

# Additional imports for PPO
import torch.optim as optim
from torch.distributions import Categorical  # For action sampling

# Additional imports for Genetic Algorithm
import random
import copy  # For deep copying networks

⚠️ Don't Remove

Never remove the gymnasium import or the try/except block - this ensures compatibility across different systems.

✅ Compliance Checklist

gymnasium/gym import is preserved
torch and torch.nn are imported
numpy is imported as np
All imports your code uses are included

Neural Network Class

🎯 Purpose

Define the neural network architecture that your agent will use to make decisions. This is the "brain" of your agent.

CartPole Input/Output

# CartPole State (Input): 4 values
state = [
    cart_position,      # Where the cart is (-4.8 to 4.8)
    cart_velocity,      # How fast it's moving
    pole_angle,         # Angle from vertical (-0.42 to 0.42 rad)
    pole_angular_velocity  # How fast the pole is falling
]

# CartPole Actions (Output): 2 choices
action = 0  # Push cart LEFT
action = 1  # Push cart RIGHT

Network Examples by Algorithm

class QNetwork(nn.Module):
    """
    Q-Network: Outputs Q-value for each action.
    Q(state, action) = expected future reward
    """
    def __init__(self, state_size=4, action_size=2, hidden_size=128):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_size, hidden_size),   # 4 → 128
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size),  # 128 → 128
            nn.ReLU(),
            nn.Linear(hidden_size, action_size)   # 128 → 2
        )

    def forward(self, state):
        return self.network(state)  # Returns [Q(left), Q(right)]

class ActorCritic(nn.Module):
    """
    Actor-Critic: Two heads - policy (actor) and value (critic).
    """
    def __init__(self, state_size=4, action_size=2, hidden_size=64):
        super().__init__()
        # Shared layers
        self.shared = nn.Sequential(
            nn.Linear(state_size, hidden_size),
            nn.Tanh(),
            nn.Linear(hidden_size, hidden_size),
            nn.Tanh()
        )
        # Actor head: outputs action probabilities
        self.actor = nn.Linear(hidden_size, action_size)
        # Critic head: outputs state value
        self.critic = nn.Linear(hidden_size, 1)

    def forward(self, state):
        shared = self.shared(state)
        action_probs = torch.softmax(self.actor(shared), dim=-1)
        value = self.critic(shared)
        return action_probs, value

class PolicyNetwork(nn.Module):
    """
    Simple policy network for Genetic Algorithm.
    Outputs action probabilities directly.
    """
    def __init__(self, state_size=4, action_size=2, hidden_size=32):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, action_size),
            nn.Softmax(dim=-1)
        )

    def forward(self, state):
        return self.network(state)  # Returns [P(left), P(right)]

💡 Architecture Tips

CartPole is simple - 64-128 hidden units is usually enough
2 hidden layers work well for most RL problems
ReLU is standard, but Tanh works better for PPO

✅ Compliance Checklist

Class inherits from nn.Module
Has __init__ method calling super().__init__()
Has forward method that returns output
Input size is 4 (state dimensions)
Output size is 2 (number of actions)

MyAgent Class (CRITICAL)

⚠️ Most Important Section

This class MUST exist and MUST have a select_action(state) method. The evaluator imports this class directly!

🎯 Purpose

The MyAgent class wraps your entire agent - initialization, model loading, and action selection. The evaluator creates an instance of this class to test your agent.

Required Structure

class MyAgent:
    """
    Your CartPole agent.

    REQUIRED: select_action(state) method
    """

    def __init__(self):
        """Initialize your agent here."""
        # 1. Create your neural network
        self.network = YourNetwork()

        # 2. Load saved model if it exists (for evaluation)
        model_path = os.path.join(
            os.path.dirname(os.path.abspath(__file__)),
            "results", "model.pt"
        )
        if os.path.exists(model_path):
            self.network.load_state_dict(
                torch.load(model_path, weights_only=True)
            )
            self.network.eval()  # Set to evaluation mode

    def select_action(self, state):
        """
        Choose an action given the current state.

        Args:
            state: numpy array of shape (4,)

        Returns:
            action: 0 (left) or 1 (right)
        """
        # Convert state to tensor and get action
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            # ... your logic to select action ...
        return action  # Must be 0 or 1

select_action by Algorithm

def select_action(self, state):
    """DQN: Pick action with highest Q-value."""
    with torch.no_grad():
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        q_values = self.network(state_tensor)
        return q_values.argmax(dim=1).item()  # 0 or 1

def select_action(self, state):
    """PPO: Sample from action probabilities."""
    with torch.no_grad():
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        probs, _ = self.network(state_tensor)
        # For evaluation, take most probable action
        return probs.argmax(dim=1).item()  # 0 or 1

def select_action(self, state):
    """Genetic: Pick action with highest probability."""
    with torch.no_grad():
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        probs = self.network(state_tensor)
        return probs.argmax(dim=1).item()  # 0 or 1

⚠️ Common Mistakes

Wrong return type: Must return an integer (0 or 1), not a tensor
Returning tuple: Don't return (action, value) - just the action
Missing model load: Agent won't work in evaluate.py without loading
Wrong path: Use absolute path with __file__ for model loading

✅ Compliance Checklist

Class is named exactly "MyAgent"
Has __init__ method that creates the network
__init__ loads saved model from results/model.pt
Has select_action(self, state) method
select_action returns integer 0 or 1
Uses torch.no_grad() during action selection

Training Function

🎯 Purpose

The training function is where your agent learns. It runs episodes, collects experience, and updates the neural network weights.

Required Structure

def train(episodes=500):
    """
    Train your agent.

    Returns:
        scores: list of episode scores
        agent: trained MyAgent instance
    """
    env = gym.make("CartPole-v1")
    agent = MyAgent()

    scores = []

    for episode in range(episodes):
        state, _ = env.reset()
        total_reward = 0
        done = False

        while not done:
            # 1. Select action
            action = agent.select_action(state)

            # 2. Take action in environment
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            # 3. YOUR LEARNING LOGIC HERE
            # - Store experience
            # - Update network

            state = next_state
            total_reward += reward

        scores.append(total_reward)

        # Print progress
        if (episode + 1) % 10 == 0:
            avg = np.mean(scores[-100:])
            print(f"Episode {episode + 1}: Avg = {avg:.1f}")

    env.close()
    return scores, agent

Training Loop by Algorithm

# DQN Training: Store experience, sample batch, update Q-network

# Inside the while loop:
# Store experience in replay buffer
agent.replay_buffer.push(state, action, reward, next_state, done)

# Sample batch and train
agent.train_step()

# Update target network (soft update)
agent.update_target_network()

# After episode:
agent.decay_epsilon()  # Reduce exploration over time

# PPO Training: Collect trajectory, compute advantages, update policy

# Collect full episode trajectory
states, actions, rewards, log_probs, values = [], [], [], [], []

# Inside while loop:
action, log_prob, value = agent.select_action_with_info(state)
states.append(state)
actions.append(action)
rewards.append(reward)
log_probs.append(log_prob)
values.append(value)

# After episode: compute advantages and update
agent.update(states, actions, rewards, log_probs, values)

# Genetic Training: Evaluate population, select best, mutate

# Create population of agents
population = [PolicyNetwork() for _ in range(population_size)]

# For each generation:
# 1. Evaluate fitness of each individual
fitness_scores = [evaluate_agent(net, env) for net in population]

# 2. Select top performers (elitism)
elite = select_top(population, fitness_scores, elite_size)

# 3. Create new generation through crossover + mutation
population = breed_new_generation(elite)

💡 Training Tips

Print progress every 10 episodes to monitor learning
Check for "solved" condition: avg score >= 475 over 100 episodes
CartPole-v1 max score is 500 (episode truncates at 500 steps)
If score isn't improving after 100 episodes, check your learning rate

✅ Compliance Checklist

Function is named "train" with episodes parameter
Creates gym environment with "CartPole-v1"
Creates MyAgent instance
Returns (scores, agent) tuple
Prints progress during training
Closes environment at the end

Main Block & Saving Model

🎯 Purpose

The main block runs training when you execute the file and saves the trained model so the evaluator can load it.

Required Structure

if __name__ == "__main__":
    # Bootstrap virtual environment (already in template)
    ensure_venv()

    # Run training
    scores, agent = train()

    # Create results directory
    results_dir = os.path.join(
        os.path.dirname(os.path.abspath(__file__)),
        "results"
    )
    os.makedirs(results_dir, exist_ok=True)

    # CRITICAL: Save model weights
    model_path = os.path.join(results_dir, "model.pt")
    torch.save(agent.network.state_dict(), model_path)
    print(f"Model saved to {model_path}")

    # Save training statistics
    with open(os.path.join(results_dir, "my_results.json"), "w") as f:
        json.dump({
            "algorithm": "YOUR_ALGORITHM",
            "final_avg": float(np.mean(scores[-100:])),
            "best_score": float(max(scores)),
            "episodes": len(scores)
        }, f, indent=2)

    # Optional: Watch agent play
    # demo(agent)

⚠️ Critical: Model Saving

You MUST save your model to results/model.pt using the absolute path pattern shown above. The evaluator loads from this exact location!

Path Handling Explanation

# Why we use this pattern:
results_dir = os.path.join(
    os.path.dirname(os.path.abspath(__file__)),  # Directory containing my_solution.py
    "results"                                       # results subdirectory
)

# This ensures the path works regardless of:
# - What directory you run from (cd somewhere && python path/to/my_solution.py)
# - Operating system (Windows vs Mac vs Linux)

💡 What to Save

model.pt: Network weights (required for evaluation)
my_results.json: Training statistics (for your records)
Your network architecture must match between save and load!

✅ Compliance Checklist

Uses if __name__ == "__main__" guard
Calls ensure_venv() first
Calls train() and captures scores, agent
Creates results directory with os.makedirs
Saves model to results/model.pt
Uses absolute path with __file__
Saves training stats to my_results.json

Journaling Your Process

📓 Why Journal?

You learn by trying things and seeing what happens. Writing it down helps you remember what worked and what didn't!

What to Write Down

📝 Simple Experiment Log

## Try #1
**Time:** 2:30 PM
**What I changed:** Made the network bigger (64 → 128 neurons)
**What I expected:** It should learn better with more brain power
**What happened:** Score went from 180 to 290 - nice!
**What I learned:** Bigger network = smarter agent

## Try #2
**Time:** 2:45 PM
**What I changed:** Made learning slower (0.01 → 0.001)
**What I expected:** It won't learn as fast but might be smoother
**What happened:** Score jumped to 420!
**What I learned:** Going too fast made it miss the good answers

The Try-It-And-See Loop

1. Look

What score is your agent getting? Is it getting better?

2. Guess

What could you change to make it better?

3. Try It

Change ONE thing and run it again.

4. Write It Down

Did it work? Why do you think so?

⚠️ One Thing at a Time!

Only change ONE thing between each test. If you change three things at once, you won't know which one made the difference!

Questions to Answer in Your Journal

🤔 Think About These

Why did I pick this algorithm? What made it seem like a good choice?
What was the trickiest problem I ran into? How did I fix it?
Which setting made the biggest difference in my score?
If I could start over, what would I do differently?
What did I learn from reading the AI's code?
What parts of the AI code did I have to fix or change?

💡 Where to Write

Open project_journal.html in your browser. That's where you'll keep track of everything you try!

Experimentation Guide

🔬 Playing With the Settings

Your agent has a bunch of settings you can change. Some make it learn faster, some make it smarter, some break it completely! Here's what each one does.

Settings You Can Change

Brain Size (Network)

Neurons per Layer	`32, 64, 128, 256`	More neurons = smarter but slower
Number of Layers	`1, 2, 3`	More layers = can learn harder stuff
Activation Type	`ReLU, Tanh`	Different ways neurons "fire"

💡 Start Small

CartPole is pretty simple. Start with 2 layers of 64 neurons. Only make it bigger if it's not learning.

How Fast It Learns

Learning Rate	`0.0001, 0.0003, 0.001, 0.01`	Too fast = crazy, too slow = boring
Discount (gamma)	`0.95, 0.99, 0.999`	How much it cares about future rewards
Batch Size	`32, 64, 128`	How many memories to learn from at once

⚠️ Learning Rate is #1

If your agent isn't learning anything, this is usually why. Try 0.001 first. If it's crazy, go lower. If it's too slow, go higher.

Random vs Smart (DQN)

Starting Epsilon	`1.0`	Starts 100% random
Epsilon Decay	`0.99, 0.995, 0.999`	How fast it stops being random
Minimum Epsilon	`0.01, 0.001`	Never go fully predictable

📊 How Random Decreases Over Time

# With decay=0.995:
Episode 100: 60% random  # Still trying lots of stuff
Episode 200: 37% random  # Half and half
Episode 300: 22% random  # Mostly using what it learned
Episode 500:  8% random  # Almost always smart choices

Memory Settings (DQN)

Memory Size	`5000, 10000, 50000`	How many past moves to remember
Target Update	`every step, every 10`	How often to update the "teacher" network
Update Speed (τ)	`0.001, 0.005, 0.01`	Lower = more stable but slower

Things to Try

🧪 Try #1: Brain Size

Try 32 neurons, then 128, then 256. Which one learns best? Which is fastest?

🧪 Try #2: Learning Speed

Try 0.01, 0.001, and 0.0001. What happens when it's too fast? Too slow?

🧪 Try #3: No Target Network

What happens if you remove the target network? Does it still learn? Is it stable?

🧪 Try #4: Always Random?

What if it stays 10% random forever? What if it stops being random really fast?

📈 Write It All Down!

For each thing you try, write down: (1) What you changed, (2) What you thought would happen, (3) What actually happened, (4) What you learned. This is how real engineers work!

Final Compliance Checklist

🎯 Before Submitting

Go through this complete checklist to ensure your solution will work with the evaluator and meets all requirements.

📋 File Structure

my_solution.py exists in the W20D2_Student folder
results/model.pt exists after training
No syntax errors (file runs without crashing)

📋 Documentation Header

Your name is filled in
Date is correct
Algorithm name matches implementation
Design decisions explain reasoning
AI assistance is documented honestly

📋 MyAgent Class (CRITICAL)

Class is named exactly "MyAgent"
__init__ creates the neural network
__init__ loads model from results/model.pt
select_action(state) method exists
select_action returns integer (0 or 1)
select_action does NOT return tuple

📋 Training

train() function exists
Returns (scores, agent) tuple
Model saves to results/model.pt
Uses absolute path for saving

📋 Testing

Run: python my_solution.py (training works)
Run: python evaluate.py (evaluation works)
Score is reasonable (not just random ~20-30)

📓 Journal (You Need This!)

Wrote down at least 3 things you tried
For each one: what you expected + what happened
Answered the thinking questions
Wrote down what prompts you gave the AI
Explained what you changed in the AI's code

Quick Test Commands

# Train your agent
python my_solution.py

# Evaluate (should show learned behavior, not random)
python evaluate.py

# Evaluate with graph
python evaluate.py --graph

# Watch agent play
python evaluate.py --watch

🏆 Grading Scale Reminder

A (Excellent)	475+ average score
B (Good)	400-474 average score
C (Needs Work)	300-399 average score
D (Keep Trying)	200-299 average score
F (Not Learning)	Below 200 average score