AISE Class Hub · W31 · D1

Monitoring MVP Systems

Knowing whether your AI system is actually working

Week 31 · Day 1 · Capstone Prep

The big idea for tonight

Building a system isn't enough — you have to know when it's broken. ML systems fail silently through degraded predictions, data drift, broken pipelines, and shifting user behavior. Tonight we build a portable monitoring framework you can drop into your capstone the day topics are revealed.

1 Learning Objectives

Primary

  • Understand why monitoring is essential for AI products
  • Design a monitoring strategy for an MVP
  • Identify critical system metrics and failure signals

Supporting

  • Distinguish logs vs metrics vs alerts
  • Recognize data drift and performance degradation
  • Design dashboards for AI systems

Key terms

Monitoring Observability Metrics Logs Alerts Data drift Model performance degradation System health metrics User analytics

Out of scope

2 The Three Monitoring Layers

Every AI system you'll ship has three observability surfaces. Miss one and you'll be debugging in the dark.

📦 Application Metrics — request rate, latency, errors
🧠 Model Performance — accuracy, drift, prediction distribution
🖥️ Infrastructure — CPU, memory, GPU, disk, network

Why silent failures are dangerous

A traditional web service tells you when it breaks — 500s in the logs, error rates spike, pagers go off. ML systems can keep returning HTTP 200 forever while quietly destroying business value:

Rule of thumb: if a metric only tells you the service is up, it can't tell you whether the model is useful. Monitor both.

3 Four Systems, Four Monitoring Profiles

Each breakout team owns one of these in tonight's first lab. Notice how the critical metrics differ even though the "three layers" frame is the same.

RAG Policy Assistant

  • Retrieval recall@k
  • Citation correctness rate
  • Answer latency p95
  • Hallucination flag rate
  • Doc-set freshness

Fraud Detection

  • Precision / recall / FPR
  • Score distribution shift
  • Reviewer override rate
  • Time to decision
  • Feature null rate

Recommendation Engine

  • CTR / engagement
  • Catalog coverage
  • Cold-start hit rate
  • Diversity / novelty
  • Latency under load

Document Classification

  • Accuracy by class
  • Confidence distribution
  • Out-of-distribution rate
  • Throughput (docs/sec)
  • Label drift over time

4 Logs vs Metrics vs Alerts

Logs

Discrete events with structured context. Read after the fact when debugging.

2026-04-26 19:14:02 INFO predict req_id=ab7 user=42 latency_ms=183 conf=0.91

Metrics

Numeric measurements aggregated over time. Read on a dashboard.

predict_latency_ms_p95{model="rag-v3"} 412

Alerts

Rules that fire when metrics cross a threshold. Wake someone up.

if accuracy_24h < 0.85 for 30m → page on-call

The pipeline

Application logs
        ↓
Metrics aggregation  (counts, percentiles, distributions)
        ↓
Alert rules          (thresholds + dwell time)
        ↓
Dashboards / pages   (humans get the signal)
            
One mistake students always make: alerting on raw logs. Logs are for debugging; metrics are for alerting. If your alert can't be expressed as metric > threshold for N minutes, it'll page constantly and get muted.

5 3-Hour Session Schedule

0:00–0:10
Recap & AgendaWeek 30 review · build → operate shift · introduce monitoring concept
0:10–0:30
Core Teaching — Monitoring ArchitectureThree layers · why silent failures matter for ML
0:30–0:55
Breakout 1 — Monitoring AnalysisTeams pick a system (RAG / fraud / recsys / docs); identify metrics, failure signals, alert thresholds → monitoring_framework.md
0:55–1:05
Share-outEach team presents · instructor highlights system-type differences
1:05–1:15
Break (10m)
1:15–1:35
Core Teaching — Logs vs Metrics vs AlertsPipeline walk-through · real-world failures from poor monitoring
1:35–2:00
Lab 1 — Logging ImplementationBuild monitoring/logging_demo.py · simulate predict requests, latency warnings, errors
2:00–2:25
Breakout 2 — Alert DesignDefine accuracy / latency / error-rate alert conditions · update framework doc
2:25–2:35
Break (10m)
2:35–2:55
Lab 2 — Dashboard PlanSketch layout in monitoring_dashboard.md · system health · model metrics · user activity
2:55–3:00
ClosingHow tonight's framework feeds the capstone operations plan

6 Hands-On Labs

Part A — Monitoring Framework

Create week31_ai_system_playbook/monitoring_framework.md. Include for your assigned system:

Part B — Logging Script

Create monitoring/logging_demo.py. Use Python's stdlib logging — no external deps needed.

import logging, random, time

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(message)s",
)

def predict(req_id):
    latency = random.uniform(50, 600)
    logging.info(f"predict req_id={req_id} latency_ms={latency:.0f}")
    if latency > 500:
        logging.warning(f"slow predict req_id={req_id} latency_ms={latency:.0f}")
    if random.random() < 0.05:
        logging.error(f"model failure req_id={req_id}")

for i in range(20):
    predict(f"r{i:03d}")
    time.sleep(0.1)

Part C — Dashboard Plan

Create week31_ai_system_playbook/monitoring_dashboard.md. Sketch a one-screen layout with three panels:

For each panel: which metric, what time window, what threshold turns the panel red.

7 Anti-GenAI Requirement

Section title (required in every team's submission):

"Why these metrics matter for this AI system"

Teams must explain how the chosen metrics tie to system reliability and user impact for their assigned system specifically — not generic monitoring talking points. A reviewer should be able to tell whether the team understands fraud detection vs. RAG without reading the system name.

8 Pre-class Reading

9 Closing Checklist

Before you leave tonight, every team should have:

These artifacts roll forward into your capstone operations plan once topics drop.