The big idea for tonight
Building a system isn't enough — you have to know when it's broken. ML systems fail silently through degraded predictions, data drift, broken pipelines, and shifting user behavior. Tonight we build a portable monitoring framework you can drop into your capstone the day topics are revealed.
1 Learning Objectives
Primary
- Understand why monitoring is essential for AI products
- Design a monitoring strategy for an MVP
- Identify critical system metrics and failure signals
Supporting
- Distinguish logs vs metrics vs alerts
- Recognize data drift and performance degradation
- Design dashboards for AI systems
Key terms
Monitoring
Observability
Metrics
Logs
Alerts
Data drift
Model performance degradation
System health metrics
User analytics
Out of scope
- Enterprise observability platforms (Datadog, New Relic full stack)
- Full production monitoring infrastructure
- Advanced distributed tracing systems
- Complex real-time analytics pipelines
2 The Three Monitoring Layers
Every AI system you'll ship has three observability surfaces. Miss one and you'll be debugging in the dark.
📦 Application Metrics — request rate, latency, errors
↓
🧠 Model Performance — accuracy, drift, prediction distribution
↓
🖥️ Infrastructure — CPU, memory, GPU, disk, network
Why silent failures are dangerous
A traditional web service tells you when it breaks — 500s in the logs, error rates spike, pagers go off. ML systems can keep returning HTTP 200 forever while quietly destroying business value:
- Recommendation engine still returns items, but they're stale and CTR has tanked.
- Fraud model still scores transactions, but new fraud patterns slip through.
- RAG assistant still answers questions, but cites a doc that was deprecated last quarter.
- Document classifier still labels — at 60% accuracy instead of 90%.
Rule of thumb: if a metric only tells you the service is up, it can't tell you whether the model is useful. Monitor both.
3 Four Systems, Four Monitoring Profiles
Each breakout team owns one of these in tonight's first lab. Notice how the critical metrics differ even though the "three layers" frame is the same.
RAG Policy Assistant
- Retrieval recall@k
- Citation correctness rate
- Answer latency p95
- Hallucination flag rate
- Doc-set freshness
Fraud Detection
- Precision / recall / FPR
- Score distribution shift
- Reviewer override rate
- Time to decision
- Feature null rate
Recommendation Engine
- CTR / engagement
- Catalog coverage
- Cold-start hit rate
- Diversity / novelty
- Latency under load
Document Classification
- Accuracy by class
- Confidence distribution
- Out-of-distribution rate
- Throughput (docs/sec)
- Label drift over time
4 Logs vs Metrics vs Alerts
Logs
Discrete events with structured context. Read after the fact when debugging.
2026-04-26 19:14:02 INFO predict req_id=ab7 user=42 latency_ms=183 conf=0.91
Metrics
Numeric measurements aggregated over time. Read on a dashboard.
predict_latency_ms_p95{model="rag-v3"} 412
Alerts
Rules that fire when metrics cross a threshold. Wake someone up.
if accuracy_24h < 0.85 for 30m → page on-call
The pipeline
Application logs
↓
Metrics aggregation (counts, percentiles, distributions)
↓
Alert rules (thresholds + dwell time)
↓
Dashboards / pages (humans get the signal)
One mistake students always make: alerting on raw logs. Logs are for debugging; metrics are for alerting. If your alert can't be expressed as metric > threshold for N minutes, it'll page constantly and get muted.
5 3-Hour Session Schedule
0:00–0:10
Recap & AgendaWeek 30 review · build → operate shift · introduce monitoring concept
0:10–0:30
Core Teaching — Monitoring ArchitectureThree layers · why silent failures matter for ML
0:30–0:55
Breakout 1 — Monitoring AnalysisTeams pick a system (RAG / fraud / recsys / docs); identify metrics, failure signals, alert thresholds → monitoring_framework.md
0:55–1:05
Share-outEach team presents · instructor highlights system-type differences
1:15–1:35
Core Teaching — Logs vs Metrics vs AlertsPipeline walk-through · real-world failures from poor monitoring
1:35–2:00
Lab 1 — Logging ImplementationBuild monitoring/logging_demo.py · simulate predict requests, latency warnings, errors
2:00–2:25
Breakout 2 — Alert DesignDefine accuracy / latency / error-rate alert conditions · update framework doc
2:35–2:55
Lab 2 — Dashboard PlanSketch layout in monitoring_dashboard.md · system health · model metrics · user activity
2:55–3:00
ClosingHow tonight's framework feeds the capstone operations plan
6 Hands-On Labs
Part A — Monitoring Framework
Create week31_ai_system_playbook/monitoring_framework.md. Include for your assigned system:
- Key metrics across the three layers
- Failure indicators (what does broken look like?)
- Alert thresholds with dwell times
Part B — Logging Script
Create monitoring/logging_demo.py. Use Python's stdlib logging — no external deps needed.
import logging, random, time
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s",
)
def predict(req_id):
latency = random.uniform(50, 600)
logging.info(f"predict req_id={req_id} latency_ms={latency:.0f}")
if latency > 500:
logging.warning(f"slow predict req_id={req_id} latency_ms={latency:.0f}")
if random.random() < 0.05:
logging.error(f"model failure req_id={req_id}")
for i in range(20):
predict(f"r{i:03d}")
time.sleep(0.1)
Part C — Dashboard Plan
Create week31_ai_system_playbook/monitoring_dashboard.md. Sketch a one-screen layout with three panels:
- System health — request rate, error rate, latency p50/p95/p99
- Model metrics — accuracy / drift / confidence distribution over the last 24h
- User activity — DAU on the feature, engagement, override / feedback rate
For each panel: which metric, what time window, what threshold turns the panel red.
7 Anti-GenAI Requirement
Section title (required in every team's submission):
"Why these metrics matter for this AI system"
Teams must explain how the chosen metrics tie to system reliability and user impact for their assigned system specifically — not generic monitoring talking points. A reviewer should be able to tell whether the team understands fraud detection vs. RAG without reading the system name.
9 Closing Checklist
Before you leave tonight, every team should have:
week31_ai_system_playbook/monitoring_framework.md
week31_ai_system_playbook/monitoring_dashboard.md
monitoring/logging_demo.py running locally
- The "Why these metrics matter for this AI system" section filled in
These artifacts roll forward into your capstone operations plan once topics drop.