Monitoring MVP Systems

Knowing whether your AI system is actually working

Week 31 · Day 1 · Capstone Prep

The big idea for tonight

Building a system isn't enough — you have to know when it's broken. ML systems fail silently through degraded predictions, data drift, broken pipelines, and shifting user behavior. Tonight we build a portable monitoring framework you can drop into your capstone the day topics are revealed.

🎯

Objectives

What you'll be able to do by the end.

🏗️

Architecture

The three monitoring layers.

📡

Logging Demo

Live interactive log stream simulator.

🚨

Alert Designer

Tune thresholds, see what fires.

⏱️

Schedule

3-hour session plan with breakouts.

🧪

Labs

Framework, logging script, dashboard.

1 Learning Objectives

Primary

Understand why monitoring is essential for AI products
Design a monitoring strategy for an MVP
Identify critical system metrics and failure signals

Supporting

Distinguish logs vs metrics vs alerts
Recognize data drift and performance degradation
Design dashboards for AI systems

Key terms

Monitoring Observability Metrics Logs Alerts Data drift Model performance degradation System health metrics User analytics

Out of scope

Enterprise observability platforms (Datadog, New Relic full stack)
Full production monitoring infrastructure
Advanced distributed tracing systems
Complex real-time analytics pipelines

2 The Three Monitoring Layers

Every AI system you'll ship has three observability surfaces. Miss one and you'll be debugging in the dark.

📦 Application Metrics — request rate, latency, errors

↓

🧠 Model Performance — accuracy, drift, prediction distribution

↓

🖥️ Infrastructure — CPU, memory, GPU, disk, network

Why silent failures are dangerous

A traditional web service tells you when it breaks — 500s in the logs, error rates spike, pagers go off. ML systems can keep returning HTTP 200 forever while quietly destroying business value:

Recommendation engine still returns items, but they're stale and CTR has tanked.
Fraud model still scores transactions, but new fraud patterns slip through.
RAG assistant still answers questions, but cites a doc that was deprecated last quarter.
Document classifier still labels — at 60% accuracy instead of 90%.

Rule of thumb: if a metric only tells you the service is up, it can't tell you whether the model is useful. Monitor both.

3 Four Systems, Four Monitoring Profiles

Each breakout team owns one of these in tonight's first lab. Notice how the critical metrics differ even though the "three layers" frame is the same.

RAG Policy Assistant

Retrieval recall@k
Citation correctness rate
Answer latency p95
Hallucination flag rate
Doc-set freshness

Fraud Detection

Precision / recall / FPR
Score distribution shift
Reviewer override rate
Time to decision
Feature null rate

Recommendation Engine

CTR / engagement
Catalog coverage
Cold-start hit rate
Diversity / novelty
Latency under load

Document Classification

Accuracy by class
Confidence distribution
Out-of-distribution rate
Throughput (docs/sec)
Label drift over time

4 Logs vs Metrics vs Alerts

Logs

Discrete events with structured context. Read after the fact when debugging.

2026-04-26 19:14:02 INFO predict req_id=ab7 user=42 latency_ms=183 conf=0.91

Metrics

Numeric measurements aggregated over time. Read on a dashboard.

predict_latency_ms_p95{model="rag-v3"} 412

Alerts

Rules that fire when metrics cross a threshold. Wake someone up.

if accuracy_24h < 0.85 for 30m → page on-call

The pipeline

Application logs
        ↓
Metrics aggregation  (counts, percentiles, distributions)
        ↓
Alert rules          (thresholds + dwell time)
        ↓
Dashboards / pages   (humans get the signal)

One mistake students always make: alerting on raw logs. Logs are for debugging; metrics are for alerting. If your alert can't be expressed as metric > threshold for N minutes, it'll page constantly and get muted.

5 3-Hour Session Schedule

0:00–0:10

Recap & AgendaWeek 30 review · build → operate shift · introduce monitoring concept

0:10–0:30

Core Teaching — Monitoring ArchitectureThree layers · why silent failures matter for ML

0:30–0:55

Breakout 1 — Monitoring AnalysisTeams pick a system (RAG / fraud / recsys / docs); identify metrics, failure signals, alert thresholds → monitoring_framework.md

0:55–1:05

Share-outEach team presents · instructor highlights system-type differences

1:05–1:15

Break (10m)

1:15–1:35

Core Teaching — Logs vs Metrics vs AlertsPipeline walk-through · real-world failures from poor monitoring

1:35–2:00

Lab 1 — Logging ImplementationBuild monitoring/logging_demo.py · simulate predict requests, latency warnings, errors

2:00–2:25

Breakout 2 — Alert DesignDefine accuracy / latency / error-rate alert conditions · update framework doc

2:25–2:35

Break (10m)

2:35–2:55

Lab 2 — Dashboard PlanSketch layout in monitoring_dashboard.md · system health · model metrics · user activity

2:55–3:00

ClosingHow tonight's framework feeds the capstone operations plan

6 Hands-On Labs

Part A — Monitoring Framework

Create week31_ai_system_playbook/monitoring_framework.md. Include for your assigned system:

Key metrics across the three layers
Failure indicators (what does broken look like?)
Alert thresholds with dwell times

Part B — Logging Script

Create monitoring/logging_demo.py. Use Python's stdlib logging — no external deps needed.

import logging, random, time

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(message)s",
)

def predict(req_id):
    latency = random.uniform(50, 600)
    logging.info(f"predict req_id={req_id} latency_ms={latency:.0f}")
    if latency > 500:
        logging.warning(f"slow predict req_id={req_id} latency_ms={latency:.0f}")
    if random.random() < 0.05:
        logging.error(f"model failure req_id={req_id}")

for i in range(20):
    predict(f"r{i:03d}")
    time.sleep(0.1)

Part C — Dashboard Plan

Create week31_ai_system_playbook/monitoring_dashboard.md. Sketch a one-screen layout with three panels:

System health — request rate, error rate, latency p50/p95/p99
Model metrics — accuracy / drift / confidence distribution over the last 24h
User activity — DAU on the feature, engagement, override / feedback rate

For each panel: which metric, what time window, what threshold turns the panel red.

7 Anti-GenAI Requirement

Section title (required in every team's submission):

"Why these metrics matter for this AI system"

Teams must explain how the chosen metrics tie to system reliability and user impact for their assigned system specifically — not generic monitoring talking points. A reviewer should be able to tell whether the team understands fraud detection vs. RAG without reading the system name.

8 Pre-class Reading

Honeycomb — Observability vs Monitoring. Monitoring tracks predefined metrics; observability lets you ask new questions of system behavior. honeycomb.io/blog/observability-vs-monitoring
Neptune — Monitoring ML Systems. Tracking accuracy, drift, and input distribution shifts. neptune.ai/blog/ml-model-monitoring
Atlassian — Logging Best Practices. What to log, what not to log, why structured logs matter. atlassian.com/incident-management/logging

9 Closing Checklist

Before you leave tonight, every team should have:

week31_ai_system_playbook/monitoring_framework.md
week31_ai_system_playbook/monitoring_dashboard.md
monitoring/logging_demo.py running locally
The "Why these metrics matter for this AI system" section filled in

These artifacts roll forward into your capstone operations plan once topics drop.