Catch problems before your users do
W23D1 - Production ML MonitoringData is clean, curated, and static. You evaluate once and ship.
model.fit(X_train, y_train) # Done!
Data drifts, quality degrades, model decays — silently, continuously.
monitor(data, model, alerts) # Never done
Deploying a model is the beginning, not the end. Tonight you build the monitoring systems that keep models healthy in production — and practice the triage process that separates a 4-hour fix from weeks of silent damage.
You are an ML engineer at MetroPulse, a transit platform that recommends routes and ride options to millions of commuters daily. Your recommendation model was trained three months ago on historical rider data: device types, time-of-day patterns, and engagement signals.
This morning at 6 AM, an automated monitoring job flagged anomalies in overnight rider behavior data. Something shifted. Your job: run the monitoring pipeline, diagnose what changed, assess model impact, and decide whether to retrain, rollback, or wait.
The starter code simulates a realistic production scenario using MovieLens 100K data:
Every deployed model eventually breaks. The question isn't if, but when and how fast you detect it.
Model degrades silently. Users see bad recommendations for weeks. Someone finally notices, triggers an emergency. Root cause investigation takes days because there's no historical data to compare against.
# Weeks of damage before anyone notices
Timeline: days → weeks → emergency
Alert fires: PSI = 0.31 on device_type. Triage identifies the shift. Team decides to retrain on fresh data. 4-hour fix instead of weeks of damage.
# Alert → Diagnose → Fix in hours
Timeline: alert → 4hr fix → resolved
A robust monitoring system covers three complementary areas. Each catches different failure modes, and together they form a complete safety net:
Is the data valid? Schema checks, null rates, range validation, duplicate detection. Catches pipeline failures before they corrupt predictions.
Has the data changed? PSI compares current vs reference distributions feature-by-feature. Catches behavioral shifts and upstream changes.
Is the model still accurate? Overall AUC plus per-slice evaluation. Catches model decay that overall metrics can mask.
The pillars form a diagnostic chain. Think of them as layers of a funnel:
| Pillar | Checks Run | Threshold | What a Failure Means |
|---|---|---|---|
| Data Quality | Schema validation, missingness, range, duplicates | MISSING_THRESHOLD = 0.05 |
Pipeline is feeding the model bad data |
| Drift (PSI) | PSI per feature: hour_bucket, device_type | PSI_THRESHOLD = 0.2 |
Feature distributions shifted significantly |
| Performance | Overall AUC drop + per-slice (by device_type) | AUC_DROP_THRESHOLD = 0.05 |
Model accuracy degraded beyond tolerance |
Population Stability Index (PSI) quantifies how much a feature's distribution has shifted between your reference (training) window and the current (production) window. It's the industry-standard metric for drift detection.
Each term in the sum measures one bin's contribution to the total shift. The (P_cur - P_ref) part captures the direction of change, while ln(P_cur / P_ref) captures the magnitude on a log scale. Multiplying them together means bins where both the absolute difference and the ratio are large contribute the most to PSI. This makes PSI sensitive to exactly the kind of shifts that break models.
(P_cur - P_ref) × ln(P_cur / P_ref). Bins where current is much higher than reference contribute large positive values.
(0.60 - 0.33) × ln(0.60 / 0.33) = 0.27 × 0.598 = 0.161
That's already in YELLOW territory from just one category! Add the desktop and tablet shifts and you'll get a RED PSI value. Try the interactive visualization in drift-explainer.html to see each bin's contribution.
PSI is popular because it's symmetric (swapping reference and current gives similar magnitude), interpretable (standard severity bands), and works for both numeric and categorical features. Alternatives include KL divergence (not symmetric), KS test (only for continuous), and chi-squared (sensitive to sample size). PSI hits a practical sweet spot.
When monitoring fires an alert, you need a systematic response. The 4 D's framework gives you a repeatable process: Detect, Diagnose, Decide, Document.
The starter computes severity automatically based on the combination of quality, drift, and performance results:
Performance degraded AND RED drift. Immediate action: consider rollback.
Performance degraded OR RED drift. Urgent review within 24 hours.
YELLOW drift or quality failures. Investigate in next sprint.
All checks pass. Continue routine monitoring.
Explore the full interactive decision tree in triage-playbook.html, which includes action cards, SLA guidance, and severity calculation.
The starter code runs 7 sections in sequence. Here's what each one does and what to look for:
Downloads MovieLens 100K (cached after first run), parses tab-separated ratings, constructs the feature table with user_id, item_id, hour_bucket, device_type, and a binary label (rating ≥ 4).
AutomaticSplits data 70/30 by timestamp (reference = older data, current = newer). Then injects synthetic drift into the current window: mobile jumps to ~60%, night hours concentrate to ~63%. This simulates what happens when production data shifts away from training data.
AutomaticRuns 4 checks on the current window: Schema (are expected columns present?), Missingness (any column exceed the null threshold?), Range (values within valid bounds?), Duplicates (replay events?). All should pass — our injected drift is a distribution shift, not a quality failure.
Interactive: adjust threshold, re-run, explain MODIFY HERE: EXPECTED_COLUMNS, MISSING_THRESHOLD, range_rulesComputes PSI for each monitored feature (hour_bucket, device_type) by comparing reference and current distributions. With the injected drift, expect device_type to flag RED (PSI well above 0.2) and hour_bucket to flag RED (major concentration shift).
Interactive: adjust PSI threshold, re-run, explain MODIFY HERE: features_to_check, PSI_THRESHOLDTrains a LogisticRegression on the reference window, evaluates AUC on both a reference holdout and the current window. Also computes per-slice AUC by device_type. Look for: overall AUC drop and especially mobile slice degradation (the model wasn't trained on this device mix).
Interactive: adjust AUC threshold, re-run, explain MODIFY HERE: feature_cols, categorical_cols, slice_colCompiles all results into a structured triage packet: overall severity (critical/high/medium/low), recommended actions, top drifters. This is the artifact an on-call engineer would review at 6 AM to decide what to do.
Automatic (uses results from S4-S6)Saves the triage packet as timestamped JSON in monitoring_results/. Generates a 3-panel dashboard (PSI bars, AUC comparison, quality checks) and a per-slice AUC chart. Use --no-plot to skip chart generation.
The starter runs in interactive mode by default. It pauses at key decision points, lets you adjust thresholds, re-run individual sections, and explore "what-if" scenarios — all without restarting the script.
Green nodes are interactive pause points. Gray nodes run automatically.
[c] Continue [r] Re-run with a new threshold [e] Explain (teaching content + link to HTML explainer)
The starter auto-creates a virtual environment on first run (installs pandas, numpy, scikit-learn, matplotlib), downloads MovieLens 100K (cached after first download), injects simulated drift, and runs all three monitoring pillars interactively.
w23d1_starter.py in your editor. Find the MODIFY HERE blocks (search for "MODIFY HERE"). Add comments justifying each threshold for MetroPulse. Optionally add new features to monitor, new range rules, or new slices to evaluate.
monitoring_plan_template.md for MetroPulse. This is the artifact that captures your monitoring strategy: what to check, what thresholds to use, how to respond, and who to notify.
The starter has clearly marked MODIFY HERE blocks. Here's what each one does:
| Location | What to Change | Why |
|---|---|---|
STUDENT_NAME |
Your name | Identifies your triage packets |
PSI_THRESHOLD |
Default: 0.2. Add a justification comment. | Why is 0.2 right for MetroPulse? What business impact justifies this sensitivity? |
MISSING_THRESHOLD |
Default: 0.05. Add a justification comment. | Why 5%? What breaks if null rates exceed this for ride recommendations? |
AUC_DROP_THRESHOLD |
Default: 0.05. Add a justification comment. | How much accuracy loss is tolerable before rider experience suffers? |
EXPECTED_COLUMNS |
The schema check column list | What columns would MetroPulse need? (pickup_lat, pickup_lon, fare?) |
range_rules |
Valid value ranges per column | What are valid lat/lon bounds? Valid fare ranges? Valid hour values? |
features_to_check |
Which features to compute PSI for | Which features matter most for ride recommendations? |
feature_cols / categorical_cols |
Model input features | What should the model use? What should it exclude? |
slice_col |
Column to compute per-slice AUC on | Which user segments need separate monitoring? |
Thresholds aren't magic numbers — they're engineering trade-offs. Too strict and you drown in false alarms. Too lenient and you miss real problems. Here are experiments to build intuition:
Set threshold to 0.05. Almost every feature will flag, even minor seasonal fluctuations. You'd alert on noise every day.
Set threshold to 0.50. Only catastrophic distribution shifts trigger. You'd miss the mobile surge entirely at moderate levels.
Set threshold to 0.0. Any null at all flags a failure. Appropriate for safety-critical systems where imputation is unacceptable.
Set threshold to 0.01. Even a 1-point AUC drop triggers an alert. Ask: is this actionable or just random variance?
You can also set thresholds via command-line flags without interactive prompts:
Terms you'll encounter in the starter code and tonight's discussion:
By the end of tonight, you should have:
w23d1_starter.py with your name, threshold justification comments, and any additional features/slices you chose to monitormonitoring_results/) — generated by running the pipeline with your chosen thresholdsmonitoring_plan_template.md with your MetroPulse monitoring strategy: what to check, what thresholds, how to respond--label tags to compare triage packets. How does severity change when you tighten PSI from 0.2 to 0.1? What about loosening AUC to 0.10? Each run saves a separate timestamped JSON you can compare.
Interactive PSI visualization with step-by-step calculator, drift slider, and per-bin contribution breakdown. See exactly how shifting mobile from 33% to 60% produces a RED PSI value.
Explore PSI →ML pipeline diagram with monitoring overlay. Clickable checkpoints show where each check lives in the pipeline. Covers the 4 D's cycle: Detect, Diagnose, Decide, Document.
View Pipeline →Interactive decision tree with clickable nodes, path highlighting, and action cards. Walk through the exact triage logic the starter code uses to compute severity.
Open Playbook →Scaffold for your MetroPulse monitoring plan. Covers system overview, quality checks, drift thresholds, performance monitoring, triage playbook, and operational considerations.
Open Template →