Advanced Monitoring: Drift, Data Quality, Performance

Catch problems before your users do

W23D1 - Production ML Monitoring

The Big Idea

Training Time

Data is clean, curated, and static. You evaluate once and ship.

model.fit(X_train, y_train) # Done!

Production

Data drifts, quality degrades, model decays — silently, continuously.

monitor(data, model, alerts) # Never done

Deploying a model is the beginning, not the end. Tonight you build the monitoring systems that keep models healthy in production — and practice the triage process that separates a 4-hour fix from weeks of silent damage.

1 The MetroPulse Scenario

Your Role Tonight

You are an ML engineer at MetroPulse, a transit platform that recommends routes and ride options to millions of commuters daily. Your recommendation model was trained three months ago on historical rider data: device types, time-of-day patterns, and engagement signals.

This morning at 6 AM, an automated monitoring job flagged anomalies in overnight rider behavior data. Something shifted. Your job: run the monitoring pipeline, diagnose what changed, assess model impact, and decide whether to retrain, rollback, or wait.

What Actually Happened (the drift we'll inject)

The starter code simulates a realistic production scenario using MovieLens 100K data:

1.
Mobile surge — A viral TikTok about MetroPulse drove mobile adoption. Device mix shifted from ~33% mobile to ~60% mobile overnight. Desktop and tablet usage dropped proportionally.
2.
Night-hour concentration — The new mobile users are night owls. Traffic in hours 0-6 jumped from ~29% to ~63%, while daytime hours thinned out.
3.
Downstream impact — The model was trained on a balanced device mix and uniform hour distribution. Now it's seeing data it wasn't built for. How badly does performance suffer? That's what you'll measure.
Why MovieLens? We use MovieLens 100K (100,000 movie ratings from 943 users) as a stand-in for MetroPulse ride data. The structure is analogous: user IDs, item IDs, timestamps, and a binary engagement label. The monitoring techniques you learn here — PSI, schema checks, slice-level AUC — apply identically to any production ML system.

2 Why Models Degrade in Production

Every deployed model eventually breaks. The question isn't if, but when and how fast you detect it.

1.
Data Drift — The statistical properties of input features change. User behavior evolves: a new device type emerges, seasonal patterns shift traffic hours, a marketing campaign changes the user mix. The data your model sees no longer matches what it trained on.
Example: Mobile share jumps 33% → 60% after a viral campaign. Your model's device_type assumptions break.
2.
Data Quality Issues — An upstream pipeline breaks. A column gets renamed. Null rates spike. A partner API starts returning zeros instead of nulls. Your model gets garbage in, produces garbage out — confidently.
Example: A schema migration drops the hour_bucket column. The model falls back to default values and recommends rush-hour routes at 3 AM.
3.
Concept Drift — The relationship between features and labels changes. What users liked last month, they don't like now. The model's learned patterns become stale even if the data distributions look similar.
Example: A transit strike makes bus routes unpopular. Same features, different labels.
4.
Feedback Loops — Your model's predictions influence user behavior, which changes the training data, which changes the model. Self-reinforcing cycles amplify errors over time.
Example: The model stops recommending a route → fewer riders take it → data shows low engagement → model becomes even less likely to recommend it.

Without Monitoring

Model degrades silently. Users see bad recommendations for weeks. Someone finally notices, triggers an emergency. Root cause investigation takes days because there's no historical data to compare against.

# Weeks of damage before anyone notices Timeline: days → weeks → emergency

With Monitoring

Alert fires: PSI = 0.31 on device_type. Triage identifies the shift. Team decides to retrain on fresh data. 4-hour fix instead of weeks of damage.

# Alert → Diagnose → Fix in hours Timeline: alert → 4hr fix → resolved

3 Three Monitoring Pillars

A robust monitoring system covers three complementary areas. Each catches different failure modes, and together they form a complete safety net:

1

Data Quality

Is the data valid? Schema checks, null rates, range validation, duplicate detection. Catches pipeline failures before they corrupt predictions.

2

Distribution Drift (PSI)

Has the data changed? PSI compares current vs reference distributions feature-by-feature. Catches behavioral shifts and upstream changes.

3

Performance Regression

Is the model still accurate? Overall AUC plus per-slice evaluation. Catches model decay that overall metrics can mask.

How They Connect

The pillars form a diagnostic chain. Think of them as layers of a funnel:

Diagnostic Chain Data Quality answers: "Can I trust this data at all?" If schema, nulls, or ranges fail, nothing downstream is reliable. Fix the pipeline first.

Drift Detection answers: "Has the world changed?" Even if data is clean, distributions may have shifted. PSI quantifies how much.

Performance Regression answers: "Does the drift actually hurt the model?" Not all drift degrades predictions. Maybe device_type shifted but the model doesn't rely on it heavily. This pillar connects distribution changes to business impact.

What Each Pillar Checks (in the starter code)

PillarChecks RunThresholdWhat a Failure Means
Data Quality Schema validation, missingness, range, duplicates MISSING_THRESHOLD = 0.05 Pipeline is feeding the model bad data
Drift (PSI) PSI per feature: hour_bucket, device_type PSI_THRESHOLD = 0.2 Feature distributions shifted significantly
Performance Overall AUC drop + per-slice (by device_type) AUC_DROP_THRESHOLD = 0.05 Model accuracy degraded beyond tolerance

4 PSI: Your Drift Detector

Population Stability Index (PSI) quantifies how much a feature's distribution has shifted between your reference (training) window and the current (production) window. It's the industry-standard metric for drift detection.

Population Stability Index
PSI = Σ (Pcurrent - Preference) × ln(Pcurrent / Preference)

Why This Formula Works

Each term in the sum measures one bin's contribution to the total shift. The (P_cur - P_ref) part captures the direction of change, while ln(P_cur / P_ref) captures the magnitude on a log scale. Multiplying them together means bins where both the absolute difference and the ratio are large contribute the most to PSI. This makes PSI sensitive to exactly the kind of shifts that break models.

PSI < 0.1
No significant drift
0.1 - 0.2
Moderate drift — investigate
PSI > 0.2
Significant drift — action required

How It Works, Step by Step

1
Bin both distributions into the same buckets. For numeric features (like hour_bucket), use equal-width bins across the combined range. For categorical features (like device_type), each category is its own bin.
2
Calculate proportions in each bin for both reference and current. Add a small epsilon (1e-4) to avoid division by zero.
3
Compute per-bin PSI: For each bin, calculate (P_cur - P_ref) × ln(P_cur / P_ref). Bins where current is much higher than reference contribute large positive values.
4
Sum all bin contributions. The total is your PSI value. Compare against the severity bands above.
Intuition Check If mobile was 33% in reference and 60% in current, that bin alone contributes: (0.60 - 0.33) × ln(0.60 / 0.33) = 0.27 × 0.598 = 0.161 That's already in YELLOW territory from just one category! Add the desktop and tablet shifts and you'll get a RED PSI value. Try the interactive visualization in drift-explainer.html to see each bin's contribution.

PSI vs Other Drift Metrics

PSI is popular because it's symmetric (swapping reference and current gives similar magnitude), interpretable (standard severity bands), and works for both numeric and categorical features. Alternatives include KL divergence (not symmetric), KS test (only for continuous), and chi-squared (sensitive to sample size). PSI hits a practical sweet spot.

5 The Triage Framework

When monitoring fires an alert, you need a systematic response. The 4 D's framework gives you a repeatable process: Detect, Diagnose, Decide, Document.

# The Triage Decision Tree Alert Triggered | +-- Data quality failing? | +-- YES --> Fix data pipeline FIRST (Critical) | +-- NO ---> Continue to drift check | +-- Drift detected (PSI > threshold)? | +-- YES + Performance OK ------> Monitor closely, no action yet | +-- YES + Performance degraded -> Retrain model (High priority) | +-- NO ---> Continue to performance check | +-- Performance degraded (no drift)? | +-- YES --> Investigate concept drift or label noise | +-- All checks pass? +-- YES --> Log and close. Routine all-clear.

Severity Levels

The starter computes severity automatically based on the combination of quality, drift, and performance results:

Critical

Performance degraded AND RED drift. Immediate action: consider rollback.

High

Performance degraded OR RED drift. Urgent review within 24 hours.

Medium

YELLOW drift or quality failures. Investigate in next sprint.

Low

All checks pass. Continue routine monitoring.

Key Insight: Drift Without Performance Loss Is Normal Not all drift is bad. If device_type shifts but the model barely uses that feature, PSI will fire but AUC won't drop. The triage framework distinguishes between "drift that matters" and "drift that's noise." This is why you need both drift detection and performance monitoring — neither alone tells the full story.

Explore the full interactive decision tree in triage-playbook.html, which includes action cards, SLA guidance, and severity calculation.

6 Pipeline Walkthrough: What Each Section Does

The starter code runs 7 sections in sequence. Here's what each one does and what to look for:

S2

Data Loading

Downloads MovieLens 100K (cached after first run), parses tab-separated ratings, constructs the feature table with user_id, item_id, hour_bucket, device_type, and a binary label (rating ≥ 4).

Automatic
S3

Reference/Current Split + Drift Injection

Splits data 70/30 by timestamp (reference = older data, current = newer). Then injects synthetic drift into the current window: mobile jumps to ~60%, night hours concentrate to ~63%. This simulates what happens when production data shifts away from training data.

Automatic
S4

Data Quality Checks

Runs 4 checks on the current window: Schema (are expected columns present?), Missingness (any column exceed the null threshold?), Range (values within valid bounds?), Duplicates (replay events?). All should pass — our injected drift is a distribution shift, not a quality failure.

Interactive: adjust threshold, re-run, explain MODIFY HERE: EXPECTED_COLUMNS, MISSING_THRESHOLD, range_rules
S5

Drift Detection with PSI

Computes PSI for each monitored feature (hour_bucket, device_type) by comparing reference and current distributions. With the injected drift, expect device_type to flag RED (PSI well above 0.2) and hour_bucket to flag RED (major concentration shift).

Interactive: adjust PSI threshold, re-run, explain MODIFY HERE: features_to_check, PSI_THRESHOLD
S6

Performance Regression

Trains a LogisticRegression on the reference window, evaluates AUC on both a reference holdout and the current window. Also computes per-slice AUC by device_type. Look for: overall AUC drop and especially mobile slice degradation (the model wasn't trained on this device mix).

Interactive: adjust AUC threshold, re-run, explain MODIFY HERE: feature_cols, categorical_cols, slice_col
S7

Triage Packet Generation

Compiles all results into a structured triage packet: overall severity (critical/high/medium/low), recommended actions, top drifters. This is the artifact an on-call engineer would review at 6 AM to decide what to do.

Automatic (uses results from S4-S6)
S8

Save Results + Visualization

Saves the triage packet as timestamped JSON in monitoring_results/. Generates a 3-panel dashboard (PSI bars, AUC comparison, quality checks) and a per-slice AUC chart. Use --no-plot to skip chart generation.

Automatic

7 Interactive Mode: How It Works

The starter runs in interactive mode by default. It pauses at key decision points, lets you adjust thresholds, re-run individual sections, and explore "what-if" scenarios — all without restarting the script.

$ python w23d1_starter.py
Interactive mode (default) — pauses + prompts
$ python w23d1_starter.py --auto
Auto mode — no pauses, ~0.7s, identical output

What Happens at Each Pause

Config
S2 Load
S3 Split
Predict
S4 Loop
S5 Loop
S6 Loop
S7 Triage
Explore
Reflect
S8 Save

Green nodes are interactive pause points. Gray nodes run automatically.

Pause Types

1.
Prediction Moment (after S3) — Before any checks run, write down what you think will happen. Which features will drift? Will quality pass? Will AUC drop > 5%? Having predictions sharpens your interpretation of results.
2.
Threshold Adjustment (before S4, S5, S6) — Before each analysis section runs, you're offered a chance to change the threshold. Type a new value or press Enter to keep the default. This lets you explore sensitivity without restarting.
3.
Re-run Loop (after S4, S5, S6) — After each section's results display, choose:
[c] Continue   [r] Re-run with a new threshold   [e] Explain (teaching content + link to HTML explainer)
4.
Exploration Mode (after S7) — Free-form sandbox. Jump to any section, re-run with different thresholds, regenerate the triage packet, or re-run everything. Exit when satisfied.
5.
Final Reflection (before save) — Three open-ended questions you answer in the terminal. Think about causality, user segments, and your recommendation.
Tip: You Can Always Press Enter Every interactive prompt accepts a blank Enter to keep the default or continue. If you just want to see the full pipeline quickly, press Enter through everything. The pauses are there to slow you down and think — not to block you.

8 Tonight's Lab: Step-by-Step

Getting Started

$ python w23d1_starter.py

The starter auto-creates a virtual environment on first run (installs pandas, numpy, scikit-learn, matplotlib), downloads MovieLens 100K (cached after first download), injects simulated drift, and runs all three monitoring pillars interactively.

Your Mission

1
Run the baseline. Execute the starter in interactive mode. At the Prediction Moment, write down what you expect. Watch the results unfold. Does the data quality pass? Which features does PSI flag? How bad is the AUC drop?
2
Explore thresholds interactively. Use the re-run loops and Exploration Mode to experiment. Try PSI threshold at 0.05 (catches everything) and 0.50 (catches nothing). Try MISSING_THRESHOLD at 0.0 and 0.5. Observe how the number of flagged features and the overall severity changes.
3
Edit the code. Open w23d1_starter.py in your editor. Find the MODIFY HERE blocks (search for "MODIFY HERE"). Add comments justifying each threshold for MetroPulse. Optionally add new features to monitor, new range rules, or new slices to evaluate.
4
Answer the reflection questions. The final reflection asks three questions about causality, user segments, and your recommendation. Think carefully — these are the kind of questions you'd face in a real on-call triage.
5
Complete your monitoring plan. Fill out monitoring_plan_template.md for MetroPulse. This is the artifact that captures your monitoring strategy: what to check, what thresholds to use, how to respond, and who to notify.

What to Edit in the Code

The starter has clearly marked MODIFY HERE blocks. Here's what each one does:

LocationWhat to ChangeWhy
STUDENT_NAME Your name Identifies your triage packets
PSI_THRESHOLD Default: 0.2. Add a justification comment. Why is 0.2 right for MetroPulse? What business impact justifies this sensitivity?
MISSING_THRESHOLD Default: 0.05. Add a justification comment. Why 5%? What breaks if null rates exceed this for ride recommendations?
AUC_DROP_THRESHOLD Default: 0.05. Add a justification comment. How much accuracy loss is tolerable before rider experience suffers?
EXPECTED_COLUMNS The schema check column list What columns would MetroPulse need? (pickup_lat, pickup_lon, fare?)
range_rules Valid value ranges per column What are valid lat/lon bounds? Valid fare ranges? Valid hour values?
features_to_check Which features to compute PSI for Which features matter most for ride recommendations?
feature_cols / categorical_cols Model input features What should the model use? What should it exclude?
slice_col Column to compute per-slice AUC on Which user segments need separate monitoring?

9 Threshold Tuning Guide

Thresholds aren't magic numbers — they're engineering trade-offs. Too strict and you drown in false alarms. Too lenient and you miss real problems. Here are experiments to build intuition:

Experiments to Try

PSI: Catch Everything

Set threshold to 0.05. Almost every feature will flag, even minor seasonal fluctuations. You'd alert on noise every day.

At the PSI prompt, type: 0.05

PSI: Only Extreme Shifts

Set threshold to 0.50. Only catastrophic distribution shifts trigger. You'd miss the mobile surge entirely at moderate levels.

At the PSI prompt, type: 0.50

Missing: Zero Tolerance

Set threshold to 0.0. Any null at all flags a failure. Appropriate for safety-critical systems where imputation is unacceptable.

At the MISSING prompt, type: 0.0

AUC: Strict Detection

Set threshold to 0.01. Even a 1-point AUC drop triggers an alert. Ask: is this actionable or just random variance?

At the AUC prompt, type: 0.01
The Threshold Trade-off Every threshold balances two costs: the cost of a false alarm (wasted engineer time investigating noise) vs the cost of a missed detection (users suffer bad predictions). For MetroPulse, ask yourself: which is worse — waking up the on-call engineer for nothing, or riders getting bad route recommendations for a week?

CLI Override (for scripting)

You can also set thresholds via command-line flags without interactive prompts:

$ python w23d1_starter.py --threshold-psi 0.15 --threshold-auc-drop 0.03
$ python w23d1_starter.py --auto --no-plot --label "strict-run"

10 Key Vocabulary

Terms you'll encounter in the starter code and tonight's discussion:

Reference Window
The "known good" data your model was trained on. Used as baseline for comparison. In the starter: first 70% of data by timestamp.
Current Window
The fresh production data being evaluated. May have drifted. In the starter: last 30% with injected drift.
PSI (Population Stability Index)
Metric quantifying how much a distribution has shifted. <0.1 stable, 0.1-0.2 moderate, >0.2 significant.
AUC (Area Under ROC Curve)
Model quality metric. 1.0 = perfect, 0.5 = random. A drop from reference to current AUC signals performance regression.
Slice / Cohort
A subset of data defined by a feature value (e.g., "mobile users"). Per-slice metrics catch problems that overall metrics hide.
Triage Packet
Structured summary of all monitoring results: quality, drift, performance, severity, recommended actions. The artifact an on-call engineer reviews.
Severity
Overall risk level: critical (rollback), high (urgent review), medium (next sprint), low (all clear). Computed from combined results.
Data Drift vs Concept Drift
Data drift: input features change (P(X) shifts). Concept drift: the relationship between features and labels changes (P(Y|X) shifts). Different root causes, different responses.
Canary Evaluation
Testing model performance on a small slice of live traffic before full rollout. Like a canary in a coal mine — if it fails, pull back before full deployment.
Shadow Deployment
Running a new model alongside the production model without serving its predictions. Compare outputs to evaluate before switching over.

11 Common Mistakes to Avoid

1.
Monitoring only overall metrics. A 2% global AUC drop might hide a 15% drop for mobile users. Always check per-slice performance. The segment that's growing fastest is often the one the model handles worst.
2.
Assuming drift = bad. Not all drift degrades the model. Maybe hour_bucket shifted dramatically but the model barely uses that feature. Always check if drift correlates with performance impact before retraining.
3.
Setting thresholds once and forgetting them. The right threshold depends on your current business context. During a product launch, you might tighten thresholds. During a known seasonal shift, you might loosen them. Monitoring parameters need monitoring too.
4.
Skipping data quality checks. It's tempting to jump straight to "cool" metrics like PSI and AUC. But if the data itself is corrupt (missing columns, null spikes, out-of-range values), every downstream metric is unreliable. Quality checks are the foundation.
5.
Retraining as the default response. Retraining is expensive and introduces risk. Sometimes the right response is "monitor closely" or "adjust the feature pipeline" or "wait for the seasonal shift to pass." The triage framework helps you choose.

12 What You'll Submit

By the end of tonight, you should have:

Bonus: Multiple Runs Use Exploration Mode or run the script multiple times with different --label tags to compare triage packets. How does severity change when you tighten PSI from 0.2 to 0.1? What about loosening AUC to 0.10? Each run saves a separate timestamped JSON you can compare.

13 Resources

Drift Explainer

Interactive PSI visualization with step-by-step calculator, drift slider, and per-bin contribution breakdown. See exactly how shifting mobile from 33% to 60% produces a RED PSI value.

Explore PSI →

Monitoring Pipeline

ML pipeline diagram with monitoring overlay. Clickable checkpoints show where each check lives in the pipeline. Covers the 4 D's cycle: Detect, Diagnose, Decide, Document.

View Pipeline →

Triage Playbook

Interactive decision tree with clickable nodes, path highlighting, and action cards. Walk through the exact triage logic the starter code uses to compute severity.

Open Playbook →

Monitoring Plan Template

Scaffold for your MetroPulse monitoring plan. Covers system overview, quality checks, drift thresholds, performance monitoring, triage playbook, and operational considerations.

Open Template →

Session Agenda

6:30 - 6:50
Why Models Degrade — Real-world failures, the cost of not monitoring, the MetroPulse scenario
6:50 - 7:20
Three Pillars — Data quality, drift detection (PSI), performance regression. How they connect.
7:20 - 7:50
PSI Deep Dive + Triage — The PSI formula, severity bands, the 4 D's framework, and when to retrain vs rollback vs wait
7:50 - 8:30
Build Block — Run the monitoring pipeline interactively. Explore thresholds. Edit MODIFY HERE blocks. Use Exploration Mode to run what-if scenarios.
8:30 - 9:00
Triage Practice — Answer reflection questions. Fill out your monitoring plan. Compare triage packets across different threshold settings.
9:00 - 9:30
Review + Discussion — Share findings. Debate: is the drift causal? Which segment matters most? What would you recommend?