How to Use This Guide
This page covers the three pillars of production ML monitoring. Explore each section below.
The 4 D's Loop
Understand the four-phase incident response cycle used by ML teams: Detect, Diagnose, Decide, and Document. Click each card to see detailed guidance.
Pipeline Overlay
Visualize where monitoring checkpoints sit across a standard ML pipeline. Click the colored badges on each stage to learn what each check does and see sample alerts.
Reference Table
Browse, search, sort, and filter a comprehensive inventory of 26 monitoring checks organized by pipeline stage, type, frequency, and severity level.
The Monitoring Loop: The 4 D's
Click each phase to explore what happens at every step of the incident response cycle.
Signals to Watch
- Data quality metrics: null rates, schema violations, volume drops
- Drift metrics: PSI, KL divergence, Jensen-Shannon distance on features and predictions
- Performance KPIs: accuracy, precision, recall, AUC trending over time windows
- System health: latency p50/p95/p99, throughput, error rates
- Statistical process control charts for real-time anomaly flagging
Root Cause Analysis Steps
- Which feature(s) drifted? Compare current vs. reference distributions per feature
- Was there an upstream data change? Check data source versioning and schema logs
- Is it seasonal or cyclical? Compare against same period in prior cycles
- Did a code or config change deploy recently? Check CI/CD deployment history
- Is the issue in a specific segment? Slice metrics by cohort, geography, device
Action Options
- Retrain: Trigger retraining with updated data if drift is confirmed and labels are available
- Rollback: Revert to previous model version if new deployment caused regression
- Alert & Escalate: Notify the on-call team if impact exceeds thresholds
- Wait & Watch: Monitor closely if the signal is noisy or within acceptable bounds
- Adjust Thresholds: Recalibrate alert boundaries if false positive rate is too high
What to Record
- Incident timeline: When detected, who was notified, when resolved
- Impact assessment: Number of users affected, revenue impact, SLA breaches
- Resolution steps: Exact actions taken and their outcomes
- Prevention plan: New checks, threshold changes, or process updates to avoid recurrence
- Runbook updates: Add to operational playbooks for future on-call engineers
Pipeline Diagram with Monitoring Overlay
Click any monitoring checkpoint (colored badges) to see what it checks and a sample alert.
Check Placement Reference
A comprehensive inventory of monitoring checks across the pipeline. Click column headers to sort; use filters to narrow down.
| Check Name ▲ | Check Type ▲ | Pipeline Stage ▲ | Frequency ▲ | Severity ▲ |
|---|
Key Concepts Reference
Essential monitoring terminology and definitions at a glance.
Data Drift
A change in the statistical distribution of input features between the training dataset and production data. Data drift can cause model performance to degrade silently if not detected. Common measures include PSI, KL divergence, and the Kolmogorov-Smirnov test.
Concept Drift
A change in the relationship between input features and the target variable. Unlike data drift, concept drift means the underlying patterns the model learned are no longer valid. Requires retraining with fresh labeled data to address.
Population Stability Index (PSI)
A metric that quantifies how much a distribution has shifted from a reference baseline. PSI below 0.1 indicates no significant change; 0.1 to 0.2 suggests moderate drift; above 0.2 indicates significant drift requiring investigation.
Shadow Deployment
Running a new model version in parallel with the production model, sending it real traffic but not using its predictions for decisions. Allows safe comparison of performance metrics before a full switch. Essential for high-stakes ML systems.
Canary Release
Gradually rolling out a new model version to an increasing percentage of traffic while closely monitoring key metrics. If degradation is detected, traffic is automatically shifted back to the stable version. Reduces blast radius of bad deployments.
Feature Store Monitoring
Tracking the health and freshness of features served from a centralized feature store. Includes monitoring feature computation latency, staleness windows, schema consistency, and usage patterns across different consuming models.