← Back to W23D1 Hub
Decision Tree Flowchart Interactive
Alert Triggered
Data quality
failing?
YES
NO
Fix data pipeline
Drift detected
(PSI > threshold)?
YES
NO
Performance
degraded
(AUC drop)?
Performance
degraded
(AUC drop)?
YES
Retrain model
NO
Investigate
feature drift
YES
Monitor closely
NO
All clear —
log and close
Click any node to highlight the path through the decision tree and see the recommended action.
Severity Calculator Real-time
PSI Value 0.00
0.0 0.1 0.2 0.3 0.4 0.5
Missing Rate 0.0%
0% 5% 10% 15% 20%
AUC Drop 0.0%
0% 3% 6% 9% 12% 15%
LOW
Recommended Response
Within 48 hours
Log the alert. Review during next scheduled check-in. No immediate action needed.
Action Playbooks Expandable
Retrain Model
When drift and performance degradation are confirmed

When to Trigger

  • PSI exceeds 0.25 for two or more key features
  • AUC has dropped more than 5% from baseline
  • Drift and performance degradation are correlated (same time window)
  • Data quality checks pass (pipeline is not broken)

Steps to Execute

  • Pull the latest validated training data (past 90 days rolling window)
  • Run feature engineering pipeline with current feature set
  • Retrain using the same hyperparameter configuration as production model
  • Evaluate on holdout set: ensure AUC is within 1% of original baseline
  • Run A/B shadow test for 24-48 hours before full deployment
  • Deploy via blue-green deployment to minimize downtime

Rollback Plan

  • Keep the previous model artifact tagged as "last-known-good" in the registry
  • If retrained model shows worse metrics in shadow test, abort deployment
  • Rollback is a one-command operation: point serving endpoint to previous version
  • Notify stakeholders immediately if rollback is triggered

Communication Template

Slack :rotating_light: *Model Retrain Initiated* *Model:* [model_name] (v[current_version]) *Trigger:* PSI = [value], AUC drop = [value]% *Action:* Retraining with data from [start_date] to [end_date] *Timeline:* Shadow test for 24-48h, deploy by [target_date] *Owner:* @[engineer_name] *Status:* :hourglass_flowing_sand: In progress Will update this thread with results. cc @ml-team
Roll Back to Previous Model
Emergency response when current model is actively harming metrics

Criteria for Rollback

  • AUC has dropped more than 10% from baseline
  • Business KPIs (conversion, revenue) are visibly impacted
  • A newly deployed model is performing worse than its predecessor
  • Data corruption is suspected and cannot be resolved quickly

Rollback Process

  • Identify the last-known-good model version in the model registry
  • Run a quick sanity check on the previous model with current data sample
  • Update the serving endpoint to point to the previous model version
  • Verify predictions are being served correctly (spot-check 10-20 requests)
  • Update monitoring dashboards to reflect the rolled-back version

Post-Rollback Verification

  • Monitor metrics for 2-4 hours post-rollback to confirm stabilization
  • Compare prediction distributions before/after rollback
  • Document the root cause and timeline in the incident log
  • Schedule a post-mortem within 48 hours

Communication Template

Email Subject: [URGENT] Model Rollback - [model_name] Team, We have initiated a rollback of [model_name] from v[new] to v[previous]. Reason: [brief description of degradation observed] Impact: [estimated business impact] Timeline: Rollback completed at [timestamp] Immediate next steps: - Monitoring stabilization for the next 4 hours - Root cause investigation begins immediately - Post-mortem scheduled for [date/time] Please hold any deployments to this endpoint until further notice. [Your name] | ML Platform Team
🔔
Alert Stakeholders
Communicate issues to the right people at the right time

Who to Notify

  • Low severity: ML team Slack channel only
  • Medium severity: ML team lead + data engineering lead
  • High severity: Above + product manager + engineering manager
  • Critical severity: All above + VP of Engineering + on-call incident commander

Communication Guidelines

  • Lead with impact, not technical details
  • Include what you know, what you do not know, and what you are doing
  • Provide an estimated time to resolution or next update
  • Use the appropriate channel: Slack for medium, email/PagerDuty for high/critical
  • Update stakeholders at regular intervals (every 30 min for critical, hourly for high)

Escalation Timeline

  • T+0 min: Initial alert sent to ML team channel
  • T+15 min: If not acknowledged, page the on-call ML engineer
  • T+30 min: Escalate to team lead if no progress
  • T+60 min: Escalate to engineering manager for critical issues

Communication Template

Slack :warning: *ML Model Alert - [Severity Level]* *Model:* [model_name] *Detected:* [timestamp] *Issue:* [one-line summary] *Current metrics:* - PSI: [value] (threshold: 0.2) - Missing rate: [value]% (threshold: 5%) - AUC: [value] (baseline: [value]) *Impact:* [known or estimated user/business impact] *Status:* Investigating *Next update:* [time] *Owner:* @[engineer_name] Thread for live updates below.
🔍
Increase Monitoring Frequency
Tighten observation windows when anomalies are suspected

When to Increase Monitoring

  • Metrics are trending toward thresholds but have not yet crossed
  • A new model version was recently deployed (first 7 days)
  • External factors may affect input data (holidays, campaigns, market events)
  • A partial data quality issue was fixed but needs verification

How to Adjust

  • Change monitoring job schedule from daily to every 4 hours (or hourly for critical)
  • Lower alert thresholds temporarily (e.g., PSI from 0.2 to 0.15)
  • Enable detailed feature-level drift reports (top 20 features)
  • Add prediction distribution histograms to the dashboard
  • Set up a dedicated Slack alert channel for the elevated period

Duration and Exit Criteria

  • Maintain elevated monitoring for 5-7 days minimum
  • Exit when metrics are stable for 3 consecutive check-ins
  • Document the monitoring change and rationale in the ops log
  • Revert to standard schedule via the monitoring config, not silently

Communication Template

Slack :mag: *Monitoring Frequency Increased* *Model:* [model_name] *Reason:* [brief description] *New schedule:* Every [N] hours (was: daily) *Threshold adjustments:* PSI alert at [value] (was [value]) *Duration:* [start_date] through [end_date] *Exit criteria:* 3 consecutive stable readings No action needed from the team unless alerts fire. Owner: @[engineer_name]
Wait and Watch
When the signal is ambiguous and premature action could cause harm

When This Is the Right Response

  • A single metric blipped once but has not sustained above threshold
  • The alert coincides with known low-traffic periods (weekends, holidays)
  • Drift is detected in low-importance features only
  • Performance metrics remain within acceptable bounds
  • The anomaly could be caused by a known upstream change that is expected

What "Watch" Means Concretely

  • Acknowledge the alert in the monitoring system (do not let it go stale)
  • Set a calendar reminder to re-check in 24 hours
  • Note the current metric values as a comparison baseline
  • Review the next 2-3 monitoring reports before closing

Escalation Triggers

  • The same alert fires again within 48 hours
  • A second metric enters the yellow zone
  • Any metric enters the red zone
  • Stakeholders report user-facing issues that may be related

Communication Template

Slack :eyes: *Alert Acknowledged - Watching* *Model:* [model_name] *Alert:* [metric] = [value] (threshold: [value]) *Assessment:* Likely transient. No performance impact observed. *Action:* Monitoring for next 48 hours. *Escalation if:* Alert recurs or additional metrics degrade. *Next check:* [date/time] *Owner:* @[engineer_name] Will update or escalate as needed.