ML Triage Decision Tree Playbook

↻

Retrain Model

When drift and performance degradation are confirmed

▼

When to Trigger

PSI exceeds 0.25 for two or more key features
AUC has dropped more than 5% from baseline
Drift and performance degradation are correlated (same time window)
Data quality checks pass (pipeline is not broken)

Steps to Execute

Pull the latest validated training data (past 90 days rolling window)
Run feature engineering pipeline with current feature set
Retrain using the same hyperparameter configuration as production model
Evaluate on holdout set: ensure AUC is within 1% of original baseline
Run A/B shadow test for 24-48 hours before full deployment
Deploy via blue-green deployment to minimize downtime

Rollback Plan

Keep the previous model artifact tagged as "last-known-good" in the registry
If retrained model shows worse metrics in shadow test, abort deployment
Rollback is a one-command operation: point serving endpoint to previous version
Notify stakeholders immediately if rollback is triggered

Communication Template

Slack :rotating_light: *Model Retrain Initiated* *Model:* [model_name] (v[current_version]) *Trigger:* PSI = [value], AUC drop = [value]% *Action:* Retraining with data from [start_date] to [end_date] *Timeline:* Shadow test for 24-48h, deploy by [target_date] *Owner:* @[engineer_name] *Status:* :hourglass_flowing_sand: In progress Will update this thread with results. cc @ml-team

⏪

Roll Back to Previous Model

Emergency response when current model is actively harming metrics

▼

Criteria for Rollback

AUC has dropped more than 10% from baseline
Business KPIs (conversion, revenue) are visibly impacted
A newly deployed model is performing worse than its predecessor
Data corruption is suspected and cannot be resolved quickly

Rollback Process

Identify the last-known-good model version in the model registry
Run a quick sanity check on the previous model with current data sample
Update the serving endpoint to point to the previous model version
Verify predictions are being served correctly (spot-check 10-20 requests)
Update monitoring dashboards to reflect the rolled-back version

Post-Rollback Verification

Monitor metrics for 2-4 hours post-rollback to confirm stabilization
Compare prediction distributions before/after rollback
Document the root cause and timeline in the incident log
Schedule a post-mortem within 48 hours

Communication Template

Email Subject: [URGENT] Model Rollback - [model_name] Team, We have initiated a rollback of [model_name] from v[new] to v[previous]. Reason: [brief description of degradation observed] Impact: [estimated business impact] Timeline: Rollback completed at [timestamp] Immediate next steps: - Monitoring stabilization for the next 4 hours - Root cause investigation begins immediately - Post-mortem scheduled for [date/time] Please hold any deployments to this endpoint until further notice. [Your name] | ML Platform Team

🔔

Alert Stakeholders

Communicate issues to the right people at the right time

▼

Who to Notify

Low severity: ML team Slack channel only
Medium severity: ML team lead + data engineering lead
High severity: Above + product manager + engineering manager
Critical severity: All above + VP of Engineering + on-call incident commander

Communication Guidelines

Lead with impact, not technical details
Include what you know, what you do not know, and what you are doing
Provide an estimated time to resolution or next update
Use the appropriate channel: Slack for medium, email/PagerDuty for high/critical
Update stakeholders at regular intervals (every 30 min for critical, hourly for high)

Escalation Timeline

T+0 min: Initial alert sent to ML team channel
T+15 min: If not acknowledged, page the on-call ML engineer
T+30 min: Escalate to team lead if no progress
T+60 min: Escalate to engineering manager for critical issues

Communication Template

Slack :warning: *ML Model Alert - [Severity Level]* *Model:* [model_name] *Detected:* [timestamp] *Issue:* [one-line summary] *Current metrics:* - PSI: [value] (threshold: 0.2) - Missing rate: [value]% (threshold: 5%) - AUC: [value] (baseline: [value]) *Impact:* [known or estimated user/business impact] *Status:* Investigating *Next update:* [time] *Owner:* @[engineer_name] Thread for live updates below.

🔍

Increase Monitoring Frequency

Tighten observation windows when anomalies are suspected

▼

When to Increase Monitoring

Metrics are trending toward thresholds but have not yet crossed
A new model version was recently deployed (first 7 days)
External factors may affect input data (holidays, campaigns, market events)
A partial data quality issue was fixed but needs verification

How to Adjust

Change monitoring job schedule from daily to every 4 hours (or hourly for critical)
Lower alert thresholds temporarily (e.g., PSI from 0.2 to 0.15)
Enable detailed feature-level drift reports (top 20 features)
Add prediction distribution histograms to the dashboard
Set up a dedicated Slack alert channel for the elevated period

Duration and Exit Criteria

Maintain elevated monitoring for 5-7 days minimum
Exit when metrics are stable for 3 consecutive check-ins
Document the monitoring change and rationale in the ops log
Revert to standard schedule via the monitoring config, not silently

Communication Template

Slack :mag: *Monitoring Frequency Increased* *Model:* [model_name] *Reason:* [brief description] *New schedule:* Every [N] hours (was: daily) *Threshold adjustments:* PSI alert at [value] (was [value]) *Duration:* [start_date] through [end_date] *Exit criteria:* 3 consecutive stable readings No action needed from the team unless alerts fire. Owner: @[engineer_name]

⏳

Wait and Watch

When the signal is ambiguous and premature action could cause harm

▼

When This Is the Right Response

A single metric blipped once but has not sustained above threshold
The alert coincides with known low-traffic periods (weekends, holidays)
Drift is detected in low-importance features only
Performance metrics remain within acceptable bounds
The anomaly could be caused by a known upstream change that is expected

What "Watch" Means Concretely

Acknowledge the alert in the monitoring system (do not let it go stale)
Set a calendar reminder to re-check in 24 hours
Note the current metric values as a comparison baseline
Review the next 2-3 monitoring reports before closing

Escalation Triggers

The same alert fires again within 48 hours
A second metric enters the yellow zone
Any metric enters the red zone
Stakeholders report user-facing issues that may be related

Communication Template

Slack :eyes: *Alert Acknowledged - Watching* *Model:* [model_name] *Alert:* [metric] = [value] (threshold: [value]) *Assessment:* Likely transient. No performance impact observed. *Action:* Monitoring for next 48 hours. *Escalation if:* Alert recurs or additional metrics degrade. *Next check:* [date/time] *Owner:* @[engineer_name] Will update or escalate as needed.