When to Trigger
- PSI exceeds 0.25 for two or more key features
- AUC has dropped more than 5% from baseline
- Drift and performance degradation are correlated (same time window)
- Data quality checks pass (pipeline is not broken)
Steps to Execute
- Pull the latest validated training data (past 90 days rolling window)
- Run feature engineering pipeline with current feature set
- Retrain using the same hyperparameter configuration as production model
- Evaluate on holdout set: ensure AUC is within 1% of original baseline
- Run A/B shadow test for 24-48 hours before full deployment
- Deploy via blue-green deployment to minimize downtime
Rollback Plan
- Keep the previous model artifact tagged as "last-known-good" in the registry
- If retrained model shows worse metrics in shadow test, abort deployment
- Rollback is a one-command operation: point serving endpoint to previous version
- Notify stakeholders immediately if rollback is triggered
Communication Template
Slack
:rotating_light: *Model Retrain Initiated*
*Model:* [model_name] (v[current_version])
*Trigger:* PSI = [value], AUC drop = [value]%
*Action:* Retraining with data from [start_date] to [end_date]
*Timeline:* Shadow test for 24-48h, deploy by [target_date]
*Owner:* @[engineer_name]
*Status:* :hourglass_flowing_sand: In progress
Will update this thread with results. cc @ml-team