An interactive guide to understanding how PSI detects distribution drift in production ML systems. Drag the slider to watch the math unfold.
Reference (training) vs. Current (production) feature distributions. Move the slider to simulate drift.
Watch each bin's contribution update live as drift changes.
Divide the feature range into 10 equal-width bins spanning the data.
Compute the proportion of observations in each bin for both distributions.
For each bin, calculate the PSI contribution using the formula above, then sum across all bins.
| Bin | Range | Ref % | Curr % | Diff | ln(C/R) | PSI Contrib |
|---|
Industry-standard thresholds for acting on detected drift.
No significant population shift. The model's input distribution is consistent with training data. Continue standard monitoring.
Moderate drift detected. Examine which features shifted and whether model performance metrics have degraded. May need retraining soon.
Significant population change. The model is likely seeing data it was not trained on. Retrain, recalibrate, or investigate root cause immediately.
These thresholds are guidelines, not absolute rules. High-stakes domains (e.g. healthcare, finance) may use stricter thresholds like 0.05 and 0.1.
Essential background for understanding PSI and data drift in production ML.
The Population Stability Index (PSI) is a symmetric metric that quantifies how much a variable's distribution has shifted between two time periods. Originally developed in credit scoring, it is now a standard tool for monitoring ML features and predictions in production.
Unlike raw accuracy, PSI is an input-side check — it detects distribution shifts before they degrade model performance. This makes it ideal for early warning systems in MLOps pipelines, where ground truth labels may arrive with a significant delay.
Divide the feature range into N bins of equal width. Simple and interpretable, but sparse tails may produce empty bins. This is the method used in the visualization above.
Define bin edges so each bin contains roughly the same number of reference observations. Better tail coverage, commonly used in practice with N = 10 (deciles).
If a bin has zero observations in either distribution, the ln(C/R) term becomes undefined (division by zero or log of zero). Common fixes:
PSI can be noisy with small samples. Random fluctuations may trigger false alarms. Best practices:
Too few bins smooth out real drift; too many create noise. The industry standard is 10 bins, but for high-cardinality features, 15–20 can capture finer shifts. Always validate with visual inspection.
For categorical variables, each unique category acts as its own "bin." New categories in production (unseen during training) are strong drift signals that PSI will capture as a large spike.
In a mature MLOps setup, PSI is computed on a schedule (hourly, daily, or per batch) for every input feature and the model's output score distribution: