# Monitoring Plan: [Your System Name]

> **Author**: [Your Name]
> **Date**: [Date]
> **System**: MetroPulse Recommendation Engine
> **Model**: [Model type, e.g., LogisticRegression / Neural Collaborative Filtering]

---

## 1. System Overview

**What does this system do?**
[Brief description of the recommendation system, its inputs, outputs, and users]

**Key stakeholders:**
| Role | Name/Team | Notification Preference |
|------|-----------|------------------------|
| Model Owner | | Slack + Email |
| Data Engineering | | Slack |
| Product Manager | | Email (daily digest) |
| On-Call Engineer | | PagerDuty |

**Service-Level Objectives (SLOs):**
- Prediction latency: p99 < [___] ms
- Model accuracy (AUC): > [___]
- Availability: [___]%

---

## 2. Data Quality Checks

| Check | What It Validates | Threshold | Frequency | Severity | Why This Threshold? |
|-------|-------------------|-----------|-----------|----------|---------------------|
| Schema validation | Expected columns present, correct types | Exact match | Every batch | Critical | |
| Missingness | % null values per feature | < [___]% | Every batch | High | |
| Range validation | Values within expected bounds | Per-feature | Every batch | High | |
| Volume check | # records in time window | > [___] records/hr | Hourly | Medium | |
| Duplicate check | % duplicate rows | < [___]% | Every batch | Medium | |
| Freshness | Data arrival time | < [___] min late | Every batch | High | |

**Notes:**
- [Explain any check-specific logic or edge cases]

---

## 3. Drift Detection

| Feature | Method | Threshold | Window Size | Frequency | Why This Threshold? |
|---------|--------|-----------|-------------|-----------|---------------------|
| hour_bucket | PSI | [___] | 7 days | Daily | |
| device_type | PSI | [___] | 7 days | Daily | |
| user_id distribution | PSI | [___] | 7 days | Daily | |
| item_id distribution | PSI | [___] | 7 days | Daily | |
| label (prediction dist.) | PSI | [___] | 7 days | Daily | |

**Reference window policy:**
- [ ] Fixed (set at training time)
- [ ] Sliding (last N days)
- [ ] Updated on retrain

**Drift response escalation:**
| PSI Range | Severity | Action |
|-----------|----------|--------|
| < 0.1 | Low | Log only |
| 0.1 - 0.2 | Medium | Investigate within 24h |
| > 0.2 | High | Alert on-call, investigate immediately |

---

## 4. Performance Monitoring

| Metric | Baseline Value | Alert Threshold | Frequency | Slice Dimensions |
|--------|---------------|-----------------|-----------|------------------|
| AUC (overall) | [___] | Drop > [___] | Daily | — |
| AUC (mobile) | [___] | Drop > [___] | Daily | device_type |
| AUC (desktop) | [___] | Drop > [___] | Daily | device_type |
| AUC (new users) | [___] | Drop > [___] | Daily | user tenure |
| Precision@K | [___] | Drop > [___] | Daily | — |
| Prediction distribution | [___] mean | Shift > [___] | Daily | — |

**Canary evaluation:**
- Holdout set size: [___]%
- Refresh frequency: [___]
- Staleness policy: [How long before holdout set is considered stale?]

---

## 5. Triage Playbook

| Scenario | Likely Cause | First Response | Escalation | SLA |
|----------|-------------|----------------|------------|-----|
| Schema check fails | Upstream pipeline change | Page data engineering | Model owner | 1 hour |
| Missingness spike | ETL failure, source outage | Check source system status | Data engineering | 2 hours |
| PSI > 0.2 on device_type | Product change, seasonal | Compare with product releases | Model owner | 4 hours |
| PSI > 0.2 on hour_bucket | User behavior shift | Check for holidays/events | Product manager | 24 hours |
| AUC drops > 5% | Concept drift, data issue | Check drift metrics first | Model owner + PM | 4 hours |
| AUC drops > 10% | Major distribution shift | Rollback to previous model | On-call + Model owner | 1 hour |
| Latency p99 > SLO | Infrastructure, model size | Check serving infrastructure | Platform engineering | 30 min |

---

## 6. Operational Considerations

**Retraining policy:**
- Trigger: [ ] Scheduled (every [___]) | [ ] Drift-triggered | [ ] Performance-triggered
- Validation: [How do you validate the retrained model before deployment?]
- Rollback: [How do you rollback if the new model is worse?]

**Alerting channels:**
- Critical: [___]
- High: [___]
- Medium: [___]
- Low: [___]

**Dashboard requirements:**
- [ ] Real-time data quality dashboard
- [ ] Daily drift report
- [ ] Weekly performance summary
- [ ] Monthly model health review

**Known limitations:**
- [What can't your monitoring catch?]
- [What edge cases exist?]
- [What assumptions does your monitoring make?]

---

## Approval

| Role | Name | Date | Signature |
|------|------|------|-----------|
| Model Owner | | | |
| Data Engineering Lead | | | |
| Product Manager | | | |
