Experiment Designer

Design rigorous A/B comparisons with statistical confidence

← Back to Hub

🧪 Why A/B Experiments?

A/B testing lets you rigorously compare two approaches (baseline vs candidate) to determine which performs better. Using bootstrap confidence intervals, we can quantify our uncertainty and make data-driven decisions.

A Baseline
VS
B Candidate

📊 Experiment Configuration

📈 Understanding Bootstrap Confidence Intervals

Bootstrap CI lets us estimate the range where the true mean likely falls. With 95% confidence, we can say the true value is within the interval.

Baseline: 420 [380, 460]
Candidate: 475 [450, 500]

If intervals don't overlap, the difference is statistically significant.

Interpreting Results: If the candidate's CI is entirely above the baseline's CI, we have strong evidence the candidate is better. Overlapping CIs suggest the difference may not be meaningful.

✅ Decision Rules

Define criteria for interpreting results:

🚀 Ship Candidate

Candidate CI lower bound exceeds baseline CI upper bound by minimum improvement threshold.

🔄 Iterate

CIs overlap or improvement is below threshold. Run more seeds or adjust hyperparameters.

🚫 Abandon

Candidate performs significantly worse than baseline. Return to baseline or try different approach.

💭 Hypothesis

📄 experiment_brief.md

Copied to clipboard!