Experiment Designer (A/B Testing)

🧪 Why A/B Experiments?

A/B testing lets you rigorously compare two approaches (baseline vs candidate) to determine which performs better. Using bootstrap confidence intervals, we can quantify our uncertainty and make data-driven decisions.

A Baseline

Name

Description

B Candidate

Name

Description

📊 Experiment Configuration

Experiment Name

Primary Metric

Number of Seeds (per variant)

Bootstrap Samples

Seed Values (comma-separated)

📈 Understanding Bootstrap Confidence Intervals

Bootstrap CI lets us estimate the range where the true mean likely falls. With 95% confidence, we can say the true value is within the interval.

Baseline: 420 [380, 460]

Candidate: 475 [450, 500]

If intervals don't overlap, the difference is statistically significant.

Interpreting Results: If the candidate's CI is entirely above the baseline's CI, we have strong evidence the candidate is better. Overlapping CIs suggest the difference may not be meaningful.

✅ Decision Rules

Define criteria for interpreting results:

Minimum Improvement (%)

Confidence Level (%)

🚀 Ship Candidate

Candidate CI lower bound exceeds baseline CI upper bound by minimum improvement threshold.

🔄 Iterate

CIs overlap or improvement is below threshold. Run more seeds or adjust hyperparameters.

🚫 Abandon

Candidate performs significantly worse than baseline. Return to baseline or try different approach.

💭 Hypothesis

What do you expect to happen? (be specific)

📄 experiment_brief.md

Copied to clipboard!