Updated · 8 min read
False positives in email A/B tests: why half of winning tests don't actually win
An A/B test flags a winner with p<0.05. The lift looks real; the team ships it. But the test was one of many, the effect doesn't replicate, and the 'win' quietly underperforms over the next quarter. This is the false-positive trap and it's ubiquitous in email testing. The statistics guarantee that at p=0.05 you'll see roughly one fake winner per twenty tests — and most programs don't keep track of how many tests they ran. Here's how to prevent the trap.
Justin Williames
Founder, Orbit · 10+ years in lifecycle marketing
Why false positives are guaranteed at scale
The significance threshold p=0.05 means "the observed result would occur by chance less than 5% of the time if there were no real effect". Translated to tests: 1 in 20 tests will show a "significant" result from pure noise, even when the variants are identical.
Run 20 subject line tests in a year. Statistically expect 1 false winner even if none of the tests have any real effect. Run 40 tests, expect 2. Most programs don't track their test count or their distribution of p-values, so they don't notice the false-positive rate.
A program that runs 50 A/B tests per year at p=0.05 should statistically expect 2–3 false-positive winners. A program that ships every winner will have 2–3 imaginary improvements in its "learnings" and doesn't know which ones.
The three most common sources of false positives
1. Stopping early. Running a test, peeking at the data, stopping when it hits significance. Every peek inflates the false-positive rate. A test designed for n=20,000 but stopped at n=8,000 when significance was reached has a much higher than 5% false-positive rate — often 15–25% depending on how many peeks occurred.
2. Multiple comparisons without correction. Testing subject + preheader + CTA + send time in the same test without adjusting the significance threshold. At p=0.05 per comparison across 4 independent metrics, overall false-positive rate becomes ~18%. Bonferroni correction (divide α by number of comparisons) is the crude fix; it over-corrects but is better than no correction.
3. Re-running until significant. "This test didn't quite hit significance — let's run it again with more users." If you keep extending until significance, you'll get it eventually — from noise, not from real effect. Pre-commit to sample size and stick to it; the sample size guide covers how to choose it up front.
How to detect false positives
1. Effect size sanity check. If your subject line test shows a 25% lift in open rate, be suspicious. Real subject line effects rarely exceed 10–15%. A huge effect is more often noise than breakthrough. Small but replicable effects are more trustworthy than large one-shot effects.
2. Test for replication. A real winner should hold up in the next send. Re-run the winning variant against a fresh control on the next campaign; if the effect disappears, it was noise. Programs that replicate significant tests have clearer learnings than programs that only run each test once.
3. Holdout validation. Treat the whole lifecycle program as a pool of "improvements"; run a small (5–10%) holdout population that receives none of your winning treatments. If the sent group outperforms the holdout meaningfully, your winners are probably real. If the gap is small, many of your wins are noise.
,
Preventive practices
1. Pre-register tests. Before running a test, document in writing: hypothesis, primary metric, sample size, stop criterion, what "winning" means. This prevents the "motivated reasoning" where post-hoc you highlight whichever metric happens to win. A simple shared document or a test-registry tool works.
2. Use tighter significance thresholds. p=0.01 instead of p=0.05 cuts false-positive rate by ~5×. The trade-off: you need larger samples (roughly 1.5× larger). For programs that run lots of tests, the tighter threshold pays for itself in avoided false winners.
3. Primary metric only. Pick one metric before the test (usually click rate or revenue per recipient). Don't report wins on secondary metrics. If the primary metric doesn't win, the test is a null result — period. Reporting "primary didn't win but secondary did" is how false positives enter learnings.
4. Sequential testing correctly. If you must peek at tests, use platforms with built-in sequential analysis (Optimizely's sequential stats, VWO's SmartStats). These adjust p-values for the peeks. Don't roll your own corrections.
What to do when you suspect a false positive
If a test shows a surprisingly large effect, or multiple tests in a row all show wins on the same variant type:
1. Run the test again with a fresh audience split. See if it replicates.
2. Check whether the effect size is realistic for the type of change. A subject line change showing 20% lift is suspicious; 4% lift is believable.
3. If it replicates at a similar effect size, it's probably real. If it disappears, the original was likely noise.
4. Update your learnings accordingly. False positives are especially costly when they become "known best practice" in the program — roll those back aggressively.
includes pre-registration, sequential-testing practice, and replication as default steps. The goal isn't to run fewer tests — it's to extract more real signal from the tests you run.
Frequently asked questions
- What's a 'significant' result actually saying?
- At p=0.05: 'if there were no real effect, we'd see data this extreme less than 5% of the time'. It's not 'we're 95% sure the effect is real'. This distinction matters — significance is a lower bound on the false-positive rate per test, not a guarantee the test found something.
- How do I tell if an effect is too large to be real?
- Benchmark against known effect sizes for the type of change. Subject line tests: 2–8% realistic. Content tests: 3–10%. Send time tests: 1–5%. A test showing effects well above these ranges is usually noise or a methodology issue (small sample, Apple MPP inflation, peeking).
- Is p=0.05 the right threshold for email tests?
- It's the convention, but not optimal for programs running many tests. p=0.01 reduces false-positive rate by ~5× at the cost of needing ~50% larger samples. For volume programs (20+ tests per year), p=0.01 produces cleaner learnings. For lower-volume programs, p=0.05 is fine as long as you replicate winners before treating them as established.
- Should I ever stop a test early?
- Only if the platform has sequential-analysis-corrected stats. 'Peeking and stopping when significant' inflates false-positive rate substantially. If you want the option to stop early, either use a sequential-testing platform, or pre-commit to specific interim analysis points with Bonferroni-style correction.
- What do I do with tests that don't reach significance?
- Treat them as informative null results, not failures. A test that returns 'no significant difference' tells you the effect is smaller than the MDE the test was sized to detect. That's real information. Run retrospective power calculations; publish null results internally so the team knows what the data actually said.
- How does Apple MPP affect false-positive rates?
- It doesn't directly inflate false positives, but it creates very noisy open rate measurement that makes open-rate tests harder to interpret. Effects you might have detected are harder to see; spurious effects from machine-open variation can look like real wins. For open-rate tests, use click-through rate as the primary metric to dodge the MPP noise.
Related guides
Browse all →Sample size: the calculation everyone gets wrong in email A/B tests
Most email A/B tests are powered to detect effects far larger than they can actually produce. Here's the sample size calculation that tells you whether your test will find what you're looking for — before you run it.
Send-time optimisation: what it really moves, and what it doesn't
Every ESP now markets a send-time optimisation feature. They all show flattering internal case studies. The honest version: STO moves open rate 3–8%, not revenue, and only works for certain program types. Here's when it's worth turning on.
Incrementality testing: the measurement that tells you if a program actually works
Last-click attribution makes lifecycle programs look bigger than they are. Incrementality tests strip out the effect of users who would have converted anyway and reveal the real lift. Here's how to design one that produces a defensible number.
Segment-based testing: when your average lift is hiding opposing effects
A winning A/B test with 4% lift overall might be a 20% win in one segment and a 10% loss in another. Segment-based analysis reveals the real story — and lets you ship winners to the segments that benefit while avoiding users who would be hurt.
Price-testing through email: what's testable, what isn't
Email is often the first place teams try to price-test, and it's often where the wrong lesson gets learned. This guide covers what can genuinely be tested in email, what can't, and the measurement traps that make most email price tests unreliable.
A/B testing in email: sample size, novelty, and what to report
Most email A/B tests produce winners that don't reproduce. This guide covers the three reasons — under-powered samples, the novelty effect, and weak readout discipline — and how to design tests that actually drive decisions.
Found this useful? Share it with your team.