· 10 min read
A/B testing in email: sample size, novelty, and what to report
A winner at 95% confidence doesn't mean a real lift. A losing variant in one test doesn't mean a broken idea. Most email A/B tests produce results that look decisive but don't survive replication — and the gap between statistical significance and actual operational significance is where most lifecycle experimentation effort gets wasted. This guide covers how to design tests that produce decisions, not noise.
Justin Williames
Founder, Orbit · 10+ years in lifecycle marketing
Sample size is the variable most teams under-think
A winner at 95% confidence doesn't mean a real lift. The gap between statistical significance and operational significance is where most lifecycle experimentation effort gets wasted.
Every A/B test has a minimum detectable effect (MDE): the smallest difference between variants that the test can reliably distinguish from noise at a chosen confidence level. The MDE is a function of sample size, baseline conversion rate, and the confidence and power thresholds you choose.
400
Conversions per variant needed to detect a 30% lift at 95% confidence, 80% power.
3,800
Conversions per variant needed to detect a 10% lift at the same thresholds.
400,000
Conversions per variant needed to detect a 1% lift. Most programs will never see this sample.
Practical numbers. At 95% confidence and 80% power, with a 20% baseline conversion rate, a test can reliably detect: a 30% relative lift with roughly 400 conversions per variant. A 15% lift needs about 1,700. A 10% lift needs about 3,800. A 5% lift needs about 15,000. A 1% lift needs about 400,000.
Most lifecycle programs run tests with 500–2,000 conversions per variant and claim to detect 5% lifts. Mathematically, they can't. What's happening is either a genuine large lift (detectable), random noise misread as a lift (false positive), or a real small lift that wasn't reliably detected (false negative). Without a pre-registered MDE, you can't tell which.
The Orbit Significance Calculatorshows the MDE for your current sample size inline — you'll see whether the test is actually powered to detect a meaningful lift before you run it.
The novelty effect — and why weekly metrics matter
New variants often outperform control for the first 3–7 days purely because they're new. Users notice difference, engage with it, and drive early metrics up. Then the novelty fades and the variant returns to baseline.
Watching only cumulative metrics obscures this entirely. A variant that's 8% ahead on day 3 and 1% behind on day 10 shows up as "+3% cumulative" in a seven-day readout — a misleading number that reads as a marginal win but is actually a pattern of fading novelty.
Defence: run every A/B test for at least two full cycles of your natural sending rhythm. For daily sends, that's two weeks. For weekly, a month. And report weekly incremental lift, not just cumulative. If the variant wins week one but loses week two, it's a novelty effect, not a winner.
The Orbit Experiment Design skill handles all of this pre-test planning — sample size calculation, duration setting, and the readout structure that separates real winners from novelty.
Multi-variant tests multiply false positives
Testing 1 variant vs control at 95% confidence: 5% false-positive rate. Testing 9 variants vs control at 95% per comparison: the chance that at least one variant hits significance purely by chance is roughly 37%, not 5%.
The straightforward fix is Bonferroni correction: divide your chosen alpha by the number of comparisons. With 9 variants, require each individual comparison to be significant at 99.4% (0.05 / 9) to maintain an overall 5% false-positive rate.
A more practical approach for lifecycle programs: run fewer variants per test (2–4 tops), and run more tests in sequence. Sequential small tests beat parallel large ones for most lifecycle work — you learn faster and your MDE is lower because the per-variant sample is bigger. Pricing and discount tests have their own failure modes that the general rules don't catch; the price-testing guide covers the specifics.
What to report beyond 'we won'
Three numbers together for every test result: observed relative lift (%), confidence level (%), and absolute conversion volume. "Variant B had a 12% relative lift in open rate, 97% confidence, adding 340 extra opens across 50K recipients."
Each number changes the decision. A 12% lift at 60% confidence is noise. A 1% lift at 99% confidence across 5M recipients might be a massive win. A 30% lift at 95% confidence that's only 40 extra conversions in absolute terms is probably not worth the operational change. The three numbers together force a complete read.
Also report the losing variants. Every losing variant contains information. A specific variant losing by a large margin tells you something about that specific direction — don't pretend it didn't happen because it wasn't your winner.
What to actually test (and what to stop testing)
Tests that reliably produce learnings: subject line structure variants, CTA copy and placement, hero-image vs text-first layouts, send-time optimisation, sender-name variations, personalisation depth (shallow vs deep). These have high inherent variance and clear hypotheses.
Tests that usually produce noise: small colour changes, minor layout tweaks, single-word copy edits, tests on audiences under 10K per variant. Not because these things don't matter — they might — but because you can't reliably detect the size of effect they produce, so running the test is performance art.
A good rule: if you're not prepared to act on a 5% lift, don't run a test that can only detect a 20% lift. The test won't answer a question you'd act on.
Frequently asked questions
- What sample size do I need for an email A/B test?
- Depends on the lift you want to detect. At 95% confidence and 80% power with a 20% baseline conversion rate: 30% lift needs ~400 conversions per variant, 15% lift needs ~1,700, 10% lift needs ~3,800, 5% lift needs ~15,000, and 1% lift needs ~400,000. Calculate before running, not after.
- Is 95% confidence enough?
- For low-stakes decisions (subject line tests, minor copy changes) 90% is fine. For high-stakes decisions that lock in strategy for a year (channel strategy, onboarding overhauls) require 99% confidence and validate with a follow-up test. 95% is the conventional threshold — not a law.
- How long should I run an email A/B test?
- At least two full cycles of your natural sending rhythm. For daily sends, two weeks. For weekly sends, a month. This neutralises the novelty effect and gives you weekly incremental metrics that reveal whether the winner is genuine or fading.
- What's the novelty effect?
- The tendency for new variants to outperform control for 3–7 days purely because users notice the change. Engagement spikes, then fades back to baseline as novelty wears off. Watching cumulative metrics obscures this; reporting weekly incremental metrics reveals it.
- Can I test multiple variants at once?
- Yes, but each additional variant inflates the false-positive rate. With 9 variants at 95% confidence each, the chance at least one hits significance by chance is ~37%. Either apply Bonferroni correction (require 99.4% per comparison with 9 variants) or run fewer variants in sequence.
- Should I report p-values or confidence levels?
- Confidence (e.g. 97%) reads more intuitively for most stakeholders. p-value (e.g. 0.03) is more precise and preferred in analytical contexts. Many readouts include both. What matters more than the notation is also showing the absolute volume — a 97%-confidence 12% lift adding 340 conversions is a very different result from one adding 34.
This guide is backed by an Orbit skill
Related guides
Personalisation that doesn't feel creepy
There's a line between personalisation that earns a user's trust and personalisation that breaks it. This guide is about where the line actually is, how lifecycle programs cross it without noticing, and the specific patterns that keep you on the right side.
Price-testing through email: what's testable, what isn't
Email is often the first place teams try to price-test, and it's often where the wrong lesson gets learned. This guide covers what can genuinely be tested in email, what can't, and the measurement traps that make most email price tests unreliable.