Updated · 8 min read
Sample size: the calculation everyone gets wrong in email A/B tests
The most common email A/B test design: 50/50 split, ship both variants, declare the winner at whatever p-value is reported. The problem: the test was probably underpowered to detect a realistic effect size, so the result is either a false positive (noise that looked like signal) or a false null (real effect hidden in noise). Sample size calculation fixes this before you send. It takes 5 minutes. Here's the 5-minute version.
Justin Williames
Founder, Orbit · 10+ years in lifecycle marketing
What sample size actually controls
Sample size determines the minimum detectable effect (MDE) of your test. A small sample can only detect large effects; a large sample can detect small effects. If your real effect is 3% but your test is sized to detect 10%, your test can't see your real effect — it'll return "no significant difference" even though there is one.
Running an underpowered test isn't a neutral act. You've spent the send, burnt the audience on the test, and learned nothing. Worse, you might conclude there's no effect when there is.
Four inputs determine sample size:
1. Baseline conversion rate. Whatever the control arm is expected to convert at (open rate, click rate, purchase rate).
2. Minimum detectable effect. The smallest lift you care about finding. Usually expressed as a relative percentage (e.g., 5% relative lift).
3. Significance level (α). Conventional: 0.05. The false-positive rate you'll accept.
4. Statistical power (1-β). Conventional: 0.80. The probability of detecting the effect if it's real.
The formula and the shortcut
For comparing two proportions (which is what most email tests are), the full formula involves standard error calculations. But for quick estimates, the shortcut formula works:
n per variant ≈ 16 × p(1-p) / (MDE × p)²
Where p is the baseline rate, and MDE is the minimum detectable effect as a relative percentage (0.05 for a 5% relative lift).
Concrete example: a welcome email with a 25% open rate. You want to detect a 5% relative lift (25% → 26.25%). n = 16 × 0.25 × 0.75 / (0.05 × 0.25)² = 19,200 per variant. You need 38,400 users total to run this test.
,
What to do when your list is too small
Accept bigger MDEs. If your list supports 15% detection, test bigger swings — full subject line rewrites, not single-word tests. Small changes won't be detectable; don't test them.
Pool similar sends. Instead of one test on one send, run the same variant treatment across five sends and combine the data. 5 × 10,000-user sends = 50,000-user effective sample.
Use pre/post comparisons where appropriate. Not technically an A/B test but can surface directional lift from large changes (e.g., template rebuild) where an A/B split isn't feasible. Compare 4 weeks before to 4 weeks after, controlling for volume and seasonality.
Test at the program level, not the send level. Instead of subject-line-at-a-time, test entire flow changes (3-email sequence A vs 3-email sequence B) with users randomised to the arm for the duration. This compounds the effect size across multiple touchpoints.
The variables most programs mis-specify
Baseline rate too optimistic. Using an industry benchmark (30% open rate) when your actual rate is 18%. The smaller baseline means larger required sample. Use your own rate, not a benchmark.
MDE too optimistic. Setting MDE at 10% because that's what the last blog post said. Your actual real effects are probably 1–5% on subject line tests, 3–8% on content tests. Size for realistic MDE, not the effect you hope for.
Power too low. 0.80 is the conventional floor. Going below it (to 0.70 or lower) increases false negatives — tests that say "no effect" when there was one. Don't reduce power to save sample; either accept the bigger sample or accept that you can't detect small effects.
The A/B testing playbook covers the broader test-design questions; this guide covers the sample-size math that underpins it.
The post-test calculation
After a test, do a retrospective power calculation: given the actual baseline rate and sample size, what was the minimum effect this test could have detected? If the test returned "no significant difference" and the retrospective MDE was 10%, you can only conclude "no effect of 10% or larger" — an effect of 4% could easily still be there.
This is the key honesty discipline. "No significant difference" is not "no effect". It's "no effect large enough for this test to detect". Treat null results accordingly — as absence of evidence, not evidence of absence.
includes pre-test sample size calculation and post-test retrospective MDE as standard outputs. Tests without either tend to become theatre — they produce confident conclusions from inconclusive data.
Frequently asked questions
- What's the simplest way to calculate sample size?
- Use the shortcut formula: n per variant ≈ 16 × p(1-p) / (MDE × p)². Or use an online calculator (Optimizely, Evan Miller, and AB Tasty all have free ones). Plug in your baseline rate, MDE, α=0.05, power=0.80. It tells you users per variant, total = 2× that.
- What's a realistic MDE for email A/B tests?
- 3–8% relative lift for content or subject line tests on the open/click metric. 1–5% for purchase conversion. Anything claimed as 'detected 20% lift from a subject line A/B' is probably either a very specific case, a very underpowered test that caught noise, or a larger program-level change.
- How do I handle multiple metrics in the same test?
- Either pre-register one primary metric (e.g., click-through rate) and treat others as secondary/exploratory, or use a multiple-comparisons correction (Bonferroni — divide α by number of tests). Most email programs get away with primary-metric-only because conversion funnel metrics are correlated; multiple-comparisons correction becomes important only when testing unrelated metrics simultaneously.
- Does sample size calculation change for sequential tests?
- Yes. Running a test and 'stopping when significant' inflates the false-positive rate substantially. Either set a pre-committed sample size and check only at the end, or use sequential analysis methods (SPRT, Optional Stopping) that adjust for repeated checks. Most email testing platforms let you check early; most testers ignore the inflation. Either commit to the end-of-test check or use platforms with sequential correction built in.
- What if my list grows during the test?
- Lock the eligible audience at test-start. Users who subscribe during the test aren't in the test — they're in the next one. This keeps assignment clean and prevents 'the test arm got more engaged users by chance of timing' effects.
- How do I explain sample size to a non-technical stakeholder?
- 'This test needs X users per variant to detect a Y% lift. If we want to find smaller effects, we need more users. If we accept that we can only detect bigger effects, we can run with fewer users. Either is fine; the only wrong answer is running an underpowered test and reporting the result as if it were conclusive.' Stakeholders usually get it after one example.
Related guides
Browse all →Send-time optimisation: what it really moves, and what it doesn't
Every ESP now markets a send-time optimisation feature. They all show flattering internal case studies. The honest version: STO moves open rate 3–8%, not revenue, and only works for certain program types. Here's when it's worth turning on.
False positives in email A/B tests: why half of winning tests don't actually win
Run enough A/B tests and some will show 'significant' lift from random noise. Programs that ship every significant winner end up with a collection of imaginary improvements. Here's how to tell real lift from noise and avoid the false-positive trap.
Incrementality testing: the measurement that tells you if a program actually works
Last-click attribution makes lifecycle programs look bigger than they are. Incrementality tests strip out the effect of users who would have converted anyway and reveal the real lift. Here's how to design one that produces a defensible number.
Segment-based testing: when your average lift is hiding opposing effects
A winning A/B test with 4% lift overall might be a 20% win in one segment and a 10% loss in another. Segment-based analysis reveals the real story — and lets you ship winners to the segments that benefit while avoiding users who would be hurt.
A/B testing in email: sample size, novelty, and what to report
Most email A/B tests produce winners that don't reproduce. This guide covers the three reasons — under-powered samples, the novelty effect, and weak readout discipline — and how to design tests that actually drive decisions.
Price-testing through email: what's testable, what isn't
Email is often the first place teams try to price-test, and it's often where the wrong lesson gets learned. This guide covers what can genuinely be tested in email, what can't, and the measurement traps that make most email price tests unreliable.
Found this useful? Share it with your team.