Measurement & testing
A/B testing
Also known as: split testing · bucket testing
A randomised experiment where two (or more) variants are shown to different user groups and compared on a primary metric — requires sufficient sample size, a pre-declared hypothesis, and statistical significance testing before declaring a winner.
A/B testing is the lifecycle discipline for attributing causation to changes. A variant (B) is compared against control (A) by randomly assigning users to each and measuring a primary metric — open rate, click rate, conversion rate. For results to be trustworthy: sample size must meet pre-computed statistical power requirements (usually 80% power, 95% confidence), the hypothesis must be declared before the test runs (not post-hoc), and the winner is declared only after reaching significance on the primary metric (not multiple metrics, not mid-test peeks). Most A/B tests in lifecycle email fail one of these: undersampled (tested on 500 users when 15,000 are needed), multiple comparisons (declared winner on whichever metric happened to move), or peeked early. The A/B test sample-size calculator gives the required volume per arm for any baseline rate + MDE combination.
Try the tool
Read next
A/B testing in email: sample size, novelty, and what to report
Most email A/B tests produce winners that don't reproduce. Three reasons keep showing up: under-powered samples, the novelty effect, and weak readout discipline. This guide is about designing tests that actually drive decisions instead of theatre.
False positives in email A/B tests: why half of winning tests don't actually win
Run enough A/B tests and some will show 'significant' lift from pure noise. Programs that ship every significant winner end up with a collection of imaginary improvements they can't tell apart from real ones. Here's how to spot the fakes and avoid the trap.