Measurement & testing
Statistical significance
Also known as: p-value · significance testing
The probability that an observed difference between two A/B test variants could have occurred by chance — conventionally, a result is 'significant' when the p-value is below 0.05 (95% confidence the difference is real).
Statistical significance is the framework for deciding whether an A/B test result is real or noise. A p-value quantifies the probability that the observed difference between variants could be due to random chance under the null hypothesis (no real difference). Conventional thresholds: p < 0.05 = 95% confidence (standard for most marketing decisions), p < 0.01 = 99% confidence (used when the downside of a false positive is high). The catch: significance requires sufficient sample size — a test with 100 users per arm will never reach significance for a 5% lift even if the lift is real. Pre-compute the required sample size from baseline rate and minimum detectable effect BEFORE running. Also: declare significance on the primary metric only. Running 10 significance tests against 10 secondary metrics guarantees at least one false positive at p<0.05 by chance (multiple-comparisons problem).
Try the tool
Read next
A/B testing in email: sample size, novelty, and what to report
Most email A/B tests produce winners that don't reproduce. Three reasons keep showing up: under-powered samples, the novelty effect, and weak readout discipline. This guide is about designing tests that actually drive decisions instead of theatre.
False positives in email A/B tests: why half of winning tests don't actually win
Run enough A/B tests and some will show 'significant' lift from pure noise. Programs that ship every significant winner end up with a collection of imaginary improvements they can't tell apart from real ones. Here's how to spot the fakes and avoid the trap.