How big does my A/B test need to be?

Depends on your baseline conversion rate and the minimum detectable effect you want to find. Rough rule: n per variant ≈ 16 × p(1-p) / (MDE × p)². A welcome email with 25% open rate wanting 5% relative lift needs ~19,000 users per variant. The sample size guide covers the full calculation.

What's the difference between an A/B test and a holdout?

An A/B test compares two treatments against each other (subject line A vs B). A holdout compares treatment vs no treatment (program on vs program off). You use A/B tests to optimise within a program; you use holdouts to measure whether the program is actually producing incremental lift vs just capturing users who'd have converted anyway.

How do I measure attribution for lifecycle marketing?

Depends on the question. 'Did this specific email drive the sale?' — last-click. 'What's the ROI of the whole lifecycle function?' — multi-touch + holdout. 'Which channel gets credit for new customers?' — first-touch or data-driven. Using one model to answer all three produces wrong answers consistently.

What's the most common measurement mistake?

Reporting last-click revenue as incremental revenue. Cart abandonment gets credited with purchases that would have happened anyway. Winback gets credited with users who'd have returned organically. The gap between last-click and incremental is typically 40–70% — meaningful. Holdouts close the gap.

How long should I run a test?

Pre-commit to a sample size, not a duration. Running 'until significant' inflates false-positive rate substantially. Calculate the needed sample, run until the sample is reached, then read. If you need the option to stop early, use a platform with sequential-testing statistics (Optimizely, VWO) that corrects for the peeking.

Does send-time optimisation actually work?

Modestly — 3–8% open rate lift on broadcasts, little to no revenue lift. Vendor case studies claim 20–40%+; measured reality is smaller because much of the claimed lift is Apple MPP inflation rather than real engagement. Worth enabling for broadcasts at scale; skip for triggered sends and small audiences.

experimentation

Measurement & testing: how to know if any of it is actually working

Measurement is where most lifecycle programs fool themselves. Running tests without sample-size math. Declaring winners from noise. Confusing last-click revenue with incremental revenue. These guides cover the discipline that separates real learning from confirmation theatre.

A lifecycle team that runs 20 A/B tests a year at p=0.05 should statistically expect 1 false-positive winner from pure noise. Most teams don't track how many tests they've run, so the false winners become 'learnings', propagate through the playbook, and quietly underperform. The gap between the claimed lifts and the aggregate program improvement is the tax of undisciplined experimentation.

The guides in this category cover the full testing stack. Sample size calculation — the 5-minute math that tells you whether a test can detect the effect you're looking for before you run it. The holdout group pattern — randomly suppressing a small population from a program so you can see its real incremental lift, not just its last-click attributed revenue. A/B testing structure — one primary metric, pre-registered, sized for a realistic effect, read at the end, not during.

Then the measurement stack. Cohort retention analysis — the one chart that tells you if retention is actually improving, stratified by cohort week or signup channel. Attribution models and which one to use for which question (first-touch for acquisition, last-click for transactional, multi-touch for anything in between, holdout for the honest incrementality answer). Send-time optimisation and the gap between vendor-claimed and measured lift. False-positive prevention and how to spot a 'winning' test that will not replicate.

Read these before you run the next test. Running an underpowered test isn't neutral — it spends the audience and produces conclusions that range from useless to actively wrong.

All 10 guides in this category

Browse all categories →

A/B testing in email: sample size, novelty, and what to report

Most email A/B tests produce winners that don't reproduce. This guide covers the three reasons — under-powered samples, the novelty effect, and weak readout discipline — and how to design tests that actually drive decisions.

10 min read

Price-testing through email: what's testable, what isn't

Email is often the first place teams try to price-test, and it's often where the wrong lesson gets learned. This guide covers what can genuinely be tested in email, what can't, and the measurement traps that make most email price tests unreliable.

9 min read

Holdout group design: the incrementality tool most lifecycle programs skip

Without a holdout, lifecycle ROI is attribution-model guesswork. With one, you get a defensible number you can put in front of finance. Here's how to size, run, and read a holdout group — and the three mistakes that invalidate the result.

9 min read

Attribution models for lifecycle: which one to defend in which room

Attribution debates are half epistemology, half politics. Last-touch is wrong but defensible; multi-touch is more accurate but less defensible; incrementality is most correct but slowest. Here's which model to use for which question — and which is table stakes for each.

10 min read

Churn cohort analysis: the one chart that tells you if retention is actually improving

A cohort retention curve is the single most useful analytical artifact in lifecycle marketing. It isolates the effect of program changes from compounding base effects, and it's the one view that survives every other metric's limitations. Here's how to build one and read it.

9 min read

Frequently asked questions

How big does my A/B test need to be?: Depends on your baseline conversion rate and the minimum detectable effect you want to find. Rough rule: n per variant ≈ 16 × p(1-p) / (MDE × p)². A welcome email with 25% open rate wanting 5% relative lift needs ~19,000 users per variant. The sample size guide covers the full calculation.
What's the difference between an A/B test and a holdout?: An A/B test compares two treatments against each other (subject line A vs B). A holdout compares treatment vs no treatment (program on vs program off). You use A/B tests to optimise within a program; you use holdouts to measure whether the program is actually producing incremental lift vs just capturing users who'd have converted anyway.
How do I measure attribution for lifecycle marketing?: Depends on the question. 'Did this specific email drive the sale?' — last-click. 'What's the ROI of the whole lifecycle function?' — multi-touch + holdout. 'Which channel gets credit for new customers?' — first-touch or data-driven. Using one model to answer all three produces wrong answers consistently.
What's the most common measurement mistake?: Reporting last-click revenue as incremental revenue. Cart abandonment gets credited with purchases that would have happened anyway. Winback gets credited with users who'd have returned organically. The gap between last-click and incremental is typically 40–70% — meaningful. Holdouts close the gap.
How long should I run a test?: Pre-commit to a sample size, not a duration. Running 'until significant' inflates false-positive rate substantially. Calculate the needed sample, run until the sample is reached, then read. If you need the option to stop early, use a platform with sequential-testing statistics (Optimizely, VWO) that corrects for the peeking.
Does send-time optimisation actually work?: Modestly — 3–8% open rate lift on broadcasts, little to no revenue lift. Vendor case studies claim 20–40%+; measured reality is smaller because much of the claimed lift is Apple MPP inflation rather than real engagement. Worth enabling for broadcasts at scale; skip for triggered sends and small audiences.

Guides

Measurement & testing: how to know if any of it is actually working

All 10 guides in this category

A/B testing in email: sample size, novelty, and what to report

Price-testing through email: what's testable, what isn't

Holdout group design: the incrementality tool most lifecycle programs skip

Attribution models for lifecycle: which one to defend in which room

Churn cohort analysis: the one chart that tells you if retention is actually improving

Sample size: the calculation everyone gets wrong in email A/B tests

Send-time optimisation: what it really moves, and what it doesn't

False positives in email A/B tests: why half of winning tests don't actually win

Incrementality testing: the measurement that tells you if a program actually works

Segment-based testing: when your average lift is hiding opposing effects

Frequently asked questions

Measurement & testing: how to know if any of it is actually working

All 10 guides in this category

A/B testing in email: sample size, novelty, and what to report

Price-testing through email: what's testable, what isn't

Holdout group design: the incrementality tool most lifecycle programs skip

Attribution models for lifecycle: which one to defend in which room

Churn cohort analysis: the one chart that tells you if retention is actually improving

Sample size: the calculation everyone gets wrong in email A/B tests

Send-time optimisation: what it really moves, and what it doesn't

False positives in email A/B tests: why half of winning tests don't actually win

Incrementality testing: the measurement that tells you if a program actually works

Segment-based testing: when your average lift is hiding opposing effects

Frequently asked questions