Updated · 9 min read
Incrementality testing: the measurement that tells you if a program actually works
Your win-back flow reports $2M in attributed revenue. Your CFO asks 'would those users have come back anyway?'. If you can't answer with evidence, the number's credibility collapses. Incrementality testing is how you answer — a dedicated measurement that compares 'with program' to 'without program' on matched populations. It's the most defensible revenue number lifecycle can produce. Here's the design.
Justin Williames
Founder, Orbit · 10+ years in lifecycle marketing
Why attribution isn't enough
Last-click attribution credits lifecycle emails with revenue from users who clicked and converted. The implicit assumption: without the email, those users wouldn't have converted.
The assumption is frequently wrong. A win-back email to a dormant user might catch their eye — but many of those users were coming back anyway. Attribution credits the email; the reality is the email accelerated or re-captured intent that was already present.
The gap between attributed revenue and incremental revenue is usually 2–4×. A program attributed $2M typically produces $500K–$1M in true incremental lift. The rest is re-credit of organic behaviour.
Incrementality testing measures the real number. It doesn't replace attribution (you still need per-send reporting); it calibrates it.
The fundamental test design
The core pattern: hold out a random subset of users from receiving the program, then compare their downstream behaviour to the treated group.
1. Define the program. The specific sequence, trigger, or treatment being tested. E.g., "the 3-message win-back flow triggered at 60 days of dormancy".
2. Define the eligible population. The users who would normally qualify for this program. E.g., "users dormant for exactly 60 days who have not unsubscribed".
3. Random assignment to treatment vs control. Usually 80/20 or 90/10 treatment-to-control. The control group does not receive the program; they continue their normal experience otherwise.
4. Define the outcome metric and measurement window. E.g., "purchase rate within 30 days of trigger".
5. Measure. Purchase rate in treatment vs control. The difference is the incremental lift. Statistical tests (proportion tests, t-tests) tell you if the difference is significant.
Sizing the holdout
The control group needs to be large enough to detect the incremental lift with statistical significance. This usually means 5–20% of eligible users, depending on expected effect size.
Expected lift 5%+ (strong programs): 5% holdout usually sufficient.
Expected lift 2–5% (typical programs): 10–15% holdout recommended.
Expected lift <2% (marginal programs): 20%+ holdout, and consider whether the program is worth running at all.
The sample size guide covers the math; use the same formulas here. Remember: the control group is the bottleneck. The treatment group gets the usual volume; the control group's sample size determines the test's statistical power.
Test duration
Run the test long enough that the measured effect represents genuine lift, not short-term shift.
Short effects (purchase within 7 days): 2–4 weeks of assignment, then analyse.
Medium effects (purchase within 30 days): 4–8 weeks.
Long effects (retention, subscription renewal): 3–6 months.
,
What to measure and how
Primary metric: revenue per user. Total revenue in the measurement window divided by the number of users in each group. Compare treatment vs control. This is the headline number.
Secondary: conversion rate. Percent of users who made any purchase. Useful for understanding whether the lift is driven by more users buying vs higher-value purchases per user.
Context: program-level engagement. Of users in the treatment group, how many opened, clicked, converted. This is attribution data — collect it but don't let it confuse the incrementality read.
The incremental lift formula: (revenue/user in treatment - revenue/user in control) / revenue/user in control. A 5% incremental lift means the program genuinely added 5% to revenue per user beyond what would have happened otherwise.
The pitfalls
Control contamination. Users in the control group somehow receiving the program through another channel. Check: are your triggers and suppressions working correctly? Do other programs overlap with the one you're testing?
External events. A product launch, sale, or news event during the test affects both groups but in different amounts if they're already behaviourally different. Balance groups on pre-test characteristics where possible.
Measurement window too short. Measuring purchase rate at 7 days for a program that affects long-term retention misses the effect. Match the window to the program's mechanism.
Treatment effects beyond the targeted metric. A win-back flow might not lift purchase rate in 30 days but might reduce unsubscribe rate. Pre-register what you'll measure; don't post-hoc pick the winning metric.
covers incrementality test design as a quarterly discipline — each lifecycle program should have at least an annual incrementality measurement to calibrate attribution numbers.
Frequently asked questions
- How do I convince stakeholders to allow a holdout?
- Frame it as risk management. A program with unmeasured effect has unknown value — stakeholders can't tell if a 10% budget cut would hurt anything. A program with measured incrementality has a defensible number that justifies its budget and survives skeptical questioning. The 10–15% 'lost' sends to holdout are the measurement cost; the incrementality number they produce is often worth far more in budget defensibility.
- What's the difference between A/B testing and incrementality testing?
- A/B testing compares two variants of a program (subject A vs subject B). Incrementality testing compares 'with program' to 'without program'. A/B is relative — it tells you which variant is better. Incrementality is absolute — it tells you if the program works at all. Both are useful; they answer different questions.
- How often should I run incrementality tests?
- Quarterly for major programs (winback, welcome, cart abandonment). Annually for supporting programs (newsletters, occasional promotional). The effort is front-loaded (design, infrastructure); ongoing tests are cheaper once the measurement plumbing exists.
- What if the incremental lift is smaller than I expected?
- Useful information. Either the program is less impactful than attribution suggested (common — reduce the program's budget or reinvent it), or the test was underpowered or contaminated (check the methodology). Small or null incrementality on programs that 'everybody knows work' is the single most common finding of incrementality testing, and often the most valuable.
- Can I run incrementality tests on transactional emails?
- Rarely appropriate — you shouldn't hold back transactional mail (password resets, order confirmations) for measurement. Transactional's value is functional, not measured via incrementality. Where incrementality testing does apply: transactional-adjacent emails (shipping updates, post-purchase surveys) where there's a choice about whether to send.
- What's a 'good' incrementality number for a lifecycle program?
- Varies by program type. Welcome flow: 5–15% incremental lift typical. Winback: 3–10%. Cart abandonment: 8–20%. Browse abandonment: 5–15%. Newsletter: often 1–3% directly, with additional indirect effects. Your program's incrementality vs these benchmarks tells you where it stands, but absolute numbers vary by audience and category.
Related guides
Browse all →False positives in email A/B tests: why half of winning tests don't actually win
Run enough A/B tests and some will show 'significant' lift from random noise. Programs that ship every significant winner end up with a collection of imaginary improvements. Here's how to tell real lift from noise and avoid the false-positive trap.
Segment-based testing: when your average lift is hiding opposing effects
A winning A/B test with 4% lift overall might be a 20% win in one segment and a 10% loss in another. Segment-based analysis reveals the real story — and lets you ship winners to the segments that benefit while avoiding users who would be hurt.
Sample size: the calculation everyone gets wrong in email A/B tests
Most email A/B tests are powered to detect effects far larger than they can actually produce. Here's the sample size calculation that tells you whether your test will find what you're looking for — before you run it.
Send-time optimisation: what it really moves, and what it doesn't
Every ESP now markets a send-time optimisation feature. They all show flattering internal case studies. The honest version: STO moves open rate 3–8%, not revenue, and only works for certain program types. Here's when it's worth turning on.
Price-testing through email: what's testable, what isn't
Email is often the first place teams try to price-test, and it's often where the wrong lesson gets learned. This guide covers what can genuinely be tested in email, what can't, and the measurement traps that make most email price tests unreliable.
Holdout group design: the incrementality tool most lifecycle programs skip
Without a holdout, lifecycle ROI is attribution-model guesswork. With one, you get a defensible number you can put in front of finance. Here's how to size, run, and read a holdout group — and the three mistakes that invalidate the result.
Found this useful? Share it with your team.