Updated · 9 min read
Incrementality testing: the measurement that tells you if a program actually works
Your win-back flow reports $2M in attributed revenue. Your CFO asks whether those users would have come back regardless. If you can't answer with evidence, the $2M evaporates on contact. Incrementality testing is how you answer — a dedicated measurement comparing 'with program' against 'without program' on matched populations. The most defensible revenue number lifecycle can produce. This is the design.
By Justin Williames
Founder, Orbit · 10+ years in lifecycle marketing
Why attribution isn't enough
Last-click credits lifecycle emails with revenue from users who clicked and converted. The implicit claim: without the email, those users wouldn't have converted. That claim is wrong often enough to matter.
A win-back email to a dormant user might catch an eye — but a meaningful share of those users were coming back anyway. Attribution credits the email; the reality is the email accelerated or re-captured intent that was already in the pipe. The program still did something. Just not as much as the dashboard says.
The gap between attributed revenue and incremental revenue is usually 2–4x. A program attributed $2M typically produces $500K–$1M in true incremental lift. The rest is re-credit of behaviour that would have happened on its own.
Incrementality testing measures the real number. It doesn't replace attribution — you still need per-send reporting for operations — it calibrates it. Treat attribution as a daily pulse and incrementality as the annual or quarterly calibration. Different instruments for different questions.
The fundamental test design
The core pattern: hold out a random subset of users from receiving the program, then compare their downstream behaviour against the treated group. Five steps, in order:
1. Define the program. The specific sequence, trigger, or treatment being tested. Example: "the 3-message win-back flow triggered at 60 days of dormancy". Not "winback" as a concept — the exact artefact going live.
2. Define the eligible population. The users who would normally qualify. Example: "users dormant exactly 60 days and not unsubscribed". Write it as if describing a SQL filter.
3. Random assignment to treatment vs control. Usually 80/20 or 90/10 treatment-to-control. The control group does not receive the program. Everything else about their experience is unchanged.
4. Define the outcome metric and measurement window. Example: "purchase rate within 30 days of trigger". Pre-register the metric. Pre-register the window.
5. Measure. Treatment minus control. The difference is the incremental lift. Proportion tests or t-tests tell you if it's significant.
Worth naming the distinction to anyone confused about why this isn't the same as A/B: an A/B test compares subject line A to subject line B — it tells you which variant wins. An incrementality test compares "with program" to "without program" — it tells you if the program works at all. Both are useful. They answer different questions.
Sizing the holdout
The control group needs to be large enough to detect the incremental lift with statistical significance. Usually that's 5–20% of eligible users, scaled to the expected effect.
Expected lift 5% or more (strong programs): 5% holdout is usually enough.
Expected lift 2–5% (typical): 10–15% holdout.
Expected lift under 2% (marginal): 20%+ holdout, and while you're there consider whether the program is earning its place at all.
The sample size guide covers the maths; same formulas apply. One thing worth holding in mind: the control group is the bottleneck. Treatment gets its usual volume; the control group's size determines the test's statistical power.
Selling the holdout to stakeholders is a separate exercise. Frame it as risk management, not measurement cost. A program with an unmeasured effect has unknown value — nobody can tell if trimming 10% of the budget hurts anything. A program with a measured incrementality number has a defensible figure that survives skeptical questioning. The 10–15% "lost" sends are the measurement cost. The number they produce is often worth vastly more in budget defensibility.
Test duration, and what you're measuring
Run the test long enough that the measured effect represents genuine lift, not a short-term shift:
Short effects (purchase within 7 days): 2–4 weeks of assignment, then analyse.
Medium effects (purchase within 30 days): 4–8 weeks.
Long effects (retention, subscription renewal): 3–6 months.
On the numbers themselves. The headline is revenue per user: total revenue in the window divided by users in each group, treatment versus control. That's the number the CFO wants. Secondary is conversion rate — the percentage making any purchase — which tells you whether the lift is driven by more buyers or bigger baskets. Collect program-level engagement (opens, clicks, attributed conversions) for context, but don't let attribution confuse the incrementality read.
The formula: (revenue/user in treatment − revenue/user in control) / revenue/user in control. A 5% incremental lift means the program genuinely added 5% to per-user revenue over what would have happened without it.
The pitfalls, and what a good number looks like
Control contamination. Users in the control group somehow still receiving the program through another channel. Check triggers and suppressions. Check whether other programs overlap with the one under test.
External events. A product launch, sale, or news event during the test affects both groups — but unevenly if the groups are already behaviourally different. Balance on pre-test characteristics where possible.
Measurement window too short. Measuring 7-day purchase rate on a program that drives long-term retention misses the effect entirely. Match the window to the mechanism.
Post-hoc metric shopping. A win-back flow might not lift 30-day purchase rate but might cut unsubscribe rate. Pre-register what you'll measure. Don't pick the winning metric after the fact — that's how you convince yourself of things that aren't true.
Rough benchmarks for a "good" incrementality number, by program type, so you have something to compare against. Welcome flow: 5–15%. Winback: 3–10%. Cart abandonment: 8–20%. Browse abandonment: 5–15%. Newsletters: 1–3% directly plus indirect effects. Your program versus those bands tells you where it stands, but absolute numbers drift with audience and category, so treat the bands as orientation, not target.
And one more. Don't run incrementality tests on transactional mail. You're not going to hold back password resets for measurement, and the value of an order confirmation isn't in its incremental revenue anyway. Transactional value is functional. Incrementality testing lives in the lifecycle programs where there's a genuine choice about whether to send.
If the measured lift comes back smaller than you expected, that's information. Either the program is less impactful than attribution claimed (common; trim budget or rebuild), or the test was underpowered or contaminated (check methodology). Small or null incrementality on programs "everyone knows work" is the single most common finding in this entire discipline — and, unsurprisingly, the most valuable one.
covers incrementality as a quarterly practice. Every major lifecycle program deserves at least an annual incrementality read to calibrate the attribution number it's been quietly generating all year.
Frequently asked questions
- What is incrementality testing?
- A measurement approach where a random fraction of eligible users are held out of a lifecycle program entirely. The difference in outcome metric (revenue per user, retention, LTV) between exposed and holdout users is the incremental impact — what the program actually caused, vs what would have happened anyway. Without incrementality testing, attributed revenue mixes program contribution with organic behaviour, typically overstating program impact by 2-5x.
- How is incrementality testing different from A/B testing?
- A/B tests within the exposed population (variant A vs variant B, both receive the program). Incrementality tests across exposure (program vs no program). A/B answers "is this variant better?"; incrementality answers "is this program creating value vs the baseline?". Every lifecycle program should have both layered: incrementality to prove the program exists, A/B to tune it internally.
- How long should an incrementality test run?
- Long enough for the outcome metric to accumulate across the expected decision cycle. For a winback program measuring 90-day revenue per user: run for 90 days at minimum. For a welcome program measuring day-30 activation: run for 30 days. Cutting short because "it looks significant early" is the peeking mistake that invalidates results. Pre-compute the measurement window based on the metric's observation cycle and commit to it.
- What metric should incrementality testing measure?
- The business outcome the program claims to move — usually revenue per user, retention rate, or LTV. Measuring engagement (opens, clicks) isn't incrementality; exposed users engage more trivially because they receive mail, but that engagement may not translate to revenue. Always measure the downstream business metric, not the program's direct output. "Our winback drives 3% more opens" is trivial. "Our winback drives $X more revenue per dormant user" is actionable.
- When should I run incrementality testing?
- Once per major program at launch (prove it works vs nothing), then annually to verify the program is still generating value (programs degrade as audiences saturate or competitive dynamics shift). Running incrementality continuously on every program is too expensive — it permanently holds back revenue. Running it never means you're flying blind on whether the programs you fund actually create value. Annual cadence, on the programs representing the largest share of CRM spend, is the operator standard.
Related guides
Browse allSample size: the calculation everyone gets wrong in email A/B tests
Most email A/B tests are powered to detect effects far larger than the test could actually produce. The result: false positives and false nulls, with confident conclusions in both directions. Sample size calculation fixes this before you send. Takes 5 minutes. Here's the 5-minute version.
False positives in email A/B tests: why half of winning tests don't actually win
Run enough A/B tests and some will show 'significant' lift from pure noise. Programs that ship every significant winner end up with a collection of imaginary improvements they can't tell apart from real ones. Here's how to spot the fakes and avoid the trap.
Segment-based testing: when your average lift is hiding opposing effects
A winning A/B test with 4% aggregate lift might be a 20% win in one segment and a 10% loss in another. The aggregate is an average of opposing effects. Segment analysis catches it — and lets you ship the win to the segments that benefit while not shipping the loss to the ones that don't.
Send-time optimisation: what it really moves, and what it doesn't
Every ESP markets an STO feature and every vendor deck shows lift. The honest version: STO moves open rate 3–8%, rarely revenue, and only for certain program types. Here's when it's worth turning on.
Price-testing through email: what's testable, what isn't
Email is the fastest place to try a new price, and the easiest place to learn the wrong lesson. What you can test cleanly, what you can't, and the measurement traps that quietly turn price tests into expensive false positives.
Holdout group design: the incrementality tool most lifecycle programs skip
Without a holdout, lifecycle ROI is attribution-model guesswork with a spreadsheet. With one, you get a defensible number you can actually put in front of finance. Here's how to size, run, and read a holdout — and the three mistakes that quietly invalidate the result.
Found this useful? Share it with your team.
Use this in Claude
Run this methodology inside your Claude sessions.
Orbit turns every guide on this site into an executable Claude skill — 54 lifecycle methodologies, 55 MCP tools, native Braze integration. Pay what it's worth.