Updated · 9 min read
Holdout group design: the incrementality tool most lifecycle programs skip
A holdout group is a random slice of your audience that receives no lifecycle messaging for a measurement period. Compare their revenue to the messaged group and you get incremental lift: the revenue your program actually produced versus the revenue that was coming anyway. It's the most defensible measurement in lifecycle. It's also the most skipped. Here's how to run one that holds up when finance starts pulling threads.
By Justin Williames
Founder, Orbit · 10+ years in lifecycle marketing
Why attribution can't replace a holdout
Attribution tells you which touchpoint got credit. Incrementality tells you whether the touchpoint produced revenue. Two different questions. Two different answers.
Attribution models — first-touch, last-touch, multi-touch — allocate credit across touchpoints that happened before a conversion. They do not answer the causal question: would the conversion have happened without any of those touchpoints? For lifecycle, the answer is often yes for a meaningful share of revenue. Users who would have returned, renewed, or bought anyway show up in attribution reports as lifecycle wins. Nobody's fault, exactly. Nobody's telling you either.
A holdout strips the confusion out at the root. Random assignment means the holdout cohort is, on average, identical to the messaged cohort except for message exposure. The revenue delta is incremental by construction. No attribution model required. No debate with finance about which model is "right".
Sizing the holdout
5–10% of the eligible audience is the operator default, and for good reasons. 10% reaches statistical power faster. 5% loses less revenue if the program turns out to be a winner. For a program with tight monthly revenue targets, start at 5%. For a program where you're genuinely testing whether to keep running it, 10% gets you to the answer sooner.
The holdout has to be large enough to detect the effect size you care about. Ballpark: to detect a 10% incremental lift at 95% confidence over a quarter, you typically need around 5,000 users in the holdout — the exact number varies with baseline conversion rate. Below that the test is underpowered and you'll land on a number that isn't reliably different from zero, which is worse than running no test at all because people will still quote it.
The Orbit Experiment Design skill handles the power calculation against your specific baseline and expected lift. Skip it and you're running a test that can't answer the question it was built to ask.
Assignment rules that keep the result clean
Three rules for assignment matter more than most teams realise, because all three can break silently and you won't find out until someone asks a hard question in a QBR.
Assignment must be stable. A user assigned to the holdout today stays in the holdout for the full measurement period. Users flickering in and out contaminate the read. Use a persistent random integer — Braze's Random Bucket Number is built for exactly this — not a recalculated random value that shuffles on every audience refresh.
Assignment must be random."Users who haven't engaged in the last 30 days" is not a random cut. It's a selection bias with a severity rating. The holdout has to be statistically equivalent to the treatment group on every dimension except message exposure. RBN or equivalent. No shortcuts.
Assignment must be global.Exempting specific programs from the holdout — "we won't hold out onboarding because it feels cruel" — compromises the measurement. Either hold out or don't. Global holdout means every marketing send respects the same exclusion. Transactional sends are exempt, obviously.
The three mistakes that invalidate the result
Mistake 1 — holdout leakage. Users in the holdout occasionally get mail because a broadcast ignored the flag. 2% leakage is enough to invalidate the measurement — you're no longer measuring messaged versus unmessaged, you're measuring heavily-messaged versus lightly-messaged, and the delta shrinks accordingly. Audit broadcasts monthly for holdout compliance. Every month. No exceptions.
Mistake 2 — seasonal confounds. A holdout that runs only in November — i.e., Black Friday — will show enormous incrementality because of the volume spike. The number won't generalise to the rest of the year, but it'll get quoted as if it does. Run holdouts across full quarters or full years so you average seasonal effects out.
Mistake 3 — reading before statistical power. A two-week holdout result is almost never significant. Leadership asks for an update, the team produces a number anyway, and that number gets cited as the incrementality forever. The fix is simple and politically hard: don't publish interim reads. Publish once at the end of the measurement period, with the full analysis attached.
Using the number (and what happens when it's zero)
A holdout produces a single most-important number: incremental revenue per user. Multiply by audience size for total program contribution. Divide by program cost for ROI. This is the number that goes in front of finance and replaces the attribution-model figures that always, eventually, get questioned.
The retention economics guide covers how to frame the incrementality number in a CFO conversation. A defensible quarterly holdout study is usually more persuasive than six quarters of attribution spreadsheets, because it answers the causal question rather than the correlational one. Run one annually at minimum. Programs that run holdouts annually have budget conversations that go differently from programs that don't.
On the mechanics question that comes up constantly — in Braze, use Random Bucket Number filters. A fixed slice (RBN < 500 for a 5% holdout) is stable, random, and respected across every program in the instance. The same attribute that underpins IP warm-up is what you want here. It was designed for this job.
What if the holdout shows zero incremental lift? Don't panic. Don't bury it either. Zero lift is honest information, and it's worth investigating before it turns into a narrative. Is the program targeting users who were going to convert anyway? Is the offer too weak to change behaviour? Is the timing wrong? Zero is rarely the final answer on a well-designed program, but when it is, the program needs rethinking — not another quarter of the same cadence with a redesigned header image.
Frequently asked questions
- What is a holdout group in email marketing?
- A randomly-selected share of the eligible audience explicitly excluded from a lifecycle program — they never receive the communications. Comparing their outcomes to exposed users reveals the program's true causal impact. Typical holdout size: 5-10% of eligible users for mature programs, 15-20% for programs still being validated. Holdouts are rotated quarterly so individual users don't sit in holdout forever.
- How big should a holdout group be?
- Big enough for statistical power on the primary metric, small enough not to waste revenue. For most lifecycle programs: 10% holdout gives adequate power within 4-8 weeks for primary metrics like revenue per user or retention rate. Smaller holdouts (5%) need longer measurement windows. Larger holdouts (20%+) measure faster but forgo more revenue during the holdout window. Pick based on volume: high-traffic programs can afford 5%; low-volume programs need 15-20%.
- How is a holdout group different from an A/B test control?
- Both randomly-assign users. Difference: A/B control receives the current treatment (not nothing); holdout receives nothing. A/B measures "variant vs current state"; holdout measures "program vs no program." A/B tells you if a change is better than today; holdout tells you if the program itself is creating value vs organic baseline. Both are essential at different decision points — A/B for tuning programs, holdout for justifying program existence.
- Should holdouts be permanent or rotated?
- Rotated, usually quarterly. Permanent holdouts create two problems: the held-out cohort diverges from the exposed cohort over time in ways that contaminate comparison, and the operator accumulates guilt about the "forever-missing-out" users. Quarterly rotation gives each user a fair chance at exposure and keeps the comparison clean by ensuring holdout and exposed cohorts stay demographically similar over time.
- How do I communicate holdout results to leadership?
- Report the incremental metric directly: "Our winback program drives $X incremental revenue per dormant user per quarter versus the no-send baseline." That single number is the business case. Avoid reporting "lift vs control" (which can be inflated by differences that existed before the program started) or cumulative revenue from program-exposed users (which mixes program contribution with organic behaviour). Incrementality vs holdout is the honest number.
This guide is backed by an Orbit skill
Related guides
Browse allPrice-testing through email: what's testable, what isn't
Email is the fastest place to try a new price, and the easiest place to learn the wrong lesson. What you can test cleanly, what you can't, and the measurement traps that quietly turn price tests into expensive false positives.
A/B testing in email: sample size, novelty, and what to report
Most email A/B tests produce winners that don't reproduce. Three reasons keep showing up: under-powered samples, the novelty effect, and weak readout discipline. This guide is about designing tests that actually drive decisions instead of theatre.
Attribution models for lifecycle: which one to defend in which room
Attribution debates are half epistemology, half politics. Last-touch is wrong but defensible. Multi-touch is more accurate but less defensible. Incrementality is the only one that answers the causal question — and it's the slowest. Here's which model to use for which question, and why.
Retention economics: proving lifecycle ROI to finance
Lifecycle programs get deprioritised when they can't defend their impact in dollars. The four models that keep the budget — LTV, payback, cohort retention, incrementality — and the four-slide pattern that wins a CFO room.
Lifecycle marketing for flat products
The standard lifecycle playbook assumes weekly engagement and neat stage progression. Most real products aren't shaped like that. This is how to design lifecycle for products used once a year, once a quarter, or whenever the user happens to need you — where the textbook quietly makes things worse.
Incrementality testing: the measurement that tells you if a program actually works
Last-click attribution makes lifecycle look bigger than it is. Incrementality testing strips out users who would have converted anyway and surfaces the real number. This is how to design a test that produces a figure you can defend in front of a CFO.
Found this useful? Share it with your team.
Use this in Claude
Run this methodology inside your Claude sessions.
Orbit turns every guide on this site into an executable Claude skill — 54 lifecycle methodologies, 55 MCP tools, native Braze integration. Pay what it's worth.