Updated · 9 min read
Price-testing through email: what's testable, what isn't
Someone has an idea for a new price, discount, or offer structure. Email is the quickest way to put it in front of users. The test ships. A winner is declared. Three months later the winning offer has been folded into the base program and nobody can reproduce the lift. This pattern is common enough it should be the default expectation. Email price tests are uniquely prone to false positives — and uniquely able to damage the program when the wrong lesson gets locked in.
By Justin Williames
Founder, Orbit · 10+ years in lifecycle marketing
What email can actually test
A winner is declared. Three months later the winning offer has been rolled into the base program and nobody can reproduce the lift. This pattern is so common it should be the default expectation.
Email is a credible test environment for a narrow set of pricing questions. Whether offer X (20% off, one product) converts better than offer Y at the same message moment. Whether free-shipping framing beats dollar-off framing at the same effective discount. Whether a deadline, a free-trial extension, or a price-lock message outperforms the control version of the same campaign.
The common thread: each test is a single message moment, a narrow audience, and a conversion window that closes quickly. Inside those limits, email tests can be rigorous. Power them correctly. Read them cleanly. Ship the winner without risking the rest of the program.
What email cannot test: whether the underlying price should change. Whether the product is worth more or less than its current tag. Whether a subscription tier is correctly positioned. Those are pricing questions, not messaging questions. They need audiences, durations, and measurement machinery that an email test cannot give you. Treat email as one data point on those decisions, never the final answer.
Why most email price tests produce false positives
Three mechanical problems push email price tests toward false positives at a rate much higher than other email tests.
Novelty effect, amplified.Price-related copy is unusual in a lifecycle program. Most emails don't lead with a number, so when a variant does, engagement spikes on novelty alone for a few days. On a standard two-week test, that novelty window can carry enough of the measurement period that the variant looks like a winner even after it fades.
Audience selection bias. Price tests in email tend to reach only the opened-the-email cohort, which is dramatically more engaged than the full audience. A discount that converts well for an engaged cohort will over-predict performance when rolled out to the full base. Measure conversion per sent, not per opened, and confirm the test population matches the rollout population. Both steps are essential. Both are frequently skipped.
Cannibalisation.The variant's lift is often real within the test window but evaporates at program level, because the converting users would have converted anyway at the control offer a week later. A 20% conversion lift that consists entirely of pulled-forward conversions is a timing change, not a revenue gain. Short-window tests rarely catch this.
The Orbit Experiment Design skill builds cannibalisation checks into the readout — holdout comparisons, payback-window modelling — so lifts that wash out at program level are flagged instead of shipped. The underlying experimentation discipline sits in the A/B testing guide — sample size, power, novelty.
The measurement most teams get wrong
The question that matters is not "did this variant convert better". It's "did this variant produce more revenue than the control, net of the discount, measured over a period that captures what happened to the users who converted". Three pieces, all of them usually missing:
Net of the discount. A 20% discount that lifts conversion by 15% usually loses money at a unit level. Almost every conversion-lift number reads differently once you subtract the margin handed over to produce it.
Over the right period.A 7-day conversion window on a price test misses retention and repurchase effects. Users who converted on a steep discount often retain worse than users who paid full price — they bought the price, not the product. Measure a 30 to 90-day window, or accept you're optimising for an intermediate metric rather than revenue. Most teams read the 7-day number and miss the 90-day reversal.
Against the right counterfactual.A proper price-test measurement needs a holdout — a random slice of the eligible audience that gets no offer at all. Without one, the test answers "variant vs control offer", which is a weaker question than "variant vs no offer". Plenty of discount campaigns beat their control variant while losing to the holdout, which is the moment you realise the whole thing was self-cannibalisation with extra steps.
Copy-level tests are safer than offer-level tests
A specific subset of price-adjacent testing is actually well-suited to email: copy-level framing around a fixed underlying offer. "Save $20" vs "20% off" at the same effective discount. "Limited time — ends Sunday" vs no deadline. "Your exclusive offer" vs generic framing. These are legitimate email tests because the underlying economics are identical — only the framing moves. Which means most of the traps above don't apply. Use the significance calculator as normal and ship the winner.
Offer-level tests — changing the discount, the product mix, or the price tiers — can still run through email, but treat the result as first evidence, not verdict. Pair it with holdout data, a 30-plus-day retention window, and explicit margin accounting before anything gets declared a winner.
When not to run the test at all
Two common failure modes where the honest answer is: don't run it through email.
Sample size is too low for the effect you care about. If your audience per variant is 5,000 and the effect that matters is a 3% lift, the test mathematically cannot answer the question. You'll get a number. It will not be signal. Any decision based on it is a coin flip wearing statistical vocabulary.
The winning condition would damage the program. An aggressive discount variant that wins in email trains your audience to wait for discounts. That training cost shows up months later as suppressed full-price conversion, and it's hard to attribute back to the test that caused it. Run these only if you're ready to either ship the winner everywhere or accept the training effect. If neither is acceptable, don't run the test.
The Retention Economics skill models the downstream cost of discount-trained behaviour so it enters the decision explicitly instead of ambushing you in six months.
This guide is backed by an Orbit skill
Related guides
Browse allA/B testing in email: sample size, novelty, and what to report
Most email A/B tests produce winners that don't reproduce. Three reasons keep showing up: under-powered samples, the novelty effect, and weak readout discipline. This guide is about designing tests that actually drive decisions instead of theatre.
Holdout group design: the incrementality tool most lifecycle programs skip
Without a holdout, lifecycle ROI is attribution-model guesswork with a spreadsheet. With one, you get a defensible number you can actually put in front of finance. Here's how to size, run, and read a holdout — and the three mistakes that quietly invalidate the result.
Attribution models for lifecycle: which one to defend in which room
Attribution debates are half epistemology, half politics. Last-touch is wrong but defensible. Multi-touch is more accurate but less defensible. Incrementality is the only one that answers the causal question — and it's the slowest. Here's which model to use for which question, and why.
Sample size: the calculation everyone gets wrong in email A/B tests
Most email A/B tests are powered to detect effects far larger than the test could actually produce. The result: false positives and false nulls, with confident conclusions in both directions. Sample size calculation fixes this before you send. Takes 5 minutes. Here's the 5-minute version.
False positives in email A/B tests: why half of winning tests don't actually win
Run enough A/B tests and some will show 'significant' lift from pure noise. Programs that ship every significant winner end up with a collection of imaginary improvements they can't tell apart from real ones. Here's how to spot the fakes and avoid the trap.
Subject line anatomy: the four parts every line that performs shares
Most subject-line advice is decoration tips — emoji, length, numbers. The lines that actually get opened share a structural pattern. Four parts in a specific order, the three distortions that ruin it, and the four rules that keep A/B tests honest.
Found this useful? Share it with your team.
Use this in Claude
Run this methodology inside your Claude sessions.
Orbit turns every guide on this site into an executable Claude skill — 54 lifecycle methodologies, 55 MCP tools, native Braze integration. Pay what it's worth.