Updated · 8 min read
Measuring AI personalisation lift honestly
AI personalisation has the worst measurement hygiene in lifecycle marketing. The defaults are open rates (broken by Apple MPP), self-reported attribution (the model claims credit for users who would have converted anyway), and vendor case studies (selection-biased to the customers who saw lift). This guide is the inverted version — the measurement framework that survives an honest audit and tells you whether to keep paying for the feature.
By Justin Williames
Founder, Orbit · 10+ years in lifecycle marketing
Why default measurement of AI personalisation is broken
If your AI personalisation lift report can't survive the question "against what control group, over what period, on what metric?" — it isn't a lift report. It's vendor marketing with your logo on it.
Three failure modes account for almost every flattering AI personalisation report that doesn't hold up under inspection:
Apple MPP open inflation. Apple Mail Privacy Protection pre-fetches images on its servers, registering an open whether or not a human looked at the email. AI personalisation features that report "open rate lift" are largely measuring this artefact. The Apple MPP guide covers the mechanics. Practical implication: opens are not a valid metric for AI personalisation lift.
Selection bias in vendor case studies. The customers who appear in the vendor deck are the ones who saw lift. The customers who didn't are not in the deck. The base rate of "programs where this feature produced no lift" is invisible to the buyer. Practical implication: vendor benchmarks are upper bounds, not expected values.
Confounded comparisons. The AI version of a program is compared to a previous version sent at a different time, to a different audience, with a different offer. Any of these confounds produces "lift" that has nothing to do with the AI. Practical implication: lift requires a proper randomised holdout or it isn't lift, it's seasonality.
The four-rule measurement framework
Rule 1 — Hold out a random sample. 10–20% of the eligible audience receives the non-AI version of the program. The split is random, persistent (the same users stay in the holdout for the duration of the test), and large enough for statistical power. The holdout group guide covers the design.
Rule 2 — Measure on downstream outcomes, not upstream signals. Conversion, revenue, retention. Not opens, not clicks. Opens are corrupted by MPP; clicks are reasonable secondary metrics but the primary should always be the business outcome the program exists to drive. If AI personalisation moves clicks but not conversion, that's a finding worth surfacing — not a lift to celebrate.
Rule 3 — Run the test long enough. Most AI personalisation effects are smaller than the natural variance of weekly metrics. 30 days is the floor; 60 days is the realistic minimum for stable readouts on conversion. Programs that declare lift after 7 days are reading noise. The sample size guide covers the calculation.
Rule 4 — Pre-register the readout. Define the metric, the success threshold, the test duration, and the decision criteria before the test starts. Pre-registration prevents the post-hoc "maybe if we look at clicks instead of conversion" reframe that turns null results into apparent lift. Treat AI personalisation tests with the same discipline as a real experiment.
What lift actually looks like in clean data
Real-world AI personalisation lift, against a proper holdout, on a downstream metric — typical ranges from independent measurement and post-MPP studies:
Predictive churn save flows: 5–15% reduction in churn rate within the targeted cohort, when the save flow itself is well-designed. The model identifies the cohort; the save flow does the work. A bad save flow with a perfect model produces no lift.
Product / content recommendations: 3–8% lift in click-to-conversion in the targeted message slot. Larger when the catalog is large (1,000+ items) and the user has rich engagement history; smaller or zero when either of those is missing.
Send-time optimisation: 3–8% open lift, 1–4% click lift, typically no significant revenue lift against a proper randomised holdout. The STO guide covers the honest version in detail.
AI subject line generation: Variable. The lift depends on what you're comparing to. Compared to a human writing one subject line, generating five and A/B testing usually wins by 2–5% on click-through. Compared to a team that already runs disciplined subject line testing, the model adds little.
Vendor case studies typically show 2–5x these numbers. The gap is methodology. Plan around the realistic range, treat the vendor numbers as best case, and let the holdout tell you which it is.
Common measurement mistakes
Comparing post-launch period to pre-launch period. "Conversion is up 12% since we turned on AI recommendations." This is a temporal comparison, not a causal one. Conversion is up because of seasonality, a marketing campaign, a product launch, a competitor going down, weather. The AI may have contributed; you cannot tell from the comparison.
Comparing AI segment vs non-AI segment. "Users in the AI cohort convert at 8%; users not in the AI cohort convert at 4%." Selection bias — the AI cohort is the high-engagement users who entered the flow; the non-AI cohort is everyone else. The AI didn't cause the lift; the segmentation did.
Reading individual metrics in isolation. Opens up, clicks flat, conversion down, unsubscribes up. The framing "opens are up 12%" is technically true and entirely wrong as a summary. Always read the metric stack together; AI personalisation often moves upstream metrics in directions that don't translate downstream.
Stopping tests when the result is favourable. Test runs 14 days, lift is 8%, "ship it." The 8% might be 3% by day 30 and 1% by day 60 as novelty effects fade. Pre-register the test duration and honour it. The Experiment Design skill covers the discipline of finishing tests rather than stopping them when the chart looks good.
What to put in the AI personalisation readout
The readout deck for any AI personalisation test should include, at minimum:
The hypothesis. What was being tested, on what audience, against what metric, with what expected magnitude.
The holdout design. Sample size, randomisation method, test duration. If anything was non-standard (cluster randomisation, post-stratification), explain why.
The result against the primary metric. Lift, confidence interval, significance. Honest version, not the most flattering frame.
Secondary metrics and guardrails. Did unsubscribes increase? Did spam complaints rise? Did the model produce any embarrassing outputs that required intervention? An AI personalisation feature that lifts the primary metric but raises unsubscribe rate is not a winner; it's a fast path to deliverability problems.
The recommendation. Keep, kill, iterate. With reasoning. Not "the model worked" — "the model produced X% lift on the primary metric, no movement on guardrails, recommend expanding to programs Y and Z under the same measurement protocol."
This kind of readout is durable. Six months later, the question "why are we still paying for this feature" gets answered with a document, not a vibe. AI personalisation features without this documentation tend to accumulate as load-bearing infrastructure nobody can justify and nobody wants to be the one to turn off. The audit at year-end is unpleasant for everyone.
The teams that get the most from AI personalisation aren't the most enthusiastic about the technology. They're the most disciplined about measuring it. The two often look the same in a deck and very different in a P&L.
Frequently asked questions
- What metric should I use to measure AI personalisation lift?
- The downstream business metric the program exists for: conversion, revenue per recipient, retention, expansion. Avoid opens (corrupted by Apple MPP) and treat clicks as secondary. If the AI moves upstream metrics but not downstream ones, that's a finding — the model is generating activity, not value. Programs measured on opens will keep buying AI features that don't earn their place.
- How big does the holdout need to be?
- Large enough to detect the realistic lift size with statistical power. For typical AI personalisation lift (3–10% on conversion), a holdout of 10–20% of the eligible audience usually works for programs above 50K monthly recipients. Below 50K, the holdout has to be larger relative to the audience to maintain power, which means slower readouts. The sample size calculator linked from the guide covers the math.
- How long should an AI personalisation test run?
- 30 days minimum for upstream metrics; 60+ days for conversion or revenue. Shorter tests pick up novelty effects (users engaging because the program changed, regardless of AI quality) that fade. The discipline is to declare the duration before the test starts and honour it, even if the early data looks great.
- Are vendor case studies useful at all?
- Yes — as upper bounds on what's possible, and as a guide to which use cases the model has been tuned for. Useless as expected values for your specific program. The customers in the case study are not representative; they're the ones who saw lift. Plan around 1/3 to 1/2 of the case-study lift as a realistic range and let your holdout tell you the actual figure.
- What if my CFO wants the lift number?
- Give the holdout-validated lift on the primary metric, with the confidence interval, and explicitly note what's not included (no Apple MPP-affected metrics, no vendor self-reporting). A CFO trusts a smaller, defensible number more than a larger, fragile one — especially when the next year's budget conversation requires defending the renewal of the AI personalisation contract. The honest readout protects the program.
Read to the end
Scroll to the bottom of the guide — we'll tick it on your reading path automatically.
This guide is backed by an Orbit skill
Related guides
Browse allSample size: the calculation everyone gets wrong in email A/B tests
Most email A/B tests are powered to detect effects far larger than the test could actually produce. The result: false positives and false nulls, with confident conclusions in both directions. Sample size calculation fixes this before you send. Takes 5 minutes. Here's the 5-minute version.
Holdout group design: the incrementality tool most lifecycle programs skip
Without a holdout, lifecycle ROI is attribution-model guesswork with a spreadsheet. With one, you get a defensible number you can actually put in front of finance. Here's how to size, run, and read a holdout — and the three mistakes that quietly invalidate the result.
Send-time optimisation: what it really moves, and what it doesn't
Every ESP markets an STO feature and every vendor deck shows lift. The honest version: STO moves open rate 3–8%, rarely revenue, and only for certain program types. Here's when it's worth turning on.
A/B testing in email: sample size, novelty, and what to report
Most email A/B tests produce winners that don't reproduce. Three reasons keep showing up: under-powered samples, the novelty effect, and weak readout discipline. This guide is about designing tests that actually drive decisions instead of theatre.
Price-testing through email: what's testable, what isn't
Email is the fastest place to try a new price, and the easiest place to learn the wrong lesson. What you can test cleanly, what you can't, and the measurement traps that quietly turn price tests into expensive false positives.
False positives in email A/B tests: why half of winning tests don't actually win
Run enough A/B tests and some will show 'significant' lift from pure noise. Programs that ship every significant winner end up with a collection of imaginary improvements they can't tell apart from real ones. Here's how to spot the fakes and avoid the trap.
Found this useful? Share it with your team.
Use this in Claude
Run this methodology inside your Claude sessions.
Orbit turns every guide on this site into an executable Claude skill — 62 lifecycle methodologies, 84MCP tools, native Braze integration. Pay what it's worth.