Updated · 8 min read
Segment-based testing: when your average lift is hiding opposing effects
You ran an A/B test. Variant B won by 4%. Significance p=0.03. Ship it. But in the post-hoc segment analysis, you notice: variant B was a 20% win among new users, a 10% loss among long-tenured users. The 4% overall was a weighted average of opposing effects. Segment-based testing catches this — and ships smarter. Here's the framework.
Justin Williames
Founder, Orbit · 10+ years in lifecycle marketing
Why averages hide opposing effects
An A/B test reports one aggregate result: variant B lifted open rate 4%. The aggregate hides variation across audience segments. Users differ — new vs returning, high-value vs low-value, engaged vs dormant — and a single change often affects different segments differently.
Example: a "limited time" subject line. New users might respond well to urgency; they don't know your cadence yet, the urgency feels real. Long-tenured users might respond worse; they've seen "limited time" many times, the urgency feels manipulative.
Example: a longer vs shorter email. High-intent users might engage more with the longer version (they want detail); browsers might engage less (they want to scan). The aggregate lift is net; the segment effects are opposite.
The "winner" of an A/B test is usually a winner for some users and a loser for others. Treating it as a universal best-practice ships the loss along with the win. Segment analysis lets you keep the win and skip the loss.
The segments that usually matter
Run the main test across all users. After the test reaches significance, slice the result by:
Tenure. New users (<30 days) vs established (>90 days) vs tenured (>12 months). Different audiences behave very differently. This is the most common and most informative split.
Engagement level. Highly-engaged (opened 50%+ of last 10 emails) vs low-engaged. Engaged users respond to different signals than occasional ones.
Purchase history. First-time vs repeat buyers. Different offers and tones work for different phases.
Device / client. Mobile-dominant users vs desktop. Often reveals that "winning" variants only win on one of the two.
Acquisition channel. Paid social vs organic vs referral. Channel often correlates with user profile and response patterns.
The analysis steps
Once the test has reached its pre-committed sample size:
1. Check the aggregate result (winner, lift, significance). Standard step.
2. For each meaningful segment, check the result restricted to that segment. Lift in this segment? Is it significant? Is the effect size similar to the aggregate or different?
3. Identify any segments where the effect size is meaningfully different from the aggregate. A segment with +20% lift when the aggregate is +5% is interesting. A segment with -10% when the aggregate is +5% is extremely interesting.
4. For cross-cutting opposing effects, consider: is the aggregate winner really the right answer for the whole list, or should different segments get different treatments?
,
Acting on segment findings
When the segment effect agrees with the aggregate: the winner is a universal winner. Ship it to all segments.
When segments differ meaningfully: ship the winner to the segments that benefit; keep the control (or test further) for segments that lost. This is called a "targeted ship" — different treatments for different audiences based on what actually works for each.
When the segment with the biggest loss is also the most valuable: consider whether the aggregate "winner" is actually the wrong direction for the program. A 4% aggregate lift that comes with a 10% loss among high-LTV users might be net-negative on revenue. Re-run the analysis weighted by user value.
The VIP lifecycle guide covers why high-value segments often need different treatment than average-segment-optimised programs.
Common patterns
Some segment-level patterns repeat across many programs:
New users love urgency; tenured users are tired of it. Urgency tactics that lift new-user conversion often reduce tenured-user engagement. Targeted ship: urgency for new, restraint for tenured.
Engaged users tolerate more frequency; dormant users respond worse. More emails lift revenue from the engaged segment; they accelerate unsubscribes from the dormant segment. Targeted ship: higher cadence for engaged, lower cadence (or pause) for dormant.
Mobile users want shorter; desktop users tolerate longer. Long-form emails test well in desktop, worse on mobile. Targeted ship rarely split by device — instead, design for mobile-first (see the mobile design guide) and make sure desktop also works.
Repeat buyers respond to personalisation; first-time buyers respond to social proof. A repeat buyer wants to feel known; a first-time buyer wants reassurance that others bought successfully.
treats segment analysis as a standard post-test step, not an optional extra. Every significant test should have at least one segment-level slice reviewed before shipping — the cases where segments behave differently than the aggregate are some of the highest-value findings a program produces.
Frequently asked questions
- How many segments should I slice by?
- Post-test, 3–5 meaningful segment cuts (tenure, engagement, device, acquisition source). Pre-test, don't over-segment: the test needs to be powered for the aggregate, and segment cuts are post-hoc analysis. If a specific segment hypothesis is important, run a segment-specific test pre-registered to have adequate power.
- Should I always slice by segment?
- For any test that reaches significance with aggregate, yes — the segment slice takes minutes and often finds the hidden story. For null-result tests, less urgent. A test that shows no aggregate effect also usually shows no segment effects; occasional exceptions where a segment had an effect hidden by offsetting behaviour in another segment.
- What's the difference between segment-based testing and personalisation?
- Overlapping. Segment-based testing is an analysis method (slice results by segment). Personalisation is a treatment method (ship different content to different segments). Segment-based analysis tells you where personalisation would be valuable; personalisation is how you act on the insight. Both work together.
- How do I avoid false positives in segment analysis?
- Three defences: (1) use tighter significance thresholds for segment-level findings (p=0.01 or Bonferroni corrected). (2) require the segment effect to be meaningfully larger than the aggregate — not just significant but also material. (3) validate unexpected segment findings with a follow-up test specifically designed for that segment.
- Can I segment-test on a small list?
- With care. Segment slicing divides an already-powered test into smaller pieces, which reduces power for each segment. On small lists, segment analysis is largely descriptive (directional) rather than statistically conclusive. Use it to generate hypotheses for future tests; don't make ship/no-ship decisions on underpowered segment slices.
- Should I ship a winner even if one segment lost?
- Depends on the segment. If the losing segment is small, low-value, and the loss is modest, shipping the aggregate winner is fine. If the losing segment is large or high-value, or the loss is substantial, consider a targeted ship: winner to the benefitting segments, control for the losing one. The worst option is to ship the 'winner' universally and ignore the segment-level cost.
Related guides
Browse all →False positives in email A/B tests: why half of winning tests don't actually win
Run enough A/B tests and some will show 'significant' lift from random noise. Programs that ship every significant winner end up with a collection of imaginary improvements. Here's how to tell real lift from noise and avoid the false-positive trap.
Incrementality testing: the measurement that tells you if a program actually works
Last-click attribution makes lifecycle programs look bigger than they are. Incrementality tests strip out the effect of users who would have converted anyway and reveal the real lift. Here's how to design one that produces a defensible number.
Sample size: the calculation everyone gets wrong in email A/B tests
Most email A/B tests are powered to detect effects far larger than they can actually produce. Here's the sample size calculation that tells you whether your test will find what you're looking for — before you run it.
Send-time optimisation: what it really moves, and what it doesn't
Every ESP now markets a send-time optimisation feature. They all show flattering internal case studies. The honest version: STO moves open rate 3–8%, not revenue, and only works for certain program types. Here's when it's worth turning on.
Price-testing through email: what's testable, what isn't
Email is often the first place teams try to price-test, and it's often where the wrong lesson gets learned. This guide covers what can genuinely be tested in email, what can't, and the measurement traps that make most email price tests unreliable.
A/B testing in email: sample size, novelty, and what to report
Most email A/B tests produce winners that don't reproduce. This guide covers the three reasons — under-powered samples, the novelty effect, and weak readout discipline — and how to design tests that actually drive decisions.
Found this useful? Share it with your team.