The Incrementality Formulas Reference
Every incrementality formula in one place: iROAS, lift percentage, confidence intervals, MDE, sample size, with worked examples and the math that decides whether a test is real.

Incrementality asks one question: what would have happened without this spend? That is the counterfactual: the version of the world where the campaign never ran. Every formula in this reference exists to estimate that gap. Most are two-proportion tests in marketing clothing. They matter because the parties grading the homework (the ad platforms) have a structural incentive to overstate impact, and because the alternative to a real counterfactual is a flattering story about credit assignment on observed data. The full taxonomy of measurement approaches lives in MMM vs MTA vs incrementality; this page is the math underneath the incrementality column.
QRY's stance: platform attribution is bookkeeping by the entity being graded. Meta Conversion Lift, Google Brand Lift, TikTok Conversion Lift are vendor-run, often under-powered, and frequently built on contaminated baselines. The honest reads come from designs the platforms cannot author or bias: geo holdouts, matched markets, marketing mix modeling (MMM) reconciliation, post-purchase surveys, and warehouse-joined outcomes. The math below is what makes those designs decidable rather than rhetorical.
Five formulas carry almost every paid-media incrementality readout: lift percentage, incremental return on ad spend (iROAS), confidence intervals, minimum detectable effect (MDE), and sample size. Each one answers a different question and the four after lift percentage exist to keep the first one honest. Use this page as a lookup; the worked examples share one running illustration (a Meta geo-lift across ten designated market areas (DMAs, the regional markets Nielsen uses) versus ten matched controls) so the numbers compose.
Lift percentage is the headline effect size, and the per-capita version is the only one that survives review
Lift percentage is how much more the treated group sold than the control predicts, expressed as a fraction of the control's own demand rate. The version that survives review uses per-capita revenue (revenue divided by population, so dollars per person) rather than raw totals, so groups with different population sizes are genuinely comparable.
Lift % = (Rev_t / pop_t - Rev_c / pop_c) / (Rev_c / pop_c)
Where Rev_t is total revenue in treatment geos over the test window, Rev_c is total revenue in control geos, and pop_t and pop_c are the matched populations. The numerator is the per-capita gap; the denominator anchors that gap as a percentage of the control's own demand rate.
Why per-capita matters: if treatment DMAs hold 22 million people and control DMAs hold 18 million, the raw revenue gap will be larger in treatment even with zero lift, simply because more people live there. Normalizing to revenue per person makes the contrast about buying behavior, not about population size.
Worked example. A DTC brand spends $40k per week on Meta and runs an 8-week geo-lift. Ten treatment DMAs hold roughly 20 million people, ten matched controls the same. At the end of the window, control DMAs have generated $4.0M in revenue ($0.20 per person) and treatment DMAs have generated $4.32M ($0.216 per person). Plug in:
Lift % = (0.216 - 0.20) / 0.20 = 0.08 = 8%
Eight percent is a healthy paid-media lift. The honest range for most established paid channels at scale is low single digits to high single digits; lifts in the double digits are unusual and worth re-checking the design before celebrating. A 25% lift on an 8-week Meta geo-lift is more often a sign of spillover contamination, a mismatched pair, or a confounder during the window than a genuinely transformative channel. Low single digits is real and common; large lifts deserve a second look before they move the budget.
iROAS tells you whether incremental revenue covers the spend, not just whether the effect exists
Lift percentage tells you the size of the effect. Incremental ROAS (iROAS), which is the ROAS calculated only on revenue the campaign actually caused, tells you whether that effect pays for itself at the spend level you tested. Both numbers belong on the readout. A 30% lift at a 0.3 iROAS is a channel that produced real incremental sales and still lost money on the spend. An 8% lift at a 2.0 iROAS is a channel doing modest but genuinely profitable work.
iROAS = incremental revenue / treatment spend
Incremental revenue is the dollar version of the per-capita gap from the lift formula, scaled across the treatment population:
incremental revenue = (Rev_t / pop_t - Rev_c / pop_c) * pop_t
Worked example, continued. Using the same Meta geo-lift: the per-capita gap is $0.216 - $0.20 = $0.016. The treatment population is 20 million. Incremental revenue is $0.016 * 20,000,000 = $320,000. Treatment spend over the 8-week window was $40k * 8 = $320,000. So:
iROAS = $320,000 / $320,000 = 1.0
A 1.0 iROAS means the channel returned a dollar of incremental revenue per dollar of spend. Meta's own dashboard for the same window probably reports something in the 2.5 to 3.5 range, because platform attribution is counting conversions that would have happened without the spend. The gap between platform ROAS and iROAS is the size of the bookkeeping problem on that channel.
For budget calls, the version that actually matters is contribution-margin iROAS: replace revenue with contribution profit (revenue minus cost of goods sold, payment fees, shipping, and any other variable costs that move with a sale). A 1.0 revenue iROAS on a product with a 40% contribution margin is a 0.4 contribution iROAS, which is a channel losing 60 cents of margin on every incremental dollar of spend. Reporting revenue iROAS without contribution iROAS is how teams confidently scale unprofitable channels.
If the confidence interval crosses zero, the test cannot distinguish the channel from no effect
A confidence interval is the range of values consistent with the data at a chosen significance level (95% is the default in paid-media testing). The point estimate is what the test came back with; the interval is the range of true effects that could plausibly have produced that estimate given the noise in the measurement. If that range crosses zero, you do not have a winner; you have a test that could not distinguish the channel from no effect at the precision it ran at.
The decisive check on any geo-lift readout is simple: does the interval cross zero? Reporting a positive lift while burying an interval that crosses zero is the most common form of dishonest measurement in this field. Insisting on the interval is the cheapest discipline available.
The interval on a two-group comparison takes the standard form:
CI = point estimate +/- z * SE
Where z is the critical value (1.96 for a 95% two-sided interval) and SE is the standard error of the lift estimate. In practice the SE is computed from the variance of per-capita revenue across geos in each group; geo-lift tools like GeoLift or CausalImpact handle this with a synthetic-control or fixed-effects model and report the interval directly.
Intervals tighten with more geos and longer windows because both reduce the standard error. Roughly, doubling the number of geo-weeks in each arm cuts the interval width by a factor of the square root of two. That is why a 12-geo, 4-week test almost never reads out at low single-digit lifts: the design's interval is wider than the effect it is trying to measure, and the result returns null even when a real effect exists. Width of the interval is a design choice; pre-compute it during planning and you will never be surprised by an inconclusive readout.
MDE is a property of the design, not the channel, and it must be computed before any spend moves
The minimum detectable effect (MDE) is the smallest lift the test design can reliably detect at a chosen significance and power level. A test designed to detect a 10% lift cannot diagnose a real 3% effect; the result will come back null and the team will wrongly conclude the channel does not work. MDE is not a limitation of the channel; it is a limitation of the design, and it is fully within the team's control before launch.
The decision rule is the only one that matters in planning: if the smallest lift the business cares about is below the MDE, change the design before launch. More geos, longer runtime, or larger treatment spend; pick one or combine them, but do not run the test as planned. A test that cannot detect the lift the business cares about is wasted spend dressed as measurement.
The practical MDE formula for a paid-media geo-lift, in the form most useful for planning:
MDE = (z_alpha + z_beta) * SE / baseline
Where z_alpha is the critical value for the chosen significance level (1.96 for a 95% two-sided test), z_beta is the critical value for the chosen power level (0.84 for 80% power, 1.28 for 90% power), SE is the standard error of the per-capita revenue difference between groups, and baseline is the control's per-capita revenue. MDE is expressed as a percentage lift; if it comes out to 0.05, the test can detect a 5% lift and nothing smaller.
Worked example. A DTC brand wants to detect a 5% lift on Meta at 80% power with a 95% two-sided test. Baseline per-capita weekly revenue across candidate DMAs is $0.025, with a standard deviation across DMAs of $0.005 (a coefficient of variation of 0.20, which is typical for established categories). For a test with 10 treatment DMAs and 10 controls over 8 weeks, the standard error of the per-capita revenue difference works out to roughly $0.00079 (the across-DMA standard deviation divided by the square root of total geo-weeks per arm, 80). Plug in:
MDE = (1.96 + 0.84) * 0.00079 / 0.025 ~= 0.088
An 8.8% MDE. The design as drawn cannot reliably detect the 5% lift the business cares about; the smallest lift it can detect is closer to 9%. The decision is forced: add geos, extend the window, or accept that the test will only reliably detect larger effects. Doubling the geo-week count (20 + 20 DMAs at 8 weeks, or 10 + 10 at 16 weeks) brings MDE down by roughly a factor of the square root of two, to about 6.2%, which is closer to but still above the 5% target. Tripling it gets the design into the 5% range. This is the kind of math that belongs in the planning doc, not the post-mortem.
Sample size and MDE are the same equation rearranged; one tells you what your design can do, the other tells you what design you need
MDE asks: given this design, what is the smallest effect we can detect? Sample size flips the question: given the effect we need to detect, how big does the design have to be? The two formulas are algebraically the same equation rearranged.
n = ((z_alpha + z_beta)^2 * sigma^2) / delta^2
Where n is the number of units per group, z_alpha and z_beta are the significance and power critical values, sigma is the standard deviation of the outcome metric across units, and delta is the absolute effect size you want to detect (the lift percentage times the baseline).
The geo-lift translation: the unit is the geo-week (or geo-month for longer-cycle categories), and sigma is computed from the variance of per-capita revenue across geo-weeks in the pre-test period. Sample size becomes a budget conversation between geos and weeks. Twenty geos for four weeks gives the same geo-week count as ten geos for eight weeks, but the two designs have different exposure to confounders and different operational costs.
Worked example. Same brand, same goal: detect a 5% lift at 80% power. Baseline per-capita weekly revenue is $0.025, so the absolute effect size delta is $0.025 * 0.05 = $0.00125. Across-DMA standard deviation of per-capita weekly revenue is $0.005. Plug in:
n = ((1.96 + 0.84)^2 * 0.005^2) / 0.00125^2 ~= 125 geo-weeks per arm
125 geo-weeks per arm. That translates to roughly 16 DMAs for 8 weeks per arm, or 10 DMAs for 13 weeks, or any equivalent product. The running illustration in this reference (10 + 10 DMAs for 8 weeks) lands at about 80 geo-weeks per arm and so is intentionally sized to detect lifts in the 8-9% range, not 5%. That is what the MDE calculation surfaced above. The two formulas agree because they are the same equation; one tells you what your current design can do, the other tells you what design you need.
Two assumptions to flag. First, this formulation assumes independence across geo-weeks, which is an approximation; in practice there is mild autocorrelation within a geo across time, and most geo-lift tools adjust for it via robust standard errors. The correction usually increases the required sample size by 10-20%. Second, the standard deviation used here is the across-unit variance of the outcome metric in the pre-test period; using a different baseline (a noisier one, a shorter one, a seasonal one that does not match the test window) will produce a sample-size estimate that does not survive launch.
The math is a tool, not the answer. The discipline is choosing the right effect to detect, then designing the test to detect it before any spend moves. A geo-lift with a 9% MDE that returns a null result is not evidence the channel does not work; it is evidence the design could not have seen the effect even if it was there. The formulas above exist so that conversation happens in the planning doc, not in the post-mortem after the budget has already been moved on a misread.
For the design-and-pitfalls treatment that sits around these formulas (matched markets, spend floors, spillover, contamination, the operational discipline that separates a finding from a coin flip), see geo-lift testing explained. The two pages are intended to be read together: this one is the math, the other one is the practice.
Get smarter about paid media
Strategy and data for senior marketers. No spam.

Founder & CEO
Samir Balwani is the founder and CEO of QRY, a full-funnel paid media agency he started in 2017. He has 15+ years of advertising experience and previously led brand strategy and digital innovation at American Express. He writes on paid media strategy, measurement, and how agencies should operate.


