Geo-Lift Testing Explained: How to Measure What Platforms Can't
Geo-lift testing is the most honest read on whether paid media actually grew the business. What it answers, how a test works, when to use it, and how to design one that holds up.

Platforms grade their own homework. Meta says Meta worked, Google says Google worked, retail media says retail media worked, and the team adds the numbers up and is asked to trust the total. Geo-lift answers a different question. Not “who got credit,” but “what would have happened without the spend.” That is the counterfactual: the version of the world where the campaign never ran.
Geo-lift is a controlled experiment. You run the campaign in some markets, hold it back in others, and the gap between the two groups is the lift. The comparison is built into the test design; you measure what would have happened without the spend, you do not estimate it after the fact. That is the only reason the answer is worth more than the platform’s own grade.
The counterfactual is built into the design, not modeled after the fact
The counterfactual question is the only one that matters for paid media measurement: what would the business have done without this spend? Every other measurement output is a story about credit assignment after the fact. Geo-lift constructs the counterfactual directly. Treatment markets see the campaign at full weight, matched controls see nothing from that channel, and the gap is causal because spend is the only thing that systematically differs between the two groups.
Platform attribution and multi-touch attribution (MTA) do not answer this question. They assign credit on observed paths: they see the people who converted, walk back through the touches, and divide credit across them. They cannot see the counterfactual because they only see the exposed side. The output is correlational, not causal, no matter how many touches the model considers.
Marketing mix modeling (MMM) gets closer. It fits an aggregate statistical model to weekly spend and revenue and estimates each channel’s contribution from the variance in the data. Useful for long-run allocation, but it models the counterfactual rather than constructing it. The result is an estimate, not a measurement.
The full taxonomy of these three approaches lives in MMM vs MTA vs incrementality.
Treatment geos spend, control geos go dark, the gap is the lift
The mechanic is simple. Pick a set of treatment geos, usually designated market areas (DMAs, the regional TV markets Nielsen uses) and sometimes broader regions or postal areas depending on the channel. Pick a matched set of control geos. Run the campaign at full weight in treatment, pull the spend entirely from controls. Six to eight weeks is the standard window. Longer for high-consideration categories where the purchase cycle stretches past the exposure window.
At the end of the test, two numbers do the work: the lift percentage and the incremental return on ad spend (iROAS), which is the ROAS calculated only on the revenue the campaign actually caused.
Lift % = (Rev_t / pop_t - Rev_c / pop_c) / (Rev_c / pop_c)
iROAS = (incremental revenue) / (treatment spend)
Revenue is calculated per-capita (revenue divided by population, so dollars per person) to make the two groups comparable when their populations differ. Incremental revenue is the dollar gap: treatment revenue minus what treatment would have earned at the control’s per-person rate, scaled back up across the treatment population.
Example. A DTC brand spending $40k a week on Meta runs the test in 10 treatment DMAs and 10 matched controls for 8 weeks. Treatment DMAs hold about 20 million people in total, control DMAs the same. Treatment spend over the window is $320k. At the end of the 8 weeks the control DMAs have generated $4.0M in revenue, which works out to $0.20 per person. Treatment DMAs have generated $4.32M, or $0.216 per person.
Plug those into the formula:
- Lift % = (0.216 − 0.20) / 0.20 = 8%
- Incremental revenue across the treatment population ≈ $320k
- iROAS = $320k / $320k = 1.0
That is the honest read on the channel. Meta’s own dashboard for the same window probably reports a number two to three times higher because it is counting conversions that would have happened anyway.
A real test also reports a confidence interval on both numbers; that is the range of values consistent with the data, and the practical rule is simple: if that range includes zero, you cannot tell the channel apart from no effect at all. Whether the interval crosses zero is the difference between a finding and a coin flip. The underlying math is in the incrementality formulas reference.
Platform numbers on major channels are systematically optimistic
Three cases make geo-lift the right tool.
Validating a major channel where the platform’s own attribution carries too much weight in the budget conversation. Connected TV (CTV), retail media, the top one or two line items inside Meta or Google. Anything where a wrong read costs six or seven figures a year. Platform numbers on these channels are systematically optimistic; geo-lift is the cheapest way to find out by how much.
Calibrating MMM. An MMM tells a coherent story about every channel, but the story is only as good as the assumptions and the modeling choices behind it. A geo-lift result on a single channel is the anchor that turns MMM contribution from a narrative into a measurement; calibration here just means using the test result to lock that one channel in place so the rest of the model has to fit around it. Re-fit the MMM with the geo-lift as a constraint and the rest of the model gets more honest with it.
Settling cross-channel allocation arguments. When finance and marketing disagree about whether a channel deserves its current budget, the right answer is rarely louder reporting. It is a geo-lift on the contested channel.
The wrong-tool cases are equally clear.
Single-DMA brands or brands without enough matched market structure cannot run a credible geo-lift. The design needs pairs that actually pair.
Very low-spend channels are out too. Every test has a minimum detectable effect (MDE): the smallest lift the design can pick out of the noise. If the lift you need to detect sits below the MDE, the test would need an unrealistic effect size or runtime to read out.
Fast creative iteration is the third miss. When you need a read in two weeks, conversion lift and ghost ads run faster and tighter for that job. Both are platform-run experiments (the platform holds out a control group inside its own audience and then grades the result itself), which is why they are quicker but less independent than geo-lift.
Test quality is determined before launch, not during analysis
Matched markets. Pair on population, baseline demand level, seasonality pattern, and category share. A pair that looks matched on population but diverges on category share will produce noise that swamps the lift you are trying to read. Reject mismatched pairs early. Do not paper over a bad pair with statistical adjustment after the fact; the test was already weaker than the math suggests by then. The single best predictor of a clean readout is honest matching before launch.
Power and runway. Enough geos plus enough weeks in market to detect the lift you actually care about, not the lift you wish existed. A test sized to read a 20 percent effect cannot diagnose a real 5 percent effect; it will return a null and the team will conclude the channel does not work. Decide the smallest business-meaningful lift first, then size the test backward from there. The formulas live in the reference.
Spend floor. Real money has to be pulled from controls. If the channel is “off” in controls but a national programmatic line is still running there, the contrast is gone before the test starts. This is where most tests quietly fail. The discipline is operational, not statistical: spend gets pulled, all of it, in every control geo, for the full window.
Test window. Six to eight weeks is the default and works for most DTC categories. High-consideration purchases (furniture, financial products, anything with a multi-week shopping cycle) need a longer window so the post-exposure conversions fall inside the measurement period. Anything under six weeks tends to compress signal without saving real time.
Pre-registration. Write the test design down before launch and share it with everyone who will read the result: the hypothesis, the primary metric, the MDE you are sizing for, and the stopping rule. Without that written record, every mid-test glance becomes a temptation to call the test early. With one, the result is a finding rather than a story. The math for sample size and MDE is in the incrementality formulas reference.
Most geo-lift failures are invisible until the post-mortem
Spillover. National channels bleeding from treatment into control. National TV, programmatic with national targeting, organic spillover from PR or earned coverage. If treatment audiences are reaching control geos through any uncontrolled path, the contrast narrows and the lift estimate falls. Either audit every national line item before launch or accept that the readout is biased toward zero.
Contamination. Someone forgets to switch the channel off in a control DMA. A retargeting list still serves control geos because the geo-fence on the campaign was set at the account level, not the ad set level. A programmatic vendor “helpfully” expands targeting during the test. Every contamination event pulls the result toward null and is usually invisible until the post-mortem.
Peeking and stopping early. Mid-test reads are noisy by construction. A test that looks like a winner at week two often regresses to the mean by week six. Stopping rules belong in the pre-registration, not in a Friday afternoon decision driven by a screenshot of the dashboard. Run the test for its full window.
Underpowered tests. A test sized to detect a 20 percent lift cannot diagnose a real 5 percent lift. The result will look like a null even when there is a real effect underneath, and the team will draw the wrong conclusion. Power calculations are not optional; they are the difference between a measurement and a coin flip.
Markets matched on paper, not in practice. Pairs that match on population but diverge on category share, weather sensitivity, regional retail distribution, or recent promo history. The basic check is parallel trends: do the two groups move together in the weeks before the test starts? If they did not move together before the test, they will not move together during it, and the lift estimate inherits the gap.
Confounders during the window. A product launch in a treatment market. A regional promo. Severe weather. A competitor pulse that overlaps one side of the pair and not the other. Anything that changes one side of the contrast during the test contaminates the result. Log every event and flag the ones that landed unevenly.
If the confidence interval crosses zero, you do not have a winner
Two headline numbers carry the readout: lift percentage and iROAS. Contribution profit is the business lens behind them; a lift number that does not translate into incremental contribution dollars is not a result anyone outside the measurement team will act on.
Confidence intervals are the test that separates findings from wishful thinking. The interval is the range of lift or iROAS values consistent with the data. If that range crosses zero, you do not have a winner; you have a test that could not distinguish the channel from no effect at the precision it ran at. Insisting on the interval is the cheapest discipline in the field.
There are three possible outcomes and each has a different next move.
Positive and significant: scale the channel toward the spend level the test ran at, calibrate the MMM against the result, and settle the allocation argument with the data.
Null: a real finding, not a failure. The channel did not produce a detectable lift at the spend level tested. Cut, reallocate, or recalibrate the campaign rather than the measurement. The test did its job.
Negative: more useful than a null. The channel is destroying value at the tested spend level. Stop the spend, investigate why platform attribution was painting a different picture, and apply the lesson to the next channel that looks too good in the dashboard.
Cadence matters. Re-run when audience, creative, or spend level changes materially. A 12-month-old geo-lift is not the truth about today’s channel; it is a frozen estimate against a moving target. Treat the cadence as part of the measurement, not an extra.
Can a geo-lift run on just one platform?
Yes, and it is often the right move when the platform’s own conversion lift product cannot be trusted. Geo-lift does not require cooperation from the platform being measured. The treatment side runs the channel at full weight, the control side does not, and the gap is the lift. That independence from the platform is the design’s point. Run a geo-lift on the channel where the budget risk is largest, regardless of whether the impressions are served by Meta, Google, retail media, or CTV; the math does not care.
What if your brand does not have many matched markets?
Use synthetic controls. A synthetic control is a weighted blend of several other geos engineered to approximate one treatment market’s pre-test behavior. Done well this works and is now standard practice in the geo-lift tooling ecosystem. Done badly it hides a power problem behind statistical complexity and a clean-looking output.
The audit step is the same one that decides whether a natural matched pair is valid: confirm the synthetic control tracks the treatment closely across the pre-test window. If it does not move with the treatment before the test starts, it will not move with it during the test, and the lift estimate inherits the gap.
Geo-lift is the QRY default read on major channels because it is the only design that builds the counterfactual into the test rather than modeling it afterward. Every other tool tells a story about credit assignment on observed data. Geo-lift tells you what would have happened without the spend, which is the only question that matters for budget decisions.
The Peak Design CTV geo-holdout case study is the published proof point: a test that changed the spend decision and the MMM that referenced it.
If your largest paid channel has not been measured against a real counterfactual in the last year, the number you are spending against is a guess. Run a geo-lift test that actually answers the question.
Get smarter about paid media
Strategy and data for senior marketers. No spam.

Founder & CEO
Samir Balwani is the founder and CEO of QRY, a full-funnel paid media agency he started in 2017. He has 15+ years of advertising experience and previously led brand strategy and digital innovation at American Express. He writes on paid media strategy, measurement, and how agencies should operate.


