Skip to contentTrusted by high-growth brands like Peak Design, Sea to Summit & Huk. Book a free Blueprint Session →

Incrementality Test Calculator

Before you run a holdout test, you need to know whether your traffic can detect the lift you care about. Too small a sample or too short a window and the test returns noise that looks like a result. This free incrementality test calculator returns the sample size per group and the estimated duration from three inputs, and it requires no email.

%
%

Enter your numbers to see results.

The math is a standard two-proportion sample size, fixed at 95% confidence and 80% power, the conventions used for most marketing experiments. Ninety-five percent confidence means a roughly 5% chance of calling a lift real when it is not. Eighty percent power means an 80% chance of detecting a true lift of the size you specify. Tighten either and the required sample grows.

Enter your baseline conversion rate, the minimum lift worth detecting, and your daily users per group. The calculator solves for how many users each group needs and how many days that takes at your traffic, so you can decide whether a user-level holdout is feasible or whether a geo test is the better design.

How it works

Convert the baseline rate to a proportion p1, and the post-lift rate to p2 equal to p1 times one plus the minimum detectable lift. With z of 1.96 for 95% confidence and 0.8416 for 80% power, the sample size per group is the squared sum of 1.96 times the square root of two times pbar times one minus pbar, plus 0.8416 times the square root of p1 times one minus p1 plus p2 times one minus p2, all divided by the squared difference of p2 and p1, where pbar is the average of p1 and p2.

Duration is the sample size per group divided by your daily users per group, rounded up to whole days. A smaller baseline rate, a smaller detectable lift, or lower traffic all push the required sample and the duration higher. If the lift you enter pushes the post-lift rate to 100% or above, the test is not well defined and the calculator returns nothing.

Note that this sizes a conversion-rate test. Revenue-based lift, the usual readout in geo designs, has higher variance than a conversion rate, so treat this output as a floor and plan for a larger sample or a longer window when the metric you will judge is revenue.

Worked example

Suppose your baseline conversion rate is 2%, you want to detect a 10% relative lift, and you have 5,000 daily users per group. Then p1 is 0.02 and p2 is 0.022. Running the two-proportion formula gives roughly 80,683 users per group, about 161,366 across both groups.

At 5,000 daily users per group, 80,683 divided by 5,000 rounds up to 17 days. That is a feasible user-level holdout. If your traffic were a tenth of that, the same test would need about 162 days, a clear signal to switch to a geo test or accept a larger detectable lift.

How long should an incrementality test run?

Long enough to hit the sample size and to cover at least one or two full weekly cycles, whichever is longer. Even when the math allows a 6-day test, run a minimum of 14 days so weekday and weekend behavior both land in the result. On the other end, a test that needs more than 8 weeks is usually impractical: seasonality and promotions contaminate the read, and the business rarely tolerates a frozen holdout for that long.

Most well-powered DTC holdouts at a 2 to 4% baseline rate and a 10 to 20% detectable lift settle in the two-to-four-week range. The detectable lift is the biggest lever. Asking to detect a 5% lift instead of a 10% lift roughly quadruples the sample, so be honest about the smallest lift that would actually change a budget decision. There is no value in powering a test to detect a difference too small to act on.

When user-level traffic cannot support the test, move to a geo design: split matched markets into test and control, which lets you measure lift on total regional revenue rather than per-user conversions. Watch the classic pitfalls. Peeking and stopping early when the result looks good inflates false positives. Seasonal events distort short windows. Holdouts under a few percent of audience are too small to read. Size the test against QRY's monthly paid media benchmarks for typical baseline rates and lift ranges in your vertical.

See QRY's monthly paid media benchmarks to compare your numbers against the portfolio.

Frequently asked questions

How long should an incrementality test run?

Long enough to reach the required sample and to cover at least one or two weekly cycles. Run a minimum of 14 days even when the math allows fewer, so weekday and weekend behavior are both represented. Avoid tests longer than about 8 weeks, where seasonality and promotions contaminate the result. Most DTC holdouts land in the two-to-four-week range.

How do you calculate the sample size for an incrementality test?

Use a two-proportion sample size at 95% confidence and 80% power. Convert the baseline rate to p1, set p2 to p1 times one plus the detectable lift, and apply the standard formula with z values of 1.96 and 0.8416. For a 2% baseline and a 10% lift, each group needs about 80,683 users. Divide by daily users per group to get the duration.

How big should a holdout group be?

Big enough to detect your target lift with confidence, which the sample size formula determines, and matched to the exposed group on the variables that drive conversion. A holdout below a few percent of the audience is usually too small to read reliably. Scale the holdout to the same audience size as the exposed group before comparing results.

What is a minimum detectable effect?

The minimum detectable effect, or MDE, is the smallest lift the test is designed to detect with the chosen confidence and power. A 10% MDE means the test can reliably catch a 10% relative improvement. Smaller MDEs require dramatically larger samples: detecting a 5% lift takes roughly four times the users of a 10% lift, so set it to the smallest lift that would change a decision.

What is a good lift in an incrementality test?

A good lift is one large enough to clear your break-even and your detectable threshold. Prospecting and upper-funnel campaigns typically show the strongest incremental lift because they reach people who were not already converting. Retargeting often shows little or no lift because it harvests existing intent. Judge the lift against your margin, not against an inflated platform ROAS.

What if the test duration is impractical?

If a user-level holdout needs months at your traffic, switch to a geo test that splits matched markets into test and control, or raise the minimum detectable lift so the required sample drops. Both trade some precision for a feasible window. Never shorten the test by peeking and stopping early, which inflates false positives.

Planning a holdout or geo test?

See the baseline rates and lift ranges typical across the QRY portfolio with our monthly paid media benchmarks.

See the benchmarks