The Creative Testing Statistics Reference
Significance, power, MDE, sequential testing: the statistics of creative testing as it actually works in paid media, not as it works in a software A/B test.

Most paid-media "tests" are ad flights wearing a lab coat. They borrow language from product A/B testing (significance, lift, winner) without matching its conditions: stable environments, higher base rates, clean units of exposure, and adequate sample. In paid media, the unit is usually a user or impression; conversions are rare; auctions, audiences, and placements shift hourly; and most tests are underpowered by an order of magnitude. Result: most reported winners, losers, and "fatigue" are noise or regression to the mean.
The fix is not a better dashboard. It is test design that earns its conclusions before launch. If you cannot write down the hypothesis, the variable, and the minimum detectable effect (MDE, the smallest effect the test can reliably detect at chosen significance and power) before the campaign starts, you are not testing. You are running ads and calling the result measurement.
Most paid-media tests are ad flights wearing a lab coat
Product A/B tests operate on page views with higher base conversion rates and relatively stable environments. Paid-media tests run on impressions or users, with rare outcomes and constantly shifting auctions, audiences, and placements. This volatility plus low base rates means typical creative tests are badly underpowered and highly sensitive to noise. A software product team running a signup-flow test might see thousands of conversions per day; a paid-media team testing two video creatives might see dozens of purchases per week across a split audience.
The gap matters because statistical power (the probability of detecting an effect of a given size if it exists) scales with sample size and event rate. Squeeze either, and the test's ability to distinguish signal from noise degrades quickly. A test that lacks power does not give you a clean null result; it gives you a coin flip dressed in confidence intervals.
Significance (the probability of seeing this result by chance alone) is the threshold most practitioners watch. But significance without power is meaningless: a result can clear the significance bar from an underpowered test and still tell you nothing real. Both numbers belong in the pre-launch design, not the post-hoc analysis.
What rigorous creative testing actually requires
Pre-registered design. Before launch, write down the variable and direction (what you are changing and how you expect it to move), the primary metric (CTR, CVR, purchase rate), the required sample size, and the stopping rule. If it is not written before the campaign starts, it is not a test design; it is retrospective rationalization. The act of writing it forces decisions that casual tests never surface: what counts as a win, how long the test runs regardless of early results, and who has authority to stop it.
Power and MDE discipline. Compute the minimum detectable effect (MDE) before launch, not after. If the smallest effect you care about is below your MDE, change the plan: run longer, spend more, or test a coarser variable. Calling a result from an underpowered test a win is not optimization; it is superstition. Power of 80 percent is a common floor: that means a 1-in-5 chance of missing a real effect even when the test is well-designed. Set it lower and that failure rate climbs.
Correct unit and allocation. Treat the user as the primary unit when measuring creative impact. Impressions within a user are correlated, so counting each impression as independent inflates your apparent sample size and understates variance. Use platform split-test tools (Meta, Google) or equivalent to enforce clean allocation and prevent audience contamination between variants.
Aligned optimization and evaluation. Optimize and judge on the same outcome (purchases on purchases, not clicks on purchases) with consistent attribution windows. Misalignment between the campaign's optimization signal and your evaluation metric is one of the most common sources of false conclusions in paid-media testing. A campaign optimized toward link clicks that you evaluate on purchase CVR is not a test; it is a misconfigured experiment.
Sequential methods if you peek. If you will monitor results in-flight (and you will), use sequential testing designs (mSPRT, a test design that explicitly allows for in-flight monitoring without inflating false positives; alpha-spending and confidence sequences are equivalent alternatives) instead of pretending you are running a fixed-horizon test. Peeking (looking at test results before the planned end date) at a nominal 5 percent test can inflate the false positive rate (the chance of declaring a winner when there is no real effect) to 20 to 30 percent with only a handful of looks. Sequential methods are designed so that valid inference survives any number of peeks.
Variable-level learning. Test structured variables (headline pattern, claim type, opening frame, CTA verb) rather than whole ads that differ on six dimensions at once. Maintain a creative taxonomy so learnings accumulate; each test should add a tagged entry to a library of what reliably moves your metrics, not just surface a one-off winner.
Anti-patterns we avoid
Each of the following produces confident-sounding conclusions from noise. That is worse than no conclusion at all, because a noise-based conclusion gets budgeted against.
Declaring winners from underpowered tests. A one-week test on a sub-1 percent outcome metric cannot distinguish a real creative effect from auction noise. The winner this week is often the loser next week. The correct response to a short, underpowered test is not to declare the leader the winner; it is to extend the runtime or acknowledge that the test produced no finding.
Stopping when the dashboard shows significance. A p-value below 0.05 on a Tuesday morning is not a stopping rule. A Type I error (declaring a winner when there is no real effect) is essentially guaranteed when you stop a fixed-horizon test at the first moment it crosses the significance threshold. The stopping rule belongs in the pre-registration, not in a Friday-afternoon decision driven by a screenshot.
Running many parallel tests without multiple-comparisons control. Run twenty parallel creative tests at a 5 percent significance threshold and you expect one false positive by chance alone, even if none of the creatives actually differ. The more tests you run simultaneously, the more your family-wide false positive rate balloons unless you apply a correction (Bonferroni, Benjamini-Hochberg, or a pre-specified hierarchy of primary and secondary tests).
Calling performance reversion "creative fatigue" without ruling out auction effects. When a creative's CVR or ROAS drops over a four-week window, the default diagnosis is fatigue. But the same pattern appears when auction competition rises, audience composition shifts, or the platform's delivery algorithm re-allocates spend toward a different segment. A Type II error (failing to detect a real effect, or mis-attributing a distributional shift to creative exhaustion) means you swap the creative and solve nothing.
If you can't write down the hypothesis, the variable, and the MDE before launch, you're not running a test; you're running ads.
— QRY POV
Are Bayesian tests safe to peek at without inflating false positives?
No. Bayesian tests with flat or weakly-informative priors still suffer false-positive inflation under optional stopping. Frequentist testing frames the question as: given this data, what is the probability of seeing this result by chance alone (the p-value)? Bayesian testing asks instead: given this data and my prior beliefs, what is the probability that variant B is better? Both framings are valid tools; neither is a license to peek without a stopping rule.
Empirical work shows Bayesian optional stopping behaves similarly to frequentist peeking when priors are weak. The real Bayesian advantage is interpretability ("there is a 91 percent probability that B outperforms A" is easier to act on than "p equals 0.04") and the ability to incorporate strong priors from prior tests. Neither advantage removes the need for a pre-registered stopping rule. If you want to peek validly, use a sequential method designed for it.
How QRY helps
We design and run creative experiments that are powered for realistic lifts at your actual baselines, use the right unit, attribution, and optimization target, and produce reusable variable-level insights instead of one-off winners. A test designed by QRY has a written hypothesis, a pre-registered primary metric, an MDE calibrated to the actual business threshold, and a stopping rule that survives in-flight monitoring. If your current testing program does not include all four, you are leaving both performance and learning on the table.
The statistical foundations for sizing and interpreting these tests connect to the broader incrementality measurement stack. The incrementality formulas that underpin geo-lift and holdout tests use the same power and MDE logic described here; the geo-lift testing guide and the incrementality formulas reference cover the channel-level measurement layer. Creative testing is the unit-level equivalent: the same discipline, applied one step closer to the ad.
Get smarter about paid media
Strategy and data for senior marketers. No spam.

Founder & CEO
Samir Balwani is the founder and CEO of QRY, a full-funnel paid media agency he started in 2017. He has 15+ years of advertising experience and previously led brand strategy and digital innovation at American Express. He writes on paid media strategy, measurement, and how agencies should operate.


