Geo holdout testing: a practitioner's guide to your first run
Geo holdout tests are the closest thing paid media has to a controlled experiment. Here is the operational detail most guides skip: how to design a valid test, pick your markets, set your confidence threshold and read the results without fooling yourself.
The idea behind a geo holdout test is disarmingly simple: turn off your ads in some markets, leave them on in others, and measure the revenue difference. If the held-out markets perform the same as the live ones, your ads are doing nothing. If they underperform, you have found your incrementality.
In practice the simplicity stops there. Market selection, test duration, business seasonality, organic baseline noise and statistical power all interact in ways that make it easy to run a test and learn nothing useful. This is the piece I wish I had read before my first one.
Why geo tests beat platform attribution
Platform attribution is a model. It is built on incomplete data, filtered through platform interests, and it will always find a way to credit the platform. A geo holdout test is an observation of what actually happened in the real world when a variable changed. These are not equivalent.
The key property of a geo experiment is that it is causal, not correlational. When you randomise your test and control markets properly, you are not inferring incrementality from a model. You are measuring it. That distinction matters more as tracking degrades and data-driven attribution becomes less, not more, reliable.
Geo holdout tests have become more valuable as third-party cookie deprecation, iOS privacy changes and consent rate variation erode the user-level data that attribution models depend on. An experiment that works at the market level is insulated from these trends.
Designing a valid experiment
Before you start, you need to answer four questions.
What is the unit of observation?
For most businesses, the right unit is a geographic region: a city, a DMA (in the US), a county or a set of postcode areas. The unit should be large enough to produce stable baseline revenue data, and self-contained enough that activity in one unit does not spill into another. Adjacent cities are messier than distant ones.
How many units do you need?
The honest answer is: more than you probably have. A minimum of twenty regions in each arm (test and control) gives reasonable statistical power for detecting a ten percent revenue effect. Fewer units means you need a larger effect to clear the noise floor, which means you either accept wide confidence intervals or you miss real effects.
How do you match test and control markets?
Match on historical revenue volume, revenue trend, organic search mix, seasonality profile and any known local market differences. The more similar your test and control arms look before the test starts, the more confidently you can attribute differences to the ad change.
A badly designed test is worse than no test. It gives you a number you feel confident about that points you in the wrong direction.
Priya Mehta, Brainlabs internal measurement guide
How long should you run it?
Long enough for the metric you care about to stabilise after the change, typically two to four weeks minimum, and ideally through at least one full purchase cycle. Shorter than two weeks and you are probably seeing short-term demand deferral, not true incrementality. Longer than eight weeks and you risk confounding from seasonality shifts.
Picking the right metric
Revenue is usually the right primary metric. Conversions can work if your AOV is stable. Do not use clicks or impressions: these are activity metrics, not outcome metrics. If you are testing upper-funnel spend, consider using search query volume for branded terms in the held-out regions as a proxy for awareness lift.
| Metric | Practical for geo tests? | Notes |
|---|---|---|
| Revenue | Yes, primary | Best signal; requires solid source-of-truth data |
| Transactions | Yes | Useful if AOV varies; prefer revenue |
| Branded search volume | Yes, for awareness | Google Trends by region; proxy only |
| Conversions (platform) | Partial | Platform-reported; use as check not primary |
| Clicks / impressions | No | Activity, not outcome |
| ROAS (platform) | No | Models attribution; circular for incrementality |
Reading the results honestly
Calculate the lift as: (test arm revenue / control arm revenue) relative to the pre-test baseline ratio. This pre-post, test-vs-control design controls for both seasonality and any underlying trend.
Report a confidence interval, not a point estimate. If your ninety percent confidence interval runs from five percent lift to forty-two percent lift, be honest about what that means: you have evidence of positive incrementality but not enough precision to optimise channel spend against it. That is a legitimate finding. It tells you where to invest next.
If the result is not statistically significant, resist the urge to interpret the direction. A non-significant result means the test could not detect an effect of the size you are looking for, not that there is no effect. The right response is a bigger or better-designed follow-up test.
The operationalisation problem
The part that catches teams out is not the statistics. It is operationalising the hold-out cleanly. You need to turn off spend in specific regions without accidentally pulling budget from adjacent regions through Smart Bidding reallocation. Geo exclusions, separate campaigns by region and careful budget monitoring are all required.
Check your hold-out is actually held out by verifying impression share has dropped to zero in the test regions. Do this on day one and again on day three. Campaigns that were not expected to serve often find a way to.
What to do with the result
A geo test result is most valuable when it is used to calibrate your attribution model, not replace it. If your attribution model credits Google Search with a forty percent share of revenue but your geo test suggests the true incrementality is twenty-five percent, you now have a calibration factor. Apply it systematically, run the test again in six months, and over time your budget decisions get better grounded in evidence.
Start with your largest channel. One well-executed experiment on the channel you spend the most on will produce more useful information than five loosely designed tests spread across your account.
