In this post we’ll walk through the concept of incremental lift (or simply, Lift), where Lift comes from, how it works, and some of the nuances that come along with our real-world constraints as we deviate from an ideal marketing scenario.
Introduction
If you could wave a magic wand and collect perfect information to optimize your advertising campaigns, what would that perfect information look like? Specifically, what would be the best possible point of information a customer could give you upon taking an action like signing up, joining a mailing list, or making a purchase?
In an ideal world this customer would tell you something like:
I saw ad X in channel Y, which completely convinced me to take action Z!
Not only would they provide this detailed level of feedback for every action they take, they would really mean it! It would be an accurate reflection of what incentivized them to take action and could serve as an ideal input that an advertiser could use to help serve more relevant ads (similar to ad X) to the customer in the right time and place (in channel Y) to continue to provide them with some sort of value (take action Z).
Close your eyes and imagine it. Wouldn’t that be nice? Doesn’t it make you feel all warm and fuzzy inside? How are you reading this if your eyes are closed?
Ok you can open your eyes now. I’m going to crush your dreams by telling you something that you already know – this ideal scenario is just not the reality we live in. Sort of.
We can’t get information from our customers that is this detailed, accurate, and relevant to our optimization goals. However, what we can do is approximate the effect of a given advertisement using performance data that is available. Specifically, we can measure the effect of a campaign on a group of individuals relative to some control group and quantify that effect with a deeply nuanced (and frequently misunderstood) performance metric known as Lift.
Lift
What is Lift?
Incremental lift (or simply, Lift) is an estimate of the increase in likelihood that a customer converted given that they were exposed to an advertising campaign. This estimated increase in likelihood is derived from comparing the performance of people that weren’t exposed to an ad (the control or holdout set) to people who were exposed to the ad (the experimental or mailed set).
Let’s say we’re running a prospecting campaign and we assume that all of the client’s potential customers are equally likely to convert. The starting set would consist of all households in the United States who are not pre-existing customers. We could then break this set into two subsets:
- A set that we will mail called the mailed set
- The performance of this set will represent the effect of being mailed.
- A set that we will not mail called the holdout set
- The performance of this set will represent the baseline performance of prospective customers at the company
Below is a representation of everyone in our mailable universe, click below to establish our experimental setup (with mailed and holdout groups).
Then we can analyze the performance of both groups over time to identify if there was any performance change in the mailed set relative to the holdout set. We can observe that behavior over time in the graphic below:
That’s it! Now that we have a strong understanding of the experimental setup behind lift, we can proceed to interpreting its results.
Quantifying Lift
Lift can be explicitly calculated at any point throughout this attribution window as
$$\text{Lift} = \left(\frac{mailed_{conv}}{mailed_{total}} \times \frac{holdout_{total}}{holdout_{conv}}\right) – 1 = \left(\frac{mailed_{cvr}}{holdout_{cvr}}\right) – 1 = \frac{mailed_{cvr} – holdout_{cvr}}{holdout_{cvr}}.$$
This equation gives us the percent improvement of the mailed set over the baseline performance established by the holdout set. Note that one could also calculate lift as the overall performance of the mailed set divided by the holdout set (\(\frac{mailed_{cvr}}{holdout_{cvr}}\)), but we prefer to use the equation stated above as it sets lift to 0% when there is little to no difference between the mailed and holdout sets (as opposed to 100%).
When we apply this calculation to our prospecting campaign, we can get a richer understanding of what’s going on in our test:
Here we can see that the mailed distribution pulls ahead of the holdout distribution once the campaign is fully delivered. Then, the holdout distribution plays catch up to the rate of the mailed distribution until the end of the attribution window.
The distance between these two distributions at any point in our attribution window (or more accurately, the percent difference between the conversion rate of the mailed set and holdout set at any point in our attribution window) is how we define lift.
Interpreting Lift
A 10% lift means that the campaign is estimated to have increased conversions by 10% at that point in time.
Therefore, a couple quick rules of thumb to quantify lift include:
- 100% lift means that the campaign is estimated to have doubled conversions at that point in time
- 200% lift means that the campaign is estimated to have tripled conversions at that point in time
- 300% lift means that the campaign is estimated to have quadrupled conversions at that point in time
The two other edge cases to cover include:
- When Lift = 0, we identify little to no difference between the performance of the mailed and holdout sets at this time.
- When Lift <= 0, the holdout set outperformed the mailed set at this time. This value is typically clipped to 0.
Negative lift would imply that you sent people an ad that was so offensive that its effect was actually *deterring* customers from purchasing your product. In reality we know that this scenario that is far less likely to occur in a highly regulated, monitored, and brand safe channel like Direct Mail. Therefore, negative lift is clipped to be 0 and the differences between the sets can usually be chalked up to a particularly strong performing holdout set.
Lift | Effect | Implication |
---|---|---|
100% | Doubled Conversion Rate | 2x More Likely to Convert when Mailed. |
200% | Tripled Conversion Rate | 3x More Likely to Convert when Mailed. |
300% | Quadrupled Conversion Rate | 4x More Likely to Convert when Mailed. |
0% | No Difference | Little difference between mailed and holdout sets. |
-100% | No Difference | Stronger than expected holdout set. |
Impact
You might have noticed that I use the “at this time” addendum at the end of each interpretation of lift. This is because time plays a very important role in interpreting and quantifying lift.
Often clients will take a look at the one number representing lift at the end of the attribution window and determine that single number is an accurate reflection of the overall trend within their marketing campaign. However time is another key variable to consider. When we plot lift as a function of time we can see this important component of the story playing out:
We define the lift at a set interval near the start of a campaign (in this case 14 days) as its impact. By analyzing the impact we can see a powerful indicator of the effect which direct mail had within an individual test cell right after attribution. In this case, all of these people just so happened to convert right after a direct mail ad was placed in-home. Therefore, not only are we seeing higher overall conversion rates at the end of the attribution window, but we’re seeing them front-loaded right after exposure to the advertisement. On the contrary, if the ad was having little to no effect we would expect conversions to roll in at roughly equal rates. Overall this additional point of data provides a strong statistical argument that people are responding well to the advertisement, and likely converting at higher rates because of the direct mail ad.
Lift at the Channel Level
Finally, the most important point to keep in mind is that we can run more than one advertising campaign!
When we analyze lift at the individual campaign level, we’re implicitly asking the question of whether one individual piece of mail was able to help drive a conversion over some period of time. More specifically, we’re asking ourselves whether the incremental benefit of sending one piece of mail to a household added additional performance in the wake of all other advertising channels (digital, email, TV, etc.). Often times, these other advertising channels are running campaigns with multiple touch points (multiple digital ads, emails, etc.) that occur throughout the timeframe of our single campaign. Therefore, the performance of the direct mail channel when analyzed at the campaign level is comparing the performance of a single piece of mail to the performance of all other advertisements in all other channels (not exactly an apples to apples comparison).
This is why we like to draw attention to the long-term performance of the channel as a whole, rather than optimizing based on the results of any one campaign (or any single piece touch point). As we send more direct mail campaigns, we earn more certainty around the performance of the direct mail channel as a whole rather than the performance of any individual piece of mail.
Caveats
While there are many nuances to interpreting Lift, I will list a few key points to consider below.
The Myth of the Incremental Converter
In our ideal scenario, the converter announces themselves to the advertiser and assigns themselves to the channel that caused their conversion. In practice, this incremental converter does not exist. We can’t be certain of all the reasons behind why a customer took an action, and even if we surveyed individuals to ask them why or used a promo code to try and tap into a sales funnel these would both only be approximations of the reasons behind customer’s actions. All we can glean from the data is that individuals who were mailed were more likely to convert than those who were not. Note that this is still a strong argument and this evidence points in the right direction, but it is not the sole data point through which a channel should be measured.
For instance, a strong impact with a very low Lift would imply that while the holdout set had a similar number of conversions, the direct mail channel was helping you tap into a different type of converter. Specifically, it would indicate that the direct mail channel is capturing individuals who respond better to direct mail advertising than any other channel. Just because you were able to get a similar number of converters in both groups at the end of an attribution window does not mean you were able to get the same type of converter in both groups (since we can’t explicitly identify the incremental converters).
If we were to look at the Lift value alone at the end of the attribution window, we may incorrectly conclude that it was an unsuccessful campaign. Therefore, we need to keep in mind that the incremental converter does not exist when interpreting lift metrics.
The Goldilocks Holdout Set
A key question we need to ask ourselves when setting up the test for incrementality is: where should we pull our holdout set?
Let’s re-visit one of the core assumptions we made in the initial setup of our experiment:
Let’s say we’re running a prospecting campaign and we assume that all of the client’s potential customers are equally likely to convert.
Here I’ve explicitly stated that we’re assuming that all of the client’s potential customers are equally likely to convert. However when we use Postie’s machine learning framework for prospecting, we already know that this isn’t the case. Our machine learning framework provides us a list of individuals in the continental United States ranked from most likely to least likely to convert. The individuals at the top of this list are much more likely to convert than the individuals at the bottom of this list. This gives us an additional factor to consider when setting up our experiment. Specifically, we need to decide from where in our distribution of households we should pull our holdout set.
Straw Man Holdout Sets
One approach we could take would be to pull holdout samples from the bottom half of this ranked list. The effect of this would be
- Higher reported lift, but a
- Lower certainty in isolating the effect of the advertisement itself.
- i.e high scores in the mailed set are now competing with low scores in the holdout set, leaning into the strength of the machine learning model at the expense of being able to understand the actual effect of the advertisement on a customer.
This is referred to as a Straw Man holdout set because it selects a control group from a weaker subset of the population for our baseline metrics. As a result, reported lift metrics will be higher.
Steel Man Holdout Sets
Another approach we could take is to pull the holdout set from the top half of this ranked list. The effect of this would be
- Lower reported lift, but a
- Higher certainty in isolating the effect of the advertisement itself.
- i.e high scores in the mailed set are now competing with high scores in the holdout set, helping us account for the strength of the model in selecting potential converters.
This is referred to as a Steel Man holdout set because it selects a control group from a stronger subset of the population for our baseline metrics. As a result, reported lift metrics will be lower.
Goldilocks Holdout Sets
Some holdout sets will be too weak, over-inflating lift projections and making it difficult to see the estimated effect of advertising. Others will be too strong, reducing the reported performance of lift and under-valuing the contribution of the channel. Therefore the advertiser must find a goldilocks zone somewhere within this spectrum that balances business needs and tolerance for uncertainty.
At Postie, we lean into conservative Steel Man Holdout sets that are derived from sampling our holdouts from the top 10% (or higher) of individuals within our ranked list. The effect of this is generally much Lower Reported Lift with a much Higher Confidence in the effect being derived from the advertisement itself rather than due to random chance.
Overall, the key piece of information needed to compare lift between channels is where the holdout set comes from. In other words:
Show me the holdout set!
Opportunity Costs Associated with Lift
To continue to generate lift metrics, we need to continuously remove prospects from the top 10% of our ranked list and preserve them for our holdout group. Once we place these top prospects in our holdout group we can no longer mail them (as per the experimental setup). This can present an immense opportunity cost to our advertising campaigns, as we’re missing out on mailing some of the absolute top prospects in our mailing universe.
For a high-stakes example of the opportunity cost that calculating lift presents, consider a patient with a life-threatening papercut infection (an unfortunate but unavoidable occupational hazard in this line of work). A new drug has been developed called healitup which has demonstrated trial after trial that it can cure these specific papercut infections. After months of deliberate testing, the scientists can get a pretty solid handle on whether healitup can be considered an effective treatment across a variety of different patients over an extended period of time. At a certain point the FDA is going to rule healitup ready for widespread use so that it can be administered by doctors to cure papercut infections worldwide. After that point, the scientists could continue to test that drug over and over and over again, however that would require the establishment of additional control groups which would doom many individuals with papercut infections in the name of… exceptionally rigorous testing? At a certain point the trend becomes clear enough and given that the treatment is life-saving you don’t exactly want to be the one left in the control group.
Similarly (though thankfully less life-threatening), the top prospects in your control groups would be missing out on offers for products and services that they are prime candidates for. By continuing to optimize for lift, we need to constantly refresh these holdout sets and leave money on the table in the name of… exceptionally rigorous testing? Therefore, it often makes sense to pivot away from the measurement of lift once the benefit of the direct mail channel has been proven so that all top prospects can be mailed (and all potential conversions turned into real conversions).
Determining Holdout Set Sizes
One of the best ways to mitigate the high opportunity costs associated with holdout sets is to reduce their overall size.
Theoretically if we were to consider the set of all households in the continental United States then the size of the holdout set (H) could be as large as the total number of households in our addressable universe (U) minus the people we plan to mail (M), or
$$ H = U – M.$$
This may work for our initial campaign, but breaks down as soon as we need to send another campaign (again, keep in mind that we can’t mail people in the holdout set). Additionally to create our steel-man holdout sets we’re still constrained to the pulling from the top 10% of our models and therefore can’t pull from all households in the United States. This is also before considering any other additional suppressions we may need to apply due to geographic or compliance-related factors. Therefore the holdout set must be pulled from a much smaller cohort of our top prospects.
As we reduce the number of places where we can collect our holdouts, we start to creep further and further into taking top ranked prospects out of our campaigns. It’s essential to keep in mind that the value of creating the holdout set is not to establish the performance of all non-mailed households, but instead to represent an estimate of a conversion rate for non-mailed households within a specific attribution window. Therefore, we don’t necessarily need the holdout set to be that large, and can often use much smaller holdout sets in our experimental setups. Below we can see an example where we would reach the same conclusion of 100% lift whether we used a holdout set of 90,000 or 10,000.
While reducing the size of the holdout set will create a slight increase in uncertainty, it frees up far more households for additional advertising.
The Compounding Effect of Lift
As mentioned in the above section about channel-level lift, we prefer to measure lift at the channel level. This is because, given a long enough timescale, the lift of any single piece from an individual campaign should approach 0. Over time the effect of that one piece from that one campaign wears off. Both the mailed sets and holdout sets will likely be exposed to the same media in all other advertising channels, and the key differentiating factor in our test (the single piece of mail from that specific campaign) will play less of a role in differentiating the two groups as time goes on. Furthermore there aren’t an unlimited number of prospects to choose from, so once all viable prospects have been advertised to the mailed and holdout sets will need to be re-shuffled and released back into the roll call for subsequent campaigns. So does single-piece lift approaching 0 over a long timescale mean that there is no overall benefit derived from the direct mail channel? Not at all!
We can think about the compounding effect of lift in terms of muscle-building. Let’s say there is a bodybuilder who wants to put on muscle. Every day the bodybuilder burns 3,000 calories. On the first day of the bodybuilder’s diet plan they eat 3,300 calories, which means they ate an excess of 300 calories that day. This is a great start! However there isn’t much of a change in the bodybuilder’s physique after just one day of clean eating. Instead of giving up our bodybuilder remains consistent, and at the end of a 12-week diet plan they manage to put on 10 pounds of muscle (look out Arnold!).
Now let’s say that the bodybuilder stops training and eats exactly enough to maintain their new weight. After a long timeframe (let’s say a year), can we say that those specific 300 calories from the first day of their training are the ones responsible for the muscle that the bodybuilder currently has? Probably not. Given a long enough timeframe the effect of those specific 300 calories should approach 0. This is because one individual meal makes less and less of an effect on overall growth as time goes on. Even still, the new muscle remains because of a sustained, compounded effort to generate an excess of calories over an extended period of time.
Just like the excess calories compound to form muscle over time, the effect of direct mail advertising compounds to generate additional conversions over time. After a long enough timeframe the effect of any one advertisement may fade, however the results of the overall channel may still prove strong. This compounding effect of lift is another key reason why we encourage clients to analyze performance in aggregate and not at the individual piece level. Stick to the training plan and your direct mail channel can grow to be as strong as the Postie above.
Conclusion
In this post we provided a comprehensive overview of the concept of incremental lift (or simply, Lift), its experimental setup, and how it is calculated. We also discussed the temporal nature of lift and introduced the concept of impact (lift after a fixed time window of 14 days). Then, we talked about the compounding effect of lift and its effect when analyzing performance at the channel level. Finally we discussed different caveats of the lift metric, including the nature of estimating with lift, where to source holdout sets, opportunity costs associated with holdout sets, determining the sizes of holdout sets, as well as understanding the compounding effect of lift.
Whew. That was a rollercoaster. While it’s not a perfect metric, hopefully with our more nuanced understanding of Lift our dreams of perfect attribution can now be slightly un-crushed, or at the very least the crushed-ness has been mildly reduced (mildly).
If you want to learn more about how practical applications of measuring performance can optimize direct mail for you and your brand, I’d love to chat.
Or if you’re simply impressed enough to see what Postie can do to optimize your direct mail campaigns, connect with Postie Today!