Significance is a core concept in statistics. Statistical significance (sometimes referred to as “stat sig”) lets a scientist know if there is a high probability that two or more samples have different average values. In simpler terms, its the engine behind A/B testing that helps us learn from and optimize our campaigns. As a data scientist at Postie, my team and I use this concept extensively, so in this post I’d like to help you build your understanding around this concept in an interactive, visual, and intuitive way. You’ll walk away with an intuitive understanding of how the underlying principles of significance are used in the advertising industry.

## Why Do We Need It?

Let’s say we have two creatives we want to test, Creative A and Creative B. The results of this test are shown below:

Creative | Reach | Conversions | CVR |
---|---|---|---|

Creative A | 50,000 | 1,500 | 3.000% |

Creative B | 50,000 | 1,499 | 2.998% |

Given these results, we want to know if Creative A is more likely to drive a higher number of conversions than Creative B. Specifically, do we have strong reason to believe if we ran the same test again Creative A would generate more conversions than Creative B?

According to the results of this last test, Creative A was the “winner.” However, the results were **very** close. In fact, it might be too close to call and you may be more comfortable saying they have similar or even identical performance. Therefore you decide to test Creative A and Creative B again (in equal quantities) and get the following results:

Creative | Reach | Conversions | CVR |
---|---|---|---|

Creative A | 50,000 | 1,530 | 3.060% |

Creative B | 50,000 | 1,490 | 2.980% |

The results still aren’t super clear. We know that as the number of conversions of Creative A increase and the number of conversions of Creative B decrease, we get more confident there is a difference. However, at which point do you make the distinction that Creative A is good enough to stop testing Creative B? Be brave – this is where the blog becomes interactive. Make an estimate, as many times as you want, in the field below to get different results.

This is exactly the kind of problem that statistical significance can help us solve! In order to better understand (and see) where to declare a winner, we’ll first need to take a closer look at each creatives’ distributions.

## Distributions

If we were to visualize your estimate without an understanding of distributions, it would look something like the following table. Please note, here and in all future tables, the teal line represents Creative A and the blue line represents Creative B.

Here’s where we get interactive again. Try moving the sliders and playing around with the inputs.

**Notice how the slider on the y-axis (the slider that moves up and down) doesn’t change anything about the visual?** When we hold the conversion rate constant we find one big issue with this representation: the same conversion rate will look identical whether it was achieved with 10k pieces or 85k pieces. That means whether we send 10k pieces or 85k pieces, as long as we get the same conversion rate we’re going to have the exact same understanding of the problem.

I think we can all agree that we’ll learn more from a campaign where we send more creatives. We can use distributions to help us quantify the information we gain when we increase our sample sizes.

Distributions help us to better understand the uncertainty around a sample given its size. There are a variety of different distributions that can be used to model all kinds of different phenomena. The beta distribution is excellent at modeling proportions and therefore is perfect for the kind of problem we want to solve here. If you’re interested in learning more about beta distributions and how they work, you can read more about them in this National Institute of Standards and Technology handbook. For now, the only things you need to know are that beta distributions only require two inputs: the number of successes (α) and the number of failures (β) for a given trial. That’s it! That means in our case:

Creative | Reach | Conversions | CVR | α | β |
---|---|---|---|---|---|

Creative A | 50,000 | 1,530 | 3.060% | 1,530 | 48,470 |

Creative B | 50,000 | 1,490 | 2.980% | 1,490 | 48,510 |

Each of our Creatives can now be visualized with its own beta distribution. The newly minted y-axis includes a measure of confidence around our overall conversion rate. As we send more, our confidence narrows around this conversion rate. As we send less, our confidence lowers and spreads out to a wider range of potential conversion rates. In the visual below, feel free to alter the results by moving the sliders to see how adjusting the Reach and CVR parameters changes our understanding of the results:

Fun, right?

## Battling Creatives

Now that we know how to use beta distributions to better understand our results, we can start to answer our original question:

**At which point would you draw the line and say Creative A is better than Creative B?**

At the start of this post, we started the estimate out with Creative A being better than Creative B when it had **1,530** conversions. Now that you can see the distributions of Creative A and Creative B, it’s time to get interactive again. Try adjusting Creative A’s number of conversions to determine where this difference would be stat sig. We’ll hold the reach parameter constant to stay within our reach budget of our next send. Try to get as close as you can to the lowest possible conversion difference that would still be considered stat sig:

Ok, enough guessing! There is a quick simulation we can run to determine the percent chance that Creative A was a better creative than Creative B. Remember we are more confident that the CVR is in some areas than it is in others. For instance, we are most confident that the CVR would land at the top of the distribution (indicated by the vertical line in the middle of the distribution) and we are much less confident at the tails of the distribution. What we’ll do is select a sample point from our distribution. The point we select will be based on our confidence around that point, with higher-confidence areas being more likely to be selected. We’ll say that the “winning” creative is the one that selected the sample with the highest value. Time to get interactive again, but this time, try taking a few samples by actually clicking on either of the distributions below:

Now that we have an understanding of how we can sample points from our distribution, we can play a probability game to determine the percent chance that a value from Creative A will be greater than a value from Creative B. Just like in the last section we will sample one value at a time from each distribution and call the higher value the “winner.” Then we’ll calculate the win rate of each distribution to determine who has the higher overall chance of being the winner. We can sample each distribution 100 times to speed up the process and collect our results in the table below. At the end, we will have a number that represents the probability that Creative A would perform better than Creative B. This time, when you click on the graph you’ll see the simulation of distribution at 100 times. Try it out below:

Creative | Reach | Conversions | Samples | Wins | Win Rate |
---|---|---|---|---|---|

Creative A | 50,000 | 1,530 | 0 | 0 | 0% |

Creative B | 50,000 | 1,490 | 0 | 0 | 0% |

What you’ve just done is computationally simulate the results of what’s called a one-tailed T-test. Nice! It’s referred to as a one-tailed test because we’re only focused on whether one of the tails of our distribution overlaps with the other (i.e whether Creative A’s performance is *greater than* Creative B’s).

## Declaring Statistical Significance / Confidence of Results

Those of you that have been paying close attention may still be wondering whether the results you achieved above were actually statistically significant. To determine whether the results were statistically significant you would need to decide on a win rate that would declare Creative A’s performance “better” before running the experiment. If your threshold was exceeded then we would declare the results to be statistically significant. Of course this begs the question of what makes the optimal threshold. Is it the best practice to declare Creative A the winner when it consistently wins 80%, 90%, or 95% of the time in the simulation above? Why not when it wins 83.6% or 92.6736% of the time?

A common misconception is that the optimal threshold is set in stone and completely immutable. While we can’t change this value after our experiment has been run (it would be too easy to bias the results in our favor), the initial threshold is somewhat subjective. The standard value for this in research tends to be 95%, but even this value is relatively arbitrary and should be changed based on the experimental setup. Occasionally our clients will want to know whether performance was stat sig, and if we’re forced to select a win rate we’ll generally declare 80% as a *strong enough* signal in the highly variable industry of advertising. However, even this method of threshold selection can be improved upon.

For example, let’s say we run the simulation above and the client decides that significance will be achieved when Creative A has a win rate of 95%. After running the simulation one million times we find that there is a 94.99999% chance that Creative A performs better than Creative B. If you wanted to stay within the confines of the t-test you would need to declare the results inconclusive. However, if this declaration makes you uncomfortable because the results were very close, then you have the same intuition as a Postie Data Scientist!

Unlike trials in academic research that are trying to publish results that declare “we found something!” or “our results were inconclusive!” to secure another round of funding, we have thousands if not hundreds of thousands of opportunities to run more trials and tests. Therefore, it makes more sense to leverage this nuanced understanding of performance to fine-tune our approach over time. In this case we might not determine a threshold and declare a winner, but instead start to send more of Creative A and less of Creative B. We would continue to send Creative A as long as it continues to perform well, however Creative B would always have a chance to be re-introduced if Creative A starts to lose performance or Creative B starts to show signs of improvement.

Additionally, the misconceptions that are encountered from a t-test are endless and in many cases result in more questions than answers. What if the win rate for Creative A is very high but we only sent 10k pieces of mail for each creative instead of 50k, is the result less stat sig? What if Creative A and Creative B are sent in different quantities, can we still call that stat sig?

It may be more intuitive (and effective) to present these results in terms of their confidence intervals rather than just presenting their win rates or declaring them statistically significant. A wider confidence interval would indicate that the results had a lower sample size and therefore indicate lower certainty in the results. A narrow confidence interval indicates a large sample size and therefore higher certainty in the results. As we run more trials this confidence interval would narrow and we could get a stronger intuition about the overall results of the test. The exact details of this approach would be best covered in another blog post, but for the sake of your curiosity, the results of the trial you ran above could be communicated in the following way:

Creative A improved CVR over Creative B by 0 ± 0 (80% CI), which seems much more intuitive. Results for different confidence intervals are also displayed below:

Confidence | Expected Value | Confidence Interval | Low Estimate | High Estimate | Statistically Significant |
---|---|---|---|---|---|

80% | 0 | 0 | 0 | 0 | |

90% | 0 | 0 | 0 | 0 | |

95% | 0 | 0 | 0 | 0 |

The high and low estimates are calculated by adding and subtracting the confidence interval from the expected value. If the interval generated by the low estimate and high estimate contains 0 then we know the results were not statistically significant at the given confidence interval. This doesn’t mean we didn’t find anything, it just means the number of trials we ran did not provide strong enough evidence to say the performance of the two creatives were different.

## Conclusion

I hope this post provided a little bit of clarity about what’s going on under the hood of statistical significance testing. We’ve deliberately avoided talking about null and alternate hypotheses, p-values, t-tests, and the details about confidence intervals for the sake of building intuition in a more visual and approachable way. Specifically in this post we covered:

- What a distribution is and why it’s important
- How to sample from distributions
- How sampling from these distributions helps us determine stat sig performance

If you want to learn more about how practical applications of these techniques can optimize direct mail for you and your brand, I’d love to chat.

Or if you’re simply impressed enough to see what Postie can do to optimize your direct mail campaigns, connect with Postie Today!