Peeking problem in A/B testing: Frequentist approach

By Abdelhak Chahid

8 min read

In our current state of digitization, data-based decision-making has become an integral aspect of online businesses. Evidence-based practice, the paradigm used to make those informed decisions, relies on the use of data and solid statistical methods to solve real problems.

Peeking problem in A/B testing: Frequentist approach
Authors

A/B testing is one of the most commonly used statistical methods and the most of successful applications of statistical inferences in the internet age, particularly in online settings for data-based decisions.

What Is A/B Testing?

A/B testing, also known as split testing or hypothesis testing, is a statistical technique to compare two or more groups to determine whether the difference in a metric between the groups can be attributed to variable changes rather than pure chance. In A/B testing, A refers to the original testing variable. i.e. original webpage (Null hypothesis). Whereas B refers to ‘variation’ or a new version of the original testing variable (Alternative hypothesis).

Most of the A/B tests are conducted using the frequentist statistical test namely the t-test or z-test. Frequentist school is the most used approach when it comes to statistical inference, this is commonly taught at the high school and university levels.

The experimenters using this approach are interested in the p-value1 resulting from the statistical test to reject the null hypothesis H0 when the p-value is less than a prescribed significance level2 α (often 0.05).

Let us clarify this further through an example: Assume a marketeer may want to determine the effect of an ad color on the Click-through rate CTR3. One can hypothesize that a green background of an ad will increase the CTR. To provide data-based evidence for such a claim, by following the frequentist approach, a web developer might be consulted to create two web pages, which should be the same except for the change. Then these web versions will be shown simultaneously and randomly to two different segments of visitors to determine which version does increase CTR. This experiment will be kept running for a while until the prespecified sample size is achieved, Subsequently, a statistical test can be computed to compare CTR between two groups of visitors. If the test is statistically significant, one will have enough evidence to accept the new web version and remove the original. Otherwise, one may conclude that there was not enough evidence for the new version. By statistical significance we mean the p-value is smaller than a prespecified threshold, usually 0.05. When this is the case we reject the Null hypothesis (original web version) and accept the Alternative hypothesis (new web version). At this stage, we can say with a high degree of certainty that the observed difference between the web versions is statistically significant and is not attributed to chance.

Introducing the peeking problem

The rapid rise of online service platforms nowadays that provide A/B testing implementation, has led to an increase in continuous monitoring of the test results. In this setting. Marketers without statistical knowledge might be attempted to run the experiment and control the statistical results directly live while the experiment is running. A misleading error is emerging when continuously monitoring the A/B test, namely the p-value peeking.

What is p-value peeking?

While the experiment is running, many mistakes can be made. One of these mistakes is tricky and called “p-value peeking”. This mistake is a consequence of continuous monitoring of the A/B test results and taking action before the A/B test is over. Even those who were taught how to perform A/B tests and report the statistical results are still making this misleading mistake. As a consequence of this type of behavior, users have been drawing inferences that are not supported by their data across many industries.

Example of peeking problem

In an article published by Johari et. al, it has been shown that continuous monitoring in conjunction with early stopping can lead to highly inflated type I error4 rates. By type I error or the false positive we mean the number of false results that were deemed as significant whereas, in reality, these are not. In simple English: an innocent person is convicted.

From a practical point of view early stopping of a running experiment is more beneficial in terms of costs, so there is value to detecting true effects as quickly as possible or giving up if it appears that no effect will be detected soon so that they may test something else.

Unfortunately, when this type of behavior is shown within a frequentist approach without following the steps we discussed early, then the entire statistical results reliability is undermined.

Simulation

Enough theory, for now, Let us illustrate what we have described with a simulation case in Python. We will describe two scenarios. Here we want to show the effect of continuous monitoring on picking the wrong web version that was declared to be of statistical significance, whereas, in reality, this is not true.

Scenario I: valid A/B test

Suppose we learned from an early experiment that CTR was 0.25, or we expect to achieve such a CTR. Further, we may want a minimum detectable effect (MDE)5 to be 0.1, with a power6 of 0.80 and a significance level of 0.05. One important question will emerge here: how many users do we need to incorporate into each group to detect such a difference?

To calculate the number of visitors that are needed to detect such a difference we can use the Power calculation built-in function in Python.

Sample size calculation

According to the input we provide, we will need at least N=1571 observations in each stream (rounded up).

Let us generate two streams of Bernoulli data: two streams of ones and zeros describing whether we observed a click or an abandonment in each stream. What we want to answer is: is there a statistically significant difference between the measured average number of CTRs of the two streams?

Once we get at least 1571 observations in each stream (figure 2), we can run a statistical test, and determine the p-value to evaluate if there is evidence for version A or B version.

Sample size calculation

After the experiment has included the minimum required sample size, we can safely now evaluate the results and extract a valid statistical conclusion.

Scenario II: continuous monitoring

Suppose, we simulate the same study, but now we let the effect be 0, i.e. there is no difference between version A and B. In addition, we want to see what will happen if we continuously look at the results and stop this experiment as early as the p-value is deemed to be significant. What is wrong with this behavior?

Sample size calculation

In figure 3, we see that the p-values oscillate near the alpha level at the start, and in some cases, it does cross that red region (significance region). As the sample size increase, we see a runaway of this level. Because the p-value had already crossed the significance level several times. Anyone who is continuously monitoring this experiment might have incorrectly called it a winner early on, hence stopping the experiment.

How to avoid peeking problem

If you run A/B test experiments within a frequentist paradigm: the best way to avoid peeking at the p-value is to follow A/B testing steps: one must decide in advance how many observations one needs for the test, run the experiment and wait until the experiment achieved the minimum required users, compute the statistical test, evaluate the results and make a decision. Of note, Peeking at the p-value is not wrong as long as you can restrain yourself from stopping the experiment before it terminates.

Conclusion

Online A/B testing analytical platforms are powerful and convenient, However, these tools should be used with a lot of caution. Any time they are used with a manual or automatic stopping rule, the resulting significance tests are simply invalid as long as the experiment did not reach the required sample size.

In this article, we discussed why continuous monitoring violates the existing paradigm within the frequentist approach and leads to invalid inference. Furthermore, we have demonstrated empirically the pitfalls of continuous monitoring.

The online platform on the internet has changed the paradigm of A/B testing and has led to the following challenge: is it possible to develop a method and deploy it into a platform where the user will be continuously monitoring experiments whereas the false positive and experiment runtime will still be kept under control?

In the next articles, we will touch upon some suggested methods that allow users to control the false positive rate and improve the user’s ability to tradeoff between detection of real effects and the length of the experiment, while continuously monitoring the experiment.

GitHub repository for the Python code: https://github.com/achahid/A_B_testingPeekingProblem

Footnotes

  1. P-value is the probability of the observed test results being random and not due to actual evidence. Therefore, the smallest the p-value is the lower the probability that the finding is random.

  2. Significance level also known as alpha is the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference.

  3. Clickthrough rate CTR is the number of clicks that your ad receives divided by the number of times your ad is shown: clicks ÷ impressions = CTR. For example, if you had 5 clicks and 100 impressions, then your CTR would be 5%.

  4. Type I Error or False Positive occurs when a researcher incorrectly rejects a true null hypothesis. Hence the reported findings are significant when in fact they have occurred by chance.

  5. Minimum Detectable Effect (MDE) is the effect size that if truly exists can be detected with a given probability with a statistical test

  6. The statistical power is the probability of a statistical test detecting an effect if there is a true effect present. With a power of 80%, we are 20% of the time missing to detect the true effect.


Upcoming events

  • Coven of Wisdom - Herentals - Winter `24 edition

    Worstelen jij en je team met automated testing en performance? Kom naar onze meetup waar ervaren sprekers hun inzichten en ervaringen delen over het bouwen van robuuste en efficiënte applicaties. Schrijf je in voor een avond vol kennis, heerlijk eten en een mix van creativiteit en technologie! 🚀 18:00 – 🚪 Deuren open 18:15 – 🍕 Food & drinks 19:00 – 📢 Talk 1 20:00 – 🍹 Kleine pauze 20:15 – 📢 Talk 2 21:00 – 🙋‍♀️ Drinks 22:00 – 🍻 Tot de volgende keer? Tijdens deze meetup gaan we dieper in op automated testing en performance. Onze sprekers delen heel wat praktische inzichten en ervaringen. Ze vertellen je hoe je effectieve geautomatiseerde tests kunt schrijven en onderhouden, en hoe je de prestaties van je applicatie kunt optimaliseren. Houd onze updates in de gaten voor meer informatie over de sprekers en hun specifieke onderwerpen. Over iO Wij zijn iO: een groeiend team van experts die end-to-end-diensten aanbieden voor communicatie en digitale transformatie. We denken groot en werken lokaal. Aan strategie, creatie, content, marketing en technologie. In nauwe samenwerking met onze klanten om hun merken te versterken, hun digitale systemen te verbeteren en hun toekomstbestendige groei veilig te stellen. We helpen klanten niet alleen hun zakelijke doelen te bereiken. Samen verkennen en benutten we de eindeloze mogelijkheden die markten in constante verandering bieden. De springplank voor die visie is talent. Onze campus is onze broedplaats voor innovatie, die een omgeving creëert die talent de ruimte en stimulans geeft die het nodig heeft om te ontkiemen, te ontwikkelen en te floreren. Want werken aan de infinite opportunities van morgen, dat doen we vandaag.

    | Coven of Wisdom Herentals

    Go to page for Coven of Wisdom - Herentals - Winter `24 edition
  • Mastering Event-Driven Design

    PLEASE RSVP SO THAT WE KNOW HOW MUCH FOOD WE WILL NEED Are you and your team struggling with event-driven microservices? Join us for a meetup with Mehmet Akif Tütüncü, a senior software engineer, who has given multiple great talks so far and Allard Buijze founder of CTO and founder of AxonIQ, who built the fundaments of the Axon Framework. RSVP for an evening of learning, delicious food, and the fusion of creativity and tech! 🚀 18:00 – 🚪 Doors open to the public 18:15 – 🍕 Let’s eat 19:00 – 📢 Getting Your Axe On Event Sourcing with Axon Framework 20:00 – 🍹 Small break 20:15 – 📢 Event-Driven Microservices - Beyond the Fairy Tale 21:00 – 🙋‍♀️ drinks 22:00 – 🍻 See you next time? Details: Getting Your Axe On - Event Sourcing with Axon Framework In this presentation, we will explore the basics of event-driven architecture using Axon Framework. We'll start by explaining key concepts such as Event Sourcing and Command Query Responsibility Segregation (CQRS), and how they can improve the scalability and maintainability of modern applications. You will learn what Axon Framework is, how it simplifies implementing these patterns, and see hands-on examples of setting up a project with Axon Framework and Spring Boot. Whether you are new to these concepts or looking to understand them more, this session will provide practical insights and tools to help you build resilient and efficient applications. Event-Driven Microservices - Beyond the Fairy Tale Our applications need to be faster, better, bigger, smarter, and more enjoyable to meet our demanding end-users needs. In recent years, the way we build, run, and operate our software has changed significantly. We use scalable platforms to deploy and manage our applications. Instead of big monolithic deployment applications, we now deploy small, functionally consistent components as microservices. Problem. Solved. Right? Unfortunately, for most of us, microservices, and especially their event-driven variants, do not deliver on the beautiful, fairy-tale-like promises that surround them.In this session, Allard will share a different take on microservices. We will see that not much has changed in how we build software, which is why so many “microservices projects” fail nowadays. What lessons can we learn from concepts like DDD, CQRS, and Event Sourcing to help manage the complexity of our systems? He will also show how message-driven communication allows us to focus on finding the boundaries of functionally cohesive components, which we can evolve into microservices should the need arise.

    | Coven of Wisdom - Utrecht

    Go to page for Mastering Event-Driven Design
  • The Leadership Meetup

    PLEASE RSVP SO THAT WE KNOW HOW MUCH FOOD WE WILL NEED What distinguishes a software developer from a software team lead? As a team leader, you are responsible for people, their performance, and motivation. Your output is the output of your team. Whether you are a front-end or back-end developer, or any other discipline that wants to grow into the role of a tech lead, RSVP for an evening of learning, delicious food, and the fusion of leadership and tech! 🚀 18:00 – 🚪 Doors open to the public 18:15 – 🍕 Let’s eat 19:00 – 📢 First round of Talks 19:45 – 🍹 Small break 20:00 – 📢 Second round of Talks 20:45 – 🙋‍♀️ drinks 21:00 – 🍻 See you next time? First Round of Talks: Pixel Perfect and Perfectly Insane: About That Time My Brain Just Switched Off Remy Parzinski, Design System Lead at Logius Learn from Remy how you can care for yourself because we all need to. Second Round of Talks: Becoming a LeadDev at your client; How to Fail at Large (or How to Do Slightly Better) Arno Koehler Engineering Manager @ iO What are the things that will help you become a lead engineer? Building Team Culture (Tales of trust and positivity) Michel Blankenstein Engineering Manager @ iO & Head of Technology @ Zorggenoot How do you create a culture at your company or team? RSVP now to secure your spot, and let's explore the fascinating world of design systems together!

    | Coven of Wisdom - Amsterdam

    Go to page for The Leadership Meetup

Share