Making Sense of A/B Testing: Understand Better with Hard Questions | by Aliaksandr Kazlou | Jul, 2023

Question 5: You’ve conducted an A/B test that produces a significant result, with a p-value of 0.04. However, your boss remains unconvinced and asks for a second test. This subsequent test doesn’t yield a significant result, presenting a p-value of 0.25. Does this mean that the original effect wasn’t real, and the initial result was a false positive?

There’s always a risk in interpreting p-values as a binary, lexicographic decision rule. Let’s remind ourselves what a p-value actually is. It’s a measure of surprise. And it’s random and it’s continuous. And it’s only one piece of evidence.

Imagine the first experiment (p=0.04) was run on 1.000 users. The second one (p=0.25) — on 10.000 users. Apart from the noticeable differences in quality, the second A/B test, as we discussed in Questions 3 and 4, probably had a much smaller estimated effect size that might not be practically significant anymore.

Let’s reverse this scenario: the first one (p=0.04) was run on 10.000, and the second one (p=0.25) — on 1.000 users. Here we are much more confident that the effect ‘exists’.

Now, imagine both A/B tests were identical. In this situation, you’ve observed two fairly similar, somewhat surprising results, neither are too consistent with the null hypothesis. The fact that they fall on opposite sides of .05 is not terribly important. What’s important is that observing two small p-values consecutively when the null is true is unlikely.

One question we might consider is whether this difference is statistically significant itself. Categorising p-values in a binary way skews our intuition, making us believe there’s a vast, even ontological, difference between p-values on different sides of the cutoff. However, the p-value is a fairly continuous function, and it might be possible that two A/B tests, despite different p-values, present very similar evidence against the null [2].

Another way to look at this is to combine the evidence. Assuming the null hypothesis is true for both tests, the combined p-value stands at 0.05, according to Fisher’s method. There are other methods to combine p-values, but the general logic remains the same: a sharp null isn’t a realistic hypothesis in most settings. Therefore, enough ‘surprising’ outcomes, even if none of them are statistically significant individually, might be sufficient to reject the null.