Don’t let the A/B tests shape your product

Miguel Carruego
8 min readApr 18, 2021

Some time ago, I participated in a recruiting process for a Product Management position in a prestigious technology company. I don’t need to tell you that I was obviously rejected, but what really left me hanging was the reason they gave me: “you don’t know enough about A/B testing”.

Don’t get me wrong, I really appreciate feedback and I always see this as a learning opportunity. But this made no sense for me. How does proficiency in A/B testing is an indicator of a good Product Manager? How much esteem do companies have about A/B testing?

Let me be very clear, statistics is the ultimate prediction tool, and all those algorithms fueled by machine learning (a form of statistics on steroids) shaping our lives confirm this. Scientific experimentation is the only truly empirical method to validate hypotheses and there is no questioning about it.

But (there is always a but), when we put the methodology of AB testing or any other split testing in digital products under the microscope, there are evident flaws in the methodology.

Let’s review some of the typical pitfalls and misuses of statistics in Product Management.

Survivorship bias

During World War II, the Jewish mathematician Abraham Wald was living in Vienna and working as director of the Austrian Institute for Economic Research when the Nazis conquered Austria. Thanks to an invitation of the Yale University in Colorado, he was able to emigrate to the US and save his life. After just a few months in Colorado, he was offered a professorship of statistics at Columbia in the city of New York where he joined the Statistical Research Group (SRG). One of the problems that the SRG worked on was to examine the distribution of damage to aircraft returning after flying missions in order to provide advice on how to minimize bomber losses to enemy fire.

So here’s the question. You don’t want your planes to get shot down by enemy fighters, so you armor them. But armor makes the plane heavier, and heavier planes are less maneuverable and use more fuel. Armoring the planes too much is a problem; armoring the planes too little is a problem. Somewhere in between there’s an optimum. (…)
The military came to the SRG with some data they thought might be useful. When American planes came back from engagements over Europe, they were covered in bullet holes. But the damage wasn’t uniformly distributed across the aircraft. There were more bullet holes in the fuselage, not so many in the engines. (…)
The armor, said Wald, doesn’t go where the bullet holes are. It goes where the bullet holes aren’t: on the engines. (…)
The reason planes were coming back with fewer hits to the engine is that planes that got hit in the engine weren’t coming back.

From ‘How not to be wrong — The hidden maths of everyday life’ by Jordan Ellenberg

The distribution of bullet holes in planes. From Wikipedia

This is what is known as survivorship bias, a type of logical error that consists of concentrating on the things that made it past some selection process and overlooking those that did not, typically because of their lack of visibility. This can lead to some false conclusions in several different ways. It is a form of selection bias.

Now think about your split tests. How do you know that the most important piece of information comes from comparing the conversions and not from the people who didn’t make it through the funnel? The truth is, probably you don’t know simply because you can’t observe it.

The main problem relies upon the fact that most split testing analysis is in fact a funnel analysis, and by design, funnel analysis overlooks everything that doesn’t fall in the funnel.

The P-value trap

When asking “does X affect Y?”, it is common to vary X and see if there is significant variation in Y as a result. If this p-value is less than some predetermined statistical significance threshold α (.05), one considers the result “significant”.

And here is when language and confusion come into place.

Assuming that “p < .05” is a synonym for “true” and “p > .05” is a synonym for “false” is just plain wrong. There is no such thing as a scientific truth behind statistical data, there is only inference, and understanding inference and significance is hard. “Statistically significant” just means that. Every other attribution to this fact usually includes an arbitrary leap in the thought process.

This typically results in misunderstanding of the p-value or, in other cases, an intentional misuse in order to manipulate people’s opinions.

From xkcd.

While the Socratic method and the logical principle reductio ad absurdum are the best way to refute dogma, they cannot be used as an indication of truth. By definition, nobody is the owner of the truth. Therefore, it’s ok to use it to fight your religious uncle at Christmas dinner but not to run a product.

This is why when I hear companies brag about being ‘data-driven’ I got the chills. And because I know that driven might mean different things for people, I prefer to use the term ‘data informed’ as a more precise way to explain how I make decisions.

If an A/B test with a p-value of .02 showed you that adding a dancing Jesus on your online store increases sales by 10%, would you do it? Would you add the dancing Jesus? There is probably no right answer to this question. The only honest answer is: it depends.

From ‘The Simpsons’.

There are different approaches to overcome this logic pickle, such as reporting confidence intervals, but I won’t go there here since I’m not that proficient in statistics.

In the words of Jordan Ellenberg, always remember this:

P-value is the detective, not the judge.

Efficacy is not efficiency

Now let’s assume you are testing two different versions of an onboarding flow. You do an A/B test and version B ends up with a 25% increase in conversion rate (conversion meaning people buying your product). It feels like an easy decision to implement version B, right? If more people are buying your product, it has to mean that the flow explains better your value proposition, right?

Well, as you imagined, no, it doesn’t mean anything to the user. It only means something for your company. A higher conversion rate does not imply a better user experience.

A better solution to a problem has to be both more effective and more efficient. In digital products, this could be translated into “more users do it” and “it’s easier for each of them”. But the problem with split testing is that usually measures only efficacy, but not necessarily also efficiency and easiness for the user.

Something we tend to overlook is what implies for the user to complete a form, or to go through a funnel. We just request information and ask for requirements, while rarely think about the implications. Do you know if the user really has that information at hand when you ask for it? Is he making the best decision or is he just being pushed to go forward?

A phrase I stole some time ago and that I often use is to “mind the gap”. Product Managers and Designers should always think about what happens in between the interactions with the product, and therefore think about how we can help the user.

Correlation does not imply causation . . . even in regression

Even though regression is a very powerful statistical tool, people sometimes forget the #1 mantra in statistics while performing regression analysis: correlation does not imply causation.

As you build a model that has significant variables and a high R-squared, it’s easy to forget that you might only be revealing correlation. Causation is an entirely different matter. Typically, to establish causation, you need to perform a designed experiment with randomization. If you’re using regression to analyze data that weren’t collected in such an experiment, you can’t be certain about causation.

In some cases, correlation can be just fine, you don’t always need variables that have causal relationships with the dependent variable. But assuming causation when you don’t have empirical proof is a recipe for catastrophe.

The famous Nicolas Cage correlation. From Spurious Correlations.

The guessing game

Something we also tend to forget is the fact that our experiments do not exist in a vacuum. Given the conditions we have to deal with, there is no such thing as a controlled environment for our experiments.

This lack of absolute visibility of the conditions and even the subjects in the experiment makes us either fill in the blanks with scientific proof or just plain guessing.

Ultimately, we don’t know if on the other side the user did precisely what we expected or if it was his kid on the phone watching Peppa Pig on YouTube who saw an email notification and started tapping buttons.

A/B tests don’t explain the why behind the numbers. Added to this, we only counter-check with qualitative research when we see negative results, but when we observe positive results we just move forward without asking too many questions.

From Giphy.

Even if sometimes not knowing the why is not a big deal, it generally means a partial or complete lack of predictability. If I don’t know the why I don’t know if this will happen again, I don’t know how to repeat this, I don’t know if it will remain like this forever, I don’t know if this was just seasonality, I just… don’t know.

Conclusion

Split tests are still the best choice when a hypothesis needs to be confirmed in the real world. Linear regression, confidence intervals, and all the Bayesian tricks help solidify the analyses of the experiments but still, none of these represent a scientific truth.

Relying on A/B testing for making roadmap or strategical decisions is definitely a risky choice. In your decision stack, these experiments should be at the very bottom and always responding to a specific strategy.

Our version of Martin Eriksson’s decision stack.

This is why in ProntoPro we do a lot of split testing, but only when necessary, and we validate as much as possible with user research. The truth is out there, not behind your spreadsheets.

--

--