“Why do I need to learn about statistics in order to run an A/B testing?” You may be inclined to wonder, especially considering that the testing engine supplies you with data to make a judgement on the statistical significance of the test, correct?
As a matter of fact, you have plenty of reasons to learn statistics.
If you’re conducting A/B tests, you need to understand some basics about statistics to validate your tests and their results.
Nobody wants to spend time, money and effort on something that will turn out useless at the end. To use A/B testing efficiently and effectively, you must understand what it is and all the statistics that surround it.
Statistical hypothesis testing sits at the core of A/B testing. Sounds exciting, huh?
No worries, no one will ask you to make grind statistics and make calculations. Nowadays, it is all done automatically for you. But you should know the key concepts and how to use them in order to interpret the tests results to make them significant.
Let’s begin by taking a look at some of the foundations.
To better understand A/B stats, we need to scale back a bit to the very beginning.
A/B testing refers to the experiments where two or more variations of the same webpage are compared against each other by displaying them to real-time visitors to determine which one performs better for a given goal. A/B testing is not limited by web pages only, you can A/B test your emails, popups, sign up forms, apps and more. Nowadays, most MarTech software comes with an A/B testing function built-in.
Executing an A/B test becomes a simple process when you know exactly what are you testing and why.
We discussed in detail our 12-step CRO process that can guide you when starting an A/B testing program:
Editor Note: You can learn more about the essentials of multivariate and A/B testing by downloading this free guide.
Like any type of scientific testing, A/B testing is basically statistical hypothesis testing, or, in other words, statistical inference. It is an analytical method for making decisions that estimates population parameters based on sample statistics.
The population refers to all the visitors coming to your website (or specific group of pages), while the sample refers to the number of visitors that participated in the test.
Let’s say, you make a decision to implement some change on your product pages based on A/B test results that tested a “sample” of the visitors to your website. Ultimately, only a percentage of the visitors saw the challenger, so that of course means not all the visitors. However, with A/B testing, you assume if the challenger (i.e. variation) in the test increased conversions for a group of visitors on product pages, it will thus have the same result for all the visitors of your product pages (we will delve into the accuracy of a variation’s validity later).
To recap, the A/B testing process can be simplified as follows:
When conducting a test, you are making an assumption about a population parameter and a numerical value. This is your hypothesis (corresponds to Step 9 of conversion optimization system).
In a simplified example, your hypothesis could look like this:
By adding reviews on the product pages, you will increase social proof and trust and confidence in the product, thus increase the number of micro conversions on the page resulting in an overall increase in conversion rates.
This is your hypothesis in “normal words.”. But how would it look like in statistics?
In statistics your hypothesis breaks down into:
The null hypothesis states the default position to be tested or the situation as it is (assumed to be) now, i.e. the status quo.
The alternative hypothesis challenges the status quo (the null hypothesis) and is basically a hypothesis that the researcher (you) believes to be true. The alternative hypothesis is what you might hope that your A/B test will prove to be true.
Let’s look at an example:
Conversion rate on product pages of Acme.Inc is equal to 8%. One of the problems that they revealed during the heuristic evaluation was there were simply no product reviews on the product pages. They believe that adding reviews would help visitors make a decision thus increasing flow to cart page and conversions.
The null hypothesis here would be: no reviews generates a conversion rate equal to 8% (the status quo)
The alternative hypothesis here would be: adding reviews will cause conversion rate to be more than 8%.
Now, the researcher, namely you, will have to collect enough evidence to reject null hypothesis and prove that the alternative hypothesis is true.
Hypothesis testing (A/B testing) is a decision-making method. You can make the right decision or you can make a mistake.
In hypothesis testing there are three possible outcomes of the test:
With no error everything is clear (your test results are ok), but what about the other two errors?
Type I error (beware! this is a really serious error) occurs when you incorrectly reject the null hypothesis and conclude that there is actually a difference between the original page and the variation when there really isn’t. In other words, you obtain false positive test results. Like the name indicates, a false positive is when you think one of your test challengers is a winner while in reality it is not.
Type I error are perhaps one of the most common errors we see when conducting reviews for A/B testing programs. These typically happen when tests are concluded too early without collecting enough data to ensure that there is high level of confidence in the test results.
Type II error occurs when you fail to reject the null hypothesis at the right moment, obtaining this time false negative test results. Type II error occurs when we conclude test with the assumption that none of the variations beat the original page while in reality one of them actually did.
Type I and type II errors cannot happen at the same time:
Keep in mind that statistical errors are unavoidable.
However, the more you know how to quantify them the more you get accurate results.
When conducting hypothesis testing, you cannot “100%” prove anything, but you can get statistically significant results.
A/B testing derives its power from random sampling.
When we conduct an A/B test (or multivariate), we distribute visitors randomly amongst different variations. We use the results for each variation to judge how that variation will behave, if it is the only design visitors see.
Let’s go back to our example. You conducted an A/B test and got the following results:
While you are running a test, only a portion of the visitors see your original page design with no reviews. The conversion page for that portion of the visitors is 8%. There is another portion of the visitors that is seeing Variation 1 design. The conversion rate for that group is 12%.
If we call the test off and declare that Variation 1 is the winner, the question becomes: will the conversion rate for variation 1 hold when all visitors are directed to it and no other variations?
Obviously, the data in the table is not enough to make a decision.
As the test is running, we record the sample distribution for each variation. As we observe the results, we need to determine whether the difference between two sample distributions is due to random chance, or if there is actual basis for the difference.
When we decide that two distributions vary in a statistically significant manner, we must make sure that the difference is due to actual numbers and not mere chance.
How to determine that our test results are statistically significant and valid.
In different A/B testing software packages, you may see a column called:
It usually shows you some percentage between 0 and 100% and determines how statistically significant the results are.
What does it all mean?
Level of significance, or α, is the probability of wrongly acknowledging that the variation produces increase in conversions. Thus, confidence level is 100%*(1-α) (we made this note for those who may have a question about it).
In other words, the confidence level is 100% minus level of significance (1%, 5% or 10%) and it makes it equal to 90%, 95% or 99%.
This is the number you usually see in your testing engine.
If you see a confidence level of 95%, does it mean that the test results 95% accurate? Does it mean that there is 95% probability that the test is accurate? Not really.
There are two ways to think of confidence level:
Since we are dealing with confidence levels for a statistical sample, you are better off thinking that the higher confidence level, the more confident you are in your results.
What affects the confidence level of your test?
Let’s see how it happens.
In some A/B testing software, you see the conversion percentage as a range, or interval.
It could also look like this:
Image Source: VWO
Why these ranges, or intervals, are needed?
This is the “width” of the confidence level called confidence interval. It indicates the level of certainty of the results.
When we put together confidence interval and confidence level, we get conversion rate as a spread of percentages.
The single conversion rate percentage you calculate for a variation is a point estimate that is taken from a random sample of the population. When we conduct an A/B test, we are attempting to approximate the mean conversion rate for the population.
The point estimate doesn’t provide you with very accurate data about all your website visitors. Confidence interval provides a range of values for the conversion rate (the point which is likely to contain the actual conversion rate of the population).
The interval provides you with more accurate information on all the visitors of your website (population), because it incorporates the sampling error (don’t mix it up with errors I and II above). It says how close are the results to the point estimate.
In the example from VWO interface, you can see that confidence interval is shown as ± to the point estimate. This ± number reflects the margin of error. It defines the relationship between population parameter and sample statistics (how the results that you got during the test would work for all your website visitors).
What margin of error is good?
The lower the margin of error, the better. It means that the result you get for the A/B test (a sample of your website visitors) is close enough to the result you would get for all your website visitors.
We would say that less than 5% margin of error is good.
The margin of error is affected by the sample size. Below you can see how it changes depending on the sample size.
Image source: wikimedia
The bigger your sample size, the lower your margin of error.
Sample size refers to the number of participants in the test, who are usually taken from the large population visiting the website.
The main purpose behind choosing a sample size and sticking to it, is to ensure valid statistical results and to avoid statistical errors. You’ve already seen before that larger sample size reduces the margin of error.
When you conduct a split test in a testing engine your data may peak, and most likely it will happen during the short intervals of time. At some point of time (even shortly after the launch), you even may get a significant result (confidence level above 90%).
It may be tempting. You would even want to stop the test when it reaches the required confidence level. But it is just a result of data peaking. This is actually how type I error happens (rejecting the null hypothesis, or, in other words, thinking that the test result is positive, while it is not).
That’s why before you start your test, you should find out the following:
Estimate your sample size based on this information. To estimate the sample size, you can use any of the sample size calculators.
Confidence level and confidence interval, that we discussed above, belong to frequentist approach to A/B testing.
However, some of the testing engines (VWO or Google Experiments) use Bayesian probabilities to evaluate A/B test results.
Frequentist and Bayesian reasoning are two different approaches to analyzing statistical data and making decisions based on it.
They have a different view on a number of statistical issues:
Is one reasoning better than the other?
There is a heated debate about it.
However, when you use one or another A/B testing tool you should be aware of what reasoning the tool uses so that you can interpret the results correctly.
And none of these reasoning methods can make you safe from A/B testing mistakes.
To sum it up for you, when you get some A/B testing results, you should check the following:
Why should you check all these things? A/B test is an experiment. For an experiment to be considered successful, from a scientific point of view, it should correspond to certain criteria.
You should also always remember that:
As Hayan Huang, from the University of California, Berkley, points out:
Statistics derives its power from random sampling. The argument is that random sampling will average out the differences between two populations and the differences between the populations seen post “treatment” could be easily traceable as a result of the treatment only. Obviously, life isn’t as simple. There is little chance that one will pick random samples that result in significantly same populations. Even if they are the same populations, we can’t be sure whether the results that we are seeing are just one time (or rare) events or actually significant (regularly occurring) events.
When running A/B tests, remember that they are, in essence, statistical hypothesis testing. So, you should stick to statistics principles to get valid results.
Also, keep in mind that running an A/B test provides you with insights about how altering your website design or messaging influences the conversion rate. The post-test analysis will give you the directions to implementing the changes to your website.
My name is Ayat Shukairy, and I’m a co-founder and CCO at Invesp. Here’s a little more about me: At the very beginning of my career, I worked on countless high-profile e-commerce projects, helping diverse organizations optimize website copy. I realized, that although the copy was great and was generating more foot traffic, many of the sites performed poorly because of usability and design issues.
View All Posts By Ayat ShukairyIf you enjoyed this post, please consider subscribing to the Invesp blog feed to have future articles delivered to your feed reader. or,receive weekly updates by email: