A/B Testing Statistics Made Simple+
- Posted in A/B Testing
“Why do I need to learn about statistics in order to run an A/B testing?” You may be inclined to wonder, especially considering that the testing engine supplies you with data to make a judgement on the statistical significance of the test, correct?
As a matter of fact, you have plenty of reasons to learn statistics.
If you’re conducting A/B tests, you need to understand some basics about statistics to validate your tests and their results.
Nobody wants to spend time, money and effort on something that will turn out useless at the end. To use A/B testing efficiently and effectively, you must understand what it is and all the statistics that surround it.
Statistical hypothesis testing sits at the core of A/B testing. Sounds exciting, huh?
No worries, no one will ask you to make grind statistics and make calculations. Nowadays, it is all done automatically for you. But you should know the key concepts and how to use them in order to interpret the tests results to make them significant.
Let’s begin by taking a look at some of the foundations.
How to Run an A/B Test?
To better understand A/B stats, we need to scale back a bit to the very beginning.
A/B testing refers to the experiments where two or more variations of the same webpage are compared against each other by displaying them to real-time visitors to determine which one performs better for a given goal. A/B testing is not limited by web pages only, you can A/B test your emails, popups, sign up forms, apps and more. Nowadays, most MarTech software comes with an A/B testing function built-in.
Executing an A/B test becomes a simple process when you know exactly what are you testing and why.
We discussed in detail our 12-step CRO process that can guide you when starting an A/B testing program:
- Conduct heuristic analysis
- Conduct qualitative analysis including heatmaps, polls, surveys, and user testing.
- Conduct quantitative analysis by looking at your website analytics to determine which pages are leaking visitors
- Conduct competitive analysis
- Gather all data to determine problem areas on the site
- Analyze the problems through the Conversion Framework
- Prioritize the problems on the website
- Create a conversion roadmap
- Create a test hypothesis
- Create new designs
- Conduct A/B Testing
- Conduct Post-Test Analysis
Editor Note: You can learn more about the essentials of multivariate and A/B testing by downloading this free guide.
What Should You Know About A/B Testing?
Like any type of scientific testing, A/B testing is basically statistical hypothesis testing, or, in other words, statistical inference. It is an analytical method for making decisions that estimates population parameters based on sample statistics.
The population refers to all the visitors coming to your website (or specific group of pages), while the sample refers to the number of visitors that participated in the test.
Let’s say, you make a decision to implement some change on your product pages based on A/B test results that tested a “sample” of the visitors to your website. Ultimately, only a percentage of the visitors saw the challenger, so that of course means not all the visitors. However, with A/B testing, you assume if the challenger (i.e. variation) in the test increased conversions for a group of visitors on product pages, it will thus have the same result for all the visitors of your product pages (we will delve into the accuracy of a variation’s validity later).
To recap, the A/B testing process can be simplified as follows:
- You start the A/B testing process by making a claim (hypothesis).
- You launch your test to gather statistical evidence to accept or reject a claim (hypothesis) about your website visitors.
- The final data shows you whether your hypothesis was correct, incorrect or inconclusive.
What is an A/B Test Hypothesis?
When conducting a test, you are making an assumption about a population parameter and a numerical value. This is your hypothesis (corresponds to Step 9 of conversion optimization system).
In a simplified example, your hypothesis could look like this:
By adding reviews on the product pages, you will increase social proof and trust and confidence in the product, thus increase the number of micro conversions on the page resulting in an overall increase in conversion rates.
This is your hypothesis in “normal words.”. But how would it look like in statistics?
In statistics your hypothesis breaks down into:
- Null hypothesis
- Alternative hypothesis
The null hypothesis states the default position to be tested or the situation as it is (assumed to be) now, i.e. the status quo.
The alternative hypothesis challenges the status quo (the null hypothesis) and is basically a hypothesis that the researcher (you) believes to be true. The alternative hypothesis is what you might hope that your A/B test will prove to be true.
Let’s look at an example:
Conversion rate on product pages of Acme.Inc is equal to 8%. One of the problems that they revealed during the heuristic evaluation was there were simply no product reviews on the product pages. They believe that adding reviews would help visitors make a decision thus increasing flow to cart page and conversions.
The null hypothesis here would be: no reviews generates a conversion rate equal to 8% (the status quo)
The alternative hypothesis here would be: adding reviews will cause conversion rate to be more than 8%.
Now, the researcher, namely you, will have to collect enough evidence to reject null hypothesis and prove that the alternative hypothesis is true.
A/B Testing Errors
Hypothesis testing (A/B testing) is a decision-making method. You can make the right decision or you can make a mistake.
In hypothesis testing there are three possible outcomes of the test:
- No error
- Type I error
- Type II error
With no error everything is clear (your test results are ok), but what about the other two errors?
Type I error (beware! this is a really serious error) occurs when you incorrectly reject the null hypothesis and conclude that there is actually a difference between the original page and the variation when there really isn’t. In other words, you obtain false positive test results. Like the name indicates, a false positive is when you think one of your test challengers is a winner while in reality it is not.
Type I error are perhaps one of the most common errors we see when conducting reviews for A/B testing programs. These typically happen when tests are concluded too early without collecting enough data to ensure that there is high level of confidence in the test results.
Type II error occurs when you fail to reject the null hypothesis at the right moment, obtaining this time false negative test results. Type II error occurs when we conclude test with the assumption that none of the variations beat the original page while in reality one of them actually did.
Type I and type II errors cannot happen at the same time:
- Type I error happens only when the null hypothesis is true
- Type II error happens only when hypothesis is false
Keep in mind that statistical errors are unavoidable.
However, the more you know how to quantify them the more you get accurate results.
When conducting hypothesis testing, you cannot “100%” prove anything, but you can get statistically significant results.
What Should You Know to Avoid Statistical Errors?
A/B testing derives its power from random sampling.
When we conduct an A/B test (or multivariate), we distribute visitors randomly amongst different variations. We use the results for each variation to judge how that variation will behave, if it is the only design visitors see.
Let’s go back to our example. You conducted an A/B test and got the following results:
- Original page conversion rate – 8%
- Variation 1 conversion rate – 12%
While you are running a test, only a portion of the visitors see your original page design with no reviews. The conversion page for that portion of the visitors is 8%. There is another portion of the visitors that is seeing Variation 1 design. The conversion rate for that group is 12%.
If we call the test off and declare that Variation 1 is the winner, the question becomes: will the conversion rate for variation 1 hold when all visitors are directed to it and no other variations?
Obviously, the data in the table is not enough to make a decision.
As the test is running, we record the sample distribution for each variation. As we observe the results, we need to determine whether the difference between two sample distributions is due to random chance, or if there is actual basis for the difference.
When we decide that two distributions vary in a statistically significant manner, we must make sure that the difference is due to actual numbers and not mere chance.
How to determine that our test results are statistically significant and valid.
In different A/B testing software packages, you may see a column called:
- Statistical significance
It usually shows you some percentage between 0 and 100% and determines how statistically significant the results are.
What does it all mean?
Level of significance, or α, is the probability of wrongly acknowledging that the variation produces increase in conversions. Thus, confidence level is 100%*(1-α) (we made this note for those who may have a question about it).
In other words, the confidence level is 100% minus level of significance (1%, 5% or 10%) and it makes it equal to 90%, 95% or 99%.
This is the number you usually see in your testing engine.
If you see a confidence level of 95%, does it mean that the test results 95% accurate? Does it mean that there is 95% probability that the test is accurate? Not really.
There are two ways to think of confidence level:
- It means that if you repeat this test over and over again the results will match the initial test in 95% of cases.
- It means that you are confident that 5% of your test samples will choose the original page over the challenger.
Since we are dealing with confidence levels for a statistical sample, you are better off thinking that the higher confidence level, the more confident you are in your results.
What affects the confidence level of your test?
- Test sample size: the number of visitors participating in the test.
- Variability of results: the extent to which test data points vary from the average, mean or each other.
Let’s see how it happens.
In some A/B testing software, you see the conversion percentage as a range, or interval.Image Source: Optimizely
It could also look like this:
Image Source: VWO
Why these ranges, or intervals, are needed?
This is the “width” of the confidence level called confidence interval. It indicates the level of certainty of the results.
When we put together confidence interval and confidence level, we get conversion rate as a spread of percentages.
The single conversion rate percentage you calculate for a variation is a point estimate that is taken from a random sample of the population. When we conduct an A/B test, we are attempting to approximate the mean conversion rate for the population.
The point estimate doesn’t provide you with very accurate data about all your website visitors. Confidence interval provides a range of values for the conversion rate (the point which is likely to contain the actual conversion rate of the population).
The interval provides you with more accurate information on all the visitors of your website (population), because it incorporates the sampling error (don’t mix it up with errors I and II above). It says how close are the results to the point estimate.
In the example from VWO interface, you can see that confidence interval is shown as ± to the point estimate. This ± number reflects the margin of error. It defines the relationship between population parameter and sample statistics (how the results that you got during the test would work for all your website visitors).
What margin of error is good?
The lower the margin of error, the better. It means that the result you get for the A/B test (a sample of your website visitors) is close enough to the result you would get for all your website visitors.
We would say that less than 5% margin of error is good.
The margin of error is affected by the sample size. Below you can see how it changes depending on the sample size.
Image source: wikimedia
The bigger your sample size, the lower your margin of error.
Sample size refers to the number of participants in the test, who are usually taken from the large population visiting the website.
The main purpose behind choosing a sample size and sticking to it, is to ensure valid statistical results and to avoid statistical errors. You’ve already seen before that larger sample size reduces the margin of error.
When you conduct a split test in a testing engine your data may peak, and most likely it will happen during the short intervals of time. At some point of time (even shortly after the launch), you even may get a significant result (confidence level above 90%).
It may be tempting. You would even want to stop the test when it reaches the required confidence level. But it is just a result of data peaking. This is actually how type I error happens (rejecting the null hypothesis, or, in other words, thinking that the test result is positive, while it is not).
That’s why before you start your test, you should find out the following:
- Baseline conversion rate (thus, the conversion rate you have now)
- Desired increase (how much do you think the new design will beat the original)
- Confidence level: 90%, 95% or 99%
Estimate your sample size based on this information. To estimate the sample size, you can use any of the sample size calculators.
Frequentist vs. Bayesian Approach to A/B Testing
Confidence level and confidence interval, that we discussed above, belong to frequentist approach to A/B testing.
However, some of the testing engines (VWO or Google Experiments) use Bayesian probabilities to evaluate A/B test results.
Frequentist and Bayesian reasoning are two different approaches to analyzing statistical data and making decisions based on it.
They have a different view on a number of statistical issues:
- Probability. Frequentist probability defines relative frequency with which some even to occur (remember, earlier on we said that 95% confidence level means that if you continue the experiment over and over again, it will have the same result in 95% of cases). Bayesian probability is a measure of strength of your belief regarding the true situation. In this way Bayesian probability is much closer to the usual definition of probability.
- Reasoning. Frequentist reasoning uses deduction: if the population looks like this, my sample might look like this. Bayesian reasoning uses induction: if my sample looks like this, the true situation might be like this.
- Data. Frequentists believe that population has fixed parameter and studies are repeatable. They think that experiment data is self-containable and they don’t use data from the previous experiments in the analysis. Bayesians believe that sample data is fixed, while population data is random and can be described through probability. To analyze the experiment data they use prior probabilities (pre-existing beliefs) to analyze the data.
Is one reasoning better than the other?
There is a heated debate about it.
However, when you use one or another A/B testing tool you should be aware of what reasoning the tool uses so that you can interpret the results correctly.
- Frequentist A/B testing shows you (as confidence level) the percentage of all possible samples that can be expected to include the result you got (challenger beating control).
- Bayesian A/B testing gives you an actual probability of challenger beating control.
And none of these reasoning methods can make you safe from A/B testing mistakes.
How Should You Treat the Data You Get Through A/B Testing?
To sum it up for you, when you get some A/B testing results, you should check the following:
- If sample size per variation is enough. The results that you get on a small sample size will have no relevance.
- If the number of conversions is enough. It should be at least 100, it is better to be around 200-300. All these numbers are approximate, because the number of conversions vary based on the size of site you have. For large websites, typically, you should not even look at data before 1,000 conversions for each variation.
- How long the test run. Do you remember that variance in results is one of the factors that influence the confidence interval. Imagine, you had the test running for four days, for three days conversions are low and then on the fourth day conversions peak because of promotion. This results in greater data variance, thus less accurate results. That’s why you should make sure that the test runs for at least one full week. You should also accommodate for seasonality factor (holidays) and for marketing efforts (PPC, SEO, sales, promotions).
- If the confidence level is high enough. The fact that the test reached a 95% confidence interval is not enough to stop the test. Before you stop the test, you should look if the test reached the calculated sample size and how long did it run. If the test satisfies sample size and duration conditions, and reached 95% confidence, only then you can stop it.
- What is the margin of error (if your testing engine provides this information). The smaller the margin of error the more accurate result you get.
Why should you check all these things? A/B test is an experiment. For an experiment to be considered successful, from a scientific point of view, it should correspond to certain criteria.
You should also always remember that:
- Randomness is a part of your test and there are a number of statistical values that effect it.
- A/B testing is a decision-making method, but cannot give you a 100% accurate prediction of your visitors’ behavior.
As Hayan Huang, from the University of California, Berkley, points out:
Statistics derives its power from random sampling. The argument is that random sampling will average out the differences between two populations and the differences between the populations seen post “treatment” could be easily traceable as a result of the treatment only. Obviously, life isn’t as simple. There is little chance that one will pick random samples that result in significantly same populations. Even if they are the same populations, we can’t be sure whether the results that we are seeing are just one time (or rare) events or actually significant (regularly occurring) events.
Over to You
When running A/B tests, remember that they are, in essence, statistical hypothesis testing. So, you should stick to statistics principles to get valid results.
Also, keep in mind that running an A/B test provides you with insights about how altering your website design or messaging influences the conversion rate. The post-test analysis will give you the directions to implementing the changes to your website.
My name is Ayat Shukairy, and I’m a co-founder and CCO at Invesp. Here’s a little more about me: At the very beginning of my career, I worked on countless high-profile e-commerce projects, helping diverse organizations optimize website copy. I realized, that although the copy was great and was generating more foot traffic, many of the sites performed poorly because of usability and design issues.View All Posts By Ayat Shukairy
Join 25,000+ Marketing Professionals
If you enjoyed this post, please consider subscribing to the Invesp blog feed to have future articles delivered to your feed reader. or,receive weekly updates by email:
Connect with us
The Art and Science of Converting Prospects to Customers
By Khalid Saleh and Ayat Shukairy
- How to Optimize Product Pages on a Shopify Website
- The Role of User Research in E-Commerce Experimentation
- The Role of Usability in eCommerce Experimentation
- Invesp and e-CENS Form a Strategic Partnership
- The Role of Branding in E-Commerce Experimentation
- Shopify Mobile Optimization: Tips and Techniques for Improving Mobile CRO
- The Role of Data Analysis in E-Commerce Experimentation
- How to measure the revenue impact of experimentation
- Top 9 SaaS value proposition examples to learn from in 2023
- How to do Quality Assurance (QA) in a high-velocity testing program