How to Analyze A/B Test Results and Statistical Significance in A/B Testing
When we talk about Conversion Rate Optimization, it’s nearly impossible not to mention A/B testing (or split testing). Actually, many companies think that AB testing and CRO are completely synonymous. But that’s not true. A/B testing is a part of the greater umbrella that is CRO – but that’s a topic for another day.
From Saas to e-commerce to lead generation websites, many companies now understand how the targeted audience responds to certain changes on their websites, thanks to A/B testing.
Most of the website elements you see on popular sites such as Google, eBay and Amazon were evaluated for effectiveness using A/B testing. When it comes to the positioning of website elements, the design strategy that worked for one company may not necessarily work for yours, that’s why you should run an A/B test.
Many people think that A/B testing is all about selecting items to test, setting the goal of the test, paying close attention to changes in user behavior, checking for conversions, checking for the significance level, and determining the winner.
But is it that simple? (We wish).
A/B test results can be complex to analyze. Even after creating a strong testing hypothesis, it only takes one simple mistake during the analysis process to derail your whole efforts and make you come up with conclusions that can cost you valuable leads and conversions.
But since you are already here, we will walk you through the process of analyzing results for an A/B test. All the tips we give on this article can be applied to any A/B testing tool –but we recommend that you try out the tool made by marketers, for marketers, FigPii.
Defining A/B testing
A/B testing, also known as split testing, is the process of comparing two different versions of a web page or email so as to determine which version generates more conversions.
According to our State of AB testing report, we conducted, 71% of online companies run two or more A/B tests every month. For many CRO Agencies, A/B testing is a decision-making tool that helps reveal the elements that have the highest impact on the overall conversion rate on a site. Simply put, split-testing gives empirical validation to your design decisions.
The potential benefits of using an A/B testing are:
- Improved content
- Higher page engagement
- Higher conversion rates
- Reduced bounce rates on pages
Before coming up with any variations to compare on a split test, a lot of user research has to be done. In fact, at Invesp, 70% of the time we spend working on a Conversion Rate Optimization project evolves around qualitative and quantitative research. The research is done so that we understand the issues and problems, learn what users want, how they want it, etc. Basically you want to learn as much as you can about your users.
For instance, let’s say that during your research you noticed that most of your users are women who belong to the 24-34 age group. So, based on these research findings, you’d want to test a copy that would appeal and resonate with that type of audience.
Looking at the above example, it’s clear to conclude that A/B testing is informed by research.
Analyzing A/B test results
Once you’re satisfied that your test has gathered enough data, reached the required statistical significance level and has run long enough, it’s time to begin the analysis process. The variation(s) you were testing will either win or lose. Any outcome is a learning curve that helps you understand your audiences better.
With that said, let’s take a look at how you should analyze your A/B test results:
Winning variations – what’s next?
Congratulations, your test won!
So, what’s next? Removing the old design and asking your developers to permanently implementing the winning variation?
No, not yet. Before you do that, you have to ensure that your results are correct. There is a great need to know about the factors that contributed to the win. Remember, A/B testing is not all about running tests and hoping for wins, it’s also about learning.
One winning variation
Most Optimizers fail to understand the importance of validating results and are instead obsessed with reaching the statistical significance of the test and implementation. This ends up being a significance level testing. So, before you ask your developer to implement the winning variations on the whole site, first determine whether the test results are valid.
For instance, let’s say you were testing the control against three variations (V1, V2, and V3), and V2 won. The next thing you should do is to re-run the test; this time around you should only test the control vs. the winning variation (V2, in this case). If the initial results were correct, V2 will win again and you will be able to draw some learnings that you can propagate across the site.
The other thing you should consider doing, after having a winning variation, is to allocate the 100% traffic to the winning variation. This means pausing the experiment, duplicating it and resetting the traffic allocation.
Multiple winning variations
Sometimes, depending on how lucky you are, a single test can have multiple winners. Both V1, V2, and V3 can have an uplift in terms of your test goals. As good as it might sound, it can be confusing –in the sense that you might not know which variation to go with.
In such cases, it’s easy to swiftly implement the highest winning variation and ignore other variations that also had an uplift. But is it a good idea?
This is very subjective. But I’d recommend that you segment your results just to see how your most valuable customers respond to the changes. Your test data can be segmented in different ways such as:
- Traffic source
- Visitor type (new vs. returning)
- Browser type
- Device type (it’s recommended that you test mobile and desktop devices separately so as to see which one performs better than the other)
Analyzing the behavior of your visitors under these segments can help you reveal new insights about their perspective.
Depending on the design differences in those winning variations, sometimes you may need to mix the design elements on those multiple winning variations, into a single design, so as to find the best experience.
For instance, let’s say yours is a lead generation website and you’re trying testing multiple variations of trust signals. In Variation 1, you use the word ‘guarantee’ in your main headline and notice a 35% uplift in conversions. In Variation 2, you include customer testimonials below the fold of the page you’re testing and you see a 31% uplift in conversions. In Variation 3, you add membership trust signals and trust by association and you get a 38% uplift in submissions.
As you can see from the above example, both V1, V2, and V3 are winners. They all had an uplift in conversions. So, the point here is, you can then combine all the winning ideas into a single design that you will implement on the site.
Losing variations – what’s next?
Yes, it’s frustrating sometimes, and some Optimizers can’t handle a losing variation, so they tend to ignore losing tests. But, that’s a wrong way of going about it –embrace losing variations.
An A/B test is said to have failed when the variation(s) running against the control fails to beat the control design in terms of the primary goals and other goals that are set in the test. A good example of this is when the control/original version gets more conversion uplifts than the variation(s).
This can happen even if you follow all the A/B testing best practices and correctly ran the test.
But depending on how you look at things, there is always a good side in everything. In the context of A/B testing, losing variations are actually not bad –they present a goldmine of information that you can use to; hone in on expectations that your website is not meeting, focus your testing and make improvements that will guarantee long-term success.
In simpler terms, losing variations are just as actionable as winning variations. When your test loses, you should:
- Evaluate the solutions you had in your variations.
- Go through your hypothesis.
- Revalidate your research data.
Here is what I mean by this:
Reevaluating the solutions in the variations
The reality is that the most likely element you’ve got wrong is the solution you presented. Solutions can be a bit subjective in that based on the covered problem you’ve removed, replaced, redesigned an element or a flow on a site. But there could be multiple variables to the change: the location, the copy, the look and feel of it, the UX of it, etc.
The vast majority of tests run at Invesp are evaluated from a solutions stand-point first and foremost. The reason for this is typically the problem uncovered and research conducted was thorough. The hypothesis is highly based and driven by data. The solution is the part that can be a bit more prone to human assumptions.
Remember: a single hypothesis can have multiple solutions. Very often the logic behind a solution seemed sound during design discussions, but in reality, it did not bode well with the site visitor. Going back to the drawing board and thinking about discarded solutions might be a good approach to making a losing test a winner.
For instance, let’s say a hypothesis has four possible solutions:
- Change the placement of the form from below to above the fold
- Use videos instead of text
- Multi-step form instead of a single form
- Use short copy instead of a long one
Because they want to learn which web element had the most impact on increasing conversions, optimizers do not usually test all the possible solutions in a single test. In this case, the first test may be aimed at testing solutions 1 and 2. If the test bears no positive results, the once discarded solutions 3 and 4 are then tested.
Go through your hypothesis
When the A/B test results are exactly the opposite of what you expected, there is a high chance that your hypothesis is wrong. But before we get into that, what’s a hypothesis?
The dictionary definition of a hypothesis is:
A tentative assumption made in order to draw out and test its logical or empirical consequences.
In Conversion Rate Optimization, a hypothesis is a prediction you create prior to running a split test. A good hypothesis reveals what is to be changed and how the changes will increase the conversion rate. Through A/B testing, a hypothesis can be proved or disproved.
If you run a split test and your variation(s) fails to beat the original, this can be a confirmation that your hypothesis or prediction is wrong. This is usually the second line of defense that follows after you’ve changed solutions but still no uplift.
You may have uncovered the right data during your research, but your prediction after reading the data may not be correct. Sometimes the data uncovered could have multiple predictions as to why visitors behaved in a certain way.
For example, after analyzing session replay videos or heat maps you may notice that visitors are not clicking your CTA buttons. Based on this analysis, your hypothesis can state that increasing the size of the CTA button will make it more visible and this will increase the click rate. However, this can be a wrong prediction because the reason why people are not clicking the CTA button can be due to the placement of the button and not the size.
On the other hand, tests fail not because you had a wrong hypothesis, but because you didn’t base your variations on the hypothesis. Hoping to increase conversions by testing random ideas is a waste of time, money and web traffic. You need to do proper qual and quant research, come up with a proper hypothesis and run a test based on your hypothesis.
Revalidate your research data
In every CRO project, Optimizers use two types of data: qualitative and quantitative.
Before an A/B test is launched, all the types of data to be validated. This validation process is a bit tricky, but it’s not impossible to understand. Your qualitative data is validated using a quantitative research technique, or vice versa.
To make you understand, let’s say Google Analytics (a quantitative research technique) shows you that there is a high bounce rate on page XYZ, then you will also have to watch session replay videos of the same page so as to understand what can be causing videos to leave.
Although the data revalidation process can be undertaken in two approaches – qualitative-first or quantitative-first.
Qualitative first approach: this approach entails that you get an understanding of how your users engage with your site and you later prove or disprove your findings by quantitative data. If your session replays indicate that users are hesitant to click on your CTA button, you can validate that by seeing how many people click on the button.
Quantitative first approach: Obviously, the quantitative first approach is in stark contrast with the qual one. Most Optimizers usually prefer this approach as it answers the ‘what’ questions. When they have answers to the what questions in their fingertips, they then seek to understand the ‘why’ by analyzing the qualitative data they could have obtained using user tests, heat maps, polls, etc.
But in most cases, optimizers prefer using the quantitative-first approach as they seek to understand what before they get to the why.
The point here is when your A/B test fails you have to revalidate your research data. It may be a case of weak data or not very conclusive data. If you had taken the quantitative first approach, this time around using the qualitative first approach. However, it will be much better to undertake both approaches so that you obtain different viewpoints, and this will help you see if you really uncovered the problem on the site.
Interpreting A/B test results
When interpreting the results of your A/B test, there is a validity checklist you should tick so as to avoid false positives or statistical errors. These factors include:
- Sample Size
- Significance level
- Test duration
- Number of conversions
- Analyze external and internal factors
- Segmenting test results (the type of visitor, traffic, and device)
- Analyzing micro-conversion data
It makes no sense to draw conclusions on any A/B test results without making sure if they are valid or not.
So, here’s a detailed discussion around each factor you should consider when analyzing A/B testing results.
A/B Test Sample size
Whether you are running the A/B test on a low or high traffic site, your sample size should be big enough to ensure that the experiment reaches a significant level. The bigger the sample size, the lesser the margin of error.
To calculate the sample size for your test you will need to specify the significance level, power and the desired relevant difference between the rates you would like to discover. If you think the formula is too complicated, there are online sample size calculators that are easy to use.
If you do not calculate the sample size of your test, you run the risk of stopping your test too early before it collects enough data. In this regard, Khalid wrote an article and had this to say about sample size:
Any experiment that involves later statistical inference requires a sample size calculation done BEFORE such an experiment starts. A/B testing is no exception.
Let’s say you already have started running the test and you have the A/B test results at hand, you can still check whether the sample size was big enough to make your results valid.
If the test gets stopped before each variation reaches the stipulated number of visitors, the test will definitely be a false positive. Your test should reach the required sample size per variation for the results to be valid.
Statistical significance in A/B Testing
Statistical significance level (or confidence, or significance of the results, or chance of beating the original) shows how significantly your result is, statistically.
As a Digital Marketer, you’d want to be certain about the results, so the statistical significance indicates that the differences observed between a variation and control aren’t due to chance.
The industry standard of a statistical significance should be 95% (or 90% in some cases). This is the target number you should have in mind when running an A/B test.
95% statistical significance means that you are 95% confident that the results are accurate. It means that if you repeat the test over and over again in 95% of cases the results will match the initial test.
A/B Test duration
You ran a test, and it appears to be yielding results, at what point can you decide to end it?
Well, the answer actually depends on various factors but a test doesn’t have to end too soon or to run for a long time before you draw conclusions from the A/B test.
She suggested that a test should run a full business cycle or 7 days at least. But she also added that this depends on the product or service on sale because there are certain products and services that sell more during paydays and are generally low throughout the month.
Every website has a business cycle –the time it typically takes for customers to make a purchase. Basically, this means that some websites have certain days when the number of conversions is relatively low throughout the weekend, but then it peaks on weekdays.
The results of the test you run on Saturday and Sunday are bound to be different from the results you get from running Monday and Tuesday. To get valid test data, you should run your test throughout the business cycle so as to include all possible fluctuations.
However, seven days is a minimal requirement. The real-time of the test depends on your site traffic. The lower the traffic, the longer you will have to run the test.
To calculate the test duration time, you can use one of the calculators available online.
Look at the example below.
The above image shows that you have to run the test for 18 days if your site has 5000 average daily visitors and 3 variations are being tested.
Number of conversions
It’s often said that the number of conversions a website gets a day depends on the amount of traffic that the site gets. High traffic sites usually get more conversion and vice versa.
Generally speaking, when you run a test on high traffic sites you do not have to worry about the number of conversions, you should just focus on reaching the required sample size for that traffic.
But when it comes to low traffic sites, to get more accurate results you should keep in mind two factors:
- Sample size per variation
- The number of conversions.
Your test should reach the required sample size and have at least 2-300 conversions per variation (this is the pure minimum). It is even better if it reaches more than 300 conversions per variation.
So, now we have checked our test results and made sure that they are valid and don’t contain any statistical errors. Let’s move on to deeper analysis.
Analyze external and internal factors
Each and every website you see is impacted by several external and internal factors. These factors include:
- Seasonality or holiday period: for some eCommerce sites, their traffic and sales are not stable all over the year, they tend to peak on Black Friday and Cyber Mondays. This could influence your test results.
- Marketing promotions and campaigns: if you run a marketing campaign on the same site that you are running an A/B test, your general test results are more likely to get affected.
All these things increase the variance of test data. As you know, the higher the data variance is, the less accurate are the test results.
If you run a test during Thanksgiving or any other holiday, before drawing conclusions you should also try to launch it one more time at a different period so as to verify the results.
Analyze micro-conversion data
When analyzing A/B test results, everyone seems to always track the site’s macro conversion data –this can either be a sale, lead generated or a subscription. But analyzing micro-conversions offers another layer of insights.
Just like micro-conversions, micro conversions can differ from business to business. Micro-conversions depend on the website type – Saas, e-commerce, lead gen, etc. – and the page you are testing.
Here is an example of micro-conversion goals you may need to analyze for an e-commerce site.
Test page Micro-conversion Homepage Top navigation clicks, banner clicks Category page Product page visit, add to cart event Product page Add to cart event Cart page Proceed to checkout
Yes, micro-conversion does not necessarily increase your conversion rate, but they will certainly help you persuade prospects down the conversion funnel. It’s not rocket science, the more visitors you persuade, the more purchase you get. In some cases, understanding the micro-conversions helps understand why a test performed the way it did.
Over to you…
When running an A/B test, it’s not always about looking for a variation that had more conversions, but sometimes it’s about learning the changes in user behavior. You should always be constant testing so as to understand your visitors, their behaviors and the web elements that influence their behavior change.
Simba Dube is the Growth Marketing Manager at Invesp. He is passionate about marketing strategy, digital marketing, content marketing, and customer experience optimization.View All Posts By Simba Dube
Join 25,000+ Marketing Professionals
If you enjoyed this post, please consider subscribing to the Invesp blog feed to have future articles delivered to your feed reader. or,receive weekly updates by email:
Connect with us
The Art and Science of Converting Prospects to Customers
By Khalid Saleh and Ayat Shukairy
- Project Management for CRO – Process, Software, and Resource Allocation
- How to Leverage The Price Anchoring Effect (With Examples)
- Expert Answers to 9 Tough CRO Questions
- What they don’t tell you about A/B testing velocity
- Google Optimize: The Good, the Bad, and the Ugly
- How to Use Data to Identify Problems on Form Fields
- The Importance of Customer Lifetime Value In eCommerce
- Optimizing Conversion Funnels: Where Should You Start First?
- How to Build a Brand Community from Scratch
- Should you hire a CRO Agency, Or should you use CRO Tools?