How to Analyze A/B Test Results and Statistical Significance in A/B Testing

Simbar Dube

Simbar Dube

Simba Dube is the Growth Marketing Manager at Invesp. He is passionate about marketing strategy, digital marketing, content marketing, and customer experience optimization.
Reading Time: 18 minutes

When we talk about Conversion Rate Optimization, it’s nearly impossible not to mention A/B testing (or split testing). Actually, many companies think that AB testing and CRO are completely synonymous. But that’s not true. A/B testing is a part of the greater umbrella that is CRO – but that’s a topic for another day.

From Saas to e-commerce to lead generation websites, many companies now understand how the targeted audience responds to certain changes on their websites, thanks to A/B testing. 

Most of the website elements you see on popular sites such as Google, eBay, and Amazon were evaluated for effectiveness using A/B testing. When it comes to the positioning of website elements, the design strategy that worked for one company may not necessarily work for yours. That’s why you should run an A/B test.

Many people think that A/B testing is all about selecting items to test, setting the goal of the test, paying close attention to changes in user behavior, checking for conversions, checking for the significance level, and determining the winner. 

But is it that simple? (We wish).  

A/B test results can be complex to analyze. Even after creating a strong testing hypothesis, it only takes one simple mistake during the analysis process to derail your whole efforts and make you come up with conclusions that can cost you valuable leads and conversions. 

But since you are already here, we will walk you through the process of analyzing results for an A/B test. All the tips we give in this article can be applied to any A/B testing tool –but we recommend that you try out the tool made by marketers, for marketers, FigPii.   

Defining A/B testing

A/B testing, also known as split testing, is the process of comparing two different versions of a web page or email so as to determine which version generates more conversions. 

A/B Testing and CRO

According to our State of AB testing report, 71% of online companies run two or more A/B tests every month. For many CRO Agencies, A/B testing is a decision-making tool that helps reveal the elements that have the highest impact on the overall conversion rate on a site. Simply put, split testing gives empirical validation to your design decisions. 

The potential benefits of using A/B testing are:

  • Improved content 
  • Higher page engagement
  • Higher conversion rates 
  • Reduced bounce rates on pages

Before coming up with any variations to compare on a split test, much user research must be done. In fact, at Invesp, we spend 70% of our time working on a Conversion Rate Optimization project that revolves around qualitative and quantitative research. The research is done so that we understand the issues and problems, learn what users want, how they want it, etc. Basically, you want to learn as much as you can about your users. 

For instance, let’s say that during your research, you noticed that most of your users are women who are in the 24-34 age group. So, based on these research findings, you’d want to test a copy that would appeal to and resonate with that type of audience.   

Looking at the above example, it’s clear to conclude that A/B testing is informed by research. 

Analyzing A/B test results

Once you’re satisfied that your test has gathered enough data, reached the required statistical significance level, and run long enough, it’s time to begin the analysis process. The variation(s) you were testing will either win or lose, or the results will be inconclusive. Regardless of the outcome, you should focus on the learnings as you will need those to inform your next tests. One thing that you should know is that, in some instances, losing tests will give you more insights than those tests where the original underperforms. 

With that said, let’s take a look at how you should analyze your A/B test results: 

Winning variations – what’s next? 

Congratulations, your test won! 

A/B Testing Results

So, what’s next? Removing the old design and asking your developers to permanently implement the winning variation? 

No, not yet! 

Before you do that, you have to ensure that your results are correct. This means you must investigate and know the factors contributing to the win. Remember, A/B testing is not all about running tests and hoping for wins. It’s also about learning. 

One winning variation

Most Optimizers fail to understand the importance of validating results and are instead obsessed with reaching the statistical significance of the test and implementation. This ends up being significance-level testing. So, before you ask your developer to implement the winning variations on the whole site, first determine whether the test results are valid.

For instance, let’s say you were testing the control against three variations (V1, V2, and V3), and V2 won. The next thing you should do is to re-run the test; this time around, you should only test the control vs. the winning variation (V2, in this case). If the initial results are correct, V2 will win again, and you will be able to draw some learnings that you can propagate across the site. 

The other thing you should consider doing after having a winning variation is to allocate 100% of the traffic to the winning variation. This means pausing the experiment, duplicating it, and resetting the traffic allocation.

Multiple winning variations

Sometimes, depending on how good your hypothesis was, a single test can have multiple winners. Both V1, V2, and V3 can outperform your original and have an uplift (in terms of your test goals). As good as it might sound, it can be confusing –in the sense that you might not know which variation to go with. 

FigPii A/B Testing

A/B Test with more than one winner

Looking at the above screenshot, In such cases, it’s easy to go with variation four because it is the highest winning variation. But is ignoring other winning variations (variations 1 and 3) a good idea? 

This is very subjective – some CROs will choose to ignore it, while others would recommend a further investigation. 

But I’d recommend that you segment your results just to see how your most valuable customers respond to the changes. Your test data can be segmented in different ways, such as: 

  • Traffic source
  • Visitor type (new vs. returning)
  • Browser type
  • Device type (it’s recommended that you test mobile and desktop devices separately so as to see which one performs better than the other)

Analyzing your visitors’ behavior under these segments can help you reveal new insights about their perspectives. 

Depending on the design differences in those winning variations, sometimes you may need to mix the design elements of those multiple winning variations into a single design so as to find the best experience. 

For instance, let’s say yours is a lead generation website, and you’re trying to test multiple variations of trust signals. In Variation 1, you use the word ‘guarantee’ in your main headline and notice a 35% uplift in conversions. In Variation 2, you include customer testimonials below the fold of the page you’re testing and see a 31% increase in conversions. In Variation 3, you add membership trust signals and trust by association, and you get a 38% uplift in submissions. 

As you can see from the above example, both V1, V2, and V3 are winners. They all had an uplift in conversions. So, the point here is that you can combine all the winning ideas into a single design you will implement on the site. 

Losing variations – what’s next?

Yes, it’s frustrating sometimes, and some Optimizers can’t handle a losing variation, so they tend to ignore losing tests. But that’s the wrong way of going about it –embrace losing variations. You can actually get valuable insights from losing tests. 

An A/B test is said to have failed when the variation(s) running against the control fails to beat the control design in terms of the primary goals and other goals that are set in the test. A good example of this is when the control/original version gets more conversion uplifts than the variation(s).   

This can happen even if you follow all the A/B testing best practices and correctly run the test. 

But depending on how you look at things, there is always a good side to everything. In the context of A/B testing, losing variations is not bad –they present a goldmine of information you can use to; hone in on expectations that your website is not meeting, focus your testing, and make improvements that will guarantee long-term success. 

In simpler terms, losing variations are just as actionable as winning variations. When your test loses, you should: 

  • Evaluate the solutions you had in your variations.
  • Go through your hypothesis. 
  • Revalidate your research data. 

Here is what I mean by this: 

Reevaluating the solutions in the variations 

The reality is that the most likely element you’ve got wrong is the solution you presented. Solutions can be a bit subjective in that, based on the covered problem, you’ve removed, replaced, or redesigned an element or a flow on a site. But there could be multiple variables to the change: the location, the copy, the look and feel of it, the UX of it, etc. 

The vast majority of tests run at Invesp are evaluated from a solutions standpoint first and foremost. The reason for this is typically the problem uncovered, and the research conducted was thorough. The hypothesis is highly based and driven by data. The solution is the part that can be a bit more prone to human assumptions. 

Remember: a single hypothesis can have multiple solutions. Very often, the logic behind a solution seemed sound during design discussions, but in reality, it did not bode well with the site visitor. Going back to the drawing board and thinking about discarded solutions might be a good approach to making a losing test a winner.  

For instance, let’s say a hypothesis has four possible solutions: 

  1. Change the placement of the form from below to above the fold
  2. Use videos instead of text
  3. Multi-step form instead of a single form
  4. Use a short copy instead of a long one

Because they want to learn which web element had the most impact on increasing conversions, optimizers do not usually test all the possible solutions in a single test. In this case, the first test may be aimed at testing solutions 1 and 2. If the test bears no positive results, the once-discarded solutions 3 and 4 are then tested. 

Go through your hypothesis. 

When the A/B test results are exactly the opposite of what you expected, there is a high chance that your hypothesis is wrong. But before we get into that, what’s a hypothesis? 

The dictionary definition of a hypothesis is:  

A tentative assumption made in order to draw out and test its logical or empirical consequences.

In Conversion Rate Optimization, a hypothesis is a prediction you create prior to running a split test. A good hypothesis reveals what is to be changed and how the changes will increase the conversion rate. Through A/B testing, a hypothesis can be proved or disproved. 

If you run a split test and your variation(s) fails to beat the original, this can be a confirmation that your hypothesis or prediction is wrong. This is usually the second line of defense that follows after you’ve changed solutions, but still no uplift.

You may have uncovered the right data during your research, but your prediction after reading the data may not be correct. Sometimes the data uncovered could have multiple predictions as to why visitors behaved in a certain way. 

For example, after analyzing session replay videos or heat maps, you may notice that visitors are not clicking your CTA buttons. Based on this analysis, your hypothesis can state that increasing the size of the CTA button will make it more visible, and this will increase the click rate. However, this can be a wrong prediction because people are not clicking the CTA button because of the placement of the button or because the copy is not compelling enough.

On the other hand, tests fail not because you had a wrong hypothesis but because you didn’t base your variations on the hypothesis. Hoping to increase conversions by testing random ideas is a waste of time, money, and web traffic. You need to do proper qualitative and quantitative research, come up with a proper hypothesis, and run a test based on your hypothesis. 

Revalidate your research data.

In every CRO project, Optimizers use two types of data: qualitative and quantitative. 

Before an A/B test is launched, all the types of data are to be validated. This validation process is a bit tricky, but it’s not impossible to understand. Your qualitative data is validated using a quantitative research technique or vice versa. 

To make you understand, let’s say Google Analytics (a quantitative research technique) shows you that there is a high bounce rate on page XYZ. Then you will also have to watch session replay videos of the same page so as to understand what can be causing videos to leave. 

The data revalidation process can be undertaken in two approaches: qualitative-first or quantitative-first. 

Qualitative first approach: this approach entails that you get an understanding of how your users engage with your site, and you later prove or disprove your findings with quantitative data. If your session replays indicate that users are hesitant to click on your CTA button, you can validate that by seeing how many people click on the button. 

Quantitative first approach: Obviously, the quantitative first approach is in stark contrast with the qual one. Most Optimizers usually prefer this approach as it answers the ‘what’ questions. When they have answers to the questions at their fingertips, they then seek to understand the ‘why’ by analyzing the qualitative data they could have obtained using user tests, heat maps, polls, etc. 

But in most cases, optimizers prefer using the quantitative-first approach as they seek to understand what before they get to the why. 

The point here is when your A/B test fails, you have to revalidate your research data. It may be a case of weak data or not very conclusive data if you had taken the quantitative first approach, this time around using the qualitative first approach. However, it will be much better to undertake both approaches so that you obtain different viewpoints, and this will help you see if you really uncovered the problem on the site.  

Interpreting A/B test results

When interpreting the results of your A/B test, there is a validity checklist you should tick to avoid false positives or statistical errors. These factors include: 

  • Sample Size 
  • Significance level 
  • Test duration
  • Number of conversions 
  • Analyze external and internal factors 
  • Segmenting test results (the type of visitor, traffic, and device) 
  • Analyzing micro-conversion data

It makes no sense to draw conclusions on any A/B test results without making sure if they are valid or not. 

So, here’s a detailed discussion around each factor you should consider when analyzing A/B testing results.  

A/B Test Sample size

Whether you are running the A/B test on a low or high-traffic site, your sample size should be big enough to ensure that the experiment reaches a significant level. The bigger the sample size, the lesser the margin of error.

To calculate the sample size for your test, you will need to specify the significance level, power, and the desired relevant difference between the rates you would like to discover. If you think the formula is too complicated, there are online sample size calculators that are easy to use.  

If you do not calculate the sample size of your test, you run the risk of stopping your test too early before it collects enough data. In this regard, Khalid wrote an article and had this to say about sample size: 

Any experiment that involves later statistical inference requires a sample size calculation done BEFORE such an experiment starts. A/B testing is no exception. 

Let’s say you have already started running the test, and you have the A/B test results at hand. You can still check whether the sample size was big enough to make your results valid.

If the test gets stopped before each variation reaches the stipulated number of visitors, the test will definitely be a false positive. Your test should reach the required sample size per variation for the results to be valid. 

Statistical Significance in A/B Testing

Statistical significance level (or confidence, or significance of the results, or chance of beating the original) shows how significant your result is statistically. 

As a Digital Marketer, you’d want to be certain about the results, so the statistical significance indicates that the differences observed between a variation and control aren’t due to chance. 

The industry standard of statistical significance should be 95% (or 90% in some cases). This is the target number you should have in mind when running an A/B test. 

95% statistical significance means that you are 95% confident that the results are accurate. It means that if you repeat the test over and over again, in 95% of cases, the results will match the initial test.

A/B Test duration 

You ran a test, and it appears to be yielding results; at what point can you decide to end it? 

Well, the answer actually depends on various factors, but a test doesn’t have to end too soon or run for a long time before you draw conclusions from the A/B test. 

I asked one of our CRO managers – Hatice Kaya – about the duration of an A/B test

She suggested that a test should run a full business cycle or seven days at least. But she also added that this depends on the product or service on sale because there are certain products and services that sell more during paydays and are generally low throughout the month. 

Every website has a business cycle –the time it typically takes for customers to make a purchase. Basically, this means that some websites have certain days when the number of conversions is relatively low throughout the weekend, but then it peaks on weekdays. 

The results of the test you run on Saturday and Sunday are bound to be different from the results you get from running on Monday and Tuesday. To get valid test data, you should run your test throughout the business cycle so as to include all possible fluctuations. 

However, seven days is a minimal requirement. The real-time of the test depends on your site traffic. The lower the traffic, the longer you will have to run the test.

To calculate the test duration time, you can use one of the calculators available online.

Look at the example below.

The above image shows that you have to run the test for 18 days if your site has 5000 average daily visitors and three variations are being tested. 

Number of conversions

It’s often said that the number of conversions a website gets a day depends on the amount of traffic that the site gets. High-traffic sites usually get more conversions and vice versa. 

Generally speaking, when you run a test on high-traffic sites, you do not have to worry about the number of conversions; you should just focus on reaching the required sample size for that traffic. 

But when it comes to low-traffic sites, to get more accurate results, you should keep in mind two factors:

  • Sample size per variation
  • The number of conversions.

Your test should reach the required sample size and have at least 2-300 conversions per variation (this is the pure minimum). It is even better if it reaches more than 300 conversions per variation.

So, now we have checked our test results and made sure that they are valid and don’t contain any statistical errors. Let’s move on to a deeper analysis.

Analyze external and internal factors. 

Each and every website you see is impacted by several external and internal factors. These factors include:

  • Seasonality or holiday period: for some eCommerce sites, their traffic and sales are not stable all over the year, they tend to peak on Black Friday and Cyber Mondays. This could influence your test results.
  • Marketing promotions and campaigns: if you run a marketing campaign on the same site that you are running an A/B test, your general test results are more likely to get affected.

All these things increase the variance of test data. As you know, the higher the data variance is, the less accurate are the test results.

If you run a test during Thanksgiving or any other holiday, before drawing conclusions you should also try to launch it one more time at a different period so as to verify the results.

Analyze micro-conversion data

When analyzing A/B test results, everyone seems to always track the site’s macro conversion data –this can either be a sale, lead generated or a subscription. But analyzing micro-conversions offers another layer of insights. 

Just like micro-conversions, micro conversions can differ from business to business. Micro-conversions depend on the website type – Saas, e-commerce, lead gen, etc. – and the page you are testing. 

Here is an example of micro-conversion goals you may need to analyze for an e-commerce site. 

Test pageMicro-conversion
HomepageTop navigation clicks, banner clicks
Category pageProduct page visit, add to cart event
Product pageAdd to cart event
Cart pageProceed to checkout

Yes, micro-conversion does not necessarily increase your conversion rate, but they will certainly help you persuade prospects down the conversion funnel. It’s not rocket science, the more visitors you persuade, the more purchase you get. In some cases, understanding the micro-conversions helps understand why a test performed the way it did. 

What to do when your A/B test doesn’t win

Not all your A/B tests will be winning tests. This is the truth and something you should be prepared for as a conversion specialist.

Instead of throwing your losing test away and hoping you win with the next one, you can turn this into a learning opportunity.

I had a chat with Anwar Aly, a conversion specialist here at Invesp, and he had this to say;

Based on the rate of LOSS of the  AB tests if it’s a normal WIN to Lose rate, businesses need to learn from lost tests with the mindset that the LOSS is part of the nature of AB testing and more valuable than wins in some cases when good learnings come out of the post-test analysis.

If the LOSS rate is high or constant they need to take a step back and evaluate the overall testing approach, maybe start from scratch with a new audit and review, also qualitative data can be a great support in validating the test hypotheses and increase test confidence.

 

In this section, I walk you through a checklist that helps you evaluate losing tests and what you can do differently.

1. Review your hypothesis:

A poorly thought-out hypothesis will result in poor AB tests and results. A characteristic of a poor hypothesis is the lack of insights driving the hypothesis. 

What this means is that the company testing or the CRO agency often guesses what to test; it’s not a product of conversion research.

To create a better insight-driven hypothesis, you should use this format:

We noticed in [type of conversion research] that [problem name] on [page or element]. Improving this by [improvement detail] will likely result in [positive impact on metrics].

So you can see what I mean; a real example of this would be:

We noticed in [the session recording videos] that [there was a high drop off] on [the product page]. Improving this by [increasing the prominence of the free shipping and returns] will likely result in [a decrease in exits and an increase in sales].

2. Were your variations different enough?

You’ll be surprised at how similar many variations are to the control.

What happened? Maybe a sentence was changed, or the color of the call to action button, but nothing major.

In this instance, getting a winning test is almost impossible because the variations don’t look different.

Check out this video to see the different categories of A/B tests we do to give you a different perspective;

3. Review click map and heatmaps for pages tested.

It’s normal to go through heatmaps and session recordings to see how site visitors and users engage with a page pre-test.

Post-test? Not so common.

This is a viable missing link in understanding why a test failed.

When you conduct post-test heatmap analysis and session recording of pages tested, you get to see if users engaged with or noticed the element you were testing.

Visitors click maps that show you heatmaps of what your visitors are clicking on and how far your visitors scroll down your pages. Even more important are visitor session recordings, where you can watch visitors’ exact mouse movements and journey through your website.

Top mistakes that make your A/B test results invalid 

The first question in the minds of many companies when it comes to A/B testing is the “what,” or how the design of their variations looks, and not enough worry about the “how”, the execution of their experiments.

Variation design is important, but you need solid hypotheses supported by strong evidence backing up that design. 

However, if you believe your work is finished once you have come up with variations for an experiment and pressed the launch button, you’re wrong.

Here are some of the top mistakes businesses make when conducting A/B tests.

1. Too many variations

A lot of variations don’t equal more insight for your test.

Having too many variations slows down your tests but, more importantly, it can impact the integrity of your data in 2 ways.

First, the more variations you test against each other, the more traffic you will need, and the longer you’ll have to run your test to get results that you can trust. This is simple math.

But the issue with running a longer test is that you are more likely to be exposed to cookie deletion. If you run an A/B test for more than 3–4 weeks, the risk of sample pollution increases: in that time, people will have deleted their cookies and may enter a different variation than the one they were originally in.

This messes up your results, making it unreliable.

2. Changing experiment settings in the middle of the test.

When you launch an experiment, you need to commit to it fully. Do not change the experiment settings, the test goals, the design of the variation, or of the Control mid-experiment. And don’t change traffic allocations to variations.

Changing the traffic split between variations during an experiment will impact the integrity of your results because of a problem known as Simpson’s Paradox. This statistical paradox appears when we see a trend in different groups of data which disappears when those groups are combined.

Changing the traffic allocation mid-test will also skew your results because it alters the sampling of your returning visitors.

Changes made to the traffic allocation only affect new users. Once visitors are bucketed into a variation, they will continue to see that variation for as long as the experiment is running.

So, let’s say you start a test by allocating 80% of your traffic to the Control and 20% to the variation. Then, after a few days you change it to a 50/50 split. All new users will be allocated accordingly from then on.

However, all the users who entered the experiment prior to the change will be bucketed into the same variation they entered previously.

Over to you…

When running an A/B test, it’s not always about looking for a variation that had more conversions, but sometimes it’s about learning the changes in user behavior. You should always be constant testing so as to understand your visitors, their behaviors and the web elements that influence their behavior change. 

Additional resources

1.Should you run the same test across all devices?

2. What to do when your A/B test keeps on losing.

3. What they don’t tell you about A/B testing velocity.

4. Top 6 A/B testing questions answered.

5. Why and how you should document experimentation insights.

6. Everything You Need to Know About User Testing

Share This Article

Join 25,000+ Marketing Professionals!

Subscribe to Invesp’s blog feed for future articles delivered to receive weekly updates by email.

Simbar Dube

Simbar Dube

Simba Dube is the Growth Marketing Manager at Invesp. He is passionate about marketing strategy, digital marketing, content marketing, and customer experience optimization.

Discover Similar Topics