14 AB Testing Sample Size Issues and Mistakes That Can Ruin Your Test

Khalid Saleh

Khalid Saleh

Khalid Saleh is CEO and co-founder of Invesp. He is the co-author of Amazon.com bestselling book: "Conversion Optimization: The Art and Science of Converting Visitors into Customers." Khalid is an in-demand speaker who has presented at such industry events as SMX, SES, PubCon, Emetrics, ACCM and DMA, among others.
Reading Time: 20 minutes

You must cover all the bases to get reliable results from your A/B tests.

A/B testing mimics scientific experiments, and similarly will not provide you a 100% certainty in 100% of tests that you run. Only few marketers are aware of the limitations of the method and know how to run it to get valid results and minimize the risk of false positives and false negatives.

That’s why Martin Goodson, now the Chief Scientist at Evolution AI, wrote a paper called “Most Winning A/B Test Results Are Illusionary.” In his paper, he explained that badly performed A/B tests are more likely to produce false wins.

If you focus only on statistical significance, you are wrong.

Here is an excellent example showing that you should not rely on statistically significant tests all the time.

Image source: Unbounce

Despite achieving +588.61% improvement in the test, with 99% confidence level, the numbers for sample size and conversions debunk the data. In reality, this kind of improvement means nothing.

A sample size of 50-100 people does not provide consistent results. In the example above, the number of conversions increase from 1 to 8, with 68 visitors for control and 79 for variation 1.

Even if you get the process of creating the test right, as conducting qualitative usability tests, digging into analytics to determine conversion problems and creating a sound testing hypothesis, your A/B testing could fail due to basic prep work that ensures collected data accuracy.

Proper sampling lays at the core of getting valid A/B test results.

In this article, we will discuss 14 sampling issues that could skew or completely ruin your A/B test results.

What Is Sample Pollution and Should You Worry About It?

“Sample pollution” refers to factors that invalidate your A/B test data by influencing the samples or data used while conducting your test.

Although you might have limited or no control over these factors, you must be aware of them and carefully monitor them to understand their impact on your data. Sample pollution might cause you to count the same visitor as two or three different new visitors. Your results might show you the behavior of, let’s say, 100 visitors, while in fact you might have had only 60 real unique visitors.

Before we get into the different types of sample pollution, let me state that every test will have some sort of sample pollution in it. What is important is that you are aware of the level of that pollution and does it reach the point where you have to pull the plug on the test.

1. Biased Sample

Random sampling is the essence of any A/B test. Random sampling means that any visitor of your website has the same probability to be chosen to see a variation of your A/B test.

Biased sampling is the opposite to random sampling and it will definitely skew your test results.

In simple words, biased sampling means that you select a sample from the population (all visitors of your website) in such a way that some of the representatives of the population are less likely to be included in this sample. Thus, some of your website visitors are less likely to be chosen to see a variation of your A/B test than the others.

Why does it happen?

You can imagine that when you test on a website you have different factors influencing your conversion rate:

  1. Business cycle
  2. Promotions
  3. Newsletter schedule
  4. Ad campaigns
  5. Pay day of customers

So, when you test only for a couple of days during some promotion or only on weekdays you get skewed test data because people who see the promotion or people who prefer to shop on the weekdays have more probability to be included in the test sample than the people who don’t see the promotion or shop in the weekend.

The same happens when you run a test only for a half of business cycle.

If your sample data includes exceptional periods like holidays and seasonal campaigns, you are on the wrong track. Your test data will be inconsistent.

How to Ensure a Random Sample

To ensure that your sampling is as random as possible, you should run your A/B tests every day of the week and for one or two business cycles. In such a way, you can ensure that you include in your test sample accurate representatives of your website visitors.

2. Too Small Test Sample Size

A common A/B testing mistake is to stop a test as soon as it shows the desired confidence level. In this case, most likely, the sample size of your test is too small and the results you get are not valid.

Why is it so?

Your test may turn out statistically significant at some point, but it is not the reason to stop the test as soon as it reaches the desired confidence level.

Most of the A/B testing tools use frequential statistics to process the results. Frequential statistics experiments have pre-defined sample size which depends on characteristics of the population (all your website visitors).

That’s why, before your start testing, you should calculate the minimal sample size for each of your tests. This way, you can start assessing the test results only after your test reaches the minimal sample size.

Have a look at the example below. This is a test we kept running even after achieving a high confidence level:

Why did we keep running the test?

With less than 4,000 visitors per variation, this test hadn’t reach the required minimal sample size. By using the original conversion rate, we calculated a minimum of 8,105 visitors per variation:

How to Avoid Small Sample Size

To make sure you have the right sample size when you run an A/B test, you should:

  1. Calculate the minimal sample size before you launch the test;
  2. Assess the test results only after your test reaches the minimal sample size.

3. Length Pollution

Determining the required time for an A/B test is a tough task because you must consider sample size, statistical significance, and conversion numbers altogether.

Length pollution occurs when you stop your A/B test too soon.

If your website has high traffic and conversion rate, it may happen that you can reach the minimal sample size and statistical significance in just a couple of days. It might be very tempting to stop the test then, but you should not do that.

If you run your test only for a couple of days, it means it will not include the whole variety of visitors and you will have a biased sample.

How to Limit Length Pollution

When conducting an A/B test, make sure that you do the following:

  1. Pre-calculate how much time is required to achieve statistically significant data. Do not run your tests for less than the pre-calculated time frame.
  2. Run each the test until each variation has collected a minimum number of conversions. For highly trafficked websites, the minimum might be 1,000 conversions. The minimum for most websites should be anywhere from 200 to 500 conversions.
  3. Run reach test for a minimum of two weeks. That will ensure that data was collected for all different days of the week at least twice. If your test duration is short, your sample size will not represent the natural fluctuations of your business cycle, even if the sample size number is adequate. There are instances where we might run tests for one week but these are rare and far in-between.

4. Data Pollution Due to External Factors

We were running an A/B test for a client back in 2010. Things were going well and several of the variations were reporting an increase in conversions.

And then, almost of a sudden, conversion rates dropped across all variations. The test data tanked.

Our first assumption is that there were technical issues with the website. But a thorough investigation revealed nothing. We then discovered that a main competitor decided to run a major site wide sale offering close to 30% discount. That competitive move impacted our test results.

We had to stop the test and wait until the competitor’s sales was done.

Every time you run an A/B test, there is a chance that some external factors will impact your data. We do not have the luxury of running A/B tests in a clean lab environment.

While the length pollution focused on running an A/B test for a short period of time, running your tests for too long will allow external factors to impact your results.

You must pay attention to this issue especially if you have a low-traffic website or a high-traffic website with extremely low conversion rate (less than 0.5%). If the minimal sample size that you calculated for your test is too big, you will have to run your test longer.

The longer you run the test, the higher the probability that external factors will pollute the results. Let’s say you run your test for couple of months, any changes in the market, competitive offers, and promotions will likely have an impact on your test.

How to Minimize Data Pollution Due External Factors

External factors pollutions are not easy to deal with! You will have to:

  • Limit the time you run an A/B test for less than 4 weeks. The longer your run an A/B test, the more you allow for external factors to impact your data.
  • Monitor your competitors to ensure they are not running any special promotions that might impact your testing data.
  • Monitor the overall market to ensure external factors do not have an impact on your data. We have seen external market situation impact test data in instances where it was revealed that major credit cards hacks took place.

5. Data Pollution Due to Internal Factors

Similar to external factors, there are internal organizational factors that can impact the accuracy of your A/B testing data.

The following are some of the main categories of internal factors:

  1. Changes in promotions: If your website runs promotions on a regular basis, these promotions can impact your data. This could be due to changes in the promotions creatives or running specific offers that can skew the testing data.
  2. Technical issues: technical problems or downtime on your website can cause data pollution.

How to Minimize Data Pollution Due Internal Factors

We recommend the following:

  1. Limit the time you run an A/B test for less than 4 weeks. This is the same recommendation we had to limit pollution due to external factors. Running tests for lengthy periods of time means that both external and internal factors can pollute your data.
  2. Stop (or limit) any website promotions while running an A/B test. I know this is difficult for many companies, so if you need to run promotions while testing, try to run them on portions of the website that are not included in your A/B test.
  3. Monitor any technical issues on your website. Downtime on your website or major sections of your website can impact your data. You will have to deal with each situation as it comes to determine what type of impact it has on your data.

6. Data Pollution Due to Test Implementation

In our experience, this type of data pollution is the most common specially when conducting A/B tests on mobile devices.

You launch an A/B test with three challengers against an original. Your testing software is collecting data, and you think everything is going well.

You then discover that one of the variations has a bug and it is breaking for iPhone users, who represent 36% of your visitors. That means that 9% of your visitors were getting corrupted data. Your data is polluted.

These types of bugs are typically introduced while coding the test variations. Since most tests are coded using JavaScript and not served from the server side, there is a good chance that one of your variations might contain bugs.

How to Minimize Data Pollution Due to Internal Factors

The best way to deal with this type of data pollution is to discover it before it happens. Your testing program must:

  1. Develop extensive quality assurance (QA) scripts to ensure that none of the variations introduced any new bugs to the software.
  2. Validate that each of the QA scripts runs on the different platforms used to visit your website.
  3. Launch and validate any test in a QA or UAT environment that closely resembles your production site. Only after data is validated in a QA environment, then it is pushed to production servers.
  4. When you discover that a bug within your variations slipped through all the testing (this will happen!), determine what impact the data pollution had on your test. In most of the cases, we have stop the test, fix the bug, and flush the test data.

7. Data Pollution Due to Testing Software: The Flicker Affect

The flicker affect is a common problem when using client-side A/B testing platforms. Conducting an A/B test by using a JavaScript to manipulate webpage (either layout or elements) causes the flicker affect.

As the name implies, the affect happens when the original page flickers on the screen first and then a variation is loaded.

Technically any type of test that uses a JavaScript to manipulate the DOM elements of the original page and to display any of the variations will suffer from the flicker affect. This happens because the testing engine must do the following:

  1. Determine whether the page is included in the test.
  2. Load the original page with all its elements.
  3. Finally apply the DOM manipulation JavaScript to change the page from the original to one of the variations.

Image source: Optimizely

How fast the original page flickers is impacted by several factors including:

  1. How fast the testing software determines whether a page is included in an A/B test or not.
  2. How fast the original webpage loads.
  3. How fast the testing software loads the manipulation script.
  4. How efficient the DOM manipulation JavaScript is.

Does the Flicker Effect Cause Data Pollution?

Before getting into what you can do to limit the flicker affect, we should seriously consider its impact on the results of an A/B test.

To start with, having the flicker effect is not ideal. We strive to minimize it as much as possible.

Having said that, there are many times where you have to balance between the cost of minimizing the flicker effect and the possible data pollution that it costs.

If the flicker affect will have minimal impact on data integrity, then is it worth spending three additional days in development to limit it?

You will have to make that judgement call on for your own website.

But perhaps a more important question is how do you determine whether the flicker effect might have an impact on your data integrity?

We have two lines of thinking here:

  1. The flicker effect causes data pollution in experiments launched on websites where most visitors are 50 years or older.
  2. If we are not sure about the impact of the flicker effect, then we might relaunch the same tests implemented as a split URL test. That comparison between how variations perform when coded in two different ways can reveal insights about our visitors.

What You Can Do to Minimize the Flicker Effect

  • Consider blocking or not displaying any of the tested variations until all DOM modifying JavaScript is executed. By doing this, visitors will see a blank page first (as opposed to the original page) and then the variation is displayed. Please note that doing this will put variations at a disadvantage since they will take longer to load compared to the control.
  • Alternatively, you can implement your test as a split URL test. So, instead of applying the modified JavaScript on the original page, you create different pages where each page (separate URL) will be a test scenario or variable. Implementing your A/B test as a split test URL is the guaranteed way to remove the flicker affect completely, however it will require additional development from your team.
  • Follow front end development concepts to speed up the loading time of the different variations in an A/B test. This could be done either through using CSS instead of JavaScript wherever that is possible. You can also consider using raw JavaScript instead of JQuery to modify the DOM and create the variation.

8. Data Pollution Due to Testing Software: Variation Load Time

This type of data pollution is close in principal to the flicker effect. In this case, the testing software takes a little longer in serving test variations when compared to the time it takes to serve the original page.

This delay in serving the variation pages is common in client-side A/B testing engines which rely on JavaScript to serve different variations. This delay is less common in client-side A/B testing platforms. As a matter of a fact, from a pure technical perspective, server side should always serve variations faster compared to client-side A/B testing technologies. However, the positive gain in speed is balanced by the negative lengthier and more costly development cycles.

It might be worthwhile to examine how long do different platforms take to serve any of the variations within a test.

ConversionXl posted an article analyzing the performance of different A/B testing platforms.

Image source: ConversionXL

The load time in the study do not mention how long it takes for the original page to load. It has been our experience that most A/B testing platforms add anywhere from 500 milli seconds to 1.5 seconds to load a page.

Does the load time of a test variation impact the test integrity?

This will depend on how long it takes the A/B testing platform to load a variation compared to the original page. A good rule of thumb is to keep the load time of any page on your website to less than 4 seconds. If the original page loads in 3 seconds and your variations take 5 seconds or more, then you will have a data integrity issue.

What Can You Do to Minimize the Variation Load Time?

  1. Start by measuring exactly how long it takes for the A/B testing software to load a variation.
  2. Consider using server-side A/B testing technology – remember that will speed up the time to serve a variation but will add more development time.
  3. Consider running your test as split URL tests.
  4. Make sure that your JavaScript are optimized and follow best practices to ensure speed.

9. Data Pollution Due to Seasonality

I posed the question to Ayat – Could seasonality cause data pollution?

“seasonality will impact the integrity of your data if you are not careful. Visitors are highly motivated in particular seasons. That means they are willing to overlook some problems in your website. A particular variation within an A/B test might have some problems but visitors will overlook them during a high season.”

We have seen this happen while working with many ecommerce companies over the years. There are two ways that seasonality can impact your data:

  1. Motivated visitors ignore many issues in a particular design during a high season (this the point Ayat mentioned).
  2. Running tests that span high and low season will impact data integrity. Imagine you started a test on May 7th (just before Mother’s day) and stopped it on June 18th (Father’s day). Let’s say you were testing a homepage for main banners which were advertising discounted items and top-rated products for Mother’s Day and Father’s Day. How do you think your data will look?

What Can You Do to Minimize Seasonality Impact?

  1. Not only should you continue to do A/B testing during high season, we actually recommend it. You just need to be careful with what you test.
  2. Make sure that tests do not span both high and low season periods.
  3. High seasons where there is high demand are good to test incentives and messaging that relates to the particular season. Avoid testing major website layouts and navigation during that period.

10. Visitor Pollution

Humans can be funny. You introduce a new and better user interface, yet returning visitors do not like it. They got used to the old (and complicated) way your website worked.

This is typically referred to as momentum behavior.

It is not that we are lazy. But rather because the benefit of learning a new path or a new way to use a website does not outweigh the cost of unlearning the old way.

As a result, you will find out that in many cases, an original webpage design performs better with returning website visitors. New visitors on the other hand will give all variations in a test an equal opportunity.

What Can You Do to Minimize Visitors’ Pollution?

As a good rule of thumb, try to launch A/B test for new visitors and exclude returning visitors from the test. This of course assumes that your website receives enough new visitors to be able to conclude the test within 4 weeks.

11. Cookies Based Pollution

Cookies are small bits of information that a website stores on user’s browser. These are the main reason for sample pollution.

A/B testing tools use cookies to track which variation to show the visitor when he comes to a website.

If the visitor comes again to the website where he has seen a test variation before, he will see this variation again. But if he deletes cookies before coming to the website again, he will see the original, or another variation.

Image Source: AMC

What does it mean for you?

You cannot control, if users cleared their cookies or not.

ComScore report from 2014 suggests that:

  • 28% of users delete first-party cookies per month
  • 37% delete third-party cookies per month

How to Minimize Cookie Pollution

That’s why we recommend to run a test for no longer than one month, because the longer you run it, the higher is the probability of your website visitors deleting cookies and your sample getting polluted.

12. Cross-Device Pollution

Today many online visitors enter websites using different kinds of devices.

Users switch from mobile, smartphones and tablets, to desktop and laptop, to other connected devices, as smartwatches, game consoles, smart TVs. So, when you are conducting an A/B test this could be problematic for you while analyzing mobile and desktop focused data.

Google’s New Multi-Screen World Report reveals that 98% of online customers move between devices on the same day to accomplish a task. When it comes to online shopping, 67% of the buyers change devices sequentially before completing the purchase. This report also shows visitors use multiple devices simultaneously for both related and unrelated activities.

According to Google’s report, 65% of shopping experiences start on a smartphone device. 61% of these shoppers move over to a laptop and 4% of them switching over to a tablet.

Image source: Google Report

In this case, cross device use plays a crucial role in sample pollution.

When a visitor enters a tested page (variation) on one device, then the same visitor enters again the same page via another device, there is a very good chance that he will be directed to the different variation page of the same test. This visitor may also count as two or more different new visitors to your testing software.

Imagine a user starts his customer journey on his phone and lands on the category page of an ecommerce. Now he thinks it would be a better idea to move on to his laptop to better navigate through the category page and better see the product, because of the screen size and connection speed. Once on his laptop, he sees a different variation of the same category page, because the shop is running an A/B test.

Why Is Cross-Device Pollution Problematic?

From the visitors perspective, the change in page layout creates visitor anxiety and FUDs specially if the changes between the variations are too drastic. From a statistical analysis perspective, this same visitor is counted by the testing software as two different visitors and it will be hard to define which variation was responsible for converting this visitor. The results of the test become less reliable.

How to Limit Cross Device Pollution

There are few steps you can take to limit cross device pollution.

Launch your A/B tests on specific device type

The easiest way to avoid cross-device pollution is to launch your A/B tests for a specific device (desktop, mobile, etc) and to avoid cross device testing. This has been our approach for a few years, however, we do get some push back from clients every now and then.

Some clients serve a responsive version of their website to mobile devices. They would prefer not to go down the path of custom development for desktop vs. mobile.

How do we deal with this?

It is typically easy to demonstrate to the client that one particular test variation works well for desktop and a completely different variation works better for mobile users.

Track registered visitor path on multiple devices

If your platform tracks registered users (using a user ID), you can track your visitors’ paths through multiple devices either using your software platform or using Google Universal Analytics. Of course, the only catch is that visitors needs to be logged in to your website throughout the journey.

Cross Device Report is the life saver that GA Universal serves us. It allows you to see the conversion process from the beginning till the end.

For instance, you might want to register the segment of users who searches on a mobile device and purchases on a desktop at the same day. Or you might want to keep track of a segment that uses the smartphone first, by landing on your website after clicking on an ad, and only one day later decides to investigate the product on a tablet, to finally buy the product after 10 days, on desktop.

Cross device report provides a better understanding by connecting the data from different devices and activities from different sessions. You can gain insights about the touch points, sessions, and interactions of the visitors connected to your website.

Conduct polling to quantify usage of your website on different device types

Keith Hagen, director of the conversion services at Inflow, suggests conducting polls to quantify the usage of your site through each type of device.

Pre-Purchase Device Poll

– Question to ask visitors: “Have you visited us on another device or browser recently?”

– Purpose of this poll: find out how many visitors return to your site on a desktop, after visiting your mobile site.

– After the poll: track the visitor behavior, by integrating the data with your Google Analytics.

Post-Purchase Device Poll

– Question to ask visitors: “Did today’s purchase involve more than one device or browser?”

– Purpose of this poll: find out the percentage of the cross-device users who complete purchases on your site.

Compare Polls

– Compare the percentage of visitors who reach your desktop site after using a mobile device to the percentage of the visitors who eventually make purchases following cross-device experiences.

Then, find out the success rate for cross-device visitors in term of conversions.

These polls will help you to gain insights about your visitors’ experiences on multiple devices. When conducting A/B tests, you can use this data to better analyze your test results.

Of the three methods we discussed to limit device pollution, running specific A/B tests for specific devices is the only meaningful way to limit cross device pollution. Tracking visitors on multiple devices using GA or your platform helps provide data into how much device pollution you are dealing with. My least favorite method is conducting polling. Don’t get me wrong, I am a big fan of asking visitors of what works for them on a website and what didn’t. I am just not sure if asking visitors about devices they used is a good way to collect data.

13. Cross-Browser Pollution

Does the browser you use matter when are you reading a blog post? Probably not, because you will get more or less the same reading experience. However, different browsers might give a different browsing experiences in more complex websites.

Google Chrome gets the biggest share on for desktop:

Image source: StatCounter

For mobile, Chrome again sets ahead

Image source: StatCounter

Some online visitors use different browsers in different times.

Cross-browser sample pollution is caused by visitors using multiple browsers to come back to the website. So, visitors are coming to the same tested page and are getting mixed experiences.

Although most of internet users choose a preferred browser, sometimes you need to use a different browser due to various reasons.

If you are conducting an A/B test on one of your pages, and a visitor arrives at it first through Chrome and then, on the next day, through Firefox, your test sample gets altered. Your testing software will most likely count this same visitor as two distinct new visitors. It will also be hard to account for the right conversion rate of your variations, because this visitor might find different variations in the different browsers.

Top Reasons Why People Switch Browsers Time to Time

  • Managing separate identities/accounts
  • Useful capabilities: browsers have different User Interface (UI), depending on the purpose of users their preferences might change
  • Different tools and extensions for specific tasks
  • Security issue
  • Speed
  • Usability

How to Limit Cross-Browser Pollution?

I tend to think of a cross-browser sample pollution as a fact of life and recommend not to worry too much about it! Every test will have some level of cross-browser pollution in it. Accept it and move on.

If you are really concerned about sample pollution due cross-browser pollution, you can follow the same steps of limiting cross-device pollution:

  • Run tests separately for each browser. Although this was the first recommendation for cross-device pollution, I am very hesitant to recommend it for cross-browser pollution. You must ensure that each browser gets enough visitors so you are able conclude tests and reach statistical significance. You must also ask yourself the question of what will happen if you find out that a particular variation in an A/B test works better for a specific browser. How will you handle that? In most of the cases, the cost of running different segments for different types of browsers is too high.
  • Use polls to determine visitors’ paths to conversion on different browsers.
  • Use Google Analytics, if your visitors are logged in using their IDs.

14. Funnel Pollution

I have seen instances where optimization teams try to short circuit a website design and navigation by directing a campaign traffic to particular pages that have an A/B test running on them. The test concludes with a winning variation.

The problem is that when that variation is pushed as the default version for the site, conversion rates tanked.

The initial suspect in cases like these is that the team identified a false positive or there was a novelty effect.

But digging deeper reveals another type of a problem. In its effort to speed up the testing program, the team pushed visitors to the web page in a manner that does resemble that natural flow on the website. As a result, the data was skewed.

What Can You Do to Minimize Funnel Pollution?

Make sure that traffic to tested pages follows the same path during the A/B test as well as after it. Remember that simplifying the sales funnel, for instance by using a direct link on a campaign email to an order page, will bias the natural behavior of the users and create more pollution.

Over to you

A/B testing is a bit more complicated because of the issues around sampling. Without correct sampling, your A/B test results become less reliable or even invalid.

Did you run into any of the sampling pollutions mentioned in this article? Anything we should add to the list?

Share This Article

Join 25,000+ Marketing Professionals!

Subscribe to Invesp’s blog feed for future articles delivered to receive weekly updates by email.

Khalid Saleh

Khalid Saleh

Khalid Saleh is CEO and co-founder of Invesp. He is the co-author of Amazon.com bestselling book: "Conversion Optimization: The Art and Science of Converting Visitors into Customers." Khalid is an in-demand speaker who has presented at such industry events as SMX, SES, PubCon, Emetrics, ACCM and DMA, among others.

Discover Similar Topics