Validity Threats to Your AB Test and How to Minimize Them

Ayat Shukairy

Ayat Shukairy

My name is Ayat Shukairy, and I’m a co-founder and CCO at Invesp. Here’s a little more about me: At the very beginning of my career, I worked on countless high-profile e-commerce projects, helping diverse organizations optimize website copy. I realized, that although the copy was great and was generating more foot traffic, many of the sites performed poorly because of usability and design issues.
Reading Time: 15 minutes

Disclaimer: This section is a TL;DR of the main article and it’s for you if you’re not interested in reading the whole article. On the other hand, if you want to read the full blog, just scroll down and you’ll see the introduction.

  • There are hundreds of case studies and examples of A/B testing. While A/B testing is important, it’s just a small fraction of the overall CRO process.
  • AB testing isn’t foolproof and like anything in statistics, results can be inaccurate. But the more you know about what makes a test valid, and basic statistical concepts, the more likely it is that you will not face errors.
  •  Validity threats are risks associated with certain uncontrollable or ‘little-known’ factors that can lead to inaccuracy in results and render inaccurate A/B test outputs and they’re categorized as type 1 and type 2 errors.
  • A null hypothesis is an assumption stating that there is absolutely no relation between two datasets. Hypothesis testing is done to either prove or disprove if an assumption is correct or wrong.
  • In statistics, type 1 error is said to occur when a true null hypothesis is rejected, which is also called a ‘false positive’ occurrence. Results might indicate that Variation B is better than Variation A as B is giving you more conversions, but there might be a type 1 error causing this conclusion.
  • In statistics, type 2 errors or a false negative occur when a false null hypothesis is retained or accepted. Or, in other words, when a test is inconclusive when in reality it is conclusive. 
  • Flicker effect: this occurs when original content flashes for a brief time before the variation gets loaded onto the visitors’ screens. This leads to visitors getting confused about content, and can result in conversions dropping.
  • History effect: this happens when an extraneous variable is introduced while a test is running, and leads to a skewing of results. It happens because an AB test is unlike a lab test and does not run in isolation. Therefore, AB tests are prone to be affected by external variables and real-world factors.
  • Instrumentation effect: these are errors related to your testing tool and code implementations. It happens when the tool you’re using is faulty or the implemented the wrong code.
  • Selection effect: this bias or error occurs because the sample is not representative of your entire audience. One of the reasons why selection error happens is because of sample bias. Marketers conducting experiments get attached to the hypothesis that they have constructed. Everyone wants their hypothesis to win. So, it is easy to select a certain sample for testing and eliminate factors or variants that might result in their hypothesis being incorrect.
  • Novelty effect: error or changes in test results that are an outcome of introducing something unusual or new that the visitor is not used to. The novelty effect happens because something new is fed to visitors.
  • Statistical regression: this happens when you end a test too early. This leads to data being evened out over the time period. Most people end the test when a 90% significance level is reached, without reaching the required sample size. You cannot be sure of the AB test results only by reaching 90% significance. You must be able to reach the required sample size as well.
  • Simpson’s paradox: this happens because of changing the traffic splits for variants while the test is going on. It occurs when a trend that was being observed in different sets of data disappears by combining those groups.

 

Here’s A Longer And More Detailed Version Of The Article.


There are hundreds of case studies and examples on AB testing, explaining what makes it highly useful for conversion optimization. While AB testing is important, it is a mere component of the overall CRO process. What’s critical to keep in mind is not dive into AB testing right away until you know everything about interpreting, analyzing, and understanding test results.

AB testing isn’t foolproof and like anything in statistics, results can simply be wrong. But the more you know about what makes a test valid, and basic statistical concepts, the more likely it is that you will not face errors. This is where validity threats because of an important topic of discussion. If left unidentified or unaccounted for, they can lead you to take the wrong decision.

What are validity threats, you ask? In simple words, validity threats are risks associated with certain uncontrollable or ‘little-known’ factors that can lead to inaccuracy in results and render inaccurate AB test outputs. Broadly speaking, validity threats can be categorized as type 1 and type 2 errors. But before we define these errors, let’s understand what a null hypothesis is.

Null hypothesis: An assumption stating that there is absolutely no relation between two datasets. Hypothesis testing is done, to either prove that is an assumption is wrong or correct.

Gif source: Giphy

Type 1 errors: In statistics, type 1 error is said to occur when a true null hypothesis is rejected, which is also called ‘false positive’ occurrence. Results might indicate that Variation B is better than Variation A as B is giving you more conversions, but there might be a type 1 error causing this conclusion. Such errors are said to occur when a test is declared as conclusive although it is inconclusive. In every test, there is some amount of probability of false positives or incorrect conclusions.

Hypothetical case study explaining type 1 errors:

You have a SaaS product and you believe that changing the CTA ‘free trial’ from fixed to floating will get you higher free trial subscriptions. Variation A will have fixed CTA and Variation B a floating free trial CTA.

You launch the test and check results within 48 hours. Results point out that Variation B is giving 2% higher conversions with 90% confidence. You declare Variation B as the winner. A week passes by and you see that conversions are starting to show a downward trend. What went wrong? Probably you checked the results too early or maybe you did not set the right confidence interval.

Type 2 errors:  In statistics, type 2 errors or a false negative occurs when a false null hypothesis is retained or accepted.Or, in other words, when a test as inconclusive when in reality it is conclusive. As opposed to type 1 error, type 2 error occurs when evidence shows that Variation A is either performing better or just like Variation B.

Hypothetical case study explaining type 2 errors:

For example, your hypothesis is that introducing the option ‘Pay by PayPal’ is likely to improve purchases.

Version A (Control): Does not have PayPal payment option

Version B: Includes the option to ‘Pay by Paypal’ on checkout

Test results show that Version A wins and the option to ‘Pay by PayPal’ did not have any effect on final conversions. However, in reality, this might have happened because maybe your sample size was not appropriate. If you had increased the sample size, you might have eliminated this type 2 error or false negative.

Types of Validity Threats and How to Minimize Them

Having understood the two common categorizations of validity threats, let’s study in detail the common threats to the validity of AB tests. Also, let’s get to know how to minimize them.

1. Flicker Effect

Gif source: Kameleoon

What is it: Occurs when original content flashes for a brief time before the variation gets loaded onto the visitors’ screens. This leads to visitors getting confused about content, and can result in conversions dropping. Any disturbance on the website and slow loading can put visitors off. If visitors see two different content being shown to them during the website loads, it is likely to make them suspicious and withdraw.

Why it happens: Flickering might happen because of slow website loading speed, the code being added incorrectly to the webpage being tested, because of code being asynchronous, or because too many scripts are being loaded before the test script. There could be some other reasons as well, other than what we’ve mentioned, which lead to flicker effect.

Example: For example, on your variations being testing, you have applied the testing code to the bottom of the test variation rather than to the header. This can cause flickering as the browser will execute the code only in the end and not as soon as the visitor lands on the variation. And, as a result, the visitor will first see the original content on his screen.

How to minimize it: Optimizing your website speed will help reduce flicker. You should also be careful while implementing the code. Another thing to ensure is that the testing script is removed from the tag manager or that it is set to a synchronous code.

2. History Effect

What is it: History effect happens when an extraneous variable is introduced while a test is running, and leads to skewing of results.

Why it happens: It happens because an AB test is unlike a lab test and does not run in isolation. Therefore, AB tests are prone to be affected by external variables and real-world factors.

Example: You are running a marketing campaign utilizing a landing page from your website and you are running an AB test on that landing page. You might see a spike in the traffic on your landing page due to the marketing campaign that you are running on it. This might lead to increased sign-ups on that landing page as well. Now, it might lead you to conclude that the original landing page is better while in reality Variation B might have won in case you wouldn’t be running the marketing campaign at the time. In this case, ‘the marketing campaign’ is the extraneous variable that has caused your AB test to give skewed results.

How to minimize it: The best way to tackle the history effect is to take into account any external factors that can skew results. Apart from this, it is important to let everyone in the organization know if an AB test is being run. Knowing that an AB test is being conducted will ensure that no one on the team introduces any external factors/variables during the tenure of the test, to the pages being tested. Making use of Google Analytics alongside your AB testing tool will also help you track any changes in traffic that have happened not because of the test but because of an external variable. This will save you from deploying the wrong variation and incur losses.

Case Study: This post on validity threats, by Marketing Experiments, talks about a case study where they wanted to determine which ad headline would fetch the highest click-through-rate for their subscription-based website. During the test, an external ‘real world’ event occurred that led to a significant and transient change in the traffic coming to the website. And, this resulted in the skewing of results as well.  

3. Instrumentation Effect

What is it: These errors are related to your testing tool and code implementations.

Why it happens: When the tool you are using is faulty. Or, in case you have deployed the incorrect code. It can also happen if the code is deployed incorrectly or is not compatible with browser types. Although, deploying incorrect code is not like a faulty test engine – each of those are problems and each need to be diagnosed differently.

Example/Examples: One out of the 4 variations you are testing is not running properly on Chrome. This means the chances of recording conversions on that variation are slim. If the code was compatible with Chrome, that Variation would have given a different result.

How to minimize it: A/A testing is one way that can help you determine if your tool is faulty or hasn’t been deployed correctly, or is inefficient. If this is the case then your AA test will conclude a winner even between two identical variations. The problem, however, with AA testing is that it is time-consuming. It is best to perform an AA test only if your website gets loads of traffic.  

Another way to tackle the instrumentation effect is to double check that your experimentation has been set up the right way. There is no code error, no mismatch between code and browser compatibility, and that your data is being passed correctly onto the CRM. Being vigilant and watchful for errors can help you minimize instrumentation effect in the best manner.

Raphael Paulin-Daigle recommends:

“Before launching ANY tests, you should always do rigorous Quality Assurance (QA) checks such as performing cross-browser and cross-device testing on your new variations, and trying out your variations under multiple different user scenarios.”

You must also ensure that the testing engine is compatible with Google Analytics in order to see the testing data in the GA and have that as a source of comparison.

In case, all of your measures to minimize instrumentation errors fails, it can be indicated that the testing engine is faulty.

Case Study: You can read this case study which talks about how Copyhackers ran a split test and found out there were major loopholes in their testing engine.

4. Selection Effect

What is it: The bias or error that occurs because the sample is not representative of your entire audience.

Why it happens: There are many reasons why one could end up picking a sample that does not completely or accurately represents the entire audience set. One of the reasons why selection error happens is because of sample bias. Marketers conducting experiments get attached to the hypothesis that they have constructed. Everyone wants their hypothesis to win. So, it is easy to select a certain sample for testing and eliminate factors or variants that might result in their hypothesis to be incorrect. While calculations can tell you the appropriate sample size, they are not helpful in deciding who should comprise the sample type. The idea is to keep the sample completely representative of your entire audience, free of bias.

Example: You are running ads on your website for premium hotels and your hypothesis is that number of bookings will increase by running this campaign. However, the conversions on Variation B that has the ad campaign running goes down instead of going up. Maybe your main traffic type comprises middle-income group and you haven’t considered this group in your sample. This is likely to skew your test results.

How to minimize it: Regularly study your analytics reports and keep digging deeper into the source of traffic.  Make sure that your sample that is truly representative and is free of sample pollution. Not taking into account the different types of traffic that visit and interacts with your website and comprises your sample, will cause regression to the mean. Include both new as well as returning traffic in your sample, and consider both weekday vs. weekend traffic in it.

Optimize smart summarizes the point:

“Each traffic source brings its own type of visitors, and you can’t assume that paid traffic from a few ads and one channel mirrors the behaviors, context, mindset, and needs of the totality of your usual traffic.”

Case Study: “In that case launching a winning variation may not result in any real uplift in sales/conversion rate. The launch of winning variation may, in fact, lower your conversion rate. When you’re analyzing the test results, make sure to segment by sources in order to see the real data that lies behind averages.” via SplitBase

5. Novelty Effect

Gif source: Giphy

What is it: Error or changes in test results that are an outcome of introducing something unusual or new that the visitor is not used to.

Why it happens: Novelty effect happens because of something new being fed to visitors. It occurs because of innate human behavior to prefer something new over old, such as alterations made to a landing page that visitors are not used to seeing on your website.

Example: Let’s flip back to online. You introduce slider images for your apparels section, in Variation B. In Variation A, you have one image for the apparel. Your hypothesis is that the version with slider images will fetch you more conversions. Even though your hypothesis wins, it might be that because of the new change is attention-grabbing, and results in the novelty effect kicking in, the conversion has seen a temporary spike.

How to minimize it: Your old set of the audience might behave differently just because they have been exposed to something new. Conversions, in this case, are simply likely to spike not because one version is better than the other, but because there is something different that the audience is getting to see. The best way to eliminate this bias is to try driving new traffic to your website while introducing something new and AB testing it.

Case Study: Let’s look at  FC Dallas stadium case study by Clive Thompson

“In 2005, FC Dallas — the city’s pro soccer team — moved into a new, state-of-the-art $80 million stadia. Over the next two years, games drew 66% more fans, with an average of about 15,145 attending each game. Over the next few years, though, as the novelty of the stadium diminished, some of those new fans began drifting away, and average attendance slid to 12,440.”

6. Statistical Regression

What is it: Regression towards mean, also called statistical regression happens when you end a test too early. This leads to data being evened out over the time period.

According to Ted Vrountas:

“If you’re making business decisions based on your A/B tests just because they reached statistical significance, stop now. You need to reach statistical significance before you can make any inferences based on your results, but that’s not all you need. You also have to run a valid test.”

Why it happens: Most people end the test when a 90% significance level is reached, without reaching the required sample size. You cannot be sure of the AB test results only by reaching 90% significance. You must be able to reach the required sample size as well. Otherwise, the results might actually just be imaginary. If, for example, the required sample size for your test is 50 and you end the test at the sample size 20 because you’ve reached 90% significance, your test results are skewed. There are a number of AB testing sample size calculators that can help you find out the required sample size for your AB test.

Example: You create a new landing page for your SaaS product and the all of your first 15 visitors convert on the new variation. This means that there is a 100% conversion rate. Does that mean that the new landing page is far better than the older one? No. First 15 visitors do not denote the full sample size. Your sample size calculator tells you that your minimum sample size is 50. You cannot conclude variation as winner having reached a sample size of 15.

How to minimize it: Do not stop your AB test when you reach statistical significance. You need to collect as much data as possible, which in turn will lead to higher accuracy in your test results. Reaching the required sample size is the key to eliminating statistical regression errors.

You must also calculate the sample size prior to running the test so that you can ensure meeting that number of conversions before the conclusion.

7. Simpson’s Paradox

Gif source: Giphy

What is it: Errors that happen because of changing the traffic splits for variants while the test is going on.

Why it happens: It occurs when a trend that was being observed in different sets of data disappears by combining those groups. This happens because when calculating test results, weighted averages are taken into account. Simpson’s paradox can also occur when an alteration to traffic distribution for a variation is done manually. For example, you see that Variation B is a winning variant, so you change traffic distribution to that variant. Another reason that it happens is that members of the population leave or join when a test is going on.

Example: To understand the point on weighted averages and Simpson’s paradox, let’s compare conversions and test results for control and variation, in the University of California example that Josh Baker explains:

“In 1973, the University of California at Berkeley was sued for showing bias in admissions for women to their graduate school. Men had a much better chance to be admitted than women according to the statistics given. “

The subgroups combined as below, showing men more likely to be admitted:

Applicants % admitted
Men 8442 44%
Women 4321 35%

“But according to individual department numbers, it showed that there was a small but statistically significant bias that favored the women in actually having a higher chance of being admitted.”

Department Men Women
Applicants          % admitted Applicants           % admitted
A 825 62% 108 82%
B 560 63% 25 68%
C 325 37% 593 34%
D 417 33% 375 35%
E 191 28% 393 24%
F 272 6% 341 7%

The above chart is a good example of Simpson’s paradox and depicts how women results are performing better than men results when divided by departments.

How to minimize it: Rather than opting for what total numbers tell you, we should be ensured before start to testing, the groups are similar as far as possible.

Gordon S. Linoff explains:

“Simpson’s Paradox arises when we are taking weighted averages of evidence from different groups. Different weightings can produce very different, even counter-intuitive results. The results become much less paradoxical when we see the actual counts rather than just the percentages.”

Georgi Georgiev recommends:

“We should threat each source/page couple as a separate test variation and perform some additional testing  until we reach the desired statistically significant result for each pair (currently we do not have significant results pair-wise).”

Let’s look at another example.

Example: To understand the point on weighted averages and Simpson’s paradox, let’s compare conversions and test results for control and variation, in the following hypothetical example.

Page Visits for A Page Visits for B Conversions for A Conversions for B Conversion Rate for A Conversion Rate for B
Aggregate date 7000 7000 350 460 5% 6.5%

 

Page A visits Page B visits Page A Conversions Page B Conversions Conversion Rate for A Conversion Rate for B
Aggregate 7000 7000 350 460 5% 6.5%
Traffic source 1 5000 1000 200 10 4% 1%
Traffic source 2 1000 2500 65 150 6.5% 6%
Traffic source 3 1000 3500 85 300 8.5% 8.5%

The above chart is the perfect exemplification of Simpson’s paradox and depicts how Variation A is performing better than Variation B when divided by traffic sources.

How to minimize it: Rather than opting for what aggregates tell you, dig a little deeper into a segment-wise performance of your variations. Maybe, you would like to retail Variation B for traffic source 3 as it is performing equally well as Variation A is for the same traffic source. Maybe, deploying Variation A for traffic source 2 is a good idea. Such insights can help you improve your decision making and infer better out of AB testing.

Conclusion

AB tests aren’t free of bias and there are a number of factors that can skew the results that you obtain from AB testing. However, if you are aware of validity threats – type 1 and type 2, which we have discussed in this post, you can stay vigilant, take into account the scope of error, and wisely interpret test results. We’d love to know if you ever ran an AB test and encountered a validity threat. Share with us your experience and learnings in comments below. Feedback is welcome!

Share This Article

Join 25,000+ Marketing Professionals!

Subscribe to Invesp’s blog feed for future articles delivered to receive weekly updates by email.

Ayat Shukairy

Ayat Shukairy

My name is Ayat Shukairy, and I’m a co-founder and CCO at Invesp. Here’s a little more about me: At the very beginning of my career, I worked on countless high-profile e-commerce projects, helping diverse organizations optimize website copy. I realized, that although the copy was great and was generating more foot traffic, many of the sites performed poorly because of usability and design issues.

Discover Similar Topics