{"id":11543,"date":"2018-06-25T14:34:03","date_gmt":"2018-06-25T19:34:03","guid":{"rendered":"https:\/\/www.invespcro.com\/blog\/?p=11543"},"modified":"2018-06-25T14:34:03","modified_gmt":"2018-06-25T19:34:03","slug":"validity-threats-to-your-ab-test-and-how-to-minimize-them","status":"publish","type":"post","link":"https:\/\/www.invespcro.com\/blog\/validity-threats-to-your-ab-test-and-how-to-minimize-them\/","title":{"rendered":"Validity Threats to Your AB Test and How to Minimize Them"},"content":{"rendered":"<span class=\"span-reading-time rt-reading-time\" style=\"display: block;\"><span class=\"rt-label rt-prefix\">Reading Time: <\/span> <span class=\"rt-time\"> 15<\/span> <span class=\"rt-label rt-postfix\">minutes<\/span><\/span><p><i><span style=\"font-weight: 400;\">Disclaimer: This section is a TL;DR of the main article and it\u2019s for you if you\u2019re not interested in reading the whole article. On the other hand, if you want to read the full blog, just scroll down and you\u2019ll see the introduction.<\/span><\/i><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">There are hundreds of case studies and examples of A\/B testing. While A\/B testing is important, it\u2019s just a small fraction of the overall CRO process.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">AB testing isn\u2019t foolproof and like anything in statistics, results can be inaccurate. But the more you know about what makes a test valid, and basic statistical concepts, the more likely it is that you will not face errors.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">\u00a0Validity threats are risks associated with certain uncontrollable or \u2018little-known\u2019 factors that can lead to inaccuracy in results and render inaccurate A\/B test outputs and they\u2019re categorized as type 1 and type 2 errors.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">A null hypothesis is <\/span><b>a<\/b><span style=\"font-weight: 400;\">n assumption stating that there is absolutely no relation between two datasets. Hypothesis testing is done to either prove or disprove if an assumption is correct or wrong.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">In statistics, type 1 error is said to occur when a true null hypothesis is rejected, which is also called a \u2018false positive\u2019 occurrence. Results might indicate that Variation B is better than Variation A as B is giving you more conversions, but there might be a type 1 error causing this conclusion.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">In statistics, type 2 errors or a false negative occur when a false null hypothesis is retained or accepted. Or, in other words, when a test is inconclusive when in reality it is conclusive.\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\"><strong>Flicker effect:<\/strong> this occurs when original content flashes for a brief time before the variation gets loaded onto the visitors\u2019 screens. This leads to visitors getting confused about content, and can result in conversions dropping.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\"><strong>History effect:<\/strong> this happens when an extraneous variable is introduced while a test is running, and leads to a skewing of results. It happens because an AB test is unlike a lab test and does not run in isolation. Therefore, AB tests are prone to be affected by external variables and real-world factors.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\"><strong>Instrumentation effect:<\/strong> these are errors related to your testing tool and code implementations. It happens when the tool you\u2019re using is faulty or the implemented the wrong code.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\"><strong>Selection effect:<\/strong> this bias or error occurs because the sample is not representative of your entire audience. One of the reasons why selection error happens is because of sample bias. Marketers conducting experiments get attached to the hypothesis that they have constructed. Everyone wants their hypothesis to win. So, it is easy to select a certain sample for testing and eliminate factors or variants that might result in their hypothesis being incorrect.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\"><strong>Novelty effect:<\/strong> error or changes in test results that are an outcome of introducing something unusual or new that the visitor is not used to. The novelty effect happens because something new is fed to visitors.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\"><strong>Statistical regression:<\/strong> this happens when you end a test too early. This leads to data being evened out over the time period. Most people end the test when a 90% significance level is reached, without reaching the required <a href=\"https:\/\/www.invespcro.com\/blog\/calculating-sample-size-for-an-ab-test\/\">sample size<\/a>. You cannot be sure of the AB test results only by reaching 90% significance. You must be able to reach the required sample size as well.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\"><strong>Simpson&#8217;s paradox:<\/strong> this happens because of changing the traffic splits for variants while the test is going on. It occurs when a trend that was being observed in different sets of data disappears by combining those groups.<\/span><\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<p><strong>Here\u2019s A Longer And More Detailed Version Of The Article.<\/strong><\/p>\n<hr \/>\n<p><span style=\"font-weight: 400;\">There are hundreds of case studies and examples on <a href=\"https:\/\/www.invespcro.com\/ab-testing\/\">AB testing<\/a>, explaining what makes it highly useful for <a href=\"https:\/\/www.invespcro.com\/cro\/\">conversion optimization<\/a>. While AB testing is important, it is a mere component of the <a href=\"https:\/\/www.invespcro.com\/cro\/process\/\">overall CRO process<\/a>. What\u2019s critical to keep in mind is not dive into <a href=\"https:\/\/www.invespcro.com\/ab-testing\/\">AB testing right away<\/a> until you know everything about interpreting, analyzing, and understanding test results. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">AB testing isn\u2019t foolproof and like anything in statistics, results can simply be wrong. But the more you know about what makes a test valid, and basic statistical concepts, the more likely it is that you will not face errors. This is where validity threats because of an important topic of discussion. If left unidentified or unaccounted for, they can lead you to take the wrong decision. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">What are validity threats, you ask? In simple words, validity threats are risks associated with certain uncontrollable or \u2018little-known\u2019 factors that can lead to inaccuracy in results and render inaccurate AB test outputs. Broadly speaking, validity threats can be categorized as type 1 and type 2 errors. But before we define these errors, let&#8217;s understand what a null hypothesis is. <\/span><!--more--><\/p>\n<p><b>Null hypothesis: <\/b><span style=\"font-weight: 400;\">An assumption stating that there is absolutely no relation between two datasets. Hypothesis testing is done, to either prove that is an assumption is wrong or correct.<\/span><img decoding=\"async\" class=\"aligncenter wp-image-11551 size-full\" src=\"https:\/\/www.invespcro.com\/blog\/images\/blog-images\/giphy-8.gif\" alt=\"\" width=\"100%\" height=\"auto\" \/><\/p>\n<h4>Gif source: <a href=\"https:\/\/giphy.com\/gifs\/starwars-d2W7eZX5z62ziqdi\">Giphy<\/a><\/h4>\n<p><b>Type 1 errors: <\/b><span style=\"font-weight: 400;\">In statistics, type 1 error is said to occur when a true null hypothesis is rejected, which is also called \u2018false positive\u2019 occurrence. Results might indicate that Variation B is better than Variation A as B is giving you more conversions, but there might be a type 1 error causing this conclusion. Such errors are said to occur when a test is declared as conclusive a<\/span><i><span style=\"font-weight: 400;\">lthough it is inconclusive.<\/span><\/i><span style=\"font-weight: 400;\"> In every test, there is some amount of probability of false positives or incorrect conclusions. <\/span><\/p>\n<p><b>Hypothetical case study explaining type 1 errors:<\/b><\/p>\n<p><span style=\"font-weight: 400;\">You have a SaaS product and you believe that changing the CTA \u2018free trial\u2019 from fixed to floating will get you higher free trial subscriptions. Variation A will have fixed CTA and Variation B a floating free trial CTA. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">You launch the test and check results within 48 hours. Results point out that Variation B is giving 2% higher conversions with 90% confidence. You declare Variation B as the winner. A week passes by and you see that conversions are starting to show a downward trend. What went wrong? Probably you checked the results <\/span><b>too early <\/b><span style=\"font-weight: 400;\">or maybe you did not set the right confidence interval. <\/span><\/p>\n<p><b>Type 2 errors: <\/b><span style=\"font-weight: 400;\">\u00a0In statistics, type 2 errors or a false negative occurs when a false null hypothesis is retained or accepted.Or, in other words, when a test as inconclusive when in reality it is conclusive. As opposed to type 1 error, type 2 error occurs when evidence shows that Variation A is either performing better or just like Variation B.<\/span><\/p>\n<p><b>Hypothetical case study explaining type 2 errors:<\/b><\/p>\n<p><span style=\"font-weight: 400;\">For example, your hypothesis is that introducing the option \u2018Pay by PayPal\u2019 is likely to improve purchases. <\/span><\/p>\n<p><b>Version A (Control)<\/b><span style=\"font-weight: 400;\">: Does not have PayPal payment option <\/span><\/p>\n<p><b>Version B<\/b><span style=\"font-weight: 400;\">: Includes the option to \u2018Pay by Paypal\u2019 on checkout<\/span><\/p>\n<p><b>Test results<\/b><span style=\"font-weight: 400;\"> show that Version A wins and the option to \u2018Pay by PayPal\u2019 did not have any effect on final conversions. However, in reality, this might have happened because maybe your sample size was not appropriate. If you had increased the sample size, you might have eliminated this type 2 error or false negative. <\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Types of Validity Threats and How to Minimize Them<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Having understood the two common categorizations of validity threats, let\u2019s study in detail the common threats to the validity of <a href=\"https:\/\/www.invespcro.com\/ab-testing\/vs-multivariate-testing\/\">AB tests<\/a>. Also, let\u2019s get to know how to minimize them. <\/span><\/p>\n<h3><strong>1. Flicker Effect<\/strong><\/h3>\n<div class=\"blog_img\"><img decoding=\"async\" class=\"aligncenter wp-image-11541 size-full\" src=\"https:\/\/www.invespcro.com\/blog\/images\/blog-images\/Effet_Flicker-1.gif\" alt=\"\" width=\"100%\" height=\"auto\" \/><\/div>\n<p><b>Gif source: <\/b><a href=\"http:\/\/blog.kameleoon.com\/en\/ab-testing-flicker-effect\/\"><b>Kameleoon<\/b><\/a><\/p>\n<p><b>What is it: <\/b><span style=\"font-weight: 400;\">Occurs when original content flashes for a brief time before the variation gets loaded onto the visitors\u2019 screens. This leads to visitors getting confused about content, and can result in conversions dropping. Any disturbance on the website and slow loading can put visitors off. If visitors see two different content being shown to them during the website loads, it is likely to make them suspicious and withdraw.<\/span><\/p>\n<p><b>Why it happens: <\/b><span style=\"font-weight: 400;\">Flickering might happen because of slow website loading speed, the code being added incorrectly to the webpage being tested, because of code being asynchronous, or because too many scripts are being loaded before the test script. There could be some other reasons as well, other than what we\u2019ve mentioned, which lead to flicker effect. <\/span><\/p>\n<p><b>Example: <\/b><span style=\"font-weight: 400;\">For example, on your variations being testing, you have applied the testing code to the bottom of the test variation rather than to the header. This can cause flickering as the browser will execute the code only in the end and not as soon as the visitor lands on the variation. And, as a result, the visitor will first see the original content on his screen.<\/span><\/p>\n<p><b>How to minimize it: <\/b><span style=\"font-weight: 400;\">Optimizing your website speed will help reduce flicker. You should also be careful while implementing the code. Another thing to ensure is that the testing script is removed from the tag manager or that it is set to a synchronous code. <\/span><\/p>\n<h3><strong>2. History Effect<\/strong><\/h3>\n<h3><img fetchpriority=\"high\" decoding=\"async\" class=\"aligncenter wp-image-11542 size-full\" src=\"https:\/\/www.invespcro.com\/blog\/images\/blog-images\/d1.jpg\" alt=\"\" width=\"680\" height=\"453\" \/><\/h3>\n<p><b>What is it:<\/b><span style=\"font-weight: 400;\"> History effect happens when an extraneous variable is introduced while a test is running, and leads to skewing of results. <\/span><\/p>\n<p><b>Why it happens:<\/b><span style=\"font-weight: 400;\"> It happens because an <\/span><span style=\"font-weight: 400;\">AB<\/span><span style=\"font-weight: 400;\"> test is unlike a lab test and does not run in isolation. Therefore, AB tests are prone to be affected by external variables and real-world factors. <\/span><\/p>\n<p><b>Example:<\/b><span style=\"font-weight: 400;\"> You are running a marketing campaign utilizing a landing page from your website and you are running an AB test on that <a href=\"https:\/\/www.invespcro.com\/blog\/landing-page-sins-mistakes\/\">landing page<\/a>. You might see a spike in the traffic on your landing page due to the marketing campaign that you are running on it. This might lead to increased sign-ups on that landing page as well. Now, it might lead you to conclude that the original landing page is better while in reality Variation B might have won in case you wouldn\u2019t be running the marketing campaign at the time. In this case, \u2018the marketing campaign\u2019 is the extraneous variable that has caused your AB test to give skewed results. <\/span><\/p>\n<p><b>How to minimize it: <\/b><span style=\"font-weight: 400;\">The best way to tackle the history effect is to take into account any external factors that can skew results. Apart from this, it is important to let everyone in the organization know if an AB test is being run. Knowing that an AB test is being conducted will ensure that no one on the team introduces any external factors\/variables during the tenure of the test, to the pages being tested. Making use of Google Analytics alongside your <a href=\"https:\/\/www.invespcro.com\/blog\/comparing-multivariate-ab-testing-tools\/\">AB testing tool<\/a> will also help you track any changes in traffic that have happened not because of the test but because of an external variable. This will save you from deploying the wrong variation and incur losses. <\/span><\/p>\n<p><span style=\"font-weight: 400;\"><strong>Case Study:<\/strong> This post on validity threats, by Marketing Experiments, talks about a <\/span><a href=\"https:\/\/marketingexperiments.com\/a-b-testing\/optimization-validity-threats\"><span style=\"font-weight: 400;\">case study <\/span><\/a><span style=\"font-weight: 400;\">where they wanted to <\/span><span style=\"font-weight: 400;\">determine which ad headline would fetch the highest click-through-rate for their subscription-based website. During the test, an external \u2018real world\u2019 event occurred that led to a significant and transient change in the traffic coming to the website. And, this resulted in the skewing of results as well. \u00a0<\/span><\/p>\n<h3><strong>3. Instrumentation Effect<\/strong><\/h3>\n<div class=\"blog_img\"><img decoding=\"async\" class=\"aligncenter wp-image-11547 size-full\" src=\"https:\/\/www.invespcro.com\/blog\/images\/blog-images\/browsers-1273344_1280.png\" alt=\"\" width=\"680\" height=\"134\" \/><\/div>\n<p><b>What is it: <\/b><span style=\"font-weight: 400;\">These errors are related to your testing tool and code implementations.<\/span><\/p>\n<p><b>Why it happens: <\/b><span style=\"font-weight: 400;\">When the tool you are using is faulty. Or, in case you have deployed the incorrect code. It can also happen if the code is deployed incorrectly or is not compatible with browser types. Although, deploying incorrect code is not like a faulty test engine &#8211; each of those are problems and each need to be diagnosed differently.<\/span><\/p>\n<p><b>Example\/Examples: <\/b><span style=\"font-weight: 400;\">One out of the 4 variations you are testing is not running properly on Chrome. This means the chances of recording conversions on that variation are slim. If the code was compatible with Chrome, that Variation would have given a different result. <\/span><\/p>\n<p><b>How to minimize it: <\/b><span style=\"font-weight: 400;\"><a href=\"https:\/\/www.invespcro.com\/blog\/aa-tests\/\">A\/A testing<\/a> is one way that can help you determine if your tool is faulty or hasn\u2019t been deployed correctly, or is inefficient. If this is the case then your <a href=\"https:\/\/www.invespcro.com\/blog\/aa-tests\/\">AA test<\/a> will <\/span><span style=\"font-weight: 400;\">conclude a winner even between two identical variations. The problem, however, with AA testing is that it is time-consuming. It is best to perform an AA test only if your website gets loads of traffic. \u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Another way to tackle the instrumentation effect is to double check that your experimentation has been set up the right way. There is no code error, no mismatch between code and browser compatibility, and that your data is being passed correctly onto the CRM. Being vigilant and watchful for errors can help you minimize instrumentation effect in the best manner. <\/span><\/p>\n<p><a href=\"http:\/\/splitbase.com\/ab-testing-threats\/\"><span style=\"font-weight: 400;\">Raphael Paulin-Daigle recommends:<\/span><\/a><\/p>\n<blockquote><p><i><span style=\"font-weight: 400;\">\u201cBefore launching ANY tests, you should always do rigorous Quality Assurance (QA) checks such as performing cross-browser and cross-device testing on your new variations, and trying out your variations under multiple different user scenarios.\u201d<\/span><\/i><\/p><\/blockquote>\n<p><span style=\"font-weight: 400;\">You must also ensure that the testing engine is compatible with Google Analytics in order to see the testing data in the GA and have that as a source of comparison. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">In case, all of your measures to minimize instrumentation errors fails, it can be indicated that the testing engine is faulty.<\/span><\/p>\n<p><b>Case Study:<\/b><span style=\"font-weight: 400;\"> You can read this case study which talks about how Copyhackers ran a split test and found out there were <\/span><a href=\"https:\/\/copyhackers.com\/2012\/11\/home-page-split-test-reveals-major-shortcoming-of-popular-testing-tools\/\"><span style=\"font-weight: 400;\">major loopholes in their testing engine<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<h3><strong>4. Selection Effect<\/strong><\/h3>\n<h3><img decoding=\"async\" class=\"aligncenter wp-image-11548 size-full\" src=\"https:\/\/www.invespcro.com\/blog\/images\/blog-images\/photo-manipulation-1825450_1280.jpg\" alt=\"\" width=\"680\" height=\"380\" \/><\/h3>\n<p><b>What is it: <\/b><span style=\"font-weight: 400;\">The bias or error that occurs because the sample is not representative of your entire audience.<\/span><\/p>\n<p><b>Why it happens: <\/b><span style=\"font-weight: 400;\">There are many reasons why one could end up picking a sample that does not completely or accurately represents the entire audience set. One of the reasons why selection error happens is because of sample bias. Marketers conducting experiments get attached to the hypothesis that they have constructed. Everyone wants their hypothesis to win. So, it is easy to select a certain sample for testing and eliminate factors or variants that might result in their hypothesis to be incorrect. While calculations can tell you the appropriate sample size, they are not helpful in deciding who should comprise the sample type. The idea is to keep the sample completely representative of your entire audience, free of bias.<\/span><\/p>\n<p><b>Example: <\/b><span style=\"font-weight: 400;\">You are running ads on your website for premium hotels and your hypothesis is that number of bookings will increase by running this campaign. However, the conversions on Variation B that has the ad campaign running goes down instead of going up. Maybe your main traffic type comprises middle-income group and you haven\u2019t considered this group in your sample. This is likely to skew your test results.<\/span><\/p>\n<p><b>How to minimize it: <\/b><span style=\"font-weight: 400;\">Regularly study your analytics reports and keep digging deeper into the source of traffic. \u00a0Make sure that your sample that is truly representative and is free of <\/span><a href=\"https:\/\/conversionxl.com\/blog\/sample-pollution\/\"><span style=\"font-weight: 400;\">sample pollution<\/span><\/a><span style=\"font-weight: 400;\">. <\/span><span style=\"font-weight: 400;\">Not taking into account the different types of traffic that visit and interacts with your website and comprises your sample, will cause regression to the mean. Include both new as well as returning traffic in your sample, and consider both weekday vs. weekend traffic in it. <\/span><\/p>\n<p><a href=\"https:\/\/www.optimizesmart.com\/understanding-ab-testing-statistics-to-get-real-lift-in-conversions\/\"><span style=\"font-weight: 400;\">Optimize smart summarizes the point:<\/span><\/a><\/p>\n<blockquote><p><i><span style=\"font-weight: 400;\">\u201cEach traffic source brings its own type of visitors, and you can\u2019t assume that paid traffic from a few ads and one channel mirrors the behaviors, context, mindset, and needs of the totality of your usual traffic.\u201d<\/span><\/i><\/p><\/blockquote>\n<p><strong>Case Study:\u00a0<\/strong><i>\u201cIn that case launching a winning variation may not result in any real uplift in sales\/conversion rate. The launch of winning variation may, in fact, lower your conversion rate.<\/i><i> When you\u2019re analyzing the test results, make sure to segment by sources in order to see the real data that lies behind averages.<\/i><i>\u201d via <a href=\"http:\/\/splitbase.com\/tag\/selection-effect\/\">SplitBase<\/a><\/i><\/p>\n<h3><strong>5. Novelty Effect<\/strong><\/h3>\n<div class=\"blog_img\"><img decoding=\"async\" class=\"aligncenter wp-image-11549 size-full\" src=\"https:\/\/www.invespcro.com\/blog\/images\/blog-images\/giphy-7.gif\" alt=\"\" width=\"100%\" height=\"auto\" \/><\/div>\n<h4>Gif source: <a href=\"https:\/\/giphy.com\/gifs\/dance-CDzdJSkC4iyLC\">Giphy<\/a><\/h4>\n<p><b>What is it: <\/b><span style=\"font-weight: 400;\">Error or changes in test results that are an outcome of introducing something unusual or new that the visitor is not used to. <\/span><\/p>\n<p><b>Why it happens: <\/b><span style=\"font-weight: 400;\">Novelty effect happens because of something new being fed to visitors. It occurs because of innate human behavior to prefer something new over old, such as alterations made to a landing page that visitors are not used to seeing on your website. <\/span><\/p>\n<p><b>Example: <\/b><span style=\"font-weight: 400;\">Let\u2019s flip back to online. You introduce slider images for your apparels section, in Variation B. In Variation A, you have one image for the apparel. Your hypothesis is that the version with slider images will fetch you more conversions. Even though your hypothesis wins, it might be that because of the new change is attention-grabbing, and results in the novelty effect kicking in, the conversion has seen a temporary spike.<\/span><\/p>\n<p><b>How to minimize it: <\/b><span style=\"font-weight: 400;\">Your old set of the audience might behave differently just because they have been exposed to something new. Conversions, in this case, are simply likely to spike not because one version is better than the other, but because there is something different that the audience is getting to see. The best way to eliminate this bias is to try driving new traffic to your website while introducing something new and AB testing it.<\/span><\/p>\n<p><b>Case Study:<\/b><b> Let\u2019s look at \u00a0<\/b><a href=\"https:\/\/medium.com\/message\/the-novelty-effect-cf606715ae62\"><b>FC Dallas stadium case study by Clive Thompson <\/b><\/a><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-11552 size-full\" src=\"https:\/\/www.invespcro.com\/blog\/images\/blog-images\/FCDallas_SoutEastField-noFans-03.jpg\" alt=\"\" width=\"680\" height=\"178\" \/><\/p>\n<p><em>\u201cIn 2005, FC Dallas \u2014 the city\u2019s pro soccer team \u2014 moved into a new, <\/em>state-of-the-art<em> $80 million <\/em>stadia<em>. Over the next two years, games drew 66% more fans, with an average of about 15,145 attending each game. Over the next few years, though, as the novelty of the stadium diminished, some of those new fans began drifting away, and average attendance slid to 12,440.\u201d<\/em><\/p>\n<h3><strong>6. Statistical Regression<\/strong><\/h3>\n<h3><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-11554\" src=\"https:\/\/www.invespcro.com\/blog\/images\/blog-images\/assortment-bright-candy-1043519.jpg\" alt=\"\" width=\"680\" height=\"254\" \/><\/h3>\n<p><b>What is it: <\/b><span style=\"font-weight: 400;\">Regression towards mean, also called statistical regression happens when you end a test too early. This leads to data being evened out over the time period. <\/span><\/p>\n<p><a href=\"https:\/\/instapage.com\/blog\/validating-ab-tests\"><span style=\"font-weight: 400;\">According to <\/span><span style=\"font-weight: 400;\">Ted Vrountas:<\/span><\/a><\/p>\n<blockquote><p><i><span style=\"font-weight: 400;\">\u201cIf you\u2019re making business decisions based on your A\/B tests just because they reached statistical significance, stop now. You need to reach statistical significance before you can make any inferences based on your results, but that\u2019s not all you need. You also have to run a valid test.\u201d<\/span><\/i><\/p><\/blockquote>\n<p><b>Why it happens:<\/b><span style=\"font-weight: 400;\"> Most people end the test when a 90% significance level is reached, without reaching the required sample size. You cannot be sure of the AB test results only by reaching 90% significance. You must be able to reach the required sample size as well. Otherwise, the results might actually just be imaginary. If, for example, the required sample size for your test is 50 and you end the test at the sample size 20 because you\u2019ve reached 90% significance, your test results are skewed. There are a number of <\/span><a href=\"https:\/\/abtestguide.com\/abtestsize\/\"><span style=\"font-weight: 400;\">AB testing sample size calculators<\/span><\/a><span style=\"font-weight: 400;\"> that can help you find out the required sample size for your AB test. <\/span><\/p>\n<p><b>Example: <\/b><span style=\"font-weight: 400;\">You create a new landing page for your SaaS product and the all of your first 15 visitors convert on the new variation. This means that there is a 100% conversion rate. Does that mean that the new landing page is far better than the older one? No. First 15 visitors do not denote the full sample size. Your sample size calculator tells you that your minimum sample size is 50. You cannot conclude variation as winner having reached a sample size of 15.<\/span><\/p>\n<p><b>How to minimize it: <\/b><span style=\"font-weight: 400;\">Do not stop your AB test when you reach statistical significance. You need to collect as much data as possible, which in turn will lead to higher accuracy in your test results. Reaching the required sample size is the key to eliminating statistical regression errors.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">You must also calculate the sample size prior to running the test so that you can ensure meeting that number of conversions before the conclusion.<\/span><\/p>\n<h3><strong>7. Simpson\u2019s Paradox<\/strong><img decoding=\"async\" class=\"aligncenter wp-image-11555 size-full\" src=\"https:\/\/www.invespcro.com\/blog\/images\/blog-images\/giphy-9.gif\" alt=\"\" width=\"100%\" height=\"auto\" \/><\/h3>\n<p>Gif source: <a href=\"https:\/\/giphy.com\/gifs\/g1ft3d-glitch-cartoon-the-simpsons-3qGw96Jowb8sM\">Giphy<\/a><\/p>\n<p><b>What is it:<\/b><span style=\"font-weight: 400;\"> Errors that happen because of changing the traffic splits for variants while the test is going on.<\/span><\/p>\n<p><b>Why it happens: <\/b><span style=\"font-weight: 400;\">It occurs when a trend that was being observed in different sets of data disappears by combining those groups. This happens because when calculating test results, weighted averages are taken into account. Simpson\u2019s paradox can also occur when an alteration to traffic distribution for a variation is done manually. For example, you see that Variation B is a winning variant, so you change traffic distribution to that variant. Another reason that it happens is that members of the population leave or join when a test is going on.<\/span><\/p>\n<p><strong>Example: <\/strong>To understand the point on weighted averages and Simpson\u2019s paradox, let\u2019s compare conversions and test results for control and variation, in <a href=\"http:\/\/blog.joshbaker.com\/2010\/09\/26\/simpsons-paradox-and-marketing-testing\/\"><span style=\"font-weight: 400;\">the University of California <\/span>example that Josh Baker explains:<\/a><\/p>\n<p><em><span style=\"font-weight: 400;\">\u201cIn 1973, the University of California at Berkeley was sued for showing bias in admissions for women to their graduate school. Men had a much better chance to be admitted than women according to the statistics given. \u201c <\/span><\/em><\/p>\n<p><span style=\"font-weight: 400;\">The subgroups combined as below, showing men more likely to be admitted:<\/span><\/p>\n<table style=\"height: 100px;\" width=\"712\">\n<tbody>\n<tr>\n<td><\/td>\n<td><b>Applicants<\/b><\/td>\n<td><b>% admitted<\/b><\/td>\n<\/tr>\n<tr>\n<td><b>Men<\/b><\/td>\n<td><span style=\"font-weight: 400;\">8442<\/span><\/td>\n<td><b>44%<\/b><\/td>\n<\/tr>\n<tr>\n<td><b>Women<\/b><\/td>\n<td><span style=\"font-weight: 400;\">4321<\/span><\/td>\n<td><span style=\"font-weight: 400;\">35%<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><span style=\"font-weight: 400;\">&#8220;But according to individual department numbers, it showed that there was a small but statistically significant bias that favored the women in actually having a higher chance of being admitted.&#8221;<\/span><\/p>\n<table style=\"height: 246px;\" width=\"714\">\n<tbody>\n<tr>\n<td><span style=\"font-weight: 400;\">Department<\/span><\/td>\n<td colspan=\"3\"><span style=\"font-weight: 400;\">Men<\/span><\/td>\n<td colspan=\"3\"><span style=\"font-weight: 400;\">Women<\/span><\/td>\n<\/tr>\n<tr>\n<td><\/td>\n<td colspan=\"3\"><b>Applicants \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0% admitted<\/b><\/td>\n<td colspan=\"3\"><b>Applicants \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0% admitted<\/b><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">A<\/span><\/td>\n<td colspan=\"2\"><span style=\"font-weight: 400;\">825<\/span><\/td>\n<td><span style=\"font-weight: 400;\">62%<\/span><\/td>\n<td colspan=\"2\"><span style=\"font-weight: 400;\">108<\/span><\/td>\n<td><b>82%<\/b><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">B<\/span><\/td>\n<td colspan=\"2\"><span style=\"font-weight: 400;\">560<\/span><\/td>\n<td><span style=\"font-weight: 400;\">63%<\/span><\/td>\n<td colspan=\"2\"><span style=\"font-weight: 400;\">25<\/span><\/td>\n<td><b>68%<\/b><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">C<\/span><\/td>\n<td colspan=\"2\"><span style=\"font-weight: 400;\">325<\/span><\/td>\n<td><b>37%<\/b><\/td>\n<td colspan=\"2\"><span style=\"font-weight: 400;\">593<\/span><\/td>\n<td><span style=\"font-weight: 400;\">34%<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">D<\/span><\/td>\n<td colspan=\"2\"><span style=\"font-weight: 400;\">417<\/span><\/td>\n<td><span style=\"font-weight: 400;\">33%<\/span><\/td>\n<td colspan=\"2\"><span style=\"font-weight: 400;\">375<\/span><\/td>\n<td><b>35%<\/b><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">E<\/span><\/td>\n<td colspan=\"2\"><span style=\"font-weight: 400;\">191<\/span><\/td>\n<td><b>28%<\/b><\/td>\n<td colspan=\"2\"><span style=\"font-weight: 400;\">393<\/span><\/td>\n<td><span style=\"font-weight: 400;\">24%<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">F<\/span><\/td>\n<td colspan=\"2\"><span style=\"font-weight: 400;\">272<\/span><\/td>\n<td><span style=\"font-weight: 400;\">6%<\/span><\/td>\n<td colspan=\"2\"><span style=\"font-weight: 400;\">341<\/span><\/td>\n<td><b>7%<\/b><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><span style=\"font-weight: 400;\">The above chart is a good example of Simpson\u2019s paradox and depicts how w<\/span><span style=\"font-weight: 400;\">omen results are performing better than men results when divided by departments. <\/span><\/p>\n<p><b>How to minimize it: <\/b><span style=\"font-weight: 400;\">Rather than opting for what total numbers tell you, we should be ensured before start to testing, the groups are similar as far as possible. <\/span><\/p>\n<p><a href=\"http:\/\/blog.data-miners.com\/2010\/02\/simpsons-paradox-and-marketing.html\">Gordon S<span style=\"font-weight: 400;\">. <\/span>Linoff explains:<\/a><\/p>\n<blockquote><p><i><span style=\"font-weight: 400;\">\u201cSimpson&#8217;s Paradox arises when we are taking weighted averages of evidence from different groups. Different weightings can produce very different, even counter-intuitive results. The results become much less paradoxical when we see the actual counts rather than just the percentages.\u201d<\/span><\/i><\/p><\/blockquote>\n<p><a href=\"http:\/\/blog.analytics-toolkit.com\/2014\/segmenting-data-web-analytics-simpsons-paradox\/\"><span style=\"font-weight: 400;\">Georgi Georgiev recommends:<\/span><\/a><\/p>\n<blockquote><p><i><span style=\"font-weight: 400;\">\u201cWe should threat each source\/page couple as a separate test variation and perform some additional testing \u00a0until we reach the desired statistically significant result for each pair (currently we do not have significant results pair-wise).\u201d<\/span><\/i><\/p><\/blockquote>\n<p><span style=\"font-weight: 400;\">Let\u2019s look at another example.<\/span><\/p>\n<p><b>Example: <\/b>To understand the point on weighted averages and Simpson\u2019s paradox, let\u2019s compare conversions and test results for control and variation, in the following hypothetical example.<\/p>\n<table>\n<tbody>\n<tr>\n<td><\/td>\n<td><b>Page Visits for A<\/b><\/td>\n<td><b>Page Visits for B<\/b><\/td>\n<td><b>Conversions for A<\/b><\/td>\n<td><b>Conversions for B<\/b><\/td>\n<td><b>Conversion Rate for A<\/b><\/td>\n<td><b>Conversion Rate for B<\/b><\/td>\n<\/tr>\n<tr>\n<td><b>Aggregate date<\/b><\/td>\n<td><b>7000<\/b><\/td>\n<td><b>7000<\/b><\/td>\n<td><b>350<\/b><\/td>\n<td><b>460<\/b><\/td>\n<td><b>5%<\/b><\/td>\n<td><b>6.5%<\/b><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<table>\n<tbody>\n<tr>\n<td><\/td>\n<td><b>Page A visits<\/b><\/td>\n<td><b>Page B visits<\/b><\/td>\n<td><b>Page A Conversions<\/b><\/td>\n<td><b>Page B Conversions<\/b><\/td>\n<td><b>Conversion Rate for A<\/b><\/td>\n<td><b>Conversion Rate for B<\/b><\/td>\n<\/tr>\n<tr>\n<td><b>Aggregate<\/b><\/td>\n<td><b>7000<\/b><\/td>\n<td><b>7000<\/b><\/td>\n<td><b>350<\/b><\/td>\n<td><b>460<\/b><\/td>\n<td><b>5%<\/b><\/td>\n<td><b>6.5%<\/b><\/td>\n<\/tr>\n<tr>\n<td><b>Traffic source 1<\/b><\/td>\n<td><b>5000<\/b><\/td>\n<td><b>1000<\/b><\/td>\n<td><b>200<\/b><\/td>\n<td><b>10<\/b><\/td>\n<td><b>4%<\/b><\/td>\n<td><b>1%<\/b><\/td>\n<\/tr>\n<tr>\n<td><b>Traffic source 2<\/b><\/td>\n<td><b>1000<\/b><\/td>\n<td><b>2500<\/b><\/td>\n<td><b>65<\/b><\/td>\n<td><b>150<\/b><\/td>\n<td><b>6.5%<\/b><\/td>\n<td><b>6%<\/b><\/td>\n<\/tr>\n<tr>\n<td><b>Traffic source 3<\/b><\/td>\n<td><b>1000<\/b><\/td>\n<td><b>3500<\/b><\/td>\n<td><b>85<\/b><\/td>\n<td><b>300<\/b><\/td>\n<td><b>8.5%<\/b><\/td>\n<td><b>8.5%<\/b><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><span style=\"font-weight: 400;\">The above chart is the perfect exemplification of Simpson\u2019s paradox and depicts how <\/span><span style=\"font-weight: 400;\">Variation A is performing better than Variation B when divided by traffic sources. <\/span><\/p>\n<p><b>How to minimize it: <\/b><span style=\"font-weight: 400;\">Rather than opting for what aggregates tell you, dig a little deeper into a segment-wise performance of your variations. Maybe, you would like to retail Variation B for traffic source 3 as it is performing equally well as Variation A is for the same traffic source. Maybe, deploying Variation A for traffic source 2 is a good idea. Such insights can help you improve your decision making and infer better out of AB testing. <\/span><\/p>\n<h3><strong>Conclusion<\/strong><\/h3>\n<p><span style=\"font-weight: 400;\">AB tests aren\u2019t free of bias and there are a number of factors that can skew the results that you obtain from AB testing. However, if you are aware of validity threats &#8211; type 1 and type 2, which we have discussed in this post, you can stay vigilant, take into account the scope of error, and wisely interpret test results. We\u2019d love to know if you ever ran an AB test and encountered a validity threat. Share with us your experience and learnings in comments below. Feedback is welcome!<\/span><\/p>\n<div class=\"blog_img\"><a href=\"https:\/\/offer.invespcro.com\/ab-test-calculator\/\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-11553\" src=\"https:\/\/www.invespcro.com\/blog\/images\/blog-images\/Calc-lead-design_1-1.png\" alt=\"\" width=\"1389\" height=\"405\" \/><\/a><\/div>\n","protected":false},"excerpt":{"rendered":"<p><span class=\"span-reading-time rt-reading-time\" style=\"display: block;\"><span class=\"rt-label rt-prefix\">Reading Time: <\/span> <span class=\"rt-time\"> 15<\/span> <span class=\"rt-label rt-postfix\">minutes<\/span><\/span>Disclaimer: This section is a TL;DR of the main article and it\u2019s for you if you\u2019re not interested in reading the whole article. On the other hand, if you want to read the full blog, just scroll down and you\u2019ll see the introduction. There are hundreds of case studies and examples of A\/B testing. While [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":11545,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[116,36],"tags":[103,568,87,569,570,245,571,572,109,573,574,575,565,566,576],"class_list":["post-11543","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ab-testing","category-cro","tag-ab-testing","tag-flicker-effect","tag-general","tag-history-effect","tag-instrumentation-effect","tag-intermediate","tag-novelty-effect","tag-null-hypothesis","tag-resource","tag-selection-effect","tag-simpsons-paradox","tag-statistical-regression","tag-type-i-error","tag-type-ii-error","tag-validity-threats"],"_links":{"self":[{"href":"https:\/\/www.invespcro.com\/blog\/wp-json\/wp\/v2\/posts\/11543","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.invespcro.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.invespcro.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.invespcro.com\/blog\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.invespcro.com\/blog\/wp-json\/wp\/v2\/comments?post=11543"}],"version-history":[{"count":0,"href":"https:\/\/www.invespcro.com\/blog\/wp-json\/wp\/v2\/posts\/11543\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.invespcro.com\/blog\/wp-json\/wp\/v2\/media\/11545"}],"wp:attachment":[{"href":"https:\/\/www.invespcro.com\/blog\/wp-json\/wp\/v2\/media?parent=11543"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.invespcro.com\/blog\/wp-json\/wp\/v2\/categories?post=11543"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.invespcro.com\/blog\/wp-json\/wp\/v2\/tags?post=11543"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}