• 6 Uncomfortable Thoughts On AB Testing

    Editor Note: We highly recommend that you implement the different ideas in this blog post through AB testing. Use the guide to conduct AB testing and figure out which of these ideas in the article works for your website visitors and which don’t.  Download Invesp’s “The Essentials of Multivariate & AB Testing” now to start your testing program on the right foot.

    How well do you know A/B testing? Admittedly, as an Optimizer or Digital Marketer, you are probably best friends or you know a lot. But one of the things I like about split testing is that even though you think you know it best, it always surprises you. 

    Each A/B testing case study comes with its own set of differences and lessons.   

    Having questions to ask about A/B testing isn’t just a great way to learn more about split testing, but it’s a great way to avoid making futile mistakes. To do that, though, you need to know all the crucial answers to the uncomfortable questions around A/B testing.   

    This article brings to you some of the important split testing questions, but if you have your own way of answering these questions based on your own experiences, feel free to share with us your experience and learnings in the comments section below.

    What is the success rate of the AB testing program?

    A typical CRO cycle is something like this:

    1. The company executive hears about conversion optimization and believes that it can generate good revenue improvements for the business. 
    2. She hires an agency to help you with CRO or hires a couple of internal resources to focus on improving conversions. 
    3. Agency/staff work tirelessly to analyze the site, identify problem areas, and come up with testing ideas. 
    4. Developers improvement the A/B test ideas.
    5. Back and forth cycles of quality assurance and bug fixes.
    6. Test results come in and everyone is excited. 
    7. The test fails to generate any increases in conversions!
    8. Go back to the drawing board and hope that the next experiment will generate an increase in conversions. 
    9. Go back to step 3 and repeat!

    What is the success rate of AB tests? The industry average success rate of AB testing is around 12% to 15% (Optimizely reports that 25% of all AB tests run on their platform are successful – seems a little different than what other testing platforms report). This is why George Mavrommatis says: 

    If your success rate is higher than 25% it means that you are either a magician or your website is really crap and you should follow another method of product development that will allow you to reach your goals faster.

    All this work and you only get a 12% success rate? Even worse, having a successful test does not mean that the revenue generated is high!. 

    How does the CRO industry mitigate this issue:

    1. Focus on the learning aspect of CRO: if we cannot produce results, then we focus on the fact that we are learning about the site visitors. But the brutal truth is that the client doesn’t care if you are learning or not —they pay to get an increase in conversions. If you focus on the learnings and not the positive results, chances are you will be shown the exit door.
    2. Increase the speed of testing, if the success rate is at 12%, then instead of launching 10 AB experiments, we launch 100 of them and hope for the best. But in most cases, increasing the testing velocity causes the success rate to drop. There is a high chance of missing out on valuable insights when you increase the testing velocity. In every A/B test, you should track all relevant goals, structure your tests and make sure that the hypothesis is addressed —this may be challenging to do if you have 100 tests to keep tabs on. 

    What is my revenue impact of a winning AB test?

    When you tell clients that the test has a 25% uplift in conversions, what comes to their mind is: “Yes, so we are going to have a 25% increase in revenue.” But does a lift in conversions directly translates to a lift in revenue? 

    The answer to that is a resounding NO. 

    Here is why: 

    The page conversion rate is not the site conversion rate  

    Every page on a site has a different impact on the overall website conversion rate. If you are an e-commerce site, then the impact that your homepage has on your overall conversion rate is different from the impact of your product and category pages. 

    This also means that an uplift in an A/B test does not necessarily translate to an uplift in revenues. The ratio of an A/B test conversions to sales is not always one-as-to-one. 

    For instance, let’s say you do an analysis of a site and you see that the product page has some issues. You then design a new product page that addresses those issues and you run an A/B test —original product page design against the new design — using 20% of your traffic. The test then shows a 30% uplift. 

    Since the experiment was done on the product page only, this means that the test only optimized a small portion (20%) of the traffic —and that is the people who visited the page while the test was running. The other 80% of the traffic didn’t visit the product page when the test was running, so they are not represented by the 30% uplift.

    This is the reason why a 30% uplift in the product page is not a 30% increase in the overall site conversion rate.

    To know the revenue impact of the page you optimized, Khalid wrote an article about the impact of a page on the site conversion rate, and suggested these two approaches: 

    • look at the percentage of pageviews a particular page has as compared to the total page views for the site. 
    • Use the number of visitors to determine the page value. 

    Statistical significance doesn’t equal validity

    As soon as the data reaches a 95% significance level, most people are tempted to end the test. 

    That’s not a good idea. 

    Even if your test indicates that you have reached a 99% confidence level, that’s not a stop sign. Data always fluctuates, and it eventually goes back to the mean. This is the reason why you should always do a couple of things:

    1. Pre-calculate the time required to run the AB experiment for. 
    2. If the pre-determined time to run the experiment is less than two weeks, then run your experiments for at least two weeks

    The idea of giving your test enough time to run is so that it reaches the required sample size —and this is imperative if you aiming to eliminate the errors that may come with statistical regression. 

    Let’s say you are running a test with four variations — control vs V1/V2/V3 —  this is how your test can look like in 4 weeks: 

    1st Week: V2 is leading all other variations and is winning big. 

    2nd Week: V2 still on the lead. 

    3rd Week: V2 is winning but the significant level is dropping. 

    4th Week: regression toward the mean, V2 is no longer winning. The uplift has disappeared. 

    Even if there is an uplift, 95% statistical significance, the expected sample size has been reached you now have a clear winner. You will still have to run another valid test that pits the winner against the control.  

    Essentials of AB Testing

    Khalid wrote a lengthy article on calculating the required sample size for an AB experiment that I recommend going over and understanding. 

    The vanishing improvements: How do you avoid false positives? 

    A/B testing is an amazing way to improve your conversions, but it is wise to check for the validity of your results before you implement anything. If not well structured, a split test can do more harm than good. Smart marketers know that in every A/B test they conduct, there is a probability of attaining wrong or misleading results. 

    For instance, let’s say you run a test and the results indicate that Variation B has more conversions than Variation A, but when you implement the winner (Variation B), you do not any real lift in sales. 

    Wait, where did that lift disappear to? 

    This occurrence is what we refer to as a false positive. In this article, Ayat describes false positives in the context of statistics as follows: 

    In statistics, type 1 error is said to occur when a true null hypothesis is rejected, which is also called ‘false positive’ occurrence. Results might indicate that Variation B is better than Variation A as B is giving you more conversions, but there might be a type 1 error causing this conclusion. Such errors are said to occur when a test is declared as conclusive although it is inconclusive. In every test, there is some amount of probability of false positives or incorrect conclusions.

    Now that you know what a false positive is, how then do you avoid it? For starters, here’s what you need to do: 

    Avoid having many variations

    Upworthy tests about 25 headlines per article, and their results are paying dividends. So does it mean that you have to follow their model too?

    No. Do not be fooled by case studies into thinking that what worked for other companies (even those in the same field as yours) will work for you —all companies are unique. Use case studies to get ideas on how or what to test, nothing much nothing less. 

    Now back to the subject matter: If you rely on frequentist statistics, then the more variations (comparisons) you run, the higher the chance of getting a false positive. With A/B testing, there is always a (5%) chance of making a wrong decision, that’s why the significance confidence should be at 95%, and not 100%.

    So, from a mathematical perspective, if your test has about 10 different variations, then your chance of getting a significance false result is close to 50%. That’s quite a big percentage.

    Here is a graph that shows how false positives increases as the number of variations increases. 

    Besides, having more variations running against the control means that you will also have to test different combinations against each other so as to see the top performer. In such a case, this may become somewhat confusing, and the chances of finding false positives increases. 

    Don’t stop the experiments prematurely.

    An A/B has to run for a certain period of time so that it achieves a reliable statistical significance – most programs aim for 95% significance but there is nothing magical about the number. It became an industry-standard many years ago.  

    Sometimes there is a possibility of achieving the 95% confidence rating in a short period of time, and when this happens, it can be tempting to end the test. 

    But, that’s not a good idea. 

    Smart Marketers know that you don’t have to stop a test too soon or to run a test for too long, you have to strike a balance between the two. I mean, even if the test reaches 95% confidence, you should let it continue running to avoid settling for an imaginary lift. 

    Before you stop an A/B test, here are the four conditions you should meet: 

    1. Always calculate the needed sample size ahead of time, and make sure you have at least that many people in your experiment
    2. Make sure you have enough representativeness in your sample, run it full weeks at a time, at least 2 business cycles
    3. No or minimal overlap in difference intervals
    4. Only look at statistical significance (95% or higher) once the 2 previous conditions have been met

    There’s a high chance that your test is a complete fluke if you stop it as soon as it reaches 95% significance. It’s much more like tossing a coin. Rule out seasonality and allow your test to run for a couple of weeks. 

    Validate your AB experiment results by re-running your tests!

    The assertion that numbers do not lie is as true as a unicorn. There are instances where your A/B test tool can report that you have a 40% lift, but when you implement the winning variation, the conversion rate is pretty much the same. There is no 40% increase in conversion rate whatsoever. 

    FigPii CRO platform

    How is that possible? 

    Chances are that the lift was imaginary, there was no real lift. Test results can be influenced by various factors, and this means that you shouldn’t rely on them. 

    There are many instances where we take the winner from an AB experiment and run it in a head-to-head match against the control to ensure that we have valid results. Yes, this slows down the testing program so we do not do it for all the experiments but we believe that it minimizes the chances of having false positives. 

    What can poison my A/B test data? 

    Having an acceptable confidence level, test duration and a decent sample size doesn’t equal validity. Your A/B test can still be skewed by what is known as validity threats.  Ayat wrote an article on how to identify and minimize validity threats, and she gave this good definition of validity threats:  

    Risks associated with certain uncontrollable or ‘little-known’ factors that can lead to inaccuracy in results and render inaccurate AB test outputs.

    There quite a number of factors that can pose a threat to the validity of your A/B test results. These threats can be grouped as follows: 

    • Flicker effect 
    • History effect
    • Instrumentation effect 
    • Selection effect 

    Since we have written in great detail about each threat in one of our previous blogs, I won’t dwell much on the details, will just show you how they can harm your experiment data.  

    Flicker effect: This is when the original variation (Variation A) flashes on the visitor’s screens before variation B gets loaded. 

    So how does this screw your test results, you ask? 

    Users who visit a page you are testing and see the control design flash first might get suspicious of the site and decide to walk away. 

    Come to think of it, let’s say you intend to buy a certain product on an e-commerce site, and then you see a sudden change in design on the page, won’t you suspect that maybe someone has forged their page and they are trying to defraud you? 

    There is very little you can do if you are running your tests using front end development (which is the case for most AB experiments). With front end experiments, you can minimize the flicker effect but you cannot eliminate it completely. One way around the flicker effect is by using server-side testing which is more complex.

    History effect: this type of validity threat occurs when some external factors influence your test data. This could be seasonal change, running a marketing campaign on the same page (control) where you have set an A/B test, maybe negative social media comments that can bias people against your site, etc.  

    Let’s say you launch a marketing campaign while you are running a test. Chances are this will result in an unusual spike on your site traffic and the visitors who were lured to your site by your marketing campaign differ from your usual visitors. So, this means that they might have different needs or browsing behaviors. Considering that this traffic is only temporary, this means that your experiment results could be shift completely and this may cause one of your variations to win, of which it could have lost had you used your regular traffic.     

    When running an A/B test, here is how you can sidestep the external factors:  

    • Pay attention to any external factors that might impact your data. 
    • Inform everyone in your organization that you are running a test.
    • Make use of Google Analytics to track changes in traffic.  

    Instrumentation effect: This is when flawed data is caused by testing tools or code implementations. It is probably the most common issue that is responsible for skewing most of the test results.

    When the tool you are using is faulty or the code is incorrect, some of the metrics you will be hoping to measure will not be correctly recorded.

    Let’s say you are testing three variations — A/B/C —  and the code for variation C is not correctly set, this means that your tool may not send some of the metrics (e.g. “product page views” page views data) to your tool and you know what this means, right? Variation C will most likely lose, but if it wasn’t for the incorrect code, variation C might have won. 

    Selection effect: this happens when we wrongly assume that the sample we are using on the test represents the total traffic. Sample bias is one of the common reasons that may lead to this validity threat. 

    3M Results

    For instance, let’s say you send promotional traffic from your Facebook ads to a page you are running a test. And your test results show an increase in the number of conversions. You then implement the results to your site thinking that the promotional traffic represents your total traffic. Of which that’s wrong. 

    How much does an A/B test program costs?

    Of course, the cost of an A/B test program varies from one company to another. But the price tag of a typical program that consists of a marketing team, development team, and software is close to $500,000. 

    That is a high cost for many companies. 

    However, when it comes to A/B testing programs it’s not always about the cost, but about the ROI. 

    I mean, if your company generates $20 million in revenue, and the split test program is able to generate a 10% increase in sales, is that worth the investment? As you can see, the more online revenue a company generates, the higher the ROI. 

    Can I run an AB test on a low traffic site?

    Can you be able to run an A/B test on a site with too little traffic?

    This question has been a bone of contention among different optimizers over the years. But to answer it, you first need to know how much traffic do you need to run a proper A/B test.  

    Well, according to A/B Tasty, you need to reach a minimum of 5 000 unique visitors per variation and 100 conversions on each objective by variation. Meaning, anything less than that can’t make a kosher split test. 

    I asked Khalid about this and here is what he said:

    In the past, we relied on the 100 conversions per variation as a pack of the napkin calculation. But we quickly found out that sites that generate 200 conversions do not see real value from running a conversion optimization program. We now require a minimum of 500 to 700 conversions per month. Companies that are generating more than 2,000 conversions per month are the ones that see the most impact of CRO.

    AB experiments are validated by statistical significance – a mathematical way of proving that the results of the experiment are reliable. For instance, let’s say you run an A/B test and your significance level is 50%, this means that you are 50% confident that the results observed are real and not caused by chance. 

    In most cases, most optimizers run tests until it reaches a 95% statistical significance. 

    So, if your site has too little traffic, it will take time (we are talking about months) to obtain that 95% significance. This is the main reason why optimizers do not like to run a test on a low traffic site, no one has the patience to wait for 5 months (or even more) for something that can be achieved in two weeks under normal circumstances. 

    With that said, the question still remains: is it possible to run a test on a low traffic site? 

    Yes, it is. Websites with little traffic can still do ab testing to enhance their conversion rates. You just have to have a tactical testing plan and here is how you can do that: 

    Focus on micro-conversions instead of macro-conversions

    With low-traffic sites, you can run a split test with the aim of tracking micro-conversions (small steps that users take towards attaining the main goal of the site), instead of macro conversions (the ultimate goal of a website). 

    The difference between the two is that macro focuses on the big picture while micro focuses on the small picture.

    For instance, in an e-commerce site, the macro conversion is likely to be making a purchase, whereas the micro-conversion can be adding an item to a cart. Here’s a list of some of the micro-conversions examples you might want to track: 

    • Product page views 
    • Subscribing 
    • Downloading ebooks 
    • Viewing the pricing page for a SaaS company
    • Watching a demo video   for a SaaS company
    • Sharing content on social media 

    Focusing micro-conversions will help you gain a complete understanding of the broader conversions. Understanding the path used by your visitors on their way to convert makes it easy for you to know where to optimize your site. That way you can base your test results on the version that is more likely to generate more conversions.

    stop CRO frustrations

    Implement radical experiments instead of incremental tests 

    Testing a singular website element at a time is time-consuming on its own, and it can take forever to attain conclusive results when done on a site that has low traffic. 

    This is why you should forget about the incremental test and just go for a radical test at once. This way you will be able to reach your results quickly and make an informed decision.

    Although you can swiftly get actionable results from a radical test, it is more difficult to learn from them. I mean, you won’t be able to tell which element helped you increase the conversions and which elements were less effective. Was it adding the trust signals? Was it the new value proposition that resonated with the users at most? Or anything else?

    But to think that there’s no way to overcome this drawback is wrong. You can come up with two new different themes of the same page and test them both, rather than testing multiple singular elements. 

    Test Something that can influence the customer’s decision 

    The best way of running this test is to start by conducting extensive user research (usability test, customer interviews, polls) so that you get to understand your customers’ drivers, and concerns. Knowing their concerns and the factors they consider before completing the ultimate goal on your site can go a long way in helping you figure out what to test and what not to waste time on. 

    Knowing what matters to your customers doesn’t only help you figure out what to test, but it can also help you obtain statistically significant results in a short space of time. 

    Forget Multivariate Testing (MVT)

    One of the requirements needed to run a multivariate test is high traffic. Although most of our clients getting tens of millions of visitors in a month, we have not conducted multivariate testing in years. Remember that The more variants you test, the longer it will take for you to attain significant results. It’s best if you’d forget about MVT testing!

    Conclusion 

    Today, the idea of A/B testing is no longer novel. Different industries see more value in running a split test before making a decision of some sort. But as common as an A/B test is, too many marketers may struggle to give correct answers to the above questions. 

    Anyway, do you agree with the above answers? What other uncomfortable A/B testing that you would like to see answered? Let us know in the comments sections. 

    Invesp can help you!

Simba Dube

Simba Dube is the Growth Marketing Manager at Invesp. He is passionate about marketing strategy, digital marketing, content marketing, and customer experience optimization.

View All Posts By Simba Dube
Avatar

Join 25,000+ Marketing Professionals

If you enjoyed this post, please consider subscribing to the Invesp blog feed to have future articles delivered to your feed reader. or,receive weekly updates by email:

One thought on “6 Uncomfortable Thoughts On AB Testing”

  1. An issue with A/B testing is that many people looking at the results don’t understand statistics and can misinterpret what should be implemented.

Leave a Reply

Your email address will not be published. Required fields are marked *