7 Things I Hate About AB Testing
Test this, test that…and always be testing.
As I listened to one speaker after the next, I wondered why a client would hire them in the first place. If you are a CRO agency, and the best advice you can provide to attendees in a conference is to test this and test that, then I am not sure why these companies would hire you in the first place.
All you need to do is start AB testing, find that one fantastic test that will take you from absolute anonymity to an overnight web success…that is what brings most people to AB testing land.
Filled with dreams that end up facing the grueling reality of how difficult it is to achieve any meaningful result. According to Stephen Walsh:
“There is one thing to keep in mind: testing every random aspect of your website can often be counter-productive. You can blow time and money on software, workers, and consultants, testing things that won’t increase your website revenue enough to justify the tests in the first place.”
So, I am here to break it down to you…I am here to give you the top 7 things that I hate about AB testing. Before I jump into this, I want to point out that we do a massive number of AB tests in any single year and we are close to run our 12,000 AB test in our company history. The road is filled with good, the bad and the ugly.
1. Settling bar bets
You will no longer have to rely on your opinions; you rely on the wisdom/desires/actions of your visitors to determine which design is better.
This is how AB testing was sold to many companies.
And why not – isn’t AB testing the antidote of the HIPPO.
This produced a side effect which the industry struggles with and is not able to resolve. Conversion optimization is reduced to testing. An industry that emphasizes that “you should always be testing,” gives the impression that testing is the holy grail of increasing conversions. While in fact, it is the easiest part of the process.
After all of that, your test fails. It does not produce statistically significant results.
Image source: carcrash
Your colleague, on the other hand, is not a big believer in this whole CRO process. While sipping on a cup of Java, he says who don’t we test this random new feature. No data or customer research to support why he should test this feature. He runs the test, and it generates significant solid results. And the kind of results managers talks about in meetings.
This phenomenon hurts the argument for the process. The fact that many CRO companies produce less than 15% success rates on their projects kills all the emphasizes on a process.
So, naturally and as a second defense, CROs move to talk about the learnings. What you will learn about your customers.
But let’s again face it. Learnings are, and I love them. However, unless you are a large company with a research and development budget, most companies will kill programs that do not generate a minimum of 3 to 5x return on investment within few months
The point is, if you think of AB testing as a way to settle bets between different teams, you missed the point of AB testing and CRO altogether. I am willing to bet that you are better off walking away from conversion optimization altogether.
2. When the only tool you have is a hammer – AB testing is not for every organization
Not every situation requires AB testing. Not every website can handle AB testing. And not every company is ready for AB testing.
But if all you sell is AB testing, then you are stuck. You either sign clients who you should not take because they are not ready for CRO or you are turning away business, and your sales suffer.
Let’s break this down:
Not every website is ready for testing. CRO companies have different requirements for when they take a client and if the client can conduct AB testing or not. If you are looking to lead full AB testing program, then you need an absolute minimum of 500 conversions per month (per device type). A CRO agency is a lot more comfortable when a client crosses the 2,000 conversions per month. 80% of our clients generate over 3,000 conversions per month.
So, what do you do with a website that generates a lower number of conversions? This topic deserves a separate blog post on its own. There are several considerations:
- If the website has growth potential or on a growth track, you can start with expert reviews, ensuring that the site is free from both bugs and usability issues.
- You can focus on micro conversion testing as opposed to macro conversion testing
- You can move from AB testing to usability testing
Not every company is ready for testing. This is harder to detect, and it is more difficult to deal with. I have given an example of one of our early clients. One of the first tests involved creating a new design for the category page. The AB test generated a 32% uplift in macro conversions – orders placed with the website. We re-ran the test again to make sure we have not identified a false positive. Still, the results were consistent. We had a winner. But the client would not have it. He hated the design. He hated everything about it.
Fast forward, 12 years later, the client has a slightly modified version of the page. This is typically where this story ends. Except for one new amendment. The same client just filled out a contact form on our website asking for CRO services. From the comment he left, I almost sure he forgot that we did any work for him at some point (maybe our service was not memorable back in 2006?). Here is what he said:
“Our sales have been dropping 12% year to year in the last three years. We are bleeding money. Is there a way you can help us?”
Not every situation requires AB testing. Besides the cases where a platform (website, app, etc.) receives a low number of monthly visitors and conversions, there are few instances where A/B testing does not make sense. A best practice is to run AB testing to cover one or two business cycles. What if the buying cycle for a business spans few months or even a few years? AB testing is not a good choice in that scenario.
My recommendation is to look at other tools to amend your conversion optimization arsenal. At a minimum, usability testing might be a better option in some of these cases. You can look at data modeling and business intelligence as another option if your organization has the budget to do so.
3. Most marketers do not understand statistics, nor are they willing to invest the time to understand it
Statistics are the heart of AB testing. After you formulate your hypothesis, you conduct a statistical test.
To conduct a test, you have to do some computational work, which is nowadays done by statistical software. The software computes the so-called test statistics which are some value of mathematical formula based on the observed values of visitors and conversion rates for the control and the new design.
In fact, the goal of the whole AB testing (statistical testing procedure) is to drive conclusions about the entire target population based on just a sample drawn from it. Based on the results of statistical tests we can, therefore, generalize what we observe only from a small sample for the bigger set of site visitors, users and/or subscribers.
The truth is that many marketers struggle with understanding statistical concepts. And most statisticians do not understand what marketers mean when they talk about AB testing. I know this first hand because we have both teams in Invesp for some time it felt like I am talking to two different creatures from two separate universes when talking to both teams about the same topic. According to Alex Birkett:
“Statistics provide inference on your results, and they help you make practical business decisions. Lack of understanding of statistics can lead to errors and unreliable outcomes.”
AB testing software tries to simplify a ton of statistics by providing a single significance number but doing so hurts all the possible insights, and analysis marketers can draw from their results.
“Understanding how statistical significance is calculated can help you determine how to best test results from your own experiments. Many tools use a 95% confidence rate, but for your experiments, it might make sense to use a lower confidence rate if you don’t need the test to be as stringent. Understanding the underlying calculations also helps you explain why your results might be significant to people who aren’t already familiar with statistics.”
Testing software reports on the performance of different variations within a test. What would be more powerful is looking at how visitors who viewed a particular variation interact with the website beyond the tested area. For that, you need to integrate your testing software with your analytics software, then conduct more in-depth segmentation and analysis on the results.
4. False positive or a false negative: the bad and the ugly
Here is how statisticians look at the false positives and false negatives:
Here is how marketers look at the false positives and false negatives:
When creating a test, you start with a null hypothesis: the conversion rate for the control is equal to the conversion rate for the new design.
The goal of AB testing is to disprove the null hypothesis. Not rejecting the null hypothesis does not mean that you accept it. You never accept the null hypothesis! It only means that you did not have enough evidence to reject it.
If your test is not significant enough, there is a good chance you might accept a design that reduces your conversion rates (Type I errors). The best way to avoid these errors is by increasing your confidence threshold to a minimum of 95%.
If your test is not powered enough, there is a good chance you might reject a design that will increase your conversion rates (Type II errors). One way of increasing power is to increase the sample size which in case of A/B testing means increasing the traffic.
Type I and II errors are mutually exclusive.
It is important to remember that AB testing is about calculating probabilities. The probability that the results you collected when running the test (sample population) will match the results when you implement the winning design for all of your visitors.
You are never 100% certain. Which is worse, type I or type II errors? Both are bad. Type I errors mean that you deploy designs that cause a reduction in your conversion rates. Type II errors indicate that you do not show good ROI for your program while there was one.
That does not mean that you should avoid AB testing altogether. It says that you need to be me more careful when conducting your testing program.
5. One-tail vs. two-tail – sometimes it does matter
You create a new design for a critical landing page on your website. Your statistical null hypothesis is that new page conversion page will be equal to the existing page. There are two different ways to think about the alternative hypothesis:
Approach 1: conversion rate for control is not equal conversion rate for new design
Approach 2: conversion rate for the new design is higher than conversion rate for control OR conversion rate for control is higher than conversion rate for new design
In the first approach, you are not making any assumption on whether the new design is good or bad for business. This is called two-sided test.
In the second approach, you make an explicit assumption about the conversion rates for the control and new design. This is a one-sided test.
Popular AB testing software such as Optimizely or VWO relied on some point on one-tailed statistical analysis to determine a winner in a test. That has changed in the last couple of years.
Why does this matter?
A couple of points to keep in mind:
- Because choosing one-tailed vs. two-tailed will impact how many visitors you will need to run through your test. If you use two-tailed, your data will apply in both directions. So, if you have 10,000 visitors and that is enough for one-tailed analysis, running the same data for two-tailed analysis will not produce the same results.
- One-tail AB tests result in more type I errors.
The document Hypothesis testing, type I and type II errors points out that:
“A one-tailed hypothesis has the statistical advantage of permitting a smaller sample size as compared to that permissible by a two-tailed hypothesis. Unfortunately, one-tailed hypotheses are not always appropriate; in fact, some investigators believe that they should never be used.”
As a matter of best practice, it is also.
“Whatever strategy is used, it should be stated in advance; otherwise, it would lack statistical rigor. Data dredging after it has been collected and post hoc deciding to change over to one-tailed hypothesis testing to reduce the sample size and P value are indicative of lack of scientific integrity.”
6. Visual editors are horrible and near useless
“With VWO’s visual campaign builder, setting up a campaign (such as A/B testing and targeting, etc.) is child’s play. The intuitive interface makes the process incredibly smooth and quick. Simply load your website and start creating campaigns.” VWO
“Create with clicks, not code – Brainstorm ideas and create them as experiments without relying on developer help using Optimizely’s industry-leading visual editor.” Optimizely
I am not picking on Optimizely or VWO…we use both platforms with our clients. The only reason I chose them because they are the two most popular ones we see.
If you have done any AB testing for more than a couple of months, you know that AB testing is a near useless feature. According to Stephen Walsh:
“As with typefaces, testing hundreds of different versions of your text-based copy, each with only a small change from its predecessor, can be a fruitless waste of time and money. So, while you should continually edit and experiment with your copy, remember to look at the bigger picture. Don’t get hung up on every other word.”
Visual editors are great when you are trying to edit text on a page, hide an element, or change a style. Anything more than that, you will have to rely on a front-end developer to implement your changes to the page. I would go further and argue that visual editors are partially responsible for the notion that a good AB test could be changing a headline or a color of a button.
7. Can you really calculate the ROI on this thing?
If you do CRO long enough, you will see this scenario play out several times:
- You run a test on a web page
- Your test shows an uplift in conversions with statistical significance
- Client assumes that revenue uplift will match that of the conversion uplift
Then comes the moment where you are trying to explain this to the client why an increase of 10% in conversion rates on a test will not translate into 10% increase in revenue.
I have seen too many clients turn from statistical novices to gurus when you this happens. Then, comes the ultimate next stage in the process. The client starts questioning AB testing altogether, it is validity and results. Mind you, if you do enough digging around the web, you can find videos and blog posts arguing that you should not conduct any AB testing and that it is a complete waste of time. It is the web, after all, you can find articles on how useless something is if you want to.
And because many struggles in relating AB testing results into actual business revenue, you end up with a fun post like this one from Luke Wroblewski:
Perhaps the most fun part of the post is Luke did not say a lot, and you can see a large number of comments and likes on his post. Side note – I would highly recommend watching Luke’s presentation at Conversions@Google 2017.
Here is the complexity of this:
AB testing and conversion optimization are sold as a way to move marketing from a fuzzy activity with no exact measurable impact into a scientific approach that produces quantifiable results.
How many times have you heard a speaker say this at a conference, “ this test generated a 20% increase in conversions. Imagine what a 20% increase in revenue will do to your business.”
Did you see what the speaker did? He took the results of one test (you do not know how statistically valid they are) and implied that these results mean the business revenue increased by the same amount.
I plan on writing about calculating ROI on your AB testing program in a future post, but let me say that there are few facts around this:
- it is challenging to do ROI calculations – not impossible
- Assuming that you have a statistically significant winner in a test, there are several reasons why your uplift in your test will not relate one to one to revenue increases
- This is further complicated by traffic fluctuations which you might not have control over.
“Here’s a rule of thumb: If you can’t prove the testing you’re doing has generated ROI in some way, you’re wasting your time. That goes for everything whether it’s A/B Testing, UX Design, Email Testing, etc.
And when I say prove, I don’t mean point to a graph that shows conversion rates increasing over time. I mean quantify exactly how much your program has impacted the business. No guesswork, no “we expect to see”, no machine learning models & predictions, I mean real value, now.
It’s true that quantifying optimization in this way is hard, or at least much harder than not doing it. But if you’re confident that the impact you have been making is real and honest there’s nothing to be worried about.”
Khalid Saleh is CEO and co-founder of Invesp. He is the co-author of Amazon.com bestselling book: “Conversion Optimization: The Art and Science of Converting Visitors into Customers.”
Khalid is an in-demand speaker who has presented at such industry events as SMX, SES, PubCon, Emetrics, ACCM and DMA, among others.
Join 25,000+ Marketing Professionals
If you enjoyed this post, please consider subscribing to the Invesp blog feed to have future articles delivered to your feed reader. or,receive weekly updates by email:
The Art and Science of Converting Prospects to Customers
By Khalid Saleh and Ayat Shukairy
- Why Welcome Emails Are Important – Statistics and Trends [Infographic]
- The Reason CRO is FLAWed
- No Low Hanging Fruits Here: Reflections on Running Conversion Projects
- Using Persuasive Web Design to Increase Conversion Rates
- Websites X-Rayed Series: A Deep Dive Evaluation of 200 Top E-commerce Cart Pages
- The Importance of Referral Marketing – Statistics and Trends [Infographic]
- Calculating Sample Size For An AB Test
- Speaker Diaries: Inbound 2018
- The State of Voice Shopping – Statistics and Trends [Infographic]
- Multiple Testing Problem: How Adding More Variations To Your AB Test Will Impact Your Results?