The CRO industry have evolved tremendously over the last twelve years.
I still recall conversations we used to have with CMOs and VP of marketing at large online retailers trying to introduce them to conversion optimization.
We would first try to determine if they even knew what were even talking about (it was that new): “Have you considered CRO?”
The first response for the most part was, “CRO? You mean SEO. Yes, we are working on increasing traffic to our website.”
“No, I am not talking about SEO, I am talking CRO…conversion rate optimization. Increasing your website conversion rate.”
It was a complete education and introduction to this brand new system.
Fast forward to 2017, the scene has changed dramatically. Most of these top companies and retailers now understand the importance of focusing on conversions.
But CRO in the early 2000’s was very different than CRO now for the simple reason is how much overall sophistication has changed. Additionally, in an effort to persuade marketers to focus on CRO, CRO focused on the wrong aspects of the discipline.
In the early 2000’s hearing that there was a 200% increase in conversions wasn’t unheard of, because even basic usability wasn’t being realized enough online. SO of course, when you fix something broke, 200% increase is likely.
But we still hear these numbers, and with the sophistication we have now, it’s almost impossible to achieve those results.
Allowing dubious CRO case studies to be used again without speaking out against them is a mistake that man CRO’s are now paying for.
This is my attempt to set the record straight about such case studies.
And the reason I was doing this was because I recently read one of these inaccurate case studies on a reputable blog.
Moz is one of my favorite places to read about SEO and SEM. Something caught my eye on twitter when I read the title “The Anatomy of a $97 Million Page: A CRO Case Study.” I usually ignore case studies that make such claims. But the fact that it was published by a very trusted source made me stop and click on the link.
Let me start by saying that that only reason I chose to discuss this topic is because I believe that if the team at Moz missed the points below, that means we as a CRO industry are doing a VERY BAD job educating the community about it. Moz continues to be a must-read platform for every member of our team.
Let’s start by digging a little deeper into the article and looking at the case studies featured in it.
This a screen shot of the first case study mentioned in the article:
Any beginner in CRO can point out that this is not a valid case study. The sample size is too small (around 400 sessions) and the timeline is too short (3 days).
The second case study was not any better:
around 1000 sessions and 4 days of running time.
The third case study in the article focused on the fact that the winning variation had a 99% statistical significance. It did not provide any further details around factors that will allow us to judge the quality of the experiment.
There is nothing unique about these case studies. The web is full of claims that a particular test increased conversions by crazy numbers without showing how that is achieved.
So let’s set the record straight:
What makes a good CRO case study?
A good CRO case study is one where the experiment concluded after satisfying established statistical requirements to provide sound insights. It does not matter whether there was an uplift as a result of the experiment. Satisfying established statistical requirements ensures that your findings are correct and thus all the conclusions you are building are valid.
There are 4 factors that we use to determine a winner for an A/B experiment:
- Length of the time an experiment ran for
- Number of conversions each variation recorded
- Statistical significance
- Statistical power
The goal from an A/B test is NOT to find a winning a variation as quickly as possible. The goal from an A/B test is to determine whether the solution you provided solves a particular problem that you have uncovered.
1. Length of the time an experiment ran for
If you have a website with a large number of visitors and conversions to it, you can achieve statistical significance very quickly. As a matter of fact, some of our clients who receive millions of visitors every day can declare a winner for an A/B experiment within few hours after launching a test.
But that usually means nothing.
I cannot count the hundreds (if not thousands) of experiments we launched and the testing software declares a winner and then few hours later, the data changes and there is a different winner. As a matter of a fact, the data from at least the first 3 days of testing is still stabilizing and not yet accurate.
That is usually not a good practice because the instability or anomalies in the data might be due to poor test implementation (bugs introduced because of the test). Instead, we recommend monitoring experiments results very closely after you launch a test, however you should apply two rules:
- Monitor and analyze experiment results to uncover any bugs introduced in the A/B testing process
- Never call a winner during the early days of testing.
Besides data anomalies, website visitors behave differently during different hours of the day and days of the week. The goal from your experiment is to find out which of its variations performs the best across several days. To avoid inconsistent testing results due to time, we recommend running experiments for a minimum of one week (preferably two weeks).
When discussing sample and result pollution, I also pointed out that experiment results could be polluted due to running a test too long. An A/B test is an experiment. The longer you run it, the higher the chance external factors could impact your data.
We have seen this first hand few years ago when results for a particular test dropped off significantly for few days. Our first thought was that there was an issue in the testing software or that we had some sort of a bug that was introduced to the live server. But that was not the case. A competitor was running a major advertising campaign and as a result, conversions dropped off across all variations including original. As a matter of a best practice, we do not run experiments for more than 4 weeks in an effort to minimize the possibility of data pollution.
2. Number of sessions/conversions each variation recorded
Think about this. You will continue subscribing/using AB testing software if you continue to find winners for your AB tests. If most of your A/B tests fail, then you will quit AB testing and move on to the next fad in digital marketing.
I think that is one of the reasons some A/B testing software will call a winner when the results are too early. Let’s take a look at some made up test data
If you plug in the data into the testing tools from VWO (https://vwo.com/ab-split-test-significance-calculator/)
Of course, if you did any type of A/B testing, you know that the sample size is way too small and too early to call.
Now, go back to the same statistical significance calculator from VWO above and try to plug in the following data in it:
Notice the error that VWO gives:
The implication is that 10 sessions is not enough while 15 sessions is enough.
Statistical significance means nothing if your experiment did not collect enough conversions. Again, collecting enough conversions ensures that you are not dealing with data anomalies.
As a matter of a good practice, you should continue to run your experiment until the original and the challenger collects a minimum of 100 conversions. If your website gets enough traffic, then we highly recommend running a run your experiment until the original and the challenger collects a minimum of 500 conversions.
3. Statistical significance
Statistical significance is an easy not commonly misunderstood A/B testing statistical concept.
For the sake of simplicity, let’s assume that you run an A/B test with one variation against the original. As your test collects data with more visitors and conversions getting captured, you will start noticing some difference between the original and the variation. Statistical significance is a way to measure whether that difference is due to mere coincidence or does
In layman terms, statistical significance is a way to measure whether the observed change between the original and the challenger happened by pure chance and were random or whether they truly represent how your website visitors react to the A/B test.
Of course, the higher the confidence, the better you are. So, it is typical for us to look for 95% or 99% confidence level when running an A/B test.
If you see a confidence level of 95%, does it mean that the test results 95% accurate? Does it mean that there is 95% probability that the test is accurate? Not really.
Most CRO specialists offer two ways to think of Statistical significance:
- It means that if you repeat this test over and over again the results will match the initial test in 95% of cases.
- It means that you are confident that 5% of your test samples will choose the original page over the challenger.
What statistical significance does NOT tell you?
statistical significance is a judgement on data in its current state. It does not tell you with 100% certainty that the data will not shift and that a particular winner in an AB test will continue to be a winner.
It has been my experience that when someone working in the CRO field presents data with statistical significance, he is trying to evoke some sort of an emotion response.
4. Statistical power
There are two types of errors you can run into when running an AB test:
- Type 1 error: false positive
- Type 2 error: false negative
Type 1 errors happen when you think one of your variations produced a significant uplift in conversions while in reality it was a false positive. So, if you deploy that winning variation, you will not see any actual uplift.
Type 2 errors happen when you reject a variation that did not produce a statistically significant uplift. The rate of the type II error is typically represented by β (beta). Statistical power is equal to 1−β. In other words, statistical power is the probability that you will be able to detect a real difference between variations in an AB test.
Statistical power is impacted by several factors including the sample size, the difference in conversion rates we might expect to see as a result of the test and the statistical significance.
When you run an A/B test, your results could be either under-powered and over-powered tests. By definition, an under-powered does not detect possible winners in an AB test so you have an increase in type 2 errors. An over powered test puts too strict condition to uncover an A/B test winner which might require longer times to conclude a test.
So, how do you determine an adequate statistical power level for an A/B test?
There are many factors that could impact the answer to that question. Getting to a higher powered-test requires more tested users and that might mean you need to run the test for longer periods of time. If time is not an issue for you, then we like to get to 80% to 90% statistical power for a test.
A solid CRO case study should meet at least the following 4 conditions:
- Clearly state the length of time the experiment ran for
- Clearly state the statistical significance for the test
- Clearly state the statistical power for the test
- Include large number of sessions and conversions
Do you use additional factors to determine whether a CRO case study is valid or not?