The Concept and Math Behind the "Two Proportions Z-Test"
Updated: Jan 2, 2018
When analyzing the results of an AB test, we usually compare the conversion rate, add to cart rate, sign up rate etc, (some sort of a proportion) between the two treatments. Statistics provides multiple methods to compare the difference in two proportions from two independent samples. Of those, one of the most common methods used in e-commerce is the “Two Proportions Z-test” (others include Bayesian test, Chi-Square test etc.). The good thing about statistical testing is that with high enough sample sizes, which is generally the case with e-commerrce, most of the different methods will result in the same outcome (or be very close).
In this post, we will primarily focus on explaining the “Two Proportions Z-test.” We will start by laying out some of the basic terminology and concepts, and then finalize with the calculations done in a “Two Proportions Z-Test.” The key concept in this specific test is the Z-Score.
To understand what a Z-Score stands for, let’s go through an example. Suppose you have a website with 100 products with different conversion rates and an overall average of 5%.
Let’s say we want to come up with a score for each product based on how their conversion rate compares to the rest of the group. Take the example of Product A that has a 10% conversion rate - a considerably better rate than others. But what is a good way to score it so that we would incorporate both it’s difference from the overall average and also the deviations in the data?
This is were the Z-Score comes into play. We basically find out the difference of Product A’s conversion rate (X) compared to the whole population average (𝜇), and see how big that difference is in regards to deviation of conversion rates from the average (𝜎).
Note here that if conversion rates were less varying, a bigger difference from the group mean would mean a better score; and it would mean a lower score if the conversion rates fluctuate substantially. Think of this as your craziness score. If you are crazy enough, you would stand out more in a population with less variety in sanity levels, yet you won’t shine enough in a population where levels of sanity is spread out fairly.
Great, now that we have covered what a Z-Score is, let’s do a quick calculation for the Z-Score of Product A. We will assume that the standard deviation is given and is 2.5%.
Now let’s put things into a visual so we understand what a Z-Score of 2.0 means. We will be plotting the distribution of conversion rates. Note that we are assuming the conversion rates are normally distributed. To calculate a Z-Score, we need to make sure the data is normally distributed. In e-commerce, since we usually deal with large enough samples, most KPIs such as conversion rate, add to cart rate etc. appear to be normally (at least approximately normally) distributed.
Since the population is normally distributed, most of the conversion rates are around the average. About 68% of the conversion rates are somewhere between 1 standard deviation of the average. So where does Product A live? Product A has a Z-Score of +2.0
which means it is +2 standard deviations apart from the population mean, making it
to have a higher conversion rate than approximately 97.5% of all the products (see right for a visualization of the products Product A has passed in regards to conversion rate).
Brilliant! Now we know how to interpret a Z-Score. We have showed that the conversion rate of Product A is significantly different greater than the group average.
Now, we will apply the same concepts for comparing 2 independent samples. The basic idea will be to compare the difference between the 2 groups against their expected difference. Let’s carry this with another example. Suppose you randomly select 100 groups of customers with 10,000 customers in each group from your website. You measure the conversion rate for each group and create a distribution plot.
As expected, you see that most of the groups have clustered around a certain conversion rate, which is what we normally expect. Some groups do have higher deviation because they ended up having a less balanced concentration of low and high intent customers. So, our basic idea will be to compare the difference between the 2 groups against the expected difference, which we will consider as zero. Actually, the expected difference here is a hypothesized difference. Which basically means that it is something are testing against. In other words, we are hypothesizing that the difference will be zero, and calculating whether the actual obtained difference is significantly different than that. We apply the same Z-Score concept that we used in product conversion rates example. Yet, we will swap 2 things in the Z-Score formula:
Instead of comparing one product’s conversion rate to the whole group’s average, here we will compare the difference in conversion rates between 2 groups to the hypothesized difference of zero.
Since now we are comparing proportion differences and not proportions, we will swap the Standard Deviation with a term called "Standard Error of Difference for Proportions.” Which basically says: “If I were to pick 2 sample populations, how much of an error can I expect for the difference between their proportions?” So our task will be to calculate the difference in proportions and see how many SEs is it apart from the hypothesized difference of zero. This is very similar to what we did in the product conversion example where we checked how many SDs is the product conversion rate is apart from the from population mean.
If the difference in sample proportions is beyond a significant distance from the hypothesized difference of zero, then we can say that the sample populations have significantly different proportions. See below the Z-Score we used for the product conversion example and the Z-Score we will use for the difference of proportions side by side.
The Standard Error is calculated as:
where p (pooled proportion) is
and where Y is the number of conversions (successes). If you seek to further understand where the SE formula comes and why we use pooled proportion please read my blog on 'Standard Deviation and Standard Error" (specifically, the section named "Standard Error of Difference used in the Two Proportions Z-Test"). For the sake of simplicity, we will not dive into the details of those in this post.
As you have seen, the basis of the “Two Proportions Z-test” is simple. We basically calculate whether the difference in the sample proportions is beyond a significant distance of error from the hypothesized difference of zero, and if so, we conclude that the sample populations have significantly different proportions.
Let’s finish with an example:
We run an experiment with two treatments of Treatment A (control) and B (new variation). At the end of the experiment, the results look like below. Can you identify whether the change in the conversion rate was significant or not?
The calculations showed us that we have a Z-Score of 2.83 which is a very high score. Using any Z-Score to p-value chart, we can see that corresponds to a p-value of 0.002, which is 99.8% statistical significance. Therefore, we conclude that B has a statistically higher conversion rate than A.