After a recent release I did some crude calculations to see how the new feature performed. We have a business intelligence team to do these analysis but I was eager and wanted early answers. The initial numbers looked promising but were they statistically significant or just the result of dumb luck? This eventually led me to do some interesting power analysis using G*Power 3, which I'll now share with you now.

Lest I divulge any company secrets I will use the following fictitious example. A recent study says men with guitars are more attractive to women. Our customers are predominantly women so we want to see if a redesigned registration page, one with a handsome man playing a guitar, would increase signups compared to our current registration page featuring an adorable puppy. Our A/B test scenario looks like this,

After running our A/B test we will be in one of four possible states,

What we want is low *alpha*, high *power*, where power is 1 - *beta*. We want to minimize probability of false positive or negative. Using G*Power 3 our numbers look like this,

For alpha 0.05 and power 0.8 we need total sample size of 88. This is for an effect size of 0.3.

Effect size is "practical" significance. It measures the significance of the difference. Our registration rate may be 40% whereas our purchase rate may only be 4%. The effect size for an increase to 45% and 4.5% are 0.0269 and 0.0116, respectively, using this handy online calculator (Cramer's V). How does this impact our power analysis?

`Analysis: A priori: Compute required sample size `

Input: Effect size w = 0.0269

α err prob = 0.05

Power (1-β err prob) = 0.8

Df = 1

Output: Noncentrality parameter λ = 7.8489977

Critical χ² = 3.8414588

Total sample size = 10847

Actual power = 0.8000069

`Analysis: A priori: Compute required sample size `

Input: Effect size w = 0.0116

α err prob = 0.05

Power (1-β err prob) = 0.8

Df = 1

Output: Noncentrality parameter λ = 7.8488848

Critical χ² = 3.8414588

Total sample size = 58330

Actual power = 0.8000012

We go from a sample size of about 10k to about 58k, which intuitively makes sense, since trying to determine if a 0.5% bump in purchase rate is due to a handsome guitar man or dumb luck will take a lot more samples than the more "obvious" 5% in registration conversion.

So now the picture is complete. We estimate our effect size and along with our desired alpha and power we can determine the sample size. We then run our A/B test for this sample size. Once the results are in we do another calculation to see whether the guitar man makes a difference in our registrations. My bet is still on the adorable puppy.