This document explains the relationship between p-value and confidence intervals. It goes on with the specific case of a binamial law. Assuming we want to determine whether or not two binomial laws are significantly different, how many observations we need to get the p-value under a given threshold.

The term p-value is very popular in the world of search engines. I usually prefer confidence interval 95%, I think it is easier to understand. Plus, because p-Value are real values, we could be tempted to compare them and it is usually wrong. On the other hand, it is more difficult to compare confidence intervals, especially if they are related to complete different variables. Their nature prevents you from doing that. Howerver p-Values and confidence interval are similar: they tell you whether or not a metric difference is significant.

Usually, it starts from a set of identically distributed random variables . We then estimate the average and we ask the question is null? In others terms, we want to know if the average is significantly different from zero. If the random variable follows a random law which has a standard deviation, we can use the central limit theorem which tells us:

Not all of them have a standard deviation. For example, if follows a Cauchy law, which does not exist. This remark also concerns every distribution known as heavy tail distribution.

If , then we have . That is why we can say:

And the confidence intervalle at 95% would be:

When , it becomes:

We usually want to check if the mean is equal to a specific value using a statistical test:

We validate if:

With confidence intervals, you first choose a confidence level and then you get an interval. You then check if your value is inside or outside your interval. Inside, the gain is not significant, outside, it is.

With a p-value, we consider the problem the other way, given , what is the probability that the difference is significant? Let’s consider following a normal law . We are looking for:

is the p-value.

(1)¶

At this point, we should not forget that we use a theorem which tells us that when , which means everything we said is true when is great enough.

Usually, we do not want to test if an average is null but if the difference between two averages is null. We consider two random samples having the same size, each of them described by and . All variables are independant. are distributed according the same law, we assume the same for . We expect the following difference to be null.

(2)¶

Considering expression (2), we can applying the central limit theorem on variable , we get ():

If both samples do not have the same number of observations, this expression becomes:

A binomial variable is defined by:

Let’s consider two series of observations and . We assume and we want to determine how many observations we need to get a p-value below 5%. We know that and . Next table shows the values. First column contains values for , first row contains values for . We also assume we have the same number of random observations for each variable. The statistical test cen be defined like following:

If H0 is true, then:

Given a binomial law with parameter and a difference , this table gives the number of observations needed on both sides to get a significant difference assuming is the expected pourcentage

Expression (1) gives a way to estimate the p-value. Computing the integral is not always possible but there is a way to do it using Monte Carlo method. Let’s assume . We denote as the density function of . We also consider an interval . Then we have and:

This is true because is decreasing for . The p-value for a estimator using Monte Carlo method is:

Assuming every follows a normal law .

The problem of Multiple comparisons happens when dealing with many metrics measyring a change. That’s allways the case when two version of the sam websire are compared in a test A/B. The metrics are correlated but it is unlikely that all metrics differences will be significant or not. The Holm–Bonferroni method proposes a way to define an hypthesis on the top of the existing ones.

We here assume there are two populations mixed defined by random variable . Let’s be a mixture of two binomial laws of parameters and . It is for example the case for a series draws coming from two different coins.

The likelihood of a random sample , the class we do not observe are :

The parameters are . We use an algorithm Expectation-Maximization (EM) to determine the parameters. We define at iteration :

We then update the parameters: