Saturday, 11 July 2015

Hypothesis testing on normally distributed data in R

Hypothesis testing is a useful statistical tool that can be used to draw a conclusion about the population from a sample. Say for instance that you are interested in knowing if the average value of a certain parameter differs significantly from a given value within a well defined confidence level: you would probably set up your test like this:

clip_image002

Note that the two hypothesis above are mutually exclusive.

Now  what can you do to check which one is more probable?
First, we should check what kind of distribution our data follows. Roughly speaking, if the number of samples is > 30 we can plot a histogram to get a visual grasp of the distribution and then run a few simple function, to assess the skewness and kurtosis of the distribution. If these parameters are close to those of a Normal distribution, then we could assume that the data comes from a Normal distribution. This assumption is not 100% sure but it is reasonable if the above parameters are close to those of a Normal distribution. If you are not sure, the other option is to run some normality tests or gather more data. Read how to make a rough check to see if your data is normally distributed.

Now that we are confident that our data is Normally distributed and hopefully the samples are independent and identically distributed, we can estimates of the sample mean and as far as the variance is concerned:
- If the population variance is known, we can use the Normal distribution in R
- If the population variance is unknown, then we should use a t-distribution with n-1 degrees of freedom (n is the number of observations). As n tends to infinity t-distribution tends to a Normal, therefore if n is sufficiently large (say > 60) you could approximate the t-distribution with a Normal one.

Now, by fixing a certain alpha (confidence level), we decide to reject the null Hypothesis (H0) if:

clip_image002[7]

Where clip_image002[9] is the sample mean, clip_image002[11] the standard deviation and z the percentile of the Normal (or Student) distribution.

Again, roughly speaking, the idea behind the test is that we should reject the null Hypothesis if the data shows that it is highly unlikely H0 to be true. For the right tail test, we should reject H0 if the statistics above (without the absolute value operator) is to the right of the red line, in the cyan shaded region:

Rplot

Aside from this test you can also make tests to check whether the population mean is higher or lower than a certain value using the same process. Here below is the implementation of the test in R assuming that the variance is unknown (and therefore using the t distribution):

In this case we can say that at 5% confidence level we do not reject the null Hypothesis in all three cases.

In case you’d like to run a z-test because you know the population variance, by clicking here you can download the script which is using the normal distribution. Essentially you only have to replace the functions related to the student distribution in the script above with those related to the Normal. Finally, here is the code used to make the plot above.

No comments:

Post a Comment