Now then, in the previous article I wrote about hypothesis testing with data that is normally distributed, in this article I’m going to post some quick test you can do to check if it is fairly safe to assume your data is normal. I would like to highlight the fact that you can never be 100% sure that your data follows a particular distribution, however, in order to be able to say something about the phenomenon you are trying to understand, it is often practical to approximate its distribution with the best fit according to the measurements (the samples). At the very least this is a starting point.

First, I should point out that if you have a small sample (< 10 samples) then it is really hard to draw some conclusions, but if the sample is large enough (as a broad and very generous rule of thumb > 30) then we can consider some indices and plots.

__Skewness:__First of all, if your data is normally distributed, then it should be approximately symmetric. In order to asses this characteristic you can either use a boxplot or the skewness index g4. However I suggest to use both. Suppose you have two datasets as below:

If we plot the boxplots we obtain:

Remembering that the black thick line is the median and that the coloured part of the boxplot contains 50% of the data, we can already see that the distribution on the left is approximately symmetric, while the one on the right has a right tail (right skew).

Then, by running the skewness function we check the numbers in order to double check our guess

Since the skewness of the first dataset is close to zero it seems reasonable to assume the first dataset is normally distributed. Note that the beta simulated dataset has a positive skewness significantly higher than 0. A final check at the histograms can help a bit more

Ok, the histogram may cast some doubt on whether the first dataset is normal, however it certainly helps us ruling out normality for the second dataset

Finally, we run the kurtosis test, which gives us some information on the shape of the bell curve

We are now in a pretty safe position to assume that the first dataset is normally distributed, since the kurtosis is pretty close to 3, and we can see that the second seems to have a “peak” which is indeed what the histogram tells us. Having so many samples in the second dataset makes us pretty confident in ruling out normality, however it could also not be the case, for instance, for some reason, say the instruments with which we made the measurements were biased, we could have ended up with biased data.

Edit: Last day, while I was writing this article I forgot a very important plot: the qqplot! The function qqnorm lets you compare the quantiles of your data with those of the normal distribution, giving you another bit of information that could point you in the right direction.

By calling the function qqnorm(data_beta) and qqnorm(data_norm) we obtain the following plots

Now, ideally, what you would expect if your data is approximately normally distributed, is that the quantiles of the sample match the theoretical samples, therefore you should get a 45 degree line. Can you guess which one of the two plots is the one generated by qqnorm(data_beta)? ;)

You've left out how to run normality tests.

ReplyDeleteR has a number of different types of normality tests, including the Anderson-Darling test, built into some of the modules. There's a package called nortest which contains ad.test and other functions, whose p-values can be used to understand normality.

You're right, my aim was to introduce the reader to a rough check of normality through the plots and indices above, (such thing appears to be common practice in undergraduate statistics courses). I'm going to write, hopefully soon, on more elaborate tests such as Shapiro-Wilk and the one you mentioned after I gather myself a solid understanding of how they work and how to use them in R.

DeleteThanks for reading and for the suggestion of the package