Tuesday, 29 March 2016

How to fit a copula model in R [heavily revised]. Part 2: fitting the copula

Here I am again with part 2. If you would like to read part 1 of this short tutorial on copulas, please click here.

In this second post I am going to select a copula model, fit it to a test dataset, evaluate the fitting and generate random observations from the fitted multivariate distribution. Furthermore I am going to show how to measure correlation using Spearman's Rho and Kendall's Tau. In order to run the code in this post you need the following packages: copula and VineCopula.

The dataset

For this example I am going to use a test dataset. You can download it from this link. This test dataset contains two variables, x and y, characterized by a strong left tail correlation.

By visually inspecting the plot of x versus y you can easily notice that low values of x tends to be highly correlated with low values of y.

Rplot

The x and y distributions


First of all, let's study the marginals separately. Coming up with an estimate of the marginal distributions for both x and y should not be that hard. In order to do this, a quick check to the histogram of both variables could provide some insights.

Rplot01Rplot02

A good glimpse at the chart of each variable is important in order to get a grasp of what distribution could be a better fit. It looks like a Gamma distribution could do both for x and y. This is just a quick guess, obviously it would be necessary to dig deeper and inspect the data further before making such a claim, however this is not the focus of this post. Moving on pretty fast and assuming the distribution assumption is correct, the only thing left is to estimate the parameters. Let’s estimate the parameters, sample some observations from the fitted distributions and make a visual comparison.

Rplot03Rplot04

You can see in green the observed data and in blue the simulated data for both x and y. That looks like a good match.

Measures of association Kendall tau, Spearman's rho


Now it's time to look at the joint behaviour. For instance, we could measure the association between x and y. When dealing with copulas, two common measures of association are Kendall's Tau and Spearman's Rho. These two are usually better suited for measuring association than linear correlation measures when working with copulas. I am going to use Kendall's Tau

Take note of the output of this code because later, when you will have fit the copula model, you will be able to check that the same correlation structure has been preserved (or captured, if you wish) by the copula.

Selecting the copula using VineCopula package


Since we are dealing with a bivariate case, we can use a scatterplot to get an idea of what the joynt relationship of x and y looks like and try to understand the joint behaviour of the two variables. As you know, the joint behaviour is captured by the copula, therefore by taking a look at it we can try to guess what copula might fit.

Of course a guess is not the optimal way of selecting a model, furthermore the "visual guessing method" is not applicable in case we have more than 3 variables, so we are going to use a useful function provided by the Vinecopula package.

The VineCopula package provides a convenient way of choosing a copula through the BiCopSelect() function. The VineCopula library performs copula selection using BIC and AIC.

Note that BiCopSelect() accepts pseudo observations as parameters. That is, the fitting process is performed on the pseudo observations, ie the observations in the unit square $[0,1]^2$. The pobs() function takes care of the conversion from real observations to pseudo observations. Note that pobs() returns a matrix (not a dataframe!).

It appears that a Clayton copula might be a good choice for our problem. The unique parameter of the Clayton, theta, is estimated to be 1.48.

Fitting process with a given copula


The BiCopSelect function, estimates the copula parameters too. However, in case you already know what copula to use, you could fit it using the fitCopula() function.

It looks like the fitting process was successful and the parameters of the fitted copula are the same estimated with the BiCopSelect() function. That's a good double check, by the way. Note that Kendall's tau is the same as the one calculated from the observed x and y.

Goodness of fit test


Once the copula has been fitted, we can test to check whether the fit is good or not. In order to perform this GOF test, we can use the gofCopula() function as below. Note that this test may be a little slow to run.
For comparison, I decided to run it twice, first with a normal copula and then with a Clayton.

Building the bivariate distribution using the copula

Now that we have successfully chosen and fitted the copula to the data and selected the appropriate marginals, we can try to model the joint relationship using the basic tools I showed in part 1 of this short tutorial.

Rplot05

Rplot06

Rplot07

Now we can sample some observations from the estimated joint distribution and check how they compare to the one observed.

Sample observations from the brand new bivariate distribution

Simply use the tools I explained in part 1 to sample from the new built distribution.

Rplot08

Note that Kendall's tau is still the same for the simulated data. The information about the correlation structure has been captured by the copula regardless of the marginals. Of course this is just a taste of how powerful plain vanilla copulas, such as Archimedean copulas, can be.

At the top of my mind I cannot think of other “easy to use” tools for modelling "weird" relationships such as the one between x and y in my test dataset.

I hope this two parts post was useful and interesting. Please feel free to leave a comment if you have any question or spot any mistake.

16 comments:

  1. This is my first time i visit here. I found so many entertaining stuff in your blog. Keep up the good work.
    Plastic Flow Meter

    ReplyDelete
  2. Replies
    1. Hi,
      I'm trying to apply it for my own data (e.g. discharge vrs. latent heat flux, ....). Using "Vinecopula" it was selected "234 = rotated Tawn type 2 copula (270 degrees)" . But, for fitting with the selected copula, it gives error...!
      It would be great, if you have some time to have a look at my data. This is my email: "mohsen.soltani@kit.edu". please email me, and then I will send you the data + related explanation.
      Thanks.

      Delete
    2. Hi, at the moment I'm working on two R-related projects, I'm afraid I do not really have any spare time. However, if you need to hire an R developer to build a model for you, I should be free after march 2017. Please use the contact form in the "contact" page. Rates will vary upon the project required. If that is not the case, or if you need just a quick help, I suggest you to ask questions on Crossvalidated at https://stats.stackexchange.com/ where there are expert and polite people ready to answer your questions in depth.

      Delete
  3. Mic you have done nice work here, congrats!
    I have a question regarding copulas,
    When we search for a type of correlation between two variables and we have almost fitted a BiCopula like you, it is necessary to continue with building of distribution and after search for " cor(mydata, method = "kendall") " ? Or after fitting of copula it is right to search for correlation?

    ReplyDelete
    Replies
    1. Hi, thanks for your comment! I guess it depends on what you'd like to do, the idea behind the correlation checks in the post is to show that the structure of the data is preserved through the copula.

      Delete
  4. Thank you very much.
    I was looking for a beginner level data and example and I found that!!
    Thanks again!

    ReplyDelete
  5. Hi Mic! Nice post! I still have a question though..I'm also currently working on a project where I have to calculate VaR of different market indices for different periods after fitting several copulas. However, copulas have proven to be quite a challenge for an average master student. My code returns the same VaR for all the periods..so clearly I'm missing/overlooking something..can you tell me where (in your explanation) I should apply VaR?
    Thanks!

    ReplyDelete
    Replies
    1. Hi Maarten! Thanks!Indeed they are quite challenging! I'm a bit rusty when it comes to VaR, however, I believe that having the pdf, the cdf and being able to simulate observations should be enough to start. Having said this, you need to make some assumption about the distribution of the returns/indeces/phoenomena that you are trying to model, choose the copula, fit it, then get the multivariate distribution that in my code I called "my_dist". Once you're there, you can get the density using dMvdc, the cdf using pMvdc and simulate observations using rmvdc (for instance for a Monte Carlo VaR).

      Delete
  6. Hi Mic! It was really useful for me. Thanks for that!
    I've started working with R 2 weeks ago and I am consider if there is Mardia copula also in some package?

    Thanks!
    Ola

    ReplyDelete
    Replies
    1. Hi Ola! Glad to know you found it useful! As far as I know the best copula packages in R should be "copula" and "VineCopula". Both packages feature Archimedean copulas. In the documentation, however, there seems to be no mention about Mardia's. Maybe you can ask the developers if they plan to add that particular copula.

      Delete
    2. Hi Mic, thank you for your quick response. That's great that I have possibility to speak with you! Yes, you're right, I read documentation for "copula" and "VineCopula" many times and there is no mention about Mardia. I've tried to put formula for Mardia copula and then built bivariate distirbution with specified marginals in the fallowing way:
      > df<-read.table("R_popr.txt", header = TRUE)
      > xy<-df[,c("x","y")]
      > m<-pobs(xy$x)
      > mm<-pobs(xy$y)
      fun<-function(u,v) 0.1*u*v+0.4*max(0,u+v-1)+0.5*min(u,v)
      mardia<-asCOP(m,mm,fun)
      mycdf<-mvdc(mardia, margins = c("exp","exp"), paramMargins = list(list(rate=2),list(rate=2))))

      but I got error:
      Error in validObject(.Object) :
      invalid class “mvdc” object: invalid object for slot "copula" in class "mvdc": got class "numeric", should be or extend class "parCopula"
      As we see, the two datasets have a similar structure. This is evidence that the proposed model is appropriate.


      Could you please advise on above?

      Also, I tried to compare empirical copula with the theoretical one, but I encouner the problem. It is easy to use cdfcomp when argument ft is a list of "fitdist" objects - so for example in case of fitting copula without specified marginals, but it is no possible to compare distributions of object mvdc using cdfcomp. Do you know how the comparison of pdf's can be done for mvdc object? The same problem is with gofCopula.

      Sorry that I have so many questions to you :)

      Thank you,
      Ola

      Delete
    3. Hi Ola, it seems that the problem is that in your example you're passing to the copula::mvdc function an object that is not of class "parCopula". The "mardia" argument is not a copula from the copula package but (probably) a numeric vector. More generally, since the "mardia" object seems to have been generated using the function copBasic::asCOP, I think that the packages copBasic and copula may not be compatible with each other. This means that if you build the Mardia copula with the copBasic package functions, it probably won't be accepted as an argument in most of the functions in the copula (and probably vinecopula too) package.

      My suggestion is to clearly define what your goal and steps are, then look for a suitable R package and stick to it. If you have specific questions on your project, you can try to ask on CrossValidated (https://stats.stackexchange.com/). People there are usually very competent in their subjects and nice, although I must say there's not that much on copulas there either.

      As a last suggestion, if you are doing this for work or study, I suggest you to get in contact with some good university professor/PhD that might be willing to help with the project (Masters student in stats might be interested in writing a thesis on an R implementation of the mardia copula and related functions/tests/applications).

      I wish you all the best.
      Mic

      Delete