Friday, 30 January 2015

How to estimate probability density function from sample data with Python

Suppose you have a sample of your data, maybe even a large sample, and you want to draw some conclusions based on its probability density function. Well, assuming the data is normally distributed, a basic thing to do is to estimate mean and standard deviation, since to fit a normal distribution those two are the only parameters you need.

However, sometimes you might not be that happy with the fitting results. That might be because your sample does not looks exactly bell shaped and you are wondering what would happened if the simulation you ran had taken this fact into account.

For instance, take stock returns, we know they are not normally distributed furthermore there is the “fat tails” problem to take into account. At least it would be interesting estimate a probability density function and then compare it to the parametric pdf you used before.

Here below is a simple example of how I estimated the pdf of a random variable using gaussian_kde from scipy

and here is the plot that we get


What we can see is that the estimated pdf is “more dense” around the mean and has some more density on the tails.

Let’s use this data to simulate a sample

For sure in this randomly generated sample there are some extreme values and they look close to the actual sample.


Hope this was interesting and useful.


This article is for educational purpose only. The author is not responsible for any consequence or loss due to inappropriate use. The article may well contain mistakes and errors. The numbers used might not be accurate. You should never use this article for purposes different from the educational one.

Thursday, 29 January 2015

Image scaling

This is a script that has been resting on my desktop for months. Basically I would like to take my knowledge of image recognition with machine learning a step further and this is a simple step towards that direction.

The basic idea behind the script is to convert the image in a 0 and 1 format which is easier to process with machine learning algorithms. After I wrote the script, I found out there is a function in sklearn that does something similar, anyway I thought I could share the piece of code I wrote, perhaps it could be useful to someone.

This the image I started with


By using the following script we can convert the image in a more manageable format

This is the result we obtain, A and B respectively


Only numbers between 0 and 1 are used in the picture A above while the picture below uses only 0s and 1.1

For sure we loose some information, however this is just a raw script, perhaps I can do better in the future. However, this is a rather complicated image (not to mention the fact that it’s the only sample I have got) the script should yield better results with simpler images and shapes.

Portfolio VaR with Python

After some posts on correlation (How to build a correlation matrix in Python) and variance (How to build a variance-covariance matrix in Python) today I’m posting an example of application: portfolio VaR. Please before you continue reading the article, make sure to read and understand the disclaimer at the bottom of the page. Thank you.

Portfolio VaR is the natural extension of the VaR risk indicator to a portfolio of stocks. The implementation is a little bit harder than the one or the two stock version since it involves calculations with matrices. If you are interested to get a first grasp on VaR you can check my first implementation in R here.

In this script I try to estimate parametric VaR of a portfolio of stocks using historical data.


A brief outline of the process I followed:

1)First I downloaded data from Quandl (they are a great source of free data by the way), then I reshaped the data for each stock into a .csv file to make it resemble the following pattern:

ticker,average return, standard deviation,initial price, R1,R2,R3,…,Rn

where Ri stands for Rth return and initial price is the most recent price.
For my analysis I used simple returns calculation, however it should be more appropriate to use logarithmic returns if possible. Next time I’ll probably be using log returns.

2) The script contains 9 functions:

2 functions are the “builder functions” (fillDicts and runSimulation) and they are used to load the data into the program

2 functions (correlationMatrix and varCovarMatrix) are auxiliary functions since they are used in the VaR function to estimate VaR.

A plot function has been added to plot (if needed) the simulation of future returns by the runSimulation function. The model used to simulate future prices is fairly simple and can be found here.

A function (portfolioExpectedReturn) to calculate portfolio expected returns based on historical data

And finally, two functions (simple_optimise_return and optimise_risk_return) to optimise the portfolio for high returns and the risk/return ratio, respectively.

3)Some preliminary notes and some assumptions.
The model is based on parametric VaR, therefore it is assuming that returns are normally distributed like a smooth Bell curve. Furthermore, the VaR is calculated for an holding period of 1 day.

For the sake of simplicity I calculated 99% VaR for each example and ran the optimisation functions for a portfolio of 10 stocks. The optimisation functions operate as follow:

simple_optimise_return yields a portfolio of n stocks with the highest average daily return in the sample while optimise_risk_return yields the 10 stocks with higher average return to standard deviation ratio. The idea behind the second optimisation is that you may want to get the highest return possible for unit of risk that you are bearing. Note that I am assuming that the total invested amount is equally distributed among the 10 stocks (10%  of the total amount is invested in each stock).

While keeping in mind that the past usually does not predict with great accuracy the future, this script let me grasp an insight of what could have (possibly) been a good portfolio of stocks in the last years.

A sidenote: since this is just for fun, some data is actually missing and some stocks have longer time series since they were on the market long before than others.

Another sidenote: a great improvement to this script could be the calculation of the advantages (in terms of less risk) of diversification and optimisation for correlation (maximum and minimum correlation for instance)

Now I run the program

And after some computations here is the result that we get:

The results look interesting to me, especially since the VaR of the last portfolio is really low if compared with the first portfolio I used to test the VaR function and the non-weighted return seems pretty good too.

This article is for educational purpose only. The author is not responsible for any consequence or loss due to inappropriate use. The article may well contain mistakes and errors. The numbers used might not be accurate. You should never use this article for purposes different from the educational one.

Tuesday, 27 January 2015

How to build a variance-covariance matrix in Python

Recently I wrote a script to calculate the VaR of a portfolio of stocks given historical prices and returns and, in order to do that, I had to study the basics of a variance-covariance matrix.

First of all, what is it?
The formal definition should sound something like this: a variance-covariance matrix is a matrix whose element in the i,j position is the covariance between the ith and jth elements of a vector of random variables. Informally, we may say that a variance-covariance matrix is the matrix of the covariances and since the covariance of a random variable with itself is its variance, the main diagonal of the matrix is filled with the variances of the random variables (hence the fancy name).


What is it useful for?
When calculating VaR, say for a single stock, you need to gather the standard deviation of the returns of your stock. However, when calculating the VaR of a portfolio, things get pretty messy pretty quick, since you cannot simply add or subtract variances. In a more intuitive way, we may say that the variance-covariance matrix generalizes the notion of variance to multiple dimensions.

So how can we build it in Python?
Here is a simple template of how I built mine. I used only two stocks, but in the script I talked about earlier I used 500 stocks, you can easily imagine what a mess it can be if you miss some numbers.

Before showing the code, let’s take a quick look at relationships between variance, standard deviation and covariance:

Standard deviation is the square root of the variance


Covariance can be obtained given correlation (check how to build a correlation matrix) and standard deviations.


Now we can look at the script:

And here is the output:

Hope this was interesting.

Monday, 26 January 2015

How to build a correlation matrix in Python

Suppose you have an array of data, stocks returns for instance, and you are wondering whether there is some degree of correlation. If you are using Python then it is easy to plug in your data and do some simple calculations.

Here is a simple example of a correlation matrix using Python. This matrix shows the correlation between 10 variables (10x10 matrix): as you can see, the correlation of a variable with itself is 1 as expected.


By entering more variables into the calculation the matrix becomes more difficult to interpret although it remains quite nice to visualize. Here below is an example of a 50x50 matrix.


The code I used is really simple, you can find it here below. Basically I built a random array with 10/50 data points and then used the numpy function “corrcoef” which handles all the troubles of building a matrix in an elegant way.

Thursday, 8 January 2015

Some plots with matplotlib

I recently updated matplotlib and could finally use the style attribute from the updated module. I thought it might be a good time to plot some data and try it out.

In order to get a grasp on how some emerging markets have performed so far, I thought I could plot some data on the real GDP (GDP adjusted for inflation) and perhaps compare it to a big part of the EU’s economy (Germany).

The data I used is from Quandl a great source of data you should check it out




Note: dotted lines indicates that the data is prospective (in this case obtained through linear regression).

Hope this was interesting