The Beginner Programmer

Saturday, 11 July 2015

How to make a rough check to see if your data is normally distributed

Now then, in the previous article I wrote about hypothesis testing with data that is normally distributed, in this article I’m going to post some quick test you can do to check if it is fairly safe to assume your data is normal. I would like to highlight the fact that you can never be 100% sure that your data follows a particular distribution, however, in order to be able to say something about the phenomenon you are trying to understand, it is often practical to approximate its distribution with the best fit according to the measurements (the samples). At the very least this is a starting point.

First, I should point out that if you have a small sample (< 10 samples) then it is really hard to draw some conclusions, but if the sample is large enough (as a broad and very generous rule of thumb > 30) then we can consider some indices and plots.

Skewness:
First of all, if your data is normally distributed, then it should be approximately symmetric. In order to asses this characteristic you can either use a boxplot or the skewness index g4. However I suggest to use both. Suppose you have two datasets as below:

If we plot the boxplots we obtain:

Remembering that the black thick line is the median and that the coloured part of the boxplot contains 50% of the data, we can already see that the distribution on the left is approximately symmetric, while the one on the right has a right tail (right skew).
Then, by running the skewness function we check the numbers in order to double check our guess

Since the skewness of the first dataset is close to zero it seems reasonable to assume the first dataset is normally distributed. Note that the beta simulated dataset has a positive skewness significantly higher than 0. A final check at the histograms can help a bit more

Ok, the histogram may cast some doubt on whether the first dataset is normal, however it certainly helps us ruling out normality for the second dataset

Finally, we run the kurtosis test, which gives us some information on the shape of the bell curve

We are now in a pretty safe position to assume that the first dataset is normally distributed, since the kurtosis is pretty close to 3, and we can see that the second seems to have a “peak” which is indeed what the histogram tells us. Having so many samples in the second dataset makes us pretty confident in ruling out normality, however it could also not be the case, for instance, for some reason, say the instruments with which we made the measurements were biased, we could have ended up with biased data.

Edit: Last day, while I was writing this article I forgot a very important plot: the qqplot! The function qqnorm lets you compare the quantiles of your data with those of the normal distribution, giving you another bit of information that could point you in the right direction.
By calling the function qqnorm(data_beta) and qqnorm(data_norm) we obtain the following plots

Now, ideally, what you would expect if your data is approximately normally distributed, is that the quantiles of the sample match the theoretical samples, therefore you should get a 45 degree line. Can you guess which one of the two plots is the one generated by qqnorm(data_beta)? ;)

Hypothesis testing on normally distributed data in R

Hypothesis testing is a useful statistical tool that can be used to draw a conclusion about the population from a sample. Say for instance that you are interested in knowing if the average value of a certain parameter differs significantly from a given value within a well defined confidence level: you would probably set up your test like this:

Note that the two hypothesis above are mutually exclusive.

Now what can you do to check which one is more probable?
First, we should check what kind of distribution our data follows. Roughly speaking, if the number of samples is > 30 we can plot a histogram to get a visual grasp of the distribution and then run a few simple function, to assess the skewness and kurtosis of the distribution. If these parameters are close to those of a Normal distribution, then we could assume that the data comes from a Normal distribution. This assumption is not 100% sure but it is reasonable if the above parameters are close to those of a Normal distribution. If you are not sure, the other option is to run some normality tests or gather more data. Read how to make a rough check to see if your data is normally distributed.

Now that we are confident that our data is Normally distributed and hopefully the samples are independent and identically distributed, we can estimates of the sample mean and as far as the variance is concerned:
- If the population variance is known, we can use the Normal distribution in R
- If the population variance is unknown, then we should use a t-distribution with n-1 degrees of freedom (n is the number of observations). As n tends to infinity t-distribution tends to a Normal, therefore if n is sufficiently large (say > 60) you could approximate the t-distribution with a Normal one.

Now, by fixing a certain alpha (confidence level), we decide to reject the null Hypothesis (H0) if:

Where is the sample mean, the standard deviation and z the percentile of the Normal (or Student) distribution.

Again, roughly speaking, the idea behind the test is that we should reject the null Hypothesis if the data shows that it is highly unlikely H0 to be true. For the right tail test, we should reject H0 if the statistics above (without the absolute value operator) is to the right of the red line, in the cyan shaded region:

Aside from this test you can also make tests to check whether the population mean is higher or lower than a certain value using the same process. Here below is the implementation of the test in R assuming that the variance is unknown (and therefore using the t distribution):

In this case we can say that at 5% confidence level we do not reject the null Hypothesis in all three cases.

In case you’d like to run a z-test because you know the population variance, by clicking here you can download the script which is using the normal distribution. Essentially you only have to replace the functions related to the student distribution in the script above with those related to the Normal. Finally, here is the code used to make the plot above.

Wednesday, 20 May 2015

Biot-Savart law: magnetic field of a straight wire

Magnetism and magnetism related phenomena are fascinating almost for everyone, in fact, I remember being a intrigued by the interaction between magnets since I was a little kid. On the other side, Python is great for plotting vector fields, although now that I am using also Matlab I found out most functions in numpy and matlab are similar to those used in Matlab (I don’t know who got inspired by who though).

Here is a small sketch of a vector field. Remember the Biot-Savart law for a straight infinitely long wire, which (assuming the current is flowing upwards) looks something like this:

Below is a 3d representation of the vector field and the Python code used to generate it. The tricky part (if I’m allowed to say so) is that you need to convert the value of the vector field from cylindrical coordinates into cartesian coordinates in order to plot it. Perhaps there’s a way to plot a vector field using also spherical and cylindrical coordinates however I still do not know how to do that.

Hope this was interesting! :)

Sunday, 10 May 2015

Spectrogram representation of the B note from a guitar

A spectrogram is a visual representation of the spectrum of frequencies as they change with respect of another variable (in our case time). Matlab offers a great function to plot the frequencies against time, therefore I decided to try it on the recordings I used in the previous post. Here is the result I got

Now, some comments. This representation of the frequencies tells us that the stronger components of our signal are below 1.5 KHz (1500 Hz) which is what we expected, however it looks like there are no other significant frequencies above 7 KHz and this is weird since in our FT of the signal we found higher frequencies (although with lower magnitude). Notice how the the higher the magnitude in the FT, the longer the frequency appear in the spectrogram.

Now, by using a simple tool I generated another spectrogram which seems to contain also the higher frequencies the one in Matlab did not have.

This spectrogram is better in tune with our FT of the signal, since it shows that our signal has frequencies up to 15 KHz as the FT showed

The program I used to generate the second spectrogram is Spek, it is free, and fast, however it has very few options. Below are some useful links

http://spek.cc/

http://en.wikipedia.org/wiki/Spectrogram

Applying Fourier to a real signal: guitar musical note B

I recorded a sound from the second string of my electric guitar (musical note B) and made a .wave file out of it. By using Matlab I tried to separate the sound of the note from the background noise and made some experiments.

At first I wanted to use Python for doing this, since I’m more knowledgeable about that than Matlab, however it looks like the most common ways to open a .wave file (scipy.audiolab and similar) are not available for Python 3 yet (either that or I could not find anything). Matlab provides simple functions for handling .wave files and great graphing tools but I could not find something to convert bins into frequencies as the function in numpy does, therefore I’m not sure the frequencies in the graph are correct.

Anyway, you can download the recording at this link, and here is the time domain signal representation:

By taking the FFT of the signal and filtering out the frequencies we don’t need, we filter out the noise and/or sounds we don’t need. My filtering rule lets through all the frequencies above 200 and below 1600 Hz. Of course this is a very naive way of filtering, however that’s a start.

You can play around and block out one, two, three peaks a time and hear the difference between the original recording and the edited one. Now, ideally B of the guitar should have a frequency of about 249 Hz (more or less), however it looks like the frequencies between 650 and 1020 Hz in our frequency domain representation are the main components of the note. Perhaps that’s because I messed up the bins. Anyway I’m glad that by applying a so simple filtering rule I’ve been able to remove the pitch sound at the beginning and smoothing out the musical note. Finally I also found out that Matlab has great publishing tools that enable you to print out an HTML file of your code, the results and graphs all together. Here is the code I used:

Calculate Fourier transform of a gaussian using Matlab

While wondering around in the Matlab documentation I found out there is a simple way to calculate the Fourier transform of any function using Matlab. Furthermore by using the function pretty() you can print out the result in a more friendly manner.

Let’s calculate for example the Fourier transform of the function u(x) = e^(-x^2). This calculation could also be done on paper: by applying the definition I wrote in the earlier article one finds out that:

Now, by using some properties of the transform (you can check them at http://en.wikipedia.org/wiki/Fourier_transform), we can play around and plot some functions. Here is the original function and its FT:

Let’s take the FT of e^(-(3x)^2)

It’s worth noting that when the function gets thinner, its FT gets more spread and viceversa. This phenomena is useful to understand the uncertainty principle.

This you tube video explains the link between this mathematical tools and physics.
https://www.youtube.com/watch?v=V7UNvDN_EZo

If you’d like to know more about the FT, check the wikipedia page:
http://en.wikipedia.org/wiki/Fourier_transform

Here is the Matlab code I used

Saturday, 9 May 2015

Filtering an ideal signal with FFT

Hi everyone! Remember that in the last article I wrote that you can use the FFT to clean a signal from background noise? Well here is an example of signal filtering.

We start by generating a signal and then add some random noise using the random number generator in numpy. Here is the noisy signal:

Now, our signal is made up of two main frequencies: 20 and 30 Hz while the rest is mainly background noise. You can check this in the FFT of the signal:

Suppose we’d like to isolate the 20 Hz bit, that means we’d have to set to 0 each value in the Spectrum which is greater or lower than 20 Hz. Let’s block out everything outside 18 and 22 Hz and then apply the inverse FTT. Here is the signal we found (blue) compared to the exact signal we wanted to find (red). At the bottom of the graph there is the error between the two signals, notice that it is not that high, (at worst we missed 0.08, however on average we missed less).

Here is the python code I used to make this.

Have fun!

Friday, 8 May 2015

Fourier Transform: first trials

The Fourier transform is surely one of the most amazing pieces of maths applied to the sciences. For starters it is defined as follows:

The definition is fine provided that you can take the integral of u(x) over R. (Technically you should say that u(x) belongs to L1(R). But aside from these technicalities, how awesome is that definition? Calculating the FT of a function with pen and paper isn’t that straight forward at all, except in some cases. It involves mainly using complex analysis techniques. If you are familiar with calculating “standard integrals” it is rare that you can apply the same techniques here.

But how can we apply this to something? Well it turns out that if you have a signal u(x) in the time domain, its Fourier transform is the signal representation in the frequency domain. Essentially the FT enables you to decompose your time domain signal into all the frequencies that make up that signal. A first simple application is deleting noise from a signal. By checking the signal spectrum (its FT) you can choose which frequencies you need to delete from the signal in order to remove the noise. Then you just apply the inverse FT and get back your time domain signal cleaned up (more or less). Now obviously the math behind this is hairy, however this should not scare you at all, and if you are interested in these subjects I suggest you to check out some resources to get a better grasp of how all of this works.

In order to compute the FT of a signal with Python we need to use the ftt function built in into Scipy. FTT stands for Fast Fourier Transform and it is a brilliant algorithm to speed up the otherwise very long calculation of the FT of a real signal.

The signal I’ve made is a linear combination of the sine and cosine function, pretty straightforward here is its representation in the time domain:

A closer look to our signal:

Now by taking the FFT we get a list of complex numbers. Then we take the absolute value of those numbers and plot them to represent the spectrum:

Eventually, we just need to fix the lower axis and switch from bins to Hz. Since the plot is symmetrical we can take out the right half without losing information. As you can see below, we have three spikes in the frequency domain. Those three peaks corresponds to the three frequencies we set earlier in the signal (85,50,75 Hz).

Hope you enjoyed this, soon I’ll make a post about applying this to a real signal, such as the recording of guitar notes. Below is the code I used to plot the data above. Have fun!

Wednesday, 22 April 2015

Fourier series and square wave approximation

Fourier series is one of the most intriguing series I have met so far in mathematics. Roughly speaking it is a way to represent a periodic function using combinations of sines and cosines. There are many other fascinating topics such as the Laplace and Fourier transforms but I am new to complex analysis and techniques so I’ll go step by step!

There are many reasons why I find Fourier series fascinating but primarily, I like the fact of approximating functions using the sines, cosines and complex exponentials because of the bond between these functions.

Now, what are the ingredients we need to approximate a function with a Fourier series?

1) A periodic bounded function, with period T
2) The function should be integrable over the period

If the above conditions are met, then we can write the function as follows

Note that this is an infinite sum, the more terms you add, better the approximation. As far as the coefficients are concerned, you can calculate them using the formulas below.

For this simple test, I chose the square wave function, which could be interpreted as an on and off signal. The aim of this simple test was to check how good is the approximation. Using only 10 armonics (n=10) this is the result I got:

I would say it is pretty accurate, Python took only a few seconds to calculate it. Furthermore you can use more complex periodic functions and found similar amazing results. Here are some useful resources I used:

http://en.wikipedia.org/wiki/Fourier_series
http://en.wikipedia.org/wiki/Square_wave

If you have time, perhaps you could try plot the sawtooth wave

In the square wave Wikipedia page there are other kind of wave functions, perhaps in the future I’ll try them out too. Below is the Python code I used to plot the square wave.

Enjoy!

Wednesday, 18 March 2015

An implementation of the Tic Tac Toe game in Java

Lately I’ve been really busy, and I’ve not had that much time to spend on programming. I’ve had this implementation of the Tic Tac Toe game in my “to do list” for a very long time but sadly for one reason or another it just kept being delayed. In some sense it is easier to code than the text editor I made here, however it has not been so in practice. In particular, I can’t shake the feeling that the code of this project is redundant and sloppy. I’d be happy to be proved wrong but I guess there’s no point in arguing over such a small project.

This project, aside from getting more familiar with Java syntax and its libraries, helped me in becoming more fluent in building a GUI with Swing. There are some technical details which I’d like to discuss perhaps with some professional Java coder in case I get the chance.to talk to one (indeed networking is important).

The main Idea was to develop a Tic Tac Toe implementation and a battleship game as well, however, due to the lack of spare time I put aside the battleship game project and perhaps I will work on it in the future, although the basics of the underlying code are pretty similar to those used here (so similar that it might actually be a repetition).

Enough talk, let’s cut to the chase! Here are some screenshots, and below you can find the download link for the Java application. As usual, I’ve put into the .zip folder the source code so that you can check it out, work on it, whatever you like

How to play:
Basically you click on one of the three columns (if you click on any button on the selected column that’s also fine) and your mark (“x” for the player) is dropped into the lowest box available. Then the computer makes its move (randomly). Whoever reaches first 3 in a row.wins. The idea of the lowest box is due to the fact that I had another game in mind while developing this, it turned out as a “new” kind of Tic Tac Toe and I thought it was nice to play with. Of course by working on the source code I uploaded along with the runnable jar one could easily remove this and get back to the classic game.

Dropbox download link.

Hope this was interesting.

Saturday, 14 March 2015

Taylor series with Python and Sympy

Here I am again using my beloved Python and doing maths stuff.

Today I’d like to post a short piece of code I made after a review of Taylor series I did. This script lets you input (almost) any function, provided that it can be represented using Sympy and output the Taylor series of that function up to the nth term centred at x0.

Sympy is a great module for basic symbolic mathematics, it works fine and it is really simple to use even if you are new to Python.

Here is the output of the plot function for the function sin(x) approximating up to the 9th term:

Note that in the console output the series is written backwards, however I think it could be fixed.

Here is the code I used

This code can be customized to return Taylor expansions for any function you’d like to use (of course provided it can be represented using Sympy).

Finally, the code used to generate the plot

Monday, 9 March 2015

My first serious Java project

Finally here I am with my first serious Java project: a text editor!

Yeah I know, perhaps the word “serious” is a bit of an overkill for a simple project such as a text editor. However for a Java newbie, as I consider myself to be, this was actually a middle level project. The hard part was using Swing to implement the user interface, although I find Swing to be easier to use than Tkinter (Python) once that you’ve got the basics.

So, what’s this text editor about? Basically it is a notepad (like the one that comes natively with Windows but with less options, and of course it is written in Java. Here is a screenshot:

Now, let’s talk about how to use this.

The file menu:

The file menu has 4 options: open, show current working directory, set current working directory and exit.
The open command lets you search for a .txt file to display in the user-interface.
The show and change current working directory commands are used to show and change, respectively, the directory you are working in.

You can open a file with the commands “open” and “load .txt file” or work on a new one simply by typing into the white box. When you have edited a file in the text editor, you can save it by typing a name into the bottom text field and then clicking the bottom right button. If you do not enter a name, the file will be saved with a default one. Furthermore, once you click on “save .txt data” the program asks you if you want to save the file in the current working directory. If you need to choose another location click “no” and then choose the directory to save the file in. Enter the same name of the file you are working on, to overwrite it.

The edit menu:

The edit menu basically lets you change font size (by 3 pt a click).

You can download the program (the runnable .jar file and the source code) at this address below:

https://www.dropbox.com/s/e4jhp1w9n1w6pp8/java%20text%20editor.zip?dl=0

Hope this was interesting.

Sunday, 8 March 2015

Two simple Java programs

So far I’m still a newbie Java programmer and I am mostly familiar with the basics (the syntax and basic concepts) therefore I thought I could practice my little knowledge by writing two simple programs. In particular, this task helped me to get more familiar with user input, random number generation and arrays. I took the idea from a website, however I lost the bookmark therefore I can not post it.

The first program is quite simple. It asks you to guess a number within 0 and 99. Many improvements could be done to this simple program, perhaps in the future I’ll work on it.

Here it is:

The second program is a little bit more complex, it asks for an input sentence from the user and then it outputs a sentence of the same length using random part of the entered sentence.

Hope this was interesting.

Friday, 6 March 2015

Numerical integration with Python

In this short article I am going to post a simple Python script for numerical integration.

Numerical integration is a part of a family of algorithms for calculating the numerical value of a definite integral. The two simplest method for performing numerical integration are rectangle and trapezoidal rule. If you’d like to know more about these two methods you can check the wikipedia page which has been pretty helpful to me.

Now, the code below works fine if the functions you are using is not decreasing and positive, however:
-If your function is decreasing, the plot need to be adjusted to show the correct graph
-If the function is negative at some point, the code might not work, since it has not been designed to handle it. Perhaps in the future I’ll fix that.

Therefore, bottom line, square root of x is fine, logarithms (after 1),x^2,x^3(from 0) etc…

Here is an approximation with 50 points and the rectangle rule:

and the code output on the console:

Finally, below is the code I used to generate the output above:

Hope this is useful.

Tag di Technorati: maths,python,calculus,algebra,programming,integral,integrals

Volume of solids of revolution: the cone

Have you ever wondered where do all those formulas to calculate the volume of solids like a cone, a cylinder, a sphere ecc… come from? In fact they come from a simple formula and from a clever basic idea. Imagine you have a function f

Intuitively you could approximate the volume by dividing your interval into a number n of small intervals of the same size and sum up the volume of the cylinders of radius f(x) and height , then

if we use an infinite number of steps n then our sum becomes an integral:

Now that we have the intuition, we can go on to the code. I coded an example in R and one in Matlab. The one in Matlab is shorter since it deals with the integration on its own using symbolic computation while in R I decided to code a simple approximation algorithm based on the first formula I posted above, although you could easily calculate the indefinite integral of the function foo and then use it to calculate the exact volume.

Here is the Matlab example, quick, fast and neat:

Note that f(10)=5, thus the radius of our cone is 5. Furthermore, for the sake of simplicity I’ve assumed that the tip is in 0,0. This code outputs:

Here is the R code. Of course you could have made it shorter and more neat but I like coding in R so I made something more out of it. The R program below approximates the volume using the first approach described above.

Finally, you can get a 3D cone as the one at the top of this page just by using the RGL package in R and the demo scripts.

Tag di Technorati: geometry,calculus,R,matlab,programming,volume

Pages

Saturday, 11 July 2015

Wednesday, 20 May 2015

Sunday, 10 May 2015

Saturday, 9 May 2015

Friday, 8 May 2015

Wednesday, 22 April 2015

Wednesday, 18 March 2015

Saturday, 14 March 2015

Monday, 9 March 2015

Sunday, 8 March 2015

Friday, 6 March 2015