The Beginner Programmer: A first really shy approach to Machine Learning using Python

Monday, 25 August 2014

A first really shy approach to Machine Learning using Python

The day before yesterday I came across Machine Learning: WOW… I got stuck at my pc for an hour wondering about and watching the real applications of this great subject.

I was getting really excited! Then, after an inspiring vide on YouTube, I decided it was time to act. My fingers wanted desperately to type some “smart” code so I decided to write a program which could recognize the language into which a given text is written.

I do not know if this is actually a very primitive kind of Machine Learning program (I somehow doubt it) therefore I apologize to all those who know more on the subject but let me dream for now Sorriso .

Remember the article on letter frequency distribution across different languages?? Back then I knew it would be useful again (although I did not know for what)!! If you would like to check it out or refresh your memory, here it is.

Name of the program: Match text to language

This simple program aims to be an algorithm able to distinguish
written text by recognizing what language a text
was written in.

The underlying hypothesis of this model are the following:
1. Each language has a given characters distribution which is different from the others. Characters distributions are generated by choosing randomly Wikipedia pages in each language.
2. Shorter sentences are more likely to contain common words that uncommon one.

The first approach to build a program able to do such a task was to build a character distribution for each of the languages used using the code in the frequency article. Next, given a string, (sentence) the program should be able to guess the language by comparing the characters distribution in the sentence with the actual distributions of the languages.

This approach, for sentences longer than 400 characters seems to work fine. However, if the sentence were to be shorter than 400 characters, a mismatch might occur. In order to avoid this, I have devised a naive approach: the shorter the sentence, the more likely the words in it are the most common. Therefore, for each language,a list of 50 most common words has been loaded and is used to double check the first guess based on the character frequency only in case the length of the sentence is less than a given number of characters (usually 400).

Note that this version of the program assumes that each language distribution has already been generated, stored in .txt format and it simply loads it from a folder. You can find and download the distributions here.

So far the program seem to work on text of different length. Here below are some results:

In these first two examples I used bigger sample sentences

In this last example, the sentence was really short, it was just 37 characters, something like: “Diese ist eine schoene Satze auf Deutsch”. In this case it was hard to draw a distribution which could match the German one. In fact the program found French and was really far away from the right answer indeed. The double-check algorithm kicked in the right answer (Lang checked).