Monday, 28 July 2014

Letter frequency with Python

Let’s say that your favourite subject is languages and comparisons between different languages, or that you enjoy as a hobby decrypting simple codes. Well then, with Python you have found the right tool to use! Occhiolino

Letter frequency, however, is a topic studied in cryptanalysis and has been studied in information theory to save up the size of information to be sent and prevent the loss of data. In fact if the most frequent letter in a language is, say “e”, then it is convenient to use the “least expensive” (in terms of amount of information) way to send that piece of information by reducing the number of bytes sent. For instance, if you were to send binary code, you could use the number 0 to represent “e”.
This is a basic underlying idea in many famous codes. If you would like to get a short introduction to this topic, check this video.

Another example, which uses techniques based on a similar concept is data-compression. Check this great video for a general introduction to data-compression.

Some encryption techniques, such as Caesar cipher and other basic ciphers, can be easily decrypted by spotting the frequency of occurrence of each character and then “guessing” what it should represent by comparing its frequency to the frequency of letters in the language the original message was written in. In fact, this decryption technique can be used for each encryption method which does not uses different symbols to represent the different occurrence of the same character. What do I mean by this? Well, imagine that you need to encrypt this: “bbbb”, now, if you decide to use a Caesar cipher and say, using a shift of 23, your encrypted message will look something like this “yyyy”. Each additional “b” will be converted into a “y” no matter what. This is a soft spot of all those encryption techniques which follows similar schemes.

By using Python, you can easily build a program to run through a long string of text and then calculate the relative frequency of occurrence of each character. Below is the code I used to build this simple program:


Once I built the code, I ran it a couple of times on some wikipedia pages written in English, French, Italian and German, below you can find the results of this process. I should mention that my code missed a lot of characters like è,é,à,ò,ù and the german umlaut. However you can easily add these by simply adding them to the alphabet list. On the y axes is represented the relative frequency of occurrence (in percentage).


Rel_freq_english_18000_chr


Rel_freq_french_20000_chr


Rel_freq_italian


Rel_freq_german_21000_chr


And the same graphs sorted.


Rel_freq_english_18000_chr_sorted


Rel_freq_french_20000_chr_sorted


Rel_freq_italian_sorted_18000 chr


Rel_freq_german_21000_chr_sorted


I do not know if there is a given distribution for each language, I doubt this, however we can clearly see that some letters are much more frequent than others. The letter “e” seems to be pretty common in all the four languages.


Hope this was interesting.

No comments:

Post a Comment