Let’s say that your favourite subject is languages and comparisons between different languages, or that you enjoy as a hobby decrypting simple codes. Well then, with Python you have found the right tool to use!
Letter frequency, however, is a topic studied in cryptanalysis and has been studied in information theory to save up the size of information to be sent and prevent the loss of data. In fact if the most frequent letter in a language is, say “e”, then it is convenient to use the “least expensive” (in terms of amount of information) way to send that piece of information by reducing the number of bytes sent. For instance, if you were to send binary code, you could use the number 0 to represent “e”.
This is a basic underlying idea in many famous codes. If you would like to get a short introduction to this topic, check this video.
Another example, which uses techniques based on a similar concept is data-compression. Check this great video for a general introduction to data-compression.
Some encryption techniques, such as Caesar cipher and other basic ciphers, can be easily decrypted by spotting the frequency of occurrence of each character and then “guessing” what it should represent by comparing its frequency to the frequency of letters in the language the original message was written in. In fact, this decryption technique can be used for each encryption method which does not uses different symbols to represent the different occurrence of the same character. What do I mean by this? Well, imagine that you need to encrypt this: “bbbb”, now, if you decide to use a Caesar cipher and say, using a shift of 23, your encrypted message will look something like this “yyyy”. Each additional “b” will be converted into a “y” no matter what. This is a soft spot of all those encryption techniques which follows similar schemes.
By using Python, you can easily build a program to run through a long string of text and then calculate the relative frequency of occurrence of each character. Below is the code I used to build this simple program:
# The following code takes as input a string of text, and then it outputs the barplot of the | |
# frequencies of occurrence of letters in the string. | |
import pylab as pl | |
import numpy as np | |
string1 = """ Example string """ | |
alphabet = ["a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z",".",",",";","-","_","+"] | |
# The following functon takes a list and a string of characters, it calculates how often a certain character appears | |
# and then it outputs a list with character and frequency | |
def frequencies(string,letters): | |
list_frequencies = [] | |
for letter in letters: | |
freq = 0 | |
for i in string: | |
if i == letter: | |
freq += 1 | |
if freq != 0: | |
list_frequencies.append(letter) | |
list_frequencies.append(freq) | |
return list_frequencies | |
print(frequencies(string1,alphabet)) | |
# This function returns a list containing 2 lists with letter and frequencies | |
def fix_lists_letter(list_1): | |
list_letters = [] | |
list_letters.append(list_1[0]) | |
list_freq = [] | |
for i in range(1,len(list_1)): | |
if i % 2 == 0: | |
list_letters.append(list_1[i]) | |
else: | |
list_freq.append(list_1[i]) | |
if len(list_letters) != len(list_freq): | |
return "Some error occurred" | |
else: | |
final_list = [list_letters,list_freq] | |
return final_list | |
first_count = frequencies(string1,alphabet) | |
final = fix_lists_letter(first_count) | |
letter_s = final[0] | |
freq = final [1] | |
print("Number of character used:",sum(freq), sep=" ") | |
# Enable the following to sort (in descending order) | |
""" | |
#The follwing function sorts the letters and frequencies in descending order. | |
def sort_all(c): | |
letters = c[0] | |
freq = c[1] | |
final_letter = [] | |
final_freq = [] | |
for i in range(0,len(letters)): | |
maximum = max(freq) | |
ind = freq.index(maximum) | |
final_freq.append(freq[ind]) | |
final_letter.append(letters[ind]) | |
letters.remove(letters[ind]) | |
freq.remove(freq[ind]) | |
to_return = [final_letter,final_freq] | |
return to_return | |
the_very_final = sort_all(final) | |
letter_s = the_very_final[0] | |
freq = the_very_final[1]""" | |
# Relative frequencies | |
def get_rel_freq(list_1): | |
list_to_ret = [] | |
for i in list_1: | |
list_to_ret.append(i/sum(list_1)) | |
return list_to_ret | |
freq = get_rel_freq(freq) | |
fig = pl.figure() | |
ax = pl.subplot(111) | |
width=0.8 | |
ax.bar(range(len(letter_s)), freq, width=width) | |
ax.set_xticks(np.arange(len(letter_s)) + width/2) | |
ax.set_xticklabels(letter_s, rotation=45) | |
pl.show() |
Once I built the code, I ran it a couple of times on some wikipedia pages written in English, French, Italian and German, below you can find the results of this process. I should mention that my code missed a lot of characters like è,é,à,ò,ù and the german umlaut. However you can easily add these by simply adding them to the alphabet list. On the y axes is represented the relative frequency of occurrence (in percentage).
And the same graphs sorted.
I do not know if there is a given distribution for each language, I doubt this, however we can clearly see that some letters are much more frequent than others. The letter “e” seems to be pretty common in all the four languages.
Hope this was interesting.
This comment has been removed by the author.
ReplyDeleteBuy TWS Earbuds at Cheapest Price with Amazing Quality
ReplyDeletehttps://seventysevenstyle.com/collections/wireless-products/products/tws-alpha-4-bluetooth-v-5-0-sport-earpiece-with-magnetic-charging-box