The day before yesterday I came across Machine Learning: WOW… I got stuck at my pc for an hour wondering about and watching the real applications of this great subject.
I was getting really excited! Then, after an inspiring vide on YouTube, I decided it was time to act. My fingers wanted desperately to type some “smart” code so I decided to write a program which could recognize the language into which a given text is written.
I do not know if this is actually a very primitive kind of Machine Learning program (I somehow doubt it) therefore I apologize to all those who know more on the subject but let me dream for now.
Remember the article on letter frequency distribution across different languages?? Back then I knew it would be useful again (although I did not know for what)!! If you would like to check it out or refresh your memory, here it is.
Name of the program: Match text to language
This simple program aims to be an algorithm able to distinguish
written text by recognizing what language a text
was written in.
The underlying hypothesis of this model are the following:
1. Each language has a given characters distribution which is different from the others. Characters distributions are generated by choosing randomly Wikipedia pages in each language.
2. Shorter sentences are more likely to contain common words that uncommon one.
The first approach to build a program able to do such a task was to build a character distribution for each of the languages used using the code in the frequency article. Next, given a string, (sentence) the program should be able to guess the language by comparing the characters distribution in the sentence with the actual distributions of the languages.
This approach, for sentences longer than 400 characters seems to work fine. However, if the sentence were to be shorter than 400 characters, a mismatch might occur. In order to avoid this, I have devised a naive approach: the shorter the sentence, the more likely the words in it are the most common. Therefore, for each language,a list of 50 most common words has been loaded and is used to double check the first guess based on the character frequency only in case the length of the sentence is less than a given number of characters (usually 400).
Note that this version of the program assumes that each language distribution has already been generated, stored in .txt format and it simply loads it from a folder. You can find and download the distributions here.
#Characters used to build a distribution | |
alphabet = ["a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z",",",";","-"] | |
#Languages supported | |
languages = ["english","italian","french","german"] | |
#A useful dictionary | |
distribDict = dict() | |
#The following functon takes a list and a string of characters, | |
#it calculates how often a certain character appears and then | |
#it outputs a list with character and frequency | |
def frequencies(string,letters): | |
list_frequencies = [] | |
for letter in letters: | |
freq = 0 | |
for i in string: | |
if i == letter: | |
freq += 1 | |
list_frequencies.append(letter) | |
list_frequencies.append(freq) | |
return list_frequencies | |
#This function returns a list containing 2 lists with letter | |
#and frequencies | |
def fix_lists_letter(list_1): | |
list_letters = [] | |
list_letters.append(list_1[0]) | |
list_freq = [] | |
for i in range(1,len(list_1)): | |
if i % 2 == 0: | |
list_letters.append(list_1[i]) | |
else: | |
list_freq.append(list_1[i]) | |
if len(list_letters) != len(list_freq): | |
return "Some error occurred" | |
else: | |
final_list = [list_letters,list_freq] | |
return final_list | |
#This function returns the relative frequencies | |
def get_rel_freq(list_1): | |
list_to_ret = [] | |
for i in list_1: | |
list_to_ret.append(i/sum(list_1)) | |
return list_to_ret | |
#This function should return the distribution of the characters | |
#in a given text by putting together most of the functions above | |
def returnDistribution(strings,alphaBet): | |
firstC = frequencies(strings,alphaBet) | |
finalC = fix_lists_letter(firstC) | |
letters = finalC[0] | |
frequenc = get_rel_freq(finalC[1]) | |
distribution = [letters,frequenc] | |
nChar = sum(finalC[1]) | |
#Note: Spaces " " are NOT considered as characters | |
print("Number of character used:", nChar, sep=" ") | |
return distribution | |
#This function loads each distribution into the dictionary distribDict | |
def loadDistribDict(): | |
try: | |
for lang in languages: | |
fileToRead = open("C:\\Users\\desktop\\lproject\\"+lang+"Dist.txt","r") | |
data = fileToRead.read() | |
dist = data.split("\n")[1].split(" ") | |
distList = [] | |
for number in dist: | |
if number == '': | |
number = 0 | |
distList.append(float(number)) | |
distribDict[lang] = distList | |
fileToRead.close() | |
print("Loaded",lang,"character frequency distribution!",sep=" ") | |
except Exception as e: | |
print(e) | |
#String to test | |
stringToCheck = "Hallo diese ist eine schoene Satze auf deutsch" | |
commonEnglishWords = [" is "," the "," of "," and "," to "," that "," for "," it "," as "," with "," be "," by "," this "," are "," or "," his "," from "," at "," which "," but "," they "," you "," we "," she "," there "," have "," had "," has "," yes "] | |
commonGermanWords = [" ein "," das "," ist "," der "," ich "," nicht "," es "," und "," Sie "," wir "," zu "," er "," sie "," mir "," ja "," wie "," den "," auf "," mich "," dass "," hier "," wenn "," sind "," eine "," von "," dich "," dir "," noch "," bin "," uns "," kann "," dem "] | |
commonItalianWords = [" di "," che ", " il "," per "," gli "," una "," sono ", " ho "," lo "," ha "," le "," ti "," con "," cosa "," come "," ci "," questo "," hai "," sei "," del "," bene "," era "," mio "," solo ", " gli "," tutto "," della "," mia "," fatto "] | |
commonFrenchWords = [" avoir "," est "," je "," pas "," et "," aller "," les "," en "," faire "," tout "," que "," pour "," une "," mes "," vouloir "," pouvoir "," nous "," dans "," savoir "," bien "," mon ", " au "," avec "," moi "," quoi "," devoir "," oui "," comme "," ils "] | |
commonWordsDict = {"english":commonEnglishWords,"german":commonGermanWords,"italian":commonItalianWords,"french":commonFrenchWords} | |
def checkLang(string): | |
distToCheck = returnDistribution(string,alphabet) | |
distToCheckFreq = distToCheck[1] | |
diffDict = dict() | |
#For each language we calculate the difference between the | |
#observed distribution and the given one. | |
for lang in languages: | |
diffList =[] | |
for i in range(len(languages)-1): | |
diff = abs(distToCheckFreq[i]-distribDict[lang][i]) | |
diffList.append(diff) | |
diffDict[lang]=sum(diffList) | |
#verifica | |
for lang in languages: | |
print(lang,diffDict[lang]) | |
langFound = min(diffDict, key=diffDict.get) | |
#If the sample sentence is shorter than 420 characters then | |
#we may have some recognition issues which will be dealt | |
#here below.. | |
langChecked = "" | |
correct = False | |
if len(string) < 420: | |
for langKey in commonWordsDict.keys(): | |
for word in commonWordsDict[langKey]: | |
if word in string: | |
langChecked = langKey | |
correct = True | |
break | |
if correct: | |
break | |
if correct: | |
print("Lang found: ",langFound) | |
print("Lang checked: ",langChecked) | |
langFound = langChecked | |
#The language found is returned here | |
print("\n") | |
return langFound | |
loadDistribDict() | |
print("\n") | |
print("Language found by the program: ",checkLang(stringToCheck)) |
So far the program seem to work on text of different length. Here below are some results:
In these first two examples I used bigger sample sentences
In this last example, the sentence was really short, it was just 37 characters, something like: “Diese ist eine schoene Satze auf Deutsch”. In this case it was hard to draw a distribution which could match the German one. In fact the program found French and was really far away from the right answer indeed. The double-check algorithm kicked in the right answer (Lang checked).
Hope this was interesting.
This comment has been removed by the author.
ReplyDelete