XD blog

blog page

edit distance, levenstein, python


2013-11-25 Edit distance and weights

I often had to implement an edit distance and I most of time chose to use uniform weights : every modification costs 1. That's the easy solution. However, I was looking for a quick way to give different weights to every comparison between twwo characters.

So I went to the site Gutenberg and downloaded a few texts. I then did the following:

I assumed it would give me something reasonable.

#coding:latin-1
# downloaded from : http://www.gutenberg.org/
#with open("jeunes_filles_en_fleur.txt","r",encoding="utf8") as f : text = f.read().lower()
#with open("hugo_les_miserables.txt","r",encoding="utf8") as f : text = f.read().lower()
with open("dorian_gray.txt","r",encoding="utf8") as f : text = f.read().lower()
import re
lettre = re.compile("[a-zéèêàôûùâîç]+")
mots = lettre.findall(text)
print ("nb char", len(text), "nb words", len(mots))

# we group by word length
long = { }
for m in mots :
    l = len(m)
    if l not in long : long[l] = {}
    long[l][m] = long[l].get(m,0)+1
    
# Hamming function    
def diff(a,b) :
    m = [ 0 if _ == __ else 1 for _,__ in zip(a,b) ]
    return sum(m)
    
def diffl(a,b) :
    m = [ (0 if _ == __ else 1,_,__) for _,__ in zip(a,b) ]
    m.sort()
    return m[-1]
    
# we count the number of characters pair which makes the difference
# among a subset of words having the same length
subst = { }
    
for l in sorted(long) :
    print ("length", l, "nb words", len(long[l]))
    for a in long[l] :
        for b in long[l] :
            d = diff(a,b)
            if d == 1 :
                df = list(diffl(a,b)[1:])
                df.sort()
                x,y = df
                if (x,y) not in subst : subst [x,y] = {}
                subst [x,y][l] = subst[x,y].get(l,0) + 0.5*(long[l][a] + long[l][b])
         
# we sum the counts         
cost = {}                
total = 0
for pair,hist in subst.items() :
    cost [pair] = sum([ k*v for k,v in hist.items() ])
    total += cost [pair]
    
#we normalize    
for k in cost :
    cost[k] /= total
    
# we display    
result = [ (v,k) for k,v in cost.items() ]
result.sort(reverse=True)
for k,v in result :
    print ("%1.6f %s"% (k,v))

Below, you will find the first pairs for an English text:

0.044317 ('s', 't')
0.028266 ('a', 'i')
0.026223 ('n', 't')
0.022779 ('a', 'e')
0.022020 ('t', 'w')
0.020432 ('h', 'o')
0.019302 ('n', 's')
0.017563 ('f', 'n')
0.016893 ('i', 'o')
0.016174 ('h', 'w')
0.016131 ('i', 'u')
0.015305 ('d', 't')
0.014790 ('h', 'i')
0.014502 ('d', 'n')
0.013989 ('i', 'n')
0.013774 ('m', 's')
0.013522 ('d', 's')

And for a French text:

0.030172 ('s', 't')
0.023355 ('a', 'e')
0.022077 ('e', 'u')
0.018951 ('e', 'i')
0.018902 ('i', 'n')
0.018756 ('n', 's')
0.018534 ('e', 'o')
0.017561 ('d', 'l')
0.017274 ('d', 's')
0.015066 ('l', 's')
0.014506 ('e', 'é')
0.013556 ('a', 'o')
0.012882 ('n', 't')
0.012515 ('l', 'm')
0.012362 ('c', 'd')
0.012040 ('d', 'n')
0.011987 ('d', 'm')

We can easily extend the same method to insertion and deletion. We can argue whether we choose to convert the weight we built using the function 1-p or 1/(1-p). Because we only kept the words with only one letter difference from texts with no mistakes at all. It means that doing the mistakes we count makes a big difference. So it should have a weight bigger than one, which would be the default weight in that case.

I could have also computed a distance between the two images of each character in a pair but that's more complex. However, it would be closer to an intuitive definition of a mistake (such as letter with or without accent).


<-- -->

Xavier Dupré