XD blog

blog page

levenstein


2013-12-02 Distance d'édition et programmation dynamique

La distance d'édition (ou distance de Levenstein) est un algorithme très connu qui sert à comparer deux mots, deux chaînes de caractères et, plus généralement, deux séquences. La distance est définie comme le nombre d'opérations minimum qui permet de passer du premier mot au second sachant qu'il n'existe que trois opérations possibles :

Par exemple, dans le cas des deux mots : idstzance et distances, cela donne :

Dans l'exemple qui précède, ces trois opérations ont un coût identique. Mais il pourrait tout-à-fait dépendre du caractère inséré ou supprimé ou de la comparaison entre les deux caractères. La distance est alors la somme de ces coûts.

Trouver le nombre minimal d'opérations est un problème classique qu'on peut résoudre à l'aide de la programmation dynamique. Cela veut dire que le problème vérifie la propriété suivante : Toute solution optimale s'appuie elle-même sur des sous-problèmes résolus localement de façon optimale (Wikipedia). Cela veut souvent dire qu'il existe une façon de trouver la solution du problème par récurrence. Ou alors, si on trouve une façon de couper le problème en deux, la solution optimale sera la combinaison des deux solutions optimales sur chacune des deux parties.


more...

2013-11-25 Edit distance and weights

I often had to implement an edit distance and I most of time chose to use uniform weights : every modification costs 1. That's the easy solution. However, I was looking for a quick way to give different weights to every comparison between twwo characters.

So I went to the site Gutenberg and downloaded a few texts. I then did the following:

I assumed it would give me something reasonable.


more...

2013-01-30 Graph matching and alignment

Pipelines became a quite popular way to represent a workflow. Many tools such as RapidMiner, Orange or Weka use graphs to represent the sequence of processes data follow. Most of the time, between two experiments, we just copy paste a previous one, we add or delete a few nodes. After a while, it becomes uneasy to find out what modifications were made. I was wondering how it would be possible to automate such a task. How to find what modifications were introduced in a graph ?

Graph 1 Graph 2 Two graphs we would like to merge into a single one. Vertices 00 and 11 represents the only root and the only leave.


more...

2013-01-25 How to normalize an edit distance ?

An edit distance is usually used to compute a distance between two words. The basic version gives the same weight to every mistake or operation which means every comparison, every insertion or deletion between two words weight the same. The sum of all these errors gives a integer score and most of time, it is better to keep that score within an interval such as [0,1].

Most of the time, we divide the edit distance by the length of the longest word involved in that distance. This score is not a distance anymore in a sense it does not verify the triangular inequality. There are many option and the following document studies some aspects of normalizations. Based on that, the maximum length is not the best option, the minimum length is not very good either. A third option studied in that document seems to be the best, especially if this distance is used by a machine learned model as a feature.


Xavier Dupré