{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 1A.algo - distance de Jaccard (dictionnaires) - correction\n", "\n", "De la distance de [Jaccard](https://fr.wikipedia.org/wiki/Indice_et_distance_de_Jaccard) \u00e0 la distance de [Levenshtein](https://fr.wikipedia.org/wiki/Distance_de_Levenshtein)."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["Plan\n", "
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 1 : Constuire l'ensemble des lettres supprim\u00e9es et ajout\u00e9es\n", "\n", "Deux mots n'ont pas forc\u00e9ment la m\u00eame longueur, il est donc difficile de les comparer lettre \u00e0 lettre. De m\u00eame, on ne doit pas tenir compte de l'ordre des lettres dans chaque mot. Cette contrainte peut para\u00eetre plus difficile \u00e0 mettre en oeuvre. Pourtant, si on construit un r\u00e9sultat interm\u00e9diaire : le d\u00e9compte de chaque lettre dans un mot."]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/plain": ["({'e': 2, 'l': 1, 'r': 1, 't': 2}, {'e': 2, 'r': 1, 't': 1})"]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["def compte_lettre(mot):\n", " d = {}\n", " for c in mot:\n", " d[c] = d.get(c,0) + 1\n", " return d\n", "\n", "compte_lettre(\"lettre\"), compte_lettre(\"etre\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Une lettre est ajout\u00e9e ou supprim\u00e9e entre deux mots si son d\u00e9compe est diff\u00e9rent dans les deux dictionnaires. On s'en sert pour contruire la liste des lettres supprim\u00e9es et ajout\u00e9es. Cela revient \u00e0 construire la diff\u00e9rence entre deux dictionnaires."]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/plain": ["({'l': -1, 't': -1}, {'s': 1})"]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["mot1 = \"lettre\"\n", "mot2 = \"etres\"\n", "\n", "d1 = compte_lettre(mot1)\n", "d2 = compte_lettre(mot2)\n", "\n", "suppression = {}\n", "for l in d1:\n", " c1 = d1[l]\n", " c2 = d2.get(l, 0) # la lettre l n'appartient pas forc\u00e9ment au second mot\n", " if c2 != c1:\n", " suppression[l] = c2 - c1\n", " \n", "ajout = {}\n", "for l in d2:\n", " if l not in d1:\n", " c1 = 0\n", " c2 = d2[l]\n", " if c2 != c1:\n", " ajout[l] = c2 - c1 \n", " else:\n", " # on a d\u00e9j\u00e0 compt\u00e9 les lettres pr\u00e9sentes dans les deux mots\n", " # lors de la premi\u00e8re boucle\n", " pass\n", " \n", "suppression, ajout "]}, {"cell_type": "code", "execution_count": 4, "metadata": {"collapsed": true}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 2 : \u00e9crire une fonction qui calcule la distance de Jaccard\n", "\n", "On copie et on colle le code pr\u00e9c\u00e9dent pour cr\u00e9er la distance."]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/plain": ["(2, 3)"]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["def jaccard(mot1, mot2): \n", " d1 = compte_lettre(mot1)\n", " d2 = compte_lettre(mot2)\n", "\n", " suppression = {}\n", " for l in d1:\n", " c1 = d1[l]\n", " c2 = d2.get(l, 0) # la lettre l n'appartient pas forc\u00e9ment au second mot\n", " if c2 != c1:\n", " suppression[l] = c2 - c1\n", "\n", " ajout = {}\n", " for l in d2:\n", " if l not in d1:\n", " c1 = 0\n", " c2 = d2[l]\n", " if c2 != c1:\n", " ajout[l] = c2 - c1 \n", " else:\n", " # on a d\u00e9j\u00e0 compt\u00e9 les lettres pr\u00e9sentes dans les deux mots\n", " # lors de la premi\u00e8re boucle\n", " pass\n", " \n", " dist = sum(abs(x) for x in suppression.values()) + sum(abs(x) for x in ajout.values())\n", " return dist\n", "\n", "jaccard(\"lettre\", \"etre\"), jaccard(\"lettre\", \"etres\")"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/plain": ["0"]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["jaccard(\"marie\", \"aimer\")"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"text/plain": ["3"]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["jaccard(\"h\u00f4pital\", \"hospital\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 3 : calculer la matrice des distances\n", "\n", "Tout d'abord, on d\u00e9coupe un texte en mot. On remplace les tirets en espaces."]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"text/plain": ["['\u00e0',\n", " 'combien',\n", " 'sont',\n", " 'ces',\n", " 'six',\n", " 'saucissons',\n", " 'ci,',\n", " 'ces',\n", " 'six',\n", " 'saucisson',\n", " 'ci',\n", " 'sont',\n", " '\u00e0',\n", " 'six',\n", " 'sous']"]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["texte = \"\u00e0 combien sont ces six saucissons-ci, ces six saucisson-ci sont \u00e0 six sous\".replace(\"-\", \" \").split()\n", "texte"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### liste de listes"]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"data": {"text/plain": ["[[0, 8, 5, 4, 4, 11, 4, 4, 4, 10, 3, 5, 0, 4, 5],\n", " [8, 0, 7, 6, 8, 9, 6, 6, 8, 8, 5, 7, 8, 8, 9],\n", " [5, 7, 0, 5, 5, 8, 7, 5, 5, 7, 6, 0, 5, 5, 4],\n", " [4, 6, 5, 0, 4, 9, 4, 0, 4, 8, 3, 5, 4, 4, 5],\n", " [4, 8, 5, 4, 0, 9, 4, 4, 0, 8, 3, 5, 4, 0, 5],\n", " [11, 9, 8, 9, 9, 0, 9, 9, 9, 1, 8, 8, 11, 9, 6],\n", " [4, 6, 7, 4, 4, 9, 0, 4, 4, 8, 1, 7, 4, 4, 7],\n", " [4, 6, 5, 0, 4, 9, 4, 0, 4, 8, 3, 5, 4, 4, 5],\n", " [4, 8, 5, 4, 0, 9, 4, 4, 0, 8, 3, 5, 4, 0, 5],\n", " [10, 8, 7, 8, 8, 1, 8, 8, 8, 0, 7, 7, 10, 8, 5],\n", " [3, 5, 6, 3, 3, 8, 1, 3, 3, 7, 0, 6, 3, 3, 6],\n", " [5, 7, 0, 5, 5, 8, 7, 5, 5, 7, 6, 0, 5, 5, 4],\n", " [0, 8, 5, 4, 4, 11, 4, 4, 4, 10, 3, 5, 0, 4, 5],\n", " [4, 8, 5, 4, 0, 9, 4, 4, 0, 8, 3, 5, 4, 0, 5],\n", " [5, 9, 4, 5, 5, 6, 7, 5, 5, 5, 6, 4, 5, 5, 0]]"]}, "execution_count": 10, "metadata": {}, "output_type": "execute_result"}], "source": ["distl = [ [ jaccard(texte[i], texte[j]) for i in range(len(texte))] for j in range(len(texte))]\n", "distl"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### dictionnaire"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"data": {"text/plain": ["{('ces', 'ces'): 0,\n", " ('ces', 'ci'): 3,\n", " ('ces', 'ci,'): 4,\n", " ('ces', 'combien'): 6,\n", " ('ces', 'saucisson'): 8,\n", " ('ces', 'saucissons'): 9,\n", " ('ces', 'six'): 4,\n", " ('ces', 'sont'): 5,\n", " ('ces', 'sous'): 5,\n", " ('ces', '\u00e0'): 4,\n", " ('ci', 'ces'): 3,\n", " ('ci', 'ci'): 0,\n", " ('ci', 'ci,'): 1,\n", " ('ci', 'combien'): 5,\n", " ('ci', 'saucisson'): 7,\n", " ('ci', 'saucissons'): 8,\n", " ('ci', 'six'): 3,\n", " ('ci', 'sont'): 6,\n", " ('ci', 'sous'): 6,\n", " ('ci', '\u00e0'): 3,\n", " ('ci,', 'ces'): 4,\n", " ('ci,', 'ci'): 1,\n", " ('ci,', 'ci,'): 0,\n", " ('ci,', 'combien'): 6,\n", " ('ci,', 'saucisson'): 8,\n", " ('ci,', 'saucissons'): 9,\n", " ('ci,', 'six'): 4,\n", " ('ci,', 'sont'): 7,\n", " ('ci,', 'sous'): 7,\n", " ('ci,', '\u00e0'): 4,\n", " ('combien', 'ces'): 6,\n", " ('combien', 'ci'): 5,\n", " ('combien', 'ci,'): 6,\n", " ('combien', 'combien'): 0,\n", " ('combien', 'saucisson'): 8,\n", " ('combien', 'saucissons'): 9,\n", " ('combien', 'six'): 8,\n", " ('combien', 'sont'): 7,\n", " ('combien', 'sous'): 9,\n", " ('combien', '\u00e0'): 8,\n", " ('saucisson', 'ces'): 8,\n", " ('saucisson', 'ci'): 7,\n", " ('saucisson', 'ci,'): 8,\n", " ('saucisson', 'combien'): 8,\n", " ('saucisson', 'saucisson'): 0,\n", " ('saucisson', 'saucissons'): 1,\n", " ('saucisson', 'six'): 8,\n", " ('saucisson', 'sont'): 7,\n", " ('saucisson', 'sous'): 5,\n", " ('saucisson', '\u00e0'): 10,\n", " ('saucissons', 'ces'): 9,\n", " ('saucissons', 'ci'): 8,\n", " ('saucissons', 'ci,'): 9,\n", " ('saucissons', 'combien'): 9,\n", " ('saucissons', 'saucisson'): 1,\n", " ('saucissons', 'saucissons'): 0,\n", " ('saucissons', 'six'): 9,\n", " ('saucissons', 'sont'): 8,\n", " ('saucissons', 'sous'): 6,\n", " ('saucissons', '\u00e0'): 11,\n", " ('six', 'ces'): 4,\n", " ('six', 'ci'): 3,\n", " ('six', 'ci,'): 4,\n", " ('six', 'combien'): 8,\n", " ('six', 'saucisson'): 8,\n", " ('six', 'saucissons'): 9,\n", " ('six', 'six'): 0,\n", " ('six', 'sont'): 5,\n", " ('six', 'sous'): 5,\n", " ('six', '\u00e0'): 4,\n", " ('sont', 'ces'): 5,\n", " ('sont', 'ci'): 6,\n", " ('sont', 'ci,'): 7,\n", " ('sont', 'combien'): 7,\n", " ('sont', 'saucisson'): 7,\n", " ('sont', 'saucissons'): 8,\n", " ('sont', 'six'): 5,\n", " ('sont', 'sont'): 0,\n", " ('sont', 'sous'): 4,\n", " ('sont', '\u00e0'): 5,\n", " ('sous', 'ces'): 5,\n", " ('sous', 'ci'): 6,\n", " ('sous', 'ci,'): 7,\n", " ('sous', 'combien'): 9,\n", " ('sous', 'saucisson'): 5,\n", " ('sous', 'saucissons'): 6,\n", " ('sous', 'six'): 5,\n", " ('sous', 'sont'): 4,\n", " ('sous', 'sous'): 0,\n", " ('sous', '\u00e0'): 5,\n", " ('\u00e0', 'ces'): 4,\n", " ('\u00e0', 'ci'): 3,\n", " ('\u00e0', 'ci,'): 4,\n", " ('\u00e0', 'combien'): 8,\n", " ('\u00e0', 'saucisson'): 10,\n", " ('\u00e0', 'saucissons'): 11,\n", " ('\u00e0', 'six'): 4,\n", " ('\u00e0', 'sont'): 5,\n", " ('\u00e0', 'sous'): 5,\n", " ('\u00e0', '\u00e0'): 0}"]}, "execution_count": 11, "metadata": {}, "output_type": "execute_result"}], "source": ["distd = { (w,y): jaccard(w,y) for w in texte for y in texte}\n", "distd"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On compare le nombre de coefficients :"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"data": {"text/plain": ["225"]}, "execution_count": 12, "metadata": {}, "output_type": "execute_result"}], "source": ["sum(len(_) for _ in distl)"]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"text/plain": ["100"]}, "execution_count": 13, "metadata": {}, "output_type": "execute_result"}], "source": ["len(distd)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Le dictionnaire para\u00eet une meilleure option en terme de nombre de coefficients. Mais en terme d'espace m\u00e9moire occup\u00e9, le dictionnaire est largement plus cons\u00e9quent."]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"data": {"text/plain": ["(192, 6240)"]}, "execution_count": 14, "metadata": {}, "output_type": "execute_result"}], "source": ["import sys\n", "sys.getsizeof(distl), sys.getsizeof(distd)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Pour ce petit exemple, la matrice para\u00eet une bonne option mais lorsque les mots sont r\u00e9p\u00e9t\u00e9s un grand nombre de fois :"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"data": {"text/plain": ["(360000, 100)"]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["texte_grand = texte * 40\n", "\n", "distl = [ [ jaccard(texte_grand[i], texte_grand[j]) for i in range(len(texte_grand))] for j in range(len(texte_grand))]\n", "distd = { (w,y): jaccard(w,y) for w in texte_grand for y in texte_grand}\n", "sum(len(_) for _ in distl), len(distd)"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/plain": ["(5496, 6240)"]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["sys.getsizeof(distl), sys.getsizeof(distd)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Le dictionnaire permet \u00e9galement de v\u00e9rifier ais\u00e9ment qu'un calcul a d\u00e9j\u00e0 \u00e9t\u00e9 effectu\u00e9 afin de gagner du temps et de ne pas le refaire."]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"data": {"text/plain": ["100"]}, "execution_count": 17, "metadata": {}, "output_type": "execute_result"}], "source": ["distd = {}\n", "for w in texte_grand:\n", " for y in texte_grand:\n", " if (w,y) not in distd:\n", " distd[w,y] = jaccard(w,y) \n", " distd[y,w] = distd[w,y]\n", "len(distd) "]}, {"cell_type": "code", "execution_count": 17, "metadata": {"collapsed": true}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1"}}, "nbformat": 4, "nbformat_minor": 2}