{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 2A.algo - Algorithmes de streaming : g\u00e9n\u00e9ralit\u00e9s\n", "\n", "Les streams (flux) de donn\u00e9es sont aujourd'hui pr\u00e9sents dans de nombreux domaines (r\u00e9seaux sociaux, e-commerce, logs de connexion Internet, etc.). L'analyse rapide et pertinente de ces flux est motiv\u00e9e par l'immensit\u00e9 des donn\u00e9es qui ne peuvent souvent pas \u00eatre stock\u00e9s (du moins facilement) et dont le traitement serait trop lourd (penser au calcul de l'\u00e2ge moyen des 1,86 milliards utilisateurs de Facebook pour s'en convaincre). Ce notebook s'int\u00e9resse particuli\u00e8rement \u00e0 l'algorithme [BJKST](http://info.prelert.com/blog/hashing-and-approximate-distinct-value-counts)."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "code", "execution_count": 2, "metadata": {"collapsed": true}, "outputs": [], "source": ["%matplotlib inline"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Introduction\n", "\n", "Plus formellement consid\u00e9rons un univers $U$ de taille $n$ (un nombre tr\u00e8s grand) qui ne peut \u00eatre stock\u00e9 en m\u00e9moire et une s\u00e9quence $S = (s_1, s_2, \\ldots, s_m, \\ldots)$ d'\u00e9l\u00e9ments de $U$. Un algorithme de streaming $\\mathcal{A}$ prend le stream $S$ en entr\u00e9e et renvoie une fonction $f(S)$ (souvent \u00e0 valeurs r\u00e9elles).\n", "\n", "Notons que l'algorithme $\\mathcal{A}$ est souvent contraint d'acc\u00e9der s\u00e9quentiellement aux \u00e9l\u00e9ments de $S$ et / ou ne peut les parcourir qu'un nombre fini de fois.\n", "\n", "Un bon algorithme de streaming doit satisfaire plusieurs contraintes:\n", "\n", "- Il doit \u00eatre un bon estimateur de la vraie valeur que prendrait $f^*(U)$ sur l'univers (plus de d\u00e9tails dans un instant)\n", "- Il doit pouvoir s'actualiser rapidement (en temps lin\u00e9aire ou moins) \u00e0 mesure que le flux $S$ \u00e9volue\n", "- Il doit \u00eatre peu gourmand en m\u00e9moire\n", "\n", "Etant donn\u00e9 une pr\u00e9cision $\\epsilon > 0$ et une tol\u00e9rance $\\delta > 0$, l'algorithme $\\mathcal{A}$ do\u00eet satisfaire:\n", "\n", "$\\mathbb{P}(\\lvert \\frac{f^*(U) - f(S)}{f(S)} \\rvert \\leq \\epsilon) \\geq 1 - \\delta.$\n", "\n", "Quelques exemples fr\u00e9quents d'algorithmes de streaming:\n", "\n", "- Estimation de la valeur moyenne, de la m\u00e9diane\n", "- Estimation du nombre d'\u00e9l\u00e9ments distincts\n", "- Estimation de la fr\u00e9quences des \u00e9lements\n", "- Estimation de la densit\u00e9 de probabilit\u00e9"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Estimer le nombre d'\u00e9l\u00e9ments distincts: l'algorithme BJKST\n", "\n", "L'algorithme [BJKST](https://pdfs.semanticscholar.org/e349/7952347101a3535434bc35d378224cf87bcc.pdf) permet d'estimer le nombre d'\u00e9l\u00e9ments distincts d'un stream $S$. Son fonctionnement est assez simple et repose sur la notion d'universal hashing que nous pr\u00e9sentons ci-bas."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Universal hashing\n", "\n", "L'id\u00e9e derri\u00e8re les fonctions de hachage est de faire correspondre des \u00e9lements d'un ensemble dont la taille est variable (ou bien n'est pas connue) vers un ensemble de taille fixe. Le principe de l'universal hashing est de s\u00e9lectionner al\u00e9atoirement une fonction $h$ dans une famille de fonctions de hachage $\\mathcal{H}$ et de garantir un faible probabilit\u00e9 du nombre de collisions de hachage. \n", "\n", "Formellement si l'on note $[n]$ l'ensemble $\\{1, \\ldots, n\\}$, une famille de fonctions $\\mathcal{H}$ est dite universelle si toute fonction $h: U \\mapsto [n]$ choisie uniform\u00e9ment dans la famille $\\mathcal{H}$ v\u00e9rifie $\\mathbb{P}(h(x) = h(y)) \\leq \\frac{1}{n}$ pour tout couple $x,y \\in U$ distincts.\n", "\n", "Nous consid\u00e9rons ici la famille $\\mathcal{H} = \\{h_{a,b}\\}$ o\u00f9 $h_{a,b}(x) = ((ax + b) \\space{} \\mathrm{mod} \\space{} p) \\space{} \\mathrm{mod} \\space{} n$, $x$ est un entier, $a \\in \\{1, \\ldots, p - 1\\}$, $b \\in \\{0, \\ldots, p - 1 \\}$ et $p$ un nombre premier $\\geq n$. On peut sans trop de difficult\u00e9 se convaincre que $h_{a,b}(x)$ est uniform\u00e9ment distribu\u00e9 sur $[n]$ et que cette famille est universelle (voir [Universal hashing](https://en.wikipedia.org/wiki/Universal_hashing) pour plus de d\u00e9tails)."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Collisions\n", "\n", "V\u00e9rifions num\u00e9riquement le nombre de collisions. Consid\u00e9rons un univers $U$ large prenons par exemple $U = [n]$ avec $n$ grand."]}, {"cell_type": "code", "execution_count": 3, "metadata": {"collapsed": true}, "outputs": [], "source": ["n = 10**4\n", "U = range(n)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Choisissons $p$ un nombre premier arbitrairement grand et une petite valeur de hashage."]}, {"cell_type": "code", "execution_count": 4, "metadata": {"collapsed": true}, "outputs": [], "source": ["p = 4294967291\n", "m = 10"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Choisissons une fonction $h$ uniform\u00e9ment dans la famille $\\mathcal{H}$"]}, {"cell_type": "code", "execution_count": 5, "metadata": {"collapsed": true}, "outputs": [], "source": ["import random\n", "a = random.randint(1, p)\n", "b = random.randint(0, p)\n", "def h(x):\n", " return ((a*x + b) % p) % m"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Tirons des couples $(x,y)$ distincts dans $U$"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Nombre de couples distincts = 500\n"]}], "source": ["couples = set()\n", "for i in range(500):\n", " x, y = random.sample(U, 2)\n", " couples.add((x, y))\n", "print('Nombre de couples distincts = {}'.format(len(couples)))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Pour chaque couple, calculons les valeurs de hashage et comptons le nombre de collisions."]}, {"cell_type": "code", "execution_count": 7, "metadata": {"collapsed": true}, "outputs": [], "source": ["c = 0\n", "for x, y, in couples:\n", " if (h(x) == h(y)):\n", " c += 1"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Le nombre de collisions rapport\u00e9 au nombre de couples $(x,y)$ distincts nous donne une estimation de la probabilit\u00e9 de collision."]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Probabilit\u00e9 de collision = 7.40%\n"]}], "source": ["p_collisions = c / len(couples)\n", "print('Probabilit\u00e9 de collision = {:.2f}%'.format(p_collisions * 100.0))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Cette valeur est proche de la valeur th\u00e9orique $\\frac{1}{m}$. Effectuons plusieurs tirages pour confirmer ce r\u00e9sultat."]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Probabilit\u00e9 de collision moyenne = 9.92%\n"]}], "source": ["import numpy\n", "collisions = []\n", "# on reitere 100 fois\n", "for _ in range(100):\n", " a = random.randint(1, p)\n", " b = random.randint(0, p)\n", " \n", " def h(x):\n", " return ((a*x + b) % p) % m\n", " \n", " couples = set()\n", " for i in range(500):\n", " x, y = random.sample(U, 2)\n", " couples.add((x, y))\n", "\n", " c = 0\n", " for x, y, in couples:\n", " if (h(x) == h(y)):\n", " c += 1\n", " collisions.append(c / len(couples))\n", "p_collision = numpy.mean(collisions)\n", "print('Probabilit\u00e9 de collision moyenne = {:.2f}%'.format(p_collision * 100.0))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Cette probabilit\u00e9 moyenne est proche de la valeur th\u00e9orique. R\u00e9it\u00e9rons pour d'autres valeurs $m$."]}, {"cell_type": "code", "execution_count": 10, "metadata": {"collapsed": true}, "outputs": [], "source": ["sizes = [10, 25, 50, 100, 250, 500, 750, 1000]\n", "p_collision = []\n", "p = 4294967291\n", "\n", "for m in sizes: \n", " collisions = []\n", " # on reitere 100 fois\n", " for _ in range(100):\n", " a = random.randint(1, p)\n", " b = random.randint(0, p)\n", "\n", " def h(x):\n", " return ((a*x + b) % p) % m\n", "\n", " couples = set()\n", " for i in range(500):\n", " x, y = random.sample(U, 2)\n", " couples.add((x, y))\n", "\n", " c = 0\n", " for x, y, in couples:\n", " if (h(x) == h(y)):\n", " c += 1\n", " collisions.append(c / len(couples))\n", " p_collision.append(numpy.mean(collisions))"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"data": {"text/plain": [""]}, "execution_count": 12, "metadata": {}, "output_type": "execute_result"}, {"data": {"image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAEWCAYAAACKSkfIAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XucXWV97/HPd++55g4hhISAARPFBFBoBG9tKSgXL8Rj\n8SXUKghHtKcebaVH4bRHEdsqHo9oK/VWUMQqIFqbUmqiorVaBYKlYBIiIdwCJExCIDeSycz+nT/W\nM8maPXvP7EkmM5m1v+/Xa16zLs9a61lr7f171nqetdejiMDMzJpDaawzYGZmo8dB38ysiTjom5k1\nEQd9M7Mm4qBvZtZEHPTNzJqIg/4QJL1d0rIRWtfXJP3lSKxrf0l6RNJr0/CVkr6Rho+WtE1SeYjl\nR+y4jAVJr5b0YNrXN4/idr8o6f+MwnZOk7RuH5cdtc9p/nhU5zn/GR2B7YSkefuw3EWSfjYSeaha\n74jt23AVLuing/l8+jKvTx/gSQ0uOzd9OFr6pkXEP0TEmQcuxweXiHgsIiZFRO8Q6cb7cbkK+Hza\n1+8diA3UChgR8d6I+PiB2N7BYLjBrOjH42BUuKCfvCkiJgEvA04Crhjj/NjB5wXAirHOhNloK2rQ\nByAi1gNLyYI/AJLeIOk/JW2R9LikK3OL/DT9fzbdKbyy+mpN0qsk3S3pufT/VfW2L+kkSb+StFXS\nzUBH1fw3SrpX0rOS/kPSibl5H5b0RFp2taQz6myjU9L/k/RoytPPJHWmeedKWpHW/xNJLxnqmFXf\n7aT9X5vy8bCkt+emN3Rc0rY/LunnaT3LJB2W5nVI+oakTSmfd0uaWSdvsyV9R1JXysv7c/OulHSL\npK+nbayQtKjOeh4CjgX+OZ3n9rTuJZKekbRG0rsbXbekoyR9N+Vrk6TPp2P9ReCVaRvPprT9qk4k\nvTtt75m0/dm5eSHpvcqqoZ6VdK0k1dmnzrTuzZJWAi9v9NgNRtIhkm5Ly21Ow3PqpL0RODp3XD+U\npn9b2V33c5J+KmlhbpmGqpIklSRdLumhdIxvkXToIOn/l6SnJD0p6eKqee2SPi3pMUkblFUxdQ6x\n/U+n/X9Y0jm56e+StCp9LtZKek9u3mHpeD2bzu+/S8rH3JdJui8dl5sl9YsPB0xEFOoPeAR4bRqe\nA9wPfC43/zTgBLIC70RgA/DmNG8uEEBLLv1FwM/S8KHAZuAdQAtwQRqfXiMfbcCjwJ8CrcB5wG7g\nL9P8k4CngVOBMnBhyns78GLgcWB2Ll8vrLO/1wI/AY5M63lVWseLgO3A69L2PwSsAdpqHKcrgW9U\nHwNgIrAFeHGaNwtYONzjkvL3UMpTZxr/ZJr3HuCfgQkp/78FTKmxnyXgHuAj6dgeC6wFzsrtw07g\n9Wk9nwB+2cjnJI3/FPg7soL5ZUAXcPpQ607j/wVck45XB/Ca6mOU287Xcp+B04GNwMnpnP0t8NNc\n2gBuA6aRBdMu4Ow6+/NJ4N/TuTgK+DWwrpFjV2Nd+TxOB34/nZ/JwLeB7zV6XNO0i9Oy7cBngXvr\nbOu0vjzX+Ix+APgl2Xe6HfgS8K06eTib7Ht9fDon30zHcl6afw2wJB2ryWSfv0/UWddFZN/bd6dz\n/UfAk4DS/DcALwQE/C6wAzg5zfsEWcHfmv5+O7fcI8BdwOyUj1XAe0clRo7GRkbzLx3MbcDWdKJ/\nBEwbJP1ngWvS8FwGD/rvAO6qWv4XwEU11vs7+Q9HmvYfuQ/4F4CPVy2zOn1w5pEVCK8FWgfJewl4\nHnhpjXn/B7ilKu0TwGk1vlBXUj/oP0v2pe+s8WVo6LiQBfm/yM37H8D30/DF6bicOMR5PRV4rGra\nFcBXc/vww9y8BcDzQ3xO+vb/KKAXmJyb/wnga0OtG3glWTBuqbGNPccoN+1ruc/AdcCncvMmkQWY\nuWk8SAVIGr8FuLzO/qwlVyAAl7I36A967Gqsa08ea8x7GbC5keNaZ/60tF9TaxyP06gf9FcBZ+Tm\nzUrHqtZxv550UZHGX5S2OY8sOG8ndxGVzuHDdfJ7EbAmNz4hreuIOum/B3wgDV8F/BOpsKlxnP4w\nN/4p4IuDfQdG6q+o1TtvjojJZB+i44DD+mZIOlXSj9Pt6nPAe/PzhzCb7Oo971Gyq+xaaZ+IdEZz\nafu8ALgs3fo9m27/jyK7ul8D/AlZsHla0k352/6cw8iuLB8aKq8RUSG7e6iV15oiYjvwNrJj9JSk\nf5F03FDbSqqPy/rc8A6yAAdwI1kV3E3pVvxTklprbOMFwOyq4/W/gXxVUPU2OpRrlB/EbOCZiNg6\njPz3rfso4NGI6GlgO7W2mz9H24BNQ2y33kMJs8nOb5/qz9pQx64mSRMkfUlZ9eEWsjuiaRri6a7c\n8mVJn0zVMlvIgh00/p3L78M/5vK/iqygrrUPgx2LGWSB+57cur6fptez5xxExI40OAlA0jmSfpmq\nb54luxvs27f/S3Z3vSxV/Vxeb70Mfm5HVFGDPgAR8W9kVxKfzk3+Jtmt3VERMZXs9quvnnSoV44+\nSfbhyzua7Aq62lPAkVV1sEfnhh8H/ioipuX+JkTEt1LevxkRr0nbC+DqGtvYSFbt8MKh8prycVSd\nvNYVEUsj4nVkV1YPAF8ZaltJveNSvf7dEfGxiFhAVjX1RuCdNZI+TnY1lj9ekyPi9cPZnzqeBA6V\nNHm4+U/5OrpO4TKsz5OkiWTVKcM6R8lTZOe3T/VnbV+P3WVk1Y2nRsQUsjtY2PudqVa9z38ALCa7\na51Kdic52PL1PA6cU7UPHRFR77tX71hsJLs7Xphbz9TIHvwYFkntwHfI4svMiJgG3E7at4jYGhGX\nRcSxwLnAB1WnbW40FTroJ58FXifppWl8MtlV3U5Jp5B9KPt0ARWyOs9abgdeJOkPJLVIehvZrf5t\nNdL+AugB3i+pVdJbgFNy878CvDfdeUjSRGWNzJMlvVjS6elDtZPsQ1qp3kC6er8e+ExqqCsra3xu\nJ6sKeIOkM9KV82XALrKqlIZImilpcQpGu8iqzQbkY5jHpXobvyfphHTluIXslr3WNu4Ctipr4O5M\n+3q8pJfXSDssEfE42XH5hLKG5ROBS4BvNLD4XWRB5pPpHHZIenWatwGYI6mtzrLfAt4l6WXpnP01\ncGdEPLIPu3ELcIWyhtc5wP+syuO+HrvJZJ+/Z5U1nH50iPQb6P/9mUz22dlEdoX91w3uT7UvAn8l\n6QUAkmZIWlwn7S3ARZIWSJqQz3P6znwFuEbS4WldR0o6ax/y1EbWvtAF9Chr4N3zGLOyBzXmpQuu\n58juTGp9tkdV4YN+RHQBXydrxIKsPvkqSVvTtFtyaXcAfwX8PN36vaJqXZvIrkQvI/sQfwh4Y0Rs\nrLHdbuAtZHWCz5BVk3w3N385WePQ58kaPdektJB9kD5JdlWyHjic+o+d/hlZY/XdaTtXA6WIWA38\nIVnj4EbgTWSPsnbXO1Y1lIAPkl2RPkPW3vBHNfa14eNSwxHArWQBfxXwb2RVPtXb6E3beBnwcNqn\nvye7ehwJF5BdhT4J/CPw0Yj44VALpXy9iay++DFgHdm5BriD7LHQ9ZJqfUZ+SNb28h2yguOFwPn7\nmP+PkVVjPAwsI3cM9/PYfZas8X0jWUPq94dI/wngL9L358/IvnuPkt29rEzr2BefI7tDX5a+u78k\na6sYICL+NeX7DrLv1R1VST6cpv8yVTn9kOxuZlhSdeD7yWLIZrILyCW5JPPTureRXQT+XUT8eLjb\nGWl9LclmZtYECn+lb2Zmeznom5k1EQd9M7Mm4qBvZtZEGvnhyqg67LDDYu7cuWOdDTOzceWee+7Z\nGBGD/cgMOAiD/ty5c1m+fPlYZ8PMbFyRVP2r+JpcvWNm1kQc9M3MmkhDQV/S2cre6b6mxkuDkPQ7\nyt4b3yPpvKp5Fyp7H/iDki4cqYybmdnwDRn00ztRrgXOIXufygWSFlQle4zsFQLfrFq2710dp5K9\nd+ajkg7Z/2ybmdm+aORK/xSy90mvTe9tuYnsrXl7RMQjEXEfA18mdBbwg4h4JiI2Az8g6+DAzMzG\nQCNB/0j6v5t6HY2/k72hZSVdKmm5pOVdXV0NrtrMzIbroGjIjYgvR8SiiFg0Y8aQj5mamdk+aiTo\nP0H/Dgnm0HgnD/uz7LA89dzzfGbZatZ2bTsQqzczK4RGgv7dwHxJx6TOIM6n/zujB7MUODN17HAI\nWQcDS/ctq4PbuLWbv7ljDQ91bT8QqzczK4Qhg37q+/N9ZMF6FVln2yskXSXpXABJL5e0Dngr8CVJ\nK9KyzwAfJys47gauStNGXGdbtivP7+49EKs3MyuEhl7DEBG3k3WJl5/2kdzw3WRVN7WWvZ6sS78D\nqqM166d5p4O+mVldB0VD7kjodNA3MxtScYJ+Wxb0n+920Dczq6cwQb+jJQV9X+mbmdVVmKBfKom2\nlpKDvpnZIAoT9CGr19/p6h0zs7oKF/R9pW9mVl+xgn5bmed3V7/zzczM+hQq6He0lv3IppnZIAoV\n9DtbSw76ZmaDKFTQ72gt+zl9M7NBFCrouyHXzGxwhQr6HW0O+mZmgylU0Pdz+mZmgytc0PeVvplZ\nfcUK+m1ldvo5fTOzugoV9DvSlX5EjHVWzMwOSgUL+tnu7Orx1b6ZWS2FCvp9Han4WX0zs9qKGfTd\nmGtmVlOxgn6bg76Z2WAKFfQ7XL1jZjaoQgX9vuqdXT0O+mZmtRQr6O/pHN1P75iZ1VKooO/O0c3M\nBleooN/Zlu2Og76ZWW2FCvp9Dbl+6ZqZWW2FCvp+Tt/MbHDFCvp+Tt/MbFCFCvp9DbnuJ9fMrLZC\nBf1SSbS3lHylb2ZWR6GCPqR36rsh18yspoaCvqSzJa2WtEbS5TXmt0u6Oc2/U9LcNL1V0g2S7pe0\nStIVI5v9gTpa3HuWmVk9QwZ9SWXgWuAcYAFwgaQFVckuATZHxDzgGuDqNP2tQHtEnAD8FvCevgLh\nQOlsK/O8e88yM6upkSv9U4A1EbE2IrqBm4DFVWkWAzek4VuBMyQJCGCipBagE+gGtoxIzuvoaC37\nhWtmZnU0EvSPBB7Pja9L02qmiYge4DlgOlkBsB14CngM+HREPFO9AUmXSlouaXlXV9ewdyKvs7Xk\np3fMzOo40A25pwC9wGzgGOAyScdWJ4qIL0fEoohYNGPGjP3aYNY5uoO+mVktjQT9J4CjcuNz0rSa\naVJVzlRgE/AHwPcjYndEPA38HFi0v5keTGerG3LNzOppJOjfDcyXdIykNuB8YElVmiXAhWn4POCO\niAiyKp3TASRNBF4BPDASGa+nw0HfzKyuIYN+qqN/H7AUWAXcEhErJF0l6dyU7DpguqQ1wAeBvsc6\nrwUmSVpBVnh8NSLuG+mdyOto9XP6Zmb1tDSSKCJuB26vmvaR3PBOssczq5fbVmv6geTqHTOz+gr5\ni1wHfTOz2goX9Dtay+zcXSFrUjAzs7zCBf29naP7V7lmZtUKGPRTl4luzDUzG6B4Qd8dqZiZ1VW4\noN/hLhPNzOoqbtB39Y6Z2QCFC/p9Dbl+/46Z2UDFC/qu0zczq6t4QX/Plb4f2TQzq1a4oO+GXDOz\n+goX9Puqd/zSNTOzgQoX9Dta0o+zfKVvZjZA4YK+G3LNzOorXNDvaPFz+mZm9RQu6JdKor3FnaOb\nmdVSuKAP7hzdzKyeYgZ9955lZlZTgYO+f5xlZlatkEG/o7XshlwzsxoKGvTdkGtmVkshg747Rzcz\nq62YQd/VO2ZmNRUy6He0ltnZ46BvZlatkEG/s7XsF66ZmdVQzKDvOn0zs5qKGfT94ywzs5oKGfTb\nW8vs3F2hUomxzoqZ2UGlkEG/r8vEXT3+Va6ZWV5Bg747UjEzq6WYQb+vy0QHfTOzfgoZ9N05uplZ\nbQ0FfUlnS1otaY2ky2vMb5d0c5p/p6S5uXknSvqFpBWS7pfUMXLZr62vTt+/yjUz62/IoC+pDFwL\nnAMsAC6QtKAq2SXA5oiYB1wDXJ2WbQG+Abw3IhYCpwG7Ryz3dbh6x8ystkau9E8B1kTE2ojoBm4C\nFlelWQzckIZvBc6QJOBM4L6I+C+AiNgUEQc8Ert6x8ystkaC/pHA47nxdWlazTQR0QM8B0wHXgSE\npKWSfiXpQ7U2IOlSScslLe/q6hruPgzg6h0zs9oOdENuC/Aa4O3p/3+TdEZ1ooj4ckQsiohFM2bM\n2O+N+krfzKy2RoL+E8BRufE5aVrNNKkefyqwieyu4KcRsTEidgC3Ayfvb6aH0lenv8tdJpqZ9dNI\n0L8bmC/pGEltwPnAkqo0S4AL0/B5wB0REcBS4ARJE1Jh8LvAypHJen2dvtI3M6upZagEEdEj6X1k\nAbwMXB8RKyRdBSyPiCXAdcCNktYAz5AVDETEZkmfISs4Arg9Iv7lAO3LHg76Zma1DRn0ASLidrKq\nmfy0j+SGdwJvrbPsN8ge2xw17S3pNQxuyDUz66eQv8gtleTO0c3Maihk0IfsCR5X75iZ9VfYoO/O\n0c3MBip00N/p9+mbmfVT2KDf4St9M7MBChv0O9vKbsg1M6tS3KDvhlwzswEKG/RdvWNmNlCBg76f\n0zczq1bYoO/qHTOzgYob9N2Qa2Y2QHGDvq/0zcwGKGzQ72gts3N3hUolxjorZmYHjcIG/T0dqfhX\nuWZmexQ36Pud+mZmAxQ26He0pnfqO+ibme1R4KCfrvT9Ay0zsz0KG/T7qnf82KaZ2V7FDfptDvpm\nZtWKG/TdkGtmNkBhg77r9M3MBips0O+r3vGVvpnZXoUN+h1uyDUzG6CwQb/T1TtmZgMUP+jv9msY\nzMz6FDbot7dku+bqHTOzvQob9EslufcsM7MqhQ364Hfqm5lVK37Qd0OumdkehQ76HW2+0jczyyt2\n0G9xP7lmZnkNBX1JZ0taLWmNpMtrzG+XdHOaf6ekuVXzj5a0TdKfjUy2G9PpK30zs36GDPqSysC1\nwDnAAuACSQuqkl0CbI6IecA1wNVV8z8D/Ov+Z3d4OlM/uWZmlmnkSv8UYE1ErI2IbuAmYHFVmsXA\nDWn4VuAMSQKQ9GbgYWDFyGS5cR1uyDUz66eRoH8k8HhufF2aVjNNRPQAzwHTJU0CPgx8bLANSLpU\n0nJJy7u6uhrN+5A621ynb2aWd6Abcq8EromIbYMliogvR8SiiFg0Y8aMEdt4Z2uJHb7SNzPbo6WB\nNE8AR+XG56RptdKsk9QCTAU2AacC50n6FDANqEjaGRGf3++cN+CIKR08vXUnz3f37nnVsplZM2vk\nSv9uYL6kYyS1AecDS6rSLAEuTMPnAXdE5rcjYm5EzAU+C/z1aAV8gAWzp1IJeGD9ltHapJnZQW3I\noJ/q6N8HLAVWAbdExApJV0k6NyW7jqwOfw3wQWDAY51jYeHsKQCseNJB38wMGqveISJuB26vmvaR\n3PBO4K1DrOPKfcjffplzSCdTO1sd9M3MkkL/IlcSC2ZNYeWTz411VszMDgqFDvqQVfE8sH4rPb3+\nkZaZWfGD/pFT2NVT4aGu7WOdFTOzMVf8oD97KgArXMVjZlb8oH/sYRNpbym5MdfMjCYI+i3lEsfN\nmuIrfTMzmiDoQ9aYu/LJLUTEWGfFzGxMNU3Q37Kzh3Wbnx/rrJiZjakmCfpuzDUzgyYJ+scdMZly\nSW7MNbOm1xRBv6O1zAtnTHTQN7Om1xRBH7IqHlfvmFmza6KgP4UNW3axcduusc6KmdmYaZqgv8Cv\nWTYza56gv3CWn+AxM2uaoD91QitzDun0lb6ZNbWmCfqw95e5ZmbNqsmC/lQe3ridbbt6xjorZmZj\nosmCftaYu+opX+2bWXNqsqCfGnOfcGOumTWnpgr6M6e0M31imxtzzaxpNVXQl8SC2VMc9M2saTVV\n0IesiufBp7fS3eOO0s2s+TRh0J/C7t7gNxu2jnVWzMxGXVMGfcDP65tZU2q6oD93+kQmtpX9OgYz\na0pNF/RLJfGSWW7MNbPm1HRBH7IqnlVPbaFScUfpZtZcmjToT2V7dy+PbNo+1lkxMxtVTRn0/W59\nM2tWTRn0XzRzMq1ld5RuZs2nKYN+W0uJ+YdP9hM8ZtZ0Ggr6ks6WtFrSGkmX15jfLunmNP9OSXPT\n9NdJukfS/en/6SOb/X3X9279CDfmmlnzGDLoSyoD1wLnAAuACyQtqEp2CbA5IuYB1wBXp+kbgTdF\nxAnAhcCNI5Xx/bVw9hQ2be9mwxZ3lG5mzaORK/1TgDURsTYiuoGbgMVVaRYDN6ThW4EzJCki/jMi\nnkzTVwCdktpHIuP7a+GR7jPXzJpPI0H/SODx3Pi6NK1mmojoAZ4Dplel+X3gVxEx4NJa0qWSlkta\n3tXV1Wje98tLZk1Bgnsff3ZUtmdmdjAYlYZcSQvJqnzeU2t+RHw5IhZFxKIZM2aMRpaY1N7Cq144\nnS/9dC13rt00Kts0MxtrjQT9J4CjcuNz0rSaaSS1AFOBTWl8DvCPwDsj4qH9zfBI+tsLTmbOIZ38\n968v54H1fnzTzIqvkaB/NzBf0jGS2oDzgSVVaZaQNdQCnAfcEREhaRrwL8DlEfHzkcr0SDl0Yhtf\nv/gUJrSVufD6u1i3ecdYZ8nM7IAaMuinOvr3AUuBVcAtEbFC0lWSzk3JrgOmS1oDfBDoe6zzfcA8\n4COS7k1/h4/4XuyHOYdM4IaLT2FHdy/vvP4untnePdZZMjM7YHSwPae+aNGiWL58+ahv9861m3jH\n9XexYNYUvvnuU5nQ1jLqeTAz21eS7omIRUOla8pf5NZy6rHT+ZvzT+K+dc/yx//wK3b3ujtFMyse\nB/2cs48/go+/+Xh+vLqLy79zv3+ta2aF4zqMKm8/9QU8vWUXn/vRgxw+pZ0Pn33cWGfJzGzEOOjX\n8CevnU/Xtl184ScPMWNSOxe/5pixzpKZ2Yhw0K9BEh9ffDybtu3iqttWctjkds596eyxzpaZ2X5z\nnX4d5ZL43PknccrcQ7nslnv52YMbxzpLZmb7zUF/EB2tZb5y4SKOPWwS77lxOb9+wi9nM7PxzUF/\nCFM7W7nh4lOYNqGNi756F4+6X10zG8cc9BtwxNQObrj4FHoqwTuuu4uurX4Hv5mNTw76DZp3+CSu\nv+jlPL11J+/62l1s29Uz1lkyMxs2B/1hOPnoQ/i7t5/Mqqe28t4b76G7x7/aNbPxxUF/mE4/biaf\nfMsJ/GzNRt5z43L+Y81GevzKBjMbJ/yc/j5466Kj2LKzh099/wF+vLqLaRNaOf24wzlr4RH8zvwZ\ndLaVxzqLZmY1+S2b+2FHdw8//U0Xy1Zs4IerNrBlZw8drSV+Z/4Mzlp4BGe85HCmTWgb62yaWRNo\n9C2bvtLfDxPaWjj7+FmcffwsdvdWuOvhZ1i6Yj3LVmxg2coNlEvi1GMO5cwFMzlz4RHMntY51lk2\nsybnK/0DICK4b91zLFu5nqUrNrDm6W0AnHDkVM5amBUA8w+fhKQxzqmZFUWjV/oO+qPgoa5tLFux\ngaUr1nPv488CcMxhE/fcAZx01DRKJRcAZrbvHPQPUhu27GTZyg0sW7GeXzy0iZ5KMGNyO69bMJMz\nF8zkVS88jLYWP1RlZsPjoD8OPPf8bn6y+mmWrljPT1Z3saO7l8ntLfzecYdz5sKZnPbiw5nU7mYX\nMxuag/44s3N3Lz9fs3HPk0CbtnfTVi7x6nnT05NAM5kxuX2ss2lmByk/vTPOdLSWOeMlMznjJTPp\nrQTLH3mGZSuzdoAfr74f6X4WveAQzlxwBGctPIKjp08Y6yyb2TjkK/2DXESw6qmt2aOgKzew6qkt\nABx3xGTOXHgEZy6YycLZU/wkkFmTc/VOQT22aQfLVma/Bbj70WeIgDmHdHLmgiM4c+FMXj73UMp+\nEsis6TjoN4GN23bxo1UbWLpiAz9bs5HungqHTmzjjPRKiNfMP4yOVr8SwqwZOOg3mW27evi31V0s\nW7meO1Y9zdZdPUxoK/O7L5rBmQtncvqLZzJ1QutYZ9PMDhA35DaZSe0tvOHEWbzhxFl091T45dpN\nLF2xnh+s3MC//no9LSXximOnc9bCmfz2/BkcMrGNiW1lWsr+TYBZM/GVfsFVKsG9657N3ge0Yj1r\nN/bv7rGtpcTEtjIT21uY2NbChPZy9j9Nm9BWZlJ7CxPaWpjYXu7/v2+5PePZ8q0uSMxGnat3bICI\n4KGubdzz6Ga27eplx64etnX3sGNXL9vz/7t72b6rp9+0nbsb7zOgrVzaU3jUKigmtLekgqRcs6DJ\nCqAs3cS2bDn/StlscK7esQEkMe/wycw7fPKwl+2tBDtSgbBtV76A6GH7rt49/7PCorf/9O6sgNm0\nbQc7cvOe393b8PZby2JC297CIl8gTGwv7y0oahQ0ewqYqoKmrVzyo67WdBz0rSHlkpjc0crkjlZm\njtA6eyvB87tTQbFr7x3Gju69dx7bdvX0KzjyBcr2XT1s3vF8vwJmR3fjBUlLSTXvKvYWGnvndbSW\naCmJkkS5pGy4JMppvN/fENNKfctLtJSzeYNOq15nmme2Lxz0bcyUS2JSquoZKZVKsGP33gKiuhDZ\nU22Vn7crdwfT3cOTz+4eUNAcbCSGLlxyBUj1tFJVIVJrWrlcu7BpqdpmSXsLp5bctNZyNt5SLtFW\nLtFSFq3lUppeorWlRGtJtLZkBWo2L0vXl76lNHBZ353tn4a+bZLOBj4HlIG/j4hPVs1vB74O/Baw\nCXhbRDyS5l0BXAL0Au+PiKUjlnuzKqUDVJB091borQQ9laBSCXoj6K1U/Q0xrZKW742gtzf73zet\nEkFPjWn5ddXcdm5de7aTW1f1tL515rfZU6mwq2fv+vLp+qZVKtBTqdBbgd5KJa0zm1apsGdbo6Fc\nygqU1lRw7C0wsgJmb8GytyDZOy+b1lIq0daSFSr5QmZPwZNbZ1YQ5ZdVKrD6F0a1CrSWfvNKB8UP\nJ4f8ZkgqA9cCrwPWAXdLWhIRK3PJLgE2R8Q8SecDVwNvk7QAOB9YCMwGfijpRRFx8F06mdVRKomO\nkn/kNpSI/gXB7kolK1R6K3T3VvYUMN092f/dvRV292aFTDacxivZ/929FXp6c8OVoLunQk9ab986\n+y+XXzbc3SA8AAAGf0lEQVT7v6O7J7ds9FvnnnRpmwe64JLICqvqAioVDGccdzh//oYFBzQPjVwO\nnQKsiYi1WaZ1E7AYyAf9xcCVafhW4PPK7sEWAzdFxC7gYUlr0vp+MTLZN7ODhSTKgnIqIDsZfwVl\npRK5wioVLJUKu3v2Tu9XQPVW2F0JdqfCqDtN21sopQKm0r+A2lOg7Vk2S3/E1APfpWojQf9I4PHc\n+Drg1HppIqJH0nPA9DT9l1XLHlm9AUmXApcCHH300Y3m3cxsRJVKor1UpsjdWBwUDz9HxJcjYlFE\nLJoxY8ZYZ8fMrLAaCfpPAEflxuekaTXTSGoBppI16DayrJmZjZJGgv7dwHxJx0hqI2uYXVKVZglw\nYRo+D7gjsp/6LgHOl9Qu6RhgPnDXyGTdzMyGa8iaq1RH/z5gKdkjm9dHxApJVwHLI2IJcB1wY2qo\nfYasYCClu4Ws0bcH+GM/uWNmNnb87h0zswJo9N07B0VDrpmZjQ4HfTOzJuKgb2bWRA66On1JXcCj\nw1jkMGDjAcrOwawZ97sZ9xmac7+bcZ9h//b7BREx5A+dDrqgP1ySljfSeFE0zbjfzbjP0Jz73Yz7\nDKOz367eMTNrIg76ZmZNpAhB/8tjnYEx0oz73Yz7DM253824zzAK+z3u6/TNzKxxRbjSNzOzBjno\nm5k1kXEd9CWdLWm1pDWSLh/r/IwUSUdJ+rGklZJWSPpAmn6opB9IejD9PyRNl6S/ScfhPkknj+0e\n7B9JZUn/Kem2NH6MpDvT/t2c3vZKenvrzWn6nZLmjmW+95WkaZJulfSApFWSXtkM51rSn6bP968l\nfUtSR9HOtaTrJT0t6de5acM+t5IuTOkflHRhrW01atwG/VzfvecAC4ALUp+8RdADXBYRC4BXAH+c\n9u1y4EcRMR/4URqH7BjMT3+XAl8Y/SyPqA8Aq3LjVwPXRMQ8YDNZn8yQ65sZuCalG48+B3w/Io4D\nXkq274U+15KOBN4PLIqI48ne4NvXv3aRzvXXgLOrpg3r3Eo6FPgoWY+FpwAf7Sso9klEjMs/4JXA\n0tz4FcAVY52vA7Sv/0TWMf1qYFaaNgtYnYa/BFyQS78n3Xj7I+to50fA6cBtgMh+odhSfd7JXvf9\nyjTcktJprPdhmPs7FXi4Ot9FP9fs7WL10HTubgPOKuK5BuYCv97XcwtcAHwpN71fuuH+jdsrfWr3\n3Tug/93xLt3GngTcCcyMiKfSrPXAzDRcpGPxWeBDQCWNTweejYieNJ7ft359MwN9fTOPJ8cAXcBX\nU5XW30uaSMHPdUQ8AXwaeAx4iuzc3UOxz3Wf4Z7bET3n4znoF56kScB3gD+JiC35eZEV+YV63lbS\nG4GnI+Kesc7LKGoBTga+EBEnAdvZe7sPFPZcHwIsJiv0ZgMTGVgNUnhjcW7Hc9AvdP+7klrJAv4/\nRMR30+QNkmal+bOAp9P0ohyLVwPnSnoEuImsiudzwLTU9zL037d6fTOPJ+uAdRFxZxq/lawQKPq5\nfi3wcER0RcRu4Ltk57/I57rPcM/tiJ7z8Rz0G+m7d1ySJLIuKFdFxGdys/J9EV9IVtffN/2dqfX/\nFcBzudvHcSMiroiIORExl+x83hERbwd+TNb3Mgzc71p9M48bEbEeeFzSi9OkM8i6Fy30uSar1nmF\npAnp896334U91znDPbdLgTMlHZLukM5M0/bNWDdy7GcDyeuB3wAPAX8+1vkZwf16Ddkt333Avenv\n9WR1mD8CHgR+CBya0ovsSaaHgPvJnogY8/3Yz2NwGnBbGj4WuAtYA3wbaE/TO9L4mjT/2LHO9z7u\n68uA5el8fw84pBnONfAx4AHg18CNQHvRzjXwLbI2i91kd3WX7Mu5BS5O+74GeNf+5MmvYTAzayLj\nuXrHzMyGyUHfzKyJOOibmTURB30zsybioG9m1kQc9M3MmoiDvplZE3HQNxuEpG9L+rykn0l6VNJr\nJN0o6TeSrhvr/JkNl4O+2eBOANZGxGuAG8hej/Eh4HjgLZLaxzJzZsPVMnQSs+YkqQOYRva6Z4Dn\ngesivetG0g6ge4yyZ7ZPfKVvVt9C4FcR0fdu/5eS9WuApDnAk+H3mNg446BvVt8JwH/lxk8keyka\nZAXAfQOWMDvIOeib1XcC2RtO+6p6OiNic5qXLwDMxg2/ZdPMrIn4St/MrIk46JuZNREHfTOzJuKg\nb2bWRBz0zcyaiIO+mVkTcdA3M2si/x/ESd6GwXsQegAAAABJRU5ErkJggg==\n", "text/plain": [""]}, "metadata": {}, "output_type": "display_data"}], "source": ["import matplotlib.pyplot as plt\n", "fix, ax = plt.subplots()\n", "plt.plot(sizes, p_collision)\n", "plt.xlabel(r'$m$')\n", "ax.set_title('Ratio des collisions en fonction de la taille de hash')"]}, {"cell_type": "markdown", "metadata": {}, "source": ["La probabilit\u00e9 de collision estim\u00e9e est bien inversement proportionnelle \u00e0 la valeur de hashage $m$."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Algorithme BJKST\n", "\n", "Nous consid\u00e9rons un univers $U$ de taille $N$ et comportant $n$ \u00e9lements distincts. Pour un stream $S = (a_1, a_2, \\ldots)$ de taille $s$ essayons d'estimer $n$ \u00e0 l'aide de l'algorithme [BJKST](http://info.prelert.com/blog/hashing-and-approximate-distinct-value-counts)."]}, {"cell_type": "code", "execution_count": 12, "metadata": {"collapsed": true}, "outputs": [], "source": ["n = 10**3\n", "N = 10**4\n", "# nous tirons N entiers de 64bits (type i8) dont n sont distincts\n", "universe = numpy.random.randint(0, n, N, dtype='i8')\n", "s = 500\n", "stream = universe[-s:]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["L'id\u00e9e derri\u00e8re l'algorithme [BJKST](http://info.prelert.com/blog/hashing-and-approximate-distinct-value-counts) est de parcourir les \u00e9lements du stream et de remplir un ensemble $B$ par \u00e9chantillonage. La probabilit\u00e9 d'\u00e9chantillonage initiale est $1$ et lorsque $B$ devient trop grand (au del\u00e0 d'un certain seuil $B_{max}$) on enl\u00e8ve des \u00e9lements et on diminue la probabilit\u00e9 d'\u00e9chantillonage. A la fin le nombre d'\u00e9l\u00e9ments dans $B$ et la probabilit\u00e9 d'\u00e9chantillonage finale permettent d'estimer le nombre d'\u00e9l\u00e9ments distincts dans $U$.\n", "\n", "Pour $\\epsilon >0$ nous prenons $B_{max} = 1/ \\epsilon^2$ :"]}, {"cell_type": "code", "execution_count": 13, "metadata": {"collapsed": true}, "outputs": [], "source": ["# definissons un ensemble B\n", "B = set()\n", "epsilon = 0.1\n", "B_max = 1 / epsilon**2"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Choisissons $p$ un nombre premier arbitrairement grand:"]}, {"cell_type": "code", "execution_count": 14, "metadata": {"collapsed": true}, "outputs": [], "source": ["p = 4294967291"]}, {"cell_type": "markdown", "metadata": {}, "source": ["et tirons al\u00e9atoirement deux fonctions $h_{a_1, b_1}$ et $h_{a_2, b_2}$ distinctes:"]}, {"cell_type": "code", "execution_count": 15, "metadata": {"collapsed": true}, "outputs": [], "source": ["import random\n", "# deux couples (a_1, b_1) (a_1, b_2) distincts\n", "a1, a2 = random.sample(range(1, p), 2)\n", "b1, b2 = random.sample(range(0, p), 2)\n", "\n", "def h1(x):\n", " return ((a1*x + b1) % p) % s\n", "\n", "def h2(x):\n", " return ((a2*x + b2) % p) % s"]}, {"cell_type": "markdown", "metadata": {"collapsed": true}, "source": ["Initialisons un entier $c$ \u00e0 zero. Pour chaque $x$ dans le stream nous calculons d'abord sa valeur de hachage $y = h_{a_1, b_1}(x)$ :"]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["x = 207, y = 230\n"]}], "source": ["c = 0\n", "# Prenons le premier \u00e9l\u00e9ment du stream (\u00e0 titre d'exemple)\n", "x = stream[0]\n", "y = h1(x)\n", "print('x = {}, y = {}'.format(x, y))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["La probabilit\u00e9 d'\u00e9chantillonage est bas\u00e9e sur le nombre de z\u00e9ros \u00e0 droite dans la d\u00e9composition binaire de $y$. Pour calculer ce nombre diverses m\u00e9thodes existent (voir [Count the consecutive zero bits (trailing) on the right with modulus division and lookup ](http://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightModLookup) pour plus de d\u00e9tails). Pour $s= 2^1$ et $s = 2^{10}$ la d\u00e9composition binaire comporte $1$ et $10$ z\u00e9ros \u00e0 droite respectivement:"]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Decomposition binaire de 2**1 = 0b10, nombre de zeros a droite = 1\n", "Decomposition binaire de 2**10 = 0b10000000000, nombre de zeros a droite = 10\n"]}], "source": ["mod_37bit_position = (32, 0, 1, 26, 2, 23, 27, 0, 3, 16, 24, 30, 28, 11, 0, 13, 4,\n", " 7, 17, 0, 25, 22, 31, 15, 29, 10, 12, 6, 0, 21, 14, 9, 5,\n", " 20, 8, 19, 18)\n", "\n", "# Un seul z\u00e9ro \u00e0 droite\n", "s = 2**1\n", "zeros = mod_37bit_position[(-s & s) % 37]\n", "print('Decomposition binaire de 2**1 = {}, nombre de zeros a droite = {}'.format(bin(s), zeros))\n", "\n", "# Dix z\u00e9ros \u00e0 droite\n", "s = 2**10\n", "zeros = mod_37bit_position[(-s & s) % 37]\n", "print('Decomposition binaire de 2**10 = {}, nombre de zeros a droite = {}'.format(bin(s), zeros))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Notons $k$ le nombre de z\u00e9ros \u00e0 droite dans la d\u00e9composition binaire de $y$."]}, {"cell_type": "code", "execution_count": 18, "metadata": {"scrolled": true}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Decomposition binaire de y = 0b11100110, nombre de zeros a droite = 1\n"]}], "source": ["k = mod_37bit_position[(-y & y) % 37]\n", "print('Decomposition binaire de y = {}, nombre de zeros a droite = {}'.format(bin(y), k))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Puis nous comparons $k \\geq c$. Si c'est le cas nous calculons une nouvelle valeur de hashage de $x$, $z = h_{a_2, b_2}(x)$ et rajoutons le couple $(z, k)$ \u00e0 l'ensemble $B$."]}, {"cell_type": "code", "execution_count": 19, "metadata": {"collapsed": true}, "outputs": [], "source": ["if (k >= c):\n", " z = h2(x)\n", " B.add((z, k))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["A l'initialisation la condition $k \\geq c$ est toujours v\u00e9rifi\u00e9e (ce qui correspond \u00e0 une probabilit\u00e9 d'\u00e9chantillonage \u00e9gale \u00e0 $1$) :"]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [{"data": {"text/plain": ["{(1002, 1)}"]}, "execution_count": 21, "metadata": {}, "output_type": "execute_result"}], "source": ["B"]}, {"cell_type": "markdown", "metadata": {}, "source": ["L'ensemble $B$ se remplit jusqu'\u00e0 atteindre la taille $B_{max}$. Lorsque cette taille est atteinte on incr\u00e9mente $c$ et on enl\u00e8ve \u00e0 $B$ tous les couples $(z, k)$ o\u00f9 $k < c$."]}, {"cell_type": "code", "execution_count": 21, "metadata": {"collapsed": true}, "outputs": [], "source": ["while (len(B) >= B_max):\n", " c += 1\n", " # on prend ici une copie de B\n", " for z, k in B.copy():\n", " if (k < c):\n", " B.remove((z, k)) "]}, {"cell_type": "markdown", "metadata": {}, "source": ["Parcourons le stream et regardons \u00e0 quoi ressemble l'ensemble $B$:"]}, {"cell_type": "code", "execution_count": 22, "metadata": {"collapsed": true}, "outputs": [], "source": ["for x in stream:\n", " y = h1(x)\n", " k = mod_37bit_position[(-y & y) % 37]\n", " if (k >= c):\n", " z = h2(x)\n", " B.add((z, k))\n", " while (len(B) >= B_max):\n", " c += 1\n", " for z, k in B.copy():\n", " if (k < c):\n", " B.remove((z, k)) "]}, {"cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Taille de B = 55, c = 3\n"]}], "source": ["print('Taille de B = {}, c = {}'.format(len(B), c))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Une estimateur de la taille de l'univers est alors $2^c \\mathrm{card}(B)$ :"]}, {"cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Estimation de la taille de U = 440\n"]}], "source": ["print('Estimation de la taille de U = {}'.format(2**c * len(B)))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Pour s'en convaincre, remarquons qu'en moyenne le cardinal de $B$ est \u00e9gal au nombre de $y_i$ pour lequel le nombre de z\u00e9ros \u00e0 droite dans la d\u00e9composition binaire est plus grand que $c$. Ceci correspond au nombre de $y_i$ qui sont divisibles par $2^c$ :\n", "\n", "$$\\mathbb{E}[\\mathrm{card}(B)] = \\mathbb{E}\\big[\\sum_{x_i} I(y_i = 0 \\space{} \\mathrm{mod} \\space{} 2^c)\\big] = \\sum_{x_i} \\mathbb{P}\\big( y_i = 0 \\space{} \\mathrm{mod} \\space{} 2^c\\big)$$\n", "\n", "C'est ici qu'intervient la notion de famille universelle car cette derniere \u00e9galit\u00e9 n'est valide que si le nombre de collision entre $y$ et $z$ est faible lors du hachage de $x$ par $h_{a_1, b_1}$ et $h_{a_2, b_2}$. En effet, si le nombre de collisions \u00e9tait trop grand la taille de $B$ sous-estimerait le nombre d'\u00e9l\u00e9ments distincts.\n", "\n", "La probabilit\u00e9 $\\mathbb{P}\\big( y_i = 0 \\space{} \\mathrm{mod} \\space{} 2^c\\big)$ \u00e9tant \u00e9gale \u00e0 $1 / 2^c$ si $y_i$ est distribu\u00e9 uniform\u00e9ment (l'\u00e9crire pour s'en convaincre) nous obtenons :\n", "\n", "$$ \\mathbb{E}[\\mathrm{card}(B)] = \\frac{n}{2^c}.$$"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### R\u00e9sultats num\u00e9riques\n", "\n", "Regroupons le code dans une fonction"]}, {"cell_type": "code", "execution_count": 25, "metadata": {"collapsed": true}, "outputs": [], "source": ["def BJKST(stream, epsilon):\n", " s = len(stream)\n", " a1, a2 = random.sample(range(1, p), 2)\n", " b1, b2 = random.sample(range(0, p), 2)\n", " def h1(x):\n", " return ((a1*x + b1) % p) % s\n", " def h2(x):\n", " return ((a2*x + b2) % p) % s\n", " c = 0\n", " B = set()\n", " B_max = 1.0 / epsilon**2\n", " for x in stream:\n", " y = h1(x)\n", " k = mod_37bit_position[(-y & y) % 37]\n", " if (k >= c):\n", " z = h2(x)\n", " B.add((z, k))\n", " while (len(B) >= B_max):\n", " c += 1\n", " for z, k in B.copy():\n", " if (k < c):\n", " B.remove((z, k)) \n", " return 2**c*len(B)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["En pratique une estimation fiable du nombre d'\u00e9l\u00e9ments distincts requiert de g\u00e9n\u00e9rer plusieurs calculs ind\u00e9pendants de l'algorithme et de prendre la m\u00e9diane. Regardons comment la qualit\u00e9 de l'estimation \u00e9volue en fonction de la pr\u00e9cision $\\epsilon$ exig\u00e9e et de la taille $s$ du stream."]}, {"cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [{"ename": "NameError", "evalue": "name 'np' is not defined", "output_type": "error", "traceback": ["\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 6\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0ms\u001b[0m \u001b[1;32min\u001b[0m \u001b[0msizes\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 7\u001b[0m \u001b[0mstream\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0muniverse\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;33m-\u001b[0m\u001b[0ms\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 8\u001b[1;33m \u001b[0mvalues\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mmedian\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mBJKST\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mstream\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0meps\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0m_\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mrange\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;36m100\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 9\u001b[0m \u001b[0mestimates\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0meps\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mvalues\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mNameError\u001b[0m: name 'np' is not defined"]}], "source": ["epsilons = [0.5, 0.2, 0.1]\n", "sizes = [100, 250, 500, 1000, 2500, 5000]\n", "estimates = {}\n", "for eps in epsilons:\n", " values = []\n", " for s in sizes:\n", " stream = universe[-s:]\n", " values.append(numpy.median([BJKST(stream, eps) for _ in range(100)]))\n", " estimates[eps] = values"]}, {"cell_type": "code", "execution_count": 27, "metadata": {"collapsed": true}, "outputs": [], "source": ["for eps in estimates:\n", " plt.plot(sizes, estimates[eps], label = '$\\epsilon$ = {:.1f}'.format(eps))\n", "plt.axhline(y=n, color='r', linestyle='--', label='Vraie valeur de $n$')\n", "plt.title('Estimation de $n$')\n", "plt.xlabel('Taille du stream')\n", "plt.legend()"]}, {"cell_type": "code", "execution_count": 28, "metadata": {"collapsed": true}, "outputs": [], "source": ["epsilon = 0.1\n", "for i in range(len(sizes)):\n", " print('Erreur relative = {0:.2f}%, s = {1}'.format(abs(estimates[epsilon][i]/ n - 1.0)*100.0, sizes[i]))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Nous observons que l'estimation converge vers la vraie valeur $n$ \u00e0 mesure que la taille du stream augmente. Pour une pr\u00e9cision $\\epsilon = 10\\%$ et une taille de stream \u00e9gale \u00e0 $5000$ l'erreur est de $0,8\\%$."]}, {"cell_type": "markdown", "metadata": {}, "source": ["La fiabilit\u00e9 de l'estimation est d\u00e9croissante avec le niveau de pr\u00e9cision exig\u00e9, l'algorithme donne une bonne estimation de la vraie valeur (ligne horizontale rouge) pour $\\epsilon \\approx 0.3$."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Temps de calcul en fonction de la taille du stream\n", "\n", "Regardons \u00e0 pr\u00e9sent comment le temps de calcul \u00e9volue en fonction de la taille du stream. Rappelons qu'un bon algorithme de streaming doit \u00e9voluer de mani\u00e8re au pire lin\u00e9aire en fonction de la taille d'espace $s$."]}, {"cell_type": "code", "execution_count": 29, "metadata": {"collapsed": true}, "outputs": [], "source": ["import time\n", "epsilon = 0.1\n", "size_bound = 15\n", "sizes = [100, 250, 500, 1000, 2500, 5000]\n", "m = 100\n", "times = []\n", "for s in sizes:\n", " start = time.time()\n", " stream = universe[-s:]\n", " BJKST(stream, epsilon)\n", " times.append(time.time() - start)\n", "times = numpy.array(times)"]}, {"cell_type": "code", "execution_count": 30, "metadata": {"collapsed": true}, "outputs": [], "source": ["fix, ax = plt.subplots()\n", "plt.plot(sizes, times*1000)\n", "plt.title('Temps de calcul (en ms)')\n", "plt.xlabel('Taille du stream')"]}, {"cell_type": "markdown", "metadata": {}, "source": ["La complexit\u00e9 semble \u00eatre lin\u00e9aire ce qui est satisfaisant. Notons qu'aucun effort d'optimisation de performance (\u00e0 part l'usage d'un set) n'a \u00e9t\u00e9 fait \u00e0 ce stade."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Un peu plus sur la pr\u00e9cision de l'estimateur"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Lorsque la pr\u00e9cision $\\epsilon$ est proche de $0$ l'estimation est moins bonne que pour une pr\u00e9cision plus large. Pourquoi ?"]}, {"cell_type": "code", "execution_count": 31, "metadata": {"collapsed": true}, "outputs": [], "source": ["import random\n", "import numpy\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "n = 1000\n", "stream = numpy.arange(1000)\n", "p = 4294967291\n", "\n", "mod_37bit_position = (32, 0, 1, 26, 2, 23, 27, 0, 3, 16, 24, 30, 28, 11, 0, 13, 4,\n", " 7, 17, 0, 25, 22, 31, 15, 29, 10, 12, 6, 0, 21, 14, 9, 5,\n", " 20, 8, 19, 18)"]}, {"cell_type": "code", "execution_count": 32, "metadata": {"collapsed": true}, "outputs": [], "source": ["def BJKST(stream, B_max, h1, h2):\n", " c = 0\n", " B = set()\n", " R = []\n", " removed = 0\n", " for x in stream:\n", " y = h1(x)\n", " k = mod_37bit_position[(-y & y) % 37]\n", " if (k >= c):\n", " z = h2(x)\n", " B.add((z, k))\n", " while (len(B) >= B_max):\n", " c += 1\n", " for z, k in B.copy():\n", " if (k < c):\n", " B.remove((z, k))\n", " removed += 1\n", " R.append([removed, len(B), c])\n", " return numpy.array(R)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### $h_1$ et $h_2$ \u00e9gales \u00e0 l'identit\u00e9\n", "\n", "Si nous prenons $h_1$ et $h_2$ \u00e9gales \u00e0 l'identit\u00e9 l'ensemble $B$ se remplit lin\u00e9airement: pour $c=1$ on enl\u00e8ve tous les nombres avec $k = 0$ (tous les nombres impairs donc la moiti\u00e9) puis pour $c=2$ on enl\u00e8ve tous les nombres o\u00f9 $k = 0, 1$ c'est \u00e0 dire encore la moiti\u00e9 des nombres rajout\u00e9s et ainsi de suite.. \n", "\n", "A la fin l'estimation est parfaite (cf graphe en fonction de l'avancement dans le stream)"]}, {"cell_type": "code", "execution_count": 33, "metadata": {"collapsed": true}, "outputs": [], "source": ["B_max = 200\n", "def h1(x):\n", " return x\n", "def h2(x):\n", " return x\n", "R = BJKST(stream, B_max, h1, h2)\n", "estimate = 2**R[-1,2]*R[-1,1]\n", "print('Estimated = {}, true = {}, c= {}'.format(estimate, n, R[-1,2]))\n", "D = numpy.concatenate((numpy.array([1]), numpy.diff(R[:,2])))\n", "changes = stream[numpy.nonzero(D)]\n", "\n", "fix, ax = plt.subplots()\n", "ax.plot(stream, R[:,0], color='red', label='total #removed')\n", "ax.plot(stream, R[:,1], color='blue', label='len B')\n", "for c in changes:\n", " ax.annotate('c = {}'.format(R[c, 2]), xy=(c, R[c, 1]), xytext=(c + 65, R[c,1] - 30), arrowprops=dict(arrowstyle='->'))\n", "ax.legend(loc=(1.1, 0.9))\n", "plt.xlabel('$x$')\n", "plt.ylabel('size')"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### cas o\u00f9 la taille du hash est petite\n", "\n", "Si on prend un hash petit il faut regarder plus de nombres pour remplir et l'incr\u00e9ment de $c$ se fait \"plus tard\" dans le stream, d'o\u00f9 la mauvaise estimation."]}, {"cell_type": "code", "execution_count": 34, "metadata": {"collapsed": true}, "outputs": [], "source": ["B_max = 200\n", "s = B_max // 4\n", "a1, a2 = random.sample(range(1, p), 2)\n", "b1, b2 = random.sample(range(0, p), 2)\n", "def h1(x):\n", " return ((a1*x + b1) % p) % n\n", "def h2(x):\n", " return ((a2*x + b2) % p) % s\n", "R = BJKST(stream, B_max, h1, h2)\n", "estimate = 2**R[-1,2]*R[-1,1]\n", "print('Estimated = {}, true = {}, c= {}'.format(estimate, n, R[-1,2]))\n", "D = numpy.concatenate((numpy.array([1]),numpy.diff(R[:,2])))\n", "changes = stream[numpy.nonzero(D)]\n", "fix, ax = plt.subplots()\n", "ax.plot(stream, R[:,0], color='red', label='total #removed')\n", "ax.plot(stream, R[:,1], color='blue', label='len B')\n", "for c in changes:\n", " ax.annotate('c = {}'.format(R[c, 2]), xy=(c, R[c, 1]), xytext=(c + 65, R[c,1] - 30), arrowprops=dict(arrowstyle='->'))\n", "ax.legend(loc=(1.1, 0.9))\n", "plt.xlabel('$x$')\n", "plt.ylabel('size')"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### cas o\u00f9 la taille de hash est plus grande\n", "\n", "Si on prend un une valeur de hash plus grande on se rapproche du cas $h_1 = h_2 = id$ et l'estimation est meilleure :"]}, {"cell_type": "code", "execution_count": 35, "metadata": {"collapsed": true}, "outputs": [], "source": ["B_max = 200\n", "s = B_max\n", "a1, a2 = random.sample(range(1, p), 2)\n", "b1, b2 = random.sample(range(0, p), 2)\n", "def h1(x):\n", " return ((a1*x + b1) % p) % n\n", "def h2(x):\n", " return ((a2*x + b2) % p) % s\n", "R = BJKST(stream, B_max, h1, h2)\n", "estimate = 2**R[-1,2]*R[-1,1]\n", "print('Estimated = {}, true = {}, c= {}'.format(estimate, n, R[-1,2]))\n", "D =numpy.concatenate((numpy.array([1]), numpy.diff(R[:,2])))\n", "changes = stream[numpy.nonzero(D)]\n", "fix, ax = plt.subplots()\n", "ax.plot(stream, R[:,0], color='red', label='total #removed')\n", "ax.plot(stream, R[:,1], color='blue', label='len B')\n", "for c in changes:\n", " ax.annotate('c = {}'.format(R[c, 2]), xy=(c, R[c, 1]), xytext=(c + 65, R[c,1] - 30), arrowprops=dict(arrowstyle='->'))\n", "ax.legend(loc=(1.1, 0.9))\n", "plt.xlabel('$x$')\n", "plt.ylabel('size')"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### la taille de hash d\u00e9pend de la pr\u00e9cision $\\epsilon$\n", "\n", "A mon avis, il faut donc que la taille du hash pour $h_2$ d\u00e9pende de la pr\u00e9cision (dans [Data Stream Algorithms](http://www.cs.dartmouth.edu/~ac/Teach/CS85-Fall09/Notes/lecnotes.pdf) ils pr\u00e9conisent une taille en $\\mathrm{log}(n) / \\epsilon^2$).\n", "\n", "Si on prend cette taille pour $h_2$ on voit que l'estimation est meilleure pour une pr\u00e9cision petite."]}, {"cell_type": "code", "execution_count": 36, "metadata": {"collapsed": true}, "outputs": [], "source": ["def BJKST(stream, epsilon):\n", " a1, a2 = random.sample(range(1, p), 2)\n", " b1, b2 = random.sample(range(0, p), 2)\n", " #taille de la valeur de hashage d\u00e9pend de la precision\n", " b = int(numpy.log(n) / epsilon**2)\n", " def h1(x):\n", " return ((a1*x + b1) % p) % n\n", " # on applique la taille b sur la seconde fonction de hash\n", " def h2(x):\n", " return ((a2*x + b2) % p) % b\n", " c = 0\n", " B = set()\n", " B_max = 1.0 / epsilon**2\n", " for x in stream:\n", " y = h1(x)\n", " k = mod_37bit_position[(-y & y) % 37]\n", " if (k >= c):\n", " z = h2(x)\n", " B.add((z, k))\n", " while (len(B) >= B_max):\n", " c += 1\n", " for z, k in B.copy():\n", " if (k < c):\n", " B.remove((z, k)) \n", " return 2**c*len(B)"]}, {"cell_type": "code", "execution_count": 37, "metadata": {"collapsed": true}, "outputs": [], "source": ["m = 100\n", "epsilons = numpy.array([0.1, 0.09, 0.08, 0.07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01])\n", "medians = numpy.array([numpy.median([BJKST(stream, eps) for _ in range(m)]) for eps in epsilons])"]}, {"cell_type": "code", "execution_count": 38, "metadata": {"collapsed": true}, "outputs": [], "source": ["import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "#plt.plot(1.0 / epsilons, medians)\n", "plt.plot(epsilons, medians)\n", "plt.axhline(y=n, color='r', linestyle='--')\n", "plt.title(r'Mediane en fonction de $1 / \\epsilon$, m = {}'.format(m))\n", "plt.xlabel(r'$\\epsilon$')\n", "plt.ylabel('Mediane')"]}, {"cell_type": "code", "execution_count": 39, "metadata": {"collapsed": true}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1"}}, "nbformat": 4, "nbformat_minor": 2}