{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 1A - Enonc\u00e9 26 octobre 2022\n", "\n", "Correction de l'examen du 26 octobre 2022."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 1 : calcul de distances\n", "\n", "On souhaite calculer pour chaque rue de France le m\u00e9decin le plus proche (environ 100.000 m\u00e9decins g\u00e9n\u00e9ralistes). On suppose qu'on dispose des coordonn\u00e9es des rues $X$ et de celle des m\u00e9decins $Y$.\n", "\n", "### Q1 : tirer deux matrices al\u00e9atoires pour X et Y, de tailles diff\u00e9rentes"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/plain": ["(array([[0.07629544, 0.9432004 ],\n", " [0.52035232, 0.47051663],\n", " [0.89432851, 0.07652123],\n", " [0.01653439, 0.99436816],\n", " [0.1499065 , 0.10758634],\n", " [0.29317605, 0.76729661],\n", " [0.97237465, 0.25573414],\n", " [0.60399214, 0.72086801],\n", " [0.76951744, 0.75725811],\n", " [0.6663647 , 0.92848284]]),\n", " array([[0.07308657, 0.11122628],\n", " [0.92539686, 0.48536534],\n", " [0.43820312, 0.26151317],\n", " [0.74683167, 0.40646889],\n", " [0.92033983, 0.88312987]]))"]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["import numpy\n", "\n", "def tirage_alea(n_rues, n_med):\n", " return numpy.random.rand(n_rues, 2), numpy.random.rand(n_med, 2)\n", "\n", "X, Y = tirage_alea(10, 5)\n", "X, Y"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q2 : calculer tous les distances entre X et Y\n", "\n", "On ne peut pas utiliser la fonction `cdist` (pairwise distances) mais rien n'emp\u00eache de l'essayer."]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[0.83198031, 0.96466894, 0.77179962, 0.85889438, 0.8461793 ],\n", " [0.57370393, 0.40531662, 0.22456834, 0.23536145, 0.57466486],\n", " [0.82197491, 0.41002286, 0.49221173, 0.36141496, 0.80702793],\n", " [0.88495069, 1.04168846, 0.84550633, 0.93752849, 0.91062519],\n", " [0.07690611, 0.86261363, 0.32681556, 0.66757055, 1.0931767 ],\n", " [0.69200264, 0.69223434, 0.52616512, 0.57965512, 0.63777092],\n", " [0.91082466, 0.23438729, 0.53420279, 0.27127587, 0.62954985],\n", " [0.80840816, 0.39845012, 0.48835731, 0.34532585, 0.3555345 ],\n", " [0.94993319, 0.31340719, 0.59626523, 0.351522 , 0.19644616],\n", " [1.00989466, 0.51327456, 0.70491577, 0.52817942, 0.25799275]])"]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["from scipy.spatial.distance import cdist\n", "\n", "cdist(X, Y)"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[0.83198031, 0.96466894, 0.77179962, 0.85889438, 0.8461793 ],\n", " [0.57370393, 0.40531662, 0.22456834, 0.23536145, 0.57466486],\n", " [0.82197491, 0.41002286, 0.49221173, 0.36141496, 0.80702793],\n", " [0.88495069, 1.04168846, 0.84550633, 0.93752849, 0.91062519],\n", " [0.07690611, 0.86261363, 0.32681556, 0.66757055, 1.0931767 ],\n", " [0.69200264, 0.69223434, 0.52616512, 0.57965512, 0.63777092],\n", " [0.91082466, 0.23438729, 0.53420279, 0.27127587, 0.62954985],\n", " [0.80840816, 0.39845012, 0.48835731, 0.34532585, 0.3555345 ],\n", " [0.94993319, 0.31340719, 0.59626523, 0.351522 , 0.19644616],\n", " [1.00989466, 0.51327456, 0.70491577, 0.52817942, 0.25799275]])"]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["def pairwise_distance(X, Y):\n", " res = numpy.empty((X.shape[0], Y.shape[0]))\n", " for i in range(X.shape[0]):\n", " for j in range(Y.shape[0]):\n", " d = (X[i, :] - Y[j, :]) ** 2\n", " res[i, j] = d.sum() ** 0.5\n", " return res\n", "\n", "pairwise_distance(X, Y)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q3 : \u00e9crire un test unitaire qui v\u00e9rifie le r\u00e9sultat lorsque X est la matrice identit\u00e9 2x2"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": ["from numpy.testing import assert_allclose\n", "\n", "def test_distance():\n", " x = numpy.identity(2)\n", " r = pairwise_distance(x, x)\n", " assert_allclose(r, numpy.array([[0, 2**0.5], [2**0.5, 0]]))\n", " # on peut \u00e9crire \u00e9galement\n", " assert r.tolist() == [[0, 2**0.5], [2**0.5, 0]]\n", "\n", " \n", "test_distance()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q4 : quel est le co\u00fbt de l'algorithme en fonction des dimensions de X et Y ?"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Le co\u00fbt est $O(nm)$ o\u00f9 $n$ est le nombre de lignes de $X$ et $m$ le nombre de lignes de $Y$."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q5 : Ecrire une fonction qui retourne l'indice du m\u00e9decin le plus proche pour chaque rue"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([0, 0, 0, 0, 1], dtype=int64)"]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["def plus_proche_medecin(X, Y):\n", " dist = pairwise_distance(X, Y)\n", " return numpy.argmin(dist, axis=1)\n", "\n", "plus_proche_medecin(X[:5], Y[:2])"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q6 : on cr\u00e9e une grille 10x10 pour quadriller l'espace. Ecrire une fonction qui calcule les coordonn\u00e9es de la grille pour un point donn\u00e9."]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": ["def grille(p, p_min, p_max, n):\n", " c = numpy.floor((p - p_min) / (p_max - p_min) * n) / n\n", " return c\n", "\n", "def test_grille():\n", " r = grille(numpy.array([[0.41, 0.71], [0.39, 0.99]]),\n", " numpy.array([0, 0]),\n", " numpy.array([1, 1]), 10)\n", " assert r.tolist() == [[0.4, 0.7], [0.3, 0.9]]\n", " \n", "test_grille()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q7 : \u00e9crire une fonction qui cr\u00e9\u00e9 un dictionaire { case_grille : indices }."]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": ["def map_grille_indices(x, n=10):\n", " xi = grille(x, x.min(axis=0), x.max(axis=0), n)\n", " res = {}\n", " for i in range(xi.shape[0]):\n", " key = tuple(xi[i, :])\n", " if key in res:\n", " res[key].append(i)\n", " else:\n", " res[key] = [i]\n", " return res\n", "\n", "def test_map():\n", " x = numpy.arange(8).reshape((-1, 2))\n", " d = map_grille_indices(x, 1)\n", " assert d == {(0.0, 0.0): [0, 1, 2], (1.0, 1.0): [3]}\n", " d = map_grille_indices(x, 2)\n", " assert d == {(0.0, 0.0): [0, 1], (0.5, 0.5): [2], (1.0, 1.0): [3]}\n", " d = map_grille_indices(x - 5, 2)\n", " assert d == {(0.0, 0.0): [0, 1], (0.5, 0.5): [2], (1.0, 1.0): [3]}\n", " \n", "test_map()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q8 : modifier la fonction pairwise_distance pour ne calculer que les distances des points dans la m\u00eame case"]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": ["def pairwise_distance_grille(X, Y, n=10):\n", " gx = map_grille_indices(X, n)\n", " gy = map_grille_indices(Y, n)\n", " res = numpy.empty((X.shape[0], Y.shape[0]))\n", " res[:, :] = 1e6\n", " for key, indices in gx.items():\n", " if key not in gy:\n", " continue\n", " for i in indices:\n", " for j in gy[key]:\n", " d = (X[i, :] - Y[j, :]) ** 2\n", " res[i, j] = d.sum() ** 0.5\n", " return res\n", "\n", "def test_distance_grille():\n", " x = numpy.random.rand(4, 2)\n", " y = numpy.random.rand(2, 2)\n", " r1 = pairwise_distance(x, y)\n", " r2 = pairwise_distance_grille(x, y, n=2)\n", " assert r1.shape == r2.shape\n", " ind = r2.ravel() != 1e6\n", " assert r1.ravel()[ind].tolist() == r2.ravel()[ind].tolist()\n", "\n", " \n", "test_distance_grille()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q9 : Mesurer le temps pour deux matrices 1000x2, 100x2 ?"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["415 ms \u00b1 23.9 ms per loop (mean \u00b1 std. dev. of 7 runs, 1 loop each)\n"]}], "source": ["X, Y = tirage_alea(1000, 100)\n", "\n", "%timeit pairwise_distance(X, Y)"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["6.41 ms \u00b1 470 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 100 loops each)\n"]}], "source": ["%timeit pairwise_distance_grille(X, Y)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On peut \u00e9galement l'\u00e9crire comme ceci :"]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"text/plain": ["[0.8018622000236064,\n", " 0.8201483000302687,\n", " 0.8459657999919727,\n", " 0.8035789999994449,\n", " 0.8155568999936804]"]}, "execution_count": 13, "metadata": {}, "output_type": "execute_result"}], "source": ["import timeit\n", "\n", "timeit.repeat(\"pairwise_distance(X, Y)\", globals=globals(), repeat=5, number=2)"]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"data": {"text/plain": ["[0.026652099972125143,\n", " 0.01595849997829646,\n", " 0.01231670001288876,\n", " 0.012164799962192774,\n", " 0.011980499955825508]"]}, "execution_count": 14, "metadata": {}, "output_type": "execute_result"}], "source": ["timeit.repeat(\"pairwise_distance_grille(X, Y)\", globals=globals(), repeat=5, number=2)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["La seconde fonction est plus rapide car on ne calcule que les distances dans paires de points dans une m\u00eame case. Si on suppose que chaque case contient le m\u00eame nombre de points, $P$, il y a $G^2=100$ cases, le calcul est quasiment $G^2=100$ fois plus rapide. Il n'est pas tout-\u00e0-fait 100 fois plus rapide car il faut parcourir calculer la case de chaque point et les indexer. Mais cette op\u00e9ration est en $O(N+M)$ o\u00f9 $N$ est le nombre de rues et $M$ le nombre de m\u00e9decins.\n", "\n", "Ce raisonnement fonctionne si chaque case contient approximativement le m\u00eame nombre de points ce qui le cas ici car les points ont \u00e9t\u00e9 tir\u00e9s selon une loi uniforme. Mais imaginons qu'il y ait des points aberrants et que le minimum et maximum des coordonn\u00e9es calcul\u00e9s pour construire la grille soient tr\u00e8s grands ; dans ce cas, les points seront tous situ\u00e9s dans la m\u00eame case \u00e0 part les points extr\u00eames. L'optimisation propos\u00e9e ici ne fonctionnera pas \u00e0 moins de changer la fa\u00e7on de construire la grille en raisonnant par quantile par exemple."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q10 : le r\u00e9sultat est utilis\u00e9 pour retourner le m\u00e9decin le plus proche ? Avec la seconde fonction, o\u00f9 les r\u00e9sultats sont-ils faux et que faire pour corriger ?"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([85, 57, 86, 18, 50, 45, 61, 50, 90, 6], dtype=int64)"]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["def plus_proche_medecin(X, Y):\n", " dist = pairwise_distance(X, Y)\n", " return numpy.argmin(dist, axis=1)\n", "\n", "plus_proche_medecin(X, Y)[:10]"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([79, 57, 86, 0, 50, 45, 0, 50, 0, 6], dtype=int64)"]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["def plus_proche_medecin_grille(X, Y):\n", " dist = pairwise_distance_grille(X, Y)\n", " return numpy.argmin(dist, axis=1)\n", "\n", "plus_proche_medecin_grille(X, Y)[:10]"]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.443"]}, "execution_count": 17, "metadata": {}, "output_type": "execute_result"}], "source": ["m = plus_proche_medecin(X, Y) \n", "mg = plus_proche_medecin_grille(X, Y)\n", "sum(m == mg) / m.shape[0]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Les deux fonctions co\u00efncident dans 50% des cas seulement. C'est attendu car on ne cherche les m\u00e9decins les plus proches dans la m\u00eame case que celle de la rue. Pr\u00e8s de la fronti\u00e8re d'une case, il est fort possible que le m\u00e9decin le plus proche soit de l'autre c\u00f4t\u00e9. Pour corriger cela, il faudrait regarder les m\u00e9decins les plus proches dans les cases voisines \u00e9galement, soit 9 cases au lieu de 1. L'algorithme n'irait plus 100 fois plus vite mais seulement 10 fois plus vite. Cela reste n\u00e9anmoins un gain significatif. Il serait sans doute possible de faire encore mieux en regardant que les cases voisines utiles, 4 plut\u00f4t que 9 mais ce n'est pas la seule optimisation possible.\n", "\n", "L'id\u00e9e de la grille est de limiter le nombre de calculs inutiles. Il ne sert \u00e0 rien de calculer la distance d'une rue et d'un m\u00e9decin tr\u00e8s \u00e9loign\u00e9s. Il existe d'autres techniques comme celles pr\u00e9sent\u00e9s dans [k-nearest neighbors algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) ou [Closest pair of points problem](https://en.wikipedia.org/wiki/Closest_pair_of_points_problem)."]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5"}}, "nbformat": 4, "nbformat_minor": 2}