{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 1A.data - Visualisation des donn\u00e9es\n", "\n", "Les tableaux et les graphes sont deux outils incontournables des statisticiens. Petite revue des graphes."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": ["%matplotlib inline"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Cette instruction fait appara\u00eetre les graphes dans le notebook. Si ce n'est pas le cas, il faut la r\u00e9ex\u00e9cuer. Les deux lignes suivantes permettent de v\u00e9rifier o\u00f9 matplotlib a pr\u00e9vu d'afficher ses r\u00e9sultats. Pour un notebook, cela doit \u00eatre ``'nbAgg'`` ou ``'module://ipykernel.pylab.backend_inline'``."]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/plain": ["'module://matplotlib_inline.backend_inline'"]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["import matplotlib\n", "matplotlib.get_backend()"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/plain": ["'module://matplotlib_inline.backend_inline'"]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["import matplotlib.pyplot as plt\n", "import matplotlib\n", "matplotlib.get_backend()"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Matplotlib, pandas"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### R\u00e9cup\u00e9ration des donn\u00e9es\n", "\n", "On r\u00e9cup\u00e8re les donn\u00e9es disponibles sur le site de l'INSEE : [Naissance, d\u00e9c\u00e8s, mariages 2012](http://www.insee.fr/fr/themes/detail.asp?ref_id=fd-etatcivil2012&page=fichiers_detail/etatcivil2012/doc/documentation.htm). Il s'agit de r\u00e9cup\u00e9rer la liste des mariages de l'ann\u00e9e 2012. On souhaite repr\u00e9senter le graphe du nombre de mariages en fonction de l'\u00e9cart entre les mari\u00e9s."]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/plain": ["((246123, 16),\n", " Index(['ANAISH', 'DEPNAISH', 'INDNATH', 'ETAMATH', 'ANAISF', 'DEPNAISF',\n", " 'INDNATF', 'ETAMATF', 'AMAR', 'MMAR', 'JSEMAINE', 'DEPMAR', 'DEPDOM',\n", " 'TUDOM', 'TUCOM', 'NBENFCOM'],\n", " dtype='object'))"]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["from urllib.error import URLError\n", "import pyensae.datasource\n", "from pyensae.datasource import dBase2df, DownloadDataException\n", "files = [\"etatcivil2012_nais2012_dbase.zip\",\n", " \"etatcivil2012_dec2012_dbase.zip\",\n", " \"etatcivil2012_mar2012_dbase.zip\" ]\n", "\n", "try:\n", " pyensae.datasource.download_data(files[-1], \n", " website='http://telechargement.insee.fr/fichiersdetail/etatcivil2012/dbase/')\n", "except (DownloadDataException, URLError, TimeoutError):\n", " # backup plan\n", " pyensae.datasource.download_data(files[-1], website=\"xd\")\n", "\n", "df = dBase2df(\"mar2012.dbf\")\n", "df.shape, df.columns"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ANAISHDEPNAISHINDNATHETAMATHANAISFDEPNAISFINDNATFETAMATFAMARMMARJSEMAINEDEPMARDEPDOMTUDOMTUCOMNBENFCOM
01982751119849921201201129999N
11956692419699924201201375999N
21982992119929911201201534999N
31985992119878411201201413999N
41968992119639921201201626999N
\n", "
"], "text/plain": [" ANAISH DEPNAISH INDNATH ETAMATH ANAISF DEPNAISF INDNATF ETAMATF AMAR MMAR \\\n", "0 1982 75 1 1 1984 99 2 1 2012 01 \n", "1 1956 69 2 4 1969 99 2 4 2012 01 \n", "2 1982 99 2 1 1992 99 1 1 2012 01 \n", "3 1985 99 2 1 1987 84 1 1 2012 01 \n", "4 1968 99 2 1 1963 99 2 1 2012 01 \n", "\n", " JSEMAINE DEPMAR DEPDOM TUDOM TUCOM NBENFCOM \n", "0 1 29 99 9 N \n", "1 3 75 99 9 N \n", "2 5 34 99 9 N \n", "3 4 13 99 9 N \n", "4 6 26 99 9 N "]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["df.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On r\u00e9cup\u00e8re de la m\u00eame mani\u00e8re la signification des variables :"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"text/plain": ["((16, 4), Index(['VARIABLE', 'LIBELLE', 'TYPE', 'LONGUEUR'], dtype='object'))"]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["from pyensae.datasource import dBase2df\n", "vardf = dBase2df(\"varlist_mariages.dbf\")\n", "vardf.shape, vardf.columns"]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
VARIABLELIBELLETYPELONGUEUR
0AMARAnn\u00e9e du mariageCHAR4
1ANAISFAnn\u00e9e de naissance de l'\u00e9pouseCHAR4
2ANAISHAnn\u00e9e de naissance de l'\u00e9pouxCHAR4
3DEPDOMD\u00e9partement de domicile apr\u00e8s le mariageCHAR3
4DEPMARD\u00e9partement de mariageCHAR3
5DEPNAISFD\u00e9partement de naissance de l'\u00e9pouseCHAR3
6DEPNAISHD\u00e9partement de naissance de l'\u00e9pouxCHAR3
7ETAMATF\u00c9tat matrimonial ant\u00e9rieur de l'\u00e9pouseCHAR1
8ETAMATH\u00c9tat matrimonial ant\u00e9rieur de l'\u00e9pouxCHAR1
9INDNATFIndicateur de nationalit\u00e9 de l'\u00e9pouseCHAR1
10INDNATHIndicateur de nationalit\u00e9 de l'\u00e9pouxCHAR1
11JSEMAINEJour du mariage dans la semaineCHAR1
12MMARMois du mariageCHAR2
13NBENFCOMEnfants en commun avant le mariageCHAR1
14TUCOMTranche de commune du lieu de domicile des \u00e9pouxCHAR1
15TUDOMTranche d'unit\u00e9 urbaine du lieu de domicile de...CHAR1
\n", "
"], "text/plain": [" VARIABLE LIBELLE TYPE \\\n", "0 AMAR Ann\u00e9e du mariage CHAR \n", "1 ANAISF Ann\u00e9e de naissance de l'\u00e9pouse CHAR \n", "2 ANAISH Ann\u00e9e de naissance de l'\u00e9poux CHAR \n", "3 DEPDOM\u00a0 D\u00e9partement de domicile apr\u00e8s le mariage CHAR \n", "4 DEPMAR D\u00e9partement de mariage CHAR \n", "5 DEPNAISF D\u00e9partement de naissance de l'\u00e9pouse CHAR \n", "6 DEPNAISH D\u00e9partement de naissance de l'\u00e9poux CHAR \n", "7 ETAMATF \u00c9tat matrimonial ant\u00e9rieur de l'\u00e9pouse CHAR \n", "8 ETAMATH \u00c9tat matrimonial ant\u00e9rieur de l'\u00e9poux CHAR \n", "9 INDNATF Indicateur de nationalit\u00e9 de l'\u00e9pouse CHAR \n", "10 INDNATH Indicateur de nationalit\u00e9 de l'\u00e9poux CHAR \n", "11 JSEMAINE Jour du mariage dans la semaine CHAR \n", "12 MMAR Mois du mariage CHAR \n", "13 NBENFCOM Enfants en commun avant le mariage CHAR \n", "14 TUCOM Tranche de commune du lieu de domicile des \u00e9poux CHAR \n", "15 TUDOM Tranche d'unit\u00e9 urbaine du lieu de domicile de... CHAR \n", "\n", " LONGUEUR \n", "0 4 \n", "1 4 \n", "2 4 \n", "3 3 \n", "4 3 \n", "5 3 \n", "6 3 \n", "7 1 \n", "8 1 \n", "9 1 \n", "10 1 \n", "11 1 \n", "12 2 \n", "13 1 \n", "14 1 \n", "15 1 "]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["vardf"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Exercice 1 : \u00e9cart entre les mari\u00e9s\n", "\n", "1. En ajoutant une colonne et en utilisant l'op\u00e9ration [group by](http://pandas.pydata.org/pandas-docs/stable/groupby.html), on veut obtenir la distribution du nombre de mariages en fonction de l'\u00e9cart entre les mari\u00e9s. Au besoin, on changera le type d'une colone ou deux.\n", "2. On veut tracer un nuage de points avec en abscisse l'\u00e2ge du mari, en ordonn\u00e9e, l'\u00e2ge de la femme. Il faudra peut-\u00eatre jeter un coup d'oeil sur la documentation de la m\u00e9thode [plot](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html)."]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": ["# df[\"colonne\"] = df.apply (lambda r: int(r[\"colonne\"]), axis=1) # pour changer de type\n", "# df[\"difference\"] = ..."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Exercice 2 : graphe de la distribution avec pandas\n", "\n", "Le module ``pandas`` propose un panel de graphiques standard faciles \u00e0 obtenir. On souhaite repr\u00e9senter la distribution sous forme d'histogramme. A vous de choisir le meilleure graphique depuis la page [Visualization](http://pandas.pydata.org/pandas-docs/stable/visualization.html)."]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": ["# df.plot(...)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### matplotlib\n", "\n", "[matplotlib](http://matplotlib.org/) est le module qu'utilise [pandas](http://pandas.pydata.org/). Ainsi, la m\u00e9thode [plot](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html) retourne un objet de type [Axes](http://matplotlib.org/api/axes_api.html#module-matplotlib.axes) qu'on peut modifier par la suite via les [m\u00e9thodes suivantes](http://matplotlib.org/api/pyplot_summary.html). On peut ajouter un titre avec [set_title](http://matplotlib.org/api/axes_api.html#matplotlib.axes.Axes.set_title) ou ajouter une grille avec [grid](http://matplotlib.org/api/axes_api.html#matplotlib.axes.Axes.grid). On peut \u00e9galement superposer [deux courbes sur le m\u00eame graphique](http://stackoverflow.com/questions/19941685/how-to-show-a-bar-and-line-graph-on-the-same-plot), ou [changer de taille de caract\u00e8res](http://stackoverflow.com/questions/12444716/how-do-i-set-figure-title-and-axes-labels-font-size-in-matplotlib). Le code suivant trace le nombre de mariages par d\u00e9partement."]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": ["
"]}, "metadata": {"needs_background": "light"}, "output_type": "display_data"}], "source": ["df[\"nb\"] = 1\n", "dep = df[[\"DEPMAR\",\"nb\"]].groupby(\"DEPMAR\", as_index=False).sum().sort_values(\"nb\",ascending=False)\n", "ax = dep.plot(kind = \"bar\", figsize=(14,6))\n", "ax.set_xlabel(\"d\u00e9partements\", fontsize=16)\n", "ax.set_title(\"nombre de mariages par d\u00e9partements\", fontsize=16)\n", "ax.legend().set_visible(False) # on supprime la l\u00e9gende\n", "\n", "# on change la taille de police de certains labels\n", "for i,tick in enumerate(ax.xaxis.get_major_ticks()):\n", " if i > 10 :\n", " tick.label.set_fontsize(8)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Quand on ne sait pas, le plus simple est d'utiliser un moteur de recherche avec un requ\u00eate du type : ``matplotlib + requ\u00eate``. Pour cr\u00e9er un graphique, le plus courant est de choisir le graphique le plus ressemblant d'une [gallerie de graphes](http://matplotlib.org/gallery.html) puis de l'adapter \u00e0 vos donn\u00e9es."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Exercice 3 : distribution des mariages par jour\n", " \n", "On veut obtenir un graphe qui contient l'histogramme de la distribution du nombre de mariages par jour de la semaine et d'ajouter une seconde courbe correspond avec un second axe \u00e0 la r\u00e9partition cumul\u00e9e."]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["## R\u00e9seaux, graphes\n", "\n", "### networkx\n", "\n", "Le module [networkx](https://networkx.github.io/) permet de repr\u00e9senter un r\u00e9seau ou un graphe de petite taille (< 500 noeuds). Un graphe est d\u00e9fini par un ensemble de noeuds (ou *vertex* en anglais) reli\u00e9s par des arcs (ou *edge* en anglais). La [gallerie](http://networkx.github.io/documentation/latest/gallery.html) vous donnera une id\u00e9e de ce que le module est capable de faire."]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": ["
"]}, "metadata": {}, "output_type": "display_data"}], "source": ["import random\n", "import networkx as nx\n", "G=nx.Graph()\n", "for i in range(15) :\n", " G.add_edge ( random.randint(0,5), random.randint(0,5) )\n", "\n", "import matplotlib.pyplot as plt\n", "f, ax = plt.subplots(figsize=(8,4))\n", "nx.draw(G, ax = ax)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Graphviz\n", "\n", "[Graphviz](http://www.graphviz.org/) est un outil d\u00e9velopp\u00e9 depuis plusieurs ann\u00e9es d\u00e9j\u00e0 qui permet de r\u00e9pr\u00e9senter des graphes plus cons\u00e9quents (> 500 noeuds). Il propose un choix plus riche de graphes : [gallerie](http://www.graphviz.org/Gallery.php). Il est utilisable via le module [graphviz](https://pypi.python.org/pypi/graphviz). Son installation requiert l'installation de l'outil [Graphviz](http://www.graphviz.org/) qui n'est pas inclus. La diff\u00e9rence entre les deux modules tient dans l'algorithme utilis\u00e9 pour assigner des coordonn\u00e9es \u00e0 chaque noeud du graphe de fa\u00e7on \u00e0 ce que ses arcs se croisent le moins possibles. Au del\u00e0 d'une certaine taille, le dessin de graphe n'est plus lisible et n\u00e9cessite quelques tat\u00f4nnements. Cela peut passer par une clusterisation du graphe (voir la [m\u00e9thode Louvain](http://perso.uclouvain.be/vincent.blondel/research/louvain.html)) de fa\u00e7on \u00e0 colorer certains noeuds proches voire \u00e0 les regrouper."]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": [""]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["import random, os\n", "from graphviz import Digraph\n", "from IPython.display import Image\n", "from pyquickhelper.helpgen import find_graphviz_dot\n", "bin = os.path.dirname(find_graphviz_dot())\n", "if bin not in os.environ[\"PATH\"]:\n", " os.environ[\"PATH\"] = os.environ[\"PATH\"] + \";\" + bin\n", "\n", "dot = Digraph(comment='random graph', format=\"png\")\n", "for i in range(15) :\n", " dot.edge ( str(random.randint(0,5)), str(random.randint(0,5)) )\n", "\n", "img = dot.render('t_random_graph.gv')\n", "Image(img)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Exercice 4 : dessin d'un graphe avec networkx\n", " \n", "On construit un graphe al\u00e9atoire, ses 20 arcs sont obtenus en tirant 20 fois deux nombres entiers entre 1 et 10. Chaque arc doit avoir une \u00e9paisseur al\u00e9atoire. On regardera les fonctions [spring_layout](https://networkx.github.io/documentation/latest/reference/generated/networkx.drawing.layout.spring_layout.html?highlight=spring_layout#networkx.drawing.layout.spring_layout), [draw_networkx_nodes](https://networkx.github.io/documentation/latest/reference/generated/networkx.drawing.nx_pylab.draw_networkx_nodes.html?highlight=draw_networkx_nodes#networkx.drawing.nx_pylab.draw_networkx_nodes), [draw_networkx_edges](https://networkx.github.io/documentation/latest/reference/generated/networkx.drawing.nx_pylab.draw_networkx_edges.html?highlight=draw_networkx_edges#networkx.drawing.nx_pylab.draw_networkx_edges). La [gallerie](https://networkx.github.io/documentation/latest/gallery.html) peut aider aussi."]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.5"}}, "nbformat": 4, "nbformat_minor": 2}