{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 1A.data - DataFrame et Matrice\n", "\n", "Les [DataFrame](http://en.wikipedia.org/wiki/Data_frame) se sont impos\u00e9s pour manipuler les donn\u00e9es avec le module [pandas](http://pandas.pydata.org/). Le module va de la manipulation des donn\u00e9es jusqu'au calcul d'une r\u00e9gresion lin\u00e9aire."]}, {"cell_type": "markdown", "metadata": {}, "source": ["Avec cette fa\u00e7on de repr\u00e9senter les donn\u00e9es, associ\u00e9e \u00e0 des un ensemble de m\u00e9thodes couramment utilis\u00e9es, ce qu'on faisait en une ou deux boucles se fait maintenant en une seule fonction. Cette s\u00e9ance contient beaucoup d'exemples et peu d'exercices. Il est conseill\u00e9 de supprimer toutes les sorties et de les ex\u00e9cuter une \u00e0 une."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["L'introduction ne contient pas d'\u00e9l\u00e9ments n\u00e9cessaires \u00e0 la r\u00e9alisation du TD."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Trouver chaussure \u00e0 ses stats\n", "\n", "La programmation est omni-pr\u00e9sente lorsqu'on manipule des donn\u00e9es. On leur applique des traitements parfois standards, souvent adapt\u00e9s pour la circonstance. On souhaite toujours programmer le moins possible mais aussi ne pas avoir \u00e0 r\u00e9apprendre un langage \u00e0 chaque fois qu'on doit manipuler les donn\u00e9es.\n", "\n", "Le logiciel [MATLAB](http://www.mathworks.fr/products/matlab/) a propos\u00e9 voici 30 ans un premier environnement de travail facilitant le calcul matriciel et ce standard s'est impos\u00e9 depuis. Comme *MATLAB* est un logiciel payant, des \u00e9quivalents open source et gratuits ont \u00e9t\u00e9 d\u00e9velopp\u00e9s. Ils proposent tous le calcul matriciel, la possibilit\u00e9 de visualiser, un environnement de d\u00e9veloppement. Ils diff\u00e9rent pas des performances diff\u00e9rentes et des \u00e9ventails d'extensions diff\u00e9rentes. \n", "\n", "* [R](http://www.r-project.org/) : la r\u00e9f\u00e9rence pour les statisticiens, il est utilis\u00e9 par tous les chercheurs dans ce domaine.\n", "* [SciLab](http://www.scilab.org/fr) : d\u00e9velopp\u00e9 par l'[INRIA](http://www.inria.fr/).\n", "* [Octave](http://www.gnu.org/software/octave/) : clone open source de *MATLAB*, il n'inclut pas autant de librairies mais il est gratuit.\n", "* [Julia](http://julialang.org/) : c'est le plus jeune, il est plus rapide mais ses librairies sont moins nombreuses.\n", "\n", "Ils sont tous performants en qui concerne le calcul num\u00e9rique, ils le sont beaucoup moins lorsqu'il s'agit de faire des traitements qui ne sont pas num\u00e9riques (traiter du texte par exemple) car ils n'ont pas \u00e9t\u00e9 pr\u00e9vus pour cela \u00e0 la base (\u00e0 l'exception de Julia peut \u00eatre qui est plus jeune [Python v. Clojure v. Julia](http://matthewrocklin.com/blog/work/2014/01/13/Text-Benchmarks)). Le langage Python est devenu depuis 2012 une alternative int\u00e9ressante pour ces raisons (voir \u00e9galement [Why Python?](http://www.xavierdupre.fr/blog/2014-07-11_nojs.html)) :\n", "\n", "* Il propose les m\u00eame fonctionnalit\u00e9s de base (calcul matriciel, graphiques, environnement).\n", "* Python est plus pratique pour tout ce qui n'est pas num\u00e9rique (fichiers, web, server web, SQL, ...).\n", "* La plupart des librairies connues et \u00e9crites en C++ ont \u00e9t\u00e9 port\u00e9e sous Python.\n", "* Il est plus facile de changer un composant important en Python (numpy par exemple) si le nouveau est plus efficace.\n", "\n", "Un inconv\u00e9nient peut-\u00eatre est qu'il faut installer plusieurs extensions avant de pouvoir commencer \u00e0 travailler (voir [Installation de Python](http://www.xavierdupre.fr/app/ensae_teaching_cs/helpsphinx/data2a.html#usage-regulier)) :\n", "\n", "* [numpy](http://www.numpy.org/) : calcul matriciel\n", "* [pandas](http://pandas.pydata.org/) : DataFrame\n", "* [jupyter](http://jupyter.org/) : notebooks (comme celui-ci)\n", "* [matplotlib](http://matplotlib.org/) : graphiques\n", "* [scikit-learn](http://scikit-learn.org/stable/) : machine learning, statistique descriptive\n", "* [statsmodels](http://statsmodels.sourceforge.net/) : statistiques descriptives\n", "\n", "Optionnels :\n", "\n", "* [Spyder](https://code.google.com/p/spyderlib/) : environnement type R, MATLAB, ...\n", "* [scipy](http://www.scipy.org/) : autres traitements num\u00e9riques (voir [NumPy vs. SciPy vs. other packages](http://www.scipy.org/scipylib/faq.html#what-is-the-difference-between-numpy-and-scipy))\n", "* [dask](http://dask.pydata.org/en/latest/) : dataframe distribu\u00e9 et capables de g\u00e9rer des gros volumes de donn\u00e9es (> 5Go)\n", "\n", "\n", "Les environnements Python \u00e9voluent tr\u00e8s vite, les modules mentionn\u00e9s ici sont tous maintenus mais il eut en surgir de nouveau tr\u00e8s rapidement. Quelques environnements \u00e0 suivre :\n", "\n", "* [Python Tools for Visual Studio](https://microsoft.github.io/PTVS/) : environnement de d\u00e9veloppement pour Visual Studio\n", "* [PyCharm](http://www.jetbrains.com/pycharm/) : n'inclut pas les graphiques mais est assez agr\u00e9able pour programmer\n", "* [IEP](http://www.iep-project.org/index.html) : \u00e9crit en Python\n", "* [PyDev](http://pydev.org/) : extension pour [Eclipse](http://www.eclipse.org/)\n", "* [WingIDE](https://wingware.com/)\n", "\n", "Si vous ne voulez pas programmer, il existe des alternatives. C'est assez performant sur de petits jeux de donn\u00e9es mais cela devient plus complexe d\u00e8s qu'on veut programmer car le code doit tenir compte des sp\u00e9cificit\u00e9s de l'outil. \n", "\n", "* [Orange](http://orange.biolab.si/) : \u00e9crit en Python\n", "* [Weka](http://www.cs.waikato.ac.nz/ml/weka/) : \u00e9crit en Java (le pionnier)\n", "* [dataiku](https://www.dataiku.com/) : startup fran\u00e7aise\n", "* [RapidMiner](http://rapidminer.com/) : version gratuite et payante\n", "* [AzureML](https://studio.azureml.net/) : solution Microsoft de workflow de donn\u00e9es\n", "\n", "C'est parfois plus pratique pour commencer mais mal commode si on veut automatiser un traitrment pour r\u00e9p\u00e9ter la m\u00eame t\u00e2che de fa\u00e7on r\u00e9guli\u00e8re. Pour les travaux pratiques \u00e0 l'ENSAE, j'ai choisi les [notebooks](http://jupyter.org/notebook.html) : c'est une page blanche o\u00f9 on peut m\u00e9langer texte, \u00e9quations, graphiques, code et ex\u00e9cution de code.\n", "\n", "**Taille de DataFrame**\n", "\n", "Les DataFrame en Python sont assez rapides lorsqu'il y a moins de 10 millions d'observations et que le fichier texte qui d\u00e9crit les donn\u00e9es n'est pas plus gros que 10 Mo. Au del\u00e0, il faut soit \u00eatre patient, soit \u00eatre astucieux comme ici : [DataFrame et SQL](http://www.xavierdupre.fr/blog/2014-07-19_nojs.html), [Data Wrangling with Pandas](http://nbviewer.ipython.org/urls/gist.github.com/fonnesbeck/5850413/raw/3a9406c73365480bc58d5e75bc80f7962243ba17/2.+Data+Wrangling+with+Pandas.ipynb).\n", "\n", "**Valeurs manquantes**\n", "\n", "Lorsqu'on r\u00e9cup\u00e8re des donn\u00e9es, il peut arriver qu'une valeur soit manquante. \n", "\n", "* [Missing Data](http://pandas.pydata.org/pandas-docs/dev/10min.html?highlight=working#missing-data)\n", "* [Working with missing data](http://pandas.pydata.org/pandas-docs/dev/missing_data.html)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## DataFrame (pandas)\n", "\n", "**Quelques liens :** [An Introduction to Pandas](http://synesthesiam.com/posts/an-introduction-to-pandas.html)\n", "\n", "Un [Data Frame](http://en.wikipedia.org/wiki/Data_frame) est un objet qui est pr\u00e9sent dans la plupart des logiciels de traitements de donn\u00e9es, c'est une **matrice**, chaque colonne est de m\u00eame type (nombre, dates, texte), elle peut contenir des valeurs manquantes. On peut consid\u00e9rer chaque colonne comme les variables d'une table ([pandas.Dataframe](http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.html) - cette page contient toutes les m\u00e9thodes de la classe)."]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " \n", " | \n", " date | \n", " devise | \n", " prix | \n", "
\n", " \n", " \n", " \n", " 0 | \n", " 2014-06-22 | \n", " euros | \n", " 220 | \n", "
\n", " \n", " 1 | \n", " 2014-06-23 | \n", " euros | \n", " 221 | \n", "
\n", " \n", "
\n", "
2 rows \u00d7 3 columns
\n", "
"], "text/plain": [" date devise prix\n", "0 2014-06-22 euros 220\n", "1 2014-06-23 euros 221\n", "\n", "[2 rows x 3 columns]"]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["import pandas\n", "l = [ { \"date\":\"2014-06-22\", \"prix\":220.0, \"devise\":\"euros\" }, \n", " { \"date\":\"2014-06-23\", \"prix\":221.0, \"devise\":\"euros\" },]\n", "df = pandas.DataFrame(l)\n", "df"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Avec une valeur manquante :"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " \n", " | \n", " date | \n", " devise | \n", " prix | \n", "
\n", " \n", " \n", " \n", " 0 | \n", " 2014-06-22 | \n", " euros | \n", " 220 | \n", "
\n", " \n", " 1 | \n", " 2014-06-23 | \n", " euros | \n", " NaN | \n", "
\n", " \n", "
\n", "
2 rows \u00d7 3 columns
\n", "
"], "text/plain": [" date devise prix\n", "0 2014-06-22 euros 220\n", "1 2014-06-23 euros NaN\n", "\n", "[2 rows x 3 columns]"]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["l = [ { \"date\":\"2014-06-22\", \"prix\":220.0, \"devise\":\"euros\" }, \n", " { \"date\":\"2014-06-23\", \"devise\":\"euros\" },]\n", "df = pandas.DataFrame(l)\n", "df"]}, {"cell_type": "markdown", "metadata": {}, "source": ["[NaN](http://docs.scipy.org/doc/numpy/reference/generated/numpy.isnan.html#numpy.isnan) est une convention pour une valeur manquante. On extrait la variable ``prix`` :"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/plain": ["0 220\n", "1 NaN\n", "Name: prix, dtype: float64"]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["df.prix"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Ou :"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/plain": ["0 220\n", "1 NaN\n", "Name: prix, dtype: float64"]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["df[\"prix\"]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Pour extraire plusieurs colonnes :"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " \n", " | \n", " date | \n", " prix | \n", "
\n", " \n", " \n", " \n", " 0 | \n", " 2014-06-22 | \n", " 220 | \n", "
\n", " \n", " 1 | \n", " 2014-06-23 | \n", " NaN | \n", "
\n", " \n", "
\n", "
2 rows \u00d7 2 columns
\n", "
"], "text/plain": [" date prix\n", "0 2014-06-22 220\n", "1 2014-06-23 NaN\n", "\n", "[2 rows x 2 columns]"]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["df [[\"date\",\"prix\"]]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Pour prendre la transpos\u00e9e (voir aussi [DataFrame.transpose](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Panel.transpose.html)) :"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " \n", " | \n", " 0 | \n", " 1 | \n", "
\n", " \n", " \n", " \n", " date | \n", " 2014-06-22 | \n", " 2014-06-23 | \n", "
\n", " \n", " devise | \n", " euros | \n", " euros | \n", "
\n", " \n", " prix | \n", " 220 | \n", " NaN | \n", "
\n", " \n", "
\n", "
3 rows \u00d7 2 columns
\n", "
"], "text/plain": [" 0 1\n", "date 2014-06-22 2014-06-23\n", "devise euros euros\n", "prix 220 NaN\n", "\n", "[3 rows x 2 columns]"]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["df.T"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Lecture et \u00e9criture de DataFrame\n", "\n", "Aujourd'hui, on n'a plus besoin de r\u00e9\u00e9crire soi-m\u00eame une fonction de lecture ou d'\u00e9criture de donn\u00e9es pr\u00e9sent\u00e9es sous forme de tables. Il existe des fonctions plus g\u00e9n\u00e9riques qui g\u00e8re un grand nombre de cas.\n", "Cette section pr\u00e9sente bri\u00e8vement les fonctions qui permettent de lire/\u00e9crire un DataFrame aux formats texte/Excel. On reprend l'exemple de section pr\u00e9c\u00e9dente. L'instruction ``encoding=utf-8`` n'est pas obligatoire mais conseill\u00e9e lorsque les donn\u00e9es contiennent des accents (voir [read_csv](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html))."]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["date\tdevise\tprix\n", "2014-06-22\teuros\t220.0\n", "2014-06-23\teuros\t221.0\n", "\n"]}], "source": ["import pandas\n", "l = [ { \"date\":\"2014-06-22\", \"prix\":220.0, \"devise\":\"euros\" }, \n", " { \"date\":\"2014-06-23\", \"prix\":221.0, \"devise\":\"euros\" },]\n", "df = pandas.DataFrame(l)\n", "\n", "# \u00e9criture au format texte\n", "df.to_csv(\"exemple.txt\",sep=\"\\t\",encoding=\"utf-8\", index=False)\n", "\n", "# on regarde ce qui a \u00e9t\u00e9 enregistr\u00e9\n", "with open(\"exemple.txt\", \"r\", encoding=\"utf-8\") as f : text = f.read()\n", "print(text)\n", "\n", "# on enregistre au format Excel\n", "df.to_excel(\"exemple.xlsx\", index=False)\n", "\n", "# on ouvre Excel sur ce fichier (sous Windows)\n", "from pyquickhelper.loghelper import run_cmd\n", "from pyquickhelper.loghelper.run_cmd import skip_run_cmd\n", "out,err = run_cmd(\"exemple.xlsx\", wait = False)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On peut r\u00e9cup\u00e9rer des donn\u00e9es directement depuis Internet ou une cha\u00eene de caract\u00e8res et afficher le d\u00e9but ([head](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html)) ou la fin ([tail](http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.tail.html)). Le code qui suit est ce qu'on \u00e9crirait d'habitude :"]}, {"cell_type": "code", "execution_count": 9, "metadata": {"collapsed": true}, "outputs": [], "source": ["if False:\n", " import pandas, urllib.request\n", " furl = urllib.request.urlopen(\"http://www.xavierdupre.fr/enseignement/complements/marathon.txt\")\n", " df = pandas.read_csv(furl, sep=\"\\t\", names=[\"ville\", \"annee\", \"temps\",\"secondes\"])\n", " df.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Et pout \u00e9viter les erreurs de connexion internet, les donn\u00e9es font partie int\u00e9grante du module :"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " \n", " | \n", " ville | \n", " annee | \n", " temps | \n", " secondes | \n", "
\n", " \n", " \n", " \n", " 0 | \n", " PARIS | \n", " 2011 | \n", " 02:06:29 | \n", " 7589 | \n", "
\n", " \n", " 1 | \n", " PARIS | \n", " 2010 | \n", " 02:06:41 | \n", " 7601 | \n", "
\n", " \n", " 2 | \n", " PARIS | \n", " 2009 | \n", " 02:05:47 | \n", " 7547 | \n", "
\n", " \n", " 3 | \n", " PARIS | \n", " 2008 | \n", " 02:06:40 | \n", " 7600 | \n", "
\n", " \n", " 4 | \n", " PARIS | \n", " 2007 | \n", " 02:07:17 | \n", " 7637 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" ville annee temps secondes\n", "0 PARIS 2011 02:06:29 7589\n", "1 PARIS 2010 02:06:41 7601\n", "2 PARIS 2009 02:05:47 7547\n", "3 PARIS 2008 02:06:40 7600\n", "4 PARIS 2007 02:07:17 7637"]}, "execution_count": 11, "metadata": {}, "output_type": "execute_result"}], "source": ["from ensae_teaching_cs.data import marathon\n", "import pandas\n", "df = pandas.read_csv(marathon(filename=True), \n", " sep=\"\\t\", names=[\"ville\", \"annee\", \"temps\",\"secondes\"])\n", "df.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["La fonction [describe](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) permet d'en savoir un peu plus sur les colonnes num\u00e9riques de cette table."]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " \n", " | \n", " annee | \n", " secondes | \n", "
\n", " \n", " \n", " \n", " count | \n", " 359.000000 | \n", " 359.000000 | \n", "
\n", " \n", " mean | \n", " 1989.754875 | \n", " 7933.660167 | \n", "
\n", " \n", " std | \n", " 14.028545 | \n", " 385.289830 | \n", "
\n", " \n", " min | \n", " 1947.000000 | \n", " 7382.000000 | \n", "
\n", " \n", " 25% | \n", " 1981.000000 | \n", " 7698.000000 | \n", "
\n", " \n", " 50% | \n", " 1991.000000 | \n", " 7820.000000 | \n", "
\n", " \n", " 75% | \n", " 2001.000000 | \n", " 8046.500000 | \n", "
\n", " \n", " max | \n", " 2011.000000 | \n", " 10028.000000 | \n", "
\n", " \n", "
\n", "
8 rows \u00d7 2 columns
\n", "
"], "text/plain": [" annee secondes\n", "count 359.000000 359.000000\n", "mean 1989.754875 7933.660167\n", "std 14.028545 385.289830\n", "min 1947.000000 7382.000000\n", "25% 1981.000000 7698.000000\n", "50% 1991.000000 7820.000000\n", "75% 2001.000000 8046.500000\n", "max 2011.000000 10028.000000\n", "\n", "[8 rows x 2 columns]"]}, "execution_count": 12, "metadata": {}, "output_type": "execute_result"}], "source": ["df.describe()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### DataFrame et Index\n", "\n", "On d\u00e9signe g\u00e9n\u00e9ralement une colonne ou *variable* par son nom. Les lignes peuvent \u00eatre d\u00e9sign\u00e9es par un entier."]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " \n", " | \n", " date | \n", " devise | \n", " prix | \n", "
\n", " \n", " \n", " \n", " 0 | \n", " 2014-06-22 | \n", " euros | \n", " 220 | \n", "
\n", " \n", " 1 | \n", " 2014-06-23 | \n", " euros | \n", " 221 | \n", "
\n", " \n", "
\n", "
2 rows \u00d7 3 columns
\n", "
"], "text/plain": [" date devise prix\n", "0 2014-06-22 euros 220\n", "1 2014-06-23 euros 221\n", "\n", "[2 rows x 3 columns]"]}, "execution_count": 13, "metadata": {}, "output_type": "execute_result"}], "source": ["import pandas\n", "l = [ { \"date\":\"2014-06-22\", \"prix\":220.0, \"devise\":\"euros\" }, \n", " { \"date\":\"2014-06-23\", \"prix\":221.0, \"devise\":\"euros\" },]\n", "df = pandas.DataFrame(l)\n", "df"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On extrait une ligne ([loc](http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.loc.html)) :"]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"data": {"text/plain": ["date 2014-06-23\n", "devise euros\n", "prix 221\n", "Name: 1, dtype: object"]}, "execution_count": 14, "metadata": {}, "output_type": "execute_result"}], "source": ["df.iloc[1]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Mais il est possible d'utiliser une colonne ou plusieurs colonnes comme index ([set_index](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html)) :"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " \n", " | \n", " devise | \n", " prix | \n", "
\n", " \n", " date | \n", " | \n", " | \n", "
\n", " \n", " \n", " \n", " 2014-06-22 | \n", " euros | \n", " 220 | \n", "
\n", " \n", " 2014-06-23 | \n", " euros | \n", " 221 | \n", "
\n", " \n", "
\n", "
2 rows \u00d7 2 columns
\n", "
"], "text/plain": [" devise prix\n", "date \n", "2014-06-22 euros 220\n", "2014-06-23 euros 221\n", "\n", "[2 rows x 2 columns]"]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["dfi = df.set_index(\"date\")\n", "dfi"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On peut maintenant d\u00e9signer une ligne par une date :"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/plain": ["devise euros\n", "prix 221\n", "Name: 2014-06-23, dtype: object"]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["dfi.loc[\"2014-06-23\"]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Il est possible d'utiliser plusieurs colonnes comme index :"]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"data": {"text/plain": ["arrondissement 18\n", "Name: (dupr\u00e9, xavier), dtype: int64"]}, "execution_count": 17, "metadata": {}, "output_type": "execute_result"}], "source": ["df = pandas.DataFrame([ {\"pr\u00e9nom\":\"xavier\", \"nom\":\"dupr\u00e9\", \"arrondissement\":18}, \n", " {\"pr\u00e9nom\":\"cl\u00e9mence\", \"nom\":\"dupr\u00e9\", \"arrondissement\":15 } ])\n", "dfi = df.set_index([\"nom\",\"pr\u00e9nom\"])\n", "dfi.loc[\"dupr\u00e9\",\"xavier\"]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Si on veut changer l'index ou le supprimer ([reset_index](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html)) :"]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " \n", " | \n", " | \n", " pr\u00e9nom | \n", "
\n", " \n", " nom | \n", " arrondissement | \n", " | \n", "
\n", " \n", " \n", " \n", " dupr\u00e9 | \n", " 18 | \n", " xavier | \n", "
\n", " \n", " 15 | \n", " cl\u00e9mence | \n", "
\n", " \n", "
\n", "
2 rows \u00d7 1 columns
\n", "
"], "text/plain": [" pr\u00e9nom\n", "nom arrondissement \n", "dupr\u00e9 18 xavier\n", " 15 cl\u00e9mence\n", "\n", "[2 rows x 1 columns]"]}, "execution_count": 18, "metadata": {}, "output_type": "execute_result"}], "source": ["dfi.reset_index(drop=False, inplace=True) \n", " # le mot-cl\u00e9 drop pour garder ou non les colonnes servant d'index\n", " # inplace signifie qu'on modifie l'instance et non qu'une copie est modifi\u00e9e\n", " # donc on peut aussi \u00e9crire dfi2 = dfi.reset_index(drop=False) \n", "dfi.set_index([\"nom\", \"arrondissement\"],inplace=True)\n", "dfi"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Les index sont particuli\u00e8rement utiles lorsqu'il s'agit de fusionner deux tables. Pour des petites tables, la plupart du temps, il est plus facile de s'en passer."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Notation avec le symbole ``:``\n", "\n", "Le symbole ``:`` d\u00e9signe une plage de valeurs."]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " \n", " | \n", " ville | \n", " annee | \n", " temps | \n", " secondes | \n", "
\n", " \n", " \n", " \n", " 0 | \n", " PARIS | \n", " 2011 | \n", " 02:06:29 | \n", " 7589 | \n", "
\n", " \n", " 1 | \n", " PARIS | \n", " 2010 | \n", " 02:06:41 | \n", " 7601 | \n", "
\n", " \n", " 2 | \n", " PARIS | \n", " 2009 | \n", " 02:05:47 | \n", " 7547 | \n", "
\n", " \n", " 3 | \n", " PARIS | \n", " 2008 | \n", " 02:06:40 | \n", " 7600 | \n", "
\n", " \n", " 4 | \n", " PARIS | \n", " 2007 | \n", " 02:07:17 | \n", " 7637 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" ville annee temps secondes\n", "0 PARIS 2011 02:06:29 7589\n", "1 PARIS 2010 02:06:41 7601\n", "2 PARIS 2009 02:05:47 7547\n", "3 PARIS 2008 02:06:40 7600\n", "4 PARIS 2007 02:07:17 7637"]}, "execution_count": 19, "metadata": {}, "output_type": "execute_result"}], "source": ["from ensae_teaching_cs.data import marathon\n", "import pandas\n", "df = pandas.read_csv(marathon(filename=True), \n", " sep=\"\\t\", names=[\"ville\", \"annee\", \"temps\",\"secondes\"])\n", "df.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On peut s\u00e9lectionner un sous-ensemble de lignes :"]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " \n", " | \n", " ville | \n", " annee | \n", " temps | \n", " secondes | \n", "
\n", " \n", " \n", " \n", " 3 | \n", " PARIS | \n", " 2008 | \n", " 02:06:40 | \n", " 7600 | \n", "
\n", " \n", " 4 | \n", " PARIS | \n", " 2007 | \n", " 02:07:17 | \n", " 7637 | \n", "
\n", " \n", " 5 | \n", " PARIS | \n", " 2006 | \n", " 02:08:03 | \n", " 7683 | \n", "
\n", " \n", "
\n", "
3 rows \u00d7 4 columns
\n", "
"], "text/plain": [" ville annee temps secondes\n", "3 PARIS 2008 02:06:40 7600\n", "4 PARIS 2007 02:07:17 7637\n", "5 PARIS 2006 02:08:03 7683\n", "\n", "[3 rows x 4 columns]"]}, "execution_count": 20, "metadata": {}, "output_type": "execute_result"}], "source": ["df[3:6]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On extrait la m\u00eame plage mais avec deux colonnes seulement :"]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " \n", " | \n", " annee | \n", " temps | \n", "
\n", " \n", " \n", " \n", " 3 | \n", " 2008 | \n", " 02:06:40 | \n", "
\n", " \n", " 4 | \n", " 2007 | \n", " 02:07:17 | \n", "
\n", " \n", " 5 | \n", " 2006 | \n", " 02:08:03 | \n", "
\n", " \n", " 6 | \n", " 2005 | \n", " 02:08:02 | \n", "
\n", " \n", "
\n", "
4 rows \u00d7 2 columns
\n", "
"], "text/plain": [" annee temps\n", "3 2008 02:06:40\n", "4 2007 02:07:17\n", "5 2006 02:08:03\n", "6 2005 02:08:02\n", "\n", "[4 rows x 2 columns]"]}, "execution_count": 21, "metadata": {}, "output_type": "execute_result"}], "source": ["df.loc[3:6,[\"annee\",\"temps\"]]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Le m\u00eame code pour lequel on renomme les colonnes extraites :"]}, {"cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " \n", " | \n", " year | \n", " time | \n", "
\n", " \n", " \n", " \n", " 3 | \n", " 2008 | \n", " 02:06:40 | \n", "
\n", " \n", " 4 | \n", " 2007 | \n", " 02:07:17 | \n", "
\n", " \n", " 5 | \n", " 2006 | \n", " 02:08:03 | \n", "
\n", " \n", " 6 | \n", " 2005 | \n", " 02:08:02 | \n", "
\n", " \n", "
\n", "
4 rows \u00d7 2 columns
\n", "
"], "text/plain": [" year time\n", "3 2008 02:06:40\n", "4 2007 02:07:17\n", "5 2006 02:08:03\n", "6 2005 02:08:02\n", "\n", "[4 rows x 2 columns]"]}, "execution_count": 22, "metadata": {}, "output_type": "execute_result"}], "source": ["sub = df.loc[3:6,[\"annee\",\"temps\"]]\n", "sub.columns = [\"year\",\"time\"]\n", "sub"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 1 : cr\u00e9er un fichier Excel\n", " \n", "On souhaite r\u00e9cup\u00e9rer les donn\u00e9es [donnees_enquete_2003_television.txt](http://www.xavierdupre.fr/enseignement/complements/donnees_enquete_2003_television.txt) (source : [INSEE](http://www.insee.fr/fr/themes/detail.asp?ref_id=fd-hdv03&page=fichiers_detail/HDV03/telechargement.htm)).\n", "\n", "* ``POIDSLOG`` : Pond\u00e9ration individuelle relative\n", "* ``POIDSF`` : Variable de pond\u00e9ration individuelle\n", "* ``cLT1FREQ`` : Nombre d'heures en moyenne pass\u00e9es \u00e0 regarder la t\u00e9l\u00e9vision \n", "* ``cLT2FREQ`` : Unit\u00e9 de temps utilis\u00e9e pour compter le nombre d'heures pass\u00e9es \u00e0 regarder la t\u00e9l\u00e9vision, cette unit\u00e9 est repr\u00e9sent\u00e9e par les quatre valeurs suivantes\n", " * 0 : non concern\u00e9\n", " * 1 : jour\n", " * 2 : semaine\n", " * 3 : mois \n", " \n", "Ensuite, on veut :\n", "\n", "1. Supprimer les colonnes vides\n", "2. Obtenir les valeurs distinctes pour la colonne ``cLT2FREQ``\n", "3. Modifier la matrice pour enlever les lignes pour lesquelles l'unit\u00e9 de temps (cLT2FREQ) n'est pas renseign\u00e9e ou \u00e9gale \u00e0 z\u00e9ro.\n", "4. Sauver le r\u00e9sultat au format Excel.\n", "\n", "Vous aurez peut-\u00eatre besoin des fonctions suivantes :\n", "\n", "* [numpy.isnan](http://docs.scipy.org/doc/numpy/reference/generated/numpy.isnan.html)\n", "* [DataFrame.apply](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html)\n", "* [DataFrame.fillna](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html) ou \n", "[DataFrame.isnull](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isnull.html)\n", "* [DataFrame.copy](http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.copy.html)"]}, {"cell_type": "code", "execution_count": 22, "metadata": {"collapsed": true}, "outputs": [], "source": ["import pandas, io\n", "# ..."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Manipuler un DataFrame : filtrer, union, sort, group by, join, pivot\n", "\n", "Si la structure *DataFrame* s'est impos\u00e9e, c'est parce qu'on effectue toujours les m\u00eames op\u00e9rations. Chaque fonction cache une boucle ou deux dont le co\u00fbt est pr\u00e9cis\u00e9 en fin de ligne :\n", "\n", "- **filter** : on s\u00e9lectionne un sous-ensemble de lignes qui v\u00e9rifie une condition $\\rightarrow O(n)$\n", "- **union** : concat\u00e9nation de deux jeux de donn\u00e9es $\\rightarrow O(n_1 + n_2)$\n", "- **sort** : tri $\\rightarrow O(n \\ln n)$\n", "- **group by** : grouper des lignes qui partagent une valeur commune $\\rightarrow O(n)$\n", "- **join** : fusionner deux jeux de donn\u00e9es en associant les lignes qui partagent une valeur commune $\\rightarrow \\in [O(n_1 + n_2), O(n_1 n_2)]$\n", "- **pivot** : utiliser des valeurs pr\u00e9sentes dans colonne comme noms de colonnes $\\rightarrow O(n)$\n", "\n", "Les 5 premi\u00e8res op\u00e9rations sont issues de la logique de manipulation des donn\u00e9es avec le langage [SQL](http://fr.wikipedia.org/wiki/Structured_Query_Language) (ou le logiciel [SAS](http://www.sas.com)). La derni\u00e8re correspond \u00e0 un [tableau crois\u00e9 dynamique](http://fr.wikipedia.org/wiki/Tableau_crois%C3%A9_dynamique). Pour illustrer ces op\u00e9rations, on prendre le DataFrame suivant :"]}, {"cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Index(['ville', 'annee', 'temps', 'secondes'], dtype='object')\n", "villes {'FUKUOKA', 'STOCKOLM', 'PARIS', 'CHICAGO', 'AMSTERDAM', 'BOSTON', 'BERLIN', 'LONDON', 'NEW YORK'}\n", "annee [1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956] ...\n"]}], "source": ["from ensae_teaching_cs.data import marathon\n", "import pandas\n", "df = pandas.read_csv(marathon(), sep=\"\\t\", names=[\"ville\", \"annee\", \"temps\",\"secondes\"])\n", "print(df.columns)\n", "print(\"villes\",set(df.ville))\n", "print(\"annee\",list(set(df.annee))[:10],\"...\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### filter\n", "\n", "Filter consiste \u00e0 s\u00e9lectionner un sous-ensemble de lignes du dataframe. Pour filter sur plusieurs conditions, il faut utiliser les op\u00e9rateurs logique & (et), | (ou), ~ (non) (voir [Mapping Operators to Functions](https://docs.python.org/3.4/library/operator.html#mapping-operators-to-functions)).\n", "\n", "* [filter](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.filter.html), [mask](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mask.html),[where](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html)\n", "* [pandas: filter rows of DataFrame with operator chaining](http://stackoverflow.com/questions/11869910/pandas-filter-rows-of-dataframe-with-operator-chaining)\n", "* [Indexing and Selecting Data](http://pandas.pydata.org/pandas-docs/stable/indexing.html)"]}, {"cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " \n", " | \n", " ville | \n", " annee | \n", " temps | \n", " secondes | \n", "
\n", " \n", " \n", " \n", " 112 | \n", " FUKUOKA | \n", " 1971 | \n", " 02:12:51 | \n", " 7971 | \n", "
\n", " \n", " 204 | \n", " NEW YORK | \n", " 1971 | \n", " 02:22:54 | \n", " 8574 | \n", "
\n", " \n", " 285 | \n", " BOSTON | \n", " 1971 | \n", " 02:18:45 | \n", " 8325 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" ville annee temps secondes\n", "112 FUKUOKA 1971 02:12:51 7971\n", "204 NEW YORK 1971 02:22:54 8574\n", "285 BOSTON 1971 02:18:45 8325"]}, "execution_count": 25, "metadata": {}, "output_type": "execute_result"}], "source": ["subset = df [ df.annee == 1971 ]\n", "subset.head()"]}, {"cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " \n", " | \n", " ville | \n", " annee | \n", " temps | \n", " secondes | \n", "
\n", " \n", " \n", " \n", " 285 | \n", " BOSTON | \n", " 1971 | \n", " 02:18:45 | \n", " 8325 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" ville annee temps secondes\n", "285 BOSTON 1971 02:18:45 8325"]}, "execution_count": 26, "metadata": {}, "output_type": "execute_result"}], "source": ["subset = df [ (df.annee == 1971) & (df.ville == \"BOSTON\") ]\n", "subset.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### union\n", "\n", "union = concat\u00e9nation de deux DataFrame (qui n'ont pas n\u00e9cessaire les m\u00eames colonnes). On peut concat\u00e9ner les lignes ou les colonnes.\n", "\n", "* [concat](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html)\n", "* [Merge, join, and concatenate](http://pandas.pydata.org/pandas-docs/stable/merging.html)"]}, {"cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [{"data": {"text/plain": ["((360, 4), (720, 4))"]}, "execution_count": 27, "metadata": {}, "output_type": "execute_result"}], "source": ["concat_ligne = pandas.concat((df,df))\n", "df.shape,concat_ligne.shape"]}, {"cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [{"data": {"text/plain": ["((360, 4), (360, 8))"]}, "execution_count": 28, "metadata": {}, "output_type": "execute_result"}], "source": ["concat_col = pandas.concat((df,df), axis=1)\n", "df.shape,concat_col.shape"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### sort\n", "\n", "Sort = trier\n", "\n", "* [sort](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort.html)"]}, {"cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " \n", " | \n", " ville | \n", " annee | \n", " temps | \n", " secondes | \n", "
\n", " \n", " \n", " \n", " 35 | \n", " BERLIN | \n", " 2011 | \n", " 02:03:38 | \n", " 7418 | \n", "
\n", " \n", " 326 | \n", " BOSTON | \n", " 2011 | \n", " 02:03:02 | \n", " 7382 | \n", "
\n", " \n", " 203 | \n", " LONDON | \n", " 2011 | \n", " 02:04:40 | \n", " 7480 | \n", "
\n", " \n", " 0 | \n", " PARIS | \n", " 2011 | \n", " 02:06:29 | \n", " 7589 | \n", "
\n", " \n", " 277 | \n", " STOCKOLM | \n", " 2011 | \n", " 02:14:07 | \n", " 8047 | \n", "
\n", " \n", "
\n", "
5 rows \u00d7 4 columns
\n", "
"], "text/plain": [" ville annee temps secondes\n", "35 BERLIN 2011 02:03:38 7418\n", "326 BOSTON 2011 02:03:02 7382\n", "203 LONDON 2011 02:04:40 7480\n", "0 PARIS 2011 02:06:29 7589\n", "277 STOCKOLM 2011 02:14:07 8047\n", "\n", "[5 rows x 4 columns]"]}, "execution_count": 29, "metadata": {}, "output_type": "execute_result"}], "source": ["tri = df.sort_values( [\"annee\", \"ville\"], ascending=[0,1])\n", "tri.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### group by\n", "\n", "Cette op\u00e9ration consiste \u00e0 grouper les lignes qui partagent une caract\u00e9ristique commune (une ou ou plusieurs valeurs par exemple). Sur chaque groupe, on peut calculer une somme, une moyenne...\n", "\n", "* [groupby](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html)\n", "* [sum](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html), [cumsum](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.cumsum.html), [mean](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html), [count](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html)\n", "* [SQL GROUP BY](http://sql.sh/cours/group-by)\n", "* [Group By: split-apply-combine](http://pandas.pydata.org/pandas-docs/dev/groupby.html)\n", "* [group by customis\u00e9](http://stackoverflow.com/questions/15322632/python-pandas-df-groupby-agg-column-reference-in-agg)\n"]}, {"cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [{"data": {"text/plain": [""]}, "execution_count": 30, "metadata": {}, "output_type": "execute_result"}], "source": ["gr = df.groupby(\"annee\")\n", "gr"]}, {"cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " \n", " | \n", " ville | \n", " annee | \n", " temps | \n", " secondes | \n", "
\n", " \n", " annee | \n", " | \n", " | \n", " | \n", " | \n", "
\n", " \n", " \n", " \n", " 2011 | \n", " 5 | \n", " 5 | \n", " 5 | \n", " 5 | \n", "
\n", " \n", " 2010 | \n", " 9 | \n", " 9 | \n", " 9 | \n", " 9 | \n", "
\n", " \n", " 2009 | \n", " 9 | \n", " 9 | \n", " 9 | \n", " 9 | \n", "
\n", " \n", " 2008 | \n", " 9 | \n", " 9 | \n", " 9 | \n", " 9 | \n", "
\n", " \n", " 2007 | \n", " 9 | \n", " 9 | \n", " 9 | \n", " 9 | \n", "
\n", " \n", "
\n", "
5 rows \u00d7 4 columns
\n", "
"], "text/plain": [" ville annee temps secondes\n", "annee \n", "2011 5 5 5 5\n", "2010 9 9 9 9\n", "2009 9 9 9 9\n", "2008 9 9 9 9\n", "2007 9 9 9 9\n", "\n", "[5 rows x 4 columns]"]}, "execution_count": 31, "metadata": {}, "output_type": "execute_result"}], "source": ["nb = gr.count()\n", "nb.sort_index(ascending=False).head()"]}, {"cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " \n", " | \n", " secondes | \n", "
\n", " \n", " annee | \n", " | \n", "
\n", " \n", " \n", " \n", " 2011 | \n", " 37916 | \n", "
\n", " \n", " 2010 | \n", " 68673 | \n", "
\n", " \n", "
\n", "
2 rows \u00d7 1 columns
\n", "
"], "text/plain": [" secondes\n", "annee \n", "2011 37916\n", "2010 68673\n", "\n", "[2 rows x 1 columns]"]}, "execution_count": 32, "metadata": {}, "output_type": "execute_result"}], "source": ["nb = gr.sum()\n", "nb.sort_index(ascending=False).head(n=2)"]}, {"cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " \n", " | \n", " secondes | \n", "
\n", " \n", " annee | \n", " | \n", "
\n", " \n", " \n", " \n", " 2011 | \n", " 7583.200000 | \n", "
\n", " \n", " 2010 | \n", " 7630.333333 | \n", "
\n", " \n", " 2009 | \n", " 7652.555556 | \n", "
\n", " \n", "
\n", "
3 rows \u00d7 1 columns
\n", "
"], "text/plain": [" secondes\n", "annee \n", "2011 7583.200000\n", "2010 7630.333333\n", "2009 7652.555556\n", "\n", "[3 rows x 1 columns]"]}, "execution_count": 33, "metadata": {}, "output_type": "execute_result"}], "source": ["nb = gr.mean()\n", "nb.sort_index(ascending=False).head(n=3)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Si les nom des colonnes utilis\u00e9es lors de l'op\u00e9ration ne sont pas mentionn\u00e9s, implicitement, c'est l'index qui sera choisi. On peut aussi aggr\u00e9ger les informations avec une fonction personnalis\u00e9e."]}, {"cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " \n", " | \n", " annee | \n", " secondes | \n", "
\n", " \n", " \n", " \n", " 62 | \n", " 2009 | \n", " 8134 | \n", "
\n", " \n", " 63 | \n", " 2010 | \n", " 7968 | \n", "
\n", " \n", " 64 | \n", " 2011 | \n", " 8047 | \n", "
\n", " \n", "
\n", "
3 rows \u00d7 2 columns
\n", "
"], "text/plain": [" annee secondes\n", "62 2009 8134\n", "63 2010 7968\n", "64 2011 8047\n", "\n", "[3 rows x 2 columns]"]}, "execution_count": 34, "metadata": {}, "output_type": "execute_result"}], "source": ["def max_entier(x):\n", " return int(max(x))\n", "nb = df[[\"annee\",\"secondes\"]].groupby(\"annee\").agg(max_entier).reset_index()\n", "nb.tail(n=3)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Ou encore consid\u00e9rer des aggr\u00e9gations diff\u00e9rentes pour chaque colonne :"]}, {"cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " \n", " | \n", " secondes | \n", " ville | \n", "
\n", " \n", " annee | \n", " | \n", " | \n", "
\n", " \n", " \n", " \n", " 2009 | \n", " 8134 | \n", " 9 | \n", "
\n", " \n", " 2010 | \n", " 7968 | \n", " 9 | \n", "
\n", " \n", " 2011 | \n", " 8047 | \n", " 5 | \n", "
\n", " \n", "
\n", "
3 rows \u00d7 2 columns
\n", "
"], "text/plain": [" secondes ville\n", "annee \n", "2009 8134 9\n", "2010 7968 9\n", "2011 8047 5\n", "\n", "[3 rows x 2 columns]"]}, "execution_count": 35, "metadata": {}, "output_type": "execute_result"}], "source": ["nb = df[[\"annee\",\"ville\",\"secondes\"]].groupby(\"annee\").agg({ \"ville\":len, \"secondes\":max_entier})\n", "nb.tail(n=3)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### join (_merge_ ou fusion)\n", "\n", "Fusionner deux tables consiste \u00e0 apparier les lignes de la premi\u00e8re table avec celle de la seconde si certaines colonnes de ces lignes partagent les m\u00eames valeurs. On distingue quatre cas :\n", "\n", "* ``INNER JOIN`` - **inner** : on garde tous les appariements r\u00e9ussis \n", "* ``LEFT OUTER JOIN`` - **left** : on garde tous les appariements r\u00e9ussis et les lignes non appari\u00e9es de la table de gauche\n", "* ``RIGHT OUTER JOIN`` - **right** : on garde tous les appariements r\u00e9ussis et les lignes non appari\u00e9es de la table de droite\n", "* ``FULL OUTER JOIN`` - **outer** : on garde tous les appariements r\u00e9ussis et les lignes non appari\u00e9es des deux tables\n", "\n", "Exemples et documentation :\n", "* [merging, joining](http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging)\n", "* [join](http://pandas.pydata.org/pandas-docs/stable/pandas.DataFrame.join.html)\n", "* [merge](http://pandas.pydata.org/pandas-docs/stable/pandas.merge.html) ou [DataFrame.merge](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html)\n", "* [jointures SQL](http://sql.sh/cours/jointures) - illustrations avec graphiques en patates\n", "\n", "Si les noms des colonnes utilis\u00e9es lors de la fusion ne sont pas mentionn\u00e9s, implicitement, c'est l'index qui sera choisi. Pour les grandes tables (> 100.000 lignes), il est fortement recommand\u00e9s d'ajouter un index s'il n'existe pas avant de fusionner.\n", "\n", "A quoi correspondent les quatre cas suivants :"]}, {"cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [{"data": {"image/png": "iVBORw0KGgoAAAANSUhEUgAAAnAAAAByCAIAAABcJnvWAAAAAXNSR0IArs4c6QAAAARnQU1BAACx\njwv8YQUAAAAJcEhZcwAADsMAAA7DAcdvqGQAAE28SURBVHhe7Z0HWBTHF8DHLorYO3bFhr13Y436\nt5fYGzbsisbE2A3GntgrIqKACoKKSuy9xxZ77wEEBREREO//Zvfd3e5e4crucXfu73ufH+7OTd2Z\nN/UNUcjIyMjIyMiYjaxQZWRkZGRkREBWqDIyMjIyMiIgK1QZGRkZGRkRkBWqjIyMjIwRREVFvXz5\n8rUOXrx48eXLF3T6nSErVBkZGdG4d++et7f3Dh34+PicPn0ancrYDvfv31cV6/79+/PkyUP00rZt\n2+DgYNa9l5dXfHw8emTvWLtC7dWr1wjd9OzZE93J2BphYWEDBw7EgtRgyJAh69atQ6cyVgyMVPr2\n7cuW2rhx48qVK4dtqg5y5MgxceJE1v2AAQNOnjyJHslYH8uXL4eaaEix6gf06+jRo6G4jx07hl7b\nKVanUEuVKuXi4lK+fPkKFSrAH1gguqlYsSK4BPdQ5NWqVUNfZKyP27dvFyxYEEoKqFy5cvbs2bEI\ndePq6sq6L1q0KHR10SMZKwC6O87OzpUqVSpUqBCWlkmAfoWPoWTJkqNGjUKvZawAGINWqVIFC0k8\nHB0docWGEsdg7I40VqgJCQmRkZH37t3D/BaPN2/ehIeHJyUlYUgyFuft27efP3+GbikWiUhAPYdi\nff36NQYjYymgTsG/WAyS8erVK6i5iYmJbKAyFuPjx4+G1laHHMQpP8mRT7vkLEAyZESXetm2bduH\nDx8weLsgzRTq9u3bQ0NDmzRpglkrDe3atdu3b9/u3bsxVBnpiY2NhXpy7NgxLAPJ8Pf39/Pze/Lk\nCQYsIxmBgYFjxozBfNdLuXLl+vbt20MHvXv3rl+/PjrVS9u2bXfu3InBy0jMqVOnNm7ciFmvlYKl\nSINupE5HKq7NyJq75KCC7NMhfyuIxw5SrRW6b9SDZMqC/mjj4MGDjx49wqjYOGmgUA8dOjRx4kTM\ny9Tw9fVdpRvQyuguNcaPHy+v1liAUaNGtWjRAjNdP5WbELflZMAC7TJkCWnWD13qpXLlykOHDsXg\nZcQm1dpatGhRb29vtj4uXbr0/v37+EsdxMTELF68mHW/efPmxo0bo0facHNzk5fSJeXly5fu7u6Z\nM2fGHBfQ2o0MXkz6zCYLz5L9CrInCWVHNNkWrk8CYtWODyjI5O2k3zxa3yto705VrFjRPmqxpRWq\ns7Oz/sUzqJC3b9++devWzZs34Q/8mW7AGQDu79y5c+XKFfRFGzlz5ixSpAj+TEZs1q5dW6FCBcxr\nrRRxIX9dI54nqMw/QjY8JkEJZNcn7RL4mWx9S+YdRvfLLpP6ndEfbcDAaPTo0RgVGZEAZamrtg4b\nNuzJkyc3btx49eoVujaJDx8+XLt27cGDB6Bf0WsNSpUq9fjxY/yBjHi4uLgULlwYc1lAkXJk6SWy\nLYLWRKiPfu+FKtMoCYihngR9IesfkhXXMQgNypYtO2LECIycbWIhhZqSklKyZEnMNj4ODg6Qj+hO\nPAoUKJA1a1YMg0/VqlW/fv2K7mTMIz4+/u7du5izmmTKQtz+pF1UqEu744n/B1ozWdn+TljrBOIb\nqXYMP9wVRz3Zp6BaNn169F8Df3//uLg4jJyMSeiqrRkzZsyVK1eZMmXQnQR0794dVHi2bNkwSD7g\nIDk5mXUpYw79+/fHPBWQLh1ZfJ7s/Ya11TdCWCvNFKj1/jFUSUObMHAByaS9ifb29saI2hqSK9Tg\n4OCxY8diPvHp06dP8+bNY2Ji0KnYvH79ukWLFn379sXw+EybNi00NBSdyhgPFNzBgwcdHR0xQ7kU\nLEVqdyCNepLjCqoIoZ9LRaN2GS2MP1AhjypI919IjTakTE0Mkc+RI0fktVXTgKEnZiKfTp06nThx\nAh1JzNu3b9u1a6e1Q9y0aVN0JGMSISEh48aNw9zkArWp71xyREH1qDi1VY8wFRnGrNA+NOimtRZv\n27bt/PnzGGnbQVqFunnzZswePmPGjPHy8kJH0rNmzZqhQ4di2HzkwxgmU6dOHcxEAYMWkkXMigsM\nKIUVSTyBHjT0ozc8Jv1/J465MWgOJUqUkK6vZpesWLFi8ODBmH0coLZCRUZHFmT//v2LFi3CSHBo\n3769h4cHOpIxhi1btmAmcqnclPw0k+xV0DolqGUWkJCvZNNT0n8+ccqH8eFw8eJFjLqNIJVCvXbt\nWvny5TFXOFStWvXevXvoyIKkpKQ8ePAAGlmMBwcXFxe72WNmGQYNGqT9iLDbn1SVBicTv9T2LIgl\n29/RXQ/Lr5LZBzEOHIoXL167dm2MtIxetPZ9q1Spkia1lcutW7e0TjJNnDgRXcgYhnZtuuwy2fBI\n2r5vqqKqxXP/xlhxsK3lc0kUquaiWnpm0QtfpzUQk3Tp0jHxUvPff//haxm9aDmpljEzadyLhCro\n1j4/8zYvmCY7omnQMGDtO4+kz4CxUgJ9OIy6jDa+fPni5eWFmcUBX1sNBQoUwJgpGTdunLykaggw\nlsAsU5GO2YUAamznx9R3M1hGoBYHJZB5R5n4CcGUWD0iR/Thw4enT5/GPFAC2svaRu5Hj2optlOn\nTj179gxdyGiwf//+8ePHY2apKFSaHFaQPYni718wViACu+PJCQWp3R7jpqRt27byerlWNG2qODo6\ntmjR4uPHj+jCmmjSpEnx4sUxokoCAgLwtYw27t+/jznFZVYoPS3qGymsRGku7A6JfvMwnkpy5MgR\nHR2NSbJixFSoMMgrUqQIZoCS5cuXh4WFoQtrIiQkxNPTE2OppGTJkvLCm1a8vb0xj1TkKUJ6zyJb\nXpHtUcJakZYSQeevevxKXOpiPJXAgAYTI8Nw5swZzBoO586dw9dWyZMnT6Btxbgq8ff3x9cyfKA0\nMY+4zNhHe8DCimNNEqogA37H2CopUaLE+/fvMWHWipgKVfPI2tq1a/GdtbJw4UKMqxInJyd8J6NE\n65Qg2fiEThlZyXwRTyJoe7H1DclbFKOqRN7MouLff//FTOHw4MEDfG3FQJcXo8tBHqdqcufOHcwd\nFUOXkgUnSWCCRpWxPtkdTw/INeyBMWeAvhSmzVoRTaFiijls2LAB31k3WncS4jsZXdoUVCndXq9R\nDaxHdkTTs27Zc9GjdRzGjx+PCfuOefToEWaHkjTZx2sO//zzD0ZdiWyqkMvjx48xX1QMXsJsGLTu\nasuVgBh67rxuJ4y/EkyhVSJC5KKiojJlyoRpZbBFI1KdO/MM8eTKlUue+wU2bdqEOaJi1gErXX3R\nIhFUp669R7LyDstOnjwZk/ddoqlNrX8mSSuaOyF27dqF775vNIuYDPyDnmQTVhBbkMOgU/+HqWCA\ncarVmtQ3V6GGh4cL1k379euH72yNbt26YRoYSpcuHRsbi+++S7Zt24Z5oeK3vda++qIpuz6RJRcE\nOnXChAmYyO8MzUU16DPhOxtEU6fK41Qt91IMWWqr2pRKBDNO7YhpYShZsqR1biA1V6EKNggMHDgQ\nX9gmAp2aO3dufPH9oWUX0sIzaXxezWQBnQrjVD5Tp07FpH43aO5CWrlyJb6zWS5evIiJUWJJozFW\niPAyg37zSUiKsEbYlmx/R8/F8XXqqVOnMMHWhFkKFVOmZMCAAfjClunYkVdsAL74nggKCsLEq6B7\nGT4LP3QbEqiQO6IxLUqmT5+OCf4+EHSS1q9fjy9snAsXLmCSGL7n7dy+vr6YCywDF5LgJGFdsEUB\nnbr3m+BEXEREBCbbajBdWzx9+hSTxTBo0CB8Yfv873+8KfuXL1/ii++GyZMnY+JZZoXa6tiUK6BT\nYZyaRW14vXPnzpjg7wCBoZzVq1fjC7vg6NGj3DvIvs/t3Fu3bsX0s4D6OWS7M72aEkFOwiiOt8fQ\n2kx2m6hQBV1C+xibcunVqxemjZAMGTLcuHEDX3wHuLm5YcqBDBnJ9GDbWzfVJbs+0cs0cquvrGrV\nqhUm267x8fHBBDOA7sEXdkSzZs0weQzfoW1CTLmKKf60EymoAjYtu+PJT7MwdQwVKlTAxDPcvHlz\nypQpM7VhmW0TpijUM2fOCEz3PXz4EN/ZC5cuXcK0MWTKlOn69ev4zq4ZOXIkphlJR/f0Cj5rm5ZQ\nBSlTCxPHUKdOHUy8/VKmTBlMLcOtW7fwhR3x5s0bTJ4SfMEQFRVVpUqVptqoVq3anTt30J3NAgoD\nk82y4AQJspd+MFdCvpJe0zGNjBm+Q4cOLV++vG7dui1atChaVHj0nEv9+vWh11W9enXpvn9TFKpg\nYmHNmjX4wr5YsGABppAhJCQEHqouUnVycsIXGqgMsqekpLB/2BClSpXCZLBAl9AaTTeYIb4R9DQe\nf9Mvm3a2vDQ3uXAJCgoCN9++fWN+YRu4u7tj7Bnse7oFE8nQpk0bXRde6YIdx9jcfcnCjUhzD1O7\nuIIv304kgrHaPQdTagaslW9xy9pohSrY12AHuwT1ILBNGBYWhn8ZDPR8Dx8+HB8fjz5aMZ8/f65X\nrx7GG3DMTba8sHbrDaYJdBEOKKjpRCXFixcfM2YM/scwWrZsefXq1SNHjmD2WSs7duzAGDNs2bIF\nX9gpERERmFTzuHv3rq3U3ISEhB9//BHjDUDN/es62WFVBkHFk8DPdB67mfZbrk3D39//5s2bmJvm\nYbRCxSgwODg44FP7RfsF2kbSpUuX+fPnW/lu/sDAQIwuy+DF9rN0qik7P5JVWmzvmcCKFSumTZv2\n+vVrzEdrIjExsW3bthhR5vTev//+i+/slJ9++glTKwadOnWaNWsWem2t7NmzB6PLMmihfdZc0KNh\nCtJnLmncE1OqlZo/ku4/k86TeNLrN5KrIDrQweXLlzFDzcA4hTp69GgMnMHut+potUpoDo0bN759\n+zb6bk3ExsbyDHSUqUVvSbSzyV5WfCPplO8hBclXDBMrBi4uLg0bNsTctBqSkpIwfgzt2rXDF/bI\nvHnzmjRpgknVimszMvsAmbabJzNDScPu6EAH1rxnOC4ujldzC5W2z5q7J4lM8SMVGmIyteJ5gnjs\nIJue0f4EDGS5EvKVHqOf6k9m7ieNtXe5cufOXbduXcxWUzFOoQou6Man9ohgG7NWwBk0WFzY6Xh8\nrZuoqCgmECsCBlgYOZb6XeknKPimbV4i6OVQUDNrtMFk6oBdSMNCVQJPFi9ejC50UK9ePSY7rQWM\nFkPBggXxqd2h/U4VLpuekK1v6fL5zo/UQixX2AtBt7ykPS1t19Sr2L17N4ZnTfz3338YP8DBkexN\nSZs7iaUTek/qFzL7EKZRD0cU+q539YvmFPcrOt7VAeasSRjx4379+mGAjKlbtomxP6Kjhcf/BSxZ\nsgSdpkaHDh1g1KJ5pR3g4OBgVcszGC2W+l3sbWcvKzAw7fUbplEHmB2p8eHDhwoVKvzwww/4Mz6T\nJk1Cd2nKq1evMEIM+NTu0L6JzLkC/sHycwDdXqfryl54Do0sCDS4oQp62LHmj6RIOfwth8DAQGvb\nD4wxY6ndnn7kgtTZtLB2Q2FYKSB7LlrEISncnRAkrzM5aEjDxRZ3FLUS/MtuUqoa/pxDgwYNMH+N\nxNBq9uTJE1dXVwzNxu1/6uHRo0eaty3SwqvYCP9mNvEatWB28+bNGTNm4I85QH7OmzcPHaUpAQEB\nGCeWo3anTaE3Cl2EPtp2BvLvhzLWdtKiRYu4HU0VHh4ePj4+6CiN4PbkevbsiU/tiBs3bixduhRT\nqCKzA/nfOOL1nF7Wy+WEMV81tNReL0n7MSSnloW3a9euYQzSmp07d2KcWGg/OK3v+RdRdsWRpZeI\ng0aD3MWDHo4H3QkOFp7Fh0D23HRK36h9lGzL0Hs2aSRcl23RogXmsjEYqlCDg4MxHOY0j9Ua+zcH\nSFexYhrramM30RNdW9+QYhXxiUlmJM+cObNs2TL8PYdatWqhi7SDN5Pfdx79TAWfnU3LniS6suLK\nO/VPccxNqyWMTmbuwycMmCnGcOzYsUqVKuHvOaThxdcbNmzgHhZ//PgxvrAXoqOjtfR9obbOO8xM\ne36gYzUuHcYadw8oDGJArf5xmkznb/lhzGLcu3cP45GmlC5dGuME9JlLdtpRzQVVt+6BFm06Zj2d\nRfCPoW58I4n3a9KIswTe3t2UawACP9Op4Ppd0BMlVapUwYw2GIOaj3fv3mEIDL1798YXdkSePHkw\neSp+HEm7e5DR0OXZ+41U4c3vmTzjLbBrCKTtZhahlcEll+xqDQba0DnaDjutvk0X1XZ9oqcLoDpx\nMK1nyp5hRS847N27l3VgYYYOHYoxIGTx4sX41I7AtHEZs4nuRlENUKBFnrYLXwHOlZim1sgBHNQF\n8HPGXvSEQ5rvhBCePV183n5qLjv9ron7OqENVKi8/ebiW5bJ2+lvuW4MEQgR1DN8M448RWCs1ReD\nFOquXervMm/evPjUXvj69auWgWk7d7oRFHJZleN8M5LmHEAEnVquHG+FpkmTJvjOskC3oF27dhgJ\nYMRKOp5TJdnWBZrCOZp7GdKRZZeZFTXlla5Qi5ao96BBr//jx4+YQUYSGxtbsWLFLFmyoF8Mlr+k\nc+PGjRg2IRkzZty/fz++sAsE/XtKiSpk5Gra61UVPYhfNC1WjulmenjRqEGqSuALOa4QGL0DICYY\np7Rg7NixGA8ANI19GMFn5bBGhwmGNycU2ofg0FBzLyGfGmDiHBtdSo8ixxQCnQoDnsTERMz01DBI\noaLHDMuXL8endgFUiRo1amDaWGq0pmswmovb0BXqwhvMoRcm8e3bN+6aNNC2bVt8Z0F4dmTyFiXz\nj9pPJxcaQahaAjqOp9O8UJRcl1CLVv9LSlRFN4SMHDkSM8gkDh0SavF9+/bhO4sAwWHAhAwePBif\n2gvQwGHaWFq7Mds7tbWhwclkEOfkW+dJQqVrlIQyhyD5nD9/HqNlWV68eFGtGmc3Te/ZdLpFEGEb\nFejTC6bZWw6hJ1AFzlQCpVy+AboEancgPv/p3ICWqkBrAF3tYpXRN4aAgADM99QwWqHiI3vhxo0b\nmDCWCg1pbmo9Fg3P96agMwb0wlTi4+PLli2LfjFY3qjswYOccwKNetjyLcQaAjWQs5WMMmaDevVF\nIFDi/dVWsebMmYMZZCqax64sZqIoJiamZMmSGGoaddSk49ixY1mzZsW0AW2GM7VVR+u5O5505nSC\nYeQx/4hZXcZ9CtJjGvrGMGrUKIyZZeFd/l+6Br1GCTSBILaiiE8E8UukLQNIwHuqqwQOxJWgL+RX\nvjZtP4YxSqxbQUKBLrlAMqrvGqKLr+YYivL/QLxekNI10TdCSpQogfmeGqlrhbp166KvDPjUXsBU\nsUCvJDCBHnsS5C8r0OWBjObQsmVL9MUMsmfPjt4x4FOLIDQmXqudlNa0o2jugXBn0aWSCHqIdgDP\nbKSW1ReuQO+++6/okiEsLAyzyVROnjyJfjGYOeo1HK7tPTs7e3r9+nVMGEuroVSbqqbuNQXqLIxc\nK3P2o032pbsiBM6MkAgaYk+1cXYgTWbUedaR6v7PrJG3HvGJcAoIv7iwLSEZaUATT5OdHyTUqdA+\nzDvKJInD1IDUi+yQgmTirLN4vdTZjBsoISm8aWQGzHq9pO6IO4qCzi8+lYJvn+7dvnnlypVLt+5H\nJeAz6fjwgacdKacV+ionSEAsXX5TojKCbyboHUOWLFk+ffqELyTm/fv3GCqQtyhdJRKkVzSJqrfz\nJpkZRlc01zyifV6hA1EFxqB9ZmO6WLpOpWNTgTOBwIi2jnq/mChGegXXSEyZMgVfSMneveodNDly\n5MCn0vB6lfLenoLD4vCZhGBYKoYuS32eE1rGShzzSb1mMCNUM75AaNwn8u7CAyy8QenFixcYMEvn\nSXQsLoinKOIT3eLvB/OUZ3qzk0oeTxRk81uhM7EkIIbeiM7Ffa1BFmaCk4TTUQc06vv2GLIrnmaU\nQPx0qN7DCsF95oYsmaeiUC9evJgzZ070jxBpzebdXFEkExuO47DNkt+VxjMED7QcnPod2tCB+k3d\nWhUoUECUE2lca6uAm5sbvpAYDw8PDBKA/t0+yRTqVkX8CuUlNgW6tjqfQrZK1smFLlFgAobFki49\n+SVI+0wvV4KTiYt6MkaUrexXr14V3Epk8nYnw8GQGBYuXIhPpWEhx3zChHPS3tCyYsUKDIll1FqD\nhmWg/9zX4U9Y1t0Xzo5CUxuYRBtlgeja2wKfyuTt6BvDzz//jLG0CG/fvsWAgVwFJTw4vjFB8e80\nUAD58udNl4HOqdZf+aLuiVjiLUH9hZoLec5llMZGM13iG0EHqVwECnVrTIPVvg4DZtJjgVzpOb3A\nkn8yhHyje/657ln5W0FqqBtnaPCxAHSTikLt06cPekbIoEGDJL3V6NKivnkxKFK8u+erL/hcCqAX\n7+DggIEBrYcxu1tT67dCkXu9IHU74q8IGT16NPpoHl27dkUfmX3UophpThUMj2XmflP2mhsim9+t\n+hQ9Pz+GAwwK+0w2azgTS6AcOQVEWXTOoL2dMHBZeJZeqK5ElMPWz58/R+8YTLbAYjgYEgM+koin\nf0JTmyu3Y+Ei9BxzgcEH8bk0sClCxmwg+wye5IShBve2vo2PeQrVJ6bWym01fmxKqrejqx4qqfJD\nwaErc0O7vP2dlklOaPpnHUAPGTCWFgHaYQwVcMwjVVd4c/iScEUgnfjMNGLO2vl96UH8Jj1WuD9S\nEC8JBqlQc+t3ZpKkxPCOAihU6P1wOc357ZbwDpfiz89rrHGmlZKrkAtxbdb2EiTqjfonrECbsPwq\nuiMkffr0qV5wksp3MGLECPSMEG9vb3wqBdHXRjWhQ+GePRiLFXna7X0m4czn77//TkNhaTWEjk31\nT/aqBFxyLLBMmzYNfeSwugvJ4JCbR65c2bJknnYWHWiFewHTnj178KmUYGAsBpnsMl58won3l1uh\n9OrjYkULVXGlO2nzdfELiFeQTdJMHAk23K+4QXbz9/TqkTAFyYiTJEBCgnDh4f3Fda0qpM+WA0uV\nxSl7luI/jr+ld8YTfWTAR5KBwTDgI2k4NZFuU6/ZaVXwtpH54K8s/7uIb8RHeIcMFJbhi/Ggb7iA\ndlQtsHmH1zj6MWLv1MrqA3FChoTEE19tOhWqDAdLnnzDIFkyZZVKoW76EP7xOD19n6MiWRF9zncI\n/Jm3/A8ldiXSei36Sqqg5i7/h84ICtzokYBYamBSReUm6hOAm8N/exR3z7O1um5rULnez/3uaeso\ngLc91fZKU50+1FflUlJSOnZUd/ZXrVqFLyTg+dEFzD7lPo8V0VPL0L+6ekl1qfqpU6doACqGLjVi\nx/nOj2TkKvwhIX379kVP1bwZrJ4j16BMhyOv0J2A337jmZmVelVGuIQs0aaGzZEBUe+PDKAjmApD\nwk6H/VoyPcmYrvqw+9KsxIQqSAnOYaQs2WjX1cCuEkhwMilXB38LfdzTpzGzlNzbOZl3OwSfqlN1\n9pi4E7958uTBpxLAvW0wZ86c+FQKEk73LJGdkGxDj6UoIgLrFskAIQ7zleSjTUpKql69Opsoytww\n446IwDfwN7+xVs0HeofXO/7x08Ffy9LoZy5cqQYpUp4aiXWu5FC0RIH0rGuHX7UuHEJzv/ou6wIo\nXLjw58+fMcYSg0GyGGVS0XDZGp59X9LFtdQIUeMuSzrcVjz/eL4jzZBMkzz31zufKPKsL9TckpxT\nQFBzQWEbXnNBoDg8j+PPASjEvSn4anP4zw/i7ni2hl5T6WoTR79XkL0K+g0cUnTevmtofUbPZszp\n6Hkn9573wtUoaPAnbeNaINA/sNSnUDds2IB+EFK0aNGrV6/iC/FJ2f8bXdHMO4ke1wv+mTkYWmGW\nROYNoZdB/WcpVV3fCSetAj1TzmK1xh0UESOo+c8MZZt2HzN+wtgxlAmTp3UqzzonRd0266p23Flo\nY43KGgsMrTAkoGpLEw+8pypeMX+dCPypVBZocArtTlkY5t+1FD3zUGvmv5VOfRS5TtI52zMknzOb\nJsrM/cYd8YaWlz/swMxS8jBoGtPy5v7f6J/Zoh07bvyoQd2LK3fse+ioItzLfCDnpatK3C2EkZGR\n+FQCPvw9pxikuoj7C/q//xb9UBRCLFHX4yHzVlzmzuWc/ixclm4MNGoPJxQrNItcOAq17vGPHw+w\nCrX6X98UJDCZdi73KQoGvZ45snZhpn+SffzZ5mc/CT9XGCKvvU9tSiix2JFfDI9FoluhvKJ2Pn6w\nsiMt1ibLHzY8/cn9/tc7E2nlKtlxSsZtcXSQKviJyYI1l2Nd57e9RhtnAIU6/wj+HChRWatCLV5h\n2GAYiW5Rdo98v4Zd9m1K20KHMZ7H+95J0jJIBWVfTz0X3b17dz1m8vQpVF9fX/SDkCFDhuBTSXjt\nQXc3VF17LRr+8+TgbNaM0J/P2bciw/it5Pdjxk0sgMA4plpL/LmWTfOsQs0+UDDCfnF4VEPm/Jzz\ngDufvuFDPsePc3pYEs/X8Y7ArrzJbH3USKmZ4h1e62TCPzs8SqQn2VxnbPuiWPvyTXCP4hBgjcbD\nul5TEG9RB6khKbzzDFBGG58Yd0pHo+XFzFJCFSrtpFcVDF1vbB1Vlk56khITTuAjPlAD+/btS10w\njBgxAl+IDbdYpbQ3G7FqBO0BV5iu3JS3j7kpuVgdr1vav22TefjwIc+QWYcxJNhIFWKoQq20IBJG\nosqFNJ+P0y5c+aMNPdFbpcmaxR+1LVIEfaG3eSvJnz//pUuXMN5SguGxGJsbBsrWz9M3zq5Bv+ry\nP/3DDNC9YwftPMzsQms8+Vxk+oAo0WZ9d38mHemqEFK1BbPObeT5OmjG5wkUqnLWTZdCpQtSsRfv\n7G2YF4o/q/uC4wO0KlRoGxed5VoVfv5cp2YyVKFqm9sUjS8H3SGIDI3G3X3P/D/mlnvLQvAk7wzx\nLzAXXqZG72fgZ1+qYpBCdfhp1Tlq3ZXD670jmTKpuS1S56kg+l5J8+bN8akE8BTqonOSKNStUQP/\nvr5hEJ3Ld1381HHPe+If/+44swLtXJOsYqZ0xKqTfh/ILP5lltDyajXQoUcMVagVDybjEyUv5rZx\ngReFKi1+g0+ErFqlXikAgoOD8YWouLjQaLDcvXsXn4pN0pPDXWk42f58hk8UilvMpE0Bt5Ui76c7\nduwY9ZildHX6oRo1EwhCi5V/+aVqUkqXQqVNbcz802cntKS6vErfXZtitClUiAm045zrdXfs2IHx\nlhIMjAWaI0GszJctEW2vRF6Y08OJkCz1lh6FEDe+Jb7v0m97tbw2DbPt7FDi+160QWpQAnedknSa\noF7+NFygBZjN2SYGn4qqlDkKtUSlkW5P4HkU/ZB2fCC7FOtW9iuUjmTIlr+ST2wmfx0t0iEFcVLv\nq4yL07ljQp9C3b17N3pASLdu3fCpBKyj58SydplzUKlnUs7M7cUsRPYUfbcr9VVFyFej+0Eg8AVz\nzjwdPQrfGxedCvXqXMa4nfPAmzpGqAD3bCg0jvhUAsqXV85BA3+cFl+hQmXzjx20eUM3mhsue+KY\n9mjju1DFm+l0oJ510h+hlU9+Fm3WNyCGZ2DFtSnd4yBwk6potLyYWUp0KtT3pztVp6ukFSbqu4aI\nu9YgxW3ViYmJqsFcpkyZnj1TqztxuRXgngvCqDKTu2/w1ERmLaqW2z+izjRfvqw++U2vrjRlA04E\n/Twyc+wrbXqGlnR4CrXKkngF2fGRLtDuiq95NO6nacPr0lJN1/s4o1F4fioF2pBaamvYnp6eGG/J\n4J5jpEeAjJ1gS1Wg5vq+/+3vk7+0puPTefcVZD3byYjIuzsyfD1zVjtz17esluX+0DSBSgeNcMvB\n1FuWAZ4mGuOFshjGOVvVuCeehFQqVKi7hcv06BL6hHieJgvO1PC5/NprAB1gZc5fafq1BW+/6dzY\nAQretTnrKwsWhgY6XyQnJ/fq1Yv9cYYMGQw3Zmg8x2hnN3/1NRc41fPq8rLMGZq+3lBwYpI+PW0R\nEe9XpijUnXFk+F/oA3OaCL1GcMp30JZ/8QHDu7M72xSm7pvPDRUoWi43b96kjhjy5Mlz/bok53HB\n21y5aJNIKV6ZLgWZkA/6ZWtEtt3vEoKGQEuVu/3qBS++ES9Gd25V3JtPF2ayVB3Q98RHsj1CnEEq\njCznc2ysFKto0toS0/JyekswPMIsY1BN+Z7BByyfgn5tB51cyMrtL/GRVjp06MD4Svnpp59EP4Sm\nqrCAlDfHRf35P6pnBu3iXQmXco01plF56bn/8JEYcA8akMGLTGxqqUmWS+gJkK8YLrDxFGrhzkt8\nyYBFZPBiMnh52J9udGolXzUyJOyP/3TvoYMPz30dPeusRM98oCjwusILJegKbw3PF/xecXYuvRku\n94TJMDTflUgne4KSMh5SHL6wswfTMo86lQiKSvhbE8Q/hvwSSH1kqdjQ6E0tKtkdzzO23Ha4QKHq\n3OVbb2R+CNQ7Ul//HjpVHLAwNND5Arq3+FNoJ4oXx6cS8GIF7d+5NJvN707fdWc0aqUuqyLwiQgs\nX76cSRBDm+GMjQyTbKYcUZCsapOBfANSrEJNX7xW6z79+/ftA/QdOOSnGnR3Q55288I+CmcLecAg\no3fv3oyvlP79++MLUeEeLyaj1tApF0ECzRevqEMvH/9BdVN270vQw43CCaItnxQRK5gFx5K7rrxN\n5x8tgkJlO7mc7SFk7t8m9txBoS46h54QUrZsWcwyBuWmJKfmvYcOYIu2X//uHepR9VK6q8+RVFrS\n27dvcy1NvnypV/0az7Bhw9BrQjZs2IBPxebrCx969WuB3peFm+sSFzG7BGrOFszZmAWTGiUmWzCA\nj2Eu5xa/klWFClV9/JhP7jIVt74nfrH6NqX/rSB5mM4yw5s3umb9xUFyheoTSXzfXpjB7LktUZvU\n7U7qd1VKj1wNWldzpnvwigw5PONliggHUkGhzuBcSFyutunbrExWqITkbDC8ve+L5mfidOpU/uEr\nLAwNdL6A2o4/hbwrUgSfis/9ya6gODP28IvFByqOjIOgM5drvY/ZRygKPOtIM/fT3qUg1wyU/QqS\nRb0jlz+lzipU7RTrOuM1OtOJl5cXupbMACy35SVDlhh3CMFA2RyneLmazj9WHV4glNmkztrX3qeo\nde7zwYF0QaLpjIPER4yVGHZRjYsJS+OsQMvLuT+1Zs2amGUMVKHqankJ6eidekvKXcIXveW1jEJ9\nsLI+DSBT9kJFnJ2LQoJYijo7F8+XgxmolZoqojko6qGKVO1H6hKTFSp0jTM6VJt2YUmE7pPT0NQ6\nqvfMS7q5GpBcoW6K+vDl8Vj19QrayZy9wPg7CrJFwxiCsQJd2Nmc3Q+lq0mnUOHrLFqub88T78iy\na2TFdbLs1tSL5/Z2YI5pEuLafkzRwAR6bEZrF3+vLSjUhNOLSiinHnVQwG2DaBvnmjfnzINPC6Rl\nKcg1A8UAhZo9n3PZcuXKIOVcyqv6sDV36rUCxTWxNnbsWHwqKpIr1K3hOQ8qrkyhU7v6yNoOOksi\nWHgAhQqDVC7mtLz6FSodoWYsUrZCubKqoi1TyAn7vpnbrNZ/1itHDvVeQVEsMXGxiEKNm8Df1aeV\nrnv0LGsYB/rIIp1CpcVaYxUEF4Kdv7o3FTf3LGAPNWfKkqfC+iiHQG3mHUCCk0gF9fVhx48fx6hL\ng+QKdUvit9OjqKX5rDnSFShNCpTkS+mChQtkptlFyOSrufd9MNeMKDTCIzj79Rp1Ry1ogqSmUJlN\nSSOGPlYQn3f05BWIzwcSrLixnjVJ7bR01xXnQ3HaUwT5vPEx44zi4OCA5cEnbRVq8t9zOnNsgmnH\nobXHQ43hq2m0aaPej0c8/KRUqNmH+AgOLcT+1bp5Iea240yNNuEzbZw/f75AgQLUHSEwpP7vPzFX\npFgkV6gboxIUr0apLEnqptamaOJttq18UKhBfPu90ilUOgardgEfIDHXQoa7sltFco7fpsNyBwPX\nfMfcuXPxqUhYQqHenk3TmS1v065DRw5zG8rFbZj7gG4VizKfeHNxrKpxZ2tI459Mb2oNUqj8YzPw\nWfoq/E8s+oGp6F2GrGx15ZvO+cBjBg1fREFahbolosO/8WuYxZOGQ1cWPqqgh0+4skcR8vIfj1p0\neNCw2S9DHnHOdJogtCscSQNTcdLUmgtigEIVnkMF2RClULwZwPyikfumbAGftHebIJ8Xn2dcUUqX\nLo3lwSdNFWrEhZH16VSLy4A/9wRs9+HjGxC0fgTbcWjqe5eeTzWTQ4cOqccHrs3J+kem78RJXaFq\n2eWrUCTuGFeDuWTIMUCvXcUBA9jypUhxj6bkCnVL0rGA/tlhxFawbM7h66ltqREr1TJ8Tf0JM7rV\novo2Q+W/hr0x2zQoVJsmHNN0bstNT5FBClXz2IxC8SWkM/NFVO489yk+0g51pAQfiYT0CjVlTxc6\nFVig5jAdB9q+bBvVlAm/UpjOs+9GwHilZOdH0yusCQoVZGPkP4q4gMa0C1G6Wf+iO6F11qFQQdNw\nwNhLg7QK1Sv2S2QgM92b73efS41Oa+zD945wPf1p21Bq19ehdN1sW0zaI6YS0Kbbo2hoKkzuCoOY\nplC3RLQ4n+LNDLVcei/O7KdjlGX9CvXZ4fnMMeEsy3VtzojZwh6W7LSOt2PWNH79lXPhZfdfTDnq\npBITFari0Z5fWZsVw/UaEudu15TiZJu0CnVzpGdEuH8zaqSvatOJra8piP9HujVaJX6f3e5G3Jz+\nI3SmsmTKP+a+2SsxUAm5wKdvTstrmkJVpHi3pC2vQ+3ep/WujTIeI/hIJCRXqJ9ONitJu6S1x+7X\ntbXu2c7RzlQ5pesZJMJCarp0dPM04v3a0gp1K/TVFEd+pVvoCtXrlXeb7gktu1Go2xT7lzUGjx1K\n1K64I0b7uUzvz0///oUJPmvnWaFlD38xfdY3zRWqTzjZFHNL8Zw90NZ6oo/jrngbHaEm+49mVoPL\nTtNtATNlYWsmBhUn690baxBz5sxh/GLoPJkEGmwzXVNMVagfzq6ozUyE2rNC3Rq3NXRznRK0ieq0\n5Ib7vUQtA1CfLxNWTKnFbPYlky447onR/hEbKFAJeS3vm7RQqIoLk2l67Fih3t/mVpAmP9ciPQZX\n4s63q0J36BSqMvyq2ZU2LRUqaNPNcdeTr03NSbcuN+n+c82jCp2awz4U6pb3B++GTShJN/E27bio\n310dU0ebwg98VRxg7vIoWLtfNai8vqYefrO4Qi1Z2X3YCwXZEct27vPuj8/r//b4bMZcBSkzfe/D\nwgdidK6hWrNCTYk435lpT+vMF5og53Jl00B2v4fHJXNNmlmBQv22Z1pzJjlO+qd8bVih+kQU2Ru1\n8NduRQjJkLHYD9AGaT1vsCU64PaVX1rSlRjX2v2bnYI+o60r1FOdstBWuUrXefY55Zv84JeWzH6k\nfMMe4SPtPFzIXOidznn+4dQvZNaPZRVq1eWJChKQQIISSWBiwb+/Djp0a7pb00y0xLPWmna2hqY5\nX5XYh0L1iR6x93i5bv1InZ61VtzpdOmTzlrpFb9sw2+kWnvSbPofV6My+plq8syyChUKOU/hZo3/\n8CNuK+jak9uqHkvXb+hGjaEC5Tr94rw7mnakuH6qxMoValRAF8bvUksv6ttlnvT0wI/MBp30nXbi\nI1OxrELV3JSkCJ7QqhS7T6f4PHykAxtWqFujxh0979uFnorJX/mX3dBCad3E6x1e5WTyk1XsHrH8\nowKeVQozY7ugpRWqcFPSt5cnZnWg5+Ch5e325018qgPGGYKPREJahZr0/nyY/6rVK7ce0d9hUCjC\nL29Yv3b1Jv+zD83d+mBZhZqrdr/RpHFv0rQPadKnUPsB49tWYK0r5W3oMfjKZ7RMolXsZsrXJ5rs\ngi7FF5rV+vu42z9Raw+74yAPTZ9espRCnfsk7pFnKz3nUIvVn7T+3ueSobpbIWsfoca++ffu3XtP\n/kvQv8H+W1Lk6yf37929/5y182s6llKo4cOZHgD/2ExZFxdmJoVSx/sR1xCEFmx6hJpj+6s8ay+T\nP06nX/eiWAj0+HR8oN7viu24R61OL7pcaMeb7AFm2PW1lEJ9EPgz0/Lyjs2ULVfWOS+atcvcalU4\nutUJ65IFH4nEkCH0xkqWzZs341NbBhPDAqVjTrH+zjWkVUmlUJueilX8/TPHsLUGefv+81XhuDtK\nX4fPggqVexWgJKYHLSmgUAUH3rgXgxsrukwPbg3PtCfRf1Y9PDshIH29f+7fJevDC+19r68PAQp1\nmdoKZrFixbA8+Ogs++fPn+NPdf/Ytpg+nXMViZmbksJ4lpJiY7nHeqKG6zwu4pCjUvfd9+PRoW6k\nVqjcXcRk+Eqz+haa4hNBe53w/W2PTG3Q+Y7seI9WVc1cQ+UCfpre8saQBSfRH0IqVqyIWcbwMHAK\n594THlmyZ3NqvQ7d6QV/wICPRIK77e7333/Hp7ZMyZIc+wJLL5quPAJiyURv9Aeo2RYvTqBGReLa\nLFvazDU7yVGI5OKIU25SrNOOB1+LHk+BIU4qX/IhCYtVEwyGRQrj+JaUnR/p3W0qCpc20bADfBuz\nQtEToEwNnjXvraARv1DjDDAWEkgIfAnRVK/rL2L4hKapjdvrurlEZ9nHxMRAa8L+OEeOHHfu3MEX\nNsuBAwfU1y9XbUE2GH9DECu0t3uMpKdDFepT1aqC6/EOLmjTqTu1S8ehd8/uXaftMNTIp9THZrZv\n354xo9I2TINuZMsr2lUUJNOGJCiRNKQ3ISMjVppo8RUEqk0HaqKLZcGCBZhlDJ8ehs1079SjFxYq\n8lOPTkPnnTPYPA56zYCPxKNSJWoWkOXmzVQmn20CTAyLafOB0FYGxKAPLKc4/oBO3f6R7KE2vIQS\nnEi2Mp1C/b09w5paEcGQWCS6vs1iIjQ9WMt0hSq8vo2ZhDBEDOnNw/fAAUtCA31VOjg4GH9NSKdO\nnfCpLSOOYQfoErqoTRiePHkSfReJS5cuFSpEb68D6tSp80Ya06Cq3hJlwSmRV2IsLNAbEMWwA215\n9d02Yz7cdYeZM2fiU/Gw1H2olgMTw2JyscIwiIvqPlRRBJrazMzxcob4+NSnoMwEQ2Ix2VCflYhY\ntnxBoeq6YFwUMV+hWuw+VIshjunBVO5DNZe//lJfZSOR6UHAEvehWkw0V2JEankxs8SDa3qQf6eC\nOHCL9dEj/btxbQNMDMtBq1So8LHlohsRWSTqAXPBkFhOiJoWywso1N/2YlqASo15U7WGCyhUz+Po\nCeBcXmSFelpUhapxSZlNwlOoM0w1ji+xQrWAcXyAp1D//MfeFKppxvGlV6hFpDSODxQtqrafLLVR\nWcuAiWHxizZlmQaKNTgJfWARUaFCfKDuOLFHqimvXumzPSkKvM3P6TPSwZMgVjYkkHucbYCUNm6m\n7OoIiCUeO9AHoOoPZu2SEci+bySzehcqgCWhgb4mY/PmzfhrZu4xIkLEi9TShlq1WFuGDO3cTby+\nDTqktZhTzQxBQUHouxgkJSX169cPvZbs+jZAZS6Y0v0XkW07WFjoAvB/mBaW34/RHqvAWaoCLa9h\nEzumce/ePfUqvgTXtwE8c2ASdAgsT9WqzLX8LEXKmTIfCJ2kyeqxAd0/AfpV4MZkgVa7irqb7uLi\nwt+iKAnJyckYHpAjr20rVBAo0/6/Y3KAblONLmWoubvi8Ocs3GVy84Xe0Kfuqn75ovN2E31VDrpa\nNWrUQD8I+euvv/CFzQJ9dkwMiwkXjEMzPWMf/YgZGjRoIG4/g3vBeO7cua9evYovxCY0lLMjDgj5\nJkypLUkE7Rs17YtpAUy7YBz6xf8biz4Q4unpiZklEtwLxnv06AHNIr4QFQyAAR/ZMtevX8fEAJWa\nmHL79H7+MBe6X9uZjeWiCAx2m6gvMN63bx/GW0revn2L4QG5Cpp+TayVyO7P3J2ApFAZsvQSvQpG\n4EyPgELd/Ql/ziLiJAR0yMZuUu1CBZ4+1XkUO5Uq5+7ujn4QsmnTJnxqy2BiWKDNNVah7o4n3X/G\nnxMyffp09FckoqOj0Wumt4tPpQGDYYFGR5BS25KAWDL7EKYFcG1qykoMv+VNSdF/RNpo3NxYi6GU\nwMBAfCo2GAADPrJlwsL484FDlxo9mwJtaxb1ITey4ZFoChWa2olb0VuGjRs3Yryl5MWLFxgeS8fx\ntF0SxM2GBBrhLa9ICeaOGxaoy0aVMihUqO9cRFSoBxWkUU/0lpA5c+ZgMWgjlSrXv39/9IaQxYsX\n41NbhjfVCRjb4Q36QvrNw98SMm7cOPRXJNBfhsaNG+NTacBgWMTttlte/D+QmQcwLSz/G4dnDQ0X\n/vFicTcNrVmzBv1lEHelgAsGwJA9e3Z8ast4enpieoBWQ2jnyfCVmu2R1H1m5lI5FpPPywmF2RPe\ngzUTT5FuC6EmMBTGUIG6/6OWJYTRsymBsU39rpgcFp//jCimQwqSjtowQ7a+MW6Aq0egpzJ4CXrL\n4Ovri2WgjVQUqpeXV/r06ojeunULX9gymBiWBSeNWGwDreP1gtTtyP40c+bM/v7+6KkYnD59mvWZ\nBZ9KRr166sM/BErZ1ldi9qaQnurbRkn11mTTUyPqpN97esA8Ax7PLV++vJ6VEmNJTk7mni2eNGkS\nvpCApCT1BhxprrWwNKA8smXjaMRfAo3YThiqIOXVt3/T9RqxBnPwwSxVm84BpNvxoMm2bdswVKBU\nNbLmDhpIsV05wW+ZR6019DQ5FMSis6qaS1n3QJzcgAYfdHNj9dWQWbNmPX/+PJaBNlJvtbkHFt+9\nM9fatTUwaNAgTA8A36Lhg1TokHLM6DRo0AB9FAkHB4M2konFp0+cVYc8RWx+JQbq1YJTJK967wBt\nPQ238HBQQSozVt0ZTp06hdkkBq9fv0Z/GXr37o0vJIA7H5ghQwbphsKWpH79+pgkoFJjOjr0NWCQ\nCn3lWQeIE7UsjYzdyAxwNVyaIIGfSXXOuXZCLl26hNGVHsEGF9Jntm3vKwSB+A9cgMlhMfBEUHAy\nr89UrzMd3RryeaQq0KSsvIXeEpIzZ84bN/RctERJvdXmnpzLmzcvPrVlBANBQ++jhhKC+pmN3nnJ\nIq5FxlmzZqG/DBZoB58+fYqBsXTxsO2VGBAYjpRwxeQADjloqdE9wBouBQIjnlHr8FcMO3eaexkD\nF6g46C8zscE3/iw+3G+pS5cu+NSW+fDhA6aHJfALbewEhagpoPN6cAyO1oem9q1B30Oqsv0dOckb\nUb1/b66xcWMZM2YMhg24rxNz63KaCJQLtD9c2gwz6OQxjIjqdsafAFMDTDeUJpDjCuJI7yJkyZUr\nF2a9bgwaBqF/DPjIxpk7l3MpQWs3utiWajUDB0Ff8CcM6JdItG/fHv0lZNeuXfhUYgIDAzFIoF4n\nm1+JgYEL1C4uq28b1EuAku09G39CyK+//ooZJBLoLwM+khI/Pz8MjJCuXbviUxsnf37OQBPYGUfX\nRwXlyBPmVHGXKege+HGEOGcToSkAn7OqT0ABGEsL4uHhgWGz2Lp5FhDIVe7WQqDDmFR2QgQmkL7q\nmksx2QSeQA4oSPZc6CcD5rteDHPE4fLly/jUlgkODs6aFW8IobQdmfqZ1IBY8vNOdM+AfplNcnJy\njx490FPmtMzFixfxncRs3crZo1imJll7z7a3JoEEJ5Na7TBFLAvPED+9y+TQPm55xd0TAX1/zCCz\nefLkCXrKUKdOHXwhJefPn4evCIOETsXq1fjCxsH0sBStkMrwBVrV+ZwbZjJkJKPWGLH4qkf2K0i1\nVugtQ2JiIkbRspQqVQpjAPw0k3YyBFG1LdkRTZZc5M4CkuqtaLOsa4cRW3MbcHYztXcX58DCrk9k\nxQ3uFsVq1aphpuvFIK0wYcIE9JUQR0dHfGrj8LbkALugw6t3A8sxXn2ePXs2emQ23It9gGHDhuEL\n6YGgeQfnhy4zemestQnUsUDeRALdjq9/mRy6Sn+cQseEwEhIxOO/Xbqw9/4iUk/2quBe5Qaf+ocP\nH/CFLcO9hYmOHuYe1m1SJ4LatKveGh0DhcoasVtCj0BDAY1+HrXFq9atWwuux7AYQUFBGAkWelRM\njLXDNBTI3mWXeTshytejt19oXROFPtOis+gMcMxDb701fJOpLoF+ydJLXL3esWNHzPHUMEihRkZG\noscM9nEz1Lt37zA9LKVr0BlzXUvZQV/oMQwOX79+RY/MhnuSJ3PmzBawtMKlb1+OPYSsjrRfZuvb\nBaEl7TUDU8QyZqPO2WwocRi1KC11ANDrx6wxm1OnTqlv9ZF4c6+AmJiYwoULY8CE2IGZM5bBgwdj\nkoB06ciyK/REv6BMQaBfBeNXLqtvM7NQGi6NEmjBvZ6TjMqrjWFE1L49xiyNwHiw1Ghjij0Ta5NQ\nBW2NuRSvrK1xjqDjSE7NJXmdDVpz1S/Qemx5STKqbzsAtm3bhtmdGobOW3Lt+trHzTMsmCSWUtVo\nV0hTnUBB7o4nFdQbyUJDQ/H3ZpMrl9HT9KLDu3Vy1kFxViDSUiLoEZpuavsblLGb8bZhgUBxe/O2\n4GKmmM358+r7/YHhw4fjC0tRt25dDJsBn9o4f/75J6ZHhdZlcmhYOWqPHlI0v6kNiKWjHz4eHh4Y\nszSCNzBwcKRritCZEMTcxiSC6lQn/rXSxV1p40zntJVqFWruVo7FKOCQ7hGRgQJFDMLHqH6wodXs\n7NmzOXOqh8Br1qzBFzZO8eLFMUksFRvTiV+BTt2TSLpPQweE5MmT5/r16/h7M0hISChXrhx6ylCj\nRg18Z1nGjx+PMWAxzbK8VQkMOsdvweSoGL9ZywUUfAMrTZs2xUwxjytXrqCPStLE0BiGzSCdGUsL\ns3jxYkySChinQsmqqq3fezptyD3pD4rQzM0BMBjinKBgmTp1KsYp7YiLiytWjHPtfdUfbN7qGQj0\nCQ4oSD5nTBSLazO6i4WugjNaU2AayaUuCTHjehlo9uHnGkVsbIfJiH4rd1WmUaNGUlw+ZXkSExN5\ni4hApcb0zl7VaAbq4ap/6eBVyZQpU/DH5lGtmtpPoFWrVvgiLcBIsPSbb9A5IiuX4GTutc8IjFPD\nFOrNkFA5x6lvgAAwO8zj8OHD6J2SPXv24DvLMm6cep3CwcEBn9o+S5bwjNeQLA50RQbqKbt+BoPR\niuojxaR2e3p3gjljF2jcl/EW1QARd66ZCXxdGCcgb1Gy9KJo50bSUHZEE+83PHuELK2HkiMKOhid\nvB2fsMBPTO4zhXyl01TtR5PceBE1i7u7O2axwRjRgkRHRxcsqL72LzIyEl/YOElJSc7O/K4QlGKb\nEbTMoNuyO54sVK97lylTRpSeBDcnAeig4Is0YudOzgbmbE7MkWrzJk+sQfYkkZn7MVFIOnpQdckF\n2mOAwoWxeJUf8A0h8+bNw+wwA+EpZ0L8/PzwncXhmoYGli9fji9snwUL+EYAAFB40Is6pqAnETmm\nzImHr4kKBr6Q4CRq8KRwWUFTO2HCBIyHFQCjAu6hO5I5K/nzmmi299JQoHsEei4fZ/zNAmOeCg1J\nhkz4X6DrFCa9xjdZ0LyfVJDaHUgRF/RKyfjx4zF/jcG4Lnn16tUxNAZ8ahc4OTlhqlR0GENHMD68\nafrmzZvjD0ylc2fOGWQGyxyl0I+g5aVfGF2Xsn2dGpigoVMZ1j+ghcsxxAocPXoUs8NU0CMO0hnB\nNxDu7gfAMtbbLcOiRYswVXpoO1z72nkqwlhxCYghNflHsBgmTpyIMbAahGdSZx00KdXWJ6Am9ytI\ndt7cgBZm7DPiyBD0k6BkA2LpWl4f/hlWJaNHj8acNRKjlSIGyJAhQwZ8ahdwL2fWBTo1ngsXLqxY\nsQJ94WDg8SYLwDO3DfT3JEEJwm/RFgUGGfwDxFoxx5gDjEorV66MHnHYvn07ukg7zp49y91pbPm9\nUZIyZ86cJk04s7sCMmYmk3yMUC3Q1AZ+pp0waMR/momecKhbt67JTa3UCHXq/CM2fwSOFd8IOuW+\n7gHvyiAuTfqQ2Qdp/xjaK3+9Gyp3faKtwZq71PjD9GChpUMGV1dXzFCTMFpDcK+gypQp0/Hjx/GF\n7RMbGzt37tx8+dSX7wtYsmQJOjWSI0eOoBccKlWqNHPmTHRhBTx58oS3u8GlHrVEY+t2HlgJTqZr\npV34LQ6HwoULQ/IxI4ykW7du6AuHCRMmbNmyBV2kNZcvX+YaEO3evTu+sBcEh33V5CtG9eKqW1RB\ngpoUfBVc2R1PL4fwekHajiCdJ5MevHvaWX744QcMz1rBiKqYvJ2OwwQptVEBXfj7MdLrN54RfC7t\n3Ok6OqhJ0L4hX7VImIL0nkVvLy/PsQvNx/wJSFOGXFOmcKx5EaLf+r7NERERgQnToGXLlkZtoe7c\nuXOjRo1atWolsHoPZM2a9ePHj+jOaoAuBffwIl03OmL7OwZZgQ5sSArpod6tzSVXrlxNmzZ98OAB\nZkRqxMTEVKlSBX4CnwR6wcG01RdJEdjt69mzJ76wC1JSUqDTIDTVoqJwGVKuLmk9jJ5lDNUmxxR0\n8r9sLXreUQcNGzbEwKyY7dv5+3RqtxfHloWVCIw+LyuIg97pX8fc9BZ6UJma4toM3WgjW7ZsV65c\n+fTpE2alqZiiUIXmOQh59OgRvrMXNI89qEiXLh38q2drEtRtcMC99k4TdGp9CPYek9yFaY9P8GXb\nqPgyC2PQV+XbjRNQt25dzAttdO3Kv7VRAxN2BloGjB+Dk5MTPrUjeBtztMLUXG3oeo5gALZAQEAA\nRpql33zGfLHtb4bY/o7W3Boc01eptbGGwDbmr1+/xuwzGxO/FZ4NWELWrl2LL+yF5cuXY9pEBTrR\n//77L4ZhrZQvXx6jC2TJRg382sEufFYCYugFh5p78c2mTJky1rC5TD/cveUVKlQQ0dpXmsPt6Dg7\nO3fo0AH/Yypt27atXr26Ld4AzbUUS+kzW8vxa9sS0KYwPK3NK9PTp09DYitWrNiqVassWXiGjVKl\nbNmyjRs3HjFiBJtjImJ652vHjh0YOwZ72j34xx9/YKoYVq1axbuO0Xg6duw4a9asDRs2YABWT+vW\nnJ6gYx7y51U6ZSr4ym1O6Mm2V6RYJUwXM48HhYv/MZWlS5dOnjz51atXmHdWTHBwMEaawfp7AAYi\nWENl7S38/vvv8+bNa9CAc1OmAVSqVMnT09MaLDaYzMmT6jubkUELbdnaQwRd3ubfeFG6dOnnz59j\ngplbs6ZNmwYl3qhRI3ShjYkTJ86ePVvS2mrWbAZGU4l9mE9auHAhpkcJPAwPDw8LCztx4gQ+Mpgb\nN24cOHDA/Kl5C+Pv748JYHHIQfdrsKfmbVSgkxuqoDPYHNjFTijWc+fO9e7dG58aRosWLS5cuHDw\n4EE2x2yC+Ph4wUJjiRIl8J3N0qlTJ0wMQ86cOd+9e4fvmC0RUEanTp1atmwZutAGNMdnzpw5dOgQ\nt5m2XZ49e4YJU9H/d/r9CyqFDUgE3cZR60dMBYOTk5OuDSiRkZFQ3Ee1sX//fnQkJWYpVMHVVMD6\n9evxnW2iebItOjoa3zEkJCSwN0tky5YNXWjAnoQBZ1++fGF+ZJMMHz4c06Mi6Esqu9KtVtjTbPzt\nDJprpewlXNCwogttsBePp9V1XaJQsWJFTAyDuPfkW5iOHTtiMpTgC21AfYRaqYlN11NdvHjxAnNE\nxYAF9NCIDXWLoRO895vwNkbrXtI2N3K3bgmNH65atQrf2Rqenp6YBoYMGTLYR3fVZEaPHo15oWL9\nQ9u7c5G1t8K9EIqQ2rVrYyK/SwRGpEHFpqSk4DvbQXBgKXfu3PhChuHu3bvsphs1gxfRO2Jt4nx5\n4Gd65qfO/zDmDNZfxCJo+3PnzmFylXh5eeE720FgcRu06bVr1/DddwzvtizAMQ/56x9bWo+h66av\nBcchrP9AoQUQrDa1aNFi1qxZ+M4W6N69O0adoVSpUvZhXVxcLl68iBnEZXqwtdt8gBam/3zS5CeM\nMAMUsfVf6yvO8PnRo0eYaCUtW7Y8efIkvrZ6NO+vePbsGb777hk5ciRmCkv+4qRMTeamJKu/JSo4\nmZ75ycsz1CxrU5Zv374J7ncDBg0ahK+tmHnz5kEhYowZcuXKJWtTXTx+/BizicuMfdSQ9fZ3wiqT\n5hIQQ40na1gEdHJysokiFm0++sGDB5h0Dvfv38fX1org0koWu7H7Lxbci4aQfMWorgqINfcCQokE\nquU+BanZVnD0sGbNmpgkGQYXF6FN8IEDB+I7q0TLfag2dU40TXj69CnmlIDAz3T3vpWo1R3RdJfG\n7EMYNz6YEqtHzIhqrqcC8NA6VyKfPHmiOVmdLl2673zdVBfCcSpQrCLxPEHHqdY2VN31icw7TOoL\nLTDI2lQrrq6uwluBCVm9ejW+thq0zl7apYUKiRAYD0AWnaMXRQSl6QwwaPTgJLL0Epl1AGOlJHfu\n3OXKlcME2AIia/6wsLDffvsNM0NJ/vz5re2U3uPHjx0dHTF+SjJkyGA3lzBLwYQJE9q0aYOZpaJR\nT3r9JL2aRqOeWF5gxBymIH3mYNw4NGvWDJMho0F4eLjmzRDLli2zkkNBN2/eXLlyJUZLSfbs2T08\nPCIiItCRjAFs27YNs49LpcbU8uK+NKrFIV/Jxiek92ySIy/GRwkMb65cuYJRtxEkGUqvX78es0RJ\nmTJlGjdujK/TmqZNm2p2yQMCAmyu8CxPSkpKnTp1MMtUVG1J+s6lKx90R36azABH0E7uYebOYQ2z\ngtCfO3z48OfPnzENMtp49+6dprkZ6HS2bNkyNjYWHaUFWmsrAH1idCFjDAcOHJg4cSJmIpeqLajt\neIvVYt8Iqr9PKEjdjqQk39wpg7+//+XLlzHStoNUc9MbNmzAjOGQOXPmtN310KVLF4gDxoYDFB66\nkDGAUqVKYcYJWHSOroIExFhuVcY3kh6N3fWJBkrRYpQVIy2TGomJiQ8fPsRc45ApUyYocXRkQbp2\n7aq1tu7evdsuT45aEi27IlT8cRprMWhW0XdIQMsA3gbE0gOm/ebx7oHnYA33HpqGtM3NsGHDMIf4\neHl53bt3Dx1Jz+3bt7XuZQAmT56MjmSMJGvWrJiJApZeIhseGXEJpckCPVyft1SL5+Pt41UhL4eb\nwI0bN3jGnDm4ubndvHkT3UnGnTt3/vrrLwySj6urq6+vL7qTMZvKlSs7O2uvO+TPa/SSUahlgQm0\novm951U9YwU0KHgCbcK6+2Tu31R0YIWXtxuF5P33MWPGaL3iCli7dq3U24ChdVi3bh2Gx6dDhw5T\npkxBdzLGA+pqwoQJ3Mur1VRuSu+h3KegB1egqyuoXWbKjmi6heGQgloobdYXQ+QDJWtDp7askJUr\nV+oaxKxevXrhwoWid4hjY2M9PT23bNmCwWgwbtw4dCojHq9evRo/frzWmQBKy8HUvlKv3+iwdT9U\n5yQqhtRo0KCsY5ADCjLRh84nD15Mb1nWQfv27T08PDBaNoslJsRiYmIOHjyouQkIKFu2bLdu3QYM\nGIBOxQO87dGjh9bVF8D8i2RlWM6ePbtp0ybMVgF1OtI7CH8JoifeoDbSMWuEUjRqoHZRuWdWXMAT\n0KMrrpOKjUkDLdd6swQEBGDkZMwjJCSkSZMmmK18ypQp07Nnz3bt2i1duhRdm8T+/fvbtGnTvXt3\nnbeZMhe/BAcH4w9kJODcuXPe3t6Y3VrJX4IudtZqR2v01J1Yo3UJvJ3sS12Ce5D6XUhGHQqb4dSp\nU35+fvHx8RgbW8ZyK0yfP3++e/cuZqEG+fPnz507N/yBrk0FfAB/BNcpCwBnycnJrHsZUUhISBg4\ncCDmr4D0Gemtv5mykka96BZcUKsgoB39P9B5JF0Cb3d+pDbSWPfQye03l17WD15lFt7WrmLMmDHy\n5iNxYa0WY/7qIF++fFDpHBwcDOkZs7cFg3sgb968+q+0fPv2bXR0tFxbLQPUnaFDh2LW6yF9Bmox\nLXtOnQJv0xl0U6m/v39cXBwGbxdYTqGqWLt2bYUKFTBH9XLv3r3bt2/f0k14eDg6TY3y5cvb7kK3\nDVGuXLnChXk3umincDny1zV6jFWXLD5P9ywYQM6cOYsUKYLBy0jGzZs3nZ2da9WqhfkuGa6urqVK\nlZJra9pSsGDBmjVrYpGIh5OTU9WqVdnrQ+ySNFCoLG5ubpMmTSpbtizmtDRA5Zw4cSJ7S5eMZXjx\n4sWgQYN++eUXLAPJcHd3h69IXiu1MMOHD4eaKzCvbz6Ojo5Tp07t168fBiNjBaxYsWL06NFaTp8b\nz3dSW9NMobLAGBRG/Zs3b8ZcF4+AgAA/P7+nT59iSDIWZ8eOHaGhoc2aNcMiEQkYAYeFhXl7e2Mw\nMmnB/fv3oebq2UNkOND3CgwMPHv2LHotY2XExcVt27ZtFwPU6BYtWmDJ6QWKNSQkhP3V91Nb01ih\nqnj9+nVycnKJEiWwNEylatWqSUlJ4Bv6K5PWfPnyJTIyEtpfLCEzgE7S+/fv09bOgIwAqGsRDFAu\nXl5eWFR6GTduXHx8PPurN2/eoEcyNgLU6Ldv37LFp4vvtlitRaFqpXTp0i4uLuV1U65cOdlAqy1y\n586dggULYinqoGjRorLBDRkZGRvCqhWqjIyMjIyMrSArVBkZGRkZGRGQFaqMjIyMjIwIyApVRkZG\nRkZGBGSFKiMjIyMjIwKyQpWRkZGRkREBWaHKyMjIyMiYjULxfzuexGQKk6jLAAAAAElFTkSuQmCC\n", "text/plain": [""]}, "execution_count": 36, "metadata": {}, "output_type": "execute_result"}], "source": ["from IPython.display import Image\n", "Image(\"patates.png\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On souhaite ajouter une colonne pays aux marathons se d\u00e9roulant dans les villes suivanes."]}, {"cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " \n", " | \n", " C | \n", " V | \n", "
\n", " \n", " \n", " \n", " 0 | \n", " USA | \n", " BOSTON | \n", "
\n", " \n", " 1 | \n", " USA | \n", " NEW YORK | \n", "
\n", " \n", " 2 | \n", " Germany | \n", " BERLIN | \n", "
\n", " \n", " 3 | \n", " UK | \n", " LONDON | \n", "
\n", " \n", " 4 | \n", " France | \n", " PARIS | \n", "
\n", " \n", "
\n", "
5 rows \u00d7 2 columns
\n", "
"], "text/plain": [" C V\n", "0 USA BOSTON\n", "1 USA NEW YORK\n", "2 Germany BERLIN\n", "3 UK LONDON\n", "4 France PARIS\n", "\n", "[5 rows x 2 columns]"]}, "execution_count": 37, "metadata": {}, "output_type": "execute_result"}], "source": ["values = [ {\"V\":'BOSTON', \"C\":\"USA\"}, \n", " {\"V\":'NEW YORK', \"C\":\"USA\"}, \n", " {\"V\":'BERLIN', \"C\":\"Germany\"}, \n", " {\"V\":'LONDON', \"C\":\"UK\"}, \n", " {\"V\":'PARIS', \"C\":\"France\"}]\n", "pays = pandas.DataFrame(values)\n", "pays"]}, {"cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " \n", " | \n", " ville | \n", " annee | \n", " temps | \n", " secondes | \n", " C | \n", " V | \n", "
\n", " \n", " \n", " \n", " 0 | \n", " PARIS | \n", " 2011 | \n", " 02:06:29 | \n", " 7589 | \n", " France | \n", " PARIS | \n", "
\n", " \n", " 1 | \n", " PARIS | \n", " 2010 | \n", " 02:06:41 | \n", " 7601 | \n", " France | \n", " PARIS | \n", "
\n", " \n", " 193 | \n", " BOSTON | \n", " 2010 | \n", " 02:05:52 | \n", " 7552 | \n", " USA | \n", " BOSTON | \n", "
\n", " \n", " 194 | \n", " BOSTON | \n", " 2011 | \n", " 02:03:02 | \n", " 7382 | \n", " USA | \n", " BOSTON | \n", "
\n", " \n", "
\n", "
4 rows \u00d7 6 columns
\n", "
"], "text/plain": [" ville annee temps secondes C V\n", "0 PARIS 2011 02:06:29 7589 France PARIS\n", "1 PARIS 2010 02:06:41 7601 France PARIS\n", "193 BOSTON 2010 02:05:52 7552 USA BOSTON\n", "194 BOSTON 2011 02:03:02 7382 USA BOSTON\n", "\n", "[4 rows x 6 columns]"]}, "execution_count": 38, "metadata": {}, "output_type": "execute_result"}], "source": ["dfavecpays = df.merge(pays, left_on=\"ville\", right_on=\"V\")\n", "pandas.concat([dfavecpays.head(n=2),dfavecpays.tail(n=2)])"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### pivot (tableau crois\u00e9 dynamique)\n", "\n", "Cette op\u00e9ration consiste \u00e0 cr\u00e9er une seconde table en utilisant utiliser les valeurs d'une colonne comme nom de colonnes. \n", "\n", "| A | B | C |\n", "| --- | --- | --- |\n", "| A1 | B1 | C1 |\n", "| A1 | B2 | C2 |\n", "| A2 | B1 | C3 |\n", "| A2 | B2 | C4 |\n", "| A2 | B3 | C5 |\n", "\n", "L'op\u00e9ration ``pivot(A,B,C)`` donnera :\n", "\n", "| A | B1 | B2 | B3 |\n", "| --- | --- | --- | --- |\n", "| A1 | C1 | C2 | |\n", "| A2 | C3 | C4 | C5 |\n", "\n", "* [pivot](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot.html)\n", "* [Reshaping and Pivot Tables](http://pandas.pydata.org/pandas-docs/stable/reshaping.html)\n", "* [Tableau crois\u00e9 dynamique - wikip\u00e9dia](http://fr.wikipedia.org/wiki/Tableau_crois%C3%A9_dynamique)\n", "\n", "On applique cela aux marathons o\u00f9 on veut avoir les villes comme noms de colonnes et une ann\u00e9e par lignes."]}, {"cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " \n", " ville | \n", " AMSTERDAM | \n", " BERLIN | \n", " BOSTON | \n", " CHICAGO | \n", " FUKUOKA | \n", " LONDON | \n", " NEW YORK | \n", " PARIS | \n", " STOCKOLM | \n", "
\n", " \n", " annee | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", "
\n", " \n", " \n", " \n", " 1967 | \n", " NaN | \n", " NaN | \n", " 02:15:45 | \n", " NaN | \n", " 02:09:37 | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", "
\n", " \n", " 1968 | \n", " NaN | \n", " NaN | \n", " 02:22:17 | \n", " NaN | \n", " 02:10:48 | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", "
\n", " \n", " 1969 | \n", " NaN | \n", " NaN | \n", " 02:13:49 | \n", " NaN | \n", " 02:11:13 | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", "
\n", " \n", " 1987 | \n", " 02:12:40 | \n", " 02:11:11 | \n", " 02:11:50 | \n", " NaN | \n", " 02:08:18 | \n", " 02:09:50 | \n", " 02:11:01 | \n", " 02:11:09 | \n", " 02:13:52 | \n", "
\n", " \n", " 1988 | \n", " 02:12:38 | \n", " 02:11:45 | \n", " 02:08:43 | \n", " 02:08:57 | \n", " 02:11:04 | \n", " 02:10:20 | \n", " 02:08:20 | \n", " 02:13:53 | \n", " 02:14:26 | \n", "
\n", " \n", " 1989 | \n", " 02:13:52 | \n", " 02:10:11 | \n", " 02:09:06 | \n", " 02:11:25 | \n", " 02:12:54 | \n", " 02:09:03 | \n", " 02:08:01 | \n", " 02:13:03 | \n", " 02:13:34 | \n", "
\n", " \n", " 2009 | \n", " 02:06:18 | \n", " 02:06:08 | \n", " 02:08:42 | \n", " 02:05:41 | \n", " 02:05:18 | \n", " 02:05:10 | \n", " 02:09:15 | \n", " 02:05:47 | \n", " 02:15:34 | \n", "
\n", " \n", " 2010 | \n", " 02:05:44 | \n", " 02:05:08 | \n", " 02:05:52 | \n", " 02:06:23 | \n", " 02:08:24 | \n", " 02:05:19 | \n", " 02:08:14 | \n", " 02:06:41 | \n", " 02:12:48 | \n", "
\n", " \n", " 2011 | \n", " NaN | \n", " 02:03:38 | \n", " 02:03:02 | \n", " NaN | \n", " NaN | \n", " 02:04:40 | \n", " NaN | \n", " 02:06:29 | \n", " 02:14:07 | \n", "
\n", " \n", "
\n", "
9 rows \u00d7 9 columns
\n", "
"], "text/plain": ["ville AMSTERDAM BERLIN BOSTON CHICAGO FUKUOKA LONDON NEW YORK \\\n", "annee \n", "1967 NaN NaN 02:15:45 NaN 02:09:37 NaN NaN \n", "1968 NaN NaN 02:22:17 NaN 02:10:48 NaN NaN \n", "1969 NaN NaN 02:13:49 NaN 02:11:13 NaN NaN \n", "1987 02:12:40 02:11:11 02:11:50 NaN 02:08:18 02:09:50 02:11:01 \n", "1988 02:12:38 02:11:45 02:08:43 02:08:57 02:11:04 02:10:20 02:08:20 \n", "1989 02:13:52 02:10:11 02:09:06 02:11:25 02:12:54 02:09:03 02:08:01 \n", "2009 02:06:18 02:06:08 02:08:42 02:05:41 02:05:18 02:05:10 02:09:15 \n", "2010 02:05:44 02:05:08 02:05:52 02:06:23 02:08:24 02:05:19 02:08:14 \n", "2011 NaN 02:03:38 02:03:02 NaN NaN 02:04:40 NaN \n", "\n", "ville PARIS STOCKOLM \n", "annee \n", "1967 NaN NaN \n", "1968 NaN NaN \n", "1969 NaN NaN \n", "1987 02:11:09 02:13:52 \n", "1988 02:13:53 02:14:26 \n", "1989 02:13:03 02:13:34 \n", "2009 02:05:47 02:15:34 \n", "2010 02:06:41 02:12:48 \n", "2011 02:06:29 02:14:07 \n", "\n", "[9 rows x 9 columns]"]}, "execution_count": 39, "metadata": {}, "output_type": "execute_result"}], "source": ["piv = df.pivot(\"annee\",\"ville\",\"temps\")\n", "pandas.concat([piv[20:23],piv[40:43],piv.tail(n=3)])"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Il existe une m\u00e9thode qui effectue l'op\u00e9ration inverse : [Dataframe.stack](http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.stack.html)."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 2 : moyennes par groupes\n", " \n", "Toujours avec le m\u00eame jeu de donn\u00e9es ([marathon.txt](http://www.xavierdupre.fr/enseignement/complements/marathon.txt)), on veut ajouter une ligne \u00e0 la fin du tableau crois\u00e9 dynamique contenant la moyenne en secondes des temps des marathons pour chaque ville."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Dates\n", "\n", "Les dates sont souvent compliqu\u00e9es \u00e0 g\u00e9rer car on n'utilise pas le m\u00eames format dans tous les pays. Pour faire simple, je recommande deux options :\n", "\n", "* Soit convertir les dates/heures au format cha\u00eenes de caract\u00e8res ``AAAA-MM-JJ hh:mm:ss:ms`` qui permet de trier les dates par ordre croissant.\n", "* Soit convertir les dates/heures au format [datetime](https://docs.python.org/3/library/datetime.html) (date) ou [timedelta](https://docs.python.org/3/library/datetime.html#timedelta-objects) (dur\u00e9e) (voir [Quelques notions sur les dates](http://www.xavierdupre.fr/blog/notebooks/example%20pyensae.html#date), [format de date/heure](https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior)).\n", "\n", "Par exemple, voici le code qui a permis de g\u00e9n\u00e9rer la colonne seconde de la table marathon :"]}, {"cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " \n", " | \n", " ville | \n", " annee | \n", " temps | \n", " secondes | \n", "
\n", " \n", " \n", " \n", " 0 | \n", " PARIS | \n", " 2011 | \n", " 02:06:29 | \n", " 7589.0 | \n", "
\n", " \n", " 1 | \n", " PARIS | \n", " 2010 | \n", " 02:06:41 | \n", " 7601.0 | \n", "
\n", " \n", " 2 | \n", " PARIS | \n", " 2009 | \n", " 02:05:47 | \n", " 7547.0 | \n", "
\n", " \n", " 3 | \n", " PARIS | \n", " 2008 | \n", " 02:06:40 | \n", " 7600.0 | \n", "
\n", " \n", " 4 | \n", " PARIS | \n", " 2007 | \n", " 02:07:17 | \n", " 7637.0 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" ville annee temps secondes\n", "0 PARIS 2011 02:06:29 7589.0\n", "1 PARIS 2010 02:06:41 7601.0\n", "2 PARIS 2009 02:05:47 7547.0\n", "3 PARIS 2008 02:06:40 7600.0\n", "4 PARIS 2007 02:07:17 7637.0"]}, "execution_count": 40, "metadata": {}, "output_type": "execute_result"}], "source": ["from datetime import datetime, time\n", "from ensae_teaching_cs.data import marathon\n", "import pandas\n", "df = pandas.read_csv(marathon(), sep=\"\\t\", names=[\"ville\", \"annee\", \"temps\",\"secondes\"])\n", "df = df [[\"ville\", \"annee\", \"temps\"]] # on enl\u00e8ve la colonne secondes pour la recr\u00e9er\n", "df[\"secondes\"] = df.apply( lambda r : (datetime.strptime(r.temps,\"%H:%M:%S\") - \\\n", " datetime(1900,1,1)).total_seconds(), axis=1)\n", "df.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Matrix, Array (numpy)\n", "\n", "Le module le plus populaire sous Python est [numpy](http://www.numpy.org/). Il propose deux containers [Matrix](http://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html) et [Array](http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html) qui facilitent le calcul matriciel. Ce module est \u00e9crit en C++, Fortran. Il sera plus rapide que tout code \u00e9crit en Python. De nombreuses modules Python s'appuient sur numpy : [SciPy](http://www.scipy.org/), [pandas](http://pandas.pydata.org/), [scikit-learn](http://scikit-learn.org/stable/), [matplotlib](http://matplotlib.org/), ... Il y a deux diff\u00e9rences entre un ``DataFrame`` et un tableau ``numpy`` :\n", "\n", "* Il n'y a pas d'index sur les lignes autre que l'index entier de la ligne.\n", "* Tous les types doivent \u00eatre identiques (tous entier, tous r\u00e9els, tous str). Il n'y a pas de m\u00e9lange possible. C'est \u00e0 cette condition que les calculs sont aussi rapides.\n", "\n"]}, {"cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["int \n", " [[1 2]\n", " [3 4]]\n", "float \n", " [[ 1. 2. ]\n", " [ 3. 4.1]]\n", "str \n", " [['1' '2']\n", " ['3' '4']]\n"]}], "source": ["import numpy\n", "print(\"int\",\"\\n\",numpy.matrix([[1, 2], [3, 4,]]))\n", "print(\"float\",\"\\n\",numpy.matrix([[1, 2], [3, 4.1]]))\n", "print(\"str\",\"\\n\",numpy.matrix([[1, 2], [3, '4']]))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Il y a deux types d'objets, ``array`` et ``matrix``. Le type ``matrix`` se comporte comme on peut l'attendre d'une matrice. Le type ``array`` est plus g\u00e9n\u00e9rique et autorise plus de deux dimensions. Les op\u00e9rateurs qui s'y appliquent ne comportent pas comme ceux d'une matrice, en particulier la multiplication qui se fait terme \u00e0 terme pour un tableau."]}, {"cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["multiplication de matrices\n", " [[ 1. 0.]\n", " [ 0. 1.]]\n", "multiplication de tableaux (terme \u00e0 terme)\n", " [[ 0. 1.]\n", " [ 1. 0.]]\n"]}], "source": ["m1 = numpy.matrix( [[0.0,1.0],[1.0,0.0]])\n", "print(\"multiplication de matrices\\n\",m1 * m1)\n", "m2 = numpy.array([[0.0,1.0],[1.0,0.0]])\n", "print(\"multiplication de tableaux (terme \u00e0 terme)\\n\",m2 * m2)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Un tableau en plusieurs dimensions :"]}, {"cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["(2, 2, 2)\n"]}, {"data": {"text/plain": ["array([[[ 0., 1.],\n", " [ 1., 0.]],\n", "\n", " [[ 0., 1.],\n", " [ 1., 0.]]])"]}, "execution_count": 43, "metadata": {}, "output_type": "execute_result"}], "source": ["cube = numpy.array( [ [[0.0,1.0],[1.0,0.0]],\n", " [[0.0,1.0],[1.0,0.0]] ] )\n", "print(cube.shape)\n", "cube"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Quelques liens pour apprendre \u00e0 manipuler ces objets :\n", " \n", "- [op\u00e9rations avec numpy.matrix](http://www.xavierdupre.fr/app/ensae_teaching_cs/helpsphinx/all_example_science.html#operations-avec-numpy-matrix) \n", "- [Numpy - multidimensional data arrays](http://nbviewer.jupyter.org/github/jrjohansson/scientific-python-lectures/blob/master/Lecture-2-Numpy.ipynb)\n", "- [NUmpy Tutorial](http://wiki.scipy.org/Tentative_NumPy_Tutorial)\n", "- [classe numpy.matrix](http://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html)\n", "- [classe numpy.array](http://docs.scipy.org/doc/numpy/reference/arrays.html)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### matrices nulle, identit\u00e9, al\u00e9atoire\n", "\n", "On utilise beaucoup les fonctions suivantes pour cr\u00e9er une matrice ou un tableau particulier."]}, {"cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[ 0., 0., 0., 0.],\n", " [ 0., 0., 0., 0.],\n", " [ 0., 0., 0., 0.]])"]}, "execution_count": 44, "metadata": {}, "output_type": "execute_result"}], "source": ["# la matrice nulle\n", "numpy.zeros( (3,4) )"]}, {"cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[ 1., 1., 1., 1.],\n", " [ 1., 1., 1., 1.],\n", " [ 1., 1., 1., 1.]])"]}, "execution_count": 45, "metadata": {}, "output_type": "execute_result"}], "source": ["# la matrice de 1\n", "numpy.ones( (3,4) )"]}, {"cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[ 1., 0., 0.],\n", " [ 0., 1., 0.],\n", " [ 0., 0., 1.]])"]}, "execution_count": 46, "metadata": {}, "output_type": "execute_result"}], "source": ["# la matrice identit\u00e9\n", "numpy.identity( 3 )"]}, {"cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[ 0.56295296, 0.77545561, 0.56041393, 0.90371888],\n", " [ 0.09984123, 0.59781939, 0.09845057, 0.30856921],\n", " [ 0.37161512, 0.5630934 , 0.6359542 , 0.13298039]])"]}, "execution_count": 47, "metadata": {}, "output_type": "execute_result"}], "source": ["# la matrice al\u00e9atoire\n", "numpy.random.random( (3,4))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Pour d'autres fonctionnalit\u00e9s al\u00e9atoires : [numpy.random](http://docs.scipy.org/doc/numpy/reference/routines.random.html)."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Quelques fonctions fr\u00e9quemment utilis\u00e9es\n", "\n", "* [column_stack](http://docs.scipy.org/doc/numpy/reference/generated/numpy.column_stack.html) : pour assembler des colonnes les unes \u00e0 c\u00f4t\u00e9 des autres\n", "* [vstack](http://docs.scipy.org/doc/numpy/reference/generated/numpy.vstack.html) : pour assembler des lignes les unes \u00e0 la suite des autres"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### de DataFrame \u00e0 numpy\n", "\n", "Le plus simple est sans doute d'utiliser ``pandas`` pour lire un fichier texte et d'utiliser la propri\u00e9t\u00e9 [values](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.values.html) pour convertir tout ou partie du ``DataFrame`` en ``numpy.matrix``."]}, {"cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": [" \n"]}, {"data": {"text/plain": ["array([[2011, 7589],\n", " [2010, 7601]], dtype=int64)"]}, "execution_count": 48, "metadata": {}, "output_type": "execute_result"}], "source": ["from pandas import read_csv\n", "import numpy\n", "from datetime import datetime, time\n", "from ensae_teaching_cs.data import marathon\n", "df = read_csv(marathon(filename=True), \n", " sep=\"\\t\", names=[\"ville\", \"annee\", \"temps\",\"secondes\"])\n", "arr = df[[\"annee\",\"secondes\"]].values # retourne un array (et non un matrix)\n", "mat = numpy.matrix(arr)\n", "print(type(arr),type(mat))\n", "arr[:2,:]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["La conversion r\u00e9ciproque est aussi simple mais il faut pr\u00e9ciser les noms des colonnes qui ne sont pas m\u00e9moris\u00e9es dans l'objet ``numpy.array`` :"]}, {"cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " \n", " | \n", " annee | \n", " secondes | \n", "
\n", " \n", " \n", " \n", " 0 | \n", " 2011 | \n", " 7589 | \n", "
\n", " \n", " 1 | \n", " 2010 | \n", " 7601 | \n", "
\n", " \n", "
\n", "
2 rows \u00d7 2 columns
\n", "
"], "text/plain": [" annee secondes\n", "0 2011 7589\n", "1 2010 7601\n", "\n", "[2 rows x 2 columns]"]}, "execution_count": 49, "metadata": {}, "output_type": "execute_result"}], "source": ["import pandas\n", "df2 = pandas.DataFrame(arr, columns=[\"annee\", \"secondes\"])\n", "df2.head(n=2)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 3 : r\u00e9gression lin\u00e9aire\n", "\n", "On souhaite impl\u00e9menter une [r\u00e9gression](http://fr.wikipedia.org/wiki/R%C3%A9gression_lin%C3%A9aire_multiple) qui se traduit par le probl\u00e8me suivant : $Y=XA+\\epsilon$. La solution est donn\u00e9e par la formule matricielle : $A^*=(X'X)^{-1}X'Y$. On pr\u00e9pare les donn\u00e9es suivantes."]}, {"cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "
\n", " \n", " \n", " | \n", " ville | \n", " annee | \n", " temps | \n", " secondes | \n", " estPARIS | \n", " estBERLIN | \n", "
\n", " \n", " \n", " \n", " 0 | \n", " PARIS | \n", " 2011 | \n", " 02:06:29 | \n", " 7589 | \n", " 1 | \n", " 0 | \n", "
\n", " \n", " 1 | \n", " PARIS | \n", " 2010 | \n", " 02:06:41 | \n", " 7601 | \n", " 1 | \n", " 0 | \n", "
\n", " \n", " 2 | \n", " PARIS | \n", " 2009 | \n", " 02:05:47 | \n", " 7547 | \n", " 1 | \n", " 0 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" ville annee temps secondes estPARIS estBERLIN\n", "0 PARIS 2011 02:06:29 7589 1 0\n", "1 PARIS 2010 02:06:41 7601 1 0\n", "2 PARIS 2009 02:05:47 7547 1 0"]}, "execution_count": 50, "metadata": {}, "output_type": "execute_result"}], "source": ["from pandas import read_csv\n", "from datetime import datetime, time\n", "from ensae_teaching_cs.data import marathon\n", "df = read_csv(marathon(filename=True), \n", " sep=\"\\t\", names=[\"ville\", \"annee\", \"temps\",\"secondes\"])\n", "df = df [ (df[\"ville\"] == \"BERLIN\") | (df[\"ville\"] == \"PARIS\") ] \n", "for v in [\"PARIS\",\"BERLIN\"]:\n", " df[\"est\" + v] = df.apply( lambda r : 1 if r[\"ville\"] == v else 0, axis=1)\n", "df.head(n = 3)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On veut construire le mod\u00e8le : $secondes = a_0 \\; annee + a_1 \\; stPARIS + a_2 \\; estBERLIN$. En appliquant la formule ci-dessus, d\u00e9terminer les coefficients $a_0,a_1,a_2$."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Annexes"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Cr\u00e9er un fichier Excel avec plusieurs feuilles\n", "\n", "La page [Allow ExcelWriter() to add sheets to existing workbook](https://github.com/pydata/pandas/issues/3441) donne plusieurs exemples d'\u00e9criture."]}, {"cell_type": "code", "execution_count": 50, "metadata": {"collapsed": true}, "outputs": [], "source": ["import pandas\n", "writer = pandas.ExcelWriter('tou_example.xlsx')\n", "df.to_excel(writer, 'Data 0')\n", "df.to_excel(writer, 'Data 1')\n", "writer.save()"]}, {"cell_type": "code", "execution_count": 51, "metadata": {"collapsed": true}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0"}}, "nbformat": 4, "nbformat_minor": 2}