{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 2A.eco - Mise en pratique des s\u00e9ances 1 et 2 - Utilisation de pandas et visualisation - correction\n", "\n", "Correction d'un exercice sur la manipulation des donn\u00e9es."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/plain": ["['Players_WC2014.xlsx', 'velib_t1.txt', 'velib_t2.txt', 'villes.txt']"]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["from pyensae.datasource import download_data\n", "files = download_data(\"td2a_eco_exercices_de_manipulation_de_donnees.zip\",\n", " url=\"https://github.com/sdpython/ensae_teaching_cs/raw/master/_doc/notebooks/td2a_eco/data/\")\n", "files"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 1 - manipulation des bases\n", "\n", "Dur\u00e9e : 10 minutes\n", "\n", "1. Importer la base de donn\u00e9es relatives aux joueurs de la Coupe du Monde 2014\n", "2. D\u00e9terminer le nombre de joueurs dans chaque \u00e9quipe et cr\u00e9er un dictionnaire { \u00e9quipe : Nombre de joueurs}\n", "3. D\u00e9terminer quels sont les 3 joueurs qui ont couvert le plus de distance. Y a t il un biais de s\u00e9lection ?\n", "4. Parmis les joueurs qui sont dans le premier d\u00e9cile des joueurs plus rapides, qui a pass\u00e9 le plus clair de son temps \u00e0 courrir sans la balle ?"]}, {"cell_type": "markdown", "metadata": {"collapsed": true}, "source": ["# Import du fichier"]}, {"cell_type": "code", "execution_count": 3, "metadata": {"collapsed": true}, "outputs": [], "source": ["import pandas as pd\n", "data_players = pd.read_excel(\"Players_WC2014.xlsx\", engine='openpyxl')"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " Clearance Rate | \n", " Colors | \n", " Distance Covered In Possession | \n", " Distance Covered Not In Possession | \n", " Distance Covered | \n", " Location | \n", " Name | \n", " Team | \n", " Top Speed | \n", " % Passes Completed | \n", " Assists | \n", " Attempted Clearances | \n", " Attempts In The Area | \n", " Attempts Off-Target | \n", " Attempts On Target | \n", " Attempts On-Target From Inside The Area | \n", " Attempts On-Target From Outside The Area | \n", " Attempts Outside The Area | \n", "
\n", " \n", " \n", " \n", " 0 | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " Algeria | \n", " Cedric SI MOHAMMED | \n", " Algeria | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", "
\n", " \n", " 1 | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " Algeria | \n", " Liassine CADAMURO | \n", " Algeria | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", "
\n", " \n", " 2 | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " Algeria | \n", " Mohamed ZEMMAMOUCHE | \n", " Algeria | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", "
\n", " \n", " 3 | \n", " NaN | \n", " NaN | \n", " 2.1km | \n", " 4.2km | \n", " 8.7km | \n", " Algeria | \n", " Riyad MAHREZ | \n", " Algeria | \n", " 23.9km/h | \n", " 0.55 | \n", " NaN | \n", " NaN | \n", " 1.0 | \n", " 1.0 | \n", " 0.0 | \n", " 0.0 | \n", " 0.0 | \n", " 0.0 | \n", "
\n", " \n", " 4 | \n", " - | \n", " NaN | \n", " 5.0km | \n", " 10.0km | \n", " 20.6km | \n", " Algeria | \n", " Saphir TAIDER | \n", " Algeria | \n", " 25.7km/h | \n", " 0.74 | \n", " NaN | \n", " 0.0 | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" Clearance Rate Colors Distance Covered In Possession \\\n", "0 NaN NaN NaN \n", "1 NaN NaN NaN \n", "2 NaN NaN NaN \n", "3 NaN NaN 2.1km \n", "4 - NaN 5.0km \n", "\n", " Distance Covered Not In Possession Distance Covered Location \\\n", "0 NaN NaN Algeria \n", "1 NaN NaN Algeria \n", "2 NaN NaN Algeria \n", "3 4.2km 8.7km Algeria \n", "4 10.0km 20.6km Algeria \n", "\n", " Name Team Top Speed % Passes Completed Assists \\\n", "0 Cedric SI MOHAMMED Algeria NaN NaN NaN \n", "1 Liassine CADAMURO Algeria NaN NaN NaN \n", "2 Mohamed ZEMMAMOUCHE Algeria NaN NaN NaN \n", "3 Riyad MAHREZ Algeria 23.9km/h 0.55 NaN \n", "4 Saphir TAIDER Algeria 25.7km/h 0.74 NaN \n", "\n", " Attempted Clearances Attempts In The Area Attempts Off-Target \\\n", "0 NaN NaN NaN \n", "1 NaN NaN NaN \n", "2 NaN NaN NaN \n", "3 NaN 1.0 1.0 \n", "4 0.0 NaN NaN \n", "\n", " Attempts On Target Attempts On-Target From Inside The Area \\\n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 0.0 0.0 \n", "4 NaN NaN \n", "\n", " Attempts On-Target From Outside The Area Attempts Outside The Area \n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 0.0 0.0 \n", "4 NaN NaN "]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["data_players.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Nombre de joueurs par \u00e9quipe"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/plain": ["{'Algeria': 23,\n", " 'Argentina': 23,\n", " 'Australia': 23,\n", " 'Belgium': 23,\n", " 'Bosnia': 23,\n", " 'Brazil': 23,\n", " 'Cameroon': 23,\n", " 'Chile': 23,\n", " 'Colombia': 23,\n", " 'Costa Rica': 23,\n", " 'Croatia': 23,\n", " \"C\u00f4te d'Ivoire\": 23,\n", " 'Ecuador': 22,\n", " 'England': 22,\n", " 'France': 22,\n", " 'Germany': 22,\n", " 'Ghana': 23,\n", " 'Greece': 23,\n", " 'Honduras': 23,\n", " 'Iran': 23,\n", " 'Italy': 23,\n", " 'Japan': 23,\n", " 'Korea Republic': 23,\n", " 'Mexico': 23,\n", " 'Netherlands': 23,\n", " 'Nigeria': 23,\n", " 'Portugal': 23,\n", " 'Russia': 23,\n", " 'Spain': 23,\n", " 'Switzerland': 23,\n", " 'USA': 23,\n", " 'Uruguay': 23}"]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["data_players.groupby(['Team']).size().to_dict()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Les joueurs ayant couvert le plus de distance"]}, {"cell_type": "code", "execution_count": 6, "metadata": {"collapsed": true}, "outputs": [], "source": ["## quels joueurs ont couvert le plus de distance ? \n", "data_players['Distance Covered'] = data_players['Distance Covered'].str.replace('km','')"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " Clearance Rate | \n", " Colors | \n", " Distance Covered In Possession | \n", " Distance Covered Not In Possession | \n", " Distance Covered | \n", " Location | \n", " Name | \n", " Team | \n", " Top Speed | \n", " % Passes Completed | \n", " Assists | \n", " Attempted Clearances | \n", " Attempts In The Area | \n", " Attempts Off-Target | \n", " Attempts On Target | \n", " Attempts On-Target From Inside The Area | \n", " Attempts On-Target From Outside The Area | \n", " Attempts Outside The Area | \n", "
\n", " \n", " \n", " \n", " 362 | \n", " 1 | \n", " Top 4 | \n", " 34.5km | \n", " 26.4km | \n", " 84.0 | \n", " Germany | \n", " Thomas MUELLER | \n", " Germany | \n", " 30.5km/h | \n", " 0.69 | \n", " 3.0 | \n", " 3.0 | \n", " 12.0 | \n", " 6.0 | \n", " 10.0 | \n", " 7.0 | \n", " 3.0 | \n", " 4.0 | \n", "
\n", " \n", " 353 | \n", " 0.5 | \n", " Top 4 | \n", " 34.3km | \n", " 28.0km | \n", " 82.6 | \n", " Germany | \n", " Toni KROOS | \n", " Germany | \n", " 29.8km/h | \n", " 0.85 | \n", " 4.0 | \n", " 2.0 | \n", " 2.0 | \n", " 5.0 | \n", " 11.0 | \n", " 2.0 | \n", " 9.0 | \n", " 14.0 | \n", "
\n", " \n", " 42 | \n", " 1 | \n", " Top 4 | \n", " 29.5km | \n", " 30.7km | \n", " 81.2 | \n", " Argentina | \n", " Javier MASCHERANO | \n", " Argentina | \n", " 30.3km/h | \n", " 0.86 | \n", " NaN | \n", " 8.0 | \n", " 0.0 | \n", " 0.0 | \n", " 4.0 | \n", " 0.0 | \n", " 4.0 | \n", " 4.0 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" Clearance Rate Colors Distance Covered In Possession \\\n", "362 1 Top 4 34.5km \n", "353 0.5 Top 4 34.3km \n", "42 1 Top 4 29.5km \n", "\n", " Distance Covered Not In Possession Distance Covered Location \\\n", "362 26.4km 84.0 Germany \n", "353 28.0km 82.6 Germany \n", "42 30.7km 81.2 Argentina \n", "\n", " Name Team Top Speed % Passes Completed Assists \\\n", "362 Thomas MUELLER Germany 30.5km/h 0.69 3.0 \n", "353 Toni KROOS Germany 29.8km/h 0.85 4.0 \n", "42 Javier MASCHERANO Argentina 30.3km/h 0.86 NaN \n", "\n", " Attempted Clearances Attempts In The Area Attempts Off-Target \\\n", "362 3.0 12.0 6.0 \n", "353 2.0 2.0 5.0 \n", "42 8.0 0.0 0.0 \n", "\n", " Attempts On Target Attempts On-Target From Inside The Area \\\n", "362 10.0 7.0 \n", "353 11.0 2.0 \n", "42 4.0 0.0 \n", "\n", " Attempts On-Target From Outside The Area Attempts Outside The Area \n", "362 3.0 4.0 \n", "353 9.0 14.0 \n", "42 4.0 4.0 "]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["data_players['Distance Covered'] = pd.to_numeric(data_players['Distance Covered'])\n", "data_players.sort_values(['Distance Covered'], ascending = 0).head(n=3)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On voit un clair effet de s\u00e9lection sur cette variable : ce sont les joueurs dont les \u00e9quipes ont \u00e9t\u00e9 le plus loin dans la comp\u00e9tition qui couvert le plus de distance."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Qui a \u00e9t\u00e9 le plus efficace ?\n", "\n", "On a besoin de rendre la variable Top Speed num\u00e9rique, et de cr\u00e9er une nouvelle variable avec le poucentage de possession de balle"]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " Clearance Rate | \n", " Colors | \n", " Distance Covered In Possession | \n", " Distance Covered Not In Possession | \n", " Distance Covered | \n", " Location | \n", " Name | \n", " Team | \n", " Top Speed | \n", " % Passes Completed | \n", " Assists | \n", " Attempted Clearances | \n", " Attempts In The Area | \n", " Attempts Off-Target | \n", " Attempts On Target | \n", " Attempts On-Target From Inside The Area | \n", " Attempts On-Target From Outside The Area | \n", " Attempts Outside The Area | \n", "
\n", " \n", " \n", " \n", " 222 | \n", " 0.89 | \n", " Quarter Finals | \n", " 15.4km | \n", " 22.7km | \n", " 56.2 | \n", " Costa Rica | \n", " Junior DIAZ | \n", " Costa Rica | \n", " 33.8 | \n", " 0.65 | \n", " NaN | \n", " 9.0 | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", "
\n", " \n", " 247 | \n", " 0.75 | \n", " Group Stage | \n", " 10.5km | \n", " 9.3km | \n", " 28.8 | \n", " C\u00f4te d'Ivoire | \n", " Serge AURIER | \n", " C\u00f4te d'Ivoire | \n", " 33.5 | \n", " 0.73 | \n", " NaN | \n", " 4.0 | \n", " 2.0 | \n", " 0.0 | \n", " 3.0 | \n", " 2.0 | \n", " 1.0 | \n", " 1.0 | \n", "
\n", " \n", " 32 | \n", " 0.5 | \n", " Top 4 | \n", " 25.1km | \n", " 19.2km | \n", " 63.6 | \n", " Argentina | \n", " Gonzalo HIGUAIN | \n", " Argentina | \n", " 33.1 | \n", " 0.75 | \n", " 1.0 | \n", " 4.0 | \n", " 11.0 | \n", " 7.0 | \n", " 6.0 | \n", " 6.0 | \n", " 0.0 | \n", " 2.0 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" Clearance Rate Colors Distance Covered In Possession \\\n", "222 0.89 Quarter Finals 15.4km \n", "247 0.75 Group Stage 10.5km \n", "32 0.5 Top 4 25.1km \n", "\n", " Distance Covered Not In Possession Distance Covered Location \\\n", "222 22.7km 56.2 Costa Rica \n", "247 9.3km 28.8 C\u00f4te d'Ivoire \n", "32 19.2km 63.6 Argentina \n", "\n", " Name Team Top Speed % Passes Completed Assists \\\n", "222 Junior DIAZ Costa Rica 33.8 0.65 NaN \n", "247 Serge AURIER C\u00f4te d'Ivoire 33.5 0.73 NaN \n", "32 Gonzalo HIGUAIN Argentina 33.1 0.75 1.0 \n", "\n", " Attempted Clearances Attempts In The Area Attempts Off-Target \\\n", "222 9.0 NaN NaN \n", "247 4.0 2.0 0.0 \n", "32 4.0 11.0 7.0 \n", "\n", " Attempts On Target Attempts On-Target From Inside The Area \\\n", "222 NaN NaN \n", "247 3.0 2.0 \n", "32 6.0 6.0 \n", "\n", " Attempts On-Target From Outside The Area Attempts Outside The Area \n", "222 NaN NaN \n", "247 1.0 1.0 \n", "32 0.0 2.0 "]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["## Qui a \u00e9t\u00e9 le plus rapide ?\n", "\n", "data_players['Top Speed'] = data_players['Top Speed'].str.replace('km/h','')\n", "data_players['Top Speed'] = pd.to_numeric(data_players['Top Speed'])\n", "data_players.sort_values(['Top Speed'], ascending = 0).head(n=3)"]}, {"cell_type": "code", "execution_count": 9, "metadata": {"collapsed": true}, "outputs": [], "source": ["## Parmis ceux qui sont dans le d\u00e9cile des plus rapides, qui a pass\u00e9 le plus clair de son temps \u00e0 courrir sans la balle ?\n", "data_players['Distance Covered In Possession'] = data_players['Distance Covered In Possession'].str.replace('km','')\n", "data_players['Distance Covered In Possession'] = pd.to_numeric(data_players['Distance Covered In Possession'])\n", "data_players['Share of Possession'] = data_players['Distance Covered In Possession']/data_players['Distance Covered']"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " Clearance Rate | \n", " Colors | \n", " Distance Covered In Possession | \n", " Distance Covered Not In Possession | \n", " Distance Covered | \n", " Location | \n", " Name | \n", " Team | \n", " Top Speed | \n", " % Passes Completed | \n", " Assists | \n", " Attempted Clearances | \n", " Attempts In The Area | \n", " Attempts Off-Target | \n", " Attempts On Target | \n", " Attempts On-Target From Inside The Area | \n", " Attempts On-Target From Outside The Area | \n", " Attempts Outside The Area | \n", " Share of Possession | \n", "
\n", " \n", " \n", " \n", " 363 | \n", " 1 | \n", " Group Stage | \n", " 6.4 | \n", " 4.2km | \n", " 14.6 | \n", " Germany | \n", " Shkodran MUSTAFI | \n", " Germany | \n", " 32.9 | \n", " 0.80 | \n", " NaN | \n", " 1.0 | \n", " 1.0 | \n", " 0.0 | \n", " 1.0 | \n", " 1.0 | \n", " 0.0 | \n", " 0.0 | \n", " 0.438356 | \n", "
\n", " \n", " 102 | \n", " - | \n", " NaN | \n", " 4.7 | \n", " 3.0km | \n", " 10.8 | \n", " Bosnia | \n", " Avdija VRSAJEVIC | \n", " Bosnia | \n", " 32.1 | \n", " 0.75 | \n", " 0.0 | \n", " 0.0 | \n", " 2.0 | \n", " 1.0 | \n", " 1.0 | \n", " 1.0 | \n", " 0.0 | \n", " 0.0 | \n", " 0.435185 | \n", "
\n", " \n", " 472 | \n", " - | \n", " Group Stage | \n", " 9.0 | \n", " 6.0km | \n", " 20.8 | \n", " Italy | \n", " Mario BALOTELLI | \n", " Italy | \n", " 33.0 | \n", " 0.65 | \n", " 0.0 | \n", " 0.0 | \n", " 3.0 | \n", " 3.0 | \n", " 4.0 | \n", " 2.0 | \n", " 2.0 | \n", " 4.0 | \n", " 0.432692 | \n", "
\n", " \n", " 351 | \n", " - | \n", " Top 4 | \n", " 31.6 | \n", " 23.6km | \n", " 73.2 | \n", " Germany | \n", " Mesut OEZIL | \n", " Germany | \n", " 32.0 | \n", " 0.79 | \n", " 1.0 | \n", " 0.0 | \n", " 8.0 | \n", " 3.0 | \n", " 7.0 | \n", " 6.0 | \n", " 1.0 | \n", " 2.0 | \n", " 0.431694 | \n", "
\n", " \n", " 646 | \n", " NaN | \n", " Group Stage | \n", " 13.7 | \n", " 10.4km | \n", " 31.8 | \n", " Spain | \n", " Jordi ALBA | \n", " Spain | \n", " 31.6 | \n", " 0.82 | \n", " NaN | \n", " NaN | \n", " 2.0 | \n", " 2.0 | \n", " 1.0 | \n", " 1.0 | \n", " 0.0 | \n", " 1.0 | \n", " 0.430818 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" Clearance Rate Colors Distance Covered In Possession \\\n", "363 1 Group Stage 6.4 \n", "102 - NaN 4.7 \n", "472 - Group Stage 9.0 \n", "351 - Top 4 31.6 \n", "646 NaN Group Stage 13.7 \n", "\n", " Distance Covered Not In Possession Distance Covered Location \\\n", "363 4.2km 14.6 Germany \n", "102 3.0km 10.8 Bosnia \n", "472 6.0km 20.8 Italy \n", "351 23.6km 73.2 Germany \n", "646 10.4km 31.8 Spain \n", "\n", " Name Team Top Speed % Passes Completed Assists \\\n", "363 Shkodran MUSTAFI Germany 32.9 0.80 NaN \n", "102 Avdija VRSAJEVIC Bosnia 32.1 0.75 0.0 \n", "472 Mario BALOTELLI Italy 33.0 0.65 0.0 \n", "351 Mesut OEZIL Germany 32.0 0.79 1.0 \n", "646 Jordi ALBA Spain 31.6 0.82 NaN \n", "\n", " Attempted Clearances Attempts In The Area Attempts Off-Target \\\n", "363 1.0 1.0 0.0 \n", "102 0.0 2.0 1.0 \n", "472 0.0 3.0 3.0 \n", "351 0.0 8.0 3.0 \n", "646 NaN 2.0 2.0 \n", "\n", " Attempts On Target Attempts On-Target From Inside The Area \\\n", "363 1.0 1.0 \n", "102 1.0 1.0 \n", "472 4.0 2.0 \n", "351 7.0 6.0 \n", "646 1.0 1.0 \n", "\n", " Attempts On-Target From Outside The Area Attempts Outside The Area \\\n", "363 0.0 0.0 \n", "102 0.0 0.0 \n", "472 2.0 4.0 \n", "351 1.0 2.0 \n", "646 0.0 1.0 \n", "\n", " Share of Possession \n", "363 0.438356 \n", "102 0.435185 \n", "472 0.432692 \n", "351 0.431694 \n", "646 0.430818 "]}, "execution_count": 11, "metadata": {}, "output_type": "execute_result"}], "source": ["data_players[data_players['Top Speed'] > data_players['Top Speed'].\n", " quantile(.90)].sort_values(['Share of Possession'], ascending = 0).head()"]}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1"}}, "nbformat": 4, "nbformat_minor": 2}