2A.eco - Mise en pratique des séances 1 et 2 - Utilisation de pandas et visualisation - correction

Links: notebook, html, PDF, python, slides, GitHub

Correction d’un exercice sur la manipulation des données.

from jyquickhelper import add_notebook_menu
add_notebook_menu()
from pyensae.datasource import download_data
files = download_data("td2a_eco_exercices_de_manipulation_de_donnees.zip",
                      url="https://github.com/sdpython/ensae_teaching_cs/raw/master/_doc/notebooks/td2a_eco/data/")
files
['Players_WC2014.xlsx', 'velib_t1.txt', 'velib_t2.txt', 'villes.txt']

Exercice 1 - manipulation des bases

Durée : 10 minutes

  1. Importer la base de données relatives aux joueurs de la Coupe du Monde 2014

  2. Déterminer le nombre de joueurs dans chaque équipe et créer un dictionnaire { équipe : Nombre de joueurs}

  3. Déterminer quels sont les 3 joueurs qui ont couvert le plus de distance. Y a t il un biais de sélection ?

  4. Parmis les joueurs qui sont dans le premier décile des joueurs plus rapides, qui a passé le plus clair de son temps à courrir sans la balle ?

Import du fichier

import pandas as pd
data_players = pd.read_excel("Players_WC2014.xlsx")
data_players.head()
Clearance Rate Colors Distance Covered In Possession Distance Covered Not In Possession Distance Covered Location Name Team Top Speed % Passes Completed Assists Attempted Clearances Attempts In The Area Attempts Off-Target Attempts On Target Attempts On-Target From Inside The Area Attempts On-Target From Outside The Area Attempts Outside The Area
0 NaN NaN NaN NaN NaN Algeria Cedric SI MOHAMMED Algeria NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN Algeria Liassine CADAMURO Algeria NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN Algeria Mohamed ZEMMAMOUCHE Algeria NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN 2.1km 4.2km 8.7km Algeria Riyad MAHREZ Algeria 23.9km/h 0.55 NaN NaN 1.0 1.0 0.0 0.0 0.0 0.0
4 - NaN 5.0km 10.0km 20.6km Algeria Saphir TAIDER Algeria 25.7km/h 0.74 NaN 0.0 NaN NaN NaN NaN NaN NaN

Nombre de joueurs par équipe

data_players.groupby(['Team']).size().to_dict()
{'Algeria': 23,
 'Argentina': 23,
 'Australia': 23,
 'Belgium': 23,
 'Bosnia': 23,
 'Brazil': 23,
 'Cameroon': 23,
 'Chile': 23,
 'Colombia': 23,
 'Costa Rica': 23,
 'Croatia': 23,
 "Côte d'Ivoire": 23,
 'Ecuador': 22,
 'England': 22,
 'France': 22,
 'Germany': 22,
 'Ghana': 23,
 'Greece': 23,
 'Honduras': 23,
 'Iran': 23,
 'Italy': 23,
 'Japan': 23,
 'Korea Republic': 23,
 'Mexico': 23,
 'Netherlands': 23,
 'Nigeria': 23,
 'Portugal': 23,
 'Russia': 23,
 'Spain': 23,
 'Switzerland': 23,
 'USA': 23,
 'Uruguay': 23}

Les joueurs ayant couvert le plus de distance

## quels joueurs ont couvert le plus de distance ?
data_players['Distance Covered'] = data_players['Distance Covered'].str.replace('km','')
data_players['Distance Covered'] = pd.to_numeric(data_players['Distance Covered'])
data_players.sort_values(['Distance Covered'], ascending = 0).head(n=3)
Clearance Rate Colors Distance Covered In Possession Distance Covered Not In Possession Distance Covered Location Name Team Top Speed % Passes Completed Assists Attempted Clearances Attempts In The Area Attempts Off-Target Attempts On Target Attempts On-Target From Inside The Area Attempts On-Target From Outside The Area Attempts Outside The Area
362 1 Top 4 34.5km 26.4km 84.0 Germany Thomas MUELLER Germany 30.5km/h 0.69 3.0 3.0 12.0 6.0 10.0 7.0 3.0 4.0
353 0.5 Top 4 34.3km 28.0km 82.6 Germany Toni KROOS Germany 29.8km/h 0.85 4.0 2.0 2.0 5.0 11.0 2.0 9.0 14.0
42 1 Top 4 29.5km 30.7km 81.2 Argentina Javier MASCHERANO Argentina 30.3km/h 0.86 NaN 8.0 0.0 0.0 4.0 0.0 4.0 4.0

On voit un clair effet de sélection sur cette variable : ce sont les joueurs dont les équipes ont été le plus loin dans la compétition qui couvert le plus de distance.

Qui a été le plus efficace ?

On a besoin de rendre la variable Top Speed numérique, et de créer une nouvelle variable avec le poucentage de possession de balle

## Qui a été le plus rapide ?

data_players['Top Speed'] = data_players['Top Speed'].str.replace('km/h','')
data_players['Top Speed'] = pd.to_numeric(data_players['Top Speed'])
data_players.sort_values(['Top Speed'], ascending = 0).head(n=3)
Clearance Rate Colors Distance Covered In Possession Distance Covered Not In Possession Distance Covered Location Name Team Top Speed % Passes Completed Assists Attempted Clearances Attempts In The Area Attempts Off-Target Attempts On Target Attempts On-Target From Inside The Area Attempts On-Target From Outside The Area Attempts Outside The Area
222 0.89 Quarter Finals 15.4km 22.7km 56.2 Costa Rica Junior DIAZ Costa Rica 33.8 0.65 NaN 9.0 NaN NaN NaN NaN NaN NaN
247 0.75 Group Stage 10.5km 9.3km 28.8 Côte d'Ivoire Serge AURIER Côte d'Ivoire 33.5 0.73 NaN 4.0 2.0 0.0 3.0 2.0 1.0 1.0
32 0.5 Top 4 25.1km 19.2km 63.6 Argentina Gonzalo HIGUAIN Argentina 33.1 0.75 1.0 4.0 11.0 7.0 6.0 6.0 0.0 2.0
## Parmis ceux qui sont dans le décile des plus rapides, qui a passé le plus clair de son temps à courrir sans la balle ?
data_players['Distance Covered In Possession'] = data_players['Distance Covered In Possession'].str.replace('km','')
data_players['Distance Covered In Possession'] = pd.to_numeric(data_players['Distance Covered In Possession'])
data_players['Share of Possession'] = data_players['Distance Covered In Possession']/data_players['Distance Covered']
data_players[data_players['Top Speed'] > data_players['Top Speed'].
             quantile(.90)].sort_values(['Share of Possession'], ascending = 0).head()
Clearance Rate Colors Distance Covered In Possession Distance Covered Not In Possession Distance Covered Location Name Team Top Speed % Passes Completed Assists Attempted Clearances Attempts In The Area Attempts Off-Target Attempts On Target Attempts On-Target From Inside The Area Attempts On-Target From Outside The Area Attempts Outside The Area Share of Possession
363 1 Group Stage 6.4 4.2km 14.6 Germany Shkodran MUSTAFI Germany 32.9 0.80 NaN 1.0 1.0 0.0 1.0 1.0 0.0 0.0 0.438356
102 - NaN 4.7 3.0km 10.8 Bosnia Avdija VRSAJEVIC Bosnia 32.1 0.75 0.0 0.0 2.0 1.0 1.0 1.0 0.0 0.0 0.435185
472 - Group Stage 9.0 6.0km 20.8 Italy Mario BALOTELLI Italy 33.0 0.65 0.0 0.0 3.0 3.0 4.0 2.0 2.0 4.0 0.432692
351 - Top 4 31.6 23.6km 73.2 Germany Mesut OEZIL Germany 32.0 0.79 1.0 0.0 8.0 3.0 7.0 6.0 1.0 2.0 0.431694
646 NaN Group Stage 13.7 10.4km 31.8 Spain Jordi ALBA Spain 31.6 0.82 NaN NaN 2.0 2.0 1.0 1.0 0.0 1.0 0.430818