2A.eco - Mise en pratique des séances 1 et 2 - Utilisation de pandas et visualisation - correction#
Links: notebook
, html, python
, slides, GitHub
Correction d’un exercice sur la manipulation des données.
from jyquickhelper import add_notebook_menu
add_notebook_menu()
from pyensae.datasource import download_data
files = download_data("td2a_eco_exercices_de_manipulation_de_donnees.zip",
url="https://github.com/sdpython/ensae_teaching_cs/raw/master/_doc/notebooks/td2a_eco/data/")
files
['Players_WC2014.xlsx', 'velib_t1.txt', 'velib_t2.txt', 'villes.txt']
Exercice 1 - manipulation des bases#
Durée : 10 minutes
Importer la base de données relatives aux joueurs de la Coupe du Monde 2014
Déterminer le nombre de joueurs dans chaque équipe et créer un dictionnaire { équipe : Nombre de joueurs}
Déterminer quels sont les 3 joueurs qui ont couvert le plus de distance. Y a t il un biais de sélection ?
Parmis les joueurs qui sont dans le premier décile des joueurs plus rapides, qui a passé le plus clair de son temps à courrir sans la balle ?
Import du fichier#
import pandas as pd
data_players = pd.read_excel("Players_WC2014.xlsx", engine='openpyxl')
data_players.head()
Clearance Rate | Colors | Distance Covered In Possession | Distance Covered Not In Possession | Distance Covered | Location | Name | Team | Top Speed | % Passes Completed | Assists | Attempted Clearances | Attempts In The Area | Attempts Off-Target | Attempts On Target | Attempts On-Target From Inside The Area | Attempts On-Target From Outside The Area | Attempts Outside The Area | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | NaN | NaN | Algeria | Cedric SI MOHAMMED | Algeria | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | NaN | NaN | NaN | NaN | NaN | Algeria | Liassine CADAMURO | Algeria | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | NaN | NaN | NaN | NaN | NaN | Algeria | Mohamed ZEMMAMOUCHE | Algeria | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | NaN | NaN | 2.1km | 4.2km | 8.7km | Algeria | Riyad MAHREZ | Algeria | 23.9km/h | 0.55 | NaN | NaN | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | - | NaN | 5.0km | 10.0km | 20.6km | Algeria | Saphir TAIDER | Algeria | 25.7km/h | 0.74 | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN |
Nombre de joueurs par équipe#
data_players.groupby(['Team']).size().to_dict()
{'Algeria': 23,
'Argentina': 23,
'Australia': 23,
'Belgium': 23,
'Bosnia': 23,
'Brazil': 23,
'Cameroon': 23,
'Chile': 23,
'Colombia': 23,
'Costa Rica': 23,
'Croatia': 23,
"Côte d'Ivoire": 23,
'Ecuador': 22,
'England': 22,
'France': 22,
'Germany': 22,
'Ghana': 23,
'Greece': 23,
'Honduras': 23,
'Iran': 23,
'Italy': 23,
'Japan': 23,
'Korea Republic': 23,
'Mexico': 23,
'Netherlands': 23,
'Nigeria': 23,
'Portugal': 23,
'Russia': 23,
'Spain': 23,
'Switzerland': 23,
'USA': 23,
'Uruguay': 23}
Les joueurs ayant couvert le plus de distance#
## quels joueurs ont couvert le plus de distance ?
data_players['Distance Covered'] = data_players['Distance Covered'].str.replace('km','')
data_players['Distance Covered'] = pd.to_numeric(data_players['Distance Covered'])
data_players.sort_values(['Distance Covered'], ascending = 0).head(n=3)
Clearance Rate | Colors | Distance Covered In Possession | Distance Covered Not In Possession | Distance Covered | Location | Name | Team | Top Speed | % Passes Completed | Assists | Attempted Clearances | Attempts In The Area | Attempts Off-Target | Attempts On Target | Attempts On-Target From Inside The Area | Attempts On-Target From Outside The Area | Attempts Outside The Area | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
362 | 1 | Top 4 | 34.5km | 26.4km | 84.0 | Germany | Thomas MUELLER | Germany | 30.5km/h | 0.69 | 3.0 | 3.0 | 12.0 | 6.0 | 10.0 | 7.0 | 3.0 | 4.0 |
353 | 0.5 | Top 4 | 34.3km | 28.0km | 82.6 | Germany | Toni KROOS | Germany | 29.8km/h | 0.85 | 4.0 | 2.0 | 2.0 | 5.0 | 11.0 | 2.0 | 9.0 | 14.0 |
42 | 1 | Top 4 | 29.5km | 30.7km | 81.2 | Argentina | Javier MASCHERANO | Argentina | 30.3km/h | 0.86 | NaN | 8.0 | 0.0 | 0.0 | 4.0 | 0.0 | 4.0 | 4.0 |
On voit un clair effet de sélection sur cette variable : ce sont les joueurs dont les équipes ont été le plus loin dans la compétition qui couvert le plus de distance.
Qui a été le plus efficace ?#
On a besoin de rendre la variable Top Speed numérique, et de créer une nouvelle variable avec le poucentage de possession de balle
## Qui a été le plus rapide ?
data_players['Top Speed'] = data_players['Top Speed'].str.replace('km/h','')
data_players['Top Speed'] = pd.to_numeric(data_players['Top Speed'])
data_players.sort_values(['Top Speed'], ascending = 0).head(n=3)
Clearance Rate | Colors | Distance Covered In Possession | Distance Covered Not In Possession | Distance Covered | Location | Name | Team | Top Speed | % Passes Completed | Assists | Attempted Clearances | Attempts In The Area | Attempts Off-Target | Attempts On Target | Attempts On-Target From Inside The Area | Attempts On-Target From Outside The Area | Attempts Outside The Area | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
222 | 0.89 | Quarter Finals | 15.4km | 22.7km | 56.2 | Costa Rica | Junior DIAZ | Costa Rica | 33.8 | 0.65 | NaN | 9.0 | NaN | NaN | NaN | NaN | NaN | NaN |
247 | 0.75 | Group Stage | 10.5km | 9.3km | 28.8 | Côte d'Ivoire | Serge AURIER | Côte d'Ivoire | 33.5 | 0.73 | NaN | 4.0 | 2.0 | 0.0 | 3.0 | 2.0 | 1.0 | 1.0 |
32 | 0.5 | Top 4 | 25.1km | 19.2km | 63.6 | Argentina | Gonzalo HIGUAIN | Argentina | 33.1 | 0.75 | 1.0 | 4.0 | 11.0 | 7.0 | 6.0 | 6.0 | 0.0 | 2.0 |
## Parmis ceux qui sont dans le décile des plus rapides, qui a passé le plus clair de son temps à courrir sans la balle ?
data_players['Distance Covered In Possession'] = data_players['Distance Covered In Possession'].str.replace('km','')
data_players['Distance Covered In Possession'] = pd.to_numeric(data_players['Distance Covered In Possession'])
data_players['Share of Possession'] = data_players['Distance Covered In Possession']/data_players['Distance Covered']
data_players[data_players['Top Speed'] > data_players['Top Speed'].
quantile(.90)].sort_values(['Share of Possession'], ascending = 0).head()
Clearance Rate | Colors | Distance Covered In Possession | Distance Covered Not In Possession | Distance Covered | Location | Name | Team | Top Speed | % Passes Completed | Assists | Attempted Clearances | Attempts In The Area | Attempts Off-Target | Attempts On Target | Attempts On-Target From Inside The Area | Attempts On-Target From Outside The Area | Attempts Outside The Area | Share of Possession | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
363 | 1 | Group Stage | 6.4 | 4.2km | 14.6 | Germany | Shkodran MUSTAFI | Germany | 32.9 | 0.80 | NaN | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.438356 |
102 | - | NaN | 4.7 | 3.0km | 10.8 | Bosnia | Avdija VRSAJEVIC | Bosnia | 32.1 | 0.75 | 0.0 | 0.0 | 2.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.435185 |
472 | - | Group Stage | 9.0 | 6.0km | 20.8 | Italy | Mario BALOTELLI | Italy | 33.0 | 0.65 | 0.0 | 0.0 | 3.0 | 3.0 | 4.0 | 2.0 | 2.0 | 4.0 | 0.432692 |
351 | - | Top 4 | 31.6 | 23.6km | 73.2 | Germany | Mesut OEZIL | Germany | 32.0 | 0.79 | 1.0 | 0.0 | 8.0 | 3.0 | 7.0 | 6.0 | 1.0 | 2.0 | 0.431694 |
646 | NaN | Group Stage | 13.7 | 10.4km | 31.8 | Spain | Jordi ALBA | Spain | 31.6 | 0.82 | NaN | NaN | 2.0 | 2.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.430818 |