Correction d'un exercice sur la manipulation des données.
from jyquickhelper import add_notebook_menu
add_notebook_menu()
from pyensae.datasource import download_data
files = download_data("td2a_eco_exercices_de_manipulation_de_donnees.zip",
url="https://github.com/sdpython/ensae_teaching_cs/raw/master/_doc/notebooks/td2a_eco/data/")
files
['Players_WC2014.xlsx', 'velib_t1.txt', 'velib_t2.txt', 'villes.txt']
Durée : 10 minutes
import pandas as pd
data_players = pd.read_excel("Players_WC2014.xlsx", engine='openpyxl')
data_players.head()
Clearance Rate | Colors | Distance Covered In Possession | Distance Covered Not In Possession | Distance Covered | Location | Name | Team | Top Speed | % Passes Completed | Assists | Attempted Clearances | Attempts In The Area | Attempts Off-Target | Attempts On Target | Attempts On-Target From Inside The Area | Attempts On-Target From Outside The Area | Attempts Outside The Area | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | NaN | NaN | Algeria | Cedric SI MOHAMMED | Algeria | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | NaN | NaN | NaN | NaN | NaN | Algeria | Liassine CADAMURO | Algeria | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | NaN | NaN | NaN | NaN | NaN | Algeria | Mohamed ZEMMAMOUCHE | Algeria | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | NaN | NaN | 2.1km | 4.2km | 8.7km | Algeria | Riyad MAHREZ | Algeria | 23.9km/h | 0.55 | NaN | NaN | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | - | NaN | 5.0km | 10.0km | 20.6km | Algeria | Saphir TAIDER | Algeria | 25.7km/h | 0.74 | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN |
data_players.groupby(['Team']).size().to_dict()
{'Algeria': 23, 'Argentina': 23, 'Australia': 23, 'Belgium': 23, 'Bosnia': 23, 'Brazil': 23, 'Cameroon': 23, 'Chile': 23, 'Colombia': 23, 'Costa Rica': 23, 'Croatia': 23, "Côte d'Ivoire": 23, 'Ecuador': 22, 'England': 22, 'France': 22, 'Germany': 22, 'Ghana': 23, 'Greece': 23, 'Honduras': 23, 'Iran': 23, 'Italy': 23, 'Japan': 23, 'Korea Republic': 23, 'Mexico': 23, 'Netherlands': 23, 'Nigeria': 23, 'Portugal': 23, 'Russia': 23, 'Spain': 23, 'Switzerland': 23, 'USA': 23, 'Uruguay': 23}
## quels joueurs ont couvert le plus de distance ?
data_players['Distance Covered'] = data_players['Distance Covered'].str.replace('km','')
data_players['Distance Covered'] = pd.to_numeric(data_players['Distance Covered'])
data_players.sort_values(['Distance Covered'], ascending = 0).head(n=3)
Clearance Rate | Colors | Distance Covered In Possession | Distance Covered Not In Possession | Distance Covered | Location | Name | Team | Top Speed | % Passes Completed | Assists | Attempted Clearances | Attempts In The Area | Attempts Off-Target | Attempts On Target | Attempts On-Target From Inside The Area | Attempts On-Target From Outside The Area | Attempts Outside The Area | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
362 | 1 | Top 4 | 34.5km | 26.4km | 84.0 | Germany | Thomas MUELLER | Germany | 30.5km/h | 0.69 | 3.0 | 3.0 | 12.0 | 6.0 | 10.0 | 7.0 | 3.0 | 4.0 |
353 | 0.5 | Top 4 | 34.3km | 28.0km | 82.6 | Germany | Toni KROOS | Germany | 29.8km/h | 0.85 | 4.0 | 2.0 | 2.0 | 5.0 | 11.0 | 2.0 | 9.0 | 14.0 |
42 | 1 | Top 4 | 29.5km | 30.7km | 81.2 | Argentina | Javier MASCHERANO | Argentina | 30.3km/h | 0.86 | NaN | 8.0 | 0.0 | 0.0 | 4.0 | 0.0 | 4.0 | 4.0 |
On voit un clair effet de sélection sur cette variable : ce sont les joueurs dont les équipes ont été le plus loin dans la compétition qui couvert le plus de distance.
On a besoin de rendre la variable Top Speed numérique, et de créer une nouvelle variable avec le poucentage de possession de balle
## Qui a été le plus rapide ?
data_players['Top Speed'] = data_players['Top Speed'].str.replace('km/h','')
data_players['Top Speed'] = pd.to_numeric(data_players['Top Speed'])
data_players.sort_values(['Top Speed'], ascending = 0).head(n=3)
Clearance Rate | Colors | Distance Covered In Possession | Distance Covered Not In Possession | Distance Covered | Location | Name | Team | Top Speed | % Passes Completed | Assists | Attempted Clearances | Attempts In The Area | Attempts Off-Target | Attempts On Target | Attempts On-Target From Inside The Area | Attempts On-Target From Outside The Area | Attempts Outside The Area | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
222 | 0.89 | Quarter Finals | 15.4km | 22.7km | 56.2 | Costa Rica | Junior DIAZ | Costa Rica | 33.8 | 0.65 | NaN | 9.0 | NaN | NaN | NaN | NaN | NaN | NaN |
247 | 0.75 | Group Stage | 10.5km | 9.3km | 28.8 | Côte d'Ivoire | Serge AURIER | Côte d'Ivoire | 33.5 | 0.73 | NaN | 4.0 | 2.0 | 0.0 | 3.0 | 2.0 | 1.0 | 1.0 |
32 | 0.5 | Top 4 | 25.1km | 19.2km | 63.6 | Argentina | Gonzalo HIGUAIN | Argentina | 33.1 | 0.75 | 1.0 | 4.0 | 11.0 | 7.0 | 6.0 | 6.0 | 0.0 | 2.0 |
## Parmis ceux qui sont dans le décile des plus rapides, qui a passé le plus clair de son temps à courrir sans la balle ?
data_players['Distance Covered In Possession'] = data_players['Distance Covered In Possession'].str.replace('km','')
data_players['Distance Covered In Possession'] = pd.to_numeric(data_players['Distance Covered In Possession'])
data_players['Share of Possession'] = data_players['Distance Covered In Possession']/data_players['Distance Covered']
data_players[data_players['Top Speed'] > data_players['Top Speed'].
quantile(.90)].sort_values(['Share of Possession'], ascending = 0).head()
Clearance Rate | Colors | Distance Covered In Possession | Distance Covered Not In Possession | Distance Covered | Location | Name | Team | Top Speed | % Passes Completed | Assists | Attempted Clearances | Attempts In The Area | Attempts Off-Target | Attempts On Target | Attempts On-Target From Inside The Area | Attempts On-Target From Outside The Area | Attempts Outside The Area | Share of Possession | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
363 | 1 | Group Stage | 6.4 | 4.2km | 14.6 | Germany | Shkodran MUSTAFI | Germany | 32.9 | 0.80 | NaN | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.438356 |
102 | - | NaN | 4.7 | 3.0km | 10.8 | Bosnia | Avdija VRSAJEVIC | Bosnia | 32.1 | 0.75 | 0.0 | 0.0 | 2.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.435185 |
472 | - | Group Stage | 9.0 | 6.0km | 20.8 | Italy | Mario BALOTELLI | Italy | 33.0 | 0.65 | 0.0 | 0.0 | 3.0 | 3.0 | 4.0 | 2.0 | 2.0 | 4.0 | 0.432692 |
351 | - | Top 4 | 31.6 | 23.6km | 73.2 | Germany | Mesut OEZIL | Germany | 32.0 | 0.79 | 1.0 | 0.0 | 8.0 | 3.0 | 7.0 | 6.0 | 1.0 | 2.0 | 0.431694 |
646 | NaN | Group Stage | 13.7 | 10.4km | 31.8 | Spain | Jordi ALBA | Spain | 31.6 | 0.82 | NaN | NaN | 2.0 | 2.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.430818 |