DataFrames Pandas#
Links: notebook
, html, PDF
, python
, slides, GitHub
Un Data Frame est un objet qui est présent dans la plupart des logiciels de traitements de données, c’est une matrice à 2 dimensions, chaque colonne a un type et toutes les cellules de cette colonne sont de ce type (nombre, dates, texte). Une cellule peut contenir une valeur manquante. On peut considérer chaque colonne comme les variables d’une table (pandas.Dataframe - cette page contient toutes les méthodes de la classe).
from jyquickhelper import add_notebook_menu
add_notebook_menu()
Quelques liens : An Introduction to Pandas
Tous les exemples utilisent un jeu de données de l’ONU contenant par pays (“country”), par année (“year”) et par secteur(“code” ou “sub_item”), la valeur ajoutée monétaire du secteur dans ce pays cette année là (“VA1” ou “VA2”), la monnaie (“currency”), et la masse salariale du secteur cette année là (“WAGE1”). Les données uilisées pour l’exemple (accessible sur github) peuvent être remplacées par n’importe quelle table disponible sur le site : data.un.org.
Lecture et écriture sur disque#
import pandas
df = pandas.read_csv("UN_Data.csv", sep=",")
df.head()
country | sub_item | year | currency | VA1 | code | VA2 | WAGE1 | |
---|---|---|---|---|---|---|---|---|
0 | Argentina | Agriculture, hunting, forestry fishing | 1993 | Argentine peso | 1.214900e+10 | AB | 1.214900e+10 | 2.123000e+09 |
1 | Argentina | Mining and quarrying | 1993 | Argentine peso | 3.525000e+09 | C | 3.525000e+09 | 8.007000e+08 |
2 | Argentina | Manufacturing | 1993 | Argentine peso | 3.890700e+10 | D | 3.890700e+10 | 1.766600e+10 |
3 | Argentina | Electricity, gas and water supply | 1993 | Argentine peso | 4.461000e+09 | E | 4.461000e+09 | 2.213000e+09 |
4 | Argentina | Construction | 1993 | Argentine peso | 1.339300e+10 | F | 1.339300e+10 | 4.355000e+09 |
df.to_excel("exemple.xlsx", index=False)
Manipulation basique#
df["VA1"]
0 1.214900e+10
1 3.525000e+09
2 3.890700e+10
3 4.461000e+09
4 1.339300e+10
5 3.929400e+10
6 1.613400e+10
7 4.320200e+10
8 2.166090e+11
9 1.308500e+10
10 3.818000e+09
11 4.159600e+10
12 4.730000e+09
13 1.431100e+10
14 4.279800e+10
15 1.825100e+10
16 4.859900e+10
17 2.358460e+11
18 1.380850e+10
19 4.838400e+09
20 4.450210e+10
21 5.111000e+09
22 1.341400e+10
23 4.119850e+10
24 1.905990e+10
25 5.133940e+10
26 2.423343e+11
27 1.527000e+10
28 5.888900e+09
29 4.772340e+10
...
4209 5.057137e+13
4210 2.880633e+12
4211 8.321318e+12
4212 8.204542e+12
4213 9.316080e+11
4214 3.400098e+12
4215 9.794983e+12
4216 5.716383e+12
4217 9.124555e+12
4218 6.033646e+13
4219 3.258018e+12
4220 1.591774e+13
4221 1.062152e+13
4222 1.137058e+12
4223 3.842038e+12
4224 1.168852e+13
4225 6.920770e+12
4226 1.100281e+13
4227 7.974029e+13
4228 3.872766e+12
4229 1.321210e+13
4230 1.095207e+13
4231 1.355214e+12
4232 5.032111e+12
4233 1.347735e+13
4234 8.063945e+12
4235 1.334013e+13
4236 8.804319e+13
4237 4.549229e+12
4238 1.074291e+14
Name: VA1, Length: 4239, dtype: float64
df[1:3]
country | sub_item | year | currency | VA1 | code | VA2 | WAGE1 | |
---|---|---|---|---|---|---|---|---|
1 | Argentina | Mining and quarrying | 1993 | Argentine peso | 3.525000e+09 | C | 3.525000e+09 | 8.007000e+08 |
2 | Argentina | Manufacturing | 1993 | Argentine peso | 3.890700e+10 | D | 3.890700e+10 | 1.766600e+10 |
La première ligne a pour indice 0 :
df[0:3]
country | sub_item | year | currency | VA1 | code | VA2 | WAGE1 | |
---|---|---|---|---|---|---|---|---|
0 | Argentina | Agriculture, hunting, forestry fishing | 1993 | Argentine peso | 1.214900e+10 | AB | 1.214900e+10 | 2.123000e+09 |
1 | Argentina | Mining and quarrying | 1993 | Argentine peso | 3.525000e+09 | C | 3.525000e+09 | 8.007000e+08 |
2 | Argentina | Manufacturing | 1993 | Argentine peso | 3.890700e+10 | D | 3.890700e+10 | 1.766600e+10 |
df[["country","year"]]
country | year | |
---|---|---|
0 | Argentina | 1993 |
1 | Argentina | 1993 |
2 | Argentina | 1993 |
3 | Argentina | 1993 |
4 | Argentina | 1993 |
5 | Argentina | 1993 |
6 | Argentina | 1993 |
7 | Argentina | 1993 |
8 | Argentina | 1993 |
9 | Argentina | 1994 |
10 | Argentina | 1994 |
11 | Argentina | 1994 |
12 | Argentina | 1994 |
13 | Argentina | 1994 |
14 | Argentina | 1994 |
15 | Argentina | 1994 |
16 | Argentina | 1994 |
17 | Argentina | 1994 |
18 | Argentina | 1995 |
19 | Argentina | 1995 |
20 | Argentina | 1995 |
21 | Argentina | 1995 |
22 | Argentina | 1995 |
23 | Argentina | 1995 |
24 | Argentina | 1995 |
25 | Argentina | 1995 |
26 | Argentina | 1995 |
27 | Argentina | 1996 |
28 | Argentina | 1996 |
29 | Argentina | 1996 |
... | ... | ... |
4209 | Venezuela | 1998 |
4210 | Venezuela | 1999 |
4211 | Venezuela | 1999 |
4212 | Venezuela | 1999 |
4213 | Venezuela | 1999 |
4214 | Venezuela | 1999 |
4215 | Venezuela | 1999 |
4216 | Venezuela | 1999 |
4217 | Venezuela | 1999 |
4218 | Venezuela | 1999 |
4219 | Venezuela | 2000 |
4220 | Venezuela | 2000 |
4221 | Venezuela | 2000 |
4222 | Venezuela | 2000 |
4223 | Venezuela | 2000 |
4224 | Venezuela | 2000 |
4225 | Venezuela | 2000 |
4226 | Venezuela | 2000 |
4227 | Venezuela | 2000 |
4228 | Venezuela | 2001 |
4229 | Venezuela | 2001 |
4230 | Venezuela | 2001 |
4231 | Venezuela | 2001 |
4232 | Venezuela | 2001 |
4233 | Venezuela | 2001 |
4234 | Venezuela | 2001 |
4235 | Venezuela | 2001 |
4236 | Venezuela | 2001 |
4237 | Venezuela | 2002 |
4238 | Venezuela | 2002 |
4239 rows × 2 columns
Documentation describe :
df.describe()
year | VA1 | VA2 | WAGE1 | |
---|---|---|---|---|
count | 4239.000000 | 4.239000e+03 | 4.233000e+03 | 4.239000e+03 |
mean | 1989.912715 | 1.802346e+12 | 1.801822e+12 | 6.371376e+11 |
std | 11.140271 | 1.086676e+13 | 1.087452e+13 | 3.957470e+12 |
min | 1966.000000 | 2.000000e+00 | 0.000000e+00 | 1.000000e+00 |
25% | 1981.000000 | 8.538500e+09 | 8.513000e+09 | 2.773500e+09 |
50% | 1991.000000 | 4.565900e+10 | 4.565900e+10 | 1.555100e+10 |
75% | 1999.000000 | 2.597610e+11 | 2.601340e+11 | 9.198150e+10 |
max | 2010.000000 | 2.711389e+14 | 2.711389e+14 | 9.220360e+13 |
df.loc[0]
country Argentina
sub_item Agriculture, hunting, forestry fishing
year 1993
currency Argentine peso
VA1 1.2149e+10
code AB
VA2 1.2149e+10
WAGE1 2.123e+09
Name: 0, dtype: object
dfy = df.set_index("year")
dfy.loc[1993]
country | sub_item | currency | VA1 | code | VA2 | WAGE1 | |
---|---|---|---|---|---|---|---|
year | |||||||
1993 | Argentina | Agriculture, hunting, forestry fishing | Argentine peso | 1.214900e+10 | AB | 1.214900e+10 | 2.123000e+09 |
1993 | Argentina | Mining and quarrying | Argentine peso | 3.525000e+09 | C | 3.525000e+09 | 8.007000e+08 |
1993 | Argentina | Manufacturing | Argentine peso | 3.890700e+10 | D | 3.890700e+10 | 1.766600e+10 |
1993 | Argentina | Electricity, gas and water supply | Argentine peso | 4.461000e+09 | E | 4.461000e+09 | 2.213000e+09 |
1993 | Argentina | Construction | Argentine peso | 1.339300e+10 | F | 1.339300e+10 | 4.355000e+09 |
1993 | Argentina | Wholesale retail trade, repair of motor vehicl... | Argentine peso | 3.929400e+10 | GH | 3.929400e+10 | 1.092000e+10 |
1993 | Argentina | Transport, storage and communications | Argentine peso | 1.613400e+10 | I | 1.613400e+10 | 6.213000e+09 |
1993 | Argentina | Financial intermediation real estate, renting ... | Argentine peso | 4.320200e+10 | JK | 4.320200e+10 | 8.039000e+09 |
1993 | Argentina | Total Economy | Argentine peso | 2.166090e+11 | TOT | 2.117210e+11 | 8.955300e+10 |
1993 | Bolivia | Total Economy | boliviano | 2.255600e+10 | TOT | 2.255600e+10 | 8.821000e+09 |
1993 | Brazil | Agriculture, hunting, forestry fishing | real | 9.560000e+08 | AB | 9.560000e+08 | 1.960000e+08 |
1993 | Brazil | Mining and quarrying | real | 1.480000e+08 | C | 1.480000e+08 | 3.600000e+07 |
1993 | Brazil | Manufacturing | real | 3.672000e+09 | D | 3.672000e+09 | 1.006000e+09 |
1993 | Brazil | Electricity, gas and water supply | real | 3.930000e+08 | E | 3.930000e+08 | 2.480000e+08 |
1993 | Brazil | Construction | real | 1.044000e+09 | F | 1.044000e+09 | 2.050000e+08 |
1993 | Brazil | Wholesale retail trade, repair of motor vehicl... | real | 1.172000e+09 | GH | 1.172000e+09 | 4.960000e+08 |
1993 | Brazil | Transport, storage and communications | real | 6.820000e+08 | I | 6.820000e+08 | 3.250000e+08 |
1993 | Brazil | Financial intermediation real estate, renting ... | real | 5.559000e+09 | JK | 5.559000e+09 | 1.460000e+09 |
1993 | Brazil | Total Economy | real | 1.655200e+10 | TOT | 1.655200e+10 | 6.363000e+09 |
1993 | Chile | Total Economy | Chilean peso | 1.660178e+13 | TOT | 1.660178e+13 | 6.582086e+12 |
1993 | Colombia | Agriculture, hunting, forestry fishing | Colombian peso | 7.823940e+12 | AB | 7.823937e+12 | 1.544580e+12 |
1993 | Colombia | Mining and quarrying | Colombian peso | 2.237211e+12 | C | 2.237211e+12 | 6.405010e+11 |
1993 | Colombia | Manufacturing | Colombian peso | 8.275730e+12 | D | 8.275730e+12 | 3.349123e+12 |
1993 | Colombia | Electricity, gas and water supply | Colombian peso | 1.607909e+12 | E | 1.607909e+12 | 4.512400e+11 |
1993 | Colombia | Construction | Colombian peso | 3.648933e+12 | F | 3.648933e+12 | 1.326030e+12 |
1993 | Colombia | Wholesale retail trade, repair of motor vehicl... | Colombian peso | 6.521734e+12 | GH | 6.521734e+12 | 2.652296e+12 |
1993 | Colombia | Transport, storage and communications | Colombian peso | 3.750715e+12 | I | 3.750715e+12 | 1.697697e+12 |
1993 | Colombia | Financial intermediation real estate, renting ... | Colombian peso | 8.494946e+12 | JK | 8.494946e+12 | 1.571843e+12 |
1993 | Colombia | Total Economy | Colombian peso | 5.059828e+13 | TOT | 5.059828e+13 | 1.882861e+13 |
1993 | Denmark | Agriculture, hunting, forestry fishing | Danish krone | 2.584700e+10 | AB | 2.584700e+10 | 6.645000e+09 |
... | ... | ... | ... | ... | ... | ... | ... |
1993 | Spain | Financial intermediation real estate, renting ... | peseta | 1.106390e+13 | JK | 1.106390e+13 | 3.062300e+12 |
1993 | Spain | Total Economy | peseta | 6.162560e+13 | TOT | 6.162560e+13 | 3.006070e+13 |
1993 | Sweden | Agriculture, hunting, forestry fishing | Swedish krona | 2.749800e+10 | AB | 2.749800e+10 | 8.543000e+09 |
1993 | Sweden | Mining and quarrying | Swedish krona | 3.467000e+09 | C | 3.467000e+09 | 2.414000e+09 |
1993 | Sweden | Manufacturing | Swedish krona | 2.601340e+11 | D | 2.601340e+11 | 1.779260e+11 |
1993 | Sweden | Electricity, gas and water supply | Swedish krona | 4.278100e+10 | E | 4.278100e+10 | 8.195000e+09 |
1993 | Sweden | Construction | Swedish krona | 7.949000e+10 | F | 7.949000e+10 | 6.192000e+10 |
1993 | Sweden | Wholesale retail trade, repair of motor vehicl... | Swedish krona | 1.396150e+11 | GH | 1.396150e+11 | 1.110160e+11 |
1993 | Sweden | Transport, storage and communications | Swedish krona | 8.499200e+10 | I | 8.499200e+10 | 5.889300e+10 |
1993 | Sweden | Financial intermediation real estate, renting ... | Swedish krona | 3.169450e+11 | JK | 3.169450e+11 | 9.304900e+10 |
1993 | Sweden | Total Economy | Swedish krona | 1.328816e+12 | TOT | 1.328816e+12 | 8.516030e+11 |
1993 | Thailand | Agriculture, hunting, forestry fishing | baht | 2.739010e+11 | AB | 2.739010e+11 | 3.091700e+10 |
1993 | Thailand | Mining and quarrying | baht | 3.778900e+10 | C | 3.778900e+10 | 5.041000e+09 |
1993 | Thailand | Manufacturing | baht | 7.990670e+11 | D | 7.990670e+11 | 2.956290e+11 |
1993 | Thailand | Electricity, gas and water supply | baht | 7.315900e+10 | E | 7.315900e+10 | 1.835000e+10 |
1993 | Thailand | Construction | baht | 2.109040e+11 | F | 2.109040e+11 | 8.201700e+10 |
1993 | Thailand | Wholesale retail trade, repair of motor vehicl... | baht | 3.711790e+11 | GH | 3.711790e+11 | 5.910000e+10 |
1993 | Thailand | Transport, storage and communications | baht | 2.366030e+11 | I | 2.366030e+11 | 5.369300e+10 |
1993 | Thailand | Financial intermediation real estate, renting ... | baht | 2.778150e+11 | JK | 2.778150e+11 | 6.928200e+10 |
1993 | Thailand | Total Economy | baht | 2.784561e+12 | TOT | 2.784561e+12 | 8.666570e+11 |
1993 | Turkey | Total Economy | New Turkish lira | 1.976292e+09 | TOT | 1.976292e+09 | 6.119037e+08 |
1993 | Venezuela | Agriculture, hunting, forestry fishing | bolivar | 2.916830e+11 | AB | 2.916830e+11 | 7.117200e+10 |
1993 | Venezuela | Mining and quarrying | bolivar | 8.411670e+11 | C | 8.411670e+11 | 9.630300e+10 |
1993 | Venezuela | Manufacturing | bolivar | 9.605900e+11 | D | 9.605900e+11 | 2.825770e+11 |
1993 | Venezuela | Electricity, gas and water supply | bolivar | 1.420870e+11 | E | 1.420870e+11 | 3.532400e+10 |
1993 | Venezuela | Construction | bolivar | 3.322990e+11 | F | 3.322990e+11 | 1.072110e+11 |
1993 | Venezuela | Wholesale retail trade, repair of motor vehicl... | bolivar | 1.045332e+12 | GH | 1.045332e+12 | 4.531180e+11 |
1993 | Venezuela | Transport, storage and communications | bolivar | 4.044280e+11 | I | 4.044280e+11 | 1.226970e+11 |
1993 | Venezuela | Financial intermediation real estate, renting ... | bolivar | 6.991300e+11 | JK | 6.991300e+11 | 1.585720e+11 |
1993 | Venezuela | Total Economy | bolivar | 5.449065e+12 | TOT | 5.449065e+12 | 1.863825e+12 |
120 rows × 7 columns
dfycc = df.set_index(["year", "country", "code"])
dfycc.head()
sub_item | currency | VA1 | VA2 | WAGE1 | |||
---|---|---|---|---|---|---|---|
year | country | code | |||||
1993 | Argentina | AB | Agriculture, hunting, forestry fishing | Argentine peso | 1.214900e+10 | 1.214900e+10 | 2.123000e+09 |
C | Mining and quarrying | Argentine peso | 3.525000e+09 | 3.525000e+09 | 8.007000e+08 | ||
D | Manufacturing | Argentine peso | 3.890700e+10 | 3.890700e+10 | 1.766600e+10 | ||
E | Electricity, gas and water supply | Argentine peso | 4.461000e+09 | 4.461000e+09 | 2.213000e+09 | ||
F | Construction | Argentine peso | 1.339300e+10 | 1.339300e+10 | 4.355000e+09 |
Documentation : sortlevel
dfycc.sort_index(inplace=True)
dfycc.head()
sub_item | currency | VA1 | VA2 | WAGE1 | |||
---|---|---|---|---|---|---|---|
year | country | code | |||||
1966 | Denmark | AB | Agriculture, hunting, forestry fishing | Danish krone | 5.694000e+09 | 5.694000e+09 | 1.191000e+09 |
C | Mining and quarrying | Danish krone | 2.530000e+08 | 2.530000e+08 | 8.000000e+07 | ||
D | Manufacturing | Danish krone | 1.543800e+10 | 1.543800e+10 | 1.095700e+10 | ||
E | Electricity, gas and water supply | Danish krone | 1.320000e+09 | 1.320000e+09 | 3.430000e+08 | ||
F | Construction | Danish krone | 6.928000e+09 | 6.928000e+09 | 4.857000e+09 |
dfycc.loc[1993, "Brazil", "TOT"]
sub_item Total Economy
currency real
VA1 1.6552e+10
VA2 1.6552e+10
WAGE1 6.363e+09
Name: (1993, Brazil, TOT), dtype: object
dfycc.sort_index(level=2, inplace=True)
dfycc.head()
sub_item | currency | VA1 | VA2 | WAGE1 | |||
---|---|---|---|---|---|---|---|
year | country | code | |||||
1966 | Denmark | AB | Agriculture, hunting, forestry fishing | Danish krone | 5.694000e+09 | 5.694000e+09 | 1.191000e+09 |
1967 | Denmark | AB | Agriculture, hunting, forestry fishing | Danish krone | 5.419000e+09 | 5.419000e+09 | 1.213000e+09 |
1968 | Denmark | AB | Agriculture, hunting, forestry fishing | Danish krone | 5.686000e+09 | 5.686000e+09 | 1.221000e+09 |
1969 | Denmark | AB | Agriculture, hunting, forestry fishing | Danish krone | 6.707000e+09 | 6.707000e+09 | 1.245000e+09 |
1970 | Bolivia | AB | Agriculture, hunting, forestry fishing | boliviano | 2.240000e+03 | 2.000000e+03 | 4.030000e+02 |
Documentation : reset_index
df.reset_index(drop=False, inplace=True)
# le mot-clé drop pour garder ou non les colonnes servant d'index
# inplace signifie qu'on modifie l'instance et non qu'une copie est modifiée
# donc on peut aussi écrire dfi2 = dfi.reset_index(drop=False)
df.columns
Index(['index', 'country', 'sub_item', 'year', 'currency', 'VA1', 'code',
'VA2', 'WAGE1'],
dtype='object')
df.index
RangeIndex(start=0, stop=4239, step=1)
df.loc[1993]
index 1993
country Mexico
sub_item Mining and quarrying
year 2002
currency Mexican new peso
VA1 7.72065e+10
code C
VA2 7.72065e+10
WAGE1 1.84915e+10
Name: 1993, dtype: object
Manipulation avancée#
df.dtypes
index int64
country object
sub_item object
year int64
currency object
VA1 float64
code object
VA2 float64
WAGE1 float64
dtype: object
filter#
filter : on sélectionne un sous-ensemble de lignes qui vérifie une condition
Filter consiste à sélectionner un sous-ensemble de lignes du dataframe. Pour filter sur plusieurs conditions, il faut utiliser les opérateurs logique & (et), | (ou), ~ (non)
subset = df [ (df.year == 1993) & (df.code == "AB") ]
subset.head()
df.filter(items=["country", "year", "VA1"])
country | year | VA1 | |
---|---|---|---|
0 | Argentina | 1993 | 1.214900e+10 |
1 | Argentina | 1993 | 3.525000e+09 |
2 | Argentina | 1993 | 3.890700e+10 |
3 | Argentina | 1993 | 4.461000e+09 |
4 | Argentina | 1993 | 1.339300e+10 |
5 | Argentina | 1993 | 3.929400e+10 |
6 | Argentina | 1993 | 1.613400e+10 |
7 | Argentina | 1993 | 4.320200e+10 |
8 | Argentina | 1993 | 2.166090e+11 |
9 | Argentina | 1994 | 1.308500e+10 |
10 | Argentina | 1994 | 3.818000e+09 |
11 | Argentina | 1994 | 4.159600e+10 |
12 | Argentina | 1994 | 4.730000e+09 |
13 | Argentina | 1994 | 1.431100e+10 |
14 | Argentina | 1994 | 4.279800e+10 |
15 | Argentina | 1994 | 1.825100e+10 |
16 | Argentina | 1994 | 4.859900e+10 |
17 | Argentina | 1994 | 2.358460e+11 |
18 | Argentina | 1995 | 1.380850e+10 |
19 | Argentina | 1995 | 4.838400e+09 |
20 | Argentina | 1995 | 4.450210e+10 |
21 | Argentina | 1995 | 5.111000e+09 |
22 | Argentina | 1995 | 1.341400e+10 |
23 | Argentina | 1995 | 4.119850e+10 |
24 | Argentina | 1995 | 1.905990e+10 |
25 | Argentina | 1995 | 5.133940e+10 |
26 | Argentina | 1995 | 2.423343e+11 |
27 | Argentina | 1996 | 1.527000e+10 |
28 | Argentina | 1996 | 5.888900e+09 |
29 | Argentina | 1996 | 4.772340e+10 |
... | ... | ... | ... |
4209 | Venezuela | 1998 | 5.057137e+13 |
4210 | Venezuela | 1999 | 2.880633e+12 |
4211 | Venezuela | 1999 | 8.321318e+12 |
4212 | Venezuela | 1999 | 8.204542e+12 |
4213 | Venezuela | 1999 | 9.316080e+11 |
4214 | Venezuela | 1999 | 3.400098e+12 |
4215 | Venezuela | 1999 | 9.794983e+12 |
4216 | Venezuela | 1999 | 5.716383e+12 |
4217 | Venezuela | 1999 | 9.124555e+12 |
4218 | Venezuela | 1999 | 6.033646e+13 |
4219 | Venezuela | 2000 | 3.258018e+12 |
4220 | Venezuela | 2000 | 1.591774e+13 |
4221 | Venezuela | 2000 | 1.062152e+13 |
4222 | Venezuela | 2000 | 1.137058e+12 |
4223 | Venezuela | 2000 | 3.842038e+12 |
4224 | Venezuela | 2000 | 1.168852e+13 |
4225 | Venezuela | 2000 | 6.920770e+12 |
4226 | Venezuela | 2000 | 1.100281e+13 |
4227 | Venezuela | 2000 | 7.974029e+13 |
4228 | Venezuela | 2001 | 3.872766e+12 |
4229 | Venezuela | 2001 | 1.321210e+13 |
4230 | Venezuela | 2001 | 1.095207e+13 |
4231 | Venezuela | 2001 | 1.355214e+12 |
4232 | Venezuela | 2001 | 5.032111e+12 |
4233 | Venezuela | 2001 | 1.347735e+13 |
4234 | Venezuela | 2001 | 8.063945e+12 |
4235 | Venezuela | 2001 | 1.334013e+13 |
4236 | Venezuela | 2001 | 8.804319e+13 |
4237 | Venezuela | 2002 | 4.549229e+12 |
4238 | Venezuela | 2002 | 1.074291e+14 |
4239 rows × 3 columns
union : concaténation de deux Data Frames#
union = concaténation de deux DataFrame (qui n’ont pas nécessaire les mêmes colonnes). On peut concaténer les lignes ou les colonnes
concat_ligne = pandas.concat((df,df))
concat_ligne[ (concat_ligne.year == 1993) & (concat_ligne.code == "AB") & (concat_ligne.country == "Argentina")]
index | country | sub_item | year | currency | VA1 | code | VA2 | WAGE1 | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | Argentina | Agriculture, hunting, forestry fishing | 1993 | Argentine peso | 1.214900e+10 | AB | 1.214900e+10 | 2.123000e+09 |
0 | 0 | Argentina | Agriculture, hunting, forestry fishing | 1993 | Argentine peso | 1.214900e+10 | AB | 1.214900e+10 | 2.123000e+09 |
sort : tri des lignes#
tri = df.sort_values( by=["year", "country"], ascending=[1,0])
tri.tail(10)
index | country | sub_item | year | currency | VA1 | code | VA2 | WAGE1 | |
---|---|---|---|---|---|---|---|---|---|
559 | 559 | Chile | Mining and quarrying | 2009 | Chilean peso | 1.404654e+13 | C | 1.404654e+13 | 1.588307e+12 |
560 | 560 | Chile | Manufacturing | 2009 | Chilean peso | 1.126610e+13 | D | 1.126610e+13 | 3.666442e+12 |
561 | 561 | Chile | Electricity, gas and water supply | 2009 | Chilean peso | 3.633492e+12 | E | 3.633492e+12 | 3.030204e+11 |
562 | 562 | Chile | Construction | 2009 | Chilean peso | 6.804767e+12 | F | 6.804767e+12 | 4.335883e+12 |
563 | 563 | Chile | Wholesale retail trade, repair of motor vehicl... | 2009 | Chilean peso | 8.163060e+12 | GH | 8.163060e+12 | 5.358225e+12 |
564 | 564 | Chile | Transport, storage and communications | 2009 | Chilean peso | 6.600354e+12 | I | 6.600354e+12 | 2.894256e+12 |
565 | 565 | Chile | Financial intermediation real estate, renting ... | 2009 | Chilean peso | 1.819692e+13 | JK | 1.819692e+13 | 5.991782e+12 |
566 | 566 | Chile | Total Economy | 2009 | Chilean peso | 8.650220e+13 | TOT | 8.650220e+13 | 3.658516e+13 |
269 | 269 | Bolivia | Total Economy | 2009 | boliviano | 1.021160e+11 | TOT | 1.021160e+11 | 3.381017e+10 |
270 | 270 | Bolivia | Total Economy | 2010 | boliviano | 1.159344e+11 | TOT | 1.159344e+11 | 3.647705e+10 |
group by : grouper des lignes qui partagent une valeur commune#
Cette opération consiste à grouper les lignes qui partagent une caractéristique commune (une ou ou plusieurs valeurs par exemple). Sur chaque groupe, on peut calculer une somme, une moyenne…
df[["country", "code", "year"]].cumsum(0).head()
country | code | year | |
---|---|---|---|
0 | Argentina | AB | 1993 |
1 | ArgentinaArgentina | ABC | 3986 |
2 | ArgentinaArgentinaArgentina | ABCD | 5979 |
3 | ArgentinaArgentinaArgentinaArgentina | ABCDE | 7972 |
4 | ArgentinaArgentinaArgentinaArgentinaArgentina | ABCDEF | 9965 |
pivot : utiliser des valeurs présentes dans colonne comme noms de colonnes#
pivot (tableau croisé dynamique)
Cette opération consiste à créer une seconde table en utilisant utiliser les valeurs d’une colonne comme nom de colonnes.
df.columns
Index(['index', 'country', 'sub_item', 'year', 'currency', 'VA1', 'code',
'VA2', 'WAGE1'],
dtype='object')
df.head()
index | country | sub_item | year | currency | VA1 | code | VA2 | WAGE1 | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | Argentina | Agriculture, hunting, forestry fishing | 1993 | Argentine peso | 1.214900e+10 | AB | 1.214900e+10 | 2.123000e+09 |
1 | 1 | Argentina | Mining and quarrying | 1993 | Argentine peso | 3.525000e+09 | C | 3.525000e+09 | 8.007000e+08 |
2 | 2 | Argentina | Manufacturing | 1993 | Argentine peso | 3.890700e+10 | D | 3.890700e+10 | 1.766600e+10 |
3 | 3 | Argentina | Electricity, gas and water supply | 1993 | Argentine peso | 4.461000e+09 | E | 4.461000e+09 | 2.213000e+09 |
4 | 4 | Argentina | Construction | 1993 | Argentine peso | 1.339300e+10 | F | 1.339300e+10 | 4.355000e+09 |
dfcopy = df.copy()
dfcopy["index"] = df.apply(lambda x: "{0}-{1}".format(x["country"], x["year"]), axis=1)
gr = dfcopy[["index", "code", "VA1"]].groupby(["index", "code"]).sum().reset_index()
gr.head()
index | code | VA1 | |
---|---|---|---|
0 | Argentina-1993 | AB | 1.214900e+10 |
1 | Argentina-1993 | C | 3.525000e+09 |
2 | Argentina-1993 | D | 3.890700e+10 |
3 | Argentina-1993 | E | 4.461000e+09 |
4 | Argentina-1993 | F | 1.339300e+10 |
piv = gr.pivot(index="index", columns="code", values="VA1")
piv.head()
code | AB | C | D | E | F | GH | I | JK | TOT |
---|---|---|---|---|---|---|---|---|---|
index | |||||||||
Argentina-1993 | 1.214900e+10 | 3.525000e+09 | 3.890700e+10 | 4.461000e+09 | 1.339300e+10 | 3.929400e+10 | 1.613400e+10 | 4.320200e+10 | 2.166090e+11 |
Argentina-1994 | 1.308500e+10 | 3.818000e+09 | 4.159600e+10 | 4.730000e+09 | 1.431100e+10 | 4.279800e+10 | 1.825100e+10 | 4.859900e+10 | 2.358460e+11 |
Argentina-1995 | 1.380850e+10 | 4.838400e+09 | 4.450210e+10 | 5.111000e+09 | 1.341400e+10 | 4.119850e+10 | 1.905990e+10 | 5.133940e+10 | 2.423343e+11 |
Argentina-1996 | 1.527000e+10 | 5.888900e+09 | 4.772340e+10 | 5.232400e+09 | 1.352680e+10 | 4.454100e+10 | 2.050140e+10 | 5.237490e+10 | 2.546081e+11 |
Argentina-1997 | 1.529300e+10 | 5.632500e+09 | 5.338210e+10 | 5.501700e+09 | 1.508030e+10 | 4.912050e+10 | 2.295190e+10 | 5.468300e+10 | 2.730922e+11 |
piv.tail()
code | AB | C | D | E | F | GH | I | JK | TOT |
---|---|---|---|---|---|---|---|---|---|
index | |||||||||
Venezuela-1998 | 2.461132e+12 | 5.412143e+12 | 7.463681e+12 | 8.678680e+11 | 3.443028e+12 | 8.775614e+12 | 4.893346e+12 | 7.898823e+12 | 5.057137e+13 |
Venezuela-1999 | 2.880633e+12 | 8.321318e+12 | 8.204542e+12 | 9.316080e+11 | 3.400098e+12 | 9.794983e+12 | 5.716383e+12 | 9.124555e+12 | 6.033646e+13 |
Venezuela-2000 | 3.258018e+12 | 1.591774e+13 | 1.062152e+13 | 1.137058e+12 | 3.842038e+12 | 1.168852e+13 | 6.920770e+12 | 1.100281e+13 | 7.974029e+13 |
Venezuela-2001 | 3.872766e+12 | 1.321210e+13 | 1.095207e+13 | 1.355214e+12 | 5.032111e+12 | 1.347735e+13 | 8.063945e+12 | 1.334013e+13 | 8.804319e+13 |
Venezuela-2002 | 4.549229e+12 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.074291e+14 |
join : fusionner deux Data Frames en associant les lignes qui partagent une valeur commune#
Fusionner deux tables consiste à apparier les lignes de la première table avec celle de la seconde si certaines colonnes de ces lignes partagent les mêmes valeurs. On distingue quatre cas :
INNER JOIN - inner : on garde tous les appariements réussis
LEFT OUTER JOIN - left : on garde tous les appariements réussis et les lignes non appariées de la table de gauche
RIGHT OUTER JOIN - right : on garde tous les appariements réussis et les lignes non appariées de la table de droite
FULL OUTER JOIN - outer : on garde tous les appariements réussis et les lignes non appariées des deux tables
Exemples et documentation : * merging, joining * join * merge ou DataFrame.merge * jointures SQL - illustrations avec graphiques en patates
Exercice: moyennes par groupes#
Calculer par exemple pour chaque pays, la moyenne des salaires au cours des années.