.. _td2acorrectionsession3Arst: ================================================================ 2A.ml - Statistiques descriptives avec scikit-learn - correction ================================================================ .. only:: html **Links:** :download:`notebook `, :downloadlink:`html `, :download:`python `, :downloadlink:`slides `, :githublink:`GitHub|_doc/notebooks/td2a_ml/td2a_correction_session_3A.ipynb|*` ACP, CAH, régression linéaire, correction. .. code:: ipython3 %matplotlib inline .. code:: ipython3 import matplotlib.pyplot as plt .. code:: ipython3 from jyquickhelper import add_notebook_menu add_notebook_menu() .. contents:: :local: .. code:: ipython3 # Répare une incompatibilité entre scipy 1.0 et statsmodels 0.8. from pymyinstall.fix import fix_scipy10_for_statsmodels08 fix_scipy10_for_statsmodels08() Prérequis de l’énoncé ~~~~~~~~~~~~~~~~~~~~~ .. code:: ipython3 import pyensae.datasource f = pyensae.datasource.download_data("dads2011_gf_salaries11_dbase.zip", website="https://www.insee.fr/fr/statistiques/fichier/2011542/") import pandas try: from dbfread import DBF use_dbfread = True except ImportError as e : use_dbfread = False if use_dbfread: import os from pyensae.sql.database_exception import ExceptionSQL from pyensae.datasource import dBase2sqllite print("convert dbase into sqllite") try: dBase2sqllite("salaries2011.db3", "varlist_salaries11.dbf", overwrite_table="varlist") dBase2sqllite("salaries2011.db3", "varmod_salaries11.dbf", overwrite_table="varmod") dBase2sqllite("salaries2011.db3", 'salaries11.dbf', overwrite_table="salaries", fLOG = print) except ExceptionSQL: print("La base de données est déjà renseignée.") else : print("use of zipped version") import pyensae.datasource db3 = pyensae.datasource.download_data("salaries2011.zip") # pour aller plus vite, données à télécharger au # http://www.xavierdupre.fr/enseignement/complements/salaries2011.zip .. parsed-literal:: convert dbase into sqllite unable to execute a SQL request (1)(file salaries2011.db3) create table varlist (VARIABLE TEXT, LIBELLE TEXT, TYPE TEXT, LONGUEUR REAL) La base de données est déjà renseignée. Exercice 1 : CAH (classification ascendante hiérarchique) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Le point commun de ces méthodes est qu’elles ne sont pas supervisées. L’objectif est de réduire la complexité des données. Réduire le nombre de dimensions pour l’ACP ou segmenter les observations pour les k-means et la CAH. .. code:: ipython3 import pandas, numpy, matplotlib.pyplot as plt from sklearn.decomposition import PCA import pyensae.datasource pyensae.datasource.download_data("eleve_region.txt") df = pandas.read_csv("eleve_region.txt", sep="\t", encoding="utf8", index_col=0) print(df.shape) df.head(n=5) .. parsed-literal:: (27, 21) .. raw:: html

	1993	1994	1995	1996	1997	1998	1999	2000	2001	2002	...	2004	2005	2006	2007	2008	2009	2010	2011	2012	2013
académie
Aix-Marseille	241357	242298	242096	242295	243660	244608	245536	247288	249331	250871	...	250622	248208	245755	243832	242309	240664	240432	241336	239051	240115
Amiens	198281	196871	195709	194055	192893	191862	189636	185977	183357	180973	...	175610	172110	168718	165295	163116	162548	163270	164422	165275	166345
Besançon	116373	115600	114282	113312	112076	110261	108106	105463	103336	102264	...	100117	98611	97038	95779	95074	94501	94599	94745	94351	94613
Bordeaux	253551	252644	249658	247708	247499	245757	244992	243047	243592	245198	...	244805	244343	242602	242933	243146	244336	246806	250626	252085	255761
Caen	145435	144369	141883	140658	139585	137704	135613	133255	131206	129271	...	125552	123889	122550	121002	119857	119426	119184	119764	119010	119238

5 rows × 21 columns

.. code:: ipython3 for c in df.columns: if c != "1993": df[c] /= df ["1993"] df["1993"] /= df["1993"] df.head() .. raw:: html

	1993	1994	1995	1996	1997	1998	1999	2000	2001	2002	...	2004	2005	2006	2007	2008	2009	2010	2011	2012	2013
académie
Aix-Marseille	1.0	1.003899	1.003062	1.003886	1.009542	1.013470	1.017315	1.024574	1.033038	1.039419	...	1.038387	1.028385	1.018222	1.010255	1.003944	0.997129	0.996168	0.999913	0.990446	0.994854
Amiens	1.0	0.992889	0.987029	0.978687	0.972826	0.967627	0.956400	0.937947	0.924733	0.912710	...	0.885662	0.868011	0.850904	0.833640	0.822651	0.819786	0.823427	0.829237	0.833539	0.838936
Besançon	1.0	0.993358	0.982032	0.973697	0.963076	0.947479	0.928961	0.906250	0.887972	0.878761	...	0.860311	0.847370	0.833853	0.823035	0.816976	0.812053	0.812895	0.814149	0.810764	0.813015
Bordeaux	1.0	0.996423	0.984646	0.976955	0.976131	0.969261	0.966243	0.958572	0.960722	0.967056	...	0.965506	0.963684	0.956817	0.958123	0.958963	0.963656	0.973398	0.988464	0.994218	1.008716
Caen	1.0	0.992670	0.975577	0.967154	0.959776	0.946842	0.932465	0.916251	0.902162	0.888858	...	0.863286	0.851851	0.842644	0.832001	0.824128	0.821164	0.819500	0.823488	0.818304	0.819871

5 rows × 21 columns

.. code:: ipython3 from sklearn.cluster import AgglomerativeClustering ward = AgglomerativeClustering(linkage='ward', compute_full_tree=True).fit(df) ward .. parsed-literal:: AgglomerativeClustering(compute_full_tree=True) .. code:: ipython3 from scipy.cluster.hierarchy import dendrogram import matplotlib.pyplot as plt dendro = [ ] for a,b in ward.children_: dendro.append([a, b, float(len(dendro)+1), len(dendro)+1]) # le dernier coefficient devrait contenir le nombre de feuilles dépendant de ce noeud # et non le dernier indice # de même, le niveau (3ème colonne) ne devrait pas être le nombre de noeud # mais la distance de Ward fig = plt.figure( figsize=(8,8) ) ax = fig.add_subplot(1,1,1) r = dendrogram(dendro, color_threshold=1, labels=list(df.index), show_leaf_counts=True, ax=ax, orientation="left") .. image:: td2a_correction_session_3A_11_0.png Je reprends également le graphique montrant la matrice de corrélations qu’on peut également obtenir avec `seaborn : clustermap `__. .. code:: ipython3 from scipy.spatial.distance import pdist, squareform data_dist = pdist(df) fig = plt.figure(figsize=(8,8)) # x ywidth height ax1 = fig.add_axes([0.05,0.1,0.2,0.6]) Z1 = dendrogram(dendro, orientation='right',labels=list(df.index)) ax1.set_xticks([]) # Compute and plot second dendrogram. ax2 = fig.add_axes([0.3,0.71,0.6,0.2]) Z2 = dendrogram(dendro) ax2.set_xticks([]) ax2.set_yticks([]) # Compute and plot the heatmap axmatrix = fig.add_axes([0.3,0.1,0.6,0.6]) idx1 = Z1['leaves'] idx2 = Z2['leaves'] D = squareform(data_dist) D = D[idx1,:] D = D[:,idx2] im = axmatrix.matshow(D, aspect='auto', origin='lower', cmap=plt.cm.YlGnBu) axmatrix.set_xticks([]) axmatrix.set_yticks([]) # Plot colorbar. axcolor = fig.add_axes([0.91,0.1,0.02,0.6]) plt.colorbar(im, cax=axcolor) plt.title("Matrice de corrélation et dendogramme."); .. image:: td2a_correction_session_3A_13_0.png Exercice 2 : régression linéaire ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ce sont trois méthodes supervisées : on s’en sert pour expliquer prédire le lien entre deux variables :math:`X` et :math:`Y` (ou ensemble de variables) ou prédire :math:`Y` en fonction de :math:`X`. On suppose que les données ont déjà été téléchargées (voir l’énoncé de cet exercice). On continue avec l’extraction de l’âge, du sexe et du salaire comme indiqué dans l’énoncé. .. code:: ipython3 import sqlite3, pandas con = sqlite3.connect("salaries2011.db3") df = pandas.io.sql.read_sql("select * from varmod", con) con.close() values = df[ df.VARIABLE == "TRNNETO"].copy() def process_intervalle(s): # [14 000 ; 16 000[ euros acc = "0123456789;+" s0 = "".join(c for c in s if c in acc) spl = s0.split(';') if len(spl) != 2: raise ValueError("Unable to process '{0}'".format(s0)) try: a = float(spl[0]) except Exception as e: raise ValueError("Cannot interpret '{0}' - {1}".format(s, spl)) b = float(spl[1]) if "+" not in spl[1] else None if b is None: return a return (a+b) / 2.0 values["montant"] = values.apply(lambda r : process_intervalle(r["MODLIBELLE"]), axis=1) values.head() .. raw:: html

	VARIABLE	MODALITE	MODLIBELLE	montant
8957	TRNNETO	00	[0 ; 200[ euros	100.0
8958	TRNNETO	01	[200 ; 500[ euros	350.0
8959	TRNNETO	02	[500 ; 1 000[ euros	750.0
8960	TRNNETO	03	[1 000 ; 1 500[ euros	1250.0
8961	TRNNETO	04	[1 500 ; 2 000[ euros	1750.0

.. code:: ipython3 import sqlite3, pandas con = sqlite3.connect("salaries2011.db3") data = pandas.io.sql.read_sql("select TRNNETO,AGE,SEXE from salaries", con) con.close() salaires = data.merge (values, left_on="TRNNETO", right_on="MODALITE" ) salaires["M"] = salaires.apply(lambda r: 1 if r["SEXE"] == "1" else 0, axis=1) salaires["F"] = salaires.apply(lambda r: 1 if r["SEXE"] == "2" else 0, axis=1) data = salaires[["AGE","M","F","montant"]] data = data [data.M + data.F > 0] data.head() .. raw:: html

	AGE	M	F	montant
0	49.0	1	0	750.0
1	27.0	1	0	750.0
2	22.0	1	0	750.0
3	26.0	1	0	750.0
4	29.0	0	1	750.0

On supprime les valeurs manquantes : .. code:: ipython3 nonull = data.dropna().copy() nonull.shape .. parsed-literal:: (2240691, 4) version scikit-learn ^^^^^^^^^^^^^^^^^^^^ La `régression linéraire `__ suit : .. code:: ipython3 nonull[["AGE","M"]].dropna().shape .. parsed-literal:: (2240691, 2) .. code:: ipython3 from sklearn import linear_model clf = linear_model.LinearRegression() clf.fit (nonull[["AGE","M"]].values, nonull.montant.values) clf.coef_, clf.intercept_, "R^2=", clf.score( nonull[["AGE","M"]], nonull.montant) .. parsed-literal:: C:\Python395_x64\lib\site-packages\sklearn\base.py:443: UserWarning: X has feature names, but LinearRegression was fitted without feature names warnings.warn( .. parsed-literal:: (array([ 310.98873096, 4710.02901965]), 4927.219006697269, 'R^2=', 0.13957345814666222) On prend un échantillon aléatoire : .. code:: ipython3 import random val = nonull.copy() val["rnd"] = val.apply(lambda r: random.randint(0, 1000), axis=1) ech = val[val["rnd"] == 1] ech.shape .. parsed-literal:: (2278, 5) On sépare homme et femmes : .. code:: ipython3 homme = ech[ech.M == 1] femme = ech[ech.M == 0] predh = clf.predict(homme[["AGE","M"]]) predf = clf.predict(femme[["AGE","M"]]) import matplotlib.pyplot as plt plt.figure(figsize=(16,6)) plt.plot(homme.AGE, homme.montant, "r.") plt.plot(femme.AGE + 0.2, femme.montant, "b.") plt.plot(homme.AGE, predh, "ro-", label="hommes") plt.plot(femme.AGE, predf, "bo-", label="femmes") plt.legend() plt.title("Montant moyen par âge et genre"); .. parsed-literal:: C:\Python395_x64\lib\site-packages\sklearn\base.py:443: UserWarning: X has feature names, but LinearRegression was fitted without feature names warnings.warn( C:\Python395_x64\lib\site-packages\sklearn\base.py:443: UserWarning: X has feature names, but LinearRegression was fitted without feature names warnings.warn( .. image:: td2a_correction_session_3A_25_1.png version statsmodels ^^^^^^^^^^^^^^^^^^^ .. code:: ipython3 import statsmodels.api as sm nonull["one"] = 1.0 # on ajoute la constante model = sm.OLS(nonull.montant, nonull [["AGE","M", "one"]]) results = model.fit() print("coefficients",results.params) results.summary() .. parsed-literal:: C:\Python395_x64\lib\site-packages\statsmodels\tsa\base\tsa_model.py:7: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. from pandas import (to_datetime, Int64Index, DatetimeIndex, Period, C:\Python395_x64\lib\site-packages\statsmodels\tsa\base\tsa_model.py:7: FutureWarning: pandas.Float64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. from pandas import (to_datetime, Int64Index, DatetimeIndex, Period, .. parsed-literal:: coefficients AGE 310.988731 M 4710.029020 one 4927.219007 dtype: float64 .. raw:: html

OLS Regression Results
Dep. Variable:	montant	R-squared:	0.140
Model:	OLS	Adj. R-squared:	0.140
Method:	Least Squares	F-statistic:	1.817e+05
Date:	Sat, 12 Feb 2022	Prob (F-statistic):	0.00
Time:	18:57:08	Log-Likelihood:	-2.4023e+07
No. Observations:	2240691	AIC:	4.805e+07
Df Residuals:	2240688	BIC:	4.805e+07
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
AGE	310.9887	0.600	518.682	0.000	309.814	312.164
M	4710.0290	14.658	321.323	0.000	4681.299	4738.759
one	4927.2190	26.032	189.272	0.000	4876.196	4978.242

Omnibus:	132298.944	Durbin-Watson:	0.183
Prob(Omnibus):	0.000	Jarque-Bera (JB):	158649.997
Skew:	0.614	Prob(JB):	0.00
Kurtosis:	3.440	Cond. No.	150.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. On reproduit le même dessin : .. code:: ipython3 import random val = nonull.copy() val["rnd"] = val.apply(lambda r: random.randint(0,1000), axis=1) ech = val[val["rnd"] == 1] .. code:: ipython3 homme = ech [ ech.M == 1] femme = ech [ ech.M == 0] predh = results.predict(homme[["AGE","M","one"]]) predf = results.predict(femme[["AGE","M","one"]]) import matplotlib.pyplot as plt def graph(homme, femme, predh, predf): fig, ax = plt.subplots(1, 1, figsize=(16, 6)) ax.plot(homme.AGE, homme.montant, "r.") ax.plot(femme.AGE + 0.2, femme.montant, "b.") ax.plot(homme.AGE, predh, "ro-", label="hommes") ax.plot(femme.AGE, predf, "bo-", label="femmes") ax.legend() ax.set_title("Montant moyen par âge et genre"); return ax graph(homme, femme, predh, predf); .. image:: td2a_correction_session_3A_30_0.png On ajoute l’intervalle de confiance sur un échantillon : .. code:: ipython3 from statsmodels.sandbox.regression.predstd import wls_prediction_std prstd, iv_l, iv_u = wls_prediction_std(results) val = nonull.copy() val["rnd"] = val.apply(lambda r: random.randint(0, 1000), axis=1) val["pred"] = prstd val["up"] = iv_u val["down"] = iv_l ech = val[val["rnd"] == 1] ech.head() .. raw:: html

	AGE	M	F	montant	one	rnd	pred	up	down
86	34.0	0	1	750.0	1.0	1	10964.546920	36990.964539	-5989.292820
355	48.0	0	1	750.0	1.0	1	10964.547492	41344.807893	-1635.451707
1046	68.0	0	1	750.0	1.0	1	10964.559457	47564.605962	4584.299462
2008	21.0	0	1	750.0	1.0	1	10964.552144	32948.121273	-10032.156559
3761	18.0	1	0	750.0	1.0	1	10964.553468	36725.186695	-6255.096328

Puis on l’ajoute au graphe précédent : .. code:: ipython3 homme = ech[ech.M == 1] femme = ech[ech.M == 0] predh = results.predict(homme[["AGE","M","one"]]) predf = results.predict(femme[["AGE","M","one"]]) ax = graph(homme, femme, predh, predf) ax.plot(homme.AGE, homme.up, 'r-') ax.plot(homme.AGE, homme.down, 'r-') ax.plot(femme.AGE, femme.up, 'b-') ax.plot(femme.AGE, femme.down, 'b-') ax.set_title("Montant moyen par âge et genre avec écart-type"); .. image:: td2a_correction_session_3A_34_0.png