Session 26/6/2017 - machine learning

Links: notebook, html ., PDF, python, slides ., presentation ., GitHub

Découverte des trois problèmes de machine learning exposé dans l’article Machine Learning - session 6.

from jyquickhelper import add_notebook_menu
add_notebook_menu()

Problème 1 : comparaison random forest, linéaire

C’est un problème de régression. On cherche à comparer une random forest avec un modèle linéaire.

Données

import pandas
df = pandas.read_csv("data/housing.data", delim_whitespace=True, header=None)
df.head()
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0 18.7 396.90 5.33 36.2
cols = "CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV".split()
df.columns = cols
df.head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0 18.7 396.90 5.33 36.2
X = df.drop("MEDV", axis=1)
y = df["MEDV"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

Random Forest

from sklearn.ensemble import RandomForestRegressor
clr = RandomForestRegressor()
clr.fit(X, y)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)
importances = clr.feature_importances_
importances
array([ 0.03571467,  0.00130566,  0.00606471,  0.00075703,  0.02778104,
        0.42221682,  0.01557762,  0.05941969,  0.00273247,  0.01327318,
        0.01474776,  0.01033679,  0.39007255])

On s’inspire de l’exemple Feature importances with forests of trees.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.figure(figsize=(12,4))
indices = np.argsort(importances)[::-1]
std = np.std([tree.feature_importances_ for tree in clr.estimators_],
             axis=0)
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
        color="r", yerr=std[indices], align="center")
xlabels = list(df.columns[:-1])
xlabels = [xlabels[i] for i in indices]
plt.xticks(range(X.shape[1]), xlabels)
plt.xlim([-1, X.shape[1]])
plt.show()
../_images/2017_session6_12_0.png
from sklearn.metrics import r2_score
r2_score(y_train, clr.predict(X_train))
0.98035593861463521
r2_score(y_test, clr.predict(X_test))
0.97561991735577791

Modèle linéaire

import statsmodels.api as sm
c:Python36_x64libsite-packagesstatsmodelscompatpandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools
model = sm.OLS(y_train, X_train)
results = model.fit()
results.params
CRIM      -0.035221
ZN         0.062151
INDUS     -0.083259
CHAS       3.527664
NOX       -4.892070
RM         5.554238
AGE       -0.006185
DIS       -1.342893
RAD        0.125101
TAX       -0.006814
PTRATIO   -0.163962
B          0.017828
LSTAT     -0.453238
dtype: float64
results.summary()
OLS Regression Results
Dep. Variable: MEDV R-squared: 0.956
Model: OLS Adj. R-squared: 0.954
Method: Least Squares F-statistic: 544.4
Date: Mon, 28 Aug 2017 Prob (F-statistic): 3.86e-212
Time: 11:57:18 Log-Likelihood: -1033.7
No. Observations: 339 AIC: 2093.
Df Residuals: 326 BIC: 2143.
Df Model: 13
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
CRIM -0.0352 0.057 -0.615 0.539 -0.148 0.077
ZN 0.0622 0.019 3.241 0.001 0.024 0.100
INDUS -0.0833 0.085 -0.976 0.330 -0.251 0.085
CHAS 3.5277 1.075 3.282 0.001 1.413 5.642
NOX -4.8921 4.106 -1.192 0.234 -12.969 3.185
RM 5.5542 0.403 13.784 0.000 4.762 6.347
AGE -0.0062 0.018 -0.351 0.726 -0.041 0.028
DIS -1.3429 0.267 -5.028 0.000 -1.868 -0.818
RAD 0.1251 0.087 1.431 0.153 -0.047 0.297
TAX -0.0068 0.005 -1.306 0.192 -0.017 0.003
PTRATIO -0.1640 0.146 -1.124 0.262 -0.451 0.123
B 0.0178 0.004 5.035 0.000 0.011 0.025
LSTAT -0.4532 0.065 -6.935 0.000 -0.582 -0.325
Omnibus: 136.429 Durbin-Watson: 2.020
Prob(Omnibus): 0.000 Jarque-Bera (JB): 826.705
Skew: 1.559 Prob(JB): 3.04e-180
Kurtosis: 9.986 Cond. No. 8.22e+03
model = sm.OLS(y,X.drop("LSTAT", axis=1))
results = model.fit()
results.summary()
OLS Regression Results
Dep. Variable: MEDV R-squared: 0.954
Model: OLS Adj. R-squared: 0.953
Method: Least Squares F-statistic: 846.6
Date: Mon, 28 Aug 2017 Prob (F-statistic): 2.38e-320
Time: 11:57:18 Log-Likelihood: -1556.1
No. Observations: 506 AIC: 3136.
Df Residuals: 494 BIC: 3187.
Df Model: 12
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
CRIM -0.1439 0.036 -3.990 0.000 -0.215 -0.073
ZN 0.0413 0.015 2.696 0.007 0.011 0.071
INDUS -0.0370 0.068 -0.540 0.589 -0.172 0.098
CHAS 3.2525 0.961 3.384 0.001 1.364 5.141
NOX -10.8653 3.422 -3.175 0.002 -17.590 -4.141
RM 7.1436 0.289 24.734 0.000 6.576 7.711
AGE -0.0449 0.014 -3.235 0.001 -0.072 -0.018
DIS -1.2292 0.206 -5.980 0.000 -1.633 -0.825
RAD 0.2008 0.071 2.829 0.005 0.061 0.340
TAX -0.0100 0.004 -2.391 0.017 -0.018 -0.002
PTRATIO -0.6575 0.112 -5.881 0.000 -0.877 -0.438
B 0.0165 0.003 5.779 0.000 0.011 0.022
Omnibus: 277.013 Durbin-Watson: 0.927
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3084.310
Skew: 2.148 Prob(JB): 0.00
Kurtosis: 14.307 Cond. No. 8.13e+03

TPOT

TPOT est un module d’apprentissage automatique.

from tpot import TPOTRegressor
tpot = TPOTRegressor(generations=2, population_size=50, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_boston_pipeline.py')
c:Python36_x64libsite-packagessklearncross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
Generation 1 - Current best internal CV score: 12.929444055404636
Generation 2 - Current best internal CV score: 12.929444055404636
Best pipeline: GradientBoostingRegressor(input_matrix, GradientBoostingRegressor__alpha=DEFAULT, GradientBoostingRegressor__learning_rate=0.1, GradientBoostingRegressor__loss=huber, GradientBoostingRegressor__max_depth=DEFAULT, GradientBoostingRegressor__max_features=0.65, GradientBoostingRegressor__min_samples_leaf=1, GradientBoostingRegressor__min_samples_split=8, GradientBoostingRegressor__n_estimators=100, GradientBoostingRegressor__subsample=0.6)
9.34632700796

Le module optimise les hyperparamètres, parfois un peu trop à en juger la mauvaise performance obtenue sur la base de test.

r2_score(y_train, tpot.predict(X_train))
0.97107575656926248
r2_score(y_test, tpot.predict(X_test))
0.88100225324209969

Feature importance pour une observations

On reprend la première random forest et on utilise le module treeinterpreter.

clr = RandomForestRegressor()
clr.fit(X, y)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)
from treeinterpreter import treeinterpreter as ti
prediction, bias, contributions = ti.predict(clr, X_test)
for i in range(min(2, X_train.shape[0])):
    print("Instance", i)
    print("Bias (trainset mean)", bias[i])
    print("Feature contributions:")
    for c, feature in sorted(zip(contributions[i], df.columns),
                             key=lambda x: -abs(x[0])):
        print(feature, round(c, 2))
    print( "-"*20)
Instance 0
Bias (trainset mean) 22.8580039526
Feature contributions:
LSTAT -4.92
CRIM 2.39
RM -2.05
AGE 1.61
NOX 0.57
TAX 0.53
PTRATIO 0.34
INDUS 0.19
B -0.18
CHAS 0.05
DIS 0.04
ZN 0.0
RAD 0.0
--------------------
Instance 1
Bias (trainset mean) 22.8580039526
Feature contributions:
RM -5.76
LSTAT 3.07
PTRATIO 0.75
CRIM -0.67
TAX -0.29
DIS -0.26
B 0.1
RAD -0.06
AGE 0.06
NOX 0.05
INDUS 0.04
ZN 0.0
CHAS 0.0
--------------------

Problème 2 : série temporelle

On prend une série sur Google Trends, dans notre cas, c’est la requête tennis live. On compare une approche linéaire et une approche non linéaire.

Approche linéaire

import pandas
df = pandas.read_csv("data/multiTimeline.csv", skiprows=1)
df.columns= ["Semaine", "compte"]
df["SemaineDt"] = pandas.to_datetime(df.Semaine)
df=df.set_index("SemaineDt")
df["compte"] = df["compte"].astype(float)
df.head()
Semaine compte
SemaineDt
2012-07-01 2012-07-01 70.0
2012-07-08 2012-07-08 49.0
2012-07-15 2012-07-15 18.0
2012-07-22 2012-07-22 22.0
2012-07-29 2012-07-29 88.0
%matplotlib inline
df.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1dbe7f0a0b8>
../_images/2017_session6_40_1.png
from statsmodels.tsa.arima_model import ARIMA
arma_mod = ARIMA(df["compte"].as_matrix(), order=(6 ,1, 1))
res = arma_mod.fit()
res.params
array([ 0.00418578,  0.59035778, -0.32540702,  0.23286813, -0.03300842,
        0.06434314, -0.0720402 , -0.99999983])
res.summary()
ARIMA Model Results
Dep. Variable: D.y No. Observations: 259
Model: ARIMA(6, 1, 1) Log Likelihood -1055.581
Method: css-mle S.D. of innovations 14.116
Date: Mon, 28 Aug 2017 AIC 2129.161
Time: 11:58:06 BIC 2161.173
Sample: 1 HQIC 2142.032
coef std err z P>|z| [0.025 0.975]
const 0.0042 0.021 0.196 0.845 -0.038 0.046
ar.L1.D.y 0.5904 0.063 9.431 0.000 0.468 0.713
ar.L2.D.y -0.3254 0.072 -4.507 0.000 -0.467 -0.184
ar.L3.D.y 0.2329 0.075 3.097 0.002 0.085 0.380
ar.L4.D.y -0.0330 0.076 -0.433 0.665 -0.182 0.116
ar.L5.D.y 0.0643 0.076 0.842 0.400 -0.085 0.214
ar.L6.D.y -0.0720 0.066 -1.096 0.274 -0.201 0.057
ma.L1.D.y -1.0000 0.010 -96.075 0.000 -1.020 -0.980
-1.2011 -1.2144j 1.7080 -0.3741 -1.2011 +1.2144j 1.7080 0.3741 0.1840 -1.4018j 1.4138 -0.2292 0.1840 +1.4018j 1.4138 0.2292 1.4636 -0.4882j 1.5429 -0.0512 1.4636 +0.4882j 1.5429 0.0512 1.0000 +0.0000j 1.0000 0.0000
Roots
Real Imaginary Modulus Frequency
AR.1
AR.2
AR.3
AR.4
AR.5
AR.6
MA.1

Méthode non linéaire

On construire la matrice des séries décalées. Cette méthode permet de sortir du cadre linéaire et d’ajouter d’autres variables.

from statsmodels.tsa.tsatools import lagmat
lag = 8
X = lagmat(df["compte"], lag)
lagged = df.copy()
for c in range(1,lag+1):
    lagged["lag%d" % c] = X[:, c-1]
pandas.concat([lagged.head(), lagged.tail()])
Semaine compte lag1 lag2 lag3 lag4 lag5 lag6 lag7 lag8
SemaineDt
2012-07-01 2012-07-01 70.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2012-07-08 2012-07-08 49.0 70.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2012-07-15 2012-07-15 18.0 49.0 70.0 0.0 0.0 0.0 0.0 0.0 0.0
2012-07-22 2012-07-22 22.0 18.0 49.0 70.0 0.0 0.0 0.0 0.0 0.0
2012-07-29 2012-07-29 88.0 22.0 18.0 49.0 70.0 0.0 0.0 0.0 0.0
2017-05-21 2017-05-21 23.0 40.0 35.0 21.0 29.0 27.0 14.0 23.0 40.0
2017-05-28 2017-05-28 44.0 23.0 40.0 35.0 21.0 29.0 27.0 14.0 23.0
2017-06-04 2017-06-04 55.0 44.0 23.0 40.0 35.0 21.0 29.0 27.0 14.0
2017-06-11 2017-06-11 28.0 55.0 44.0 23.0 40.0 35.0 21.0 29.0 27.0
2017-06-18 2017-06-18 28.0 28.0 55.0 44.0 23.0 40.0 35.0 21.0 29.0
xc = ["lag%d" % i for i in range(1,lag+1)]
split = 0.66
isplit = int(len(lagged) * split)
xt = lagged[10:][xc]
yt = lagged[10:]["compte"]
X_train, y_train, X_test, y_test = xt[:isplit], yt[:isplit], xt[isplit:], yt[isplit:]
from sklearn.ensemble import RandomForestRegressor
clr = RandomForestRegressor()
clr.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)
from sklearn.metrics import r2_score
r2 = r2_score(y_test.as_matrix(), clr.predict(X_test))
r2
0.39863343986737942
plt.scatter(y_test.as_matrix(), clr.predict(X_test))
<matplotlib.collections.PathCollection at 0x1dbe9ce11d0>
../_images/2017_session6_48_1.png

Texte

On cherche à comparer une LDA avec word2vec et kmeans et les données qui sont sur ensae_teaching_cs/src/ensae_teaching_cs/data/data_web/.

from ensae_teaching_cs.data import twitter_zip
df = twitter_zip(as_df=True)
df.head(n=2).T
0 1
index 776066992054861825 776067660979245056
nb_user_mentions 0 0
nb_extended_entities 0 0
nb_hashtags 1 1
geo NaN NaN
text_hashtags , SiJétaisPrésident , SiJétaisPrésident
annee 2016 2016
delimit_mention NaN NaN
lang fr fr
id_str 7.76067e+17 7.76068e+17
text_mention NaN NaN
retweet_count 4 5
favorite_count 3 8
type_extended_entities [] []
text #SiJétaisPrésident se serait la fin du monde..... #SiJétaisPrésident je donnerai plus de vacance...
nb_user_photos 0 0
nb_urls 0 0
nb_symbols 0 0
created_at Wed Sep 14 14:36:04 +0000 2016 Wed Sep 14 14:38:43 +0000 2016
delimit_hash , 0, 18 , 0, 18

Des mots aux coordonnées - tf-idf

keep = df.text.dropna().index
dfnonan = df.iloc[keep, :]
dfnonan.shape
(5087, 20)
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=1000)
tfidf = tfidf_vectorizer.fit_transform(dfnonan["text"])
tfidf[:2, :]
<2x1000 sparse matrix of type '<class 'numpy.float64'>'
    with 21 stored elements in Compressed Sparse Row format>

LDA

from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=10, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
lda.fit(tfidf)
LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=50.0,
             max_doc_update_iter=100, max_iter=5, mean_change_tol=0.001,
             n_components=10, n_jobs=1, n_topics=None, perp_tol=0.1,
             random_state=0, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)
tf_feature_names = tfidf_vectorizer.get_feature_names()
tf_feature_names[100:103]
['avoir', 'bac', 'bah']
lda.components_.shape
(10, 1000)

On obtient dix vecteurs qui représentent les dix vecteurs associés aux dix clusters. Chaque dimension relié au fait que le mot appartient ou non au cluster.

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()
print_top_words(lda, tf_feature_names, 10)
Topic #0:
gratuit mcdo supprimerai école soir kebab macdo kfc domicile cc
Topic #1:
macron co https de la est le il et hollande
Topic #2:
sijetaispresident je les de la et le des en pour
Topic #3:
notaires eu organiserais mets carte nouveaux journées installation cache créer
Topic #4:
sijetaispresident interdirais les je ballerines la serait serais bah de
Topic #5:
ministre de sijetaispresident la je premier mort et nommerais président
Topic #6:
cours le supprimerais jour sijetaispresident lundi samedi semaine je vendredi
Topic #7:
port interdirait démissionnerais promesses heure rendrai ballerine mes changement christineboutin
Topic #8:
seraient sijetaispresident gratuits aux les nos putain éducation nationale bonne
Topic #9:
bordel seront légaliserai putes gratuites pizza mot virerais vitesse dutreil

Clustering

from sklearn.cluster import KMeans
km = KMeans(n_clusters=10)
km.fit(tfidf)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=10, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)
km.cluster_centers_.shape
(10, 1000)
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.cluster_centers_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()
print_top_words(km, tf_feature_names, 10)
Topic #0:
je sijetaispresident un pas en que serais ferais une ne
Topic #1:
les sijetaispresident je et tous de pour seraient ballerines en
Topic #2:
il macron de aurait qu est co https un pas
Topic #3:
ministre premier nommerais de sijetaispresident je la serait mickey nommerai
Topic #4:
macron est de hollande la le et pas qui sarko
Topic #5:
des sijetaispresident je de et les en la pour le
Topic #6:
de sijetaispresident la je les et plus en au mois
Topic #7:
la serait sijetaispresident merde dans je et france vous de
Topic #8:
le sijetaispresident je monde et pour tout de serait les
Topic #9:
https co macron de la le les via et sijetaispresident

Ah les stop words….