# Session 26/6/2017 - machine learning¶

Découverte des trois problèmes de machine learning exposé dans l’article Machine Learning - session 6.

```from jyquickhelper import add_notebook_menu
```
run previous cell, wait for 2 seconds

## Problème 1 : comparaison random forest, linéaire¶

C’est un problème de régression. On cherche à comparer une random forest avec un modèle linéaire.

### Données¶

```import pandas
```
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0 18.7 396.90 5.33 36.2
```cols = "CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV".split()
df.columns = cols
```
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0 18.7 396.90 5.33 36.2
```X = df.drop("MEDV", axis=1)
y = df["MEDV"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
```

### Random Forest¶

```from sklearn.ensemble import RandomForestRegressor
clr = RandomForestRegressor()
clr.fit(X, y)
```
```RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0, warm_start=False)
```
```importances = clr.feature_importances_
importances
```
```array([ 0.03571467,  0.00130566,  0.00606471,  0.00075703,  0.02778104,
0.42221682,  0.01557762,  0.05941969,  0.00273247,  0.01327318,
0.01474776,  0.01033679,  0.39007255])
```

On s’inspire de l’exemple Feature importances with forests of trees.

```%matplotlib inline
```
```import matplotlib.pyplot as plt
import numpy as np
plt.figure(figsize=(12,4))
indices = np.argsort(importances)[::-1]
std = np.std([tree.feature_importances_ for tree in clr.estimators_],
axis=0)
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
color="r", yerr=std[indices], align="center")
xlabels = list(df.columns[:-1])
xlabels = [xlabels[i] for i in indices]
plt.xticks(range(X.shape[1]), xlabels)
plt.xlim([-1, X.shape[1]])
plt.show()
```
```from sklearn.metrics import r2_score
r2_score(y_train, clr.predict(X_train))
```
```0.98035593861463521
```
```r2_score(y_test, clr.predict(X_test))
```
```0.97561991735577791
```

### Modèle linéaire¶

```import statsmodels.api as sm
```
```c:Python36_x64libsite-packagesstatsmodelscompatpandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
from pandas.core import datetools
```
```model = sm.OLS(y_train, X_train)
```
```results = model.fit()
```
```results.params
```
```CRIM      -0.035221
ZN         0.062151
INDUS     -0.083259
CHAS       3.527664
NOX       -4.892070
RM         5.554238
AGE       -0.006185
DIS       -1.342893
TAX       -0.006814
PTRATIO   -0.163962
B          0.017828
LSTAT     -0.453238
dtype: float64
```
```results.summary()
```
Dep. Variable: R-squared: MEDV 0.956 OLS 0.954 Least Squares 544.4 Mon, 28 Aug 2017 3.86e-212 11:57:18 -1033.7 339 2093. 326 2143. 13 nonrobust
coef std err t P>|t| [0.025 0.975] -0.0352 0.057 -0.615 0.539 -0.148 0.077 0.0622 0.019 3.241 0.001 0.024 0.100 -0.0833 0.085 -0.976 0.330 -0.251 0.085 3.5277 1.075 3.282 0.001 1.413 5.642 -4.8921 4.106 -1.192 0.234 -12.969 3.185 5.5542 0.403 13.784 0.000 4.762 6.347 -0.0062 0.018 -0.351 0.726 -0.041 0.028 -1.3429 0.267 -5.028 0.000 -1.868 -0.818 0.1251 0.087 1.431 0.153 -0.047 0.297 -0.0068 0.005 -1.306 0.192 -0.017 0.003 -0.1640 0.146 -1.124 0.262 -0.451 0.123 0.0178 0.004 5.035 0.000 0.011 0.025 -0.4532 0.065 -6.935 0.000 -0.582 -0.325
 Omnibus: Durbin-Watson: 136.429 2.02 0 826.705 1.559 3.04e-180 9.986 8220
```model = sm.OLS(y,X.drop("LSTAT", axis=1))
results = model.fit()
results.summary()
```
Dep. Variable: R-squared: MEDV 0.954 OLS 0.953 Least Squares 846.6 Mon, 28 Aug 2017 2.38e-320 11:57:18 -1556.1 506 3136. 494 3187. 12 nonrobust
coef std err t P>|t| [0.025 0.975] -0.1439 0.036 -3.990 0.000 -0.215 -0.073 0.0413 0.015 2.696 0.007 0.011 0.071 -0.0370 0.068 -0.540 0.589 -0.172 0.098 3.2525 0.961 3.384 0.001 1.364 5.141 -10.8653 3.422 -3.175 0.002 -17.590 -4.141 7.1436 0.289 24.734 0.000 6.576 7.711 -0.0449 0.014 -3.235 0.001 -0.072 -0.018 -1.2292 0.206 -5.980 0.000 -1.633 -0.825 0.2008 0.071 2.829 0.005 0.061 0.340 -0.0100 0.004 -2.391 0.017 -0.018 -0.002 -0.6575 0.112 -5.881 0.000 -0.877 -0.438 0.0165 0.003 5.779 0.000 0.011 0.022
 Omnibus: Durbin-Watson: 277.013 0.927 0 3084.31 2.148 0 14.307 8130

### TPOT¶

TPOT est un module d’apprentissage automatique.

```from tpot import TPOTRegressor
tpot = TPOTRegressor(generations=2, population_size=50, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_boston_pipeline.py')
```
```c:Python36_x64libsite-packagessklearncross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
```
```Generation 1 - Current best internal CV score: 12.929444055404636
```
```Generation 2 - Current best internal CV score: 12.929444055404636
```
```Best pipeline: GradientBoostingRegressor(input_matrix, GradientBoostingRegressor__alpha=DEFAULT, GradientBoostingRegressor__learning_rate=0.1, GradientBoostingRegressor__loss=huber, GradientBoostingRegressor__max_depth=DEFAULT, GradientBoostingRegressor__max_features=0.65, GradientBoostingRegressor__min_samples_leaf=1, GradientBoostingRegressor__min_samples_split=8, GradientBoostingRegressor__n_estimators=100, GradientBoostingRegressor__subsample=0.6)
9.34632700796
```

Le module optimise les hyperparamètres, parfois un peu trop à en juger la mauvaise performance obtenue sur la base de test.

```r2_score(y_train, tpot.predict(X_train))
```
```0.97107575656926248
```
```r2_score(y_test, tpot.predict(X_test))
```
```0.88100225324209969
```

### Feature importance pour une observations¶

On reprend la première random forest et on utilise le module treeinterpreter.

```clr = RandomForestRegressor()
clr.fit(X, y)
```
```RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0, warm_start=False)
```
```from treeinterpreter import treeinterpreter as ti
prediction, bias, contributions = ti.predict(clr, X_test)
```
```for i in range(min(2, X_train.shape[0])):
print("Instance", i)
print("Bias (trainset mean)", bias[i])
print("Feature contributions:")
for c, feature in sorted(zip(contributions[i], df.columns),
key=lambda x: -abs(x[0])):
print(feature, round(c, 2))
print( "-"*20)
```
```Instance 0
Bias (trainset mean) 22.8580039526
Feature contributions:
LSTAT -4.92
CRIM 2.39
RM -2.05
AGE 1.61
NOX 0.57
TAX 0.53
PTRATIO 0.34
INDUS 0.19
B -0.18
CHAS 0.05
DIS 0.04
ZN 0.0
--------------------
Instance 1
Bias (trainset mean) 22.8580039526
Feature contributions:
RM -5.76
LSTAT 3.07
PTRATIO 0.75
CRIM -0.67
TAX -0.29
DIS -0.26
B 0.1
AGE 0.06
NOX 0.05
INDUS 0.04
ZN 0.0
CHAS 0.0
--------------------
```

## Problème 2 : série temporelle¶

On prend une série sur Google Trends, dans notre cas, c’est la requête tennis live. On compare une approche linéaire et une approche non linéaire.

### Approche linéaire¶

```import pandas
```
```df.columns= ["Semaine", "compte"]
```
```df["SemaineDt"] = pandas.to_datetime(df.Semaine)
```
```df=df.set_index("SemaineDt")
```
```df["compte"] = df["compte"].astype(float)
```
```df.head()
```
Semaine compte
SemaineDt
2012-07-01 2012-07-01 70.0
2012-07-08 2012-07-08 49.0
2012-07-15 2012-07-15 18.0
2012-07-22 2012-07-22 22.0
2012-07-29 2012-07-29 88.0
```%matplotlib inline
```
```df.plot()
```
```<matplotlib.axes._subplots.AxesSubplot at 0x1dbe7f0a0b8>
```
```from statsmodels.tsa.arima_model import ARIMA
arma_mod = ARIMA(df["compte"].as_matrix(), order=(6 ,1, 1))
res = arma_mod.fit()
res.params
```
```array([ 0.00418578,  0.59035778, -0.32540702,  0.23286813, -0.03300842,
0.06434314, -0.0720402 , -0.99999983])
```
```res.summary()
```
Dep. Variable: No. Observations: D.y 259 ARIMA(6, 1, 1) -1055.581 css-mle 14.116 Mon, 28 Aug 2017 2129.161 11:58:06 2161.173 1 2142.032
coef std err z P>|z| [0.025 0.975] 0.0042 0.021 0.196 0.845 -0.038 0.046 0.5904 0.063 9.431 0.000 0.468 0.713 -0.3254 0.072 -4.507 0.000 -0.467 -0.184 0.2329 0.075 3.097 0.002 0.085 0.380 -0.0330 0.076 -0.433 0.665 -0.182 0.116 0.0643 0.076 0.842 0.400 -0.085 0.214 -0.0720 0.066 -1.096 0.274 -0.201 0.057 -1.0000 0.010 -96.075 0.000 -1.020 -0.980
-1.2011 -1.2144j 1.7080 -0.3741 -1.2011 +1.2144j 1.7080 0.3741 0.1840 -1.4018j 1.4138 -0.2292 0.1840 +1.4018j 1.4138 0.2292 1.4636 -0.4882j 1.5429 -0.0512 1.4636 +0.4882j 1.5429 0.0512 1.0000 +0.0000j 1.0000 0.0000

### Méthode non linéaire¶

On construire la matrice des séries décalées. Cette méthode permet de sortir du cadre linéaire et d’ajouter d’autres variables.

```from statsmodels.tsa.tsatools import lagmat
lag = 8
X = lagmat(df["compte"], lag)
lagged = df.copy()
for c in range(1,lag+1):
lagged["lag%d" % c] = X[:, c-1]
```
Semaine compte lag1 lag2 lag3 lag4 lag5 lag6 lag7 lag8
SemaineDt
2012-07-01 2012-07-01 70.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2012-07-08 2012-07-08 49.0 70.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2012-07-15 2012-07-15 18.0 49.0 70.0 0.0 0.0 0.0 0.0 0.0 0.0
2012-07-22 2012-07-22 22.0 18.0 49.0 70.0 0.0 0.0 0.0 0.0 0.0
2012-07-29 2012-07-29 88.0 22.0 18.0 49.0 70.0 0.0 0.0 0.0 0.0
2017-05-21 2017-05-21 23.0 40.0 35.0 21.0 29.0 27.0 14.0 23.0 40.0
2017-05-28 2017-05-28 44.0 23.0 40.0 35.0 21.0 29.0 27.0 14.0 23.0
2017-06-04 2017-06-04 55.0 44.0 23.0 40.0 35.0 21.0 29.0 27.0 14.0
2017-06-11 2017-06-11 28.0 55.0 44.0 23.0 40.0 35.0 21.0 29.0 27.0
2017-06-18 2017-06-18 28.0 28.0 55.0 44.0 23.0 40.0 35.0 21.0 29.0
```xc = ["lag%d" % i for i in range(1,lag+1)]
split = 0.66
isplit = int(len(lagged) * split)
xt = lagged[10:][xc]
yt = lagged[10:]["compte"]
X_train, y_train, X_test, y_test = xt[:isplit], yt[:isplit], xt[isplit:], yt[isplit:]
```
```from sklearn.ensemble import RandomForestRegressor
clr = RandomForestRegressor()
clr.fit(X_train, y_train)
```
```RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0, warm_start=False)
```
```from sklearn.metrics import r2_score
r2 = r2_score(y_test.as_matrix(), clr.predict(X_test))
r2
```
```0.39863343986737942
```
```plt.scatter(y_test.as_matrix(), clr.predict(X_test))
```
```<matplotlib.collections.PathCollection at 0x1dbe9ce11d0>
```

## Texte¶

On cherche à comparer une LDA avec word2vec et kmeans et les données qui sont sur ensae_teaching_cs/src/ensae_teaching_cs/data/data_web/.

```from ensae_teaching_cs.data import twitter_zip
```
0 1
index 776066992054861825 776067660979245056
nb_user_mentions 0 0
nb_extended_entities 0 0
nb_hashtags 1 1
geo NaN NaN
text_hashtags , SiJétaisPrésident , SiJétaisPrésident
annee 2016 2016
delimit_mention NaN NaN
lang fr fr
id_str 7.76067e+17 7.76068e+17
text_mention NaN NaN
retweet_count 4 5
favorite_count 3 8
type_extended_entities [] []
text #SiJétaisPrésident se serait la fin du monde..... #SiJétaisPrésident je donnerai plus de vacance...
nb_user_photos 0 0
nb_urls 0 0
nb_symbols 0 0
created_at Wed Sep 14 14:36:04 +0000 2016 Wed Sep 14 14:38:43 +0000 2016
delimit_hash , 0, 18 , 0, 18

### Des mots aux coordonnées - tf-idf¶

```keep = df.text.dropna().index
```
```dfnonan = df.iloc[keep, :]
dfnonan.shape
```
```(5087, 20)
```
```from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=1000)
tfidf = tfidf_vectorizer.fit_transform(dfnonan["text"])
```
```tfidf[:2, :]
```
```<2x1000 sparse matrix of type '<class 'numpy.float64'>'
with 21 stored elements in Compressed Sparse Row format>
```

### LDA¶

```from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=10, max_iter=5,
learning_method='online',
learning_offset=50.,
random_state=0)
```
```lda.fit(tfidf)
```
```LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
evaluate_every=-1, learning_decay=0.7,
learning_method='online', learning_offset=50.0,
max_doc_update_iter=100, max_iter=5, mean_change_tol=0.001,
n_components=10, n_jobs=1, n_topics=None, perp_tol=0.1,
random_state=0, topic_word_prior=None,
total_samples=1000000.0, verbose=0)
```
```tf_feature_names = tfidf_vectorizer.get_feature_names()
tf_feature_names[100:103]
```
```['avoir', 'bac', 'bah']
```
```lda.components_.shape
```
```(10, 1000)
```

On obtient dix vecteurs qui représentent les dix vecteurs associés aux dix clusters. Chaque dimension relié au fait que le mot appartient ou non au cluster.

```def print_top_words(model, feature_names, n_top_words):
for topic_idx, topic in enumerate(model.components_):
print("Topic #%d:" % topic_idx)
print(" ".join([feature_names[i]
for i in topic.argsort()[:-n_top_words - 1:-1]]))
print()
```
```print_top_words(lda, tf_feature_names, 10)
```
```Topic #0:
gratuit mcdo supprimerai école soir kebab macdo kfc domicile cc
Topic #1:
macron co https de la est le il et hollande
Topic #2:
sijetaispresident je les de la et le des en pour
Topic #3:
notaires eu organiserais mets carte nouveaux journées installation cache créer
Topic #4:
sijetaispresident interdirais les je ballerines la serait serais bah de
Topic #5:
ministre de sijetaispresident la je premier mort et nommerais président
Topic #6:
cours le supprimerais jour sijetaispresident lundi samedi semaine je vendredi
Topic #7:
port interdirait démissionnerais promesses heure rendrai ballerine mes changement christineboutin
Topic #8:
seraient sijetaispresident gratuits aux les nos putain éducation nationale bonne
Topic #9:
bordel seront légaliserai putes gratuites pizza mot virerais vitesse dutreil
```

### Clustering¶

```from sklearn.cluster import KMeans
km = KMeans(n_clusters=10)
km.fit(tfidf)
```
```KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=10, n_init=10, n_jobs=1, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
```
```km.cluster_centers_.shape
```
```(10, 1000)
```
```def print_top_words(model, feature_names, n_top_words):
for topic_idx, topic in enumerate(model.cluster_centers_):
print("Topic #%d:" % topic_idx)
print(" ".join([feature_names[i]
for i in topic.argsort()[:-n_top_words - 1:-1]]))
print()
```
```print_top_words(km, tf_feature_names, 10)
```
```Topic #0:
je sijetaispresident un pas en que serais ferais une ne
Topic #1:
les sijetaispresident je et tous de pour seraient ballerines en
Topic #2:
il macron de aurait qu est co https un pas
Topic #3:
ministre premier nommerais de sijetaispresident je la serait mickey nommerai
Topic #4:
macron est de hollande la le et pas qui sarko
Topic #5:
des sijetaispresident je de et les en la pour le
Topic #6:
de sijetaispresident la je les et plus en au mois
Topic #7:
la serait sijetaispresident merde dans je et france vous de
Topic #8:
le sijetaispresident je monde et pour tout de serait les
Topic #9:
https co macron de la le les via et sijetaispresident
```

Ah les stop words….