# Session 26/6/2017 - machine learning#

Découverte des trois problèmes de machine learning exposé dans l’article Machine Learning - session 6.

```from jyquickhelper import add_notebook_menu
```

## Problème 1 : comparaison random forest, linéaire#

C’est un problème de régression. On cherche à comparer une random forest avec un modèle linéaire.

### Données#

```import pandas
```
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0 18.7 396.90 5.33 36.2
```cols = "CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV".split()
df.columns = cols
```
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0 18.7 396.90 5.33 36.2
```X = df.drop("MEDV", axis=1)
y = df["MEDV"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
```

### Random Forest#

```from sklearn.ensemble import RandomForestRegressor
clr = RandomForestRegressor()
clr.fit(X, y)
```
```RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
```
```importances = clr.feature_importances_
importances
```
```array([0.04377988, 0.00098713, 0.00669368, 0.00092023, 0.02074126,
0.4076033 , 0.01283279, 0.06458251, 0.0036521 , 0.01366027,
0.01576497, 0.01209152, 0.39669034])
```

On s’inspire de l’exemple Feature importances with forests of trees.

```%matplotlib inline
```
```import matplotlib.pyplot as plt
import numpy as np
plt.figure(figsize=(12,4))
indices = np.argsort(importances)[::-1]
std = np.std([tree.feature_importances_ for tree in clr.estimators_],
axis=0)
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
color="r", yerr=std[indices], align="center")
xlabels = list(df.columns[:-1])
xlabels = [xlabels[i] for i in indices]
plt.xticks(range(X.shape[1]), xlabels)
plt.xlim([-1, X.shape[1]])
plt.show()
```
```from sklearn.metrics import r2_score
r2_score(y_train, clr.predict(X_train))
```
```0.979236135448983
```
```r2_score(y_test, clr.predict(X_test))
```
```0.9843720628811461
```

### Modèle linéaire#

```import statsmodels.api as sm
```
```model = sm.OLS(y_train, X_train)
```
```results = model.fit()
```
```results.params
```
```CRIM      -0.109867
ZN         0.045855
INDUS     -0.054801
CHAS       3.758792
NOX       -5.285538
RM         6.267853
AGE       -0.022554
DIS       -1.169496
TAX       -0.010410
PTRATIO   -0.447963
B          0.016474
LSTAT     -0.259210
dtype: float64
```
```results.summary()
```
Dep. Variable: R-squared: MEDV 0.958 OLS 0.957 Least Squares 575.3 Fri, 07 Jun 2019 7.10e-216 16:38:39 -1025.5 339 2077. 326 2127. 13 nonrobust
coef std err t P>|t| [0.025 0.975] -0.1099 0.039 -2.845 0.005 -0.186 -0.034 0.0459 0.018 2.543 0.011 0.010 0.081 -0.0548 0.079 -0.697 0.486 -0.209 0.100 3.7588 1.116 3.370 0.001 1.564 5.953 -5.2855 4.398 -1.202 0.230 -13.938 3.367 6.2679 0.405 15.461 0.000 5.470 7.065 -0.0226 0.017 -1.325 0.186 -0.056 0.011 -1.1695 0.249 -4.692 0.000 -1.660 -0.679 0.2218 0.083 2.672 0.008 0.058 0.385 -0.0104 0.005 -2.155 0.032 -0.020 -0.001 -0.4480 0.137 -3.261 0.001 -0.718 -0.178 0.0165 0.003 4.767 0.000 0.010 0.023 -0.2592 0.065 -3.959 0.000 -0.388 -0.130
 Omnibus: Durbin-Watson: 170.276 1.981 0 1726.4 1.839 0 13.426 8910

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.91e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
```model = sm.OLS(y,X.drop("LSTAT", axis=1))
results = model.fit()
results.summary()
```
Dep. Variable: R-squared: MEDV 0.954 OLS 0.953 Least Squares 846.6 Fri, 07 Jun 2019 2.38e-320 16:38:39 -1556.1 506 3136. 494 3187. 12 nonrobust
coef std err t P>|t| [0.025 0.975] -0.1439 0.036 -3.990 0.000 -0.215 -0.073 0.0413 0.015 2.696 0.007 0.011 0.071 -0.0370 0.068 -0.540 0.589 -0.172 0.098 3.2525 0.961 3.384 0.001 1.364 5.141 -10.8653 3.422 -3.175 0.002 -17.590 -4.141 7.1436 0.289 24.734 0.000 6.576 7.711 -0.0449 0.014 -3.235 0.001 -0.072 -0.018 -1.2292 0.206 -5.980 0.000 -1.633 -0.825 0.2008 0.071 2.829 0.005 0.061 0.340 -0.0100 0.004 -2.391 0.017 -0.018 -0.002 -0.6575 0.112 -5.881 0.000 -0.877 -0.438 0.0165 0.003 5.779 0.000 0.011 0.022
 Omnibus: Durbin-Watson: 277.013 0.927 0 3084.31 2.148 0 14.307 8130

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.13e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

### TPOT#

TPOT est un module d’apprentissage automatique.

```try:
from tpot import TPOTRegressor
except ImportError:
# for sklearn 0.22
import sklearn.preprocessing
from sklearn.impute import SimpleImputer
sklearn.preprocessing.Imputer = SimpleImputer
from tpot import TPOTRegressor

tpot = TPOTRegressor(generations=2, population_size=50, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_boston_pipeline.py')
```
```HBox(children=(IntProgress(value=0, description='Optimization Progress', max=150, style=ProgressStyle(descript…
```
```Generation 1 - Current best internal CV score: -10.19955600172555
Generation 2 - Current best internal CV score: -10.19955600172555

Best pipeline: XGBRegressor(StandardScaler(input_matrix), learning_rate=0.1, max_depth=9, min_child_weight=2, n_estimators=100, nthread=1, subsample=0.8)
-9.01100125963658
```

Le module optimise les hyperparamètres, parfois un peu trop à en juger la mauvaise performance obtenue sur la base de test.

```r2_score(y_train, tpot.predict(X_train))
```
```0.9994364901579464
```
```r2_score(y_test, tpot.predict(X_test))
```
```0.8978805496355459
```

### Feature importance pour une observations#

On reprend la première random forest et on utilise le module treeinterpreter.

```clr = RandomForestRegressor()
clr.fit(X, y)
```
```RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
```
```from treeinterpreter import treeinterpreter as ti
prediction, bias, contributions = ti.predict(clr, X_test)
```
```for i in range(min(2, X_train.shape[0])):
print("Instance", i)
print("Bias (trainset mean)", bias[i])
print("Feature contributions:")
for c, feature in sorted(zip(contributions[i], df.columns),
key=lambda x: -abs(x[0])):
print(feature, round(c, 2))
print( "-"*20)
```
```Instance 0
Bias (trainset mean) 22.527195652173905
Feature contributions:
RM 7.53
LSTAT 1.58
TAX -0.64
PTRATIO 0.6
DIS -0.4
INDUS 0.19
AGE 0.13
ZN 0.12
CRIM 0.07
NOX 0.06
B -0.01
CHAS -0.01
--------------------
Instance 1
Bias (trainset mean) 22.527195652173905
Feature contributions:
LSTAT -5.04
RM -1.64
B -0.9
NOX -0.89
CRIM -0.81
DIS 0.64
AGE 0.17
PTRATIO -0.07
TAX -0.04
CHAS -0.02
INDUS -0.0
ZN 0.0
--------------------
```

## Problème 2 : série temporelle#

On prend une série sur Google Trends, dans notre cas, c’est la requête tennis live. On compare une approche linéaire et une approche non linéaire.

### Approche linéaire#

```import pandas
```
```df.columns= ["Semaine", "compte"]
```
```df["SemaineDt"] = pandas.to_datetime(df.Semaine)
```
```df=df.set_index("SemaineDt")
```
```df["compte"] = df["compte"].astype(float)
```
```df.head()
```
Semaine compte
SemaineDt
2012-07-01 2012-07-01 70.0
2012-07-08 2012-07-08 49.0
2012-07-15 2012-07-15 18.0
2012-07-22 2012-07-22 22.0
2012-07-29 2012-07-29 88.0
```%matplotlib inline
```
```df.plot()
```
```<matplotlib.axes._subplots.AxesSubplot at 0x19c15163358>
```
```from statsmodels.tsa.arima_model import ARIMA
arma_mod = ARIMA(df["compte"].values, order=(6 ,1, 1))
res = arma_mod.fit()
res.params
```
```array([ 0.00418581,  0.59035757, -0.32540695,  0.23286807, -0.03300838,
0.06434307, -0.07204017, -0.99999983])
```
```res.summary()
```
Dep. Variable: No. Observations: D.y 259 ARIMA(6, 1, 1) -1055.581 css-mle 14.116 Fri, 07 Jun 2019 2129.161 16:45:54 2161.173 1 2142.032
coef std err z P>|z| [0.025 0.975] 0.0042 0.021 0.196 0.845 -0.038 0.046 0.5904 0.063 9.431 0.000 0.468 0.713 -0.3254 0.072 -4.507 0.000 -0.467 -0.184 0.2329 0.075 3.097 0.002 0.085 0.380 -0.0330 0.076 -0.433 0.665 -0.182 0.116 0.0643 0.076 0.842 0.400 -0.085 0.214 -0.0720 0.066 -1.096 0.274 -0.201 0.057 -1.0000 0.010 -96.075 0.000 -1.020 -0.980
 Real Imaginary Modulus Frequency -1.2011 -1.2144j 1.7080 -0.3741 -1.2011 +1.2144j 1.7080 0.3741 0.1840 -1.4018j 1.4138 -0.2292 0.1840 +1.4018j 1.4138 0.2292 1.4636 -0.4882j 1.5429 -0.0512 1.4636 +0.4882j 1.5429 0.0512 1.0000 +0.0000j 1.0000 0.0000

### Méthode non linéaire#

On construire la matrice des séries décalées. Cette méthode permet de sortir du cadre linéaire et d’ajouter d’autres variables.

```from statsmodels.tsa.tsatools import lagmat
lag = 8
X = lagmat(df["compte"], lag)
lagged = df.copy()
for c in range(1,lag+1):
lagged["lag%d" % c] = X[:, c-1]
```
Semaine compte lag1 lag2 lag3 lag4 lag5 lag6 lag7 lag8
SemaineDt
2012-07-01 2012-07-01 70.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2012-07-08 2012-07-08 49.0 70.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2012-07-15 2012-07-15 18.0 49.0 70.0 0.0 0.0 0.0 0.0 0.0 0.0
2012-07-22 2012-07-22 22.0 18.0 49.0 70.0 0.0 0.0 0.0 0.0 0.0
2012-07-29 2012-07-29 88.0 22.0 18.0 49.0 70.0 0.0 0.0 0.0 0.0
2017-05-21 2017-05-21 23.0 40.0 35.0 21.0 29.0 27.0 14.0 23.0 40.0
2017-05-28 2017-05-28 44.0 23.0 40.0 35.0 21.0 29.0 27.0 14.0 23.0
2017-06-04 2017-06-04 55.0 44.0 23.0 40.0 35.0 21.0 29.0 27.0 14.0
2017-06-11 2017-06-11 28.0 55.0 44.0 23.0 40.0 35.0 21.0 29.0 27.0
2017-06-18 2017-06-18 28.0 28.0 55.0 44.0 23.0 40.0 35.0 21.0 29.0
```xc = ["lag%d" % i for i in range(1,lag+1)]
split = 0.66
isplit = int(len(lagged) * split)
xt = lagged[10:][xc]
yt = lagged[10:]["compte"]
X_train, y_train, X_test, y_test = xt[:isplit], yt[:isplit], xt[isplit:], yt[isplit:]
```
```from sklearn.ensemble import RandomForestRegressor
clr = RandomForestRegressor()
clr.fit(X_train, y_train)
```
```RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
```
```from sklearn.metrics import r2_score
r2 = r2_score(y_test.values, clr.predict(X_test))
r2
```
```0.4878325848907805
```
```plt.scatter(y_test.values, clr.predict(X_test));
```
```<matplotlib.collections.PathCollection at 0x19c26b9a2e8>
```

## Texte#

On cherche à comparer une LDA avec word2vec et kmeans et les données qui sont sur ensae_teaching_cs/src/ensae_teaching_cs/data/data_web/.

```from ensae_teaching_cs.data import twitter_zip
```
0 1
index 776066992054861825 776067660979245056
nb_user_mentions 0 0
nb_extended_entities 0 0
nb_hashtags 1 1
geo NaN NaN
text_hashtags , SiJétaisPrésident , SiJétaisPrésident
annee 2016 2016
delimit_mention NaN NaN
lang fr fr
id_str 7.76067e+17 7.76068e+17
text_mention NaN NaN
retweet_count 4 5
favorite_count 3 8
type_extended_entities [] []
text #SiJétaisPrésident se serait la fin du monde..... #SiJétaisPrésident je donnerai plus de vacance...
nb_user_photos 0 0
nb_urls 0 0
nb_symbols 0 0
created_at Wed Sep 14 14:36:04 +0000 2016 Wed Sep 14 14:38:43 +0000 2016
delimit_hash , 0, 18 , 0, 18

### Des mots aux coordonnées - tf-idf#

```keep = df.text.dropna().index
```
```dfnonan = df.iloc[keep, :]
dfnonan.shape
```
```(5087, 20)
```
```from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=1000)
tfidf = tfidf_vectorizer.fit_transform(dfnonan["text"])
```
```tfidf[:2, :]
```
```<2x1000 sparse matrix of type '<class 'numpy.float64'>'
with 21 stored elements in Compressed Sparse Row format>
```
```tfidf[:2, :].todense()
```
```matrix([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
```

### LDA#

```from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=10, max_iter=5,
learning_method='online',
learning_offset=50.,
random_state=0)
```
```lda.fit(tfidf)
```
```LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
evaluate_every=-1, learning_decay=0.7,
learning_method='online', learning_offset=50.0,
max_doc_update_iter=100, max_iter=5,
mean_change_tol=0.001, n_components=10, n_jobs=None,
perp_tol=0.1, random_state=0, topic_word_prior=None,
total_samples=1000000.0, verbose=0)
```
```tf_feature_names = tfidf_vectorizer.get_feature_names()
tf_feature_names[100:103]
```
```['avoir', 'bac', 'bah']
```
```lda.components_.shape
```
```(10, 1000)
```

On obtient dix vecteurs qui représentent les dix vecteurs associés aux dix clusters. Chaque dimension relié au fait que le mot appartient ou non au cluster.

```def print_top_words(model, feature_names, n_top_words):
for topic_idx, topic in enumerate(model.components_):
print("Topic #%d:" % topic_idx)
print(" ".join([feature_names[i]
for i in topic.argsort()[:-n_top_words - 1:-1]]))
print()
```
```print_top_words(lda, tf_feature_names, 10)
```
```Topic #0:
gratuit mcdo supprimerai école soir kebab macdo kfc domicile cc
Topic #1:
macron co https de la est le il et hollande
Topic #2:
sijetaispresident je les de la et le des en pour
Topic #3:
notaires eu organiserais mets carte nouveaux journées installation cache créer
Topic #4:
sijetaispresident interdirais les je ballerines la serait serais bah de
Topic #5:
ministre de sijetaispresident la je premier mort et nommerais président
Topic #6:
cours le supprimerais jour sijetaispresident lundi samedi semaine je vendredi
Topic #7:
port interdirait démissionnerais promesses heure rendrai ballerine mes changement christineboutin
Topic #8:
seraient sijetaispresident gratuits aux les nos putain éducation nationale bonne
Topic #9:
bordel seront légaliserai putes gratuites pizza mot virerais vitesse dutreil
```

### Clustering#

```from sklearn.cluster import KMeans
km = KMeans(n_clusters=10)
km.fit(tfidf)
```
```KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=10, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
```
```km.cluster_centers_.shape
```
```(10, 1000)
```
```def print_top_words(model, feature_names, n_top_words):
for topic_idx, topic in enumerate(model.cluster_centers_):
print("Topic #%d:" % topic_idx)
print(" ".join([feature_names[i]
for i in topic.argsort()[:-n_top_words - 1:-1]]))
print()
```
```print_top_words(km, tf_feature_names, 10)
```
```Topic #0:
ferais je sijetaispresident les en sorte que de et un
Topic #1:
le sijetaispresident je monde et pour de tout des les
Topic #2:
de sijetaispresident la en et des une un plus aurait
Topic #3:
https co macron de la le les via et sijetaispresident
Topic #4:
les sijetaispresident je de et tous des seraient pour ballerines
Topic #5:
serait sijetaispresident la le merde de ça et on dans
Topic #6:
je sijetaispresident la des me et ferai pas en au
Topic #7:
macron est il de pas la hollande le gauche que
Topic #8:
ministre premier nommerais sijetaispresident de je la mickey serait nommerai
Topic #9:
serais je sijetaispresident un pas ne président que la de
```

Ah les stop words….