2A.ml - Réduction d’une forêt aléatoire - correction

Links: notebook, html, PDF, python, slides, GitHub

Le modèle Lasso permet de sélectionner des variables, une forêt aléatoire produit une prédiction comme étant la moyenne d’arbres de régression. Et si on mélangeait les deux ?

from jyquickhelper import add_notebook_menu
add_notebook_menu()
%matplotlib inline

Datasets

Comme il faut toujours des données, on prend ce jeu Boston.

from sklearn.datasets import load_boston
data = load_boston()
X, y = data.data, data.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

Une forêt aléatoire

from sklearn.ensemble import RandomForestRegressor as model_class
clr = model_class()
clr.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)

Le nombre d’arbres est…

len(clr.estimators_)
100
from sklearn.metrics import r2_score
r2_score(y_test, clr.predict(X_test))
0.8816716443059919

Random Forest = moyenne des prédictions

On recommence en faisant la moyenne soi-même.

import numpy
dest = numpy.zeros((X_test.shape[0], len(clr.estimators_)))
estimators = numpy.array(clr.estimators_).ravel()
for i, est in enumerate(estimators):
    pred = est.predict(X_test)
    dest[:, i] = pred

average = numpy.mean(dest, axis=1)
r2_score(y_test, average)
0.8816716443059919

A priori, c’est la même chose.

Pondérer les arbres à l’aide d’une régression linéaire

La forêt aléatoire est une façon de créer de nouvelles features, 100 exactement qu’on utilise pour caler une régression linéaire.

from sklearn.linear_model import LinearRegression


def new_features(forest, X):
    dest = numpy.zeros((X.shape[0], len(forest.estimators_)))
    estimators = numpy.array(forest.estimators_).ravel()
    for i, est in enumerate(estimators):
        pred = est.predict(X)
        dest[:, i] = pred
    return dest


X_train_2 = new_features(clr, X_train)
lr = LinearRegression()
lr.fit(X_train_2, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
X_test_2 = new_features(clr, X_test)
r2_score(y_test, lr.predict(X_test_2))
0.8759826041035556

Un peu moins bien, un peu mieux, le risque d’overfitting est un peu plus grand avec ces nombreuses features car la base d’apprentissage ne contient que 379 observations (regardez X_train.shape pour vérifier).

lr.coef_
array([-0.01122904,  0.0044548 ,  0.04920393,  0.02824337, -0.04625929,
       -0.03198554,  0.04100134, -0.0110173 ,  0.02175922,  0.02270664,
       -0.07129658, -0.04025288,  0.01199426,  0.03317564,  0.0011403 ,
        0.03344571,  0.01945555, -0.03182884,  0.07348532,  0.00333813,
        0.02294959, -0.00464431,  0.0225264 ,  0.0119253 ,  0.09014915,
       -0.0376745 ,  0.04447262,  0.02850649,  0.00736921, -0.01369467,
        0.02986174,  0.00575564,  0.05044376,  0.02081299,  0.01798322,
       -0.00192326,  0.09159215,  0.08490833,  0.00953901,  0.05039408,
       -0.00231599, -0.03193621,  0.04187978, -0.01702496,  0.02467238,
        0.0180003 ,  0.08144923, -0.00251786, -0.01782545, -0.01027325,
        0.01990357, -0.03748182, -0.04099434,  0.00057383, -0.03013624,
        0.11380534,  0.06436785,  0.04228636,  0.02423566, -0.0560923 ,
       -0.03855099,  0.041692  ,  0.03862377,  0.08781796, -0.05300599,
       -0.00840021,  0.02812588, -0.01234117,  0.03544364,  0.0168987 ,
       -0.02765353,  0.02515268,  0.04157685, -0.01604241,  0.0098268 ,
       -0.06842855,  0.05983471, -0.01461408, -0.00256612,  0.03797782,
       -0.01348758,  0.0063176 , -0.0115086 ,  0.05499093,  0.02628663,
       -0.02784253, -0.03583171, -0.0050989 , -0.02116866,  0.00982458,
       -0.02887861,  0.01661494, -0.02185889, -0.01376049,  0.00091703,
        0.06485045, -0.00799936,  0.01988687, -0.00827135,  0.03381613])
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 1, figsize=(12, 4))
ax.bar(numpy.arange(0, len(lr.coef_)), lr.coef_)
ax.set_title("Coefficients pour chaque arbre calculés avec une régression linéaire");
../_images/td2a_tree_selection_correction_19_0.png

Le score est avec une régression linéaire sur les variables initiales est nettement moins élevé.

lr_raw = LinearRegression()
lr_raw.fit(X_train, y_train)
r2_score(y_test, lr_raw.predict(X_test))
0.6978615081037233

Sélection d’arbres

L’idée est d’utiliser un algorithme de sélection de variables type Lasso pour réduire la forêt aléatoire sans perdre en performance. C’est presque le même code.

from sklearn.linear_model import Lasso

lrs = Lasso(max_iter=10000)
lrs.fit(X_train_2, y_train)
lrs.coef_
array([0.        , 0.        , 0.02245791, 0.01210641, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.03045497,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.01150988, 0.00435434, 0.        , 0.03146978, 0.        ,
       0.        , 0.        , 0.03183664, 0.        , 0.12385294,
       0.        , 0.03312474, 0.        , 0.        , 0.        ,
       0.02413847, 0.        , 0.        , 0.        , 0.00498728,
       0.        , 0.06410314, 0.04747783, 0.        , 0.00633977,
       0.        , 0.        , 0.03690248, 0.        , 0.        ,
       0.        , 0.0564866 , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.05602663, 0.08840581, 0.01980571, 0.        , 0.        ,
       0.        , 0.02873135, 0.03756932, 0.05435875, 0.        ,
       0.        , 0.01786024, 0.        , 0.0597982 , 0.02132273,
       0.        , 0.        , 0.00315636, 0.        , 0.        ,
       0.        , 0.00707516, 0.        , 0.        , 0.01152698,
       0.        , 0.        , 0.        , 0.        , 0.01209226,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.00326681,
       0.0275338 , 0.        , 0.0314566 , 0.        , 0.00132836])

Pas mal de zéros donc pas mal d’arbres non utilisés.

r2_score(y_test, lrs.predict(X_test_2))
0.8843604495090776

Pas trop de perte… Ca donne envie d’essayer plusieurs valeur de alpha.

from tqdm import tqdm
alphas = [0.01 * i for i in range(100)] +[1 + 0.1 * i for i in range(100)]
obs = []
for i in tqdm(range(0, len(alphas))):
    alpha = alphas[i]
    lrs = Lasso(max_iter=20000, alpha=alpha)
    lrs.fit(X_train_2, y_train)
    obs.append(dict(
        alpha=alpha,
        null=len(lrs.coef_[lrs.coef_!=0]),
        r2=r2_score(y_test, lrs.predict(X_test_2))
    ))
  0%|                                                                                          | 0/200 [00:00<?, ?it/s]c:python372_x64libsite-packagesipykernel_launcher.py:7: UserWarning: With alpha=0, this algorithm does not converge well. You are advised to use the LinearRegression estimator
  import sys
C:xavierdupre__home_github_forkscikit-learnsklearnlinear_modelcoordinate_descent.py:475: UserWarning: Coordinate descent with no regularization may lead to unexpected results and is discouraged.
  positive)
C:xavierdupre__home_github_forkscikit-learnsklearnlinear_modelcoordinate_descent.py:475: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 125.80123242096884, tolerance: 3.3705663218997364
  positive)
  0%|▍                                                                                 | 1/200 [00:00<01:53,  1.75it/s]C:xavierdupre__home_github_forkscikit-learnsklearnlinear_modelcoordinate_descent.py:475: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 14.242863637667796, tolerance: 3.3705663218997364
  positive)
100%|████████████████████████████████████████████████████████████████████████████████| 200/200 [00:11<00:00, 45.65it/s]
from pandas import DataFrame
df = DataFrame(obs)
df.tail()
alpha null r2
195 10.5 21 0.877512
196 10.6 21 0.877345
197 10.7 21 0.877175
198 10.8 21 0.877003
199 10.9 21 0.876827
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
df[["alpha", "null"]].set_index("alpha").plot(ax=ax[0], logx=True)
ax[0].set_title("Nombre de coefficients non nulls")
df[["alpha", "r2"]].set_index("alpha").plot(ax=ax[1], logx=True)
ax[1].set_title("r2");
../_images/td2a_tree_selection_correction_29_0.png

Dans ce cas, supprimer des arbres augmentent la performance, comme évoqué ci-dessus, cela réduit l’overfitting. Le nombre d’arbre peut être réduit des deux tiers avec ce modèle.