.. _td2atreeselectioncorrectionrst: ==================================================== 2A.ml - Réduction d’une forêt aléatoire - correction ==================================================== .. only:: html **Links:** :download:`notebook `, :downloadlink:`html `, :download:`python `, :downloadlink:`slides `, :githublink:`GitHub|_doc/notebooks/td2a_ml/td2a_tree_selection_correction.ipynb|*` Le modèle Lasso permet de sélectionner des variables, une forêt aléatoire produit une prédiction comme étant la moyenne d’arbres de régression. Et si on mélangeait les deux ? .. code:: ipython3 from jyquickhelper import add_notebook_menu add_notebook_menu() .. contents:: :local: .. code:: ipython3 %matplotlib inline Datasets -------- Comme il faut toujours des données, on prend ce jeu `Diabetes `__. .. code:: ipython3 from sklearn.datasets import load_diabetes data = load_diabetes() X, y = data.data, data.target .. code:: ipython3 from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y) Une forêt aléatoire ------------------- .. code:: ipython3 from sklearn.ensemble import RandomForestRegressor as model_class clr = model_class() clr.fit(X_train, y_train) .. raw:: html
RandomForestRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Le nombre d’arbres est… .. code:: ipython3 len(clr.estimators_) .. parsed-literal:: 100 .. code:: ipython3 from sklearn.metrics import r2_score r2_score(y_test, clr.predict(X_test)) .. parsed-literal:: 0.3625404922781166 Random Forest = moyenne des prédictions --------------------------------------- On recommence en faisant la moyenne soi-même. .. code:: ipython3 import numpy dest = numpy.zeros((X_test.shape[0], len(clr.estimators_))) estimators = numpy.array(clr.estimators_).ravel() for i, est in enumerate(estimators): pred = est.predict(X_test) dest[:, i] = pred average = numpy.mean(dest, axis=1) r2_score(y_test, average) .. parsed-literal:: 0.3625404922781166 A priori, c’est la même chose. Pondérer les arbres à l’aide d’une régression linéaire ------------------------------------------------------ La forêt aléatoire est une façon de créer de nouvelles features, 100 exactement qu’on utilise pour caler une régression linéaire. .. code:: ipython3 from sklearn.linear_model import LinearRegression def new_features(forest, X): dest = numpy.zeros((X.shape[0], len(forest.estimators_))) estimators = numpy.array(forest.estimators_).ravel() for i, est in enumerate(estimators): pred = est.predict(X) dest[:, i] = pred return dest X_train_2 = new_features(clr, X_train) lr = LinearRegression() lr.fit(X_train_2, y_train) .. raw:: html
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
.. code:: ipython3 X_test_2 = new_features(clr, X_test) r2_score(y_test, lr.predict(X_test_2)) .. parsed-literal:: 0.30414556638121215 Un peu moins bien, un peu mieux, le risque d’overfitting est un peu plus grand avec ces nombreuses features car la base d’apprentissage ne contient que 379 observations (regardez ``X_train.shape`` pour vérifier). .. code:: ipython3 lr.coef_ .. parsed-literal:: array([ 0.0129567 , -0.03467343, -0.02574902, 0.01872549, 0.00128276, -0.01449147, 0.00977528, -0.02397026, 0.01066261, 0.02121925, 0.03544455, 0.02735311, 0.01859875, -0.03189411, -0.0245749 , -0.01879966, 0.01521987, 0.00292998, 0.04250576, 0.01424533, -0.00561623, 0.00635399, 0.04712406, 0.02518721, 0.01713507, 0.01741708, -0.02072389, 0.05748854, 0.00424951, 0.02872275, -0.01016485, 0.04368062, 0.07377962, 0.06540726, -0.00123185, 0.02227104, 0.0289425 , 0.00914512, 0.03645644, 0.01838009, 0.00046509, 0.04145444, 0.0202303 , 0.00984027, 0.0149448 , -0.01129977, 0.00428108, 0.02601842, 0.00421449, -0.01172942, 0.02631074, 0.04180424, 0.02909078, -0.01922766, -0.00953341, -0.0036882 , -0.02411783, 0.06700977, -0.01447105, 0.02094102, 0.00227497, 0.04181756, -0.02474879, 0.0465355 , 0.05504502, -0.05645067, -0.02066304, 0.04349629, -0.01549704, 0.02805018, 0.01344701, 0.03489881, 0.04401519, 0.04756385, -0.02936105, -0.0305603 , -0.02101141, 0.02751049, -0.00875684, -0.01583926, 0.00033533, 0.02769942, 0.0358323 , -0.04180737, -0.02759142, -0.01231979, 0.02881228, -0.00406825, 0.00497993, 0.01094388, -0.01672934, 0.05414844, -0.01725494, 0.04816335, 0.04487341, 0.0269151 , 0.00945554, 0.02318397, 0.04105411, 0.05314256]) .. code:: ipython3 import matplotlib.pyplot as plt fig, ax = plt.subplots(1, 1, figsize=(12, 4)) ax.bar(numpy.arange(0, len(lr.coef_)), lr.coef_) ax.set_title("Coefficients pour chaque arbre calculés avec une régression linéaire"); .. image:: td2a_tree_selection_correction_19_0.png Le score est avec une régression linéaire sur les variables initiales est nettement moins élevé. .. code:: ipython3 lr_raw = LinearRegression() lr_raw.fit(X_train, y_train) r2_score(y_test, lr_raw.predict(X_test)) .. parsed-literal:: 0.5103612609676136 Sélection d’arbres ------------------ L’idée est d’utiliser un algorithme de sélection de variables type `Lasso `__ pour réduire la forêt aléatoire sans perdre en performance. C’est presque le même code. .. code:: ipython3 from sklearn.linear_model import Lasso lrs = Lasso(max_iter=10000) lrs.fit(X_train_2, y_train) lrs.coef_ .. parsed-literal:: array([ 0.01256934, -0.03342528, -0.02400605, 0.01825851, 0.0005323 , -0.01374509, 0.01004616, -0.02284903, 0.01105419, 0.02047233, 0.03476362, 0.02755575, 0.01751674, -0.03051477, -0.02321124, -0.01783216, 0.01429992, 0.00214398, 0.04066576, 0.0134879 , -0.00377705, 0.00506043, 0.04614375, 0.02482044, 0.01560689, 0.01706262, -0.02035898, 0.05747191, 0.00418486, 0.02766988, -0.00899098, 0.04325266, 0.07327657, 0.06515135, -0.00034774, 0.02210777, 0.0280344 , 0.00852669, 0.0358763 , 0.01779845, 0. , 0.03970822, 0.01935286, 0.00908017, 0.01417323, -0.01066044, 0.00293442, 0.02483663, 0.00332255, -0.01043329, 0.02666477, 0.04097776, 0.02851599, -0.01795373, -0.00830115, -0.00293032, -0.02188798, 0.06679156, -0.01364001, 0.02028321, 0.00160792, 0.04114419, -0.02342478, 0.04638246, 0.0547764 , -0.05501755, -0.01856303, 0.04157578, -0.01403205, 0.02718244, 0.01215738, 0.03503149, 0.04403975, 0.04640854, -0.02884553, -0.02929629, -0.01946676, 0.02679733, -0.00779812, -0.01418256, 0. , 0.02734732, 0.03608281, -0.04111661, -0.02654714, -0.01106999, 0.02664032, -0.00291639, 0.00541073, 0.01187597, -0.01621428, 0.05386765, -0.01531834, 0.04807872, 0.04398675, 0.02611443, 0.00944403, 0.02219076, 0.04080548, 0.05276076]) Pas mal de zéros donc pas mal d’arbres non utilisés. .. code:: ipython3 r2_score(y_test, lrs.predict(X_test_2)) .. parsed-literal:: 0.3055529526371402 Pas trop de perte… Ca donne envie d’essayer plusieurs valeur de ``alpha``. .. code:: ipython3 from tqdm import tqdm alphas = [0.01 * i for i in range(100)] +[1 + 0.1 * i for i in range(100)] obs = [] for i in tqdm(range(0, len(alphas))): alpha = alphas[i] lrs = Lasso(max_iter=20000, alpha=alpha) lrs.fit(X_train_2, y_train) obs.append(dict( alpha=alpha, null=len(lrs.coef_[lrs.coef_!=0]), r2=r2_score(y_test, lrs.predict(X_test_2)) )) .. parsed-literal:: 0%| | 0/200 [00:00
alpha null r2
195 10.5 83 0.318660
196 10.6 83 0.318771
197 10.7 83 0.318879
198 10.8 83 0.318982
199 10.9 82 0.319073
.. code:: ipython3 fig, ax = plt.subplots(1, 2, figsize=(12, 4)) df[["alpha", "null"]].set_index("alpha").plot(ax=ax[0], logx=True) ax[0].set_title("Nombre de coefficients non nulls") df[["alpha", "r2"]].set_index("alpha").plot(ax=ax[1], logx=True) ax[1].set_title("r2"); .. image:: td2a_tree_selection_correction_29_0.png Dans ce cas, supprimer des arbres augmente la performance, comme évoqué ci-dessus, cela réduit l’overfitting. Le nombre d’arbres peut être réduit des deux tiers avec ce modèle.