.. _td2apipelinetreeselectioncorrectionrst: ===================================================================== 2A.ml - Pipeline pour un réduction d’une forêt aléatoire - correction ===================================================================== .. only:: html **Links:** :download:`notebook `, :downloadlink:`html `, :download:`python `, :downloadlink:`slides `, :githublink:`GitHub|_doc/notebooks/td2a_ml/td2a_pipeline_tree_selection_correction.ipynb|*` Le modèle Lasso permet de sélectionner des variables, une forêt aléatoire produit une prédiction comme étant la moyenne d’arbres de régression. Cet aspect a été abordé dans le notebook `Reduction d’une forêt aléatoire `__. On cherche à automatiser le processus. .. code:: ipython3 from jyquickhelper import add_notebook_menu add_notebook_menu() .. contents:: :local: .. code:: ipython3 %matplotlib inline Datasets -------- Comme il faut toujours des données, on prend ce jeu `Diabetes `__. .. code:: ipython3 from sklearn.datasets import load_diabetes data = load_diabetes() X, y = data.data, data.target .. code:: ipython3 from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y) Forêt aléatoire suivi de Lasso ------------------------------ La méthode consiste à apprendre une forêt aléatoire puis à effectuer d’une régression sur chacun des estimateurs. .. code:: ipython3 import numpy from sklearn.ensemble import RandomForestRegressor from sklearn.linear_model import Lasso # Apprentissage d'une forêt aléatoire clr = RandomForestRegressor() clr.fit(X_train, y_train) # Récupération de la prédiction de chaque arbre X_train_2 = numpy.zeros((X_train.shape[0], len(clr.estimators_))) estimators = numpy.array(clr.estimators_).ravel() for i, est in enumerate(estimators): pred = est.predict(X_train) X_train_2[:, i] = pred # Apprentissage d'une régression Lasso lrs = Lasso(max_iter=10000) lrs.fit(X_train_2, y_train) lrs.coef_ .. parsed-literal:: array([0.00516931, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.05150952, 0. , 0.0114454 , 0.00778913, 0. , 0.04239907, 0.01882099, 0.02956967, 0. , 0.04699227, 0. , 0.04588009, 0.00476672, 0.05276899, 0. , 0. , 0.00719994, 0. , 0.02817731, 0. , 0. , 0.03606261, 0.00228349, 0.01204062, 0.02018557, 0. , 0. , 0.03759611, 0.04608785, 0. , 0.00316996, 0. , 0. , 0. , 0.01678394, 0. , 0. , 0. , 0.00801926, 0.07006079, 0.03263025, 0. , 0.00770145, 0. , 0.00351302, 0. , 0. , 0. , 0. , 0. , 0.00183299, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.02545205, 0.05789703, 0. , 0. , 0. , 0.0065516 , 0. , 0. , 0. , 0.07234827, 0. , 0.03547108, 0. , 0. , 0.03080198, 0.00930293, 0.04231454, 0. , 0.01124574, 0. , 0. , 0. , 0. , 0. , 0.00108674, 0.02485889, 0.01839299, 0. , 0. , 0.03118312, 0. , 0. , 0. , 0. ]) Nous avons réussi à reproduire le processus dans son ensemble. Pas toujours simple de se souvenir de toutes les étapes, c’est pourquoi il est plus simple de compiler l’ensemble dans un `pipeline `__. Premier pipeline ---------------- L’idée est d’avoir quelque chose qui ressemble à ce qui suit. .. code:: ipython3 from sklearn.pipeline import Pipeline try: pipe = Pipeline(steps=[ ('rf', RandomForestRegressor()), ("une fonction qui n'existe pas encore", fct), ("lasso", Lasso()), ]) except Exception as e: print(e) .. parsed-literal:: name 'fct' is not defined Dans un pipeline, on ne peut y mettre que des modèles prédictifs, classifieur, régresseur ou des transformeur (normalisseur). La fonction qui extrait les prédictions des arbres doit être emballés dans un *transformer*. C’est le rôle d’un `FunctionTransformer `__. .. code:: ipython3 from sklearn.preprocessing import FunctionTransformer def random_forest_tree_prediction(rf, X): preds = numpy.zeros((X.shape[0], len(rf.estimators_))) estimators = numpy.array(rf.estimators_).ravel() for i, est in enumerate(estimators): pred = est.predict(X) preds[:, i] = pred return preds random_forest_tree_prediction(clr, X) .. parsed-literal:: array([[23.8, 31.2, 32. , ..., 32. , 29.9, 28. ], [21.6, 22.9, 21.6, ..., 21.6, 22. , 24.4], [33.8, 37.2, 34.7, ..., 34.9, 34.7, 34.7], ..., [23.9, 23.9, 31.5, ..., 29.9, 23.9, 23.9], [23.9, 22. , 22. , ..., 29.9, 23.9, 22. ], [11.9, 11.9, 11.9, ..., 11.9, 20.6, 11.9]]) .. code:: ipython3 fct = FunctionTransformer(lambda X, rf=clr: random_forest_tree_prediction(rf, X) ) fct.transform(X_train) .. parsed-literal:: array([[ 8.5, 8.5, 5.6, ..., 5. , 5. , 8.5], [14.8, 15. , 13.4, ..., 19.5, 21.7, 16.6], [13.1, 13.1, 15.1, ..., 13.1, 11.3, 12.6], ..., [21.4, 22. , 21.4, ..., 20. , 22.6, 21.4], [25.1, 29.9, 25.1, ..., 25.1, 25.1, 25.1], [28.4, 28.4, 28.4, ..., 28.4, 28.4, 28.4]]) Tout se passe bien. Il suffit de l’insérer dans le pipeline. .. code:: ipython3 try: pipe = Pipeline(steps=[ ('rf', RandomForestRegressor()), ("tree_pred", fct), ("lasso", Lasso()), ]) except Exception as e: print(e) .. parsed-literal:: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)' (type ) doesn't Ca ne marche toujours pas parce qu’un pipeline à ce que toutes les étapes excepté la dernière doivent être un *transformeur* et implémenter la méthode ``transform`` et ce n’est pas le cas. Et cela pose également un autre problème, la fonction ne fonctionne que si elle reçoit la forêt aléatoire en argument et nous avons passé celle déjà apprise mais ce n’aurait pas été celle apprise dans le pipeline. .. code:: ipython3 hasattr(clr, 'transform') .. parsed-literal:: False .. code:: ipython3 from jyquickhelper import RenderJsDot .. code:: ipython3 RenderJsDot("""digraph { A [label="RandomForestRegressor pipline"]; A2 [label="RandomForestRegressor - pretrained"]; B [label="FunctionTransformer"]; C [label="Lasso"]; A -> B [label="X"]; B -> C [label="X2"]; A2 -> B [label="rf"]; }""") .. raw:: html
Comme ça ne marche pas, on passe à une seconde idée. Second pipeline --------------- On déguise la forêt aléatoire en un transformeur. .. code:: ipython3 class RandomForestRegressorAsTransformer: def __init__(self, **kwargs): self.rf = RandomForestRegressor(**kwargs) def fit(self, X, y): self.rf.fit(X, y) return self def transform(self, X): preds = numpy.zeros((X.shape[0], len(self.rf.estimators_))) estimators = numpy.array(self.rf.estimators_).ravel() for i, est in enumerate(estimators): pred = est.predict(X) preds[:, i] = pred return preds trrf = RandomForestRegressorAsTransformer() trrf.fit(X_train, y_train) trrf.transform(X_train) .. parsed-literal:: array([[ 8.5, 8.5, 10.4, ..., 8.5, 8.5, 5.6], [13.4, 15. , 21.7, ..., 21.7, 21.7, 20. ], [ 8.3, 13.1, 15.1, ..., 13.6, 13.1, 17.1], ..., [21.4, 21.2, 21.4, ..., 21.4, 21.4, 21.4], [23.8, 25.1, 25.1, ..., 24.3, 25.1, 22. ], [28.4, 28.4, 28.4, ..., 28.1, 28.4, 28.4]]) Tout va bien. On refait le pipeline. .. code:: ipython3 pipe = Pipeline(steps=[('trrf', RandomForestRegressorAsTransformer()), ("lasso", Lasso())]) pipe.fit(X_train, y_train) .. parsed-literal:: C:\xavierdupre\__home_\github_fork\scikit-learn\sklearn\linear_model\coordinate_descent.py:475: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 15.605865570498736, tolerance: 3.566623625329815 positive) .. parsed-literal:: Pipeline(memory=None, steps=[('trrf', <__main__.RandomForestRegressorAsTransformer object at 0x000002D7FBD6B0F0>), ('lasso', Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000, normalize=False, positive=False, precompute=False, random_state=None, selection='cyclic', tol=0.0001, warm_start=False))], verbose=False) On récupère les coefficients. .. code:: ipython3 pipe.steps[1][1].coef_ .. parsed-literal:: array([0.00000000e+00, 8.18725785e-03, 2.57107281e-02, 2.64260468e-02, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 2.91496034e-02, 0.00000000e+00, 2.59224355e-03, 0.00000000e+00, 0.00000000e+00, 1.11199737e-02, 2.25351658e-02, 0.00000000e+00, 1.89481812e-02, 1.02779896e-01, 0.00000000e+00, 6.25993012e-03, 2.88645052e-02, 2.26525053e-02, 0.00000000e+00, 1.58723695e-02, 2.17116677e-02, 5.73111769e-02, 4.07723945e-02, 3.07676159e-02, 0.00000000e+00, 0.00000000e+00, 2.96368833e-02, 6.31627239e-03, 3.05513736e-04, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 2.61832331e-02, 0.00000000e+00, 0.00000000e+00, 1.95009449e-02, 3.88476951e-02, 1.12862592e-02, 1.97136005e-02, 0.00000000e+00, 5.67052346e-02, 9.39029327e-03, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 6.86248078e-03, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 5.22709050e-02, 1.56786096e-02, 0.00000000e+00, 1.06189159e-02, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 7.99152616e-02, 0.00000000e+00, 1.05299329e-02, 0.00000000e+00, 0.00000000e+00, 4.70392340e-02, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.04728058e-03, 3.60665273e-02, 0.00000000e+00, 0.00000000e+00, 1.21597852e-02, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.73818504e-02, 1.70113005e-02, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 2.29949689e-06, 0.00000000e+00]) A quoi ça sert : GridSearchCV ----------------------------- Comme l’ensemble des traitements sont maintenant dans un seul pipeline que *scikit-learn* considère comme un modèle comme les autres, on peut rechercher les meilleurs hyper-paramètres du modèle, comme le nombre d’arbres initial, le paramètre *alpha*, la profondeur des arbres… .. code:: ipython3 from sklearn.model_selection import GridSearchCV param_grid = {'trrf__n_estimators': [30, 50, 80, 100], 'lasso__alpha': [0.5, 1.0, 1.5]} try: grid = GridSearchCV(pipe, cv=5, verbose=1, param_grid=param_grid) grid.fit(X_train, y_train) except Exception as e: print(e) .. parsed-literal:: Fitting 5 folds for each of 12 candidates, totalling 60 fits 'RandomForestRegressorAsTransformer' object has no attribute 'set_params' .. parsed-literal:: [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. La classe ``RandomForestRegressorAsTransformer`` a besoin de la méthode *set_params*\ … Aucun problème. .. code:: ipython3 class RandomForestRegressorAsTransformer: def __init__(self, **kwargs): self.rf = RandomForestRegressor(**kwargs) def fit(self, X, y): self.rf.fit(X, y) return self def transform(self, X): preds = numpy.zeros((X.shape[0], len(self.rf.estimators_))) estimators = numpy.array(self.rf.estimators_).ravel() for i, est in enumerate(estimators): pred = est.predict(X) preds[:, i] = pred return preds def set_params(self, **params): self.rf.set_params(**params) .. code:: ipython3 import warnings from sklearn.exceptions import ConvergenceWarning pipe = Pipeline(steps=[('trrf', RandomForestRegressorAsTransformer()), ("lasso", Lasso())]) param_grid = {'trrf__n_estimators': [50, 100], 'lasso__alpha': [0.5, 1.0, 1.5]} grid = GridSearchCV(pipe, cv=5, verbose=2, param_grid=param_grid) with warnings.catch_warnings(record=False) as w: # On ignore les convergence warning car il y en beaucoup. warnings.simplefilter("ignore", ConvergenceWarning) grid.fit(X_train, y_train) .. parsed-literal:: [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. .. parsed-literal:: Fitting 5 folds for each of 6 candidates, totalling 30 fits [CV] lasso__alpha=0.5, trrf__n_estimators=50 ......................... [CV] .......... lasso__alpha=0.5, trrf__n_estimators=50, total= 0.3s [CV] lasso__alpha=0.5, trrf__n_estimators=50 ......................... .. parsed-literal:: [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s .. parsed-literal:: [CV] .......... lasso__alpha=0.5, trrf__n_estimators=50, total= 0.3s [CV] lasso__alpha=0.5, trrf__n_estimators=50 ......................... [CV] .......... lasso__alpha=0.5, trrf__n_estimators=50, total= 0.3s [CV] lasso__alpha=0.5, trrf__n_estimators=50 ......................... [CV] .......... lasso__alpha=0.5, trrf__n_estimators=50, total= 0.2s [CV] lasso__alpha=0.5, trrf__n_estimators=50 ......................... [CV] .......... lasso__alpha=0.5, trrf__n_estimators=50, total= 0.2s [CV] lasso__alpha=0.5, trrf__n_estimators=100 ........................ [CV] ......... lasso__alpha=0.5, trrf__n_estimators=100, total= 0.4s [CV] lasso__alpha=0.5, trrf__n_estimators=100 ........................ [CV] ......... lasso__alpha=0.5, trrf__n_estimators=100, total= 0.4s [CV] lasso__alpha=0.5, trrf__n_estimators=100 ........................ [CV] ......... lasso__alpha=0.5, trrf__n_estimators=100, total= 0.4s [CV] lasso__alpha=0.5, trrf__n_estimators=100 ........................ [CV] ......... lasso__alpha=0.5, trrf__n_estimators=100, total= 0.5s [CV] lasso__alpha=0.5, trrf__n_estimators=100 ........................ [CV] ......... lasso__alpha=0.5, trrf__n_estimators=100, total= 0.4s [CV] lasso__alpha=1.0, trrf__n_estimators=50 ......................... [CV] .......... lasso__alpha=1.0, trrf__n_estimators=50, total= 0.2s [CV] lasso__alpha=1.0, trrf__n_estimators=50 ......................... [CV] .......... lasso__alpha=1.0, trrf__n_estimators=50, total= 0.2s [CV] lasso__alpha=1.0, trrf__n_estimators=50 ......................... [CV] .......... lasso__alpha=1.0, trrf__n_estimators=50, total= 0.3s [CV] lasso__alpha=1.0, trrf__n_estimators=50 ......................... [CV] .......... lasso__alpha=1.0, trrf__n_estimators=50, total= 0.2s [CV] lasso__alpha=1.0, trrf__n_estimators=50 ......................... [CV] .......... lasso__alpha=1.0, trrf__n_estimators=50, total= 0.2s [CV] lasso__alpha=1.0, trrf__n_estimators=100 ........................ [CV] ......... lasso__alpha=1.0, trrf__n_estimators=100, total= 0.5s [CV] lasso__alpha=1.0, trrf__n_estimators=100 ........................ [CV] ......... lasso__alpha=1.0, trrf__n_estimators=100, total= 0.4s [CV] lasso__alpha=1.0, trrf__n_estimators=100 ........................ [CV] ......... lasso__alpha=1.0, trrf__n_estimators=100, total= 0.5s [CV] lasso__alpha=1.0, trrf__n_estimators=100 ........................ [CV] ......... lasso__alpha=1.0, trrf__n_estimators=100, total= 0.5s [CV] lasso__alpha=1.0, trrf__n_estimators=100 ........................ [CV] ......... lasso__alpha=1.0, trrf__n_estimators=100, total= 0.4s [CV] lasso__alpha=1.5, trrf__n_estimators=50 ......................... [CV] .......... lasso__alpha=1.5, trrf__n_estimators=50, total= 0.2s [CV] lasso__alpha=1.5, trrf__n_estimators=50 ......................... [CV] .......... lasso__alpha=1.5, trrf__n_estimators=50, total= 0.2s [CV] lasso__alpha=1.5, trrf__n_estimators=50 ......................... [CV] .......... lasso__alpha=1.5, trrf__n_estimators=50, total= 0.2s [CV] lasso__alpha=1.5, trrf__n_estimators=50 ......................... [CV] .......... lasso__alpha=1.5, trrf__n_estimators=50, total= 0.2s [CV] lasso__alpha=1.5, trrf__n_estimators=50 ......................... [CV] .......... lasso__alpha=1.5, trrf__n_estimators=50, total= 0.2s [CV] lasso__alpha=1.5, trrf__n_estimators=100 ........................ [CV] ......... lasso__alpha=1.5, trrf__n_estimators=100, total= 0.4s [CV] lasso__alpha=1.5, trrf__n_estimators=100 ........................ [CV] ......... lasso__alpha=1.5, trrf__n_estimators=100, total= 0.4s [CV] lasso__alpha=1.5, trrf__n_estimators=100 ........................ [CV] ......... lasso__alpha=1.5, trrf__n_estimators=100, total= 0.5s [CV] lasso__alpha=1.5, trrf__n_estimators=100 ........................ [CV] ......... lasso__alpha=1.5, trrf__n_estimators=100, total= 0.7s [CV] lasso__alpha=1.5, trrf__n_estimators=100 ........................ [CV] ......... lasso__alpha=1.5, trrf__n_estimators=100, total= 0.5s .. parsed-literal:: [Parallel(n_jobs=1)]: Done 30 out of 30 | elapsed: 10.6s finished .. code:: ipython3 grid.best_params_ .. parsed-literal:: {'lasso__alpha': 0.5, 'trrf__n_estimators': 50} .. code:: ipython3 grid.best_estimator_.steps[1][1].coef_ .. parsed-literal:: array([ 0.01661755, 0.09608553, -0. , 0.04337892, 0.00256722, 0.04875441, 0.0022436 , 0.00757652, 0. , 0. , 0. , 0.04868714, 0.00878259, 0. , 0.01989812, 0.0123234 , 0. , 0.06432313, 0.00565488, 0. , 0.00119269, 0. , 0.00611262, 0. , 0. , 0. , 0.01786513, 0. , 0.026654 , 0. , 0. , 0. , 0.09583967, 0.00722895, 0.05395944, 0.063898 , 0.0586511 , 0. , 0. , 0.1290402 , 0. , 0. , 0. , 0. , 0. , 0.00750308, 0. , 0.08633491, 0.03593556, 0.05771344]) .. code:: ipython3 grid.best_score_ .. parsed-literal:: 0.8321710116268228 On essaye sur la base de test. .. code:: ipython3 grid.score(X_test, y_test) .. parsed-literal:: 0.8772908724536076 Et il y a combien de coefficients non nuls extactement… .. code:: ipython3 coef = grid.best_estimator_.steps[1][1].coef_ coef.shape, sum(coef != 0) .. parsed-literal:: ((50,), 27) Et on ne peut pas vérifier pour les autres car la recherche ne conserve pas les autres modèles. Et il reste un problème qui est plus ou moins important selon l’usage qu’on fait du modèle : il y a 23 coefficients nuls ce qui fait le modèle calcule la prédiction de 23 arbres de régressions qui ne sont pas utilisées puisque la régression leur affecte un coefficient nul. On pourrait donc facilement diviser le temps de prédiction par deux. Ce n’est pas si difficile à faire mais ce sera pour une autre histoire.