{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 2A.ml - Pipeline pour un r\u00e9duction d'une for\u00eat al\u00e9atoire - \u00e9nonc\u00e9\n", "\n", "Le mod\u00e8le Lasso permet de s\u00e9lectionner des variables, une for\u00eat al\u00e9atoire produit une pr\u00e9diction comme \u00e9tant la moyenne d'arbres de r\u00e9gression. Cet aspect a \u00e9t\u00e9 abord\u00e9 dans le notebook [Reduction d'une for\u00eat al\u00e9atoire](http://www.xavierdupre.fr/app/ensae_teaching_cs/helpsphinx/notebooks/td2a_tree_selection_correction.html). On cherche \u00e0 automatiser le processus."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": ["%matplotlib inline"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Datasets\n", "\n", "Comme il faut toujours des donn\u00e9es, on prend ce jeu [Diabetes](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html)."]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": ["from sklearn.datasets import load_diabetes\n", "data = load_diabetes()\n", "X, y = data.data, data.target"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": ["from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## For\u00eat al\u00e9atoire suivi de Lasso\n", "\n", "La m\u00e9thode consiste \u00e0 apprendre une for\u00eat al\u00e9atoire puis \u00e0 effectuer d'une r\u00e9gression sur chacun des estimateurs."]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([ 0.01058919, 0.05879275, -0.00490468, 0.0422317 , 0.02061981,\n", " 0.05832323, 0.04902792, -0.02386671, -0.00783027, -0.02905091,\n", " -0.05936758, -0.03081102, -0.00874234, -0.01032493, -0.00215755,\n", " 0.02104254, -0.06726193, 0.00863015, -0.00657562, 0.01915455,\n", " 0.1103515 , 0.03127041, 0.0059957 , 0.01318572, -0.02425179,\n", " 0.02444136, -0.01270415, 0.00860503, -0.01053657, -0.0044742 ,\n", " -0.01316523, 0.01369104, -0.00739582, -0.02240202, -0.0049985 ,\n", " 0.08646501, 0.00866649, -0.00228254, 0.02181667, 0.01934537,\n", " -0.00796704, -0.00372213, 0.02581304, -0.01812068, 0.04921884,\n", " 0.04735237, -0.01544872, 0.00383606, 0.03220245, 0.04162666,\n", " 0.00815848, 0.04327313, 0.03816147, -0.00254619, 0. ,\n", " -0.03287036, -0.04364327, 0.00691009, -0.00819448, 0.00571863,\n", " -0.0085195 , 0.03282482, -0.041993 , 0.04787454, 0.01832266,\n", " 0.03145652, 0.013905 , 0.00592087, 0.01296335, 0.01339059,\n", " 0.01104395, -0.0004973 , 0.05065905, 0.01915292, 0. ,\n", " 0.00598882, 0. , 0.03658216, -0.01576201, 0.00131738,\n", " 0.07700475, 0.03661206, 0.0100858 , 0.0201148 , 0.08337645,\n", " 0.01867529, 0.00236212, -0.00237683, 0.06146853, 0.05481785,\n", " 0.0629231 , -0.00304007, -0.03835209, 0.00739201, 0.00431521,\n", " 0.01388169, 0.02238382, 0.01769634, 0.01612737, 0.01166434])"]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["import numpy\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.linear_model import Lasso\n", "\n", "# Apprentissage d'une for\u00eat al\u00e9atoire\n", "clr = RandomForestRegressor()\n", "clr.fit(X_train, y_train)\n", "\n", "# R\u00e9cup\u00e9ration de la pr\u00e9diction de chaque arbre\n", "X_train_2 = numpy.zeros((X_train.shape[0], len(clr.estimators_)))\n", "estimators = numpy.array(clr.estimators_).ravel()\n", "for i, est in enumerate(estimators):\n", " pred = est.predict(X_train)\n", " X_train_2[:, i] = pred\n", "\n", "# Apprentissage d'une r\u00e9gression Lasso\n", "lrs = Lasso(max_iter=10000)\n", "lrs.fit(X_train_2, y_train)\n", "lrs.coef_"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Nous avons r\u00e9ussi \u00e0 reproduire le processus dans son ensemble. Pas toujours simple de se souvenir de toutes les \u00e9tapes, c'est pourquoi il est plus simple de compiler l'ensemble dans un [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 1 : Premier pipeline\n", "\n", "Peut-\u00eatre trouverez-vous tout de suite un [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) qui fonctionne. La partie difficile est la partie qui produit le vecteur des sorties de chaque arbre de r\u00e9gression. La premi\u00e8re piste que j'ai explor\u00e9e est un [FunctionTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html)."]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 2 : Second pipeline\n", "\n", "La premi\u00e8re id\u00e9e de marche pas vraiment... On d\u00e9cide alors de d\u00e9guiser la for\u00eat al\u00e9atoire en un transformeur."]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": ["class RandomForestRegressorAsTransformer:\n", " \n", " def __init__(self, **kwargs):\n", " self.rf = RandomForestRegressor(**kwargs)\n", " \n", " def fit(self, X, y):\n", " # ...\n", " return self\n", " \n", " def transform(self, X):\n", " # ...\n", " # return les pr\u00e9diction de chaque arbre\n", " pass\n", "\n", "# Tout \u00e7a pour \u00e9crire ce qui suit...\n", "trrf = RandomForestRegressorAsTransformer()\n", "trrf.fit(X_train, y_train)\n", "trrf.transform(X_train)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Il reste \u00e0 \u00e9crire le pipeline correspondant \u00e0 la s\u00e9quence d'apprentissage d\u00e9crit quelque part dans ce notebook."]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"text/html": ["
Pipeline(steps=[('name', 'passthrough')])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
"], "text/plain": ["Pipeline(steps=[('name', 'passthrough')])"]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.pipeline import Pipeline\n", "\n", "pipe = Pipeline(steps=[\n", " ('name', 'passthrough'),\n", " # ...\n", "])\n", "\n", "pipe.fit(X_train, y_train)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 3 : GridSearchCV\n", "\n", "Comme l'ensemble des traitements sont maintenant dans un seul pipeline que *scikit-learn* consid\u00e8re comme un mod\u00e8le comme les autres, on peut rechercher les meilleurs hyper-param\u00e8tres du mod\u00e8le, comme le nombre d'arbres initial, le param\u00e8tre *alpha*, la profondeur des arbres... Tout \u00e7a avec la classe [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)."]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["Vous devriez tomber sur un message disant que la classe ``RandomForestRegressorAsTransformer`` a besoin de la m\u00e9thode *set_params*... Un indice : ``def set_params(self, **params): self.rf.set_params(**params)``."]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 4 : nombre de coefficients non nuls\n", "\n", "Il ne reste plus qu'\u00e0 trouver le nombre de coefficients non nuls du meilleur mod\u00e8le, donc le nombre d'arbres conserv\u00e9s par le mod\u00e8le."]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5"}}, "nbformat": 4, "nbformat_minor": 2}