.. _sklearnapirst:

========================================
API de sciki-learn et modèles customisés
========================================


.. only:: html

    **Links:** :download:`notebook <sklearn_api.ipynb>`, :downloadlink:`html <sklearn_api2html.html>`, :download:`PDF <sklearn_api.pdf>`, :download:`python <sklearn_api.py>`, :downloadlink:`slides <sklearn_api.slides.html>`, :githublink:`GitHub|_doc/notebooks/2019/ensae_api/sklearn_api.ipynb|*`


*scikit-learn* est devenu le module incontournable quand il s’agit de
machine learning. Cela tient en partie à son API épurée qui permet à
quiconque d’implémenter ses propres modèles tout permettant à
*scikit-learn* de les manipuler comme s’il s’agissait des siens.

.. code:: ipython3

    from jyquickhelper import add_notebook_menu
    add_notebook_menu(last_level=3)


.. contents::
    :local:


Cette présentation détaille l’API de *scikit-learn*, aborde la mise en
production avec
`pickle <https://docs.python.org/3/library/pickle.html>`__, montre un
exemple d’implémentation d’un modèle customisé appliqué à la sélection
d’arbres dans une forêt aléatoire.

.. code:: ipython3

    import matplotlib.pyplot as plt
    from jupytalk.pres_helper import show_images

Design et API
-------------

On peut penser que deux implémentations du même algorithme se valent à
partir du moment où elles produisent les mêmes résultats. Voici deux
chaises, vers laquelle votre instinct vous poussera-t-il ?

.. code:: ipython3

    show_images("zigzag.jpg", "chaise.jpg", figsize=(14, 4), title2="Le Corbusier");


.. image:: sklearn_api_5_0.png


Quatre ou cinq librairies ont fait le succès de Python
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


-  `numpy <https://numpy.org/>`__: calcul matriciel - existait avant
   Python (`matlab <https://en.wikipedia.org/wiki/MATLAB>`__, R, …)
-  `pandas <https://pandas.pydata.org/>`__: manipulation de données -
   existait avant Python (`R <https://www.r-project.org/>`__, …)
-  `matplotlib <https://matplotlib.org/>`__: graphes - existait avant
   Python - (`matlab <https://en.wikipedia.org/wiki/MATLAB>`__,
   `R <https://www.r-project.org/>`__\ …)
-  `scikit-learn <https://scikit-learn.org/stable/>`__: machine learning
   - **innovation : design**
-  `jupyter <https://jupyter.org/>`__: notebooks - **innovation :
   mélange interactif code, texte, images**

.. code:: ipython3

    show_images("trends.png", title1="Google Trendss Python / Matlab");


.. image:: sklearn_api_7_0.png


Machine learning résumé
~~~~~~~~~~~~~~~~~~~~~~~


-  Modèle de machine learning = résultat d’une optimisation
-  Cette optimisation dépend de paramètres (dimension, pas du gradient,
   …)
-  Optimisation = apprentissage
-  On s’en sert pour faire de la prédiction.

Ce que les codeurs imaginent
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Des designs souvent très jolis mais à usage unique.

.. code:: ipython3

    show_images("coop.jpg", "coop2.jpg", title1="Coop Himeblau", title2="Rooftop", figsize=(16,8));


.. image:: sklearn_api_10_0.png


Vues incompatibles
~~~~~~~~~~~~~~~~~~


-  Les chercheurs aiment l’innonvation, cherchent de nouveaux modèles.
-  Les datascientist assemblent des modèles existants.
-  L’estimation d’un modèle arrivent à la toute fin.

**On retient facilement ce qui est court et qui se répète.**

Vocabulaire scikit-learn
~~~~~~~~~~~~~~~~~~~~~~~~


-  **Predictor** : modèle de machine learning qu’on apprend (``fit``) et
   qui prédit (``predict``)
-  **Transformer** : prétraitement de données qui précède un prédicteur,
   qu’on apprend (``fit``) et qui transforme les données (``transform``)

Utilisation de classes : predictor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

::
   class Predictor:
       def __init__(self, **kwargs):
           # kwargs sont les paramètres d'apprentissage
       def fit(self, X, y):
           # apprentissage
           return self
       def predict(self, X):
           # prédiction

Utilisation de classes : transformer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

::
   class Transformer:
       def __init__(self, **kwargs):
           # kwargs sont les paramètres d'apprentissage
       def fit(self, X, y):
           # apprentissage
           return self
       def transform(self, X):
           # prédiction

pipeline (sandwitch en français)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Normalisation + ACP + Régression Logistique**

+-------------+-------------+-------------+-------------+-------------+
| Classe      | Step 1      | Step 2      | Step 3      | Step 4      |
+=============+=============+=============+=============+=============+
| Normalizer  | ``fit(X)``  | ``X2=transf | ``X2=transf | ``X2=transf |
|             |             | orm(X)``    | orm(X)``    | orm(X)``    |
+-------------+-------------+-------------+-------------+-------------+
| PCA         | .           | ``fit(X2)`` | ``X3=transf | ``X3=transf |
|             |             |             | orm(X2)``   | orm(X2)``   |
+-------------+-------------+-------------+-------------+-------------+
| LogisticReg | .           | .           | ``fit(X3,y) | ``X4=predic |
| ression     |             |             | ``          | t(X3)``     |
+-------------+-------------+-------------+-------------+-------------+

En langage Python
~~~~~~~~~~~~~~~~~

.. code:: ipython3

    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import Normalizer
    from sklearn.decomposition import PCA
    from sklearn.linear_model import LogisticRegression
    
    pipe = Pipeline([
        ('norm', Normalizer()),
        ('pca', PCA()),
        ('lr', LogisticRegression())
    ])

.. code:: ipython3

    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    data = load_iris()
    X, y = data.data, data.target
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    pipe.fit(X_train, y_train)


.. raw:: html

    <style>#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: "▸";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: "▾";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: "";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: "";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: "";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id="sk-container-id-1" class="sk-top-container"><div class="sk-text-repr-fallback"><pre>Pipeline(steps=[(&#x27;norm&#x27;, Normalizer()), (&#x27;pca&#x27;, PCA()),
                    (&#x27;lr&#x27;, LogisticRegression())])</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-1" type="checkbox" ><label for="sk-estimator-id-1" class="sk-toggleable__label sk-toggleable__label-arrow">Pipeline</label><div class="sk-toggleable__content"><pre>Pipeline(steps=[(&#x27;norm&#x27;, Normalizer()), (&#x27;pca&#x27;, PCA()),
                    (&#x27;lr&#x27;, LogisticRegression())])</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-2" type="checkbox" ><label for="sk-estimator-id-2" class="sk-toggleable__label sk-toggleable__label-arrow">Normalizer</label><div class="sk-toggleable__content"><pre>Normalizer()</pre></div></div></div><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-3" type="checkbox" ><label for="sk-estimator-id-3" class="sk-toggleable__label sk-toggleable__label-arrow">PCA</label><div class="sk-toggleable__content"><pre>PCA()</pre></div></div></div><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-4" type="checkbox" ><label for="sk-estimator-id-4" class="sk-toggleable__label sk-toggleable__label-arrow">LogisticRegression</label><div class="sk-toggleable__content"><pre>LogisticRegression()</pre></div></div></div></div></div></div></div>


.. code:: ipython3

    prediction = pipe.predict(X_test)
    prediction[:5]


.. parsed-literal::
    array([0, 2, 0, 2, 2])


.. code:: ipython3

    pipe.score(X_test, y_test)


.. parsed-literal::
    0.6578947368421053


Raffinement
-----------

.. code:: ipython3

    show_images("church-of-light-1024x614.jpg", title1="Tadao Ando", figsize=(10, 6));


.. image:: sklearn_api_22_0.png


Un design commun aux régresseurs et classifieurs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


-  Les **régresseurs** sont les plus simples, ils modèlisent une
   fonction :math:`f(X \in \mathbb{R}^d) \rightarrow \mathbb{R}`.
-  Les **classifieurs** modélisent une fonction
   :math:`f(X \in \mathbb{R}^d) \rightarrow \mathbb{N}`

**Mais**

Les classifieurs sont liés à la notion de **distance** par rapport à la
frontière, distance qu’on relie ensuite à une **probabilité** mais pas
toujours.

.. code:: ipython3

    show_images('logreg.png');


.. image:: sklearn_api_24_0.png


Besoin d’un classifieur
~~~~~~~~~~~~~~~~~~~~~~~

::
   class Classifier:
       def __init__(self, **kwargs):
           # kwargs sont les paramètres d'apprentissage
       def fit(self, X, y):
           # apprentissage
           return self
       def decision_function(self, X):
           # distances
       def predict_proba(self, X):
           # distances --> proba
       def predict(self, X):
           # classes

Besoin d’un régresseur par mimétisme
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

::
   class Classifier:
       def __init__(self, **kwargs):
           # kwargs sont les paramètres d'apprentissage
       def fit(self, X, y):
           # apprentissage
           return self
       def decision_function(self, X):
           # une ou plusieurs régressions
       def predict(self, X):
           # moyennes

Paramètres et résultats d’apprentissage
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


-  Tout attribut terminé par ``_`` est un résultat d’apprentissage.
-  A l’opposé, tout ce qui ne se termine pas par ``_`` est connu avant
   l’apprentissage

.. code:: ipython3

    show_images("lasso.png");


.. image:: sklearn_api_28_0.png


Problèmes standards - moule commun
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: ipython3

    show_images('sklearn_base.png');


.. image:: sklearn_api_30_0.png


Analyser ou prédire
~~~~~~~~~~~~~~~~~~~

Certains modèles ne peuvent pas prédire, simplement analyser. C’est le
cas du
`SpectralClustering <https://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html>`__.

::
   class NoPredictionButAnalysis:
       def __init__(self, **kwargs):
           # kwargs sont les paramètres d'apprentissage
       def fit_predict(self, X, y=None):
           # apprentissage et prédiction
           return self

Limites du concept
~~~~~~~~~~~~~~~~~~

Et si on veut réutiliser les sorties d’un prédicteur pour en faire autre
chose ?

`VotingClassifier <https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html>`__

A suivre… dans la dernière partie.

Le design, c’est le design, le code, c’est de la bidouille.

pickle
------

Un modèle c’est :


-  une classe, un pipeline, une liste de traitements définis **avant**
   apprentissage
-  des coefficients obtenus **après** apprentissage

Comment conserver le résultat ? –>
`pickle <https://docs.python.org/3/library/pickle.html>`__

Cas des dataframes
~~~~~~~~~~~~~~~~~~

.. code:: ipython3

    from pandas import DataFrame, read_csv
    df = DataFrame(X)
    df['label'] = y
    df.head()


.. raw:: html

    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>0</th>
          <th>1</th>
          <th>2</th>
          <th>3</th>
          <th>label</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>5.1</td>
          <td>3.5</td>
          <td>1.4</td>
          <td>0.2</td>
          <td>0</td>
        </tr>
        <tr>
          <th>1</th>
          <td>4.9</td>
          <td>3.0</td>
          <td>1.4</td>
          <td>0.2</td>
          <td>0</td>
        </tr>
        <tr>
          <th>2</th>
          <td>4.7</td>
          <td>3.2</td>
          <td>1.3</td>
          <td>0.2</td>
          <td>0</td>
        </tr>
        <tr>
          <th>3</th>
          <td>4.6</td>
          <td>3.1</td>
          <td>1.5</td>
          <td>0.2</td>
          <td>0</td>
        </tr>
        <tr>
          <th>4</th>
          <td>5.0</td>
          <td>3.6</td>
          <td>1.4</td>
          <td>0.2</td>
          <td>0</td>
        </tr>
      </tbody>
    </table>
    </div>


.. code:: ipython3

    df.to_csv("data_iris.csv")

.. code:: ipython3

    %timeit read_csv("data_iris.csv")


.. parsed-literal::
    1.77 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


.. code:: ipython3

    import pickle

.. code:: ipython3

    with open("data_iris.pickle", "wb") as f:
        pickle.dump(df, f)

.. code:: ipython3

    def load_from_pickle(name):
        with open(name, "rb") as f:
            return pickle.load(f)
    
    %timeit load_from_pickle("data_iris.pickle")


.. parsed-literal::
    264 µs ± 18.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


pickle est plus rapide
~~~~~~~~~~~~~~~~~~~~~~


-  `read_csv <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html>`__
   : convertit un fichier texte en dataframe –> format intermédiaire
   `csv <https://fr.wikipedia.org/wiki/Comma-separated_values>`__
-  `pickle <https://docs.python.org/3/library/pickle.html>`__ : conserve
   des données comme elles sont stockées en mémoire –> pas de conversion

.. code:: ipython3

    from jyquickhelper import RenderJsDot
    RenderJsDot('''digraph{ rankdir="LR";
        B [label="mémoire"]; C [label="csv"]; C2 [label="csv"];
        D [label="disque"]; B -> C [label="to_csv", color="red"];
        C -> D ; D -> C2 ;
        C2 -> B [label="read_csv", color="red"];
        B -> D [label="pickle.dump", color="blue"];
        D -> B [label="pickle.load", color="blue"];
    }''')


.. raw:: html

    <div id="Mdf8cbb0aacca4f8e88b3b175074c9e38-cont"><div id="Mdf8cbb0aacca4f8e88b3b175074c9e38" style="width:100%;height:100%;"></div></div>
    <script>

    require(['http://www.xavierdupre.fr/js/vizjs/viz.js'], function() { var svgGraph = Viz("digraph{ rankdir=\"LR\";\n    B [label=\"mémoire\"]; C [label=\"csv\"]; C2 [label=\"csv\"];\n    D [label=\"disque\"]; B -> C [label=\"to_csv\", color=\"red\"];\n    C -> D ; D -> C2 ;\n    C2 -> B [label=\"read_csv\", color=\"red\"];\n    B -> D [label=\"pickle.dump\", color=\"blue\"];\n    D -> B [label=\"pickle.load\", color=\"blue\"];\n}");
    document.getElementById('Mdf8cbb0aacca4f8e88b3b175074c9e38').innerHTML = svgGraph; });

    </script>


scikit-learn, pickle
~~~~~~~~~~~~~~~~~~~~

unique moyen de conserver les modèles

.. code:: ipython3

    with open("pipe.pickle", "wb") as f:
        pickle.dump(pipe, f)

.. code:: ipython3

    with open("pipe.pickle", "rb") as f:
        pipe2 = pickle.load(f)

.. code:: ipython3

    from numpy.testing import assert_almost_equal
    assert_almost_equal(pipe.predict(X_test), pipe2.predict(X_test))

Problème avec pickle
~~~~~~~~~~~~~~~~~~~~


-  L’état de la mémoire dépend très fortement des librairies installées
-  Changer de version scikit-learn –> l’état de la mémoire est
   différente
-  **Analogie** : pickle ne conserve que les coefficients en mémoire,
   ils sont cryptés en quelque sorte.
-  On ne peut les décrypter qu’avec le même code.

Dissocier les colonnes
~~~~~~~~~~~~~~~~~~~~~~

Toutes les colonnes subissent le même traitement.

.. code:: ipython3

    pipe = Pipeline([
        ('norm', Normalizer()),
        ('pca', PCA()),
        ('lr', LogisticRegression())
    ])

Mais ce n’est pas forcément ce que l’on veut.

.. code:: ipython3

    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import MinMaxScaler
    
    pipe2 = Pipeline([
        ('multi', ColumnTransformer([
            ('c01', Normalizer(), [0, 1]),
            ('c23', MinMaxScaler(), [2, 3]),
        ])),
        ('pca', PCA()),
        ('lr', LogisticRegression())
    ])
    
    pipe2.fit(X_train, y_train);

.. code:: ipython3

    from mlinsights.plotting import pipeline2dot
    RenderJsDot(pipeline2dot(pipe2, X_train))


.. raw:: html

    <div id="M37a4b81a4e2746d3ab8842e993744541-cont"><div id="M37a4b81a4e2746d3ab8842e993744541" style="width:100%;height:100%;"></div></div>
    <script>

    require(['http://www.xavierdupre.fr/js/vizjs/viz.js'], function() { var svgGraph = Viz("digraph{\n  orientation=portrait;\n  nodesep=0.05;\n  ranksep=0.25;\n  sch0[label=\"<f0> X0|<f1> X1|<f2> X2|<f3> X3\",shape=record,fontsize=8];\n\n  node1[label=\"union\",shape=box,style=\"filled,rounded\",color=cyan,fontsize=12];\n  sch0:f0 -> node1;\n  sch0:f1 -> node1;\n  sch0:f2 -> node1;\n  sch0:f3 -> node1;\n  sch1[label=\"<f0> -v-0\",shape=record,fontsize=8];\n  node1 -> sch1:f0;\n\n  node2[label=\"Normalizer\",shape=box,style=\"filled,rounded\",color=cyan,fontsize=12];\n  sch1:f0 -> node2;\n  sch2[label=\"<f0> -v-1\",shape=record,fontsize=8];\n  node2 -> sch2:f0;\n\n  node3[label=\"union\",shape=box,style=\"filled,rounded\",color=cyan,fontsize=12];\n  sch0:f0 -> node3;\n  sch0:f1 -> node3;\n  sch0:f2 -> node3;\n  sch0:f3 -> node3;\n  sch3[label=\"<f0> -v-2\",shape=record,fontsize=8];\n  node3 -> sch3:f0;\n\n  node4[label=\"MinMaxScaler\",shape=box,style=\"filled,rounded\",color=cyan,fontsize=12];\n  sch3:f0 -> node4;\n  sch4[label=\"<f0> -v-3\",shape=record,fontsize=8];\n  node4 -> sch4:f0;\n\n  node5[label=\"union\",shape=box,style=\"filled,rounded\",color=cyan,fontsize=12];\n  sch2:f0 -> node5;\n  sch4:f0 -> node5;\n  sch5[label=\"<f0> -v-4\",shape=record,fontsize=8];\n  node5 -> sch5:f0;\n\n  node6[label=\"PCA\",shape=box,style=\"filled,rounded\",color=cyan,fontsize=12];\n  sch5:f0 -> node6;\n  sch6[label=\"<f0> -v-4\",shape=record,fontsize=8];\n  node6 -> sch6:f0;\n\n  node7[label=\"LogisticRegression\",shape=box,style=\"filled,rounded\",color=yellow,fontsize=12];\n  sch6:f0 -> node7;\n  sch7[label=\"<f0> PredictedLabel|<f1> Probabilities\",shape=record,fontsize=8];\n  node7 -> sch7:f0;\n  node7 -> sch7:f1;\n}");
    document.getElementById('M37a4b81a4e2746d3ab8842e993744541').innerHTML = svgGraph; });

    </script>


Concepts appliqués à un nouveau régresseur
------------------------------------------

On construit une forêt d’arbres puis on réduit le nombre d’arbres à
l’aide d’une régression
`Lasso <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html>`__.

.. code:: ipython3

    from sklearn.datasets import load_diabetes
    from sklearn.model_selection import train_test_split
    
    data = load_diabetes()
    X, y = data.data, data.target
    X_train, X_test, y_train, y_test = train_test_split(X, y)

Sketch de l’algorithme
~~~~~~~~~~~~~~~~~~~~~~

.. code:: ipython3

    import numpy
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.linear_model import Lasso
    
    # Apprentissage d'une forêt aléatoire
    clr = RandomForestRegressor()
    clr.fit(X_train, y_train)
    
    # Récupération de la prédiction de chaque arbre
    X_train_2 = numpy.zeros((X_train.shape[0], len(clr.estimators_)))
    estimators = numpy.array(clr.estimators_).ravel()
    for i, est in enumerate(estimators):
        pred = est.predict(X_train)
        X_train_2[:, i] = pred
    
    # Apprentissage d'une régression Lasso
    lrs = Lasso(max_iter=10000)
    lrs.fit(X_train_2, y_train)
    lrs.coef_


.. parsed-literal::
    array([-0.00469714,  0.0221791 , -0.03849948,  0.00314431,  0.04879728,
           -0.00045039, -0.0054841 , -0.01130761, -0.01956316,  0.05802847,
           -0.00031975, -0.05406833, -0.04773371,  0.06614678,  0.00892759,
           -0.06309655,  0.03340401, -0.04168602,  0.02377001, -0.03671289,
            0.02627701,  0.00022712, -0.01083544, -0.04179967,  0.03231883,
           -0.02245547,  0.00971713,  0.01600841,  0.01458184, -0.03772706,
            0.02509486, -0.01068935, -0.04092312,  0.0541524 ,  0.00537527,
           -0.03710114,  0.017908  ,  0.02937607,  0.04451909,  0.0013495 ,
           -0.02321562, -0.04876043, -0.01734136,  0.03884741,  0.03373548,
            0.00811501,  0.0169834 , -0.02234235,  0.05643999,  0.00889717,
           -0.02046968,  0.00973609,  0.07077278,  0.01506631,  0.09280915,
            0.01589242, -0.02673953,  0.02240294, -0.00475286,  0.01830085,
            0.02026113,  0.03854988,  0.03195279,  0.0394844 ,  0.02784215,
            0.02402331,  0.06021017,  0.01825254,  0.01992086,  0.0188973 ,
            0.01556557,  0.04059752,  0.04422221,  0.00365708,  0.00389476,
           -0.00737055,  0.05960936, -0.04092342,  0.05995745,  0.06623417,
            0.02395334,  0.01308198,  0.08500338, -0.01354122,  0.0357201 ,
            0.01747697,  0.04941955,  0.05530153,  0.01663532,  0.04105603,
            0.02831484,  0.00386307, -0.00450148,  0.03319402, -0.01291577,
           -0.01517642, -0.0147378 ,  0.05063852, -0.00490926,  0.00825488])


Ce que l’on veut
~~~~~~~~~~~~~~~~

::
   class LassoRandomForestRegressor:
       def fit(self, X, y):
           # apprendre une random forest
           # sélectionner les arbres à garder avec un Lasso
           # supprimer les arbres associés à un poids nul
           return self
       def predict(self, X):
          # retourner une moyenne pondérée des prédictions
          return ...

Implémentation
~~~~~~~~~~~~~~

`lasso_random_forest_regressor.py <https://github.com/sdpython/ensae_teaching_cs/blob/master/src/ensae_teaching_cs/ml/lasso_random_forest_regressor.py>`__

.. code:: ipython3

    import numpy
    from sklearn.base import BaseEstimator, RegressorMixin, clone
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.linear_model import Lasso
    
    
    class LassoRandomForestRegressor(BaseEstimator, RegressorMixin):
        
        def __init__(self, rf_estimator=None, lasso_estimator=None):
            BaseEstimator.__init__(self)
            RegressorMixin.__init__(self)
            if rf_estimator is None:
                rf_estimator = RandomForestRegressor()
            if lasso_estimator is None:
                lasso_estimator = Lasso()
            self.rf_estimator = rf_estimator
            self.lasso_estimator = lasso_estimator
    
        def fit(self, X, y, sample_weight=None):
            self.rf_estimator_ = clone(self.rf_estimator)
            self.rf_estimator_.fit(X, y, sample_weight)
    
            estims = self.rf_estimator_.estimators_
            estimators = numpy.array(estims).ravel()
            X2 = numpy.zeros((X.shape[0], len(estimators)))
            for i, est in enumerate(estimators):
                pred = est.predict(X)
                X2[:, i] = pred
    
            self.lasso_estimator_ = clone(self.lasso_estimator)
            self.lasso_estimator_.fit(X2, y)
    
            not_null = self.lasso_estimator_.coef_ != 0
            self.intercept_ = self.lasso_estimator_.intercept_
            self.estimators_ = estimators[not_null]
            self.coef_ = self.lasso_estimator_.coef_[not_null]
            return self
    
        def predict(self, X):
            prediction = None
            for i, est in enumerate(self.estimators_):
                pred = est.predict(X)
                if prediction is None:
                    prediction = pred * self.coef_[i]
                else:
                    prediction += pred * self.coef_[i]
            return prediction + self.intercept_

.. code:: ipython3

    ls = LassoRandomForestRegressor()
    ls.fit(X_train, y_train)


.. parsed-literal::
    C:\xavierdupre\__home_\github_fork\scikit-learn\sklearn\linear_model\_coordinate_descent.py:634: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.825e+04, tolerance: 1.935e+02
      model = cd_fast.enet_coordinate_descent(


.. raw:: html

    <style>#sk-container-id-2 {color: black;background-color: white;}#sk-container-id-2 pre{padding: 0;}#sk-container-id-2 div.sk-toggleable {background-color: white;}#sk-container-id-2 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-2 label.sk-toggleable__label-arrow:before {content: "▸";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-2 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-2 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-2 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-2 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-2 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-2 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: "▾";}#sk-container-id-2 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-2 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-2 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-2 div.sk-parallel-item::after {content: "";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-2 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-serial::before {content: "";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-2 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-2 div.sk-item {position: relative;z-index: 1;}#sk-container-id-2 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-2 div.sk-item::before, #sk-container-id-2 div.sk-parallel-item::before {content: "";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-2 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-2 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-2 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-2 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-2 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-2 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-2 div.sk-label-container {text-align: center;}#sk-container-id-2 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-2 div.sk-text-repr-fallback {display: none;}</style><div id="sk-container-id-2" class="sk-top-container"><div class="sk-text-repr-fallback"><pre>LassoRandomForestRegressor(lasso_estimator=Lasso(),
                               rf_estimator=RandomForestRegressor())</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-5" type="checkbox" ><label for="sk-estimator-id-5" class="sk-toggleable__label sk-toggleable__label-arrow">LassoRandomForestRegressor</label><div class="sk-toggleable__content"><pre>LassoRandomForestRegressor(lasso_estimator=Lasso(),
                               rf_estimator=RandomForestRegressor())</pre></div></div></div><div class="sk-parallel"><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-6" type="checkbox" ><label for="sk-estimator-id-6" class="sk-toggleable__label sk-toggleable__label-arrow">lasso_estimator: Lasso</label><div class="sk-toggleable__content"><pre>Lasso()</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-7" type="checkbox" ><label for="sk-estimator-id-7" class="sk-toggleable__label sk-toggleable__label-arrow">Lasso</label><div class="sk-toggleable__content"><pre>Lasso()</pre></div></div></div></div></div></div><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-8" type="checkbox" ><label for="sk-estimator-id-8" class="sk-toggleable__label sk-toggleable__label-arrow">rf_estimator: RandomForestRegressor</label><div class="sk-toggleable__content"><pre>RandomForestRegressor()</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-9" type="checkbox" ><label for="sk-estimator-id-9" class="sk-toggleable__label sk-toggleable__label-arrow">RandomForestRegressor</label><div class="sk-toggleable__content"><pre>RandomForestRegressor()</pre></div></div></div></div></div></div></div></div></div></div>


Résultats
~~~~~~~~~

La forêt aléatoire seule.

.. code:: ipython3

    clr.score(X_test, y_test)


.. parsed-literal::
    0.5704306565411461


La forêt aléatoire réduite.

.. code:: ipython3

    ls.score(X_test, y_test)


.. parsed-literal::
    0.46294352058906363


Avec une réduction conséquente.

.. code:: ipython3

    len(ls.estimators_), len(clr.estimators_)


.. parsed-literal::
    (99, 100)


Critère AIC
~~~~~~~~~~~

On peut même sélectionner le nombre d’arbres avec un critère
`AIC <https://fr.wikipedia.org/wiki/Crit%C3%A8re_d%27information_d%27Akaike>`__
et le modèle
`LassoLarsIC <https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLarsIC.html#sklearn.linear_model.LassoLarsIC>`__.

.. code:: ipython3

    from sklearn.linear_model import LassoLarsIC
    ls_aic = LassoRandomForestRegressor(lasso_estimator=LassoLarsIC())
    ls_aic.fit(X_train, y_train)


.. raw:: html

    <style>#sk-container-id-3 {color: black;background-color: white;}#sk-container-id-3 pre{padding: 0;}#sk-container-id-3 div.sk-toggleable {background-color: white;}#sk-container-id-3 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-3 label.sk-toggleable__label-arrow:before {content: "▸";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-3 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-3 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-3 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-3 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-3 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-3 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: "▾";}#sk-container-id-3 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-3 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-3 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-3 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-3 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-3 div.sk-parallel-item::after {content: "";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-3 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-3 div.sk-serial::before {content: "";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-3 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-3 div.sk-item {position: relative;z-index: 1;}#sk-container-id-3 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-3 div.sk-item::before, #sk-container-id-3 div.sk-parallel-item::before {content: "";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-3 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-3 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-3 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-3 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-3 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-3 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-3 div.sk-label-container {text-align: center;}#sk-container-id-3 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-3 div.sk-text-repr-fallback {display: none;}</style><div id="sk-container-id-3" class="sk-top-container"><div class="sk-text-repr-fallback"><pre>LassoRandomForestRegressor(lasso_estimator=LassoLarsIC(),
                               rf_estimator=RandomForestRegressor())</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item sk-dashed-wrapped"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-10" type="checkbox" ><label for="sk-estimator-id-10" class="sk-toggleable__label sk-toggleable__label-arrow">LassoRandomForestRegressor</label><div class="sk-toggleable__content"><pre>LassoRandomForestRegressor(lasso_estimator=LassoLarsIC(),
                               rf_estimator=RandomForestRegressor())</pre></div></div></div><div class="sk-parallel"><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-11" type="checkbox" ><label for="sk-estimator-id-11" class="sk-toggleable__label sk-toggleable__label-arrow">lasso_estimator: LassoLarsIC</label><div class="sk-toggleable__content"><pre>LassoLarsIC()</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-12" type="checkbox" ><label for="sk-estimator-id-12" class="sk-toggleable__label sk-toggleable__label-arrow">LassoLarsIC</label><div class="sk-toggleable__content"><pre>LassoLarsIC()</pre></div></div></div></div></div></div><div class="sk-parallel-item"><div class="sk-item"><div class="sk-label-container"><div class="sk-label sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-13" type="checkbox" ><label for="sk-estimator-id-13" class="sk-toggleable__label sk-toggleable__label-arrow">rf_estimator: RandomForestRegressor</label><div class="sk-toggleable__content"><pre>RandomForestRegressor()</pre></div></div></div><div class="sk-serial"><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-14" type="checkbox" ><label for="sk-estimator-id-14" class="sk-toggleable__label sk-toggleable__label-arrow">RandomForestRegressor</label><div class="sk-toggleable__content"><pre>RandomForestRegressor()</pre></div></div></div></div></div></div></div></div></div></div>


.. code:: ipython3

    ls_aic.score(X_test, y_test)


.. parsed-literal::
    0.4833526611115916


.. code:: ipython3

    len(ls_aic.estimators_)


.. parsed-literal::
    48


pickling
~~~~~~~~

A partir du moment où les conventions de l’API de *scikit-learn* sont
respectées, tout est pris en charge.

.. code:: ipython3

    from io import BytesIO
    by = BytesIO()
    pickle.dump(ls, by)
    by2 = BytesIO(by.getvalue())
    mod2 = pickle.load(by2)
    p1 = ls.predict(X_test)
    p2 = mod2.predict(X_test)
    p1[:5], p2[:5]


.. parsed-literal::
    (array([277.10103941, 221.38733112,  63.87889654, 205.27390858,
             80.4188308 ]),
     array([277.10103941, 221.38733112,  63.87889654, 205.27390858,
             80.4188308 ]))


Conclusion
----------

L’API est une sorte de légo. Tout marche si on respecte les dimensions
de départ.

.. code:: ipython3

    show_images('lego.png', 'lego-architecture-studio-8804.jpg', figsize=(16,6));


.. image:: sklearn_api_74_0.png


.. code:: ipython3

    show_images('vue-interieure-cite-de-musique-christian-de.jpg', 'PaulPoiret-7.jpg', figsize=(16,6));


.. image:: sklearn_api_75_0.png


.. code:: ipython3

    show_images('lycee_chanzy_maquette.jpg', figsize=(16,10));


.. image:: sklearn_api_76_0.png