.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "gyexamples/plot_logistic_decision.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_gyexamples_plot_logistic_decision.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_gyexamples_plot_logistic_decision.py:


.. _l-example-logistic-decision:

Arbre d'indécision
==================

La construction d'un arbre de décision appliqué à une
classification binaire suppose qu'on puisse
déterminer un seuil qui sépare les deux classes ou tout
du moins qui aboutisse à deux sous-ensemble dans lesquels
une classe est majoritaire. Mais certains cas, c'est une
chose compliquée.


.. contents::
    :local:

Un cas simple et un cas compliqué
+++++++++++++++++++++++++++++++++

Il faut choisir un seuil sur l'axe des abscisses qui
permette de classer le jeu de données.

.. GENERATED FROM PYTHON SOURCE LINES 24-61

.. code-block:: default

    import numpy
    import matplotlib.pyplot as plt
    from pandas import DataFrame


    def random_set_1d(n, kind):
        x = numpy.random.rand(n) * 3 - 1
        if kind:
            y = numpy.empty(x.shape, dtype=numpy.int32)
            y[x < 0] = 0
            y[(x >= 0) & (x <= 1)] = 1
            y[x > 1] = 0
        else:
            y = numpy.empty(x.shape, dtype=numpy.int32)
            y[x < 0] = 0
            y[x >= 0] = 1
        x2 = numpy.random.rand(n)
        return numpy.vstack([x, x2]).T, y


    def plot_ds(X, y, ax=None, title=None):
        if ax is None:
            ax = plt.gca()
        ax.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor='k', lw=0.5)
        if title is not None:
            ax.set_title(title)
        return ax


    X1, y1 = random_set_1d(1000, False)
    X2, y2 = random_set_1d(1000, True)

    fig, ax = plt.subplots(1, 2, figsize=(12, 6), sharey=True)
    plot_ds(X1, y1, ax=ax[0], title="easy")
    plot_ds(X2, y2, ax=ax[1], title="difficult")


.. image-sg:: /gyexamples/images/sphx_glr_plot_logistic_decision_001.png
   :alt: easy, difficult
   :srcset: /gyexamples/images/sphx_glr_plot_logistic_decision_001.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    <Axes: title={'center': 'difficult'}>


.. GENERATED FROM PYTHON SOURCE LINES 62-73

Seuil de décision
-----------------

Les arbres de décision utilisent comme critère
le critère de `Gini <https://fr.wikipedia.org/wiki/
Arbre_de_d%C3%A9cision_(apprentissage)#Cas_des_arbres_de_classification>`_
ou l'`entropie <https://fr.wikipedia.org/wiki/Entropie_de_Shannon>`_.
L'apprentissage d'une régression logistique
s'appuie sur la :ref:`log-vraisemblance <l-lr-log-likelihood>`
du jeu de données. On regarde l'évolution de ces critères
en fonction des différents seuils possibles.

.. GENERATED FROM PYTHON SOURCE LINES 73-120

.. code-block:: default


    def plog2(p):
        if p == 0:
            return 0
        return p * numpy.log(p) / numpy.log(2)


    def logistic(x):
        return 1. / (1. + numpy.exp(-x))


    def likelihood(x, y, theta=1., th=0.):
        lr = logistic((x - th) * theta)
        return y * lr + (1. - y) * (1 - lr)


    def criteria(X, y):
        res = numpy.empty((X.shape[0], 8))
        res[:, 0] = X[:, 0]
        res[:, 1] = y
        order = numpy.argsort(res[:, 0])
        res = res[order, :].copy()
        x = res[:, 0].copy()
        y = res[:, 1].copy()

        for i in range(1, res.shape[0] - 1):
            # gini
            p1 = numpy.sum(y[:i]) / i
            p2 = numpy.sum(y[i:]) / (y.shape[0] - i)
            res[i, 2] = p1
            res[i, 3] = p2
            res[i, 4] = 1 - p1**2 - (1 - p1)**2 + 1 - p2**2 - (1 - p2)**2
            res[i, 5] = - plog2(p1) - plog2(1 - p1) - plog2(p2) - plog2(1 - p2)
            th = x[i]
            res[i, 6] = logistic(th * 10.)
            res[i, 7] = numpy.sum(likelihood(x, y, 10., th)) / res.shape[0]
        return DataFrame(res[1:-1], columns=[
            'X', 'y', 'p1', 'p2', 'Gini', 'Gain', 'lr', 'LL-10'])


    X1, y1 = random_set_1d(1000, False)
    X2, y2 = random_set_1d(1000, True)

    df = criteria(X1, y1)
    print(df.head())


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

              X    y   p1        p2      Gini      Gain        lr     LL-10
    0 -0.997911  0.0  0.0  0.698699  0.421038  0.882876  0.000046  0.718692
    1 -0.996744  0.0  0.0  0.699399  0.420480  0.882025  0.000047  0.718865
    2 -0.993137  0.0  0.0  0.700100  0.419920  0.881168  0.000049  0.719406
    3 -0.993011  0.0  0.0  0.700803  0.419356  0.880307  0.000049  0.719425
    4 -0.979667  0.0  0.0  0.701508  0.418789  0.879440  0.000056  0.721511


.. GENERATED FROM PYTHON SOURCE LINES 121-122

Et visuellement...

.. GENERATED FROM PYTHON SOURCE LINES 122-142

.. code-block:: default


    def plot_ds(X, y, ax=None, title=None):
        if ax is None:
            ax = plt.gca()
        ax.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor='k', lw=0.5)
        if title is not None:
            ax.set_title(title)
        return ax


    df1 = criteria(X1, y1)
    df2 = criteria(X2, y2)

    fig, ax = plt.subplots(1, 2, figsize=(12, 6), sharey=True)
    plot_ds(X1, y1, ax=ax[0], title="easy")
    plot_ds(X2, y2, ax=ax[1], title="difficult")
    df1.plot(x='X', y=['Gini', 'Gain', 'LL-10', 'p1', 'p2'], ax=ax[0], lw=5.)
    df2.plot(x='X', y=['Gini', 'Gain', 'LL-10', 'p1', 'p2'], ax=ax[1], lw=5.)


.. image-sg:: /gyexamples/images/sphx_glr_plot_logistic_decision_002.png
   :alt: easy, difficult
   :srcset: /gyexamples/images/sphx_glr_plot_logistic_decision_002.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    <Axes: title={'center': 'difficult'}, xlabel='X'>


.. GENERATED FROM PYTHON SOURCE LINES 143-153

Le premier exemple est le cas simple et tous les
indicateurs trouvent bien la fontière entre les deux classes
comme un extremum sur l'intervalle considéré.
Le second cas est linéairement non séparable.
Aucun des indicateurs ne semble trouver une des
deux frontières. La log-vraisemblance montre deux
maxima. L'un est bien situé sur une frontière, le second
est situé à une extrémité de l'intervalle, ce qui revient
à construire un classifier qui retourné une réponse
constante. C'est donc inutile.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  3.845 seconds)


.. _sphx_glr_download_gyexamples_plot_logistic_decision.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example


    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_logistic_decision.py <plot_logistic_decision.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_logistic_decision.ipynb <plot_logistic_decision.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_