2A.ml - Tree, hyperparamètres, overfitting

Links: notebook, html, PDF, python, slides, GitHub

L’overfitting ou surapprentissage apparaît lorsque les prédictions sur de nouvelles données sont nettement moins bonnes que celles obtenus sur la base d’apprentissage. Les forêts aléatoires sont moins sujettes à l’overfitting que les arbres de décisions qui les composent. Quelques illustrations.

from jyquickhelper import add_notebook_menu
add_notebook_menu()
%matplotlib inline

Données générées

On génère un nuage de points y_i = \sin(x_i) + \epsilon_i.

import numpy, numpy.random, math
def generate_data(n):
    import matplotlib.pyplot as plt
    X = numpy.arange(n)/n*6
    mat = numpy.random.normal(size=(n, 1))/2
    si = numpy.sin(X).reshape((n, 1))
    Y = mat + si
    X = X.reshape((n,1))
    data = numpy.hstack((X, Y))
    return data, X, Y
import pandas
n = 100
data, X, Y = generate_data(n)
df = pandas.DataFrame(data, columns=["X", "Y"])
df.plot(x="X", y="Y", kind="scatter", figsize=(10,4));
../_images/ml_a_tree_overfitting_4_0.png

Différents arbres de décision

On regarde l’influence de paramêtres sur la sortie du modèle résultant de son apprentissage.

max_depth

Un arbre représente une fonction en escalier. La profondeur détermine le nombre de feuilles, c’est-à-dire le nombre de valeurs possibles : 2^{max\_depth}.

from sklearn.tree import DecisionTreeRegressor
ax = df.plot(x="X", y="Y", kind="scatter", figsize=(10,6), label="données", title="DecisionTree")
Xi = (numpy.arange(n*10)/n*6/10).reshape((n*10, 1))
for max_depth in [1, 2, 3, 6, 10]:
    clr = DecisionTreeRegressor(max_depth=max_depth)
    clr.fit(X, Y)
    pred = clr.predict(Xi)
    ex = pandas.DataFrame(Xi, columns=["Xi"])
    ex["pred"] = pred
    ex.sort_values("Xi").plot(x="Xi", y="pred", kind="line", label="max_depth=%d" % max_depth, ax=ax)
../_images/ml_a_tree_overfitting_7_0.png

min_samples_split=10

Chaque feuille d’un arbre prédit une valeur calculée à partir d’un ensemble d’observations. Ce nombre ne peut pas être inférieur à la valeur de ce paramètre. Ce mécanisme limite la possibilité de faire du surapprentissage en augmentant la représentativité de chaque feuille.

ax = df.plot(x="X", y="Y", kind="scatter", figsize=(10,6), label="données", title="DecisionTree, min_samples_split=10")
Xi = (numpy.arange(n*10)/n*6/10).reshape((n*10, 1))
for max_depth in [1, 2, 3, 6, 10]:
    clr = DecisionTreeRegressor(max_depth=max_depth, min_samples_split=10)
    clr.fit(X, Y)
    pred = clr.predict(Xi)
    ex = pandas.DataFrame(Xi, columns=["Xi"])
    ex["pred"] = pred
    ex.sort_values("Xi").plot(x="Xi", y="pred", kind="line", label="max_depth=%d" % max_depth, ax=ax)
../_images/ml_a_tree_overfitting_9_0.png

Random Forest

On étudie les deux mêmes paramètres pour une random forest à ceci près que ce modèle est une somme pondérée des résultats produits par un ensemble d’arbres de décision.

max_depth

from sklearn.ensemble import RandomForestRegressor
ax = df.plot(x="X", y="Y", kind="scatter", figsize=(10,6), label="données", title="RandomForest")
Xi = (numpy.arange(n*10)/n*6/10).reshape((n*10, 1))
for max_depth in [1, 2, 3, 6, 10]:
    clr = RandomForestRegressor(max_depth=max_depth)
    clr.fit(X, Y.ravel())
    pred = clr.predict(Xi)
    ex = pandas.DataFrame(Xi, columns=["Xi"])
    ex["pred"] = pred
    ex.sort_values("Xi").plot(x="Xi", y="pred", kind="line", label="max_depth=%d" % max_depth, ax=ax)
c:python370_x64libsite-packagessklearnensembleforest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
c:python370_x64libsite-packagessklearnensembleforest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
c:python370_x64libsite-packagessklearnensembleforest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
c:python370_x64libsite-packagessklearnensembleforest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
c:python370_x64libsite-packagessklearnensembleforest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
../_images/ml_a_tree_overfitting_12_1.png

n_estimators

n_estimators est le nombre d’itérations, c’est extactement le nombre d’arbres de décisions qui feront partie de la forêt dans le cas d’une régression ou d’une classification binaire.

from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt
f, axarr = plt.subplots(2, sharex=True)
df.plot(x="X", y="Y", kind="scatter", figsize=(10,6), label="données", title="RandomForest md=2", ax=axarr[0])
df.plot(x="X", y="Y", kind="scatter", figsize=(10,6), label="données", title="RandomForest md=4", ax=axarr[1])
Xi = (numpy.arange(n*10)/n*6/10).reshape((n*10, 1))
for i, max_depth in enumerate([2, 4]):
    for n_estimators in [1, 2, 10]:
        clr = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth)
        clr.fit(X, Y.ravel())
        pred = clr.predict(Xi)
        ex = pandas.DataFrame(Xi, columns=["Xi"])
        ex["pred"] = pred
        ex.sort_values("Xi").plot(x="Xi", y="pred", kind="line",
                                  label="n_estimators=%d, max_depth=%d" % (n_estimators, max_depth), ax=axarr[i])
../_images/ml_a_tree_overfitting_14_0.png

min_samples_split=10

ax = df.plot(x="X", y="Y", kind="scatter", figsize=(10,6), label="données", title="RandomForest")
Xi = (numpy.arange(n*10)/n*6/10).reshape((n*10, 1))
for max_depth in [1, 2, 3, 6, 10]:
    clr = RandomForestRegressor(max_depth=max_depth, min_samples_split=10)
    clr.fit(X, Y.ravel())
    pred = clr.predict(Xi)
    ex = pandas.DataFrame(Xi, columns=["Xi"])
    ex["pred"] = pred
    ex.sort_values("Xi").plot(x="Xi", y="pred", kind="line", label="max_depth=%d" % max_depth, ax=ax)
c:python370_x64libsite-packagessklearnensembleforest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
c:python370_x64libsite-packagessklearnensembleforest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
c:python370_x64libsite-packagessklearnensembleforest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
c:python370_x64libsite-packagessklearnensembleforest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
c:python370_x64libsite-packagessklearnensembleforest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
../_images/ml_a_tree_overfitting_16_1.png

Base d’apprentissage et et base de test

C’est un des principes de base en machine learning : ne jamais tester un modèle sur les données d’apprentissage. Avec suffisamment de feuilles, un arbre de décision apprendra la valeur à prédire pour chaque observation. Le modèle apprend le bruit. Où s’arrête l’information, où commence le bruit, il n’est pas toujours évident de fixer le curseur.

Decision Tree

data, X, Y = generate_data(1000)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score
min_samples_splits = [2, 5, 10]
rows = []
for max_depth in range(1, 15):
    d = dict(max_depth=max_depth)
    for min_samples_split in min_samples_splits:
        clr = DecisionTreeRegressor(max_depth=max_depth, min_samples_split=min_samples_split)
        clr.fit(X_train, y_train)
        pred = clr.predict(X_test)
        score = r2_score(y_test, pred)
        d["min_samples_split=%d" % min_samples_split] = score
    rows.append(d)
pandas.DataFrame(rows).plot(x="max_depth", y=["min_samples_split=%d" % _ for _ in min_samples_splits]);
../_images/ml_a_tree_overfitting_21_0.png

Le pic sur la base de test montre que passé un certain point, la performance décroît. A ce moment précis, le modèle commence à apprendre le bruit de la base d’apprentissage. Il overfitte. On remarque aussi que le modele overfitte moins lorsque min_samples_split=10.

Random Forest

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score
rows = []
for n_estimators in range(1, 11):
    for max_depth in range(1, 11):
        for min_samples_split in [2, 5, 10]:
            clr = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth)
            clr.fit(X_train, y_train.ravel())
            pred = clr.predict(X_test)
            score = r2_score(y_test, pred)
            d = dict(max_depth=max_depth)
            d["n_estimators"] = n_estimators
            d["min_samples_split"] = min_samples_split
            d["score"] = score
            rows.append(d)
pl = pandas.DataFrame(rows)
pl.head()
max_depth min_samples_split n_estimators score
0 1 2 1 0.510104
1 1 5 1 0.493350
2 1 10 1 0.470168
3 2 2 1 0.543408
4 2 5 1 0.584557
ax = pl[(pl.min_samples_split==10) & (pl.n_estimators==2)].plot(x="max_depth", y="score", label="n=2")
for i in (4,6,8,10):
    pl[(pl.min_samples_split==10) & (pl.n_estimators==i)].plot(x="max_depth", y="score", label="n=%d"%i, ax=ax)
../_images/ml_a_tree_overfitting_25_0.png
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(111, projection='3d')
for v, c in [(2, "b"), (10, "r")]:
    piv = pl[pl.min_samples_split==v].pivot("n_estimators", "max_depth", "score")
    pivX = piv.copy()
    pivY = piv.copy()
    for v in piv.columns:
        pivX.loc[:, v] = piv.index
    for v in piv.index:
        pivY.loc[v, :] = piv.columns
    ax.plot_wireframe(pivX.as_matrix(), pivY.as_matrix(), piv.as_matrix(), color=c)
ax.set_xlabel("n_estimators")
ax.set_ylabel("max_depth");
c:python370_x64libsite-packagesipykernel_launcher.py:13: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  del sys.path[0]
../_images/ml_a_tree_overfitting_26_1.png

Réseaux de neurones

Sur ce problème précis, les méthodes à base de gradient sont moins performantes. Elles paraissent également moins stables : la fonction d’erreur est plus agitée que celle obtenue pour les random forest. Ce type d’optimisation est plus sensible aux extrema locaux.

from sklearn.neural_network import MLPRegressor
from sklearn.metrics import r2_score
min_samples_splits = [2, 5, 10]
rows = []
for nb in range(20, 300, 5):
    clr = MLPRegressor(hidden_layer_sizes=(nb,), activation="relu", max_iter=400)
    clr.fit(X_train, y_train.ravel())
    pred = clr.predict(X_test)
    score = r2_score(y_test, pred)
    if score > 0:
        d = dict(nb=nb, score=score)
        rows.append(d)
pandas.DataFrame(rows).plot(x="nb", y=["score"]);
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
../_images/ml_a_tree_overfitting_28_1.png

Réseaux de neurones, alpha=0

from sklearn.neural_network import MLPRegressor
from sklearn.metrics import r2_score
min_samples_splits = [2, 5, 10]
rows = []
for nb in range(20, 300, 5):
    clr = MLPRegressor(hidden_layer_sizes=(nb,), activation="relu", alpha=0, tol=1e-6, max_iter=400)
    clr.fit(X_train, y_train.ravel())
    pred = clr.predict(X_test)
    score = r2_score(y_test, pred)
    if score > 0:
        d = dict(nb=nb, score=score)
        rows.append(d)
pandas.DataFrame(rows).plot(x="nb", y=["score"]);
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
c:python370_x64libsite-packagessklearnneural_networkmultilayer_perceptron.py:562: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (400) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
../_images/ml_a_tree_overfitting_30_1.png

Intervalles de confiance

On utilise le module forest-confidence-interval. Le module s’appuie sur le Jackknife pour estimer des intervalles de confiance. Il calcule un estimateur qui calcule une sortie en supprimant plusieurs fois un arbre lors de l’évaluation de la sortie de la forêt aléatoire. La théorie s’appuie sur un resampling de la base d’apprentissage que l’article considère comme équivalent à ceux effectués par scikit-learn pour générer chaque arbre.

L’idée s’appuie sur l’article Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife. Je pense qu’il reste un ou deux bugs car l’algorithme produit parfois des valeurs manquantes qui ne devraient pas se produire et les intervalles de confiance sont parfois très variables d’un apprentissage à l’autre.

clr = RandomForestRegressor(min_samples_split=2)
clr.fit(X_train, y_train.ravel())
c:python370_x64libsite-packagessklearnensembleforest.py:248: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)
import forestci
pred = clr.predict(X_test)
mpg_inbag = forestci.calc_inbag(X_train.shape[0], clr)
mpg_V_IJ_unbiased = forestci.random_forest_error(clr, X_train=X_train, X_test=X_test, inbag=mpg_inbag)

plt.errorbar(y_test, pred, yerr=numpy.sqrt(mpg_V_IJ_unbiased), fmt='o')
plt.plot([-2, 2], [-2, 2], '--')
plt.xlabel('truth')
plt.ylabel('prediction')
plt.title("min_samples_split=2");
c:python370_x64libsite-packagesforestcicalibration.py:86: RuntimeWarning: overflow encountered in exp
  g_eta_raw = np.exp(np.dot(XX, eta)) * mask
c:python370_x64libsite-packagesnumpycorefromnumeric.py:83: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
c:python370_x64libsite-packagesforestcicalibration.py:86: RuntimeWarning: overflow encountered in exp
  g_eta_raw = np.exp(np.dot(XX, eta)) * mask
../_images/ml_a_tree_overfitting_37_1.png
clr = RandomForestRegressor(n_estimators=40, max_depth=6)
clr.fit(X_train, y_train.ravel())
import forestci
pred = clr.predict(X_test)
mpg_inbag = forestci.calc_inbag(X_train.shape[0], clr)
mpg_V_IJ_unbiased = forestci.random_forest_error(clr, X_train=X_train, X_test=X_test, inbag=mpg_inbag)

plt.errorbar(y_test, pred, yerr=numpy.sqrt(mpg_V_IJ_unbiased), fmt='o')
plt.plot([-2, 2], [-2, 2], '--')
plt.xlabel('truth')
plt.ylabel('prediction');
c:python370_x64libsite-packagesforestcicalibration.py:86: RuntimeWarning: overflow encountered in exp
  g_eta_raw = np.exp(np.dot(XX, eta)) * mask
c:python370_x64libsite-packagesnumpycorefromnumeric.py:83: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
c:python370_x64libsite-packagesforestcicalibration.py:86: RuntimeWarning: overflow encountered in exp
  g_eta_raw = np.exp(np.dot(XX, eta)) * mask
c:python370_x64libsite-packagesnumpycorefromnumeric.py:83: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
../_images/ml_a_tree_overfitting_38_1.png
X_plt = numpy.arange(start=0, stop=6, step=0.1)
X_plt = X_plt.reshape((len(X_plt), 1))
pred = clr.predict(X_plt)
mpg_inbag = forestci.calc_inbag(X_train.shape[0], clr)
mpg_V_IJ_unbiased = forestci.random_forest_error(clr, X_train=X_train, X_test=X_plt, inbag=mpg_inbag)

df = pandas.DataFrame(numpy.hstack((X, Y)), columns=["X", "Y"])
ax = df.plot(x="X", y="Y", kind="scatter")
ax.errorbar(X_plt, pred, yerr=numpy.sqrt(mpg_V_IJ_unbiased), fmt='o', color="r")

plt.xlabel('X')
plt.ylabel('Y');
../_images/ml_a_tree_overfitting_39_0.png
Xs = numpy.arange(start=0, stop=6, step=0.001)
Xs = Xs.reshape((len(Xs), 1))
ps = clr.predict(Xs)
ps = ps.reshape((len(ps), 1))
ci = forestci.random_forest_error(clr, X_train=X_train, X_test=Xs, inbag=mpg_inbag)
ci = numpy.sqrt(ci).reshape((len(ci), 1))

df = pandas.DataFrame(numpy.hstack((Xs, ps, ps + ci, ps-ci)), columns=["X", "Y", "Y+", "Y-"])
ax = df.plot(x="X", y=["Y", "Y+", "Y-"], kind="line", figsize=(14,4))
plt.xlabel('X')
plt.ylabel('Y');
../_images/ml_a_tree_overfitting_40_0.png

XGBoost

clr = RandomForestRegressor(n_estimators=10, max_depth=2)
clr.fit(X_train, y_train.ravel())

from xgboost import XGBRegressor
clrx = XGBRegressor(n_estimators=10, max_depth=2)
clrx.fit(X_train, y_train.ravel())
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=2, min_child_weight=1, missing=None, n_estimators=10,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
Xs = numpy.arange(start=0, stop=6, step=0.001)
Xs = Xs.reshape((len(Xs), 1))
ps = clr.predict(Xs)
ps = ps.reshape((len(ps), 1))
psx = clrx.predict(Xs)
psx = psx.reshape((len(psx), 1))

df = pandas.DataFrame(numpy.hstack((Xs, ps, psx)), columns=["X", "Y sk", "Y xg"])
ax = df.plot(x="X", y=["Y sk", "Y xg"], kind="line", lw=2)
ax.plot(X, Y, 'g.', ms=1)
plt.xlabel('X')
plt.ylabel('Y');
../_images/ml_a_tree_overfitting_43_0.png