2018-10-09 Ensemble, Gradient, Boosting…

Links: notebook, html, PDF, python, slides, GitHub

Le noteboook explore quelques particularités des algorithmes d’apprentissage pour expliquer certains résultats numériques. L’algoithme AdaBoost surpondère les exemples sur lequel un modèle fait des erreurs.

from jyquickhelper import add_notebook_menu
add_notebook_menu()
%matplotlib inline

Skewed split train test

Lorsqu’une classe est sous représentée, il est difficile de prédire les résultats d’un modèle de machine learning.

import numpy, numpy.random
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.metrics import confusion_matrix

N = 1000

res = []

for n in [1, 2, 5, 10, 20, 50, 80, 90, 100, 110]:
    print("n=", n)
    for k in range(10):

        X = numpy.zeros((N, 2))
        X[:, 0] = numpy.random.randint(0, 2, (N,))
        X[:, 1] = numpy.random.randint(0, n+1, (N,))
        Y = X[:, 0] + X[:, 1] + numpy.random.normal(size=(N,)) / 2
        Y[Y < 1.5] = 0
        Y[Y >= 1.5] = 1

        X_train, X_test, y_train, y_test = train_test_split(X, Y)

        stat = dict(N=N, n=n, ratio_train=y_train.sum()/y_train.shape[0],
                    k=k, ratio_test=y_test.sum()/y_test.shape[0])

        for model in [LogisticRegression(solver="liblinear"),
                      MLPClassifier(max_iter=500),
                      RandomForestClassifier(n_estimators=10),
                      AdaBoostClassifier(DecisionTreeClassifier(), n_estimators=10)]:
            obs = stat.copy()
            obs["model"] = model.__class__.__name__
            if obs["model"] == "AdaBoostClassifier":
                obs["model"] = "AdaB-" + model.base_estimator.__class__.__name__
            try:
                model.fit(X_train, y_train)
            except ValueError as e:
                obs["erreur"] = str(e)
                res.append(obs)
                continue
            sc = model.score(X_test, y_test)
            obs["accuracy"] = sc
            conf = confusion_matrix(y_test, model.predict(X_test))
            try:
                obs["Error-0|1"] = conf[0, 1] / conf[0, :].sum()
                obs["Error-1|0"] = conf[1, 0] / conf[1, :].sum()
            except Exception:
                pass
            res.append(obs)
n= 1
n= 2
n= 5
n= 10
n= 20
n= 50
n= 80
n= 90
n= 100
c:python370_x64libsite-packagesipykernel_launcher.py:47: RuntimeWarning: invalid value encountered in longlong_scalars
c:python370_x64libsite-packagesipykernel_launcher.py:47: RuntimeWarning: invalid value encountered in longlong_scalars
n= 110
from pandas import DataFrame
df = DataFrame(res)
df = df.sort_values(['n', 'model', 'model', "k"]).reset_index(drop=True)
df["diff_ratio"] = (df["ratio_test"] - df["ratio_train"]).abs()
df.head(n=5)
Error-0|1 Error-1|0 N accuracy k model n ratio_test ratio_train diff_ratio
0 0.061798 0.208333 1000 0.896 0 AdaB-DecisionTreeClassifier 1 0.288 0.237333 0.050667
1 0.040230 0.302632 1000 0.880 1 AdaB-DecisionTreeClassifier 1 0.304 0.284000 0.020000
2 0.081395 0.256410 1000 0.864 2 AdaB-DecisionTreeClassifier 1 0.312 0.293333 0.018667
3 0.062147 0.342466 1000 0.856 3 AdaB-DecisionTreeClassifier 1 0.292 0.342667 0.050667
4 0.049180 0.298507 1000 0.884 4 AdaB-DecisionTreeClassifier 1 0.268 0.290667 0.022667
df.tail(n=5)
Error-0|1 Error-1|0 N accuracy k model n ratio_test ratio_train diff_ratio
395 0.00 0.0 1000 1.000 5 RandomForestClassifier 110 0.996 0.977333 0.018667
396 0.00 0.0 1000 1.000 6 RandomForestClassifier 110 0.972 0.990667 0.018667
397 0.00 0.0 1000 1.000 7 RandomForestClassifier 110 0.980 0.985333 0.005333
398 0.00 0.0 1000 1.000 8 RandomForestClassifier 110 0.988 0.980000 0.008000
399 0.25 0.0 1000 0.996 9 RandomForestClassifier 110 0.984 0.988000 0.004000

La répartition train/test est loin d’être statisfaisante lorsqu’il existe une classe sous représentée.

df[df.n==100][["n", "ratio_test", "ratio_train"]].head(n=10)
n ratio_test ratio_train
320 100 1.000 0.986667
321 100 0.980 0.984000
322 100 0.984 0.984000
323 100 0.972 0.980000
324 100 0.984 0.981333
325 100 0.976 0.984000
326 100 0.984 0.990667
327 100 0.984 0.988000
328 100 0.980 0.989333
329 100 0.992 0.982667
#df.to_excel("data.xlsx")
columns = ["n", "N", "model"]
agg = df.groupby(columns, as_index=False).mean().sort_values(["n", "model"]).reset_index(drop=True)
agg.tail()
n N model Error-0|1 Error-1|0 accuracy k ratio_test ratio_train diff_ratio
35 100 1000 RandomForestClassifier 0.050000 0.002434 0.9968 4.5 0.9836 0.985067 0.006533
36 110 1000 AdaB-DecisionTreeClassifier 0.050000 0.002031 0.9972 4.5 0.9852 0.984667 0.007733
37 110 1000 LogisticRegression 0.261905 0.001216 0.9956 4.5 0.9852 0.984667 0.007733
38 110 1000 MLPClassifier 0.241905 0.001216 0.9948 4.5 0.9852 0.984667 0.007733
39 110 1000 RandomForestClassifier 0.083333 0.001626 0.9972 4.5 0.9852 0.984667 0.007733
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 2, figsize=(10,4))
agg.plot(x="n", y="diff_ratio", ax=ax[0])
agg.plot(x="n", y="ratio_train", ax=ax[1])
agg.plot(x="n", y="ratio_test", ax=ax[1])
ax[0].set_title("Maximum difference between\nratio of first class on train and test")
ax[1].set_title("Ratio of first class on train and test")
ax[0].legend();
../_images/2018-10-09_ensemble_gradient_boosting_11_0.png

Une astuce pour éviter les doublons avant d’effecturer un pivot.

agg2 = agg.copy()
agg2["ratio_test2"] = agg2["ratio_test"] + agg2["n"] / 100000
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 3, figsize=(14,4))
agg2.pivot("ratio_test2", "model", "accuracy").plot(ax=ax[0])
agg2.pivot("ratio_test2", "model", "Error-0|1").plot(ax=ax[1])
agg2.pivot("ratio_test2", "model", "Error-1|0").plot(ax=ax[2])
ax[0].plot([0.5, 1.0], [0.5, 1.0], '--', label="constant")
ax[0].set_title("Accuracy")
ax[1].set_title("Error-0|1")
ax[2].set_title("Error-1|0")
ax[0].legend();
../_images/2018-10-09_ensemble_gradient_boosting_14_0.png
agg2.pivot("ratio_test2", "model", "Error-0|1")
model AdaB-DecisionTreeClassifier LogisticRegression MLPClassifier RandomForestClassifier
ratio_test2
0.28801 0.052915 0.052915 0.052915 0.052915
0.49002 0.099708 0.099708 0.099708 0.099708
0.75445 0.096037 0.096037 0.096037 0.096037
0.84490 0.083750 0.083750 0.083750 0.083750
0.92100 0.132553 0.132553 0.165886 0.170648
0.97010 0.103611 0.193611 0.173611 0.103611
0.98400 0.230000 0.336667 0.336667 0.280000
0.98460 0.050000 0.270370 0.150000 0.050000
0.98630 0.050000 0.261905 0.241905 0.083333
0.98770 0.100000 0.383333 0.133333 0.116667

Le modèle AdaBoost construit 10 arbres tout comme la forêt aléatoire à ceci près que le poids associé à chacun des arbres des différents et non uniforme.

Apprentissage continu

Apprendre une forêt aléatoire, puis ajouter un arbre, encore un tout en gardant le résultat des apprentissages précédents.

from sklearn.datasets import load_boston
boston = load_boston()
X, y = boston.data, boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y)
from sklearn.ensemble import RandomForestRegressor

model = None
res = []
for i in range(0, 20):
    if model is None:
        model = RandomForestRegressor(n_estimators=1, warm_start=True)
    else:
        model.set_params(**dict(n_estimators=model.n_estimators+1))
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    res.append(dict(n_estimators=model.n_estimators, score=score))
df = DataFrame(res)
df.head()
n_estimators score
0 1 0.584034
1 2 0.760666
2 3 0.824477
3 4 0.811059
4 5 0.818055
ax = df.plot(x="n_estimators", y="score")
ax.set_title("Apprentissage continu\nmesure de la performance à chaque itération");
../_images/2018-10-09_ensemble_gradient_boosting_22_0.png