.. _2018-10-09ensemblegradientboostingrst: ======================================== 2018-10-09 Ensemble, Gradient, Boosting… ======================================== .. only:: html **Links:** :download:`notebook <2018-10-09_ensemble_gradient_boosting.ipynb>`, :downloadlink:`html <2018-10-09_ensemble_gradient_boosting2html.html>`, :download:`python <2018-10-09_ensemble_gradient_boosting.py>`, :downloadlink:`slides <2018-10-09_ensemble_gradient_boosting.slides.html>`, :githublink:`GitHub|_doc/notebooks/notebook_eleves/2018-2019/2018-10-09_ensemble_gradient_boosting.ipynb|*` Le noteboook explore quelques particularités des algorithmes d’apprentissage pour expliquer certains résultats numériques. L’algoithme `AdaBoost `__ surpondère les exemples sur lequel un modèle fait des erreurs. .. code:: ipython3 from jyquickhelper import add_notebook_menu add_notebook_menu() .. contents:: :local: .. code:: ipython3 %matplotlib inline Skewed split train test ----------------------- Lorsqu’une classe est sous représentée, il est difficile de prédire les résultats d’un modèle de machine learning. .. code:: ipython3 import numpy, numpy.random from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.neural_network import MLPClassifier from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier from sklearn.metrics import confusion_matrix N = 1000 res = [] for n in [1, 2, 5, 10, 20, 50, 80, 90, 100, 110]: print("n=", n) for k in range(10): X = numpy.zeros((N, 2)) X[:, 0] = numpy.random.randint(0, 2, (N,)) X[:, 1] = numpy.random.randint(0, n+1, (N,)) Y = X[:, 0] + X[:, 1] + numpy.random.normal(size=(N,)) / 2 Y[Y < 1.5] = 0 Y[Y >= 1.5] = 1 X_train, X_test, y_train, y_test = train_test_split(X, Y) stat = dict(N=N, n=n, ratio_train=y_train.sum()/y_train.shape[0], k=k, ratio_test=y_test.sum()/y_test.shape[0]) for model in [LogisticRegression(solver="liblinear"), MLPClassifier(max_iter=500), RandomForestClassifier(n_estimators=10), AdaBoostClassifier(DecisionTreeClassifier(), n_estimators=10)]: obs = stat.copy() obs["model"] = model.__class__.__name__ if obs["model"] == "AdaBoostClassifier": obs["model"] = "AdaB-" + model.base_estimator.__class__.__name__ try: model.fit(X_train, y_train) except ValueError as e: obs["erreur"] = str(e) res.append(obs) continue sc = model.score(X_test, y_test) obs["accuracy"] = sc conf = confusion_matrix(y_test, model.predict(X_test)) try: obs["Error-0|1"] = conf[0, 1] / conf[0, :].sum() obs["Error-1|0"] = conf[1, 0] / conf[1, :].sum() except Exception: pass res.append(obs) .. parsed-literal:: n= 1 n= 2 n= 5 n= 10 n= 20 n= 50 n= 80 n= 90 n= 100 n= 110 .. code:: ipython3 from pandas import DataFrame df = DataFrame(res) df = df.sort_values(['n', 'model', 'model', "k"]).reset_index(drop=True) df["diff_ratio"] = (df["ratio_test"] - df["ratio_train"]).abs() df.head(n=5) .. raw:: html

	N	n	ratio_train	k	ratio_test	model	accuracy	Error-0\|1	Error-1\|0	diff_ratio
0	1000	1	0.273333	0	0.300	AdaB-DecisionTreeClassifier	0.860	0.062857	0.320000	0.026667
1	1000	1	0.274667	1	0.328	AdaB-DecisionTreeClassifier	0.916	0.029762	0.195122	0.053333
2	1000	1	0.304000	2	0.284	AdaB-DecisionTreeClassifier	0.860	0.072626	0.309859	0.020000
3	1000	1	0.285333	3	0.268	AdaB-DecisionTreeClassifier	0.896	0.027322	0.313433	0.017333
4	1000	1	0.297333	4	0.256	AdaB-DecisionTreeClassifier	0.888	0.053763	0.281250	0.041333

.. code:: ipython3 df.tail(n=5) .. raw:: html

	N	n	ratio_train	k	ratio_test	model	accuracy	Error-0\|1	Error-1\|0	diff_ratio
395	1000	110	0.982667	5	0.996	RandomForestClassifier	0.996	0.0	0.004016	0.013333
396	1000	110	0.990667	6	0.980	RandomForestClassifier	0.996	0.2	0.000000	0.010667
397	1000	110	0.985333	7	0.988	RandomForestClassifier	1.000	0.0	0.000000	0.002667
398	1000	110	0.985333	8	0.992	RandomForestClassifier	1.000	0.0	0.000000	0.006667
399	1000	110	0.985333	9	0.992	RandomForestClassifier	0.996	0.5	0.000000	0.006667

La répartition train/test est loin d’être statisfaisante lorsqu’il existe une classe sous représentée. .. code:: ipython3 df[df.n==100][["n", "ratio_test", "ratio_train"]].head(n=10) .. raw:: html

	n	ratio_test	ratio_train
320	100	0.980	0.992000
321	100	0.984	0.980000
322	100	0.988	0.984000
323	100	0.988	0.986667
324	100	0.976	0.986667
325	100	0.984	0.985333
326	100	0.984	0.981333
327	100	0.988	0.982667
328	100	0.984	0.989333
329	100	0.992	0.989333

.. code:: ipython3 #df.to_excel("data.xlsx") .. code:: ipython3 columns = ["n", "N", "model"] agg = df.groupby(columns, as_index=False).mean().sort_values(["n", "model"]).reset_index(drop=True) agg.tail() .. raw:: html

	n	N	model	ratio_train	k	ratio_test	accuracy	Error-0\|1	Error-1\|0	diff_ratio
35	100	1000	RandomForestClassifier	0.985733	4.5	0.9848	0.9956	0.185000	0.001216	0.004933
36	110	1000	AdaB-DecisionTreeClassifier	0.986533	4.5	0.9900	0.9972	0.130000	0.000810	0.007200
37	110	1000	LogisticRegression	0.986533	4.5	0.9900	0.9960	0.346667	0.000402	0.007200
38	110	1000	MLPClassifier	0.986533	4.5	0.9900	0.9956	0.346667	0.000810	0.007200
39	110	1000	RandomForestClassifier	0.986533	4.5	0.9900	0.9980	0.090000	0.000810	0.007200

.. code:: ipython3 import matplotlib.pyplot as plt fig, ax = plt.subplots(1, 2, figsize=(10,4)) agg.plot(x="n", y="diff_ratio", ax=ax[0]) agg.plot(x="n", y="ratio_train", ax=ax[1]) agg.plot(x="n", y="ratio_test", ax=ax[1]) ax[0].set_title("Maximum difference between\nratio of first class on train and test") ax[1].set_title("Ratio of first class on train and test") ax[0].legend(); .. image:: 2018-10-09_ensemble_gradient_boosting_11_0.png Une astuce pour éviter les doublons avant d’effecturer un pivot. .. code:: ipython3 agg2 = agg.copy() agg2["ratio_test2"] = agg2["ratio_test"] + agg2["n"] / 100000 .. code:: ipython3 import matplotlib.pyplot as plt fig, ax = plt.subplots(1, 3, figsize=(14,4)) agg2.pivot("ratio_test2", "model", "accuracy").plot(ax=ax[0]) agg2.pivot("ratio_test2", "model", "Error-0|1").plot(ax=ax[1]) agg2.pivot("ratio_test2", "model", "Error-1|0").plot(ax=ax[2]) ax[0].plot([0.5, 1.0], [0.5, 1.0], '--', label="constant") ax[0].set_title("Accuracy") ax[1].set_title("Error-0|1") ax[2].set_title("Error-1|0") ax[0].legend(); .. image:: 2018-10-09_ensemble_gradient_boosting_14_0.png .. code:: ipython3 agg2.pivot("ratio_test2", "model", "Error-0|1") .. raw:: html

model	AdaB-DecisionTreeClassifier	LogisticRegression	MLPClassifier	RandomForestClassifier
ratio_test2
0.29721	0.052249	0.052249	0.052249	0.052249
0.50682	0.110686	0.110686	0.110686	0.110686
0.75525	0.119578	0.119578	0.119578	0.119578
0.86690	0.099333	0.099333	0.099333	0.099333
0.92900	0.088095	0.113095	0.113095	0.088095
0.96970	0.106349	0.253968	0.220635	0.163492
0.98120	0.125000	0.310000	0.200000	0.175000
0.98490	0.110000	0.155000	0.155000	0.170000
0.98580	0.185000	0.335000	0.268333	0.185000
0.99110	0.130000	0.346667	0.346667	0.090000

Le modèle `AdaBoost `__ construit 10 arbres tout comme la forêt aléatoire à ceci près que le poids associé à chacun des arbres des différents et non uniforme. Apprentissage continu --------------------- Apprendre une forêt aléatoire, puis ajouter un arbre, encore un tout en gardant le résultat des apprentissages précédents. .. code:: ipython3 from sklearn.datasets import load_diabetes data = load_diabetes() X, y = data.data, data.target .. code:: ipython3 X_train, X_test, y_train, y_test = train_test_split(X, y) .. code:: ipython3 from sklearn.ensemble import RandomForestRegressor model = None res = [] for i in range(0, 20): if model is None: model = RandomForestRegressor(n_estimators=1, warm_start=True) else: model.set_params(**dict(n_estimators=model.n_estimators+1)) model.fit(X_train, y_train) score = model.score(X_test, y_test) res.append(dict(n_estimators=model.n_estimators, score=score)) .. code:: ipython3 df = DataFrame(res) df.head() .. raw:: html

	n_estimators	score
0	1	0.128906
1	2	0.323854
2	3	0.352876
3	4	0.389476
4	5	0.429992

.. code:: ipython3 ax = df.plot(x="n_estimators", y="score") ax.set_title("Apprentissage continu\nmesure de la performance à chaque itération"); .. image:: 2018-10-09_ensemble_gradient_boosting_22_0.png