{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# R\u00e9gression polyn\u00f4miale et pipeline\n", "\n", "Le notebook compare plusieurs de mod\u00e8les de r\u00e9gression polyn\u00f4miale."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": ["%matplotlib inline"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": ["from papierstat.datasets import load_wines_dataset\n", "data = load_wines_dataset()\n", "X = data.drop(['quality', 'color'], axis=1)\n", "y = data['quality']"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": ["from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On normalise les donn\u00e9es. Pour ce cas particulier, c'est d'autant plus important que les polyn\u00f4mes prendront de tr\u00e8s grandes valeurs si cela n'est pas fait et les librairies de calculs n'aiment pas les ordres de grandeurs trop diff\u00e9rents."]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": ["from sklearn.preprocessing import Normalizer\n", "norm = Normalizer()\n", "X_train_norm = norm.fit_transform(X_train)\n", "X_test_norm = norm.transform(X_test)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["La transformation [PolynomialFeatures](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) cr\u00e9\u00e9e de nouvelles features en multipliant les variables les unes avec les autres. Pour le degr\u00e9 deux et trois features $a, b, c$, on obtient les nouvelles features : $1, a, b, c, a^2, ab, ac, b^2, bc, c^2$."]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["1 0.189007413643138 0.17548948727814861 0.005909326000001158\n", "2 0.3090044704138045 0.3016856760353912 0.027130041999996024\n", "3 0.4065060987061494 -0.057880204420430736 0.22084438099999915\n", "4 0.5874526458338967 -3659.6472584680923 2.230189553999999\n"]}], "source": ["from time import perf_counter \n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.preprocessing import PolynomialFeatures\n", "from sklearn.pipeline import make_pipeline\n", "from sklearn.metrics import r2_score\n", "\n", "r2ts = []\n", "r2es = []\n", "degs = []\n", "tts = []\n", "models = []\n", "\n", "for d in range(1, 5):\n", " begin = perf_counter ()\n", " pipe = make_pipeline(PolynomialFeatures(degree=d), \n", " LinearRegression())\n", " pipe.fit(X_train_norm, y_train)\n", " duree = perf_counter () - begin\n", " r2t = r2_score(y_train, pipe.predict(X_train_norm))\n", " r2e = r2_score(y_test, pipe.predict(X_test_norm))\n", " degs.append(d)\n", " r2ts.append(r2t)\n", " r2es.append(r2e)\n", " tts.append(duree)\n", " models.append(pipe)\n", " print(d, r2t, r2e, duree)"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "
\n", " \n", "
\n", "
\n", "
temps
\n", "
r2_train
\n", "
r2_test
\n", "
\n", "
\n", "
degr\u00e9
\n", "
\n", "
\n", "
\n", "
\n", " \n", " \n", "
\n", "
1
\n", "
0.005909
\n", "
0.189007
\n", "
0.175489
\n", "
\n", "
\n", "
2
\n", "
0.027130
\n", "
0.309004
\n", "
0.301686
\n", "
\n", "
\n", "
3
\n", "
0.220844
\n", "
0.406506
\n", "
-0.057880
\n", "
\n", "
\n", "
4
\n", "
2.230190
\n", "
0.587453
\n", "
-3659.647258
\n", "
\n", " \n", "
\n", "
"], "text/plain": [" temps r2_train r2_test\n", "degr\u00e9 \n", "1 0.005909 0.189007 0.175489\n", "2 0.027130 0.309004 0.301686\n", "3 0.220844 0.406506 -0.057880\n", "4 2.230190 0.587453 -3659.647258"]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["import pandas\n", "df = pandas.DataFrame(dict(temps=tts, r2_train=r2ts, r2_test=r2es, degr\u00e9=degs))\n", "df.set_index('degr\u00e9')"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Le polyn\u00f4mes de degr\u00e9 2 para\u00eet le meilleur mod\u00e8le. Le temps de calcul est multipli\u00e9 par 10 \u00e0 chaque fois, ce qui correspond au nombre de features. On voit n\u00e9anmoins que l'ajout de features crois\u00e9e fonctionne sur ce jeu de donn\u00e9es. Mais au del\u00e0 de 3, la r\u00e9gression produit des r\u00e9sultats tr\u00e8s mauvais sur la base de test alors qu'ils continuent d'augmenter sur la base d'apprentissage. Voyons cela un peu plus en d\u00e9tail."]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": ["
"]}, "metadata": {}, "output_type": "display_data"}], "source": ["import matplotlib.pyplot as plt\n", "fig, ax = plt.subplots(1, 2, figsize=(12, 4))\n", "\n", "n = 15\n", "ax[0].plot(y_train[:n].reset_index(), 'o')\n", "ax[1].plot(y_test[:n].reset_index(), 'o')\n", "ax[0].set_title('Pr\u00e9dictions sur quelques valeurs\\napprentissage')\n", "ax[1].set_title('Pr\u00e9dictions sur quelques valeurs\\ntest')\n", "for x in ax:\n", " x.set_ylim([3, 9])\n", " x.get_xaxis().set_visible(False)\n", "\n", "for model in models:\n", " d = model.get_params()['polynomialfeatures__degree']\n", " tr = model.predict(X_train_norm[:n])\n", " te = model.predict(X_test_norm[:n])\n", " ax[0].plot(tr, label=\"d=%d\" % d)\n", " ax[1].plot(te, label=\"d=%d\" % d)\n", "ax[0].legend()\n", "ax[1].legend();"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Le mod\u00e8le de degr\u00e9 4 a l'air performant sur la base d'apprentissage mais s'\u00e9gare compl\u00e8tement sur la base de test comme s'il \u00e9tait surpris des valeurs rencontr\u00e9es sur la base de test. On dit que le mod\u00e8le fait du [sur-apprentissage](https://fr.wikipedia.org/wiki/Surapprentissage) ou [overfitting](https://en.wikipedia.org/wiki/Overfitting) en anglais. Le polyn\u00f4me de degr\u00e9 fonctionne mieux que la r\u00e9gression lin\u00e9aire simple. On peut se demander quelles sont les variables crois\u00e9es qui ont un impact sur la performance. On utilise le mod\u00e8le [statsmodels](http://www.statsmodels.org/stable/index.html)."]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": ["poly = PolynomialFeatures(degree=2)\n", "poly_feat_train = poly.fit_transform(X_train_norm)\n", "poly_feat_test = poly.fit_transform(X_test_norm)"]}, {"cell_type": "code", "execution_count": 9, "metadata": {"scrolled": false}, "outputs": [{"data": {"text/html": ["