{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 2A.ml - R\u00e9duction d'une for\u00eat al\u00e9atoire - correction\n", "\n", "Le mod\u00e8le Lasso permet de s\u00e9lectionner des variables, une for\u00eat al\u00e9atoire produit une pr\u00e9diction comme \u00e9tant la moyenne d'arbres de r\u00e9gression. Et si on m\u00e9langeait les deux ?"]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": ["%matplotlib inline"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Datasets\n", "\n", "Comme il faut toujours des donn\u00e9es, on prend ce jeu [Diabetes](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html)."]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": ["from sklearn.datasets import load_diabetes\n", "data = load_diabetes()\n", "X, y = data.data, data.target"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": ["from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Une for\u00eat al\u00e9atoire"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/html": ["
RandomForestRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
"], "text/plain": ["RandomForestRegressor()"]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.ensemble import RandomForestRegressor as model_class\n", "clr = model_class()\n", "clr.fit(X_train, y_train)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Le nombre d'arbres est..."]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/plain": ["100"]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["len(clr.estimators_)"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.3625404922781166"]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.metrics import r2_score\n", "r2_score(y_test, clr.predict(X_test))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Random Forest = moyenne des pr\u00e9dictions\n", "\n", "On recommence en faisant la moyenne soi-m\u00eame."]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.3625404922781166"]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["import numpy\n", "dest = numpy.zeros((X_test.shape[0], len(clr.estimators_)))\n", "estimators = numpy.array(clr.estimators_).ravel()\n", "for i, est in enumerate(estimators):\n", " pred = est.predict(X_test)\n", " dest[:, i] = pred\n", "\n", "average = numpy.mean(dest, axis=1)\n", "r2_score(y_test, average)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["A priori, c'est la m\u00eame chose."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Pond\u00e9rer les arbres \u00e0 l'aide d'une r\u00e9gression lin\u00e9aire\n", "\n", "La for\u00eat al\u00e9atoire est une fa\u00e7on de cr\u00e9er de nouvelles features, 100 exactement qu'on utilise pour caler une r\u00e9gression lin\u00e9aire."]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"data": {"text/html": ["
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
"], "text/plain": ["LinearRegression()"]}, "execution_count": 10, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.linear_model import LinearRegression\n", "\n", "\n", "def new_features(forest, X):\n", " dest = numpy.zeros((X.shape[0], len(forest.estimators_)))\n", " estimators = numpy.array(forest.estimators_).ravel()\n", " for i, est in enumerate(estimators):\n", " pred = est.predict(X)\n", " dest[:, i] = pred\n", " return dest\n", "\n", "\n", "X_train_2 = new_features(clr, X_train)\n", "lr = LinearRegression()\n", "lr.fit(X_train_2, y_train)"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.30414556638121215"]}, "execution_count": 11, "metadata": {}, "output_type": "execute_result"}], "source": ["X_test_2 = new_features(clr, X_test)\n", "r2_score(y_test, lr.predict(X_test_2))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Un peu moins bien, un peu mieux, le risque d'overfitting est un peu plus grand avec ces nombreuses features car la base d'apprentissage ne contient que 379 observations (regardez ``X_train.shape`` pour v\u00e9rifier)."]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([ 0.0129567 , -0.03467343, -0.02574902, 0.01872549, 0.00128276,\n", " -0.01449147, 0.00977528, -0.02397026, 0.01066261, 0.02121925,\n", " 0.03544455, 0.02735311, 0.01859875, -0.03189411, -0.0245749 ,\n", " -0.01879966, 0.01521987, 0.00292998, 0.04250576, 0.01424533,\n", " -0.00561623, 0.00635399, 0.04712406, 0.02518721, 0.01713507,\n", " 0.01741708, -0.02072389, 0.05748854, 0.00424951, 0.02872275,\n", " -0.01016485, 0.04368062, 0.07377962, 0.06540726, -0.00123185,\n", " 0.02227104, 0.0289425 , 0.00914512, 0.03645644, 0.01838009,\n", " 0.00046509, 0.04145444, 0.0202303 , 0.00984027, 0.0149448 ,\n", " -0.01129977, 0.00428108, 0.02601842, 0.00421449, -0.01172942,\n", " 0.02631074, 0.04180424, 0.02909078, -0.01922766, -0.00953341,\n", " -0.0036882 , -0.02411783, 0.06700977, -0.01447105, 0.02094102,\n", " 0.00227497, 0.04181756, -0.02474879, 0.0465355 , 0.05504502,\n", " -0.05645067, -0.02066304, 0.04349629, -0.01549704, 0.02805018,\n", " 0.01344701, 0.03489881, 0.04401519, 0.04756385, -0.02936105,\n", " -0.0305603 , -0.02101141, 0.02751049, -0.00875684, -0.01583926,\n", " 0.00033533, 0.02769942, 0.0358323 , -0.04180737, -0.02759142,\n", " -0.01231979, 0.02881228, -0.00406825, 0.00497993, 0.01094388,\n", " -0.01672934, 0.05414844, -0.01725494, 0.04816335, 0.04487341,\n", " 0.0269151 , 0.00945554, 0.02318397, 0.04105411, 0.05314256])"]}, "execution_count": 12, "metadata": {}, "output_type": "execute_result"}], "source": ["lr.coef_"]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"image/png": "iVBORw0KGgoAAAANSUhEUgAAAtEAAAEICAYAAACZEKh9AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAAsTAAALEwEAmpwYAAAkiUlEQVR4nO3debgkdX3v8fdHBhDQsE6QbQAFYyAmaiagT0zCFUTcMtwEAsaYSQJBb+JVExPFJUhAI+Yal1xRwwUS3EBDFidKggiSxS0M7uDCiAMzBGRgELcgot/7R9WRnqb7zKnpPkuf8349z3lOLb+u+tba3/7Vr6pSVUiSJEmauQfNdwCSJEnSpDGJliRJkjoyiZYkSZI6MomWJEmSOjKJliRJkjoyiZYkSZI6MomW+iT5iSSfSfKtJC9IslOSf0pyd5K/TfLsJB+awXRenuT8uYh50iQ5KsnG+Y5j3JIclKSSLJvvWHolOTPJu8YwnUpyyIDh5yZ5zajTX8qSnJjkiiQPnu9YpszWOSzJbyX5j57+byd5eMdp7NWep1fOsPyKdj7bdY1XGmZBneilLpL8OvCHwKOAbwGfAV5TVf8x3edm4CXAR6rqMe18ngPsDexZVfe1Zd69tYlU1Z+NGAft/A8CvgZs3zN/aUFIchpwb1W9Yr5jmVRJHgucChxfVffMdzxTxnUOm8F8HtKlfJLtgYuA36uqtTOcx81Ap/lIW2MSrYmU5A+B04HnAZcD9wLHAauAUZPoA4FL+vq/YgI7vSTbVdUP5juO+bIttc9Jlk36flVV5813DJOod9tX1aeBp4xreotdVX0fePq4ppckQKrqh+OappYGm3No4iTZFTgL+P2q+vuq+k5Vfb+q/qmq/rgts2OSNyX5r/bvTUl27JnGM9pLgd9I8rEkP90Ovwr4H8Bb2kt/FwNnACe1/acMuBR5eHsZdnOSryd5eTt8i0voSR7fzusbST6b5KiecVcnOTvJR9tmJB9Kslc7+t/a/99oY3hCkkOS/GvbxOSOJO8dsq6mmhec1q6HW5P8Uc/4oeupfznbYT+6nJ/kb5K8LcllSb7Trrf++e+R5K/bad+V5B/7xr84ye1tXL/dM/zpST6d5JtJNiQ5s+9zz0lyU5I7k7wiyfokx/TE9eqesls0HUmyb5K/S7IpydeSvGDQuttaHD3r9pQkNwNX9Xz0d4as7zOTXJrkXUm+CfxWkl2TXNCWvSXJqzPkknOS7dJcYv9qu59cm+SAdtyb2xi/2Q7/hWmW64k9++KGJL/VDr86yak95R6wD/SM2zHJ65Pc3O73b0+yUzturyQfaKe/Ocm/Jxn4fTMs7nY7/XeSPXrKPrbd37dv+38nyRfbfevyJAf2lB14XA6Y/7TL3G7j5yW5oV2ec5OkZ/zQGPrmM3B/2coyHJvky2mO87emOeZP7Ynzo0nemORO4Mxt3SZJXtrue99q53d0O7z/HPbLSa5rp3F1kp/sGbc+yR8l+Vwb73szw6YpeeB55dwkH2zj+WSSR/SUfVTPdv1ykl/rGTeT43VZz3Z/TZKPAt8FHj7dtKWBqso//ybqj6bG+T5g2TRlzgI+Afw4sBz4GHB2O+6xwO3AkcB2wGpgPbBjO/5q4NSeaZ0JvKun/7eA/2i7HwrcCrwYeHDbf2T/54D9gDuBp9H8eH1y27+8Z55fBR4J7NT2n9OOOwio3uUFLgZe0U7rwcATh6yHqc9eDOwCPBrYBBwzg/X0o+XsmV4Bh7TdfwPcDfz8VBwD5v9B4L3A7sD2wC+1w49qt+FZ7fCn0XyR7d4z/tHtdH8a+DrNpW6Aw4BvA78I7Ai8oZ3WMT1xvbonhqOAjW33g4BraX4Y7QA8HLgReMqQ9TddHFPr9h3tut1pBuv7TOD7wPHtNHcC/gH4q7b8jwP/CTx3SDx/DHwe+AkgwM/QNDMC+A1gT5orjC8GbpvaJmy5Lx5I0/zpWe263xN4zJB9f4t9oG/7vxFYA+xBs9//E/Dadtxrgbe3098e+AWamr5ByzRd3FcBv9tT9v8Ab2+7VwHrgJ9sP/tK4GNbOy4HzH8my/wBYDdgRbs9j9taDNMci737y3TLsBfwTeBX2nEvpNl3Tu2J8z7gf7fjd9qWbUKzL20A9u2J8xED9ptHAt+hOXdtT9PsbR2wQzt+Pc2+u287/y8CzxuyLgat497zyp3AEe1yvRu4pB23Sxvrb7fjHgvcARzW4Xhd1rPdbwYOb6e163TT9s+/QX/zHoB//nX9A54N3LaVMl8FntbT/xRgfdv9NtpEsWf8l7k/wbuamSfRzwI+PSSG3i+glwLv7Bt/ObC6Z56v7Bn3e8C/tN1bnPzbYe8AzgP238p6mPrso3qG/TlwwQzW0xZfdO2w/i+7d0wz732AH9Imxn3jjgL+u2+ZbgceP2RabwLe2HafQful2vbvQtOcZyZJ9JHAzX3Tfhnw1zPc93rjmFq3D++wvs8E/q1n3N7A94CdeoY9i6ZN/qD5fxlYNcNY7wJ+ZsC++DLgH4Z85mpmkETTJF/foU222nFPAL7Wdp8FvH9qX+ny1xf3qcBVbXdokpxfbPv/GTil53MPovkhdiDTHJfbuMxP7Ol/H3D61mIYMJ9B+8t0y/CbwMd7xk0tf28SfXPf+M7bpN2etwPH0Nx3Mewc9ifA+/pivQU4qu1fD/xG337/9iHrfOB+Vfcfv+f3jHsa8KW2+yTg3/um9VfAq4bM50088HjtTaLP6inbadr++VdVNufQRLoT2CvTt0HdF7ipp/+mdhg0X1Avbi9JfiPJN4ADesZ3cQBNIro1BwIn9s3ziTSJ5pTberq/y/Q3wbyE5kvzP9vLq7+zlflv6OnuXRfTraeZ2DDNuAOAzVV115Dxd9aWbTh/tMxJjkzykTRNLu6mafs+1bxl3975VtV3aPaJmTgQ2LdvO7ycJpl9gK3EMWXQOhi2vvvHHUhTq3drTzx/RVMjPcjQ/a29lP7F9lL6N2hq1vpjnXYaHSwHdgau7Yn7X9rh0NQYrwM+lOTGJKcPm9BW4v474AlJ9qG58vBD4N/bcQcCb+6Z/2aaY2K/MS1jr2HH5nQxDNO//Yd9vn8/L6D/iTa909qmbVJV64AX0STMtye5JMmgc8AW54pq2g9v6FvWLuew6Uy3vo/sO36fDTwMZny89urfFkOnLQ1iEq1J9HGa2rvjpynzXzQnxSkr2mHQnDhfU1W79fztXFUXb0MsG2iaBMyk3Dv75rlLVZ0zg8/WAwZU3VZVv1tV+wLPBd6aAY8e63FAT3fvuphuPX2H5ksZgCSDvkweEFuPDcAeSXabpsww76G5LH1AVe1Kcxl6qh3qrfQsT5KdaZoDTNkibrb8EtxAUzPXux0eWlVP24Y4pgxaB8PWd3/5DTT78l498fxYVR0+JJ4NwCP6B6ZpR/wS4Ndoav53o2lq0x/r0Gm0plt3ve6guZJweE/cu1b7lIWq+lZVvbiqHg78MvCHU+1su8Td/gD7EE0t4a/TXIGYWn8baJq99G7LnarqY8z8uOyyzINMF8Mw/dt/2OdvBfafKpgkvf0DprXN26Sq3lNVT6Q5FxTwugFxb3GuaOM5gKY2eq5sAP61b309pKr+Vzt+Jsdrr/5tMd20pQcwidbEqaq7aS7pn5vk+CQ7J9k+yVOT/Hlb7GLglUmWp7lB7wxg6gaZ/wc8r621SJJd2htSHroN4XwA2CfJi9Lc1PPQJEcOKPcu4JlJnpLm5rAHp7nhrf9LcZBNNDVwP0oK0jxTduqzd9F8GUx3Z/mftOvpcJo2f1M3Ik63nj4LHJ7kMWluEDpzBrH+SFXdSnO5+q1Jdm+30S/O8OMPpanFvifJETQJ1JRLgWekuTluB5rL1L3nss8AT0tzU+PDaGrZpvwn8K00N1Lt1G6Ln0ryc9sQx3SGre8ttOvoQ8BfJPmxJA9K8ogkvzRkuucDZyc5tN13fzrJnm2c99HsK8uSnAH82JBpvBs4JsmvJVmWZM8kj2nHfQb4lTb2Q4BThsT9Q5rj6I1JfhwgyX5JntJ2PyPNza+hSYp/wOD9cyZxv4emacMJbfeUtwMva9cxaW7QPLEdN9PjcsbLPMR0MYz6+Q8Cj27PccuA32eaBH9bt0ma5+I/Kc0NxffQJOKDttX7gKcnOTrNjZ0vpvkBON0PhnH7APDINDcWb9/+/Vzuv8FxW4/XmUxbegCTaE2kqvoLmmdEv5LmC3gD8HzgH9sirwbWAp+juRHrU+0wqnmu6O8Cb6FJQNfRtNHblji+RXOjzTNpLkHewICnVFTVBpqbiF7eE+8fM4NjsKq+C7wG+Giay4yPB34O+GSSb9PUvLywqm6cZjL/SrOcVwKvr6qpl8VMt56+QpOgfrhdrm15dOBzaG6G+hJNu8sXzfBzvwecleRbNIn9+6ZGVNV1NAnFe2hq6+5iy8vc76T5AbCeJkF9b89nfwA8A3gMzbO376BJTHftGsdWDFvfg/wmzU2O17fLcilbNvPp9YY2hg/R3HR2Ac0NZZfTXLr/Cs0l93sY0tSmmuflPo0mCdpMk0T+TDv6jTTty79O8xze6Z6H/tJ2GT+R5kkjH6a5SQ3g0Lb/2zRXjt5aVR8ZMI2ZxL2mnd5tVfXZnuX4B5oa00va+X8BeGo7bkbH5TYs8xami2HUz1fVHcCJNG2L76S5oXYtTeI6zLZskx2Bc2iOhdtomhK9bECsX6a5CfT/tmWfCTyzqu6d6fKOqt2uxwIn09SM30az/qaevLStx+tMpi09QO6/MiZpsckSeVFLkvU0N1x9eL5jkWZDmsfRbQSePeQHiaQ5Zk20JEkLUNv8a7e2qcXLadr3fmKew5LUMomWJGlhegLNU0ammk8cX1X/Pb8hSZoyluYcSY4D3kzz4orz+5840P6KfgfwszRtu06qqvXtzQnnA4+jebj5O6rqtSMHJEmSJM2ikWui07ye9lyamyEOA56V5LC+YqcAd1XVITQ3cUw9PudEmrfEPZomwX5u24ZTkiRJWrCme1nFTB0BrJt6MkCSS2ieQnB9T5lV3P94rEuBt7SP2Slgl/bxPTvR3CH9za3NcK+99qqDDjpoDKFLkiRJg1177bV3VNXyQePGkUTvx5aPJNpI82rdgWWq6r40bxLakyahXkXzmKqdgT+oqs2DZpLkNOA0gBUrVrB27doxhC5JkiQNluSmYePm+8bCI2ge+L4vcDDNq5gHvmWqqs6rqpVVtXL58oE/CCRJkqQ5MY4k+ha2fMXt/jzwNaA/KtM23diV5gbDXwf+paq+X1W3Ax8FVo4hJkmSJGnWjCOJvgY4NMnB7St4T6Z5w1SvNcDqtvsE4KpqHgtyM/AkgCS7AI+nebOZJEmStGCNnES3b0F7Ps3rW78IvK+qrktyVpJfbotdAOyZZB3Nq5pPb4efCzwkyXU0yfhfV9XnRo1JkiRJmk0T+drvlStXljcWSpIkaTYlubaqBjY1nu8bCyVJkqSJYxItSZIkdWQSLUmSJHU0jpetSFpiDjr9gw8Ytv6cp89DJJIkzQ9roiVJkqSOTKIlSZKkjkyiJUmSpI5MoiVJkqSOTKIlSZKkjkyiJUmSpI5MoiVJkqSOTKIlSZKkjkyiJUmSpI5MoiVJkqSOfO23JGnB8dXykhY6a6IlSZKkjqyJljQ21h5KkpYKa6IlSZKkjkyiJUmSpI5MoiVJkqSOTKIlSZKkjsaSRCc5LsmXk6xLcvqA8TsmeW87/pNJDuoZ99NJPp7kuiSfT/LgccQkSZIkzZaRk+gk2wHnAk8FDgOeleSwvmKnAHdV1SHAG4HXtZ9dBrwLeF5VHQ4cBXx/1JgkSZKk2TSOmugjgHVVdWNV3QtcAqzqK7MKuKjtvhQ4OkmAY4HPVdVnAarqzqr6wRhikiRJkmbNOJLo/YANPf0b22EDy1TVfcDdwJ7AI4FKcnmSTyV5ybCZJDktydokazdt2jSGsCVJkqRtM983Fi4Dngg8u/3/P5McPahgVZ1XVSurauXy5cvnMkZJkiRpC+NIom8BDujp378dNrBM2w56V+BOmlrrf6uqO6rqu8BlwOPGEJMkSZI0a8aRRF8DHJrk4CQ7ACcDa/rKrAFWt90nAFdVVQGXA49OsnObXP8ScP0YYpIkSZJmzbJRJ1BV9yV5Pk1CvB1wYVVdl+QsYG1VrQEuAN6ZZB2wmSbRpqruSvIGmkS8gMuq6oOjxiQtFQed/sDDZf05T5+HSCRJWlpGTqIBquoymqYYvcPO6Om+BzhxyGffRfOYO0mSJGkizPeNhZIkSdLEMYmWJEmSOhpLcw5Jkpaq/nsTvC9BWhpMoiVJkhYJbzifOzbnkCRJkjqyJlqSJEkLxqTUplsTLUmSJHVkEi1JkiR1ZBItSZIkdWQSLUmSJHVkEi1JkiR1ZBItSZIkdeQj7iRJkpaoSXmc3EJkEi1Jkh7A5Eqankm0NCK/aCRJWnpMoiVJ2gp/LEvq542FkiRJUkfWREuSJGkLXn3ZOpNoSZKWMJMladuYREsayi9Xae54vEmTxTbRkiRJUkdjqYlOchzwZmA74PyqOqdv/I7AO4CfBe4ETqqq9T3jVwDXA2dW1evHEZM0xdodSZI0biMn0Um2A84FngxsBK5Jsqaqru8pdgpwV1UdkuRk4HXAST3j3wD886ixSNKoluKPLpe5sdiXWdJ4jaMm+ghgXVXdCJDkEmAVTc3ylFXAmW33pcBbkqSqKsnxwNeA74whFkmSJE2ASf8xO4420fsBG3r6N7bDBpapqvuAu4E9kzwEeCnwp2OIQ5IkSZoT831j4ZnAG6vq21srmOS0JGuTrN20adPsRyZJkiQNMY7mHLcAB/T0798OG1RmY5JlwK40NxgeCZyQ5M+B3YAfJrmnqt7SP5OqOg84D2DlypU1hrglSZKkbTKOJPoa4NAkB9MkyycDv95XZg2wGvg4cAJwVVUV8AtTBZKcCXx7UAItSZIkLSQjJ9FVdV+S5wOX0zzi7sKqui7JWcDaqloDXAC8M8k6YDNNoi1JkiRNpLE8J7qqLgMu6xt2Rk/3PcCJW5nGmeOIRZIkSZptvvZb0qIw6Y9KkiRNFpNoTcvERJK02PjdpnEwiZYkaQkwcZTGa76fEy1JkiRNHJNoSZIkqSOTaEmSJKkjk2hJkiSpI5NoSZIkqSOTaEmSJKkjk2hJkiSpI5NoSZIkqSNftiJpQfLFEPPPbSBtO4+fxc8kWlpCPKlLkjQeJtGSJE2gxfCjeLaXYTGsIy1cJtHaJp6YJElamvpzgKX6/W8SLUlLhD9+JWl8TKIlaYEz+ZWkhcckWpolJj6SJC1ePidakiRJ6sgkWpIkSerI5hySJEmaNYu1eaNJtKR55aOSJEmTaCzNOZIcl+TLSdYlOX3A+B2TvLcd/8kkB7XDn5zk2iSfb/8/aRzxSJIkSbNp5JroJNsB5wJPBjYC1yRZU1XX9xQ7Bbirqg5JcjLwOuAk4A7gmVX1X0l+Crgc2G/UmNTdYr3UImlp8Bwmaa6Noyb6CGBdVd1YVfcClwCr+sqsAi5quy8Fjk6Sqvp0Vf1XO/w6YKckO44hJkmSJGnWjKNN9H7Ahp7+jcCRw8pU1X1J7gb2pKmJnvKrwKeq6nuDZpLkNOA0gBUrVowhbEmaOWs6JUm9FsQj7pIcTtPE47nDylTVeVW1sqpWLl++fO6CkyRJkvqMoyb6FuCAnv7922GDymxMsgzYFbgTIMn+wD8Av1lVXx1DPAveYq7RWszLpsnkPilJmg3jqIm+Bjg0ycFJdgBOBtb0lVkDrG67TwCuqqpKshvwQeD0qvroGGKRJEmSZt3INdFtG+fn0zxZYzvgwqq6LslZwNqqWgNcALwzyTpgM02iDfB84BDgjCRntMOOrarbR41LWsiGPRvZWlNJkibDWF62UlWXAZf1DTujp/se4MQBn3s18OpxxCBJkiTNlQVxY6EkSZI0SXzttyaOTR4kSdJ8M4mWpFkyrO27JGnymURrybJGW5IkbSuTaEmSFhErCKS5YRLdgScmSZIkgU/nkCRJkjoziZYkSZI6sjnHEmOTFEmSJoff2wuXSfSE8+CS5pfHoCQtTSbRkiaKSaskLUxL7fxsm2hJkiSpI5NoSZIkqSOTaEmSJKkjk2hJkiSpI28slGZoqd0wIUmShrMmWpIkSerImmhJkqR55tXOyWMSLS1CnownU9ft5naWpPljcw5JkiSpI2uitWhYKydJkubKWJLoJMcBbwa2A86vqnP6xu8IvAP4WeBO4KSqWt+OexlwCvAD4AVVdfk4YpIkyR/XkmbLyEl0ku2Ac4EnAxuBa5Ksqarre4qdAtxVVYckORl4HXBSksOAk4HDgX2BDyd5ZFX9YNS4JlX/Cd+TvSRJ0sIzjjbRRwDrqurGqroXuARY1VdmFXBR230pcHSStMMvqarvVdXXgHXt9CRJkqQFK1U12gSSE4DjqurUtv85wJFV9fyeMl9oy2xs+78KHAmcCXyiqt7VDr8A+OequnTAfE4DTgNYsWLFz950000jxT1Owy4XjvMy4qTftT9dPMNq3+drGRbauoPZj2kultmrLPeb7XUxn/vLTI/nbT3OJ+lYmJRl7mouzudz8b06rvl2PZ7HtQyTvh9NiiTXVtXKQeMm5ukcVXVeVa2sqpXLly+f73AkSZK0hI3jxsJbgAN6+vdvhw0qszHJMmBXmhsMZ/JZSYuANR6SpMVkHDXR1wCHJjk4yQ40Nwqu6SuzBljddp8AXFVNO5I1wMlJdkxyMHAo8J9jiEmSJEmaNSPXRFfVfUmeD1xO84i7C6vquiRnAWurag1wAfDOJOuAzTSJNm259wHXA/cBv7+Un8whSZKkyTCW50RX1WXAZX3Dzujpvgc4cchnXwO8ZhxxSJIkSXPBNxZKkmadbeKlhsfC4jExT+eQJEmSFgqTaEmSJKkjm3NIfbzUJkmStsaaaEmSJKkjk2hJkiSpI5NoSZIkqSOTaEmSJKkjbyyUJEkLijd4axKYREuSxsbkR9JSYRItycRHkqSOTKIlaY75o0WSJp9JtCQtMibpkjT7TKIlSZK2kT9aly4fcSdJkiR1ZBItSZIkdWRzjgnh5SJJ8lwoaeGwJlqSJEnqyJpoSZIWMGvf7+e60EJiTbQkSZLUkTXRkiRJE6Zrrby1+OM3Uk10kj2SXJHkhvb/7kPKrW7L3JBkdTts5yQfTPKlJNclOWeUWCRJkqS5MmpzjtOBK6vqUODKtn8LSfYAXgUcCRwBvKon2X59VT0KeCzw80meOmI8kiRJ0qwbtTnHKuCotvsi4GrgpX1lngJcUVWbAZJcARxXVRcDHwGoqnuTfArYf8R4tIh46UlamDw2JWn0mui9q+rWtvs2YO8BZfYDNvT0b2yH/UiS3YBn0tRmS5IkSQvaVmuik3wYeNiAUa/o7amqSlJdA0iyDLgY+MuqunGacqcBpwGsWLGi62wkSZKksdlqEl1Vxwwbl+TrSfapqluT7APcPqDYLdzf5AOaJhtX9/SfB9xQVW/aShzntWVZuXJl52RdmmRePpcWj6V4PC/FZdbiN2pzjjXA6rZ7NfD+AWUuB45Nsnt7Q+Gx7TCSvBrYFXjRiHFIkiRJc2bUJPoc4MlJbgCOaftJsjLJ+QDtDYVnA9e0f2dV1eYk+9M0CTkM+FSSzyQ5dcR4JEmSpFk30tM5qupO4OgBw9cCp/b0Xwhc2FdmI5BR5i9JkiTNB1/7LUmSJHXka78XKW/ikDQJPFdJmlQm0Zp3folKWig8H0maKZPoWeTJWJIkaXGyTbQkSZLUkUm0JEmS1JHNOSRJkqZh80wNYk20JEmS1JFJtCRJktSRzTkkSdKM2bRBalgTLUmSJHVkTbQkSRoLa6m1lFgTLUmSJHVkTbTmhLUTkiRpMbEmWpIkSerIJFqSJEnqyCRakiRJ6sg20WNge19JkqSlxZpoSZIkqSOTaEmSJKkjk2hJkiSpI5NoSZIkqaORkugkeyS5IskN7f/dh5Rb3Za5IcnqAePXJPnCKLFIkiRJc2XUmujTgSur6lDgyrZ/C0n2AF4FHAkcAbyqN9lO8ivAt0eMQ5IkSZozoybRq4CL2u6LgOMHlHkKcEVVba6qu4ArgOMAkjwE+EPg1SPGIUmSJM2ZUZPovavq1rb7NmDvAWX2Azb09G9shwGcDfwF8N2tzSjJaUnWJlm7adOmEUKWJEmSRrPVl60k+TDwsAGjXtHbU1WVpGY64ySPAR5RVX+Q5KCtla+q84DzAFauXDnj+UiSJEnjttUkuqqOGTYuydeT7FNVtybZB7h9QLFbgKN6+vcHrgaeAKxMsr6N48eTXF1VRyFJkiQtYKO+9nsNsBo4p/3//gFlLgf+rOdmwmOBl1XVZuBtAG1N9AdMoCVJ0nxZf87T5zsETZBR20SfAzw5yQ3AMW0/SVYmOR+gTZbPBq5p/85qh0mSJEkTaaSa6Kq6Ezh6wPC1wKk9/RcCF04znfXAT40SiyRJkjRXfGOhJEmS1JFJtCRJktSRSbQkSZLUkUm0JEmS1JFJtCRJktSRSbQkSZLUkUm0JEmS1JFJtCRJktSRSbQkSZLUkUm0JEmS1JFJtCRJktSRSbQkSZLUkUm0JEmS1JFJtCRJktSRSbQkSZLUkUm0JEmS1JFJtCRJktSRSbQkSZLUkUm0JEmS1NGy+Q5AkqTFaP05T5/vECTNImuiJUmSpI5GSqKT7JHkiiQ3tP93H1JudVvmhiSre4bvkOS8JF9J8qUkvzpKPJIkSdJcGLUm+nTgyqo6FLiy7d9Ckj2AVwFHAkcAr+pJtl8B3F5VjwQOA/51xHgkSZKkWTdqEr0KuKjtvgg4fkCZpwBXVNXmqroLuAI4rh33O8BrAarqh1V1x4jxSJIkSbNu1CR676q6te2+Ddh7QJn9gA09/RuB/ZLs1vafneRTSf42yaDPA5DktCRrk6zdtGnTiGFLkiRJ226rSXSSDyf5woC/Vb3lqqqA6jDvZcD+wMeq6nHAx4HXDytcVedV1cqqWrl8+fIOs5EkSZLGa6uPuKuqY4aNS/L1JPtU1a1J9gFuH1DsFuConv79gauBO4HvAn/fDv9b4JSZhS1JkiTNn1Gbc6wBpp62sRp4/4AylwPHJtm9vaHwWODytub6n7g/wT4auH7EeCRJkqRZN2oSfQ7w5CQ3AMe0/SRZmeR8gKraDJwNXNP+ndUOA3gpcGaSzwHPAV48YjySJEnSrEtTITxZkmwCbprnMPYCfJrI4ud2XhrczkuD23lpcDsvDXO1nQ+sqoE3401kEr0QJFlbVSvnOw7NLrfz0uB2XhrczkuD23lpWAjb2dd+S5IkSR2ZREuSJEkdmURvu/PmOwDNCbfz0uB2XhrczkuD23lpmPftbJtoSZIkqSNroiVJkqSOTKIlSZKkjkyiO0pyXJIvJ1mX5PT5jkfjkeSAJB9Jcn2S65K8sB2+R5IrktzQ/t99vmPV6JJsl+TTST7Q9h+c5JPtcf3eJDvMd4waTZLdklya5EtJvpjkCR7Pi0+SP2jP2V9IcnGSB3s8T74kFya5PckXeoYNPH7T+Mt2e38uyePmKk6T6A6SbAecCzwVOAx4VpLD5jcqjcl9wIur6jDg8cDvt9v2dODKqjoUuLLt1+R7IfDFnv7XAW+sqkOAu4BT5iUqjdObgX+pqkcBP0OzvT2eF5Ek+wEvAFZW1U8B2wEn4/G8GPwNcFzfsGHH71OBQ9u/04C3zVGMJtEdHQGsq6obq+pe4BJg1TzHpDGoqlur6lNt97dovnD3o9m+F7XFLgKOn5cANTZJ9geeDpzf9gd4EnBpW8TtPOGS7Ar8InABQFXdW1XfwON5MVoG7JRkGbAzcCsezxOvqv4N2Nw3eNjxuwp4RzU+AeyWZJ+5iNMkupv9gA09/RvbYVpEkhwEPBb4JLB3Vd3ajroN2Hu+4tLYvAl4CfDDtn9P4BtVdV/b73E9+Q4GNgF/3TbbOT/JLng8LypVdQvweuBmmuT5buBaPJ4Xq2HH77zlZibRUo8kDwH+DnhRVX2zd1w1z4P0mZATLMkzgNur6tr5jkWzahnwOOBtVfVY4Dv0Nd3weJ58bZvYVTQ/mvYFduGBTQC0CC2U49ckuptbgAN6+vdvh2kRSLI9TQL97qr6+3bw16cuC7X/b5+v+DQWPw/8cpL1NM2xnkTTdna39nIweFwvBhuBjVX1ybb/Upqk2uN5cTkG+FpVbaqq7wN/T3OMezwvTsOO33nLzUyiu7kGOLS983cHmhsY1sxzTBqDtl3sBcAXq+oNPaPWAKvb7tXA++c6No1PVb2sqvavqoNojt+rqurZwEeAE9pibucJV1W3ARuS/EQ76GjgejyeF5ubgccn2bk9h09tZ4/nxWnY8bsG+M32KR2PB+7uafYxq3xjYUdJnkbTpnI74MKqes38RqRxSPJE4N+Bz3N/W9mX07SLfh+wArgJ+LWq6r/ZQRMoyVHAH1XVM5I8nKZmeg/g08BvVNX35jE8jSjJY2huHt0BuBH4bZqKI4/nRSTJnwIn0Txh6dPAqTTtYT2eJ1iSi4GjgL2ArwOvAv6RAcdv+wPqLTRNeb4L/HZVrZ2TOE2iJUmSpG5sziFJkiR1ZBItSZIkdWQSLUmSJHVkEi1JkiR1ZBItSZIkdWQSLUmSJHVkEi1JkiR19P8BdUsrYobT1wEAAAAASUVORK5CYII=\n", "text/plain": ["
"]}, "metadata": {"needs_background": "light"}, "output_type": "display_data"}], "source": ["import matplotlib.pyplot as plt\n", "fig, ax = plt.subplots(1, 1, figsize=(12, 4))\n", "ax.bar(numpy.arange(0, len(lr.coef_)), lr.coef_)\n", "ax.set_title(\"Coefficients pour chaque arbre calcul\u00e9s avec une r\u00e9gression lin\u00e9aire\");"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Le score est avec une r\u00e9gression lin\u00e9aire sur les variables initiales est nettement moins \u00e9lev\u00e9."]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.5103612609676136"]}, "execution_count": 14, "metadata": {}, "output_type": "execute_result"}], "source": ["lr_raw = LinearRegression()\n", "lr_raw.fit(X_train, y_train)\n", "r2_score(y_test, lr_raw.predict(X_test))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## S\u00e9lection d'arbres\n", "\n", "L'id\u00e9e est d'utiliser un algorithme de s\u00e9lection de variables type [Lasso](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) pour r\u00e9duire la for\u00eat al\u00e9atoire sans perdre en performance. C'est presque le m\u00eame code."]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([ 0.01256934, -0.03342528, -0.02400605, 0.01825851, 0.0005323 ,\n", " -0.01374509, 0.01004616, -0.02284903, 0.01105419, 0.02047233,\n", " 0.03476362, 0.02755575, 0.01751674, -0.03051477, -0.02321124,\n", " -0.01783216, 0.01429992, 0.00214398, 0.04066576, 0.0134879 ,\n", " -0.00377705, 0.00506043, 0.04614375, 0.02482044, 0.01560689,\n", " 0.01706262, -0.02035898, 0.05747191, 0.00418486, 0.02766988,\n", " -0.00899098, 0.04325266, 0.07327657, 0.06515135, -0.00034774,\n", " 0.02210777, 0.0280344 , 0.00852669, 0.0358763 , 0.01779845,\n", " 0. , 0.03970822, 0.01935286, 0.00908017, 0.01417323,\n", " -0.01066044, 0.00293442, 0.02483663, 0.00332255, -0.01043329,\n", " 0.02666477, 0.04097776, 0.02851599, -0.01795373, -0.00830115,\n", " -0.00293032, -0.02188798, 0.06679156, -0.01364001, 0.02028321,\n", " 0.00160792, 0.04114419, -0.02342478, 0.04638246, 0.0547764 ,\n", " -0.05501755, -0.01856303, 0.04157578, -0.01403205, 0.02718244,\n", " 0.01215738, 0.03503149, 0.04403975, 0.04640854, -0.02884553,\n", " -0.02929629, -0.01946676, 0.02679733, -0.00779812, -0.01418256,\n", " 0. , 0.02734732, 0.03608281, -0.04111661, -0.02654714,\n", " -0.01106999, 0.02664032, -0.00291639, 0.00541073, 0.01187597,\n", " -0.01621428, 0.05386765, -0.01531834, 0.04807872, 0.04398675,\n", " 0.02611443, 0.00944403, 0.02219076, 0.04080548, 0.05276076])"]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.linear_model import Lasso\n", "\n", "lrs = Lasso(max_iter=10000)\n", "lrs.fit(X_train_2, y_train)\n", "lrs.coef_"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Pas mal de z\u00e9ros donc pas mal d'arbres non utilis\u00e9s."]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.3055529526371402"]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["r2_score(y_test, lrs.predict(X_test_2))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Pas trop de perte... Ca donne envie d'essayer plusieurs valeur de `alpha`."]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"name": "stderr", "output_type": "stream", "text": [" 0%| | 0/200 [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
alphanullr2
19510.5830.318660
19610.6830.318771
19710.7830.318879
19810.8830.318982
19910.9820.319073
\n", ""], "text/plain": [" alpha null r2\n", "195 10.5 83 0.318660\n", "196 10.6 83 0.318771\n", "197 10.7 83 0.318879\n", "198 10.8 83 0.318982\n", "199 10.9 82 0.319073"]}, "execution_count": 18, "metadata": {}, "output_type": "execute_result"}], "source": ["from pandas import DataFrame\n", "df = DataFrame(obs)\n", "df.tail()"]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": ["
"]}, "metadata": {"needs_background": "light"}, "output_type": "display_data"}], "source": ["fig, ax = plt.subplots(1, 2, figsize=(12, 4))\n", "df[[\"alpha\", \"null\"]].set_index(\"alpha\").plot(ax=ax[0], logx=True)\n", "ax[0].set_title(\"Nombre de coefficients non nulls\")\n", "df[[\"alpha\", \"r2\"]].set_index(\"alpha\").plot(ax=ax[1], logx=True)\n", "ax[1].set_title(\"r2\");"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Dans ce cas, supprimer des arbres augmente la performance, comme \u00e9voqu\u00e9 ci-dessus, cela r\u00e9duit l'overfitting. Le nombre d'arbres peut \u00eatre r\u00e9duit des deux tiers avec ce mod\u00e8le."]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5"}}, "nbformat": 4, "nbformat_minor": 2}