{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 2A.ml - Interpr\u00e9tabilit\u00e9 et corr\u00e9lations des variables\n", "\n", "Plus un mod\u00e8le de machine learning contient de coefficients, moins sa d\u00e9cision peut \u00eatre interpr\u00e9t\u00e9e. Comment contourner cet obstacle et comprendre ce que le mod\u00e8le a appris ? Notion de [feature importance](http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html)."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": ["# R\u00e9pare une incompatibilit\u00e9 entre scipy 1.0 et statsmodels 0.8.\n", "from pymyinstall.fix import fix_scipy10_for_statsmodels08\n", "fix_scipy10_for_statsmodels08()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Mod\u00e8les lin\u00e9aires\n", "\n", "Les mod\u00e8les lin\u00e9aires sont les mod\u00e8les les plus simples \u00e0 interpr\u00e9ter. A performance \u00e9quivalente, il faut toujours choisir le mod\u00e8le le plus simple. Le module [scikit-learn](http://scikit-learn.org/) ne propose pas les outils standards d'analyse des mod\u00e8les lin\u00e9aires (test de nullit\u00e9, valeur propre). Il faut choisir [statsmodels](http://statsmodels.sourceforge.net/) pour obtenir ces informations."]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["C:\\Python395_x64\\lib\\site-packages\\statsmodels\\tsa\\base\\tsa_model.py:7: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.\n", " from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,\n", "C:\\Python395_x64\\lib\\site-packages\\statsmodels\\tsa\\base\\tsa_model.py:7: FutureWarning: pandas.Float64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.\n", " from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,\n"]}], "source": ["import numpy\n", "import statsmodels.api as smapi\n", "nsample = 100\n", "x = numpy.linspace(0, 10, 100)\n", "X = numpy.column_stack((x, x**2 - x))\n", "beta = numpy.array([1, 0.1, 10])\n", "e = numpy.random.normal(size=nsample)\n", "X = smapi.add_constant(X)\n", "y = X @ beta + e"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "OLS Regression Results\n", "\n", " Dep. Variable: | y | R-squared: | 1.000 | \n", "
\n", "\n", " Model: | OLS | Adj. R-squared: | 1.000 | \n", "
\n", "\n", " Method: | Least Squares | F-statistic: | 3.222e+06 | \n", "
\n", "\n", " Date: | Sat, 12 Feb 2022 | Prob (F-statistic): | 1.30e-234 | \n", "
\n", "\n", " Time: | 18:53:30 | Log-Likelihood: | -147.79 | \n", "
\n", "\n", " No. Observations: | 100 | AIC: | 301.6 | \n", "
\n", "\n", " Df Residuals: | 97 | BIC: | 309.4 | \n", "
\n", "\n", " Df Model: | 2 | | | \n", "
\n", "\n", " Covariance Type: | nonrobust | | | \n", "
\n", "
\n", "\n", "\n", " | coef | std err | t | P>|t| | [0.025 | 0.975] | \n", "
\n", "\n", " const | 1.2570 | 0.317 | 3.969 | 0.000 | 0.628 | 1.886 | \n", "
\n", "\n", " x1 | 0.0134 | 0.133 | 0.101 | 0.920 | -0.250 | 0.277 | \n", "
\n", "\n", " x2 | 10.0052 | 0.014 | 706.336 | 0.000 | 9.977 | 10.033 | \n", "
\n", "
\n", "\n", "\n", " Omnibus: | 4.968 | Durbin-Watson: | 1.920 | \n", "
\n", "\n", " Prob(Omnibus): | 0.083 | Jarque-Bera (JB): | 2.455 | \n", "
\n", "\n", " Skew: | -0.037 | Prob(JB): | 0.293 | \n", "
\n", "\n", " Kurtosis: | 2.236 | Cond. No. | 125. | \n", "
\n", "
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified."], "text/plain": ["\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: y R-squared: 1.000\n", "Model: OLS Adj. R-squared: 1.000\n", "Method: Least Squares F-statistic: 3.222e+06\n", "Date: Sat, 12 Feb 2022 Prob (F-statistic): 1.30e-234\n", "Time: 18:53:30 Log-Likelihood: -147.79\n", "No. Observations: 100 AIC: 301.6\n", "Df Residuals: 97 BIC: 309.4\n", "Df Model: 2 \n", "Covariance Type: nonrobust \n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "const 1.2570 0.317 3.969 0.000 0.628 1.886\n", "x1 0.0134 0.133 0.101 0.920 -0.250 0.277\n", "x2 10.0052 0.014 706.336 0.000 9.977 10.033\n", "==============================================================================\n", "Omnibus: 4.968 Durbin-Watson: 1.920\n", "Prob(Omnibus): 0.083 Jarque-Bera (JB): 2.455\n", "Skew: -0.037 Prob(JB): 0.293\n", "Kurtosis: 2.236 Cond. No. 125.\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\""]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["model = smapi.OLS(y, X)\n", "results = model.fit()\n", "results.summary()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Arbres (tree)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Lectures\n", "\n", "* [treeinterpreter](https://github.com/andosa/treeinterpreter)\n", "* [Making Tree Ensembles Interpretable](https://arxiv.org/pdf/1606.05390v1.pdf) : l'article propose de simplifier une random forest en approximant sa sortie par une somme pond\u00e9r\u00e9e d'arbre plus simples.\n", "* [Understanding variable importances in forests of randomized trees](http://papers.nips.cc/paper/4928-understanding-variable-importances-in-forests-of-randomized-trees.pdf) : cet article explique plus formellement le calcul des termes ``feature_importances_`` calcul\u00e9s par scikit-learn pour chaque arbre et for\u00eats d'arbres (voir aussi [Random Forests, Leo Breiman and Adele Cutler](http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Module treeinterpreter"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": ["from sklearn.datasets import load_iris\n", "iris = load_iris()\n", "X = iris.data\n", "Y = iris.target"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": ["from sklearn.tree import DecisionTreeClassifier\n", "clf2 = DecisionTreeClassifier(max_depth=3)\n", "clf2.fit(X, Y)\n", "Yp2 = clf2.predict(X)"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": ["from sklearn.tree import export_graphviz\n", "export_graphviz(clf2, out_file=\"arbre.dot\")"]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"text/plain": ["0"]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["import os\n", "cwd = os.getcwd()\n", "from pyquickhelper.helpgen import find_graphviz_dot\n", "dot = find_graphviz_dot()\n", "os.system (\"\\\"{1}\\\" -Tpng {0}\\\\arbre.dot -o {0}\\\\arbre.png\".format(cwd, dot))"]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": [""]}, "execution_count": 10, "metadata": {}, "output_type": "execute_result"}], "source": ["from IPython.display import Image\n", "Image(\"arbre.png\")"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": ["from treeinterpreter import treeinterpreter\n", "pred, bias, contrib = treeinterpreter.predict(clf2, X[106:107,:])"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[4.9, 2.5, 4.5, 1.7]])"]}, "execution_count": 12, "metadata": {}, "output_type": "execute_result"}], "source": ["X[106:107,:]"]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[0. , 0.97916667, 0.02083333]])"]}, "execution_count": 13, "metadata": {}, "output_type": "execute_result"}], "source": ["pred"]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[0.33333333, 0.33333333, 0.33333333]])"]}, "execution_count": 14, "metadata": {}, "output_type": "execute_result"}], "source": ["bias"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([[[ 0. , 0. , 0. ],\n", " [ 0. , 0. , 0. ],\n", " [ 0. , 0.07175926, -0.07175926],\n", " [-0.33333333, 0.57407407, -0.24074074]]])"]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["contrib"]}, {"cell_type": "markdown", "metadata": {}, "source": ["``pred`` est identique \u00e0 ce que retourne la m\u00e9thode ``predict`` de scikit-learn. ``bias`` est la proportion de chaque classe. ``contrib`` est la somme des contributions de chaque variable \u00e0 chaque classe. On note $X=(x_1, ..., x_n)$ une observation.\n", "\n", "$$P(X \\in classe(i)) = \\sum_i contrib(x_k,i)$$\n", "\n", "Le [code](https://github.com/andosa/treeinterpreter/blob/master/treeinterpreter/treeinterpreter.py) est assez facile \u00e0 lire et permet de comprendre ce que vaut la fonction $contrib$."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Exercice 1 : d\u00e9crire la fonction contrib\n", "\n", "La lecture de [Understanding variable importances\n", "in forests of randomized trees](http://papers.nips.cc/paper/4928-understanding-variable-importances-in-forests-of-randomized-trees.pdf) devrait vous y aider."]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([0. , 0. , 0.05393633, 0.94606367])"]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["clf2.feature_importances_"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Exercice 2 : impl\u00e9menter l'algorithme\n", "\n", "D\u00e9crit dans [Making Tree Ensembles Interpretable](https://arxiv.org/pdf/1606.05390v1.pdf)"]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["## Interpr\u00e9tation et corr\u00e9lation"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Mod\u00e8les lin\u00e9aires\n", "\n", "Les mod\u00e8les lin\u00e9aires n'aiment pas les variables corr\u00e9l\u00e9es. Dans l'exemple qui suit, les variables $X_2, X_3$ sont identiques. La r\u00e9gression ne peut retrouver les coefficients du mod\u00e8le initial (2 et 8)."]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": ["import numpy\n", "import statsmodels.api as smapi\n", "nsample = 100\n", "x = numpy.linspace(0, 10, 100)\n", "X = numpy.column_stack((x, (x-5)**2, (x-5)**2)) # ajout de la m\u00eame variable\n", "beta = numpy.array([1, 0.1, 2, 8])\n", "e = numpy.random.normal(size=nsample)\n", "X = smapi.add_constant(X)\n", "y = X @ beta + e"]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"name": "stderr", "output_type": "stream", "text": ["C:\\Python395_x64\\lib\\site-packages\\numpy\\lib\\function_base.py:2691: RuntimeWarning: invalid value encountered in true_divide\n", " c /= stddev[:, None]\n", "C:\\Python395_x64\\lib\\site-packages\\numpy\\lib\\function_base.py:2692: RuntimeWarning: invalid value encountered in true_divide\n", " c /= stddev[None, :]\n"]}, {"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " 0 | \n", " 1 | \n", " 2 | \n", " 3 | \n", "
\n", " \n", " \n", " \n", " 0 | \n", " NaN | \n", " NaN | \n", " NaN | \n", " NaN | \n", "
\n", " \n", " 1 | \n", " NaN | \n", " 1.000000e+00 | \n", " 8.513703e-17 | \n", " 8.513703e-17 | \n", "
\n", " \n", " 2 | \n", " NaN | \n", " 8.513703e-17 | \n", " 1.000000e+00 | \n", " 1.000000e+00 | \n", "
\n", " \n", " 3 | \n", " NaN | \n", " 8.513703e-17 | \n", " 1.000000e+00 | \n", " 1.000000e+00 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" 0 1 2 3\n", "0 NaN NaN NaN NaN\n", "1 NaN 1.000000e+00 8.513703e-17 8.513703e-17\n", "2 NaN 8.513703e-17 1.000000e+00 1.000000e+00\n", "3 NaN 8.513703e-17 1.000000e+00 1.000000e+00"]}, "execution_count": 19, "metadata": {}, "output_type": "execute_result"}], "source": ["import pandas\n", "pandas.DataFrame(numpy.corrcoef(X.T))"]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "OLS Regression Results\n", "\n", " Dep. Variable: | y | R-squared: | 1.000 | \n", "
\n", "\n", " Model: | OLS | Adj. R-squared: | 1.000 | \n", "
\n", "\n", " Method: | Least Squares | F-statistic: | 3.806e+05 | \n", "
\n", "\n", " Date: | Sat, 12 Feb 2022 | Prob (F-statistic): | 1.27e-189 | \n", "
\n", "\n", " Time: | 18:53:59 | Log-Likelihood: | -126.59 | \n", "
\n", "\n", " No. Observations: | 100 | AIC: | 259.2 | \n", "
\n", "\n", " Df Residuals: | 97 | BIC: | 267.0 | \n", "
\n", "\n", " Df Model: | 2 | | | \n", "
\n", "\n", " Covariance Type: | nonrobust | | | \n", "
\n", "
\n", "\n", "\n", " | coef | std err | t | P>|t| | [0.025 | 0.975] | \n", "
\n", "\n", " const | 0.5470 | 0.199 | 2.756 | 0.007 | 0.153 | 0.941 | \n", "
\n", "\n", " x1 | 0.1396 | 0.030 | 4.671 | 0.000 | 0.080 | 0.199 | \n", "
\n", "\n", " x2 | 4.9989 | 0.006 | 872.449 | 0.000 | 4.988 | 5.010 | \n", "
\n", "\n", " x3 | 4.9989 | 0.006 | 872.449 | 0.000 | 4.988 | 5.010 | \n", "
\n", "
\n", "\n", "\n", " Omnibus: | 2.677 | Durbin-Watson: | 2.150 | \n", "
\n", "\n", " Prob(Omnibus): | 0.262 | Jarque-Bera (JB): | 1.871 | \n", "
\n", "\n", " Skew: | 0.133 | Prob(JB): | 0.392 | \n", "
\n", "\n", " Kurtosis: | 2.385 | Cond. No. | 1.41e+16 | \n", "
\n", "
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.39e-28. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular."], "text/plain": ["\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: y R-squared: 1.000\n", "Model: OLS Adj. R-squared: 1.000\n", "Method: Least Squares F-statistic: 3.806e+05\n", "Date: Sat, 12 Feb 2022 Prob (F-statistic): 1.27e-189\n", "Time: 18:53:59 Log-Likelihood: -126.59\n", "No. Observations: 100 AIC: 259.2\n", "Df Residuals: 97 BIC: 267.0\n", "Df Model: 2 \n", "Covariance Type: nonrobust \n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "const 0.5470 0.199 2.756 0.007 0.153 0.941\n", "x1 0.1396 0.030 4.671 0.000 0.080 0.199\n", "x2 4.9989 0.006 872.449 0.000 4.988 5.010\n", "x3 4.9989 0.006 872.449 0.000 4.988 5.010\n", "==============================================================================\n", "Omnibus: 2.677 Durbin-Watson: 2.150\n", "Prob(Omnibus): 0.262 Jarque-Bera (JB): 1.871\n", "Skew: 0.133 Prob(JB): 0.392\n", "Kurtosis: 2.385 Cond. No. 1.41e+16\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "[2] The smallest eigenvalue is 1.39e-28. This might indicate that there are\n", "strong multicollinearity problems or that the design matrix is singular.\n", "\"\"\""]}, "execution_count": 20, "metadata": {}, "output_type": "execute_result"}], "source": ["model = smapi.OLS(y, X)\n", "results = model.fit()\n", "results.summary()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Arbre / tree\n", "\n", "Les arbres de d\u00e9cision n'aiment pas plus les variables corr\u00e9l\u00e9es."]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": ["from sklearn.datasets import load_iris\n", "iris = load_iris()\n", "X = iris.data[:,:2]\n", "Y = iris.target"]}, {"cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [{"data": {"text/plain": ["DecisionTreeClassifier(max_depth=3)"]}, "execution_count": 22, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.tree import DecisionTreeClassifier\n", "clf1 = DecisionTreeClassifier(max_depth=3)\n", "clf1.fit(X, Y)"]}, {"cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([0.76759205, 0.23240795])"]}, "execution_count": 23, "metadata": {}, "output_type": "execute_result"}], "source": ["clf1.feature_importances_"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On recopie la variables $X_1$."]}, {"cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": ["import numpy\n", "X2 = numpy.hstack([X, numpy.ones((X.shape[0], 1))])\n", "X2[:,2] = X2[:,0]"]}, {"cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [{"data": {"text/plain": ["DecisionTreeClassifier(max_depth=3)"]}, "execution_count": 25, "metadata": {}, "output_type": "execute_result"}], "source": ["clf2 = DecisionTreeClassifier(max_depth=3)\n", "clf2.fit(X2, Y)"]}, {"cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([0.14454858, 0.23240795, 0.62304347])"]}, "execution_count": 26, "metadata": {}, "output_type": "execute_result"}], "source": ["clf2.feature_importances_"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On voit que l'importance de la variable 1 est dilu\u00e9e sur deux variables."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Exercice 3 : variables corr\u00e9l\u00e9es pour un arbre de d\u00e9cision\n", "\n", "Un arbre de d\u00e9cision est compos\u00e9 d'un ensemble de fonctions de seuil. Si $X_i > s_i$ alors il faut suivre cette branche, sinon, telle autre. Les arbres de d\u00e9cision ne sont pas sensibles aux probl\u00e8mes d'\u00e9chelle de variables. Si deux variables sont corr\u00e9l\u00e9es $cor(X_1, X_2)= 1$, l'arbre subit les m\u00eames probl\u00e8mes qu'un mod\u00e8le lin\u00e9aire. Dans le cas lin\u00e9aire, il suffit de changer l'\u00e9chelle $(X_1, \\ln X_2)$ pour \u00e9viter ce probl\u00e8me. \n", "\n", "* Pourquoi cette transformation ne change rien pour un arbre de d\u00e9cision ?\n", "* Quelle corr\u00e9lation il faudrait calculer pour rep\u00e9rer les variables identiques selon le point de vue d'un arbre de d\u00e9cision ?"]}, {"cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.5"}}, "nbformat": 4, "nbformat_minor": 2}