{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Traitement des cat\u00e9gories\n", "\n", "Ce notebook pr\u00e9sente diff\u00e9rentes options pour g\u00e9rer les cat\u00e9gories au format entier ou texte."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": ["%matplotlib inline"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On construit un jeu tr\u00e8s simple avec deux cat\u00e9gories, une enti\u00e8re, une au format texte."]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " cat_int | \n", " cat_text | \n", "
\n", " \n", " \n", " \n", " 0 | \n", " 10.0 | \n", " catA | \n", "
\n", " \n", " 1 | \n", " 20.0 | \n", " catB | \n", "
\n", " \n", " 2 | \n", " 10.0 | \n", " catA | \n", "
\n", " \n", " 3 | \n", " 39.0 | \n", " catDD | \n", "
\n", " \n", " 4 | \n", " 10.0 | \n", " catB | \n", "
\n", " \n", " 5 | \n", " 10.0 | \n", " NaN | \n", "
\n", " \n", " 6 | \n", " NaN | \n", " catB | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" cat_int cat_text\n", "0 10.0 catA\n", "1 20.0 catB\n", "2 10.0 catA\n", "3 39.0 catDD\n", "4 10.0 catB\n", "5 10.0 NaN\n", "6 NaN catB"]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["import pandas\n", "import numpy\n", "df = pandas.DataFrame(dict(cat_int=[10, 20, 10, 39, 10, 10, numpy.nan],\n", " cat_text=['catA', 'catB', 'catA', 'catDD', 'catB', numpy.nan, 'catB']))\n", "df"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Transformations d'une cat\u00e9gorie\n", "\n", "Les premi\u00e8res op\u00e9rations consiste \u00e0 convertir une cat\u00e9gorie au format entier ou au format texte en un entier. Les valeurs manquantes ne sont toujours trait\u00e9es de la m\u00eame fa\u00e7on."]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/plain": ["array([0, 1, 0, 2, 0, 0, 3], dtype=int64)"]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.preprocessing import LabelEncoder\n", "LabelEncoder().fit_transform(df['cat_int'])"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["'<' not supported between instances of 'float' and 'str'\n"]}, {"data": {"text/plain": ["array([0, 1, 0, 2, 1, 1], dtype=int64)"]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["try:\n", " LabelEncoder().fit_transform(df['cat_text'])\n", "except Exception as e:\n", " print(e)\n", "LabelEncoder().fit_transform(df['cat_text'].dropna())"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On peut r\u00e9cup\u00e9rer l'association entre cat\u00e9gorie et cat\u00e9gorie cod\u00e9e."]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/plain": ["array(['catA', 'catB', 'catDD'], dtype=object)"]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["le = LabelEncoder()\n", "le.fit(df['cat_text'].dropna())\n", "le.classes_"]}, {"cell_type": "markdown", "metadata": {}, "source": ["La seconde op\u00e9ration permet de transformer une cat\u00e9gorie au format entier en plusieurs colonnes au format binaire."]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Input contains NaN, infinity or a value too large for dtype('float64').\n"]}, {"data": {"text/plain": ["matrix([[1., 0., 0.],\n", " [0., 1., 0.],\n", " [1., 0., 0.],\n", " [0., 0., 1.],\n", " [1., 0., 0.],\n", " [1., 0., 0.]])"]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.preprocessing import OneHotEncoder\n", "try:\n", " OneHotEncoder().fit_transform(df[['cat_int']]).todense()\n", "except Exception as e:\n", " print(e)\n", "OneHotEncoder().fit_transform(df[['cat_int']].dropna()).todense()"]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Unknown label type: ( cat_int\n", "0 10.0\n", "1 20.0\n", "2 10.0\n", "3 39.0\n", "4 10.0\n", "5 10.0\n", "6 NaN,)\n"]}, {"data": {"text/plain": ["array([[1, 0, 0],\n", " [0, 1, 0],\n", " [1, 0, 0],\n", " [0, 0, 1],\n", " [1, 0, 0],\n", " [1, 0, 0]])"]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.preprocessing import LabelBinarizer\n", "try:\n", " LabelBinarizer().fit_transform(df[['cat_int']])\n", "except Exception as e:\n", " print(e)\n", "LabelBinarizer().fit_transform(df[['cat_int']].dropna())"]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["'<' not supported between instances of 'float' and 'str'\n"]}, {"data": {"text/plain": ["array([[1, 0, 0],\n", " [0, 1, 0],\n", " [1, 0, 0],\n", " [0, 0, 1],\n", " [0, 1, 0],\n", " [0, 1, 0]])"]}, "execution_count": 10, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.preprocessing import LabelBinarizer\n", "try:\n", " LabelBinarizer().fit_transform(df[['cat_text']])\n", "except Exception as e:\n", " print(e)\n", "LabelBinarizer().fit_transform(df[['cat_text']].dropna())"]}, {"cell_type": "markdown", "metadata": {}, "source": ["D'autres options qui ne fonctionnent pas tout \u00e0 fait de la m\u00eame mani\u00e8re en terme d'impl\u00e9mentation."]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"data": {"text/plain": ["matrix([[10., 0., 1., 0., 0.],\n", " [20., 0., 0., 1., 0.],\n", " [10., 0., 1., 0., 0.],\n", " [39., 0., 0., 0., 1.],\n", " [10., 0., 0., 1., 0.],\n", " [10., nan, 0., 0., 0.],\n", " [nan, 0., 0., 1., 0.]])"]}, "execution_count": 11, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.feature_extraction import DictVectorizer\n", "DictVectorizer().fit_transform(df.to_dict('records')).todense()"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"data": {"text/plain": ["matrix([[ 0., 0., 0., 1., 10.],\n", " [ 1., 0., 0., 0., 20.],\n", " [ 0., 0., 0., 1., 10.],\n", " [ 0., -1., 0., 0., 39.],\n", " [ 1., 0., 0., 0., 10.],\n", " [ 0., 0., 0., 0., nan],\n", " [ 1., 0., 0., 0., nan]])"]}, "execution_count": 12, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.feature_extraction import FeatureHasher\n", "FeatureHasher(n_features=5).fit_transform(df.to_dict('records')).todense()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## M\u00e9thodes \u00e0 gradient et ensemblistes\n", "\n", "On construit un simple jeu de donn\u00e9es pour une r\u00e9gression lin\u00e9aire, $Y = -10 X - 7$ puis on cale une r\u00e9gression lin\u00e9aire avec $Y \\sim \\alpha X_2 + \\epsilon$ o\u00f9 $X_2$ est une permutation de *X*. Le lien est en quelque sorte bris\u00e9."]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " X1 | \n", " X2 | \n", " Y | \n", "
\n", " \n", " \n", " \n", " 0 | \n", " 4 | \n", " 7 | \n", " -46.420964 | \n", "
\n", " \n", " 1 | \n", " 6 | \n", " 9 | \n", " -66.321194 | \n", "
\n", " \n", " 2 | \n", " 3 | \n", " 5 | \n", " -36.001053 | \n", "
\n", " \n", " 3 | \n", " 0 | \n", " 2 | \n", " -6.802070 | \n", "
\n", " \n", " 4 | \n", " 8 | \n", " 3 | \n", " -86.044988 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" X1 X2 Y\n", "0 4 7 -46.420964\n", "1 6 9 -66.321194\n", "2 3 5 -36.001053\n", "3 0 2 -6.802070\n", "4 8 3 -86.044988"]}, "execution_count": 13, "metadata": {}, "output_type": "execute_result"}], "source": ["perm = numpy.random.permutation(list(range(10)))\n", "n = 1000\n", "X1 = numpy.random.randint(0, 10, (n,1))\n", "X2 = numpy.array([perm[i] for i in X1])\n", "eps = numpy.random.random((n, 1))\n", "Y = X1 * (-10) - 7 + eps\n", "data = pandas.DataFrame(dict(X1=X1.ravel(), X2=X2.ravel(), Y=Y.ravel()))\n", "data.head()"]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": ["from sklearn.model_selection import train_test_split\n", "data_train, data_test = train_test_split(data)\n", "data_train = data_train.copy()\n", "data_test = data_test.copy()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On transforme la cat\u00e9gorie :"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " X1 | \n", " X2 | \n", " Y | \n", " X3 | \n", "
\n", " \n", " \n", " \n", " 943 | \n", " 1 | \n", " 8 | \n", " -16.020453 | \n", " 8 | \n", "
\n", " \n", " 686 | \n", " 2 | \n", " 0 | \n", " -26.664813 | \n", " 0 | \n", "
\n", " \n", " 312 | \n", " 9 | \n", " 6 | \n", " -96.833801 | \n", " 6 | \n", "
\n", " \n", " 861 | \n", " 1 | \n", " 8 | \n", " -16.090882 | \n", " 8 | \n", "
\n", " \n", " 784 | \n", " 4 | \n", " 7 | \n", " -46.693131 | \n", " 7 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" X1 X2 Y X3\n", "943 1 8 -16.020453 8\n", "686 2 0 -26.664813 0\n", "312 9 6 -96.833801 6\n", "861 1 8 -16.090882 8\n", "784 4 7 -46.693131 7"]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["le = LabelEncoder().fit(data_train['X2'])\n", "data_train['X3'] = le.transform(data_train['X2'])\n", "data_test['X3'] = le.transform(data_test['X2'])\n", "data_train.head()"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", "\n", "
\n", " \n", " \n", " | \n", " X1 | \n", " X2 | \n", " Y | \n", " X3 | \n", "
\n", " \n", " \n", " \n", " X1 | \n", " 1.000000 | \n", " 0.145122 | \n", " -0.999946 | \n", " 0.145122 | \n", "
\n", " \n", " X2 | \n", " 0.145122 | \n", " 1.000000 | \n", " -0.144419 | \n", " 1.000000 | \n", "
\n", " \n", " Y | \n", " -0.999946 | \n", " -0.144419 | \n", " 1.000000 | \n", " -0.144419 | \n", "
\n", " \n", " X3 | \n", " 0.145122 | \n", " 1.000000 | \n", " -0.144419 | \n", " 1.000000 | \n", "
\n", " \n", "
\n", "
"], "text/plain": [" X1 X2 Y X3\n", "X1 1.000000 0.145122 -0.999946 0.145122\n", "X2 0.145122 1.000000 -0.144419 1.000000\n", "Y -0.999946 -0.144419 1.000000 -0.144419\n", "X3 0.145122 1.000000 -0.144419 1.000000"]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["data_train.corr()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On cale une r\u00e9gression lin\u00e9aire de $Y$ sur $X_3$ la cat\u00e9gorie encod\u00e9e :"]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"data": {"text/plain": ["LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)"]}, "execution_count": 17, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.linear_model import LinearRegression\n", "clr = LinearRegression()\n", "clr.fit(data_train[['X3']], data_train['Y'])"]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.009306055721419959"]}, "execution_count": 18, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.metrics import r2_score\n", "r2_score(data_test['Y'], clr.predict(data_test[['X3']]))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Autrement dit, elle n'a rien appris. On cale un arbre de d\u00e9cision :"]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"data": {"text/plain": ["DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,\n", " max_leaf_nodes=None, min_impurity_decrease=0.0,\n", " min_impurity_split=None, min_samples_leaf=1,\n", " min_samples_split=2, min_weight_fraction_leaf=0.0,\n", " presort=False, random_state=None, splitter='best')"]}, "execution_count": 19, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.tree import DecisionTreeRegressor\n", "clr = DecisionTreeRegressor()\n", "clr.fit(data_train[['X3']], data_train['Y'])"]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.9998898230391275"]}, "execution_count": 20, "metadata": {}, "output_type": "execute_result"}], "source": ["r2_score(data_test['Y'], clr.predict(data_test[['X3']]))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["L'arbre de d\u00e9cision a saisi la permutation alors que la r\u00e9gression lin\u00e9aire n'a pas fonctionn\u00e9. La r\u00e9gression lin\u00e9aire n'est pas estim\u00e9e \u00e0 l'aide d'une m\u00e9thode \u00e0 base de gradient mais elle poss\u00e8de les m\u00eames contraintes, il est pr\u00e9f\u00e9rable que la cible *Y* soit une fonction le plus possible monotone de *X*. Avec une colonne par modalit\u00e9 de la cat\u00e9gorie, le r\u00e9sultat est tout autre."]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [{"data": {"text/plain": ["matrix([[0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],\n", " [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],\n", " [0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],\n", " [0., 0., 0., 0., 0., 0., 0., 1., 0., 0.]])"]}, "execution_count": 21, "metadata": {}, "output_type": "execute_result"}], "source": ["one = OneHotEncoder().fit(data_train[['X2']])\n", "feat = one.transform(data_train[['X2']])\n", "feat[:5].todense()"]}, {"cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [{"data": {"text/plain": ["LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)"]}, "execution_count": 22, "metadata": {}, "output_type": "execute_result"}], "source": ["clr = LinearRegression()\n", "clr.fit(feat, data_train['Y'])"]}, {"cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [{"data": {"text/plain": ["0.9998898230391275"]}, "execution_count": 23, "metadata": {}, "output_type": "execute_result"}], "source": ["r2_score(data_test['Y'], clr.predict(one.transform(data_test[['X2']])))"]}, {"cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4"}}, "nbformat": 4, "nbformat_minor": 2}