.. _artificielcategoryrst: ========================= Traitement des catégories ========================= .. only:: html **Links:** :download:`notebook `, :downloadlink:`html `, :download:`PDF `, :download:`python `, :downloadlink:`slides `, :githublink:`GitHub|_doc/notebooks/lectures/artificiel_category.ipynb|*` Ce notebook présente différentes options pour gérer les catégories au format entier ou texte. .. code:: ipython3 from jyquickhelper import add_notebook_menu add_notebook_menu() .. contents:: :local: .. code:: ipython3 %matplotlib inline On construit un jeu très simple avec deux catégories, une entière, une au format texte. .. code:: ipython3 import pandas import numpy df = pandas.DataFrame(dict(cat_int=[10, 20, 10, 39, 10, 10, numpy.nan], cat_text=['catA', 'catB', 'catA', 'catDD', 'catB', numpy.nan, 'catB'])) df .. raw:: html

	cat_int	cat_text
0	10.0	catA
1	20.0	catB
2	10.0	catA
3	39.0	catDD
4	10.0	catB
5	10.0	NaN
6	NaN	catB

Transformations d’une catégorie ------------------------------- Les premières opérations consiste à convertir une catégorie au format entier ou au format texte en un entier. Les valeurs manquantes ne sont toujours traitées de la même façon. .. code:: ipython3 from sklearn.preprocessing import LabelEncoder LabelEncoder().fit_transform(df['cat_int']) .. parsed-literal:: array([0, 1, 0, 2, 0, 0, 3], dtype=int64) .. code:: ipython3 try: LabelEncoder().fit_transform(df['cat_text']) except Exception as e: print(e) LabelEncoder().fit_transform(df['cat_text'].dropna()) .. parsed-literal:: '<' not supported between instances of 'float' and 'str' .. parsed-literal:: array([0, 1, 0, 2, 1, 1], dtype=int64) On peut récupérer l’association entre catégorie et catégorie codée. .. code:: ipython3 le = LabelEncoder() le.fit(df['cat_text'].dropna()) le.classes_ .. parsed-literal:: array(['catA', 'catB', 'catDD'], dtype=object) La seconde opération permet de transformer une catégorie au format entier en plusieurs colonnes au format binaire. .. code:: ipython3 from sklearn.preprocessing import OneHotEncoder try: OneHotEncoder().fit_transform(df[['cat_int']]).todense() except Exception as e: print(e) OneHotEncoder().fit_transform(df[['cat_int']].dropna()).todense() .. parsed-literal:: Input contains NaN, infinity or a value too large for dtype('float64'). .. parsed-literal:: matrix([[1., 0., 0.], [0., 1., 0.], [1., 0., 0.], [0., 0., 1.], [1., 0., 0.], [1., 0., 0.]]) .. code:: ipython3 from sklearn.preprocessing import LabelBinarizer try: LabelBinarizer().fit_transform(df[['cat_int']]) except Exception as e: print(e) LabelBinarizer().fit_transform(df[['cat_int']].dropna()) .. parsed-literal:: Unknown label type: ( cat_int 0 10.0 1 20.0 2 10.0 3 39.0 4 10.0 5 10.0 6 NaN,) .. parsed-literal:: array([[1, 0, 0], [0, 1, 0], [1, 0, 0], [0, 0, 1], [1, 0, 0], [1, 0, 0]]) .. code:: ipython3 from sklearn.preprocessing import LabelBinarizer try: LabelBinarizer().fit_transform(df[['cat_text']]) except Exception as e: print(e) LabelBinarizer().fit_transform(df[['cat_text']].dropna()) .. parsed-literal:: '<' not supported between instances of 'float' and 'str' .. parsed-literal:: array([[1, 0, 0], [0, 1, 0], [1, 0, 0], [0, 0, 1], [0, 1, 0], [0, 1, 0]]) D’autres options qui ne fonctionnent pas tout à fait de la même manière en terme d’implémentation. .. code:: ipython3 from sklearn.feature_extraction import DictVectorizer DictVectorizer().fit_transform(df.to_dict('records')).todense() .. parsed-literal:: matrix([[10., 0., 1., 0., 0.], [20., 0., 0., 1., 0.], [10., 0., 1., 0., 0.], [39., 0., 0., 0., 1.], [10., 0., 0., 1., 0.], [10., nan, 0., 0., 0.], [nan, 0., 0., 1., 0.]]) .. code:: ipython3 from sklearn.feature_extraction import FeatureHasher FeatureHasher(n_features=5).fit_transform(df.to_dict('records')).todense() .. parsed-literal:: matrix([[ 0., 0., 0., 1., 10.], [ 1., 0., 0., 0., 20.], [ 0., 0., 0., 1., 10.], [ 0., -1., 0., 0., 39.], [ 1., 0., 0., 0., 10.], [ 0., 0., 0., 0., nan], [ 1., 0., 0., 0., nan]]) Méthodes à gradient et ensemblistes ----------------------------------- On construit un simple jeu de données pour une régression linéaire, :math:`Y = -10 X - 7` puis on cale une régression linéaire avec :math:`Y \sim \alpha X_2 + \epsilon` où :math:`X_2` est une permutation de *X*. Le lien est en quelque sorte brisé. .. code:: ipython3 perm = numpy.random.permutation(list(range(10))) n = 1000 X1 = numpy.random.randint(0, 10, (n,1)) X2 = numpy.array([perm[i] for i in X1]) eps = numpy.random.random((n, 1)) Y = X1 * (-10) - 7 + eps data = pandas.DataFrame(dict(X1=X1.ravel(), X2=X2.ravel(), Y=Y.ravel())) data.head() .. raw:: html

	X1	X2	Y
0	4	7	-46.420964
1	6	9	-66.321194
2	3	5	-36.001053
3	0	2	-6.802070
4	8	3	-86.044988

.. code:: ipython3 from sklearn.model_selection import train_test_split data_train, data_test = train_test_split(data) data_train = data_train.copy() data_test = data_test.copy() On transforme la catégorie : .. code:: ipython3 le = LabelEncoder().fit(data_train['X2']) data_train['X3'] = le.transform(data_train['X2']) data_test['X3'] = le.transform(data_test['X2']) data_train.head() .. raw:: html

	X1	X2	Y	X3
943	1	8	-16.020453	8
686	2	0	-26.664813	0
312	9	6	-96.833801	6
861	1	8	-16.090882	8
784	4	7	-46.693131	7

.. code:: ipython3 data_train.corr() .. raw:: html

	X1	X2	Y	X3
X1	1.000000	0.145122	-0.999946	0.145122
X2	0.145122	1.000000	-0.144419	1.000000
Y	-0.999946	-0.144419	1.000000	-0.144419
X3	0.145122	1.000000	-0.144419	1.000000

On cale une régression linéaire de :math:`Y` sur :math:`X_3` la catégorie encodée : .. code:: ipython3 from sklearn.linear_model import LinearRegression clr = LinearRegression() clr.fit(data_train[['X3']], data_train['Y']) .. parsed-literal:: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False) .. code:: ipython3 from sklearn.metrics import r2_score r2_score(data_test['Y'], clr.predict(data_test[['X3']])) .. parsed-literal:: 0.009306055721419959 Autrement dit, elle n’a rien appris. On cale un arbre de décision : .. code:: ipython3 from sklearn.tree import DecisionTreeRegressor clr = DecisionTreeRegressor() clr.fit(data_train[['X3']], data_train['Y']) .. parsed-literal:: DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best') .. code:: ipython3 r2_score(data_test['Y'], clr.predict(data_test[['X3']])) .. parsed-literal:: 0.9998898230391275 L’arbre de décision a saisi la permutation alors que la régression linéaire n’a pas fonctionné. La régression linéaire n’est pas estimée à l’aide d’une méthode à base de gradient mais elle possède les mêmes contraintes, il est préférable que la cible *Y* soit une fonction le plus possible monotone de *X*. Avec une colonne par modalité de la catégorie, le résultat est tout autre. .. code:: ipython3 one = OneHotEncoder().fit(data_train[['X2']]) feat = one.transform(data_train[['X2']]) feat[:5].todense() .. parsed-literal:: matrix([[0., 0., 0., 0., 0., 0., 0., 0., 1., 0.], [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 1., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 1., 0.], [0., 0., 0., 0., 0., 0., 0., 1., 0., 0.]]) .. code:: ipython3 clr = LinearRegression() clr.fit(feat, data_train['Y']) .. parsed-literal:: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False) .. code:: ipython3 r2_score(data_test['Y'], clr.predict(one.transform(data_test[['X2']]))) .. parsed-literal:: 0.9998898230391275