Traitement amélioré des catégories#

Links: notebook, html, PDF, python, slides, GitHub

Ce notebook présenté des encoding différents de ceux implémentées dans scikit-learn.

from jyquickhelper import add_notebook_menu
add_notebook_menu()

%matplotlib inline

On construit un jeu très simple avec deux catégories, une entière, une au format texte.

import pandas
import numpy
df = pandas.DataFrame(dict(cat_int=[10, 20, 10, 39, 10, 10, numpy.nan],
                          cat_text=['catA', 'catB', 'catA', 'catDD', 'catB', numpy.nan, 'catB']))
df

	cat_int	cat_text
0	10.0	catA
1	20.0	catB
2	10.0	catA
3	39.0	catDD
4	10.0	catB
5	10.0	NaN
6	NaN	catB

Une API un peu différente #

Le module Category Encoders implémente d’autres options avec une API un peu différente puisqu’il est possible de spécifier la colonne sur laquelle s’applique l’encoding.

from category_encoders import OneHotEncoder
OneHotEncoder(cols=['cat_text']).fit_transform(df)

	cat_text_1	cat_text_2	cat_text_3	cat_text_4	cat_int
0	1	0	0	0	10.0
1	0	1	0	0	20.0
2	1	0	0	0	10.0
3	0	0	1	0	39.0
4	0	1	0	0	10.0
5	0	0	0	1	10.0
6	0	1	0	0	NaN

Autres options #

import category_encoders
encoders = []
for k, enc in category_encoders.__dict__.items():
    if 'Encoder' in k:
        encoders.append(enc)
encoders

[category_encoders.backward_difference.BackwardDifferenceEncoder,
 category_encoders.binary.BinaryEncoder,
 category_encoders.hashing.HashingEncoder,
 category_encoders.helmert.HelmertEncoder,
 category_encoders.one_hot.OneHotEncoder,
 category_encoders.ordinal.OrdinalEncoder,
 category_encoders.sum_coding.SumEncoder,
 category_encoders.polynomial.PolynomialEncoder,
 category_encoders.basen.BaseNEncoder,
 category_encoders.leave_one_out.LeaveOneOutEncoder,
 category_encoders.target_encoder.TargetEncoder,
 category_encoders.woe.WOEEncoder]

dfi = df[['cat_text']].copy()
dfi["copy"] = dfi['cat_text']
for encoder in encoders:
    if 'Leave' in encoder.__name__ or \
       'Target' in encoder.__name__ or \
       'WOE' in encoder.__name__:
        continue
    enc = encoder(cols=['cat_text'])
    try:
        out = enc.fit_transform(dfi)
    except Exception as e:
        print("Issue with '{0}' due to {1}".format(encoder.__name__, e))
        continue
    print('-----', encoder.__name__)
    print(out)
    print('-----')

----- BackwardDifferenceEncoder
   intercept  cat_text_0  cat_text_1  cat_text_2   copy
        1       -0.75        -0.5       -0.25   catA
        1        0.25        -0.5       -0.25   catB
        1       -0.75        -0.5       -0.25   catA
        1        0.25         0.5       -0.25  catDD
        1        0.25        -0.5       -0.25   catB
        1        0.25         0.5        0.75    NaN
        1        0.25        -0.5       -0.25   catB
-----
----- BinaryEncoder
   cat_text_0  cat_text_1  cat_text_2   copy
         0           0           1   catA
         0           1           0   catB
         0           0           1   catA
         0           1           1  catDD
         0           1           0   catB
         1           0           0    NaN
         0           1           0   catB
-----
----- HashingEncoder
   col_0  col_1  col_2  col_3  col_4  col_5  col_6  col_7   copy
    0      0      0      0      0      0      1      0   catA
    0      0      0      0      0      0      0      1   catB
    0      0      0      0      0      0      1      0   catA
    1      0      0      0      0      0      0      0  catDD
    0      0      0      0      0      0      0      1   catB
    1      0      0      0      0      0      0      0    NaN
    0      0      0      0      0      0      0      1   catB
-----
----- HelmertEncoder
   intercept  cat_text_0  cat_text_1  cat_text_2   copy
        1        -1.0        -1.0        -1.0   catA
        1         1.0        -1.0        -1.0   catB
        1        -1.0        -1.0        -1.0   catA
        1         0.0         2.0        -1.0  catDD
        1         1.0        -1.0        -1.0   catB
        1         0.0         0.0         3.0    NaN
        1         1.0        -1.0        -1.0   catB
-----
----- OneHotEncoder
   cat_text_1  cat_text_2  cat_text_3  cat_text_4  cat_text_-1   copy
         1           0           0           0            0   catA
         0           1           0           0            0   catB
         1           0           0           0            0   catA
         0           0           1           0            0  catDD
         0           1           0           0            0   catB
         0           0           0           1            0    NaN
         0           1           0           0            0   catB
-----
----- OrdinalEncoder
   cat_text   copy
       1   catA
       2   catB
       1   catA
       3  catDD
       2   catB
       4    NaN
       2   catB
-----
----- SumEncoder
   intercept  cat_text_0  cat_text_1  cat_text_2   copy
        1         1.0         0.0         0.0   catA
        1         0.0         1.0         0.0   catB
        1         1.0         0.0         0.0   catA
        1         0.0         0.0         1.0  catDD
        1         0.0         1.0         0.0   catB
        1        -1.0        -1.0        -1.0    NaN
        1         0.0         1.0         0.0   catB
-----
----- PolynomialEncoder
   intercept  cat_text_0  cat_text_1  cat_text_2   copy
        1   -0.670820         0.5   -0.223607   catA
        1   -0.223607        -0.5    0.670820   catB
        1   -0.670820         0.5   -0.223607   catA
        1    0.223607        -0.5   -0.670820  catDD
        1   -0.223607        -0.5    0.670820   catB
        1    0.670820         0.5    0.223607    NaN
        1   -0.223607        -0.5    0.670820   catB
-----
----- BaseNEncoder
   cat_text_0  cat_text_1  cat_text_2   copy
         0           0           1   catA
         0           1           0   catB
         0           0           1   catA
         0           1           1  catDD
         0           1           0   catB
         1           0           0    NaN
         0           1           0   catB
-----

Certains encoding optimise l’encoding en fonction de la cible à prédire lors d’un apprentissage supervisé. Les deux encoders suivant prédisent la cible en fonction de la catégorie ou essayent d’optimiser l’encoding de la catégorie en fonction de la cible à prédire. En particulier, l’encoder LeaveOneOut associe à chaque modéalité la moyenne des valeurs observées sur une autre colonne pour chaque ligne associée à cette modalité.

dfy = df.sort_values('cat_text').reset_index(drop=True).copy()
dfy['cat_text_copy'] = dfy['cat_text']
dfy['y'] = dfy.index * dfy.index + 10
dfy['y_copy'] = dfy.y
dfy

	cat_int	cat_text	cat_text_copy	y	y_copy
0	10.0	catA	catA	10	10
1	10.0	catA	catA	11	11
2	20.0	catB	catB	14	14
3	10.0	catB	catB	19	19
4	NaN	catB	catB	26	26
5	39.0	catDD	catDD	35	35
6	10.0	NaN	NaN	46	46

categories = dfy.drop('y', axis=1)
label = dfy.y
binary_label = label == 10  # dummy one

for encoder in encoders:
    enc = encoder(cols=['cat_text'])
    try:
        out = enc.fit_transform(categories)
    except Exception as e:
        out = pandas.DataFrame()
    try:
        outy = enc.fit_transform(categories, label)
    except ValueError as e:
        if "must be binary" not in str(e):
            continue
        outy = enc.fit_transform(categories, binary_label)
    if not out.equals(outy):
        print('-----', encoder.__name__)
        print(outy)
        print('-----')

----- LeaveOneOutEncoder
   cat_int  cat_text cat_text_copy  y_copy
   10.0      11.0          catA      10
   10.0      10.0          catA      11
   20.0      22.5          catB      14
   10.0      20.0          catB      19
    NaN      16.5          catB      26
   39.0      23.0         catDD      35
   10.0      23.0           NaN      46
-----
----- TargetEncoder
   cat_int   cat_text cat_text_copy  y_copy
   10.0  13.861768          catA      10
   10.0  13.861768          catA      11
   20.0  20.064010          catB      14
   10.0  20.064010          catB      19
    NaN  20.064010          catB      26
   39.0  23.000000         catDD      35
   10.0  23.000000           NaN      46
-----
----- WOEEncoder
   cat_int  cat_text cat_text_copy  y_copy
   10.0  0.980829          catA      10
   10.0  0.980829          catA      11
   20.0 -0.405465          catB      14
   10.0 -0.405465          catB      19
    NaN -0.405465          catB      26
   39.0  0.000000         catDD      35
   10.0  0.000000           NaN      46
-----

A propos du LeaveOneOut #

Cet encoder ne produit qu’une seule colonne. Dans le cas d’une régression linéaire, la valeur est la moyenne de la cible $y$ sur l’ensemble des lignes associées à cette catégorie. On reprend un exemple déjà utilisé.

perm = numpy.random.permutation(list(range(10)))
n = 1000
X1 = numpy.random.randint(0, 10, (n,1))
X2 = numpy.array([perm[i] for i in X1])
eps = numpy.random.random((n, 1))
Y = X1 * (-10) - 7 + eps
data = pandas.DataFrame(dict(X1=X1.ravel(), X2=X2.ravel(), Y=Y.ravel()))
data.head()

	X1	X2	Y
0	7	3	-76.045854
1	3	4	-36.468473
2	2	7	-26.752507
3	0	2	-6.294129
4	4	0	-46.812676

from sklearn.model_selection import train_test_split
data_train, data_test = train_test_split(data)

data_train = data_train.reset_index(drop=True)

from category_encoders import LeaveOneOutEncoder
le = LeaveOneOutEncoder(cols=['X2'])
X = data_train.drop('Y', axis=1)
le.fit(X, data_train.Y)
data_train2 = le.transform(X)
data_train2.head()

	X1	X2
0	9	-96.513580
1	9	-96.513580
2	8	-86.548834
3	7	-76.475048
4	9	-96.513580

data_train2 = data_train2.reset_index(drop=True)

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(data_train2[["X2"]], data_train['Y'])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

data_test = data_test.reset_index(drop=True)

from sklearn.metrics import r2_score
data_test2 = le.transform(data_test.drop("Y", axis=1))
r2_score(data_test['Y'], model.predict(data_test2[['X2']]))

0.9999033020848935

Le coefficient $R^2$ est proche de 1, la régression est quasi parfaite. L’encodeur LeaveONeOutEncode utilise la cible pour maximiser le coefficient $R^2$ si le modèle utilisé pour prédire est une régression linéaire. Vous trouverez une idée de la démonstration dans cet énoncé : ENSAE TD noté 2016.

Liens

Contenu

Information

Sujet précédent

Sujet suivant

Traitement amélioré des catégories#

Une API un peu différente #

Autres options #

Utilisation de la cible #

A propos du LeaveOneOut #

Liens

Contenu

Information

Sujet précédent

Sujet suivant

Traitement amélioré des catégories#

Une API un peu différente#

Autres options#

Utilisation de la cible#

A propos du LeaveOneOut#

Une API un peu différente #

Autres options #

Utilisation de la cible #

A propos du LeaveOneOut #