2A.ml - 2016 - Compétition - Importation des données

Links: notebook, html, PDF, python, slides, GitHub

Une compétition était proposée dans le cadre du cours Python pour un Data Scientist à l’ENSAE. Ce notebook facilite la prise en main des données et étudie les données avec des méthodes de statistiques descriptives.

%matplotlib inline
from jyquickhelper import add_notebook_menu
add_notebook_menu()
from pyensae.datasource import download_data
download_data("ensae_competition_2016.zip",
              url="https://github.com/sdpython/ensae_teaching_cs/raw/master/_doc/competitions/2016_ENSAE_2A/")
['ensae_competition_test_X.txt', 'ensae_competition_train.txt']

Données

import pandas as p
df = p.read_csv('./ensae_competition_train.txt', header=[0,1], sep="\t", index_col=0)
df.head()
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 ... X15 X16 X17 X18 X19 X20 X21 X22 X23 Y
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month
0 180000 1 2 1 47 0 0 0 0 0 ... 99694 65977 67415 3700 3700 4100 2360 2500 2618 0
1 110000 2 2 1 35 0 0 0 0 0 ... 4869 4966 5070 1053 1073 1081 178 184 185 1
2 70000 2 2 2 22 0 0 0 0 0 ... 69927 50579 49483 2501 3001 2608 1777 1792 1793 1
3 200000 2 1 2 27 -2 -2 -2 -2 -2 ... 1665 3370 -36 5610 15616 1673 3385 0 95456 0
4 370000 2 1 1 39 0 0 0 0 0 ... 48216 47675 48074 2157 2000 1668 2000 3000 1000 0

5 rows × 24 columns

len(df)
22500

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables: - X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. - X2: Gender (1 = male; 2 = female). - X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). - X4: Marital status (1 = married; 2 = single; 3 = others). - X5: Age (year). - X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; …;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; …; 8 = payment delay for eight months; 9 = payment delay for nine months and above. - X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; …; X17 = amount of bill statement in April, 2005. - X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; …;X23 = amount paid in April, 2005.

#Format des variables : multiindex
print(df[('Y','default payment next month')].value_counts())
print(df['Y']['default payment next month'].value_counts())
0    17520
1     4980
Name: (Y, default payment next month), dtype: int64
0    17520
1     4980
Name: default payment next month, dtype: int64
df.columns = df.columns.droplevel()
df.rename(columns={'default payment next month':'Y'},inplace=True)
df.head()
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 Y
0 180000 1 2 1 47 0 0 0 0 0 ... 99694 65977 67415 3700 3700 4100 2360 2500 2618 0
1 110000 2 2 1 35 0 0 0 0 0 ... 4869 4966 5070 1053 1073 1081 178 184 185 1
2 70000 2 2 2 22 0 0 0 0 0 ... 69927 50579 49483 2501 3001 2608 1777 1792 1793 1
3 200000 2 1 2 27 -2 -2 -2 -2 -2 ... 1665 3370 -36 5610 15616 1673 3385 0 95456 0
4 370000 2 1 1 39 0 0 0 0 0 ... 48216 47675 48074 2157 2000 1668 2000 3000 1000 0

5 rows × 24 columns

Distinction des variables catégorielles

df.columns
Index(['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2',
       'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'Y'],
      dtype='object', name='ID')
cat_var =  list(df.columns[1:4]) + list(df.columns[5:11]) + list(df.columns[-1:])

num_var = [v for v in df.columns if v not in cat_var]

print("Categorical variables: ", cat_var)
print("Numerical variables: "  , num_var)

for v in cat_var[:-1]:
    df[v]=p.DataFrame(df[v]).applymap(str)
Categorical variables:  ['SEX', 'EDUCATION', 'MARRIAGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'Y']
Numerical variables:  ['LIMIT_BAL', 'AGE', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
num_var_bill = [ v for v in num_var if 'BILL' in v or v == 'LIMIT_BAL']
num_var_pay  = [ v for v in num_var if 'PAY' in v ]
print("Numerical variables BILL: ", num_var_bill)
print("Numerical variables PAY: " , num_var_pay)
Numerical variables BILL:  ['LIMIT_BAL', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']
Numerical variables PAY:  ['PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']

Variable d’intérêt

default=df[df['Y']==1]
solvent=df[df['Y']==0]
df['Y'].value_counts()
0    17520
1     4980
Name: Y, dtype: int64

Statistiques Descriptives

Variables quantitatives

desc=df[num_var].describe()
std=p.DataFrame(df[num_var].std()).transpose()
std.rename(index={0:'std'},inplace=True)
res=p.concat([desc,std])
res
ID LIMIT_BAL AGE BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
count 22500.000000 22500.000000 22500.000000 22500.000000 2.250000e+04 22500.000000 22500.000000 22500.000000 22500.000000 2.250000e+04 22500.000000 22500.000000 22500.000000 22500.000000
mean 167031.985778 35.522622 51377.476533 49254.855378 4.719780e+04 43390.913156 40370.602000 38961.910489 5573.344400 5.981831e+03 5159.924000 4783.836400 4761.467022 5319.834311
std 129410.412959 9.240719 73518.931398 70946.251776 6.954210e+04 64346.612034 60746.789532 59597.874685 15467.825328 2.282096e+04 16085.885024 15626.493539 15349.550859 18196.730565
min 10000.000000 21.000000 -165580.000000 -67526.000000 -1.572640e+05 -170000.000000 -81334.000000 -339603.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000
25% 50000.000000 28.000000 3565.000000 3069.250000 2.740000e+03 2350.000000 1763.000000 1253.750000 1000.000000 8.750000e+02 396.000000 300.000000 250.000000 154.000000
50% 140000.000000 34.000000 22578.500000 21365.500000 2.020250e+04 19095.500000 18104.500000 17125.000000 2100.000000 2.009000e+03 1800.000000 1500.000000 1500.000000 1500.000000
75% 240000.000000 41.000000 67254.750000 64252.750000 6.019275e+04 54475.000000 50090.250000 49063.250000 5006.000000 5.000000e+03 4500.000000 4000.000000 4013.250000 4005.000000
max 800000.000000 79.000000 746814.000000 743970.000000 1.664089e+06 628699.000000 823540.000000 699944.000000 505000.000000 1.684259e+06 508229.000000 528897.000000 426529.000000 527143.000000
std 129410.412959 9.240719 73518.931398 70946.251776 6.954210e+04 64346.612034 60746.789532 59597.874685 15467.825328 2.282096e+04 16085.885024 15626.493539 15349.550859 18196.730565
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

fig = plt.figure(figsize=(20,10))

ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)

sns.set(style="ticks")

ax1.set_ylim([15000,25000])
ax2.set_ylim([1000,2500])

ax1.set_title('BILL variables',fontsize=14)
ax2.set_title('PAY variables',fontsize=14)

sns.tsplot(df[num_var_bill[1:]].values, ci=[68, 95], estimator=np.median, ax=ax1)
sns.tsplot(df[num_var_pay].values, ci=[68, 95], estimator=np.median, ax=ax2)
<matplotlib.axes._subplots.AxesSubplot at 0x2cd7b82d550>
../_images/td2a_eco_competition_statsdesc_19_1.png
desc=default[num_var].describe()
std=p.DataFrame(default[num_var].std()).transpose()
std.rename(index={0:'std'},inplace=True)
p.concat([desc,std])
ID LIMIT_BAL AGE BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
count 4980.000000 4980.000000 4980.000000 4980.000000 4980.000000 4980.000000 4980.000000 4980.000000 4980.000000 4980.000000 4980.000000 4980.000000 4980.00000 4980.000000
mean 128369.012048 35.898594 48802.007430 47508.621486 45354.507631 42269.960040 39794.054418 38482.410843 3235.062450 3389.897992 3449.471888 3089.322490 3048.47751 3508.093173
std 114585.300619 9.785178 73208.224739 70957.968104 68145.557400 64232.730366 61745.867002 59804.195524 7333.004363 11169.855676 13776.576790 10998.148448 10471.88080 13996.188125
min 10000.000000 21.000000 -6676.000000 -17710.000000 -61506.000000 -65167.000000 -53007.000000 -339603.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000
25% 50000.000000 28.000000 3351.750000 3099.750000 2574.250000 2302.500000 1475.250000 1168.250000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000
50% 90000.000000 34.000000 20738.500000 20715.000000 20105.000000 19344.500000 18661.000000 18254.000000 1642.500000 1573.000000 1242.500000 1000.000000 1000.00000 1000.000000
75% 190000.000000 42.000000 59812.250000 58135.250000 54888.750000 50483.250000 48421.250000 47740.750000 3435.250000 3321.000000 3000.000000 2902.500000 3000.00000 2935.500000
max 720000.000000 75.000000 613860.000000 581775.000000 578971.000000 548020.000000 547880.000000 514975.000000 175100.000000 344467.000000 508229.000000 432130.000000 332000.00000 345293.000000
std 114585.300619 9.785178 73208.224739 70957.968104 68145.557400 64232.730366 61745.867002 59804.195524 7333.004363 11169.855676 13776.576790 10998.148448 10471.88080 13996.188125
desc=solvent[num_var].describe()
std=p.DataFrame(solvent[num_var].std()).transpose()
std.rename(index={0:'std'},inplace=True)
p.concat([desc,std])
ID LIMIT_BAL AGE BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
count 17520.000000 17520.000000 17520.000000 17520.000000 1.752000e+04 17520.000000 17520.000000 17520.000000 17520.000000 1.752000e+04 17520.000000 17520.000000 17520.000000 17520.000000
mean 178021.803653 35.415753 52109.544806 49751.216381 4.772175e+04 43709.540240 40534.483676 39098.206621 6237.993037 6.718579e+03 5646.114155 5265.496176 5248.378425 5834.815525
std 131264.434306 9.077451 73592.642301 70937.100523 6.992702e+04 64377.214009 60460.571694 59540.104394 17028.973430 2.511828e+04 16652.378496 16678.582519 16442.208028 19193.074508
min 10000.000000 21.000000 -165580.000000 -67526.000000 -1.572640e+05 -170000.000000 -81334.000000 -150953.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000
25% 70000.000000 28.000000 3610.750000 3059.750000 2.774000e+03 2368.500000 1830.500000 1261.000000 1170.000000 1.010750e+03 600.000000 390.000000 359.000000 323.000000
50% 150000.000000 34.000000 23100.500000 21571.500000 2.021500e+04 18983.000000 17919.500000 16677.500000 2440.000000 2.244500e+03 2000.000000 1732.500000 1738.500000 1738.000000
75% 250000.000000 41.000000 69197.750000 66165.000000 6.214250e+04 55812.750000 51010.750000 49667.500000 5600.250000 5.363000e+03 5000.000000 4580.000000 4600.000000 4600.000000
max 800000.000000 79.000000 746814.000000 743970.000000 1.664089e+06 628699.000000 823540.000000 699944.000000 505000.000000 1.684259e+06 417588.000000 528897.000000 426529.000000 527143.000000
std 131264.434306 9.077451 73592.642301 70937.100523 6.992702e+04 64377.214009 60460.571694 59540.104394 17028.973430 2.511828e+04 16652.378496 16678.582519 16442.208028 19193.074508

Biais de sélection sur les clients suceptibles de faire défaut ? Ou effet de taille (catégorie moins nombreuse) => moins de chances d’avoir des crédits élevés.

sample=solvent[num_var].sample(len(default))

desc=sample.describe()
std=p.DataFrame(sample.std()).transpose()
std.rename(index={0:'std'},inplace=True)
p.concat([desc,std])
ID LIMIT_BAL AGE BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
count 4980.000000 4980.000000 4980.000000 4980.000000 4980.000000 4980.000000 4980.000000 4980.000000 4980.000000 4.980000e+03 4980.000000 4980.000000 4980.000000 4980.000000
mean 180180.722892 35.434337 51721.304819 49316.536345 47455.014659 43264.014458 40511.312851 39532.004819 6033.253213 6.644837e+03 5620.497189 5507.051807 5591.140562 5865.546586
std 132531.728419 9.018197 73357.363416 70623.742032 68843.660610 64018.906476 60086.277673 59487.205748 17400.853507 2.541040e+04 17278.872802 17665.644643 18650.355908 20659.420780
min 10000.000000 21.000000 -11545.000000 -6256.000000 -10922.000000 -170000.000000 -36156.000000 -57060.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000
25% 70000.000000 28.000000 3495.000000 2873.000000 2529.250000 2273.750000 1940.500000 1379.000000 1123.000000 1.000000e+03 611.500000 389.750000 360.000000 316.000000
50% 150000.000000 34.000000 23355.500000 21726.500000 20330.000000 18916.000000 18070.500000 17073.500000 2401.500000 2.206500e+03 2000.000000 1800.000000 1826.500000 1782.000000
75% 260000.000000 41.000000 69135.750000 65527.750000 61580.750000 57274.750000 53338.500000 51001.000000 5500.000000 5.193250e+03 5000.000000 4500.000000 4514.750000 4500.000000
max 760000.000000 75.000000 746814.000000 624475.000000 632041.000000 628699.000000 823540.000000 527711.000000 493358.000000 1.227082e+06 417588.000000 497000.000000 426529.000000 527143.000000
std 132531.728419 9.018197 73357.363416 70623.742032 68843.660610 64018.906476 60086.277673 59487.205748 17400.853507 2.541040e+04 17278.872802 17665.644643 18650.355908 20659.420780

Il semble y avoir un biais de sélection en amont.

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

fig, axes = plt.subplots(3,2, sharex=True)
fig = plt.figure(figsize=(20,10))

sns.set(style="ticks")

df_dict={0: 'All - ', 1 : 'Default - ' , 2 : 'Solvent (sample) - '}
var_dict={0: 'BILL variables', 1 : 'PAY variables'}

for i in range(3):
    for j in range(2):
        axes[i,j].set_title(df_dict[i] + var_dict[j],fontsize=8)

sns.tsplot(df[num_var_bill[1:]].values, ci=[68, 95], estimator=np.median, ax=axes[0,0])
sns.tsplot(df[num_var_pay].values, ci=[68, 95], estimator=np.median, ax=axes[0,1])

sns.tsplot(default[num_var_bill[1:]].values, ci=[68, 95], estimator=np.median, ax=axes[1,0])
sns.tsplot(default[num_var_pay].values, ci=[68, 95], estimator=np.median, ax=axes[1,1])

sns.tsplot(sample[num_var_bill[1:]].values, ci=[68, 95], estimator=np.median, ax=axes[2,0])
sns.tsplot(sample[num_var_pay].values, ci=[68, 95], estimator=np.median, ax=axes[2,1])
<matplotlib.axes._subplots.AxesSubplot at 0x2cd7ba355f8>
../_images/td2a_eco_competition_statsdesc_25_1.png
<matplotlib.figure.Figure at 0x2cd7ba85710>
df_resample = p.concat([df[df['Y']==1], df[df['Y']==0].sample(len(df[df['Y']==1]))])

for v in cat_var[:-1]:
    print("Variable {}\n".format(v), p.crosstab(df_resample['Y'],df_resample[v]))
Variable SEX
 SEX     1     2
Y
0    1937  3043
1    2160  2820
Variable EDUCATION
 EDUCATION     1     2    3   4   5  6
Y
0          1819  2299  776  22  56  8
1          1498  2516  937   7  17  5
Variable MARRIAGE
 MARRIAGE   0     1     2   3
Y
0         12  2214  2699  55
1          5  2443  2464  68
Variable PAY_0
 PAY_0   -1   -2     0    1     2    3   4   5  6  7  8
Y
0      992  515  2726  535   180   21   5   4  0  1  1
1      712  279  1447  901  1371  200  39  12  4  7  8
Variable PAY_2
 PAY_2    -1   -2     0  1     2    3   4   5  6  7  8
Y
0      1080  664  2811  6   386   19  11   0  1  1  1
1       728  497  1903  5  1616  161  43  11  8  8  0
Variable PAY_3
 PAY_3    -1   -2     0  1     2    3   4  5  6   7  8
Y
0      1055  710  2793  0   397   15   6  2  1   1  0
1       696  555  2087  1  1469  109  31  9  8  14  1
Variable PAY_4
 PAY_4    -1   -2     0  1     2   3   4   5  6   7  8
Y
0      1015  736  2884  0   322  14   3   4  1   1  0
1       680  607  2280  1  1249  81  34  12  0  35  1
Variable PAY_5
 PAY_5   -1   -2     0     2   3   4  5  6   7  8
Y
0      960  786  2938   273  11   7  4  0   1  0
1      658  663  2407  1090  80  36  9  1  35  1
Variable PAY_6
 PAY_6    -1   -2     0     2   3   4  5   6   7  8
Y
0      1005  828  2841   280  18   4  2   1   1  0
1       716  735  2308  1068  85  23  6  10  28  1

Variables numériques (sur un sample d’individus)

%matplotlib inline
import matplotlib.pyplot as plt

fig, axes = plt.subplots(5,3, sharex=True)

for i in range(5):
    for j in range(3):
        if i*3+j < len(num_var) :
            axes[i,j].set_title(num_var[i*3+j],fontsize=8)
            sns.violinplot(x="Y", y=num_var[i*3+j], \
                        data=df_resample, palette=["#2F6CE6","#E62F2F"],  ax = axes[i,j])
            #sns.stripplot(x="Y", y=num_var[i*3+j], \
            #            data=df_resample, palette=["#2F6CE6","#E62F2F"], jitter=True,  ax = axes[i,j])
            axes[i,j].set_ylabel('')
            yticks=axes[i,j].get_yticks()
            axes[i,j].set_yticks(yticks[::2])

plt.figure(figsize=(20,15))
<matplotlib.figure.Figure at 0x2cd7b97e0b8>
../_images/td2a_eco_competition_statsdesc_29_1.png
<matplotlib.figure.Figure at 0x2cd7b97e0b8>
import seaborn as sns

sns.set(style="ticks")
sns.pairplot(df_resample[num_var_bill + ['Y']],hue='Y',palette=["#2F6CE6","#E62F2F"])
<seaborn.axisgrid.PairGrid at 0x2cd7c3bf278>
../_images/td2a_eco_competition_statsdesc_30_1.png
import seaborn as sns

sns.set(style="ticks")
sns.pairplot(df_resample[num_var_pay + ['Y']],hue='Y',palette=["#2F6CE6","#E62F2F"])
<seaborn.axisgrid.PairGrid at 0x2cd02420eb8>
../_images/td2a_eco_competition_statsdesc_31_1.png

Idées pouvant être explorées : - le default doit pouvoir se repérer dans la décélération des remboursements. - il y a aussi peut etre un effet “biais du survivant” : les users qui sont repérés comme susceptibles d’entraîner un défaut => montant de crédits accordés plus petits. - multicolinéarité de nombreuses variables => faire une ACP au préalable