Analyse de survie#

Links: notebook, html, PDF, python, slides, GitHub

from jyquickhelper import add_notebook_menu
add_notebook_menu()

Quelques données#

On récupère les données disponibles sur open.data.gouv.fr Données hospitalières relatives à l’épidémie de COVID-19. Ces données ne permettent pas de construire la courbe de Kaplan-Meier. On sait combien de personnes rentrent et sortent chaque jour mais on ne sait pas quand une personne qui sort un 1er avril est entrée.

import numpy.random as rnd

import pandas
df = pandas.read_csv("https://www.data.gouv.fr/en/datasets/r/63352e38-d353-4b54-bfd1-f1b3ee1cabd7", sep=";")
gr = df[["jour", "rad", "dc"]].groupby(["jour"]).sum()
diff = gr.diff().reset_index(drop=False)
diff.head()
jour rad dc
0 2020-03-18 NaN NaN
1 2020-03-19 695.0 207.0
2 2020-03-20 806.0 248.0
3 2020-03-21 452.0 151.0
4 2020-03-22 608.0 210.0
def donnees_artificielles(hosp, mu=14, nu=21):
    dt = pandas.to_datetime(hosp['jour'])
    res = []
    for i in range(hosp.shape[0]):
        date = dt[i].dayofyear
        h = hosp.iloc[i, 1]
        delay = rnd.exponential(mu, int(h))
        for j in range(delay.shape[0]):
            res.append([date - int(delay[j]), date, 1])
        h = hosp.iloc[i, 2]
        delay = rnd.exponential(nu, int(h))
        for j in range(delay.shape[0]):
            res.append([date - int(delay[j]), date , 0])
    return pandas.DataFrame(res, columns=["entree", "sortie", "issue"])


data = donnees_artificielles(diff[1:].reset_index(drop=True)).sort_values('entree')
data.head()
entree sortie issue
518488 -200 19 0
541408 -192 27 0
476735 -187 2 0
587013 -185 42 0
476057 -180 1 0

Chaque ligne est une personne, entree est le jour d’entrée à l’hôpital, sortie celui de la sortie, issue, 0 pour décès, 1 pour en vie.

data.describe()
entree sortie issue
count 624130.000000 624130.000000 624130.000000
mean 169.704510 184.532815 0.806729
std 125.420957 124.343186 0.394864
min -200.000000 1.000000 0.000000
25% 53.000000 84.000000 1.000000
50% 133.000000 144.000000 1.000000
75% 301.000000 315.000000 1.000000
max 366.000000 366.000000 1.000000

Il y a environ 80% de survie dans ces données.

import numpy
duree = data.sortie - data.entree
deces = (data.issue == 0).astype(numpy.int32)
import numpy
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter
fig, ax = plt.subplots(1, 1, figsize=(10, 4))
kmf = KaplanMeierFitter()
kmf.fit(duree, deces)
kmf.plot(ax=ax)
ax.legend();
../_images/survival_9_0.png

Régression de Cox#

On reprend les données artificiellement générées et on ajoute une variable identique à la durée plus un bruit mais quasi nul

import pandas
data_simple = pandas.DataFrame({'duree': duree, 'deces': deces,
                                'X1': duree * 0.57 * deces + numpy.random.randn(duree.shape[0]),
                                'X2': duree * (-0.57) * deces + numpy.random.randn(duree.shape[0])})
data_simple.head()
duree deces X1 X2
518488 219 1 125.653230 -125.666662
541408 219 1 126.006024 -125.327549
476735 189 1 107.920779 -108.358230
587013 227 1 129.788930 -130.045019
476057 181 1 103.642440 -103.793008
from sklearn.model_selection import train_test_split
data_train, data_test = train_test_split(data_simple, test_size=0.8)
from lifelines.fitters.coxph_fitter import CoxPHFitter
cox = CoxPHFitter()
cox.fit(data_train[['duree', 'deces', 'X1']], duration_col="duree", event_col="deces",
        show_progress=True)
Iteration 1: norm_delta = 0.13943, step_size = 0.9000, log_lik = -250658.36250, newton_decrement = 889.93933, seconds_since_start = 0.0
Iteration 2: norm_delta = 0.00660, step_size = 0.9000, log_lik = -249862.37270, newton_decrement = 2.81312, seconds_since_start = 0.0
Iteration 3: norm_delta = 0.00073, step_size = 0.9000, log_lik = -249859.57376, newton_decrement = 0.03357, seconds_since_start = 0.1
Iteration 4: norm_delta = 0.00000, step_size = 1.0000, log_lik = -249859.54017, newton_decrement = 0.00000, seconds_since_start = 0.1
Convergence success after 4 iterations.
<lifelines.CoxPHFitter: fitted with 124826 total observations, 100754 right-censored observations>
cox.print_summary()
model lifelines.CoxPHFitter
duration col 'duree'
event col 'deces'
baseline estimation breslow
number of observations 124826
number of events observed 24072
partial log-likelihood -249859.54
time fit was run 2021-02-24 23:48:57 UTC
coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95% z p -log2(p)
X1 0.02 1.02 0.00 0.02 0.02 1.02 1.02 42.23 <0.005 inf

Concordance 0.69
Partial AIC 499721.08
log-likelihood ratio test 1597.64 on 1 df
-log2(p) of ll-ratio test inf
cox2 = CoxPHFitter()
cox2.fit(data_train[['duree', 'deces', 'X2']], duration_col="duree", event_col="deces",
        show_progress=True)
cox2.print_summary()
Iteration 1: norm_delta = 0.13946, step_size = 0.9000, log_lik = -250658.36250, newton_decrement = 888.92089, seconds_since_start = 0.0
Iteration 2: norm_delta = 0.00667, step_size = 0.9000, log_lik = -249863.61089, newton_decrement = 2.86434, seconds_since_start = 0.0
Iteration 3: norm_delta = 0.00074, step_size = 0.9000, log_lik = -249860.76079, newton_decrement = 0.03426, seconds_since_start = 0.1
Iteration 4: norm_delta = 0.00000, step_size = 1.0000, log_lik = -249860.72650, newton_decrement = 0.00000, seconds_since_start = 0.1
Convergence success after 4 iterations.
model lifelines.CoxPHFitter
duration col 'duree'
event col 'deces'
baseline estimation breslow
number of observations 124826
number of events observed 24072
partial log-likelihood -249860.73
time fit was run 2021-02-24 23:48:59 UTC
coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95% z p -log2(p)
X2 -0.02 0.98 0.00 -0.02 -0.02 0.98 0.98 -42.21 <0.005 inf

Concordance 0.69
Partial AIC 499723.45
log-likelihood ratio test 1595.27 on 1 df
-log2(p) of ll-ratio test inf
cox.predict_cumulative_hazard(data_test[:5])
621725 110352 139986 72623 248121
0.0 0.008806 0.008673 0.008543 0.008717 0.008661
1.0 0.018218 0.017942 0.017673 0.018035 0.017918
2.0 0.026735 0.026330 0.025936 0.026466 0.026295
3.0 0.036002 0.035457 0.034926 0.035640 0.035409
4.0 0.045543 0.044854 0.044182 0.045085 0.044793
... ... ... ... ... ...
184.0 2.055736 2.024623 1.994277 2.035070 2.021872
189.0 2.092002 2.060340 2.029459 2.070971 2.057541
197.0 2.138687 2.106318 2.074747 2.117186 2.103457
201.0 2.205513 2.172133 2.139576 2.183341 2.169182
217.0 2.330629 2.295356 2.260952 2.307199 2.292237

165 rows × 5 columns

cox.predict_survival_function(data_test[:5])
621725 110352 139986 72623 248121
0.0 0.991233 0.991365 0.991494 0.991321 0.991377
1.0 0.981947 0.982218 0.982482 0.982127 0.982242
2.0 0.973619 0.974013 0.974398 0.973881 0.974048
3.0 0.964638 0.965164 0.965677 0.964988 0.965211
4.0 0.955478 0.956137 0.956780 0.955916 0.956196
... ... ... ... ... ...
184.0 0.127999 0.132044 0.136112 0.130671 0.132407
189.0 0.123440 0.127411 0.131407 0.126063 0.127768
197.0 0.117809 0.121685 0.125588 0.120370 0.122034
201.0 0.110194 0.113934 0.117705 0.112664 0.114271
217.0 0.097235 0.100726 0.104251 0.099540 0.101040

165 rows × 5 columns