2A.i - Sérialisation - correction

Sérialisation d'objets, en particulier de dataframes. Mesures de vitesse.

In [1]:
from jyquickhelper import add_notebook_menu
add_notebook_menu()
Out[1]:
run previous cell, wait for 2 seconds

Exercice 1 : sérialisation d'un gros dataframe

Etape 1 : construction d'un gros dataframe composé de nombres aléatoires

In [2]:
import random
values = [ [random.random() for i in range(0,20)] for _ in range(0,100000) ]
col = [ "col%d" % i for i in range(0,20) ]
In [3]:
import pandas
df = pandas.DataFrame( values, columns = col )

Etape 2 : on sauve ce dataframe sous deux formats texte et sérialisé (binaire)

In [4]:
df.to_csv("df_text.txt", sep="\t")
In [5]:
df.to_pickle("df_text.bin")

Etape 3 : on mesure le temps de chargement

In [6]:
%timeit pandas.read_csv("df_text.txt", sep="\t")
1.14 s ± 46.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [7]:
%timeit pandas.read_pickle("df_text.bin")
26.9 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Exercice 2 : json

Un premier essai.

In [8]:
obj = dict(a=[50, "r"], gg=(5, 't'))

import jsonpickle
frozen = jsonpickle.encode(obj)
frozen
Out[8]:
'{"a": [50, "r"], "gg": {"py/tuple": [5, "t"]}}'

Ce module est équivalent au module json sur les types standard du langage Python (liste, dictionnaires, nombres, ...). Mais le module json ne fonctionne pas sur les dataframe.

In [9]:
frozen = jsonpickle.encode(df)
In [10]:
len(frozen), type(frozen), frozen[:55]
Out[10]:
(22586357, str, '{"py/object": "pandas.core.frame.DataFrame", "py/state"')

La methode to_json donnera un résultat statisfaisant également mais ne pourra s'appliquer à un modèle de machine learning produit par scikit-learn.

In [11]:
def to_json(obj, filename):
    frozen = jsonpickle.encode(obj)
    with open(filename, "w", encoding="utf-8") as f:
        f.write(frozen)
        
def read_json(filename):
    with open(filename, "r", encoding="utf-8") as f:
        enc = f.read()
    return jsonpickle.decode(enc)
In [12]:
to_json(df, "df_text.json")
In [13]:
try:
    df = read_json("df_text.json")
except Exception as e:
    print(e)
maximum recursion depth exceeded while calling a Python object

Visiblement, cela ne fonctionne pas sur les DataFrame. Il faudra s'inspirer du module numpyson.

json + scikit-learn

Il faut lire l'issue 147 pour saisir l'intérêt des deux lignes suivantes.

In [14]:
import jsonpickle.ext.numpy as jsonpickle_numpy
jsonpickle_numpy.register_handlers()
In [15]:
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
y = iris.target
In [16]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X,y)
Out[16]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [17]:
clf.predict_proba([[0.1, 0.2]])
Out[17]:
array([[ 0.49942162,  0.45148936,  0.04908902]])
In [18]:
to_json(clf, "logreg.json")
In [19]:
clf2 = read_json("logreg.json")
In [20]:
clf2.predict_proba([[0.1, 0.2]])
Out[20]:
array([[ 0.49942162,  0.45148936,  0.04908902]])
In [21]:
with open("logreg.json", "r") as f:
    content = f.read()
content
Out[21]:
'{"py/object": "sklearn.linear_model.logistic.LogisticRegression", "py/state": {"C": 1.0, "_sklearn_version": "0.19.1", "class_weight": null, "classes_": {"py/object": "numpy.ndarray", "dtype": "int32", "values": [0, 1, 2]}, "coef_": {"py/object": "numpy.ndarray", "base": {"py/object": "numpy.ndarray", "dtype": "float64", "values": [[-2.4957928882125406, 4.010113006761804, 0.8171393204472739], [0.49709450754556295, -1.6338022222456163, 1.225435620375353], [1.1592140429099165, -1.7773656810121667, -2.2251611854055735]], "order": "F"}, "strides": [8, 24], "shape": [3, 2], "dtype": "float64", "values": [[-2.4957928882125406, 4.010113006761804], [0.49709450754556295, -1.6338022222456163], [1.1592140429099165, -1.7773656810121667]]}, "dual": false, "fit_intercept": true, "intercept_": {"py/object": "numpy.ndarray", "dtype": "float64", "values": [0.8171393204472739, 1.225435620375353, -2.2251611854055735]}, "intercept_scaling": 1, "max_iter": 100, "multi_class": "ovr", "n_iter_": {"py/object": "numpy.ndarray", "dtype": "int32", "values": [8]}, "n_jobs": 1, "penalty": "l2", "random_state": null, "solver": "liblinear", "tol": 0.0001, "verbose": 0, "warm_start": false}}'
In [22]: