.. _td2acorrectionsession2Erst:

=================================
2A.i - Sérialisation - correction
=================================


.. only:: html

    **Links:** :download:`notebook <td2a_correction_session_2E.ipynb>`, :downloadlink:`html <td2a_correction_session_2E2html.html>`, :download:`python <td2a_correction_session_2E.py>`, :downloadlink:`slides <td2a_correction_session_2E.slides.html>`, :githublink:`GitHub|_doc/notebooks/td2a/td2a_correction_session_2E.ipynb|*`


Sérialisation d’objets, en particulier de dataframes. Mesures de
vitesse.

.. code:: ipython3

    from jyquickhelper import add_notebook_menu
    add_notebook_menu()


.. contents::
    :local:


Exercice 1 : sérialisation d’un gros dataframe
----------------------------------------------

**Etape 1 :** construction d’un gros dataframe composé de nombres
aléatoires

.. code:: ipython3

    import random
    values = [ [random.random() for i in range(0,20)] for _ in range(0,100000) ]
    col = [ "col%d" % i for i in range(0,20) ]

.. code:: ipython3

    import pandas
    df = pandas.DataFrame( values, columns = col )

**Etape 2 :** on sauve ce dataframe sous deux formats texte et sérialisé
(binaire)

.. code:: ipython3

    df.to_csv("df_text.txt", sep="\t")

.. code:: ipython3

    df.to_pickle("df_text.bin")

**Etape 3 :** on mesure le temps de chargement

.. code:: ipython3

    %timeit pandas.read_csv("df_text.txt", sep="\t")


.. parsed-literal::
    499 ms ± 8.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


.. code:: ipython3

    %timeit pandas.read_pickle("df_text.bin")


.. parsed-literal::
    10.1 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


Exercice 2 : json
-----------------

Un premier essai.

.. code:: ipython3

    obj = dict(a=[50, "r"], gg=(5, 't'))
    
    import jsonpickle
    frozen = jsonpickle.encode(obj)
    frozen


.. parsed-literal::
    '{"a": [50, "r"], "gg": {"py/tuple": [5, "t"]}}'


Ce module est équivalent au module
`json <https://docs.python.org/3/library/json.html>`__ sur les types
standard du langage Python (liste, dictionnaires, nombres, …). Mais le
module `json <https://docs.python.org/3/library/json.html>`__ ne
fonctionne pas sur les dataframe.

.. code:: ipython3

    frozen = jsonpickle.encode(df)

.. code:: ipython3

    len(frozen), type(frozen), frozen[:55]


.. parsed-literal::
    (22025124, str, '{"py/object": "pandas.core.frame.DataFrame", "py/state"')


La methode
`to_json <https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html>`__
donnera un résultat statisfaisant également mais ne pourra s’appliquer à
un modèle de machine learning produit par
`scikit-learn <http://scikit-learn.org/>`__.

.. code:: ipython3

    def to_json(obj, filename):
        frozen = jsonpickle.encode(obj)
        with open(filename, "w", encoding="utf-8") as f:
            f.write(frozen)
            
    def read_json(filename):
        with open(filename, "r", encoding="utf-8") as f:
            enc = f.read()
        return jsonpickle.decode(enc)

.. code:: ipython3

    to_json(df, "df_text.json")

.. code:: ipython3

    try:
        df = read_json("df_text.json")
    except Exception as e:
        print(e)


.. parsed-literal::
    all inputs must be Index


Visiblement, cela ne fonctionne pas sur les DataFrame. Il faudra
s’inspirer du module `numpyson <https://github.com/hpk42/numpyson>`__.

json + scikit-learn
-------------------

Il faut lire l’issue
`147 <https://github.com/jsonpickle/jsonpickle/issues/147>`__ pour
saisir l’intérêt des deux lignes suivantes.

.. code:: ipython3

    import jsonpickle.ext.numpy as jsonpickle_numpy
    jsonpickle_numpy.register_handlers()

.. code:: ipython3

    from sklearn import datasets
    iris = datasets.load_iris()
    X = iris.data[:, :2]  # we only take the first two features.
    y = iris.target

.. code:: ipython3

    from sklearn.linear_model import LogisticRegression
    clf = LogisticRegression()
    clf.fit(X,y)


.. parsed-literal::
    LogisticRegression()


.. code:: ipython3

    clf.predict_proba([[0.1, 0.2]])


.. parsed-literal::
    array([[9.98521017e-01, 1.47896452e-03, 1.84545577e-08]])


.. code:: ipython3

    to_json(clf, "logreg.json")

.. code:: ipython3

    try:
        clf2 = read_json("logreg.json")
    except AttributeError as e:
        # Pour une raison inconnue, un bug sans doute, le code ne fonctionne pas.
        print(e)


.. parsed-literal::
    'list' object has no attribute 'flags'


Donc on essaye d’une essaye d’une autre façon. Si le code précédent ne
fonctionne pas et le suivant si, c’est un bug de
`jsonpickle <https://github.com/jsonpickle/jsonpickle>`__.

.. code:: ipython3

    class EncapsulateLogisticRegression:
        def __init__(self, obj):
            self.obj = obj
        def __getstate__(self):
            return {k: v for k, v in sorted(self.obj.__getstate__().items())}
        def __setstate__(self, data):
            self.obj = LogisticRegression()
            self.obj.__setstate__(data)
            
    enc = EncapsulateLogisticRegression(clf)
    to_json(enc, "logreg.json")

.. code:: ipython3

    enc2 = read_json("logreg.json")
    clf2 = enc2.obj

.. code:: ipython3

    clf2.predict_proba([[0.1, 0.2]])


.. parsed-literal::
    array([[9.98521017e-01, 1.47896452e-03, 1.84545577e-08]])


.. code:: ipython3

    with open("logreg.json", "r") as f:
        content = f.read()
    content


.. parsed-literal::
    '{"py/object": "__main__.EncapsulateLogisticRegression", "py/state": {"C": 1.0, "_sklearn_version": "1.0.dev0", "class_weight": null, "classes_": {"py/object": "numpy.ndarray", "dtype": "int32", "values": [0, 1, 2]}, "coef_": {"py/object": "numpy.ndarray", "base": {"py/object": "numpy.ndarray", "dtype": "float64", "values": [[[-2.7089024902680983, 2.3240237755859914, 7.913221292541044], [0.6127325890163979, -1.5705880338943812, 1.8450471421510946], [2.0961699012517387, -0.7534357416910977, -9.758268434691205]]]}, "strides": [24, 8], "shape": [3, 2], "dtype": "float64", "values": [[-2.7089024902680983, 2.3240237755859914], [0.6127325890163979, -1.5705880338943812], [2.0961699012517387, -0.7534357416910977]]}, "dual": false, "fit_intercept": true, "intercept_": {"py/object": "numpy.ndarray", "base": {"py/id": 4}, "offset": 16, "strides": [24], "shape": [3], "dtype": "float64", "values": [7.913221292541044, 1.8450471421510946, -9.758268434691205]}, "intercept_scaling": 1, "l1_ratio": null, "max_iter": 100, "multi_class": "auto", "n_features_in_": 2, "n_iter_": {"py/object": "numpy.ndarray", "base": {"py/object": "numpy.ndarray", "dtype": "int32", "values": [[50]]}, "shape": [1], "dtype": "int32", "values": [50]}, "n_jobs": null, "penalty": "l2", "random_state": null, "solver": "lbfgs", "tol": 0.0001, "verbose": 0, "warm_start": false}}'