{"cells": [{"cell_type": "markdown", "id": "c98f9a0e", "metadata": {}, "source": ["# Tech - S\u00e9rialisation\n", "\n", "La s\u00e9rialisation r\u00e9pond \u00e0 un probl\u00e8me simple : comment \u00e9changer des donn\u00e9es complexes autres que des tableaux ?\n", "\n", "Si l'\u00e9nonc\u00e9 est simple, la solution ne l'est pas toujours. Il est assez facile d'\u00e9changer des donn\u00e9es qui se pr\u00e9sentent sous la forme d'un tableau, d'un texte, d'un nombre mais comment \u00e9changer un assemblage de donn\u00e9es h\u00e9t\u00e9rog\u00e8nes ? La **s\u00e9rialisation** d\u00e9signe un m\u00e9anisme qui permet de permet de repr\u00e9senter un assemblage de donn\u00e9es en un seul tableau de caract\u00e8res. La **d\u00e9s\u00e9rialisation** d\u00e9signe le m\u00e9canisme inverse qui consiste \u00e0 reconstruire les donn\u00e9es initiales \u00e0 partir de ce tableau de caract\u00e8res."]}, {"cell_type": "code", "execution_count": 1, "id": "5ad631e0", "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "id": "7c47ec7b", "metadata": {}, "source": ["## Notion de stream ou flux\n", "\n", "Un stream en informatique d\u00e9finit une fa\u00e7on de parcourir une s\u00e9quence d'octets. Un fichier est un stream : on \u00e9crit les octets ou caract\u00e8res les uns apr\u00e8s les autres, chaque nouveau caract\u00e8re est ajout\u00e9 \u00e0 la fin. Lors de la lecture, on proc\u00e8de de m\u00eame en lisant les caract\u00e8res du d\u00e9but \u00e0 la fin. Dans un stream, on ne revient jamais en arri\u00e8re, on lit toujours le caract\u00e8re suivant.\n", "\n", "Les streams sont optimis\u00e9s pour ce type de lecture et d'\u00e9criture, ils sont tr\u00e8s lents lorsqu'il s'agit d'aller lire ou \u00e9crire des caract\u00e8res de fa\u00e7on non s\u00e9quentielle.\n", "\n", "Pour faire des calculs math\u00e9matiques, il faut pouvoir acc\u00e9der \u00e0 tout moment \u00e0 n'importe quel \u00e9l\u00e9ment de la matrice. L'utilisation d'un *stream* est contre-indiqu\u00e9e. En revanche, ils sont tr\u00e8s adapt\u00e9s \u00e0 la lecture et l'\u00e9criture de fichiers. Ils sont \u00e9galement utilis\u00e9s pour communiquer des donn\u00e9es, lorsqu'un ordinateur envoie des donn\u00e9es \u00e0 un autre ordinateur."]}, {"cell_type": "code", "execution_count": 2, "id": "09eea6d4", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["pi=3.141592653589793;\n"]}], "source": ["import math\n", "from io import StringIO\n", "\n", "st = StringIO()\n", "st.write(\"pi=\")\n", "st.write(str(math.pi))\n", "st.write(\";\")\n", "value = st.getvalue()\n", "print(value)"]}, {"cell_type": "code", "execution_count": 3, "id": "a861a7a2", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["p\n", "i\n", "=\n", "3\n", ".\n", "1\n", "4\n", "1\n", "5\n", "9\n", "2\n", "6\n", "5\n", "3\n", "5\n", "8\n", "9\n", "7\n", "9\n", "3\n", ";\n"]}], "source": ["st = StringIO(\"pi=3.141592653589793;\")\n", "while text := st.read(1):\n", " print(text)"]}, {"cell_type": "code", "execution_count": 4, "id": "b902df76", "metadata": {}, "outputs": [{"data": {"text/plain": ["('petit;essai;de;comparaison;petit;essai;de;comparaison;petit;essai;de;comparaison;petit;essai;de;comp',\n", " 'petit;essai;de;comparaison;petit;essai;de;comparaison;petit;essai;de;comparaison;petit;essai;de;comp')"]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["def f1(text):\n", " st = StringIO()\n", " for t in text:\n", " st.write(t)\n", " st.write(\";\")\n", " value = st.getvalue()\n", " return value\n", "\n", "def f2(text):\n", " s = \"\"\n", " for t in text:\n", " s += t + \";\"\n", " return s\n", "\n", "data = [\"petit\", \"essai\", \"de\", \"comparaison\"] * 300\n", "f1(data)[:100], f2(data)[:100]"]}, {"cell_type": "code", "execution_count": 5, "id": "609b8b2c", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["207 \u00b5s \u00b1 25.6 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n"]}], "source": ["%timeit f1(data)"]}, {"cell_type": "code", "execution_count": 6, "id": "41d2d93d", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["365 \u00b5s \u00b1 50.1 \u00b5s per loop (mean \u00b1 std. dev. of 7 runs, 1,000 loops each)\n"]}], "source": ["%timeit f2(data)"]}, {"cell_type": "markdown", "id": "899d2cbc", "metadata": {}, "source": ["### Q1: quelle est la fonction la plus rapide ?\n", "\n", "Il vaut mieux faire varier la longueur de la liste `data` avant de r\u00e9pondre."]}, {"cell_type": "markdown", "id": "43e4a4e7", "metadata": {}, "source": ["## Format JSON\n", "\n", "Le format [JSON](https://fr.wikipedia.org/wiki/JavaScript_Object_Notation) est le format le plus r\u00e9pandu sur Internet. C'est un assemblage r\u00e9cursif de listes et de dictionnaires. Chaque conteneur peut contenir des listes, des dictionnaires, des nombres, des cha\u00eenes de caract\u00e8res.\n", "\n", "Il est possible de t\u00e9l\u00e9charger tout Wikip\u00e9dia au format JSON : [Wikidata:T\u00e9l\u00e9chargement de la base de donn\u00e9es](https://www.wikidata.org/wiki/Wikidata:Database_download/fr)."]}, {"cell_type": "code", "execution_count": 7, "id": "cbe2b5d7", "metadata": {}, "outputs": [{"data": {"text/plain": ["{'nom': 'magoo',\n", " 'naissance': 1949,\n", " 'creator': ['Millard Kaufman', 'John Hubley'],\n", " 'cartoons': [{'title': 'Les Aventures c\u00e9l\u00e8bres de Monsieur Magoo',\n", " 'dur\u00e9e': 5},\n", " {'title': 'Quoi de neuf Mr. Magoo ?', 'dur\u00e9e': 10}]}"]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["data = {\n", " \"nom\": \"magoo\", \n", " \"naissance\": 1949, \n", " \"creator\": [\"Millard Kaufman\", \"John Hubley\"],\n", " \"cartoons\": [\n", " {\"title\": \"Les Aventures c\u00e9l\u00e8bres de Monsieur Magoo\", \"dur\u00e9e\": 5},\n", " {\"title\": \"Quoi de neuf Mr. Magoo ?\", \"dur\u00e9e\": 10}\n", " ]\n", "}\n", "data"]}, {"cell_type": "markdown", "id": "7ad48320", "metadata": {}, "source": ["Le langage python propose une librairie standard [json](https://docs.python.org/3/library/json.html) pour manipuler les informations. Et comme c'est d'un usage fr\u00e9quent, il existe d'autres options plus rapides [ujson](https://github.com/ultrajson/ultrajson), [simplejson](https://github.com/simplejson/simplejson), [ijson](https://github.com/ICRAR/ijson), ...\n", "\n", "La page [Index of /wikidatawiki/entities/](https://dumps.wikimedia.org/wikidatawiki/entities/) contient des fichiers json issues de wikipedia. Le fichier [latest-lexemes-sample.json](https://raw.githubusercontent.com/sdpython/ensae_teaching_cs/master/_doc/notebooks/td1a_home/latest-lexemes-sample.json) contient les premi\u00e8res lignes de ``latest-lexemes.json.bz2``."]}, {"cell_type": "markdown", "id": "4b853485", "metadata": {}, "source": ["### Q1. lire du json\n", "\n", "T\u00e9l\u00e9charger et lire ce [fichier](https://raw.githubusercontent.com/sdpython/ensae_teaching_cs/master/_doc/notebooks/td1a_home/latest-lexemes-sample.json) avec la libraire [json](https://docs.python.org/3/library/json.html)."]}, {"cell_type": "code", "execution_count": 8, "id": "79c5f0d5", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["[\n", "{\"type\":\"lexeme\",\"id\":\"L4\",\"lemmas\":{\"en\":{\"language\":\"en\",\"value\":\"windsurf\"}},\"lexicalCategory\":\"Q24905\",\"language\":\"Q1860\",\"claims\":{\"P5238\":[{\"m...\n"]}], "source": ["from urllib.request import urlopen\n", "url = \"https://raw.githubusercontent.com/sdpython/ensae_teaching_cs/master/_doc/notebooks/td1a_home/latest-lexemes-sample.json\"\n", "with urlopen(url) as f:\n", " text = f.read().decode(\"utf-8\")\n", "print(text[:150] + \"...\")"]}, {"cell_type": "code", "execution_count": 9, "id": "049ffdf0", "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "id": "37d8ddf9", "metadata": {}, "source": ["### Q2: \u00e9crire du json\n", "\n", "Modifier les donn\u00e9es et les \u00e9crire de nouveau sur disque."]}, {"cell_type": "code", "execution_count": 10, "id": "6c7f2930", "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "id": "16556249", "metadata": {}, "source": ["### Q3: gros json\n", "\n", "Le dump de la version anglaise de wikipedia fait plus de 100 Go (en version compress\u00e9e). Il tient sur disque mais pas en m\u00e9moire. Comment faire pour le lire malgr\u00e9 tout ? Quelques lignes pour vous donn\u00e9es des id\u00e9es... Les plus courageux utiliseront la librairie [ijson](https://github.com/ICRAR/ijson) ou [orjson](https://github.com/ijl/orjson)."]}, {"cell_type": "code", "execution_count": 11, "id": "79829473", "metadata": {"scrolled": false}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["[\n", "\n", "{\"type\":\"lexeme\",\"id\":\"L4\",\"lemmas\":{\"en\":{\"language\":\"en\",\"value\":\"windsurf\"}},\"lexicalCategory\":\"Q24905\",\"language\":\"Q1860\",\"claims\":{\"P5238\":[{\"mainsnak\":{\"snaktype\":\"value\",\"property\":\"P5238\",\"datavalue\":{\"value\":{\"entity-type\":\"lexeme\",\"numeric-id\":3324,\"id\":\"L3324\"},\"type\":\"wikibase-entityid\"},\"datatype\":\"wikibase-lexeme\"},\"type\":\"statement\",\"qualifiers\":{\"P1545\":[{\"snaktype\":\"value\",\"property\":\"P1545\",\"hash\":\"2a1ced1dca90648ea7e306acbadd74fc81a10722\",\"datavalue\":{\"value\":\"1\",\"type\":\"string\"},\"datatype\":\"string\"}]},\"qualifiers-order\":[\"P1545\"],\"id\":\"L4$faad30b0-421c-803a-c1fd-b9a99a0eb35d\",\"rank\":\"normal\"},{\"mainsnak\":{\"snaktype\":\"value\",\"property\":\"P5238\",\"datavalue\":{\"value\":{\"entity-type\":\"lexeme\",\"numeric-id\":18537,\"id\":\"L18537\"},\"type\":\"wikibase-entityid\"},\"datatype\":\"wikibase-lexeme\"},\"type\":\"statement\",\"qualifiers\":{\"P1545\":[{\"snaktype\":\"value\",\"property\":\"P1545\",\"hash\":\"7241753c62a310cf84895620ea82250dcea65835\",\"datavalue\":{\"value\":\"2\",\"type\":\"string\"},\"datatype\":\"string\"}]},\"qualifiers-order\":[\"P1545\"],\"id\":\"L4$d15285a1-4880-7a9b-bb1f-85403e1a785a\",\"rank\":\"normal\"}],\"P5187\":[{\"mainsnak\":{\"snaktype\":\"value\",\"property\":\"P5187\",\"datavalue\":{\"value\":{\"text\":\"windsurf\",\"language\":\"en\"},\"type\":\"monolingualtext\"},\"datatype\":\"monolingualtext\"},\"type\":\"statement\",\"id\":\"L4$d4a63d17-43ea-749d-5860-21b90feb83f7\",\"rank\":\"normal\"}]},\"forms\":[{\"id\":\"L4-F1\",\"representations\":{\"en\":{\"language\":\"en\",\"value\":\"windsurfing\"}},\"grammaticalFeatures\":[\"Q10345583\"],\"claims\":[]},{\"id\":\"L4-F3\",\"representations\":{\"en\":{\"language\":\"en\",\"value\":\"windsurfs\"}},\"grammaticalFeatures\":[\"Q110786\",\"Q3910936\",\"Q51929074\"],\"claims\":[]},{\"id\":\"L4-F4\",\"representations\":{\"en\":{\"language\":\"en\",\"value\":\"windsurfed\"}},\"grammaticalFeatures\":[\"Q1392475\"],\"claims\":[]},{\"id\":\"L4-F5\",\"representations\":{\"en\":{\"language\":\"en\",\"value\":\"windsurfed\"}},\"grammaticalFeatures\":[\"Q1230649\"],\"claims\":[]},{\"id\":\"L4-F6\",\"representations\":{\"en\":{\"language\":\"en\",\"value\":\"windsurf\"}},\"grammaticalFeatures\":[\"Q3910936\"],\"claims\":[]}],\"senses\":[{\"id\":\"L4-S1\",\"glosses\":{\"fr\":{\"language\":\"fr\",\"value\":\"faire de la planche \\u00e0 voile\"},\"ms\":{\"language\":\"ms\",\"value\":\"meluncur angin\"},\"zh\":{\"language\":\"zh\",\"value\":\"\\u6ed1\\u6d6a\\u98a8\\u5e06\"},\"zh-hant\":{\"language\":\"zh-hant\",\"value\":\"\\u6ed1\\u6d6a\\u98a8\\u5e06\"},\"zh-tw\":{\"language\":\"zh-tw\",\"value\":\"\\u6ed1\\u6d6a\\u98a8\\u5e06\"},\"nan\":{\"language\":\"nan\",\"value\":\"h\\u00e1i-\\u00edng hong-ph\\u00e2ng\"},\"th\":{\"language\":\"th\",\"value\":\"\\u0e40\\u0e25\\u0e48\\u0e19\\u0e27\\u0e34\\u0e19\\u0e14\\u0e4c\\u0e40\\u0e0b\\u0e34\\u0e23\\u0e4c\\u0e1f\"},\"tg\":{\"language\":\"tg\",\"value\":\"\\u0441\\u0451\\u0440\\u0444\\u0438\\u043d\\u0433\\u0431\\u043e\\u0437\\u0438\\u0438 \\u0448\\u0430\\u043c\\u043e\\u043b\\u04e3\"},\"fi\":{\"language\":\"fi\",\"value\":\"purjelautailla\"}},\"claims\":{\"P5137\":[{\"mainsnak\":{\"snaktype\":\"value\",\"property\":\"P5137\",\"datavalue\":{\"value\":{\"entity-type\":\"item\",\"numeric-id\":191051,\"id\":\"Q191051\"},\"type\":\"wikibase-entityid\"},\"datatype\":\"wikibase-item\"},\"type\":\"statement\",\"id\":\"L4-S1$13e5f498-4deb-ea41-4d60-02c852b88b4c\",\"rank\":\"normal\"}],\"P5972\":[{\"mainsnak\":{\"snaktype\":\"value\",\"property\":\"P5972\",\"datavalue\":{\"value\":{\"entity-type\":\"sense\",\"id\":\"L144039-S1\"},\"type\":\"wikibase-entityid\"},\"datatype\":\"wikibase-sense\"},\"type\":\"statement\",\"id\":\"L4-S1$7218013F-B84B-40FA-B57B-BC1BA2239BB8\",\"rank\":\"normal\"}]}}],\"pageid\":54387040,\"ns\":146,\"title\":\"Lexeme:L4\",\"lastrevid\":1710596079,\"modified\":\"2022-08-22T19:28:34Z\"},\n", "\n", "{\"type\":\"lexeme\",\"id\":\"L314\",\"lemmas\":{\"ca\":{\"language\":\"ca\",\"value\":\"pi\"}},\"lexicalCategory\":\"Q1084\",\"language\":\"Q7026\",\"claims\":{\"P5185\":[{\"mainsnak\":{\"snaktype\":\"value\",\"property\":\"P5185\",\"datavalue\":{\"value\":{\"entity-type\":\"item\",\"numeric-id\":1775415,\"id\":\"Q1775415\"},\"type\":\"wikibase-entityid\"},\"datatype\":\"wikibase-item\"},\"type\":\"statement\",\"id\":\"L314$45650151-4ed8-025d-2442-e36ef22e6a2a\",\"rank\":\"normal\"}]},\"forms\":[{\"id\":\"L314-F1\",\"representations\":{\"ca\":{\"language\":\"ca\",\"value\":\"pis\"}},\"grammaticalFeatures\":[\"Q146786\"],\"claims\":[]},{\"id\":\"L314-F2\",\"representations\":{\"ca\":{\"language\":\"ca\",\"value\":\"pi\"}},\"grammaticalFeatures\":[\"Q110786\"],\"claims\":[]}],\"senses\":{},\"pageid\":54387050,\"ns\":146,\"title\":\"Lexeme:L314\",\"lastrevid\":684359491,\"modified\":\"2018-05-24T07:28:21Z\"},\n", "\n"]}], "source": ["with open(\"dummy.json\", \"w\", encoding=\"utf-8\") as f:\n", " f.write(text)\n", " \n", "with open(\"dummy.json\", \"r\", encoding=\"utf-8\") as f:\n", " for i, line in enumerate(f):\n", " if i > 2:\n", " break\n", " print(line)"]}, {"cell_type": "markdown", "id": "90c3e951", "metadata": {}, "source": ["## XML\n", "\n", "Le [XML](https://fr.wikipedia.org/wiki/Extensible_Markup_Language) \u00e9tait utilis\u00e9 avant le format json. Il permet de faire la m\u00eame chose, s\u00e9rialiser, mais est plus verbeux. Il a \u00e9t\u00e9 abandonn\u00e9 car le r\u00e9sultat est plus long qu'avec le format json."]}, {"cell_type": "code", "execution_count": 12, "id": "c525b915", "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["\n", " 5\n", " Les Aventures c\u00e9l\u00e8bres de Monsieur Magoo\n", "\n", "\n", " 10\n", " Quoi de neuf Mr. Magoo ?\n", "\n", "Millard Kaufman\n", "John Hubley\n", "1949\n", "magoo\n"]}], "source": ["from dict2xml import dict2xml\n", "print(dict2xml(data))"]}, {"cell_type": "markdown", "id": "9d7a4387", "metadata": {}, "source": ["## pickle\n", "\n", "Le format [JSON](https://fr.wikipedia.org/wiki/JavaScript_Object_Notation) a un inconv\u00e9nient majeur : il impose la conversion des donn\u00e9es au format texte, en particulier les nombres. Chaque nombre doit \u00eatre converti en cha\u00eenes de caract\u00e8res et r\u00e9ciproquement. Pourquoi ne pas garder la repr\u00e9sentation binaire des nombres tels qu'ils sont utilis\u00e9s en m\u00e9moire ?\n", "\n", "C'est l'objectif du module [pickle](https://docs.python.org/3/library/pickle.html). Comme il n'y pas de conversion au format texte et qu'il s'agit de recopier la m\u00e9moire sur disque en un seule, cette s\u00e9rialisation s'applique \u00e0 tout objet python. Elle n'est pas restreinte aux dictionnaires et aux listes."]}, {"cell_type": "markdown", "id": "8d5ef71f", "metadata": {}, "source": ["### Q1: comparer le temps de s\u00e9rialisation entre pickle et json\n", "\n", "On pourra utiliser les donn\u00e9es json r\u00e9cup\u00e9r\u00e9es ci-dessus."]}, {"cell_type": "code", "execution_count": 13, "id": "7529f4a7", "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "id": "772a9dca", "metadata": {}, "source": ["### Q2: comparer le temps de d\u00e9s\u00e9rialisation entre pickle et json\n", "\n", "M\u00eame exercice en sens inverse."]}, {"cell_type": "code", "execution_count": 14, "id": "af5c2a8a", "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "id": "e5aee8e5", "metadata": {}, "source": ["### Peut-on tout s\u00e9rialiser ?\n", "\n", "La plupart des objets contenant des donn\u00e9es peuvent \u00eatre s\u00e9rialis\u00e9es, les listes, ldes dictionnaires, les matrices (numpy), les dataframes)... Il n'est pas possible de s\u00e9rialiser les fonctions \u00e0 moins d'utiliser des librairies comme [cloudpickle](https://github.com/cloudpipe/cloudpickle) ou [dill](https://github.com/uqfoundation/dill).\n", "\n", "La s\u00e9rialisation fonctionne de fa\u00e7on implicite avec toutes les classes python \u00e0 l'exception de celles d\u00e9finies en C++. Pour celles-ci, il faudra coder explicitement la s\u00e9rialisation et la d\u00e9s\u00e9rialisation. Pour cela il faut red\u00e9finir les m\u00e9thodes [getstate et_setstate](http://www.xavierdupre.fr/app/teachpyx/helpsphinx/notebooks/serialisation_examples.html?highlight=__getstate__#reduire-la-taille).\n", "\n", "Il reste une contrainte majeure \u00e0 cette s\u00e9rialisation, elle d\u00e9pend de la version du langage et de chaque extension. S\u00e9rialisation avec python 3.7 et d\u00e9s\u00e9rialisation avec python 3.10 a peu de chance de fonctionner."]}, {"cell_type": "code", "execution_count": 15, "id": "a99089fe", "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5"}}, "nbformat": 4, "nbformat_minor": 5}