{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Tech - JSON - XML\n", "\n", "Transmettre l'information d'une machine \u00e0 une autre, d'un logiciel \u00e0 un autre, d'une base de donn\u00e9es \u00e0 une autre est un probl\u00e8me r\u00e9current. Le format le plus simple pour des donn\u00e9es est le format [csv](https://fr.wikipedia.org/wiki/Comma-separated_values). Ca marche bien pour les tables mais cela ne permet de transmettre ais\u00e9ment des [donn\u00e9es non structur\u00e9es](https://fr.wikipedia.org/wiki/Informations_non_structur%C3%A9es)."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Enonc\u00e9"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)fleur
1456.73.05.22.3virginica
1466.32.55.01.9virginica
1476.53.05.22.0virginica
1486.23.45.42.3virginica
1495.93.05.11.8virginica
\n", "
"], "text/plain": [" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n", "145 6.7 3.0 5.2 2.3 \n", "146 6.3 2.5 5.0 1.9 \n", "147 6.5 3.0 5.2 2.0 \n", "148 6.2 3.4 5.4 2.3 \n", "149 5.9 3.0 5.1 1.8 \n", "\n", " fleur \n", "145 virginica \n", "146 virginica \n", "147 virginica \n", "148 virginica \n", "149 virginica "]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.datasets import load_iris as load_data\n", "from pandas import DataFrame\n", "data = load_data()\n", "df = DataFrame(data.data, columns=data.feature_names)\n", "df['fleur'] = [data.target_names[t] for t in data.target]\n", "df.tail()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q1 : \u00e9criture des donn\u00e9es au format CSV\n", "\n", "Le plus simple..."]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/plain": ["'sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),fleur\\r\\n5.1,3.5,1.4,0.2,setosa\\r\\n4.9,3.0,1.4,0.2,setosa\\r\\n4.7,3.2,1.3,0.2,setosa\\r\\n4.6,3.1,1.5,0.2,setosa\\r\\n5.0,3.6,1.4,0.2,setosa\\r\\n5.4,3.9,1.7,0.4,setosa\\r\\n4.6,3.4,1.4,0.3,setosa\\r\\n5.0,3.4,1.5,0.2,setosa\\r\\n4.4,2.9,1.4,0.2,setosa\\r\\n4.9,3.1'"]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["from io import StringIO\n", "buffer = StringIO()\n", "df.to_csv(buffer, index=False)\n", "text = buffer.getvalue()\n", "text[:300]"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q2 : \u00e9criture des donn\u00e9es au format JSON"]}, {"cell_type": "code", "execution_count": 5, "metadata": {"scrolled": false}, "outputs": [{"data": {"text/plain": ["'[{\"sepal length (cm)\":5.1,\"sepal width (cm)\":3.5,\"petal length (cm)\":1.4,\"petal width (cm)\":0.2,\"fleur\":\"setosa\"},{\"sepal length (cm)\":4.9,\"sepal width (cm)\":3.0,\"petal length (cm)\":1.4,\"petal width (cm)\":0.2,\"fleur\":\"setosa\"},{\"sepal length (cm)\":4.7,\"sepal width (cm)\":3.2,\"petal length (cm)\":1.3,\"petal width (cm)\":0.2,\"fleur\":\"setosa\"},{\"sepal length (cm)\":4.6,\"sepal width (cm)\":3.1,\"petal lengt'"]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["r = df.to_json(orient='records')\n", "r[:400]"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q3 : relire les donn\u00e9es avec le module [json](https://docs.python.org/3/library/json.html)"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q4 : essayez avec les format XML (ou HTML), SQL, SAS, Excel..."]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q5 : donn\u00e9es non structur\u00e9es\n", "\n", "On ajoute les endroits o\u00f9 ces fleurs sont pr\u00e9sentes. On voudrait que toutes les informations soient pr\u00e9sentes dans le m\u00eame fichier. Comment fait-on ?"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": ["locations = {'virginica': ['Florida', 'Georgia'],\n", " 'setosa': ['Maine', 'Alaska', 'Quebec'],\n", " 'versicolor': ['Quebec', 'Georgia', 'Ireland', 'Main']}"]}, {"cell_type": "markdown", "metadata": {}, "source": ["La question sous-jacente est : vaut-il mieux avoir deux fichiers plats, l'un pour les donn\u00e9es d\u00e9crivant les fleurs, l'autre pour les localisations ou un seul fusionnant les deux informations ?"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q6 : le texte, \u00e7a prend trop de place, zippons\n", "\n", "Avec le module [zipfile](https://docs.python.org/3/library/zipfile.html)."]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q7 : que vous inspire [protobuf](https://developers.google.com/protocol-buffers) ?\n", "\n", "Ou [flatbuffers](https://google.github.io/flatbuffers/), [MessagePack](https://msgpack.org/)"]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["## R\u00e9ponses"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q1 : \u00e9criture des donn\u00e9es au format CSV"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"data": {"text/plain": ["'sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),fleur\\r\\n5.1,3.5,1.4,0.2,setosa\\r\\n4.9,3.0,1.4,0.2,setosa\\r\\n4.7,3.2,1.3,0.2,setosa\\r\\n4.6,3.1,1.5,0.2,setosa\\r\\n5.0,3.6,1.4,0.2,setosa\\r\\n5.4,3.9,1.7,0.4,setosa\\r\\n4.6,3.4,1.4,0.3,setosa\\r\\n5.0,3.4,1.5,0.2,setosa\\r\\n4.4,2.9,1.4,0.2,setosa\\r\\n4.9,3.1'"]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["from io import StringIO\n", "buffer = StringIO()\n", "df.to_csv(buffer, index=False)\n", "text = buffer.getvalue()\n", "text[:300]"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": ["df.to_csv(\"fleurs.csv\", index=False)"]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"data": {"text/plain": ["['.ipynb_checkpoints',\n", " '2020_covid.ipynb',\n", " '2020_edit.ipynb',\n", " '2020_json_xml.ipynb',\n", " '2020_numpy.ipynb',\n", " '2020_pandas.ipynb',\n", " '2020_profile.ipynb',\n", " '2020_regex.ipynb',\n", " '2020_suffix.ipynb',\n", " '2020_surface.ipynb',\n", " '2020_topk.ipynb',\n", " '2020_tsp.ipynb',\n", " 'data.csv',\n", " 'fleurs.csv']"]}, "execution_count": 17, "metadata": {}, "output_type": "execute_result"}], "source": ["import os\n", "os.listdir(\".\")"]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": ["import pandas\n", "df2 = pandas.read_csv(\"fleurs.csv\")"]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)fleur
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
\n", "
"], "text/plain": [" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n", "0 5.1 3.5 1.4 0.2 \n", "1 4.9 3.0 1.4 0.2 \n", "2 4.7 3.2 1.3 0.2 \n", "3 4.6 3.1 1.5 0.2 \n", "4 5.0 3.6 1.4 0.2 \n", "\n", " fleur \n", "0 setosa \n", "1 setosa \n", "2 setosa \n", "3 setosa \n", "4 setosa "]}, "execution_count": 19, "metadata": {}, "output_type": "execute_result"}], "source": ["df2.head()"]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)fleur
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
\n", "
"], "text/plain": [" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n", "0 5.1 3.5 1.4 0.2 \n", "1 4.9 3.0 1.4 0.2 \n", "2 4.7 3.2 1.3 0.2 \n", "3 4.6 3.1 1.5 0.2 \n", "4 5.0 3.6 1.4 0.2 \n", "\n", " fleur \n", "0 setosa \n", "1 setosa \n", "2 setosa \n", "3 setosa \n", "4 setosa "]}, "execution_count": 20, "metadata": {}, "output_type": "execute_result"}], "source": ["virtuel = StringIO(text)\n", "df3 = pandas.read_csv(virtuel)\n", "df3.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q2 : \u00e9criture des donn\u00e9es au format JSON"]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [{"data": {"text/plain": ["'[{\"sepal length (cm)\":5.1,\"sepal width (cm)\":3.5,\"petal length (cm)\":1.4,\"petal width (cm)\":0.2,\"fleur\":\"setosa\"},{\"sepal length (cm)\":4.9,\"sepal width (cm)\":3.0,\"petal length (cm)\":1.4,\"petal width (cm)\":0.2,\"fleur\":\"setosa\"},{\"sepal length (cm)\":4.7,\"sepal width (cm)\":3.2,\"petal length (cm)\":1.3,\"petal width (cm)\":0.2,\"fleur\":\"setosa\"},{\"sepal length (cm)\":4.6,\"sepal width (cm)\":3.1,\"petal lengt'"]}, "execution_count": 21, "metadata": {}, "output_type": "execute_result"}], "source": ["json_text = df.to_json(orient='records')\n", "json_text[:400]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q3 : relire les donn\u00e9es avec le module [json](https://docs.python.org/3/library/json.html)"]}, {"cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": ["import json"]}, {"cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": ["res = json.loads(json_text)"]}, {"cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["0 {'sepal length (cm)': 5.1, 'sepal width (cm)': 3.5, 'petal length (cm)': 1.4, 'petal width (cm)': 0.2, 'fleur': 'setosa'}\n", "1 {'sepal length (cm)': 4.9, 'sepal width (cm)': 3.0, 'petal length (cm)': 1.4, 'petal width (cm)': 0.2, 'fleur': 'setosa'}\n", "2 {'sepal length (cm)': 4.7, 'sepal width (cm)': 3.2, 'petal length (cm)': 1.3, 'petal width (cm)': 0.2, 'fleur': 'setosa'}\n", "3 {'sepal length (cm)': 4.6, 'sepal width (cm)': 3.1, 'petal length (cm)': 1.5, 'petal width (cm)': 0.2, 'fleur': 'setosa'}\n", "4 {'sepal length (cm)': 5.0, 'sepal width (cm)': 3.6, 'petal length (cm)': 1.4, 'petal width (cm)': 0.2, 'fleur': 'setosa'}\n", "5 {'sepal length (cm)': 5.4, 'sepal width (cm)': 3.9, 'petal length (cm)': 1.7, 'petal width (cm)': 0.4, 'fleur': 'setosa'}\n"]}], "source": ["for i, r in enumerate(res):\n", " print(i, type(r), r)\n", " if i >= 5:\n", " break"]}, {"cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [{"data": {"text/plain": ["3.1"]}, "execution_count": 25, "metadata": {}, "output_type": "execute_result"}], "source": ["res[3]['sepal width (cm)']"]}, {"cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [{"data": {"text/plain": ["[{'sepal length (cm)': 5.1,\n", " 'sepal width (cm)': 3.5,\n", " 'petal length (cm)': 1.4,\n", " 'petal width (cm)': 0.2,\n", " 'fleur': 'setosa'},\n", " {'sepal length (cm)': 4.9,\n", " 'sepal width (cm)': 3.0,\n", " 'petal length (cm)': 1.4,\n", " 'petal width (cm)': 0.2,\n", " 'fleur': 'setosa'},\n", " {'sepal length (cm)': 4.7,\n", " 'sepal width (cm)': 3.2,\n", " 'petal length (cm)': 1.3,\n", " 'petal width (cm)': 0.2,\n", " 'fleur': 'setosa'}]"]}, "execution_count": 26, "metadata": {}, "output_type": "execute_result"}], "source": ["virtuel = StringIO(json_text)\n", "res2 = json.load(virtuel)\n", "res2[:3]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q4 : essayez avec les format XML (ou HTML), SQL, SAS, Excel..."]}, {"cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": ["html_text = df.to_html(index=False)"]}, {"cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n"]}], "source": ["print(html_text[:500])"]}, {"cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)fleur
5.13.51.40.2setosa
4.93.01.40.2setosa
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)fleur
1456.73.05.22.3virginica
1466.32.55.01.9virginica
1476.53.05.22.0virginica
1486.23.45.42.3virginica
1495.93.05.11.8virginica
\n", ""], "text/plain": [" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n", "145 6.7 3.0 5.2 2.3 \n", "146 6.3 2.5 5.0 1.9 \n", "147 6.5 3.0 5.2 2.0 \n", "148 6.2 3.4 5.4 2.3 \n", "149 5.9 3.0 5.1 1.8 \n", "\n", " fleur \n", "145 virginica \n", "146 virginica \n", "147 virginica \n", "148 virginica \n", "149 virginica "]}, "execution_count": 29, "metadata": {}, "output_type": "execute_result"}], "source": ["df_html = pandas.read_html(html_text)\n", "df_html[0].tail()"]}, {"cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [{"data": {"text/plain": ["2"]}, "execution_count": 30, "metadata": {}, "output_type": "execute_result"}], "source": ["df_html = pandas.read_html(html_text + html_text)\n", "len(df_html)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q5 : donn\u00e9es non structur\u00e9es"]}, {"cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)fleur
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
\n", "
"], "text/plain": [" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n", "0 5.1 3.5 1.4 0.2 \n", "1 4.9 3.0 1.4 0.2 \n", "2 4.7 3.2 1.3 0.2 \n", "3 4.6 3.1 1.5 0.2 \n", "4 5.0 3.6 1.4 0.2 \n", "\n", " fleur \n", "0 setosa \n", "1 setosa \n", "2 setosa \n", "3 setosa \n", "4 setosa "]}, "execution_count": 31, "metadata": {}, "output_type": "execute_result"}], "source": ["df.head()"]}, {"cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": ["locations = {'virginica': ['Florida', 'Georgia'],\n", " 'setosa': ['Maine', 'Alaska', 'Quebec'],\n", " 'versicolor': ['Quebec', 'Georgia', 'Ireland', 'Main']}"]}, {"cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [{"data": {"text/plain": ["[{'fleur': 'virginica', 'location': 'Florida'},\n", " {'fleur': 'virginica', 'location': 'Georgia'},\n", " {'fleur': 'setosa', 'location': 'Maine'},\n", " {'fleur': 'setosa', 'location': 'Alaska'},\n", " {'fleur': 'setosa', 'location': 'Quebec'},\n", " {'fleur': 'versicolor', 'location': 'Quebec'},\n", " {'fleur': 'versicolor', 'location': 'Georgia'},\n", " {'fleur': 'versicolor', 'location': 'Ireland'},\n", " {'fleur': 'versicolor', 'location': 'Main'}]"]}, "execution_count": 33, "metadata": {}, "output_type": "execute_result"}], "source": ["obs = []\n", "for fleur, loc in locations.items():\n", " for l in loc:\n", " obs.append({\"fleur\": fleur, \"location\": l})\n", "obs"]}, {"cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fleurlocation
0virginicaFlorida
1virginicaGeorgia
2setosaMaine
3setosaAlaska
4setosaQuebec
5versicolorQuebec
6versicolorGeorgia
7versicolorIreland
8versicolorMain
\n", "
"], "text/plain": [" fleur location\n", "0 virginica Florida\n", "1 virginica Georgia\n", "2 setosa Maine\n", "3 setosa Alaska\n", "4 setosa Quebec\n", "5 versicolor Quebec\n", "6 versicolor Georgia\n", "7 versicolor Ireland\n", "8 versicolor Main"]}, "execution_count": 34, "metadata": {}, "output_type": "execute_result"}], "source": ["df_locations = pandas.DataFrame(obs)\n", "df_locations"]}, {"cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)fleurlocation
05.13.51.40.2setosaMaine
15.13.51.40.2setosaAlaska
25.13.51.40.2setosaQuebec
34.93.01.40.2setosaMaine
44.93.01.40.2setosaAlaska
54.93.01.40.2setosaQuebec
64.73.21.30.2setosaMaine
74.73.21.30.2setosaAlaska
84.73.21.30.2setosaQuebec
94.63.11.50.2setosaMaine
\n", "
"], "text/plain": [" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n", "0 5.1 3.5 1.4 0.2 \n", "1 5.1 3.5 1.4 0.2 \n", "2 5.1 3.5 1.4 0.2 \n", "3 4.9 3.0 1.4 0.2 \n", "4 4.9 3.0 1.4 0.2 \n", "5 4.9 3.0 1.4 0.2 \n", "6 4.7 3.2 1.3 0.2 \n", "7 4.7 3.2 1.3 0.2 \n", "8 4.7 3.2 1.3 0.2 \n", "9 4.6 3.1 1.5 0.2 \n", "\n", " fleur location \n", "0 setosa Maine \n", "1 setosa Alaska \n", "2 setosa Quebec \n", "3 setosa Maine \n", "4 setosa Alaska \n", "5 setosa Quebec \n", "6 setosa Maine \n", "7 setosa Alaska \n", "8 setosa Quebec \n", "9 setosa Maine "]}, "execution_count": 35, "metadata": {}, "output_type": "execute_result"}], "source": ["merged = df.merge(df_locations, left_on=\"fleur\", right_on=\"fleur\")\n", "merged.head(10)"]}, {"cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [{"data": {"text/plain": ["(450, 6)"]}, "execution_count": 36, "metadata": {}, "output_type": "execute_result"}], "source": ["merged.shape"]}, {"cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [{"data": {"text/plain": ["{'virginica': ['Florida', 'Georgia'],\n", " 'setosa': ['Maine', 'Alaska', 'Quebec'],\n", " 'versicolor': ['Quebec', 'Georgia', 'Ireland', 'Main']}"]}, "execution_count": 37, "metadata": {}, "output_type": "execute_result"}], "source": ["locations"]}, {"cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [{"data": {"text/plain": ["[{'fleur': 'virginica', 'location': ['Florida', 'Georgia']},\n", " {'fleur': 'setosa', 'location': ['Maine', 'Alaska', 'Quebec']},\n", " {'fleur': 'versicolor', 'location': ['Quebec', 'Georgia', 'Ireland', 'Main']}]"]}, "execution_count": 38, "metadata": {}, "output_type": "execute_result"}], "source": ["obs2 = []\n", "for fleur, loc in locations.items():\n", " obs2.append({\"fleur\": fleur, \"location\": loc})\n", "obs2"]}, {"cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fleurlocation
0virginica[Florida, Georgia]
1setosa[Maine, Alaska, Quebec]
2versicolor[Quebec, Georgia, Ireland, Main]
\n", "
"], "text/plain": [" fleur location\n", "0 virginica [Florida, Georgia]\n", "1 setosa [Maine, Alaska, Quebec]\n", "2 versicolor [Quebec, Georgia, Ireland, Main]"]}, "execution_count": 39, "metadata": {}, "output_type": "execute_result"}], "source": ["df_locations2 = pandas.DataFrame(obs2)\n", "df_locations2"]}, {"cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)fleurlocation
05.13.51.40.2setosa[Maine, Alaska, Quebec]
14.93.01.40.2setosa[Maine, Alaska, Quebec]
24.73.21.30.2setosa[Maine, Alaska, Quebec]
34.63.11.50.2setosa[Maine, Alaska, Quebec]
45.03.61.40.2setosa[Maine, Alaska, Quebec]
55.43.91.70.4setosa[Maine, Alaska, Quebec]
64.63.41.40.3setosa[Maine, Alaska, Quebec]
75.03.41.50.2setosa[Maine, Alaska, Quebec]
84.42.91.40.2setosa[Maine, Alaska, Quebec]
94.93.11.50.1setosa[Maine, Alaska, Quebec]
\n", "
"], "text/plain": [" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n", "0 5.1 3.5 1.4 0.2 \n", "1 4.9 3.0 1.4 0.2 \n", "2 4.7 3.2 1.3 0.2 \n", "3 4.6 3.1 1.5 0.2 \n", "4 5.0 3.6 1.4 0.2 \n", "5 5.4 3.9 1.7 0.4 \n", "6 4.6 3.4 1.4 0.3 \n", "7 5.0 3.4 1.5 0.2 \n", "8 4.4 2.9 1.4 0.2 \n", "9 4.9 3.1 1.5 0.1 \n", "\n", " fleur location \n", "0 setosa [Maine, Alaska, Quebec] \n", "1 setosa [Maine, Alaska, Quebec] \n", "2 setosa [Maine, Alaska, Quebec] \n", "3 setosa [Maine, Alaska, Quebec] \n", "4 setosa [Maine, Alaska, Quebec] \n", "5 setosa [Maine, Alaska, Quebec] \n", "6 setosa [Maine, Alaska, Quebec] \n", "7 setosa [Maine, Alaska, Quebec] \n", "8 setosa [Maine, Alaska, Quebec] \n", "9 setosa [Maine, Alaska, Quebec] "]}, "execution_count": 40, "metadata": {}, "output_type": "execute_result"}], "source": ["merged = df.merge(df_locations2, left_on=\"fleur\", right_on=\"fleur\")\n", "merged.head(10)"]}, {"cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [{"data": {"text/plain": ["'[{\"sepal length (cm)\":5.1,\"sepal width (cm)\":3.5,\"petal length (cm)\":1.4,\"petal width (cm)\":0.2,\"fleur\":\"setosa\",\"location\":[\"Maine\",\"Alaska\",\"Quebec\"]},{\"sepal length (cm)\":4.9,\"sepal width (cm)\":3.0'"]}, "execution_count": 41, "metadata": {}, "output_type": "execute_result"}], "source": ["json_text = merged.to_json(orient='records')\n", "json_text[:200]"]}, {"cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": ["df.to_excel(\"data.xlsx\", index=False)"]}, {"cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)fleur
1456.73.05.22.3virginica
1466.32.55.01.9virginica
1476.53.05.22.0virginica
1486.23.45.42.3virginica
1495.93.05.11.8virginica
\n", "
"], "text/plain": [" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n", "145 6.7 3.0 5.2 2.3 \n", "146 6.3 2.5 5.0 1.9 \n", "147 6.5 3.0 5.2 2.0 \n", "148 6.2 3.4 5.4 2.3 \n", "149 5.9 3.0 5.1 1.8 \n", "\n", " fleur \n", "145 virginica \n", "146 virginica \n", "147 virginica \n", "148 virginica \n", "149 virginica "]}, "execution_count": 43, "metadata": {}, "output_type": "execute_result"}], "source": ["dfe = pandas.read_excel(\"data.xlsx\", engine='openpyxl')\n", "dfe.tail()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q6 : le texte, \u00e7a prend trop de place, zippons"]}, {"cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": ["from zipfile import ZipFile\n", "\n", "with ZipFile('data.zip', 'w') as myzip:\n", " myzip.write('data.xlsx')\n", " myzip.write(\"2020_json_xml.ipynb\")"]}, {"cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [{"data": {"text/plain": ["['data.zip']"]}, "execution_count": 45, "metadata": {}, "output_type": "execute_result"}], "source": ["import glob\n", "glob.glob(\"*.zip\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q7 : que vous inspire [protobuf](https://developers.google.com/protocol-buffers) ?\n", "\n", "Protobuf est un format de s\u00e9rialisation au m\u00eame titre que [pickle](https://docs.python.org/3/library/pickle.html). La [s\u00e9rialisation](https://fr.wikipedia.org/wiki/S%C3%A9rialisation) d\u00e9signe un proc\u00e9d\u00e9 qui permet d'enregister tout un tas d'information sous la forme d'ensemble d'octets contig\u00fc. En gros, on enregistre un ensemble de variables en m\u00eame quel qu'il soit, dans un unique fichier contig\u00fc. [json](https://fr.wikipedia.org/wiki/JavaScript_Object_Notation) est un format utilis\u00e9 par les outils de s\u00e9rialisation. Tout est en texte et la plupart du temps, les nombres r\u00e9elles prennent moins de place quand ils sont stock\u00e9s comme ils sont sock\u00e9s en m\u00e9moire sur 8 octets. C'est le premier probl\u00e8me r\u00e9solu par protobuf. Ensuite, le format *json* est tr\u00e8s g\u00e9n\u00e9rique mais il sugg\u00e8re de stocker le nom des colonnes \u00e0 chaque ligne, c'est une information qui est dupliqu\u00e9 \u00e0 chaque ligne... *protobuf* propose une fa\u00e7on de ne pas les sotcker du tout. Je passe les d\u00e9tails. Ils reviendront quand le probl\u00e8me de communiquer efficacement des donn\u00e9es se posera."]}, {"cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2"}}, "nbformat": 4, "nbformat_minor": 2}