{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Tech - JSON - XML\n", "\n", "Transmettre l'information d'une machine \u00e0 une autre, d'un logiciel \u00e0 un autre, d'une base de donn\u00e9es \u00e0 une autre est un probl\u00e8me r\u00e9current. Le format le plus simple pour des donn\u00e9es est le format [csv](https://fr.wikipedia.org/wiki/Comma-separated_values). Ca marche bien pour les tables mais cela ne permet de transmettre ais\u00e9ment des [donn\u00e9es non structur\u00e9es](https://fr.wikipedia.org/wiki/Informations_non_structur%C3%A9es)."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["<div id=\"my_id_menu_nb\">run previous cell, wait for 2 seconds</div>\n", "<script>\n", "function repeat_indent_string(n){\n", "    var a = \"\" ;\n", "    for ( ; n > 0 ; --n)\n", "        a += \"    \";\n", "    return a;\n", "}\n", "// look up into all sections and builds an automated menu //\n", "var update_menu_string = function(begin, lfirst, llast, sformat, send, keep_item, begin_format, end_format) {\n", "    var anchors = document.getElementsByClassName(\"section\");\n", "    if (anchors.length == 0) {\n", "        anchors = document.getElementsByClassName(\"text_cell_render rendered_html\");\n", "    }\n", "    var i,t;\n", "    var text_menu = begin;\n", "    var text_memo = \"<pre>\\nlength:\" + anchors.length + \"\\n\";\n", "    var ind = \"\";\n", "    var memo_level = 1;\n", "    var href;\n", "    var tags = [];\n", "    var main_item = 0;\n", "    var format_open = 0;\n", "    for (i = 0; i <= llast; i++)\n", "        tags.push(\"h\" + i);\n", "\n", "    for (i = 0; i < anchors.length; i++) {\n", "        text_memo += \"**\" + anchors[i].id + \"--\\n\";\n", "\n", "        var child = null;\n", "        for(t = 0; t < tags.length; t++) {\n", "            var r = anchors[i].getElementsByTagName(tags[t]);\n", "            if (r.length > 0) {\n", "child = r[0];\n", "break;\n", "            }\n", "        }\n", "        if (child == null) {\n", "            text_memo += \"null\\n\";\n", "            continue;\n", "        }\n", "        if (anchors[i].hasAttribute(\"id\")) {\n", "            // when converted in RST\n", "            href = anchors[i].id;\n", "            text_memo += \"#1-\" + href;\n", "            // passer \u00e0 child suivant (le chercher)\n", "        }\n", "        else if (child.hasAttribute(\"id\")) {\n", "            // in a notebook\n", "            href = child.id;\n", "            text_memo += \"#2-\" + href;\n", "        }\n", "        else {\n", "            text_memo += \"#3-\" + \"*\" + \"\\n\";\n", "            continue;\n", "        }\n", "        var title = child.textContent;\n", "        var level = parseInt(child.tagName.substring(1,2));\n", "\n", "        text_memo += \"--\" + level + \"?\" + lfirst + \"--\" + title + \"\\n\";\n", "\n", "        if ((level < lfirst) || (level > llast)) {\n", "            continue ;\n", "        }\n", "        if (title.endsWith('\u00b6')) {\n", "            title = title.substring(0,title.length-1).replace(\"<\", \"&lt;\")\n", "         .replace(\">\", \"&gt;\").replace(\"&\", \"&amp;\");\n", "        }\n", "        if (title.length == 0) {\n", "            continue;\n", "        }\n", "\n", "        while (level < memo_level) {\n", "            text_menu += end_format + \"</ul>\\n\";\n", "            format_open -= 1;\n", "            memo_level -= 1;\n", "        }\n", "        if (level == lfirst) {\n", "            main_item += 1;\n", "        }\n", "        if (keep_item != -1 && main_item != keep_item + 1) {\n", "            // alert(main_item + \" - \" + level + \" - \" + keep_item);\n", "            continue;\n", "        }\n", "        while (level > memo_level) {\n", "            text_menu += \"<ul>\\n\";\n", "            memo_level += 1;\n", "        }\n", "        text_menu += repeat_indent_string(level-2);\n", "        text_menu += begin_format + sformat.replace(\"__HREF__\", href).replace(\"__TITLE__\", title);\n", "        format_open += 1;\n", "    }\n", "    while (1 < memo_level) {\n", "        text_menu += end_format + \"</ul>\\n\";\n", "        memo_level -= 1;\n", "        format_open -= 1;\n", "    }\n", "    text_menu += send;\n", "    //text_menu += \"\\n\" + text_memo;\n", "\n", "    while (format_open > 0) {\n", "        text_menu += end_format;\n", "        format_open -= 1;\n", "    }\n", "    return text_menu;\n", "};\n", "var update_menu = function() {\n", "    var sbegin = \"\";\n", "    var sformat = '<a href=\"#__HREF__\">__TITLE__</a>';\n", "    var send = \"\";\n", "    var begin_format = '<li>';\n", "    var end_format = '</li>';\n", "    var keep_item = -1;\n", "    var text_menu = update_menu_string(sbegin, 2, 4, sformat, send, keep_item,\n", "       begin_format, end_format);\n", "    var menu = document.getElementById(\"my_id_menu_nb\");\n", "    menu.innerHTML=text_menu;\n", "};\n", "window.setTimeout(update_menu,2000);\n", "            </script>"], "text/plain": ["<IPython.core.display.HTML object>"]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Enonc\u00e9"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/html": ["<div>\n", "<style scoped>\n", "    .dataframe tbody tr th:only-of-type {\n", "        vertical-align: middle;\n", "    }\n", "\n", "    .dataframe tbody tr th {\n", "        vertical-align: top;\n", "    }\n", "\n", "    .dataframe thead th {\n", "        text-align: right;\n", "    }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", "  <thead>\n", "    <tr style=\"text-align: right;\">\n", "      <th></th>\n", "      <th>sepal length (cm)</th>\n", "      <th>sepal width (cm)</th>\n", "      <th>petal length (cm)</th>\n", "      <th>petal width (cm)</th>\n", "      <th>fleur</th>\n", "    </tr>\n", "  </thead>\n", "  <tbody>\n", "    <tr>\n", "      <th>145</th>\n", "      <td>6.7</td>\n", "      <td>3.0</td>\n", "      <td>5.2</td>\n", "      <td>2.3</td>\n", "      <td>virginica</td>\n", "    </tr>\n", "    <tr>\n", "      <th>146</th>\n", "      <td>6.3</td>\n", "      <td>2.5</td>\n", "      <td>5.0</td>\n", "      <td>1.9</td>\n", "      <td>virginica</td>\n", "    </tr>\n", "    <tr>\n", "      <th>147</th>\n", "      <td>6.5</td>\n", "      <td>3.0</td>\n", "      <td>5.2</td>\n", "      <td>2.0</td>\n", "      <td>virginica</td>\n", "    </tr>\n", "    <tr>\n", "      <th>148</th>\n", "      <td>6.2</td>\n", "      <td>3.4</td>\n", "      <td>5.4</td>\n", "      <td>2.3</td>\n", "      <td>virginica</td>\n", "    </tr>\n", "    <tr>\n", "      <th>149</th>\n", "      <td>5.9</td>\n", "      <td>3.0</td>\n", "      <td>5.1</td>\n", "      <td>1.8</td>\n", "      <td>virginica</td>\n", "    </tr>\n", "  </tbody>\n", "</table>\n", "</div>"], "text/plain": ["     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \\\n", "145                6.7               3.0                5.2               2.3   \n", "146                6.3               2.5                5.0               1.9   \n", "147                6.5               3.0                5.2               2.0   \n", "148                6.2               3.4                5.4               2.3   \n", "149                5.9               3.0                5.1               1.8   \n", "\n", "         fleur  \n", "145  virginica  \n", "146  virginica  \n", "147  virginica  \n", "148  virginica  \n", "149  virginica  "]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["from sklearn.datasets import load_iris as load_data\n", "from pandas import DataFrame\n", "data = load_data()\n", "df = DataFrame(data.data, columns=data.feature_names)\n", "df['fleur'] = [data.target_names[t] for t in data.target]\n", "df.tail()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q1 : \u00e9criture des donn\u00e9es au format CSV\n", "\n", "Le plus simple..."]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/plain": ["'sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),fleur\\r\\n5.1,3.5,1.4,0.2,setosa\\r\\n4.9,3.0,1.4,0.2,setosa\\r\\n4.7,3.2,1.3,0.2,setosa\\r\\n4.6,3.1,1.5,0.2,setosa\\r\\n5.0,3.6,1.4,0.2,setosa\\r\\n5.4,3.9,1.7,0.4,setosa\\r\\n4.6,3.4,1.4,0.3,setosa\\r\\n5.0,3.4,1.5,0.2,setosa\\r\\n4.4,2.9,1.4,0.2,setosa\\r\\n4.9,3.1'"]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["from io import StringIO\n", "buffer = StringIO()\n", "df.to_csv(buffer, index=False)\n", "text = buffer.getvalue()\n", "text[:300]"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q2 : \u00e9criture des donn\u00e9es au format JSON"]}, {"cell_type": "code", "execution_count": 5, "metadata": {"scrolled": false}, "outputs": [{"data": {"text/plain": ["'[{\"sepal length (cm)\":5.1,\"sepal width (cm)\":3.5,\"petal length (cm)\":1.4,\"petal width (cm)\":0.2,\"fleur\":\"setosa\"},{\"sepal length (cm)\":4.9,\"sepal width (cm)\":3.0,\"petal length (cm)\":1.4,\"petal width (cm)\":0.2,\"fleur\":\"setosa\"},{\"sepal length (cm)\":4.7,\"sepal width (cm)\":3.2,\"petal length (cm)\":1.3,\"petal width (cm)\":0.2,\"fleur\":\"setosa\"},{\"sepal length (cm)\":4.6,\"sepal width (cm)\":3.1,\"petal lengt'"]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["r = df.to_json(orient='records')\n", "r[:400]"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q3 : relire les donn\u00e9es avec le module [json](https://docs.python.org/3/library/json.html)"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q4 : essayez avec les format XML (ou HTML), SQL, SAS, Excel..."]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q5 : donn\u00e9es non structur\u00e9es\n", "\n", "On ajoute les endroits o\u00f9 ces fleurs sont pr\u00e9sentes. On voudrait que toutes les informations soient pr\u00e9sentes dans le m\u00eame fichier. Comment fait-on ?"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": ["locations = {'virginica': ['Florida', 'Georgia'],\n", "             'setosa': ['Maine', 'Alaska', 'Quebec'],\n", "             'versicolor': ['Quebec', 'Georgia', 'Ireland', 'Main']}"]}, {"cell_type": "markdown", "metadata": {}, "source": ["La question sous-jacente est : vaut-il mieux avoir deux fichiers plats, l'un pour les donn\u00e9es d\u00e9crivant les fleurs, l'autre pour les localisations ou un seul fusionnant les deux informations ?"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q6 : le texte, \u00e7a prend trop de place, zippons\n", "\n", "Avec le module [zipfile](https://docs.python.org/3/library/zipfile.html)."]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q7 : que vous inspire [protobuf](https://developers.google.com/protocol-buffers) ?\n", "\n", "Ou [flatbuffers](https://google.github.io/flatbuffers/), [MessagePack](https://msgpack.org/)"]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": []}, {"cell_type": "markdown", "metadata": {}, "source": ["## R\u00e9ponses"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q1 : \u00e9criture des donn\u00e9es au format CSV"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"data": {"text/plain": ["'sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),fleur\\r\\n5.1,3.5,1.4,0.2,setosa\\r\\n4.9,3.0,1.4,0.2,setosa\\r\\n4.7,3.2,1.3,0.2,setosa\\r\\n4.6,3.1,1.5,0.2,setosa\\r\\n5.0,3.6,1.4,0.2,setosa\\r\\n5.4,3.9,1.7,0.4,setosa\\r\\n4.6,3.4,1.4,0.3,setosa\\r\\n5.0,3.4,1.5,0.2,setosa\\r\\n4.4,2.9,1.4,0.2,setosa\\r\\n4.9,3.1'"]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["from io import StringIO\n", "buffer = StringIO()\n", "df.to_csv(buffer, index=False)\n", "text = buffer.getvalue()\n", "text[:300]"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": ["df.to_csv(\"fleurs.csv\", index=False)"]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"data": {"text/plain": ["['.ipynb_checkpoints',\n", " '2020_covid.ipynb',\n", " '2020_edit.ipynb',\n", " '2020_json_xml.ipynb',\n", " '2020_numpy.ipynb',\n", " '2020_pandas.ipynb',\n", " '2020_profile.ipynb',\n", " '2020_regex.ipynb',\n", " '2020_suffix.ipynb',\n", " '2020_surface.ipynb',\n", " '2020_topk.ipynb',\n", " '2020_tsp.ipynb',\n", " 'data.csv',\n", " 'fleurs.csv']"]}, "execution_count": 17, "metadata": {}, "output_type": "execute_result"}], "source": ["import os\n", "os.listdir(\".\")"]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": ["import pandas\n", "df2 = pandas.read_csv(\"fleurs.csv\")"]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"data": {"text/html": ["<div>\n", "<style scoped>\n", "    .dataframe tbody tr th:only-of-type {\n", "        vertical-align: middle;\n", "    }\n", "\n", "    .dataframe tbody tr th {\n", "        vertical-align: top;\n", "    }\n", "\n", "    .dataframe thead th {\n", "        text-align: right;\n", "    }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", "  <thead>\n", "    <tr style=\"text-align: right;\">\n", "      <th></th>\n", "      <th>sepal length (cm)</th>\n", "      <th>sepal width (cm)</th>\n", "      <th>petal length (cm)</th>\n", "      <th>petal width (cm)</th>\n", "      <th>fleur</th>\n", "    </tr>\n", "  </thead>\n", "  <tbody>\n", "    <tr>\n", "      <th>0</th>\n", "      <td>5.1</td>\n", "      <td>3.5</td>\n", "      <td>1.4</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "    </tr>\n", "    <tr>\n", "      <th>1</th>\n", "      <td>4.9</td>\n", "      <td>3.0</td>\n", "      <td>1.4</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "    </tr>\n", "    <tr>\n", "      <th>2</th>\n", "      <td>4.7</td>\n", "      <td>3.2</td>\n", "      <td>1.3</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "    </tr>\n", "    <tr>\n", "      <th>3</th>\n", "      <td>4.6</td>\n", "      <td>3.1</td>\n", "      <td>1.5</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "    </tr>\n", "    <tr>\n", "      <th>4</th>\n", "      <td>5.0</td>\n", "      <td>3.6</td>\n", "      <td>1.4</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "    </tr>\n", "  </tbody>\n", "</table>\n", "</div>"], "text/plain": ["   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \\\n", "0                5.1               3.5                1.4               0.2   \n", "1                4.9               3.0                1.4               0.2   \n", "2                4.7               3.2                1.3               0.2   \n", "3                4.6               3.1                1.5               0.2   \n", "4                5.0               3.6                1.4               0.2   \n", "\n", "    fleur  \n", "0  setosa  \n", "1  setosa  \n", "2  setosa  \n", "3  setosa  \n", "4  setosa  "]}, "execution_count": 19, "metadata": {}, "output_type": "execute_result"}], "source": ["df2.head()"]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [{"data": {"text/html": ["<div>\n", "<style scoped>\n", "    .dataframe tbody tr th:only-of-type {\n", "        vertical-align: middle;\n", "    }\n", "\n", "    .dataframe tbody tr th {\n", "        vertical-align: top;\n", "    }\n", "\n", "    .dataframe thead th {\n", "        text-align: right;\n", "    }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", "  <thead>\n", "    <tr style=\"text-align: right;\">\n", "      <th></th>\n", "      <th>sepal length (cm)</th>\n", "      <th>sepal width (cm)</th>\n", "      <th>petal length (cm)</th>\n", "      <th>petal width (cm)</th>\n", "      <th>fleur</th>\n", "    </tr>\n", "  </thead>\n", "  <tbody>\n", "    <tr>\n", "      <th>0</th>\n", "      <td>5.1</td>\n", "      <td>3.5</td>\n", "      <td>1.4</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "    </tr>\n", "    <tr>\n", "      <th>1</th>\n", "      <td>4.9</td>\n", "      <td>3.0</td>\n", "      <td>1.4</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "    </tr>\n", "    <tr>\n", "      <th>2</th>\n", "      <td>4.7</td>\n", "      <td>3.2</td>\n", "      <td>1.3</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "    </tr>\n", "    <tr>\n", "      <th>3</th>\n", "      <td>4.6</td>\n", "      <td>3.1</td>\n", "      <td>1.5</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "    </tr>\n", "    <tr>\n", "      <th>4</th>\n", "      <td>5.0</td>\n", "      <td>3.6</td>\n", "      <td>1.4</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "    </tr>\n", "  </tbody>\n", "</table>\n", "</div>"], "text/plain": ["   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \\\n", "0                5.1               3.5                1.4               0.2   \n", "1                4.9               3.0                1.4               0.2   \n", "2                4.7               3.2                1.3               0.2   \n", "3                4.6               3.1                1.5               0.2   \n", "4                5.0               3.6                1.4               0.2   \n", "\n", "    fleur  \n", "0  setosa  \n", "1  setosa  \n", "2  setosa  \n", "3  setosa  \n", "4  setosa  "]}, "execution_count": 20, "metadata": {}, "output_type": "execute_result"}], "source": ["virtuel = StringIO(text)\n", "df3 = pandas.read_csv(virtuel)\n", "df3.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q2 : \u00e9criture des donn\u00e9es au format JSON"]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [{"data": {"text/plain": ["'[{\"sepal length (cm)\":5.1,\"sepal width (cm)\":3.5,\"petal length (cm)\":1.4,\"petal width (cm)\":0.2,\"fleur\":\"setosa\"},{\"sepal length (cm)\":4.9,\"sepal width (cm)\":3.0,\"petal length (cm)\":1.4,\"petal width (cm)\":0.2,\"fleur\":\"setosa\"},{\"sepal length (cm)\":4.7,\"sepal width (cm)\":3.2,\"petal length (cm)\":1.3,\"petal width (cm)\":0.2,\"fleur\":\"setosa\"},{\"sepal length (cm)\":4.6,\"sepal width (cm)\":3.1,\"petal lengt'"]}, "execution_count": 21, "metadata": {}, "output_type": "execute_result"}], "source": ["json_text = df.to_json(orient='records')\n", "json_text[:400]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q3 : relire les donn\u00e9es avec le module [json](https://docs.python.org/3/library/json.html)"]}, {"cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": ["import json"]}, {"cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": ["res = json.loads(json_text)"]}, {"cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["0 <class 'dict'> {'sepal length (cm)': 5.1, 'sepal width (cm)': 3.5, 'petal length (cm)': 1.4, 'petal width (cm)': 0.2, 'fleur': 'setosa'}\n", "1 <class 'dict'> {'sepal length (cm)': 4.9, 'sepal width (cm)': 3.0, 'petal length (cm)': 1.4, 'petal width (cm)': 0.2, 'fleur': 'setosa'}\n", "2 <class 'dict'> {'sepal length (cm)': 4.7, 'sepal width (cm)': 3.2, 'petal length (cm)': 1.3, 'petal width (cm)': 0.2, 'fleur': 'setosa'}\n", "3 <class 'dict'> {'sepal length (cm)': 4.6, 'sepal width (cm)': 3.1, 'petal length (cm)': 1.5, 'petal width (cm)': 0.2, 'fleur': 'setosa'}\n", "4 <class 'dict'> {'sepal length (cm)': 5.0, 'sepal width (cm)': 3.6, 'petal length (cm)': 1.4, 'petal width (cm)': 0.2, 'fleur': 'setosa'}\n", "5 <class 'dict'> {'sepal length (cm)': 5.4, 'sepal width (cm)': 3.9, 'petal length (cm)': 1.7, 'petal width (cm)': 0.4, 'fleur': 'setosa'}\n"]}], "source": ["for i, r in enumerate(res):\n", "    print(i, type(r), r)\n", "    if i >= 5:\n", "        break"]}, {"cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [{"data": {"text/plain": ["3.1"]}, "execution_count": 25, "metadata": {}, "output_type": "execute_result"}], "source": ["res[3]['sepal width (cm)']"]}, {"cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [{"data": {"text/plain": ["[{'sepal length (cm)': 5.1,\n", "  'sepal width (cm)': 3.5,\n", "  'petal length (cm)': 1.4,\n", "  'petal width (cm)': 0.2,\n", "  'fleur': 'setosa'},\n", " {'sepal length (cm)': 4.9,\n", "  'sepal width (cm)': 3.0,\n", "  'petal length (cm)': 1.4,\n", "  'petal width (cm)': 0.2,\n", "  'fleur': 'setosa'},\n", " {'sepal length (cm)': 4.7,\n", "  'sepal width (cm)': 3.2,\n", "  'petal length (cm)': 1.3,\n", "  'petal width (cm)': 0.2,\n", "  'fleur': 'setosa'}]"]}, "execution_count": 26, "metadata": {}, "output_type": "execute_result"}], "source": ["virtuel = StringIO(json_text)\n", "res2 = json.load(virtuel)\n", "res2[:3]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q4 : essayez avec les format XML (ou HTML), SQL, SAS, Excel..."]}, {"cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": ["html_text = df.to_html(index=False)"]}, {"cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["<table border=\"1\" class=\"dataframe\">\n", "  <thead>\n", "    <tr style=\"text-align: right;\">\n", "      <th>sepal length (cm)</th>\n", "      <th>sepal width (cm)</th>\n", "      <th>petal length (cm)</th>\n", "      <th>petal width (cm)</th>\n", "      <th>fleur</th>\n", "    </tr>\n", "  </thead>\n", "  <tbody>\n", "    <tr>\n", "      <td>5.1</td>\n", "      <td>3.5</td>\n", "      <td>1.4</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "    </tr>\n", "    <tr>\n", "      <td>4.9</td>\n", "      <td>3.0</td>\n", "      <td>1.4</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "    </tr>\n", " \n"]}], "source": ["print(html_text[:500])"]}, {"cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [{"data": {"text/html": ["<div>\n", "<style scoped>\n", "    .dataframe tbody tr th:only-of-type {\n", "        vertical-align: middle;\n", "    }\n", "\n", "    .dataframe tbody tr th {\n", "        vertical-align: top;\n", "    }\n", "\n", "    .dataframe thead th {\n", "        text-align: right;\n", "    }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", "  <thead>\n", "    <tr style=\"text-align: right;\">\n", "      <th></th>\n", "      <th>sepal length (cm)</th>\n", "      <th>sepal width (cm)</th>\n", "      <th>petal length (cm)</th>\n", "      <th>petal width (cm)</th>\n", "      <th>fleur</th>\n", "    </tr>\n", "  </thead>\n", "  <tbody>\n", "    <tr>\n", "      <th>145</th>\n", "      <td>6.7</td>\n", "      <td>3.0</td>\n", "      <td>5.2</td>\n", "      <td>2.3</td>\n", "      <td>virginica</td>\n", "    </tr>\n", "    <tr>\n", "      <th>146</th>\n", "      <td>6.3</td>\n", "      <td>2.5</td>\n", "      <td>5.0</td>\n", "      <td>1.9</td>\n", "      <td>virginica</td>\n", "    </tr>\n", "    <tr>\n", "      <th>147</th>\n", "      <td>6.5</td>\n", "      <td>3.0</td>\n", "      <td>5.2</td>\n", "      <td>2.0</td>\n", "      <td>virginica</td>\n", "    </tr>\n", "    <tr>\n", "      <th>148</th>\n", "      <td>6.2</td>\n", "      <td>3.4</td>\n", "      <td>5.4</td>\n", "      <td>2.3</td>\n", "      <td>virginica</td>\n", "    </tr>\n", "    <tr>\n", "      <th>149</th>\n", "      <td>5.9</td>\n", "      <td>3.0</td>\n", "      <td>5.1</td>\n", "      <td>1.8</td>\n", "      <td>virginica</td>\n", "    </tr>\n", "  </tbody>\n", "</table>\n", "</div>"], "text/plain": ["     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \\\n", "145                6.7               3.0                5.2               2.3   \n", "146                6.3               2.5                5.0               1.9   \n", "147                6.5               3.0                5.2               2.0   \n", "148                6.2               3.4                5.4               2.3   \n", "149                5.9               3.0                5.1               1.8   \n", "\n", "         fleur  \n", "145  virginica  \n", "146  virginica  \n", "147  virginica  \n", "148  virginica  \n", "149  virginica  "]}, "execution_count": 29, "metadata": {}, "output_type": "execute_result"}], "source": ["df_html = pandas.read_html(html_text)\n", "df_html[0].tail()"]}, {"cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [{"data": {"text/plain": ["2"]}, "execution_count": 30, "metadata": {}, "output_type": "execute_result"}], "source": ["df_html = pandas.read_html(html_text + html_text)\n", "len(df_html)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q5 : donn\u00e9es non structur\u00e9es"]}, {"cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [{"data": {"text/html": ["<div>\n", "<style scoped>\n", "    .dataframe tbody tr th:only-of-type {\n", "        vertical-align: middle;\n", "    }\n", "\n", "    .dataframe tbody tr th {\n", "        vertical-align: top;\n", "    }\n", "\n", "    .dataframe thead th {\n", "        text-align: right;\n", "    }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", "  <thead>\n", "    <tr style=\"text-align: right;\">\n", "      <th></th>\n", "      <th>sepal length (cm)</th>\n", "      <th>sepal width (cm)</th>\n", "      <th>petal length (cm)</th>\n", "      <th>petal width (cm)</th>\n", "      <th>fleur</th>\n", "    </tr>\n", "  </thead>\n", "  <tbody>\n", "    <tr>\n", "      <th>0</th>\n", "      <td>5.1</td>\n", "      <td>3.5</td>\n", "      <td>1.4</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "    </tr>\n", "    <tr>\n", "      <th>1</th>\n", "      <td>4.9</td>\n", "      <td>3.0</td>\n", "      <td>1.4</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "    </tr>\n", "    <tr>\n", "      <th>2</th>\n", "      <td>4.7</td>\n", "      <td>3.2</td>\n", "      <td>1.3</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "    </tr>\n", "    <tr>\n", "      <th>3</th>\n", "      <td>4.6</td>\n", "      <td>3.1</td>\n", "      <td>1.5</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "    </tr>\n", "    <tr>\n", "      <th>4</th>\n", "      <td>5.0</td>\n", "      <td>3.6</td>\n", "      <td>1.4</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "    </tr>\n", "  </tbody>\n", "</table>\n", "</div>"], "text/plain": ["   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \\\n", "0                5.1               3.5                1.4               0.2   \n", "1                4.9               3.0                1.4               0.2   \n", "2                4.7               3.2                1.3               0.2   \n", "3                4.6               3.1                1.5               0.2   \n", "4                5.0               3.6                1.4               0.2   \n", "\n", "    fleur  \n", "0  setosa  \n", "1  setosa  \n", "2  setosa  \n", "3  setosa  \n", "4  setosa  "]}, "execution_count": 31, "metadata": {}, "output_type": "execute_result"}], "source": ["df.head()"]}, {"cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": ["locations = {'virginica': ['Florida', 'Georgia'],\n", "             'setosa': ['Maine', 'Alaska', 'Quebec'],\n", "             'versicolor': ['Quebec', 'Georgia', 'Ireland', 'Main']}"]}, {"cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [{"data": {"text/plain": ["[{'fleur': 'virginica', 'location': 'Florida'},\n", " {'fleur': 'virginica', 'location': 'Georgia'},\n", " {'fleur': 'setosa', 'location': 'Maine'},\n", " {'fleur': 'setosa', 'location': 'Alaska'},\n", " {'fleur': 'setosa', 'location': 'Quebec'},\n", " {'fleur': 'versicolor', 'location': 'Quebec'},\n", " {'fleur': 'versicolor', 'location': 'Georgia'},\n", " {'fleur': 'versicolor', 'location': 'Ireland'},\n", " {'fleur': 'versicolor', 'location': 'Main'}]"]}, "execution_count": 33, "metadata": {}, "output_type": "execute_result"}], "source": ["obs = []\n", "for fleur, loc in locations.items():\n", "    for l in loc:\n", "        obs.append({\"fleur\": fleur, \"location\": l})\n", "obs"]}, {"cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [{"data": {"text/html": ["<div>\n", "<style scoped>\n", "    .dataframe tbody tr th:only-of-type {\n", "        vertical-align: middle;\n", "    }\n", "\n", "    .dataframe tbody tr th {\n", "        vertical-align: top;\n", "    }\n", "\n", "    .dataframe thead th {\n", "        text-align: right;\n", "    }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", "  <thead>\n", "    <tr style=\"text-align: right;\">\n", "      <th></th>\n", "      <th>fleur</th>\n", "      <th>location</th>\n", "    </tr>\n", "  </thead>\n", "  <tbody>\n", "    <tr>\n", "      <th>0</th>\n", "      <td>virginica</td>\n", "      <td>Florida</td>\n", "    </tr>\n", "    <tr>\n", "      <th>1</th>\n", "      <td>virginica</td>\n", "      <td>Georgia</td>\n", "    </tr>\n", "    <tr>\n", "      <th>2</th>\n", "      <td>setosa</td>\n", "      <td>Maine</td>\n", "    </tr>\n", "    <tr>\n", "      <th>3</th>\n", "      <td>setosa</td>\n", "      <td>Alaska</td>\n", "    </tr>\n", "    <tr>\n", "      <th>4</th>\n", "      <td>setosa</td>\n", "      <td>Quebec</td>\n", "    </tr>\n", "    <tr>\n", "      <th>5</th>\n", "      <td>versicolor</td>\n", "      <td>Quebec</td>\n", "    </tr>\n", "    <tr>\n", "      <th>6</th>\n", "      <td>versicolor</td>\n", "      <td>Georgia</td>\n", "    </tr>\n", "    <tr>\n", "      <th>7</th>\n", "      <td>versicolor</td>\n", "      <td>Ireland</td>\n", "    </tr>\n", "    <tr>\n", "      <th>8</th>\n", "      <td>versicolor</td>\n", "      <td>Main</td>\n", "    </tr>\n", "  </tbody>\n", "</table>\n", "</div>"], "text/plain": ["        fleur location\n", "0   virginica  Florida\n", "1   virginica  Georgia\n", "2      setosa    Maine\n", "3      setosa   Alaska\n", "4      setosa   Quebec\n", "5  versicolor   Quebec\n", "6  versicolor  Georgia\n", "7  versicolor  Ireland\n", "8  versicolor     Main"]}, "execution_count": 34, "metadata": {}, "output_type": "execute_result"}], "source": ["df_locations = pandas.DataFrame(obs)\n", "df_locations"]}, {"cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [{"data": {"text/html": ["<div>\n", "<style scoped>\n", "    .dataframe tbody tr th:only-of-type {\n", "        vertical-align: middle;\n", "    }\n", "\n", "    .dataframe tbody tr th {\n", "        vertical-align: top;\n", "    }\n", "\n", "    .dataframe thead th {\n", "        text-align: right;\n", "    }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", "  <thead>\n", "    <tr style=\"text-align: right;\">\n", "      <th></th>\n", "      <th>sepal length (cm)</th>\n", "      <th>sepal width (cm)</th>\n", "      <th>petal length (cm)</th>\n", "      <th>petal width (cm)</th>\n", "      <th>fleur</th>\n", "      <th>location</th>\n", "    </tr>\n", "  </thead>\n", "  <tbody>\n", "    <tr>\n", "      <th>0</th>\n", "      <td>5.1</td>\n", "      <td>3.5</td>\n", "      <td>1.4</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "      <td>Maine</td>\n", "    </tr>\n", "    <tr>\n", "      <th>1</th>\n", "      <td>5.1</td>\n", "      <td>3.5</td>\n", "      <td>1.4</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "      <td>Alaska</td>\n", "    </tr>\n", "    <tr>\n", "      <th>2</th>\n", "      <td>5.1</td>\n", "      <td>3.5</td>\n", "      <td>1.4</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "      <td>Quebec</td>\n", "    </tr>\n", "    <tr>\n", "      <th>3</th>\n", "      <td>4.9</td>\n", "      <td>3.0</td>\n", "      <td>1.4</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "      <td>Maine</td>\n", "    </tr>\n", "    <tr>\n", "      <th>4</th>\n", "      <td>4.9</td>\n", "      <td>3.0</td>\n", "      <td>1.4</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "      <td>Alaska</td>\n", "    </tr>\n", "    <tr>\n", "      <th>5</th>\n", "      <td>4.9</td>\n", "      <td>3.0</td>\n", "      <td>1.4</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "      <td>Quebec</td>\n", "    </tr>\n", "    <tr>\n", "      <th>6</th>\n", "      <td>4.7</td>\n", "      <td>3.2</td>\n", "      <td>1.3</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "      <td>Maine</td>\n", "    </tr>\n", "    <tr>\n", "      <th>7</th>\n", "      <td>4.7</td>\n", "      <td>3.2</td>\n", "      <td>1.3</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "      <td>Alaska</td>\n", "    </tr>\n", "    <tr>\n", "      <th>8</th>\n", "      <td>4.7</td>\n", "      <td>3.2</td>\n", "      <td>1.3</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "      <td>Quebec</td>\n", "    </tr>\n", "    <tr>\n", "      <th>9</th>\n", "      <td>4.6</td>\n", "      <td>3.1</td>\n", "      <td>1.5</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "      <td>Maine</td>\n", "    </tr>\n", "  </tbody>\n", "</table>\n", "</div>"], "text/plain": ["   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \\\n", "0                5.1               3.5                1.4               0.2   \n", "1                5.1               3.5                1.4               0.2   \n", "2                5.1               3.5                1.4               0.2   \n", "3                4.9               3.0                1.4               0.2   \n", "4                4.9               3.0                1.4               0.2   \n", "5                4.9               3.0                1.4               0.2   \n", "6                4.7               3.2                1.3               0.2   \n", "7                4.7               3.2                1.3               0.2   \n", "8                4.7               3.2                1.3               0.2   \n", "9                4.6               3.1                1.5               0.2   \n", "\n", "    fleur location  \n", "0  setosa    Maine  \n", "1  setosa   Alaska  \n", "2  setosa   Quebec  \n", "3  setosa    Maine  \n", "4  setosa   Alaska  \n", "5  setosa   Quebec  \n", "6  setosa    Maine  \n", "7  setosa   Alaska  \n", "8  setosa   Quebec  \n", "9  setosa    Maine  "]}, "execution_count": 35, "metadata": {}, "output_type": "execute_result"}], "source": ["merged = df.merge(df_locations, left_on=\"fleur\", right_on=\"fleur\")\n", "merged.head(10)"]}, {"cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [{"data": {"text/plain": ["(450, 6)"]}, "execution_count": 36, "metadata": {}, "output_type": "execute_result"}], "source": ["merged.shape"]}, {"cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [{"data": {"text/plain": ["{'virginica': ['Florida', 'Georgia'],\n", " 'setosa': ['Maine', 'Alaska', 'Quebec'],\n", " 'versicolor': ['Quebec', 'Georgia', 'Ireland', 'Main']}"]}, "execution_count": 37, "metadata": {}, "output_type": "execute_result"}], "source": ["locations"]}, {"cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [{"data": {"text/plain": ["[{'fleur': 'virginica', 'location': ['Florida', 'Georgia']},\n", " {'fleur': 'setosa', 'location': ['Maine', 'Alaska', 'Quebec']},\n", " {'fleur': 'versicolor', 'location': ['Quebec', 'Georgia', 'Ireland', 'Main']}]"]}, "execution_count": 38, "metadata": {}, "output_type": "execute_result"}], "source": ["obs2 = []\n", "for fleur, loc in locations.items():\n", "    obs2.append({\"fleur\": fleur, \"location\": loc})\n", "obs2"]}, {"cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [{"data": {"text/html": ["<div>\n", "<style scoped>\n", "    .dataframe tbody tr th:only-of-type {\n", "        vertical-align: middle;\n", "    }\n", "\n", "    .dataframe tbody tr th {\n", "        vertical-align: top;\n", "    }\n", "\n", "    .dataframe thead th {\n", "        text-align: right;\n", "    }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", "  <thead>\n", "    <tr style=\"text-align: right;\">\n", "      <th></th>\n", "      <th>fleur</th>\n", "      <th>location</th>\n", "    </tr>\n", "  </thead>\n", "  <tbody>\n", "    <tr>\n", "      <th>0</th>\n", "      <td>virginica</td>\n", "      <td>[Florida, Georgia]</td>\n", "    </tr>\n", "    <tr>\n", "      <th>1</th>\n", "      <td>setosa</td>\n", "      <td>[Maine, Alaska, Quebec]</td>\n", "    </tr>\n", "    <tr>\n", "      <th>2</th>\n", "      <td>versicolor</td>\n", "      <td>[Quebec, Georgia, Ireland, Main]</td>\n", "    </tr>\n", "  </tbody>\n", "</table>\n", "</div>"], "text/plain": ["        fleur                          location\n", "0   virginica                [Florida, Georgia]\n", "1      setosa           [Maine, Alaska, Quebec]\n", "2  versicolor  [Quebec, Georgia, Ireland, Main]"]}, "execution_count": 39, "metadata": {}, "output_type": "execute_result"}], "source": ["df_locations2 = pandas.DataFrame(obs2)\n", "df_locations2"]}, {"cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [{"data": {"text/html": ["<div>\n", "<style scoped>\n", "    .dataframe tbody tr th:only-of-type {\n", "        vertical-align: middle;\n", "    }\n", "\n", "    .dataframe tbody tr th {\n", "        vertical-align: top;\n", "    }\n", "\n", "    .dataframe thead th {\n", "        text-align: right;\n", "    }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", "  <thead>\n", "    <tr style=\"text-align: right;\">\n", "      <th></th>\n", "      <th>sepal length (cm)</th>\n", "      <th>sepal width (cm)</th>\n", "      <th>petal length (cm)</th>\n", "      <th>petal width (cm)</th>\n", "      <th>fleur</th>\n", "      <th>location</th>\n", "    </tr>\n", "  </thead>\n", "  <tbody>\n", "    <tr>\n", "      <th>0</th>\n", "      <td>5.1</td>\n", "      <td>3.5</td>\n", "      <td>1.4</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "      <td>[Maine, Alaska, Quebec]</td>\n", "    </tr>\n", "    <tr>\n", "      <th>1</th>\n", "      <td>4.9</td>\n", "      <td>3.0</td>\n", "      <td>1.4</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "      <td>[Maine, Alaska, Quebec]</td>\n", "    </tr>\n", "    <tr>\n", "      <th>2</th>\n", "      <td>4.7</td>\n", "      <td>3.2</td>\n", "      <td>1.3</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "      <td>[Maine, Alaska, Quebec]</td>\n", "    </tr>\n", "    <tr>\n", "      <th>3</th>\n", "      <td>4.6</td>\n", "      <td>3.1</td>\n", "      <td>1.5</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "      <td>[Maine, Alaska, Quebec]</td>\n", "    </tr>\n", "    <tr>\n", "      <th>4</th>\n", "      <td>5.0</td>\n", "      <td>3.6</td>\n", "      <td>1.4</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "      <td>[Maine, Alaska, Quebec]</td>\n", "    </tr>\n", "    <tr>\n", "      <th>5</th>\n", "      <td>5.4</td>\n", "      <td>3.9</td>\n", "      <td>1.7</td>\n", "      <td>0.4</td>\n", "      <td>setosa</td>\n", "      <td>[Maine, Alaska, Quebec]</td>\n", "    </tr>\n", "    <tr>\n", "      <th>6</th>\n", "      <td>4.6</td>\n", "      <td>3.4</td>\n", "      <td>1.4</td>\n", "      <td>0.3</td>\n", "      <td>setosa</td>\n", "      <td>[Maine, Alaska, Quebec]</td>\n", "    </tr>\n", "    <tr>\n", "      <th>7</th>\n", "      <td>5.0</td>\n", "      <td>3.4</td>\n", "      <td>1.5</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "      <td>[Maine, Alaska, Quebec]</td>\n", "    </tr>\n", "    <tr>\n", "      <th>8</th>\n", "      <td>4.4</td>\n", "      <td>2.9</td>\n", "      <td>1.4</td>\n", "      <td>0.2</td>\n", "      <td>setosa</td>\n", "      <td>[Maine, Alaska, Quebec]</td>\n", "    </tr>\n", "    <tr>\n", "      <th>9</th>\n", "      <td>4.9</td>\n", "      <td>3.1</td>\n", "      <td>1.5</td>\n", "      <td>0.1</td>\n", "      <td>setosa</td>\n", "      <td>[Maine, Alaska, Quebec]</td>\n", "    </tr>\n", "  </tbody>\n", "</table>\n", "</div>"], "text/plain": ["   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \\\n", "0                5.1               3.5                1.4               0.2   \n", "1                4.9               3.0                1.4               0.2   \n", "2                4.7               3.2                1.3               0.2   \n", "3                4.6               3.1                1.5               0.2   \n", "4                5.0               3.6                1.4               0.2   \n", "5                5.4               3.9                1.7               0.4   \n", "6                4.6               3.4                1.4               0.3   \n", "7                5.0               3.4                1.5               0.2   \n", "8                4.4               2.9                1.4               0.2   \n", "9                4.9               3.1                1.5               0.1   \n", "\n", "    fleur                 location  \n", "0  setosa  [Maine, Alaska, Quebec]  \n", "1  setosa  [Maine, Alaska, Quebec]  \n", "2  setosa  [Maine, Alaska, Quebec]  \n", "3  setosa  [Maine, Alaska, Quebec]  \n", "4  setosa  [Maine, Alaska, Quebec]  \n", "5  setosa  [Maine, Alaska, Quebec]  \n", "6  setosa  [Maine, Alaska, Quebec]  \n", "7  setosa  [Maine, Alaska, Quebec]  \n", "8  setosa  [Maine, Alaska, Quebec]  \n", "9  setosa  [Maine, Alaska, Quebec]  "]}, "execution_count": 40, "metadata": {}, "output_type": "execute_result"}], "source": ["merged = df.merge(df_locations2, left_on=\"fleur\", right_on=\"fleur\")\n", "merged.head(10)"]}, {"cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [{"data": {"text/plain": ["'[{\"sepal length (cm)\":5.1,\"sepal width (cm)\":3.5,\"petal length (cm)\":1.4,\"petal width (cm)\":0.2,\"fleur\":\"setosa\",\"location\":[\"Maine\",\"Alaska\",\"Quebec\"]},{\"sepal length (cm)\":4.9,\"sepal width (cm)\":3.0'"]}, "execution_count": 41, "metadata": {}, "output_type": "execute_result"}], "source": ["json_text = merged.to_json(orient='records')\n", "json_text[:200]"]}, {"cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": ["df.to_excel(\"data.xlsx\", index=False)"]}, {"cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [{"data": {"text/html": ["<div>\n", "<style scoped>\n", "    .dataframe tbody tr th:only-of-type {\n", "        vertical-align: middle;\n", "    }\n", "\n", "    .dataframe tbody tr th {\n", "        vertical-align: top;\n", "    }\n", "\n", "    .dataframe thead th {\n", "        text-align: right;\n", "    }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", "  <thead>\n", "    <tr style=\"text-align: right;\">\n", "      <th></th>\n", "      <th>sepal length (cm)</th>\n", "      <th>sepal width (cm)</th>\n", "      <th>petal length (cm)</th>\n", "      <th>petal width (cm)</th>\n", "      <th>fleur</th>\n", "    </tr>\n", "  </thead>\n", "  <tbody>\n", "    <tr>\n", "      <th>145</th>\n", "      <td>6.7</td>\n", "      <td>3.0</td>\n", "      <td>5.2</td>\n", "      <td>2.3</td>\n", "      <td>virginica</td>\n", "    </tr>\n", "    <tr>\n", "      <th>146</th>\n", "      <td>6.3</td>\n", "      <td>2.5</td>\n", "      <td>5.0</td>\n", "      <td>1.9</td>\n", "      <td>virginica</td>\n", "    </tr>\n", "    <tr>\n", "      <th>147</th>\n", "      <td>6.5</td>\n", "      <td>3.0</td>\n", "      <td>5.2</td>\n", "      <td>2.0</td>\n", "      <td>virginica</td>\n", "    </tr>\n", "    <tr>\n", "      <th>148</th>\n", "      <td>6.2</td>\n", "      <td>3.4</td>\n", "      <td>5.4</td>\n", "      <td>2.3</td>\n", "      <td>virginica</td>\n", "    </tr>\n", "    <tr>\n", "      <th>149</th>\n", "      <td>5.9</td>\n", "      <td>3.0</td>\n", "      <td>5.1</td>\n", "      <td>1.8</td>\n", "      <td>virginica</td>\n", "    </tr>\n", "  </tbody>\n", "</table>\n", "</div>"], "text/plain": ["     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \\\n", "145                6.7               3.0                5.2               2.3   \n", "146                6.3               2.5                5.0               1.9   \n", "147                6.5               3.0                5.2               2.0   \n", "148                6.2               3.4                5.4               2.3   \n", "149                5.9               3.0                5.1               1.8   \n", "\n", "         fleur  \n", "145  virginica  \n", "146  virginica  \n", "147  virginica  \n", "148  virginica  \n", "149  virginica  "]}, "execution_count": 43, "metadata": {}, "output_type": "execute_result"}], "source": ["dfe = pandas.read_excel(\"data.xlsx\", engine='openpyxl')\n", "dfe.tail()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q6 : le texte, \u00e7a prend trop de place, zippons"]}, {"cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": ["from zipfile import ZipFile\n", "\n", "with ZipFile('data.zip', 'w') as myzip:\n", "    myzip.write('data.xlsx')\n", "    myzip.write(\"2020_json_xml.ipynb\")"]}, {"cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [{"data": {"text/plain": ["['data.zip']"]}, "execution_count": 45, "metadata": {}, "output_type": "execute_result"}], "source": ["import glob\n", "glob.glob(\"*.zip\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Q7 : que vous inspire [protobuf](https://developers.google.com/protocol-buffers) ?\n", "\n", "Protobuf est un format de s\u00e9rialisation au m\u00eame titre que [pickle](https://docs.python.org/3/library/pickle.html). La [s\u00e9rialisation](https://fr.wikipedia.org/wiki/S%C3%A9rialisation) d\u00e9signe un proc\u00e9d\u00e9 qui permet d'enregister tout un tas d'information sous la forme d'ensemble d'octets contig\u00fc. En gros, on enregistre un ensemble de variables en m\u00eame quel qu'il soit, dans un unique fichier contig\u00fc. [json](https://fr.wikipedia.org/wiki/JavaScript_Object_Notation) est un format utilis\u00e9 par les outils de s\u00e9rialisation. Tout est en texte et la plupart du temps, les nombres r\u00e9elles prennent moins de place quand ils sont stock\u00e9s comme ils sont sock\u00e9s en m\u00e9moire sur 8 octets. C'est le premier probl\u00e8me r\u00e9solu par protobuf. Ensuite, le format *json* est tr\u00e8s g\u00e9n\u00e9rique mais il sugg\u00e8re de stocker le nom des colonnes \u00e0 chaque ligne, c'est une information qui est dupliqu\u00e9 \u00e0 chaque ligne... *protobuf* propose une fa\u00e7on de ne pas les sotcker du tout. Je passe les d\u00e9tails. Ils reviendront quand le probl\u00e8me de communiquer efficacement des donn\u00e9es se posera."]}, {"cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2"}}, "nbformat": 4, "nbformat_minor": 2}