{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 2A.i - Huge datasets, datasets hi\u00e9rarchiques\n", "\n", "L'exemple [Building a huge numpy array using pytables](http://stackoverflow.com/questions/8642626/building-a-huge-numpy-array-using-pytables) montre cr\u00e9er une grande matrice qui ne tient pourtant pas en m\u00e9moire. Il existe des modules qui permet de faire des calcul \u00e0 partir de donn\u00e9es stock\u00e9es sur disque comme si elles \u00e9taient en m\u00e9moire."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": ["%matplotlib inline"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## h5py"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Le module [h5py](http://www.h5py.org/) est un module qui permet d'agr\u00e9ger un grand nombre de donn\u00e9es dans un seul fichier et de les nommer comme des fichiers sur un disque. L'exemple suivant cr\u00e9e un seul fichier contenant deux tableaux :"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["\n", "random f0_100 (1000,)\n", "random f0_1000 (10000,)\n"]}], "source": ["import h5py\n", "import random\n", "hf = h5py.File('example.hdf5','w')\n", "arr = [ random.randint(0,100) for h in range(0,1000) ]\n", "hf[\"random/f0_100\"] = arr\n", "arr = [ random.randint(0,1000) for h in range(0,10000) ]\n", "hf[\"random/f0_1000\"] = arr\n", "hf.close()\n", "\n", "hf = h5py.File('example.hdf5','r')\n", "print(hf)\n", "for k in hf :\n", " for k2 in hf[k] :\n", " obj =hf[\"{0}/{1}\".format(k,k2)]\n", " print(k, k2, obj, obj.value.shape)\n", "hf.close()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["L'avantage est de pouvoir acc\u00e9der \u00e0 une partie d'un ensemble sans que celui-ci ne soit charg\u00e9 en m\u00e9moire :"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["[361 155 961 162 560]\n"]}], "source": ["hf = h5py.File('example.hdf5','r')\n", "print(hf[\"random/f0_1000\"][20:25])\n", "hf.close()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## pytables"]}, {"cell_type": "markdown", "metadata": {}, "source": ["[pytables](http://pytables.github.io/) peut se comprendre comme une sorte de base de donn\u00e9es [sqlite](https://www.sqlite.org/) et [sqlite3](https://docs.python.org/3/library/sqlite3.html) qui s'utilise comme un dataframe."]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": ["try:\n", " from tables import IsDescription, StringCol, Int64Col, Float32Col, Float64Col\n", "except ImportError as e:\n", " # Parfois cela \u00e9choue sur Windows: DLL load failed: La proc\u00e9dure sp\u00e9cifi\u00e9e est introuvable.\n", " import sys\n", " raise ImportError(\"Cannot import tables.\\n\" + \"\\n\".join(sys.path)) from e\n", "\n", "class Particle(IsDescription):\n", " name = StringCol(16) # 16-character String\n", " idnumber = Int64Col() # Signed 64-bit integer\n", " pressure = Float32Col() # float (single-precision)\n", " energy = Float64Col() # double (double-precision)"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": ["from tables import open_file\n", "\n", "h5file = open_file(\"particule2.h5\", mode = \"w\", title = \"Test file\")"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"text/plain": ["File(filename=particule2.h5, title='Test file', mode='w', root_uep='/', filters=Filters(complevel=0, shuffle=False, bitshuffle=False, fletcher32=False, least_significant_digit=None))\n", "/ (RootGroup) 'Test file'\n", "/detector (Group) 'Detector information'\n", "/detector/readout (Table(0,)) 'Readout example'\n", " description := {\n", " \"energy\": Float64Col(shape=(), dflt=0.0, pos=0),\n", " \"idnumber\": Int64Col(shape=(), dflt=0, pos=1),\n", " \"name\": StringCol(itemsize=16, shape=(), dflt=b'', pos=2),\n", " \"pressure\": Float32Col(shape=(), dflt=0.0, pos=3)}\n", " byteorder := 'little'\n", " chunkshape := (1820,)"]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["group = h5file.create_group(\"/\", 'detector', 'Detector information')\n", "table = h5file.create_table(group, 'readout', Particle, \"Readout example\")\n", "h5file"]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": ["particle = table.row"]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": ["for i in range(10):\n", " particle['name'] = 'Particle: %6d' % (i)\n", " particle['pressure'] = float(i*i)\n", " particle['energy'] = float(particle['pressure'] ** 4)\n", " particle['idnumber'] = i * (2 ** 34)\n", " # Insert a new particle record\n", " particle.append()"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": ["table.flush()"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"data": {"text/plain": ["[25.0, 36.0, 49.0]"]}, "execution_count": 12, "metadata": {}, "output_type": "execute_result"}], "source": ["table_read = h5file.root.detector.readout\n", "pressure = [x['pressure'] for x in table_read.iterrows() if 20 <= x['pressure'] < 50]\n", "pressure"]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"text/plain": ["[b'Particle: 5', b'Particle: 6', b'Particle: 7']"]}, "execution_count": 13, "metadata": {}, "output_type": "execute_result"}], "source": ["names = [ x['name'] for x in table.where(\"\"\"(20 <= pressure) & (pressure < 50)\"\"\") ]\n", "names"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Ces lignes sont extraites du [tutoriel](http://www.pytables.org/usersguide/tutorials.html). Le module autorise la cr\u00e9ation de tableaux, toujours sur disque."]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": ["h5file.close()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## blosc\n", "\n", "[blosc](http://www.blosc.org/) compresse des tableaux num\u00e9rique. Cela permet de lib\u00e9rer de la m\u00e9moire pendant le temps qu'il ne sont pas utilis\u00e9s. Il est optimis\u00e9 pour perdre le moins de temps possible en compression / d\u00e9compression."]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"data": {"text/plain": ["(10000000,)"]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["import blosc\n", "import numpy as np\n", "\n", "a = np.linspace(0, 100, 10000000)\n", "a.shape"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/plain": ["(bytes, 6969171)"]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["packed = blosc.pack_array(a)\n", "type(packed), len(packed)"]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"data": {"text/plain": ["numpy.ndarray"]}, "execution_count": 17, "metadata": {}, "output_type": "execute_result"}], "source": ["array = blosc.unpack_array(packed)\n", "type(array)"]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["166 ms \u00b1 4.96 ms per loop (mean \u00b1 std. dev. of 7 runs, 10 loops each)\n"]}], "source": ["%timeit blosc.pack_array(a)"]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["103 ms \u00b1 6.08 ms per loop (mean \u00b1 std. dev. of 7 runs, 10 loops each)\n"]}], "source": ["%timeit blosc.unpack_array(packed)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Performance en fonction de la dimension."]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["1 0.019986204999895563 6.755499998689629e-05\n", "2 0.00012523500004135713 3.7926000004517846e-05\n", "3 0.0001631599998290767 4.819800005861907e-05\n", "4 0.0005262239999410667 0.00014814799988016603\n", "5 0.002960992999987866 0.001279211000110081\n", "6 0.018268078000119203 0.011275079999904847\n", "7 0.1547826650000843 0.09316439400004128\n", "8 1.544297945999915 0.9456180869999571\n"]}], "source": ["import time\n", "x = []\n", "t_comp = []\n", "t_dec = []\n", "size = 10\n", "for i in range(1,9): \n", " a = np.linspace(0, 100, size)\n", " t1 = time.perf_counter()\n", " packed = blosc.pack_array(a)\n", " t2 = time.perf_counter()\n", " blosc.unpack_array(packed)\n", " t3 = time.perf_counter()\n", " x.append(len(a))\n", " t_comp.append(t2-t1)\n", " t_dec.append(t3-t2)\n", " print(i, t2-t1, t3-t2)\n", " size *= 10"]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": ["
"]}, "metadata": {}, "output_type": "display_data"}], "source": ["import matplotlib.pyplot as plt\n", "fig, ax = plt.subplots(1, 1)\n", "ax.plot(x, t_comp, label=\"compression\")\n", "ax.plot(x, t_dec, label=\"d\u00e9compression\")\n", "ax.set_xlabel(\"taille\")\n", "ax.set_ylabel(\"time(ms)\")\n", "ax.set_xscale(\"log\", nonposx='clip')\n", "ax.set_yscale(\"log\", nonposy='clip')\n", "ax.legend();"]}, {"cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0"}}, "nbformat": 4, "nbformat_minor": 2}