{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Statistiques Wikipedia - \u00e9nonc\u00e9\n", "\n", "Parall\u00e9lisation de la r\u00e9cup\u00e9ration de fichiers de donn\u00e9es depuis wikip\u00e9dia."]}, {"cell_type": "code", "execution_count": 1, "metadata": {"collapsed": false}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 1 : parall\u00e9lisation du t\u00e9l\u00e9chargement\n", "\n", "On peut parall\u00e9liser le t\u00e9l\u00e9chargement de diff\u00e9rentes fa\u00e7ons :\n", "\n", "* avec des [threads](https://en.wikipedia.org/wiki/Thread_(computing)) (librairie [threading](https://docs.python.org/3/library/threading.html) : synchronisation rapide mais parfois d\u00e9licate et m\u00e9moire partag\u00e9e entre threads\n", "* avec des [processus](https://fr.wikipedia.org/wiki/Processus_(informatique)) (librairie [multiprocessing](https://docs.python.org/3.5/library/multiprocessing.html), [joblib](https://pythonhosted.org/joblib/), [jupyter](http://www.xavierdupre.fr/app/ensae_teaching_cs/helpsphinx/notebooks/td2a_cenonce_session_2D.html) : synchronisation lente, pas de m\u00e9moire partag\u00e9e\n", "* avec un [cluster](https://fr.wikipedia.org/wiki/Grappe_de_serveurs), [jupyter](http://www.xavierdupre.fr/app/ensae_teaching_cs/helpsphinx/notebooks/td2a_cenonce_session_2D.html) : synchronisation lente, pas de m\u00e9moire partag\u00e9e, parall\u00e9lisme en grande dimension\n", "\n", "La page [ParallelProcessing](https://wiki.python.org/moin/ParallelProcessing) recense des modules qui impl\u00e9mente cela mais elle n'est pas tr\u00e8s \u00e0 jour. Il faut v\u00e9rifier si les modules propos\u00e9s sont encore maintenus."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Approche avec des threads"]}, {"cell_type": "code", "execution_count": 2, "metadata": {"collapsed": false, "scrolled": false}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["thread 0 download 2016-08-28 05:27:45.899868 len(qu) 55\n", "thread 0 download 2016-08-28 08:27:45.899868 len(qu) 54\n", "thread 0 download 2016-08-28 11:27:45.899868 len(qu) 53\n", "thread 0 download 2016-08-28 14:27:45.899868 len(qu) 52\n", "thread 0 download 2016-08-28 17:27:45.899868 len(qu) 51\n", "thread 0 download 2016-08-28 20:27:45.899868 len(qu) 50\n", "thread 0 download 2016-08-28 23:27:45.899868 len(qu) 49\n", "thread 0 download 2016-08-29 02:27:45.899868 len(qu) 48\n", "thread 0 download 2016-08-29 05:27:45.899868 len(qu) 47\n", "thread 1 download 2016-08-28 06:27:45.899868 len(qu) 55\n", "thread 0 download 2016-08-29 08:27:45.899868 len(qu) 46\n", "thread 1 download 2016-08-28 09:27:45.899868 len(qu) 54\n", "thread 1 download 2016-08-28 12:27:45.899868 len(qu) 53\n", "thread 0 download 2016-08-29 11:27:45.899868 len(qu) 45\n", "thread 1 download 2016-08-28 15:27:45.899868 len(qu) 52\n", "thread 1 download 2016-08-28 18:27:45.899868 len(qu) 51\n", "thread 1 download 2016-08-28 21:27:45.899868 len(qu) 50\n", "thread 1 download 2016-08-29 00:27:45.899868 len(qu) 49\n", "thread 1 download 2016-08-29 03:27:45.899868 len(qu) 48\n", "thread 1 download 2016-08-29 06:27:45.899868 len(qu) 47\n", "thread 1 download 2016-08-29 09:27:45.899868 len(qu) 46\n", "thread 1 download 2016-08-29 12:27:45.899868 len(qu) 45\n", "thread 0 download 2016-08-29 14:27:45.899868 len(qu) 44\n", "thread 2 download 2016-08-28 07:27:45.899868 len(qu) 55\n", "thread 2 download 2016-08-28 10:27:45.899868 len(qu) 54\n", "thread 1 download 2016-08-29 15:27:45.899868 len(qu) 44\n", "thread 1 download 2016-08-29 18:27:45.899868 len(qu) 43\n", "thread 2 download 2016-08-28 13:27:45.899868 len(qu) 53\n", "thread 1 download 2016-08-29 21:27:45.899868 len(qu) 42\n", "thread 2 download 2016-08-28 16:27:45.899868 len(qu) 52\n", "thread 1 download 2016-08-30 00:27:45.899868 len(qu) 41\n", "thread 1 download 2016-08-30 03:27:45.899868 len(qu) 40\n", "thread 2 download 2016-08-28 19:27:45.899868 len(qu) 51\n", "attendre file 0 [44, 40, 51]\n", "thread 0 download 2016-08-29 17:27:45.899868 len(qu) 43\n", "thread 1 download 2016-08-30 06:27:45.899868 len(qu) 39\n", "thread 0 download 2016-08-29 20:27:45.899868 len(qu) 42\n", "thread 0 download 2016-08-29 23:27:45.899868 len(qu) 41\n", "thread 0 download 2016-08-30 02:27:45.899868 len(qu) 40\n", "thread 0 download 2016-08-30 05:27:45.899868 len(qu) 39\n", "thread 0 download 2016-08-30 08:27:45.899868 len(qu) 38\n", "thread 1 download 2016-08-30 09:27:45.899868 len(qu) 38\n", "thread 0 download 2016-08-30 11:27:45.899868 len(qu) 37\n", "thread 0 download 2016-08-30 14:27:45.899868 len(qu) 36\n", "thread 1 download 2016-08-30 12:27:45.899868 len(qu) 37\n", "thread 0 download 2016-08-30 17:27:45.899868 len(qu) 35\n", "thread 1 download 2016-08-30 15:27:45.899868 len(qu) 36\n", "thread 0 download 2016-08-30 20:27:45.899868 len(qu) 34\n", "thread 1 download 2016-08-30 18:27:45.899868 len(qu) 35\n", "thread 0 download 2016-08-30 23:27:45.899868 len(qu) 33\n", "thread 1 download 2016-08-30 21:27:45.899868 len(qu) 34\n", "thread 0 download 2016-08-31 02:27:45.899868 len(qu) 32\n", "thread 1 download 2016-08-31 00:27:45.899868 len(qu) 33\n", "thread 0 download 2016-08-31 05:27:45.899868 len(qu) 31\n", "thread 1 download 2016-08-31 03:27:45.899868 len(qu) 32\n", "thread 0 download 2016-08-31 08:27:45.899868 len(qu) 30\n", "thread 1 download 2016-08-31 06:27:45.899868 len(qu) 31\n", "thread 0 download 2016-08-31 11:27:45.899868 len(qu) 29\n", "thread 1 download 2016-08-31 09:27:45.899868 len(qu) 30\n", "thread 0 download 2016-08-31 14:27:45.899868 len(qu) 28\n", "thread 1 download 2016-08-31 12:27:45.899868 len(qu) 29\n", "thread 0 download 2016-08-31 17:27:45.899868 len(qu) 27\n", "thread 1 download 2016-08-31 15:27:45.899868 len(qu) 28\n", "thread 1 download 2016-08-31 18:27:45.899868 len(qu) 27\n", "thread 0 download 2016-08-31 20:27:45.899868 len(qu) 26\n", "thread 2 download 2016-08-28 22:27:45.899868 len(qu) 50\n", "thread 1 download 2016-08-31 21:27:45.899868 len(qu) 26\n", "thread 2 download 2016-08-29 01:27:45.899868 len(qu) 49\n", "thread 0 download 2016-08-31 23:27:45.899868 len(qu) 25\n", "thread 1 download 2016-09-01 00:27:45.899868 len(qu) 25\n", "thread 0 download 2016-09-01 02:27:45.899868 len(qu) 24\n", "thread 2 download 2016-08-29 04:27:45.899868 len(qu) 48\n", "thread 0 download 2016-09-01 05:27:45.899868 len(qu) 23\n", "thread 2 download 2016-08-29 07:27:45.899868 len(qu) 47\n", "thread 0 download 2016-09-01 08:27:45.899868 len(qu) 22\n", "thread 2 download 2016-08-29 10:27:45.899868 len(qu) 46\n", "thread 0 download 2016-09-01 11:27:45.899868 len(qu) 21\n", "thread 0 download 2016-09-01 14:27:45.899868 len(qu) 20\n", "thread 1 download 2016-09-01 03:27:45.899868 len(qu) 24\n", "thread 1 download 2016-09-01 06:27:45.899868 len(qu) 23\n", "thread 1 download 2016-09-01 09:27:45.899868 len(qu) 22\n", "thread 2 download 2016-08-29 13:27:45.899868 len(qu) 45\n", "thread 2 download 2016-08-29 16:27:45.899868 len(qu) 44\n", "thread 2 download 2016-08-29 19:27:45.899868 len(qu) 43\n", "thread 2 download 2016-08-29 22:27:45.899868 len(qu) 42\n", "thread 2 download 2016-08-30 01:27:45.899868 len(qu) 41\n", "thread 2 download 2016-08-30 04:27:45.899868 len(qu) 40\n", "thread 2 download 2016-08-30 07:27:45.899868 len(qu) 39\n", "thread 2 download 2016-08-30 10:27:45.899868 len(qu) 38\n", "thread 2 download 2016-08-30 13:27:45.899868 len(qu) 37\n", "thread 2 download 2016-08-30 16:27:45.899868 len(qu) 36\n", "thread 2 download 2016-08-30 19:27:45.899868 len(qu) 35\n", "thread 2 download 2016-08-30 22:27:45.899868 len(qu) 34\n", "thread 2 download 2016-08-31 01:27:45.899868 len(qu) 33\n", "thread 2 download 2016-08-31 04:27:45.899868 len(qu) 32\n", "thread 2 download 2016-08-31 07:27:45.899868 len(qu) 31\n", "thread 2 download 2016-08-31 10:27:45.899868 len(qu) 30\n", "thread 1 download 2016-09-01 12:27:45.899868 len(qu) 21\n", "thread 2 download 2016-08-31 13:27:45.899868 len(qu) 29\n", "thread 0 download 2016-09-01 17:27:45.899868 len(qu) 19\n", "thread 1 download 2016-09-01 15:27:45.899868 len(qu) 20\n", "thread 0 download 2016-09-01 20:27:45.899868 len(qu) 18\n", "thread 2 download 2016-08-31 16:27:45.899868 len(qu) 28\n", "thread 0 download 2016-09-01 23:27:45.899868 len(qu) 17\n", "thread 1 download 2016-09-01 18:27:45.899868 len(qu) 19\n", "thread 0 download 2016-09-02 02:27:45.899868 len(qu) 16\n", "thread 1 download 2016-09-01 21:27:45.899868 len(qu) 18\n", "thread 1 download 2016-09-02 00:27:45.899868 len(qu) 17\n", "thread 1 download 2016-09-02 03:27:45.899868 len(qu) 16\n", "thread 2 download 2016-08-31 19:27:45.899868 len(qu) 27\n", "thread 0 download 2016-09-02 05:27:45.899868 len(qu) 15\n", "thread 0 download 2016-09-02 08:27:45.899868 len(qu) 14\n", "thread 0 download 2016-09-02 11:27:45.899868 len(qu) 13\n", "thread 0 download 2016-09-02 14:27:45.899868 len(qu) 12\n", "thread 2 download 2016-08-31 22:27:45.899868 len(qu) 26\n", "thread 0 download 2016-09-02 17:27:45.899868 len(qu) 11\n", "thread 2 download 2016-09-01 01:27:45.899868 len(qu) 25\n", "thread 0 download 2016-09-02 20:27:45.899868 len(qu) 10\n", "thread 2 download 2016-09-01 04:27:45.899868 len(qu) 24\n", "thread 2 download 2016-09-01 07:27:45.899868 len(qu) 23\n", "thread 1 download 2016-09-02 06:27:45.899868 len(qu) 15\n", "thread 1 download 2016-09-02 09:27:45.899868 len(qu) 14\n", "thread 1 download 2016-09-02 12:27:45.899868 len(qu) 13\n", "thread 0 download 2016-09-02 23:27:45.899868 len(qu) 9\n", "thread 0 download 2016-09-03 02:27:45.899868 len(qu) 8\n", "thread 2 download 2016-09-01 10:27:45.899868 len(qu) 22\n", "thread 1 download 2016-09-02 15:27:45.899868 len(qu) 12\n", "thread 2 download 2016-09-01 13:27:45.899868 len(qu) 21\n", "thread 0 download 2016-09-03 05:27:45.899868 len(qu) 7\n", "thread 2 download 2016-09-01 16:27:45.899868 len(qu) 20\n", "thread 1 download 2016-09-02 18:27:45.899868 len(qu) 11\n", "thread 2 download 2016-09-01 19:27:45.899868 len(qu) 19\n", "thread 0 download 2016-09-03 08:27:45.899868 len(qu) 6\n", "thread 1 download 2016-09-02 21:27:45.899868 len(qu) 10\n", "thread 2 download 2016-09-01 22:27:45.899868 len(qu) 18\n", "thread 0 download 2016-09-03 11:27:45.899868 len(qu) 5\n", "thread 1 download 2016-09-03 00:27:45.899868 len(qu) 9\n", "thread 2 download 2016-09-02 01:27:45.899868 len(qu) 17\n", "thread 1 download 2016-09-03 03:27:45.899868 len(qu) 8\n", "thread 2 download 2016-09-02 04:27:45.899868 len(qu) 16\n", "thread 1 download 2016-09-03 06:27:45.899868 len(qu) 7\n", "thread 1 download 2016-09-03 09:27:45.899868 len(qu) 6\n", "thread 2 download 2016-09-02 07:27:45.899868 len(qu) 15\n", "thread 2 download 2016-09-02 10:27:45.899868 len(qu) 14\n", "thread 1 download 2016-09-03 12:27:45.899868 len(qu) 5\n", "thread 2 download 2016-09-02 13:27:45.899868 len(qu) 13\n", "thread 2 download 2016-09-02 16:27:45.899868 len(qu) 12\n", "thread 2 download 2016-09-02 19:27:45.899868 len(qu) 11\n", "thread 2 download 2016-09-02 22:27:45.899868 len(qu) 10\n", "thread 2 download 2016-09-03 01:27:45.899868 len(qu) 9\n", "thread 2 download 2016-09-03 04:27:45.899868 len(qu) 8\n", "thread 2 download 2016-09-03 07:27:45.899868 len(qu) 7\n", "thread 2 download 2016-09-03 10:27:45.899868 len(qu) 6\n", "skipping dt 2016-09-04 05:27:45.899868 rerun to get it [Errno 28] No space left on device\n", "thread 2 download 2016-09-03 13:27:45.899868 len(qu) 5\n", "skipping dt 2016-09-04 05:27:45.899868 rerun to get it [Errno 28] No space left on device\n", "thread 0 download 2016-09-03 14:27:45.899868 len(qu) 4\n", "skipping dt 2016-09-04 05:27:45.899868 rerun to get it [Errno 28] No space left on device\n", "thread 1 download 2016-09-03 15:27:45.899868 len(qu) 4\n", "skipping dt 2016-09-04 05:27:45.899868 rerun to get it unable to retrieve content, url=https://dumps.wikimedia.org/other/pageviews/2016/2016-09/pageviews-20160903-130000.gz, exc=[Errno 28] No space left on device\n", "thread 2 download 2016-09-03 16:27:45.899868 len(qu) 4\n", "skipping dt 2016-09-04 05:27:45.899868 rerun to get it unable to retrieve content, url=https://dumps.wikimedia.org/other/pageviews/2016/2016-09/pageviews-20160903-140000.gz, exc=[Errno 28] No space left on device\n", "thread 0 download 2016-09-03 17:27:45.899868 len(qu) 3\n", "skipping dt 2016-09-04 05:27:45.899868 rerun to get it unable to retrieve content, url=https://dumps.wikimedia.org/other/pageviews/2016/2016-09/pageviews-20160903-150000.gz, exc=[Errno 28] No space left on device\n", "thread 1 download 2016-09-03 18:27:45.899868 len(qu) 3\n", "skipping dt 2016-09-04 05:27:45.899868 rerun to get it unable to retrieve content, url=https://dumps.wikimedia.org/other/pageviews/2016/2016-09/pageviews-20160903-160000.gz, exc=[Errno 28] No space left on device\n", "thread 2 download 2016-09-03 19:27:45.899868 len(qu) 3\n", "skipping dt 2016-09-04 05:27:45.899868 rerun to get it unable to retrieve content, url=https://dumps.wikimedia.org/other/pageviews/2016/2016-09/pageviews-20160903-170000.gz, exc=[Errno 28] No space left on device\n", "thread 0 download 2016-09-03 20:27:45.899868 len(qu) 2\n", "skipping dt 2016-09-04 05:27:45.899868 rerun to get it unable to retrieve content, url=https://dumps.wikimedia.org/other/pageviews/2016/2016-09/pageviews-20160903-180000.gz, exc=[Errno 28] No space left on device\n", "thread 1 download 2016-09-03 21:27:45.899868 len(qu) 2\n", "skipping dt 2016-09-04 05:27:45.899868 rerun to get it unable to retrieve content, url=https://dumps.wikimedia.org/other/pageviews/2016/2016-09/pageviews-20160903-190000.gz, exc=[Errno 28] No space left on device\n", "thread 2 download 2016-09-03 22:27:45.899868 len(qu) 2\n", "skipping dt 2016-09-04 05:27:45.899868 rerun to get it unable to retrieve content, url=https://dumps.wikimedia.org/other/pageviews/2016/2016-09/pageviews-20160903-200000.gz, exc=[Errno 28] No space left on device\n", "thread 0 download 2016-09-03 23:27:45.899868 len(qu) 1\n", "skipping dt 2016-09-04 05:27:45.899868 rerun to get it unable to retrieve content, url=https://dumps.wikimedia.org/other/pageviews/2016/2016-09/pageviews-20160903-210000.gz, exc=[Errno 28] No space left on device\n", "thread 1 download 2016-09-04 00:27:45.899868 len(qu) 1\n", "skipping dt 2016-09-04 05:27:45.899868 rerun to get it unable to retrieve content, url=https://dumps.wikimedia.org/other/pageviews/2016/2016-09/pageviews-20160903-220000.gz, exc=[Errno 28] No space left on device\n", "thread 2 download 2016-09-04 01:27:45.899868 len(qu) 1\n", "skipping dt 2016-09-04 05:27:45.899868 rerun to get it unable to retrieve content, url=https://dumps.wikimedia.org/other/pageviews/2016/2016-09/pageviews-20160903-230000.gz, exc=[Errno 28] No space left on device\n", "thread 0 download 2016-09-04 02:27:45.899868 len(qu) 0\n", "skipping dt 2016-09-04 05:27:45.899868 rerun to get it unable to retrieve content, url=https://dumps.wikimedia.org/other/pageviews/2016/2016-09/pageviews-20160904-000000.gz, exc=[Errno 28] No space left on device\n", "thread 1 download 2016-09-04 03:27:45.899868 len(qu) 0\n", "attendre file 1 [0, 0, 1]\n", "attendre file 2 [0, 0, 1]\n", "skipping dt 2016-09-04 05:27:45.899868 rerun to get it unable to retrieve content, url=https://dumps.wikimedia.org/other/pageviews/2016/2016-09/pageviews-20160904-010000.gz, exc=[Errno 28] No space left on device\n", "thread 2 download 2016-09-04 04:27:45.899868 len(qu) 0\n", "skipping dt 2016-09-04 05:27:45.899868 rerun to get it unable to retrieve content, url=https://dumps.wikimedia.org/other/pageviews/2016/2016-09/pageviews-20160904-020000.gz, exc=[Errno 28] No space left on device\n", "done thread 0\n", "skipping dt 2016-09-04 05:27:45.899868 rerun to get it unable to retrieve content, url=https://dumps.wikimedia.org/other/pageviews/2016/2016-09/pageviews-20160904-030000.gz, exc=[Errno 28] No space left on device\n", "done thread 1\n", "skipping dt 2016-09-04 05:27:45.899868 rerun to get it unable to retrieve content, url=https://dumps.wikimedia.org/other/pageviews/2016/2016-09/pageviews-20160904-040000.gz, exc=[Errno 28] No space left on device\n", "done thread 2\n"]}], "source": ["import threading, time, os\n", "from datetime import datetime, timedelta\n", "from mlstatpy.data.wikipedia import download_pageviews\n", "folder = \"d:\\\\wikipv\"\n", "if not os.path.exists(folder):\n", " os.mkdir(folder)\n", "\n", "class DownloadThread(threading.Thread) :\n", " \"\"\"thread definition, it downloads a stream one after another\n", " until a queue is empty\"\"\"\n", " def __init__ (self, qu, name, folder) :\n", " threading.Thread.__init__ (self)\n", " self.qu = qu\n", " self.name = name\n", " self.folder = folder\n", " \n", " def run (self) :\n", " while not self.qu.empty():\n", " date = self.qu.get(False)\n", " if date is None:\n", " break\n", " print(self.name, \"download\", date, \"len(qu)\", self.qu.qsize())\n", " try:\n", " download_pageviews(date, folder=self.folder)\n", " except Exception as e:\n", " print(\"skipping dt\", dt, \"rerun to get it\", e)\n", " # On doit le faire \u00e0 chaque fois.\n", " self.qu.task_done()\n", " \n", "# on cr\u00e9\u00e9 des files et les threads associ\u00e9s\n", "import queue \n", "queues = [queue.Queue() for i in range(0, 3)]\n", "m = [DownloadThread(q, \"thread %d\" % i, folder) for i, q in enumerate(queues)]\n", "\n", "# on remplit les files\n", "dt = datetime.now() - timedelta(15)\n", "hour = timedelta(hours=1)\n", "for h in range(0, 24*7):\n", " queues[h%3].put(dt)\n", " dt += hour\n", " \n", "# on d\u00e9marre les threads\n", "for t in m:\n", " t.start()\n", " \n", "# on attend qu'elles se vident\n", "for i, q in enumerate(queues):\n", " print(\"attendre file\", i, [q.qsize() for q in queues])\n", " q.join()\n", " \n", " # On ne peut pas utiliser quelque chose comme ceci :\n", " while not q.empty():\n", " time.sleep(1)\n", " # Le programme s'arr\u00eate d\u00e8s que les files sont vides.\n", " # Ceci arrive avec l'instruction q.get()\n", " # avant que le t\u00e9l\u00e9chargement soit fini.\n", " # Le programme s'arr\u00eate et interrompt les threads en cours."]}, {"cell_type": "markdown", "metadata": {"collapsed": true}, "source": ["### Parall\u00e9lisation avec des processus\n", "\n", "Il n'est pas toujours \u00e9vident de comprendre ce qu'il se passe quand l'erreur se produit dans un processus diff\u00e9rent. Si on change le *backend* pour ``\"threading\"``, l'erreur devient visible. Voir [Parallel](https://pythonhosted.org/joblib/generated/joblib.Parallel.html?highlight=parallel). Le code ne fonctionne pas toujours lorsque ``n_jobs > 1`` sous Windows et que le backend est celui par d\u00e9faut (processus). Lire [Embarrassingly Parallel For Loops](https://pythonhosted.org/joblib/parallel.html#embarrassingly-parallel-for-loops)."]}, {"cell_type": "code", "execution_count": 3, "metadata": {"collapsed": false, "scrolled": false}, "outputs": [], "source": ["from joblib import Parallel, delayed\n", "from datetime import datetime, timedelta\n", "import os\n", "folder = \"d:\\\\wikipv\"\n", "if not os.path.exists(folder):\n", " os.mkdir(folder)\n", " \n", "# on remplit les files\n", "dt = datetime.now() - timedelta(14)\n", "hour = timedelta(hours=1)\n", "dates = [dt + hour*i for i in range(0,24)]\n", " \n", "def downloadp2(dt, folder):\n", " from mlstatpy.data.wikipedia import download_pageviews\n", " download_pageviews(dt, folder=folder)\n", "\n", "# L'instruction ne marche pas depuis un notebook lorsque le backend est \"muliprocessing\".\n", "# Dans ce cas, il faut ex\u00e9cuter un programme.\n", "if __name__ == \"__main__\":\n", " Parallel(n_jobs=3, verbose=5)(delayed(downloadp2)(dt, folder) for dt in dates)"]}, {"cell_type": "markdown", "metadata": {"collapsed": true}, "source": ["## Filtrage pour ne garder que les lignes avec fr"]}, {"cell_type": "code", "execution_count": 4, "metadata": {"collapsed": false, "scrolled": false}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["wikipv\\pageviews-20160827-210000wikipv\\pageviews-20160827-220000\n", "wikipv\\pageviews-20160827-230000\n", "\n", "wikipv\\pageviews-20160828-000000\n", "wikipv\\pageviews-20160828-010000\n", "wikipv\\pageviews-20160828-020000\n", "wikipv\\pageviews-20160828-030000\n", "wikipv\\pageviews-20160828-040000\n", "wikipv\\pageviews-20160828-050000\n", "wikipv\\pageviews-20160828-060000\n", "wikipv\\pageviews-20160828-070000\n", "wikipv\\pageviews-20160828-080000\n", "wikipv\\pageviews-20160828-090000\n", "wikipv\\pageviews-20160828-100000\n", "wikipv\\pageviews-20160828-110000\n"]}, {"name": "stderr", "output_type": "stream", "text": ["[Parallel(n_jobs=3)]: Done 12 tasks | elapsed: 53.4s\n"]}, {"name": "stdout", "output_type": "stream", "text": ["wikipv\\pageviews-20160828-120000\n", "wikipv\\pageviews-20160828-130000\n", "wikipv\\pageviews-20160828-140000\n", "wikipv\\pageviews-20160828-150000\n", "wikipv\\pageviews-20160828-160000\n", "wikipv\\pageviews-20160828-170000\n", "wikipv\\pageviews-20160828-180000\n", "wikipv\\pageviews-20160828-190000\n", "wikipv\\pageviews-20160828-200000\n", "wikipv\\pageviews-20160828-210000\n", "wikipv\\pageviews-20160828-220000\n", "wikipv\\pageviews-20160828-230000\n", "wikipv\\pageviews-20160829-000000\n", "wikipv\\pageviews-20160829-010000\n", "wikipv\\pageviews-20160829-020000\n", "wikipv\\pageviews-20160829-030000\n", "wikipv\\pageviews-20160829-040000\n", "wikipv\\pageviews-20160829-050000\n", "wikipv\\pageviews-20160829-060000\n", "wikipv\\pageviews-20160829-070000\n", "wikipv\\pageviews-20160829-080000\n"]}], "source": ["def filtre(input, country):\n", " import os\n", " print(input)\n", " output = input + \".\" + country\n", " if not os.path.exists(output):\n", " with open(input, \"r\", encoding=\"utf-8\") as f:\n", " with open(output, \"w\", encoding=\"utf-8\") as g:\n", " for line in f:\n", " if line.startswith(country):\n", " g.write(line)\n", "\n", "import os\n", "from joblib import Parallel, delayed\n", "folder = \"wikipv\"\n", "files = os.listdir(folder) \n", "files = [os.path.join(folder, _) for _ in files if _.startswith(\"pageviews\") and _.endswith(\"0000\")]\n", "\n", "Parallel(n_jobs=3, verbose=5, backend=\"threading\")(delayed(filtre)(name, \"fr\") for name in files)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Ins\u00e9rer le fichier dans une base de donn\u00e9es SQL"]}, {"cell_type": "code", "execution_count": 5, "metadata": {"collapsed": true}, "outputs": [], "source": ["import pandas\n", "df = pandas.read"]}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2"}}, "nbformat": 4, "nbformat_minor": 2}