{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 2A.i - programmation fonctionnelle\n", "\n", "It\u00e9rateur, g\u00e9n\u00e9rateur, programmation fonctionnelle, tout pour \u00e9viter de charger l'int\u00e9gralit\u00e9 des donn\u00e9es en m\u00e9moire et commencer les calculs le plus vite possible."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["Plan\n", "
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Donn\u00e9es : [twitter_for_network_100000.db.zip](https://drive.google.com/open?id=0B6jkqYitZ0uTQ3k1NDZmLUJBZVk)"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/plain": ["['twitter_for_network_100000.db']"]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["import pyensae.datasource\n", "pyensae.datasource.download_data(\"twitter_for_network_100000.db.zip\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["# Programmation fonctionnelle"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Fonction pure, tests et modularit\u00e9"]}, {"cell_type": "markdown", "metadata": {}, "source": ["La programmation fonctionnelle se concentre sur la notion de fonction, comme son nom l'indique, et plus pr\u00e9cis\u00e9ment de fonction pure. \n", "Une fonction pure est une fonction:\n", "\n", " - dont le r\u00e9sultat d\u00e9pend uniquement des entr\u00e9es\n", " - qui n'a pas d'effet de bord"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["[1, 2, 3, 4]\n", "[1, 2, 3, 4]\n", "[1, 2, 3, 4]\n", "[4, 3, 2, 1]\n", "The slowest run took 23.73 times longer than the fastest. This could mean that an intermediate result is being cached \n", "100 loops, best of 3: 3.96 ms per loop\n", "100 loops, best of 3: 4.81 ms per loop\n"]}], "source": ["def sorted_1(l):\n", " l.sort()\n", " return l\n", "\n", "a = [4,3,2,1]\n", "print(sorted_1(a))\n", "print(a)\n", "\n", "a = [4,3,2,1]\n", "print(sorted(a))\n", "print(a)\n", "\n", "import random\n", "l = list(range(100000))\n", "random.shuffle( l )\n", "\n", "%timeit l.sort()\n", "%timeit sorted(l)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["La programmation fonctionnelle est \u00e0 mettre en contraste par rapport \u00e0 la programmation orient\u00e9e objet. L'objet est plus centr\u00e9 sur la repr\u00e9sentation, la fonction sur l'action l'entr\u00e9e et le r\u00e9sultat. Il existe des langages orient\u00e9s fonctionnel, comme [lisp](https://en.wikipedia.org/wiki/Lisp_(programming_language)). Elle pr\u00e9sente en effet des avantages consid\u00e9rables sur au moins deux points essentiels en informatique: \n", "\n", "- tests\n", "- modularit\u00e9\n", " \n", "Un exemple concret, les webservices en python. Ceux-ci sont d\u00e9finies comme des fonctions, ce qui permet notamment de facilement les rendre compatibles avec diff\u00e9rents serveurs web, en donnant \u00e0 ceux-ci non pas le webservice directement, mais une composition de celui-ci.\n", "\n", "- [Apache](https://httpd.apache.org/) $\\Rightarrow wrapper \\; Apache \\circ webservice$\n", "- [IIS](https://www.iis.net/) $\\Rightarrow wrapper \\; IIS \\circ webservice$\n", "- [CGI](http://httpd.apache.org/docs/current/fr/howto/cgi.html) $\\Rightarrow wrapper \\; CGI \\circ webservice$\n", " \n", "La composition est une fa\u00e7on tr\u00e8s puissante de modifier la comportement d'un objet, car elle n'impacte pas l'objet lui-m\u00eame."]}, {"cell_type": "code", "execution_count": 4, "metadata": {"collapsed": true}, "outputs": [], "source": ["import os, psutil, gc, sys\n", "if not sys.platform.startswith(\"win\"):\n", " import resource\n", " \n", "def memory_usage_psutil():\n", " gc.collect()\n", " process = psutil.Process(os.getpid())\n", " mem = process.memory_info()[0] / float(2 ** 20)\n", "\n", " print( \"Memory used : %i MB\" % mem )\n", " if not sys.platform.startswith(\"win\"):\n", " print( \"Max memory usage : %i MB\" % (resource.getrusage(resource.RUSAGE_SELF).ru_maxrss//1024) )"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Memory used : 109 MB\n"]}], "source": ["memory_usage_psutil()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Fonctions pour la gestion de grosses donn\u00e9es : laziness"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Lors de la gestion de grosses donn\u00e9es, le point crucial est que l'on ne veut pas stocker de valeurs interm\u00e9diaires, parce que celle-ci pourraient prendre trop de place en m\u00e9moire.\n", "Par exemple pour calculer la moyenne du nombre de followers dans la base de donn\u00e9e, il n'est pas indispensable de stocker tous les users en m\u00e9moire.\n", "\n", "Les fonctions dans [cytoolz](https://github.com/pytoolz/cytoolz) sont dites \"lazy\", ce qui signifie qu'elles ne s'ex\u00e9cutent effectivement que quand n\u00e9cessaire.\n", "Cela \u00e9vite d'utiliser de la m\u00e9moire pour stocker un r\u00e9sultat interm\u00e9diaire.\n", "\n", "Par exemple la cellule ci-dessous s'ex\u00e9cute tr\u00e8s rapidement, et ne consomme pas de m\u00e9moire. En effet a sert \u00e0 repr\u00e9senter l'ensemble des nombres de 0 \u00e0 1000000 au carr\u00e9, mais ils ne sont pas cacul\u00e9s imm\u00e9diatement."]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["The slowest run took 8.45 times longer than the fastest. This could mean that an intermediate result is being cached \n", "100000 loops, best of 3: 1.77 \u00b5s per loop\n", "\n"]}], "source": ["a = (it**2 for it in range(1000001))\n", "%timeit a = (it**2 for it in range(1000001))\n", "print( type(a) )"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Ici on calcule la somme de ces nombres, et c'est au moment o\u00f9 on appelle la fonction sum que l'on calcule effectivement les carr\u00e9s. Mais du coup cette op\u00e9ration est beaucoup plus lente que si l'on avait d\u00e9j\u00e0 calcul\u00e9 ces nombres."]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["1 loops, best of 3: 888 ms per loop\n"]}, {"data": {"text/plain": ["333333833333500000"]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["%timeit sum( (it**2 for it in range(1000001)) )\n", "sum( a )"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Ma consommation m\u00e9moire n'a quasiment pas boug\u00e9."]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Memory used : 109 MB\n"]}], "source": ["memory_usage_psutil()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Ci-dessous, on n'a simplement remplac\u00e9 les parenth\u00e8ses ``()`` par des crochets ``[]``, mais cela suffit pour dire que l'on veut effectivement calculer ces valeurs et en stocker la liste.\n", "Cela est plus lng, consomme de la m\u00e9moire, mais en calculer la somme sera beaucoup plus rapide."]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["1 loops, best of 3: 973 ms per loop\n", "\n"]}], "source": ["b = [it**2 for it in range(1000001)]\n", "%timeit b = [it**2 for it in range(1000001)]\n", "print(type(b))"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["333333833333500000\n", "10 loops, best of 3: 72.4 ms per loop\n"]}], "source": ["print( sum(b) )\n", "%timeit sum(b)"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Memory used : 149 MB\n"]}], "source": ["memory_usage_psutil()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Attention \u00e0 ce que a est objet de type [iterateur](http://anandology.com/python-practice-book/iterators.html), qui retient sa position. Autrement dit, on ne peut l'utiliser qu'une seule fois."]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"text/plain": ["0"]}, "execution_count": 13, "metadata": {}, "output_type": "execute_result"}], "source": ["sum(a)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Si on a besoin de le r\u00e9utiliser, on peut soit stocker les valeurs, soit le mettre dans une fonction"]}, {"cell_type": "code", "execution_count": 13, "metadata": {"collapsed": true}, "outputs": [], "source": ["def f():\n", " return (it**2 for it in range(1000001))"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["333333833333500000\n", "1 loops, best of 3: 1.01 s per loop\n"]}], "source": ["print( sum(f()) )\n", "%timeit sum(f())"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exemple cytoolz / twitters data\n", "\n", "Liens vers les donn\u00e9es : \n", "\n", "* [twitter_for_network_100000.db.zip](http://www.xavierdupre.fr/enseignement/complements/twitter_for_network_100000.db.zip)\n", "* [twitter_for_network_full.db.zip](http://www.xavierdupre.fr/enseignement/complements/twitter_for_network_full.db.zip)"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/plain": ["['twitter_for_network_100000.db']"]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["import pyensae.datasource\n", "pyensae.datasource.download_data(\"twitter_for_network_100000.db.zip\")"]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Memory used : 149 MB\n"]}], "source": ["memory_usage_psutil()"]}, {"cell_type": "code", "execution_count": 17, "metadata": {"collapsed": true}, "outputs": [], "source": ["import cytoolz as ct # import groupby, valmap, compose\n", "import cytoolz.curried as ctc ## pipe, map, filter, get\n", "import sqlite3\n", "import pprint\n", "try:\n", " import ujson as json\n", "except:\n", " print(\"ujson not available\")\n", " import json\n", "\n", "tw_users_limit = 1000000\n", "conn_sqlite = sqlite3.connect(\"twitter_for_network_100000.db\")\n", "cursor_sqlite = conn_sqlite.cursor()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Note : sur internet vous verez plus souvent l'exemple json.loads. ujson est simplement une version plus rapide. Elle n'est pas indispensable"]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["The slowest run took 4.26 times longer than the fastest. This could mean that an intermediate result is being cached \n", "10000 loops, best of 3: 16.9 \u00b5s per loop\n", "10000 loops, best of 3: 28.6 \u00b5s per loop\n"]}], "source": ["import ujson as ujson_test\n", "import json as json_test\n", "\n", "cursor_sqlite.execute(\"SELECT content FROM tw_users LIMIT 1\")\n", "tw_user_json = cursor_sqlite.fetchone()[0]\n", "\n", "%timeit ujson_test.loads( tw_user_json )\n", "%timeit json_test.loads( tw_user_json )"]}, {"cell_type": "code", "execution_count": 19, "metadata": {"collapsed": true}, "outputs": [], "source": ["tw_users_limit = 1000000"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Ci-dessous on charge en m\u00e9moire la liste des profils utilisateurs de la table tw_users.\n", "Il est conseill\u00e9 de tester vos fonctions sur des extraits de vos donn\u00e9es qui tiennent en m\u00e9moire. Par contre ensuite il faudra \u00e9viter de les charger en m\u00e9moire."]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [{"data": {"text/plain": ["100071"]}, "execution_count": 21, "metadata": {}, "output_type": "execute_result"}], "source": ["## With storing in memory\n", "cursor_sqlite.execute(\"SELECT id, content FROM tw_users LIMIT %s\" % tw_users_limit)\n", "tw_users_as_json = list( ctc.map( json.loads, ctc.pluck( 1, cursor_sqlite ) ) )\n", "len(tw_users_as_json)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On a dans ces deux exemples deux fonctions des plus classiques :\n", "\n", " - ctc.pluck => prend une s\u00e9quence en entr\u00e9e et renvoit une s\u00e9quence de de l'item s\u00e9lectionn\u00e9e\n", " - ctc.map => applique une fonction \u00e0 chaque \u00e9l\u00e9ment de la s\u00e9quence"]}, {"cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [{"data": {"text/plain": ["108086205"]}, "execution_count": 22, "metadata": {}, "output_type": "execute_result"}], "source": ["## Without storing in memory\n", "cursor_sqlite.execute(\"SELECT id, content FROM tw_users LIMIT %s\" % tw_users_limit)\n", "tw_users_as_json = ctc.pluck(\"followers_count\", # M\u00eame chose qu'avec le 1, mais on utilise une cl\u00e9\n", " ctc.map(json.loads, # Map applique la fonction json.loads \u00e0 tous les objets \n", " ctc.pluck(1, # Le curseur renvoit les objets sous forme de tuple des colonnes\n", " # pluck(1, _) est l'\u00e9quivalent de (it[1] for it in _)\n", " cursor_sqlite) ) )\n", "sum(tw_users_as_json)"]}, {"cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["1 loops, best of 3: 5.99 \u00b5s per loop\n"]}], "source": ["cursor_sqlite.execute(\"SELECT id, content FROM tw_users LIMIT %s\" % tw_users_limit)\n", "%timeit -n 1 tw_users_as_json = ctc.pluck(\"followers_count\", ctc.map(json.loads, ctc.pluck(1, cursor_sqlite) ) ) "]}, {"cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [{"data": {"text/plain": ["108086205"]}, "execution_count": 24, "metadata": {}, "output_type": "execute_result"}], "source": ["## Without storing in memory\n", "def get_tw_users_as_json():\n", " cursor_sqlite.execute(\"SELECT content FROM tw_users LIMIT %s\" % tw_users_limit)\n", " return ctc.pluck(\"followers_count\", ctc.map(json.loads, ctc.pluck(0, cursor_sqlite) ) )\n", "sum(get_tw_users_as_json())"]}, {"cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [{"data": {"text/plain": ["108086205"]}, "execution_count": 25, "metadata": {}, "output_type": "execute_result"}], "source": ["sum(get_tw_users_as_json())"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Quelques exemples :\n", "\n", " - count_all_followers_cyt() fait la somme des followers \n", " - count_all_followers_cyt_by_location() fait la somme par location diff\u00e9rente (nous verrons ensuite que cette donn\u00e9e, du texte brute, m\u00e9riterait des traitements particuliers)"]}, {"cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["1 loops, best of 3: 2.52 s per loop\n", "1 loops, best of 3: 2.75 s per loop\n"]}], "source": ["tw_users_limit = 1000000\n", "import ujson\n", "\n", "def get_users_cyt():\n", " cursor_sqlite.execute(\"SELECT content FROM tw_users LIMIT %s\" % tw_users_limit)\n", " return ct.map(ujson.loads, ct.pluck( 0, cursor_sqlite ) )\n", "\n", "def count_all_followers_cyt():\n", " return sum( ct.pluck(\"followers_count\", get_users_cyt() ) )\n", "\n", "def count_all_followers_cyt_by_location():\n", " return ct.reduceby( \"location\", lambda x, item: x + item[\"followers_count\"], get_users_cyt(), 0 )\n", "\n", "%timeit count_all_followers_cyt()\n", "%timeit count_all_followers_cyt_by_location()"]}, {"cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Memory used : 156 MB\n"]}], "source": ["memory_usage_psutil()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Leur \u00e9quivalent en code standard.\n", "A noter que la version fonctionnelle n'est pas significativement plus rapide."]}, {"cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["1 loops, best of 3: 2.66 s per loop\n", "1 loops, best of 3: 2.72 s per loop\n"]}], "source": ["from collections import defaultdict\n", "\n", "def count_all_followers():\n", " cursor_sqlite.execute(\"SELECT content FROM tw_users LIMIT %s\" % tw_users_limit)\n", " nb_totals_followers_id = 0\n", " for it_json in cursor_sqlite:\n", " nb_totals_followers_id += json.loads(it_json[0])[ \"followers_count\" ]\n", " return nb_totals_followers_id\n", "\n", "def count_all_followers_by_location():\n", " cursor_sqlite.execute(\"SELECT content FROM tw_users LIMIT %s\" % tw_users_limit)\n", " res = defaultdict(int)\n", " for it_json in cursor_sqlite:\n", " it_json = json.loads(it_json[0])\n", " res[it_json[\"location\"]] += it_json[ \"followers_count\" ]\n", " return res\n", "\n", "%timeit count_all_followers()\n", "%timeit count_all_followers_by_location()"]}, {"cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["1000 loops, best of 3: 11.3 \u00b5s per loop\n", "100000 loops, best of 3: 16.6 \u00b5s per loop\n"]}], "source": ["cursor_sqlite.execute(\"SELECT content FROM tw_users LIMIT 10000\")\n", "%timeit -n1000 first_content = cursor_sqlite.fetchone()[0]\n", "cursor_sqlite.execute(\"SELECT content FROM tw_users LIMIT 10000\")\n", "first_content = cursor_sqlite.fetchone()[0]\n", "%timeit json.loads( first_content )"]}, {"cell_type": "markdown", "metadata": {"collapsed": true}, "source": ["## Cytoolz functions"]}, {"cell_type": "markdown", "metadata": {}, "source": ["[cytoolz](https://github.com/pytoolz/cytoolz) est une impl\u00e9mentation plus performante de la librairie [toolz](https://github.com/pytoolz/toolz/), il faut donc vous r\u00e9f\u00e9rer \u00e0 la documentation de celle-ci.\n", "\n", "http://toolz.readthedocs.org/en/latest/api.html\n", "\n", "A noter qu'il y a deux packages, [cytoolz](https://github.com/pytoolz/cytoolz) et [cytoolz.curried](https://github.com/eriknw/cytoolz/blob/master/cytoolz/curried.py), ils contiennent les m\u00eames fonctions, seulement celles du second supporte le \"curry\", l'\u00e9valuation partielle (voir plus bas). Cela peut repr\u00e9senter un petit overhead."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### les basiques"]}, {"cell_type": "markdown", "metadata": {}, "source": ["[cytoolz.curried.pluck](http://toolz.readthedocs.org/en/latest/api.html?highlight=pluck#toolz.itertoolz.pluck) => s\u00e9lectionne un item dans chaque \u00e9l\u00e9ment d'une s\u00e9quence, \u00e0 partir d'une cl\u00e9 ou d'un index \n", "[cytoolz.curried.map](http://toolz.readthedocs.org/en/latest/curry.html) => applique une fonction \u00e0 tous les \u00e9l\u00e9ments d'une s\u00e9quence"]}, {"cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["108086205\n"]}], "source": ["import cytoolz as ct\n", "import cytoolz.curried as ctc \n", "\n", "cursor_sqlite.execute(\"SELECT id, content FROM tw_users LIMIT %s\" % tw_users_limit)\n", "a = ctc.pluck( 1, cursor_sqlite )\n", "b = ctc.map( json.loads, a )\n", "c = ctc.pluck(\"followers_count\", b)\n", "print( sum(c) )\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["A noter que toutes les fonctions cytoolz du package cytoolz.curry supportent les \u00e9valuations partielles, i.e. construire une fonction d'un argument \u00e0 partir d'une fonction de deux arguments (ou plus g\u00e9n\u00e9ralement *n-1* arguments \u00e0 partir de *n*)"]}, {"cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["108086205\n"]}], "source": ["cursor_sqlite.execute(\"SELECT id, content FROM tw_users LIMIT %s\" % tw_users_limit)\n", "\n", "pl_1 = ctc.pluck(1) ## ctc.pluck prend 2 arguments, cette fonction est donc une fonction d'un argument\n", "m_loads = ctc.map(json.loads)\n", "pl_fc = ctc.pluck(\"followers_count\")\n", "\n", "a = pl_1( cursor_sqlite )\n", "b = m_loads(a)\n", "c = pl_fc(b)\n", "print( sum(c) )"]}, {"cell_type": "code", "execution_count": 31, "metadata": {"collapsed": true}, "outputs": [], "source": ["tw_users_limit = 10000"]}, {"cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [{"data": {"text/plain": ["4284281"]}, "execution_count": 33, "metadata": {}, "output_type": "execute_result"}], "source": ["cursor_sqlite.execute(\"SELECT id, content FROM tw_users LIMIT %s\" % tw_users_limit)\n", "sum( pl_fc( m_loads( pl_1 ( cursor_sqlite ) ) ) ) "]}, {"cell_type": "markdown", "metadata": {}, "source": ["[cytoolz.compose](http://toolz.readthedocs.org/en/latest/api.html?highlight=compose#toolz.functoolz.compose) permet de cr\u00e9er une fonction par un cha\u00eenage de fonction. \n", "Le r\u00e9sultat de chaque fonction est donn\u00e9 en argument \u00e0 la fonction suivante, chaque fonction doit donc ne prendre qu'un seul argument, d'o\u00f9 l'int\u00e9r\u00eat de l'\u00e9valuation partielle.\n", "Comme en math\u00e9matique, les fonctions sont \u00e9valu\u00e9es de droite \u00e0 gauche \n", "\n", "``count_nb_followers( cursor_sqlite )`` est donc \u00e9quivalent \u00e0 ``sum( pl_fc( get_json_seq( cursor_sqlite ) ) )``"]}, {"cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [{"data": {"text/plain": ["4284281"]}, "execution_count": 34, "metadata": {}, "output_type": "execute_result"}], "source": ["get_json_seq = ct.compose( m_loads, pl_1 )\n", "count_nb_followers = ct.compose( sum, pl_fc, get_json_seq )\n", "cursor_sqlite.execute(\"SELECT id, content FROM tw_users LIMIT %s\" % tw_users_limit)\n", "count_nb_followers( cursor_sqlite )"]}, {"cell_type": "markdown", "metadata": {}, "source": ["[cytoolz.pipe](http://toolz.readthedocs.org/en/latest/api.html?highlight=pipe#toolz.functoolz.pipe) a un comportement similaire, avec une diff\u00e9rence importante, l'ordre des fonctions est invers\u00e9 (ce qui le rend plus lisible, \u00e0 mon humble avis)"]}, {"cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [{"data": {"text/plain": ["4284281"]}, "execution_count": 35, "metadata": {}, "output_type": "execute_result"}], "source": ["ct.pipe(\n", " cursor_sqlite.execute(\"SELECT id, content FROM tw_users LIMIT %s\" % tw_users_limit),\n", " pl_1,\n", " m_loads,\n", " pl_fc,\n", " sum )\n"]}, {"cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["2951686\n", "1332595\n"]}], "source": ["cursor_sqlite.execute(\"SELECT id, content FROM tw_users LIMIT %s\" % tw_users_limit)\n", "print( count_nb_followers( ct.take_nth(2, cursor_sqlite ) ) ) # take_nth, prendre un \u00e9l\u00e9ment sur n\n", "cursor_sqlite.execute(\"SELECT id, content FROM tw_users LIMIT %s\" % tw_users_limit)\n", "print( count_nb_followers( ct.take_nth(2, ct.drop(1, cursor_sqlite ) ) ) ) # drop, enl\u00e8ve les n premiers \u00e9l\u00e9ments"]}, {"cell_type": "markdown", "metadata": {}, "source": ["**cytoolz.take_nth** => prend un \u00e9l\u00e9ment sur n \n", "**cytoolz.drop** => enl\u00e8ve n \u00e9l\u00e9ments"]}, {"cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [{"data": {"text/plain": ["10000"]}, "execution_count": 37, "metadata": {}, "output_type": "execute_result"}], "source": ["tw_users_limit"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Il existe beaucoup de fonctions, dont un certain nombre peuvent faire double emploi. \n", "Par exemple **countby** prend une fonction et une s\u00e9quence et compte le nombre de r\u00e9sultat de la fonction appliqu\u00e9e \u00e0 chaque \u00e9l\u00e9ment de la s\u00e9quence, ce qui \u00e9quivalent \u00e0 appliquer une fonction \u00e0 tous les \u00e9l\u00e9ments de la s\u00e9quence, puis calculer la fr\u00e9quence des r\u00e9sultats (op\u00e9ration effectu\u00e9e avec **frequencies** et **pluck**)"]}, {"cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["The slowest run took 7732.75 times longer than the fastest. This could mean that an intermediate result is being cached \n", "1 loops, best of 3: 46.6 \u00b5s per loop\n", "The slowest run took 46923.46 times longer than the fastest. This could mean that an intermediate result is being cached \n", "1 loops, best of 3: 5.56 \u00b5s per loop\n", "The slowest run took 36110.35 times longer than the fastest. This could mean that an intermediate result is being cached \n", "1 loops, best of 3: 8.55 \u00b5s per loop\n", "{1: 1603,\n", " 2: 107,\n", " 3: 32,\n", " 4: 23,\n", " 5: 16,\n", " 6: 7,\n", " 7: 4,\n", " 8: 1,\n", " 10: 4,\n", " 11: 1,\n", " 12: 1,\n", " 14: 2,\n", " 15: 1,\n", " 16: 1,\n", " 19: 1,\n", " 20: 1,\n", " 25: 1,\n", " 26: 1,\n", " 33: 1,\n", " 39: 1,\n", " 43: 1,\n", " 46: 1,\n", " 47: 1,\n", " 170: 1,\n", " 288: 1,\n", " 6959: 1}\n", "{'': 6959,\n", " 'Abidjan': 10,\n", " 'Bordeaux': 14,\n", " 'Bruxelles': 12,\n", " 'FRANCE': 26,\n", " 'France': 170,\n", " 'Lille': 33,\n", " 'London': 10,\n", " 'Lyon': 25,\n", " 'Marseille': 19,\n", " 'Montpellier': 11,\n", " 'Nantes': 14,\n", " 'Nice': 10,\n", " 'PARIS': 16,\n", " 'Paris': 288,\n", " 'Paris ': 15,\n", " 'Paris, France': 39,\n", " 'Paris, Ile-de-France': 46,\n", " 'Toulouse': 20,\n", " 'Tunisie': 10,\n", " 'france': 43,\n", " 'paris': 47}\n"]}], "source": ["import collections\n", "from operator import ge, le\n", "\n", "cursor_sqlite.execute(\"SELECT id, content FROM tw_users LIMIT %s\" % tw_users_limit)\n", "%timeit -n1 ct.countby(ctc.get(\"location\"), get_json_seq( cursor_sqlite ) )\n", "\n", "cursor_sqlite.execute(\"SELECT id, content FROM tw_users LIMIT %s\" % tw_users_limit)\n", "%timeit -n1 ct.frequencies(ct.pluck(\"location\", get_json_seq( cursor_sqlite ) ) )\n", "\n", "def count_location_frequency(c):\n", " counter = collections.Counter()\n", " for it_json in c:\n", " it_json = json.loads( it_json[1] )\n", " counter[ it_json[\"location\"] ] += 1\n", " return counter\n", "\n", "cursor_sqlite.execute(\"SELECT id, content FROM tw_users LIMIT %s\" % tw_users_limit)\n", "%timeit -n1 count_location_frequency(cursor_sqlite)\n", "\n", "get_freq_by_loc = ct.compose( ct.frequencies, ctc.pluck(\"location\"), get_json_seq )\n", "\n", "cursor_sqlite.execute(\"SELECT id, content FROM tw_users LIMIT %s\" % tw_users_limit)\n", "pprint.pprint( ct.frequencies( get_freq_by_loc(cursor_sqlite).values() ) )\n", "\n", "cursor_sqlite.execute(\"SELECT id, content FROM tw_users LIMIT %s\" % tw_users_limit)\n", "pprint.pprint( ct.valfilter( ct.curry(le,10), get_freq_by_loc(cursor_sqlite) ) )\n"]}, {"cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Memory used : 118 MB\n"]}], "source": ["memory_usage_psutil()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["A priori il est pr\u00e9f\u00e9rable de choisir l'ordre de fonctions qui s\u00e9pare les plus les op\u00e9rations. Ici **countby** fait les deux \u00e0 la fois (appliquer la fonction et calculer le nombre d'occurences)."]}, {"cell_type": "markdown", "metadata": {}, "source": ["Les deux derniers que nous allons voir sont **reduce**, **reduceby** et **groupby**.\n", "Attention \u00e0 groupby, celle-ci cr\u00e9e un dictionnaire de liste des \u00e9l\u00e9ments donn\u00e9s en entr\u00e9es, elle forcera donc le chargement en m\u00e9moire de toutes les donn\u00e9es.\n", "\n", "**groupby** prend en entr\u00e9e une cl\u00e9 et une s\u00e9quence, et groupe les objets pour lesquels cette cl\u00e9 a la m\u00eame valeur.\n", "Son retour sera un dictionnaire dont les cl\u00e9s sont les valeurs prises par la cl\u00e9 (ci-dessous les diff\u00e9rentes valeurs de \"location\" dans les utilisateurs) et les valeurs les listes des objets ayant cette valeur pour la cl\u00e9."]}, {"cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [{"data": {"text/plain": ["{'chat': [{'age': 15, 'animal': 'chat', 'npm': 'Roudy'},\n", " {'age': 10, 'animal': 'chat', 'npm': 'Teemo'},\n", " {'age': 25, 'animal': 'chat', 'npm': 'Garfied'}],\n", " 'chien': [{'age': 5, 'animal': 'chien', 'npm': 'Medor'},\n", " {'age': 3, 'animal': 'chien', 'npm': 'Fluffy'},\n", " {'age': 2, 'animal': 'chien', 'npm': 'Max'}]}"]}, "execution_count": 40, "metadata": {}, "output_type": "execute_result"}], "source": ["liste_animaux = [ \n", " { \"animal\":\"chat\" , \"age\":15,\"npm\":\"Roudy\"}, \n", " { \"animal\":\"chien\" , \"age\": 5,\"npm\":\"Medor\"},\n", " { \"animal\":\"chien\" , \"age\": 3,\"npm\":\"Fluffy\"}, \n", " { \"animal\":\"chien\" , \"age\": 2,\"npm\":\"Max\"},\n", " { \"animal\":\"chat\" , \"age\":10,\"npm\":\"Teemo\"}, \n", " { \"animal\":\"chat\" , \"age\":25,\"npm\":\"Garfied\"} \n", "]\n", "\n", "ct.groupby( \"animal\", liste_animaux )"]}, {"cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["'' : 69975\n", "'Communaut\u00e9 Valencienne, Espagne' : 1\n", "'Coral Springs, Fl' : 1\n", "'Abbiategrasso' : 1\n", "'Getafe - Bordeaux - Vigo' : 1\n", "'Piscop, France' : 1\n", "'Roma - Oslo' : 1\n", "'tarbes' : 2\n", "'Samoa' : 1\n", "'E\u2661' : 1\n", "'Epernay sous gevrey' : 1\n", "'Porto-Vecchio, Corse' : 2\n", "'Plan\u00e8te Terre\u00ae' : 1\n", "'donzere' : 1\n", "'Laval, take it or leave it' : 1\n", "'Albi - Bordeaux ' : 1\n", "'Vannes ' : 1\n", "'Paris - Strasbourg' : 2\n", "'Itabashi-ku, Tokyo' : 1\n", "'Paris, Bilbao, Dieppe' : 1\n", "'hello' : 1\n", "'Francfort-sur-le-Main, Hesse' : 1\n", "'Issy Les Moulineaux' : 1\n", "'montgenost' : 1\n", "'France/Toulouse' : 1\n", "'UAE DUBAI' : 1\n", "'Paris ~ Somewhere' : 1\n", "'Vezin le Coquet' : 1\n", "'\u062a\u0627\u0632\u0629 \u0627\u0644\u0645\u063a\u0631\u0628' : 1\n", "'Paris 16' : 1\n", "'Senegal' : 29\n", "'Paris XV\u00e8me' : 1\n", "'tunis' : 29\n", "'st cyprien plage' : 1\n", "'Dhaka,Bangladesh' : 1\n", "'Sa\u00f4ne-et-Loire (71)' : 1\n", "'panama' : 1\n", "'63720 chappes' : 1\n", "'Poueyferr\u00e9' : 1\n", "'Yaound\u00e9-Cameroon' : 1\n", "'Maroc,Mekn\u00e8s' : 1\n", "'Bucarest' : 2\n", "'france strasbourg' : 1\n", "'dans un monde loin du v\u00f4tre' : 1\n", "'nantes/paris' : 1\n", "'Elassona, Greece' : 1\n", "'San Francisco Bay Area' : 3\n", "'Vannes, Bretagne' : 1\n", "'AIX / MARSEILLE / PACA' : 1\n", "'Castellon,Espa\u00f1a' : 1\n", "'Roque perez' : 1\n"]}], "source": ["tw_users_limit = 100000\n", "cursor_sqlite.execute(\"SELECT id, content FROM tw_users LIMIT %s\" % tw_users_limit)\n", "for i, (k, v) in enumerate( ct.valmap( ct.count, ct.groupby( \"location\", get_json_seq( cursor_sqlite ) ) ).items() ):\n", " print(repr(k) + \" : \" + repr(v))\n", " if i == 50:\n", " break"]}, {"cell_type": "markdown", "metadata": {}, "source": ["A noter que si vous voulez utiliser les op\u00e9rateurs usuels (+, \\*, etc ...), vous pouvez les obtenir sous forme de fonctions dans le package operator"]}, {"cell_type": "markdown", "metadata": {}, "source": ["**reduce** applique une fonction aux deux premiers \u00e9l\u00e9ments d'une s\u00e9quence (ou au premier \u00e9l\u00e9ment et une valeur initiale) et applique ensuite cette fonction au total et \u00e0 l'\u00e9lement suivant."]}, {"cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["15\n", "120\n"]}], "source": ["from operator import add, mul\n", "print( ct.reduce( add, [1,2,3,4,5] ) ) ## calcule add(1,2), puis add(_, 3), add(_, 4), etc ...\n", "print( ct.reduce( mul, [1,2,3,4,5] ) )"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Du coup si votre r\u00e9sultat n'est pas de m\u00eame nature que vos \u00e9l\u00e9ments, la syntaxe ci-dessus ne fonctionnera pas. Dans ce cas, il faut rajouter une valeur initiale.\n", "\n", "Dans ce cas la fonction de r\u00e9duction est appliqu\u00e9e \u00e0 :\n", "\n", "1. f(valeur\\_initiale, premier\\_\u00e9l\u00e9ment)\n", "1. f(r\u00e9sultat\\_pr\u00e9c\u00e9dent, deuxi\u00e8me\\_\u00e9l\u00e9ment)\n", "1. f(r\u00e9sultat\\_pr\u00e9c\u00e9dent, troisi\u00e8me\\_\u00e9l\u00e9ment)"]}, {"cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [{"data": {"text/plain": ["4284281"]}, "execution_count": 43, "metadata": {}, "output_type": "execute_result"}], "source": ["tw_users_limit = 10000\n", "cursor_sqlite.execute(\"SELECT id, content FROM tw_users LIMIT %s\" % tw_users_limit)\n", "ct.reduce((lambda total,elt: total + elt[\"followers_count\"]), # Fonction pour faire la r\u00e9duction\n", " get_json_seq( cursor_sqlite ), # s\u00e9quence \u00e0 r\u00e9duire, \n", " 0 # Valeur initiale \n", " )\n", " "]}, {"cell_type": "markdown", "metadata": {}, "source": ["reduceby fait la m\u00eame chose, avec un groupement selon un crit\u00e8re en plus.\n", "Le code ci-dessous calcule la somme du nombre de followers par location, et filtre sur les valeurs sup\u00e9rieures \u00e0 10000."]}, {"cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["The slowest run took 52894.00 times longer than the fastest. This could mean that an intermediate result is being cached \n", "1 loops, best of 3: 5.56 \u00b5s per loop\n"]}, {"data": {"text/plain": ["{'': 476234,\n", " 'Barcelona (Spain)': 13773,\n", " 'Beijing China': 19296,\n", " 'Conscience': 34254,\n", " 'France': 243077,\n", " 'Futuroscope 86': 10888,\n", " 'Islamic Republic of Iran': 72745,\n", " 'Lens,LOSC,VA,USBCO,Reims,ESTAC': 16054,\n", " 'Lib\u00e9rateur enracin\u00e9': 19987,\n", " 'London, UK': 37646,\n", " 'Longueuil, Qu\u00e9bec': 43522,\n", " 'Melun City': 18401,\n", " 'Paris': 1506591,\n", " 'Paris /France': 251205,\n", " 'Paris, France': 45841,\n", " 'Plein Sud': 17565,\n", " 'Poitiers, Vienne (86)': 102472,\n", " 'ROUEN (76)FRANCE': 10757,\n", " 'Rosslyn, Va.': 278088,\n", " 'St-Raymond on the Beach': 44569,\n", " 'Tunisia': 16602,\n", " 'Worldwide': 11040,\n", " 'http://www.13or-du-hiphop.fr': 22143,\n", " 'paris ': 366667}"]}, "execution_count": 44, "metadata": {}, "output_type": "execute_result"}], "source": ["from operator import le\n", "\n", "tw_users_limit = 10000\n", "cursor_sqlite.execute(\"SELECT id, content FROM tw_users LIMIT %s\" % tw_users_limit)\n", "%timeit -n1 ct.reduceby( \"location\", lambda x,y: x + y[\"followers_count\"], get_json_seq( cursor_sqlite ), 0 )\n", "cursor_sqlite.execute(\"SELECT id, content FROM tw_users LIMIT %s\" % tw_users_limit)\n", "ct.valfilter(ct.curry(le,10000), ## Ne s\u00e9lectionne que les \u00e9l\u00e9ments dont la valeur est sup\u00e9rieure \u00e0 10000\n", " ct.reduceby( \"location\", \n", " lambda x,y: x + y[\"followers_count\"], \n", " get_json_seq( cursor_sqlite ), \n", " 0 ))"]}, {"cell_type": "code", "execution_count": 44, "metadata": {"collapsed": true}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1"}}, "nbformat": 4, "nbformat_minor": 2}