{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# PIG et Param\u00e8tres sur Azure - \u00e9nonc\u00e9\n", "\n", "Manipulation de donn\u00e9es JSON en Map/Reduce avec [PIG](https://pig.apache.org/) sur [HDInsight](https://azure.microsoft.com/en-us/services/hdinsight/)."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["Plan\n", "
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Param\u00e8tres\n", "\n", "Les sites web produisent des donn\u00e9es en continu. On utilise fr\u00e9quemment le m\u00eame script pour traiter les donn\u00e9esd'un jour, du lendemain, de jour d'apr\u00e8s... Tous les jours, on veut r\u00e9cup\u00e9rer la fr\u00e9quentation de la veille. La seule chose qui change est la date des donn\u00e9es qu'on veut traiter. Plut\u00f4t que de recopier un script en entier pour changer une date qui appara\u00eet parfois \u00e0 plusieurs endroits, il est pr\u00e9f\u00e9rable d'\u00e9crire un script ou la date appara\u00eet comme une variable.\n", "\n", "Ce notebook va illustrer ce proc\u00e9d\u00e9 sur la construction d'un histogramme. Le param\u00e8tre du script sera la largeur des barres de l'histogramme (ou [bin](http://en.wikipedia.org/wiki/Histogram#Number_of_bins_and_width) en anglais)."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Connexion au cluster\n", "\n", "On prend le cluster [Cloudera](http://www.xavierdupre.fr/app/ensae_teaching_cs/helpsphinx/notebooks/td3a_cenonce_session6.html#p2). Il faut ex\u00e9cuter ce script pour pouvoir notifier au notebook que la variable ``params`` existe."]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/html": ["
"], "text/plain": [" name last_modified \\\n", "0 xavierdupre/random/random.sample.txt Thu, 27 Nov 2014 23:21:26 GMT \n", "\n", " content_type content_length blob_type \n", "0 application/octet-stream 202619 BlockBlob "]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["%blob_ls /$PSEUDO/random"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## PIG et param\u00e8tres"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On indique un param\u00e8tre par le symbole : ``$bins``. La valeur du param\u00e8tre est pass\u00e9 sous forme de cha\u00eene de caract\u00e8res au script et remplac\u00e9e telle quelle dans le script. Il en va de m\u00eame des constantes d\u00e9clar\u00e9es gr\u00e2ce au mot-cl\u00e9 [%declare](https://pig.apache.org/docs/r0.11.1/cont.html#Examples-N1060D).\n", "\n", "La sortie du script inclut le param\u00e8tre : cela permet de retrouver comment ces donn\u00e9es ont \u00e9t\u00e9 g\u00e9n\u00e9r\u00e9es."]}, {"cell_type": "code", "execution_count": 7, "metadata": {"collapsed": true}, "outputs": [], "source": ["%%PIG histogram.pig\n", "\n", "values = LOAD '$CONTAINER/$PSEUDO/random/random.sample.txt' USING PigStorage('\\t') AS (x:double);\n", "\n", "values_h = FOREACH values GENERATE x, ((int)(x / $bins)) * $bins AS h ;\n", "\n", "hist_group = GROUP values_h BY h ;\n", "\n", "hist = FOREACH hist_group GENERATE group, COUNT(values_h) AS nb ;\n", "\n", "STORE hist INTO '$CONTAINER/$PSEUDO/random/histo_$bins.txt' USING PigStorage('\\t') ;"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Pour supprimer les pr\u00e9c\u00e9dents r\u00e9sultats :"]}, {"cell_type": "code", "execution_count": 8, "metadata": {"collapsed": true}, "outputs": [], "source": ["if client.exists(bs, client.account_name, \"$PSEUDO/random/histo_0.1.txt\"):\n", " r = client.delete_folder (bs, client.account_name, \"$PSEUDO/random/histo_0.1.txt\")\n", " print(r) "]}, {"cell_type": "markdown", "metadata": {}, "source": ["On ex\u00e9cute le job. Comme la commande magique supportant les param\u00e8tres n'existe pas encore, il faut utiliser la variable ``client`` et sa m\u00e9thode [pig_submit](http://www.xavierdupre.fr/app/pyensae/helpsphinx/pyensae/remote/ssh_remote_connection.html?highlight=pig_submit#pyensae.remote.ssh_remote_connection.ASSHClient.pig_submit) qui fait la m\u00eame chose. Elle upload le script puis le soumet."]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"data": {"text/plain": ["{'id': 'job_1416874839254_0101'}"]}, "execution_count": 10, "metadata": {}, "output_type": "execute_result"}], "source": ["jid = client.pig_submit(bs, client.account_name, \"histogram.pig\", params = dict(bins=\"0.1\"), stop_on_failure=True )"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"data": {"text/plain": ["('job_1416874839254_0101', '100% complete', True)"]}, "execution_count": 11, "metadata": {}, "output_type": "execute_result"}], "source": ["st = %hd_job_status jid[\"id\"]\n", "st[\"id\"],st[\"percentComplete\"],st[\"status\"][\"jobComplete\"]"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "Job DAG:\n", "job_1416874839254_0102\n", "\n", "\n", "2014-11-27 23:29:00,446 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - No FileSystem for scheme: wasb. Not creating success file\n", "2014-11-27 23:29:00,446 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at headnodehost/100.74.20.101:9010\n", "2014-11-27 23:29:00,525 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server\n", "2014-11-27 23:29:03,900 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!\n", "\n", "
"], "text/plain": [""]}, "execution_count": 12, "metadata": {}, "output_type": "execute_result"}], "source": ["%hd_tail_stderr jid[\"id\"]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On v\u00e9rifie que tout s'est bien pass\u00e9. La taille devrait \u00eatre \u00e9quivalent \u00e0 l'entr\u00e9e."]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"text/html": ["