{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Map/Reduce avec PIG sur Azure - correction"]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["Plan\n", "
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Donn\u00e9es\n", "\n", "On consid\u00e8re le jeu de donn\u00e9es suivant : [Localization Data for Person Activity Data Set](https://archive.ics.uci.edu/ml/datasets/Localization+Data+for+Person+Activity) qu'on r\u00e9cup\u00e8re comme indiqu\u00e9 dans le notebook de l'\u00e9nonc\u00e9."]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/html": ["
server + hadoop + credentials\n", "
blob_storage \n", "
hadoop_server \n", "
password1 \n", "
password2 \n", "
username \n", "
\n", ""], "text/plain": [""]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["from pyquickhelper.ipythonhelper import open_html_form\n", "params={\"blob_storage\":\"\", \"password1\":\"\", \"hadoop_server\":\"\", \"password2\":\"\", \"username\":\"xavierdupre\"}\n", "open_html_form(params=params,title=\"server + hadoop + credentials\", key_save=\"blobhp\")"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": ["blobstorage = blobhp[\"blob_storage\"]\n", "blobpassword = blobhp[\"password1\"]\n", "hadoop_server = blobhp[\"hadoop_server\"]\n", "hadoop_password = blobhp[\"password2\"]\n", "username = blobhp[\"username\"]"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/plain": ["(,\n", " )"]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["import pyensae\n", "%load_ext pyensae\n", "%load_ext pyenbc\n", "%hd_open"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 1 : GROUP BY"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
activitynb
0falling2973
1lying54480
2lying down6168
3on all fours5210
4sitting27244
\n", "
"], "text/plain": [" activity nb\n", "0 falling 2973\n", "1 lying 54480\n", "2 lying down 6168\n", "3 on all fours 5210\n", "4 sitting 27244"]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["import pandas, sqlite3\n", "con = sqlite3.connect(\"ConfLongDemo_JSI.db3\")\n", "df = pandas.read_sql(\"\"\"SELECT activity, count(*) as nb FROM person GROUP BY activity\"\"\", con)\n", "con.close()\n", "df.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On v\u00e9rifie que le fichier qu'on veut traiter est bien l\u00e0 :"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namelast_modifiedcontent_typecontent_lengthblob_type
0testensae/ConfLongDemo_JSI.small.txtThu, 29 Oct 2015 00:23:00 GMTapplication/octet-stream132727BlockBlob
\n", "
"], "text/plain": [" name last_modified \\\n", "0 testensae/ConfLongDemo_JSI.small.txt Thu, 29 Oct 2015 00:23:00 GMT \n", "\n", " content_type content_length blob_type \n", "0 application/octet-stream 132727 BlockBlob "]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["%blob_ls /testensae/ConfLongDemo_JSI.small.txt"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Il faut maintenant le faire avec PIG."]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": ["%%PIG_azure solution_groupby.pig\n", "\n", "myinput = LOAD '$CONTAINER/testensae/ConfLongDemo_JSI.small.txt' \n", " using PigStorage(',') \n", " AS (index:long, sequence, tag, timestamp:long, dateformat, x:double,y:double, z:double, activity) ;\n", "\n", "gr = GROUP myinput BY activity ;\n", "avgact = FOREACH gr GENERATE group, COUNT(myinput) ; \n", "\n", "STORE avgact INTO '$CONTAINER/$PSEUDO/testensae/ConfLongDemo_JSI.small.group.2015.txt' USING PigStorage() ;"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On soumet le job :"]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"text/plain": ["{'id': 'job_1445989166328_0009'}"]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["jid = %hd_pig_submit solution_groupby.pig\n", "jid"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On v\u00e9rifie le status du job :"]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"data": {"text/plain": ["('job_1445989166328_0009', '100% complete', None, False, 'RUNNING')"]}, "execution_count": 10, "metadata": {}, "output_type": "execute_result"}], "source": ["st = %hd_job_status jid[\"id\"]\n", "st[\"id\"],st[\"percentComplete\"],st[\"completed\"],st[\"status\"][\"jobComplete\"],st[\"status\"][\"state\"]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On regarde si la compilation s'est bien pass\u00e9e :"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "Job DAG:\n", "job_1445989166328_0010\n", "\n", "\n", "2015-10-29 00:55:14,395 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://headnodehost:8188/ws/v1/timeline/\n", "2015-10-29 00:55:14,395 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at headnodehost/100.89.100.164:9010\n", "2015-10-29 00:55:14,395 [main] INFO  org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at headnodehost/100.89.100.164:10200\n", "2015-10-29 00:55:14,473 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server\n", "2015-10-29 00:55:14,676 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://headnodehost:8188/ws/v1/timeline/\n", "2015-10-29 00:55:14,676 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at headnodehost/100.89.100.164:9010\n", "2015-10-29 00:55:14,676 [main] INFO  org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at headnodehost/100.89.100.164:10200\n", "2015-10-29 00:55:14,754 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server\n", "2015-10-29 00:55:14,957 [main] INFO  org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl - Timeline service address: http://headnodehost:8188/ws/v1/timeline/\n", "2015-10-29 00:55:14,957 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at headnodehost/100.89.100.164:9010\n", "2015-10-29 00:55:14,957 [main] INFO  org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at headnodehost/100.89.100.164:10200\n", "2015-10-29 00:55:15,020 [main] INFO  org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server\n", "2015-10-29 00:55:15,082 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!\n", "2015-10-29 00:55:15,113 [main] INFO  org.apache.pig.Main - Pig script completed in 49 seconds and 706 milliseconds (49706 ms)\n", "\n", "

"], "text/plain": [""]}, "execution_count": 11, "metadata": {}, "output_type": "execute_result"}], "source": ["%hd_tail_stderr jid[\"id\"]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On regarde le contenu du r\u00e9pertoire sur le blob storage :"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"data": {"text/plain": ["['xavierdupre/testensae',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.group.2015.txt',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.group.2015.txt/_SUCCESS',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.group.2015.txt/part-r-00000',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.group.join.txt',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.group.join.txt/_SUCCESS',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.group.join.txt/part-r-00000',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.group.txt',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.group.txt/_SUCCESS',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.group.txt/part-r-00000',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.keep_walking.txt',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.keep_walking.txt/_SUCCESS',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.keep_walking.txt/part-m-00000',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.walking2015.txt',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.walking2015.txt/_SUCCESS',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.walking2015.txt/part-m-00000',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.walking_2015.txt',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.walking_2015.txt/_SUCCESS',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.walking_2015.txt/part-m-00000']"]}, "execution_count": 12, "metadata": {}, "output_type": "execute_result"}], "source": ["df=%blob_ls /$PSEUDO/testensae\n", "list(df[\"name\"])"]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"text/plain": ["'results.group.2015.txt'"]}, "execution_count": 13, "metadata": {}, "output_type": "execute_result"}], "source": ["import os\n", "if os.path.exists(\"results.group.2015.xt\") : os.remove(\"results.group.2015.txt\")\n", "%blob_downmerge /$PSEUDO/testensae/ConfLongDemo_JSI.small.group.2015.txt results.group.2015.txt"]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
directorylast_modifiednamesize
0False2015-10-29 01:56:11.025867.\\results.group.2015.txt89
1False2015-10-29 01:46:45.425028.\\results.txt21.65 Kb
2False2015-10-29 01:46:46.705466.\\results_allfiles.txt21.65 Kb
\n", "
"], "text/plain": [" directory last_modified name size\n", "0 False 2015-10-29 01:56:11.025867 .\\results.group.2015.txt 89\n", "1 False 2015-10-29 01:46:45.425028 .\\results.txt 21.65 Kb\n", "2 False 2015-10-29 01:46:46.705466 .\\results_allfiles.txt 21.65 Kb"]}, "execution_count": 14, "metadata": {}, "output_type": "execute_result"}], "source": ["%lsr res.*[.]txt"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "lying\t267\n", "falling\t30\n", "sitting\t435\n", "walking\t170\n", "sitting down\t56\n", "standing up from sitting\t42\n", "\n", "
"], "text/plain": [""]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["%head results.group.2015.txt"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 2 : JOIN"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
indexsequencetagtimestampdateformatxyzactivitynb
00A01010-000-024-03363379022605128032927.05.2009 14:03:25:1274.0629311.8924340.507425walking32710
11A01020-000-033-11163379022605182091327.05.2009 14:03:25:1834.2919541.7811401.344495walking32710
22A01020-000-032-22163379022605209120527.05.2009 14:03:25:2104.3591011.8264560.968821walking32710
33A01010-000-024-03363379022605236149827.05.2009 14:03:25:2374.0878351.8799990.466983walking32710
44A01010-000-030-09663379022605263179227.05.2009 14:03:25:2634.3244622.0724600.488065walking32710
\n", "
"], "text/plain": [" index sequence tag timestamp \\\n", "0 0 A01 010-000-024-033 633790226051280329 \n", "1 1 A01 020-000-033-111 633790226051820913 \n", "2 2 A01 020-000-032-221 633790226052091205 \n", "3 3 A01 010-000-024-033 633790226052361498 \n", "4 4 A01 010-000-030-096 633790226052631792 \n", "\n", " dateformat x y z activity nb \n", "0 27.05.2009 14:03:25:127 4.062931 1.892434 0.507425 walking 32710 \n", "1 27.05.2009 14:03:25:183 4.291954 1.781140 1.344495 walking 32710 \n", "2 27.05.2009 14:03:25:210 4.359101 1.826456 0.968821 walking 32710 \n", "3 27.05.2009 14:03:25:237 4.087835 1.879999 0.466983 walking 32710 \n", "4 27.05.2009 14:03:25:263 4.324462 2.072460 0.488065 walking 32710 "]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["con = sqlite3.connect(\"ConfLongDemo_JSI.db3\")\n", "df = pandas.read_sql(\"\"\"SELECT person.*, A.nb FROM person INNER JOIN (\n", " SELECT activity, count(*) as nb FROM person GROUP BY activity) AS A\n", " ON person.activity == A.activity\"\"\", con)\n", "con.close()\n", "df.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Idem, maintenant il faut le faire avec PIG."]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": ["%%PIG_azure solution_groupby_join.pig\n", "\n", "myinput = LOAD '$CONTAINER/testensae/ConfLongDemo_JSI.small.txt' \n", " using PigStorage(',') \n", " AS (index:long, sequence, tag, timestamp:long, dateformat, x:double,y:double, z:double, activity) ;\n", "\n", "gr = GROUP myinput BY activity ;\n", "avgact = FOREACH gr GENERATE group, COUNT(myinput) ; \n", "\n", "joined = JOIN myinput BY activity, avgact BY group ;\n", "\n", "STORE joined INTO '$CONTAINER/$PSEUDO/testensae/ConfLongDemo_JSI.small.group.join.2015.txt' USING PigStorage() ;"]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [{"data": {"text/plain": ["{'id': 'job_1445989166328_0011'}"]}, "execution_count": 18, "metadata": {}, "output_type": "execute_result"}], "source": ["jid = %hd_pig_submit solution_groupby_join.pig\n", "jid"]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"data": {"text/plain": ["('job_1445989166328_0011',\n", " '100% complete',\n", " 'done',\n", " True,\n", " 'SUCCEEDED',\n", " 'wasb://hdblobstorage@hdblobstorage.blob.core.windows.net/xavierdupre/scripts/pig/solution_groupby_join.pig')"]}, "execution_count": 19, "metadata": {}, "output_type": "execute_result"}], "source": ["st = %hd_job_status jid[\"id\"]\n", "st[\"id\"],st[\"percentComplete\"],st[\"completed\"],st[\"status\"][\"jobComplete\"],st[\"status\"][\"state\"], st[\"userargs\"][\"file\"]"]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namelast_modifiedcontent_typecontent_lengthblob_type
0xavierdupre/testensaeTue, 25 Nov 2014 00:50:34 GMTapplication/octet-stream0BlockBlob
1xavierdupre/testensae/ConfLongDemo_JSI.small.g...Thu, 29 Oct 2015 00:55:09 GMT0BlockBlob
2xavierdupre/testensae/ConfLongDemo_JSI.small.g...Thu, 29 Oct 2015 00:55:09 GMTapplication/octet-stream0BlockBlob
3xavierdupre/testensae/ConfLongDemo_JSI.small.g...Thu, 29 Oct 2015 00:55:08 GMTapplication/octet-stream89BlockBlob
4xavierdupre/testensae/ConfLongDemo_JSI.small.g...Thu, 29 Oct 2015 00:58:43 GMT0BlockBlob
5xavierdupre/testensae/ConfLongDemo_JSI.small.g...Thu, 29 Oct 2015 00:58:43 GMTapplication/octet-stream0BlockBlob
6xavierdupre/testensae/ConfLongDemo_JSI.small.g...Thu, 29 Oct 2015 00:58:42 GMTapplication/octet-stream144059BlockBlob
7xavierdupre/testensae/ConfLongDemo_JSI.small.g...Tue, 25 Nov 2014 01:16:11 GMT0BlockBlob
8xavierdupre/testensae/ConfLongDemo_JSI.small.g...Tue, 25 Nov 2014 01:16:11 GMTapplication/octet-stream0BlockBlob
9xavierdupre/testensae/ConfLongDemo_JSI.small.g...Tue, 25 Nov 2014 01:16:10 GMTapplication/octet-stream144059BlockBlob
10xavierdupre/testensae/ConfLongDemo_JSI.small.g...Tue, 25 Nov 2014 01:12:49 GMT0BlockBlob
11xavierdupre/testensae/ConfLongDemo_JSI.small.g...Tue, 25 Nov 2014 01:12:49 GMTapplication/octet-stream0BlockBlob
12xavierdupre/testensae/ConfLongDemo_JSI.small.g...Tue, 25 Nov 2014 01:12:49 GMTapplication/octet-stream89BlockBlob
13xavierdupre/testensae/ConfLongDemo_JSI.small.k...Tue, 25 Nov 2014 00:50:45 GMT0BlockBlob
14xavierdupre/testensae/ConfLongDemo_JSI.small.k...Tue, 25 Nov 2014 00:50:46 GMTapplication/octet-stream0BlockBlob
15xavierdupre/testensae/ConfLongDemo_JSI.small.k...Tue, 25 Nov 2014 00:50:45 GMTapplication/octet-stream22166BlockBlob
16xavierdupre/testensae/ConfLongDemo_JSI.small.w...Thu, 29 Oct 2015 00:28:30 GMT0BlockBlob
17xavierdupre/testensae/ConfLongDemo_JSI.small.w...Thu, 29 Oct 2015 00:28:30 GMTapplication/octet-stream0BlockBlob
18xavierdupre/testensae/ConfLongDemo_JSI.small.w...Thu, 29 Oct 2015 00:28:30 GMTapplication/octet-stream22166BlockBlob
19xavierdupre/testensae/ConfLongDemo_JSI.small.w...Thu, 29 Oct 2015 00:46:05 GMT0BlockBlob
20xavierdupre/testensae/ConfLongDemo_JSI.small.w...Thu, 29 Oct 2015 00:46:05 GMTapplication/octet-stream0BlockBlob
21xavierdupre/testensae/ConfLongDemo_JSI.small.w...Thu, 29 Oct 2015 00:46:04 GMTapplication/octet-stream22166BlockBlob
\n", "
"], "text/plain": [" name \\\n", "0 xavierdupre/testensae \n", "1 xavierdupre/testensae/ConfLongDemo_JSI.small.g... \n", "2 xavierdupre/testensae/ConfLongDemo_JSI.small.g... \n", "3 xavierdupre/testensae/ConfLongDemo_JSI.small.g... \n", "4 xavierdupre/testensae/ConfLongDemo_JSI.small.g... \n", "5 xavierdupre/testensae/ConfLongDemo_JSI.small.g... \n", "6 xavierdupre/testensae/ConfLongDemo_JSI.small.g... \n", "7 xavierdupre/testensae/ConfLongDemo_JSI.small.g... \n", "8 xavierdupre/testensae/ConfLongDemo_JSI.small.g... \n", "9 xavierdupre/testensae/ConfLongDemo_JSI.small.g... \n", "10 xavierdupre/testensae/ConfLongDemo_JSI.small.g... \n", "11 xavierdupre/testensae/ConfLongDemo_JSI.small.g... \n", "12 xavierdupre/testensae/ConfLongDemo_JSI.small.g... \n", "13 xavierdupre/testensae/ConfLongDemo_JSI.small.k... \n", "14 xavierdupre/testensae/ConfLongDemo_JSI.small.k... \n", "15 xavierdupre/testensae/ConfLongDemo_JSI.small.k... \n", "16 xavierdupre/testensae/ConfLongDemo_JSI.small.w... \n", "17 xavierdupre/testensae/ConfLongDemo_JSI.small.w... \n", "18 xavierdupre/testensae/ConfLongDemo_JSI.small.w... \n", "19 xavierdupre/testensae/ConfLongDemo_JSI.small.w... \n", "20 xavierdupre/testensae/ConfLongDemo_JSI.small.w... \n", "21 xavierdupre/testensae/ConfLongDemo_JSI.small.w... \n", "\n", " last_modified content_type content_length \\\n", "0 Tue, 25 Nov 2014 00:50:34 GMT application/octet-stream 0 \n", "1 Thu, 29 Oct 2015 00:55:09 GMT 0 \n", "2 Thu, 29 Oct 2015 00:55:09 GMT application/octet-stream 0 \n", "3 Thu, 29 Oct 2015 00:55:08 GMT application/octet-stream 89 \n", "4 Thu, 29 Oct 2015 00:58:43 GMT 0 \n", "5 Thu, 29 Oct 2015 00:58:43 GMT application/octet-stream 0 \n", "6 Thu, 29 Oct 2015 00:58:42 GMT application/octet-stream 144059 \n", "7 Tue, 25 Nov 2014 01:16:11 GMT 0 \n", "8 Tue, 25 Nov 2014 01:16:11 GMT application/octet-stream 0 \n", "9 Tue, 25 Nov 2014 01:16:10 GMT application/octet-stream 144059 \n", "10 Tue, 25 Nov 2014 01:12:49 GMT 0 \n", "11 Tue, 25 Nov 2014 01:12:49 GMT application/octet-stream 0 \n", "12 Tue, 25 Nov 2014 01:12:49 GMT application/octet-stream 89 \n", "13 Tue, 25 Nov 2014 00:50:45 GMT 0 \n", "14 Tue, 25 Nov 2014 00:50:46 GMT application/octet-stream 0 \n", "15 Tue, 25 Nov 2014 00:50:45 GMT application/octet-stream 22166 \n", "16 Thu, 29 Oct 2015 00:28:30 GMT 0 \n", "17 Thu, 29 Oct 2015 00:28:30 GMT application/octet-stream 0 \n", "18 Thu, 29 Oct 2015 00:28:30 GMT application/octet-stream 22166 \n", "19 Thu, 29 Oct 2015 00:46:05 GMT 0 \n", "20 Thu, 29 Oct 2015 00:46:05 GMT application/octet-stream 0 \n", "21 Thu, 29 Oct 2015 00:46:04 GMT application/octet-stream 22166 \n", "\n", " blob_type \n", "0 BlockBlob \n", "1 BlockBlob \n", "2 BlockBlob \n", "3 BlockBlob \n", "4 BlockBlob \n", "5 BlockBlob \n", "6 BlockBlob \n", "7 BlockBlob \n", "8 BlockBlob \n", "9 BlockBlob \n", "10 BlockBlob \n", "11 BlockBlob \n", "12 BlockBlob \n", "13 BlockBlob \n", "14 BlockBlob \n", "15 BlockBlob \n", "16 BlockBlob \n", "17 BlockBlob \n", "18 BlockBlob \n", "19 BlockBlob \n", "20 BlockBlob \n", "21 BlockBlob "]}, "execution_count": 20, "metadata": {}, "output_type": "execute_result"}], "source": ["df=%blob_ls /$PSEUDO/testensae\n", "df"]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [{"data": {"text/plain": ["{'xavierdupre/testensae',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.group.2015.txt',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.group.2015.txt/_SUCCESS',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.group.2015.txt/part-r-00000',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.group.join.2015.txt',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.group.join.2015.txt/_SUCCESS',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.group.join.2015.txt/part-r-00000',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.group.join.txt',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.group.join.txt/_SUCCESS',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.group.join.txt/part-r-00000',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.group.txt',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.group.txt/_SUCCESS',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.group.txt/part-r-00000',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.keep_walking.txt',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.keep_walking.txt/_SUCCESS',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.keep_walking.txt/part-m-00000',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.walking2015.txt',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.walking2015.txt/_SUCCESS',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.walking2015.txt/part-m-00000',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.walking_2015.txt',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.walking_2015.txt/_SUCCESS',\n", " 'xavierdupre/testensae/ConfLongDemo_JSI.small.walking_2015.txt/part-m-00000'}"]}, "execution_count": 21, "metadata": {}, "output_type": "execute_result"}], "source": ["set(df.name)"]}, {"cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [{"data": {"text/plain": ["'results.join.2015.txt'"]}, "execution_count": 22, "metadata": {}, "output_type": "execute_result"}], "source": ["if os.path.exists(\"results.join.2015.txt\") : os.remove(\"results.join.2015.txt\")\n", "%blob_downmerge /$PSEUDO/testensae/ConfLongDemo_JSI.small.group.join.2015.txt results.join.2015.txt"]}, {"cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "999\tA01\t010-000-024-033\t633790226379871138\t27.05.2009 14:03:57:987\t3.198556661605835\t1.1257659196853638\t0.3567752242088318\tlying\tlying\t267\n", "998\tA01\t020-000-032-221\t633790226379600847\t27.05.2009 14:03:57:960\t4.3730292320251465\t1.3821170330047607\t0.38861045241355896\tlying\tlying\t267\n", "997\tA01\t020-000-033-111\t633790226379330550\t27.05.2009 14:03:57:933\t4.7574005126953125\t1.285519003868103\t-0.08946932852268219\tlying\tlying\t267\n", "996\tA01\t010-000-030-096\t633790226379060251\t27.05.2009 14:03:57:907\t3.182415008544922\t1.1020996570587158\t0.29104289412498474\tlying\tlying\t267\n", "995\tA01\t010-000-024-033\t633790226378789954\t27.05.2009 14:03:57:880\t3.0784008502960205\t1.0197675228118896\t0.6061218976974487\tlying\tlying\t267\n", "994\tA01\t020-000-032-221\t633790226378519655\t27.05.2009 14:03:57:853\t4.36382532119751\t1.4307395219802856\t0.3206148743629456\tlying\tlying\t267\n", "993\tA01\t010-000-024-033\t633790226377708776\t27.05.2009 14:03:57:770\t3.0621800422668457\t1.0790562629699707\t0.6795752048492432\tlying\tlying\t267\n", "992\tA01\t020-000-032-221\t633790226377438480\t27.05.2009 14:03:57:743\t4.371500492095946\t1.4781558513641355\t0.5384233593940735\tlying\tlying\t267\n", "991\tA01\t020-000-033-111\t633790226377168187\t27.05.2009 14:03:57:717\t4.918898105621338\t1.1530661582946775\t0.19635945558547974\tlying\tlying\t267\n", "990\tA01\t010-000-030-096\t633790226376897895\t27.05.2009 14:03:57:690\t3.208510637283325\t1.1156394481658936\t0.3381773829460144\tlying\tlying\t267\n", "\n", "
"], "text/plain": [""]}, "execution_count": 23, "metadata": {}, "output_type": "execute_result"}], "source": ["%head results.join.2015.txt"]}, {"cell_type": "markdown", "metadata": {}, "source": ["

Prolongements

\n", "\n", "[PIG](http://pig.apache.org/) n'est pas la seule fa\u00e7on d'ex\u00e9cuter des jobs Map/Reduce. [Hive](https://hive.apache.org/) est un langage dont la syntaxe est tr\u00e8s proche de celle du SQL. L'article [Comparing Pig Latin and SQL for Constructing Data Processing Pipelines](https://developer.yahoo.com/blogs/hadoop/comparing-pig-latin-sql-constructing-data-processing-pipelines-444.html) explicite les diff\u00e9rences des deux approches.\n", "\n", "**langage haut niveau**\n", "\n", "Ce qu'il faut retenir est que le langage PIG est un langage haut niveau. Le programme est compil\u00e9 en une s\u00e9quence d'op\u00e9rations Map/Reduce transparente pour l'utilisateur. Le temps de d\u00e9veloppement est tr\u00e8s r\u00e9duit lorsqu'on le compare au m\u00eame programme \u00e9crit en Java. Le compilateur construit un plan d'ex\u00e9cution ([quelques exemples ici](http://chimera.labs.oreilly.com/books/1234000001811/ch07.html#explain)) et inf\u00e8re le nombre de machines requises pour distribuer le job. Cela suffit pour la plupart des besoins, cela n\u00e9cessite.\n", "\n", "**petits jeux**\n", "\n", "Certains jobs peuvent durer des heures, il est conseill\u00e9e de les essayer sur des petits jeux de donn\u00e9es avant de les faire tourner sur les vrais donn\u00e9es. Il est toujours frustrant de s'apercevoir qu'un job a plant\u00e9 au bout de deux heures car une cha\u00eene de caract\u00e8res est vide et que ce cas n'a pas \u00e9t\u00e9 pr\u00e9vu.\n", "\n", "Avec ces petits jeux, il est possible de faire tourner et conseill\u00e9 de tester le job d'abord sur la passerelle ([ex\u00e9cution local](http://archive.cloudera.com/cdh/3/pig/tutorial.html#Running+the+Pig+Scripts+in+Local+Mode)) avant de le lancer sur le cluster. Avec pyensae, il faut ajouter l'option ``-local`` \u00e0 la commande [hd_pig_submit](http://www.xavierdupre.fr/app/pyensae/helpsphinx/pyensae/remote/magic_azure.html?highlight=hd_pig_submit#pyensae.remote.magic_azure.MagicAzure.hd_pig_submit).\n", "\n", "**concat\u00e9ner les fichiers divis\u00e9s**\n", "\n", "Un programme PIG ne produit pas un fichier mais plusieurs fichiers dans un r\u00e9pertoire. La commande [getmerge](http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-common/FileSystemShell.html) t\u00e9l\u00e9charge ces fichiers sur la passerelle et les fusionne en un seul.\n", "\n", "**ordre des lignes**\n", "\n", "Les jobs sont distribu\u00e9s, m\u00eame en faisant rien (LOAD + STORE), il n'est pas garanti que l'ordre des lignes soit pr\u00e9serv\u00e9. La probabili\u00e9 que ce soit le cas est quasi nulle."]}, {"cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4"}}, "nbformat": 4, "nbformat_minor": 2}