{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# File Manipulation with Azure Blob Storage\n", "\n", "We try a few file manipulation between a local computer and a blob storage on Azure. It requires [azure-sdk-for-python](https://github.com/Azure/azure-sdk-for-python) and [pyenbc](http://www.xavierdupre.fr/app/pyenbc/helpsphinx/index.html). We first create a dummy file."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": ["import pandas, random\n", "mat = [ {\"x\":random.random(), \"y\":random.random()} for i in range(0,1000)]\n", "df = pandas.DataFrame(mat)\n", "df.to_csv(\"randomxy.txt\", sep=\"\\t\", encoding=\"utf8\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We need credentials and to avoid having them in clear in the notebook, we use a HTML form:"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/html": ["
credentials\n", "
blob_storage \n", "
password \n", "
\n", ""], "text/plain": [""]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["import pyquickhelper.ipythonhelper as ipy\n", "params={\"blob_storage\":\"hdblobstorage\", \"password\":\"\"}\n", "ipy.open_html_form(params=params,title=\"credentials\",key_save=\"blobservice\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We stored the values in two variables in the workspace:"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": ["blobstorage = blobservice[\"blob_storage\"]\n", "blobpassword = blobservice[\"password\"]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We need pyensae >= 1.2:"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["The pyensae extension is already loaded. To reload it, use:\n", " %reload_ext pyensae\n", "The pyenbc extension is already loaded. To reload it, use:\n", " %reload_ext pyenbc\n"]}, {"data": {"text/plain": ["'1.2'"]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["import pyensae\n", "import pyenbc\n", "%load_ext pyensae\n", "%load_ext pyenbc\n", "pyensae.__version__, pyenbc.__version__"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["usage: blob_open [-h] [-b BLOBSTORAGE] [-p BLOBPASSWORD]\n", "\n", "open a connection to an Azure blob storage, by default, the magic command\n", "takes blobstorage and blobpassword local variables as default values\n", "\n", "optional arguments:\n", " -h, --help show this help message and exit\n", " -b BLOBSTORAGE, --blobstorage BLOBSTORAGE\n", " blob storage name\n", " -p BLOBPASSWORD, --blobpassword BLOBPASSWORD\n", " blob password\n", "usage: blob_open [-h] [-b BLOBSTORAGE] [-p BLOBPASSWORD]\n", "\n"]}], "source": ["%blob_open --help"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We open a connection to the blob storage:"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/plain": ["(,\n", " )"]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["cl, bs = %blob_open\n", "cl, bs"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We extract the available containers:"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"data": {"text/plain": ["['clusterensaeazure1',\n", " 'clusterensaeazure2',\n", " 'clusterensaeazure2-1',\n", " 'hdblobstorage',\n", " 'petittest',\n", " 'sparkclus',\n", " 'sparkclus2',\n", " 'testhadoopensae']"]}, "execution_count": 8, "metadata": {}, "output_type": "execute_result"}], "source": ["l = %blob_containers\n", "l"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We get the content of one container:"]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namelast_modifiedcontent_typecontent_lengthblob_type
4995velib_several_days/paris.2014-11-14_15-54-58.6...Fri, 28 Nov 2014 10:34:15 GMTapplication/octet-stream524941BlockBlob
4996velib_several_days/paris.2014-11-14_15-55-57.8...Fri, 28 Nov 2014 10:34:16 GMTapplication/octet-stream524944BlockBlob
4997velib_several_days/paris.2014-11-14_15-56-58.5...Fri, 28 Nov 2014 10:34:17 GMTapplication/octet-stream522499BlockBlob
4998velib_several_days/paris.2014-11-14_15-57-57.8...Fri, 28 Nov 2014 10:34:17 GMTapplication/octet-stream524958BlockBlob
4999velib_several_days/paris.2014-11-14_15-58-58.5...Fri, 28 Nov 2014 10:34:18 GMTapplication/octet-stream523757BlockBlob
\n", "
"], "text/plain": [" name \\\n", "4995 velib_several_days/paris.2014-11-14_15-54-58.6... \n", "4996 velib_several_days/paris.2014-11-14_15-55-57.8... \n", "4997 velib_several_days/paris.2014-11-14_15-56-58.5... \n", "4998 velib_several_days/paris.2014-11-14_15-57-57.8... \n", "4999 velib_several_days/paris.2014-11-14_15-58-58.5... \n", "\n", " last_modified content_type content_length \\\n", "4995 Fri, 28 Nov 2014 10:34:15 GMT application/octet-stream 524941 \n", "4996 Fri, 28 Nov 2014 10:34:16 GMT application/octet-stream 524944 \n", "4997 Fri, 28 Nov 2014 10:34:17 GMT application/octet-stream 522499 \n", "4998 Fri, 28 Nov 2014 10:34:17 GMT application/octet-stream 524958 \n", "4999 Fri, 28 Nov 2014 10:34:18 GMT application/octet-stream 523757 \n", "\n", " blob_type \n", "4995 BlockBlob \n", "4996 BlockBlob \n", "4997 BlockBlob \n", "4998 BlockBlob \n", "4999 BlockBlob "]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["df = %blob_ls hdblobstorage\n", "df.tail(n=5)"]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"data": {"text/plain": ["'wasb://hdblobstorage@hdblobstorage.blob.core.windows.net/'"]}, "execution_count": 10, "metadata": {}, "output_type": "execute_result"}], "source": ["%hd_wasb_prefix"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"data": {"text/plain": ["'wasb://hdblobstorage@hdblobstorage.blob.core.windows.net/velib_several_days'"]}, "execution_count": 11, "metadata": {}, "output_type": "execute_result"}], "source": ["cl.wasb_to_file(\"hdblobstorage\", \"velib_several_days\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We upload the file we created in the first cell:"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"data": {"text/plain": ["'testpyenbc/randomxy.txt'"]}, "execution_count": 12, "metadata": {}, "output_type": "execute_result"}], "source": ["%blob_up randomxy.txt clusterensaeazure1/testpyenbc/randomxy.txt"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We check the file is over there:"]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namelast_modifiedcontent_typecontent_lengthblob_type
0testpyenbc/randomxy.txtSat, 26 Sep 2015 22:05:12 GMTapplication/octet-stream43483BlockBlob
1testpyenbc/randomxy2.txtSat, 26 Sep 2015 21:50:55 GMTapplication/octet-stream43456BlockBlob
\n", "
"], "text/plain": [" name last_modified \\\n", "0 testpyenbc/randomxy.txt Sat, 26 Sep 2015 22:05:12 GMT \n", "1 testpyenbc/randomxy2.txt Sat, 26 Sep 2015 21:50:55 GMT \n", "\n", " content_type content_length blob_type \n", "0 application/octet-stream 43483 BlockBlob \n", "1 application/octet-stream 43456 BlockBlob "]}, "execution_count": 13, "metadata": {}, "output_type": "execute_result"}], "source": ["%blob_ls clusterensaeazure1/testpyenbc"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We try an extended version:"]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
blob_typecontent_encodingcontent_languagecontent_lengthcontent_md5content_typecopy_completion_timecopy_idcopy_progresscopy_sourcecopy_statuscopy_status_descriptionetaglast_modifiedlease_durationlease_statelease_statusnameurlxms_blob_sequence_number
0BlockBlob43483application/octet-stream0x8D2C6BE8D4DEB43Sat, 26 Sep 2015 22:05:12 GMTavailableunlockedtestpyenbc/randomxy.txt0
1BlockBlob43456application/octet-stream0x8D2C6BC8E2C38FBSat, 26 Sep 2015 21:50:55 GMTavailableunlockedtestpyenbc/randomxy2.txt0
\n", "
"], "text/plain": [" blob_type content_encoding content_language content_length content_md5 \\\n", "0 BlockBlob 43483 \n", "1 BlockBlob 43456 \n", "\n", " content_type copy_completion_time copy_id copy_progress \\\n", "0 application/octet-stream \n", "1 application/octet-stream \n", "\n", " copy_source copy_status copy_status_description etag \\\n", "0 0x8D2C6BE8D4DEB43 \n", "1 0x8D2C6BC8E2C38FB \n", "\n", " last_modified lease_duration lease_state lease_status \\\n", "0 Sat, 26 Sep 2015 22:05:12 GMT available unlocked \n", "1 Sat, 26 Sep 2015 21:50:55 GMT available unlocked \n", "\n", " name url xms_blob_sequence_number \n", "0 testpyenbc/randomxy.txt 0 \n", "1 testpyenbc/randomxy2.txt 0 "]}, "execution_count": 14, "metadata": {}, "output_type": "execute_result"}], "source": ["%blob_lsl clusterensaeazure1/testpyenbc"]}, {"cell_type": "markdown", "metadata": {}, "source": ["If you need information not accessible through a magic command, you can use the variable ``bs`` (type [azure.storage.blobservice.BlobService](http://www.xavierdupre.fr/app/azure-sdk-for-python/helpsphinx/storage/blobservice.html#module-azure.storage.blobservice)):"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["size= 43483 id= 00000000\n"]}], "source": ["l=bs.get_block_list(\"clusterensaeazure1\", \"testpyenbc/randomxy.txt\")\n", "for _ in l.committed_blocks:\n", " print(\"size=\",_.size, \"id=\",_.id)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We download this again to the local computer:"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/plain": ["'randomxx_copy.txt'"]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["%blob_down clusterensaeazure1/testpyenbc/randomxy.txt randomxx_copy.txt --overwrite"]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
directorylast_modifiednamesize
0False2015-09-26 23:50:56.776239.\\randomall.txt84.88 Kb
1False2015-09-27 00:05:14.546891.\\randomxx_copy.txt42.46 Kb
2False2015-09-27 00:04:55.847278.\\randomxy.txt42.46 Kb
\n", "
"], "text/plain": [" directory last_modified name size\n", "0 False 2015-09-26 23:50:56.776239 .\\randomall.txt 84.88 Kb\n", "1 False 2015-09-27 00:05:14.546891 .\\randomxx_copy.txt 42.46 Kb\n", "2 False 2015-09-27 00:04:55.847278 .\\randomxy.txt 42.46 Kb"]}, "execution_count": 17, "metadata": {}, "output_type": "execute_result"}], "source": ["%lsr r.*[.]txt"]}, {"cell_type": "markdown", "metadata": {}, "source": ["PIG scripts usually produce more than one output and it is convenient to merge them while downloading them. To test that, we upload a second time our file with a different names:"]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [{"data": {"text/plain": ["'testpyenbc/randomxy2.txt'"]}, "execution_count": 18, "metadata": {}, "output_type": "execute_result"}], "source": ["%blob_up randomxy.txt clusterensaeazure1/testpyenbc/randomxy2.txt"]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namelast_modifiedcontent_typecontent_lengthblob_type
0testpyenbc/randomxy.txtSat, 26 Sep 2015 22:05:12 GMTapplication/octet-stream43483BlockBlob
1testpyenbc/randomxy2.txtSat, 26 Sep 2015 22:05:18 GMTapplication/octet-stream43483BlockBlob
\n", "
"], "text/plain": [" name last_modified \\\n", "0 testpyenbc/randomxy.txt Sat, 26 Sep 2015 22:05:12 GMT \n", "1 testpyenbc/randomxy2.txt Sat, 26 Sep 2015 22:05:18 GMT \n", "\n", " content_type content_length blob_type \n", "0 application/octet-stream 43483 BlockBlob \n", "1 application/octet-stream 43483 BlockBlob "]}, "execution_count": 19, "metadata": {}, "output_type": "execute_result"}], "source": ["%blob_ls clusterensaeazure1/testpyenbc"]}, {"cell_type": "markdown", "metadata": {}, "source": ["And we merge them:"]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [{"data": {"text/plain": ["'randomall.txt'"]}, "execution_count": 20, "metadata": {}, "output_type": "execute_result"}], "source": ["%blob_downmerge clusterensaeazure1/testpyenbc randomall.txt --overwrite"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We check the size of file ``randomall.txt`` is twice bigger:"]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
directorylast_modifiednamesize
0False2015-09-27 00:05:32.134221.\\randomall.txt84.93 Kb
1False2015-09-27 00:05:14.546891.\\randomxx_copy.txt42.46 Kb
2False2015-09-27 00:04:55.847278.\\randomxy.txt42.46 Kb
\n", "
"], "text/plain": [" directory last_modified name size\n", "0 False 2015-09-27 00:05:32.134221 .\\randomall.txt 84.93 Kb\n", "1 False 2015-09-27 00:05:14.546891 .\\randomxx_copy.txt 42.46 Kb\n", "2 False 2015-09-27 00:04:55.847278 .\\randomxy.txt 42.46 Kb"]}, "execution_count": 21, "metadata": {}, "output_type": "execute_result"}], "source": ["%lsr r.*[.]txt"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We finally remove the files from the blob storage:"]}, {"cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [{"data": {"text/plain": ["True"]}, "execution_count": 22, "metadata": {}, "output_type": "execute_result"}], "source": ["%blob_delete clusterensaeazure1/testpyenbc/randomxy.txt\n", "%blob_delete clusterensaeazure1/testpyenbc/randomxy2.txt"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We check it disappeared:"]}, {"cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameurl
\n", "
"], "text/plain": ["Empty DataFrame\n", "Columns: [name, url]\n", "Index: []"]}, "execution_count": 23, "metadata": {}, "output_type": "execute_result"}], "source": ["%blob_ls clusterensaeazure1/testpyenbc/"]}, {"cell_type": "markdown", "metadata": {}, "source": ["And we close the connection:"]}, {"cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [{"data": {"text/plain": ["True"]}, "execution_count": 24, "metadata": {}, "output_type": "execute_result"}], "source": ["%blob_close"]}, {"cell_type": "markdown", "metadata": {}, "source": ["**END**"]}, {"cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1"}}, "nbformat": 4, "nbformat_minor": 2}