{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# File Manipulation with Azure Blob Storage\n", "\n", "We try a few file manipulation between a local computer and a blob storage on Azure. It requires [azure-sdk-for-python](https://github.com/Azure/azure-sdk-for-python) and [pyenbc](http://www.xavierdupre.fr/app/pyenbc/helpsphinx/index.html). We first create a dummy file."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": ["import pandas, random\n", "mat = [ {\"x\":random.random(), \"y\":random.random()} for i in range(0,1000)]\n", "df = pandas.DataFrame(mat)\n", "df.to_csv(\"randomxy.txt\", sep=\"\\t\", encoding=\"utf8\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We need credentials and to avoid having them in clear in the notebook, we use a HTML form:"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/html": ["
"], "text/plain": [" blob_type content_encoding content_language content_length content_md5 \\\n", "0 BlockBlob 43483 \n", "1 BlockBlob 43456 \n", "\n", " content_type copy_completion_time copy_id copy_progress \\\n", "0 application/octet-stream \n", "1 application/octet-stream \n", "\n", " copy_source copy_status copy_status_description etag \\\n", "0 0x8D2C6BE8D4DEB43 \n", "1 0x8D2C6BC8E2C38FB \n", "\n", " last_modified lease_duration lease_state lease_status \\\n", "0 Sat, 26 Sep 2015 22:05:12 GMT available unlocked \n", "1 Sat, 26 Sep 2015 21:50:55 GMT available unlocked \n", "\n", " name url xms_blob_sequence_number \n", "0 testpyenbc/randomxy.txt 0 \n", "1 testpyenbc/randomxy2.txt 0 "]}, "execution_count": 14, "metadata": {}, "output_type": "execute_result"}], "source": ["%blob_lsl clusterensaeazure1/testpyenbc"]}, {"cell_type": "markdown", "metadata": {}, "source": ["If you need information not accessible through a magic command, you can use the variable ``bs`` (type [azure.storage.blobservice.BlobService](http://www.xavierdupre.fr/app/azure-sdk-for-python/helpsphinx/storage/blobservice.html#module-azure.storage.blobservice)):"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["size= 43483 id= 00000000\n"]}], "source": ["l=bs.get_block_list(\"clusterensaeazure1\", \"testpyenbc/randomxy.txt\")\n", "for _ in l.committed_blocks:\n", " print(\"size=\",_.size, \"id=\",_.id)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["We download this again to the local computer:"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/plain": ["'randomxx_copy.txt'"]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["%blob_down clusterensaeazure1/testpyenbc/randomxy.txt randomxx_copy.txt --overwrite"]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "
\n", " \n", "
\n", "
\n", "
directory
\n", "
last_modified
\n", "
name
\n", "
size
\n", "
\n", " \n", " \n", "
\n", "
0
\n", "
False
\n", "
2015-09-26 23:50:56.776239
\n", "
.\\randomall.txt
\n", "
84.88 Kb
\n", "
\n", "
\n", "
1
\n", "
False
\n", "
2015-09-27 00:05:14.546891
\n", "
.\\randomxx_copy.txt
\n", "
42.46 Kb
\n", "
\n", "
\n", "
2
\n", "
False
\n", "
2015-09-27 00:04:55.847278
\n", "
.\\randomxy.txt
\n", "
42.46 Kb
\n", "
\n", " \n", "
\n", "
"], "text/plain": [" directory last_modified name size\n", "0 False 2015-09-26 23:50:56.776239 .\\randomall.txt 84.88 Kb\n", "1 False 2015-09-27 00:05:14.546891 .\\randomxx_copy.txt 42.46 Kb\n", "2 False 2015-09-27 00:04:55.847278 .\\randomxy.txt 42.46 Kb"]}, "execution_count": 17, "metadata": {}, "output_type": "execute_result"}], "source": ["%lsr r.*[.]txt"]}, {"cell_type": "markdown", "metadata": {}, "source": ["PIG scripts usually produce more than one output and it is convenient to merge them while downloading them. To test that, we upload a second time our file with a different names:"]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [{"data": {"text/plain": ["'testpyenbc/randomxy2.txt'"]}, "execution_count": 18, "metadata": {}, "output_type": "execute_result"}], "source": ["%blob_up randomxy.txt clusterensaeazure1/testpyenbc/randomxy2.txt"]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"data": {"text/html": ["