.. _azureblobstoragerst: ========================================= File Manipulation with Azure Blob Storage ========================================= .. only:: html **Links:** :download:`notebook `, :downloadlink:`html `, :download:`PDF `, :download:`python `, :downloadlink:`slides `, :githublink:`GitHub|_doc/notebooks/azure_blob_storage.ipynb|*` We try a few file manipulation between a local computer and a blob storage on Azure. It requires `azure-sdk-for-python `__ and `pyenbc `__. We first create a dummy file. .. code:: ipython3 import pandas, random mat = [ {"x":random.random(), "y":random.random()} for i in range(0,1000)] df = pandas.DataFrame(mat) df.to_csv("randomxy.txt", sep="\t", encoding="utf8") We need credentials and to avoid having them in clear in the notebook, we use a HTML form: .. code:: ipython3 import pyquickhelper.ipythonhelper as ipy params={"blob_storage":"hdblobstorage", "password":""} ipy.open_html_form(params=params,title="credentials",key_save="blobservice") .. raw:: html
credentials
blob_storage
password
We stored the values in two variables in the workspace: .. code:: ipython3 blobstorage = blobservice["blob_storage"] blobpassword = blobservice["password"] We need pyensae >= 1.2: .. code:: ipython3 import pyensae import pyenbc %load_ext pyensae %load_ext pyenbc pyensae.__version__, pyenbc.__version__ .. parsed-literal:: The pyensae extension is already loaded. To reload it, use: %reload_ext pyensae The pyenbc extension is already loaded. To reload it, use: %reload_ext pyenbc .. parsed-literal:: '1.2' .. code:: ipython3 %blob_open --help .. parsed-literal:: usage: blob_open [-h] [-b BLOBSTORAGE] [-p BLOBPASSWORD] open a connection to an Azure blob storage, by default, the magic command takes blobstorage and blobpassword local variables as default values optional arguments: -h, --help show this help message and exit -b BLOBSTORAGE, --blobstorage BLOBSTORAGE blob storage name -p BLOBPASSWORD, --blobpassword BLOBPASSWORD blob password usage: blob_open [-h] [-b BLOBSTORAGE] [-p BLOBPASSWORD] We open a connection to the blob storage: .. code:: ipython3 cl, bs = %blob_open cl, bs .. parsed-literal:: (, ) We extract the available containers: .. code:: ipython3 l = %blob_containers l .. parsed-literal:: ['clusterensaeazure1', 'clusterensaeazure2', 'clusterensaeazure2-1', 'hdblobstorage', 'petittest', 'sparkclus', 'sparkclus2', 'testhadoopensae'] We get the content of one container: .. code:: ipython3 df = %blob_ls hdblobstorage df.tail(n=5) .. raw:: html
name last_modified content_type content_length blob_type
4995 velib_several_days/paris.2014-11-14_15-54-58.6... Fri, 28 Nov 2014 10:34:15 GMT application/octet-stream 524941 BlockBlob
4996 velib_several_days/paris.2014-11-14_15-55-57.8... Fri, 28 Nov 2014 10:34:16 GMT application/octet-stream 524944 BlockBlob
4997 velib_several_days/paris.2014-11-14_15-56-58.5... Fri, 28 Nov 2014 10:34:17 GMT application/octet-stream 522499 BlockBlob
4998 velib_several_days/paris.2014-11-14_15-57-57.8... Fri, 28 Nov 2014 10:34:17 GMT application/octet-stream 524958 BlockBlob
4999 velib_several_days/paris.2014-11-14_15-58-58.5... Fri, 28 Nov 2014 10:34:18 GMT application/octet-stream 523757 BlockBlob
.. code:: ipython3 %hd_wasb_prefix .. parsed-literal:: 'wasb://hdblobstorage@hdblobstorage.blob.core.windows.net/' .. code:: ipython3 cl.wasb_to_file("hdblobstorage", "velib_several_days") .. parsed-literal:: 'wasb://hdblobstorage@hdblobstorage.blob.core.windows.net/velib_several_days' We upload the file we created in the first cell: .. code:: ipython3 %blob_up randomxy.txt clusterensaeazure1/testpyenbc/randomxy.txt .. parsed-literal:: 'testpyenbc/randomxy.txt' We check the file is over there: .. code:: ipython3 %blob_ls clusterensaeazure1/testpyenbc .. raw:: html
name last_modified content_type content_length blob_type
0 testpyenbc/randomxy.txt Sat, 26 Sep 2015 22:05:12 GMT application/octet-stream 43483 BlockBlob
1 testpyenbc/randomxy2.txt Sat, 26 Sep 2015 21:50:55 GMT application/octet-stream 43456 BlockBlob
We try an extended version: .. code:: ipython3 %blob_lsl clusterensaeazure1/testpyenbc .. raw:: html
blob_type content_encoding content_language content_length content_md5 content_type copy_completion_time copy_id copy_progress copy_source copy_status copy_status_description etag last_modified lease_duration lease_state lease_status name url xms_blob_sequence_number
0 BlockBlob 43483 application/octet-stream 0x8D2C6BE8D4DEB43 Sat, 26 Sep 2015 22:05:12 GMT available unlocked testpyenbc/randomxy.txt 0
1 BlockBlob 43456 application/octet-stream 0x8D2C6BC8E2C38FB Sat, 26 Sep 2015 21:50:55 GMT available unlocked testpyenbc/randomxy2.txt 0
If you need information not accessible through a magic command, you can use the variable ``bs`` (type `azure.storage.blobservice.BlobService `__): .. code:: ipython3 l=bs.get_block_list("clusterensaeazure1", "testpyenbc/randomxy.txt") for _ in l.committed_blocks: print("size=",_.size, "id=",_.id) .. parsed-literal:: size= 43483 id= 00000000 We download this again to the local computer: .. code:: ipython3 %blob_down clusterensaeazure1/testpyenbc/randomxy.txt randomxx_copy.txt --overwrite .. parsed-literal:: 'randomxx_copy.txt' .. code:: ipython3 %lsr r.*[.]txt .. raw:: html
directory last_modified name size
0 False 2015-09-26 23:50:56.776239 .\randomall.txt 84.88 Kb
1 False 2015-09-27 00:05:14.546891 .\randomxx_copy.txt 42.46 Kb
2 False 2015-09-27 00:04:55.847278 .\randomxy.txt 42.46 Kb
PIG scripts usually produce more than one output and it is convenient to merge them while downloading them. To test that, we upload a second time our file with a different names: .. code:: ipython3 %blob_up randomxy.txt clusterensaeazure1/testpyenbc/randomxy2.txt .. parsed-literal:: 'testpyenbc/randomxy2.txt' .. code:: ipython3 %blob_ls clusterensaeazure1/testpyenbc .. raw:: html
name last_modified content_type content_length blob_type
0 testpyenbc/randomxy.txt Sat, 26 Sep 2015 22:05:12 GMT application/octet-stream 43483 BlockBlob
1 testpyenbc/randomxy2.txt Sat, 26 Sep 2015 22:05:18 GMT application/octet-stream 43483 BlockBlob
And we merge them: .. code:: ipython3 %blob_downmerge clusterensaeazure1/testpyenbc randomall.txt --overwrite .. parsed-literal:: 'randomall.txt' We check the size of file ``randomall.txt`` is twice bigger: .. code:: ipython3 %lsr r.*[.]txt .. raw:: html
directory last_modified name size
0 False 2015-09-27 00:05:32.134221 .\randomall.txt 84.93 Kb
1 False 2015-09-27 00:05:14.546891 .\randomxx_copy.txt 42.46 Kb
2 False 2015-09-27 00:04:55.847278 .\randomxy.txt 42.46 Kb
We finally remove the files from the blob storage: .. code:: ipython3 %blob_delete clusterensaeazure1/testpyenbc/randomxy.txt %blob_delete clusterensaeazure1/testpyenbc/randomxy2.txt .. parsed-literal:: True We check it disappeared: .. code:: ipython3 %blob_ls clusterensaeazure1/testpyenbc/ .. raw:: html
name url
And we close the connection: .. code:: ipython3 %blob_close .. parsed-literal:: True **END**