module `remote.azure_connection`¶

Short summary¶

module pyenbc.remote.azure_connection

A class to help connect with a remote machine and send command line.

Classes¶

class	truncated documentation
`AzureClient`	A simple class to access and communicate with Azure. It requires modules:

Static Methods¶

staticmethod	truncated documentation
`mask_string`	return empty string or as many `*` as the length of the string

Methods¶

method	truncated documentation
`__init__`	constructor
`__str__`	usual
`_interpret_path`	replace variavble such as `$PSEUDO`, `$USERNAME`
`copy_blob`	copy a blob
`delete_blob`	delete a blob
`delete_folder`	delete a folder and its content
`df_head`	Downloads the beginning of a stream and displays as a DataFrame.
`download`	Downloads data from a blob storage to a file. No more than 64Mb can be downloaded at the same, it needs to be split …
`download_data`	Downloads data from a blob storage and return bytes. No more than 64Mb can be downloaded at the same, it needs …
`download_merge`	Downloads all files from a folder in a blob storage to a single local file. Files will be merged. No more …
`exists`	test the existence of a path on the blob storage
`get_status`	return the status of the webHCatUrl server
`get_version`	return the status of the WebHCat version
`hive_submit`	Submits a :epkg:`HIVE` job, the function uploads it to the cluster as well as the dependencies. The code …
`job_kill`	kills a job see Delete Job — DELETE queue/:jobid …
`job_queue`	returns the list of jobs It uses the API Job Information — GET queue/:jobid. …
`job_status`	return the status of a job see List Versions — GET version …
`ls`	return the content of a blob storage
`open_blob_service`	open a blob service
`pig_submit`	Submits a :epkg:`PIG` job, the function uploads it to the cluster as well as the dependencies. The code …
`standard_outputs`	returns the standard output and error for a specific job id
`upload`	Uploads data from a file to a blob storage. No more than 64Mb can be uploaded at the same, it needs to be split …
`upload_data`	Uploads data (bytes) to a blob storage. No more than 64Mb can be uploaded at the same, it needs to be split into …
`url_blob`	returns an url for a blob file name
`url_webHCatUrl`	returns an url to the cluster
`wait_job`	wait until a job has run or failed
`wasb_prefix`	when using an instruction `LOAD` in a PIG script, file blob name must be reference using a wasb syntax. …
`wasb_to_file`	return something like `wasb://demo@myblobstorage.blob...`

Documentation¶

A class to help connect with a remote machine and send command line.

source on GitHub

class pyenbc.remote.azure_connection.AzureClient(blob_name, blob_key, hadoop_name=None, hadoop_key=None, hadoop_user_name='admin', pseudo='any', fLOG=None)¶

Bases: object

A simple class to access and communicate with Azure. It requires modules:

Main functionalities related to blob:

list_containers, create_container, list_blobs, put_blob, put_block_blob_from_bytes
put_block_blob_from_text, put_page_blob_from_file, get_blob, get_blob

See How to use Blob storage from Python.

Get the list of containers and files from a blob storage?

The functionalities of a BlobService are described in blockblobservice.py.

from pyenbc.remote.azure_connection import AzureClient
cl = AzureClient("<blob_storage_service>",
                "<primary_key>")
bs = cl.open_blob_service()
res = cl.ls(bs)
for r in res:
    print(r["name"])

Upload, download, to a blob storage

The following example uploads and downloads a file on a Blob Storage.

from pyenbc.remote.azure_connection import AzureClient
cl = AzureClient("<blob_storage_service>",
                "<primary_key>")
bs = cl.open_blob_service()
cl.upload(bs, "<container>", "myremotefolder/remotename.txt",
                            "local_filename.txt")

res = cl.ls(bs,"<container>")
for r in res:
    if "local_filename" in r["name"]:
        print(r)

cl.download(bs, "<container>", "myremotefolder/remotename.txt",
                            "another_local_filename.txt")

Many function uses WebHCat API. The error codes can be found here: Error Codes and Responses.

Changed in version PSEUDO,: CONTAINER, SCRIPT will be passed to the script as parameters

source on GitHub

constructor

Parameters:

blob_name – blob storage name
blob_key – account key for the blob storage
hadoop_name – hadoop server name (can be None if HDInsight is not used)
hadoop_key – hadoop key (can be None if HDInsight is not used)
pseudo – sometimes, the same identification is used to connect to HDInsight, the pseudo is meant to avoid collisions
fLOG – logging function

source on GitHub

__init__(blob_name, blob_key, hadoop_name=None, hadoop_key=None, hadoop_user_name='admin', pseudo='any', fLOG=None)¶

constructor

Parameters:

blob_name – blob storage name
blob_key – account key for the blob storage
hadoop_name – hadoop server name (can be None if HDInsight is not used)
hadoop_key – hadoop key (can be None if HDInsight is not used)
pseudo – sometimes, the same identification is used to connect to HDInsight, the pseudo is meant to avoid collisions
fLOG – logging function

source on GitHub

__str__()¶

usual

source on GitHub

_blob_properties = ['copy_completion_time', 'content_encoding', 'content_language', 'blob_type', 'copy_status_description', 'copy_id', 'content_md5', 'lease_duration', 'copy_source', 'content_type', 'content_length', 'lease_state', 'copy_progress', 'copy_status', 'xms_blob_sequence_number', 'lease_status', 'etag', 'last_modified']¶

_chunk_size = 4194304¶

_interpret_path(blob_path)¶

replace variavble such as $PSEUDO, $USERNAME

Parameters:: blob_path – path
Returns:: modified path

New in version 1.1.

source on GitHub

copy_blob(blob_service, container, blob_name, source)¶

copy a blob

Parameters:

blob_service – returns by open_blob_service
container_name – container name
blob_name – destination
source – source

source on GitHub

delete_blob(blob_service, container_name, blob_name)¶

delete a blob

Parameters:

blob_service – returns by open_blob_service
container_name – container name
blob_name – blob name (remote file name)

source on GitHub

delete_folder(blob_service, container_name, blob_folder)¶

delete a folder and its content

Parameters:

blob_service – returns by open_blob_service
container_name – container name
blob_folder – blob folder (remote folder name)

New in version 1.1.

source on GitHub

df_head(blob_service, container_name, blob_name, stop_at=1048576, encoding='utf-8', as_df=True, merge=False, **options)¶

Downloads the beginning of a stream and displays as a DataFrame.

Parameters:

blob_service – returns by open_blob_service
container_name – container name
blob_name – blob name (or list of blob names) (remote file name)
stop_at – stop at a given size (None to avoid stopping)
encoding – encoding
as_df – result as a dataframe or a string
merge – if True, blob_name is a folder, method download_merge is called
options – see read_csv

Returns:

local file or bytes if file_path is None

New in version 1.1.

source on GitHub

download(blob_service, container_name, blob_name, file_path=None, append=False, chunk_size=None, stop_at=None)¶

Downloads data from a blob storage to a file. No more than 64Mb can be downloaded at the same, it needs to be split into pieces.

Parameters:

blob_service – returns by open_blob_service
container_name – container name
blob_name – blob name (or list of blob names) (remote file name)
file_path – local file path
append – if True, append the content to an existing file
chunk_size – download by chunk
stop_at – stop at a given size (None to avoid stopping)

Returns:

local file or bytes if file_path is None

The code comes from Utilisation du service de stockage d’objets blob à partir de Python.

Changed in version 1.1: Parameters append, chunk_size were added. If file_path is None (default value now), the function returns bytes.

source on GitHub

download_data(blob_service, container_name, blob_name, chunk_size=None, stop_at=None)¶

Downloads data from a blob storage and return bytes. No more than 64Mb can be downloaded at the same, it needs to be split into pieces.

Parameters:

blob_service – returns by open_blob_service
container_name – container name
blob_name – blob name (or list of blob names) (remote file name)
chunk_size – download by chunk
stop_at – stop at a given size (None to avoid stopping)

Returns:

local file or bytes if file_path is None

New in version 1.1.

source on GitHub

download_merge(blob_service, container_name, blob_folder, file_path=None, chunk_size=None, stop_at=None)¶

Downloads all files from a folder in a blob storage to a single local file. Files will be merged. No more than 64Mb can be downloaded at the same, it needs to be split into pieces.

Parameters:

blob_service – returns by open_blob_service
container_name – container name
blob_folder – blob folder(remote file name)
file_path – local file path
chunk_size – download by chunk
stop_at – stop at a given size (None to avoid stopping)

Returns:

local file

Changed in version 1.1: Parameters append, chunk_size were added. If file_path is None (default value now), the function returns bytes.

source on GitHub

exists(blob_service, container_name, path)¶

test the existence of a path on the blob storage

Parameters:

blob_service – blob service, returned by open_blob_service
container_name – None for all, its name otherwise
path – path in the container

Returns:

boolean

New in version 1.1.

source on GitHub

get_status()¶

return the status of the webHCatUrl server

Returns:: json

source on GitHub

get_version()¶

return the status of the WebHCat version

Returns:: json

source on GitHub

hive_submit(blob_service, container_name, hive_file, dependencies=None, status_dir=None, stop_on_failure=True, params=None)¶

Submits a :epkg:`HIVE` job, the function uploads it to the cluster as well as the dependencies.

The code comes from How to use HDInsight from Linux and start a Pig + Jython job in HDInsight thru WebHCat. The API is described at Pig Job — POST pig.

Parameters:

blob_service – returns by open_blob_service
container_name – name of a container
hive_file – path to the job in the blob storage
dependencies – dependencies
status_dir – folder used by Hadoop to store job’s progress, it should contain your alias if you want to avoid collision with others’ jobs
stop_on_failure – stop on failure, do not wait as long as possible
params – to

Returns:

json

New in version 1.1.

source on GitHub

job_kill(jobid)¶

kills a job

see Delete Job — DELETE queue/:jobid for the outcome

Parameters:: jobid – jobid
Returns:: json

source on GitHub

job_queue(showall=False, fields=None)¶

returns the list of jobs

It uses the API Job Information — GET queue/:jobid.

Parameters:

showall – if True, show all your jobs (not only yours)
fields – to add fields in the requests

Returns:

list of jobs

List job queue

Most of the time, a job remains stuck in the job queue because it is full. Here is a code to check that is the case on a Azure cluster. It should be executed from a notebook.

Connection

blobstorage = "..."
blobpassword = "..."
hadoop_server = "..."
hadoop_password = "..."
username = "..."

%load_ext pyenbc
client, bs = %hd_open

Job queue

res = client.job_queue()
res.reverse()   # last submitted jobs first

Displays the first 20 jobs:

for i, r in enumerate(res[:20]):
    st = client.job_status(r["id"])
    print(i, r, st["status"]["state"],datetime.fromtimestamp(float(st["status"]["startTime"])/1000), st["status"]["jobName"])
    print(st["userargs"].get("file", None), st["profile"].get("jobName", None))

It gives:

0 {'detail': None, 'id': 'job_1451961118663_3126'} PREP 2016-01-26 21:57:28.756000 TempletonControllerJob
wasb://..../scripts/pig/titi.pig TempletonControllerJob
1 {'detail': None, 'id': 'job_1451961118663_3125'} PREP 2016-01-26 21:57:28.517999 TempletonControllerJob
wasb://..../scripts/pig/pre_processing.pig TempletonControllerJob
2 {'detail': None, 'id': 'job_1451961118663_3124'} PREP 2016-01-26 21:50:32.742000 TempletonControllerJob
wasb://..../scripts/pig/titi.pig TempletonControllerJob
3 {'detail': None, 'id': 'job_1451961118663_3123'} RUNNING 2016-01-26 21:46:57.219000 TempletonControllerJob
wasb://..../scripts/pig/alg1.pig TempletonControllerJob
4 {'detail': None, 'id': 'job_1451961118663_3122'} SUCCEEDED 2016-01-26 21:40:34.687999 PigLatin:pre_processing.pig
None PigLatin:pre_processing.pig
5 {'detail': None, 'id': 'job_1451961118663_3121'} RUNNING 2016-01-26 21:41:29.657000 TempletonControllerJob
wasb://..../scripts/pig/Algo_LDA2.pig TempletonControllerJob
6 {'detail': None, 'id': 'job_1451961118663_3120'} SUCCEEDED 2016-01-26 21:40:06.859999 TempletonControllerJob
wasb://..../scripts/pig/alg1.pig TempletonControllerJob

To kill a job:

client.job_kill("id")

source on GitHub

job_status(jobid)¶

return the status of a job

see List Versions — GET version for the outcome

Parameters:: jobid – jobid
Returns:: json

You can extract the startTime by doing:

from datetime import datetime
st = client.job_status(<job_id>)
datetime.fromtimestamp(float(st["status"]["startTime"])/1000)

source on GitHub

ls(blob_service, container_name=None, path=None, add_metadata=False, as_df=True)¶

return the content of a blob storage

Parameters:

blob_service – blob service, returned by open_blob_service
container_name – None for all, its name otherwise
path – path in the container
add_metadata – add the metadata to the blob
as_df – if True, returns a DataFrame

Returns:

list of dictionaries

Changed in version 1.1: Parameter add_metadata was added and the function now returns the property last_modified, parameter as_df

source on GitHub

static mask_string(s)¶

return empty string or as many * as the length of the string

source on GitHub

open_blob_service()¶

open a blob service

source on GitHub

pig_submit(blob_service, container_name, pig_file, dependencies=None, status_dir=None, stop_on_failure=True, params=None)¶

Submits a :epkg:`PIG` job, the function uploads it to the cluster as well as the dependencies.

The code comes from How to use HDInsight from Linux and start a Pig + Jython job in HDInsight thru WebHCat. The API is described at Pig Job — POST pig.

Parameters:

blob_service – returns by open_blob_service
container_name – name of a container
pig_file – path to the job in the blob storage
dependencies – dependencies
status_dir – folder used by Hadoop to store job’s progress, it should contain your alias if you want to avoid collision with others’ jobs
stop_on_failure – stop on failure, do not wait as long as possible
params – to

Returns:

json

Submit a job PIG

The script PIG must include an instruction LOAD. This instruction use file name defined with the wasb syntax.

If you place the string $CONTAINER before a stream name, it should be replaced by the corresponding wasb syntax associated to the container name defined by container_name. The function will then load your script, modify it and save another one with the by adding .wasb.pig. Others constants you could use:

$PSEUDO
$CONTAINER
$SCRIPTSPIG

However, this replacement is not done by this class, but your code could be such as:

blobstorage = "****"
blobpassword = "*********************"
hadoop_name = "*********"
hadoop_password = "********"
username = "********"
cl = AzureClient(blobstorage,
                blobpassword,
                hadoop_name,
                hadoop_password,
                username)
script = '''
    myinput = LOAD '$CONTAINER/input.csv'
            using PigStorage(',')
            AS (index:long, sequence, tag, timestamp:long, dateformat, x:double,y:double, z:double, activity) ;
    filt = FILTER myinput BY activity == 'walking' ;
    STORE filt INTO '$PSEUDO/output.csv' USING PigStorage() ;
    '''

with open("script_walking.pig","w") as f :
    f.write(script)

bs = cl.open_blob_service()
js = cl.pig_submit(bs, blobstorage, "testensae/script_walking.pig")
print(js)

js = cl.job_status('job_1414863995725_0013')

New in version 1.1: parameter stop_on_failure

source on GitHub

standard_outputs(job_id, blob_service, container, folder)¶

returns the standard output and error for a specific job id

Parameters:

job_id – job_id or status
blob_service – returns by open_blob_service
container_name – name of a container

@parm folder folder where to download them

Returns:: out, err

source on GitHub

upload(blob_service, container_name, blob_name, file_path)¶

Uploads data from a file to a blob storage. No more than 64Mb can be uploaded at the same, it needs to be split into pieces.

Parameters:

blob_service – returns by open_blob_service
container_name – container name
blob_name – blob name (remote file name)
file_path – local file path

Returns:

list of uploaded blob names

The code comes from Utilisation du service de stockage d’objets blob à partir de Python.

source on GitHub

upload_data(blob_service, container_name, blob_name, data)¶

Uploads data (bytes) to a blob storage. No more than 64Mb can be uploaded at the same, it needs to be split into pieces.

Parameters:

blob_service – returns by open_blob_service
container_name – container name
blob_name – blob name (remote file name)
data – bytes

Returns:

list of uploaded blob names

The code comes from Utilisation du service de stockage d’objets blob à partir de Python.

New in version 1.1.

source on GitHub

url_blob(blob_service, container, blob_name)¶

returns an url for a blob file name

Parameters:

container – container
blob_name – blob_name

Returns:

url

source on GitHub

url_webHCatUrl(cmd)¶

returns an url to the cluster

Parameters:: cmd – something like pig, status
Returns:: url

source on GitHub

wait_job(job_id, delay=5, fLOG=<function noLOG>)¶

wait until a job has run or failed

Parameters:

job_id – job_id
delay – check every N seconds

Returns:

status

New in version 1.1.

source on GitHub

wasb_prefix(container_name)¶

when using an instruction LOAD in a PIG script, file blob name must be reference using a wasb syntax. This method returns the prefix to add.

Returns:: wasb prefix

source on GitHub

wasb_to_file(container_name, blob_file)¶

return something like wasb://demo@myblobstorage.blob...

Parameters:

container_name – name of a container
blob_file – path to a file

Returns:

return a url to blob file (pig script for example)

source on GitHub

module `remote.azure_connection`¶

Short summary¶

Classes¶

Static Methods¶

Methods¶

Documentation¶

Links

Contents

Information

Related Topics

This Page

module remote.azure_connection¶

Short summary¶

Classes¶

Static Methods¶

Methods¶

Documentation¶

module `remote.azure_connection`¶