module `datainc.data_helper`#

Short summary#

module ensae_projects.datainc.data_helper

Simple functions to process text files.

Functions#

function	truncated documentation
`change_encoding`	Changes the encoding of a text file and removes quotes. By default process is `process_line()`.
`change_encoding_improve`	Changes the encoding of a text file, removes quotes. By default process is `process_line()` but the function …
`clean_column_name_sql_dump`	Removes quotes in a line which looks like:
`convert_dates`	Converts a string into a date.
`enumerate_text_lines`	Enumerates all lines from a text file and does some cleaning (see the list of parameters).

Documentation#

Simple functions to process text files.

source on GitHub

ensae_projects.datainc.data_helper.change_encoding(infile, outfile, enc1, enc2='utf-8', process=None, fLOG=<function noLOG>)#

Changes the encoding of a text file and removes quotes. By default process is process_line().

Parameters:

infile – input file
outfile – output file
enc1 – encoding of the input file
enc2 – encoding of the output file
process – function which processes a line, see below
fLOG – logging function

Returns:

number of processed lines

function process

def process(line_number, line):
    # ...
    return line

See clean_column_name_sql_dump for an example.

source on GitHub

ensae_projects.datainc.data_helper.change_encoding_improve(infile, outfile, enc1, enc2='utf-8', process=None, fLOG=<function noLOG>)#

Changes the encoding of a text file, removes quotes. By default process is process_line() but the function has access to the distribution of the number of columns in the previous lines.

Parameters:

infile – input file
outfile – output file
enc1 – encoding of the input file
enc2 – encoding of the output file
process – function which processes a line, see below
fLOG – logging function

Returns:

number of processed lines

function process

def process(line_number, line, histo_nb_columns):
    # ...
    return line, number_of_columns

source on GitHub

ensae_projects.datainc.data_helper.clean_column_name_sql_dump(i, line, hist, sep=';')#

Removes quotes in a line which looks like:

0; "a"; 'j"'; "r;"

Parameters:

i – line number (unused)
line – line to process
hist – distribution of the number of columns
sep – line separator

Returns:

text line, number of columns

source on GitHub

ensae_projects.datainc.data_helper.convert_dates(sd, option=None, exc=False)#

Converts a string into a date.

Parameters:

sd – string
option – see below
exc – raise an exception

Returns:

string

'F': dates must contain / and format is DD/MM/YY

source on GitHub

ensae_projects.datainc.data_helper.enumerate_text_lines(filename, sep='\t', encoding='utf-8', quotes_as_str=False, header=True, clean_column_name=None, convert_float=False, option=None, skip=0, take=-1, fLOG=<function noLOG>)#

Enumerates all lines from a text file and does some cleaning (see the list of parameters).

Parameters:

filename – filename
sep – column separator
header – first row is header
encoding – encoding
quotes_as_str – surrounded by quotes
clean_column_name – function to clean column name
convert_float – convert number into float wherever possible
option – several option to clean dates, see below
skip – number of rows to skip
take – number of rows to consider (-1 for all)
fLOG – logging function

Returns:

iterator on dictionary

Options to cleaning dates:

'F': dates must contain / and format is DD/MM/YY

source on GitHub

Links

Contents

Information

module `datainc.data_helper`#

Short summary#

Functions#

Documentation#

Links

Contents

Information

module datainc.data_helper#

Short summary#

Functions#

Documentation#

module `datainc.data_helper`#