module datainc.data_helper
#
Short summary#
module ensae_projects.datainc.data_helper
Simple functions to process text files.
Functions#
function |
truncated documentation |
---|---|
Changes the encoding of a text file and removes quotes. By default process is |
|
Changes the encoding of a text file, removes quotes. By default process is |
|
Removes quotes in a line which looks like: |
|
Converts a string into a date. |
|
Enumerates all lines from a text file and does some cleaning (see the list of parameters). |
Documentation#
Simple functions to process text files.
- ensae_projects.datainc.data_helper.change_encoding(infile, outfile, enc1, enc2='utf-8', process=None, fLOG=<function noLOG>)#
Changes the encoding of a text file and removes quotes. By default process is
process_line()
.- Parameters:
infile – input file
outfile – output file
enc1 – encoding of the input file
enc2 – encoding of the output file
process – function which processes a line, see below
fLOG – logging function
- Returns:
number of processed lines
function
process
def process(line_number, line): # ... return line
See
clean_column_name_sql_dump
for an example.
- ensae_projects.datainc.data_helper.change_encoding_improve(infile, outfile, enc1, enc2='utf-8', process=None, fLOG=<function noLOG>)#
Changes the encoding of a text file, removes quotes. By default process is
process_line()
but the function has access to the distribution of the number of columns in the previous lines.- Parameters:
infile – input file
outfile – output file
enc1 – encoding of the input file
enc2 – encoding of the output file
process – function which processes a line, see below
fLOG – logging function
- Returns:
number of processed lines
function
process
def process(line_number, line, histo_nb_columns): # ... return line, number_of_columns
- ensae_projects.datainc.data_helper.clean_column_name_sql_dump(i, line, hist, sep=';')#
Removes quotes in a line which looks like:
0; "a"; 'j"'; "r;"
- Parameters:
i – line number (unused)
line – line to process
hist – distribution of the number of columns
sep – line separator
- Returns:
text line, number of columns
- ensae_projects.datainc.data_helper.convert_dates(sd, option=None, exc=False)#
Converts a string into a date.
- Parameters:
sd – string
option – see below
exc – raise an exception
- Returns:
string
'F'
: dates must contain/
and format isDD/MM/YY
- ensae_projects.datainc.data_helper.enumerate_text_lines(filename, sep='\t', encoding='utf-8', quotes_as_str=False, header=True, clean_column_name=None, convert_float=False, option=None, skip=0, take=-1, fLOG=<function noLOG>)#
Enumerates all lines from a text file and does some cleaning (see the list of parameters).
- Parameters:
filename – filename
sep – column separator
header – first row is header
encoding – encoding
quotes_as_str – surrounded by quotes
clean_column_name – function to clean column name
convert_float – convert number into float wherever possible
option – several option to clean dates, see below
skip – number of rows to skip
take – number of rows to consider (-1 for all)
fLOG – logging function
- Returns:
iterator on dictionary
Options to cleaning dates:
'F'
: dates must contain/
and format isDD/MM/YY