module datainc.data_helper#

Short summary#

module ensae_projects.datainc.data_helper

Simple functions to process text files.

source on GitHub

Functions#

function

truncated documentation

change_encoding

Changes the encoding of a text file and removes quotes. By default process is process_line().

change_encoding_improve

Changes the encoding of a text file, removes quotes. By default process is process_line() but the function …

clean_column_name_sql_dump

Removes quotes in a line which looks like:

convert_dates

Converts a string into a date.

enumerate_text_lines

Enumerates all lines from a text file and does some cleaning (see the list of parameters).

Documentation#

Simple functions to process text files.

source on GitHub

ensae_projects.datainc.data_helper.change_encoding(infile, outfile, enc1, enc2='utf-8', process=None, fLOG=<function noLOG>)#

Changes the encoding of a text file and removes quotes. By default process is process_line().

Parameters:
  • infile – input file

  • outfile – output file

  • enc1 – encoding of the input file

  • enc2 – encoding of the output file

  • process – function which processes a line, see below

  • fLOG – logging function

Returns:

number of processed lines

function process

def process(line_number, line):
    # ...
    return line

See clean_column_name_sql_dump for an example.

source on GitHub

ensae_projects.datainc.data_helper.change_encoding_improve(infile, outfile, enc1, enc2='utf-8', process=None, fLOG=<function noLOG>)#

Changes the encoding of a text file, removes quotes. By default process is process_line() but the function has access to the distribution of the number of columns in the previous lines.

Parameters:
  • infile – input file

  • outfile – output file

  • enc1 – encoding of the input file

  • enc2 – encoding of the output file

  • process – function which processes a line, see below

  • fLOG – logging function

Returns:

number of processed lines

function process

def process(line_number, line, histo_nb_columns):
    # ...
    return line, number_of_columns

source on GitHub

ensae_projects.datainc.data_helper.clean_column_name_sql_dump(i, line, hist, sep=';')#

Removes quotes in a line which looks like:

0; "a"; 'j"'; "r;"
Parameters:
  • i – line number (unused)

  • line – line to process

  • hist – distribution of the number of columns

  • sep – line separator

Returns:

text line, number of columns

source on GitHub

ensae_projects.datainc.data_helper.convert_dates(sd, option=None, exc=False)#

Converts a string into a date.

Parameters:
  • sd – string

  • option – see below

  • exc – raise an exception

Returns:

string

  • 'F': dates must contain / and format is DD/MM/YY

source on GitHub

ensae_projects.datainc.data_helper.enumerate_text_lines(filename, sep='\t', encoding='utf-8', quotes_as_str=False, header=True, clean_column_name=None, convert_float=False, option=None, skip=0, take=-1, fLOG=<function noLOG>)#

Enumerates all lines from a text file and does some cleaning (see the list of parameters).

Parameters:
  • filename – filename

  • sep – column separator

  • header – first row is header

  • encoding – encoding

  • quotes_as_str – surrounded by quotes

  • clean_column_name – function to clean column name

  • convert_float – convert number into float wherever possible

  • option – several option to clean dates, see below

  • skip – number of rows to skip

  • take – number of rows to consider (-1 for all)

  • fLOG – logging function

Returns:

iterator on dictionary

Options to cleaning dates:

  • 'F': dates must contain / and format is DD/MM/YY

source on GitHub