module sql.file_text_binary_columns

Inheritance diagram of pyensae.sql.file_text_binary_columns

Short summary

module pyensae.sql.file_text_binary_columns

contains a class which iterations on rows of a text file structured as a table.

source on GitHub

Classes

class

truncated documentation

TextFileColumns

This class opens a text file as if it were a binary file. It can deal with null characters. The file is interpreted …

Static Methods

staticmethod

truncated documentation

_store

Stores a list of dictionaries into a file (add a header).

fusion

Does a fusion between several files with the same columns (different order is allowed).

Methods

method

truncated documentation

__init__

__iter__

__str__

Returns the header.

close

Closes the file and remove all information related to the format, next time it is opened, the format will be checked …

get_columns

open

Opens the file and find out if there is a header, what are the columns, what are their type.

sort

Sorts a text file, even a big one, one or several columns gives the order.

Documentation

contains a class which iterations on rows of a text file structured as a table.

source on GitHub

class pyensae.sql.file_text_binary_columns.TextFileColumns(filename, errors=None, fLOG=<function noLOG>, force_header=False, changes=None, force_noheader=False, regex=None, filter=None, fields=None, keep_text_when_bad_type=False, break_at=-1, strip_space=True, force_sep=None, nb_line_guess=100, mistake=3, encoding='utf-8', strict_separator=False)[source]

Bases: pyensae.sql.file_text_binary.TextFile

This class opens a text file as if it were a binary file. It can deal with null characters. The file is interpreted as a TSV file or file containing columns. The separator is found automatically. The columns are assumed to be in the first line but it is not mandatory. It walks along a file through an iterator, every line is automatically converted into a dictionary { column : value }. If the class was able to guess what type is which column, the conversion will automatically take place.

f = TextFileColumns(filename)
        # filename is a file
        # the separator is unknown --> the class automatically determines it
        # as well as the columns and their type
f.open()
for d in f:
    print(d)       # d is a dictionary
f.close()

attribute

meaning

_force_header

there is a header even if not detected

_force_noheader

there is no header even if detected

_changes

replace the columns name

_regexfix

impose a regular expression to interpret a line instead of the automatically built one

_filter_dict

it is a function which takes a dictionary and returns a boolean which tells if the line must considered or not

_fields

name of the columns (if there is no header)

Spaces and non-ascii characters cannot be used to name a column. This name must be a named group for a regular expression.

source on GitHub

Parameters
  • filename – filename

  • errors – see str (errors = …)

  • fLOG – LOG function, see fLOG

  • force_header – defines the first line as columns header whatever is it relevant or not

  • changes – to change the column name, gives the correspondence, example: { “query”:”query___” }, it can be a list if there is no header and you want to name any column

  • force_noheader – there is no header at all

  • regex – specify a different regular expression (only if changes is a list) if it is a dictionary, the class will replace the default by the one associated in regex for this field

  • filter – None if there is no filter, otherwise it is a function which takes a dictionary and returns a boolean which tells if the line must considered or not

  • fields – when the header is not here, these fields will name the columns

  • keep_text_when_bad_type – keep the value when the conversion type does not word

  • break_at – if != -1, stop when this limit is reached

  • strip_space – remove space around columns if True

  • force_sep – if != None, impose a column separator

  • nb_line_guess – number of lines used to guess types

  • mistake – not more than mistake conversion in numbers are allowed

  • encoding – encoding

  • strict_separator – strict number of columns, it assumes there is no separator in the content of every column

source on GitHub

__init__(filename, errors=None, fLOG=<function noLOG>, force_header=False, changes=None, force_noheader=False, regex=None, filter=None, fields=None, keep_text_when_bad_type=False, break_at=-1, strip_space=True, force_sep=None, nb_line_guess=100, mistake=3, encoding='utf-8', strict_separator=False)[source]
Parameters
  • filename – filename

  • errors – see str (errors = …)

  • fLOG

    LOG function, see fLOG

  • force_header – defines the first line as columns header whatever is it relevant or not

  • changes – to change the column name, gives the correspondence, example: { “query”:”query___” }, it can be a list if there is no header and you want to name any column

  • force_noheader – there is no header at all

  • regex – specify a different regular expression (only if changes is a list) if it is a dictionary, the class will replace the default by the one associated in regex for this field

  • filter – None if there is no filter, otherwise it is a function which takes a dictionary and returns a boolean which tells if the line must considered or not

  • fields – when the header is not here, these fields will name the columns

  • keep_text_when_bad_type – keep the value when the conversion type does not word

  • break_at – if != -1, stop when this limit is reached

  • strip_space – remove space around columns if True

  • force_sep – if != None, impose a column separator

  • nb_line_guess – number of lines used to guess types

  • mistake – not more than mistake conversion in numbers are allowed

  • encoding – encoding

  • strict_separator – strict number of columns, it assumes there is no separator in the content of every column

source on GitHub

__iter__()[source]
Returns

a dictionary { column_name: value }

source on GitHub

__str__()[source]

Returns the header.

source on GitHub

static _store(output, l, encoding='utf-8')[source]

Stores a list of dictionaries into a file (add a header).

Parameters
  • output – filename

  • l – list of dictionary key:value

  • encoding – encoding

Warning

format is utf-8

source on GitHub

close()[source]

Closes the file and remove all information related to the format, next time it is opened, the format will be checked again.

source on GitHub

static fusion(key, files, output, force_header=False, encoding='utf-8', fLOG=<function noLOG>)[source]

Does a fusion between several files with the same columns (different order is allowed).

Parameters
  • key – columns to be compared

  • files – list of files

  • output – output file

  • force_header – impose the first line as a header

  • encoding – encoding

  • fLOG – logging function

Warning

We assume all files are sorted depending on columns in key

source on GitHub

get_columns()[source]
Returns

the columns

source on GitHub

open()[source]

Opens the file and find out if there is a header, what are the columns, what are their type… any information about which format was found is logged.

source on GitHub

sort(output, key, maxmemory=268435456, folder=None, fLOG=<function noLOG>)[source]

Sorts a text file, even a big one, one or several columns gives the order.

Parameters
  • output – output file result

  • key – lines sorted depending of these columns

  • maxmemory – a file is split into smaller files which contains not more than maxmemory lines

  • folder – the function needs to create temporary files, this folder will contain them before they get removed

  • fLOG – logging function

Returns

Warning

We assume this file is not opened.

source on GitHub