module filehelper.content_helper

Short summary

module pyensae.filehelper.content_helper

Various functions to process text

source on GitHub

Functions

function

truncated documentation

enumerate_grep

Extracts lines matching a regular expression.

file_encoding

Returns the encoding of a file. The function relies on chardet. …

file_head

Extracts the first nbline of a file (assuming it is text file).

file_tail

Extracts the first nbline of a file (assuming it is text file).

replace_comma_by_point

Replaces all commas by point in a file (do that inplace).

Documentation

Various functions to process text

source on GitHub

pyensae.filehelper.content_helper.enumerate_grep(filename, regex, encoding='utf8', errors=None)

Extracts lines matching a regular expression.

Parameters:
  • filename – filename

  • regex – regular expression

  • encoding – encoding

  • errors – see open

Returns:

iterator in lines

New in version 1.1.

source on GitHub

pyensae.filehelper.content_helper.file_encoding(filename_or_bytes, limit=1048576)

Returns the encoding of a file. The function relies on chardet.

Parameters:
  • filename_or_bytes – filename or bytes

  • limit – if filename_or_bytes is a file, the function only loads the first limit bytes (or all if limit is -1)

Returns:

dictionary

Example of results:

{'encoding': 'EUC-JP', 'confidence': 0.99}

source on GitHub

pyensae.filehelper.content_helper.file_head(filename: str, nbline=10, encoding='utf8', errors='strict')

Extracts the first nbline of a file (assuming it is text file).

Parameters:
  • filename – filename

  • nbline – number of lines

  • encoding – encoding

  • errors

    see open

Returns:

list of lines

source on GitHub

pyensae.filehelper.content_helper.file_tail(filename: str, nbline=10, encoding='utf8', threshold=16384, errors='strict')

Extracts the first nbline of a file (assuming it is text file).

Parameters:
  • filename – filename

  • nbline – number of lines

  • encoding – encoding

  • threshold – if the file size is above, it will not read the beginning

  • errors

    see open

Returns:

list of lines

The line marked as A has an issue because the cursor could fall on a character (= byte) in the middle of a character if the file is encoded in utf-8 character. The next line fails. That’s why we try again by moving the cursor by one character (see line B).

The first returned line may be incomplete.

source on GitHub

pyensae.filehelper.content_helper.replace_comma_by_point(file)

Replaces all commas by point in a file (do that inplace).

Parameters:

file – file to process

source on GitHub