module helpers.pypdf_helper

Short summary

module ensae_teaching_cs.helpers.pypdf_helper

globals functions to manipulate PDF files

source on GitHub

Functions

function

truncated documentation

pdf_read_content

Extracts the text from a PDF file.

Documentation

globals functions to manipulate PDF files

source on GitHub

ensae_teaching_cs.helpers.pypdf_helper.pdf_read_content(filename)[source]

Extracts the text from a PDF file.

Paramètres

filename – (str) filename

Renvoie

content (string)

The module was modified to introduce spaces. The method is not very robust because it does not take into account the size of characters. But the method PageObject.extractText can be modified to deal with by introducing statistics or a better knowledge of PDF format.

The best way to introduce spaces and end of line is to study the distribution of distances between consecutive characters assuming we would fine a couple of modes:

  • one for characters on the same line and from the same word,

  • one for characters on the same line but separated by a space,

  • one for characters on two different lines. This method was not implemented yet.

If a line ends by « -« , it is assumed a word was split. It is replaced by « — ».

This function only works with sdpython/pyPdf. The module was modified to work better with spaces. Every line ending by '---' is a split word.

source on GitHub