module helpers.pypdf_helper

Short summary

module ensae_teaching_cs.helpers.pypdf_helper

globals functions to manipulate PDF files

source on GitHub

Functions

function truncated documentation
pdf_read_content Extracts the text from a PDF file.

Documentation

globals functions to manipulate PDF files

source on GitHub

ensae_teaching_cs.helpers.pypdf_helper.pdf_read_content(filename)[source]

Extracts the text from a PDF file.

Paramètres:filename – (str) filename
Renvoie:content (string)

The module was modified to introduce spaces. The method is not very robust because it does not take into account the size of characters. But the method PageObject.extractText can be modified to deal with by introducing statistics or a better knowledge of PDF format.

The best way to introduce spaces and end of line is to study the distribution of distances between consecutive characters assuming we would fine a couple of modes:

  • one for characters on the same line and from the same word,
  • one for characters on the same line but separated by a space,
  • one for characters on two different lines. This method was not implemented yet.

If a line ends by « -« , it is assumed a word was split. It is replaced by « — ».

This function only works with sdpython/pyPdf. The module was modified to work better with spaces. Every line ending by '---' is a split word.

source on GitHub