module texthelper.edit_text_diff

Short summary

module pyquickhelper.texthelper.edit_text_diff

Improves text comparison.

source on GitHub

Functions

function

truncated documentation

diff2html

Produces a HTML report with differences between rows1 and rows2.

edit_distance_string

Computes the edit distance between strings s1 and s2.

edit_distance_text

Computes an edit distance between lines of a text.

Documentation

Improves text comparison.

source on GitHub

pyquickhelper.texthelper.edit_text_diff.diff2html(rows1, rows2, equals, aligned, two_columns=False)[source]

Produces a HTML report with differences between rows1 and rows2.

Parameters:
  • rows1 – first set of rows

  • rows2 – second set of rows

  • equals – third output of edit_distance_text

  • aligned – fourth output of edit_distance_text

  • two_columns – displays the differences on two columns

Returns:

HTML text

source on GitHub

pyquickhelper.texthelper.edit_text_diff.edit_distance_string(s1, s2, cmp_cost=1.0)[source]

Computes the edit distance between strings s1 and s2.

Parameters:
  • s1 – first string

  • s2 – second string

Returns:

dist, list of tuples of aligned characters

Another version is implemented in module :epkg:`cpyquickhelper`. It uses C++ to make it around 25 times faster than the python implementation.

source on GitHub

pyquickhelper.texthelper.edit_text_diff.edit_distance_text(rows1, rows2, strategy='full', verbose=False, return_matrices=False, **thresholds)[source]

Computes an edit distance between lines of a text.

Parameters:
  • rows1 – first set of rows

  • rows2 – second set of rows

  • strategy – strategy to match lines (see below)

  • verbose – if True, show progress with tqdm

  • return_matrices – return distances and predecessor matrices as well

  • thresholds – see below

Returns:

distance, list of tuples of aligned lines, distance and alignment for each aligned lines, and finally an array with aligned line number for both texts

Strategies: * ‘full’: computes all edit distances between all lines

Thresholds: * ‘threshold’: two lines can match if the edit distance is not too big,

a low threshold means no match (default is 0.5)

  • ‘insert_len’: variable cost of insertion (default is 1.)

  • ‘insert_cst’: fixed cost of insertion (default is 1.)

  • ‘weight_cmp’: weight for comparison cost (default is 2.)

  • cmp_cost’: cost of a bad comparison, default is 2 * insert_len

Note

The full python implementation is quite slow. Function edit_distance_string is also implemented in module :epkg:`cpyquickhelper`. If this module is installed and recent enough, this function will use this version as it is 25 times faster. The version in :epkg:`cpyquickhelper` is using C++.

source on GitHub