module texthelper.edit_text_diff
¶
Short summary¶
module pyquickhelper.texthelper.edit_text_diff
Improves text comparison.
Functions¶
function |
truncated documentation |
---|---|
Produces a HTML report with differences between rows1 and rows2. |
|
Computes the edit distance between strings s1 and s2. |
|
Computes an edit distance between lines of a text. |
Documentation¶
Improves text comparison.
- pyquickhelper.texthelper.edit_text_diff.diff2html(rows1, rows2, equals, aligned, two_columns=False)[source]¶
Produces a HTML report with differences between rows1 and rows2.
- Parameters:
rows1 – first set of rows
rows2 – second set of rows
equals – third output of
edit_distance_text
aligned – fourth output of
edit_distance_text
two_columns – displays the differences on two columns
- Returns:
HTML text
- pyquickhelper.texthelper.edit_text_diff.edit_distance_string(s1, s2, cmp_cost=1.0)[source]¶
Computes the edit distance between strings s1 and s2.
- Parameters:
s1 – first string
s2 – second string
- Returns:
dist, list of tuples of aligned characters
Another version is implemented in module :epkg:`cpyquickhelper`. It uses C++ to make it around 25 times faster than the python implementation.
- pyquickhelper.texthelper.edit_text_diff.edit_distance_text(rows1, rows2, strategy='full', verbose=False, return_matrices=False, **thresholds)[source]¶
Computes an edit distance between lines of a text.
- Parameters:
rows1 – first set of rows
rows2 – second set of rows
strategy – strategy to match lines (see below)
verbose – if True, show progress with tqdm
return_matrices – return distances and predecessor matrices as well
thresholds – see below
- Returns:
distance, list of tuples of aligned lines, distance and alignment for each aligned lines, and finally an array with aligned line number for both texts
Strategies: * ‘full’: computes all edit distances between all lines
Thresholds: * ‘threshold’: two lines can match if the edit distance is not too big,
a low threshold means no match (default is 0.5)
‘insert_len’: variable cost of insertion (default is 1.)
‘insert_cst’: fixed cost of insertion (default is 1.)
‘weight_cmp’: weight for comparison cost (default is 2.)
‘cmp_cost’: cost of a bad comparison, default is 2 * insert_len
Note
The full python implementation is quite slow. Function
edit_distance_string
is also implemented in module :epkg:`cpyquickhelper`. If this module is installed and recent enough, this function will use this version as it is 25 times faster. The version in :epkg:`cpyquickhelper` is using C++.