Example with re2

Links: notebook, html, PDF, python, slides, GitHub

wrapclib wraps the library re2 using the wrapper pyre2.

from jyquickhelper import add_notebook_menu
add_notebook_menu()
from wrapclib import re2

Example with HTML

import re
s = "<h1>mot</h1>"
print(re.compile("(<.*>)").match(s).groups())
('<h1>mot</h1>',)
s = "<h1>mot</h1>"
print(re2.compile("(<.*>)").match(s).groups())
('<h1>mot</h1>',)

Group, Span

s = """date 0 : 14/9/2000
date 1 : 20/04/1971     date 2 : 14/09/1913     date 3 : 2/3/1978
date 4 : 1/7/1986     date 5 : 7/3/47     date 6 : 15/10/1914
date 7 : 08/03/1941     date 8 : 8/1/1980     date 9 : 30/6/1976"""

expression = re2.compile(
    "([0-3]?[0-9]/[0-1]?[0-9]/([0-2][0-9])?[0-9][0-9])[^\d]")
expression.search(s).group(1, 2)
('14/9/2000', '20')
c = expression.search(s).span(1)
s[c[0]:c[1]]
'14/9/2000'

Names

date = "05/22/2010"
exp = "(?P<jj>[0-9]{1,2})/(?P<mm>[0-9]{1,2})/(?P<aa>((19)|(20))[0-9]{2})"
com = re2.compile(exp)
print(com.search(date).groupdict())
{'aa': '2010', 'jj': '05', 'mm': '22'}

findall

findall is not natively implemented in re2. It was added.

s = """date 0 : 14/9/2000
date 1 : 20/04/1971     date 2 : 14/09/1913     date 3 : 2/3/1978
date 4 : 1/7/1986     date 5 : 7/3/47     date 6 : 15/10/1914
date 7 : 08/03/1941     date 8 : 8/1/1980     date 9 : 30/6/1976"""

expression = re2.compile(
    "([0-3]?[0-9]/[0-1]?[0-9]/([0-2][0-9])?[0-9][0-9])[^\d]")

re2.findall(expression, s)
[('14/9/2000', '20'),
 ('20/04/1971', '19'),
 ('14/09/1913', '19'),
 ('2/3/1978', '19'),
 ('1/7/1986', '19'),
 ('7/3/47', None),
 ('15/10/1914', '19'),
 ('08/03/1941', '19'),
 ('8/1/1980', '19')]

benchmark

s = """date 0 : 14/9/2000
date 1 : 20/04/1971     date 2 : 14/09/1913     date 3 : 2/3/1978
date 4 : 1/7/1986     date 5 : 7/3/47     date 6 : 15/10/1914
date 7 : 08/03/1941     date 8 : 8/1/1980     date 9 : 30/6/1976"""

expression = re.compile(
    "([0-3]?[0-9]/[0-1]?[0-9]/([0-2][0-9])?[0-9][0-9])[^\d]")

%timeit expression.findall(s)
10.5 µs ± 296 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit re2.findall(expression, s)
18.4 µs ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

That’s expected as method findall is implemented in python and not C.