Feature or enhancement
Proposal:
re._compiler._optimize_charset builds a 256-byte (or wider) charmap for character classes. On main it is filled with a loop:
for i in range(av[0], av[1] + 1):
charmap[i] = 1
Replacing the loop with a single bytearray slice is faster.
Benchmark results on pyperformance regex_compile
main (86.2 ± 1.4 ms) -> PR (78.5 ± 1.6 ms): 1.10x faster
Microbenchmarks:
compile-charsets-small: 2.80 ± 0.04 ms -> 1.22 ± 0.02 ms: 2.29x faster
compile-charsets-big: 1.46 ± 0.03 ms -> 1.36 ± 0.03 ms: 1.08x faster
Geometric mean: 1.57x faster
The "small" case oversamples wide ranges ([\x00-\xff], BMP range) which benefit most. The "big" case has mostly small ASCII ranges where the slice-fill construction (b'\x01' * n) competes with the loop savings.
microbench script
"""Microbench for the bytearray slice-fill optimization in
re._compiler._optimize_charset (RANGE handler)."""
import pyperf
import re
SMALL_PATTERNS = [
r'[a-zA-Z0-9_]+',
r'[^\s\d]+',
r'[a-z][A-Z][0-9]',
r'[\x00-\xff]', # full byte range
r'[Ā-]+', # BMP range, hits growth path
r'[a-zA-Z0-9!@#$%^&*()_+\-=\[\]{};:\'",.<>/?\\|`~]+',
r'[0-9a-fA-F]{2,8}',
r'[ -~]+', # all printable ASCII
]
BIG_PATTERN = (
r'^([a-zA-Z][a-zA-Z0-9_\-]*)\s*'
r'([\x20-\x7e]*)\s*'
r'([-ÿĀ-ɏ]+)?\s*'
r'([0-9]{1,4}[\-/.][0-9]{1,2}[\-/.][0-9]{1,4})?\s*'
r'([^\s,;:]+)$'
)
def compile_small():
for _ in range(8):
for p in SMALL_PATTERNS:
re.purge()
re.compile(p)
def compile_big():
for _ in range(10):
re.purge()
re.compile(BIG_PATTERN)
runner = pyperf.Runner()
runner.bench_func('compile-charsets-small', compile_small)
runner.bench_func('compile-charsets-big', compile_big)
Has this already been discussed elsewhere?
No response given
Links to previous discussion of this feature:
No response
Linked PRs
Feature or enhancement
Proposal:
re._compiler._optimize_charsetbuilds a 256-byte (or wider)charmapfor character classes. On main it is filled with a loop:Replacing the loop with a single
bytearrayslice is faster.Benchmark results on pyperformance
regex_compileMicrobenchmarks:
The "small" case oversamples wide ranges (
[\x00-\xff], BMP range) which benefit most. The "big" case has mostly small ASCII ranges where the slice-fill construction (b'\x01' * n) competes with the loop savings.microbench script
Has this already been discussed elsewhere?
No response given
Links to previous discussion of this feature:
No response
Linked PRs