Skip to content

Improve performance of re.compile() for patterns with character ranges #149427

@eendebakpt

Description

@eendebakpt

Feature or enhancement

Proposal:

re._compiler._optimize_charset builds a 256-byte (or wider) charmap for character classes. On main it is filled with a loop:

for i in range(av[0], av[1] + 1):
    charmap[i] = 1

Replacing the loop with a single bytearray slice is faster.

Benchmark results on pyperformance regex_compile

main (86.2 ± 1.4 ms)  ->  PR (78.5 ± 1.6 ms):  1.10x faster

Microbenchmarks:

compile-charsets-small:  2.80 ± 0.04 ms  ->  1.22 ± 0.02 ms:  2.29x faster
compile-charsets-big:    1.46 ± 0.03 ms  ->  1.36 ± 0.03 ms:  1.08x faster
Geometric mean:                                              1.57x faster

The "small" case oversamples wide ranges ([\x00-\xff], BMP range) which benefit most. The "big" case has mostly small ASCII ranges where the slice-fill construction (b'\x01' * n) competes with the loop savings.

microbench script
"""Microbench for the bytearray slice-fill optimization in
re._compiler._optimize_charset (RANGE handler)."""
import pyperf
import re

SMALL_PATTERNS = [
    r'[a-zA-Z0-9_]+',
    r'[^\s\d]+',
    r'[a-z][A-Z][0-9]',
    r'[\x00-\xff]',                  # full byte range
    r'[Ā-῿]+',                       # BMP range, hits growth path
    r'[a-zA-Z0-9!@#$%^&*()_+\-=\[\]{};:\'",.<>/?\\|`~]+',
    r'[0-9a-fA-F]{2,8}',
    r'[ -~]+',                       # all printable ASCII
]

BIG_PATTERN = (
    r'^([a-zA-Z][a-zA-Z0-9_\-]*)\s*'
    r'([\x20-\x7e]*)\s*'
    r'([-ÿĀ-ɏ]+)?\s*'
    r'([0-9]{1,4}[\-/.][0-9]{1,2}[\-/.][0-9]{1,4})?\s*'
    r'([^\s,;:]+)$'
)

def compile_small():
    for _ in range(8):
        for p in SMALL_PATTERNS:
            re.purge()
            re.compile(p)

def compile_big():
    for _ in range(10):
        re.purge()
        re.compile(BIG_PATTERN)

runner = pyperf.Runner()
runner.bench_func('compile-charsets-small', compile_small)
runner.bench_func('compile-charsets-big', compile_big)

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

No response

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance or resource usagestdlibStandard Library Python modules in the Lib/ directorytopic-regextype-featureA feature request or enhancement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions