Reuse the precompiled regex for collection variable selectors#3573
Reuse the precompiled regex for collection variable selectors#3573chweidling wants to merge 5 commits into
Conversation
resolveRegularExpression() in the collection backends compiled (and JIT-compiled) a fresh Utils::Regex from the pattern string on every call. For a regex variable selector such as TX:/regex/ that is evaluated per transaction, this recompiled the same pattern on every request - even though the calling VariableRegex already holds it compiled once at configuration time in its m_r member. Add a Collection::resolveRegularExpression(Utils::Regex *) overload that accepts the pre-compiled regex. The base class keeps the previous behaviour by default (it compiles from r->pattern and delegates), so backends that do not override it are unaffected; InMemoryPerProcess overrides it to scan the collection with the supplied regex directly. Tx_DictElementRegexp now passes its already-compiled &m_r instead of the pattern string. Behaviour is unchanged: m_r is constructed with the same arguments (Utils::Regex(pattern, /*ignoreCase=*/true)) the backend used, so the identical regex is applied - it is just compiled once instead of per transaction. A regression test covering TX:/regex/ selection is added.
|
The failing Filed as #3574. All other CI jobs (Linux x32/x64 gcc+clang, all configs, macOS) pass. |
|
Hi @chweidling, could you pick these changes or update your branch? Thanks! |
There was a problem hiding this comment.
Pull request overview
This PR optimizes TX:/regex/ collection variable selection by reusing the regex compiled at configuration time (in VariableRegex) instead of recompiling (and JIT-compiling) the same pattern on every transaction. It introduces a new Collection::resolveRegularExpression(Utils::Regex*) overload with a default fallback to the existing string-based method, and adds a regression test covering chained-rule behavior and non-matching key exclusion.
Changes:
- Add a
Collection::resolveRegularExpression(Utils::Regex*)overload (default implementation delegates to the string overload). - Update
TX:/regex/selection to pass the precompiledm_r, and implement a backend override inInMemoryPerProcessto use it directly. - Add regression coverage for precompiled
TX:/regex/selection (match and non-match cases).
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
headers/modsecurity/collection/collection.h |
Adds new public virtual overload accepting a compiled regex. |
src/collection/collection.cc |
Implements the default fallback behavior for the new overload. |
src/collection/backend/in_memory-per_process.h |
Declares backend override for compiled-regex resolution. |
src/collection/backend/in_memory-per_process.cc |
Implements compiled-regex resolution and delegates string overload to it. |
src/variables/tx.h |
Switches TX regex selector to pass the precompiled regex instead of the pattern string. |
src/Makefile.am |
Adds the new collection.cc to the build sources. |
test/test-cases/regression/variable-tx-regex-precompiled.json |
Adds regression tests validating the TX regex selector behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /* | ||
| * Resolve using a regular expression that was already compiled (e.g. the | ||
| * one held by a VariableRegex). This avoids recompiling - and JIT'ing - the | ||
| * same pattern on every transaction. The default implementation delegates | ||
| * to the string overload using the pattern text, so backends that do not | ||
| * override it keep their previous behaviour. | ||
| */ | ||
| virtual void resolveRegularExpression(Utils::Regex *r, | ||
| std::vector<const VariableValue *> *l, | ||
| variables::KeyExclusions &ke); |
There was a problem hiding this comment.
Good catch — this is accurate. Adding a virtual to Collection grows the
vtable, so a Collection backend built out-of-tree against an older
collection.h would be binary-incompatible with a newer libmodsecurity.
A few points on how I'd weigh it:
- I kept it virtual on purpose.
m_collections.m_tx_collectionis typed as
Collection*, so dispatching through the base class is the natural fit. The
alternatives suggested here - a separatedynamic_cast-able interface, or a
non-virtual helper keyed on the concrete type - add indirection at the call
site for what is otherwise a drop-in. The method is also non-pure (it ships
a default that delegates to the existing string overload), so neither in-tree
nor out-of-tree backends are required to implement anything. - libmodsecurity is already SONAME-versioned via
libtool -version-info
(MSC_VERSION_INFO = current:revision:age, derived from the release version),
so the mechanism to flag an ABI change is in place. The remaining question is
a release-policy one rather than a code one.
@airween - this is your call: are you comfortable shipping this as an
ABI-affecting change with the corresponding version-info bump, or would you
prefer I keep the base vtable untouched and confine the optimization to the
concrete InMemoryPerProcess path instead? I can also move the new declaration
to the end of the class to minimize in-tree vtable churn if that helps. Happy
to take whichever direction you prefer.
There was a problem hiding this comment.
Recently I've fixed something where the situation was the same like this. Then I solved it with the suggested method and created a helper.
Actually, I've been working on V4, where the collections will be replaced, because there are several issues. Those changes completely brakes the current API (this is why we need to increase the version).
If you deal with it, you can wait with this PR for V4. Otherwise, it would be nice to avoid to break the ABI.
Note: there will be one more release from v3.0 at least, because we have some security issues, so if you solve this, then it can be a part of the current code. I hope I can do that at the end of this month at last.
(Expected release of V4 is in autumn, or later...)
There was a problem hiding this comment.
Reworked to avoid the ABI break. The introduced virtual methode in collection.h is removed, and the optimization is in InMemoryPerProcess only using a dynamic_cast.
Move the precompiled-regex fast path off the public Collection base class (its vtable is unchanged) and onto the concrete `InMemoryPerProcess` backend. `Tx_DictElementRegexp` reaches it via a guarded `dynamic_cast`, falling back to the string-based resolution for any other backend. Drops the previously added Collection `virtual` and `src/collection/collection.cc`.
|



what
InMemoryPerProcess::resolveRegularExpression(const Utils::Regex *)overload that takes an already-compiled regex.Tx_DictElementRegexp(theTX:/regex/variable selector) now reuses its pre-compiledm_r: itdynamic_casts the TX collection toInMemoryPerProcessand calls the fast path if possible.TX:/regex/selection (positive + negative cases).why
resolveRegularExpression(const std::string&, ...)built a freshUtils::Regex— i.e. apcre2_compile()andpcre2_jit_compile()— on every call.TX:/regex/that is evaluated per transaction (e.g. CRS 921180), this recompiled the same pattern on every request, even though the callingVariableRegexalready holds it compiled once at configuration time inm_r. That compiled regex was simply being ignored.m_ris constructed with the same arguments —Utils::Regex(pattern, /*ignoreCase=*/true)— that the backend used, so the identical regex is applied; it is just compiled once instead of per transaction.v3/master, gcc-O2, system PCRE2 with JIT):references