A heuristic-based crate to calculate the specificity of a regular expression pattern against a specific string.
This problem has been discussed for a long time (e.g., Determine regular expression's specificity from 2010), yet there is still no universally accepted or "standard" solution.
I didn't mean to dig up old posts.
Specificity measures how "precise" a match is. For example, the pattern abc is more specific to the string "abc" than the pattern a.c or .*.
The calculation follows these principles:
- Positional Weighting: Earlier matches contribute more to the total specificity than later ones.
- Certainty: Literals (exact characters) specificity higher than character classes or wildcards.
- Information Density: Narrower character classes (e.g.,
[a-z]) specificity higher than broader ones (e.g.,.). - Branching Penalty: Patterns with many alternatives (alternations) are penalized as they are less specific.
The get function assumes that the string provided is already a full match for the pattern.
If the pattern does not match the string, the resulting specificity will be mathematically inconsistent and meaningless for comparison purposes.
let string = "abc";
let high = get(string, "abc").unwrap();
let low = get(string, ".*").unwrap();
assert!(high > low);Since this crate uses a greedy heuristic based on the HIR (High-level Intermediate Representation), certain patterns may yield the same specificity even if they look different.
A common example is when a wildcard .* "swallows" the entire string before other parts of the pattern can be evaluated.
let string = "cat";
let pattern1 = r".*";
let pattern2 = r".*a.*";
assert_eq!(
get(string, pattern1).unwrap(),
get(string, pattern2).unwrap()
)If you need to distinguish between patterns with identical specificity, we recommend using the pattern length as a secondary tie-breaker:
- Mathematical: A longer pattern is often less specific because it requires more redundant components to describe the same set.
- Intuitive: You may prefer the pattern with more literals.
if result_a == result_b {
return pattern_a.len().cmp(&pattern_b.len());
}This project is licensed under the MIT License © 2025 557.