fix(grep): route per regex dialect to preserve POSIX BRE semantics#2138
fix(grep): route per regex dialect to preserve POSIX BRE semantics#2138F0rty-Tw0 wants to merge 3 commits into
Conversation
Plain `grep` defaults to Basic Regular Expressions; `rtk grep` was
always passing the user's pattern to ripgrep, whose regex dialect
flips the meaning of `()`, `+`, `?`, `{`, `}` (BRE literals -> rg
metacharacters) and uses `|` instead of `\|` for alternation. This
silently broadened or shrank match sets -- e.g. `delete()` matched
every `delete` because `()` was treated as an empty group.
Detect grep's regex-mode flags (`-F`, `-E`, `-G`, `-P`) from
`extra_args` and route accordingly:
* BRE (default or `-G`): delegate to real `grep` for exact POSIX
semantics. Slower than rg, but eliminates the regex-translation
surface area. On Windows without `grep` on PATH, fall back to rg
with a best-effort BRE->rg translation and emit a one-line stderr
notice (once per process) so the degradation is never silent.
* `-E` (ERE), `-P` (PCRE), `-F` (fixed string): rg primary
(fast path); fall back to grep if rg is unavailable, surfacing
rg's error so debugging is possible. `-P` adds `--pcre2` for
exact PCRE.
Speed/correctness trade is intentional: BRE prioritises correctness
(grep's exact rules) at a small speed cost. Users who care about
rg's speed can opt in with `-E`/`-P`/`-F`.
Strips grep-only mode flags (`-E`, `-G`, `-P`, `-F`) and `-r` from
extra_args before forwarding to rg, since rg either has different
meanings for those (rg `-r` is `--replace`) or doesn't recognise
them. grep gets the user's flags verbatim.
has_grep_binary uses PATH resolution (no subprocess spawn) so the
BRE fast-path stays well under the project's <10ms startup target.
Closes the BRE-translation portion of rtk-ai#1436.
Signed-off-by: Artiom Tofan <arto@queue-it.com>
The per-dialect routing landed in b9fcc59 only fires inside grep_cmd::run, but Clap rejects unknown short flags before the first positional. Invocations like `rtk grep -E 'foo|bar' src/` hit `run_fallback`, which spawns the raw command without the recursion flag or rtk filtering -- so the new ERE/PCRE/Fixed paths were unreachable in the natural call shape and `delete()` would have remained the only correctness fix that worked end to end. Promote `-E` / `-G` / `-P` / `-F` (and their long forms) to first class Clap flags on the Grep subcommand. They're mutually exclusive via `conflicts_with_all`, so the user can't mix dialects. The dispatch re-injects the chosen flag at the front of `extra_args` so `detect_mode` and `build_grep_command` still see it and route consistently with the trailing-position form. Trailing-position usage (`rtk grep 'pat' src/ -E`) keeps working: `extra_args` has `trailing_var_arg + allow_hyphen_values`, so a late `-E` is captured there and `detect_mode` picks it up. Smoke-tested all four dialects in both leading and trailing positions, including the issue rtk-ai#1436 BRE regression for `delete()`, the ERE alternation case from the PR description, PCRE lookbehind, and the `-E -P` conflict. Signed-off-by: Artiom Tofan <arto@queue-it.com>
Promoting only the four dialect flags (a235b25) left every other short flag placed before the pattern (-i, -w, -A N, ...) tripping Clap's positional parser, so `rtk grep -i foo src/` fell through to run_fallback and ran the raw command with no filtering and 0% savings, even though the field doc advertises -i as a valid extra arg. reorder_grep_args runs before Cli::try_parse_from and hoists the positional operands (pattern, optional path) ahead of any flags so the flags land in extra_args (trailing_var_arg). It preserves the value of value-consuming flags (-A 3 keeps 3 as the value, not the pattern), stops at --, and bails to unchanged argv when the pattern comes from -e/-f. Scoped to the grep subcommand, so `rtk proxy grep` and `git grep` are untouched. Review hardening (same rtk-ai#1436 review): - detect_mode: GNU last-flag-wins + bundled-cluster detection (-iE -> ERE), stops at -- - build_rg_command: strip dialect chars from forwarded clusters so -iE no longer reaches rg (where -E is misread as --encoding and aborts) - build_grep_command: honor -t TYPE via --include (previously dropped on the grep paths) - bre_to_rg: backslash-parity so \( is not confused with \(; leading * is literal
There was a problem hiding this comment.
Thanks for your PR.
I didn't go through all of it but it seems way too complicated and trying to solve non-existent problems (at least from a rtk's point of view).
Can you come back with a second simpler proposal, taking into account what I have already commented on? It seems to be a good addition, just far too complex for now.
| /// `BRE` is the POSIX default when no dialect flag is given, matching plain | ||
| /// `grep` behavior. | ||
| #[derive(Debug, Clone, Copy, PartialEq, Eq)] | ||
| enum GrepMode { |
There was a problem hiding this comment.
suggestion:
| enum GrepMode { | |
| enum RegexSyntax { |
| /// Perl-Compatible Regular Expression (`-P`/`--perl-regexp`). | ||
| Pcre, | ||
| /// Fixed string, no regex (`-F`/`--fixed-strings`). | ||
| Fixed, |
There was a problem hiding this comment.
suggestion, either allow None here or return Option<RegexSyntax> in the detection function (preferred)
| // grouping natively. Pass through, fast. | ||
| // * PCRE (-P): rg --pcre2. Exact PCRE semantics, fast. | ||
| // * Fixed (-F): rg -F. No regex parsing at all, fast. | ||
| let mode = detect_mode(extra_args); |
There was a problem hiding this comment.
I think rewrite should be involved so a rg xxx rewrite ends with rtk grep -P command so we won't default to BRE in that case
| let is_cluster = bytes.len() >= 2 | ||
| && bytes[0] == b'-' | ||
| && bytes[1] != b'-' | ||
| && bytes[1..].iter().all(|b| b.is_ascii_alphabetic()); |
There was a problem hiding this comment.
ripgrep has -0 which is outside of is_ascii_alphabetic
There was a problem hiding this comment.
Perhaps treating other if starting with - and not followed by another - and stop if you encounter a = would work?
| fn bre_to_rg(pattern: &str) -> String { | ||
| // Backslash-aware single-pass tokenizer. We must distinguish: |
There was a problem hiding this comment.
This seems overly complicated, here is a simplified version that pass all your tests:
fn bre_to_pcre(bre: &str) -> String {
let mut result = String::with_capacity(bre.len() + 8);
let mut chars = bre.chars().peekable();
let mut at_start = true;
while let Some(ch) = chars.next() {
match ch {
'\\' if let Some(&next) = chars.peek() => match next {
// Escaped metacharacter in BRE -> unescape for PCRE
'+' | '?' | '|' | '(' | ')' | '{' | '}' => {
result.push(next);
chars.next();
}
// Escaped backslash -> keep as-is
'\\' => {
result.push_str(r"\\");
chars.next();
}
// Other escapes -> keep as-is
_ => result.push(ch),
},
// Leading * is literal in BRE, escape it for PCRE
'*' if at_start => {
result.push_str(r"\*");
}
// Unescaped metacharacters in BRE are literal, escape for PCRE
'+' | '?' | '|' | '(' | ')' | '{' | '}' => {
result.push('\\');
result.push(ch);
}
_ => {
result.push(ch);
}
}
at_start = false;
}
result
}| let is_cluster = bytes.len() >= 2 | ||
| && bytes[0] == b'-' | ||
| && bytes[1] != b'-' | ||
| && bytes[1..].iter().all(|b| b.is_ascii_alphabetic()); |
There was a problem hiding this comment.
Same remark, could be factorized
| /// grep has no `--type` flag, so the type filter must be expressed as a glob. | ||
| /// Known common types are mapped explicitly; anything else falls back to | ||
| /// `*.<type>` (so `rtk grep -t foo` searches `*.foo`). | ||
| fn grep_include_glob(file_type: &str) -> String { |
There was a problem hiding this comment.
That seems too complicated again, the rtk user is calling either grep or rg, if is calling it with the right args.
If calling rg, it should be because it knows rg is installed or the user has told it to, in either cases, if it isn't available it doesn't seem bad to fail.
That rtk internally uses rg over grep if available isn't a bad thing and can be handled more simply. The other way around, i.e. trying to overcome system/user bad configuration, isn't what rtk is aiming for
Summary
rtk grepwas always passing the user's pattern to ripgrep, even when the user invoked it expectinggrep's POSIX BRE rules. rg's regex dialect treats(),+,?,{,}as metacharacters (BRE treats them literally) and uses|instead of\|for alternation. Sodelete()quietly matched everydelete, because rg read()as an empty group.This PR detects grep's regex-mode flags (
-F,-E,-G,-P) and routes per dialect:-G): delegate to realgrepfor exact POSIX semantics. On Windows withoutgrepon PATH, fall back to rg with a best-effort BRE->rg translation and a one-shot stderr notice so the swap isn't silent.-E/-P/-F: rg primary (faster), with a grep fallback if rg is unavailable.-Padds--pcre2.Grep-only flags (
-E,-G,-P,-F) and-rare stripped before forwarding to rg, since rg either doesn't recognise them or overloads the letter (rg-ris--replace). grep gets the user's flags verbatim.has_grep_binaryuses PATH resolution only (no subprocess spawn), so the BRE fast path stays under the project's <10ms startup target.Closes the BRE-translation portion of #1436. The parser portion landed earlier as
be226d4.Follow-up (a235b25): recognize dialect flags before the pattern
Smoke testing surfaced a second, independent bug: the per-dialect routing in
b9fcc59is only reachable once Clap accepts the invocation, and Clap rejects unknown short flags placed before the first positional. Sortk grep -E 'foo|bar' src/fell through torun_fallback, which spawns the raw command without-ror rtk filtering. The ERE/PCRE/Fixed routes never fired in the natural call shape.Promoted
-E/-G/-P/-F(and their long forms) to first-class Clap flags on theGrepsubcommand, mutually exclusive viaconflicts_with_all. The dispatch re-injects the chosen flag at the front ofextra_argssodetect_modeandbuild_grep_commandstill see it and route consistently with the trailing-position form (rtk grep 'pat' src/ -Estill works viatrailing_var_arg).Follow-up: generalize the pre-pattern flag fix to all flags
Promoting the four dialect flags only fixed those four. Any other short flag placed before the pattern (
-i,-w,-A N, ...) still tripped Clap's positional parser, sortk grep -i foo src/fell through torun_fallbackand ran raw with no filtering and 0% token savings, even though the field doc advertises-ias a valid extra arg.reorder_grep_argsruns beforeCli::try_parse_fromand hoists the positional operands (pattern, optional path) ahead of any flags, so the flags land inextra_args(trailing_var_arg). It preserves the values of value-consuming flags (-A 3keeps3as the value, not the pattern), stops at--, and bails to the unchanged argv when the pattern comes from-e/-f(unsupported, since the pattern is positional). The reorder is scoped to thegrepsubcommand, sortk proxy grepandgit grepare untouched.Related hardening from the same review:
detect_modenow matches GNU grep's last-flag-wins and detects dialect chars inside bundled short clusters (-iE-> ERE); it stops scanning at--.build_rg_commandstrips dialect chars out of forwarded bundled clusters, so-iEno longer reaches rg, where-Ewould be misread as--encodingand abort with "missing value".build_grep_commandnow honors-t TYPEvia--include, which the grep paths previously dropped.bre_to_rgtracks backslash parity so\\((literal backslash + paren) is no longer confused with\((BRE group); leading*is treated as a literal.Test plan
cargo fmt --allcargo clippy --all-targets, no warningscargo test --all: 1978 passed, 6 ignoredrtk grep 'delete()' src/returns only literaldelete()(regression for the empty-group bug)rtk grep -E 'foo|bar' src/still routes through rg-E 'pat' src/) and trailing ('pat' src/ -E) positions for-E,-G,-P,-Frtk grep -i foo src/and-w/-A 1before the pattern are now filtered (previously raw passthrough)-P '(?<=fn )\w+_to_rg'(PCRE lookbehind) matches, confirming--pcre2is being passedrtk grep -E -P 'foo' src/surfaces a conflict error