Skip to content

fix(grep): route per regex dialect to preserve POSIX BRE semantics#2138

Open
F0rty-Tw0 wants to merge 3 commits into
rtk-ai:developfrom
F0rty-Tw0:fix/1436/grep-bre-translation
Open

fix(grep): route per regex dialect to preserve POSIX BRE semantics#2138
F0rty-Tw0 wants to merge 3 commits into
rtk-ai:developfrom
F0rty-Tw0:fix/1436/grep-bre-translation

Conversation

@F0rty-Tw0
Copy link
Copy Markdown
Contributor

@F0rty-Tw0 F0rty-Tw0 commented May 28, 2026

Summary

rtk grep was always passing the user's pattern to ripgrep, even when the user invoked it expecting grep's POSIX BRE rules. rg's regex dialect treats (), +, ?, {, } as metacharacters (BRE treats them literally) and uses | instead of \| for alternation. So delete() quietly matched every delete, because rg read () as an empty group.

This PR detects grep's regex-mode flags (-F, -E, -G, -P) and routes per dialect:

  • BRE (default or -G): delegate to real grep for exact POSIX semantics. On Windows without grep on PATH, fall back to rg with a best-effort BRE->rg translation and a one-shot stderr notice so the swap isn't silent.
  • -E / -P / -F: rg primary (faster), with a grep fallback if rg is unavailable. -P adds --pcre2.

Grep-only flags (-E, -G, -P, -F) and -r are stripped before forwarding to rg, since rg either doesn't recognise them or overloads the letter (rg -r is --replace). grep gets the user's flags verbatim.

has_grep_binary uses PATH resolution only (no subprocess spawn), so the BRE fast path stays under the project's <10ms startup target.

Closes the BRE-translation portion of #1436. The parser portion landed earlier as be226d4.

Follow-up (a235b25): recognize dialect flags before the pattern

Smoke testing surfaced a second, independent bug: the per-dialect routing in b9fcc59 is only reachable once Clap accepts the invocation, and Clap rejects unknown short flags placed before the first positional. So rtk grep -E 'foo|bar' src/ fell through to run_fallback, which spawns the raw command without -r or rtk filtering. The ERE/PCRE/Fixed routes never fired in the natural call shape.

Promoted -E / -G / -P / -F (and their long forms) to first-class Clap flags on the Grep subcommand, mutually exclusive via conflicts_with_all. The dispatch re-injects the chosen flag at the front of extra_args so detect_mode and build_grep_command still see it and route consistently with the trailing-position form (rtk grep 'pat' src/ -E still works via trailing_var_arg).

Follow-up: generalize the pre-pattern flag fix to all flags

Promoting the four dialect flags only fixed those four. Any other short flag placed before the pattern (-i, -w, -A N, ...) still tripped Clap's positional parser, so rtk grep -i foo src/ fell through to run_fallback and ran raw with no filtering and 0% token savings, even though the field doc advertises -i as a valid extra arg.

reorder_grep_args runs before Cli::try_parse_from and hoists the positional operands (pattern, optional path) ahead of any flags, so the flags land in extra_args (trailing_var_arg). It preserves the values of value-consuming flags (-A 3 keeps 3 as the value, not the pattern), stops at --, and bails to the unchanged argv when the pattern comes from -e/-f (unsupported, since the pattern is positional). The reorder is scoped to the grep subcommand, so rtk proxy grep and git grep are untouched.

Related hardening from the same review:

  • detect_mode now matches GNU grep's last-flag-wins and detects dialect chars inside bundled short clusters (-iE -> ERE); it stops scanning at --.
  • build_rg_command strips dialect chars out of forwarded bundled clusters, so -iE no longer reaches rg, where -E would be misread as --encoding and abort with "missing value".
  • build_grep_command now honors -t TYPE via --include, which the grep paths previously dropped.
  • bre_to_rg tracks backslash parity so \\( (literal backslash + paren) is no longer confused with \( (BRE group); leading * is treated as a literal.

Test plan

  • cargo fmt --all
  • cargo clippy --all-targets, no warnings
  • cargo test --all: 1978 passed, 6 ignored
  • Manual smoke: rtk grep 'delete()' src/ returns only literal delete() (regression for the empty-group bug)
  • Manual smoke: rtk grep -E 'foo|bar' src/ still routes through rg
  • Manual smoke: dialect flags in both leading (-E 'pat' src/) and trailing ('pat' src/ -E) positions for -E, -G, -P, -F
  • Manual smoke: rtk grep -i foo src/ and -w/-A 1 before the pattern are now filtered (previously raw passthrough)
  • Manual smoke: -P '(?<=fn )\w+_to_rg' (PCRE lookbehind) matches, confirming --pcre2 is being passed
  • Manual smoke: rtk grep -E -P 'foo' src/ surfaces a conflict error

F0rty-Tw0 added 3 commits May 28, 2026 16:20
Plain `grep` defaults to Basic Regular Expressions; `rtk grep` was
always passing the user's pattern to ripgrep, whose regex dialect
flips the meaning of `()`, `+`, `?`, `{`, `}` (BRE literals -> rg
metacharacters) and uses `|` instead of `\|` for alternation. This
silently broadened or shrank match sets -- e.g. `delete()` matched
every `delete` because `()` was treated as an empty group.

Detect grep's regex-mode flags (`-F`, `-E`, `-G`, `-P`) from
`extra_args` and route accordingly:

* BRE (default or `-G`): delegate to real `grep` for exact POSIX
  semantics. Slower than rg, but eliminates the regex-translation
  surface area. On Windows without `grep` on PATH, fall back to rg
  with a best-effort BRE->rg translation and emit a one-line stderr
  notice (once per process) so the degradation is never silent.
* `-E` (ERE), `-P` (PCRE), `-F` (fixed string): rg primary
  (fast path); fall back to grep if rg is unavailable, surfacing
  rg's error so debugging is possible. `-P` adds `--pcre2` for
  exact PCRE.

Speed/correctness trade is intentional: BRE prioritises correctness
(grep's exact rules) at a small speed cost. Users who care about
rg's speed can opt in with `-E`/`-P`/`-F`.

Strips grep-only mode flags (`-E`, `-G`, `-P`, `-F`) and `-r` from
extra_args before forwarding to rg, since rg either has different
meanings for those (rg `-r` is `--replace`) or doesn't recognise
them. grep gets the user's flags verbatim.

has_grep_binary uses PATH resolution (no subprocess spawn) so the
BRE fast-path stays well under the project's <10ms startup target.

Closes the BRE-translation portion of rtk-ai#1436.

Signed-off-by: Artiom Tofan <arto@queue-it.com>
The per-dialect routing landed in b9fcc59 only fires inside
grep_cmd::run, but Clap rejects unknown short flags before the
first positional. Invocations like `rtk grep -E 'foo|bar' src/`
hit `run_fallback`, which spawns the raw command without the
recursion flag or rtk filtering -- so the new ERE/PCRE/Fixed
paths were unreachable in the natural call shape and `delete()`
would have remained the only correctness fix that worked end to
end.

Promote `-E` / `-G` / `-P` / `-F` (and their long forms) to first
class Clap flags on the Grep subcommand. They're mutually
exclusive via `conflicts_with_all`, so the user can't mix
dialects. The dispatch re-injects the chosen flag at the front of
`extra_args` so `detect_mode` and `build_grep_command` still see
it and route consistently with the trailing-position form.

Trailing-position usage (`rtk grep 'pat' src/ -E`) keeps working:
`extra_args` has `trailing_var_arg + allow_hyphen_values`, so a
late `-E` is captured there and `detect_mode` picks it up.

Smoke-tested all four dialects in both leading and trailing
positions, including the issue rtk-ai#1436 BRE regression for
`delete()`, the ERE alternation case from the PR description,
PCRE lookbehind, and the `-E -P` conflict.

Signed-off-by: Artiom Tofan <arto@queue-it.com>
Promoting only the four dialect flags (a235b25) left every other short
flag placed before the pattern (-i, -w, -A N, ...) tripping Clap's
positional parser, so `rtk grep -i foo src/` fell through to
run_fallback and ran the raw command with no filtering and 0% savings,
even though the field doc advertises -i as a valid extra arg.

reorder_grep_args runs before Cli::try_parse_from and hoists the
positional operands (pattern, optional path) ahead of any flags so the
flags land in extra_args (trailing_var_arg). It preserves the value of
value-consuming flags (-A 3 keeps 3 as the value, not the pattern),
stops at --, and bails to unchanged argv when the pattern comes from
-e/-f. Scoped to the grep subcommand, so `rtk proxy grep` and
`git grep` are untouched.

Review hardening (same rtk-ai#1436 review):
- detect_mode: GNU last-flag-wins + bundled-cluster detection (-iE -> ERE), stops at --
- build_rg_command: strip dialect chars from forwarded clusters so -iE no longer reaches rg (where -E is misread as --encoding and aborts)
- build_grep_command: honor -t TYPE via --include (previously dropped on the grep paths)
- bre_to_rg: backslash-parity so \( is not confused with \(; leading * is literal
Copy link
Copy Markdown
Collaborator

@KuSh KuSh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR.

I didn't go through all of it but it seems way too complicated and trying to solve non-existent problems (at least from a rtk's point of view).

Can you come back with a second simpler proposal, taking into account what I have already commented on? It seems to be a good addition, just far too complex for now.

/// `BRE` is the POSIX default when no dialect flag is given, matching plain
/// `grep` behavior.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
enum GrepMode {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:

Suggested change
enum GrepMode {
enum RegexSyntax {

/// Perl-Compatible Regular Expression (`-P`/`--perl-regexp`).
Pcre,
/// Fixed string, no regex (`-F`/`--fixed-strings`).
Fixed,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion, either allow None here or return Option<RegexSyntax> in the detection function (preferred)

// grouping natively. Pass through, fast.
// * PCRE (-P): rg --pcre2. Exact PCRE semantics, fast.
// * Fixed (-F): rg -F. No regex parsing at all, fast.
let mode = detect_mode(extra_args);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rewrite should be involved so a rg xxx rewrite ends with rtk grep -P command so we won't default to BRE in that case

let is_cluster = bytes.len() >= 2
&& bytes[0] == b'-'
&& bytes[1] != b'-'
&& bytes[1..].iter().all(|b| b.is_ascii_alphabetic());
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ripgrep has -0 which is outside of is_ascii_alphabetic

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps treating other if starting with - and not followed by another - and stop if you encounter a = would work?

Comment on lines +94 to +95
fn bre_to_rg(pattern: &str) -> String {
// Backslash-aware single-pass tokenizer. We must distinguish:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems overly complicated, here is a simplified version that pass all your tests:

    fn bre_to_pcre(bre: &str) -> String {
        let mut result = String::with_capacity(bre.len() + 8);
        let mut chars = bre.chars().peekable();
        let mut at_start = true;

        while let Some(ch) = chars.next() {
            match ch {
                '\\' if let Some(&next) = chars.peek() => match next {
                    // Escaped metacharacter in BRE -> unescape for PCRE
                    '+' | '?' | '|' | '(' | ')' | '{' | '}' => {
                        result.push(next);
                        chars.next();
                    }
                    // Escaped backslash -> keep as-is
                    '\\' => {
                        result.push_str(r"\\");
                        chars.next();
                    }
                    // Other escapes -> keep as-is
                    _ => result.push(ch),
                },

                // Leading * is literal in BRE, escape it for PCRE
                '*' if at_start => {
                    result.push_str(r"\*");
                }

                // Unescaped metacharacters in BRE are literal, escape for PCRE
                '+' | '?' | '|' | '(' | ')' | '{' | '}' => {
                    result.push('\\');
                    result.push(ch);
                }

                _ => {
                    result.push(ch);
                }
            }
            at_start = false;
        }

        result
    }

let is_cluster = bytes.len() >= 2
&& bytes[0] == b'-'
&& bytes[1] != b'-'
&& bytes[1..].iter().all(|b| b.is_ascii_alphabetic());
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same remark, could be factorized

/// grep has no `--type` flag, so the type filter must be expressed as a glob.
/// Known common types are mapped explicitly; anything else falls back to
/// `*.<type>` (so `rtk grep -t foo` searches `*.foo`).
fn grep_include_glob(file_type: &str) -> String {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems too complicated again, the rtk user is calling either grep or rg, if is calling it with the right args.

If calling rg, it should be because it knows rg is installed or the user has told it to, in either cases, if it isn't available it doesn't seem bad to fail.

That rtk internally uses rg over grep if available isn't a bad thing and can be handled more simply. The other way around, i.e. trying to overcome system/user bad configuration, isn't what rtk is aiming for

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants