Skip to content

Commit 8ced37b

Browse files
fix(assail): char-boundary-aware advance in Isabelle cartouche skip (regression from #49) (#65)
## Summary `skip_cartouche_end` advanced its byte index by 1 in the no-open / no-close branch, which puts `j` inside a multi-byte UTF-8 sequence when the cartouche body contains non-ASCII (`¬`, `∀`, `⟹`, `🎉`, etc.). The next iteration's `haystack[j..].starts_with(open)` then panics: ``` thread 'main' panicked at src/assail/analyzer.rs:5648:20: start byte index 89 is not a char boundary; it is inside '¬' (bytes 88..90) ``` ## Discovery `./target/release/panic-attack assail /home/hyperpolymath/developer/repos/echidna` from the 2026-05-26 estate reconnaissance crashed on a real `.thy` file inside echidna: ``` text \<open> Double negation elimination: ¬¬A is equivalent to A. ... \<close> ``` Subsetting echidna to `src/` worked around it; this is the proper fix. ## Fix In the no-open / no-close branch, advance by `len_utf8()` of the current char rather than 1 byte. `chars().next()` is guaranteed non-empty there because `j < haystack.len()`. ## Tests Two new regression tests in `assail::analyzer::tests`: - `isabelle_cartouche_with_non_ascii_does_not_panic` — sampled body from the actual crash (`¬¬A`, `∀x`, `⟹`). - `isabelle_cartouche_emoji_grapheme_clusters` — 4-byte UTF-8 (🎉) exercises the full multi-byte range, including the path that 2/3-byte chars wouldn't catch. ## Test plan - [x] `cargo test --lib assail::analyzer::tests::isabelle` — **10/10 pass** (8 existing + 2 new). - [x] `cargo build --release` clean. - [x] Reproducer: `./target/release/panic-attack assail /home/hyperpolymath/developer/repos/echidna` now completes cleanly with **44 weak points** (was: panic). - [ ] CI on this PR. Regression from #49 (Isabelle prose-stripper added in 2026-05-26). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent c7e7fb0 commit 8ced37b

1 file changed

Lines changed: 40 additions & 1 deletion

File tree

src/assail/analyzer.rs

Lines changed: 40 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5641,6 +5641,11 @@ fn skip_antiquotation_end(haystack: &str, start: usize) -> usize {
56415641

56425642
/// Given a position just past the first `\<open>` (depth = 1), scan forward and
56435643
/// return the byte index just past the matching `\<close>`. Nesting-aware.
5644+
///
5645+
/// The byte-advance branch steps by `char.len_utf8()` rather than 1 so the
5646+
/// next iteration's `haystack[j..]` slice lands on a UTF-8 char boundary —
5647+
/// Isabelle prose routinely contains non-ASCII (e.g. `¬`, `∀`, `⟹`) inside
5648+
/// `text \<open>...\<close>` blocks.
56445649
fn skip_cartouche_end(haystack: &str, after_open: usize, open: &str, close: &str) -> usize {
56455650
let mut depth: i32 = 1;
56465651
let mut j = after_open;
@@ -5652,7 +5657,13 @@ fn skip_cartouche_end(haystack: &str, after_open: usize, open: &str, close: &str
56525657
depth -= 1;
56535658
j += close.len();
56545659
} else {
5655-
j += 1;
5660+
// Step by char width to keep `j` on a UTF-8 boundary. `chars()`
5661+
// is guaranteed non-empty here because `j < haystack.len()`.
5662+
j += haystack[j..]
5663+
.chars()
5664+
.next()
5665+
.map(|c| c.len_utf8())
5666+
.unwrap_or(1);
56565667
}
56575668
}
56585669
j
@@ -5924,6 +5935,34 @@ mod tests {
59245935
);
59255936
}
59265937

5938+
#[test]
5939+
fn isabelle_cartouche_with_non_ascii_does_not_panic() {
5940+
// Regression: prior to the char-boundary fix in skip_cartouche_end,
5941+
// any non-ASCII byte inside a `\<open>...\<close>` block panicked
5942+
// with "byte index N is not a char boundary" because the helper
5943+
// advanced by 1 byte at a time through multi-byte UTF-8 sequences.
5944+
// This content is shortened from a real echidna .thy file that
5945+
// triggered the panic on a full-tree assail scan.
5946+
let src = "text \\<open>\n Double negation: ¬¬A ≡ A.\n ∀x. ¬(P x) ⟹ True\n\\<close>\nlemma foo: True by simp";
5947+
let stripped = strip_isabelle_prose(src);
5948+
assert!(
5949+
!stripped.contains("¬¬"),
5950+
"cartouche body containing non-ASCII not stripped: {stripped:?}"
5951+
);
5952+
assert!(stripped.contains("lemma foo"));
5953+
}
5954+
5955+
#[test]
5956+
fn isabelle_cartouche_emoji_grapheme_clusters() {
5957+
// 4-byte UTF-8 sequences (U+1F389 🎉) inside a cartouche also
5958+
// exercise the char-boundary path. A surrogate-pair-style multi-
5959+
// byte char being skipped by 1-byte advances is the same bug.
5960+
let src = "section \\<open>release notes 🎉 — celebrate!\\<close>\nlemma g: True by simp";
5961+
let stripped = strip_isabelle_prose(src);
5962+
assert!(!stripped.contains("🎉"), "got: {stripped:?}");
5963+
assert!(stripped.contains("lemma g"));
5964+
}
5965+
59275966
#[test]
59285967
fn isabelle_nested_cartouches() {
59295968
// \<open> can nest. A cartouche containing another cartouche

0 commit comments

Comments
 (0)