Skip to content

Commit d550f31

Browse files
hyperpolymathclaude
andcommitted
fix(assail): char-boundary-aware advance in Isabelle cartouche skip (regression from #49)
`skip_cartouche_end` advanced its byte index by 1 in the no-open/no-close branch, which puts `j` inside a multi-byte UTF-8 sequence when the cartouche body contains non-ASCII (`¬`, `∀`, `⟹`, `🎉`, etc.). The next iteration's `haystack[j..].starts_with(open)` then panics with thread 'main' panicked at src/assail/analyzer.rs:5648:20: start byte index 89 is not a char boundary; it is inside '¬' (...) discovered on `echidna` full-tree scans during the 2026-05-26 estate reconnaissance. Subsetting to `src/` worked around it; the proper fix is to advance by `len_utf8()` of the current char. Two regression tests: one with a sampled echidna text-cartouche body (`¬¬A`, `∀x`, `⟹`), one with a 4-byte UTF-8 emoji to exercise the full multi-byte range. Reproducer (full-tree assail on echidna) now completes cleanly with 44 weak points. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent d449286 commit d550f31

1 file changed

Lines changed: 40 additions & 1 deletion

File tree

src/assail/analyzer.rs

Lines changed: 40 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5641,6 +5641,11 @@ fn skip_antiquotation_end(haystack: &str, start: usize) -> usize {
56415641

56425642
/// Given a position just past the first `\<open>` (depth = 1), scan forward and
56435643
/// return the byte index just past the matching `\<close>`. Nesting-aware.
5644+
///
5645+
/// The byte-advance branch steps by `char.len_utf8()` rather than 1 so the
5646+
/// next iteration's `haystack[j..]` slice lands on a UTF-8 char boundary —
5647+
/// Isabelle prose routinely contains non-ASCII (e.g. `¬`, `∀`, `⟹`) inside
5648+
/// `text \<open>...\<close>` blocks.
56445649
fn skip_cartouche_end(haystack: &str, after_open: usize, open: &str, close: &str) -> usize {
56455650
let mut depth: i32 = 1;
56465651
let mut j = after_open;
@@ -5652,7 +5657,13 @@ fn skip_cartouche_end(haystack: &str, after_open: usize, open: &str, close: &str
56525657
depth -= 1;
56535658
j += close.len();
56545659
} else {
5655-
j += 1;
5660+
// Step by char width to keep `j` on a UTF-8 boundary. `chars()`
5661+
// is guaranteed non-empty here because `j < haystack.len()`.
5662+
j += haystack[j..]
5663+
.chars()
5664+
.next()
5665+
.map(|c| c.len_utf8())
5666+
.unwrap_or(1);
56565667
}
56575668
}
56585669
j
@@ -5924,6 +5935,34 @@ mod tests {
59245935
);
59255936
}
59265937

5938+
#[test]
5939+
fn isabelle_cartouche_with_non_ascii_does_not_panic() {
5940+
// Regression: prior to the char-boundary fix in skip_cartouche_end,
5941+
// any non-ASCII byte inside a `\<open>...\<close>` block panicked
5942+
// with "byte index N is not a char boundary" because the helper
5943+
// advanced by 1 byte at a time through multi-byte UTF-8 sequences.
5944+
// This content is shortened from a real echidna .thy file that
5945+
// triggered the panic on a full-tree assail scan.
5946+
let src = "text \\<open>\n Double negation: ¬¬A ≡ A.\n ∀x. ¬(P x) ⟹ True\n\\<close>\nlemma foo: True by simp";
5947+
let stripped = strip_isabelle_prose(src);
5948+
assert!(
5949+
!stripped.contains("¬¬"),
5950+
"cartouche body containing non-ASCII not stripped: {stripped:?}"
5951+
);
5952+
assert!(stripped.contains("lemma foo"));
5953+
}
5954+
5955+
#[test]
5956+
fn isabelle_cartouche_emoji_grapheme_clusters() {
5957+
// 4-byte UTF-8 sequences (U+1F389 🎉) inside a cartouche also
5958+
// exercise the char-boundary path. A surrogate-pair-style multi-
5959+
// byte char being skipped by 1-byte advances is the same bug.
5960+
let src = "section \\<open>release notes 🎉 — celebrate!\\<close>\nlemma g: True by simp";
5961+
let stripped = strip_isabelle_prose(src);
5962+
assert!(!stripped.contains("🎉"), "got: {stripped:?}");
5963+
assert!(stripped.contains("lemma g"));
5964+
}
5965+
59275966
#[test]
59285967
fn isabelle_nested_cartouches() {
59295968
// \<open> can nest. A cartouche containing another cartouche

0 commit comments

Comments
 (0)