Skip to content

perf(store): speed up redaction scanning#155

Merged
0xjunha merged 2 commits into
mainfrom
perf-redaction
May 20, 2026
Merged

perf(store): speed up redaction scanning#155
0xjunha merged 2 commits into
mainfrom
perf-redaction

Conversation

@0xjunha
Copy link
Copy Markdown
Owner

@0xjunha 0xjunha commented May 20, 2026

Summary

This PR speeds up Darc's store redaction path while preserving the existing redacted index output.

The main change is a redaction prefilter and targeted scanner rewrite in darc-store. Instead of sending every text field through every expensive detector, Darc now first checks for cheap literal indicators of redactable content. When there is no candidate signal, it skips the heavier redaction pipeline. Hot regex-based paths for generic secret assignments and base64-like blobs were also replaced with targeted scanners that preserve the previous matching behavior.

A benchmark helper script was added to make cold-refresh and index-rebuild timing easier to reproduce, including optional redaction snapshot comparison.

Changes

  • Add an aho-corasick-based prefilter for text redaction candidates.
  • Avoid allocating replacement strings for no-op redaction cases.
  • Replace the broad generic secret assignment regex with a manual scanner.
  • Replace base64-like blob regex scanning with a targeted scanner.
  • Keep the old generic assignment regex under cfg(test) as an oracle for behavior checks.
  • Add tests comparing scanner behavior against the prior regex behavior.
  • Add scripts/bench-cold-refresh.sh for repeatable timing and snapshot checks.
  • Update CHANGELOG.md under Unreleased.

Correctness

The intended behavior is no regression in redaction output.

Validation performed:

  • Compared old and new redacted SQLite outputs across:
    • sessions
    • turns
    • turn_search
    • tool_calls
    • file_accesses
    • turn_evidence
  • Result: 0 diffs.
  • Ran the new benchmark helper with snapshot comparison.
  • Result: snapshot_match=yes.
  • Added test coverage that keeps the old regex behavior available as a test oracle.

Performance

Observed local results:

  • Fresh temp-root cold refresh:

    • before: 29.34s
    • after: 9.59s
  • Frozen archive index rebuild:

    • before: 28.82s
    • after: 7.85s
  • Old regex-oracle rebuild:

    • 30.37s

The speedup comes mostly from avoiding expensive regex work on text that has no plausible redaction candidate.

Risk

The main risk is false negatives in redaction correctness. This PR mitigates that by preserving the old detector semantics through oracle tests and by comparing full redacted index snapshots before and after the change.

The new scanners are intentionally scoped to match the previous behavior rather than expanding the redaction policy in this PR.

@0xjunha 0xjunha merged commit f27724f into main May 20, 2026
12 checks passed
@0xjunha 0xjunha deleted the perf-redaction branch May 20, 2026 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant