Skip to content

Commit bb55eaa

Browse files
feat: Hyperscan-based secret redaction with SOM spans and overlap merge
- redact.py: use python-hyperscan with HS_FLAG_SOM_LEFTMOST for correct spans - redact_patterns.py: curated patterns (OpenAI, AWS, GitHub, Stripe, etc.) - tests/test_redact.py: unit tests for redaction (use fake test values for keys) - docs/secrets.md: document implementation and pre-commit options Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 8280715 commit bb55eaa

7 files changed

Lines changed: 333 additions & 23 deletions

File tree

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,3 +124,7 @@ uv run convx explore --output-path /path/to/your/repo
124124
uv run convx hooks install
125125
uv run convx hooks uninstall
126126
```
127+
128+
## Secrets
129+
130+
Exports are redacted by default (API keys, tokens, passwords → `[REDACTED]`). Be mindful of secrets in your history repo. See [docs/secrets.md](docs/secrets.md) for details and pre-commit scanner options (Gitleaks, TruffleHog, detect-secrets, semgrep).

docs/secrets.md

Lines changed: 74 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Secret redaction
22

3-
convx redacts API keys, tokens, passwords, and similar secrets from exported files so they dont end up in your history repo.
3+
convx redacts API keys, tokens, passwords, and similar secrets from exported files so they don't end up in your history repo.
44

55
> ⚠️ You are responsible for ensuring no secrets are committed; we provide no warranty or liability.
66
@@ -14,11 +14,80 @@ convx redacts API keys, tokens, passwords, and similar secrets from exported fil
1414
convx backup --output-path /path/to/repo --no-redact
1515
```
1616

17-
- **What gets redacted:** Patterns for credentials and secrets (e.g. OpenAI-style keys, `api_key=...`, `password=...`, AWS/GCP/Stripe/GitHub tokens). High-entropy but non-secret text (URLs, long IDs) is left as-is.
18-
- **What you see:** Matched secrets are replaced with markers like `[REDACTED:api_key]` or `[REDACTED:password]` in the written files.
17+
- **What gets redacted:** API keys (OpenAI, AWS, Stripe, GitHub, etc.), tokens, passwords, private keys, and similar credentials. Matched secrets are replaced with `[REDACTED]`.
18+
- **What you see:** All redacted values appear as `[REDACTED]` in the written files.
1919

2020
## How it works technically
2121

22-
- **Library:** [plumbrc](https://pypi.org/project/plumbrc/)pattern-based redaction (700+ built-in patterns). If plumbrc isn’t available at runtime, redaction is skipped and content is written unchanged.
22+
- **Library:** [Hyperscan](https://github.com/intel/hyperscan) via [python-hyperscan](https://github.com/darvid/python-hyperscan)multi-pattern matching with `HS_FLAG_UTF8` and `HS_FLAG_SOM_LEFTMOST` so we get correct start/end spans. Text is scanned as UTF-8 bytes; match spans are merged when overlapping, then replaced from end to start so indices stay valid. All matches become `[REDACTED]`.
2323
- **When:** Redaction runs **after** rendering and **before** writing. Both Markdown and JSON outputs are passed through `redact_secrets()` in the engine; the index (`.convx/index.json`) is not redacted.
24-
- **Where in code:** `convx_ai.redact.redact_secrets(text, redact=True)` wraps `Plumbr(quiet=True).redact(text)`. The engine calls it for every session markdown and JSON blob (including child-session markdown) when `redact=True`; the CLI passes `redact=not no_redact` from `sync` and `backup`.
24+
- **Where in code:** `convx_ai.redact.redact_secrets(text, redact=True)`. The engine calls it for every session markdown and JSON blob when `redact=True`; the CLI passes `redact=not no_redact` from `sync` and `backup`.
25+
- **Patterns:** Curated regex set in `convx_ai.redact_patterns` (OpenAI, AWS, GitHub, Stripe, Slack, SendGrid, Discord, Google, Twilio, Telegram, Square, private keys, etc.). No entropy-based detection (avoids over-redacting URLs and long IDs in transcripts). Patterns must be Hyperscan-compatible (no backrefs, no lookahead/lookbehind).
26+
27+
## Pattern sources
28+
29+
Default patterns are derived from:
30+
31+
- [mazen160/secrets-patterns-db](https://github.com/mazen160/secrets-patterns-db) — high-confidence and rules-stable datasets (1,600+ patterns, ReDoS-tested)
32+
- [gitleaks/gitleaks](https://github.com/gitleaks/gitleaks)`config/gitleaks.toml` built-in rules
33+
- [Yelp/detect-secrets](https://github.com/Yelp/detect-secrets) — plugin regex patterns
34+
35+
To add or change patterns, edit `src/convx_ai/redact_patterns.py`. Each entry is `(bytes_regex, unique_id)`; regex syntax is Hyperscan’s (PCRE-like, no backrefs/lookahead/lookbehind).
36+
37+
## Pre-commit secret scanners (defense in depth)
38+
39+
For additional protection, add a pre-commit hook to block secrets before they enter your repo. Popular options:
40+
41+
| Tool | Language | Engine | Notes |
42+
|-----|----------|--------|-------|
43+
| **Gitleaks** | Go | RE2 | Single-pass scan, huge maintained pattern DB. Most popular pre-commit tool. |
44+
| **TruffleHog** | Go | RE2 | Similar to Gitleaks, strong verification and pattern coverage. |
45+
| **detect-secrets** | Python | regex | Yelp's tool, what most Python shops use in CI. Baseline-based. |
46+
| **semgrep** | OCaml | custom | AST-aware, more than regex. Heavily used in enterprise CI. |
47+
48+
### Example configs
49+
50+
**Gitleaks** (recommended):
51+
52+
```yaml
53+
# .pre-commit-config.yaml
54+
repos:
55+
- repo: https://github.com/gitleaks/gitleaks
56+
rev: v8.18.0
57+
hooks:
58+
- id: gitleaks
59+
```
60+
61+
**TruffleHog**:
62+
63+
```yaml
64+
repos:
65+
- repo: https://github.com/trufflesecurity/trufflehog
66+
rev: v3.78.0
67+
hooks:
68+
- id: trufflehog
69+
```
70+
71+
**detect-secrets** (Python shops):
72+
73+
```yaml
74+
repos:
75+
- repo: https://github.com/Yelp/detect-secrets
76+
rev: v1.5.0
77+
hooks:
78+
- id: detect-secrets
79+
args: ['--baseline', '.secrets.baseline']
80+
```
81+
82+
Create a baseline with `detect-secrets scan > .secrets.baseline` before first use.
83+
84+
**semgrep** (enterprise / AST-aware):
85+
86+
```yaml
87+
repos:
88+
- repo: https://github.com/semgrep/pre-commit
89+
rev: v1.151.0
90+
hooks:
91+
- id: semgrep
92+
args: ['--config', 'p/secrets', '--error']
93+
```

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ classifiers = [
1818
]
1919

2020
dependencies = [
21-
"plumbrc>=1.0.0",
21+
"hyperscan>=0.2.0",
2222
"tantivy>=0.22",
2323
"textual>=8.0",
2424
"typer>=0.12.0",

src/convx_ai/redact.py

Lines changed: 48 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,55 @@
11
from __future__ import annotations
22

3-
try:
4-
from plumbrc import Plumbr
5-
_plumbr_available = True
6-
except Exception:
7-
Plumbr = None
8-
_plumbr_available = False
3+
import hyperscan
4+
5+
from convx_ai.redact_patterns import PATTERNS
6+
7+
REDACTED = "[REDACTED]"
8+
9+
_db: hyperscan.Database | None = None
10+
11+
12+
def _get_db() -> hyperscan.Database:
13+
global _db
14+
if _db is None:
15+
exprs, ids = zip(*PATTERNS)
16+
_db = hyperscan.Database()
17+
_db.compile(
18+
expressions=list(exprs),
19+
ids=list(ids),
20+
elements=len(PATTERNS),
21+
flags=[hyperscan.HS_FLAG_UTF8 | hyperscan.HS_FLAG_SOM_LEFTMOST] * len(PATTERNS),
22+
)
23+
return _db
24+
25+
26+
def _merge_overlaps(spans: list[tuple[int, int]]) -> list[tuple[int, int]]:
27+
if not spans:
28+
return []
29+
sorted_spans = sorted(spans)
30+
merged = [sorted_spans[0]]
31+
for start, end in sorted_spans[1:]:
32+
if start <= merged[-1][1]:
33+
merged[-1] = (merged[-1][0], max(merged[-1][1], end))
34+
else:
35+
merged.append((start, end))
36+
return merged
937

1038

1139
def redact_secrets(text: str, *, redact: bool = True) -> str:
1240
if not redact:
1341
return text
14-
if not _plumbr_available:
15-
return text
16-
return Plumbr(quiet=True).redact(text)
42+
data = text.encode("utf-8")
43+
matches: list[tuple[int, int]] = []
44+
45+
def on_match(id: int, from_: int, to: int, flags: int, context: list) -> None:
46+
context.append((from_, to))
47+
48+
_get_db().scan(data, match_event_handler=on_match, context=matches)
49+
spans = _merge_overlaps(matches)
50+
51+
for start, end in sorted(spans, key=lambda s: -s[1]):
52+
char_start = len(data[:start].decode("utf-8"))
53+
char_end = len(data[:end].decode("utf-8"))
54+
text = text[:char_start] + REDACTED + text[char_end:]
55+
return text

src/convx_ai/redact_patterns.py

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
"""
2+
Secret detection patterns for redaction.
3+
4+
Hyperscan-compatible: no backrefs, no lookahead/lookbehind.
5+
Patterns match the secret value only (no surrounding context).
6+
7+
Sources:
8+
- mazen160/secrets-patterns-db (high-confidence, rules-stable)
9+
- gitleaks/gitleaks config
10+
- Yelp/detect-secrets plugins
11+
"""
12+
13+
# (pattern, id) — ids must be unique
14+
PATTERNS: list[tuple[bytes, int]] = [
15+
# OpenAI (detect-secrets, custom)
16+
(br"sk-proj-[a-zA-Z0-9_-]{20,}", 0),
17+
(br"sk-[a-zA-Z0-9_-]{20,}", 1),
18+
(br"sk-[A-Za-z0-9-_]*[A-Za-z0-9]{20}T3BlbkFJ[A-Za-z0-9]{20}", 2),
19+
# AWS (secrets-patterns-db, gitleaks)
20+
(br"AKIA[0-9A-Z]{16}", 3),
21+
(br"ASIA[0-9A-Z]{16}", 4),
22+
(br"da2-[a-z0-9]{26}", 5),
23+
# GitHub (detect-secrets)
24+
(br"(?:ghp|gho|ghu|ghs|ghr)_[A-Za-z0-9_]{36}", 6),
25+
# Stripe (secrets-patterns-db)
26+
(br"sk_live_[0-9a-zA-Z]{24}", 7),
27+
(br"rk_live_[0-9a-zA-Z]{24}", 8),
28+
# Slack (detect-secrets — flexible format)
29+
(br"xox(?:a|b|p|o|s|r)-(?:\d+-)+[a-zA-Z0-9]+", 9),
30+
(br"https://hooks\.slack\.com/services/T[a-zA-Z0-9_]+/B[a-zA-Z0-9_]+/[a-zA-Z0-9_]+", 10),
31+
# SendGrid (detect-secrets)
32+
(br"SG\.[a-zA-Z0-9_-]{22}\.[a-zA-Z0-9_-]{43}", 11),
33+
# Discord (detect-secrets)
34+
(br"[MNO][a-zA-Z0-9_-]{23,25}\.[a-zA-Z0-9_-]{6}\.[a-zA-Z0-9_-]{27}", 12),
35+
# Google (secrets-patterns-db)
36+
(br"AIza[0-9A-Za-z_-]{35}", 13),
37+
(br"ya29\.[0-9A-Za-z_-]+", 14),
38+
# Twilio (secrets-patterns-db)
39+
(br"SK[0-9a-fA-F]{32}", 15),
40+
# Telegram (secrets-patterns-db)
41+
(br"[0-9]+:AA[0-9A-Za-z_-]{33}", 16),
42+
# Mailgun, Mailchimp (secrets-patterns-db)
43+
(br"key-[0-9a-zA-Z]{32}", 17),
44+
(br"[0-9a-f]{32}-us[0-9]{1,2}", 18),
45+
# Square (secrets-patterns-db)
46+
(br"sq0atp-[0-9A-Za-z_-]{22}", 19),
47+
(br"sq0csp-[0-9A-Za-z_-]{43}", 20),
48+
# Private keys (secrets-patterns-db, detect-secrets)
49+
(br"-----BEGIN RSA PRIVATE KEY-----", 21),
50+
(br"-----BEGIN DSA PRIVATE KEY-----", 22),
51+
(br"-----BEGIN EC PRIVATE KEY-----", 23),
52+
(br"-----BEGIN OPENSSH PRIVATE KEY-----", 24),
53+
(br"-----BEGIN PGP PRIVATE KEY BLOCK-----", 25),
54+
(br"-----BEGIN PRIVATE KEY-----", 26),
55+
]

tests/test_redact.py

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
from __future__ import annotations
2+
3+
import pytest
4+
5+
from convx_ai.redact import REDACTED, redact_secrets
6+
7+
8+
def test_redact_disabled() -> None:
9+
assert redact_secrets("sk-proj-abc123xyz", redact=False) == "sk-proj-abc123xyz"
10+
11+
12+
def test_openai_project_key() -> None:
13+
text = "Use API key sk-proj-redact-test-abc123xyz for the demo."
14+
assert REDACTED in redact_secrets(text)
15+
assert "sk-proj-redact-test-abc123xyz" not in redact_secrets(text)
16+
17+
18+
def test_openai_legacy_key() -> None:
19+
text = "sk-abcdefghijklmnopqrstuvwxyz123456"
20+
assert redact_secrets(text) == REDACTED
21+
22+
23+
def test_aws_key() -> None:
24+
text = "AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE"
25+
out = redact_secrets(text)
26+
assert REDACTED in out
27+
assert "AKIAIOSFODNN7EXAMPLE" not in out
28+
29+
30+
def test_github_token() -> None:
31+
text = "ghp_" + "x" * 36
32+
assert redact_secrets(text) == REDACTED
33+
34+
35+
def test_stripe_key() -> None:
36+
text = "sk_live_" + "0" * 24
37+
assert redact_secrets(text) == REDACTED
38+
39+
40+
def test_slack_token() -> None:
41+
text = "xoxb-1234-5678-abcdefghijkl"
42+
assert redact_secrets(text) == REDACTED
43+
44+
45+
def test_slack_webhook() -> None:
46+
text = "https://hooks.slack.com/services/T123/B456/abc123"
47+
assert redact_secrets(text) == REDACTED
48+
49+
50+
def test_sendgrid_key() -> None:
51+
text = "SG." + "a" * 22 + "." + "b" * 43
52+
assert redact_secrets(text) == REDACTED
53+
54+
55+
def test_private_key_rsa() -> None:
56+
text = "-----BEGIN RSA PRIVATE KEY-----"
57+
assert redact_secrets(text) == REDACTED
58+
59+
60+
def test_private_key_ec() -> None:
61+
text = "-----BEGIN EC PRIVATE KEY-----"
62+
assert redact_secrets(text) == REDACTED
63+
64+
65+
def test_private_key_openssh() -> None:
66+
text = "-----BEGIN OPENSSH PRIVATE KEY-----"
67+
assert redact_secrets(text) == REDACTED
68+
69+
70+
def test_multiple_secrets() -> None:
71+
secret1 = "sk-proj-" + "a" * 20
72+
secret2 = "sk-proj-" + "b" * 20
73+
text = f"key1={secret1} key2={secret2}"
74+
out = redact_secrets(text)
75+
assert secret1 not in out and secret2 not in out
76+
assert out.count(REDACTED) == 2
77+
78+
79+
def test_no_false_positive_urls() -> None:
80+
text = "https://example.com/path?param=value"
81+
assert redact_secrets(text) == text
82+
83+
84+
def test_no_false_positive_short_sk() -> None:
85+
text = "sk-abc"
86+
assert redact_secrets(text) == text
87+
88+
89+
def test_no_false_positive_ghp_prefix() -> None:
90+
text = "ghp_short"
91+
assert redact_secrets(text) == text
92+
93+
94+
def test_google_api_key() -> None:
95+
text = "AIza" + "a" * 35
96+
assert redact_secrets(text) == REDACTED
97+
98+
99+
def test_twilio_key() -> None:
100+
text = "SK" + "a" * 32
101+
assert redact_secrets(text) == REDACTED
102+
103+
104+
def test_telegram_bot_token() -> None:
105+
text = "12345:AA" + "a" * 33
106+
assert redact_secrets(text) == REDACTED

0 commit comments

Comments
 (0)