You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: hybrid multi-modal PII detection with NER, evaluation harness, and v0.3.0 release
Add detect-then-replace architecture: regex patterns, field heuristics, and optional
transformer-based NER run in parallel, producing PIIDetection spans that are merged
(union strategy, highest confidence wins overlaps) before tokenization. The system
learns over time — fields with repeated PII detections get auto-promoted, and user
feedback (false positives/negatives) adjusts confidence.
New components:
- pii_detection: PIIDetection model, DetectionStrategy protocol, RegexDetectionStrategy,
FieldHeuristicStrategy, NERDetectionStrategy, CompositeDetectionPipeline
- pii_evaluation: span-level IoU matching, per-entity precision/recall/F1, micro/macro averaging
- event_handler: external event ingestion for WOS integration
New CLI commands:
- pii-ingest: download and ingest ai4privacy/pii-masking-200k evaluation dataset
- pii-evaluate: run detection strategies against labeled data, print F1 table
Config additions: detection_mode (regex_only|hybrid|ner_only), ner_model, ner_device,
ner_confidence_threshold, ner_max_text_length in PIIConfig.
Optional ML dependencies: pip install apprentice-ai[ml] adds transformers, torch, datasets.
NER strategy lazy-loads and gracefully disables when deps are missing.
Updated docs: README with PII protection section, GitHub Pages with privacy feature cards,
example config with PII options, version bump to 0.3.0.
2486 tests pass, 19 skipped, 0 failures.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2.**Reinforcement** — Both models process each request. The evaluator scores local vs. remote output. A rolling window tracks correlation. When sustained correlation exceeds the threshold, the system promotes to Phase 3.
87
87
3.**Steady State** — The local model handles most traffic. An adaptive sampler periodically sends requests to both models to verify quality hasn't degraded. If it has, the system automatically regresses to Phase 2.
88
88
89
+
## PII Protection
90
+
91
+
Apprentice includes a built-in PII detection and tokenization middleware that scrubs sensitive data before it reaches models, training stores, or audit logs. The system uses a hybrid multi-modal approach that combines fast regex patterns with optional NER model inference.
Copy file name to clipboardExpand all lines: docs/index.html
+27-5Lines changed: 27 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -623,6 +623,21 @@ <h3>Adaptive Sampling</h3>
623
623
<h3>Multi-Provider</h3>
624
624
<p>Anthropic, OpenAI, or any API as the remote teacher. Ollama, vLLM, or llama.cpp as the local student. Mix and match per task.</p>
625
625
</div>
626
+
<divclass="feature-card">
627
+
<spanclass="feature-icon">🔒</span>
628
+
<h3>PII Protection</h3>
629
+
<p>Hybrid multi-modal PII detection: fast regex, field heuristics, and optional NER model inference. Scrubs sensitive data before it reaches models or logs. Learns over time.</p>
630
+
</div>
631
+
<divclass="feature-card">
632
+
<spanclass="feature-icon">🧠</span>
633
+
<h3>NER Integration</h3>
634
+
<p>Optional transformer-based named entity recognition catches person names, addresses, and organizations that regex can't. Lazy-loaded — zero overhead when disabled.</p>
635
+
</div>
636
+
<divclass="feature-card">
637
+
<spanclass="feature-icon">📝</span>
638
+
<h3>Feedback Loop</h3>
639
+
<p>Human and AI feedback drives continuous improvement. False positive/negative reports adjust detection confidence. The system gets smarter with every correction.</p>
Copy file name to clipboardExpand all lines: pyproject.toml
+6-1Lines changed: 6 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
4
4
5
5
[project]
6
6
name = "apprentice-ai"
7
-
version = "0.2.0"
7
+
version = "0.3.0"
8
8
description = "Adaptive model distillation framework — progressively replace expensive API calls with fine-tuned local models through coaching, evaluation, and phased rollout"
0 commit comments