Summary
Evaluate replacing distilbert-base-cased (66M params) with a stronger base model such as microsoft/deberta-v3-base or roberta-base to improve NER performance.
Problem
The config currently hardcodes distilbert-base-cased as the base encoder. While DistilBERT is fast and lightweight, it consistently underperforms on NER tasks compared to more capable encoders. With 24 PII entity types and a BIO labeling scheme producing 33+ classes, the model needs strong contextual representations to disambiguate fine-grained entity boundaries — especially for rare types like IBANs, security tokens, and IP addresses.
Candidate models
| Model |
Params |
NER strength |
Inference cost |
Notes |
distilbert-base-cased (current) |
66M |
Baseline |
Fastest |
Good for latency-constrained serving |
roberta-base |
125M |
+2–3 F1 typical |
~2× DistilBERT |
Strong general-purpose encoder |
microsoft/deberta-v3-base |
86M |
+3–5 F1 typical |
~1.5× DistilBERT |
Disentangled attention excels at span boundaries; best in class for NER at this size |
microsoft/deberta-v3-small |
44M |
+1–2 F1 typical |
~0.8× DistilBERT |
Smaller than DistilBERT but often matches or beats it |
DeBERTa-v3 is particularly well suited for NER because its disentangled attention mechanism separately models content and position, which helps the model reason about entity boundaries more precisely.
Proposed approach
- Benchmark first. Train the current pipeline with
deberta-v3-base and roberta-base alongside the DistilBERT baseline on the same data split. Compare entity-level F1 (per-class and macro) and inference latency.
- If latency is a hard constraint, consider
deberta-v3-small (44M params, smaller than DistilBERT) or a knowledge distillation approach: train on the stronger model, then distill back into DistilBERT using the larger model's soft labels.
- Update
config.py to make the model name a first-class config option (it already is via model_name) — no structural change needed. The AutoModel.from_pretrained usage in model.py means swapping the encoder is a config-only change for most models.
Integration considerations
- Tokenizer differences: RoBERTa and DeBERTa use different tokenizers (SentencePiece / BPE) than DistilBERT (WordPiece). The tokenization pipeline in
tokenization.py uses AutoTokenizer so this should work out of the box, but label alignment in _find_privacy_mask_positions should be verified.
- ONNX export: The quantization pipeline in
quantize.py exports to ONNX. DeBERTa-v3 has known ONNX export quirks (custom ops for disentangled attention). RoBERTa exports cleanly. This should be validated before committing to a model.
- Hidden size: Both RoBERTa-base and DeBERTa-v3-base use
hidden_size=768, same as DistilBERT, so the classification heads don't need changes.
Summary
Evaluate replacing
distilbert-base-cased(66M params) with a stronger base model such asmicrosoft/deberta-v3-baseorroberta-baseto improve NER performance.Problem
The config currently hardcodes
distilbert-base-casedas the base encoder. While DistilBERT is fast and lightweight, it consistently underperforms on NER tasks compared to more capable encoders. With 24 PII entity types and a BIO labeling scheme producing 33+ classes, the model needs strong contextual representations to disambiguate fine-grained entity boundaries — especially for rare types like IBANs, security tokens, and IP addresses.Candidate models
distilbert-base-cased(current)roberta-basemicrosoft/deberta-v3-basemicrosoft/deberta-v3-smallDeBERTa-v3 is particularly well suited for NER because its disentangled attention mechanism separately models content and position, which helps the model reason about entity boundaries more precisely.
Proposed approach
deberta-v3-baseandroberta-basealongside the DistilBERT baseline on the same data split. Compare entity-level F1 (per-class and macro) and inference latency.deberta-v3-small(44M params, smaller than DistilBERT) or a knowledge distillation approach: train on the stronger model, then distill back into DistilBERT using the larger model's soft labels.config.pyto make the model name a first-class config option (it already is viamodel_name) — no structural change needed. TheAutoModel.from_pretrainedusage inmodel.pymeans swapping the encoder is a config-only change for most models.Integration considerations
tokenization.pyusesAutoTokenizerso this should work out of the box, but label alignment in_find_privacy_mask_positionsshould be verified.quantize.pyexports to ONNX. DeBERTa-v3 has known ONNX export quirks (custom ops for disentangled attention). RoBERTa exports cleanly. This should be validated before committing to a model.hidden_size=768, same as DistilBERT, so the classification heads don't need changes.