feat(model): evaluate upgrading from DistilBERT to a stronger base encoder

## Summary

Evaluate replacing `distilbert-base-cased` (66M params) with a stronger base model such as `microsoft/deberta-v3-base` or `roberta-base` to improve NER performance.

## Problem

The config currently hardcodes `distilbert-base-cased` as the base encoder. While DistilBERT is fast and lightweight, it consistently underperforms on NER tasks compared to more capable encoders. With 24 PII entity types and a BIO labeling scheme producing 33+ classes, the model needs strong contextual representations to disambiguate fine-grained entity boundaries — especially for rare types like IBANs, security tokens, and IP addresses.

## Candidate models

| Model | Params | NER strength | Inference cost | Notes |
|---|---|---|---|---|
| `distilbert-base-cased` (current) | 66M | Baseline | Fastest | Good for latency-constrained serving |
| `roberta-base` | 125M | +2–3 F1 typical | ~2× DistilBERT | Strong general-purpose encoder |
| `microsoft/deberta-v3-base` | 86M | +3–5 F1 typical | ~1.5× DistilBERT | Disentangled attention excels at span boundaries; best in class for NER at this size |
| `microsoft/deberta-v3-small` | 44M | +1–2 F1 typical | ~0.8× DistilBERT | Smaller than DistilBERT but often matches or beats it |

DeBERTa-v3 is particularly well suited for NER because its disentangled attention mechanism separately models content and position, which helps the model reason about entity boundaries more precisely.

## Proposed approach

1. **Benchmark first.** Train the current pipeline with `deberta-v3-base` and `roberta-base` alongside the DistilBERT baseline on the same data split. Compare entity-level F1 (per-class and macro) and inference latency.
2. **If latency is a hard constraint**, consider `deberta-v3-small` (44M params, smaller than DistilBERT) or a knowledge distillation approach: train on the stronger model, then distill back into DistilBERT using the larger model's soft labels.
3. **Update `config.py`** to make the model name a first-class config option (it already is via `model_name`) — no structural change needed. The `AutoModel.from_pretrained` usage in `model.py` means swapping the encoder is a config-only change for most models.

## Integration considerations

- **Tokenizer differences**: RoBERTa and DeBERTa use different tokenizers (SentencePiece / BPE) than DistilBERT (WordPiece). The tokenization pipeline in `tokenization.py` uses `AutoTokenizer` so this should work out of the box, but label alignment in `_find_privacy_mask_positions` should be verified.
- **ONNX export**: The quantization pipeline in `quantize.py` exports to ONNX. DeBERTa-v3 has known ONNX export quirks (custom ops for disentangled attention). RoBERTa exports cleanly. This should be validated before committing to a model.
- **Hidden size**: Both RoBERTa-base and DeBERTa-v3-base use `hidden_size=768`, same as DistilBERT, so the classification heads don't need changes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(model): evaluate upgrading from DistilBERT to a stronger base encoder #258

Summary

Problem

Candidate models

Proposed approach

Integration considerations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Params	NER strength	Inference cost	Notes
`distilbert-base-cased` (current)	66M	Baseline	Fastest	Good for latency-constrained serving
`roberta-base`	125M	+2–3 F1 typical	~2× DistilBERT	Strong general-purpose encoder
`microsoft/deberta-v3-base`	86M	+3–5 F1 typical	~1.5× DistilBERT	Disentangled attention excels at span boundaries; best in class for NER at this size
`microsoft/deberta-v3-small`	44M	+1–2 F1 typical	~0.8× DistilBERT	Smaller than DistilBERT but often matches or beats it

feat(model): evaluate upgrading from DistilBERT to a stronger base encoder #258

Description

Summary

Problem

Candidate models

Proposed approach

Integration considerations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions