Skip to content

feat(model): evaluate upgrading from DistilBERT to a stronger base encoder #258

@hanneshapke

Description

@hanneshapke

Summary

Evaluate replacing distilbert-base-cased (66M params) with a stronger base model such as microsoft/deberta-v3-base or roberta-base to improve NER performance.

Problem

The config currently hardcodes distilbert-base-cased as the base encoder. While DistilBERT is fast and lightweight, it consistently underperforms on NER tasks compared to more capable encoders. With 24 PII entity types and a BIO labeling scheme producing 33+ classes, the model needs strong contextual representations to disambiguate fine-grained entity boundaries — especially for rare types like IBANs, security tokens, and IP addresses.

Candidate models

Model Params NER strength Inference cost Notes
distilbert-base-cased (current) 66M Baseline Fastest Good for latency-constrained serving
roberta-base 125M +2–3 F1 typical ~2× DistilBERT Strong general-purpose encoder
microsoft/deberta-v3-base 86M +3–5 F1 typical ~1.5× DistilBERT Disentangled attention excels at span boundaries; best in class for NER at this size
microsoft/deberta-v3-small 44M +1–2 F1 typical ~0.8× DistilBERT Smaller than DistilBERT but often matches or beats it

DeBERTa-v3 is particularly well suited for NER because its disentangled attention mechanism separately models content and position, which helps the model reason about entity boundaries more precisely.

Proposed approach

  1. Benchmark first. Train the current pipeline with deberta-v3-base and roberta-base alongside the DistilBERT baseline on the same data split. Compare entity-level F1 (per-class and macro) and inference latency.
  2. If latency is a hard constraint, consider deberta-v3-small (44M params, smaller than DistilBERT) or a knowledge distillation approach: train on the stronger model, then distill back into DistilBERT using the larger model's soft labels.
  3. Update config.py to make the model name a first-class config option (it already is via model_name) — no structural change needed. The AutoModel.from_pretrained usage in model.py means swapping the encoder is a config-only change for most models.

Integration considerations

  • Tokenizer differences: RoBERTa and DeBERTa use different tokenizers (SentencePiece / BPE) than DistilBERT (WordPiece). The tokenization pipeline in tokenization.py uses AutoTokenizer so this should work out of the box, but label alignment in _find_privacy_mask_positions should be verified.
  • ONNX export: The quantization pipeline in quantize.py exports to ONNX. DeBERTa-v3 has known ONNX export quirks (custom ops for disentangled attention). RoBERTa exports cleanly. This should be validated before committing to a model.
  • Hidden size: Both RoBERTa-base and DeBERTa-v3-base use hidden_size=768, same as DistilBERT, so the classification heads don't need changes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions