Skip to content

feat(eval): switch to seqeval for entity-level F1 metrics #261

@hanneshapke

Description

@hanneshapke

Summary

Replace the current token-level F1 evaluation with seqeval entity-level (span-level) F1, which is the standard metric for NER and gives an accurate picture of model quality.

Problem

The current compute_metrics in trainer.py uses sklearn's f1_score at the token level. This means each token is evaluated independently — if the model gets 3 out of 4 tokens in a phone number correct, it scores 75% on those tokens. But the phone number itself is completely wrong: a partially detected 555-987-6543 still leaks through the proxy.

Token-level metrics systematically inflate performance:

  • A model that predicts B-PHONENUMBER I-PHONENUMBER I-PHONENUMBER O for a 4-token phone number scores 75% token F1 but 0% entity F1 — the entity was not correctly extracted
  • O tokens dominate most sequences (often 80%+), so a model that mostly predicts O can score high token F1 while missing most entities
  • Partial matches on multi-token entities (addresses, IBANs, full names) look good token-wise but are useless for PII redaction

For a privacy tool, entity-level F1 is the only metric that reflects real-world performance: either the entire PII span is correctly identified and redacted, or it isn't.

Proposed changes

Install seqeval

Add seqeval to the training dependencies (pip install seqeval).

Update compute_metrics in trainer.py

Replace the sklearn token-level evaluation with seqeval's span-level evaluation:

from seqeval.metrics import classification_report, f1_score
from seqeval.scheme import IOB2

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    # predictions shape: (batch, seq_len, num_labels)
    predictions = np.argmax(predictions, axis=2)

    # Convert integer IDs back to BIO tag strings, skipping padding (-100)
    id_to_label = {v: k for k, v in label_mapping.items()}
    true_labels = []
    pred_labels = []

    for pred_seq, label_seq in zip(predictions, labels):
        true_seq = []
        pred_seq_tags = []
        for p, l in zip(pred_seq, label_seq):
            if l == -100:
                continue  # skip padding / special tokens
            true_seq.append(id_to_label[l])
            pred_seq_tags.append(id_to_label[p])
        true_labels.append(true_seq)
        pred_labels.append(pred_seq_tags)

    # Entity-level metrics
    return {
        "f1": f1_score(true_labels, pred_labels, mode="strict", scheme=IOB2),
        "report": classification_report(true_labels, pred_labels, scheme=IOB2),
    }

What seqeval evaluates

seqeval counts an entity as correct only if:

  1. The entity type matches (e.g., PHONENUMBER)
  2. The span boundaries match exactly (same start and end tokens)

This means:

  • B-PHONENUMBER I-PHONENUMBER I-PHONENUMBER predicted as B-PHONENUMBER I-PHONENUMBER Ofalse negative (span boundary wrong)
  • B-EMAIL predicted as B-URLfalse negative for EMAIL, false positive for URL
  • B-FIRSTNAME I-FIRSTNAME predicted correctly → true positive

Per-class reporting

seqeval.metrics.classification_report provides precision, recall, and F1 broken down by entity type. This is critical for identifying which PII types the model struggles with:

              precision    recall  f1-score   support

     ADDRESS       0.82      0.78      0.80       156
       EMAIL       0.97      0.98      0.97       203
   FIRSTNAME       0.91      0.93      0.92       412
        IBAN       0.85      0.79      0.82        47
 PHONENUMBER       0.88      0.84      0.86       189
         SSN       0.90      0.87      0.88        63
     SURNAME       0.89      0.91      0.90       398
         ...

   micro avg       0.90      0.88      0.89      2341
   macro avg       0.87      0.84      0.85      2341

Logging

Log both the overall F1 and the per-class report at the end of each evaluation epoch so regressions on specific entity types are visible during training.

Notes

  • If a CRF layer is added later (feat(model): add CRF layer on top of token classifier for valid BIO sequence decoding #256), seqeval becomes even more important — the CRF's value is in improving entity-level metrics specifically, and token-level F1 would understate its contribution
  • seqeval expects IOB2 format (which is what the codebase uses: B- prefix for begin, I- for inside, O for outside)
  • The mode="strict" flag ensures exact span matching with no partial credit

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions