You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Replace the current token-level F1 evaluation with seqeval entity-level (span-level) F1, which is the standard metric for NER and gives an accurate picture of model quality.
Problem
The current compute_metrics in trainer.py uses sklearn's f1_score at the token level. This means each token is evaluated independently — if the model gets 3 out of 4 tokens in a phone number correct, it scores 75% on those tokens. But the phone number itself is completely wrong: a partially detected 555-987-6543 still leaks through the proxy.
A model that predicts B-PHONENUMBER I-PHONENUMBER I-PHONENUMBER O for a 4-token phone number scores 75% token F1 but 0% entity F1 — the entity was not correctly extracted
O tokens dominate most sequences (often 80%+), so a model that mostly predicts O can score high token F1 while missing most entities
Partial matches on multi-token entities (addresses, IBANs, full names) look good token-wise but are useless for PII redaction
For a privacy tool, entity-level F1 is the only metric that reflects real-world performance: either the entire PII span is correctly identified and redacted, or it isn't.
Proposed changes
Install seqeval
Add seqeval to the training dependencies (pip install seqeval).
Update compute_metrics in trainer.py
Replace the sklearn token-level evaluation with seqeval's span-level evaluation:
seqeval.metrics.classification_report provides precision, recall, and F1 broken down by entity type. This is critical for identifying which PII types the model struggles with:
Log both the overall F1 and the per-class report at the end of each evaluation epoch so regressions on specific entity types are visible during training.
Summary
Replace the current token-level F1 evaluation with
seqevalentity-level (span-level) F1, which is the standard metric for NER and gives an accurate picture of model quality.Problem
The current
compute_metricsintrainer.pyuses sklearn'sf1_scoreat the token level. This means each token is evaluated independently — if the model gets 3 out of 4 tokens in a phone number correct, it scores 75% on those tokens. But the phone number itself is completely wrong: a partially detected555-987-6543still leaks through the proxy.Token-level metrics systematically inflate performance:
B-PHONENUMBER I-PHONENUMBER I-PHONENUMBER Ofor a 4-token phone number scores 75% token F1 but 0% entity F1 — the entity was not correctly extractedOtokens dominate most sequences (often 80%+), so a model that mostly predictsOcan score high token F1 while missing most entitiesFor a privacy tool, entity-level F1 is the only metric that reflects real-world performance: either the entire PII span is correctly identified and redacted, or it isn't.
Proposed changes
Install seqeval
Add
seqevalto the training dependencies (pip install seqeval).Update
compute_metricsintrainer.pyReplace the sklearn token-level evaluation with seqeval's span-level evaluation:
What seqeval evaluates
seqeval counts an entity as correct only if:
PHONENUMBER)This means:
B-PHONENUMBER I-PHONENUMBER I-PHONENUMBERpredicted asB-PHONENUMBER I-PHONENUMBER O→ false negative (span boundary wrong)B-EMAILpredicted asB-URL→ false negative for EMAIL, false positive for URLB-FIRSTNAME I-FIRSTNAMEpredicted correctly → true positivePer-class reporting
seqeval.metrics.classification_reportprovides precision, recall, and F1 broken down by entity type. This is critical for identifying which PII types the model struggles with:Logging
Log both the overall F1 and the per-class report at the end of each evaluation epoch so regressions on specific entity types are visible during training.
Notes
IOB2format (which is what the codebase uses:B-prefix for begin,I-for inside,Ofor outside)mode="strict"flag ensures exact span matching with no partial credit