feat(eval): switch to seqeval for entity-level F1 metrics

## Summary

Replace the current token-level F1 evaluation with `seqeval` entity-level (span-level) F1, which is the standard metric for NER and gives an accurate picture of model quality.

## Problem

The current `compute_metrics` in `trainer.py` uses sklearn's `f1_score` at the **token level**. This means each token is evaluated independently — if the model gets 3 out of 4 tokens in a phone number correct, it scores 75% on those tokens. But the phone number itself is completely wrong: a partially detected `555-987-6543` still leaks through the proxy.

Token-level metrics systematically inflate performance:

- A model that predicts `B-PHONENUMBER I-PHONENUMBER I-PHONENUMBER O` for a 4-token phone number scores 75% token F1 but 0% entity F1 — the entity was not correctly extracted
- `O` tokens dominate most sequences (often 80%+), so a model that mostly predicts `O` can score high token F1 while missing most entities
- Partial matches on multi-token entities (addresses, IBANs, full names) look good token-wise but are useless for PII redaction

For a privacy tool, entity-level F1 is the only metric that reflects real-world performance: either the entire PII span is correctly identified and redacted, or it isn't.

## Proposed changes

### Install seqeval

Add `seqeval` to the training dependencies (`pip install seqeval`).

### Update `compute_metrics` in `trainer.py`

Replace the sklearn token-level evaluation with seqeval's span-level evaluation:

```python
from seqeval.metrics import classification_report, f1_score
from seqeval.scheme import IOB2

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    # predictions shape: (batch, seq_len, num_labels)
    predictions = np.argmax(predictions, axis=2)

    # Convert integer IDs back to BIO tag strings, skipping padding (-100)
    id_to_label = {v: k for k, v in label_mapping.items()}
    true_labels = []
    pred_labels = []

    for pred_seq, label_seq in zip(predictions, labels):
        true_seq = []
        pred_seq_tags = []
        for p, l in zip(pred_seq, label_seq):
            if l == -100:
                continue  # skip padding / special tokens
            true_seq.append(id_to_label[l])
            pred_seq_tags.append(id_to_label[p])
        true_labels.append(true_seq)
        pred_labels.append(pred_seq_tags)

    # Entity-level metrics
    return {
        "f1": f1_score(true_labels, pred_labels, mode="strict", scheme=IOB2),
        "report": classification_report(true_labels, pred_labels, scheme=IOB2),
    }
```

### What seqeval evaluates

seqeval counts an entity as correct **only if**:
1. The entity type matches (e.g., `PHONENUMBER`)
2. The span boundaries match exactly (same start and end tokens)

This means:
- `B-PHONENUMBER I-PHONENUMBER I-PHONENUMBER` predicted as `B-PHONENUMBER I-PHONENUMBER O` → **false negative** (span boundary wrong)
- `B-EMAIL` predicted as `B-URL` → **false negative** for EMAIL, **false positive** for URL
- `B-FIRSTNAME I-FIRSTNAME` predicted correctly → **true positive**

### Per-class reporting

`seqeval.metrics.classification_report` provides precision, recall, and F1 broken down by entity type. This is critical for identifying which PII types the model struggles with:

```
              precision    recall  f1-score   support

     ADDRESS       0.82      0.78      0.80       156
       EMAIL       0.97      0.98      0.97       203
   FIRSTNAME       0.91      0.93      0.92       412
        IBAN       0.85      0.79      0.82        47
 PHONENUMBER       0.88      0.84      0.86       189
         SSN       0.90      0.87      0.88        63
     SURNAME       0.89      0.91      0.90       398
         ...

   micro avg       0.90      0.88      0.89      2341
   macro avg       0.87      0.84      0.85      2341
```

### Logging

Log both the overall F1 and the per-class report at the end of each evaluation epoch so regressions on specific entity types are visible during training.

## Notes

- If a CRF layer is added later (#256), seqeval becomes even more important — the CRF's value is in improving entity-level metrics specifically, and token-level F1 would understate its contribution
- seqeval expects `IOB2` format (which is what the codebase uses: `B-` prefix for begin, `I-` for inside, `O` for outside)
- The `mode="strict"` flag ensures exact span matching with no partial credit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): switch to seqeval for entity-level F1 metrics #261

Summary

Problem

Proposed changes

Install seqeval

Update `compute_metrics` in `trainer.py`

What seqeval evaluates

Per-class reporting

Logging

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat(eval): switch to seqeval for entity-level F1 metrics #261

Description

Summary

Problem

Proposed changes

Install seqeval

Update compute_metrics in trainer.py

What seqeval evaluates

Per-class reporting

Logging

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Update `compute_metrics` in `trainer.py`