fix(tokenization): replace fragile substring matching with character offsets from annotations

## Summary

The `_find_privacy_mask_positions` method in `tokenization.py` uses `text.find(value)` to re-discover entity positions by substring matching. This is fragile and produces incorrect labels when PII values appear as substrings of other words or occur multiple times in the text. The original character offsets from Label Studio annotations are available but are discarded during preprocessing — they should be preserved and used instead.

## Problem

### Current implementation (`tokenization.py:22-50`)

```python
def _find_privacy_mask_positions(self, text, privacy_mask):
    privacy_mask_with_positions = []
    for item in privacy_mask:
        value = item["value"]
        label = item["label"]
        start = 0
        while True:
            pos = text.find(value, start)
            if pos == -1:
                break
            privacy_mask_with_positions.append({
                "value": value, "label": label,
                "start": pos, "end": pos + len(value),
            })
            start = pos + 1
    return sorted(privacy_mask_with_positions, key=lambda x: x["start"], reverse=True)
```

This has several failure modes:

**1. Partial/spurious matches**
- First name "Alex" matches inside "Alexandria" → the city gets partially labeled as FIRSTNAME
- Surname "Park" matches inside "Parking" → non-PII word gets labeled
- Street name "Main" matches inside "Maintain" → false annotation

**2. Duplicate value over-matching**
- The `while True` loop finds *all* occurrences of a value. If "John" appears twice in the text but only one instance is annotated PII, both get labeled — injecting noise into training data.

**3. Overlapping entity collisions**
- If a first name "Dan" and a username "DanTheMan" both exist in the text, `text.find("Dan")` matches inside the username, creating overlapping/conflicting annotations.

### Root cause: offsets are discarded

In `preprocessing.py`, the `convert_labelstudio_to_training_format` function extracts the original Label Studio character offsets but then throws them away:

```python
# Line 135-144: offsets ARE extracted
entities[entity_id] = {
    "text": value.get("text", ""),
    "label": labels[0] if labels else None,
    "start": value.get("start"),      # ← available
    "end": value.get("end"),          # ← available
}

# Line 183-188: but only value and label are kept
privacy_mask.append({
    "value": main_entity["text"],     # ← kept
    "label": main_entity["label"],    # ← kept
    # start and end are DISCARDED
})
```

This forces `_find_privacy_mask_positions` to re-discover positions via substring search, which is where all the bugs come from.

## Proposed fix

### 1. Preserve character offsets through the pipeline

In `preprocessing.py`, include `start` and `end` in the `privacy_mask` entries:

```python
privacy_mask.append({
    "value": main_entity["text"],
    "label": main_entity["label"],
    "start": main_entity["start"],
    "end": main_entity["end"],
})
```

### 2. Use offsets directly instead of substring search

In `tokenization.py`, replace `_find_privacy_mask_positions` to simply pass through the offsets when available, falling back to substring matching (with word-boundary awareness) for data sources that don't provide offsets:

```python
def _find_privacy_mask_positions(self, text, privacy_mask):
    privacy_mask_with_positions = []
    for item in privacy_mask:
        if "start" in item and "end" in item:
            # Use annotation offsets directly — no search needed
            privacy_mask_with_positions.append({
                "value": item["value"],
                "label": item["label"],
                "start": item["start"],
                "end": item["end"],
            })
        else:
            # Fallback: word-boundary-aware search for external datasets
            import re
            pattern = re.escape(item["value"])
            for match in re.finditer(rf'\b{pattern}\b', text):
                privacy_mask_with_positions.append({
                    "value": item["value"],
                    "label": item["label"],
                    "start": match.start(),
                    "end": match.end(),
                })

    return sorted(
        privacy_mask_with_positions, key=lambda x: x["start"], reverse=True
    )
```

### 3. Add a validation check

After building positions, verify that `text[start:end] == value` to catch any offset drift from text normalization:

```python
for entry in privacy_mask_with_positions:
    actual = text[entry["start"]:entry["end"]]
    if actual != entry["value"]:
        logger.warning(
            f"Offset mismatch: expected '{entry['value']}' "
            f"but found '{actual}' at [{entry['start']}:{entry['end']}]"
        )
```

## Impact

This is a silent data quality bug — the model trains on incorrectly labeled tokens without any warning. Fixing it improves training signal quality across the entire dataset, which compounds with every other improvement (class weights, better metrics, CRF, etc.).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tokenization): replace fragile substring matching with character offsets from annotations #264

Summary

Problem

Current implementation (`tokenization.py:22-50`)

Root cause: offsets are discarded

Proposed fix

1. Preserve character offsets through the pipeline

2. Use offsets directly instead of substring search

3. Add a validation check

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

fix(tokenization): replace fragile substring matching with character offsets from annotations #264

Description

Summary

Problem

Current implementation (tokenization.py:22-50)

Root cause: offsets are discarded

Proposed fix

1. Preserve character offsets through the pipeline

2. Use offsets directly instead of substring search

3. Add a validation check

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Current implementation (`tokenization.py:22-50`)