Skip to content

fix(tokenization): replace fragile substring matching with character offsets from annotations #264

@hanneshapke

Description

@hanneshapke

Summary

The _find_privacy_mask_positions method in tokenization.py uses text.find(value) to re-discover entity positions by substring matching. This is fragile and produces incorrect labels when PII values appear as substrings of other words or occur multiple times in the text. The original character offsets from Label Studio annotations are available but are discarded during preprocessing — they should be preserved and used instead.

Problem

Current implementation (tokenization.py:22-50)

def _find_privacy_mask_positions(self, text, privacy_mask):
    privacy_mask_with_positions = []
    for item in privacy_mask:
        value = item["value"]
        label = item["label"]
        start = 0
        while True:
            pos = text.find(value, start)
            if pos == -1:
                break
            privacy_mask_with_positions.append({
                "value": value, "label": label,
                "start": pos, "end": pos + len(value),
            })
            start = pos + 1
    return sorted(privacy_mask_with_positions, key=lambda x: x["start"], reverse=True)

This has several failure modes:

1. Partial/spurious matches

  • First name "Alex" matches inside "Alexandria" → the city gets partially labeled as FIRSTNAME
  • Surname "Park" matches inside "Parking" → non-PII word gets labeled
  • Street name "Main" matches inside "Maintain" → false annotation

2. Duplicate value over-matching

  • The while True loop finds all occurrences of a value. If "John" appears twice in the text but only one instance is annotated PII, both get labeled — injecting noise into training data.

3. Overlapping entity collisions

  • If a first name "Dan" and a username "DanTheMan" both exist in the text, text.find("Dan") matches inside the username, creating overlapping/conflicting annotations.

Root cause: offsets are discarded

In preprocessing.py, the convert_labelstudio_to_training_format function extracts the original Label Studio character offsets but then throws them away:

# Line 135-144: offsets ARE extracted
entities[entity_id] = {
    "text": value.get("text", ""),
    "label": labels[0] if labels else None,
    "start": value.get("start"),      # ← available
    "end": value.get("end"),          # ← available
}

# Line 183-188: but only value and label are kept
privacy_mask.append({
    "value": main_entity["text"],     # ← kept
    "label": main_entity["label"],    # ← kept
    # start and end are DISCARDED
})

This forces _find_privacy_mask_positions to re-discover positions via substring search, which is where all the bugs come from.

Proposed fix

1. Preserve character offsets through the pipeline

In preprocessing.py, include start and end in the privacy_mask entries:

privacy_mask.append({
    "value": main_entity["text"],
    "label": main_entity["label"],
    "start": main_entity["start"],
    "end": main_entity["end"],
})

2. Use offsets directly instead of substring search

In tokenization.py, replace _find_privacy_mask_positions to simply pass through the offsets when available, falling back to substring matching (with word-boundary awareness) for data sources that don't provide offsets:

def _find_privacy_mask_positions(self, text, privacy_mask):
    privacy_mask_with_positions = []
    for item in privacy_mask:
        if "start" in item and "end" in item:
            # Use annotation offsets directly — no search needed
            privacy_mask_with_positions.append({
                "value": item["value"],
                "label": item["label"],
                "start": item["start"],
                "end": item["end"],
            })
        else:
            # Fallback: word-boundary-aware search for external datasets
            import re
            pattern = re.escape(item["value"])
            for match in re.finditer(rf'\b{pattern}\b', text):
                privacy_mask_with_positions.append({
                    "value": item["value"],
                    "label": item["label"],
                    "start": match.start(),
                    "end": match.end(),
                })

    return sorted(
        privacy_mask_with_positions, key=lambda x: x["start"], reverse=True
    )

3. Add a validation check

After building positions, verify that text[start:end] == value to catch any offset drift from text normalization:

for entry in privacy_mask_with_positions:
    actual = text[entry["start"]:entry["end"]]
    if actual != entry["value"]:
        logger.warning(
            f"Offset mismatch: expected '{entry['value']}' "
            f"but found '{actual}' at [{entry['start']}:{entry['end']}]"
        )

Impact

This is a silent data quality bug — the model trains on incorrectly labeled tokens without any warning. Fixing it improves training signal quality across the entire dataset, which compounds with every other improvement (class weights, better metrics, CRF, etc.).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions