Summary
The _find_privacy_mask_positions method in tokenization.py uses text.find(value) to re-discover entity positions by substring matching. This is fragile and produces incorrect labels when PII values appear as substrings of other words or occur multiple times in the text. The original character offsets from Label Studio annotations are available but are discarded during preprocessing — they should be preserved and used instead.
Problem
Current implementation (tokenization.py:22-50)
def _find_privacy_mask_positions(self, text, privacy_mask):
privacy_mask_with_positions = []
for item in privacy_mask:
value = item["value"]
label = item["label"]
start = 0
while True:
pos = text.find(value, start)
if pos == -1:
break
privacy_mask_with_positions.append({
"value": value, "label": label,
"start": pos, "end": pos + len(value),
})
start = pos + 1
return sorted(privacy_mask_with_positions, key=lambda x: x["start"], reverse=True)
This has several failure modes:
1. Partial/spurious matches
- First name "Alex" matches inside "Alexandria" → the city gets partially labeled as FIRSTNAME
- Surname "Park" matches inside "Parking" → non-PII word gets labeled
- Street name "Main" matches inside "Maintain" → false annotation
2. Duplicate value over-matching
- The
while True loop finds all occurrences of a value. If "John" appears twice in the text but only one instance is annotated PII, both get labeled — injecting noise into training data.
3. Overlapping entity collisions
- If a first name "Dan" and a username "DanTheMan" both exist in the text,
text.find("Dan") matches inside the username, creating overlapping/conflicting annotations.
Root cause: offsets are discarded
In preprocessing.py, the convert_labelstudio_to_training_format function extracts the original Label Studio character offsets but then throws them away:
# Line 135-144: offsets ARE extracted
entities[entity_id] = {
"text": value.get("text", ""),
"label": labels[0] if labels else None,
"start": value.get("start"), # ← available
"end": value.get("end"), # ← available
}
# Line 183-188: but only value and label are kept
privacy_mask.append({
"value": main_entity["text"], # ← kept
"label": main_entity["label"], # ← kept
# start and end are DISCARDED
})
This forces _find_privacy_mask_positions to re-discover positions via substring search, which is where all the bugs come from.
Proposed fix
1. Preserve character offsets through the pipeline
In preprocessing.py, include start and end in the privacy_mask entries:
privacy_mask.append({
"value": main_entity["text"],
"label": main_entity["label"],
"start": main_entity["start"],
"end": main_entity["end"],
})
2. Use offsets directly instead of substring search
In tokenization.py, replace _find_privacy_mask_positions to simply pass through the offsets when available, falling back to substring matching (with word-boundary awareness) for data sources that don't provide offsets:
def _find_privacy_mask_positions(self, text, privacy_mask):
privacy_mask_with_positions = []
for item in privacy_mask:
if "start" in item and "end" in item:
# Use annotation offsets directly — no search needed
privacy_mask_with_positions.append({
"value": item["value"],
"label": item["label"],
"start": item["start"],
"end": item["end"],
})
else:
# Fallback: word-boundary-aware search for external datasets
import re
pattern = re.escape(item["value"])
for match in re.finditer(rf'\b{pattern}\b', text):
privacy_mask_with_positions.append({
"value": item["value"],
"label": item["label"],
"start": match.start(),
"end": match.end(),
})
return sorted(
privacy_mask_with_positions, key=lambda x: x["start"], reverse=True
)
3. Add a validation check
After building positions, verify that text[start:end] == value to catch any offset drift from text normalization:
for entry in privacy_mask_with_positions:
actual = text[entry["start"]:entry["end"]]
if actual != entry["value"]:
logger.warning(
f"Offset mismatch: expected '{entry['value']}' "
f"but found '{actual}' at [{entry['start']}:{entry['end']}]"
)
Impact
This is a silent data quality bug — the model trains on incorrectly labeled tokens without any warning. Fixing it improves training signal quality across the entire dataset, which compounds with every other improvement (class weights, better metrics, CRF, etc.).
Summary
The
_find_privacy_mask_positionsmethod intokenization.pyusestext.find(value)to re-discover entity positions by substring matching. This is fragile and produces incorrect labels when PII values appear as substrings of other words or occur multiple times in the text. The original character offsets from Label Studio annotations are available but are discarded during preprocessing — they should be preserved and used instead.Problem
Current implementation (
tokenization.py:22-50)This has several failure modes:
1. Partial/spurious matches
2. Duplicate value over-matching
while Trueloop finds all occurrences of a value. If "John" appears twice in the text but only one instance is annotated PII, both get labeled — injecting noise into training data.3. Overlapping entity collisions
text.find("Dan")matches inside the username, creating overlapping/conflicting annotations.Root cause: offsets are discarded
In
preprocessing.py, theconvert_labelstudio_to_training_formatfunction extracts the original Label Studio character offsets but then throws them away:This forces
_find_privacy_mask_positionsto re-discover positions via substring search, which is where all the bugs come from.Proposed fix
1. Preserve character offsets through the pipeline
In
preprocessing.py, includestartandendin theprivacy_maskentries:2. Use offsets directly instead of substring search
In
tokenization.py, replace_find_privacy_mask_positionsto simply pass through the offsets when available, falling back to substring matching (with word-boundary awareness) for data sources that don't provide offsets:3. Add a validation check
After building positions, verify that
text[start:end] == valueto catch any offset drift from text normalization:Impact
This is a silent data quality bug — the model trains on incorrectly labeled tokens without any warning. Fixing it improves training signal quality across the entire dataset, which compounds with every other improvement (class weights, better metrics, CRF, etc.).