feat(anonymize): support rule-based non-Western name detection across target locales#180
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a non-Western name detector (locale-name-detector.ts) to identify names, honorifics, and relational connectors across various Asian and Middle Eastern locales, integrating it into the main anonymization pipeline alongside a comprehensive test suite. The review feedback identifies a correctness bug where relational connectors like s/o can falsely match substrings of longer words (e.g., s/office), an deduplicateSpans that can be optimized to REGEX detection source instead of NAME_CORPUS.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| for (let idx = 0; idx < words.length; idx++) { | ||
| const word = words[idx]!; | ||
| const lower = word.text.toLowerCase(); | ||
| if ( | ||
| (lower === "s" || lower === "d" || lower === "w" || lower === "r") && | ||
| fullText[word.end] === "/" && | ||
| fullText[word.end + 1]?.toLowerCase() === "o" | ||
| ) { | ||
| tokens.push({ | ||
| text: fullText.slice(word.start, word.end + 2), | ||
| type: TOKEN_TYPE.ARABIC_CONNECTOR, // reuse connector behavior so they chain seamlessly | ||
| start: word.start, | ||
| end: word.end + 2, | ||
| }); | ||
| const nextWord = words[idx + 1]; | ||
| if ( | ||
| nextWord && | ||
| nextWord.start === word.end + 1 && | ||
| nextWord.text.toLowerCase() === "o" | ||
| ) { | ||
| idx++; | ||
| } | ||
| } else { | ||
| tokens.push(classifyToken(word)); | ||
| } | ||
| } |
There was a problem hiding this comment.
There is a correctness bug when parsing relational connectors like s/o, d/o, w/o, or r/o. If the character following the / is 'o' but is part of a longer word (e.g., "s/office" or "s/out"), the current implementation will still match "s/o" and push it as an ARABIC_CONNECTOR token. This results in overlapping tokens and incorrect anonymization (e.g., "Rahul s/o" is anonymized, leaving "ffice" or "ut" unredacted).
To fix this, verify that the next word is exactly "o" (case-insensitive) and starts exactly at word.end + 1 before pushing the connector token.
for (let idx = 0; idx < words.length; idx++) {
const word = words[idx]!;
const lower = word.text.toLowerCase();
const nextWord = words[idx + 1];
if (
(lower === "s" || lower === "d" || lower === "w" || lower === "r") &&
fullText[word.end] === "/" &&
nextWord &&
nextWord.start === word.end + 1 &&
nextWord.text.toLowerCase() === "o"
) {
tokens.push({
text: fullText.slice(word.start, word.end + 2),
type: TOKEN_TYPE.ARABIC_CONNECTOR,
start: word.start,
end: word.end + 2,
});
idx++;
} else {
tokens.push(classifyToken(word));
}
}| const deduplicateSpans = (entities: Entity[]): Entity[] => { | ||
| const sorted = [...entities].sort( | ||
| (a, b) => a.start - b.start || b.end - a.end, | ||
| ); | ||
| const result: Entity[] = []; | ||
|
|
||
| for (const entity of sorted) { | ||
| const overlaps = result.some( | ||
| (accepted) => entity.start < accepted.end && entity.end > accepted.start, | ||
| ); | ||
| if (!overlaps) { | ||
| result.push(entity); | ||
| } | ||
| } | ||
|
|
||
| return result; | ||
| }; |
There was a problem hiding this comment.
The deduplicateSpans function currently has result.some(...) call inside the loop. Since the input is already sorted by start ascending and end descending, any new entity can only overlap with the most recently accepted entity. We can optimize this to result array.
const deduplicateSpans = (entities: Entity[]): Entity[] => {
const sorted = [...entities].sort(
(a, b) => a.start - b.start || b.end - a.end,
);
const result: Entity[] = [];
for (const entity of sorted) {
const last = result.at(-1);
if (!last || entity.start >= last.end) {
result.push(entity);
}
}
return result;
};| }); | ||
| }; | ||
|
|
||
| const hanCount = (fullText.match(/\p{Script=Han}/gu) || []).length; |
There was a problem hiding this comment.
Using fullText.match(/\p{Script=Han}/gu) allocates a new array containing all matched Han characters. For very large documents, this can cause significant memory overhead and garbage collection pressure. Using matchAll avoids allocating the intermediate array of matches.
| const hanCount = (fullText.match(/\p{Script=Han}/gu) || []).length; | |
| let hanCount = 0; | |
| for (const _ of fullText.matchAll(/\p{Script=Han}/gu)) { | |
| hanCount++; | |
| } |
| entities.push({ | ||
| start, | ||
| end, | ||
| label: "person", | ||
| text, | ||
| score, | ||
| source: DETECTION_SOURCES.REGEX, |
There was a problem hiding this comment.
The detectNonWesternNames detector is a rule-based/dictionary-based name detector, but it currently uses DETECTION_SOURCES.REGEX as its source. This causes priority inversion in the pipeline because regex-sourced entities have a higher priority than dictionary-sourced entities, which can prevent proper deduplication or filtering. It should use the appropriate name corpus or dictionary source enum value (e.g., DETECTION_SOURCES.NAME_CORPUS or DETECTION_SOURCES.DICTIONARY).
| entities.push({ | |
| start, | |
| end, | |
| label: "person", | |
| text, | |
| score, | |
| source: DETECTION_SOURCES.REGEX, | |
| entities.push({ | |
| start, | |
| end, | |
| label: "person", | |
| text, | |
| score, | |
| source: DETECTION_SOURCES.NAME_CORPUS, | |
| }); |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a new “locale/non-Western” person-name detector and integrates it into the anonymize pipeline so the name-corpus step also finds common non-Western name patterns (incl. CJK and various Latin transliterations).
Changes:
- Integrates
detectNonWesternNamesintorunPipeline’s name-corpus phase. - Adds new detector implementation and exports it from the shared index.
- Introduces non-Western honorific configuration and comprehensive detector tests.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/anonymize/src/pipeline.ts | Merges Western name-corpus results with the new non-Western detector results. |
| packages/anonymize/src/index-shared.ts | Exports detectNonWesternNames for external consumers. |
| packages/anonymize/src/detectors/locale-name-detector.ts | Implements non-Western name detection via tokenization + heuristics + large token sets. |
| packages/anonymize/src/config/titles.ts | Adds NONWESTERN_HONORIFICS and tweaks an honorific comment. |
| packages/anonymize/src/test/nonWesternNames.test.ts | Adds pipeline + unit coverage for many locales and boundary/FP behaviors. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const hanCount = (fullText.match(/\p{Script=Han}/gu) || []).length; | ||
| const isLatinMajority = | ||
| fullText.length < 100 || hanCount / fullText.length < 0.15; |
| const deduplicateSpans = (entities: Entity[]): Entity[] => { | ||
| const sorted = [...entities].sort( | ||
| (a, b) => a.start - b.start || b.end - a.end, | ||
| ); | ||
| const result: Entity[] = []; | ||
|
|
||
| for (const entity of sorted) { | ||
| const overlaps = result.some( | ||
| (accepted) => entity.start < accepted.end && entity.end > accepted.start, | ||
| ); | ||
| if (!overlaps) { | ||
| result.push(entity); | ||
| } | ||
| } | ||
|
|
||
| return result; | ||
| }; |
| const segmenter = new Intl.Segmenter(undefined, { | ||
| granularity: "word", | ||
| }); |
| "Majumdar", | ||
| "Ray", | ||
| "Poddar", | ||
| "Ray", | ||
| "Halder", |
| // Only match name tokens when the word is capitalized in the source text. | ||
| // This prevents lowercase English words like "to", "so", "an" from being | ||
| // title-cased and falsely matched as Korean/Thai/Vietnamese name tokens. | ||
| if (UPPER_START_RE.test(text) && NONWESTERN_NAME_TOKENS.has(titleCased)) { | ||
| return { text, type: TOKEN_TYPE.NONWESTERN_NAME, start, end }; | ||
| } |
| const westernNames = detectNameCorpus(fullText, ctx); | ||
| const nonWesternNames = detectNonWesternNames(fullText, ctx); | ||
| rawNameCorpusEntities = [...westernNames, ...nonWesternNames]; |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d899f37e07
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ).length; | ||
|
|
||
| let score: number; | ||
| if (hasTitle && (hasNonWesternToken || capitalizedCount > 0)) score = 0.95; |
There was a problem hiding this comment.
Require name evidence before title chains
In legal text that uses court honorifics, this condition redacts non-person phrases as people: The Hon'ble Court issued an order. is tokenized as capitalized + title + capitalized and emits The Hon'ble Court with score 0.95 even though no non-Western name token is present. This is common in Indian legal documents and will corrupt output whenever enableNameCorpus is on, so title chains should require a name token or exclude court/institution terms.
Useful? React with 👍 / 👎.
| CJK_NAME_RE.lastIndex = 0; | ||
| let match: RegExpExecArray | null; | ||
| while ((match = CJK_NAME_RE.exec(fullText)) !== null) { | ||
| addCandidate(match.index, match.index + match[0].length, 0.95); |
There was a problem hiding this comment.
Do not label every short Han phrase as a person
For Latin-majority documents, every isolated 2-4 Han-character sequence is emitted as a high-confidence person without any context or stoplist, so ordinary jurisdictions, cities, and language labels such as 香港, 北京, 上海, or 中文 are redacted as people. Mixed English/Chinese legal documents often contain these terms, so this needs additional name evidence or exclusions before creating a person entity.
Useful? React with 👍 / 👎.
| else if ( | ||
| hasNonWesternToken && | ||
| (capitalizedCount > 0 || abbreviationCount > 0) | ||
| ) | ||
| score = 0.9; |
There was a problem hiding this comment.
Avoid absorbing capitalized document nouns into names
When a known non-Western token appears next to capitalized legal/document words, this branch scores the whole chain as a person, so text like Please review the Sato Agreement. emits Sato Agreement and The Nguyen Contract was signed. emits The Nguyen Contract. That over-redacts non-person words and changes document meaning whenever the detector sees a name-like token before a capitalized noun.
Useful? React with 👍 / 👎.
| else if ( | ||
| hasNonWesternToken && | ||
| chain.length === 1 && | ||
| !isSentenceStart(fullText, token.start) | ||
| ) | ||
| score = 0.5; |
There was a problem hiding this comment.
Suppress standalone ambiguous locale tokens
This emits any single non-Western token that is not at sentence start with score 0.5, which is above the default test threshold of 0.3; because the token list includes common jurisdiction/adjective words, This agreement is governed by Thai law. redacts Thai as a person. Standalone tokens need stronger context or a stoplist for locale/legal terms to avoid corrupting ordinary legal clauses.
Useful? React with 👍 / 👎.
| else if (hasJaSuffix && (capitalizedCount > 0 || hasNonWesternToken)) | ||
| score = 0.9; |
There was a problem hiding this comment.
Require a hyphen before Japanese honorific suffixes
Because san is classified as a Japanese suffix even when it is a standalone word, any preceding/following capitalized tokens can become a person span; for example, The San Diego office. emits The San Diego with score 0.9. This will over-redact common place names in mixed legal/business documents, so the -san case should require the hyphen or otherwise avoid treating standalone San as suffix evidence.
Useful? React with 👍 / 👎.
|
Hey @jan-kubica, just pushed the fix for #179 (non-Western name detection). Would appreciate a review when you get a chance. |
| const NONWESTERN_TITLES = new Set([ | ||
| ...Object.values(NONWESTERN_HONORIFICS) | ||
| .flat() | ||
| .map((t) => t.replace(/[.-]+$/u, "")), // strip trailing dots/dashes |
| const strippedTitle = text.endsWith(".") ? text.slice(0, -1) : text; | ||
| if (NONWESTERN_TITLES.has(strippedTitle)) { | ||
| return { text, type: TOKEN_TYPE.TITLE, start, end }; | ||
| } |
| const NONWESTERN_NAME_TOKENS = new Set([ | ||
| "Aah", | ||
| "Aarti", | ||
| "Abdel", | ||
| "Abdul", | ||
| "Abdullah", |
| const westernNames = detectNameCorpus(fullText, ctx); | ||
| const nonWesternNames = detectNonWesternNames(fullText, ctx); | ||
| rawNameCorpusEntities = [...westernNames, ...nonWesternNames]; |
| entities.push({ | ||
| start, | ||
| end, | ||
| label: "person", | ||
| text, | ||
| score, | ||
| source: DETECTION_SOURCES.REGEX, | ||
| }); |
|
|
||
| // ── Tests ────────────────────────────────────────────────────────── | ||
|
|
||
| describe("Non-Western Name Detection", () => { |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d39a62704f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if ( | ||
| gap.includes("\n") || | ||
| PERSON_CHAIN_BREAK_RE.test(gap) || | ||
| breaksOnPeriod | ||
| ) |
There was a problem hiding this comment.
Break person chains at appositive commas
When a valid name is followed by a comma and an organization/location descriptor, this loop keeps extending through the comma into the next capitalized token. Because addCandidate later drops spans containing organization words, Rahul Sharma, University of Delhi, testified. produces no Rahul Sharma entity at all. Commas that introduce appositives should stop the chain, or the detector needs to keep the accepted prefix before rejecting the longer organization-tainted span.
Useful? React with 👍 / 👎.
| const MAX_CHAIN = 5; | ||
| const chain: ClassifiedToken[] = [token]; | ||
| let j = i + 1; | ||
|
|
||
| while (j < tokens.length && chain.length < MAX_CHAIN) { |
There was a problem hiding this comment.
Avoid truncating Arabic patronymic chains
The five-token cap truncates common Arabic/Gulf names that include both a title and connectors; for example Sheikh Mohammed bin Rashid Al Maktoum is emitted as Sheikh Mohammed bin Rashid Al, leaving Maktoum unredacted and ending the entity on a dangling connector. This affects the target locales added by this detector, so the chain limit should allow connector-heavy names or avoid finalizing spans that end with a connector.
Useful? React with 👍 / 👎.
| if (config.enableNameCorpus && !config.enableDenyList) { | ||
| await initNameCorpus(ctx, config.dictionaries, config.nameCorpusLanguages); | ||
| checkAbort(signal); | ||
| rawNameCorpusEntities = detectNameCorpus(fullText, ctx); | ||
| const westernNames = detectNameCorpus(fullText, ctx); | ||
| const nonWesternNames = detectNonWesternNames(fullText, ctx); |
There was a problem hiding this comment.
Run non-Western names in deny-list mode
This guard means the new rule-based detector never runs when enableDenyList is also true, even though enableNameCorpus is documented and tested as keeping name matches available in deny-list mode. In the common configuration with both enabled, names that only this detector knows about, such as Rahul Sharma without a matching injected dictionary entry, are skipped entirely.
Useful? React with 👍 / 👎.
| const CJK_NAME_RE = | ||
| /(?<!\p{Script=Han})\p{Script=Han}{2,4}(?!\p{Script=Han})/gu; |
There was a problem hiding this comment.
Detect CJK names embedded in Han text
The CJK regex only matches a 2-4 Han-character sequence when it is not adjacent to any other Han character, so names written naturally inside Chinese/Japanese prose are missed; for example 由張小明簽署 produces no 張小明 entity because the name touches surrounding Han characters. This makes the CJK branch work only for space-delimited mixed-language snippets, not ordinary CJK legal text.
Useful? React with 👍 / 👎.
| const breaksOnPeriod = | ||
| gap.includes(".") && | ||
| !isInitialContinuationGap(prev.text, gap) && | ||
| prev.type !== TOKEN_TYPE.TITLE; |
There was a problem hiding this comment.
Stop full-word titles at sentence periods
Because every title token bypasses the period break, full-word honorifics are chained across sentence boundaries; The presiding officer was Justice. Kumar filed the appeal. emits Justice. Kumar as a person even though the period ended the title sentence. Only abbreviation-style titles such as Dr. or Mr. should suppress breaksOnPeriod; full-word non-Western titles need to stop at . to avoid redacting text from the next sentence.
Useful? React with 👍 / 👎.
jan-kubica
left a comment
There was a problem hiding this comment.
This PR identified and addresses real and important gap. Thank you for it. I would like to get the following changes in the PR in other not to duplicate existing approaches. LMK if I am missing something please
- Extend
names.tsinstead of forking it. You have the same skeleton+some novel ideas (suffixes, arabic connectors) but in a parallel implementation. This introduces maintenance burden, and extra compute in hot path we can avoid if you land this as chain rules in the existing names.ts - move name tokens to per-locale dictionary data, loaded via
initNameCorpus. the convention is to split the logic vs the data
583bb5c to
7f459d7
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 25dc116194
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (dictionaries?.nonWesternNames) { | ||
| for (const [, names] of Object.entries(dictionaries.nonWesternNames)) { |
There was a problem hiding this comment.
Honor scoped languages for injected non-Western names
When callers pass nameCorpusLanguages to restrict which injected name dictionaries are active, this branch still merges every dictionaries.nonWesternNames locale, unlike the first-name and surname branches above. In a scoped run such as initNameCorpus(ctx, { nonWesternNames: { ja: ["Qwertaro"] } }, ["en"]), detectNameCorpus("Please contact Qwertaro Smith.") still emits a person match, so integrations that scope names to the document language can get unexpected cross-locale redactions.
Useful? React with 👍 / 👎.
| if (next.type !== TOKEN_TYPE.OTHER) { | ||
| chain.push(next); | ||
| j++; | ||
| } else { |
There was a problem hiding this comment.
Preserve hyphenated Korean given names
For hyphenated Korean romanizations, Intl.Segmenter splits the given name around the hyphen and the lowercase continuation is classified as OTHER, so this loop finalizes only the prefix. For example, Kim Min-jun is emitted as Kim Min, leaving jun unredacted; the chain needs to absorb or normalize hyphenated non-Western name continuations before stopping on OTHER.
Useful? React with 👍 / 👎.
| if (corpus.nonWesternNames.has(token)) return true; | ||
| return corpus.nonWesternNames.has(titleCaseWithApostrophe(token)); |
There was a problem hiding this comment.
Match Vietnamese names with their diacritics
The bundled Vietnamese corpus is unaccented, but this lookup only checks the raw and title-cased token, so standard Vietnamese spellings with tone marks are missed or truncated. For example, Trần Thị Mai signed only emits Mai, leaving the surname and middle name visible; the lookup needs a diacritic-folded form, or accented variants in the corpus, for these locale names.
Useful? React with 👍 / 👎.
| // 1. Title tokens (Western + non-Western honorifics) | ||
| if (corpus.titleTokens.has(stripped)) { | ||
| return { text, type: TOKEN_TYPE.TITLE, start, end }; | ||
| // Mark abbreviation-style titles so the chain | ||
| // assembler knows their trailing dot is not a | ||
| // sentence boundary (Dr., Mr., Smt. etc.). | ||
| return { | ||
| text, | ||
| type: TOKEN_TYPE.TITLE, | ||
| start, | ||
| end, | ||
| ...(corpus.titleAbbreviations.has(stripped) | ||
| ? { titleAbbreviation: true } | ||
| : {}), | ||
| }; | ||
| } | ||
|
|
||
| // 2. Japanese honorific suffixes (san, sama, sensei) | ||
| if (JA_SUFFIXES.has(lower)) { | ||
| return { text, type: TOKEN_TYPE.JA_SUFFIX, start, end }; | ||
| } | ||
|
|
||
| // 3. Arabic patronymic connectors (bin, bint, ibn, al, el) | ||
| if (ARABIC_CONNECTORS.has(lower)) { | ||
| return { text, type: TOKEN_TYPE.ARABIC_CONNECTOR, start, end }; | ||
| } |
| /** Organization keyword filter. */ | ||
| const ORG_WORDS = | ||
| /\b(?:Group|Company|LLC|LLP|LP|Inc|Ltd|Corp|Corporation|Holdings|Partners|Association|University|Bank|Fund|Trust|Agency|Government|Ministry|Office|Department|Council|Board|Committee|Commission|Services|Solutions|Technologies|Systems|Analytics|Software)\b/i; | ||
|
|
||
| const isOrganization = (text: string): boolean => ORG_WORDS.test(text); |
| /** Deduplicate overlapping entity spans (first wins). */ | ||
| const deduplicateSpans = (entities: Entity[]): Entity[] => { | ||
| const sorted = [...entities].sort( | ||
| (a, b) => a.start - b.start || b.end - a.end, | ||
| ); | ||
| const result: Entity[] = []; | ||
| for (const entity of sorted) { | ||
| const last = result.at(-1); | ||
| if (!last || entity.start >= last.end) { | ||
| result.push(entity); | ||
| } | ||
| } | ||
| return result; | ||
| }; |
| if (dictionaries?.nonWesternNames) { | ||
| for (const [, names] of Object.entries(dictionaries.nonWesternNames)) { | ||
| for (const name of names) { | ||
| nonWesternNames.push(name); | ||
| } | ||
| } | ||
| } |
| const nwLocaleKeys = [ | ||
| "in", | ||
| "ar", | ||
| "ja-latn", | ||
| "ko", | ||
| "zh-latn", | ||
| "th", | ||
| "vi", | ||
| "fil", | ||
| "id", | ||
| ] as const; | ||
| const [nwNameMods, nwExcludedMod] = await Promise.all([ | ||
| Promise.all( | ||
| nwLocaleKeys.map( | ||
| (locale) => | ||
| import(`../data/names-nw-${locale}.json`) as Promise<{ | ||
| default: { names: string[] }; | ||
| }>, | ||
| ), | ||
| ), | ||
| import("../data/names-nw-excluded-allcaps.json") as Promise<{ | ||
| default: { words: string[] }; | ||
| }>, | ||
| ]); |
| export const NONWESTERN_HONORIFICS = { | ||
| in: [ | ||
| "Sri", | ||
| "Shri", | ||
| "Smt", | ||
| "Smt.", | ||
| "Kumari", | ||
| "Pandit", | ||
| "Pt.", | ||
| "Adv.", | ||
| "Adv", | ||
| "Justice", | ||
| "Hon'ble", | ||
| ], | ||
| ar: [ | ||
| "Al-", | ||
| "El-", | ||
| "Sheikh", | ||
| "Shaikh", | ||
| "Sheikha", | ||
| "Ustaz", | ||
| "Ustaza", | ||
| "Abu", | ||
| "Umm", | ||
| ], | ||
| "zh-Latn": [ | ||
| "Encik", | ||
| "Puan", | ||
| "Datuk", | ||
| "Dato'", | ||
| "Dato", | ||
| "Tan Sri", | ||
| "Tun", | ||
| ] as const, | ||
| "ja-Latn": ["-san", "-sama", "-sensei", "Mr.", "Ms."] as const, |
| export const getNameCorpusNonWesternNames = ( | ||
| ctx: PipelineContext = defaultContext, | ||
| ): readonly string[] => ctx.nameCorpus?.nonWesternNamesList ?? []; |
|
Hey @jan-kubica, thanks for the review. Both points addressed:
Also fixed a few things the AI review tools flagged as real issues: full-word titles like "Justice" no longer chain across sentence boundaries (only abbreviation titles like "Dr." suppress period breaks), and bare "HE"/"HH" no longer false-positive as Gulf honorific titles. |
jan-kubica
left a comment
There was a problem hiding this comment.
Thank you for the prompt changes. One more thing, I have noticed you are mocking text-search for every test run, presumably because it did not natively work on Windows. I have now addressed the underlying gap and anonymize 1.4.9 (and text search 1.0.6) now ship win32-x64-msvc native bindings for all three engines.
As such, bun install && bun test should now work natively on Windows.
After rebasing and refreshing the lockfile, the mock should be unnecessary >> could you please rebase, reinstall and remove the mock?. Alternatively, if I am missing anything and thre is a reason for the mock, please lmk
25dc116 to
fee0b0c
Compare
fee0b0c to
6b57e6b
Compare
|
Hey @jan-kubica Thanks for addressing the Windows native support in I have completed the requested changes:
All 750 tests are now running and passing natively on Windows with the mock removed. I have force-pushed the squashed commit to update the PR. Let me know if you need anything else! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6b57e6b037
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for (const form of Object.values(NONWESTERN_HONORIFICS).flat()) { | ||
| titles.push(form.replace(/[.-]+$/u, "").toLowerCase()); |
There was a problem hiding this comment.
Do not inject stripped honorific fragments as titles
When enableDenyList and enableNameCorpus are both on, appendNameCorpusEntries adds every title token as a standalone person pattern; this loop now turns non-Western forms like Al-, El-, H., Ba, Bu, and Co into bare tokens (al, el, h, ba, bu, co). In Spanish/Arabic/English legal text this will redact common words or abbreviations such as al and Co as people, so prefix/post-nominal fragments should not be merged into the global title corpus as standalone patterns.
Useful? React with 👍 / 👎.
| const lower = word.text.toLowerCase(); | ||
| const nextWord = words[idx + 1]; | ||
| if ( | ||
| (lower === "s" || lower === "d" || lower === "w" || lower === "r") && |
There was a problem hiding this comment.
Do not merge r/o residence markers into names
Including r/o in the same connector path as s/o, d/o, and w/o makes Indian party descriptions like Rahul Kumar r/o Delhi become one person span (Rahul Kumar r/o Delhi). In this context r/o means resident of and introduces an address/location, not a patronymic or spouse connector, so the name chain should break before it rather than consume the residence field.
Useful? React with 👍 / 👎.
|
@1337Xcode thank you for your prompt replies and this PR. Appreciate it. I am adding final minor adjustments (e.g., why ship field that does not currently exist and would already land as deprecated...?) directly to this branch and merging in a moment. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e33f5cecdb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (isLatinMajority) { | ||
| CJK_NAME_RE.lastIndex = 0; | ||
| let match: RegExpExecArray | null; | ||
| while ((match = CJK_NAME_RE.exec(fullText)) !== null) { |
There was a problem hiding this comment.
Honor language scoping before CJK pre-pass
When callers scope nameCorpusLanguages to non-CJK locales, this pre-pass still runs because it does not consult the scoped languages at all. For example, after initNameCorpus(ctx, undefined, ["cs"]), detectNameCorpus("Signed by 張小明.", ctx) still emits 張小明 as a person even though every non-Western locale corpus was scoped out; this can cause unexpected cross-locale redactions in integrations that use language scoping. Fresh evidence beyond the earlier scoped-dictionary issue: the hard-coded CJK branch bypasses the corpus/language scope entirely.
Useful? React with 👍 / 👎.
| const CJK_NAME_RE = | ||
| /(?<!\p{Script=Han})\p{Script=Han}{2,4}(?!\p{Script=Han})/gu; |
There was a problem hiding this comment.
Include Hangul in the native-name regex
The native-name pre-pass can never match Korean names written in Hangul because this regex only accepts Script=Han, even though the surname table below includes Hangul surnames such as 김 and 박. In practice detectNameCorpus("Signed by 김민준.") and detectNameCorpus("Signed by 박지훈.") return no person entity, so Korean text in the target locale is left unredacted unless it is romanized.
Useful? React with 👍 / 👎.
| if (corpus.nonWesternNames.has(token)) return true; | ||
| return corpus.nonWesternNames.has(titleCaseWithApostrophe(token)); |
There was a problem hiding this comment.
Match multi-word locale corpus entries
Some bundled non-Western entries are phrases, for example Thai surnames like Na Nakhon, but the detector later looks up one Intl.Segmenter word at a time. Those phrase entries can never match, so The witness Somchai Na Nakhon signed. emits only Somchai at low confidence and leaves the family name visible; phrase entries need a phrase pass or token-level decomposition.
Useful? React with 👍 / 👎.
| !isNonWesternNameToken(text, corpus) && | ||
| !JA_SUFFIXES.has(lower) && | ||
| !ARABIC_CONNECTORS.has(lower) && | ||
| !(ALL_UPPER_RE.test(text) && !corpus.excludedAllCaps.has(text)) |
There was a problem hiding this comment.
Whitelist short all-caps post-nominals
This exception treats any two-letter all-caps token not in the exclusion list as name-like, not just known post-nominals such as JP, KC, or QC. Common legal/business acronyms that are absent from the exclusion JSON then get absorbed into person spans, e.g. Sato IP rights, Kim IT policy, or Wong HR reviewed it all become high-confidence person entities; this should be limited to an explicit post-nominal whitelist.
Useful? React with 👍 / 👎.
| if (!chain.some(hasPersonNameSource)) { | ||
| continue; |
There was a problem hiding this comment.
Preserve curated Names deny-list matches
This guard drops any person chain that does not include the synthetic first-name or surname source, but curated dictionaries in the Names category are added with the normal deny-list source. With enableDenyList and enableNameCorpus both on, a bundled or injected Names entry that is not also in the first/surname corpus, for example a full-name entry from names/global, reaches nameHits and is then skipped here, so valid deny-list person entries silently stop redacting.
Useful? React with 👍 / 👎.
Proposed change
Adds a rule-based non-Western name detector (
detectNonWesternNames) for person name anonymization across 10 locales: India (in), UAE/Saudi Arabia (ar), Singapore/Malaysia (zh-Latn), Hong Kong (zh-Hant), Japan (ja-Latn), South Korea (ko), Thailand (th), Vietnam (vi), Philippines (fil), and Indonesia (id).Resolves #179.
NONWESTERN_HONORIFICSintitles.ts.detectNonWesternNamesinlocale-name-detector.tsusingIntl.Segmenterfor word segmentation, with CJK Han range detection, suffix chaining, and all-caps filtering.s/o,d/o,w/o,r/o) and post-nominal designations (S.H.,JP,SC).types.tsor the core pipeline registry.Files affected
packages/anonymize/src/config/titles.tspackages/anonymize/src/detectors/locale-name-detector.tspackages/anonymize/src/index-shared.tspackages/anonymize/src/pipeline.tspackages/anonymize/src/__test__/nonWesternNames.test.tsChecklist