Skip to content

feat(anonymize): support rule-based non-Western name detection across target locales#180

Merged
jan-kubica merged 7 commits into
stella:mainfrom
1337Xcode:feat/non-western-names
Jun 12, 2026
Merged

feat(anonymize): support rule-based non-Western name detection across target locales#180
jan-kubica merged 7 commits into
stella:mainfrom
1337Xcode:feat/non-western-names

Conversation

@1337Xcode

@1337Xcode 1337Xcode commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Proposed change

Adds a rule-based non-Western name detector (detectNonWesternNames) for person name anonymization across 10 locales: India (in), UAE/Saudi Arabia (ar), Singapore/Malaysia (zh-Latn), Hong Kong (zh-Hant), Japan (ja-Latn), South Korea (ko), Thailand (th), Vietnam (vi), Philippines (fil), and Indonesia (id).

Resolves #179.

  1. Added locale-specific honorific and token config under NONWESTERN_HONORIFICS in titles.ts.
  2. Implemented detectNonWesternNames in locale-name-detector.ts using Intl.Segmenter for word segmentation, with CJK Han range detection, suffix chaining, and all-caps filtering.
  3. Added parsing for relational connectors (s/o, d/o, w/o, r/o) and post-nominal designations (S.H., JP, SC).
  4. No changes to types.ts or the core pipeline registry.
  5. All 696 existing tests pass unaffected. 73 new tests added covering all target locales, boundaries, deduplication, and false-positive cases.

Files affected

  • packages/anonymize/src/config/titles.ts
  • packages/anonymize/src/detectors/locale-name-detector.ts
  • packages/anonymize/src/index-shared.ts
  • packages/anonymize/src/pipeline.ts
  • packages/anonymize/src/__test__/nonWesternNames.test.ts

Checklist

@1337Xcode 1337Xcode marked this pull request as ready for review June 9, 2026 01:56
Copilot AI review requested due to automatic review settings June 9, 2026 01:56

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a non-Western name detector (locale-name-detector.ts) to identify names, honorifics, and relational connectors across various Asian and Middle Eastern locales, integrating it into the main anonymization pipeline alongside a comprehensive test suite. The review feedback identifies a correctness bug where relational connectors like s/o can falsely match substrings of longer words (e.g., s/office), an $O(N^2)$ performance bottleneck in deduplicateSpans that can be optimized to $O(N)$, high memory overhead from Han character matching, and a priority inversion issue caused by using the REGEX detection source instead of NAME_CORPUS.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +2426 to +2451
for (let idx = 0; idx < words.length; idx++) {
const word = words[idx]!;
const lower = word.text.toLowerCase();
if (
(lower === "s" || lower === "d" || lower === "w" || lower === "r") &&
fullText[word.end] === "/" &&
fullText[word.end + 1]?.toLowerCase() === "o"
) {
tokens.push({
text: fullText.slice(word.start, word.end + 2),
type: TOKEN_TYPE.ARABIC_CONNECTOR, // reuse connector behavior so they chain seamlessly
start: word.start,
end: word.end + 2,
});
const nextWord = words[idx + 1];
if (
nextWord &&
nextWord.start === word.end + 1 &&
nextWord.text.toLowerCase() === "o"
) {
idx++;
}
} else {
tokens.push(classifyToken(word));
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a correctness bug when parsing relational connectors like s/o, d/o, w/o, or r/o. If the character following the / is 'o' but is part of a longer word (e.g., "s/office" or "s/out"), the current implementation will still match "s/o" and push it as an ARABIC_CONNECTOR token. This results in overlapping tokens and incorrect anonymization (e.g., "Rahul s/o" is anonymized, leaving "ffice" or "ut" unredacted).

To fix this, verify that the next word is exactly "o" (case-insensitive) and starts exactly at word.end + 1 before pushing the connector token.

  for (let idx = 0; idx < words.length; idx++) {
    const word = words[idx]!;
    const lower = word.text.toLowerCase();
    const nextWord = words[idx + 1];
    if (
      (lower === "s" || lower === "d" || lower === "w" || lower === "r") &&
      fullText[word.end] === "/" &&
      nextWord &&
      nextWord.start === word.end + 1 &&
      nextWord.text.toLowerCase() === "o"
    ) {
      tokens.push({
        text: fullText.slice(word.start, word.end + 2),
        type: TOKEN_TYPE.ARABIC_CONNECTOR,
        start: word.start,
        end: word.end + 2,
      });
      idx++;
    } else {
      tokens.push(classifyToken(word));
    }
  }

Comment on lines +2373 to +2389
const deduplicateSpans = (entities: Entity[]): Entity[] => {
const sorted = [...entities].sort(
(a, b) => a.start - b.start || b.end - a.end,
);
const result: Entity[] = [];

for (const entity of sorted) {
const overlaps = result.some(
(accepted) => entity.start < accepted.end && entity.end > accepted.start,
);
if (!overlaps) {
result.push(entity);
}
}

return result;
};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The deduplicateSpans function currently has $O(N^2)$ complexity because of the result.some(...) call inside the loop. Since the input is already sorted by start ascending and end descending, any new entity can only overlap with the most recently accepted entity. We can optimize this to $O(N)$ by only comparing with the last accepted entity in the result array.

const deduplicateSpans = (entities: Entity[]): Entity[] => {
  const sorted = [...entities].sort(
    (a, b) => a.start - b.start || b.end - a.end,
  );
  const result: Entity[] = [];

  for (const entity of sorted) {
    const last = result.at(-1);
    if (!last || entity.start >= last.end) {
      result.push(entity);
    }
  }

  return result;
};

});
};

const hanCount = (fullText.match(/\p{Script=Han}/gu) || []).length;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using fullText.match(/\p{Script=Han}/gu) allocates a new array containing all matched Han characters. For very large documents, this can cause significant memory overhead and garbage collection pressure. Using matchAll avoids allocating the intermediate array of matches.

Suggested change
const hanCount = (fullText.match(/\p{Script=Han}/gu) || []).length;
let hanCount = 0;
for (const _ of fullText.matchAll(/\p{Script=Han}/gu)) {
hanCount++;
}

Comment on lines +2402 to +2408
entities.push({
start,
end,
label: "person",
text,
score,
source: DETECTION_SOURCES.REGEX,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The detectNonWesternNames detector is a rule-based/dictionary-based name detector, but it currently uses DETECTION_SOURCES.REGEX as its source. This causes priority inversion in the pipeline because regex-sourced entities have a higher priority than dictionary-sourced entities, which can prevent proper deduplication or filtering. It should use the appropriate name corpus or dictionary source enum value (e.g., DETECTION_SOURCES.NAME_CORPUS or DETECTION_SOURCES.DICTIONARY).

Suggested change
entities.push({
start,
end,
label: "person",
text,
score,
source: DETECTION_SOURCES.REGEX,
entities.push({
start,
end,
label: "person",
text,
score,
source: DETECTION_SOURCES.NAME_CORPUS,
});

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a new “locale/non-Western” person-name detector and integrates it into the anonymize pipeline so the name-corpus step also finds common non-Western name patterns (incl. CJK and various Latin transliterations).

Changes:

  • Integrates detectNonWesternNames into runPipeline’s name-corpus phase.
  • Adds new detector implementation and exports it from the shared index.
  • Introduces non-Western honorific configuration and comprehensive detector tests.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
packages/anonymize/src/pipeline.ts Merges Western name-corpus results with the new non-Western detector results.
packages/anonymize/src/index-shared.ts Exports detectNonWesternNames for external consumers.
packages/anonymize/src/detectors/locale-name-detector.ts Implements non-Western name detection via tokenization + heuristics + large token sets.
packages/anonymize/src/config/titles.ts Adds NONWESTERN_HONORIFICS and tweaks an honorific comment.
packages/anonymize/src/test/nonWesternNames.test.ts Adds pipeline + unit coverage for many locales and boundary/FP behaviors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +2412 to +2414
const hanCount = (fullText.match(/\p{Script=Han}/gu) || []).length;
const isLatinMajority =
fullText.length < 100 || hanCount / fullText.length < 0.15;
Comment on lines +2373 to +2389
const deduplicateSpans = (entities: Entity[]): Entity[] => {
const sorted = [...entities].sort(
(a, b) => a.start - b.start || b.end - a.end,
);
const result: Entity[] = [];

for (const entity of sorted) {
const overlaps = result.some(
(accepted) => entity.start < accepted.end && entity.end > accepted.start,
);
if (!overlaps) {
result.push(entity);
}
}

return result;
};
Comment on lines +2279 to +2281
const segmenter = new Intl.Segmenter(undefined, {
granularity: "word",
});
Comment on lines +352 to +356
"Majumdar",
"Ray",
"Poddar",
"Ray",
"Halder",
Comment on lines +2341 to +2346
// Only match name tokens when the word is capitalized in the source text.
// This prevents lowercase English words like "to", "so", "an" from being
// title-cased and falsely matched as Korean/Thai/Vietnamese name tokens.
if (UPPER_START_RE.test(text) && NONWESTERN_NAME_TOKENS.has(titleCased)) {
return { text, type: TOKEN_TYPE.NONWESTERN_NAME, start, end };
}
Comment thread packages/anonymize/src/pipeline.ts Outdated
Comment on lines +1254 to +1256
const westernNames = detectNameCorpus(fullText, ctx);
const nonWesternNames = detectNonWesternNames(fullText, ctx);
rawNameCorpusEntities = [...westernNames, ...nonWesternNames];

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d899f37e07

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

).length;

let score: number;
if (hasTitle && (hasNonWesternToken || capitalizedCount > 0)) score = 0.95;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require name evidence before title chains

In legal text that uses court honorifics, this condition redacts non-person phrases as people: The Hon'ble Court issued an order. is tokenized as capitalized + title + capitalized and emits The Hon'ble Court with score 0.95 even though no non-Western name token is present. This is common in Indian legal documents and will corrupt output whenever enableNameCorpus is on, so title chains should require a name token or exclude court/institution terms.

Useful? React with 👍 / 👎.

CJK_NAME_RE.lastIndex = 0;
let match: RegExpExecArray | null;
while ((match = CJK_NAME_RE.exec(fullText)) !== null) {
addCandidate(match.index, match.index + match[0].length, 0.95);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not label every short Han phrase as a person

For Latin-majority documents, every isolated 2-4 Han-character sequence is emitted as a high-confidence person without any context or stoplist, so ordinary jurisdictions, cities, and language labels such as 香港, 北京, 上海, or 中文 are redacted as people. Mixed English/Chinese legal documents often contain these terms, so this needs additional name evidence or exclusions before creating a person entity.

Useful? React with 👍 / 👎.

Comment on lines +2527 to +2531
else if (
hasNonWesternToken &&
(capitalizedCount > 0 || abbreviationCount > 0)
)
score = 0.9;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid absorbing capitalized document nouns into names

When a known non-Western token appears next to capitalized legal/document words, this branch scores the whole chain as a person, so text like Please review the Sato Agreement. emits Sato Agreement and The Nguyen Contract was signed. emits The Nguyen Contract. That over-redacts non-person words and changes document meaning whenever the detector sees a name-like token before a capitalized noun.

Useful? React with 👍 / 👎.

Comment on lines +2532 to +2537
else if (
hasNonWesternToken &&
chain.length === 1 &&
!isSentenceStart(fullText, token.start)
)
score = 0.5;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Suppress standalone ambiguous locale tokens

This emits any single non-Western token that is not at sentence start with score 0.5, which is above the default test threshold of 0.3; because the token list includes common jurisdiction/adjective words, This agreement is governed by Thai law. redacts Thai as a person. Standalone tokens need stronger context or a stoplist for locale/legal terms to avoid corrupting ordinary legal clauses.

Useful? React with 👍 / 👎.

Comment on lines +2523 to +2524
else if (hasJaSuffix && (capitalizedCount > 0 || hasNonWesternToken))
score = 0.9;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require a hyphen before Japanese honorific suffixes

Because san is classified as a Japanese suffix even when it is a standalone word, any preceding/following capitalized tokens can become a person span; for example, The San Diego office. emits The San Diego with score 0.9. This will over-redact common place names in mixed legal/business documents, so the -san case should require the hyphen or otherwise avoid treating standalone San as suffix evidence.

Useful? React with 👍 / 👎.

@1337Xcode

Copy link
Copy Markdown
Contributor Author

Hey @jan-kubica, just pushed the fix for #179 (non-Western name detection). Would appreciate a review when you get a chance.

@1337Xcode 1337Xcode requested a review from Copilot June 9, 2026 02:05

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Comment on lines +113 to +116
const NONWESTERN_TITLES = new Set([
...Object.values(NONWESTERN_HONORIFICS)
.flat()
.map((t) => t.replace(/[.-]+$/u, "")), // strip trailing dots/dashes
Comment on lines +2113 to +2116
const strippedTitle = text.endsWith(".") ? text.slice(0, -1) : text;
if (NONWESTERN_TITLES.has(strippedTitle)) {
return { text, type: TOKEN_TYPE.TITLE, start, end };
}
Comment on lines +129 to +134
const NONWESTERN_NAME_TOKENS = new Set([
"Aah",
"Aarti",
"Abdel",
"Abdul",
"Abdullah",
Comment thread packages/anonymize/src/pipeline.ts Outdated
Comment on lines +1254 to +1256
const westernNames = detectNameCorpus(fullText, ctx);
const nonWesternNames = detectNonWesternNames(fullText, ctx);
rawNameCorpusEntities = [...westernNames, ...nonWesternNames];
Comment on lines +2189 to +2196
entities.push({
start,
end,
label: "person",
text,
score,
source: DETECTION_SOURCES.REGEX,
});

// ── Tests ──────────────────────────────────────────────────────────

describe("Non-Western Name Detection", () => {

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d39a62704f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +2272 to +2276
if (
gap.includes("\n") ||
PERSON_CHAIN_BREAK_RE.test(gap) ||
breaksOnPeriod
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Break person chains at appositive commas

When a valid name is followed by a comma and an organization/location descriptor, this loop keeps extending through the comma into the next capitalized token. Because addCandidate later drops spans containing organization words, Rahul Sharma, University of Delhi, testified. produces no Rahul Sharma entity at all. Commas that introduce appositives should stop the chain, or the detector needs to keep the accepted prefix before rejecting the longer organization-tainted span.

Useful? React with 👍 / 👎.

Comment on lines +2258 to +2262
const MAX_CHAIN = 5;
const chain: ClassifiedToken[] = [token];
let j = i + 1;

while (j < tokens.length && chain.length < MAX_CHAIN) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid truncating Arabic patronymic chains

The five-token cap truncates common Arabic/Gulf names that include both a title and connectors; for example Sheikh Mohammed bin Rashid Al Maktoum is emitted as Sheikh Mohammed bin Rashid Al, leaving Maktoum unredacted and ending the entity on a dangling connector. This affects the target locales added by this detector, so the chain limit should allow connector-heavy names or avoid finalizing spans that end with a connector.

Useful? React with 👍 / 👎.

Comment thread packages/anonymize/src/pipeline.ts Outdated
Comment on lines +1251 to +1255
if (config.enableNameCorpus && !config.enableDenyList) {
await initNameCorpus(ctx, config.dictionaries, config.nameCorpusLanguages);
checkAbort(signal);
rawNameCorpusEntities = detectNameCorpus(fullText, ctx);
const westernNames = detectNameCorpus(fullText, ctx);
const nonWesternNames = detectNonWesternNames(fullText, ctx);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Run non-Western names in deny-list mode

This guard means the new rule-based detector never runs when enableDenyList is also true, even though enableNameCorpus is documented and tested as keeping name matches available in deny-list mode. In the common configuration with both enabled, names that only this detector knows about, such as Rahul Sharma without a matching injected dictionary entry, are skipped entirely.

Useful? React with 👍 / 👎.

Comment on lines +2016 to +2017
const CJK_NAME_RE =
/(?<!\p{Script=Han})\p{Script=Han}{2,4}(?!\p{Script=Han})/gu;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Detect CJK names embedded in Han text

The CJK regex only matches a 2-4 Han-character sequence when it is not adjacent to any other Han character, so names written naturally inside Chinese/Japanese prose are missed; for example 由張小明簽署 produces no 張小明 entity because the name touches surrounding Han characters. This makes the CJK branch work only for space-delimited mixed-language snippets, not ordinary CJK legal text.

Useful? React with 👍 / 👎.

Comment on lines +2268 to +2271
const breaksOnPeriod =
gap.includes(".") &&
!isInitialContinuationGap(prev.text, gap) &&
prev.type !== TOKEN_TYPE.TITLE;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stop full-word titles at sentence periods

Because every title token bypasses the period break, full-word honorifics are chained across sentence boundaries; The presiding officer was Justice. Kumar filed the appeal. emits Justice. Kumar as a person even though the period ended the title sentence. Only abbreviation-style titles such as Dr. or Mr. should suppress breaksOnPeriod; full-word non-Western titles need to stop at . to avoid redacting text from the next sentence.

Useful? React with 👍 / 👎.

@jan-kubica jan-kubica self-requested a review June 10, 2026 20:38

@jan-kubica jan-kubica left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR identified and addresses real and important gap. Thank you for it. I would like to get the following changes in the PR in other not to duplicate existing approaches. LMK if I am missing something please

  • Extend names.ts instead of forking it. You have the same skeleton+some novel ideas (suffixes, arabic connectors) but in a parallel implementation. This introduces maintenance burden, and extra compute in hot path we can avoid if you land this as chain rules in the existing names.ts
  • move name tokens to per-locale dictionary data, loaded via initNameCorpus. the convention is to split the logic vs the data

@1337Xcode 1337Xcode force-pushed the feat/non-western-names branch from 583bb5c to 7f459d7 Compare June 11, 2026 04:57
@1337Xcode 1337Xcode requested review from Copilot and jan-kubica June 11, 2026 05:08

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 25dc116194

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +212 to +213
if (dictionaries?.nonWesternNames) {
for (const [, names] of Object.entries(dictionaries.nonWesternNames)) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor scoped languages for injected non-Western names

When callers pass nameCorpusLanguages to restrict which injected name dictionaries are active, this branch still merges every dictionaries.nonWesternNames locale, unlike the first-name and surname branches above. In a scoped run such as initNameCorpus(ctx, { nonWesternNames: { ja: ["Qwertaro"] } }, ["en"]), detectNameCorpus("Please contact Qwertaro Smith.") still emits a person match, so integrations that scope names to the document language can get unexpected cross-locale redactions.

Useful? React with 👍 / 👎.

Comment on lines +918 to 921
if (next.type !== TOKEN_TYPE.OTHER) {
chain.push(next);
j++;
} else {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve hyphenated Korean given names

For hyphenated Korean romanizations, Intl.Segmenter splits the given name around the hyphen and the lowercase continuation is classified as OTHER, so this loop finalizes only the prefix. For example, Kim Min-jun is emitted as Kim Min, leaving jun unredacted; the chain needs to absorb or normalize hyphenated non-Western name continuations before stopping on OTHER.

Useful? React with 👍 / 👎.

Comment on lines +382 to +383
if (corpus.nonWesternNames.has(token)) return true;
return corpus.nonWesternNames.has(titleCaseWithApostrophe(token));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match Vietnamese names with their diacritics

The bundled Vietnamese corpus is unaccented, but this lookup only checks the raw and title-cased token, so standard Vietnamese spellings with tone marks are missed or truncated. For example, Trần Thị Mai signed only emits Mai, leaving the surname and middle name visible; the lookup needs a diacritic-folded form, or accented variants in the corpus, for these locale names.

Useful? React with 👍 / 👎.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 7 comments.

Comment on lines +525 to +549
// 1. Title tokens (Western + non-Western honorifics)
if (corpus.titleTokens.has(stripped)) {
return { text, type: TOKEN_TYPE.TITLE, start, end };
// Mark abbreviation-style titles so the chain
// assembler knows their trailing dot is not a
// sentence boundary (Dr., Mr., Smt. etc.).
return {
text,
type: TOKEN_TYPE.TITLE,
start,
end,
...(corpus.titleAbbreviations.has(stripped)
? { titleAbbreviation: true }
: {}),
};
}

// 2. Japanese honorific suffixes (san, sama, sensei)
if (JA_SUFFIXES.has(lower)) {
return { text, type: TOKEN_TYPE.JA_SUFFIX, start, end };
}

// 3. Arabic patronymic connectors (bin, bint, ibn, al, el)
if (ARABIC_CONNECTORS.has(lower)) {
return { text, type: TOKEN_TYPE.ARABIC_CONNECTOR, start, end };
}
Comment on lines +404 to +408
/** Organization keyword filter. */
const ORG_WORDS =
/\b(?:Group|Company|LLC|LLP|LP|Inc|Ltd|Corp|Corporation|Holdings|Partners|Association|University|Bank|Fund|Trust|Agency|Government|Ministry|Office|Department|Council|Board|Committee|Commission|Services|Solutions|Technologies|Systems|Analytics|Software)\b/i;

const isOrganization = (text: string): boolean => ORG_WORDS.test(text);
Comment on lines +410 to +423
/** Deduplicate overlapping entity spans (first wins). */
const deduplicateSpans = (entities: Entity[]): Entity[] => {
const sorted = [...entities].sort(
(a, b) => a.start - b.start || b.end - a.end,
);
const result: Entity[] = [];
for (const entity of sorted) {
const last = result.at(-1);
if (!last || entity.start >= last.end) {
result.push(entity);
}
}
return result;
};
Comment on lines +212 to +218
if (dictionaries?.nonWesternNames) {
for (const [, names] of Object.entries(dictionaries.nonWesternNames)) {
for (const name of names) {
nonWesternNames.push(name);
}
}
}
Comment on lines +181 to +204
const nwLocaleKeys = [
"in",
"ar",
"ja-latn",
"ko",
"zh-latn",
"th",
"vi",
"fil",
"id",
] as const;
const [nwNameMods, nwExcludedMod] = await Promise.all([
Promise.all(
nwLocaleKeys.map(
(locale) =>
import(`../data/names-nw-${locale}.json`) as Promise<{
default: { names: string[] };
}>,
),
),
import("../data/names-nw-excluded-allcaps.json") as Promise<{
default: { words: string[] };
}>,
]);
Comment on lines +234 to +268
export const NONWESTERN_HONORIFICS = {
in: [
"Sri",
"Shri",
"Smt",
"Smt.",
"Kumari",
"Pandit",
"Pt.",
"Adv.",
"Adv",
"Justice",
"Hon'ble",
],
ar: [
"Al-",
"El-",
"Sheikh",
"Shaikh",
"Sheikha",
"Ustaz",
"Ustaza",
"Abu",
"Umm",
],
"zh-Latn": [
"Encik",
"Puan",
"Datuk",
"Dato'",
"Dato",
"Tan Sri",
"Tun",
] as const,
"ja-Latn": ["-san", "-sama", "-sensei", "Mr.", "Ms."] as const,
Comment on lines +36 to +38
export const getNameCorpusNonWesternNames = (
ctx: PipelineContext = defaultContext,
): readonly string[] => ctx.nameCorpus?.nonWesternNamesList ?? [];
@1337Xcode

Copy link
Copy Markdown
Contributor Author

Hey @jan-kubica, thanks for the review. Both points addressed:

  • Extended names.ts instead of forking. The parallel locale-name-detector.ts is now a thin deprecated wrapper that delegates to detectNameCorpus. All non-Western logic (Arabic connectors, Japanese suffixes, non-Western scoring path) lives as chain rules in names.ts. The pipeline calls detectNameCorpus once, no duplicate compute.
  • Moved name tokens to per-locale JSON data. Added names-nw-{in,ar,ja-latn,ko,zh-latn,th,vi,fil,id}.json loaded via initNameCorpus into corpus.nonWesternNames. Adding names = editing JSON, no code changes.

Also fixed a few things the AI review tools flagged as real issues: full-word titles like "Justice" no longer chain across sentence boundaries (only abbreviation titles like "Dr." suppress period breaks), and bare "HE"/"HH" no longer false-positive as Gulf honorific titles.

@jan-kubica jan-kubica left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the prompt changes. One more thing, I have noticed you are mocking text-search for every test run, presumably because it did not natively work on Windows. I have now addressed the underlying gap and anonymize 1.4.9 (and text search 1.0.6) now ship win32-x64-msvc native bindings for all three engines.

As such, bun install && bun test should now work natively on Windows.

After rebasing and refreshing the lockfile, the mock should be unnecessary >> could you please rebase, reinstall and remove the mock?. Alternatively, if I am missing anything and thre is a reason for the mock, please lmk

@1337Xcode 1337Xcode force-pushed the feat/non-western-names branch from 25dc116 to fee0b0c Compare June 11, 2026 21:44
@1337Xcode 1337Xcode force-pushed the feat/non-western-names branch from fee0b0c to 6b57e6b Compare June 11, 2026 21:45
@1337Xcode 1337Xcode requested a review from jan-kubica June 11, 2026 21:47
@1337Xcode

Copy link
Copy Markdown
Contributor Author

Hey @jan-kubica

Thanks for addressing the Windows native support in v1.4.9!

I have completed the requested changes:

  1. Rebased the branch onto the latest main.
  2. Refreshed the lockfile with bun install to pull in the win32-x64-msvc native bindings (@stll/text-search v1.0.6).
  3. Removed the WASM mock entirely from packages/anonymize/src/__test__/setup.ts.

All 750 tests are now running and passing natively on Windows with the mock removed. I have force-pushed the squashed commit to update the PR. Let me know if you need anything else!

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6b57e6b037

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +148 to +149
for (const form of Object.values(NONWESTERN_HONORIFICS).flat()) {
titles.push(form.replace(/[.-]+$/u, "").toLowerCase());

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Do not inject stripped honorific fragments as titles

When enableDenyList and enableNameCorpus are both on, appendNameCorpusEntries adds every title token as a standalone person pattern; this loop now turns non-Western forms like Al-, El-, H., Ba, Bu, and Co into bare tokens (al, el, h, ba, bu, co). In Spanish/Arabic/English legal text this will redact common words or abbreviations such as al and Co as people, so prefix/post-nominal fragments should not be merged into the global title corpus as standalone patterns.

Useful? React with 👍 / 👎.

const lower = word.text.toLowerCase();
const nextWord = words[idx + 1];
if (
(lower === "s" || lower === "d" || lower === "w" || lower === "r") &&

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not merge r/o residence markers into names

Including r/o in the same connector path as s/o, d/o, and w/o makes Indian party descriptions like Rahul Kumar r/o Delhi become one person span (Rahul Kumar r/o Delhi). In this context r/o means resident of and introduces an address/location, not a patronymic or spouse connector, so the name chain should break before it rather than consume the residence field.

Useful? React with 👍 / 👎.

@jan-kubica

Copy link
Copy Markdown
Contributor

@1337Xcode thank you for your prompt replies and this PR. Appreciate it. I am adding final minor adjustments (e.g., why ship field that does not currently exist and would already land as deprecated...?) directly to this branch and merging in a moment.

@jan-kubica jan-kubica merged commit e2a382e into stella:main Jun 12, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e33f5cecdb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1050 to +1053
if (isLatinMajority) {
CJK_NAME_RE.lastIndex = 0;
let match: RegExpExecArray | null;
while ((match = CJK_NAME_RE.exec(fullText)) !== null) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor language scoping before CJK pre-pass

When callers scope nameCorpusLanguages to non-CJK locales, this pre-pass still runs because it does not consult the scoped languages at all. For example, after initNameCorpus(ctx, undefined, ["cs"]), detectNameCorpus("Signed by 張小明.", ctx) still emits 張小明 as a person even though every non-Western locale corpus was scoped out; this can cause unexpected cross-locale redactions in integrations that use language scoping. Fresh evidence beyond the earlier scoped-dictionary issue: the hard-coded CJK branch bypasses the corpus/language scope entirely.

Useful? React with 👍 / 👎.

Comment on lines +559 to +560
const CJK_NAME_RE =
/(?<!\p{Script=Han})\p{Script=Han}{2,4}(?!\p{Script=Han})/gu;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include Hangul in the native-name regex

The native-name pre-pass can never match Korean names written in Hangul because this regex only accepts Script=Han, even though the surname table below includes Hangul surnames such as 김 and 박. In practice detectNameCorpus("Signed by 김민준.") and detectNameCorpus("Signed by 박지훈.") return no person entity, so Korean text in the target locale is left unredacted unless it is romanized.

Useful? React with 👍 / 👎.

Comment on lines +543 to +544
if (corpus.nonWesternNames.has(token)) return true;
return corpus.nonWesternNames.has(titleCaseWithApostrophe(token));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match multi-word locale corpus entries

Some bundled non-Western entries are phrases, for example Thai surnames like Na Nakhon, but the detector later looks up one Intl.Segmenter word at a time. Those phrase entries can never match, so The witness Somchai Na Nakhon signed. emits only Somchai at low confidence and leaves the family name visible; phrase entries need a phrase pass or token-level decomposition.

Useful? React with 👍 / 👎.

!isNonWesternNameToken(text, corpus) &&
!JA_SUFFIXES.has(lower) &&
!ARABIC_CONNECTORS.has(lower) &&
!(ALL_UPPER_RE.test(text) && !corpus.excludedAllCaps.has(text))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Whitelist short all-caps post-nominals

This exception treats any two-letter all-caps token not in the exclusion list as name-like, not just known post-nominals such as JP, KC, or QC. Common legal/business acronyms that are absent from the exclusion JSON then get absorbed into person spans, e.g. Sato IP rights, Kim IT policy, or Wong HR reviewed it all become high-confidence person entities; this should be limited to an explicit post-nominal whitelist.

Useful? React with 👍 / 👎.

Comment on lines +1218 to +1219
if (!chain.some(hasPersonNameSource)) {
continue;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve curated Names deny-list matches

This guard drops any person chain that does not include the synthetic first-name or surname source, but curated dictionaries in the Names category are added with the normal deny-list source. With enableDenyList and enableNameCorpus both on, a bundled or injected Names entry that is not also in the first/surname corpus, for example a full-name entry from names/global, reaches nameHits and is then skipped here, so valid deny-list person entries silently stop redacting.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(anonymize): support rule-based non-Western name detection across target locales

3 participants