feat(anonymize): support rule-based non-Western name detection across target locales by 1337Xcode · Pull Request #180 · stella/anonymize

1337Xcode · 2026-06-09T01:56:02Z

Proposed change

Adds a rule-based non-Western name detector (detectNonWesternNames) for person name anonymization across 10 locales: India (in), UAE/Saudi Arabia (ar), Singapore/Malaysia (zh-Latn), Hong Kong (zh-Hant), Japan (ja-Latn), South Korea (ko), Thailand (th), Vietnam (vi), Philippines (fil), and Indonesia (id).

Resolves #179.

Added locale-specific honorific and token config under NONWESTERN_HONORIFICS in titles.ts.
Implemented detectNonWesternNames in locale-name-detector.ts using Intl.Segmenter for word segmentation, with CJK Han range detection, suffix chaining, and all-caps filtering.
Added parsing for relational connectors (s/o, d/o, w/o, r/o) and post-nominal designations (S.H., JP, SC).
No changes to types.ts or the core pipeline registry.
All 696 existing tests pass unaffected. 73 new tests added covering all target locales, boundaries, deduplication, and false-positive cases.

Files affected

packages/anonymize/src/config/titles.ts
packages/anonymize/src/detectors/locale-name-detector.ts
packages/anonymize/src/index-shared.ts
packages/anonymize/src/pipeline.ts
packages/anonymize/src/__test__/nonWesternNames.test.ts

Checklist

BUILD ASSURANCE | Builds with no errors or warnings.
TESTING | All 696 existing tests and 73 new tests pass.
DARK MODE SUPPORT | N/A (non-visual library changes).
ISSUE LINKED | Linked feat(anonymize): support rule-based non-Western name detection across target locales #179.
SCREENSHOT INCLUDED | N/A (non-visual library changes).

gemini-code-assist

Code Review

This pull request introduces a non-Western name detector (locale-name-detector.ts) to identify names, honorifics, and relational connectors across various Asian and Middle Eastern locales, integrating it into the main anonymization pipeline alongside a comprehensive test suite. The review feedback identifies a correctness bug where relational connectors like s/o can falsely match substrings of longer words (e.g., s/office), an $O(N^2)$ performance bottleneck in deduplicateSpans that can be optimized to $O(N)$, high memory overhead from Han character matching, and a priority inversion issue caused by using the REGEX detection source instead of NAME_CORPUS.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-09T01:58:13Z

+  for (let idx = 0; idx < words.length; idx++) {
+    const word = words[idx]!;
+    const lower = word.text.toLowerCase();
+    if (
+      (lower === "s" || lower === "d" || lower === "w" || lower === "r") &&
+      fullText[word.end] === "/" &&
+      fullText[word.end + 1]?.toLowerCase() === "o"
+    ) {
+      tokens.push({
+        text: fullText.slice(word.start, word.end + 2),
+        type: TOKEN_TYPE.ARABIC_CONNECTOR, // reuse connector behavior so they chain seamlessly
+        start: word.start,
+        end: word.end + 2,
+      });
+      const nextWord = words[idx + 1];
+      if (
+        nextWord &&
+        nextWord.start === word.end + 1 &&
+        nextWord.text.toLowerCase() === "o"
+      ) {
+        idx++;
+      }
+    } else {
+      tokens.push(classifyToken(word));
+    }
+  }


There is a correctness bug when parsing relational connectors like s/o, d/o, w/o, or r/o. If the character following the / is 'o' but is part of a longer word (e.g., "s/office" or "s/out"), the current implementation will still match "s/o" and push it as an ARABIC_CONNECTOR token. This results in overlapping tokens and incorrect anonymization (e.g., "Rahul s/o" is anonymized, leaving "ffice" or "ut" unredacted).

To fix this, verify that the next word is exactly "o" (case-insensitive) and starts exactly at word.end + 1 before pushing the connector token.

for (let idx = 0; idx < words.length; idx++) { const word = words[idx]!; const lower = word.text.toLowerCase(); const nextWord = words[idx + 1]; if ( (lower === "s" || lower === "d" || lower === "w" || lower === "r") && fullText[word.end] === "/" && nextWord && nextWord.start === word.end + 1 && nextWord.text.toLowerCase() === "o" ) { tokens.push({ text: fullText.slice(word.start, word.end + 2), type: TOKEN_TYPE.ARABIC_CONNECTOR, start: word.start, end: word.end + 2, }); idx++; } else { tokens.push(classifyToken(word)); } }

gemini-code-assist · 2026-06-09T01:58:13Z

+const deduplicateSpans = (entities: Entity[]): Entity[] => {
+  const sorted = [...entities].sort(
+    (a, b) => a.start - b.start || b.end - a.end,
+  );
+  const result: Entity[] = [];
+
+  for (const entity of sorted) {
+    const overlaps = result.some(
+      (accepted) => entity.start < accepted.end && entity.end > accepted.start,
+    );
+    if (!overlaps) {
+      result.push(entity);
+    }
+  }
+
+  return result;
+};


The deduplicateSpans function currently has $O(N^2)$ complexity because of the result.some(...) call inside the loop. Since the input is already sorted by start ascending and end descending, any new entity can only overlap with the most recently accepted entity. We can optimize this to $O(N)$ by only comparing with the last accepted entity in the result array.

const deduplicateSpans = (entities: Entity[]): Entity[] => { const sorted = [...entities].sort( (a, b) => a.start - b.start || b.end - a.end, ); const result: Entity[] = []; for (const entity of sorted) { const last = result.at(-1); if (!last || entity.start >= last.end) { result.push(entity); } } return result; };

gemini-code-assist · 2026-06-09T01:58:13Z

+    });
+  };
+
+  const hanCount = (fullText.match(/\p{Script=Han}/gu) || []).length;


Using fullText.match(/\p{Script=Han}/gu) allocates a new array containing all matched Han characters. For very large documents, this can cause significant memory overhead and garbage collection pressure. Using matchAll avoids allocating the intermediate array of matches.

Suggested change

const hanCount = (fullText.match(/\p{Script=Han}/gu) || []).length;

let hanCount = 0;

for (const _ of fullText.matchAll(/\p{Script=Han}/gu)) {

hanCount++;

}

gemini-code-assist · 2026-06-09T01:58:13Z

+    entities.push({
+      start,
+      end,
+      label: "person",
+      text,
+      score,
+      source: DETECTION_SOURCES.REGEX,


The detectNonWesternNames detector is a rule-based/dictionary-based name detector, but it currently uses DETECTION_SOURCES.REGEX as its source. This causes priority inversion in the pipeline because regex-sourced entities have a higher priority than dictionary-sourced entities, which can prevent proper deduplication or filtering. It should use the appropriate name corpus or dictionary source enum value (e.g., DETECTION_SOURCES.NAME_CORPUS or DETECTION_SOURCES.DICTIONARY).

Suggested change

entities.push({

start,

end,

label: "person",

text,

score,

source: DETECTION_SOURCES.REGEX,

entities.push({

start,

end,

label: "person",

text,

score,

source: DETECTION_SOURCES.NAME_CORPUS,

});

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a new “locale/non-Western” person-name detector and integrates it into the anonymize pipeline so the name-corpus step also finds common non-Western name patterns (incl. CJK and various Latin transliterations).

Changes:

Integrates detectNonWesternNames into runPipeline’s name-corpus phase.
Adds new detector implementation and exports it from the shared index.
Introduces non-Western honorific configuration and comprehensive detector tests.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
packages/anonymize/src/pipeline.ts	Merges Western name-corpus results with the new non-Western detector results.
packages/anonymize/src/index-shared.ts	Exports `detectNonWesternNames` for external consumers.
packages/anonymize/src/detectors/locale-name-detector.ts	Implements non-Western name detection via tokenization + heuristics + large token sets.
packages/anonymize/src/config/titles.ts	Adds `NONWESTERN_HONORIFICS` and tweaks an honorific comment.
packages/anonymize/src/test/nonWesternNames.test.ts	Adds pipeline + unit coverage for many locales and boundary/FP behaviors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+  const hanCount = (fullText.match(/\p{Script=Han}/gu) || []).length;
+  const isLatinMajority =
+    fullText.length < 100 || hanCount / fullText.length < 0.15;


+const deduplicateSpans = (entities: Entity[]): Entity[] => {
+  const sorted = [...entities].sort(
+    (a, b) => a.start - b.start || b.end - a.end,
+  );
+  const result: Entity[] = [];
+
+  for (const entity of sorted) {
+    const overlaps = result.some(
+      (accepted) => entity.start < accepted.end && entity.end > accepted.start,
+    );
+    if (!overlaps) {
+      result.push(entity);
+    }
+  }
+
+  return result;
+};


+const segmenter = new Intl.Segmenter(undefined, {
+  granularity: "word",
+});


+  "Majumdar",
+  "Ray",
+  "Poddar",
+  "Ray",
+  "Halder",


+  // Only match name tokens when the word is capitalized in the source text.
+  // This prevents lowercase English words like "to", "so", "an" from being
+  // title-cased and falsely matched as Korean/Thai/Vietnamese name tokens.
+  if (UPPER_START_RE.test(text) && NONWESTERN_NAME_TOKENS.has(titleCased)) {
+    return { text, type: TOKEN_TYPE.NONWESTERN_NAME, start, end };
+  }


+    const westernNames = detectNameCorpus(fullText, ctx);
+    const nonWesternNames = detectNonWesternNames(fullText, ctx);
+    rawNameCorpusEntities = [...westernNames, ...nonWesternNames];


chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d899f37e07

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-09T02:01:45Z

+    ).length;
+
+    let score: number;
+    if (hasTitle && (hasNonWesternToken || capitalizedCount > 0)) score = 0.95;


Require name evidence before title chains

In legal text that uses court honorifics, this condition redacts non-person phrases as people: The Hon'ble Court issued an order. is tokenized as capitalized + title + capitalized and emits The Hon'ble Court with score 0.95 even though no non-Western name token is present. This is common in Indian legal documents and will corrupt output whenever enableNameCorpus is on, so title chains should require a name token or exclude court/institution terms.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-09T02:01:45Z

+    CJK_NAME_RE.lastIndex = 0;
+    let match: RegExpExecArray | null;
+    while ((match = CJK_NAME_RE.exec(fullText)) !== null) {
+      addCandidate(match.index, match.index + match[0].length, 0.95);


Do not label every short Han phrase as a person

For Latin-majority documents, every isolated 2-4 Han-character sequence is emitted as a high-confidence person without any context or stoplist, so ordinary jurisdictions, cities, and language labels such as 香港, 北京, 上海, or 中文 are redacted as people. Mixed English/Chinese legal documents often contain these terms, so this needs additional name evidence or exclusions before creating a person entity.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-09T02:01:45Z

+    else if (
+      hasNonWesternToken &&
+      (capitalizedCount > 0 || abbreviationCount > 0)
+    )
+      score = 0.9;


Avoid absorbing capitalized document nouns into names

When a known non-Western token appears next to capitalized legal/document words, this branch scores the whole chain as a person, so text like Please review the Sato Agreement. emits Sato Agreement and The Nguyen Contract was signed. emits The Nguyen Contract. That over-redacts non-person words and changes document meaning whenever the detector sees a name-like token before a capitalized noun.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-09T02:01:45Z

+    else if (
+      hasNonWesternToken &&
+      chain.length === 1 &&
+      !isSentenceStart(fullText, token.start)
+    )
+      score = 0.5;


Suppress standalone ambiguous locale tokens

This emits any single non-Western token that is not at sentence start with score 0.5, which is above the default test threshold of 0.3; because the token list includes common jurisdiction/adjective words, This agreement is governed by Thai law. redacts Thai as a person. Standalone tokens need stronger context or a stoplist for locale/legal terms to avoid corrupting ordinary legal clauses.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-09T02:01:45Z

+    else if (hasJaSuffix && (capitalizedCount > 0 || hasNonWesternToken))
+      score = 0.9;


Require a hyphen before Japanese honorific suffixes

Because san is classified as a Japanese suffix even when it is a standalone word, any preceding/following capitalized tokens can become a person span; for example, The San Diego office. emits The San Diego with score 0.9. This will over-redact common place names in mixed legal/business documents, so the -san case should require the hyphen or otherwise avoid treating standalone San as suffix evidence.

Useful? React with 👍 / 👎.

1337Xcode · 2026-06-09T02:04:34Z

Hey @jan-kubica, just pushed the fix for #179 (non-Western name detection). Would appreciate a review when you get a chance.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

+const NONWESTERN_TITLES = new Set([
+  ...Object.values(NONWESTERN_HONORIFICS)
+    .flat()
+    .map((t) => t.replace(/[.-]+$/u, "")), // strip trailing dots/dashes


+  const strippedTitle = text.endsWith(".") ? text.slice(0, -1) : text;
+  if (NONWESTERN_TITLES.has(strippedTitle)) {
+    return { text, type: TOKEN_TYPE.TITLE, start, end };
+  }


+const NONWESTERN_NAME_TOKENS = new Set([
+  "Aah",
+  "Aarti",
+  "Abdel",
+  "Abdul",
+  "Abdullah",


+    const westernNames = detectNameCorpus(fullText, ctx);
+    const nonWesternNames = detectNonWesternNames(fullText, ctx);
+    rawNameCorpusEntities = [...westernNames, ...nonWesternNames];


+    entities.push({
+      start,
+      end,
+      label: "person",
+      text,
+      score,
+      source: DETECTION_SOURCES.REGEX,
+    });


+
+// ── Tests ──────────────────────────────────────────────────────────
+
+describe("Non-Western Name Detection", () => {


chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d39a62704f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-09T02:10:53Z

+      if (
+        gap.includes("\n") ||
+        PERSON_CHAIN_BREAK_RE.test(gap) ||
+        breaksOnPeriod
+      )


Break person chains at appositive commas

When a valid name is followed by a comma and an organization/location descriptor, this loop keeps extending through the comma into the next capitalized token. Because addCandidate later drops spans containing organization words, Rahul Sharma, University of Delhi, testified. produces no Rahul Sharma entity at all. Commas that introduce appositives should stop the chain, or the detector needs to keep the accepted prefix before rejecting the longer organization-tainted span.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-09T02:10:53Z

+    const MAX_CHAIN = 5;
+    const chain: ClassifiedToken[] = [token];
+    let j = i + 1;
+
+    while (j < tokens.length && chain.length < MAX_CHAIN) {


Avoid truncating Arabic patronymic chains

The five-token cap truncates common Arabic/Gulf names that include both a title and connectors; for example Sheikh Mohammed bin Rashid Al Maktoum is emitted as Sheikh Mohammed bin Rashid Al, leaving Maktoum unredacted and ending the entity on a dangling connector. This affects the target locales added by this detector, so the chain limit should allow connector-heavy names or avoid finalizing spans that end with a connector.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-09T02:10:54Z

  if (config.enableNameCorpus && !config.enableDenyList) {
    await initNameCorpus(ctx, config.dictionaries, config.nameCorpusLanguages);
    checkAbort(signal);
-    rawNameCorpusEntities = detectNameCorpus(fullText, ctx);
+    const westernNames = detectNameCorpus(fullText, ctx);
+    const nonWesternNames = detectNonWesternNames(fullText, ctx);


Run non-Western names in deny-list mode

This guard means the new rule-based detector never runs when enableDenyList is also true, even though enableNameCorpus is documented and tested as keeping name matches available in deny-list mode. In the common configuration with both enabled, names that only this detector knows about, such as Rahul Sharma without a matching injected dictionary entry, are skipped entirely.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-09T02:10:54Z

+const CJK_NAME_RE =
+  /(?<!\p{Script=Han})\p{Script=Han}{2,4}(?!\p{Script=Han})/gu;


Detect CJK names embedded in Han text

The CJK regex only matches a 2-4 Han-character sequence when it is not adjacent to any other Han character, so names written naturally inside Chinese/Japanese prose are missed; for example 由張小明簽署 produces no 張小明 entity because the name touches surrounding Han characters. This makes the CJK branch work only for space-delimited mixed-language snippets, not ordinary CJK legal text.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-09T02:10:54Z

+      const breaksOnPeriod =
+        gap.includes(".") &&
+        !isInitialContinuationGap(prev.text, gap) &&
+        prev.type !== TOKEN_TYPE.TITLE;


Stop full-word titles at sentence periods

Because every title token bypasses the period break, full-word honorifics are chained across sentence boundaries; The presiding officer was Justice. Kumar filed the appeal. emits Justice. Kumar as a person even though the period ended the title sentence. Only abbreviation-style titles such as Dr. or Mr. should suppress breaksOnPeriod; full-word non-Western titles need to stop at . to avoid redacting text from the next sentence.

Useful? React with 👍 / 👎.

jan-kubica

This PR identified and addresses real and important gap. Thank you for it. I would like to get the following changes in the PR in other not to duplicate existing approaches. LMK if I am missing something please

Extend names.ts instead of forking it. You have the same skeleton+some novel ideas (suffixes, arabic connectors) but in a parallel implementation. This introduces maintenance burden, and extra compute in hot path we can avoid if you land this as chain rules in the existing names.ts
move name tokens to per-locale dictionary data, loaded via initNameCorpus. the convention is to split the logic vs the data

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 25dc116194

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T05:10:28Z

+      if (dictionaries?.nonWesternNames) {
+        for (const [, names] of Object.entries(dictionaries.nonWesternNames)) {


Honor scoped languages for injected non-Western names

When callers pass nameCorpusLanguages to restrict which injected name dictionaries are active, this branch still merges every dictionaries.nonWesternNames locale, unlike the first-name and surname branches above. In a scoped run such as initNameCorpus(ctx, { nonWesternNames: { ja: ["Qwertaro"] } }, ["en"]), detectNameCorpus("Please contact Qwertaro Smith.") still emits a person match, so integrations that scope names to the document language can get unexpected cross-locale redactions.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-11T05:10:28Z

+      if (next.type !== TOKEN_TYPE.OTHER) {
        chain.push(next);
        j++;
      } else {


Preserve hyphenated Korean given names

For hyphenated Korean romanizations, Intl.Segmenter splits the given name around the hyphen and the lowercase continuation is classified as OTHER, so this loop finalizes only the prefix. For example, Kim Min-jun is emitted as Kim Min, leaving jun unredacted; the chain needs to absorb or normalize hyphenated non-Western name continuations before stopping on OTHER.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-11T05:10:28Z

+  if (corpus.nonWesternNames.has(token)) return true;
+  return corpus.nonWesternNames.has(titleCaseWithApostrophe(token));


Match Vietnamese names with their diacritics

The bundled Vietnamese corpus is unaccented, but this lookup only checks the raw and title-cased token, so standard Vietnamese spellings with tone marks are missed or truncated. For example, Trần Thị Mai signed only emits Mai, leaving the surname and middle name visible; the lookup needs a diacritic-folded form, or accented variants in the corpus, for these locale names.

Useful? React with 👍 / 👎.

Copilot

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 7 comments.

+  // 1. Title tokens (Western + non-Western honorifics)
  if (corpus.titleTokens.has(stripped)) {
-    return { text, type: TOKEN_TYPE.TITLE, start, end };
+    // Mark abbreviation-style titles so the chain
+    // assembler knows their trailing dot is not a
+    // sentence boundary (Dr., Mr., Smt. etc.).
+    return {
+      text,
+      type: TOKEN_TYPE.TITLE,
+      start,
+      end,
+      ...(corpus.titleAbbreviations.has(stripped)
+        ? { titleAbbreviation: true }
+        : {}),
+    };
+  }
+
+  // 2. Japanese honorific suffixes (san, sama, sensei)
+  if (JA_SUFFIXES.has(lower)) {
+    return { text, type: TOKEN_TYPE.JA_SUFFIX, start, end };
+  }
+
+  // 3. Arabic patronymic connectors (bin, bint, ibn, al, el)
+  if (ARABIC_CONNECTORS.has(lower)) {
+    return { text, type: TOKEN_TYPE.ARABIC_CONNECTOR, start, end };
+  }


+/** Organization keyword filter. */
+const ORG_WORDS =
+  /\b(?:Group|Company|LLC|LLP|LP|Inc|Ltd|Corp|Corporation|Holdings|Partners|Association|University|Bank|Fund|Trust|Agency|Government|Ministry|Office|Department|Council|Board|Committee|Commission|Services|Solutions|Technologies|Systems|Analytics|Software)\b/i;
+
+const isOrganization = (text: string): boolean => ORG_WORDS.test(text);


+/** Deduplicate overlapping entity spans (first wins). */
+const deduplicateSpans = (entities: Entity[]): Entity[] => {
+  const sorted = [...entities].sort(
+    (a, b) => a.start - b.start || b.end - a.end,
+  );
+  const result: Entity[] = [];
+  for (const entity of sorted) {
+    const last = result.at(-1);
+    if (!last || entity.start >= last.end) {
+      result.push(entity);
+    }
+  }
+  return result;
+};


+      if (dictionaries?.nonWesternNames) {
+        for (const [, names] of Object.entries(dictionaries.nonWesternNames)) {
+          for (const name of names) {
+            nonWesternNames.push(name);
+          }
+        }
+      }


+      const nwLocaleKeys = [
+        "in",
+        "ar",
+        "ja-latn",
+        "ko",
+        "zh-latn",
+        "th",
+        "vi",
+        "fil",
+        "id",
+      ] as const;
+      const [nwNameMods, nwExcludedMod] = await Promise.all([
+        Promise.all(
+          nwLocaleKeys.map(
+            (locale) =>
+              import(`../data/names-nw-${locale}.json`) as Promise<{
+                default: { names: string[] };
+              }>,
+          ),
+        ),
+        import("../data/names-nw-excluded-allcaps.json") as Promise<{
+          default: { words: string[] };
+        }>,
+      ]);


+export const NONWESTERN_HONORIFICS = {
+  in: [
+    "Sri",
+    "Shri",
+    "Smt",
+    "Smt.",
+    "Kumari",
+    "Pandit",
+    "Pt.",
+    "Adv.",
+    "Adv",
+    "Justice",
+    "Hon'ble",
+  ],
+  ar: [
+    "Al-",
+    "El-",
+    "Sheikh",
+    "Shaikh",
+    "Sheikha",
+    "Ustaz",
+    "Ustaza",
+    "Abu",
+    "Umm",
+  ],
+  "zh-Latn": [
+    "Encik",
+    "Puan",
+    "Datuk",
+    "Dato'",
+    "Dato",
+    "Tan Sri",
+    "Tun",
+  ] as const,
+  "ja-Latn": ["-san", "-sama", "-sensei", "Mr.", "Ms."] as const,


+export const getNameCorpusNonWesternNames = (
+  ctx: PipelineContext = defaultContext,
+): readonly string[] => ctx.nameCorpus?.nonWesternNamesList ?? [];


1337Xcode · 2026-06-11T05:28:00Z

Hey @jan-kubica, thanks for the review. Both points addressed:

Extended names.ts instead of forking. The parallel locale-name-detector.ts is now a thin deprecated wrapper that delegates to detectNameCorpus. All non-Western logic (Arabic connectors, Japanese suffixes, non-Western scoring path) lives as chain rules in names.ts. The pipeline calls detectNameCorpus once, no duplicate compute.
Moved name tokens to per-locale JSON data. Added names-nw-{in,ar,ja-latn,ko,zh-latn,th,vi,fil,id}.json loaded via initNameCorpus into corpus.nonWesternNames. Adding names = editing JSON, no code changes.

Also fixed a few things the AI review tools flagged as real issues: full-word titles like "Justice" no longer chain across sentence boundaries (only abbreviation titles like "Dr." suppress period breaks), and bare "HE"/"HH" no longer false-positive as Gulf honorific titles.

jan-kubica

Thank you for the prompt changes. One more thing, I have noticed you are mocking text-search for every test run, presumably because it did not natively work on Windows. I have now addressed the underlying gap and anonymize 1.4.9 (and text search 1.0.6) now ship win32-x64-msvc native bindings for all three engines.

As such, bun install && bun test should now work natively on Windows.

After rebasing and refreshing the lockfile, the mock should be unnecessary >> could you please rebase, reinstall and remove the mock?. Alternatively, if I am missing anything and thre is a reason for the mock, please lmk

… target locales

1337Xcode · 2026-06-11T21:48:25Z

Hey @jan-kubica

Thanks for addressing the Windows native support in v1.4.9!

I have completed the requested changes:

Rebased the branch onto the latest main.
Refreshed the lockfile with bun install to pull in the win32-x64-msvc native bindings (@stll/text-search v1.0.6).
Removed the WASM mock entirely from packages/anonymize/src/__test__/setup.ts.

All 750 tests are now running and passing natively on Windows with the mock removed. I have force-pushed the squashed commit to update the PR. Let me know if you need anything else!

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6b57e6b037

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T21:52:20Z

+      for (const form of Object.values(NONWESTERN_HONORIFICS).flat()) {
+        titles.push(form.replace(/[.-]+$/u, "").toLowerCase());


Do not inject stripped honorific fragments as titles

When enableDenyList and enableNameCorpus are both on, appendNameCorpusEntries adds every title token as a standalone person pattern; this loop now turns non-Western forms like Al-, El-, H., Ba, Bu, and Co into bare tokens (al, el, h, ba, bu, co). In Spanish/Arabic/English legal text this will redact common words or abbreviations such as al and Co as people, so prefix/post-nominal fragments should not be merged into the global title corpus as standalone patterns.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-11T21:52:20Z

+    const lower = word.text.toLowerCase();
+    const nextWord = words[idx + 1];
+    if (
+      (lower === "s" || lower === "d" || lower === "w" || lower === "r") &&


Do not merge r/o residence markers into names

Including r/o in the same connector path as s/o, d/o, and w/o makes Indian party descriptions like Rahul Kumar r/o Delhi become one person span (Rahul Kumar r/o Delhi). In this context r/o means resident of and introduces an address/location, not a patronymic or spouse connector, so the name chain should break before it rather than consume the residence field.

Useful? React with 👍 / 👎.

jan-kubica · 2026-06-12T10:18:57Z

@1337Xcode thank you for your prompt replies and this PR. Appreciate it. I am adding final minor adjustments (e.g., why ship field that does not currently exist and would already land as deprecated...?) directly to this branch and merging in a moment.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e33f5cecdb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-12T10:29:15Z

+  if (isLatinMajority) {
+    CJK_NAME_RE.lastIndex = 0;
+    let match: RegExpExecArray | null;
+    while ((match = CJK_NAME_RE.exec(fullText)) !== null) {


Honor language scoping before CJK pre-pass

When callers scope nameCorpusLanguages to non-CJK locales, this pre-pass still runs because it does not consult the scoped languages at all. For example, after initNameCorpus(ctx, undefined, ["cs"]), detectNameCorpus("Signed by 張小明.", ctx) still emits 張小明 as a person even though every non-Western locale corpus was scoped out; this can cause unexpected cross-locale redactions in integrations that use language scoping. Fresh evidence beyond the earlier scoped-dictionary issue: the hard-coded CJK branch bypasses the corpus/language scope entirely.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-12T10:29:15Z

+const CJK_NAME_RE =
+  /(?<!\p{Script=Han})\p{Script=Han}{2,4}(?!\p{Script=Han})/gu;


Include Hangul in the native-name regex

The native-name pre-pass can never match Korean names written in Hangul because this regex only accepts Script=Han, even though the surname table below includes Hangul surnames such as 김 and 박. In practice detectNameCorpus("Signed by 김민준.") and detectNameCorpus("Signed by 박지훈.") return no person entity, so Korean text in the target locale is left unredacted unless it is romanized.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-12T10:29:15Z

+  if (corpus.nonWesternNames.has(token)) return true;
+  return corpus.nonWesternNames.has(titleCaseWithApostrophe(token));


Match multi-word locale corpus entries

Some bundled non-Western entries are phrases, for example Thai surnames like Na Nakhon, but the detector later looks up one Intl.Segmenter word at a time. Those phrase entries can never match, so The witness Somchai Na Nakhon signed. emits only Somchai at low confidence and leaves the family name visible; phrase entries need a phrase pass or token-level decomposition.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-12T10:29:15Z

+    !isNonWesternNameToken(text, corpus) &&
+    !JA_SUFFIXES.has(lower) &&
+    !ARABIC_CONNECTORS.has(lower) &&
+    !(ALL_UPPER_RE.test(text) && !corpus.excludedAllCaps.has(text))


Whitelist short all-caps post-nominals

This exception treats any two-letter all-caps token not in the exclusion list as name-like, not just known post-nominals such as JP, KC, or QC. Common legal/business acronyms that are absent from the exclusion JSON then get absorbed into person spans, e.g. Sato IP rights, Kim IT policy, or Wong HR reviewed it all become high-confidence person entities; this should be limited to an explicit post-nominal whitelist.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-12T10:29:15Z

+    if (!chain.some(hasPersonNameSource)) {
+      continue;


Preserve curated Names deny-list matches

This guard drops any person chain that does not include the synthetic first-name or surname source, but curated dictionaries in the Names category are added with the normal deny-list source. With enableDenyList and enableNameCorpus both on, a bundled or injected Names entry that is not also in the first/surname corpus, for example a full-name entry from names/global, reaches nameHits and is then skipped here, so valid deny-list person entries silently stop redacting.

Useful? React with 👍 / 👎.

1337Xcode marked this pull request as ready for review June 9, 2026 01:56

Copilot AI review requested due to automatic review settings June 9, 2026 01:56

gemini-code-assist Bot reviewed Jun 9, 2026

View reviewed changes

Copilot AI reviewed Jun 9, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 9, 2026

View reviewed changes

1337Xcode requested a review from Copilot June 9, 2026 02:05

Copilot AI reviewed Jun 9, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 9, 2026

View reviewed changes

jan-kubica self-requested a review June 10, 2026 20:38

jan-kubica requested changes Jun 10, 2026

View reviewed changes

1337Xcode force-pushed the feat/non-western-names branch from 583bb5c to 7f459d7 Compare June 11, 2026 04:57

1337Xcode requested review from Copilot and jan-kubica June 11, 2026 05:08

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

Copilot AI reviewed Jun 11, 2026

View reviewed changes

This was referenced Jun 11, 2026

feat: build and publish win32-x64-msvc native binding stella/fuzzy-search#64

Merged

chore: bump anonymize to 1.4.9 with win32-capable text-search #185

Merged

jan-kubica requested changes Jun 11, 2026

View reviewed changes

1337Xcode force-pushed the feat/non-western-names branch from 25dc116 to fee0b0c Compare June 11, 2026 21:44

feat(anonymize): support rule-based non-Western name detection across…

6b57e6b

… target locales

1337Xcode force-pushed the feat/non-western-names branch from fee0b0c to 6b57e6b Compare June 11, 2026 21:45

1337Xcode requested a review from jan-kubica June 11, 2026 21:47

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

jan-kubica added 4 commits June 12, 2026 09:33

fix: mirror non-western name data

3ffb9cb

chore: remove non-western detector shim

b1add62

fix: gate cjk name detection by surname shape

6c3b499

fix: suppress title-only deny-list person spans

7218dd3

jan-kubica added 2 commits June 12, 2026 09:39

fix: scope non-western name corpus by language

7f1fba1

fix: run supplemental name corpus in deny-list mode

e33f5ce

jan-kubica merged commit e2a382e into stella:main Jun 12, 2026

chatgpt-codex-connector Bot reviewed Jun 12, 2026

View reviewed changes

-  const hanCount = (fullText.match(/\p{Script=Han}/gu) || []).length;
+  let hanCount = 0;
+  for (const _ of fullText.matchAll(/\p{Script=Han}/gu)) {
+    hanCount++;
+  }

		else if (hasJaSuffix && (capitalizedCount > 0 \|\| hasNonWesternToken))
		score = 0.9;


		// ── Tests ──────────────────────────────────────────────────────────

		describe("Non-Western Name Detection", () => {

		const CJK_NAME_RE =
		/(?<!\p{Script=Han})\p{Script=Han}{2,4}(?!\p{Script=Han})/gu;

		if (dictionaries?.nonWesternNames) {
		for (const [, names] of Object.entries(dictionaries.nonWesternNames)) {

		if (corpus.nonWesternNames.has(token)) return true;
		return corpus.nonWesternNames.has(titleCaseWithApostrophe(token));

		for (const form of Object.values(NONWESTERN_HONORIFICS).flat()) {
		titles.push(form.replace(/[.-]+$/u, "").toLowerCase());

Conversation

1337Xcode commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed change

Files affected

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

1337Xcode commented Jun 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

jan-kubica left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

1337Xcode commented Jun 9, 2026 •

edited

Loading