Skip to content

fix: bound phone trigger values to phone-shaped runs#188

Merged
jan-kubica merged 4 commits into
mainfrom
fix/phone-trigger-span
Jun 11, 2026
Merged

fix: bound phone trigger values to phone-shaped runs#188
jan-kubica merged 4 commits into
mainfrom
fix/phone-trigger-span

Conversation

@jan-kubica

Copy link
Copy Markdown
Contributor

The to-end-of-line trigger strategy bounds phone values at line delimiters and later inline field labels, but nothing stops the capture when the rest of a single-line paragraph follows the number. On HTML-flattened text (where a whole paragraph is one logical line), a mid-sentence trigger like ... phone +420 777 123 456. Signed on 1 January 2024 by Jane Doe ... produced a phone entity spanning from the number to the end of the paragraph, swallowing a date and two person names into one mislabeled span.

Phone values are digits plus separators and end on a digit, so the strategy now also cuts the capture at the end of the phone-shaped run (PHONE_SHAPE_PREFIX_RE), keeping the existing inline-label stop and length cap as-is.

The software-license-agreement snapshot changes accordingly: the phone entity drops a trailing sentence period ("212) 555-0142.""212) 555-0142"), and the period now survives in the redacted text.

Includes the same chore: sync wasm and cli package versions to 1.4.9 commit as #187 (pre-existing check:version failure on main); whichever PR lands second will drop the duplicate on rebase.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to prevent phone number triggers from swallowing subsequent text in single-line paragraphs by truncating the captured value at the end of a phone-shaped run. It includes corresponding unit tests and snapshot updates. The reviewer identified a potential edge case where the new PHONE_SHAPE_PREFIX_RE regex could incorrectly swallow the leading digit of a subsequent sentence if it starts with a number, and provided a refined regex suggestion to resolve this issue.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

const LINE_TRIGGER_LOOKAHEAD = 2_048;
const MATCH_PATTERN_LOOKAHEAD = 512;
const PHONE_VALUE_START_RE = /^[+(\d]/;
const PHONE_SHAPE_PREFIX_RE = /^[+(\d][\d\s()./-]*/;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the next sentence starts with a digit (for example, a date like 1 January or a section number like 1. Section), the current regex PHONE_SHAPE_PREFIX_RE will match the dot, the space, and the leading digit of the next sentence, swallowing it into the phone number (e.g., +420 777 123 456. 1).

We can prevent this by ensuring that a dot . is only matched if it is not followed by whitespace (using a negative lookahead \.(?!\s)). This keeps the phone shape matching robust against mid-sentence triggers followed by digit-starting sentences.

Suggested change
const PHONE_SHAPE_PREFIX_RE = /^[+(\d][\d\s()./-]*/;
const PHONE_SHAPE_PREFIX_RE = /^[+(\d](?:[\d()\/-]|\s|\.(?!\s))*/;

@jan-kubica jan-kubica marked this pull request as ready for review June 11, 2026 18:19

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7c02bbcd81

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/cli/package.json
},
"dependencies": {
"@stll/anonymize": "^1.4.8",
"@stll/anonymize": "^1.4.9",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Regenerate the Bun lockfile for this dependency bump

This manifest now requires @stll/anonymize ^1.4.9, but bun.lock still mirrors the CLI workspace at version 1.4.8 with @stll/anonymize ^1.4.8 (and the wasm workspace is also still 1.4.8). The release workflow installs with bun install --frozen-lockfile in .github/workflows/release.yml, and bun install --help documents that this flag disallows lockfile changes, so the next release/CI install can fail or use stale workspace metadata unless the updated lockfile is committed with these package changes.

Useful? React with 👍 / 👎.

const shapeEnd = shape[0].replace(/\D+$/, "").length;
if (shapeEnd > 0 && shapeEnd < end) {
end = shapeEnd;
foundLineStop = true;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve the cap for long phone-shaped prefixes

When a label is followed by a very long digit/separator run and then prose before the newline, this new stop marks foundLineStop as true, so the existing MAX_TRIGGER_VALUE_LEN fallback below is skipped. For inputs like a flattened PHONE: 1 1 1 ... Signed by ..., the returned phone entity can again grow to hundreds or thousands of characters, undoing the pathological-input bound that this strategy relies on unless the shape stop is still capped or distinguished from a real line delimiter.

Useful? React with 👍 / 👎.

// ("... phone +420 777 123 456. Signed on ...")
// cannot swallow the rest of a single-line
// paragraph through the line delimiter.
const shape = PHONE_SHAPE_PREFIX_RE.exec(valueText.slice(0, end));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include phone extensions in the bounded phone value

For labelled phone lines that include an extension before the newline, such as PHONE: +1 555 123 4567 ext. 89, this new shape bound stops at the first letter after the main number and drops the extension from the phone entity. The previous to-end-of-line extraction kept the whole inline value under the cap, and the regex detectors shown in regex.ts only cover the main number shape, so the extension digits can now remain unredacted unless the phone-shaped prefix also permits common extension markers.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 02af67ca80

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +941 to +942
if (foundLineStop && end > MAX_TRIGGER_VALUE_LEN) {
end = capAtWordBoundary(valueText, MAX_TRIGGER_VALUE_LEN);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve full newline-terminated phone lines

When the phone value reaches a real newline after more than 100 characters, foundLineStop is already true from the delimiter scan, so this new cap truncates an otherwise well-delimited line. For example, a notice block like Phone: 555-1000 / 555-1001 / ...\n can contain multiple short labelled phone numbers that are only covered by the trigger, and anything after the 100-character cap may be left unredacted; the existing fallback below was intentionally limited to cases with no line delimiter.

Useful? React with 👍 / 👎.

const LINE_TRIGGER_LOOKAHEAD = 2_048;
const MATCH_PATTERN_LOOKAHEAD = 512;
const PHONE_VALUE_START_RE = /^[+(\d]/;
const PHONE_SHAPE_PREFIX_RE = /^[+(\d][\d\s()./-]*/;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Accept Unicode dash phone separators

For labelled phone numbers that use non-ASCII dash separators, such as PHONE: +1 555‑123‑4567, this new shape regex stops at the first Unicode hyphen and leaves only +1 555, which is then rejected by the phone plausibility digit check. The existing regex phone detectors also only admit ASCII -, so these labelled numbers can now be left unredacted even though the previous end-of-line trigger captured them.

Useful? React with 👍 / 👎.

@jan-kubica

Copy link
Copy Markdown
Contributor Author

Review feedback addressed in 02af67c:

  • A dot followed by whitespace now ends the phone-shaped run, so a following numbered clause (… 123 456. 1. Definitions) is no longer pulled into the value; dot separators without whitespace (555.123.4567) are unaffected.
  • Extensions are kept: an optional ext/ext./extension/x + digits suffix is re-admitted after the numeric run (+1 555 123 4567 ext. 89, 555-1234 x42).
  • The shape-derived stop is subject to the existing 100-char MAX_TRIGGER_VALUE_LEN cap with the same word-boundary retreat, so a pathological digit/space run cannot produce an over-long entity.
  • bun.lock updated for the 1.4.9 workspace versions (bun install --frozen-lockfile passes; no unrelated dependency churn).

Three regression tests added alongside the original one; full suite green.

CC on behalf of @jan-kubica

@jan-kubica jan-kubica merged commit 77c9fab into main Jun 11, 2026
2 checks passed
@jan-kubica jan-kubica deleted the fix/phone-trigger-span branch June 11, 2026 19:37

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 932f8746a6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

* ("+1 555‑123‑4567") does not break the run and leave the
* number unredacted.
*/
const PHONE_SHAPE_PREFIX_RE = new RegExp(`^[+(\\d][\\d\\s()./${DASH_INNER}]*`);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include list separators in phone-shaped runs

For labelled lines that contain multiple local numbers separated by a comma or semicolon, e.g. Phone: 555-1234, 555-5678\n, this prefix now stops at the comma and returns only 555-1234. The generic regex phone detectors do not recover an unlabelled 7-digit follow-up like 555-5678 (they require shapes such as +..., Czech 9-digit numbers, tel prefixes, or US parentheses), so the second labelled phone number can be left unredacted even though the previous end-of-line trigger captured the whole line.

Useful? React with 👍 / 👎.

Comment on lines +921 to +925
const sentenceBreak = /\.\s/.exec(region);
const shapeRegion =
sentenceBreak !== null
? region.slice(0, sentenceBreak.index)
: region;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve dot-space formatted phone values

When a labelled phone number uses dot-plus-space separators, such as Phone: 01. 23. 45. 67. 89, this treats the first . as a sentence break and only tests 01 as the phone-shaped prefix, which then fails the digit-count plausibility check. The configured phone regexes accept either a dot or whitespace separator but not both together, so these labelled numbers can now be left unredacted even though the previous end-of-line capture kept the full value.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant