fix: bound phone trigger values to phone-shaped runs#188
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to prevent phone number triggers from swallowing subsequent text in single-line paragraphs by truncating the captured value at the end of a phone-shaped run. It includes corresponding unit tests and snapshot updates. The reviewer identified a potential edge case where the new PHONE_SHAPE_PREFIX_RE regex could incorrectly swallow the leading digit of a subsequent sentence if it starts with a number, and provided a refined regex suggestion to resolve this issue.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| const LINE_TRIGGER_LOOKAHEAD = 2_048; | ||
| const MATCH_PATTERN_LOOKAHEAD = 512; | ||
| const PHONE_VALUE_START_RE = /^[+(\d]/; | ||
| const PHONE_SHAPE_PREFIX_RE = /^[+(\d][\d\s()./-]*/; |
There was a problem hiding this comment.
If the next sentence starts with a digit (for example, a date like 1 January or a section number like 1. Section), the current regex PHONE_SHAPE_PREFIX_RE will match the dot, the space, and the leading digit of the next sentence, swallowing it into the phone number (e.g., +420 777 123 456. 1).
We can prevent this by ensuring that a dot . is only matched if it is not followed by whitespace (using a negative lookahead \.(?!\s)). This keeps the phone shape matching robust against mid-sentence triggers followed by digit-starting sentences.
| const PHONE_SHAPE_PREFIX_RE = /^[+(\d][\d\s()./-]*/; | |
| const PHONE_SHAPE_PREFIX_RE = /^[+(\d](?:[\d()\/-]|\s|\.(?!\s))*/; |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7c02bbcd81
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| }, | ||
| "dependencies": { | ||
| "@stll/anonymize": "^1.4.8", | ||
| "@stll/anonymize": "^1.4.9", |
There was a problem hiding this comment.
Regenerate the Bun lockfile for this dependency bump
This manifest now requires @stll/anonymize ^1.4.9, but bun.lock still mirrors the CLI workspace at version 1.4.8 with @stll/anonymize ^1.4.8 (and the wasm workspace is also still 1.4.8). The release workflow installs with bun install --frozen-lockfile in .github/workflows/release.yml, and bun install --help documents that this flag disallows lockfile changes, so the next release/CI install can fail or use stale workspace metadata unless the updated lockfile is committed with these package changes.
Useful? React with 👍 / 👎.
| const shapeEnd = shape[0].replace(/\D+$/, "").length; | ||
| if (shapeEnd > 0 && shapeEnd < end) { | ||
| end = shapeEnd; | ||
| foundLineStop = true; |
There was a problem hiding this comment.
Preserve the cap for long phone-shaped prefixes
When a label is followed by a very long digit/separator run and then prose before the newline, this new stop marks foundLineStop as true, so the existing MAX_TRIGGER_VALUE_LEN fallback below is skipped. For inputs like a flattened PHONE: 1 1 1 ... Signed by ..., the returned phone entity can again grow to hundreds or thousands of characters, undoing the pathological-input bound that this strategy relies on unless the shape stop is still capped or distinguished from a real line delimiter.
Useful? React with 👍 / 👎.
| // ("... phone +420 777 123 456. Signed on ...") | ||
| // cannot swallow the rest of a single-line | ||
| // paragraph through the line delimiter. | ||
| const shape = PHONE_SHAPE_PREFIX_RE.exec(valueText.slice(0, end)); |
There was a problem hiding this comment.
Include phone extensions in the bounded phone value
For labelled phone lines that include an extension before the newline, such as PHONE: +1 555 123 4567 ext. 89, this new shape bound stops at the first letter after the main number and drops the extension from the phone entity. The previous to-end-of-line extraction kept the whole inline value under the cap, and the regex detectors shown in regex.ts only cover the main number shape, so the extension digits can now remain unredacted unless the phone-shaped prefix also permits common extension markers.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 02af67ca80
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (foundLineStop && end > MAX_TRIGGER_VALUE_LEN) { | ||
| end = capAtWordBoundary(valueText, MAX_TRIGGER_VALUE_LEN); |
There was a problem hiding this comment.
Preserve full newline-terminated phone lines
When the phone value reaches a real newline after more than 100 characters, foundLineStop is already true from the delimiter scan, so this new cap truncates an otherwise well-delimited line. For example, a notice block like Phone: 555-1000 / 555-1001 / ...\n can contain multiple short labelled phone numbers that are only covered by the trigger, and anything after the 100-character cap may be left unredacted; the existing fallback below was intentionally limited to cases with no line delimiter.
Useful? React with 👍 / 👎.
| const LINE_TRIGGER_LOOKAHEAD = 2_048; | ||
| const MATCH_PATTERN_LOOKAHEAD = 512; | ||
| const PHONE_VALUE_START_RE = /^[+(\d]/; | ||
| const PHONE_SHAPE_PREFIX_RE = /^[+(\d][\d\s()./-]*/; |
There was a problem hiding this comment.
Accept Unicode dash phone separators
For labelled phone numbers that use non-ASCII dash separators, such as PHONE: +1 555‑123‑4567, this new shape regex stops at the first Unicode hyphen and leaves only +1 555, which is then rejected by the phone plausibility digit check. The existing regex phone detectors also only admit ASCII -, so these labelled numbers can now be left unredacted even though the previous end-of-line trigger captured them.
Useful? React with 👍 / 👎.
|
Review feedback addressed in 02af67c:
Three regression tests added alongside the original one; full suite green. CC on behalf of @jan-kubica |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 932f8746a6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| * ("+1 555‑123‑4567") does not break the run and leave the | ||
| * number unredacted. | ||
| */ | ||
| const PHONE_SHAPE_PREFIX_RE = new RegExp(`^[+(\\d][\\d\\s()./${DASH_INNER}]*`); |
There was a problem hiding this comment.
Include list separators in phone-shaped runs
For labelled lines that contain multiple local numbers separated by a comma or semicolon, e.g. Phone: 555-1234, 555-5678\n, this prefix now stops at the comma and returns only 555-1234. The generic regex phone detectors do not recover an unlabelled 7-digit follow-up like 555-5678 (they require shapes such as +..., Czech 9-digit numbers, tel prefixes, or US parentheses), so the second labelled phone number can be left unredacted even though the previous end-of-line trigger captured the whole line.
Useful? React with 👍 / 👎.
| const sentenceBreak = /\.\s/.exec(region); | ||
| const shapeRegion = | ||
| sentenceBreak !== null | ||
| ? region.slice(0, sentenceBreak.index) | ||
| : region; |
There was a problem hiding this comment.
Preserve dot-space formatted phone values
When a labelled phone number uses dot-plus-space separators, such as Phone: 01. 23. 45. 67. 89, this treats the first . as a sentence break and only tests 01 as the phone-shaped prefix, which then fails the digit-count plausibility check. The configured phone regexes accept either a dot or whitespace separator but not both together, so these labelled numbers can now be left unredacted even though the previous end-of-line capture kept the full value.
Useful? React with 👍 / 👎.
The
to-end-of-linetrigger strategy bounds phone values at line delimiters and later inline field labels, but nothing stops the capture when the rest of a single-line paragraph follows the number. On HTML-flattened text (where a whole paragraph is one logical line), a mid-sentence trigger like... phone +420 777 123 456. Signed on 1 January 2024 by Jane Doe ...produced a phone entity spanning from the number to the end of the paragraph, swallowing a date and two person names into one mislabeled span.Phone values are digits plus separators and end on a digit, so the strategy now also cuts the capture at the end of the phone-shaped run (
PHONE_SHAPE_PREFIX_RE), keeping the existing inline-label stop and length cap as-is.The
software-license-agreementsnapshot changes accordingly: the phone entity drops a trailing sentence period ("212) 555-0142."→"212) 555-0142"), and the period now survives in the redacted text.Includes the same
chore: sync wasm and cli package versions to 1.4.9commit as #187 (pre-existingcheck:versionfailure on main); whichever PR lands second will drop the duplicate on rebase.