feat: add corpus evaluation tooling package#187
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new package, @stll/anonymize-corpus, which provides tooling to evaluate the PII detection pipeline against real-world documents. It includes scripts for fetching EDGAR documents, running the detection pipeline, diffing runs against triaged verdicts, and managing a document manifest. Feedback on the changes includes wrapping the EDGAR client's fetch call in a try/catch block to properly retry on network or timeout errors, recording skipped documents in the manifest to avoid redundant fetches, handling potential Git command failures gracefully in run.ts, and checking for file existence in diff.ts before reading text to prevent crashes on fresh clones.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| const response = await fetch(url, { | ||
| headers: { "User-Agent": userAgent }, | ||
| signal: AbortSignal.timeout(REQUEST_TIMEOUT_MS), | ||
| }); | ||
| if (response.ok) { | ||
| return response; | ||
| } | ||
| const retryable = RETRYABLE_STATUS.has(response.status); | ||
| if (!retryable || attempt === MAX_ATTEMPTS) { | ||
| throw new Error(`GET ${url} failed: HTTP ${response.status}`); | ||
| } | ||
| await sleep(RETRY_BASE_DELAY_MS * attempt); |
There was a problem hiding this comment.
The fetch call is not wrapped in a try/catch block inside the retry loop. If fetch throws a network error or a timeout error (which is likely given the AbortSignal.timeout), the error will propagate immediately and bypass the retry logic entirely. Wrapping the request in a try/catch block ensures that transient network issues and timeouts are retried up to MAX_ATTEMPTS.
try {
const response = await fetch(url, {
headers: { "User-Agent": userAgent },
signal: AbortSignal.timeout(REQUEST_TIMEOUT_MS),
});
if (response.ok) {
return response;
}
const retryable = RETRYABLE_STATUS.has(response.status);
if (!retryable || attempt === MAX_ATTEMPTS) {
throw new Error(`GET ${url} failed: HTTP ${response.status}`);
}
} catch (error) {
if (attempt === MAX_ATTEMPTS) {
throw error;
}
}
await sleep(RETRY_BASE_DELAY_MS * attempt);| const path = textBySha.get(doc.sha256); | ||
| const text = path ? await Bun.file(path).text() : null; |
There was a problem hiding this comment.
Bun.file(path).text() is called directly without checking if the file exists. Since corpus/raw/ is gitignored, a fresh clone of the repository won't have these files, causing diff.ts to throw an unhandled exception and crash when trying to validate verdicts. Checking if the file exists first prevents this crash.
const path = textBySha.get(doc.sha256);
const file = path ? Bun.file(path) : null;
const text = file && (await file.exists()) ? await file.text() : null;There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a25e296bd5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| text.replaceAll(/&(#x?[\da-f]+|[a-z]+);/gi, (match, body: string): string => { | ||
| if (body.startsWith("#x") || body.startsWith("#X")) { | ||
| return String.fromCodePoint(Number.parseInt(body.slice(2), 16)); | ||
| } | ||
| if (body.startsWith("#")) { | ||
| return String.fromCodePoint(Number.parseInt(body.slice(1), 10)); |
There was a problem hiding this comment.
Guard malformed numeric HTML entities
When an EDGAR HTML document contains a malformed decimal numeric entity such as &#abc; (or a numeric code point outside the Unicode range), this branch parses NaN/an invalid value and passes it to String.fromCodePoint, which throws and aborts fetch.ts for the whole query. Since unknown named entities are intentionally left untouched, malformed numeric entities should also fall back to match unless the parsed code point is valid.
Useful? React with 👍 / 👎.
| "name": "@stll/anonymize-corpus", | ||
| "version": "0.0.0", | ||
| "dependencies": { | ||
| "@stll/anonymize": "^1.4.8", |
There was a problem hiding this comment.
Synchronize the corpus lockfile dependency
The new workspace package declares @stll/anonymize as ^1.4.9 in packages/corpus/package.json, but the committed lockfile records it as ^1.4.8 here. In environments that run bun install --frozen-lockfile (Bun help: “Disallow changes to lockfile”), this out-of-sync workspace entry forces the install to update or fail before the corpus commands can run.
Useful? React with 👍 / 👎.
| const removed = (baseline?.entities ?? []).filter((entity) => { | ||
| const key = spanKey(entity); | ||
| return !currentKeys.has(key) && !judged.has(key); |
There was a problem hiding this comment.
Surface disappeared true-positive verdicts
When a baseline span already has a committed verdict such as tp and a later run stops detecting it, this filter drops the span from removed just because its key is judged. That hides regressions where known PII disappeared from the pipeline output; only judged false positives should be suppressed on the removed side, while judged true positives need to remain visible as FN regressions or otherwise be counted against the run.
Useful? React with 👍 / 👎.
| const fresh = refs.filter( | ||
| (ref) => !knownIds.has(ref.id) && !newEntries.some((e) => e.id === ref.id), |
There was a problem hiding this comment.
Allow refetching manifest documents
In a fresh checkout, corpus/raw/ is intentionally gitignored while manifest.json is committed, so run.ts will fail on every manifest entry with “missing raw text; re-run fetch”. Re-running fetch.ts cannot repair that state because these lines filter out every known manifest id before downloading; the tool needs to rehydrate known entries whose raw file is missing (or fetch directly from manifest URLs) instead of always skipping them.
Useful? React with 👍 / 👎.
| export const isMaterialContract = (fileType: string | undefined): boolean => | ||
| fileType?.startsWith(MATERIAL_CONTRACT_PREFIX) ?? false; |
There was a problem hiding this comment.
Reject EX-101 interactive-data exhibits
When EFTS returns an EX-101 interactive-data exhibit, this prefix check treats it as a material contract because "EX-101".startsWith("EX-10") is true. That lets non-contract XBRL/XML exhibits into the corpus despite the fetcher intending to evaluate EX-10 material contracts, so the predicate should only accept EX-10 itself or EX-10. variants.
Useful? React with 👍 / 👎.
| if ( | ||
| (await Bun.file(join(runDir, RUN_SUMMARY_FILE)).exists()) && | ||
| !values.force | ||
| ) { | ||
| console.error(`run "${runName}" already exists; use --force to overwrite`); |
There was a problem hiding this comment.
Clear stale run artifacts on force
When rerunning an existing output name with --force after the manifest changes or a document hash changes, this check allows the run to proceed but the existing directory is only reused, not cleared. Old <sha256>.json files remain beside the newly written artifacts, and diff.ts later reads every JSON file in that run directory, so stale documents can reappear as candidates even though the forced run was supposed to overwrite the output.
Useful? React with 👍 / 👎.
|
Review feedback addressed in 22b2614 and 5b99565:
Not adopted: recording size-skips inside CC on behalf of @jan-kubica |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5b995650a3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| type PipelineConfig, | ||
| preparePipelineSearch, | ||
| runPipeline, | ||
| } from "@stll/anonymize"; |
There was a problem hiding this comment.
Make corpus runs work before package builds
In a clean workspace, @stll/anonymize resolves through its package exports to dist/index.mjs, but dist/ is gitignored and the corpus README tells users to run bun src/run.ts directly. That means the run command fails at module resolution unless the library has already been built; import the workspace source like the existing performance script does, or make this command explicitly build/depend on the package first.
Useful? React with 👍 / 👎.
|
|
||
| /** Document ids contain `:`/`/`; map them to safe file names. */ | ||
| export const rawFileName = (docId: string): string => | ||
| `${docId.replaceAll(/[^\w.-]/g, "_")}.txt`; |
There was a problem hiding this comment.
Use collision-free raw filenames
When two manifest ids differ only by characters this sanitizer collapses to _ (for example a:b and a_b, or case-law ids using different punctuation), both raw bodies share the same corpus/raw/*.txt path. Fetching or refilling the second document overwrites the first, so later runs either hash the wrong body or fail the manifest check; use a reversible encoding or include a hash in the raw filename.
Useful? React with 👍 / 👎.
6d0635d to
81e921f
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 81e921fe91
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if ( | ||
| (await Bun.file(join(runDir, RUN_SUMMARY_FILE)).exists()) && | ||
| !values.force |
There was a problem hiding this comment.
Reject incomplete existing run directories
If a previous run is interrupted before run.json is written, this guard treats the existing run directory as reusable because it only checks for the summary file. Retrying the same run name without --force can then leave old <sha256>.json artifacts in place, especially after the manifest changes, and diff.ts reads every JSON artifact in the directory; that can surface stale documents as if they belong to the new run. Check for the directory itself or clear it before writing unless an explicit resume mode is intended.
Useful? React with 👍 / 👎.
| const limit = Number(values.limit ?? DEFAULT_LIMIT_PER_QUERY); | ||
| const pages = Number(values.pages ?? DEFAULT_PAGES_PER_QUERY); |
There was a problem hiding this comment.
Validate numeric fetch options
When --limit or --pages is not a finite positive integer, Number(...) silently produces values that change fetch behavior: for example --limit foo makes the later taken >= limit guard always false, so the command downloads every fresh hit in the requested pages instead of stopping at the default limit, while --pages foo silently fetches nothing. Since this CLI hits EDGAR and writes corpus state, reject invalid numeric input before querying.
Useful? React with 👍 / 👎.
81e921f to
f7482aa
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f7482aae81
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const syncedLockText = lockText.replaceAll( | ||
| SYNCED_DEPENDENCY_RANGE_RE, |
There was a problem hiding this comment.
Sync lockfile workspace versions too
When VERSION is bumped and bun run sync:version is run, this new lockfile sync only rewrites @stll/anonymize dependency ranges; it does not update the "version" fields that bun.lock records for packages/anonymize, packages/anonymize/wasm, and packages/cli. The next bun run check:version can still pass because it never checks those lockfile workspace versions, leaving the lockfile inconsistent with the package.json files; in environments using bun install --frozen-lockfile (Bun help: --frozen-lockfile “Disallow changes to lockfile”), install then has to change or reject the lockfile after every release bump.
Useful? React with 👍 / 👎.
# Conflicts: # packages/anonymize/src/detectors/names.ts
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3c13475730
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1238022a1f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "license": "MIT", | ||
| "scripts": { | ||
| "typecheck": "tsc --noEmit -p tsconfig.json && tsc --noEmit -p tsconfig.test.json", | ||
| "test": "bun test", |
There was a problem hiding this comment.
Make the corpus test script build its runtime dependency
Fresh evidence beyond the earlier direct-run concern: the package's own test script runs bare bun test, but runtime.test.ts imports @stll/anonymize, which resolves through that package's exports to the gitignored dist/index.mjs. In a clean checkout before a build, bun run --cwd packages/corpus test fails at module resolution before assertions; the script should build/depend on the runtime package or the test should import source explicitly.
Useful? React with 👍 / 👎.
`Bond` collides with legal nouns ("Bond Trustee", "Bond Indenture"),
so the global person stopword suppressed it everywhere — including the
name-extension step, which turned "James Bond" into just "James" and
left the surname in the clear.
Move `bond` from person-stopwords to the allow list: the allow list
still suppresses the standalone/leading token, but `extendPersonName`
never consults it, so a preceding first name carries the surname
through (the same path that already handles `John Law` / `Elon Tesla`).
Mirror the data change into packages/data/config.
- html-text: require a full numeric-entity body, so a mixed token like `{abc;` is left intact instead of decoding its `123` prefix. - verdicts: on a value mismatch, report lengths and short content hashes instead of the verbatim span/document text (gitignored personal data for case-law sources). - edgar: advance EFTS pagination by the hits actually returned and stop on an empty page, so a 100-stride over 10-hit pages no longer skips results. - diff: only compare against a baseline artifact whose sha256 matches the current document; a re-extraction under the same id otherwise diffs offsets across unrelated text.
|
Addressed the latest review in 1d43f6f and 4c7387b. Person pipeline
Corpus tooling
Already handled in earlier commits on this branch (noting for thread cleanup): CC on behalf of @jan-kubica |
Adds
packages/corpus(@stll/anonymize-corpus), a private workspace package for evaluating the detection pipeline against real-world public documents without repeating work between rounds.What's inside
src/fetch.ts— EDGAR full-text search (EFTS) fetcher for EX-10 material contracts.corpus/manifest.json(committed) records every document ever pulled (id, query, sha256), so repeated searches skip known documents. Bodies are stored as plain text under gitignoredcorpus/raw/and are always re-fetchable. Polite request spacing, retry on 429/503, request timeout, HTML-to-text extraction for.htmexhibits.src/run.ts— runs the rules pipeline (NER off, same config asscripts/contract-perf.mjs) over the whole corpus with a shared prepared search context; writes per-document entity artifacts tocorpus/runs/<git-sha>/<sha256>.json, verifying content hashes against the manifest first.src/diff.ts— produces the triage queue: FP candidates (spans not in the baseline run, or all spans without a baseline) and FN candidates (baseline spans that disappeared). Spans already covered by a verdict file are excluded, so triage effort is never repeated.src/verdicts.ts— frozen triage outcomes (tp/fp/fnper span), one JSON file per document keyed by content hash. Mechanical validation enforces that every span quotes the document verbatim at its offsets;diff.tsfails on mismatch.Data boundaries
EDGAR verdicts are committed (public SEC filings, consistent with existing contract fixtures). Document bodies, run artifacts, and case-law verdicts are gitignored; case-law text contains personal data and stays out of the repository entirely (only citation identifiers go in the manifest).
Other changes
chore: sync wasm and cli package versions to 1.4.9— pre-existingcheck:versionfailure on main after the 1.4.9 bump; fixed viabun run sync:version.consoleinpackages/corpus/src/**(CLI tooling), matching the existing.github/tools/**override.25 unit tests cover the pure logic (run diffing, verdict validation, manifest merging, EFTS hit parsing, HTML extraction).