Skip to content

feat: add corpus evaluation tooling package#187

Open
jan-kubica wants to merge 14 commits into
mainfrom
feat/corpus-eval-tooling
Open

feat: add corpus evaluation tooling package#187
jan-kubica wants to merge 14 commits into
mainfrom
feat/corpus-eval-tooling

Conversation

@jan-kubica

Copy link
Copy Markdown
Contributor

Adds packages/corpus (@stll/anonymize-corpus), a private workspace package for evaluating the detection pipeline against real-world public documents without repeating work between rounds.

What's inside

  • src/fetch.ts — EDGAR full-text search (EFTS) fetcher for EX-10 material contracts. corpus/manifest.json (committed) records every document ever pulled (id, query, sha256), so repeated searches skip known documents. Bodies are stored as plain text under gitignored corpus/raw/ and are always re-fetchable. Polite request spacing, retry on 429/503, request timeout, HTML-to-text extraction for .htm exhibits.
  • src/run.ts — runs the rules pipeline (NER off, same config as scripts/contract-perf.mjs) over the whole corpus with a shared prepared search context; writes per-document entity artifacts to corpus/runs/<git-sha>/<sha256>.json, verifying content hashes against the manifest first.
  • src/diff.ts — produces the triage queue: FP candidates (spans not in the baseline run, or all spans without a baseline) and FN candidates (baseline spans that disappeared). Spans already covered by a verdict file are excluded, so triage effort is never repeated.
  • src/verdicts.ts — frozen triage outcomes (tp/fp/fn per span), one JSON file per document keyed by content hash. Mechanical validation enforces that every span quotes the document verbatim at its offsets; diff.ts fails on mismatch.

Data boundaries

EDGAR verdicts are committed (public SEC filings, consistent with existing contract fixtures). Document bodies, run artifacts, and case-law verdicts are gitignored; case-law text contains personal data and stays out of the repository entirely (only citation identifiers go in the manifest).

Other changes

  • chore: sync wasm and cli package versions to 1.4.9 — pre-existing check:version failure on main after the 1.4.9 bump; fixed via bun run sync:version.
  • oxlint override allowing console in packages/corpus/src/** (CLI tooling), matching the existing .github/tools/** override.

25 unit tests cover the pure logic (run diffing, verdict validation, manifest merging, EFTS hit parsing, HTML extraction).

@jan-kubica jan-kubica marked this pull request as ready for review June 11, 2026 17:00

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new package, @stll/anonymize-corpus, which provides tooling to evaluate the PII detection pipeline against real-world documents. It includes scripts for fetching EDGAR documents, running the detection pipeline, diffing runs against triaged verdicts, and managing a document manifest. Feedback on the changes includes wrapping the EDGAR client's fetch call in a try/catch block to properly retry on network or timeout errors, recording skipped documents in the manifest to avoid redundant fetches, handling potential Git command failures gracefully in run.ts, and checking for file existence in diff.ts before reading text to prevent crashes on fresh clones.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread packages/corpus/src/edgar.ts Outdated
Comment on lines +107 to +118
const response = await fetch(url, {
headers: { "User-Agent": userAgent },
signal: AbortSignal.timeout(REQUEST_TIMEOUT_MS),
});
if (response.ok) {
return response;
}
const retryable = RETRYABLE_STATUS.has(response.status);
if (!retryable || attempt === MAX_ATTEMPTS) {
throw new Error(`GET ${url} failed: HTTP ${response.status}`);
}
await sleep(RETRY_BASE_DELAY_MS * attempt);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The fetch call is not wrapped in a try/catch block inside the retry loop. If fetch throws a network error or a timeout error (which is likely given the AbortSignal.timeout), the error will propagate immediately and bypass the retry logic entirely. Wrapping the request in a try/catch block ensures that transient network issues and timeouts are retried up to MAX_ATTEMPTS.

      try {
        const response = await fetch(url, {
          headers: { "User-Agent": userAgent },
          signal: AbortSignal.timeout(REQUEST_TIMEOUT_MS),
        });
        if (response.ok) {
          return response;
        }
        const retryable = RETRYABLE_STATUS.has(response.status);
        if (!retryable || attempt === MAX_ATTEMPTS) {
          throw new Error(`GET ${url} failed: HTTP ${response.status}`);
        }
      } catch (error) {
        if (attempt === MAX_ATTEMPTS) {
          throw error;
        }
      }
      await sleep(RETRY_BASE_DELAY_MS * attempt);

Comment thread packages/corpus/src/fetch.ts
Comment thread packages/corpus/src/run.ts
Comment thread packages/corpus/src/diff.ts Outdated
Comment on lines +117 to +118
const path = textBySha.get(doc.sha256);
const text = path ? await Bun.file(path).text() : null;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Bun.file(path).text() is called directly without checking if the file exists. Since corpus/raw/ is gitignored, a fresh clone of the repository won't have these files, causing diff.ts to throw an unhandled exception and crash when trying to validate verdicts. Checking if the file exists first prevents this crash.

      const path = textBySha.get(doc.sha256);
      const file = path ? Bun.file(path) : null;
      const text = file && (await file.exists()) ? await file.text() : null;

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a25e296bd5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/corpus/src/html-text.ts Outdated
Comment on lines +16 to +21
text.replaceAll(/&(#x?[\da-f]+|[a-z]+);/gi, (match, body: string): string => {
if (body.startsWith("#x") || body.startsWith("#X")) {
return String.fromCodePoint(Number.parseInt(body.slice(2), 16));
}
if (body.startsWith("#")) {
return String.fromCodePoint(Number.parseInt(body.slice(1), 10));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard malformed numeric HTML entities

When an EDGAR HTML document contains a malformed decimal numeric entity such as &#abc; (or a numeric code point outside the Unicode range), this branch parses NaN/an invalid value and passes it to String.fromCodePoint, which throws and aborts fetch.ts for the whole query. Since unknown named entities are intentionally left untouched, malformed numeric entities should also fall back to match unless the parsed code point is valid.

Useful? React with 👍 / 👎.

Comment thread bun.lock Outdated
"name": "@stll/anonymize-corpus",
"version": "0.0.0",
"dependencies": {
"@stll/anonymize": "^1.4.8",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Synchronize the corpus lockfile dependency

The new workspace package declares @stll/anonymize as ^1.4.9 in packages/corpus/package.json, but the committed lockfile records it as ^1.4.8 here. In environments that run bun install --frozen-lockfile (Bun help: “Disallow changes to lockfile”), this out-of-sync workspace entry forces the install to update or fail before the corpus commands can run.

Useful? React with 👍 / 👎.

Comment thread packages/corpus/src/diff.ts Outdated
Comment on lines +48 to +50
const removed = (baseline?.entities ?? []).filter((entity) => {
const key = spanKey(entity);
return !currentKeys.has(key) && !judged.has(key);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Surface disappeared true-positive verdicts

When a baseline span already has a committed verdict such as tp and a later run stops detecting it, this filter drops the span from removed just because its key is judged. That hides regressions where known PII disappeared from the pipeline output; only judged false positives should be suppressed on the removed side, while judged true positives need to remain visible as FN regressions or otherwise be counted against the run.

Useful? React with 👍 / 👎.

Comment thread packages/corpus/src/fetch.ts Outdated
Comment on lines +71 to +72
const fresh = refs.filter(
(ref) => !knownIds.has(ref.id) && !newEntries.some((e) => e.id === ref.id),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Allow refetching manifest documents

In a fresh checkout, corpus/raw/ is intentionally gitignored while manifest.json is committed, so run.ts will fail on every manifest entry with “missing raw text; re-run fetch”. Re-running fetch.ts cannot repair that state because these lines filter out every known manifest id before downloading; the tool needs to rehydrate known entries whose raw file is missing (or fetch directly from manifest URLs) instead of always skipping them.

Useful? React with 👍 / 👎.

Comment thread packages/corpus/src/edgar.ts Outdated
Comment on lines +38 to +39
export const isMaterialContract = (fileType: string | undefined): boolean =>
fileType?.startsWith(MATERIAL_CONTRACT_PREFIX) ?? false;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject EX-101 interactive-data exhibits

When EFTS returns an EX-101 interactive-data exhibit, this prefix check treats it as a material contract because "EX-101".startsWith("EX-10") is true. That lets non-contract XBRL/XML exhibits into the corpus despite the fetcher intending to evaluate EX-10 material contracts, so the predicate should only accept EX-10 itself or EX-10. variants.

Useful? React with 👍 / 👎.

Comment thread packages/corpus/src/run.ts Outdated
Comment on lines +53 to +57
if (
(await Bun.file(join(runDir, RUN_SUMMARY_FILE)).exists()) &&
!values.force
) {
console.error(`run "${runName}" already exists; use --force to overwrite`);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clear stale run artifacts on force

When rerunning an existing output name with --force after the manifest changes or a document hash changes, this check allows the run to proceed but the existing directory is only reused, not cleared. Old <sha256>.json files remain beside the newly written artifacts, and diff.ts later reads every JSON file in that run directory, so stale documents can reappear as candidates even though the forced run was supposed to overwrite the output.

Useful? React with 👍 / 👎.

@jan-kubica

Copy link
Copy Markdown
Contributor Author

Review feedback addressed in 22b2614 and 5b99565:

  • politeFetch now retries thrown errors (network failures, abort timeouts) with the same backoff as 429/503, rethrowing with the URL and cause after the final attempt.
  • isMaterialContract matches EX-10/EX-10.x exactly; EX-101 interactive-data exhibits are rejected (tests added).
  • Malformed or out-of-range numeric HTML entities (&#abc;, &#x110000;) no longer throw; the original text is kept (tests added).
  • Size-skipped documents are recorded in a committed corpus/skiplist.json so repeated queries never re-fetch them.
  • New fetch.ts --refill re-downloads manifest entries whose raw file is missing (fresh checkouts), verifying the stored sha256 and failing loudly if extraction has drifted.
  • run.ts survives a missing git binary and --force now clears stale artifacts before writing.
  • The diff is verdict-aware with four buckets: added/removed (unjudged candidates), regressions (judged tp no longer detected), fixed (judged fn now detected). Per the P1: a vanished tp always surfaces — and after 5b99565 regressions are derived from the verdict files themselves, so they are reported even on a first run with no baseline.
  • bun.lock now records ^1.4.9 for @stll/anonymize in the corpus and cli workspace blocks (minimal edit; bun install --frozen-lockfile passes). A full regenerate was avoided to keep unrelated dev-dependency churn out of this PR.

Not adopted: recording size-skips inside manifest.json itself; a separate skip list keeps the manifest's invariant (every entry has stored text with a verified hash) intact.

CC on behalf of @jan-kubica

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5b995650a3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/corpus/src/run.ts
type PipelineConfig,
preparePipelineSearch,
runPipeline,
} from "@stll/anonymize";

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make corpus runs work before package builds

In a clean workspace, @stll/anonymize resolves through its package exports to dist/index.mjs, but dist/ is gitignored and the corpus README tells users to run bun src/run.ts directly. That means the run command fails at module resolution unless the library has already been built; import the workspace source like the existing performance script does, or make this command explicitly build/depend on the package first.

Useful? React with 👍 / 👎.

Comment thread packages/corpus/src/paths.ts Outdated

/** Document ids contain `:`/`/`; map them to safe file names. */
export const rawFileName = (docId: string): string =>
`${docId.replaceAll(/[^\w.-]/g, "_")}.txt`;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use collision-free raw filenames

When two manifest ids differ only by characters this sanitizer collapses to _ (for example a:b and a_b, or case-law ids using different punctuation), both raw bodies share the same corpus/raw/*.txt path. Fetching or refilling the second document overwrites the first, so later runs either hash the wrong body or fail the manifest check; use a reversible encoding or include a hash in the raw filename.

Useful? React with 👍 / 👎.

@jan-kubica jan-kubica force-pushed the feat/corpus-eval-tooling branch from 6d0635d to 81e921f Compare June 11, 2026 19:38

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 81e921fe91

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/corpus/src/run.ts Outdated
Comment on lines +64 to +66
if (
(await Bun.file(join(runDir, RUN_SUMMARY_FILE)).exists()) &&
!values.force

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject incomplete existing run directories

If a previous run is interrupted before run.json is written, this guard treats the existing run directory as reusable because it only checks for the summary file. Retrying the same run name without --force can then leave old <sha256>.json artifacts in place, especially after the manifest changes, and diff.ts reads every JSON artifact in the directory; that can surface stale documents as if they belong to the new run. Check for the directory itself or clear it before writing unless an explicit resume mode is intended.

Useful? React with 👍 / 👎.

Comment thread packages/corpus/src/fetch.ts Outdated
Comment on lines +111 to +112
const limit = Number(values.limit ?? DEFAULT_LIMIT_PER_QUERY);
const pages = Number(values.pages ?? DEFAULT_PAGES_PER_QUERY);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate numeric fetch options

When --limit or --pages is not a finite positive integer, Number(...) silently produces values that change fetch behavior: for example --limit foo makes the later taken >= limit guard always false, so the command downloads every fresh hit in the requested pages instead of stopping at the default limit, while --pages foo silently fetches nothing. Since this CLI hits EDGAR and writes corpus state, reject invalid numeric input before querying.

Useful? React with 👍 / 👎.

Comment thread packages/corpus/src/verdicts.ts Outdated
@jan-kubica jan-kubica force-pushed the feat/corpus-eval-tooling branch from 81e921f to f7482aa Compare June 12, 2026 11:18

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f7482aae81

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread .github/tools/sync-runtime-version.mjs Outdated
Comment on lines +63 to +64
const syncedLockText = lockText.replaceAll(
SYNCED_DEPENDENCY_RANGE_RE,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Sync lockfile workspace versions too

When VERSION is bumped and bun run sync:version is run, this new lockfile sync only rewrites @stll/anonymize dependency ranges; it does not update the "version" fields that bun.lock records for packages/anonymize, packages/anonymize/wasm, and packages/cli. The next bun run check:version can still pass because it never checks those lockfile workspace versions, leaving the lockfile inconsistent with the package.json files; in environments using bun install --frozen-lockfile (Bun help: --frozen-lockfile “Disallow changes to lockfile”), install then has to change or reject the lockfile after every release bump.

Useful? React with 👍 / 👎.

Comment thread packages/corpus/src/run.ts
Comment thread packages/corpus/src/fetch.ts

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3c13475730

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/corpus/src/html-text.ts Outdated
Comment thread packages/corpus/src/diff.ts Outdated
Comment thread packages/corpus/src/edgar.ts Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1238022a1f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/anonymize/src/data/person-stopwords.json Outdated
"license": "MIT",
"scripts": {
"typecheck": "tsc --noEmit -p tsconfig.json && tsc --noEmit -p tsconfig.test.json",
"test": "bun test",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make the corpus test script build its runtime dependency

Fresh evidence beyond the earlier direct-run concern: the package's own test script runs bare bun test, but runtime.test.ts imports @stll/anonymize, which resolves through that package's exports to the gitignored dist/index.mjs. In a clean checkout before a build, bun run --cwd packages/corpus test fails at module resolution before assertions; the script should build/depend on the runtime package or the test should import source explicitly.

Useful? React with 👍 / 👎.

`Bond` collides with legal nouns ("Bond Trustee", "Bond Indenture"),
so the global person stopword suppressed it everywhere — including the
name-extension step, which turned "James Bond" into just "James" and
left the surname in the clear.

Move `bond` from person-stopwords to the allow list: the allow list
still suppresses the standalone/leading token, but `extendPersonName`
never consults it, so a preceding first name carries the surname
through (the same path that already handles `John Law` / `Elon Tesla`).
Mirror the data change into packages/data/config.
- html-text: require a full numeric-entity body, so a mixed token like
  `&#123abc;` is left intact instead of decoding its `123` prefix.
- verdicts: on a value mismatch, report lengths and short content
  hashes instead of the verbatim span/document text (gitignored
  personal data for case-law sources).
- edgar: advance EFTS pagination by the hits actually returned and stop
  on an empty page, so a 100-stride over 10-hit pages no longer skips
  results.
- diff: only compare against a baseline artifact whose sha256 matches
  the current document; a re-extraction under the same id otherwise
  diffs offsets across unrelated text.
@jan-kubica

Copy link
Copy Markdown
Contributor Author

Addressed the latest review in 1d43f6f and 4c7387b.

Person pipeline

  • Bond is no longer dropped as a surname. Moved bond from person-stopwords.json to allow-list.json. The allow list still suppresses the standalone/leading token (the Bond Trustee / Bond Indenture legal nouns), but extendPersonName deliberately does not consult the allow list, so a preceding first name carries the surname through: James Bond now stays whole, the same way John Law / Elon Tesla already did. Mirrored into packages/data/config; added a positive regression test next to the existing Bond Trustee guard.

Corpus tooling

  • Numeric HTML entities. Decimal and hex bodies are now matched by separate character classes, so a mixed token like &#123abc; no longer decodes its 123 prefix; it is left intact like other malformed entities. Test added.
  • Verdict validation logging. A value mismatch now reports offsets, lengths, and short content hashes instead of the verbatim span/document slice, so a failed case-law verdict cannot leak personal text into stderr. Test asserts neither value appears in the message.
  • EFTS pagination. Pagination advances from by the hits actually returned and stops on an empty page, instead of striding by an assumed page size of 100 over EDGAR's smaller pages (which skipped the results in between).
  • Run / baseline hash mismatch. diff.ts only compares against a baseline artifact whose sha256 matches the current document; a re-extraction under the same id is treated as having no baseline rather than diffing offsets across unrelated text.

Already handled in earlier commits on this branch (noting for thread cleanup): Bun.spawnSync is wrapped in try/catch; the --force delete is guarded by assertValidRunName, which rejects path separators and ..; same-hash run artifacts and raw filenames carry an id hash; non-text EDGAR exhibits are rejected by extension before fetch; size-skips are recorded in skiplist.json; numeric CLI options are validated; --refill rehydrates missing raw bodies; and the README documents bun run build before first use, with corpus tests building their runtime via turbo ^build.

CC on behalf of @jan-kubica

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant