Skip to content

brotli-compress .socket.facts.json on upload#1291

Draft
Benjamin Barslev Nielsen (barslev) wants to merge 2 commits intov1.xfrom
barslev/compress-socket-facts-file
Draft

brotli-compress .socket.facts.json on upload#1291
Benjamin Barslev Nielsen (barslev) wants to merge 2 commits intov1.xfrom
barslev/compress-socket-facts-file

Conversation

@barslev
Copy link
Copy Markdown
Contributor

@barslev Benjamin Barslev Nielsen (barslev) commented Apr 30, 2026

Summary

Brotli-compress .socket.facts.json at upload time. Coana keeps writing plain JSON; only the wire bytes between socket-cli and depscan change. ~85% reduction on the simple-npm fixture; production mono-repo payloads see 60+ MB → ~10 MB.

How

  • New compressSocketFactsForUpload(scanPaths) in src/utils/coana.mts — async + streaming (createReadStream → createBrotliCompress → createWriteStream via pipeline). For any .socket.facts.json in scanPaths, writes a sibling .socket.facts.json.br next to the source file. Returns swapped paths + cleanup callback.
  • handle-create-new-scan.mts wraps fetchCreateOrgFullScan in try { compress; upload; } finally { cleanup; }.

Why sibling-write (not os.tmpdir(), not a temp subdir)

The SDK passes path.relative(cwd, brPath) as the multipart name. depscan's addStreamEntry rejects entries with .. traversal — a tmpdir path resolves to ../../../var/folders/... and gets dropped to unmatchedFiles. Sibling-write keeps the relative path inside cwd, and keeps directory shape symmetric with the plain .socket.facts.json upload (depscan strips the .br suffix at ingest, so <dir>/.socket.facts.json.br and <dir>/.socket.facts.json resolve to the same storage path).

Concurrent scans against the same source dir already collide on .socket.facts.json itself (coana writes to a single path), so the sibling .br doesn't introduce a new race.

Tests

  • 16 unit cases in src/utils/coana.test.mts (sibling-path round-trip, passthrough, idempotent cleanup that leaves the source intact, plus the existing extractTier1ReachabilityScanId / extractReachabilityErrors matrix).
  • e2e socket scan create --reach against simple-npm: scan_type=socket_tier1, unmatchedFiles: [], root manifest entries are .socket.facts.json and package.json (storage path matches the plain upload — dir-preservation symmetry holds).

Compress `.socket.facts.json` with brotli at upload time, just before
`fetchCreateOrgFullScan` POSTs the multipart `/full-scans` body. Coana
keeps writing plain JSON; the local readers
(`extractTier1ReachabilityScanId`, `extractReachabilityErrors`) keep
reading plain JSON; only the on-the-wire bytes between socket-cli and
depscan change. depscan transparently decompresses at the api-v0
multipart ingest boundary in a coordinated change.

Why:
* Mono-repo `.socket.facts.json` files routinely exceed 60MB. On a
  representative simple-npm fixture, brotli compresses 150,748 bytes
  to 19,971 (~86.8% reduction); production mono-repos see a much
  larger absolute saving.
* Coana has a second consumer downstream - teaching coana to write
  `.br` would break it. Brotli on the wire-only path keeps coana's
  contract invariant.
* `tier1ReachabilityScanId` and reachability-error reporting still
  read plain JSON locally; no brotli round-trip on those paths.
* Compression is a transport detail owned by the upload site; cleanup
  is one `finally` block.

Implementation:
* `src/utils/coana.mts` - new exported
  `compressSocketFactsForUpload(scanPaths)` returning `{ paths, cleanup }`.
  Per `.socket.facts.json` entry, `mkdtempSync` a fresh `.socket-br-XXX/`
  directory inside the source's parent dir (NOT under `os.tmpdir()` -
  see below), `brotliCompressSync` bytes to
  `${td}/.socket.facts.json.br`, swap the path. Other paths pass through.
  Missing facts paths pass through. Cleanup is idempotent with
  `force: true`.
* `src/commands/scan/handle-create-new-scan.mts` - wraps the
  `fetchCreateOrgFullScan` call in
  `try { compress; upload; } finally { cleanup(); }`. Cleanup runs on
  success, throw, or any abort path.

Why the temp lives next to the source:

The SDK computes `path.relative(cwd, brPath)` for the multipart name
field. depscan's multipart ingest (`addStreamEntry`) checks
`containsTraversal(unixified)` and bails on any `..` segment. A
tmpdir-based path resolves to `../../../var/folders/...`, gets dropped
to `unmatchedFiles`, and the SocketFacts content never lands in the
scan. Putting the temp inside `path.dirname(originalFactsPath)`
produces a relative path like `.socket-br-XXX/.socket.facts.json.br` -
traversal-free, ingests cleanly.

Tests:
* `src/utils/coana.test.mts` - 16 cases.
  - `compressSocketFactsForUpload` x 5: round-trip JSON via
    `brotliDecompressSync`, basename swap to `.socket.facts.json.br`,
    non-facts paths pass through, missing facts paths pass through,
    cleanup idempotent, mixed-entry order. Pins the contract that the
    temp lives in a subdir of the source's parent (traversal-free).
  - `extractTier1ReachabilityScanId` x 7: plain JSON, missing file,
    missing field, null, empty/blank, trim, numeric coercion.
  - `extractReachabilityErrors` x 4: extraction, missing file,
    no-components, no-inner-reachability.

This change requires the matching depscan multipart decompression
patch on the receiving side; that change ships first.
@jdalton
Copy link
Copy Markdown
Collaborator

This is rad!

@barslev Benjamin Barslev Nielsen (barslev) changed the title refactor: brotli-compress .socket.facts.json on upload brotli-compress .socket.facts.json on upload May 1, 2026
Previous form wrapped each `.br` in a per-scan `mkdtempSync` subdir
under the source dir for concurrency isolation. That created a
directory-handling asymmetry on depscan's side: a wire path of
`dirA/.socket.facts.json.br` flattened to `.socket.facts.json` at
root via depscan's basename-strip, while plain `dirA/.socket.facts.json`
preserved the dir.

depscan dropped the basename-strip in the corresponding PR; switch
to writing `.br` as a sibling of the source so wire and storage
paths match for both branches. Net effect: a brotli upload from
`<source>/.socket.facts.json` lands at `<source>/.socket.facts.json`
in the manifest tarball - identical to plain.

Concurrent-scan note: coana already writes to a single source path,
so the source `.socket.facts.json` itself is racy under concurrent
runs against the same dir; the sibling `.br` doesn't introduce a
new race that wasn't already there.

* `src/utils/coana.mts`: `compressSocketFactsForUpload` writes
  `${p}.br` next to the source; `cleanup()` does
  `rm(brPath, { force: true })` per file. Drops `mkdtemp` import
  that's no longer used.
* `src/utils/coana.test.mts`: directory-shape assertion replaced
  with `swappedPath === ${input}.br` (sibling). First test now
  also asserts the source survives `cleanup()`. 16 cases.
John-David Dalton (jdalton) added a commit that referenced this pull request May 6, 2026
…) (#1305)

Port of barslev's PR #1291 from v1.x to main. depscan's api-v0
multipart boundary streams brotli decode based on the .br filename
suffix, so the facts file uploads as ~10x smaller without changing
the on-disk format coana writes.

Adds packages/cli/src/utils/coana/compress-facts.mts with
compressSocketFactsForUpload(scanPaths) — streams brotli a sibling
.socket.facts.json.br next to each source file and returns swapped
paths plus a cleanup() callback. Sibling-write keeps the multipart
entry name inside cwd (depscan drops .. traversal entries).
handleCreateNewScan calls it just before fetchCreateOrgFullScan and
runs cleanup() in finally so the .br files are removed even on
upload failure.

Translation from v1.x to main:
- @socketsecurity/registry/lib/fs -> @socketsecurity/lib/fs
- fs.rm -> safeDelete (fleet-wide rule)
- constants default-import -> named import
  (DOT_SOCKET_DOT_FACTS_JSON already lives in
  packages/cli/src/constants.mts)
- v1.x added scan-id and reachability-error tests in the same file as
  the helper; main keeps utils/coana/extract-scan-id.mts separate and
  has no extractReachabilityErrors yet, so only the new helper's
  tests are added at test/unit/utils/coana/compress-facts.test.mts

Skipping pre-commit test suite via DISABLE_PRECOMMIT_TEST=1 because
test/unit/commands/analytics/output-analytics.test.mts has 3
date-dependent snapshot failures unrelated to this PR (snapshots
encode the literal date Apr 18/19/20/21 and fail on any other day).
Targeted vitest run on the touched files (32 tests in
utils/coana/ + handle-create-new-scan) passes 32/32.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants