Skip to content

fix: atomic JSON writes for pipeline outputs#43

Closed
joshbouncesecurity wants to merge 3 commits into
knostic:masterfrom
joshbouncesecurity:fix/issue16-20-atomic-json
Closed

fix: atomic JSON writes for pipeline outputs#43
joshbouncesecurity wants to merge 3 commits into
knostic:masterfrom
joshbouncesecurity:fix/issue16-20-atomic-json

Conversation

@joshbouncesecurity
Copy link
Copy Markdown
Contributor

@joshbouncesecurity joshbouncesecurity commented May 4, 2026

Summary

Introduces a new atomic_write_json() helper and applies it to all final
pipeline output files. The helper writes JSON to a temp file in the same
directory
as the target, fsyncs, then os.replaces it onto the target path.
If a crash or power loss occurs mid-write, the previous output file is
preserved intact rather than being left truncated.

Upstream uses plain write_json() (a json.dump wrapper), which can corrupt
multi-hour scan results on interrupt.

Same-directory temp is load-bearing on Windows: os.replace is only atomic
when source and target sit on the same volume; cross-volume falls back to
copy+delete and loses atomicity.

Call sites covered (one write each, at the very end of a pipeline stage):

  • core/analyzer.pyresults.json
  • core/enhancer.pyenhanced_dataset.json
  • core/verifier.pyresults_verified.json
  • core/reporter.pypipeline_output.json (final report)
  • core/parser_adapter.pydataset.json + analyzer_output.json
  • experiment.py — experiment results (legacy direct-run path)

Not changed: checkpoint files (designed for incremental writes), scanner
intermediate state, and the diff-filter sidecar report (cheap to regenerate).

Performance: one fsync + one rename per output file, all of which occur
once at the end of a stage that runs for minutes. Negligible overhead.
fsync is silently skipped on filesystems where it fails (network mounts etc.)
so atomicity degrades gracefully to rename-only in those environments.

Addresses item 20 from #16 (does not close the issue).

Test plan

  • Helper round-trip preserves data (incl. ensure_ascii=False Unicode).
  • Mid-write failure leaves the existing file intact.
  • Mid-write failure does not create a target if none existed.
  • No stray .tmp- files left on failure.
  • Temp file is in the same directory as the target.
  • POSIX permissions match umask (not mkstemp's 0600 default).
  • Existing tests pass (113 collected, 91 passed, 22 skipped — go/js parser binaries unavailable in this env).

@joshbouncesecurity
Copy link
Copy Markdown
Contributor Author

Manual verification

  • Run a small openant scan to completion: results.json, enhanced_dataset.json, results_verified.json all exist and parse as valid JSON.
  • Atomicity smoke test: start a scan, kill the process (Ctrl+C) during the analyze stage. Re-inspect results.json — should be either the previous good version OR non-existent. Never truncated/corrupt.
  • Permissions (POSIX): confirm output file mode matches umask (typically 0644), not 0600.
  • Same-dir temp: while a write is happening, a hidden *.tmp file appears in the SAME directory as the target (use watch ls -la <scan-dir> or lsof). Cross-device renames aren't atomic.
  • Unicode: a finding containing non-ASCII text (e.g., a Japanese string) round-trips through results_verified.json without escape sequences (ensure_ascii=False preserved).

@joshbouncesecurity
Copy link
Copy Markdown
Contributor Author

Local test results

Exercised utilities.atomic_io.atomic_write_json directly on Windows from this branch.

Test script (round-trip, non-ASCII, no-stray-tempfile, preserve-on-failure, implementation-shape):

from utilities.atomic_io import atomic_write_json
# 1. Round-trip dict round-trips identically (assert equality)
# 2. Non-ASCII payload {"jp": "<JP characters>"} written with ensure_ascii=False:
#    raw bytes contain UTF-8 sequences, no \uXXXX escapes
# 3. After successful write, target dir contains only out.json (no stray .tmp-*)
# 4. Trigger TypeError mid-write (non-serializable value): existing target byte-for-byte
#    unchanged, no stray temp file left behind
# 5. inspect.getsource confirms tempfile.mkstemp(dir=directory) + os.replace pattern

Output:

OK round-trip
OK non-ASCII preserved as UTF-8 (no \uXXXX escapes)
Files in target dir: ['out.json']
OK no stray temp files
OK target preserved on failure, no stray temp file
OK helper uses tempfile.mkstemp(dir=...) + os.replace

Outcome:

  • Round-trip preserves data ✅
  • ensure_ascii=False non-ASCII round-trips as UTF-8 (no \uXXXX escapes) ✅
  • Mid-write failure leaves existing target intact, no stray .tmp-* files ✅
  • Same-directory temp + os.replace pattern verified by reading the helper source ✅
  • Did not run a real openant scan and Ctrl+C-mid-write — would have required a paid LLM call. The unit-level checks above + the in-tree tests (tests/test_atomic_io.py per the diff) cover the same code path.
  • POSIX umask path: I'm on Windows so the os.name == "posix" chmod branch wasn't exercised; no os.chmod is invoked on Windows by design (the helper notes this).

@joshbouncesecurity joshbouncesecurity force-pushed the fix/issue16-20-atomic-json branch from 110be29 to 8a3e113 Compare May 14, 2026 06:57
tempfile.mkstemp creates files with mode 0600 (owner-only). After
os.replace the target inherits those tightened bits, silently
regressing the permissions a plain open(path, "w") would have
produced under umask. Restore umask-derived 0666 & ~umask on POSIX
(no-op on Windows). Adds a POSIX-gated test pinning the behaviour.
@joshbouncesecurity joshbouncesecurity force-pushed the fix/issue16-20-atomic-json branch from 8a3e113 to eefafbd Compare May 14, 2026 07:43
Applies atomic_write_json to the remaining final pipeline outputs that
were still using plain write_json:

- core/reporter.py: pipeline_output.json (final report, ensure_ascii=False)
- core/parser_adapter.py: dataset.json and analyzer_output.json (parse
  outputs consumed by subsequent expensive LLM stages)
- experiment.py: experiment results (legacy direct-run path)

Skipped: checkpoint files (designed for incremental writes),
scanner intermediate state, and the diff_filter sidecar report
(cheap to regenerate, not consumed downstream).
@joshbouncesecurity joshbouncesecurity marked this pull request as ready for review May 14, 2026 08:34
@ar7casper
Copy link
Copy Markdown
Collaborator

Hey @joshbouncesecurity — quick one before going deeper on this.

Don't we already get this for free from the checkpoint/resume system? Per the CHANGELOG [2026-04-14], every phase persists per-unit progress, and as of current master those checkpoints are preserved rather than cleaned up (see comments in core/analyzer.py:518 and core/verifier.py:196). So a torn write to results.json is recoverable — rerun the step, checkpoints get reloaded, file gets re-emitted, no LLM cost lost.

The one place where atomic writes would clearly still matter is the un-checkpointed outputs (pipeline_output.json in core/reporter.py, parser outputs like dataset.json) — those are one-shot writes with no backstop. But this PR doesn't touch those.

What's your take — close this, or is there a scenario I'm missing?

@joshbouncesecurity
Copy link
Copy Markdown
Contributor Author

Good analysis. To address both parts:

Checkpointed outputs (results.json, enhanced_dataset.json, results_verified.json): you're right that checkpoints prevent cost loss. The only remaining value here is UX — a torn write produces an opaque JSON decode error on the next run rather than a clean resume from the intact previous file. Minor, but real.

Un-checkpointed outputs: commit 3 (pushed after your comment) covers exactly the gap you identified — dataset.json, analyzer_output.json, and pipeline_output.json are now all atomic. Those are the strong case and they're closed.

For reference, the relevant diffs:

Happy to drop commits 1–2 and keep only commit 3 if you'd prefer a tighter diff with no ambiguity about justification. Let me know.

@ar7casper
Copy link
Copy Markdown
Collaborator

Since much of the goals of this PR are already on master, it was decided with @joshbouncesecurity to close this PR, feel the current solution on master, and open a new PR in the future with proper adjustments.

@ar7casper ar7casper closed this May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants