fix: atomic JSON writes for pipeline outputs#43
Conversation
Manual verification
|
Local test resultsExercised Test script (round-trip, non-ASCII, no-stray-tempfile, preserve-on-failure, implementation-shape): from utilities.atomic_io import atomic_write_json
# 1. Round-trip dict round-trips identically (assert equality)
# 2. Non-ASCII payload {"jp": "<JP characters>"} written with ensure_ascii=False:
# raw bytes contain UTF-8 sequences, no \uXXXX escapes
# 3. After successful write, target dir contains only out.json (no stray .tmp-*)
# 4. Trigger TypeError mid-write (non-serializable value): existing target byte-for-byte
# unchanged, no stray temp file left behind
# 5. inspect.getsource confirms tempfile.mkstemp(dir=directory) + os.replace patternOutput: Outcome:
|
110be29 to
8a3e113
Compare
tempfile.mkstemp creates files with mode 0600 (owner-only). After os.replace the target inherits those tightened bits, silently regressing the permissions a plain open(path, "w") would have produced under umask. Restore umask-derived 0666 & ~umask on POSIX (no-op on Windows). Adds a POSIX-gated test pinning the behaviour.
8a3e113 to
eefafbd
Compare
Applies atomic_write_json to the remaining final pipeline outputs that were still using plain write_json: - core/reporter.py: pipeline_output.json (final report, ensure_ascii=False) - core/parser_adapter.py: dataset.json and analyzer_output.json (parse outputs consumed by subsequent expensive LLM stages) - experiment.py: experiment results (legacy direct-run path) Skipped: checkpoint files (designed for incremental writes), scanner intermediate state, and the diff_filter sidecar report (cheap to regenerate, not consumed downstream).
|
Hey @joshbouncesecurity — quick one before going deeper on this. Don't we already get this for free from the checkpoint/resume system? Per the CHANGELOG The one place where atomic writes would clearly still matter is the un-checkpointed outputs ( What's your take — close this, or is there a scenario I'm missing? |
|
Good analysis. To address both parts: Checkpointed outputs ( Un-checkpointed outputs: commit 3 (pushed after your comment) covers exactly the gap you identified — For reference, the relevant diffs:
Happy to drop commits 1–2 and keep only commit 3 if you'd prefer a tighter diff with no ambiguity about justification. Let me know. |
|
Since much of the goals of this PR are already on master, it was decided with @joshbouncesecurity to close this PR, feel the current solution on master, and open a new PR in the future with proper adjustments. |
Summary
Introduces a new
atomic_write_json()helper and applies it to all finalpipeline output files. The helper writes JSON to a temp file in the same
directory as the target, fsyncs, then
os.replaces it onto the target path.If a crash or power loss occurs mid-write, the previous output file is
preserved intact rather than being left truncated.
Upstream uses plain
write_json()(ajson.dumpwrapper), which can corruptmulti-hour scan results on interrupt.
Same-directory temp is load-bearing on Windows:
os.replaceis only atomicwhen source and target sit on the same volume; cross-volume falls back to
copy+delete and loses atomicity.
Call sites covered (one write each, at the very end of a pipeline stage):
core/analyzer.py—results.jsoncore/enhancer.py—enhanced_dataset.jsoncore/verifier.py—results_verified.jsoncore/reporter.py—pipeline_output.json(final report)core/parser_adapter.py—dataset.json+analyzer_output.jsonexperiment.py— experiment results (legacy direct-run path)Not changed: checkpoint files (designed for incremental writes), scanner
intermediate state, and the diff-filter sidecar report (cheap to regenerate).
Performance: one
fsync+ one rename per output file, all of which occuronce at the end of a stage that runs for minutes. Negligible overhead.
fsyncis silently skipped on filesystems where it fails (network mounts etc.)so atomicity degrades gracefully to rename-only in those environments.
Addresses item 20 from #16 (does not close the issue).
Test plan
ensure_ascii=FalseUnicode)..tmp-files left on failure.