fix: atomic JSON writes for pipeline outputs by joshbouncesecurity · Pull Request #43 · knostic/OpenAnt

joshbouncesecurity · 2026-05-04T18:23:02Z

Summary

Introduces a new atomic_write_json() helper and applies it to all final
pipeline output files. The helper writes JSON to a temp file in the same
directory as the target, fsyncs, then os.replaces it onto the target path.
If a crash or power loss occurs mid-write, the previous output file is
preserved intact rather than being left truncated.

Upstream uses plain write_json() (a json.dump wrapper), which can corrupt
multi-hour scan results on interrupt.

Same-directory temp is load-bearing on Windows: os.replace is only atomic
when source and target sit on the same volume; cross-volume falls back to
copy+delete and loses atomicity.

Call sites covered (one write each, at the very end of a pipeline stage):

core/analyzer.py — results.json
core/enhancer.py — enhanced_dataset.json
core/verifier.py — results_verified.json
core/reporter.py — pipeline_output.json (final report)
core/parser_adapter.py — dataset.json + analyzer_output.json
experiment.py — experiment results (legacy direct-run path)

Not changed: checkpoint files (designed for incremental writes), scanner
intermediate state, and the diff-filter sidecar report (cheap to regenerate).

Performance: one fsync + one rename per output file, all of which occur
once at the end of a stage that runs for minutes. Negligible overhead.
fsync is silently skipped on filesystems where it fails (network mounts etc.)
so atomicity degrades gracefully to rename-only in those environments.

Addresses item 20 from #16 (does not close the issue).

Test plan

Helper round-trip preserves data (incl. ensure_ascii=False Unicode).
Mid-write failure leaves the existing file intact.
Mid-write failure does not create a target if none existed.
No stray .tmp- files left on failure.
Temp file is in the same directory as the target.
POSIX permissions match umask (not mkstemp's 0600 default).
Existing tests pass (113 collected, 91 passed, 22 skipped — go/js parser binaries unavailable in this env).

joshbouncesecurity · 2026-05-04T20:45:40Z

Manual verification

Run a small openant scan to completion: results.json, enhanced_dataset.json, results_verified.json all exist and parse as valid JSON.
Atomicity smoke test: start a scan, kill the process (Ctrl+C) during the analyze stage. Re-inspect results.json — should be either the previous good version OR non-existent. Never truncated/corrupt.
Permissions (POSIX): confirm output file mode matches umask (typically 0644), not 0600.
Same-dir temp: while a write is happening, a hidden *.tmp file appears in the SAME directory as the target (use watch ls -la <scan-dir> or lsof). Cross-device renames aren't atomic.
Unicode: a finding containing non-ASCII text (e.g., a Japanese string) round-trips through results_verified.json without escape sequences (ensure_ascii=False preserved).

joshbouncesecurity · 2026-05-04T21:05:19Z

Local test results

Exercised utilities.atomic_io.atomic_write_json directly on Windows from this branch.

Test script (round-trip, non-ASCII, no-stray-tempfile, preserve-on-failure, implementation-shape):

from utilities.atomic_io import atomic_write_json
# 1. Round-trip dict round-trips identically (assert equality)
# 2. Non-ASCII payload {"jp": "<JP characters>"} written with ensure_ascii=False:
#    raw bytes contain UTF-8 sequences, no \uXXXX escapes
# 3. After successful write, target dir contains only out.json (no stray .tmp-*)
# 4. Trigger TypeError mid-write (non-serializable value): existing target byte-for-byte
#    unchanged, no stray temp file left behind
# 5. inspect.getsource confirms tempfile.mkstemp(dir=directory) + os.replace pattern

Output:

OK round-trip
OK non-ASCII preserved as UTF-8 (no \uXXXX escapes)
Files in target dir: ['out.json']
OK no stray temp files
OK target preserved on failure, no stray temp file
OK helper uses tempfile.mkstemp(dir=...) + os.replace

Outcome:

Round-trip preserves data ✅
ensure_ascii=False non-ASCII round-trips as UTF-8 (no \uXXXX escapes) ✅
Mid-write failure leaves existing target intact, no stray .tmp-* files ✅
Same-directory temp + os.replace pattern verified by reading the helper source ✅
Did not run a real openant scan and Ctrl+C-mid-write — would have required a paid LLM call. The unit-level checks above + the in-tree tests (tests/test_atomic_io.py per the diff) cover the same code path.
POSIX umask path: I'm on Windows so the os.name == "posix" chmod branch wasn't exercised; no os.chmod is invoked on Windows by design (the helper notes this).

tempfile.mkstemp creates files with mode 0600 (owner-only). After os.replace the target inherits those tightened bits, silently regressing the permissions a plain open(path, "w") would have produced under umask. Restore umask-derived 0666 & ~umask on POSIX (no-op on Windows). Adds a POSIX-gated test pinning the behaviour.

Applies atomic_write_json to the remaining final pipeline outputs that were still using plain write_json: - core/reporter.py: pipeline_output.json (final report, ensure_ascii=False) - core/parser_adapter.py: dataset.json and analyzer_output.json (parse outputs consumed by subsequent expensive LLM stages) - experiment.py: experiment results (legacy direct-run path) Skipped: checkpoint files (designed for incremental writes), scanner intermediate state, and the diff_filter sidecar report (cheap to regenerate, not consumed downstream).

ar7casper · 2026-05-14T10:06:25Z

Hey @joshbouncesecurity — quick one before going deeper on this.

Don't we already get this for free from the checkpoint/resume system? Per the CHANGELOG [2026-04-14], every phase persists per-unit progress, and as of current master those checkpoints are preserved rather than cleaned up (see comments in core/analyzer.py:518 and core/verifier.py:196). So a torn write to results.json is recoverable — rerun the step, checkpoints get reloaded, file gets re-emitted, no LLM cost lost.

The one place where atomic writes would clearly still matter is the un-checkpointed outputs (pipeline_output.json in core/reporter.py, parser outputs like dataset.json) — those are one-shot writes with no backstop. But this PR doesn't touch those.

What's your take — close this, or is there a scenario I'm missing?

joshbouncesecurity · 2026-05-14T10:53:36Z

Good analysis. To address both parts:

Checkpointed outputs (results.json, enhanced_dataset.json, results_verified.json): you're right that checkpoints prevent cost loss. The only remaining value here is UX — a torn write produces an opaque JSON decode error on the next run rather than a clean resume from the intact previous file. Minor, but real.

Un-checkpointed outputs: commit 3 (pushed after your comment) covers exactly the gap you identified — dataset.json, analyzer_output.json, and pipeline_output.json are now all atomic. Those are the strong case and they're closed.

For reference, the relevant diffs:

Commits 1–2 (checkpointed outputs + helper): 42fbb32~1...eefafbd
Commit 3 (un-checkpointed outputs): f482e09

Happy to drop commits 1–2 and keep only commit 3 if you'd prefer a tighter diff with no ambiguity about justification. Let me know.

ar7casper · 2026-05-17T11:30:39Z

Since much of the goals of this PR are already on master, it was decided with @joshbouncesecurity to close this PR, feel the current solution on master, and open a new PR in the future with proper adjustments.

joshbouncesecurity force-pushed the fix/issue16-20-atomic-json branch from 110be29 to 8a3e113 Compare May 14, 2026 06:57

joshbouncesecurity added 2 commits May 14, 2026 10:42

fix: atomic JSON writes for pipeline outputs

42fbb32

joshbouncesecurity force-pushed the fix/issue16-20-atomic-json branch from 8a3e113 to eefafbd Compare May 14, 2026 07:43

joshbouncesecurity marked this pull request as ready for review May 14, 2026 08:34

joshbouncesecurity requested review from ar7casper, dgeyshis, shahar-davidson, sounil and yotamleo as code owners May 14, 2026 08:34

ar7casper closed this May 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: atomic JSON writes for pipeline outputs#43

fix: atomic JSON writes for pipeline outputs#43
joshbouncesecurity wants to merge 3 commits into
knostic:masterfrom
joshbouncesecurity:fix/issue16-20-atomic-json

joshbouncesecurity commented May 4, 2026 •

edited

Loading

Uh oh!

joshbouncesecurity commented May 4, 2026

Uh oh!

joshbouncesecurity commented May 4, 2026

Uh oh!

ar7casper commented May 14, 2026

Uh oh!

joshbouncesecurity commented May 14, 2026

Uh oh!

ar7casper commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joshbouncesecurity commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

joshbouncesecurity commented May 4, 2026

Manual verification

Uh oh!

joshbouncesecurity commented May 4, 2026

Local test results

Uh oh!

ar7casper commented May 14, 2026

Uh oh!

joshbouncesecurity commented May 14, 2026

Uh oh!

ar7casper commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joshbouncesecurity commented May 4, 2026 •

edited

Loading