Skip to content

Comments

feat: batch-build with OpenAI Batch API + demo template#71

Merged
marklubin merged 10 commits intomainfrom
feat/batch-build-demo
Feb 17, 2026
Merged

feat: batch-build with OpenAI Batch API + demo template#71
marklubin merged 10 commits intomainfrom
feat/batch-build-demo

Conversation

@marklubin
Copy link
Owner

Summary

  • Batch build enginesynix batch-build run/resume/plan/list/status commands using OpenAI Batch API for eligible transform layers, with automatic fallback to sync for non-OpenAI providers
  • Cassette replay for batch builds — relaxed SYNIX_CASSETTE_MODE=replay check to allow demo replay when batch_responses.json exists, skipping API key validation during replay
  • Demo template 05-batch-build — 2-level pipeline (batch OpenAI work styles → sync Anthropic team summary) with full cassette coverage and golden files
  • Output normalization — added batch-specific normalization patterns (build IDs, pipeline hashes, timestamps) to demo_commands.py for stable golden comparison

Architecture

Source (bios) → WorkStyleBatchTransform (OpenAI, batch=True) → TeamSummaryTransform (Anthropic, batch=False)

BatchLLMClient is a drop-in LLMClient replacement that uses control-flow exceptions (BatchCollecting, BatchInProgress) to signal the runner. BatchState persists build progress to JSON for multi-run resume support.

Test plan

  • 22 E2E tests for batch-build CLI (plan, run, resume, list, status, error handling, mixed providers, corrupted state, fingerprint mismatch)
  • Unit tests for BatchLLMClient and BatchState
  • Demo 05 passes with cassette replay (all 11 golden comparisons)
  • All 5 existing demos still pass (no regression from normalization changes)
  • uv run release passes (1008 tests, lint, template sync, all demos)

Add OpenAI Batch API support for pipeline transforms, enabling async
batch processing of LLM calls at ~50% cost reduction.

Core implementation:
- BatchLLMClient: drop-in LLMClient replacement using control-flow
  exceptions (BatchCollecting/BatchInProgress) for lifecycle management
- BatchState: persistent JSON state tracking for multi-run batch builds
- batch_runner: DAG walker that routes layers to batch vs sync mode
  based on provider and explicit batch=True/False/None config

CLI commands (synix batch-build):
- run: create build instance and submit batches
- resume: continue a previously submitted build
- plan: dry-run showing which layers batch vs sync
- list/status: inspect build instances

Cassette replay fix:
- Allow SYNIX_CASSETTE_MODE=replay when batch_responses.json exists
- Skip API key validation during cassette replay
- Enables deterministic demo replay without real API keys

Demo template (05-batch-build):
- 2-level pipeline: batch OpenAI work styles → sync Anthropic summary
- Full cassette coverage for both batch and sync layers
- Golden files with normalized batch IDs, timestamps, pipeline hashes

Tests: 22 E2E tests + unit tests for BatchLLMClient and BatchState
@github-actions
Copy link
Contributor

Note

Red Team Review — OpenAI GPT-5.2 | Adversarial review (docs + diff only)

Threat assessment — Medium risk: this PR adds a new persistence format + CLI surface + execution mode (async batch) with several correctness and operability gaps that will be painful to unwind once users depend on it.

One-way doors

  1. New CLI surface: synix batch-build ...
    Hard to reverse because it’ll be documented, scripted in CI, and baked into templates/goldens. Safe only if you commit to long-term semantics for run/resume/list/status/plan and flags, or clearly gate behind a feature flag / separate plugin package.

  2. Persistent on-disk build instance format: build/builds/<build_id>/{manifest.json,batch_state.json}
    Hard to reverse because users will have in-flight builds and tooling around these files. Safe only if you version the state schema (explicit schema_version) and provide forward/backward compatibility or migration strategy.

  3. Public API addition: Transform(..., batch: bool | None) in src/synix/core/models.py
    This becomes part of the pipeline authoring contract. Safe only if you’re confident about the tri-state behavior and provider mapping rules; otherwise keep it out of the base model (e.g., batch-build-specific config) until stabilized.

  4. Request key semantics reusing compute_cassette_key()
    This implicitly binds batch correctness to cassette hashing forever. Safe only if you treat compute_cassette_key() as a versioned, provider-aware protocol (include version, include all output-affecting params, don’t normalize content).

Findings

  1. src/synix/build/batch_runner.py: corrupted-state reset uses BatchState.__new__ and manual field poking
    Failure mode: easy to create a partially-initialized object (miss invariants, future fields), leading to silent state loss or crashes later. This is a code smell in a reliability-critical path.
    Severity: [critical]

  2. BatchState.compute_pipeline_hash() only hashes pipeline name/dirs/layer names/types/deps, not prompts/config/code (src/synix/build/batch_state.py)
    Failure mode: “pipeline mismatch” check is weak; you can change prompt/model/transform config and resume without detection if the layer graph is unchanged. That defeats the whole guardrail.
    Severity: [critical]

  3. Cassette-mode contract inconsistency: docs vs implementation
    Docs claim batch-build rejects cassette mode broadly, but code only errors for SYNIX_CASSETTE_MODE=replay when no batch_responses.json exists, and implicitly allows record with no guard (batch_runner.py pre-validation). This is confused and will rot.
    Severity: [warning]

  4. Batch result parsing assumes OpenAI response shape is stable and always has choices[0].message.content (batch_client.py::_parse_batch_results)
    Failure mode: tool calls / non-text responses / empty choices yield empty string artifacts, treated as success. You’ll cache garbage and mark layer “completed.”
    Severity: [warning]

  5. Expired/failed batch handling does not re-queue as spec claims
    Docs: “Batch expired → re-queue failed requests, submit new batch.” Code: marks errors for all missing results and returns terminal True (check_and_download). No requeue logic exists.
    Severity: [warning]

  6. Plan estimation logic is nonsense / misleading (batch_runner.py::plan_batch)
    It uses estimate_output_count of upstream transforms with hardcoded 1 and sums them; doesn’t inspect actual cached/source counts. This will mislead cost/size expectations and erode trust.
    Severity: [minor]

  7. Race conditions / concurrent resume not addressed (BatchState read-modify-write, no locking)
    Two resume --poll processes can submit duplicate batches or clobber state despite atomic_write (atomic write doesn’t serialize writers).
    Severity: [warning]

  8. Template duplication: src/synix/templates/05-batch-build/... and templates/05-batch-build/... both added
    Failure mode: two sources of truth; demo runner likely uses one; drift guaranteed.
    Severity: [warning]

  9. tests/e2e/test_batch_build.py: “cassette replay allowed with empty batch_responses.json” test is incoherent
    It asserts the pre-gate doesn’t reject, but then doesn’t control later behavior (no OpenAI mock, no API key). This test will be flaky depending on code paths.
    Severity: [minor]

Missing

  • Schema/versioning for manifest.json and batch_state.json plus migration story.
  • Strong pipeline fingerprint reuse of existing build fingerprint machinery (prompt/model/config/transform code), not a custom weak hash.
  • File locking strategy (or explicit “single-process only” enforcement) for build instance dirs.
  • Implementation matching docs for expired-batch requeue and cassette-mode rules.
  • Docs updates in README/website for new command group and experimental constraints (right now it’s only in a new doc).

Verdict

Block — the feature is directionally fine, but the pipeline fingerprint guardrail is currently fake and the corrupted-state recovery is unsafe; both are foundational and must be fixed before shipping a new persistent build mode.

Review parameters
  • Model: gpt-5.2
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 3,954 lines
  • Prompt: .github/prompts/openai_review.md
  • Timestamp: 2026-02-17T02:44:22Z

@github-actions
Copy link
Contributor

Note

Architectural Review — Claude Opus | Blind review (docs + diff only)

Summary

This PR adds OpenAI Batch API support as a new synix batch-build command group. It introduces a BatchLLMClient (drop-in for LLMClient), persistent build instance state, a batch runner that walks the DAG with collect→submit→poll→resume semantics, CLI commands (run/resume/list/status/plan), a demo template, and comprehensive unit + e2e tests. The feature is clearly marked [Experimental].

Alignment

Strong fit. The build-system analogy holds: batch-build is analogous to a distributed build backend — same DAG, same artifacts, same cache, different execution strategy. Artifacts remain content-addressed and immutable (shared ArtifactStore). Provenance chains stay complete — _run_batch_transform records provenance identically to sync. The design respects "audit determinism over reproducibility" (DESIGN.md Appendix C) by reusing compute_cassette_key for request identity. Cache keys correctly capture all inputs. The synix build path is untouched, consistent with the "architecture is a runtime concern" hypothesis — users swap execution strategy without changing their pipeline definition.

Observations

  1. [concern] batch_runner.py:107-120 — reset_state bypasses __init__. Manually constructing BatchState via __new__ and setting private fields is fragile. If BatchState.__init__ gains new fields, this code silently produces an incomplete object. A classmethod like BatchState.create_fresh(build_dir, build_id) would be safer.

  2. [concern] batch_runner.py — duplicated artifact-save logic. _run_sync_transform and _run_batch_transform contain nearly identical 20-line blocks for fingerprint check → save → provenance → append. This should be extracted to a shared helper to prevent drift.

  3. [concern] batch-build-review.md:6-11 — cassette-mode incompatibility flagged but not resolved. The review doc itself identifies that the recording workflow exports SYNIX_CASSETTE_MODE=record then calls batch-build run, but the spec says any cassette mode set is a hard error for replay. The implementation only rejects replay (good), but the design doc language is ambiguous. The review doc ships as part of the PR — should this be resolved before merge, or is it intentional tracking?

  4. [question] batch_state.py:245compute_pipeline_hash is shallow. It hashes layer names, types, and dependency names but not llm_config, config dicts, prompt source, or batch parameter. A pipeline that changes model or temperature between submit and resume would pass the fingerprint check. This undermines the safety the mismatch guard is supposed to provide.

  5. [question] batch_runner.py:390plan_batch estimates are rough. estimate_output_count is called on dependency layers with a hardcoded 1 input, and the fallback is len(depends_on) or 1. For real pipelines this will often be wrong. Is this acceptable for an experimental plan command?

  6. [positive] Test coverage is excellent. Unit tests for BatchState (persistence, corruption quarantine, round-trips), BatchLLMClient (all exception paths, keying correctness), and e2e tests covering the full CLI flow, mixed providers, fingerprint mismatch, corrupted state recovery, cassette replay gating, and idempotent resume. The test list in batch-build.md maps nearly 1:1 to implemented tests.

  7. [positive] Transform contract is preserved. Transforms don't know they're in batch mode — the _shared_llm_client injection via config and control-flow exceptions is clean. The batch parameter on Transform defaults to None and is ignored by synix build.

  8. [nit] Duplicate template files. templates/05-batch-build/ and src/synix/templates/05-batch-build/ contain identical pipeline.py and sources/. One should be a symlink or the packaging should be clarified.

  9. [nit] _has_cassette_responses() is called repeatedly during validation (once per layer in _resolve_batch_mode). It does filesystem I/O each time. Minor, but could cache the result.

Verdict

This is a well-designed, well-tested incremental step that adds a meaningful cost optimization without disturbing the core build model — good to merge after addressing the __new__ bypass (#1) and the shallow pipeline hash (#4).

Review parameters
  • Model: claude-opus-4-6
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 3,954 lines
  • Prompt: .github/prompts/claude_review.md
  • Timestamp: 2026-02-17T02:44:34Z

…eline hash

- Replace fragile `BatchState.__new__` bypass with `create_fresh()`
  classmethod for corrupted state recovery (both reviewers flagged)
- Strengthen `compute_pipeline_hash()` to include pipeline llm_config,
  per-layer config dicts, and batch parameter — changing model,
  temperature, or batch mode between submit/resume is now detected
@github-actions
Copy link
Contributor

Note

Red Team Review — OpenAI GPT-5.2 | Adversarial review (docs + diff only)

Threat assessment — High risk: it adds a new persistence format + CLI surface + alternate execution engine that will be very hard to unwind once users adopt it, and there are correctness holes around caching, state recovery, and batch semantics.

One-way doors

  1. New public CLI surface: synix batch-build ...
    Hard to reverse because docs/templates/tests will encode this workflow; users will script it. Safe to merge only if you’re confident the command group + flag names are stable or you explicitly gate it (hidden/undocumented) until v1.
  2. On-disk build-instance state format (build/builds/<id>/{manifest.json,batch_state.json})
    Hard to reverse because builds become resumable artifacts; changing schema requires migrations/back-compat. Safe only if you version these JSON schemas now (e.g., "schema_version": 1) and define upgrade behavior.
  3. Transform(batch=...) added to core model (src/synix/core/models.py)
    This changes the Python API shape. Users will start setting batch=True/False. Safe only if you’re sure batch is the right abstraction boundary and won’t need renaming/splitting (e.g., provider capability vs execution strategy).

Findings

  1. src/synix/build/batch_runner.py: cassette-mode gate contradicts docs
    Code allows SYNIX_CASSETTE_MODE=replay if batch_responses.json exists, while docs/batch-build.md says replay is a hard error. Also docs say to unset cassette mode for batch. This mismatch will cause “works on my machine” behavior. [warning]
  2. _resolve_batch_mode: auto-batches all OpenAI transforms, ignoring “eligible if split_count > 1” spec
    Docs claim default auto-batch only when split_count > 1. Implementation: batch=None + OpenAI provider ⇒ always "batch". That’s a behavioral contract break and a potentially expensive/slow default for singletons. [critical]
  3. State reset is unsafe and underspecified (batch_run(... reset_state=True) + BatchState.create_fresh)
    Resetting creates a fresh state but keeps the same build_id and does not clean/mark existing OpenAI batches as cancelled. You’ll leak orphan batches and may later download/merge unexpected results. Safe recovery needs explicit “abandon batch ids” semantics recorded in manifest and surfaced to the user. [critical]
  4. Partial failures don’t actually persist error counts to manifest during run
    BuildInstance.failed_requests only set at end of batch_run. If you exit early after submission, list/status will show even if errors already known. [minor]
  5. Batch request keying is coupled to cassette key + normalization assumptions (BatchLLMClient.complete uses compute_cassette_key)
    This is a hidden global coupling: any future change to cassette hashing changes batch dedupe/resume semantics and vice-versa. You’re creating an implicit “wire format” without versioning. [warning]
  6. BatchLLMClient.submit_batch: unbounded JSONL in memory + no size chunking
    Builds a single in-memory string of all requests; for large layers this is a blowup and may exceed OpenAI batch file limits. Needs chunking into multiple batches by request count/bytes. [critical]
  7. _poll_and_resume: fixed max_polls=1440 assumes 60s interval
    If user sets --poll-interval 30, you only poll for 12h; if 300, you poll for 5 days. This is sloppy reliability behavior. Base it on elapsed time vs SLA. [warning]
  8. batch_runner.py imports unused internals and likely dead code paths (json unused, compute_build_fingerprint etc.)
    Signals copy/paste from sync runner and increases maintenance. [minor]
  9. docs/batch-build.md claims cassette key strips trailing whitespace from message string values
    That’s a semantic change; if true it can cause collisions. Code here doesn’t implement it; it delegates to compute_cassette_key (not shown). You must align wording with actual behavior or explicitly guarantee it. [warning]
  10. Duplicate template paths: both src/synix/templates/05-batch-build/... and templates/05-batch-build/...
    Two copies of the same template will drift; unclear which is authoritative. [warning]

Missing

  • Schema versioning + migration strategy for manifest.json / batch_state.json.
  • Chunking tests / limits tests (file size, request count, multiple batches per layer).
  • Concurrency story: what happens if two resume processes run on same build_id (atomic_write doesn’t prevent logical races).
  • Docs updates to README/website CLI reference if this is user-facing; currently only new docs file covers it.

Verdict

Block — too many contract mismatches and scale/recovery holes for a feature that introduces durable state + new public CLI/API. Fix eligibility semantics, add state schema/versioning, implement batch chunking, and tighten recovery before merging.

Review parameters
  • Model: gpt-5.2
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 3,975 lines
  • Prompt: .github/prompts/openai_review.md
  • Timestamp: 2026-02-17T02:49:53Z

@github-actions
Copy link
Contributor

Note

Architectural Review — Claude Opus | Blind review (docs + diff only)

Summary

This PR adds OpenAI Batch API support as an experimental synix batch-build command group. It introduces a BatchLLMClient (drop-in for LLMClient), persistent build instance state, a batch runner that walks the DAG with async submit/poll/resume semantics, CLI commands (run/resume/list/status/plan), a demo template, and comprehensive tests. The spec document (batch-build.md) and its self-review (batch-build-review.md) are included.

Alignment

Strong fit. This is a cost optimization layer that preserves all core invariants: artifacts remain content-addressed and immutable, the DAG walk order is unchanged, provenance chains are recorded identically, and cache keys reuse compute_cassette_key(). The synix build path is untouched. DESIGN.md explicitly calls LLM transforms "expensive ($0.01–$1.00 per call)" and highlights cost estimation — 50% cost reduction directly serves that concern. The Python-first principle is respected: batch=True|False|None is a per-layer parameter, not config. Audit determinism is maintained: batch results are stored in batch_state.json with the same content/model/token metadata.

Observations

  1. [positive] batch_client.py:complete() — The four-case dispatch (cached → cassette → in-batch → new) is clean and the control-flow-via-exceptions pattern keeps transforms completely unaware of batch mode. This is the right abstraction boundary.

  2. [positive] Corruption recovery in batch_state.py — Quarantine-then-fail-safe with --reset-state as explicit opt-in is well-designed. Matches the "fail safe instead of silently resetting" principle called out in the self-review.

  3. [concern] batch_runner.py:_run_batch_transform and _run_sync_transform duplicate ~30 lines of artifact-saving logic (fingerprint computation, needs_rebuild, store.save_artifact, provenance recording). This is the same code as in runner.py. Three copies of the same logic will drift. Extract a shared _save_or_cache_artifact() helper.

  4. [concern] batch_runner.py:_poll_and_resume has max_polls = 1440 (24h at 60s) hardcoded. If poll_interval is changed to 30s, this silently doubles to 48h of polling. The limit should be time-based or derived from poll_interval.

  5. [concern] The self-review (batch-build-review.md, finding Implement Synix v0.1 - declarative pipeline for AI conversation exports #1) identifies that SYNIX_CASSETTE_MODE=record + batch-build run conflicts with the documented recording workflow, but the code in batch_runner.py:89-93 only checks for replay, not record. The spec says SYNIX_CASSETTE_MODE set is a hard error, but the implementation only rejects replay. This inconsistency should be resolved explicitly.

  6. [question] batch_runner.py:_resolve_batch_mode treats "deepseek" and "openai-compatible" as batch-eligible providers. Are these actually supported by the OpenAI Batch API endpoint? This could cause silent failures at submit time.

  7. [question] Pipeline files are duplicated: src/synix/templates/05-batch-build/ and templates/05-batch-build/ contain identical pipeline.py and source files. Is this intentional (packaged template vs. dev template)?

  8. [nit] batch_commands.py:resume takes pipeline_path as a positional argument via @pipeline_argument, but the resume instructions printed by run say synix batch-build resume {build_id} pipeline.py — users may not realize the pipeline path is required on resume. Consider storing it in the manifest.

  9. [positive] Test coverage is thorough: unit tests for BatchLLMClient (cached/collecting/in-progress/failed paths), BatchState (persistence, corruption, round-trips), request key distinctness, and e2e tests covering the full CLI flow with mocked OpenAI including fingerprint mismatch, corrupted state, mixed providers, and idempotent resume.

  10. [nit] batch_state.py:compute_pipeline_hash uses string concatenation with | delimiter — a layer named a|b could collide with adjacent fields. Use a structured serialization or null-byte separator.

Verdict

This is a well-scoped, architecturally sound addition that respects all core invariants while addressing a real cost concern — the duplicated artifact-saving logic is the main thing to fix before merge.

Review parameters
  • Model: claude-opus-4-6
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 3,975 lines
  • Prompt: .github/prompts/claude_review.md
  • Timestamp: 2026-02-17T02:50:05Z

- Extract _save_or_cache_artifact() shared helper used by both sync and
  batch transform runners (addresses review feedback on duplicated logic)
- Switch max_polls from fixed iteration count to 24-hour time-based loop
- Add Artifact to module-level imports for type annotation
@github-actions
Copy link
Contributor

Note

Red Team Review — OpenAI GPT-5.2 | Adversarial review (docs + diff only)

Threat assessment — Medium-high risk: this adds a new execution mode with persistent state + new CLI surface area, and there are a couple of correctness/reliability gaps that will bite users in non-demo scenarios.

One-way doors

  1. New public CLI surface: synix batch-build {run,resume,list,status,plan}

    • Why hard to reverse: Once documented/used, command shapes, flags, and output fields become de facto API.
    • Safe if: You explicitly mark as experimental in README/website CLI reference (not just in command output/docs) and commit to keeping flags stable or versioning them. Also add --json early if you expect CI usage (otherwise you’ll later break scrapers).
  2. New persistent on-disk state format under build/builds/<build_id>/{manifest.json,batch_state.json}

    • Why hard to reverse: Any change requires migration logic or invalidates existing in-flight builds.
    • Safe if: You add a schema_version to both files now and define upgrade/invalid behavior. Without it, you’ll be stuck supporting implicit formats.
  3. Transform(batch: bool|None) added to src/synix/core/models.py

    • Why hard to reverse: It becomes part of the Python API and template patterns.
    • Safe if: You confirm this belongs on the base Transform abstraction (not an OpenAI-only knob). Otherwise you’ve baked provider-specific execution policy into core domain objects.

Findings

  1. src/synix/build/batch_runner.py: cassette-mode gate contradicts docs

    • Pattern: if cassette_mode == "replay" and not _has_cassette_responses(): raise UsageError(...)
    • Risk: Docs claim SYNIX_CASSETTE_MODE=replay is simply incompatible; code allows it if a file exists. Also docs mention rejecting record in one place but code doesn’t.
    • Severity: [warning]
  2. src/synix/build/batch_runner.py::_resolve_batch_mode: auto-batch ignores split_count > 1 spec

    • Pattern: if is_openai: return "batch" for batch=None
    • Risk: Documentation says auto-batch only when split_count > 1. Implementation batches even a single unit, which changes behavior/cost/latency unexpectedly.
    • Severity: [critical]
  3. src/synix/build/batch_runner.py::_run_batch_transform: incorrect handling of per-request failures

    • Pattern: except BatchRequestFailed: logger.warning(...); then continues; collecting/in_progress logic may conclude "completed" while failures exist and artifacts for failed keys are missing.
    • Risk: Layer may be marked completed and pipeline proceeds, but the layer output is silently partial and the failed requests aren’t surfaced in the run result until the very end. This undermines the “completed_with_errors” contract and makes debugging miserable.
    • Severity: [critical]
  4. src/synix/build/batch_client.py::_parse_batch_results: error parsing likely wrong

    • Pattern: on non-200, uses entry.get("error", {}) and HTTP code from response.status_code.
    • Risk: OpenAI batch output format usually nests errors differently; if you parse wrong, you’ll mark successful outputs as failed or store useless "unknown" errors.
    • Severity: [warning]
  5. src/synix/build/batch_state.py::compute_pipeline_hash: truncating to 16 hex chars

    • Pattern: sha256(...).hexdigest()[:16]
    • Risk: collision risk is low but not zero; worse, you’ve created an identifier users will quote and rely on while it’s explicitly non-unique.
    • Severity: [warning]
  6. src/synix/templates/05-batch-build/pipeline.py: uses private APIs

    • Pattern: imports _get_llm_client, _logged_complete from synix.build.llm_transforms
    • Risk: templates become “example canonical usage”; now your private functions are effectively public API. Refactors will break templates and user code copying them.
    • Severity: [warning]
  7. src/synix/cli/batch_commands.py resume: requires pipeline path always

    • Pattern: resume <build_id> PIPELINE
    • Risk: UX mismatch with docs in docs/batch-build.md which show resume <build-id> alone. Also forces pipeline to be present even though manifest already has hash; could instead store pipeline path in manifest.
    • Severity: [minor]
  8. Concurrency/race concerns: shared build_dir state with no locking

    • Pattern: multiple processes can queue_request / save() / record_batch() concurrently with only atomic file replace.
    • Risk: last-writer-wins will drop pending keys or results; duplicates batches; corrupt logical state without JSON corruption.
    • Severity: [warning]

Missing

  • Schema versioning for manifest.json and batch_state.json.
  • A test that asserts auto-batch respects split>1 (right now it doesn’t).
  • A test that failed requests lead to completed_with_errors and do not mark the layer complete without surfacing failure counts per-layer.
  • Docs updates in README + website CLI table to mention batch-build (this is user-facing).
  • File locking strategy (even “single-writer only” documented constraint) for build instances.

Verdict

Block — the auto-batch condition mismatch and the per-request failure handling are correctness issues that will produce confusing, silently partial builds; fix those plus add state versioning before merging.

Review parameters
  • Model: gpt-5.2
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 3,962 lines
  • Prompt: .github/prompts/openai_review.md
  • Timestamp: 2026-02-17T02:57:02Z

@github-actions
Copy link
Contributor

Note

Architectural Review — Claude Opus | Blind review (docs + diff only)

Summary

This PR adds OpenAI Batch API support as an experimental synix batch-build command, enabling 50% cheaper LLM inference for OpenAI layers. It introduces a BatchLLMClient (drop-in for LLMClient), persistent build instance state, a batch runner that walks the DAG with async submit/poll/resume semantics, CLI commands, a demo template, and comprehensive tests.

Alignment

Strong fit. The build system analogy extends naturally: batch builds are like distributed compilation — same DAG, same artifacts, different execution strategy. Artifacts remain content-addressed and immutable (shared ArtifactStore, unchanged). Provenance chains stay complete — _save_or_cache_artifact records the same parent labels and prompt IDs. Cache keys reuse compute_cassette_key() (SHA256 over all output-affecting params), consistent with DESIGN.md's materialization key philosophy. The batch parameter on Transform is a runtime concern, not an architectural one — aligning with Hypothesis 3 ("architecture is a runtime concern"). The Python-first API (batch=True on layer constructors) follows the CDK-over-config decision.

Observations

  1. [positive] batch_client.py:complete() — The control-flow-exception pattern (BatchCollecting/BatchInProgress/BatchRequestFailed) is clean. Transforms genuinely don't know they're batched. This preserves the transform contract from DESIGN.md.

  2. [positive] Test coverage is thorough: unit tests for client, state, and keying; e2e tests covering happy path, resume idempotency, corrupted state, fingerprint mismatch, mixed providers, force-sync, cassette replay, and request key correctness. The spec review doc (batch-build-review.md) shows self-auditing.

  3. [concern] batch_runner.py:_poll_and_resume has an 86400s (24h) hardcoded max poll duration with no user override. For CI/automation use cases (noted as P1 issue Non-interactive automation mode for CI/scripted runs #54), this could hang a job. Consider exposing --max-poll-duration or at least making the constant configurable.

  4. [concern] batch_build-review.md Finding Implement Synix v0.1 - declarative pipeline for AI conversation exports #1 identifies a real contradiction: the recording workflow sets SYNIX_CASSETTE_MODE=record then calls batch-build run, but batch_runner.py only checks for replay mode — record mode silently passes through. The review doc flags this but the code doesn't resolve it. The spec and implementation should be reconciled before merge.

  5. [concern] batch_runner.py:_run_batch_transform — when BatchInProgress is caught, the loop breaks, skipping remaining units. On resume, the full unit list is re-iterated. This is correct but means partially-collected layers lose queued requests from units after the in-progress one. If a layer has units in different batches, only the first in-progress batch blocks.

  6. [question] batch_state.py:compute_pipeline_hash uses json.dumps(layer.config, sort_keys=True) but layer.config could contain non-JSON-serializable values (e.g., callable validators from DESIGN.md's extension model). This would crash on resume fingerprint check.

  7. [nit] pipeline.py template duplicated — identical file at both src/synix/templates/05-batch-build/pipeline.py and templates/05-batch-build/pipeline.py. Same for source bios. Intentional (package vs. repo copy) or accidental?

  8. [nit] batch_commands.py:resume takes pipeline_path as a positional argument after build_id, but the spec doc shows synix batch-build resume <build-id> without explicit pipeline path. The CLI ergonomics differ from the spec's implied UX.

  9. [positive] Golden output files with normalization regexes for batch IDs, pipeline hashes, and timestamps in demo_commands.py — good approach for deterministic CI against inherently dynamic output.

  10. [concern] _load_cassette_responses reads from SYNIX_CASSETTE_DIR but _has_cassette_responses also checks SYNIX_CASSETTE_DIR. Both are called repeatedly in hot paths (per-layer). Minor, but could be cached once per build invocation.

Verdict

This is a well-designed incremental step that adds meaningful cost optimization without compromising the core build-system invariants — the cassette-mode contradiction (finding #4) and the pipeline hash serialization edge case (#6) should be resolved before merge.

Review parameters
  • Model: claude-opus-4-6
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 3,962 lines
  • Prompt: .github/prompts/claude_review.md
  • Timestamp: 2026-02-17T02:57:15Z

…tch-build demo

- Record real OpenAI Batch API + Anthropic cassettes (no more synthetic data)
- Add sync build path (steps 7-10) to demo — same pipeline, no Batch API
- Add `batch-build status --latest` step to exercise all subcommands
- Add normalization rules for pipeline hashes, datetimes, dashboard URLs, OpenAI batch IDs
- Regenerate all goldens from real API recordings
@github-actions
Copy link
Contributor

Note

Red Team Review — OpenAI GPT-5.2 | Adversarial review (docs + diff only)

Threat assessment — Medium risk: you’re introducing a new execution mode with persistent state + new CLI surface area, and the design has a couple correctness holes that will bite under partial failures and resume scenarios.

One-way doors

  1. New CLI surface: synix batch-build {run,resume,list,status,plan}
    Hard to reverse because it becomes documented workflow + scripts/CI rely on it. Safe to merge only if: CLI semantics are stable-ish (even “experimental”), and you commit to compatibility guarantees around build IDs, directory layout, and resume behavior.

  2. On-disk state format: <build_dir>/builds/<build_id>/{manifest.json,batch_state.json}
    Hard to reverse because users will have in-flight builds and stored state across upgrades. Safe only if: you add a version field + migration/compat story, or explicitly document “state may break between versions” and code defensively.

  3. Public-ish API change: Transform(..., batch: bool|None) in src/synix/core/models.py
    This is now part of pipeline authoring mental model and will get copied into user pipelines/templates. Safe only if: you’re confident this is the right knob (vs runner-level policy), and you document precedence and provider compatibility.

Findings

  1. src/synix/build/batch_runner.py: cassette-mode contract does not match docs
    Pattern: pre-validation only blocks SYNIX_CASSETTE_MODE=replay without batch_responses.json, but docs/batch-build.md says hard error for replay; and earlier review doc flags record confusion.
    Failure mode: users set cassette mode for normal builds/demos and get surprising behavior divergence; spec and implementation drift.
    Severity: [warning]

  2. src/synix/build/batch_runner.py::_run_batch_transform: failed requests are silently dropped from layer outputs
    Pattern: except BatchRequestFailed: logger.warning(...); and continue; no placeholder artifact, no error propagation, and critically no recording into layer_built / layer_artifacts beyond already-built ones.
    Failure mode: downstream layers run with missing inputs without any strong signal at execution time; your “completed_with_errors” only happens at end and is global, not per-layer gating. This violates “build system” expectations—partial outputs should be explicit and discoverable in the build summary.
    Severity: [critical]

  3. src/synix/build/batch_runner.py: no re-queue behavior for expired despite docs
    Docs promise: “Batch expired → re-queue failed requests, submit new batch”.
    Implementation in BatchLLMClient.check_and_download: marks keys as errors on expired/failed/cancelled and never re-queues.
    Failure mode: users believe batch-build is self-healing; it is not. This is a design/doc mismatch and will cause data loss unless manually reset-state and rerun.
    Severity: [critical]

  4. src/synix/build/batch_state.py::compute_pipeline_hash: hash is under-specified and unstable
    Pattern: includes pipeline.source_dir/build_dir and iterates pipeline.layers in declared order; doesn’t include transform source code/prompt identity unless it lives inside layer.config.
    Failure mode: false negatives (pipeline changed but hash unchanged) or false positives (path changes) causing unnecessary hard-stop. Also “resume safety” is weaker than claimed.
    Severity: [warning]

  5. src/synix/build/batch_client.py: request-key correctness depends on compute_cassette_key() behavior you didn’t modify here
    Docs claim normalization (\r\n\n, strip trailing whitespace on string values). This PR doesn’t show changes to cassette keying.
    Failure mode: key collisions or mismatches between cassette and batch modes; duplicated submissions on resume.
    Severity: [warning]

  6. src/synix/build/batch_client.py::_parse_batch_results: fragile parsing of OpenAI JSONL
    Pattern: assumes choices[0].message.content exists; ignores tool/response formats; treats non-200 as entry["error"] but OpenAI may encode errors differently.
    Failure mode: stores empty content as “success”, creating garbage artifacts; or misclassifies errors.
    Severity: [warning]

  7. src/synix/templates/05-batch-build/*: duplicated template under both src/synix/templates and top-level templates/
    Failure mode: packaging confusion and drift (which one is installed vs used in demos).
    Severity: [minor]

Missing

  • State/versioning: manifest.json / batch_state.json lack a schema_version.
  • Concurrency/race handling: no file lock around batch_state.save()/manifest writes; two resumes can corrupt or interleave state.
  • Docs update to README/CLI reference: batch-build is user-facing but not mentioned in README CLI table.
  • Explicit per-layer error reporting in CLI output: run/resume summary doesn’t show failed request counts per layer.

Verdict

Block — doc/behavior mismatches + expired-batch behavior contradicts spec, and silent partial-output handling is a correctness footgun for a build system. Fix those before shipping this CLI/state format.

Review parameters
  • Model: gpt-5.2
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 4,274 lines
  • Prompt: .github/prompts/openai_review.md
  • Timestamp: 2026-02-17T03:57:59Z

@github-actions
Copy link
Contributor

Note

Architectural Review — Claude Opus | Blind review (docs + diff only)

Summary

This PR adds OpenAI Batch API support (synix batch-build) as an experimental feature for 50% cheaper LLM inference. It introduces a BatchLLMClient drop-in for LLMClient, persistent build instance state, a multi-command CLI group, a demo template (05-batch-build), and comprehensive unit + e2e tests. The spec document (batch-build.md) and a self-review (batch-build-review.md) are also included.

Alignment

Strong fit. DESIGN.md identifies LLM cost as a key differentiator from SQL-based build systems ("$0.01-$1.00 per call vs. milliseconds of SQL. Incremental rebuild matters more."). Batch API directly addresses this. The design preserves core invariants: artifacts remain content-addressed and immutable, the DAG walk order is unchanged, provenance chains are complete (batch artifacts go through _save_or_cache_artifact with the same fingerprint/provenance path), and cache keys reuse compute_cassette_key() — the same materialization-key philosophy from §3.3. The batch parameter on Transform is None-by-default, so existing pipelines are unaffected — consistent with "architecture is a runtime concern."

Observations

  1. [concern] batch-build-review.md finding Implement Synix v0.1 - declarative pipeline for AI conversation exports #1 is unresolved in this PR. The recording workflow in batch-build.md:436-443 sets SYNIX_CASSETTE_MODE=record then runs synix build, but the hard-error table says SYNIX_CASSETTE_MODE set to replay is rejected. The actual code in batch_runner.py:93 only rejects replay without cassette responses — record mode is silently allowed. The review doc flags this but the spec text at line 178 says the broader "SYNIX_CASSETTE_MODE set" is an error. Spec and code should be reconciled.

  2. [concern] _resolve_batch_mode auto-batches any OpenAI layer regardless of split_count. The spec table says auto-batch triggers when "Provider is OpenAI and split_count > 1," but the code at batch_runner.py:229 doesn't check split count — it returns "batch" for any OpenAI provider. This means a single-unit layer (like CoreSynthesis) would unnecessarily go through the batch path.

  3. [concern] Duplicate file tree. pipeline.py and the three sources/bios/*.md files appear identically under both src/synix/templates/05-batch-build/ and templates/05-batch-build/. Other templates presumably follow a convention — is the duplication intentional or an oversight?

  4. [concern] _poll_and_resume has a 24-hour hard timeout with no user feedback. The while loop at batch_runner.py:359 silently spins for up to 86400s. If run interactively without --poll, this isn't hit, but with --poll on a slow batch there's no progress output — just time.sleep() in a tight loop. A periodic status log line would help.

  5. [question] batch_state.json writes on every save() call, which happens after every status check in the poll loop. At 60s intervals this is fine, but at --poll-interval 1 it's a write per second. Is atomic_write cheap enough? Probably yes for JSON this size, but worth noting.

  6. [positive] Test coverage is excellent. Unit tests for BatchLLMClient and BatchState cover happy paths, partial failures, corruption quarantine, request-key distinctness, and round-trip persistence. E2E tests cover the full CLI surface including fingerprint mismatch, --reset-state, cassette replay gating, mixed providers, and idempotent resume. The spec's own "Suggested Test Additions" section lists cassette incompatibility and force-flag safety — both are covered.

  7. [positive] Clean extension model. Transforms remain completely unaware of batch mode — complete() returns LLMResponse or raises control-flow exceptions caught by the runner. No changes to the Transform.execute() contract beyond the documented idempotency requirement.

  8. [nit] Golden file artifacts.stdout.txt:15 has │ ws-bob │ <N> │ where the artifact ID is normalized away, but other rows show real 7-char hashes (876eff9, 6b999a6). Looks like a normalization inconsistency that could cause flaky golden comparisons.

  9. [nit] batch_runner.py imports click for UsageError — a build-engine module depending on a CLI framework is a mild layering violation. Consider raising a domain error and catching it in the CLI layer.

Verdict

This is a well-designed, thoroughly tested feature addition that directly serves the project's cost-reduction goals while preserving all core architectural invariants — a strong incremental step, pending resolution of the spec/code mismatch on auto-batch triggering and cassette-mode semantics.

Review parameters
  • Model: claude-opus-4-6
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 4,274 lines
  • Prompt: .github/prompts/claude_review.md
  • Timestamp: 2026-02-17T03:58:16Z

OpenAI SDK raises on client creation without OPENAI_API_KEY,
even though replay mode never makes real API calls. Provide a
placeholder key so SDK clients can be constructed on CI where
API keys aren't available.
@github-actions
Copy link
Contributor

Note

Red Team Review — OpenAI GPT-5.2 | Adversarial review (docs + diff only)

Threat assessment — High risk: this PR introduces a new persistent on-disk state machine + new CLI surface area + control-flow-by-exception in core execution paths, with several correctness holes that will be painful to undo.

One-way doors

  1. New public CLI namespace: synix batch-build ...

    • Why hard to reverse: Users will script against these commands/flags and rely on semantics (run/resume/list/status/plan, --allow-pipeline-mismatch, --reset-state, polling behavior).
    • Safe to merge if: Marked clearly experimental in README/website, version-gated, and you commit to a stability story or explicitly reserve the right to break (and reflect that in docs).
  2. Persistent build instance format under build/builds/<build_id>/{manifest.json,batch_state.json}

    • Why hard to reverse: This is a stateful artifact format that will accumulate; breaking it requires migration logic or “delete your build dir” guidance (which is unacceptable if users have in-flight 24h batches).
    • Safe to merge if: You add a state_version, forward-compatible parsing, and explicit migration/invalid-state handling beyond “quarantine and die”.
  3. Public pipeline API change: Transform(..., batch: bool|None)

    • Why hard to reverse: This becomes part of the Python-first contract. Removing/renaming breaks pipelines.
    • Safe to merge if: You define semantics precisely (including interaction with provider overrides, split semantics, and caching), and add validation errors that are consistent across build and batch-build.

Findings

  1. src/synix/build/batch_runner.py — cassette mode contract mismatch

    • Pattern: pre-validation only blocks SYNIX_CASSETTE_MODE=replay unless batch_responses.json exists; docs (docs/batch-build.md) claim replay is incompatible outright (and earlier review notes mention record-mode issues too).
    • Failure mode: users follow docs and hit surprising hard errors / inconsistent behavior; CI/replay story becomes brittle.
    • Severity: [warning]
  2. src/synix/build/batch_runner.py::_resolve_batch_mode — provider classification is wrong

    • Pattern: is_openai = provider in ("openai", "deepseek", "openai-compatible").
    • Failure mode: you’ll attempt OpenAI Batch API against non-OpenAI providers (DeepSeek is not “OpenAI batch”), yielding runtime failures and confusing user experience.
    • Severity: [critical]
  3. batch-build does not actually implement “per-layer batching”; it batches per request key globally

    • Files: BatchState stores _pending keyed only by request key; queue_request() overwrites by key; record_batch() maps key→batch.
    • Failure mode: two different layers producing identical compute_cassette_key() (same model/messages/params) collide; second overwrites first; batch attribution becomes wrong; artifacts can be built from the wrong layer’s results.
    • Severity: [critical]
    • Fix: keys must be namespaced with (pipeline_hash/build_id, layer_name, maybe unit identity) or pending must be nested by layer.
  4. _run_batch_transform swallows BatchRequestFailed without recording failure accounting

    • Pattern: logs warning only; doesn’t persist per-artifact failure into manifest counts, doesn’t mark build completed_with_errors until very end (and even then errors may not be stored if failure occurred before download/parsing).
    • Failure mode: silent partial output; downstream runs with missing artifacts without a strong signal at the layer boundary.
    • Severity: [warning]
  5. batch_state.json growth is unbounded

    • Files: BatchState._results/_errors never pruned; save() writes full pretty-printed JSON every time.
    • Failure mode: large pipelines -> multi-GB JSON, slow writes, high memory, corruption risk; atomic rewrite becomes expensive.
    • Severity: [critical]
    • Fix: move to SQLite or sharded files; at minimum compress/stream and avoid indent; add pruning/compaction.
  6. Race conditions / concurrency unsafety

    • Files: BatchState has no file locking; batch-build “resume” can be invoked concurrently (user mistake, CI retry, etc.).
    • Failure mode: interleaved atomic writes still lose updates (last writer wins) and can cause duplicate submissions.
    • Severity: [critical]
  7. src/synix/build/llm_client.py — placeholder API key in replay mode

    • Pattern: injects "sk-replay-placeholder" when replay.
    • Failure mode: may accidentally hit real network if cassette miss occurs (now you have a “valid-looking” key); also leaks into logs/requests in some SDKs.
    • Severity: [warning]
    • Fix: fail fast on cassette miss; don’t fabricate credentials.
  8. Templates duplicated in two locations

    • Files: both src/synix/templates/05-batch-build/... and templates/05-batch-build/... added.
    • Failure mode: divergence, packaging confusion, tests referencing wrong assets.
    • Severity: [warning]

Missing

  • State versioning in manifest.json/batch_state.json and migration story.
  • File locking or single-writer enforcement for build instance state.
  • Layer-scoped request identity (see collision issue).
  • Docs update in README/website: new command group and explicit experimental/stability notice.
  • Tests for key collision across layers and for concurrent resume/run behavior.
  • Clear “provider support” contract: OpenAI Batch API should be provider=="openai" only, not “compatible”.

Verdict

Block — core correctness issues (request key collisions, wrong provider gating, unbounded state file, no concurrency safety) make this unsafe to ship even as “experimental”.

Review parameters
  • Model: gpt-5.2
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 4,299 lines
  • Prompt: .github/prompts/openai_review.md
  • Timestamp: 2026-02-17T04:07:46Z

@github-actions
Copy link
Contributor

Note

Architectural Review — Claude Opus | Blind review (docs + diff only)

Summary

This PR adds OpenAI Batch API support as an experimental synix batch-build command group. It introduces a BatchLLMClient (drop-in for LLMClient), persistent build instance state (BatchState/BuildInstance), a batch-aware DAG runner, CLI commands (run/resume/list/status/plan), a demo template (05-batch-build), and comprehensive unit + e2e tests. The design spec (batch-build.md) and its own review doc are included.

Alignment

Strong fit. DESIGN.md §2.1 explicitly calls out that "LLM transforms are expensive — $0.01-$1.00 per call" and that caching matters more than in SQL pipelines. A 50% cost reduction directly serves this. Artifacts remain content-addressed and immutable — BatchLLMClient produces the same LLMResponse objects, and _save_or_cache_artifact runs the same fingerprint/provenance path. Request keys reuse compute_cassette_key() (SHA256 over all output-affecting params), consistent with materialization key philosophy from §3.3. The batch parameter on Transform is None by default and ignored by synix build, preserving the existing build path completely — a clean extension, not a mutation.

Observations

  1. [concern] batch-build-review.md §1 flags that SYNIX_CASSETTE_MODE=record conflicts with the documented recording workflow in batch-build.md:436-443. The code in batch_runner.py:87 only rejects replay (not record), but batch-build.md:178 says "SYNIX_CASSETTE_MODE set" is a hard error. The spec and implementation disagree. The review doc caught this but it's not resolved in the PR.

  2. [concern] batch_runner.py duplicates LLM config resolution logic (base_config.update(layer.config["llm_config"])) in at least four places: _resolve_batch_mode, _run_batch_transform, _poll_and_resume, and the plan function. This should be a helper. It's also fragile — if llm_config is a non-dict mapping type, dict(pipeline.llm_config) could behave unexpectedly.

  3. [concern] _poll_and_resume has an 86400-second hard-coded max duration with no way to override it. For CI or scripted use (issue Non-interactive automation mode for CI/scripted runs #54), this is a footgun — a stuck batch holds the process for 24 hours.

  4. [question] batch_runner.py:_run_source_layer silently swallows source loading exceptions with logger.warning and continues with an empty list. This differs from what regular synix build likely does. Should a source failure in batch mode be a hard error?

  5. [question] The pipeline template at src/synix/templates/05-batch-build/pipeline.py is duplicated identically at templates/05-batch-build/pipeline.py. Same for the three bio source files. Is this intentional (packaged template vs. repo template), or a copy-paste oversight?

  6. [concern] _save_or_cache_artifact in batch_runner.py reimplements logic that likely exists in runner.py. The PR imports several private functions from runner.py (_build_source_config, _build_transform_config, _gather_inputs, etc.) but creates a new save helper. This parallel implementation will drift.

  7. [positive] Test coverage is thorough. Unit tests (test_batch_client.py, test_batch_state.py) cover the happy path, all exception types, key uniqueness, persistence round-trips, corruption quarantine, and partial failures. E2E tests cover mixed providers, fingerprint mismatch, --reset-state, cassette replay gating, and batch=False forcing sync. The spec's own test list in batch-build.md maps closely to what's implemented.

  8. [positive] The corruption recovery design (quarantine + explicit --reset-state) is well-considered and avoids the "silently start fresh" anti-pattern. Atomic writes via atomic_write() are used consistently.

  9. [nit] llm_client.py adds a "sk-replay-placeholder" API key for cassette replay mode. This is practical but should have a comment explaining it won't leak into real calls — it's easy to misread as a security issue.

  10. [positive] Golden output files and _normalize_output regex additions for batch-specific dynamic values (build IDs, timestamps, pipeline hashes) show disciplined approach to snapshot testing.

  11. [concern] batch_state.py:create_fresh uses object.__new__(cls) to bypass __init__, then manually assigns every attribute. This is fragile — adding a field to __init__ without updating create_fresh will cause AttributeError at runtime with no compile-time safety net.

Verdict

This is a well-structured, well-tested incremental step that directly addresses the project's cost sensitivity thesis — the spec/implementation cassette-mode inconsistency (#1) and the duplicated config resolution logic (#2) should be resolved before merge.

Review parameters
  • Model: claude-opus-4-6
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 4,299 lines
  • Prompt: .github/prompts/claude_review.md
  • Timestamp: 2026-02-17T04:08:01Z

- batch=True rejects non-OpenAI providers (deepseek, openai-compatible)
- batch=True rejects custom base_url (Batch API is platform-specific)
- batch=None (auto) only routes to batch for native OpenAI (no base_url)
- Source layer failures propagate instead of being swallowed
- BatchRequestFailed raises RuntimeError instead of logging a warning
- Poll loop errors raise immediately instead of silently retrying
- Cassette parse errors fail hard (no silent None return)
- Plan mode still allows missing API key (shows sync), but provider
  misconfiguration fails immediately
@github-actions
Copy link
Contributor

Note

Red Team Review — OpenAI GPT-5.2 | Adversarial review (docs + diff only)

Threat assessment — High risk: this PR introduces a new CLI surface + new persistence format + new execution mode, with several correctness holes that will create silent wrong builds or hard-to-debug behavior.

One-way doors

  1. New public CLI group synix batch-build …
    Hard to reverse once documented/templates depend on it. Safe only if command semantics, flags, and error model are nailed down (especially resume/mismatch/reset behavior) and documented in main README/CLI reference.

  2. On-disk build-instance state format under build/builds/<id>/{manifest.json,batch_state.json}
    Users will accumulate these; schema changes require migrations or compatibility code. Safe only if you version the schema (e.g., schema_version) and define forward/backward behavior.

  3. Transform(..., batch: bool|None) added to src/synix/core/models.py
    This becomes part of the Python API. Reversing/renaming later breaks pipelines. Safe only if API stability is explicitly “experimental” in docs and you’re confident batch belongs on Transform rather than an execution option.

Findings

  1. src/synix/build/batch_runner.py: cassette-mode gate is wrong and contradicts docs
    Code only blocks SYNIX_CASSETTE_MODE=replay when no batch_responses.json exists; docs say replay is incompatible (and earlier review mentions record-mode conflict too). This creates surprising behavior and will break the recording story. [critical]

  2. src/synix/build/batch_runner.py + docs: partial failure semantics not implemented
    Docs promise per-request failures yield completed_with_errors and downstream gets partial inputs. Code raises RuntimeError on BatchRequestFailed in _run_batch_transform, which aborts the build. [critical]

  3. src/synix/build/batch_client.py::_parse_batch_results: wrong error parsing
    For non-200 responses, it reads entry.get("error"), but OpenAI batch output typically nests errors differently (often inside response.body.error). This will misclassify failures as unknown and lose messages. [warning]

  4. src/synix/build/batch_runner.py:_poll_and_resume: doesn’t refresh active batch list
    active_batch_ids computed once; if new batches are submitted later (multi-layer), poll loop won’t notice. Also, it only polls current layer, but runner returns after first batchable layer anyway—making multi-batch-per-layer or multi-layer batch sequencing fragile. [warning]

  5. src/synix/build/batch_runner.py: plan_batch request estimation is nonsense
    Uses estimate_output_count of deps in a way that doesn’t reflect actual split units. This will mislead users about batch size/cost. [minor]

  6. src/synix/build/llm_client.py: placeholder API key for replay is a footgun
    You’re mutating auth behavior based on env. If a provider SDK validates key format server-side or logs it, you risk confusing failures/leaking “fake” keys into logs. Better: lazily construct SDK client only when needed. [warning]

  7. src/synix/build/batch_state.py: pipeline hash includes source_dir/build_dir paths
    Resuming the same logical pipeline from a different working directory becomes a fingerprint mismatch for no semantic reason. That makes resume brittle. [warning]

  8. src/synix/build/batch_runner.py:_save_or_cache_artifact: provenance/cache correctness unclear
    It computes build fingerprint from transform_fp and artifact.input_ids, but ignores that batch request key includes model/messages/max_tokens/temperature. If transform_fp doesn’t include those (unknown here), you can cache wrong outputs. Needs explicit linkage. [warning]

  9. templates/05-batch-build: duplicated under both src/synix/templates/... and templates/...
    Two sources of truth; packaging/installation will likely include only one. This will break demo runner or users. [critical]

  10. templates/05-batch-build/golden/artifacts.stdout.txt: corrupted golden (ws-bob artifact id shows <N> with broken table cell)
    This suggests demo normalization is brittle; CI will be flaky. [minor]

Missing

  • README / website / CLI reference updates for the new batch-build command group and the new batch pipeline parameter.
  • Schema versioning in manifest.json / batch_state.json.
  • Concurrency / locking: nothing prevents two resume processes from writing the same state dir → race/corruption.
  • Integration tests for partial failures matching the documented completed_with_errors behavior (current tests don’t assert downstream behavior).
  • Clear contract for cassette modes across build vs batch-build (record/replay/off), enforced consistently.

Verdict

Block. The docs promise behaviors the implementation doesn’t provide (partial failures), state/CLI surfaces are one-way doors without versioning/lock safety, and the template duplication will cause immediate user breakage. Fix the critical items before merging.

Review parameters
  • Model: gpt-5.2
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 4,311 lines
  • Prompt: .github/prompts/openai_review.md
  • Timestamp: 2026-02-17T04:17:44Z

@github-actions
Copy link
Contributor

Note

Architectural Review — Claude Opus | Blind review (docs + diff only)

Summary — Adds OpenAI Batch API support as an experimental synix batch-build command group, enabling 50% cheaper LLM inference for eligible layers. Introduces BatchLLMClient (drop-in for LLMClient), BatchState persistence, a BatchRunner, CLI commands (run/resume/list/status/plan), a demo template, and comprehensive tests. A detailed design spec (batch-build.md) and self-review (batch-build-review.md) are included.

Alignment — Strong fit. DESIGN.md §2.1 explicitly calls out that "LLM transforms are expensive" and caching/cost matter more than in dbt. A 50% cost reduction directly serves this. The batch client reuses compute_cassette_key() for request identity, preserving content-addressed semantics. Artifacts still flow through the shared ArtifactStore with fingerprint-based caching — batch state is lifecycle metadata only. The existing synix build path is completely untouched, consistent with the "Python-first, evolve without breaking" ethos. The batch parameter on Transform is None by default and ignored by synix build, respecting the principle that pipelines are Python objects users control.

Observations

  1. [positive] batch_state.json corruption handling is thorough — quarantine + explicit --reset-state recovery. The self-review doc (batch-build-review.md) shows genuine adversarial thinking about failure modes.

  2. [positive] Test coverage is excellent: 368-line unit test for BatchLLMClient, 287-line unit test for BatchState, 602-line e2e test covering happy path, fingerprint mismatch, corrupted state, partial failures, mixed providers, cassette replay, request key correctness, and idempotent resume.

  3. [concern] batch_runner.py:_resolve_batch_mode checks is_openai_native = provider == "openai" and not base_url, but the auto-detect path (batch=None) silently batches any OpenAI-provider layer. If someone sets provider: "openai" expecting synchronous behavior (e.g., in existing pipelines), batch-build run will batch them without explicit opt-in. The spec says this is "auto" but the surprise radius is worth documenting more prominently.

  4. [concern] The self-review (batch-build-review.md, finding Implement Synix v0.1 - declarative pipeline for AI conversation exports #1) identifies that SYNIX_CASSETTE_MODE=record conflicts with the recording instructions in the spec. The code in batch_runner.py:92 only blocks replay without batch_responses.json, but the spec's recording workflow sets SYNIX_CASSETTE_MODE=record then runs synix build (not batch-build) for Anthropic, then unsets it. This is fine, but the recording instructions at lines 436-443 of the spec could still confuse operators. The review identified this but it's not resolved in the code or docs.

  5. [question] _poll_and_resume has a hardcoded max_duration = 86400 (24h). Should this be configurable or at least documented? A user running --poll might not expect a 24-hour hang.

  6. [concern] batch_runner.py duplicates LLM config resolution logic (lines ~255, ~310, ~350) — base_config = dict(pipeline.llm_config)... if layer.config.... This pattern appears 3+ times. Extract to a shared helper to avoid drift.

  7. [nit] The pipeline template exists in both src/synix/templates/05-batch-build/ and templates/05-batch-build/ with identical content. Source files too. Is this intentional dual-location, or should one be a symlink?

  8. [nit] golden/artifacts.stdout.txt line 15 has │ ws-bob │ <N> │ where the artifact ID got normalized away — looks like a golden file generated with normalization leaking into the expected output differently than other rows.

  9. [positive] BatchRequestFailed as a distinct exception from BatchInProgress and BatchCollecting keeps the control-flow contract clean. Transforms never see error sentinels — consistent with DESIGN.md's "keep the transform interface clean."

  10. [question] plan_batch catches ValueError with "API_KEY" in str(exc) to degrade gracefully. This string-matching heuristic is fragile — a refactored error message breaks the plan command silently.

Verdict — This is a well-designed, well-tested incremental step that directly addresses the project's cost-of-LLM-calls concern while preserving all core invariants. The duplicated config resolution and the unresolved cassette-mode documentation issue deserve follow-up, but neither blocks merge.

Review parameters
  • Model: claude-opus-4-6
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 4,311 lines
  • Prompt: .github/prompts/claude_review.md
  • Timestamp: 2026-02-17T04:18:01Z

Add CLI reference rows for all batch-build subcommands and a full
"Batch Build (Experimental)" section covering poll/resume workflows,
the batch parameter, provider restrictions, and transform requirements.
@github-actions
Copy link
Contributor

Note

Red Team Review — OpenAI GPT-5.2 | Adversarial review (docs + diff only)

Threat assessment — High risk: this PR introduces new persistent state formats + a new CLI surface + control-flow via exceptions; it’s easy to get subtle correctness and “stuck build” behavior wrong.

One-way doors

  1. New CLI command group: synix batch-build …

    • Why hard to reverse: Once documented, users script against it; flags (--poll, --allow-pipeline-mismatch, --reset-state, --latest) become contractual.
    • Safe if: The command semantics are consistent with build/plan philosophy; error model is stable; docs + help text match actual behavior; deprecation path is defined if it changes.
  2. Persistent build-instance state format under build/builds/<build_id>/{manifest.json,batch_state.json}

    • Why hard to reverse: Users will rely on resuming across versions; any schema changes need migrations or compatibility logic.
    • Safe if: Versioning is embedded in the files (schema version), and readers tolerate additive fields and can detect incompatible versions cleanly.
  3. Public API shape: Transform(..., batch: bool|None) added to core/models.py

    • Why hard to reverse: It’s now part of the Python pipeline declaration model; downstream code and templates will use it.
    • Safe if: It’s clearly “experimental”, validated centrally, and doesn’t leak into non-batch execution semantics (today you claim synix build ignores it—must stay true).
  4. Request-key definition reusing compute_cassette_key()

    • Why hard to reverse: Keying determines dedupe and replay compatibility; changing it invalidates recorded batch responses and can create duplicate submissions.
    • Safe if: The key spec is frozen or explicitly versioned; normalization rules are unambiguous and tested.

Findings

  1. src/synix/build/batch_runner.py: cassette mode gate contradicts spec
    Pattern: if cassette_mode == "replay" and not _has_cassette_responses(): … UsageError
    Risk: Docs (docs/batch-build.md) say hard error for SYNIX_CASSETTE_MODE=replay, but code allows replay if a file exists. That’s a behavior contract mismatch and will rot.
    Severity: [warning]

  2. docs/batch-build.md vs implementation: per-request failure policy not implemented
    Doc says per-request failures -> “artifact not created; continue; completed_with_errors”.
    Code: _run_batch_transform catches BatchRequestFailed and raises RuntimeError, aborting the layer/build. No partial success.
    Severity: [critical]

  3. src/synix/build/batch_client.py::_parse_batch_results: incorrect error handling on non-200
    Pattern: for non-200, uses entry.get("error", {}) but message references response.status_code.
    Risk: OpenAI batch output format may put errors under response.body.error rather than top-level error; you may silently store “unknown” errors or lose diagnostics. Needs to follow actual OpenAI JSONL schema precisely and test against real samples.
    Severity: [warning]

  4. src/synix/build/batch_runner.py: plan_batch() request estimation is nonsense
    Pattern: dep_counts = sum(dep_layer.estimate_output_count(1) ...) then layer.estimate_output_count(dep_counts)
    Risk: This is not a stable estimate of “requests” and will mislead users (and docs claim plan shows “estimated batch count”). Either compute from cached artifact counts / source scan or call it “unknown”.
    Severity: [minor]

  5. src/synix/build/batch_runner.py: batch mode auto-detect ignores “split_count > 1” rule from docs
    Docs claim auto-batch only when multiple work units; code batches for any OpenAI-native transform even if it yields one unit.
    Severity: [warning]

  6. src/synix/build/batch_runner.py: unused imports and likely dead code
    Pattern: imports json, Artifact, click used partially; compute_build_fingerprint vs compute_pipeline_hash inconsistency; cassette_mode == replay special casing.
    Risk: suggests churny implementation; increases maintenance and hidden behavior.
    Severity: [minor]

  7. src/synix/build/llm_client.py: placeholder API key in cassette replay
    Pattern: if replay and missing key, set sk-replay-placeholder.
    Risk: if any code path accidentally makes a network call, you’ll get confusing auth errors instead of a clear “replay should not hit network”. This is masking failures.
    Severity: [warning]

  8. BatchState.compute_pipeline_hash: hash includes source_dir/build_dir paths
    Risk: Resuming across machines/paths becomes impossible without --allow-pipeline-mismatch, pushing users toward the dangerous bypass. Hash should be about build semantics, not local filesystem layout.
    Severity: [warning]

  9. Concurrency/race: no locking for build/builds/<id>
    Risk: two resume --poll processes can clobber batch_state.json despite atomic writes (last-writer wins), causing lost results/pending keys. Need a file lock.
    Severity: [critical]

Missing

  • A schema/version field in manifest.json and batch_state.json.
  • File locking (or explicit single-process guarantee + enforcement) around build instance state.
  • Tests that assert partial failures produce completed_with_errors (right now implementation aborts).
  • Documentation update in DESIGN.md / website: you’re adding a new execution mode and a new primitive-ish runner; currently absent.
  • Clear constraints on OpenAI Batch limits (max requests per batch, JSONL size, rate limits) and how you chunk batches.

Verdict

Block — the implementation contradicts the documented completion/error semantics and is unsafe under concurrent resume; fix those before shipping an experimental feature that writes persistent state and charges money.

Review parameters
  • Model: gpt-5.2
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 4,407 lines
  • Prompt: .github/prompts/openai_review.md
  • Timestamp: 2026-02-17T04:29:47Z

@github-actions
Copy link
Contributor

Note

Architectural Review — Claude Opus | Blind review (docs + diff only)

Summary

This PR adds experimental OpenAI Batch API support (synix batch-build) — a new execution mode that submits eligible transform layers as async OpenAI batches at 50% cost, with submit/poll/resume lifecycle management. It includes a BatchLLMClient drop-in, persistent build state, CLI commands, a demo template (05-batch-build), comprehensive unit and e2e tests, and a full design spec.

Alignment

Strong fit. DESIGN.md explicitly calls out that "LLM transforms are expensive — $0.01-$1.00 per call" and that caching/incremental rebuild matter because of cost. Batch mode is a direct response: same DAG, same transforms, same content-addressed artifacts, 50% cheaper. The BatchLLMClient is a clean drop-in for LLMClient — transforms remain unaware of execution mode, preserving the stateless transform contract. Artifacts are still immutable and content-addressed. Provenance chains remain intact. The batch parameter on Transform is a runtime concern (Hypothesis 3: architecture is a runtime concern), not a structural change. The cassette key reuse for request deduplication aligns with the materialization key philosophy — capture all output-affecting inputs.

Observations

  1. [positive] The control-flow-via-exception pattern (BatchCollecting, BatchInProgress) is clever and keeps transforms completely unmodified. The spec in batch-build.md is thorough — error handling, state corruption recovery, partial failures, and downstream behavior are all explicitly defined.

  2. [positive] Test coverage is strong: 368 lines of unit tests for the client, 287 for state, 602 for e2e. Edge cases covered include corrupted state quarantine, fingerprint mismatch, partial failures, cassette replay gating, idempotent resume, and request key distinctness.

  3. [concern] batch-build-review.md (the self-review) identifies a cassette-mode incompatibility: the recording workflow exports SYNIX_CASSETTE_MODE=record then runs batch-build run, but the spec says cassette mode set is a hard error. The code in batch_runner.py:80 only checks for replay, not record, which is inconsistent with the spec's stated contract. The review doc's own finding Implement Synix v0.1 - declarative pipeline for AI conversation exports #1 appears unresolved.

  4. [concern] batch_runner.py has significant code duplication: _resolve_batch_mode resolves LLMConfig from pipeline+layer config with the same 3-line pattern repeated in _run_batch_transform and _poll_and_resume. The _poll_and_resume function also duplicates batch client construction. Extract a helper.

  5. [concern] _poll_and_resume catches Exception broadly (line ~400) and wraps it in RuntimeError. Network transient errors during polling should be retriable per the spec ("retry next interval"), but this implementation raises immediately.

  6. [question] The pipeline template at src/synix/templates/05-batch-build/pipeline.py is duplicated verbatim at templates/05-batch-build/pipeline.py. Same for the source bios. Is this intentional (one for packaging, one for demo runner)? If so, they'll drift.

  7. [nit] _has_cassette_responses() is called repeatedly during _resolve_batch_mode (once per layer) and reads env vars each time. Minor, but could be cached at the top of batch_run.

  8. [concern] plan_batch calls layer.estimate_output_count(dep_counts) but estimate_output_count isn't defined anywhere in this diff or visible on the Transform base class. This will raise AttributeError at plan time unless it exists elsewhere.

  9. [nit] Golden file artifacts.stdout.txt line 15 has │ ws-bob │ <N> │ where the artifact ID is normalized away — inconsistent with the other rows that show 7-char hashes. Suggests an output normalization regex is too greedy.

  10. [positive] The --allow-pipeline-mismatch and --reset-state flags are well-designed safety hatches — dangerous operations require explicit opt-in rather than overloading --force.

Verdict

This is a well-scoped, well-tested incremental addition that directly addresses the cost problem DESIGN.md highlights, without compromising any core invariants — good step for the project, pending resolution of the cassette-mode contract inconsistency and the missing estimate_output_count method.

Review parameters
  • Model: claude-opus-4-6
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 4,407 lines
  • Prompt: .github/prompts/claude_review.md
  • Timestamp: 2026-02-17T04:29:58Z

- BatchState.create_fresh() uses _skip_load parameter instead of
  unsafe object.__new__ bypass
- compute_pipeline_hash() includes transform fingerprints for stronger
  change detection between submit and resume
- Auto-batch skips N:1 transforms (estimate_output_count <= 1) since
  they don't benefit from batching
- Per-unit BatchRequestFailed is caught and logged instead of aborting
  the entire layer — remaining units continue processing
@github-actions
Copy link
Contributor

Note

Red Team Review — OpenAI GPT-5.2 | Adversarial review (docs + diff only)

Threat assessment — Medium-to-high risk: you’re introducing a new persisted state format + new CLI surface + control-flow-by-exception execution path, with several correctness holes that will be painful to unwind if they ship.

One-way doors

  1. New CLI surface: synix batch-build *
    Hard to reverse because users will script against these commands/flags (--poll, --allow-pipeline-mismatch, --reset-state, --latest, --build-dir) and you’ve already documented them in README.
    Safe if: command/flag names are stable (or explicitly versioned/experimental namespace), and semantics are backed by tests that match docs.

  2. On-disk state schema under build/builds/<build_id>/{manifest.json,batch_state.json}
    Hard to reverse because builds will be resumable across versions; schema changes require migrations or break resumes.
    Safe if: you add a schema_version field and migration strategy (even “delete and restart” needs to be explicit).

  3. Public API change: Transform(..., batch: bool|None) in src/synix/core/models.py
    Hard to reverse because it changes the constructor signature for every custom transform.
    Safe if: you confirm backward compatibility for subclasses/instantiation patterns and document it in API docs/changelog.

Findings

  1. src/synix/build/batch_runner.py: cassette mode contract contradicts docs
    Code only rejects SYNIX_CASSETTE_MODE=replay without batch_responses.json; docs/spec say replay is incompatible (and review doc mentions record-mode incompatibility too). This is a semantic mess that will break reproducibility workflows. [critical]

  2. src/synix/build/batch_runner.py::_poll_and_resume: “expired batch requeue” not implemented
    docs/batch-build.md claims expired batches re-queue failed requests and resubmit. Implementation marks errors and stops. That’s a spec/impl divergence. [warning]

  3. src/synix/build/batch_runner.py::_run_batch_transform: error accounting is wrong and pollutes error keys
    You store synthetic keys like "{layer.name}:{unit_desc}" for unit failures, but real request errors are keyed by request hash. Then batch_run() counts len(errors) as failed_requests, mixing apples/oranges. Status reporting becomes meaningless. [warning]

  4. src/synix/build/batch_runner.py::_save_or_cache_artifact: cache/provenance integrity risk
    On cache hit you load by store.load_artifact(artifact.label) (label-based), not by artifact_id/fingerprint. If labels collide across pipeline changes (which you explicitly allow with --allow-pipeline-mismatch), you can “hit cache” but return the wrong artifact content with updated metadata. [critical]

  5. src/synix/build/batch_state.py::compute_pipeline_hash: uses layer.compute_fingerprint(config) with incomplete config
    You pass only pipeline.llm_config dict, ignoring src_dir/build_dir extras that normal runner uses, and ignoring per-layer merged config used in execution. Hash may fail to detect relevant changes or detect irrelevant ones. Resume safety is questionable. [warning]

  6. src/synix/build/batch_client.py::_parse_batch_results: incorrect parsing for OpenAI Batch output shape risk
    You assume a response: {status_code, body} wrapper and choices[0].message.content. If OpenAI’s batch returns responses in a different schema (or tool calls), you’ll silently store empty content. Needs strict validation and errors on unexpected shape. [warning]

  7. src/synix/cli/batch_commands.py::status: dashboard URL check is wrong
    You check bid.startswith("batch_"), but your mocked batch id is "batch-test456" and OpenAI ids aren’t guaranteed to use underscore. Output will often show “—” even when it should link. [minor]

  8. src/synix/build/llm_client.py: placeholder API key in replay mode
    You’re mutating credential behavior globally based on env var. If any code path accidentally makes a real call in replay, it’ll hit OpenAI/Anthropic with a junk key and fail confusingly. This should be localized to the cassette client, not the base client. [warning]

Missing

  • Schema/versioning for manifest.json and batch_state.json, plus upgrade policy tests.
  • Concurrency safety: no locking around <build_dir>/builds/<id>; two resumes can race and corrupt state despite atomic writes (atomic write doesn’t prevent last-writer-wins).
  • Integration test against real-ish OpenAI batch response schema (at least a fixture captured from the API) to validate parsing.
  • Clear contract + tests for SYNIX_CASSETTE_MODE=record behavior with batch-build (docs mention it; code doesn’t enforce consistently).
  • Documentation alignment: README and docs should explicitly say what is/isn’t compatible with cassette replay/record, and what happens on expired/failed batches.

Verdict

Block — the cache/provenance correctness issues (label-based cache loads + allow-mismatch) and the cassette/spec inconsistencies are one-way mistakes that will create irreproducible or wrong builds. Fix those, add schema versioning + locking, and then reassess.

Review parameters
  • Model: gpt-5.2
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 4,435 lines
  • Prompt: .github/prompts/openai_review.md
  • Timestamp: 2026-02-17T09:02:12Z

@github-actions
Copy link
Contributor

Note

Architectural Review — Claude Opus | Blind review (docs + diff only)

Summary

This PR adds OpenAI Batch API support (synix batch-build) as an experimental feature. It introduces a BatchLLMClient that intercepts complete() calls via control-flow exceptions, a persistent BatchState for async build lifecycle, CLI commands (run/resume/list/status/plan), a demo template, and comprehensive tests. The synix build path is untouched.

Alignment

Strong fit. The DESIGN.md explicitly identifies LLM cost as a key differentiator from SQL-based build systems ("$0.01-$1.00 per call"). Batch builds at 50% cost directly address this. The implementation preserves core invariants: artifacts remain content-addressed via ArtifactStore, provenance is tracked through the same ProvenanceTracker, cache keys reuse compute_cassette_key (aligned with materialization key principles), and the DAG walk is unchanged. The batch parameter on Transform respects the Python-first API philosophy — it's a keyword argument, not config. The design doc's "audit determinism" bet is maintained: batch_state.json records every request/response pair.

Observations

  1. [concern] Duplicate template files. templates/05-batch-build/ and src/synix/templates/05-batch-build/ contain identical pipeline.py and sources/ files. One is presumably the installed package template, the other the repo-level demo. This will diverge silently — there's no symlink or build step keeping them in sync.

  2. [concern] _poll_and_resume wraps errors too aggressively. In batch_runner.py:415, network errors during check_and_download are re-raised as RuntimeError, losing the original exception type. This makes it impossible for callers to distinguish transient network failures from permanent errors, contradicting the spec's "retry next interval" recovery for network errors.

  3. [concern] batch_state.json review doc ships in PR. docs/batch-build-review.md is an internal review artifact documenting spec inconsistencies (cassette-mode incompatibility, --force overloading). Finding Implement Synix v0.1 - declarative pipeline for AI conversation exports #1 (cassette-mode) appears partially addressed — the implementation rejects replay but allows record — but the review doc itself shouldn't ship as user-facing documentation.

  4. [question] estimate_output_count in auto-detect. _resolve_batch_mode calls layer.estimate_output_count(3) with a hardcoded 3 to decide batch vs sync. Where does estimate_output_count come from on Transform? It's not in the diff or models.py. If it's a method returning 1 by default for N:1 transforms, the heuristic makes sense, but it's not visible here.

  5. [concern] Cassette-mode record during batch-build. The spec (batch-build.md:178) says SYNIX_CASSETTE_MODE set is a hard error, but the implementation only rejects replay (without cassette responses). The review doc flags this exact inconsistency. record mode during batch-build run is silently allowed, which may confuse users following the recording workflow.

  6. [positive] Thorough test coverage. Unit tests cover all BatchLLMClient paths (cached, cassette, in-progress, failed), BatchState corruption/quarantine, request key uniqueness. E2E tests cover plan, run, resume, list, status, mixed providers, corrupted state, fingerprint mismatch, force-sync, cassette mode rejection — matching nearly every case listed in the spec.

  7. [positive] Clean transform contract. Transforms are completely unaware of batch mode. The BatchLLMClient is injected via _shared_llm_client in config, and the control-flow exception pattern keeps execute() unchanged. This is well-designed.

  8. [nit] llm_client.py placeholder key. The sk-replay-placeholder fallback in cassette replay mode is pragmatic but could mask misconfiguration if someone accidentally sets SYNIX_CASSETTE_MODE=replay in production.

  9. [concern] Golden file artifacts.stdout.txt line 15 has │ ws-bob │ <N> │ where the artifact ID appears to be masked as <N> — likely a normalization regex over-matching the short hash. This could cause flaky golden comparisons.

  10. [positive] Atomic writes and corruption recovery. The quarantine-on-corrupt pattern with timestamped suffixes and explicit --reset-state opt-in is well-considered for a feature managing async state across sessions.

Verdict

This is a well-structured, thoroughly tested incremental step that adds meaningful cost optimization while preserving all architectural invariants — ship it after resolving the duplicate template files and the cassette-mode record inconsistency.

Review parameters
  • Model: claude-opus-4-6
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 4,435 lines
  • Prompt: .github/prompts/claude_review.md
  • Timestamp: 2026-02-17T09:02:31Z

Resolve conflicts from ext transforms merge (PR #72):
- Keep v0.12.1 version from main
- Keep HEAD's batch-build fixes (create_fresh, pipeline hash, N:1 gating, per-unit errors)
- Merge CLI table styles from main with full batch-build commands
@github-actions
Copy link
Contributor

Note

Red Team Review — OpenAI GPT-5.2 | Adversarial review (docs + diff only)

Threat assessment — Medium risk: it quietly changes batch/sync execution semantics and introduces new CLI surface area/state coupling without showing the supporting tests/spec alignment.

One-way doors

  1. CLI contract expansion: batch-build plan|run --poll|resume|list|status --latest (README.md)
    Hard to reverse because users will script these subcommands/flags and expect stable output/IDs.
    Safe to merge if: the CLI is implemented, documented in docs/batch-build.md, and has compatibility guarantees (even “experimental” needs best-effort stability) plus golden tests for output shape.

  2. Public pipeline API: per-transform batch: None|True|False + provider restrictions (README.md)
    Hard to reverse because it becomes part of pipeline definitions and mental model (“auto-detect” rules).
    Safe to merge if: the behavior is enforced consistently in code (validation at pipeline construction), and there’s a migration story if heuristics change.

  3. Batch state format and keying: errors{request_key...} + pipeline hash semantics (src/synix/build/batch_state.py)
    Hard to reverse because existing batch builds on disk may become unreadable or “unresumable.”
    Safe to merge if: state JSON is versioned, backward-compatible loaders exist, and corruption/recovery flows are tested.

Findings

  1. README.md: “N:1 transforms don’t benefit from batching” vs actual diff logic in _resolve_batch_mode
    Pattern: if layer.estimate_output_count(3) <= 1: return "sync"
    Failure mode: this disables batching for any transform that returns 1 output per unit—even if it has many units (which is exactly where batching saves cost). “N:1” is being inferred from output count, which is not equivalent. This is likely a correctness bug in the heuristic, not just an optimization. [critical]

  2. src/synix/build/batch_runner.py: per-unit batch submission failures are now non-fatal
    Pattern: catch BatchRequestFailed, log, continue; later store errors but still return "completed"
    Failure mode: pipeline reports completion while silently missing artifacts for failed units; downstream layers may run on partial inputs or stale cache, producing inconsistent provenance. At minimum this should flip layer/run status to failed/partial and block dependent layers. [critical]

  3. src/synix/build/batch_runner.py: failed_units keyed by joined labels
    Pattern: unit_desc = ", ".join(a.label for a in unit_inputs) then store_error(f"{layer.name}:{unit_desc}", ...)
    Failure mode: non-unique/unstable keys (labels can collide, order may vary, commas in labels, truncation); resume/dedup becomes unreliable; errors may overwrite each other. Should use a deterministic unit/request key (materialization key / fingerprint). [warning]

  4. src/synix/build/batch_state.py: create_fresh(..., _skip_load=True) may not create directories/files
    Pattern: previously manual init; now calls __init__ without _load_state()
    Failure mode: if other methods assume _instance_dir exists or state files exist, you’ll get runtime failures later. The old code also didn’t mkdir, but changing construction path without showing invariants is risky. Needs explicit “fresh state initialization” (mkdir + save baseline). [warning]

  5. src/synix/build/batch_state.py: compute_pipeline_hash() now imports synix.core.models.Transform and calls layer.compute_fingerprint(config)
    Failure mode: circular import risk + “hash changes because fingerprint includes unstable data” (timestamps, environment, provider-resolved defaults). Also the config passed is pipeline llm_config only; per-layer overrides may be missed unless already in parts. If the hash flips spuriously, resume becomes impossible. [warning]

  6. README.md vs DESIGN.md mismatch on API shape
    DESIGN.md emphasizes function-based prompts and step_version_hash from inspect.getsource, while README now documents class-based transforms with batch toggles and provider heuristics. That’s a divergence in public story; users won’t know which is canonical. [minor]

Missing

  • Tests proving: (a) batching decision logic, (b) partial failure semantics (what happens downstream), (c) state load/save compatibility and corruption quarantine, (d) compute_pipeline_hash stability across runs.
  • Documentation update to docs/batch-build.md reflecting the new hash inputs and partial-failure behavior (README references it, but diff doesn’t show it changed).
  • Clear run status model for “completed with errors” vs “failed”; and CLI exit codes for run --poll when some units fail.

Verdict

Block. The new batch/sync heuristic is almost certainly wrong, and treating per-unit submission failure as non-fatal without strict downstream gating will create silently incomplete builds and broken provenance.

Review parameters
  • Model: gpt-5.2
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 218 lines
  • Prompt: .github/prompts/openai_review.md
  • Timestamp: 2026-02-17T18:26:49Z

@github-actions
Copy link
Contributor

Note

Architectural Review — Claude Opus | Blind review (docs + diff only)

Summary

This PR makes three improvements to the experimental batch-build feature: (1) N:1 transforms (like CoreSynthesis) are auto-detected and excluded from batching since they produce a single output, (2) per-unit batch failures are now non-fatal — recorded in state and logged rather than halting the pipeline, and (3) the pipeline hash used for submit/resume change detection now includes transform fingerprints (source code + prompts). README is updated with full batch-build documentation.

Alignment

Strong fit. DESIGN.md §3.3 establishes that cache keys must "capture everything affecting what gets produced" — extending the pipeline hash to include transform fingerprints directly advances this. The non-fatal error handling aligns with the §10.2 stance: "Log malformed outputs but don't halt." The N:1 batching optimization is coherent with the build-system model — you don't batch a link step that produces one binary. Batch build itself advances Hypothesis 3 (architecture is a runtime concern) by making iteration cheaper.

Observations

  1. [positive] batch_state.py — Replacing object.__new__ + manual field assignment in create_fresh with cls(build_dir, build_id, _skip_load=True) eliminates a DRY violation where field defaults were duplicated. Clean fix.

  2. [positive] batch_runner.py:302-305 — The estimate_output_count(3) <= 1 heuristic for skipping N:1 transforms is pragmatic. Passing a sentinel input count of 3 to distinguish 1:1 from N:1 is clever, though the magic number deserves a comment explaining why 3.

  3. [concern] batch_runner.py:477 — Failed units are logged and recorded in state, but layer_built may contain partial results. Downstream layers will receive incomplete inputs. There's no mechanism to mark the layer as partially complete or prevent downstream layers from running on a degraded input set. This could produce artifacts with incomplete provenance chains, which contradicts the "provenance chains are complete" principle from DESIGN.md §3.9.

  4. [question] batch_runner.py:493-500store_error is called with a synthesized key f"{layer.name}:{unit_desc}" using artifact labels. Are artifact labels guaranteed unique within a layer? If two units share a label (or labels contain colons), error keys could collide or be ambiguous.

  5. [concern] batch_state.py:284-286layer.compute_fingerprint(config) is called during pipeline hash computation. If compute_fingerprint is expensive (LLM config resolution, file I/O), this runs on every resume call. For pipelines with many layers, this could add latency to what should be a quick change-detection check.

  6. [nit] README batch-build section — The core.config = {"llm_config": {...}} pattern in the quick example doesn't appear elsewhere in the README or website. The established pattern is pipeline-level llm_config. This could confuse users about the canonical way to set per-layer config.

  7. [question] The batch parameter is documented on transforms, but DESIGN.md §3.5 defines four build rule types (transform, aggregate, fold, merge). Does batch apply to all step types or only transforms? The README says "each transform" but the provider restriction section implies broader applicability.

  8. [concern] No tests in this diff. The error-handling change (fail-fast → continue-on-failure) is a behavioral change to the execution model. The fingerprint inclusion in pipeline hash changes resume/change-detection semantics. Both need test coverage — at minimum: a test that a failed unit doesn't halt the batch, a test that partial results are recorded, and a test that pipeline hash changes when a transform's prompt changes.

Verdict

Good directional changes — the N:1 optimization, resilient error handling, and fingerprint-based change detection are all well-motivated — but the missing tests and the unaddressed partial-completion/provenance gap make this not quite ready to merge.

Review parameters
  • Model: claude-opus-4-6
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 218 lines
  • Prompt: .github/prompts/claude_review.md
  • Timestamp: 2026-02-17T18:26:56Z

@marklubin marklubin merged commit 131d525 into main Feb 17, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant