Skip to content

Comments

fix: issues 1-8 from claude-sync pipeline build#74

Merged
marklubin merged 2 commits intomainfrom
fix/issues-1-8-claude-sync
Feb 19, 2026
Merged

fix: issues 1-8 from claude-sync pipeline build#74
marklubin merged 2 commits intomainfrom
fix/issues-1-8-claude-sync

Conversation

@marklubin
Copy link
Owner

Summary

Batch fix for 8 issues discovered during claude-sync pipeline build:

Issues 1-5 (build system fixes):

Issues 6-8 (new features):

Test plan

  • uv run release passes (sync, lint, test, demos)
  • All 5 demo golden files verified
  • 33 new tests in test_issues_6_7_8.py (metadata propagation, API key resolution, JSONL parsing)
  • 60+ new tests in test_issue_fixes.py (per-unit caching, dimensions, base_url, batch size, plan estimation)
  • Existing ext transform fingerprint tests updated for renamed component keys

@github-actions
Copy link
Contributor

Note

Red Team Review — OpenAI GPT-5.2 | Adversarial review (docs + diff only)

Threat assessment — Medium risk: good intent and decent tests, but a couple of cache/projection changes are format one-way doors and there are some correctness + scale traps.

One-way doors

  1. Projection cache format change (.projection_cache.json: source_hashcontent_hash + embedding_hash)

    • Why hard to reverse: Once users have builds on disk, any further change risks “everything rebuilds” or incorrect cache hits. You’re now supporting two formats; dropping either later becomes painful.
    • Safe if: you document the cache schema as explicitly internal/unstable, add a cache “version” field, and add tests that: (a) legacy cache is read, (b) new cache is written, (c) round-trips don’t cause unnecessary rebuilds.
  2. EmbeddingConfig defaults change (dimensions: 384→None, batch_size: 64→16)

    • Why hard to reverse: This directly changes output embeddings and performance characteristics. Users will “feel” search quality/perf shifts across upgrades and may not know why.
    • Safe if: release notes + docs call this out, and you provide a migration note (“set dimensions=384/batch_size=64 to preserve old behavior”).
  3. Registering .jsonl globally to Claude Code adapter

    • Why hard to reverse: .jsonl is generic. Users with arbitrary JSONL sources now silently parse as “not Claude Code” → empty artifacts. That’s a behavior contract.
    • Safe if: autodetect based on content first (like _parse_json_autodetect), or scope to a more specific pattern/flag (--format claude-code), or at least warn loudly when .jsonl yields empty due to non-matching schema.

Findings

  1. src/synix/build/runner.py: per-unit cache key uses tuple(sorted(art.input_ids))

    • Failure mode: input_ids is a list; sorting a list-of-strings is fine, but if any artifact has input_ids=None/mixed types, this will throw at runtime. Also, this key ignores unit config extras; two units with same inputs but different config_extras could be incorrectly treated as cached.
    • Severity: [critical]
  2. src/synix/build/runner.py: existing_artifacts = store.list_artifacts(layer.name) for every layer execution

    • Failure mode: O(N) read of all artifacts in a layer each build, then building a dict keyed by inputs. Works for 100 artifacts, degrades badly for 10k+.
    • Severity: [warning]
  3. src/synix/build/runner.py: embedding-only regeneration calls projection._generate_embeddings(...) (private API)

    • Failure mode: tight coupling to internal projection implementation. Any refactor of projection internals breaks runner. Also bypasses whatever invariants _materialize_projection enforces.
    • Severity: [warning]
  4. src/synix/build/plan.py: _plan_projection mixed legacy/new logic but no explicit cache versioning

    • Failure mode: future schema changes become more conditional spaghetti. Also risk of partial entries (e.g., content_hash present but embedding_hash missing) producing misleading “embedding config changed”.
    • Severity: [warning]
  5. src/synix/ext/*_synthesis.py: metadata propagation is inconsistent

    • MapSynthesis now copies full input metadata; Group/Reduce/Fold do not (only add derived fields).
    • Failure mode: users will assume metadata always propagates; downstream grouping/sorting could break depending on which transform was used.
    • Severity: [warning]
  6. src/synix/ext/*: cache key includes repr(metadata_fn)

    • Failure mode: repr(fn) often contains memory addresses → nondeterministic cache keys across runs, causing spurious rebuilds.
    • Severity: [warning]
  7. src/synix/adapters/claude_code.py: open(filepath) without encoding

    • Failure mode: platform-dependent decoding; non-UTF8 JSONL can explode.
    • Severity: [minor]

Missing

  • Docs updates: README/docs don’t mention api_key_env, SYNIX_API_KEY, SYNIX_EMBEDDING_API_KEY, OPENAI_BASE_URL, or embedding-only rebuild behavior.
  • Tests for projection cache migration behavior: you added plan logic, but no tests asserting correct skip/rebuild across old/new cache files.
  • Guardrails for .jsonl: no warning surfaced when .jsonl file is non-Claude-Code and returns empty artifacts (silent failure).

Verdict

Ship with fixes — block until you fix the per-unit cache key correctness (config_extras + input_ids robustness) and address repr(metadata_fn) nondeterminism; the rest can follow quickly but needs documentation before release.

Review parameters
  • Model: gpt-5.2
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 2,422 lines
  • Prompt: .github/prompts/openai_review.md
  • Timestamp: 2026-02-19T06:04:58Z

@github-actions
Copy link
Contributor

Note

Architectural Review — Claude Opus | Blind review (docs + diff only)

Summary

A multi-issue fix PR (issues #1#8) adding: per-unit cache checks in the build runner so only uncached units execute, a Claude Code .jsonl adapter, split projection hashing (content vs. embedding config), metadata_fn on all four ext transforms, improved API key resolution with api_key_env/SYNIX_API_KEY, embedding config fixes (dimensions default, batch size, 413 retry), and better plan estimation when upstream is dirty. Accompanied by ~1,300 lines of thorough tests.

Alignment

Strong fit. Per-unit caching directly advances the "only rebuild what changed" principle from DESIGN.md §3.3 — materialization keys now work at unit granularity, not just layer granularity. Split projection hashing respects the cache-key-captures-all-inputs rule while avoiding unnecessary FTS5 rebuilds. The Claude Code adapter extends the "conversations are sources" model. metadata_fn on ext transforms is a clean extension point consistent with DESIGN.md's Python-first philosophy (§4.1) — pipeline authors get code-level control without subclassing. Fingerprint key renaming ("callable""label_fn", "group_by") improves auditability, supporting the audit-determinism-over-reproducibility bet.

Observations

  1. [positive] Per-unit caching (runner.py) is the highest-impact change. The cached_by_inputs dictionary keyed on tuple(sorted(input_ids)) correctly identifies units that match both inputs and transform fingerprint. This addresses a real O(n) waste scenario at scale.

  2. [concern] _execute_transform_concurrent unpacks units_to_run as (seq_idx, (_orig_idx, unit_inputs, config_extras)) but units_to_run stores (i, unit_inputs, config_extras) — three elements. The destructuring on line ~445 uses (_orig_idx, unit_inputs, config_extras) inside the enumerate-style loop. This looks correct but the _orig_idx is unused and the original index i loses meaning after filtering. No bug, but confusing for a reader.

  3. [concern] _regenerate_embeddings_only calls projection._generate_embeddings(...) — a private method on the projection class. This couples the runner to an internal API. If the projection implementation changes, this breaks silently. A public regenerate_embeddings() method would be safer.

  4. [question] The .jsonl extension is registered globally for Claude Code. What happens when a user has non-Claude-Code JSONL files in their source directory? The adapter returns [] for non-matching files, which is graceful, but the user gets no artifacts and no error. Is silent skipping the right UX here?

  5. [positive] metadata_fn is correctly incorporated into cache keys, fingerprints, and output metadata across all four ext transforms. The fingerprinting via inspect.getsource with repr() fallback matches the existing pattern from DESIGN.md §4.3.1.

  6. [positive] The 413 retry with recursive batch halving in embeddings.py is well-implemented — base case (single text) re-raises, and the _is_payload_too_large helper checks both status_code attribute and message string patterns.

  7. [nit] metadata_fn is typed as object | None across all ext transforms. Should be Callable (or Callable[[Artifact], dict] | None for Map, Callable[[str, list[Artifact]], dict] | None for Group, etc.). The current typing gives no IDE support and no signature documentation.

  8. [positive] Test coverage is excellent. ~1,300 lines covering happy paths, edge cases (empty files, malformed JSON, recursive 413 halving, dimensions=0 boundary), concurrent paths, and fingerprint invalidation. The per-unit caching tests verify both sequential and concurrent code paths with real pipeline builds.

  9. [nit] SearchIndex.__init__ removed the self.config["embedding_config"] = embedding_config assignment, but _materialize_projection now sets config["embedding_config"] = proj.embedding_config at call time. This works but the intent is non-obvious — a comment explaining the migration would help.

  10. [positive] Legacy cache format compatibility in _plan_projection — checking for "content_hash" vs "source_hash" keys means existing builds upgrade cleanly without a forced clean.

Verdict

This is a solid incremental step — the per-unit caching and split projection hashing are meaningful architectural improvements that directly serve the build-system-for-memory model, the new extension points are clean, and the test coverage is thorough.

Review parameters
  • Model: claude-opus-4-6
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 2,422 lines
  • Prompt: .github/prompts/claude_review.md
  • Timestamp: 2026-02-19T06:05:11Z

@marklubin marklubin force-pushed the fix/issues-1-8-claude-sync branch from 2105833 to 7e59de5 Compare February 19, 2026 06:12
@github-actions
Copy link
Contributor

Note

Red Team Review — OpenAI GPT-5.2 | Adversarial review (docs + diff only)

Threat assessment — Medium risk: several behavior changes in caching and projection materialization that can silently invalidate (or worse, incorrectly reuse) outputs, plus a new adapter with overly-broad .jsonl capture.

One-way doors

  1. .jsonl adapter defaulting to Claude Code parser
    Hard to reverse because users will start dropping arbitrary JSONL sources in and get “empty list” behavior with no clear diagnostics; future adapters for other JSONL formats will be constrained.
    Safe to merge if: registry supports content-based sniffing with multiple JSONL parsers (or explicit --format) and .jsonl doesn’t hard-map to Claude Code globally.

  2. Projection cache format change (source_hash → content_hash + embedding_hash)
    Hard to reverse because it changes on-disk cache semantics and “what causes rebuild” logic; users will depend on embedding-only rebuild behavior.
    Safe to merge if: cache file versioning is explicit, migration/compat is tested, and projection rebuild reasons are correct and stable.

  3. EmbeddingConfig defaults changed (dimensions None, batch_size 16)
    This alters embeddings output shape/behavior and performance characteristics; downstream artifacts/search scores can change and users will build expectations.
    Safe to merge if: documented and treated as a breaking change (even pre-1.0), with a clear upgrade note.

Findings

  1. src/synix/adapters/registry.py: register_adapter([".jsonl"])(parse_claude_code)
    Failure mode: hijacks all JSONL inputs; non-Claude-Code JSONL silently produces zero artifacts. Also blocks adding other JSONL adapters later without redesign.
    Severity: [warning]

  2. src/synix/adapters/claude_code.py: session_id = filepath.stem + metadata uses entry.get("sessionId") only for detection, not identity
    Failure mode: session identity is wrong if filename isn’t sessionId; collisions if multiple files share stem; provenance/labels become misleading.
    Severity: [warning]

  3. src/synix/build/runner.py: per-unit cache key uses tuple(sorted(art.input_ids)) and matches by transform_fp.matches()
    Failure mode: this assumes input_ids uniquely identify a work unit. For transforms where split() produces N>1 inputs per unit or where order matters, sorting destroys ordering. You can return the wrong cached artifact for a unit with same set but different sequence.
    Severity: [critical]

  4. src/synix/build/runner.py: _execute_transform_concurrent filters cached units and returns [] for them
    Failure mode: caller must rely on on_cached side effects to include cached artifacts. That’s fragile coupling; any future caller that expects returned artifacts will silently drop cached results.
    Severity: [warning]

  5. src/synix/build/runner.py: _compute_content_only_hash hashes artifact IDs only
    Failure mode: projection cache won’t invalidate when artifact content changes without ID change. If artifact IDs are not strictly content-addressed (unknown from diff), this is incorrect caching. Even if IDs are content-addressed, projections may depend on labels/metadata/config.
    Severity: [critical]

  6. src/synix/build/runner.py: _compute_embedding_config_hash ignores base_url and other identity fields
    Failure mode: changing embedding endpoint (provider-compatible) can silently reuse embeddings because hash doesn’t change. You’re explicitly optimizing “identity fields only,” but base_url is part of identity for OpenAI-compatible services.
    Severity: [warning]

  7. src/synix/build/runner.py: _regenerate_embeddings_only calls projection._generate_embeddings(...) (private method)
    Failure mode: tight coupling to projection internals; any refactor breaks this. Also unclear if it clears prior embeddings / maintains alignment with FTS rows.
    Severity: [warning]

  8. src/synix/ext/*_synthesis.py: callable fingerprinting uses inspect.getsource() with repr() fallback
    Failure mode: lambdas/closures often don’t have stable repr across runs (addresses), causing unnecessary cache misses or cache divergence across machines. This undermines “build system” determinism.
    Severity: [warning]

  9. src/synix/ext/map_synthesis.py: output metadata copies entire input metadata
    Failure mode: metadata bloat and accidental propagation of sensitive fields; also allows upstream keys to leak into downstream layers unpredictably. No reserved namespace enforcement.
    Severity: [minor]

Missing

  • Documentation updates: README/docs mention nothing about .jsonl now being treated as Claude Code sessions, nor how to force other JSONL parsing.
  • Tests for correctness of per-unit caching with multi-input units / ordering-sensitive splits (current tests only cover 1:1).
  • A clear statement/contract that artifact_id is content-addressed and immutable; otherwise the projection caching change is unsafe.
  • Migration/versioning note for .projection_cache.json (you have partial backward-compat in planning, but runner now always writes new format).

Verdict

Block — the per-unit cache keying (sorted input_ids) and projection content hashing (artifact IDs only) can produce incorrect cache hits and silently wrong outputs. Fix those contracts first, then re-review.

Review parameters
  • Model: gpt-5.2
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 2,523 lines
  • Prompt: .github/prompts/openai_review.md
  • Timestamp: 2026-02-19T06:13:07Z

@github-actions
Copy link
Contributor

Note

Architectural Review — Claude Opus | Blind review (docs + diff only)

Summary

This is an omnibus PR addressing issues #1#8: per-unit caching in the build runner, split projection hashing (content vs. embedding config), a metadata_fn hook across all ext transforms, a multi-level API key resolution chain (api_key_envSYNIX_API_KEY → provider default), a Claude Code .jsonl adapter, 413 retry logic for embeddings, and plan estimation fixes for dirty-upstream layers. Approximately 1,900 lines, heavily weighted toward tests.

Alignment

Strong alignment. Per-unit caching (Issue 1) directly advances DESIGN.md's incremental rebuild semantics — materialization keys should mean "if it exists, skip." The split projection hash respects the principle that cache keys capture all inputs: content changes and embedding config changes are now independent invalidation axes, which is correct (changing your embedding model shouldn't re-index FTS5). The metadata_fn hook extends the transform contract cleanly without breaking the existing prompt-centric model. The Claude Code adapter expands source format coverage — "conversations are sources" applied to a new format.

Observations

  1. [concern] _regenerate_embeddings_only calls projection._generate_embeddings — a private method on the projection class. This couples the runner to an internal API that could change. If SearchIndexProjection refactors its internals, this breaks silently. Consider adding a public regenerate_embeddings() method on the projection.

  2. [concern] 413 retry in _embed_chunk is unbounded recursion. If every halved batch still triggers 413, this recurses until single-item batches. With 10,000+ artifacts (per scale concerns), a legitimately oversized single text could cause a stack overflow. Add a depth limit or floor on batch size.

  3. [concern] _stable_callable_repr is duplicated four times (map, group, reduce, fold). Extract to a shared module. This is a maintenance risk — if the fingerprinting logic for callables changes, four files need updating.

  4. [question] SearchIndex no longer stores embedding_config in self.config (removed from models.py), but _materialize_projection now manually sets config["embedding_config"] = proj.embedding_config. Is there downstream code that reads proj.config["embedding_config"] that would break? The old path wrote it in __init__; the new path writes it only at materialization time.

  5. [question] Legacy cache format compatibility in _plan_projection: the "content_hash" in cached_entry branch duplicates ~30 lines of logic. Is there a migration path to drop the legacy format, or will this bifurcation persist indefinitely?

  6. [positive] Test coverage is excellent. 1,337 lines of tests covering per-unit caching (sequential + concurrent), embedding dimension edge cases (falsy 0), 413 recursive halving, API key precedence chains, Claude Code malformed input, middle truncation, and metadata_fn fingerprint invalidation. Edge cases are well-covered.

  7. [positive] metadata_fn properly invalidates caches. It's included in both get_cache_key and compute_fingerprint, meaning a change to the metadata function triggers rebuild. This respects the "cache keys capture all inputs" principle.

  8. [nit] The golden file diffs show search scores changing (1.58→1.60, 1.17→1.19). These appear to be FTS5 scoring changes from the embedding config being removed from self.config (which was previously included in the search index config hash). Worth a comment in the PR noting why scores shifted.

  9. [nit] _compute_content_only_hash hashes artifact IDs, not content fingerprints. The name suggests content but it's actually identity-based. Consider _compute_artifact_identity_hash.

  10. [positive] Claude Code adapter is well-bounded — returns empty list for non-Claude-Code JSONL, handles malformed lines gracefully, middle-truncates to prevent unbounded memory, and requires minimum 2 turns. The is_claude_code flag prevents false positives on arbitrary .jsonl files.

Verdict

This is a solid incremental step — it closes real usability gaps (API key collisions, missing adapter, wasted LLM calls on cached units) while respecting the build system's core invariants around caching and provenance. The 413 recursion depth and the private method coupling warrant fixes before merge.

Review parameters
  • Model: claude-opus-4-6
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 2,523 lines
  • Prompt: .github/prompts/claude_review.md
  • Timestamp: 2026-02-19T06:13:18Z

Issues 1-5 (build system fixes):
- Per-unit cache check in runner — skip cached units, only execute uncached
- EmbeddingConfig.dimensions defaults to None (not passed to API when unset)
- EmbeddingConfig.resolve_base_url() for env var fallback; manifest stores identity fields only
- Batch size default 16 + 413 retry with halved batch
- Plan estimation uses estimate_output_count when upstream is dirty

Issues 6-8 (new features):
- Ext transforms propagate input metadata: MapSynthesis auto-propagates (1:1),
  Group/Reduce/Fold accept metadata_fn for custom fields. All fingerprinted.
- API key env var collision: api_key_env field + SYNIX_API_KEY/SYNIX_EMBEDDING_API_KEY
  precedence layers in LLMConfig and EmbeddingConfig
- Claude Code JSONL adapter: parse .jsonl session files into transcript artifacts
  with metadata extraction (session_id, date, slug, cwd, git_branch)
@marklubin marklubin force-pushed the fix/issues-1-8-claude-sync branch from 7e59de5 to 7444bb9 Compare February 19, 2026 06:19
@github-actions
Copy link
Contributor

Note

Red Team Review — OpenAI GPT-5.2 | Adversarial review (docs + diff only)

Threat assessment — Medium risk: lots of surface area touched (caching, projection cache format, adapter registry, ext transform fingerprinting/metadata), with a couple of subtle correctness traps.

One-way doors

  1. Projection cache format change (.projection_cache.json: source_hashcontent_hash + embedding_hash)

    • Why hard to reverse: Once users have build dirs in the wild, you’re stuck supporting legacy + new formats indefinitely or breaking caches.
    • Safe to merge if: You document the cache file schema + versioning, add an explicit schema version field, and prove forward/back compatibility with tests that load old cache then write new (and vice versa if you care).
  2. Registering .jsonl globally to Claude Code (adapters/registry.py)

    • Why hard to reverse: .jsonl is a generic extension. Users will start relying on “drop any jsonl in sources and it works,” and you’ll collide with future JSONL sources (OpenAI logs, custom event streams, etc.).
    • Safe to merge if: Autodetect is robust (magic header/required keys) and registry supports multiple .jsonl parsers with sniffing/priority. Or constrain to a naming convention (*.claude-code.jsonl) instead of extension-wide registration.
  3. Ext transforms now accept/encode metadata_fn into cache/fingerprint

    • Why hard to reverse: This becomes part of pipeline authoring patterns. Removing/renaming it later breaks pipeline definitions and cache stability expectations.
    • Safe to merge if: You document the callable stability contract (closures, lambdas, decorators) and guarantee deterministic fingerprinting behavior across environments.

Findings

  1. src/synix/build/runner.py: per-unit cache key uses tuple(sorted(art.input_ids))

    • Risk/failure mode: This assumes unit identity == sorted input artifact IDs. That is wrong for transforms where ordering matters (N:1 reduce/fold-like units) or where config extras change the unit output. Also drops None IDs silently; you can end up key collisions or skipping work incorrectly.
    • Severity: [critical]
  2. src/synix/build/runner.py: cached artifacts selected only by transform_fp.matches(stored_fp) + input_ids

    • Risk: If split() changes unit boundaries (e.g., grouping strategy), you may incorrectly treat an artifact as reusable because it shares the same set of inputs, even though semantics changed. Also unclear if transform_fp includes config extras from split().
    • Severity: [warning]
  3. src/synix/build/plan.py: estimate_output_count used only when upstream_dirty

    • Risk: Planning accuracy now depends on upstream_dirty plumbing and presence of the method. If split() is expensive/fragile even when upstream isn’t “dirty”, plan still calls it. This is a band-aid, not a principled contract.
    • Severity: [minor]
  4. src/synix/build/runner.py: _compute_content_only_hash hashes artifact IDs, not content

    • Risk: Name is misleading. If artifact IDs are not strictly content-addressed (unknown; docs talk about fingerprints but not guaranteed IDs), this breaks projection caching correctness.
    • Severity: [warning]
  5. src/synix/build/runner.py: _regenerate_embeddings_only calls projection._generate_embeddings (private API)

    • Risk: Tight coupling to an internal method; refactors will silently break this path. Also bypasses any invariant _materialize_projection enforces (schema setup, transactional behavior).
    • Severity: [warning]
  6. src/synix/ext/map_synthesis.py: metadata propagation copies full inp.metadata

    • Risk: Metadata bloat and accidental propagation of reserved/internal keys across layers. Also mutability risk if nested structures are reused (shallow copy only).
    • Severity: [warning]
  7. src/synix/ext/_util.py: stable_callable_repr uses inspect.getsource

    • Risk: getsource() can vary with packaging, line numbers, decorators, or missing source distributions. This will cause spurious cache invalidations or (worse) cache collisions with repr() fallback.
    • Severity: [warning]
  8. src/synix/adapters/claude_code.py: loads entire file line-by-line but builds full transcript in memory, then truncates

    • Risk: For very large sessions, you still accumulate turns unbounded before truncation. You’ll blow memory before _middle_cut helps.
    • Severity: [warning]
  9. src/synix/adapters/registry.py: .jsonl adapter registration without autodetect multiplexing

    • Risk: Any non-Claude-Code jsonl now routes to parse_claude_code, which returns [] after scanning the entire file. That’s wasted I/O and confusing behavior (“why are my sources empty?”).
    • Severity: [warning]

Missing

  • Tests proving per-unit caching is correct for non-1:1 units (e.g., Group/Reduce) where unit identity includes grouping key/config extras and/or input order.
  • Documentation updates: README/Docs about new env vars (SYNIX_API_KEY, SYNIX_EMBEDDING_API_KEY, api_key_env) and the changed embedding defaults (dimensions=None, batch_size=16).
  • A schema/version field in .projection_cache.json and a migration/compat note.
  • Concurrency safety discussion: multiple processes touching build dir + .projection_cache.json + manifests (race/corruption risk).

Verdict

Ship with fixes. The direction is fine, but per-unit cache identity is currently too hand-wavy and can skip work incorrectly; .jsonl registration is also overly global without a real autodetect/mux strategy.

Review parameters
  • Model: gpt-5.2
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 2,510 lines
  • Prompt: .github/prompts/openai_review.md
  • Timestamp: 2026-02-19T06:20:41Z

@github-actions
Copy link
Contributor

Note

Architectural Review — Claude Opus | Blind review (docs + diff only)

Summary

A multi-issue PR (issues #1#8) that adds per-unit caching within partially-dirty layers, a Claude Code JSONL adapter, metadata_fn hooks across all four ext transforms, split projection hashing (content vs. embedding config), flexible API key resolution chains, 413 payload-too-large retry logic, and several config defaults fixes. Roughly 1,800 lines, mostly tests.

Alignment

Strong alignment with the build-system model. Per-unit caching (issue #1) directly advances the "skip-if-unchanged" incremental rebuild principle from DESIGN.md §3.3 — previously a single new source would re-execute the entire layer. Split projection hashing correctly separates content identity from embedding config, matching the design's emphasis on cache keys capturing all inputs (§3.3). The metadata_fn extension point is consistent with Python-first composability (§4.1). Claude Code adapter expands the source format surface, fitting the "conversations are sources" model. The api_key_env / SYNIX_API_KEY chain supports real-world multi-provider pipelines without contradicting any stated bets.

Observations

  1. [positive] Per-unit caching in runner.py is architecturally significant — it makes incremental rebuilds granular at the work-unit level, not just the layer level. The tests (sequential, concurrent, and the direct _execute_transform_concurrent test) cover the key paths well.

  2. [concern] _execute_transform_concurrent unpacks units_to_run as (seq_idx, (_orig_idx, unit_inputs, config_extras)) (line ~445). The tuple is (i, unit_inputs, config_extras) — so _orig_idx is actually the original index i, and unit_inputs shadows the loop variable name from the enclosing enumerate(units). This works but is confusing to read; a named tuple or clearer destructuring would help.

  3. [concern] _regenerate_embeddings_only calls projection._generate_embeddings(embedding_config, artifacts, synix_logger) — a private method on the projection. This creates a tight coupling to an internal API that isn't visible in the diff. If that method's signature changes, this breaks silently. No test covers this path end-to-end.

  4. [question] The legacy projection cache format compatibility (lines ~680–710 in plan.py) — is there a plan to remove this? It doubles the branching logic and will accumulate cruft. A migration or version bump would be cleaner.

  5. [concern] MapSynthesis.execute now does output_metadata = dict(inp.metadata) to propagate input metadata. For large metadata dicts or many artifacts, this shallow copy could carry unexpected keys downstream (e.g., transform_fingerprint from a prior layer's runner code, visible at line 209). There's no filtering of internal metadata keys before propagation.

  6. [nit] _is_payload_too_large does string matching on "413" in the message. This could false-positive on unrelated messages containing "413" (e.g., "processed 413 items"). Checking status_code first and only falling back to string matching when the attribute is absent would be safer.

  7. [positive] stable_callable_repr in _util.py is a clean extraction. Using inspect.getsource for cache keys across all ext transforms (replacing the generic "callable" fingerprint component key with named keys like "label_fn", "group_by", "sort_by") is a correctness improvement — a transform with both label_fn and metadata_fn would have had a collision under the old single "callable" key.

  8. [positive] Test coverage is excellent — 1,300+ lines of tests covering happy paths, edge cases (dimensions=0, empty files, malformed JSONL, single-text 413), and the concurrent execution path. The mock_llm fixture integration tests for per-unit caching are particularly strong.

  9. [nit] SearchIndex.__init__ removes the line self.config["embedding_config"] = embedding_config but _materialize_projection adds config["embedding_config"] = proj.embedding_config (line 661). This works but scatters the embedding config plumbing across two files.

  10. [question] _MAX_413_DEPTH = 10 allows halving down to batch size 1 from any starting size (2^10 = 1024), but with batch_size defaulting to 16, depth 4 suffices. The constant is fine as a safety bound, but worth a comment explaining the math.

Verdict

This is a solid incremental step — the per-unit caching and split projection hashing are meaningful architectural improvements to the cache system, the metadata propagation unlocks real pipeline-authoring needs, and the test coverage is thorough. The main risk is the private-API coupling in _regenerate_embeddings_only and unchecked metadata propagation in MapSynthesis.

Review parameters
  • Model: claude-opus-4-6
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 2,510 lines
  • Prompt: .github/prompts/claude_review.md
  • Timestamp: 2026-02-19T06:20:55Z

Add metadata_fn parameter to all 4 ext transform tables in pipeline-api.md,
add Claude Code .jsonl to sources table, and update LLMConfig/EmbeddingConfig
docstrings with the new api_key_env and SYNIX_API_KEY resolution chain.
@github-actions
Copy link
Contributor

Note

Red Team Review — OpenAI GPT-5.2 | Adversarial review (docs + diff only)

Threat assessment — Medium risk: mostly incremental improvements, but it introduces subtle cache/projection semantics and a new adapter that can silently drop data.

One-way doors

  1. Projection cache format change (.projection_cache.json now uses content_hash/embedding_hash)
    Hard to reverse because existing build dirs become stateful artifacts users will rely on; reverting would require migration logic in both directions and could trigger full rebuilds unexpectedly.
    Safe to merge if: you document the cache schema/versioning, add a schema version field, and add a deterministic migration test covering legacy→new and mixed-state behavior.

  2. Ext transform metadata propagation + new reserved-ish keys (source_label, group_key, input_count)
    Hard to reverse because downstream transforms/users will start depending on these metadata fields and their merge precedence. Changing later breaks pipelines silently.
    Safe to merge if: you define reserved metadata key policy (namespacing) and guarantee precedence rules in docs/tests (especially collisions).

  3. Default changes for embeddings (dimensions=None, batch_size=16)
    Hard to reverse because it affects output embedding vectors (dimensionality) and performance/cost characteristics; users may bake this into expectations/benchmarks.
    Safe to merge if: release notes call it out, and search index manifests clearly record identity fields so index compatibility is explicit.

Findings

  1. src/synix/ext/map_synthesis.py — metadata overwrite bug
    output_metadata is computed but artifact construction does not remove the old metadata= behavior; it now sets metadata=output_metadata, but ensure there isn’t another default metadata path in Artifact(...) that merges/overwrites. Risk is silent metadata loss or duplication depending on Artifact constructor semantics. [warning]

  2. src/synix/ext/*_synthesis.py — cache identity via inspect.getsource is brittle
    stable_callable_repr() uses inspect.getsource() which changes with formatting, comments, or tooling; lambdas in REPL/packaged environments can throw and fall back to repr(), which is often memory-address based and non-stable. This can cause spurious cache misses across machines/installs. [warning]

  3. src/synix/build/runner.py — per-unit caching keyed only by sorted input_ids
    You match cached artifacts by (sorted(input_ids)) + transform_fp.matches(). This ignores ordering for order-sensitive transforms (any transform whose split() groups multiple inputs where order matters). Sorting the IDs makes collisions likely and can return incorrect cached outputs. [critical]

  4. src/synix/build/runner.py — cached artifacts reused without validating unit config extras
    Cache keying doesn’t incorporate config_extras returned by split() (e.g., group keys, per-unit params). Two different units with same inputs but different extras could incorrectly share cached outputs. [critical]

  5. src/synix/build/runner.py — unbounded memory/time building cached_by_inputs
    existing_artifacts = store.list_artifacts(layer.name) then builds a dict for all artifacts each run. This is O(N) extra per layer per run and can get ugly at 10k+ artifacts. Needs indexing in the store or lazy lookups. [warning]

  6. src/synix/build/runner.py — projection “embeddings-only regen” relies on private method
    _regenerate_embeddings_only() calls projection._generate_embeddings(...). That’s an internal API; any refactor breaks cache behavior. Also unclear if it cleans old embeddings / keeps dimensionality consistent. [warning]

  7. src/synix/build/plan.py — cache planning logic now diverges from runner logic
    Plan uses _compute_content_only_hash + _compute_embedding_config_hash; runner uses those too, but legacy handling differs (plan supports legacy, runner writes only new). If a user has legacy cache entries, plan may say “cached” while runner rebuilds (or vice versa) depending on exact fields present. [warning]

  8. src/synix/adapters/claude_code.py — silent data loss by skipping tool blocks
    You drop tool_use / tool_result / thinking. For Claude Code sessions, tool output often contains the actual “work product” (commands, errors). This will confuse provenance/search. At least make this behavior configurable or store tool output in metadata. [warning]

  9. src/synix/adapters/claude_code.py — metadata keys are not namespaced
    Keys like date, title, cwd, git_branch risk colliding with other adapters/transforms. Design doc pushes meta.* namespaces; this violates the spirit even if current code uses a flat dict. [minor]

Missing

  • A regression test proving per-unit caching is correct for multi-input units where order/config_extras differ (this is where your current implementation is wrong).
  • Documentation update on reserved metadata keys / namespacing and collision precedence.
  • Any mention of the new projection cache schema in docs/release notes; no explicit version field in the cache file.
  • Performance tests or at least a note for large artifact stores (runner now scans entire layer artifacts to build cached_by_inputs).

Verdict

Block. Per-unit caching as implemented can return wrong cached artifacts for multi-input/order-sensitive units and ignores per-unit config, which is correctness-breaking and will be very hard to debug once shipped. Fix the cache keying (include ordered input IDs + unit-specific extras) and add targeted tests, then reassess.

Review parameters
  • Model: gpt-5.2
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 2,591 lines
  • Prompt: .github/prompts/openai_review.md
  • Timestamp: 2026-02-19T06:32:56Z

@github-actions
Copy link
Contributor

Note

Architectural Review — Claude Opus | Blind review (docs + diff only)

Summary

Multi-issue PR (issues #1#8) addressing: per-unit cache checks during incremental builds, embedding config fixes (dimensions default, batch size, 413 retry), API key resolution chain with api_key_env and SYNIX_API_KEY, metadata propagation via metadata_fn across all ext transforms, split projection hashing (content vs. embedding config), and a new Claude Code .jsonl source adapter. This is a significant quality-of-life and correctness pass.

Alignment

Strong alignment. Per-unit caching (#1) directly advances the "incremental rebuild" core promise — DESIGN.md §3.3 states "if the materialization key exists, skip." Previously, a single new source invalidated all units in a layer; now only uncached units execute. The metadata_fn addition respects the Python-first design decision (§4.1) by keeping the API composable. Split projection hashing correctly separates content identity from embedding config identity, consistent with the principle that cache keys should capture all inputs (§3.3). The Claude Code adapter expands the "conversations are sources" surface area without altering the pipeline model.

Observations

  1. [positive] runner.py per-unit cache check: The cached_by_inputs dict keyed on sorted input_ids is clean. Both sequential and concurrent paths share the same filtering logic, reducing divergence risk.

  2. [concern] _regenerate_embeddings_only calls projection._generate_embeddings() — a private method. If that method signature changes, this breaks silently. No test covers this path end-to-end; only the plan-level split-hash logic is tested.

  3. [concern] _execute_transform_concurrent unpacks units_to_run as (seq_idx, (_orig_idx, unit_inputs, config_extras)) but populates it as (i, unit_inputs, config_extras) — the _orig_idx name in the destructure is misleading since i is already the original index. The results list is sized to len(units_to_run) not len(units), which is correct but the index semantics are subtle enough to trip future contributors.

  4. [positive] stable_callable_repr in _util.py centralizes the inspect.getsource fallback pattern previously duplicated across all four ext transforms. Good DRY improvement.

  5. [question] _middle_cut uses a fixed 50-char budget for the marker (keep = max_chars - 50). The actual marker "\n\n[... middle truncated ...]\n\n" is 35 chars. Not a bug, but the slack is arbitrary — worth a constant or comment.

  6. [nit] plan.py legacy cache format support: the else branch duplicates the old hashing logic wholesale (~20 lines). Consider extracting into a helper or dropping legacy support after one release cycle to avoid carrying two code paths permanently.

  7. [positive] Test coverage is thorough — 1,337 lines of tests across two new files. Happy paths, edge cases (malformed JSONL, dimensions=0 boundary, 413 recursive halving, concurrent caching), and fingerprint invalidation are all covered. The test_cached_units_skip_execute test directly asserts LLM call count, which is the right level of verification.

  8. [concern] SearchIndex.__init__ removed the line self.config["embedding_config"] = embedding_config but _materialize_projection now sets config["embedding_config"] = proj.embedding_config at call time. If any other code path read config["embedding_config"] from the SearchIndex instance directly (not through the runner), it would break. Grep-verifiable but not obvious from the diff alone.

  9. [nit] Golden file score changes (1.58→1.60, 1.17→1.19) suggest FTS scoring shifted — presumably from the metadata propagation adding source_label to indexed content. Worth a comment in the PR description explaining why scores changed.

  10. [positive] API key resolution chain (explicit > api_key_env > SYNIX_API_KEY > provider default) is well-tested and solves a real ergonomic problem for OpenAI-compatible providers sharing OPENAI_API_KEY.

Verdict

This is a solid incremental step — it fixes real usability gaps (per-unit caching, API key collisions, Claude Code import) while maintaining alignment with the build-system-for-memory model, though the _regenerate_embeddings_only path needs an integration test before it ships.

Review parameters
  • Model: claude-opus-4-6
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 2,591 lines
  • Prompt: .github/prompts/claude_review.md
  • Timestamp: 2026-02-19T06:33:09Z

@marklubin marklubin merged commit 95aef2a into main Feb 19, 2026
12 checks passed
@marklubin marklubin deleted the fix/issues-1-8-claude-sync branch February 19, 2026 06:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant