Skip to content

Comments

feat: add Codex trace adapter with JSONL auto-detection#75

Open
marklubin wants to merge 2 commits intomainfrom
feat/codex-trace-adapter
Open

feat: add Codex trace adapter with JSONL auto-detection#75
marklubin wants to merge 2 commits intomainfrom
feat/codex-trace-adapter

Conversation

@marklubin
Copy link
Owner

Summary

  • add src/synix/adapters/codex.py to parse Codex traces from:
    • history.jsonl (flat history records)
    • rollout session logs (~/.codex/sessions/**/rollout-*.jsonl style envelopes)
  • emit dual artifacts for Codex inputs:
    • session transcript (artifact_type="transcript")
    • per-turn artifact (artifact_type="transcript_turn")
  • switch .jsonl dispatch in src/synix/adapters/registry.py to auto-detect:
    • Codex JSONL -> parse_codex
    • Claude Code JSONL -> parse_claude_code
  • update adapter module docs and source format docs in docs/pipeline-api.md
  • add dedicated unit coverage in tests/unit/test_codex_adapter.py
  • update existing .jsonl registry assertion in tests/unit/test_issues_6_7_8.py

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Docs
  • Tests

Validation

  • Ran tests locally
  • Added/updated tests where needed
  • Updated docs (if applicable)

Commands run:

uv run pytest tests/unit/test_codex_adapter.py tests/unit/test_issues_6_7_8.py -q
uv run pytest tests/unit/test_codex_adapter.py tests/unit/test_issues_6_7_8.py tests/unit/test_text_adapter.py -q
uv run ruff check src/synix/adapters/codex.py tests/unit/test_codex_adapter.py

Related Issues

N/A

Notes

  • Codex adapter currently includes all assistant text blocks present in Codex response_item messages (including commentary/final phases), as requested.
  • Existing Claude Code .jsonl behavior is preserved via JSONL auto-detection fallback.

@github-actions
Copy link
Contributor

Note

Red Team Review — OpenAI GPT-5.2 | Adversarial review (docs + diff only)

Threat assessment — Medium risk: adds a new source format and changes .jsonl autodetection behavior in a way that can silently misclassify inputs and alter outputs.

One-way doors

  1. New artifact types and labels for Codex (artifact_type="transcript_turn", labels t-codex-*)
    Hard to reverse because users will build pipelines/filters/search around artifact_type and label prefixes; stored artifacts become part of the build dir’s long-lived corpus.
    Safe to merge only if you’re willing to support these names long-term or you introduce an explicit “adapter version”/namespacing strategy (or document them as unstable/experimental).

  2. Changing .jsonl registry from “Claude Code only” to “autodetect Codex then Claude Code” (registry.py)
    Hard to reverse because existing users may have .jsonl inputs that previously parsed as Claude Code but will now parse as Codex (or vice versa) with different artifact shapes/metadata, causing cache invalidation and confusing diffs.
    Safe to merge only if you add an explicit override mechanism (CLI flag, per-source config, or filename-based routing) and document the precedence + ambiguity rules.

Findings

  1. src/synix/adapters/registry.py _parse_jsonl_autodetect: “try Codex first” heuristic
    Failure mode: a non-Codex JSONL that happens to have session_id + text keys (common in analytics/logging) will be treated as Codex “history” if parse_codex emits anything. That’s silent corruption, not an error.
    Severity: [critical]

  2. src/synix/adapters/codex.py parse_codex: reads entire file into memory (events.append(parsed) for all lines)
    Failure mode: JSONL traces can be huge; this is unbounded memory growth and will blow up on real Codex rollouts. This is exactly where “works at 100 conversations, fails at 10,000” happens. Streaming is required here.
    Severity: [critical]

  3. src/synix/adapters/codex.py _parse_history: forces all turns to role="user"
    Failure mode: you’re encoding history rows as user utterances regardless of whether they’re user/assistant/tool; transcript becomes misleading. If the upstream format truly only contains user prompts, say so and validate it; otherwise you’re baking in bad provenance.
    Severity: [warning]

  4. src/synix/adapters/codex.py _build_artifacts: transcript date derived from first turn, but turns are not sorted
    Failure mode: you sort by session_id but not by timestamp within a session; JSONL order is assumed. If file is merged or out-of-order, transcript ordering and date are wrong.
    Severity: [warning]

  5. src/synix/adapters/codex.py metadata schema doesn’t match DESIGN.md reserved keys
    You use flat keys ("source", "date", "session_id", "source_path") rather than the documented meta.time.* / meta.source.* / meta.chat.* shape. That’s a design violation and will make downstream generic features (time grouping, provenance UI, altitude queries) brittle across adapters.
    Severity: [warning]

  6. src/synix/adapters/codex.py label sanitization and truncation strategy
    _safe_label can collapse distinct session IDs into the same label; no collision handling. Also _middle_cut destroys content continuity (and you do it before any fingerprinting/caching, presumably), which can cause non-obvious cache churn and poor search.
    Severity: [minor]

  7. Docs update is incomplete (docs/pipeline-api.md)
    You list “Codex” support but don’t document the .jsonl ambiguity, the precedence order, emitted artifact types, or metadata fields. This is user-facing behavior.
    Severity: [warning]

Missing

  • A disambiguation mechanism: e.g., Source(..., format="codex") or --format codex so autodetect is not the only path.
  • Streaming parser + tests for large files: at least verify it doesn’t load all lines (or set a hard cap with a loud error).
  • Negative tests for misclassification: ensure Claude Code JSONL cannot be incorrectly parsed as Codex and vice versa; right now you only test one non-codex JSONL shape.
  • Ordering guarantees: tests asserting per-session turn ordering by timestamp when available.
  • Metadata conformance: either adapt to reserved keys or explicitly document adapter-specific metadata and how core grouping/search consumes it.

Verdict

Block — the .jsonl autodetect + non-streaming parsing creates silent data-shape changes and scalability failures; add explicit format selection and stream the parser before merging.

Review parameters
  • Model: gpt-5.2
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 488 lines
  • Prompt: .github/prompts/openai_review.md
  • Timestamp: 2026-02-20T00:22:55Z

@github-actions
Copy link
Contributor

Note

Architectural Review — Claude Opus | Blind review (docs + diff only)

Summary

Adds a Codex JSONL adapter (parse_codex) that handles two Codex export formats — history.jsonl (flat events with session_id/ts/text) and rollout session traces (envelope with type/payload). Updates the .jsonl registry to auto-detect Codex before falling back to Claude Code. Emits both transcript and transcript_turn artifact types per session.

Alignment

Good fit. Sources are the entry point of the build DAG; adding a new source parser is the most natural extension point. Artifacts are created with content and metadata but no identity assumptions — content-addressing and fingerprinting happen downstream. The adapter stays purely mechanical (no LLM calls), consistent with DESIGN.md's separation between sources and steps. The dual-output pattern (transcript + per-turn artifacts) gives pipeline authors flexibility to depends_on at different granularities, which aligns with the altitude/hierarchy thesis.

Observations

  1. [concern] Auto-detection ordering is fragile. _parse_jsonl_autodetect tries Codex first and falls back to Claude Code only if Codex returns empty. The Codex history heuristic (all(isinstance(e.get("session_id"), str) and "text" in e for e in events)) is broad — any JSONL where every line has a string session_id and a text key will match. If a future format or a user's custom JSONL happens to have these fields, it silently gets parsed as Codex. Consider a more discriminating signal (e.g., requiring ts as well in the all() check, or checking a format marker).

  2. [concern] History format treats all events as role="user". In _parse_history, every event becomes {"role": "user", ...}. The test confirms this (assert all(t.metadata["role"] == "user" for t in turns)). If history.jsonl genuinely only contains user prompts, this is fine — but the transcript renders lines as User: ... for all of them, which would be misleading if assistant responses ever appear. Worth a code comment explaining the assumption.

  3. [question] _middle_cut truncation and downstream fingerprinting. Truncating content at DEFAULT_MAX_CHARS = 80_000 means two runs with different max_chars values produce different artifacts from identical sources. Since max_chars is a parameter on parse_codex but not captured in any materialization key, this could cause silent cache inconsistencies if someone changes it between builds. Is this value intended to be fixed?

  4. [positive] Solid test coverage. Tests cover both formats, malformed lines, non-Codex JSONL returning empty, and registry auto-detection for both Codex and Claude Code fallback. The existing Claude Code registry test was correctly relaxed rather than broken.

  5. [nit] Variable shadowing. In parse_codex, line = line.strip() shadows the loop variable. Harmless but slightly unusual — a stripped = line.strip() would be clearer.

  6. [positive] _build_artifacts is shared between both parsers. Clean factoring — history and rollout parsing differ only in how they extract sessions, then converge on the same artifact construction.

  7. [question] Rollout _default_session_id takes last 5 dash-separated parts. For the filename pattern rollout-2026-02-19T14-43-28-019c7812-..., this captures the UUID portion. But timestamps with dashes could shift the window. Is this robust enough, or should it parse more explicitly?

  8. **[nit] The pipeline-api.md doc says "Emits transcript and transcript_turn artifacts" — worth noting that transcript_turn is a new artifact type not emitted by any other adapter. Downstream transforms keyed on artifact_type should be aware.

  9. [positive] Existing test updated cleanly. test_jsonl_maps_to_parse_claude_code was correctly generalized to test_jsonl_maps_to_registered_adapter — checks callable rather than identity, which is the right contract after adding the auto-detect wrapper.

Verdict

A clean, well-tested source adapter addition that fits naturally into the project's extension model; the main risk is the JSONL auto-detection heuristic being too permissive, which deserves tightening before more formats land.

Review parameters
  • Model: claude-opus-4-6
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 488 lines
  • Prompt: .github/prompts/claude_review.md
  • Timestamp: 2026-02-20T00:22:58Z

@github-actions
Copy link
Contributor

Note

Red Team Review — OpenAI GPT-5.2 | Adversarial review (docs + diff only)

Threat assessment — Medium risk: it changes .jsonl adapter behavior globally and introduces new metadata conventions, but the tests cover the happy paths.

One-way doors

  1. .jsonl auto-detection precedence (Codex-first)

    • Why hard to reverse: Users will drop .jsonl files into source_dir and expect stable parsing. If precedence changes later, the same inputs could produce different artifacts/labels → cache invalidation + surprising rebuilds.
    • Safe to merge if: You lock down detection rules in docs (done) and add regression tests for ambiguous JSONL shapes (Codex-vs-Claude-Code collisions) plus a user override (e.g., format="codex" / format="claude-code" or CLI flag) to force adapter selection.
  2. New artifact types emitted by a source adapter: transcript_turn

    • Why hard to reverse: Artifact types become part of downstream pipelines and search/index configs. Removing/renaming breaks user pipelines and any stored artifacts keyed by artifact_type.
    • Safe to merge if: transcript_turn is documented as stable (or explicitly experimental) and the rest of the system can tolerate “extra” artifact types from sources without changing semantics elsewhere (indexing, listing, caching).
  3. Metadata key conventions: mixed flat keys vs dotted meta.* strings

    • Why hard to reverse: This effectively creates two metadata schemas. If the system later formalizes reserved metadata (DESIGN.md uses nested-ish meta.time.*), you’ll have legacy data with incompatible keys.
    • Safe to merge if: There is a single canonical metadata model. Either: (a) actually nest metadata["meta"]["time"]["date"] etc., or (b) explicitly standardize dotted-string keys across all adapters and docs. Right now it’s inconsistent.

Findings

  1. src/synix/adapters/registry.py: _parse_jsonl_autodetect returns Claude Code artifacts even when Codex detection fails

    • Risk: parse_codex() returns [] for unknown JSONL; then you unconditionally call parse_claude_code(). If Claude Code parser is permissive, you may misclassify random JSONL as “claude-code” rather than returning empty. Your tests only cover one non-codex case (analytics.jsonl) but they call parse_codex directly, not registry fallback.
    • Severity: [warning]
  2. src/synix/adapters/codex.py: mode detection is “first row decides”, with hard fail to []

    • Risk: If the first non-empty line is malformed/unexpected but later lines are valid Codex, you return [] and discard the whole file. That’s a brittle implicit dependency on file ordering. Real traces often contain headers, comments, or occasional junk.
    • Severity: [warning]
  3. src/synix/adapters/codex.py: inconsistent reserved metadata implementation

    • Pattern: meta.update({"meta.source.adapter": "codex", ...}) alongside "source": "codex", "date": date, "timestamp": ....
    • Risk: Downstream code likely expects one convention. DESIGN.md reserves meta.time.created_at/period, meta.chat.conversation_id, etc. This adapter invents meta.time.date and meta.chat.session_id. That fragmentation will leak everywhere (search facets, grouping, rollups).
    • Severity: [warning]
  4. src/synix/adapters/codex.py: potential memory blow-up on large JSONL

    • Pattern: sessions: dict[str, list[dict]] accumulating all turns for all sessions before emitting.
    • Risk: Codex traces can be huge. This is unbounded memory proportional to file size. The max_chars truncation only applies at final transcript rendering, not while accumulating.
    • Severity: [warning]
  5. src/synix/adapters/codex.py: DEFAULT_MAX_CHARS = 80_000 is a magic number

    • Risk: This silently changes content and therefore fingerprints/caching/provenance. Also, 80k chars may still be too big for some downstream LLM prompts, and too small for others.
    • Severity: [minor]
  6. tests/unit/test_codex_adapter.py: missing assertion about registry behavior on truly unknown JSONL

    • Risk: You don’t test parse_file() on non-codex/non-claude-code JSONL. If Claude Code parser starts accepting more shapes, registry will incorrectly ingest garbage.
    • Severity: [minor]

Missing

  • A test that parse_file(analytics.jsonl) (or other random JSONL) returns [] through registry, not just parse_codex.
  • A test for “first line unrecognized, later lines valid Codex” to expose the brittle mode selection.
  • Documentation update clarifying whether transcript_turn is a stable public artifact type and how it interacts with indexing/transforms.
  • A clear, project-wide metadata schema decision (nested vs dotted vs flat) and alignment across adapters.

Verdict

Ship with fixes — the functionality is reasonable, but the .jsonl registry fallback and metadata schema inconsistencies are the kind of early mess that becomes permanent technical debt. Fix the autodetect robustness + add the missing registry tests before merge.

Review parameters
  • Model: gpt-5.2
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 581 lines
  • Prompt: .github/prompts/openai_review.md
  • Timestamp: 2026-02-20T00:26:55Z

@github-actions
Copy link
Contributor

Note

Architectural Review — Claude Opus | Blind review (docs + diff only)

Summary

Adds a Codex JSONL adapter (parse_codex) that handles two Codex file shapes — history.jsonl (user prompt history) and rollout session traces — producing transcript and transcript_turn artifacts. Updates the .jsonl registry to auto-detect Codex before falling back to Claude Code. Includes documentation and tests.

Alignment

Strong fit. DESIGN.md §1.0 frames Synix around importing conversation exports from multiple providers ("ChatGPT exports, Claude exports"). Adding Codex as another source format is a natural extension. The adapter produces immutable Artifact objects with structured metadata including provenance keys (meta.source.adapter, meta.chat.session_id, meta.time.date), which aligns with the reserved metadata key conventions from DESIGN.md §3.4. Content-addressing is preserved — the adapter is a pure function from file to artifacts.

Observations

  1. [positive] Parser precedence for .jsonl is well-designed — rollout envelope (structural, unambiguous) → history shape (filename-gated) → Claude Code fallback. The filename gate on history.jsonl (_is_history_row checks filepath.name) prevents false positives on arbitrary JSONL files. Test test_parse_codex_history_requires_history_filename explicitly validates this.

  2. [positive] Test coverage is thorough: happy paths for both history and rollout formats, malformed line resilience, non-Codex JSONL rejection, timestamp sorting, registry auto-detection for Codex vs. Claude Code, and the false-positive filename guard. 13 test cases covering meaningful edge cases.

  3. [concern] _middle_cut with max_chars=80_000 silently truncates long sessions. This is a lossy operation that could affect downstream fingerprinting — if a session grows and the truncation boundary shifts, the content hash changes and everything downstream rebuilds even though the actual conversation didn't meaningfully change. Consider documenting this as a known limitation or making the truncation boundary stable (e.g., truncate by turn count instead).

  4. [question] The transcript_turn artifact type emits one artifact per turn. For a Codex session with hundreds of tool-use turns, this could produce thousands of artifacts per file. At 10,000+ source files with verbose sessions, this is potentially O(n) in a way that could stress the build directory and SQLite. Is there a practical bound, or should large sessions skip per-turn emission?

  5. [nit] _parse_rollout_row uses a mutable state dict passed by reference to track known_session_id. This works but is non-obvious — a small dataclass or named state object would improve legibility for new contributors.

  6. [concern] The existing test test_jsonl_maps_to_parse_claude_code was weakened to just assert callable(adapter). This loses the property that the test originally verified — which specific adapter handles .jsonl. The new assertion proves nothing; any function would pass. Consider asserting it's _parse_jsonl_autodetect specifically, or testing the dispatch behavior instead.

  7. [positive] Dual metadata strategy — flat keys (source, session_id) for immediate readability alongside namespaced keys (meta.source.adapter, meta.chat.session_id) for downstream processing — is pragmatic and consistent with the existing Claude Code adapter pattern.

  8. [nit] sort_ts type annotation in the sort lambda uses float | int (Python 3.10+ union syntax). Confirm this matches the project's minimum Python version.

  9. [question] The rollout parser extracts session_meta payload fields (cwd, cli_version, model_provider) into base_metadata but only for the first session_meta event. If a file contains multiple session_meta events (e.g., reconnections), only the last one's fields survive due to .update(). Is this intentional?

Verdict

This is a clean, well-tested source adapter addition that fits naturally into the project's "conversations are sources" model — a good incremental step.

Review parameters
  • Model: claude-opus-4-6
  • Context: README.md, DESIGN.md, synix.dev, PR diff
  • Diff size: 581 lines
  • Prompt: .github/prompts/claude_review.md
  • Timestamp: 2026-02-20T00:27:09Z

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant