Skip to content

RFC: Trajectory + Event Bus Integration for Multi-Session Debugging #57

Description

@evansenter

Problem

When multiple Claude Code sessions coordinate via claude-event-bus, debugging issues that span sessions is difficult:

  1. Trajectories are per-session: gemicro's Trajectory captures LLM requests/responses within a single agent run, but when sessions communicate via the event bus (e.g., RFC negotiations, dependency signaling), the cross-session flow is invisible.

  2. No correlation: If Session A publishes task_completed and Session B polls and acts on it, there's no way to trace that causal chain after the fact.

  3. Replay is single-session: MockLlmClient can replay one trajectory, but can't reproduce a multi-session workflow where timing and event ordering matter.

Use Cases

1. Cross-Session Debugging

"Session B failed after receiving an event from Session A—what did A actually send?"

Currently requires manually correlating logs from both sessions. With trajectory events on the bus, you'd have a unified timeline.

2. Multi-Session Replay

Reproduce a 3-session parallel workflow offline by replaying:

  • Each session's LLM trajectory

  • The event bus messages in order

  • The causal dependencies between them

3. Swarm Observability

When running distributed Claude sessions (future Tailscale support), centralized trajectory events enable:

  • Bottleneck detection (which session is blocking others?)

  • Error propagation tracing (which session's failure cascaded?)

  • Cost attribution (which session consumed the most tokens?)

Proposed Approach

Option A: Trajectory Events Published to Bus

Sessions publish key trajectory events to the event bus:

# New event types

trajectory_step_completed  # LLM request/response pair finished

tool_execution_failed      # Tool call failed (with error context)

phase_started              # Agent entered new phase

phase_completed            # Agent completed phase

trajectory_finalized       # Full trajectory available for export

Payload structure:

{

  "type": "trajectory_step_completed",

  "session_id": "abc123",

  "correlation_id": "workflow-xyz",

  "step": {

    "phase": "research",

    "model": "claude-sonnet-4-20250514",

    "tokens_in": 1500,

    "tokens_out": 450,

    "duration_ms": 2300,

    "tool_calls": ["web_search", "file_read"]

  }

}

Option B: Correlation IDs Only

Lighter-weight: sessions just exchange correlation IDs via the bus, and trajectories reference them:

// Event bus message

{"type": "task_started", "correlation_id": "feature-auth-123"}

 

// Trajectory step metadata

{"correlation_id": "feature-auth-123", "triggered_by_event": "evt_789"}

Post-hoc analysis joins trajectories by correlation ID.

Option C: Centralized Trajectory Store

Event bus becomes the trajectory store:

  • Sessions stream trajectory steps to the bus as they execute

  • Bus persists to SQLite alongside events

  • Single query surface for "show me everything that happened"

Tradeoff: Higher bus load, but simplest debugging experience.

Questions

| Question | Why It Matters | Proposed Answer |

|----------|----------------|-----------------|

| Which events are worth publishing? | Too many = noise, too few = gaps | Start with phase boundaries + failures |

| Should full LLM responses be included? | Privacy/size vs. debuggability | No—just metadata (model, tokens, duration) |

| How to handle correlation across repos? | gemicro ↔ rust-genai RFC pattern | Use issue/PR URLs as correlation IDs |

| Real-time vs. batch? | Streaming overhead vs. post-hoc delay | Batch at phase boundaries |

Blocking Decisions

| Decision | Owner | Options |

|----------|-------|---------|

| Event granularity | Human | Option A (full events) vs B (correlation only) vs C (centralized) |

| Integration point | Claude | Hook into gemicro's TrajectoryBuilder vs. standalone wrapper |

| Correlation ID format | Claude | UUID vs. semantic (e.g., repo/issue#123) vs. hybrid |

Related Work

Implementation Sketch

  1. gemicro: Add optional EventBusPublisher to TrajectoryBuilder that publishes on step_completed

  2. claude-event-bus: New event types + optional trajectory storage table

  3. dotfiles: /trajectory-timeline command to visualize cross-session flows

  4. Future: TrajectoryDataset::from_event_bus() to load multi-session trajectories for evaluation

Open Questions for Human

  1. Is real-time debugging the priority, or post-hoc analysis sufficient?

  2. Should trajectory events be opt-in (per-session flag) or always-on?

  3. What's the privacy model for trajectory data on a shared bus?

Metadata

Metadata

Assignees

No one assigned

    Labels

    backlogLong-term exploration, not actively plannedblockedWaiting on external dependencypriority:lowNice to have, backlog items

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions