Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
304 changes: 304 additions & 0 deletions docs/routing-design-phase-d.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,304 @@
# Phase D Routing Design

## Evidence Source

- Diagnostics log: `~/Library/Logs/cq/routes-v0.18.0-uat.jsonl`
- Corpus marker: `~/Library/Logs/cq/routes-v0.18.0-uat.start` (manually created with `date -u` to record the start of collection before UAT traffic)
- Collection window: 2026-04-28T11:10:56.349183Z to 2026-04-28T11:22:22.404814Z
- cq version: `0.18.0`
- Runtime: Homebrew service, restarted after adding `diagnostics_log` because the field is read at proxy startup.
- Diagnostics file mode: `0600` (`-rw-------`).

Safety validation commands used:

```bash
jq . "$DIAG_LOG" >/dev/null
jq -r '.route_kind // "(missing)"' "$DIAG_LOG" | sort | uniq -c | sort -rn
jq -r 'select(.account_hint != null and .account_hint != "") | .account_hint' "$DIAG_LOG" \
| grep -vE '^(claude|codex):[0-9a-f]{12}$'
grep -E 'Bearer|sk-|oauth|refresh_token|access_token|local_token|@' "$DIAG_LOG" || true
jq 'select((.latency_ms // 0) < 0)' "$DIAG_LOG"
jq 'select((.status_code // 0) < 0)' "$DIAG_LOG"
```

Results: JSONL parsed successfully, the account-hint format check produced zero invalid lines, no broad credential-leak patterns were found, and no negative latency or status values were present.

## Corpus Coverage

Route-kind distribution:

```text
288 anthropic_messages
7 codex_native
5 anthropic_count_tokens
2 health
2 codex_compact
1 codex_legacy_websocket
1 codex_app_server
```

Covered route kinds:

- `health`
- `anthropic_messages`
- `anthropic_count_tokens`
- `codex_native`
- `codex_compact` (emitted from `internal/proxy/codex_compact.go`)
- `codex_legacy_websocket`
- `codex_app_server`

Not observed:

- No known route kinds remained unobserved in the refreshed corpus. Some hard-to-trigger route kinds were covered by later client traffic while diagnostics remained enabled.

Notes:

- Real CLI usage produced `health`, `anthropic_messages`, `anthropic_count_tokens`, `codex_native`, and `codex_legacy_websocket` traffic.
- Minimal local proxy requests were used to cover `codex_native`, `codex_compact`, and `codex_app_server` where the live clients did not naturally exercise those paths.
- Synthetic route-coverage requests generated expected invalid/request-shape responses for some paths and should not be interpreted as product failures.

Provider/model distribution:

```text
115 codex anthropic_messages gpt-5.5
87 claude anthropic_messages
59 claude anthropic_messages claude-sonnet-4-6
27 claude anthropic_messages claude-opus-4-7
6 codex codex_native
5 codex anthropic_count_tokens gpt-5.5
2 proxy health
2 codex codex_compact gpt-5.5
1 codex codex_legacy_websocket
1 codex codex_native gpt-5.5
1 codex codex_app_server
```

Latency by route kind:

```text
anthropic_count_tokens count=5 avg=357.2 max=1094
anthropic_messages count=288 avg=6837.2 max=117481
codex_app_server count=1 avg=0.0 max=0
codex_compact count=2 avg=77451.5 max=154813
codex_legacy_websocket count=1 avg=469782.0 max=469782
codex_native count=7 avg=22437.6 max=98836
health count=2 avg=102.5 max=105
```

The higher `codex_native`, `codex_compact`, and `codex_legacy_websocket` maxima came from live Codex/client traffic and should be treated as small-sample observations, not routing-policy conclusions. The zero-millisecond `codex_app_server` latency came from an immediately rejected non-upgrade request, so it should not be compared with full upstream requests.

## Observed Request Boundaries

Observed traffic arrived in bursts over roughly 686 seconds. Minute clustering was:

```text
4 2026-04-28T11:10
19 2026-04-28T11:11
22 2026-04-28T11:12
15 2026-04-28T11:13
24 2026-04-28T11:14
43 2026-04-28T11:15
40 2026-04-28T11:16
24 2026-04-28T11:17
33 2026-04-28T11:18
15 2026-04-28T11:19
22 2026-04-28T11:20
31 2026-04-28T11:21
14 2026-04-28T11:22
```

Sorted events showed many idle gaps over five seconds, ranging from about 5.0s to 37.7s before the next request. This suggests an idle-gap heuristic could identify coarse bursts, but the corpus still does not justify a threshold because long model latency and active client sessions can also create similar gaps.

Candidate boundaries evaluated against the corpus:

- End of a full `/v1/messages` response: plausible. Each `anthropic_messages` event has final status/latency and can act as a safe point after a completed request.
- `count_tokens` immediately followed by `messages`: observed for Codex-routed Anthropic-compatible traffic. These likely belong to the same request burst and should not independently trigger account rebalance before the following message request.
- End of a Codex app-server/WebSocket session: partially confirmed. One `codex_legacy_websocket` event was observed with long latency, but only one event is insufficient to define close-boundary policy. `codex_app_server` was still covered only by a non-upgrade request that produced `websocket_upgrade_required`.
- Compaction boundaries: two `codex_compact` requests were observed. They remain plausible boundaries because compaction relates to conversation summarisation, but the small count does not prove whether compaction should end or continue stickiness.
- Idle gap between bursts: promising for coarse session segmentation, but sensitive to streaming duration, client retries, and concurrent requests.

The diagnostics log is JSONL-safe but request events are not guaranteed to be written in chronological order under concurrent traffic. Any analysis or future helper should sort by `time` before inferring adjacency.

## Account-Hint and Failover Findings

Distinct account hints:

- Claude: 1 (`claude:6697888dcf5a`)
- Codex: 1 (`codex:de879f4afc23`)
- Proxy/health: 0

Account-hint distribution:

```text
115 anthropic_messages codex codex:de879f4afc23
86 anthropic_messages claude claude:6697888dcf5a
7 codex_native codex codex:de879f4afc23
5 anthropic_count_tokens codex codex:de879f4afc23
2 codex_compact codex codex:de879f4afc23
1 codex_legacy_websocket codex codex:de879f4afc23
```

No account-hint churn was observed within either provider because the corpus selected only one usable Claude account hint and one Codex account hint. That means the corpus supports privacy/format validation, but does not answer how often real multi-account selection changes inside longer bursts.

Failover was not observed naturally. This is acceptable for the initial corpus because credentials/quota failures were not deliberately forced.

`pin_active` behaved as expected at the event level: pinned Claude requests showed `pin_active: true` and the unpinned request showed `pin_active` absent/null. Some invalid-proxy-token synthetic requests also reported `pin_active: true` while no account was selected because `PinActive` reflects server pin state before auth validation, not proof that the pin was applied to a handled request; Phase D should pair it with `account_hint` and terminal status context.

Error summary:

```text
87 authentication_error:invalid_proxy_token
6 api_error:codex_upstream_error
1 invalid_request_error:websocket_upgrade_required
```

The `invalid_proxy_token` events came from local non-production/synthetic request paths and demonstrate safe error-code logging. The Codex upstream errors were redacted safe error codes, and the app-server route produced the expected safe WebSocket-upgrade error code for a non-upgrade request.

## Candidate Natural Boundaries

1. Completed provider request
- Treat successful or terminal `/v1/messages`, `/responses`, and compact responses as safe points for changing account selection.
- Do not change account mid-stream.

2. Explicit route-kind boundaries
- `codex_compact` may indicate a natural summarisation boundary.
- `codex_app_server`/`codex_legacy_websocket` should be treated as session-like once a successful upgrade/close can be observed.

3. Idle-gap boundary
- Candidate only after more data.
- The current corpus shows multiple gaps above five seconds, but not enough diversity to distinguish normal model latency from a true user/session boundary.

4. Model-change boundary
- The corpus includes `gpt-5.5`, `claude-sonnet-4-6`, and `claude-opus-4-7` events.
- Model changes can indicate a new request class, but the same broad time window included concurrent Codex and Claude traffic, so model changes alone should not force a boundary.

## Candidate Session Signals

- Idle-gap heuristic: useful as a fallback for clients that do not identify sessions, but only after sorting by event time and choosing a conservative threshold from a larger corpus.
- Explicit future header such as `X-CQ-Session-ID`: strongest option when clients can supply it. It would avoid guessing from timing and model names, but requires client support and privacy review.
- Model-change boundary: weak signal. It may help terminate stickiness when switching between materially different models, but should not be the primary session key.
- Route-kind boundary: strong for known session lifecycle routes once natural `codex_app_server`, `codex_legacy_websocket`, and `codex_compact` coverage is available.
- Account hint continuity: useful for analysing selector behaviour, but current `account_hint` is last-selected-account only and cannot reconstruct attempted failover sequences.

## RouteEvent Gaps for Phase D

Only add fields after a Phase D implementation plan justifies them. The current corpus suggests these possible additions:

- `request_id`: would let analysis correlate client preflight/main requests and sort/reconstruct concurrent event sequences without relying only on timestamps.
- `session_hint`: only if clients can provide an explicit session identifier, for example via `X-CQ-Session-ID`.
- `stream_complete`: would distinguish completed streaming responses from early termination when natural-boundary routing depends on full response completion.
- `websocket_close_code`: would make app-server/WebSocket session boundaries visible without logging payloads.
- `failover_count`: would improve policy analysis beyond the current boolean while avoiding raw attempted-account sequences.
- Attempted-account summary: if needed, log only redacted hints and bounded counts; do not log full IDs or tokens.
- Quota snapshot age/min remaining at selection time: useful for explaining selector choices, but must avoid leaking sensitive account metadata.

Do not add raw request bodies, response bodies, bearer tokens, local proxy tokens, OAuth refresh tokens, API keys, full emails, full account UUIDs, or full credential/account secrets.

## Proposed Routing Policy Changes

These proposals are design-only output from this UAT phase. No proxy source changes or routing-policy changes are authorised until a separate implementation plan is approved.

### Proposal: Manual pin overrides first

**Evidence:** Pinned Claude events showed `pin_active: true` with a redacted Claude account hint on successful requests. Unpinned events showed `pin_active` absent/null.

**Behaviour:** If a manual pin is active and the pinned account is usable, route eligible Claude traffic to that account before quota balancing or session stickiness. Existing failover behaviour still applies for auth/quota emergencies.

**Files likely touched:**

- `internal/proxy/router.go`
- `internal/proxy/transport.go`
- `internal/proxy/server.go`
- `cmd/cq/proxy.go`
- `internal/proxy/server_test.go`

**Tests required:** Unit tests for pinned selection precedence, integration tests for diagnostics `pin_active`, and UAT with one pinned and one unpinned Claude request. Assertions that prove a pin handled a request must pair `pin_active` with a valid `account_hint` and a 2xx status.

**Risks:** User surprise if a stale pin silently overrides balancing, cache affinity if pinned and unpinned flows interleave, and privacy if diagnostics over-explain pin identity.

### Proposal: Emergency failover on auth/quota errors

**Evidence:** No natural failover occurred, but diagnostics already provide safe error codes and the corpus confirmed no credential leakage in error fields.

**Behaviour:** Preserve immediate failover on auth/quota errors even when a session or natural-boundary policy would otherwise prefer stickiness. Diagnostics should record that failover occurred without logging raw attempted credentials.

**Files likely touched:**

- `internal/proxy/transport.go`
- `internal/proxy/codex_transport.go`
- `internal/proxy/diag.go`
- `internal/proxy/server_test.go`

**Tests required:** Existing 401/429 failover tests should be extended to assert natural-boundary/session stickiness does not block emergency failover, plus diagnostics tests for `failover` and future `failover_count` if added.

**Risks:** Cache affinity breaks on emergency failover, stale quota state may cause repeated retries, and richer failover diagnostics could reveal account topology if not bounded/redacted.

### Proposal: Quota balancing only at natural boundaries

**Evidence:** The corpus shows many adjacent `anthropic_messages` events within short bursts and only one account hint per provider. It does not show harmful churn, but it does show enough burstiness that mid-stream or mid-session switching would be difficult to reason about without explicit boundaries.

**Behaviour:** Continue selecting by current quota/headroom semantics, but only re-run balancing when no active session/boundary guard applies. Completed message/native/compact requests and conservative idle gaps are candidate boundaries.

**Files likely touched:**

- `internal/proxy/router.go`
- `internal/proxy/server.go`
- `internal/proxy/transport.go`
- `internal/proxy/codex_transport.go`

**Tests required:** Unit tests for boundary detection, selector tests showing no mid-session rebalance, streaming tests ensuring selection is fixed for a response, and UAT comparing account hints across bursts.

**Risks:** Sticky routing can underuse available quota, idle-gap thresholds may be wrong for slow models, and concurrent requests can blur boundary inference unless events are sorted/correlated.

### Proposal: Cache/session stickiness for ongoing conversations when a reliable signal exists

**Evidence:** The corpus cannot infer true conversation identity from timing alone. It did show that mixed Claude/Codex events and model switches can occur in the same small time window, so timing/model alone are weak session identifiers.

**Behaviour:** Prefer an explicit session signal if available. Use timing or route-kind heuristics only as conservative fallback. Once a session is assigned to an account, keep routing that session to the same account until a natural boundary or emergency failover.

**Files likely touched:**

- `internal/proxy/router.go`
- `internal/proxy/server.go`
- `internal/proxy/config.go` only if a later product decision adds configuration
- `internal/proxy/server_test.go`

**Tests required:** Tests for explicit session header parsing if added, cache expiry/idle-gap tests, concurrent-session isolation tests, and privacy tests ensuring session IDs are not logged raw if sensitive.

**Risks:** Session identifiers may be unavailable, client-provided IDs may be sensitive, sticky cache state can become stale, and users may be surprised if balancing appears less responsive.

### Proposal: Preserve stale/unknown quota eligibility semantics

**Evidence:** The corpus includes successful Codex and Claude traffic with stable account hints, but no failover and no multi-account churn. It does not justify tightening eligibility for stale/unknown quota accounts.

**Behaviour:** Keep current selector semantics for stale or unknown quota data during Phase D. Natural-boundary routing should constrain when selection changes, not redefine which accounts are eligible.

**Files likely touched:**

- `internal/proxy/router.go`
- `internal/proxy/codex_selector.go`
- `internal/proxy/transport.go`
- `internal/proxy/codex_transport.go`

**Tests required:** Regression tests for stale/unknown quota eligibility, boundary tests proving existing selector results are reused only while sticky, and UAT with fresh and stale quota cache states.

**Risks:** Stale quota can over-select a poor account, while over-tightening can strand usable accounts. Diagnostics may need quota snapshot age to explain decisions.

## Non-Goals for Phase D

- Do not log request bodies or response bodies.
- Do not log raw bearer tokens, local proxy tokens, OAuth refresh tokens, API keys, full emails, full account UUIDs, or credential/account secrets.
- Do not use `pin_active` alone as proof that an account handled a request; require an account hint and successful/terminal status context.
- Do not infer full failover attempt order from the current `failover` boolean.
- Do not make route changes in the diagnostics/UAT phase.
- Do not hot-reload `diagnostics_log` as part of routing design.

## Open Questions

- Do Claude Code clients issue `anthropic_count_tokens` consistently across versions/settings, and should observed count-token/message pairs always share one routing decision?
- What natural app-server/WebSocket lifecycle events are visible during a real Codex interactive session?
- Can Claude Code, Codex CLI, or other clients provide a stable, non-sensitive explicit session ID header?
- What idle-gap threshold separates a user/session boundary from normal streaming/model latency?
- How often do account hints churn in longer multi-account Claude corpora when quota pressure is real?
- What should diagnostics expose for failover depth: boolean, count, redacted attempted-account hints, or only final outcome?
- Should session stickiness be per provider, per route kind, per model, or per explicit client session?
Loading