Skip to content

bug: claude-cli provider crashes after ~4 tests in combined eval run (SessionStart hook error) #830

@christso

Description

@christso

Summary

When running multiple eval files in a single agentv eval run invocation against the claude-cli target, the provider crashes after ~4 tests with a SessionStart hook error. All subsequent tests fail with the same error.

Reproduction

# This fails after ~4 tests:
agentv eval run --target claude-cli --workers 1 "evals/hivespec/*.eval.yaml"

# This works — each invocation gets a fresh session:
agentv eval run --target claude-cli --workers 1 evals/hivespec/hs-claim.eval.yaml
agentv eval run --target claude-cli --workers 1 evals/hivespec/hs-explore.eval.yaml
# etc.

Error

{"type":"system","subtype":"hook_started","hook_id":"...","hook_name":"SessionStart:startup","hook_event":"SessionStart","session_id":"..."}

Claude CLI exits with code 1. The error is a raw JSON event from the session hook, not a proper error message.

Observed behavior

  • Tests 1-4: all PASS (1.000)
  • Tests 5-17: all ERROR with the same SessionStart hook failure
  • When the same 17 tests are run as 5 separate invocations (one per eval file), all complete successfully

Expected behavior

All 17 tests should complete when run in a single invocation. The claude-cli provider should clean up sessions between tests and handle hook errors gracefully.

Investigation starting points

  1. Claude CLI provider: `packages/core/src/evaluation/providers/` — find the claude-cli provider implementation
  2. Check how sessions are created and torn down between test cases
  3. Check if there's a session cleanup or process kill between tests
  4. Compare with pi-cli provider which handles 17 sequential tests without issues

Acceptance signals

  • `agentv eval run --target claude-cli --workers 1 "evals/hivespec/*.eval.yaml"` completes all 17 tests without SessionStart errors
  • Each test gets a fresh Claude CLI session (no state leakage between tests)
  • If a SessionStart hook fails, the provider retries with a fresh session before marking the test as an execution error

Non-goals

  • Parallel Claude CLI sessions (workers > 1) — that's a separate concern
  • Changing how Claude CLI hooks work — the fix should be in agentv's provider layer

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions