Skip to content

feat(monitor): backend-driven SWE-bench evaluation with thread-native traces#93

Closed
shuxueshuxue wants to merge 6 commits intomainfrom
feat/monitor-session-trace-fix
Closed

feat(monitor): backend-driven SWE-bench evaluation with thread-native traces#93
shuxueshuxue wants to merge 6 commits intomainfrom
feat/monitor-session-trace-fix

Conversation

@shuxueshuxue
Copy link
Copy Markdown
Collaborator

@shuxueshuxue shuxueshuxue commented Feb 25, 2026

Summary

  • make thread/run the only trace ownership model in monitor backend
  • remove session-trace API and keep session page focused on session metadata + terminal commands
  • render trace on thread page via canonical thread trace endpoint
  • keep Evaluation as links to evaluation threads/runs (not a fake local-only animation)

Why

  • aligns data model with product semantics: trace belongs to thread/run
  • avoids confusion from session-level trace ownership

Validation

  • backend route check: only GET /api/monitor/thread/{thread_id}/trace exists
  • GET /api/monitor/session/{session_id}/trace now returns 404
  • frontend build: npm run build in frontend/monitor
  • backend tests: .venv/bin/pytest -q tests/test_integration_new_arch.py (23 passed)
  • Playwright manual check:
    • thread page shows Thread Trace Conversation
    • session page no longer shows trace panel

Screenshots

  • /home/ubuntu/specops0/artifacts/playwright/thread-trace-thread-page.png
  • /home/ubuntu/specops0/artifacts/playwright/thread-trace-session-page-no-trace.png

Follow-up Update

  • switched /api/monitor/evaluations to backend direct SWE runner (no control prompt sent to agent)
  • removed Terminal Commands from session API response and session UI
  • added checkpoint-based fallback in GET /api/monitor/thread/{thread_id}/trace so SWE-bench threads still render tool_call/tool_result style trace when run_events are absent

Relationship to PR #88

@shuxueshuxue shuxueshuxue changed the title feat(monitor): trace belongs to thread/run, not session feat(monitor): backend-driven SWE-bench evaluation with thread-native traces Feb 25, 2026
@shuxueshuxue shuxueshuxue force-pushed the feat/monitor-session-trace-fix branch from 15ec774 to d216543 Compare February 25, 2026 12:14
@shuxueshuxue shuxueshuxue force-pushed the feat/monitor-session-trace-fix branch from a6de3bb to 191ab98 Compare March 4, 2026 07:30
@shuxueshuxue
Copy link
Copy Markdown
Collaborator Author

superseded by #182

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant