Skip to content

feat: Add Logfire error tracking to fleet env#6

Open
dzorlu wants to merge 5 commits intodeniz/fleet_clientfrom
deniz/fleet-logfire
Open

feat: Add Logfire error tracking to fleet env#6
dzorlu wants to merge 5 commits intodeniz/fleet_clientfrom
deniz/fleet-logfire

Conversation

@dzorlu
Copy link
Collaborator

@dzorlu dzorlu commented Feb 26, 2026

Summary

  • Adds structured error tracking via Logfire across fleet env (init failures, tool call failures, MCP timeouts, verifier errors)
  • New telemetry.py wrapper with 4 functions (fleet_info, fleet_warning, fleet_error, fleet_exception)
  • 15 instrumentation sites across task_env.py, client.py, mcp_tools.py
  • No-op if configure_fleet_telemetry() is never called — logfire silently drops events

Test plan

  • All 41 previously-passing tests still pass
  • 4 pre-existing failures unaffected
  • Smoke test with configure_fleet_telemetry(send_to_logfire=False) to verify structured output

🤖 Generated with Claude Code

Structured observability for fleet env errors (init failures, tool call
failures, MCP timeouts, verifier errors). Adds telemetry.py wrapper and
15 instrumentation sites across task_env.py, client.py, mcp_tools.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, modality

- Add set_task_context() to establish base attributes for all events
- All telemetry events now inherit env_key, env_version, task_key, modality
- Parse env_key:version in client.py to log separately
- Add fleet_rollout_started and fleet_rollout_completed events
- Default environment changed to "training_rollouts"
- Update README with new schema and example SQL query

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Deniz and others added 2 commits February 26, 2026 21:52
Tracks MCP server errors (returned in tool results) separately from
Python exceptions:
- fleet_tool_call_failed: Python exception during call_tool()
- fleet_mcp_tool_error: MCP server returned {"error": ...}, {"status": "failed"}, or {"isError": true}

This aligns telemetry with WandB tool_error counting which tracks both
exception-based errors and error patterns in tool results.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… full context

- Move set_task_context() before from_fleet() call in task_env.py
- Remove explicit env_key/env_version from client.py telemetry calls (now from context)
- This ensures fleet_make_failed events include task_key and modality

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove unused fleet_error import from task_env.py
- Fix _is_tool_error to check truthy values (avoid {"error": null} false positives)
- Make close() exception-safe with try/finally for cleanup
- Emit fleet_rollout_completed on ALL paths (not just verifier success)
- Remove unused _env_name/_env_version parsing in client.py

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

fleet_exception(
"fleet_tools_list_failed",
step_count=self._step_count,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unprotected telemetry calls in error handlers risk disrupting recovery

Low Severity

fleet_exception calls inside except blocks in reset_async, step_async, and _compute_reward are not wrapped in try/except, unlike the identical pattern in close() which correctly protects against telemetry failures with a comment "Telemetry failure should not break cleanup." If fleet_exception raises, it disrupts the graceful error handling — e.g., at line 274, self._tools_cache = [] is never reached; at line 434, the rest of step_async (done/reward/obs logic) is skipped. This inconsistency means a telemetry failure could leave the environment in a broken state.

Additional Locations (2)

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant