feat: Add Logfire error tracking to fleet env#6
feat: Add Logfire error tracking to fleet env#6dzorlu wants to merge 5 commits intodeniz/fleet_clientfrom
Conversation
Structured observability for fleet env errors (init failures, tool call failures, MCP timeouts, verifier errors). Adds telemetry.py wrapper and 15 instrumentation sites across task_env.py, client.py, mcp_tools.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, modality - Add set_task_context() to establish base attributes for all events - All telemetry events now inherit env_key, env_version, task_key, modality - Parse env_key:version in client.py to log separately - Add fleet_rollout_started and fleet_rollout_completed events - Default environment changed to "training_rollouts" - Update README with new schema and example SQL query 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tracks MCP server errors (returned in tool results) separately from
Python exceptions:
- fleet_tool_call_failed: Python exception during call_tool()
- fleet_mcp_tool_error: MCP server returned {"error": ...}, {"status": "failed"}, or {"isError": true}
This aligns telemetry with WandB tool_error counting which tracks both
exception-based errors and error patterns in tool results.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… full context - Move set_task_context() before from_fleet() call in task_env.py - Remove explicit env_key/env_version from client.py telemetry calls (now from context) - This ensures fleet_make_failed events include task_key and modality 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove unused fleet_error import from task_env.py
- Fix _is_tool_error to check truthy values (avoid {"error": null} false positives)
- Make close() exception-safe with try/finally for cleanup
- Emit fleet_rollout_completed on ALL paths (not just verifier success)
- Remove unused _env_name/_env_version parsing in client.py
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| fleet_exception( | ||
| "fleet_tools_list_failed", | ||
| step_count=self._step_count, | ||
| ) |
There was a problem hiding this comment.
Unprotected telemetry calls in error handlers risk disrupting recovery
Low Severity
fleet_exception calls inside except blocks in reset_async, step_async, and _compute_reward are not wrapped in try/except, unlike the identical pattern in close() which correctly protects against telemetry failures with a comment "Telemetry failure should not break cleanup." If fleet_exception raises, it disrupts the graceful error handling — e.g., at line 274, self._tools_cache = [] is never reached; at line 434, the rest of step_async (done/reward/obs logic) is skipped. This inconsistency means a telemetry failure could leave the environment in a broken state.


Summary
telemetry.pywrapper with 4 functions (fleet_info,fleet_warning,fleet_error,fleet_exception)task_env.py,client.py,mcp_tools.pyconfigure_fleet_telemetry()is never called — logfire silently drops eventsTest plan
configure_fleet_telemetry(send_to_logfire=False)to verify structured output🤖 Generated with Claude Code