feat: Add Logfire error tracking to fleet env by dzorlu · Pull Request #6 · fleet-ai/OpenEnv

dzorlu · 2026-02-26T04:54:13Z

Summary

Adds structured error tracking via Logfire across fleet env (init failures, tool call failures, MCP timeouts, verifier errors)
New telemetry.py wrapper with 4 functions (fleet_info, fleet_warning, fleet_error, fleet_exception)
15 instrumentation sites across task_env.py, client.py, mcp_tools.py
No-op if configure_fleet_telemetry() is never called — logfire silently drops events

Test plan

All 41 previously-passing tests still pass
4 pre-existing failures unaffected
Smoke test with configure_fleet_telemetry(send_to_logfire=False) to verify structured output

🤖 Generated with Claude Code

Structured observability for fleet env errors (init failures, tool call failures, MCP timeouts, verifier errors). Adds telemetry.py wrapper and 15 instrumentation sites across task_env.py, client.py, mcp_tools.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

src/envs/fleet_env/task_env.py

…, modality - Add set_task_context() to establish base attributes for all events - All telemetry events now inherit env_key, env_version, task_key, modality - Parse env_key:version in client.py to log separately - Add fleet_rollout_started and fleet_rollout_completed events - Default environment changed to "training_rollouts" - Update README with new schema and example SQL query 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

src/envs/fleet_env/task_env.py

Tracks MCP server errors (returned in tool results) separately from Python exceptions: - fleet_tool_call_failed: Python exception during call_tool() - fleet_mcp_tool_error: MCP server returned {"error": ...}, {"status": "failed"}, or {"isError": true} This aligns telemetry with WandB tool_error counting which tracks both exception-based errors and error patterns in tool results. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

… full context - Move set_task_context() before from_fleet() call in task_env.py - Remove explicit env_key/env_version from client.py telemetry calls (now from context) - This ensures fleet_make_failed events include task_key and modality 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

src/envs/fleet_env/client.py

src/envs/fleet_env/task_env.py

- Remove unused fleet_error import from task_env.py - Fix _is_tool_error to check truthy values (avoid {"error": null} false positives) - Make close() exception-safe with try/finally for cleanup - Emit fleet_rollout_completed on ALL paths (not just verifier success) - Remove unused _env_name/_env_version parsing in client.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-02-27T06:23:09Z

src/envs/fleet_env/task_env.py

+                fleet_exception(
+                    "fleet_tools_list_failed",
+                    step_count=self._step_count,
+                )


Unprotected telemetry calls in error handlers risk disrupting recovery

Low Severity

fleet_exception calls inside except blocks in reset_async, step_async, and _compute_reward are not wrapped in try/except, unlike the identical pattern in close() which correctly protects against telemetry failures with a comment "Telemetry failure should not break cleanup." If fleet_exception raises, it disrupts the graceful error handling — e.g., at line 274, self._tools_cache = [] is never reached; at line 434, the rest of step_async (done/reward/obs logic) is skipped. This inconsistency means a telemetry failure could leave the environment in a broken state.

Additional Locations (2)

src/envs/fleet_env/task_env.py#L433-L438

src/envs/fleet_env/task_env.py#L535-L542

cursor bot reviewed Feb 26, 2026

View reviewed changes

src/envs/fleet_env/task_env.py Outdated Show resolved Hide resolved

src/envs/fleet_env/task_env.py Outdated Show resolved Hide resolved

dzorlu mentioned this pull request Feb 26, 2026

feat: Wire LOGFIRE_TOKEN to fleet env telemetry fleet-ai/SkyRL#258

Merged

2 tasks

cursor bot reviewed Feb 27, 2026

View reviewed changes

src/envs/fleet_env/task_env.py Outdated Show resolved Hide resolved

Deniz and others added 2 commits February 26, 2026 21:52

cursor bot reviewed Feb 27, 2026

View reviewed changes

src/envs/fleet_env/client.py Outdated Show resolved Hide resolved

src/envs/fleet_env/task_env.py Show resolved Hide resolved

cursor bot reviewed Feb 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Logfire error tracking to fleet env#6

feat: Add Logfire error tracking to fleet env#6
dzorlu wants to merge 5 commits intodeniz/fleet_clientfrom
deniz/fleet-logfire

dzorlu commented Feb 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dzorlu commented Feb 26, 2026

Summary

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 27, 2026

Choose a reason for hiding this comment

Unprotected telemetry calls in error handlers risk disrupting recovery

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant