Skip to content

feat: Fleet client#1

Draft
dzorlu wants to merge 40 commits intomainfrom
deniz/fleet_client
Draft

feat: Fleet client#1
dzorlu wants to merge 40 commits intomainfrom
deniz/fleet_client

Conversation

@dzorlu
Copy link
Collaborator

@dzorlu dzorlu commented Dec 12, 2025

PR: Fleet environments (OpenEnv)

This PR documents and refines the Fleet runtime integration for OpenEnv.

What this enables

  • Run OpenEnv environments on Fleet (remote) with no local Docker.
  • Keep a strict split between:
    • Orchestration (HTTP): reset / step / state
    • Agent actions (MCP): tools/list + tools/call

What this is not

  • This is not the local “Dockerized env server + env container” setup.
  • There is no container/provider abstraction here; Fleet hosts the runtime remotely (HTTP env server + MCP service). The client only connects.

Main abstractions

  • FleetEnvClient (HTTP): orchestrator handle for reset/step/state.
  • FleetMCPTools (MCP): agent handle for listing/calling tools.
    • Unions tools across Fleet’s MCP endpoints (today often api/v1/mcp and mcp)
    • Returns tools in OpenAI “tools” dict format (via convert_tool_format)
    • Routes tool calls to the owning endpoint (cached after discovery)

Quickstart

  • Install: pip install "openenv-core[fleet]"
  • Set: export FLEET_API_KEY="..."
  • Run: python examples/fleet_env_example.py <env_key>

References

  • RFC 001: rfcs/001-abstractions.md
  • RFC 003: rfcs/003-mcp-support.md

TODOs / known sharp edges

  • Endpoint discovery (avoid hardcoding api/v1/mcp vs mcp)
  • Reset inconsistencies across some env keys (better errors + compatibility notes)
  • Tool-name collision policy across endpoints
  • Retries/backoff and clearer “endpoint down” failure modes

@dzorlu dzorlu changed the title Deniz/fleet client feat: Fleet client Dec 13, 2025
Deniz and others added 15 commits December 17, 2025 17:29
- FleetTaskEnv wraps FleetEnvClient with task-oriented interface
- Accepts task configs from export_training_tasks.py
- Creates versioned environments on reset
- Injects task prompt into observations
- Executes verifier for reward computation on episode completion
- Supports both sync and async step methods
- Factory functions: make_fleet_task_env, from_json_file
- Tests: 20 unit tests for init, specs, verifiers, factories

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The MCP images don't exist for all environment versions, causing
FleetVersionNotFoundError when trying to create environments.
Changing the default to None allows the Fleet SDK to use standard
images which are available for all versions.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
FleetEnvClient.from_fleet() was not accepting data_key/data_version
parameters, causing them to be passed through **kwargs to HTTPEnvClient
which doesn't accept them.

- Add data_key and data_version as explicit parameters
- Pass them to fleet.make()
- Update task_env.py to pass them separately

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fleet SDK expects data_key in "key:version" format, not as separate
parameters. Updated from_fleet() to combine them before calling
fleet.make().

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
HTTPEnvClient.reset() doesn't support seed parameter yet.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Increases default timeout from 15s to 60s for Fleet API calls.
This prevents timeouts during environment initialization.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously reset() did partial work and reset_async() added tool fetching.
Now reset_async() does all the work (including fetching tools) and reset()
is just a sync wrapper that calls it via run_until_complete().

This ensures both methods return identical results including tools.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
MCP's call_tool() returns a CallToolResult Pydantic object, not plain text.
This was causing ugly repr strings to be passed to agents like:
  "meta=None content=[TextContent(type='text', text='...')] ..."

Now properly extracts:
- Text content from result.content[].text
- Tries JSON parsing for structured results
- Falls back to structuredContent if available
- Handles isError cases

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tests for:
- FleetMCPClient._extract_tool_result():
  - Single text content extraction
  - JSON parsing from text
  - Multiple text contents
  - Error result handling
  - Structured content fallback
  - Empty result handling

- FleetTaskEnv reset:
  - reset_async() returns tools
  - reset() calls reset_async() (sync wrapper)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move fleet.make() and list_tools() into FleetTaskEnv.__init__()
- Tools are now fetched at env creation, not during reset
- reset_async() calls _orch.reset() with error handling, returns cached tools
- Use asyncio.run() for Python 3.13 compatibility
- Update tests for new initialization pattern

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Log task_key and verifier code preview when verifier fails
- Catch syntax errors separately with clear message
- Show which functions were found if 'verify' is missing

Helps debug issues like "Verifier code must define a 'verify' function"

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace custom _execute_verifier_local() with Fleet SDK's Task.verify_detailed()
which properly sets up the verifier namespace with:
- Environment type annotation
- Helper functions (normalized_contains, etc.)
- Proper function discovery (not just "verify" function)

This fixes "name 'Environment' is not defined" errors during verifier execution.

Changes:
- _compute_reward: Create Fleet SDK Task and call verify_detailed()
- Support both 'verifier_code' and 'verifier_func' field names
- Add comprehensive logging for debugging
- Remove broken _execute_verifier_local method

Tests:
- Update all verifier tests to mock Fleet SDK Task.verify_detailed()
- Add tests for various edge cases (no verifier, no orch, exceptions)
- Fix fixture to avoid asyncio.run() conflicts with pytest-asyncio

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Deniz and others added 9 commits January 26, 2026 11:01
- Add retry with exponential backoff (3 attempts, 1s/2s/4s delays)
- Log errors instead of silently swallowing exceptions
- Log warning when some clients fail but others succeed
- Log error after all retries exhausted

This fixes silent failures when MCP connections are flaky, which caused
'no tools found' errors in SkyRL training.
call_tool now retries with exponential backoff (3 attempts, 1s/2s/4s)
on connection errors, similar to list_tools.

ValueError (tool not found) is not retried.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds exponential backoff retry (3 attempts, 2s base delay) around
fleet.make() to handle transient Fleet API errors like health check
failures that can occur during instance provisioning.

Only retries on transient errors (health check, timeout, connection).
Permanent errors are raised immediately.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add Toolathlon-style context management tools for long trajectories:
- check_context: Check visible/total turn counts
- manage_context: Drop old turns to free up context space
- search_history: Search all history (including dropped)
- search_tool_output: Search truncated tool output
- view_tool_output: Paginate through truncated output

The ContextManager class can be used by any training framework that
maintains chat_history. It tracks full history and handles truncated
tool outputs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Computer-use tasks require MCP-enabled container images (e.g., famazon:mcp0.0.7)
which have scrot installed for screenshots and the MCP server with 'computer' tool
for mouse/keyboard control.
Deniz and others added 12 commits February 11, 2026 07:46
Previously, tools were only fetched for tool_use modality due to a
restrictive condition. This caused computer_use tasks to fail with
"no tools found in observation" because the computer tool (mouse,
keyboard, screenshot) was never fetched.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When task_modality is computer_use, filter tools to only include
the 'computer' tool. This prevents the model from using API tools
when it should be using mouse/keyboard control.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Two critical fixes for VL (vision-language) model training:

1. ImageContent extraction: _extract_tool_result() now handles MCP
   ImageContent (base64 images with mimeType) and converts them to
   OpenAI-compatible format for VL models.

2. Tool filtering: computer_use modality now always filters to only
   the 'computer' tool. If no computer tool found, clears all tools
   and logs warning (prevents model from using API tools).

Tests added:
- test_extract_image_content
- test_extract_mixed_text_and_image_content
- test_extract_image_default_mimetype
- test_computer_use_filters_to_computer_tool
- test_computer_use_clears_tools_when_no_computer_tool
- test_tool_use_does_not_filter
- test_computer_use_filters_function_format

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
For VL (vision-language) models doing computer_use tasks, the model
needs visual input to know where to click. Previously, reset() only
returned metadata without a screenshot, leaving VL models blind.

Now for computer_use modality, reset_async() automatically takes a
screenshot after reset and includes it in the observation as
`initial_screenshot`. This is in OpenAI-compatible format for VL models.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The manager API (POST /reset) hangs indefinitely on some env images
(e.g. google-maps v0.0.53). Since reset failure is already handled
gracefully (warning + continue), this adds a short dedicated timeout
(default 10s) so the reset fails fast instead of blocking for the
full request_timeout_s (60-120s).

This saves 50-110s per episode during training when the manager API
is unresponsive, while still allowing reset to succeed on healthy envs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add env_key property to FleetTaskEnv
- Prefix all error/warning logs with [env=X] for easy filtering
- Helps identify which environments have infrastructure issues (502s, health checks)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Defense-in-depth against zombie threads: if asyncio cancellation somehow
fails to propagate, HTTP-level timeouts ensure MCP calls fail within
2 minutes instead of hanging forever.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- mcp_tools.py: raise RuntimeError after 3 failed list_tools attempts
  instead of silently returning empty ListToolsAction
- mcp_tools.py: increase retry_base_delay from 1s to 2s
- task_env.py: don't set _tools_fetched=True on failure so next
  reset_async() can retry tool discovery

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dzorlu dzorlu force-pushed the deniz/fleet_client branch 2 times, most recently from 272b0b3 to 8370cd1 Compare February 26, 2026 04:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant