feat: support structured criteria scoring in verifier results#63
feat: support structured criteria scoring in verifier results#63andrew-stelmach-fleet wants to merge 66 commits intomainfrom
Conversation
When verifier code contains multiple functions (e.g., a main verifier function and helper functions), the helper functions were not accessible from the main function due to namespace isolation. The exec() call created functions in local_namespace, but the main function's __globals__ pointed to exec_globals which didn't contain the helper functions. This caused NameError when the main function tried to call helpers, which was silently caught and returned 0.0. Fix: Merge local_namespace into exec_globals after exec() so all defined functions are accessible when the verifier is called.
…mespace fix: allow verifier helper functions to be called from main verifier
InstanceRequest changes: - Add: profile_id, async_provision, instance_mode, ssh_public_keys, snapshot_interval_minutes, version (deprecated) - Fix: region default changed from 'us-west-1' to None (server decides) - Fix: created_from default changed from None to 'api' TaskRequest changes: - Add: verifier_func, project_key, data_id, data_version, writer_metadata - Add: model_config with extra='ignore' and populate_by_name=True - Add: alias='env_id' for environment_id field - Remove: metadata (doesn't exist in orchestrator TaskRequest, only in TaskResponse)
…API" This reverts commit 9a0af14.
add metadata to tasks in SDK
bump version
…odels Add factual_answer field to support research/factual tasks: - Task model: stores expected answer for verification - TaskRequest: accept factual_answer when creating tasks - TaskResponse: return factual_answer from API Part of: https://linear.app/fleet-ai/issue/ENG-843/import-script-needs-to-support-output-json-schemas Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
feat: add factual_answer field to Task and API models
Add task_modality field to Task and TaskResponse models to support copying task modality (computer_use, tool_use, browser) when importing tasks via the SDK. Changes: - Add task_modality to TaskResponse model (API response) - Add task_modality to Task model (SDK model) - Pass task_modality from TaskResponse to Task in load_tasks Co-authored-by: Cursor <cursoragent@cursor.com>
Addresses Bugbot comment: load_task_from_json wasn't extracting task_modality from JSON data, causing tasks loaded from JSON files to have task_modality=None even when the JSON contains this field. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Add task_modality field to async Task model, TaskResponse model, and update load_task_from_json and load_tasks to preserve task_modality. Co-authored-by: Cursor <cursoragent@cursor.com>
The API returns `environment_id` but load_task_from_json was only looking for `env_id` or `env_key`. Now it checks all three field names. Bump version to 0.2.113. Co-authored-by: Cursor <cursoragent@cursor.com>
fix: handle environment_id in load_task_from_json
Previously, import_single_task would catch all exceptions and return None, making it impossible to debug import failures. Now it raises the exception so callers can handle or report the actual error. Bump version to 0.2.114. Co-authored-by: Cursor <cursoragent@cursor.com>
…dling fix: propagate errors from import_single_task instead of swallowing
This field was missing from the SDK, causing the lifecycle status to be lost when copying tasks. The API returns this field but the SDK wasn't capturing it. Changes: - Add task_lifecycle_status field to Task model (sync and async) - Map task_lifecycle_status in load_task_from_json (sync and async) - Bump version to 0.2.115 Co-authored-by: Cursor <cursoragent@cursor.com>
The API returns 'environment_id', so just use that directly instead of a fallback chain of env_id/env_key/environment_id. Co-authored-by: Cursor <cursoragent@cursor.com>
The database uses env_key, so the SDK model should match. Added alias="environment_id" so the API response still maps correctly. Updated all references: - Task.env_id -> Task.env_key - TaskInfo.env_id -> TaskInfo.env_key - Updated docstrings and examples Co-authored-by: Cursor <cursoragent@cursor.com>
The API expects env_id (or environment_id), so we map env_key to env_id in import_single_task before sending. This keeps the SDK using env_key internally (matching DB) while maintaining API compatibility. No API changes needed - this is SDK-only. Co-authored-by: Cursor <cursoragent@cursor.com>
The task_lifecycle_status field was added to the Task model but was missing from: - TaskResponse model (sync and async) - needed to parse API response - load_tasks method - needed to pass the field to Task constructor This completes the task_lifecycle_status support in the SDK. Co-authored-by: Cursor <cursoragent@cursor.com>
The field was renamed to env_key but there was already a property with the same name, causing infinite recursion. Renamed the property to get_env_key() method. Also restored fallback for env_key in load_task_from_json to support JSON files that use env_key field. Co-authored-by: Cursor <cursoragent@cursor.com>
The field was renamed to env_key but there was already a property with the same name, causing infinite recursion. Renamed the property to get_env_key() method. Also restored env_id fallback in load_task_from_json for backward compatibility with existing JSON files. Co-authored-by: Cursor <cursoragent@cursor.com>
The make() method was using self.env_key (raw field) instead of self.get_env_key() (computed method with version). This would cause environments to be created without the version suffix. Co-authored-by: Cursor <cursoragent@cursor.com>
The API returns env_id but TaskInfo was renamed to use env_key. Added alias="env_id" so Pydantic accepts both field names during deserialization of API responses. Co-authored-by: Cursor <cursoragent@cursor.com>
When export_tasks serializes tasks, it outputs env_key. The loading function needs to check for env_key first (canonical name), then fallback to environment_id (API) and env_id (legacy). Co-authored-by: Cursor <cursoragent@cursor.com>
- TaskResponse: rename environment_id -> env_key (alias="environment_id") - TaskRequest: rename environment_id -> env_key (alias="environment_id") - Add ConfigDict(populate_by_name=True) for alias support - Add Task.env_spec property for env_key:version string - Use task.env_spec in Task.make() and make_for_task() - Clean up load_tasks to use task_response.env_key directly - Remove scattered inline env_key:version string building Co-authored-by: Cursor <cursoragent@cursor.com>
- data_spec: renamed from data_key (data_key kept as alias) - has_verifier: whether task has verifier_func or verifier - is_research_based: whether task has a factual_answer - is_action_based: inverse of is_research_based Co-authored-by: Cursor <cursoragent@cursor.com>
TaskInfo has alias="env_id" on env_key field but was missing model_config = ConfigDict(populate_by_name=True). Without this, creating TaskInfo(env_key="...") would fail since only the alias name was accepted. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
feat: Add task_lifecycle_status field to Task model
The PUT /v1/tasks/{task_key} endpoint can return environment_id: null,
which caused a Pydantic validation error since env_key was required.
This made update_task crash instead of returning a TaskResponse.
- TaskResponse.env_key: str -> Optional[str]
- Task.env_key: str -> Optional[str]
- Task.env_spec now returns None when env_key is absent
Co-authored-by: Cursor <cursoragent@cursor.com>
When a task has env_key=None, make_for_task would pass None to make() causing a TypeError at ":" in env_key. Now raises a clear ValueError matching the guard in Task.make(). Co-authored-by: Cursor <cursoragent@cursor.com>
fix: make TaskResponse.env_key optional to handle null API responses
…v-key revert: restore env_key as required in TaskResponse and Task
Update verifier result processing to preserve full dict results when
they contain structured criteria data (e.g. {"result": 0.84, "criteria": [...]}).
Previously, dict results were always reduced to a float score, which
dropped criteria breakdown data needed by the frontend (client PR #1737).
Changes:
- SyncVerifierFunction.__call__: accept dict returns with "result" key
- SyncVerifierFunction._process_result: preserve criteria dicts
- AsyncVerifierFunction: same changes for async path
- DecoratorVerifierFunction.__call__: same changes for decorator path
- tests/test_verifier_criteria.py: new test suite
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
__call__ was returning the full dict for any dict with "score"/"result" keys, but _process_result only preserved the dict when "criteria" was present. Now both paths behave consistently: extract numeric score unless criteria are present. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| result = v(None) | ||
| assert isinstance(result, dict) | ||
| assert result["score"] == 0.84 | ||
| assert len(result["criteria"]) == 1 |
There was a problem hiding this comment.
Misleading test masks coverage gap in score-only dict path
Medium Severity
test_call_returns_dict_with_score_key claims to verify that a dict with a "score" key is returned as-is, but the test data also includes a "criteria" key. It's the "criteria" key that triggers the full-dict return path, not the "score" key alone. A dict like {"score": 0.84} (without "criteria") would actually be reduced to the float 0.84 by __call__. The test accidentally passes and provides a false sense of coverage for the score-only dict path through __call__, which is a behavioral change from the previous code (which returned full dicts for any dict with "score").
|
|
||
| def test_call_error_returns_zero(self): | ||
| """Errors in verifier function return 0.0.""" | ||
| v = self._make_verifier(lambda env: (_ for _ in ()).throw(ValueError("boom"))) |
There was a problem hiding this comment.
Unused verifier variable in error test
Low Severity
In test_call_error_returns_zero, the variable v is created from a lambda-based verifier on line 86 but is never called or asserted against. The test only exercises v2. This appears to be leftover scaffolding code that adds confusion without contributing to test coverage.


Summary
SyncVerifierFunction,AsyncVerifierFunction,DecoratorVerifierFunction) to preserve full dict results when they contain structured criteria data (e.g.{"result": 0.84, "criteria": [...]})Changes
fleet/verifiers/verifier.py(sync)__call__(): Accept dict returns with"result"key (not just"score")remote(): Removed-> floatreturn annotation to allow dict returns_process_result(): When dict contains"criteria"key, return full dict instead of extracting just the scorefleet/_async/verifiers/verifier.py(async)__call__,remote,_process_resultfleet/verifiers/decorator.py__call__(): Same dict-awareness pattern — returns full dict when result contains"result"keytests/test_verifier_criteria.py(new)How it works
When a verifier function returns a dict like:
{ "result": 0.84, "criteria": [ {"criteria": "Accuracy", "score": 0.95, "score_out_of": 1.0}, {"criteria": "Quality", "score": 0.6, "score_out_of": 1.0} ] }The SDK now preserves the full dict instead of reducing it to
0.84. This allows the orchestrator (theseus PR #1801) to store the full result inverifier_executions.resultjsonb column, and the frontend (client PR #1737) to render the criteria breakdown.Tests
tests/test_verifier_criteria.py— 8 test cases covering all verifier typespytest tests/test_verifier_criteria.py -vTest plan
pytest tests/test_verifier_criteria.py -vpassespytest tests/ -v{"result": 0.84, "criteria": [...]}flows through to Supabaseverifier_executions.result🤖 Generated with Claude Code