feat: support structured criteria scoring in verifier results by andrew-stelmach-fleet · Pull Request #63 · fleet-ai/fleet-sdk

andrew-stelmach-fleet · 2026-02-18T20:54:38Z

Summary

Update verifier result processing (SyncVerifierFunction, AsyncVerifierFunction, DecoratorVerifierFunction) to preserve full dict results when they contain structured criteria data (e.g. {"result": 0.84, "criteria": [...]})
Previously, dict results were always reduced to a float score, which dropped the criteria breakdown data needed by the frontend
Companion to orchestrator PR: https://github.com/fleet-ai/theseus/pull/1801 and frontend PR: https://github.com/fleet-ai/client/pull/1737

Changes

`fleet/verifiers/verifier.py` (sync)

__call__(): Accept dict returns with "result" key (not just "score")
remote(): Removed -> float return annotation to allow dict returns
_process_result(): When dict contains "criteria" key, return full dict instead of extracting just the score

`fleet/_async/verifiers/verifier.py` (async)

Same changes applied to async versions of __call__, remote, _process_result

`fleet/verifiers/decorator.py`

__call__(): Same dict-awareness pattern — returns full dict when result contains "result" key

`tests/test_verifier_criteria.py` (new)

Tests for sync verifier: float return, dict-with-criteria return, dict-without-criteria return
Tests for async verifier: same coverage
Tests for decorator verifier: float and dict-with-criteria returns

How it works

When a verifier function returns a dict like:

{
    "result": 0.84,
    "criteria": [
        {"criteria": "Accuracy", "score": 0.95, "score_out_of": 1.0},
        {"criteria": "Quality", "score": 0.6, "score_out_of": 1.0}
    ]
}

The SDK now preserves the full dict instead of reducing it to 0.84. This allows the orchestrator (theseus PR #1801) to store the full result in verifier_executions.result jsonb column, and the frontend (client PR #1737) to render the criteria breakdown.

Tests

tests/test_verifier_criteria.py — 8 test cases covering all verifier types
Run: pytest tests/test_verifier_criteria.py -v

Test plan

Verify pytest tests/test_verifier_criteria.py -v passes
Verify existing tests still pass: pytest tests/ -v
Integration test: verifier returning {"result": 0.84, "criteria": [...]} flows through to Supabase verifier_executions.result

🤖 Generated with Claude Code

When verifier code contains multiple functions (e.g., a main verifier function and helper functions), the helper functions were not accessible from the main function due to namespace isolation. The exec() call created functions in local_namespace, but the main function's __globals__ pointed to exec_globals which didn't contain the helper functions. This caused NameError when the main function tried to call helpers, which was silently caught and returned 0.0. Fix: Merge local_namespace into exec_globals after exec() so all defined functions are accessible when the verifier is called.

…mespace fix: allow verifier helper functions to be called from main verifier

InstanceRequest changes: - Add: profile_id, async_provision, instance_mode, ssh_public_keys, snapshot_interval_minutes, version (deprecated) - Fix: region default changed from 'us-west-1' to None (server decides) - Fix: created_from default changed from None to 'api' TaskRequest changes: - Add: verifier_func, project_key, data_id, data_version, writer_metadata - Add: model_config with extra='ignore' and populate_by_name=True - Add: alias='env_id' for environment_id field - Remove: metadata (doesn't exist in orchestrator TaskRequest, only in TaskResponse)

…API" This reverts commit 9a0af14.

add metadata to tasks in SDK

bump version

consolidate

…odels Add factual_answer field to support research/factual tasks: - Task model: stores expected answer for verification - TaskRequest: accept factual_answer when creating tasks - TaskResponse: return factual_answer from API Part of: https://linear.app/fleet-ai/issue/ENG-843/import-script-needs-to-support-output-json-schemas Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

feat: add factual_answer field to Task and API models

Add task_modality field to Task and TaskResponse models to support copying task modality (computer_use, tool_use, browser) when importing tasks via the SDK. Changes: - Add task_modality to TaskResponse model (API response) - Add task_modality to Task model (SDK model) - Pass task_modality from TaskResponse to Task in load_tasks Co-authored-by: Cursor <cursoragent@cursor.com>

Addresses Bugbot comment: load_task_from_json wasn't extracting task_modality from JSON data, causing tasks loaded from JSON files to have task_modality=None even when the JSON contains this field. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

Add task_modality field to async Task model, TaskResponse model, and update load_task_from_json and load_tasks to preserve task_modality. Co-authored-by: Cursor <cursoragent@cursor.com>

The API returns `environment_id` but load_task_from_json was only looking for `env_id` or `env_key`. Now it checks all three field names. Bump version to 0.2.113. Co-authored-by: Cursor <cursoragent@cursor.com>

fix: handle environment_id in load_task_from_json

Previously, import_single_task would catch all exceptions and return None, making it impossible to debug import failures. Now it raises the exception so callers can handle or report the actual error. Bump version to 0.2.114. Co-authored-by: Cursor <cursoragent@cursor.com>

…dling fix: propagate errors from import_single_task instead of swallowing

This field was missing from the SDK, causing the lifecycle status to be lost when copying tasks. The API returns this field but the SDK wasn't capturing it. Changes: - Add task_lifecycle_status field to Task model (sync and async) - Map task_lifecycle_status in load_task_from_json (sync and async) - Bump version to 0.2.115 Co-authored-by: Cursor <cursoragent@cursor.com>

The API returns 'environment_id', so just use that directly instead of a fallback chain of env_id/env_key/environment_id. Co-authored-by: Cursor <cursoragent@cursor.com>

The database uses env_key, so the SDK model should match. Added alias="environment_id" so the API response still maps correctly. Updated all references: - Task.env_id -> Task.env_key - TaskInfo.env_id -> TaskInfo.env_key - Updated docstrings and examples Co-authored-by: Cursor <cursoragent@cursor.com>

The API expects env_id (or environment_id), so we map env_key to env_id in import_single_task before sending. This keeps the SDK using env_key internally (matching DB) while maintaining API compatibility. No API changes needed - this is SDK-only. Co-authored-by: Cursor <cursoragent@cursor.com>

The task_lifecycle_status field was added to the Task model but was missing from: - TaskResponse model (sync and async) - needed to parse API response - load_tasks method - needed to pass the field to Task constructor This completes the task_lifecycle_status support in the SDK. Co-authored-by: Cursor <cursoragent@cursor.com>

The field was renamed to env_key but there was already a property with the same name, causing infinite recursion. Renamed the property to get_env_key() method. Also restored fallback for env_key in load_task_from_json to support JSON files that use env_key field. Co-authored-by: Cursor <cursoragent@cursor.com>

The field was renamed to env_key but there was already a property with the same name, causing infinite recursion. Renamed the property to get_env_key() method. Also restored env_id fallback in load_task_from_json for backward compatibility with existing JSON files. Co-authored-by: Cursor <cursoragent@cursor.com>

The make() method was using self.env_key (raw field) instead of self.get_env_key() (computed method with version). This would cause environments to be created without the version suffix. Co-authored-by: Cursor <cursoragent@cursor.com>

The API returns env_id but TaskInfo was renamed to use env_key. Added alias="env_id" so Pydantic accepts both field names during deserialization of API responses. Co-authored-by: Cursor <cursoragent@cursor.com>

When export_tasks serializes tasks, it outputs env_key. The loading function needs to check for env_key first (canonical name), then fallback to environment_id (API) and env_id (legacy). Co-authored-by: Cursor <cursoragent@cursor.com>

- TaskResponse: rename environment_id -> env_key (alias="environment_id") - TaskRequest: rename environment_id -> env_key (alias="environment_id") - Add ConfigDict(populate_by_name=True) for alias support - Add Task.env_spec property for env_key:version string - Use task.env_spec in Task.make() and make_for_task() - Clean up load_tasks to use task_response.env_key directly - Remove scattered inline env_key:version string building Co-authored-by: Cursor <cursoragent@cursor.com>

- data_spec: renamed from data_key (data_key kept as alias) - has_verifier: whether task has verifier_func or verifier - is_research_based: whether task has a factual_answer - is_action_based: inverse of is_research_based Co-authored-by: Cursor <cursoragent@cursor.com>

TaskInfo has alias="env_id" on env_key field but was missing model_config = ConfigDict(populate_by_name=True). Without this, creating TaskInfo(env_key="...") would fail since only the alias name was accepted. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

feat: Add task_lifecycle_status field to Task model

The PUT /v1/tasks/{task_key} endpoint can return environment_id: null, which caused a Pydantic validation error since env_key was required. This made update_task crash instead of returning a TaskResponse. - TaskResponse.env_key: str -> Optional[str] - Task.env_key: str -> Optional[str] - Task.env_spec now returns None when env_key is absent Co-authored-by: Cursor <cursoragent@cursor.com>

When a task has env_key=None, make_for_task would pass None to make() causing a TypeError at ":" in env_key. Now raises a clear ValueError matching the guard in Task.make(). Co-authored-by: Cursor <cursoragent@cursor.com>

fix: make TaskResponse.env_key optional to handle null API responses

…al-env-key" This reverts commit 3a4f711, reversing changes made to 7ec526b.

…v-key revert: restore env_key as required in TaskResponse and Task

Update verifier result processing to preserve full dict results when they contain structured criteria data (e.g. {"result": 0.84, "criteria": [...]}). Previously, dict results were always reduced to a float score, which dropped criteria breakdown data needed by the frontend (client PR #1737). Changes: - SyncVerifierFunction.__call__: accept dict returns with "result" key - SyncVerifierFunction._process_result: preserve criteria dicts - AsyncVerifierFunction: same changes for async path - DecoratorVerifierFunction.__call__: same changes for decorator path - tests/test_verifier_criteria.py: new test suite Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

__call__ was returning the full dict for any dict with "score"/"result" keys, but _process_result only preserved the dict when "criteria" was present. Now both paths behave consistently: extract numeric score unless criteria are present. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-18T21:25:51Z

+        result = v(None)
+        assert isinstance(result, dict)
+        assert result["score"] == 0.84
+        assert len(result["criteria"]) == 1


Misleading test masks coverage gap in score-only dict path

Medium Severity

test_call_returns_dict_with_score_key claims to verify that a dict with a "score" key is returned as-is, but the test data also includes a "criteria" key. It's the "criteria" key that triggers the full-dict return path, not the "score" key alone. A dict like {"score": 0.84} (without "criteria") would actually be reduced to the float 0.84 by __call__. The test accidentally passes and provides a false sense of coverage for the score-only dict path through __call__, which is a behavioral change from the previous code (which returned full dicts for any dict with "score").

cursor · 2026-02-18T21:25:51Z

+
+    def test_call_error_returns_zero(self):
+        """Errors in verifier function return 0.0."""
+        v = self._make_verifier(lambda env: (_ for _ in ()).throw(ValueError("boom")))


Unused verifier variable in error test

Low Severity

In test_call_error_returns_zero, the variable v is created from a lambda-based verifier on line 86 but is never called or asserted against. The test only exercises v2. This appears to be leftover scaffolding code that adds confusion without contributing to test coverage.

mikesklar and others added 30 commits January 13, 2026 12:21

Update agent.py

b301d67

update gemini cua agent with latest updates

32fa85f

update name

8852db9

Merge pull request #38 from fleet-ai/fix/verifier-helper-functions-na…

5f89234

…mespace fix: allow verifier helper functions to be called from main verifier

Bump version to 0.2.104

54feffd

add metadata to tasks

f895cb6

fixes

fef862d

Revert "fix: align InstanceRequest and TaskRequest with orchestrator …

9cdd7e6

…API" This reverts commit 9a0af14.

Merge pull request #40 from fleet-ai/zz/add-metadata-0121

acc58ed

add metadata to tasks in SDK

bump version

58f8faf

Merge pull request #41 from fleet-ai/zz/2.105

5dbb76e

bump version

consolidate

ee6a8a5

Consolidate all metadata into "metadata" in TaskResponse

05112f7

consolidate

README.md

afb516c

README.md

c08b76b

Update README.md

e303638

Update README.md

0c4d149

Delete export_tasks_filtered.py

f0737e5

Update README.md

53609ae

Update README.md

8b92120

chore: bump version to 0.2.107

33459b2

Co-authored-by: Cursor <cursoragent@cursor.com>

chore: update lockfile for 0.2.107

7dbbe39

Co-authored-by: Cursor <cursoragent@cursor.com>

Merge pull request #47 from fleet-ai/add-factual-answer-support

1bcfb21

feat: add factual_answer field to Task and API models

chore: bump version to 0.2.108

875a297

Co-authored-by: Cursor <cursoragent@cursor.com>

feat: add task_modality support to async SDK

aa03cd0

Add task_modality field to async Task model, TaskResponse model, and update load_task_from_json and load_tasks to preserve task_modality. Co-authored-by: Cursor <cursoragent@cursor.com>

andrew-stelmach-fleet and others added 26 commits February 4, 2026 22:42

fix: handle environment_id in load_task_from_json

a5eafaf

The API returns `environment_id` but load_task_from_json was only looking for `env_id` or `env_key`. Now it checks all three field names. Bump version to 0.2.113. Co-authored-by: Cursor <cursoragent@cursor.com>

Merge pull request #52 from fleet-ai/fix/load-task-from-json-env-id

42bea78

fix: handle environment_id in load_task_from_json

Merge pull request #53 from fleet-ai/fix/import-single-task-error-han…

cc239a1

…dling fix: propagate errors from import_single_task instead of swallowing

refactor: Simplify env_id mapping in load_task_from_json

44a5beb

The API returns 'environment_id', so just use that directly instead of a fallback chain of env_id/env_key/environment_id. Co-authored-by: Cursor <cursoragent@cursor.com>

fix: Add alias for TaskInfo env_key field to support env_id

747d945

The API returns env_id but TaskInfo was renamed to use env_key. Added alias="env_id" so Pydantic accepts both field names during deserialization of API responses. Co-authored-by: Cursor <cursoragent@cursor.com>

fix: Update example files to use task.env_key instead of task.env_id

2d438b3

Co-authored-by: Cursor <cursoragent@cursor.com>

Merge pull request #55 from fleet-ai/feat/add-task-lifecycle-status

7ec526b

feat: Add task_lifecycle_status field to Task model

Merge pull request #56 from fleet-ai/fix/task-response-optional-env-key

3a4f711

fix: make TaskResponse.env_key optional to handle null API responses

Revert "Merge pull request #56 from fleet-ai/fix/task-response-option…

5715f73

…al-env-key" This reverts commit 3a4f711, reversing changes made to 7ec526b.

Merge pull request #57 from fleet-ai/revert/task-response-optional-en…

bb30e38

…v-key revert: restore env_key as required in TaskResponse and Task

export_tasks

942a9af

cursor Bot reviewed Feb 18, 2026

View reviewed changes

Comment thread fleet/verifiers/verifier.py Outdated

cursor Bot reviewed Feb 18, 2026

View reviewed changes

gg2001 force-pushed the main branch from 51131ab to e3c5571 Compare April 14, 2026 22:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support structured criteria scoring in verifier results#63

feat: support structured criteria scoring in verifier results#63
andrew-stelmach-fleet wants to merge 66 commits intomainfrom
feat/structured-criteria-scoring

andrew-stelmach-fleet commented Feb 18, 2026

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Feb 18, 2026

Uh oh!

cursor Bot Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

andrew-stelmach-fleet commented Feb 18, 2026

Summary

Changes

fleet/verifiers/verifier.py (sync)

fleet/_async/verifiers/verifier.py (async)

fleet/verifiers/decorator.py

tests/test_verifier_criteria.py (new)

How it works

Tests

Test plan

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Feb 18, 2026

Choose a reason for hiding this comment

Misleading test masks coverage gap in score-only dict path

Uh oh!

cursor Bot Feb 18, 2026

Choose a reason for hiding this comment

Unused verifier variable in error test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

`fleet/verifiers/verifier.py` (sync)

`fleet/_async/verifiers/verifier.py` (async)

`fleet/verifiers/decorator.py`

`tests/test_verifier_criteria.py` (new)