Skip to content

Detect stalled validators with zero evaluation progress#321

Open
thomasvangurp wants to merge 1 commit intoridgesai:mainfrom
thomasvangurp:feature/stalled-evaluation-detection
Open

Detect stalled validators with zero evaluation progress#321
thomasvangurp wants to merge 1 commit intoridgesai:mainfrom
thomasvangurp:feature/stalled-evaluation-detection

Conversation

@thomasvangurp
Copy link

@thomasvangurp thomasvangurp commented Mar 9, 2026

Summary

  • Adds background loop that detects validators/screeners whose inference gateway is broken (e.g. returning 502 for all requests)
  • When a screener has been running an evaluation for >15 minutes with zero runs progressed past running_agent, it is disconnected and its evaluation runs are errored out
  • The agent is automatically re-queued for another screener (no additional changes needed — the existing queue views and error handling already support this)

Problem

Screeners with broken inference gateways stay alive (heartbeats continue) but produce zero results. The running evaluation blocks the queue view, so the miner's agent cannot be picked up by a working screener. The agent is stuck until the screener's Docker containers time out (25+ minutes per problem), and even then all runs get 0%.

How re-queuing works (already built into the platform)

  1. Stalled screener is detected and disconnected via delete_validator()
  2. All unfinished runs are set to error with 3xxx (platform) error codes
  3. handle_evaluation_if_finished() sees evaluation status = failure (not success), so agent status is not changed — it stays in screening_1 or screening_2
  4. The screener_X_queue views only block on success or running evaluations — a failure evaluation does not block
  5. Next time a working screener calls request-evaluation, the agent appears in the queue and gets assigned

Changes

  • api/endpoints/validator.py: New detect_and_handle_stalled_evaluations() function
  • api/loops/validator_stalled_evaluation.py: New background loop (runs every 60s)
  • api/config.py: Two new env vars with sensible defaults (no .env changes required)
  • api/src/main.py: Wire up the new loop

Config

Variable Default Description
VALIDATOR_STALLED_EVALUATION_TIMEOUT_SECONDS 900 (15 min) How long before an evaluation with zero progress is considered stalled
VALIDATOR_STALLED_EVALUATION_CHECK_INTERVAL_SECONDS 60 How often to check for stalled evaluations

Test plan

  • Deploy to staging, simulate a screener with a broken inference gateway (e.g. invalid RIDGES_INFERENCE_GATEWAY_URL)
  • Verify stalled evaluation is detected and logged after 15 minutes
  • Verify the screener is disconnected and evaluation runs are marked as errored with 3xxx codes
  • Verify the agent remains in its current screening status (not transitioned to failed_screening_X)
  • Verify a second working screener picks up the agent from the queue

🤖 Generated with Claude Code

…on progress

When a screener's inference gateway is broken (e.g. returning 502 for all
requests), the screener stays alive (heartbeats continue) but no evaluation
runs ever complete. This blocks the miner's agent from being reassigned to
a working screener.

This adds a background loop that checks every 60s (configurable) for
validators that have been running an evaluation for over 15 minutes
(configurable) with zero runs progressed past the running_agent phase.
When detected, the validator is disconnected and its evaluation runs are
marked as errored, allowing the agent to be picked up by another screener.

New env vars (with defaults, no .env changes required):
- VALIDATOR_STALLED_EVALUATION_TIMEOUT_SECONDS (default: 900)
- VALIDATOR_STALLED_EVALUATION_CHECK_INTERVAL_SECONDS (default: 60)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant