Detect stalled validators with zero evaluation progress by thomasvangurp · Pull Request #321 · ridgesai/ridges

thomasvangurp · 2026-03-09T19:21:07Z

Summary

Adds background loop that detects validators/screeners whose inference gateway is broken (e.g. returning 502 for all requests)
When a screener has been running an evaluation for >15 minutes with zero runs progressed past running_agent, it is disconnected and its evaluation runs are errored out
The agent is automatically re-queued for another screener (no additional changes needed — the existing queue views and error handling already support this)

Problem

Screeners with broken inference gateways stay alive (heartbeats continue) but produce zero results. The running evaluation blocks the queue view, so the miner's agent cannot be picked up by a working screener. The agent is stuck until the screener's Docker containers time out (25+ minutes per problem), and even then all runs get 0%.

How re-queuing works (already built into the platform)

Stalled screener is detected and disconnected via delete_validator()
All unfinished runs are set to error with 3xxx (platform) error codes
handle_evaluation_if_finished() sees evaluation status = failure (not success), so agent status is not changed — it stays in screening_1 or screening_2
The screener_X_queue views only block on success or running evaluations — a failure evaluation does not block
Next time a working screener calls request-evaluation, the agent appears in the queue and gets assigned

Changes

api/endpoints/validator.py: New detect_and_handle_stalled_evaluations() function
api/loops/validator_stalled_evaluation.py: New background loop (runs every 60s)
api/config.py: Two new env vars with sensible defaults (no .env changes required)
api/src/main.py: Wire up the new loop

Config

Variable	Default	Description
`VALIDATOR_STALLED_EVALUATION_TIMEOUT_SECONDS`	900 (15 min)	How long before an evaluation with zero progress is considered stalled
`VALIDATOR_STALLED_EVALUATION_CHECK_INTERVAL_SECONDS`	60	How often to check for stalled evaluations

Test plan

Deploy to staging, simulate a screener with a broken inference gateway (e.g. invalid RIDGES_INFERENCE_GATEWAY_URL)
Verify stalled evaluation is detected and logged after 15 minutes
Verify the screener is disconnected and evaluation runs are marked as errored with 3xxx codes
Verify the agent remains in its current screening status (not transitioned to failed_screening_X)
Verify a second working screener picks up the agent from the queue

🤖 Generated with Claude Code

…on progress When a screener's inference gateway is broken (e.g. returning 502 for all requests), the screener stays alive (heartbeats continue) but no evaluation runs ever complete. This blocks the miner's agent from being reassigned to a working screener. This adds a background loop that checks every 60s (configurable) for validators that have been running an evaluation for over 15 minutes (configurable) with zero runs progressed past the running_agent phase. When detected, the validator is disconnected and its evaluation runs are marked as errored, allowing the agent to be picked up by another screener. New env vars (with defaults, no .env changes required): - VALIDATOR_STALLED_EVALUATION_TIMEOUT_SECONDS (default: 900) - VALIDATOR_STALLED_EVALUATION_CHECK_INTERVAL_SECONDS (default: 60) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect stalled validators with zero evaluation progress#321

Detect stalled validators with zero evaluation progress#321
thomasvangurp wants to merge 1 commit intoridgesai:mainfrom
thomasvangurp:feature/stalled-evaluation-detection

thomasvangurp commented Mar 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thomasvangurp commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

How re-queuing works (already built into the platform)

Changes

Config

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thomasvangurp commented Mar 9, 2026 •

edited

Loading