Detect stalled validators with zero evaluation progress#321
Open
thomasvangurp wants to merge 1 commit intoridgesai:mainfrom
Open
Detect stalled validators with zero evaluation progress#321thomasvangurp wants to merge 1 commit intoridgesai:mainfrom
thomasvangurp wants to merge 1 commit intoridgesai:mainfrom
Conversation
…on progress When a screener's inference gateway is broken (e.g. returning 502 for all requests), the screener stays alive (heartbeats continue) but no evaluation runs ever complete. This blocks the miner's agent from being reassigned to a working screener. This adds a background loop that checks every 60s (configurable) for validators that have been running an evaluation for over 15 minutes (configurable) with zero runs progressed past the running_agent phase. When detected, the validator is disconnected and its evaluation runs are marked as errored, allowing the agent to be picked up by another screener. New env vars (with defaults, no .env changes required): - VALIDATOR_STALLED_EVALUATION_TIMEOUT_SECONDS (default: 900) - VALIDATOR_STALLED_EVALUATION_CHECK_INTERVAL_SECONDS (default: 60) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
running_agent, it is disconnected and its evaluation runs are errored outProblem
Screeners with broken inference gateways stay alive (heartbeats continue) but produce zero results. The
runningevaluation blocks the queue view, so the miner's agent cannot be picked up by a working screener. The agent is stuck until the screener's Docker containers time out (25+ minutes per problem), and even then all runs get 0%.How re-queuing works (already built into the platform)
delete_validator()errorwith 3xxx (platform) error codeshandle_evaluation_if_finished()sees evaluation status =failure(notsuccess), so agent status is not changed — it stays inscreening_1orscreening_2screener_X_queueviews only block onsuccessorrunningevaluations — afailureevaluation does not blockrequest-evaluation, the agent appears in the queue and gets assignedChanges
api/endpoints/validator.py: Newdetect_and_handle_stalled_evaluations()functionapi/loops/validator_stalled_evaluation.py: New background loop (runs every 60s)api/config.py: Two new env vars with sensible defaults (no.envchanges required)api/src/main.py: Wire up the new loopConfig
VALIDATOR_STALLED_EVALUATION_TIMEOUT_SECONDSVALIDATOR_STALLED_EVALUATION_CHECK_INTERVAL_SECONDSTest plan
RIDGES_INFERENCE_GATEWAY_URL)failed_screening_X)🤖 Generated with Claude Code