feat: Detect platform-side inference errors by statxc · Pull Request #332 · ridgesai/ridges

statxc · 2026-03-12T13:40:29Z

Detect platform-side inference errors so agents aren't penalized for provider failures

Closes #331

Problem

When an AI provider goes down or returns server errors (500, 502, etc.), the agent's inference() calls return None. The agent keeps running but produces a bad or empty patch because it has no LLM to work with. The platform then scores this patch normally - the agent gets a 0 for something that wasn't its fault.

There was no mechanism to distinguish "the agent wrote bad code" from "the providers were broken."

Solution

Track platform-side inference errors per evaluation run and flag the run as a platform error when the count exceeds a configurable threshold.

Platform errors are provider failures that the agent can't control:

500 Internal Server Error
502 Bad Gateway
503 Service Unavailable
504 Gateway Timeout
-1 Internal provider error

Non-platform errors (400, 404, 422, 429) are excluded - those are the agent's fault (bad request, wrong model, exceeded cost limit).

What changed

File	What changed
`inference_gateway/error_hash_map.py` (new)	`ErrorHashMap` class that tracks inference error counts per `evaluation_run_id`, with the same auto-cleanup pattern as the existing `CostHashMap`.
`inference_gateway/config.py`	Added `MAX_INFERENCE_ERRORS_PER_EVALUATION_RUN` (defaults to 5 if not set in `.env`). Existing deployments won't break.
`inference_gateway/main.py`	Counts platform errors after each inference/embedding call. Blocks further requests with `503` once the threshold is hit. Extended `/api/usage` to include `inference_errors` and `max_inference_errors`. Added `logger.warning()` when errors are counted and when threshold blocks a request.
`models/evaluation_run.py`	Added `PLATFORM_TOO_MANY_INFERENCE_ERRORS = 3050` in the 3xxx platform error range.
`validator/main.py`	After agent finishes, queries `/api/usage` on the inference gateway with a 10s timeout. If errors exceed the limit, marks the run as a platform error (3050) instead of scoring the patch. Also wired up the `extra` field in `EvaluationRunException` handling - it was designed but never passed through. Now `agent_logs` are included when reporting platform errors.
`tests/test_inference_error_tracking.py` (new)	19 tests covering `ErrorHashMap` unit behavior, platform error classification, error code validation, and integration tests against both inference and embedding gateway endpoints.

How it works end-to-end

Agent calls inference() → provider returns 500 → gateway counts error → agent gets None
Agent calls inference() → provider returns 500 → gateway counts error → agent gets None
...5th error...
Agent calls inference() → gateway returns 503 (blocked) → agent finishes with bad patch
Validator checks /api/usage → sees errors >= limit → marks run as PLATFORM error (3050)
→ Agent is not scored on this run

Config

Add to your .env if you want to override the default:

MAX_INFERENCE_ERRORS_PER_EVALUATION_RUN=5

Testing

python3 -m pytest tests/test_inference_error_tracking.py -v

…d for provider failures

statxc · 2026-03-12T13:41:29Z

@camfairchild Could you please review the PR? I'd appreciate any feedbacks.

camfairchild · 2026-03-12T13:57:09Z

inference_gateway/main.py

+def is_non_halting_error(status_code: int) -> bool:
+    return status_code in NON_HALTING_ERROR_CODES


Would be better termed "platform error" or something.

By "halting error" I meant one that isn't caught properly and halts the process

camfairchild · 2026-03-12T14:05:01Z

Looks good otherwise. Thank you

statxc · 2026-03-12T14:12:29Z

@camfairchild Thanks for your feedback. I updated name to platform error. Could you review again?

…d embedding and edge case tests

statxc · 2026-03-12T17:08:38Z

@ibraheem-abe Could you please review this PR? Welcome to any feedbacks.
thanks

feat: Detect platform-side inference errors so agents aren't penalize…

8c62dba

…d for provider failures

camfairchild requested changes Mar 12, 2026

View reviewed changes

refactor: update halting name to platform error

cee757c

statxc requested a review from camfairchild March 12, 2026 14:12

refactor: Rename to platform error, add logging and httpx timeout, ad…

57d5889

…d embedding and edge case tests

camfairchild approved these changes Mar 12, 2026

View reviewed changes

camfairchild requested a review from ibraheem-abe March 12, 2026 15:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Detect platform-side inference errors#332

feat: Detect platform-side inference errors#332
statxc wants to merge 3 commits intoridgesai:mainfrom
statxc:feat/platform-inference-error-detection

statxc commented Mar 12, 2026 •

edited

Loading

Uh oh!

statxc commented Mar 12, 2026 •

edited

Loading

Uh oh!

camfairchild Mar 12, 2026

Uh oh!

camfairchild commented Mar 12, 2026

Uh oh!

statxc commented Mar 12, 2026

Uh oh!

statxc commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		def is_non_halting_error(status_code: int) -> bool:
		return status_code in NON_HALTING_ERROR_CODES

Conversation

statxc commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Detect platform-side inference errors so agents aren't penalized for provider failures

Problem

Solution

What changed

How it works end-to-end

Config

Testing

Uh oh!

statxc commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

camfairchild Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

camfairchild commented Mar 12, 2026

Uh oh!

statxc commented Mar 12, 2026

Uh oh!

statxc commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

statxc commented Mar 12, 2026 •

edited

Loading

statxc commented Mar 12, 2026 •

edited

Loading