Skip to content

fix(provider): correct model_not_found misclassification and unify re…#2141

Open
xiang33 wants to merge 1 commit intosipeed:mainfrom
xiang33:fix/error-classifier-model-not-found
Open

fix(provider): correct model_not_found misclassification and unify re…#2141
xiang33 wants to merge 1 commit intosipeed:mainfrom
xiang33:fix/error-classifier-model-not-found

Conversation

@xiang33
Copy link
Copy Markdown

@xiang33 xiang33 commented Mar 29, 2026

…try error handling

  • error_classifier: add modelNotFoundPatterns with highest priority in classifyByMessage(); for transient HTTP statuses (5xx), override with message-level classification when body indicates a concrete non-transient error (e.g. zhipu 503 + model_not_found → FailoverFormat, not timeout)
  • loop: replace ad-hoc string matching in retry loop with errors.As() to extract typed FailoverReason from FallbackChain, preserving string matching as backward-compat fallback for single-candidate path
  • fallback: add Unwrap() to FallbackExhaustedError so errors.As() can traverse into the underlying FailoverError

📝 Description

Fix two issues in the provider error classification and agent retry logic:

  1. model_not_found misclassified as timeout: When providers (e.g. zhipu) return HTTP 503 with model_not_found in the response body, ClassifyError() only checked the
    HTTP status code (503 → transient timeout) and skipped message-level classification. This caused a non-retriable configuration error to enter cooldown and trigger unnecessary
    retry loops.

  2. Inconsistent retry error classification: The outer retry loop in loop.go used ad-hoc string matching to classify errors, while FallbackChain already performed typed
    classification via ClassifyError(). This architectural inconsistency was fragile and could lead to incorrect retry decisions when error message formats change.

🗣️ Type of Change

  • 🐞 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 📖 Documentation update
  • ⚡ Code refactoring (no functional changes, no api changes)

🤖 AI Code Generation

  • 🤖 Fully AI-generated (100% AI, 0% Human)
  • 🛠️ Mostly AI-generated (AI draft, Human verified/modified)
  • 👨‍💻 Mostly Human-written (Human lead, AI assisted or none)

🔗 Related Issue

📚 Technical Context (Skip for Docs)

  • Reference URL:
  • Reasoning:
    • error_classifier.go: Added modelNotFoundPatterns with highest priority in classifyByMessage(). For transient HTTP statuses (5xx), message-level classification now
      overrides status-level when body indicates a concrete non-transient error (e.g. 503 + model_not_found → FailoverFormat).
    • loop.go: Replaced ad-hoc string matching in retry loop with errors.As() to extract typed FailoverReason from FallbackChain. String matching preserved as
      backward-compat fallback for single-candidate path.
    • fallback.go: Added Unwrap() to FallbackExhaustedError so errors.As() can traverse into the underlying FailoverError.

🧪 Test Environment

  • Hardware: Mac (Intel amd64)
  • OS: macOS
  • Model/Provider: zhipu/glm-5, openrouter/MiniMax-M2.5-highspeed
  • Channels: QQ

📸 Evidence (Optional)

Click to view Logs/Screenshots

All existing tests pass:

ok github.com/sipeed/picoclaw/pkg/providers
ok github.com/sipeed/picoclaw/pkg/providers/anthropic
ok github.com/sipeed/picoclaw/pkg/providers/anthropic_messages
ok github.com/sipeed/picoclaw/pkg/providers/azure
ok github.com/sipeed/picoclaw/pkg/providers/bedrock
ok github.com/sipeed/picoclaw/pkg/providers/common
ok github.com/sipeed/picoclaw/pkg/providers/openai_compat
ok github.com/sipeed/picoclaw/pkg/providers/openai_responses_common
ok github.com/sipeed/picoclaw/pkg/agent

☑️ Checklist

  • My code/docs follow the style of this project.
  • I have performed a self-review of my own changes.
  • I have updated the documentation accordingly.

…try error handling

- error_classifier: add modelNotFoundPatterns with highest priority in
  classifyByMessage(); for transient HTTP statuses (5xx), override with
  message-level classification when body indicates a concrete non-transient
  error (e.g. zhipu 503 + model_not_found → FailoverFormat, not timeout)
- loop: replace ad-hoc string matching in retry loop with errors.As() to
  extract typed FailoverReason from FallbackChain, preserving string
  matching as backward-compat fallback for single-candidate path
- fallback: add Unwrap() to FallbackExhaustedError so errors.As() can
  traverse into the underlying FailoverError
@sipeed-bot sipeed-bot bot added type: bug Something isn't working domain: provider domain: agent go Pull requests that update go code labels Mar 29, 2026
@sipeed sipeed deleted a comment Mar 31, 2026
@yinwm
Copy link
Copy Markdown
Collaborator

yinwm commented Apr 1, 2026

plz fix Linter

@sipeed-bot
Copy link
Copy Markdown

sipeed-bot bot commented Apr 17, 2026

@xiang33 Hi! This PR has been inactive for over a week. If there's no update in the next 7 days, it will be closed automatically. If you're still working on it, just leave a comment to keep it open!

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: agent domain: provider go Pull requests that update go code type: bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants