Skip to content

feat: Add error ratio-based circuit breaking policy to api-breaker plugin#12765

Open
HaoTien wants to merge 14 commits intoapache:masterfrom
HaoTien:feat/api-breaker-error-ratio-policy
Open

feat: Add error ratio-based circuit breaking policy to api-breaker plugin#12765
HaoTien wants to merge 14 commits intoapache:masterfrom
HaoTien:feat/api-breaker-error-ratio-policy

Conversation

@HaoTien
Copy link
Copy Markdown

@HaoTien HaoTien commented Nov 21, 2025

feat: Add error ratio-based circuit breaking policy to api-breaker plugin

What this PR does / why we need it

This PR implements error ratio-based circuit breaking (unhealthy-ratio policy) for the api-breaker plugin, providing more intelligent and adaptive circuit breaking behavior based on error rates within a sliding time window, rather than just consecutive failure counts.

Closes #12763

Types of changes

  • New feature (non-breaking change which adds functionality)
  • Documentation update

Description

Current Limitations

  • The existing failure count-based approach only considers consecutive failures
  • It doesn't account for the overall error rate in relation to total requests
  • May be too sensitive during low traffic periods or not sensitive enough during high traffic periods

New Features Added

  • Error ratio-based circuit breaking: New unhealthy-ratio policy that triggers circuit breaker based on error rate within a sliding time window
  • Configurable parameters: Support for error ratio threshold, minimum request threshold, sliding window size, etc.
  • Circuit breaker states: Proper implementation of CLOSED, OPEN, and HALF_OPEN states
  • Backward compatibility: Existing configurations continue to work without changes

New Configuration Parameters

Parameter Type Default Description
policy string "unhealthy-count" Circuit breaker policy
unhealthy.error_ratio number 0.5 Error rate threshold (0-1) to trigger circuit breaker
unhealthy.min_request_threshold integer 10 Minimum requests needed before evaluating error rate
unhealthy.sliding_window_size integer 300 Sliding window size in seconds for error rate calculation
unhealthy.permitted_number_of_calls_in_half_open_state integer 3 Number of permitted calls in half-open state
healthy.success_ratio number 0.6 Success rate threshold to close circuit breaker from half-open state

Example Configuration

{
  "plugins": {
    "api-breaker": {
      "break_response_code": 503,
      "policy": "unhealthy-ratio",
      "max_breaker_sec": 60,
      "unhealthy": {
        "http_statuses": [500, 502, 503, 504],
        "error_ratio": 0.5,
        "min_request_threshold": 10,
        "sliding_window_size": 300,
        "permitted_number_of_calls_in_half_open_state": 3
      },
      "healthy": {
        "http_statuses": [200, 201, 202],
        "success_ratio": 0.6
      }
    }
  }
}

How Has This Been Tested?

  • Schema validation tests for new parameters
  • Functional tests for error ratio calculation
  • Circuit breaker state transition tests
  • Integration tests with various traffic patterns
  • Backward compatibility tests
  • Performance tests to ensure no regression

Test Results

# Run the new test file
prove -I. -r t/plugin/api-breaker2.t

# Verify existing tests still pass
prove -I. -r t/plugin/api-breaker.t

Files Modified

  • apisix/plugins/api-breaker.lua - Core plugin logic with new ratio-based policy
  • t/plugin/api-breaker2.t - New comprehensive test file for ratio-based circuit breaking
  • docs/en/latest/plugins/api-breaker.md - Updated English documentation
  • docs/zh/latest/plugins/api-breaker.md - Updated Chinese documentation

Checklist

  • My code follows the code style of this project
  • My change requires a change to the documentation
  • I have updated the documentation accordingly
  • I have read the CONTRIBUTING document
  • I have added tests to cover my changes
  • All new and existing tests passed
  • I have squashed my commits into logical units
  • My commit messages are in the proper format

Additional Notes

This implementation:

  • Maintains full backward compatibility - existing configurations work unchanged
  • Follows APISIX patterns - consistent with existing plugin architecture
  • Comprehensive testing - covers all scenarios and edge cases
  • Performance optimized - efficient sliding window implementation
  • Well documented - updated both English and Chinese docs

The feature addresses real-world use cases for:

  • High-traffic services with better error spike handling
  • Variable traffic patterns with adaptive behavior
  • Microservices architectures requiring precise circuit breaking
  • SLA-based circuit breaking with configurable error rates

Ready for review and feedback!

…ugin

- Add new 'unhealthy-ratio' policy that triggers circuit breaker based on error rate within sliding time window
- Implement three-state circuit breaker: CLOSED -> OPEN -> HALF_OPEN -> CLOSED
- Add configurable parameters: error_ratio, min_request_threshold, sliding_window_size, permitted_number_of_calls_in_half_open_state, success_ratio
- Maintain full backward compatibility with existing 'unhealthy-count' policy as default
- Add comprehensive test coverage for new functionality
- Update documentation in both Chinese and English
- Follow APISIX coding standards and testing conventions

This enhancement provides more intelligent circuit breaking for microservices architectures by considering error rates rather than just consecutive failure counts.
@dosubot dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. doc Documentation things enhancement New feature or request labels Nov 21, 2025
Copy link
Copy Markdown
Contributor

@Baoyuantop Baoyuantop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! Based on the current configuration, we need to add some test cases:

  1. After the sliding window time (sliding_window_size) expires, are the statistics (total number of requests, number of failures) correctly cleared?

  2. Failure fallback in half-open state (Half-Open -> Open)

  3. Sending more requests than permitted_number_of_calls_in_half_open_state in half-open state

Comment thread apisix/plugins/api-breaker.lua Outdated
Comment thread t/plugin/api-breaker2.t Outdated
@Baoyuantop Baoyuantop added the wait for update wait for the author's response in this issue/PR label Dec 24, 2025
@Baoyuantop
Copy link
Copy Markdown
Contributor

Hi @HaoTien, please fix the lint error

Comment thread t/lib/server.lua

function _M.api_breaker()
ngx.exit(tonumber(ngx.var.arg_code))
local code = tonumber(ngx.var.arg_code) or 200
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace ngx.say with ngx.print. The reason is that the test cases strictly match the content of the response body and do not expect a newline character at the end. ngx.say will automatically add line breaks, while ngx.print will not

Comment thread t/plugin/api-breaker.t
Comment thread t/plugin/api-breaker2.t Outdated
@HaoTien
Copy link
Copy Markdown
Author

HaoTien commented Jan 14, 2026

The current merge check error has nothing to do with the code I submitted

@HaoTien
Copy link
Copy Markdown
Author

HaoTien commented Jan 19, 2026

The current merge check error has nothing to do with the code I submitted

@Baoyuantop

@Baoyuantop Baoyuantop added awaiting review and removed wait for update wait for the author's response in this issue/PR user responded labels Jan 20, 2026
Comment thread apisix/plugins/api-breaker.lua Outdated
Comment thread t/lib/server.lua
@Baoyuantop Baoyuantop requested a review from Copilot January 26, 2026 09:33
@Baoyuantop Baoyuantop added wait for update wait for the author's response in this issue/PR and removed awaiting review labels Jan 26, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Comment thread apisix/plugins/api-breaker.lua
@HaoTien HaoTien requested a review from Baoyuantop January 28, 2026 07:28
if total_requests >= minimum_calls then
local failure_rate = unhealthy_count / total_requests
-- Use precise comparison to avoid floating point issues
local rounded_failure_rate = math.floor(failure_rate * 10000 + 0.5) / 10000
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why choose 4 decimal places? This should be explained in the comments.

Comment thread t/plugin/api-breaker2.t Outdated
Comment thread docs/zh/latest/plugins/api-breaker.md Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 11 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/zh/latest/plugins/api-breaker.md
Comment thread apisix/plugins/api-breaker.lua
Comment thread apisix/plugins/api-breaker.lua Outdated
Comment thread apisix/plugins/api-breaker.lua Outdated
Comment thread apisix/plugins/api-breaker.lua Outdated
Comment thread docs/zh/latest/plugins/api-breaker.md Outdated
Comment thread t/plugin/api-breaker.t
--- error_code: 400
--- response_body
{"error_msg":"failed to check the configuration of plugin api-breaker err: property \"healthy\" validation failed: property \"http_statuses\" validation failed: expected unique items but items 1 and 2 are equal"}
{"error_msg":"failed to check the configuration of plugin api-breaker err: then clause did not match"}
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test now asserts a generic schema failure message (then clause did not match). That message is an artifact of the conditional schema and is much less specific than the previous unique-items validation error, making the test less effective at catching regressions. Prefer asserting the actual validation cause (e.g., via response_body_like matching expected unique items), or otherwise adjust the schema so duplicate healthy.http_statuses produces a stable/clear error message.

Copilot uses AI. Check for mistakes.
Comment thread apisix/plugins/api-breaker.lua Outdated
Comment thread t/plugin/api-breaker2.t Outdated
Comment thread docs/en/latest/plugins/api-breaker.md Outdated
@HaoTien HaoTien requested a review from Baoyuantop February 25, 2026 03:05
@Baoyuantop
Copy link
Copy Markdown
Contributor

Please help review @membphis

@HaoTien
Copy link
Copy Markdown
Author

HaoTien commented Mar 11, 2026

The current merge check error has nothing to do with the code I submitted @Baoyuantop

@Baoyuantop Baoyuantop added awaiting review and removed wait for update wait for the author's response in this issue/PR user responded labels Mar 11, 2026
Copy link
Copy Markdown
Member

@moonming moonming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @HaoTien, thank you for the error ratio-based circuit breaking! This has had 21 reviews, showing strong community engagement.

Error ratio-based breaking (e.g., trip when >50% of requests fail in a window) is indeed smarter than the current consecutive-failures approach.

Since this is in the awaiting review state with extensive review history, could you:

  1. Confirm all review comments have been addressed
  2. Provide a brief summary of the final design decisions made during the 21 reviews
  3. Ensure the documentation clearly explains the new error_ratio policy alongside the existing consecutive_errors policy

This looks close to ready. Let me do a deeper code review once you confirm the above. Thank you for persisting through the extensive review process! 👏

@HaoTien HaoTien requested a review from moonming March 17, 2026 02:08
@HaoTien
Copy link
Copy Markdown
Author

HaoTien commented Mar 17, 2026

Hi @moonming , thank you for your review! Let me address your questions:

  1. Confirmation of Addressed Review Comments
    Yes, all previous review comments from @Baoyuantop and @Copilot have been addressed:

✅ Added test cases for sliding window expiration statistics reset (TEST 17)
✅ Added test cases for half-open state failure fallback (TEST 19)
✅ Added test cases for exceeding half_open_max_calls limit in half-open state (TEST 21)
✅ Fixed lint errors and code style issues
✅ Added comments explaining floating-point precision handling (4 decimal places)
✅ Removed unnecessary formatting changes
2. Summary of Final Design Decisions
Through the extensive review process, the following key design decisions were made:

Circuit Breaker States:

CLOSED → OPEN: Triggered when error rate exceeds error_ratio threshold with minimum min_request_threshold requests
OPEN → HALF_OPEN: After max_breaker_sec timeout, transitions to half-open state for testing
HALF_OPEN → CLOSED: When success rate meets success_ratio threshold
HALF_OPEN → OPEN: Immediately on any failure during half-open state
Key Implementation Decisions:

Uses sliding window (sliding_window_size) for statistics collection, which resets after expiration
Floating-point comparisons use 4 decimal places precision to avoid Lua floating-point issues
Atomic operations for state transitions to prevent race conditions
Full backward compatibility with existing unhealthy-count policy (default behavior unchanged)
3. Documentation Coverage
Both English and Chinese documentation have been updated to clearly explain both policies:

The unhealthy-count policy (default) triggers based on consecutive failure counts
The new unhealthy-ratio policy triggers based on error rate within a sliding window
Each policy has its own dedicated section with:

Configuration attributes table
State transition descriptions
Working examples
The documentation includes a clear note section explaining the differences between the two policies.

@HaoTien
Copy link
Copy Markdown
Author

HaoTien commented Apr 24, 2026

@moonming Please help review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting review doc Documentation things enhancement New feature or request size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

4 participants