out_stackdriver: fix metrics on partial failures and retries#11679
out_stackdriver: fix metrics on partial failures and retries#11679pdewilde wants to merge 1 commit intofluent:masterfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughUpdates partial-success parsing and flush handling in the Stackdriver output plugin: introduces explicit Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
c09c036 to
0e32dea
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c09c036e49
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@plugins/out_stackdriver/stackdriver.c`:
- Around line 3080-3126: The partial-success path (ret_partial_success == 0) is
causing double-counting because the full-batch metric
add_record_metrics(formatted_records, c->resp.status, -1) still runs under
FLB_HAVE_METRICS; change the metrics logic so the full-batch add_record_metrics
call is only executed when NOT a partial-success (ret_partial_success != 0), and
for partial-success only increment cmt_dropped_records via cmt_counter_add and
add_record_metrics for successful_records (as already computed). Locate and
update the branches around the ret_partial_success checks and calls to
add_record_metrics, cmt_counter_add, grpc_status_counts, and
parse_partial_success_response to ensure failures are counted once and
successful_records are the only success increments for partial responses.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 114ebf7d-79b1-45e0-86dc-c3e0ab6b66bc
📒 Files selected for processing (1)
plugins/out_stackdriver/stackdriver.c
25983cd to
2f03b9a
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@plugins/out_stackdriver/stackdriver.c`:
- Around line 3080-3090: The code treats ret_partial_success == 0 as a
successful “partial-success” match, but parse_partial_success_response()
currently returns 0 even when no WriteLogEntriesPartialErrors detail was found;
change the contract so the helper returns a distinct value when a
partial-success detail was actually matched (e.g., return 1 for matched, 0 for
no-match or an error code), then update the branch around ret_partial_success
(and the similar block at lines ~3108-3125) to only treat the response as
partial-success when the helper indicates a real match (the matched value) and
otherwise set ret_code to the appropriate FLB_* (FLB_RETRY or FLB_ERROR) so
normal 400/403 responses are not treated as FLB_OK; reference
parse_partial_success_response, ret_partial_success and the surrounding logic
that computes failed_records and calls flb_plg_warn.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: dd58ec12-2c6a-4f72-b9a6-9ecbea3f83e2
📒 Files selected for processing (1)
plugins/out_stackdriver/stackdriver.c
…le logic This patch refactors the response handling in the stackdriver output plugin to resolve incorrect metric accounting and inconsistent behavior between builds with and without Prometheus metrics enabled. 1. Defer Metric Accounting to Core Engine: Removed manual increment of cmt_dropped_records on standard 4xx errors. The core engine automatically increments drop counters when the plugin returns FLB_ERROR. Manual increments caused double counting. This aligns with how other plugins (like out_cloudwatch_logs) defer core accounting to the engine. 2. Enable Partial Success Handling for All Builds: Moved parse_partial_success_response outside of #ifdef FLB_HAVE_METRICS. Previously, partial success was only checked if metrics were enabled. Now, all builds will correctly identify partial successes and return FLB_OK to clear the chunk, preventing unnecessary retries and ensuring consistent behavior across different build configurations. 3. Added Visibility Logging: Added a warning log for partial successes that reports the exact number of failed records, providing operator visibility into silent drops. 4. Style Guide Compliance: Ensured all variables are declared at the beginning of the function scope as required by Fluent Bit style guidelines. Moved ts outside of ifdef to support its use on all builds. 5. Fixed Double Counting on Partial Success: Guarded the batch success metrics update in if (ret_code == FLB_OK) to ensure we don't double count successful records that were already accounted for in the partial success block. 6. Fixed Partial Success Parsing Bug and Safety Issue: Fixed parse_partial_success_response to only return 0 when a partial success detail is actually found, preventing accidental dropping of logs on standard 400/403 errors with other details. Also split the type check to ensure ret == 0 before accessing string pointers, avoiding potential crashes. Signed-off-by: pdewilde <pdewilde@google.com>
2f03b9a to
811c57f
Compare
This change makes a few fixes:
ctx->ins->cmt_dropped_recordsis no longer updated outside of the partial failure scenario. The plugin previously incremented this core metric when receiving a retryable 5xx error or standard 4xx errors (which would get double-counted due to core engine incrementing).Upon partial failure ret_code is now set to
FLB_OK.Previously, we would fall through to trigger
FLB_ERRORfor any 4xx. In the case of a partial failure this would double-count failures (plugin would manually increment counter for failed records, then the core engine would increment counter for all records in chunk).This behavior seems to align with what is done in other plugins that handle partial failure. I left in the logic of manually incrementing the failure counter for only the failed records. Other plugins don't do the metric updates, but it seemed like a good feature.
Log number of logs dropped on partial failure. Since we return
FLB_OKto clear the chunk, logs are otherwise silent about these drops. Added aflb_plg_warnto maintain operational visibility.failed_recordsis no longer within #ifdef, put up at top of function definition to align with style guide.Moved the partial success parsing logic outside of the
#ifdef FLB_HAVE_METRICScompile flag. This ensures all builds identify partial successes and clear the chunk consistently.Guarded the batch success metrics update at the bottom of the function to only update them if it wasn't a partial success, stopping double counting of successful records.
Tightened up
parse_partial_success_responseto only return success if it actually found a partial success detail in the response, to avoid accidentally dropping logs on standard 400/403 errors with other details. Also split the type check step there to ensure string operations are safe from garbage memory if parsing failed.Motivation:
The
out_stackdriverplugin was erroneously incrementing the instancecmt_dropped_recordsmetric when receiving a retryable 5xx error from stackdriver.Logs look like:
The retry succeeds but the
fluentbit_output_dropped_records_totalmetric already had been incremented.This violates the contract for
fluentbit_output_dropped_records_total:From my understanding, the plugin shouldn't be updating
ctx->ins->cmt_dropped_recordsas the fluentbit core should be updating the metrics on the batch based on the ret_code (FLB_OK,FLB_RETRY,FLB_ERROR). The partial failure is an edgecase, where it makes sense to increment the failure, though other plugins don't opt to do that.For transparency , I used an LLM to track down the source of the bug I had seen (failed metric incrementing during successful retries). I reviewed the suggested change manually to ensure the changes made sense to me. I am not familiar with the fluentbit codebase though, so there could be a glaring error I misssed.
Enter
[N/A]in the box, if an item is not applicable to your change.Testing
Before we can approve your change; please submit the following in a comment:
If this is a change to packaging of containers or native binaries then please confirm it works for all targets.
ok-package-testlabel to test for all targets (requires maintainer to do).Documentation
Backporting
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
Summary by CodeRabbit