Skip to content

out_stackdriver: fix metrics on partial failures and retries#11679

Open
pdewilde wants to merge 1 commit intofluent:masterfrom
pdewilde:fix/stackdriver-eager-metric-drops
Open

out_stackdriver: fix metrics on partial failures and retries#11679
pdewilde wants to merge 1 commit intofluent:masterfrom
pdewilde:fix/stackdriver-eager-metric-drops

Conversation

@pdewilde
Copy link
Copy Markdown

@pdewilde pdewilde commented Apr 6, 2026

This change makes a few fixes:

  1. ctx->ins->cmt_dropped_records is no longer updated outside of the partial failure scenario. The plugin previously incremented this core metric when receiving a retryable 5xx error or standard 4xx errors (which would get double-counted due to core engine incrementing).

  2. Upon partial failure ret_code is now set to FLB_OK.
    Previously, we would fall through to trigger FLB_ERROR for any 4xx. In the case of a partial failure this would double-count failures (plugin would manually increment counter for failed records, then the core engine would increment counter for all records in chunk).

    This behavior seems to align with what is done in other plugins that handle partial failure. I left in the logic of manually incrementing the failure counter for only the failed records. Other plugins don't do the metric updates, but it seemed like a good feature.

  3. Log number of logs dropped on partial failure. Since we return FLB_OK to clear the chunk, logs are otherwise silent about these drops. Added a flb_plg_warn to maintain operational visibility.

  4. failed_records is no longer within #ifdef, put up at top of function definition to align with style guide.

  5. Moved the partial success parsing logic outside of the #ifdef FLB_HAVE_METRICS compile flag. This ensures all builds identify partial successes and clear the chunk consistently.

  6. Guarded the batch success metrics update at the bottom of the function to only update them if it wasn't a partial success, stopping double counting of successful records.

  7. Tightened up parse_partial_success_response to only return success if it actually found a partial success detail in the response, to avoid accidentally dropping logs on standard 400/403 errors with other details. Also split the type check step there to ensure string operations are safe from garbage memory if parsing failed.

Motivation:

The out_stackdriver plugin was erroneously incrementing the instance cmt_dropped_records metric when receiving a retryable 5xx error from stackdriver.

Logs look like:

[2026/04/02 06:54:08.610] [ warn] [output:stackdriver:stackdriver.0] tag=log.systemd error sending to Cloud Logging: {
"error": {
"code": 503,
"message": "Authentication backend unavailable.",
"status": "UNAVAILABLE"
}
}
[2026/04/02 06:54:08.610] [ warn] [engine] failed to flush chunk '26829-1775112843.103700327.flb', retry in 8 seconds: task_id=7, input=router > output=stackdriver.0 (out_id=0)
[2026/04/02 06:54:16.684] [ info] [engine] flush chunk '26829-1775112843.103700327.flb' succeeded at retry 1: task_id=7, input=router > output=stackdriver.0 (out_id=0)

The retry succeeds but the fluentbit_output_dropped_records_total metric already had been incremented.

This violates the contract for fluentbit_output_dropped_records_total:

The number of log records dropped by the output. These records hit an unrecoverable error or retries expired for their chunk.

From my understanding, the plugin shouldn't be updating ctx->ins->cmt_dropped_records as the fluentbit core should be updating the metrics on the batch based on the ret_code (FLB_OK, FLB_RETRY, FLB_ERROR). The partial failure is an edgecase, where it makes sense to increment the failure, though other plugins don't opt to do that.

For transparency , I used an LLM to track down the source of the bug I had seen (failed metric incrementing during successful retries). I reviewed the suggested change manually to ensure the changes made sense to me. I am not familiar with the fluentbit codebase though, so there could be a glaring error I misssed.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • [N/A] Example configuration file for the change
  • [N/A] Debug log output from testing the change
  • [N/A] Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • [N/A] Run local packaging test showing all targets (including any new ones) build.
  • [N/A] Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • [N/A] Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

  • Bug Fixes
    • Corrected partial-success handling so dropped vs processed record counts are reported accurately per batch.
    • Improved warnings and logging to better surface partial failures and when records are retried.
    • Refined metrics: dropped and successful record counters are now updated for partial successes, and per-batch metrics reflect HTTP status classification (client vs server errors) to inform retries and error reporting.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 6, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Updates partial-success parsing and flush handling in the Stackdriver output plugin: introduces explicit partial_success_found and upfront failed_records initialization, tightens @type validation, alters return semantics, reorganizes logging and metrics emission, and adjusts retry/error classification based on HTTP status and payload presence. (39 words)

Changes

Cohort / File(s) Summary
Stackdriver flush and partial-success logic
plugins/out_stackdriver/stackdriver.c
Refactor partial-success parsing and cb_stackdriver_flush: add partial_success_found tracking and upfront failed_records, tighten @type validation, change return behavior for parse function, reorganize handling so partial successes set ret_code = FLB_OK and compute dropped records from grpc_status_counts, emit warnings and update metrics only when appropriate; otherwise update per-batch metrics and classify errors/retries (4xx → FLB_ERROR, else → FLB_RETRY).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • braydonk

Poem

🐰 I hopped through logs and counted every skip,
Found tiny partials on my carrot trip,
I nudged the counters, nudged the code to say,
Which bits stayed, which bits hopped far away,
A little thump — the flush is tidy today.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main objective: fixing metrics handling on partial failures and retries in the out_stackdriver plugin.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@pdewilde pdewilde force-pushed the fix/stackdriver-eager-metric-drops branch from c09c036 to 0e32dea Compare April 6, 2026 20:18
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c09c036e49

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@plugins/out_stackdriver/stackdriver.c`:
- Around line 3080-3126: The partial-success path (ret_partial_success == 0) is
causing double-counting because the full-batch metric
add_record_metrics(formatted_records, c->resp.status, -1) still runs under
FLB_HAVE_METRICS; change the metrics logic so the full-batch add_record_metrics
call is only executed when NOT a partial-success (ret_partial_success != 0), and
for partial-success only increment cmt_dropped_records via cmt_counter_add and
add_record_metrics for successful_records (as already computed). Locate and
update the branches around the ret_partial_success checks and calls to
add_record_metrics, cmt_counter_add, grpc_status_counts, and
parse_partial_success_response to ensure failures are counted once and
successful_records are the only success increments for partial responses.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 114ebf7d-79b1-45e0-86dc-c3e0ab6b66bc

📥 Commits

Reviewing files that changed from the base of the PR and between 95af96c and c09c036.

📒 Files selected for processing (1)
  • plugins/out_stackdriver/stackdriver.c

@pdewilde pdewilde force-pushed the fix/stackdriver-eager-metric-drops branch 2 times, most recently from 25983cd to 2f03b9a Compare April 6, 2026 21:56
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@plugins/out_stackdriver/stackdriver.c`:
- Around line 3080-3090: The code treats ret_partial_success == 0 as a
successful “partial-success” match, but parse_partial_success_response()
currently returns 0 even when no WriteLogEntriesPartialErrors detail was found;
change the contract so the helper returns a distinct value when a
partial-success detail was actually matched (e.g., return 1 for matched, 0 for
no-match or an error code), then update the branch around ret_partial_success
(and the similar block at lines ~3108-3125) to only treat the response as
partial-success when the helper indicates a real match (the matched value) and
otherwise set ret_code to the appropriate FLB_* (FLB_RETRY or FLB_ERROR) so
normal 400/403 responses are not treated as FLB_OK; reference
parse_partial_success_response, ret_partial_success and the surrounding logic
that computes failed_records and calls flb_plg_warn.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dd58ec12-2c6a-4f72-b9a6-9ecbea3f83e2

📥 Commits

Reviewing files that changed from the base of the PR and between 25983cd and 2f03b9a.

📒 Files selected for processing (1)
  • plugins/out_stackdriver/stackdriver.c

…le logic

This patch refactors the response handling in the stackdriver output plugin to resolve incorrect metric accounting and inconsistent behavior between builds with and without Prometheus metrics enabled.

1. Defer Metric Accounting to Core Engine: Removed manual increment of cmt_dropped_records on standard 4xx errors. The core engine automatically increments drop counters when the plugin returns FLB_ERROR. Manual increments caused double counting. This aligns with how other plugins (like out_cloudwatch_logs) defer core accounting to the engine.

2. Enable Partial Success Handling for All Builds: Moved parse_partial_success_response outside of #ifdef FLB_HAVE_METRICS. Previously, partial success was only checked if metrics were enabled. Now, all builds will correctly identify partial successes and return FLB_OK to clear the chunk, preventing unnecessary retries and ensuring consistent behavior across different build configurations.

3. Added Visibility Logging: Added a warning log for partial successes that reports the exact number of failed records, providing operator visibility into silent drops.

4. Style Guide Compliance: Ensured all variables are declared at the beginning of the function scope as required by Fluent Bit style guidelines. Moved ts outside of ifdef to support its use on all builds.

5. Fixed Double Counting on Partial Success: Guarded the batch success metrics update in if (ret_code == FLB_OK) to ensure we don't double count successful records that were already accounted for in the partial success block.

6. Fixed Partial Success Parsing Bug and Safety Issue: Fixed parse_partial_success_response to only return 0 when a partial success detail is actually found, preventing accidental dropping of logs on standard 400/403 errors with other details. Also split the type check to ensure ret == 0 before accessing string pointers, avoiding potential crashes.

Signed-off-by: pdewilde <pdewilde@google.com>
@pdewilde pdewilde force-pushed the fix/stackdriver-eager-metric-drops branch from 2f03b9a to 811c57f Compare April 6, 2026 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant