Updates to log loss article. by alanconway · Pull Request #3267 · openshift/cluster-logging-operator

alanconway · 2026-05-04T16:33:50Z

Description

Update the log loss article to address unresolved comments on the original PR:
#3166

/assign jcantrill
/cc Clee2691
/cc r2d2rnd

Links

See comments on PR: doc: Article on high volume log loss. #3166

fix: update log loss article to address comments.

This update is to address the unresolved comments from Pull Request #3166.

Summary by CodeRabbit

Documentation
- Clarified high-volume log loss causes (including short-lived pod/job termination) and added prominent warning against using logs as app storage/transport
- Clarified forwarder tuning (CPU/memory) and that drops/throttling affect effective write rate; switched rate terminology to bytes/sec
- Restructured metrics and PromQL guidance with worst-case sizing metric, expanded rotation/audit log notes, recovery/backlog considerations, and a “Bad alternatives” rationale

coderabbitai · 2026-05-04T16:34:05Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6a204a75-78de-48ac-9304-37cda8e61971

📥 Commits

Reviewing files that changed from the base of the PR and between d969ade and 08e153e.

📒 Files selected for processing (1)

docs/administration/high-volume-log-loss.adoc

🚧 Files skipped from review as they are similar to previous changes (1)

docs/administration/high-volume-log-loss.adoc

Walkthrough

Refines the high-volume log-loss documentation: clarifies unread-container-log loss (including short-lived pods), switches operational metrics from events/sec to bytes/sec, expands rotation/other-log controls and limitations, restructures metrics guidance, warns about rotation backlog I/O, and replaces “alternatives” with a “Bad alternatives” section plus updated checklist.

Changes

High-volume log-loss guide

Layer / File(s)	Summary
Core Concepts / Reader-facing clarifications `docs/administration/high-volume-log-loss.adoc`	Rewords forwarder performance guidance to explicitly reference CPU and memory; adds loss mode where short-lived pods/jobs can delete logs before the collector reads them; updates container-log scope notes.
Rotation details & assumptions `docs/administration/high-volume-log-loss.adoc`	Adds NOTE that CRI‑O may compress rotated logs (`.gz`) which collectors cannot read; states disk-sizing assumes uncompressed rotated logs.
Operational units & forwarder effects `docs/administration/high-volume-log-loss.adoc`	Changes `writeRate`/`sendRate` definitions from logs/sec to bytes/sec; explains `ClusterLogForwarder` drop/filter rules reduce effective write rate and collector CPU/memory throttling can slow read/send independent of remote-store capacity.
Metrics restructuring & limitations `docs/administration/high-volume-log-loss.adoc`	Restructures metrics guidance into NOTE/TIP blocks with byte/event semantics; replaces prior limitations text with an explicit list: `log_logged_bytes_total` covers only container logs under `/var/log/pods` and excludes node/journald, Linux audit, and Kubernetes API audit logs; notes write-vs-send discrepancies.
PromQL & node sizing guidance `docs/administration/high-volume-log-loss.adoc`	Adds `MaxNodeWriteRateBytes` PromQL for busiest-node worst-case sizing; documents how node journal/audit log forwarding affects write/send comparisons depending on forwarder configuration.
Other log types and rotation controls `docs/administration/high-volume-log-loss.adoc`	Expands “Other types of logs” with concrete rotation/configuration pointers for journald, Linux audit, and Kubernetes API audit; adds NOTE on verbose Kubernetes API audit logs and `ClusterLogForwarder` audit filtering.
Rotation caution & capacity planning `docs/administration/high-volume-log-loss.adoc`	Adds CAUTION that very large rotation settings increase backlog read I/O after outages (affecting latency-sensitive workloads); changes sizing guidance to use busiest-node write-rate assumptions; updates example and recovery/backlog considerations and node partition layout assumptions.
Bad alternatives & summary checklist `docs/administration/high-volume-log-loss.adoc`	Replaces “Alternative (non)-solutions” with a “Bad alternatives” section explaining why large forwarder buffers and PV-backed buffers are problematic (duplication, design mismatch, PV reliability/performance); updates the final checklist to reflect rotation-size and storage planning steps and adds a WARNING against using logs as application storage/transport.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'Updates to log loss article' is vague and generic, using non-descriptive phrasing that doesn't convey the specific nature or scope of the documentation updates.	Replace with a more specific title that highlights key changes, such as 'Clarify log loss mechanisms and forwarder tuning guidance' or similar.

✅ Passed checks (11 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description addresses unresolved comments from a prior PR and includes proper reviewer assignments and links, but lacks detailed context about what issues were resolved or why the changes matter.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	Examined 1,016 Ginkgo test definitions in this PR. Found no dynamic content (fmt.Sprintf, variable concatenation, generated IDs). All test names are static and deterministic.
Test Structure And Quality	✅ Passed	Custom check for Ginkgo test code quality is not applicable to this PR. This is a documentation-only change (docs/administration/high-volume-log-loss.adoc). No test files were modified.
Microshift Test Compatibility	✅ Passed	PR is documentation-only; no new Ginkgo e2e tests added. MicroShift compatibility check is not applicable to documentation changes.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	This PR only updates documentation (AsciiDoc file). No new Ginkgo e2e tests are added. The custom check is not applicable.
Topology-Aware Scheduling Compatibility	✅ Passed	PR modifies only documentation (docs/administration/high-volume-log-loss.adoc). Custom check targets deployment manifests, operator code, and controllers—none of which are present in this change.
Ote Binary Stdout Contract	✅ Passed	PR is documentation-only (modifies .adoc file). OTE Binary Stdout Contract check targets executable code and test setup code. No binaries, no code-level changes, no violations possible.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	PR is documentation-only (high-volume-log-loss.adoc). No Ginkgo e2e tests added or modified. Check does not apply.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

qodo-code-review · 2026-05-04T16:34:05Z

Review Summary by Qodo

Comprehensive updates to high-volume log loss documentation

📝 Documentation

Walkthroughs

Description

• Clarifies log loss concepts and adds missing technical details
• Improves metrics documentation with better explanations and new queries
• Expands recommendations section with practical guidance on CPU/memory tuning
• Reorganizes and corrects information about other log types (journald, audit)
• Adds warnings about disk I/O impact and per-node variation in capacity planning
• Restructures "bad alternatives" section with clearer explanations of buffer limitations

Diagram

flowchart LR
  A["Log Loss Article"] -->|Clarifies concepts| B["Overview & Rotation"]
  A -->|Expands metrics| C["Metrics Documentation"]
  A -->|Adds guidance| D["Recommendations"]
  A -->|Improves explanations| E["Other Log Types"]
  A -->|Restructures| F["Bad Alternatives"]
  D -->|New section| G["Check Forwarder CPU/Memory"]
  F -->|Better clarity| H["Buffer Limitations"]

File Changes

1. docs/administration/high-volume-log-loss.adoc 📝 Documentation +146/-70

Comprehensive technical clarifications and expanded guidance

• Clarifies log loss mechanisms including short-lived pod scenarios and CRI-O compression effects
• Updates metrics section with improved log_logged_bytes_total explanation and adds
 MaxNodeWriteRateBytes query
• Expands "Check forwarder CPU and Memory" section with detailed troubleshooting guidance
• Reorganizes "Other types of logs" section with specific configuration details for journald, audit,
 and Kubernetes API logs
• Adds multiple NOTE and WARNING blocks highlighting per-node variation, disk I/O concerns, and
 buffer design mismatches
• Restructures "Alternative (non)-solutions" to "Bad alternatives" with clearer explanations of why
 buffers are insufficient
• Corrects terminology (e.g., "logs per second" to "bytes per second", "journal" to "journald")
• Improves summary with actionable steps and updated console navigation path

docs/administration/high-volume-log-loss.adoc

qodo-code-review · 2026-05-04T16:34:07Z

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0)

1. Conflicting disk sizing guidance 🐞 Bug ≡ Correctness

Description

The new WARNING advises sizing rotation parameters based on the busiest nodes, but the later
disk-sizing section still states sizing is based on cluster-wide average write rates. This
contradiction can cause under-provisioned per-node /var/log space and continued log loss if readers
follow the cluster-average guidance.

Code

docs/administration/high-volume-log-loss.adoc[R229-236]

+[WARNING]
+====
+Cluster-wide averages can hide per-node variation.
+In practice, a small number of nodes often produce most of the log volume.
+Always size rotation parameters based on the _busiest nodes_, not cluster averages.
+
+Use `MaxNodeWriteRateBytes` (see <<Using metrics to measure log activity>>) to identify the worst-case node.
+====

Evidence
The document now explicitly instructs readers to size for busiest nodes, but later claims disk
sizing is based on cluster-wide averages; the worked example also uses the busiest-node write rate.
These statements cannot all be true at the same time without clarifying whether the calculations are
per-node vs cluster-total, and what each variable represents.
docs/administration/high-volume-log-loss.adoc[229-236]
docs/administration/high-volume-log-loss.adoc[269-275]
docs/administration/high-volume-log-loss.adoc[331-342]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The doc simultaneously (a) recommends sizing based on the busiest nodes and (b) states disk sizing is based on cluster-wide average write rates, while the example uses busiest-node rates. This creates conflicting instructions that can lead to incorrect capacity planning.

### Issue Context
This article is used to size node log rotation/disk usage to prevent log loss. The sizing section should be internally consistent about whether variables and formulas are **per-node** (node /var/log partition sizing) or **cluster-total** (sum across all nodes).

### Fix Focus Areas
- docs/administration/high-volume-log-loss.adoc[229-236]
- docs/administration/high-volume-log-loss.adoc[267-296]
- docs/administration/high-volume-log-loss.adoc[331-342]

### Suggested direction
- Either update the “Estimate total disk requirements” narrative/variables to be explicitly **per-node** (e.g., use busiest-node `MaxNodeWriteRateBytes` / “NodeWriteRateBytes”) and remove/adjust the “cluster-wide average” statement; **or**
- Keep cluster-wide calculations but add a clearly labeled, separate **per-node** sizing subsection (recommended, since node partitions are what actually fill), and make the example match the subsection it’s demonstrating.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

openshift-ci · 2026-05-04T16:34:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alanconway

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [alanconway]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jcantrill

couple questions but overal lgtm

This update is to address the unresolved comments from Pull Request openshift#3166.

alanconway · 2026-05-13T15:49:51Z

@jcantrill fixed your questions

openshift-ci · 2026-05-13T16:17:07Z

@alanconway: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot assigned jcantrill May 4, 2026

openshift-ci Bot requested review from Clee2691 and r2d2rnd May 4, 2026 16:33

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 4, 2026

qodo-code-review Bot reviewed May 4, 2026

View reviewed changes

Comment thread docs/administration/high-volume-log-loss.adoc

alanconway force-pushed the log-loss-2 branch from 6e4006f to d969ade Compare May 4, 2026 20:35

jcantrill added the release/6.5 label May 6, 2026

jcantrill reviewed May 6, 2026

View reviewed changes

Comment thread docs/administration/high-volume-log-loss.adoc Outdated

Comment thread docs/administration/high-volume-log-loss.adoc

Comment thread docs/administration/high-volume-log-loss.adoc

Comment thread docs/administration/high-volume-log-loss.adoc Outdated

fix: update log loss article to address comments.

08e153e

This update is to address the unresolved comments from Pull Request openshift#3166.

alanconway force-pushed the log-loss-2 branch from d969ade to 08e153e Compare May 13, 2026 15:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates to log loss article.#3267

Updates to log loss article.#3267
alanconway wants to merge 1 commit into
openshift:masterfrom
alanconway:log-loss-2

alanconway commented May 4, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 4, 2026 •

edited

Loading

❌ Failed checks (1 inconclusive)

Uh oh!

qodo-code-review Bot commented May 4, 2026

Uh oh!

qodo-code-review Bot commented May 4, 2026 •

edited

Loading

Uh oh!

openshift-ci Bot commented May 4, 2026

Uh oh!

Uh oh!

jcantrill left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alanconway commented May 13, 2026

Uh oh!

openshift-ci Bot commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alanconway commented May 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Links

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

qodo-code-review Bot commented May 4, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

openshift-ci Bot commented May 4, 2026

Uh oh!

Uh oh!

jcantrill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alanconway commented May 13, 2026

Uh oh!

openshift-ci Bot commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alanconway commented May 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 4, 2026 •

edited

Loading

qodo-code-review Bot commented May 4, 2026 •

edited

Loading