Skip to content

Updates to log loss article.#3267

Open
alanconway wants to merge 1 commit into
openshift:masterfrom
alanconway:log-loss-2
Open

Updates to log loss article.#3267
alanconway wants to merge 1 commit into
openshift:masterfrom
alanconway:log-loss-2

Conversation

@alanconway
Copy link
Copy Markdown
Contributor

@alanconway alanconway commented May 4, 2026

Description

Update the log loss article to address unresolved comments on the original PR:
#3166

/assign jcantrill
/cc Clee2691
/cc r2d2rnd

Links

fix: update log loss article to address comments.

This update is to address the unresolved comments from Pull Request #3166.

Summary by CodeRabbit

  • Documentation
    • Clarified high-volume log loss causes (including short-lived pod/job termination) and added prominent warning against using logs as app storage/transport
    • Clarified forwarder tuning (CPU/memory) and that drops/throttling affect effective write rate; switched rate terminology to bytes/sec
    • Restructured metrics and PromQL guidance with worst-case sizing metric, expanded rotation/audit log notes, recovery/backlog considerations, and a “Bad alternatives” rationale

@openshift-ci openshift-ci Bot requested review from Clee2691 and r2d2rnd May 4, 2026 16:33
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6a204a75-78de-48ac-9304-37cda8e61971

📥 Commits

Reviewing files that changed from the base of the PR and between d969ade and 08e153e.

📒 Files selected for processing (1)
  • docs/administration/high-volume-log-loss.adoc
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/administration/high-volume-log-loss.adoc

Walkthrough

Refines the high-volume log-loss documentation: clarifies unread-container-log loss (including short-lived pods), switches operational metrics from events/sec to bytes/sec, expands rotation/other-log controls and limitations, restructures metrics guidance, warns about rotation backlog I/O, and replaces “alternatives” with a “Bad alternatives” section plus updated checklist.

Changes

High-volume log-loss guide

Layer / File(s) Summary
Core Concepts / Reader-facing clarifications
docs/administration/high-volume-log-loss.adoc
Rewords forwarder performance guidance to explicitly reference CPU and memory; adds loss mode where short-lived pods/jobs can delete logs before the collector reads them; updates container-log scope notes.
Rotation details & assumptions
docs/administration/high-volume-log-loss.adoc
Adds NOTE that CRI‑O may compress rotated logs (.gz) which collectors cannot read; states disk-sizing assumes uncompressed rotated logs.
Operational units & forwarder effects
docs/administration/high-volume-log-loss.adoc
Changes writeRate/sendRate definitions from logs/sec to bytes/sec; explains ClusterLogForwarder drop/filter rules reduce effective write rate and collector CPU/memory throttling can slow read/send independent of remote-store capacity.
Metrics restructuring & limitations
docs/administration/high-volume-log-loss.adoc
Restructures metrics guidance into NOTE/TIP blocks with byte/event semantics; replaces prior limitations text with an explicit list: log_logged_bytes_total covers only container logs under /var/log/pods and excludes node/journald, Linux audit, and Kubernetes API audit logs; notes write-vs-send discrepancies.
PromQL & node sizing guidance
docs/administration/high-volume-log-loss.adoc
Adds MaxNodeWriteRateBytes PromQL for busiest-node worst-case sizing; documents how node journal/audit log forwarding affects write/send comparisons depending on forwarder configuration.
Other log types and rotation controls
docs/administration/high-volume-log-loss.adoc
Expands “Other types of logs” with concrete rotation/configuration pointers for journald, Linux audit, and Kubernetes API audit; adds NOTE on verbose Kubernetes API audit logs and ClusterLogForwarder audit filtering.
Rotation caution & capacity planning
docs/administration/high-volume-log-loss.adoc
Adds CAUTION that very large rotation settings increase backlog read I/O after outages (affecting latency-sensitive workloads); changes sizing guidance to use busiest-node write-rate assumptions; updates example and recovery/backlog considerations and node partition layout assumptions.
Bad alternatives & summary checklist
docs/administration/high-volume-log-loss.adoc
Replaces “Alternative (non)-solutions” with a “Bad alternatives” section explaining why large forwarder buffers and PV-backed buffers are problematic (duplication, design mismatch, PV reliability/performance); updates the final checklist to reflect rotation-size and storage planning steps and adds a WARNING against using logs as application storage/transport.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Updates to log loss article' is vague and generic, using non-descriptive phrasing that doesn't convey the specific nature or scope of the documentation updates. Replace with a more specific title that highlights key changes, such as 'Clarify log loss mechanisms and forwarder tuning guidance' or similar.
✅ Passed checks (11 passed)
Check name Status Explanation
Description check ✅ Passed The description addresses unresolved comments from a prior PR and includes proper reviewer assignments and links, but lacks detailed context about what issues were resolved or why the changes matter.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Examined 1,016 Ginkgo test definitions in this PR. Found no dynamic content (fmt.Sprintf, variable concatenation, generated IDs). All test names are static and deterministic.
Test Structure And Quality ✅ Passed Custom check for Ginkgo test code quality is not applicable to this PR. This is a documentation-only change (docs/administration/high-volume-log-loss.adoc). No test files were modified.
Microshift Test Compatibility ✅ Passed PR is documentation-only; no new Ginkgo e2e tests added. MicroShift compatibility check is not applicable to documentation changes.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR only updates documentation (AsciiDoc file). No new Ginkgo e2e tests are added. The custom check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only documentation (docs/administration/high-volume-log-loss.adoc). Custom check targets deployment manifests, operator code, and controllers—none of which are present in this change.
Ote Binary Stdout Contract ✅ Passed PR is documentation-only (modifies .adoc file). OTE Binary Stdout Contract check targets executable code and test setup code. No binaries, no code-level changes, no violations possible.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR is documentation-only (high-volume-log-loss.adoc). No Ginkgo e2e tests added or modified. Check does not apply.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Comprehensive updates to high-volume log loss documentation

📝 Documentation

Grey Divider

Walkthroughs

Description
• Clarifies log loss concepts and adds missing technical details
• Improves metrics documentation with better explanations and new queries
• Expands recommendations section with practical guidance on CPU/memory tuning
• Reorganizes and corrects information about other log types (journald, audit)
• Adds warnings about disk I/O impact and per-node variation in capacity planning
• Restructures "bad alternatives" section with clearer explanations of buffer limitations
Diagram
flowchart LR
  A["Log Loss Article"] -->|Clarifies concepts| B["Overview & Rotation"]
  A -->|Expands metrics| C["Metrics Documentation"]
  A -->|Adds guidance| D["Recommendations"]
  A -->|Improves explanations| E["Other Log Types"]
  A -->|Restructures| F["Bad Alternatives"]
  D -->|New section| G["Check Forwarder CPU/Memory"]
  F -->|Better clarity| H["Buffer Limitations"]
Loading

Grey Divider

File Changes

1. docs/administration/high-volume-log-loss.adoc 📝 Documentation +146/-70

Comprehensive technical clarifications and expanded guidance

• Clarifies log loss mechanisms including short-lived pod scenarios and CRI-O compression effects
• Updates metrics section with improved log_logged_bytes_total explanation and adds
 MaxNodeWriteRateBytes query
• Expands "Check forwarder CPU and Memory" section with detailed troubleshooting guidance
• Reorganizes "Other types of logs" section with specific configuration details for journald, audit,
 and Kubernetes API logs
• Adds multiple NOTE and WARNING blocks highlighting per-node variation, disk I/O concerns, and
 buffer design mismatches
• Restructures "Alternative (non)-solutions" to "Bad alternatives" with clearer explanations of why
 buffers are insufficient
• Corrects terminology (e.g., "logs per second" to "bytes per second", "journal" to "journald")
• Improves summary with actionable steps and updated console navigation path

docs/administration/high-volume-log-loss.adoc


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 4, 2026

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0)

Grey Divider


Action required

1. Conflicting disk sizing guidance 🐞 Bug ≡ Correctness
Description
The new WARNING advises sizing rotation parameters based on the busiest nodes, but the later
disk-sizing section still states sizing is based on cluster-wide average write rates. This
contradiction can cause under-provisioned per-node /var/log space and continued log loss if readers
follow the cluster-average guidance.
Code

docs/administration/high-volume-log-loss.adoc[R229-236]

+[WARNING]
+====
+Cluster-wide averages can hide per-node variation.
+In practice, a small number of nodes often produce most of the log volume.
+Always size rotation parameters based on the _busiest nodes_, not cluster averages.
+
+Use `MaxNodeWriteRateBytes` (see <<Using metrics to measure log activity>>) to identify the worst-case node.
+====
Evidence
The document now explicitly instructs readers to size for busiest nodes, but later claims disk
sizing is based on cluster-wide averages; the worked example also uses the busiest-node write rate.
These statements cannot all be true at the same time without clarifying whether the calculations are
per-node vs cluster-total, and what each variable represents.

docs/administration/high-volume-log-loss.adoc[229-236]
docs/administration/high-volume-log-loss.adoc[269-275]
docs/administration/high-volume-log-loss.adoc[331-342]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The doc simultaneously (a) recommends sizing based on the busiest nodes and (b) states disk sizing is based on cluster-wide average write rates, while the example uses busiest-node rates. This creates conflicting instructions that can lead to incorrect capacity planning.

### Issue Context
This article is used to size node log rotation/disk usage to prevent log loss. The sizing section should be internally consistent about whether variables and formulas are **per-node** (node /var/log partition sizing) or **cluster-total** (sum across all nodes).

### Fix Focus Areas
- docs/administration/high-volume-log-loss.adoc[229-236]
- docs/administration/high-volume-log-loss.adoc[267-296]
- docs/administration/high-volume-log-loss.adoc[331-342]

### Suggested direction
- Either update the “Estimate total disk requirements” narrative/variables to be explicitly **per-node** (e.g., use busiest-node `MaxNodeWriteRateBytes` / “NodeWriteRateBytes”) and remove/adjust the “cluster-wide average” statement; **or**
- Keep cluster-wide calculations but add a clearly labeled, separate **per-node** sizing subsection (recommended, since node partitions are what actually fill), and make the example match the subsection it’s demonstrating.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 4, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alanconway

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 4, 2026
Comment thread docs/administration/high-volume-log-loss.adoc
Copy link
Copy Markdown
Contributor

@jcantrill jcantrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple questions but overal lgtm

Comment thread docs/administration/high-volume-log-loss.adoc Outdated
Comment thread docs/administration/high-volume-log-loss.adoc
Comment thread docs/administration/high-volume-log-loss.adoc
Comment thread docs/administration/high-volume-log-loss.adoc Outdated
This update is to address the unresolved comments from Pull Request openshift#3166.
@alanconway
Copy link
Copy Markdown
Contributor Author

@jcantrill fixed your questions

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 13, 2026

@alanconway: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. release/6.5

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants