Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -250,19 +250,18 @@ async def query_and_validate(self, prom_query) -> Any:
async def get_cluster_summary(self) -> Dict[str, Any]:
cluster_label = self.get_prometheus_cluster_label()

# use this for queries with no labels. turn ', cluster="xxx"' to 'cluster="xxx"'
single_cluster_label = cluster_label.replace(",", "")
memory_query = f"""
sum(max by (instance) (machine_memory_bytes{{ {single_cluster_label} }}))
sum(kube_node_status_capacity{{ resource="memory" {cluster_label} }})
"""

cpu_query = f"""
sum(max by (instance) (machine_cpu_cores{{ {single_cluster_label} }}))
sum(kube_node_status_capacity{{ resource="cpu" {cluster_label} }})
"""
Comment on lines 253 to 259
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail
PROM="${PROMETHEUS_URL:?Set PROMETHEUS_URL (e.g. https://prom.example.com)}"

queries=(
  'sum(kube_node_status_capacity{resource="memory"})'
  'sum(max by (node) (kube_node_status_capacity{resource="memory"}))'
  'sum(kube_node_status_capacity{resource="cpu"})'
  'sum(max by (node) (kube_node_status_capacity{resource="cpu"}))'
)

for q in "${queries[@]}"; do
  echo "QUERY: $q"
  curl -sG "$PROM/api/v1/query" --data-urlencode "query=$q" | jq -r '.data.result[0].value // "no result"'
  echo
done

Repository: robusta-dev/krr

Length of output: 144


🏁 Script executed:

# Read the file at the specified lines
cat -n robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py | sed -n '240,270p'

Repository: robusta-dev/krr

Length of output: 1751


🏁 Script executed:

# Check the get_prometheus_cluster_label function referenced in the scratchpad
grep -n "get_prometheus_cluster_label" robusta_krr/core/integrations/prometheus/metrics/base.py | head -5

Repository: robusta-dev/krr

Length of output: 112


🏁 Script executed:

# Find where these queries are used
rg "memory_query|cpu_query" --context 3 -A 2 -B 2 robusta_krr/core/integrations/prometheus/metrics_service/

Repository: robusta-dev/krr

Length of output: 2031


🏁 Script executed:

# Check the implementation of get_prometheus_cluster_label
cat -n robusta_krr/core/integrations/prometheus/metrics/base.py | sed -n '75,95p'

Repository: robusta-dev/krr

Length of output: 979


🏁 Script executed:

# Look for the class definition and understand the context
grep -n "class.*PrometheusMetricsService" robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py

Repository: robusta-dev/krr

Length of output: 109


🏁 Script executed:

# Search for kube_node_status_capacity usage in the codebase to understand metric structure
rg "kube_node_status_capacity" robusta_krr/

Repository: robusta-dev/krr

Length of output: 393


🏁 Script executed:

# Check if there's any documentation or references to the metric labels
rg "node.*resource.*capacity" robusta_krr/ -A 2 -B 2

Repository: robusta-dev/krr

Length of output: 41


🏁 Script executed:

# Look for any test data or example Prometheus queries
find robusta_krr -name "*.py" -type f | xargs grep -l "kube_state_metrics\|kube-state-metrics" | head -5

Repository: robusta-dev/krr

Length of output: 41


🏁 Script executed:

# Search for how this metric is documented or used in comments
rg "kube_node_status_capacity" -B 5 -A 5 robusta_krr/

Repository: robusta-dev/krr

Length of output: 1965


🏁 Script executed:

# Check the structure of the queries - look for any other aggregations in similar contexts
rg "kube_pod_container_resource_requests|kube_node" robusta_krr/core/integrations/prometheus/metrics_service/ -A 3 -B 1 | head -40

Repository: robusta-dev/krr

Length of output: 2096


🏁 Script executed:

# Look at the similar query patterns for kube_system to understand the aggregation style
cat -n robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py | sed -n '260,265p'

Repository: robusta-dev/krr

Length of output: 501


🏁 Script executed:

# Check if there are any tests or documentation about this query
rg "get_cluster_summary" robusta_krr/ -A 10 -B 2 | head -50

Repository: robusta-dev/krr

Length of output: 4489


🏁 Script executed:

# Research kube-state-metrics documentation about kube_node_status_capacity metric
# Check if there are any comments or references in the codebase about metric labels
rg "node.*label\|resource.*label" robusta_krr/ | head -20

Repository: robusta-dev/krr

Length of output: 41


🏁 Script executed:

# Check the kube-state-metrics version or any vendor-specific documentation
find robusta_krr -name "*.md" -o -name "*.txt" | xargs grep -l "kube.state\|prometheus" 2>/dev/null | head -5

Repository: robusta-dev/krr

Length of output: 41


🌐 Web query:

kube_node_status_capacity kube-state-metrics metric labels

💡 Result:

The kube_node_status_capacity metric from kube-state-metrics is a Gauge that represents the total amount of resources available (capacity) for a Kubernetes node. Labels: - node: The node address/name (e.g., node=) - resource: The name of the resource (e.g., cpu, memory, ephemeral_storage, pods, attachable_volumes_, hugepages_) Example values: - cpu= - memory= - ephemeral_storage= - pods= Status: STABLE This metric is sourced from the Kubernetes Node .status.capacity field.

Citations:


🏁 Script executed:

# Check if there are any tests or example configurations
find robusta_krr -name "*test*" -type f -path "*prometheus*" | xargs grep -l "kube_node_status" 2>/dev/null | head -3

Repository: robusta-dev/krr

Length of output: 41


🏁 Script executed:

# Look for any existing aggregation patterns for node metrics in the codebase
rg "by \(node\)" robusta_krr/ -B 2 -A 2

Repository: robusta-dev/krr

Length of output: 41


🏁 Script executed:

# Verify there are no other similar problematic sum() aggregations for node metrics
rg "sum\(kube_node" robusta_krr/

Repository: robusta-dev/krr

Length of output: 393


🏁 Script executed:

# Double-check the cluster_label format one more time
cat -n robusta_krr/core/integrations/prometheus/metrics/base.py | sed -n '79,88p'

Repository: robusta-dev/krr

Length of output: 523


Deduplicate node capacity series before summing.

sum(kube_node_status_capacity...) counts the same node's capacity multiple times when the metric appears in duplicate scrape targets or jobs (e.g., HA kube-state-metrics). Aggregate by node first to deduplicate, then sum.

🔧 Proposed fix
         memory_query = f"""
-            sum(kube_node_status_capacity{{ resource="memory" {cluster_label} }})
+            sum(max by (node) (kube_node_status_capacity{{ resource="memory" {cluster_label} }}))
         """
 
         cpu_query = f"""
-            sum(kube_node_status_capacity{{ resource="cpu" {cluster_label} }})
+            sum(max by (node) (kube_node_status_capacity{{ resource="cpu" {cluster_label} }}))
         """
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py`
around lines 253 - 259, The memory_query and cpu_query use
sum(kube_node_status_capacity{...}) which double-counts nodes when metrics are
scraped from multiple targets; change both queries to first deduplicate by node
then sum the per-node totals, e.g. use sum(sum by (node)
(kube_node_status_capacity{ resource="memory" {cluster_label} })) and the
analogous expression for cpu, ensuring you keep the existing cluster_label
interpolation and update the memory_query and cpu_query strings in
prometheus_metrics_service.py.

kube_system_requests_mem = f"""
sum(max(kube_pod_container_resource_requests{{ namespace='kube-system', resource='memory' {cluster_label} }}) by (job, pod, container) )
sum(max(kube_pod_container_resource_requests{{ namespace="kube-system", resource="memory" {cluster_label} }}) by (job, pod, container) )
"""
kube_system_requests_cpu = f"""
sum(max(kube_pod_container_resource_requests{{ namespace='kube-system', resource='cpu' {cluster_label} }}) by (job, pod, container) )
sum(max(kube_pod_container_resource_requests{{ namespace="kube-system", resource="cpu" {cluster_label} }}) by (job, pod, container) )
"""
try:
cluster_memory_result = await self.query_and_validate(memory_query)
Expand Down