-
Notifications
You must be signed in to change notification settings - Fork 267
Update memory_query and cpu_query for Prometheus #420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
6700a5d
9988554
fc104cb
8284b08
433fca8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -250,19 +250,18 @@ async def query_and_validate(self, prom_query) -> Any: | |
| async def get_cluster_summary(self) -> Dict[str, Any]: | ||
| cluster_label = self.get_prometheus_cluster_label() | ||
|
|
||
| # use this for queries with no labels. turn ', cluster="xxx"' to 'cluster="xxx"' | ||
| single_cluster_label = cluster_label.replace(",", "") | ||
| memory_query = f""" | ||
| sum(max by (instance) (machine_memory_bytes{{ {single_cluster_label} }})) | ||
| sum(kube_node_status_capacity{{ resource="memory" {cluster_label} }}) | ||
| """ | ||
|
|
||
| cpu_query = f""" | ||
| sum(max by (instance) (machine_cpu_cores{{ {single_cluster_label} }})) | ||
| sum(kube_node_status_capacity{{ resource="cpu" {cluster_label} }}) | ||
| """ | ||
coderabbitai[bot] marked this conversation as resolved.
Show resolved
Hide resolved
Comment on lines
253
to
259
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
set -euo pipefail
PROM="${PROMETHEUS_URL:?Set PROMETHEUS_URL (e.g. https://prom.example.com)}"
queries=(
'sum(kube_node_status_capacity{resource="memory"})'
'sum(max by (node) (kube_node_status_capacity{resource="memory"}))'
'sum(kube_node_status_capacity{resource="cpu"})'
'sum(max by (node) (kube_node_status_capacity{resource="cpu"}))'
)
for q in "${queries[@]}"; do
echo "QUERY: $q"
curl -sG "$PROM/api/v1/query" --data-urlencode "query=$q" | jq -r '.data.result[0].value // "no result"'
echo
doneRepository: robusta-dev/krr Length of output: 144 🏁 Script executed: # Read the file at the specified lines
cat -n robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py | sed -n '240,270p'Repository: robusta-dev/krr Length of output: 1751 🏁 Script executed: # Check the get_prometheus_cluster_label function referenced in the scratchpad
grep -n "get_prometheus_cluster_label" robusta_krr/core/integrations/prometheus/metrics/base.py | head -5Repository: robusta-dev/krr Length of output: 112 🏁 Script executed: # Find where these queries are used
rg "memory_query|cpu_query" --context 3 -A 2 -B 2 robusta_krr/core/integrations/prometheus/metrics_service/Repository: robusta-dev/krr Length of output: 2031 🏁 Script executed: # Check the implementation of get_prometheus_cluster_label
cat -n robusta_krr/core/integrations/prometheus/metrics/base.py | sed -n '75,95p'Repository: robusta-dev/krr Length of output: 979 🏁 Script executed: # Look for the class definition and understand the context
grep -n "class.*PrometheusMetricsService" robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.pyRepository: robusta-dev/krr Length of output: 109 🏁 Script executed: # Search for kube_node_status_capacity usage in the codebase to understand metric structure
rg "kube_node_status_capacity" robusta_krr/Repository: robusta-dev/krr Length of output: 393 🏁 Script executed: # Check if there's any documentation or references to the metric labels
rg "node.*resource.*capacity" robusta_krr/ -A 2 -B 2Repository: robusta-dev/krr Length of output: 41 🏁 Script executed: # Look for any test data or example Prometheus queries
find robusta_krr -name "*.py" -type f | xargs grep -l "kube_state_metrics\|kube-state-metrics" | head -5Repository: robusta-dev/krr Length of output: 41 🏁 Script executed: # Search for how this metric is documented or used in comments
rg "kube_node_status_capacity" -B 5 -A 5 robusta_krr/Repository: robusta-dev/krr Length of output: 1965 🏁 Script executed: # Check the structure of the queries - look for any other aggregations in similar contexts
rg "kube_pod_container_resource_requests|kube_node" robusta_krr/core/integrations/prometheus/metrics_service/ -A 3 -B 1 | head -40Repository: robusta-dev/krr Length of output: 2096 🏁 Script executed: # Look at the similar query patterns for kube_system to understand the aggregation style
cat -n robusta_krr/core/integrations/prometheus/metrics_service/prometheus_metrics_service.py | sed -n '260,265p'Repository: robusta-dev/krr Length of output: 501 🏁 Script executed: # Check if there are any tests or documentation about this query
rg "get_cluster_summary" robusta_krr/ -A 10 -B 2 | head -50Repository: robusta-dev/krr Length of output: 4489 🏁 Script executed: # Research kube-state-metrics documentation about kube_node_status_capacity metric
# Check if there are any comments or references in the codebase about metric labels
rg "node.*label\|resource.*label" robusta_krr/ | head -20Repository: robusta-dev/krr Length of output: 41 🏁 Script executed: # Check the kube-state-metrics version or any vendor-specific documentation
find robusta_krr -name "*.md" -o -name "*.txt" | xargs grep -l "kube.state\|prometheus" 2>/dev/null | head -5Repository: robusta-dev/krr Length of output: 41 🌐 Web query:
💡 Result: The kube_node_status_capacity metric from kube-state-metrics is a Gauge that represents the total amount of resources available (capacity) for a Kubernetes node. Labels: - node: The node address/name (e.g., node=) - resource: The name of the resource (e.g., cpu, memory, ephemeral_storage, pods, attachable_volumes_, hugepages_) Example values: - cpu= - memory= - ephemeral_storage= - pods= Status: STABLE This metric is sourced from the Kubernetes Node .status.capacity field. Citations:
🏁 Script executed: # Check if there are any tests or example configurations
find robusta_krr -name "*test*" -type f -path "*prometheus*" | xargs grep -l "kube_node_status" 2>/dev/null | head -3Repository: robusta-dev/krr Length of output: 41 🏁 Script executed: # Look for any existing aggregation patterns for node metrics in the codebase
rg "by \(node\)" robusta_krr/ -B 2 -A 2Repository: robusta-dev/krr Length of output: 41 🏁 Script executed: # Verify there are no other similar problematic sum() aggregations for node metrics
rg "sum\(kube_node" robusta_krr/Repository: robusta-dev/krr Length of output: 393 🏁 Script executed: # Double-check the cluster_label format one more time
cat -n robusta_krr/core/integrations/prometheus/metrics/base.py | sed -n '79,88p'Repository: robusta-dev/krr Length of output: 523 Deduplicate node capacity series before summing.
🔧 Proposed fix memory_query = f"""
- sum(kube_node_status_capacity{{ resource="memory" {cluster_label} }})
+ sum(max by (node) (kube_node_status_capacity{{ resource="memory" {cluster_label} }}))
"""
cpu_query = f"""
- sum(kube_node_status_capacity{{ resource="cpu" {cluster_label} }})
+ sum(max by (node) (kube_node_status_capacity{{ resource="cpu" {cluster_label} }}))
"""🤖 Prompt for AI Agents |
||
| kube_system_requests_mem = f""" | ||
| sum(max(kube_pod_container_resource_requests{{ namespace='kube-system', resource='memory' {cluster_label} }}) by (job, pod, container) ) | ||
| sum(max(kube_pod_container_resource_requests{{ namespace="kube-system", resource="memory" {cluster_label} }}) by (job, pod, container) ) | ||
| """ | ||
| kube_system_requests_cpu = f""" | ||
| sum(max(kube_pod_container_resource_requests{{ namespace='kube-system', resource='cpu' {cluster_label} }}) by (job, pod, container) ) | ||
| sum(max(kube_pod_container_resource_requests{{ namespace="kube-system", resource="cpu" {cluster_label} }}) by (job, pod, container) ) | ||
| """ | ||
| try: | ||
| cluster_memory_result = await self.query_and_validate(memory_query) | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.