Skip to content

fix: preserve counter metric types in OTLP#198

Open
jingxiang-z wants to merge 3 commits into
mainfrom
fix/metric-counter-otlp-types
Open

fix: preserve counter metric types in OTLP#198
jingxiang-z wants to merge 3 commits into
mainfrom
fix/metric-counter-otlp-types

Conversation

@jingxiang-z
Copy link
Copy Markdown
Collaborator

@jingxiang-z jingxiang-z commented May 19, 2026

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Summary by CodeRabbit

  • New Features

    • Added explicit metric-type tracking (counter vs gauge) and a settable counter metric collector.
  • Improvements

    • OTLP export now emits counters as cumulative monotonic sums and preserves gauges.
    • Scraper and storage now persist and populate metric types; several hardware/network metrics reclassified from gauge → counter.
    • Removed default unit value from the agent "up" metric.
  • Tests

    • Added tests for type propagation, OTLP conversion, scraper, collector, and schema migration.

Review Change Stack

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4c295735-ffa9-44e5-8829-363ec3b9a739

📥 Commits

Reviewing files that changed from the base of the PR and between 0b61d83 and 18d0125.

📒 Files selected for processing (4)
  • third_party/fleet-intelligence-sdk/pkg/metrics/scraper/prometheus.go
  • third_party/fleet-intelligence-sdk/pkg/metrics/scraper/prometheus_test.go
  • third_party/fleet-intelligence-sdk/pkg/metrics/store/sqlite.go
  • third_party/fleet-intelligence-sdk/pkg/metrics/store/sqlite_test.go

📝 Walkthrough

Walkthrough

This PR introduces explicit metric type awareness (counter vs gauge) throughout the metrics pipeline: type constants and field, a settable counter collector, scraper-type detection, SQLite persistence and migration for metric types, OTLP export routing to Sum/Gauge, and metric-definition updates to use counter collectors.

Changes

Metric Type System

Layer / File(s) Summary
Metric type definitions
third_party/fleet-intelligence-sdk/pkg/metrics/types.go
Introduces MetricType string type with MetricTypeGauge and MetricTypeCounter constants; adds Type field to Metric struct.
Settable counter vector implementation
third_party/fleet-intelligence-sdk/pkg/metrics/settable_const_metric.go, third_party/fleet-intelligence-sdk/pkg/metrics/settable_const_metric_test.go
Adds SettableConstMetricVec as a Prometheus collector that stores externally-provided sample values, supports label currying, implements Describe/Collect interface, and provides Set(value) mutation handles with Delete/Reset lifecycle methods; tests added for emission and lifecycle.
Prometheus scraper type detection
third_party/fleet-intelligence-sdk/pkg/metrics/scraper/prometheus.go, third_party/fleet-intelligence-sdk/pkg/metrics/scraper/prometheus_test.go
Updates scraper to detect metric type (counter vs gauge) from Prometheus metric fields and populate the Type field on returned metrics; tests updated to assert Type.
SQLite schema evolution
third_party/fleet-intelligence-sdk/pkg/metrics/store/sqlite.go, third_party/fleet-intelligence-sdk/pkg/metrics/store/sqlite_test.go
Extends SQLite table with metric_type column; adds ensureMetricTypeColumn migration helper; updates read/write paths to persist and restore metric type with gauge default; tests added for migration and type persistence.
OTLP metric type routing
internal/exporter/converter/otlp.go, internal/exporter/converter/otlp_test.go
Adds convertMetricToOTLP helper that routes counter metrics to OTLP cumulative monotonic Sum and gauge metrics to OTLP Gauge; removes per-metric Unit assignments and updates tests.
Metric collector type conversions
third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/mem/metrics.go, third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/nvlink/metrics.go, third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/pcie/metrics.go, third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/power/metrics.go, third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/thermal/metrics.go, third_party/fleet-intelligence-sdk/components/accelerator/nvidia/infiniband/metrics.go, third_party/fleet-intelligence-sdk/components/network/ethernet/metrics.go
Updates metric definitions to use NewSettableCounterVec instead of prometheus.NewGaugeVec for counter-semantic metrics across DCGM subsystems, infiniband, and ethernet RX/TX metrics.

Sequence Diagram(s)

sequenceDiagram
  participant Collector as Metric Collector
  participant Scraper as Prometheus Scraper
  participant Store as SQLite Store
  participant Exporter as OTLP Exporter
  Collector->>Scraper: expose Prometheus metric family
  Scraper->>Scraper: detect counter vs gauge -> set Metric.Type
  Scraper->>Store: write Metric with Type
  Store-->>Store: ensure metric_type column / migration
  Exporter->>Store: read Metric with Type
  Store->>Exporter: return Metric with Type
  Exporter->>Exporter: convertMetricToOTLP -> Sum (counter) or Gauge (gauge)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • mukilsh

"🐰 Hopping through metrics with glee,
Types now flow from source to tree,
Counters cumulate, gauges stand tall,
OTLP exports them all!
⚡ SQLite remembers each call"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main purpose of the PR: preserving counter metric types when converting to OTLP format, which aligns with the comprehensive changes across metric conversion, type handling, and storage.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/metric-counter-otlp-types

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
third_party/fleet-intelligence-sdk/pkg/metrics/scraper/prometheus.go (1)

71-82: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip unsupported metric kinds before appending.

When a gathered metric is not counter/gauge, m.Type and m.Value remain zero-values but the row is still appended. That emits bogus datapoints downstream (later defaulted/exported as gauge).

Suggested fix
 			// for now, only support counter and gauge
 			switch {
 			case mtRaw.GetCounter() != nil:
 				m.Type = pkgmetrics.MetricTypeCounter
 				m.Value = mtRaw.GetCounter().GetValue()
 			case mtRaw.GetGauge() != nil:
 				m.Type = pkgmetrics.MetricTypeGauge
 				m.Value = mtRaw.GetGauge().GetValue()
+			default:
+				continue
 			}
 
 			ms = append(ms, m)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@third_party/fleet-intelligence-sdk/pkg/metrics/scraper/prometheus.go` around
lines 71 - 82, The loop currently always appends m even when mtRaw is not a
counter or gauge, producing zero-valued bogus datapoints; update the metric
parsing in prometheus.go (the switch handling mtRaw.GetCounter()/GetGauge() that
sets m.Type and m.Value) to skip unsupported kinds by continuing the loop when
neither case matches (i.e., if mtRaw.GetCounter() == nil && mtRaw.GetGauge() ==
nil then continue) so only properly populated metrics
(pkgmetrics.MetricTypeCounter or pkgmetrics.MetricTypeGauge) are appended to ms.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@third_party/fleet-intelligence-sdk/pkg/metrics/store/sqlite.go`:
- Around line 277-309: ensureMetricTypeColumn currently interpolates the table
variable directly into PRAGMA and ALTER TABLE SQL which risks SQL injection and
broken identifiers; validate the table name with a strict identifier regexp
(e.g. ^[A-Za-z_][A-Za-z0-9_]*$) and/or reject unsafe names, then build a
properly quoted identifier by escaping any internal double-quotes and wrapping
the name in double quotes before using it in fmt.Sprintf; update uses in
ensureMetricTypeColumn (both the PRAGMA table_info(...) and ALTER TABLE ... ADD
COLUMN ...) to use the validated/quoted identifier and keep references to
columnMetricType and pkgmetrics.MetricTypeGauge unchanged.

---

Outside diff comments:
In `@third_party/fleet-intelligence-sdk/pkg/metrics/scraper/prometheus.go`:
- Around line 71-82: The loop currently always appends m even when mtRaw is not
a counter or gauge, producing zero-valued bogus datapoints; update the metric
parsing in prometheus.go (the switch handling mtRaw.GetCounter()/GetGauge() that
sets m.Type and m.Value) to skip unsupported kinds by continuing the loop when
neither case matches (i.e., if mtRaw.GetCounter() == nil && mtRaw.GetGauge() ==
nil then continue) so only properly populated metrics
(pkgmetrics.MetricTypeCounter or pkgmetrics.MetricTypeGauge) are appended to ms.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 989ad40d-1da4-4108-b0b1-d51db8aaeca0

📥 Commits

Reviewing files that changed from the base of the PR and between 9dda598 and da468c3.

📒 Files selected for processing (16)
  • internal/exporter/converter/otlp.go
  • internal/exporter/converter/otlp_test.go
  • third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/mem/metrics.go
  • third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/nvlink/metrics.go
  • third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/pcie/metrics.go
  • third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/power/metrics.go
  • third_party/fleet-intelligence-sdk/components/accelerator/nvidia/dcgm/thermal/metrics.go
  • third_party/fleet-intelligence-sdk/components/accelerator/nvidia/infiniband/metrics.go
  • third_party/fleet-intelligence-sdk/components/network/ethernet/metrics.go
  • third_party/fleet-intelligence-sdk/pkg/metrics/scraper/prometheus.go
  • third_party/fleet-intelligence-sdk/pkg/metrics/scraper/prometheus_test.go
  • third_party/fleet-intelligence-sdk/pkg/metrics/settable_const_metric.go
  • third_party/fleet-intelligence-sdk/pkg/metrics/settable_const_metric_test.go
  • third_party/fleet-intelligence-sdk/pkg/metrics/store/sqlite.go
  • third_party/fleet-intelligence-sdk/pkg/metrics/store/sqlite_test.go
  • third_party/fleet-intelligence-sdk/pkg/metrics/types.go

Comment thread third_party/fleet-intelligence-sdk/pkg/metrics/store/sqlite.go
Signed-off-by: Jingxiang Zhang <jingzhang@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant