Skip to content

collector/diskstats: fix rotational queue stats scrape regression#3686

Open
ruthwikkakumani wants to merge 1 commit into
prometheus:masterfrom
ruthwikkakumani:master
Open

collector/diskstats: fix rotational queue stats scrape regression#3686
ruthwikkakumani wants to merge 1 commit into
prometheus:masterfrom
ruthwikkakumani:master

Conversation

@ruthwikkakumani

Copy link
Copy Markdown

Fixes #3282 - 10x diskstats latency regression on systems with many block devices

This PR resolves a severe performance regression introduced in PR #3022. The collector was previously calling SysBlockDeviceQueueStats(dev) for every block device, forcing roughly 30 sysfs file reads per device per scrape. On systems with 1,000+ devices, this resulted in 30,000+ unnecessary file operations, causing scrapes to timeout.

This fix excises the expensive struct call and replaces it with a targeted readRotational(dev) function that reads exactly 1 file per device. The new function safely mirrors previous fallback behavior by strictly validating the content and failing closed to "0" on missing or malformed files. Dynamic tests using t.TempDir() have been added to ensure robust edge-case handling.

@nicolastakashi

Copy link
Copy Markdown

Hey @ruthwikkakumani overall LGTM, do you mind paste the results of the benchmark as part of the PR description?

@ruthwikkakumani ruthwikkakumani force-pushed the master branch 2 times, most recently from d802087 to 65da987 Compare June 17, 2026 03:30
@ruthwikkakumani

ruthwikkakumani commented Jun 17, 2026

Copy link
Copy Markdown
Author

Hi @nicolastakashi! Thanks for the approval

I noticed CI is blocked on "3 workflows awaiting approval" — could you approve the workflow run so the checks can complete?

Benchmark Results

The BenchmarkDiskstatsUpdate test is included in the PR and can be run with:

go test -bench=BenchmarkDiskstatsUpdate -benchtime=5s -benchmem ./collector/

I can't run it locally since the diskstats collector is Linux-only. Analytically:

Before After
Call SysBlockDeviceQueueStats(dev) readRotational(dev)
sysfs reads per device ~30 1
1,000 devices per scrape ~30,000 file reads 1,000 file reads

This ~30× reduction in sysfs I/O directly explains the 10× latency regression reported in #3282.

Note on golangci-lint Failures

All 15 reported lint issues (cpufreq_common.go, paths.go, filesystem_common.go, etc.) are pre-existing on prometheus:master and unrelated to this PR. You can verify by running:

golangci-lint run ./collector/...

directly on upstream master.

@ruthwikkakumani

Copy link
Copy Markdown
Author

Hey @SuperQ @discordianfish — could you please approve the workflow runs so CI can complete? The code has been reviewed and approved by @nicolastakashi. Thank you!

@nicolastakashi nicolastakashi left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on my side.

Comment thread collector/diskstats_linux.go Outdated
// file reads to exactly 1, fixing the 10× scrape regression on systems with
// large numbers of block devices (issue #3282).
func readRotational(dev string) string {
data, err := os.ReadFile(sysFilePath(fmt.Sprintf("block/%s/queue/rotational", dev)))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer to have these file readers implemented in the procfs library.

I would be fine with an optimized blockdevice.FS.Rotational(dev) function that only reads the one parameter.

The rest of the PR seems fine.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood! I'll add a SysBlockDeviceRotational(device string) (uint64, error) method
to the blockdevice.FS type in prometheus/procfs and update this PR to use it.

The implementation will follow the same pattern as SysBlockDeviceSize() — a targeted
single-file read of /sys/block/<device>/queue/rotational using util.ReadUintFromFile.

I'll open the procfs PR first, then update this one to depend on it.

@ruthwikkakumani

Copy link
Copy Markdown
Author

@SuperQ I've addressed your feedback — the file reader is now in the procfs library as you requested.

prometheus/procfs#824 adds blockdevice.FS.SysBlockDeviceRotational(device string) (uint64, error) — a targeted single-file read of /sys/block//queue/rotational following the same pattern as SysBlockDeviceSize().

This PR now calls fs.SysBlockDeviceRotational(dev) via rotationalLabel(). The go.mod uses a replace directive pointing to the procfs fork until procfs#824 merges — at that point it can be replaced with a proper version bump.

Could you also approve the workflow runs so CI can complete? Thank you!

…ometheus#3282)

Fix a 10× scrape-latency regression introduced in prometheus#3022 on systems with
many block devices.

Previously, Update() called SysBlockDeviceQueueStats(dev) for every
block device per scrape, reading ~30 sysfs files per device. On systems
with 1,000+ devices this resulted in 30,000+ unnecessary file reads per
scrape, causing timeouts.

Replace the expensive struct call with a targeted rotationalLabel()
helper that delegates to blockdevice.FS.SysBlockDeviceRotational(dev)
(added in prometheus/procfs#824). This reads exactly 1 sysfs file per
device:
  /sys/block/<device>/queue/rotational

Returns "1" for rotational (HDD) and "0" for non-rotational (SSD/NVMe),
failing closed to "0" on any error — matching the zero-initialised-struct
behaviour of the old code.

Also add:
- TestRotationalLabel: dynamic t.TempDir()-based tests via blockdevice.FS,
  covering HDD, SSD, and missing-file cases.
- BenchmarkDiskstatsUpdate: measures the full Update() call to catch
  future per-device sysfs I/O regressions before merge.

Fixes prometheus#3282

Signed-off-by: Ruthwik Kakumani <ruthwikkakumani08@gmail.com>
@ruthwikkakumani

Copy link
Copy Markdown
Author

@SuperQ procfs#824 has been merged into prometheus:master. I've updated go.mod to drop the replace directive and now point directly to the official github.com/prometheus/procfs@v0.20.2-0.20260618104242-66a9e6ebfb87. The fork dependency is gone.

Could you please approve the workflow runs so CI can complete? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance regression in diskstats collector

3 participants