feat: add directory-level disk usage stats#33
Conversation
This adds three new methods to Statter: - DiskUsage: Returns actual disk usage of a directory tree (like du -sh) using stat.Blocks for accurate block-level accounting on Unix. Handles hard links by tracking inodes to avoid double-counting. - DiskUsageWithTotal: Same as DiskUsage but includes the filesystem total capacity from statfs/GetDiskFreeSpaceEx. - DiskUsageSimple: A simpler/faster version that uses file sizes instead of blocks. Useful when accuracy for sparse files is not needed. These methods are useful in containerized environments where you need to track usage of specific directories (e.g., /home, /var/lib/docker) rather than the entire filesystem. Fixes #32
Staff Engineer Review 🔍Putting on the review goggles, here are potential issues and gotchas to consider: 1. Symlink handling is missingThe code uses Suggestion: Use 2. Inode tracking doesn't consider device IDsThe hard link deduplication uses Suggestion: Key on 3. Silent error swallowing is riskyReturning Suggestion: Consider logging warnings, or provide an option to collect/surface these errors. 4. Unbounded memory usageThe Consideration: For agent metadata use cases this is probably acceptable, but worth documenting. Could also consider a bounded LRU cache or bloom filter for extreme cases. 5. Race conditions during walkFiles can be created/deleted/modified during the walk. The current code handles some of this gracefully (skips errors), but Consideration: For agent metadata polling (1hr intervals), this is probably fine, but worth documenting. 6. Test coverage gaps
7. API consistency questionThe existing 8. Windows hard link inconsistencyWindows also supports hard links (NTFS), but the Windows implementation doesn't deduplicate them. This creates a platform behavioral inconsistency. If Windows hard link deduplication is desired, could use Summary: Most of these are edge cases that may not matter for the immediate use case (agent metadata in containers), but worth considering before this becomes a widely-used API. At minimum, I'd suggest addressing #1 (symlinks) and #2 (device ID) as they could cause incorrect results in common scenarios. |
- Skip symlinks to avoid following links outside directory or loops - Use (Dev, Ino) tuple for hard link deduplication across filesystems - Add comprehensive tests for symlinks, hard links, and permission errors - Improve documentation with details on edge case handling
|
Addressed the key review feedback in the latest commit: Fixed
New Tests Added
DeferredItems I chose not to address in this PR (could be follow-ups):
|
Summary
Adds three new methods to
Statterfor directory-level disk usage measurement, complementing the existing filesystem-levelDisk()method.New Methods
DiskUsage(prefix Prefix, path string) (*Result, error)Returns actual disk usage of a directory tree (similar to
du -sh). On Unix systems, usesstat.Blocksfor accurate block-level accounting and tracks inodes to avoid double-counting hard links.DiskUsageWithTotal(prefix Prefix, path string) (*Result, error)Same as
DiskUsagebut includes the filesystem total capacity fromstatfs/GetDiskFreeSpaceEx, enabling percentage calculations.DiskUsageSimple(prefix Prefix, path string) (*Result, error)A simpler/faster version that uses file sizes instead of blocks. On Windows, this is identical to
DiskUsage.Use Case
These methods are useful in containerized environments where you need to track usage of specific directories (e.g.,
/home,/var/lib/docker) rather than the entire filesystem. The existingDisk()method reports filesystem-level totals which shows the host filesystem in Docker containers.Implementation Details
filepath.WalkDirfor efficient directory traversalsyscall.Stat_t.Blocks * 512for accurate disk block usagefile.Size()(blocks not exposed by Windows API)Performance Note
These operations can be expensive for large directory trees with many small files (e.g.,
node_modules). Callers should use appropriate refresh intervals (e.g., 1 hour as discussed in the issue).Fixes #32
Created on behalf of @matifali