Skip to content

Add bindmaster nodes command for GPU/process status monitoring#6

Open
damborik22 wants to merge 1 commit into
masterfrom
claude/slurm-style-local-jobs-986V6
Open

Add bindmaster nodes command for GPU/process status monitoring#6
damborik22 wants to merge 1 commit into
masterfrom
claude/slurm-style-local-jobs-986V6

Conversation

@damborik22
Copy link
Copy Markdown
Owner

Summary

Adds a new bindmaster nodes subcommand that queries GPU and process status across configured remote hosts via SSH + nvidia-smi. Provides both human-readable table and JSON output formats, with parallel host probing and configurable timeouts.

Changes

  • New module: nodes/status.py (356 lines)

    • Parallel SSH probing of configured hosts using ThreadPoolExecutor
    • Single remote probe script gathers GPU info, compute processes, and user mapping in one roundtrip
    • Parses nvidia-smi output and ps data to build GPU/process inventory
    • Classifies GPU status (FREE/PARTIAL/BUSY) based on utilization and memory thresholds
    • Renders results as colored ASCII table (default) or JSON (with --json flag)
    • Auto-creates ~/.bindmaster/nodes.json config on first run with default hosts (bm1, bm2, bm4)
  • New module: nodes/__init__.py

    • Package marker with docstring
  • Updated: bindmaster.py

    • Integrated nodes subcommand into CLI dispatcher
    • Updated help text and usage examples to include bindmaster nodes [--json] [--config PATH] [--timeout SEC]

Implementation Details

  • Config schema: JSON file with optional per-host ssh_user override; defaults to ssh's own user resolution
  • Remote probe: Single bash script echoing three sections (GPU, PROC, PS) to minimize SSH roundtrips
  • Error handling: Graceful degradation for unreachable hosts, missing nvidia-smi, and SSH timeouts
  • Concurrency: Configurable worker pool (default 8) for parallel host queries
  • Dependencies: Stdlib only (subprocess, concurrent.futures, json, argparse, pathlib, dataclasses)
  • ANSI colors: Status rows colored by GPU state (green=FREE, yellow=PARTIAL, red=BUSY/error)

Usage Examples

bindmaster nodes                          # Show table of all configured hosts
bindmaster nodes --json                   # Output JSON for scripting
bindmaster nodes --config /custom/path    # Use alternate config file
bindmaster nodes --timeout 15 --workers 4 # Adjust SSH timeout and parallelism

https://claude.ai/code/session_01SoHjgSb8Vh3wGpdbwttgrY

Fans out ssh + nvidia-smi to each configured host in parallel and prints
one row per GPU (free VRAM, util %, top process, FREE/PARTIAL/BUSY).
Stdlib only; no daemon, no shared state.

Config at ~/.bindmaster/nodes.json, auto-seeded with bm1/bm2/bm4 on
first run. Add Spark/BM3 entries later (Tailscale name as `host`).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants