AI cluster debugging for distributed LLM and HPC workloads.
Launch benchmark-style runs, capture GPU/NCCL/network signals, inject failures, and turn noisy telemetry into clear tuning recommendations.
Quick Start | Project Quality | Demo Scenarios | Architecture | NVIDIA Relevance
ClusterScope is a portfolio-grade lab for the work behind production AI clusters: deployment, benchmarking, profiling, failure analysis, and operator guidance. Instead of another chatbot demo, it shows how to reason from workload symptoms to infrastructure causes.
The first milestone runs locally with deterministic synthetic telemetry, so the demo is fast and repeatable. The data model is shaped like a real cluster pipeline, making it straightforward to replace the simulator with Kubernetes jobs, NCCL logs, DCGM exporter metrics, node counters, and storage telemetry later.
- Simulates distributed LLM/HPC runs across 1-16 GPUs and 1-8 nodes.
- Models GPU utilization, NCCL all-reduce latency, network throughput, storage wait, data-loader wait, pod restarts, and rank skew.
- Detects likely bottlenecks such as fabric saturation, poor topology placement, CPU data-loader starvation, collective inefficiency, storage I/O wait, and worker instability.
- Presents a self-serve dashboard with scaling curves, step breakdowns, topology tiles, experiment comparisons, evidence, and tuning actions.
- Scores each run with operator-facing health, risk factors, target status, next actions, and baseline deltas.
- Runs a role-based agent council for Product, Developer, QA, Designer, Performance, and SRE guidance.
- Generates a remediation playbook with prioritized steps, success metrics, validation checks, rollback notes, and runnable experiment commands.
- Exports incident-style Markdown reports for run reviews, interview walkthroughs, and tuning records.
- Includes Kubernetes templates for GPU jobs, DCGM-style telemetry, topology-aware placement, and network fault injection.
Use ClusterScope as a compact lab for the kinds of incidents and tuning loops that show up in real AI infrastructure work:
| Use case | What you run | What ClusterScope should explain |
|---|---|---|
| Scaling review before a bigger training run | Compare 1, 2, and 4 GPU runs with the same model and batch settings | Whether efficiency drops below 70 percent, and which subsystem is most likely responsible |
| NCCL or fabric regression triage | Run a healthy baseline beside a network_saturation experiment |
Whether all-reduce latency, retries, softirq pressure, and link utilization point to the network path |
| Kubernetes placement validation | Compare default scheduling with topology-aware placement | Whether rank skew and collective time suggest poor GPU, NIC, NUMA, or rack locality |
| Data pipeline sizing | Run dataloader_bottleneck while changing CPU and prefetch assumptions |
Whether low GPU utilization is caused by input starvation rather than communication |
| Storage warmup and sharding checks | Run storage_io before and after cache warmup or rank-sharded reads |
Whether storage wait is large enough to distort step cadence |
| Resilience drill | Run worker_failure to simulate a killed rank or pod restart |
Whether the platform surfaces recovery impact, failed workers, and checkpoint/eviction actions |
| Interview or portfolio demo | Walk through the dashboard and incident reports from baseline to degraded run | Full-stack reasoning from workload symptoms to Kubernetes, GPU, NCCL, and network tuning decisions |
python -m clusterscope.cli serve --port 8080Open http://127.0.0.1:8080.
Run a single diagnosis from the CLI:
python -m clusterscope.cli simulate --scenario network_saturation --gpus 4 --nodes 2Generate the operator playbook for the same failure mode:
python -m clusterscope.cli plan --scenario network_saturation --gpus 4 --nodes 2Expected output:
Run: run-network-saturation-... (network_saturation)
Throughput: 2688.6 tokens/s
Scaling efficiency: 51.0%
GPU utilization: 62.8%
Top diagnoses:
- [critical] NCCL traffic is saturating the fabric
Run the tests:
python -m unittest discover -s testsClusterScope includes a GitHub Actions CI workflow that compiles the Python package, runs the unit suite with coverage on Python 3.11 and 3.13, checks the dashboard JavaScript, and verifies the static dashboard artifact. The local suite covers diagnosis rules, simulator payloads, API routes, CLI output, Markdown reporting, health insights, remediation playbooks, and data-model edge cases.
ClusterScope includes deterministic advisor agents for the main product-development processes around the app:
| Agent | Role in the workflow |
|---|---|
| Product Manager | Prioritizes the next demo story, user workflow, and portfolio framing |
| Developer | Converts diagnosis gaps into maintainable telemetry or parser work |
| QA Engineer | Turns scenarios into regression tests and acceptance criteria |
| Designer | Improves clarity, scanability, and operator confidence |
| Performance Engineer | Designs the next benchmark or tuning experiment |
| SRE | Adds incident, alerting, and recovery thinking to degraded runs |
The dashboard shows an Agent Council panel for the selected run. The same council is available through the API at /api/agents and /api/runs/{run_id}/agents, or from the CLI:
python -m clusterscope.cli agents --scenario network_saturationEvery diagnosed run now produces a concrete playbook for the next tuning loop. The playbook answers:
| Question | Example output |
|---|---|
| What should we do first? | Validate fabric health before retuning NCCL |
| Who owns it? | SRE, Performance Engineer, Developer, or QA Engineer |
| How do we run it? | A CLI or Kubernetes-oriented command for the next experiment |
| How do we know it worked? | Success metrics such as scaling efficiency, all-reduce p95, retry volume, or data-loader wait |
| How do we stay safe? | Guardrails and rollback notes so changes remain explainable |
The playbook is visible in the dashboard, included in exported Markdown reports, available from /api/runs/{run_id}/playbook, and exposed through:
python -m clusterscope.cli plan --scenario worker_failureThe dashboard is static-hosting ready. The Pages workflow packages web/ plus shared assets/, and the browser falls back to an in-page simulator when the Python API is unavailable.
Live demo: https://arttuan.github.io/clusterscope/
| Scenario | Expected diagnosis | Main tuning idea |
|---|---|---|
healthy |
No dominant bottleneck | Save as a baseline before tuning |
network_saturation |
NCCL traffic saturates the fabric | Validate RoCE/MTU/ECN/PFC, placement, GPUDirect RDMA |
topology_mismatch |
Ranks are poorly placed | Add GPU, NUMA, NIC, and rack-aware scheduling |
dataloader_bottleneck |
CPU data loading starves GPUs | Increase prefetch/workers, cache shards, async copies |
collective_inefficiency |
Collectives dominate step time | Tune NCCL algorithm, bucket sizes, gradient accumulation |
storage_io |
Dataset reads slow the step cadence | Warm cache, shard data by rank, measure object-store latency |
worker_failure |
Rank restart reduces throughput | Add checkpoint/resume, elastic recovery, node health checks |
flowchart LR
A["Workload launcher"] --> B["Kubernetes GPU job"]
B --> C["Telemetry collectors"]
C --> D["Run report"]
D --> E["Diagnosis engine"]
E --> F["Dashboard and incident report"]
F --> G["Tuning experiment"]
G --> A
The local implementation replaces the real Kubernetes job and collectors with deterministic simulation. The RunReport interface stays the same, so real telemetry adapters can land without rewriting the dashboard or diagnosis engine.
clusterscope/ Core simulator, diagnosis engine, API, and CLI
web/ Dependency-free dashboard
k8s/ Kubernetes lab templates for real-cluster experiments
docs/ Architecture, playbook, and incident-style reports
tests/ Unit tests for diagnosis and API-shaped payloads
scripts/ Local helper scripts
assets/ Logo and README visuals
ClusterScope is designed around the full stack NVIDIA-facing cluster roles often emphasize:
| Skill area | How ClusterScope demonstrates it |
|---|---|
| Distributed AI/HPC systems | Multi-GPU scaling curves, step timing, and utilization analysis |
| Networking and communications | NCCL/all-reduce bottleneck detection and fabric-saturation evidence |
| Kubernetes/DevOps | GPU job manifests, telemetry DaemonSets, RBAC, and placement patches |
| Performance engineering | Before/after experiment framing and quantified efficiency targets |
| Developer empathy | Dashboard explanations that translate metrics into operator actions |
- Launch
torchrun,nccl-tests, or MPI jobs through Kubernetes. - Parse
NCCL_DEBUG=INFO, DCGM exporter metrics, Kubernetes events, and node network counters intoRunMetrics. - Persist run reports to SQLite or object storage.
- Re-run tuning experiments with changed batch size, placement, precision, and NCCL settings.
- Generate Markdown incident reports automatically after failed scaling runs.
# Start dashboard
python -m clusterscope.cli serve --port 8080
# Simulate a fabric issue
python -m clusterscope.cli simulate --scenario network_saturation --gpus 4 --nodes 2
# Simulate a worker restart
python -m clusterscope.cli simulate --scenario worker_failure --gpus 4 --nodes 2
# Export an incident-style report
python -m clusterscope.cli simulate --scenario network_saturation --markdown
# Ask the app-improvement agent council
python -m clusterscope.cli agents --scenario worker_failure
# Generate an operator remediation playbook
python -m clusterscope.cli plan --scenario network_saturation
# Run tests
python -m unittest discover -s tests