ClusterScope

AI cluster debugging for distributed LLM and HPC workloads.

Launch benchmark-style runs, capture GPU/NCCL/network signals, inject failures, and turn noisy telemetry into clear tuning recommendations.

Quick Start | Project Quality | Demo Scenarios | Architecture | NVIDIA Relevance

Why ClusterScope

ClusterScope is a portfolio-grade lab for the work behind production AI clusters: deployment, benchmarking, profiling, failure analysis, and operator guidance. Instead of another chatbot demo, it shows how to reason from workload symptoms to infrastructure causes.

The first milestone runs locally with deterministic synthetic telemetry, so the demo is fast and repeatable. The data model is shaped like a real cluster pipeline, making it straightforward to replace the simulator with Kubernetes jobs, NCCL logs, DCGM exporter metrics, node counters, and storage telemetry later.

What It Does

Simulates distributed LLM/HPC runs across 1-16 GPUs and 1-8 nodes.
Models GPU utilization, NCCL all-reduce latency, network throughput, storage wait, data-loader wait, pod restarts, and rank skew.
Detects likely bottlenecks such as fabric saturation, poor topology placement, CPU data-loader starvation, collective inefficiency, storage I/O wait, and worker instability.
Presents a self-serve dashboard with scaling curves, step breakdowns, topology tiles, experiment comparisons, evidence, and tuning actions.
Scores each run with operator-facing health, risk factors, target status, next actions, and baseline deltas.
Runs a role-based agent council for Product, Developer, QA, Designer, Performance, and SRE guidance.
Generates a remediation playbook with prioritized steps, success metrics, validation checks, rollback notes, and runnable experiment commands.
Exports incident-style Markdown reports for run reviews, interview walkthroughs, and tuning records.
Includes Kubernetes templates for GPU jobs, DCGM-style telemetry, topology-aware placement, and network fault injection.

Practical Use Cases

Use ClusterScope as a compact lab for the kinds of incidents and tuning loops that show up in real AI infrastructure work:

Use case	What you run	What ClusterScope should explain
Scaling review before a bigger training run	Compare 1, 2, and 4 GPU runs with the same model and batch settings	Whether efficiency drops below 70 percent, and which subsystem is most likely responsible
NCCL or fabric regression triage	Run a healthy baseline beside a `network_saturation` experiment	Whether all-reduce latency, retries, softirq pressure, and link utilization point to the network path
Kubernetes placement validation	Compare default scheduling with topology-aware placement	Whether rank skew and collective time suggest poor GPU, NIC, NUMA, or rack locality
Data pipeline sizing	Run `dataloader_bottleneck` while changing CPU and prefetch assumptions	Whether low GPU utilization is caused by input starvation rather than communication
Storage warmup and sharding checks	Run `storage_io` before and after cache warmup or rank-sharded reads	Whether storage wait is large enough to distort step cadence
Resilience drill	Run `worker_failure` to simulate a killed rank or pod restart	Whether the platform surfaces recovery impact, failed workers, and checkpoint/eviction actions
Interview or portfolio demo	Walk through the dashboard and incident reports from baseline to degraded run	Full-stack reasoning from workload symptoms to Kubernetes, GPU, NCCL, and network tuning decisions

Quick Start

python -m clusterscope.cli serve --port 8080

Open http://127.0.0.1:8080.

Run a single diagnosis from the CLI:

python -m clusterscope.cli simulate --scenario network_saturation --gpus 4 --nodes 2

Generate the operator playbook for the same failure mode:

python -m clusterscope.cli plan --scenario network_saturation --gpus 4 --nodes 2

Expected output:

Run: run-network-saturation-... (network_saturation)
Throughput: 2688.6 tokens/s
Scaling efficiency: 51.0%
GPU utilization: 62.8%
Top diagnoses:
- [critical] NCCL traffic is saturating the fabric

Run the tests:

python -m unittest discover -s tests

Project Quality

ClusterScope includes a GitHub Actions CI workflow that compiles the Python package, runs the unit suite with coverage on Python 3.11 and 3.13, checks the dashboard JavaScript, and verifies the static dashboard artifact. The local suite covers diagnosis rules, simulator payloads, API routes, CLI output, Markdown reporting, health insights, remediation playbooks, and data-model edge cases.

Agent Council

ClusterScope includes deterministic advisor agents for the main product-development processes around the app:

Agent	Role in the workflow
Product Manager	Prioritizes the next demo story, user workflow, and portfolio framing
Developer	Converts diagnosis gaps into maintainable telemetry or parser work
QA Engineer	Turns scenarios into regression tests and acceptance criteria
Designer	Improves clarity, scanability, and operator confidence
Performance Engineer	Designs the next benchmark or tuning experiment
SRE	Adds incident, alerting, and recovery thinking to degraded runs

The dashboard shows an Agent Council panel for the selected run. The same council is available through the API at /api/agents and /api/runs/{run_id}/agents, or from the CLI:

python -m clusterscope.cli agents --scenario network_saturation

Remediation Playbooks

Every diagnosed run now produces a concrete playbook for the next tuning loop. The playbook answers:

Question	Example output
What should we do first?	Validate fabric health before retuning NCCL
Who owns it?	SRE, Performance Engineer, Developer, or QA Engineer
How do we run it?	A CLI or Kubernetes-oriented command for the next experiment
How do we know it worked?	Success metrics such as scaling efficiency, all-reduce p95, retry volume, or data-loader wait
How do we stay safe?	Guardrails and rollback notes so changes remain explainable

The playbook is visible in the dashboard, included in exported Markdown reports, available from /api/runs/{run_id}/playbook, and exposed through:

python -m clusterscope.cli plan --scenario worker_failure

GitHub Pages

The dashboard is static-hosting ready. The Pages workflow packages web/ plus shared assets/, and the browser falls back to an in-page simulator when the Python API is unavailable.

Live demo: https://arttuan.github.io/clusterscope/

Demo Scenarios

Scenario	Expected diagnosis	Main tuning idea
`healthy`	No dominant bottleneck	Save as a baseline before tuning
`network_saturation`	NCCL traffic saturates the fabric	Validate RoCE/MTU/ECN/PFC, placement, GPUDirect RDMA
`topology_mismatch`	Ranks are poorly placed	Add GPU, NUMA, NIC, and rack-aware scheduling
`dataloader_bottleneck`	CPU data loading starves GPUs	Increase prefetch/workers, cache shards, async copies
`collective_inefficiency`	Collectives dominate step time	Tune NCCL algorithm, bucket sizes, gradient accumulation
`storage_io`	Dataset reads slow the step cadence	Warm cache, shard data by rank, measure object-store latency
`worker_failure`	Rank restart reduces throughput	Add checkpoint/resume, elastic recovery, node health checks

Architecture

flowchart LR
  A["Workload launcher"] --> B["Kubernetes GPU job"]
  B --> C["Telemetry collectors"]
  C --> D["Run report"]
  D --> E["Diagnosis engine"]
  E --> F["Dashboard and incident report"]
  F --> G["Tuning experiment"]
  G --> A

The local implementation replaces the real Kubernetes job and collectors with deterministic simulation. The RunReport interface stays the same, so real telemetry adapters can land without rewriting the dashboard or diagnosis engine.

Repo Map

clusterscope/        Core simulator, diagnosis engine, API, and CLI
web/                 Dependency-free dashboard
k8s/                 Kubernetes lab templates for real-cluster experiments
docs/                Architecture, playbook, and incident-style reports
tests/               Unit tests for diagnosis and API-shaped payloads
scripts/             Local helper scripts
assets/              Logo and README visuals

NVIDIA Relevance

ClusterScope is designed around the full stack NVIDIA-facing cluster roles often emphasize:

Skill area	How ClusterScope demonstrates it
Distributed AI/HPC systems	Multi-GPU scaling curves, step timing, and utilization analysis
Networking and communications	NCCL/all-reduce bottleneck detection and fabric-saturation evidence
Kubernetes/DevOps	GPU job manifests, telemetry DaemonSets, RBAC, and placement patches
Performance engineering	Before/after experiment framing and quantified efficiency targets
Developer empathy	Dashboard explanations that translate metrics into operator actions

Real-Cluster Roadmap

Launch torchrun, nccl-tests, or MPI jobs through Kubernetes.
Parse NCCL_DEBUG=INFO, DCGM exporter metrics, Kubernetes events, and node network counters into RunMetrics.
Persist run reports to SQLite or object storage.
Re-run tuning experiments with changed batch size, placement, precision, and NCCL settings.
Generate Markdown incident reports automatically after failed scaling runs.

Useful Commands

# Start dashboard
python -m clusterscope.cli serve --port 8080

# Simulate a fabric issue
python -m clusterscope.cli simulate --scenario network_saturation --gpus 4 --nodes 2

# Simulate a worker restart
python -m clusterscope.cli simulate --scenario worker_failure --gpus 4 --nodes 2

# Export an incident-style report
python -m clusterscope.cli simulate --scenario network_saturation --markdown

# Ask the app-improvement agent council
python -m clusterscope.cli agents --scenario worker_failure

# Generate an operator remediation playbook
python -m clusterscope.cli plan --scenario network_saturation

# Run tests
python -m unittest discover -s tests

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
assets		assets
clusterscope		clusterscope
docs		docs
k8s		k8s
scripts		scripts
tests		tests
web		web
.editorconfig		.editorconfig
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ClusterScope

Why ClusterScope

What It Does

Practical Use Cases

Quick Start

Project Quality

Agent Council

Remediation Playbooks

GitHub Pages

Demo Scenarios

Architecture

Repo Map

NVIDIA Relevance

Real-Cluster Roadmap

Useful Commands

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ClusterScope

Why ClusterScope

What It Does

Practical Use Cases

Quick Start

Project Quality

Agent Council

Remediation Playbooks

GitHub Pages

Demo Scenarios

Architecture

Repo Map

NVIDIA Relevance

Real-Cluster Roadmap

Useful Commands

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages