Skip to content

ArttuAn/clusterscope

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ClusterScope logo

ClusterScope

AI cluster debugging for distributed LLM and HPC workloads.

Launch benchmark-style runs, capture GPU/NCCL/network signals, inject failures, and turn noisy telemetry into clear tuning recommendations.

Quick Start | Project Quality | Demo Scenarios | Architecture | NVIDIA Relevance

Python Kubernetes Focus Tests Coverage

ClusterScope dashboard preview

Why ClusterScope

ClusterScope is a portfolio-grade lab for the work behind production AI clusters: deployment, benchmarking, profiling, failure analysis, and operator guidance. Instead of another chatbot demo, it shows how to reason from workload symptoms to infrastructure causes.

The first milestone runs locally with deterministic synthetic telemetry, so the demo is fast and repeatable. The data model is shaped like a real cluster pipeline, making it straightforward to replace the simulator with Kubernetes jobs, NCCL logs, DCGM exporter metrics, node counters, and storage telemetry later.

What It Does

  • Simulates distributed LLM/HPC runs across 1-16 GPUs and 1-8 nodes.
  • Models GPU utilization, NCCL all-reduce latency, network throughput, storage wait, data-loader wait, pod restarts, and rank skew.
  • Detects likely bottlenecks such as fabric saturation, poor topology placement, CPU data-loader starvation, collective inefficiency, storage I/O wait, and worker instability.
  • Presents a self-serve dashboard with scaling curves, step breakdowns, topology tiles, experiment comparisons, evidence, and tuning actions.
  • Scores each run with operator-facing health, risk factors, target status, next actions, and baseline deltas.
  • Runs a role-based agent council for Product, Developer, QA, Designer, Performance, and SRE guidance.
  • Generates a remediation playbook with prioritized steps, success metrics, validation checks, rollback notes, and runnable experiment commands.
  • Exports incident-style Markdown reports for run reviews, interview walkthroughs, and tuning records.
  • Includes Kubernetes templates for GPU jobs, DCGM-style telemetry, topology-aware placement, and network fault injection.

Practical Use Cases

Use ClusterScope as a compact lab for the kinds of incidents and tuning loops that show up in real AI infrastructure work:

Use case What you run What ClusterScope should explain
Scaling review before a bigger training run Compare 1, 2, and 4 GPU runs with the same model and batch settings Whether efficiency drops below 70 percent, and which subsystem is most likely responsible
NCCL or fabric regression triage Run a healthy baseline beside a network_saturation experiment Whether all-reduce latency, retries, softirq pressure, and link utilization point to the network path
Kubernetes placement validation Compare default scheduling with topology-aware placement Whether rank skew and collective time suggest poor GPU, NIC, NUMA, or rack locality
Data pipeline sizing Run dataloader_bottleneck while changing CPU and prefetch assumptions Whether low GPU utilization is caused by input starvation rather than communication
Storage warmup and sharding checks Run storage_io before and after cache warmup or rank-sharded reads Whether storage wait is large enough to distort step cadence
Resilience drill Run worker_failure to simulate a killed rank or pod restart Whether the platform surfaces recovery impact, failed workers, and checkpoint/eviction actions
Interview or portfolio demo Walk through the dashboard and incident reports from baseline to degraded run Full-stack reasoning from workload symptoms to Kubernetes, GPU, NCCL, and network tuning decisions

Quick Start

python -m clusterscope.cli serve --port 8080

Open http://127.0.0.1:8080.

Run a single diagnosis from the CLI:

python -m clusterscope.cli simulate --scenario network_saturation --gpus 4 --nodes 2

Generate the operator playbook for the same failure mode:

python -m clusterscope.cli plan --scenario network_saturation --gpus 4 --nodes 2

Expected output:

Run: run-network-saturation-... (network_saturation)
Throughput: 2688.6 tokens/s
Scaling efficiency: 51.0%
GPU utilization: 62.8%
Top diagnoses:
- [critical] NCCL traffic is saturating the fabric

Run the tests:

python -m unittest discover -s tests

Project Quality

ClusterScope includes a GitHub Actions CI workflow that compiles the Python package, runs the unit suite with coverage on Python 3.11 and 3.13, checks the dashboard JavaScript, and verifies the static dashboard artifact. The local suite covers diagnosis rules, simulator payloads, API routes, CLI output, Markdown reporting, health insights, remediation playbooks, and data-model edge cases.

Agent Council

ClusterScope includes deterministic advisor agents for the main product-development processes around the app:

Agent Role in the workflow
Product Manager Prioritizes the next demo story, user workflow, and portfolio framing
Developer Converts diagnosis gaps into maintainable telemetry or parser work
QA Engineer Turns scenarios into regression tests and acceptance criteria
Designer Improves clarity, scanability, and operator confidence
Performance Engineer Designs the next benchmark or tuning experiment
SRE Adds incident, alerting, and recovery thinking to degraded runs

The dashboard shows an Agent Council panel for the selected run. The same council is available through the API at /api/agents and /api/runs/{run_id}/agents, or from the CLI:

python -m clusterscope.cli agents --scenario network_saturation

Remediation Playbooks

Every diagnosed run now produces a concrete playbook for the next tuning loop. The playbook answers:

Question Example output
What should we do first? Validate fabric health before retuning NCCL
Who owns it? SRE, Performance Engineer, Developer, or QA Engineer
How do we run it? A CLI or Kubernetes-oriented command for the next experiment
How do we know it worked? Success metrics such as scaling efficiency, all-reduce p95, retry volume, or data-loader wait
How do we stay safe? Guardrails and rollback notes so changes remain explainable

The playbook is visible in the dashboard, included in exported Markdown reports, available from /api/runs/{run_id}/playbook, and exposed through:

python -m clusterscope.cli plan --scenario worker_failure

GitHub Pages

The dashboard is static-hosting ready. The Pages workflow packages web/ plus shared assets/, and the browser falls back to an in-page simulator when the Python API is unavailable.

Live demo: https://arttuan.github.io/clusterscope/

Demo Scenarios

Scenario Expected diagnosis Main tuning idea
healthy No dominant bottleneck Save as a baseline before tuning
network_saturation NCCL traffic saturates the fabric Validate RoCE/MTU/ECN/PFC, placement, GPUDirect RDMA
topology_mismatch Ranks are poorly placed Add GPU, NUMA, NIC, and rack-aware scheduling
dataloader_bottleneck CPU data loading starves GPUs Increase prefetch/workers, cache shards, async copies
collective_inefficiency Collectives dominate step time Tune NCCL algorithm, bucket sizes, gradient accumulation
storage_io Dataset reads slow the step cadence Warm cache, shard data by rank, measure object-store latency
worker_failure Rank restart reduces throughput Add checkpoint/resume, elastic recovery, node health checks

Architecture

flowchart LR
  A["Workload launcher"] --> B["Kubernetes GPU job"]
  B --> C["Telemetry collectors"]
  C --> D["Run report"]
  D --> E["Diagnosis engine"]
  E --> F["Dashboard and incident report"]
  F --> G["Tuning experiment"]
  G --> A
Loading

The local implementation replaces the real Kubernetes job and collectors with deterministic simulation. The RunReport interface stays the same, so real telemetry adapters can land without rewriting the dashboard or diagnosis engine.

Repo Map

clusterscope/        Core simulator, diagnosis engine, API, and CLI
web/                 Dependency-free dashboard
k8s/                 Kubernetes lab templates for real-cluster experiments
docs/                Architecture, playbook, and incident-style reports
tests/               Unit tests for diagnosis and API-shaped payloads
scripts/             Local helper scripts
assets/              Logo and README visuals

NVIDIA Relevance

ClusterScope is designed around the full stack NVIDIA-facing cluster roles often emphasize:

Skill area How ClusterScope demonstrates it
Distributed AI/HPC systems Multi-GPU scaling curves, step timing, and utilization analysis
Networking and communications NCCL/all-reduce bottleneck detection and fabric-saturation evidence
Kubernetes/DevOps GPU job manifests, telemetry DaemonSets, RBAC, and placement patches
Performance engineering Before/after experiment framing and quantified efficiency targets
Developer empathy Dashboard explanations that translate metrics into operator actions

Real-Cluster Roadmap

  • Launch torchrun, nccl-tests, or MPI jobs through Kubernetes.
  • Parse NCCL_DEBUG=INFO, DCGM exporter metrics, Kubernetes events, and node network counters into RunMetrics.
  • Persist run reports to SQLite or object storage.
  • Re-run tuning experiments with changed batch size, placement, precision, and NCCL settings.
  • Generate Markdown incident reports automatically after failed scaling runs.

Useful Commands

# Start dashboard
python -m clusterscope.cli serve --port 8080

# Simulate a fabric issue
python -m clusterscope.cli simulate --scenario network_saturation --gpus 4 --nodes 2

# Simulate a worker restart
python -m clusterscope.cli simulate --scenario worker_failure --gpus 4 --nodes 2

# Export an incident-style report
python -m clusterscope.cli simulate --scenario network_saturation --markdown

# Ask the app-improvement agent council
python -m clusterscope.cli agents --scenario worker_failure

# Generate an operator remediation playbook
python -m clusterscope.cli plan --scenario network_saturation

# Run tests
python -m unittest discover -s tests

About

AI cluster debugging lab for distributed LLM and HPC workloads: GPU, NCCL, Kubernetes, failure analysis, and tuning recommendations.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors