Skip to content

Commit 7130c58

Browse files
Merge pull request #7 from iowarp/quickstart-and-sidebar-reorder
Add CAE overview, dashboard guide, and SDK docs
2 parents e75bf5b + e5a895e commit 7130c58

2 files changed

Lines changed: 515 additions & 0 deletions

File tree

docs/deployment/dashboard.md

Lines changed: 242 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,242 @@
1+
---
2+
sidebar_position: 5
3+
title: Dashboard
4+
description: Real-time web dashboard for monitoring and managing the Chimaera runtime cluster.
5+
---
6+
7+
# Runtime Dashboard
8+
9+
The `context_visualizer` package provides a lightweight Flask web application that lets you inspect and manage a live Chimaera runtime cluster from your browser. It connects to the runtime using the same client API used by application code and surfaces cluster topology, per-node worker statistics, system resource utilization, block device stats, pool configuration, and the active YAML config.
10+
11+
## Prerequisites
12+
13+
- IOWarp installed with Python support (`WRP_CORE_ENABLE_PYTHON=ON`)
14+
- A running Chimaera runtime (`chimaera runtime start`)
15+
- Python dependencies: `flask`, `pyyaml`, `msgpack`
16+
17+
Install the Python dependencies with any of:
18+
19+
```bash
20+
pip install flask pyyaml msgpack
21+
# or
22+
pip install iowarp-core[visualizer]
23+
# or (conda)
24+
conda install flask pyyaml python-msgpack
25+
```
26+
27+
## Starting the Dashboard
28+
29+
```bash
30+
python -m context_visualizer
31+
```
32+
33+
Then open [http://127.0.0.1:5000](http://127.0.0.1:5000) in your browser.
34+
35+
### CLI Options
36+
37+
| Flag | Default | Description |
38+
|------|---------|-------------|
39+
| `--host` | `127.0.0.1` | Bind address. Use `0.0.0.0` to expose on all interfaces. |
40+
| `--port` | `5000` | Listen port. |
41+
| `--debug` | *(off)* | Enable Flask debug mode (auto-reload, verbose errors). |
42+
43+
```bash
44+
# Expose on all interfaces, non-default port
45+
python -m context_visualizer --host 0.0.0.0 --port 8080
46+
47+
# Debug mode (development only)
48+
python -m context_visualizer --debug
49+
```
50+
51+
## Pages
52+
53+
### Topology (`/`) {#topology}
54+
55+
The landing page shows a live grid of all nodes in the cluster. Each node card displays:
56+
57+
- **Hostname** and **IP address**
58+
- **Status badge** (alive)
59+
- **CPU**, **RAM**, and **GPU** utilization bars (GPU shown only when GPUs are present)
60+
- **Restart** and **Shutdown** action buttons
61+
62+
The search bar supports filtering by node ID (single `3`, range `1-20`, comma-separated `1,3,5`) or by hostname/IP substring.
63+
64+
Clicking a node card navigates to the per-node detail page.
65+
66+
### Node Detail (`/node/<id>`) {#node-detail}
67+
68+
A per-node drilldown page showing:
69+
70+
- **Worker statistics** — per-worker queue depth, blocked tasks, processed count, and more
71+
- **System stats** — time-series CPU, RAM, GPU, and HBM utilization
72+
- **Block device stats** — per-bdev pool throughput and capacity
73+
74+
### Pools (`/pools`)
75+
76+
Lists all pools defined in the `compose` section of the active configuration file:
77+
78+
| Column | Description |
79+
|--------|-------------|
80+
| **Module** | ChiMod shared-library name (`mod_name`) |
81+
| **Pool Name** | User-defined pool name |
82+
| **Pool ID** | Unique pool identifier |
83+
| **Query** | Routing policy (`local`, `dynamic`, `broadcast`) |
84+
85+
### Config (`/config`)
86+
87+
Displays the full contents of the active YAML configuration file as formatted JSON, for quick inspection without opening a terminal.
88+
89+
## REST API
90+
91+
All pages are backed by a JSON API. You can query these endpoints directly for scripting or integration with other monitoring tools.
92+
93+
### Cluster-wide
94+
95+
| Endpoint | Method | Description |
96+
|----------|--------|-------------|
97+
| `/api/topology` | GET | List all nodes with hostname, IP, CPU/RAM/GPU utilization |
98+
| `/api/system` | GET | High-level system overview (connected, worker/queue/blocked/processed counts) |
99+
| `/api/workers` | GET | Per-worker stats plus a fleet summary (local node) |
100+
| `/api/pools` | GET | Pool list from the `compose` section of the config |
101+
| `/api/config` | GET | Full active configuration as JSON |
102+
103+
### Per-node
104+
105+
| Endpoint | Method | Description |
106+
|----------|--------|-------------|
107+
| `/api/node/<id>/workers` | GET | Worker stats for a specific node |
108+
| `/api/node/<id>/system_stats` | GET | System resource utilization entries for a specific node |
109+
| `/api/node/<id>/bdev_stats` | GET | Block device stats for a specific node |
110+
111+
### Node Management
112+
113+
| Endpoint | Method | Description |
114+
|----------|--------|-------------|
115+
| `/api/topology/node/<id>/shutdown` | POST | Gracefully shut down a node via SSH |
116+
| `/api/topology/node/<id>/restart` | POST | Restart a node via SSH |
117+
118+
Shutdown and restart are performed by SSHing from the dashboard host to the target node and running `chimaera runtime stop` or `chimaera runtime restart`. This avoids the problem of a node killing itself mid-RPC. The SSH connection uses `StrictHostKeyChecking=no` and `ConnectTimeout=5`.
119+
120+
**Shutdown response:**
121+
```json
122+
{
123+
"success": true,
124+
"returncode": 0,
125+
"stdout": "",
126+
"stderr": ""
127+
}
128+
```
129+
130+
Exit codes `0` and `134` (SIGABRT from `std::abort()` in `InitiateShutdown`) are both treated as success.
131+
132+
**Restart** uses `nohup` so the SSH session returns immediately while the node restarts in the background.
133+
134+
All endpoints return `Content-Type: application/json`. On error they return an appropriate HTTP status code (e.g., `503` if the runtime is unreachable, `404` if a node is not found) with an `"error"` field in the response body.
135+
136+
### Examples
137+
138+
```bash
139+
# Get cluster topology
140+
curl http://127.0.0.1:5000/api/topology
141+
142+
# Get system overview
143+
curl http://127.0.0.1:5000/api/system
144+
145+
# Get worker stats for node 2
146+
curl http://127.0.0.1:5000/api/node/2/workers
147+
148+
# Shut down node 3
149+
curl -X POST http://127.0.0.1:5000/api/topology/node/3/shutdown
150+
151+
# Restart node 3
152+
curl -X POST http://127.0.0.1:5000/api/topology/node/3/restart
153+
```
154+
155+
## Configuration File Discovery
156+
157+
The dashboard reads the same config file as the runtime, using the same search order:
158+
159+
| Source | Priority |
160+
|--------|----------|
161+
| `CHI_SERVER_CONF` environment variable | **1st** |
162+
| `WRP_RUNTIME_CONF` environment variable | **2nd** |
163+
| `~/.chimaera/chimaera.yaml` | **3rd** |
164+
165+
See [Configuration](./configuration) for details on the config file format.
166+
167+
## Connection Lifecycle
168+
169+
The dashboard connects to the runtime lazily — on the first request that needs live data. If the runtime is not yet running when the dashboard starts, it will show a disconnected state and retry on subsequent requests. Shutdown is handled automatically via `atexit` so the client is finalized cleanly when the server process exits.
170+
171+
## Docker / Remote Access
172+
173+
When running the runtime inside Docker or on a remote host, bind the dashboard to all interfaces and forward the port:
174+
175+
```bash
176+
# On the host running the runtime
177+
python -m context_visualizer --host 0.0.0.0 --port 5000
178+
```
179+
180+
```yaml
181+
# docker-compose.yml — expose the dashboard port alongside the runtime
182+
services:
183+
iowarp:
184+
image: iowarp/deploy-cpu:latest
185+
ports:
186+
- "9413:9413" # Chimaera RPC
187+
- "5000:5000" # Dashboard
188+
command: >
189+
bash -c "chimaera runtime start &
190+
python -m context_visualizer --host 0.0.0.0"
191+
```
192+
193+
:::warning
194+
The dashboard has no authentication. Do not expose it on a public network without a reverse proxy that enforces access control.
195+
:::
196+
197+
## Try It: Interactive Docker Cluster {#interactive-cluster}
198+
199+
An interactive test environment is provided that spins up a **4-node Chimaera cluster** with the dashboard so you can explore all features from your browser.
200+
201+
### Location
202+
203+
```
204+
context-runtime/test/integration/interactive/
205+
├── docker-compose.yml # 4-node runtime cluster
206+
├── hostfile # Node IP addresses (172.28.0.10-13)
207+
├── wrp_conf.yaml # Runtime configuration
208+
└── run.sh # Launcher script
209+
```
210+
211+
### How It Works
212+
213+
- **4 Docker containers** (`iowarp-interactive-node1` through `node4`) run the Chimaera runtime on a private `172.28.0.0/16` network, each with `sshd` for SSH-based shutdown/restart
214+
- **Node 1** also runs the dashboard alongside its runtime
215+
- The script connects the devcontainer to the Docker network and starts a local port-forward so that `localhost:5000` reaches the dashboard inside Docker — VS Code then auto-forwards this to your host browser
216+
- SSH keys are distributed via a shared Docker volume so the dashboard can authenticate to all nodes
217+
218+
### Running
219+
220+
```bash
221+
cd context-runtime/test/integration/interactive
222+
223+
# Foreground (Ctrl-C to stop)
224+
bash run.sh
225+
226+
# Or run in the background
227+
bash run.sh start
228+
229+
# Follow runtime container logs
230+
bash run.sh logs
231+
232+
# Stop everything (cluster + dashboard)
233+
bash run.sh stop
234+
```
235+
236+
Once the cluster is up (~15 seconds), open [http://localhost:5000](http://localhost:5000) to browse the topology, click into individual nodes, and use the Restart/Shutdown buttons.
237+
238+
If running from a devcontainer or a host where the workspace is at a different path, set `HOST_WORKSPACE`:
239+
240+
```bash
241+
HOST_WORKSPACE=/host/path/to/workspace bash run.sh
242+
```

0 commit comments

Comments
 (0)