init observable by JimyMa · Pull Request #91 · DeepLink-org/DLSlime

JimyMa · 2026-05-09T10:03:26Z

Observability v0: Low-Overhead Distributed Reported Status

Motivation

DLSlime currently lacks runtime observability after the RDMA data plane starts. When a transfer becomes slow, stalls, or fails, it is hard to answer basic operational questions:

Which PeerAgent is producing traffic?
Which local NIC is carrying the work?
How many semantic read/write operations are pending?
Are failures coming from CQ errors, post failures, or MR lifecycle problems?
Which directed peer/NIC connections exist in the current cluster?

This PR implements Observability v0: a low-overhead distributed reported-status system for DLSlime.

The goal is not to add full tracing, OpenTelemetry, or Prometheus in this PR. The goal is to make DLSlime’s own transfer runtime visible through bounded counters and Redis snapshots, while keeping the RDMA hot path minimal.

Architecture

┌──────────────────────── Per PeerAgent Process ───────────────────────┐
│                                                                       │
│  RDMAEndpoint        RDMAChannel        RDMAContext / CQ              │
│   ↓ semantic submit   ↓ transport post   ↓ CQ error signal            │
│                                                                       │
│  EndpointOpState owns semantic completion-once accounting             │
│                                                                       │
│   └──────────── relaxed atomic counters (alignas 64) ───────────────┘ │
│                              ↓                                        │
│                    obs_snapshot_json() [pybind11]                     │
│                              ↓                                        │
│                    ObsReporter daemon thread                          │
│                              ↓                                        │
│          SET {scope}:obs:peer:{peer_id} + PEXPIRE                     │
│                              ↓                                        │
└──────────────────────────── Redis ───────────────────────────────────┘
                               ↓
                 nanoctrl obs status / peers / nics / links

The data flow is:

C++ relaxed atomic counters
  -> Python PeerAgent snapshot reporter
  -> Redis reported snapshot
  -> nanoctrl obs CLI

No Redis writes, JSON serialization, Prometheus export, Python calls, mutexes, or dynamic label construction are performed in the RDMA I/O hot path.

Performance Guarantees

Scenario	Behavior
`DLSLIME_OBS=0`	Observability is disabled. Record functions return after a gated check.
`DLSLIME_OBS=1`	RDMA hot path only performs bounded relaxed atomic updates.
Redis reporting	Done by a background PeerAgent reporter once per `DLSLIME_OBS_TIME_STEP_MS`.
Snapshot construction	Slow path only. JSON is built only during snapshot reporting.
Prometheus / OpenTelemetry	Not included in this PR.

What is Included

A. C++ Observability Counters

New target and files:

dlslime/csrc/observability/obs.h
dlslime/csrc/observability/obs.cpp
dlslime/csrc/observability/CMakeLists.txt

The observability layer provides:

peer-level aggregate counters
per-NIC aggregate counters
per-op pending breakdown
user/system MR accounting
CQ error counters
transport post counters
JSON snapshot construction

The new _slime_obs shared library is linked into both _slime_c and _slime_rdma, so observability symbols are available even when BUILD_RDMA=OFF.

B. EndpointOpState-Based Semantic Accounting

Semantic accounting is now anchored on EndpointOpState, not on RDMAAssign or raw CQ completions.

EndpointOpState now carries lightweight observability metadata:

uint64_t obs_bytes;
uint32_t obs_assign_count;
uint8_t  obs_op;
uint16_t obs_nic_id;
bool     obs_enabled;
std::atomic<bool> obs_completed;

This ensures that one user-visible semantic operation records completion exactly once, regardless of:

number of QPs
number of CQEs
number of transport slots
callback race order
cancellation path

The semantic accounting invariant is:

one read/write/writeWithImm submit
  -> pending_ops += 1
  -> exactly one complete/fail/cancel path
  -> pending_ops -= 1

Success is recorded only when the final slot has completed across all QPs. Failure and cancellation use the same obs_completed.exchange(true) guard.

C. v0 Semantic Scope: One-Sided Ops Only

Observability v0 reports semantic submit/completion for:

read
write
writeWithImm

Two-sided operations are intentionally excluded from semantic pending accounting in v0:

send
recv
immRecv

Their completion paths are not yet integrated with the EndpointOpState completion-once accounting. They can still contribute to transport-level post counters through RDMAChannel::post_*_batch.

D. Transport-Level Post Counters

RDMAChannel records transport-level posting statistics:

post_batch_total
post_wr_total
post_bytes_total
post_failures_total

These are per-local-NIC counters and do not affect semantic pending.

E. CQ Error Counters

RDMAContext::cq_poll_handle() no longer records semantic completion.

It only records CQ-level error signals, while semantic completion/failure is handled in the endpoint callback through EndpointOpState.

F. MR Lifecycle Accounting

MR observability is kept because registration/unregistration is a slow path and useful for resource debugging.

The PR tracks:

user_mr_count
user_mr_bytes
sys_mr_count
sys_mr_bytes

Compatibility aliases are kept:

mr_count == user_mr_count
mr_bytes == user_mr_bytes

MRs whose names start with sys. are treated as system/internal MRs, such as:

sys.io_dummy
sys.msg_dummy
sys.send_ctx

Re-registering a larger MR only adds the size delta to the MR byte counter.

G. PeerAgent Redis Snapshot Reporter

New file:

dlslime/peer_agent/_accounting.py

When DLSLIME_OBS=1, each PeerAgent starts an ObsReporter daemon thread.

The reporter periodically:

calls _slime_c.obs_snapshot()
adds PeerAgent metadata
adds connection catalog information
computes peer-level EWMA bandwidth
computes per-NIC EWMA bandwidth
writes a bounded JSON snapshot to Redis

Redis key schema:

{scope}:obs:peer:{peer_id}

Alive snapshots use a TTL floor of 180 seconds so that stale snapshots remain visible before Redis evicts them.

On graceful shutdown, the reporter emits one final snapshot with:

"status": "stopped"

This lets nanoctrl obs peers distinguish:

alive   -> actively reporting
stale   -> no fresh snapshot before stale_ms
stopped -> graceful shutdown
gone    -> Redis key expired

H. NanoCtrl CLI

This PR adds:

nanoctrl obs status
nanoctrl obs peers
nanoctrl obs nics
nanoctrl obs links

All commands support:

--scope <scope>
--stale-ms <milliseconds>
--json

The CLI uses Redis SCAN + MGET, not KEYS.

`nanoctrl obs status`

Cluster-level summary:

Peers: alive / stale / stopped
Total assign
Total batch
Completed bytes
Pending ops
Error total
Estimated EWMA bandwidth

`nanoctrl obs peers`

PeerAgent-level summary:

PEER  HOST  PID  AGE  STATE  ASSIGN  BATCH  BW  BYTES  PENDING  ERRORS

`nanoctrl obs nics`

Local NIC aggregate view under each PeerAgent:

PEER  NIC  AGE  ASSIGN  BATCH  BW  BYTES  PENDING  ERRORS  POST_BYTES  POST_FAIL  CQ_ERR

Definitions:

BW: per-NIC EWMA bandwidth
BYTES: semantic completed bytes
POST_BYTES: transport-level posted bytes
CQ_ERR: CQ error count

This is a local-NIC aggregate view, not a peer-pair traffic matrix.

`nanoctrl obs links`

Directed connection catalog:

SRC_PEER  SRC_NIC  DST_PEER  DST_NIC  STATE  BW  BYTES  PENDING  ERRORS

In v0, obs links uses each PeerAgent’s reported connections list. It shows the directed connection relationship and state.

Per-link traffic counters are not included in this PR. Therefore:

BW / BYTES / PENDING / ERRORS

render as - in obs links.

Per-link traffic accounting is deferred to a follow-up PR.

Environment Variables

Variable	Default	Description
`DLSLIME_OBS`	disabled	Set to `1` to enable observability.
`DLSLIME_OBS_TIME_STEP_MS`	`1000`	PeerAgent snapshot interval.
`DLSLIME_OBS_REDIS`	`1`	Set to `0` to collect counters without Redis reporting.

Snapshot Schema

Example Redis snapshot:

{
  "schema_version": 1,
  "session_id": "agent-0:12345:1715000000000",
  "peer_id": "agent-0",
  "host": "node-0",
  "pid": 12345,
  "reported_at_ms": 1715000000000,
  "summary": {
    "assign_total": 100,
    "batch_total": 10,
    "submitted_bytes_total": 10485760,
    "completed_bytes_total": 10485760,
    "failed_bytes_total": 0,
    "pending_ops": 0,
    "pending_by_op": {
      "read": 0,
      "write": 0,
      "write_with_imm": 0,
      "send": 0,
      "recv": 0,
      "imm_recv": 0
    },
    "error_total": 0,
    "user_mr_count": 2,
    "user_mr_bytes": 1048576,
    "sys_mr_count": 3,
    "sys_mr_bytes": 4096,
    "mr_count": 2,
    "mr_bytes": 1048576
  },
  "nics": [
    {
      "nic": "mlx5_0",
      "nic_bdf": "",
      "assign_total": 100,
      "batch_total": 10,
      "completed_bytes_total": 10485760,
      "pending_ops": 0,
      "error_total": 0,
      "post_bytes_total": 10485760,
      "post_failures_total": 0,
      "cq_errors_total": 0,
      "ewma_bandwidth_bps": 123456789.0
    }
  ],
  "connections": [
    {
      "conn_id": "agent-0:mlx5_0->agent-1:mlx5_1",
      "peer": "agent-1",
      "local_nic": "mlx5_0",
      "remote_nic": "mlx5_1",
      "state": "connected",
      "connected": true
    }
  ],
  "ewma_bandwidth_bps": 123456789.0
}

Tests

This PR adds or strengthens tests for:

C++ obs snapshot shape
per-op pending summary
user/system MR fields
concurrent NIC registration
send/recv/immRecv not recording semantic pending in v0
PeerAgent reporter lifecycle
Redis snapshot write + TTL
graceful shutdown status="stopped" snapshot
connection snapshot extraction from DirectedConnection
NanoCtrl snapshot parsing
stale/alive/stopped state derivation
link catalog parsing

How to Verify

Build

cmake -S . -B build-rdma-on -DBUILD_RDMA=ON -DBUILD_PYTHON=ON
cmake --build build-rdma-on -j

cmake -S . -B build-rdma-off -DBUILD_RDMA=OFF -DBUILD_PYTHON=ON
cmake --build build-rdma-off -j

Python tests

DLSLIME_OBS=1 pytest tests/python/test_obs_counters.py tests/python/test_obs_reporter.py -v

NanoCtrl tests

cd NanoCtrl
cargo test

Smoke test

Start Redis and NanoCtrl, then run a control-plane RDMA example with observability enabled:

export DLSLIME_OBS=1
export DLSLIME_OBS_TIME_STEP_MS=1000
export NANOCTRL_SCOPE=obs-test

python examples/python/p2p_rdma_rc_read_ctrl_plane.py

Query:

nanoctrl obs status --scope obs-test
nanoctrl obs peers --scope obs-test
nanoctrl obs nics --scope obs-test
nanoctrl obs links --scope obs-test

Expected behavior:

pending_ops returns to zero after workload settles.
completed_bytes_total is not multiplied by num_qp.
obs nics shows per-local-NIC aggregate traffic.
obs links shows directed connection catalog, with traffic fields rendered as -.
Graceful shutdown shows STATE=stopped.
Hard-kill or crash shows alive -> stale -> gone.

Not Included in This PR

Deferred follow-ups:

Prometheus exporter
OpenTelemetry
per-link traffic matrix
per-connection counters
error Redis streams
latency histograms
oldest pending ops
send/recv/immRecv semantic pending
hardware NIC counters
Grafana dashboard

Design Summary

This PR intentionally implements a bounded Observability v0.

It provides:

PeerAgent-level reported status
local-NIC aggregate traffic
one-sided semantic pending/completion accounting
MR lifecycle accounting
directed connection catalog

It does not attempt to implement full distributed tracing or peer-pair traffic accounting.

The key invariant is:

EndpointOpState owns semantic accounting.
RDMAAssign and CQ callbacks do not own semantic completion.
Redis receives bounded aggregate snapshots, not per-op traces.

init observable

6185295

JimyMa requested a deployment to self-hosted-rdma May 9, 2026 10:03 — with GitHub Actions Waiting

lint

2be3462

JimyMa requested a deployment to self-hosted-rdma May 9, 2026 11:07 — with GitHub Actions Waiting

refine observability

9fccb7e

JimyMa requested a deployment to self-hosted-rdma May 10, 2026 09:32 — with GitHub Actions Waiting

add multi qp test

53e3e65

JimyMa temporarily deployed to self-hosted-rdma May 10, 2026 10:04 — with GitHub Actions Inactive

JimyMa merged commit f9ffa05 into main May 10, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

init observable#91

init observable#91
JimyMa merged 4 commits into
mainfrom
observable

JimyMa commented May 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JimyMa commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Observability v0: Low-Overhead Distributed Reported Status

Motivation

Architecture

Performance Guarantees

What is Included

A. C++ Observability Counters

B. EndpointOpState-Based Semantic Accounting

C. v0 Semantic Scope: One-Sided Ops Only

D. Transport-Level Post Counters

E. CQ Error Counters

F. MR Lifecycle Accounting

G. PeerAgent Redis Snapshot Reporter

H. NanoCtrl CLI

nanoctrl obs status

nanoctrl obs peers

nanoctrl obs nics

nanoctrl obs links

Environment Variables

Snapshot Schema

Tests

How to Verify

Build

Python tests

NanoCtrl tests

Smoke test

Not Included in This PR

Design Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JimyMa commented May 9, 2026 •

edited

Loading

`nanoctrl obs status`

`nanoctrl obs peers`

`nanoctrl obs nics`

`nanoctrl obs links`