Skip to content

init observable#91

Merged
JimyMa merged 4 commits into
mainfrom
observable
May 10, 2026
Merged

init observable#91
JimyMa merged 4 commits into
mainfrom
observable

Conversation

@JimyMa
Copy link
Copy Markdown
Contributor

@JimyMa JimyMa commented May 9, 2026

Observability v0: Low-Overhead Distributed Reported Status

Motivation

DLSlime currently lacks runtime observability after the RDMA data plane starts. When a transfer becomes slow, stalls, or fails, it is hard to answer basic operational questions:

  • Which PeerAgent is producing traffic?
  • Which local NIC is carrying the work?
  • How many semantic read/write operations are pending?
  • Are failures coming from CQ errors, post failures, or MR lifecycle problems?
  • Which directed peer/NIC connections exist in the current cluster?

This PR implements Observability v0: a low-overhead distributed reported-status system for DLSlime.

The goal is not to add full tracing, OpenTelemetry, or Prometheus in this PR. The goal is to make DLSlime’s own transfer runtime visible through bounded counters and Redis snapshots, while keeping the RDMA hot path minimal.


Architecture

┌──────────────────────── Per PeerAgent Process ───────────────────────┐
│                                                                       │
│  RDMAEndpoint        RDMAChannel        RDMAContext / CQ              │
│   ↓ semantic submit   ↓ transport post   ↓ CQ error signal            │
│                                                                       │
│  EndpointOpState owns semantic completion-once accounting             │
│                                                                       │
│   └──────────── relaxed atomic counters (alignas 64) ───────────────┘ │
│                              ↓                                        │
│                    obs_snapshot_json() [pybind11]                     │
│                              ↓                                        │
│                    ObsReporter daemon thread                          │
│                              ↓                                        │
│          SET {scope}:obs:peer:{peer_id} + PEXPIRE                     │
│                              ↓                                        │
└──────────────────────────── Redis ───────────────────────────────────┘
                               ↓
                 nanoctrl obs status / peers / nics / links

The data flow is:

C++ relaxed atomic counters
  -> Python PeerAgent snapshot reporter
  -> Redis reported snapshot
  -> nanoctrl obs CLI

No Redis writes, JSON serialization, Prometheus export, Python calls, mutexes, or dynamic label construction are performed in the RDMA I/O hot path.


Performance Guarantees

Scenario Behavior
DLSLIME_OBS=0 Observability is disabled. Record functions return after a gated check.
DLSLIME_OBS=1 RDMA hot path only performs bounded relaxed atomic updates.
Redis reporting Done by a background PeerAgent reporter once per DLSLIME_OBS_TIME_STEP_MS.
Snapshot construction Slow path only. JSON is built only during snapshot reporting.
Prometheus / OpenTelemetry Not included in this PR.

What is Included

A. C++ Observability Counters

New target and files:

  • dlslime/csrc/observability/obs.h
  • dlslime/csrc/observability/obs.cpp
  • dlslime/csrc/observability/CMakeLists.txt

The observability layer provides:

  • peer-level aggregate counters
  • per-NIC aggregate counters
  • per-op pending breakdown
  • user/system MR accounting
  • CQ error counters
  • transport post counters
  • JSON snapshot construction

The new _slime_obs shared library is linked into both _slime_c and _slime_rdma, so observability symbols are available even when BUILD_RDMA=OFF.


B. EndpointOpState-Based Semantic Accounting

Semantic accounting is now anchored on EndpointOpState, not on RDMAAssign or raw CQ completions.

EndpointOpState now carries lightweight observability metadata:

uint64_t obs_bytes;
uint32_t obs_assign_count;
uint8_t  obs_op;
uint16_t obs_nic_id;
bool     obs_enabled;
std::atomic<bool> obs_completed;

This ensures that one user-visible semantic operation records completion exactly once, regardless of:

  • number of QPs
  • number of CQEs
  • number of transport slots
  • callback race order
  • cancellation path

The semantic accounting invariant is:

one read/write/writeWithImm submit
  -> pending_ops += 1
  -> exactly one complete/fail/cancel path
  -> pending_ops -= 1

Success is recorded only when the final slot has completed across all QPs. Failure and cancellation use the same obs_completed.exchange(true) guard.


C. v0 Semantic Scope: One-Sided Ops Only

Observability v0 reports semantic submit/completion for:

  • read
  • write
  • writeWithImm

Two-sided operations are intentionally excluded from semantic pending accounting in v0:

  • send
  • recv
  • immRecv

Their completion paths are not yet integrated with the EndpointOpState completion-once accounting. They can still contribute to transport-level post counters through RDMAChannel::post_*_batch.


D. Transport-Level Post Counters

RDMAChannel records transport-level posting statistics:

  • post_batch_total
  • post_wr_total
  • post_bytes_total
  • post_failures_total

These are per-local-NIC counters and do not affect semantic pending.


E. CQ Error Counters

RDMAContext::cq_poll_handle() no longer records semantic completion.

It only records CQ-level error signals, while semantic completion/failure is handled in the endpoint callback through EndpointOpState.


F. MR Lifecycle Accounting

MR observability is kept because registration/unregistration is a slow path and useful for resource debugging.

The PR tracks:

  • user_mr_count
  • user_mr_bytes
  • sys_mr_count
  • sys_mr_bytes

Compatibility aliases are kept:

  • mr_count == user_mr_count
  • mr_bytes == user_mr_bytes

MRs whose names start with sys. are treated as system/internal MRs, such as:

  • sys.io_dummy
  • sys.msg_dummy
  • sys.send_ctx

Re-registering a larger MR only adds the size delta to the MR byte counter.


G. PeerAgent Redis Snapshot Reporter

New file:

  • dlslime/peer_agent/_accounting.py

When DLSLIME_OBS=1, each PeerAgent starts an ObsReporter daemon thread.

The reporter periodically:

  1. calls _slime_c.obs_snapshot()
  2. adds PeerAgent metadata
  3. adds connection catalog information
  4. computes peer-level EWMA bandwidth
  5. computes per-NIC EWMA bandwidth
  6. writes a bounded JSON snapshot to Redis

Redis key schema:

{scope}:obs:peer:{peer_id}

Alive snapshots use a TTL floor of 180 seconds so that stale snapshots remain visible before Redis evicts them.

On graceful shutdown, the reporter emits one final snapshot with:

"status": "stopped"

This lets nanoctrl obs peers distinguish:

alive   -> actively reporting
stale   -> no fresh snapshot before stale_ms
stopped -> graceful shutdown
gone    -> Redis key expired

H. NanoCtrl CLI

This PR adds:

nanoctrl obs status
nanoctrl obs peers
nanoctrl obs nics
nanoctrl obs links

All commands support:

--scope <scope>
--stale-ms <milliseconds>
--json

The CLI uses Redis SCAN + MGET, not KEYS.

nanoctrl obs status

Cluster-level summary:

Peers: alive / stale / stopped
Total assign
Total batch
Completed bytes
Pending ops
Error total
Estimated EWMA bandwidth

nanoctrl obs peers

PeerAgent-level summary:

PEER  HOST  PID  AGE  STATE  ASSIGN  BATCH  BW  BYTES  PENDING  ERRORS

nanoctrl obs nics

Local NIC aggregate view under each PeerAgent:

PEER  NIC  AGE  ASSIGN  BATCH  BW  BYTES  PENDING  ERRORS  POST_BYTES  POST_FAIL  CQ_ERR

Definitions:

  • BW: per-NIC EWMA bandwidth
  • BYTES: semantic completed bytes
  • POST_BYTES: transport-level posted bytes
  • CQ_ERR: CQ error count

This is a local-NIC aggregate view, not a peer-pair traffic matrix.

nanoctrl obs links

Directed connection catalog:

SRC_PEER  SRC_NIC  DST_PEER  DST_NIC  STATE  BW  BYTES  PENDING  ERRORS

In v0, obs links uses each PeerAgent’s reported connections list. It shows the directed connection relationship and state.

Per-link traffic counters are not included in this PR. Therefore:

BW / BYTES / PENDING / ERRORS

render as - in obs links.

Per-link traffic accounting is deferred to a follow-up PR.


Environment Variables

Variable Default Description
DLSLIME_OBS disabled Set to 1 to enable observability.
DLSLIME_OBS_TIME_STEP_MS 1000 PeerAgent snapshot interval.
DLSLIME_OBS_REDIS 1 Set to 0 to collect counters without Redis reporting.

Snapshot Schema

Example Redis snapshot:

{
  "schema_version": 1,
  "session_id": "agent-0:12345:1715000000000",
  "peer_id": "agent-0",
  "host": "node-0",
  "pid": 12345,
  "reported_at_ms": 1715000000000,
  "summary": {
    "assign_total": 100,
    "batch_total": 10,
    "submitted_bytes_total": 10485760,
    "completed_bytes_total": 10485760,
    "failed_bytes_total": 0,
    "pending_ops": 0,
    "pending_by_op": {
      "read": 0,
      "write": 0,
      "write_with_imm": 0,
      "send": 0,
      "recv": 0,
      "imm_recv": 0
    },
    "error_total": 0,
    "user_mr_count": 2,
    "user_mr_bytes": 1048576,
    "sys_mr_count": 3,
    "sys_mr_bytes": 4096,
    "mr_count": 2,
    "mr_bytes": 1048576
  },
  "nics": [
    {
      "nic": "mlx5_0",
      "nic_bdf": "",
      "assign_total": 100,
      "batch_total": 10,
      "completed_bytes_total": 10485760,
      "pending_ops": 0,
      "error_total": 0,
      "post_bytes_total": 10485760,
      "post_failures_total": 0,
      "cq_errors_total": 0,
      "ewma_bandwidth_bps": 123456789.0
    }
  ],
  "connections": [
    {
      "conn_id": "agent-0:mlx5_0->agent-1:mlx5_1",
      "peer": "agent-1",
      "local_nic": "mlx5_0",
      "remote_nic": "mlx5_1",
      "state": "connected",
      "connected": true
    }
  ],
  "ewma_bandwidth_bps": 123456789.0
}

Tests

This PR adds or strengthens tests for:

  • C++ obs snapshot shape
  • per-op pending summary
  • user/system MR fields
  • concurrent NIC registration
  • send/recv/immRecv not recording semantic pending in v0
  • PeerAgent reporter lifecycle
  • Redis snapshot write + TTL
  • graceful shutdown status="stopped" snapshot
  • connection snapshot extraction from DirectedConnection
  • NanoCtrl snapshot parsing
  • stale/alive/stopped state derivation
  • link catalog parsing

How to Verify

Build

cmake -S . -B build-rdma-on -DBUILD_RDMA=ON -DBUILD_PYTHON=ON
cmake --build build-rdma-on -j

cmake -S . -B build-rdma-off -DBUILD_RDMA=OFF -DBUILD_PYTHON=ON
cmake --build build-rdma-off -j

Python tests

DLSLIME_OBS=1 pytest tests/python/test_obs_counters.py tests/python/test_obs_reporter.py -v

NanoCtrl tests

cd NanoCtrl
cargo test

Smoke test

Start Redis and NanoCtrl, then run a control-plane RDMA example with observability enabled:

export DLSLIME_OBS=1
export DLSLIME_OBS_TIME_STEP_MS=1000
export NANOCTRL_SCOPE=obs-test

python examples/python/p2p_rdma_rc_read_ctrl_plane.py

Query:

nanoctrl obs status --scope obs-test
nanoctrl obs peers --scope obs-test
nanoctrl obs nics --scope obs-test
nanoctrl obs links --scope obs-test

Expected behavior:

  • pending_ops returns to zero after workload settles.
  • completed_bytes_total is not multiplied by num_qp.
  • obs nics shows per-local-NIC aggregate traffic.
  • obs links shows directed connection catalog, with traffic fields rendered as -.
  • Graceful shutdown shows STATE=stopped.
  • Hard-kill or crash shows alive -> stale -> gone.

Not Included in This PR

Deferred follow-ups:

  • Prometheus exporter
  • OpenTelemetry
  • per-link traffic matrix
  • per-connection counters
  • error Redis streams
  • latency histograms
  • oldest pending ops
  • send/recv/immRecv semantic pending
  • hardware NIC counters
  • Grafana dashboard

Design Summary

This PR intentionally implements a bounded Observability v0.

It provides:

PeerAgent-level reported status
local-NIC aggregate traffic
one-sided semantic pending/completion accounting
MR lifecycle accounting
directed connection catalog

It does not attempt to implement full distributed tracing or peer-pair traffic accounting.

The key invariant is:

EndpointOpState owns semantic accounting.
RDMAAssign and CQ callbacks do not own semantic completion.
Redis receives bounded aggregate snapshots, not per-op traces.

@JimyMa JimyMa requested a deployment to self-hosted-rdma May 9, 2026 10:03 — with GitHub Actions Waiting
@JimyMa JimyMa requested a deployment to self-hosted-rdma May 9, 2026 11:07 — with GitHub Actions Waiting
@JimyMa JimyMa requested a deployment to self-hosted-rdma May 10, 2026 09:32 — with GitHub Actions Waiting
@JimyMa JimyMa temporarily deployed to self-hosted-rdma May 10, 2026 10:04 — with GitHub Actions Inactive
@JimyMa JimyMa merged commit f9ffa05 into main May 10, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant