init observable#91
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Observability v0: Low-Overhead Distributed Reported Status
Motivation
DLSlime currently lacks runtime observability after the RDMA data plane starts. When a transfer becomes slow, stalls, or fails, it is hard to answer basic operational questions:
This PR implements Observability v0: a low-overhead distributed reported-status system for DLSlime.
The goal is not to add full tracing, OpenTelemetry, or Prometheus in this PR. The goal is to make DLSlime’s own transfer runtime visible through bounded counters and Redis snapshots, while keeping the RDMA hot path minimal.
Architecture
The data flow is:
No Redis writes, JSON serialization, Prometheus export, Python calls, mutexes, or dynamic label construction are performed in the RDMA I/O hot path.
Performance Guarantees
DLSLIME_OBS=0DLSLIME_OBS=1DLSLIME_OBS_TIME_STEP_MS.What is Included
A. C++ Observability Counters
New target and files:
dlslime/csrc/observability/obs.hdlslime/csrc/observability/obs.cppdlslime/csrc/observability/CMakeLists.txtThe observability layer provides:
The new
_slime_obsshared library is linked into both_slime_cand_slime_rdma, so observability symbols are available even whenBUILD_RDMA=OFF.B. EndpointOpState-Based Semantic Accounting
Semantic accounting is now anchored on
EndpointOpState, not onRDMAAssignor raw CQ completions.EndpointOpStatenow carries lightweight observability metadata:This ensures that one user-visible semantic operation records completion exactly once, regardless of:
The semantic accounting invariant is:
Success is recorded only when the final slot has completed across all QPs. Failure and cancellation use the same
obs_completed.exchange(true)guard.C. v0 Semantic Scope: One-Sided Ops Only
Observability v0 reports semantic submit/completion for:
readwritewriteWithImmTwo-sided operations are intentionally excluded from semantic pending accounting in v0:
sendrecvimmRecvTheir completion paths are not yet integrated with the
EndpointOpStatecompletion-once accounting. They can still contribute to transport-level post counters throughRDMAChannel::post_*_batch.D. Transport-Level Post Counters
RDMAChannelrecords transport-level posting statistics:post_batch_totalpost_wr_totalpost_bytes_totalpost_failures_totalThese are per-local-NIC counters and do not affect semantic pending.
E. CQ Error Counters
RDMAContext::cq_poll_handle()no longer records semantic completion.It only records CQ-level error signals, while semantic completion/failure is handled in the endpoint callback through
EndpointOpState.F. MR Lifecycle Accounting
MR observability is kept because registration/unregistration is a slow path and useful for resource debugging.
The PR tracks:
user_mr_countuser_mr_bytessys_mr_countsys_mr_bytesCompatibility aliases are kept:
mr_count == user_mr_countmr_bytes == user_mr_bytesMRs whose names start with
sys.are treated as system/internal MRs, such as:sys.io_dummysys.msg_dummysys.send_ctxRe-registering a larger MR only adds the size delta to the MR byte counter.
G. PeerAgent Redis Snapshot Reporter
New file:
dlslime/peer_agent/_accounting.pyWhen
DLSLIME_OBS=1, each PeerAgent starts anObsReporterdaemon thread.The reporter periodically:
_slime_c.obs_snapshot()Redis key schema:
Alive snapshots use a TTL floor of 180 seconds so that stale snapshots remain visible before Redis evicts them.
On graceful shutdown, the reporter emits one final snapshot with:
This lets
nanoctrl obs peersdistinguish:H. NanoCtrl CLI
This PR adds:
All commands support:
The CLI uses Redis
SCAN + MGET, notKEYS.nanoctrl obs statusCluster-level summary:
nanoctrl obs peersPeerAgent-level summary:
nanoctrl obs nicsLocal NIC aggregate view under each PeerAgent:
Definitions:
BW: per-NIC EWMA bandwidthBYTES: semantic completed bytesPOST_BYTES: transport-level posted bytesCQ_ERR: CQ error countThis is a local-NIC aggregate view, not a peer-pair traffic matrix.
nanoctrl obs linksDirected connection catalog:
In v0,
obs linksuses each PeerAgent’s reportedconnectionslist. It shows the directed connection relationship and state.Per-link traffic counters are not included in this PR. Therefore:
render as
-inobs links.Per-link traffic accounting is deferred to a follow-up PR.
Environment Variables
DLSLIME_OBS1to enable observability.DLSLIME_OBS_TIME_STEP_MS1000DLSLIME_OBS_REDIS10to collect counters without Redis reporting.Snapshot Schema
Example Redis snapshot:
{ "schema_version": 1, "session_id": "agent-0:12345:1715000000000", "peer_id": "agent-0", "host": "node-0", "pid": 12345, "reported_at_ms": 1715000000000, "summary": { "assign_total": 100, "batch_total": 10, "submitted_bytes_total": 10485760, "completed_bytes_total": 10485760, "failed_bytes_total": 0, "pending_ops": 0, "pending_by_op": { "read": 0, "write": 0, "write_with_imm": 0, "send": 0, "recv": 0, "imm_recv": 0 }, "error_total": 0, "user_mr_count": 2, "user_mr_bytes": 1048576, "sys_mr_count": 3, "sys_mr_bytes": 4096, "mr_count": 2, "mr_bytes": 1048576 }, "nics": [ { "nic": "mlx5_0", "nic_bdf": "", "assign_total": 100, "batch_total": 10, "completed_bytes_total": 10485760, "pending_ops": 0, "error_total": 0, "post_bytes_total": 10485760, "post_failures_total": 0, "cq_errors_total": 0, "ewma_bandwidth_bps": 123456789.0 } ], "connections": [ { "conn_id": "agent-0:mlx5_0->agent-1:mlx5_1", "peer": "agent-1", "local_nic": "mlx5_0", "remote_nic": "mlx5_1", "state": "connected", "connected": true } ], "ewma_bandwidth_bps": 123456789.0 }Tests
This PR adds or strengthens tests for:
status="stopped"snapshotDirectedConnectionHow to Verify
Build
Python tests
NanoCtrl tests
Smoke test
Start Redis and NanoCtrl, then run a control-plane RDMA example with observability enabled:
Query:
Expected behavior:
pending_opsreturns to zero after workload settles.completed_bytes_totalis not multiplied bynum_qp.obs nicsshows per-local-NIC aggregate traffic.obs linksshows directed connection catalog, with traffic fields rendered as-.STATE=stopped.alive -> stale -> gone.Not Included in This PR
Deferred follow-ups:
Design Summary
This PR intentionally implements a bounded Observability v0.
It provides:
It does not attempt to implement full distributed tracing or peer-pair traffic accounting.
The key invariant is: