Skip to content

feat(kad-dht): add RoutingTableDiagnostics for routing health inspection#1332

Open
bhuvan-somisetty wants to merge 1 commit into
libp2p:mainfrom
bhuvan-somisetty:feat/kad-dht-routing-table-diagnostics-clean
Open

feat(kad-dht): add RoutingTableDiagnostics for routing health inspection#1332
bhuvan-somisetty wants to merge 1 commit into
libp2p:mainfrom
bhuvan-somisetty:feat/kad-dht-routing-table-diagnostics-clean

Conversation

@bhuvan-somisetty
Copy link
Copy Markdown

@bhuvan-somisetty bhuvan-somisetty commented May 16, 2026

What problem does this solve?

When a KadDHT node starts misbehaving - slow lookups, failed bootstraps, keys that can never be found operators are flying blind. The routing table is a black box. The only way to debug it today is to sprinkle print statements through the Kademlia internals or stare at raw bucket lists.

This PR adds a first-class diagnostic surface so you can answer the real questions in seconds, not hours:

  • Which k-buckets are under-populated or empty?
  • Where are the keyspace coverage gaps that explain why certain keys can't be found?
  • How fresh are the known peers? Are half of them stale?
  • What is the overall routing-table health as a single score I can log or alert on?

What was added

libp2p/kad_dht/diagnostics.py - core engine

A new RoutingTableDiagnostics class that analyses a live routing table and produces a RoutingTableReport. The analyser is read-only - it never touches the routing table state.

from libp2p.kad_dht import KadDHT

dht = KadDHT(host, mode)

report = dht.get_diagnostics().analyse()
print(report.summary())
print(f"Health score: {report.health_score}/100  ({report.verdict})")

Sample output:

=== KadDHT Routing Table Report ===
Timestamp       : 2026-05-15 23:29:54
Local peer      : 12D3KooWAbcd1234…
Health score    : 73.4/100  (good)

Peers           : 18
Buckets         : 4/6 populated
Keyspace cover  : 61.2%
Coverage gaps   : 2

Peer freshness
  Fresh  (<1 h) : 14
  Aging (1–12 h): 3
  Stale (12–24h): 1
  Very stale    : 0

Top coverage gaps (first 3):
  bucket #5: 00000000…–1fffffff… (0 peers)
  bucket #4: 20000000…–3fffffff… (1 peers)

Health score breakdown (0–100, composite):

Component Weight Notes
Fill score 40 pts Weighted by proximity - nearest buckets count more
Freshness score 35 pts Ratio of peers seen in the last hour
Coverage score 25 pts Fraction of the 256-bit keyspace covered

The weighting reflects Kademlia reality: a full bucket closest to the local node is worth far more than a full bucket at the far end of the keyspace.

Convenience entry points

# Via KadDHT instance
report = dht.get_diagnostics().analyse()

# Via RoutingTable directly
report = dht.routing_table.get_diagnostics().analyse()

# Partial queries (no full report allocation)
score  = dht.get_diagnostics().get_health_score()
gaps   = dht.get_diagnostics().get_coverage_gaps()
fresh  = dht.get_diagnostics().get_freshness_distribution()
stats  = dht.get_diagnostics().get_bucket_stats()

Public types exported from libp2p.kad_dht

from libp2p.kad_dht import (
    RoutingTableDiagnostics,
    RoutingTableReport,
    BucketStat,
    CoverageGap,
    FreshnessDistribution,
)

Tests

tests/core/kad_dht/test_routing_table_diagnostics.py - 27 unit tests, fully offline (mock host, no network required). Covers:

  • BucketStat properties (is_full, is_empty, health)
  • FreshnessDistribution ratios and totals
  • RoutingTableDiagnostics with empty tables, single-peer tables, and multi-bucket tables
  • Coverage gap detection and ordering (emptiest first)
  • Health score components individually and composite
  • RoutingTableReport.summary() output
  • Freshness time-band boundaries (exactly at 1 h, 12 h, 24 h)
  • Stale peer counting in bucket stats

Example

examples/kademlia/routing_table_diagnostics.py - two-node runnable demo:

# Terminal 1 - bootstrap node
python examples/kademlia/routing_table_diagnostics.py --port 8888 --mode server

# Terminal 2 - client (connects, warms up 5s, prints full report)
python examples/kademlia/routing_table_diagnostics.py \
    --port 9999 --mode server \
    --bootstrap /ip4/127.0.0.1/tcp/8888/p2p/<PeerID>

Non-goals / out of scope

  • No metrics export (Prometheus, OpenTelemetry) - that belongs in a separate PR
  • No periodic background task - callers decide when to run diagnostics
  • No routing table modification - purely observational

Checklist

  • New module libp2p/kad_dht/diagnostics.py with full docstrings
  • RoutingTable.get_diagnostics() convenience factory
  • KadDHT.get_diagnostics() convenience factory
  • All new public types exported in libp2p/kad_dht/__init__.py
  • 27 offline unit tests
  • Runnable example in examples/kademlia/
  • No changes to routing table logic — read-only analyser
  • TYPE_CHECKING guard to avoid circular imports

When a KadDHT node misbehaves — slow lookups, unreachable keys, failed
bootstraps — operators previously had no visibility into why. This commit
adds a first-class diagnostic surface that answers the core questions:

  • Which k-buckets are under-populated or empty?
  • Where are the keyspace coverage gaps?
  • How fresh are my known peers? (fresh / aging / stale / very stale)
  • What is the overall routing-table health as a single 0–100 score?

Changes:
  libp2p/kad_dht/diagnostics.py
    New RoutingTableDiagnostics class (read-only analyser).
    Produces a RoutingTableReport with BucketStat list, CoverageGap list,
    FreshnessDistribution, composite health score, and human-readable summary.

  libp2p/kad_dht/routing_table.py
    Add RoutingTable.get_diagnostics() convenience factory.

  libp2p/kad_dht/kad_dht.py
    Add KadDHT.get_diagnostics() convenience factory.

  libp2p/kad_dht/__init__.py
    Export all new public types.

  tests/core/kad_dht/test_routing_table_diagnostics.py
    27 unit tests; fully offline (mock host, no network required).

  examples/kademlia/routing_table_diagnostics.py
    Two-node demo that prints a full report after bootstrapping.

Usage:
    report = dht.get_diagnostics().analyse()
    print(report.summary())
    print(f"Health score: {report.health_score}/100")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant