Skip to content

bgpd: speed up peer teardown with per-peer route indexes#21186

Open
nick-bouliane wants to merge 1 commit intoFRRouting:masterfrom
nick-bouliane:master
Open

bgpd: speed up peer teardown with per-peer route indexes#21186
nick-bouliane wants to merge 1 commit intoFRRouting:masterfrom
nick-bouliane:master

Conversation

@nick-bouliane
Copy link
Copy Markdown
Contributor

Track adj-in and path_info entries per peer so clear operations avoid full RIB/tree scans during neighbor teardown. Deduplicate queued bgp_dest work items so AddPath peers do not enqueue the same destination repeatedly.

Signed-off-by: Nick Bouliane nbouliane@coreweave.com

Track adj-in and path_info entries per peer so clear operations avoid full
RIB/tree scans during neighbor teardown. Deduplicate queued bgp_dest work
items so AddPath peers do not enqueue the same destination repeatedly.

Signed-off-by: Nick Bouliane <nbouliane@coreweave.com>
@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Mar 16, 2026

Greptile Summary

This PR adds per-peer indexes (linked lists) to track bgp_adj_in and bgp_path_info entries, enabling O(peer_routes) instead of O(table_size) clearing during neighbor teardown. It also deduplicates work queue items for AddPath peers using a hash table, and skips already-removed paths in the clear worker.

  • bgp_adj_in gains a per-peer doubly-linked list (peer_adj_prev/peer_adj_next) and a dest back-pointer; bgp_path_info gains a LIST_ENTRY for per-peer tracking via peer->paths
  • bgp_clear_route_table is rewritten to walk per-peer lists instead of scanning the full RIB, eliminating the need for special VPN/EVPN/ENCAP nested table iteration in bgp_clear_route
  • A hash is used to deduplicate destination queue entries when multiple AddPath routes share the same destination
  • bgp_clear_route_node gains a BGP_PATH_REMOVED check to avoid double-processing already-deleted paths
  • Memory cost: two pointers per bgp_adj_in and one LIST_ENTRY per bgp_path_info, plus one pointer per peer — proportional to existing route count

Confidence Score: 4/5

  • This PR is well-structured and the core logic is sound, with properly paired list add/remove operations and correct safe-iteration patterns.
  • Score of 4 reflects that the algorithmic change is correct and well-reasoned: all per-peer list insertions and removals are properly paired across add/reap paths, the safe iteration patterns are correctly used, VPN/EVPN sub-table handling works because bgp_dest_table() returns the correct sub-table with matching afi/safi, and the batch clearing path (walk_batch_table_helper) also correctly maintains the per-peer lists through bgp_adj_in_remove and bgp_path_info_reap. The dedup hash and BGP_PATH_REMOVED skip are good defensive additions. Deducting 1 point because this is a significant data structure change in critical teardown paths that would benefit from thorough integration testing with AddPath, VPN, EVPN, and graceful restart scenarios.
  • bgpd/bgp_route.c deserves the most attention as it contains the core algorithmic change to bgp_clear_route_table and the new dedup hash logic.

Important Files Changed

Filename Overview
bgpd/bgp_advertise.h Adds per-peer doubly-linked list pointers and a back-pointer to dest in bgp_adj_in. Straightforward struct extension, no issues.
bgpd/bgp_advertise.c Links/unlinks adj-in entries into per-peer list on set/remove. Doubly-linked list operations are correctly implemented with proper NULL checks.
bgpd/bgp_route.h Adds LIST_ENTRY(bgp_path_info) peer_thread to bgp_path_info for per-peer path list linkage. Clean struct extension.
bgpd/bgp_route.c Core logic change: bgp_clear_route_table now walks per-peer lists instead of full table. Adds dedup hash for AddPath destinations. Adds BGP_PATH_REMOVED skip in worker. Eliminates VPN/EVPN nested table iteration by leveraging per-peer indexing. Logic is sound but the hash key function is simplistic.
bgpd/bgpd.h Adds per-peer path list head (LIST_HEAD) and adj_in_head pointer to struct peer. Clean additions.
bgpd/bgpd.c Initializes per-peer path list in peer_new. Single-line change, correct placement.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[bgp_clear_route] --> B[bgp_clear_route_table]
    B --> C[Walk peer->adj_in_head list]
    C --> D{afi/safi match?}
    D -- Yes --> E[bgp_adj_in_remove]
    D -- No --> C
    E --> C
    B --> F[Walk peer->paths list via LIST_FOREACH_SAFE]
    F --> G{afi/safi match?}
    G -- No --> F
    G -- Yes --> H{force?}
    H -- Yes --> I[bgp_path_info_reap]
    I --> F
    H -- No --> J{dest in queued_dests hash?}
    J -- Yes/skip --> F
    J -- No --> K[Add to queued_dests hash]
    K --> L[Queue dest to clear_node_queue]
    L --> F
    L -.-> M[bgp_clear_route_node worker]
    M --> N{BGP_PATH_REMOVED?}
    N -- Yes/skip --> M
    N -- No --> O[bgp_rib_remove / mark stale]
Loading

Last reviewed commit: f199de7

Copy link
Copy Markdown
Contributor

@mjstapp mjstapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a batching mechanism that should be helping when multiple peer connections are closing. that mechanism also does yield/resume, so it tries not to occupy the main pthread. can you say more about why that mechanism isn't being used in your situation? I'd like to see if we can bring more peer processing into that path if we need to.

@nick-bouliane
Copy link
Copy Markdown
Contributor Author

we have a batching mechanism that should be helping when multiple peer connections are closing. that mechanism also does yield/resume, so it tries not to occupy the main pthread. can you say more about why that mechanism isn't being used in your situation? I'd like to see if we can bring more peer processing into that path if we need to.

Good question. We are using the existing clear work-queue path (clear_node_queue / bgp_clear_route_node) already.
The bottleneck I measured was mainly in the synchronous pre-queue discovery phase (bgp_clear_route*) where each clearing peer walked large table structures to find relevant routes. That work happens on the main thread before items are batched/yielded.
The patch focuses on that part: it replaces full scans with per-peer indexed iteration and deduped bgp_dest queueing, so the existing yield/resume worker has much less to do and main-thread stalls drop significantly (in my stress test, multi-second vtysh stalls became near-instant).

Comment thread bgpd/bgpd.h
/* workqueues */
struct work_queue *clear_node_queue;

/* Per-peer path list for fast teardown (avoids full table walk) */
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we use the bnc and it's list of paths to do this work, instead of adding an entirely new structure ( and it's associated cost of higher memory usage ) to do this work?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically BNC could be part of the solution, but for peer clear it is a nexthop-oriented index, not a peer-ownership index. To use it for this workflow we would still need to:

1- iterate all BNC entries (or add/maintain a peer->BNC mapping),
2- iterate each bnc->paths,
3- filter pi->peer == target_peer,
4- dedupe pi->net (bgp_dest) before queue/remove, and
5- run separate adj-in cleanup anyway (BNC does not index adj-in).

The per-peer lists are more direct for this specific operation: they answer “what belongs to this peer?” in bounded time and cover both path and adj-in cleanup.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rough incremental memory estimate (64-bit): about +40 MB for 1 peer with 1M routes, and about +4.2 MB for 10k peers with 10 routes each.

@donaldsharp
Copy link
Copy Markdown
Member

I've got some experimental code that uses the bnc for speed up. I will need a chance to get it working. Hopefully in the next day or so, so that we can compare the two solutions.

@nick-bouliane
Copy link
Copy Markdown
Contributor Author

I've got some experimental code that uses the bnc for speed up. I will need a chance to get it working. Hopefully in the next day or so, so that we can compare the two solutions.

@donaldsharp did you have time play with this one ? Also if you have your code in a repo I can have a look and play with it. Thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants