bgpd: cancel BFD strict hold timer on peer delete#21926
Conversation
8f7ba81 to
3e43ef0
Compare
Greptile SummaryThis PR fixes a use-after-free crash in
Confidence Score: 4/5The C fix is correct and safe to merge; the only finding is a minor edge case in the test's process-alive check. The three-line change in bgp_bfd.c is a well-understood, correctly placed timer cancellation that mirrors the same pattern already used in the hold-time VTY handler. The timer cancel happens before any branch that frees bfd_config, covering both the group-fallback early-return path and the full-free path. The new topotest exercises the crash scenario end-to-end, though its final bgpd-alive assertion has a theoretical gap where a missing pid file would cause the shell one-liner to silently report alive regardless of process state. The test file's process-alive check is the only area worth a second look. Important Files Changed
Sequence DiagramsequenceDiagram
participant BFD as bfdd (remote)
participant BGP as bgpd peer
participant Timer as t_hold_timer
participant Mem as bfd_config memory
Note over BFD,BGP: BFD session established
BFD->>BGP: BFD Down notification
BGP->>Timer: event_add_timer(bgp_bfd_strict_holdtime_expire)
Note over BGP,Timer: hold timer armed (e.g. 5 s)
BGP->>BGP: "no neighbor" → bgp_peer_remove_bfd()
alt BEFORE this fix
BGP->>Mem: XFREE(bfd_config) ← timer still pending!
Timer-->>Mem: timer fires → touches freed memory → CRASH
else AFTER this fix
BGP->>Timer: "event_cancel(&bfd_config->t_hold_timer)"
BGP->>Mem: bfd_sess_free() + XFREE(bfd_config)
Note over Timer: timer cancelled, never fires
end
Prompt To Fix All With AIFix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
tests/topotests/bgp_bfd_down_cease_notification/test_bgp_bfd_strict_hold_timer_peer_delete.py:165-170
The process-alive check has a silent false-positive path. When `cat /var/run/frr/bgpd.pid` fails (e.g. the pid file was never written, or is transiently unreadable), the command substitution expands to the empty string and `test -d /proc/` is evaluated — which always succeeds on Linux, so `"alive"` is printed and the assertion passes even though `bgpd` may have crashed. Replacing the bare `test -d /proc/…` path with an explicit empty-pid guard makes the intent clear and eliminates this edge case.
```suggestion
step("Verify R2 bgpd did not crash after the peer was deleted")
assert (
r2.cmd(
"pid=$(cat /var/run/frr/bgpd.pid 2>/dev/null);"
' [ -n "$pid" ] && test -d /proc/"$pid" && echo alive || true'
)
.strip()
== "alive"
)
```
Reviews (1): Last reviewed commit: "bgpd: cancel BFD strict hold timer on pe..." | Re-trigger Greptile |
| step("Verify R2 bgpd did not crash after the peer was deleted") | ||
| assert ( | ||
| r2.cmd("test -d /proc/$(cat /var/run/frr/bgpd.pid) && echo alive || true") | ||
| .strip() | ||
| == "alive" | ||
| ) |
There was a problem hiding this comment.
The process-alive check has a silent false-positive path. When
cat /var/run/frr/bgpd.pid fails (e.g. the pid file was never written, or is transiently unreadable), the command substitution expands to the empty string and test -d /proc/ is evaluated — which always succeeds on Linux, so "alive" is printed and the assertion passes even though bgpd may have crashed. Replacing the bare test -d /proc/… path with an explicit empty-pid guard makes the intent clear and eliminates this edge case.
| step("Verify R2 bgpd did not crash after the peer was deleted") | |
| assert ( | |
| r2.cmd("test -d /proc/$(cat /var/run/frr/bgpd.pid) && echo alive || true") | |
| .strip() | |
| == "alive" | |
| ) | |
| step("Verify R2 bgpd did not crash after the peer was deleted") | |
| assert ( | |
| r2.cmd( | |
| "pid=$(cat /var/run/frr/bgpd.pid 2>/dev/null);" | |
| ' [ -n "$pid" ] && test -d /proc/"$pid" && echo alive || true' | |
| ) | |
| .strip() | |
| == "alive" | |
| ) |
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/topotests/bgp_bfd_down_cease_notification/test_bgp_bfd_strict_hold_timer_peer_delete.py
Line: 165-170
Comment:
The process-alive check has a silent false-positive path. When `cat /var/run/frr/bgpd.pid` fails (e.g. the pid file was never written, or is transiently unreadable), the command substitution expands to the empty string and `test -d /proc/` is evaluated — which always succeeds on Linux, so `"alive"` is printed and the assertion passes even though `bgpd` may have crashed. Replacing the bare `test -d /proc/…` path with an explicit empty-pid guard makes the intent clear and eliminates this edge case.
```suggestion
step("Verify R2 bgpd did not crash after the peer was deleted")
assert (
r2.cmd(
"pid=$(cat /var/run/frr/bgpd.pid 2>/dev/null);"
' [ -n "$pid" ] && test -d /proc/"$pid" && echo alive || true'
)
.strip()
== "alive"
)
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Pull request overview
This PR fixes a bgpd crash caused by a stale pending strict BFD hold timer when a BGP peer (and its peer->bfd_config) is deleted while the timer is still armed. It also adds a topotest to reproduce the sequence and ensure bgpd remains alive past the original timer deadline.
Changes:
- Cancel
peer->bfd_config->t_hold_timerinbgp_peer_remove_bfd()before BFD config is removed/replaced. - Add a new topotest that arms the strict hold timer, deletes the peer before expiry, and verifies bgpd stays running.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
bgpd/bgp_bfd.c |
Cancels the strict hold timer during peer BFD teardown to prevent a use-after-free when the timer later expires. |
tests/topotests/bgp_bfd_down_cease_notification/test_bgp_bfd_strict_hold_timer_peer_delete.py |
New regression test covering the peer-delete-with-pending-timer crash scenario. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| assert ( | ||
| r2.cmd("test -d /proc/$(cat /var/run/frr/bgpd.pid) && echo alive || true") | ||
| .strip() | ||
| == "alive" | ||
| ) |
| if ( | ||
| obs["bfdHoldTimerExpireInMsecs"] is None | ||
| or obs["bfdHoldTimerExpireInMsecs"] <= 0 |
|
|
||
| for rname, router in tgen.routers().items(): | ||
| router.load_config( | ||
| TopoRouter.RD_ZEBRA, os.path.join(CWD, "{}/zebra.conf".format(rname)) |
There was a problem hiding this comment.
Please move using a single frr.conf
| ) | ||
|
|
||
| step("Wait beyond the stale strict hold timer deadline") | ||
| topotest.sleep( |
There was a problem hiding this comment.
We should check with run_and_expect if the timer is expired or not.
3e43ef0 to
19dd96e
Compare
A BFD Down notification can arm peer->bfd_config->t_hold_timer when strict BFD hold time is configured. If the peer is deleted before that timer expires, bgp_peer_remove_bfd() releases peer->bfd_config without first canceling the pending timer. When the stale timer later expires, the event code may touch the freed timer slot and bgpd can crash. Cancel the strict hold timer before releasing peer->bfd_config. Add a topotest that stops BFD to arm the timer, deletes the peer, waits past the original expiry time, and verifies bgpd remains alive. Fixes: FRRouting#21921 Signed-off-by: Z-Yivon <652025330042@smail.nju.edu.cn>
19dd96e to
d001ee1
Compare
Fix a stale BFD strict hold timer when deleting a BGP peer.
When BFD strict hold time is configured, a BFD Down notification can arm
peer->bfd_config->t_hold_timer. If the peer is deleted before that timerexpires,
bgp_peer_remove_bfd()can release the peer's BFD configuration whilethe timer is still pending.
When the stale timer later expires, the event code may touch the freed timer
slot and
bgpdcan crash.This patch cancels
peer->bfd_config->t_hold_timerinbgp_peer_remove_bfd()before the peer BFD configuration is removed or replaced.
fix: #21921
Testing
Added a topotest that:
bfddso the local peer observes BFD Down.bgpdremains alive.