Skip to content

bgpd: cancel BFD strict hold timer on peer delete#21926

Open
Z-Yivon wants to merge 1 commit into
FRRouting:masterfrom
Z-Yivon:bgpd-bfd-strict-hold-peer-delete
Open

bgpd: cancel BFD strict hold timer on peer delete#21926
Z-Yivon wants to merge 1 commit into
FRRouting:masterfrom
Z-Yivon:bgpd-bfd-strict-hold-peer-delete

Conversation

@Z-Yivon
Copy link
Copy Markdown

@Z-Yivon Z-Yivon commented May 13, 2026

Fix a stale BFD strict hold timer when deleting a BGP peer.

When BFD strict hold time is configured, a BFD Down notification can arm
peer->bfd_config->t_hold_timer. If the peer is deleted before that timer
expires, bgp_peer_remove_bfd() can release the peer's BFD configuration while
the timer is still pending.

When the stale timer later expires, the event code may touch the freed timer
slot and bgpd can crash.

This patch cancels peer->bfd_config->t_hold_timer in bgp_peer_remove_bfd()
before the peer BFD configuration is removed or replaced.
fix: #21921

Testing

Added a topotest that:

  1. Starts two local BGP peers with BFD enabled.
  2. Configures strict BFD hold time.
  3. Stops the remote local bfdd so the local peer observes BFD Down.
  4. Waits until the strict hold timer is armed.
  5. Deletes the local BGP peer before the timer expires.
  6. Waits past the original timer deadline and verifies bgpd remains alive.

Copilot AI review requested due to automatic review settings May 13, 2026 04:35
@frrbot frrbot Bot added the bgp label May 13, 2026
@Z-Yivon Z-Yivon force-pushed the bgpd-bfd-strict-hold-peer-delete branch from 8f7ba81 to 3e43ef0 Compare May 13, 2026 04:38
@frrbot frrbot Bot added the bugfix label May 13, 2026
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 13, 2026

Greptile Summary

This PR fixes a use-after-free crash in bgpd where a BFD strict hold timer could fire after the owning peer's BFD configuration was freed during peer deletion. The fix is a targeted, single-statement addition that cancels t_hold_timer inside bgp_peer_remove_bfd() before any BFD config memory is released.

  • C fix (bgp_bfd.c): Three lines added in bgp_peer_remove_bfd() to call event_cancel(&p->bfd_config->t_hold_timer) before the existing group-fallback or free paths execute, mirroring the same cancel already present in the hold-time VTY command handler (line 668).
  • New topotest (test_bgp_bfd_strict_hold_timer_peer_delete.py): End-to-end regression test that kills r1's bfdd, waits for r2 to arm the hold timer, deletes the r2 peer, waits past the original deadline, and then checks that bgpd on r2 is still alive.

Confidence Score: 4/5

The C fix is correct and safe to merge; the only finding is a minor edge case in the test's process-alive check.

The three-line change in bgp_bfd.c is a well-understood, correctly placed timer cancellation that mirrors the same pattern already used in the hold-time VTY handler. The timer cancel happens before any branch that frees bfd_config, covering both the group-fallback early-return path and the full-free path. The new topotest exercises the crash scenario end-to-end, though its final bgpd-alive assertion has a theoretical gap where a missing pid file would cause the shell one-liner to silently report alive regardless of process state.

The test file's process-alive check is the only area worth a second look.

Important Files Changed

Filename Overview
bgpd/bgp_bfd.c Correct, minimal fix: cancels the BFD strict hold timer in bgp_peer_remove_bfd() before bfd_config is freed, preventing a stale-timer use-after-free crash.
tests/topotests/bgp_bfd_down_cease_notification/test_bgp_bfd_strict_hold_timer_peer_delete.py New regression topotest covering the crash scenario; process-alive check has a theoretical false-positive path when the pid file is unexpectedly absent.

Sequence Diagram

sequenceDiagram
    participant BFD as bfdd (remote)
    participant BGP as bgpd peer
    participant Timer as t_hold_timer
    participant Mem as bfd_config memory

    Note over BFD,BGP: BFD session established

    BFD->>BGP: BFD Down notification
    BGP->>Timer: event_add_timer(bgp_bfd_strict_holdtime_expire)

    Note over BGP,Timer: hold timer armed (e.g. 5 s)

    BGP->>BGP: "no neighbor" → bgp_peer_remove_bfd()

    alt BEFORE this fix
        BGP->>Mem: XFREE(bfd_config)  ← timer still pending!
        Timer-->>Mem: timer fires → touches freed memory → CRASH
    else AFTER this fix
        BGP->>Timer: "event_cancel(&bfd_config->t_hold_timer)"
        BGP->>Mem: bfd_sess_free() + XFREE(bfd_config)
        Note over Timer: timer cancelled, never fires
    end
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
tests/topotests/bgp_bfd_down_cease_notification/test_bgp_bfd_strict_hold_timer_peer_delete.py:165-170
The process-alive check has a silent false-positive path. When `cat /var/run/frr/bgpd.pid` fails (e.g. the pid file was never written, or is transiently unreadable), the command substitution expands to the empty string and `test -d /proc/` is evaluated — which always succeeds on Linux, so `"alive"` is printed and the assertion passes even though `bgpd` may have crashed. Replacing the bare `test -d /proc/…` path with an explicit empty-pid guard makes the intent clear and eliminates this edge case.

```suggestion
    step("Verify R2 bgpd did not crash after the peer was deleted")
    assert (
        r2.cmd(
            "pid=$(cat /var/run/frr/bgpd.pid 2>/dev/null);"
            ' [ -n "$pid" ] && test -d /proc/"$pid" && echo alive || true'
        )
        .strip()
        == "alive"
    )
```

Reviews (1): Last reviewed commit: "bgpd: cancel BFD strict hold timer on pe..." | Re-trigger Greptile

Comment on lines +165 to +170
step("Verify R2 bgpd did not crash after the peer was deleted")
assert (
r2.cmd("test -d /proc/$(cat /var/run/frr/bgpd.pid) && echo alive || true")
.strip()
== "alive"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The process-alive check has a silent false-positive path. When cat /var/run/frr/bgpd.pid fails (e.g. the pid file was never written, or is transiently unreadable), the command substitution expands to the empty string and test -d /proc/ is evaluated — which always succeeds on Linux, so "alive" is printed and the assertion passes even though bgpd may have crashed. Replacing the bare test -d /proc/… path with an explicit empty-pid guard makes the intent clear and eliminates this edge case.

Suggested change
step("Verify R2 bgpd did not crash after the peer was deleted")
assert (
r2.cmd("test -d /proc/$(cat /var/run/frr/bgpd.pid) && echo alive || true")
.strip()
== "alive"
)
step("Verify R2 bgpd did not crash after the peer was deleted")
assert (
r2.cmd(
"pid=$(cat /var/run/frr/bgpd.pid 2>/dev/null);"
' [ -n "$pid" ] && test -d /proc/"$pid" && echo alive || true'
)
.strip()
== "alive"
)
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/topotests/bgp_bfd_down_cease_notification/test_bgp_bfd_strict_hold_timer_peer_delete.py
Line: 165-170

Comment:
The process-alive check has a silent false-positive path. When `cat /var/run/frr/bgpd.pid` fails (e.g. the pid file was never written, or is transiently unreadable), the command substitution expands to the empty string and `test -d /proc/` is evaluated — which always succeeds on Linux, so `"alive"` is printed and the assertion passes even though `bgpd` may have crashed. Replacing the bare `test -d /proc/…` path with an explicit empty-pid guard makes the intent clear and eliminates this edge case.

```suggestion
    step("Verify R2 bgpd did not crash after the peer was deleted")
    assert (
        r2.cmd(
            "pid=$(cat /var/run/frr/bgpd.pid 2>/dev/null);"
            ' [ -n "$pid" ] && test -d /proc/"$pid" && echo alive || true'
        )
        .strip()
        == "alive"
    )
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bgpd crash caused by a stale pending strict BFD hold timer when a BGP peer (and its peer->bfd_config) is deleted while the timer is still armed. It also adds a topotest to reproduce the sequence and ensure bgpd remains alive past the original timer deadline.

Changes:

  • Cancel peer->bfd_config->t_hold_timer in bgp_peer_remove_bfd() before BFD config is removed/replaced.
  • Add a new topotest that arms the strict hold timer, deletes the peer before expiry, and verifies bgpd stays running.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
bgpd/bgp_bfd.c Cancels the strict hold timer during peer BFD teardown to prevent a use-after-free when the timer later expires.
tests/topotests/bgp_bfd_down_cease_notification/test_bgp_bfd_strict_hold_timer_peer_delete.py New regression test covering the peer-delete-with-pending-timer crash scenario.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +166 to +170
assert (
r2.cmd("test -d /proc/$(cat /var/run/frr/bgpd.pid) && echo alive || true")
.strip()
== "alive"
)
Comment on lines +139 to +141
if (
obs["bfdHoldTimerExpireInMsecs"] is None
or obs["bfdHoldTimerExpireInMsecs"] <= 0

for rname, router in tgen.routers().items():
router.load_config(
TopoRouter.RD_ZEBRA, os.path.join(CWD, "{}/zebra.conf".format(rname))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move using a single frr.conf

)

step("Wait beyond the stale strict hold timer deadline")
topotest.sleep(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check with run_and_expect if the timer is expired or not.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@Z-Yivon Z-Yivon force-pushed the bgpd-bfd-strict-hold-peer-delete branch from 3e43ef0 to 19dd96e Compare May 13, 2026 12:28
A BFD Down notification can arm peer->bfd_config->t_hold_timer when
strict BFD hold time is configured. If the peer is deleted before that
timer expires, bgp_peer_remove_bfd() releases peer->bfd_config without
first canceling the pending timer.

When the stale timer later expires, the event code may touch the freed
timer slot and bgpd can crash.

Cancel the strict hold timer before releasing peer->bfd_config. Add a
topotest that stops BFD to arm the timer, deletes the peer, waits past
the original expiry time, and verifies bgpd remains alive.

Fixes: FRRouting#21921
Signed-off-by: Z-Yivon <652025330042@smail.nju.edu.cn>
@Z-Yivon Z-Yivon force-pushed the bgpd-bfd-strict-hold-peer-delete branch from 19dd96e to d001ee1 Compare May 13, 2026 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bgpd: stale BFD strict hold timer can crash after peer deletion

3 participants