No kernel nhg original by donaldsharp · Pull Request #21893 · FRRouting/frr

donaldsharp · 2026-05-08T13:24:28Z

PR #21125 attempts to modify slightly how NHG data is sent to the kernel for connected/local/kernel NHG's. The approach taken there was too complicated and in order for me to figure out what was going on I had to reverse engineer it to a degree that my approach is just much simplier and frankly easier to understand. Here's my alternate approach to the problem

greptile-apps · 2026-05-08T13:32:13Z

Greptile Summary

This PR simplifies how NHGs for connected/local/kernel (system) routes are handled in the dataplane by letting them flow through the full dplane queue with a skip_kernel flag instead of short-circuiting early and discarding the context. The previous approach discarded the dplane context before FPM and other providers could see it; the new approach allows providers (particularly FPM) to observe and act on these NHG installs while still preventing kernel programming.

zebra_dplane.c: Replaces the early-return that marked NHGs as installed and freed the context with a call to dplane_ctx_set_skip_kernel(), then lets the context flow normally through the provider chain.
zebra_nhg.c: Removes the now-unnecessary special-case handling of ZEBRA_DPLANE_REQUEST_SUCCESS in zebra_nhg_install_kernel and adds UNSET_FLAG(nhe->flags, NEXTHOP_GROUP_QUEUED) when promoting an NHG from the delayed-install path to a real kernel install, which is critical for correctness.
test_fpm_topo1.py: Adds a new test verifying that system-route NHGs are forwarded to FPM but not installed in the Linux kernel.

Confidence Score: 3/5

The C changes are correct but the new test may always time out, leaving the PR's primary behavioral claim unverified.

The core C implementation is clean and the flag-management changes in zebra_nhg.c are correct. The concern is with the test: NEXTHOP_GROUP_FPM is only written during the FPM batch resync walk (fpm_nhg_send_cb), not when individual NHG contexts flow through fpm_nl_enqueue. A connected/local/kernel NHG added after the last FPM sync will never have fpm=True in its show output, so check_nhg_sent_to_fpm will time out on every CI run.

tests/topotests/fpm_testing_topo1/test_fpm_topo1.py — the FPM presence assertion needs a different verification strategy.

Important Files Changed

Filename	Overview
zebra/zebra_dplane.c	Replaces the early-exit/ctx-discard for NEXTHOP_GROUP_INITIAL_DELAY_INSTALL with dplane_ctx_set_skip_kernel(); ctx now flows through the full provider chain.
zebra/zebra_nhg.c	Removes special SUCCESS handling from the dplane result switch (now always an error) and adds critical NEXTHOP_GROUP_QUEUED clearing when promoting a system-route NHG to a real kernel install.
tests/topotests/fpm_testing_topo1/test_fpm_topo1.py	New test for system-route NHG FPM delivery and kernel-skip; the FPM presence check relies on NEXTHOP_GROUP_FPM, which is only set during batch FPM resync, not via the individual dplane ctx path the PR introduces.

Sequence Diagram

sequenceDiagram
    participant RIB as zebra_rib (main thread)
    participant NHG as zebra_nhg
    participant DPLANE as dplane thread
    participant FPM as FPM provider
    participant KERNEL as kernel provider

    RIB->>NHG: zebra_nhg_install_kernel(nhe, CONNECT/LOCAL/KERNEL)
    Note over NHG: INITIAL_DELAY_INSTALL set skip flag-clearing block
    NHG->>DPLANE: dplane_nexthop_add(nhe)
    DPLANE->>DPLANE: dplane_ctx_set_skip_kernel(ctx)
    DPLANE->>DPLANE: dplane_update_enqueue(ctx)
    DPLANE-->>NHG: ZEBRA_DPLANE_REQUEST_QUEUED
    NHG->>NHG: SET_FLAG NEXTHOP_GROUP_QUEUED

    DPLANE->>FPM: fpm_nl_process ctx with NH_INSTALL op
    FPM->>FPM: fpm_nl_enqueue encode and send NHG to FPM socket
    FPM-->>DPLANE: enqueue to output

    DPLANE->>KERNEL: kernel provider dequeues ctx
    KERNEL->>KERNEL: dplane_ctx_is_skip_kernel true
    KERNEL->>KERNEL: set status SUCCESS skip netlink
    KERNEL-->>DPLANE: enqueue to output

    DPLANE-->>NHG: zebra_nhg_dplane_result SUCCESS
    NHG->>NHG: UNSET QUEUED REINSTALL SET INSTALLED
    NHG->>NHG: zebra_nhg_handle_install(nhe)

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
tests/topotests/fpm_testing_topo1/test_fpm_topo1.py:256-264
**FPM flag not set via dplane ctx path**

`NEXTHOP_GROUP_FPM` is only set inside `fpm_nhg_send_cb` — the batch hash-walk that runs during an FPM reconnect/resync. When an NHG install context flows through the FPM dplane provider via `fpm_nl_enqueue` (the new path this PR introduces), the flag is never written to the `nhg_hash_entry`. For an interface address added *after* the last FPM sync, `nhg_data.get("fpm", False)` will remain `False` indefinitely, causing the assertion to time out on every run. A more reliable check would either force an FPM reconnect before the assertion (triggering a full resync) or inspect the mock FPM server's received message log directly.

### Issue 2 of 2
tests/topotests/fpm_testing_topo1/test_fpm_topo1.py:258-265
**Misleading error message: `result` is a boolean, not NHG data**

`topotest.run_and_expect` returns `(success, last_function_return_value)`. Here `check_nhg_sent_to_fpm` returns `True` or `False`, so when the assertion fails the message reads `NHG data: False` instead of meaningful diagnostic information. Consider capturing the raw NHG JSON inside the check closure and including it in the assertion message.

_{Reviews (1): Last reviewed commit: "tests: Add topotest cases for sending RI..." | Re-trigger Greptile}

greptile-apps · 2026-05-08T13:32:18Z

+    """
+
+    def check_nhg_sent_to_fpm():
+        nhg_id, nhg_data = _get_nhg_for_prefix(router, prefix)
+        if nhg_id is None or nhg_data is None:
+            return False
+        return nhg_data.get("fpm", False) is True
+
+    success, result = topotest.run_and_expect(


FPM flag not set via dplane ctx path

NEXTHOP_GROUP_FPM is only set inside fpm_nhg_send_cb — the batch hash-walk that runs during an FPM reconnect/resync. When an NHG install context flows through the FPM dplane provider via fpm_nl_enqueue (the new path this PR introduces), the flag is never written to the nhg_hash_entry. For an interface address added after the last FPM sync, nhg_data.get("fpm", False) will remain False indefinitely, causing the assertion to time out on every run. A more reliable check would either force an FPM reconnect before the assertion (triggering a full resync) or inspect the mock FPM server's received message log directly.

Prompt To Fix With AI

This is a comment left during a code review. Path: tests/topotests/fpm_testing_topo1/test_fpm_topo1.py Line: 256-264 Comment: **FPM flag not set via dplane ctx path** `NEXTHOP_GROUP_FPM` is only set inside `fpm_nhg_send_cb` — the batch hash-walk that runs during an FPM reconnect/resync. When an NHG install context flows through the FPM dplane provider via `fpm_nl_enqueue` (the new path this PR introduces), the flag is never written to the `nhg_hash_entry`. For an interface address added *after* the last FPM sync, `nhg_data.get("fpm", False)` will remain `False` indefinitely, causing the assertion to time out on every run. A more reliable check would either force an FPM reconnect before the assertion (triggering a full resync) or inspect the mock FPM server's received message log directly. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-08T13:32:21Z

+    def check_nhg_sent_to_fpm():
+        nhg_id, nhg_data = _get_nhg_for_prefix(router, prefix)
+        if nhg_id is None or nhg_data is None:
+            return False
+        return nhg_data.get("fpm", False) is True
+
+    success, result = topotest.run_and_expect(
+        check_nhg_sent_to_fpm, True, count=60, wait=1


Misleading error message: result is a boolean, not NHG data

topotest.run_and_expect returns (success, last_function_return_value). Here check_nhg_sent_to_fpm returns True or False, so when the assertion fails the message reads NHG data: False instead of meaningful diagnostic information. Consider capturing the raw NHG JSON inside the check closure and including it in the assertion message.

Prompt To Fix With AI

This is a comment left during a code review. Path: tests/topotests/fpm_testing_topo1/test_fpm_topo1.py Line: 258-265 Comment: **Misleading error message: `result` is a boolean, not NHG data** `topotest.run_and_expect` returns `(success, last_function_return_value)`. Here `check_nhg_sent_to_fpm` returns `True` or `False`, so when the assertion fails the message reads `NHG data: False` instead of meaningful diagnostic information. Consider capturing the raw NHG JSON inside the check closure and including it in the assertion message. How can I resolve this? If you propose a fix, please make it concise.

Currently zebra sends connected/local/kernel routes NHG's to the kernel only upon initial usage outside of one of these 3. Modify the code to send the result down but to ignore the installation into the kernel. There are dplanes out there that are using the nhg's created for the connected routes being installed, which are also ignored by the linux kernel. Signed-off-by: Donald Sharp <sharpd@nvidia.com>

…t skipping kernel Signed-off-by: Yuqing Zhao <galadriel.zyq@alibaba-inc.com>

Tighten up test code for these two things: a) check_nhg_sent_to_fpm needed to be able to see error messages from a run_and_expect block so it can be communicated better what has gone wrong. b) knowledge of sending of data to the fpm can be gated by what zebra thinks it knows. The topotest was confusing this modify the topotest to grab the data it is looking for from asking the fpm_listener what it knows. Signed-off-by: Donald Sharp <sharpd@nvidia.com>

GaladrielZhao · 2026-05-11T10:58:39Z

Current SONiC fpmsyncd handles NHG via RTM_NEWNEXTHOP / RTM_DELNEXTHOP in onNextHopMsg.

Below is how each route type's NHG is processed:

Connected routes: treated as normal unicast routes. NHG referenced by nh_id is resolved from the in-memory map and expanded into ROUTE_TABLE fields (or NEXTHOP_GROUP_TABLE for multipath). Works correctly.
Kernel routes (proto=kernel, type=RTN_UNICAST): same path as connected routes. No special handling.
Local routes (type=RTN_LOCAL): the route itself is intentionally skipped in onRouteMsg ("BUM routes aren't supported yet"). However, its NHG is still received via RTM_NEWNEXTHOP:
- single NH: stored only in memory (m_nh_groups), never written to DB.
- group NH: written to NEXTHOP_GROUP_TABLE on creation. While this creates a brief orphan entry, multipath local routes are uncommon, so the impact is nearly negligible. Even if it happens, the entry is cleaned up on RTM_DELNEXTHOP.

So from the SONiC side, having FRR send all three types of NHGs causes no functional issue and actually fixes the missing connected-route NHG bug we observed.

GaladrielZhao

Thanks, looks good to me.

frrbot Bot added tests Topotests, make check, etc zebra labels May 8, 2026

github-actions Bot added size/L master labels May 8, 2026

greptile-apps Bot reviewed May 8, 2026

View reviewed changes

donaldsharp mentioned this pull request May 8, 2026

zebra: Allow connected routes NHG sent to dplane #21125

Open

donaldsharp and others added 3 commits May 8, 2026 13:01

tests: Add topotest cases for sending RIB system route NHGs to FPM bu…

454cb7e

…t skipping kernel Signed-off-by: Yuqing Zhao <galadriel.zyq@alibaba-inc.com>

donaldsharp force-pushed the no_kernel_nhg_original branch from afff5d3 to 5a6ed59 Compare May 8, 2026 17:02

GaladrielZhao approved these changes May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No kernel nhg original#21893

No kernel nhg original#21893
donaldsharp wants to merge 3 commits into
FRRouting:masterfrom
donaldsharp:no_kernel_nhg_original

donaldsharp commented May 8, 2026

Uh oh!

greptile-apps Bot commented May 8, 2026

Uh oh!

greptile-apps Bot May 8, 2026

Uh oh!

greptile-apps Bot May 8, 2026

Uh oh!

GaladrielZhao commented May 11, 2026

Uh oh!

GaladrielZhao left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

donaldsharp commented May 8, 2026

Uh oh!

greptile-apps Bot commented May 8, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

GaladrielZhao commented May 11, 2026

Uh oh!

GaladrielZhao left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants