Skip to content

zebra: Allow connected routes NHG sent to dplane#21125

Open
GaladrielZhao wants to merge 2 commits into
FRRouting:masterfrom
GaladrielZhao:fix/connected_routes_to_dplane
Open

zebra: Allow connected routes NHG sent to dplane#21125
GaladrielZhao wants to merge 2 commits into
FRRouting:masterfrom
GaladrielZhao:fix/connected_routes_to_dplane

Conversation

@GaladrielZhao
Copy link
Copy Markdown
Contributor

When "fpm use-next-hop-groups" is enabled, connected routes' NHGs are not being sent to dplane.
This causes routes that resolve their nexthop through a connected route to fail insertion in the FPM.

Fix this by restricting the NEXTHOP_GROUP_INITIAL_DELAY_INSTALL flag to KERNEL and LOCAL routes only,
allowing connected routes' NHG to be properly downloaded to the dplane.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Mar 13, 2026

Greptile Summary

This PR fixes a bug where connected routes' nexthop groups (NHGs) were not being sent to the dataplane when fpm use-next-hop-groups is enabled, causing FPM insertion failures for routes that resolve their nexthop via a connected route.

The fix is achieved through two consistent, minimal changes:

  • zebra/zebra_rib.c (rib_add_multipath): The NEXTHOP_GROUP_INITIAL_DELAY_INSTALL flag is now only set for ZEBRA_ROUTE_KERNEL and ZEBRA_ROUTE_LOCAL routes. Previously it was set for all RIB_SYSTEM_ROUTE types (which also included ZEBRA_ROUTE_CONNECT), causing the dplane subsystem to skip actual NHG programming for connected routes.
  • zebra/zebra_nhg.c (zebra_nhg_install_kernel): ZEBRA_ROUTE_CONNECT is removed from the list of types that preserve the NEXTHOP_GROUP_INITIAL_DELAY_INSTALL flag, ensuring that if a connected-type NHG somehow carries the flag it will be cleared and a real installation triggered.

The two changes are logically paired and correct. The only minor issue is a stale comment in zebra_interface_nhg_reinstall that still mentions "kernel/connected routes" after the semantics have changed.

Confidence Score: 4/5

  • Safe to merge; the logic fix is minimal, well-scoped, and consistent across both modified sites.
  • The two changes are tightly coupled and mutually consistent. No new code paths are introduced; existing dplane, uninstall, and interface-event paths all behave correctly for the newly-enabled connected-route NHG flow. The only finding is a stale comment that does not affect correctness.
  • The comment at zebra/zebra_nhg.c lines 4085–4088 references "kernel/connected routes" and should be updated to "kernel/LOCAL routes".

Important Files Changed

Filename Overview
zebra/zebra_nhg.c Removes ZEBRA_ROUTE_CONNECT from the exemption list in zebra_nhg_install_kernel, so connected routes no longer bypass the NEXTHOP_GROUP_INITIAL_DELAY_INSTALL guard. A comment in zebra_interface_nhg_reinstall still refers to "kernel/connected routes" and needs updating.
zebra/zebra_rib.c Narrows the NEXTHOP_GROUP_INITIAL_DELAY_INSTALL flag assignment in rib_add_multipath from all RIB_SYSTEM_ROUTE types (KERNEL + CONNECT + LOCAL) to only RIB_KERNEL_ROUTE and LOCAL, allowing connected-route NHGs to be properly forwarded to the dataplane.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[rib_add_multipath called] --> B{Route type?}
    B -- "ZEBRA_ROUTE_KERNEL\nor ZEBRA_ROUTE_LOCAL" --> C[SET NEXTHOP_GROUP_INITIAL_DELAY_INSTALL]
    B -- "ZEBRA_ROUTE_CONNECT\n(after fix)" --> D[Flag NOT set]
    B -- "Other routes\ne.g. BGP/OSPF" --> D

    C --> E[zebra_nhg_install_kernel\ntype = KERNEL or LOCAL]
    D --> F[zebra_nhg_install_kernel\ntype = CONNECT or other]

    E --> G{type == LOCAL\nor KERNEL?}
    G -- "Yes" --> H[Flag preserved\nDelay optimization kept]
    G -- "No\ne.g. ZEBRA_ROUTE_MAX" --> I[Flag cleared\nNHG marked uninstalled]

    F --> J{type != LOCAL\nand != KERNEL?}
    J -- "Yes (CONNECT,\nROUTE_MAX, etc.)" --> K{Flag set?}
    K -- "No (normal)" --> L[Proceed to install check]
    K -- "Yes (edge case)" --> I
    J -- "No" --> H

    H --> M[dplane_nexthop_add\nFlag set → fake SUCCESS\nNHG marked INSTALLED\nNo dplane op sent]
    L --> N{NHG VALID\nand not INSTALLED?}
    I --> N
    N -- "Yes" --> O[dplane_nexthop_add\nReal op enqueued\nNHG marked QUEUED]
    N -- "No" --> P[Skip install]

    O --> Q[FPM/Kernel receives NHG]
    M --> R[NHG skipped in FPM]
Loading

Comments Outside Diff (1)

  1. zebra/zebra_nhg.c, line 4085-4088 (link)

    Stale comment after connected route change

    The comment at this location still references "kernel/connected routes" but, after this PR, connected routes (ZEBRA_ROUTE_CONNECT) no longer have NEXTHOP_GROUP_INITIAL_DELAY_INSTALL set — only kernel (ZEBRA_ROUTE_KERNEL) and LOCAL (ZEBRA_ROUTE_LOCAL) routes retain this optimization. The comment should be updated to reflect that.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: zebra/zebra_nhg.c
Line: 4085-4088

Comment:
**Stale comment after connected route change**

The comment at this location still references "kernel/connected routes" but, after this PR, connected routes (`ZEBRA_ROUTE_CONNECT`) no longer have `NEXTHOP_GROUP_INITIAL_DELAY_INSTALL` set — only kernel (`ZEBRA_ROUTE_KERNEL`) and LOCAL (`ZEBRA_ROUTE_LOCAL`) routes retain this optimization. The comment should be updated to reflect that.

```suggestion
		/*
		 * If this nhe has 'initial delay' flag set, we should not install this
		 * in kernel in case of any interface events. Zebra created this entry
		 * while processing the kernel/LOCAL routes, just to pretend
		 * the successful kernel install of this NHG
		 */
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: e978a11

@GaladrielZhao GaladrielZhao force-pushed the fix/connected_routes_to_dplane branch from e978a11 to 7695923 Compare March 13, 2026 08:15
Comment thread zebra/zebra_nhg.c Outdated

if ((type != ZEBRA_ROUTE_CONNECT && type != ZEBRA_ROUTE_LOCAL &&
type != ZEBRA_ROUTE_KERNEL) &&
if ((type != ZEBRA_ROUTE_LOCAL && type != ZEBRA_ROUTE_KERNEL) &&
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a straight up NAK from me. This is not an appropriate approach to the problem. It completely recreates the problem that was being fixed here. I spoke w/ Eddie about a completely different approach and this was not taken

@donaldsharp
Copy link
Copy Markdown
Member

At the very least a topotest that shows that we do not install the nhg to the kernel and at the same time it is forwarded to the fpm is needed here.

@GaladrielZhao GaladrielZhao force-pushed the fix/connected_routes_to_dplane branch from 7695923 to 9c4a370 Compare March 29, 2026 10:14
@frrbot frrbot Bot added the tests Topotests, make check, etc label Mar 29, 2026
@github-actions github-actions Bot added size/L rebase PR needs rebase and removed size/XS labels Mar 29, 2026
@GaladrielZhao
Copy link
Copy Markdown
Contributor Author

GaladrielZhao commented Mar 29, 2026

Thanks @donaldsharp , really appreciate your feedback.
I've revised the logic to send RIB system route NHGs to dplane using dplane_ctx_set_skip_kernel(), based on the NEXTHOP_GROUP_INITIAL_DELAY_INSTALL flag, in order to avoid kernel programming.
And I've also added the corresponding test cases in topotest "test_fpm_topo1" to validate the behavior.

@donaldsharp
Copy link
Copy Markdown
Member

I'm on vacation this week and will be able to look again next week

@GaladrielZhao GaladrielZhao force-pushed the fix/connected_routes_to_dplane branch 4 times, most recently from 85dbec2 to fb03ca1 Compare March 31, 2026 03:30
Comment thread zebra/zebra_nhg.c
if (IS_ZEBRA_DEBUG_NHG_DETAIL)
zlog_debug("%s: NHG %pNG delayed-install optimization (flags 0x%x)",
__func__, nhe, nhe->flags);
} else {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should still be there. Why are you removing it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added back.

Comment thread zebra/zebra_nhg.c Outdated
* Clear QUEUED flag for non-system routes to allow re-installation
* when the NHG is shared between multiple protocols.
*/
if (type != ZEBRA_ROUTE_CONNECT && type != ZEBRA_ROUTE_LOCAL && type != ZEBRA_ROUTE_KERNEL)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use RSYSTEM_ROUTE

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@donaldsharp
Copy link
Copy Markdown
Member

My alternate proposal is #21893

When "fpm use-next-hop-groups" is enabled,
NHGs for system routes (connected, local, kernel) carry the
NEXTHOP_GROUP_INITIAL_DELAY_INSTALL flag, which was originally
designed to prevent them from being programmed into kernel.
However, this causes routes that resolve their nexthop
through a connected route to fail insertion in the FPM.

Fix this by enqueuing INITIAL_DELAY NHGs into dplane as normal,
but use dplane_ctx_set_skip_kernel() for these NHGs.
This allows FPM to receive and program them
while skipping kernel programming.

Signed-off-by: Yuqing Zhao <galadriel.zyq@alibaba-inc.com>
…t skipping kernel

Signed-off-by: Yuqing Zhao <galadriel.zyq@alibaba-inc.com>
@GaladrielZhao GaladrielZhao force-pushed the fix/connected_routes_to_dplane branch from fb03ca1 to ff7f6d0 Compare May 11, 2026 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

master rebase PR needs rebase size/L tests Topotests, make check, etc zebra

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants