No kernel nhg original#21893
Conversation
Greptile SummaryThis PR simplifies how NHGs for connected/local/kernel (system) routes are handled in the dataplane by letting them flow through the full dplane queue with a
Confidence Score: 3/5The C changes are correct but the new test may always time out, leaving the PR's primary behavioral claim unverified. The core C implementation is clean and the flag-management changes in zebra_nhg.c are correct. The concern is with the test: NEXTHOP_GROUP_FPM is only written during the FPM batch resync walk (fpm_nhg_send_cb), not when individual NHG contexts flow through fpm_nl_enqueue. A connected/local/kernel NHG added after the last FPM sync will never have fpm=True in its show output, so check_nhg_sent_to_fpm will time out on every CI run. tests/topotests/fpm_testing_topo1/test_fpm_topo1.py — the FPM presence assertion needs a different verification strategy. Important Files Changed
Sequence DiagramsequenceDiagram
participant RIB as zebra_rib (main thread)
participant NHG as zebra_nhg
participant DPLANE as dplane thread
participant FPM as FPM provider
participant KERNEL as kernel provider
RIB->>NHG: zebra_nhg_install_kernel(nhe, CONNECT/LOCAL/KERNEL)
Note over NHG: INITIAL_DELAY_INSTALL set skip flag-clearing block
NHG->>DPLANE: dplane_nexthop_add(nhe)
DPLANE->>DPLANE: dplane_ctx_set_skip_kernel(ctx)
DPLANE->>DPLANE: dplane_update_enqueue(ctx)
DPLANE-->>NHG: ZEBRA_DPLANE_REQUEST_QUEUED
NHG->>NHG: SET_FLAG NEXTHOP_GROUP_QUEUED
DPLANE->>FPM: fpm_nl_process ctx with NH_INSTALL op
FPM->>FPM: fpm_nl_enqueue encode and send NHG to FPM socket
FPM-->>DPLANE: enqueue to output
DPLANE->>KERNEL: kernel provider dequeues ctx
KERNEL->>KERNEL: dplane_ctx_is_skip_kernel true
KERNEL->>KERNEL: set status SUCCESS skip netlink
KERNEL-->>DPLANE: enqueue to output
DPLANE-->>NHG: zebra_nhg_dplane_result SUCCESS
NHG->>NHG: UNSET QUEUED REINSTALL SET INSTALLED
NHG->>NHG: zebra_nhg_handle_install(nhe)
Prompt To Fix All With AIFix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
tests/topotests/fpm_testing_topo1/test_fpm_topo1.py:256-264
**FPM flag not set via dplane ctx path**
`NEXTHOP_GROUP_FPM` is only set inside `fpm_nhg_send_cb` — the batch hash-walk that runs during an FPM reconnect/resync. When an NHG install context flows through the FPM dplane provider via `fpm_nl_enqueue` (the new path this PR introduces), the flag is never written to the `nhg_hash_entry`. For an interface address added *after* the last FPM sync, `nhg_data.get("fpm", False)` will remain `False` indefinitely, causing the assertion to time out on every run. A more reliable check would either force an FPM reconnect before the assertion (triggering a full resync) or inspect the mock FPM server's received message log directly.
### Issue 2 of 2
tests/topotests/fpm_testing_topo1/test_fpm_topo1.py:258-265
**Misleading error message: `result` is a boolean, not NHG data**
`topotest.run_and_expect` returns `(success, last_function_return_value)`. Here `check_nhg_sent_to_fpm` returns `True` or `False`, so when the assertion fails the message reads `NHG data: False` instead of meaningful diagnostic information. Consider capturing the raw NHG JSON inside the check closure and including it in the assertion message.
Reviews (1): Last reviewed commit: "tests: Add topotest cases for sending RI..." | Re-trigger Greptile |
| """ | ||
|
|
||
| def check_nhg_sent_to_fpm(): | ||
| nhg_id, nhg_data = _get_nhg_for_prefix(router, prefix) | ||
| if nhg_id is None or nhg_data is None: | ||
| return False | ||
| return nhg_data.get("fpm", False) is True | ||
|
|
||
| success, result = topotest.run_and_expect( |
There was a problem hiding this comment.
FPM flag not set via dplane ctx path
NEXTHOP_GROUP_FPM is only set inside fpm_nhg_send_cb — the batch hash-walk that runs during an FPM reconnect/resync. When an NHG install context flows through the FPM dplane provider via fpm_nl_enqueue (the new path this PR introduces), the flag is never written to the nhg_hash_entry. For an interface address added after the last FPM sync, nhg_data.get("fpm", False) will remain False indefinitely, causing the assertion to time out on every run. A more reliable check would either force an FPM reconnect before the assertion (triggering a full resync) or inspect the mock FPM server's received message log directly.
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/topotests/fpm_testing_topo1/test_fpm_topo1.py
Line: 256-264
Comment:
**FPM flag not set via dplane ctx path**
`NEXTHOP_GROUP_FPM` is only set inside `fpm_nhg_send_cb` — the batch hash-walk that runs during an FPM reconnect/resync. When an NHG install context flows through the FPM dplane provider via `fpm_nl_enqueue` (the new path this PR introduces), the flag is never written to the `nhg_hash_entry`. For an interface address added *after* the last FPM sync, `nhg_data.get("fpm", False)` will remain `False` indefinitely, causing the assertion to time out on every run. A more reliable check would either force an FPM reconnect before the assertion (triggering a full resync) or inspect the mock FPM server's received message log directly.
How can I resolve this? If you propose a fix, please make it concise.| def check_nhg_sent_to_fpm(): | ||
| nhg_id, nhg_data = _get_nhg_for_prefix(router, prefix) | ||
| if nhg_id is None or nhg_data is None: | ||
| return False | ||
| return nhg_data.get("fpm", False) is True | ||
|
|
||
| success, result = topotest.run_and_expect( | ||
| check_nhg_sent_to_fpm, True, count=60, wait=1 |
There was a problem hiding this comment.
Misleading error message:
result is a boolean, not NHG data
topotest.run_and_expect returns (success, last_function_return_value). Here check_nhg_sent_to_fpm returns True or False, so when the assertion fails the message reads NHG data: False instead of meaningful diagnostic information. Consider capturing the raw NHG JSON inside the check closure and including it in the assertion message.
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/topotests/fpm_testing_topo1/test_fpm_topo1.py
Line: 258-265
Comment:
**Misleading error message: `result` is a boolean, not NHG data**
`topotest.run_and_expect` returns `(success, last_function_return_value)`. Here `check_nhg_sent_to_fpm` returns `True` or `False`, so when the assertion fails the message reads `NHG data: False` instead of meaningful diagnostic information. Consider capturing the raw NHG JSON inside the check closure and including it in the assertion message.
How can I resolve this? If you propose a fix, please make it concise.Currently zebra sends connected/local/kernel routes NHG's to the kernel only upon initial usage outside of one of these 3. Modify the code to send the result down but to ignore the installation into the kernel. There are dplanes out there that are using the nhg's created for the connected routes being installed, which are also ignored by the linux kernel. Signed-off-by: Donald Sharp <sharpd@nvidia.com>
…t skipping kernel Signed-off-by: Yuqing Zhao <galadriel.zyq@alibaba-inc.com>
Tighten up test code for these two things: a) check_nhg_sent_to_fpm needed to be able to see error messages from a run_and_expect block so it can be communicated better what has gone wrong. b) knowledge of sending of data to the fpm can be gated by what zebra thinks it knows. The topotest was confusing this modify the topotest to grab the data it is looking for from asking the fpm_listener what it knows. Signed-off-by: Donald Sharp <sharpd@nvidia.com>
afff5d3 to
5a6ed59
Compare
|
Current SONiC fpmsyncd handles NHG via RTM_NEWNEXTHOP / RTM_DELNEXTHOP in onNextHopMsg. Below is how each route type's NHG is processed:
So from the SONiC side, having FRR send all three types of NHGs causes no functional issue and actually fixes the missing connected-route NHG bug we observed. |
GaladrielZhao
left a comment
There was a problem hiding this comment.
Thanks, looks good to me.
PR #21125 attempts to modify slightly how NHG data is sent to the kernel for connected/local/kernel NHG's. The approach taken there was too complicated and in order for me to figure out what was going on I had to reverse engineer it to a degree that my approach is just much simplier and frankly easier to understand. Here's my alternate approach to the problem