Description
We see an OSPFv3 convergence bug in FRR under high churn/startup load. We roll out a high number of frr pods (e.g. >1000).
On some routers, OSPFv3 neighbor state becomes Full, but routes are not properly installed/usable for a long time (or stay stuck). In this state, end-to-end connectivity fails for some destinations, although adjacency itself looks healthy.
A manual clear ipv6 ospf6 process on the affected node recovers immediately.
Version
show version on affected pods shows FRR 10.5.2.
We added PR #20897 changes
How to reproduce
- Deploy a larger IPv6 OSPFv3 topology (in our case Kubernetes pods, ~500 to 1200 routers).
- Start all nodes nearly at once.
- Wait until many adjacencies are Full.
- Run connectivity checks.
- Observe that a subset fails.
Expected behavior
When neighbor state is Full, OSPFv3 routes should be consistently present and installed into zebra/kernel without manual intervention.
No node should require clear ipv6 ospf6 process to become operational.
Actual behavior
Intermittently, a node gets into a bad state:
- neighbor shows Full
- but route programming is incomplete/missing
- connectivity from/to that node fails for multiple destinations
Example pattern we repeatedly observed:
show ipv6 ospf6 neighbor => Full exists
show ipv6 route ospf6 => missing/too few routes in affected moments
show ipv6 ospf6 route may show entries, but forwarding still not correct on affected node
- after
clear ipv6 ospf6 process, routes are reinstalled and connectivity returns
Additional context
- Reproduced multiple times in containerized environment with many concurrent OSPFv3 instances.
- Issue is much easier to trigger under large-scale startup/restart conditions than in very small topologies.
Checklist
Description
We see an OSPFv3 convergence bug in FRR under high churn/startup load. We roll out a high number of frr pods (e.g. >1000).
On some routers, OSPFv3 neighbor state becomes Full, but routes are not properly installed/usable for a long time (or stay stuck). In this state, end-to-end connectivity fails for some destinations, although adjacency itself looks healthy.
A manual
clear ipv6 ospf6 processon the affected node recovers immediately.Version
How to reproduce
Expected behavior
When neighbor state is Full, OSPFv3 routes should be consistently present and installed into zebra/kernel without manual intervention.
No node should require
clear ipv6 ospf6 processto become operational.Actual behavior
Intermittently, a node gets into a bad state:
Example pattern we repeatedly observed:
show ipv6 ospf6 neighbor=> Full existsshow ipv6 route ospf6=> missing/too few routes in affected momentsshow ipv6 ospf6 routemay show entries, but forwarding still not correct on affected nodeclear ipv6 ospf6 process, routes are reinstalled and connectivity returnsAdditional context
Checklist