Prevent worst-case exponential complexity in dependency evaluation#10523
Prevent worst-case exponential complexity in dependency evaluation#10523
Conversation
jschmidt-icinga
left a comment
There was a problem hiding this comment.
I've tested this with the given config snippet and verified that this indeed fixes the exponential CPU-time use by the master-branch code. It's still not perfect, I now get ~2% usage vs. ~0.5% with about the same amount of objects (dependencies, hosts and services) in a flat hierarchy, but maybe that can't be avoided anyway, I don't know...
What exactly do you mean by flat? I'd expect the most comparable result from a long chain of dependencies (i.e. 0 -> 1 -> 2 -> 3 -> ...) with the same number of checkables (i.e. length being twice the depth of the graph in my example), though that will probably still be a bit cheaper as you only have half the |
Same number of objects total, but each service depends on a single host (i.e. the one host).
That's probably it. |
This allows using ref-counted pointers to const objects. Adds a second typedef so that T::ConstPtr can be used similar to how T::Ptr currently is.
05c92c8 to
0f6185a
Compare
yhabteab
left a comment
There was a problem hiding this comment.
Before:
Bildschirmaufnahme.2025-07-31.um.10.15.07.mov
After:
Bildschirmaufnahme.2025-07-31.um.10.23.12.mov
Checkable::IsReachable() and DependencyGroup::GetState() call each other recursively. Moving them to a common helper class allows adding caching to them in a later commit without having to pass a cache between the functions (through a public interface) or resorting to thread_local variables.
So far, calling Checkable::IsReachable() traversed all possible paths to it's parents. In case a parent is reachable via multiple paths, all it's parents were evaluated multiple times, result in a worst-case exponential complexity. With this commit, the implementation keeps track of which checkables were already visited and uses the already-computed reachability instead of repeating the computation, ensuring a worst-case linear runtime within the graph size.
0f6185a to
9601468
Compare
So far, calling Checkable::IsReachable() traversed all possible paths to it's parents. In case a parent is reachable via multiple paths, all it's parents were evaluated multiple times, result in a worst-case exponential complexity.
With this PR, the implementation keeps track of which checkables were already visited and uses the already-computed reachability instead of repeating the computation, ensuring a worst-case linear runtime within the graph size.
For implementing this, there are two additional commits with preparations for that change:
T::ConstPtras a typedef forintrusive_ptr<const T>combined with the necessary changes to support this (changing some functions to accept aconst Object*instead of anObject*as well as marking the reference counter attribute asmutable). This was done becauseCheckable::IsReachable()isconst, i.e.thisis of typeconst Checkable*, so aCheckable::Ptrcan't be constructed from it, only the newCheckable::ConstPtr.Checkable::IsReachable()andDependencyGroup::GetState()is moved to a new helper class (the original methods are kept and now transparently call the ones in the helper class), so that they can easily access common storage (i.e. a cache needed for implementing this) without having to explicitly pass the cache through a public interface or having to resort tothread_local.Test
I've prepared a config file that creates the following dependency graph that is pretty much the worst case for the old implementation. Note that this diagram is reduced to 4 levels, the actual config generates 25 levels, making the issue much more obvious when running the config.
graph TD; 2-->0; 2-->1; 3-->0; 3-->1; 4-->2; 4-->3; 5-->2; 5-->3; 6-->4; 6-->5; 7-->4; 7-->5;bobby-little-dependencies.conf
This can easily be started in a container:
docker run --rm -it -v $(pwd)/bobby-little-dependencies.conf:/icinga2.conf:ro icinga/icinga2:dev icinga2 daemon -c /icinga2.confBefore
Heavy CPU usage (fluctuating somewhere between 200% and 800% on my machine), checks aren't executed every 10s as configured.
After
Almost no CPU usage (mostly idle, peaks of around 2% on my machine), regular check execution.
ref/IP/59145
ref/IP/59671