Fix overdue state doesn't account for timeperiods in HA cluster#10562
Fix overdue state doesn't account for timeperiods in HA cluster#10562yhabteab wants to merge 10 commits intoefficient-config-and-state-update-queuefrom
Conversation
jschmidt-icinga
left a comment
There was a problem hiding this comment.
So basically, the DB will only be updated in these cases:
- In the non-forced External Command Pipe commands
- Unconditionally via the RescheduleCheck ApiAction
- When scheduling a check that is not forced, is active, and
enable_service_checksis true
In all other cases we just trigger internal state-changes.
I'm not sure the SetNextCheck() is the right place to make the decision about which event to persist to the DB, but I get that this is a low impact change that fixes the issue, and so I don't want to restart the bikeshedding from #10082.
lib/checker/checkercomponent.cpp
Outdated
| if (host && !service && (!checkable->GetEnableActiveChecks() || !icingaApp->GetEnableHostChecks())) { | ||
| activeChecksDisabled = !checkable->GetEnableActiveChecks() || !icingaApp->GetEnableServiceChecks(); | ||
| if (activeChecksDisabled && host && !service) { | ||
| Log(LogNotice, "CheckerComponent") | ||
| << "Skipping check for host '" << host->GetName() << "': active host checks are disabled"; | ||
| check = false; | ||
| } | ||
| if (host && service && (!checkable->GetEnableActiveChecks() || !icingaApp->GetEnableServiceChecks())) { | ||
| } else if (activeChecksDisabled) { |
There was a problem hiding this comment.
Can you please explain this diff a bit more?
Why are we now checking icingaApp->GetEnableServiceChecks() in the host case? Or is that a mistake?
Is this change meaningful for what this PR does or just a small refactoring since we need the information for the activeChecksDisabled flag?
Also, since we're touching this anway, maybe it can be simplified to have less duplication, since all that's different is the word "host"/"service" in the log message.
d8c1d07 to
f8a4acb
Compare
f8a4acb to
c7775eb
Compare
|
I've changed the implementation according to our offline discussion with @jschmidt-icinga yesterday and with a bunch of comments added, the change list is a bit bigger than before but it's way better, I think. So please have a look! PS: I've also updated the PR description! |
Would it be possible to add test-cases for all permutations of state changes? The checker component doesn't have any dependencies on its own and you could make a simple mock object connecting to the same events the databases receive. Then reschedule checks on checkables and verify everything is updated and consistent. Also note that Unit-Tests act as a kind of documentation too. Well commented tests sometimes describe what that code does better than the code itself. (Speaking as someone who still doesn't grasp the entirety of check scheduling in its complexity) Edit: To clarify: I don't mean a full unit-test suite for the entire checker component, though that might be useful in the future, but for now, maybe just a few cases for the specific issue fixed in this PR. |
c7775eb to
0bcee67
Compare
|
I've added basic unittests for the checker now as requested, so the PR size has drastically increased now :)! But I hope, the tests gives you now a little overview how the checker works or is actually supposed to work. |
jschmidt-icinga
left a comment
There was a problem hiding this comment.
First of all, amazing work adding these tests with all that infrastructure in what must only have been a few hours of time.
I'm not at all concerned with the increase in diff size due to the tests, as it is pretty clear that the production code actually gets a reduction in complexity from this PR.
For now I'm only going to comment on some of the test methods used, but I'm also continuing to look at what they do and the rest of the PR, which looks very promising.
0bcee67 to
5f3b7d2
Compare
|
I think, I've addressed all your requested changes. Plus, our checker should now also have a roughly 70-90% code coverage. I've added some more unit tests to cover for remotely executed checks aka |
Al2Klimov
left a comment
There was a problem hiding this comment.
Test: redis-cli MONITOR with one check.
c2c9b6b
1759312891.029505 [0 192.168.65.1:60842] "HSET" "icinga:host:state" "d7e93a933920b2a35baf2479691400a961d92311" "{\"affects_children\":false,\"check_attempt\":1,\"check_commandline\":\"'dummy'\",\"check_source\":\"ws-aklimov7777777.local\",\"check_timeout\":60000,\"environment_id\":\"824bc16396c1df319c0a1e50a335de0df3d2ce9d\",\"execution_time\":0,\"hard_state\":0,\"host_id\":\"d7e93a933920b2a35baf2479691400a961d92311\",\"id\":\"d7e93a933920b2a35baf2479691400a961d92311\",\"in_downtime\":false,\"is_acknowledged\":false,\"is_active\":true,\"is_flapping\":false,\"is_handled\":false,\"is_problem\":false,\"is_reachable\":true,\"is_sticky_acknowledgement\":false,\"last_state_change\":1759312863888,\"last_update\":1759312891027,\"latency\":7,\"next_check\":1759312901019,\"next_update\":1759312911035,\"output\":\"Check was successful.\",\"previous_hard_state\":99,\"previous_soft_state\":99,\"scheduling_source\":\"ws-aklimov7777777.local\",\"severity\":0,\"soft_state\":0,\"state_type\":\"hard\"}"
1759312891.029533 [0 192.168.65.1:60842] "HSET" "icinga:checksum:host:state" "d7e93a933920b2a35baf2479691400a961d92311" "{\"checksum\":\"acb9e73a6beec90b4ed53726d693d85f30085e77\"}"
1759312891.029545 [0 192.168.65.1:60842] "ZADD" "icinga:nextupdate:host" "1759312911.035350" "d7e93a933920b2a35baf2479691400a961d92311"
5f3b7d2
1759313097.299299 [0 192.168.65.1:32736] "HSET" "icinga:host:state" "d7e93a933920b2a35baf2479691400a961d92311" "{\"affects_children\":false,\"check_attempt\":1,\"check_commandline\":\"'dummy'\",\"check_source\":\"ws-aklimov7777777.local\",\"check_timeout\":60000,\"environment_id\":\"824bc16396c1df319c0a1e50a335de0df3d2ce9d\",\"execution_time\":0,\"hard_state\":0,\"host_id\":\"d7e93a933920b2a35baf2479691400a961d92311\",\"id\":\"d7e93a933920b2a35baf2479691400a961d92311\",\"in_downtime\":false,\"is_acknowledged\":false,\"is_active\":true,\"is_flapping\":false,\"is_handled\":false,\"is_problem\":false,\"is_reachable\":true,\"is_sticky_acknowledgement\":false,\"last_state_change\":1759312863888,\"last_update\":1759313097297,\"latency\":6,\"next_check\":1759313106638,\"next_update\":1759313116651,\"output\":\"Check was successful.\",\"previous_hard_state\":99,\"previous_soft_state\":99,\"scheduling_source\":\"ws-aklimov7777777.local\",\"severity\":0,\"soft_state\":0,\"state_type\":\"hard\"}"
1759313097.299331 [0 192.168.65.1:32736] "HSET" "icinga:checksum:host:state" "d7e93a933920b2a35baf2479691400a961d92311" "{\"checksum\":\"6a8714aa442b2d076193cfa3baf4603a35cf5fec\"}"
1759313097.299346 [0 192.168.65.1:32736] "ZADD" "icinga:nextupdate:host" "1759313116.651438" "d7e93a933920b2a35baf2479691400a961d92311"
1759313097.300295 [0 192.168.65.1:32736] "HSET" "icinga:host:state" "d7e93a933920b2a35baf2479691400a961d92311" "{\"affects_children\":false,\"check_attempt\":1,\"check_commandline\":\"'dummy'\",\"check_source\":\"ws-aklimov7777777.local\",\"check_timeout\":60000,\"environment_id\":\"824bc16396c1df319c0a1e50a335de0df3d2ce9d\",\"execution_time\":0,\"hard_state\":0,\"host_id\":\"d7e93a933920b2a35baf2479691400a961d92311\",\"id\":\"d7e93a933920b2a35baf2479691400a961d92311\",\"in_downtime\":false,\"is_acknowledged\":false,\"is_active\":true,\"is_flapping\":false,\"is_handled\":false,\"is_problem\":false,\"is_reachable\":true,\"is_sticky_acknowledgement\":false,\"last_state_change\":1759312863888,\"last_update\":1759313097297,\"latency\":6,\"next_check\":1759313106638,\"next_update\":1759313116651,\"output\":\"Check was successful.\",\"previous_hard_state\":99,\"previous_soft_state\":99,\"scheduling_source\":\"ws-aklimov7777777.local\",\"severity\":0,\"soft_state\":0,\"state_type\":\"hard\"}"
1759313097.300320 [0 192.168.65.1:32736] "HSET" "icinga:checksum:host:state" "d7e93a933920b2a35baf2479691400a961d92311" "{\"checksum\":\"6a8714aa442b2d076193cfa3baf4603a35cf5fec\"}"
Conclusion
Smells like #9501, I didn't even test state changes yet.
Consider a dead simple additional OnNextCheckUpdated cluster event instead.
|
Any updates on this? Would be great to see this fixed, right now we are just ignoring overdue checks :( |
|
Hi, this one is currently blocked by another PR, so nobody is working on this right now. |
This commit changes the ordering of CheckableScheduleInfo in the multi-index container to ensure that checkables with running checks are pushed to the end of the ordering. This prevents them from being prioritized for scheduling ahead of others, which could lead to unnecessary CPU load due to repeated scheduling attempts. By using a very large value for the index of checkables with running checks, they are effectively deprioritized until their current check is completed and they can be reinserted with their actual next check time.
Since the scheduler accounts for already running checks, we need to update the `next_check` timestamp in `Checkable::ExecuteCheck()` only where it actually makes sense to do so, and as for local checks this doesn't make sense at all. There only two cases where we need to update the next check beforehand are: 1) The execute command event is sent to a connected remote endpoint, so we need to set the next check to a time in the future until we actually receive the check result back from the remote endpoint. However, it must not be too far in the future to avoid that the check is not re-run for too long in case the remote endpoint never responds. 2) The check is a remote check, but either the endpoint is currently syncing replay logs or not connected at all, and we are within the magical 5min cold startup window. In these cases, the check is effectively skipped, and there will be no check result for it coming in, so we manually update the next check normally as if the check was executed. In the other cases, either the check is executed locally, which means the `m_RunningCheck` flag already prevents the scheduler from re-running the check, or this is a remote check and the endpoint is not connected, but we are outside the cold startup window, in which case we also don't do anything as we've already called `Checkable::ProcessCheckResult()` with an appropriate error state, which in turn will call `Checkable::UpdateNextCheck()`.
This commit introduces a new kinda special `OnRescheduleCheck` signal that is emitted whenever we want to inform the checker to reschedule the checkable at a specific timestamp without actually changing the next check time. Previously, we called `SetNextCheck` with some random timestamp just to enforce the checker to either pick it up immediately or at a specific time. Then at some point in time, subscribing to the `OnNextCheckChanged` signal became effectively unusable for any other purpose than to inform the checker about a new next check time. Thus, it resulted in introducing a new signal that is solely responsible for informing the Icigna DB and IDO about a new next check time in places where calling `SetNextCheck` did make sense. This commit does quite the opposite: it replaces all calls to `SetNextCheck` that were only used to inform the checker about a new next check time wit `OnRescheduleCheck` calls. Only places where we actually wanted to change the next check time still call `SetNextCheck` and thus inform the checker and all other listeners about the new next check time. And as a bonus point, we now got rid of the two object locks for child and parent at the same time.
5f3b7d2 to
41c78d7
Compare
Given the base branch is #10619 now, #10562 (review) doesn't matter anymore.
41c78d7 to
cc491af
Compare
So, after the long discussions in #10082 that didn't lead to a final conclusion, here's a minimal implementation without too many refactorings and changes to existing logic. This PR changes the checker queueing behaviour a bit to never reschedule checkables with ongoing checks. Thus, we avoid piling up checkables with running checks in the queue that will never be executed anyway. Though, since this change affects a critical part of Icinga 2, we should be very careful with this not to introduce any regressions or performance degradations. Thus far, I've let it run some checks with 500ms intervals and I didn't notice any issues (see the attached graph), though the graph shows at "10:35" a small hiccup caused by some API requests triggered by me. Another, behaviour change is that previously the checker didn't trigger the now dropped
Checkable::OnNextCheckUpdatedsignal when active checks were disabled. However, since the Icinga DBSendNextUpdatefunction already has a logic to send aZREMRedis command when active checks are disabled, the checker now also triggers theCheckable::OnNextCheckChangedsignal unconditionally.Now, to the individual changes, each commit is self explaining what it does and why it's needed and contains some more details about the reasoning behind it. So, please have a look at the individual commits.
Tests
Just like before a simple
/v1/actions/reschedule-checkrequest triggers twoicinga:nextupdate:hostevents.... while a
/v1/actions/process-check-resultonly triggers a singleicinga:nextupdate:event.1758547833.559175 [12 172.18.0.1:57800] "HSET" "icinga:host:state" "4c80e4ad47034480b3c3bb28885970b6bb50983e" "{\"affects_children\":false,\"check_attempt\":1,\"check_source\":\"mbp-yhabteab.fritz.box\",\"check_timeout\":60000,\"environment_id\":\"30e34a7a6c432baf29e051ed428aee3825f25e67\",\"execution_time\":0,\"hard_state\":0,\"host_id\":\"4c80e4ad47034480b3c3bb28885970b6bb50983e\",\"id\":\"4c80e4ad47034480b3c3bb28885970b6bb50983e\",\"in_downtime\":false,\"is_acknowledged\":false,\"is_active\":true,\"is_flapping\":false,\"is_handled\":false,\"is_problem\":true,\"is_reachable\":true,\"is_sticky_acknowledgement\":false,\"last_state_change\":1758547833522,\"last_update\":1758547833522,\"latency\":0,\"next_check\":1758548133522,\"next_update\":1758548193522,\"normalized_performance_data\":\"wrong=85;0;0;0 ok=1;0;0;0 max=215.600000;0;0;0\",\"output\":\" /dev/dm-13: 1.0MiB/s read1, 4.3KiB/s write1, 1.0MiB/s total, 215.6MiB/s max\",\"performance_data\":\"wrong=85;0;0;0; ok=1;0;0;0; max=215.6;0;0;0;\",\"previous_hard_state\":0,\"previous_soft_state\":0,\"scheduling_source\":\"mbp-yhabteab.fritz.box\",\"severity\":2080,\"soft_state\":1,\"state_type\":\"soft\"}" 1758547833.559289 [12 172.18.0.1:57800] "HSET" "icinga:checksum:host:state" "4c80e4ad47034480b3c3bb28885970b6bb50983e" "{\"checksum\":\"0ed63c1311be890a7ad5ffae340a3489db7d57ca\"}" 1758547833.559305 [12 172.18.0.1:57800] "ZADD" "icinga:nextupdate:host" "1758548193.522986" "4c80e4ad47034480b3c3bb28885970b6bb50983e" 1758547833.559430 [12 172.18.0.1:57800] "XADD" "icinga:runtime:state" "MAXLEN" "~" "1000000" "*" "runtime_type" "upsert" "redis_key" "icinga:host:state" "checksum" "0ed63c1311be890a7ad5ffae340a3489db7d57ca" "affects_children" "0" "check_attempt" "1" "check_source" "mbp-yhabteab.fritz.box" "check_timeout" "60000" "environment_id" "30e34a7a6c432baf29e051ed428aee3825f25e67" "execution_time" "0" "hard_state" "0" "host_id" "4c80e4ad47034480b3c3bb28885970b6bb50983e" "id" "4c80e4ad47034480b3c3bb28885970b6bb50983e" "in_downtime" "0" "is_acknowledged" "0" "is_active" "1" "is_flapping" "0" "is_handled" "0" "is_problem" "1" "is_reachable" "1" "is_sticky_acknowledgement" "0" "last_state_change" "1758547833522" "last_update" "1758547833522" "latency" "0" "next_check" "1758548133522" "next_update" "1758548193522" "normalized_performance_data" "wrong=85;0;0;0 ok=1;0;0;0 max=215.600000;0;0;0" "output" " /dev/dm-13: 1.0MiB/s read1, 4.3KiB/s write1, 1.0MiB/s total, 215.6MiB/s max" "performance_data" "wrong=85;0;0;0; ok=1;0;0;0; max=215.6;0;0;0;" "previous_hard_state" "0" "previous_soft_state" "0" "scheduling_source" "mbp-yhabteab.fritz.box" "severity" "2080" "soft_state" "1" "state_type" "soft"Blocked by
RedisConnectionenhancements #10619fixes #10082