Write checkable `create` & `delete` sla history events by yhabteab · Pull Request #566 · Icinga/icingadb

yhabteab · 2023-02-23T16:34:34Z

Why do we need this!

Currently we are generating the SLA history events only when e.g. there were state change and downtime start and end events for the checkables. Under some circumstances (if a checkable is created once and never deleted) this should be sufficient. However, when you e.g. delete a host and create it again a couple of days later and want to generate sla reports for this host at the end of the week, the result can vary depending on which state the host had before it was deleted. In order to be able to generate sla reports as accurately as possible, we decided to track the checkable creation and deletion time on top of the existing info. And since Icinga 2 doesn't really know when an object has been deleted (at least not in a simple way), this PR should take care of it.

Though, Icinga DB doesn't know when an object has been deleted either, it just takes the time the delete event for that object arrived and puts it into the new table. Meaning when you delete checkables while Icinga DB is stopped, the events Icinga DB would write after it is started won't reflect the actual delete/create event. Though, there is no better way to handle this gracefully.

Config sync

The upgrade script for 1.3.0 generates a created_at sla lifecycle entry for all existing hosts and services once it is applied as proposed in #566 (comment). Thus, all special cases such as the implementation of a custom fingerprint type for services¹ and performing an extra query to retrieve host IDs from the database for runtime deleted services, have been removed.

Implementation

The new table sla_history_lifecycle has a primary key over (id, delete_time) where delete_time=0 means "not deleted yet" (the column has to be NOT NULL due to being included in the primary key). id is either the service or host ID for which that sla lifecycle is being generated. This ensures that there can only be row per object that states that the object is currently alive in Icinga 2.

Initial sync

Icinga DB performs a simple INSERT statement for Host and Service types after each initial config dump unconditionally, but matches on hosts and services that don't already have a create_time entry with delete_time = 0 in the sla_lifecycle table, and sets their create_time timestamp to now. Additionally, it also updates the delete_time of each existing sla_lifecycle entries whose host/service IDs cannot be found in the Host/Service tables. It's unlikely, but when a given Checkable doesn't already have a create_time entry in the database, the update query won't update anything. Likewise, the insert statements may also become a no-op if the Checkables already have a create_time entry with delete_time = 0.

Create

Nothing to be done here (all newly created objects will be covered by the bulk INSERT ... SELECT FROM host/service queries after the config dump).

Update

Nothing to be done here (object existed before and continues to exist).

Delete

Nothing to be done here (all deleted objects will be covered by general bulk sea_lifecycle queries after the config dump).

Runtime updates

Upsert

Performs an INSERT with ignore for duplicate keys for both create and update events (these look identical in the runtime update stream). If the object is already marked as alive in sla_history_lifecycle, this will do nothing, otherwise it will mark it as created now ~~(including when an object that was created before this feature was enabled is updated)~~.

Delete

It assumes that there exists a created_at sla_lifecycle entry for that checkable currently going to be deleted, and performs a simple UPDATE statement setting delete_time = now (i.e. updates the PK of the row) marking the alive row for the object as deleted. If, for whatever reason, there is no corresponding created_at entry for this checkable, that update statement becomes a no-op, as the upgrade script and/or the initial config dump should have generated the necessary entries for all existing objects that were created before this feature was available.

Before @julianbrost proposed a change in https://github.com/Icinga/icingadb/pull/566#issuecomment-2273088195, services had to implement an additional custom fingerprint type ServiceFingerprint which was used to also retrieve their host IDs when computing the config delta of the initial sync. By introducing this type, the necessity of having to always perform an extra SELECT query to additionally retrieve the host IDs was eliminated, as host ID is always required for the sla lifecycles to work. ↩

schema/mysql/schema.sql

pkg/icingadb/sync.go

pkg/icingadb/sla.go

schema/mysql/schema.sql

schema/pgsql/schema.sql

schema/mysql/schema.sql

pkg/icingadb/sla.go

julianbrost · 2023-03-02T16:59:18Z

I just had an idea how we could call that type of SLA history after we didn't really come up with good name for this initially: lifecycle

Al2Klimov

Please don’t force-push for now.

pkg/types/int.go

pkg/icingadb/db.go

pkg/icingadb/sla.go

pkg/types/int.go

pkg/icingadb/sla.go

pkg/common/sync_subject.go

tests/object_sync_test.go

pkg/icingadb/runtime_updates.go

pkg/icingadb/sla_lifecycle.go

pkg/icingadb/sync.go

pkg/icingadb/sla_lifecycle.go

schema/mysql/schema.sql

tests/object_sync_test.go

pkg/icingadb/sla_lifecycle.go

pkg/icingadb/runtime_updates.go

julianbrost · 2023-07-25T11:23:36Z

Now that I'm taking a fresh look at this after some time, I'm wondering: what was the reason for updating rows instead of inserting separate rows for create and delete events?

julianbrost · 2023-07-27T13:51:57Z

I've updated the PR description with a summary of what queries are performed when (section "Implementation").

Now that I'm taking a fresh look at this after some time, I'm wondering: what was the reason for updating rows instead of inserting separate rows for create and delete events?

This also gave sort of an answer to this question: having that database structure with the delete time being part of the primary key prevents duplicate rows to be inserted for runtime updates.

Al2Klimov · 2023-07-27T14:03:24Z

Admittedly not the most beautiful concept, but it solves problems we'd have with 1 event = 1 row 👍

julianbrost · 2024-08-07T08:31:22Z

While looking at this after over a year (🙈), I had an idea: what if during the schema upgrade, we just insert rows for all currently existing hosts and services into the new table, shouldn't that allow us to get rid of two special cases (emphasis by me)?

2. Additionally, it performs an INSERT with ignore for duplicate keys with the same timestamp. So in case there was no row to be updated, it will now be inserted (otherwise, this query is a no-op). This is especially important for the case where objects were created before this feature becomes available.

Runtime updates

Upsert

Performs an INSERT with ignore for duplicate keys for both create and update events (these look identical in the runtime update stream). If the object is already marked as alive in sla_history_lifecycle, this will do nothing, otherwise it will mark it as created now (including when an object that was created before this feature was enabled is updated).

yhabteab · 2024-08-07T09:10:32Z

I had an idea: what if during the schema upgrade, we just insert rows for all currently existing hosts and services into the new table, shouldn't that allow us to get rid of two special cases (emphasis by me)?

How do you plan to perform this hashing with pure SQL?

icingadb/pkg/icingadb/sla_lifecycle.go

Line 84 in 5a2396e

    
           sl.Id = utils.Checksum(objectpacker.MustPackSlice(sl.EnvironmentId, sl.HostId, sl.ServiceId))

julianbrost · 2024-08-07T09:58:03Z

How do you plan to perform this hashing with pure SQL?

icingadb/pkg/icingadb/sla_lifecycle.go

Line 84 in 5a2396e

sl.Id = utils.Checksum(objectpacker.MustPackSlice(sl.EnvironmentId, sl.HostId, sl.ServiceId))

That's introduced by this PR, so it could also be done differently if this makes things easier. Which brings me right to my next ideas, which I have plenty of. 😅 Please read them, ask any question that may come up and let me know if you think that the approach I'm suggesting in the following may indeed significantly simplify this PR.

Doesn't much of the new code for selecting the host ID for a service primarily exist just because of the choice to compute the id column based on it? The service ID itself already includes the host ID, so including it is not actually necessary for uniqueness of the resulting IDs, i.e. the same service on different hosts already has different IDs. So if that was just omitted from the hash, shouldn't good parts of the new code actually become unnecessary?

Going one more step further: host and service IDs already include the environment ID as well, so even simply using sla_lifecycle.id = host.id or sla_lifecycle.id = service.id should be an option¹. That would then allow a pretty simple query to be used in the schema migration (for hosts, would work for services analogously):

insert into sla_lifecycle (id, environment_id, host_id, create_time, delete_time)
select id, environment_id, id, UNIX_TIMESTAMP() * 1000, 0
from host;

That wasn't the last step/idea of mine, this could be taken even further: one could add a where clause so that instead of all hosts, only those that don't yet have a row in sla_lifecycle are inserted:

insert into sla_lifecycle (id, environment_id, host_id, create_time, delete_time)
select id, environment_id, id, UNIX_TIMESTAMP() * 1000, 0
from host
where id not in (select id from sla_lifecycle where service_id is null and delete_time = 0);

So this gives a simple query that could just be executed each time after the initial delta sync. That would take care of created objects, deleted objects could be handled similarly by updating all sla_lifecycle rows with delete_time = 0 that reference dead IDs:

update sla_lifecycle
set delete_time = UNIX_TIMESTAMP() * 1000
where service_id is null
and delete_time = 0
and id not in (select id from host);

So with 4 queries in total², this should take care of all the updates needed during the initial delta sync, leaving only the runtime updates, those should be possible to be reduced to a single insert/update each, as potentially missing rows were already created by the previous delta sync.

If I remember correctly, there can't be conflicts between host IDs and service IDs, which would be a requirement for doing that way. That's something that has to be verified, but even if it's not the case, this could trivially be fixed by including an object_type column into the primary key (such a column already exists in the other SLA tables). ↩
Note that the queries I gave are still missing checks on the environment_id in the where clauses. ↩

yhabteab · 2024-08-08T13:29:56Z

I have reworked this as suggested by Julian in #566 (comment) and also updated the PR description accordingly.

yhabteab · 2024-08-13T15:07:51Z

I have reworked this as suggested by Julian in #566 (comment) and also updated the PR description accordingly.

The proposals from #566 (comment) should have been fully implemented by now and of course the PR description is also updated again.

oxzi

First of all, thanks for writing the documentation. It makes it easier for both users and me to understand the motivation for this change.

I just made a first walk through and added some comments.

doc/10-Sla-Reporting.md

pkg/icingadb/sla_lifecycle.go

oxzi

I have played with this PR in a testing environment and have not encountered any errors so far.

However, I have a more general question: How will the sla_lifecycle table be used later on? The get_sla_ok_percent function does not honor it and, unless I am missing something, cannot even do this for deleted checkable objects.

pkg/icingadb/sla_lifecycle.go

schema/pgsql/schema.sql

oxzi · 2024-08-29T11:08:53Z

schema/pgsql/schema.sql

+  delete_time biguint NOT NULL DEFAULT 0,
+
+  CONSTRAINT pk_sla_lifecycle PRIMARY KEY (id, delete_time)
+);


I was thinking about creating some indices for the queries within the SyncCheckablesSlaLifecycle function.

In a huge environment, those queries may take some time. However, I cannot say how huge they must become to make those queries slow or if Icinga 2 will be the performance bottleneck before.

I have not tested this or anything, just wanted to put the idea out there.

or if Icinga 2 will be the performance bottleneck before

If there's an issue, you could probably trigger it without Icinga 2 becoming an issue. You can create and delete many hosts and services over time, keeping the active config in Icinga 2 at a constant size but growing this table.

Valid point, I have not considered ~~cloud native~~ fast moving infrastructures with a multitude of short-lived VMs.

yhabteab · 2024-08-29T13:59:38Z

The get_sla_ok_percent function does not honor it and, unless I am missing something, cannot even do this for deleted checkable objects.

That's right and it will never make use of it as well! As we discussed this last time, we will only collect the data for the time being until we get the time to finalise Icinga/icingadb-web#710 which will make use of this newly introduced table and delegate the SLA generation from SQL procedure to PHP code and mark the get_sla_ok_percent procedure as a feature freeze.

yhabteab · 2024-09-04T07:06:40Z

I've pushed the new transparent Screenshots and also changed the main function a bit, so I won't be pushing anything from now on unless someone requests a change.

cmd/icingadb/main.go

oxzi · 2024-11-04T15:36:44Z

doc/10-Sla-Reporting.md

+the database without any interpretation. In order to generate and visualise SLA reports of specific hosts and services
+based on the accumulated events over time, [Icinga Reporting](https://icinga.com/docs/icinga-reporting/latest/doc/02-Installation/)
+is the optimal complement, facilitating comprehensive SLA report generation within a specific timeframe.
+


I would change the tone of the first two paragraphs to make it sound more like a technical document and less like a PR post. For example, the "legally binding" part could be dropped, as it must not be the case. Furthermore, Icinga Reporting is (currently) not the "optimal complement", but the only complementing UI.

Please don't take this the wrong way. If I would read this as a technical user - and this is technical documentation -, I would have the feeling that somebody wants to sell me something and not like I am going to find the details here.

oxzi · 2024-11-04T15:38:09Z

doc/10-Sla-Reporting.md

+| PENDING (no check result yet) | OK         |
+| OK                            | Warning    |
+| Warning                       | Critical   |
+| Critical                      | OK         |


As discussed earlier, please include the state's magic numbers used within the database. This eases debugging and understanding when using this technical document together with the own setup.

oxzi · 2024-11-04T15:45:25Z

doc/10-Sla-Reporting.md

+        // clearly identifies its state. Consequently, such events become irrelevant for the purposes of calculating
+        // the SLA and we must exclude the duration of that PENDING state from the total time.
+        total_time = total_time - (event.event_time - last_event_time)
+    else if (previous_hard_state is greater than OK/UP


Without the mapping from states to their numeric representation, this is quite cryptic. Thus, either introduce a mapping before or make it more explicit, like listing all acceptable previous_hard_state values.

I'm not sure if I understand you correctly, but I don't quite agree that this being cryptic, rather it's the exact opposite. These are just normal Icinga 2 host - service states that have their own documentation except the PENDING state. Apart from that, I have actually listed the possible values that the previous_hard_state column represents and you want me to add another listing here?

Please try to put yourself in a reader's position, who knows a bit about Icinga, but definitely no implementation details. When reading the comparison in this line, it may be unclear what "greater than" means in this context, as no numerical representation or hierarchy of states was introduced.

Apart from that, I have actually listed the possible values that the previous_hard_state column represents and you want me to add another listing here?

Based on this table, one would assume the following order: PENDING < OK < WARNING < CRITICAL.
And what is about UNKNOWN and the host states UP and DOWN?

Coming back to the if cases, there is

event.previous_hard_state is PENDING and

previous_hard_state is greater than OK/UP AND previous_hard_state is not PENDING AND checkable is not in DOWNTIME (this one misses the event. prefix, but I guess it was just forgotten?).

Shouldn't it be possible to rewrite the second clause to event.previous_hard_state is not in {OK, UP} AND checkable is not in DOWNTIME as event.previous_hard_state cannot be PENDING due to the first check? Doing so eliminates any order which has to be explained otherwise.

pkg/icingadb/sla_lifecycle.go

yhabteab · 2026-01-13T09:53:27Z

Wont't be ready any time soon, so I'm closing it.

cla-bot bot added the cla/signed label Feb 23, 2023

yhabteab force-pushed the add-create-delete-history-events branch 3 times, most recently from 2853ab4 to 87e94ac Compare February 24, 2023 09:14

yhabteab requested a review from Al2Klimov February 24, 2023 09:41

yhabteab force-pushed the add-create-delete-history-events branch 2 times, most recently from 1cc1f09 to 80a76e5 Compare February 27, 2023 12:00

Al2Klimov requested changes Feb 27, 2023

View reviewed changes

schema/mysql/schema.sql Outdated Show resolved Hide resolved

pkg/icingadb/sync.go Outdated Show resolved Hide resolved

pkg/icingadb/sla.go Outdated Show resolved Hide resolved

pkg/icingadb/sla.go Outdated Show resolved Hide resolved

pkg/icingadb/sla.go Outdated Show resolved Hide resolved

yhabteab force-pushed the add-create-delete-history-events branch 5 times, most recently from 2fc3663 to c160354 Compare March 2, 2023 12:54

yhabteab requested a review from Al2Klimov March 2, 2023 12:56

Al2Klimov reviewed Mar 2, 2023

View reviewed changes

yhabteab force-pushed the add-create-delete-history-events branch 3 times, most recently from 2a4824a to 2717c61 Compare March 2, 2023 14:41

yhabteab requested a review from Al2Klimov March 2, 2023 14:42

yhabteab force-pushed the add-create-delete-history-events branch 2 times, most recently from 93d6f3f to 69bc4ad Compare March 2, 2023 16:37

yhabteab requested review from Al2Klimov and removed request for Al2Klimov March 2, 2023 16:39

yhabteab force-pushed the add-create-delete-history-events branch 4 times, most recently from 753dba4 to f1878aa Compare March 3, 2023 09:47

Al2Klimov previously requested changes Mar 3, 2023

View reviewed changes

Al2Klimov reviewed Mar 3, 2023

View reviewed changes

pkg/icingadb/sla.go Outdated Show resolved Hide resolved

pkg/types/int.go Outdated Show resolved Hide resolved

pkg/icingadb/sla.go Outdated Show resolved Hide resolved

yhabteab force-pushed the add-create-delete-history-events branch from 5ee05e7 to be89ac0 Compare May 26, 2023 12:56

Al2Klimov removed their request for review May 26, 2023 13:07

julianbrost requested changes Jun 6, 2023

View reviewed changes

yhabteab force-pushed the add-create-delete-history-events branch 2 times, most recently from 52f5cbb to b8f9080 Compare June 12, 2023 16:37

julianbrost requested changes Jun 13, 2023

View reviewed changes

julianbrost mentioned this pull request Jun 14, 2023

Schema version for the next release #602

Closed

2 tasks

oxzi reviewed Aug 27, 2024

View reviewed changes

oxzi reviewed Aug 29, 2024

View reviewed changes

oxzi reviewed Sep 4, 2024

View reviewed changes

cmd/icingadb/main.go Outdated Show resolved Hide resolved

cmd/icingadb/main.go Outdated Show resolved Hide resolved

yhabteab added 7 commits October 25, 2024 09:07

Add Icingadb#SlaLifecycle

a60ebc5

Write create & delete sla events on config dumps

31b7578

Write checkable sla events for runtime updates

e5510b5

Add integration tests for checkable SlaLifecycle

0b23415

Test host runtime updates & document some helper functions

c5abf93

Test creation & deletion of sla lifecycles via config sync

e028c93

Add SLA reporting docs

7aa02dd

oxzi mentioned this pull request Nov 4, 2024

always use the selected timeframe for sla calculation #834

Open

oxzi reviewed Nov 4, 2024

View reviewed changes

Conversation

yhabteab commented Feb 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why do we need this!

Config sync

Implementation

Initial sync

Create

Update

Delete

Runtime updates

Upsert

Delete

Footnotes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

julianbrost commented Mar 2, 2023

Uh oh!

Al2Klimov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

julianbrost commented Jul 25, 2023

Uh oh!

julianbrost commented Jul 27, 2023

Uh oh!

Al2Klimov commented Jul 27, 2023

Uh oh!

julianbrost commented Aug 7, 2024

Runtime updates

Upsert

Uh oh!

yhabteab commented Aug 7, 2024

Uh oh!

julianbrost commented Aug 7, 2024

Footnotes

Uh oh!

yhabteab commented Aug 8, 2024

Uh oh!

yhabteab commented Aug 13, 2024

Uh oh!

yhabteab commented Feb 23, 2023 •

edited

Loading