Background task: complete never-completed audit log entries#10026
Background task: complete never-completed audit log entries#10026david-crespo wants to merge 7 commits intomainfrom
Conversation
| -- Supports "find stale incomplete rows ordered by time_started". | ||
| CREATE INDEX IF NOT EXISTS audit_log_incomplete_by_time_started | ||
| ON omicron.public.audit_log (time_started, id) | ||
| WHERE time_completed IS NULL; |
There was a problem hiding this comment.
The robots in my employ pointed out that this index does have a cost — every single request that comes through causes this index to be updated because the audit log entry for it has time_completed null while the operation is in progress. That's probably worth it — otherwise I think the job would not work as written because it would have to do a table scan to find the incomplete rows.
| entries (may indicate Nexus crashes or bugs)"; | ||
| "cutoff" => %cutoff, | ||
| "limit" => self.max_update_per_activation, | ||
| ); |
There was a problem hiding this comment.
I figured this should warn! since things showing up here should be quite unusual.
| diesel::update(audit_log::table) | ||
| let rows = diesel::update(audit_log::table) | ||
| .filter(audit_log::id.eq(entry.id)) | ||
| .filter(audit_log::time_completed.is_null()) |
There was a problem hiding this comment.
This is really important, otherwise in the bizarro world situation where the request actually does finish after the row has been timed out, completing it would change its time_completed, violating the Rules of Audit Log. See the comment above.
| ), | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
This task isn't changed here but it was causing the tests to be flaky for me locally depending on whether I ran one test or the whole module.
| .execute_async(&*self.pool_connection_authorized(opctx).await?) | ||
| .await | ||
| .map_err(|e| public_error_from_diesel(e, ErrorHandler::Server)) | ||
| } |
There was a problem hiding this comment.
This is the core of the task: for any audit log entries older than cutoff with null time_completed, mark those as timed out. In practice the ordering and limit should not matter because the number of timed out requests should be small.
8639848 to
b4bf111
Compare
|
The numbers from colo: 5M rows in the table, only two (!!!!!) incomplete rows. |
nexus/examples/config.toml
Outdated
| session_cleanup.max_delete_per_activation = 10000 | ||
| audit_log_timeout_incomplete.period_secs = 600 | ||
| audit_log_timeout_incomplete.timeout_secs = 14400 | ||
| audit_log_timeout_incomplete.max_update_per_activation = 1000 |
There was a problem hiding this comment.
I wonder if it's even necessary to make these tunable in the config as opposed to hard coded in the task definition. I guess it's nice to be able to change them without making a code change, but will we really have to?
There was a problem hiding this comment.
🤷♀️ i don't really care about being able to tune these in the config, but it's fine to leave them since you've already done it?
| /// of the immutable record. Overwriting it would change its timestamp | ||
| /// and violate the guarantee that the past doesn't change. | ||
| #[tokio::test] | ||
| async fn test_audit_log_timeout_then_late_completion() { |
There was a problem hiding this comment.
I think this test belongs here rather than the timeout task because it's really about what datastore.audit_log_entry_complete() does with an already timed out row. I got rid of the other tests that were redundant with the one in the task file, though.
9ed1955 to
c5b7f89
Compare
| }; | ||
|
|
||
| let cutoff = Utc::now() - timeout_delta; | ||
| let cutoff = Utc::now() - self.timeout; |
There was a problem hiding this comment.
should this fail if it's out of range?
| ) -> Self { | ||
| let Ok(timeout) = TimeDelta::from_std(timeout) else { | ||
| panic!( | ||
| "invalid timeout {timeout:?} \ |
There was a problem hiding this comment.
it would be nice if we said the name of the config file key here, rather than just "invalid timeout"
| let audit_log_sub = | ||
| diesel::alias!(nexus_db_schema::schema::audit_log as audit_log_sub); | ||
|
|
||
| let subquery = audit_log_sub |
There was a problem hiding this comment.
i might give this a more semantically meaningful name --- it's selecting the timed-out rows to update, right?
| /// The `time_completed IS NULL` filter is critical: once an entry has been | ||
| /// completed — whether by the normal request path or by the background | ||
| /// timeout task — it is part of the immutable audit log. Overwriting it | ||
| /// would change its `time_completed` timestamp, shifting its position in | ||
| /// the `(time_completed, id)` pagination order and violating the guarantee | ||
| /// that past query results are stable. |
There was a problem hiding this comment.
this doesn't feel like API docs to me --- i think it should go in the place where you're actually setting that filter?
Closes #8817
The most important thing about an audit log is that every event you care about is in there. This PR adds a job to make sure requests that never complete (e.g., because Nexus crashed in the middle) appear in the audit log after some timeout.
The audit log gets written in two steps. At the beginning of a request we initialize the log entry row. If this fails, the request fails. This guarantees we never miss an initialization. Then, at the end of the request, we "complete" the row with the result of the operation and a
time_completed. The audit log list API response is ordered bytime_completed. This lets us guarantee that if you retrieve a chunk of the audit log that is fully in the past, it will never change — you never have to fetch that chunk of the log again. But this requires thate entries do not appear in the log until they are completed.Completion can fail for variety of hopefully very rare reasons. It seemed unreasonably complicated to guarantee completion by rolling back whatever operation happened on a failed completion. Instead, the idea was that we would run a job that cleans up incomplete rows after a long enough amount of time that it's very unlikely the request will complete. When we clean up such rows, we don't know what actually happened to the request, so we have to give it a special
result_kindoftimeout(as opposed tosuccessorerror).omicron/nexus/db-model/src/audit_log.rs
Lines 275 to 283 in 3d2d0c1
In the API response this becomes
AuditLogEntryResult::Unknownbecause I feel like "timeout" is an implementation detail and makes it sound like the operation itself timed out.omicron/nexus/db-model/src/audit_log.rs
Lines 413 to 415 in 3d2d0c1
How often are log entries left incomplete?
Fortunately, this should be a very rare event — I checked on dogfood and there were around 200k rows in the table and there was exactly 1 incomplete
disk_createcall from October 2025. Need to check on the colo rack.Config options
All three live in
[background_tasks.audit_log_timeout_incomplete]inthe Nexus config (
AuditLogTimeoutIncompleteConfiginnexus_config.rs):period_secs— how often the background task activates (currently 600 = 10 min)timeout_secs— how old an incomplete entry must be before it gets timed out (default 14400 = 4 hours)max_update_per_activation— max rows updated per activation (currently 1000), following the naming convention fromsession_cleanup.max_delete_per_activationOn the timeout, I think 4 hours is pretty conservative. My worry is that we could have a legitimate API call that takes an hour or two because it's a post of a giant file or something. I don't know how reasonable that worry is. This timeout could probably go down to two hours or an hour.
On the interval, 10 minutes seems maybe overkill for something that happens once every 5 months? But it's just one query, basically.
Claude skill for adding background tasks
After #10009 I had Claude write the
add-background-taskskill and used it to make this one. It was pretty good, so I think it's probably worth checking in.