Skip to content

Background task: delete audit log entries older than retention period#10042

Open
david-crespo wants to merge 9 commits intoaudit-log-timeout-taskfrom
audit-log-cleanup
Open

Background task: delete audit log entries older than retention period#10042
david-crespo wants to merge 9 commits intoaudit-log-timeout-taskfrom
audit-log-cleanup

Conversation

@david-crespo
Copy link
Contributor

@david-crespo david-crespo commented Mar 12, 2026

Closes #8818

The audit log table grows unbounded. Here we prevent that by deleting old entries. We expect customers who rely on the audit log to be fetching the log and putting the result in some external monitoring system on a very frequent interval, so the retention period could be quite short. Here I am proposing 7 days, but it could probably be 3. It's fine as long as we make clear that if you want to rely on it you have to fetch it and store it yourself.

Tunables

  • audit_log_cleanup.period_secs = 600
  • audit_log_cleanup.retention_days = 30
  • audit_log_cleanup.max_delete_per_activation = 10000

Table size on colo

5M total, 385k in the past week. At 10000 per activation running every 10 minutes, it would take 83 hours to clean up the whole table. That seems ok to me. After that it will be much easier. 10k entries per ten minute period would be a lot.

Adding up the ranges, it's about 1336 MB, or about 26MB per 100k rows. On dogfood it's more like 45MB per 100k rows (see comment below). I would expect the latter to be more typical — on colo we have a ton of rows due to automation that does a ton of login requests with no silo or actor ID.

root@[fd00:1122:3344:116::3]:32221/omicron> select count(*) from omicron.public.audit_log;
   count
-----------
  4898486
(1 row)

root@[fd00:1122:3344:116::3]:32221/omicron> select count(*) from omicron.public.audit_log where time_started > now() - interval '7days';
  count
----------
  385581
(1 row)

root@[fd00:1122:3344:116::3]:32221/omicron> show ranges from table omicron.public.audit_log;
                    start_key                   |                    end_key                    | range_id |     range_size_mb     | lease_holder | lease_holder_locality |  replicas   | replica_localities
------------------------------------------------+-----------------------------------------------+----------+-----------------------+--------------+-----------------------+-------------+---------------------
  NULL                                          | /"D&}&\xc4-JE\xa0\xa5\x03W\x10^Ⱥ"             |     1691 | 314.92982300000000000 |            3 |                       | {1,2,3,4,5} | {,,,,}
  /"D&}&\xc4-JE\xa0\xa5\x03W\x10^Ⱥ"             | /"{ʕr\x8d\xceE6\xb0\xc0.\xbe\xeb\x01\x00G"    |     2257 | 264.22612800000000000 |            4 |                       | {1,2,3,4,5} | {,,,,}
  /"{ʕr\x8d\xceE6\xb0\xc0.\xbe\xeb\x01\x00G"    | /"\xbcƊb\xa3\xd2G\x00\xbc\x1fYo1;\x9d\xf5"    |     2325 | 271.09559900000000000 |            3 |                       | {1,2,3,4,5} | {,,,,}
  /"\xbcƊb\xa3\xd2G\x00\xbc\x1fYo1;\x9d\xf5"    | /"\xff\x9e\x9b\x06\x97\x9eNڦ\xadmkJ\xf1<\xd5" |     2336 | 317.12977100000000000 |            2 |                       | {1,2,3,4,5} | {,,,,}
  /"\xff\x9e\x9b\x06\x97\x9eNڦ\xadmkJ\xf1<\xd5" | NULL                                          |     2381 | 168.61788100000000000 |            4 |                       | {1,2,3,4,5} | {,,,,}
(5 rows)

.execute_async(&*datastore.pool_connection_for_tests().await.unwrap())
.await
.unwrap();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is test-only code.

@david-crespo david-crespo marked this pull request as ready for review March 12, 2026 05:19
@david-crespo david-crespo requested a review from hawkw March 12, 2026 05:19
@david-crespo
Copy link
Contributor Author

On dogfood, we have 200k rows and it takes up 89 MB. So let's say it's about 45MB per 100k rows.

Dogood

root@[fd00:1122:3344:109::3]:32221/omicron> show ranges from table omicron.public.audit_log;
  start_key | end_key | range_id |     range_size_mb     | lease_holder | lease_holder_locality |  replicas   | replica_localities
------------+---------+----------+-----------------------+--------------+-----------------------+-------------+---------------------
  NULL      | NULL    |     4007 | 89.608930000000000000 |            4 |                       | {1,2,3,4,5} | {,,,,}
(1 row)


root@[fd00:1122:3344:109::3]:32221/omicron> select count(*) from omicron.public.audit_log;
  count
----------
  221481
(1 row)

Comment on lines +165 to +175
// Diesel's DeleteStatement doesn't support LIMIT, so we use raw
// SQL rather than the subquery workaround needed for the DSL.
diesel::sql_query(
"DELETE FROM omicron.public.audit_log \
WHERE time_completed IS NOT NULL \
AND time_completed < $1 \
ORDER BY time_completed, id \
LIMIT $2",
)
.bind::<diesel::sql_types::Timestamptz, _>(cutoff)
.bind::<diesel::sql_types::BigInt, _>(i64::from(limit))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice if, when writing raw SQL queries, we used the various test utils expectorate helpers so that we could at least get a unit test that asserts that the query is valid SQL. see, for example:

// Explain the possible forms of the SQL query to ensure that it
// creates a valid SQL string.
#[tokio::test]
async fn explainable() {
let logctx = dev::test_setup_log("explainable");
let db = TestDatabase::new_with_pool(&logctx.log).await;
let pool = db.pool();
let conn = pool.claim().await.unwrap();
let query =
dataset_update_query(vec![Uuid::nil(), Uuid::nil(), Uuid::nil()]);
let _ = query
.explain_async(&conn)
.await
.expect("Failed to explain query - is it valid SQL?");
db.terminate().await;
logctx.cleanup_successful();
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reluctant on this one — the integration tests cover that it's valid SQL better than a unit test since in the latter there's no guarantee the query is being used in the spot where it's supposed to be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, I just kinda think these are nice to have, since the unit test fails much faster than the integration tests and the error is very obviously "you have a syntax error in your SQL". But it's not important to me.

/// The cutoff time used: completed entries older than this were eligible.
pub cutoff: DateTime<Utc>,
/// Configured max rows to delete in this activation.
pub max_delete_per_activation: u32,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel like this should be "max_deleted", but maybe that's just me?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps "max_to_delete" or "max_deletions" ? This just based on the description above. "max_deleted" sounds like something that happened in the past no?

Copy link
Contributor Author

@david-crespo david-crespo Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could even be something like batch_size, though that leaves open the possibility of the activation doing multiple batches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps "max_to_delete" or "max_deletions" ? This just based on the description above. "max_deleted" sounds like something that happened in the past no?

I think if it's max_deleted_per_activation it makes sense --- I wasn't suggesting dropping the _per_activation part.

Could even be something like batch_size, though that leaves open the possibility of the activation doing multiple batches.

Hm, I think that --- for the reason you mentioned --- I'd rather avoid batch_size; it's the maximum number of records to delete during the activation, not the size of the individual batch in which the rows are deleted.

@david-crespo
Copy link
Contributor Author

david-crespo commented Mar 13, 2026

Thanks for the thorough review! Here is what I did in e587600:

  • I realized the datastore test was testing exactly the same stuff as the integration test in the task, which is pointless, so I deleted it. Now there is only the integration test for the task.
  • The test now asserts about the set of IDs that remain in the log after the cleanup job runs
  • Rolled the test about respecting the limit per activation into the first test
  • max_delete_per_activation -> max_deleted_per_activation
  • Set timestamp helpers in tests use Diesel DSL

david-crespo added a commit that referenced this pull request Mar 13, 2026
@david-crespo
Copy link
Contributor Author

In f58acf9 I added more explicit checking for overflow when we subtract the days from now and more detailed assertions about which entries get cleaned up and which remain after each activation.

Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new tests are much nicer, thanks! I had some admittedly pedantic complaints about the date overflow checking stuff. Sorry.

Comment on lines +40 to +55
let cutoff = match cutoff {
Some(c) => c,
None => {
let msg = format!(
"retention_days {} overflows date arithmetic",
self.retention_days,
);
slog::error!(&opctx.log, "{msg}");
return AuditLogCleanupStatus {
rows_deleted: 0,
cutoff: Utc::now(),
max_deleted_per_activation: self.max_deleted_per_activation,
error: Some(msg),
};
}
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I kind of wonder if what we would actually want to do here is to saturate to the longest representable retention period, instead of just giving up. But, this would lead to us deleting rows that are actually younger than the configured retention period, which also feels a bit bad. So, maybe it's correct to just give up, in the hopes that eventually the cutoff period becomes representable as time moves forwards.

TimeDelta::try_days(i64::from(self.retention_days.get()))
.expect("retention_days fits in TimeDelta");
let cutoff = Utc::now() - retention_delta;
let cutoff = TimeDelta::try_days(i64::from(self.retention_days.get()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, I don't love that the conversion of self.retention_days to a TimeDelta is part of the same fallible conversion that we attempt every time the task activates. it seems like ideally, the calculation involving the fixed value from the config file failed once when we parsed the config and was treated as an invalid config file, and only the subtraction of the current date from the time delta was fallible every time the task activates. if self.retention_days is not representable as a TimeDelta of days, this will never succeed, which should be treated as an invalid config.

@inickles
Copy link
Contributor

We expect customers who rely on the audit log to be fetching the log and putting the result in some external monitoring system on a very frequent interval, so the retention period could be quite short.

I see the retention period is now 30 days, so "short" from this original comment is a little out of date, but I wanted to mention that I think the longer retention is important for giving our customers the room to notice and fix audit log fetch automation they may have. Or even set it up in the first place after initial rack installation. Unless we're worried about the size of the audit log taking up too much space I recommend extending that even further, such as 90 days, 180 days, or even a year, considering this isn't tunable.

@david-crespo
Copy link
Contributor Author

Bumped to 90 after talking offline, and we'll plan to make this configurable.

Comment on lines +31 to +32
let retention = TimeDelta::try_days(i64::from(retention_days.get()))
.expect("retention_days must be representable as a TimeDelta");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicky: it would be nice to include the invalid value in the panic message, something like:

Suggested change
let retention = TimeDelta::try_days(i64::from(retention_days.get()))
.expect("retention_days must be representable as a TimeDelta");
let Some(retention) = TimeDelta::try_days(i64::from(retention_days.get())) else {
panic!("invalid retention_days {retention_days} (must be representable as a TimeDelta)");
};

or something like that --- it would maybe be nicer to also include the actual max value but i was too lazy to calculate what it is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i also felt a bit sketched out about the i64::from and worried that we might be accidentally making it into a negative TimeDelta, but then realized this couldn't happen since retention_days` is a u32, so...it's fine, but maybe a comment about that would be nice?

Comment on lines +66 to +67
let msg = format!("audit log cleanup failed: {err:#}");
slog::error!(&opctx.log, "{msg}");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicky, you're welcome to ignore this: i feel like audit log cleanup failed: is useful to have in the Nexus logs (even though they're also tagged with the name of the background task), but is maybe a bit less useful in the omdb db background-tasks status output, since...you asked for the status of the audit log cleanup task, you know that if there's an error, it means that was the task that failed, so it just kinda makes the line longer?

Suggested change
let msg = format!("audit log cleanup failed: {err:#}");
slog::error!(&opctx.log, "{msg}");
let msg = format!("{err:#}");
slog::error!(&opctx.log, "audit log cleanup failed: {msg}");

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, i think generally when we log errors in a slog log, we try to log them as structured "err" => ... fields and use the slog_error_chain crate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good tips, I did em. Regarding the error chain, I don't think it currently does anything because nothing in the error we get from the datastore function uses #[source] to nest other errors inside it. But it still seemed smart to use InlineErrorChain so that it automatically gets logged correctly in the future. slog::error! with the structured thing apparently already does this.

Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the changes to the retention period validation, i think this is basically good. i had a couple minor nits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants