Background task: delete audit log entries older than retention period by david-crespo · Pull Request #10042 · oxidecomputer/omicron

david-crespo · 2026-03-12T05:08:37Z

The audit log table grows unbounded. Here we prevent that by deleting old entries. We expect customers who rely on the audit log to be fetching the log and putting the result in some external monitoring system on a very frequent interval, so the retention period could be quite short. Here I am proposing 7 days, but it could probably be 3. It's fine as long as we make clear that if you want to rely on it you have to fetch it and store it yourself.

Tunables

audit_log_cleanup.period_secs = 600
audit_log_cleanup.retention_days = 30
audit_log_cleanup.max_delete_per_activation = 10000

Table size on colo

5M total, 385k in the past week. At 10000 per activation running every 10 minutes, it would take 83 hours to clean up the whole table. That seems ok to me. After that it will be much easier. 10k entries per ten minute period would be a lot.

Adding up the ranges, it's about 1336 MB, or about 26MB per 100k rows. On dogfood it's more like 45MB per 100k rows (see comment below). I would expect the latter to be more typical — on colo we have a ton of rows due to automation that does a ton of login requests with no silo or actor ID.

root@[fd00:1122:3344:116::3]:32221/omicron> select count(*) from omicron.public.audit_log;
   count
-----------
  4898486
(1 row)

root@[fd00:1122:3344:116::3]:32221/omicron> select count(*) from omicron.public.audit_log where time_started > now() - interval '7days';
  count
----------
  385581
(1 row)

root@[fd00:1122:3344:116::3]:32221/omicron> show ranges from table omicron.public.audit_log;
                    start_key                   |                    end_key                    | range_id |     range_size_mb     | lease_holder | lease_holder_locality |  replicas   | replica_localities
------------------------------------------------+-----------------------------------------------+----------+-----------------------+--------------+-----------------------+-------------+---------------------
  NULL                                          | /"D&}&\xc4-JE\xa0\xa5\x03W\x10^Ⱥ"             |     1691 | 314.92982300000000000 |            3 |                       | {1,2,3,4,5} | {,,,,}
  /"D&}&\xc4-JE\xa0\xa5\x03W\x10^Ⱥ"             | /"{ʕr\x8d\xceE6\xb0\xc0.\xbe\xeb\x01\x00G"    |     2257 | 264.22612800000000000 |            4 |                       | {1,2,3,4,5} | {,,,,}
  /"{ʕr\x8d\xceE6\xb0\xc0.\xbe\xeb\x01\x00G"    | /"\xbcƊb\xa3\xd2G\x00\xbc\x1fYo1;\x9d\xf5"    |     2325 | 271.09559900000000000 |            3 |                       | {1,2,3,4,5} | {,,,,}
  /"\xbcƊb\xa3\xd2G\x00\xbc\x1fYo1;\x9d\xf5"    | /"\xff\x9e\x9b\x06\x97\x9eNڦ\xadmkJ\xf1<\xd5" |     2336 | 317.12977100000000000 |            2 |                       | {1,2,3,4,5} | {,,,,}
  /"\xff\x9e\x9b\x06\x97\x9eNڦ\xadmkJ\xf1<\xd5" | NULL                                          |     2381 | 168.61788100000000000 |            4 |                       | {1,2,3,4,5} | {,,,,}
(5 rows)

david-crespo · 2026-03-12T05:09:15Z

nexus/db-queries/src/db/datastore/audit_log.rs

+        .execute_async(&*datastore.pool_connection_for_tests().await.unwrap())
+        .await
+        .unwrap();
+    }


This is test-only code.

david-crespo · 2026-03-12T20:10:31Z

On dogfood, we have 200k rows and it takes up 89 MB. So let's say it's about 45MB per 100k rows.

Dogood

root@[fd00:1122:3344:109::3]:32221/omicron> show ranges from table omicron.public.audit_log;
  start_key | end_key | range_id |     range_size_mb     | lease_holder | lease_holder_locality |  replicas   | replica_localities
------------+---------+----------+-----------------------+--------------+-----------------------+-------------+---------------------
  NULL      | NULL    |     4007 | 89.608930000000000000 |            4 |                       | {1,2,3,4,5} | {,,,,}
(1 row)


root@[fd00:1122:3344:109::3]:32221/omicron> select count(*) from omicron.public.audit_log;
  count
----------
  221481
(1 row)

dev-tools/omdb/src/bin/omdb/nexus.rs

nexus/db-queries/src/db/datastore/audit_log.rs

hawkw · 2026-03-12T22:40:58Z

nexus/db-queries/src/db/datastore/audit_log.rs

+        // Diesel's DeleteStatement doesn't support LIMIT, so we use raw
+        // SQL rather than the subquery workaround needed for the DSL.
+        diesel::sql_query(
+            "DELETE FROM omicron.public.audit_log \
+             WHERE time_completed IS NOT NULL \
+               AND time_completed < $1 \
+             ORDER BY time_completed, id \
+             LIMIT $2",
+        )
+        .bind::<diesel::sql_types::Timestamptz, _>(cutoff)
+        .bind::<diesel::sql_types::BigInt, _>(i64::from(limit))


it would be nice if, when writing raw SQL queries, we used the various test utils expectorate helpers so that we could at least get a unit test that asserts that the query is valid SQL. see, for example:

omicron/nexus/db-queries/src/db/queries/regions_hard_delete.rs

Lines 81 to 100 in d3b6e92

// Explain the possible forms of the SQL query to ensure that it

// creates a valid SQL string.

#[tokio::test]

async fn explainable() {

let logctx = dev::test_setup_log("explainable");

let db = TestDatabase::new_with_pool(&logctx.log).await;

let pool = db.pool();

let conn = pool.claim().await.unwrap();

let query =

dataset_update_query(vec![Uuid::nil(), Uuid::nil(), Uuid::nil()]);

let _ = query

.explain_async(&conn)

.await

.expect("Failed to explain query - is it valid SQL?");

db.terminate().await;

logctx.cleanup_successful();

}

I'm reluctant on this one — the integration tests cover that it's valid SQL better than a unit test since in the latter there's no guarantee the query is being used in the spot where it's supposed to be.

That's fair, I just kinda think these are nice to have, since the unit test fails much faster than the integration tests and the error is very obviously "you have a syntax error in your SQL". But it's not important to me.

nexus/db-queries/src/db/datastore/audit_log.rs

nexus/src/app/background/tasks/audit_log_cleanup.rs

hawkw · 2026-03-12T22:45:02Z

nexus/types/src/internal_api/background.rs

+    /// The cutoff time used: completed entries older than this were eligible.
+    pub cutoff: DateTime<Utc>,
+    /// Configured max rows to delete in this activation.
+    pub max_delete_per_activation: u32,


i feel like this should be "max_deleted", but maybe that's just me?

Perhaps "max_to_delete" or "max_deletions" ? This just based on the description above. "max_deleted" sounds like something that happened in the past no?

Could even be something like batch_size, though that leaves open the possibility of the activation doing multiple batches.

Perhaps "max_to_delete" or "max_deletions" ? This just based on the description above. "max_deleted" sounds like something that happened in the past no?

I think if it's max_deleted_per_activation it makes sense --- I wasn't suggesting dropping the _per_activation part.

Could even be something like batch_size, though that leaves open the possibility of the activation doing multiple batches.

Hm, I think that --- for the reason you mentioned --- I'd rather avoid batch_size; it's the maximum number of records to delete during the activation, not the size of the individual batch in which the rows are deleted.

nexus-config/src/nexus_config.rs

david-crespo · 2026-03-13T03:15:12Z

Thanks for the thorough review! Here is what I did in e587600:

I realized the datastore test was testing exactly the same stuff as the integration test in the task, which is pointless, so I deleted it. Now there is only the integration test for the task.
The test now asserts about the set of IDs that remain in the log after the cleanup job runs
Rolled the test about respecting the limit per activation into the first test
max_delete_per_activation -> max_deleted_per_activation
Set timestamp helpers in tests use Diesel DSL

david-crespo · 2026-03-13T16:54:28Z

In f58acf9 I added more explicit checking for overflow when we subtract the days from now and more detailed assertions about which entries get cleaned up and which remain after each activation.

hawkw

The new tests are much nicer, thanks! I had some admittedly pedantic complaints about the date overflow checking stuff. Sorry.

hawkw · 2026-03-13T17:41:21Z

nexus/src/app/background/tasks/audit_log_cleanup.rs

+        let cutoff = match cutoff {
+            Some(c) => c,
+            None => {
+                let msg = format!(
+                    "retention_days {} overflows date arithmetic",
+                    self.retention_days,
+                );
+                slog::error!(&opctx.log, "{msg}");
+                return AuditLogCleanupStatus {
+                    rows_deleted: 0,
+                    cutoff: Utc::now(),
+                    max_deleted_per_activation: self.max_deleted_per_activation,
+                    error: Some(msg),
+                };
+            }
+        };


Hm, I kind of wonder if what we would actually want to do here is to saturate to the longest representable retention period, instead of just giving up. But, this would lead to us deleting rows that are actually younger than the configured retention period, which also feels a bit bad. So, maybe it's correct to just give up, in the hopes that eventually the cutoff period becomes representable as time moves forwards.

hawkw · 2026-03-13T17:43:27Z

nexus/src/app/background/tasks/audit_log_cleanup.rs

-            TimeDelta::try_days(i64::from(self.retention_days.get()))
-                .expect("retention_days fits in TimeDelta");
-        let cutoff = Utc::now() - retention_delta;
+        let cutoff = TimeDelta::try_days(i64::from(self.retention_days.get()))


so, I don't love that the conversion of self.retention_days to a TimeDelta is part of the same fallible conversion that we attempt every time the task activates. it seems like ideally, the calculation involving the fixed value from the config file failed once when we parsed the config and was treated as an invalid config file, and only the subtraction of the current date from the time delta was fallible every time the task activates. if self.retention_days is not representable as a TimeDelta of days, this will never succeed, which should be treated as an invalid config.

inickles · 2026-03-13T20:29:46Z

We expect customers who rely on the audit log to be fetching the log and putting the result in some external monitoring system on a very frequent interval, so the retention period could be quite short.

I see the retention period is now 30 days, so "short" from this original comment is a little out of date, but I wanted to mention that I think the longer retention is important for giving our customers the room to notice and fix audit log fetch automation they may have. Or even set it up in the first place after initial rack installation. Unless we're worried about the size of the audit log taking up too much space I recommend extending that even further, such as 90 days, 180 days, or even a year, considering this isn't tunable.

david-crespo · 2026-03-13T21:53:21Z

Bumped to 90 after talking offline, and we'll plan to make this configurable.

hawkw · 2026-03-18T17:43:30Z

nexus/src/app/background/tasks/audit_log_cleanup.rs

+        let retention = TimeDelta::try_days(i64::from(retention_days.get()))
+            .expect("retention_days must be representable as a TimeDelta");


nitpicky: it would be nice to include the invalid value in the panic message, something like:

Suggested change

let retention = TimeDelta::try_days(i64::from(retention_days.get()))

.expect("retention_days must be representable as a TimeDelta");

let Some(retention) = TimeDelta::try_days(i64::from(retention_days.get())) else {

panic!("invalid retention_days {retention_days} (must be representable as a TimeDelta)");

};

or something like that --- it would maybe be nicer to also include the actual max value but i was too lazy to calculate what it is.

i also felt a bit sketched out about the i64::from and worried that we might be accidentally making it into a negative TimeDelta, but then realized this couldn't happen since retention_days` is a u32, so...it's fine, but maybe a comment about that would be nice?

hawkw · 2026-03-18T17:46:26Z

nexus/src/app/background/tasks/audit_log_cleanup.rs

+                let msg = format!("audit log cleanup failed: {err:#}");
+                slog::error!(&opctx.log, "{msg}");


nitpicky, you're welcome to ignore this: i feel like audit log cleanup failed: is useful to have in the Nexus logs (even though they're also tagged with the name of the background task), but is maybe a bit less useful in the omdb db background-tasks status output, since...you asked for the status of the audit log cleanup task, you know that if there's an error, it means that was the task that failed, so it just kinda makes the line longer?

Suggested change

let msg = format!("audit log cleanup failed: {err:#}");

slog::error!(&opctx.log, "{msg}");

let msg = format!("{err:#}");

slog::error!(&opctx.log, "audit log cleanup failed: {msg}");

also, i think generally when we log errors in a slog log, we try to log them as structured "err" => ... fields and use the slog_error_chain crate?

Good tips, I did em. Regarding the error chain, I don't think it currently does anything because nothing in the error we get from the datastore function uses #[source] to nest other errors inside it. But it still seemed smart to use InlineErrorChain so that it automatically gets logged correctly in the future. slog::error! with the structured thing apparently already does this.

hawkw

thanks for the changes to the retention period validation, i think this is basically good. i had a couple minor nits.

background task to delete audit log older than retention period

b72a12c

david-crespo commented Mar 12, 2026

View reviewed changes

david-crespo marked this pull request as ready for review March 12, 2026 05:19

david-crespo requested a review from hawkw March 12, 2026 05:19

let's start with 30 days for now

904b905

hawkw reviewed Mar 12, 2026

View reviewed changes

address eliza's suggestions

e587600

david-crespo added a commit that referenced this pull request Mar 13, 2026

preemptive eliza fixes based on #10042

9ed1955

merge timeout task into cleanup

284009f

david-crespo force-pushed the audit-log-timeout-task branch from 9ed1955 to c5b7f89 Compare March 13, 2026 05:40

david-crespo force-pushed the audit-log-cleanup branch from 8885a81 to 284009f Compare March 13, 2026 05:40

nicer overflow check and more detailed integration testing

f58acf9

hawkw reviewed Mar 13, 2026

View reviewed changes

bump default retention_days to 90

a418c43

move validation of retention days to config new

4609d15

david-crespo force-pushed the audit-log-cleanup branch from 49e49dc to 4609d15 Compare March 16, 2026 23:16

david-crespo mentioned this pull request Mar 17, 2026

Audit log: operator-configurable retention period #10071

Open

hawkw reviewed Mar 18, 2026

View reviewed changes

hawkw approved these changes Mar 18, 2026

View reviewed changes

last few tweaks on error logging

deb68bb

david-crespo force-pushed the audit-log-cleanup branch from da4abcb to deb68bb Compare March 18, 2026 20:42

merge timeout task

5aa5291

	// Explain the possible forms of the SQL query to ensure that it
	// creates a valid SQL string.
	#[tokio::test]
	async fn explainable() {
	let logctx = dev::test_setup_log("explainable");
	let db = TestDatabase::new_with_pool(&logctx.log).await;
	let pool = db.pool();
	let conn = pool.claim().await.unwrap();

	let query =
	dataset_update_query(vec![Uuid::nil(), Uuid::nil(), Uuid::nil()]);

	let _ = query
	.explain_async(&conn)
	.await
	.expect("Failed to explain query - is it valid SQL?");

	db.terminate().await;
	logctx.cleanup_successful();
	}

		let retention = TimeDelta::try_days(i64::from(retention_days.get()))
		.expect("retention_days must be representable as a TimeDelta");

		let msg = format!("audit log cleanup failed: {err:#}");
		slog::error!(&opctx.log, "{msg}");

Conversation

david-crespo commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tunables

Table size on colo

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-crespo commented Mar 12, 2026

Dogood

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-crespo Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

david-crespo commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

david-crespo commented Mar 13, 2026

Uh oh!

hawkw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

inickles commented Mar 13, 2026

Uh oh!

david-crespo commented Mar 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hawkw left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

david-crespo commented Mar 12, 2026 •

edited

Loading

david-crespo Mar 13, 2026 •

edited

Loading

david-crespo commented Mar 13, 2026 •

edited

Loading