-
Notifications
You must be signed in to change notification settings - Fork 69
Background task: complete never-completed audit log entries #10026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
8f06de6
b4bf111
c5b7f89
d37eed5
a6c7e63
51f6f1c
99eab7b
a247d54
e9cafb9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,99 @@ | ||
| --- | ||
| name: add-background-task | ||
| description: Add a new Nexus background task. Use when the user wants to create a periodic background task in Nexus that runs on a timer. | ||
| --- | ||
|
|
||
| # Add a Nexus background task | ||
|
|
||
| All background tasks live in Nexus. A task implements the `BackgroundTask` trait (`nexus/src/app/background/mod.rs`), runs on a configurable period, and reports status as `serde_json::Value`. | ||
|
|
||
| ## General approach | ||
|
|
||
| There are many existing background tasks in `nexus/src/app/background/tasks/`. Before writing anything, read a few tasks that are similar in shape to the one you're adding (e.g., a simple periodic cleanup vs. a task that watches a channel). Use those as models for structure, naming, logging, error handling, and status reporting. The goal is to conform to the patterns already in use, not to invent new ones. | ||
|
|
||
| ## Checklist | ||
|
|
||
| These are the touch points for adding a new background task. Follow them in order. | ||
|
|
||
| ### 1. Status type (`nexus/types/src/internal_api/background.rs`) | ||
|
|
||
| Define a struct for the task's activation status. Derive `Clone, Debug, Deserialize, Serialize, PartialEq, Eq`. For errors, use `Option<String>` if the task can only fail in one way per activation, or `Vec<String>` if it accumulates multiple independent errors. Match what similar tasks do. | ||
|
|
||
| ### 2. Task implementation (`nexus/src/app/background/tasks/<name>.rs`) | ||
|
|
||
| Create the task module. The struct holds whatever state it needs (typically `Arc<DataStore>` plus config). Implement `BackgroundTask::activate` by delegating to an `actually_activate` helper, then serialize the status to `serde_json::Value`. The `actually_activate` pattern makes unit testing easy without going through the trait. | ||
|
|
||
| `actually_activate` can either build and return the status (`async fn actually_activate(&mut self, opctx) -> YourStatus`), or take a mutable reference to one (`async fn actually_activate(&mut self, opctx, status: &mut YourStatus) -> Result<(), Error>`). The first is simpler and works well when the task either fully succeeds or fully fails. The second is better when the task can partially complete (e.g., it loops over work items): `activate` creates the status struct up front, passes it in, and serializes it afterward regardless of `Ok`/`Err`, so any progress already recorded in `status` (items processed, partial counts, earlier errors) is preserved even if the method bails out with `?` later. | ||
|
|
||
| Logging conventions: `debug` when there's nothing to do, `info` when routine work was done, `warn` when the work done indicates something is wrong (e.g., cleaning up after a crash), `error` on failure. Log errors as structured fields with the `; &err` slog syntax (which uses the `SlogInlineError` trait), not by interpolating into the message string. For the error string in the status struct, use `InlineErrorChain::new(&err).to_string()` (from `slog_error_chain`) to capture the full cause chain. Status error strings should not repeat the task name — omdb already shows which task you're looking at. | ||
|
|
||
| If the task takes config values that need conversion or validation (e.g., converting a `Duration` to `TimeDelta`, or checking a numeric range), do it once in `new()` and store the validated form. Don't re-validate on every activation — if the config is invalid, panic in `new()` with a message that includes the invalid value. | ||
|
|
||
| Include a unit test in the same file using `TestDatabase::new_with_datastore` that calls `actually_activate` directly. If the task has a datastore method, a single test exercising the task end-to-end (including the limit/batching behavior) is sufficient — don't add a redundant test for the datastore method separately unless it has complex logic worth testing in isolation. | ||
|
|
||
| ### 3. Register the module (`nexus/src/app/background/tasks/mod.rs`) | ||
|
|
||
| Add `pub mod <name>;` in alphabetical order. | ||
|
|
||
| ### 4. Activator (`nexus/background-task-interface/src/init.rs`) | ||
|
|
||
| Add `pub task_<name>: Activator` to the `BackgroundTasks` struct, maintaining alphabetical order among the task fields. | ||
|
|
||
| ### 5. Config (`nexus-config/src/nexus_config.rs`) | ||
|
|
||
| Add a config struct (e.g., `YourTaskConfig`) with at minimum `period_secs: Duration` (using `#[serde_as(as = "DurationSeconds<u64>")]`). If the task does bounded work per activation, name the limit field `max_<past_tense_verb>_per_activation` (e.g., `max_deleted_per_activation`, `max_timed_out_per_activation`) to match existing conventions. Add the field to `BackgroundTaskConfig`. Update the test config literal and expected parse output at the bottom of the file. | ||
|
|
||
| ### 6. Config files | ||
|
|
||
| Add the new config fields to all of these: | ||
| - `nexus/examples/config.toml` | ||
| - `nexus/examples/config-second.toml` | ||
| - `nexus/tests/config.test.toml` | ||
| - `smf/nexus/single-sled/config-partial.toml` | ||
| - `smf/nexus/multi-sled/config-partial.toml` | ||
|
|
||
| ### 7. Wire up in `nexus/src/app/background/init.rs` | ||
|
|
||
| - Import the task module. | ||
| - Add `Activator::new()` in the `BackgroundTasks` constructor. | ||
| - Destructure it in the `start` method. | ||
| - Call `driver.register(TaskDefinition { ... })` with the task. The last task registered should pass `datastore` by move (not `.clone()`), so adjust the previous last task if needed. | ||
| - If extra data is needed from `BackgroundTasksData`, add the field there and plumb it from `nexus/src/app/mod.rs`. | ||
|
|
||
| ### 8. Schema migration (if needed) | ||
|
|
||
| If the task needs a new index or schema change to support its query, add a migration under `schema/crdb/`. See `schema/crdb/README.adoc` for the procedure. Also update `dbinit.sql` and bump the version in `nexus/db-model/src/schema_versions.rs`. | ||
|
|
||
| ### 9. Datastore method (if needed) | ||
|
|
||
| If the task needs a new query, add it in the appropriate `nexus/db-queries/src/db/datastore/` file. Prefer the Diesel typed DSL over raw SQL (`diesel::sql_query`) for queries and test helpers. Only fall back to raw SQL when the DSL genuinely can't express the query. | ||
|
|
||
| If the task modifies rows that other code paths also modify, think about races: what happens if both run concurrently on the same row? Both paths should typically guard their writes so only one succeeds. | ||
|
|
||
| ### 10. omdb output (`dev-tools/omdb/src/bin/omdb/nexus.rs`) | ||
|
|
||
| Add a `print_task_<name>` function and wire it into the match in `print_task_details` (alphabetical order). Import the status type. Use the `const_max_len` + `WIDTH` pattern to align columns: | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wow another thing I did that one time has now been determined to be a Best Practice that the robot should do! This almost makes me feel okay about being replaced! :D |
||
|
|
||
| ```rust | ||
| const LABEL: &str = "label:"; | ||
| const WIDTH: usize = const_max_len(&[LABEL, ...]) + 1; | ||
| println!(" {LABEL:<WIDTH$}{}", status.field); | ||
| ``` | ||
|
|
||
| ### 11. Update test output (`dev-tools/omdb/tests/`) | ||
|
|
||
| Run the omdb tests with `EXPECTORATE=overwrite` to update the expected output snapshots (`env.out` and `successes.out`): | ||
|
|
||
| ``` | ||
| EXPECTORATE=overwrite cargo nextest run -p omicron-omdb | ||
| ``` | ||
|
|
||
| Review the diff to make sure only your new task's output was added. | ||
|
|
||
| ### 12. Verify | ||
|
|
||
| - `cargo check -p omicron-nexus --all-targets` | ||
| - `cargo fmt` | ||
| - `cargo xtask clippy` | ||
| - Run the new task's unit tests | ||
| - Run the omdb tests: `cargo nextest run -p omicron-omdb` | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -52,6 +52,7 @@ use nexus_types::deployment::OximeterReadPolicy; | |
| use nexus_types::fm; | ||
| use nexus_types::internal_api::background::AbandonedVmmReaperStatus; | ||
| use nexus_types::internal_api::background::AttachedSubnetManagerStatus; | ||
| use nexus_types::internal_api::background::AuditLogTimeoutIncompleteStatus; | ||
| use nexus_types::internal_api::background::BlueprintPlannerStatus; | ||
| use nexus_types::internal_api::background::BlueprintRendezvousStats; | ||
| use nexus_types::internal_api::background::BlueprintRendezvousStatus; | ||
|
|
@@ -1222,6 +1223,9 @@ fn print_task_details(bgtask: &BackgroundTask, details: &serde_json::Value) { | |
| "attached_subnet_manager" => { | ||
| print_task_attached_subnet_manager_status(details); | ||
| } | ||
| "audit_log_timeout_incomplete" => { | ||
| print_task_audit_log_timeout_incomplete(details); | ||
| } | ||
| "blueprint_planner" => { | ||
| print_task_blueprint_planner(details); | ||
| } | ||
|
|
@@ -1273,6 +1277,9 @@ fn print_task_details(bgtask: &BackgroundTask, details: &serde_json::Value) { | |
| "read_only_region_replacement_start" => { | ||
| print_task_read_only_region_replacement_start(details); | ||
| } | ||
| "reconfigurator_config_watcher" => { | ||
| print_task_reconfigurator_config_watcher(details); | ||
| } | ||
| "region_replacement" => { | ||
| print_task_region_replacement(details); | ||
| } | ||
|
|
@@ -2323,6 +2330,16 @@ fn print_task_read_only_region_replacement_start(details: &serde_json::Value) { | |
| } | ||
| } | ||
|
|
||
| fn print_task_reconfigurator_config_watcher(details: &serde_json::Value) { | ||
| match details.get("config_updated").and_then(|v| v.as_bool()) { | ||
| Some(updated) => println!(" config updated: {updated}"), | ||
| None => eprintln!( | ||
| "warning: failed to interpret task details: {:?}", | ||
| details | ||
| ), | ||
| } | ||
| } | ||
|
|
||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This task isn't changed here but it was causing the tests to be flaky for me locally depending on whether I ran one test or the whole module. |
||
| fn print_task_region_replacement(details: &serde_json::Value) { | ||
| match serde_json::from_value::<RegionReplacementStatus>(details.clone()) { | ||
| Err(error) => eprintln!( | ||
|
|
@@ -2671,6 +2688,38 @@ fn print_task_saga_recovery(details: &serde_json::Value) { | |
| } | ||
| } | ||
|
|
||
| fn print_task_audit_log_timeout_incomplete(details: &serde_json::Value) { | ||
| match serde_json::from_value::<AuditLogTimeoutIncompleteStatus>( | ||
| details.clone(), | ||
| ) { | ||
| Err(error) => eprintln!( | ||
| "warning: failed to interpret task details: {:?}: {:?}", | ||
| error, details | ||
| ), | ||
| Ok(status) => { | ||
| const TIMED_OUT: &str = "timed_out:"; | ||
| const CUTOFF: &str = "cutoff:"; | ||
| const MAX_UPDATE: &str = "max_timed_out_per_activation:"; | ||
| const ERROR: &str = "error:"; | ||
| const WIDTH: usize = | ||
| const_max_len(&[TIMED_OUT, CUTOFF, MAX_UPDATE, ERROR]) + 1; | ||
|
|
||
| println!(" {TIMED_OUT:<WIDTH$}{}", status.timed_out); | ||
| println!( | ||
| " {CUTOFF:<WIDTH$}{}", | ||
| status.cutoff.to_rfc3339_opts(SecondsFormat::AutoSi, true), | ||
| ); | ||
| println!( | ||
| " {MAX_UPDATE:<WIDTH$}{}", | ||
| status.max_timed_out_per_activation | ||
| ); | ||
| if let Some(error) = &status.error { | ||
| println!(" {ERROR:<WIDTH$}{error}"); | ||
| } | ||
| } | ||
| }; | ||
| } | ||
|
|
||
| fn print_task_session_cleanup(details: &serde_json::Value) { | ||
| match serde_json::from_value::<SessionCleanupStatus>(details.clone()) { | ||
| Err(error) => eprintln!( | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of this (the stuff about slog error chains) feels like maybe something
slogshould always be instructed to do, not just in bg tasks? If there's another place to put that advice that will cause claude to use it any time it writes code in this repo, that could be a good idea.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A checked-in
CLAUDE.mdfor the repo would be a good spot for that. I've held off making one because the work people do in here is so varied that it's hard to write down truly general guidelines.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. i feel like there is a lot of stuff that everyone would want it to know, such like "don't use the logger incorrectly" and "remember to run
cargo xtask clippyinstead of some other thing" and "if you run the tests usingcargo testrather thancargo nextest, most of them will not work"