Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 99 additions & 0 deletions .claude/skills/add-background-task/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
---
name: add-background-task
description: Add a new Nexus background task. Use when the user wants to create a periodic background task in Nexus that runs on a timer.
---

# Add a Nexus background task

All background tasks live in Nexus. A task implements the `BackgroundTask` trait (`nexus/src/app/background/mod.rs`), runs on a configurable period, and reports status as `serde_json::Value`.

## General approach

There are many existing background tasks in `nexus/src/app/background/tasks/`. Before writing anything, read a few tasks that are similar in shape to the one you're adding (e.g., a simple periodic cleanup vs. a task that watches a channel). Use those as models for structure, naming, logging, error handling, and status reporting. The goal is to conform to the patterns already in use, not to invent new ones.

## Checklist

These are the touch points for adding a new background task. Follow them in order.

### 1. Status type (`nexus/types/src/internal_api/background.rs`)

Define a struct for the task's activation status. Derive `Clone, Debug, Deserialize, Serialize, PartialEq, Eq`. For errors, use `Option<String>` if the task can only fail in one way per activation, or `Vec<String>` if it accumulates multiple independent errors. Match what similar tasks do.

### 2. Task implementation (`nexus/src/app/background/tasks/<name>.rs`)

Create the task module. The struct holds whatever state it needs (typically `Arc<DataStore>` plus config). Implement `BackgroundTask::activate` by delegating to an `actually_activate` helper, then serialize the status to `serde_json::Value`. The `actually_activate` pattern makes unit testing easy without going through the trait.

`actually_activate` can either build and return the status (`async fn actually_activate(&mut self, opctx) -> YourStatus`), or take a mutable reference to one (`async fn actually_activate(&mut self, opctx, status: &mut YourStatus) -> Result<(), Error>`). The first is simpler and works well when the task either fully succeeds or fully fails. The second is better when the task can partially complete (e.g., it loops over work items): `activate` creates the status struct up front, passes it in, and serializes it afterward regardless of `Ok`/`Err`, so any progress already recorded in `status` (items processed, partial counts, earlier errors) is preserved even if the method bails out with `?` later.

Logging conventions: `debug` when there's nothing to do, `info` when routine work was done, `warn` when the work done indicates something is wrong (e.g., cleaning up after a crash), `error` on failure. Log errors as structured fields with the `; &err` slog syntax (which uses the `SlogInlineError` trait), not by interpolating into the message string. For the error string in the status struct, use `InlineErrorChain::new(&err).to_string()` (from `slog_error_chain`) to capture the full cause chain. Status error strings should not repeat the task name — omdb already shows which task you're looking at.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of this (the stuff about slog error chains) feels like maybe something slog should always be instructed to do, not just in bg tasks? If there's another place to put that advice that will cause claude to use it any time it writes code in this repo, that could be a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A checked-in CLAUDE.md for the repo would be a good spot for that. I've held off making one because the work people do in here is so varied that it's hard to write down truly general guidelines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. i feel like there is a lot of stuff that everyone would want it to know, such like "don't use the logger incorrectly" and "remember to run cargo xtask clippy instead of some other thing" and "if you run the tests using cargo test rather than cargo nextest, most of them will not work"


If the task takes config values that need conversion or validation (e.g., converting a `Duration` to `TimeDelta`, or checking a numeric range), do it once in `new()` and store the validated form. Don't re-validate on every activation — if the config is invalid, panic in `new()` with a message that includes the invalid value.

Include a unit test in the same file using `TestDatabase::new_with_datastore` that calls `actually_activate` directly. If the task has a datastore method, a single test exercising the task end-to-end (including the limit/batching behavior) is sufficient — don't add a redundant test for the datastore method separately unless it has complex logic worth testing in isolation.

### 3. Register the module (`nexus/src/app/background/tasks/mod.rs`)

Add `pub mod <name>;` in alphabetical order.

### 4. Activator (`nexus/background-task-interface/src/init.rs`)

Add `pub task_<name>: Activator` to the `BackgroundTasks` struct, maintaining alphabetical order among the task fields.

### 5. Config (`nexus-config/src/nexus_config.rs`)

Add a config struct (e.g., `YourTaskConfig`) with at minimum `period_secs: Duration` (using `#[serde_as(as = "DurationSeconds<u64>")]`). If the task does bounded work per activation, name the limit field `max_<past_tense_verb>_per_activation` (e.g., `max_deleted_per_activation`, `max_timed_out_per_activation`) to match existing conventions. Add the field to `BackgroundTaskConfig`. Update the test config literal and expected parse output at the bottom of the file.

### 6. Config files

Add the new config fields to all of these:
- `nexus/examples/config.toml`
- `nexus/examples/config-second.toml`
- `nexus/tests/config.test.toml`
- `smf/nexus/single-sled/config-partial.toml`
- `smf/nexus/multi-sled/config-partial.toml`

### 7. Wire up in `nexus/src/app/background/init.rs`

- Import the task module.
- Add `Activator::new()` in the `BackgroundTasks` constructor.
- Destructure it in the `start` method.
- Call `driver.register(TaskDefinition { ... })` with the task. The last task registered should pass `datastore` by move (not `.clone()`), so adjust the previous last task if needed.
- If extra data is needed from `BackgroundTasksData`, add the field there and plumb it from `nexus/src/app/mod.rs`.

### 8. Schema migration (if needed)

If the task needs a new index or schema change to support its query, add a migration under `schema/crdb/`. See `schema/crdb/README.adoc` for the procedure. Also update `dbinit.sql` and bump the version in `nexus/db-model/src/schema_versions.rs`.

### 9. Datastore method (if needed)

If the task needs a new query, add it in the appropriate `nexus/db-queries/src/db/datastore/` file. Prefer the Diesel typed DSL over raw SQL (`diesel::sql_query`) for queries and test helpers. Only fall back to raw SQL when the DSL genuinely can't express the query.

If the task modifies rows that other code paths also modify, think about races: what happens if both run concurrently on the same row? Both paths should typically guard their writes so only one succeeds.

### 10. omdb output (`dev-tools/omdb/src/bin/omdb/nexus.rs`)

Add a `print_task_<name>` function and wire it into the match in `print_task_details` (alphabetical order). Import the status type. Use the `const_max_len` + `WIDTH` pattern to align columns:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow another thing I did that one time has now been determined to be a Best Practice that the robot should do! This almost makes me feel okay about being replaced! :D


```rust
const LABEL: &str = "label:";
const WIDTH: usize = const_max_len(&[LABEL, ...]) + 1;
println!(" {LABEL:<WIDTH$}{}", status.field);
```

### 11. Update test output (`dev-tools/omdb/tests/`)

Run the omdb tests with `EXPECTORATE=overwrite` to update the expected output snapshots (`env.out` and `successes.out`):

```
EXPECTORATE=overwrite cargo nextest run -p omicron-omdb
```

Review the diff to make sure only your new task's output was added.

### 12. Verify

- `cargo check -p omicron-nexus --all-targets`
- `cargo fmt`
- `cargo xtask clippy`
- Run the new task's unit tests
- Run the omdb tests: `cargo nextest run -p omicron-omdb`
49 changes: 49 additions & 0 deletions dev-tools/omdb/src/bin/omdb/nexus.rs
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ use nexus_types::deployment::OximeterReadPolicy;
use nexus_types::fm;
use nexus_types::internal_api::background::AbandonedVmmReaperStatus;
use nexus_types::internal_api::background::AttachedSubnetManagerStatus;
use nexus_types::internal_api::background::AuditLogTimeoutIncompleteStatus;
use nexus_types::internal_api::background::BlueprintPlannerStatus;
use nexus_types::internal_api::background::BlueprintRendezvousStats;
use nexus_types::internal_api::background::BlueprintRendezvousStatus;
Expand Down Expand Up @@ -1222,6 +1223,9 @@ fn print_task_details(bgtask: &BackgroundTask, details: &serde_json::Value) {
"attached_subnet_manager" => {
print_task_attached_subnet_manager_status(details);
}
"audit_log_timeout_incomplete" => {
print_task_audit_log_timeout_incomplete(details);
}
"blueprint_planner" => {
print_task_blueprint_planner(details);
}
Expand Down Expand Up @@ -1273,6 +1277,9 @@ fn print_task_details(bgtask: &BackgroundTask, details: &serde_json::Value) {
"read_only_region_replacement_start" => {
print_task_read_only_region_replacement_start(details);
}
"reconfigurator_config_watcher" => {
print_task_reconfigurator_config_watcher(details);
}
"region_replacement" => {
print_task_region_replacement(details);
}
Expand Down Expand Up @@ -2323,6 +2330,16 @@ fn print_task_read_only_region_replacement_start(details: &serde_json::Value) {
}
}

fn print_task_reconfigurator_config_watcher(details: &serde_json::Value) {
match details.get("config_updated").and_then(|v| v.as_bool()) {
Some(updated) => println!(" config updated: {updated}"),
None => eprintln!(
"warning: failed to interpret task details: {:?}",
details
),
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This task isn't changed here but it was causing the tests to be flaky for me locally depending on whether I ran one test or the whole module.

fn print_task_region_replacement(details: &serde_json::Value) {
match serde_json::from_value::<RegionReplacementStatus>(details.clone()) {
Err(error) => eprintln!(
Expand Down Expand Up @@ -2671,6 +2688,38 @@ fn print_task_saga_recovery(details: &serde_json::Value) {
}
}

fn print_task_audit_log_timeout_incomplete(details: &serde_json::Value) {
match serde_json::from_value::<AuditLogTimeoutIncompleteStatus>(
details.clone(),
) {
Err(error) => eprintln!(
"warning: failed to interpret task details: {:?}: {:?}",
error, details
),
Ok(status) => {
const TIMED_OUT: &str = "timed_out:";
const CUTOFF: &str = "cutoff:";
const MAX_UPDATE: &str = "max_timed_out_per_activation:";
const ERROR: &str = "error:";
const WIDTH: usize =
const_max_len(&[TIMED_OUT, CUTOFF, MAX_UPDATE, ERROR]) + 1;

println!(" {TIMED_OUT:<WIDTH$}{}", status.timed_out);
println!(
" {CUTOFF:<WIDTH$}{}",
status.cutoff.to_rfc3339_opts(SecondsFormat::AutoSi, true),
);
println!(
" {MAX_UPDATE:<WIDTH$}{}",
status.max_timed_out_per_activation
);
if let Some(error) = &status.error {
println!(" {ERROR:<WIDTH$}{error}");
}
}
};
}

fn print_task_session_cleanup(details: &serde_json::Value) {
match serde_json::from_value::<SessionCleanupStatus>(details.clone()) {
Err(error) => eprintln!(
Expand Down
15 changes: 15 additions & 0 deletions dev-tools/omdb/tests/env.out
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,11 @@ task: "attached_subnet_manager"
distributes attached subnets to sleds and switch


task: "audit_log_timeout_incomplete"
transitions stale incomplete audit log entries to timeout status so they
become visible in the audit log


task: "bfd_manager"
Manages bidirectional fowarding detection (BFD) configuration on rack
switches
Expand Down Expand Up @@ -288,6 +293,11 @@ task: "attached_subnet_manager"
distributes attached subnets to sleds and switch


task: "audit_log_timeout_incomplete"
transitions stale incomplete audit log entries to timeout status so they
become visible in the audit log


task: "bfd_manager"
Manages bidirectional fowarding detection (BFD) configuration on rack
switches
Expand Down Expand Up @@ -525,6 +535,11 @@ task: "attached_subnet_manager"
distributes attached subnets to sleds and switch


task: "audit_log_timeout_incomplete"
transitions stale incomplete audit log entries to timeout status so they
become visible in the audit log


task: "bfd_manager"
Manages bidirectional fowarding detection (BFD) configuration on rack
switches
Expand Down
25 changes: 23 additions & 2 deletions dev-tools/omdb/tests/successes.out
Original file line number Diff line number Diff line change
Expand Up @@ -273,6 +273,11 @@ task: "attached_subnet_manager"
distributes attached subnets to sleds and switch


task: "audit_log_timeout_incomplete"
transitions stale incomplete audit log entries to timeout status so they
become visible in the audit log


task: "bfd_manager"
Manages bidirectional fowarding detection (BFD) configuration on rack
switches
Expand Down Expand Up @@ -593,6 +598,14 @@ task: "attached_subnet_manager"
no dendrite instances found
no sleds found

task: "audit_log_timeout_incomplete"
configured period: every <REDACTED_DURATION>m
last completed activation: <REDACTED ITERATIONS>, triggered by <TRIGGERED_BY_REDACTED>
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
timed_out: 0
cutoff: <REDACTED_TIMESTAMP>
max_timed_out_per_activation: 1000

task: "bfd_manager"
configured period: every <REDACTED_DURATION>s
last completed activation: <REDACTED ITERATIONS>, triggered by <TRIGGERED_BY_REDACTED>
Expand Down Expand Up @@ -782,7 +795,7 @@ task: "reconfigurator_config_watcher"
configured period: every <REDACTED_DURATION>s
last completed activation: <REDACTED ITERATIONS>, triggered by <TRIGGERED_BY_REDACTED>
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
warning: unknown background task: "reconfigurator_config_watcher" (don't know how to interpret details: Object {"config_updated": Bool(false)})
config updated: <CONFIG_UPDATED_REDACTED>

task: "region_replacement"
configured period: every <REDACTED_DURATION>days <REDACTED_DURATION>h <REDACTED_DURATION>m <REDACTED_DURATION>s
Expand Down Expand Up @@ -1194,6 +1207,14 @@ task: "attached_subnet_manager"
no dendrite instances found
no sleds found

task: "audit_log_timeout_incomplete"
configured period: every <REDACTED_DURATION>m
last completed activation: <REDACTED ITERATIONS>, triggered by <TRIGGERED_BY_REDACTED>
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
timed_out: 0
cutoff: <REDACTED_TIMESTAMP>
max_timed_out_per_activation: 1000

task: "bfd_manager"
configured period: every <REDACTED_DURATION>s
last completed activation: <REDACTED ITERATIONS>, triggered by <TRIGGERED_BY_REDACTED>
Expand Down Expand Up @@ -1383,7 +1404,7 @@ task: "reconfigurator_config_watcher"
configured period: every <REDACTED_DURATION>s
last completed activation: <REDACTED ITERATIONS>, triggered by <TRIGGERED_BY_REDACTED>
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
warning: unknown background task: "reconfigurator_config_watcher" (don't know how to interpret details: Object {"config_updated": Bool(false)})
config updated: <CONFIG_UPDATED_REDACTED>

task: "region_replacement"
configured period: every <REDACTED_DURATION>days <REDACTED_DURATION>h <REDACTED_DURATION>m <REDACTED_DURATION>s
Expand Down
4 changes: 4 additions & 0 deletions dev-tools/omdb/tests/test_all_output.rs
Original file line number Diff line number Diff line change
Expand Up @@ -333,6 +333,10 @@ async fn test_omdb_success_cases(cptestctx: &ControlPlaneTestContext) {
redactor.extra_variable_length("cockroachdb_version", &crdb_version);
}

// The `reconfigurator_config_watcher` task's output depends on
// whether it has had time to complete an activation.
redactor.field("config updated:", r"\w+");

// The `tuf_artifact_replication` task's output depends on how
// many sleds happened to register with Nexus before its first
// execution. These redactions work around the issue described in
Expand Down
29 changes: 29 additions & 0 deletions nexus-config/src/nexus_config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -437,6 +437,8 @@ pub struct BackgroundTaskConfig {
pub attached_subnet_manager: AttachedSubnetManagerConfig,
/// configuration for console session cleanup task
pub session_cleanup: SessionCleanupConfig,
/// configuration for audit log incomplete timeout task
pub audit_log_timeout_incomplete: AuditLogTimeoutIncompleteConfig,
}

#[serde_as]
Expand All @@ -450,6 +452,21 @@ pub struct SessionCleanupConfig {
pub max_delete_per_activation: u32,
}

#[serde_as]
#[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize)]
pub struct AuditLogTimeoutIncompleteConfig {
/// period (in seconds) for periodic activations of this task
#[serde_as(as = "DurationSeconds<u64>")]
pub period_secs: Duration,

/// how old an incomplete entry must be before it is timed out
#[serde_as(as = "DurationSeconds<u64>")]
pub timeout_secs: Duration,

/// max rows per SQL statement
pub max_timed_out_per_activation: u32,
}

#[serde_as]
#[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize)]
pub struct DnsTasksConfig {
Expand Down Expand Up @@ -1276,6 +1293,9 @@ mod test {
attached_subnet_manager.period_secs = 60
session_cleanup.period_secs = 300
session_cleanup.max_delete_per_activation = 10000
audit_log_timeout_incomplete.period_secs = 600
audit_log_timeout_incomplete.timeout_secs = 14400
audit_log_timeout_incomplete.max_timed_out_per_activation = 1000
[default_region_allocation_strategy]
type = "random"
seed = 0
Expand Down Expand Up @@ -1544,6 +1564,12 @@ mod test {
period_secs: Duration::from_secs(300),
max_delete_per_activation: 10_000,
},
audit_log_timeout_incomplete:
AuditLogTimeoutIncompleteConfig {
period_secs: Duration::from_secs(600),
timeout_secs: Duration::from_secs(14400),
max_timed_out_per_activation: 1000,
},
},
multicast: MulticastConfig { enabled: false },
default_region_allocation_strategy:
Expand Down Expand Up @@ -1652,6 +1678,9 @@ mod test {
attached_subnet_manager.period_secs = 60
session_cleanup.period_secs = 300
session_cleanup.max_delete_per_activation = 10000
audit_log_timeout_incomplete.period_secs = 600
audit_log_timeout_incomplete.timeout_secs = 14400
audit_log_timeout_incomplete.max_timed_out_per_activation = 1000

[default_region_allocation_strategy]
type = "random"
Expand Down
1 change: 1 addition & 0 deletions nexus/background-task-interface/src/init.rs
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ pub struct BackgroundTasks {
pub task_instance_reincarnation: Activator,
pub task_service_firewall_propagation: Activator,
pub task_abandoned_vmm_reaper: Activator,
pub task_audit_log_timeout_incomplete: Activator,
pub task_vpc_route_manager: Activator,
pub task_saga_recovery: Activator,
pub task_lookup_region_port: Activator,
Expand Down
12 changes: 6 additions & 6 deletions nexus/db-model/src/audit_log.rs
Original file line number Diff line number Diff line change
Expand Up @@ -224,7 +224,7 @@ impl From<AuditLogEntryInitParams> for AuditLogEntryInit {

/// `audit_log_complete` is a view on `audit_log` filtering for rows with
/// non-null `time_completed`, not its own table.
#[derive(Queryable, Selectable, Clone, Debug)]
#[derive(Queryable, Selectable, Clone, Debug, PartialEq)]
#[diesel(table_name = audit_log_complete)]
pub struct AuditLogEntry {
pub id: Uuid,
Expand Down Expand Up @@ -276,15 +276,15 @@ pub enum AuditLogCompletion {
/// error, and I don't think we even have API timeouts) but rather that the
/// attempts to complete the log entry failed (or were never even attempted
/// because, e.g., Nexus crashed during the operation), and this entry had
/// to be cleaned up later by a background job (which doesn't exist yet)
/// after a timeout. Note we represent this result status as "Unknown" in
/// the external API because timeout is an implementation detail and makes
/// it sound like the operation timed out.
/// to be cleaned up later by a background job after a timeout. Note we
/// represent this result status as "Unknown" in the external API because
/// timeout is an implementation detail and makes it sound like the
/// operation timed out.
Timeout,
}

#[derive(AsChangeset, Clone)]
#[diesel(table_name = audit_log)]
#[diesel(table_name = audit_log, treat_none_as_null = true)]
pub struct AuditLogCompletionUpdate {
pub time_completed: DateTime<Utc>,
pub result_kind: AuditLogResultKind,
Expand Down
Loading
Loading