Skip to content

Commit d2bf8fc

Browse files
committed
Add review context comments
1 parent 9b5d906 commit d2bf8fc

3 files changed

Lines changed: 42 additions & 2 deletions

File tree

nexus/db-model/src/fm.rs

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,14 @@
1212
//! These types are used when inserting and reading sitreps in CRDB; when in
1313
//! use, the sitrep is represented as a [`nexus_types::fm::Sitrep`]. See the
1414
//! documentation in [`nexus_types::fm`] for more information.
15+
//!
16+
//! # Conventions
17+
//!
18+
//! Fields in `Queryable`/`Insertable` structs should be ordered to match the
19+
//! column order in `dbinit.sql`. Diesel maps `Queryable` fields positionally,
20+
//! so matching the schema column order prevents subtle bugs if a query ever
21+
//! omits `Selectable`, and keeps the Rust types easy to cross-reference with
22+
//! the schema definition.
1523
1624
use crate::SqlU32;
1725
use crate::typed_uuid::DbTypedUuid;

nexus/db-queries/src/db/datastore/fm.rs

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,17 @@
77
//!
88
//! See [RFD 603](https://rfd.shared.oxide.computer/rfd/0603) for details on the
99
//! fault management sitrep.
10+
//!
11+
//! # Conventions
12+
//!
13+
//! Methods with an `_on_conn` suffix take an explicit database connection
14+
//! parameter. In other parts of the codebase this suffix sometimes implies the
15+
//! method is called within a transaction, but that isn't always the case here.
16+
//! In the FM datastore, `_on_conn` methods exist primarily so that multiple
17+
//! queries can share the same connection for consistency (e.g. loading a sitrep
18+
//! reads child records first and metadata last on the same connection, to
19+
//! detect concurrent deletes without requiring a transaction, see
20+
//! [`DataStore::fm_sitrep_read_on_conn`] and issue #9594).
1021
1122
use super::DataStore;
1223
use crate::authz;
@@ -503,6 +514,13 @@ impl DataStore {
503514
// rather than doing smaller ones for each case in the sitrep. This uses
504515
// more memory in Nexus but reduces the number of small db queries we
505516
// perform.
517+
//
518+
// The ordering of inserts among case child records (ereports, alert
519+
// requests) and case metadata doesn't matter: there are no foreign key
520+
// constraints between these tables, and garbage collection is keyed on
521+
// sitrep_id (which is inserted first above). If we crash partway
522+
// through, orphaned child records will be cleaned up when the orphaned
523+
// sitrep is garbage collected.
506524
let mut cases = Vec::with_capacity(sitrep.cases.len());
507525
let mut alerts_requested = Vec::new();
508526
let mut case_ereports = Vec::new();

nexus/src/app/background/tasks/fm_rendezvous.rs

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,19 @@ impl FmRendezvous {
8282

8383
// XXX(eliza): is it better to allocate all of these into a big array
8484
// and do a single `INSERT INTO` query, or iterate over them one by one
85-
// (not allocating) but insert one at a time?
85+
// (not allocating) but insert one at a time? Note that a batched insert
86+
// would need to use `ON CONFLICT DO NOTHING` rather than checking for
87+
// `Conflict` errors from individual inserts, since multiple Nexus
88+
// instances may run this task concurrently.
89+
//
90+
// Currently, these `alert_create` calls have no guard against a stale
91+
// Nexus inserting alerts from an outdated sitrep. This is fine for now
92+
// because alert requests are carried forward into newer sitreps, so a
93+
// stale insert is redundant rather than incorrect. However, if alerts
94+
// are ever hard-deleted (e.g. when a case is closed), a lagging Nexus
95+
// could re-create "zombie" alert records after deletion. At that point,
96+
// the INSERT should be guarded by a CTE that checks the sitrep
97+
// generation matches the current one.
8698
for (case_id, req) in sitrep.alerts_requested() {
8799
let &AlertRequest { id, class, requested_sitrep_id, .. } = req;
88100
status.total_alerts_requested += 1;
@@ -97,7 +109,9 @@ impl FmRendezvous {
97109
)
98110
.await
99111
{
100-
// Alert already exists, that's fine.
112+
// Alert already exists --- this is expected, since multiple
113+
// Nexus instances may run this task concurrently for the same
114+
// sitrep, or a previous activation may have partially completed.
101115
Err(Error::Conflict { .. }) => {}
102116
Err(e) => {
103117
slog::warn!(

0 commit comments

Comments
 (0)