Skip to content

[fm] always record the physical location of an ereport reporter#10096

Open
hawkw wants to merge 11 commits intomainfrom
eliza/ereport-refactor
Open

[fm] always record the physical location of an ereport reporter#10096
hawkw wants to merge 11 commits intomainfrom
eliza/ereport-refactor

Conversation

@hawkw
Copy link
Member

@hawkw hawkw commented Mar 18, 2026

This is a large, and somewhat brutish, migration which attempts to
correct my past lack of forethought in designing the
omicron.public.ereport schema. In particular, a younger, dumber
version of Eliza foolishly chose to make the physical location of the
reporter in the rack (the sp_type and sp_slot columns) nullable, and
only incldue them when the reporter is a SP, and not when it's a
sled's host OS. This made a lot of people1 very unhappy, and is
widely regarded as a bad move.

While SP reporters are uniquely indexed by the sp_type and sp_slot
(as they are the keys Nexus uses to request SP ereports from MGS), host
OS ereports are identified by the sled UUID (as it's the primary key of
the entry in the sled table through which Nexus will discover the
address of the sled-agent that it asks for the sled's ereports). At
the time, I thought that we would only need to hang onto the sled UUID,
as we could always get the physical slot of the sled by going and doing
some JOINs to look that up by sled UUID. However, it's much less
pleasant to do that than I had anticipated, as turning a sled UUID into
a slot requires looking up the hw_baseboard_id for the sled in the
inventory's inv_sled_agent table, and then using the hw_baseboard_id
in the inv_service_processor table, which actually knows the slot.
This is a bit of a pain to do, and because old inventory collections and
sled entries are deleted, we may no longer be able to find the slot
for a sled UUID that references a sled that no longer exists. Thus, we
really should have been recording the physical location in the ereport
table if we want to be able to have it for historic ereports.

This PR rights these wrongs by replacing the nullable sp_type and
sp_slot columns in the ereport table with non-null slot_type and
slot columns (renamed to reflect that they are no longer specifically
for SPs), and changing the CHECK constraints to permit host OS
ereports to also have those columns. We attempt to backfill the slot for
host OS ereports using the nasty join chain I described above. If we are
unable to do this for a host OS ereport because it refers to a sled UUID
that no longer exists in the inventory, we just delete it. This feels
quite icky, but it's worth noting that, at time of writing, we simply
don't have any code for collecting ereports from the host OS into CRDB
anyway, so there aren't actually going to be any actual ereports getting
dropped here --- making an attempt to backfill them is really just an
intellectual exercise, but it made me feel better.

Footnotes

  1. Well...mostly just me.

@hawkw hawkw changed the title [wip] always record the physical location of an ereport reporter [fm] always record the physical location of an ereport reporter Mar 19, 2026
@hawkw hawkw marked this pull request as ready for review March 19, 2026 17:20
@hawkw hawkw requested review from mergeconflict and smklein March 19, 2026 17:21
@hawkw hawkw self-assigned this Mar 19, 2026
@hawkw hawkw added the fault-management Everything related to the fault-management initiative (RFD480 and others) label Mar 19, 2026
ON isa.hw_baseboard_id = isp.hw_baseboard_id
AND isa.inv_collection_id = isp.inv_collection_id
WHERE isa.hw_baseboard_id IS NOT NULL
ORDER BY isa.sled_id, isa.inv_collection_id DESC
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're ordering by UUIDs descending here , rather than like, a generation number of time collected or something. Is that intentional?

const PORT_SETTINGS_ID_165_2: &str = "8b777d9b-62a3-4c4d-b0b7-314315c2a7fc";
const PORT_SETTINGS_ID_165_3: &str = "7c675e89-74b1-45da-9577-cf75f028107a";
const PORT_SETTINGS_ID_165_4: &str = "e2413d63-9307-4918-b9c4-bce959c63042";
const PORT_SETTINGS_ID_165_4: &str = "e2423d63-9307-4918-b9c4-bce959c63042";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it might be a bad merge? Why are we changing this UUID?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argh, i have no idea how that happened, i think the cat stepped on my keyboard or something. will fix


/* physical slot location of the reporter. */
slot_type omicron.public.sp_type NOT NULL,
slot INT4 NOT NULL,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implication here is that "slot will always be known to the host OS, for all ereports it ever generates", right?

I'm totally on-board with backfilling this value - preferring it to be non-null - but I want to make sure I understand the implications of this constraint that it cannot be NULL. We'll always have the slot, under all host OS ereports we care about?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was kind of the entire intent of the change, yes. Am I correct that if a sled-agent is known to Nexus enough to be able to send it an HTTP request, it should also exist in the inventory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(for the record, the idea was that the location would be coming from Nexus when it's writing what it collected from the sled, not from the sled itself)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct that if a sled-agent is known to Nexus enough to be able to send it an HTTP request, it should also exist in the inventory?

Should - yes. Guaranteed - no. Inventory is "best effort" collection (so can be pretty lossy), and is very non-atomic (collected over a period of several 10s of seconds, with several minutes between collections, in general). A couple examples where sled-agent could talk to Nexus but not be present in inventory:

  • The Nexus that collected the most recent inventory collection(s) was (or still is!) partitioned off from the sled, but the sled can find and send ereports to a different Nexus.
  • The sled was off or not present during the last inventory collection and sends ereports before a new collection has been made that sees it.

sled_id UUID,

/* physical slot location of the reporter. */
slot_type omicron.public.sp_type NOT NULL,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this type also be renamed to "slot_type"?

It's funny for a host OS to reference the "sp_type::sled", because we really are referring to "the sled", rather than "the sled's SP" in that case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I respect that this might be a pain in the ass from a schema point-of-view, so, push back if this sucks too much

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, uh, this enum is used all over the place and i didn't really think it was prudent to do the amount of deleting and recreating columns it would take to rename the type...?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fault-management Everything related to the fault-management initiative (RFD480 and others)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants