[fm] always record the physical location of an ereport reporter#10096
[fm] always record the physical location of an ereport reporter#10096
Conversation
| ON isa.hw_baseboard_id = isp.hw_baseboard_id | ||
| AND isa.inv_collection_id = isp.inv_collection_id | ||
| WHERE isa.hw_baseboard_id IS NOT NULL | ||
| ORDER BY isa.sled_id, isa.inv_collection_id DESC |
There was a problem hiding this comment.
You're ordering by UUIDs descending here , rather than like, a generation number of time collected or something. Is that intentional?
| const PORT_SETTINGS_ID_165_2: &str = "8b777d9b-62a3-4c4d-b0b7-314315c2a7fc"; | ||
| const PORT_SETTINGS_ID_165_3: &str = "7c675e89-74b1-45da-9577-cf75f028107a"; | ||
| const PORT_SETTINGS_ID_165_4: &str = "e2413d63-9307-4918-b9c4-bce959c63042"; | ||
| const PORT_SETTINGS_ID_165_4: &str = "e2423d63-9307-4918-b9c4-bce959c63042"; |
There was a problem hiding this comment.
This seems like it might be a bad merge? Why are we changing this UUID?
There was a problem hiding this comment.
argh, i have no idea how that happened, i think the cat stepped on my keyboard or something. will fix
|
|
||
| /* physical slot location of the reporter. */ | ||
| slot_type omicron.public.sp_type NOT NULL, | ||
| slot INT4 NOT NULL, |
There was a problem hiding this comment.
The implication here is that "slot will always be known to the host OS, for all ereports it ever generates", right?
I'm totally on-board with backfilling this value - preferring it to be non-null - but I want to make sure I understand the implications of this constraint that it cannot be NULL. We'll always have the slot, under all host OS ereports we care about?
There was a problem hiding this comment.
That was kind of the entire intent of the change, yes. Am I correct that if a sled-agent is known to Nexus enough to be able to send it an HTTP request, it should also exist in the inventory?
There was a problem hiding this comment.
(for the record, the idea was that the location would be coming from Nexus when it's writing what it collected from the sled, not from the sled itself)
There was a problem hiding this comment.
Am I correct that if a sled-agent is known to Nexus enough to be able to send it an HTTP request, it should also exist in the inventory?
Should - yes. Guaranteed - no. Inventory is "best effort" collection (so can be pretty lossy), and is very non-atomic (collected over a period of several 10s of seconds, with several minutes between collections, in general). A couple examples where sled-agent could talk to Nexus but not be present in inventory:
- The Nexus that collected the most recent inventory collection(s) was (or still is!) partitioned off from the sled, but the sled can find and send ereports to a different Nexus.
- The sled was off or not present during the last inventory collection and sends ereports before a new collection has been made that sees it.
| sled_id UUID, | ||
|
|
||
| /* physical slot location of the reporter. */ | ||
| slot_type omicron.public.sp_type NOT NULL, |
There was a problem hiding this comment.
Should this type also be renamed to "slot_type"?
It's funny for a host OS to reference the "sp_type::sled", because we really are referring to "the sled", rather than "the sled's SP" in that case.
There was a problem hiding this comment.
Also I respect that this might be a pain in the ass from a schema point-of-view, so, push back if this sucks too much
There was a problem hiding this comment.
unfortunately, uh, this enum is used all over the place and i didn't really think it was prudent to do the amount of deleting and recreating columns it would take to rename the type...?
This is a large, and somewhat brutish, migration which attempts to
correct my past lack of forethought in designing the
omicron.public.ereportschema. In particular, a younger, dumberversion of Eliza foolishly chose to make the physical location of the
reporter in the rack (the
sp_typeandsp_slotcolumns) nullable, andonly incldue them when the reporter is a SP, and not when it's a
sled's host OS. This made a lot of people1 very unhappy, and is
widely regarded as a bad move.
While SP reporters are uniquely indexed by the
sp_typeandsp_slot(as they are the keys Nexus uses to request SP ereports from MGS), host
OS ereports are identified by the sled UUID (as it's the primary key of
the entry in the
sledtable through which Nexus will discover theaddress of the
sled-agentthat it asks for the sled's ereports). Atthe time, I thought that we would only need to hang onto the sled UUID,
as we could always get the physical slot of the sled by going and doing
some JOINs to look that up by sled UUID. However, it's much less
pleasant to do that than I had anticipated, as turning a sled UUID into
a slot requires looking up the
hw_baseboard_idfor the sled in theinventory's
inv_sled_agenttable, and then using thehw_baseboard_idin the
inv_service_processortable, which actually knows the slot.This is a bit of a pain to do, and because old inventory collections and
sledentries are deleted, we may no longer be able to find the slotfor a sled UUID that references a sled that no longer exists. Thus, we
really should have been recording the physical location in the ereport
table if we want to be able to have it for historic ereports.
This PR rights these wrongs by replacing the nullable
sp_typeandsp_slotcolumns in theereporttable with non-nullslot_typeandslotcolumns (renamed to reflect that they are no longer specificallyfor SPs), and changing the
CHECKconstraints to permit host OSereports to also have those columns. We attempt to backfill the slot for
host OS ereports using the nasty join chain I described above. If we are
unable to do this for a host OS ereport because it refers to a sled UUID
that no longer exists in the inventory, we just delete it. This feels
quite icky, but it's worth noting that, at time of writing, we simply
don't have any code for collecting ereports from the host OS into CRDB
anyway, so there aren't actually going to be any actual ereports getting
dropped here --- making an attempt to backfill them is really just an
intellectual exercise, but it made me feel better.
Footnotes
Well...mostly just me. ↩