Skip to content

A downstairs region reports missing context slot #1906

@leftwo

Description

@leftwo

My instance on dogfood was stuck in starting.

Looking in the propolis log, I found it was waiting for one of the downstairs to come online after being told to activate:

21:24:22.961Z INFO propolis-server (vm_state_driver): connecting to [fd00:1122:3344:106::d]:19033
     = downstairs
    client = 1
    session_id = 303bf55e-2b8e-44a5-a521-c8eac984801a
21:24:22.961Z WARN propolis-server (vm_state_driver): ds_connection connect to [fd00:1122:3344:106:
:d]:19033 failure: Os { code: 146, kind: ConnectionRefused, message: "Connection refused" }
     = downstairs
    client = 1
    session_id = 303bf55e-2b8e-44a5-a521-c8eac984801a

There is only one disk on my instance, so it was easy enough to find it's ID.
Dumping info about it tells me where the what the region IDs are, which sled contains them, and all the IPs of the downstairs:

root@oxz_switch0:~# omdb db disks info bfa68c2e-ad75-4793-88eb-05c48527f567
HOST_SERIAL DISK_NAME                        INSTANCE_NAME PROPOLIS_ZONE                                            VOLUME_ID
                           DISK_STATE IMPORT_ADDRESS READONLY
BRM42220031 nightlydebug-btrium-image-1190c3 nightlydebug  oxz_propolis-server_9c447451-23b0-405e-b705-1cce40b4427e b5f12294- 414-4308-a959-0367b846664b attached   -              false
HOST_SERIAL REGION                               DATASET                              PHYSICAL_DISK
BRM27230045 8915bf83-b1e8-4782-8526-31530f6a4b79 7cdb3c11-f792-4195-9152-16a948fb271e 1f938f43-7003-4c9b-b961-0b9c1a7fecbe
BRM13250012 bc095a4b-3b8b-47fe-a660-b380a55c7b74 8aaf6a36-a2fc-4ce8-9372-888bc20854b1 4b75f4a4-22c6-4391-b0e1-b1c101f7ebbb
BRM42220006 2b3ca602-dc02-4abb-97d5-67eb2e3b2bbe fffddf56-10ca-4b62-9be3-5b3764a5f682 7e6e461b-c246-4b88-bcc8-28f0d9f5495e
VCR from volume ID b5f12294-e414-4308-a959-0367b846664b
ID                                   BS  SUB_VOLUMES READ_ONLY_PARENT
bfa68c2e-ad75-4793-88eb-05c48527f567 512 1           false

SUB VOLUME 0
    ID                                   BS  BPE    EC   GENERATION READ_ONLY 
    bfa68c2e-ad75-4793-88eb-05c48527f567 512 131072 4016 10         false
    [fd00:1122:3344:129::22]:19004
    [fd00:1122:3344:106::d]:19033
    [fd00:1122:3344:127::29]:19023

I know my downstairs has this IP: [fd00:1122:3344:106::d], so I just need to figure out which sled has fd00:1122:3344:106 and then which crucible region of the three above matches.

root@oxz_switch0:~# pilot host exec -c "ipadm | grep fd00 | grep sled6" 0-31
 2  BRM22250001        ok: underlay0/sled6   static   ok           fd00:1122:3344:128::1/64
 3  BRM13250012        ok: underlay0/sled6   static   ok           fd00:1122:3344:129::1/64
 7  BRM27230045        ok: underlay0/sled6   static   ok           fd00:1122:3344:127::1/64
 8  BRM44220011        ok: underlay0/sled6   static   ok           fd00:1122:3344:103::1/64
 9  BRM44220005        ok: underlay0/sled6   static   ok           fd00:1122:3344:105::1/64
10  BRM42220009        ok: underlay0/sled6   static   ok           fd00:1122:3344:107::1/64
11  BRM42220006        ok: underlay0/sled6   static   ok           fd00:1122:3344:106::1/64
12  BRM42220057        ok: underlay0/sled6   static   ok           fd00:1122:3344:104::1/64
14  BRM42220051        ok: underlay0/sled6   static   ok           fd00:1122:3344:10b::1/64
16  BRM42220014        ok: underlay0/sled6   static   ok           fd00:1122:3344:108::1/64
17  BRM42220017        ok: underlay0/sled6   static   ok           fd00:1122:3344:109::1/64
21  BRM42220031        ok: underlay0/sled6   static   ok           fd00:1122:3344:102::1/64
23  BRM42220016        ok: underlay0/sled6   static   ok           fd00:1122:3344:10a::1/64
25  BRM44220010        ok: underlay0/sled6   static   ok           fd00:1122:3344:101::1/64

So, sled 11 has our bad downstairs, BRM13250012. This is region bc095a4b-3b8b-47fe-a660-b380a55c7b74
Sleds - region lookup:

03 BRM13250012 bc095a4b-3b8b-47fe-a660-b380a55c7b74 
07 BRM27230045 8915bf83-b1e8-4782-8526-31530f6a4b79 
11 BRM42220006 2b3ca602-dc02-4abb-97d5-67eb2e3b2bbe <------ problem

On sled 11, we can find our zone a few ways, but this work by searching for IP:

BRM42220006 # for zzz in $(zoneadm list | grep crucible); do  echo -n "$zzz "; zlogin $zzz ipadm | grep fd00; done
oxz_crucible_0022703b-dcfc-44d4-897a-b42f6f53b433 oxControlService13/omicron6 static ok   fd00:1122:3344:106::c/64
oxz_crucible_12afe1c3-bfe6-4278-8240-91d401347d36 oxControlService14/omicron6 static ok   fd00:1122:3344:106::8/64
oxz_crucible_46d1afcc-cc3f-4b17-aafc-054dd4862d15 oxControlService15/omicron6 static ok   fd00:1122:3344:106::5/64
oxz_crucible_605be8b9-c652-4a5f-94ca-068ec7a39472 oxControlService17/omicron6 static ok   fd00:1122:3344:106::a/64
oxz_crucible_65b3db59-9361-4100-9cee-04e32a8c67d3 oxControlService18/omicron6 static ok   fd00:1122:3344:106::7/64
oxz_crucible_9b8194ee-917d-4abc-a55c-94cea6cdaea1 oxControlService20/omicron6 static ok   fd00:1122:3344:106::6/64
oxz_crucible_af8a8712-457c-4ea7-a8b6-aecb04761c1b oxControlService21/omicron6 static ok   fd00:1122:3344:106::9/64
oxz_crucible_b369e133-485c-4d98-8fee-83542d1fd94d oxControlService22/omicron6 static ok   fd00:1122:3344:106::4/64
oxz_crucible_c33b5912-9985-43ed-98f2-41297e2b796a oxControlService23/omicron6 static ok   fd00:1122:3344:106::b/64
oxz_crucible_pantry_e9ea27c2-600a-45fb-9224-36eca95d87b6 oxControlService24/omicron6 static ok   fd00:1122:3344:106::74/64
oxz_crucible_fffddf56-10ca-4b62-9be3-5b3764a5f682 oxControlService25/omicron6 static ok   fd00:1122:3344:106::d/64  <---- our IP

Inside the zone, let's find the logs from that downstairs:

zlogin oxz_crucible_fffddf56-10ca-4b62-9be3-5b3764a5f682
root@oxz_crucible_fffddf56:~# svcs | grep 2b3ca602-dc02-4abb-97d5-67eb2e3b2bbe
online         21:35:00 svc:/oxide/crucible/downstairs:downstairs-2b3ca602-dc02-4abb-97d5-67eb2e3b2bbe
root@oxz_crucible_fffddf56:~# tail -f $(svcs -L svc:/oxide/crucible/downstairs:downstairs-2b3ca602-dc02-4abb-97d5-67eb2e3b2bbe)
[ Mar 13 21:35:10 Executing start method ("/opt/oxide/lib/svc/manifest/crucible/downstairs.sh"). ]
{"msg":"Opened existing region file \"/data/regions/2b3ca602-dc02-4abb-97d5-67eb2e3b2bbe/region.json\"","v":0,"name":"crucible","level":30,"time":"2026-03-13T21:35:10.992893765Z","hostname":"oxz_crucible_fffddf56-10ca-4b62-9be3-5b3764a5f682","pid":4923}
{"msg":"Database read version 1","v":0,"name":"crucible","level":30,"time":"2026-03-13T21:35:10.993166652Z","hostname":"oxz_crucible_fffddf56-10ca-4b62-9be3-5b3764a5f682","pid":4923}
{"msg":"Database write version 1","v":0,"name":"crucible","level":30,"time":"2026-03-13T21:35:10.993178104Z","hostname":"oxz_crucible_fffddf56-10ca-4b62-9be3-5b3764a5f682","pid":4923}
Error: missing context slot for block 58607 in extent 2799
[ Mar 13 21:35:12 Stopping because service exited with an error. ]

And, we have our problem, no context for a block: Error: missing context slot for block 58607 in extent 2799

This will cause the downstairs service to exit (and restart) and will prevent the upstairs from ever connecting to it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions