-
Notifications
You must be signed in to change notification settings - Fork 29
Description
My instance on dogfood was stuck in starting.
Looking in the propolis log, I found it was waiting for one of the downstairs to come online after being told to activate:
21:24:22.961Z INFO propolis-server (vm_state_driver): connecting to [fd00:1122:3344:106::d]:19033
= downstairs
client = 1
session_id = 303bf55e-2b8e-44a5-a521-c8eac984801a
21:24:22.961Z WARN propolis-server (vm_state_driver): ds_connection connect to [fd00:1122:3344:106:
:d]:19033 failure: Os { code: 146, kind: ConnectionRefused, message: "Connection refused" }
= downstairs
client = 1
session_id = 303bf55e-2b8e-44a5-a521-c8eac984801a
There is only one disk on my instance, so it was easy enough to find it's ID.
Dumping info about it tells me where the what the region IDs are, which sled contains them, and all the IPs of the downstairs:
root@oxz_switch0:~# omdb db disks info bfa68c2e-ad75-4793-88eb-05c48527f567
HOST_SERIAL DISK_NAME INSTANCE_NAME PROPOLIS_ZONE VOLUME_ID
DISK_STATE IMPORT_ADDRESS READONLY
BRM42220031 nightlydebug-btrium-image-1190c3 nightlydebug oxz_propolis-server_9c447451-23b0-405e-b705-1cce40b4427e b5f12294- 414-4308-a959-0367b846664b attached - false
HOST_SERIAL REGION DATASET PHYSICAL_DISK
BRM27230045 8915bf83-b1e8-4782-8526-31530f6a4b79 7cdb3c11-f792-4195-9152-16a948fb271e 1f938f43-7003-4c9b-b961-0b9c1a7fecbe
BRM13250012 bc095a4b-3b8b-47fe-a660-b380a55c7b74 8aaf6a36-a2fc-4ce8-9372-888bc20854b1 4b75f4a4-22c6-4391-b0e1-b1c101f7ebbb
BRM42220006 2b3ca602-dc02-4abb-97d5-67eb2e3b2bbe fffddf56-10ca-4b62-9be3-5b3764a5f682 7e6e461b-c246-4b88-bcc8-28f0d9f5495e
VCR from volume ID b5f12294-e414-4308-a959-0367b846664b
ID BS SUB_VOLUMES READ_ONLY_PARENT
bfa68c2e-ad75-4793-88eb-05c48527f567 512 1 false
SUB VOLUME 0
ID BS BPE EC GENERATION READ_ONLY
bfa68c2e-ad75-4793-88eb-05c48527f567 512 131072 4016 10 false
[fd00:1122:3344:129::22]:19004
[fd00:1122:3344:106::d]:19033
[fd00:1122:3344:127::29]:19023
I know my downstairs has this IP: [fd00:1122:3344:106::d], so I just need to figure out which sled has fd00:1122:3344:106 and then which crucible region of the three above matches.
root@oxz_switch0:~# pilot host exec -c "ipadm | grep fd00 | grep sled6" 0-31
2 BRM22250001 ok: underlay0/sled6 static ok fd00:1122:3344:128::1/64
3 BRM13250012 ok: underlay0/sled6 static ok fd00:1122:3344:129::1/64
7 BRM27230045 ok: underlay0/sled6 static ok fd00:1122:3344:127::1/64
8 BRM44220011 ok: underlay0/sled6 static ok fd00:1122:3344:103::1/64
9 BRM44220005 ok: underlay0/sled6 static ok fd00:1122:3344:105::1/64
10 BRM42220009 ok: underlay0/sled6 static ok fd00:1122:3344:107::1/64
11 BRM42220006 ok: underlay0/sled6 static ok fd00:1122:3344:106::1/64
12 BRM42220057 ok: underlay0/sled6 static ok fd00:1122:3344:104::1/64
14 BRM42220051 ok: underlay0/sled6 static ok fd00:1122:3344:10b::1/64
16 BRM42220014 ok: underlay0/sled6 static ok fd00:1122:3344:108::1/64
17 BRM42220017 ok: underlay0/sled6 static ok fd00:1122:3344:109::1/64
21 BRM42220031 ok: underlay0/sled6 static ok fd00:1122:3344:102::1/64
23 BRM42220016 ok: underlay0/sled6 static ok fd00:1122:3344:10a::1/64
25 BRM44220010 ok: underlay0/sled6 static ok fd00:1122:3344:101::1/64
So, sled 11 has our bad downstairs, BRM13250012. This is region bc095a4b-3b8b-47fe-a660-b380a55c7b74
Sleds - region lookup:
03 BRM13250012 bc095a4b-3b8b-47fe-a660-b380a55c7b74
07 BRM27230045 8915bf83-b1e8-4782-8526-31530f6a4b79
11 BRM42220006 2b3ca602-dc02-4abb-97d5-67eb2e3b2bbe <------ problem
On sled 11, we can find our zone a few ways, but this work by searching for IP:
BRM42220006 # for zzz in $(zoneadm list | grep crucible); do echo -n "$zzz "; zlogin $zzz ipadm | grep fd00; done
oxz_crucible_0022703b-dcfc-44d4-897a-b42f6f53b433 oxControlService13/omicron6 static ok fd00:1122:3344:106::c/64
oxz_crucible_12afe1c3-bfe6-4278-8240-91d401347d36 oxControlService14/omicron6 static ok fd00:1122:3344:106::8/64
oxz_crucible_46d1afcc-cc3f-4b17-aafc-054dd4862d15 oxControlService15/omicron6 static ok fd00:1122:3344:106::5/64
oxz_crucible_605be8b9-c652-4a5f-94ca-068ec7a39472 oxControlService17/omicron6 static ok fd00:1122:3344:106::a/64
oxz_crucible_65b3db59-9361-4100-9cee-04e32a8c67d3 oxControlService18/omicron6 static ok fd00:1122:3344:106::7/64
oxz_crucible_9b8194ee-917d-4abc-a55c-94cea6cdaea1 oxControlService20/omicron6 static ok fd00:1122:3344:106::6/64
oxz_crucible_af8a8712-457c-4ea7-a8b6-aecb04761c1b oxControlService21/omicron6 static ok fd00:1122:3344:106::9/64
oxz_crucible_b369e133-485c-4d98-8fee-83542d1fd94d oxControlService22/omicron6 static ok fd00:1122:3344:106::4/64
oxz_crucible_c33b5912-9985-43ed-98f2-41297e2b796a oxControlService23/omicron6 static ok fd00:1122:3344:106::b/64
oxz_crucible_pantry_e9ea27c2-600a-45fb-9224-36eca95d87b6 oxControlService24/omicron6 static ok fd00:1122:3344:106::74/64
oxz_crucible_fffddf56-10ca-4b62-9be3-5b3764a5f682 oxControlService25/omicron6 static ok fd00:1122:3344:106::d/64 <---- our IP
Inside the zone, let's find the logs from that downstairs:
zlogin oxz_crucible_fffddf56-10ca-4b62-9be3-5b3764a5f682
root@oxz_crucible_fffddf56:~# svcs | grep 2b3ca602-dc02-4abb-97d5-67eb2e3b2bbe
online 21:35:00 svc:/oxide/crucible/downstairs:downstairs-2b3ca602-dc02-4abb-97d5-67eb2e3b2bbe
root@oxz_crucible_fffddf56:~# tail -f $(svcs -L svc:/oxide/crucible/downstairs:downstairs-2b3ca602-dc02-4abb-97d5-67eb2e3b2bbe)
[ Mar 13 21:35:10 Executing start method ("/opt/oxide/lib/svc/manifest/crucible/downstairs.sh"). ]
{"msg":"Opened existing region file \"/data/regions/2b3ca602-dc02-4abb-97d5-67eb2e3b2bbe/region.json\"","v":0,"name":"crucible","level":30,"time":"2026-03-13T21:35:10.992893765Z","hostname":"oxz_crucible_fffddf56-10ca-4b62-9be3-5b3764a5f682","pid":4923}
{"msg":"Database read version 1","v":0,"name":"crucible","level":30,"time":"2026-03-13T21:35:10.993166652Z","hostname":"oxz_crucible_fffddf56-10ca-4b62-9be3-5b3764a5f682","pid":4923}
{"msg":"Database write version 1","v":0,"name":"crucible","level":30,"time":"2026-03-13T21:35:10.993178104Z","hostname":"oxz_crucible_fffddf56-10ca-4b62-9be3-5b3764a5f682","pid":4923}
Error: missing context slot for block 58607 in extent 2799
[ Mar 13 21:35:12 Stopping because service exited with an error. ]
And, we have our problem, no context for a block: Error: missing context slot for block 58607 in extent 2799
This will cause the downstairs service to exit (and restart) and will prevent the upstairs from ever connecting to it.