Recurring firmware measurement mismatch at index 9 after VM reboot on GCP Spot instances (H100 + TDX)

## Summary

We are experiencing a **recurring firmware measurement mismatch at index 9** on H100 GPUs running in Confidential VMs (Intel TDX) on GCP. The issue occurs after VM reboots (both manual and GCP-initiated) and can **only be resolved by performing a full stop + start** of the VM instance from the GCP console.

This issue was previously discussed in #90, where @steven-bellock suggested opening a dedicated issue.

---

## Environment

| Component | Value |
|-----------|-------|
| Cloud provider | Google Cloud Platform (GCP) |
| VM type | `a3-highgpu-1g` (Spot instance) |
| Confidential computing | Intel TDX (`--confidential-compute-type=TDX`) |
| GPU | NVIDIA H100 80GB HBM3 (`GH100 A01 GSP BROM`) |
| Driver version | `580.126.09` (open kernel module, `nvidia-driver-580-server-open`) |
| VBIOS version | `96.00.CF.00.01` |
| Kernel | `6.17.0-1008-gcp` (Ubuntu 24.04) |
| nvattest version | `1.1.1.1770245582-1` |
| CC status | `ON` (Production mode, not DevTools) |
| Secure Boot | Enabled |
| Maintenance policy | `TERMINATE` |
| LKCA | Configured (`/etc/modprobe.d/nvidia-lkca.conf`) |
| Persistence mode | Enabled |

---

## Behavior

### Timeline of occurrences

| Date | Event | Attestation result | Fix applied |
|------|-------|--------------------|-------------|
| Feb 19, 2026 | VM reboot (GCP reallocation) | `measres: fail`, index 9, `result_code: 12` | Stop + Start from GCP console |
| Mar 1, 2026 17:19 UTC | VM started (fresh start) | **Passed** (`result_code: 0`) | — |
| Mar 2, 2026 12:01 UTC | VM rebooted (automatic) | `measres: fail`, index 9, `result_code: 12` | Pending |

### Reproducible pattern

1. After a fresh **stop + start** from GCP console → attestation **passes** consistently.
2. After a **reboot** (either `sudo reboot`, automatic GCP reboot, or host maintenance) → attestation **fails** with measurement mismatch at index 9.
3. Subsequent reboots **do not fix** the issue. Only a full stop + start resolves it.
4. We have a daily auto start/stop schedule on our Spot instances. The issue appears to be triggered when the VM reboots without a full hardware deallocation/reallocation cycle.

---

## Failing attestation output

**Command:**

```bash
nvattest attest --device gpu --verifier local --format json --nonce <64-char-hex-nonce>
```

**Result:**

```json
{
    "result_code": 12,
    "result_message": "Overall Attestation Result is False"
}
```

### Claims (relevant fields)

```json
{
    "hwmodel": "GH100 A01 GSP BROM",
    "measres": "fail",
    "oemid": "5703",
    "secboot": null,
    "dbgstat": null,
    "x-nvidia-gpu-driver-version": "580.126.09",
    "x-nvidia-gpu-vbios-version": "96.00.CF.00.01",
    "x-nvidia-gpu-attestation-report-signature-verified": true,
    "x-nvidia-gpu-attestation-report-nonce-match": true,
    "x-nvidia-gpu-driver-rim-version-match": true,
    "x-nvidia-gpu-driver-rim-signature-verified": true,
    "x-nvidia-gpu-vbios-rim-version-match": true,
    "x-nvidia-gpu-vbios-rim-signature-verified": true,
    "x-nvidia-gpu-arch-check": true,
    "x-nvidia-gpu-attestation-report-cert-chain-fwid-match": true,
    "x-nvidia-mismatch-measurement-records": [
        {
            "index": 9,
            "measurementSource": "Firmware",
            "goldenSize": 48,
            "goldenValue": "4b3ed0f834d10fef95e61615edc5b4e98ec78cff39323993b3218f0cd62507978cf64e4487520bc7e560fde71ea0fc75",
            "runtimeSize": 48,
            "runtimeValue": "c80a9b62ce0d41184bb1ad0f6334d9400a2d2514ef92003b1c043410f91b7309144325a3e01c58b8bd6e198f5dda3b9b"
        }
    ]
}
```

---

## Key observations

- **Only index 9 fails.** All other verification checks pass (signature, cert chain, RIM, nonce, driver/VBIOS version match).
- The `goldenValue` and `runtimeValue` are **identical across occurrences** (same values on Feb 19 and Mar 2), suggesting a deterministic mismatch rather than random corruption.
- `secboot` is `null` and `dbgstat` is `null` (known issue per NVIDIA internal issue `5916701`, as mentioned by @steven-bellock in #90).
- All certificate chains are valid with `good` OCSP status.
- We cannot use [gpu-admin-tools](https://github.com/NVIDIA/gpu-admin-tools) to toggle CC mode on GCP, as GCP does not allow GPU reset from within the VM.

---

## Configuration verification

We followed the [GCP Confidential VM with GPU guide](https://docs.cloud.google.com/confidential-computing/confidential-vm/docs/create-a-confidential-vm-instance-with-gpu) and verified all steps:

```bash
# CC mode
$ nvidia-smi conf-compute -f
CC status: ON

$ nvidia-smi conf-compute -e
CC Environment: PRODUCTION

# TDX
$ dmesg | grep tdx
tdx: Guest detected

# Secure Boot
$ mokutil --sb-state
SecureBoot enabled

# LKCA configured
$ cat /etc/modprobe.d/nvidia-lkca.conf
install nvidia /sbin/modprobe ecdsa_generic; /sbin/modprobe ecdh; /sbin/modprobe --ignore-install nvidia

# ecdh_generic is builtin in the kernel
$ modinfo ecdh_generic
filename: (builtin)

# Persistence mode
$ nvidia-smi --query-gpu=persistence_mode --format=csv,noheader
Enabled

# Host maintenance policy
$ curl -s -H "Metadata-Flavor: Google" \
    http://metadata.google.internal/computeMetadata/v1/instance/scheduling/on-host-maintenance
TERMINATE
```

---

## Questions

1. What does measurement index 9 (`Firmware`) correspond to? Is there documentation on the semantics of each measurement index?
2. Why does a **reboot** cause the runtime measurement to diverge from the golden value, while a **stop + start** does not?
3. Is this a known issue with GCP Spot instances and H100 GPUs in TDX mode?
4. Is there a workaround that does not require a full stop + start cycle?

---

## Related issues

- #90 — Original report of this issue (same `goldenValue`/`runtimeValue`, same index 9)
- #28 — Similar measurement mismatch report

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recurring firmware measurement mismatch at index 9 after VM reboot on GCP Spot instances (H100 + TDX) #132

Summary

Environment

Behavior

Timeline of occurrences

Reproducible pattern

Failing attestation output

Claims (relevant fields)

Key observations

Configuration verification

Questions

Related issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	Value
Cloud provider	Google Cloud Platform (GCP)
VM type	`a3-highgpu-1g` (Spot instance)
Confidential computing	Intel TDX (`--confidential-compute-type=TDX`)
GPU	NVIDIA H100 80GB HBM3 (`GH100 A01 GSP BROM`)
Driver version	`580.126.09` (open kernel module, `nvidia-driver-580-server-open`)
VBIOS version	`96.00.CF.00.01`
Kernel	`6.17.0-1008-gcp` (Ubuntu 24.04)
nvattest version	`1.1.1.1770245582-1`
CC status	`ON` (Production mode, not DevTools)
Secure Boot	Enabled
Maintenance policy	`TERMINATE`
LKCA	Configured (`/etc/modprobe.d/nvidia-lkca.conf`)
Persistence mode	Enabled

Date	Event	Attestation result	Fix applied
Feb 19, 2026	VM reboot (GCP reallocation)	`measres: fail`, index 9, `result_code: 12`	Stop + Start from GCP console
Mar 1, 2026 17:19 UTC	VM started (fresh start)	Passed (`result_code: 0`)	—
Mar 2, 2026 12:01 UTC	VM rebooted (automatic)	`measres: fail`, index 9, `result_code: 12`	Pending

Recurring firmware measurement mismatch at index 9 after VM reboot on GCP Spot instances (H100 + TDX) #132

Description

Summary

Environment

Behavior

Timeline of occurrences

Reproducible pattern

Failing attestation output

Claims (relevant fields)

Key observations

Configuration verification

Questions

Related issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions