Skip to content

Recurring firmware measurement mismatch at index 9 after VM reboot on GCP Spot instances (H100 + TDX) #132

@lripoll96

Description

@lripoll96

Summary

We are experiencing a recurring firmware measurement mismatch at index 9 on H100 GPUs running in Confidential VMs (Intel TDX) on GCP. The issue occurs after VM reboots (both manual and GCP-initiated) and can only be resolved by performing a full stop + start of the VM instance from the GCP console.

This issue was previously discussed in #90, where @steven-bellock suggested opening a dedicated issue.


Environment

Component Value
Cloud provider Google Cloud Platform (GCP)
VM type a3-highgpu-1g (Spot instance)
Confidential computing Intel TDX (--confidential-compute-type=TDX)
GPU NVIDIA H100 80GB HBM3 (GH100 A01 GSP BROM)
Driver version 580.126.09 (open kernel module, nvidia-driver-580-server-open)
VBIOS version 96.00.CF.00.01
Kernel 6.17.0-1008-gcp (Ubuntu 24.04)
nvattest version 1.1.1.1770245582-1
CC status ON (Production mode, not DevTools)
Secure Boot Enabled
Maintenance policy TERMINATE
LKCA Configured (/etc/modprobe.d/nvidia-lkca.conf)
Persistence mode Enabled

Behavior

Timeline of occurrences

Date Event Attestation result Fix applied
Feb 19, 2026 VM reboot (GCP reallocation) measres: fail, index 9, result_code: 12 Stop + Start from GCP console
Mar 1, 2026 17:19 UTC VM started (fresh start) Passed (result_code: 0)
Mar 2, 2026 12:01 UTC VM rebooted (automatic) measres: fail, index 9, result_code: 12 Pending

Reproducible pattern

  1. After a fresh stop + start from GCP console → attestation passes consistently.
  2. After a reboot (either sudo reboot, automatic GCP reboot, or host maintenance) → attestation fails with measurement mismatch at index 9.
  3. Subsequent reboots do not fix the issue. Only a full stop + start resolves it.
  4. We have a daily auto start/stop schedule on our Spot instances. The issue appears to be triggered when the VM reboots without a full hardware deallocation/reallocation cycle.

Failing attestation output

Command:

nvattest attest --device gpu --verifier local --format json --nonce <64-char-hex-nonce>

Result:

{
    "result_code": 12,
    "result_message": "Overall Attestation Result is False"
}

Claims (relevant fields)

{
    "hwmodel": "GH100 A01 GSP BROM",
    "measres": "fail",
    "oemid": "5703",
    "secboot": null,
    "dbgstat": null,
    "x-nvidia-gpu-driver-version": "580.126.09",
    "x-nvidia-gpu-vbios-version": "96.00.CF.00.01",
    "x-nvidia-gpu-attestation-report-signature-verified": true,
    "x-nvidia-gpu-attestation-report-nonce-match": true,
    "x-nvidia-gpu-driver-rim-version-match": true,
    "x-nvidia-gpu-driver-rim-signature-verified": true,
    "x-nvidia-gpu-vbios-rim-version-match": true,
    "x-nvidia-gpu-vbios-rim-signature-verified": true,
    "x-nvidia-gpu-arch-check": true,
    "x-nvidia-gpu-attestation-report-cert-chain-fwid-match": true,
    "x-nvidia-mismatch-measurement-records": [
        {
            "index": 9,
            "measurementSource": "Firmware",
            "goldenSize": 48,
            "goldenValue": "4b3ed0f834d10fef95e61615edc5b4e98ec78cff39323993b3218f0cd62507978cf64e4487520bc7e560fde71ea0fc75",
            "runtimeSize": 48,
            "runtimeValue": "c80a9b62ce0d41184bb1ad0f6334d9400a2d2514ef92003b1c043410f91b7309144325a3e01c58b8bd6e198f5dda3b9b"
        }
    ]
}

Key observations

  • Only index 9 fails. All other verification checks pass (signature, cert chain, RIM, nonce, driver/VBIOS version match).
  • The goldenValue and runtimeValue are identical across occurrences (same values on Feb 19 and Mar 2), suggesting a deterministic mismatch rather than random corruption.
  • secboot is null and dbgstat is null (known issue per NVIDIA internal issue 5916701, as mentioned by @steven-bellock in Measurement mismatch in idx 9 #90).
  • All certificate chains are valid with good OCSP status.
  • We cannot use gpu-admin-tools to toggle CC mode on GCP, as GCP does not allow GPU reset from within the VM.

Configuration verification

We followed the GCP Confidential VM with GPU guide and verified all steps:

# CC mode
$ nvidia-smi conf-compute -f
CC status: ON

$ nvidia-smi conf-compute -e
CC Environment: PRODUCTION

# TDX
$ dmesg | grep tdx
tdx: Guest detected

# Secure Boot
$ mokutil --sb-state
SecureBoot enabled

# LKCA configured
$ cat /etc/modprobe.d/nvidia-lkca.conf
install nvidia /sbin/modprobe ecdsa_generic; /sbin/modprobe ecdh; /sbin/modprobe --ignore-install nvidia

# ecdh_generic is builtin in the kernel
$ modinfo ecdh_generic
filename: (builtin)

# Persistence mode
$ nvidia-smi --query-gpu=persistence_mode --format=csv,noheader
Enabled

# Host maintenance policy
$ curl -s -H "Metadata-Flavor: Google" \
    http://metadata.google.internal/computeMetadata/v1/instance/scheduling/on-host-maintenance
TERMINATE

Questions

  1. What does measurement index 9 (Firmware) correspond to? Is there documentation on the semantics of each measurement index?
  2. Why does a reboot cause the runtime measurement to diverge from the golden value, while a stop + start does not?
  3. Is this a known issue with GCP Spot instances and H100 GPUs in TDX mode?
  4. Is there a workaround that does not require a full stop + start cycle?

Related issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions