Skip to content

KVM: x86: pKVM: align AMX CPUID with host-provided model#88

Open
i-yyi wants to merge 1 commit intointel-staging:pkvm-v6.18from
i-yyi:pkvm-v6.18
Open

KVM: x86: pKVM: align AMX CPUID with host-provided model#88
i-yyi wants to merge 1 commit intointel-staging:pkvm-v6.18from
i-yyi:pkvm-v6.18

Conversation

@i-yyi
Copy link
Copy Markdown

@i-yyi i-yyi commented Mar 20, 2026

This is a vibe-coding version for #87 . If you think it generally matches what you had in mind, I can do the first round of review myself. Once I’ve confirmed it, we can move on to the next round of discussion.


For protected VMs, pKVM enforces a host-like CPUID and may append missing CPUID leaves from the default set. On AMX-capable hosts, that can expose AMX-related CPUID state even when the host userspace VMM didn't provide AMX in the guest CPUID model.

That mismatch is problematic in two ways. First, it changes the guest CPU model behind the VMM's back. Second, when AMX state is added during pKVM enforcement, the hypervisor may require a larger fpstate buffer than what the host side prepared for the protected vCPU.

Align pKVM's AMX handling with the CPUID model provided by the host VMM:

  • if the host-provided CPUID contains AMX, keep the AMX-related CPUID bits/leaves and reallocate fpstate before synchronizing CPUID into the hypervisor;
  • if the host-provided CPUID does not contain AMX, clear the AMX-related CPUID state from the enforced result and refresh xstate sizing.

This keeps AMX exposure under the VMM's control while preserving the required fpstate sizing when AMX is intentionally exposed.

For protected VMs, pKVM enforces a host-like CPUID and may append missing
CPUID leaves from the default set. On AMX-capable hosts, that can expose
AMX-related CPUID state even when the host userspace VMM didn't provide
AMX in the guest CPUID model.

That mismatch is problematic in two ways. First, it changes the guest
CPU model behind the VMM's back. Second, when AMX state is added during
pKVM enforcement, the hypervisor may require a larger fpstate buffer
than what the host side prepared for the protected vCPU.

Align pKVM's AMX handling with the CPUID model provided by the host VMM:

- if the host-provided CPUID contains AMX, keep the AMX-related CPUID
  bits/leaves and reallocate fpstate before synchronizing CPUID into the
  hypervisor;
- if the host-provided CPUID does not contain AMX, clear the AMX-related
  CPUID state from the enforced result and refresh xstate sizing.

This keeps AMX exposure under the VMM's control while preserving the
required fpstate sizing when AMX is intentionally exposed.

Signed-off-by: Your Name <you@example.com>
@cxdong
Copy link
Copy Markdown
Contributor

cxdong commented Mar 20, 2026

This is a vibe-coding version for #87 . If you think it generally matches what you had in mind, I can do the first round of review myself. Once I’ve confirmed it, we can move on to the next round of discussion.

Basically it is. I have posted my understanding about the root cause in #87 (comment). With this understanding, I think we may not need to modify the pkvm_host.c.

Specifically, we may only need to align the XFEATURE_MASK_USER_DYNAMIC bits in the CPUID 0xd ECX 0 (refer to kvm_check_cpuid). If the CPUID from host VMM doesn't have XFEATURE_MASK_USER_DYNAMIC bits, then make sure the pkvm_enforce_cpuid won't add these bits. We may also needs to update the size information stored in the CPUID 0xd main and sub leaves. The size calculation can refer to __do_cpuid_func.

What do you think?

@i-yyi
Copy link
Copy Markdown
Author

i-yyi commented Mar 20, 2026

I understand, I agree with your point of view, and I will complete this PR during my free time next week.

@maluka-dmytro
Copy link
Copy Markdown
Contributor

For protected VMs, pKVM enforces a host-like CPUID and may append missing CPUID leaves from the default set. On AMX-capable hosts, that can expose AMX-related CPUID state even when the host userspace VMM didn't provide AMX in the guest CPUID model.

Am I correct that this problem exists only in the case when the VMM doesn't use the KVM_GET_SUPPORTED_CPUID ioctl for creating the guest CPUID (or uses it but for some reason removes AMX stuff from it)?

But in #87 you mentioned you observed this with crosvm, whereas crosvm does use KVM_GET_SUPPORTED_CPUID, and it doesn't look like crosvm removes AMX stuff. What am I missing?

That mismatch is problematic in two ways. First, it changes the guest CPU model behind the VMM's back.

Yeah, I agree this is not quite nice. Basically it's a hack (BTW together with the optimistic assumption that the aligned buffer will be large enough for extra entries).

As I see it, a proper solution would be: pKVM requires the VMM to prepare a CPUID matching pKVM's requirements, and validates this CPUID and returns an error if it doesn't conform, instead of silently modifying it. IIUC that is more or less how it works in TDX. (The requirements themselves might be the same as what pKVM already enforces, i.e. roughly speaking: based on KVM_GET_SUPPORTED_CPUID, with only a few leaves allowed to be tweaked by the VMM.)

I think basically the reason why we didn't already implement it this way was just to spare the effort, and spare the need to modify crosvm for now...

@i-yyi
Copy link
Copy Markdown
Author

i-yyi commented Mar 25, 2026

Am I correct that this problem exists only in the case when the VMM doesn't use the KVM_GET_SUPPORTED_CPUID ioctl for creating the guest CPUID (or uses it but for some reason removes AMX stuff from it)?
But in #87 you mentioned you observed this with crosvm, whereas crosvm does use KVM_GET_SUPPORTED_CPUID, and it doesn't look like crosvm removes AMX stuff. What am I missing?

I don't think so. My understanding is that for protected VMs, even if the VMM (such as crosvm) calls KVM_GET_SUPPORTED_CPUID and provides appropriate settings, PKVM will not obey the CPUID provided by crosvm, but will instead overwrite it according to the host CPU.
If I am wrong, please point out my mistake.

As I see it, a proper solution would be: pKVM requires the VMM to prepare a CPUID matching pKVM's requirements, and validates this CPUID and returns an error if it doesn't conform, instead of silently modifying it. IIUC that is more or less how it works in TDX. (The requirements themselves might be the same as what pKVM already enforces, i.e. roughly speaking: based on KVM_GET_SUPPORTED_CPUID, with only a few leaves allowed to be tweaked by the VMM.)

Yes, PKVM is better suited as a verification gate than for direct intervention. I agree with you.

@cxdong
Copy link
Copy Markdown
Contributor

cxdong commented Mar 25, 2026

Am I correct that this problem exists only in the case when the VMM doesn't use the KVM_GET_SUPPORTED_CPUID ioctl for creating the guest CPUID (or uses it but for some reason removes AMX stuff from it)?

I guess the CPUID bits are not removed by the crosvm. When the host KVM return the supported CPUID for the KVM_GET_SUPPORTED_CPUID ioctl, the host KVM has removed the XTILE_DATA bit from the permitted_xcr0 via the kvm_get_filtered_xcr0() as the crosvm process is not requesting to use it. But from the pKVM side, as there is no way for the pKVM to know about the crosvm process's permitted XCR0, thus XTILE_DATA bit is retained, which made the difference when calculating the FPU memory size.

AFAIK, the linux kernel made such mechanism to let the process explicitly request to use AMX feature rather than allow it by default is to reduce the memory consumption from the FPU state for each process, as not every process will need AMX.

As I see it, a proper solution would be: pKVM requires the VMM to prepare a CPUID matching pKVM's requirements, and validates this CPUID and returns an error if it doesn't conform, instead of silently modifying it. IIUC that is more or less how it works in TDX. (The requirements themselves might be the same as what pKVM already enforces, i.e. roughly speaking: based on KVM_GET_SUPPORTED_CPUID, with only a few leaves allowed to be tweaked by the VMM.)

I think basically the reason why we didn't already implement it this way was just to spare the effort, and spare the need to modify crosvm for now...

Yes, true. That is the reason. With the current CPUID enforcment mechanism, I think the XTILE_DATA could be an allowed bit for the host to tweak, so that the pKVM can horner this bit from the host (e.g., in this case, remove XTILE_DATA from the pKVM's side and update the size information in the CPUID as well).

@cxdong
Copy link
Copy Markdown
Contributor

cxdong commented Mar 25, 2026

But from the pKVM side, as there is no way for the pKVM to know about the crosvm process's permitted XCR0, thus XTILE_DATA bit is retained, which made the difference when calculating the FPU memory size.

This reminds me that, a possible alternative is to let the pKVM respect the host KVM's guest FPU permission, which is used to calculate the permitted XCR0 in xstate_get_group_perm(), so that when the pKVM calculate the permitted XCR0, it can get the same result with the host. Currently the guest FPU permission from the pKVM side is fixed, thus there is no per-VM one, which may be needed in this alternative.

@maluka-dmytro
Copy link
Copy Markdown
Contributor

When the host KVM return the supported CPUID for the KVM_GET_SUPPORTED_CPUID ioctl, the host KVM has removed the XTILE_DATA bit from the permitted_xcr0 via the kvm_get_filtered_xcr0() as the crosvm process is not requesting to use it. But from the pKVM side, as there is no way for the pKVM to know about the crosvm process's permitted XCR0, thus XTILE_DATA bit is retained, which made the difference when calculating the FPU memory size.

AFAIK, the linux kernel made such mechanism to let the process explicitly request to use AMX feature rather than allow it by default is to reduce the memory consumption from the FPU state for each process, as not every process will need AMX.

I see. So in a nutshell, the problem is that this is per userspace process, and pKVM currently has no knowledge about such host's per-process states.

Also, this problem seems quite independent of how pKVM enforces the CPUID - by silently modifying it (like it does now) or just by validating it? (i.e. in the latter case we'd still need to address this problem somehow?)

I'm not really familiar with all this FPU stuff, but it feels it would be better to address this problem in some generic way, not by adding ad-hoc logic for this specific AMX issue...

@maluka-dmytro
Copy link
Copy Markdown
Contributor

BTW just for the record: not sure if exactly related to this issue, but I recall that when originally reviewing Kevin's patch in https://android-review.googlesource.com/c/kernel/common/+/3813637, I had the following observation:

On all vCPUs, pKVM enforces different values of some FPU params (according to SDM, those params are "Maximum size (bytes, from the beginning of the XSAVE/XRSTOR save area) required by
enabled features in XCR0" and "supported bits of the lower 32 bits of the IA32_XSS MSR"):

[  230.485715] pkvm: orig cpuid: func=d idx=0 flags=1 eax=207 ebx=240 ecx=a88 edx=0
[  230.493979] pkvm:  new cpuid: func=d idx=0 flags=1 eax=207 ebx=a88 ecx=a88 edx=0
[  230.502230] pkvm: orig cpuid: func=d idx=1 flags=1 eax=f ebx=240 ecx=0 edx=0
[  230.510102] pkvm:  new cpuid: func=d idx=1 flags=1 eax=f ebx=348 ecx=0 edx=0

I haven't analyzed if it is ok that pKVM changes these values. So, is it ok?

And Kevin's reply was:

FPU is managed by pKVM hence the state size must be set by pKVM (e.g. it doesn't sound secure if the host reports a smaller area size than supported FPU features to guest, leading to xsave/xrstor not covering the full state in guest). In this specific case the difference comes from:

a) crosvm gets size a88 from kvm-high, based on supported FPU features
b) crosvm passes a88 to kvm-high
c) kvm-high updates it to 240 based on initial xcr0 value. See kvm_set_cpuid() calls __kvm_update_cpuid_runtime()
d) kvm-high issues hypercall with 240 to pkvm, at the end of kvm_set_cpuid()
e) pkvm calls pkvm_enforce_cpuid(), which updates 240 to a88 based on supported FPU features
f) pkvm calls kvm_set_cpuid(), same as step c). Then 240 again

at run-time, guest kernel updates xcr0, emulated by __kvm_set_xcr() which calls kvm_update_cpuid_runtime() to change 240 to a88 based on the new xcr0 value.

So in the end the guest observes the desired value.

@cxdong
Copy link
Copy Markdown
Contributor

cxdong commented Mar 26, 2026

I see. So in a nutshell, the problem is that this is per userspace process, and pKVM currently has no knowledge about such host's per-process states.

Yes it is. So I raised the alternative in #88 (comment), which is to sync the guest vCPU process's fpu permission to the pKVM. With this, the pKVM will also remove the XFEATURE_XTILE_DATA bit from its side if this is not requested by the host. This alternative doesn't need to modify the current CPUID enforcement logic.

Also, this problem seems quite independent of how pKVM enforces the CPUID - by silently modifying it (like it does now) or just by validating it? (i.e. in the latter case we'd still need to address this problem somehow?)

I'm not really familiar with all this FPU stuff, but it feels it would be better to address this problem in some generic way, not by adding ad-hoc logic for this specific AMX issue...

If the pKVM only validates rather than silently modifies the CPUID bits, and returns error code to the host, then the host should guarantee only exposing the expected CPUID bits to the pKVM to avoid the validation failure. This is the solution you mentioned in #88 (comment), right? Without the above alternative, the crosvm should request the XFEATURE_XTILE_DATA fpu permission for the guest vCPU process, then the host KVM can expose the CPUID which contains the XFEATURE_XTILE_DATA bit required by the pKVM, which seems not necessary if the crosvm want to create a guest which doesn't need the AMX feature. With the above alternative, the crosvm doesn't need to do this as the XFEATURE_XTILE_DATA bit will be tweaked by the pKVM according to the guest vCPU process's fpu permission.

So seems this alternative is needed for either for the solution in #88 (comment) or for the current CPUID enforcement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants