KVM: x86: pKVM: align AMX CPUID with host-provided model#88
KVM: x86: pKVM: align AMX CPUID with host-provided model#88i-yyi wants to merge 1 commit intointel-staging:pkvm-v6.18from
Conversation
For protected VMs, pKVM enforces a host-like CPUID and may append missing CPUID leaves from the default set. On AMX-capable hosts, that can expose AMX-related CPUID state even when the host userspace VMM didn't provide AMX in the guest CPUID model. That mismatch is problematic in two ways. First, it changes the guest CPU model behind the VMM's back. Second, when AMX state is added during pKVM enforcement, the hypervisor may require a larger fpstate buffer than what the host side prepared for the protected vCPU. Align pKVM's AMX handling with the CPUID model provided by the host VMM: - if the host-provided CPUID contains AMX, keep the AMX-related CPUID bits/leaves and reallocate fpstate before synchronizing CPUID into the hypervisor; - if the host-provided CPUID does not contain AMX, clear the AMX-related CPUID state from the enforced result and refresh xstate sizing. This keeps AMX exposure under the VMM's control while preserving the required fpstate sizing when AMX is intentionally exposed. Signed-off-by: Your Name <you@example.com>
Basically it is. I have posted my understanding about the root cause in #87 (comment). With this understanding, I think we may not need to modify the pkvm_host.c. Specifically, we may only need to align the What do you think? |
|
I understand, I agree with your point of view, and I will complete this PR during my free time next week. |
Am I correct that this problem exists only in the case when the VMM doesn't use the KVM_GET_SUPPORTED_CPUID ioctl for creating the guest CPUID (or uses it but for some reason removes AMX stuff from it)? But in #87 you mentioned you observed this with crosvm, whereas crosvm does use KVM_GET_SUPPORTED_CPUID, and it doesn't look like crosvm removes AMX stuff. What am I missing?
Yeah, I agree this is not quite nice. Basically it's a hack (BTW together with the optimistic assumption that the aligned buffer will be large enough for extra entries). As I see it, a proper solution would be: pKVM requires the VMM to prepare a CPUID matching pKVM's requirements, and validates this CPUID and returns an error if it doesn't conform, instead of silently modifying it. IIUC that is more or less how it works in TDX. (The requirements themselves might be the same as what pKVM already enforces, i.e. roughly speaking: based on KVM_GET_SUPPORTED_CPUID, with only a few leaves allowed to be tweaked by the VMM.) I think basically the reason why we didn't already implement it this way was just to spare the effort, and spare the need to modify crosvm for now... |
I don't think so. My understanding is that for protected VMs, even if the VMM (such as crosvm) calls KVM_GET_SUPPORTED_CPUID and provides appropriate settings, PKVM will not obey the CPUID provided by crosvm, but will instead overwrite it according to the host CPU.
Yes, PKVM is better suited as a verification gate than for direct intervention. I agree with you. |
I guess the CPUID bits are not removed by the crosvm. When the host KVM return the supported CPUID for the KVM_GET_SUPPORTED_CPUID ioctl, the host KVM has removed the XTILE_DATA bit from the permitted_xcr0 via the AFAIK, the linux kernel made such mechanism to let the process explicitly request to use AMX feature rather than allow it by default is to reduce the memory consumption from the FPU state for each process, as not every process will need AMX.
Yes, true. That is the reason. With the current CPUID enforcment mechanism, I think the XTILE_DATA could be an allowed bit for the host to tweak, so that the pKVM can horner this bit from the host (e.g., in this case, remove XTILE_DATA from the pKVM's side and update the size information in the CPUID as well). |
This reminds me that, a possible alternative is to let the pKVM respect the host KVM's guest FPU permission, which is used to calculate the permitted XCR0 in |
I see. So in a nutshell, the problem is that this is per userspace process, and pKVM currently has no knowledge about such host's per-process states. Also, this problem seems quite independent of how pKVM enforces the CPUID - by silently modifying it (like it does now) or just by validating it? (i.e. in the latter case we'd still need to address this problem somehow?) I'm not really familiar with all this FPU stuff, but it feels it would be better to address this problem in some generic way, not by adding ad-hoc logic for this specific AMX issue... |
|
BTW just for the record: not sure if exactly related to this issue, but I recall that when originally reviewing Kevin's patch in https://android-review.googlesource.com/c/kernel/common/+/3813637, I had the following observation:
And Kevin's reply was:
|
Yes it is. So I raised the alternative in #88 (comment), which is to sync the guest vCPU process's fpu permission to the pKVM. With this, the pKVM will also remove the
If the pKVM only validates rather than silently modifies the CPUID bits, and returns error code to the host, then the host should guarantee only exposing the expected CPUID bits to the pKVM to avoid the validation failure. This is the solution you mentioned in #88 (comment), right? Without the above alternative, the crosvm should request the So seems this alternative is needed for either for the solution in #88 (comment) or for the current CPUID enforcement. |
This is a vibe-coding version for #87 . If you think it generally matches what you had in mind, I can do the first round of review myself. Once I’ve confirmed it, we can move on to the next round of discussion.
For protected VMs, pKVM enforces a host-like CPUID and may append missing CPUID leaves from the default set. On AMX-capable hosts, that can expose AMX-related CPUID state even when the host userspace VMM didn't provide AMX in the guest CPUID model.
That mismatch is problematic in two ways. First, it changes the guest CPU model behind the VMM's back. Second, when AMX state is added during pKVM enforcement, the hypervisor may require a larger fpstate buffer than what the host side prepared for the protected vCPU.
Align pKVM's AMX handling with the CPUID model provided by the host VMM:
This keeps AMX exposure under the VMM's control while preserving the required fpstate sizing when AMX is intentionally exposed.