Skip to content

KVM: pVMX: Use fpu_user_cfg.max_size to compute fpsize#77

Open
mmisono wants to merge 6 commits intointel-staging:pkvm-v6.18from
mmisono:fix/pkvm-v6.18-fpsize
Open

KVM: pVMX: Use fpu_user_cfg.max_size to compute fpsize#77
mmisono wants to merge 6 commits intointel-staging:pkvm-v6.18from
mmisono:fix/pkvm-v6.18-fpsize

Conversation

@mmisono
Copy link
Copy Markdown

@mmisono mmisono commented Feb 24, 2026

Since pKVM enforces the host's cpuids, calculate fpsize based on the value that the hardware supports instead of that of the vCPU reported.

Without this, pKVM fails to boot a VM on a machine with Intel AMX, as crosvm does not request that feature.

fixes: 58f48d1 ("KVM: pVMX: Add new fpstate memory for xfd")

Dmytro Maluka and others added 6 commits February 12, 2026 15:57
Currently host KVM skips zapping all guest mappings in
kvm_arch_flush_shadow_all() during VM teardown is pKVM is enabled,
assuming that it is not needed since pKVM hypervisor is about to destroy
the guest MMU anyway, when handling the vm_destroy hypercall.

However, the host kernel may start using guest pages for other needs
right after kvm_arch_flush_shadow_all(), before vm_destroy, since KVM
executes kvm_arch_flush_shadow_all() as a part of the MMU notifier
release, thus telling the host kernel's MM that the pages are available
to use for other needs.

This is not an issue for pVMs, since pVM pages are still pinned (they
are only unpinned after vm_destroy). However, this is an issue for
npVMs, since npVM pages state for the host is still SHARED_OWNED, not
OWNED, thus the host will fail e.g. to donate those pages to a pVM if
it tries to do that immediately after kvm_arch_flush_shadow_all().

So don't skip kvm_arch_flush_shadow_all() but let it properly ask pKVM
to unmap guest pages and thus properly update their host state to OWNED,
so that the host kernel can immediately start using them for other
needs.

Fixes: 30e9a30 ("KVM: x86/mmu: pKVM: Skip kvm_arch_flush_shadow_all()")
Signed-off-by: Dmytro Maluka <dmaluka@google.com>
…aps()

commit f8ade833b733ae0b72e87ac6d2202a1afbe3eb4a upstream.

Explicitly configure KVM's supported XSS as part of each vendor's setup
flow to fix a bug where clearing SHSTK and IBT in kvm_cpu_caps, e.g. due
to lack of CET XFEATURE support, makes kvm-intel.ko unloadable when nested
VMX is enabled, i.e. when nested=1.  The late clearing results in
nested_vmx_setup_{entry,exit}_ctls() clearing VM_{ENTRY,EXIT}_LOAD_CET_STATE
when nested_vmx_setup_ctls_msrs() runs during the CPU compatibility checks,
ultimately leading to a mismatched VMCS config due to the reference config
having the CET bits set, but every CPU's "local" config having the bits
cleared.

Note, kvm_caps.supported_{xcr0,xss} are unconditionally initialized by
kvm_x86_vendor_init(), before calling into vendor code, and not referenced
between ops->hardware_setup() and their current/old location.

Fixes: 69cc3e8 ("KVM: x86: Add XSS support for CET_KERNEL and CET_USER")
Cc: stable@vger.kernel.org
Cc: Mathias Krause <minipli@grsecurity.net>
Cc: John Allen <john.allen@amd.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Chao Gao <chao.gao@intel.com>
Cc: Binbin Wu <binbin.wu@linux.intel.com>
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://patch.msgid.link/20260128014310.3255561-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
After commit b5a0229 ("KVM: x86: Explicitly configure supported
XSS from {svm,vmx}_set_cpu_caps()"), the pKVM hypervisor build fails
with an undefined reference to `kvm_setup_xss_caps`.

Move `kvm_setup_xss_caps` outside the `#ifndef __PKVM_HYP__` block
to ensure it is compiled and available for the pKVM hypervisor,
resolving the linking error.

At the same time drop configuration related to XSS from
pkvm_x86_vendor_init() as right now the kvm_setup_xss_caps() which
handles that is called by vmx_set_cpu_caps() in vmx_hardware_setup() for
both KVM and pKVM.

Fixes: b5a0229 ("KVM: x86: Explicitly configure supported XSS from {svm,vmx}_set_cpu_caps()")

Signed-off-by: Grzegorz Jaszczyk <jaszczyk@google.com>
Since pKVM does not support kvmclock for protected VMs, a pVM can't find
out the TSC frequency from the host directly. So Linux pVMs find it out
by calibrating the TSC against a PIT or HPET timer emulated by the host,
which is inherently unreliable: if the host is under a heavy load during
this calibration, the calibration will produce inaccurate results or,
more likely, will just fail. And if the calibration fails, the guest
will deem the TSC unstable (whereas actually it is the timer that is
inaccurate, while the TSC is accurate, since it is native), and will
fall back to using a jiffies-based clocksource instead of TSC, thus
requiring periodic timer interrupts and thus causing permanently high
CPU usage as long as the pVM is running.

Furthermore, on PTL (not on older CPUs) for some reason this problem
(failed TSC calibration) is observed all the time, even if the system is
not under a high load when the pVM starts.

So to avoid this problem and yet avoid special changes on the guest
side, enforce exposing the TSC frequency info to the pVM via the CPUID
leaf 0x15, so that the pVM doesn't need to calibrate the TSC.

In the future this enforcement might be also reused for security
purposes, as a part of providing secure TSC for pVMs (which pKVM doesn't
provide yet).

Signed-off-by: Dmytro Maluka <dmaluka@google.com>
When pKVM handles host hypercalls it allocates an output structure on
the stack, fills it accordingly during hypercall handling and finally
use pkvm_hc_set_output() to map the data to the proper host vCPU
registers so the host can read it back.

However, during hypercalls handling, some errors may occur that prevent
the output data from being set. Because the pkvm_handle_host_hypercall()
unconditionally copies the 'out' structure to the vCPU registers,
uninitialized hyp stack (which potentially has sensitive data) can be
leaked to the host.

To prevent described hypervisor stack leaking, zero initialize
pkvm_hc_data 'out'.

Fixes: b0738f5 ("pKVM: x86: Return hypercall outputs to the host")

Signed-off-by: Grzegorz Jaszczyk <jaszczyk@google.com>
Since pKVM enforces the host's cpuids, calculate fpsize based on the
value that the hardware supports instead of that of the vCPU reported.

Without this, pKVM fails to boot a VM on a machine with Intel AMX, as
crosvm does not request that feature.

fixes: 58f48d1 ("KVM: pVMX: Add new fpstate
memory for xfd")

Signed-off-by: Masanori Misono <m.misono760@gmail.com>
int ret;

fpsize = PAGE_ALIGN(vcpu->arch.guest_fpu.fpstate->size +
fpsize = PAGE_ALIGN(fpu_user_cfg.max_size +
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems I didn't fully understand the root cause. I don't have a machine with Intel AMX feature, but suppose if the guest cpuid has enabled the XFD feature, the vcpu->arch.guest_fpu.fpstate->size will be set by the host KVM via kvm_check_cpuid -> fpu_enable_guest_xfd_features -> __xfd_enable_feature ->fpstate_realloc. Is this true from your side?

But as you mentioned in the commit message, the crosvm doesn't request the XFD feature for the guest, then this function will not be called as (vcpu->arch.guest_fpu.xfeatures & XFEATURE_MASK_USER_DYNAMIC) == false.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my machine, pkvm_vcpu_after_set_cpuid hypercallls fails as __xfd_enable_feature() fails here. AFAICT, in the hypercall handler, pkvm_enforce_cpuid() populate actual cpu's cpuid entries. Then pkvm_vcpu_after_set_cpuid() calls kvm_set_cpuid() -> kvm_check_cpuid() -> fpu_enable_guest_xfd_features() -> __xfd_enable_feature(), which triggers ENOMEM.

then this function will not be called as (vcpu->arch.guest_fpu.xfeatures & XFEATURE_MASK_USER_DYNAMIC) == false.

yes, pkvm_vcpu_realloc_fpstate() is not called because of this. On my machine, I get:

[  138.684348] pkvm_host: [pkvm] vcpu->arch.guest_fpu.xfeatures & XFEATURE_MASK_USER_DYNAMIC = 0
[  138.684351] pkvm_host: [pkvm] vcpu->arch.guest_fpu.fpstate->size = 2560
[  138.684353] pkvm_host: [pkvm] fpu_user_cfg.max_features & XFEATURE_MASK_USER_DYNAMIC = 0x40000
[  138.684353] pkvm_host: [pkvm] fpu_user_cfg.max_size = 11008

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like pkvm_enforce_cpuid() populate the actual cpu's cpuid entries, leaf 0xd and its subleaves, are not the same with the cpuid entries set by the crosvm?

If so, npVM should be fine?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like pkvm_enforce_cpuid() populate the actual cpu's cpuid entries, leaf 0xd and its subleaves, are not the same with the cpuid entries set by the crosvm?

That is my understanding. Alternatively, enforce_cpuid() could respect crosvm's xfd configuration.

If so, npVM should be fine?

My commit message is ambiguous but I have this issue for pVM. I just confirmed that npVM works fine without this change as you said.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is my understanding. Alternatively, enforce_cpuid() could respect crosvm's xfd configuration.

This seems to be a better way.

My commit message is ambiguous but I have this issue for pVM. I just confirmed that npVM works fine without this change as you said.

Thanks for confirming this.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a better way.

as this patch fixes my issue, I don't plan work on this for the moment. Please feel free to discard/adopt this change in any way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants