diff options
author | Linus Torvalds | 2020-10-23 11:17:56 -0700 |
---|---|---|
committer | Linus Torvalds | 2020-10-23 11:17:56 -0700 |
commit | f9a705ad1c077ec2872c641f0db9c0d5b4a097bb (patch) | |
tree | 7f5d18d74f700be5bcf72ec5f4955f016eac9ab9 /Documentation/virt | |
parent | 9313f8026328d0309d093f6774be4b8f5340c0e5 (diff) | |
parent | 29cf0f5007a215b51feb0ae25ca5353480d53ead (diff) |
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"For x86, there is a new alternative and (in the future) more scalable
implementation of extended page tables that does not need a reverse
map from guest physical addresses to host physical addresses.
For now it is disabled by default because it is still lacking a few of
the existing MMU's bells and whistles. However it is a very solid
piece of work and it is already available for people to hammer on it.
Other updates:
ARM:
- New page table code for both hypervisor and guest stage-2
- Introduction of a new EL2-private host context
- Allow EL2 to have its own private per-CPU variables
- Support of PMU event filtering
- Complete rework of the Spectre mitigation
PPC:
- Fix for running nested guests with in-kernel IRQ chip
- Fix race condition causing occasional host hard lockup
- Minor cleanups and bugfixes
x86:
- allow trapping unknown MSRs to userspace
- allow userspace to force #GP on specific MSRs
- INVPCID support on AMD
- nested AMD cleanup, on demand allocation of nested SVM state
- hide PV MSRs and hypercalls for features not enabled in CPUID
- new test for MSR_IA32_TSC writes from host and guest
- cleanups: MMU, CPUID, shared MSRs
- LAPIC latency optimizations ad bugfixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (232 commits)
kvm: x86/mmu: NX largepage recovery for TDP MMU
kvm: x86/mmu: Don't clear write flooding count for direct roots
kvm: x86/mmu: Support MMIO in the TDP MMU
kvm: x86/mmu: Support write protection for nesting in tdp MMU
kvm: x86/mmu: Support disabling dirty logging for the tdp MMU
kvm: x86/mmu: Support dirty logging for the TDP MMU
kvm: x86/mmu: Support changed pte notifier in tdp MMU
kvm: x86/mmu: Add access tracking for tdp_mmu
kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU
kvm: x86/mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU
kvm: x86/mmu: Add TDP MMU PF handler
kvm: x86/mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator arg
kvm: x86/mmu: Support zapping SPTEs in the TDP MMU
KVM: Cache as_id in kvm_memory_slot
kvm: x86/mmu: Add functions to handle changed TDP SPTEs
kvm: x86/mmu: Allocate and free TDP MMU roots
kvm: x86/mmu: Init / Uninit the TDP MMU
kvm: x86/mmu: Introduce tdp_iter
KVM: mmu: extract spte.h and spte.c
KVM: mmu: Separate updating a PTE from kvm_set_pte_rmapp
...
Diffstat (limited to 'Documentation/virt')
-rw-r--r-- | Documentation/virt/kvm/api.rst | 216 | ||||
-rw-r--r-- | Documentation/virt/kvm/cpuid.rst | 88 | ||||
-rw-r--r-- | Documentation/virt/kvm/devices/vcpu.rst | 57 |
3 files changed, 304 insertions, 57 deletions
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 1f26d83e6b16..36d5f1f3c6dd 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -4498,11 +4498,14 @@ Currently, the following list of CPUID leaves are returned: - HYPERV_CPUID_ENLIGHTMENT_INFO - HYPERV_CPUID_IMPLEMENT_LIMITS - HYPERV_CPUID_NESTED_FEATURES + - HYPERV_CPUID_SYNDBG_VENDOR_AND_MAX_FUNCTIONS + - HYPERV_CPUID_SYNDBG_INTERFACE + - HYPERV_CPUID_SYNDBG_PLATFORM_CAPABILITIES HYPERV_CPUID_NESTED_FEATURES leaf is only exposed when Enlightened VMCS was enabled on the corresponding vCPU (KVM_CAP_HYPERV_ENLIGHTENED_VMCS). -Userspace invokes KVM_GET_SUPPORTED_CPUID by passing a kvm_cpuid2 structure +Userspace invokes KVM_GET_SUPPORTED_HV_CPUID by passing a kvm_cpuid2 structure with the 'nent' field indicating the number of entries in the variable-size array 'entries'. If the number of entries is too low to describe all Hyper-V feature leaves, an error (E2BIG) is returned. If the number is more or equal @@ -4704,6 +4707,106 @@ KVM_PV_VM_VERIFY Verify the integrity of the unpacked image. Only if this succeeds, KVM is allowed to start protected VCPUs. +4.126 KVM_X86_SET_MSR_FILTER +---------------------------- + +:Capability: KVM_X86_SET_MSR_FILTER +:Architectures: x86 +:Type: vm ioctl +:Parameters: struct kvm_msr_filter +:Returns: 0 on success, < 0 on error + +:: + + struct kvm_msr_filter_range { + #define KVM_MSR_FILTER_READ (1 << 0) + #define KVM_MSR_FILTER_WRITE (1 << 1) + __u32 flags; + __u32 nmsrs; /* number of msrs in bitmap */ + __u32 base; /* MSR index the bitmap starts at */ + __u8 *bitmap; /* a 1 bit allows the operations in flags, 0 denies */ + }; + + #define KVM_MSR_FILTER_MAX_RANGES 16 + struct kvm_msr_filter { + #define KVM_MSR_FILTER_DEFAULT_ALLOW (0 << 0) + #define KVM_MSR_FILTER_DEFAULT_DENY (1 << 0) + __u32 flags; + struct kvm_msr_filter_range ranges[KVM_MSR_FILTER_MAX_RANGES]; + }; + +flags values for ``struct kvm_msr_filter_range``: + +``KVM_MSR_FILTER_READ`` + + Filter read accesses to MSRs using the given bitmap. A 0 in the bitmap + indicates that a read should immediately fail, while a 1 indicates that + a read for a particular MSR should be handled regardless of the default + filter action. + +``KVM_MSR_FILTER_WRITE`` + + Filter write accesses to MSRs using the given bitmap. A 0 in the bitmap + indicates that a write should immediately fail, while a 1 indicates that + a write for a particular MSR should be handled regardless of the default + filter action. + +``KVM_MSR_FILTER_READ | KVM_MSR_FILTER_WRITE`` + + Filter both read and write accesses to MSRs using the given bitmap. A 0 + in the bitmap indicates that both reads and writes should immediately fail, + while a 1 indicates that reads and writes for a particular MSR are not + filtered by this range. + +flags values for ``struct kvm_msr_filter``: + +``KVM_MSR_FILTER_DEFAULT_ALLOW`` + + If no filter range matches an MSR index that is getting accessed, KVM will + fall back to allowing access to the MSR. + +``KVM_MSR_FILTER_DEFAULT_DENY`` + + If no filter range matches an MSR index that is getting accessed, KVM will + fall back to rejecting access to the MSR. In this mode, all MSRs that should + be processed by KVM need to explicitly be marked as allowed in the bitmaps. + +This ioctl allows user space to define up to 16 bitmaps of MSR ranges to +specify whether a certain MSR access should be explicitly filtered for or not. + +If this ioctl has never been invoked, MSR accesses are not guarded and the +default KVM in-kernel emulation behavior is fully preserved. + +Calling this ioctl with an empty set of ranges (all nmsrs == 0) disables MSR +filtering. In that mode, ``KVM_MSR_FILTER_DEFAULT_DENY`` is invalid and causes +an error. + +As soon as the filtering is in place, every MSR access is processed through +the filtering except for accesses to the x2APIC MSRs (from 0x800 to 0x8ff); +x2APIC MSRs are always allowed, independent of the ``default_allow`` setting, +and their behavior depends on the ``X2APIC_ENABLE`` bit of the APIC base +register. + +If a bit is within one of the defined ranges, read and write accesses are +guarded by the bitmap's value for the MSR index if the kind of access +is included in the ``struct kvm_msr_filter_range`` flags. If no range +cover this particular access, the behavior is determined by the flags +field in the kvm_msr_filter struct: ``KVM_MSR_FILTER_DEFAULT_ALLOW`` +and ``KVM_MSR_FILTER_DEFAULT_DENY``. + +Each bitmap range specifies a range of MSRs to potentially allow access on. +The range goes from MSR index [base .. base+nmsrs]. The flags field +indicates whether reads, writes or both reads and writes are filtered +by setting a 1 bit in the bitmap for the corresponding MSR index. + +If an MSR access is not permitted through the filtering, it generates a +#GP inside the guest. When combined with KVM_CAP_X86_USER_SPACE_MSR, that +allows user space to deflect and potentially handle various MSR accesses +into user space. + +If a vCPU is in running state while this ioctl is invoked, the vCPU may +experience inconsistent filtering behavior on MSR accesses. + 5. The kvm_run structure ======================== @@ -4869,14 +4972,13 @@ to the byte array. .. note:: - For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR and - KVM_EXIT_EPR the corresponding - -operations are complete (and guest state is consistent) only after userspace -has re-entered the kernel with KVM_RUN. The kernel side will first finish -incomplete operations and then check for pending signals. Userspace -can re-enter the guest with an unmasked signal pending to complete -pending operations. + For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR, + KVM_EXIT_EPR, KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR the corresponding + operations are complete (and guest state is consistent) only after userspace + has re-entered the kernel with KVM_RUN. The kernel side will first finish + incomplete operations and then check for pending signals. Userspace + can re-enter the guest with an unmasked signal pending to complete + pending operations. :: @@ -5165,6 +5267,44 @@ if it decides to decode and emulate the instruction. :: + /* KVM_EXIT_X86_RDMSR / KVM_EXIT_X86_WRMSR */ + struct { + __u8 error; /* user -> kernel */ + __u8 pad[7]; + __u32 reason; /* kernel -> user */ + __u32 index; /* kernel -> user */ + __u64 data; /* kernel <-> user */ + } msr; + +Used on x86 systems. When the VM capability KVM_CAP_X86_USER_SPACE_MSR is +enabled, MSR accesses to registers that would invoke a #GP by KVM kernel code +will instead trigger a KVM_EXIT_X86_RDMSR exit for reads and KVM_EXIT_X86_WRMSR +exit for writes. + +The "reason" field specifies why the MSR trap occurred. User space will only +receive MSR exit traps when a particular reason was requested during through +ENABLE_CAP. Currently valid exit reasons are: + + KVM_MSR_EXIT_REASON_UNKNOWN - access to MSR that is unknown to KVM + KVM_MSR_EXIT_REASON_INVAL - access to invalid MSRs or reserved bits + KVM_MSR_EXIT_REASON_FILTER - access blocked by KVM_X86_SET_MSR_FILTER + +For KVM_EXIT_X86_RDMSR, the "index" field tells user space which MSR the guest +wants to read. To respond to this request with a successful read, user space +writes the respective data into the "data" field and must continue guest +execution to ensure the read data is transferred into guest register state. + +If the RDMSR request was unsuccessful, user space indicates that with a "1" in +the "error" field. This will inject a #GP into the guest when the VCPU is +executed again. + +For KVM_EXIT_X86_WRMSR, the "index" field tells user space which MSR the guest +wants to write. Once finished processing the event, user space must continue +vCPU execution. If the MSR write was unsuccessful, user space also sets the +"error" field to "1". + +:: + /* Fix the size of the union. */ char padding[256]; }; @@ -5852,6 +5992,28 @@ controlled by the kvm module parameter halt_poll_ns. This capability allows the maximum halt time to specified on a per-VM basis, effectively overriding the module parameter for the target VM. +7.21 KVM_CAP_X86_USER_SPACE_MSR +------------------------------- + +:Architectures: x86 +:Target: VM +:Parameters: args[0] contains the mask of KVM_MSR_EXIT_REASON_* events to report +:Returns: 0 on success; -1 on error + +This capability enables trapping of #GP invoking RDMSR and WRMSR instructions +into user space. + +When a guest requests to read or write an MSR, KVM may not implement all MSRs +that are relevant to a respective system. It also does not differentiate by +CPU type. + +To allow more fine grained control over MSR handling, user space may enable +this capability. With it enabled, MSR accesses that match the mask specified in +args[0] and trigger a #GP event inside the guest by KVM will instead trigger +KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR exit notifications which user space +can then handle to implement model specific MSR handling and/or user notifications +to inform a user that an MSR was not handled. + 8. Other capabilities. ====================== @@ -6193,3 +6355,39 @@ distribution...) If this capability is available, then the CPNC and CPVC can be synchronized between KVM and userspace via the sync regs mechanism (KVM_SYNC_DIAG318). + +8.26 KVM_CAP_X86_USER_SPACE_MSR +------------------------------- + +:Architectures: x86 + +This capability indicates that KVM supports deflection of MSR reads and +writes to user space. It can be enabled on a VM level. If enabled, MSR +accesses that would usually trigger a #GP by KVM into the guest will +instead get bounced to user space through the KVM_EXIT_X86_RDMSR and +KVM_EXIT_X86_WRMSR exit notifications. + +8.25 KVM_X86_SET_MSR_FILTER +--------------------------- + +:Architectures: x86 + +This capability indicates that KVM supports that accesses to user defined MSRs +may be rejected. With this capability exposed, KVM exports new VM ioctl +KVM_X86_SET_MSR_FILTER which user space can call to specify bitmaps of MSR +ranges that KVM should reject access to. + +In combination with KVM_CAP_X86_USER_SPACE_MSR, this allows user space to +trap and emulate MSRs that are outside of the scope of KVM as well as +limit the attack surface on KVM's MSR emulation code. + + +8.26 KVM_CAP_ENFORCE_PV_CPUID +----------------------------- + +Architectures: x86 + +When enabled, KVM will disable paravirtual features provided to the +guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf +(0x40000001). Otherwise, a guest may use the paravirtual features +regardless of what has actually been exposed through the CPUID leaf. diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst index 9150e9d1c39b..7d81c0aa4a59 100644 --- a/Documentation/virt/kvm/cpuid.rst +++ b/Documentation/virt/kvm/cpuid.rst @@ -38,64 +38,64 @@ returns:: where ``flag`` is defined as below: -================================= =========== ================================ -flag value meaning -================================= =========== ================================ -KVM_FEATURE_CLOCKSOURCE 0 kvmclock available at msrs - 0x11 and 0x12 +================================== =========== ================================ +flag value meaning +================================== =========== ================================ +KVM_FEATURE_CLOCKSOURCE 0 kvmclock available at msrs + 0x11 and 0x12 -KVM_FEATURE_NOP_IO_DELAY 1 not necessary to perform delays - on PIO operations +KVM_FEATURE_NOP_IO_DELAY 1 not necessary to perform delays + on PIO operations -KVM_FEATURE_MMU_OP 2 deprecated +KVM_FEATURE_MMU_OP 2 deprecated -KVM_FEATURE_CLOCKSOURCE2 3 kvmclock available at msrs - 0x4b564d00 and 0x4b564d01 +KVM_FEATURE_CLOCKSOURCE2 3 kvmclock available at msrs + 0x4b564d00 and 0x4b564d01 -KVM_FEATURE_ASYNC_PF 4 async pf can be enabled by - writing to msr 0x4b564d02 +KVM_FEATURE_ASYNC_PF 4 async pf can be enabled by + writing to msr 0x4b564d02 -KVM_FEATURE_STEAL_TIME 5 steal time can be enabled by - writing to msr 0x4b564d03 +KVM_FEATURE_STEAL_TIME 5 steal time can be enabled by + writing to msr 0x4b564d03 -KVM_FEATURE_PV_EOI 6 paravirtualized end of interrupt - handler can be enabled by - writing to msr 0x4b564d04 +KVM_FEATURE_PV_EOI 6 paravirtualized end of interrupt + handler can be enabled by + writing to msr 0x4b564d04 -KVM_FEATURE_PV_UNHAULT 7 guest checks this feature bit - before enabling paravirtualized - spinlock support +KVM_FEATURE_PV_UNHALT 7 guest checks this feature bit + before enabling paravirtualized + spinlock support -KVM_FEATURE_PV_TLB_FLUSH 9 guest checks this feature bit - before enabling paravirtualized - tlb flush +KVM_FEATURE_PV_TLB_FLUSH 9 guest checks this feature bit + before enabling paravirtualized + tlb flush -KVM_FEATURE_ASYNC_PF_VMEXIT 10 paravirtualized async PF VM EXIT - can be enabled by setting bit 2 - when writing to msr 0x4b564d02 +KVM_FEATURE_ASYNC_PF_VMEXIT 10 paravirtualized async PF VM EXIT + can be enabled by setting bit 2 + when writing to msr 0x4b564d02 -KVM_FEATURE_PV_SEND_IPI 11 guest checks this feature bit - before enabling paravirtualized - sebd IPIs +KVM_FEATURE_PV_SEND_IPI 11 guest checks this feature bit + before enabling paravirtualized + send IPIs -KVM_FEATURE_POLL_CONTROL 12 host-side polling on HLT can - be disabled by writing - to msr 0x4b564d05. +KVM_FEATURE_POLL_CONTROL 12 host-side polling on HLT can + be disabled by writing + to msr 0x4b564d05. -KVM_FEATURE_PV_SCHED_YIELD 13 guest checks this feature bit - before using paravirtualized - sched yield. +KVM_FEATURE_PV_SCHED_YIELD 13 guest checks this feature bit + before using paravirtualized + sched yield. -KVM_FEATURE_ASYNC_PF_INT 14 guest checks this feature bit - before using the second async - pf control msr 0x4b564d06 and - async pf acknowledgment msr - 0x4b564d07. +KVM_FEATURE_ASYNC_PF_INT 14 guest checks this feature bit + before using the second async + pf control msr 0x4b564d06 and + async pf acknowledgment msr + 0x4b564d07. -KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24 host will warn if no guest-side - per-cpu warps are expeced in - kvmclock -================================= =========== ================================ +KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24 host will warn if no guest-side + per-cpu warps are expected in + kvmclock +================================== =========== ================================ :: diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst index ca374d3fe085..2acec3b9ef65 100644 --- a/Documentation/virt/kvm/devices/vcpu.rst +++ b/Documentation/virt/kvm/devices/vcpu.rst @@ -25,8 +25,10 @@ Returns: ======= ======================================================== -EBUSY The PMU overflow interrupt is already set - -ENXIO The overflow interrupt not set when attempting to get it - -ENODEV PMUv3 not supported + -EFAULT Error reading interrupt number + -ENXIO PMUv3 not supported or the overflow interrupt not set + when attempting to get it + -ENODEV KVM_ARM_VCPU_PMU_V3 feature missing from VCPU -EINVAL Invalid PMU overflow interrupt number supplied or trying to set the IRQ number without using an in-kernel irqchip. @@ -45,9 +47,10 @@ all vcpus, while as an SPI it must be a separate number per vcpu. Returns: ======= ====================================================== + -EEXIST Interrupt number already used -ENODEV PMUv3 not supported or GIC not initialized - -ENXIO PMUv3 not properly configured or in-kernel irqchip not - configured as required prior to calling this attribute + -ENXIO PMUv3 not supported, missing VCPU feature or interrupt + number not set -EBUSY PMUv3 already initialized ======= ====================================================== @@ -55,6 +58,52 @@ Request the initialization of the PMUv3. If using the PMUv3 with an in-kernel virtual GIC implementation, this must be done after initializing the in-kernel irqchip. +1.3 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_FILTER +----------------------------------------- + +:Parameters: in kvm_device_attr.addr the address for a PMU event filter is a + pointer to a struct kvm_pmu_event_filter + +:Returns: + + ======= ====================================================== + -ENODEV PMUv3 not supported or GIC not initialized + -ENXIO PMUv3 not properly configured or in-kernel irqchip not + configured as required prior to calling this attribute + -EBUSY PMUv3 already initialized + -EINVAL Invalid filter range + ======= ====================================================== + +Request the installation of a PMU event filter described as follows:: + + struct kvm_pmu_event_filter { + __u16 base_event; + __u16 nevents; + + #define KVM_PMU_EVENT_ALLOW 0 + #define KVM_PMU_EVENT_DENY 1 + + __u8 action; + __u8 pad[3]; + }; + +A filter range is defined as the range [@base_event, @base_event + @nevents), +together with an @action (KVM_PMU_EVENT_ALLOW or KVM_PMU_EVENT_DENY). The +first registered range defines the global policy (global ALLOW if the first +@action is DENY, global DENY if the first @action is ALLOW). Multiple ranges +can be programmed, and must fit within the event space defined by the PMU +architecture (10 bits on ARMv8.0, 16 bits from ARMv8.1 onwards). + +Note: "Cancelling" a filter by registering the opposite action for the same +range doesn't change the default action. For example, installing an ALLOW +filter for event range [0:10) as the first filter and then applying a DENY +action for the same range will leave the whole range as disabled. + +Restrictions: Event 0 (SW_INCR) is never filtered, as it doesn't count a +hardware event. Filtering event 0x1E (CHAIN) has no effect either, as it +isn't strictly speaking an event. Filtering the cycle counter is possible +using event 0x11 (CPU_CYCLES). + 2. GROUP: KVM_ARM_VCPU_TIMER_CTRL ================================= |