- Apr 07, 2022
-
-
Oliver Upton authored
Unfortunately, there is no guarantee that KVM was able to instantiate a debugfs directory for a particular VM. To that end, KVM shouldn't even attempt to create new debugfs files in this case. If the specified parent dentry is NULL, debugfs_create_file() will instantiate files at the root of debugfs. For arm64, it is possible to create the vgic-state file outside of a VM directory, the file is not cleaned up when a VM is destroyed. Nonetheless, the corresponding struct kvm is freed when the VM is destroyed. Nip the problem in the bud for all possible errant debugfs file creations by initializing kvm->debugfs_dentry to -ENOENT. In so doing, debugfs_create_file() will fail instead of creating the file in the root directory. Cc: stable@kernel.org Fixes: 929f45e3 ("kvm: no need to check return value of debugfs_create functions") Signed-off-by:
Oliver Upton <oupton@google.com> Signed-off-by:
Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220406235615.1447180-2-oupton@google.com
-
- Apr 06, 2022
-
-
Paolo Bonzini authored
kvm_vcpu_release() will call kvm_dirty_ring_free(), freeing ring->dirty_gfns and setting it to NULL. Afterwards, it calls kvm_arch_vcpu_destroy(). However, if closing the file descriptor races with KVM_RUN in such away that vcpu->arch.st.preempted == 0, the following call stack leads to a NULL pointer dereference in kvm_dirty_run_push(): mark_page_dirty_in_slot+0x192/0x270 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3171 kvm_steal_time_set_preempted arch/x86/kvm/x86.c:4600 [inline] kvm_arch_vcpu_put+0x34e/0x5b0 arch/x86/kvm/x86.c:4618 vcpu_put+0x1b/0x70 arch/x86/kvm/../../../virt/kvm/kvm_main.c:211 vmx_free_vcpu+0xcb/0x130 arch/x86/kvm/vmx/vmx.c:6985 kvm_arch_vcpu_destroy+0x76/0x290 arch/x86/kvm/x86.c:11219 kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline] The fix is to release the dirty page ring after kvm_arch_vcpu_destroy has run. Reported-by:
Qiuhao Li <qiuhao@sysec.org> Reported-by:
Gaoning Pan <pgn@zju.edu.cn> Reported-by:
Yongkang Jia <kangel@zju.edu.cn> Cc: stable@vger.kernel.org Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Apr 02, 2022
-
-
David Woodhouse authored
It isn't OK to cache the dirty status of a page in internal structures for an indefinite period of time. Any time a vCPU exits the run loop to userspace might be its last; the VMM might do its final check of the dirty log, flush the last remaining dirty pages to the destination and complete a live migration. If we have internal 'dirty' state which doesn't get flushed until the vCPU is finally destroyed on the source after migration is complete, then we have lost data because that will escape the final copy. This problem already exists with the use of kvm_vcpu_unmap() to mark pages dirty in e.g. VMX nesting. Note that the actual Linux MM already considers the page to be dirty since we have a writeable mapping of it. This is just about the KVM dirty logging. For the nesting-style use cases (KVM_GUEST_USES_PFN) we will need to track which gfn_to_pfn_caches have been used and explicitly mark the corresponding pages dirty before returning to userspace. But we would have needed external tracking of that anyway, rather than walking the full list of GPCs to find those belonging to this vCPU which are dirty. So let's rely *solely* on that external tracking, and keep it simple rather than laying a tempting trap for callers to fall into. Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-3-dwmw2@infradead.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Replace the guest_uses_pa and kernel_map booleans in the PFN cache code with a unified enum/bitmask. Using explicit names makes it easier to review and audit call sites. Opportunistically add a WARN to prevent passing garbage; instantating a cache without declaring its usage is either buggy or pointless. Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-2-dwmw2@infradead.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Don't actually set a request bit in vcpu->requests when making a request purely to force a vCPU to exit the guest. Logging a request but not actually consuming it would cause the vCPU to get stuck in an infinite loop during KVM_RUN because KVM would see the pending request and bail from VM-Enter to service the request. Note, it's currently impossible for KVM to set KVM_REQ_GPC_INVALIDATE as nothing in KVM is wired up to set guest_uses_pa=true. But, it'd be all too easy for arch code to introduce use of kvm_gfn_to_pfn_cache_init() without implementing handling of the request, especially since getting test coverage of MMU notifier interaction with specific KVM features usually requires a directed test. Opportunistically rename gfn_to_pfn_cache_invalidate_start()'s wake_vcpus to evict_vcpus. The purpose of the request is to get vCPUs out of guest mode, it's supposed to _avoid_ waking vCPUs that are blocking. Opportunistically rename KVM_REQ_GPC_INVALIDATE to be more specific as to what it wants to accomplish, and to genericize the name so that it can used for similar but unrelated scenarios, should they arise in the future. Add a comment and documentation to explain why the "no action" request exists. Add compile-time assertions to help detect improper usage. Use the inner assertless helper in the one s390 path that makes requests without a hardcoded request. Cc: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20220223165302.3205276-1-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
David Woodhouse authored
If the cache's user host virtual address becomes invalid, there is still a path from kvm_gfn_to_pfn_cache_refresh() where __release_gpc() could release the pfn but the gpc->pfn field has not been overwritten with an error value. If this happens, kvm_gfn_to_pfn_cache_unmap will call put_page again on the same page. Cc: stable@vger.kernel.org Fixes: 982ed0de ("KVM: Reinstate gfn_to_pfn_cache with invalidation support") Signed-off-by:
David Woodhouse <dwmw2@infradead.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Mar 29, 2022
-
-
David Matlack authored
This reverts commit 3d3aab1b . Now that the KVM module's lifetime is tied to kvm.users_count, there is no need to also tie it's lifetime to the lifetime of the VM and vCPU file descriptors. Suggested-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
David Matlack <dmatlack@google.com> Message-Id: <20220303183328.1499189-3-dmatlack@google.com> Reviewed-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
David Matlack authored
Tie the lifetime the KVM module to the lifetime of each VM via kvm.users_count. This way anything that grabs a reference to the VM via kvm_get_kvm() cannot accidentally outlive the KVM module. Prior to this commit, the lifetime of the KVM module was tied to the lifetime of /dev/kvm file descriptors, VM file descriptors, and vCPU file descriptors by their respective file_operations "owner" field. This approach is insufficient because references grabbed via kvm_get_kvm() do not prevent closing any of the aforementioned file descriptors. This fixes a long standing theoretical bug in KVM that at least affects async page faults. kvm_setup_async_pf() grabs a reference via kvm_get_kvm(), and drops it in an asynchronous work callback. Nothing prevents the VM file descriptor from being closed and the KVM module from being unloaded before this callback runs. Fixes: af585b92 ("KVM: Halt vcpu if page it tries to access is swapped out") Fixes: 3d3aab1b ("KVM: set owner of cpu and vm file operations") Cc: stable@vger.kernel.org Suggested-by:
Ben Gardon <bgardon@google.com> [ Based on a patch from Ben implemented for Google's kernel. ] Signed-off-by:
David Matlack <dmatlack@google.com> Message-Id: <20220303183328.1499189-2-dmatlack@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Mar 11, 2022
-
-
Guo Ren authored
Current riscv doesn't support the 32bit KVM API. Let's make it clear by not selecting KVM_COMPAT. Signed-off-by:
Guo Ren <guoren@linux.alibaba.com> Signed-off-by:
Guo Ren <guoren@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Anup Patel <anup@brainfault.org> Reviewed-by:
Anup Patel <anup@brainfault.org> Signed-off-by:
Anup Patel <anup@brainfault.org>
-
- Mar 08, 2022
-
-
Paolo Bonzini authored
Allocations whose size is related to the memslot size can be arbitrarily large. Do not use kvzalloc/kvcalloc, as those are limited to "not crazy" sizes that fit in 32 bits. Cc: stable@vger.kernel.org Fixes: 7661809d ("mm: don't allow oversized kvmalloc() calls") Reviewed-by:
David Hildenbrand <david@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Mar 01, 2022
-
-
Sean Christopherson authored
Remove the generic kvm_reload_remote_mmus() and open code its functionality into the two x86 callers. x86 is (obviously) the only architecture that uses the hook, and is also the only architecture that uses KVM_REQ_MMU_RELOAD in a way that's consistent with the name. That will change in a future patch, as x86's usage when zapping a single shadow page x86 doesn't actually _need_ to reload all vCPUs' MMUs, only MMUs whose root is being zapped actually need to be reloaded. s390 also uses KVM_REQ_MMU_RELOAD, but for a slightly different purpose. Drop the generic code in anticipation of implementing s390 and x86 arch specific requests, which will allow dropping KVM_REQ_MMU_RELOAD entirely. Opportunistically reword the x86 TDP MMU comment to avoid making references to functions (and requests!) when possible, and to remove the rather ambiguous "this". No functional change intended. Cc: Ben Gardon <bgardon@google.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Ben Gardon <bgardon@google.com> Message-Id: <20220225182248.3812651-4-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Feb 25, 2022
-
-
Vipin Sharma authored
VM worker kthreads can linger in the VM process's cgroup for sometime after KVM terminates the VM process. KVM terminates the worker kthreads by calling kthread_stop() which waits on the 'exited' completion, triggered by exit_mm(), via mm_release(), in do_exit() during the kthread's exit. However, these kthreads are removed from the cgroup using the cgroup_exit() which happens after the exit_mm(). Therefore, A VM process can terminate in between the exit_mm() and cgroup_exit() calls, leaving only worker kthreads in the cgroup. Moving worker kthreads back to the original cgroup (kthreadd_task's cgroup) makes sure that the cgroup is empty as soon as the main VM process is terminated. Signed-off-by:
Vipin Sharma <vipinsh@google.com> Suggested-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20220222054848.563321-1-vipinsh@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Feb 17, 2022
-
-
Wanpeng Li authored
I saw the below splatting after the host suspended and resumed. WARNING: CPU: 0 PID: 2943 at kvm/arch/x86/kvm/../../../virt/kvm/kvm_main.c:5531 kvm_resume+0x2c/0x30 [kvm] CPU: 0 PID: 2943 Comm: step_after_susp Tainted: G W IOE 5.17.0-rc3+ #4 RIP: 0010:kvm_resume+0x2c/0x30 [kvm] Call Trace: <TASK> syscore_resume+0x90/0x340 suspend_devices_and_enter+0xaee/0xe90 pm_suspend.cold+0x36b/0x3c2 state_store+0x82/0xf0 kernfs_fop_write_iter+0x1b6/0x260 new_sync_write+0x258/0x370 vfs_write+0x33f/0x510 ksys_write+0xc9/0x160 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae lockdep_is_held() can return -1 when lockdep is disabled which triggers this warning. Let's use lockdep_assert_not_held() which can detect incorrect calls while holding a lock and it also avoids false negatives when lockdep is disabled. Signed-off-by:
Wanpeng Li <wanpengli@tencent.com> Message-Id: <1644920142-81249-...
-
- Feb 10, 2022
-
-
Jinrong Liang authored
The "struct kvm *kvm" parameter of kvm_make_vcpu_request() is not used, so remove it. No functional change intended. Signed-off-by:
Jinrong Liang <cloudliang@tencent.com> Message-Id: <20220125095909.38122-19-cloudliang@tencent.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Jan 28, 2022
-
-
Hou Wenlong authored
Fix the following false positive warning: ============================= WARNING: suspicious RCU usage 5.16.0-rc4+ #57 Not tainted ----------------------------- arch/x86/kvm/../../../virt/kvm/eventfd.c:484 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 3 locks held by fc_vcpu 0/330: #0: ffff8884835fc0b0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x88/0x6f0 [kvm] #1: ffffc90004c0bb68 (&kvm->srcu){....}-{0:0}, at: vcpu_enter_guest+0x600/0x1860 [kvm] #2: ffffc90004c0c1d0 (&kvm->irq_srcu){....}-{0:0}, at: kvm_notify_acked_irq+0x36/0x180 [kvm] stack backtrace: CPU: 26 PID: 330 Comm: fc_vcpu 0 Not tainted 5.16.0-rc4+ Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x44/0x57 kvm_notify_acked_gsi+0x6b/0x70 [kvm] kvm_notify_acked_irq+0x8d/0x180 [kvm] kvm_ioapic_update_eoi+0x92/0x240 [kvm] kvm_apic_set_eoi_accelerated+0x2a/0xe0 [kvm] handle_apic_eoi_induced+0x3d/0x60 [kvm_intel] vmx_handle_exit+0x19c/0x6a0 [kvm_intel] vcpu_enter_guest+0x66e/0x1860 [kvm] kvm_arch_vcpu_ioctl_run+0x438/0x7f0 [kvm] kvm_vcpu_ioctl+0x38a/0x6f0 [kvm] __x64_sys_ioctl+0x89/0xc0 do_syscall_64+0x3a/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Since kvm_unregister_irq_ack_notifier() does synchronize_srcu(&kvm->irq_srcu), kvm->irq_ack_notifier_list is protected by kvm->irq_srcu. In fact, kvm->irq_srcu SRCU read lock is held in kvm_notify_acked_irq(), making it a false positive warning. So use hlist_for_each_entry_srcu() instead of hlist_for_each_entry_rcu(). Reviewed-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Hou Wenlong <houwenlong93@linux.alibaba.com> Message-Id: <f98bac4f5052bad2c26df9ad50f7019e40434512.1643265976.git.houwenlong.hwl@antgroup.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Jan 26, 2022
-
-
Sean Christopherson authored
Revert a completely broken check on an "invalid" RIP in SVM's workaround for the DecodeAssists SMAP errata. kvm_vcpu_gfn_to_memslot() obviously expects a gfn, i.e. operates in the guest physical address space, whereas RIP is a virtual (not even linear) address. The "fix" worked for the problematic KVM selftest because the test identity mapped RIP. Fully revert the hack instead of trying to translate RIP to a GPA, as the non-SEV case is now handled earlier, and KVM cannot access guest page tables to translate RIP. This reverts commit e72436bc. Fixes: e72436bc ("KVM: SVM: avoid infinite loop on NPF from bad address") Reported-by:
Liam Merwick <liam.merwick@oracle.com> Cc: stable@vger.kernel.org Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Liam Merwick <liam.merwick@oracle.com> Message-Id: <20220120010719.711476-3-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Jan 24, 2022
-
-
Xianting Tian authored
The async parameter of hva_to_pfn_remapped() is not used, so remove it. Signed-off-by:
Xianting Tian <xianting.tian@linux.alibaba.com> Message-Id: <20220124020456.156386-1-xianting.tian@linux.alibaba.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Jan 19, 2022
-
-
Sean Christopherson authored
Move the seemingly generic block_vcpu_list from kvm_vcpu to vcpu_vmx, and rename the list and all associated variables to clarify that it tracks the set of vCPU that need to be poked on a posted interrupt to the wakeup vector. The list is not used to track _all_ vCPUs that are blocking, and the term "blocked" can be misleading as it may refer to a blocking condition in the host or the guest, where as the PI wakeup case is specifically for the vCPUs that are actively blocking from within the guest. No functional change intended. Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-7-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Remove kvm_vcpu.pre_pcpu as it no longer has any users. No functional change intended. Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-6-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Christian Borntraeger authored
Avoid warnings on s390 like [ 1801.980931] CPU: 12 PID: 117600 Comm: kworker/12:0 Tainted: G E 5.17.0-20220113.rc0.git0.32ce2abb03cf.300.fc35.s390x+next #1 [ 1801.980938] Workqueue: events irqfd_inject [kvm] [...] [ 1801.981057] Call Trace: [ 1801.981060] [<000003ff805f0f5c>] mark_page_dirty_in_slot+0xa4/0xb0 [kvm] [ 1801.981083] [<000003ff8060e9fe>] adapter_indicators_set+0xde/0x268 [kvm] [ 1801.981104] [<000003ff80613c24>] set_adapter_int+0x64/0xd8 [kvm] [ 1801.981124] [<000003ff805fb9aa>] kvm_set_irq+0xc2/0x130 [kvm] [ 1801.981144] [<000003ff805f8d86>] irqfd_inject+0x76/0xa0 [kvm] [ 1801.981164] [<0000000175e56906>] process_one_work+0x1fe/0x470 [ 1801.981173] [<0000000175e570a4>] worker_thread+0x64/0x498 [ 1801.981176] [<0000000175e5ef2c>] kthread+0x10c/0x110 [ 1801.981180] [<0000000175de73c8>] __ret_from_fork+0x40/0x58 [ 1801.981185] [<000000017698440a>] ret_from_fork+0xa/0x40 when writing to a guest from an irqfd worker as long as we do not have the dirty ring. Signed-off-by:
Christian Borntraeger <borntraeger@linux.ibm.com> Reluctantly-acked-by:
David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20220113122924.740496-1-borntraeger@linux.ibm.com> Fixes: 2efd61a6 ("KVM: Warn if mark_page_dirty() is called without an active vCPU") Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Jan 07, 2022
-
-
David Woodhouse authored
This can be used in two modes. There is an atomic mode where the cached mapping is accessed while holding the rwlock, and a mode where the physical address is used by a vCPU in guest mode. For the latter case, an invalidation will wake the vCPU with the new KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any caches it still needs to access before entering guest mode again. Only one vCPU can be targeted by the wake requests; it's simple enough to make it wake all vCPUs or even a mask but I don't see a use case for that additional complexity right now. Invalidation happens from the invalidate_range_start MMU notifier, which needs to be able to sleep in order to wake the vCPU and wait for it. This means that revalidation potentially needs to "wait" for the MMU operation to complete and the invalidate_range_end notifier to be invoked. Like the vCPU when it takes a page fault in that period, we just spin — fixing that in a future patch by implementing an actual *wait* may be another part of shaving this particularly hirsute yak. As noted in the comments in the function itself, the only case where the invalidate_range_start notifier is expected to be called *without* being able to sleep is when the OOM reaper is killing the process. In that case, we expect the vCPU threads already to have exited, and thus there will be nothing to wake, and no reason to wait. So we clear the KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly if there actually *was* anything to wake up. Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20211210163625.2886-3-dwmw2@infradead.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
David Woodhouse authored
The various kvm_write_guest() and mark_page_dirty() functions must only ever be called in the context of an active vCPU, because if dirty ring tracking is enabled it may simply oops when kvm_get_running_vcpu() returns NULL for the vcpu and then kvm_dirty_ring_get() dereferences it. This oops was reported by "butt3rflyh4ck" <butterflyhuangxx@gmail.com> in https://lore.kernel.org/kvm/CAFcO6XOmoS7EacN_n6v4Txk7xL7iqRa2gABg3F7E3Naf5uG94g@mail.gmail.com/ That actual bug will be fixed under separate cover but this warning should help to prevent new ones from being added. Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20211210163625.2886-2-dwmw2@infradead.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Dec 09, 2021
-
-
David Woodhouse authored
Splitting kvm_main.c out into smaller and better-organized files is slightly non-trivial when it involves editing a bunch of per-arch KVM makefiles. Provide virt/kvm/Makefile.kvm for them to include. Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Acked-by:
Marc Zyngier <maz@kernel.org> Message-Id: <20211121125451.9489-3-dwmw2@infradead.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
David Woodhouse authored
I'd like to make the build include dirty_ring.c based on whether the arch wants it or not. That's a whole lot simpler if there's a config symbol instead of doing it implicitly on KVM_DIRTY_LOG_PAGE_OFFSET being set to something non-zero. Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20211121125451.9489-2-dwmw2@infradead.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Dec 08, 2021
-
-
Sean Christopherson authored
Add helpers to wake and query a blocking vCPU. In addition to providing nice names, the helpers reduce the probability of KVM neglecting to use kvm_arch_vcpu_get_wait(). No functional change intended. Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-20-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Calculate the halt-polling "stop" time using "start" instead of redoing ktime_get(). In practice, the numbers involved are in the noise (e.g., in the happy case where hardware correctly predicts do_halt_poll and there are no interrupts, "start" is probably only a few cycles old) and either approach is perfectly ok. But it's more precise to count any extra latency toward the halt-polling time. Reviewed-by:
David Matlack <dmatlack@google.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-17-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Jing Zhang authored
Add a "blocking" stat that userspace can use to detect the case where a vCPU is not being run because of an vCPU/guest action, e.g. HLT or WFS on x86, WFI on arm64, etc... Current guest/host/halt stats don't show this well, e.g. if a guest halts for a long period of time then the vCPU could could appear pathologically blocked due to a host condition, when in reality the vCPU has been put into a not-runnable state by the guest. Originally-by:
Cannon Matthews <cannonmatthews@google.com> Suggested-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
David Matlack <dmatlack@google.com> Signed-off-by:
Jing Zhang <jingzhangos@google.com> [sean: renamed stat to "blocking", massaged changelog] Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-16-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Factor out the "block" part of kvm_vcpu_halt() so that x86 can emulate non-halt wait/sleep/block conditions that should not be subjected to halt-polling. No functional change intended. Reviewed-by:
Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by:
David Matlack <dmatlack@google.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-15-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Rename kvm_vcpu_block() to kvm_vcpu_halt() in preparation for splitting the actual "block" sequences into a separate helper (to be named kvm_vcpu_block()). x86 will use the standalone block-only path to handle non-halt cases where the vCPU is not runnable. Rename block_ns to halt_ns to match the new function name. No functional change intended. Reviewed-by:
David Matlack <dmatlack@google.com> Reviewed-by:
Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-14-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Drop kvm_arch_vcpu_block_finish() now that all arch implementations are nops. No functional change intended. Acked-by:
Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by:
David Matlack <dmatlack@google.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-10-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Invoke the arch hooks for block+unblock if and only if KVM actually attempts to block the vCPU. The only non-nop implementation is on x86, specifically SVM's AVIC, and there is no need to put the AVIC prior to halt-polling; KVM x86's kvm_vcpu_has_events() will scour the full vIRR to find pending IRQs regardless of whether the AVIC is loaded/"running". The primary motivation is to allow future cleanup to split out "block" from "halt", but this is also likely a small performance boost on x86 SVM when halt-polling is successful. Adjust the post-block path to update "cur" after unblocking, i.e. include AVIC load time in halt_wait_ns and halt_wait_hist, so that the behavior is consistent. Moving just the pre-block arch hook would result in only the AVIC put latency being included in the halt_wait stats. There is no obvious evidence that one way or the other is correct, so just ensure KVM is consistent. Note, x86 has two separate paths for handling APICv with respect to vCPU blocking. VMX uses hooks in x86's vcpu_block(), while SVM uses the arch hooks in kvm_vcpu_block(). Prior to this path, the two paths were more or less functionally identical. That is very much not the case after this patch, as the hooks used by VMX _must_ fire before halt-polling. x86's entire mess will be cleaned up in future patches. Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-12-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Move the halt-polling "success" and histogram stats update into the dedicated helper to fix a discrepancy where the success/fail "time" stats consider polling successful so long as the wait is avoided, but the main "success" and histogram stats consider polling successful if and only if a wake event was detected by the halt-polling loop. Move halt_attempted_poll to the helper as well so that all the stats are updated in a single location. While it's a bit odd to update the stat well after the fact, practically speaking there's no meaningful advantage to updating before polling. Note, there is a functional change in addition to the success vs. fail change. The histogram updates previously called ktime_get() instead of using "cur". But that change is desirable as it means all the stats are now updated with the same polling time, and avoids the extra ktime_get(), which isn't expensive but isn't free either. Reviewed-by:
David Matlack <dmatlack@google.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-8-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Add a comment to document that halt-polling is considered successful even if the polling loop itself didn't detect a wake event, i.e. if a wake event was detect in the final kvm_vcpu_check_block(). Invert the param to update helper so that the helper is a dumb function that is "told" whether or not polling was successful, as opposed to determining success based on blocking behavior. Opportunistically tweak the params to the update helper to reduce the line length for the call site so that it fits on a single line, and so that the prototype conforms to the more traditional kernel style. No functional change intended. Reviewed-by:
David Matlack <dmatlack@google.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-7-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Don't update halt-polling stats if halt-polling wasn't attempted. This is a nop as @poll_ns is guaranteed to be '0' (poll_end == start); in a future patch (to move the histogram stats into the helper), it will avoid to avoid a discrepancy in what is considered a "successful" halt-poll. No functional change intended. Reviewed-by:
David Matlack <dmatlack@google.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-6-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Do not define/reference kvm_vcpu.wait if __KVM_HAVE_ARCH_WQP is true, and instead force the architecture (PPC) to define its own rcuwait object. Allowing common KVM to directly access vcpu->wait without a guard makes it all too easy to introduce potential bugs, e.g. kvm_vcpu_block(), kvm_vcpu_on_spin(), and async_pf_execute() all operate on vcpu->wait, not the result of kvm_arch_vcpu_get_wait(), and so may do the wrong thing for PPC. Due to PPC's shenanigans with respect to callbacks and waits (it switches to the virtual core's wait object at KVM_RUN!?!?), it's not clear whether or not this fixes any bugs. Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-5-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Wrap s390's halt_poll_max_steal with READ_ONCE and snapshot the result of kvm_arch_no_poll() in kvm_vcpu_block() to avoid a mostly-theoretical, largely benign bug on s390 where the result of kvm_arch_no_poll() could change due to userspace modifying halt_poll_max_steal while the vCPU is blocking. The bug is largely benign as it will either cause KVM to skip updating halt-polling times (no_poll toggles false=>true) or to update halt-polling times with a slightly flawed block_ns. Note, READ_ONCE is unnecessary in the current code, add it in case the arch hook is ever inlined, and to provide a hint that userspace can change the param at will. Fixes: 8b905d28 ("KVM: s390: provide kvm_arch_no_poll function") Reviewed-by:
Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-4-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
If we do have the vcpu mutex, as is the case if kvm_running_vcpu is set to the target vcpu of the kick, changes to vcpu->mode do not need atomic operations; cmpxchg is only needed _outside_ the mutex to ensure that the IN_GUEST_MODE->EXITING_GUEST_MODE change does not race with the vcpu thread going OUTSIDE_GUEST_MODE. Use this to optimize the case of a vCPU sending an interrupt to itself. Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Ben Gardon authored
In preparation for implementing in-place hugepage promotion, various functions will need to be called from zap_collapsible_spte_range, which has the const qualifier on its memslot argument. Propagate the const qualifier to the various functions which will be needed. This just serves to simplify the following patch. No functional change intended. Signed-off-by:
Ben Gardon <bgardon@google.com> Message-Id: <20211115234603.2908381-11-bgardon@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Allocate the "new" memslot for !DELETE memslot updates straight away instead of filling an intermediate on-stack object and forcing kvm_set_memslot() to juggle the allocation and do weird things like reuse the old memslot object in MOVE. In the MOVE case, this results in an "extra" memslot allocation due to allocating both the "new" slot and the "invalid" slot, but that's a temporary and not-huge allocation, and MOVE is a relatively rare memslot operation. Regarding MOVE, drop the open-coded management of the gfn tree with a call to kvm_replace_memslot(), which already handles the case where new->base_gfn != old->base_gfn. This is made possible by virtue of not having to copy the "new" memslot data after erasing the old memslot from the gfn tree. Using kvm_replace_memslot(), and more specifically not reusing the old memslot, means the MOVE case now does hva tree and hash list updates, but that's a small price to pay for simplifying the code and making MOVE align with all the other flavors of updates. The "extra" updates are firmly in the noise from a performance perspective, e.g. the "move (in)active area" selfttests show a (very, very) slight improvement. Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
-
Sean Christopherson authored
Initialize the "new" memslot in the !DELETE path only after the various sanity checks have passed. This will allow a future commit to allocate @new dynamically without having to copy a memslot, and without having to deal with freeing @new in error paths and in the "nothing to change" path that's hiding in the sanity checks. No functional change intended. Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <a084d0531ca3a826a7f861eb2b08b5d1c06ef265.1638817641.git.maciej.szmigiero@oracle.com>
-