- Jan 07, 2022
-
-
David Woodhouse authored
This can be used in two modes. There is an atomic mode where the cached mapping is accessed while holding the rwlock, and a mode where the physical address is used by a vCPU in guest mode. For the latter case, an invalidation will wake the vCPU with the new KVM_REQ_GPC_INVALIDATE, and the architecture will need to refresh any caches it still needs to access before entering guest mode again. Only one vCPU can be targeted by the wake requests; it's simple enough to make it wake all vCPUs or even a mask but I don't see a use case for that additional complexity right now. Invalidation happens from the invalidate_range_start MMU notifier, which needs to be able to sleep in order to wake the vCPU and wait for it. This means that revalidation potentially needs to "wait" for the MMU operation to complete and the invalidate_range_end notifier to be invoked. Like the vCPU when it takes a page fault in that period, we just spin — fixing that in a future patch by implementing an actual *wait* may be another part of shaving this particularly hirsute yak. As noted in the comments in the function itself, the only case where the invalidate_range_start notifier is expected to be called *without* being able to sleep is when the OOM reaper is killing the process. In that case, we expect the vCPU threads already to have exited, and thus there will be nothing to wake, and no reason to wait. So we clear the KVM_REQUEST_WAIT bit and send the request anyway, then complain loudly if there actually *was* anything to wake up. Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20211210163625.2886-3-dwmw2@infradead.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
David Woodhouse authored
The various kvm_write_guest() and mark_page_dirty() functions must only ever be called in the context of an active vCPU, because if dirty ring tracking is enabled it may simply oops when kvm_get_running_vcpu() returns NULL for the vcpu and then kvm_dirty_ring_get() dereferences it. This oops was reported by "butt3rflyh4ck" <butterflyhuangxx@gmail.com> in https://lore.kernel.org/kvm/CAFcO6XOmoS7EacN_n6v4Txk7xL7iqRa2gABg3F7E3Naf5uG94g@mail.gmail.com/ That actual bug will be fixed under separate cover but this warning should help to prevent new ones from being added. Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20211210163625.2886-2-dwmw2@infradead.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Dec 09, 2021
-
-
David Woodhouse authored
Splitting kvm_main.c out into smaller and better-organized files is slightly non-trivial when it involves editing a bunch of per-arch KVM makefiles. Provide virt/kvm/Makefile.kvm for them to include. Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Acked-by:
Marc Zyngier <maz@kernel.org> Message-Id: <20211121125451.9489-3-dwmw2@infradead.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
David Woodhouse authored
I'd like to make the build include dirty_ring.c based on whether the arch wants it or not. That's a whole lot simpler if there's a config symbol instead of doing it implicitly on KVM_DIRTY_LOG_PAGE_OFFSET being set to something non-zero. Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20211121125451.9489-2-dwmw2@infradead.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Dec 08, 2021
-
-
Sean Christopherson authored
Add helpers to wake and query a blocking vCPU. In addition to providing nice names, the helpers reduce the probability of KVM neglecting to use kvm_arch_vcpu_get_wait(). No functional change intended. Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-20-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Calculate the halt-polling "stop" time using "start" instead of redoing ktime_get(). In practice, the numbers involved are in the noise (e.g., in the happy case where hardware correctly predicts do_halt_poll and there are no interrupts, "start" is probably only a few cycles old) and either approach is perfectly ok. But it's more precise to count any extra latency toward the halt-polling time. Reviewed-by:
David Matlack <dmatlack@google.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-17-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Jing Zhang authored
Add a "blocking" stat that userspace can use to detect the case where a vCPU is not being run because of an vCPU/guest action, e.g. HLT or WFS on x86, WFI on arm64, etc... Current guest/host/halt stats don't show this well, e.g. if a guest halts for a long period of time then the vCPU could could appear pathologically blocked due to a host condition, when in reality the vCPU has been put into a not-runnable state by the guest. Originally-by:
Cannon Matthews <cannonmatthews@google.com> Suggested-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
David Matlack <dmatlack@google.com> Signed-off-by:
Jing Zhang <jingzhangos@google.com> [sean: renamed stat to "blocking", massaged changelog] Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-16-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Factor out the "block" part of kvm_vcpu_halt() so that x86 can emulate non-halt wait/sleep/block conditions that should not be subjected to halt-polling. No functional change intended. Reviewed-by:
Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by:
David Matlack <dmatlack@google.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-15-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Rename kvm_vcpu_block() to kvm_vcpu_halt() in preparation for splitting the actual "block" sequences into a separate helper (to be named kvm_vcpu_block()). x86 will use the standalone block-only path to handle non-halt cases where the vCPU is not runnable. Rename block_ns to halt_ns to match the new function name. No functional change intended. Reviewed-by:
David Matlack <dmatlack@google.com> Reviewed-by:
Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-14-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Drop kvm_arch_vcpu_block_finish() now that all arch implementations are nops. No functional change intended. Acked-by:
Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by:
David Matlack <dmatlack@google.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-10-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Invoke the arch hooks for block+unblock if and only if KVM actually attempts to block the vCPU. The only non-nop implementation is on x86, specifically SVM's AVIC, and there is no need to put the AVIC prior to halt-polling; KVM x86's kvm_vcpu_has_events() will scour the full vIRR to find pending IRQs regardless of whether the AVIC is loaded/"running". The primary motivation is to allow future cleanup to split out "block" from "halt", but this is also likely a small performance boost on x86 SVM when halt-polling is successful. Adjust the post-block path to update "cur" after unblocking, i.e. include AVIC load time in halt_wait_ns and halt_wait_hist, so that the behavior is consistent. Moving just the pre-block arch hook would result in only the AVIC put latency being included in the halt_wait stats. There is no obvious evidence that one way or the other is correct, so just ensure KVM is consistent. Note, x86 has two separate paths for handling APICv with respect to vCPU blocking. VMX uses hooks in x86's vcpu_block(), while SVM uses the arch hooks in kvm_vcpu_block(). Prior to this path, the two paths were more or less functionally identical. That is very much not the case after this patch, as the hooks used by VMX _must_ fire before halt-polling. x86's entire mess will be cleaned up in future patches. Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-12-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Move the halt-polling "success" and histogram stats update into the dedicated helper to fix a discrepancy where the success/fail "time" stats consider polling successful so long as the wait is avoided, but the main "success" and histogram stats consider polling successful if and only if a wake event was detected by the halt-polling loop. Move halt_attempted_poll to the helper as well so that all the stats are updated in a single location. While it's a bit odd to update the stat well after the fact, practically speaking there's no meaningful advantage to updating before polling. Note, there is a functional change in addition to the success vs. fail change. The histogram updates previously called ktime_get() instead of using "cur". But that change is desirable as it means all the stats are now updated with the same polling time, and avoids the extra ktime_get(), which isn't expensive but isn't free either. Reviewed-by:
David Matlack <dmatlack@google.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-8-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Add a comment to document that halt-polling is considered successful even if the polling loop itself didn't detect a wake event, i.e. if a wake event was detect in the final kvm_vcpu_check_block(). Invert the param to update helper so that the helper is a dumb function that is "told" whether or not polling was successful, as opposed to determining success based on blocking behavior. Opportunistically tweak the params to the update helper to reduce the line length for the call site so that it fits on a single line, and so that the prototype conforms to the more traditional kernel style. No functional change intended. Reviewed-by:
David Matlack <dmatlack@google.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-7-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Don't update halt-polling stats if halt-polling wasn't attempted. This is a nop as @poll_ns is guaranteed to be '0' (poll_end == start); in a future patch (to move the histogram stats into the helper), it will avoid to avoid a discrepancy in what is considered a "successful" halt-poll. No functional change intended. Reviewed-by:
David Matlack <dmatlack@google.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-6-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Do not define/reference kvm_vcpu.wait if __KVM_HAVE_ARCH_WQP is true, and instead force the architecture (PPC) to define its own rcuwait object. Allowing common KVM to directly access vcpu->wait without a guard makes it all too easy to introduce potential bugs, e.g. kvm_vcpu_block(), kvm_vcpu_on_spin(), and async_pf_execute() all operate on vcpu->wait, not the result of kvm_arch_vcpu_get_wait(), and so may do the wrong thing for PPC. Due to PPC's shenanigans with respect to callbacks and waits (it switches to the virtual core's wait object at KVM_RUN!?!?), it's not clear whether or not this fixes any bugs. Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-5-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Wrap s390's halt_poll_max_steal with READ_ONCE and snapshot the result of kvm_arch_no_poll() in kvm_vcpu_block() to avoid a mostly-theoretical, largely benign bug on s390 where the result of kvm_arch_no_poll() could change due to userspace modifying halt_poll_max_steal while the vCPU is blocking. The bug is largely benign as it will either cause KVM to skip updating halt-polling times (no_poll toggles false=>true) or to update halt-polling times with a slightly flawed block_ns. Note, READ_ONCE is unnecessary in the current code, add it in case the arch hook is ever inlined, and to provide a hint that userspace can change the param at will. Fixes: 8b905d28 ("KVM: s390: provide kvm_arch_no_poll function") Reviewed-by:
Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-4-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
If we do have the vcpu mutex, as is the case if kvm_running_vcpu is set to the target vcpu of the kick, changes to vcpu->mode do not need atomic operations; cmpxchg is only needed _outside_ the mutex to ensure that the IN_GUEST_MODE->EXITING_GUEST_MODE change does not race with the vcpu thread going OUTSIDE_GUEST_MODE. Use this to optimize the case of a vCPU sending an interrupt to itself. Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Ben Gardon authored
In preparation for implementing in-place hugepage promotion, various functions will need to be called from zap_collapsible_spte_range, which has the const qualifier on its memslot argument. Propagate the const qualifier to the various functions which will be needed. This just serves to simplify the following patch. No functional change intended. Signed-off-by:
Ben Gardon <bgardon@google.com> Message-Id: <20211115234603.2908381-11-bgardon@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Allocate the "new" memslot for !DELETE memslot updates straight away instead of filling an intermediate on-stack object and forcing kvm_set_memslot() to juggle the allocation and do weird things like reuse the old memslot object in MOVE. In the MOVE case, this results in an "extra" memslot allocation due to allocating both the "new" slot and the "invalid" slot, but that's a temporary and not-huge allocation, and MOVE is a relatively rare memslot operation. Regarding MOVE, drop the open-coded management of the gfn tree with a call to kvm_replace_memslot(), which already handles the case where new->base_gfn != old->base_gfn. This is made possible by virtue of not having to copy the "new" memslot data after erasing the old memslot from the gfn tree. Using kvm_replace_memslot(), and more specifically not reusing the old memslot, means the MOVE case now does hva tree and hash list updates, but that's a small price to pay for simplifying the code and making MOVE align with all the other flavors of updates. The "extra" updates are firmly in the noise from a performance perspective, e.g. the "move (in)active area" selfttests show a (very, very) slight improvement. Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <f0d8c72727aa825cf682bd4e3da4b3fa68215dd4.1638817641.git.maciej.szmigiero@oracle.com>
-
Sean Christopherson authored
Initialize the "new" memslot in the !DELETE path only after the various sanity checks have passed. This will allow a future commit to allocate @new dynamically without having to copy a memslot, and without having to deal with freeing @new in error paths and in the "nothing to change" path that's hiding in the sanity checks. No functional change intended. Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <a084d0531ca3a826a7f861eb2b08b5d1c06ef265.1638817641.git.maciej.szmigiero@oracle.com>
-
Maciej S. Szmigiero authored
Do a quick lookup for possibly overlapping gfns when creating or moving a memslot instead of performing a linear scan of the whole memslot set. Signed-off-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> [sean: tweaked params to avoid churn in future cleanup] Reviewed-by:
Sean Christopherson <seanjc@google.com> Message-Id: <a4795e5c2f624754e9c0aab023ebda1966feb3e1.1638817641.git.maciej.szmigiero@oracle.com>
-
Maciej S. Szmigiero authored
kvm_invalidate_memslot() calls kvm_arch_flush_shadow_memslot() on the active, but KVM_MEMSLOT_INVALID slot. Do it on the inactive (but valid) old slot instead since arch code really should not get passed such invalid slot. Note that this means that the "arch" field of the slot provided to kvm_arch_flush_shadow_memslot() may have stale data since this function is called with slots_arch_lock released. Suggested-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Reviewed-by:
Sean Christopherson <seanjc@google.com> Message-Id: <813595ecc193d6ae39a87709899d4251523b05f8.1638817641.git.maciej.szmigiero@oracle.com>
-
Maciej S. Szmigiero authored
The current memslot code uses a (reverse gfn-ordered) memslot array for keeping track of them. Because the memslot array that is currently in use cannot be modified every memslot management operation (create, delete, move, change flags) has to make a copy of the whole array so it has a scratch copy to work on. Strictly speaking, however, it is only necessary to make copy of the memslot that is being modified, copying all the memslots currently present is just a limitation of the array-based memslot implementation. Two memslot sets, however, are still needed so the VM continues to run on the currently active set while the requested operation is being performed on the second, currently inactive one. In order to have two memslot sets, but only one copy of actual memslots it is necessary to split out the memslot data from the memslot sets. The memslots themselves should be also kept independent of each other so they can be individually added or deleted. These two memslot sets should normally point to the same set of memslots. They can, however, be desynchronized when performing a memslot management operation by replacing the memslot to be modified by its copy. After the operation is complete, both memslot sets once again point to the same, common set of memslot data. This commit implements the aforementioned idea. For tracking of gfns an ordinary rbtree is used since memslots cannot overlap in the guest address space and so this data structure is sufficient for ensuring that lookups are done quickly. The "last used slot" mini-caches (both per-slot set one and per-vCPU one), that keep track of the last found-by-gfn memslot, are still present in the new code. Co-developed-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
-
Maciej S. Szmigiero authored
The current memslots implementation only allows quick binary search by gfn, quick lookup by hva is not possible - the implementation has to do a linear scan of the whole memslots array, even though the operation being performed might apply just to a single memslot. This significantly hurts performance of per-hva operations with higher memslot counts. Since hva ranges can overlap between memslots an interval tree is needed for tracking them. [sean: handle interval tree updates in kvm_replace_memslot()] Signed-off-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <d66b9974becaa9839be9c4e1a5de97b177b4ac20.1638817640.git.maciej.szmigiero@oracle.com>
-
Maciej S. Szmigiero authored
Memslot ID to the corresponding memslot mappings are currently kept as indices in static id_to_index array. The size of this array depends on the maximum allowed memslot count (regardless of the number of memslots actually in use). This has become especially problematic recently, when memslot count cap was removed, so the maximum count is now full 32k memslots - the maximum allowed by the current KVM API. Keeping these IDs in a hash table (instead of an array) avoids this problem. Resolving a memslot ID to the actual memslot (instead of its index) will also enable transitioning away from an array-based implementation of the whole memslots structure in a later commit. Co-developed-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <117fb2c04320e6cd6cf34f205a72eadb0aa8d5f9.1638817640.git.maciej.szmigiero@oracle.com>
-
Maciej S. Szmigiero authored
Since kvm_memslot_move_forward() can theoretically return a negative memslot index even when kvm_memslot_move_backward() returned a positive one (and so did not WARN) let's just move the warning to the common code. Signed-off-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Reviewed-by:
Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by:
Sean Christopherson <seanjc@google.com> Message-Id: <eeed890ccb951e7b0dce15bc170eb2661d5b02da.1638817640.git.maciej.szmigiero@oracle.com>
-
Maciej S. Szmigiero authored
s390 arch has gfn_to_memslot_approx() which is almost identical to search_memslots(), differing only in that in case the gfn falls in a hole one of the memslots bordering the hole is returned. Add this lookup mode as an option to search_memslots() so we don't have two almost identical functions for looking up a memslot by its gfn. Signed-off-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> [sean: tweaked helper names to keep gfn_to_memslot_approx() in s390] Reviewed-by:
Sean Christopherson <seanjc@google.com> Message-Id: <171cd89b52c718dbe180ecd909b4437a64a7e2ec.1638817640.git.maciej.szmigiero@oracle.com>
-
Sean Christopherson authored
Stop making a full copy of the old memslot in __kvm_set_memory_region() now that metadata updates are handled by kvm_set_memslot(), i.e. now that the old memslot's dirty bitmap doesn't need to be referenced after the memslot and its pointer is modified/invalidated by kvm_set_memslot(). No functional change intended. Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <5dce0946b41bba8c83f6e3424c6955c56bcc9f86.1638817640.git.maciej.szmigiero@oracle.com>
-
Sean Christopherson authored
Handle the generic memslot metadata, a.k.a. dirty bitmap, updates at the same time that arch handles it's own metadata updates, i.e. at memslot prepare and commit. This will simplify converting @new to a dynamically allocated object, and more closely aligns common KVM with architecture code. No functional change intended. Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <2ddd5446e3706fe3c1e52e3df279f04c458be830.1638817640.git.maciej.szmigiero@oracle.com>
-
Sean Christopherson authored
Drop the @mem param from kvm_arch_{prepare,commit}_memory_region() now that its use has been removed in all architectures. No functional change intended. Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <aa5ed3e62c27e881d0d8bc0acbc1572bc336dc19.1638817640.git.maciej.szmigiero@oracle.com>
-
Sean Christopherson authored
Pass the "old" slot to kvm_arch_prepare_memory_region() and force arch code to handle propagating arch specific data from "new" to "old" when necessary. This is a baby step towards dynamically allocating "new" from the get go, and is a (very) minor performance boost on x86 due to not unnecessarily copying arch data. For PPC HV, copy the rmap in the !CREATE and !DELETE paths, i.e. for MOVE and FLAGS_ONLY. This is functionally a nop as the previous behavior would overwrite the pointer for CREATE, and eventually discard/ignore it for DELETE. For x86, copy the arch data only for FLAGS_ONLY changes. Unlike PPC HV, x86 needs to reallocate arch data in the MOVE case as the size of x86's allocations depend on the alignment of the memslot's gfn. Opportunistically tweak kvm_arch_prepare_memory_region()'s param order to match the "commit" prototype. Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> [mss: add missing RISCV kvm_arch_prepare_memory_region() change] Signed-off-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <67dea5f11bbcfd71e3da5986f11e87f5dd4013f9.1638817639.git.maciej.szmigiero@oracle.com>
-
Sean Christopherson authored
Now that the address space ID is stored in every slot, including fake slots used for deletion, use the slot's as_id instead of passing in the redundant information as a param to kvm_set_memslot(). This will greatly simplify future memslot work by avoiding passing a large number of variables around purely to honor @as_id. Drop a comment in the DELETE path about new->as_id being provided purely for debug, as that's now a lie. No functional change intended. Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <03189577be214ab8530a4b3a3ee3ed1c2f9e5815.1638817639.git.maciej.szmigiero@oracle.com>
-
Maciej S. Szmigiero authored
There is no need to copy the whole memslot data after releasing slots_arch_lock for a moment to install temporary memslots copy in kvm_set_memslot() since this lock only protects the arch field of each memslot. Just resync this particular field after reacquiring slots_arch_lock. Note, this also eliminates the need to manually clear the INVALID flag when restoring memslots; the "setting" of the INVALID flag was an unwanted side effect of copying the entire memslots. Since kvm_copy_memslots() has just one caller remaining now open-code it instead. Signed-off-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> [sean: tweak shortlog, note INVALID flag in changelog, revert comment] Reviewed-by:
Sean Christopherson <seanjc@google.com> Message-Id: <b63035d114707792e9042f074478337f770dff6a.1638817638.git.maciej.szmigiero@oracle.com>
-
Sean Christopherson authored
Fold kvm_delete_memslot() into __kvm_set_memory_region() to free up the "kvm_delete_memslot()" name for use in a future helper. The delete logic isn't so complex/long that it truly needs a helper, and it will be simplified a wee bit further in upcoming commits. No functional change intended. Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <2887631c31a82947faa488ab72f55f8c68b7c194.1638817638.git.maciej.szmigiero@oracle.com>
-
Sean Christopherson authored
Explicitly disallow creating more memslot pages than can fit in an unsigned long, KVM doesn't correctly handle a total number of memslot pages that doesn't fit in an unsigned long and remedying that would be a waste of time. For a 64-bit kernel, this is a nop as memslots are not allowed to overlap in the gfn address space. With a 32-bit kernel, userspace can at most address 3gb of virtual memory, whereas wrapping the total number of pages would require 4tb+ of guest physical memory. Even with x86's second address space for SMM, userspace would need to alias all of guest memory more than one _thousand_ times. And on older x86 hardware with MAXPHYADDR < 43, the guest couldn't actually access any of those aliases even if userspace lied about guest.MAXPHYADDR. On 390 and arm64, this is a nop as they don't support 32-bit hosts. On x86, practically speaking this is simply acknowledging reality as the existing kvm_mmu_calculate_default_mmu_pages() assumes the total number of pages fits in an "unsigned long". On PPC, this is likely a nop as every flavor of PPC KVM assumes gfns (and gpas!) fit in unsigned long. arch/powerpc/kvm/book3s_32_mmu_host.c goes a step further and fails the build if CONFIG_PTE_64BIT=y, which presumably means that it does't support 64-bit physical addresses. On MIPS, this is also likely a nop as the core MMU helpers assume gpas fit in unsigned long, e.g. see kvm_mips_##name##_pte. And finally, RISC-V is a "don't care" as it doesn't exist in any release, i.e. there is no established ABI to break. Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <1c2c91baf8e78acccd4dad38da591002e61c013c.1638817638.git.maciej.szmigiero@oracle.com>
-
Marc Zyngier authored
Everywhere we use kvm_for_each_vpcu(), we use an int as the vcpu index. Unfortunately, we're about to move rework the iterator, which requires this to be upgrade to an unsigned long. Let's bite the bullet and repaint all of it in one go. Signed-off-by:
Marc Zyngier <maz@kernel.org> Message-Id: <20211116160403.4074052-7-maz@kernel.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Marc Zyngier authored
At least on arm64 and x86, the vcpus array is pretty huge (up to 1024 entries on x86) and is mostly empty in the majority of the cases (running 1k vcpu VMs is not that common). This mean that we end-up with a 4kB block of unused memory in the middle of the kvm structure. Instead of wasting away this memory, let's use an xarray instead, which gives us almost the same flexibility as a normal array, but with a reduced memory usage with smaller VMs. Signed-off-by:
Marc Zyngier <maz@kernel.org> Message-Id: <20211116160403.4074052-6-maz@kernel.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Marc Zyngier authored
All architectures have similar loops iterating over the vcpus, freeing one vcpu at a time, and eventually wiping the reference off the vcpus array. They are also inconsistently taking the kvm->lock mutex when wiping the references from the array. Make this code common, which will simplify further changes. The locking is dropped altogether, as this should only be called when there is no further references on the kvm structure. Reviewed-by:
Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by:
Marc Zyngier <maz@kernel.org> Message-Id: <20211116160403.4074052-2-maz@kernel.org> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Nov 26, 2021
-
-
Paolo Bonzini authored
This is not an unrecoverable situation. Users of kvm_read_guest_offset_cached and kvm_write_guest_offset_cached must expect the read/write to fail, and therefore it is possible to just return early with an error value. Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Nov 18, 2021
-
-
Sean Christopherson authored
Reject userspace memslots whose size exceeds the storage capacity of an "unsigned long". KVM's uAPI takes the size as u64 to support large slots on 64-bit hosts, but does not account for the size being truncated on 32-bit hosts in various flows. The access_ok() check on the userspace virtual address in particular casts the size to "unsigned long" and will check the wrong number of bytes. KVM doesn't actually support slots whose size doesn't fit in an "unsigned long", e.g. KVM's internal kvm_memory_slot.npages is an "unsigned long", not a "u64", and misc arch specific code follows that behavior. Fixes: fa3d315a ("KVM: Validate userspace_addr of memslot when registered") Cc: stable@vger.kernel.org Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <20211104002531.1176691-3-seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-