Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Apr 08, 2022
    • David Matlack's avatar
      KVM: Prevent module exit until all VMs are freed · eccfee44
      David Matlack authored
      commit 5f6de5cb upstream.
      
      Tie the lifetime the KVM module to the lifetime of each VM via
      kvm.users_count. This way anything that grabs a reference to the VM via
      kvm_get_kvm() cannot accidentally outlive the KVM module.
      
      Prior to this commit, the lifetime of the KVM module was tied to the
      lifetime of /dev/kvm file descriptors, VM file descriptors, and vCPU
      file descriptors by their respective file_operations "owner" field.
      This approach is insufficient because references grabbed via
      kvm_get_kvm() do not prevent closing any of the aforementioned file
      descriptors.
      
      This fixes a long standing theoretical bug in KVM that at least affects
      async page faults. kvm_setup_async_pf() grabs a reference via
      kvm_get_kvm(), and drops it in an asynchronous work callback. Nothing
      prevents the VM file descriptor from being closed and the KVM module
      from being unloaded before this callback runs.
      
      Fixes: af585b92 ("KVM: Halt vcpu if page it tries to access is swapped out")
      Fixes: 3d3aab1b
      
       ("KVM: set owner of cpu and vm file operations")
      Cc: stable@vger.kernel.org
      Suggested-by: default avatarBen Gardon <bgardon@google.com>
      [ Based on a patch from Ben implemented for Google's kernel. ]
      Signed-off-by: default avatarDavid Matlack <dmatlack@google.com>
      Message-Id: <20220303183328.1499189-2-dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eccfee44
  2. Feb 16, 2022
    • Hou Wenlong's avatar
      KVM: eventfd: Fix false positive RCU usage warning · dc129275
      Hou Wenlong authored
      [ Upstream commit 6a0c6170
      
       ]
      
      Fix the following false positive warning:
       =============================
       WARNING: suspicious RCU usage
       5.16.0-rc4+ #57 Not tainted
       -----------------------------
       arch/x86/kvm/../../../virt/kvm/eventfd.c:484 RCU-list traversed in non-reader section!!
      
       other info that might help us debug this:
      
       rcu_scheduler_active = 2, debug_locks = 1
       3 locks held by fc_vcpu 0/330:
        #0: ffff8884835fc0b0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x88/0x6f0 [kvm]
        #1: ffffc90004c0bb68 (&kvm->srcu){....}-{0:0}, at: vcpu_enter_guest+0x600/0x1860 [kvm]
        #2: ffffc90004c0c1d0 (&kvm->irq_srcu){....}-{0:0}, at: kvm_notify_acked_irq+0x36/0x180 [kvm]
      
       stack backtrace:
       CPU: 26 PID: 330 Comm: fc_vcpu 0 Not tainted 5.16.0-rc4+
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
       Call Trace:
        <TASK>
        dump_stack_lvl+0x44/0x57
        kvm_notify_acked_gsi+0x6b/0x70 [kvm]
        kvm_notify_acked_irq+0x8d/0x180 [kvm]
        kvm_ioapic_update_eoi+0x92/0x240 [kvm]
        kvm_apic_set_eoi_accelerated+0x2a/0xe0 [kvm]
        handle_apic_eoi_induced+0x3d/0x60 [kvm_intel]
        vmx_handle_exit+0x19c/0x6a0 [kvm_intel]
        vcpu_enter_guest+0x66e/0x1860 [kvm]
        kvm_arch_vcpu_ioctl_run+0x438/0x7f0 [kvm]
        kvm_vcpu_ioctl+0x38a/0x6f0 [kvm]
        __x64_sys_ioctl+0x89/0xc0
        do_syscall_64+0x3a/0x90
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Since kvm_unregister_irq_ack_notifier() does synchronize_srcu(&kvm->irq_srcu),
      kvm->irq_ack_notifier_list is protected by kvm->irq_srcu. In fact,
      kvm->irq_srcu SRCU read lock is held in kvm_notify_acked_irq(), making it
      a false positive warning. So use hlist_for_each_entry_srcu() instead of
      hlist_for_each_entry_rcu().
      
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarHou Wenlong <houwenlong93@linux.alibaba.com>
      Message-Id: <f98bac4f5052bad2c26df9ad50f7019e40434512.1643265976.git.houwenlong.hwl@antgroup.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      dc129275
  3. Feb 01, 2022
  4. Dec 22, 2021
  5. Dec 08, 2021
  6. Oct 09, 2021
    • Sergey Senozhatsky's avatar
      KVM: do not shrink halt_poll_ns below grow_start · 6d0ff920
      Sergey Senozhatsky authored
      [ Upstream commit ae232ea4
      
       ]
      
      grow_halt_poll_ns() ignores values between 0 and
      halt_poll_ns_grow_start (10000 by default). However,
      when we shrink halt_poll_ns we may fall way below
      halt_poll_ns_grow_start and endup with halt_poll_ns
      values that don't make a lot of sense: like 1 or 9,
      or 19.
      
      VCPU1 trace (halt_poll_ns_shrink equals 2):
      
      VCPU1 grow 10000
      VCPU1 shrink 5000
      VCPU1 shrink 2500
      VCPU1 shrink 1250
      VCPU1 shrink 625
      VCPU1 shrink 312
      VCPU1 shrink 156
      VCPU1 shrink 78
      VCPU1 shrink 39
      VCPU1 shrink 19
      VCPU1 shrink 9
      VCPU1 shrink 4
      
      Mirror what grow_halt_poll_ns() does and set halt_poll_ns
      to 0 as soon as new shrink-ed halt_poll_ns value falls
      below halt_poll_ns_grow_start.
      
      Signed-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210902031100.252080-1-senozhatsky@chromium.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      6d0ff920
  7. Aug 12, 2021
    • Paolo Bonzini's avatar
      KVM: Do not leak memory for duplicate debugfs directories · 32f55c25
      Paolo Bonzini authored
      commit 85cd39af upstream.
      
      KVM creates a debugfs directory for each VM in order to store statistics
      about the virtual machine.  The directory name is built from the process
      pid and a VM fd.  While generally unique, it is possible to keep a
      file descriptor alive in a way that causes duplicate directories, which
      manifests as these messages:
      
        [  471.846235] debugfs: Directory '20245-4' with parent 'kvm' already present!
      
      Even though this should not happen in practice, it is more or less
      expected in the case of KVM for testcases that call KVM_CREATE_VM and
      close the resulting file descriptor repeatedly and in parallel.
      
      When this happens, debugfs_create_dir() returns an error but
      kvm_create_vm_debugfs() goes on to allocate stat data structs which are
      later leaked.  The slow memory leak was spotted by syzkaller, where it
      caused OOM reports.
      
      Since the issue only affects debugfs, do a lookup before calling
      debugfs_create_dir, so that the message is downgraded and rate-limited.
      While at it, ensure kvm->debugfs_dentry is NULL rather than an error
      if it is not created.  This fixes kvm_destroy_vm_debugfs, which was not
      checking IS_ERR_OR_NULL correctly.
      
      Cc: stable@vger.kernel.org
      Fixes: 536a6f88
      
       ("KVM: Create debugfs dir and stat files for each VM")
      Reported-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Suggested-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      32f55c25
  8. Aug 04, 2021
  9. Jul 20, 2021
    • Kefeng Wang's avatar
      KVM: mmio: Fix use-after-free Read in kvm_vm_ioctl_unregister_coalesced_mmio · 679837dc
      Kefeng Wang authored
      commit 23fa2e46
      
       upstream.
      
      BUG: KASAN: use-after-free in kvm_vm_ioctl_unregister_coalesced_mmio+0x7c/0x1ec arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:183
      Read of size 8 at addr ffff0000c03a2500 by task syz-executor083/4269
      
      CPU: 5 PID: 4269 Comm: syz-executor083 Not tainted 5.10.0 #7
      Hardware name: linux,dummy-virt (DT)
      Call trace:
       dump_backtrace+0x0/0x2d0 arch/arm64/kernel/stacktrace.c:132
       show_stack+0x28/0x34 arch/arm64/kernel/stacktrace.c:196
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x110/0x164 lib/dump_stack.c:118
       print_address_description+0x78/0x5c8 mm/kasan/report.c:385
       __kasan_report mm/kasan/report.c:545 [inline]
       kasan_report+0x148/0x1e4 mm/kasan/report.c:562
       check_memory_region_inline mm/kasan/generic.c:183 [inline]
       __asan_load8+0xb4/0xbc mm/kasan/generic.c:252
       kvm_vm_ioctl_unregister_coalesced_mmio+0x7c/0x1ec arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:183
       kvm_vm_ioctl+0xe30/0x14c4 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:3755
       vfs_ioctl fs/ioctl.c:48 [inline]
       __do_sys_ioctl fs/ioctl.c:753 [inline]
       __se_sys_ioctl fs/ioctl.c:739 [inline]
       __arm64_sys_ioctl+0xf88/0x131c fs/ioctl.c:739
       __invoke_syscall arch/arm64/kernel/syscall.c:36 [inline]
       invoke_syscall arch/arm64/kernel/syscall.c:48 [inline]
       el0_svc_common arch/arm64/kernel/syscall.c:158 [inline]
       do_el0_svc+0x120/0x290 arch/arm64/kernel/syscall.c:220
       el0_svc+0x1c/0x28 arch/arm64/kernel/entry-common.c:367
       el0_sync_handler+0x98/0x170 arch/arm64/kernel/entry-common.c:383
       el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:670
      
      Allocated by task 4269:
       stack_trace_save+0x80/0xb8 kernel/stacktrace.c:121
       kasan_save_stack mm/kasan/common.c:48 [inline]
       kasan_set_track mm/kasan/common.c:56 [inline]
       __kasan_kmalloc+0xdc/0x120 mm/kasan/common.c:461
       kasan_kmalloc+0xc/0x14 mm/kasan/common.c:475
       kmem_cache_alloc_trace include/linux/slab.h:450 [inline]
       kmalloc include/linux/slab.h:552 [inline]
       kzalloc include/linux/slab.h:664 [inline]
       kvm_vm_ioctl_register_coalesced_mmio+0x78/0x1cc arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:146
       kvm_vm_ioctl+0x7e8/0x14c4 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:3746
       vfs_ioctl fs/ioctl.c:48 [inline]
       __do_sys_ioctl fs/ioctl.c:753 [inline]
       __se_sys_ioctl fs/ioctl.c:739 [inline]
       __arm64_sys_ioctl+0xf88/0x131c fs/ioctl.c:739
       __invoke_syscall arch/arm64/kernel/syscall.c:36 [inline]
       invoke_syscall arch/arm64/kernel/syscall.c:48 [inline]
       el0_svc_common arch/arm64/kernel/syscall.c:158 [inline]
       do_el0_svc+0x120/0x290 arch/arm64/kernel/syscall.c:220
       el0_svc+0x1c/0x28 arch/arm64/kernel/entry-common.c:367
       el0_sync_handler+0x98/0x170 arch/arm64/kernel/entry-common.c:383
       el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:670
      
      Freed by task 4269:
       stack_trace_save+0x80/0xb8 kernel/stacktrace.c:121
       kasan_save_stack mm/kasan/common.c:48 [inline]
       kasan_set_track+0x38/0x6c mm/kasan/common.c:56
       kasan_set_free_info+0x20/0x40 mm/kasan/generic.c:355
       __kasan_slab_free+0x124/0x150 mm/kasan/common.c:422
       kasan_slab_free+0x10/0x1c mm/kasan/common.c:431
       slab_free_hook mm/slub.c:1544 [inline]
       slab_free_freelist_hook mm/slub.c:1577 [inline]
       slab_free mm/slub.c:3142 [inline]
       kfree+0x104/0x38c mm/slub.c:4124
       coalesced_mmio_destructor+0x94/0xa4 arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:102
       kvm_iodevice_destructor include/kvm/iodev.h:61 [inline]
       kvm_io_bus_unregister_dev+0x248/0x280 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:4374
       kvm_vm_ioctl_unregister_coalesced_mmio+0x158/0x1ec arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:186
       kvm_vm_ioctl+0xe30/0x14c4 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:3755
       vfs_ioctl fs/ioctl.c:48 [inline]
       __do_sys_ioctl fs/ioctl.c:753 [inline]
       __se_sys_ioctl fs/ioctl.c:739 [inline]
       __arm64_sys_ioctl+0xf88/0x131c fs/ioctl.c:739
       __invoke_syscall arch/arm64/kernel/syscall.c:36 [inline]
       invoke_syscall arch/arm64/kernel/syscall.c:48 [inline]
       el0_svc_common arch/arm64/kernel/syscall.c:158 [inline]
       do_el0_svc+0x120/0x290 arch/arm64/kernel/syscall.c:220
       el0_svc+0x1c/0x28 arch/arm64/kernel/entry-common.c:367
       el0_sync_handler+0x98/0x170 arch/arm64/kernel/entry-common.c:383
       el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:670
      
      If kvm_io_bus_unregister_dev() return -ENOMEM, we already call kvm_iodevice_destructor()
      inside this function to delete 'struct kvm_coalesced_mmio_dev *dev' from list
      and free the dev, but kvm_iodevice_destructor() is called again, it will lead
      the above issue.
      
      Let's check the the return value of kvm_io_bus_unregister_dev(), only call
      kvm_iodevice_destructor() if the return value is 0.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: kvm@vger.kernel.org
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Message-Id: <20210626070304.143456-1-wangkefeng.wang@huawei.com>
      Cc: stable@vger.kernel.org
      Fixes: 5d3c4c79
      
       ("KVM: Stop looking for coalesced MMIO zones if the bus is destroyed", 2021-04-20)
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      679837dc
  10. Jun 30, 2021
  11. Jun 03, 2021
  12. May 19, 2021
  13. May 14, 2021
  14. Feb 26, 2021
    • Sean Christopherson's avatar
      KVM: Use kvm_pfn_t for local PFN variable in hva_to_pfn_remapped() · 3320aa64
      Sean Christopherson authored
      commit a9545779 upstream.
      
      Use kvm_pfn_t, a.k.a. u64, for the local 'pfn' variable when retrieving
      a so called "remapped" hva/pfn pair.  In theory, the hva could resolve to
      a pfn in high memory on a 32-bit kernel.
      
      This bug was inadvertantly exposed by commit bd2fae8d ("KVM: do not
      assume PTE is writable after follow_pfn"), which added an error PFN value
      to the mix, causing gcc to comlain about overflowing the unsigned long.
      
        arch/x86/kvm/../../../virt/kvm/kvm_main.c: In function ‘hva_to_pfn_remapped’:
        include/linux/kvm_host.h:89:30: error: conversion from ‘long long unsigned int’
                                        to ‘long unsigned int’ changes value from
                                        ‘9218868437227405314’ to ‘2’ [-Werror=overflow]
         89 | #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
            |                              ^
      virt/kvm/kvm_main.c:1935:9: note: in expansion of macro ‘KVM_PFN_ERR_RO_FAULT’
      
      Cc: stable@vger.kernel.org
      Fixes: add6a0cd
      
       ("KVM: MMU: try to fix up page faults before giving up")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210208201940.1258328-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3320aa64
    • Paolo Bonzini's avatar
      mm: provide a saner PTE walking API for modules · a42150f1
      Paolo Bonzini authored
      commit 9fd6dad1
      
       upstream.
      
      Currently, the follow_pfn function is exported for modules but
      follow_pte is not.  However, follow_pfn is very easy to misuse,
      because it does not provide protections (so most of its callers
      assume the page is writable!) and because it returns after having
      already unlocked the page table lock.
      
      Provide instead a simplified version of follow_pte that does
      not have the pmdpp and range arguments.  The older version
      survives as follow_invalidate_pte() for use by fs/dax.c.
      
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a42150f1
    • Paolo Bonzini's avatar
      KVM: do not assume PTE is writable after follow_pfn · 83d42c25
      Paolo Bonzini authored
      commit bd2fae8d upstream.
      
      In order to convert an HVA to a PFN, KVM usually tries to use
      the get_user_pages family of functinso.  This however is not
      possible for VM_IO vmas; in that case, KVM instead uses follow_pfn.
      
      In doing this however KVM loses the information on whether the
      PFN is writable.  That is usually not a problem because the main
      use of VM_IO vmas with KVM is for BARs in PCI device assignment,
      however it is a bug.  To fix it, use follow_pte and check pte_write
      while under the protection of the PTE lock.  The information can
      be used to fail hva_to_pfn_remapped or passed back to the
      caller via *writable.
      
      Usage of follow_pfn was introduced in commit add6a0cd ("KVM: MMU: try to fix
      up page faults before giving up", 2016-07-05); however, even older version
      have the same issue, all the way back to commit 2e2e3738 ("KVM:
      Handle vma regions with no backing page", 2008-07-20), as they also did
      not check whether the PFN was writable.
      
      Fixes: 2e2e3738
      
       ("KVM: Handle vma regions with no backing page")
      Reported-by: default avatarDavid Stevens <stevensd@google.com>
      Cc: 3pvd@google.com
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      83d42c25
  15. Feb 03, 2021
  16. Jan 12, 2021
    • Lai Jiangshan's avatar
      kvm: check tlbs_dirty directly · ffee6772
      Lai Jiangshan authored
      commit 88bf56d0 upstream.
      
      In kvm_mmu_notifier_invalidate_range_start(), tlbs_dirty is used as:
              need_tlb_flush |= kvm->tlbs_dirty;
      with need_tlb_flush's type being int and tlbs_dirty's type being long.
      
      It means that tlbs_dirty is always used as int and the higher 32 bits
      is useless.  We need to check tlbs_dirty in a correct way and this
      change checks it directly without propagating it to need_tlb_flush.
      
      Note: it's _extremely_ unlikely this neglecting of higher 32 bits can
      cause problems in practice.  It would require encountering tlbs_dirty
      on a 4 billion count boundary, and KVM would need to be using shadow
      paging or be running a nested guest.
      
      Cc: stable@vger.kernel.org
      Fixes: a4ee1ca4
      
       ("KVM: MMU: delay flush all tlbs on sync_page path")
      Signed-off-by: default avatarLai Jiangshan <laijs@linux.alibaba.com>
      Message-Id: <20201217154118.16497-1-jiangshanlai@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: Greg Kroah-H...
      ffee6772
  17. Oct 23, 2020
  18. Oct 21, 2020
  19. Sep 28, 2020
  20. Sep 11, 2020
  21. Aug 21, 2020
    • Will Deacon's avatar
      KVM: Pass MMU notifier range flags to kvm_unmap_hva_range() · fdfe7cbd
      Will Deacon authored
      
      The 'flags' field of 'struct mmu_notifier_range' is used to indicate
      whether invalidate_range_{start,end}() are permitted to block. In the
      case of kvm_mmu_notifier_invalidate_range_start(), this field is not
      forwarded on to the architecture-specific implementation of
      kvm_unmap_hva_range() and therefore the backend cannot sensibly decide
      whether or not to block.
      
      Add an extra 'flags' parameter to kvm_unmap_hva_range() so that
      architectures are aware as to whether or not they are permitted to block.
      
      Cc: <stable@vger.kernel.org>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      Message-Id: <20200811102725.7121-2-will@kernel.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fdfe7cbd
  22. Aug 12, 2020
  23. Aug 05, 2020
  24. Jul 29, 2020
  25. Jul 24, 2020
    • Thomas Gleixner's avatar
      entry: Provide infrastructure for work before transitioning to guest mode · 935ace2f
      Thomas Gleixner authored
      Entering a guest is similar to exiting to user space. Pending work like
      handling signals, rescheduling, task work etc. needs to be handled before
      that.
      
      Provide generic infrastructure to avoid duplication of the same handling
      code all over the place.
      
      The transfer to guest mode handling is different from the exit to usermode
      handling, e.g. vs. rseq and live patching, so a separate function is used.
      
      The initial list of work items handled is:
      
          TIF_SIGPENDING, TIF_NEED_RESCHED, TIF_NOTIFY_RESUME
      
      Architecture specific TIF flags can be added via defines in the
      architecture specific include files.
      
      The calling convention is also different from the syscall/interrupt entry
      functions as KVM invokes this from the outer vcpu_run() loop with
      interrupts and preemption enabled. To prevent missing a pending work item
      it invokes a check for pending TIF work from interrupt disabled code right
      before transitioning to guest mode. The lockdep, RC...
      935ace2f
  26. Jul 09, 2020
  27. Jul 08, 2020
  28. Jul 02, 2020
  29. Jun 11, 2020
  30. Jun 09, 2020
    • Michel Lespinasse's avatar
      mmap locking API: use coccinelle to convert mmap_sem rwsem call sites · d8ed45c5
      Michel Lespinasse authored
      
      This change converts the existing mmap_sem rwsem calls to use the new mmap
      locking API instead.
      
      The change is generated using coccinelle with the following rule:
      
      // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .
      
      @@
      expression mm;
      @@
      (
      -init_rwsem
      +mmap_init_lock
      |
      -down_write
      +mmap_write_lock
      |
      -down_write_killable
      +mmap_write_lock_killable
      |
      -down_write_trylock
      +mmap_write_trylock
      |
      -up_write
      +mmap_write_unlock
      |
      -downgrade_write
      +mmap_write_downgrade
      |
      -down_read
      +mmap_read_lock
      |
      -down_read_killable
      +mmap_read_lock_killable
      |
      -down_read_trylock
      +mmap_read_trylock
      |
      -up_read
      +mmap_read_unlock
      )
      -(&mm->mmap_sem)
      +(mm)
      
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: default avatarLaurent Dufour <ldufour@linux.ibm.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dbueso@...
      d8ed45c5
    • Mike Rapoport's avatar
      mm: don't include asm/pgtable.h if linux/mm.h is already included · e31cf2f4
      Mike Rapoport authored
      
      Patch series "mm: consolidate definitions of page table accessors", v2.
      
      The low level page table accessors (pXY_index(), pXY_offset()) are
      duplicated across all architectures and sometimes more than once.  For
      instance, we have 31 definition of pgd_offset() for 25 supported
      architectures.
      
      Most of these definitions are actually identical and typically it boils
      down to, e.g.
      
      static inline unsigned long pmd_index(unsigned long address)
      {
              return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
      }
      
      static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
      {
              return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
      }
      
      These definitions can be shared among 90% of the arches provided
      XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.
      
      For architectures that really need a custom version there is always
      possibility to override the generic version with the usual ifdefs magic.
      
      These patches introduce include/linux/pgtable.h that replaces
      include/asm-generic/pgtable.h and add the definitions of the page table
      accessors to the new header.
      
      This patch (of 12):
      
      The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the
      functions involving page table manipulations, e.g.  pte_alloc() and
      pmd_alloc().  So, there is no point to explicitly include <asm/pgtable.h>
      in the files that include <linux/mm.h>.
      
      The include statements in such cases are remove with a simple loop:
      
      	for f in $(git grep -l "include <linux/mm.h>") ; do
      		sed -i -e '/include <asm\/pgtable.h>/ d' $f
      	done
      
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
      Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e31cf2f4
  31. Jun 08, 2020
    • Souptick Joarder's avatar
      mm/gup.c: convert to use get_user_{page|pages}_fast_only() · dadbb612
      Souptick Joarder authored
      
      API __get_user_pages_fast() renamed to get_user_pages_fast_only() to
      align with pin_user_pages_fast_only().
      
      As part of this we will get rid of write parameter.  Instead caller will
      pass FOLL_WRITE to get_user_pages_fast_only().  This will not change any
      existing functionality of the API.
      
      All the callers are changed to pass FOLL_WRITE.
      
      Also introduce get_user_page_fast_only(), and use it in a few places
      that hard-code nr_pages to 1.
      
      Updated the documentation of the API.
      
      Signed-off-by: default avatarSouptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: Paul Mackerras <paulus@ozlabs.org>		[arch/powerpc/kvm]
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Michal Suchanek <msuchanek@suse.de>
      Link: http://lkml.kernel.org/r/1590396812-31277-1-git-send-email-jrdr.linux@gmail.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dadbb612
    • Eiichi Tsukata's avatar
      KVM: x86: Fix APIC page invalidation race · e649b3f0
      Eiichi Tsukata authored
      Commit b1394e74 ("KVM: x86: fix APIC page invalidation") tried
      to fix inappropriate APIC page invalidation by re-introducing arch
      specific kvm_arch_mmu_notifier_invalidate_range() and calling it from
      kvm_mmu_notifier_invalidate_range_start. However, the patch left a
      possible race where the VMCS APIC address cache is updated *before*
      it is unmapped:
      
        (Invalidator) kvm_mmu_notifier_invalidate_range_start()
        (Invalidator) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD)
        (KVM VCPU) vcpu_enter_guest()
        (KVM VCPU) kvm_vcpu_reload_apic_access_page()
        (Invalidator) actually unmap page
      
      Because of the above race, there can be a mismatch between the
      host physical address stored in the APIC_ACCESS_PAGE VMCS field and
      the host physical address stored in the EPT entry for the APIC GPA
      (0xfee0000).  When this happens, the processor will not trap APIC
      accesses, and will instead show the raw contents of the APIC-access page.
      Because Windows OS periodically checks for unexpected modifications to
      the LAPIC register, this will show up as a BSOD crash with BugCheck
      CRITICAL_STRUCTURE_CORRUPTION (109) we are currently seeing in
      https://bugzilla.redhat.com/show_bug.cgi?id=1751017.
      
      The root cause of the issue is that kvm_arch_mmu_notifier_invalidate_range()
      cannot guarantee that no additional references are taken to the pages in
      the range before kvm_mmu_notifier_invalidate_range_end().  Fortunately,
      this case is supported by the MMU notifier API, as documented in
      include/linux/mmu_notifier.h:
      
      	 * If the subsystem
               * can't guarantee that no additional references are taken to
               * the pages in the range, it has to implement the
               * invalidate_range() notifier to remove any references taken
               * after invalidate_range_start().
      
      The fix therefore is to reload the APIC-access page field in the VMCS
      from kvm_mmu_notifier_invalidate_range() instead of ..._range_start().
      
      Cc: stable@vger.kernel.org
      Fixes: b1394e74 ("KVM: x86: fix APIC page invalidation")
      Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=197951
      
      
      Signed-off-by: default avatarEiichi Tsukata <eiichi.tsukata@nutanix.com>
      Message-Id: <20200606042627.61070-1-eiichi.tsukata@nutanix.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e649b3f0