Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Apr 23, 2022
  2. Apr 22, 2022
  3. Apr 21, 2022
    • Randy Dunlap's avatar
      RISC-V: cpuidle: fix Kconfig select for RISCV_SBI_CPUIDLE · bf9bac40
      Randy Dunlap authored
      There can be lots of build errors when building cpuidle-riscv-sbi.o.
      They are all caused by a kconfig problem with this warning:
      
      WARNING: unmet direct dependencies detected for RISCV_SBI_CPUIDLE
        Depends on [n]: CPU_IDLE [=y] && RISCV [=y] && RISCV_SBI [=n]
        Selected by [y]:
        - SOC_VIRT [=y] && CPU_IDLE [=y]
      
      so make the 'select' of RISCV_SBI_CPUIDLE also depend on RISCV_SBI.
      
      Fixes: c5179ef1
      
       ("RISC-V: Enable RISC-V SBI CPU Idle driver for QEMU virt machine")
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reviewed-by: default avatarAnup Patel <anup@brainfault.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPalmer Dabbelt <palmer@rivosinc.com>
      bf9bac40
    • Anup Patel's avatar
      RISC-V: mm: Fix set_satp_mode() for platform not having Sv57 · d5fdade9
      Anup Patel authored
      When Sv57 is not available the satp.MODE test in set_satp_mode() will
      fail and lead to pgdir re-programming for Sv48. The pgdir re-programming
      will fail as well due to pre-existing pgdir entry used for Sv57 and as
      a result kernel fails to boot on RISC-V platform not having Sv57.
      
      To fix above issue, we should clear the pgdir memory in set_satp_mode()
      before re-programming.
      
      Fixes: 011f09d1
      
       ("riscv: mm: Set sv57 on defaultly")
      Reported-by: default avatarMayuresh Chitale <mchitale@ventanamicro.com>
      Signed-off-by: default avatarAnup Patel <apatel@ventanamicro.com>
      Reviewed-by: default avatarAtish Patra <atishp@rivosinc.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPalmer Dabbelt <palmer@rivosinc.com>
      d5fdade9
    • Mingwei Zhang's avatar
      KVM: SEV: add cache flush to solve SEV cache incoherency issues · 683412cc
      Mingwei Zhang authored
      Flush the CPU caches when memory is reclaimed from an SEV guest (where
      reclaim also includes it being unmapped from KVM's memslots).  Due to lack
      of coherency for SEV encrypted memory, failure to flush results in silent
      data corruption if userspace is malicious/broken and doesn't ensure SEV
      guest memory is properly pinned and unpinned.
      
      Cache coherency is not enforced across the VM boundary in SEV (AMD APM
      vol.2 Section 15.34.7). Confidential cachelines, generated by confidential
      VM guests have to be explicitly flushed on the host side. If a memory page
      containing dirty confidential cachelines was released by VM and reallocated
      to another user, the cachelines may corrupt the new user at a later time.
      
      KVM takes a shortcut by assuming all confidential memory remain pinned
      until the end of VM lifetime. Therefore, KVM does not flush cache at
      mmu_notifier invalidation events. Because of this incorrect assumption and
      the lack of cache flushing, ma...
      683412cc
    • Mingwei Zhang's avatar
      KVM: SVM: Flush when freeing encrypted pages even on SME_COHERENT CPUs · d45829b3
      Mingwei Zhang authored
      Use clflush_cache_range() to flush the confidential memory when
      SME_COHERENT is supported in AMD CPU. Cache flush is still needed since
      SME_COHERENT only support cache invalidation at CPU side. All confidential
      cache lines are still incoherent with DMA devices.
      
      Cc: stable@vger.kerel.org
      
      Fixes: add5e2f0
      
       ("KVM: SVM: Add support for the SEV-ES VMSA")
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMingwei Zhang <mizhang@google.com>
      Message-Id: <20220421031407.2516575-3-mizhang@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d45829b3
    • Sean Christopherson's avatar
      KVM: SVM: Simplify and harden helper to flush SEV guest page(s) · 4bbef7e8
      Sean Christopherson authored
      
      Rework sev_flush_guest_memory() to explicitly handle only a single page,
      and harden it to fall back to WBINVD if VM_PAGE_FLUSH fails.  Per-page
      flushing is currently used only to flush the VMSA, and in its current
      form, the helper is completely broken with respect to flushing actual
      guest memory, i.e. won't work correctly for an arbitrary memory range.
      
      VM_PAGE_FLUSH takes a host virtual address, and is subject to normal page
      walks, i.e. will fault if the address is not present in the host page
      tables or does not have the correct permissions.  Current AMD CPUs also
      do not honor SMAP overrides (undocumented in kernel versions of the APM),
      so passing in a userspace address is completely out of the question.  In
      other words, KVM would need to manually walk the host page tables to get
      the pfn, ensure the pfn is stable, and then use the direct map to invoke
      VM_PAGE_FLUSH.  And the latter might not even work, e.g. if userspace is
      particularly evil/clever and backs the guest with Secret Memory (which
      unmaps memory from the direct map).
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      
      Fixes: add5e2f0
      
       ("KVM: SVM: Add support for the SEV-ES VMSA")
      Reported-by: default avatarMingwei Zhang <mizhang@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMingwei Zhang <mizhang@google.com>
      Message-Id: <20220421031407.2516575-2-mizhang@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4bbef7e8
    • Like Xu's avatar
      KVM: x86/pmu: Update AMD PMC sample period to fix guest NMI-watchdog · 75189d1d
      Like Xu authored
      NMI-watchdog is one of the favorite features of kernel developers,
      but it does not work in AMD guest even with vPMU enabled and worse,
      the system misrepresents this capability via /proc.
      
      This is a PMC emulation error. KVM does not pass the latest valid
      value to perf_event in time when guest NMI-watchdog is running, thus
      the perf_event corresponding to the watchdog counter will enter the
      old state at some point after the first guest NMI injection, forcing
      the hardware register PMC0 to be constantly written to 0x800000000001.
      
      Meanwhile, the running counter should accurately reflect its new value
      based on the latest coordinated pmc->counter (from vPMC's point of view)
      rather than the value written directly by the guest.
      
      Fixes: 168d918f
      
       ("KVM: x86: Adjust counter sample period after a wrmsr")
      Reported-by: default avatarDongli Cao <caodongli@kingsoft.com>
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      Reviewed-by: default avatarYanan Wang <wangyanan55@huawei.com>
      Tested-by: default avatarYanan Wang <wangyanan55@huawei.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Message-Id: <20220409015226.38619-1-likexu@tencent.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      75189d1d
    • Wanpeng Li's avatar
      x86/kvm: Preserve BSP MSR_KVM_POLL_CONTROL across suspend/resume · 0361bdfd
      Wanpeng Li authored
      
      MSR_KVM_POLL_CONTROL is cleared on reset, thus reverting guests to
      host-side polling after suspend/resume.  Non-bootstrap CPUs are
      restored correctly by the haltpoll driver because they are hot-unplugged
      during suspend and hot-plugged during resume; however, the BSP
      is not hotpluggable and remains in host-sde polling mode after
      the guest resume.  The makes the guest pay for the cost of vmexits
      every time the guest enters idle.
      
      Fix it by recording BSP's haltpoll state and resuming it during guest
      resume.
      
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1650267752-46796-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0361bdfd
    • Sean Christopherson's avatar
      KVM: x86: Skip KVM_GUESTDBG_BLOCKIRQ APICv update if APICv is disabled · 0047fb33
      Sean Christopherson authored
      Skip the APICv inhibit update for KVM_GUESTDBG_BLOCKIRQ if APICv is
      disabled at the module level to avoid having to acquire the mutex and
      potentially process all vCPUs. The DISABLE inhibit will (barring bugs)
      never be lifted, so piling on more inhibits is unnecessary.
      
      Fixes: cae72dcc
      
       ("KVM: x86: inhibit APICv when KVM_GUESTDBG_BLOCKIRQ active")
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220420013732.3308816-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0047fb33
    • Sean Christopherson's avatar
      KVM: x86: Pend KVM_REQ_APICV_UPDATE during vCPU creation to fix a race · 423ecfea
      Sean Christopherson authored
      
      Make a KVM_REQ_APICV_UPDATE request when creating a vCPU with an
      in-kernel local APIC and APICv enabled at the module level.  Consuming
      kvm_apicv_activated() and stuffing vcpu->arch.apicv_active directly can
      race with __kvm_set_or_clear_apicv_inhibit(), as vCPU creation happens
      before the vCPU is fully onlined, i.e. it won't get the request made to
      "all" vCPUs.  If APICv is globally inhibited between setting apicv_active
      and onlining the vCPU, the vCPU will end up running with APICv enabled
      and trigger KVM's sanity check.
      
      Mark APICv as active during vCPU creation if APICv is enabled at the
      module level, both to be optimistic about it's final state, e.g. to avoid
      additional VMWRITEs on VMX, and because there are likely bugs lurking
      since KVM checks apicv_active in multiple vCPU creation paths.  While
      keeping the current behavior of consuming kvm_apicv_activated() is
      arguably safer from a regression perspective, force apicv_active so that
      vCPU creation runs with deterministic state and so that if there are bugs,
      they are found sooner than later, i.e. not when some crazy race condition
      is hit.
      
        WARNING: CPU: 0 PID: 484 at arch/x86/kvm/x86.c:9877 vcpu_enter_guest+0x2ae3/0x3ee0 arch/x86/kvm/x86.c:9877
        Modules linked in:
        CPU: 0 PID: 484 Comm: syz-executor361 Not tainted 5.16.13 #2
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1~cloud0 04/01/2014
        RIP: 0010:vcpu_enter_guest+0x2ae3/0x3ee0 arch/x86/kvm/x86.c:9877
        Call Trace:
         <TASK>
         vcpu_run arch/x86/kvm/x86.c:10039 [inline]
         kvm_arch_vcpu_ioctl_run+0x337/0x15e0 arch/x86/kvm/x86.c:10234
         kvm_vcpu_ioctl+0x4d2/0xc80 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3727
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:874 [inline]
         __se_sys_ioctl fs/ioctl.c:860 [inline]
         __x64_sys_ioctl+0x16d/0x1d0 fs/ioctl.c:860
         do_syscall_x64 arch/x86/entry/common.c:50 [inline]
         do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      The bug was hit by a syzkaller spamming VM creation with 2 vCPUs and a
      call to KVM_SET_GUEST_DEBUG.
      
        r0 = openat$kvm(0xffffffffffffff9c, &(0x7f0000000000), 0x0, 0x0)
        r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
        ioctl$KVM_CAP_SPLIT_IRQCHIP(r1, 0x4068aea3, &(0x7f0000000000)) (async)
        r2 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x0) (async)
        r3 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x400000000000002)
        ioctl$KVM_SET_GUEST_DEBUG(r3, 0x4048ae9b, &(0x7f00000000c0)={0x5dda9c14aa95f5c5})
        ioctl$KVM_RUN(r2, 0xae80, 0x0)
      
      Reported-by: default avatarGaoning Pan <pgn@zju.edu.cn>
      Reported-by: default avatarYongkang Jia <kangel@zju.edu.cn>
      Fixes: 8df14af4
      
       ("kvm: x86: Add support for dynamic APICv activation")
      Cc: stable@vger.kernel.org
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220420013732.3308816-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      423ecfea
    • Sean Christopherson's avatar
      KVM: nVMX: Defer APICv updates while L2 is active until L1 is active · 7c69661e
      Sean Christopherson authored
      
      Defer APICv updates that occur while L2 is active until nested VM-Exit,
      i.e. until L1 regains control.  vmx_refresh_apicv_exec_ctrl() assumes L1
      is active and (a) stomps all over vmcs02 and (b) neglects to ever updated
      vmcs01.  E.g. if vmcs12 doesn't enable the TPR shadow for L2 (and thus no
      APICv controls), L1 performs nested VM-Enter APICv inhibited, and APICv
      becomes unhibited while L2 is active, KVM will set various APICv controls
      in vmcs02 and trigger a failed VM-Entry.  The kicker is that, unless
      running with nested_early_check=1, KVM blames L1 and chaos ensues.
      
      In all cases, ignoring vmcs02 and always deferring the inhibition change
      to vmcs01 is correct (or at least acceptable).  The ABSENT and DISABLE
      inhibitions cannot truly change while L2 is active (see below).
      
      IRQ_BLOCKING can change, but it is firmly a best effort debug feature.
      Furthermore, only L2's APIC is accelerated/virtualized to the full extent
      possible, e.g. even if L1 passes through its APIC to L2, normal MMIO/MSR
      interception will apply to the virtual APIC managed by KVM.
      The exception is the SELF_IPI register when x2APIC is enabled, but that's
      an acceptable hole.
      
      Lastly, Hyper-V's Auto EOI can technically be toggled if L1 exposes the
      MSRs to L2, but for that to work in any sane capacity, L1 would need to
      pass through IRQs to L2 as well, and IRQs must be intercepted to enable
      virtual interrupt delivery.  I.e. exposing Auto EOI to L2 and enabling
      VID for L2 are, for all intents and purposes, mutually exclusive.
      
      Lack of dynamic toggling is also why this scenario is all but impossible
      to encounter in KVM's current form.  But a future patch will pend an
      APICv update request _during_ vCPU creation to plug a race where a vCPU
      that's being created doesn't get included in the "all vCPUs request"
      because it's not yet visible to other vCPUs.  If userspaces restores L2
      after VM creation (hello, KVM selftests), the first KVM_RUN will occur
      while L2 is active and thus service the APICv update request made during
      VM creation.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220420013732.3308816-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7c69661e
    • Sean Christopherson's avatar
      KVM: x86: Tag APICv DISABLE inhibit, not ABSENT, if APICv is disabled · 80f0497c
      Sean Christopherson authored
      Set the DISABLE inhibit, not the ABSENT inhibit, if APICv is disabled via
      module param.  A recent refactoring to add a wrapper for setting/clearing
      inhibits unintentionally changed the flag, probably due to a copy+paste
      goof.
      
      Fixes: 4f4c4a3e
      
       ("KVM: x86: Trace all APICv inhibit changes and capture overall status")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220420013732.3308816-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      80f0497c
    • Sean Christopherson's avatar
      KVM: Add helpers to wrap vcpu->srcu_idx and yell if it's abused · 2031f287
      Sean Christopherson authored
      
      Add wrappers to acquire/release KVM's SRCU lock when stashing the index
      in vcpu->src_idx, along with rudimentary detection of illegal usage,
      e.g. re-acquiring SRCU and thus overwriting vcpu->src_idx.  Because the
      SRCU index is (currently) either 0 or 1, illegal nesting bugs can go
      unnoticed for quite some time and only cause problems when the nested
      lock happens to get a different index.
      
      Wrap the WARNs in PROVE_RCU=y, and make them ONCE, otherwise KVM will
      likely yell so loudly that it will bring the kernel to its knees.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Tested-by: default avatarFabiano Rosas <farosas@linux.ibm.com>
      Message-Id: <20220415004343.2203171-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2031f287
    • Sean Christopherson's avatar
      KVM: RISC-V: Use kvm_vcpu.srcu_idx, drop RISC-V's unnecessary copy · fdd6f6ac
      Sean Christopherson authored
      
      Use the generic kvm_vcpu's srcu_idx instead of using an indentical field
      in RISC-V's version of kvm_vcpu_arch.  Generic KVM very intentionally
      does not touch vcpu->srcu_idx, i.e. there's zero chance of running afoul
      of common code.
      
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220415004343.2203171-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fdd6f6ac
    • Sean Christopherson's avatar
      KVM: x86: Don't re-acquire SRCU lock in complete_emulated_io() · 2d089356
      Sean Christopherson authored
      Don't re-acquire SRCU in complete_emulated_io() now that KVM acquires the
      lock in kvm_arch_vcpu_ioctl_run().  More importantly, don't overwrite
      vcpu->srcu_idx.  If the index acquired by complete_emulated_io() differs
      from the one acquired by kvm_arch_vcpu_ioctl_run(), KVM will effectively
      leak a lock and hang if/when synchronize_srcu() is invoked for the
      relevant grace period.
      
      Fixes: 8d25b7be
      
       ("KVM: x86: pull kvm->srcu read-side to kvm_arch_vcpu_ioctl_run")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220415004343.2203171-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2d089356
    • Alexey Kardashevskiy's avatar
      powerpc/perf: Fix 32bit compile · bb82c574
      Alexey Kardashevskiy authored
      
      The "read_bhrb" global symbol is only called under CONFIG_PPC64 of
      arch/powerpc/perf/core-book3s.c but it is compiled for both 32 and 64 bit
      anyway (and LLVM fails to link this on 32bit).
      
      This fixes it by moving bhrb.o to obj64 targets.
      
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20220421025756.571995-1-aik@ozlabs.ru
      bb82c574
    • Athira Rajeev's avatar
      powerpc/perf: Fix power10 event alternatives · c6cc9a85
      Athira Rajeev authored
      When scheduling a group of events, there are constraint checks done to
      make sure all events can go in a group. Example, one of the criteria is
      that events in a group cannot use the same PMC. But platform specific
      PMU supports alternative event for some of the event codes. During
      perf_event_open(), if any event group doesn't match constraint check
      criteria, further lookup is done to find alternative event.
      
      By current design, the array of alternatives events in PMU code is
      expected to be sorted by column 0. This is because in
      find_alternative() the return criteria is based on event code
      comparison. ie. "event < ev_alt[i][0])". This optimisation is there
      since find_alternative() can be called multiple times. In power10 PMU
      code, the alternative event array is not sorted properly and hence there
      is breakage in finding alternative event.
      
      To work with existing logic, fix the alternative event array to be
      sorted by column 0 for power10-pmu.c
      
      Results:
      
      In case where an alternative event is not chosen when we could, events
      will be multiplexed. ie, time sliced where it could actually run
      concurrently.
      
      Example, in power10 PM_INST_CMPL_ALT(0x00002) has alternative event,
      PM_INST_CMPL(0x500fa). Without the fix, if a group of events with PMC1
      to PMC4 is used along with PM_INST_CMPL_ALT, it will be time sliced
      since all programmable PMC's are consumed already. But with the fix,
      when it picks alternative event on PMC5, all events will run
      concurrently.
      
      Before:
      
       # perf stat -e r00002,r100fc,r200fa,r300fc,r400fc
      
       Performance counter stats for 'system wide':
      
               328668935      r00002               (79.94%)
                56501024      r100fc               (79.95%)
                49564238      r200fa               (79.95%)
                     376      r300fc               (80.19%)
                     660      r400fc               (79.97%)
      
             4.039150522 seconds time elapsed
      
      With the fix, since alternative event is chosen to run on PMC6, events
      will be run concurrently.
      
      After:
      
       # perf stat -e r00002,r100fc,r200fa,r300fc,r400fc
      
       Performance counter stats for 'system wide':
      
                23596607      r00002
                 4907738      r100fc
                 2283608      r200fa
                     135      r300fc
                     248      r400fc
      
             1.664671390 seconds time elapsed
      
      Fixes: a64e697c
      
       ("powerpc/perf: power10 Performance Monitoring support")
      Signed-off-by: default avatarAthira Rajeev <atrajeev@linux.vnet.ibm.com>
      Reviewed-by: default avatarMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20220419114828.89843-2-atrajeev@linux.vnet.ibm.com
      c6cc9a85
    • Athira Rajeev's avatar
      powerpc/perf: Fix power9 event alternatives · 0dcad700
      Athira Rajeev authored
      When scheduling a group of events, there are constraint checks done to
      make sure all events can go in a group. Example, one of the criteria is
      that events in a group cannot use the same PMC. But platform specific
      PMU supports alternative event for some of the event codes. During
      perf_event_open(), if any event group doesn't match constraint check
      criteria, further lookup is done to find alternative event.
      
      By current design, the array of alternatives events in PMU code is
      expected to be sorted by column 0. This is because in
      find_alternative() the return criteria is based on event code
      comparison. ie. "event < ev_alt[i][0])". This optimisation is there
      since find_alternative() can be called multiple times. In power9 PMU
      code, the alternative event array is not sorted properly and hence there
      is breakage in finding alternative events.
      
      To work with existing logic, fix the alternative event array to be
      sorted by column 0 for power9-pmu.c
      
      Results:
      
      With alternative events, multiplexing can be avoided. That is, for
      example, in power9 PM_LD_MISS_L1 (0x3e054) has alternative event,
      PM_LD_MISS_L1_ALT (0x400f0). This is an identical event which can be
      programmed in a different PMC.
      
      Before:
      
       # perf stat -e r3e054,r300fc
      
       Performance counter stats for 'system wide':
      
                 1057860      r3e054              (50.21%)
                     379      r300fc              (49.79%)
      
             0.944329741 seconds time elapsed
      
      Since both the events are using PMC3 in this case, they are
      multiplexed here.
      
      After:
      
       # perf stat -e r3e054,r300fc
      
       Performance counter stats for 'system wide':
      
                 1006948      r3e054
                     182      r300fc
      
      Fixes: 91e0bd1e
      
       ("powerpc/perf: Add PM_LD_MISS_L1 and PM_BR_2PATH to power9 event list")
      Signed-off-by: default avatarAthira Rajeev <atrajeev@linux.vnet.ibm.com>
      Reviewed-by: default avatarMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20220419114828.89843-1-atrajeev@linux.vnet.ibm.com
      0dcad700
    • Alexey Kardashevskiy's avatar
      KVM: PPC: Fix TCE handling for VFIO · 26a62b75
      Alexey Kardashevskiy authored
      The LoPAPR spec defines a guest visible IOMMU with a variable page size.
      Currently QEMU advertises 4K, 64K, 2M, 16MB pages, a Linux VM picks
      the biggest (16MB). In the case of a passed though PCI device, there is
      a hardware IOMMU which does not support all pages sizes from the above -
      P8 cannot do 2MB and P9 cannot do 16MB. So for each emulated
      16M IOMMU page we may create several smaller mappings ("TCEs") in
      the hardware IOMMU.
      
      The code wrongly uses the emulated TCE index instead of hardware TCE
      index in error handling. The problem is easier to see on POWER8 with
      multi-level TCE tables (when only the first level is preallocated)
      as hash mode uses real mode TCE hypercalls handlers.
      The kernel starts using indirect tables when VMs get bigger than 128GB
      (depends on the max page order).
      The very first real mode hcall is going to fail with H_TOO_HARD as
      in the real mode we cannot allocate memory for TCEs (we can in the virtual
      mode) but on the way out the code attempts to clear hardware TCEs using
      emulated TCE indexes which corrupts random kernel memory because
      it_offset==1<<59 is subtracted from those indexes and the resulting index
      is out of the TCE table bounds.
      
      This fixes kvmppc_clear_tce() to use the correct TCE indexes.
      
      While at it, this fixes TCE cache invalidation which uses emulated TCE
      indexes instead of the hardware ones. This went unnoticed as 64bit DMA
      is used these days and VMs map all RAM in one go and only then do DMA
      and this is when the TCE cache gets populated.
      
      Potentially this could slow down mapping, however normally 16MB
      emulated pages are backed by 64K hardware pages so it is one write to
      the "TCE Kill" per 256 updates which is not that bad considering the size
      of the cache (1024 TCEs or so).
      
      Fixes: ca1fc489
      
       ("KVM: PPC: Book3S: Allow backing bigger guest IOMMU pages with smaller physical pages")
      
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Tested-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarFrederic Barrat <fbarrat@linux.ibm.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20220420050840.328223-1-aik@ozlabs.ru
      26a62b75
    • Michael Ellerman's avatar
      powerpc/time: Always set decrementer in timer_interrupt() · d2b9be1f
      Michael Ellerman authored
      This is a partial revert of commit 0faf20a1 ("powerpc/64s/interrupt:
      Don't enable MSR[EE] in irq handlers unless perf is in use").
      
      Prior to that commit, we always set the decrementer in
      timer_interrupt(), to clear the timer interrupt. Otherwise we could end
      up continuously taking timer interrupts.
      
      When high res timers are enabled there is no problem seen with leaving
      the decrementer untouched in timer_interrupt(), because it will be
      programmed via hrtimer_interrupt() -> tick_program_event() ->
      clockevents_program_event() -> decrementer_set_next_event().
      
      However with CONFIG_HIGH_RES_TIMERS=n or booting with highres=off, we
      see a stall/lockup, because tick_nohz_handler() does not cause a
      reprogram of the decrementer, leading to endless timer interrupts.
      Example trace:
      
        [    1.898617][    T7] Freeing initrd memory: 2624K^M
        [   22.680919][    C1] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:^M
        [   22.682281][    C1] rcu:     0-....: (25 ticks this GP) idle=073/0/0x1 softirq=10/16 fqs=1050 ^M
        [   22.682851][    C1]  (detected by 1, t=2102 jiffies, g=-1179, q=476)^M
        [   22.683649][    C1] Sending NMI from CPU 1 to CPUs 0:^M
        [   22.685252][    C0] NMI backtrace for cpu 0^M
        [   22.685649][    C0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.16.0-rc2-00185-g0faf20a1ad16 #145^M
        [   22.686393][    C0] NIP:  c000000000016d64 LR: c000000000f6cca4 CTR: c00000000019c6e0^M
        [   22.686774][    C0] REGS: c000000002833590 TRAP: 0500   Not tainted  (5.16.0-rc2-00185-g0faf20a1ad16)^M
        [   22.687222][    C0] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24000222  XER: 00000000^M
        [   22.688297][    C0] CFAR: c00000000000c854 IRQMASK: 0 ^M
        ...
        [   22.692637][    C0] NIP [c000000000016d64] arch_local_irq_restore+0x174/0x250^M
        [   22.694443][    C0] LR [c000000000f6cca4] __do_softirq+0xe4/0x3dc^M
        [   22.695762][    C0] Call Trace:^M
        [   22.696050][    C0] [c000000002833830] [c000000000f6cc80] __do_softirq+0xc0/0x3dc (unreliable)^M
        [   22.697377][    C0] [c000000002833920] [c000000000151508] __irq_exit_rcu+0xd8/0x130^M
        [   22.698739][    C0] [c000000002833950] [c000000000151730] irq_exit+0x20/0x40^M
        [   22.699938][    C0] [c000000002833970] [c000000000027f40] timer_interrupt+0x270/0x460^M
        [   22.701119][    C0] [c0000000028339d0] [c0000000000099a8] decrementer_common_virt+0x208/0x210^M
      
      Possibly this should be fixed in the lowres timing code, but that would
      be a generic change and could take some time and may not backport
      easily, so for now make the programming of the decrementer unconditional
      again in timer_interrupt() to avoid the stall/lockup.
      
      Fixes: 0faf20a1
      
       ("powerpc/64s/interrupt: Don't enable MSR[EE] in irq handlers unless perf is in use")
      Reported-by: default avatarMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Link: https://lore.kernel.org/r/20220420141657.771442-1-mpe@ellerman.id.au
      d2b9be1f
  4. Apr 20, 2022
  5. Apr 19, 2022
  6. Apr 18, 2022
  7. Apr 15, 2022
    • Max Filippov's avatar
      xtensa: fix a7 clobbering in coprocessor context load/store · 839769c3
      Max Filippov authored
      Fast coprocessor exception handler saves a3..a6, but coprocessor context
      load/store code uses a4..a7 as temporaries, potentially clobbering a7.
      'Potentially' because coprocessor state load/store macros may not use
      all four temporary registers (and neither FPU nor HiFi macros do).
      Use a3..a6 as intended.
      
      Cc: stable@vger.kernel.org
      Fixes: c658eac6
      
       ("[XTENSA] Add support for configurable registers and coprocessors")
      Signed-off-by: default avatarMax Filippov <jcmvbkbc@gmail.com>
      839769c3
    • Omar Sandoval's avatar
      mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore · c12cd77c
      Omar Sandoval authored
      Commit 3ee48b6a ("mm, x86: Saving vmcore with non-lazy freeing of
      vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to
      lazy_max_pages() + 1, ensuring that any future vunmaps() immediately
      purge the vmap areas instead of doing it lazily.
      
      Commit 690467c8 ("mm/vmalloc: Move draining areas out of caller
      context") moved the purging from the vunmap() caller to a worker thread.
      Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin
      (possibly forever).  For example, consider the following scenario:
      
       1. Thread reads from /proc/vmcore. This eventually calls
          __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets
          vmap_lazy_nr to lazy_max_pages() + 1.
      
       2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2
          pages (one page plus the guard page) to the purge list and
          vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the
          drain_vmap_work is scheduled.
      
       3. Thread returns from the kernel and is scheduled out.
      
       4. Worker thread is scheduled in and calls drain_vmap_area_work(). It
          frees the 2 pages on the purge list. vmap_lazy_nr is now
          lazy_max_pages() + 1.
      
       5. This is still over the threshold, so it tries to purge areas again,
          but doesn't find anything.
      
       6. Repeat 5.
      
      If the system is running with only one CPU (which is typicial for kdump)
      and preemption is disabled, then this will never make forward progress:
      there aren't any more pages to purge, so it hangs.  If there is more
      than one CPU or preemption is enabled, then the worker thread will spin
      forever in the background.  (Note that if there were already pages to be
      purged at the time that set_iounmap_nonlazy() was called, this bug is
      avoided.)
      
      This can be reproduced with anything that reads from /proc/vmcore
      multiple times.  E.g., vmcore-dmesg /proc/vmcore.
      
      It turns out that improvements to vmap() over the years have obsoleted
      the need for this "optimization".  I benchmarked `dd if=/proc/vmcore
      of=/dev/null` with 4k and 1M read sizes on a system with a 32GB vmcore.
      The test was run on 5.17, 5.18-rc1 with a fix that avoided the hang, and
      5.18-rc1 with set_iounmap_nonlazy() removed entirely:
      
          |5.17  |5.18+fix|5.18+removal
        4k|40.86s|  40.09s|      26.73s
        1M|24.47s|  23.98s|      21.84s
      
      The removal was the fastest (by a wide margin with 4k reads).  This
      patch removes set_iounmap_nonlazy().
      
      Link: https://lkml.kernel.org/r/52f819991051f9b865e9ce25605509bfdbacadcd.1649277321.git.osandov@fb.com
      Fixes: 690467c8
      
        ("mm/vmalloc: Move draining areas out of caller context")
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c12cd77c
  8. Apr 14, 2022