Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Apr 15, 2022
    • Omar Sandoval's avatar
      mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore · c12cd77c
      Omar Sandoval authored
      Commit 3ee48b6a ("mm, x86: Saving vmcore with non-lazy freeing of
      vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to
      lazy_max_pages() + 1, ensuring that any future vunmaps() immediately
      purge the vmap areas instead of doing it lazily.
      
      Commit 690467c8 ("mm/vmalloc: Move draining areas out of caller
      context") moved the purging from the vunmap() caller to a worker thread.
      Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin
      (possibly forever).  For example, consider the following scenario:
      
       1. Thread reads from /proc/vmcore. This eventually calls
          __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets
          vmap_lazy_nr to lazy_max_pages() + 1.
      
       2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2
          pages (one page plus the guard page) to the purge list and
          vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the
          drain_vmap_work is scheduled.
      
       3. Thread returns from the kernel and is scheduled out.
      
       4. Worker thread is scheduled in and calls drain_vmap_area_work(). It
          frees the 2 pages on the purge list. vmap_lazy_nr is now
          lazy_max_pages() + 1.
      
       5. This is still over the threshold, so it tries to purge areas again,
          but doesn't find anything.
      
       6. Repeat 5.
      
      If the system is running with only one CPU (which is typicial for kdump)
      and preemption is disabled, then this will never make forward progress:
      there aren't any more pages to purge, so it hangs.  If there is more
      than one CPU or preemption is enabled, then the worker thread will spin
      forever in the background.  (Note that if there were already pages to be
      purged at the time that set_iounmap_nonlazy() was called, this bug is
      avoided.)
      
      This can be reproduced with anything that reads from /proc/vmcore
      multiple times.  E.g., vmcore-dmesg /proc/vmcore.
      
      It turns out that improvements to vmap() over the years have obsoleted
      the need for this "optimization".  I benchmarked `dd if=/proc/vmcore
      of=/dev/null` with 4k and 1M read sizes on a system with a 32GB vmcore.
      The test was run on 5.17, 5.18-rc1 with a fix that avoided the hang, and
      5.18-rc1 with set_iounmap_nonlazy() removed entirely:
      
          |5.17  |5.18+fix|5.18+removal
        4k|40.86s|  40.09s|      26.73s
        1M|24.47s|  23.98s|      21.84s
      
      The removal was the fastest (by a wide margin with 4k reads).  This
      patch removes set_iounmap_nonlazy().
      
      Link: https://lkml.kernel.org/r/52f819991051f9b865e9ce25605509bfdbacadcd.1649277321.git.osandov@fb.com
      Fixes: 690467c8
      
        ("mm/vmalloc: Move draining areas out of caller context")
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c12cd77c
  2. Apr 14, 2022
  3. Apr 12, 2022
  4. Apr 11, 2022
  5. Apr 10, 2022
  6. Apr 08, 2022
  7. Apr 07, 2022
  8. Apr 06, 2022
    • Kefeng Wang's avatar
      Revert "powerpc: Set max_mapnr correctly" · 1ff5c8e8
      Kefeng Wang authored
      This reverts commit 602946ec.
      
      If CONFIG_HIGHMEM is enabled, no highmem will be added with max_mapnr
      set to max_low_pfn, see mem_init():
      
        for (pfn = highmem_mapnr; pfn < max_mapnr; ++pfn) {
              ...
              free_highmem_page();
        }
      
      Now that virt_addr_valid() has been fixed in the previous commit, we can
      revert the change to max_mapnr.
      
      Fixes: 602946ec
      
       ("powerpc: Set max_mapnr correctly")
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Reported-by: default avatarErhard F. <erhard_f@mailbox.org>
      [mpe: Update change log to reflect series reordering]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20220406145802.538416-2-mpe@ellerman.id.au
      1ff5c8e8
    • Kefeng Wang's avatar
      powerpc: Fix virt_addr_valid() for 64-bit Book3E & 32-bit · ffa0b64e
      Kefeng Wang authored
      mpe: On 64-bit Book3E vmalloc space starts at 0x8000000000000000.
      
      Because of the way __pa() works we have:
        __pa(0x8000000000000000) == 0, and therefore
        virt_to_pfn(0x8000000000000000) == 0, and therefore
        virt_addr_valid(0x8000000000000000) == true
      
      Which is wrong, virt_addr_valid() should be false for vmalloc space.
      In fact all vmalloc addresses that alias with a valid PFN will return
      true from virt_addr_valid(). That can cause bugs with hardened usercopy
      as described below by Kefeng Wang:
      
        When running ethtool eth0 on 64-bit Book3E, a BUG occurred:
      
          usercopy: Kernel memory exposure attempt detected from SLUB object not in SLUB page?! (offset 0, size 1048)!
          kernel BUG at mm/usercopy.c:99
          ...
          usercopy_abort+0x64/0xa0 (unreliable)
          __check_heap_object+0x168/0x190
          __check_object_size+0x1a0/0x200
          dev_ethtool+0x2494/0x2b20
          dev_ioctl+0x5d0/0x770
          sock_do_ioctl+0xf0/0x1d0
          sock_ioctl+0x3ec/0x5a0
          __se_sys_ioctl+0xf0/0x160
          system_call_exception+0xfc/0x1f0
          system_call_common+0xf8/0x200
      
        The code shows below,
      
          data = vzalloc(array_size(gstrings.len, ETH_GSTRING_LEN));
          copy_to_user(useraddr, data, gstrings.len * ETH_GSTRING_LEN))
      
        The data is alloced by vmalloc(), virt_addr_valid(ptr) will return true
        on 64-bit Book3E, which leads to the panic.
      
        As commit 4dd7554a ("powerpc/64: Add VIRTUAL_BUG_ON checks for __va
        and __pa addresses") does, make sure the virt addr above PAGE_OFFSET in
        the virt_addr_valid() for 64-bit, also add upper limit check to make
        sure the virt is below high_memory.
      
        Meanwhile, for 32-bit PAGE_OFFSET is the virtual address of the start
        of lowmem, high_memory is the upper low virtual address, the check is
        suitable for 32-bit, this will fix the issue mentioned in commit
        602946ec ("powerpc: Set max_mapnr correctly") too.
      
      On 32-bit there is a similar problem with high memory, that was fixed in
      commit 602946ec
      
       ("powerpc: Set max_mapnr correctly"), but that
      commit breaks highmem and needs to be reverted.
      
      We can't easily fix __pa(), we have code that relies on its current
      behaviour. So for now add extra checks to virt_addr_valid().
      
      For 64-bit Book3S the extra checks are not necessary, the combination of
      virt_to_pfn() and pfn_valid() should yield the correct result, but they
      are harmless.
      
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      [mpe: Add additional change log detail]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20220406145802.538416-1-mpe@ellerman.id.au
      ffa0b64e
    • Linus Walleij's avatar
      ARM: config: u8500: Re-enable AB8500 battery charging · 62f64245
      Linus Walleij authored
      This is effectively a revert of the temporary disablement
      patch. Battery charging now works!
      
      We also enable static battery data for the Samsung SDI
      batteries as used by the U8500 Samsung phones.
      
      Cc: Lee Jones <lee.jones@linaro.org>
      Fixes: a1149ae9
      
       ("ARM: ux500: Disable Power Supply and Battery Management by default")
      Signed-off-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      62f64245
    • Reiji Watanabe's avatar
      KVM: arm64: mixed-width check should be skipped for uninitialized vCPUs · 26bf74bd
      Reiji Watanabe authored
      KVM allows userspace to configure either all EL1 32bit or 64bit vCPUs
      for a guest.  At vCPU reset, vcpu_allowed_register_width() checks
      if the vcpu's register width is consistent with all other vCPUs'.
      Since the checking is done even against vCPUs that are not initialized
      (KVM_ARM_VCPU_INIT has not been done) yet, the uninitialized vCPUs
      are erroneously treated as 64bit vCPU, which causes the function to
      incorrectly detect a mixed-width VM.
      
      Introduce KVM_ARCH_FLAG_EL1_32BIT and KVM_ARCH_FLAG_REG_WIDTH_CONFIGURED
      bits for kvm->arch.flags.  A value of the EL1_32BIT bit indicates that
      the guest needs to be configured with all 32bit or 64bit vCPUs, and
      a value of the REG_WIDTH_CONFIGURED bit indicates if a value of the
      EL1_32BIT bit is valid (already set up). Values in those bits are set at
      the first KVM_ARM_VCPU_INIT for the guest based on KVM_ARM_VCPU_EL1_32BIT
      configuration for the vCPU.
      
      Check vcpu's register width against those new bits at the vcpu's
      KVM_ARM_VCPU_INIT (instead of against other vCPUs' register width).
      
      Fixes: 66e94d5c
      
       ("KVM: arm64: Prevent mixed-width VM creation")
      Signed-off-by: default avatarReiji Watanabe <reijiw@google.com>
      Reviewed-by: default avatarOliver Upton <oupton@google.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20220329031924.619453-2-reijiw@google.com
      26bf74bd
    • Heiko Carstens's avatar
      s390: allow to compile with z16 optimizations · e69a7ff8
      Heiko Carstens authored
      
      Add config and compile options which allow to compile with z16
      optimizations if the compiler supports it.
      
      Signed-off-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      e69a7ff8
    • Heiko Carstens's avatar
      s390: add z16 elf platform · 6203ac30
      Heiko Carstens authored
      
      Add detection for machine types 0x3931 and 0x3932 and set ELF platform
      name to z16.
      
      Signed-off-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      6203ac30
    • Joey Gouly's avatar
      arm64: alternatives: mark patch_alternative() as `noinstr` · a2c0b0fb
      Joey Gouly authored
      
      The alternatives code must be `noinstr` such that it does not patch itself,
      as the cache invalidation is only performed after all the alternatives have
      been applied.
      
      Mark patch_alternative() as `noinstr`. Mark branch_insn_requires_update()
      and get_alt_insn() with `__always_inline` since they are both only called
      through patch_alternative().
      
      Booting a kernel in QEMU TCG with KCSAN=y and ARM64_USE_LSE_ATOMICS=y caused
      a boot hang:
      [    0.241121] CPU: All CPU(s) started at EL2
      
      The alternatives code was patching the atomics in __tsan_read4() from LL/SC
      atomics to LSE atomics.
      
      The following fragment is using LL/SC atomics in the .text section:
        | <__tsan_unaligned_read4+304>:     ldxr    x6, [x2]
        | <__tsan_unaligned_read4+308>:     add     x6, x6, x5
        | <__tsan_unaligned_read4+312>:     stxr    w7, x6, [x2]
        | <__tsan_unaligned_read4+316>:     cbnz    w7, <__tsan_unaligned_read4+304>
      
      This LL/SC atomic sequence was to be replaced with LSE atomics. However since
      the alternatives code was instrumentable, __tsan_read4() was being called after
      only the first instruction was replaced, which led to the following code in memory:
        | <__tsan_unaligned_read4+304>:     ldadd   x5, x6, [x2]
        | <__tsan_unaligned_read4+308>:     add     x6, x6, x5
        | <__tsan_unaligned_read4+312>:     stxr    w7, x6, [x2]
        | <__tsan_unaligned_read4+316>:     cbnz    w7, <__tsan_unaligned_read4+304>
      
      This caused an infinite loop as the `stxr` instruction never completed successfully,
      so `w7` was always 0.
      
      Signed-off-by: default avatarJoey Gouly <joey.gouly@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: https://lore.kernel.org/r/20220405104733.11476-1-joey.gouly@arm.com
      
      
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      a2c0b0fb
    • Yu Zhe's avatar
    • Oliver Upton's avatar
      KVM: arm64: Don't split hugepages outside of MMU write lock · f587661f
      Oliver Upton authored
      It is possible to take a stage-2 permission fault on a page larger than
      PAGE_SIZE. For example, when running a guest backed by 2M HugeTLB, KVM
      eagerly maps at the largest possible block size. When dirty logging is
      enabled on a memslot, KVM does *not* eagerly split these 2M stage-2
      mappings and instead clears the write bit on the pte.
      
      Since dirty logging is always performed at PAGE_SIZE granularity, KVM
      lazily splits these 2M block mappings down to PAGE_SIZE in the stage-2
      fault handler. This operation must be done under the write lock. Since
      commit f783ef1c ("KVM: arm64: Add fast path to handle permission
      relaxation during dirty logging"), the stage-2 fault handler
      conditionally takes the read lock on permission faults with dirty
      logging enabled. To that end, it is possible to split a 2M block mapping
      while only holding the read lock.
      
      The problem is demonstrated by running kvm_page_table_test with 2M
      anonymous HugeTLB, which splats like so:
      
        WARNING: CPU: 5 PID: 15276 at arch/arm64/kvm/hyp/pgtable.c:153 stage2_map_walk_leaf+0x124/0x158
      
        [...]
      
        Call trace:
        stage2_map_walk_leaf+0x124/0x158
        stage2_map_walker+0x5c/0xf0
        __kvm_pgtable_walk+0x100/0x1d4
        __kvm_pgtable_walk+0x140/0x1d4
        __kvm_pgtable_walk+0x140/0x1d4
        kvm_pgtable_walk+0xa0/0xf8
        kvm_pgtable_stage2_map+0x15c/0x198
        user_mem_abort+0x56c/0x838
        kvm_handle_guest_abort+0x1fc/0x2a4
        handle_exit+0xa4/0x120
        kvm_arch_vcpu_ioctl_run+0x200/0x448
        kvm_vcpu_ioctl+0x588/0x664
        __arm64_sys_ioctl+0x9c/0xd4
        invoke_syscall+0x4c/0x144
        el0_svc_common+0xc4/0x190
        do_el0_svc+0x30/0x8c
        el0_svc+0x28/0xcc
        el0t_64_sync_handler+0x84/0xe4
        el0t_64_sync+0x1a4/0x1a8
      
      Fix the issue by only acquiring the read lock if the guest faulted on a
      PAGE_SIZE granule w/ dirty logging enabled. Add a WARN to catch locking
      bugs in future changes.
      
      Fixes: f783ef1c
      
       ("KVM: arm64: Add fast path to handle permission relaxation during dirty logging")
      Cc: Jing Zhang <jingzhangos@google.com>
      Signed-off-by: default avatarOliver Upton <oupton@google.com>
      Reviewed-by: default avatarReiji Watanabe <reijiw@google.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20220401194652.950240-1-oupton@google.com
      f587661f