Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Jun 08, 2023
  2. Feb 15, 2023
    • David Chen's avatar
      Fix page corruption caused by racy check in __free_pages · 0a626e27
      David Chen authored
      commit 462a8e08 upstream.
      
      When we upgraded our kernel, we started seeing some page corruption like
      the following consistently:
      
        BUG: Bad page state in process ganesha.nfsd  pfn:1304ca
        page:0000000022261c55 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0 pfn:0x1304ca
        flags: 0x17ffffc0000000()
        raw: 0017ffffc0000000 ffff8a513ffd4c98 ffffeee24b35ec08 0000000000000000
        raw: 0000000000000000 0000000000000001 00000000ffffff7f 0000000000000000
        page dumped because: nonzero mapcount
        CPU: 0 PID: 15567 Comm: ganesha.nfsd Kdump: loaded Tainted: P    B      O      5.10.158-1.nutanix.20221209.el7.x86_64 #1
        Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
        Call Trace:
         dump_stack+0x74/0x96
         bad_page.cold+0x63/0x94
         check_new_page_bad+0x6d/0x80
         rmqueue+0x46e/0x970
         get_page_from_freelist+0xcb/0x3f0
         ? _cond_resched+0x19/0x40
         __alloc_...
      0a626e27
    • Mike Kravetz's avatar
      migrate: hugetlb: check for hugetlb shared PMD in node migration · dbe5a119
      Mike Kravetz authored
      [ Upstream commit 73bdf65e ]
      
      migrate_pages/mempolicy semantics state that CAP_SYS_NICE is required to
      move pages shared with another process to a different node.  page_mapcount
      > 1 is being used to determine if a hugetlb page is shared.  However, a
      hugetlb page will have a mapcount of 1 if mapped by multiple processes via
      a shared PMD.  As a result, hugetlb pages shared by multiple processes and
      mapped with a shared PMD can be moved by a process without CAP_SYS_NICE.
      
      To fix, check for a shared PMD if mapcount is 1.  If a shared PMD is found
      consider the page shared.
      
      Link: https://lkml.kernel.org/r/20230126222721.222195-3-mike.kravetz@oracle.com
      Fixes: e2d8cf40
      
       ("migrate: add hugepage migration code to migrate_pages()")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      dbe5a119
    • Miaohe Lin's avatar
      mm/migration: return errno when isolate_huge_page failed · 97a5104d
      Miaohe Lin authored
      [ Upstream commit 7ce82f4c ]
      
      We might fail to isolate huge page due to e.g.  the page is under
      migration which cleared HPageMigratable.  We should return errno in this
      case rather than always return 1 which could confuse the user, i.e.  the
      caller might think all of the memory is migrated while the hugetlb page is
      left behind.  We make the prototype of isolate_huge_page consistent with
      isolate_lru_page as suggested by Huang Ying and rename isolate_huge_page
      to isolate_hugetlb as suggested by Muchun to improve the readability.
      
      Link: https://lkml.kernel.org/r/20220530113016.16663-4-linmiaohe@huawei.com
      Fixes: e8db67eb
      
       ("mm: migrate: move_pages() supports thp migration")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Suggested-by: default avatarHuang Ying <ying.huang@intel.com>
      Reported-by: kernel test robot <lkp@intel.com> (build error)
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christoph ...
      97a5104d
    • Longlong Xia's avatar
      mm/swapfile: add cond_resched() in get_swap_pages() · 30187be2
      Longlong Xia authored
      commit 7717fc1a upstream.
      
      The softlockup still occurs in get_swap_pages() under memory pressure.  64
      CPU cores, 64GB memory, and 28 zram devices, the disksize of each zram
      device is 50MB with same priority as si.  Use the stress-ng tool to
      increase memory pressure, causing the system to oom frequently.
      
      The plist_for_each_entry_safe() loops in get_swap_pages() could reach tens
      of thousands of times to find available space (extreme case:
      cond_resched() is not called in scan_swap_map_slots()).  Let's add
      cond_resched() into get_swap_pages() when failed to find available space
      to avoid softlockup.
      
      Link: https://lkml.kernel.org/r/20230128094757.1060525-1-xialonglong1@huawei.com
      
      
      Signed-off-by: default avatarLonglong Xia <xialonglong1@huawei.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Chen Wandun <chenwandun@huawei.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Nanyong Sun <sunnanyong@huawei.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      30187be2
  3. Feb 01, 2023
  4. Jan 24, 2023
    • Hugh Dickins's avatar
      mm/khugepaged: fix collapse_pte_mapped_thp() to allow anon_vma · 8bc72b49
      Hugh Dickins authored
      commit ab0c3f12 upstream.
      
      uprobe_write_opcode() uses collapse_pte_mapped_thp() to restore huge pmd,
      when removing a breakpoint from hugepage text: vma->anon_vma is always set
      in that case, so undo the prohibition.  And MADV_COLLAPSE ought to be able
      to collapse some page tables in a vma which happens to have anon_vma set
      from CoWing elsewhere.
      
      Is anon_vma lock required?  Almost not: if any page other than expected
      subpage of the non-anon huge page is found in the page table, collapse is
      aborted without making any change.  However, it is possible that an anon
      page was CoWed from this extent in another mm or vma, in which case a
      concurrent lookup might look here: so keep it away while clearing pmd (but
      perhaps we shall go back to using pmd_lock() there in future).
      
      Note that collapse_pte_mapped_thp() is exceptional in freeing a page table
      without having cleared its ptes: I'm uneasy about that, and had thought
      pte_clear()ing appropriate; but exclusive i_mmap lock does fix the
      problem, and we would have to move the mmu_notification if clearing those
      ptes.
      
      What this fixes is not a dangerous instability.  But I suggest Cc stable
      because uprobes "healing" has regressed in that way, so this should follow
      8d3c106e into those stable releases where it was backported (and may
      want adjustment there - I'll supply backports as needed).
      
      Link: https://lkml.kernel.org/r/b740c9fb-edba-92ba-59fb-7a5592e5dfc@google.com
      Fixes: 8d3c106e
      
       ("mm/khugepaged: take the right locks for page table retraction")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: <stable@vger.kernel.org>    [5.4+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8bc72b49
  5. Jan 18, 2023
    • Aaron Thompson's avatar
      mm: Always release pages to the buddy allocator in memblock_free_late(). · 60806adc
      Aaron Thompson authored
      [ Upstream commit 115d9d77 ]
      
      If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages()
      only releases pages to the buddy allocator if they are not in the
      deferred range. This is correct for free pages (as defined by
      for_each_free_mem_pfn_range_in_zone()) because free pages in the
      deferred range will be initialized and released as part of the deferred
      init process. memblock_free_pages() is called by memblock_free_late(),
      which is used to free reserved ranges after memblock_free_all() has
      run. All pages in reserved ranges have been initialized at that point,
      and accordingly, those pages are not touched by the deferred init
      process. This means that currently, if the pages that
      memblock_free_late() intends to release are in the deferred range, they
      will never be released to the buddy allocator. They will forever be
      reserved.
      
      In addition, memblock_free_pages() calls kmsan_memblock_free_pages(),
      which is also correct for free pages but is not correct for reserved
      pages. KMSAN metadata for reserved pages is initialized by
      kmsan_init_shadow(), which runs shortly before memblock_free_all().
      
      For both of these reasons, memblock_free_pages() should only be called
      for free pages, and memblock_free_late() should call __free_pages_core()
      directly instead.
      
      One case where this issue can occur in the wild is EFI boot on
      x86_64. The x86 EFI code reserves all EFI boot services memory ranges
      via memblock_reserve() and frees them later via memblock_free_late()
      (efi_reserve_boot_services() and efi_free_boot_services(),
      respectively). If any of those ranges happens to fall within the
      deferred init range, the pages will not be released and that memory will
      be unavailable.
      
      For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI:
      
      v6.2-rc2:
        # grep -E 'Node|spanned|present|managed' /proc/zoneinfo
        Node 0, zone      DMA
                spanned  4095
                present  3999
                managed  3840
        Node 0, zone    DMA32
                spanned  246652
                present  245868
                managed  178867
      
      v6.2-rc2 + patch:
        # grep -E 'Node|spanned|present|managed' /proc/zoneinfo
        Node 0, zone      DMA
                spanned  4095
                present  3999
                managed  3840
        Node 0, zone    DMA32
                spanned  246652
                present  245868
                managed  222816   # +43,949 pages
      
      Fixes: 3a80a7fa
      
       ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
      Signed-off-by: default avatarAaron Thompson <dev@aaront.org>
      Link: https://lore.kernel.org/r/01010185892de53e-e379acfb-7044-4b24-b30a-e2657c1ba989-000000@us-west-2.amazonses.com
      
      
      Signed-off-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      60806adc
  6. Jan 14, 2023
    • NARIBAYASHI Akira's avatar
      mm, compaction: fix fast_isolate_around() to stay within boundaries · 882734bb
      NARIBAYASHI Akira authored
      commit be21b32a upstream.
      
      Depending on the memory configuration, isolate_freepages_block() may scan
      pages out of the target range and causes panic.
      
      Panic can occur on systems with multiple zones in a single pageblock.
      
      The reason it is rare is that it only happens in special
      configurations.  Depending on how many similar systems there are, it
      may be a good idea to fix this problem for older kernels as well.
      
      The problem is that pfn as argument of fast_isolate_around() could be out
      of the target range.  Therefore we should consider the case where pfn <
      start_pfn, and also the case where end_pfn < pfn.
      
      This problem should have been addressd by the commit 6e2b7044 ("mm,
      compaction: make fast_isolate_freepages() stay within zone") but there was
      an oversight.
      
       Case1: pfn < start_pfn
      
        <at memory compaction for node Y>
        |  node X's zone  | node Y's zone
        +-----------------+-----------------------------...
      882734bb
  7. Dec 14, 2022
  8. Dec 08, 2022
    • Linus Torvalds's avatar
      v4l2: don't fall back to follow_pfn() if pin_user_pages_fast() fails · d072a10c
      Linus Torvalds authored
      commit 6647e76a
      
       upstream.
      
      The V4L2_MEMORY_USERPTR interface is long deprecated and shouldn't be
      used (and is discouraged for any modern v4l drivers).  And Seth Jenkins
      points out that the fallback to VM_PFNMAP/VM_IO is fundamentally racy
      and dangerous.
      
      Note that it's not even a case that should trigger, since any normal
      user pointer logic ends up just using the pin_user_pages_fast() call
      that does the proper page reference counting.  That's not the problem
      case, only if you try to use special device mappings do you have any
      issues.
      
      Normally I'd just remove this during the merge window, but since Seth
      pointed out the problem cases, we really want to know as soon as
      possible if there are actually any users of this odd special case of a
      legacy interface.  Neither Hans nor Mauro seem to think that such
      mis-uses of the old legacy interface should exist.  As Mauro says:
      
       "See, V4L2 has actually 4 streaming APIs:
              - Kernel-allocated mmap (usually referred simply as just mmap);
              - USERPTR mmap;
              - read();
              - dmabuf;
      
        The USERPTR is one of the oldest way to use it, coming from V4L
        version 1 times, and by far the least used one"
      
      And Hans chimed in on the USERPTR interface:
      
       "To be honest, I wouldn't mind if it goes away completely, but that's a
        bit of a pipe dream right now"
      
      but while removing this legacy interface entirely may be a pipe dream we
      can at least try to remove the unlikely (and actively broken) case of
      using special device mappings for USERPTR accesses.
      
      This replaces it with a WARN_ONCE() that we can remove once we've
      hopefully confirmed that no actual users exist.
      
      NOTE! Longer term, this means that a 'struct frame_vector' only ever
      contains proper page pointers, and all the games we have with converting
      them to pages can go away (grep for 'frame_vector_to_pages()' and the
      uses of 'vec->is_pfns').  But this is just the first step, to verify
      that this code really is all dead, and do so as quickly as possible.
      
      Reported-by: default avatarSeth Jenkins <sethjenkins@google.com>
      Acked-by: default avatarHans Verkuil <hverkuil@xs4all.nl>
      Acked-by: default avatarMauro Carvalho Chehab <mchehab@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d072a10c
  9. Dec 02, 2022
    • Johannes Weiner's avatar
      mm: vmscan: fix extreme overreclaim and swap floods · d925dd3e
      Johannes Weiner authored
      commit f53af428 upstream.
      
      During proactive reclaim, we sometimes observe severe overreclaim, with
      several thousand times more pages reclaimed than requested.
      
      This trace was obtained from shrink_lruvec() during such an instance:
      
          prio:0 anon_cost:1141521 file_cost:7767
          nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190)
          nr=[7161123 345 578 1111]
      
      While he reclaimer requested 4M, vmscan reclaimed close to 16G, most of it
      by swapping.  These requests take over a minute, during which the write()
      to memory.reclaim is unkillably stuck inside the kernel.
      
      Digging into the source, this is caused by the proportional reclaim
      bailout logic.  This code tries to resolve a fundamental conflict: to
      reclaim roughly what was requested, while also aging all LRUs fairly and
      in accordance to their size, swappiness, refault rates etc.  The way it
      attempts fairness is that once the reclaim goal has been reached, it stops
      scanning the LRUs with the smaller remaining scan targets, and adjusts the
      remainder of the bigger LRUs according to how much of the smaller LRUs was
      scanned.  It then finishes scanning that remainder regardless of the
      reclaim goal.
      
      This works fine if priority levels are low and the LRU lists are
      comparable in size.  However, in this instance, the cgroup that is
      targeted by proactive reclaim has almost no files left - they've already
      been squeezed out by proactive reclaim earlier - and the remaining anon
      pages are hot.  Anon rotations cause the priority level to drop to 0,
      which results in reclaim targeting all of anon (a lot) and all of file
      (almost nothing).  By the time reclaim decides to bail, it has scanned
      most or all of the file target, and therefor must also scan most or all of
      the enormous anon target.  This target is thousands of times larger than
      the reclaim goal, thus causing the overreclaim.
      
      The bailout code hasn't changed in years, why is this failing now?  The
      most likely explanations are two other recent changes in anon reclaim:
      
      1. Before the series starting with commit 5df74196 ("mm: fix LRU
         balancing effect of new transparent huge pages"), the VM was
         overall relatively reluctant to swap at all, even if swap was
         configured. This means the LRU balancing code didn't come into play
         as often as it does now, and mostly in high pressure situations
         where pronounced swap activity wouldn't be as surprising.
      
      2. For historic reasons, shrink_lruvec() loops on the scan targets of
         all LRU lists except the active anon one, meaning it would bail if
         the only remaining pages to scan were active anon - even if there
         were a lot of them.
      
         Before the series starting with commit ccc5dc67 ("mm/vmscan:
         make active/inactive ratio as 1:1 for anon lru"), most anon pages
         would live on the active LRU; the inactive one would contain only a
         handful of preselected reclaim candidates. After the series, anon
         gets aged similarly to file, and the inactive list is the default
         for new anon pages as well, making it often the much bigger list.
      
         As a result, the VM is now more likely to actually finish large
         anon targets than before.
      
      Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward the
      larger LRU lists is made before bailing out on a met reclaim goal.
      
      This fixes the extreme overreclaim problem.
      
      Fairness is more subtle and harder to evaluate.  No obvious misbehavior
      was observed on the test workload, in any case.  Conceptually, fairness
      should primarily be a cumulative effect from regular, lower priority
      scans.  Once the VM is in trouble and needs to escalate scan targets to
      make forward progress, fairness needs to take a backseat.  This is also
      acknowledged by the myriad exceptions in get_scan_count().  This patch
      makes fairness decrease gradually, as it keeps fairness work static over
      increasing priority levels with growing scan targets.  This should make
      more sense - although we may have to re-visit the exact values.
      
      Link: https://lkml.kernel.org/r/20220802162811.39216-1-hannes@cmpxchg.org
      
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d925dd3e
  10. Nov 25, 2022
    • Alexander Potapenko's avatar
      mm: fs: initialize fsdata passed to write_begin/write_end interface · 294ef12d
      Alexander Potapenko authored
      commit 1468c6f4 upstream.
      
      Functions implementing the a_ops->write_end() interface accept the `void
      *fsdata` parameter that is supposed to be initialized by the corresponding
      a_ops->write_begin() (which accepts `void **fsdata`).
      
      However not all a_ops->write_begin() implementations initialize `fsdata`
      unconditionally, so it may get passed uninitialized to a_ops->write_end(),
      resulting in undefined behavior.
      
      Fix this by initializing fsdata with NULL before the call to
      write_begin(), rather than doing so in all possible a_ops implementations.
      
      This patch covers only the following cases found by running x86 KMSAN
      under syzkaller:
      
       - generic_perform_write()
       - cont_expand_zero() and generic_cont_expand_simple()
       - page_symlink()
      
      Other cases of passing uninitialized fsdata may persist in the codebase.
      
      Link: https://lkml.kernel.org/r/20220915150417.722975-43-glider@google.com
      
      
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Eric Biggers <ebiggers@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Ilya Leoshkevich <iii@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      294ef12d
    • Alban Crequy's avatar
      maccess: Fix writing offset in case of fault in strncpy_from_kernel_nofault() · db744288
      Alban Crequy authored
      commit 8678ea06 upstream.
      
      If a page fault occurs while copying the first byte, this function resets one
      byte before dst.
      As a consequence, an address could be modified and leaded to kernel crashes if
      case the modified address was accessed later.
      
      Fixes: b58294ea
      
       ("maccess: allow architectures to provide kernel probing directly")
      Signed-off-by: default avatarAlban Crequy <albancrequy@linux.microsoft.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Tested-by: default avatarFrancis Laniel <flaniel@linux.microsoft.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: <stable@vger.kernel.org> [5.8]
      Link: https://lore.kernel.org/bpf/20221110085614.111213-2-albancrequy@linux.microsoft.com
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      db744288
  11. Nov 16, 2022
  12. Nov 03, 2022
    • Rik van Riel's avatar
      mm,hugetlb: take hugetlb_lock before decrementing h->resv_huge_pages · 568e3812
      Rik van Riel authored
      commit 12df140f upstream.
      
      The h->*_huge_pages counters are protected by the hugetlb_lock, but
      alloc_huge_page has a corner case where it can decrement the counter
      outside of the lock.
      
      This could lead to a corrupted value of h->resv_huge_pages, which we have
      observed on our systems.
      
      Take the hugetlb_lock before decrementing h->resv_huge_pages to avoid a
      potential race.
      
      Link: https://lkml.kernel.org/r/20221017202505.0e6a4fcd@imladris.surriel.com
      Fixes: a88c7695
      
       ("mm: hugetlb: fix hugepage memory leak caused by wrong reserve count")
      Signed-off-by: default avatarRik van Riel <riel@surriel.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Glen McCready <gkmccready@meta.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      568e3812
    • Yuanzheng Song's avatar
      mm/memory: add non-anonymous page check in the copy_present_page() · 935a8b62
      Yuanzheng Song authored
      The vma->anon_vma of the child process may be NULL because
      the entire vma does not contain anonymous pages. In this
      case, a BUG will occur when the copy_present_page() passes
      a copy of a non-anonymous page of that vma to the
      page_add_new_anon_rmap() to set up new anonymous rmap.
      
      ------------[ cut here ]------------
      kernel BUG at mm/rmap.c:1044!
      Internal error: Oops - BUG: 0 [#1] SMP
      Modules linked in:
      CPU: 2 PID: 3617 Comm: test Not tainted 5.10.149 #1
      Hardware name: linux,dummy-virt (DT)
      pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
      pc : __page_set_anon_rmap+0xbc/0xf8
      lr : __page_set_anon_rmap+0xbc/0xf8
      sp : ffff800014c1b870
      x29: ffff800014c1b870 x28: 0000000000000001
      x27: 0000000010100073 x26: ffff1d65c517baa8
      x25: ffff1d65cab0f000 x24: ffff1d65c416d800
      x23: ffff1d65cab5f248 x22: 0000000020000000
      x21: 0000000000000001 x20: 0000000000000000
      x19: fffffe75970023c0 x18: 0000000000000000
      x17: 0000000000000000 x16: 0000000000000000
      x15: 0000000000000000 x14: 0000000000000000
      x13: 0000000000000000 x12: 0000000000000000
      x11: 0000000000000000 x10: 0000000000000000
      x9 : ffffc3096d5fb858 x8 : 0000000000000000
      x7 : 0000000000000011 x6 : ffff5a5c9089c000
      x5 : 0000000000020000 x4 : ffff5a5c9089c000
      x3 : ffffc3096d200000 x2 : ffffc3096e8d0000
      x1 : ffff1d65ca3da740 x0 : 0000000000000000
      Call trace:
       __page_set_anon_rmap+0xbc/0xf8
       page_add_new_anon_rmap+0x1e0/0x390
       copy_pte_range+0xd00/0x1248
       copy_page_range+0x39c/0x620
       dup_mmap+0x2e0/0x5a8
       dup_mm+0x78/0x140
       copy_process+0x918/0x1a20
       kernel_clone+0xac/0x638
       __do_sys_clone+0x78/0xb0
       __arm64_sys_clone+0x30/0x40
       el0_svc_common.constprop.0+0xb0/0x308
       do_el0_svc+0x48/0xb8
       el0_svc+0x24/0x38
       el0_sync_handler+0x160/0x168
       el0_sync+0x180/0x1c0
      Code: 97f8ff85 f9400294 17ffffeb 97f8ff82 (d4210000)
      ---[ end trace a972347688dc9bd4 ]---
      Kernel panic - not syncing: Oops - BUG: Fatal exception
      SMP: stopping secondary CPUs
      Kernel Offset: 0x43095d200000 from 0xffff800010000000
      PHYS_OFFSET: 0xffffe29a80000000
      CPU features: 0x08200022,61806082
      Memory Limit: none
      ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]---
      
      This problem has been fixed by the commit <fb3d824d>
      ("mm/rmap: split page_dup_rmap() into page_dup_file_rmap()
      and page_try_dup_anon_rmap()"), but still exists in the
      linux-5.10.y branch.
      
      This patch is not applicable to this version because
      of the large version differences. Therefore, fix it by
      adding non-anonymous page check in the copy_present_page().
      
      Cc: stable@vger.kernel.org
      Fixes: 70e806e4
      
       ("mm: Do early cow for pinned pages during fork() for ptes")
      Signed-off-by: default avatarYuanzheng Song <songyuanzheng@huawei.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      935a8b62
  13. Oct 26, 2022
    • Liu Shixin's avatar
      mm: hugetlb: fix UAF in hugetlb_handle_userfault · 45c33966
      Liu Shixin authored
      commit 958f32ce upstream.
      
      The vma_lock and hugetlb_fault_mutex are dropped before handling userfault
      and reacquire them again after handle_userfault(), but reacquire the
      vma_lock could lead to UAF[1,2] due to the following race,
      
      hugetlb_fault
        hugetlb_no_page
          /*unlock vma_lock */
          hugetlb_handle_userfault
            handle_userfault
              /* unlock mm->mmap_lock*/
                                                 vm_mmap_pgoff
                                                   do_mmap
                                                     mmap_region
                                                       munmap_vma_range
                                                         /* clean old vma */
              /* lock vma_lock again  <--- UAF */
          /* unlock vma_lock */
      
      Since the vma_lock will unlock immediately after
      hugetlb_handle_userfault(), let's drop the unneeded lock and unlock in
      hugetlb_handle_userfault() to fix the issue.
      
      [1] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@google.com/
      [2] https://lore.kernel.org/linux-mm/20220921014457.1668-1-liuzixian4@huawei.com/
      Link: https://lkml.kernel.org/r/20220923042113.137273-1-liushixin2@huawei.com
      Fixes: 1a1aad8a
      
       ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook")
      Signed-off-by: default avatarLiu Shixin <liushixin2@huawei.com>
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reported-by: default avatar <syzbot+193f9cee8638750b23cf@syzkaller.appspotmail.com>
      Reported-by: default avatarLiu Zixian <liuzixian4@huawei.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: <stable@vger.kernel.org>	[4.14+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      45c33966
    • Carlos Llamas's avatar
      mm/mmap: undo ->mmap() when arch_validate_flags() fails · a3c08c02
      Carlos Llamas authored
      commit deb0f656 upstream.
      
      Commit c462ac28 ("mm: Introduce arch_validate_flags()") added a late
      check in mmap_region() to let architectures validate vm_flags.  The check
      needs to happen after calling ->mmap() as the flags can potentially be
      modified during this callback.
      
      If arch_validate_flags() check fails we unmap and free the vma.  However,
      the error path fails to undo the ->mmap() call that previously succeeded
      and depending on the specific ->mmap() implementation this translates to
      reference increments, memory allocations and other operations what will
      not be cleaned up.
      
      There are several places (mainly device drivers) where this is an issue.
      However, one specific example is bpf_map_mmap() which keeps count of the
      mappings in map->writecnt.  The count is incremented on ->mmap() and then
      decremented on vm_ops->close().  When arch_validate_flags() fails this
      count is off since bpf_map_mmap_close() is never called.
      
      One can reproduce this issue in arm64 devices with MTE support.  Here the
      vm_flags are checked to only allow VM_MTE if VM_MTE_ALLOWED has been set
      previously.  From userspace then is enough to pass the PROT_MTE flag to
      mmap() syscall to trigger the arch_validate_flags() failure.
      
      The following program reproduces this issue:
      
        #include <stdio.h>
        #include <unistd.h>
        #include <linux/unistd.h>
        #include <linux/bpf.h>
        #include <sys/mman.h>
      
        int main(void)
        {
      	union bpf_attr attr = {
      		.map_type = BPF_MAP_TYPE_ARRAY,
      		.key_size = sizeof(int),
      		.value_size = sizeof(long long),
      		.max_entries = 256,
      		.map_flags = BPF_F_MMAPABLE,
      	};
      	int fd;
      
      	fd = syscall(__NR_bpf, BPF_MAP_CREATE, &attr, sizeof(attr));
      	mmap(NULL, 4096, PROT_WRITE | PROT_MTE, MAP_SHARED, fd, 0);
      
      	return 0;
        }
      
      By manually adding some log statements to the vm_ops callbacks we can
      confirm that when passing PROT_MTE to mmap() the map->writecnt is off upon
      ->release():
      
      With PROT_MTE flag:
        root@debian:~# ./bpf-test
        [  111.263874] bpf_map_write_active_inc: map=9 writecnt=1
        [  111.288763] bpf_map_release: map=9 writecnt=1
      
      Without PROT_MTE flag:
        root@debian:~# ./bpf-test
        [  157.816912] bpf_map_write_active_inc: map=10 writecnt=1
        [  157.830442] bpf_map_write_active_dec: map=10 writecnt=0
        [  157.832396] bpf_map_release: map=10 writecnt=0
      
      This patch fixes the above issue by calling vm_ops->close() when the
      arch_validate_flags() check fails, after this we can proceed to unmap and
      free the vma on the error path.
      
      Link: https://lkml.kernel.org/r/20220930003844.1210987-1-cmllamas@google.com
      Fixes: c462ac28
      
       ("mm: Introduce arch_validate_flags()")
      Signed-off-by: default avatarCarlos Llamas <cmllamas@google.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Reviewed-by: default avatarLiam Howlett <liam.howlett@oracle.com>
      Cc: Christian Brauner (Microsoft) <brauner@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: <stable@vger.kernel.org>	[5.10+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a3c08c02
  14. Oct 15, 2022
    • Yang Shi's avatar
      mm: gup: fix the fast GUP race against THP collapse · 377c60dd
      Yang Shi authored
      commit 70cbc3cc upstream.
      
      Since general RCU GUP fast was introduced in commit 2667f50e ("mm:
      introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer
      sufficient to handle concurrent GUP-fast in all cases, it only handles
      traditional IPI-based GUP-fast correctly.  On architectures that send an
      IPI broadcast on TLB flush, it works as expected.  But on the
      architectures that do not use IPI to broadcast TLB flush, it may have the
      below race:
      
         CPU A                                          CPU B
      THP collapse                                     fast GUP
                                                    gup_pmd_range() <-- see valid pmd
                                                        gup_pte_range() <-- work on pte
      pmdp_collapse_flush() <-- clear pmd and flush
      __collapse_huge_page_isolate()
          check page pinned <-- before GUP bump refcount
                                                            pin the page
                                                            check PTE <-- no change
      __collapse_huge_page_copy()
          copy data to huge page
          ptep_clear()
      install huge pmd for the huge page
                                                            return the stale page
      discard the stale page
      
      The race can be fixed by checking whether PMD is changed or not after
      taking the page pin in fast GUP, just like what it does for PTE.  If the
      PMD is changed it means there may be parallel THP collapse, so GUP should
      back off.
      
      Also update the stale comment about serializing against fast GUP in
      khugepaged.
      
      Link: https://lkml.kernel.org/r/20220907180144.555485-1-shy828301@gmail.com
      Fixes: 2667f50e
      
       ("mm: introduce a general RCU get_user_pages_fast()")
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarYang Shi <shy828301@gmail.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      377c60dd
  15. Oct 05, 2022
  16. Sep 28, 2022
    • Chao Yu's avatar
      mm/slub: fix to return errno if kmalloc() fails · 379ac790
      Chao Yu authored
      commit 7e9c323c upstream.
      
      In create_unique_id(), kmalloc(, GFP_KERNEL) can fail due to
      out-of-memory, if it fails, return errno correctly rather than
      triggering panic via BUG_ON();
      
      kernel BUG at mm/slub.c:5893!
      Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
      
      Call trace:
       sysfs_slab_add+0x258/0x260 mm/slub.c:5973
       __kmem_cache_create+0x60/0x118 mm/slub.c:4899
       create_cache mm/slab_common.c:229 [inline]
       kmem_cache_create_usercopy+0x19c/0x31c mm/slab_common.c:335
       kmem_cache_create+0x1c/0x28 mm/slab_common.c:390
       f2fs_kmem_cache_create fs/f2fs/f2fs.h:2766 [inline]
       f2fs_init_xattr_caches+0x78/0xb4 fs/f2fs/xattr.c:808
       f2fs_fill_super+0x1050/0x1e0c fs/f2fs/super.c:4149
       mount_bdev+0x1b8/0x210 fs/super.c:1400
       f2fs_mount+0x44/0x58 fs/f2fs/super.c:4512
       legacy_get_tree+0x30/0x74 fs/fs_context.c:610
       vfs_get_tree+0x40/0x140 fs/super.c:1530
       do_new_mount+0x1dc/0x4e4 fs/namespace.c:3040
       path_mount+0x358/0x914 fs/namespace.c:3370
       do_mount fs/namespace.c:3383 [inline]
       __do_sys_mount fs/namespace.c:3591 [inline]
       __se_sys_mount fs/namespace.c:3568 [inline]
       __arm64_sys_mount+0x2f8/0x408 fs/namespace.c:3568
      
      Cc: <stable@kernel.org>
      Fixes: 81819f0f
      
       ("SLUB core")
      Reported-by: default avatar <syzbot+81684812ea68216e08c5@syzkaller.appspotmail.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarHyeonggon Yoo <42.hyeyoo@gmail.com>
      Signed-off-by: default avatarChao Yu <chao.yu@oppo.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      379ac790
  17. Sep 20, 2022
  18. Sep 15, 2022
    • Yee Lee's avatar
      Revert "mm: kmemleak: take a full lowmem check in kmemleak_*_phys()" · a14f1799
      Yee Lee authored
      This reverts commit 23c2d497.
      
      Commit 23c2d497 ("mm: kmemleak: take a full lowmem check in
      kmemleak_*_phys()") brought false leak alarms on some archs like arm64
      that does not init pfn boundary in early booting. The final solution
      lands on linux-6.0: commit 0c24e061
      
       ("mm: kmemleak: add rbtree and
      store physical address for objects allocated with PA").
      
      Revert this commit before linux-6.0. The original issue of invalid PA
      can be mitigated by additional check in devicetree.
      
      The false alarm report is as following: Kmemleak output: (Qemu/arm64)
      unreferenced object 0xffff0000c0170a00 (size 128):
        comm "swapper/0", pid 1, jiffies 4294892404 (age 126.208s)
        hex dump (first 32 bytes):
       62 61 73 65 00 00 00 00 00 00 00 00 00 00 00 00  base............
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<(____ptrval____)>] __kmalloc_track_caller+0x1b0/0x2e4
          [<(____ptrval____)>] kstrdup_const+0x8c/0xc4
          [<(____ptrval____)>] kvasprintf_const+0xbc/0xec
          [<(____ptrval____)>] kobject_set_name_vargs+0x58/0xe4
          [<(____ptrval____)>] kobject_add+0x84/0x100
          [<(____ptrval____)>] __of_attach_node_sysfs+0x78/0xec
          [<(____ptrval____)>] of_core_init+0x68/0x104
          [<(____ptrval____)>] driver_init+0x28/0x48
          [<(____ptrval____)>] do_basic_setup+0x14/0x28
          [<(____ptrval____)>] kernel_init_freeable+0x110/0x178
          [<(____ptrval____)>] kernel_init+0x20/0x1a0
          [<(____ptrval____)>] ret_from_fork+0x10/0x20
      
      This pacth is also applicable to linux-5.17.y/linux-5.18.y/linux-5.19.y
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarYee Lee <yee.lee@mediatek.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a14f1799
  19. Sep 08, 2022
    • Steven Price's avatar
      mm: pagewalk: Fix race between unmap and page walker · 47a73e5e
      Steven Price authored
      [ Upstream commit 8782fb61 ]
      
      The mmap lock protects the page walker from changes to the page tables
      during the walk.  However a read lock is insufficient to protect those
      areas which don't have a VMA as munmap() detaches the VMAs before
      downgrading to a read lock and actually tearing down PTEs/page tables.
      
      For users of walk_page_range() the solution is to simply call pte_hole()
      immediately without checking the actual page tables when a VMA is not
      present. We now never call __walk_page_range() without a valid vma.
      
      For walk_page_range_novma() the locking requirements are tightened to
      require the mmap write lock to be taken, and then walking the pgd
      directly with 'no_vma' set.
      
      This in turn means that all page walkers either have a valid vma, or
      it's that special 'novma' case for page table debugging.  As a result,
      all the odd '(!walk->vma && !walk->no_vma)' tests can be removed.
      
      Fixes: dd2283f2
      
       ("mm: mmap: zap pages with read mmap_sem in munmap")
      Reported-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarSteven Price <steven.price@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      47a73e5e
  20. Sep 05, 2022
    • Jann Horn's avatar
      mm/rmap: Fix anon_vma->degree ambiguity leading to double-reuse · 98f401d3
      Jann Horn authored
      commit 2555283e upstream.
      
      anon_vma->degree tracks the combined number of child anon_vmas and VMAs
      that use the anon_vma as their ->anon_vma.
      
      anon_vma_clone() then assumes that for any anon_vma attached to
      src->anon_vma_chain other than src->anon_vma, it is impossible for it to
      be a leaf node of the VMA tree, meaning that for such VMAs ->degree is
      elevated by 1 because of a child anon_vma, meaning that if ->degree
      equals 1 there are no VMAs that use the anon_vma as their ->anon_vma.
      
      This assumption is wrong because the ->degree optimization leads to leaf
      nodes being abandoned on anon_vma_clone() - an existing anon_vma is
      reused and no new parent-child relationship is created.  So it is
      possible to reuse an anon_vma for one VMA while it is still tied to
      another VMA.
      
      This is an issue because is_mergeable_anon_vma() and its callers assume
      that if two VMAs have the same ->anon_vma, the list of anon_vmas
      attached to the VMAs is guaranteed to be the same.  When this assumption
      is violated, vma_merge() can merge pages into a VMA that is not attached
      to the corresponding anon_vma, leading to dangling page->mapping
      pointers that will be dereferenced during rmap walks.
      
      Fix it by separately tracking the number of child anon_vmas and the
      number of VMAs using the anon_vma as their ->anon_vma.
      
      Fixes: 7a3ef208
      
       ("mm: prevent endless growth of anon_vma hierarchy")
      Cc: stable@kernel.org
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      98f401d3
    • Jann Horn's avatar
      mm: Force TLB flush for PFNMAP mappings before unlink_file_vma() · 895428ee
      Jann Horn authored
      commit b67fbebd upstream.
      
      Some drivers rely on having all VMAs through which a PFN might be
      accessible listed in the rmap for correctness.
      However, on X86, it was possible for a VMA with stale TLB entries
      to not be listed in the rmap.
      
      This was fixed in mainline with
      commit b67fbebd ("mmu_gather: Force tlb-flush VM_PFNMAP vmas"),
      but that commit relies on preceding refactoring in
      commit 18ba064e ("mmu_gather: Let there be one tlb_{start,end}_vma()
      implementation") and commit 1e9fdf21
      
       ("mmu_gather: Remove per arch
      tlb_{start,end}_vma()").
      
      This patch provides equivalent protection without needing that
      refactoring, by forcing a TLB flush between removing PTEs in
      unmap_vmas() and the call to unlink_file_vma() in free_pgtables().
      
      [This is a stable-specific rewrite of the upstream commit!]
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      895428ee
  21. Aug 31, 2022
    • David Hildenbrand's avatar
      mm/hugetlb: fix hugetlb not supporting softdirty tracking · 62af37c5
      David Hildenbrand authored
      commit f96f7a40 upstream.
      
      Patch series "mm/hugetlb: fix write-fault handling for shared mappings", v2.
      
      I observed that hugetlb does not support/expect write-faults in shared
      mappings that would have to map the R/O-mapped page writable -- and I
      found two case where we could currently get such faults and would
      erroneously map an anon page into a shared mapping.
      
      Reproducers part of the patches.
      
      I propose to backport both fixes to stable trees.  The first fix needs a
      small adjustment.
      
      
      This patch (of 2):
      
      Staring at hugetlb_wp(), one might wonder where all the logic for shared
      mappings is when stumbling over a write-protected page in a shared
      mapping.  In fact, there is none, and so far we thought we could get away
      with that because e.g., mprotect() should always do the right thing and
      map all pages directly writable.
      
      Looks like we were wrong:
      
      --------------------------------------------------------------------------
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <fcntl.h>
       #include <unistd.h>
       #include <errno.h>
       #include <sys/mman.h>
      
       #define HUGETLB_SIZE (2 * 1024 * 1024u)
      
       static void clear_softdirty(void)
       {
               int fd = open("/proc/self/clear_refs", O_WRONLY);
               const char *ctrl = "4";
               int ret;
      
               if (fd < 0) {
                       fprintf(stderr, "open(clear_refs) failed\n");
                       exit(1);
               }
               ret = write(fd, ctrl, strlen(ctrl));
               if (ret != strlen(ctrl)) {
                       fprintf(stderr, "write(clear_refs) failed\n");
                       exit(1);
               }
               close(fd);
       }
      
       int main(int argc, char **argv)
       {
               char *map;
               int fd;
      
               fd = open("/dev/hugepages/tmp", O_RDWR | O_CREAT);
               if (!fd) {
                       fprintf(stderr, "open() failed\n");
                       return -errno;
               }
               if (ftruncate(fd, HUGETLB_SIZE)) {
                       fprintf(stderr, "ftruncate() failed\n");
                       return -errno;
               }
      
               map = mmap(NULL, HUGETLB_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
               if (map == MAP_FAILED) {
                       fprintf(stderr, "mmap() failed\n");
                       return -errno;
               }
      
               *map = 0;
      
               if (mprotect(map, HUGETLB_SIZE, PROT_READ)) {
                       fprintf(stderr, "mmprotect() failed\n");
                       return -errno;
               }
      
               clear_softdirty();
      
               if (mprotect(map, HUGETLB_SIZE, PROT_READ|PROT_WRITE)) {
                       fprintf(stderr, "mmprotect() failed\n");
                       return -errno;
               }
      
               *map = 0;
      
               return 0;
       }
      --------------------------------------------------------------------------
      
      Above test fails with SIGBUS when there is only a single free hugetlb page.
       # echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
       # ./test
       Bus error (core dumped)
      
      And worse, with sufficient free hugetlb pages it will map an anonymous page
      into a shared mapping, for example, messing up accounting during unmap
      and breaking MAP_SHARED semantics:
       # echo 2 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
       # ./test
       # cat /proc/meminfo | grep HugePages_
       HugePages_Total:       2
       HugePages_Free:        1
       HugePages_Rsvd:    18446744073709551615
       HugePages_Surp:        0
      
      Reason in this particular case is that vma_wants_writenotify() will
      return "true", removing VM_SHARED in vma_set_page_prot() to map pages
      write-protected. Let's teach vma_wants_writenotify() that hugetlb does not
      support softdirty tracking.
      
      Link: https://lkml.kernel.org/r/20220811103435.188481-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20220811103435.188481-2-david@redhat.com
      Fixes: 64e45507
      
       ("mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Jamie Liu <jamieliu@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: <stable@vger.kernel.org>	[3.18+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      62af37c5
    • Miaohe Lin's avatar
      mm/huge_memory.c: use helper function migration_entry_to_page() · c7c77185
      Miaohe Lin authored
      [ Upstream commit a44f89dc ]
      
      It's more recommended to use helper function migration_entry_to_page()
      to get the page via migration entry.  We can also enjoy the PageLocked()
      check there.
      
      Link: https://lkml.kernel.org/r/20210318122722.13135-7-linmiaohe@huawei.com
      
      
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Thomas Hellstrm (Intel) <thomas_os@shipmail.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: yuleixzhang <yulei.kernel@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      c7c77185
  22. Aug 21, 2022