Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Apr 15, 2022
    • Patrick Wang's avatar
      mm: kmemleak: take a full lowmem check in kmemleak_*_phys() · 23c2d497
      Patrick Wang authored
      The kmemleak_*_phys() apis do not check the address for lowmem's min
      boundary, while the caller may pass an address below lowmem, which will
      trigger an oops:
      
        # echo scan > /sys/kernel/debug/kmemleak
        Unable to handle kernel paging request at virtual address ff5fffffffe00000
        Oops [#1]
        Modules linked in:
        CPU: 2 PID: 134 Comm: bash Not tainted 5.18.0-rc1-next-20220407 #33
        Hardware name: riscv-virtio,qemu (DT)
        epc : scan_block+0x74/0x15c
         ra : scan_block+0x72/0x15c
        epc : ffffffff801e5806 ra : ffffffff801e5804 sp : ff200000104abc30
         gp : ffffffff815cd4e8 tp : ff60000004cfa340 t0 : 0000000000000200
         t1 : 00aaaaaac23954cc t2 : 00000000000003ff s0 : ff200000104abc90
         s1 : ffffffff81b0ff28 a0 : 0000000000000000 a1 : ff5fffffffe01000
         a2 : ffffffff81b0ff28 a3 : 0000000000000002 a4 : 0000000000000001
         a5 : 0000000000000000 a6 : ff200000104abd7c a7 : 0000000000000005
         s2 : ff5fffffffe00ff9 s3 : ffffffff815cd998 s4 : ffffffff815d0e90
         s5 : ffffffff81b0ff28 s6 : 0000000000000020 s7 : ffffffff815d0eb0
         s8 : ffffffffffffffff s9 : ff5fffffffe00000 s10: ff5fffffffe01000
         s11: 0000000000000022 t3 : 00ffffffaa17db4c t4 : 000000000000000f
         t5 : 0000000000000001 t6 : 0000000000000000
        status: 0000000000000100 badaddr: ff5fffffffe00000 cause: 000000000000000d
          scan_gray_list+0x12e/0x1a6
          kmemleak_scan+0x2aa/0x57e
          kmemleak_write+0x32a/0x40c
          full_proxy_write+0x56/0x82
          vfs_write+0xa6/0x2a6
          ksys_write+0x6c/0xe2
          sys_write+0x22/0x2a
          ret_from_syscall+0x0/0x2
      
      The callers may not quite know the actual address they pass(e.g. from
      devicetree).  So the kmemleak_*_phys() apis should guarantee the address
      they finally use is in lowmem range, so check the address for lowmem's
      min boundary.
      
      Link: https://lkml.kernel.org/r/20220413122925.33856-1-patrick.wang.shcn@gmail.com
      
      
      Signed-off-by: default avatarPatrick Wang <patrick.wang.shcn@gmail.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      23c2d497
    • Omar Sandoval's avatar
      mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcore · c12cd77c
      Omar Sandoval authored
      Commit 3ee48b6a ("mm, x86: Saving vmcore with non-lazy freeing of
      vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to
      lazy_max_pages() + 1, ensuring that any future vunmaps() immediately
      purge the vmap areas instead of doing it lazily.
      
      Commit 690467c8 ("mm/vmalloc: Move draining areas out of caller
      context") moved the purging from the vunmap() caller to a worker thread.
      Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin
      (possibly forever).  For example, consider the following scenario:
      
       1. Thread reads from /proc/vmcore. This eventually calls
          __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets
          vmap_lazy_nr to lazy_max_pages() + 1.
      
       2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2
          pages (one page plus the guard page) to the purge list and
          vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the
          drain_vmap_work is scheduled.
      
       3. Thread returns from the kernel and is scheduled out.
      
       4. Worker thread is scheduled in and calls drain_vmap_area_work(). It
          frees the 2 pages on the purge list. vmap_lazy_nr is now
          lazy_max_pages() + 1.
      
       5. This is still over the threshold, so it tries to purge areas again,
          but doesn't find anything.
      
       6. Repeat 5.
      
      If the system is running with only one CPU (which is typicial for kdump)
      and preemption is disabled, then this will never make forward progress:
      there aren't any more pages to purge, so it hangs.  If there is more
      than one CPU or preemption is enabled, then the worker thread will spin
      forever in the background.  (Note that if there were already pages to be
      purged at the time that set_iounmap_nonlazy() was called, this bug is
      avoided.)
      
      This can be reproduced with anything that reads from /proc/vmcore
      multiple times.  E.g., vmcore-dmesg /proc/vmcore.
      
      It turns out that improvements to vmap() over the years have obsoleted
      the need for this "optimization".  I benchmarked `dd if=/proc/vmcore
      of=/dev/null` with 4k and 1M read sizes on a system with a 32GB vmcore.
      The test was run on 5.17, 5.18-rc1 with a fix that avoided the hang, and
      5.18-rc1 with set_iounmap_nonlazy() removed entirely:
      
          |5.17  |5.18+fix|5.18+removal
        4k|40.86s|  40.09s|      26.73s
        1M|24.47s|  23.98s|      21.84s
      
      The removal was the fastest (by a wide margin with 4k reads).  This
      patch removes set_iounmap_nonlazy().
      
      Link: https://lkml.kernel.org/r/52f819991051f9b865e9ce25605509bfdbacadcd.1649277321.git.osandov@fb.com
      Fixes: 690467c8
      
        ("mm/vmalloc: Move draining areas out of caller context")
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Acked-by: default avatarChris Down <chris@chrisdown.name>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c12cd77c
    • Mike Kravetz's avatar
      hugetlb: do not demote poisoned hugetlb pages · 5a317412
      Mike Kravetz authored
      It is possible for poisoned hugetlb pages to reside on the free lists.
      The huge page allocation routines which dequeue entries from the free
      lists make a point of avoiding poisoned pages.  There is no such check
      and avoidance in the demote code path.
      
      If a hugetlb page on the is on a free list, poison will only be set in
      the head page rather then the page with the actual error.  If such a
      page is demoted, then the poison flag may follow the wrong page.  A page
      without error could have poison set, and a page with poison could not
      have the flag set.
      
      Check for poison before attempting to demote a hugetlb page.  Also,
      return -EBUSY to the caller if only poisoned pages are on the free list.
      
      Link: https://lkml.kernel.org/r/20220307215707.50916-1-mike.kravetz@oracle.com
      Fixes: 8531fc6f
      
       ("hugetlb: add hugetlb demote page support")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5a317412
    • Charan Teja Kalla's avatar
      mm: compaction: fix compiler warning when CONFIG_COMPACTION=n · 31ca72fa
      Charan Teja Kalla authored
      The below warning is reported when CONFIG_COMPACTION=n:
      
         mm/compaction.c:56:27: warning: 'HPAGE_FRAG_CHECK_INTERVAL_MSEC' defined but not used [-Wunused-const-variable=]
            56 | static const unsigned int HPAGE_FRAG_CHECK_INTERVAL_MSEC = 500;
               |                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Fix it by moving 'HPAGE_FRAG_CHECK_INTERVAL_MSEC' under
      CONFIG_COMPACTION defconfig.
      
      Also since this is just a 'static const int' type, use #define for it.
      
      Link: https://lkml.kernel.org/r/1647608518-20924-1-git-send-email-quic_charante@quicinc.com
      
      
      Signed-off-by: default avatarCharan Teja Kalla <quic_charante@quicinc.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Nitin Gupta <nigupta@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31ca72fa
    • Minchan Kim's avatar
      mm: fix unexpected zeroed page mapping with zram swap · e914d8f0
      Minchan Kim authored
      Two processes under CLONE_VM cloning, user process can be corrupted by
      seeing zeroed page unexpectedly.
      
            CPU A                        CPU B
      
        do_swap_page                do_swap_page
        SWP_SYNCHRONOUS_IO path     SWP_SYNCHRONOUS_IO path
        swap_readpage valid data
          swap_slot_free_notify
            delete zram entry
                                    swap_readpage zeroed(invalid) data
                                    pte_lock
                                    map the *zero data* to userspace
                                    pte_unlock
        pte_lock
        if (!pte_same)
          goto out_nomap;
        pte_unlock
        return and next refault will
        read zeroed data
      
      The swap_slot_free_notify is bogus for CLONE_VM case since it doesn't
      increase the refcount of swap slot at copy_mm so it couldn't catch up
      whether it's safe or not to discard data from backing device.  In the
      case, only the lock it could rely on to synchronize swap slot freeing is
      page table lock.  Thus, this patch gets rid of the swap_slot_free_notify
      function.  With this patch, CPU A will see correct data.
      
            CPU A                        CPU B
      
        do_swap_page                do_swap_page
        SWP_SYNCHRONOUS_IO path     SWP_SYNCHRONOUS_IO path
                                    swap_readpage original data
                                    pte_lock
                                    map the original data
                                    swap_free
                                      swap_range_free
                                        bd_disk->fops->swap_slot_free_notify
        swap_readpage read zeroed data
                                    pte_unlock
        pte_lock
        if (!pte_same)
          goto out_nomap;
        pte_unlock
        return
        on next refault will see mapped data by CPU B
      
      The concern of the patch would increase memory consumption since it
      could keep wasted memory with compressed form in zram as well as
      uncompressed form in address space.  However, most of cases of zram uses
      no readahead and do_swap_page is followed by swap_free so it will free
      the compressed form from in zram quickly.
      
      Link: https://lkml.kernel.org/r/YjTVVxIAsnKAXjTd@google.com
      Fixes: 0bcac06f
      
       ("mm, swap: skip swapcache for swapin of synchronous device")
      Reported-by: default avatarIvan Babrou <ivan@cloudflare.com>
      Tested-by: default avatarIvan Babrou <ivan@cloudflare.com>
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>	[4.14+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e914d8f0
    • Juergen Gross's avatar
      mm, page_alloc: fix build_zonerefs_node() · e553f62f
      Juergen Gross authored
      Since commit 6aa303de ("mm, vmscan: only allocate and reclaim from
      zones with pages managed by the buddy allocator") only zones with free
      memory are included in a built zonelist.  This is problematic when e.g.
      all memory of a zone has been ballooned out when zonelists are being
      rebuilt.
      
      The decision whether to rebuild the zonelists when onlining new memory
      is done based on populated_zone() returning 0 for the zone the memory
      will be added to.  The new zone is added to the zonelists only, if it
      has free memory pages (managed_zone() returns a non-zero value) after
      the memory has been onlined.  This implies, that onlining memory will
      always free the added pages to the allocator immediately, but this is
      not true in all cases: when e.g. running as a Xen guest the onlined new
      memory will be added only to the ballooned memory list, it will be freed
      only when the guest is being ballooned up afterwards.
      
      Another problem with using managed_zone() for the decision whether a
      zone is being added to the zonelists is, that a zone with all memory
      used will in fact be removed from all zonelists in case the zonelists
      happen to be rebuilt.
      
      Use populated_zone() when building a zonelist as it has been done before
      that commit.
      
      There was a report that QubesOS (based on Xen) is hitting this problem.
      Xen has switched to use the zone device functionality in kernel 5.9 and
      QubesOS wants to use memory hotplugging for guests in order to be able
      to start a guest with minimal memory and expand it as needed.  This was
      the report leading to the patch.
      
      Link: https://lkml.kernel.org/r/20220407120637.9035-1-jgross@suse.com
      Fixes: 6aa303de
      
       ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Reported-by: default avatarMarek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e553f62f
    • Marco Elver's avatar
      mm, kfence: support kmem_dump_obj() for KFENCE objects · 2dfe63e6
      Marco Elver authored
      Calling kmem_obj_info() via kmem_dump_obj() on KFENCE objects has been
      producing garbage data due to the object not actually being maintained
      by SLAB or SLUB.
      
      Fix this by implementing __kfence_obj_info() that copies relevant
      information to struct kmem_obj_info when the object was allocated by
      KFENCE; this is called by a common kmem_obj_info(), which also calls the
      slab/slub/slob specific variant now called __kmem_obj_info().
      
      For completeness, kmem_dump_obj() now displays if the object was
      allocated by KFENCE.
      
      Link: https://lore.kernel.org/all/20220323090520.GG16885@xsang-OptiPlex-9020/
      Link: https://lkml.kernel.org/r/20220406131558.3558585-1-elver@google.com
      Fixes: b89fb5ef ("mm, kfence: insert KFENCE hooks for SLUB")
      Fixes: d3fb45f3
      
       ("mm, kfence: insert KFENCE hooks for SLAB")
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarHyeonggon Yoo <42.hyeyoo@gmail.com>
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Acked-by: Vlastimil Babka <vbabka@suse.cz>	[slab]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2dfe63e6
    • Vincenzo Frascino's avatar
      kasan: fix hw tags enablement when KUNIT tests are disabled · b1add418
      Vincenzo Frascino authored
      Kasan enables hw tags via kasan_enable_tagging() which based on the mode
      passed via kernel command line selects the correct hw backend.
      kasan_enable_tagging() is meant to be invoked indirectly via the cpu
      features framework of the architectures that support these backends.
      Currently the invocation of this function is guarded by
      CONFIG_KASAN_KUNIT_TEST which allows the enablement of the correct backend
      only when KUNIT tests are enabled in the kernel.
      
      This inconsistency was introduced in commit:
      
        ed6d7444 ("kasan: test: support async (again) and asymm modes for HW_TAGS")
      
      ... and prevents to enable MTE on arm64 when KUNIT tests for kasan hw_tags are
      disabled.
      
      Fix the issue making sure that the CONFIG_KASAN_KUNIT_TEST guard does not
      prevent the correct invocation of kasan_enable_tagging().
      
      Link: https://lkml.kernel.org/r/20220408124323.10028-1-vincenzo.frascino@arm.com
      Fixes: ed6d7444
      
       ("kasan: test: support async (again) and asymm modes for HW_TAGS")
      Signed-off-by: default avatarVincenzo Frascino <vincenzo.frascino@arm.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b1add418
    • Axel Rasmussen's avatar
      mm/secretmem: fix panic when growing a memfd_secret · f9b141f9
      Axel Rasmussen authored
      
      When one tries to grow an existing memfd_secret with ftruncate, one gets
      a panic [1].  For example, doing the following reliably induces the
      panic:
      
          fd = memfd_secret();
      
          ftruncate(fd, 10);
          ptr = mmap(NULL, 10, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
          strcpy(ptr, "123456789");
      
          munmap(ptr, 10);
          ftruncate(fd, 20);
      
      The basic reason for this is, when we grow with ftruncate, we call down
      into simple_setattr, and then truncate_inode_pages_range, and eventually
      we try to zero part of the memory.  The normal truncation code does this
      via the direct map (i.e., it calls page_address() and hands that to
      memset()).
      
      For memfd_secret though, we specifically don't map our pages via the
      direct map (i.e.  we call set_direct_map_invalid_noflush() on every
      fault).  So the address returned by page_address() isn't useful, and
      when we try to memset() with it we panic.
      
      This patch avoids the panic by implementing a custom setattr for
      memfd_secret, which detects resizes specifically (setting the size for
      the first time works just fine, since there are no existing pages to try
      to zero), and rejects them with EINVAL.
      
      One could argue growing should be supported, but I think that will
      require a significantly more lengthy change.  So, I propose a minimal
      fix for the benefit of stable kernels, and then perhaps to extend
      memfd_secret to support growing in a separate patch.
      
      [1]:
      
        BUG: unable to handle page fault for address: ffffa0a889277028
        #PF: supervisor write access in kernel mode
        #PF: error_code(0x0002) - not-present page
        PGD afa01067 P4D afa01067 PUD 83f909067 PMD 83f8bf067 PTE 800ffffef6d88060
        Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
        CPU: 0 PID: 281 Comm: repro Not tainted 5.17.0-dbg-DEV #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
        RIP: 0010:memset_erms+0x9/0x10
        Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
        RSP: 0018:ffffb932c09afbf0 EFLAGS: 00010246
        RAX: 0000000000000000 RBX: ffffda63c4249dc0 RCX: 0000000000000fd8
        RDX: 0000000000000fd8 RSI: 0000000000000000 RDI: ffffa0a889277028
        RBP: ffffb932c09afc00 R08: 0000000000001000 R09: ffffa0a889277028
        R10: 0000000000020023 R11: 0000000000000000 R12: ffffda63c4249dc0
        R13: ffffa0a890d70d98 R14: 0000000000000028 R15: 0000000000000fd8
        FS:  00007f7294899580(0000) GS:ffffa0af9bc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: ffffa0a889277028 CR3: 0000000107ef6006 CR4: 0000000000370ef0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         ? zero_user_segments+0x82/0x190
         truncate_inode_partial_folio+0xd4/0x2a0
         truncate_inode_pages_range+0x380/0x830
         truncate_setsize+0x63/0x80
         simple_setattr+0x37/0x60
         notify_change+0x3d8/0x4d0
         do_sys_ftruncate+0x162/0x1d0
         __x64_sys_ftruncate+0x1c/0x20
         do_syscall_64+0x44/0xa0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
        Modules linked in: xhci_pci xhci_hcd virtio_net net_failover failover virtio_blk virtio_balloon uhci_hcd ohci_pci ohci_hcd evdev ehci_pci ehci_hcd 9pnet_virtio 9p netfs 9pnet
        CR2: ffffa0a889277028
      
      [lkp@intel.com: secretmem_iops can be static]
      Signed-off-by: default avatarkernel test robot <lkp@intel.com>
      [axelrasmussen@google.com: return EINVAL]
      
      Link: https://lkml.kernel.org/r/20220324210909.1843814-1-axelrasmussen@google.com
      Link: https://lkml.kernel.org/r/20220412193023.279320-1-axelrasmussen@google.com
      
      
      Signed-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: kernel test robot <lkp@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f9b141f9
    • Hugh Dickins's avatar
      tmpfs: fix regressions from wider use of ZERO_PAGE · 1bdec44b
      Hugh Dickins authored
      Chuck Lever reported fsx-based xfstests generic 075 091 112 127 failing
      when 5.18-rc1 NFS server exports tmpfs: bisected to recent tmpfs change.
      
      Whilst nfsd_splice_action() does contain some questionable handling of
      repeated pages, and Chuck was able to work around there, history from
      Mark Hemment makes clear that there might be similar dangers elsewhere:
      it was not a good idea for me to pass ZERO_PAGE down to unknown actors.
      
      Revert shmem_file_read_iter() to using ZERO_PAGE for holes only when
      iter_is_iovec(); in other cases, use the more natural iov_iter_zero()
      instead of copy_page_to_iter().
      
      We would use iov_iter_zero() throughout, but the x86 clear_user() is not
      nearly so well optimized as copy to user (dd of 1T sparse tmpfs file
      takes 57 seconds rather than 44 seconds).
      
      And now pagecache_init() does not need to SetPageUptodate(ZERO_PAGE(0)):
      which had caused boot failure on arm noMMU STM32F7 and STM32H7 boards
      
      Link: https://lkml.kernel.org/r/9a978571-8648-e830-5735-1f4748ce2e30@google.com
      Fixes: 56a8c8eb
      
       ("tmpfs: do not allocate pages on read")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarPatrice CHOTARD <patrice.chotard@foss.st.com>
      Reported-by: default avatarChuck Lever III <chuck.lever@oracle.com>
      Tested-by: default avatarChuck Lever III <chuck.lever@oracle.com>
      Cc: Mark Hemment <markhemm@googlemail.com>
      Cc: Patrice CHOTARD <patrice.chotard@foss.st.com>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Lukas Czerner <lczerner@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "Darrick J. Wong" <djwong@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1bdec44b
  2. Apr 08, 2022
  3. Apr 07, 2022
  4. Apr 05, 2022
  5. Apr 01, 2022
  6. Mar 30, 2022
  7. Mar 28, 2022
  8. Mar 27, 2022
  9. Mar 25, 2022
    • Andreas Gruenbacher's avatar
      fs/iomap: Fix buffered write page prefaulting · 631f871f
      Andreas Gruenbacher authored
      When part of the user buffer passed to generic_perform_write() or
      iomap_file_buffered_write() cannot be faulted in for reading, the entire
      write currently fails.  The correct behavior would be to write all the
      data that can be written, up to the point of failure.
      
      Commit a6294593
      
       ("iov_iter: Turn iov_iter_fault_in_readable into
      fault_in_iov_iter_readable") gave us the information needed, so fix the
      page prefaulting in generic_perform_write() and iomap_write_iter() to
      only bail out when no pages could be faulted in.
      
      We already factor in that pages that are faulted in may no longer be
      resident by the time they are accessed.  Paging out pages has the same
      effect as not faulting in those pages in the first place, so the code
      can already deal with that.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      631f871f
  10. Mar 24, 2022
    • Johannes Weiner's avatar
      mm: madvise: MADV_DONTNEED_LOCKED · 9457056a
      Johannes Weiner authored
      MADV_DONTNEED historically rejects mlocked ranges, but with MLOCK_ONFAULT
      and MCL_ONFAULT allowing to mlock without populating, there are valid use
      cases for depopulating locked ranges as well.
      
      Users mlock memory to protect secrets.  There are allocators for secure
      buffers that want in-use memory generally mlocked, but cleared and
      invalidated memory to give up the physical pages.  This could be done with
      explicit munlock -> mlock calls on free -> alloc of course, but that adds
      two unnecessary syscalls, heavy mmap_sem write locks, vma splits and
      re-merges - only to get rid of the backing pages.
      
      Users also mlockall(MCL_ONFAULT) to suppress sustained paging, but are
      okay with on-demand initial population.  It seems valid to selectively
      free some memory during the lifetime of such a process, without having to
      mess with its overall policy.
      
      Why add a separate flag? Isn't this a pretty niche usecase?
      
      - MADV_DONTNEED has been bailing on locked vmas forever. It's at least
        conceivable that someone, somewhere is relying on mlock to protect
        data from perhaps broader invalidation calls. Changing this behavior
        now could lead to quiet data corruption.
      
      - It also clarifies expectations around MADV_FREE and maybe
        MADV_REMOVE. It avoids the situation where one quietly behaves
        different than the others. MADV_FREE_LOCKED can be added later.
      
      - The combination of mlock() and madvise() in the first place is
        probably niche. But where it happens, I'd say that dropping pages
        from a locked region once they don't contain secrets or won't page
        anymore is much saner than relying on mlock to protect memory from
        speculative or errant invalidation calls. It's just that we can't
        change the default behavior because of the two previous points.
      
      Given that, an explicit new flag seems to make the most sense.
      
      [hannes@cmpxchg.org: fix mips build]
      
      Link: https://lkml.kernel.org/r/20220304171912.305060-1-hannes@cmpxchg.org
      
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9457056a