Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Apr 21, 2022
    • Nico Pache's avatar
      oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup · e4a38402
      Nico Pache authored
      The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which
      can be targeted by the oom reaper.  This mapping is used to store the
      futex robust list head; the kernel does not keep a copy of the robust
      list and instead references a userspace address to maintain the
      robustness during a process death.
      
      A race can occur between exit_mm and the oom reaper that allows the oom
      reaper to free the memory of the futex robust list before the exit path
      has handled the futex death:
      
          CPU1                               CPU2
          --------------------------------------------------------------------
          page_fault
          do_exit "signal"
          wake_oom_reaper
                                              oom_reaper
                                              oom_reap_task_mm (invalidates mm)
          exit_mm
          exit_mm_release
          futex_exit_release
          futex_cleanup
          exit_robust_list
          get_user (EFAULT- can't access memory)
      
      If the get_user EFAULT's, the kernel will be unable to recover the
      waiters on the robust_list, leaving userspace mutexes hung indefinitely.
      
      Delay the OOM reaper, allowing more time for the exit path to perform
      the futex cleanup.
      
      Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer
      
      Based on a patch by Michal Hocko.
      
      Link: https://elixir.bootlin.com/glibc/glibc-2.35/source/nptl/allocatestack.c#L370 [1]
      Link: https://lkml.kernel.org/r/20220414144042.677008-1-npache@redhat.com
      Fixes: 21292580
      
       ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
      Signed-off-by: default avatarJoel Savitz <jsavitz@redhat.com>
      Signed-off-by: default avatarNico Pache <npache@redhat.com>
      Co-developed-by: default avatarJoel Savitz <jsavitz@redhat.com>
      Suggested-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Herton R. Krzesinski <herton@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joel Savitz <jsavitz@redhat.com>
      Cc: Darren Hart <dvhart@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e4a38402
    • Christophe Leroy's avatar
      mm, hugetlb: allow for "high" userspace addresses · 5f24d5a5
      Christophe Leroy authored
      This is a fix for commit f6795053 ("mm: mmap: Allow for "high"
      userspace addresses") for hugetlb.
      
      This patch adds support for "high" userspace addresses that are
      optionally supported on the system and have to be requested via a hint
      mechanism ("high" addr parameter to mmap).
      
      Architectures such as powerpc and x86 achieve this by making changes to
      their architectural versions of hugetlb_get_unmapped_area() function.
      However, arm64 uses the generic version of that function.
      
      So take into account arch_get_mmap_base() and arch_get_mmap_end() in
      hugetlb_get_unmapped_area().  To allow that, move those two macros out
      of mm/mmap.c into include/linux/sched/mm.h
      
      If these macros are not defined in architectural code then they default
      to (TASK_SIZE) and (base) so should not introduce any behavioural
      changes to architectures that do not define them.
      
      For the time being, only ARM64 is affected by this change.
      
      Catalin (ARM64) said
       "We should have fixed hugetlb_get_unmapped_area() as well when we added
        support for 52-bit VA. The reason for commit f6795053 was to
        prevent normal mmap() from returning addresses above 48-bit by default
        as some user-space had hard assumptions about this.
      
        It's a slight ABI change if you do this for hugetlb_get_unmapped_area()
        but I doubt anyone would notice. It's more likely that the current
        behaviour would cause issues, so I'd rather have them consistent.
      
        Basically when arm64 gained support for 52-bit addresses we did not
        want user-space calling mmap() to suddenly get such high addresses,
        otherwise we could have inadvertently broken some programs (similar
        behaviour to x86 here). Hence we added commit f6795053. But we
        missed hugetlbfs which could still get such high mmap() addresses. So
        in theory that's a potential regression that should have bee addressed
        at the same time as commit f6795053 (and before arm64 enabled
        52-bit addresses)"
      
      Link: https://lkml.kernel.org/r/ab847b6edb197bffdfe189e70fb4ac76bfe79e0d.1650033747.git.christophe.leroy@csgroup.eu
      Fixes: f6795053
      
       ("mm: mmap: Allow for "high" userspace addresses")
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: <stable@vger.kernel.org>	[5.0.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f24d5a5
    • Shakeel Butt's avatar
      memcg: sync flush only if periodic flush is delayed · 9b301615
      Shakeel Butt authored
      Daniel Dao has reported [1] a regression on workloads that may trigger a
      lot of refaults (anon and file).  The underlying issue is that flushing
      rstat is expensive.  Although rstat flush are batched with (nr_cpus *
      MEMCG_BATCH) stat updates, it seems like there are workloads which
      genuinely do stat updates larger than batch value within short amount of
      time.  Since the rstat flush can happen in the performance critical
      codepaths like page faults, such workload can suffer greatly.
      
      This patch fixes this regression by making the rstat flushing
      conditional in the performance critical codepaths.  More specifically,
      the kernel relies on the async periodic rstat flusher to flush the stats
      and only if the periodic flusher is delayed by more than twice the
      amount of its normal time window then the kernel allows rstat flushing
      from the performance critical codepaths.
      
      Now the question: what are the side-effects of this change? The worst
      that can happen is the refault codepath will see 4sec old lruvec stats
      and may cause false (or missed) activations of the refaulted page which
      may under-or-overestimate the workingset size.  Though that is not very
      concerning as the kernel can already miss or do false activations.
      
      There are two more codepaths whose flushing behavior is not changed by
      this patch and we may need to come to them in future.  One is the
      writeback stats used by dirty throttling and second is the deactivation
      heuristic in the reclaim.  For now keeping an eye on them and if there
      is report of regression due to these codepaths, we will reevaluate then.
      
      Link: https://lore.kernel.org/all/CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndgX2MQ@mail.gmail.com [1]
      Link: https://lkml.kernel.org/r/20220304184040.1304781-1-shakeelb@google.com
      Fixes: 1f828223
      
       ("memcg: flush lruvec stats in the refault")
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Reported-by: default avatarDaniel Dao <dqminh@cloudflare.com>
      Tested-by: default avatarIvan Babrou <ivan@cloudflare.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Frank Hofmann <fhofmann@cloudflare.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9b301615
    • Naoya Horiguchi's avatar
      mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb() · 405ce051
      Naoya Horiguchi authored
      There is a race condition between memory_failure_hugetlb() and hugetlb
      free/demotion, which causes setting PageHWPoison flag on the wrong page.
      The one simple result is that wrong processes can be killed, but another
      (more serious) one is that the actual error is left unhandled, so no one
      prevents later access to it, and that might lead to more serious results
      like consuming corrupted data.
      
      Think about the below race window:
      
        CPU 1                                   CPU 2
        memory_failure_hugetlb
        struct page *head = compound_head(p);
                                                hugetlb page might be freed to
                                                buddy, or even changed to another
                                                compound page.
      
        get_hwpoison_page -- page is not what we want now...
      
      The current code first does prechecks roughly and then reconfirms after
      taking refcount, but it's found that it makes code overly complicated,
      so move the prechecks in a single hugetlb_lock range.
      
      A newly introduced function, try_memory_failure_hugetlb(), always takes
      hugetlb_lock (even for non-hugetlb pages).  That can be improved, but
      memory_failure() is rare in principle, so should not be a big problem.
      
      Link: https://lkml.kernel.org/r/20220408135323.1559401-2-naoya.horiguchi@linux.dev
      Fixes: 761ad8d7
      
       ("mm: hwpoison: introduce memory_failure_hugetlb()")
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reported-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      405ce051
  2. Apr 19, 2022
    • Song Liu's avatar
      vmalloc: replace VM_NO_HUGE_VMAP with VM_ALLOW_HUGE_VMAP · 559089e0
      Song Liu authored
      Huge page backed vmalloc memory could benefit performance in many cases.
      However, some users of vmalloc may not be ready to handle huge pages for
      various reasons: hardware constraints, potential pages split, etc.
      VM_NO_HUGE_VMAP was introduced to allow vmalloc users to opt-out huge
      pages.  However, it is not easy to track down all the users that require
      the opt-out, as the allocation are passed different stacks and may cause
      issues in different layers.
      
      To address this issue, replace VM_NO_HUGE_VMAP with an opt-in flag,
      VM_ALLOW_HUGE_VMAP, so that users that benefit from huge pages could ask
      specificially.
      
      Also, remove vmalloc_no_huge() and add opt-in helper vmalloc_huge().
      
      Fixes: fac54e2b ("x86/Kconfig: Select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP")
      Link: https://lore.kernel.org/netdev/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/
      
      "
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Reviewed-b...
      559089e0
    • Christian Brauner's avatar
      fs: fix acl translation · 705191b0
      Christian Brauner authored
      Last cycle we extended the idmapped mounts infrastructure to support
      idmapped mounts of idmapped filesystems (No such filesystem yet exist.).
      Since then, the meaning of an idmapped mount is a mount whose idmapping
      is different from the filesystems idmapping.
      
      While doing that work we missed to adapt the acl translation helpers.
      They still assume that checking for the identity mapping is enough.  But
      they need to use the no_idmapping() helper instead.
      
      Note, POSIX ACLs are always translated right at the userspace-kernel
      boundary using the caller's current idmapping and the initial idmapping.
      The order depends on whether we're coming from or going to userspace.
      The filesystem's idmapping doesn't matter at the border.
      
      Consequently, if a non-idmapped mount is passed we need to make sure to
      always pass the initial idmapping as the mount's idmapping and not the
      filesystem idmapping.  Since it's irrelevant here it would yield invalid
      ids and prevent setting acls for filesystems that are mountable in a
      userns and support posix acls (tmpfs and fuse).
      
      I verified the regression reported in [1] and verified that this patch
      fixes it.  A regression test will be added to xfstests in parallel.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=215849 [1]
      Fixes: bd303368
      
       ("fs: support mapped mounts of mapped filesystems")
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: <stable@vger.kernel.org> # 5.17
      Cc: <regressions@lists.linux.dev>
      Signed-off-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      705191b0
  3. Apr 15, 2022
  4. Apr 13, 2022
    • Jason Gunthorpe's avatar
      vfio/pci: Fix vf_token mechanism when device-specific VF drivers are used · 1ef3342a
      Jason Gunthorpe authored
      get_pf_vdev() tries to check if a PF is a VFIO PF by looking at the driver:
      
             if (pci_dev_driver(physfn) != pci_dev_driver(vdev->pdev)) {
      
      However now that we have multiple VF and PF drivers this is no longer
      reliable.
      
      This means that security tests realted to vf_token can be skipped by
      mixing and matching different VFIO PCI drivers.
      
      Instead of trying to use the driver core to find the PF devices maintain a
      linked list of all PF vfio_pci_core_device's that we have called
      pci_enable_sriov() on.
      
      When registering a VF just search the list to see if the PF is present and
      record the match permanently in the struct. PCI core locking prevents a PF
      from passing pci_disable_sriov() while VF drivers are attached so the VFIO
      owned PF becomes a static property of the VF.
      
      In common cases where vfio does not own the PF the global list remains
      empty and the VF's pointer is statically NULL.
      
      This also fixes a lockdep splat from recursive locking of the
      vfio_group::device_lock between vfio_device_get_from_name() and
      vfio_device_get_from_dev(). If the VF and PF share the same group this
      would deadlock.
      
      Fixes: ff53edf6
      
       ("vfio/pci: Split the pci_driver code out of vfio_pci_core.c")
      Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Link: https://lore.kernel.org/r/0-v3-876570980634+f2e8-vfio_vf_token_jgg@nvidia.com
      
      
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      1ef3342a
    • Jason A. Donenfeld's avatar
      random: make random_get_entropy() return an unsigned long · b0c3e796
      Jason A. Donenfeld authored
      
      Some implementations were returning type `unsigned long`, while others
      that fell back to get_cycles() were implicitly returning a `cycles_t` or
      an untyped constant int literal. That makes for weird and confusing
      code, and basically all code in the kernel already handled it like it
      was an `unsigned long`. I recently tried to handle it as the largest
      type it could be, a `cycles_t`, but doing so doesn't really help with
      much.
      
      Instead let's just make random_get_entropy() return an unsigned long all
      the time. This also matches the commonly used `arch_get_random_long()`
      function, so now RDRAND and RDTSC return the same sized integer, which
      means one can fallback to the other more gracefully.
      
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      b0c3e796
    • Sabrina Dubroca's avatar
      esp: limit skb_page_frag_refill use to a single page · 5bd8baab
      Sabrina Dubroca authored
      Commit ebe48d36 ("esp: Fix possible buffer overflow in ESP
      transformation") tried to fix skb_page_frag_refill usage in ESP by
      capping allocsize to 32k, but that doesn't completely solve the issue,
      as skb_page_frag_refill may return a single page. If that happens, we
      will write out of bounds, despite the check introduced in the previous
      patch.
      
      This patch forces COW in cases where we would end up calling
      skb_page_frag_refill with a size larger than a page (first in
      esp_output_head with tailen, then in esp_output_tail with
      skb->data_len).
      
      Fixes: cac2661c ("esp4: Avoid skb_cow_data whenever possible")
      Fixes: 03e2a30f
      
       ("esp6: Avoid skb_cow_data whenever possible")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      5bd8baab
    • Takashi Iwai's avatar
      ALSA: memalloc: Add fallback SG-buffer allocations for x86 · 925ca893
      Takashi Iwai authored
      The recent change for memory allocator replaced the SG-buffer handling
      helper for x86 with the standard non-contiguous page handler.  This
      works for most cases, but there is a corner case I obviously
      overlooked, namely, the fallback of non-contiguous handler without
      IOMMU.  When the system runs without IOMMU, the core handler tries to
      use the continuous pages with a single SGL entry.  It works nicely for
      most cases, but when the system memory gets fragmented, the large
      allocation may fail frequently.
      
      Ideally the non-contig handler could deal with the proper SG pages,
      it's cumbersome to extend for now.  As a workaround, here we add new
      types for (minimalistic) SG allocations, instead, so that the
      allocator falls back to those types automatically when the allocation
      with the standard API failed.
      
      BTW, one better (but pretty minor) improvement from the previous
      SG-buffer code is that this provides the proper mmap support without
      the PCM's page fault handling.
      
      Fixes: 2c95b92e ("ALSA: memalloc: Unify x86 SG-buffer handling (take#3)")
      BugLink: https://gitlab.freedesktop.org/pipewire/pipewire/-/issues/2272
      BugLink: https://bugzilla.suse.com/show_bug.cgi?id=1198248
      Cc: <stable@vger.kernel.org>
      Link: https://lore.kernel.org/r/20220413054808.7547-1-tiwai@suse.de
      
      
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      925ca893
  5. Apr 12, 2022
    • Alexander Lobakin's avatar
      asm-generic: fix __get_unaligned_be48() on 32 bit platforms · b9768752
      Alexander Lobakin authored
      While testing the new macros for working with 48 bit containers,
      I faced a weird problem:
      
      32 + 16: 0x2ef6e8da 0x79e60000
      48: 0xffffe8da + 0x79e60000
      
      All the bits starting from the 32nd were getting 1d in 9/10 cases.
      The debug showed:
      
      p[0]: 0x00002e0000000000
      p[1]: 0x00002ef600000000
      p[2]: 0xffffffffe8000000
      p[3]: 0xffffffffe8da0000
      p[4]: 0xffffffffe8da7900
      p[5]: 0xffffffffe8da79e6
      
      that the value becomes a garbage after the third OR, i.e. on
      `p[2] << 24`.
      When the 31st bit is 1 and there's no explicit cast to an unsigned,
      it's being considered as a signed int and getting sign-extended on
      OR, so `e8000000` becomes `ffffffffe8000000` and messes up the
      result.
      Cast the @p[2] to u64 as well to avoid this. Now:
      
      32 + 16: 0x7ef6a490 0xddc10000
      48: 0x7ef6a490 + 0xddc10000
      
      p[0]: 0x00007e0000000000
      p[1]: 0x00007ef600000000
      p[2]: 0x00007ef6a4000000
      p[3]: 0x00007ef6a4900000
      p[4]: 0x00007ef6a490dd00
      p[5]: 0x00007ef6a490ddc1
      
      Fixes: c2ea5fcf
      
       ("asm-generic: introduce be48 unaligned accessors")
      Signed-off-by: default avatarAlexander Lobakin <alobakin@pm.me>
      Link: https://lore.kernel.org/r/20220412215220.75677-1-alobakin@pm.me
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b9768752
    • Takashi Iwai's avatar
      ALSA: core: Add snd_card_free_on_error() helper · fee2b871
      Takashi Iwai authored
      This is a small helper function to handle the error path more easily
      when an error happens during the probe for the device with the
      device-managed card.  Since devres releases in the reverser order of
      the creations, usually snd_card_free() gets called at the last in the
      probe error path unless it already reached snd_card_register() calls.
      Due to this nature, when a driver expects the resource releases in
      card->private_free, this might be called too lately.
      
      As a workaround, one should call the probe like:
      
       static int __some_probe(...) { // do real probe.... }
      
       static int some_probe(...)
       {
      	return snd_card_free_on_error(dev, __some_probe(dev, ...));
       }
      
      so that the snd_card_free() is called explicitly at the beginning of
      the error path from the probe.
      
      This function will be used in the upcoming fixes to address the
      regressions by devres usages.
      
      Fixes: e8ad415b ("ALSA: core: Add managed card creation")
      Cc: <stable@vger.kernel.org>
      Link: https://lore.kernel.org/r/20220412093141.8008-2-tiwai@suse.de
      
      
      Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      fee2b871
  6. Apr 11, 2022
  7. Apr 10, 2022
    • Jens Axboe's avatar
      io_uring: flag the fact that linked file assignment is sane · c4212f3e
      Jens Axboe authored
      Give applications a way to tell if the kernel supports sane linked files,
      as in files being assigned at the right time to be able to reliably
      do <open file direct into slot X><read file from slot X> while using
      IOSQE_IO_LINK to order them.
      
      Not really a bug fix, but flag it as such so that it gets pulled in with
      backports of the deferred file assignment.
      
      Fixes: 6bf9c47a
      
       ("io_uring: defer file assignment")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c4212f3e
  8. Apr 08, 2022
  9. Apr 07, 2022
    • Trond Myklebust's avatar
      NFS: Ensure rpc_run_task() cannot fail in nfs_async_rename() · 88dee0cc
      Trond Myklebust authored
      Ensure the call to rpc_run_task() cannot fail by preallocating the
      rpc_task.
      
      Fixes: 910ad386
      
       ("NFS: Fix memory allocation in rpc_alloc_task()")
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      88dee0cc
    • Trond Myklebust's avatar
      SUNRPC: Ensure we flush any closed sockets before xs_xprt_free() · f0043206
      Trond Myklebust authored
      
      We must ensure that all sockets are closed before we call xprt_free()
      and release the reference to the net namespace. The problem is that
      calling fput() will defer closing the socket until delayed_fput() gets
      called.
      Let's fix the situation by allowing rpciod and the transport teardown
      code (which runs on the system wq) to call __fput_sync(), and directly
      close the socket.
      
      Reported-by: default avatarFelix Fu <foyjog@gmail.com>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Fixes: a73881c9 ("SUNRPC: Fix an Oops in udp_poll()")
      Cc: stable@vger.kernel.org # 5.1.x: 3be232f1: SUNRPC: Prevent immediate close+reconnect
      Cc: stable@vger.kernel.org # 5.1.x: 89f42494
      
      : SUNRPC: Don't call connect() more than once on a TCP socket
      Cc: stable@vger.kernel.org # 5.1.x
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      f0043206
    • Chuck Lever's avatar
      SUNRPC: Fix the svc_deferred_event trace class · 4d500445
      Chuck Lever authored
      Fix a NULL deref crash that occurs when an svc_rqst is deferred
      while the sunrpc tracing subsystem is enabled. svc_revisit() sets
      dr->xprt to NULL, so it can't be relied upon in the tracepoint to
      provide the remote's address.
      
      Unfortunately we can't revert the "svc_deferred_class" hunk in
      commit ece200dd ("sunrpc: Save remote presentation address in
      svc_xprt for trace events") because there is now a specific check
      of event format specifiers for unsafe dereferences. The warning
      that check emits is:
      
        event svc_defer_recv has unsafe dereference of argument 1
      
      A "%pISpc" format specifier with a "struct sockaddr *" is indeed
      flagged by this check.
      
      Instead, take the brute-force approach used by the svcrdma_qp_error
      tracepoint. Convert the dr::addr field into a presentation address
      in the TP_fast_assign() arm of the trace event, and store that as
      a string. This fix can be backported to -stable kernels.
      
      In the meantime, commit c6ced229 ("tracing: Update print fmt
      check to handle new __get_sockaddr() macro") is now in v5.18, so
      this wonky fix can be replaced with __sockaddr() and friends
      properly during the v5.19 merge window.
      
      Fixes: ece200dd
      
       ("sunrpc: Save remote presentation address in svc_xprt for trace events")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      4d500445
    • Matthew Wilcox (Oracle)'s avatar
      mm: Add vma_alloc_folio() · f584b680
      Matthew Wilcox (Oracle) authored
      
      This wrapper around alloc_pages_vma() calls prep_transhuge_page(),
      removing the obligation from the caller.  This is in the same spirit
      as __folio_alloc().
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Reviewed-by: default avatarWilliam Kucharski <william.kucharski@oracle.com>
      f584b680
  10. Apr 06, 2022
    • Chuck Lever's avatar
      SUNRPC: Fix NFSD's request deferral on RDMA transports · 773f91b2
      Chuck Lever authored
      Trond Myklebust reports an NFSD crash in svc_rdma_sendto(). Further
      investigation shows that the crash occurred while NFSD was handling
      a deferred request.
      
      This patch addresses two inter-related issues that prevent request
      deferral from working correctly for RPC/RDMA requests:
      
      1. Prevent the crash by ensuring that the original
         svc_rqst::rq_xprt_ctxt value is available when the request is
         revisited. Otherwise svc_rdma_sendto() does not have a Receive
         context available with which to construct its reply.
      
      2. Possibly since before commit 71641d99
      
       ("svcrdma: Properly
         compute .len and .buflen for received RPC Calls"),
         svc_rdma_recvfrom() did not include the transport header in the
         returned xdr_buf. There should have been no need for svc_defer()
         and friends to save and restore that header, as of that commit.
         This issue is addressed in a backport-friendly way by simply
         having svc_rdma_recvfrom() set rq_xprt_hlen to zero
         unconditionally, just as svc_tcp_recvfrom() does. This enables
         svc_deferred_recv() to correctly reconstruct an RPC message
         received via RPC/RDMA.
      
      Reported-by: default avatarTrond Myklebust <trondmy@hammerspace.com>
      Link: https://lore.kernel.org/linux-nfs/82662b7190f26fb304eb0ab1bb04279072439d4e.camel@hammerspace.com/
      
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Cc: <stable@vger.kernel.org>
      773f91b2
    • Steve Capper's avatar
      tlb: hugetlb: Add more sizes to tlb_remove_huge_tlb_entry · 697a1d44
      Steve Capper authored
      
      tlb_remove_huge_tlb_entry only considers PMD_SIZE and PUD_SIZE when
      updating the mmu_gather structure.
      
      Unfortunately on arm64 there are two additional huge page sizes that
      need to be covered: CONT_PTE_SIZE and CONT_PMD_SIZE. Where an end-user
      attempts to employ contiguous huge pages, a VM_BUG_ON can be experienced
      due to the fact that the tlb structure hasn't been correctly updated by
      the relevant tlb_flush_p.._range() call from tlb_remove_huge_tlb_entry.
      
      This patch adds inequality logic to the generic implementation of
      tlb_remove_huge_tlb_entry s.t. CONT_PTE_SIZE and CONT_PMD_SIZE are
      effectively covered on arm64. Also, as well as ptes, pmds and puds;
      p4ds are now considered too.
      
      Reported-by: default avatarDavid Hildenbrand <david@redhat.com>
      Suggested-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: https://lore.kernel.org...
      697a1d44
  11. Apr 05, 2022
  12. Apr 04, 2022