Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Mar 22, 2022
    • David Hildenbrand's avatar
      drivers/base/node: consolidate node device subsystem initialization in node_dev_init() · 2848a28b
      David Hildenbrand authored
      ...  and call node_dev_init() after memory_dev_init() from driver_init(),
      so before any of the existing arch/subsys calls.  All online nodes should
      be known at that point: early during boot, arch code determines node and
      zone ranges and sets the relevant nodes online; usually this happens in
      setup_arch().
      
      This is in line with memory_dev_init(), which initializes the memory
      device subsystem and creates all memory block devices.
      
      Similar to memory_dev_init(), panic() if anything goes wrong, we don't
      want to continue with such basic initialization errors.
      
      The important part is that node_dev_init() gets called after
      memory_dev_init() and after cpu_dev_init(), but before any of the relevant
      archs call register_cpu() to register the new cpu device under the node
      device.  The latter should be the case for the current users of
      topology_init().
      
      Link: https://lkml.kernel.org/r/20220203105212.30385-1-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Tested-by: Anatoly Pugachev <matorola@gmail.com> (sparc64)
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2848a28b
    • Michal Hocko's avatar
      mm, memory_hotplug: drop arch_free_nodedata · 390511e1
      Michal Hocko authored
      Prior to "mm: handle uninitialized numa nodes gracefully" memory hotplug
      used to allocate pgdat when memory has been added to a node
      (hotadd_init_pgdat) arch_free_nodedata has been only used in the failure
      path because once the pgdat is exported (to be visible by NODA_DATA(nid))
      it cannot really be freed because there is no synchronization available
      for that.
      
      pgdat is allocated for each possible nodes now so the memory hotplug
      doesn't need to do the ever use arch_free_nodedata so drop it.
      
      This patch doesn't introduce any functional change.
      
      Link: https://lkml.kernel.org/r/20220127085305.20890-4-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRafael Aquini <raquini@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Alexey Makhalov <amakhalov@vmware.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Nico Pache <npache@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      390511e1
    • Michal Hocko's avatar
      mm: handle uninitialized numa nodes gracefully · 09f49dca
      Michal Hocko authored
      We have had several reports [1][2][3] that page allocator blows up when an
      allocation from a possible node is requested.  The underlying reason is
      that NODE_DATA for the specific node is not allocated.
      
      NUMA specific initialization is arch specific and it can vary a lot.  E.g.
      x86 tries to initialize all nodes that have some cpu affinity (see
      init_cpu_to_node) but this can be insufficient because the node might be
      cpuless for example.
      
      One way to address this problem would be to check for !node_online nodes
      when trying to get a zonelist and silently fall back to another node.
      That is unfortunately adding a branch into allocator hot path and it
      doesn't handle any other potential NODE_DATA users.
      
      This patch takes a different approach (following a lead of [3]) and it pre
      allocates pgdat for all possible nodes in an arch indipendent code -
      free_area_init.  All uninitialized nodes are treated as memoryless nodes.
      node_state of the node is not changed because that would lead to other
      side effects - e.g.  sysfs representation of such a node and from past
      discussions [4] it is known that some tools might have problems digesting
      that.
      
      Newly allocated pgdat only gets a minimal initialization and the rest of
      the work is expected to be done by the memory hotplug - hotadd_new_pgdat
      (renamed to hotadd_init_pgdat).
      
      generic_alloc_nodedata is changed to use the memblock allocator because
      neither page nor slab allocators are available at the stage when all
      pgdats are allocated.  Hotplug doesn't allocate pgdat anymore so we can
      use the early boot allocator.  The only arch specific implementation is
      ia64 and that is changed to use the early allocator as well.
      
      [1] http://lkml.kernel.org/r/20211101201312.11589-1-amakhalov@vmware.com
      [2] http://lkml.kernel.org/r/20211207224013.880775-1-npache@redhat.com
      [3] http://lkml.kernel.org/r/20190114082416.30939-1-mhocko@kernel.org
      [4] http://lkml.kernel.org/r/20200428093836.27190-1-srikar@linux.vnet.ibm.com
      
      [akpm@linux-foundation.org: replace comment, per Mike]
      
      Link: https://lkml.kernel.org/r/Yfe7RBeLCijnWBON@dhcp22.suse.cz
      
      
      Reported-by: default avatarAlexey Makhalov <amakhalov@vmware.com>
      Tested-by: default avatarAlexey Makhalov <amakhalov@vmware.com>
      Reported-by: default avatarNico Pache <npache@redhat.com>
      Acked-by: default avatarRafael Aquini <raquini@redhat.com>
      Tested-by: default avatarRafael Aquini <raquini@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09f49dca
    • Michal Hocko's avatar
      mm, memory_hotplug: make arch_alloc_nodedata independent on CONFIG_MEMORY_HOTPLUG · e930d999
      Michal Hocko authored
      Patch series "mm, memory_hotplug: handle unitialized numa node gracefully".
      
      The core of the fix is patch 2 which also links existing bug reports.  The
      high level goal is to have all possible numa nodes have their pgdat
      allocated and initialized so
      
      	for_each_possible_node(nid)
      		NODE_DATA(nid)
      
      will never return garbage.  This has proven to be problem in several
      places when an offline numa node is used for an allocation just to realize
      that node_data and therefore allocation fallback zonelists are not
      initialized and such an allocation request blows up.
      
      There were attempts to address that by checking node_online in several
      places including the page allocator.  This patchset approaches the problem
      from a different perspective and instead of special casing, which just
      adds a runtime overhead, it allocates pglist_data for each possible node.
      This can add some memory overhead for platforms with high number of
      possible nodes if they do not contain any memory.  This should be a rather
      rare configuration though.
      
      How to test this? David has provided and excellent howto:
      http://lkml.kernel.org/r/6e5ebc19-890c-b6dd-1924-9f25c441010d@redhat.com
      
      Patches 1 and 3-6 are mostly cleanups.  The patchset has been reviewed by
      Rafael (thanks!) and the core fix tested by Rafael and Alexey (thanks to
      both).  David has tested as per instructions above and hasn't found any
      fallouts in the memory hotplug scenarios.
      
      This patch (of 6):
      
      This is a preparatory patch and it doesn't introduce any functional
      change.  It merely pulls out arch_alloc_nodedata (and co) outside of
      CONFIG_MEMORY_HOTPLUG because the following patch will need to call this
      from the generic MM code.
      
      Link: https://lkml.kernel.org/r/20220127085305.20890-1-mhocko@kernel.org
      Link: https://lkml.kernel.org/r/20220127085305.20890-2-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRafael Aquini <raquini@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Cc: Alexey Makhalov <amakhalov@vmware.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Nico Pache <npache@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e930d999
    • Hari Bathini's avatar
      powerpc/fadump: opt out from freeing pages on cma activation failure · ee97347f
      Hari Bathini authored
      With commit a4e92ce8 ("powerpc/fadump: Reservationless firmware
      assisted dump"), Linux kernel's Contiguous Memory Allocator (CMA) based
      reservation was introduced in fadump.  That change was aimed at using CMA
      to let applications utilize the memory reserved for fadump while blocking
      it from being used for kernel pages.  The assumption was, even if CMA
      activation fails for whatever reason, the memory still remains reserved to
      avoid it from being used for kernel pages.  But commit 072355c1
      ("mm/cma: expose all pages to the buddy if activation of an area fails")
      breaks this assumption as it started exposing all pages to buddy allocator
      on CMA activation failure.  It led to warning messages like below while
      running crash-utility on vmcore of a kernel having above two commits:
      
        crash: seek error: kernel virtual address: <from reserved region>
      
      To fix this problem, opt out from exposing pages to buddy allocator on CMA
      activation failure for fadump reserved memory.
      
      Link: https://lkml.kernel.org/r/20220117075246.36072-3-hbathini@linux.ibm.com
      
      
      Signed-off-by: default avatarHari Bathini <hbathini@linux.ibm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ee97347f
    • Anshuman Khandual's avatar
      mm/hugetlb: generalize ARCH_WANT_GENERAL_HUGETLB · 07431506
      Anshuman Khandual authored
      ARCH_WANT_GENERAL_HUGETLB config has duplicate definitions on platforms
      that subscribe it.  Instead make it a generic config option which can be
      selected on applicable platforms when required.
      
      Link: https://lkml.kernel.org/r/1643718465-4324-1-git-send-email-anshuman.khandual@arm.com
      
      
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      07431506
    • luofei's avatar
      mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handler · d1fe111f
      luofei authored
      When the hwpoison page meets the filter conditions, it should not be
      regarded as successful memory_failure() processing for mce handler, but
      should return a distinct value, otherwise mce handler regards the error
      page has been identified and isolated, which may lead to calling
      set_mce_nospec() to change page attribute, etc.
      
      Here memory_failure() return -EOPNOTSUPP to indicate that the error
      event is filtered, mce handler should not take any action for this
      situation and hwpoison injector should treat as correct.
      
      Link: https://lkml.kernel.org/r/20220223082135.2769649-1-luofei@unicloud.com
      
      
      Signed-off-by: default avatarluofei <luofei@unicloud.com>
      Acked-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony....
      d1fe111f
    • Oscar Salvador's avatar
      arch/x86/mm/numa: Do not initialize nodes twice · 1ca75fa7
      Oscar Salvador authored
      On x86, prior to ("mm: handle uninitialized numa nodes gracecully"), NUMA
      nodes could be allocated at three different places.
      
       - numa_register_memblks
       - init_cpu_to_node
       - init_gi_nodes
      
      All these calls happen at setup_arch, and have the following order:
      
      setup_arch
        ...
        x86_numa_init
         numa_init
          numa_register_memblks
        ...
        init_cpu_to_node
         init_memory_less_node
          alloc_node_data
          free_area_init_memoryless_node
        init_gi_nodes
         init_memory_less_node
          alloc_node_data
          free_area_init_memoryless_node
      
      numa_register_memblks() is only interested in those nodes which have
      memory, so it skips over any memoryless node it founds.  Later on, when
      we have read ACPI's SRAT table, we call init_cpu_to_node() and
      init_gi_nodes(), which initialize any memoryless node we might have that
      have either CPU or Initiator affinity, meaning we allocate pg_data_t
      struct for them and we mark them as ONLINE.
      
      So far so good, but the thing is that after ("mm: handle uninitialized
      numa nodes gracefully"), we allocate all possible NUMA nodes in
      free_area_init(), meaning we have a picture like the following:
      
      setup_arch
        x86_numa_init
         numa_init
          numa_register_memblks  <-- allocate non-memoryless node
        x86_init.paging.pagetable_init
         ...
          free_area_init
           free_area_init_memoryless <-- allocate memoryless node
        init_cpu_to_node
         alloc_node_data             <-- allocate memoryless node with CPU
         free_area_init_memoryless_node
        init_gi_nodes
         alloc_node_data             <-- allocate memoryless node with Initiator
         free_area_init_memoryless_node
      
      free_area_init() already allocates all possible NUMA nodes, but
      init_cpu_to_node() and init_gi_nodes() are clueless about that, so they
      go ahead and allocate a new pg_data_t struct without checking anything,
      meaning we end up allocating twice.
      
      It should be mad clear that this only happens in the case where
      memoryless NUMA node happens to have a CPU/Initiator affinity.
      
      So get rid of init_memory_less_node() and just set the node online.
      
      Note that setting the node online is needed, otherwise we choke down the
      chain when bringup_nonboot_cpus() ends up calling
      __try_online_node()->register_one_node()->...  and we blow up in
      bus_add_device().  As can be seen here:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000060
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 0 P4D 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.0-rc4-1-default+ #45
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/4
        RIP: 0010:bus_add_device+0x5a/0x140
        Code: 8b 74 24 20 48 89 df e8 84 96 ff ff 85 c0 89 c5 75 38 48 8b 53 50 48 85 d2 0f 84 bb 00 004
        RSP: 0000:ffffc9000022bd10 EFLAGS: 00010246
        RAX: 0000000000000000 RBX: ffff888100987400 RCX: ffff8881003e4e19
        RDX: ffff8881009a5e00 RSI: ffff888100987400 RDI: ffff888100987400
        RBP: 0000000000000000 R08: ffff8881003e4e18 R09: ffff8881003e4c98
        R10: 0000000000000000 R11: ffff888100402bc0 R12: ffffffff822ceba0
        R13: 0000000000000000 R14: ffff888100987400 R15: 0000000000000000
        FS:  0000000000000000(0000) GS:ffff88853fc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000060 CR3: 000000000200a001 CR4: 00000000001706b0
        Call Trace:
         device_add+0x4c0/0x910
         __register_one_node+0x97/0x2d0
         __try_online_node+0x85/0xc0
         try_online_node+0x25/0x40
         cpu_up+0x4f/0x100
         bringup_nonboot_cpus+0x4f/0x60
         smp_init+0x26/0x79
         kernel_init_freeable+0x130/0x2f1
         kernel_init+0x17/0x150
         ret_from_fork+0x22/0x30
      
      The reason is simple, by the time bringup_nonboot_cpus() gets called, we
      did not register the node_subsys bus yet, so we crash when
      bus_add_device() tries to dereference bus()->p.
      
      The following shows the order of the calls:
      
      kernel_init_freeable
       smp_init
        bringup_nonboot_cpus
         ...
           bus_add_device()      <- we did not register node_subsys yet
       do_basic_setup
        do_initcalls
         postcore_initcall(register_node_type);
          register_node_type
           subsys_system_register
            subsys_register
             bus_register         <- register node_subsys bus
      
      Why setting the node online saves us then? Well, simply because
      __try_online_node() backs off when the node is online, meaning we do not
      end up calling register_one_node() in the first place.
      
      This is subtle, broken and deserves a deep analysis and thought about
      how to put this into shape, but for now let us have this easy fix for
      the leaking memory issue.
      
      [osalvador@suse.de: add comments]
        Link: https://lkml.kernel.org/r/20220221142649.3457-1-osalvador@suse.de
      
      Link: https://lkml.kernel.org/r/20220218224302.5282-2-osalvador@suse.de
      
      
      Fixes: da4490c958ad ("mm: handle uninitialized numa nodes gracefully")
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Rafael Aquini <raquini@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Alexey Makhalov <amakhalov@vmware.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1ca75fa7
    • David Hildenbrand's avatar
      cma: factor out minimum alignment requirement · e16faf26
      David Hildenbrand authored
      Patch series "mm: enforce pageblock_order < MAX_ORDER".
      
      Having pageblock_order >= MAX_ORDER seems to be able to happen in corner
      cases and some parts of the kernel are not prepared for it.
      
      For example, Aneesh has shown [1] that such kernels can be compiled on
      ppc64 with 64k base pages by setting FORCE_MAX_ZONEORDER=8, which will
      run into a WARN_ON_ONCE(order >= MAX_ORDER) in comapction code right
      during boot.
      
      We can get pageblock_order >= MAX_ORDER when the default hugetlb size is
      bigger than the maximum allocation granularity of the buddy, in which
      case we are no longer talking about huge pages but instead gigantic
      pages.
      
      Having pageblock_order >= MAX_ORDER can only make alloc_contig_range()
      of such gigantic pages more likely to succeed.
      
      Reliable use of gigantic pages either requires boot time allcoation or
      CMA, no need to overcomplicate some places in the kernel to optimize for
      corner cases that are broken in other areas of the kernel.
      
      This patch (of 2):
      
      Let's enforce pageblock_order < MAX_ORDER and simplify.
      
      Especially patch #1 can be regarded a cleanup before:
      	[PATCH v5 0/6] Use pageblock_order for cma and alloc_contig_range
      	alignment. [2]
      
      [1] https://lkml.kernel.org/r/87r189a2ks.fsf@linux.ibm.com
      [2] https://lkml.kernel.org/r/20220211164135.1803616-1-zi.yan@sent.com
      
      Link: https://lkml.kernel.org/r/20220214174132.219303-2-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Acked-by: default avatarRob Herring <robh@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Frank Rowand <frowand.list@gmail.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: John Garry via iommu <iommu@lists.linux-foundation.org>
      
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e16faf26
    • Stafford Horne's avatar
      mm: remove mmu_gathers storage from remaining architectures · d6d22442
      Stafford Horne authored
      Originally the mmu_gathers were removed in commit 1c395176 ("mm: now
      that all old mmu_gather code is gone, remove the storage").  However,
      the openrisc and hexagon architecture were merged around the same time
      and mmu_gathers was not removed.
      
      This patch removes them from openrisc, hexagon and nds32:
      
      Noticed while cleaning this warning:
      
          arch/openrisc/mm/init.c:41:1: warning: symbol 'mmu_gathers' was not declared. Should it be static?
      
      Link: https://lkml.kernel.org/r/20220205141956.3315419-1-shorne@gmail.com
      
      
      Signed-off-by: default avatarStafford Horne <shorne@gmail.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: Russell King <rmk+kernel@armlinux.org.uk>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d6d22442
    • Anshuman Khandual's avatar
      mm: merge pte_mkhuge() call into arch_make_huge_pte() · 16785bd7
      Anshuman Khandual authored
      Each call into pte_mkhuge() is invariably followed by
      arch_make_huge_pte().  Instead arch_make_huge_pte() can accommodate
      pte_mkhuge() at the beginning.  This updates generic fallback stub for
      arch_make_huge_pte() and available platforms definitions.  This makes huge
      pte creation much cleaner and easier to follow.
      
      Link: https://lkml.kernel.org/r/1643860669-26307-1-git-send-email-anshuman.khandual@arm.com
      
      
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      16785bd7
    • Randy Dunlap's avatar
      sched/headers: ARM needs asm/paravirt_api_clock.h too · ffea9fb3
      Randy Dunlap authored
      
      Add <asm/paravirt_api_clock.h> for arch/arm/, mapped to <asm/paravirt.h>,
      to simplify #ifdeffery in generic code.
      
      Fixes this build error introduced by the scheduler tree:
      
        In file included from ../kernel/sched/core.c:81:
        ../kernel/sched/sched.h:87:11: fatal error: asm/paravirt_api_clock.h: No such file or directory
           87 | # include <asm/paravirt_api_clock.h>
      
      Reviewed-by: default avatarNathan Chancellor <nathan@kernel.org>
      Fixes: 4ff8f2ca
      
       ("sched/headers: Reorganize, clean up and optimize kernel/sched/sched.h dependencies")
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/20220316204146.14000-1-rdunlap@infradead.org
      ffea9fb3
  2. Mar 21, 2022
  3. Mar 20, 2022
    • Borislav Petkov's avatar
      kvm/emulate: Fix SETcc emulation function offsets with SLS · fe83f5ea
      Borislav Petkov authored
      The commit in Fixes started adding INT3 after RETs as a mitigation
      against straight-line speculation.
      
      The fastop SETcc implementation in kvm's insn emulator uses macro magic
      to generate all possible SETcc functions and to jump to them when
      emulating the respective instruction.
      
      However, it hardcodes the size and alignment of those functions to 4: a
      three-byte SETcc insn and a single-byte RET. BUT, with SLS, there's an
      INT3 that gets slapped after the RET, which brings the whole scheme out
      of alignment:
      
        15:   0f 90 c0                seto   %al
        18:   c3                      ret
        19:   cc                      int3
        1a:   0f 1f 00                nopl   (%rax)
        1d:   0f 91 c0                setno  %al
        20:   c3                      ret
        21:   cc                      int3
        22:   0f 1f 00                nopl   (%rax)
        25:   0f 92 c0                setb   %al
        28:   c3                      ret
        29:   cc                      int3
      
      and this explodes like this:
      
        int3: 0000 [#1] PREEMPT SMP PTI
        CPU: 0 PID: 2435 Comm: qemu-system-x86 Not tainted 5.17.0-rc8-sls #1
        Hardware name: Dell Inc. Precision WorkStation T3400  /0TP412, BIOS A14 04/30/2012
        RIP: 0010:setc+0x5/0x8 [kvm]
        Code: 00 00 0f 1f 00 0f b6 05 43 24 06 00 c3 cc 0f 1f 80 00 00 00 00 0f 90 c0 c3 cc 0f \
      	  1f 00 0f 91 c0 c3 cc 0f 1f 00 0f 92 c0 c3 cc <0f> 1f 00 0f 93 c0 c3 cc 0f 1f 00 \
      	  0f 94 c0 c3 cc 0f 1f 00 0f 95 c0
        Call Trace:
         <TASK>
         ? x86_emulate_insn [kvm]
         ? x86_emulate_instruction [kvm]
         ? vmx_handle_exit [kvm_intel]
         ? kvm_arch_vcpu_ioctl_run [kvm]
         ? kvm_vcpu_ioctl [kvm]
         ? __x64_sys_ioctl
         ? do_syscall_64
         ? entry_SYSCALL_64_after_hwframe
         </TASK>
      
      Raise the alignment value when SLS is enabled and use a macro for that
      instead of hard-coding naked numbers.
      
      Fixes: e463a09a
      
       ("x86: Add straight-line-speculation mitigation")
      Reported-by: default avatarJamie Heilman <jamie@audible.transient.net>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarJamie Heilman <jamie@audible.transient.net>
      Link: https://lore.kernel.org/r/YjGzJwjrvxg5YZ0Z@audible.transient.net
      
      
      [Add a comment and a bit of safety checking, since this is going to be changed
       again for IBT support. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fe83f5ea
  4. Mar 18, 2022
  5. Mar 17, 2022
  6. Mar 16, 2022
  7. Mar 11, 2022