- Mar 22, 2022
-
-
David Hildenbrand authored
... and call node_dev_init() after memory_dev_init() from driver_init(), so before any of the existing arch/subsys calls. All online nodes should be known at that point: early during boot, arch code determines node and zone ranges and sets the relevant nodes online; usually this happens in setup_arch(). This is in line with memory_dev_init(), which initializes the memory device subsystem and creates all memory block devices. Similar to memory_dev_init(), panic() if anything goes wrong, we don't want to continue with such basic initialization errors. The important part is that node_dev_init() gets called after memory_dev_init() and after cpu_dev_init(), but before any of the relevant archs call register_cpu() to register the new cpu device under the node device. The latter should be the case for the current users of topology_init(). Link: https://lkml.kernel.org/r/20220203105212.30385-1-david@redhat.com Signed-off-by:
David Hildenbrand <david@redhat.com> Reviewed-by:
Oscar Salvador <osalvador@suse.de> Tested-by: Anatoly Pugachev <matorola@gmail.com> (sparc64) Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mike Rapoport <rppt@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Michal Hocko authored
Prior to "mm: handle uninitialized numa nodes gracefully" memory hotplug used to allocate pgdat when memory has been added to a node (hotadd_init_pgdat) arch_free_nodedata has been only used in the failure path because once the pgdat is exported (to be visible by NODA_DATA(nid)) it cannot really be freed because there is no synchronization available for that. pgdat is allocated for each possible nodes now so the memory hotplug doesn't need to do the ever use arch_free_nodedata so drop it. This patch doesn't introduce any functional change. Link: https://lkml.kernel.org/r/20220127085305.20890-4-mhocko@kernel.org Signed-off-by:
Michal Hocko <mhocko@suse.com> Acked-by:
Rafael Aquini <raquini@redhat.com> Acked-by:
David Hildenbrand <david@redhat.com> Acked-by:
Mike Rapoport <rppt@linux.ibm.com> Reviewed-by:
Oscar Salvador <osalvador@suse.de> Cc: Alexey Makhalov <amakhalov@vmware.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Nico Pache <npache@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Michal Hocko authored
We have had several reports [1][2][3] that page allocator blows up when an allocation from a possible node is requested. The underlying reason is that NODE_DATA for the specific node is not allocated. NUMA specific initialization is arch specific and it can vary a lot. E.g. x86 tries to initialize all nodes that have some cpu affinity (see init_cpu_to_node) but this can be insufficient because the node might be cpuless for example. One way to address this problem would be to check for !node_online nodes when trying to get a zonelist and silently fall back to another node. That is unfortunately adding a branch into allocator hot path and it doesn't handle any other potential NODE_DATA users. This patch takes a different approach (following a lead of [3]) and it pre allocates pgdat for all possible nodes in an arch indipendent code - free_area_init. All uninitialized nodes are treated as memoryless nodes. node_state of the node is not changed because that would lead to other side effects - e.g. sysfs representation of such a node and from past discussions [4] it is known that some tools might have problems digesting that. Newly allocated pgdat only gets a minimal initialization and the rest of the work is expected to be done by the memory hotplug - hotadd_new_pgdat (renamed to hotadd_init_pgdat). generic_alloc_nodedata is changed to use the memblock allocator because neither page nor slab allocators are available at the stage when all pgdats are allocated. Hotplug doesn't allocate pgdat anymore so we can use the early boot allocator. The only arch specific implementation is ia64 and that is changed to use the early allocator as well. [1] http://lkml.kernel.org/r/20211101201312.11589-1-amakhalov@vmware.com [2] http://lkml.kernel.org/r/20211207224013.880775-1-npache@redhat.com [3] http://lkml.kernel.org/r/20190114082416.30939-1-mhocko@kernel.org [4] http://lkml.kernel.org/r/20200428093836.27190-1-srikar@linux.vnet.ibm.com [akpm@linux-foundation.org: replace comment, per Mike] Link: https://lkml.kernel.org/r/Yfe7RBeLCijnWBON@dhcp22.suse.cz Reported-by:
Alexey Makhalov <amakhalov@vmware.com> Tested-by:
Alexey Makhalov <amakhalov@vmware.com> Reported-by:
Nico Pache <npache@redhat.com> Acked-by:
Rafael Aquini <raquini@redhat.com> Tested-by:
Rafael Aquini <raquini@redhat.com> Acked-by:
David Hildenbrand <david@redhat.com> Reviewed-by:
Oscar Salvador <osalvador@suse.de> Acked-by:
Mike Rapoport <rppt@linux.ibm.com> Signed-off-by:
Michal Hocko <mhocko@suse.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Michal Hocko authored
Patch series "mm, memory_hotplug: handle unitialized numa node gracefully". The core of the fix is patch 2 which also links existing bug reports. The high level goal is to have all possible numa nodes have their pgdat allocated and initialized so for_each_possible_node(nid) NODE_DATA(nid) will never return garbage. This has proven to be problem in several places when an offline numa node is used for an allocation just to realize that node_data and therefore allocation fallback zonelists are not initialized and such an allocation request blows up. There were attempts to address that by checking node_online in several places including the page allocator. This patchset approaches the problem from a different perspective and instead of special casing, which just adds a runtime overhead, it allocates pglist_data for each possible node. This can add some memory overhead for platforms with high number of possible nodes if they do not contain any memory. This should be a rather rare configuration though. How to test this? David has provided and excellent howto: http://lkml.kernel.org/r/6e5ebc19-890c-b6dd-1924-9f25c441010d@redhat.com Patches 1 and 3-6 are mostly cleanups. The patchset has been reviewed by Rafael (thanks!) and the core fix tested by Rafael and Alexey (thanks to both). David has tested as per instructions above and hasn't found any fallouts in the memory hotplug scenarios. This patch (of 6): This is a preparatory patch and it doesn't introduce any functional change. It merely pulls out arch_alloc_nodedata (and co) outside of CONFIG_MEMORY_HOTPLUG because the following patch will need to call this from the generic MM code. Link: https://lkml.kernel.org/r/20220127085305.20890-1-mhocko@kernel.org Link: https://lkml.kernel.org/r/20220127085305.20890-2-mhocko@kernel.org Signed-off-by:
Michal Hocko <mhocko@suse.com> Acked-by:
Rafael Aquini <raquini@redhat.com> Acked-by:
David Hildenbrand <david@redhat.com> Acked-by:
Mike Rapoport <rppt@linux.ibm.com> Reviewed-by:
Oscar Salvador <osalvador@suse.de> Reviewed-by:
Wei Yang <richard.weiyang@gmail.com> Cc: Alexey Makhalov <amakhalov@vmware.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Nico Pache <npache@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Hari Bathini authored
With commit a4e92ce8 ("powerpc/fadump: Reservationless firmware assisted dump"), Linux kernel's Contiguous Memory Allocator (CMA) based reservation was introduced in fadump. That change was aimed at using CMA to let applications utilize the memory reserved for fadump while blocking it from being used for kernel pages. The assumption was, even if CMA activation fails for whatever reason, the memory still remains reserved to avoid it from being used for kernel pages. But commit 072355c1 ("mm/cma: expose all pages to the buddy if activation of an area fails") breaks this assumption as it started exposing all pages to buddy allocator on CMA activation failure. It led to warning messages like below while running crash-utility on vmcore of a kernel having above two commits: crash: seek error: kernel virtual address: <from reserved region> To fix this problem, opt out from exposing pages to buddy allocator on CMA activation failure for fadump reserved memory. Link: https://lkml.kernel.org/r/20220117075246.36072-3-hbathini@linux.ibm.com Signed-off-by:
Hari Bathini <hbathini@linux.ibm.com> Acked-by:
David Hildenbrand <david@redhat.com> Acked-by:
Michael Ellerman <mpe@ellerman.id.au> Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sourabh Jain <sourabhjain@linux.ibm.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Anshuman Khandual authored
ARCH_WANT_GENERAL_HUGETLB config has duplicate definitions on platforms that subscribe it. Instead make it a generic config option which can be selected on applicable platforms when required. Link: https://lkml.kernel.org/r/1643718465-4324-1-git-send-email-anshuman.khandual@arm.com Signed-off-by:
Anshuman Khandual <anshuman.khandual@arm.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
luofei authored
When the hwpoison page meets the filter conditions, it should not be regarded as successful memory_failure() processing for mce handler, but should return a distinct value, otherwise mce handler regards the error page has been identified and isolated, which may lead to calling set_mce_nospec() to change page attribute, etc. Here memory_failure() return -EOPNOTSUPP to indicate that the error event is filtered, mce handler should not take any action for this situation and hwpoison injector should treat as correct. Link: https://lkml.kernel.org/r/20220223082135.2769649-1-luofei@unicloud.com Signed-off-by:
luofei <luofei@unicloud.com> Acked-by:
Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony....
-
Oscar Salvador authored
On x86, prior to ("mm: handle uninitialized numa nodes gracecully"), NUMA nodes could be allocated at three different places. - numa_register_memblks - init_cpu_to_node - init_gi_nodes All these calls happen at setup_arch, and have the following order: setup_arch ... x86_numa_init numa_init numa_register_memblks ... init_cpu_to_node init_memory_less_node alloc_node_data free_area_init_memoryless_node init_gi_nodes init_memory_less_node alloc_node_data free_area_init_memoryless_node numa_register_memblks() is only interested in those nodes which have memory, so it skips over any memoryless node it founds. Later on, when we have read ACPI's SRAT table, we call init_cpu_to_node() and init_gi_nodes(), which initialize any memoryless node we might have that have either CPU or Initiator affinity, meaning we allocate pg_data_t struct for them and we mark them as ONLINE. So far so good, but the thing is that after ("mm: handle uninitialized numa nodes gracefully"), we allocate all possible NUMA nodes in free_area_init(), meaning we have a picture like the following: setup_arch x86_numa_init numa_init numa_register_memblks <-- allocate non-memoryless node x86_init.paging.pagetable_init ... free_area_init free_area_init_memoryless <-- allocate memoryless node init_cpu_to_node alloc_node_data <-- allocate memoryless node with CPU free_area_init_memoryless_node init_gi_nodes alloc_node_data <-- allocate memoryless node with Initiator free_area_init_memoryless_node free_area_init() already allocates all possible NUMA nodes, but init_cpu_to_node() and init_gi_nodes() are clueless about that, so they go ahead and allocate a new pg_data_t struct without checking anything, meaning we end up allocating twice. It should be mad clear that this only happens in the case where memoryless NUMA node happens to have a CPU/Initiator affinity. So get rid of init_memory_less_node() and just set the node online. Note that setting the node online is needed, otherwise we choke down the chain when bringup_nonboot_cpus() ends up calling __try_online_node()->register_one_node()->... and we blow up in bus_add_device(). As can be seen here: BUG: kernel NULL pointer dereference, address: 0000000000000060 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.0-rc4-1-default+ #45 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/4 RIP: 0010:bus_add_device+0x5a/0x140 Code: 8b 74 24 20 48 89 df e8 84 96 ff ff 85 c0 89 c5 75 38 48 8b 53 50 48 85 d2 0f 84 bb 00 004 RSP: 0000:ffffc9000022bd10 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888100987400 RCX: ffff8881003e4e19 RDX: ffff8881009a5e00 RSI: ffff888100987400 RDI: ffff888100987400 RBP: 0000000000000000 R08: ffff8881003e4e18 R09: ffff8881003e4c98 R10: 0000000000000000 R11: ffff888100402bc0 R12: ffffffff822ceba0 R13: 0000000000000000 R14: ffff888100987400 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88853fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000060 CR3: 000000000200a001 CR4: 00000000001706b0 Call Trace: device_add+0x4c0/0x910 __register_one_node+0x97/0x2d0 __try_online_node+0x85/0xc0 try_online_node+0x25/0x40 cpu_up+0x4f/0x100 bringup_nonboot_cpus+0x4f/0x60 smp_init+0x26/0x79 kernel_init_freeable+0x130/0x2f1 kernel_init+0x17/0x150 ret_from_fork+0x22/0x30 The reason is simple, by the time bringup_nonboot_cpus() gets called, we did not register the node_subsys bus yet, so we crash when bus_add_device() tries to dereference bus()->p. The following shows the order of the calls: kernel_init_freeable smp_init bringup_nonboot_cpus ... bus_add_device() <- we did not register node_subsys yet do_basic_setup do_initcalls postcore_initcall(register_node_type); register_node_type subsys_system_register subsys_register bus_register <- register node_subsys bus Why setting the node online saves us then? Well, simply because __try_online_node() backs off when the node is online, meaning we do not end up calling register_one_node() in the first place. This is subtle, broken and deserves a deep analysis and thought about how to put this into shape, but for now let us have this easy fix for the leaking memory issue. [osalvador@suse.de: add comments] Link: https://lkml.kernel.org/r/20220221142649.3457-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20220218224302.5282-2-osalvador@suse.de Fixes: da4490c958ad ("mm: handle uninitialized numa nodes gracefully") Signed-off-by:
Oscar Salvador <osalvador@suse.de> Acked-by:
Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Alexey Makhalov <amakhalov@vmware.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
David Hildenbrand authored
Patch series "mm: enforce pageblock_order < MAX_ORDER". Having pageblock_order >= MAX_ORDER seems to be able to happen in corner cases and some parts of the kernel are not prepared for it. For example, Aneesh has shown [1] that such kernels can be compiled on ppc64 with 64k base pages by setting FORCE_MAX_ZONEORDER=8, which will run into a WARN_ON_ONCE(order >= MAX_ORDER) in comapction code right during boot. We can get pageblock_order >= MAX_ORDER when the default hugetlb size is bigger than the maximum allocation granularity of the buddy, in which case we are no longer talking about huge pages but instead gigantic pages. Having pageblock_order >= MAX_ORDER can only make alloc_contig_range() of such gigantic pages more likely to succeed. Reliable use of gigantic pages either requires boot time allcoation or CMA, no need to overcomplicate some places in the kernel to optimize for corner cases that are broken in other areas of the kernel. This patch (of 2): Let's enforce pageblock_order < MAX_ORDER and simplify. Especially patch #1 can be regarded a cleanup before: [PATCH v5 0/6] Use pageblock_order for cma and alloc_contig_range alignment. [2] [1] https://lkml.kernel.org/r/87r189a2ks.fsf@linux.ibm.com [2] https://lkml.kernel.org/r/20220211164135.1803616-1-zi.yan@sent.com Link: https://lkml.kernel.org/r/20220214174132.219303-2-david@redhat.com Signed-off-by:
David Hildenbrand <david@redhat.com> Reviewed-by:
Zi Yan <ziy@nvidia.com> Acked-by:
Rob Herring <robh@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Frank Rowand <frowand.list@gmail.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: John Garry via iommu <iommu@lists.linux-foundation.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Stafford Horne authored
Originally the mmu_gathers were removed in commit 1c395176 ("mm: now that all old mmu_gather code is gone, remove the storage"). However, the openrisc and hexagon architecture were merged around the same time and mmu_gathers was not removed. This patch removes them from openrisc, hexagon and nds32: Noticed while cleaning this warning: arch/openrisc/mm/init.c:41:1: warning: symbol 'mmu_gathers' was not declared. Should it be static? Link: https://lkml.kernel.org/r/20220205141956.3315419-1-shorne@gmail.com Signed-off-by:
Stafford Horne <shorne@gmail.com> Acked-by:
Mike Rapoport <rppt@linux.ibm.com> Cc: Brian Cain <bcain@codeaurora.org> Cc: Nick Hu <nickhu@andestech.com> Cc: Greentime Hu <green.hu@gmail.com> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Russell King <rmk+kernel@armlinux.org.uk> Cc: David Hildenbrand <david@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Anshuman Khandual authored
Each call into pte_mkhuge() is invariably followed by arch_make_huge_pte(). Instead arch_make_huge_pte() can accommodate pte_mkhuge() at the beginning. This updates generic fallback stub for arch_make_huge_pte() and available platforms definitions. This makes huge pte creation much cleaner and easier to follow. Link: https://lkml.kernel.org/r/1643860669-26307-1-git-send-email-anshuman.khandual@arm.com Signed-off-by:
Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by:
Christophe Leroy <christophe.leroy@csgroup.eu> Acked-by:
Mike Kravetz <mike.kravetz@oracle.com> Acked-by:
Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Randy Dunlap authored
Add <asm/paravirt_api_clock.h> for arch/arm/, mapped to <asm/paravirt.h>, to simplify #ifdeffery in generic code. Fixes this build error introduced by the scheduler tree: In file included from ../kernel/sched/core.c:81: ../kernel/sched/sched.h:87:11: fatal error: asm/paravirt_api_clock.h: No such file or directory 87 | # include <asm/paravirt_api_clock.h> Reviewed-by:
Nathan Chancellor <nathan@kernel.org> Fixes: 4ff8f2ca ("sched/headers: Reorganize, clean up and optimize kernel/sched/sched.h dependencies") Signed-off-by:
Randy Dunlap <rdunlap@infradead.org> Signed-off-by:
Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20220316204146.14000-1-rdunlap@infradead.org
-
- Mar 21, 2022
-
-
Matthew Wilcox (Oracle) authored
We need to use this function in common code; pull it out of pmd_page(). Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org>
-
Matthew Wilcox (Oracle) authored
This is straightforward for everything except nohash64 where we indirect through pmd_page(). There must be a better way to do this. Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org>
-
Matthew Wilcox (Oracle) authored
Whether or not the platform supports PMD sized pages, we need to provide pmd_pfn() for an upcoming patch. Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org>
-
Mike Rapoport authored
We need to use this function in common code, so define it for architectures and/or configrations that miss it. The result of pmd_pfn() will only be used if TRANSPARENT_HUGEPAGE is enabled, but a function or macro called pmd_pfn() must be defined, even on machines with two level page tables. Signed-off-by:
Mike Rapoport <rppt@linux.ibm.com> Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org>
-
Matthew Wilcox (Oracle) authored
Add isolate_lru_page() as a wrapper around isolate_lru_folio(). TestClearPageLRU() would have always failed on a tail page, so returning -EBUSY is the same behaviour. Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
John Hubbard <jhubbard@nvidia.com> Reviewed-by:
Jason Gunthorpe <jgg@nvidia.com> Reviewed-by:
William Kucharski <william.kucharski@oracle.com>
-
Kees Cook authored
Add SUBARCH target for Clang+um (which must go last, not alphabetically, so the other SUBARCHes are assigned). Remove open-coded "DEFINE" macro, instead using linux/kbuild.h's version which was updated to use Clang-friendly assembly in commit cf0c3e68 ("kbuild: fix asm-offset generation to work with clang"). Redefine "DEFINE_LONGS" in terms of "COMMENT" and "DEFINE" so that the intended coment actually has useful content. Add a missed "break" to avoid implicit fall-through warnings. This lets me run KUnit tests with Clang: $ ./tools/testing/kunit/kunit.py run --make_options LLVM=1 ... Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: David Gow <davidgow@google.com> Cc: linux-um@lists.infradead.org Cc: linux-kbuild@vger.kernel.org Cc: linux-kselftest@vger.kernel.org Cc: kunit-dev@googlegroups.com Cc: llvm@lists.linux.dev Reviewed-by:
Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/lkml/Yg2YubZxvYvx7%2Fnm@dev-arch.archlinux-ax161/ Tested-by:
David Gow <davidgow@google.com> Link: https://lore.kernel.org/lkml/CABVgOSk=oFxsbSbQE-v65VwR2+mXeGXDDjzq8t7FShwjJ3+kUg@mail.gmail.com/ Signed-off-by:
Kees Cook <keescook@chromium.org> --- v1: https://lore.kernel.org/lkml/20220217002843.2312603-1-keescook@chromium.org v2: https://lore.kernel.org/lkml/20220224055831.1854786-1-keescook@chromium.org v3: - use kbuild.h to avoid duplication (Masahiro) - fix intended comments (Masahiro) - use SUBARCH (Nathan)
-
John David Anglin authored
Cache move-in for virtual accesses is controlled by the TLB. Thus, we must generally purge TLB entries before flushing. The flush routines must use TLB entries that inhibit cache move-in. V2: Load physical address prior to flushing TLB. In flush_cache_page, flush TLB when flushing and purging. V3: Don't flush when start equals end. Signed-off-by:
John David Anglin <dave.anglin@bell.net> Signed-off-by:
Helge Deller <deller@gmx.de>
-
- Mar 20, 2022
-
-
Borislav Petkov authored
The commit in Fixes started adding INT3 after RETs as a mitigation against straight-line speculation. The fastop SETcc implementation in kvm's insn emulator uses macro magic to generate all possible SETcc functions and to jump to them when emulating the respective instruction. However, it hardcodes the size and alignment of those functions to 4: a three-byte SETcc insn and a single-byte RET. BUT, with SLS, there's an INT3 that gets slapped after the RET, which brings the whole scheme out of alignment: 15: 0f 90 c0 seto %al 18: c3 ret 19: cc int3 1a: 0f 1f 00 nopl (%rax) 1d: 0f 91 c0 setno %al 20: c3 ret 21: cc int3 22: 0f 1f 00 nopl (%rax) 25: 0f 92 c0 setb %al 28: c3 ret 29: cc int3 and this explodes like this: int3: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 2435 Comm: qemu-system-x86 Not tainted 5.17.0-rc8-sls #1 Hardware name: Dell Inc. Precision WorkStation T3400 /0TP412, BIOS A14 04/30/2012 RIP: 0010:setc+0x5/0x8 [kvm] Code: 00 00 0f 1f 00 0f b6 05 43 24 06 00 c3 cc 0f 1f 80 00 00 00 00 0f 90 c0 c3 cc 0f \ 1f 00 0f 91 c0 c3 cc 0f 1f 00 0f 92 c0 c3 cc <0f> 1f 00 0f 93 c0 c3 cc 0f 1f 00 \ 0f 94 c0 c3 cc 0f 1f 00 0f 95 c0 Call Trace: <TASK> ? x86_emulate_insn [kvm] ? x86_emulate_instruction [kvm] ? vmx_handle_exit [kvm_intel] ? kvm_arch_vcpu_ioctl_run [kvm] ? kvm_vcpu_ioctl [kvm] ? __x64_sys_ioctl ? do_syscall_64 ? entry_SYSCALL_64_after_hwframe </TASK> Raise the alignment value when SLS is enabled and use a macro for that instead of hard-coding naked numbers. Fixes: e463a09a ("x86: Add straight-line-speculation mitigation") Reported-by:
Jamie Heilman <jamie@audible.transient.net> Signed-off-by:
Borislav Petkov <bp@suse.de> Acked-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by:
Jamie Heilman <jamie@audible.transient.net> Link: https://lore.kernel.org/r/YjGzJwjrvxg5YZ0Z@audible.transient.net [Add a comment and a bit of safety checking, since this is going to be changed again for IBT support. - Paolo] Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Mar 18, 2022
-
-
Helge Deller authored
Avoid flushing caches in __flush_cache_page() and __purge_cache_page() if the machine hasn't data or instruction caches - as e.g. in qemu. Signed-off-by:
Helge Deller <deller@gmx.de>
-
Arnd Bergmann authored
The '.type' field is initialized both in place and in the macro as reported by this W=1 warning: arch/arm64/include/asm/cpufeature.h:281:9: error: initialized field overwritten [-Werror=override-init] 281 | (ARM64_CPUCAP_SCOPE_LOCAL_CPU | ARM64_CPUCAP_OPTIONAL_FOR_LATE_CPU) | ^ arch/arm64/kernel/cpu_errata.c:136:17: note: in expansion of macro 'ARM64_CPUCAP_LOCAL_CPU_ERRATUM' 136 | .type = ARM64_CPUCAP_LOCAL_CPU_ERRATUM, \ | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ arch/arm64/kernel/cpu_errata.c:145:9: note: in expansion of macro 'ERRATA_MIDR_RANGE' 145 | ERRATA_MIDR_RANGE(m, var, r_min, var, r_max) | ^~~~~~~~~~~~~~~~~ arch/arm64/kernel/cpu_errata.c:613:17: note: in expansion of macro 'ERRATA_MIDR_REV_RANGE' 613 | ERRATA_MIDR_REV_RANGE(MIDR_CORTEX_A510, 0, 0, 2), | ^~~~~~~~~~~~~~~~~~~~~ arch/arm64/include/asm/cpufeature.h:281:9: note: (near initialization for 'arm64_errata[18].type') 281 | (ARM64_CPUCAP_SCOPE_LOCAL_CPU | ARM64_CPUCAP_OPTIONAL_FOR_LATE_CPU) | ^ Remove the extranous initializer. Signed-off-by:
Arnd Bergmann <arnd@arndb.de> Fixes: 1dd498e5 ("KVM: arm64: Workaround Cortex-A510's single-step and PAC trap errata") Link: https://lore.kernel.org/r/20220316183800.1546731-1-arnd@kernel.org Signed-off-by:
Catalin Marinas <catalin.marinas@arm.com>
-
Arnd Bergmann authored
The newly introduced TRAMP_VALIAS definition causes a build warning with clang-14: arch/arm64/include/asm/vectors.h:66:31: error: arithmetic on a null pointer treated as a cast from integer to pointer is a GNU extension [-Werror,-Wnull-pointer-arithmetic] return (char *)TRAMP_VALIAS + SZ_2K * slot; Change the addition to something clang does not complain about. Fixes: bd09128d ("arm64: Add percpu vectors for EL1") Signed-off-by:
Arnd Bergmann <arnd@arndb.de> Acked-by:
James Morse <james.morse@arm.com> Link: https://lore.kernel.org/r/20220316183833.1563139-1-arnd@kernel.org Signed-off-by:
Catalin Marinas <catalin.marinas@arm.com>
-
Helge Deller authored
This patch changes the kprobe and kretprobe feature to use another break instruction instead of relying on the hardware single-step feature. That way those kprobes now work in qemu as well, because in qemu we don't emulate yet single-stepping. Signed-off-by:
Helge Deller <deller@gmx.de>
-
- Mar 17, 2022
-
-
Helge Deller authored
Improve CPU bootup info text from: CPU1: thread -1, cpu 0, socket 1 to CPU1: cpu core 0 of socket 1 Signed-off-by:
Helge Deller <deller@gmx.de>
-
Helge Deller authored
Allow to enable page table boot-up checks. Suggested-by:
Rolf Eike Beer <eike-kernel@sf-tec.de> Signed-off-by:
Helge Deller <deller@gmx.de>
-
- Mar 16, 2022
-
-
Helge Deller authored
At least the qemu virtual machine does not provide D- and I-caches, so skip triggering SMP irqs to flush caches on such machines. Further optimize the caching code by using static branches and making some functions static. Signed-off-by:
Helge Deller <deller@gmx.de>
-
Vladimir Oltean authored
This reverts commit 869f0ec0. That updated the expected device tree binding format for the ls-extirq driver, without also updating the parsing code (ls_extirq_parse_map) to the new format. The context is that the ls-extirq driver uses the standard "interrupt-map" OF property in a non-standard way, as suggested by Rob Herring during review: https://lore.kernel.org/lkml/20190927161118.GA19333@bogus/ This has turned out to be problematic, as Marc Zyngier discovered through commit 04128418 ("of/irq: Allow matching of an interrupt-map local to an interrupt controller"), later fixed through commit de4adddc ("of/irq: Add a quirk for controllers with their own definition of interrupt-map"). Marc's position, expressed on multiple opportunities, is that: (a) [ making private use of the reserved "interrupt-map" name in a driver ] "is wrong, by the very letter of what an interrupt-map means. If the interrupt map points to an interrupt controller, that's the target for the interrupt." https://lore.kernel.org/lkml/87k0g8jlmg.wl-maz@kernel.org/ (b) [ updating the driver's bindings to accept a non-reserved name for this property, as an alternative, is ] "is totally pointless. These machines have been in the wild for years, and existing DTs will be there *forever*." https://lore.kernel.org/lkml/87ilvrk1r0.wl-maz@kernel.org/ Considering the above, the Linux kernel has quirks in place to deal with the ls-extirq's non-standard use of the "interrupt-map". These quirks may be needed in other operating systems that consume this device tree, yet this is seen as the only viable solution. Therefore, the premise of the patch being reverted here is invalid. It doesn't matter whether the driver, in its non-standard use of the property, complies to the standard format or not, since this property isn't expected to be used for interrupt translation by the core. This change restores LS1088A, LS2088A/LS2085A and LX2160A to their previous bindings, which allows these systems to continue to use external interrupt lines with the correct polarity. Fixes: 869f0ec0 ("arm64: dts: freescale: Fix 'interrupt-map' parent address cells") Signed-off-by:
Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by:
Marc Zyngier <maz@kernel.org> Signed-off-by:
Arnd Bergmann <arnd@arndb.de>
-
David Woodhouse authored
The ACPI specification says that OSPM should refuse to restore from hibernate if the hardware signature changes, and should boot from scratch. However, real BIOSes often vary the hardware signature in cases where we *do* want to resume from hibernate, so Linux doesn't follow the spec by default. However, in a virtual environment there's no reason for the VMM to vary the hardware signature *unless* it wants to trigger a clean reboot as defined by the ACPI spec. So enable the check by default if a hypervisor is detected. Signed-off-by:
David Woodhouse <dwmw@amazon.co.uk> Signed-off-by:
Rafael J. Wysocki <rafael.j.wysocki@intel.com>
-
Jiri Kosina authored
The Do you have a strange power saving mode enabled? hint when unknown NMI happens dates back to i386 stone age, and isn't currently really helpful. Unknown NMIs are coming for many different reasons (broken firmware, faulty hardware, ...) and rarely have anything to do with 'strange power saving mode' (whatever that even is). Just remove it as it's largerly misleading. Signed-off-by:
Jiri Kosina <jkosina@suse.cz> Signed-off-by:
Borislav Petkov <bp@suse.de> Acked-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/nycvar.YFH.7.76.2203140924120.24795@cbobk.fhfr.pm
-
- Mar 11, 2022
-
-
Randy Dunlap authored
When CONFIG_GENERIC_CPU_VULNERABILITIES is not set, references to spectre_v2_update_state() cause a build error, so provide an empty stub for that function when the Kconfig option is not set. Fixes this build error: arm-linux-gnueabi-ld: arch/arm/mm/proc-v7-bugs.o: in function `cpu_v7_bugs_init': proc-v7-bugs.c:(.text+0x52): undefined reference to `spectre_v2_update_state' arm-linux-gnueabi-ld: proc-v7-bugs.c:(.text+0x82): undefined reference to `spectre_v2_update_state' Fixes: b9baf5c8 ("ARM: Spectre-BHB workaround") Signed-off-by:
Randy Dunlap <rdunlap@infradead.org> Reported-by:
kernel test robot <lkp@intel.com> Cc: Russell King <rmk+kernel@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: patches@armlinux.org.uk Acked-by:
Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Russell King (Oracle) authored
When building for Thumb2, the vectors make use of a local label. Sadly, the Spectre BHB code also uses a local label with the same number which results in the Thumb2 reference pointing at the wrong place. Fix this by changing the number used for the Spectre BHB local label. Fixes: b9baf5c8 ("ARM: Spectre-BHB workaround") Tested-by:
Nathan Chancellor <nathan@kernel.org> Signed-off-by:
Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
John David Anglin authored
In testing the "Fix non-access data TLB cache flush faults" change, I noticed a significant improvement in glibc build and check times. This led me to investigate the parisc_cache_flush_threshold setting. It determines when we switch from line flushing to whole cache flushing. It turned out that the parisc_cache_flush_threshold setting on mako and mako2 machines (PA8800 and PA8900 processors) was way too small. Adjusting this setting provided almost a factor two improvement in the glibc build and check time. Signed-off-by:
John David Anglin <dave.anglin@bell.net> Signed-off-by:
Helge Deller <deller@gmx.de>
-
Helge Deller authored
Userspace is up to now limited to 32-bit, so it's sufficient to print only 32-bit values when showing pointer addresses. Signed-off-by:
Helge Deller <deller@gmx.de>
-
Helge Deller authored
Convert to use real temp variables instead of clobbering processor registers. Signed-off-by:
Helge Deller <deller@gmx.de>
-
Helge Deller authored
Convert to use real temp variables instead of clobbering processor registers. Signed-off-by:
Helge Deller <deller@gmx.de>
-
Helge Deller authored
Convert to use real temp variables instead of clobbering processor registers. Signed-off-by:
Helge Deller <deller@gmx.de>
-
Helge Deller authored
Convert to use real temp variables instead of clobbering processor registers. Signed-off-by:
Helge Deller <deller@gmx.de>
-
Helge Deller authored
Convert the inline assembly code to use the automatic EFAULT exception handler. With that the fixup code can be dropped. The other change is to allow double-word only when a 64-bit kernel is used instead of depending on CONFIG_PA20. Signed-off-by:
Helge Deller <deller@gmx.de>
-
Helge Deller authored
The get_current() code uses the mfctl() macro to get the pointer to the current task struct from %cr30. The problem with the mfctl() macro is, that it is marked volatile which is basically correct, because mfctl() is used to get e.g. the current internal timer or interrupt flags as well. But specifically the task struct pointer (%cr30) doesn't change over time when the kernel executes code for a task. So, by dropping the volatile when retrieving %cr30 the compiler is now able to get this value only once and optimize the generated code a lot. A bloat-o-meter comparism shows that this patch saves ~5kB kernel code on a 32-bit kernel and ~6kB kernel code on a 64-bit kernel. Signed-off-by:
Helge Deller <deller@gmx.de>
-