Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Apr 02, 2022
    • Steven Rostedt (Google)'s avatar
      tracing: Rename the staging files for trace_events · 84055411
      Steven Rostedt (Google) authored
      
      When looking for implementation of different phases of the creation of the
      TRACE_EVENT() macro, it is pretty useless when all helper macro
      redefinitions are in files labeled "stageX_defines.h". Rename them to
      state which phase the files are for. For instance, when looking for the
      defines that are used to create the event fields, seeing
      "stage4_event_fields.h" gives the developer a good idea that the defines
      are in that file.
      
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      84055411
  2. Mar 23, 2022
  3. Mar 22, 2022
    • SeongJae Park's avatar
      mm/damon/core: add number of each enum type values · 5257f36e
      SeongJae Park authored
      This commit declares the number of legal values for each DAMON enum types
      to make traversals of such DAMON enum types easy and safe.
      
      Link: https://lkml.kernel.org/r/20220228081314.5770-3-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <skhan@linuxfoundation.org>
      Cc: Xin Hao <xhao@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5257f36e
    • SeongJae Park's avatar
      mm/damon/core: allow non-exclusive DAMON start/stop · 8b9b0d33
      SeongJae Park authored
      Patch series "Introduce DAMON sysfs interface", v3.
      
      Introduction
      ============
      
      DAMON's debugfs-based user interface (DAMON_DBGFS) served very well, so
      far.  However, it unnecessarily depends on debugfs, while DAMON is not
      aimed to be used for only debugging.  Also, the interface receives
      multiple values via one file.  For example, schemes file receives 18
      values.  As a result, it is inefficient, hard to be used, and difficult to
      be extended.  Especially, keeping backward compatibility of user space
      tools is getting only challenging.  It would be better to implement
      another reliable and flexible interface and deprecate DAMON_DBGFS in long
      term.
      
      For the reason, this patchset introduces a sysfs-based new user interface
      of DAMON.  The idea of the new interface is, using directory hierarchies
      and having one dedicated file for each value.  For a short example, users
      can do the virtual address monitoring via the interface as below:
      
          # cd /sys/kernel/mm/damon/admin/
          # echo 1 > kdamonds/nr_kdamonds
          # echo 1 > kdamonds/0/contexts/nr_contexts
          # echo vaddr > kdamonds/0/contexts/0/operations
          # echo 1 > kdamonds/0/contexts/0/targets/nr_targets
          # echo $(pidof <workload>) > kdamonds/0/contexts/0/targets/0/pid_target
          # echo on > kdamonds/0/state
      
      A brief representation of the files hierarchy of DAMON sysfs interface is
      as below.  Childs are represented with indentation, directories are having
      '/' suffix, and files in each directory are separated by comma.
      
          /sys/kernel/mm/damon/admin
          │ kdamonds/nr_kdamonds
          │ │ 0/state,pid
          │ │ │ contexts/nr_contexts
          │ │ │ │ 0/operations
          │ │ │ │ │ monitoring_attrs/
          │ │ │ │ │ │ intervals/sample_us,aggr_us,update_us
          │ │ │ │ │ │ nr_regions/min,max
          │ │ │ │ │ targets/nr_targets
          │ │ │ │ │ │ 0/pid_target
          │ │ │ │ │ │ │ regions/nr_regions
          │ │ │ │ │ │ │ │ 0/start,end
          │ │ │ │ │ │ │ │ ...
          │ │ │ │ │ │ ...
          │ │ │ │ │ schemes/nr_schemes
          │ │ │ │ │ │ 0/action
          │ │ │ │ │ │ │ access_pattern/
          │ │ │ │ │ │ │ │ sz/min,max
          │ │ │ │ │ │ │ │ nr_accesses/min,max
          │ │ │ │ │ │ │ │ age/min,max
          │ │ │ │ │ │ │ quotas/ms,bytes,reset_interval_ms
          │ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil
          │ │ │ │ │ │ │ watermarks/metric,interval_us,high,mid,low
          │ │ │ │ │ │ │ stats/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds
          │ │ │ │ │ │ ...
          │ │ │ │ ...
          │ │ ...
      
      Detailed usage of the files will be described in the final Documentation
      patch of this patchset.
      
      Main Difference Between DAMON_DBGFS and DAMON_SYSFS
      ---------------------------------------------------
      
      At the moment, DAMON_DBGFS and DAMON_SYSFS provides same features.  One
      important difference between them is their exclusiveness.  DAMON_DBGFS
      works in an exclusive manner, so that no DAMON worker thread (kdamond) in
      the system can run concurrently and interfere somehow.  For the reason,
      DAMON_DBGFS asks users to construct all monitoring contexts and start them
      at once.  It's not a big problem but makes the operation a little bit
      complex and unflexible.
      
      For more flexible usage, DAMON_SYSFS moves the responsibility of
      preventing any possible interference to the admins and work in a
      non-exclusive manner.  That is, users can configure and start contexts one
      by one.  Note that DAMON respects both exclusive groups and non-exclusive
      groups of contexts, in a manner similar to that of reader-writer locks.
      That is, if any exclusive monitoring contexts (e.g., contexts that started
      via DAMON_DBGFS) are running, DAMON_SYSFS does not start new contexts, and
      vice versa.
      
      Future Plan of DAMON_DBGFS Deprecation
      ======================================
      
      Once this patchset is merged, DAMON_DBGFS development will be frozen.
      That is, we will maintain it to work as is now so that no users will be
      break.  But, it will not be extended to provide any new feature of DAMON.
      The support will be continued only until next LTS release.  After that, we
      will drop DAMON_DBGFS.
      
      User-space Tooling Compatibility
      --------------------------------
      
      As DAMON_SYSFS provides all features of DAMON_DBGFS, all user space
      tooling can move to DAMON_SYSFS.  As we will continue supporting
      DAMON_DBGFS until next LTS kernel release, user space tools would have
      enough time to move to DAMON_SYSFS.
      
      The official user space tool, damo[1], is already supporting both
      DAMON_SYSFS and DAMON_DBGFS.  Both correctness tests[2] and performance
      tests[3] of DAMON using DAMON_SYSFS also passed.
      
      [1] https://github.com/awslabs/damo
      [2] https://github.com/awslabs/damon-tests/tree/master/corr
      [3] https://github.com/awslabs/damon-tests/tree/master/perf
      
      Sequence of Patches
      ===================
      
      First two patches (patches 1-2) make core changes for DAMON_SYSFS.  The
      first one (patch 1) allows non-exclusive DAMON contexts so that
      DAMON_SYSFS can work in non-exclusive mode, while the second one (patch 2)
      adds size of DAMON enum types so that DAMON API users can safely iterate
      the enums.
      
      Third patch (patch 3) implements basic sysfs stub for virtual address
      spaces monitoring.  Note that this implements only sysfs files and DAMON
      is not linked.  Fourth patch (patch 4) links the DAMON_SYSFS to DAMON so
      that users can control DAMON using the sysfs files.
      
      Following six patches (patches 5-10) implements other DAMON features that
      DAMON_DBGFS supports one by one (physical address space monitoring,
      DAMON-based operation schemes, schemes quotas, schemes prioritization
      weights, schemes watermarks, and schemes stats).
      
      Following patch (patch 11) adds a simple selftest for DAMON_SYSFS, and the
      final one (patch 12) documents DAMON_SYSFS.
      
      This patch (of 13):
      
      To avoid interference between DAMON contexts monitoring overlapping memory
      regions, damon_start() works in an exclusive manner.  That is,
      damon_start() does nothing bug fails if any context that started by
      another instance of the function is still running.  This makes its usage a
      little bit restrictive.  However, admins could aware each DAMON usage and
      address such interferences on their own in some cases.
      
      This commit hence implements non-exclusive mode of the function and allows
      the callers to select the mode.  Note that the exclusive groups and
      non-exclusive groups of contexts will respect each other in a manner
      similar to that of reader-writer locks.  Therefore, this commit will not
      cause any behavioral change to the exclusive groups.
      
      Link: https://lkml.kernel.org/r/20220228081314.5770-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20220228081314.5770-2-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Shuah Khan <skhan@linuxfoundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Xin Hao <xhao@linux.alibaba.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b9b0d33
    • SeongJae Park's avatar
      mm/damon/paddr,vaddr: remove damon_{p,v}a_{target_valid,set_operations}() · 85104056
      SeongJae Park authored
      Because DAMON debugfs interface and DAMON-based proactive reclaim are now
      using monitoring operations via registration mechanism,
      damon_{p,v}a_{target_valid,set_operations}() functions have no user.  This
      commit clean them up.
      
      Link: https://lkml.kernel.org/r/20220215184603.1479-9-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Xin Hao <xhao@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      85104056
    • SeongJae Park's avatar
      mm/damon: let monitoring operations can be registered and selected · 9f7b053a
      SeongJae Park authored
      In-kernel DAMON user code like DAMON debugfs interface should set 'struct
      damon_operations' of its 'struct damon_ctx' on its own.  Therefore, the
      client code should depend on all supporting monitoring operations
      implementations that it could use.  For example, DAMON debugfs interface
      depends on both vaddr and paddr, while some of the users are not always
      interested in both.
      
      To minimize such unnecessary dependencies, this commit makes the
      monitoring operations can be registered by implementing code and then
      dynamically selected by the user code without build-time dependency.
      
      Link: https://lkml.kernel.org/r/20220215184603.1479-3-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Xin Hao <xhao@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f7b053a
    • SeongJae Park's avatar
      mm/damon: rename damon_primitives to damon_operations · f7d911c3
      SeongJae Park authored
      Patch series "Allow DAMON user code independent of monitoring primitives".
      
      In-kernel DAMON user code is required to configure the monitoring context
      (struct damon_ctx) with proper monitoring primitives (struct
      damon_primitive).  This makes the user code dependent to all supporting
      monitoring primitives.  For example, DAMON debugfs interface depends on
      both DAMON_VADDR and DAMON_PADDR, though some users have interest in only
      one use case.  As more monitoring primitives are introduced, the problem
      will be bigger.
      
      To minimize such unnecessary dependency, this patchset makes monitoring
      primitives can be registered by the implemnting code and later dynamically
      searched and selected by the user code.
      
      In addition to that, this patchset renames monitoring primitives to
      monitoring operations, which is more easy to intuitively understand what
      it means and how it would be structed.
      
      This patch (of 8):
      
      DAMON has a set of callback functions called monitoring primitives and let
      it can be configured with various implementations for easy extension for
      different address spaces and usages.  However, the word 'primitive' is not
      so explicit.  Meanwhile, many other structs resembles similar purpose
      calls themselves 'operations'.  To make the code easier to be understood,
      this commit renames 'damon_primitives' to 'damon_operations' before it is
      too late to rename.
      
      Link: https://lkml.kernel.org/r/20220215184603.1479-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20220215184603.1479-2-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Xin Hao <xhao@linux.alibaba.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7d911c3
    • SeongJae Park's avatar
      mm/damon: remove the target id concept · 1971bd63
      SeongJae Park authored
      DAMON asks each monitoring target ('struct damon_target') to have one
      'unsigned long' integer called 'id', which should be unique among the
      targets of same monitoring context.  Meaning of it is, however, totally up
      to the monitoring primitives that registered to the monitoring context.
      For example, the virtual address spaces monitoring primitives treats the
      id as a 'struct pid' pointer.
      
      This makes the code flexible, but ugly, not well-documented, and
      type-unsafe[1].  Also, identification of each target can be done via its
      index.  For the reason, this commit removes the concept and uses clear
      type definition.  For now, only 'struct pid' pointer is used for the
      virtual address spaces monitoring.  If DAMON is extended in future so that
      we need to put another identifier field in the struct, we will use a union
      for such primitives-dependent fields and document which primitives are
      using which type.
      
      [1] https://lore.kernel.org/linux-mm/20211013154535.4aaeaaf9d0182922e405dd1e@linux-foundation.org/
      
      Link: https://lkml.kernel.org/r/20211230100723.2238-5-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1971bd63
    • SeongJae Park's avatar
      mm/damon/core: move damon_set_targets() into dbgfs · 43642825
      SeongJae Park authored
      damon_set_targets() function is defined in the core for general use cases,
      but called from only dbgfs.  Also, because the function is for general use
      cases, dbgfs does additional handling of pid type target id case.  To make
      the situation simpler, this commit moves the function into dbgfs and makes
      it to do the pid type case handling on its own.
      
      Link: https://lkml.kernel.org/r/20211230100723.2238-4-sj@kernel.org
      
      
      Signed-off-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      43642825
    • Ira Weiny's avatar
      highmem: document kunmap_local() · d7ca25c5
      Ira Weiny authored
      Some users of kmap() add an offset to the kmap() address to be used
      during the mapping.
      
      When converting to kmap_local_page() the base address does not need to
      be stored because any address within the page can be used in
      kunmap_local().  However, this was not clear from the documentation and
      cause some questions.[1]
      
      Document that any address in the page can be used in kunmap_local() to
      clarify this for future users.
      
      [1] https://lore.kernel.org/lkml/20211213154543.GM3538886@iweiny-DESK2.sc.intel.com/
      
      [ira.weiny@intel.com: updates per Christoph]
        Link: https://lkml.kernel.org/r/20220124182138.816693-1-ira.weiny@intel.com
      
      Link: https://lkml.kernel.org/r/20220124013045.806718-1-ira.weiny@intel.com
      
      
      Signed-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7ca25c5
    • Christophe Leroy's avatar
      mm: uninline copy_overflow() · ad7489d5
      Christophe Leroy authored
      While building a small config with CONFIG_CC_OPTIMISE_FOR_SIZE, I ended
      up with more than 50 times the following function in vmlinux because GCC
      doesn't honor the 'inline' keyword:
      
      	c00243bc <copy_overflow>:
      	c00243bc:	94 21 ff f0 	stwu    r1,-16(r1)
      	c00243c0:	7c 85 23 78 	mr      r5,r4
      	c00243c4:	7c 64 1b 78 	mr      r4,r3
      	c00243c8:	3c 60 c0 62 	lis     r3,-16286
      	c00243cc:	7c 08 02 a6 	mflr    r0
      	c00243d0:	38 63 5e e5 	addi    r3,r3,24293
      	c00243d4:	90 01 00 14 	stw     r0,20(r1)
      	c00243d8:	4b ff 82 45 	bl      c001c61c <__warn_printk>
      	c00243dc:	0f e0 00 00 	twui    r0,0
      	c00243e0:	80 01 00 14 	lwz     r0,20(r1)
      	c00243e4:	38 21 00 10 	addi    r1,r1,16
      	c00243e8:	7c 08 03 a6 	mtlr    r0
      	c00243ec:	4e 80 00 20 	blr
      
      With -Winline, GCC tells:
      
      	/include/linux/thread_info.h:212:20: warning: inlining failed in call to 'copy_overflow': call is unlikely and code size would grow [-Winline]
      
      copy_overflow() is a non conditional warning called by check_copy_size()
      on an error path.
      
      check_copy_size() have to remain inlined in order to benefit from
      constant folding, but copy_overflow() is not worth inlining.
      
      Uninline the warning when CONFIG_BUG is selected.
      
      When CONFIG_BUG is not selected, WARN() does nothing so skip it.
      
      This reduces the size of vmlinux by almost 4kbytes.
      
      Link: https://lkml.kernel.org/r/e1723b9cfa924bcefcd41f69d0025b38e4c9364e.1644819985.git.christophe.leroy@csgroup.eu
      
      
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad7489d5
    • Christophe Leroy's avatar
      mm: remove usercopy_warn() · 6eada26f
      Christophe Leroy authored
      Users of usercopy_warn() were removed by commit 53944f17 ("mm:
      remove HARDENED_USERCOPY_FALLBACK")
      
      Remove it.
      
      Link: https://lkml.kernel.org/r/5f26643fc70b05f8455b60b99c30c17d635fa640.1644231910.git.christophe.leroy@csgroup.eu
      
      
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarStephen Kitt <steve@sk2.org>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6eada26f
    • Oscar Salvador's avatar
      mm: only re-generate demotion targets when a numa node changes its N_CPU state · 734c1570
      Oscar Salvador authored
      Abhishek reported that after patch [1], hotplug operations are taking
      roughly double the expected time.  [2]
      
      The reason behind is that the CPU callbacks that
      migrate_on_reclaim_init() sets always call set_migration_target_nodes()
      whenever a CPU is brought up/down.
      
      But we only care about numa nodes going from having cpus to become
      cpuless, and vice versa, as that influences the demotion_target order.
      
      We do already have two CPU callbacks (vmstat_cpu_online() and
      vmstat_cpu_dead()) that check exactly that, so get rid of the CPU
      callbacks in migrate_on_reclaim_init() and only call
      set_migration_target_nodes() from vmstat_cpu_{dead,online}() whenever a
      numa node change its N_CPU state.
      
      [1] https://lore.kernel.org/linux-mm/20210721063926.3024591-2-ying.huang@intel.com/
      [2] https://lore.kernel.org/linux-mm/eb438ddd-2919-73d4-bd9f-b7eecdd9577a@linux.vnet.ibm.com/
      
      [osalvador@suse.de: add feedback from Huang Ying]
        Link: https://lkml.kernel.org/r/20220314150945.12694-1-osalvador@suse.de
      
      Link: https://lkml.kernel.org/r/20220310120749.23077-1-osalvador@suse.de
      Fixes: 884a6e5d
      
       ("mm/migrate: update node demotion order on hotplug events")
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Tested-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reported-by: default avatarAbhishek Goel <huntbag@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Abhishek Goel <huntbag@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      734c1570
    • David Hildenbrand's avatar
      drivers/base/memory: determine and store zone for single-zone memory blocks · 395f6081
      David Hildenbrand authored
      test_pages_in_a_zone() is just another nasty PFN walker that can easily
      stumble over ZONE_DEVICE memory ranges falling into the same memory block
      as ordinary system RAM: the memmap of parts of these ranges might possibly
      be uninitialized.  In fact, we observed (on an older kernel) with UBSAN:
      
        UBSAN: Undefined behaviour in ./include/linux/mm.h:1133:50
        index 7 is out of range for type 'zone [5]'
        CPU: 121 PID: 35603 Comm: read_all Kdump: loaded Tainted: [...]
        Hardware name: Dell Inc. PowerEdge R7425/08V001, BIOS 1.12.2 11/15/2019
        Call Trace:
         dump_stack+0x9a/0xf0
         ubsan_epilogue+0x9/0x7a
         __ubsan_handle_out_of_bounds+0x13a/0x181
         test_pages_in_a_zone+0x3c4/0x500
         show_valid_zones+0x1fa/0x380
         dev_attr_show+0x43/0xb0
         sysfs_kf_seq_show+0x1c5/0x440
         seq_read+0x49d/0x1190
         vfs_read+0xff/0x300
         ksys_read+0xb8/0x170
         do_syscall_64+0xa5/0x4b0
         entry_SYSCALL_64_after_hwframe+0x6a/0xdf
        RIP: 0033:0x7f01f4439b52
      
      We seem to stumble over a memmap that contains a garbage zone id.  While
      we could try inserting pfn_to_online_page() calls, it will just make
      memory offlining slower, because we use test_pages_in_a_zone() to make
      sure we're offlining pages that all belong to the same zone.
      
      Let's just get rid of this PFN walker and determine the single zone of a
      memory block -- if any -- for early memory blocks during boot.  For memory
      onlining, we know the single zone already.  Let's avoid any additional
      memmap scanning and just rely on the zone information available during
      boot.
      
      For memory hot(un)plug, we only really care about memory blocks that:
      * span a single zone (and, thereby, a single node)
      * are completely System RAM (IOW, no holes, no ZONE_DEVICE)
      If one of these conditions is not met, we reject memory offlining.
      Hotplugged memory blocks (starting out offline), always meet both
      conditions.
      
      There are three scenarios to handle:
      
      (1) Memory hot(un)plug
      
      A memory block with zone == NULL cannot be offlined, corresponding to
      our previous test_pages_in_a_zone() check.
      
      After successful memory onlining/offlining, we simply set the zone
      accordingly.
      * Memory onlining: set the zone we just used for onlining
      * Memory offlining: set zone = NULL
      
      So a hotplugged memory block starts with zone = NULL. Once memory
      onlining is done, we set the proper zone.
      
      (2) Boot memory with !CONFIG_NUMA
      
      We know that there is just a single pgdat, so we simply scan all zones
      of that pgdat for an intersection with our memory block PFN range when
      adding the memory block. If more than one zone intersects (e.g., DMA and
      DMA32 on x86 for the first memory block) we set zone = NULL and
      consequently mimic what test_pages_in_a_zone() used to do.
      
      (3) Boot memory with CONFIG_NUMA
      
      At the point in time we create the memory block devices during boot, we
      don't know yet which nodes *actually* span a memory block. While we could
      scan all zones of all nodes for intersections, overlapping nodes complicate
      the situation and scanning all nodes is possibly expensive. But that
      problem has already been solved by the code that sets the node of a memory
      block and creates the link in the sysfs --
      do_register_memory_block_under_node().
      
      So, we hook into the code that sets the node id for a memory block. If
      we already have a different node id set for the memory block, we know
      that multiple nodes *actually* have PFNs falling into our memory block:
      we set zone = NULL and consequently mimic what test_pages_in_a_zone() used
      to do. If there is no node id set, we do the same as (2) for the given
      node.
      
      Note that the call order in driver_init() is:
      -> memory_dev_init(): create memory block devices
      -> node_dev_init(): link memory block devices to the node and set the
      		    node id
      
      So in summary, we detect if there is a single zone responsible for this
      memory block and we consequently store the zone in that case in the
      memory block, updating it during memory onlining/offlining.
      
      Link: https://lkml.kernel.org/r/20220210184359.235565-3-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reported-by: default avatarRafael Parra <rparrazo@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rafael Parra <rparrazo@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      395f6081
    • David Hildenbrand's avatar
      drivers/base/node: rename link_mem_sections() to register_memory_block_under_node() · cc651559
      David Hildenbrand authored
      Patch series "drivers/base/memory: determine and store zone for single-zone memory blocks", v2.
      
      I remember talking to Michal in the past about removing
      test_pages_in_a_zone(), which we use for:
      * verifying that a memory block we intend to offline is really only managed
        by a single zone. We don't support offlining of memory blocks that are
        managed by multiple zones (e.g., multiple nodes, DMA and DMA32)
      * exposing that zone to user space via
        /sys/devices/system/memory/memory*/valid_zones
      
      Now that I identified some more cases where test_pages_in_a_zone() might
      go wrong, and we received an UBSAN report (see patch #3), let's get rid of
      this PFN walker.
      
      So instead of detecting the zone at runtime with test_pages_in_a_zone() by
      scanning the memmap, let's determine and remember for each memory block if
      it's managed by a single zone.  The stored zone can then be used for the
      above two cases, avoiding a manual lookup using test_pages_in_a_zone().
      
      This avoids eventually stumbling over uninitialized memmaps in corner
      cases, especially when ZONE_DEVICE ranges partly fall into memory block
      (that are responsible for managing System RAM).
      
      Handling memory onlining is easy, because we online to exactly one zone.
      Handling boot memory is more tricky, because we want to avoid scanning all
      zones of all nodes to detect possible zones that overlap with the physical
      memory region of interest.  Fortunately, we already have code that
      determines the applicable nodes for a memory block, to create sysfs links
      -- we'll hook into that.
      
      Patch #1 is a simple cleanup I had laying around for a longer time.
      Patch #2 contains the main logic to remove test_pages_in_a_zone() and
      further details.
      
      [1] https://lkml.kernel.org/r/20220128144540.153902-1-david@redhat.com
      [2] https://lkml.kernel.org/r/20220203105212.30385-1-david@redhat.com
      
      This patch (of 2):
      
      Let's adjust the stale terminology, making it match
      unregister_memory_block_under_nodes() and
      do_register_memory_block_under_node().  We're dealing with memory block
      devices, which span 1..X memory sections.
      
      Link: https://lkml.kernel.org/r/20220210184359.235565-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20220210184359.235565-2-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Rafael Parra <rparrazo@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc651559
    • David Hildenbrand's avatar
      drivers/base/node: consolidate node device subsystem initialization in node_dev_init() · 2848a28b
      David Hildenbrand authored
      ...  and call node_dev_init() after memory_dev_init() from driver_init(),
      so before any of the existing arch/subsys calls.  All online nodes should
      be known at that point: early during boot, arch code determines node and
      zone ranges and sets the relevant nodes online; usually this happens in
      setup_arch().
      
      This is in line with memory_dev_init(), which initializes the memory
      device subsystem and creates all memory block devices.
      
      Similar to memory_dev_init(), panic() if anything goes wrong, we don't
      want to continue with such basic initialization errors.
      
      The important part is that node_dev_init() gets called after
      memory_dev_init() and after cpu_dev_init(), but before any of the relevant
      archs call register_cpu() to register the new cpu device under the node
      device.  The latter should be the case for the current users of
      topology_init().
      
      Link: https://lkml.kernel.org/r/20220203105212.30385-1-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Tested-by: Anatoly Pugachev <matorola@gmail.com> (sparc64)
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2848a28b
    • Michal Hocko's avatar
      mm, memory_hotplug: reorganize new pgdat initialization · 70b5b46a
      Michal Hocko authored
      When a !node_online node is brought up it needs a hotplug specific
      initialization because the node could be either uninitialized yet or it
      could have been recycled after previous hotremove.  hotadd_init_pgdat is
      responsible for that.
      
      Internal pgdat state is initialized at two places currently
      	- hotadd_init_pgdat
      	- free_area_init_core_hotplug
      
      There is no real clear cut what should go where but this patch's chosen to
      move the whole internal state initialization into
      free_area_init_core_hotplug.  hotadd_init_pgdat is still responsible to
      pull all the parts together - most notably to initialize zonelists because
      those depend on the overall topology.
      
      This patch doesn't introduce any functional change.
      
      Link: https://lkml.kernel.org/r/20220127085305.20890-5-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRafael Aquini <raquini@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Alexey Makhalov <amakhalov@vmware.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nico Pache <npache@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      70b5b46a
    • Michal Hocko's avatar
      mm, memory_hotplug: drop arch_free_nodedata · 390511e1
      Michal Hocko authored
      Prior to "mm: handle uninitialized numa nodes gracefully" memory hotplug
      used to allocate pgdat when memory has been added to a node
      (hotadd_init_pgdat) arch_free_nodedata has been only used in the failure
      path because once the pgdat is exported (to be visible by NODA_DATA(nid))
      it cannot really be freed because there is no synchronization available
      for that.
      
      pgdat is allocated for each possible nodes now so the memory hotplug
      doesn't need to do the ever use arch_free_nodedata so drop it.
      
      This patch doesn't introduce any functional change.
      
      Link: https://lkml.kernel.org/r/20220127085305.20890-4-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRafael Aquini <raquini@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Alexey Makhalov <amakhalov@vmware.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Nico Pache <npache@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      390511e1
    • Michal Hocko's avatar
      mm: handle uninitialized numa nodes gracefully · 09f49dca
      Michal Hocko authored
      We have had several reports [1][2][3] that page allocator blows up when an
      allocation from a possible node is requested.  The underlying reason is
      that NODE_DATA for the specific node is not allocated.
      
      NUMA specific initialization is arch specific and it can vary a lot.  E.g.
      x86 tries to initialize all nodes that have some cpu affinity (see
      init_cpu_to_node) but this can be insufficient because the node might be
      cpuless for example.
      
      One way to address this problem would be to check for !node_online nodes
      when trying to get a zonelist and silently fall back to another node.
      That is unfortunately adding a branch into allocator hot path and it
      doesn't handle any other potential NODE_DATA users.
      
      This patch takes a different approach (following a lead of [3]) and it pre
      allocates pgdat for all possible nodes in an arch indipendent code -
      free_area_init.  All uninitialized nodes are treated as memoryless nodes.
      node_state of the node is not changed because that would lead to other
      side effects - e.g.  sysfs representation of such a node and from past
      discussions [4] it is known that some tools might have problems digesting
      that.
      
      Newly allocated pgdat only gets a minimal initialization and the rest of
      the work is expected to be done by the memory hotplug - hotadd_new_pgdat
      (renamed to hotadd_init_pgdat).
      
      generic_alloc_nodedata is changed to use the memblock allocator because
      neither page nor slab allocators are available at the stage when all
      pgdats are allocated.  Hotplug doesn't allocate pgdat anymore so we can
      use the early boot allocator.  The only arch specific implementation is
      ia64 and that is changed to use the early allocator as well.
      
      [1] http://lkml.kernel.org/r/20211101201312.11589-1-amakhalov@vmware.com
      [2] http://lkml.kernel.org/r/20211207224013.880775-1-npache@redhat.com
      [3] http://lkml.kernel.org/r/20190114082416.30939-1-mhocko@kernel.org
      [4] http://lkml.kernel.org/r/20200428093836.27190-1-srikar@linux.vnet.ibm.com
      
      [akpm@linux-foundation.org: replace comment, per Mike]
      
      Link: https://lkml.kernel.org/r/Yfe7RBeLCijnWBON@dhcp22.suse.cz
      
      
      Reported-by: default avatarAlexey Makhalov <amakhalov@vmware.com>
      Tested-by: default avatarAlexey Makhalov <amakhalov@vmware.com>
      Reported-by: default avatarNico Pache <npache@redhat.com>
      Acked-by: default avatarRafael Aquini <raquini@redhat.com>
      Tested-by: default avatarRafael Aquini <raquini@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09f49dca
    • Michal Hocko's avatar
      mm, memory_hotplug: make arch_alloc_nodedata independent on CONFIG_MEMORY_HOTPLUG · e930d999
      Michal Hocko authored
      Patch series "mm, memory_hotplug: handle unitialized numa node gracefully".
      
      The core of the fix is patch 2 which also links existing bug reports.  The
      high level goal is to have all possible numa nodes have their pgdat
      allocated and initialized so
      
      	for_each_possible_node(nid)
      		NODE_DATA(nid)
      
      will never return garbage.  This has proven to be problem in several
      places when an offline numa node is used for an allocation just to realize
      that node_data and therefore allocation fallback zonelists are not
      initialized and such an allocation request blows up.
      
      There were attempts to address that by checking node_online in several
      places including the page allocator.  This patchset approaches the problem
      from a different perspective and instead of special casing, which just
      adds a runtime overhead, it allocates pglist_data for each possible node.
      This can add some memory overhead for platforms with high number of
      possible nodes if they do not contain any memory.  This should be a rather
      rare configuration though.
      
      How to test this? David has provided and excellent howto:
      http://lkml.kernel.org/r/6e5ebc19-890c-b6dd-1924-9f25c441010d@redhat.com
      
      Patches 1 and 3-6 are mostly cleanups.  The patchset has been reviewed by
      Rafael (thanks!) and the core fix tested by Rafael and Alexey (thanks to
      both).  David has tested as per instructions above and hasn't found any
      fallouts in the memory hotplug scenarios.
      
      This patch (of 6):
      
      This is a preparatory patch and it doesn't introduce any functional
      change.  It merely pulls out arch_alloc_nodedata (and co) outside of
      CONFIG_MEMORY_HOTPLUG because the following patch will need to call this
      from the generic MM code.
      
      Link: https://lkml.kernel.org/r/20220127085305.20890-1-mhocko@kernel.org
      Link: https://lkml.kernel.org/r/20220127085305.20890-2-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRafael Aquini <raquini@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Cc: Alexey Makhalov <amakhalov@vmware.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Nico Pache <npache@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e930d999
    • Yang Yang's avatar
      mm/vmstat: add event for ksm swapping in copy · 4d45c3af
      Yang Yang authored
      When faults in from swap what used to be a KSM page and that page had been
      swapped in before, system has to make a copy, and leaves remerging the
      pages to a later pass of ksmd.
      
      That is not good for performace, we'd better to reduce this kind of copy.
      There are some ways to reduce it, for example lessen swappiness or
      madvise(, , MADV_MERGEABLE) range.  So add this event to support doing
      this tuning.  Just like this patch: "mm, THP, swap: add THP swapping out
      fallback counting".
      
      Link: https://lkml.kernel.org/r/20220113023839.758845-1-yang.yang29@zte.com.cn
      
      
      Signed-off-by: default avatarYang Yang <yang.yang29@zte.com.cn>
      Reviewed-by: default avatarRan Xiaokai <ran.xiaokai@zte.com.cn>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Saravanan D <saravanand@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4d45c3af
    • Huang Ying's avatar
      NUMA balancing: optimize page placement for memory tiering system · c574bbe9
      Huang Ying authored
      With the advent of various new memory types, some machines will have
      multiple types of memory, e.g.  DRAM and PMEM (persistent memory).  The
      memory subsystem of these machines can be called memory tiering system,
      because the performance of the different types of memory are usually
      different.
      
      In such system, because of the memory accessing pattern changing etc,
      some pages in the slow memory may become hot globally.  So in this
      patch, the NUMA balancing mechanism is enhanced to optimize the page
      placement among the different memory types according to hot/cold
      dynamically.
      
      In a typical memory tiering system, there are CPUs, fast memory and slow
      memory in each physical NUMA node.  The CPUs and the fast memory will be
      put in one logical node (called fast memory node), while the slow memory
      will be put in another (faked) logical node (called slow memory node).
      That is, the fast memory is regarded as local while the slow memory is
      regarded as remote.  So it's possible for the recently accessed pages in
      the slow memory node to be promoted to the fast memory node via the
      existing NUMA balancing mechanism.
      
      The original NUMA balancing mechanism will stop to migrate pages if the
      free memory of the target node becomes below the high watermark.  This
      is a reasonable policy if there's only one memory type.  But this makes
      the original NUMA balancing mechanism almost do not work to optimize
      page placement among different memory types.  Details are as follows.
      
      It's the common cases that the working-set size of the workload is
      larger than the size of the fast memory nodes.  Otherwise, it's
      unnecessary to use the slow memory at all.  So, there are almost always
      no enough free pages in the fast memory nodes, so that the globally hot
      pages in the slow memory node cannot be promoted to the fast memory
      node.  To solve the issue, we have 2 choices as follows,
      
      a. Ignore the free pages watermark checking when promoting hot pages
         from the slow memory node to the fast memory node.  This will
         create some memory pressure in the fast memory node, thus trigger
         the memory reclaiming.  So that, the cold pages in the fast memory
         node will be demoted to the slow memory node.
      
      b. Define a new watermark called wmark_promo which is higher than
         wmark_high, and have kswapd reclaiming pages until free pages reach
         such watermark.  The scenario is as follows: when we want to promote
         hot-pages from a slow memory to a fast memory, but fast memory's free
         pages would go lower than high watermark with such promotion, we wake
         up kswapd with wmark_promo watermark in order to demote cold pages and
         free us up some space.  So, next time we want to promote hot-pages we
         might have a chance of doing so.
      
      The choice "a" may create high memory pressure in the fast memory node.
      If the memory pressure of the workload is high, the memory pressure
      may become so high that the memory allocation latency of the workload
      is influenced, e.g.  the direct reclaiming may be triggered.
      
      The choice "b" works much better at this aspect.  If the memory
      pressure of the workload is high, the hot pages promotion will stop
      earlier because its allocation watermark is higher than that of the
      normal memory allocation.  So in this patch, choice "b" is implemented.
      A new zone watermark (WMARK_PROMO) is added.  Which is larger than the
      high watermark and can be controlled via watermark_scale_factor.
      
      In addition to the original page placement optimization among sockets,
      the NUMA balancing mechanism is extended to be used to optimize page
      placement according to hot/cold among different memory types.  So the
      sysctl user space interface (numa_balancing) is extended in a backward
      compatible way as follow, so that the users can enable/disable these
      functionality individually.
      
      The sysctl is converted from a Boolean value to a bits field.  The
      definition of the flags is,
      
      - 0: NUMA_BALANCING_DISABLED
      - 1: NUMA_BALANCING_NORMAL
      - 2: NUMA_BALANCING_MEMORY_TIERING
      
      We have tested the patch with the pmbench memory accessing benchmark
      with the 80:20 read/write ratio and the Gauss access address
      distribution on a 2 socket Intel server with Optane DC Persistent
      Memory Model.  The test results shows that the pmbench score can
      improve up to 95.9%.
      
      Thanks Andrew Morton to help fix the document format error.
      
      Link: https://lkml.kernel.org/r/20220221084529.1052339-3-ying.huang@intel.com
      
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Tested-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Feng Tang <feng.tang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c574bbe9
    • Huang Ying's avatar
      NUMA Balancing: add page promotion counter · e39bb6be
      Huang Ying authored
      Patch series "NUMA balancing: optimize memory placement for memory tiering system", v13
      
      With the advent of various new memory types, some machines will have
      multiple types of memory, e.g.  DRAM and PMEM (persistent memory).  The
      memory subsystem of these machines can be called memory tiering system,
      because the performance of the different types of memory are different.
      
      After commit c221c0b0 ("device-dax: "Hotplug" persistent memory for
      use like normal RAM"), the PMEM could be used as the cost-effective
      volatile memory in separate NUMA nodes.  In a typical memory tiering
      system, there are CPUs, DRAM and PMEM in each physical NUMA node.  The
      CPUs and the DRAM will be put in one logical node, while the PMEM will
      be put in another (faked) logical node.
      
      To optimize the system overall performance, the hot pages should be
      placed in DRAM node.  To do that, we need to identify the hot pages in
      the PMEM node and migrate them to DRAM node via NUMA migration.
      
      In the original NUMA balancing, there are already a set of existing
      mechanisms to identify the pages recently accessed by the CPUs in a node
      and migrate the pages to the node.  So we can reuse these mechanisms to
      build the mechanisms to optimize the page placement in the memory
      tiering system.  This is implemented in this patchset.
      
      At the other hand, the cold pages should be placed in PMEM node.  So, we
      also need to identify the cold pages in the DRAM node and migrate them
      to PMEM node.
      
      In commit 26aa2d19 ("mm/migrate: demote pages during reclaim"), a
      mechanism to demote the cold DRAM pages to PMEM node under memory
      pressure is implemented.  Based on that, the cold DRAM pages can be
      demoted to PMEM node proactively to free some memory space on DRAM node
      to accommodate the promoted hot PMEM pages.  This is implemented in this
      patchset too.
      
      We have tested the solution with the pmbench memory accessing benchmark
      with the 80:20 read/write ratio and the Gauss access address
      distribution on a 2 socket Intel server with Optane DC Persistent Memory
      Model.  The test results shows that the pmbench score can improve up to
      95.9%.
      
      This patch (of 3):
      
      In a system with multiple memory types, e.g.  DRAM and PMEM, the CPU
      and DRAM in one socket will be put in one NUMA node as before, while
      the PMEM will be put in another NUMA node as described in the
      description of the commit c221c0b0 ("device-dax: "Hotplug"
      persistent memory for use like normal RAM").  So, the NUMA balancing
      mechanism will identify all PMEM accesses as remote access and try to
      promote the PMEM pages to DRAM.
      
      To distinguish the number of the inter-type promoted pages from that of
      the inter-socket migrated pages.  A new vmstat count is added.  The
      counter is per-node (count in the target node).  So this can be used to
      identify promotion imbalance among the NUMA nodes.
      
      Link: https://lkml.kernel.org/r/20220301085329.3210428-1-ying.huang@intel.com
      Link: https://lkml.kernel.org/r/20220221084529.1052339-1-ying.huang@intel.com
      Link: https://lkml.kernel.org/r/20220221084529.1052339-2-ying.huang@intel.com
      
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Tested-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e39bb6be
    • Hari Bathini's avatar
      mm/cma: provide option to opt out from exposing pages on activation failure · 27d121d0
      Hari Bathini authored
      Patch series "powerpc/fadump: handle CMA activation failure appropriately", v3.
      
      Commit 072355c1 ("mm/cma: expose all pages to the buddy if
      activation of an area fails") started exposing all pages to buddy
      allocator on CMA activation failure.  But there can be CMA users that
      want to handle the reserved memory differently on CMA allocation
      failure.
      
      Provide an option to opt out from exposing pages to buddy for such
      cases.
      
      Link: https://lkml.kernel.org/r/20220117075246.36072-1-hbathini@linux.ibm.com
      Link: https://lkml.kernel.org/r/20220117075246.36072-2-hbathini@linux.ibm.com
      
      
      Signed-off-by: default avatarHari Bathini <hbathini@linux.ibm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com>
      Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      27d121d0
    • andrew.yang's avatar
      mm/migrate: fix race between lock page and clear PG_Isolated · 356ea386
      andrew.yang authored
      When memory is tight, system may start to compact memory for large
      continuous memory demands.  If one process tries to lock a memory page
      that is being locked and isolated for compaction, it may wait a long time
      or even forever.  This is because compaction will perform non-atomic
      PG_Isolated clear while holding page lock, this may overwrite PG_waiters
      set by the process that can't obtain the page lock and add itself to the
      waiting queue to wait for the lock to be unlocked.
      
        CPU1                            CPU2
        lock_page(page); (successful)
                                        lock_page(); (failed)
        __ClearPageIsolated(page);      SetPageWaiters(page) (may be overwritten)
        unlock_page(page);
      
      The solution is to not perform non-atomic operation on page flags while
      holding page lock.
      
      Link: https://lkml.kernel.org/r/20220315030515.20263-1-andrew.yang@mediatek.com
      
      
      Signed-off-by: default avatarandrew.yang <andrew.yang@mediatek.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Vlastimil Babka" <vbabka@suse.cz>
      Cc: David Howells <dhowells@redhat.com>
      Cc: "William Kucharski" <william.kucharski@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Nicholas Tang <nicholas.tang@mediatek.com>
      Cc: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      356ea386
    • Baolin Wang's avatar
      mm: compaction: cleanup the compaction trace events · abd4349f
      Baolin Wang authored
      As Steven suggested [1], we should access the pointers from the trace
      event to avoid dereferencing them to the tracepoint function when the
      tracepoint is disabled.
      
      [1] https://lkml.org/lkml/2021/11/3/409
      
      Link: https://lkml.kernel.org/r/4cd393b4d57f8f01ed72c001509b28e3a3b1a8c1.1646985115.git.baolin.wang@linux.alibaba.com
      
      
      Signed-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      abd4349f
    • Hugh Dickins's avatar
      mm: __isolate_lru_page_prepare() in isolate_migratepages_block() · 89f6c88a
      Hugh Dickins authored
      __isolate_lru_page_prepare() conflates two unrelated functions, with the
      flags to one disjoint from the flags to the other; and hides some of the
      important checks outside of isolate_migratepages_block(), where the
      sequence is better to be visible.  It comes from the days of lumpy
      reclaim, before compaction, when the combination made more sense.
      
      Move what's needed by mm/compaction.c isolate_migratepages_block() inline
      there, and what's needed by mm/vmscan.c isolate_lru_pages() inline there.
      
      Shorten "isolate_mode" to "mode", so the sequence of conditions is easier
      to read.  Declare a "mapping" variable, to save one call to page_mapping()
      (but not another: calling again after page is locked is necessary).
      Simplify isolate_lru_pages() with a "move_to" list pointer.
      
      Link: https://lkml.kernel.org/r/879d62a8-91cc-d3c6-fb3b-69768236df68@google.com
      
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarAlex Shi <alexs@kernel.org>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      89f6c88a
    • Hugh Dickins's avatar
      mm/fs: delete PF_SWAPWRITE · b698f0a1
      Hugh Dickins authored
      PF_SWAPWRITE has been redundant since v3.2 commit ee72886d ("mm:
      vmscan: do not writeback filesystem pages in direct reclaim").
      
      Coincidentally, NeilBrown's current patch "remove inode_congested()"
      deletes may_write_to_inode(), which appeared to be the one function which
      took notice of PF_SWAPWRITE.  But if you study the old logic, and the
      conditions under which may_write_to_inode() was called, you discover that
      flag and function have been pointless for a decade.
      
      Link: https://lkml.kernel.org/r/75e80e7-742d-e3bd-531-614db8961e4@google.com
      
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Jan Kara <jack@suse.de>
      Cc: "Darrick J. Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b698f0a1
    • Nadav Amit's avatar
      userfaultfd: provide unmasked address on page-fault · 824ddc60
      Nadav Amit authored
      Userfaultfd is supposed to provide the full address (i.e., unmasked) of
      the faulting access back to userspace.  However, that is not the case for
      quite some time.
      
      Even running "userfaultfd_demo" from the userfaultfd man page provides the
      wrong output (and contradicts the man page).  Notice that
      "UFFD_EVENT_PAGEFAULT event" shows the masked address (7fc5e30b3000) and
      not the first read address (0x7fc5e30b300f).
      
      	Address returned by mmap() = 0x7fc5e30b3000
      
      	fault_handler_thread():
      	    poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
      	    UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fc5e30b3000
      		(uffdio_copy.copy returned 4096)
      	Read address 0x7fc5e30b300f in main(): A
      	Read address 0x7fc5e30b340f in main(): A
      	Read address 0x7fc5e30b380f in main(): A
      	Read address 0x7fc5e30b3c0f in main(): A
      
      The exact address is useful for various reasons and specifically for
      prefetching decisions.  If it is known that the memory is populated by
      certain objects whose size is not page-aligned, then based on the faulting
      address, the uffd-monitor can decide whether to prefetch and prefault the
      adjacent page.
      
      This bug has been for quite some time in the kernel: since commit
      1a29d85e ("mm: use vmf->address instead of of vmf->virtual_address")
      vmf->virtual_address"), which dates back to 2016.  A concern has been
      raised that existing userspace application might rely on the old/wrong
      behavior in which the address is masked.  Therefore, it was suggested to
      provide the masked address unless the user explicitly asks for the exact
      address.
      
      Add a new userfaultfd feature UFFD_FEATURE_EXACT_ADDRESS to direct
      userfaultfd to provide the exact address.  Add a new "real_address" field
      to vmf to hold the unmasked address.  Provide the address to userspace
      accordingly.
      
      Initialize real_address in various code-paths to be consistent with
      address, even when it is not used, to be on the safe side.
      
      [namit@vmware.com: initialize real_address on all code paths, per Jan]
        Link: https://lkml.kernel.org/r/20220226022655.350562-1-namit@vmware.com
      [akpm@linux-foundation.org: fix typo in comment, per Jan]
      
      Link: https://lkml.kernel.org/r/20220218041003.3508-1-namit@vmware.com
      
      
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      824ddc60
    • Muchun Song's avatar
      mm: sparsemem: move vmemmap related to HugeTLB to CONFIG_HUGETLB_PAGE_FREE_VMEMMAP · e5408417
      Muchun Song authored
      The vmemmap_remap_free/alloc are relevant to HugeTLB, so move those
      functiongs to the scope of CONFIG_HUGETLB_PAGE_FREE_VMEMMAP.
      
      Link: https://lkml.kernel.org/r/20211101031651.75851-6-songmuchun@bytedance.com
      
      
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarBarry Song <song.bao.hua@hisilicon.com>
      Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e5408417
    • Muchun Song's avatar
      mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key · a6b40850
      Muchun Song authored
      The page_fixed_fake_head() is used throughout memory management and the
      conditional check requires checking a global variable, although the
      overhead of this check may be small, it increases when the memory cache
      comes under pressure.  Also, the global variable will not be modified
      after system boot, so it is very appropriate to use static key machanism.
      
      Link: https://lkml.kernel.org/r/20211101031651.75851-3-songmuchun@bytedance.com
      
      
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarBarry Song <song.bao.hua@hisilicon.com>
      Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a6b40850
    • Muchun Song's avatar
      mm: hugetlb: free the 2nd vmemmap page associated with each HugeTLB page · e7d32485
      Muchun Song authored
      Patch series "Free the 2nd vmemmap page associated with each HugeTLB
      page", v7.
      
      This series can minimize the overhead of struct page for 2MB HugeTLB
      pages significantly.  It further reduces the overhead of struct page by
      12.5% for a 2MB HugeTLB compared to the previous approach, which means
      2GB per 1TB HugeTLB.  It is a nice gain.  Comments and reviews are
      welcome.  Thanks.
      
      The main implementation and details can refer to the commit log of patch
      1.  In this series, I have changed the following four helpers, the
      following table shows the impact of the overhead of those helpers.
      
      	+------------------+-----------------------+
      	|       APIs       | head page | tail page |
      	+------------------+-----------+-----------+
      	|    PageHead()    |     Y     |     N     |
      	+------------------+-----------+-----------+
      	|    PageTail()    |     Y     |     N     |
      	+------------------+-----------+-----------+
      	|  PageCompound()  |     N     |     N     |
      	+------------------+-----------+-----------+
      	|  compound_head() |     Y     |     N     |
      	+------------------+-----------+-----------+
      
      	Y: Overhead is increased.
      	N: Overhead is _NOT_ increased.
      
      It shows that the overhead of those helpers on a tail page don't change
      between "hugetlb_free_vmemmap=on" and "hugetlb_free_vmemmap=off".  But the
      overhead on a head page will be increased when "hugetlb_free_vmemmap=on"
      (except PageCompound()).  So I believe that Matthew Wilcox's folio series
      will help with this.
      
      The users of PageHead() and PageTail() are much less than compound_head()
      and most users of PageTail() are VM_BUG_ON(), so I have done some tests
      about the overhead of compound_head() on head pages.
      
      I have tested the overhead of calling compound_head() on a head page,
      which is 2.11ns (Measure the call time of 10 million times
      compound_head(), and then average).
      
      For a head page whose address is not aligned with PAGE_SIZE or a
      non-compound page, the overhead of compound_head() is 2.54ns which is
      increased by 20%.  For a head page whose address is aligned with
      PAGE_SIZE, the overhead of compound_head() is 2.97ns which is increased by
      40%.  Most pages are the former.  I do not think the overhead is
      significant since the overhead of compound_head() itself is low.
      
      This patch (of 5):
      
      This patch minimizes the overhead of struct page for 2MB HugeTLB pages
      significantly.  It further reduces the overhead of struct page by 12.5%
      for a 2MB HugeTLB compared to the previous approach, which means 2GB per
      1TB HugeTLB (2MB type).
      
      After the feature of "Free sonme vmemmap pages of HugeTLB page" is
      enabled, the mapping of the vmemmap addresses associated with a 2MB
      HugeTLB page becomes the figure below.
      
           HugeTLB                    struct pages(8 pages)         page frame(8 pages)
       +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+---> PG_head
       |           |                     |     0     | -------------> |     0     |
       |           |                     +-----------+                +-----------+
       |           |                     |     1     | -------------> |     1     |
       |           |                     +-----------+                +-----------+
       |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
       |           |                     +-----------+                   | | | | |
       |           |                     |     3     | ------------------+ | | | |
       |           |                     +-----------+                     | | | |
       |           |                     |     4     | --------------------+ | | |
       |    2MB    |                     +-----------+                       | | |
       |           |                     |     5     | ----------------------+ | |
       |           |                     +-----------+                         | |
       |           |                     |     6     | ------------------------+ |
       |           |                     +-----------+                           |
       |           |                     |     7     | --------------------------+
       |           |                     +-----------+
       |           |
       |           |
       |           |
       +-----------+
      
      As we can see, the 2nd vmemmap page frame (indexed by 1) is reused and
      remaped. However, the 2nd vmemmap page frame is also can be freed to
      the buddy allocator, then we can change the mapping from the figure
      above to the figure below.
      
          HugeTLB                    struct pages(8 pages)         page frame(8 pages)
       +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+---> PG_head
       |           |                     |     0     | -------------> |     0     |
       |           |                     +-----------+                +-----------+
       |           |                     |     1     | ---------------^ ^ ^ ^ ^ ^ ^
       |           |                     +-----------+                  | | | | | |
       |           |                     |     2     | -----------------+ | | | | |
       |           |                     +-----------+                    | | | | |
       |           |                     |     3     | -------------------+ | | | |
       |           |                     +-----------+                      | | | |
       |           |                     |     4     | ---------------------+ | | |
       |    2MB    |                     +-----------+                        | | |
       |           |                     |     5     | -----------------------+ | |
       |           |                     +-----------+                          | |
       |           |                     |     6     | -------------------------+ |
       |           |                     +-----------+                            |
       |           |                     |     7     | ---------------------------+
       |           |                     +-----------+
       |           |
       |           |
       |           |
       +-----------+
      
      After we do this, all tail vmemmap pages (1-7) are mapped to the head
      vmemmap page frame (0).  In other words, there are more than one page
      struct with PG_head associated with each HugeTLB page.  We __know__ that
      there is only one head page struct, the tail page structs with PG_head are
      fake head page structs.  We need an approach to distinguish between those
      two different types of page structs so that compound_head(), PageHead()
      and PageTail() can work properly if the parameter is the tail page struct
      but with PG_head.
      
      The following code snippet describes how to distinguish between real and
      fake head page struct.
      
      	if (test_bit(PG_head, &page->flags)) {
      		unsigned long head = READ_ONCE(page[1].compound_head);
      
      		if (head & 1) {
      			if (head == (unsigned long)page + 1)
      				==> head page struct
      			else
      				==> tail page struct
      		} else
      			==> head page struct
      	}
      
      We can safely access the field of the @page[1] with PG_head because the
      @page is a compound page composed with at least two contiguous pages.
      
      [songmuchun@bytedance.com: restore lost comment changes]
      
      Link: https://lkml.kernel.org/r/20211101031651.75851-1-songmuchun@bytedance.com
      Link: https://lkml.kernel.org/r/20211101031651.75851-2-songmuchun@bytedance.com
      
      
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarBarry Song <song.bao.hua@hisilicon.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e7d32485
    • Vlastimil Babka's avatar
      mm, fault-injection: declare should_fail_alloc_page() · 1e7a8181
      Vlastimil Babka authored
      The mm/ directory can almost fully be built with W=1, which would help
      in local development.  One remaining issue is missing prototype for
      should_fail_alloc_page().  Thus add it next to the should_failslab()
      prototype.
      
      Note the previous attempt by commit f7173090 ("mm/page_alloc: make
      should_fail_alloc_page() static") had to be reverted by commit
      54aa3866 as it caused an unresolved symbol error with
      CONFIG_DEBUG_INFO_BTF=y
      
      Link: https://lkml.kernel.org/r/20220314165724.16071-1-vbabka@suse.cz
      
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1e7a8181
    • Miaohe Lin's avatar
      mm/memory-failure.c: fix race with changing page compound again · 888af270
      Miaohe Lin authored
      Patch series "A few fixup patches for memory failure", v2.
      
      This series contains a few patches to fix the race with changing page
      compound page, make non-LRU movable pages unhandlable and so on.  More
      details can be found in the respective changelogs.
      
      There is a race window where we got the compound_head, the hugetlb page
      could be freed to buddy, or even changed to another compound page just
      before we try to get hwpoison page.  Think about the below race window:
      
        CPU 1					  CPU 2
        memory_failure_hugetlb
        struct page *head = compound_head(p);
      					  hugetlb page might be freed to
      					  buddy, or even changed to another
      					  compound page.
      
        get_hwpoison_page -- page is not what we want now...
      
      If this race happens, just bail out.  Also MF_MSG_DIFFERENT_PAGE_SIZE is
      introduced to record this event.
      
      [akpm@linux-foundation.org: s@/**@/*@, per Naoya Horiguchi]
      
      Link: https://lkml.kernel.org/r/20220312074613.4798-1-linmiaohe@huawei.com
      L...
      888af270
    • Oscar Salvador's avatar
      arch/x86/mm/numa: Do not initialize nodes twice · 1ca75fa7
      Oscar Salvador authored
      On x86, prior to ("mm: handle uninitialized numa nodes gracecully"), NUMA
      nodes could be allocated at three different places.
      
       - numa_register_memblks
       - init_cpu_to_node
       - init_gi_nodes
      
      All these calls happen at setup_arch, and have the following order:
      
      setup_arch
        ...
        x86_numa_init
         numa_init
          numa_register_memblks
        ...
        init_cpu_to_node
         init_memory_less_node
          alloc_node_data
          free_area_init_memoryless_node
        init_gi_nodes
         init_memory_less_node
          alloc_node_data
          free_area_init_memoryless_node
      
      numa_register_memblks() is only interested in those nodes which have
      memory, so it skips over any memoryless node it founds.  Later on, when
      we have read ACPI's SRAT table, we call init_cpu_to_node() and
      init_gi_nodes(), which initialize any memoryless node we might have that
      have either CPU or Initiator affinity, meaning we allocate pg_data_t
      struct for them and we mark them as ONLINE.
      
      So far so good, but the thing is that after ("mm: handle uninitialized
      numa nodes gracefully"), we allocate all possible NUMA nodes in
      free_area_init(), meaning we have a picture like the following:
      
      setup_arch
        x86_numa_init
         numa_init
          numa_register_memblks  <-- allocate non-memoryless node
        x86_init.paging.pagetable_init
         ...
          free_area_init
           free_area_init_memoryless <-- allocate memoryless node
        init_cpu_to_node
         alloc_node_data             <-- allocate memoryless node with CPU
         free_area_init_memoryless_node
        init_gi_nodes
         alloc_node_data             <-- allocate memoryless node with Initiator
         free_area_init_memoryless_node
      
      free_area_init() already allocates all possible NUMA nodes, but
      init_cpu_to_node() and init_gi_nodes() are clueless about that, so they
      go ahead and allocate a new pg_data_t struct without checking anything,
      meaning we end up allocating twice.
      
      It should be mad clear that this only happens in the case where
      memoryless NUMA node happens to have a CPU/Initiator affinity.
      
      So get rid of init_memory_less_node() and just set the node online.
      
      Note that setting the node online is needed, otherwise we choke down the
      chain when bringup_nonboot_cpus() ends up calling
      __try_online_node()->register_one_node()->...  and we blow up in
      bus_add_device().  As can be seen here:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000060
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 0 P4D 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.0-rc4-1-default+ #45
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/4
        RIP: 0010:bus_add_device+0x5a/0x140
        Code: 8b 74 24 20 48 89 df e8 84 96 ff ff 85 c0 89 c5 75 38 48 8b 53 50 48 85 d2 0f 84 bb 00 004
        RSP: 0000:ffffc9000022bd10 EFLAGS: 00010246
        RAX: 0000000000000000 RBX: ffff888100987400 RCX: ffff8881003e4e19
        RDX: ffff8881009a5e00 RSI: ffff888100987400 RDI: ffff888100987400
        RBP: 0000000000000000 R08: ffff8881003e4e18 R09: ffff8881003e4c98
        R10: 0000000000000000 R11: ffff888100402bc0 R12: ffffffff822ceba0
        R13: 0000000000000000 R14: ffff888100987400 R15: 0000000000000000
        FS:  0000000000000000(0000) GS:ffff88853fc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000060 CR3: 000000000200a001 CR4: 00000000001706b0
        Call Trace:
         device_add+0x4c0/0x910
         __register_one_node+0x97/0x2d0
         __try_online_node+0x85/0xc0
         try_online_node+0x25/0x40
         cpu_up+0x4f/0x100
         bringup_nonboot_cpus+0x4f/0x60
         smp_init+0x26/0x79
         kernel_init_freeable+0x130/0x2f1
         kernel_init+0x17/0x150
         ret_from_fork+0x22/0x30
      
      The reason is simple, by the time bringup_nonboot_cpus() gets called, we
      did not register the node_subsys bus yet, so we crash when
      bus_add_device() tries to dereference bus()->p.
      
      The following shows the order of the calls:
      
      kernel_init_freeable
       smp_init
        bringup_nonboot_cpus
         ...
           bus_add_device()      <- we did not register node_subsys yet
       do_basic_setup
        do_initcalls
         postcore_initcall(register_node_type);
          register_node_type
           subsys_system_register
            subsys_register
             bus_register         <- register node_subsys bus
      
      Why setting the node online saves us then? Well, simply because
      __try_online_node() backs off when the node is online, meaning we do not
      end up calling register_one_node() in the first place.
      
      This is subtle, broken and deserves a deep analysis and thought about
      how to put this into shape, but for now let us have this easy fix for
      the leaking memory issue.
      
      [osalvador@suse.de: add comments]
        Link: https://lkml.kernel.org/r/20220221142649.3457-1-osalvador@suse.de
      
      Link: https://lkml.kernel.org/r/20220218224302.5282-2-osalvador@suse.de
      
      
      Fixes: da4490c958ad ("mm: handle uninitialized numa nodes gracefully")
      Signed-off-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Rafael Aquini <raquini@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Alexey Makhalov <amakhalov@vmware.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1ca75fa7
    • David Hildenbrand's avatar
      mm: enforce pageblock_order < MAX_ORDER · b3d40a2b
      David Hildenbrand authored
      Some places in the kernel don't really expect pageblock_order >=
      MAX_ORDER, and it looks like this is only possible in corner cases:
      
      1) CONFIG_DEFERRED_STRUCT_PAGE_INIT we'll end up freeing pageblock_order
         pages via __free_pages_core(), which cannot possibly work.
      
      2) find_zone_movable_pfns_for_nodes() will roundup the ZONE_MOVABLE
         start PFN to MAX_ORDER_NR_PAGES. Consequently with a bigger
         pageblock_order, we could have a single pageblock partially managed by
         two zones.
      
      3) compaction code runs into __fragmentation_index() with order
         >= MAX_ORDER, when checking WARN_ON_ONCE(order >= MAX_ORDER). [1]
      
      4) mm/page_reporting.c won't be reporting any pages with default
         page_reporting_order == pageblock_order, as we'll be skipping the
         reporting loop inside page_reporting_process_zone().
      
      5) __rmqueue_fallback() will never be able to steal with
         ALLOC_NOFRAGMENT.
      
      pageblock_order >= MAX_ORDER is weird either way: it's a pure
      optimization for making alloc_contig_range(), as used for allcoation of
      gigantic pages, a little more reliable to succeed.  However, if there is
      demand for somewhat reliable allocation of gigantic pages, affected
      setups should be using CMA or boottime allocations instead.
      
      So let's make sure that pageblock_order < MAX_ORDER and simplify.
      
      [1] https://lkml.kernel.org/r/87r189a2ks.fsf@linux.ibm.com
      
      Link: https://lkml.kernel.org/r/20220214174132.219303-3-david@redhat.com
      
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Frank Rowand <frowand.list@gmail.com>
      Cc: John Garry via iommu <iommu@lists.linux-foundation.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3d40a2b
    • David Hildenbrand's avatar
      cma: factor out minimum alignment requirement · e16faf26
      David Hildenbrand authored
      Patch series "mm: enforce pageblock_order < MAX_ORDER".
      
      Having pageblock_order >= MAX_ORDER seems to be able to happen in corner
      cases and some parts of the kernel are not prepared for it.
      
      For example, Aneesh has shown [1] that such kernels can be compiled on
      ppc64 with 64k base pages by setting FORCE_MAX_ZONEORDER=8, which will
      run into a WARN_ON_ONCE(order >= MAX_ORDER) in comapction code right
      during boot.
      
      We can get pageblock_order >= MAX_ORDER when the default hugetlb size is
      bigger than the maximum allocation granularity of the buddy, in which
      case we are no longer talking about huge pages but instead gigantic
      pages.
      
      Having pageblock_order >= MAX_ORDER can only make alloc_contig_range()
      of such gigantic pages more likely to succeed.
      
      Reliable use of gigantic pages either requires boot time allcoation or
      CMA, no need to overcomplicate some places in the kernel to optimize for
      corner cases that are broken in other areas of the kernel.
      
      This patch (of 2...
      e16faf26
    • Miaohe Lin's avatar
      mm/mmzone.h: remove unused macros · 7f37e49c
      Miaohe Lin authored
      Remove pgdat_page_nr, nid_page_nr and NODE_MEM_MAP.  They are unused
      now.
      
      Link: https://lkml.kernel.org/r/20220127093210.62293-1-linmiaohe@huawei.com
      
      
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f37e49c
    • Zi Yan's avatar
      mm: page_alloc: avoid merging non-fallbackable pageblocks with others · 1dd214b8
      Zi Yan authored
      This is done in addition to MIGRATE_ISOLATE pageblock merge avoidance.
      It prepares for the upcoming removal of the MAX_ORDER-1 alignment
      requirement for CMA and alloc_contig_range().
      
      MIGRATE_HIGHATOMIC should not merge with other migratetypes like
      MIGRATE_ISOLATE and MIGRARTE_CMA[1], so this commit prevents that too.
      
      Remove MIGRATE_CMA and MIGRATE_ISOLATE from fallbacks list, since they
      are never used.
      
      [1] https://lore.kernel.org/linux-mm/20211130100853.GP3366@techsingularity.net/
      
      Link: https://lkml.kernel.org/r/20220124175957.1261961-1-zi.yan@sent.com
      
      
      Signed-off-by: default avatarZi Yan <ziy@nvidia.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Mike Rapoport <rppt@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1dd214b8
    • Bang Li's avatar
      mm/vmalloc: fix comments about vmap_area struct · ff11a7ce
      Bang Li authored
      The vmap_area_root should be in the "busy" tree and the
      free_vmap_area_root should be in the "free" tree.
      
      Link: https://lkml.kernel.org/r/20220305011510.33596-1-libang.linuxer@gmail.com
      Fixes: 688fcbfc
      
       ("mm/vmalloc: modify struct vmap_area to reduce its size")
      Signed-off-by: default avatarBang Li <libang.linuxer@gmail.com>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Pengfei Li <lpf.vector@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ff11a7ce