Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Jun 24, 2016
    • Rasmus Villemoes's avatar
      init/main.c: fix initcall_blacklisted on ia64, ppc64 and parisc64 · 0fd5ed8d
      Rasmus Villemoes authored
      When I replaced kasprintf("%pf") with a direct call to
      sprint_symbol_no_offset I must have broken the initcall blacklisting
      feature on the arches where dereference_function_descriptor() is
      non-trivial.
      
      Fixes: c8cdd2be (init/main.c: simplify initcall_blacklisted())
      Link: http://lkml.kernel.org/r/1466027283-4065-1-git-send-email-linux@rasmusvillemoes.dk
      
      
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0fd5ed8d
    • Linus Torvalds's avatar
      Clarify naming of thread info/stack allocators · b235beea
      Linus Torvalds authored
      
      We've had the thread info allocated together with the thread stack for
      most architectures for a long time (since the thread_info was split off
      from the task struct), but that is about to change.
      
      But the patches that move the thread info to be off-stack (and a part of
      the task struct instead) made it clear how confused the allocator and
      freeing functions are.
      
      Because the common case was that we share an allocation with the thread
      stack and the thread_info, the two pointers were identical.  That
      identity then meant that we would have things like
      
      	ti = alloc_thread_info_node(tsk, node);
      	...
      	tsk->stack = ti;
      
      which certainly _worked_ (since stack and thread_info have the same
      value), but is rather confusing: why are we assigning a thread_info to
      the stack? And if we move the thread_info away, the "confusing" code
      just gets to be entirely bogus.
      
      So remove all this confusion, and make it clear that we are doing the
      stack allocation by renaming and clarifying the function names to be
      about the stack.  The fact that the thread_info then shares the
      allocation is an implementation detail, and not really about the
      allocation itself.
      
      This is a pure renaming and type fix: we pass in the same pointer, it's
      just that we clarify what the pointer means.
      
      The ia64 code that actually only has one single allocation (for all of
      task_struct, thread_info and kernel thread stack) now looks a bit odd,
      but since "tsk->stack" is actually not even used there, that oddity
      doesn't matter.  It would be a separate thing to clean that up, I
      intentionally left the ia64 changes as a pure brute-force renaming and
      type change.
      
      Acked-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b235beea
  2. May 27, 2016
  3. May 20, 2016
    • Rasmus Villemoes's avatar
      init/main.c: simplify initcall_blacklisted() · c8cdd2be
      Rasmus Villemoes authored
      
      Using kasprintf to get the function name makes us look up the name
      twice, along with all the vsnprintf overhead of parsing the format
      string etc.  It also means there is an allocation failure case to deal
      with.  Since symbol_string in vsprintf.c would anyway allocate an array
      of size KSYM_SYMBOL_LEN on the stack, that might as well be done up
      here.
      
      Moreover, since this is a debug feature and the blacklisted_initcalls
      list is usually empty, we might as well test that and thus avoid looking
      up the symbol name even once in the common case.
      
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Acked-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Acked-by: default avatarPrarit Bhargava <prarit@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c8cdd2be
    • Petr Mladek's avatar
      printk/nmi: increase the size of NMI buffer and make it configurable · 427934b8
      Petr Mladek authored
      Testing has shown that the backtrace sometimes does not fit into the 4kB
      temporary buffer that is used in NMI context.  The warnings are gone
      when I double the temporary buffer size.
      
      This patch doubles the buffer size and makes it configurable.
      
      Note that this problem existed even in the x86-specific implementation
      that was added by the commit a9edc880
      
       ("x86/nmi: Perform a safe NMI
      stack trace on all CPUs").  Nobody noticed it because it did not print
      any warnings.
      
      Signed-off-by: default avatarPetr Mladek <pmladek@suse.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Cc: Jiri Kosina <jkosina@suse.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      427934b8
    • Petr Mladek's avatar
      printk/nmi: generic solution for safe printk in NMI · 42a0bb3f
      Petr Mladek authored
      printk() takes some locks and could not be used a safe way in NMI
      context.
      
      The chance of a deadlock is real especially when printing stacks from
      all CPUs.  This particular problem has been addressed on x86 by the
      commit a9edc880 ("x86/nmi: Perform a safe NMI stack trace on all
      CPUs").
      
      The patchset brings two big advantages.  First, it makes the NMI
      backtraces safe on all architectures for free.  Second, it makes all NMI
      messages almost safe on all architectures (the temporary buffer is
      limited.  We still should keep the number of messages in NMI context at
      minimum).
      
      Note that there already are several messages printed in NMI context:
      WARN_ON(in_nmi()), BUG_ON(in_nmi()), anything being printed out from MCE
      handlers.  These are not easy to avoid.
      
      This patch reuses most of the code and makes it generic.  It is useful
      for all messages and architectures that support NMI.
      
      The alternative printk_func is set when entering and is reseted when
      leaving NMI...
      42a0bb3f
    • Yang Shi's avatar
      mm: call page_ext_init() after all struct pages are initialized · b8f1a75d
      Yang Shi authored
      When DEFERRED_STRUCT_PAGE_INIT is enabled, just a subset of memmap at
      boot are initialized, then the rest are initialized in parallel by
      starting one-off "pgdatinitX" kernel thread for each node X.
      
      If page_ext_init is called before it, some pages will not have valid
      extension, this may lead the below kernel oops when booting up kernel:
      
        BUG: unable to handle kernel NULL pointer dereference at           (null)
        IP: [<ffffffff8118d982>] free_pcppages_bulk+0x2d2/0x8d0
        PGD 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        Modules linked in:
        CPU: 11 PID: 106 Comm: pgdatinit1 Not tainted 4.6.0-rc5-next-20160427 #26
        Hardware name: Intel Corporation S5520HC/S5520HC, BIOS S5500.86B.01.10.0025.030220091519 03/02/2009
        task: ffff88017c080040 ti: ffff88017c084000 task.ti: ffff88017c084000
        RIP: 0010:[<ffffffff8118d982>]  [<ffffffff8118d982>] free_pcppages_bulk+0x2d2/0x8d0
        RSP: 0000:ffff88017c087c48  EFLAGS: 00010046
        RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
        RDX: 0000000000000980 RSI: 0000000000000080 RDI: 0000000000660401
        RBP: ffff88017c087cd0 R08: 0000000000000401 R09: 0000000000000009
        R10: ffff88017c080040 R11: 000000000000000a R12: 0000000000000400
        R13: ffffea0019810000 R14: ffffea0019810040 R15: ffff88066cfe6080
        FS:  0000000000000000(0000) GS:ffff88066cd40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 0000000002406000 CR4: 00000000000006e0
        Call Trace:
          free_hot_cold_page+0x192/0x1d0
          __free_pages+0x5c/0x90
          __free_pages_boot_core+0x11a/0x14e
          deferred_free_range+0x50/0x62
          deferred_init_memmap+0x220/0x3c3
          kthread+0xf8/0x110
          ret_from_fork+0x22/0x40
        Code: 49 89 d4 48 c1 e0 06 49 01 c5 e9 de fe ff ff 4c 89 f7 44 89 4d b8 4c 89 45 c0 44 89 5d c8 48 89 4d d0 e8 62 c7 07 00 48 8b 4d d0 <48> 8b 00 44 8b 5d c8 4c 8b 45 c0 44 8b 4d b8 a8 02 0f 84 05 ff
        RIP  [<ffffffff8118d982>] free_pcppages_bulk+0x2d2/0x8d0
         RSP <ffff88017c087c48>
        CR2: 0000000000000000
      
      Move page_ext_init() after page_alloc_init_late() to make sure page extension
      is setup for all pages.
      
      Link: http://lkml.kernel.org/r/1463696006-31360-1-git-send-email-yang.shi@linaro.org
      
      
      Signed-off-by: default avatarYang Shi <yang.shi@linaro.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b8f1a75d
  4. May 19, 2016
    • Thomas Garnier's avatar
      mm: SLAB freelist randomization · c7ce4f60
      Thomas Garnier authored
      Provides an optional config (CONFIG_SLAB_FREELIST_RANDOM) to randomize
      the SLAB freelist.  The list is randomized during initialization of a
      new set of pages.  The order on different freelist sizes is pre-computed
      at boot for performance.  Each kmem_cache has its own randomized
      freelist.  Before pre-computed lists are available freelists are
      generated dynamically.  This security feature reduces the predictability
      of the kernel SLAB allocator against heap overflows rendering attacks
      much less stable.
      
      For example this attack against SLUB (also applicable against SLAB)
      would be affected:
      
        https://jon.oberheide.org/blog/2010/09/10/linux-kernel-can-slub-overflow/
      
      
      
      Also, since v4.6 the freelist was moved at the end of the SLAB.  It
      means a controllable heap is opened to new attacks not yet publicly
      discussed.  A kernel heap overflow can be transformed to multiple
      use-after-free.  This feature makes this type of attack harder too.
      
      To generate entropy, we use get_random_bytes_arch because 0 bits of
      entropy is available in the boot stage.  In the worse case this function
      will fallback to the get_random_bytes sub API.  We also generate a shift
      random number to shift pre-computed freelist for each new set of pages.
      
      The config option name is not specific to the SLAB as this approach will
      be extended to other allocators like SLUB.
      
      Performance results highlighted no major changes:
      
      Hackbench (running 90 10 times):
      
        Before average: 0.0698
        After average: 0.0663 (-5.01%)
      
      slab_test 1 run on boot.  Difference only seen on the 2048 size test
      being the worse case scenario covered by freelist randomization.  New
      slab pages are constantly being created on the 10000 allocations.
      Variance should be mainly due to getting new pages every few
      allocations.
      
      Before:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        10000 times kmalloc(8) -> 99 cycles kfree -> 112 cycles
        10000 times kmalloc(16) -> 109 cycles kfree -> 140 cycles
        10000 times kmalloc(32) -> 129 cycles kfree -> 137 cycles
        10000 times kmalloc(64) -> 141 cycles kfree -> 141 cycles
        10000 times kmalloc(128) -> 152 cycles kfree -> 148 cycles
        10000 times kmalloc(256) -> 195 cycles kfree -> 167 cycles
        10000 times kmalloc(512) -> 257 cycles kfree -> 199 cycles
        10000 times kmalloc(1024) -> 393 cycles kfree -> 251 cycles
        10000 times kmalloc(2048) -> 649 cycles kfree -> 228 cycles
        10000 times kmalloc(4096) -> 806 cycles kfree -> 370 cycles
        10000 times kmalloc(8192) -> 814 cycles kfree -> 411 cycles
        10000 times kmalloc(16384) -> 892 cycles kfree -> 455 cycles
        2. Kmalloc: alloc/free test
        10000 times kmalloc(8)/kfree -> 121 cycles
        10000 times kmalloc(16)/kfree -> 121 cycles
        10000 times kmalloc(32)/kfree -> 121 cycles
        10000 times kmalloc(64)/kfree -> 121 cycles
        10000 times kmalloc(128)/kfree -> 121 cycles
        10000 times kmalloc(256)/kfree -> 119 cycles
        10000 times kmalloc(512)/kfree -> 119 cycles
        10000 times kmalloc(1024)/kfree -> 119 cycles
        10000 times kmalloc(2048)/kfree -> 119 cycles
        10000 times kmalloc(4096)/kfree -> 121 cycles
        10000 times kmalloc(8192)/kfree -> 119 cycles
        10000 times kmalloc(16384)/kfree -> 119 cycles
      
      After:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        10000 times kmalloc(8) -> 130 cycles kfree -> 86 cycles
        10000 times kmalloc(16) -> 118 cycles kfree -> 86 cycles
        10000 times kmalloc(32) -> 121 cycles kfree -> 85 cycles
        10000 times kmalloc(64) -> 176 cycles kfree -> 102 cycles
        10000 times kmalloc(128) -> 178 cycles kfree -> 100 cycles
        10000 times kmalloc(256) -> 205 cycles kfree -> 109 cycles
        10000 times kmalloc(512) -> 262 cycles kfree -> 136 cycles
        10000 times kmalloc(1024) -> 342 cycles kfree -> 157 cycles
        10000 times kmalloc(2048) -> 701 cycles kfree -> 238 cycles
        10000 times kmalloc(4096) -> 803 cycles kfree -> 364 cycles
        10000 times kmalloc(8192) -> 835 cycles kfree -> 404 cycles
        10000 times kmalloc(16384) -> 896 cycles kfree -> 441 cycles
        2. Kmalloc: alloc/free test
        10000 times kmalloc(8)/kfree -> 121 cycles
        10000 times kmalloc(16)/kfree -> 121 cycles
        10000 times kmalloc(32)/kfree -> 123 cycles
        10000 times kmalloc(64)/kfree -> 142 cycles
        10000 times kmalloc(128)/kfree -> 121 cycles
        10000 times kmalloc(256)/kfree -> 119 cycles
        10000 times kmalloc(512)/kfree -> 119 cycles
        10000 times kmalloc(1024)/kfree -> 119 cycles
        10000 times kmalloc(2048)/kfree -> 119 cycles
        10000 times kmalloc(4096)/kfree -> 119 cycles
        10000 times kmalloc(8192)/kfree -> 119 cycles
        10000 times kmalloc(16384)/kfree -> 119 cycles
      
      [akpm@linux-foundation.org: propagate gfp_t into cache_random_seq_create()]
      Signed-off-by: default avatarThomas Garnier <thgarnie@google.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Laura Abbott <labbott@fedoraproject.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c7ce4f60
  5. May 10, 2016
    • Arnd Bergmann's avatar
      Kbuild: change CC_OPTIMIZE_FOR_SIZE definition · 877417e6
      Arnd Bergmann authored
      
      CC_OPTIMIZE_FOR_SIZE disables the often useful -Wmaybe-unused warning,
      because that causes a ridiculous amount of false positives when combined
      with -Os.
      
      This means a lot of warnings don't show up in testing by the developers
      that should see them with an 'allmodconfig' kernel that has
      CC_OPTIMIZE_FOR_SIZE enabled, but only later in randconfig builds
      that don't.
      
      This changes the Kconfig logic around CC_OPTIMIZE_FOR_SIZE to make
      it a 'choice' statement defaulting to CC_OPTIMIZE_FOR_PERFORMANCE
      that gets added for this purpose. The allmodconfig and allyesconfig
      kernels now default to -O2 with the maybe-unused warning enabled.
      
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarMichal Marek <mmarek@suse.com>
      877417e6
  6. Apr 01, 2016
    • Andi Kleen's avatar
      Make CONFIG_FHANDLE default y · f76be617
      Andi Kleen authored
      
      Newer Fedora and OpenSUSE didn't boot with my standard configuration.
      It took me some time to figure out why, in fact I had to write a script
      to try different config options systematically.
      
      The problem is that something (systemd) in dracut depends on
      CONFIG_FHANDLE, which adds open by file handle syscalls.
      
      While it is set in defconfigs it is very easy to miss when updating
      older configs because it is not default y.
      
      Make it default y and also depend on EXPERT, as dracut use is likely
      widespread.
      
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Cc: Richard Weinberger <richard.weinberger@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f76be617
  7. Mar 29, 2016
  8. Mar 15, 2016
    • Ard Biesheuvel's avatar
      kallsyms: add support for relative offsets in kallsyms address table · 2213e9a6
      Ard Biesheuvel authored
      
      Similar to how relative extables are implemented, it is possible to emit
      the kallsyms table in such a way that it contains offsets relative to
      some anchor point in the kernel image rather than absolute addresses.
      
      On 64-bit architectures, it cuts the size of the kallsyms address table
      in half, since offsets between kernel symbols can typically be expressed
      in 32 bits.  This saves several hundreds of kilobytes of permanent
      .rodata on average.  In addition, the kallsyms address table is no
      longer subject to dynamic relocation when CONFIG_RELOCATABLE is in
      effect, so the relocation work done after decompression now doesn't have
      to do relocation updates for all these values.  This saves up to 24
      bytes (i.e., the size of a ELF64 RELA relocation table entry) per value,
      which easily adds up to a couple of megabytes of uncompressed __init
      data on ppc64 or arm64.  Even if these relocation entries typically
      compress well, the combined size reduction of 2.8 MB uncompressed for a
      ppc64_defconfig build (of which 2.4 MB is __init data) results in a ~500
      KB space saving in the compressed image.
      
      Since it is useful for some architectures (like x86) to retain the
      ability to emit absolute values as well, this patch also adds support
      for capturing both absolute and relative values when
      KALLSYMS_ABSOLUTE_PERCPU is in effect, by emitting absolute per-cpu
      addresses as positive 32-bit values, and addresses relative to the
      lowest encountered relative symbol as negative values, which are
      subtracted from the runtime address of this base symbol to produce the
      actual address.
      
      Support for the above is enabled by default for all architectures except
      IA-64 and Tile-GX, whose symbols are too far apart to capture in this
      manner.
      
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Tested-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2213e9a6
    • Ard Biesheuvel's avatar
      x86: kallsyms: disable absolute percpu symbols on !SMP · 4d5d5664
      Ard Biesheuvel authored
      
      scripts/kallsyms.c has a special --absolute-percpu command line option
      which deals with the zero based per cpu offsets that are used when
      building for SMP on x86_64.  This means that the option should only be
      passed in that case, so add a Kconfig symbol with the correct predicate,
      and use that instead.
      
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Tested-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Tested-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4d5d5664
    • Geliang Tang's avatar
      init/main.c: use list_for_each_entry() · e6fd1fb3
      Geliang Tang authored
      
      Use list_for_each_entry() instead of list_for_each() to simplify the code.
      
      Signed-off-by: default avatarGeliang Tang <geliangtang@163.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e6fd1fb3
  9. Mar 05, 2016
  10. Mar 03, 2016
    • David Howells's avatar
      akcipher: Move the RSA DER encoding check to the crypto layer · d43de6c7
      David Howells authored
      
      Move the RSA EMSA-PKCS1-v1_5 encoding from the asymmetric-key public_key
      subtype to the rsa crypto module's pkcs1pad template.  This means that the
      public_key subtype no longer has any dependencies on public key type.
      
      To make this work, the following changes have been made:
      
       (1) The rsa pkcs1pad template is now used for RSA keys.  This strips off the
           padding and returns just the message hash.
      
       (2) In a previous patch, the pkcs1pad template gained an optional second
           parameter that, if given, specifies the hash used.  We now give this,
           and pkcs1pad checks the encoded message E(M) for the EMSA-PKCS1-v1_5
           encoding and verifies that the correct digest OID is present.
      
       (3) The crypto driver in crypto/asymmetric_keys/rsa.c is now reduced to
           something that doesn't care about what the encryption actually does
           and and has been merged into public_key.c.
      
       (4) CONFIG_PUBLIC_KEY_ALGO_RSA is gone.  Module signing must set
           CONFIG_CRYPTO_RSA=y instead.
      
      Thoughts:
      
       (*) Should the encoding style (eg. raw, EMSA-PKCS1-v1_5) also be passed to
           the padding template?  Should there be multiple padding templates
           registered that share most of the code?
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarTadeusz Struk <tadeusz.struk@intel.com>
      Acked-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      d43de6c7
  11. Mar 01, 2016
    • Thomas Gleixner's avatar
      cpu/hotplug: Unpark smpboot threads from the state machine · 931ef163
      Thomas Gleixner authored
      
      Handle the smpboot threads in the state machine.
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: linux-arch@vger.kernel.org
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Sebastian Siewior <bigeasy@linutronix.de>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Turner <pjt@google.com>
      Link: http://lkml.kernel.org/r/20160226182341.295777684@linutronix.de
      
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      931ef163
    • Thomas Gleixner's avatar
      cpu/hotplug: Convert to a state machine for the control processor · cff7d378
      Thomas Gleixner authored
      
      Move the split out steps into a callback array and let the cpu_up/down
      code iterate through the array functions. For now most of the
      callbacks are asymmetric to resemble the current hotplug maze.
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: linux-arch@vger.kernel.org
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Rafael Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Srivatsa S. Bhat" <srivatsa@mit.edu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Sebastian Siewior <bigeasy@linutronix.de>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Turner <pjt@google.com>
      Link: http://lkml.kernel.org/r/20160226182340.671816690@linutronix.de
      
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      cff7d378
  12. Feb 22, 2016
    • Kees Cook's avatar
      mm/init: Add 'rodata=off' boot cmdline parameter to disable read-only kernel mappings · d2aa1aca
      Kees Cook authored
      
      It may be useful to debug writes to the readonly sections of memory,
      so provide a cmdline "rodata=off" to allow for this. This can be
      expanded in the future to support "log" and "write" modes, but that
      will need to be architecture-specific.
      
      This also makes KDB software breakpoints more usable, as read-only
      mappings can now be disabled on any kernel.
      
      Suggested-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Brown <david.brown@linaro.org>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Emese Revfy <re.emese@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathias Krause <minipli@googlemail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: PaX Team <pageexec@freemail.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: T...
      d2aa1aca
  13. Feb 09, 2016
  14. Jan 20, 2016
  15. Jan 16, 2016
  16. Jan 13, 2016
  17. Dec 26, 2015
  18. Dec 18, 2015
  19. Dec 12, 2015
  20. Dec 04, 2015
  21. Sep 11, 2015
    • Mathieu Desnoyers's avatar
      sys_membarrier(): system-wide memory barrier (generic, x86) · 5b25b13a
      Mathieu Desnoyers authored
      Here is an implementation of a new system call, sys_membarrier(), which
      executes a memory barrier on all threads running on the system.  It is
      implemented by calling synchronize_sched().  It can be used to
      distribute the cost of user-space memory barriers asymmetrically by
      transforming pairs of memory barriers into pairs consisting of
      sys_membarrier() and a compiler barrier.  For synchronization primitives
      that distinguish between read-side and write-side (e.g.  userspace RCU
      [1], rwlocks), the read-side can be accelerated significantly by moving
      the bulk of the memory barrier overhead to the write-side.
      
      The existing applications of which I am aware that would be improved by
      this system call are as follows:
      
      * Through Userspace RCU library (http://urcu.so)
        - DNS server (Knot DNS) https://www.knot-dns.cz/
        - Network sniffer (http://netsniff-ng.org/)
        - Distributed object storage (https://sheepdog.github.io/sheepdog/)
        - User-space tracing (http://lttng.org)
        - Network storage system (https://www.gluster.org/)
        - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
        - Financial software (https://lkml.org/lkml/2015/3/23/189)
      
      Those projects use RCU in userspace to increase read-side speed and
      scalability compared to locking.  Especially in the case of RCU used by
      libraries, sys_membarrier can speed up the read-side by moving the bulk of
      the memory barrier cost to synchronize_rcu().
      
      * Direct users of sys_membarrier
        - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
      
      Microsoft core dotnet GC developers are planning to use the mprotect()
      side-effect of issuing memory barriers through IPIs as a way to implement
      Windows FlushProcessWriteBuffers() on Linux.  They are referring to
      sys_membarrier in their github thread, specifically stating that
      sys_membarrier() is what they are looking for.
      
      To explain the benefit of this scheme, let's introduce two example threads:
      
      Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
      Thread B (frequent, e.g. executing liburcu
      rcu_read_lock()/rcu_read_unlock())
      
      In a scheme where all smp_mb() in thread A are ordering memory accesses
      with respect to smp_mb() present in Thread B, we can change each
      smp_mb() within Thread A into calls to sys_membarrier() and each
      smp_mb() within Thread B into compiler barriers "barrier()".
      
      Before the change, we had, for each smp_mb() pairs:
      
      Thread A                    Thread B
      previous mem accesses       previous mem accesses
      smp_mb()                    smp_mb()
      following mem accesses      following mem accesses
      
      After the change, these pairs become:
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      As we can see, there are two possible scenarios: either Thread B memory
      accesses do not happen concurrently with Thread A accesses (1), or they
      do (2).
      
      1) Non-concurrent Thread A vs Thread B accesses:
      
      Thread A                    Thread B
      prev mem accesses
      sys_membarrier()
      follow mem accesses
                                  prev mem accesses
                                  barrier()
                                  follow mem accesses
      
      In this case, thread B accesses will be weakly ordered. This is OK,
      because at that point, thread A is not particularly interested in
      ordering them with respect to its own accesses.
      
      2) Concurrent Thread A vs Thread B accesses
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      In this case, thread B accesses, which are ensured to be in program
      order thanks to the compiler barrier, will be "upgraded" to full
      smp_mb() by synchronize_sched().
      
      * Benchmarks
      
      On Intel Xeon E5405 (8 cores)
      (one thread is calling sys_membarrier, the other 7 threads are busy
      looping)
      
      1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.
      
      * User-space user of this system call: Userspace RCU library
      
      Both the signal-based and the sys_membarrier userspace RCU schemes
      permit us to remove the memory barrier from the userspace RCU
      rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
      accelerating them. These memory barriers are replaced by compiler
      barriers on the read-side, and all matching memory barriers on the
      write-side are turned into an invocation of a memory barrier on all
      active threads in the process. By letting the kernel perform this
      synchronization rather than dumbly sending a signal to every process
      threads (as we currently do), we diminish the number of unnecessary wake
      ups and only issue the memory barriers on active threads. Non-running
      threads do not need to execute such barrier anyway, because these are
      implied by the scheduler context switches.
      
      Results in liburcu:
      
      Operations in 10s, 6 readers, 2 writers:
      
      memory barriers in reader:    1701557485 reads, 2202847 writes
      signal-based scheme:          9830061167 reads,    6700 writes
      sys_membarrier:               9952759104 reads,     425 writes
      sys_membarrier (dyn. check):  7970328887 reads,     425 writes
      
      The dynamic sys_membarrier availability check adds some overhead to
      the read-side compared to the signal-based scheme, but besides that,
      sys_membarrier slightly outperforms the signal-based scheme. However,
      this non-expedited sys_membarrier implementation has a much slower grace
      period than signal and memory barrier schemes.
      
      Besides diminishing the number of wake-ups, one major advantage of the
      membarrier system call over the signal-based scheme is that it does not
      need to reserve a signal. This plays much more nicely with libraries,
      and with processes injected into for tracing purposes, for which we
      cannot expect that signals will be unused by the application.
      
      An expedited version of this system call can be added later on to speed
      up the grace period. Its implementation will likely depend on reading
      the cpu_curr()->mm without holding each CPU's rq lock.
      
      This patch adds the system call to x86 and to asm-generic.
      
      [1] http://urcu.so
      
      
      
      membarrier(2) man page:
      
      MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)
      
      NAME
             membarrier - issue memory barriers on a set of threads
      
      SYNOPSIS
             #include <linux/membarrier.h>
      
             int membarrier(int cmd, int flags);
      
      DESCRIPTION
             The cmd argument is one of the following:
      
             MEMBARRIER_CMD_QUERY
                    Query  the  set  of  supported commands. It returns a bitmask of
                    supported commands.
      
             MEMBARRIER_CMD_SHARED
                    Execute a memory barrier on all threads running on  the  system.
                    Upon  return from system call, the caller thread is ensured that
                    all running threads have passed through a state where all memory
                    accesses  to  user-space  addresses  match program order between
                    entry to and return from the system  call  (non-running  threads
                    are de facto in such a state). This covers threads from all pro=E2=80=90
                    cesses running on the system.  This command returns 0.
      
             The flags argument needs to be 0. For future extensions.
      
             All memory accesses performed  in  program  order  from  each  targeted
             thread is guaranteed to be ordered with respect to sys_membarrier(). If
             we use the semantic "barrier()" to represent a compiler barrier forcing
             memory  accesses  to  be performed in program order across the barrier,
             and smp_mb() to represent explicit memory barriers forcing full  memory
             ordering  across  the barrier, we have the following ordering table for
             each pair of barrier(), sys_membarrier() and smp_mb():
      
             The pair ordering is detailed as (O: ordered, X: not ordered):
      
                                    barrier()   smp_mb() sys_membarrier()
                    barrier()          X           X            O
                    smp_mb()           X           O            O
                    sys_membarrier()   O           O            O
      
      RETURN VALUE
             On success, these system calls return zero.  On error, -1 is  returned,
             and errno is set appropriately. For a given command, with flags
             argument set to 0, this system call is guaranteed to always return the
             same value until reboot.
      
      ERRORS
             ENOSYS System call is not implemented.
      
             EINVAL Invalid arguments.
      
      Linux                             2015-04-15                     MEMBARRIER(2)
      
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Nicholas Miell <nmiell@comcast.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Pranith Kumar <bobby.prani@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b25b13a
  22. Sep 10, 2015
    • Dave Young's avatar
      kexec: split kexec_load syscall from kexec core code · 2965faa5
      Dave Young authored
      
      There are two kexec load syscalls, kexec_load another and kexec_file_load.
       kexec_file_load has been splited as kernel/kexec_file.c.  In this patch I
      split kexec_load syscall code to kernel/kexec.c.
      
      And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
      use kexec_file_load only, or vice verse.
      
      The original requirement is from Ted Ts'o, he want kexec kernel signature
      being checked with CONFIG_KEXEC_VERIFY_SIG enabled.  But kexec-tools use
      kexec_load syscall can bypass the checking.
      
      Vivek Goyal proposed to create a common kconfig option so user can compile
      in only one syscall for loading kexec kernel.  KEXEC/KEXEC_FILE selects
      KEXEC_CORE so that old config files still work.
      
      Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
      architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
      KEXEC_CORE in arch Kconfig.  Also updated general kernel code with to
      kexec_load syscall.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarDave Young <dyoung@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2965faa5
    • Frederic Weisbecker's avatar
      kmod: use system_unbound_wq instead of khelper · 90f02303
      Frederic Weisbecker authored
      We need to launch the usermodehelper kernel threads with the widest
      affinity and this is partly why we use khelper.  This workqueue has
      unbound properties and thus a wide affinity inherited by all its children.
      
      Now khelper also has special properties that we aren't much interested in:
      ordered and singlethread.  There is really no need about ordering as all
      we do is creating kernel threads.  This can be done concurrently.  And
      singlethread is a useless limitation as well.
      
      The workqueue engine already proposes generic unbound workqueues that
      don't share these useless properties and handle well parallel jobs.
      
      The only worrysome specific is their affinity to the node of the current
      CPU.  It's fine for creating the usermodehelper kernel threads but those
      inherit this affinity for longer jobs such as requesting modules.
      
      This patch proposes to use these node affine unbound workqueues assuming
      that a node is sufficient to handle several parallel usermodehelper
      re...
      90f02303
  23. Sep 04, 2015
    • Mel Gorman's avatar
      mm: send one IPI per CPU to TLB flush all entries after unmapping pages · 72b252ae
      Mel Gorman authored
      
      An IPI is sent to flush remote TLBs when a page is unmapped that was
      potentially accesssed by other CPUs.  There are many circumstances where
      this happens but the obvious one is kswapd reclaiming pages belonging to a
      running process as kswapd and the task are likely running on separate
      CPUs.
      
      On small machines, this is not a significant problem but as machine gets
      larger with more cores and more memory, the cost of these IPIs can be
      high.  This patch uses a simple structure that tracks CPUs that
      potentially have TLB entries for pages being unmapped.  When the unmapping
      is complete, the full TLB is flushed on the assumption that a refill cost
      is lower than flushing individual entries.
      
      Architectures wishing to do this must give the following guarantee.
      
              If a clean page is unmapped and not immediately flushed, the
              architecture must guarantee that a write to that linear address
              from a CPU with a cached TLB entry will trap a page fault.
      
      This is essentially what the kernel already depends on but the window is
      much larger with this patch applied and is worth highlighting.  The
      architecture should consider whether the cost of the full TLB flush is
      higher than sending an IPI to flush each individual entry.  An additional
      architecture helper called flush_tlb_local is required.  It's a trivial
      wrapper with some accounting in the x86 case.
      
      The impact of this patch depends on the workload as measuring any benefit
      requires both mapped pages co-located on the LRU and memory pressure.  The
      case with the biggest impact is multiple processes reading mapped pages
      taken from the vm-scalability test suite.  The test case uses NR_CPU
      readers of mapped files that consume 10*RAM.
      
      Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs
      
                                                 4.2.0-rc1          4.2.0-rc1
                                                   vanilla       flushfull-v7
      Ops lru-file-mmap-read-elapsed      159.62 (  0.00%)   120.68 ( 24.40%)
      Ops lru-file-mmap-read-time_range    30.59 (  0.00%)     2.80 ( 90.85%)
      Ops lru-file-mmap-read-time_stddv     6.70 (  0.00%)     0.64 ( 90.38%)
      
                 4.2.0-rc1    4.2.0-rc1
                   vanilla flushfull-v7
      User          581.00       611.43
      System       5804.93      4111.76
      Elapsed       161.03       122.12
      
      This is showing that the readers completed 24.40% faster with 29% less
      system CPU time.  From vmstats, it is known that the vanilla kernel was
      interrupted roughly 900K times per second during the steady phase of the
      test and the patched kernel was interrupts 180K times per second.
      
      The impact is lower on a single socket machine.
      
                                                 4.2.0-rc1          4.2.0-rc1
                                                   vanilla       flushfull-v7
      Ops lru-file-mmap-read-elapsed       25.33 (  0.00%)    20.38 ( 19.54%)
      Ops lru-file-mmap-read-time_range     0.91 (  0.00%)     1.44 (-58.24%)
      Ops lru-file-mmap-read-time_stddv     0.28 (  0.00%)     0.47 (-65.34%)
      
                 4.2.0-rc1    4.2.0-rc1
                   vanilla flushfull-v7
      User           58.09        57.64
      System        111.82        76.56
      Elapsed        27.29        22.55
      
      It's still a noticeable improvement with vmstat showing interrupts went
      from roughly 500K per second to 45K per second.
      
      The patch will have no impact on workloads with no memory pressure or have
      relatively few mapped pages.  It will have an unpredictable impact on the
      workload running on the CPU being flushed as it'll depend on how many TLB
      entries need to be refilled and how long that takes.  Worst case, the TLB
      will be completely cleared of active entries when the target PFNs were not
      resident at all.
      
      [sasha.levin@oracle.com: trace tlb flush after disabling preemption in try_to_unmap_flush]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72b252ae
    • Andrea Arcangeli's avatar
      userfaultfd: buildsystem activation · a14c151e
      Andrea Arcangeli authored
      
      This allows to select the userfaultfd during configuration to build it.
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a14c151e
  24. Aug 15, 2015
  25. Aug 14, 2015
  26. Aug 12, 2015
  27. Aug 07, 2015