Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Jul 14, 2021
    • Christian Brauner's avatar
      fs: add vfs_parse_fs_param_source() helper · d1d488d8
      Christian Brauner authored
      Add a simple helper that filesystems can use in their parameter parser
      to parse the "source" parameter. A few places open-coded this function
      and that already caused a bug in the cgroup v1 parser that we fixed.
      Let's make it harder to get this wrong by introducing a helper which
      performs all necessary checks.
      
      Link: https://syzkaller.appspot.com/bug?id=6312526aba5beae046fdae8f00399f87aab48b12
      
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d1d488d8
    • Christian Brauner's avatar
      cgroup: verify that source is a string · 3b046272
      Christian Brauner authored
      The following sequence can be used to trigger a UAF:
      
          int fscontext_fd = fsopen("cgroup");
          int fd_null = open("/dev/null, O_RDONLY);
          int fsconfig(fscontext_fd, FSCONFIG_SET_FD, "source", fd_null);
          close_range(3, ~0U, 0);
      
      The cgroup v1 specific fs parser expects a string for the "source"
      parameter.  However, it is perfectly legitimate to e.g.  specify a file
      descriptor for the "source" parameter.  The fs parser doesn't know what
      a filesystem allows there.  So it's a bug to assume that "source" is
      always of type fs_value_is_string when it can reasonably also be
      fs_value_is_file.
      
      This assumption in the cgroup code causes a UAF because struct
      fs_parameter uses a union for the actual value.  Access to that union is
      guarded by the param->type member.  Since the cgroup paramter parser
      didn't check param->type but unconditionally moved param->string into
      fc->source a close on the fscontext_fd would trigger a UAF during
      put_fs_context() which frees fc->source thereby freeing the file stashed
      in param->file causing a UAF during a close of the fd_null.
      
      Fix this by verifying that param->type is actually a string and report
      an error if not.
      
      In follow up patches I'll add a new generic helper that can be used here
      and by other filesystems instead of this error-prone copy-pasta fix.
      But fixing it in here first makes backporting a it to stable a lot
      easier.
      
      Fixes: 8d2451f4
      
       ("cgroup1: switch to option-by-option parsing")
      Reported-by: default avatar <syzbot+283ce5a46486d6acdbaf@syzkaller.appspotmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: <stable@kernel.org>
      Cc: syzkaller-bugs <syzkaller-bugs@googlegroups.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b046272
  2. Jul 13, 2021
    • Daniel Borkmann's avatar
      bpf: Fix tail_call_reachable rejection for interpreter when jit failed · 5dd0a6b8
      Daniel Borkmann authored
      During testing of f263a814 ("bpf: Track subprog poke descriptors correctly
      and fix use-after-free") under various failure conditions, for example, when
      jit_subprogs() fails and tries to clean up the program to be run under the
      interpreter, we ran into the following freeze:
      
        [...]
        #127/8 tailcall_bpf2bpf_3:FAIL
        [...]
        [   92.041251] BUG: KASAN: slab-out-of-bounds in ___bpf_prog_run+0x1b9d/0x2e20
        [   92.042408] Read of size 8 at addr ffff88800da67f68 by task test_progs/682
        [   92.043707]
        [   92.044030] CPU: 1 PID: 682 Comm: test_progs Tainted: G   O   5.13.0-53301-ge6c08cb33a30-dirty #87
        [   92.045542] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
        [   92.046785] Call Trace:
        [   92.047171]  ? __bpf_prog_run_args64+0xc0/0xc0
        [   92.047773]  ? __bpf_prog_run_args32+0x8b/0xb0
        [   92.048389]  ? __bpf_prog_run_args64+0xc0/0xc0
        [   92.049019]  ? ktime_get+0x117/0x130
        [...] // few hundred [similar] lines more
        [   92.659025]  ? ktime_get+0x117/0x130
        [   92.659845]  ? __bpf_prog_run_args64+0xc0/0xc0
        [   92.660738]  ? __bpf_prog_run_args32+0x8b/0xb0
        [   92.661528]  ? __bpf_prog_run_args64+0xc0/0xc0
        [   92.662378]  ? print_usage_bug+0x50/0x50
        [   92.663221]  ? print_usage_bug+0x50/0x50
        [   92.664077]  ? bpf_ksym_find+0x9c/0xe0
        [   92.664887]  ? ktime_get+0x117/0x130
        [   92.665624]  ? kernel_text_address+0xf5/0x100
        [   92.666529]  ? __kernel_text_address+0xe/0x30
        [   92.667725]  ? unwind_get_return_address+0x2f/0x50
        [   92.668854]  ? ___bpf_prog_run+0x15d4/0x2e20
        [   92.670185]  ? ktime_get+0x117/0x130
        [   92.671130]  ? __bpf_prog_run_args64+0xc0/0xc0
        [   92.672020]  ? __bpf_prog_run_args32+0x8b/0xb0
        [   92.672860]  ? __bpf_prog_run_args64+0xc0/0xc0
        [   92.675159]  ? ktime_get+0x117/0x130
        [   92.677074]  ? lock_is_held_type+0xd5/0x130
        [   92.678662]  ? ___bpf_prog_run+0x15d4/0x2e20
        [   92.680046]  ? ktime_get+0x117/0x130
        [   92.681285]  ? __bpf_prog_run32+0x6b/0x90
        [   92.682601]  ? __bpf_prog_run64+0x90/0x90
        [   92.683636]  ? lock_downgrade+0x370/0x370
        [   92.684647]  ? mark_held_locks+0x44/0x90
        [   92.685652]  ? ktime_get+0x117/0x130
        [   92.686752]  ? lockdep_hardirqs_on+0x79/0x100
        [   92.688004]  ? ktime_get+0x117/0x130
        [   92.688573]  ? __cant_migrate+0x2b/0x80
        [   92.689192]  ? bpf_test_run+0x2f4/0x510
        [   92.689869]  ? bpf_test_timer_continue+0x1c0/0x1c0
        [   92.690856]  ? rcu_read_lock_bh_held+0x90/0x90
        [   92.691506]  ? __kasan_slab_alloc+0x61/0x80
        [   92.692128]  ? eth_type_trans+0x128/0x240
        [   92.692737]  ? __build_skb+0x46/0x50
        [   92.693252]  ? bpf_prog_test_run_skb+0x65e/0xc50
        [   92.693954]  ? bpf_prog_test_run_raw_tp+0x2d0/0x2d0
        [   92.694639]  ? __fget_light+0xa1/0x100
        [   92.695162]  ? bpf_prog_inc+0x23/0x30
        [   92.695685]  ? __sys_bpf+0xb40/0x2c80
        [   92.696324]  ? bpf_link_get_from_fd+0x90/0x90
        [   92.697150]  ? mark_held_locks+0x24/0x90
        [   92.698007]  ? lockdep_hardirqs_on_prepare+0x124/0x220
        [   92.699045]  ? finish_task_switch+0xe6/0x370
        [   92.700072]  ? lockdep_hardirqs_on+0x79/0x100
        [   92.701233]  ? finish_task_switch+0x11d/0x370
        [   92.702264]  ? __switch_to+0x2c0/0x740
        [   92.703148]  ? mark_held_locks+0x24/0x90
        [   92.704155]  ? __x64_sys_bpf+0x45/0x50
        [   92.705146]  ? do_syscall_64+0x35/0x80
        [   92.706953]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
        [...]
      
      Turns out that the program rejection from e411901c ("bpf: allow for tailcalls
      in BPF subprograms for x64 JIT") is buggy since env->prog->aux->tail_call_reachable
      is never true. Commit ebf7d1f5 ("bpf, x64: rework pro/epilogue and tailcall
      handling in JIT") added a tracker into check_max_stack_depth() which propagates
      the tail_call_reachable condition throughout the subprograms. This info is then
      assigned to the subprogram's func[i]->aux->tail_call_reachable. However, in the
      case of the rejection check upon JIT failure, env->prog->aux->tail_call_reachable
      is used. func[0]->aux->tail_call_reachable which represents the main program's
      information did not propagate this to the outer env->prog->aux, though. Add this
      propagation into check_max_stack_depth() where it needs to belong so that the
      check can be done reliably.
      
      Fixes: ebf7d1f5 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
      Fixes: e411901c
      
       ("bpf: allow for tailcalls in BPF subprograms for x64 JIT")
      Co-developed-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/bpf/618c34e3163ad1a36b1e82377576a6081e182f25.1626123173.git.daniel@iogearbox.net
      5dd0a6b8
  3. Jul 09, 2021
    • John Fastabend's avatar
      bpf: Track subprog poke descriptors correctly and fix use-after-free · f263a814
      John Fastabend authored
      Subprograms are calling map_poke_track(), but on program release there is no
      hook to call map_poke_untrack(). However, on program release, the aux memory
      (and poke descriptor table) is freed even though we still have a reference to
      it in the element list of the map aux data. When we run map_poke_run(), we then
      end up accessing free'd memory, triggering KASAN in prog_array_map_poke_run():
      
        [...]
        [  402.824689] BUG: KASAN: use-after-free in prog_array_map_poke_run+0xc2/0x34e
        [  402.824698] Read of size 4 at addr ffff8881905a7940 by task hubble-fgs/4337
        [  402.824705] CPU: 1 PID: 4337 Comm: hubble-fgs Tainted: G          I       5.12.0+ #399
        [  402.824715] Call Trace:
        [  402.824719]  dump_stack+0x93/0xc2
        [  402.824727]  print_address_description.constprop.0+0x1a/0x140
        [  402.824736]  ? prog_array_map_poke_run+0xc2/0x34e
        [  402.824740]  ? prog_array_map_poke_run+0xc2/0x34e
        [  402.824744]  kasan_report.cold+0x7c/0xd8
        [  402.824752]  ? prog_array_map_poke_run+0xc2/0x34e
        [  402.824757]  prog_array_map_poke_run+0xc2/0x34e
        [  402.824765]  bpf_fd_array_map_update_elem+0x124/0x1a0
        [...]
      
      The elements concerned are walked as follows:
      
          for (i = 0; i < elem->aux->size_poke_tab; i++) {
                 poke = &elem->aux->poke_tab[i];
          [...]
      
      The access to size_poke_tab is a 4 byte read, verified by checking offsets
      in the KASAN dump:
      
        [  402.825004] The buggy address belongs to the object at ffff8881905a7800
                       which belongs to the cache kmalloc-1k of size 1024
        [  402.825008] The buggy address is located 320 bytes inside of
                       1024-byte region [ffff8881905a7800, ffff8881905a7c00)
      
      The pahole output of bpf_prog_aux:
      
        struct bpf_prog_aux {
          [...]
          /* --- cacheline 5 boundary (320 bytes) --- */
          u32                        size_poke_tab;        /*   320     4 */
          [...]
      
      In general, subprograms do not necessarily manage their own data structures.
      For example, BTF func_info and linfo are just pointers to the main program
      structure. This allows reference counting and cleanup to be done on the latter
      which simplifies their management a bit. The aux->poke_tab struct, however,
      did not follow this logic. The initial proposed fix for this use-after-free
      bug further embedded poke data tracking into the subprogram with proper
      reference counting. However, Daniel and Alexei questioned why we were treating
      these objects special; I agree, its unnecessary. The fix here removes the per
      subprogram poke table allocation and map tracking and instead simply points
      the aux->poke_tab pointer at the main programs poke table. This way, map
      tracking is simplified to the main program and we do not need to manage them
      per subprogram.
      
      This also means, bpf_prog_free_deferred(), which unwinds the program reference
      counting and kfrees objects, needs to ensure that we don't try to double free
      the poke_tab when free'ing the subprog structures. This is easily solved by
      NULL'ing the poke_tab pointer. The second detail is to ensure that per
      subprogram JIT logic only does fixups on poke_tab[] entries it owns. To do
      this, we add a pointer in the poke structure to point at the subprogram value
      so JITs can easily check while walking the poke_tab structure if the current
      entry belongs to the current program. The aux pointer is stable and therefore
      suitable for such comparison. On the jit_subprogs() error path, we omit
      cleaning up the poke->aux field because these are only ever referenced from
      the JIT side, but on error we will never make it to the JIT, so its fine to
      leave them dangling. Removing these pointers would complicate the error path
      for no reason. However, we do need to untrack all poke descriptors from the
      main program as otherwise they could race with the freeing of JIT memory from
      the subprograms. Lastly, a748c697 ("bpf: propagate poke descriptors to
      subprograms") had an off-by-one on the subprogram instruction index range
      check as it was testing 'insn_idx >= subprog_start && insn_idx <= subprog_end'.
      However, subprog_end is the next subprogram's start instruction.
      
      Fixes: a748c697
      
       ("bpf: propagate poke descriptors to subprograms")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Co-developed-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210707223848.14580-2-john.fastabend@gmail.com
      f263a814
  4. Jul 08, 2021
    • Stephen Boyd's avatar
      kdump: use vmlinux_build_id to simplify · 44e8a5e9
      Stephen Boyd authored
      We can use the vmlinux_build_id array here now instead of open coding it.
      This mostly consolidates code.
      
      Link: https://lkml.kernel.org/r/20210511003845.2429846-14-swboyd@chromium.org
      
      
      Signed-off-by: default avatarStephen Boyd <swboyd@chromium.org>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: Evan Green <evgreen@chromium.org>
      Cc: Hsin-Yi Wang <hsinyi@chromium.org>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Steven Rostedt <...
      44e8a5e9
    • Stephen Boyd's avatar
      module: add printk formats to add module build ID to stacktraces · 9294523e
      Stephen Boyd authored
      Let's make kernel stacktraces easier to identify by including the build
      ID[1] of a module if the stacktrace is printing a symbol from a module.
      This makes it simpler for developers to locate a kernel module's full
      debuginfo for a particular stacktrace.  Combined with
      scripts/decode_stracktrace.sh, a developer can download the matching
      debuginfo from a debuginfod[2] server and find the exact file and line
      number for the functions plus offsets in a stacktrace that match the
      module.  This is especially useful for pstore crash debugging where the
      kernel crashes are recorded in something like console-ramoops and the
      recovery kernel/modules are different or the debuginfo doesn't exist on
      the device due to space concerns (the debuginfo can be too large for space
      limited devices).
      
      Originally, I put this on the %pS format, but that was quickly rejected
      given that %pS is used in other places such as ftrace where build IDs
      aren't meaningful.  There was some discussions on the list to put every
      module build ID into the "Modules linked in:" section of the stacktrace
      message but that quickly becomes very hard to read once you have more than
      three or four modules linked in.  It also provides too much information
      when we don't expect each module to be traversed in a stacktrace.  Having
      the build ID for modules that aren't important just makes things messy.
      Splitting it to multiple lines for each module quickly explodes the number
      of lines printed in an oops too, possibly wrapping the warning off the
      console.  And finally, trying to stash away each module used in a
      callstack to provide the ID of each symbol printed is cumbersome and would
      require changes to each architecture to stash away modules and return
      their build IDs once unwinding has completed.
      
      Instead, we opt for the simpler approach of introducing new printk formats
      '%pS[R]b' for "pointer symbolic backtrace with module build ID" and '%pBb'
      for "pointer backtrace with module build ID" and then updating the few
      places in the architecture layer where the stacktrace is printed to use
      this new format.
      
      Before:
      
       Call trace:
        lkdtm_WARNING+0x28/0x30 [lkdtm]
        direct_entry+0x16c/0x1b4 [lkdtm]
        full_proxy_write+0x74/0xa4
        vfs_write+0xec/0x2e8
      
      After:
      
       Call trace:
        lkdtm_WARNING+0x28/0x30 [lkdtm 6c2215028606bda50de823490723dc4bc5bf46f9]
        direct_entry+0x16c/0x1b4 [lkdtm 6c2215028606bda50de823490723dc4bc5bf46f9]
        full_proxy_write+0x74/0xa4
        vfs_write+0xec/0x2e8
      
      [akpm@linux-foundation.org: fix build with CONFIG_MODULES=n, tweak code layout]
      [rdunlap@infradead.org: fix build when CONFIG_MODULES is not set]
        Link: https://lkml.kernel.org/r/20210513171510.20328-1-rdunlap@infradead.org
      [akpm@linux-foundation.org: make kallsyms_lookup_buildid() static]
      [cuibixuan@huawei.com: fix build error when CONFIG_SYSFS is disabled]
        Link: https://lkml.kernel.org/r/20210525105049.34804-1-cuibixuan@huawei.com
      
      Link: https://lkml.kernel.org/r/20210511003845.2429846-6-swboyd@chromium.org
      Link: https://fedoraproject.org/wiki/Releases/FeatureBuildId [1]
      Link: https://sourceware.org/elfutils/Debuginfod.html
      
       [2]
      Signed-off-by: default avatarStephen Boyd <swboyd@chromium.org>
      Signed-off-by: default avatarBixuan Cui <cuibixuan@huawei.com>
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Jessica Yu <jeyu@kernel.org>
      Cc: Evan Green <evgreen@chromium.org>
      Cc: Hsin-Yi Wang <hsinyi@chromium.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9294523e
    • Mike Rapoport's avatar
      PM: hibernate: disable when there are active secretmem users · 9a436f8f
      Mike Rapoport authored
      It is unsafe to allow saving of secretmem areas to the hibernation
      snapshot as they would be visible after the resume and this essentially
      will defeat the purpose of secret memory mappings.
      
      Prevent hibernation whenever there are active secret memory users.
      
      Link: https://lkml.kernel.org/r/20210518072034.31572-6-rppt@kernel.org
      
      
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarJames Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Elena Reshetova <elena.reshetova@intel.com>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Bottomley <jejb@linux.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Palmer Dabbelt <palmerdabbelt@google.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tycho Andersen <tycho@tycho.ws>
      Cc: Will Deacon <will@kernel.org>
      Cc: kernel test robot <lkp@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a436f8f
    • Mike Rapoport's avatar
      mm: introduce memfd_secret system call to create "secret" memory areas · 1507f512
      Mike Rapoport authored
      Introduce "memfd_secret" system call with the ability to create memory
      areas visible only in the context of the owning process and not mapped not
      only to other processes but in the kernel page tables as well.
      
      The secretmem feature is off by default and the user must explicitly
      enable it at the boot time.
      
      Once secretmem is enabled, the user will be able to create a file
      descriptor using the memfd_secret() system call.  The memory areas created
      by mmap() calls from this file descriptor will be unmapped from the kernel
      direct map and they will be only mapped in the page table of the processes
      that have access to the file descriptor.
      
      Secretmem is designed to provide the following protections:
      
      * Enhanced protection (in conjunction with all the other in-kernel
        attack prevention systems) against ROP attacks.  Seceretmem makes
        "simple" ROP insufficient to perform exfiltration, which increases the
        required complexity of the attack.  Along with other protections like
        the kernel stack size limit and address space layout randomization which
        make finding gadgets is really hard, absence of any in-kernel primitive
        for accessing secret memory means the one gadget ROP attack can't work.
        Since the only way to access secret memory is to reconstruct the missing
        mapping entry, the attacker has to recover the physical page and insert
        a PTE pointing to it in the kernel and then retrieve the contents.  That
        takes at least three gadgets which is a level of difficulty beyond most
        standard attacks.
      
      * Prevent cross-process secret userspace memory exposures.  Once the
        secret memory is allocated, the user can't accidentally pass it into the
        kernel to be transmitted somewhere.  The secreremem pages cannot be
        accessed via the direct map and they are disallowed in GUP.
      
      * Harden against exploited kernel flaws.  In order to access secretmem,
        a kernel-side attack would need to either walk the page tables and
        create new ones, or spawn a new privileged uiserspace process to perform
        secrets exfiltration using ptrace.
      
      The file descriptor based memory has several advantages over the
      "traditional" mm interfaces, such as mlock(), mprotect(), madvise().  File
      descriptor approach allows explicit and controlled sharing of the memory
      areas, it allows to seal the operations.  Besides, file descriptor based
      memory paves the way for VMMs to remove the secret memory range from the
      userspace hipervisor process, for instance QEMU.  Andy Lutomirski says:
      
        "Getting fd-backed memory into a guest will take some possibly major
        work in the kernel, but getting vma-backed memory into a guest without
        mapping it in the host user address space seems much, much worse."
      
      memfd_secret() is made a dedicated system call rather than an extension to
      memfd_create() because it's purpose is to allow the user to create more
      secure memory mappings rather than to simply allow file based access to
      the memory.  Nowadays a new system call cost is negligible while it is way
      simpler for userspace to deal with a clear-cut system calls than with a
      multiplexer or an overloaded syscall.  Moreover, the initial
      implementation of memfd_secret() is completely distinct from
      memfd_create() so there is no much sense in overloading memfd_create() to
      begin with.  If there will be a need for code sharing between these
      implementation it can be easily achieved without a need to adjust user
      visible APIs.
      
      The secret memory remains accessible in the process context using uaccess
      primitives, but it is not exposed to the kernel otherwise; secret memory
      areas are removed from the direct map and functions in the
      follow_page()/get_user_page() family will refuse to return a page that
      belongs to the secret memory area.
      
      Once there will be a use case that will require exposing secretmem to the
      kernel it will be an opt-in request in the system call flags so that user
      would have to decide what data can be exposed to the kernel.
      
      Removing of the pages from the direct map may cause its fragmentation on
      architectures that use large pages to map the physical memory which
      affects the system performance.  However, the original Kconfig text for
      CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "...  can
      improve the kernel's performance a tiny bit ..." (commit 00d1c5e0
      ("x86: add gbpages switches")) and the recent report [1] showed that "...
      although 1G mappings are a good default choice, there is no compelling
      evidence that it must be the only choice".  Hence, it is sufficient to
      have secretmem disabled by default with the ability of a system
      administrator to enable it at boot time.
      
      Pages in the secretmem regions are unevictable and unmovable to avoid
      accidental exposure of the sensitive data via swap or during page
      migration.
      
      Since the secretmem mappings are locked in memory they cannot exceed
      RLIMIT_MEMLOCK.  Since these mappings are already locked independently
      from mlock(), an attempt to mlock()/munlock() secretmem range would fail
      and mlockall()/munlockall() will ignore secretmem mappings.
      
      However, unlike mlock()ed memory, secretmem currently behaves more like
      long-term GUP: secretmem mappings are unmovable mappings directly consumed
      by user space.  With default limits, there is no excessive use of
      secretmem and it poses no real problem in combination with
      ZONE_MOVABLE/CMA, but in the future this should be addressed to allow
      balanced use of large amounts of secretmem along with ZONE_MOVABLE/CMA.
      
      A page that was a part of the secret memory area is cleared when it is
      freed to ensure the data is not exposed to the next user of that page.
      
      The following example demonstrates creation of a secret mapping (error
      handling is omitted):
      
      	fd = memfd_secret(0);
      	ftruncate(fd, MAP_SIZE);
      	ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
      		   MAP_SHARED, fd, 0);
      
      [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/
      
      [akpm@linux-foundation.org: suppress Kconfig whine]
      
      Link: https://lkml.kernel.org/r/20210518072034.31572-5-rppt@kernel.org
      
      
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarHagen Paul Pfeifer <hagen@jauu.net>
      Acked-by: default avatarJames Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Elena Reshetova <elena.reshetova@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Bottomley <jejb@linux.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Palmer Dabbelt <palmerdabbelt@google.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tycho Andersen <tycho@tycho.ws>
      Cc: Will Deacon <will@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: kernel test robot <lkp@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1507f512
    • Alexey Gladkov's avatar
      Fix UCOUNT_RLIMIT_SIGPENDING counter leak · f3791f4d
      Alexey Gladkov authored
      We must properly handle an errors when we increase the rlimit counter
      and the ucounts reference counter. We have to this with RCU protection
      to prevent possible use-after-free that could occur due to concurrent
      put_cred_rcu().
      
      The following reproducer triggers the problem:
      
        $ cat testcase.sh
        case "${STEP:-0}" in
        0)
      	ulimit -Si 1
      	ulimit -Hi 1
      	STEP=1 unshare -rU "$0"
      	killall sleep
      	;;
        1)
      	for i in 1 2 3 4 5; do unshare -rU sleep 5 & done
      	;;
        esac
      
      with the KASAN report being along the lines of
      
        BUG: KASAN: use-after-free in put_ucounts+0x17/0xa0
        Write of size 4 at addr ffff8880045f031c by task swapper/2/0
      
        CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.13.0+ #19
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-alt4 04/01/2014
        Call Trace:
         <IRQ>
         put_ucounts+0x17/0xa0
         put_cred_rcu+0xd5/0x190
         rcu_core+0x3bf/0xcb0
         __do_softirq+0xe3/0x341
         irq_exit_rcu+0xbe/0xe0
         sysvec_apic_timer_interrupt+0x6a/0x90
         </IRQ>
         asm_sysvec_apic_timer_interrupt+0x12/0x20
         default_idle_call+0x53/0x130
         do_idle+0x311/0x3c0
         cpu_startup_entry+0x14/0x20
         secondary_startup_64_no_verify+0xc2/0xcb
      
        Allocated by task 127:
         kasan_save_stack+0x1b/0x40
         __kasan_kmalloc+0x7c/0x90
         alloc_ucounts+0x169/0x2b0
         set_cred_ucounts+0xbb/0x170
         ksys_unshare+0x24c/0x4e0
         __x64_sys_unshare+0x16/0x20
         do_syscall_64+0x37/0x70
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
        Freed by task 0:
         kasan_save_stack+0x1b/0x40
         kasan_set_track+0x1c/0x30
         kasan_set_free_info+0x20/0x30
         __kasan_slab_free+0xeb/0x120
         kfree+0xaa/0x460
         put_cred_rcu+0xd5/0x190
         rcu_core+0x3bf/0xcb0
         __do_softirq+0xe3/0x341
      
        The buggy address belongs to the object at ffff8880045f0300
         which belongs to the cache kmalloc-192 of size 192
        The buggy address is located 28 bytes inside of
         192-byte region [ffff8880045f0300, ffff8880045f03c0)
        The buggy address belongs to the page:
        page:000000008de0a388 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff8880045f0000 pfn:0x45f0
        flags: 0x100000000000200(slab|node=0|zone=1)
        raw: 0100000000000200 ffffea00000f4640 0000000a0000000a ffff888001042a00
        raw: ffff8880045f0000 000000008010000d 00000001ffffffff 0000000000000000
        page dumped because: kasan: bad access detected
      
        Memory state around the buggy address:
         ffff8880045f0200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff8880045f0280: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
        >ffff8880045f0300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                    ^
         ffff8880045f0380: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
         ffff8880045f0400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ==================================================================
        Disabling lock debugging due to kernel taint
      
      Fixes: d6469690
      
       ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAlexey Gladkov <legion@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3791f4d
    • Baokun Li's avatar
      ftrace: Use list_move instead of list_del/list_add · 3ecda644
      Baokun Li authored
      Using list_move() instead of list_del() + list_add().
      
      Link: https://lkml.kernel.org/r/20210608031108.2820996-1-libaokun1@huawei.com
      
      
      
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarBaokun Li <libaokun1@huawei.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      3ecda644
  5. Jul 07, 2021
    • Steven Rostedt (VMware)'s avatar
      tracing/histograms: Fix parsing of "sym-offset" modifier · 26c56373
      Steven Rostedt (VMware) authored
      With the addition of simple mathematical operations (plus and minus), the
      parsing of the "sym-offset" modifier broke, as it took the '-' part of the
      "sym-offset" as a minus, and tried to break it up into a mathematical
      operation of "field.sym - offset", in which case it failed to parse
      (unless the event had a field called "offset").
      
      Both .sym and .sym-offset modifiers should not be entered into
      mathematical calculations anyway. If ".sym-offset" is found in the
      modifier, then simply make it not an operation that can be calculated on.
      
      Link: https://lkml.kernel.org/r/20210707110821.188ae255@oasis.local.home
      
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: 100719dc
      
       ("tracing: Add simple expression support to hist triggers")
      Reviewed-by: default avatarTom Zanussi <zanussi@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      26c56373
  6. Jul 05, 2021
  7. Jul 02, 2021
  8. Jul 01, 2021
  9. Jun 30, 2021
    • Viresh Kumar's avatar
      cpufreq: CPPC: Add support for frequency invariance · 1eb5dde6
      Viresh Kumar authored
      
      The Frequency Invariance Engine (FIE) is providing a frequency scaling
      correction factor that helps achieve more accurate load-tracking.
      
      Normally, this scaling factor can be obtained directly with the help of
      the cpufreq drivers as they know the exact frequency the hardware is
      running at. But that isn't the case for CPPC cpufreq driver.
      
      Another way of obtaining that is using the arch specific counter
      support, which is already present in kernel, but that hardware is
      optional for platforms.
      
      This patch updates the CPPC driver to register itself with the topology
      core to provide its own implementation (cppc_scale_freq_tick()) of
      topology_scale_freq_tick() which gets called by the scheduler on every
      tick. Note that the arch specific counters have higher priority than
      CPPC counters, if available, though the CPPC driver doesn't need to have
      any special handling for that.
      
      On an invocation of cppc_scale_freq_tick(), we schedule an irq work
      (since we reach here from hard-irq context), which then schedules a
      normal work item and cppc_scale_freq_workfn() updates the per_cpu
      arch_freq_scale variable based on the counter updates since the last
      tick.
      
      To allow platforms to disable this CPPC counter-based frequency
      invariance support, this is all done under CONFIG_ACPI_CPPC_CPUFREQ_FIE,
      which is enabled by default.
      
      This also exports sched_setattr_nocheck() as the CPPC driver can be
      built as a module.
      
      Cc: linux-acpi@vger.kernel.org
      Tested-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Reviewed-by: default avatarIonela Voinescu <ionela.voinescu@arm.com>
      Tested-by: default avatarQian Cai <quic_qiancai@quicinc.com>
      Signed-off-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      1eb5dde6
    • Paul Burton's avatar
      tracing: Simplify & fix saved_tgids logic · b81b3e95
      Paul Burton authored
      The tgid_map array records a mapping from pid to tgid, where the index
      of an entry within the array is the pid & the value stored at that index
      is the tgid.
      
      The saved_tgids_next() function iterates over pointers into the tgid_map
      array & dereferences the pointers which results in the tgid, but then it
      passes that dereferenced value to trace_find_tgid() which treats it as a
      pid & does a further lookup within the tgid_map array. It seems likely
      that the intent here was to skip over entries in tgid_map for which the
      recorded tgid is zero, but instead we end up skipping over entries for
      which the thread group leader hasn't yet had its own tgid recorded in
      tgid_map.
      
      A minimal fix would be to remove the call to trace_find_tgid, turning:
      
        if (trace_find_tgid(*ptr))
      
      into:
      
        if (*ptr)
      
      ..but it seems like this logic can be much simpler if we simply let
      seq_read() iterate over the whole tgid_map array & filter out empty
      entries by returning SEQ_SKIP from saved_tgids_show(). Here we take that
      approach, removing the incorrect logic here entirely.
      
      Link: https://lkml.kernel.org/r/20210630003406.4013668-1-paulburton@google.com
      
      Fixes: d914ba37
      
       ("tracing: Add support for recording tgid of tasks")
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarPaul Burton <paulburton@google.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      b81b3e95
    • Austin Kim's avatar
      tracing: Change variable type as bool for clean-up · bfbf8d15
      Austin Kim authored
      The wakeup_rt wakeup_dl, tracing_dl is only set to 0, 1.
      So changing type of wakeup_rt wakeup_dl, tracing_dl as bool
      makes relevant routine be more readable.
      
      Link: https://lkml.kernel.org/r/20210629140548.GA1627@raspberrypi
      
      
      
      Signed-off-by: default avatarAustin Kim <austin.kim@lge.com>
      [ Removed unneeded initialization of static bool tracing_dl ]
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      bfbf8d15
  10. Jun 29, 2021