Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Dec 10, 2021
  2. Dec 09, 2021
  3. Dec 08, 2021
  4. Dec 04, 2021
  5. Dec 03, 2021
    • Maxim Mikityanskiy's avatar
      bpf: Fix the off-by-two error in range markings · 2fa7d94a
      Maxim Mikityanskiy authored
      The first commit cited below attempts to fix the off-by-one error that
      appeared in some comparisons with an open range. Due to this error,
      arithmetically equivalent pieces of code could get different verdicts
      from the verifier, for example (pseudocode):
      
        // 1. Passes the verifier:
        if (data + 8 > data_end)
            return early
        read *(u64 *)data, i.e. [data; data+7]
      
        // 2. Rejected by the verifier (should still pass):
        if (data + 7 >= data_end)
            return early
        read *(u64 *)data, i.e. [data; data+7]
      
      The attempted fix, however, shifts the range by one in a wrong
      direction, so the bug not only remains, but also such piece of code
      starts failing in the verifier:
      
        // 3. Rejected by the verifier, but the check is stricter than in #1.
        if (data + 8 >= data_end)
            return early
        read *(u64 *)data, i.e. [data; data+7]
      
      The change performed by that fix converted an off-by-one bug into
      off-by-two. The second commit cited below added the BPF selftests
      written to ensure than code chunks like #3 are rejected, however,
      they should be accepted.
      
      This commit fixes the off-by-two error by adjusting new_range in the
      right direction and fixes the tests by changing the range into the
      one that should actually fail.
      
      Fixes: fb2a311a ("bpf: fix off by one for range markings with L{T, E} patterns")
      Fixes: b37242c7
      
       ("bpf: add test cases to bpf selftests to cover all access tests")
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20211130181607.593149-1-maximmi@nvidia.com
      2fa7d94a
  6. Dec 02, 2021
  7. Dec 01, 2021
    • Masami Hiramatsu's avatar
      kprobes: Limit max data_size of the kretprobe instances · 6bbfa441
      Masami Hiramatsu authored
      The 'kprobe::data_size' is unsigned, thus it can not be negative.  But if
      user sets it enough big number (e.g. (size_t)-8), the result of 'data_size
      + sizeof(struct kretprobe_instance)' becomes smaller than sizeof(struct
      kretprobe_instance) or zero. In result, the kretprobe_instance are
      allocated without enough memory, and kretprobe accesses outside of
      allocated memory.
      
      To avoid this issue, introduce a max limitation of the
      kretprobe::data_size. 4KB per instance should be OK.
      
      Link: https://lkml.kernel.org/r/163836995040.432120.10322772773821182925.stgit@devnote2
      
      Cc: stable@vger.kernel.org
      Fixes: f47cd9b5
      
       ("kprobes: kretprobe user entry-handler")
      Reported-by: default avatarzhangyue <zhangyue1@kylinos.cn>
      Signed-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      6bbfa441
    • Chen Jun's avatar
      tracing: Fix a kmemleak false positive in tracing_map · f25667e5
      Chen Jun authored
      Doing the command:
        echo 'hist:key=common_pid.execname,common_timestamp' > /sys/kernel/debug/tracing/events/xxx/trigger
      
      Triggers many kmemleak reports:
      
      unreferenced object 0xffff0000c7ea4980 (size 128):
        comm "bash", pid 338, jiffies 4294912626 (age 9339.324s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<00000000f3469921>] kmem_cache_alloc_trace+0x4c0/0x6f0
          [<0000000054ca40c3>] hist_trigger_elt_data_alloc+0x140/0x178
          [<00000000633bd154>] tracing_map_init+0x1f8/0x268
          [<000000007e814ab9>] event_hist_trigger_func+0xca0/0x1ad0
          [<00000000bf8520ed>] trigger_process_regex+0xd4/0x128
          [<00000000f549355a>] event_trigger_write+0x7c/0x120
          [<00000000b80f898d>] vfs_write+0xc4/0x380
          [<00000000823e1055>] ksys_write+0x74/0xf8
          [<000000008a9374aa>] __arm64_sys_write+0x24/0x30
          [<0000000087124017>] do_el0_svc+0x88/0x1c0
          [<00000000efd0dcd1>] el0_svc+0x1c/0x28
          [<00000000dbfba9b3>] el0_sync_handler+0x88/0xc0
          [<00000000e7399680>] el0_sync+0x148/0x180
      unreferenced object 0xffff0000c7ea4980 (size 128):
        comm "bash", pid 338, jiffies 4294912626 (age 9339.324s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<00000000f3469921>] kmem_cache_alloc_trace+0x4c0/0x6f0
          [<0000000054ca40c3>] hist_trigger_elt_data_alloc+0x140/0x178
          [<00000000633bd154>] tracing_map_init+0x1f8/0x268
          [<000000007e814ab9>] event_hist_trigger_func+0xca0/0x1ad0
          [<00000000bf8520ed>] trigger_process_regex+0xd4/0x128
          [<00000000f549355a>] event_trigger_write+0x7c/0x120
          [<00000000b80f898d>] vfs_write+0xc4/0x380
          [<00000000823e1055>] ksys_write+0x74/0xf8
          [<000000008a9374aa>] __arm64_sys_write+0x24/0x30
          [<0000000087124017>] do_el0_svc+0x88/0x1c0
          [<00000000efd0dcd1>] el0_svc+0x1c/0x28
          [<00000000dbfba9b3>] el0_sync_handler+0x88/0xc0
          [<00000000e7399680>] el0_sync+0x148/0x180
      
      The reason is elts->pages[i] is alloced by get_zeroed_page.
      and kmemleak will not scan the area alloced by get_zeroed_page.
      The address stored in elts->pages will be regarded as leaked.
      
      That is, the elts->pages[i] will have pointers loaded onto it as well, and
      without telling kmemleak about it, those pointers will look like memory
      without a reference.
      
      To fix this, call kmemleak_alloc to tell kmemleak to scan elts->pages[i]
      
      Link: https://lkml.kernel.org/r/20211124140801.87121-1-chenjun102@huawei.com
      
      
      
      Signed-off-by: default avatarChen Jun <chenjun102@huawei.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      f25667e5
    • Steven Rostedt (VMware)'s avatar
      tracing/histograms: String compares should not care about signed values · 450fec13
      Steven Rostedt (VMware) authored
      When comparing two strings for the "onmatch" histogram trigger, fields
      that are strings use string comparisons, which do not care about being
      signed or not.
      
      Do not fail to match two string fields if one is unsigned char array and
      the other is a signed char array.
      
      Link: https://lore.kernel.org/all/20211129123043.5cfd687a@gandalf.local.home/
      
      Cc: stable@vgerk.kernel.org
      Cc: Tom Zanussi <zanussi@kernel.org>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Fixes: b05e89ae
      
       ("tracing: Accept different type for synthetic event fields")
      Reviewed-by: default avatarMasami Hiramatsu <mhiramatsu@kernel.org>
      Reported-by: default avatarSven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      450fec13
  8. Nov 27, 2021
  9. Nov 26, 2021
  10. Nov 24, 2021
    • Evan Green's avatar
      PM: hibernate: Fix snapshot partial write lengths · 88a5045f
      Evan Green authored
      
      snapshot_write() is inappropriately limiting the amount of data that can
      be written in cases where a partial page has already been written. For
      example, one would expect to be able to write 1 byte, then 4095 bytes to
      the snapshot device, and have both of those complete fully (since now
      we're aligned to a page again). But what ends up happening is we write 1
      byte, then 4094/4095 bytes complete successfully.
      
      The reason is that simple_write_to_buffer()'s second argument is the
      total size of the buffer, not the size of the buffer minus the offset.
      Since simple_write_to_buffer() accounts for the offset in its
      implementation, snapshot_write() can just pass the full page size
      directly down.
      
      Signed-off-by: default avatarEvan Green <evgreen@chromium.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      88a5045f
    • Thomas Zeitlhofer's avatar
      PM: hibernate: use correct mode for swsusp_close() · cefcf24b
      Thomas Zeitlhofer authored
      Commit 39fbef4b ("PM: hibernate: Get block device exclusively in
      swsusp_check()") changed the opening mode of the block device to
      (FMODE_READ | FMODE_EXCL).
      
      In the corresponding calls to swsusp_close(), the mode is still just
      FMODE_READ which triggers the warning in blkdev_flush_mapping() on
      resume from hibernate.
      
      So, use the mode (FMODE_READ | FMODE_EXCL) also when closing the
      device.
      
      Fixes: 39fbef4b
      
       ("PM: hibernate: Get block device exclusively in swsusp_check()")
      Signed-off-by: default avatarThomas Zeitlhofer <thomas.zeitlhofer+lkml@ze-it.at>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      cefcf24b
    • Mark Rutland's avatar
      sched/scs: Reset task stack state in bringup_cpu() · dce1ca05
      Mark Rutland authored
      To hot unplug a CPU, the idle task on that CPU calls a few layers of C
      code before finally leaving the kernel. When KASAN is in use, poisoned
      shadow is left around for each of the active stack frames, and when
      shadow call stacks are in use. When shadow call stacks (SCS) are in use
      the task's saved SCS SP is left pointing at an arbitrary point within
      the task's shadow call stack.
      
      When a CPU is offlined than onlined back into the kernel, this stale
      state can adversely affect execution. Stale KASAN shadow can alias new
      stackframes and result in bogus KASAN warnings. A stale SCS SP is
      effectively a memory leak, and prevents a portion of the shadow call
      stack being used. Across a number of hotplug cycles the idle task's
      entire shadow call stack can become unusable.
      
      We previously fixed the KASAN issue in commit:
      
        e1b77c92 ("sched/kasan: remove stale KASAN poison after hotplug")
      
      ... by removing any stale KASAN stack poison immediately prior to
      onlining a CPU.
      
      Subsequently in commit:
      
        f1a0a376 ("sched/core: Initialize the idle task with preemption disabled")
      
      ... the refactoring left the KASAN and SCS cleanup in one-time idle
      thread initialization code rather than something invoked prior to each
      CPU being onlined, breaking both as above.
      
      We fixed SCS (but not KASAN) in commit:
      
        63acd42c ("sched/scs: Reset the shadow stack when idle_task_exit")
      
      ... but as this runs in the context of the idle task being offlined it's
      potentially fragile.
      
      To fix these consistently and more robustly, reset the SCS SP and KASAN
      shadow of a CPU's idle task immediately before we online that CPU in
      bringup_cpu(). This ensures the idle task always has a consistent state
      when it is running, and removes the need to so so when exiting an idle
      task.
      
      Whenever any thread is created, dup_task_struct() will give the task a
      stack which is free of KASAN shadow, and initialize the task's SCS SP,
      so there's no need to specially initialize either for idle thread within
      init_idle(), as this was only necessary to handle hotplug cycles.
      
      I've tested this on arm64 with:
      
      * gcc 11.1.0, defconfig +KASAN_INLINE, KASAN_STACK
      * clang 12.0.0, defconfig +KASAN_INLINE, KASAN_STACK, SHADOW_CALL_STACK
      
      ... offlining and onlining CPUS with:
      
      | while true; do
      |   for C in /sys/devices/system/cpu/cpu*/online; do
      |     echo 0 > $C;
      |     echo 1 > $C;
      |   done
      | done
      
      Fixes: f1a0a376
      
       ("sched/core: Initialize the idle task with preemption disabled")
      Reported-by: default avatarQian Cai <quic_qiancai@quicinc.com>
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Tested-by: default avatarQian Cai <quic_qiancai@quicinc.com>
      Link: https://lore.kernel.org/lkml/20211115113310.35693-1-mark.rutland@arm.com/
      dce1ca05
  11. Nov 23, 2021
    • Jiri Olsa's avatar
      tracing/uprobe: Fix uprobe_perf_open probes iteration · 1880ed71
      Jiri Olsa authored
      Add missing 'tu' variable initialization in the probes loop,
      otherwise the head 'tu' is used instead of added probes.
      
      Link: https://lkml.kernel.org/r/20211123142801.182530-1-jolsa@kernel.org
      
      Cc: stable@vger.kernel.org
      Fixes: 99c9a923
      
       ("tracing/uprobe: Fix double perf_event linking on multiprobe uprobe")
      Acked-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      1880ed71
    • Marco Elver's avatar
      perf: Ignore sigtrap for tracepoints destined for other tasks · 73743c3b
      Marco Elver authored
      syzbot reported that the warning in perf_sigtrap() fires, saying that
      the event's task does not match current:
      
       | WARNING: CPU: 0 PID: 9090 at kernel/events/core.c:6446 perf_pending_event+0x40d/0x4b0 kernel/events/core.c:6513
       | Modules linked in:
       | CPU: 0 PID: 9090 Comm: syz-executor.1 Not tainted 5.15.0-syzkaller #0
       | Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
       | RIP: 0010:perf_sigtrap kernel/events/core.c:6446 [inline]
       | RIP: 0010:perf_pending_event_disable kernel/events/core.c:6470 [inline]
       | RIP: 0010:perf_pending_event+0x40d/0x4b0 kernel/events/core.c:6513
       | ...
       | Call Trace:
       |  <IRQ>
       |  irq_work_single+0x106/0x220 kernel/irq_work.c:211
       |  irq_work_run_list+0x6a/0x90 kernel/irq_work.c:242
       |  irq_work_run+0x4f/0xd0 kernel/irq_work.c:251
       |  __sysvec_irq_work+0x95/0x3d0 arch/x86/kernel/irq_work.c:22
       |  sysvec_irq_work+0x8e/0xc0 arch/x86/kernel/irq_work.c:17
       |  </IRQ>
       |  <TASK>
       |  asm_sysvec_irq_work+0x12/0x20 arch/x86/include/asm/idtentry.h:664
       | RIP: 0010:__raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:152 [inline]
       | RIP: 0010:_raw_spin_unlock_irqrestore+0x38/0x70 kernel/locking/spinlock.c:194
       | ...
       |  coredump_task_exit kernel/exit.c:371 [inline]
       |  do_exit+0x1865/0x25c0 kernel/exit.c:771
       |  do_group_exit+0xe7/0x290 kernel/exit.c:929
       |  get_signal+0x3b0/0x1ce0 kernel/signal.c:2820
       |  arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
       |  handle_signal_work kernel/entry/common.c:148 [inline]
       |  exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
       |  exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
       |  __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
       |  syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
       |  do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
       |  entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      On x86 this shouldn't happen, which has arch_irq_work_raise().
      
      The test program sets up a perf event with sigtrap set to fire on the
      'sched_wakeup' tracepoint, which fired in ttwu_do_wakeup().
      
      This happened because the 'sched_wakeup' tracepoint also takes a task
      argument passed on to perf_tp_event(), which is used to deliver the
      event to that other task.
      
      Since we cannot deliver synchronous signals to other tasks, skip an event if
      perf_tp_event() is targeted at another task and perf_event_attr::sigtrap is
      set, which will avoid ever entering perf_sigtrap() for such events.
      
      Fixes: 97ba62b2
      
       ("perf: Add support for SIGTRAP on perf events")
      Reported-by: default avatar <syzbot+663359e32ce6f1a305ad@syzkaller.appspotmail.com>
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/YYpoCOBmC/kJWfmI@elver.google.com
      73743c3b
    • Muchun Song's avatar
      locking/rwsem: Optimize down_read_trylock() under highly contended case · 14c24048
      Muchun Song authored
      We found that a process with 10 thousnads threads has been encountered
      a regression problem from Linux-v4.14 to Linux-v5.4. It is a kind of
      workload which will concurrently allocate lots of memory in different
      threads sometimes. In this case, we will see the down_read_trylock()
      with a high hotspot. Therefore, we suppose that rwsem has a regression
      at least since Linux-v5.4. In order to easily debug this problem, we
      write a simply benchmark to create the similar situation lile the
      following.
      
        ```c++
        #include <sys/mman.h>
        #include <sys/time.h>
        #include <sys/resource.h>
        #include <sched.h>
      
        #include <cstdio>
        #include <cassert>
        #include <thread>
        #include <vector>
        #include <chrono>
      
        volatile int mutex;
      
        void trigger(int cpu, char* ptr, std::size_t sz)
        {
        	cpu_set_t set;
        	CPU_ZERO(&set);
        	CPU_SET(cpu, &set);
        	assert(pthread_setaffinity_np(pthread_self(), sizeof(set), &set) == 0);
      
        	while (mutex);
      
        	for (std::size_t i = 0; i < sz; i += 4096) {
        		*ptr = '\0';
        		ptr += 4096;
        	}
        }
      
        int main(int argc, char* argv[])
        {
        	std::size_t sz = 100;
      
        	if (argc > 1)
        		sz = atoi(argv[1]);
      
        	auto nproc = std::thread::hardware_concurrency();
        	std::vector<std::thread> thr;
        	sz <<= 30;
        	auto* ptr = mmap(nullptr, sz, PROT_READ | PROT_WRITE, MAP_ANON |
      			 MAP_PRIVATE, -1, 0);
        	assert(ptr != MAP_FAILED);
        	char* cptr = static_cast<char*>(ptr);
        	auto run = sz / nproc;
        	run = (run >> 12) << 12;
      
        	mutex = 1;
      
        	for (auto i = 0U; i < nproc; ++i) {
        		thr.emplace_back(std::thread([i, cptr, run]() { trigger(i, cptr, run); }));
        		cptr += run;
        	}
      
        	rusage usage_start;
        	getrusage(RUSAGE_SELF, &usage_start);
        	auto start = std::chrono::system_clock::now();
      
        	mutex = 0;
      
        	for (auto& t : thr)
        		t.join();
      
        	rusage usage_end;
        	getrusage(RUSAGE_SELF, &usage_end);
        	auto end = std::chrono::system_clock::now();
        	timeval utime;
        	timeval stime;
        	timersub(&usage_end.ru_utime, &usage_start.ru_utime, &utime);
        	timersub(&usage_end.ru_stime, &usage_start.ru_stime, &stime);
        	printf("usr: %ld.%06ld\n", utime.tv_sec, utime.tv_usec);
        	printf("sys: %ld.%06ld\n", stime.tv_sec, stime.tv_usec);
        	printf("real: %lu\n",
        	       std::chrono::duration_cast<std::chrono::milliseconds>(end -
        	       start).count());
      
        	return 0;
        }
        ```
      
      The functionality of above program is simply which creates `nproc`
      threads and each of them are trying to touch memory (trigger page
      fault) on different CPU. Then we will see the similar profile by
      `perf top`.
      
        25.55%  [kernel]                  [k] down_read_trylock
        14.78%  [kernel]                  [k] handle_mm_fault
        13.45%  [kernel]                  [k] up_read
         8.61%  [kernel]                  [k] clear_page_erms
         3.89%  [kernel]                  [k] __do_page_fault
      
      The highest hot instruction, which accounts for about 92%, in
      down_read_trylock() is cmpxchg like the following.
      
        91.89 │      lock   cmpxchg %rdx,(%rdi)
      
      Sice the problem is found by migrating from Linux-v4.14 to Linux-v5.4,
      so we easily found that the commit ddb20d1d
      
       ("locking/rwsem: Optimize
      down_read_trylock()") caused the regression. The reason is that the
      commit assumes the rwsem is not contended at all. But it is not always
      true for mmap lock which could be contended with thousands threads.
      So most threads almost need to run at least 2 times of "cmpxchg" to
      acquire the lock. The overhead of atomic operation is higher than
      non-atomic instructions, which caused the regression.
      
      By using the above benchmark, the real executing time on a x86-64 system
      before and after the patch were:
      
                        Before Patch  After Patch
         # of Threads      real          real     reduced by
         ------------     ------        ------    ----------
               1          65,373        65,206       ~0.0%
               4          15,467        15,378       ~0.5%
              40           6,214         5,528      ~11.0%
      
      For the uncontended case, the new down_read_trylock() is the same as
      before. For the contended cases, the new down_read_trylock() is faster
      than before. The more contended, the more fast.
      
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarWaiman Long <longman@redhat.com>
      Link: https://lore.kernel.org/r/20211118094455.9068-1-songmuchun@bytedance.com
      14c24048
    • Waiman Long's avatar
      locking/rwsem: Make handoff bit handling more consistent · d257cc8c
      Waiman Long authored
      There are some inconsistency in the way that the handoff bit is being
      handled in readers and writers that lead to a race condition.
      
      Firstly, when a queue head writer set the handoff bit, it will clear
      it when the writer is being killed or interrupted on its way out
      without acquiring the lock. That is not the case for a queue head
      reader. The handoff bit will simply be inherited by the next waiter.
      
      Secondly, in the out_nolock path of rwsem_down_read_slowpath(), both
      the waiter and handoff bits are cleared if the wait queue becomes
      empty.  For rwsem_down_write_slowpath(), however, the handoff bit is
      not checked and cleared if the wait queue is empty. This can
      potentially make the handoff bit set with empty wait queue.
      
      Worse, the situation in rwsem_down_write_slowpath() relies on wstate,
      a variable set outside of the critical section containing the ->count
      manipulation, this leads to race condition where RWSEM_FLAG_HANDOFF
      can be double subtracted, corrupting ->count.
      
      To make the handoff bit handling more consistent and robust, extract
      out handoff bit clearing code into the new rwsem_del_waiter() helper
      function. Also, completely eradicate wstate; always evaluate
      everything inside the same critical section.
      
      The common function will only use atomic_long_andnot() to clear bits
      when the wait queue is empty to avoid possible race condition.  If the
      first waiter with handoff bit set is killed or interrupted to exit the
      slowpath without acquiring the lock, the next waiter will inherit the
      handoff bit.
      
      While at it, simplify the trylock for loop in
      rwsem_down_write_slowpath() to make it easier to read.
      
      Fixes: 4f23dbc1
      
       ("locking/rwsem: Implement lock handoff to prevent lock starvation")
      Reported-by: default avatarZhenhua Ma <mazhenhua@xiaomi.com>
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20211116012912.723980-1-longman@redhat.com
      d257cc8c
  12. Nov 19, 2021
  13. Nov 18, 2021
  14. Nov 15, 2021
    • Daniel Borkmann's avatar
      bpf: Fix toctou on read-only map's constant scalar tracking · 353050be
      Daniel Borkmann authored
      Commit a23740ec ("bpf: Track contents of read-only maps as scalars") is
      checking whether maps are read-only both from BPF program side and user space
      side, and then, given their content is constant, reading out their data via
      map->ops->map_direct_value_addr() which is then subsequently used as known
      scalar value for the register, that is, it is marked as __mark_reg_known()
      with the read value at verification time. Before a23740ec, the register
      content was marked as an unknown scalar so the verifier could not make any
      assumptions about the map content.
      
      The current implementation however is prone to a TOCTOU race, meaning, the
      value read as known scalar for the register is not guaranteed to be exactly
      the same at a later point when the program is executed, and as such, the
      prior made assumptions of the verifier with regards to the program will be
      invalid which can cause issues such as OOB access, etc.
      
      While the BPF_F_RDONLY_PROG map flag is always fixed and required to be
      specified at map creation time, the map->frozen property is initially set to
      false for the map given the map value needs to be populated, e.g. for global
      data sections. Once complete, the loader "freezes" the map from user space
      such that no subsequent updates/deletes are possible anymore. For the rest
      of the lifetime of the map, this freeze one-time trigger cannot be undone
      anymore after a successful BPF_MAP_FREEZE cmd return. Meaning, any new BPF_*
      cmd calls which would update/delete map entries will be rejected with -EPERM
      since map_get_sys_perms() removes the FMODE_CAN_WRITE permission. This also
      means that pending update/delete map entries must still complete before this
      guarantee is given. This corner case is not an issue for loaders since they
      create and prepare such program private map in successive steps.
      
      However, a malicious user is able to trigger this TOCTOU race in two different
      ways: i) via userfaultfd, and ii) via batched updates. For i) userfaultfd is
      used to expand the competition interval, so that map_update_elem() can modify
      the contents of the map after map_freeze() and bpf_prog_load() were executed.
      This works, because userfaultfd halts the parallel thread which triggered a
      map_update_elem() at the time where we copy key/value from the user buffer and
      this already passed the FMODE_CAN_WRITE capability test given at that time the
      map was not "frozen". Then, the main thread performs the map_freeze() and
      bpf_prog_load(), and once that had completed successfully, the other thread
      is woken up to complete the pending map_update_elem() which then changes the
      map content. For ii) the idea of the batched update is similar, meaning, when
      there are a large number of updates to be processed, it can increase the
      competition interval between the two. It is therefore possible in practice to
      modify the contents of the map after executing map_freeze() and bpf_prog_load().
      
      One way to fix both i) and ii) at the same time is to expand the use of the
      map's map->writecnt. The latter was introduced in fc970227 ("bpf: Add mmap()
      support for BPF_MAP_TYPE_ARRAY") and further refined in 1f6cb19b ("bpf:
      Prevent re-mmap()'ing BPF map as writable for initially r/o mapping") with
      the rationale to make a writable mmap()'ing of a map mutually exclusive with
      read-only freezing. The counter indicates writable mmap() mappings and then
      prevents/fails the freeze operation. Its semantics can be expanded beyond
      just mmap() by generally indicating ongoing write phases. This would essentially
      span any parallel regular and batched flavor of update/delete operation and
      then also have map_freeze() fail with -EBUSY. For the check_mem_access() in
      the verifier we expand upon the bpf_map_is_rdonly() check ensuring that all
      last pending writes have completed via bpf_map_write_active() test. Once the
      map->frozen is set and bpf_map_write_active() indicates a map->writecnt of 0
      only then we are really guaranteed to use the map's data as known constants.
      For map->frozen being set and pending writes in process of still being completed
      we fall back to marking that register as unknown scalar so we don't end up
      making assumptions about it. With this, both TOCTOU reproducers from i) and
      ii) are fixed.
      
      Note that the map->writecnt has been converted into a atomic64 in the fix in
      order to avoid a double freeze_mutex mutex_{un,}lock() pair when updating
      map->writecnt in the various map update/delete BPF_* cmd flavors. Spanning
      the freeze_mutex over entire map update/delete operations in syscall side
      would not be possible due to then causing everything to be serialized.
      Similarly, something like synchronize_rcu() after setting map->frozen to wait
      for update/deletes to complete is not possible either since it would also
      have to span the user copy which can sleep. On the libbpf side, this won't
      break d66562fb ("libbpf: Add BPF object skeleton support") as the
      anonymous mmap()-ed "map initialization image" is remapped as a BPF map-backed
      mmap()-ed memory where for .rodata it's non-writable.
      
      Fixes: a23740ec
      
       ("bpf: Track contents of read-only maps as scalars")
      Reported-by: default avatar <w1tcher.bupt@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      353050be
    • Dmitrii Banshchikov's avatar
      bpf: Forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs · 5e0bc308
      Dmitrii Banshchikov authored
      Use of bpf_ktime_get_coarse_ns() and bpf_timer_* helpers in tracing
      progs may result in locking issues.
      
      bpf_ktime_get_coarse_ns() uses ktime_get_coarse_ns() time accessor that
      isn't safe for any context:
      ======================================================
      WARNING: possible circular locking dependency detected
      5.15.0-syzkaller #0 Not tainted
      ------------------------------------------------------
      syz-executor.4/14877 is trying to acquire lock:
      ffffffff8cb30008 (tk_core.seq.seqcount){----}-{0:0}, at: ktime_get_coarse_ts64+0x25/0x110 kernel/time/timekeeping.c:2255
      
      but task is already holding lock:
      ffffffff90dbf200 (&obj_hash[i].lock){-.-.}-{2:2}, at: debug_object_deactivate+0x61/0x400 lib/debugobjects.c:735
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (&obj_hash[i].lock){-.-.}-{2:2}:
             lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5625
             __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
             _raw_spin_lock_irqsave+0xd1/0x120 kernel/locking/spinlock.c:162
             __debug_object_init+0xd9/0x1860 lib/debugobjects.c:569
             debug_hrtimer_init kernel/time/hrtimer.c:414 [inline]
             debug_init kernel/time/hrtimer.c:468 [inline]
             hrtimer_init+0x20/0x40 kernel/time/hrtimer.c:1592
             ntp_init_cmos_sync kernel/time/ntp.c:676 [inline]
             ntp_init+0xa1/0xad kernel/time/ntp.c:1095
             timekeeping_init+0x512/0x6bf kernel/time/timekeeping.c:1639
             start_kernel+0x267/0x56e init/main.c:1030
             secondary_startup_64_no_verify+0xb1/0xbb
      
      -> #0 (tk_core.seq.seqcount){----}-{0:0}:
             check_prev_add kernel/locking/lockdep.c:3051 [inline]
             check_prevs_add kernel/locking/lockdep.c:3174 [inline]
             validate_chain+0x1dfb/0x8240 kernel/locking/lockdep.c:3789
             __lock_acquire+0x1382/0x2b00 kernel/locking/lockdep.c:5015
             lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5625
             seqcount_lockdep_reader_access+0xfe/0x230 include/linux/seqlock.h:103
             ktime_get_coarse_ts64+0x25/0x110 kernel/time/timekeeping.c:2255
             ktime_get_coarse include/linux/timekeeping.h:120 [inline]
             ktime_get_coarse_ns include/linux/timekeeping.h:126 [inline]
             ____bpf_ktime_get_coarse_ns kernel/bpf/helpers.c:173 [inline]
             bpf_ktime_get_coarse_ns+0x7e/0x130 kernel/bpf/helpers.c:171
             bpf_prog_a99735ebafdda2f1+0x10/0xb50
             bpf_dispatcher_nop_func include/linux/bpf.h:721 [inline]
             __bpf_prog_run include/linux/filter.h:626 [inline]
             bpf_prog_run include/linux/filter.h:633 [inline]
             BPF_PROG_RUN_ARRAY include/linux/bpf.h:1294 [inline]
             trace_call_bpf+0x2cf/0x5d0 kernel/trace/bpf_trace.c:127
             perf_trace_run_bpf_submit+0x7b/0x1d0 kernel/events/core.c:9708
             perf_trace_lock+0x37c/0x440 include/trace/events/lock.h:39
             trace_lock_release+0x128/0x150 include/trace/events/lock.h:58
             lock_release+0x82/0x810 kernel/locking/lockdep.c:5636
             __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:149 [inline]
             _raw_spin_unlock_irqrestore+0x75/0x130 kernel/locking/spinlock.c:194
             debug_hrtimer_deactivate kernel/time/hrtimer.c:425 [inline]
             debug_deactivate kernel/time/hrtimer.c:481 [inline]
             __run_hrtimer kernel/time/hrtimer.c:1653 [inline]
             __hrtimer_run_queues+0x2f9/0xa60 kernel/time/hrtimer.c:1749
             hrtimer_interrupt+0x3b3/0x1040 kernel/time/hrtimer.c:1811
             local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1086 [inline]
             __sysvec_apic_timer_interrupt+0xf9/0x270 arch/x86/kernel/apic/apic.c:1103
             sysvec_apic_timer_interrupt+0x8c/0xb0 arch/x86/kernel/apic/apic.c:1097
             asm_sysvec_apic_timer_interrupt+0x12/0x20
             __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:152 [inline]
             _raw_spin_unlock_irqrestore+0xd4/0x130 kernel/locking/spinlock.c:194
             try_to_wake_up+0x702/0xd20 kernel/sched/core.c:4118
             wake_up_process kernel/sched/core.c:4200 [inline]
             wake_up_q+0x9a/0xf0 kernel/sched/core.c:953
             futex_wake+0x50f/0x5b0 kernel/futex/waitwake.c:184
             do_futex+0x367/0x560 kernel/futex/syscalls.c:127
             __do_sys_futex kernel/futex/syscalls.c:199 [inline]
             __se_sys_futex+0x401/0x4b0 kernel/futex/syscalls.c:180
             do_syscall_x64 arch/x86/entry/common.c:50 [inline]
             do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      There is a possible deadlock with bpf_timer_* set of helpers:
      hrtimer_start()
        lock_base();
        trace_hrtimer...()
          perf_event()
            bpf_run()
              bpf_timer_start()
                hrtimer_start()
                  lock_base()         <- DEADLOCK
      
      Forbid use of bpf_ktime_get_coarse_ns() and bpf_timer_* helpers in
      BPF_PROG_TYPE_KPROBE, BPF_PROG_TYPE_TRACEPOINT, BPF_PROG_TYPE_PERF_EVENT
      and BPF_PROG_TYPE_RAW_TRACEPOINT prog types.
      
      Fixes: d0551261 ("bpf: Add bpf_ktime_get_coarse_ns helper")
      Fixes: b00628b1
      
       ("bpf: Introduce bpf timers.")
      Reported-by: default avatar <syzbot+43fd005b5a1b4d10781e@syzkaller.appspotmail.com>
      Signed-off-by: default avatarDmitrii Banshchikov <me@ubique.spb.ru>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211113142227.566439-2-me@ubique.spb.ru
      5e0bc308
  15. Nov 14, 2021
  16. Nov 12, 2021
  17. Nov 11, 2021
    • Greg Thelen's avatar
      perf/core: Avoid put_page() when GUP fails · 4716023a
      Greg Thelen authored
      PEBS PERF_SAMPLE_PHYS_ADDR events use perf_virt_to_phys() to convert PMU
      sampled virtual addresses to physical using get_user_page_fast_only()
      and page_to_phys().
      
      Some get_user_page_fast_only() error cases return false, indicating no
      page reference, but still initialize the output page pointer with an
      unreferenced page. In these error cases perf_virt_to_phys() calls
      put_page(). This causes page reference count underflow, which can lead
      to unintentional page sharing.
      
      Fix perf_virt_to_phys() to only put_page() if get_user_page_fast_only()
      returns a referenced page.
      
      Fixes: fc7ce9c7
      
       ("perf/core, x86: Add PERF_SAMPLE_PHYS_ADDR")
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20211111021814.757086-1-gthelen@google.com
      4716023a
    • Valentin Schneider's avatar
      preempt: Restore preemption model selection configs · a8b76910
      Valentin Schneider authored
      Commit c597bfdd ("sched: Provide Kconfig support for default dynamic
      preempt mode") changed the selectable config names for the preemption
      model. This means a config file must now select
      
        CONFIG_PREEMPT_BEHAVIOUR=y
      
      rather than
      
        CONFIG_PREEMPT=y
      
      to get a preemptible kernel. This means all arch config files would need to
      be updated - right now they'll all end up with the default
      CONFIG_PREEMPT_NONE_BEHAVIOUR.
      
      Rather than touch a good hundred of config files, restore usage of
      CONFIG_PREEMPT{_NONE, _VOLUNTARY}. Make them configure:
      o The build-time preemption model when !PREEMPT_DYNAMIC
      o The default boot-time preemption model when PREEMPT_DYNAMIC
      
      Add siblings of those configs with the _BUILD suffix to unconditionally
      designate the build-time preemption model (PREEMPT_DYNAMIC is built with
      the "highest" preemption model it supports, aka PREEMPT). Downstream
      configs should by now all be depending / selected by CONFIG_PREEMPTION
      rather than CONFIG_...
      a8b76910
    • Mathias Krause's avatar
      sched/fair: Prevent dead task groups from regaining cfs_rq's · b027789e
      Mathias Krause authored
      Kevin is reporting crashes which point to a use-after-free of a cfs_rq
      in update_blocked_averages(). Initial debugging revealed that we've
      live cfs_rq's (on_list=1) in an about to be kfree()'d task group in
      free_fair_sched_group(). However, it was unclear how that can happen.
      
      His kernel config happened to lead to a layout of struct sched_entity
      that put the 'my_q' member directly into the middle of the object
      which makes it incidentally overlap with SLUB's freelist pointer.
      That, in combination with SLAB_FREELIST_HARDENED's freelist pointer
      mangling, leads to a reliable access violation in form of a #GP which
      made the UAF fail fast.
      
      Michal seems to have run into the same issue[1]. He already correctly
      diagnosed that commit a7b359fc ("sched/fair: Correctly insert
      cfs_rq's to list on unthrottle") is causing the preconditions for the
      UAF to happen by re-adding cfs_rq's also to task groups that have no
      more running tasks, i.e. also to dead ones. His analysis, however,
      misses the real root cause and it cannot be seen from the crash
      backtrace only, as the real offender is tg_unthrottle_up() getting
      called via sched_cfs_period_timer() via the timer interrupt at an
      inconvenient time.
      
      When unregister_fair_sched_group() unlinks all cfs_rq's from the dying
      task group, it doesn't protect itself from getting interrupted. If the
      timer interrupt triggers while we iterate over all CPUs or after
      unregister_fair_sched_group() has finished but prior to unlinking the
      task group, sched_cfs_period_timer() will execute and walk the list of
      task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the
      dying task group. These will later -- in free_fair_sched_group() -- be
      kfree()'ed while still being linked, leading to the fireworks Kevin
      and Michal are seeing.
      
      To fix this race, ensure the dying task group gets unlinked first.
      However, simply switching the order of unregistering and unlinking the
      task group isn't sufficient, as concurrent RCU walkers might still see
      it, as can be seen below:
      
          CPU1:                                      CPU2:
            :                                        timer IRQ:
            :                                          do_sched_cfs_period_timer():
            :                                            :
            :                                            distribute_cfs_runtime():
            :                                              rcu_read_lock();
            :                                              :
            :                                              unthrottle_cfs_rq():
          sched_offline_group():                             :
            :                                                walk_tg_tree_from(…,tg_unthrottle_up,…):
            list_del_rcu(&tg->list);                           :
       (1)  :                                                  list_for_each_entry_rcu(child, &parent->children, siblings)
            :                                                    :
       (2)  list_del_rcu(&tg->siblings);                         :
            :                                                    tg_unthrottle_up():
            unregister_fair_sched_group():                         struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
              :                                                    :
              list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);               :
              :                                                    :
              :                                                    if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
       (3)    :                                                        list_add_leaf_cfs_rq(cfs_rq);
            :                                                      :
            :                                                    :
            :                                                  :
            :                                                :
            :                                              :
       (4)  :                                              rcu_read_unlock();
      
      CPU 2 walks the task group list in parallel to sched_offline_group(),
      specifically, it'll read the soon to be unlinked task group entry at
      (1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from
      still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink
      all cfs_rq's via list_del_leaf_cfs_rq() in
      unregister_fair_sched_group().  Meanwhile CPU 2 will re-add some of
      these at (3), which is the cause of the UAF later on.
      
      To prevent this additional race from happening, we need to wait until
      walk_tg_tree_from() has finished traversing the task groups, i.e.
      after the RCU read critical section ends in (4). Afterwards we're safe
      to call unregister_fair_sched_group(), as each new walk won't see the
      dying task group any more.
      
      On top of that, we need to wait yet another RCU grace period after
      unregister_fair_sched_group() to ensure print_cfs_stats(), which might
      run concurrently, always sees valid objects, i.e. not already free'd
      ones.
      
      This patch survives Michal's reproducer[2] for 8h+ now, which used to
      trigger within minutes before.
      
        [1] https://lore.kernel.org/lkml/20211011172236.11223-1-mkoutny@suse.com/
        [2] https://lore.kernel.org/lkml/20211102160228.GA57072@blackbody.suse.cz/
      
      Fixes: a7b359fc
      
       ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
      [peterz: shuffle code around a bit]
      Reported-by: default avatarKevin Tanguy <kevin.tanguy@corp.ovh.com>
      Signed-off-by: default avatarMathias Krause <minipli@grsecurity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      b027789e