Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Apr 21, 2022
  2. Apr 15, 2022
    • Zqiang's avatar
      irq_work: use kasan_record_aux_stack_noalloc() record callstack · 25934fcf
      Zqiang authored
      On PREEMPT_RT kernel and KASAN is enabled.  the kasan_record_aux_stack()
      may call alloc_pages(), and the rt-spinlock will be acquired, if currently
      in atomic context, will trigger warning:
      
        BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46
        in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 239, name: bootlogd
        Preemption disabled at:
        [<ffffffffbab1a531>] rt_mutex_slowunlock+0xa1/0x4e0
        CPU: 3 PID: 239 Comm: bootlogd Tainted: G        W 5.17.1-rt17-yocto-preempt-rt+ #105
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
        Call Trace:
           __might_resched.cold+0x13b/0x173
           rt_spin_lock+0x5b/0xf0
           get_page_from_freelist+0x20c/0x1610
           __alloc_pages+0x25e/0x5e0
           __stack_depot_save+0x3c0/0x4a0
           kasan_save_stack+0x3a/0x50
           __kasan_record_aux_stack+0xb6/0xc0
           kasan_record_aux_stack+0xe/0x10
           irq_work_queue_on+0x6a/0x1c0
           pull_rt_task+0x631/0x6b0
           do_balance_callbacks+0x56/0x80
           __balance_callbacks+0x63/0x90
           rt_mutex_setprio+0x349/0x880
           rt_mutex_slowunlock+0x22a/0x4e0
           rt_spin_unlock+0x49/0x80
           uart_write+0x186/0x2b0
           do_output_char+0x2e9/0x3a0
           n_tty_write+0x306/0x800
           file_tty_write.isra.0+0x2af/0x450
           tty_write+0x22/0x30
           new_sync_write+0x27c/0x3a0
           vfs_write+0x3f7/0x5d0
           ksys_write+0xd9/0x180
           __x64_sys_write+0x43/0x50
           do_syscall_64+0x44/0x90
           entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fix it by using kasan_record_aux_stack_noalloc() to avoid the call to
      alloc_pages().
      
      Link: https://lkml.kernel.org/r/20220402142555.2699582-1-qiang1.zhang@intel.com
      
      
      Signed-off-by: default avatarZqiang <qiang1.zhang@intel.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      25934fcf
  3. Apr 14, 2022
    • Chao Gao's avatar
      dma-direct: avoid redundant memory sync for swiotlb · 9e02977b
      Chao Gao authored
      When we looked into FIO performance with swiotlb enabled in VM, we found
      swiotlb_bounce() is always called one more time than expected for each DMA
      read request.
      
      It turns out that the bounce buffer is copied to original DMA buffer twice
      after the completion of a DMA request (one is done by in
      dma_direct_sync_single_for_cpu(), the other by swiotlb_tbl_unmap_single()).
      But the content in bounce buffer actually doesn't change between the two
      rounds of copy. So, one round of copy is redundant.
      
      Pass DMA_ATTR_SKIP_CPU_SYNC flag to swiotlb_tbl_unmap_single() to
      skip the memory copy in it.
      
      This fix increases FIO 64KB sequential read throughput in a guest with
      swiotlb=force by 5.6%.
      
      Fixes: 55897af6
      
       ("dma-direct: merge swiotlb_dma_ops into the dma_direct code")
      Reported-by: default avatarWang Zhaoyang1 <zhaoyang1.wang@intel.com>
      Reported-by: default avatarGao Liang <liang.gao@intel.com>
      Signed-off-by: default avatarChao Gao <chao.gao@intel.com>
      Reviewed-by: default avatarKevin Tian <kevin.tian@intel.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      9e02977b
  4. Apr 13, 2022
  5. Apr 11, 2022
  6. Apr 10, 2022
  7. Apr 09, 2022
  8. Apr 05, 2022
    • Chengming Zhou's avatar
      perf/core: Always set cpuctx cgrp when enable cgroup event · e19cd0b6
      Chengming Zhou authored
      
      When enable a cgroup event, cpuctx->cgrp setting is conditional
      on the current task cgrp matching the event's cgroup, so have to
      do it for every new event. It brings complexity but no advantage.
      
      To keep it simple, this patch would always set cpuctx->cgrp
      when enable the first cgroup event, and reset to NULL when disable
      the last cgroup event.
      
      Signed-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20220329154523.86438-5-zhouchengming@bytedance.com
      e19cd0b6
    • Chengming Zhou's avatar
      perf/core: Fix perf_cgroup_switch() · 96492a6c
      Chengming Zhou authored
      There is a race problem that can trigger WARN_ON_ONCE(cpuctx->cgrp)
      in perf_cgroup_switch().
      
      CPU1						CPU2
      perf_cgroup_sched_out(prev, next)
        cgrp1 = perf_cgroup_from_task(prev)
        cgrp2 = perf_cgroup_from_task(next)
        if (cgrp1 != cgrp2)
          perf_cgroup_switch(prev, PERF_CGROUP_SWOUT)
      						cgroup_migrate_execute()
      						  task->cgroups = ?
      						  perf_cgroup_attach()
      						    task_function_call(task, __perf_cgroup_move)
      perf_cgroup_sched_in(prev, next)
        cgrp1 = perf_cgroup_from_task(prev)
        cgrp2 = perf_cgroup_from_task(next)
        if (cgrp1 != cgrp2)
          perf_cgroup_switch(next, PERF_CGROUP_SWIN)
      						__perf_cgroup_move()
      						  perf_cgroup_switch(task, PERF_CGROUP_SWOUT | PERF_CGROUP_SWIN)
      
      The commit a8d757ef ("perf events: Fix slow and broken cgroup
      context switch code") want to skip perf_cgroup_switch() when the
      perf_cgroup of "prev" and "next" are the same.
      
      But task->cgroups can change in concurrent with context_switch()
      in cgroup_migrate_execute(). If cgrp1 == cgrp2 in sched_out(),
      cpuctx won't do sched_out. Then task->cgroups changed cause
      cgrp1 != cgrp2 in sched_in(), cpuctx will do sched_in. So trigger
      WARN_ON_ONCE(cpuctx->cgrp).
      
      Even though __perf_cgroup_move() will be synchronized as the context
      switch disables the interrupt, context_switch() still can see the
      task->cgroups is changing in the middle, since task->cgroups changed
      before sending IPI.
      
      So we have to combine perf_cgroup_sched_in() into perf_cgroup_sched_out(),
      unified into perf_cgroup_switch(), to fix the incosistency between
      perf_cgroup_sched_out() and perf_cgroup_sched_in().
      
      But we can't just compare prev->cgroups with next->cgroups to decide
      whether to skip cpuctx sched_out/in since the prev->cgroups is changing
      too. For example:
      
      CPU1					CPU2
      					cgroup_migrate_execute()
      					  prev->cgroups = ?
      					  perf_cgroup_attach()
      					    task_function_call(task, __perf_cgroup_move)
      perf_cgroup_switch(task)
        cgrp1 = perf_cgroup_from_task(prev)
        cgrp2 = perf_cgroup_from_task(next)
        if (cgrp1 != cgrp2)
          cpuctx sched_out/in ...
      					task_function_call() will return -ESRCH
      
      In the above example, prev->cgroups changing cause (cgrp1 == cgrp2)
      to be true, so skip cpuctx sched_out/in. And later task_function_call()
      would return -ESRCH since the prev task isn't running on cpu anymore.
      So we would leave perf_events of the old prev->cgroups still sched on
      the CPU, which is wrong.
      
      The solution is that we should use cpuctx->cgrp to compare with
      the next task's perf_cgroup. Since cpuctx->cgrp can only be changed
      on local CPU, and we have irq disabled, we can read cpuctx->cgrp to
      compare without holding ctx lock.
      
      Fixes: a8d757ef
      
       ("perf events: Fix slow and broken cgroup context switch code")
      Signed-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20220329154523.86438-4-zhouchengming@bytedance.com
      96492a6c
    • Chengming Zhou's avatar
      perf/core: Use perf_cgroup_info->active to check if cgroup is active · 6875186a
      Chengming Zhou authored
      
      Since we use perf_cgroup_set_timestamp() to start cgroup time and
      set active to 1, then use update_cgrp_time_from_cpuctx() to stop
      cgroup time and set active to 0.
      
      We can use info->active directly to check if cgroup is active.
      
      Signed-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20220329154523.86438-3-zhouchengming@bytedance.com
      6875186a
    • Chengming Zhou's avatar
      perf/core: Don't pass task around when ctx sched in · a0827713
      Chengming Zhou authored
      
      The current code pass task around for ctx_sched_in(), only
      to get perf_cgroup of the task, then update the timestamp
      of it and its ancestors and set them to active.
      
      But we can use cpuctx->cgrp to get active perf_cgroup and
      its ancestors since cpuctx->cgrp has been set before
      ctx_sched_in().
      
      This patch remove the task argument in ctx_sched_in()
      and cleanup related code.
      
      Signed-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20220329154523.86438-2-zhouchengming@bytedance.com
      a0827713
    • Namhyung Kim's avatar
      perf/core: Inherit event_caps · e3265a43
      Namhyung Kim authored
      
      It was reported that some perf event setup can make fork failed on
      ARM64.  It was the case of a group of mixed hw and sw events and it
      failed in perf_event_init_task() due to armpmu_event_init().
      
      The ARM PMU code checks if all the events in a group belong to the
      same PMU except for software events.  But it didn't set the event_caps
      of inherited events and no longer identify them as software events.
      Therefore the test failed in a child process.
      
      A simple reproducer is:
      
        $ perf stat -e '{cycles,cs,instructions}' perf bench sched messaging
        # Running 'sched/messaging' benchmark:
        perf: fork(): Invalid argument
      
      The perf stat was fine but the perf bench failed in fork().  Let's
      inherit the event caps from the parent.
      
      Signed-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20220328200112.457740-1-namhyung@kernel.org
      e3265a43
    • Christophe Leroy's avatar
      static_call: Don't make __static_call_return0 static · 8fd4ddda
      Christophe Leroy authored
      System.map shows that vmlinux contains several instances of
      __static_call_return0():
      
      	c0004fc0 t __static_call_return0
      	c0011518 t __static_call_return0
      	c00d8160 t __static_call_return0
      
      arch_static_call_transform() uses the middle one to check whether we are
      setting a call to __static_call_return0 or not:
      
      	c0011520 <arch_static_call_transform>:
      	c0011520:       3d 20 c0 01     lis     r9,-16383	<== r9 =  0xc001 << 16
      	c0011524:       39 29 15 18     addi    r9,r9,5400	<== r9 += 0x1518
      	c0011528:       7c 05 48 00     cmpw    r5,r9		<== r9 has value 0xc0011518 here
      
      So if static_call_update() is called with one of the other instances of
      __static_call_return0(), arch_static_call_transform() won't recognise it.
      
      In order to work properly, global single instance of __static_call_return0() is required.
      
      Fixes: 3f2a8fc4
      
       ("static_call/x86: Add __static_call_return0()")
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: Peter Z...
      8fd4ddda
    • Sven Schnelle's avatar
      entry: Fix compile error in dynamic_irqentry_exit_cond_resched() · 0a70045e
      Sven Schnelle authored
      kernel/entry/common.c: In function ‘dynamic_irqentry_exit_cond_resched’:
      kernel/entry/common.c:409:14: error: implicit declaration of function ‘static_key_unlikely’; did you mean ‘static_key_enable’? [-Werror=implicit-function-declaration]
        409 |         if (!static_key_unlikely(&sk_dynamic_irqentry_exit_cond_resched))
            |              ^~~~~~~~~~~~~~~~~~~
            |              static_key_enable
      
      static_key_unlikely() should be static_branch_unlikely().
      
      Fixes: 99cf983c
      
       ("sched/preempt: Add PREEMPT_DYNAMIC using static keys")
      Signed-off-by: default avatarSven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarMark Rutland <mark.rutland@arm.com>
      Link: https://lore.kernel.org/r/20220330084328.1805665-1-svens@linux.ibm.com
      0a70045e
    • Sebastian Andrzej Siewior's avatar
      sched: Teach the forced-newidle balancer about CPU affinity limitation. · 386ef214
      Sebastian Andrzej Siewior authored
      try_steal_cookie() looks at task_struct::cpus_mask to decide if the
      task could be moved to `this' CPU. It ignores that the task might be in
      a migration disabled section while not on the CPU. In this case the task
      must not be moved otherwise per-CPU assumption are broken.
      
      Use is_cpu_allowed(), as suggested by Peter Zijlstra, to decide if the a
      task can be moved.
      
      Fixes: d2dfa17b
      
       ("sched: Trivial forced-newidle balancer")
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/YjNK9El+3fzGmswf@linutronix.de
      386ef214
    • Peter Zijlstra's avatar
      sched/core: Fix forceidle balancing · 5b6547ed
      Peter Zijlstra authored
      Steve reported that ChromeOS encounters the forceidle balancer being
      ran from rt_mutex_setprio()'s balance_callback() invocation and
      explodes.
      
      Now, the forceidle balancer gets queued every time the idle task gets
      selected, set_next_task(), which is strictly too often.
      rt_mutex_setprio() also uses set_next_task() in the 'change' pattern:
      
      	queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
      	running = task_current(rq, p); /* rq->curr == p */
      
      	if (queued)
      		dequeue_task(...);
      	if (running)
      		put_prev_task(...);
      
      	/* change task properties */
      
      	if (queued)
      		enqueue_task(...);
      	if (running)
      		set_next_task(...);
      
      However, rt_mutex_setprio() will explicitly not run this pattern on
      the idle task (since priority boosting the idle task is quite insane).
      Most other 'change' pattern users are pidhash based and would also not
      apply to idle.
      
      Also, the change pattern doesn't contain a __balance_callback()
      invocation and hence we could have an out-of-band balance-callback,
      which *should* trigger the WARN in rq_pin_lock() (which guards against
      this exact anti-pattern).
      
      So while none of that explains how this happens, it does indicate that
      having it in set_next_task() might not be the most robust option.
      
      Instead, explicitly queue the forceidle balancer from pick_next_task()
      when it does indeed result in forceidle selection. Having it here,
      ensures it can only be triggered under the __schedule() rq->lock
      instance, and hence must be ran from that context.
      
      This also happens to clean up the code a little, so win-win.
      
      Fixes: d2dfa17b
      
       ("sched: Trivial forced-newidle balancer")
      Reported-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarT.J. Alumbaugh <talumbau@chromium.org>
      Link: https://lkml.kernel.org/r/20220330160535.GN8939@worktop.programming.kicks-ass.net
      5b6547ed
  9. Apr 02, 2022
  10. Apr 01, 2022
  11. Mar 31, 2022
  12. Mar 30, 2022
  13. Mar 28, 2022
  14. Mar 26, 2022
    • Linus Torvalds's avatar
      Revert "swiotlb: rework "fix info leak with DMA_FROM_DEVICE"" · bddac7c1
      Linus Torvalds authored
      This reverts commit aa6f8dcb.
      
      It turns out this breaks at least the ath9k wireless driver, and
      possibly others.
      
      What the ath9k driver does on packet receive is to set up the DMA
      transfer with:
      
        int ath_rx_init(..)
        ..
                      bf->bf_buf_addr = dma_map_single(sc->dev, skb->data,
                                                       common->rx_bufsize,
                                                       DMA_FROM_DEVICE);
      
      and then the receive logic (through ath_rx_tasklet()) will fetch
      incoming packets
      
        static bool ath_edma_get_buffers(..)
        ..
              dma_sync_single_for_cpu(sc->dev, bf->bf_buf_addr,
                                      common->rx_bufsize, DMA_FROM_DEVICE);
      
              ret = ath9k_hw_process_rxdesc_edma(ah, rs, skb->data);
              if (ret == -EINPROGRESS) {
                      /*let device gain the buffer again*/
                      dma_sync_single_for_device(sc->dev, bf->bf_buf_addr,
                                      common->rx_bufsize, DMA_FROM_DEVICE);
                      return false;
              }
      
      and it's worth noting how that first DMA sync:
      
          dma_sync_single_for_cpu(..DMA_FROM_DEVICE);
      
      is there to make sure the CPU can read the DMA buffer (possibly by
      copying it from the bounce buffer area, or by doing some cache flush).
      The iommu correctly turns that into a "copy from bounce bufer" so that
      the driver can look at the state of the packets.
      
      In the meantime, the device may continue to write to the DMA buffer, but
      we at least have a snapshot of the state due to that first DMA sync.
      
      But that _second_ DMA sync:
      
          dma_sync_single_for_device(..DMA_FROM_DEVICE);
      
      is telling the DMA mapping that the CPU wasn't interested in the area
      because the packet wasn't there.  In the case of a DMA bounce buffer,
      that is a no-op.
      
      Note how it's not a sync for the CPU (the "for_device()" part), and it's
      not a sync for data written by the CPU (the "DMA_FROM_DEVICE" part).
      
      Or rather, it _should_ be a no-op.  That's what commit aa6f8dcb
      
      
      broke: it made the code bounce the buffer unconditionally, and changed
      the DMA_FROM_DEVICE to just unconditionally and illogically be
      DMA_TO_DEVICE.
      
      [ Side note: purely within the confines of the swiotlb driver it wasn't
        entirely illogical: The reason it did that odd DMA_FROM_DEVICE ->
        DMA_TO_DEVICE conversion thing is because inside the swiotlb driver,
        it uses just a swiotlb_bounce() helper that doesn't care about the
        whole distinction of who the sync is for - only which direction to
        bounce.
      
        So it took the "sync for device" to mean that the CPU must have been
        the one writing, and thought it meant DMA_TO_DEVICE. ]
      
      Also note how the commentary in that commit was wrong, probably due to
      that whole confusion, claiming that the commit makes the swiotlb code
      
                                        "bounce unconditionally (that is, also
          when dir == DMA_TO_DEVICE) in order do avoid synchronising back stale
          data from the swiotlb buffer"
      
      which is nonsensical for two reasons:
      
       - that "also when dir == DMA_TO_DEVICE" is nonsensical, as that was
         exactly when it always did - and should do - the bounce.
      
       - since this is a sync for the device (not for the CPU), we're clearly
         fundamentally not coping back stale data from the bounce buffers at
         all, because we'd be copying *to* the bounce buffers.
      
      So that commit was just very confused.  It confused the direction of the
      synchronization (to the device, not the cpu) with the direction of the
      DMA (from the device).
      
      Reported-and-bisected-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Reported-by: default avatarOlha Cherevyk <olha.cherevyk@gmail.com>
      Cc: Halil Pasic <pasic@linux.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Kalle Valo <kvalo@kernel.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Toke Høiland-Jørgensen <toke@toke.dk>
      Cc: Maxime Bizon <mbizon@freebox.fr>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bddac7c1
  15. Mar 24, 2022