Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Apr 05, 2022
    • Chengming Zhou's avatar
      perf/core: Always set cpuctx cgrp when enable cgroup event · e19cd0b6
      Chengming Zhou authored
      
      When enable a cgroup event, cpuctx->cgrp setting is conditional
      on the current task cgrp matching the event's cgroup, so have to
      do it for every new event. It brings complexity but no advantage.
      
      To keep it simple, this patch would always set cpuctx->cgrp
      when enable the first cgroup event, and reset to NULL when disable
      the last cgroup event.
      
      Signed-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20220329154523.86438-5-zhouchengming@bytedance.com
      e19cd0b6
    • Chengming Zhou's avatar
      perf/core: Fix perf_cgroup_switch() · 96492a6c
      Chengming Zhou authored
      There is a race problem that can trigger WARN_ON_ONCE(cpuctx->cgrp)
      in perf_cgroup_switch().
      
      CPU1						CPU2
      perf_cgroup_sched_out(prev, next)
        cgrp1 = perf_cgroup_from_task(prev)
        cgrp2 = perf_cgroup_from_task(next)
        if (cgrp1 != cgrp2)
          perf_cgroup_switch(prev, PERF_CGROUP_SWOUT)
      						cgroup_migrate_execute()
      						  task->cgroups = ?
      						  perf_cgroup_attach()
      						    task_function_call(task, __perf_cgroup_move)
      perf_cgroup_sched_in(prev, next)
        cgrp1 = perf_cgroup_from_task(prev)
        cgrp2 = perf_cgroup_from_task(next)
        if (cgrp1 != cgrp2)
          perf_cgroup_switch(next, PERF_CGROUP_SWIN)
      						__perf_cgroup_move()
      						  perf_cgroup_switch(task, PERF_CGROUP_SWOUT | PERF_CGROUP_SWIN)
      
      The commit a8d757ef ("perf events: Fix slow and broken cgroup
      context switch code") want to skip perf_cgroup_switch() when the
      perf_cgroup of "prev" and "next" are the same.
      
      But task->cgroups can change in concurrent with context_switch()
      in cgroup_migrate_execute(). If cgrp1 == cgrp2 in sched_out(),
      cpuctx won't do sched_out. Then task->cgroups changed cause
      cgrp1 != cgrp2 in sched_in(), cpuctx will do sched_in. So trigger
      WARN_ON_ONCE(cpuctx->cgrp).
      
      Even though __perf_cgroup_move() will be synchronized as the context
      switch disables the interrupt, context_switch() still can see the
      task->cgroups is changing in the middle, since task->cgroups changed
      before sending IPI.
      
      So we have to combine perf_cgroup_sched_in() into perf_cgroup_sched_out(),
      unified into perf_cgroup_switch(), to fix the incosistency between
      perf_cgroup_sched_out() and perf_cgroup_sched_in().
      
      But we can't just compare prev->cgroups with next->cgroups to decide
      whether to skip cpuctx sched_out/in since the prev->cgroups is changing
      too. For example:
      
      CPU1					CPU2
      					cgroup_migrate_execute()
      					  prev->cgroups = ?
      					  perf_cgroup_attach()
      					    task_function_call(task, __perf_cgroup_move)
      perf_cgroup_switch(task)
        cgrp1 = perf_cgroup_from_task(prev)
        cgrp2 = perf_cgroup_from_task(next)
        if (cgrp1 != cgrp2)
          cpuctx sched_out/in ...
      					task_function_call() will return -ESRCH
      
      In the above example, prev->cgroups changing cause (cgrp1 == cgrp2)
      to be true, so skip cpuctx sched_out/in. And later task_function_call()
      would return -ESRCH since the prev task isn't running on cpu anymore.
      So we would leave perf_events of the old prev->cgroups still sched on
      the CPU, which is wrong.
      
      The solution is that we should use cpuctx->cgrp to compare with
      the next task's perf_cgroup. Since cpuctx->cgrp can only be changed
      on local CPU, and we have irq disabled, we can read cpuctx->cgrp to
      compare without holding ctx lock.
      
      Fixes: a8d757ef
      
       ("perf events: Fix slow and broken cgroup context switch code")
      Signed-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20220329154523.86438-4-zhouchengming@bytedance.com
      96492a6c
    • Chengming Zhou's avatar
      perf/core: Use perf_cgroup_info->active to check if cgroup is active · 6875186a
      Chengming Zhou authored
      
      Since we use perf_cgroup_set_timestamp() to start cgroup time and
      set active to 1, then use update_cgrp_time_from_cpuctx() to stop
      cgroup time and set active to 0.
      
      We can use info->active directly to check if cgroup is active.
      
      Signed-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20220329154523.86438-3-zhouchengming@bytedance.com
      6875186a
    • Chengming Zhou's avatar
      perf/core: Don't pass task around when ctx sched in · a0827713
      Chengming Zhou authored
      
      The current code pass task around for ctx_sched_in(), only
      to get perf_cgroup of the task, then update the timestamp
      of it and its ancestors and set them to active.
      
      But we can use cpuctx->cgrp to get active perf_cgroup and
      its ancestors since cpuctx->cgrp has been set before
      ctx_sched_in().
      
      This patch remove the task argument in ctx_sched_in()
      and cleanup related code.
      
      Signed-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20220329154523.86438-2-zhouchengming@bytedance.com
      a0827713
    • Namhyung Kim's avatar
      perf/core: Inherit event_caps · e3265a43
      Namhyung Kim authored
      
      It was reported that some perf event setup can make fork failed on
      ARM64.  It was the case of a group of mixed hw and sw events and it
      failed in perf_event_init_task() due to armpmu_event_init().
      
      The ARM PMU code checks if all the events in a group belong to the
      same PMU except for software events.  But it didn't set the event_caps
      of inherited events and no longer identify them as software events.
      Therefore the test failed in a child process.
      
      A simple reproducer is:
      
        $ perf stat -e '{cycles,cs,instructions}' perf bench sched messaging
        # Running 'sched/messaging' benchmark:
        perf: fork(): Invalid argument
      
      The perf stat was fine but the perf bench failed in fork().  Let's
      inherit the event caps from the parent.
      
      Signed-off-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20220328200112.457740-1-namhyung@kernel.org
      e3265a43
    • Christophe Leroy's avatar
      static_call: Don't make __static_call_return0 static · 8fd4ddda
      Christophe Leroy authored
      System.map shows that vmlinux contains several instances of
      __static_call_return0():
      
      	c0004fc0 t __static_call_return0
      	c0011518 t __static_call_return0
      	c00d8160 t __static_call_return0
      
      arch_static_call_transform() uses the middle one to check whether we are
      setting a call to __static_call_return0 or not:
      
      	c0011520 <arch_static_call_transform>:
      	c0011520:       3d 20 c0 01     lis     r9,-16383	<== r9 =  0xc001 << 16
      	c0011524:       39 29 15 18     addi    r9,r9,5400	<== r9 += 0x1518
      	c0011528:       7c 05 48 00     cmpw    r5,r9		<== r9 has value 0xc0011518 here
      
      So if static_call_update() is called with one of the other instances of
      __static_call_return0(), arch_static_call_transform() won't recognise it.
      
      In order to work properly, global single instance of __static_call_return0() is required.
      
      Fixes: 3f2a8fc4
      
       ("static_call/x86: Add __static_call_return0()")
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: Peter Z...
      8fd4ddda
    • Sven Schnelle's avatar
      entry: Fix compile error in dynamic_irqentry_exit_cond_resched() · 0a70045e
      Sven Schnelle authored
      kernel/entry/common.c: In function ‘dynamic_irqentry_exit_cond_resched’:
      kernel/entry/common.c:409:14: error: implicit declaration of function ‘static_key_unlikely’; did you mean ‘static_key_enable’? [-Werror=implicit-function-declaration]
        409 |         if (!static_key_unlikely(&sk_dynamic_irqentry_exit_cond_resched))
            |              ^~~~~~~~~~~~~~~~~~~
            |              static_key_enable
      
      static_key_unlikely() should be static_branch_unlikely().
      
      Fixes: 99cf983c
      
       ("sched/preempt: Add PREEMPT_DYNAMIC using static keys")
      Signed-off-by: default avatarSven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarMark Rutland <mark.rutland@arm.com>
      Link: https://lore.kernel.org/r/20220330084328.1805665-1-svens@linux.ibm.com
      0a70045e
    • Sebastian Andrzej Siewior's avatar
      sched: Teach the forced-newidle balancer about CPU affinity limitation. · 386ef214
      Sebastian Andrzej Siewior authored
      try_steal_cookie() looks at task_struct::cpus_mask to decide if the
      task could be moved to `this' CPU. It ignores that the task might be in
      a migration disabled section while not on the CPU. In this case the task
      must not be moved otherwise per-CPU assumption are broken.
      
      Use is_cpu_allowed(), as suggested by Peter Zijlstra, to decide if the a
      task can be moved.
      
      Fixes: d2dfa17b
      
       ("sched: Trivial forced-newidle balancer")
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/YjNK9El+3fzGmswf@linutronix.de
      386ef214
    • Peter Zijlstra's avatar
      sched/core: Fix forceidle balancing · 5b6547ed
      Peter Zijlstra authored
      Steve reported that ChromeOS encounters the forceidle balancer being
      ran from rt_mutex_setprio()'s balance_callback() invocation and
      explodes.
      
      Now, the forceidle balancer gets queued every time the idle task gets
      selected, set_next_task(), which is strictly too often.
      rt_mutex_setprio() also uses set_next_task() in the 'change' pattern:
      
      	queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
      	running = task_current(rq, p); /* rq->curr == p */
      
      	if (queued)
      		dequeue_task(...);
      	if (running)
      		put_prev_task(...);
      
      	/* change task properties */
      
      	if (queued)
      		enqueue_task(...);
      	if (running)
      		set_next_task(...);
      
      However, rt_mutex_setprio() will explicitly not run this pattern on
      the idle task (since priority boosting the idle task is quite insane).
      Most other 'change' pattern users are pidhash based and would also not
      apply to idle.
      
      Also, the change pattern doesn't contain a __balance_callback()
      invocation and hence we could have an out-of-band balance-callback,
      which *should* trigger the WARN in rq_pin_lock() (which guards against
      this exact anti-pattern).
      
      So while none of that explains how this happens, it does indicate that
      having it in set_next_task() might not be the most robust option.
      
      Instead, explicitly queue the forceidle balancer from pick_next_task()
      when it does indeed result in forceidle selection. Having it here,
      ensures it can only be triggered under the __schedule() rq->lock
      instance, and hence must be ran from that context.
      
      This also happens to clean up the code a little, so win-win.
      
      Fixes: d2dfa17b
      
       ("sched: Trivial forced-newidle balancer")
      Reported-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarT.J. Alumbaugh <talumbau@chromium.org>
      Link: https://lkml.kernel.org/r/20220330160535.GN8939@worktop.programming.kicks-ass.net
      5b6547ed
  2. Apr 02, 2022
  3. Apr 01, 2022
  4. Mar 31, 2022
  5. Mar 30, 2022
  6. Mar 28, 2022
  7. Mar 26, 2022
    • Linus Torvalds's avatar
      Revert "swiotlb: rework "fix info leak with DMA_FROM_DEVICE"" · bddac7c1
      Linus Torvalds authored
      This reverts commit aa6f8dcb.
      
      It turns out this breaks at least the ath9k wireless driver, and
      possibly others.
      
      What the ath9k driver does on packet receive is to set up the DMA
      transfer with:
      
        int ath_rx_init(..)
        ..
                      bf->bf_buf_addr = dma_map_single(sc->dev, skb->data,
                                                       common->rx_bufsize,
                                                       DMA_FROM_DEVICE);
      
      and then the receive logic (through ath_rx_tasklet()) will fetch
      incoming packets
      
        static bool ath_edma_get_buffers(..)
        ..
              dma_sync_single_for_cpu(sc->dev, bf->bf_buf_addr,
                                      common->rx_bufsize, DMA_FROM_DEVICE);
      
              ret = ath9k_hw_process_rxdesc_edma(ah, rs, skb->data);
              if (ret == -EINPROGRESS) {
                      /*let device gain the buffer again*/
                      dma_sync_single_for_device(sc->dev, bf->bf_buf_addr,
                                      common->rx_bufsize, DMA_FROM_DEVICE);
                      return false;
              }
      
      and it's worth noting how that first DMA sync:
      
          dma_sync_single_for_cpu(..DMA_FROM_DEVICE);
      
      is there to make sure the CPU can read the DMA buffer (possibly by
      copying it from the bounce buffer area, or by doing some cache flush).
      The iommu correctly turns that into a "copy from bounce bufer" so that
      the driver can look at the state of the packets.
      
      In the meantime, the device may continue to write to the DMA buffer, but
      we at least have a snapshot of the state due to that first DMA sync.
      
      But that _second_ DMA sync:
      
          dma_sync_single_for_device(..DMA_FROM_DEVICE);
      
      is telling the DMA mapping that the CPU wasn't interested in the area
      because the packet wasn't there.  In the case of a DMA bounce buffer,
      that is a no-op.
      
      Note how it's not a sync for the CPU (the "for_device()" part), and it's
      not a sync for data written by the CPU (the "DMA_FROM_DEVICE" part).
      
      Or rather, it _should_ be a no-op.  That's what commit aa6f8dcb
      
      
      broke: it made the code bounce the buffer unconditionally, and changed
      the DMA_FROM_DEVICE to just unconditionally and illogically be
      DMA_TO_DEVICE.
      
      [ Side note: purely within the confines of the swiotlb driver it wasn't
        entirely illogical: The reason it did that odd DMA_FROM_DEVICE ->
        DMA_TO_DEVICE conversion thing is because inside the swiotlb driver,
        it uses just a swiotlb_bounce() helper that doesn't care about the
        whole distinction of who the sync is for - only which direction to
        bounce.
      
        So it took the "sync for device" to mean that the CPU must have been
        the one writing, and thought it meant DMA_TO_DEVICE. ]
      
      Also note how the commentary in that commit was wrong, probably due to
      that whole confusion, claiming that the commit makes the swiotlb code
      
                                        "bounce unconditionally (that is, also
          when dir == DMA_TO_DEVICE) in order do avoid synchronising back stale
          data from the swiotlb buffer"
      
      which is nonsensical for two reasons:
      
       - that "also when dir == DMA_TO_DEVICE" is nonsensical, as that was
         exactly when it always did - and should do - the bounce.
      
       - since this is a sync for the device (not for the CPU), we're clearly
         fundamentally not coping back stale data from the bounce buffers at
         all, because we'd be copying *to* the bounce buffers.
      
      So that commit was just very confused.  It confused the direction of the
      synchronization (to the device, not the cpu) with the direction of the
      DMA (from the device).
      
      Reported-and-bisected-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Reported-by: default avatarOlha Cherevyk <olha.cherevyk@gmail.com>
      Cc: Halil Pasic <pasic@linux.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Kalle Valo <kvalo@kernel.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Toke Høiland-Jørgensen <toke@toke.dk>
      Cc: Maxime Bizon <mbizon@freebox.fr>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bddac7c1
  8. Mar 24, 2022
  9. Mar 23, 2022