Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Mar 22, 2022
  2. Feb 04, 2022
  3. Jan 22, 2022
  4. Nov 20, 2021
    • Alexander Mikhalitsyn's avatar
      shm: extend forced shm destroy to support objects from several IPC nses · 85b6d246
      Alexander Mikhalitsyn authored
      Currently, the exit_shm() function not designed to work properly when
      task->sysvshm.shm_clist holds shm objects from different IPC namespaces.
      
      This is a real pain when sysctl kernel.shm_rmid_forced = 1, because it
      leads to use-after-free (reproducer exists).
      
      This is an attempt to fix the problem by extending exit_shm mechanism to
      handle shm's destroy from several IPC ns'es.
      
      To achieve that we do several things:
      
      1. add a namespace (non-refcounted) pointer to the struct shmid_kernel
      
      2. during new shm object creation (newseg()/shmget syscall) we
         initialize this pointer by current task IPC ns
      
      3. exit_shm() fully reworked such that it traverses over all shp's in
         task->sysvshm.shm_clist and gets IPC namespace not from current task
         as it was before but from shp's object itself, then call
         shm_destroy(shp, ns).
      
      Note: We need to be really careful here, because as it was said before
      (1), our pointer to IPC ns non-refcnt'ed.  T...
      85b6d246
    • Alexander Mikhalitsyn's avatar
      ipc: WARN if trying to remove ipc object which is absent · 126e8bee
      Alexander Mikhalitsyn authored
      Patch series "shm: shm_rmid_forced feature fixes".
      
      Some time ago I met kernel crash after CRIU restore procedure,
      fortunately, it was CRIU restore, so, I had dump files and could do
      restore many times and crash reproduced easily.  After some
      investigation I've constructed the minimal reproducer.  It was found
      that it's use-after-free and it happens only if sysctl
      kernel.shm_rmid_forced = 1.
      
      The key of the problem is that the exit_shm() function not handles shp's
      object destroy when task->sysvshm.shm_clist contains items from
      different IPC namespaces.  In most cases this list will contain only
      items from one IPC namespace.
      
      How can this list contain object from different namespaces? The
      exit_shm() function is designed to clean up this list always when
      process leaves IPC namespace.  But we made a mistake a long time ago and
      did not add a exit_shm() call into the setns() syscall procedures.
      
      The first idea was just to add this call to setns() syscall...
      126e8bee
  5. Nov 09, 2021
  6. Sep 14, 2021
    • Vasily Averin's avatar
      ipc: remove memcg accounting for sops objects in do_semtimedop() · 6a4746ba
      Vasily Averin authored
      Linus proposes to revert an accounting for sops objects in
      do_semtimedop() because it's really just a temporary buffer
      for a single semtimedop() system call.
      
      This object can consume up to 2 pages, syscall is sleeping
      one, size and duration can be controlled by user, and this
      allocation can be repeated by many thread at the same time.
      
      However Shakeel Butt pointed that there are much more popular
      objects with the same life time and similar memory
      consumption, the accounting of which was decided to be
      rejected for performance reasons.
      
      Considering at least 2 pages for task_struct and 2 pages for
      the kernel stack, a back of the envelope calculation gives a
      footprint amplification of <1.5 so this temporal buffer can be
      safely ignored.
      
      The factor would IMO be interesting if it was >> 2 (from the
      PoV of excessive (ab)use, fine-grained accounting seems to be
      currently unfeasible due to performance impact).
      
      Link: https://lore.kernel.org/lkml/90e2...
      6a4746ba
  7. Sep 08, 2021
    • Rafael Aquini's avatar
      ipc: replace costly bailout check in sysvipc_find_ipc() · 20401d10
      Rafael Aquini authored
      sysvipc_find_ipc() was left with a costly way to check if the offset
      position fed to it is bigger than the total number of IPC IDs in use.  So
      much so that the time it takes to iterate over /proc/sysvipc/* files grows
      exponentially for a custom benchmark that creates "N" SYSV shm segments
      and then times the read of /proc/sysvipc/shm (milliseconds):
      
          12 msecs to read   1024 segs from /proc/sysvipc/shm
          18 msecs to read   2048 segs from /proc/sysvipc/shm
          65 msecs to read   4096 segs from /proc/sysvipc/shm
         325 msecs to read   8192 segs from /proc/sysvipc/shm
        1303 msecs to read  16384 segs from /proc/sysvipc/shm
        5182 msecs to read  32768 segs from /proc/sysvipc/shm
      
      The root problem lies with the loop that computes the total amount of ids
      in use to check if the "pos" feeded to sysvipc_find_ipc() grew bigger than
      "ids->in_use".  That is a quite inneficient way to get to the maximum
      index in the id lookup table, specially when that value is already
      provided by struct ipc_ids.max_idx.
      
      This patch follows up on the optimization introduced via commit
      15df03c8 ("sysvipc: make get_maxid O(1) again") and gets rid of the
      aforementioned costly loop replacing it by a simpler checkpoint based on
      ipc_get_maxidx() returned value, which allows for a smooth linear increase
      in time complexity for the same custom benchmark:
      
           2 msecs to read   1024 segs from /proc/sysvipc/shm
           2 msecs to read   2048 segs from /proc/sysvipc/shm
           4 msecs to read   4096 segs from /proc/sysvipc/shm
           9 msecs to read   8192 segs from /proc/sysvipc/shm
          19 msecs to read  16384 segs from /proc/sysvipc/shm
          39 msecs to read  32768 segs from /proc/sysvipc/shm
      
      Link: https://lkml.kernel.org/r/20210809203554.1562989-1-aquini@redhat.com
      
      
      Signed-off-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Acked-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Cc: Waiman Long <llong@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      20401d10
  8. Sep 03, 2021
  9. Aug 20, 2021
    • Arnd Bergmann's avatar
      ARM: 9114/1: oabi-compat: rework sys_semtimedop emulation · bdec0145
      Arnd Bergmann authored
      
      sys_oabi_semtimedop() is one of the last users of set_fs() on Arm. To
      remove this one, expose the internal code of the actual implementation
      that operates on a kernel pointer and call it directly after copying.
      
      There should be no measurable impact on the normal execution of this
      function, and it makes the overly long function a little shorter, which
      may help readability.
      
      While reworking the oabi version, make it behave a little more like
      the native one, using kvmalloc_array() and restructure the code
      flow in a similar way.
      
      The naming of __do_semtimedop() is not very good, I hope someone can
      come up with a better name.
      
      One regression was spotted by kernel test robot <rong.a.chen@intel.com>
      and fixed before the first mailing list submission.
      
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      bdec0145
  10. Jul 01, 2021
  11. May 22, 2021
    • Varad Gautam's avatar
      ipc/mqueue, msg, sem: avoid relying on a stack reference past its expiry · a11ddb37
      Varad Gautam authored
      do_mq_timedreceive calls wq_sleep with a stack local address.  The
      sender (do_mq_timedsend) uses this address to later call pipelined_send.
      
      This leads to a very hard to trigger race where a do_mq_timedreceive
      call might return and leave do_mq_timedsend to rely on an invalid
      address, causing the following crash:
      
        RIP: 0010:wake_q_add_safe+0x13/0x60
        Call Trace:
         __x64_sys_mq_timedsend+0x2a9/0x490
         do_syscall_64+0x80/0x680
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f5928e40343
      
      The race occurs as:
      
      1. do_mq_timedreceive calls wq_sleep with the address of `struct
         ext_wait_queue` on function stack (aliased as `ewq_addr` here) - it
         holds a valid `struct ext_wait_queue *` as long as the stack has not
         been overwritten.
      
      2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and
         do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call
         __pipelined_op.
      
      3. Sender calls __pipelined...
      a11ddb37
  12. May 07, 2021
  13. Apr 30, 2021
  14. Jan 24, 2021
  15. Dec 15, 2020
    • Dmitry Safonov's avatar
      vm_ops: rename .split() callback to .may_split() · dd3b614f
      Dmitry Safonov authored
      Rename the callback to reflect that it's not called *on* or *after* split,
      but rather some time before the splitting to check if it's possible.
      
      Link: https://lkml.kernel.org/r/20201013013416.390574-5-dima@arista.com
      
      
      Signed-off-by: default avatarDmitry Safonov <dima@arista.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.frank...
      dd3b614f
  16. Sep 05, 2020
    • Tobias Klauser's avatar
      ipc: adjust proc_ipc_sem_dointvec definition to match prototype · fff1662c
      Tobias Klauser authored
      Commit 32927393 ("sysctl: pass kernel pointers to ->proc_handler")
      changed ctl_table.proc_handler to take a kernel pointer.  Adjust the
      signature of proc_ipc_sem_dointvec to match ctl_table.proc_handler which
      fixes the following sparse error/warning:
      
        ipc/ipc_sysctl.c:94:47: warning: incorrect type in argument 3 (different address spaces)
        ipc/ipc_sysctl.c:94:47:    expected void *buffer
        ipc/ipc_sysctl.c:94:47:    got void [noderef] __user *buffer
        ipc/ipc_sysctl.c:194:35: warning: incorrect type in initializer (incompatible argument 3 (different address spaces))
        ipc/ipc_sysctl.c:194:35:    expected int ( [usertype] *proc_handler )( ... )
        ipc/ipc_sysctl.c:194:35:    got int ( * )( ... )
      
      Fixes: 32927393
      
       ("sysctl: pass kernel pointers to ->proc_handler")
      Signed-off-by: default avatarTobias Klauser <tklauser@distanz.ch>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv...
      fff1662c
  17. Aug 23, 2020
  18. Aug 19, 2020
    • Kirill Tkhai's avatar
      ipc: Use generic ns_common::count · 137ec390
      Kirill Tkhai authored
      Switch over ipc namespaces to use the newly introduced common lifetime
      counter.
      
      Currently every namespace type has its own lifetime counter which is stored
      in the specific namespace struct. The lifetime counters are used
      identically for all namespaces types. Namespaces may of course have
      additional unrelated counters and these are not altered.
      
      This introduces a common lifetime counter into struct ns_common. The
      ns_common struct encompasses information that all namespaces share. That
      should include the lifetime counter since its common for all of them.
      
      It also allows us to unify the type of the counters across all namespaces.
      Most of them use refcount_t but one uses atomic_t and at least one uses
      kref. Especially the last one doesn't make much sense since it's just a
      wrapper around refcount_t since 2016 and actually complicates cleanup
      operations by having to use container_of() to cast the correct namespace
      struct out of struct ns_common.
      
      Having the lifetime counter fo...
      137ec390
  19. Aug 12, 2020
  20. Aug 07, 2020
  21. Jun 09, 2020
  22. Jun 08, 2020
  23. May 14, 2020
    • Vasily Averin's avatar
      ipc/util.c: sysvipc_find_ipc() incorrectly updates position index · 5e698222
      Vasily Averin authored
      Commit 89163f93 ("ipc/util.c: sysvipc_find_ipc() should increase
      position index") is causing this bug (seen on 5.6.8):
      
         # ipcs -q
      
         ------ Message Queues --------
         key        msqid      owner      perms      used-bytes   messages
      
         # ipcmk -Q
         Message queue id: 0
         # ipcs -q
      
         ------ Message Queues --------
         key        msqid      owner      perms      used-bytes   messages
         0x82db8127 0          root       644        0            0
      
         # ipcmk -Q
         Message queue id: 1
         # ipcs -q
      
         ------ Message Queues --------
         key        msqid      owner      perms      used-bytes   messages
         0x82db8127 0          root       644        0            0
         0x76d1fb2a 1          root       644        0            0
      
         # ipcrm -q 0
         # ipcs -q
      
         ------ Message Queues --------
         key        msqid      owner      perms      used-bytes   messages
         0x76d1fb2a 1          root       644        0            0
         0x76d1fb2a 1      ...
      5e698222
  24. May 09, 2020
    • Christian Brauner's avatar
      nsproxy: add struct nsset · f2a8d52e
      Christian Brauner authored
      Add a simple struct nsset. It holds all necessary pieces to switch to a new
      set of namespaces without leaving a task in a half-switched state which we
      will make use of in the next patch. This patch switches the existing setns
      logic over without causing a change in setns() behavior. This brings
      setns() closer to how unshare() works(). The prepare_ns() function is
      responsible to prepare all necessary information. This has two reasons.
      First it minimizes dependencies between individual namespaces, i.e. all
      install handler can expect that all fields are properly initialized
      independent in what order they are called in. Second, this makes the code
      easier to maintain and easier to follow if it needs to be changed.
      
      The prepare_ns() helper will only be switched over to use a flags argument
      in the next patch. Here it will still use nstype as a simple integer
      argument which was argued would be clearer. I'm not particularly
      opinionated about this if it really helps or not. The struct nsset...
      f2a8d52e
  25. May 07, 2020
    • Oleg Nesterov's avatar
      ipc/mqueue.c: change __do_notify() to bypass check_kill_permission() · b5f20061
      Oleg Nesterov authored
      Commit cc731525 ("signal: Remove kernel interal si_code magic")
      changed the value of SI_FROMUSER(SI_MESGQ), this means that mq_notify() no
      longer works if the sender doesn't have rights to send a signal.
      
      Change __do_notify() to use do_send_sig_info() instead of kill_pid_info()
      to avoid check_kill_permission().
      
      This needs the additional notify.sigev_signo != 0 check, shouldn't we
      change do_mq_notify() to deny sigev_signo == 0 ?
      
      Test-case:
      
      	#include <signal.h>
      	#include <mqueue.h>
      	#include <unistd.h>
      	#include <sys/wait.h>
      	#include <assert.h>
      
      	static int notified;
      
      	static void sigh(int sig)
      	{
      		notified = 1;
      	}
      
      	int main(void)
      	{
      		signal(SIGIO, sigh);
      
      		int fd = mq_open("/mq", O_RDWR|O_CREAT, 0666, NULL);
      		assert(fd >= 0);
      
      		struct sigevent se = {
      			.sigev_notify	= SIGEV_SIGNAL,
      			.sigev_signo	= SIGIO,
      		};
      		assert(mq_notify(fd, &se) == 0);
      
      		if (!fork()) {
      			assert(setuid(1) == 0);
      			mq_send(fd, "",1,0);
      			return ...
      b5f20061
  26. Apr 27, 2020
  27. Apr 10, 2020