Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Jan 19, 2022
    • Eric Dumazet's avatar
      ipv4: add net_hash_mix() dispersion to fib_info_laddrhash keys · 79eb15da
      Eric Dumazet authored
      
      net/ipv4/fib_semantics.c uses a hash table (fib_info_laddrhash)
      in which fib_sync_down_addr() can locate fib_info
      based on IPv4 local address.
      
      This hash table is resized based on total number of
      hashed fib_info, but the hash function is only
      using the local address.
      
      For hosts having many active network namespaces,
      all fib_info for loopback devices (IPv4 address 127.0.0.1)
      are hashed into a single bucket, making netns dismantles
      very slow.
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      79eb15da
    • Eric Dumazet's avatar
      ipv4: avoid quadratic behavior in netns dismantle · d07418af
      Eric Dumazet authored
      net/ipv4/fib_semantics.c uses an hash table of 256 slots,
      keyed by device ifindexes: fib_info_devhash[DEVINDEX_HASHSIZE]
      
      Problem is that with network namespaces, devices tend
      to use the same ifindex.
      
      lo device for instance has a fixed ifindex of one,
      for all network namespaces.
      
      This means that hosts with thousands of netns spend
      a lot of time looking at some hash buckets with thousands
      of elements, notably at netns dismantle.
      
      Simply add a per netns perturbation (net_hash_mix())
      to spread elements more uniformely.
      
      Also change fib_devindex_hashfn() to use more entropy.
      
      Fixes: aa79e66e
      
       ("net: Make ifindex generation per-net namespace")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d07418af
    • Krzysztof Kozlowski's avatar
      nfc: llcp: fix NULL error pointer dereference on sendmsg() after failed bind() · dded0892
      Krzysztof Kozlowski authored
      
      Syzbot detected a NULL pointer dereference of nfc_llcp_sock->dev pointer
      (which is a 'struct nfc_dev *') with calls to llcp_sock_sendmsg() after
      a failed llcp_sock_bind(). The message being sent is a SOCK_DGRAM.
      
      KASAN report:
      
        BUG: KASAN: null-ptr-deref in nfc_alloc_send_skb+0x2d/0xc0
        Read of size 4 at addr 00000000000005c8 by task llcp_sock_nfc_a/899
      
        CPU: 5 PID: 899 Comm: llcp_sock_nfc_a Not tainted 5.16.0-rc6-next-20211224-00001-gc6437fbf18b0 #125
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
        Call Trace:
         <TASK>
         dump_stack_lvl+0x45/0x59
         ? nfc_alloc_send_skb+0x2d/0xc0
         __kasan_report.cold+0x117/0x11c
         ? mark_lock+0x480/0x4f0
         ? nfc_alloc_send_skb+0x2d/0xc0
         kasan_report+0x38/0x50
         nfc_alloc_send_skb+0x2d/0xc0
         nfc_llcp_send_ui_frame+0x18c/0x2a0
         ? nfc_llcp_send_i_frame+0x230/0x230
         ? __local_bh_enable_ip+0x86/0xe0
         ? llcp_sock_connect+0x470/0x470
         ? llcp_sock_connect+0x470/0x470
         sock_sendmsg+0x8e/0xa0
         ____sys_sendmsg+0x253/0x3f0
         ...
      
      The issue was visible only with multiple simultaneous calls to bind() and
      sendmsg(), which resulted in most of the bind() calls to fail.  The
      bind() was failing on checking if there is available WKS/SDP/SAP
      (respective bit in 'struct nfc_llcp_local' fields).  When there was no
      available WKS/SDP/SAP, the bind returned error but the sendmsg() to such
      socket was able to trigger mentioned NULL pointer dereference of
      nfc_llcp_sock->dev.
      
      The code looks simply racy and currently it protects several paths
      against race with checks for (!nfc_llcp_sock->local) which is NULL-ified
      in error paths of bind().  The llcp_sock_sendmsg() did not have such
      check but called function nfc_llcp_send_ui_frame() had, although not
      protected with lock_sock().
      
      Therefore the race could look like (same socket is used all the time):
        CPU0                                     CPU1
        ====                                     ====
        llcp_sock_bind()
        - lock_sock()
          - success
        - release_sock()
        - return 0
                                                 llcp_sock_sendmsg()
                                                 - lock_sock()
                                                 - release_sock()
        llcp_sock_bind(), same socket
        - lock_sock()
          - error
                                                 - nfc_llcp_send_ui_frame()
                                                   - if (!llcp_sock->local)
          - llcp_sock->local = NULL
          - nfc_put_device(dev)
                                                   - dereference llcp_sock->dev
        - release_sock()
        - return -ERRNO
      
      The nfc_llcp_send_ui_frame() checked llcp_sock->local outside of the
      lock, which is racy and ineffective check.  Instead, its caller
      llcp_sock_sendmsg(), should perform the check inside lock_sock().
      
      Reported-and-tested-by: default avatar <syzbot+7f23bcddf626e0593a39@syzkaller.appspotmail.com>
      Fixes: b874dec2
      
       ("NFC: Implement LLCP connection less Tx path")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dded0892
  2. Jan 18, 2022
    • Eric Dumazet's avatar
      netns: add schedule point in ops_exit_list() · 2836615a
      Eric Dumazet authored
      When under stress, cleanup_net() can have to dismantle
      netns in big numbers. ops_exit_list() currently calls
      many helpers [1] that have no schedule point, and we can
      end up with soft lockups, particularly on hosts
      with many cpus.
      
      Even for moderate amount of netns processed by cleanup_net()
      this patch avoids latency spikes.
      
      [1] Some of these helpers like fib_sync_up() and fib_sync_down_dev()
      are very slow because net/ipv4/fib_semantics.c uses host-wide hash tables,
      and ifindex is used as the only input of two hash functions.
          ifindexes tend to be the same for all netns (lo.ifindex==1 per instance)
          This will be fixed in a separate patch.
      
      Fixes: 72ad937a
      
       ("net: Add support for batching network namespace cleanups")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2836615a
  3. Jan 17, 2022
  4. Jan 16, 2022
    • Wen Gu's avatar
      net/smc: Fix hung_task when removing SMC-R devices · 56d99e81
      Wen Gu authored
      A hung_task is observed when removing SMC-R devices. Suppose that
      a link group has two active links(lnk_A, lnk_B) associated with two
      different SMC-R devices(dev_A, dev_B). When dev_A is removed, the
      link group will be removed from smc_lgr_list and added into
      lgr_linkdown_list. lnk_A will be cleared and smcibdev(A)->lnk_cnt
      will reach to zero. However, when dev_B is removed then, the link
      group can't be found in smc_lgr_list and lnk_B won't be cleared,
      making smcibdev->lnk_cnt never reaches zero, which causes a hung_task.
      
      This patch fixes this issue by restoring the implementation of
      smc_smcr_terminate_all() to what it was before commit 349d4312
      ("net/smc: fix kernel panic caused by race of smc_sock"). The original
      implementation also satisfies the intention that make sure QP destroy
      earlier than CQ destroy because we will always wait for smcibdev->lnk_cnt
      reaches zero, which guarantees QP has been destroyed.
      
      Fixes: 349d4312
      
       ("net/smc: fix kernel panic caused by race of smc_sock")
      Signed-off-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      56d99e81
    • Eric Dumazet's avatar
      ipv4: update fib_info_cnt under spinlock protection · 0a6e6b3c
      Eric Dumazet authored
      In the past, free_fib_info() was supposed to be called
      under RTNL protection.
      
      This eventually was no longer the case.
      
      Instead of enforcing RTNL it seems we simply can
      move fib_info_cnt changes to occur when fib_info_lock
      is held.
      
      v2: David Laight suggested to update fib_info_cnt
      only when an entry is added/deleted to/from the hash table,
      as fib_info_cnt is used to make sure hash table size
      is optimal.
      
      BUG: KCSAN: data-race in fib_create_info / free_fib_info
      
      write to 0xffffffff86e243a0 of 4 bytes by task 26429 on cpu 0:
       fib_create_info+0xe78/0x3440 net/ipv4/fib_semantics.c:1428
       fib_table_insert+0x148/0x10c0 net/ipv4/fib_trie.c:1224
       fib_magic+0x195/0x1e0 net/ipv4/fib_frontend.c:1087
       fib_add_ifaddr+0xd0/0x2e0 net/ipv4/fib_frontend.c:1109
       fib_netdev_event+0x178/0x510 net/ipv4/fib_frontend.c:1466
       notifier_call_chain kernel/notifier.c:83 [inline]
       raw_notifier_call_chain+0x53/0xb0 kernel/notifier.c:391
       __dev_notify_flags+0x1d3/0x3b0
       dev_change_flags+0xa2/0xc0 net/core/dev.c:8872
       do_setlink+0x810/0x2410 net/core/rtnetlink.c:2719
       rtnl_group_changelink net/core/rtnetlink.c:3242 [inline]
       __rtnl_newlink net/core/rtnetlink.c:3396 [inline]
       rtnl_newlink+0xb10/0x13b0 net/core/rtnetlink.c:3506
       rtnetlink_rcv_msg+0x745/0x7e0 net/core/rtnetlink.c:5571
       netlink_rcv_skb+0x14e/0x250 net/netlink/af_netlink.c:2496
       rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:5589
       netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
       netlink_unicast+0x5fc/0x6c0 net/netlink/af_netlink.c:1345
       netlink_sendmsg+0x726/0x840 net/netlink/af_netlink.c:1921
       sock_sendmsg_nosec net/socket.c:704 [inline]
       sock_sendmsg net/socket.c:724 [inline]
       ____sys_sendmsg+0x39a/0x510 net/socket.c:2409
       ___sys_sendmsg net/socket.c:2463 [inline]
       __sys_sendmsg+0x195/0x230 net/socket.c:2492
       __do_sys_sendmsg net/socket.c:2501 [inline]
       __se_sys_sendmsg net/socket.c:2499 [inline]
       __x64_sys_sendmsg+0x42/0x50 net/socket.c:2499
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      read to 0xffffffff86e243a0 of 4 bytes by task 31505 on cpu 1:
       free_fib_info+0x35/0x80 net/ipv4/fib_semantics.c:252
       fib_info_put include/net/ip_fib.h:575 [inline]
       nsim_fib4_rt_destroy drivers/net/netdevsim/fib.c:294 [inline]
       nsim_fib4_rt_replace drivers/net/netdevsim/fib.c:403 [inline]
       nsim_fib4_rt_insert drivers/net/netdevsim/fib.c:431 [inline]
       nsim_fib4_event drivers/net/netdevsim/fib.c:461 [inline]
       nsim_fib_event drivers/net/netdevsim/fib.c:881 [inline]
       nsim_fib_event_work+0x15ca/0x2cf0 drivers/net/netdevsim/fib.c:1477
       process_one_work+0x3fc/0x980 kernel/workqueue.c:2298
       process_scheduled_works kernel/workqueue.c:2361 [inline]
       worker_thread+0x7df/0xa70 kernel/workqueue.c:2447
       kthread+0x2c7/0x2e0 kernel/kthread.c:327
       ret_from_fork+0x1f/0x30
      
      value changed: 0x00000d2d -> 0x00000d2e
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 31505 Comm: kworker/1:21 Not tainted 5.16.0-rc6-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: events nsim_fib_event_work
      
      Fixes: 48bb9eb4
      
       ("netdevsim: fib: Add dummy implementation for FIB offload")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Ido Schimmel <idosch@mellanox.com>
      Cc: Jiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a6e6b3c
  5. Jan 15, 2022
    • Wen Gu's avatar
      net/smc: Remove unused function declaration · 9404bc1e
      Wen Gu authored
      The declaration of smc_wr_tx_dismiss_slots() is unused.
      So remove it.
      
      Fixes: 349d4312
      
       ("net/smc: fix kernel panic caused by race of smc_sock")
      Signed-off-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Reviewed-by: default avatarDust Li <dust.li@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9404bc1e
    • NeilBrown's avatar
      mm: introduce memalloc_retry_wait() · 4034247a
      NeilBrown authored
      Various places in the kernel - largely in filesystems - respond to a
      memory allocation failure by looping around and re-trying.  Some of
      these cannot conveniently use __GFP_NOFAIL, for reasons such as:
      
       - a GFP_ATOMIC allocation, which __GFP_NOFAIL doesn't work on
       - a need to check for the process being signalled between failures
       - the possibility that other recovery actions could be performed
       - the allocation is quite deep in support code, and passing down an
         extra flag to say if __GFP_NOFAIL is wanted would be clumsy.
      
      Many of these currently use congestion_wait() which (in almost all
      cases) simply waits the given timeout - congestion isn't tracked for
      most devices.
      
      It isn't clear what the best delay is for loops, but it is clear that
      the various filesystems shouldn't be responsible for choosing a timeout.
      
      This patch introduces memalloc_retry_wait() with takes on that
      responsibility.  Code that wants to retry a memory allocation can call
      this function passing the GFP flags that were used.  It will wait
      however is appropriate.
      
      For now, it only considers __GFP_NORETRY and whatever
      gfpflags_allow_blocking() tests.  If blocking is allowed without
      __GFP_NORETRY, then alloc_page either made some reclaim progress, or
      waited for a while, before failing.  So there is no need for much
      further waiting.  memalloc_retry_wait() will wait until the current
      jiffie ends.  If this condition is not met, then alloc_page() won't have
      waited much if at all.  In that case memalloc_retry_wait() waits about
      200ms.  This is the delay that most current loops uses.
      
      linux/sched/mm.h needs to be included in some files now,
      but linux/backing-dev.h does not.
      
      Link: https://lkml.kernel.org/r/163754371968.13692.1277530886009912421@noble.neil.brown.name
      
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4034247a
    • Michal Hocko's avatar
      mm: allow !GFP_KERNEL allocations for kvmalloc · a421ef30
      Michal Hocko authored
      Support for GFP_NO{FS,IO} and __GFP_NOFAIL has been implemented by
      previous patches so we can allow the support for kvmalloc.  This will
      allow some external users to simplify or completely remove their
      helpers.
      
      GFP_NOWAIT semantic hasn't been supported so far but it hasn't been
      explicitly documented so let's add a note about that.
      
      ceph_kvmalloc is the first helper to be dropped and changed to kvmalloc.
      
      Link: https://lkml.kernel.org/r/20211122153233.9924-5-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a421ef30
  6. Jan 14, 2022
    • Eric Dumazet's avatar
      af_unix: annote lockless accesses to unix_tot_inflight & gc_in_progress · 9d6d7f1c
      Eric Dumazet authored
      wait_for_unix_gc() reads unix_tot_inflight & gc_in_progress
      without synchronization.
      
      Adds READ_ONCE()/WRITE_ONCE() and their associated comments
      to better document the intent.
      
      BUG: KCSAN: data-race in unix_inflight / wait_for_unix_gc
      
      write to 0xffffffff86e2b7c0 of 4 bytes by task 9380 on cpu 0:
       unix_inflight+0x1e8/0x260 net/unix/scm.c:63
       unix_attach_fds+0x10c/0x1e0 net/unix/scm.c:121
       unix_scm_to_skb net/unix/af_unix.c:1674 [inline]
       unix_dgram_sendmsg+0x679/0x16b0 net/unix/af_unix.c:1817
       unix_seqpacket_sendmsg+0xcc/0x110 net/unix/af_unix.c:2258
       sock_sendmsg_nosec net/socket.c:704 [inline]
       sock_sendmsg net/socket.c:724 [inline]
       ____sys_sendmsg+0x39a/0x510 net/socket.c:2409
       ___sys_sendmsg net/socket.c:2463 [inline]
       __sys_sendmmsg+0x267/0x4c0 net/socket.c:2549
       __do_sys_sendmmsg net/socket.c:2578 [inline]
       __se_sys_sendmmsg net/socket.c:2575 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2575
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      read to 0xffffffff86e2b7c0 of 4 bytes by task 9375 on cpu 1:
       wait_for_unix_gc+0x24/0x160 net/unix/garbage.c:196
       unix_dgram_sendmsg+0x8e/0x16b0 net/unix/af_unix.c:1772
       unix_seqpacket_sendmsg+0xcc/0x110 net/unix/af_unix.c:2258
       sock_sendmsg_nosec net/socket.c:704 [inline]
       sock_sendmsg net/socket.c:724 [inline]
       ____sys_sendmsg+0x39a/0x510 net/socket.c:2409
       ___sys_sendmsg net/socket.c:2463 [inline]
       __sys_sendmmsg+0x267/0x4c0 net/socket.c:2549
       __do_sys_sendmmsg net/socket.c:2578 [inline]
       __se_sys_sendmmsg net/socket.c:2575 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2575
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0x00000002 -> 0x00000004
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 9375 Comm: syz-executor.1 Not tainted 5.16.0-rc7-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: 9915672d
      
       ("af_unix: limit unix_tot_inflight")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Link: https://lore.kernel.org/r/20220114164328.2038499-1-eric.dumazet@gmail.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9d6d7f1c
    • Michael S. Tsirkin's avatar
      virtio: wrap config->reset calls · d9679d00
      Michael S. Tsirkin authored
      
      This will enable cleanups down the road.
      The idea is to disable cbs, then add "flush_queued_cbs" callback
      as a parameter, this way drivers can flush any work
      queued after callbacks have been disabled.
      
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Link: https://lore.kernel.org/r/20211013105226.20225-1-mst@redhat.com
      
      
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      d9679d00
  7. Jan 13, 2022
    • Kevin Bracey's avatar
      net_sched: restore "mpu xxx" handling · fb80445c
      Kevin Bracey authored
      commit 56b765b7 ("htb: improved accuracy at high rates") broke
      "overhead X", "linklayer atm" and "mpu X" attributes.
      
      "overhead X" and "linklayer atm" have already been fixed. This restores
      the "mpu X" handling, as might be used by DOCSIS or Ethernet shaping:
      
          tc class add ... htb rate X overhead 4 mpu 64
      
      The code being fixed is used by htb, tbf and act_police. Cake has its
      own mpu handling. qdisc_calculate_pkt_len still uses the size table
      containing values adjusted for mpu by user space.
      
      iproute2 tc has always passed mpu into the kernel via a tc_ratespec
      structure, but the kernel never directly acted on it, merely stored it
      so that it could be read back by `tc class show`.
      
      Rather, tc would generate length-to-time tables that included the mpu
      (and linklayer) in their construction, and the kernel used those tables.
      
      Since v3.7, the tables were no longer used. Along with "mpu", this also
      broke "overhead" and "linklayer" which were fixed in 01cb71d2
      ("net_sched: restore "overhead xxx" handling", v3.10) and 8a8e3d84
      ("net_sched: restore "linklayer atm" handling", v3.11).
      
      "overhead" was fixed by simply restoring use of tc_ratespec::overhead -
      this had originally been used by the kernel but was initially omitted
      from the new non-table-based calculations.
      
      "linklayer" had been handled in the table like "mpu", but the mode was
      not originally passed in tc_ratespec. The new implementation was made to
      handle it by getting new versions of tc to pass the mode in an extended
      tc_ratespec, and for older versions of tc the table contents were analysed
      at load time to deduce linklayer.
      
      As "mpu" has always been given to the kernel in tc_ratespec,
      accompanying the mpu-based table, we can restore system functionality
      with no userspace change by making the kernel act on the tc_ratespec
      value.
      
      Fixes: 56b765b7
      
       ("htb: improved accuracy at high rates")
      Signed-off-by: default avatarKevin Bracey <kevin@bracey.fi>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Cc: Vimalkumar <j.vimal@gmail.com>
      Link: https://lore.kernel.org/r/20220112170210.1014351-1-kevin@bracey.fi
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fb80445c
    • Wen Gu's avatar
      net/smc: Resolve the race between SMC-R link access and clear · 20c9398d
      Wen Gu authored
      
      We encountered some crashes caused by the race between SMC-R
      link access and link clear that triggered by abnormal link
      group termination, such as port error.
      
      Here is an example of this kind of crashes:
      
       BUG: kernel NULL pointer dereference, address: 0000000000000000
       Workqueue: smc_hs_wq smc_listen_work [smc]
       RIP: 0010:smc_llc_flow_initiate+0x44/0x190 [smc]
       Call Trace:
        <TASK>
        ? __smc_buf_create+0x75a/0x950 [smc]
        smcr_lgr_reg_rmbs+0x2a/0xbf [smc]
        smc_listen_work+0xf72/0x1230 [smc]
        ? process_one_work+0x25c/0x600
        process_one_work+0x25c/0x600
        worker_thread+0x4f/0x3a0
        ? process_one_work+0x600/0x600
        kthread+0x15d/0x1a0
        ? set_kthread_struct+0x40/0x40
        ret_from_fork+0x1f/0x30
        </TASK>
      
      smc_listen_work()                     __smc_lgr_terminate()
      ---------------------------------------------------------------
                                          | smc_lgr_free()
                                          |  |- smcr_link_clear()
                                          |      |- memset(lnk, 0)
      smc_listen_rdma_reg()               |
       |- smcr_lgr_reg_rmbs()             |
           |- smc_llc_flow_initiate()     |
               |- access lnk->lgr (panic) |
      
      These crashes are similarly caused by clearing SMC-R link
      resources when some functions is still accessing to them.
      This patch tries to fix the issue by introducing reference
      count of SMC-R links and ensuring that the sensitive resources
      of links won't be cleared until reference count reaches zero.
      
      The operation to the SMC-R link reference count can be concluded
      as follows:
      
      object          [hold or initialized as 1]         [put]
      --------------------------------------------------------------------
      links           smcr_link_init()                   smcr_link_clear()
      connections     smc_conn_create()                  smc_conn_free()
      
      Through this way, the clear of SMC-R links is later than the
      free of all the smc connections above it, thus avoiding the
      unsafe reference to SMC-R links.
      
      Signed-off-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      20c9398d
    • Wen Gu's avatar
      net/smc: Introduce a new conn->lgr validity check helper · ea89c6c0
      Wen Gu authored
      
      It is no longer suitable to identify whether a smc connection
      is registered in a link group through checking if conn->lgr
      is NULL, because conn->lgr won't be reset even the connection
      is unregistered from a link group.
      
      So this patch introduces a new helper smc_conn_lgr_valid() and
      replaces all the check of conn->lgr in original implementation
      with the new helper to judge if conn->lgr is valid to use.
      
      Signed-off-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ea89c6c0
    • Eric Dumazet's avatar
      inet: frags: annotate races around fqdir->dead and fqdir->high_thresh · 91341fa0
      Eric Dumazet authored
      Both fields can be read/written without synchronization,
      add proper accessors and documentation.
      
      Fixes: d5dd8879
      
       ("inet: fix various use-after-free in defrags units")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      91341fa0
    • Wen Gu's avatar
      net/smc: Resolve the race between link group access and termination · 61f434b0
      Wen Gu authored
      
      We encountered some crashes caused by the race between the access
      and the termination of link groups.
      
      Here are some of panic stacks we met:
      
      1) Race between smc_clc_wait_msg() and __smc_lgr_terminate()
      
       BUG: kernel NULL pointer dereference, address: 00000000000002f0
       Workqueue: smc_hs_wq smc_listen_work [smc]
       RIP: 0010:smc_clc_wait_msg+0x3eb/0x5c0 [smc]
       Call Trace:
        <TASK>
        ? smc_clc_send_accept+0x45/0xa0 [smc]
        ? smc_clc_send_accept+0x45/0xa0 [smc]
        smc_listen_work+0x783/0x1220 [smc]
        ? finish_task_switch+0xc4/0x2e0
        ? process_one_work+0x1ad/0x3c0
        process_one_work+0x1ad/0x3c0
        worker_thread+0x4c/0x390
        ? rescuer_thread+0x320/0x320
        kthread+0x149/0x190
        ? set_kthread_struct+0x40/0x40
        ret_from_fork+0x1f/0x30
        </TASK>
      
      smc_listen_work()                abnormal case like port error
      ---------------------------------------------------------------
                                      | __smc_lgr_terminate()
                                      |  |- smc_conn_kill()
                                      |      |- smc_lgr_unregister_conn()
                                      |          |- set conn->lgr = NULL
      smc_clc_wait_msg()              |
       |- access conn->lgr (panic)    |
      
      2) Race between smc_setsockopt() and __smc_lgr_terminate()
      
       BUG: kernel NULL pointer dereference, address: 00000000000002e8
       RIP: 0010:smc_setsockopt+0x17a/0x280 [smc]
       Call Trace:
        <TASK>
        __sys_setsockopt+0xfc/0x190
        __x64_sys_setsockopt+0x20/0x30
        do_syscall_64+0x34/0x90
        entry_SYSCALL_64_after_hwframe+0x44/0xae
        </TASK>
      
      smc_setsockopt()                 abnormal case like port error
      --------------------------------------------------------------
                                      | __smc_lgr_terminate()
                                      |  |- smc_conn_kill()
                                      |      |- smc_lgr_unregister_conn()
                                      |          |- set conn->lgr = NULL
      mod_delayed_work()              |
       |- access conn->lgr (panic)    |
      
      There are some other panic places and they are caused by the
      similar reason as described above, which is accessing link
      group after termination, thus getting a NULL pointer or invalid
      resource.
      
      Currently, there seems to be no synchronization between the
      link group access and a sudden termination of it. This patch
      tries to fix this by introducing reference count of link group
      and not freeing link group until reference count is zero.
      
      Link group might be referred to by links or smc connections. So
      the operation to the link group reference count can be concluded
      as follows:
      
      object          [hold or initialized as 1]       [put]
      -------------------------------------------------------------------
      link group      smc_lgr_create()                 smc_lgr_free()
      connections     smc_conn_create()                smc_conn_free()
      links           smcr_link_init()                 smcr_link_clear()
      
      Througth this way, we extend the life cycle of link group and
      ensure it is longer than the life cycle of connections and links
      above it, so that avoid invalid access to link group after its
      termination.
      
      Signed-off-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      61f434b0
    • Venky Shankar's avatar
      libceph: rename parse_fsid() to ceph_parse_fsid() and export · 4153c7fc
      Venky Shankar authored
      
      ... as it is too generic. also, use __func__ when logging
      rather than hardcoding the function name.
      
      Signed-off-by: default avatarVenky Shankar <vshankar@redhat.com>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      4153c7fc
    • Venky Shankar's avatar
      libceph: generalize addr/ip parsing based on delimiter · 2d7c86a8
      Venky Shankar authored
      
      ... and remove hardcoded function name in ceph_parse_ips().
      
      [ idryomov: delim parameter, drop CEPH_ADDR_PARSE_DEFAULT_DELIM ]
      
      Signed-off-by: default avatarVenky Shankar <vshankar@redhat.com>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      2d7c86a8
    • Maxim Mikityanskiy's avatar
      sch_api: Don't skip qdisc attach on ingress · de2d807b
      Maxim Mikityanskiy authored
      The attach callback of struct Qdisc_ops is used by only a few qdiscs:
      mq, mqprio and htb. qdisc_graft() contains the following logic
      (pseudocode):
      
          if (!qdisc->ops->attach) {
              if (ingress)
                  do ingress stuff;
              else
                  do egress stuff;
          }
          if (!ingress) {
              ...
              if (qdisc->ops->attach)
                  qdisc->ops->attach(qdisc);
          } else {
              ...
          }
      
      As we see, the attach callback is not called if the qdisc is being
      attached to ingress (TC_H_INGRESS). That wasn't a problem for mq and
      mqprio, since they contain a check that they are attached to TC_H_ROOT,
      and they can't be attached to TC_H_INGRESS anyway.
      
      However, the commit cited below added the attach callback to htb. It is
      needed for the hardware offload, but in the non-offload mode it
      simulates the "do egress stuff" part of the pseudocode above. The
      problem is that when htb is attached to ingress, neither "do ingress
      stuff" nor attach() is called. It results in an inconsistency, and the
      following message is printed to dmesg:
      
      unregister_netdevice: waiting for lo to become free. Usage count = 2
      
      This commit addresses the issue by running "do ingress stuff" in the
      ingress flow even in the attach callback is present, which is fine,
      because attach isn't going to be called afterwards.
      
      The bug was found by syzbot and reported by Eric.
      
      Fixes: d03b195b
      
       ("sch_htb: Hierarchical QoS hardware offload")
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      de2d807b
  8. Jan 12, 2022
    • Ignat Korchagin's avatar
      sit: allow encapsulated IPv6 traffic to be delivered locally · ed6ae5ca
      Ignat Korchagin authored
      
      While experimenting with FOU encapsulation Amir noticed that encapsulated IPv6
      traffic fails to be delivered, if the peer IP address is configured locally.
      
      It can be easily verified by creating a sit interface like below:
      
      $ sudo ip link add name fou_test type sit remote 127.0.0.1 encap fou encap-sport auto encap-dport 1111
      $ sudo ip link set fou_test up
      
      and sending some IPv4 and IPv6 traffic to it
      
      $ ping -I fou_test -c 1 1.1.1.1
      $ ping6 -I fou_test -c 1 fe80::d0b0:dfff:fe4c:fcbc
      
      "tcpdump -i any udp dst port 1111" will confirm that only the first IPv4 ping
      was encapsulated and attempted to be delivered.
      
      This seems like a limitation: for example, in a cloud environment the "peer"
      service may be arbitrarily scheduled on any server within the cluster, where all
      nodes are trying to send encapsulated traffic. And the unlucky node will not be
      able to. Moreover, delivering encapsulated IPv4 traffic locally is allowed.
      
      But I may not have all the context about this restriction and this code predates
      the observable git history.
      
      Reported-by: default avatarAmir Razmjou <arazmjou@cloudflare.com>
      Signed-off-by: default avatarIgnat Korchagin <ignat@cloudflare.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20220107123842.211335-1-ignat@cloudflare.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ed6ae5ca
    • Eric Dumazet's avatar
      net/smc: fix possible NULL deref in smc_pnet_add_eth() · 7b9b1d44
      Eric Dumazet authored
      I missed that @ndev value can be NULL.
      
      I prefer not factorizing this NULL check, and instead
      clearly document where a NULL might be expected.
      
      general protection fault, probably for non-canonical address 0xdffffc00000000ba: 0000 [#1] PREEMPT SMP KASAN
      KASAN: null-ptr-deref in range [0x00000000000005d0-0x00000000000005d7]
      CPU: 0 PID: 19875 Comm: syz-executor.2 Not tainted 5.16.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:__lock_acquire+0xd7a/0x5470 kernel/locking/lockdep.c:4897
      Code: 14 0e 41 bf 01 00 00 00 0f 86 c8 00 00 00 89 05 5c 20 14 0e e9 bd 00 00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 9f 2e 00 00 49 81 3e 20 c5 1a 8f 0f 84 52 f3 ff
      RSP: 0018:ffffc900057071d0 EFLAGS: 00010002
      RAX: dffffc0000000000 RBX: 1ffff92000ae0e65 RCX: 1ffff92000ae0e4c
      RDX: 00000000000000ba RSI: 0000000000000000 RDI: 0000000000000001
      RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000001
      R10: fffffbfff1b24ae2 R11: 000000000008808a R12: 0000000000000000
      R13: ffff888040ca4000 R14: 00000000000005d0 R15: 0000000000000000
      FS:  00007fbd683e0700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000001b2be22000 CR3: 0000000013fea000 CR4: 00000000003526f0
      Call Trace:
       <TASK>
       lock_acquire kernel/locking/lockdep.c:5637 [inline]
       lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5602
       __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
       _raw_spin_lock_irqsave+0x39/0x50 kernel/locking/spinlock.c:162
       ref_tracker_alloc+0x182/0x440 lib/ref_tracker.c:84
       netdev_tracker_alloc include/linux/netdevice.h:3859 [inline]
       smc_pnet_add_eth net/smc/smc_pnet.c:372 [inline]
       smc_pnet_enter net/smc/smc_pnet.c:492 [inline]
       smc_pnet_add+0x49a/0x14d0 net/smc/smc_pnet.c:555
       genl_family_rcv_msg_doit+0x228/0x320 net/netlink/genetlink.c:731
       genl_family_rcv_msg net/netlink/genetlink.c:775 [inline]
       genl_rcv_msg+0x328/0x580 net/netlink/genetlink.c:792
       netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494
       genl_rcv+0x24/0x40 net/netlink/genetlink.c:803
       netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
       netlink_unicast+0x539/0x7e0 net/netlink/af_netlink.c:1343
       netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1919
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:725
       ____sys_sendmsg+0x6e8/0x810 net/socket.c:2413
       ___sys_sendmsg+0xf3/0x170 net/socket.c:2467
       __sys_sendmsg+0xe5/0x1b0 net/socket.c:2496
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: b6064524
      
       ("net/smc: add net device tracker to struct smc_pnetentry")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7b9b1d44
    • Eric Dumazet's avatar
      net: bridge: fix net device refcount tracking issue in error path · fcfb894d
      Eric Dumazet authored
      I left one dev_put() in br_add_if() error path and sure enough
      syzbot found its way.
      
      As the tracker is allocated in new_nbp(), we must make sure
      to properly free it.
      
      We have to call dev_put_track(dev, &p->dev_tracker) before
      @p object is freed, of course. This is not an issue because
      br_add_if() owns a reference on @dev.
      
      Fixes: b2dcdc7f
      
       ("net: bridge: add net device refcount tracker")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fcfb894d
    • Miroslav Lichvar's avatar
      net: fix sock_timestamping_bind_phc() to release device · 2a4d75bf
      Miroslav Lichvar authored
      Don't forget to release the device in sock_timestamping_bind_phc() after
      it was used to get the vclock indices.
      
      Fixes: d463126e
      
       ("net: sock: extend SO_TIMESTAMPING for PHC binding")
      Signed-off-by: default avatarMiroslav Lichvar <mlichvar@redhat.com>
      Cc: Yangbo Lu <yangbo.lu@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2a4d75bf
    • Michael Walle's avatar
      Revert "of: net: support NVMEM cells with MAC in text format" · 3486eb77
      Michael Walle authored
      This reverts commit 9ed319e4
      
      .
      
      We can already post process a nvmem cell value in a particular driver.
      Instead of having yet another place to convert the values, the post
      processing hook of the nvmem provider should be used in this case.
      
      Signed-off-by: default avatarMichael Walle <michael@walle.cc>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3486eb77
  9. Jan 11, 2022
  10. Jan 10, 2022
    • Linus Torvalds's avatar
      netfilter: nf_tables: don't use 'data_size' uninitialized · 63045bfd
      Linus Torvalds authored
      Commit 2c865a8a ("netfilter: nf_tables: add rule blob layout") never
      initialized the new 'data_size' variable.
      
      I'm not sure how it ever worked, but it might have worked almost by
      accident - gcc seems to occasionally miss these kinds of 'variable used
      uninitialized' situations, but I've seen it do so because it ended up
      zero-initializing them due to some other simplification.
      
      But clang is very unhappy about it all, and correctly reports
      
          net/netfilter/nf_tables_api.c:8278:4: error: variable 'data_size' is uninitialized when used here [-Werror,-Wuninitialized]
                                  data_size += sizeof(*prule) + rule->dlen;
                                  ^~~~~~~~~
          net/netfilter/nf_tables_api.c:8263:30: note: initialize the variable 'data_size' to silence this warning
                  unsigned int size, data_size;
                                              ^
                                               = 0
          1 error generated.
      
      and this fix just initializes 'data_size' to zero before the loop.
      
      Fixes: 2c865a8a
      
       ("netfilter: nf_tables: add rule blob layout")
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63045bfd
    • Chuck Lever's avatar
      SUNRPC: Fix sockaddr handling in the svc_xprt_create_error trace point · dc6c6fb3
      Chuck Lever authored
      While testing, I got an unexpected KASAN splat:
      
      Jan 08 13:50:27 oracle-102.nfsv4.dev kernel: BUG: KASAN: stack-out-of-bounds in trace_event_raw_event_svc_xprt_create_err+0x190/0x210 [sunrpc]
      Jan 08 13:50:27 oracle-102.nfsv4.dev kernel: Read of size 28 at addr ffffc9000008f728 by task mount.nfs/4628
      
      The memcpy() in the TP_fast_assign section of this trace point
      copies the size of the destination buffer in order that the buffer
      won't be overrun.
      
      In other similar trace points, the source buffer for this memcpy is
      a "struct sockaddr_storage" so the actual length of the source
      buffer is always long enough to prevent the memcpy from reading
      uninitialized or unallocated memory.
      
      However, for this trace point, the source buffer can be as small as
      a "struct sockaddr_in". For AF_INET sockaddrs, the memcpy() reads
      memory that follows the source buffer, which is not always valid
      memory.
      
      To avoid copying past the end of the passed-in sockaddr, ...
      dc6c6fb3
  11. Jan 09, 2022