Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Jan 29, 2021
  2. Jan 28, 2021
    • Jakub Kicinski's avatar
      Merge branch 'nexthop-preparations-for-resilient-next-hop-groups' · 67d25ce8
      Jakub Kicinski authored
      Petr Machata says:
      
      ====================
      nexthop: Preparations for resilient next-hop groups
      
      At this moment, there is only one type of next-hop group: an mpath group.
      Mpath groups implement the hash-threshold algorithm, described in RFC
      2992[1].
      
      To select a next hop, hash-threshold algorithm first assigns a range of
      hashes to each next hop in the group, and then selects the next hop by
      comparing the SKB hash with the individual ranges. When a next hop is
      removed from the group, the ranges are recomputed, which leads to
      reassignment of parts of hash space from one next hop to another. RFC 2992
      illustrates it thus:
      
                   +-------+-------+-------+-------+-------+
                   |   1   |   2   |   3   |   4   |   5   |
                   +-------+-+-----+---+---+-----+-+-------+
                   |    1    |    2    |    4    |    5    |
                   +---------+---------+---------+---------+
      
                    Before and after deletion of next hop 3
      	      under the hash-threshold algorithm.
      
      Note how next hop 2 gave up part of the hash space in favor of next hop 1,
      and 4 in favor of 5. While there will usually be some overlap between the
      previous and the new distribution, some traffic flows change the next hop
      that they resolve to.
      
      If a multipath group is used for load-balancing between multiple servers,
      this hash space reassignment causes an issue that packets from a single
      flow suddenly end up arriving at a server that does not expect them, which
      may lead to TCP reset.
      
      If a multipath group is used for load-balancing among available paths to
      the same server, the issue is that different latencies and reordering along
      the way causes the packets to arrive in wrong order.
      
      Resilient hashing is a technique to address the above problem. Resilient
      next-hop group has another layer of indirection between the group itself
      and its constituent next hops: a hash table. The selection algorithm uses a
      straightforward modulo operation to choose a hash bucket, and then reads
      the next hop that this bucket contains, and forwards traffic there.
      
      This indirection brings an important feature. In the hash-threshold
      algorithm, the range of hashes associated with a next hop must be
      continuous. With a hash table, mapping between the hash table buckets and
      the individual next hops is arbitrary. Therefore when a next hop is deleted
      the buckets that held it are simply reassigned to other next hops:
      
                   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                   |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5|
                   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      	                      v v v v
                   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                   |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5|
                   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      
                    Before and after deletion of next hop 3
      	      under the resilient hashing algorithm.
      
      When weights of next hops in a group are altered, it may be possible to
      choose a subset of buckets that are currently not used for forwarding
      traffic, and use those to satisfy the new next-hop distribution demands,
      keeping the "busy" buckets intact. This way, established flows are ideally
      kept being forwarded to the same endpoints through the same paths as before
      the next-hop group change.
      
      This patchset prepares the next-hop code for eventual introduction of
      resilient hashing groups.
      
      - Patches #1-#4 carry otherwise disjoint changes that just remove certain
        assumptions in the next-hop code.
      
      - Patches #5-#6 extend the in-kernel next-hop notifiers to support more
        next-hop group types.
      
      - Patches #7-#12 refactor RTNL message handlers. Resilient next-hop groups
        will introduce a new logical object, a hash table bucket. It turns out
        that handling bucket-related messages is similar to how next-hop messages
        are handled. These patches extract the commonalities into reusable
        components.
      
      The plan is to contribute approximately the following patchsets:
      
      1) Nexthop policy refactoring (already pushed)
      2) Preparations for resilient next hop groups (this patchset)
      3) Implementation of resilient next hop group
      4) Netdevsim offload plus a suite of selftests
      5) Preparations for mlxsw offload of resilient next-hop groups
      6) mlxsw offload including selftests
      
      Interested parties can look at the current state of the code at [2] and
      [3].
      
      [1] https://tools.ietf.org/html/rfc2992
      [2] https://github.com/idosch/linux/commits/submit/res_integ_v1
      [3] https://github.com/idosch/iproute2/commits/submit/res_v1
      ====================
      
      Link: https://lore.kernel.org/r/cover.1611836479.git.petrm@nvidia.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      67d25ce8
    • Petr Machata's avatar
      nexthop: Extract a helper for validation of get/del RTNL requests · 0bccf8ed
      Petr Machata authored
      
      Validation of messages for get / del of a next hop is the same as will be
      validation of messages for get of a resilient next hop group bucket. The
      difference is that policy for resilient next hop group buckets is a
      superset of that used for next-hop get.
      
      It is therefore possible to reuse the code that validates the nhmsg fields,
      extracts the next-hop ID, and validates that. To that end, extract from
      nh_valid_get_del_req() a helper __nh_valid_get_del_req() that does just
      that.
      
      Make the nlh argument const so that the function can be called from the
      dump context, which only has a const nlh. Propagate the constness to
      nh_valid_get_del_req().
      
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0bccf8ed
    • Petr Machata's avatar
      nexthop: Add a callback parameter to rtm_dump_walk_nexthops() · e948217d
      Petr Machata authored
      
      In order to allow different handling for next-hop tree dumper and for
      bucket dumper, parameterize the next-hop tree walker with a callback. Add
      rtm_dump_nexthop_cb() with just the bits relevant for next-hop tree
      dumping.
      
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e948217d
    • Petr Machata's avatar
      nexthop: Extract a helper for walking the next-hop tree · cbee1807
      Petr Machata authored
      
      Extract from rtm_dump_nexthop() a helper to walk the next hop tree. A
      separate function for this will be reusable from the bucket dumper.
      
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cbee1807
    • Petr Machata's avatar
      nexthop: Strongly-type context of rtm_dump_nexthop() · a6fbbaa6
      Petr Machata authored
      
      The dump operations need to keep state from one invocation to another. A
      scratch area is dedicated for this purpose in the passed-in argument, cb,
      namely via two aliased arrays, struct netlink_callback.args and .ctx.
      
      Dumping of buckets will end up having to iterate over next hops as well,
      and it would be nice to be able to reuse the iteration logic with the NH
      dumper. The fact that the logic currently relies on fixed index to the
      .args array, and the indices would have to be coordinated between the two
      dumpers, makes this somewhat awkward.
      
      To make the access patters clearer, introduce a helper struct with a NH
      index, and instead of using the .args array directly, use it through this
      structure.
      
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a6fbbaa6
    • Petr Machata's avatar
      nexthop: Extract a common helper for parsing dump attributes · b9ebea12
      Petr Machata authored
      
      Requests to dump nexthops have many attributes in common with those that
      requests to dump buckets of resilient NH groups will have. However, they
      have different policies. To allow reuse of this code, extract a
      policy-agnostic wrapper out of nh_valid_dump_req(), and convert this
      function into a thin wrapper around it.
      
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b9ebea12
    • Petr Machata's avatar
      nexthop: Extract dump filtering parameters into a single structure · 56450ec6
      Petr Machata authored
      
      Requests to dump nexthops have many attributes in common with those that
      requests to dump buckets of resilient NH groups will have. In order to make
      reuse of this code simpler, convert the code to use a single structure with
      filtering configuration instead of passing around the parameters one by
      one.
      
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      56450ec6
    • Petr Machata's avatar
      nexthop: Dispatch notifier init()/fini() by group type · da230501
      Petr Machata authored
      
      After there are several next-hop group types, initialization and
      finalization of notifier type needs to reflect the actual type. Transform
      nh_notifier_grp_info_init() and _fini() to make extending them easier.
      
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      da230501
    • Ido Schimmel's avatar
      nexthop: Use enum to encode notification type · 09ad6bec
      Ido Schimmel authored
      
      Currently there are only two types of in-kernel nexthop notification.
      The two are distinguished by the 'is_grp' boolean field in 'struct
      nh_notifier_info'.
      
      As more notification types are introduced for more next-hop group types, a
      boolean is not an easily extensible interface. Instead, convert it to an
      enum.
      
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      09ad6bec
    • Petr Machata's avatar
      nexthop: Assert the invariant that a NH group is of only one type · 720ccd9a
      Petr Machata authored
      
      Most of the code that deals with nexthop groups relies on the fact that the
      group is of exactly one well-known type. Currently there is only one type,
      "mpath", but as more next-hop group types come, it becomes desirable to
      have a central place where the setting is validated. Introduce such place
      into nexthop_create_group(), such that the check is done before the code
      that relies on that invariant is invoked.
      
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      720ccd9a