Forum | Documentation | Website | Blog

Skip to content
Snippets Groups Projects
  1. Dec 03, 2020
  2. Dec 02, 2020
    • Alexei Starovoitov's avatar
      Merge branch 'switch to memcg-based memory accounting' · 97306be4
      Alexei Starovoitov authored
      
      Roman Gushchin says:
      
      ====================
      
      Currently bpf is using the memlock rlimit for the memory accounting.
      This approach has its downsides and over time has created a significant
      amount of problems:
      
      1) The limit is per-user, but because most bpf operations are performed
         as root, the limit has a little value.
      
      2) It's hard to come up with a specific maximum value. Especially because
         the counter is shared with non-bpf use cases (e.g. memlock()).
         Any specific value is either too low and creates false failures
         or is too high and useless.
      
      3) Charging is not connected to the actual memory allocation. Bpf code
         should manually calculate the estimated cost and charge the counter,
         and then take care of uncharging, including all fail paths.
         It adds to the code complexity and makes it easy to leak a charge.
      
      4) There is no simple way of getting the current value of the counter.
         We've used drgn for it, but it's far from being convenient.
      
      5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had
         a function to "explain" this case for users.
      
      6) rlimits are generally considered as (at least partially) obsolete.
         They do not provide a comprehensive system for the control of physical
         resources: memory, cpu, io etc. All resource control developments
         in the recent years were related to cgroups.
      
      In order to overcome these problems let's switch to the memory cgroup-based
      memory accounting of bpf objects. With the recent addition of the percpu
      memory accounting, now it's possible to provide a comprehensive accounting
      of the memory used by bpf programs and maps.
      
      This approach has the following advantages:
      1) The limit is per-cgroup and hierarchical. It's way more flexible and allows
         a better control over memory usage by different workloads.
      
      2) The actual memory consumption is taken into account. It happens automatically
         on the allocation time if __GFP_ACCOUNT flags is passed. Uncharging is also
         performed automatically on releasing the memory. So the code on the bpf side
         becomes simpler and safer.
      
      3) There is a simple way to get the current value and statistics.
      
      Cgroup-based accounting adds new requirements:
      1) The kernel config should have CONFIG_CGROUPS and CONFIG_MEMCG_KMEM enabled.
         These options are usually enabled, maybe excluding tiny builds for embedded
         devices.
      2) The system should have a configured cgroup hierarchy, including reasonable
         memory limits and/or guarantees. Modern systems usually delegate this task
         to systemd or similar task managers.
      
      Without meeting these requirements there are no limits on how much memory bpf
      can use and a non-root user is able to hurt the system by allocating too much.
      But because per-user rlimits do not provide a functional system to protect
      and manage physical resources anyway, anyone who seriously depends on it,
      should use cgroups.
      
      When a bpf map is created, the memory cgroup of the process which creates
      the map is recorded. Subsequently all memory allocation related to the bpf map
      are charged to the same cgroup. It includes allocations made from interrupts
      and by any processes. Bpf program memory is charged to the memory cgroup of
      a process which loads the program.
      
      The patchset consists of the following parts:
      1) 4 mm patches are required on the mm side, otherwise vmallocs cannot be mapped
         to userspace
      2) memcg-based accounting for various bpf objects: progs and maps
      3) removal of the rlimit-based accounting
      4) removal of rlimit adjustments in userspace samples
      
      v9:
        - always charge the saved memory cgroup, by Daniel, Toke and Alexei
        - added bpf_map_kzalloc()
        - rebase and minor fixes
      
      v8:
        - extended the cover letter to be more clear on new requirements, by Daniel
        - an approximate value is provided by map memlock info, by Alexei
      
      v7:
        - introduced bpf_map_kmalloc_node() and bpf_map_alloc_percpu(), by Alexei
        - switched allocations made from an interrupt context to new helpers,
          by Daniel
        - rebase and minor fixes
      
      v6:
        - rebased to the latest version of the remote charging API
        - fixed signatures, added acks
      
      v5:
        - rebased to the latest version of the remote charging API
        - implemented kmem accounting from an interrupt context, by Shakeel
        - rebased to latest changes in mm allowed to map vmallocs to userspace
        - fixed a build issue in kselftests, by Alexei
        - fixed a use-after-free bug in bpf_map_free_deferred()
        - added bpf line info coverage, by Shakeel
        - split bpf map charging preparations into a separate patch
      
      v4:
        - covered allocations made from an interrupt context, by Daniel
        - added some clarifications to the cover letter
      
      v3:
        - droped the userspace part for further discussions/refinements,
          by Andrii and Song
      
      v2:
        - fixed build issue, caused by the remaining rlimit-based accounting
          for sockhash maps
      ====================
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      97306be4