From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 21399C433F5 for ; Wed, 9 Mar 2022 20:27:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A996E8D0003; Wed, 9 Mar 2022 15:27:53 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A48498D0001; Wed, 9 Mar 2022 15:27:53 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9116A8D0003; Wed, 9 Mar 2022 15:27:53 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0173.hostedemail.com [216.40.44.173]) by kanga.kvack.org (Postfix) with ESMTP id 83E3A8D0001 for ; Wed, 9 Mar 2022 15:27:53 -0500 (EST) Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 3ABA4A231B for ; Wed, 9 Mar 2022 20:27:53 +0000 (UTC) X-FDA: 79225984026.28.53B99FD Received: from mail-pj1-f42.google.com (mail-pj1-f42.google.com [209.85.216.42]) by imf21.hostedemail.com (Postfix) with ESMTP id B8F1A1C001B for ; Wed, 9 Mar 2022 20:27:52 +0000 (UTC) Received: by mail-pj1-f42.google.com with SMTP id 15-20020a17090a098f00b001bef0376d5cso3310613pjo.5 for ; Wed, 09 Mar 2022 12:27:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:from:date:message-id:subject:to:cc :content-transfer-encoding; bh=GUbtsiwXo7ZWNRfgZSEDw0U2PipEoXKr9rZN+rhPk3s=; b=SUA/yFMpMyj5GekKm292AylNufR9xs7Du5jXCeQmd7EDE/yZMieHd8zw+20EBhKiQL LJe2ZwA2+pASXKJsll9JWWryWd4G9oTsTuYX82IA+q+SPzHKJekgLaaNab8Xoyu9Gnqy Qz7YMac23398Ydaq3cXWbhq4tDiaaJVSvRptsejR3kcCTP/EB4VUbOpF4UQsCPU4Rx51 3Fuu7Ibt7wDy7KsmMQe7ggwIZQiOO3dDhmqD4aAN8+jbToDTslnbCLrcVbmCm0EAbncQ R6OcG5gB5m1EOfF4dGm4mpsu6deOZl3okcWyQJMnGFQz7NB1rU0OvIpFNgP03LAS2ngx JdEA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:from:date:message-id:subject:to:cc :content-transfer-encoding; bh=GUbtsiwXo7ZWNRfgZSEDw0U2PipEoXKr9rZN+rhPk3s=; b=2e2jldIaSbONvteZEkGBovzA8ggQFdiApirRhzwkkgvL/uW5bO9nh75XML6GgQ6ipo rM4JJERdr5Wh3K0mybT4fwGk3H/2b/hwhqBOA5Zz4/tGIXysRwpZwbF7IK8M76ofiCdm vTYnTrojfPTgr7w+It4GAGsiS5N9rqc2k+jdpqSMqTeeH5CCdpu4Y5btQUfNzp76DUIm IvKA7liG8lUWV8uKPkMUwWx/jkktJn0FIW+bGzOR/tKscBQk7HI42bHG3ADRHqrf/LsC Qan/u+dEytUiPc/lD2ODhnnj/UmXHGaBKyKiu2coj32fEnQV8Y+mpUM5PkOY/IyN5tGM z8Vg== X-Gm-Message-State: AOAM531//nLhqLcZK/uEBF5wyKsZT9pvifaiFeMNJKwabBlN4xjPIiuY TNrKgC/QdLmYxhDlvDDo3zGK3kG1iMrHrbhLocZdsw== X-Google-Smtp-Source: ABdhPJwJjCsschFnpoiZo+TBwCCYDT3Ute/Kgm6anN+K+zvlADEqrOAkki4TZUAVcrxor/K17H58isMDi+V4hORr64Q= X-Received: by 2002:a17:903:41c9:b0:152:ab7:438 with SMTP id u9-20020a17090341c900b001520ab70438mr1315823ple.162.1646857671071; Wed, 09 Mar 2022 12:27:51 -0800 (PST) MIME-Version: 1.0 From: Yosry Ahmed Date: Wed, 9 Mar 2022 12:27:15 -0800 Message-ID: Subject: [RFC bpf-next] Hierarchical Cgroup Stats Collection Using BPF To: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , Johannes Weiner Cc: Hao Luo , Shakeel Butt , Stanislav Fomichev , David Rientjes , bpf@vger.kernel.org, KP Singh , cgroups@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: B8F1A1C001B X-Stat-Signature: xx5e39dym8zxzre74x46xe78zixy5b85 Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b="SUA/yFMp"; spf=pass (imf21.hostedemail.com: domain of yosryahmed@google.com designates 209.85.216.42 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1646857672-339773 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hey everyone, I would like to discuss an idea to facilitate collection of hierarchical cgroup stats using BPF programs. We want to provide a simple interface for BPF programs to collect hierarchical cgroup stats and integrate with the existing rstat aggregation mechanism in the kernel. The most prominent use case is the ability to extend memcg stats (and histograms) by BPF programs. This also integrates nicely with Hao's work [1] that enables reading those stats through files, similar to cgroupfs. This idea is more concerned about the stats collection path. The main idea is to introduce a new map type (let's call it BPF cgroup stats map for now). This map will be keyed by cgroup_id (similar to cgroup storage). The value is an array (or struct, more on this later) that the user chooses its size and element type, which will hold the stats. The main properties of the map are as follows: 1. Map entries creation and deletion is handled automatically by the kernel= . 2. Internally, the map entries contain per-cpu arrays, a total array, and a pending array. 3. BPF programs & user space see the entry as a single array, updates are transparently made to per-cpu array, and lookups invoke stats flushing. The main differences between this and a cgroup storage is that it naturally integrates with rstat hierarchical aggregation (more on that later). The reason why we do not want to do aggregation in BPF programs or in user space are: 1. Each program will loop through the cgroup descendants to do their own stats aggregation, lots of repeated work. 2. We will loop through all the descendants, even those that do not have updates. These problems are already addressed by the rstat aggregation mechanism in the kernel, which is primarily used for memcg stats. We want to provide a way for BPF programs to be able to make use of this as well. The lifetime of map entries can be handled as follows: - When the map is created, it gets as a parameter an initial cgroup_id, maybe through the map_extra parameter struct bpf_attr. The map is created and entries for the initial cgroup and all its descendants are created. - The update and delete interfaces are disabled. The kernel creates entries for new cgroups and removes entries for destroyed cgroups (we can use cgroup_bpf_inherit() and cgroup_bpf_release()). - When all the entries in the map are deleted (initial cgroup destroyed), the map is destroyed. The map usage by BPF programs and integration with rstat can be as follows: - Internally, each map entry has per-cpu arrays, a total array, and a pending array. BPF programs and user space only see one array. - The update interface is disabled. BPF programs use helpers to modify elements. Internally, the modifications are made to per-cpu arrays, and invoke a call to cgroup_bpf_updated() or an equivalent. - Lookups (from BPF programs or user space) invoke an rstat flush and read from the total array. - In cgroup_rstat_flush_locked() flush BPF stats as well. Flushing of BPF stats can be as follows: - For every cgroup, we will either use flags to distinguish BPF stats updates from normal stats updates, or flush both anyway (memcg stats are periodically flushed anyway). - We will need to link cgroups to the maps that have entries for them. One possible implementation here is to store the map entries in struct cgroup_bpf in a htable indexed by map fd. The update helpers will also use this to avoid lookups. - For each updated cgroup, we go through all of its maps, accumulate per-cpu arrays to the total array, then propagate total to the parent=E2=80=99s pending array (same mechanism as memcg stats flushing). There is room for extensions or generalizations here: - Provide flags to enable/disable using per-cpu arrays (for stats that are not updated frequently), and enable/disable hierarchical aggregation (for non-hierarchical stats, they can still make benefit of the automatic entries creation & deletion). - Provide different hierarchical aggregation operations : SUM, MAX, MIN, et= c. - Instead of an array as the map value, use a struct, and let the user provide an aggregator function in the form of a BPF program. I am happy to hear your thoughts about the idea in general and any comments or concerns. [1] https://lwn.net/Articles/886292/