From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9A41CC369D5 for ; Tue, 29 Apr 2025 01:57:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1BE146B0005; Mon, 28 Apr 2025 21:57:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 16B186B0006; Mon, 28 Apr 2025 21:57:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 033D56B0007; Mon, 28 Apr 2025 21:57:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id DC45D6B0005 for ; Mon, 28 Apr 2025 21:57:33 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id F1FCF1A1E05 for ; Tue, 29 Apr 2025 01:57:34 +0000 (UTC) X-FDA: 83385419628.16.C8B8B08 Received: from mail-ed1-f65.google.com (mail-ed1-f65.google.com [209.85.208.65]) by imf21.hostedemail.com (Postfix) with ESMTP id 062E21C0007 for ; Tue, 29 Apr 2025 01:57:32 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=mQwxzjig; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf21.hostedemail.com: domain of memxor@gmail.com designates 209.85.208.65 as permitted sender) smtp.mailfrom=memxor@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745891853; a=rsa-sha256; cv=none; b=s4SBjv0OYHNgjxawNl8FQaD5BuEfSqxcw431zXSmTC1HAZID7MGu2CiP1jNJ5Vw/9Mv7f4 GOdcOMyl6R+mAEKKCIP+UiZ/n0wjbm5rPSUtQKw8ijs6mFjE+KIfBHDWwp7bxdK1SLa1jI QbvZsIPD8DpxGDHo0jQYtnNEFwrT8Zs= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=mQwxzjig; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf21.hostedemail.com: domain of memxor@gmail.com designates 209.85.208.65 as permitted sender) smtp.mailfrom=memxor@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745891853; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=TcbnWyvi1dis9SuCZ1/SKbV8UgckR1GJ3YAyTppIVtA=; b=OMdG7UQ8Bb621MrwfCRbF5lRIi8ek4piuafXZi+/lvdFsdanABxpvJPrtP+Irt1e2Kctpz 6zpFJ1bGP2I+hdRdhDVpUucLweb2VIf01TwxKCwYuMxganp503Q2m1yaUDdG16SU9jdAN1 WCGic2b8FpAnejH+aZLc7iwxW0kMNGE= Received: by mail-ed1-f65.google.com with SMTP id 4fb4d7f45d1cf-5e6f4b3ebe5so1269836a12.0 for ; Mon, 28 Apr 2025 18:57:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1745891851; x=1746496651; darn=kvack.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=TcbnWyvi1dis9SuCZ1/SKbV8UgckR1GJ3YAyTppIVtA=; b=mQwxzjigD/Bic4qydyZV5bqiv5AJG98P6able883dLwobPknbETh7XseyWJpuKjQi0 mZSCpDYNHObnS/I3KHQryVZmvE+cGcDqNLYnObqtGpjIRj7qCm5maFlb/cuGfH4m3w3d C0uu3NfIKFW3M9sAOOx8xelpJtvStGfKflbx7sGvLzXJuxzqN3hDyX2GlkvZygnMKtw4 KruINafcI2FIgEFaXBP/NjpJ9+9hjA4ffzpiAaUlzGlnBiTEojbXmk110Q6c2GGldBU8 W/s/31TYD2jC7nhAtPONkLPBpLnznnR+aDnCTb93GKtFEO6qrPcfDWynH4+Piy/a/KuG xWTw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745891851; x=1746496651; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=TcbnWyvi1dis9SuCZ1/SKbV8UgckR1GJ3YAyTppIVtA=; b=uEfq7VcSBQmsEaXEsZZbBtTw8CJC9NuFa4kJ673o/x7IrjnZNK6NIqJys4fHW0FECX EIbjM7n8Q9Gg4uvI0GdXpKIHYjz+A/Wdym8xCOkMdwFXr08BWSOxSHtndB/gVj5e88qr sn7vwmEe0pTUEb/Pa/w0eTddwsRKNpCuZtkhDIZud3yWeHDURjFTbGNwhdmVPT1ijJBf fBcm5LxcHHIitxXxVz1UPQxJeV3PAcNfdbftVWTngDNYnC5HzePGtOHYx1k9Z/NR3yus 6cpK5WhyNclf5ywZturvWL6wD05LpCJQ9sysigbw81HlJpEJk9O2gJQDYmRdb1e3gklc iALw== X-Forwarded-Encrypted: i=1; AJvYcCXXxpAgzOKw2iia7mKj+atACb7VHGoZ9f1DOYcESyFdwre6SxqbDP2pWeQNagS+m+k3RzfxAuH4CQ==@kvack.org X-Gm-Message-State: AOJu0YwrBnvCxEONzZzJuq3tPSa9Ly2krBUc7ach6gDuU/5GudZ10gcm RdtDQffz4DRslgfP0XzCAfFCNP/2/AgJZR5Hf98HeAvHltF/GWGUJ09T83GWfM5kMkWvTM2Cbu/ XuxdEpdHy6c1w9F5rHCKpu3p9GjY= X-Gm-Gg: ASbGncuxAh2ccuc53P84/xtG4+v+b/raGLV3tpBEL/2fll0dMnh+x3m6XFgv9kD8t2o H82XI9ZvImY3cJ24S5wkYlbzWRHAV3vTztYmc2wRJQhYzXKcNjWfxEB4f4T0v17Va2AQsQ5lbDH 999cEniRFP5Ho5ydIaK4i43Fxr2xTR7qSS9apdfBQelHc= X-Google-Smtp-Source: AGHT+IGOWEHRraH/2R9p5mFTxnX5r73jmgvEf14qIKTjllHFXZonneNpk+KF0w0p4HhcLjQZwdv7Ae+C7fdY42nuL00= X-Received: by 2002:a05:6402:1d51:b0:5e5:cb92:e760 with SMTP id 4fb4d7f45d1cf-5f73960b9dfmr10033682a12.17.1745891851157; Mon, 28 Apr 2025 18:57:31 -0700 (PDT) MIME-Version: 1.0 References: <20250428033617.3797686-1-roman.gushchin@linux.dev> In-Reply-To: From: Kumar Kartikeya Dwivedi Date: Tue, 29 Apr 2025 03:56:54 +0200 X-Gm-Features: ATxdqUFdh-Baw2wdCT7Q9ZoFIPb0JFBuAYBmlyEh_Ms9YrPmoZBG5fJLOSMLFxg Message-ID: Subject: Re: [PATCH rfc 00/12] mm: BPF OOM To: Roman Gushchin Cc: Matt Bobrowski , linux-kernel@vger.kernel.org, Andrew Morton , Alexei Starovoitov , Johannes Weiner , Michal Hocko , Shakeel Butt , Suren Baghdasaryan , David Rientjes , Josh Don , Chuyi Zhou , cgroups@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Rspamd-Queue-Id: 062E21C0007 X-Rspamd-Server: rspam04 X-Stat-Signature: ifur9qc5mxshcyhcmsy43crbxxs6ebby X-HE-Tag: 1745891852-315061 X-HE-Meta: U2FsdGVkX19MRA7OaZuF6Z1wCUYDMWuinVD26oyfZlF9qLneAOWqiKAjQuKupf4vArZ4BI41OZRgn8sjDdeKOg7zD9xAa0HYkcRCBR+6AVfp6qCkA8UehFgQsjS91kpG0nGIzmCjqNFeTIDU5kqne4kPTpyP7hkK/1htM42yP0j2Z8t1A12ykMNtnEIOUTqjFwL0nC6D2/6V2S7yX6k9Khg4Zh0NjWTMgdP6hZMeIqFZXdZbYWfySzYn6PeSu0Rxam9zW+thZZMAoiZdK3zKaWBy0z8rKV8jqLXDeC9cRuF4jJ9uPtOz9RdhYeHS8Qj2rJKhoqSGg9B5Jf97Iheb0TrMnYb0gX44YQ7411l2hOHCIckIuowor2dqisTvfRpcKhdu/OoiL4561W8ohENErEldTfVgUwfXaAh7L31Tk6ofqe+z3xdT45SIHnLtCWc2XOBE6huZSYo+2MUt0ur53Bvfjp0EfPOdasofzLCS2MNGxAsojwtz1ztPR2h22CE8ChW0oBHvDOgMy3SZUp5KCsc5jFUO0A9c2r7qJnjA22zVrlntKvfOZx/sSiF5Ebt88sxX6xItx3Z2H4GGSCXoF41HWAagMdfu68q3yCoLiXonbVZbwchgHK3eILEf+jOSWeIidIMkph919KHwnz1lKYOXZl1kT7HS5di7YAWB+j7Cm6pF0sT6gFEgL6lPyqJ/1CGzp79EycIVZ/wrRMxtw0NjeiyONSmUa1e/kMZ3Rxq1+lOMpWaunErASQRF3MApZr8oiY637izpAzYwgeiiLa5mkx1DZWYr/2vNop2Vy4Df9g0rj7xHK3DU5veX/uxdEnS1rVq3ViRqSUXd7mO4QgWPASE62v1E0Q8H2W90GkCKgEBg/N1sP7+up1b2TcrC+MknRjc0L2QEP642vuhn/9LE9eoNhMba8NuuHCtG2BL3SptejHXMwp1InmmO0ZP2cRuOFVnXitQI/SjyOcC fMc8Kclk 9fEeIAA1K4aLk95YQFd5DDdi2EhWyxJ/9ArVj0VVTyfmDYYvJNIKo7+XNUYNDFwkAbjJ5vqaaEEWQGWUPw+GSGWYA3D7AXLBbTkrPOd1e5is9rMPTZx02pWmyLbV+iBQxoSwmD4qI1J5WumHqbh9tKCwCExBcPSn+oqCvR8iVnrgUOlfdlR9Skn9DnUw3vzYR7kPm2aNRF2MBuJKF9h3+MGOApV54FtcU6BN5zVlZphUw7RF2hOSShKUXfC0LQMOHZL4q0IcxyWRJzF6q6rsY/nnoxnR7UADukuzXIPBwoc2Lh7zZX9kfkNgTh8ILi3R+kFYnv/VUwwqN4OmBu7DnBOM/F8rs/r/9w/HvM9F1v/TE/WyprsGJnxnHxmtdlWyUq215VMXm6Wb0PgMkwAM7jvLvPeY356m7C//odfOS9QF8NmsWL3cT03onC74UHx+IkO7+yXUcuyKciKkhrIKcXA3DNK5MagDDwH/bvOHrWzzgVSxKq/Y2njf3UdDwSfsmnDgh69GNTT0AphXC3XMYrVJiVeyJhFmBB2UboLibtN4J/yUMgugar3yD7QDuq0QFCdHvpUF4n7OYzBY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, 28 Apr 2025 at 19:24, Roman Gushchin wrote: > > On Mon, Apr 28, 2025 at 10:43:07AM +0000, Matt Bobrowski wrote: > > On Mon, Apr 28, 2025 at 03:36:05AM +0000, Roman Gushchin wrote: > > > This patchset adds an ability to customize the out of memory > > > handling using bpf. > > > > > > It focuses on two parts: > > > 1) OOM handling policy, > > > 2) PSI-based OOM invocation. > > > > > > The idea to use bpf for customizing the OOM handling is not new, but > > > unlike the previous proposal [1], which augmented the existing task > > > ranking-based policy, this one tries to be as generic as possible and > > > leverage the full power of the modern bpf. > > > > > > It provides a generic hook which is called before the existing OOM > > > killer code and allows implementing any policy, e.g. picking a victim > > > task or memory cgroup or potentially even releasing memory in other > > > ways, e.g. deleting tmpfs files (the last one might require some > > > additional but relatively simple changes). > > > > > > The past attempt to implement memory-cgroup aware policy [2] showed > > > that there are multiple opinions on what the best policy is. As it's > > > highly workload-dependent and specific to a concrete way of organizing > > > workloads, the structure of the cgroup tree etc, a customizable > > > bpf-based implementation is preferable over a in-kernel implementation > > > with a dozen on sysctls. > > > > > > The second part is related to the fundamental question on when to > > > declare the OOM event. It's a trade-off between the risk of > > > unnecessary OOM kills and associated work losses and the risk of > > > infinite trashing and effective soft lockups. In the last few years > > > several PSI-based userspace solutions were developed (e.g. OOMd [3] or > > > systemd-OOMd [4]). The common idea was to use userspace daemons to > > > implement custom OOM logic as well as rely on PSI monitoring to avoid > > > stalls. In this scenario the userspace daemon was supposed to handle > > > the majority of OOMs, while the in-kernel OOM killer worked as the > > > last resort measure to guarantee that the system would never deadlock > > > on the memory. But this approach creates additional infrastructure > > > churn: userspace OOM daemon is a separate entity which needs to be > > > deployed, updated, monitored. A completely different pipeline needs to > > > be built to monitor both types of OOM events and collect associated > > > logs. A userspace daemon is more restricted in terms on what data is > > > available to it. Implementing a daemon which can work reliably under a > > > heavy memory pressure in the system is also tricky. > > > > > > [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/ > > > [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/ > > > [3]: https://github.com/facebookincubator/oomd > > > [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html > > > > > > ---- > > > > > > This is an RFC version, which is not intended to be merged in the current form. > > > Open questions/TODOs: > > > 1) Program type/attachment type for the bpf_handle_out_of_memory() hook. > > > It has to be able to return a value, to be sleepable (to use cgroup iterators) > > > and to have trusted arguments to pass oom_control down to bpf_oom_kill_process(). > > > Current patchset has a workaround (patch "bpf: treat fmodret tracing program's > > > arguments as trusted"), which is not safe. One option is to fake acquire/release > > > semantics for the oom_control pointer. Other option is to introduce a completely > > > new attachment or program type, similar to lsm hooks. > > > > Thinking out loud now, but rather than introducing and having a single > > BPF-specific function/interface, and BPF program for that matter, > > which can effectively be used to short-circuit steps from within > > out_of_memory(), why not introduce a > > tcp_congestion_ops/sched_ext_ops-like interface which essentially > > provides a multifaceted interface for controlling OOM killing > > (->select_bad_process, ->oom_kill_process, etc), optionally also from > > the context of a BPF program (BPF_PROG_TYPE_STRUCT_OPS)? > > It's certainly an option and I thought about it. I don't think we need a bunch > of hooks though. This patchset adds 2 and they belong to completely different > subsystems (mm and sched/psi), so Idk how well they can be gathered > into a single struct ops. But maybe it's fine. > > The only potentially new hook I can envision now is one to customize > the oom reporting. > If you're considering scoping it down to a particular cgroup (as you allude to in the TODO), or building a hierarchical interface, using struct_ops will be much better than fmod_ret etc., which is global in nature. Even if you don't support it now. I don't think a struct_ops is warranted only when you have more than a few callbacks. As an illustration, sched_ext started out without supporting hierarchical attachment, but will piggy-back on the struct_ops interface to do so in the near future. > Thanks for the suggestion! > >