From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CEF50C369D8 for ; Mon, 28 Apr 2025 03:36:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EF4F86B0005; Sun, 27 Apr 2025 23:36:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EA57E6B0006; Sun, 27 Apr 2025 23:36:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D6B7C6B0007; Sun, 27 Apr 2025 23:36:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id AFB226B0005 for ; Sun, 27 Apr 2025 23:36:38 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 5C67F1211D2 for ; Mon, 28 Apr 2025 03:36:38 +0000 (UTC) X-FDA: 83382040476.08.105233B Received: from out-186.mta1.migadu.com (out-186.mta1.migadu.com [95.215.58.186]) by imf13.hostedemail.com (Postfix) with ESMTP id 907002000B for ; Mon, 28 Apr 2025 03:36:36 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=Uj5xmv6S; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf13.hostedemail.com: domain of roman.gushchin@linux.dev designates 95.215.58.186 as permitted sender) smtp.mailfrom=roman.gushchin@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745811396; a=rsa-sha256; cv=none; b=wrqu+hJCfz7R69w1crxfJmYhd+SpInpDBg3tmIiimeX/e8Ztx7E/735tUkx7V24O0mgF6A o3g+0zDUTl5Rs8d/VsQel6i0BztK13PIUhLfUN0OZqEcFlOyKYV0Ya1NJTkK2ni9GARZMX lVq2B+6jhooF9rqKobhkuu8ZBPNgqYE= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=Uj5xmv6S; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf13.hostedemail.com: domain of roman.gushchin@linux.dev designates 95.215.58.186 as permitted sender) smtp.mailfrom=roman.gushchin@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745811396; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=tFWeksBB0juVbpa0rRrrSWdVJEnufM4rtKFQMpJhEFo=; b=hPVytlWUGxG4kziMaZnvb2cqDCZUe2P0JvOS7nxrRFb0fIrOjoAVQsKtQnc1xIzRq8E7mJ nkAnhsAEX0OWkthXld1dlERVHi85cGFLbF547tiqB7xem4SIkA7bxPc1ZN5QYsMcjw54Et RXWt0lEgkxatFRTcE7h+VP+DlDtOmHE= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1745811393; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=tFWeksBB0juVbpa0rRrrSWdVJEnufM4rtKFQMpJhEFo=; b=Uj5xmv6Snanvh9Ynazp3rYSVNuYCISEFJVrFWlzEle4yNq3AZfGNs8pqoMzgpo04o5kkPT hhQPewkRFVz6yHoHGE/6v5wfhoLLYxGl84shOS2I6dtqMd7Vp5hhYiUpTcc4FQpJ411yLD NV1aUUtsX0TJO4oeJg0moyDezzYWlIg= From: Roman Gushchin To: linux-kernel@vger.kernel.org Cc: Andrew Morton , Alexei Starovoitov , Johannes Weiner , Michal Hocko , Shakeel Butt , Suren Baghdasaryan , David Rientjes , Josh Don , Chuyi Zhou , cgroups@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org, Roman Gushchin Subject: [PATCH rfc 00/12] mm: BPF OOM Date: Mon, 28 Apr 2025 03:36:05 +0000 Message-ID: <20250428033617.3797686-1-roman.gushchin@linux.dev> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT X-Rspamd-Queue-Id: 907002000B X-Rspam-User: X-Rspamd-Server: rspam07 X-Stat-Signature: ontenoezzj1qrwk8oezwagwikmcqp4fd X-HE-Tag: 1745811396-50005 X-HE-Meta: U2FsdGVkX1/7N/7nn74IWn8psF03GmuFvh0U0iiDw6VE+O6RTbSDsz9jI7X8IWATxczK+wmvPaqHpe6ZayfdadV/yVhc7ZM683wpMJJgYBLg7p17o69OI3AMFWsregoGoYApHJNJ12u1d+dG+zLgXfxlNPGvtXOKAGQBTloZqhhmMm76qosoghplSeKs8Ne2NnLIU9EXFGB2w2sn7VwKogdHY3aJYqWOTuNxDbqwRUDdIXS2sPiB5RB+hPA7tcLKK3tLfLxSf73yi20x4QwArAwjukx7J8aZgNlZSfMvi5niUwwn7YEPykzzQL1I2HNeJUmy4wYzVwx3rEzjxKnhIQyFGDCaK+8tFsp5WuI7Q8eQ/bSFx/ElovmRYV8YNuFQ+190aKGLpEoCWfBttaO9T1+b3cP28oYCkOUrU+5W5vt+Scv+iFwvju85tBlCVGKgXpYkk2mFgYPzL4NbH9DyapDofESkDK2aKXsP6xjaIJojta59cIr2hX2cjN3ZiSclnllTFpVnlqJrxj9YgQVfE8FOiMPvFhcz4QGiUTPJIEByRll5AW9l0GETNbpP7KRN+ViJtDAQ3gPrgC/HUAiXASecL5f6vEYQdXao4CUDyNWBH3U2otCGl8opdY1FQEg+8u4XtIEqL4wbEWRBmTwUyMj6sGqBjYHvK+S57FHfXP1e/w931VeOI8IJbYAn9ykdNhOMagXAp0iqdEGsjXySr/5JblSXu3t5REVzhosSiIKonoLzJx2/hb7pbZic4Q5qxa3bBsDj6uPi41NaKZVjAzbaHU7f2j1YFg1vbidMQdWqvJsfdow+B39JEmoHR41fE5I6hD7DHx2VBSuYeX8O/x6ETReECd3OMZsO7eavz/R4E3yUX+YHmZeEqUpY9xWNx4fM2wnBUQIt50UZ4oqNlOT+aQWX/D59bOVOuqEpkTXVr09ynWlHVD0LgDRi5LDVYbb70UnGztYIwoV+HER 08oNYnJp fAvKlyNWdHpxhXeaW/5rn183M2Ri+s6yInMaAKe+2loy5Y3nu1yBonr7cufnWRHkXHprh6yTJU0jGQyLSQJ7zculIkbdlzG70+DCzqHlBo4Q+k42o6a5HH71TH8bqZZAimGXUzIaquM2gR/FdQMN1Qtd9G+FoqSoGC0qQ0BJl54CeTInrMrntle8VMIRGdOV9YVpi2qJiquA3vzQgWSxBpob6RPk5oQnZ1v8rbWTw3XQBlcoyV/MeD5Qi8/Z9Ndn9G/BerjqmGeueGgmvena38tIidd43aFk5f8TSVSqhpB1PmhMs7vXGhOpNYXO1VjF8KbySb0LgQ+vCnz4CGKdAHUwLyJxeh//QjWUmKENwo6ZvMjs+REMTsJ9kgY+zuy+JUgk9DkO9JvBSF4vHCwRIeCoztDZF+BeFpONZyfEH4axNKIXTmc5LE6a9Lg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: This patchset adds an ability to customize the out of memory handling using bpf. It focuses on two parts: 1) OOM handling policy, 2) PSI-based OOM invocation. The idea to use bpf for customizing the OOM handling is not new, but unlike the previous proposal [1], which augmented the existing task ranking-based policy, this one tries to be as generic as possible and leverage the full power of the modern bpf. It provides a generic hook which is called before the existing OOM killer code and allows implementing any policy, e.g. picking a victim task or memory cgroup or potentially even releasing memory in other ways, e.g. deleting tmpfs files (the last one might require some additional but relatively simple changes). The past attempt to implement memory-cgroup aware policy [2] showed that there are multiple opinions on what the best policy is. As it's highly workload-dependent and specific to a concrete way of organizing workloads, the structure of the cgroup tree etc, a customizable bpf-based implementation is preferable over a in-kernel implementation with a dozen on sysctls. The second part is related to the fundamental question on when to declare the OOM event. It's a trade-off between the risk of unnecessary OOM kills and associated work losses and the risk of infinite trashing and effective soft lockups. In the last few years several PSI-based userspace solutions were developed (e.g. OOMd [3] or systemd-OOMd [4]). The common idea was to use userspace daemons to implement custom OOM logic as well as rely on PSI monitoring to avoid stalls. In this scenario the userspace daemon was supposed to handle the majority of OOMs, while the in-kernel OOM killer worked as the last resort measure to guarantee that the system would never deadlock on the memory. But this approach creates additional infrastructure churn: userspace OOM daemon is a separate entity which needs to be deployed, updated, monitored. A completely different pipeline needs to be built to monitor both types of OOM events and collect associated logs. A userspace daemon is more restricted in terms on what data is available to it. Implementing a daemon which can work reliably under a heavy memory pressure in the system is also tricky. [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/ [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/ [3]: https://github.com/facebookincubator/oomd [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html ---- This is an RFC version, which is not intended to be merged in the current form. Open questions/TODOs: 1) Program type/attachment type for the bpf_handle_out_of_memory() hook. It has to be able to return a value, to be sleepable (to use cgroup iterators) and to have trusted arguments to pass oom_control down to bpf_oom_kill_process(). Current patchset has a workaround (patch "bpf: treat fmodret tracing program's arguments as trusted"), which is not safe. One option is to fake acquire/release semantics for the oom_control pointer. Other option is to introduce a completely new attachment or program type, similar to lsm hooks. 2) Currently lockdep complaints about a potential circular dependency because sleepable bpf_handle_out_of_memory() hook calls might_fault() under oom_lock. One way to fix it is to make it non-sleepable, but then it will require some additional work to allow it using cgroup iterators. It's intervened with 1). 3) What kind of hierarchical features are required? Do we want to nest oom policies? Do we want to attach oom policies to cgroups? I think it's too complicated, but if we want a full hierarchical support, it might be required. Patch "mm: introduce bpf_get_root_mem_cgroup() bpf kfunc" exposes the true root memcg, which is potentially outside of the ns of the loading process. Does it require some additional capabilities checks? Should it be removed? 4) Documentation is lacking and will be added in the next version. Roman Gushchin (12): mm: introduce a bpf hook for OOM handling bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL bpf: treat fmodret tracing program's arguments as trusted mm: introduce bpf_oom_kill_process() bpf kfunc mm: introduce bpf kfuncs to deal with memcg pointers mm: introduce bpf_get_root_mem_cgroup() bpf kfunc bpf: selftests: introduce read_cgroup_file() helper bpf: selftests: bpf OOM handler test sched: psi: bpf hook to handle psi events mm: introduce bpf_out_of_memory() bpf kfunc bpf: selftests: introduce open_cgroup_file() helper bpf: selftests: psi handler test include/linux/memcontrol.h | 2 + include/linux/oom.h | 5 + kernel/bpf/btf.c | 9 +- kernel/bpf/verifier.c | 5 + kernel/sched/psi.c | 36 ++- mm/Makefile | 3 + mm/bpf_memcontrol.c | 108 +++++++++ mm/oom_kill.c | 140 +++++++++++ tools/testing/selftests/bpf/cgroup_helpers.c | 67 ++++++ tools/testing/selftests/bpf/cgroup_helpers.h | 3 + tools/testing/selftests/bpf/prog_tests/oom.c | 227 ++++++++++++++++++ tools/testing/selftests/bpf/prog_tests/psi.c | 234 +++++++++++++++++++ tools/testing/selftests/bpf/progs/test_oom.c | 103 ++++++++ tools/testing/selftests/bpf/progs/test_psi.c | 43 ++++ 14 files changed, 983 insertions(+), 2 deletions(-) create mode 100644 mm/bpf_memcontrol.c create mode 100644 tools/testing/selftests/bpf/prog_tests/oom.c create mode 100644 tools/testing/selftests/bpf/prog_tests/psi.c create mode 100644 tools/testing/selftests/bpf/progs/test_oom.c create mode 100644 tools/testing/selftests/bpf/progs/test_psi.c -- 2.49.0.901.g37484f566f-goog