From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B3F17C3ABA5 for ; Wed, 30 Apr 2025 01:10:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 83AE86B00D1; Tue, 29 Apr 2025 21:10:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7E6F96B00D5; Tue, 29 Apr 2025 21:10:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 63AAD6B00EC; Tue, 29 Apr 2025 21:10:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 3D1226B00D1 for ; Tue, 29 Apr 2025 21:10:02 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id AEB3C81476 for ; Tue, 29 Apr 2025 22:44:28 +0000 (UTC) X-FDA: 83388561816.28.E84AF83 Received: from mail-qt1-f177.google.com (mail-qt1-f177.google.com [209.85.160.177]) by imf23.hostedemail.com (Postfix) with ESMTP id C7D68140004 for ; Tue, 29 Apr 2025 22:44:26 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=0Xjyy3Tu; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf23.hostedemail.com: domain of surenb@google.com designates 209.85.160.177 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745966666; a=rsa-sha256; cv=none; b=aidZV5ZliRIrNpxt8mOMEoSGIjhXh7Y7PAOHeZ01EWjrLrFO0rL72NmZdW4BEOISgWhplx 7BWNNq3QB+7hVDSl3Ynyhlgev8uJ3GKAmyjtN3iRH1MJCRKWptE9ILKIj1EKhw9zekA9zm Ux4WZyGQKJSqpfb2l2D6Nbdjq2dQ7Lg= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=0Xjyy3Tu; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf23.hostedemail.com: domain of surenb@google.com designates 209.85.160.177 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745966666; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=4Q+2haq7XwAv6J5PtSHPQ69YS6DBi9vH+XHrN06jFIM=; b=uywNDWmWNkpQJa+tVnzDSYLuyLPERnbKTdu/kMbGaQeJdyEafSLO2vPq1ite60N7o6mZWT uuXogtGE8R7FzuRQvvpGCk1/n640nzzl75HRV0w0w7Ba4maVfDkPTUcVwjeiCmTO760vDh FPVdZTMke83Q/oRQHgjKih6wPQy/KMw= Received: by mail-qt1-f177.google.com with SMTP id d75a77b69052e-47666573242so476761cf.0 for ; Tue, 29 Apr 2025 15:44:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1745966666; x=1746571466; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=4Q+2haq7XwAv6J5PtSHPQ69YS6DBi9vH+XHrN06jFIM=; b=0Xjyy3TuoG+zM3eCnZP14lUf6IEbsPbPfhYMX9RFi/WN9L0nsgkSa7y1aqiS4rHvOh OIFg5AutZr1IKvwdAOk0I9JEiV1xeNq2rDVVXlVQmhBDJvZQrfmooW3Zb8y6TO0O28fT 3/KtF0ZLkuVb1nGxV4FrUD3aLJEl59LdXuWesYx7iKO6ch1FT5IYszXMOC2d2dHub6K0 FWLCrXgyIMqpGPTZpmzHqiCvsWt0sRl+wHc3OvaGfq0e5ppfUn7aufltiBL8iq9hrNUI VJfuNCKdD9FtgiKTymiVAfxD5BoKWuqMPc0dLl151pej6e4Lsxy+uVXe9W39yoAlGf6J UyqA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745966666; x=1746571466; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=4Q+2haq7XwAv6J5PtSHPQ69YS6DBi9vH+XHrN06jFIM=; b=OBSjXBPAI2icim7umOgHvX2hGrB23pEUtDJ3YBsKoo2w/dEa3jik6u4/F/9ZIjOb5D EgYqS4Si5Ezs1SKiLH80oVq2vcq1NppO4c0d8902oDTPtHj1VNg3g9xj2XrhL3XXQH2d OEqrAHhtzZGYrren2obyCjnPknR1/dX0VQBOuFhoR+Uu46+fLjQzxCFV73gFTRqBb6xW M3HtsBimZs9xewzY//ayosoHMw2473fcJn3ob3CQHWefTstT3FeOtVh7LBzooFMDVimJ ZtMr/+F8KoSnOLoENhjcfP3HweWCDo8lYePEAT4pRwKfQuqxfv6p2ogTgvKdfip88Utp oEjQ== X-Forwarded-Encrypted: i=1; AJvYcCUxN4d8hcfbWMjCcf1JWF8k/OwIJZjciGvkeK0JY9irbx/PuFoApbk/PTHVcvLCy2l+AWPNXsE9lg==@kvack.org X-Gm-Message-State: AOJu0Ywc0LgRBQmoqrbtI64Z7e8iloYF2vdL161GcGidumrjF+TySas0 H8PGYa3TtzrvsfwWtdWqPFrMu7eIVlV94iyLd6iywAm62eX3TkVTk7DYtMCoEexhyCVbXvnv8zr r/1qOkinPDAnkCIFHGlvjm+EqLfHeeliQVpsW X-Gm-Gg: ASbGncvBsGfetrFHe6SUpd4TazjY/VK4AMvnAIkgn3WNMEZ54zNRAmqq215D+rVL2B+ AEnVqmtDHJYghK9nsowoyfAzbWn+lIx46yW62S/PAu4fOUNTjAtBPGhsNsXujPaodkra9/zLwZT Xl92sSSYZYeXxfHLZ/4NIdhF/XrXLzMYRVeeOLuJN2VwWgXnVoAsiVHVIC1cE9lzo= X-Google-Smtp-Source: AGHT+IGp3ol19l66XpIR0HRveaM+/EwEn2dP2bkpY0B9zrpxdPZiZBnty9O1i5cicSDF8bDdT7SZ8k6g1jsNGNxwKCI= X-Received: by 2002:a05:622a:44e:b0:477:63b7:3523 with SMTP id d75a77b69052e-489b9935eafmr1740991cf.4.1745966665576; Tue, 29 Apr 2025 15:44:25 -0700 (PDT) MIME-Version: 1.0 References: <20250428033617.3797686-1-roman.gushchin@linux.dev> In-Reply-To: <20250428033617.3797686-1-roman.gushchin@linux.dev> From: Suren Baghdasaryan Date: Tue, 29 Apr 2025 15:44:14 -0700 X-Gm-Features: ATxdqUFgwg8VjWDE02DoJhjYn9FVXzG-oTsO_zJH_wM8K_TPzbvSWrGdfiA0Du4 Message-ID: Subject: Re: [PATCH rfc 00/12] mm: BPF OOM To: Roman Gushchin Cc: linux-kernel@vger.kernel.org, Andrew Morton , Alexei Starovoitov , Johannes Weiner , Michal Hocko , Shakeel Butt , David Rientjes , Josh Don , Chuyi Zhou , cgroups@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: C7D68140004 X-Rspam-User: X-Rspamd-Server: rspam07 X-Stat-Signature: 9qrfcz899wgfw1mnxoyden8136iagryw X-HE-Tag: 1745966666-131885 X-HE-Meta: U2FsdGVkX1/qm7o/3YR8I5r3b2db2OeoFCJqPBRMxmZsyUvFx/tmao/pYzSD+Lb2Bhztesn9bHTN+/QT+4jHgzl/WZVYYGqN7QFJUH24YKKU6sArVoJAJYRzAV12kiM9jAjew90mSxWnui0IXzQvVJEbZjgRVJuPbWp+nPeEt0z1IZqMkFlh8s16yn94Eehw17ui98LCk1cpvlkw60MdGJgjAFzdCBP8+ojJPI/HGsZ6ETrbBj0VdQ+LebiCyibo9vZXkraus0ZSViy/DI5a7zgm5r7swhGzg0yUnkJlZwyEz+p8jUg4pxGCvSdou4Ato0vRIhdIuxCEPvVpK23VbyDTNOpbwl1fPz8YW+lHpC++UtV5T1ZqkHzoc8CYSD4ts3bfISid0J6ouo0eXA/7IiEJn4KQm4kPbanM0t0oiZ+ECnkkA0FaEukkZUhofBr6DR8b3hm6B+d3LW1W5hEOmv9Qh8KDoE3Xf1HmobuhxAPTQVJg+N6Uw+Y/CeS4OahnGBtxuQJ3g4MzrHp/DZmi8S1IC81jEVNoxuC9S7i+ExySNAe5I7QQ0dDPeffsE/4cPr90BrauqzjVJWyf7wBX2uKMFFcIGQaZe+PgpeBJ5BshkD9sfDf++9UdjFmhRk4a3rdf2N1WmJdTDOoO5NLeKK/+JeBeKlGTCDgBHaUZNif5YCzY02OIc6HBOVpVq+X8wqymk7B7LYZXKwo8nIgTTsWf0L/V/aoyz+3se3RWrNt4eKEuPUyAeFv708RIMVK/uPbx+VHcHnGv1ymgkfFPAQgHn3mK8YGWzFk4JkARN4rwlqBowUybDTM1TmKrHjlvFrbx1sYO3oAa0CPnvn8T/Fsm8wMw3Z98JTIOymf4Q4jHYB3ukeiFSyGlP7M3g6T0sUNkdUuA+/4foCKUQuRl/G/Txt1zHOG/yyiAwXEeSgXjCMy91dGTUuu5LqD/1yB3KO0HIj3Lc8x59fl1qNk VKmJA16+ HIJADHdwMU4CHoJY2KLQ1jj0dYpdwPvJ/DV5Eq8kmEEbXO/1BgmTVdMopKAvGE4Z/dcGNtJnE2nnAb9Wl61nBqMt2WpvqSa2/NLn3h9GL6LHk5nn+Iy4Y66ODJfqbua0z2f2WgplvLr5PDVvza9AuYW5LNTVx8rBMDoCF6eXpAaqzZgbP266n45yo6CoQJgXcBW0ANabt9f8kuhP6l7Qy25/w9xKEpYA9FPXsnPR81XaSJEVptGC3GuhK1lS95duIt7iCllVzzyZ/63PZlRqG4wp5BWpJdteLcJpcL64cndIM05WZPDo1NKK/hAcpggdCwXWnrx+RSWody4UjugHB1HddLFT51YGpREJOJzcTnYwaUj3KDisgCiFYvPHVTRadKwBc1zM4U0YgmAGPU7sttQ7Bn1/hz1SXDI9JafHW7Ugk7BiS+7blzg776OaPlJmFcyT1YKqYJnzjwvA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sun, Apr 27, 2025 at 8:36=E2=80=AFPM Roman Gushchin wrote: > > This patchset adds an ability to customize the out of memory > handling using bpf. > > It focuses on two parts: > 1) OOM handling policy, > 2) PSI-based OOM invocation. > > The idea to use bpf for customizing the OOM handling is not new, but > unlike the previous proposal [1], which augmented the existing task > ranking-based policy, this one tries to be as generic as possible and > leverage the full power of the modern bpf. > > It provides a generic hook which is called before the existing OOM > killer code and allows implementing any policy, e.g. picking a victim > task or memory cgroup or potentially even releasing memory in other > ways, e.g. deleting tmpfs files (the last one might require some > additional but relatively simple changes). > > The past attempt to implement memory-cgroup aware policy [2] showed > that there are multiple opinions on what the best policy is. As it's > highly workload-dependent and specific to a concrete way of organizing > workloads, the structure of the cgroup tree etc, a customizable > bpf-based implementation is preferable over a in-kernel implementation > with a dozen on sysctls. > > The second part is related to the fundamental question on when to > declare the OOM event. It's a trade-off between the risk of > unnecessary OOM kills and associated work losses and the risk of > infinite trashing and effective soft lockups. In the last few years > several PSI-based userspace solutions were developed (e.g. OOMd [3] or > systemd-OOMd [4]). The common idea was to use userspace daemons to > implement custom OOM logic as well as rely on PSI monitoring to avoid > stalls. In this scenario the userspace daemon was supposed to handle > the majority of OOMs, while the in-kernel OOM killer worked as the > last resort measure to guarantee that the system would never deadlock > on the memory. But this approach creates additional infrastructure > churn: userspace OOM daemon is a separate entity which needs to be > deployed, updated, monitored. A completely different pipeline needs to > be built to monitor both types of OOM events and collect associated > logs. A userspace daemon is more restricted in terms on what data is > available to it. Implementing a daemon which can work reliably under a > heavy memory pressure in the system is also tricky. I didn't read the whole patchset yet but want to mention couple features that we should not forget: - memory reaping. Maybe you already call oom_reap_task_mm() after BPF oom-handler kills a process or maybe BPF handler is expected to implement it? - kill reporting to userspace. I think BPF handler would be expected to implement it? > > [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@byt= edance.com/ > [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/ > [3]: https://github.com/facebookincubator/oomd > [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd= .service.html > > ---- > > This is an RFC version, which is not intended to be merged in the current= form. > Open questions/TODOs: > 1) Program type/attachment type for the bpf_handle_out_of_memory() hook. > It has to be able to return a value, to be sleepable (to use cgroup it= erators) > and to have trusted arguments to pass oom_control down to bpf_oom_kill= _process(). > Current patchset has a workaround (patch "bpf: treat fmodret tracing p= rogram's > arguments as trusted"), which is not safe. One option is to fake acqui= re/release > semantics for the oom_control pointer. Other option is to introduce a = completely > new attachment or program type, similar to lsm hooks. > 2) Currently lockdep complaints about a potential circular dependency bec= ause > sleepable bpf_handle_out_of_memory() hook calls might_fault() under oo= m_lock. > One way to fix it is to make it non-sleepable, but then it will requir= e some > additional work to allow it using cgroup iterators. It's intervened wi= th 1). > 3) What kind of hierarchical features are required? Do we want to nest oo= m policies? > Do we want to attach oom policies to cgroups? I think it's too complic= ated, > but if we want a full hierarchical support, it might be required. > Patch "mm: introduce bpf_get_root_mem_cgroup() bpf kfunc" exposes the = true root > memcg, which is potentially outside of the ns of the loading process. = Does > it require some additional capabilities checks? Should it be removed? > 4) Documentation is lacking and will be added in the next version. > > > Roman Gushchin (12): > mm: introduce a bpf hook for OOM handling > bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL > bpf: treat fmodret tracing program's arguments as trusted > mm: introduce bpf_oom_kill_process() bpf kfunc > mm: introduce bpf kfuncs to deal with memcg pointers > mm: introduce bpf_get_root_mem_cgroup() bpf kfunc > bpf: selftests: introduce read_cgroup_file() helper > bpf: selftests: bpf OOM handler test > sched: psi: bpf hook to handle psi events > mm: introduce bpf_out_of_memory() bpf kfunc > bpf: selftests: introduce open_cgroup_file() helper > bpf: selftests: psi handler test > > include/linux/memcontrol.h | 2 + > include/linux/oom.h | 5 + > kernel/bpf/btf.c | 9 +- > kernel/bpf/verifier.c | 5 + > kernel/sched/psi.c | 36 ++- > mm/Makefile | 3 + > mm/bpf_memcontrol.c | 108 +++++++++ > mm/oom_kill.c | 140 +++++++++++ > tools/testing/selftests/bpf/cgroup_helpers.c | 67 ++++++ > tools/testing/selftests/bpf/cgroup_helpers.h | 3 + > tools/testing/selftests/bpf/prog_tests/oom.c | 227 ++++++++++++++++++ > tools/testing/selftests/bpf/prog_tests/psi.c | 234 +++++++++++++++++++ > tools/testing/selftests/bpf/progs/test_oom.c | 103 ++++++++ > tools/testing/selftests/bpf/progs/test_psi.c | 43 ++++ > 14 files changed, 983 insertions(+), 2 deletions(-) > create mode 100644 mm/bpf_memcontrol.c > create mode 100644 tools/testing/selftests/bpf/prog_tests/oom.c > create mode 100644 tools/testing/selftests/bpf/prog_tests/psi.c > create mode 100644 tools/testing/selftests/bpf/progs/test_oom.c > create mode 100644 tools/testing/selftests/bpf/progs/test_psi.c > > -- > 2.49.0.901.g37484f566f-goog >