From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4F8D7C36005 for ; Mon, 28 Apr 2025 10:43:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7A25D6B0025; Mon, 28 Apr 2025 06:43:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 72CFA6B00B5; Mon, 28 Apr 2025 06:43:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 57D8D6B00B6; Mon, 28 Apr 2025 06:43:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 331316B0025 for ; Mon, 28 Apr 2025 06:43:16 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 1A8EA160460 for ; Mon, 28 Apr 2025 10:43:17 +0000 (UTC) X-FDA: 83383115634.30.4DB9AB3 Received: from mail-ej1-f54.google.com (mail-ej1-f54.google.com [209.85.218.54]) by imf29.hostedemail.com (Postfix) with ESMTP id 2F17C12000A for ; Mon, 28 Apr 2025 10:43:14 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=SkXeEiSN; spf=pass (imf29.hostedemail.com: domain of mattbobrowski@google.com designates 209.85.218.54 as permitted sender) smtp.mailfrom=mattbobrowski@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745836995; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=IK7dIbpaTofEG9W0MKimUJmh49eq3Exd9JjSGC+MWSc=; b=D+2s3Sfibtz5uMaTiTdAoFsrqMsc7vfbzjPAy5L9GLgGF8NM6kkP0UMvpPs9OjxeJKeg+H 3o/C6ZY8NGqXgv5i0h5577IbEs8W/fU5hqMDIUNkyiemUvpLmiEMQDqTDc80KTyzWYJDMb ST926ZbCW1HobqwKHQFZe5Rt9EnNk+c= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=SkXeEiSN; spf=pass (imf29.hostedemail.com: domain of mattbobrowski@google.com designates 209.85.218.54 as permitted sender) smtp.mailfrom=mattbobrowski@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745836995; a=rsa-sha256; cv=none; b=fKagQ2WFBt0UQWxRbYuzfyjK5r3P1XCvfgWQhQ2s1SDUoEb1REggmYXF01p2suf8fewtiS U83m0pr/V/EZGPw/7TcELpY3viR7kniQ2aDN1Obhpg5hHWEx/axVp+80S0PsBDuvxDET62 oOl3D4OZupi9gRfcPdhMe4ZPcAOrs4Y= Received: by mail-ej1-f54.google.com with SMTP id a640c23a62f3a-acb415dd8faso627420166b.2 for ; Mon, 28 Apr 2025 03:43:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1745836993; x=1746441793; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=IK7dIbpaTofEG9W0MKimUJmh49eq3Exd9JjSGC+MWSc=; b=SkXeEiSNJY8ih4z8SO3WOgVvKtjjwTxmUeEmEi0qQO46MsmmYiV7LABaPqGcDtyXYD whB2rohq/B/qHJiBB78jIJ+pz73LxYUdiSBWXGhMt4szf9gle4rIpiG9LWGzQRCEyYX1 tnL8TMMmQxXWkriLO0BI/mvprB6q/BSvEbkqpArHOXqa0sVtWp59DHfftcxvFw7wtMAQ XkrpYUVNL+1otJGUSjxYw2yoDupyiYJLbZXIEqHDL8DEKSb9RxQkIH8JyKtFjuUPzn0N /VAVOECBlJt/RKjexJFRNfOmZwp+ubwS3Viji01FOWF1XbJiIWzGD4H2cguL9GALd4uz I/gg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745836993; x=1746441793; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=IK7dIbpaTofEG9W0MKimUJmh49eq3Exd9JjSGC+MWSc=; b=DnSa+TM4WwMeML9gPY7vWMy2HJIqdZUVI55XH24UhPhI2qVAWjK4v2ZO8DJGdXs7jp gf/ngm2ZwtQdFGa15l9ThXyGTWINKEe5RL1s18rQTydS4WBK6hV7OQPb3YBae1m046pl bpMi+punihYkAFLfu59P0LWJN+hv9PHDPSD3FkuYb5bkesoEWZR+Xlx1xdrcMgh1putc 4gZLP6fxTxWDFHFNY8mMVrmIM+lTIBQr76VUsYOdIFuvVd6eFAA4buQRxg3XmxN6yEiH YLAA6zgkO0YZ0VN/8LIcOaSGqKE27yn8YAQ5imvtMX2ZmjcWDAwD/qLbFEQB+N64RRvb dl6A== X-Forwarded-Encrypted: i=1; AJvYcCXRLETmVMEPoTprcpzbZ+0yxxhtAFjG+yhriMUU0V5ZNDa2GmLxH4FraQF4uGaDlqe99InZlB9TwQ==@kvack.org X-Gm-Message-State: AOJu0Yzuy7TEe7zBnktCKCUyUpi5zBo1E07vy2nxCBdu509AJrXcSsCM 8G03ea5Vwtk/KBuUqOjyT17kIXeOtlS6uxzuyHdrYXoZY4h2/i3CeLcBtOot+Q== X-Gm-Gg: ASbGnctnwJNZX9KsC4r192X79++TK9+pBeEoRRSbZWJIBEoNNKCTc+cEq1nFqnhoYlo It793FVyw+jIBINNr4hjigvYiPpTq0ePC+ThrgyWtPwIMU6mfGuMXD5x90vcdd5hSWbdIsUp+nz jZWb5V5m4SKQRhc8Gjri6w3oFtIMXquoWnxJLAHRSHWmBemjMlGSYMFupMA5E/VzMI8MbpJNngO ezmiiZfYqZ1uKXSYOUbA/7lrRdMB34KNQWx34XmFqlcL9hGfFeoowMKx9SwEzz2zFbtXoRvX0MQ ydIkHTmXOd70btLHdpb0ypq2lh32v/V5XCURkXtB/f9N8bRF6PdrpfjNOCDjlXbiYKNgOf84+gz b4jO99Iw= X-Google-Smtp-Source: AGHT+IHt49FasoPmpJdwq6ZuKmCYrd001aZAosiZ+1FBattCQmgQzn5vUSNbKU2Sq0zUGcO1M3eAxQ== X-Received: by 2002:a17:907:7b9a:b0:ace:be13:64e5 with SMTP id a640c23a62f3a-acebe136fabmr63079166b.26.1745836993362; Mon, 28 Apr 2025 03:43:13 -0700 (PDT) Received: from google.com (201.31.90.34.bc.googleusercontent.com. [34.90.31.201]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-ace733aa401sm569796166b.35.2025.04.28.03.43.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 28 Apr 2025 03:43:12 -0700 (PDT) Date: Mon, 28 Apr 2025 10:43:07 +0000 From: Matt Bobrowski To: Roman Gushchin Cc: linux-kernel@vger.kernel.org, Andrew Morton , Alexei Starovoitov , Johannes Weiner , Michal Hocko , Shakeel Butt , Suren Baghdasaryan , David Rientjes , Josh Don , Chuyi Zhou , cgroups@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org Subject: Re: [PATCH rfc 00/12] mm: BPF OOM Message-ID: References: <20250428033617.3797686-1-roman.gushchin@linux.dev> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250428033617.3797686-1-roman.gushchin@linux.dev> X-Stat-Signature: itd9shwsdzjr1fb4yf3sj1uiy83pr4oo X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 2F17C12000A X-Rspam-User: X-HE-Tag: 1745836994-451284 X-HE-Meta: U2FsdGVkX19txle4VKsClvAzOVNuM5MtGJAzEXf92NeAphZSuTDoNdH7tVOCfA7nPndB0scObImBEYPvVRenhoUeQRM+9woY1SE+ukSGAQZB3RRDoREynOHhhn/8xNFzAo16YFCrm4Zpj4SC2vt5Ryjqmt2Sk6eW52JCNmX1GIwC70E3cpPi8pI7dq7Brd/DQo095L96UbF8WOwaZf4Hut3JJ9rwqVU5HxkeLZMduYrNh5TwE7e5uPBwxLGK4MyPIJPjjBfAMOaPv5zRM/6nlFOLmsO3A+HJ3DXGLaAnyvQJOQEH7ahHxPplI8T4rE89iGw4ZbJ4vAbp/ZzHhxjp8rYd5X0WuzN0SLyMknnCV1TDm6QgdDtjZlIYIDubyBYG7Rjq5CG65VNqpy7HMpVEm3oufFXh4kibcvuu47FIVV70EsPShYH+gthWuN5/WVovRf21CjmQb/Jw8KCvowYd26HugZ5pBg0KAIP6mF/RlLLvNUoNviXa6P650cbNfhZvrdrvt/+OETW4S7ozzxUbAZthL9/f3DadINBjLhy4qbNPysRdt+c5OI02stScL/c73OPxzB5FrYDYpe2flo8tru9Fb5MaEBsckfDv59wKwJqlegSnhaNrHpxmRT7ID1waX0xp1VBi2HdKx+J2bl9/heMym43SlnMW1cCzXBnqqEElKVwNHNSia9/7rrToOZq32fR+5QsdR3s3iWakg/Zpv0nbvo6l8kBjNZyGHQRNaIs18VqIakaVI7W6eG8iVlNzOW5yMphI5sD1glWVuYzf2UVTop3zP3Q59aoH45vdwIaTun1nrDpCkaEl+ea8lYbuGbSSF6854jxjpsRG3pa13YnQ04pSSB0UdUvIWqxfd89fCm2KUgqws2CU7f0ND6YsnEb3X0e9Du7XAtdByVkggDvrWJFCQ+2E1ZdKfOLcEQf/dDsd3kSpR/F2cF2x8PT/Uh1CG4tTTDPmObaKcsa qv1AEO64 riBPJAtoWTSn7bBSVlRzJqPYNMA4kUOdUb27j1ZDccwcVqrj3x6PxzFliuiJboUYl86R3PoDAEJINd5LmbTYjY2aXAAl7U/UBrVYwwPYFvssn74xaSbKA83hGjv+wVUdEQy58BQ0PJTWSzj1F7EOQycwn9lMkt9CsSzP5GOWk7IESK7aOyF5lia7d4KvVhThcF3/Vv9b3wRrGcoVQ7IFR0dpfQsS53MYK4UMHJTS5Mgx5gKB8ugYGozL6iKjfxEq5YA9HjfiJJUVAeCHZrahJEnQw1dy285VmaGuHHhVoQ/W09tGRSZvfwH0ZFpNtc1NE+tOoGFKRHcE8Q0/7DZjtqz7wILHiLn1OD+7pv8UWzLN5Hb9IfX1gnyZZeeF4nyhXOOKQXzFLNKSDh50HF/24biOr10y6vhgjN9n7bABNcSfpbIZV0umMitVztNJrQdLtFi0KYdA4JLAJKTyJPr/nz9/Hpf+C3SgLIIh/lIuZ90Bi0KxNDkTo/44X/0FjF4MW4kDtXzqmcgibRhB+r9b8h2b1ySVQCTYShguJVtroBtqR+uByUMiYF5Hr1Z0vMe/km2wwa9Khyn0MB+C/xApd2SNRtmy0asJKzA65GQ+rBr6Nfa4ugkrKI5VEnA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Apr 28, 2025 at 03:36:05AM +0000, Roman Gushchin wrote: > This patchset adds an ability to customize the out of memory > handling using bpf. > > It focuses on two parts: > 1) OOM handling policy, > 2) PSI-based OOM invocation. > > The idea to use bpf for customizing the OOM handling is not new, but > unlike the previous proposal [1], which augmented the existing task > ranking-based policy, this one tries to be as generic as possible and > leverage the full power of the modern bpf. > > It provides a generic hook which is called before the existing OOM > killer code and allows implementing any policy, e.g. picking a victim > task or memory cgroup or potentially even releasing memory in other > ways, e.g. deleting tmpfs files (the last one might require some > additional but relatively simple changes). > > The past attempt to implement memory-cgroup aware policy [2] showed > that there are multiple opinions on what the best policy is. As it's > highly workload-dependent and specific to a concrete way of organizing > workloads, the structure of the cgroup tree etc, a customizable > bpf-based implementation is preferable over a in-kernel implementation > with a dozen on sysctls. > > The second part is related to the fundamental question on when to > declare the OOM event. It's a trade-off between the risk of > unnecessary OOM kills and associated work losses and the risk of > infinite trashing and effective soft lockups. In the last few years > several PSI-based userspace solutions were developed (e.g. OOMd [3] or > systemd-OOMd [4]). The common idea was to use userspace daemons to > implement custom OOM logic as well as rely on PSI monitoring to avoid > stalls. In this scenario the userspace daemon was supposed to handle > the majority of OOMs, while the in-kernel OOM killer worked as the > last resort measure to guarantee that the system would never deadlock > on the memory. But this approach creates additional infrastructure > churn: userspace OOM daemon is a separate entity which needs to be > deployed, updated, monitored. A completely different pipeline needs to > be built to monitor both types of OOM events and collect associated > logs. A userspace daemon is more restricted in terms on what data is > available to it. Implementing a daemon which can work reliably under a > heavy memory pressure in the system is also tricky. > > [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/ > [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/ > [3]: https://github.com/facebookincubator/oomd > [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html > > ---- > > This is an RFC version, which is not intended to be merged in the current form. > Open questions/TODOs: > 1) Program type/attachment type for the bpf_handle_out_of_memory() hook. > It has to be able to return a value, to be sleepable (to use cgroup iterators) > and to have trusted arguments to pass oom_control down to bpf_oom_kill_process(). > Current patchset has a workaround (patch "bpf: treat fmodret tracing program's > arguments as trusted"), which is not safe. One option is to fake acquire/release > semantics for the oom_control pointer. Other option is to introduce a completely > new attachment or program type, similar to lsm hooks. Thinking out loud now, but rather than introducing and having a single BPF-specific function/interface, and BPF program for that matter, which can effectively be used to short-circuit steps from within out_of_memory(), why not introduce a tcp_congestion_ops/sched_ext_ops-like interface which essentially provides a multifaceted interface for controlling OOM killing (->select_bad_process, ->oom_kill_process, etc), optionally also from the context of a BPF program (BPF_PROG_TYPE_STRUCT_OPS)? I don't know whether that's what you meant by introducing a new attachment, or program type, but an approach like this is what immediately comes to mind when wanting to provide more than a single implementation for a set of operations within the Linux kernel, particularly also from the context of a BPF program. > 2) Currently lockdep complaints about a potential circular dependency because > sleepable bpf_handle_out_of_memory() hook calls might_fault() under oom_lock. > One way to fix it is to make it non-sleepable, but then it will require some > additional work to allow it using cgroup iterators. It's intervened with 1). > 3) What kind of hierarchical features are required? Do we want to nest oom policies? > Do we want to attach oom policies to cgroups? I think it's too complicated, > but if we want a full hierarchical support, it might be required. > Patch "mm: introduce bpf_get_root_mem_cgroup() bpf kfunc" exposes the true root > memcg, which is potentially outside of the ns of the loading process. Does > it require some additional capabilities checks? Should it be removed? > 4) Documentation is lacking and will be added in the next version. > > > Roman Gushchin (12): > mm: introduce a bpf hook for OOM handling > bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL > bpf: treat fmodret tracing program's arguments as trusted > mm: introduce bpf_oom_kill_process() bpf kfunc > mm: introduce bpf kfuncs to deal with memcg pointers > mm: introduce bpf_get_root_mem_cgroup() bpf kfunc > bpf: selftests: introduce read_cgroup_file() helper > bpf: selftests: bpf OOM handler test > sched: psi: bpf hook to handle psi events > mm: introduce bpf_out_of_memory() bpf kfunc > bpf: selftests: introduce open_cgroup_file() helper > bpf: selftests: psi handler test > > include/linux/memcontrol.h | 2 + > include/linux/oom.h | 5 + > kernel/bpf/btf.c | 9 +- > kernel/bpf/verifier.c | 5 + > kernel/sched/psi.c | 36 ++- > mm/Makefile | 3 + > mm/bpf_memcontrol.c | 108 +++++++++ > mm/oom_kill.c | 140 +++++++++++ > tools/testing/selftests/bpf/cgroup_helpers.c | 67 ++++++ > tools/testing/selftests/bpf/cgroup_helpers.h | 3 + > tools/testing/selftests/bpf/prog_tests/oom.c | 227 ++++++++++++++++++ > tools/testing/selftests/bpf/prog_tests/psi.c | 234 +++++++++++++++++++ > tools/testing/selftests/bpf/progs/test_oom.c | 103 ++++++++ > tools/testing/selftests/bpf/progs/test_psi.c | 43 ++++ > 14 files changed, 983 insertions(+), 2 deletions(-) > create mode 100644 mm/bpf_memcontrol.c > create mode 100644 tools/testing/selftests/bpf/prog_tests/oom.c > create mode 100644 tools/testing/selftests/bpf/prog_tests/psi.c > create mode 100644 tools/testing/selftests/bpf/progs/test_oom.c > create mode 100644 tools/testing/selftests/bpf/progs/test_psi.c > > -- > 2.49.0.901.g37484f566f-goog >