From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D98B7C369DC for ; Tue, 29 Apr 2025 11:42:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E4AAE6B0005; Tue, 29 Apr 2025 07:42:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DF7DB6B0006; Tue, 29 Apr 2025 07:42:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CE5D96B0007; Tue, 29 Apr 2025 07:42:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id AE1846B0005 for ; Tue, 29 Apr 2025 07:42:16 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id A7D2AAD716 for ; Tue, 29 Apr 2025 11:42:16 +0000 (UTC) X-FDA: 83386893072.03.E06D232 Received: from mail-ed1-f46.google.com (mail-ed1-f46.google.com [209.85.208.46]) by imf23.hostedemail.com (Postfix) with ESMTP id 9F7A814000E for ; Tue, 29 Apr 2025 11:42:14 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=suse.com header.s=google header.b=XWquZE7D; spf=pass (imf23.hostedemail.com: domain of mhocko@suse.com designates 209.85.208.46 as permitted sender) smtp.mailfrom=mhocko@suse.com; dmarc=pass (policy=quarantine) header.from=suse.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745926934; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ey+I5jL6iIXatVmqxM65zXGZtsNFeT7DcoIALExv/ys=; b=peISugAJ2JDikLWAx53biK9APA9fjNyVjaTpHfrJN4ph38bVC2qAzOnvq9kKF8TuPao1vh Olkf+78tD9gjSA0LcAdu5FqUQeBV7SQm8c8IZUafI4RkTbSyCIJ83J8TmzowJoLw0C19SV IDFMSmXK1DO/ZNcXskWSeO1jkgoUG+U= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=suse.com header.s=google header.b=XWquZE7D; spf=pass (imf23.hostedemail.com: domain of mhocko@suse.com designates 209.85.208.46 as permitted sender) smtp.mailfrom=mhocko@suse.com; dmarc=pass (policy=quarantine) header.from=suse.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745926934; a=rsa-sha256; cv=none; b=46mwN/I3pJkybCDVhLfef9QTD4uHbMvKZ+/LfzVm6wjTqYdHKTgKLJesQ2y725ngQXGaN6 E0ZBey+2s2ogA/IF+irS202+HWl2J0Lxr6KTHRi2pESU/WVL/hqeg2IZi0mGnQ94M9esqD Qm7O5OBukQdQzXHfwOQERc1fC5gYDn8= Received: by mail-ed1-f46.google.com with SMTP id 4fb4d7f45d1cf-5e8be1c6ff8so9912622a12.1 for ; Tue, 29 Apr 2025 04:42:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=google; t=1745926933; x=1746531733; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=ey+I5jL6iIXatVmqxM65zXGZtsNFeT7DcoIALExv/ys=; b=XWquZE7DTZ43+G/xb+YnO75vpf3J9YjKFIHHbxj9KxxV7z9ulkbL3yt1Uv4Bl1cCcn mOXqxljXffrt19A/xsySFVcT53U63sYWqOa38ThdHMBAP4r774fbcYyGbuVhhMUOqENa +KO2giTnkq2vG85mztQxRWkaWYY5Hw9tfWT0Zh1HqI7lD08SBG2h/WM9cDuOX/hiUg70 g7mDv4Gk/pFdUkhSGfFpJI/CZXB0ZGllWsyGtc8vR3CHRjpaMPu6bMxUR/R3lgddzDJL uenDcythoY3ievWgfIbfn8pm6eySl/5+nzaHuoihqX3Hq23H+pbhr8AV4ArkpHkWXSJv ZyCA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745926933; x=1746531733; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=ey+I5jL6iIXatVmqxM65zXGZtsNFeT7DcoIALExv/ys=; b=SFAGXVMT0/ZZXnnodljflz7veOr+zydiDE1fU3iKZ3tvI8H/B+p28urKYgyuxrr/fH ZGNCVwLEpvTEGuR1tLH4xe393zsd8IihRFmraW046DxMs7uegK6hPEO286Rr6iomp99T +YbOhsF2T7QJsi2lFs56RNDouRq4l6ufIEsxuYf4e0R8l9h0VfGyCmE9bxYKgr3KLrdD bmbqRXWyPbHwJh74cjL/OGmo8wCSPR2Jd86Csh3/NjWSlYuI3GR81Um/gMqjm/FFQoXm s6KgSTgUkD/lg5zx7C3VbueUv3nI/xbGt8ncV3XfnRh8GE2BpPEE+XbyrSicWisYEcbi jftA== X-Forwarded-Encrypted: i=1; AJvYcCVtL/FGaPxDquHrxNQcrisIXmfgRE0045U7mfXm9nSN5mc1WmgnEb+lVIjSs/atkB51P6pUxABHWw==@kvack.org X-Gm-Message-State: AOJu0Yz/9SdueWNJBASHO26arp70SMaB/Bg7N3iXbZgUyJVXOIO4yfMm pu3Ktbnj9RE30Usw4eh2AhtkClwA6TWqWNczP43DVlo6yBmY38uTXT1klpjssoM= X-Gm-Gg: ASbGnctT8T0YK68SpQeXPN6bJseYnVv/jAtc2BI/bNJc7kj8YHQZFrwguwwNeHBDVnU KD5bQPs5TieQkhG57d6P9pgkQOYAJffFCsedsYWTXKnvpoV3nI9V6Nw84Ch8ovH7NbkOR+m05iQ rsfPxZVQ9T9EhaZNOh4kYjOXo7IpAR+sVVIK2UAiV86JuJwaTkDc5OeTQrITo1zcamgKjsNbVd/ 2uH007P25jeqyun6keNP9srxJ8LqDMjT3ezq5lQNkJiCUWwJxn5vYIQDUjIzvcGBeHFGpSYfh0q aPuCSQC5DEgUqqp925w2e6nZ54tBD+LXP3QhdfmQJlbQAKlw0pVHbg== X-Google-Smtp-Source: AGHT+IH7dcicAcnmj5NaE95QUGR24eCcfOm63TRGYwTmK3j5rlAl8p+nDrnlZU2xFwsK27po9Y9CvQ== X-Received: by 2002:a17:907:60d0:b0:acb:63a4:e8e5 with SMTP id a640c23a62f3a-acec4ccfc22mr315665666b.6.1745926932884; Tue, 29 Apr 2025 04:42:12 -0700 (PDT) Received: from localhost (109-81-85-148.rct.o2.cz. [109.81.85.148]) by smtp.gmail.com with UTF8SMTPSA id a640c23a62f3a-ace6ed6affesm765813266b.130.2025.04.29.04.42.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 29 Apr 2025 04:42:12 -0700 (PDT) Date: Tue, 29 Apr 2025 13:42:11 +0200 From: Michal Hocko To: Roman Gushchin Cc: linux-kernel@vger.kernel.org, Andrew Morton , Alexei Starovoitov , Johannes Weiner , Shakeel Butt , Suren Baghdasaryan , David Rientjes , Josh Don , Chuyi Zhou , cgroups@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org Subject: Re: [PATCH rfc 00/12] mm: BPF OOM Message-ID: References: <20250428033617.3797686-1-roman.gushchin@linux.dev> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250428033617.3797686-1-roman.gushchin@linux.dev> X-Stat-Signature: 74s96yahpxha95dnkyeoek3o1q4muiwq X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 9F7A814000E X-Rspam-User: X-HE-Tag: 1745926934-187536 X-HE-Meta: U2FsdGVkX19t4mXcxLvOlA7gZdyY/abeZ225u5U+KfHv28gv2nU/MLgzBBnbuD0TZtZJMGYrCfjueTt+t1MikPv7oQHYYJ5noUiHzriq36Wfhr06a0ncuMAeIQs3kFG0/9FTWjPU8tvLgNfwywiMoT2E8VS6B9+d5YkNuwiD1/A+Zv5F08KT0RdqtZu8lpAXT7tp9Wl6PDtLPoSmjK1lXxBoNgyjrhJ0uzMH/vgeCxzxUojjo6PgNFvrxZOYzA+xCeQGhfAZuBye3hkwwX/RVWEhfLDPo+b9ghNbUVPARdUWvcJ7Ga7n5NnruFsF6eyXpUsYf4Fo0L8eim5JP5Zif1KzTSUtInOaDpDr3wuc/G/9WfOGbTWA/f3HrpTdtMYwrpAOP9hQgcGotRiu8sUmiAVs4xDUqU478H1vkjEDLGrq/rRH1J1OxUa78FLsknedyZW+G0Gzvtcz6XB2Yc767jS/WV5FwVJitc/DHv9HhTJaQ7A2Epmd/Js/k02P3+mo0EjR4DZGKmB2RXbc65TWT9MhyJ9V/6bVdBcN0+RPLOtW5d+u0z/BytFEXHC2Hv7a9RPOKcNEJ4TGDgL7jnoQs2ptprOnQggctSlqjr8MklJzL9Y/Ys/oudkmfuX84u7wiebUli8l7xEIcnibK9ndZbwH3C/ldZ75h8P59JKC4HwJRNdqtLF1GbZ5luxd6B/t+1oJJVpjhbn6xQ8PG4UDfVJwWhKlPPxWN9QdlrUGidufWGEOKcLkxHO5VWF8V5H8BjFN952m3cFBw1TYHyx9qcoIvhb0kDqrtXxg/U3SAc7s1Qio9cYXrJFgyQIk0aNJpN6C1TqR+nokZOY1dtDE5wYZBVXDABO/B28xz6bmNg4TQ36G655S79ZGYBoriroiyNcXgJVTigjJjFhBvkTpn1Qxw8QUyIiZt/3gHu/xu5Yt/xV6wPFphr2XlFGIjvO8GPTvkB9j3ux0O7Z54bu tUCAvKcl fnRWBqEae/FukX8GwzSMQGuqT1rWgl1FH/KfV0oB7E2UmhCB1f/lqyH4ckVNhXwg/BJFfXYgdRjKg8We/A+QQzLtQRyL7IYfN8B9rqK+mQLcilsP3LRtfMY1t8BPaam69nNuXu/IpEm1U562udGxQeO0cmGy4uUOXbujMr3ZVXLHEUyiVqqiWCJ0d2TxLW0+JnZobXk0hdoFjp2rncoGFADS/y+OYlgzmL8Z8jzliHarPPv0dCIxmWBOFP7wqZS2kyAl3IHt+6KPv1x5Z9qTM70wODs9EZ0pkjQ1fyv0xQrNZ302cyQw1Ou+terIQwSLIuvM0NP+nhWSXDSu700yPpv3KC18k+iqQoZa61AZtZ19cHMjc3EOSarG+bgqxjIp4+vsqSlodL+hk/Btfdbk14exuxr4FgrOuxvP1q7LSRxPlnbH/KfqcW82TJ9Byybo0//eM1W76rCReus76NnHW/FTm7JJbReGVki27CK/ypAK9TthO7cYBZDbFIffHfXukd8Z0JpfrExMUTe266w14yR0y9p/SUjq00AIBCaLy4bGB2eq8hZ3r8zaJIcfFAAEFMwg3t4lthdhwyb0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon 28-04-25 03:36:05, Roman Gushchin wrote: > This patchset adds an ability to customize the out of memory > handling using bpf. > > It focuses on two parts: > 1) OOM handling policy, > 2) PSI-based OOM invocation. > > The idea to use bpf for customizing the OOM handling is not new, but > unlike the previous proposal [1], which augmented the existing task > ranking-based policy, this one tries to be as generic as possible and > leverage the full power of the modern bpf. > > It provides a generic hook which is called before the existing OOM > killer code and allows implementing any policy, e.g. picking a victim > task or memory cgroup or potentially even releasing memory in other > ways, e.g. deleting tmpfs files (the last one might require some > additional but relatively simple changes). Makes sense to me. I still have a slight concern though. We have 3 different oom handlers smashed into a single one with special casing involved. This is manageable (although not great) for the in kernel code but I am wondering whether we should do better for BPF based OOM implementations. Would it make sense to have different callbacks for cpuset, memcg and global oom killer handlers? I can see you have already added some helper functions to deal with memcgs but I do not see anything to iterate processes or find a process to kill etc. Is that functionality generally available (sorry I am not really familiar with BPF all that much so please bear with me)? I like the way how you naturalely hooked into existing OOM primitives like oom_kill_process but I do not see tsk_is_oom_victim exposed. Are you waiting for a first user that needs to implement oom victim synchronization or do you plan to integrate that into tasks iterators? I am mostly asking because it is exactly these kind of details that make the current in kernel oom handler quite complex and it would be great if custom ones do not have to reproduce that complexity and only focus on the high level policy. > The second part is related to the fundamental question on when to > declare the OOM event. It's a trade-off between the risk of > unnecessary OOM kills and associated work losses and the risk of > infinite trashing and effective soft lockups. In the last few years > several PSI-based userspace solutions were developed (e.g. OOMd [3] or > systemd-OOMd [4]). The common idea was to use userspace daemons to > implement custom OOM logic as well as rely on PSI monitoring to avoid > stalls. This makes sense to me as well. I have to admit I am not fully familiar with PSI integration into sched code but from what I can see the evaluation is done on regular bases from the worker context kicked off from the scheduler code. There shouldn't be any locking constrains which is good. Is there any risk if the oom handler took too long though? Also an important question. I can see selftests which are using the infrastructure. But have you tried to implement a real OOM handler with this proposed infrastructure? > [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/ > [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/ > [3]: https://github.com/facebookincubator/oomd > [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html > > ---- > > This is an RFC version, which is not intended to be merged in the current form. > Open questions/TODOs: > 1) Program type/attachment type for the bpf_handle_out_of_memory() hook. > It has to be able to return a value, to be sleepable (to use cgroup iterators) > and to have trusted arguments to pass oom_control down to bpf_oom_kill_process(). > Current patchset has a workaround (patch "bpf: treat fmodret tracing program's > arguments as trusted"), which is not safe. One option is to fake acquire/release > semantics for the oom_control pointer. Other option is to introduce a completely > new attachment or program type, similar to lsm hooks. > 2) Currently lockdep complaints about a potential circular dependency because > sleepable bpf_handle_out_of_memory() hook calls might_fault() under oom_lock. > One way to fix it is to make it non-sleepable, but then it will require some > additional work to allow it using cgroup iterators. It's intervened with 1). I cannot see this in the code. Could you be more specific please? Where is this might_fault coming from? Is this BPF constrain? > 3) What kind of hierarchical features are required? Do we want to nest oom policies? > Do we want to attach oom policies to cgroups? I think it's too complicated, > but if we want a full hierarchical support, it might be required. > Patch "mm: introduce bpf_get_root_mem_cgroup() bpf kfunc" exposes the true root > memcg, which is potentially outside of the ns of the loading process. Does > it require some additional capabilities checks? Should it be removed? Yes, let's start simple and see where we get from there. > 4) Documentation is lacking and will be added in the next version. +1 Thanks! -- Michal Hocko SUSE Labs