From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C3C02C369DC for ; Tue, 29 Apr 2025 22:18:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D6BEC6B0082; Tue, 29 Apr 2025 18:18:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D1B346B0083; Tue, 29 Apr 2025 18:18:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BE4F56B0085; Tue, 29 Apr 2025 18:18:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 9E1F76B0082 for ; Tue, 29 Apr 2025 18:18:00 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 9358E5F713 for ; Tue, 29 Apr 2025 22:18:01 +0000 (UTC) X-FDA: 83388495162.10.1232716 Received: from out-178.mta1.migadu.com (out-178.mta1.migadu.com [95.215.58.178]) by imf21.hostedemail.com (Postfix) with ESMTP id 8D6251C0003 for ; Tue, 29 Apr 2025 22:17:59 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=G6qOR9Nm; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf21.hostedemail.com: domain of roman.gushchin@linux.dev designates 95.215.58.178 as permitted sender) smtp.mailfrom=roman.gushchin@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745965080; a=rsa-sha256; cv=none; b=GCudgyxg3KSMO+rnThg/UQrzw+3CnI6FHT0v/+EoRPkrDP7Du2yrQ8w8DBalfCsX2ua44X R3HyT0aRXJQ0ngCfWle4L9+sUTzECTXMtX6LN+offCn7Mqj7xgsEl7uOb53OYJ6+u7MSlc ZuvxRlUZldL6wQgbhGVSDyjbr7BUP4k= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=G6qOR9Nm; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf21.hostedemail.com: domain of roman.gushchin@linux.dev designates 95.215.58.178 as permitted sender) smtp.mailfrom=roman.gushchin@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745965079; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=VQJrRrVaFnBo4XocryJt5anmF0O/r9hN8T59g9L9+bs=; b=Tjw1yT04/RO/vE4jqlwr8bsyVzozzjsiddus9Bhq4Q4ywtwU0M24Ij4DihKeK+2dYXJT4E 1Wo5DjPgyR4jAECchnySPlDR4bkQQVpw2Vy6VYK0XG27XZrl5MC5w55UBPDcB5c9TJxhgk srQ2Rr9FaCbCyB591MZD3TFJaPymXRw= Date: Tue, 29 Apr 2025 22:17:41 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1745965076; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=VQJrRrVaFnBo4XocryJt5anmF0O/r9hN8T59g9L9+bs=; b=G6qOR9Nmd2hZRkjRJSwbebk2ojGNBZjKTHw9Cbq30iktCdJKq2cJddDCl+7skE3ZxiG5Ol YkVX0BUIC+d/kVdjEu6Xx0fbr0tLyechP3c0pYVUSxH1kX1oi4tPMOWBz+N/Zt0i7VM5gM p6eYvL0BqmG+5/n7pnQRRr/WLiA56vU= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Roman Gushchin To: Suren Baghdasaryan Cc: Michal Hocko , linux-kernel@vger.kernel.org, Andrew Morton , Alexei Starovoitov , Johannes Weiner , Shakeel Butt , David Rientjes , Josh Don , Chuyi Zhou , cgroups@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org Subject: Re: [PATCH rfc 00/12] mm: BPF OOM Message-ID: References: <20250428033617.3797686-1-roman.gushchin@linux.dev> <87selrrpqz.fsf@linux.dev> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Migadu-Flow: FLOW_OUT X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 8D6251C0003 X-Stat-Signature: yn7wjpfhd7qx5hi6cbt64zyj9ehyzmjq X-HE-Tag: 1745965079-283843 X-HE-Meta: U2FsdGVkX1/r6deNUZvstYvLb8oCNGi7FpFRG4o8he4x5Oab8IJgZ8LrUd9JYu1atDz652JFuGcAtPJ3kQKkRhgQK2HKONBjMMCSzFROqoyC3fkScsERy7masPKt2z3gxOuEMjjGu1nKvMR/0k3i4qEQJuG2PYZ0yaJmJCju3fQ4BzwwHkfpsKR6TM4UtZbxWGvrndB2elo/as+AsFDmogqVP5G6FOYgRaD0HJJ/yQEvj6ueGEG5LXldw/WSp3f9XFMZKNrAaJ/TknIgNtlDnWCVsKtcJP1rql/QeKXn7shzHnw/vWAiZr6U7q9pSV7IRgO3JhWnUdsviRC857uuuy/ZlKnGlL0UfVbHssXun9+wb+cXPlYVkYHm9IiHFcZlsLQsiSY1Df5LYSoMH6ZFm6m+QC4sb4By+JBj5aJKclltlfajSVPivqhrm7ETuGtCTM+kOcu7VuZxqZwF17aKXqw0t5NZ8efOfFv996PsOz3qch3KUJuGTQFIIBMLm7xnvaZUqiuJHmSU0ls9O2Ik0ZZx0uNU4TB/eFOlLZT8ILiieRlbZNRDG3+37H8ErzvBO9hKLa5LDy+khp5E4eVQ34Az3nvdPCggPXKpdQlvF7jJ9EHjEhZNp9Tq69TW2649t6d8vJbDW6PSWzrTXOaErZHo0G3MijzJfCtV8R8CC7Cfh2iErIUihxychKJ5Zn7RfYco73bwZgAEdBuvL/4Dxz0gfzWVXAnuIbZrLOeJj73UUgHGFi/V4t0mSLom5AgmGOh2TPWN3poNugE+ZmfrpYiVO82v3V4Gg1hQ4FQLBKcz7skbeDUKE8Ln0yf8pdF6NKW8djcHTI9QhjqMo1Vk9d9rYkd/YW2VV80qYaSR3DCaGC4zJCbpWcVQkt3HSl+QUPnhmPHlKI/IAOsIC8ghhNOcwo5KWhMqaPvud7kwzZjFt+OWG9cSf5nWp4lWi7kA8g3qyIP7JXpsnAgX0hi hlvyF9j2 4OBe/NZ0G83IK42txjj+XIXELZWoQANAscizLESQbrhcgqX4HUgHh+OQESiIY9kx63UFFrCt2TEieplcSkiFKlrtqsY54kbXNokB/f5uSX8DedjqPYLS8XFne2tQithK1QS54ztj0uRcCpQg/TTj8p0Ys4RK8azHrGlCzWUxYKZ7OHIn8XbfqOF2q/L8RofU4s3REjsKNkDDFxNJH2nQfng4y9dZ2ryIIWHaPkCm+9T8qLfQV1yEqITD+g1l5fBNhbJACYLo/MM+qr0mLQ2/KZjtxjRzzJ0QwaO+spocFl2S1VqMVeWSOUdrGB9HIoEIRpy4hD/mURb6e2WDV/TyQxD5vY6XViIb+X2KOHQDtK/cQQYkxEOxJxD8oTILOX5OijbVoW+9ZaAZHZ14= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Apr 29, 2025 at 02:56:31PM -0700, Suren Baghdasaryan wrote: > On Tue, Apr 29, 2025 at 7:45 AM Roman Gushchin wrote: > > > > Michal Hocko writes: > > > > > On Mon 28-04-25 03:36:05, Roman Gushchin wrote: > > >> This patchset adds an ability to customize the out of memory > > >> handling using bpf. > > >> > > >> It focuses on two parts: > > >> 1) OOM handling policy, > > >> 2) PSI-based OOM invocation. > > >> > > >> The idea to use bpf for customizing the OOM handling is not new, but > > >> unlike the previous proposal [1], which augmented the existing task > > >> ranking-based policy, this one tries to be as generic as possible and > > >> leverage the full power of the modern bpf. > > >> > > >> It provides a generic hook which is called before the existing OOM > > >> killer code and allows implementing any policy, e.g. picking a victim > > >> task or memory cgroup or potentially even releasing memory in other > > >> ways, e.g. deleting tmpfs files (the last one might require some > > >> additional but relatively simple changes). > > > > > > Makes sense to me. I still have a slight concern though. We have 3 > > > different oom handlers smashed into a single one with special casing > > > involved. This is manageable (although not great) for the in kernel > > > code but I am wondering whether we should do better for BPF based OOM > > > implementations. Would it make sense to have different callbacks for > > > cpuset, memcg and global oom killer handlers? > > > > Yes, it's certainly possible. If we go struct_ops path, we can even > > have both the common hook which handles all types of OOM's and separate > > hooks for each type. The user then can choose what's more convenient. > > Good point. > > > > > > > > I can see you have already added some helper functions to deal with > > > memcgs but I do not see anything to iterate processes or find a process to > > > kill etc. Is that functionality generally available (sorry I am not > > > really familiar with BPF all that much so please bear with me)? > > > > Yes, task iterator is available since v6.7: > > https://docs.ebpf.io/linux/kfuncs/bpf_iter_task_new/ > > > > > > > > I like the way how you naturalely hooked into existing OOM primitives > > > like oom_kill_process but I do not see tsk_is_oom_victim exposed. Are > > > you waiting for a first user that needs to implement oom victim > > > synchronization or do you plan to integrate that into tasks iterators? > > > > It can be implemented in bpf directly, but I agree that it probably > > deserves at least an example in the test or a separate in-kernel helper. > > In-kernel helper is probably a better idea. > > > > > I am mostly asking because it is exactly these kind of details that > > > make the current in kernel oom handler quite complex and it would be > > > great if custom ones do not have to reproduce that complexity and only > > > focus on the high level policy. > > > > Totally agree. > > > > > > > >> The second part is related to the fundamental question on when to > > >> declare the OOM event. It's a trade-off between the risk of > > >> unnecessary OOM kills and associated work losses and the risk of > > >> infinite trashing and effective soft lockups. In the last few years > > >> several PSI-based userspace solutions were developed (e.g. OOMd [3] or > > >> systemd-OOMd [4]). The common idea was to use userspace daemons to > > >> implement custom OOM logic as well as rely on PSI monitoring to avoid > > >> stalls. > > > > > > This makes sense to me as well. I have to admit I am not fully familiar > > > with PSI integration into sched code but from what I can see the > > > evaluation is done on regular bases from the worker context kicked off > > > from the scheduler code. There shouldn't be any locking constrains which > > > is good. Is there any risk if the oom handler took too long though? > > > > It's a good question. In theory yes, it can affect the timing of other > > PSI events. An option here is to move it into a separate work, however > > I'm not sure if it worth the added complexity. I actually tried this > > approach in an earlier version of this patchset, but the problem was > > that the code for scheduling this work should be dynamically turned > > on/off when a bpf program is attached/detached, otherwise it's an > > obvious cpu overhead. > > It's doable, but Idk if it's justified. > > > > > > > > Also an important question. I can see selftests which are using the > > > infrastructure. But have you tried to implement a real OOM handler with > > > this proposed infrastructure? > > > > Not yet. Given the size and complexity of the infrastructure of my > > current employer, it's not a short process. But we're working on it. > > Hi Roman, > This might end up being very useful for Android. Since we have a > shared current employer, we might be able to provide an earlier test > environment for this concept on Android and speed up development of a > real OOM handler. I'll be following the development of this patchset > and will see if we can come up with an early prototype for testing. Hi Suren, Sounds great, thank you!