From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 48500C369DC
	for <linux-mm@archiver.kernel.org>; Tue, 29 Apr 2025 21:56:47 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E02BF6B0082; Tue, 29 Apr 2025 17:56:45 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D88C76B0083; Tue, 29 Apr 2025 17:56:45 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C03706B0085; Tue, 29 Apr 2025 17:56:45 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 650556B0082
	for <linux-mm@kvack.org>; Tue, 29 Apr 2025 17:56:45 -0400 (EDT)
Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 7903012137D
	for <linux-mm@kvack.org>; Tue, 29 Apr 2025 21:56:45 +0000 (UTC)
X-FDA: 83388441570.12.0829C4C
Received: from mail-qt1-f181.google.com (mail-qt1-f181.google.com [209.85.160.181])
	by imf25.hostedemail.com (Postfix) with ESMTP id 8F8A8A0004
	for <linux-mm@kvack.org>; Tue, 29 Apr 2025 21:56:43 +0000 (UTC)
Authentication-Results: imf25.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=Nxk4ZwaC;
	spf=pass (imf25.hostedemail.com: domain of surenb@google.com designates 209.85.160.181 as permitted sender) smtp.mailfrom=surenb@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1745963803;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=DHPStCfNoeO92wT2yn6jdgylDS3AOWBgTDjw60mOQWk=;
	b=dSElB/9syYpl05VrqFU3x83JKbUEjKIZ1CiBcuTPr/HC8a7KDqYbNZIalMRx16kbzFAMUw
	MFZGFmpLDNrgf9mu+NSlAN6VGqx5qMOnJgRR1StngP4Ilnt/WwX4mUh6O7MJtwipOPnGja
	ZKm00ATJo2Ru4wcv/d0eZ8TJQcr7SlE=
ARC-Authentication-Results: i=1;
	imf25.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=Nxk4ZwaC;
	spf=pass (imf25.hostedemail.com: domain of surenb@google.com designates 209.85.160.181 as permitted sender) smtp.mailfrom=surenb@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745963803; a=rsa-sha256;
	cv=none;
	b=wHbNvV1mj4RC89WJ3WBl0/0ut2TwkVO5CQgzyrlluSzq8W/SuK28MiyOfcXgo0pKIN2lun
	HZ/CUWLF15p3XijqnTSKl6xm9Nv9jE0d1aJ5181mPnmBnU+htGQjK8+2B4euLvwOUbh1Jw
	akLYIWdGjqTwwJ2dGplNgcZWpOqhla8=
Received: by mail-qt1-f181.google.com with SMTP id d75a77b69052e-4774611d40bso368111cf.0
        for <linux-mm@kvack.org>; Tue, 29 Apr 2025 14:56:43 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1745963802; x=1746568602; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=DHPStCfNoeO92wT2yn6jdgylDS3AOWBgTDjw60mOQWk=;
        b=Nxk4ZwaCEY3NXYLAfLFWF9nbJCrM7mfgJWNxCJwfPLuQaHmLQ7lhsrmk2xAGSaInYc
         j/TGx6lje623G3fJeJnBd4lz93iaFKqCu5dSg4EW0BYaLLFbNBEeRk8VuA3rih2EYOnH
         mnNIG0WZYT6fQSfT8dTaUk3PVlCPC6S7yPT3r1ft7bTKuiRzb+orT51yZYIXfeJ1rt6w
         t1AT/xTSN4cM8WkhVhuZKNMJ9qfkxNsFKHL1Zb7hMD0wdpMYQZowKBRej+DLbHS+fon+
         lNA+jOKS2gBoQQamrWMyt+iuWdf0f9B7X9WqD48pVT4Xh17663U3XVNU0uDo3vg+esh+
         UPiw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1745963802; x=1746568602;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=DHPStCfNoeO92wT2yn6jdgylDS3AOWBgTDjw60mOQWk=;
        b=ppZB8ubdQu4v7mj+dFJ02A4JitZJrHNJlut0IQZ4uPWAYl4T8SBrD0GtMzjJka2E7a
         PRWUp5jQEbT8RHbDOBq3gWu2knh2PnDMv5+RwquccRAGUFExyyz9FtO90e7JUx87VOsf
         nao5Zplzc1FU2wYdIqRL/HnSgpbOeA+yt/CPygnGnYuXC504YFPiou9kFA+w+D3Q7rGQ
         veuOzEs8Pbg1S4SfGWTVm1TNqJZgiRnfrp4LJKbTffLdmZw6irpowuhFMIkat9nwhfaG
         o+KDJ6ao5FZkk7DeHydQTDUWBte2TNMIDyBN1WkxD20RdXZp60CP5DYhnWon7rHbZd9k
         SXow==
X-Forwarded-Encrypted: i=1; AJvYcCXwIUAPFIsriJYscMjqkSyPKN//yuV0Bxhudq04RBrNqukqAnGv2948AXR7VtNvhmFCnWpMOVn+xQ==@kvack.org
X-Gm-Message-State: AOJu0Yztn653tRd3+cqa98mrnIbTH/nit7cpB4jQ+XMQ/J0vZNi93shP
	pyCzfjZlB3+SN50zEC9LlyuejhE+y0NHP65Mm1LMA4hjdsNEaHiUEJUmp6olhj0BPZrtfqTr5JS
	2gov+Bxp61v8yviWhIwYF6X8gYsIhfhwyIs2E
X-Gm-Gg: ASbGncs8E9VNNglLNf5XNqibn6uXiB3kTQfRIysDh/xASTaNSm7pt8pzSr3qskfaubN
	/7rqx9c6UuH4/wf0hGC+tYIQXCd7U48Qj7UB9ijvUhzp5sihDcE7eI6EYHIMvOiYIIhQeYqBklm
	dSC2a5yZtVeLr6M5CLRQM1CAS4NeukF+7XcRZUe/BHY4Hg1fyWswNl
X-Google-Smtp-Source: AGHT+IHnm3oxMMCggQ4zYDBc/EviFjkF3Acb+O+Bqo5djYoDiGVrrZLyLpU3ATh7ELc5quPVw0I9UjhXoBryKd7Lkk4=
X-Received: by 2002:ac8:7f93:0:b0:486:be94:7c21 with SMTP id
 d75a77b69052e-489b9b3576bmr1402081cf.14.1745963802313; Tue, 29 Apr 2025
 14:56:42 -0700 (PDT)
MIME-Version: 1.0
References: <20250428033617.3797686-1-roman.gushchin@linux.dev>
 <aBC7E487qDSDTdBH@tiehlicka> <87selrrpqz.fsf@linux.dev>
In-Reply-To: <87selrrpqz.fsf@linux.dev>
From: Suren Baghdasaryan <surenb@google.com>
Date: Tue, 29 Apr 2025 14:56:31 -0700
X-Gm-Features: ATxdqUGmHhJ_UTjKXo9nohfLLfKUi3iJT5843r-eaFIU3HLqrfEhnF9PVMNh5vQ
Message-ID: <CAJuCfpEToCmf6rdA6tNpWrzw70Er6Q4ZWOwn+ruCWpU=ZEEkmA@mail.gmail.com>
Subject: Re: [PATCH rfc 00/12] mm: BPF OOM
To: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Michal Hocko <mhocko@suse.com>, linux-kernel@vger.kernel.org, 
	Andrew Morton <akpm@linux-foundation.org>, Alexei Starovoitov <ast@kernel.org>, 
	Johannes Weiner <hannes@cmpxchg.org>, Shakeel Butt <shakeel.butt@linux.dev>, 
	David Rientjes <rientjes@google.com>, Josh Don <joshdon@google.com>, 
	Chuyi Zhou <zhouchuyi@bytedance.com>, cgroups@vger.kernel.org, linux-mm@kvack.org, 
	bpf@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Server: rspam09
X-Rspamd-Queue-Id: 8F8A8A0004
X-Stat-Signature: zihmu65opod83idu46k7xomtju3smhhm
X-HE-Tag: 1745963803-563205
X-HE-Meta: U2FsdGVkX19tyiyRcfywRHNw/iwiTK6tB+T+4gNVJXxZd0EOJ1Rm+a8Xxxvs4iwQyilVOdIsW8pYsi2Ab8gadI7DQ0WBRxsvBgJuGckX4ij3H8RaPo48nCmtMt1lQ+QbflD1yARa0HQdG3ax1rnPy9lKMWqTplA/1BsM+s6Y/s8TGtJJVjnrmho3hW3OYEKzVLAfPHsvaLM+Nw6Or6/55MPn4EOv/c/NbbH08XY/oIluNigG83pIrwFhF7UTBwWN20txeSZksAx6nDdbUO3TJamM/lvBE9gkzcU/RQsvGvO2m1BWmozylWKMwwMU2F/PJ0TvKdowFAXy/FY/180/nyVnpRs8vT+s+rnf/iAwGXPw7k7p+atBhb1F4L0Z/T8CFW3Zhet0lVDVtqY8t35yp1e/30Azn1g36EqVK+iuDygWYulcmjbzwkUNiao0NYU1kIET2vVzNzSnJG7O16ZPBhDZXGkvYf4kqYjbRm3ry78fpVzYA8QUe85ETXNxEbu8IFJM6jVT1TWWhDeBhYPtAdXAZlIir1zIuJoBjVYl9qhY2Xux41fkCRvcxzJKegyynHliPrBcYUY2STgmZLb9NPV+OykqIKOq37w8dsbyAKSF8dbUYvW9Z/Cs4nLuSjsWlACPIVAk/Cmlr+2UF6KPPzlUtwldgiKigFA/mqq7zztLShwTkJzjDJAJuGLsEcqZQYYqEyagYdQsN1W2K6knh9WYxOlWfUtdEtzjTwWo5iZ8KJZSSPOidaYSLPgCQnIH5WPFoqP6jiEixbnU+3ZKrpdfJqAk7hp8ALAWltjIDs153jkLTAimiywdXK47rZvBAG33Tj5pWEmGirNRhkKJKBL2LsbDCRZJnssb1plTH7Dse70WSGkVmcPfb0/JDqh6ytxQu/uZLvupQz6sTX93ZB46PEdY8bCqS/ekJ3zB9WfxvcfMjwAw8T5ehCUi1i8paLoO7nCngEdd64dI9Ay
 LUJ4Imt6
 ovoF/vbBspgGGFdvq/znAZQVJJwSrPWehZHma6yKBOOVGG65B8Tb4ZK8rjP61N+nuD81U0pJqV99Vxjt1kgs/BPyewG7IePNxl6q5ZtPmaGbj8Ka51dz59IRrC7xxHymcDgbEvX67u0qL+OcJHqFlvOda1btoUEaY2BKNPkbmfgJaRi5h0gUf28JgBGEDd3UYpb6wpS8eDgSpvU7OzfSqGvXorx+lc0DBuUiqfeAqs1Wx7DoujJPIvM69kSj90DAcROltAC9v5jCA5i/3Xpbi/mIttXA9fiHsqN3JobcLWLn5hMUQ9b2WWTM3lu1jdBjcdiAmZHvCLHzqbrVlzpgh2j0iGBHqFW6etdypBf/1yaB1979KwUY54dlfPugI2synleRBxK1ZcwnripVuNu77jTHwWEWxjAWZqVwgWzowrRuMPP5afHl9l3Qu+oUGFY3PvR/B+PswW/ikEvOjFm0wqNAxFzUwUEq2rwPqgUwm2KQFUZN+dTfgRTfFabfyN8lkGmKD8Lu9GIBqWkYt3B4+kytoSf2SG+SzqeDIOg/yz6w2A+TcoelTSpEplsKnw/gGq4wzIOtIkd7uisI=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Apr 29, 2025 at 7:45=E2=80=AFAM Roman Gushchin <roman.gushchin@linu=
x.dev> wrote:
>
> Michal Hocko <mhocko@suse.com> writes:
>
> > On Mon 28-04-25 03:36:05, Roman Gushchin wrote:
> >> This patchset adds an ability to customize the out of memory
> >> handling using bpf.
> >>
> >> It focuses on two parts:
> >> 1) OOM handling policy,
> >> 2) PSI-based OOM invocation.
> >>
> >> The idea to use bpf for customizing the OOM handling is not new, but
> >> unlike the previous proposal [1], which augmented the existing task
> >> ranking-based policy, this one tries to be as generic as possible and
> >> leverage the full power of the modern bpf.
> >>
> >> It provides a generic hook which is called before the existing OOM
> >> killer code and allows implementing any policy, e.g.  picking a victim
> >> task or memory cgroup or potentially even releasing memory in other
> >> ways, e.g. deleting tmpfs files (the last one might require some
> >> additional but relatively simple changes).
> >
> > Makes sense to me. I still have a slight concern though. We have 3
> > different oom handlers smashed into a single one with special casing
> > involved. This is manageable (although not great) for the in kernel
> > code but I am wondering whether we should do better for BPF based OOM
> > implementations. Would it make sense to have different callbacks for
> > cpuset, memcg and global oom killer handlers?
>
> Yes, it's certainly possible. If we go struct_ops path, we can even
> have both the common hook which handles all types of OOM's and separate
> hooks for each type. The user then can choose what's more convenient.
> Good point.
>
> >
> > I can see you have already added some helper functions to deal with
> > memcgs but I do not see anything to iterate processes or find a process=
 to
> > kill etc. Is that functionality generally available (sorry I am not
> > really familiar with BPF all that much so please bear with me)?
>
> Yes, task iterator is available since v6.7:
> https://docs.ebpf.io/linux/kfuncs/bpf_iter_task_new/
>
> >
> > I like the way how you naturalely hooked into existing OOM primitives
> > like oom_kill_process but I do not see tsk_is_oom_victim exposed. Are
> > you waiting for a first user that needs to implement oom victim
> > synchronization or do you plan to integrate that into tasks iterators?
>
> It can be implemented in bpf directly, but I agree that it probably
> deserves at least an example in the test or a separate in-kernel helper.
> In-kernel helper is probably a better idea.
>
> > I am mostly asking because it is exactly these kind of details that
> > make the current in kernel oom handler quite complex and it would be
> > great if custom ones do not have to reproduce that complexity and only
> > focus on the high level policy.
>
> Totally agree.
>
> >
> >> The second part is related to the fundamental question on when to
> >> declare the OOM event. It's a trade-off between the risk of
> >> unnecessary OOM kills and associated work losses and the risk of
> >> infinite trashing and effective soft lockups.  In the last few years
> >> several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> >> systemd-OOMd [4]). The common idea was to use userspace daemons to
> >> implement custom OOM logic as well as rely on PSI monitoring to avoid
> >> stalls.
> >
> > This makes sense to me as well. I have to admit I am not fully familiar
> > with PSI integration into sched code but from what I can see the
> > evaluation is done on regular bases from the worker context kicked off
> > from the scheduler code. There shouldn't be any locking constrains whic=
h
> > is good. Is there any risk if the oom handler took too long though?
>
> It's a good question. In theory yes, it can affect the timing of other
> PSI events. An option here is to move it into a separate work, however
> I'm not sure if it worth the added complexity. I actually tried this
> approach in an earlier version of this patchset, but the problem was
> that the code for scheduling this work should be dynamically turned
> on/off when a bpf program is attached/detached, otherwise it's an
> obvious cpu overhead.
> It's doable, but Idk if it's justified.
>
> >
> > Also an important question. I can see selftests which are using the
> > infrastructure. But have you tried to implement a real OOM handler with
> > this proposed infrastructure?
>
> Not yet. Given the size and complexity of the infrastructure of my
> current employer, it's not a short process. But we're working on it.

Hi Roman,
This might end up being very useful for Android. Since we have a
shared current employer, we might be able to provide an earlier test
environment for this concept on Android and speed up development of a
real OOM handler. I'll be following the development of this patchset
and will see if we can come up with an early prototype for testing.

>
> >
> >> [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@=
bytedance.com/
> >> [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
> >> [3]: https://github.com/facebookincubator/oomd
> >> [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-o=
omd.service.html
> >>
> >> ----
> >>
> >> This is an RFC version, which is not intended to be merged in the curr=
ent form.
> >> Open questions/TODOs:
> >> 1) Program type/attachment type for the bpf_handle_out_of_memory() hoo=
k.
> >>    It has to be able to return a value, to be sleepable (to use cgroup=
 iterators)
> >>    and to have trusted arguments to pass oom_control down to bpf_oom_k=
ill_process().
> >>    Current patchset has a workaround (patch "bpf: treat fmodret tracin=
g program's
> >>    arguments as trusted"), which is not safe. One option is to fake ac=
quire/release
> >>    semantics for the oom_control pointer. Other option is to introduce=
 a completely
> >>    new attachment or program type, similar to lsm hooks.
> >> 2) Currently lockdep complaints about a potential circular dependency =
because
> >>    sleepable bpf_handle_out_of_memory() hook calls might_fault() under=
 oom_lock.
> >>    One way to fix it is to make it non-sleepable, but then it will req=
uire some
> >>    additional work to allow it using cgroup iterators. It's intervened=
 with 1).
> >
> > I cannot see this in the code. Could you be more specific please? Where
> > is this might_fault coming from? Is this BPF constrain?
>
> It's in __bpf_prog_enter_sleepable(). But I hope I can make this hook
> non-sleepable (by going struct_ops path) and the problem will go away.
>
> >
> >> 3) What kind of hierarchical features are required? Do we want to nest=
 oom policies?
> >>    Do we want to attach oom policies to cgroups? I think it's too comp=
licated,
> >>    but if we want a full hierarchical support, it might be required.
> >>    Patch "mm: introduce bpf_get_root_mem_cgroup() bpf kfunc" exposes t=
he true root
> >>    memcg, which is potentially outside of the ns of the loading proces=
s. Does
> >>    it require some additional capabilities checks? Should it be remove=
d?
> >
> > Yes, let's start simple and see where we get from there.
>
> Agree.
>
> Thank you for taking a look and your comments/ideas!
>