From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C2728C3ABA5 for ; Wed, 30 Apr 2025 01:29:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 193486B00E4; Tue, 29 Apr 2025 21:29:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0F3F66B00E5; Tue, 29 Apr 2025 21:29:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E6A226B00E6; Tue, 29 Apr 2025 21:29:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id C250A6B00E4 for ; Tue, 29 Apr 2025 21:29:25 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 634B014144A for ; Tue, 29 Apr 2025 23:02:06 +0000 (UTC) X-FDA: 83388606252.01.12A8190 Received: from mail-qt1-f176.google.com (mail-qt1-f176.google.com [209.85.160.176]) by imf15.hostedemail.com (Postfix) with ESMTP id 84169A001C for ; Tue, 29 Apr 2025 23:02:04 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=2MzePG5w; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf15.hostedemail.com: domain of surenb@google.com designates 209.85.160.176 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745967724; a=rsa-sha256; cv=none; b=USZBUPF+C0eTVbp7ozakAZoxM1SX1ZQ+pomV9ul0nTiKj+paAolf4eP6P6Qy2EFwpb/TKg 1QN/c6XO6+aSmo/a34ZQyVzMpyCjXaBZMpnwlseNFYfHOSsmefxehwhb6jL8bF5kB6XiQx 9hbSYAZ+yYra2GPORuVrWzur1ptH8XQ= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=2MzePG5w; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf15.hostedemail.com: domain of surenb@google.com designates 209.85.160.176 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745967724; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=D3cBPfGnbz/yvJrhNx9adzzOV771Og4uaUmsjg7veAQ=; b=HQFccxpv1/wdp4AOdkLy4upj7f2Ife2qtK6Ci0TA+hlOqPeZvScAOQC/SutZd6c6p01Ufu ZzXokB8FLaFiN1DNxsKfZBUmdTufMLfRGEDdSL9UXDQUmHW69Jbb4UJj53cQZiHpwcKel8 pC4CwAwVSSZEuOhw66ZDEqxBrdueYV0= Received: by mail-qt1-f176.google.com with SMTP id d75a77b69052e-4774611d40bso382371cf.0 for ; Tue, 29 Apr 2025 16:02:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1745967723; x=1746572523; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=D3cBPfGnbz/yvJrhNx9adzzOV771Og4uaUmsjg7veAQ=; b=2MzePG5wRbLfdHI3afsOuvSkd0JUJ1VD2A2+S+KiSmmznBxLBa6cOdhp0oSC8su0Zx mlyjM2Fw2WgcZke5MyYN85e7HqB+bf+8IKPEKXXnvXJwiQBZ4dwD1Q4ZNiNIZvs78Srk 8TGZFy0GxZqB9mqwzdcRu1jRwrbUcI8mTdHUsl/3zuybXLrDBZlMOnsdgyAdiYHBVFKU z0xH6Ya9XB3NN5m9xyEr72LNwiQzqQJZfygYwJvoBudYQZgsd0dvlWlRbt+UbCkSUA9F w1oZyys0qfIAhghIVciCq9jE2iZblKze25PLUBgUgq/Mes98ZkReuE8+zHnRtQpZZecg KkXA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745967723; x=1746572523; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=D3cBPfGnbz/yvJrhNx9adzzOV771Og4uaUmsjg7veAQ=; b=FY9icgTm0Zhr4Dx7c2m8rNyBZs0CIkHG5qugKm7NZEeiTdKOrVoD4bQWheKMJGD+wP DPk1ZRTrrXOqUuPBOXKc597jyJhyRT5cIDz1++j0Rk6XtpQGb2HTcO2lNBpsS9whsLqn D1OwznqalXzNEIBHHVzLvcwbhI+O9GeiIBA4P1GMaieN/lIvhOBXuyaJtuBLhhQHfmGP mG0O7bFEz/QJSHusVLPWnPGDJT3IS8TeXcpgm08/PfgCkVitl9dGf28Yfvc1r1Xemj1K HSYO4PhLm5Z2Aa3KA2eYRuF01HSh3EFn0tDynPUehmo6fustqXV0Z5+6f7694n3a5xEi 0z6w== X-Forwarded-Encrypted: i=1; AJvYcCWIuIiC9MlwcGE1T+VmbvcWix94wm+vcWi45n4Fi7uWP9yWfyN99HU8Rq/cluk8TngB6ArQFnPjpA==@kvack.org X-Gm-Message-State: AOJu0YyJmMQvdr7/6awAqHuUyXqs83a5gE8qzcdyKaWDufPDupEky0Pv XQk+HeZZr4zKetVhMkj8AB46myCxMNTMs7fXXgb4D13Mr7mPJGZlFx9OOaxFXrBwvy12B/F1hKu +OFYQqsForAzGAFBfFfq/s+KceebWrhQxI46d X-Gm-Gg: ASbGncts8Jud45GEYeee0wq7LSklbZwSve5abIN9/JtCnwRI/VjmgzxDBq6vFkugDWn 48QFOmEKZkj11f+ekuuw6Zdq+P8qoSvxpUayLv2RWZm+Y+hRA8+rRTPsZEOilq4eA81sx/B3521 jHrB2LO4rcSCiobNxihU0VTPFHz/nRQPNzWrTLaPgi/mklXVz9ypwv X-Google-Smtp-Source: AGHT+IF0gr5CtAes59SO6D9Xj0Aw+uB4nzgHygdR3Y4wolPVTp2LjDG4D3xn7ja/TpQ6TJvn+4y3ap8WAP2rwioNkhA= X-Received: by 2002:a05:622a:148:b0:486:8d63:2dfc with SMTP id d75a77b69052e-489b9838ba4mr1552441cf.2.1745967723192; Tue, 29 Apr 2025 16:02:03 -0700 (PDT) MIME-Version: 1.0 References: <20250428033617.3797686-1-roman.gushchin@linux.dev> <87selrrpqz.fsf@linux.dev> In-Reply-To: <87selrrpqz.fsf@linux.dev> From: Suren Baghdasaryan Date: Tue, 29 Apr 2025 16:01:52 -0700 X-Gm-Features: ATxdqUEHJ0lKafIQLl6tuQKVn2J6UGA75Kfzy8FGMf24UMiPODsN3bv6xf3NM-k Message-ID: Subject: Re: [PATCH rfc 00/12] mm: BPF OOM To: Roman Gushchin Cc: Michal Hocko , linux-kernel@vger.kernel.org, Andrew Morton , Alexei Starovoitov , Johannes Weiner , Shakeel Butt , David Rientjes , Josh Don , Chuyi Zhou , cgroups@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 84169A001C X-Stat-Signature: w7efdoqyobf5cu5uxj4ccjwunncpkp4y X-Rspam-User: X-HE-Tag: 1745967724-29250 X-HE-Meta: U2FsdGVkX18jdiLBRsVBchL+/PnjVtFT47eJ3EMYqSoMrioqJTeuMcRxOKpQCvAdwaDMgPP/Fx9u+OatvbtnWTGsw2S5/fPcxLQbjOna/RT2mBKS+vFANZdGswmbi1uehxW7QTks+DIpDZ5QkzAjKfueNLHKP8RTXl+aWTLlpb3DmmzXnAmlJKv5b5I1SLqxmxeMXmIn5+SPPxFGMxQSYD5cIlpbxUwmmVZYjFyaN6j1uhuJjlOCZ/qy4zTjqbDmtuT8sjt+WWoJqE1FNXGcvxmEZDfJO6PdDBtjvUAkhDBd7RZ+h0uXBRv7iRx4vZpZp3zMk3GZaj4n+B4j4kGQDyq4vB/SK4UVTyyzwWYTsys2PfYPpYAhIU7tWk6J3CTwH9SZ8vSAP01hVmm39Zgxzz4f8ON9h9oeCOzNmbV7SMVrhZ7BUHFzfrSW65ofZFHSErMnlpRUPA+97P1SUpYx9+9PMxEJCe/n2/XFu8hcKj3fL5osirz9QOEl8SHh26RszX+nMufHSp5xmmC/u0CdW8N3GUdoYKE5X9lKP+ecABO4UI4QdNDm5S/WMbX5u5M5p6cV7B+o/7lenaoCn8AvXOkxNueAv5nRcnifZ1O1DIAMDRNeFqsWi4idCGDZmjWcJ5UyYRMxCTVtEX5I5/bXyUlgHqlz/1MSeBYkaOVLDZ5JGu0T+J+O0+RMmmJrJfP69Ke14ssm1y4M/JGs1GqACgFnJZdO56I5/ZH+Rgai2qTJH6CmFoyzg8Ks6RrJjBhHxcNGVk+RAeSY+Oi+oNGnF0ul7K32dnfeOuYWVUKIDRULypsFrsQfeAAgj9NB94WGg4dvmV1HII4ur1hp5Qp+nfb6FMjz0eXjyB3ldlioqZcmGOs3wLWOgzYXCmOr0ApQXSZ8Sjvw2fjyktdTnX/kdQUBnCjizq1gzfsNxcE4JBmfFBdFe90GhfIBPtdhJ+j2dl4O6AgSIjCBZtrtQTh Hmao8TVF bvgDNu9vmDxgPaSG2Yc17RNC0JW3rsSTh/sHxsHfPqHyWxAPRWl9stSnEVpidusIIGZxtG/V/L4wEQytFwUwI9wUao1K798HiXEIF5ohHGbXm5PSV9GqCYpMkvXD8QIS+Stfvzqcz5Dz3YcTuz/7tqX3tw3l8iRfxNpAHDja5/iSe4eezsph+DYTKe9hwxEna6LtYnysD/4kfUsTv6c7x0sQlONcmmh1NaGcWk2x3quDTHWaXq7xzGrr8WI1taiZ6rhP7Tw3/gFiSAnDx4P7FRMfDr9Qg26Ak5Obug9hgnLY4CTWB4yCGdtmQEFBFzj1Km7iFAhG1/VKU1lRym0wqidKt5HHDKiaXaatv9pQ5RnijR0hamxvv5tqypl9M/mquiSwHab4MitKNy2X1evOc7qZkgXcIojk3bruKUhjDKJi1KlRBwOEyXldmlPXfFouzhYwMUrAOOIa6Rokdqp1SaOWIBOqQ6wLCkP9Wb14ZxqSpVUo1GSKnE3lNN14KbDxboe11gbLRKed9UHeDn131k5Iv57C8cUAkV7cBMW+UEge8WKJMOZanwYaJJbF+C8NC9EfYgHI2ebl19xU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Apr 29, 2025 at 7:45=E2=80=AFAM Roman Gushchin wrote: > > Michal Hocko writes: > > > On Mon 28-04-25 03:36:05, Roman Gushchin wrote: > >> This patchset adds an ability to customize the out of memory > >> handling using bpf. > >> > >> It focuses on two parts: > >> 1) OOM handling policy, > >> 2) PSI-based OOM invocation. > >> > >> The idea to use bpf for customizing the OOM handling is not new, but > >> unlike the previous proposal [1], which augmented the existing task > >> ranking-based policy, this one tries to be as generic as possible and > >> leverage the full power of the modern bpf. > >> > >> It provides a generic hook which is called before the existing OOM > >> killer code and allows implementing any policy, e.g. picking a victim > >> task or memory cgroup or potentially even releasing memory in other > >> ways, e.g. deleting tmpfs files (the last one might require some > >> additional but relatively simple changes). > > > > Makes sense to me. I still have a slight concern though. We have 3 > > different oom handlers smashed into a single one with special casing > > involved. This is manageable (although not great) for the in kernel > > code but I am wondering whether we should do better for BPF based OOM > > implementations. Would it make sense to have different callbacks for > > cpuset, memcg and global oom killer handlers? > > Yes, it's certainly possible. If we go struct_ops path, we can even > have both the common hook which handles all types of OOM's and separate > hooks for each type. The user then can choose what's more convenient. > Good point. > > > > > I can see you have already added some helper functions to deal with > > memcgs but I do not see anything to iterate processes or find a process= to > > kill etc. Is that functionality generally available (sorry I am not > > really familiar with BPF all that much so please bear with me)? > > Yes, task iterator is available since v6.7: > https://docs.ebpf.io/linux/kfuncs/bpf_iter_task_new/ > > > > > I like the way how you naturalely hooked into existing OOM primitives > > like oom_kill_process but I do not see tsk_is_oom_victim exposed. Are > > you waiting for a first user that needs to implement oom victim > > synchronization or do you plan to integrate that into tasks iterators? > > It can be implemented in bpf directly, but I agree that it probably > deserves at least an example in the test or a separate in-kernel helper. > In-kernel helper is probably a better idea. > > > I am mostly asking because it is exactly these kind of details that > > make the current in kernel oom handler quite complex and it would be > > great if custom ones do not have to reproduce that complexity and only > > focus on the high level policy. > > Totally agree. > > > > >> The second part is related to the fundamental question on when to > >> declare the OOM event. It's a trade-off between the risk of > >> unnecessary OOM kills and associated work losses and the risk of > >> infinite trashing and effective soft lockups. In the last few years > >> several PSI-based userspace solutions were developed (e.g. OOMd [3] or > >> systemd-OOMd [4]). The common idea was to use userspace daemons to > >> implement custom OOM logic as well as rely on PSI monitoring to avoid > >> stalls. > > > > This makes sense to me as well. I have to admit I am not fully familiar > > with PSI integration into sched code but from what I can see the > > evaluation is done on regular bases from the worker context kicked off > > from the scheduler code. There shouldn't be any locking constrains whic= h > > is good. Is there any risk if the oom handler took too long though? > > It's a good question. In theory yes, it can affect the timing of other > PSI events. An option here is to move it into a separate work, however > I'm not sure if it worth the added complexity. I actually tried this > approach in an earlier version of this patchset, but the problem was > that the code for scheduling this work should be dynamically turned > on/off when a bpf program is attached/detached, otherwise it's an > obvious cpu overhead. > It's doable, but Idk if it's justified. I think this is a legitimate concern. bpf_handle_psi_event() can block update_triggers() and delay other PSI triggers. > > > > > Also an important question. I can see selftests which are using the > > infrastructure. But have you tried to implement a real OOM handler with > > this proposed infrastructure? > > Not yet. Given the size and complexity of the infrastructure of my > current employer, it's not a short process. But we're working on it. > > > > >> [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@= bytedance.com/ > >> [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/ > >> [3]: https://github.com/facebookincubator/oomd > >> [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-o= omd.service.html > >> > >> ---- > >> > >> This is an RFC version, which is not intended to be merged in the curr= ent form. > >> Open questions/TODOs: > >> 1) Program type/attachment type for the bpf_handle_out_of_memory() hoo= k. > >> It has to be able to return a value, to be sleepable (to use cgroup= iterators) > >> and to have trusted arguments to pass oom_control down to bpf_oom_k= ill_process(). > >> Current patchset has a workaround (patch "bpf: treat fmodret tracin= g program's > >> arguments as trusted"), which is not safe. One option is to fake ac= quire/release > >> semantics for the oom_control pointer. Other option is to introduce= a completely > >> new attachment or program type, similar to lsm hooks. > >> 2) Currently lockdep complaints about a potential circular dependency = because > >> sleepable bpf_handle_out_of_memory() hook calls might_fault() under= oom_lock. > >> One way to fix it is to make it non-sleepable, but then it will req= uire some > >> additional work to allow it using cgroup iterators. It's intervened= with 1). > > > > I cannot see this in the code. Could you be more specific please? Where > > is this might_fault coming from? Is this BPF constrain? > > It's in __bpf_prog_enter_sleepable(). But I hope I can make this hook > non-sleepable (by going struct_ops path) and the problem will go away. > > > > >> 3) What kind of hierarchical features are required? Do we want to nest= oom policies? > >> Do we want to attach oom policies to cgroups? I think it's too comp= licated, > >> but if we want a full hierarchical support, it might be required. > >> Patch "mm: introduce bpf_get_root_mem_cgroup() bpf kfunc" exposes t= he true root > >> memcg, which is potentially outside of the ns of the loading proces= s. Does > >> it require some additional capabilities checks? Should it be remove= d? > > > > Yes, let's start simple and see where we get from there. > > Agree. > > Thank you for taking a look and your comments/ideas! >