From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 48500C369DC for ; Tue, 29 Apr 2025 21:56:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E02BF6B0082; Tue, 29 Apr 2025 17:56:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D88C76B0083; Tue, 29 Apr 2025 17:56:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C03706B0085; Tue, 29 Apr 2025 17:56:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 650556B0082 for ; Tue, 29 Apr 2025 17:56:45 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 7903012137D for ; Tue, 29 Apr 2025 21:56:45 +0000 (UTC) X-FDA: 83388441570.12.0829C4C Received: from mail-qt1-f181.google.com (mail-qt1-f181.google.com [209.85.160.181]) by imf25.hostedemail.com (Postfix) with ESMTP id 8F8A8A0004 for ; Tue, 29 Apr 2025 21:56:43 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Nxk4ZwaC; spf=pass (imf25.hostedemail.com: domain of surenb@google.com designates 209.85.160.181 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745963803; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=DHPStCfNoeO92wT2yn6jdgylDS3AOWBgTDjw60mOQWk=; b=dSElB/9syYpl05VrqFU3x83JKbUEjKIZ1CiBcuTPr/HC8a7KDqYbNZIalMRx16kbzFAMUw MFZGFmpLDNrgf9mu+NSlAN6VGqx5qMOnJgRR1StngP4Ilnt/WwX4mUh6O7MJtwipOPnGja ZKm00ATJo2Ru4wcv/d0eZ8TJQcr7SlE= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Nxk4ZwaC; spf=pass (imf25.hostedemail.com: domain of surenb@google.com designates 209.85.160.181 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745963803; a=rsa-sha256; cv=none; b=wHbNvV1mj4RC89WJ3WBl0/0ut2TwkVO5CQgzyrlluSzq8W/SuK28MiyOfcXgo0pKIN2lun HZ/CUWLF15p3XijqnTSKl6xm9Nv9jE0d1aJ5181mPnmBnU+htGQjK8+2B4euLvwOUbh1Jw akLYIWdGjqTwwJ2dGplNgcZWpOqhla8= Received: by mail-qt1-f181.google.com with SMTP id d75a77b69052e-4774611d40bso368111cf.0 for ; Tue, 29 Apr 2025 14:56:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1745963802; x=1746568602; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=DHPStCfNoeO92wT2yn6jdgylDS3AOWBgTDjw60mOQWk=; b=Nxk4ZwaCEY3NXYLAfLFWF9nbJCrM7mfgJWNxCJwfPLuQaHmLQ7lhsrmk2xAGSaInYc j/TGx6lje623G3fJeJnBd4lz93iaFKqCu5dSg4EW0BYaLLFbNBEeRk8VuA3rih2EYOnH mnNIG0WZYT6fQSfT8dTaUk3PVlCPC6S7yPT3r1ft7bTKuiRzb+orT51yZYIXfeJ1rt6w t1AT/xTSN4cM8WkhVhuZKNMJ9qfkxNsFKHL1Zb7hMD0wdpMYQZowKBRej+DLbHS+fon+ lNA+jOKS2gBoQQamrWMyt+iuWdf0f9B7X9WqD48pVT4Xh17663U3XVNU0uDo3vg+esh+ UPiw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745963802; x=1746568602; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=DHPStCfNoeO92wT2yn6jdgylDS3AOWBgTDjw60mOQWk=; b=ppZB8ubdQu4v7mj+dFJ02A4JitZJrHNJlut0IQZ4uPWAYl4T8SBrD0GtMzjJka2E7a PRWUp5jQEbT8RHbDOBq3gWu2knh2PnDMv5+RwquccRAGUFExyyz9FtO90e7JUx87VOsf nao5Zplzc1FU2wYdIqRL/HnSgpbOeA+yt/CPygnGnYuXC504YFPiou9kFA+w+D3Q7rGQ veuOzEs8Pbg1S4SfGWTVm1TNqJZgiRnfrp4LJKbTffLdmZw6irpowuhFMIkat9nwhfaG o+KDJ6ao5FZkk7DeHydQTDUWBte2TNMIDyBN1WkxD20RdXZp60CP5DYhnWon7rHbZd9k SXow== X-Forwarded-Encrypted: i=1; AJvYcCXwIUAPFIsriJYscMjqkSyPKN//yuV0Bxhudq04RBrNqukqAnGv2948AXR7VtNvhmFCnWpMOVn+xQ==@kvack.org X-Gm-Message-State: AOJu0Yztn653tRd3+cqa98mrnIbTH/nit7cpB4jQ+XMQ/J0vZNi93shP pyCzfjZlB3+SN50zEC9LlyuejhE+y0NHP65Mm1LMA4hjdsNEaHiUEJUmp6olhj0BPZrtfqTr5JS 2gov+Bxp61v8yviWhIwYF6X8gYsIhfhwyIs2E X-Gm-Gg: ASbGncs8E9VNNglLNf5XNqibn6uXiB3kTQfRIysDh/xASTaNSm7pt8pzSr3qskfaubN /7rqx9c6UuH4/wf0hGC+tYIQXCd7U48Qj7UB9ijvUhzp5sihDcE7eI6EYHIMvOiYIIhQeYqBklm dSC2a5yZtVeLr6M5CLRQM1CAS4NeukF+7XcRZUe/BHY4Hg1fyWswNl X-Google-Smtp-Source: AGHT+IHnm3oxMMCggQ4zYDBc/EviFjkF3Acb+O+Bqo5djYoDiGVrrZLyLpU3ATh7ELc5quPVw0I9UjhXoBryKd7Lkk4= X-Received: by 2002:ac8:7f93:0:b0:486:be94:7c21 with SMTP id d75a77b69052e-489b9b3576bmr1402081cf.14.1745963802313; Tue, 29 Apr 2025 14:56:42 -0700 (PDT) MIME-Version: 1.0 References: <20250428033617.3797686-1-roman.gushchin@linux.dev> <87selrrpqz.fsf@linux.dev> In-Reply-To: <87selrrpqz.fsf@linux.dev> From: Suren Baghdasaryan Date: Tue, 29 Apr 2025 14:56:31 -0700 X-Gm-Features: ATxdqUGmHhJ_UTjKXo9nohfLLfKUi3iJT5843r-eaFIU3HLqrfEhnF9PVMNh5vQ Message-ID: Subject: Re: [PATCH rfc 00/12] mm: BPF OOM To: Roman Gushchin Cc: Michal Hocko , linux-kernel@vger.kernel.org, Andrew Morton , Alexei Starovoitov , Johannes Weiner , Shakeel Butt , David Rientjes , Josh Don , Chuyi Zhou , cgroups@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 8F8A8A0004 X-Stat-Signature: zihmu65opod83idu46k7xomtju3smhhm X-HE-Tag: 1745963803-563205 X-HE-Meta: U2FsdGVkX19tyiyRcfywRHNw/iwiTK6tB+T+4gNVJXxZd0EOJ1Rm+a8Xxxvs4iwQyilVOdIsW8pYsi2Ab8gadI7DQ0WBRxsvBgJuGckX4ij3H8RaPo48nCmtMt1lQ+QbflD1yARa0HQdG3ax1rnPy9lKMWqTplA/1BsM+s6Y/s8TGtJJVjnrmho3hW3OYEKzVLAfPHsvaLM+Nw6Or6/55MPn4EOv/c/NbbH08XY/oIluNigG83pIrwFhF7UTBwWN20txeSZksAx6nDdbUO3TJamM/lvBE9gkzcU/RQsvGvO2m1BWmozylWKMwwMU2F/PJ0TvKdowFAXy/FY/180/nyVnpRs8vT+s+rnf/iAwGXPw7k7p+atBhb1F4L0Z/T8CFW3Zhet0lVDVtqY8t35yp1e/30Azn1g36EqVK+iuDygWYulcmjbzwkUNiao0NYU1kIET2vVzNzSnJG7O16ZPBhDZXGkvYf4kqYjbRm3ry78fpVzYA8QUe85ETXNxEbu8IFJM6jVT1TWWhDeBhYPtAdXAZlIir1zIuJoBjVYl9qhY2Xux41fkCRvcxzJKegyynHliPrBcYUY2STgmZLb9NPV+OykqIKOq37w8dsbyAKSF8dbUYvW9Z/Cs4nLuSjsWlACPIVAk/Cmlr+2UF6KPPzlUtwldgiKigFA/mqq7zztLShwTkJzjDJAJuGLsEcqZQYYqEyagYdQsN1W2K6knh9WYxOlWfUtdEtzjTwWo5iZ8KJZSSPOidaYSLPgCQnIH5WPFoqP6jiEixbnU+3ZKrpdfJqAk7hp8ALAWltjIDs153jkLTAimiywdXK47rZvBAG33Tj5pWEmGirNRhkKJKBL2LsbDCRZJnssb1plTH7Dse70WSGkVmcPfb0/JDqh6ytxQu/uZLvupQz6sTX93ZB46PEdY8bCqS/ekJ3zB9WfxvcfMjwAw8T5ehCUi1i8paLoO7nCngEdd64dI9Ay LUJ4Imt6 ovoF/vbBspgGGFdvq/znAZQVJJwSrPWehZHma6yKBOOVGG65B8Tb4ZK8rjP61N+nuD81U0pJqV99Vxjt1kgs/BPyewG7IePNxl6q5ZtPmaGbj8Ka51dz59IRrC7xxHymcDgbEvX67u0qL+OcJHqFlvOda1btoUEaY2BKNPkbmfgJaRi5h0gUf28JgBGEDd3UYpb6wpS8eDgSpvU7OzfSqGvXorx+lc0DBuUiqfeAqs1Wx7DoujJPIvM69kSj90DAcROltAC9v5jCA5i/3Xpbi/mIttXA9fiHsqN3JobcLWLn5hMUQ9b2WWTM3lu1jdBjcdiAmZHvCLHzqbrVlzpgh2j0iGBHqFW6etdypBf/1yaB1979KwUY54dlfPugI2synleRBxK1ZcwnripVuNu77jTHwWEWxjAWZqVwgWzowrRuMPP5afHl9l3Qu+oUGFY3PvR/B+PswW/ikEvOjFm0wqNAxFzUwUEq2rwPqgUwm2KQFUZN+dTfgRTfFabfyN8lkGmKD8Lu9GIBqWkYt3B4+kytoSf2SG+SzqeDIOg/yz6w2A+TcoelTSpEplsKnw/gGq4wzIOtIkd7uisI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Apr 29, 2025 at 7:45=E2=80=AFAM Roman Gushchin wrote: > > Michal Hocko writes: > > > On Mon 28-04-25 03:36:05, Roman Gushchin wrote: > >> This patchset adds an ability to customize the out of memory > >> handling using bpf. > >> > >> It focuses on two parts: > >> 1) OOM handling policy, > >> 2) PSI-based OOM invocation. > >> > >> The idea to use bpf for customizing the OOM handling is not new, but > >> unlike the previous proposal [1], which augmented the existing task > >> ranking-based policy, this one tries to be as generic as possible and > >> leverage the full power of the modern bpf. > >> > >> It provides a generic hook which is called before the existing OOM > >> killer code and allows implementing any policy, e.g. picking a victim > >> task or memory cgroup or potentially even releasing memory in other > >> ways, e.g. deleting tmpfs files (the last one might require some > >> additional but relatively simple changes). > > > > Makes sense to me. I still have a slight concern though. We have 3 > > different oom handlers smashed into a single one with special casing > > involved. This is manageable (although not great) for the in kernel > > code but I am wondering whether we should do better for BPF based OOM > > implementations. Would it make sense to have different callbacks for > > cpuset, memcg and global oom killer handlers? > > Yes, it's certainly possible. If we go struct_ops path, we can even > have both the common hook which handles all types of OOM's and separate > hooks for each type. The user then can choose what's more convenient. > Good point. > > > > > I can see you have already added some helper functions to deal with > > memcgs but I do not see anything to iterate processes or find a process= to > > kill etc. Is that functionality generally available (sorry I am not > > really familiar with BPF all that much so please bear with me)? > > Yes, task iterator is available since v6.7: > https://docs.ebpf.io/linux/kfuncs/bpf_iter_task_new/ > > > > > I like the way how you naturalely hooked into existing OOM primitives > > like oom_kill_process but I do not see tsk_is_oom_victim exposed. Are > > you waiting for a first user that needs to implement oom victim > > synchronization or do you plan to integrate that into tasks iterators? > > It can be implemented in bpf directly, but I agree that it probably > deserves at least an example in the test or a separate in-kernel helper. > In-kernel helper is probably a better idea. > > > I am mostly asking because it is exactly these kind of details that > > make the current in kernel oom handler quite complex and it would be > > great if custom ones do not have to reproduce that complexity and only > > focus on the high level policy. > > Totally agree. > > > > >> The second part is related to the fundamental question on when to > >> declare the OOM event. It's a trade-off between the risk of > >> unnecessary OOM kills and associated work losses and the risk of > >> infinite trashing and effective soft lockups. In the last few years > >> several PSI-based userspace solutions were developed (e.g. OOMd [3] or > >> systemd-OOMd [4]). The common idea was to use userspace daemons to > >> implement custom OOM logic as well as rely on PSI monitoring to avoid > >> stalls. > > > > This makes sense to me as well. I have to admit I am not fully familiar > > with PSI integration into sched code but from what I can see the > > evaluation is done on regular bases from the worker context kicked off > > from the scheduler code. There shouldn't be any locking constrains whic= h > > is good. Is there any risk if the oom handler took too long though? > > It's a good question. In theory yes, it can affect the timing of other > PSI events. An option here is to move it into a separate work, however > I'm not sure if it worth the added complexity. I actually tried this > approach in an earlier version of this patchset, but the problem was > that the code for scheduling this work should be dynamically turned > on/off when a bpf program is attached/detached, otherwise it's an > obvious cpu overhead. > It's doable, but Idk if it's justified. > > > > > Also an important question. I can see selftests which are using the > > infrastructure. But have you tried to implement a real OOM handler with > > this proposed infrastructure? > > Not yet. Given the size and complexity of the infrastructure of my > current employer, it's not a short process. But we're working on it. Hi Roman, This might end up being very useful for Android. Since we have a shared current employer, we might be able to provide an earlier test environment for this concept on Android and speed up development of a real OOM handler. I'll be following the development of this patchset and will see if we can come up with an early prototype for testing. > > > > >> [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@= bytedance.com/ > >> [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/ > >> [3]: https://github.com/facebookincubator/oomd > >> [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-o= omd.service.html > >> > >> ---- > >> > >> This is an RFC version, which is not intended to be merged in the curr= ent form. > >> Open questions/TODOs: > >> 1) Program type/attachment type for the bpf_handle_out_of_memory() hoo= k. > >> It has to be able to return a value, to be sleepable (to use cgroup= iterators) > >> and to have trusted arguments to pass oom_control down to bpf_oom_k= ill_process(). > >> Current patchset has a workaround (patch "bpf: treat fmodret tracin= g program's > >> arguments as trusted"), which is not safe. One option is to fake ac= quire/release > >> semantics for the oom_control pointer. Other option is to introduce= a completely > >> new attachment or program type, similar to lsm hooks. > >> 2) Currently lockdep complaints about a potential circular dependency = because > >> sleepable bpf_handle_out_of_memory() hook calls might_fault() under= oom_lock. > >> One way to fix it is to make it non-sleepable, but then it will req= uire some > >> additional work to allow it using cgroup iterators. It's intervened= with 1). > > > > I cannot see this in the code. Could you be more specific please? Where > > is this might_fault coming from? Is this BPF constrain? > > It's in __bpf_prog_enter_sleepable(). But I hope I can make this hook > non-sleepable (by going struct_ops path) and the problem will go away. > > > > >> 3) What kind of hierarchical features are required? Do we want to nest= oom policies? > >> Do we want to attach oom policies to cgroups? I think it's too comp= licated, > >> but if we want a full hierarchical support, it might be required. > >> Patch "mm: introduce bpf_get_root_mem_cgroup() bpf kfunc" exposes t= he true root > >> memcg, which is potentially outside of the ns of the loading proces= s. Does > >> it require some additional capabilities checks? Should it be remove= d? > > > > Yes, let's start simple and see where we get from there. > > Agree. > > Thank you for taking a look and your comments/ideas! >