From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CE0B0C369C2 for ; Tue, 22 Apr 2025 09:23:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 878976B0005; Tue, 22 Apr 2025 05:23:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 826216B0007; Tue, 22 Apr 2025 05:23:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 73C1C6B0008; Tue, 22 Apr 2025 05:23:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 56A316B0005 for ; Tue, 22 Apr 2025 05:23:26 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 7E706C1CFB for ; Tue, 22 Apr 2025 09:23:26 +0000 (UTC) X-FDA: 83361141612.27.8D79CD4 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf20.hostedemail.com (Postfix) with ESMTP id C8AC31C0007 for ; Tue, 22 Apr 2025 09:23:24 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=s3H3Cwwa; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf20.hostedemail.com: domain of brauner@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=brauner@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745313804; a=rsa-sha256; cv=none; b=K4SVhuIGKYAR1vvy41nHvaz+GFoPqDIeBPqby0LaF4f8Kc0aLL+OVCIK9ggm6bbrxIHcKJ p1V+mDtbnE6o+MUAD2wVPqEANQ+FQENsUHKEIskHlaYSYyHdBNLj9sT5WTLoo79bqptuux qKkRx5fkGdmaKT/Uzxki2AhpFX4cSRw= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745313804; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=rYHc17eus0sgi682Q5kmamVk9mwTKJdwRlAbxMZtgQ8=; b=p0HzPs5hE5cRxNWqzSk+duPVuqudYAH7UqPaWGfnkDQcRPbPt6eXOJdmenwMtpt5hAhLZH ZW0+YB65Z0k3XOUdJ7gUVXka9UYkJ2b65vz4BuM4B1w0lswNkFi9vxp/bTqE019tO7ER7D /vTjUCBrlyCRuA87fyueeoVWSawrcpk= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=s3H3Cwwa; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf20.hostedemail.com: domain of brauner@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=brauner@kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id A676C5C5ABE; Tue, 22 Apr 2025 09:21:06 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 532E2C4CEE9; Tue, 22 Apr 2025 09:23:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1745313803; bh=1I31TMfaIyDHfwOWMX/uBcvIKjblAc3bvEHx9bVoZnk=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=s3H3CwwajuUi+FwVqkRsXIHP/6guw/UjoiZBWihMnWUqAhZ5efisYHogcBe0DHHTz w8Rogdmc9Uj2kAbYXHND5eHRPkIMkQ34/oa4ueMjdKo47+U8lhgDD0YlMQnWTWxj9Q ddC5VUOmdU9MI12RdnMcKWOSohEZnit7VmOCcpcepTAmyCPK2axv63dQbARtf0KfFJ 8ekstAqpUgeIr+wcTLTr8r0Pmm5GcBgaHjzuaR88czqqXlWGwASSJSh6006FPJ4bML DRdXU4dcYLuWgUbfvQhnvgTN8UZYJmk+su4bfv8Wdx6yMXg5wvURqSNUJ9xpGK6QBZ BRAVusRcTwNCw== Date: Tue, 22 Apr 2025 11:23:17 +0200 From: Christian Brauner To: Shakeel Butt Cc: Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Yosry Ahmed , Tejun Heo , Michal =?utf-8?Q?Koutn=C3=BD?= , Greg Thelen , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Meta kernel team Subject: Re: [PATCH v2] memcg: introduce non-blocking limit setting option Message-ID: <20250422-daumen-ozonbelastung-93d90ca81dfa@brauner> References: <20250419183545.1982187-1-shakeel.butt@linux.dev> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20250419183545.1982187-1-shakeel.butt@linux.dev> X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: C8AC31C0007 X-Stat-Signature: hsnygn7ktmguxqky4qrhjwgndz46pea5 X-Rspam-User: X-HE-Tag: 1745313804-252127 X-HE-Meta: U2FsdGVkX1864UG2vzz9xo3JVtTwp+ZPsIVsX94kOFOBj4ucXSbebT+4ncDHdEDAcudkfd3e8IYAnYeojWHI3gpbP/3Kfev/mbWDW4SAMFP5Ldnlde0yODLQH6J9gaPSHW3ma6BYmuxrxweY8guGcHVtbp+Im3dz+rfgz1xInMXyAYMc21C/lcdxZBX0pqPf8a4jUixjJAPTV0HjDd34SwjIpFdn6M4pK+9U7EdbrmeW2tu+AEAmw1tIVEhyCNNDlu2YddEks0XDqCwvvFV7kyu46Ae+/KElT9MYNqAjj6JgYnI3OyxEBNIS+0nTgapg3SyncHw7sOIvV7EcfeU7IrIo9IrWP66j3TIMzJF0OeEmRbMWU7dF/BquIvPlAQ78vbHW92kHDELfjpJDatbZhuIZwtEOcQH0u77BYFsoudzUYLWe30zO7VTtVnU465HngZmTyrTR4gV5LiAexp7wLxtoU+8Ib6i2ofN6TdrmUenl3LVj5turvBvpIdswA8s0iWN4nnMqkAryAgdcnyM6SSdHKQ73f6sK5zmHj0uMdghbydyRD08SEtujfWMy2+ltQtKAojFLiycW1ssLLzCHomRbw8rAgZUfO3cxErLDTdd8E07ivLDhax6sV+9wEETiRiNagwsu1FQkYyHZOdaeDLFKiKEF/m1YxkTSayiCr2VP5wNzyD/rzbax35SbdbRmGknjE4Xnk0AjwrnLFu8vyYowpSzpZ5JJGwS1AQWbDRUxFGU5S07nmG3TxlKHz/4jrJcB8oFbJefsqoI/zKh1FRn/gN01UyAwIlpsEW+MyBdbU0KWmP40pyM69fA58ipAEYXurwU2fexCGoJ9TPCrJqgu04uPZLxbItgHC/O5CrAmm1S+c9K/jKAXguJBs4SGSOzOM4lx76CJVU/21n/n1J6yhtCgn/f4Tk99kswKsbBAIz1hwTFsOc8o5ukToCI1p8JAksDBuCNzI535CcY MuNc7ebE DN+DKCU5JB9LK1GSRPlP2Zr0KaRY1tE6OT0ClwWEl03KYgZsUCrHVZsn2dP2U3uj0/9B9cILzYCVUZTmspbm9vVV8PO1s9SmwJ2E9ZNw0w1kn85P/KTxNXoDQ+jIzHapjVDwMR70HmyYf4nkuXOcdwF/NSOBn/dPMVyLr1axpqu7uBTj8OracEwXeCeD/93SNBDe6l++7H5I1EYE6txJAiGrDoFG/mshp72yKoJqAwhOnZVxJ5Fz0BrP34g6sXGIHJPyAZ5CHf8Lv1XDXqD0jonN/c0xkgIP6gEbMhYwX/Ipz2WZB6P55G8LFTw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Apr 19, 2025 at 11:35:45AM -0700, Shakeel Butt wrote: > Setting the max and high limits can trigger synchronous reclaim and/or > oom-kill if the usage is higher than the given limit. This behavior is > fine for newly created cgroups but it can cause issues for the node > controller while setting limits for existing cgroups. > > In our production multi-tenant and overcommitted environment, we are > seeing priority inversion when the node controller dynamically adjusts > the limits of running jobs of different priorities. Based on the system > situation, the node controller may reduce the limits of lower priority > jobs and increase the limits of higher priority jobs. However we are > seeing node controller getting stuck for long period of time while > reclaiming from lower priority jobs while setting their limits and also > spends a lot of its own CPU. > > One of the workaround we are trying is to fork a new process which sets > the limit of the lower priority job along with setting an alarm to get > itself killed if it get stuck in the reclaim for lower priority job. > However we are finding it very unreliable and costly. Either we need a > good enough time buffer for the alarm to be delivered after setting > limit and potentialy spend a lot of CPU in the reclaim or be unreliable > in setting the limit for much shorter but cheaper (less reclaim) alarms. > > Let's introduce new limit setting option which does not trigger > reclaim and/or oom-kill and let the processes in the target cgroup to > trigger reclaim and/or throttling and/or oom-kill in their next charge > request. This will make the node controller on multi-tenant > overcommitted environment much more reliable. > > Signed-off-by: Shakeel Butt > --- > Changes since v1: > - Instead of new interfaces use O_NONBLOCK flag (Greg, Roman & Tejun) > > Documentation/admin-guide/cgroup-v2.rst | 14 ++++++++++++++ > mm/memcontrol.c | 10 ++++++++-- > 2 files changed, 22 insertions(+), 2 deletions(-) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index 8fb14ffab7d1..c14514da4d9a 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1299,6 +1299,13 @@ PAGE_SIZE multiple when read back. > monitors the limited cgroup to alleviate heavy reclaim > pressure. > > + If memory.high is opened with O_NONBLOCK then the synchronous > + reclaim is bypassed. This is useful for admin processes that As written this isn't restricted to admin processes though, no? So any unprivileged container can open that file O_NONBLOCK and avoid synchronous reclaim? Which might be fine I have no idea but it's something to explicitly point out (The alternative is to restrict opening with O_NONBLOCK through a relevant capability check when the file is opened or use a write-time check.). > + need to dynamically adjust the job's memory limits without > + expending their own CPU resources on memory reclamation. The > + job will trigger the reclaim and/or get throttled on its > + next charge request. > + > memory.max > A read-write single value file which exists on non-root > cgroups. The default is "max". > @@ -1316,6 +1323,13 @@ PAGE_SIZE multiple when read back. > Caller could retry them differently, return into userspace > as -ENOMEM or silently ignore in cases like disk readahead. > > + If memory.max is opened with O_NONBLOCK, then the synchronous > + reclaim and oom-kill are bypassed. This is useful for admin > + processes that need to dynamically adjust the job's memory limits > + without expending their own CPU resources on memory reclamation. > + The job will trigger the reclaim and/or oom-kill on its next > + charge request. > + > memory.reclaim > A write-only nested-keyed file which exists for all cgroups. > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 5e2ea8b8a898..6f7362a7756a 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -4252,6 +4252,9 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > > page_counter_set_high(&memcg->memory, high); > > + if (of->file->f_flags & O_NONBLOCK) > + goto out; > + > for (;;) { > unsigned long nr_pages = page_counter_read(&memcg->memory); > unsigned long reclaimed; > @@ -4274,7 +4277,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, > if (!reclaimed && !nr_retries--) > break; > } > - > +out: > memcg_wb_domain_size_changed(memcg); > return nbytes; > } > @@ -4301,6 +4304,9 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, > > xchg(&memcg->memory.max, max); > > + if (of->file->f_flags & O_NONBLOCK) > + goto out; > + > for (;;) { > unsigned long nr_pages = page_counter_read(&memcg->memory); > > @@ -4328,7 +4334,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, > break; > cond_resched(); > } > - > +out: > memcg_wb_domain_size_changed(memcg); > return nbytes; > } > -- > 2.47.1 >