From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9986FC0218A
	for <linux-mm@archiver.kernel.org>; Thu, 30 Jan 2025 20:20:00 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id F20A82802A5; Thu, 30 Jan 2025 15:19:59 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id ED0472802A4; Thu, 30 Jan 2025 15:19:59 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id DA1DC2802A5; Thu, 30 Jan 2025 15:19:59 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id D3FB22802A4
	for <linux-mm@kvack.org>; Thu, 30 Jan 2025 15:19:58 -0500 (EST)
Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 689F2C01E7
	for <linux-mm@kvack.org>; Thu, 30 Jan 2025 20:19:58 +0000 (UTC)
X-FDA: 83065234476.18.9E9B973
Received: from mail-qk1-f182.google.com (mail-qk1-f182.google.com [209.85.222.182])
	by imf01.hostedemail.com (Postfix) with ESMTP id 21B204000A
	for <linux-mm@kvack.org>; Thu, 30 Jan 2025 20:19:55 +0000 (UTC)
Authentication-Results: imf01.hostedemail.com;
	dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=oTsdeIDy;
	spf=pass (imf01.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.182 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org;
	dmarc=pass (policy=none) header.from=cmpxchg.org
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1738268396;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=Z51AAJ9zLJofEtdedIFqQZN9K/yyiAQiMmQX40ASiiE=;
	b=vfhLTeNZLIyB3uH9YRc5S1txt/prvrLZlngnwhE7cSMs4DfJqqcr5nH0jx3/62ANvS/fJM
	/Td9aN9FeFdKx/IRa+tNIWnHF6hbQrQphuX2Q1Dj+R404FyxDr1g2sJYmyyoLfxF89iVDs
	p/YEhuTKJlTxiE9sA75aVcsPm64h2xI=
ARC-Authentication-Results: i=1;
	imf01.hostedemail.com;
	dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=oTsdeIDy;
	spf=pass (imf01.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.182 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org;
	dmarc=pass (policy=none) header.from=cmpxchg.org
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1738268396; a=rsa-sha256;
	cv=none;
	b=VkqYJ4fw/8qATuD2dPrT4yVEUaZ0SOKHqGULqA13s3wI/4+pZDJkwt5pwPqd7HzlJ425Fc
	VL6c/BQpUU5y9LJvSloUHI2rWwIVfcb5pTJa7TUlxyGn2T6l6dOFy2ielm5+/LU7LJojHg
	tzxJ5zFq4rcfaJRlz3rTFiI3zDh/4U8=
Received: by mail-qk1-f182.google.com with SMTP id af79cd13be357-7b6f19a6c04so120369485a.0
        for <linux-mm@kvack.org>; Thu, 30 Jan 2025 12:19:55 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1738268395; x=1738873195; darn=kvack.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=Z51AAJ9zLJofEtdedIFqQZN9K/yyiAQiMmQX40ASiiE=;
        b=oTsdeIDyx18b1URR0FY3aeZ7urFD6DPpwUUy/pkSDaysQJ0LDeJANwY8yGh/+1GHuc
         6aCILaCdaOi+VtoDe5a66MQUvNd3uCiVNQyCGMMQiPiivkuCYxyFd7N2gv1k7TjJ/Iaq
         ORtEQYJtasVUYYF8dgPkODGSQgrkq71/ZMkfQZsKRsZJ/FjjjhMxpCcPT7zv5f1ao7ri
         gBKCRZyJrQKSLITLVQLPeT6VRYw7KKvNaFayqnYa7Eo2n1/NqRdNHIatoQx6CXOWbbVM
         ULdPJdtfkcDpzXEpMygdFx5/cQWp8oYCPpHn/v6TPTNs5Awp5HDhDrNOYYnLvp/t42NQ
         esXg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1738268395; x=1738873195;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=Z51AAJ9zLJofEtdedIFqQZN9K/yyiAQiMmQX40ASiiE=;
        b=jdePF6/hWcOwUO7sARnJ+djZiZ8GTW4fu1seBDspDY8noRKJrmavcsHooUGs0aSI1i
         ZnZKlvMBFPQ0lDXNxIeRrb0/ofbvU8Izeldy2+CjOE+lR404nd+ICfXZyevBLt8ccZ6q
         bLSdLd+RA2s8CpjR91yPF2HAOuJPxJQLcqnOdLpB9CIhK759Aqu/4h+D8UP7GhFUh2T7
         mTmMiuDbB9C3ZLIQIQJslI52cqtuS55Ic6BeqUcpFo4orTGgvh/lge97QgvCeLB9WoOj
         17hQU0hoqyM4TOi1N+ZoadbXyQ0IReHgcCu3pJY19rNsb7udbYwFQ/wT2CEU4f+ayiu7
         o/qw==
X-Forwarded-Encrypted: i=1; AJvYcCUnyROIVrcc7oyUfYKxzLipe5wRB5l2g0tsaZyYHOoKCyIr7DJqQNgmPnK3ocRrN7CJpBhVP1IYDQ==@kvack.org
X-Gm-Message-State: AOJu0Ywc/pLLqfYt30R6JCp8vsrFONw4Iq2RZ29l3mBsYdxj7eEZwUta
	MYtHMKggzk0Zp4fGFfWXx36nWLzf6ltteCNSJ/p3GjDVchtZ2VapR2/e38nUugU=
X-Gm-Gg: ASbGncu3MQi0elvJDZiJsoH7icY3mzLp3u/gCRZPN5pjnfGyce1XNFvgDH7vPsaT2pc
	d8PqxgCcoQb9HAToytrc8f1TBaViTahG0VnXgViCfOBBvkv7+wWVl7PdESxufyn//OaZExWpt9s
	0DACUZc2k/fQ7BCcX7cZDO/Q/hSDCfarADv6CvOfjyEEgGpg4PgID/SW5IJJzisG846JEw9dvGQ
	OsfnChEoUtP+VYa8gLhW1pNh+KNxuEDoPIHu7dAaUTONYSNHnE6+UUe39lK+GNj+QyvupniUTZT
	H9BrQX1jQO7IIA==
X-Google-Smtp-Source: AGHT+IHG1JSOh0nlUVEK+WfyizNl08mq/IVx9s0pot0IjDtoSwn6IimZwwRkGUrT/Jjf7Q98QZJwgA==
X-Received: by 2002:a05:620a:4051:b0:7bc:db11:495b with SMTP id af79cd13be357-7bffcd9d350mr1321689885a.50.1738268394707;
        Thu, 30 Jan 2025 12:19:54 -0800 (PST)
Received: from localhost ([2603:7000:c01:2716:cbb0:8ad0:a429:60f5])
        by smtp.gmail.com with UTF8SMTPSA id d75a77b69052e-46fdf173efbsm10153071cf.61.2025.01.30.12.19.53
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 30 Jan 2025 12:19:54 -0800 (PST)
Date: Thu, 30 Jan 2025 15:19:45 -0500
From: Johannes Weiner <hannes@cmpxchg.org>
To: Waiman Long <llong@redhat.com>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>, Tejun Heo <tj@kernel.org>,
	Michal =?iso-8859-1?Q?Koutn=FD?= <mkoutny@suse.com>,
	Jonathan Corbet <corbet@lwn.net>, Michal Hocko <mhocko@kernel.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
	linux-mm@kvack.org, linux-doc@vger.kernel.org,
	Peter Hunt <pehunt@redhat.com>
Subject: Re: [RFC PATCH] mm, memcg: introduce memory.high.throttle
Message-ID: <20250130201945.GA13575@cmpxchg.org>
References: <20250129191204.368199-1-longman@redhat.com>
 <Z5qLQ1o6cXbcvc0o@google.com>
 <366fd30f-033d-48d6-92b4-ac67c44d0d9b@redhat.com>
 <20250130163904.GB1283@cmpxchg.org>
 <baf1f9bf-4226-47f5-b795-c8862fd0554f@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <baf1f9bf-4226-47f5-b795-c8862fd0554f@redhat.com>
X-Rspamd-Server: rspam02
X-Rspamd-Queue-Id: 21B204000A
X-Stat-Signature: xgemwi617xekpscbidimkrjk9w6n75on
X-Rspam-User: 
X-HE-Tag: 1738268395-962375
X-HE-Meta: U2FsdGVkX1/vFfconFvKm1NoYazqB/gvHw/Jmg3CpHDM/OcoNTH3h1x4DyvqMUsfxjHiPiQOjRepLa0iK9H+tFuWq4ZGFCnzfj5F+VWVdaEy7by7yf953NI2Sdha44vqllyZIvvyMgTjMB1avZW79m1CwlhWb1s5bdc+Ieao/tH9WD6UpJfbJTCX7qKCGzT+NIq4Dot9Ok58Tu+qy+dqLzSs/GGTvUHDOv1ugYQg6HhJJPph4IDy+rCg6vYdTTn/QLc7hCluWwu3LCBeuEHOcc24bd5Ia1ntKiq0VRoIbeT76JJxSTeARr6Jy2gHJ/p9A0CZ0EAXQbuzXyScszjadRird4hlZXPHza76kksnIt3GeUuh1HRG4EovZOSEaFtg88IGTnOTtVp72Z0N+Nr0WgFjNSSuSA9tXmcrNDeSfCFBqNw2zmsKyYKDO23aYqiBWJFnswmWIZvwgHPPi6fg/ZWTh4UDQfZ+Bdlajtk/JKEe7s+psaGAbNzGTnBPeP7WrkcvqbMbP3d8YhBHYro7DsblwEzO1hRgKOjfZpuirIGwdZrE/q+DN8lbKGa4Ju4yWYNfKwxtPzMq0M58B3W3ovVAOquYe6Z9SmhO9mQw6vThtWfYUQ4Z1AduNWzCjFC+toGbETFL/OILnXUFHrZpxmaDHLq/tjTJ76cnaWbsHI0bOrWumtS+9jACYilfyShLvoAlSP2WQ0WITD3qTO13jezt6nvfBWT+5i+arT4AAmP3YFaPy75tkMGqkSLKNP+umAkZX3Kwae9/kz8dWqwqabVOZdV5/rCozyW1dVFmiwua3cAuVAElmkFzQIYTj8gUub1mMwKyHrYxeY1xwrWvZAlXF7zIIjGJIySTR56dZByI5ZNQYP7SudXR5Eo5KPqmobqAJTwYvKpTNF+oMA1KZltU0LFqr0iFUi1hYP21u4Nd1d8W5b0EGC5zdV2FHCgT3rJEWbcmxdYwWc6ebN3
 uCCIaX6B
 SmX2zvb59hg0KC5a0vaqbdgIVcDw3x7raNjIvaeJQ+NNDC0p1b7AQZkPxw21n1rQMThreBJgV1d+gNl2stEHiQ/EnpsizLlYoOyw5rmZr1sbSt1j6iWkd1BQxkgIJ74Zv+DW+/IKywEzhxJlLwAuthZzjH6S+n+Aid+kAWLQo71CNmoWUbORXqCriwKYSOTEQQDRrhyuWkOgbWtJnGKxI5IwctJKd0zo1pFUMn0NUvtU4id553+CBesJFIeNCASA2fjJ/4kZ0mggzdBurv98J62pAbi+hJb7plr5FcNpeNPVFNizYgan7jACaNrZeG2A0im10HputTGZ5hAYfjvTY4tb3DDc1XSwD0pelSG5xcohzR9cK6h3R2TnkhSon2W+8AzEIQGciMlPMH9P6Ylcmt0xTak5GZJ5HBLuS93EKWod3K1+SwM9A1JOZNLo0iPSgtUOfcsomBdPz6PrscQVu0nC02XmQBeWU5A7BR6Mnvd4V4j8BNiebrfWxnFfaswPIp1wVeq/VMMe3qZY=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Jan 30, 2025 at 12:07:31PM -0500, Waiman Long wrote:
> On 1/30/25 11:39 AM, Johannes Weiner wrote:
> > On Thu, Jan 30, 2025 at 09:52:29AM -0500, Waiman Long wrote:
> >> On 1/29/25 3:10 PM, Yosry Ahmed wrote:
> >>> On Wed, Jan 29, 2025 at 02:12:04PM -0500, Waiman Long wrote:
> >>>> Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing
> >>>> reclaim over memory.high"), the amount of allocator throttling had
> >>>> increased substantially. As a result, it could be difficult for a
> >>>> misbehaving application that consumes increasing amount of memory from
> >>>> being OOM-killed if memory.high is set. Instead, the application may
> >>>> just be crawling along holding close to the allowed memory.high memory
> >>>> for the current memory cgroup for a very long time especially those
> >>>> that do a lot of memcg charging and uncharging operations.
> >>>>
> >>>> This behavior makes the upstream Kubernetes community hesitate to
> >>>> use memory.high. Instead, they use only memory.max for memory control
> >>>> similar to what is being done for cgroup v1 [1].
> >>>>
> >>>> To allow better control of the amount of throttling and hence the
> >>>> speed that a misbehving task can be OOM killed, a new single-value
> >>>> memory.high.throttle control file is now added. The allowable range
> >>>> is 0-32.  By default, it has a value of 0 which means maximum throttling
> >>>> like before. Any non-zero positive value represents the corresponding
> >>>> power of 2 reduction of throttling and makes OOM kills easier to happen.
> >>>>
> >>>> System administrators can now use this parameter to determine how easy
> >>>> they want OOM kills to happen for applications that tend to consume
> >>>> a lot of memory without the need to run a special userspace memory
> >>>> management tool to monitor memory consumption when memory.high is set.
> >>>>
> >>>> Below are the test results of a simple program showing how different
> >>>> values of memory.high.throttle can affect its run time (in secs) until
> >>>> it gets OOM killed. This test program allocates pages from kernel
> >>>> continuously. There are some run-to-run variations and the results
> >>>> are just one possible set of samples.
> >>>>
> >>>>     # systemd-run -p MemoryHigh=10M -p MemoryMax=20M -p MemorySwapMax=10M \
> >>>> 	--wait -t timeout 300 /tmp/mmap-oom
> >>>>
> >>>>     memory.high.throttle	service runtime
> >>>>     --------------------	---------------
> >>>>               0		    120.521
> >>>>               1		    103.376
> >>>>               2		     85.881
> >>>>               3		     69.698
> >>>>               4		     42.668
> >>>>               5		     45.782
> >>>>               6		     22.179
> >>>>               7		      9.909
> >>>>               8		      5.347
> >>>>               9		      3.100
> >>>>              10		      1.757
> >>>>              11		      1.084
> >>>>              12		      0.919
> >>>>              13		      0.650
> >>>>              14		      0.650
> >>>>              15		      0.655
> >>>>
> >>>> [1] https://docs.google.com/document/d/1mY0MTT34P-Eyv5G1t_Pqs4OWyIH-cg9caRKWmqYlSbI/edit?tab=t.0
> >>>>
> >>>> Signed-off-by: Waiman Long <longman@redhat.com>
> >>>> ---
> >>>>    Documentation/admin-guide/cgroup-v2.rst | 16 ++++++++--
> >>>>    include/linux/memcontrol.h              |  2 ++
> >>>>    mm/memcontrol.c                         | 41 +++++++++++++++++++++++++
> >>>>    3 files changed, 57 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> >>>> index cb1b4e759b7e..df9410ad8b3b 100644
> >>>> --- a/Documentation/admin-guide/cgroup-v2.rst
> >>>> +++ b/Documentation/admin-guide/cgroup-v2.rst
> >>>> @@ -1291,8 +1291,20 @@ PAGE_SIZE multiple when read back.
> >>>>    	Going over the high limit never invokes the OOM killer and
> >>>>    	under extreme conditions the limit may be breached. The high
> >>>>    	limit should be used in scenarios where an external process
> >>>> -	monitors the limited cgroup to alleviate heavy reclaim
> >>>> -	pressure.
> >>>> +	monitors the limited cgroup to alleviate heavy reclaim pressure
> >>>> +	unless a high enough value is set in "memory.high.throttle".
> >>>> +
> >>>> +  memory.high.throttle
> >>>> +	A read-write single value file which exists on non-root
> >>>> +	cgroups.  The default is 0.
> >>>> +
> >>>> +	Memory usage throttle control.	This value controls the amount
> >>>> +	of throttling that will be applied when memory consumption
> >>>> +	exceeds the "memory.high" limit.  The larger the value is,
> >>>> +	the smaller the amount of throttling will be and the easier an
> >>>> +	offending application may get OOM killed.
> >>> memory.high is supposed to never invoke the OOM killer (see above). It's
> >>> unclear to me if you are referring to OOM kills from the kernel or
> >>> userspace in the commit message. If the latter, I think it shouldn't be
> >>> in kernel docs.
> >> I am sorry for not being clear. What I meant is that if an application
> >> is consuming more memory than what can be recovered by memory reclaim,
> >> it will reach memory.max faster, if set, and get OOM killed. Will
> >> clarify that in the next version.
> > You're not really supposed to use max and high in conjunction. One is
> > for kernel OOM killing, the other for userspace OOM killing. That's tho
> > what the documentation that you edited is trying to explain.
> >
> > What's the usecase you have in mind?
> 
> That is new to me that high and max are not supposed to be used 
> together. One problem with v1 is that by the time the limit is reached 
> and memory reclaim is not able to recover enough memory in time, the 
> task will be OOM killed. I always thought that by setting high to a bit 
> below max, say 90%, early memory reclaim will reduce the chance of OOM 
> kills. There are certainly others that think like that.

I can't fault you or them for this, because this was the original plan
for these knobs. However, this didn't end up working in practice.

If you have a non-throttling, non-killing limit, then reclaim will
either work and keep the workload to that limit; or it won't work, and
the workload escapes to the hard limit and gets killed.

You'll notice you get the same behavior with just memory.max set by
itself - either reclaim can keep up, or OOM is triggered.

> So the use case here is to reduce the chance of OOM kills without 
> letting really mishaving tasks from holding up useful memory for too long.

That brings us to the idea of a medium amount of throttling.

The premise would be that, by throttling *to a certain degree*, you
can slow the workload down just enough to tide over the pressure peak
and avert the OOM kill.

This assumes that some tasks inside the cgroup can independently make
forward progress and release memory, while allocating tasks inside the
group are already throttled.

[ Keep in mind, it's a cgroup-internal limit, so no memory freeing
  outside of the group can alleviate the situation. Progress must
  happen from within the cgroup. ]

But this sort of parallelism in a pressured cgroup is unlikely in
practice. By the time reclaim fails, usually *every task* in the
cgroup ends up having to allocate. Because they lose executables to
cache reclaim, or heap memory to swap etc, and then page fault.

We found that more often than not, it just deteriorates into a single
sequence of events. Slowing it down just drags out the inevitable.

As a result we eventually moved away from the idea of gradual
throttling. The last remnants of this idea finally disappeared from
the docs last year (commit 5647e53f7856bb39dae781fe26aa65a699e2fc9f).

memory.high now effectively puts the cgroup to sleep when reclaim
fails (similar to oom killer disabling in v1, but without the caveats
of that implementation). This is useful to let userspace implement
custom OOM killing policies.