From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BE486C61D97
	for <linux-mm@archiver.kernel.org>; Wed, 22 Nov 2023 15:33:24 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 4AFDB6B0500; Wed, 22 Nov 2023 10:33:24 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 460106B0501; Wed, 22 Nov 2023 10:33:24 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 301A66B0525; Wed, 22 Nov 2023 10:33:24 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 1CA796B0500
	for <linux-mm@kvack.org>; Wed, 22 Nov 2023 10:33:24 -0500 (EST)
Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id D9745120E9F
	for <linux-mm@kvack.org>; Wed, 22 Nov 2023 15:33:23 +0000 (UTC)
X-FDA: 81485984286.24.96A5DF5
Received: from out-175.mta0.migadu.com (out-175.mta0.migadu.com [91.218.175.175])
	by imf13.hostedemail.com (Postfix) with ESMTP id 98D1120026
	for <linux-mm@kvack.org>; Wed, 22 Nov 2023 15:33:21 +0000 (UTC)
Authentication-Results: imf13.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=UDaSKShg;
	spf=pass (imf13.hostedemail.com: domain of chengming.zhou@linux.dev designates 91.218.175.175 as permitted sender) smtp.mailfrom=chengming.zhou@linux.dev;
	dmarc=pass (policy=none) header.from=linux.dev
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1700667202;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=WHIfzJtlqXfq2TWLtmGo3k1rJttIf/XhAniLKFtYdwE=;
	b=vYBSOrNI8rxo/eZbHSxC6+qU2aa0Ca+7+jFJL+RpAO1HmDYrEUPB5BxvBidOXBL0KwG/qs
	LyEAUJRUDyoYG7iDSfM1hRKbN3YU4NIS8VzQCz+rO32wZg+nmoQjL1L56QCvNLxwEj18Gz
	u1oSatXGwpd+a/TZu9qnxhe49To2sSM=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1700667202; a=rsa-sha256;
	cv=none;
	b=fuohN+umHx8FatcRm3301GqyG9likublQBqH+siNZ1qVU3897P6IP94kNpN+0kclvqQpIu
	5XcFv4rAyoqurGdqMDpeX8PG7yUwjmTQnz9I2wNh7YweJUfVFp1xDFzV2SSMSp0ac78Eqn
	QS6gcLRIRo22x6nke2VVH1E6lI/elZY=
ARC-Authentication-Results: i=1;
	imf13.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=UDaSKShg;
	spf=pass (imf13.hostedemail.com: domain of chengming.zhou@linux.dev designates 91.218.175.175 as permitted sender) smtp.mailfrom=chengming.zhou@linux.dev;
	dmarc=pass (policy=none) header.from=linux.dev
Message-ID: <7e3d3ff6-b453-404b-beaf-cdd23fb3e1a2@linux.dev>
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1700667199;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=WHIfzJtlqXfq2TWLtmGo3k1rJttIf/XhAniLKFtYdwE=;
	b=UDaSKShgQKcYx/B7S56JiFK/4D+nhiZ7wFbbEp1dd7f7WkqqkO2uJgUVLHwVG4+SI15qo/
	Rhy2/d0t7y1k6xI9p7hDueR2D+94dsjULH3MeT8D+HfqSIGaNS6fEi377SCGwmZTUEH/Wu
	UTuPvqKI/Fl6+rkwqkiPiCy2cO4bRz4=
Date: Wed, 22 Nov 2023 23:32:50 +0800
MIME-Version: 1.0
Subject: Re: Question: memcg dirty throttle caused by low per-memcg dirty
 thresh
Content-Language: en-US
To: Jan Kara <jack@suse.cz>
Cc: LKML <linux-kernel@vger.kernel.org>, linux-mm <linux-mm@kvack.org>,
 Tejun Heo <tj@kernel.org>, Johannes Weiner <hannes@cmpxchg.org>,
 Christoph Hellwig <hch@lst.de>, shr@devkernel.io, neilb@suse.de,
 Michal Hocko <mhocko@suse.com>
References: <109029e0-1772-4102-a2a8-ab9076462454@linux.dev>
 <20231122144932.m44oiw5lufwkc5pw@quack3>
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: Chengming Zhou <chengming.zhou@linux.dev>
In-Reply-To: <20231122144932.m44oiw5lufwkc5pw@quack3>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Migadu-Flow: FLOW_OUT
X-Stat-Signature: tw4jpfa6ijreebz4a4moitqfo9tp9x4w
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: 98D1120026
X-Rspam-User: 
X-HE-Tag: 1700667201-500308
X-HE-Meta: U2FsdGVkX18I0xRXxZuliigaGgXTWs3GQbjs5jPXHubs5y3CvKFthYZqCxgUS3gNavfJGh9TyQ2qYBF9rAI3SLJA5VMTDZ+VNYXt8cBq6q/oxIdhjCQPbfMQzg571an1dfnEA0OoEQ81KJWMO14YjVGpM09h+j5SeyJFvlIRVKmdvzOX4TrkHd38aQi/YOrKrfwTsFX0rw+AleF6YUCpnOg2H/LT7n2ejE1B2dgqAcllljGIFE+n/11bI/zcrMeRvsqEEQN3Cyv+KLesRn4b7G7G1chOIXcaE7ixLKaRC7XaP69wR0UpuGcqskw1qUEeVBZ4HmsoUvJAzRQrwIVNIyX8gR+6wjSHVho/62l9oTEEdEpGLKCZckKNTW8O38ukzKzAFoI9d9f2T+wy2MyuUjkdBxYqYYWmb1s+xAdCcYmISY66+ykuYspapcalJvS1UMwLPTkbGrCi9dn0TV7ayezbsiUxgIpYlaSahHq4ovEk8GBI2AKZRrcaTvVGMty2pS/lV4bmBeuO9r9/aNdcV2Li53O/s7e5hB/FxSbnru30wUWnC7lU7+eGcV712ftDmV9hozUT/oMo1RX9iMLz89I8ffIH39AxzN7bSIT9O8DlEmjXJQC2rD2XWEVtrkVvEXa30fvODT2oE8AC4Dw8HJ21OqaK+B2S78rW/h6kE4MagDNVZLUBhD2Ugx6jU7Id5yaDdI3uKoWdkJTYwsf4sZ1VmC76bwE99ou0oQcdLdnw1grjvgTZCUY6A7WmTrbCGnxvc6n08n99arsUUPFjbRShlhqNx0MmkPQDuRdCuSw2HK6GQ5EQtvm32zjpE6BmN8OCDB5H53+o5DN/HPWw3Nxe/zqx25pzA7zWEc/Pwuwk1Iob1TttyWCpNi3IAxoWGBdSFPm2AfAYtHceNkkA+spQB2Mn/wXt7LmI397Gf45nlv/b2gjpoFZ+Hnr3ipLawucU93K2pkYAb7utvQD
 IegDio/N
 y5XtZuq6a9GMNQOoCTjc2fo57xWweJJeu/RKJaUpAq4mLrrp5O6ZJstGEP0MXnhW25lqdHT1KKAwXGDTu5vd4cHdrHA==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On 2023/11/22 22:49, Jan Kara wrote:
> Hello!
> 
> On Wed 22-11-23 17:38:25, Chengming Zhou wrote:
>> Sorry to bother you, we encountered a problem related to the memcg dirty
>> throttle after migrating from cgroup v1 to v2, so want to ask for some
>> comments or suggestions.
>>
>> 1. Problem
>>
>> We have the "containerd" service running under system.slice, with
>> its memory.max set to 5GB. It will be constantly throttled in the
>> balance_dirty_pages() since the memcg has dirty memory more than
>> the memcg dirty thresh.
>>
>> We haven't this problem on cgroup v1, because cgroup v1 doesn't have
>> the per-memcg writeback and per-memcg dirty thresh. Only the global
>> dirty thresh will be checked in balance_dirty_pages().
> 
> As Michal writes, if you allow too many memcg pages to become dirty, you
> might be facing issues with page reclaim so there are actually good reasons
> why you want amount of dirty pages in each memcg reasonably limited. Also

Yes, the memcg dirty limit (20%) is good for the memcg reclaim path.
But for some workloads (like burst dirtier) which may only create many dirty
pages in a short time, we want its memory.max 60% can be dirtied without
being throttled. And this is not much harmful for its memcg reclaim path.

> generally increasing number of available dirty pages beyond say 1GB is not
> going to bring any benefit in the overall writeback performance. It may
> still be useful in case you generate a lot of (or large) temporary files
> which get quickly deleted and thus with high enough dirty limit they don't
> have to be written to the disk at all. Similarly if the generation of dirty
> data is very bursty (i.e. you generate a lot of dirty data in a short while
> and then don't dirty anything for a long time), having higher dirty limit
> may be useful. What is your usecase that you think you'll benefit from
> higher dirty limit?

I think it's the burst dirtier in our case, and we have good performance
improvement if we change the global dirty_ratio to 60 just for testing.

> 
>> 2. Thinking
>>
>> So we wonder if we can support the per-memcg dirty thresh interface?
>> Now the memcg dirty thresh is just calculated from memcg max * ratio,
>> which can be set from /proc/sys/vm/dirty_ratio.
>>
>> We have to set it to 60 instead of the default 20 to workaround now,
>> but worry about the potential side effects.
>>
>> If we can support the per-memcg dirty thresh interface, we can set
>> some containers to a much higher dirty_ratio, especially for hungry
>> dirtier workloads like "containerd".
> 
> As Michal wrote, if this ought to be configurable per memcg, then
> configuring dirty amount directly in bytes may be more sensible.
> 

Yes, "memory.dirty_limit" should be more sensible than "memory.dirty_ratio".

>> 3. Solution?
>>
>> But we could't think of a good solution to support this. The current
>> memcg dirty thresh is calculated from a complex rule:
>>
>> 	memcg dirty thresh = memcg avail * dirty_ratio
>>
>> memcg avail is from combination of: memcg max/high, memcg files
>> and capped by system-wide clean memory excluding the amount being used
>> in the memcg.
>>
>> Although we may find a way to calculate the per-memcg dirty thresh,
>> we can't use it directly, since we still need to calculate/distribute
>> dirty thresh to the per-wb dirty thresh share.
>>
>> R - A - B
>>     \-- C
>>
>> For example, if we know the dirty thresh of A, but wb is in C, we
>> have no way to distribute the dirty thresh shares to the wb in C.
>>
>> But we have to get the dirty thresh of the wb in C, since we need it
>> to control throttling process of the wb in balance_dirty_pages().
>>
>> I may have missed something above, but the problem seems clear IMHO.
>> Looking forward to any comment or suggestion.
> 
> I'm not sure I follow what is the problem here. In balance_dirty_pages() we
> have global dirty threshold (tracked in gdtc) and memcg dirty threshold
> (tracked in mdtc). This can get further scaled down based on the device
> throughput (that is the difference between 'thresh' and 'wb_thresh') but
> that is independent of the way mdtc->thresh is calculated. So if we provide
> a different way of calculating mdtc->thresh, technically everything should
> keep working as is.
> 

Sorry for the confusion. The problem is exactly how to calculate mdtc->thresh.

R - A - B
    \-- C

Case 1:

Suppose the C has "memory.dirty_limit" set, should we just use it as mdtc->thresh?
I see the current code also consider the system clean memory in mdtc_calc_avail(),
should we also need to consider it when "memory.dirty_limit" set?

Case 2:

Suppose the C hasn't "memory.dirty_limit" set, but A has "memory.dirty_limit" set,
how do we calculate the C mdtc->thresh ?

Obviously we can't directly use the A "memory.dirty_limit", since it should be
distributed to B and C ?

So the problem is that I don't know how to reasonably calculate the mdtc->thresh,
even given a memcg tree with some memcgs have "memory.dirty_limit" set. :\

Thanks!