From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=wtRQ=KI=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.8 required=3.0 tests=BAYES_00,
	FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8DCE3C433ED
	for <linux-mm@archiver.kernel.org>; Thu, 13 May 2021 07:44:27 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id CCF6061439
	for <linux-mm@archiver.kernel.org>; Thu, 13 May 2021 07:44:26 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org CCF6061439
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=sina.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 183C76B0036; Thu, 13 May 2021 03:44:26 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 134D46B006E; Thu, 13 May 2021 03:44:26 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id F3D696B0070; Thu, 13 May 2021 03:44:25 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0146.hostedemail.com [216.40.44.146])
	by kanga.kvack.org (Postfix) with ESMTP id C326B6B0036
	for <linux-mm@kvack.org>; Thu, 13 May 2021 03:44:25 -0400 (EDT)
Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id 6228CA745
	for <linux-mm@kvack.org>; Thu, 13 May 2021 07:44:25 +0000 (UTC)
X-FDA: 78135420090.26.7542FAA
Received: from r3-25.sinamail.sina.com.cn (r3-25.sinamail.sina.com.cn [202.108.3.25])
	by imf05.hostedemail.com (Postfix) with SMTP id 96415E000107
	for <linux-mm@kvack.org>; Thu, 13 May 2021 07:44:15 +0000 (UTC)
Received: from unknown (HELO localhost.localdomain)([221.199.207.228])
	by sina.com (172.16.97.23) with ESMTP
	id 609CD8D10001F56E; Thu, 13 May 2021 15:44:19 +0800 (CST)
X-Sender: hdanton@sina.com
X-Auth-ID: hdanton@sina.com
X-SMAIL-MID: 61712754919703
From: Hillf Danton <hdanton@sina.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Michal Hocko <mhocko@kernel.org>,
	Matthew Wilcox <willy@infradead.org>,
	linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org,
	Mel Gorman <mgorman@suse.de>
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] Congestion
Date: Thu, 13 May 2021 15:44:09 +0800
Message-Id: <20210513074409.3181-1-hdanton@sina.com>
In-Reply-To: <20200106232100.GL23195@dread.disaster.area>
References: <20191231125908.GD6788@bombadil.infradead.org> <20200106115514.GG12699@dhcp22.suse.cz>
MIME-Version: 1.0
Authentication-Results: imf05.hostedemail.com;
	dkim=none;
	dmarc=none;
	spf=pass (imf05.hostedemail.com: domain of hdanton@sina.com designates 202.108.3.25 as permitted sender) smtp.mailfrom=hdanton@sina.com
X-Rspamd-Server: rspam05
X-Rspamd-Queue-Id: 96415E000107
X-Stat-Signature: 59yyxb6xt6b8ia79zoceb9xht8eantng
Received-SPF: none (sina.com>: No applicable sender policy available) receiver=imf05; identity=mailfrom; envelope-from="<hdanton@sina.com>"; helo=r3-25.sinamail.sina.com.cn; client-ip=202.108.3.25
X-HE-DKIM-Result: none/none
X-HE-Tag: 1620891855-245003
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, 7 Jan 2020 10:21:00 Dave Chinner wrote:
>On Mon, Jan 06, 2020 at 12:55:14PM +0100, Michal Hocko wrote:
>> On Tue 31-12-19 04:59:08, Matthew Wilcox wrote:
>> >=20
>> > I don't want to present this topic; I merely noticed the problem.
>> > I nominate Jens Axboe and Michael Hocko as session leaders.  See the
>> > thread here:
>>=20
>> Thanks for bringing this up Matthew! The change in the behavior came a=
s
>> a surprise to me. I can lead the session for the MM side.
>>=20
>> > https://lore.kernel.org/linux-mm/20190923111900.GH15392@bombadil.inf=
radead.org/
>> >=20
>> > Summary: Congestion is broken and has been for years, and everybody'=
s
>> > system is sleeping waiting for congestion that will never clear.
>> >=20
>> > A good outcome for this meeting would be:
>> >=20
>> >  - MM defines what information they want from the block stack.
>>=20
>> The history of the congestion waiting is kinda hairy but I will try to
>> summarize expectations we used to have and we can discuss how much of
>> that has been real and what followed up as a cargo cult. Maybe we just
>> find out that we do not need functionality like that anymore. I believ=
e
>> Mel would be a great contributor to the discussion.
>
>We most definitely do need some form of reclaim throttling based on
>IO congestion, because it is trivial to drive the system into swap
>storms and OOM killer invocation when there are large dirty slab
>caches that require IO to make reclaim progress and there's little
>in the way of page cache to reclaim.
>
>This is one of the biggest issues I've come across trying to make
>XFS inode reclaim non-blocking - the existing code blocks on inode
>writeback IO congestion to throttle the overall reclaim rate and
>so prevents swap storms and OOM killer rampages from occurring.
>
>The moment I remove the inode writeback blocking from the reclaim
>path and move the backoffs to the core reclaim congestion backoff
>algorithms, I see a sustantial increase in the typical reclaim scan
>priority. This is because the reclaim code does not have an
>integrated back-off mechanism that can balance reclaim throttling
>between slab cache and page cache reclaim. This results in
>insufficient page reclaim backoff under slab cache backoff
>conditions, leading to excessive page cache reclaim and swapping out
>all the anonymous pages in memory. Then performance goes to hell as
>userspace then starts to block on page faults swap thrashing like
>this:
>
>page_fault
>  swap_in
>    alloc page
>      direct reclaim
>        swap out anon page
>	  submit_bio
>	    wbt_throttle
>
>
>IOWs, page reclaim doesn't back off until userspace gets throttled
>in the block layer doing swap out during swap in during page
>faults. For these sorts of workloads there should be little to no
>swap thrashing occurring - throttling reclaim to the rate at which
>inodes are cleaned by async IO dispatcher threads is what is needed
>here, not continuing to wind up reclaim priority  until swap storms
>and the oom killer end up killng the machine...
>
>I also see this when the inode cache load is on a separate device to
>the swap partition - both devices end up at 100% utilisation, one
>doing inode writeback flat out (about 300,000 inodes/sec from an
>inode cache of 5-10 million inodes), the other is swap thrashing
>from a page cache of only 250-500 pages in size.

Is there a watermark of clean inodes in the inode cache, say 3% of the
cache size? A laundry thread kicks off once clean inodes drop below it,
better independent of dirty page writeback and kswapd, to ease direct
reclaimers.

Hillf
>
>Hence the way congestion was historically dealt with as a "global
>condition" still needs to exist in some manner - congestion on a
>single device is sufficient to cause the high level reclaim
>algroithms to misbehave badly...
>
>Hence it seems to me that having IO load feedback to the memory
>reclaim algorithms is most definitely required for memory reclaim to
>be able to make the correct decisions about what to reclaim. If the
>shrinker for the cache that uses 50% of RAM in the machine is saying
>"backoff needed" and it's underlying device is
>congested and limiting object reclaim rates, then it's a pretty good
>indication that reclaim should back off and wait for IO progress to
>be made instead of trying to reclaim from other LRUs that hold an
>insignificant amount of memory compared to the huge cache that is
>backed up waiting on IO completion to make progress....
>
>Cheers,
>
>Dave.
>--=20
>Dave Chinner
>david@fromorbit.com