From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=BAYES_00, FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8DCE3C433ED for ; Thu, 13 May 2021 07:44:27 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id CCF6061439 for ; Thu, 13 May 2021 07:44:26 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org CCF6061439 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=sina.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 183C76B0036; Thu, 13 May 2021 03:44:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 134D46B006E; Thu, 13 May 2021 03:44:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F3D696B0070; Thu, 13 May 2021 03:44:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0146.hostedemail.com [216.40.44.146]) by kanga.kvack.org (Postfix) with ESMTP id C326B6B0036 for ; Thu, 13 May 2021 03:44:25 -0400 (EDT) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 6228CA745 for ; Thu, 13 May 2021 07:44:25 +0000 (UTC) X-FDA: 78135420090.26.7542FAA Received: from r3-25.sinamail.sina.com.cn (r3-25.sinamail.sina.com.cn [202.108.3.25]) by imf05.hostedemail.com (Postfix) with SMTP id 96415E000107 for ; Thu, 13 May 2021 07:44:15 +0000 (UTC) Received: from unknown (HELO localhost.localdomain)([221.199.207.228]) by sina.com (172.16.97.23) with ESMTP id 609CD8D10001F56E; Thu, 13 May 2021 15:44:19 +0800 (CST) X-Sender: hdanton@sina.com X-Auth-ID: hdanton@sina.com X-SMAIL-MID: 61712754919703 From: Hillf Danton To: Dave Chinner Cc: Michal Hocko , Matthew Wilcox , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Mel Gorman Subject: Re: [Lsf-pc] [LSF/MM TOPIC] Congestion Date: Thu, 13 May 2021 15:44:09 +0800 Message-Id: <20210513074409.3181-1-hdanton@sina.com> In-Reply-To: <20200106232100.GL23195@dread.disaster.area> References: <20191231125908.GD6788@bombadil.infradead.org> <20200106115514.GG12699@dhcp22.suse.cz> MIME-Version: 1.0 Authentication-Results: imf05.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf05.hostedemail.com: domain of hdanton@sina.com designates 202.108.3.25 as permitted sender) smtp.mailfrom=hdanton@sina.com X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 96415E000107 X-Stat-Signature: 59yyxb6xt6b8ia79zoceb9xht8eantng Received-SPF: none (sina.com>: No applicable sender policy available) receiver=imf05; identity=mailfrom; envelope-from=""; helo=r3-25.sinamail.sina.com.cn; client-ip=202.108.3.25 X-HE-DKIM-Result: none/none X-HE-Tag: 1620891855-245003 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, 7 Jan 2020 10:21:00 Dave Chinner wrote: >On Mon, Jan 06, 2020 at 12:55:14PM +0100, Michal Hocko wrote: >> On Tue 31-12-19 04:59:08, Matthew Wilcox wrote: >> >=20 >> > I don't want to present this topic; I merely noticed the problem. >> > I nominate Jens Axboe and Michael Hocko as session leaders. See the >> > thread here: >>=20 >> Thanks for bringing this up Matthew! The change in the behavior came a= s >> a surprise to me. I can lead the session for the MM side. >>=20 >> > https://lore.kernel.org/linux-mm/20190923111900.GH15392@bombadil.inf= radead.org/ >> >=20 >> > Summary: Congestion is broken and has been for years, and everybody'= s >> > system is sleeping waiting for congestion that will never clear. >> >=20 >> > A good outcome for this meeting would be: >> >=20 >> > - MM defines what information they want from the block stack. >>=20 >> The history of the congestion waiting is kinda hairy but I will try to >> summarize expectations we used to have and we can discuss how much of >> that has been real and what followed up as a cargo cult. Maybe we just >> find out that we do not need functionality like that anymore. I believ= e >> Mel would be a great contributor to the discussion. > >We most definitely do need some form of reclaim throttling based on >IO congestion, because it is trivial to drive the system into swap >storms and OOM killer invocation when there are large dirty slab >caches that require IO to make reclaim progress and there's little >in the way of page cache to reclaim. > >This is one of the biggest issues I've come across trying to make >XFS inode reclaim non-blocking - the existing code blocks on inode >writeback IO congestion to throttle the overall reclaim rate and >so prevents swap storms and OOM killer rampages from occurring. > >The moment I remove the inode writeback blocking from the reclaim >path and move the backoffs to the core reclaim congestion backoff >algorithms, I see a sustantial increase in the typical reclaim scan >priority. This is because the reclaim code does not have an >integrated back-off mechanism that can balance reclaim throttling >between slab cache and page cache reclaim. This results in >insufficient page reclaim backoff under slab cache backoff >conditions, leading to excessive page cache reclaim and swapping out >all the anonymous pages in memory. Then performance goes to hell as >userspace then starts to block on page faults swap thrashing like >this: > >page_fault > swap_in > alloc page > direct reclaim > swap out anon page > submit_bio > wbt_throttle > > >IOWs, page reclaim doesn't back off until userspace gets throttled >in the block layer doing swap out during swap in during page >faults. For these sorts of workloads there should be little to no >swap thrashing occurring - throttling reclaim to the rate at which >inodes are cleaned by async IO dispatcher threads is what is needed >here, not continuing to wind up reclaim priority until swap storms >and the oom killer end up killng the machine... > >I also see this when the inode cache load is on a separate device to >the swap partition - both devices end up at 100% utilisation, one >doing inode writeback flat out (about 300,000 inodes/sec from an >inode cache of 5-10 million inodes), the other is swap thrashing >from a page cache of only 250-500 pages in size. Is there a watermark of clean inodes in the inode cache, say 3% of the cache size? A laundry thread kicks off once clean inodes drop below it, better independent of dirty page writeback and kswapd, to ease direct reclaimers. Hillf > >Hence the way congestion was historically dealt with as a "global >condition" still needs to exist in some manner - congestion on a >single device is sufficient to cause the high level reclaim >algroithms to misbehave badly... > >Hence it seems to me that having IO load feedback to the memory >reclaim algorithms is most definitely required for memory reclaim to >be able to make the correct decisions about what to reclaim. If the >shrinker for the cache that uses 50% of RAM in the machine is saying >"backoff needed" and it's underlying device is >congested and limiting object reclaim rates, then it's a pretty good >indication that reclaim should back off and wait for IO progress to >be made instead of trying to reclaim from other LRUs that hold an >insignificant amount of memory compared to the huge cache that is >backed up waiting on IO completion to make progress.... > >Cheers, > >Dave. >--=20 >Dave Chinner >david@fromorbit.com