From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DDAA8C433F5 for ; Wed, 16 Mar 2022 02:52:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 561FE8D0002; Tue, 15 Mar 2022 22:52:30 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 512F18D0001; Tue, 15 Mar 2022 22:52:30 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 400608D0002; Tue, 15 Mar 2022 22:52:30 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0230.hostedemail.com [216.40.44.230]) by kanga.kvack.org (Postfix) with ESMTP id 2EDC68D0001 for ; Tue, 15 Mar 2022 22:52:30 -0400 (EDT) Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id DBF9EA2B0F for ; Wed, 16 Mar 2022 02:52:29 +0000 (UTC) X-FDA: 79248726018.16.F113C90 Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by imf28.hostedemail.com (Postfix) with ESMTP id F3136C0011 for ; Wed, 16 Mar 2022 02:52:28 +0000 (UTC) Received: from dread.disaster.area (pa49-186-150-27.pa.vic.optusnet.com.au [49.186.150.27]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id D691010E5699; Wed, 16 Mar 2022 13:52:24 +1100 (AEDT) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1nUJlv-005zO8-FH; Wed, 16 Mar 2022 13:52:23 +1100 Date: Wed, 16 Mar 2022 13:52:23 +1100 From: Dave Chinner To: Roman Gushchin , Matthew Wilcox , Stephen Brennan , lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Gautham Ananthakrishna , khlebnikov@yandex-team.ru Subject: Re: [LSF/MM TOPIC] Better handling of negative dentries Message-ID: <20220316025223.GR661808@dread.disaster.area> References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.4 cv=VuxAv86n c=1 sm=1 tr=0 ts=623150eb a=sPqof0Mm7fxWrhYUF33ZaQ==:117 a=sPqof0Mm7fxWrhYUF33ZaQ==:17 a=IkcTkHD0fZMA:10 a=o8Y5sQTvuykA:10 a=JfrnYn6hAAAA:8 a=7-415B0cAAAA:8 a=7N29KvIgx71kk7lOLegA:9 a=QEXdDO2ut3YA:10 a=1CNFftbPRP8L7MoqJWF3:22 a=biEYGPWJfzWAr4FL6Ov7:22 X-Rspam-User: X-Rspamd-Queue-Id: F3136C0011 X-Stat-Signature: ofjspj8r94jez61z8soyokagb1fg6o9o Authentication-Results: imf28.hostedemail.com; dkim=none; dmarc=none; spf=none (imf28.hostedemail.com: domain of david@fromorbit.com has no SPF policy when checking 211.29.132.249) smtp.mailfrom=david@fromorbit.com X-Rspamd-Server: rspam03 X-HE-Tag: 1647399148-943257 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000006, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Mar 16, 2022 at 10:07:19AM +0800, Gao Xiang wrote: > On Tue, Mar 15, 2022 at 01:56:18PM -0700, Roman Gushchin wrote: > >=20 > > > On Mar 15, 2022, at 12:56 PM, Matthew Wilcox = wrote: > > >=20 > > > The number of negative dentries is effectively constrained only by = memory > > > size. Systems which do not experience significant memory pressure = for > > > an extended period can build up millions of negative dentries which > > > clog the dcache. That can have different symptoms, such as inotify > > > taking a long time [1], high memory usage [2] and even just poor lo= okup > > > performance [3]. We've also seen problems with cgroups being pinne= d > > > by negative dentries, though I think we now reparent those dentries= to > > > their parent cgroup instead. > >=20 > > Yes, it should be fixed already. > >=20 > > >=20 > > > We don't have a really good solution yet, and maybe some focused > > > brainstorming on the problem would lead to something that actually = works. > >=20 > > I=E2=80=99d be happy to join this discussion. And in my opinion it=E2= =80=99s going beyond negative dentries: there are other types of objects = which tend to grow beyond any reasonable limits if there is no memory pre= ssure. >=20 > +1, we once had a similar issue as well, and agree that is not only > limited to negative dentries but all too many LRU-ed dentries and inode= s. Yup, any discussion solely about managing buildup of negative dentries doesn't acknowledge that it is just a symptom of larger problems that need to be addressed. > Limited the total number may benefit to avoid shrink spiking for server= s. No, we don't want to set hard limits on object counts - that's just asking for systems that need frequent hand tuning and are impossible to get right under changing workloads. Caches need to auto size according to workload's working set to find a steady state balance, not be bound by artitrary limits. But even cache sizing isn't the problem here - it's just another symptom. > > A perfect example when it happens is when a machine is almost > > idle for some period of time. Periodically running processes > > creating various kernel objects (mostly vfs cache) which over > > time are filling significant portions of the total memory. And > > when the need for memory arises, we realize that the memory is > > heavily fragmented and it=E2=80=99s costly to reclaim it back. Yup, the underlying issue here is that memory reclaim does nothing to manage long term build-up of single use cached objects when *there is no memory pressure*. There's of idle time and spare resources to manage caches sanely, but we don't. e.g. there is no periodic rotation of caches that could lead to detection and reclaim of single use objects (say over a period of minutes) and hence prevent them from filling up all of memory unnecessarily and creating transient memory reclaim and allocation latency spikes when memory finally fills up. IOWs, negative dentries getting out of hand and shrinker spikes are both a symptom of the same problem: while memory allocation is free, memory reclaim does nothing to manage cache aging. Hence we only find out we've got a badly aged cache when we finally realise it has filled all of memory, and then we have heaps of work to do before memory can be made available for allocation again.... And then if you're going to talk memory reclaim, the elephant in the room is the lack of integration between shrinkers and the main reclaim infrastructure. There's no priority determination, there's no progress feedback, there's no mechanism to allow shrinkers to throttle reclaim rather than have the reclaim infrastructure wind up priority and OOM kill when a shrinker cannot make progress quickly, etc. Then there's direct reclaim hammering shrinkers with unbound concurrency so individual shrinkers have no chance of determining how much memory pressure there really is by themselves, not to mention the lock contention problems that unbound reclaim concurrency on things like LRU lists can cause. And, of course, memcg based reclaim is still only tacked onto the side of the shrinker infrastructure... Cheers, Dave. --=20 Dave Chinner david@fromorbit.com