From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 433E6C49EA3 for ; Mon, 28 Jun 2021 17:12:11 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 981F7619BE for ; Mon, 28 Jun 2021 17:12:10 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 981F7619BE Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 5D8768D0065; Mon, 28 Jun 2021 13:12:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5AF668D0016; Mon, 28 Jun 2021 13:12:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 450F08D0065; Mon, 28 Jun 2021 13:12:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0207.hostedemail.com [216.40.44.207]) by kanga.kvack.org (Postfix) with ESMTP id 23A028D0016 for ; Mon, 28 Jun 2021 13:12:09 -0400 (EDT) Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 0DEBA1DAC6 for ; Mon, 28 Jun 2021 17:12:09 +0000 (UTC) X-FDA: 78303775578.09.3EA86B6 Received: from mail-pj1-f50.google.com (mail-pj1-f50.google.com [209.85.216.50]) by imf21.hostedemail.com (Postfix) with ESMTP id 67DCAE000250 for ; Mon, 28 Jun 2021 17:12:08 +0000 (UTC) Received: by mail-pj1-f50.google.com with SMTP id l11so10560840pji.5 for ; Mon, 28 Jun 2021 10:12:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=gCEhY7OT/PM4S4BcRqZZpvh9i6cszzrvehQ02dGXfrI=; b=eJtoQ7Sd2g/qcSytKyRrnUUqfjld6DGHFD33VQofjTldYGfHUaBvqP+1iaphrcPas8 MW86XsJF+Bnzd0HMV1bw4IsUlQB2rHPFCrRqQsEMwTEeZNtq8VH0DbRRFKwm6PvQBIMm iDWTgoDk3qohgGl7zgcn7QXNjDJFx1yrqLruS6NM0yOpLvVmD59b10+rrtFJC8EzRICa fYx3CFU0wV1nL+frDup66wwGdNQpk1TWY0pGF3ff4v8aJtQG3UbvMU//UpiAjTAYi4WT IQiRO9qkHYu19A5LrBmCJ1j8XNtNB+sdZUeLMjnA97xD38Nkdb3CLIvZvjOAjm7VYaml 1N4w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=gCEhY7OT/PM4S4BcRqZZpvh9i6cszzrvehQ02dGXfrI=; b=g3QpaRouCIkm/jIdclL65opUZ8KJ9dMo//0Kr0vRcB71zwpJE0uuWp3L+QDyFxd7mi 2hn8SuAf49I93d5Wze4x17yHf2/bYlt0+ZozV0CCdEqahW+YS4hit1lqyDnABnSzOyXG rt+xTf8uiTNP3DidmXlBgFagzbXcDNrEvj7eySP+3TN7KrET0voo7DP4LTg+q8QzcUPI bSj4THpemJDf4WTY87j1itTHc9FiIn3KNF8apYqvBEQvi9XIgFlGNBmtKiNjdFSEwXqu KPNM6G/gkS6X17Gzxv4V7OYfXmi12JpYdoOaiLl94TD+vWhChD2OuJs+w6yxHPBwsITW QVHQ== X-Gm-Message-State: AOAM532t1Iz1zMgQ06OQeqhLANy4t+amqpGuVMBIXmabEpfymtASskY0 RuEQ888glzGbFmE8huFVzZY/xg== X-Google-Smtp-Source: ABdhPJyW+bWCigyJiRvZW9fgo/sLDWJpIC2xOjVji8n+yHXcsm+Ca6D3wFIpsP96nv7ft1EaCAVybw== X-Received: by 2002:a17:90a:69e2:: with SMTP id s89mr38858516pjj.154.1624900327359; Mon, 28 Jun 2021 10:12:07 -0700 (PDT) Received: from localhost ([2620:10d:c090:400::5:700f]) by smtp.gmail.com with ESMTPSA id gd5sm65219pjb.45.2021.06.28.10.12.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 28 Jun 2021 10:12:06 -0700 (PDT) Date: Mon, 28 Jun 2021 13:12:03 -0400 From: Johannes Weiner To: Dave Chinner Cc: Andrew Morton , Roman Gushchin , Tejun Heo , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: Re: [PATCH 4/4] vfs: keep inodes with page cache off the inode shrinker LRU Message-ID: References: <20210614211904.14420-1-hannes@cmpxchg.org> <20210614211904.14420-4-hannes@cmpxchg.org> <20210615062640.GD2419729@dread.disaster.area> <20210616012008.GE2419729@dread.disaster.area> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210616012008.GE2419729@dread.disaster.area> Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=eJtoQ7Sd; spf=pass (imf21.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.216.50 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org X-Stat-Signature: t8iqgeqzr33cyr3w9knd7xt8rz731o6t X-Rspamd-Queue-Id: 67DCAE000250 X-Rspamd-Server: rspam06 X-HE-Tag: 1624900328-414203 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Jun 16, 2021 at 11:20:08AM +1000, Dave Chinner wrote: > On Tue, Jun 15, 2021 at 02:50:09PM -0400, Johannes Weiner wrote: > > On Tue, Jun 15, 2021 at 04:26:40PM +1000, Dave Chinner wrote: > > > And in __list_lru_walk_one() just add: > > > > > > case LRU_ROTATE_NODEFER: > > > isolated++; > > > /* fallthrough */ > > > case LRU_ROTATE: > > > list_move_tail(item, &l->list); > > > break; > > > > > > And now inodes with active page cache rotated to the tail of the > > > list and are considered to have had work done on them. Hence they > > > don't add to the work accumulation that the shrinker infrastructure > > > defers, and so will allow the page reclaim to do it's stuff with > > > page reclaim before such inodes will get reclaimed. > > > > > > That's *much* simpler than your proposed patch and should get you > > > pretty much the same result. > > > > It solves the deferred work buildup, but it's absurdly inefficient. > > So you keep saying. Show us the numbers. Show us that it's so > inefficient that it's completely unworkable. _You_ need to justify > why violating modularity and layering is the only viable solution to > this problem. Given that there is an alternative simple, straight > forward solution to the problem, it's on you to prove it is > insufficient to solve your issues. > > I'm sceptical that the complexity is necessary given that in general > workloads, the inode shrinker doesn't even register in kernel > profiles and that the problem being avoided generally isn't even hit > in most workloads. IOWs, I'll take a simple but inefficient solution > for avoiding a corner case behaviour over a solution that is > complex, fragile and full of layering violations any day of the > weeks. I spent time last week benchmarking both implementations with various combinations of icache and page cache size proportions. You're right that most workloads don't care. But there are workloads that do, and for them the behavior can become pathological during drop-behind reclaim. Page cache reclaim has two modes: 1. Workingset transitions where we flush out the old as quickly as possible and 2. Streaming buffered IO that doesn't benefit from caching, and so gets confined to the smallest possible amount of memory without touching active pages. During 1. we may rotate busy inodes a few times until their page cache disappears. This isn't great, but at least temporary. The issue is 2. We may do drop-behind reclaim for extended periods of time, during which the cache workingset remains completely untouched and the corresponding inodes never become eligible for freeing. Rotating them over and over represents a continuous parasitic drag on reclaim. Depending on the proportions between the icache and the inactive cache list, this drag can make up a sizable portion or even the majority of overall CPU consumed by reclaim. (And if you recall the discussion around RWF_UNCACHED, dropbehind reclaim is already bottlenecked on CPU with decent IO devices.) My test is doing drop-behind reclaim while most memory is filled with a cache workingset that is held by an increasing number of inodes. The first number here is the inodes, the second is the active pages held by each: 1,000 * 3072 pages: 0.39% 0.05% kswapd0 [kernel.kallsyms] [k] shrink_slab 10,000 * 307 pages: 0.39% 0.04% kswapd0 [kernel.kallsyms] [k] shrink_slab 100,000 * 32 pages: 1.29% 0.05% kswapd0 [kernel.kallsyms] [k] shrink_slab 500,000 * 6 pages: 11.36% 0.08% kswapd0 [kernel.kallsyms] [k] shrink_slab 1,000,000 * 3 pages: 26.40% 0.04% kswapd0 [kernel.kallsyms] [k] shrink_slab 1,500,000 * 2 pages: 42.97% 0.00% kswapd0 [kernel.kallsyms] [k] shrink_slab 3,000,000 * 1 page: 45.22% 0.00% kswapd0 [kernel.kallsyms] [k] shrink_slab As we get into higher inode counts, the shrinkers end up burning most of the reclaim cycles to rotate workingset inodes. For perspective, with 3 million inodes, when the shrinkers eat 45% of the cycles to busypoll the workingset inodes, page reclaim only consumes about 10% to actually make forward progress. IMO it goes from suboptimal to being a problem somewhere between 100k and 500k in this table. That's not *that* many inodes - I'm counting ~74k files in my linux git tree alone. North of 500k, it becomes pathological. That's probably less common, but it happens in the real world. I checked the file servers that host our internal source code trees. They have 16 times the memory of my test box, but they routinely deal with 50 million+ inodes. I think the additional complexity of updating the inode LRU according to cache population state is justified in order to avoid these pathological cornercases.