From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=nkpv=LW=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 433E6C49EA3
	for <linux-mm@archiver.kernel.org>; Mon, 28 Jun 2021 17:12:11 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 981F7619BE
	for <linux-mm@archiver.kernel.org>; Mon, 28 Jun 2021 17:12:10 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 981F7619BE
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 5D8768D0065; Mon, 28 Jun 2021 13:12:09 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 5AF668D0016; Mon, 28 Jun 2021 13:12:09 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 450F08D0065; Mon, 28 Jun 2021 13:12:09 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0207.hostedemail.com [216.40.44.207])
	by kanga.kvack.org (Postfix) with ESMTP id 23A028D0016
	for <linux-mm@kvack.org>; Mon, 28 Jun 2021 13:12:09 -0400 (EDT)
Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id 0DEBA1DAC6
	for <linux-mm@kvack.org>; Mon, 28 Jun 2021 17:12:09 +0000 (UTC)
X-FDA: 78303775578.09.3EA86B6
Received: from mail-pj1-f50.google.com (mail-pj1-f50.google.com [209.85.216.50])
	by imf21.hostedemail.com (Postfix) with ESMTP id 67DCAE000250
	for <linux-mm@kvack.org>; Mon, 28 Jun 2021 17:12:08 +0000 (UTC)
Received: by mail-pj1-f50.google.com with SMTP id l11so10560840pji.5
        for <linux-mm@kvack.org>; Mon, 28 Jun 2021 10:12:08 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=cmpxchg-org.20150623.gappssmtp.com; s=20150623;
        h=date:from:to:cc:subject:message-id:references:mime-version
         :content-disposition:in-reply-to;
        bh=gCEhY7OT/PM4S4BcRqZZpvh9i6cszzrvehQ02dGXfrI=;
        b=eJtoQ7Sd2g/qcSytKyRrnUUqfjld6DGHFD33VQofjTldYGfHUaBvqP+1iaphrcPas8
         MW86XsJF+Bnzd0HMV1bw4IsUlQB2rHPFCrRqQsEMwTEeZNtq8VH0DbRRFKwm6PvQBIMm
         iDWTgoDk3qohgGl7zgcn7QXNjDJFx1yrqLruS6NM0yOpLvVmD59b10+rrtFJC8EzRICa
         fYx3CFU0wV1nL+frDup66wwGdNQpk1TWY0pGF3ff4v8aJtQG3UbvMU//UpiAjTAYi4WT
         IQiRO9qkHYu19A5LrBmCJ1j8XNtNB+sdZUeLMjnA97xD38Nkdb3CLIvZvjOAjm7VYaml
         1N4w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to;
        bh=gCEhY7OT/PM4S4BcRqZZpvh9i6cszzrvehQ02dGXfrI=;
        b=g3QpaRouCIkm/jIdclL65opUZ8KJ9dMo//0Kr0vRcB71zwpJE0uuWp3L+QDyFxd7mi
         2hn8SuAf49I93d5Wze4x17yHf2/bYlt0+ZozV0CCdEqahW+YS4hit1lqyDnABnSzOyXG
         rt+xTf8uiTNP3DidmXlBgFagzbXcDNrEvj7eySP+3TN7KrET0voo7DP4LTg+q8QzcUPI
         bSj4THpemJDf4WTY87j1itTHc9FiIn3KNF8apYqvBEQvi9XIgFlGNBmtKiNjdFSEwXqu
         KPNM6G/gkS6X17Gzxv4V7OYfXmi12JpYdoOaiLl94TD+vWhChD2OuJs+w6yxHPBwsITW
         QVHQ==
X-Gm-Message-State: AOAM532t1Iz1zMgQ06OQeqhLANy4t+amqpGuVMBIXmabEpfymtASskY0
	RuEQ888glzGbFmE8huFVzZY/xg==
X-Google-Smtp-Source: ABdhPJyW+bWCigyJiRvZW9fgo/sLDWJpIC2xOjVji8n+yHXcsm+Ca6D3wFIpsP96nv7ft1EaCAVybw==
X-Received: by 2002:a17:90a:69e2:: with SMTP id s89mr38858516pjj.154.1624900327359;
        Mon, 28 Jun 2021 10:12:07 -0700 (PDT)
Received: from localhost ([2620:10d:c090:400::5:700f])
        by smtp.gmail.com with ESMTPSA id gd5sm65219pjb.45.2021.06.28.10.12.05
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 28 Jun 2021 10:12:06 -0700 (PDT)
Date: Mon, 28 Jun 2021 13:12:03 -0400
From: Johannes Weiner <hannes@cmpxchg.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, Roman Gushchin <guro@fb.com>,
	Tejun Heo <tj@kernel.org>, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	kernel-team@fb.com
Subject: Re: [PATCH 4/4] vfs: keep inodes with page cache off the inode
 shrinker LRU
Message-ID: <YNoC49bWOCxnkQ/v@cmpxchg.org>
References: <20210614211904.14420-1-hannes@cmpxchg.org>
 <20210614211904.14420-4-hannes@cmpxchg.org>
 <20210615062640.GD2419729@dread.disaster.area>
 <YMj2YbqJvVh1busC@cmpxchg.org>
 <20210616012008.GE2419729@dread.disaster.area>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20210616012008.GE2419729@dread.disaster.area>
Authentication-Results: imf21.hostedemail.com;
	dkim=pass header.d=cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=eJtoQ7Sd;
	spf=pass (imf21.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.216.50 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org;
	dmarc=pass (policy=none) header.from=cmpxchg.org
X-Stat-Signature: t8iqgeqzr33cyr3w9knd7xt8rz731o6t
X-Rspamd-Queue-Id: 67DCAE000250
X-Rspamd-Server: rspam06
X-HE-Tag: 1624900328-414203
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, Jun 16, 2021 at 11:20:08AM +1000, Dave Chinner wrote:
> On Tue, Jun 15, 2021 at 02:50:09PM -0400, Johannes Weiner wrote:
> > On Tue, Jun 15, 2021 at 04:26:40PM +1000, Dave Chinner wrote:
> > > And in __list_lru_walk_one() just add:
> > > 
> > > 		case LRU_ROTATE_NODEFER:
> > > 			isolated++;
> > > 			/* fallthrough */
> > > 		case LRU_ROTATE:
> > > 			list_move_tail(item, &l->list);
> > > 			break;
> > > 
> > > And now inodes with active page cache  rotated to the tail of the
> > > list and are considered to have had work done on them. Hence they
> > > don't add to the work accumulation that the shrinker infrastructure
> > > defers, and so will allow the page reclaim to do it's stuff with
> > > page reclaim before such inodes will get reclaimed.
> > > 
> > > That's *much* simpler than your proposed patch and should get you
> > > pretty much the same result.
> > 
> > It solves the deferred work buildup, but it's absurdly inefficient.
> 
> So you keep saying. Show us the numbers. Show us that it's so
> inefficient that it's completely unworkable. _You_ need to justify
> why violating modularity and layering is the only viable solution to
> this problem. Given that there is an alternative simple, straight
> forward solution to the problem, it's on you to prove it is
> insufficient to solve your issues.
> 
> I'm sceptical that the complexity is necessary given that in general
> workloads, the inode shrinker doesn't even register in kernel
> profiles and that the problem being avoided generally isn't even hit
> in most workloads. IOWs, I'll take a simple but inefficient solution
> for avoiding a corner case behaviour over a solution that is
> complex, fragile and full of layering violations any day of the
> weeks.

I spent time last week benchmarking both implementations with various
combinations of icache and page cache size proportions.

You're right that most workloads don't care. But there are workloads
that do, and for them the behavior can become pathological during
drop-behind reclaim.

Page cache reclaim has two modes: 1. Workingset transitions where we
flush out the old as quickly as possible and 2. Streaming buffered IO
that doesn't benefit from caching, and so gets confined to the
smallest possible amount of memory without touching active pages.

During 1. we may rotate busy inodes a few times until their page cache
disappears. This isn't great, but at least temporary.

The issue is 2. We may do drop-behind reclaim for extended periods of
time, during which the cache workingset remains completely untouched
and the corresponding inodes never become eligible for freeing.
Rotating them over and over represents a continuous parasitic drag on
reclaim. Depending on the proportions between the icache and the
inactive cache list, this drag can make up a sizable portion or even
the majority of overall CPU consumed by reclaim. (And if you recall
the discussion around RWF_UNCACHED, dropbehind reclaim is already
bottlenecked on CPU with decent IO devices.)

My test is doing drop-behind reclaim while most memory is filled with
a cache workingset that is held by an increasing number of inodes. The
first number here is the inodes, the second is the active pages held
by each:

 1,000 * 3072 pages:  0.39%     0.05%  kswapd0          [kernel.kallsyms]   [k] shrink_slab
 10,000 * 307 pages:  0.39%     0.04%  kswapd0          [kernel.kallsyms]   [k] shrink_slab
 100,000 * 32 pages:  1.29%     0.05%  kswapd0          [kernel.kallsyms]   [k] shrink_slab
 500,000 *  6 pages: 11.36%     0.08%  kswapd0          [kernel.kallsyms]   [k] shrink_slab
1,000,000 * 3 pages: 26.40%     0.04%  kswapd0          [kernel.kallsyms]   [k] shrink_slab
1,500,000 * 2 pages: 42.97%     0.00%  kswapd0          [kernel.kallsyms]   [k] shrink_slab
3,000,000 * 1  page: 45.22%     0.00%  kswapd0          [kernel.kallsyms]   [k] shrink_slab

As we get into higher inode counts, the shrinkers end up burning most
of the reclaim cycles to rotate workingset inodes. For perspective,
with 3 million inodes, when the shrinkers eat 45% of the cycles to
busypoll the workingset inodes, page reclaim only consumes about 10%
to actually make forward progress.

IMO it goes from suboptimal to being a problem somewhere between 100k
and 500k in this table. That's not *that* many inodes - I'm counting
~74k files in my linux git tree alone.

North of 500k, it becomes pathological. That's probably less common,
but it happens in the real world. I checked the file servers that host
our internal source code trees. They have 16 times the memory of my
test box, but they routinely deal with 50 million+ inodes.

I think the additional complexity of updating the inode LRU according
to cache population state is justified in order to avoid these
pathological cornercases.