From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 487C7D609D2
	for <linux-mm@archiver.kernel.org>; Wed, 27 Nov 2024 11:07:00 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 8DDE36B0082; Wed, 27 Nov 2024 06:06:59 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 864616B0083; Wed, 27 Nov 2024 06:06:59 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 6DDA36B0085; Wed, 27 Nov 2024 06:06:59 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 4BB8A6B0082
	for <linux-mm@kvack.org>; Wed, 27 Nov 2024 06:06:59 -0500 (EST)
Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id B84A0C109E
	for <linux-mm@kvack.org>; Wed, 27 Nov 2024 11:06:58 +0000 (UTC)
X-FDA: 82831597674.14.41B2CEC
Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131])
	by imf18.hostedemail.com (Postfix) with ESMTP id ED25D1C0010
	for <linux-mm@kvack.org>; Wed, 27 Nov 2024 11:06:53 +0000 (UTC)
Authentication-Results: imf18.hostedemail.com;
	dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=p8bQSUHT;
	dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=A6QzlK35;
	dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=p8bQSUHT;
	dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=A6QzlK35;
	spf=pass (imf18.hostedemail.com: domain of jack@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=jack@suse.cz;
	dmarc=none
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1732705614;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=Ve5bO8Xk7Mdrs4etTh9swfc19BKNL9zCNHLtpjsUTFc=;
	b=gYtVuKDeGgF1INiOmTcVNEVpHqvB1Z/IfVsSXWtQi2JTZtMzHJcLJupwtGx1Pv6Np7++aA
	iWg76H5Iuo23lB8ppvyztnH3cwZfMlYCqI2z6w8NPpT7awWjZzMXaKkHq9hnzECsl7c2qR
	nBHXSvAc+Ke4NEcnF86uk7SMN4EIeVA=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1732705614; a=rsa-sha256;
	cv=none;
	b=jk1iM7McqV49UldJMaFyc5incvdql3BJRFX+1g6VKdp2NmGC/9qWdogTjIVEicHLOMH0z1
	5aMvRng5tXnVtiY01Nz3vbWDsHpRpT/MjShRd0eoZH6me674cIr748XJYvkslKvWp+xnhp
	AQdDsk7Nn22AHJpg+bJ82zYYEn50GyM=
ARC-Authentication-Results: i=1;
	imf18.hostedemail.com;
	dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=p8bQSUHT;
	dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=A6QzlK35;
	dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=p8bQSUHT;
	dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=A6QzlK35;
	spf=pass (imf18.hostedemail.com: domain of jack@suse.cz designates 195.135.223.131 as permitted sender) smtp.mailfrom=jack@suse.cz;
	dmarc=none
Received: from imap1.dmz-prg2.suse.org (unknown [10.150.64.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(No client certificate requested)
	by smtp-out2.suse.de (Postfix) with ESMTPS id 6DCAC1F786;
	Wed, 27 Nov 2024 11:06:54 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa;
	t=1732705614; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=Ve5bO8Xk7Mdrs4etTh9swfc19BKNL9zCNHLtpjsUTFc=;
	b=p8bQSUHTiWn0ntOSRJlScU0Y19NQdFqyVvSnTjYkPUFbSbXa84sBQr1A++WXGv8NqwuNv7
	Uv8Tu4bAhSvjqJS1aPqaa9tjJdYTBrh05zizZNuVVnuFwm0khP5TBFOTxpmDqrWT1RyyQ6
	7AkavAsc5Zz5U4vPxB5TjZWUiuGyPr8=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz;
	s=susede2_ed25519; t=1732705614;
	h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=Ve5bO8Xk7Mdrs4etTh9swfc19BKNL9zCNHLtpjsUTFc=;
	b=A6QzlK35pDqkrXs+xGX1Dgc+aZox97lM3/2YXSKsXJeYg5wX9oGp90z8E3LoX/LazwHrS8
	WLYc/Yx/4DFZsYDw==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa;
	t=1732705614; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=Ve5bO8Xk7Mdrs4etTh9swfc19BKNL9zCNHLtpjsUTFc=;
	b=p8bQSUHTiWn0ntOSRJlScU0Y19NQdFqyVvSnTjYkPUFbSbXa84sBQr1A++WXGv8NqwuNv7
	Uv8Tu4bAhSvjqJS1aPqaa9tjJdYTBrh05zizZNuVVnuFwm0khP5TBFOTxpmDqrWT1RyyQ6
	7AkavAsc5Zz5U4vPxB5TjZWUiuGyPr8=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz;
	s=susede2_ed25519; t=1732705614;
	h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=Ve5bO8Xk7Mdrs4etTh9swfc19BKNL9zCNHLtpjsUTFc=;
	b=A6QzlK35pDqkrXs+xGX1Dgc+aZox97lM3/2YXSKsXJeYg5wX9oGp90z8E3LoX/LazwHrS8
	WLYc/Yx/4DFZsYDw==
Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(No client certificate requested)
	by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 5686213941;
	Wed, 27 Nov 2024 11:06:54 +0000 (UTC)
Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167])
	by imap1.dmz-prg2.suse.org with ESMTPSA
	id S3cYFU79RmfSWQAAD6G6ig
	(envelope-from <jack@suse.cz>); Wed, 27 Nov 2024 11:06:54 +0000
Received: by quack3.suse.cz (Postfix, from userid 1000)
	id EBEF6A08D6; Wed, 27 Nov 2024 12:06:49 +0100 (CET)
Date: Wed, 27 Nov 2024 12:06:49 +0100
From: Jan Kara <jack@suse.cz>
To: NeilBrown <neilb@suse.de>
Cc: Jan Kara <jack@suse.cz>, Anders Blomdell <anders.blomdell@gmail.com>,
	Philippe Troin <phil@fifi.org>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, David Howells <dhowells@redhat.com>,
	netfs@lists.linux.dev
Subject: Re: Regression in NFS probably due to very large amounts of readahead
Message-ID: <20241127110649.yg2k4s3fzohb2pgg@quack3>
References: <>
 <20241126150613.a4b57y2qmolapsuc@quack3>
 <173269663098.1734440.13407516531783940860@noble.neil.brown.name>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <173269663098.1734440.13407516531783940860@noble.neil.brown.name>
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: ED25D1C0010
X-Stat-Signature: ozo499qiusqptmsq6bkaidnz4d63w199
X-Rspam-User: 
X-HE-Tag: 1732705613-460142
X-HE-Meta: U2FsdGVkX19a3BtgSNzKJwMoSg84eR2oM7G1WA2JNA7tRViRdQKYU5/AyLGqh9U2Uukp48GEEHKwszi0xvvn9U1gxAzb39fImkmkDjtbRWdTwhHLpiFyYpVJIfDvzrJMUkBCsN6gHqlILp/o03Tw98fejxq2UsEds80CceIZxNLNwhnFWPweovyZIXhovTEDVSa4sdfo4LBH110uqdons4yegzRq2Is04KJ0q8XOU1uAYRbCfn2pJfUwzYjdmGwZaXomDUh9rs+H+czpLMY45obLxDm/ZtsnVDw1Cwx+BfXhWhfmgICFwDebU8DHSruU6zeeCXLYPjgOJsgwhJ9PUvp1kZrKGdDboSl7dVre2oH9lux1kV/upYf1sYjSxEq+KSNC1wM1K4DLFXaElD4HX6IffICasOD1QCwCuS/zQK5EwQzSFH8tW0T1ZOzPLscqPQTT9kAQBlNNtbYkUqe5idi79M4Z3E1XuPh5gr1ikMfPHPdvvCjLFd/DZ0o7M9rRJ2E+0SpDyoqh5AEhJ0yJz1y086hbV07MndZHrhumvLhppKRZJEvLeedkA0Lg9ZF7/L6pwfFQdGXLz2qr6rgpAcfzVoo1eCAofnieJCf74z3e0PPkBmHX1++5JmdOyl5N4/nI8sMoI+Qb0tFsjr8mem3KWbJxo7tI9vEOQzxIxa7Kti5L2Ui7+e0lGxLVrzQiZWjbYe2zPg4mOvA58QJqJnMH+HpNZqNMewh+Nupy7MMO0tNqS6LsIrUfx6oCvnYQQlzgY/WbPwi0eNY5FPdfoZ8sDY580Yf0q13FpQJ9n0CreEOtKhN1yIIRzR9PHONtnTDa0tnfjUa97tFMtOmBk/qt8jSGG/1HuO57cJAEPlbXcA1ivcka4838StrnaEwntEmO0qYMfvqMAjy5fkIJeZFMjvb2klw01KbgLejboD2DC20zRy8QrS7Oy9hBN0kXzYWn4c3W3LQsJcnkztw
 hoT5CdCk
 xL/We/ugDJvELEFieRElv/LOatEyoNzrAeCVg6nz45e1L0uYRcIsnKMHDge478VKmJWSPHWqTJHt3SzMSrmHvxBzfSZiOSLfzHyIFrwO9RdlISuNw2+bfdZuuNoa+iAhIFlP/o26WXI2uvuU0LdeY7d/mYsUNvsXdViBwhu+WnEI3TCH3JD5g9zQaerGEBT7gcLzbwe7IkYFTB8Ra882X19T2sctMr0IQ/qsTrQ7v20jnaecClAH4B4Tc0WYiNOLQWwD84wzOQalTCrHpFWn8cOaKRorYKW1PHXlfMq+W4mcyUJlFZuYaKZFbrNGXAhmOSUrSoytggavnydqiRodvo/e7FcbXEPmclis0Vci5GcJTXbdc2VdRGBdSBQLUY9KC2PSdcOGW+hGKZqCPGVvYURoy0vs49yZtAxCx
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Added David Howells to CC since this seems to be mostly netfs related.

On Wed 27-11-24 19:37:10, NeilBrown wrote:
> On Wed, 27 Nov 2024, Jan Kara wrote:
> > On Tue 26-11-24 11:37:19, Jan Kara wrote:
> > > On Tue 26-11-24 09:01:35, Anders Blomdell wrote:
> > > > On 2024-11-26 02:48, Philippe Troin wrote:
> > > > > On Sat, 2024-11-23 at 23:32 +0100, Anders Blomdell wrote:
> > > > > > When we (re)started one of our servers with 6.11.3-200.fc40.x86_64,
> > > > > > we got terrible performance (lots of nfs: server x.x.x.x not
> > > > > > responding).
> > > > > > What triggered this problem was virtual machines with NFS-mounted
> > > > > > qcow2 disks
> > > > > > that often triggered large readaheads that generates long streaks of
> > > > > > disk I/O
> > > > > > of 150-600 MB/s (4 ordinary HDD's) that filled up the buffer/cache
> > > > > > area of the
> > > > > > machine.
> > > > > > 
> > > > > > A git bisect gave the following suspect:
> > > > > > 
> > > > > > git bisect start
> > > > > 
> > > > > 8< snip >8
> > > > > 
> > > > > > # first bad commit: [7c877586da3178974a8a94577b6045a48377ff25]
> > > > > > readahead: properly shorten readahead when falling back to
> > > > > > do_page_cache_ra()
> > > > > 
> > > > > Thank you for taking the time to bisect, this issue has been bugging
> > > > > me, but it's been non-deterministic, and hence hard to bisect.
> > > > > 
> > > > > I'm seeing the same problem on 6.11.10 (and earlier 6.11.x kernels) in
> > > > > slightly different setups:
> > > > > 
> > > > > (1) On machines mounting NFSv3 shared drives. The symptom here is a
> > > > > "nfs server XXX not responding, still trying" that never recovers
> > > > > (while the server remains pingable and other NFSv3 volumes from the
> > > > > hanging server can be mounted).
> > > > > 
> > > > > (2) On VMs running over qemu-kvm, I see very long stalls (can be up to
> > > > > several minutes) on random I/O. These stalls eventually recover.
> > > > > 
> > > > > I've built a 6.11.10 kernel with
> > > > > 7c877586da3178974a8a94577b6045a48377ff25 reverted and I'm back to
> > > > > normal (no more NFS hangs, no more VM stalls).
> > > > > 
> > > > Some printk debugging, seems to indicate that the problem
> > > > is that the entity 'ra->size - (index - start)' goes
> > > > negative, which then gets cast to a very large unsigned
> > > > 'nr_to_read' when calling 'do_page_cache_ra'. Where the true
> > > > bug is still eludes me, though.
> > > 
> > > Thanks for the report, bisection and debugging! I think I see what's going
> > > on. read_pages() can go and reduce ra->size when ->readahead() callback
> > > failed to read all folios prepared for reading and apparently that's what
> > > happens with NFS and what can lead to negative argument to
> > > do_page_cache_ra(). Now at this point I'm of the opinion that updating
> > > ra->size / ra->async_size does more harm than good (because those values
> > > show *desired* readahead to happen, not exact number of pages read),
> > > furthermore it is problematic because ra can be shared by multiple
> > > processes and so updates are inherently racy. If we indeed need to store
> > > number of read pages, we could do it through ractl which is call-site local
> > > and used for communication between readahead generic functions and callers.
> > > But I have to do some more history digging and code reading to understand
> > > what is using this logic in read_pages().
> > 
> > Hum, checking the history the update of ra->size has been added by Neil two
> > years ago in 9fd472af84ab ("mm: improve cleanup when ->readpages doesn't
> > process all pages"). Neil, the changelog seems as there was some real
> > motivation behind updating of ra->size in read_pages(). What was it? Now I
> > somewhat disagree with reducing ra->size in read_pages() because it seems
> > like a wrong place to do that and if we do need something like that,
> > readahead window sizing logic should rather be changed to take that into
> > account? But it all depends on what was the real rationale behind reducing
> > ra->size in read_pages()...
> > 
> 
> I cannot tell you much more than what the commit itself says.
> If there are any pages still in the rac, then we didn't try read-ahead
> and shouldn't pretend that we did. Else the numbers will be wrong.
> 
> I think the important part of the patch was the
> delete_from_page_cache().
> Leaving pages in the page cache which we didn't try to read will cause
> a future read-ahead to skip those pages and they can only be read
> synchronously.

Yes, I agree with the delete_from_page_cache() part (although it seems as a
bit of an band aid but I guess KISS principle wins here).

> But maybe you are right that ra, being shared, shouldn't be modified
> like this.

OK, I was wondering whether this ra update isn't some way how NFS tries to
stear optimal readahead size for it. It would be weird but possible. If
this was mostly a theoretical concert, then I'd be for dropping the ra
update in read_pages(). I did a small excursion to nfs_readahead() and
that function itself seems to read all the pages unless there's some error
like ENOMEM. But before doing this nfs_readahead() does:

        ret = nfs_netfs_readahead(ractl);
        if (!ret)
                goto out;                    

And that is more interesting because if the inode has netfs cache, we will
do netfs_readahead(ractl) and return 0. So whatever netfs_readahead() reads
is the final result. And that function actually seems to read only
PAGEVEC_SIZE folios (because that's what fits in its request structure) and
aborts. 

Now unless you have fscache in tmpfs or you have really large folios in the
page cache, reading PAGEVEC_SIZE folios can be too small to get decent
performance. But that's somewhat besides the point of this thread. The fact
is that netfs can indeed read very few folios from the readahead it was
asked to do. The question is how the generic readahead code should handle
such case. Either we could say that such behavior is not really supported
(besides error recovery where performance is not an issue) and fix netfs to
try harder to submit all the folios generic code asked it to read. Or we
can say generic readahead code needs to accommodate such behavior in a
performant way but then the readahead limitation should be communicated in
advance so that we avoid creating tons of folios only to discard them a
while later. I guess the decision depends on how practical is the "try to
read all folios" solution for netfs... What do you think guys?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR