From: Jan Kara <jack@suse.cz>
To: Anders Blomdell <anders.blomdell@gmail.com>
Cc: Philippe Troin <phil@fifi.org>, Jan Kara <jack@suse.cz>,
"Matthew Wilcox (Oracle)" <willy@infradead.org>,
Andrew Morton <akpm@linux-foundation.org>,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, NeilBrown <neilb@suse.de>
Subject: Re: Regression in NFS probably due to very large amounts of readahead
Date: Tue, 26 Nov 2024 16:06:13 +0100 [thread overview]
Message-ID: <20241126150613.a4b57y2qmolapsuc@quack3> (raw)
In-Reply-To: <20241126103719.bvd2umwarh26pmb3@quack3>
On Tue 26-11-24 11:37:19, Jan Kara wrote:
> On Tue 26-11-24 09:01:35, Anders Blomdell wrote:
> > On 2024-11-26 02:48, Philippe Troin wrote:
> > > On Sat, 2024-11-23 at 23:32 +0100, Anders Blomdell wrote:
> > > > When we (re)started one of our servers with 6.11.3-200.fc40.x86_64,
> > > > we got terrible performance (lots of nfs: server x.x.x.x not
> > > > responding).
> > > > What triggered this problem was virtual machines with NFS-mounted
> > > > qcow2 disks
> > > > that often triggered large readaheads that generates long streaks of
> > > > disk I/O
> > > > of 150-600 MB/s (4 ordinary HDD's) that filled up the buffer/cache
> > > > area of the
> > > > machine.
> > > >
> > > > A git bisect gave the following suspect:
> > > >
> > > > git bisect start
> > >
> > > 8< snip >8
> > >
> > > > # first bad commit: [7c877586da3178974a8a94577b6045a48377ff25]
> > > > readahead: properly shorten readahead when falling back to
> > > > do_page_cache_ra()
> > >
> > > Thank you for taking the time to bisect, this issue has been bugging
> > > me, but it's been non-deterministic, and hence hard to bisect.
> > >
> > > I'm seeing the same problem on 6.11.10 (and earlier 6.11.x kernels) in
> > > slightly different setups:
> > >
> > > (1) On machines mounting NFSv3 shared drives. The symptom here is a
> > > "nfs server XXX not responding, still trying" that never recovers
> > > (while the server remains pingable and other NFSv3 volumes from the
> > > hanging server can be mounted).
> > >
> > > (2) On VMs running over qemu-kvm, I see very long stalls (can be up to
> > > several minutes) on random I/O. These stalls eventually recover.
> > >
> > > I've built a 6.11.10 kernel with
> > > 7c877586da3178974a8a94577b6045a48377ff25 reverted and I'm back to
> > > normal (no more NFS hangs, no more VM stalls).
> > >
> > Some printk debugging, seems to indicate that the problem
> > is that the entity 'ra->size - (index - start)' goes
> > negative, which then gets cast to a very large unsigned
> > 'nr_to_read' when calling 'do_page_cache_ra'. Where the true
> > bug is still eludes me, though.
>
> Thanks for the report, bisection and debugging! I think I see what's going
> on. read_pages() can go and reduce ra->size when ->readahead() callback
> failed to read all folios prepared for reading and apparently that's what
> happens with NFS and what can lead to negative argument to
> do_page_cache_ra(). Now at this point I'm of the opinion that updating
> ra->size / ra->async_size does more harm than good (because those values
> show *desired* readahead to happen, not exact number of pages read),
> furthermore it is problematic because ra can be shared by multiple
> processes and so updates are inherently racy. If we indeed need to store
> number of read pages, we could do it through ractl which is call-site local
> and used for communication between readahead generic functions and callers.
> But I have to do some more history digging and code reading to understand
> what is using this logic in read_pages().
Hum, checking the history the update of ra->size has been added by Neil two
years ago in 9fd472af84ab ("mm: improve cleanup when ->readpages doesn't
process all pages"). Neil, the changelog seems as there was some real
motivation behind updating of ra->size in read_pages(). What was it? Now I
somewhat disagree with reducing ra->size in read_pages() because it seems
like a wrong place to do that and if we do need something like that,
readahead window sizing logic should rather be changed to take that into
account? But it all depends on what was the real rationale behind reducing
ra->size in read_pages()...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
next prev parent reply other threads:[~2024-11-26 15:06 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-23 22:32 Anders Blomdell
2024-11-26 1:48 ` Philippe Troin
2024-11-26 8:01 ` Anders Blomdell
2024-11-26 10:37 ` Jan Kara
2024-11-26 12:49 ` Anders Blomdell
2024-11-26 13:24 ` Anders Blomdell
2024-11-26 15:00 ` Jan Kara
2024-11-26 15:06 ` Jan Kara [this message]
2024-11-26 15:28 ` Anders Blomdell
2024-11-26 16:55 ` Matthew Wilcox
2024-11-26 17:26 ` Anders Blomdell
2024-11-26 18:42 ` Matthew Wilcox
2024-11-26 20:22 ` Anders Blomdell
2024-11-27 7:55 ` Anders Blomdell
2024-11-27 8:37 ` NeilBrown
2024-11-27 11:06 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241126150613.a4b57y2qmolapsuc@quack3 \
--to=jack@suse.cz \
--cc=akpm@linux-foundation.org \
--cc=anders.blomdell@gmail.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=neilb@suse.de \
--cc=phil@fifi.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox