Re: Regression in NFS probably due to very large amounts of readahead

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Anders Blomdell <anders.blomdell@gmail.com>
To: Jan Kara <jack@suse.cz>
Cc: Philippe Troin <phil@fifi.org>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, NeilBrown <neilb@suse.de>
Subject: Re: Regression in NFS probably due to very large amounts of readahead
Date: Tue, 26 Nov 2024 16:28:04 +0100	[thread overview]
Message-ID: <fba6bc0c-2ea8-467c-b7ea-8810c9e13b84@gmail.com> (raw)
In-Reply-To: <20241126150613.a4b57y2qmolapsuc@quack3>



On 2024-11-26 16:06, Jan Kara wrote:
> On Tue 26-11-24 11:37:19, Jan Kara wrote:
>> On Tue 26-11-24 09:01:35, Anders Blomdell wrote:
>>> On 2024-11-26 02:48, Philippe Troin wrote:
>>>> On Sat, 2024-11-23 at 23:32 +0100, Anders Blomdell wrote:
>>>>> When we (re)started one of our servers with 6.11.3-200.fc40.x86_64,
>>>>> we got terrible performance (lots of nfs: server x.x.x.x not
>>>>> responding).
>>>>> What triggered this problem was virtual machines with NFS-mounted
>>>>> qcow2 disks
>>>>> that often triggered large readaheads that generates long streaks of
>>>>> disk I/O
>>>>> of 150-600 MB/s (4 ordinary HDD's) that filled up the buffer/cache
>>>>> area of the
>>>>> machine.
>>>>>
>>>>> A git bisect gave the following suspect:
>>>>>
>>>>> git bisect start
>>>>
>>>> 8< snip >8
>>>>
>>>>> # first bad commit: [7c877586da3178974a8a94577b6045a48377ff25]
>>>>> readahead: properly shorten readahead when falling back to
>>>>> do_page_cache_ra()
>>>>
>>>> Thank you for taking the time to bisect, this issue has been bugging
>>>> me, but it's been non-deterministic, and hence hard to bisect.
>>>>
>>>> I'm seeing the same problem on 6.11.10 (and earlier 6.11.x kernels) in
>>>> slightly different setups:
>>>>
>>>> (1) On machines mounting NFSv3 shared drives. The symptom here is a
>>>> "nfs server XXX not responding, still trying" that never recovers
>>>> (while the server remains pingable and other NFSv3 volumes from the
>>>> hanging server can be mounted).
>>>>
>>>> (2) On VMs running over qemu-kvm, I see very long stalls (can be up to
>>>> several minutes) on random I/O. These stalls eventually recover.
>>>>
>>>> I've built a 6.11.10 kernel with
>>>> 7c877586da3178974a8a94577b6045a48377ff25 reverted and I'm back to
>>>> normal (no more NFS hangs, no more VM stalls).
>>>>
>>> Some printk debugging, seems to indicate that the problem
>>> is that the entity 'ra->size - (index - start)' goes
>>> negative, which then gets cast to a very large unsigned
>>> 'nr_to_read' when calling 'do_page_cache_ra'. Where the true
>>> bug is still eludes me, though.
>>
>> Thanks for the report, bisection and debugging! I think I see what's going
>> on. read_pages() can go and reduce ra->size when ->readahead() callback
>> failed to read all folios prepared for reading and apparently that's what
>> happens with NFS and what can lead to negative argument to
>> do_page_cache_ra(). Now at this point I'm of the opinion that updating
>> ra->size / ra->async_size does more harm than good (because those values
>> show *desired* readahead to happen, not exact number of pages read),
>> furthermore it is problematic because ra can be shared by multiple
>> processes and so updates are inherently racy. If we indeed need to store
>> number of read pages, we could do it through ractl which is call-site local
>> and used for communication between readahead generic functions and callers.
>> But I have to do some more history digging and code reading to understand
>> what is using this logic in read_pages().
> 
> Hum, checking the history the update of ra->size has been added by Neil two
> years ago in 9fd472af84ab ("mm: improve cleanup when ->readpages doesn't
> process all pages"). Neil, the changelog seems as there was some real
> motivation behind updating of ra->size in read_pages(). What was it? Now I
> somewhat disagree with reducing ra->size in read_pages() because it seems
> like a wrong place to do that and if we do need something like that,
> readahead window sizing logic should rather be changed to take that into
> account? But it all depends on what was the real rationale behind reducing
> ra->size in read_pages()...
> 
> 								Honza
My (rather limited) understanding of the patch is that it was intended to read those pages
that didn't get read because the allocation of a bigger folio failed, while not redoing what
readpages already did; how it was actually going to accomplish that is still unclear to me,
but I even don't even quite understand the comment...

	/*
	 * If there were already pages in the page cache, then we may have
	 * left some gaps.  Let the regular readahead code take care of this
	 * situation.
	 */

the reason for an unchanged async_size is also beyond my understanding.

/Anders

next prev parent reply	other threads:[~2024-11-26 15:28 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-23 22:32 Anders Blomdell
2024-11-26  1:48 ` Philippe Troin
2024-11-26  8:01   ` Anders Blomdell
2024-11-26 10:37     ` Jan Kara
2024-11-26 12:49       ` Anders Blomdell
2024-11-26 13:24         ` Anders Blomdell
2024-11-26 15:00         ` Jan Kara
2024-11-26 15:06       ` Jan Kara
2024-11-26 15:28         ` Anders Blomdell [this message]
2024-11-26 16:55           ` Matthew Wilcox
2024-11-26 17:26             ` Anders Blomdell
2024-11-26 18:42               ` Matthew Wilcox
2024-11-26 20:22                 ` Anders Blomdell
2024-11-27  7:55                 ` Anders Blomdell
2024-11-27  8:37         ` NeilBrown
2024-11-27 11:06           ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fba6bc0c-2ea8-467c-b7ea-8810c9e13b84@gmail.com \
    --to=anders.blomdell@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=neilb@suse.de \
    --cc=phil@fifi.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox