From: Anders Blomdell <anders.blomdell@gmail.com>
To: Jan Kara <jack@suse.cz>
Cc: Philippe Troin <phil@fifi.org>,
"Matthew Wilcox (Oracle)" <willy@infradead.org>,
Andrew Morton <akpm@linux-foundation.org>,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org
Subject: Re: Regression in NFS probably due to very large amounts of readahead
Date: Tue, 26 Nov 2024 14:24:05 +0100 [thread overview]
Message-ID: <fc8fca1c-d03e-4b11-84f4-5e7560086e42@gmail.com> (raw)
In-Reply-To: <6777d050-99a2-4f3c-b398-4b4271c427d5@gmail.com>
On 2024-11-26 13:49, Anders Blomdell wrote:
>
>
> On 2024-11-26 11:37, Jan Kara wrote:
>> On Tue 26-11-24 09:01:35, Anders Blomdell wrote:
>>> On 2024-11-26 02:48, Philippe Troin wrote:
>>>> On Sat, 2024-11-23 at 23:32 +0100, Anders Blomdell wrote:
>>>>> When we (re)started one of our servers with 6.11.3-200.fc40.x86_64,
>>>>> we got terrible performance (lots of nfs: server x.x.x.x not
>>>>> responding).
>>>>> What triggered this problem was virtual machines with NFS-mounted
>>>>> qcow2 disks
>>>>> that often triggered large readaheads that generates long streaks of
>>>>> disk I/O
>>>>> of 150-600 MB/s (4 ordinary HDD's) that filled up the buffer/cache
>>>>> area of the
>>>>> machine.
>>>>>
>>>>> A git bisect gave the following suspect:
>>>>>
>>>>> git bisect start
>>>>
>>>> 8< snip >8
>>>>
>>>>> # first bad commit: [7c877586da3178974a8a94577b6045a48377ff25]
>>>>> readahead: properly shorten readahead when falling back to
>>>>> do_page_cache_ra()
>>>>
>>>> Thank you for taking the time to bisect, this issue has been bugging
>>>> me, but it's been non-deterministic, and hence hard to bisect.
>>>>
>>>> I'm seeing the same problem on 6.11.10 (and earlier 6.11.x kernels) in
>>>> slightly different setups:
>>>>
>>>> (1) On machines mounting NFSv3 shared drives. The symptom here is a
>>>> "nfs server XXX not responding, still trying" that never recovers
>>>> (while the server remains pingable and other NFSv3 volumes from the
>>>> hanging server can be mounted).
>>>>
>>>> (2) On VMs running over qemu-kvm, I see very long stalls (can be up to
>>>> several minutes) on random I/O. These stalls eventually recover.
>>>>
>>>> I've built a 6.11.10 kernel with
>>>> 7c877586da3178974a8a94577b6045a48377ff25 reverted and I'm back to
>>>> normal (no more NFS hangs, no more VM stalls).
>>>>
>>> Some printk debugging, seems to indicate that the problem
>>> is that the entity 'ra->size - (index - start)' goes
>>> negative, which then gets cast to a very large unsigned
>>> 'nr_to_read' when calling 'do_page_cache_ra'. Where the true
>>> bug is still eludes me, though.
>>
>> Thanks for the report, bisection and debugging! I think I see what's going
>> on. read_pages() can go and reduce ra->size when ->readahead() callback
>> failed to read all folios prepared for reading and apparently that's what
>> happens with NFS and what can lead to negative argument to
>> do_page_cache_ra(). Now at this point I'm of the opinion that updating
>> ra->size / ra->async_size does more harm than good (because those values
>> show *desired* readahead to happen, not exact number of pages read),
>> furthermore it is problematic because ra can be shared by multiple
>> processes and so updates are inherently racy. If we indeed need to store
>> number of read pages, we could do it through ractl which is call-site local
>> and used for communication between readahead generic functions and callers.
>> But I have to do some more history digging and code reading to understand
>> what is using this logic in read_pages().
>>
>> Honza
> Good, look forward to a quick revert, and don't forget to CC GKH, so I get kernels recent that work ASAP.
BTW, here is the output of the problematic reads from my printk modified kernel, all the good ones omitted:
nov 13:49:11 fay-02 kernel: mm/readahead.c:490 000000002cdf0a09: nr_to_read=-3 size=8 index=173952 mark=173947 start=173941 async=5 err=-17
nov 13:49:12 fay-02 kernel: mm/readahead.c:490 000000002cdf0a09: nr_to_read=-7 size=20 index=4158252 mark=4158225 start=4158225 async=20 err=-17
nov 13:49:16 fay-02 kernel: mm/readahead.c:490 0000000036189388: nr_to_read=-8 size=4 index=17978832 mark=17978820 start=17978820 async=4 err=-17
nov 13:49:19 fay-02 kernel: mm/readahead.c:490 00000000ce741f0d: nr_to_read=-5 size=8 index=3074784 mark=3074771 start=3074771 async=8 err=-17
nov 13:49:21 fay-02 kernel: mm/readahead.c:490 00000000ce741f0d: nr_to_read=-4 size=6 index=3087040 mark=3087030 start=3087030 async=6 err=-17
nov 13:49:23 fay-02 kernel: mm/readahead.c:490 0000000036189388: nr_to_read=-2 size=16 index=16118408 mark=16118405 start=16118390 async=10 err=-17
nov 13:49:24 fay-02 kernel: mm/readahead.c:490 0000000036189388: nr_to_read=-10 size=16 index=20781128 mark=20781118 start=20781102 async=16 err=-17
nov 13:49:24 fay-02 kernel: mm/readahead.c:490 0000000036189388: nr_to_read=-13 size=16 index=20679424 mark=20679411 start=20679395 async=10 err=-17
nov 13:49:25 fay-02 kernel: mm/readahead.c:490 0000000036189388: nr_to_read=-9 size=4 index=20792116 mark=20792103 start=20792103 async=4 err=-17
nov 13:50:22 fay-02 kernel: mm/readahead.c:490 000000009b8f0763: nr_to_read=-7 size=4 index=4172 mark=4167 start=4161 async=1 err=-17
nov 13:50:24 fay-02 kernel: mm/readahead.c:490 00000000295f3a99: nr_to_read=-7 size=4 index=4108 mark=4097 start=4097 async=1 err=-17
nov 13:50:24 fay-02 kernel: mm/readahead.c:490 00000000295f3a99: nr_to_read=-7 size=4 index=4428 mark=4417 start=4417 async=4 err=-17
nov 13:56:48 fay-02 kernel: mm/readahead.c:490 000000009b8f0763: nr_to_read=-10 size=18 index=85071484 mark=85071456 start=85071456 async=18 err=-17
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -485,7 +485,21 @@ void page_cache_ra_order(struct readahead_control *ractl,
if (!err)
return;
fallback:
- do_page_cache_ra(ractl, ra->size - (index - start), ra->async_size);
+ long nr_to_read = ra->size - (index - start);
+ if (index > mark) {
+ printk("%s:%d %p: "
+ "nr_to_read=%ld "
+ "size=%d index=%ld mark=%ld start=%ld async=%d err=%d",
+ __FILE__, __LINE__,
+ ractl->mapping->host,
+ nr_to_read,
+ ra->size, index, mark, start, ra->async_size, err);
+ }
+ if (nr_to_read < 0) {
+ printk("SKIP");
+ return;
+ }
+ do_page_cache_ra(ractl, nr_to_read, ra->async_size);
}
static unsigned long ractl_max_pages(struct readahead_control *ractl,
Regards
/Anders
next prev parent reply other threads:[~2024-11-26 13:24 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-23 22:32 Anders Blomdell
2024-11-26 1:48 ` Philippe Troin
2024-11-26 8:01 ` Anders Blomdell
2024-11-26 10:37 ` Jan Kara
2024-11-26 12:49 ` Anders Blomdell
2024-11-26 13:24 ` Anders Blomdell [this message]
2024-11-26 15:00 ` Jan Kara
2024-11-26 15:06 ` Jan Kara
2024-11-26 15:28 ` Anders Blomdell
2024-11-26 16:55 ` Matthew Wilcox
2024-11-26 17:26 ` Anders Blomdell
2024-11-26 18:42 ` Matthew Wilcox
2024-11-26 20:22 ` Anders Blomdell
2024-11-27 7:55 ` Anders Blomdell
2024-11-27 8:37 ` NeilBrown
2024-11-27 11:06 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=fc8fca1c-d03e-4b11-84f4-5e7560086e42@gmail.com \
--to=anders.blomdell@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=phil@fifi.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox