Regression in NFS probably due to very large amounts of readahead

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Regression in NFS probably due to very large amounts of readahead
@ 2024-11-23 22:32 Anders Blomdell
  2024-11-26  1:48 ` Philippe Troin
  0 siblings, 1 reply; 16+ messages in thread
From: Anders Blomdell @ 2024-11-23 22:32 UTC (permalink / raw)
  To: Jan Kara, Matthew Wilcox (Oracle),
	Andrew Morton, linux-fsdevel, linux-mm, linux-kernel

When we (re)started one of our servers with 6.11.3-200.fc40.x86_64,
we got terrible performance (lots of nfs: server x.x.x.x not responding).
What triggered this problem was virtual machines with NFS-mounted qcow2 disks
that often triggered large readaheads that generates long streaks of disk I/O
of 150-600 MB/s (4 ordinary HDD's) that filled up the buffer/cache area of the
machine.

A git bisect gave the following suspect:

git bisect start
# status: waiting for both good and bad commits
# bad: [8e24a758d14c0b1cd42ab0aea980a1030eea811f] Linux 6.11.3
git bisect bad 8e24a758d14c0b1cd42ab0aea980a1030eea811f
# status: waiting for good commit(s), bad commit known
# good: [8a886bee7aa574611df83a028ab435aeee071e00] Linux 6.10.11
git bisect good 8a886bee7aa574611df83a028ab435aeee071e00
# good: [0c3836482481200ead7b416ca80c68a29cfdaabd] Linux 6.10
git bisect good 0c3836482481200ead7b416ca80c68a29cfdaabd
# good: [f669aac34c5f76b58e6cad1fef0643e5ae16d413] Merge tag 'trace-v6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
git bisect good f669aac34c5f76b58e6cad1fef0643e5ae16d413
# bad: [78eb4ea25cd5fdbdae7eb9fdf87b99195ff67508] sysctl: treewide: constify the ctl_table argument of proc_handlers
git bisect bad 78eb4ea25cd5fdbdae7eb9fdf87b99195ff67508
# good: [acc5965b9ff8a1889f5b51466562896d59c6e1b9] Merge tag 'char-misc-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
git bisect good acc5965b9ff8a1889f5b51466562896d59c6e1b9
# good: [8e313211f7d46d42b6aa7601b972fe89dcc4a076] Merge tag 'pinctrl-v6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
git bisect good 8e313211f7d46d42b6aa7601b972fe89dcc4a076
# bad: [fbc90c042cd1dc7258ebfebe6d226017e5b5ac8c] Merge tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
git bisect bad fbc90c042cd1dc7258ebfebe6d226017e5b5ac8c
# good: [f416817197e102b9bc6118101c3be652dac01a44] kmsan: support SLAB_POISON
git bisect good f416817197e102b9bc6118101c3be652dac01a44
# bad: [f6a6de245fdb1dfb4307b0a80ce7fa35ba2c35a6] Docs/mm/damon/index: add links to admin-guide doc
git bisect bad f6a6de245fdb1dfb4307b0a80ce7fa35ba2c35a6
# bad: [a0b856b617c585b86a077aae5176c946e1462b7d] mm/ksm: optimize the chain()/chain_prune() interfaces
git bisect bad a0b856b617c585b86a077aae5176c946e1462b7d
# good: [b1a80f4be7691a1ea007e24ebb3c8ca2e4a20f00] kmsan: do not pass NULL pointers as 0
git bisect good b1a80f4be7691a1ea007e24ebb3c8ca2e4a20f00
# bad: [58540f5cde404f512c80fb7b868b12005f0e2747] readahead: simplify gotos in page_cache_sync_ra()
git bisect bad 58540f5cde404f512c80fb7b868b12005f0e2747
# bad: [7c877586da3178974a8a94577b6045a48377ff25] readahead: properly shorten readahead when falling back to do_page_cache_ra()
git bisect bad 7c877586da3178974a8a94577b6045a48377ff25
# good: [ee86814b0562f18255b55c5e6a01a022895994cf] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL
git bisect good ee86814b0562f18255b55c5e6a01a022895994cf
# good: [901a269ff3d59c9ee0e6be35c6044dc4bf2c0fdf] filemap: fix page_cache_next_miss() when no hole found
git bisect good 901a269ff3d59c9ee0e6be35c6044dc4bf2c0fdf
# first bad commit: [7c877586da3178974a8a94577b6045a48377ff25] readahead: properly shorten readahead when falling back to do_page_cache_ra()

I would much appreciate some guidance on how to proceed to track down what goes wrong.

Best regards

Anders Blomdell



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression in NFS probably due to very large amounts of readahead
  2024-11-23 22:32 Regression in NFS probably due to very large amounts of readahead Anders Blomdell
@ 2024-11-26  1:48 ` Philippe Troin
  2024-11-26  8:01   ` Anders Blomdell
  0 siblings, 1 reply; 16+ messages in thread
From: Philippe Troin @ 2024-11-26  1:48 UTC (permalink / raw)
  To: Anders Blomdell, Jan Kara, Matthew Wilcox (Oracle),
	Andrew Morton, linux-fsdevel, linux-mm, linux-kernel
  Cc: Jan Kara

On Sat, 2024-11-23 at 23:32 +0100, Anders Blomdell wrote:
> When we (re)started one of our servers with 6.11.3-200.fc40.x86_64,
> we got terrible performance (lots of nfs: server x.x.x.x not
> responding).
> What triggered this problem was virtual machines with NFS-mounted
> qcow2 disks
> that often triggered large readaheads that generates long streaks of
> disk I/O
> of 150-600 MB/s (4 ordinary HDD's) that filled up the buffer/cache
> area of the
> machine.
> 
> A git bisect gave the following suspect:
> 
> git bisect start

8< snip >8

> # first bad commit: [7c877586da3178974a8a94577b6045a48377ff25]
> readahead: properly shorten readahead when falling back to
> do_page_cache_ra()

Thank you for taking the time to bisect, this issue has been bugging
me, but it's been non-deterministic, and hence hard to bisect.

I'm seeing the same problem on 6.11.10 (and earlier 6.11.x kernels) in
slightly different setups:

(1) On machines mounting NFSv3 shared drives. The symptom here is a
"nfs server XXX not responding, still trying" that never recovers
(while the server remains pingable and other NFSv3 volumes from the
hanging server can be mounted).

(2) On VMs running over qemu-kvm, I see very long stalls (can be up to
several minutes) on random I/O. These stalls eventually recover.

I've built a 6.11.10 kernel with
7c877586da3178974a8a94577b6045a48377ff25 reverted and I'm back to
normal (no more NFS hangs, no more VM stalls).

Phil.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression in NFS probably due to very large amounts of readahead
  2024-11-26  1:48 ` Philippe Troin
@ 2024-11-26  8:01   ` Anders Blomdell
  2024-11-26 10:37     ` Jan Kara
  0 siblings, 1 reply; 16+ messages in thread
From: Anders Blomdell @ 2024-11-26  8:01 UTC (permalink / raw)
  To: Philippe Troin, Jan Kara, Matthew Wilcox (Oracle),
	Andrew Morton, linux-fsdevel, linux-mm, linux-kernel



On 2024-11-26 02:48, Philippe Troin wrote:
> On Sat, 2024-11-23 at 23:32 +0100, Anders Blomdell wrote:
>> When we (re)started one of our servers with 6.11.3-200.fc40.x86_64,
>> we got terrible performance (lots of nfs: server x.x.x.x not
>> responding).
>> What triggered this problem was virtual machines with NFS-mounted
>> qcow2 disks
>> that often triggered large readaheads that generates long streaks of
>> disk I/O
>> of 150-600 MB/s (4 ordinary HDD's) that filled up the buffer/cache
>> area of the
>> machine.
>>
>> A git bisect gave the following suspect:
>>
>> git bisect start
> 
> 8< snip >8
> 
>> # first bad commit: [7c877586da3178974a8a94577b6045a48377ff25]
>> readahead: properly shorten readahead when falling back to
>> do_page_cache_ra()
> 
> Thank you for taking the time to bisect, this issue has been bugging
> me, but it's been non-deterministic, and hence hard to bisect.
> 
> I'm seeing the same problem on 6.11.10 (and earlier 6.11.x kernels) in
> slightly different setups:
> 
> (1) On machines mounting NFSv3 shared drives. The symptom here is a
> "nfs server XXX not responding, still trying" that never recovers
> (while the server remains pingable and other NFSv3 volumes from the
> hanging server can be mounted).
> 
> (2) On VMs running over qemu-kvm, I see very long stalls (can be up to
> several minutes) on random I/O. These stalls eventually recover.
> 
> I've built a 6.11.10 kernel with
> 7c877586da3178974a8a94577b6045a48377ff25 reverted and I'm back to
> normal (no more NFS hangs, no more VM stalls).
> 
> Phil.
Some printk debugging, seems to indicate that the problem
is that the entity 'ra->size - (index - start)' goes
negative, which then gets cast to a very large unsigned
'nr_to_read' when calling 'do_page_cache_ra'. Where the true
bug is still eludes me, though.

/Anders


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression in NFS probably due to very large amounts of readahead
  2024-11-26  8:01   ` Anders Blomdell
@ 2024-11-26 10:37     ` Jan Kara
  2024-11-26 12:49       ` Anders Blomdell
  2024-11-26 15:06       ` Jan Kara
  0 siblings, 2 replies; 16+ messages in thread
From: Jan Kara @ 2024-11-26 10:37 UTC (permalink / raw)
  To: Anders Blomdell
  Cc: Philippe Troin, Jan Kara, Matthew Wilcox (Oracle),
	Andrew Morton, linux-fsdevel, linux-mm, linux-kernel

On Tue 26-11-24 09:01:35, Anders Blomdell wrote:
> On 2024-11-26 02:48, Philippe Troin wrote:
> > On Sat, 2024-11-23 at 23:32 +0100, Anders Blomdell wrote:
> > > When we (re)started one of our servers with 6.11.3-200.fc40.x86_64,
> > > we got terrible performance (lots of nfs: server x.x.x.x not
> > > responding).
> > > What triggered this problem was virtual machines with NFS-mounted
> > > qcow2 disks
> > > that often triggered large readaheads that generates long streaks of
> > > disk I/O
> > > of 150-600 MB/s (4 ordinary HDD's) that filled up the buffer/cache
> > > area of the
> > > machine.
> > > 
> > > A git bisect gave the following suspect:
> > > 
> > > git bisect start
> > 
> > 8< snip >8
> > 
> > > # first bad commit: [7c877586da3178974a8a94577b6045a48377ff25]
> > > readahead: properly shorten readahead when falling back to
> > > do_page_cache_ra()
> > 
> > Thank you for taking the time to bisect, this issue has been bugging
> > me, but it's been non-deterministic, and hence hard to bisect.
> > 
> > I'm seeing the same problem on 6.11.10 (and earlier 6.11.x kernels) in
> > slightly different setups:
> > 
> > (1) On machines mounting NFSv3 shared drives. The symptom here is a
> > "nfs server XXX not responding, still trying" that never recovers
> > (while the server remains pingable and other NFSv3 volumes from the
> > hanging server can be mounted).
> > 
> > (2) On VMs running over qemu-kvm, I see very long stalls (can be up to
> > several minutes) on random I/O. These stalls eventually recover.
> > 
> > I've built a 6.11.10 kernel with
> > 7c877586da3178974a8a94577b6045a48377ff25 reverted and I'm back to
> > normal (no more NFS hangs, no more VM stalls).
> > 
> Some printk debugging, seems to indicate that the problem
> is that the entity 'ra->size - (index - start)' goes
> negative, which then gets cast to a very large unsigned
> 'nr_to_read' when calling 'do_page_cache_ra'. Where the true
> bug is still eludes me, though.

Thanks for the report, bisection and debugging! I think I see what's going
on. read_pages() can go and reduce ra->size when ->readahead() callback
failed to read all folios prepared for reading and apparently that's what
happens with NFS and what can lead to negative argument to
do_page_cache_ra(). Now at this point I'm of the opinion that updating
ra->size / ra->async_size does more harm than good (because those values
show *desired* readahead to happen, not exact number of pages read),
furthermore it is problematic because ra can be shared by multiple
processes and so updates are inherently racy. If we indeed need to store
number of read pages, we could do it through ractl which is call-site local
and used for communication between readahead generic functions and callers.
But I have to do some more history digging and code reading to understand
what is using this logic in read_pages().

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression in NFS probably due to very large amounts of readahead
  2024-11-26 10:37     ` Jan Kara
@ 2024-11-26 12:49       ` Anders Blomdell
  2024-11-26 13:24         ` Anders Blomdell
  2024-11-26 15:00         ` Jan Kara
  2024-11-26 15:06       ` Jan Kara
  1 sibling, 2 replies; 16+ messages in thread
From: Anders Blomdell @ 2024-11-26 12:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: Philippe Troin, Matthew Wilcox (Oracle),
	Andrew Morton, linux-fsdevel, linux-mm, linux-kernel



On 2024-11-26 11:37, Jan Kara wrote:
> On Tue 26-11-24 09:01:35, Anders Blomdell wrote:
>> On 2024-11-26 02:48, Philippe Troin wrote:
>>> On Sat, 2024-11-23 at 23:32 +0100, Anders Blomdell wrote:
>>>> When we (re)started one of our servers with 6.11.3-200.fc40.x86_64,
>>>> we got terrible performance (lots of nfs: server x.x.x.x not
>>>> responding).
>>>> What triggered this problem was virtual machines with NFS-mounted
>>>> qcow2 disks
>>>> that often triggered large readaheads that generates long streaks of
>>>> disk I/O
>>>> of 150-600 MB/s (4 ordinary HDD's) that filled up the buffer/cache
>>>> area of the
>>>> machine.
>>>>
>>>> A git bisect gave the following suspect:
>>>>
>>>> git bisect start
>>>
>>> 8< snip >8
>>>
>>>> # first bad commit: [7c877586da3178974a8a94577b6045a48377ff25]
>>>> readahead: properly shorten readahead when falling back to
>>>> do_page_cache_ra()
>>>
>>> Thank you for taking the time to bisect, this issue has been bugging
>>> me, but it's been non-deterministic, and hence hard to bisect.
>>>
>>> I'm seeing the same problem on 6.11.10 (and earlier 6.11.x kernels) in
>>> slightly different setups:
>>>
>>> (1) On machines mounting NFSv3 shared drives. The symptom here is a
>>> "nfs server XXX not responding, still trying" that never recovers
>>> (while the server remains pingable and other NFSv3 volumes from the
>>> hanging server can be mounted).
>>>
>>> (2) On VMs running over qemu-kvm, I see very long stalls (can be up to
>>> several minutes) on random I/O. These stalls eventually recover.
>>>
>>> I've built a 6.11.10 kernel with
>>> 7c877586da3178974a8a94577b6045a48377ff25 reverted and I'm back to
>>> normal (no more NFS hangs, no more VM stalls).
>>>
>> Some printk debugging, seems to indicate that the problem
>> is that the entity 'ra->size - (index - start)' goes
>> negative, which then gets cast to a very large unsigned
>> 'nr_to_read' when calling 'do_page_cache_ra'. Where the true
>> bug is still eludes me, though.
> 
> Thanks for the report, bisection and debugging! I think I see what's going
> on. read_pages() can go and reduce ra->size when ->readahead() callback
> failed to read all folios prepared for reading and apparently that's what
> happens with NFS and what can lead to negative argument to
> do_page_cache_ra(). Now at this point I'm of the opinion that updating
> ra->size / ra->async_size does more harm than good (because those values
> show *desired* readahead to happen, not exact number of pages read),
> furthermore it is problematic because ra can be shared by multiple
> processes and so updates are inherently racy. If we indeed need to store
> number of read pages, we could do it through ractl which is call-site local
> and used for communication between readahead generic functions and callers.
> But I have to do some more history digging and code reading to understand
> what is using this logic in read_pages().
> 
> 								Honza
Good, look forward to a quick revert, and don't forget to CC GKH, so I get kernels recent  that work ASAP.

Regards

/Anders



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression in NFS probably due to very large amounts of readahead
  2024-11-26 12:49       ` Anders Blomdell
@ 2024-11-26 13:24         ` Anders Blomdell
  2024-11-26 15:00         ` Jan Kara
  1 sibling, 0 replies; 16+ messages in thread
From: Anders Blomdell @ 2024-11-26 13:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: Philippe Troin, Matthew Wilcox (Oracle),
	Andrew Morton, linux-fsdevel, linux-mm, linux-kernel



On 2024-11-26 13:49, Anders Blomdell wrote:
> 
> 
> On 2024-11-26 11:37, Jan Kara wrote:
>> On Tue 26-11-24 09:01:35, Anders Blomdell wrote:
>>> On 2024-11-26 02:48, Philippe Troin wrote:
>>>> On Sat, 2024-11-23 at 23:32 +0100, Anders Blomdell wrote:
>>>>> When we (re)started one of our servers with 6.11.3-200.fc40.x86_64,
>>>>> we got terrible performance (lots of nfs: server x.x.x.x not
>>>>> responding).
>>>>> What triggered this problem was virtual machines with NFS-mounted
>>>>> qcow2 disks
>>>>> that often triggered large readaheads that generates long streaks of
>>>>> disk I/O
>>>>> of 150-600 MB/s (4 ordinary HDD's) that filled up the buffer/cache
>>>>> area of the
>>>>> machine.
>>>>>
>>>>> A git bisect gave the following suspect:
>>>>>
>>>>> git bisect start
>>>>
>>>> 8< snip >8
>>>>
>>>>> # first bad commit: [7c877586da3178974a8a94577b6045a48377ff25]
>>>>> readahead: properly shorten readahead when falling back to
>>>>> do_page_cache_ra()
>>>>
>>>> Thank you for taking the time to bisect, this issue has been bugging
>>>> me, but it's been non-deterministic, and hence hard to bisect.
>>>>
>>>> I'm seeing the same problem on 6.11.10 (and earlier 6.11.x kernels) in
>>>> slightly different setups:
>>>>
>>>> (1) On machines mounting NFSv3 shared drives. The symptom here is a
>>>> "nfs server XXX not responding, still trying" that never recovers
>>>> (while the server remains pingable and other NFSv3 volumes from the
>>>> hanging server can be mounted).
>>>>
>>>> (2) On VMs running over qemu-kvm, I see very long stalls (can be up to
>>>> several minutes) on random I/O. These stalls eventually recover.
>>>>
>>>> I've built a 6.11.10 kernel with
>>>> 7c877586da3178974a8a94577b6045a48377ff25 reverted and I'm back to
>>>> normal (no more NFS hangs, no more VM stalls).
>>>>
>>> Some printk debugging, seems to indicate that the problem
>>> is that the entity 'ra->size - (index - start)' goes
>>> negative, which then gets cast to a very large unsigned
>>> 'nr_to_read' when calling 'do_page_cache_ra'. Where the true
>>> bug is still eludes me, though.
>>
>> Thanks for the report, bisection and debugging! I think I see what's going
>> on. read_pages() can go and reduce ra->size when ->readahead() callback
>> failed to read all folios prepared for reading and apparently that's what
>> happens with NFS and what can lead to negative argument to
>> do_page_cache_ra(). Now at this point I'm of the opinion that updating
>> ra->size / ra->async_size does more harm than good (because those values
>> show *desired* readahead to happen, not exact number of pages read),
>> furthermore it is problematic because ra can be shared by multiple
>> processes and so updates are inherently racy. If we indeed need to store
>> number of read pages, we could do it through ractl which is call-site local
>> and used for communication between readahead generic functions and callers.
>> But I have to do some more history digging and code reading to understand
>> what is using this logic in read_pages().
>>
>>                                 Honza
> Good, look forward to a quick revert, and don't forget to CC GKH, so I get kernels recent  that work ASAP.
BTW, here is the output of the problematic reads from my printk modified kernel, all the good ones omitted:

nov 13:49:11 fay-02 kernel: mm/readahead.c:490 000000002cdf0a09: nr_to_read=-3 size=8 index=173952 mark=173947 start=173941 async=5 err=-17
nov 13:49:12 fay-02 kernel: mm/readahead.c:490 000000002cdf0a09: nr_to_read=-7 size=20 index=4158252 mark=4158225 start=4158225 async=20 err=-17
nov 13:49:16 fay-02 kernel: mm/readahead.c:490 0000000036189388: nr_to_read=-8 size=4 index=17978832 mark=17978820 start=17978820 async=4 err=-17
nov 13:49:19 fay-02 kernel: mm/readahead.c:490 00000000ce741f0d: nr_to_read=-5 size=8 index=3074784 mark=3074771 start=3074771 async=8 err=-17
nov 13:49:21 fay-02 kernel: mm/readahead.c:490 00000000ce741f0d: nr_to_read=-4 size=6 index=3087040 mark=3087030 start=3087030 async=6 err=-17
nov 13:49:23 fay-02 kernel: mm/readahead.c:490 0000000036189388: nr_to_read=-2 size=16 index=16118408 mark=16118405 start=16118390 async=10 err=-17
nov 13:49:24 fay-02 kernel: mm/readahead.c:490 0000000036189388: nr_to_read=-10 size=16 index=20781128 mark=20781118 start=20781102 async=16 err=-17
nov 13:49:24 fay-02 kernel: mm/readahead.c:490 0000000036189388: nr_to_read=-13 size=16 index=20679424 mark=20679411 start=20679395 async=10 err=-17
nov 13:49:25 fay-02 kernel: mm/readahead.c:490 0000000036189388: nr_to_read=-9 size=4 index=20792116 mark=20792103 start=20792103 async=4 err=-17
nov 13:50:22 fay-02 kernel: mm/readahead.c:490 000000009b8f0763: nr_to_read=-7 size=4 index=4172 mark=4167 start=4161 async=1 err=-17
nov 13:50:24 fay-02 kernel: mm/readahead.c:490 00000000295f3a99: nr_to_read=-7 size=4 index=4108 mark=4097 start=4097 async=1 err=-17
nov 13:50:24 fay-02 kernel: mm/readahead.c:490 00000000295f3a99: nr_to_read=-7 size=4 index=4428 mark=4417 start=4417 async=4 err=-17
nov 13:56:48 fay-02 kernel: mm/readahead.c:490 000000009b8f0763: nr_to_read=-10 size=18 index=85071484 mark=85071456 start=85071456 async=18 err=-17

--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -485,7 +485,21 @@ void page_cache_ra_order(struct readahead_control *ractl,
         if (!err)
                 return;
  fallback:
-       do_page_cache_ra(ractl, ra->size - (index - start), ra->async_size);
+       long nr_to_read = ra->size - (index - start);
+       if (index > mark) {
+         printk("%s:%d %p: "
+                "nr_to_read=%ld "
+                "size=%d index=%ld mark=%ld start=%ld async=%d err=%d",
+                __FILE__, __LINE__,
+                ractl->mapping->host,
+                nr_to_read,
+                ra->size, index, mark, start, ra->async_size, err);
+       }
+       if (nr_to_read < 0) {
+         printk("SKIP");
+         return;
+       }
+       do_page_cache_ra(ractl, nr_to_read, ra->async_size);
  }
  
  static unsigned long ractl_max_pages(struct readahead_control *ractl,

Regards

/Anders


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression in NFS probably due to very large amounts of readahead
  2024-11-26 12:49       ` Anders Blomdell
  2024-11-26 13:24         ` Anders Blomdell
@ 2024-11-26 15:00         ` Jan Kara
  1 sibling, 0 replies; 16+ messages in thread
From: Jan Kara @ 2024-11-26 15:00 UTC (permalink / raw)
  To: Anders Blomdell
  Cc: Jan Kara, Philippe Troin, Matthew Wilcox (Oracle),
	Andrew Morton, linux-fsdevel, linux-mm, linux-kernel

On Tue 26-11-24 13:49:09, Anders Blomdell wrote:
> 
> 
> On 2024-11-26 11:37, Jan Kara wrote:
> > On Tue 26-11-24 09:01:35, Anders Blomdell wrote:
> > > On 2024-11-26 02:48, Philippe Troin wrote:
> > > > On Sat, 2024-11-23 at 23:32 +0100, Anders Blomdell wrote:
> > > > > When we (re)started one of our servers with 6.11.3-200.fc40.x86_64,
> > > > > we got terrible performance (lots of nfs: server x.x.x.x not
> > > > > responding).
> > > > > What triggered this problem was virtual machines with NFS-mounted
> > > > > qcow2 disks
> > > > > that often triggered large readaheads that generates long streaks of
> > > > > disk I/O
> > > > > of 150-600 MB/s (4 ordinary HDD's) that filled up the buffer/cache
> > > > > area of the
> > > > > machine.
> > > > > 
> > > > > A git bisect gave the following suspect:
> > > > > 
> > > > > git bisect start
> > > > 
> > > > 8< snip >8
> > > > 
> > > > > # first bad commit: [7c877586da3178974a8a94577b6045a48377ff25]
> > > > > readahead: properly shorten readahead when falling back to
> > > > > do_page_cache_ra()
> > > > 
> > > > Thank you for taking the time to bisect, this issue has been bugging
> > > > me, but it's been non-deterministic, and hence hard to bisect.
> > > > 
> > > > I'm seeing the same problem on 6.11.10 (and earlier 6.11.x kernels) in
> > > > slightly different setups:
> > > > 
> > > > (1) On machines mounting NFSv3 shared drives. The symptom here is a
> > > > "nfs server XXX not responding, still trying" that never recovers
> > > > (while the server remains pingable and other NFSv3 volumes from the
> > > > hanging server can be mounted).
> > > > 
> > > > (2) On VMs running over qemu-kvm, I see very long stalls (can be up to
> > > > several minutes) on random I/O. These stalls eventually recover.
> > > > 
> > > > I've built a 6.11.10 kernel with
> > > > 7c877586da3178974a8a94577b6045a48377ff25 reverted and I'm back to
> > > > normal (no more NFS hangs, no more VM stalls).
> > > > 
> > > Some printk debugging, seems to indicate that the problem
> > > is that the entity 'ra->size - (index - start)' goes
> > > negative, which then gets cast to a very large unsigned
> > > 'nr_to_read' when calling 'do_page_cache_ra'. Where the true
> > > bug is still eludes me, though.
> > 
> > Thanks for the report, bisection and debugging! I think I see what's going
> > on. read_pages() can go and reduce ra->size when ->readahead() callback
> > failed to read all folios prepared for reading and apparently that's what
> > happens with NFS and what can lead to negative argument to
> > do_page_cache_ra(). Now at this point I'm of the opinion that updating
> > ra->size / ra->async_size does more harm than good (because those values
> > show *desired* readahead to happen, not exact number of pages read),
> > furthermore it is problematic because ra can be shared by multiple
> > processes and so updates are inherently racy. If we indeed need to store
> > number of read pages, we could do it through ractl which is call-site local
> > and used for communication between readahead generic functions and callers.
> > But I have to do some more history digging and code reading to understand
> > what is using this logic in read_pages().
> > 
> > 								Honza
> Good, look forward to a quick revert, and don't forget to CC GKH, so I
> get kernels recent  that work ASAP.

Well, Greg won't merge any patch until it gets upstream. I've sent the
revert now to Andrew (MM maintainer), once it lands in Linus' tree, Greg
will take it since stable tree is CCed.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression in NFS probably due to very large amounts of readahead
  2024-11-26 10:37     ` Jan Kara
  2024-11-26 12:49       ` Anders Blomdell
@ 2024-11-26 15:06       ` Jan Kara
  2024-11-26 15:28         ` Anders Blomdell
  2024-11-27  8:37         ` NeilBrown
  1 sibling, 2 replies; 16+ messages in thread
From: Jan Kara @ 2024-11-26 15:06 UTC (permalink / raw)
  To: Anders Blomdell
  Cc: Philippe Troin, Jan Kara, Matthew Wilcox (Oracle),
	Andrew Morton, linux-fsdevel, linux-mm, linux-kernel, NeilBrown

On Tue 26-11-24 11:37:19, Jan Kara wrote:
> On Tue 26-11-24 09:01:35, Anders Blomdell wrote:
> > On 2024-11-26 02:48, Philippe Troin wrote:
> > > On Sat, 2024-11-23 at 23:32 +0100, Anders Blomdell wrote:
> > > > When we (re)started one of our servers with 6.11.3-200.fc40.x86_64,
> > > > we got terrible performance (lots of nfs: server x.x.x.x not
> > > > responding).
> > > > What triggered this problem was virtual machines with NFS-mounted
> > > > qcow2 disks
> > > > that often triggered large readaheads that generates long streaks of
> > > > disk I/O
> > > > of 150-600 MB/s (4 ordinary HDD's) that filled up the buffer/cache
> > > > area of the
> > > > machine.
> > > > 
> > > > A git bisect gave the following suspect:
> > > > 
> > > > git bisect start
> > > 
> > > 8< snip >8
> > > 
> > > > # first bad commit: [7c877586da3178974a8a94577b6045a48377ff25]
> > > > readahead: properly shorten readahead when falling back to
> > > > do_page_cache_ra()
> > > 
> > > Thank you for taking the time to bisect, this issue has been bugging
> > > me, but it's been non-deterministic, and hence hard to bisect.
> > > 
> > > I'm seeing the same problem on 6.11.10 (and earlier 6.11.x kernels) in
> > > slightly different setups:
> > > 
> > > (1) On machines mounting NFSv3 shared drives. The symptom here is a
> > > "nfs server XXX not responding, still trying" that never recovers
> > > (while the server remains pingable and other NFSv3 volumes from the
> > > hanging server can be mounted).
> > > 
> > > (2) On VMs running over qemu-kvm, I see very long stalls (can be up to
> > > several minutes) on random I/O. These stalls eventually recover.
> > > 
> > > I've built a 6.11.10 kernel with
> > > 7c877586da3178974a8a94577b6045a48377ff25 reverted and I'm back to
> > > normal (no more NFS hangs, no more VM stalls).
> > > 
> > Some printk debugging, seems to indicate that the problem
> > is that the entity 'ra->size - (index - start)' goes
> > negative, which then gets cast to a very large unsigned
> > 'nr_to_read' when calling 'do_page_cache_ra'. Where the true
> > bug is still eludes me, though.
> 
> Thanks for the report, bisection and debugging! I think I see what's going
> on. read_pages() can go and reduce ra->size when ->readahead() callback
> failed to read all folios prepared for reading and apparently that's what
> happens with NFS and what can lead to negative argument to
> do_page_cache_ra(). Now at this point I'm of the opinion that updating
> ra->size / ra->async_size does more harm than good (because those values
> show *desired* readahead to happen, not exact number of pages read),
> furthermore it is problematic because ra can be shared by multiple
> processes and so updates are inherently racy. If we indeed need to store
> number of read pages, we could do it through ractl which is call-site local
> and used for communication between readahead generic functions and callers.
> But I have to do some more history digging and code reading to understand
> what is using this logic in read_pages().

Hum, checking the history the update of ra->size has been added by Neil two
years ago in 9fd472af84ab ("mm: improve cleanup when ->readpages doesn't
process all pages"). Neil, the changelog seems as there was some real
motivation behind updating of ra->size in read_pages(). What was it? Now I
somewhat disagree with reducing ra->size in read_pages() because it seems
like a wrong place to do that and if we do need something like that,
readahead window sizing logic should rather be changed to take that into
account? But it all depends on what was the real rationale behind reducing
ra->size in read_pages()...

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression in NFS probably due to very large amounts of readahead
  2024-11-26 15:06       ` Jan Kara
@ 2024-11-26 15:28         ` Anders Blomdell
  2024-11-26 16:55           ` Matthew Wilcox
  2024-11-27  8:37         ` NeilBrown
  1 sibling, 1 reply; 16+ messages in thread
From: Anders Blomdell @ 2024-11-26 15:28 UTC (permalink / raw)
  To: Jan Kara
  Cc: Philippe Troin, Matthew Wilcox (Oracle),
	Andrew Morton, linux-fsdevel, linux-mm, linux-kernel, NeilBrown



On 2024-11-26 16:06, Jan Kara wrote:
> On Tue 26-11-24 11:37:19, Jan Kara wrote:
>> On Tue 26-11-24 09:01:35, Anders Blomdell wrote:
>>> On 2024-11-26 02:48, Philippe Troin wrote:
>>>> On Sat, 2024-11-23 at 23:32 +0100, Anders Blomdell wrote:
>>>>> When we (re)started one of our servers with 6.11.3-200.fc40.x86_64,
>>>>> we got terrible performance (lots of nfs: server x.x.x.x not
>>>>> responding).
>>>>> What triggered this problem was virtual machines with NFS-mounted
>>>>> qcow2 disks
>>>>> that often triggered large readaheads that generates long streaks of
>>>>> disk I/O
>>>>> of 150-600 MB/s (4 ordinary HDD's) that filled up the buffer/cache
>>>>> area of the
>>>>> machine.
>>>>>
>>>>> A git bisect gave the following suspect:
>>>>>
>>>>> git bisect start
>>>>
>>>> 8< snip >8
>>>>
>>>>> # first bad commit: [7c877586da3178974a8a94577b6045a48377ff25]
>>>>> readahead: properly shorten readahead when falling back to
>>>>> do_page_cache_ra()
>>>>
>>>> Thank you for taking the time to bisect, this issue has been bugging
>>>> me, but it's been non-deterministic, and hence hard to bisect.
>>>>
>>>> I'm seeing the same problem on 6.11.10 (and earlier 6.11.x kernels) in
>>>> slightly different setups:
>>>>
>>>> (1) On machines mounting NFSv3 shared drives. The symptom here is a
>>>> "nfs server XXX not responding, still trying" that never recovers
>>>> (while the server remains pingable and other NFSv3 volumes from the
>>>> hanging server can be mounted).
>>>>
>>>> (2) On VMs running over qemu-kvm, I see very long stalls (can be up to
>>>> several minutes) on random I/O. These stalls eventually recover.
>>>>
>>>> I've built a 6.11.10 kernel with
>>>> 7c877586da3178974a8a94577b6045a48377ff25 reverted and I'm back to
>>>> normal (no more NFS hangs, no more VM stalls).
>>>>
>>> Some printk debugging, seems to indicate that the problem
>>> is that the entity 'ra->size - (index - start)' goes
>>> negative, which then gets cast to a very large unsigned
>>> 'nr_to_read' when calling 'do_page_cache_ra'. Where the true
>>> bug is still eludes me, though.
>>
>> Thanks for the report, bisection and debugging! I think I see what's going
>> on. read_pages() can go and reduce ra->size when ->readahead() callback
>> failed to read all folios prepared for reading and apparently that's what
>> happens with NFS and what can lead to negative argument to
>> do_page_cache_ra(). Now at this point I'm of the opinion that updating
>> ra->size / ra->async_size does more harm than good (because those values
>> show *desired* readahead to happen, not exact number of pages read),
>> furthermore it is problematic because ra can be shared by multiple
>> processes and so updates are inherently racy. If we indeed need to store
>> number of read pages, we could do it through ractl which is call-site local
>> and used for communication between readahead generic functions and callers.
>> But I have to do some more history digging and code reading to understand
>> what is using this logic in read_pages().
> 
> Hum, checking the history the update of ra->size has been added by Neil two
> years ago in 9fd472af84ab ("mm: improve cleanup when ->readpages doesn't
> process all pages"). Neil, the changelog seems as there was some real
> motivation behind updating of ra->size in read_pages(). What was it? Now I
> somewhat disagree with reducing ra->size in read_pages() because it seems
> like a wrong place to do that and if we do need something like that,
> readahead window sizing logic should rather be changed to take that into
> account? But it all depends on what was the real rationale behind reducing
> ra->size in read_pages()...
> 
> 								Honza
My (rather limited) understanding of the patch is that it was intended to read those pages
that didn't get read because the allocation of a bigger folio failed, while not redoing what
readpages already did; how it was actually going to accomplish that is still unclear to me,
but I even don't even quite understand the comment...

	/*
	 * If there were already pages in the page cache, then we may have
	 * left some gaps.  Let the regular readahead code take care of this
	 * situation.
	 */

the reason for an unchanged async_size is also beyond my understanding.

/Anders


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression in NFS probably due to very large amounts of readahead
  2024-11-26 15:28         ` Anders Blomdell
@ 2024-11-26 16:55           ` Matthew Wilcox
  2024-11-26 17:26             ` Anders Blomdell
  0 siblings, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2024-11-26 16:55 UTC (permalink / raw)
  To: Anders Blomdell
  Cc: Jan Kara, Philippe Troin, Andrew Morton, linux-fsdevel, linux-mm,
	linux-kernel, NeilBrown

On Tue, Nov 26, 2024 at 04:28:04PM +0100, Anders Blomdell wrote:
> On 2024-11-26 16:06, Jan Kara wrote:
> > Hum, checking the history the update of ra->size has been added by Neil two
> > years ago in 9fd472af84ab ("mm: improve cleanup when ->readpages doesn't
> > process all pages"). Neil, the changelog seems as there was some real
> > motivation behind updating of ra->size in read_pages(). What was it? Now I
> > somewhat disagree with reducing ra->size in read_pages() because it seems
> > like a wrong place to do that and if we do need something like that,
> > readahead window sizing logic should rather be changed to take that into
> > account? But it all depends on what was the real rationale behind reducing
> > ra->size in read_pages()...
>
> My (rather limited) understanding of the patch is that it was intended to read those pages
> that didn't get read because the allocation of a bigger folio failed, while not redoing what
> readpages already did; how it was actually going to accomplish that is still unclear to me,
> but I even don't even quite understand the comment...
> 
> 	/*
> 	 * If there were already pages in the page cache, then we may have
> 	 * left some gaps.  Let the regular readahead code take care of this
> 	 * situation.
> 	 */
> 
> the reason for an unchanged async_size is also beyond my understanding.

This isn't because we couldn't allocate a folio, this is when we
allocated folios, tried to read them and we failed to submit the I/O.
This is a pretty rare occurrence under normal conditions.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression in NFS probably due to very large amounts of readahead
  2024-11-26 16:55           ` Matthew Wilcox
@ 2024-11-26 17:26             ` Anders Blomdell
  2024-11-26 18:42               ` Matthew Wilcox
  0 siblings, 1 reply; 16+ messages in thread
From: Anders Blomdell @ 2024-11-26 17:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, Philippe Troin, Andrew Morton, linux-fsdevel, linux-mm,
	linux-kernel, NeilBrown



On 2024-11-26 17:55, Matthew Wilcox wrote:
> On Tue, Nov 26, 2024 at 04:28:04PM +0100, Anders Blomdell wrote:
>> On 2024-11-26 16:06, Jan Kara wrote:
>>> Hum, checking the history the update of ra->size has been added by Neil two
>>> years ago in 9fd472af84ab ("mm: improve cleanup when ->readpages doesn't
>>> process all pages"). Neil, the changelog seems as there was some real
>>> motivation behind updating of ra->size in read_pages(). What was it? Now I
>>> somewhat disagree with reducing ra->size in read_pages() because it seems
>>> like a wrong place to do that and if we do need something like that,
>>> readahead window sizing logic should rather be changed to take that into
>>> account? But it all depends on what was the real rationale behind reducing
>>> ra->size in read_pages()...
>>
>> My (rather limited) understanding of the patch is that it was intended to read those pages
>> that didn't get read because the allocation of a bigger folio failed, while not redoing what
>> readpages already did; how it was actually going to accomplish that is still unclear to me,
>> but I even don't even quite understand the comment...
>>
>> 	/*
>> 	 * If there were already pages in the page cache, then we may have
>> 	 * left some gaps.  Let the regular readahead code take care of this
>> 	 * situation.
>> 	 */
>>
>> the reason for an unchanged async_size is also beyond my understanding.
> 
> This isn't because we couldn't allocate a folio, this is when we
> allocated folios, tried to read them and we failed to submit the I/O.
> This is a pretty rare occurrence under normal conditions.

I beg to differ, the code is reached when there is
no folio support or ra->size < 4 (not considered in
this discussion) or falling throug when !err, err
is set by:

         err = ra_alloc_folio(ractl, index, mark, order, gfp);
                 if (err)
                         break;

isn't the reading done by:

         read_pages(ractl);

which does not set err!

/Anders


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression in NFS probably due to very large amounts of readahead
  2024-11-26 17:26             ` Anders Blomdell
@ 2024-11-26 18:42               ` Matthew Wilcox
  2024-11-26 20:22                 ` Anders Blomdell
  2024-11-27  7:55                 ` Anders Blomdell
  0 siblings, 2 replies; 16+ messages in thread
From: Matthew Wilcox @ 2024-11-26 18:42 UTC (permalink / raw)
  To: Anders Blomdell
  Cc: Jan Kara, Philippe Troin, Andrew Morton, linux-fsdevel, linux-mm,
	linux-kernel, NeilBrown

On Tue, Nov 26, 2024 at 06:26:13PM +0100, Anders Blomdell wrote:
> On 2024-11-26 17:55, Matthew Wilcox wrote:
> > On Tue, Nov 26, 2024 at 04:28:04PM +0100, Anders Blomdell wrote:
> > > On 2024-11-26 16:06, Jan Kara wrote:
> > > > Hum, checking the history the update of ra->size has been added by Neil two
> > > > years ago in 9fd472af84ab ("mm: improve cleanup when ->readpages doesn't
> > > > process all pages"). Neil, the changelog seems as there was some real
> > > > motivation behind updating of ra->size in read_pages(). What was it? Now I
> > > > somewhat disagree with reducing ra->size in read_pages() because it seems
> > > > like a wrong place to do that and if we do need something like that,
> > > > readahead window sizing logic should rather be changed to take that into
> > > > account? But it all depends on what was the real rationale behind reducing
> > > > ra->size in read_pages()...
> > > 
> > > My (rather limited) understanding of the patch is that it was intended to read those pages
> > > that didn't get read because the allocation of a bigger folio failed, while not redoing what
> > > readpages already did; how it was actually going to accomplish that is still unclear to me,
> > > but I even don't even quite understand the comment...
> > > 
> > > 	/*
> > > 	 * If there were already pages in the page cache, then we may have
> > > 	 * left some gaps.  Let the regular readahead code take care of this
> > > 	 * situation.
> > > 	 */
> > > 
> > > the reason for an unchanged async_size is also beyond my understanding.
> > 
> > This isn't because we couldn't allocate a folio, this is when we
> > allocated folios, tried to read them and we failed to submit the I/O.
> > This is a pretty rare occurrence under normal conditions.
> 
> I beg to differ, the code is reached when there is
> no folio support or ra->size < 4 (not considered in
> this discussion) or falling throug when !err, err
> is set by:
> 
>         err = ra_alloc_folio(ractl, index, mark, order, gfp);
>                 if (err)
>                         break;
> 
> isn't the reading done by:
> 
>         read_pages(ractl);
> 
> which does not set err!

You're misunderstanding.  Yes, read_pages() is called when we fail to
allocate a fresh folio; either because there's already one in the
page cache, or because -ENOMEM (or if we raced to install one), but
it's also called when all folios are normally allocated.  Here:

        /*
         * Now start the IO.  We ignore I/O errors - if the folio is not
         * uptodate then the caller will launch read_folio again, and
         * will then handle the error.
         */
        read_pages(ractl);

So at the point that read_pages() is called, all folios that ractl
describes are present in the page cache, locked and !uptodate.

After calling aops->readahead() in read_pages(), most filesystems will
have consumed all folios described by ractl.  It seems that NFS is
choosing not to submit some folios, so rather than leave them sitting
around in the page cache, Neil decided that we should remove them from
the page cache.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression in NFS probably due to very large amounts of readahead
  2024-11-26 18:42               ` Matthew Wilcox
@ 2024-11-26 20:22                 ` Anders Blomdell
  2024-11-27  7:55                 ` Anders Blomdell
  1 sibling, 0 replies; 16+ messages in thread
From: Anders Blomdell @ 2024-11-26 20:22 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: Jan Kara, phil, Andrew Morton, linux-fsdevel, linux-mm, LKML, NeilBrown

[-- Attachment #1: Type: text/plain, Size: 3574 bytes --]

More like me not reading the comments properly, sorry. What I thought I
said, was that the problematic code in the call to do_page_cache_ra was
reached when the folio alloction returned an error. Sorry for not being
clear, and thanks for your patience.

/Anders

On Tue, 26 Nov 2024, 19:42 Matthew Wilcox, <willy@infradead.org> wrote:

> On Tue, Nov 26, 2024 at 06:26:13PM +0100, Anders Blomdell wrote:
> > On 2024-11-26 17:55, Matthew Wilcox wrote:
> > > On Tue, Nov 26, 2024 at 04:28:04PM +0100, Anders Blomdell wrote:
> > > > On 2024-11-26 16:06, Jan Kara wrote:
> > > > > Hum, checking the history the update of ra->size has been added by
> Neil two
> > > > > years ago in 9fd472af84ab ("mm: improve cleanup when ->readpages
> doesn't
> > > > > process all pages"). Neil, the changelog seems as there was some
> real
> > > > > motivation behind updating of ra->size in read_pages(). What was
> it? Now I
> > > > > somewhat disagree with reducing ra->size in read_pages() because
> it seems
> > > > > like a wrong place to do that and if we do need something like
> that,
> > > > > readahead window sizing logic should rather be changed to take
> that into
> > > > > account? But it all depends on what was the real rationale behind
> reducing
> > > > > ra->size in read_pages()...
> > > >
> > > > My (rather limited) understanding of the patch is that it was
> intended to read those pages
> > > > that didn't get read because the allocation of a bigger folio
> failed, while not redoing what
> > > > readpages already did; how it was actually going to accomplish that
> is still unclear to me,
> > > > but I even don't even quite understand the comment...
> > > >
> > > >   /*
> > > >    * If there were already pages in the page cache, then we may have
> > > >    * left some gaps.  Let the regular readahead code take care of
> this
> > > >    * situation.
> > > >    */
> > > >
> > > > the reason for an unchanged async_size is also beyond my
> understanding.
> > >
> > > This isn't because we couldn't allocate a folio, this is when we
> > > allocated folios, tried to read them and we failed to submit the I/O.
> > > This is a pretty rare occurrence under normal conditions.
> >
> > I beg to differ, the code is reached when there is
> > no folio support or ra->size < 4 (not considered in
> > this discussion) or falling throug when !err, err
> > is set by:
> >
> >         err = ra_alloc_folio(ractl, index, mark, order, gfp);
> >                 if (err)
> >                         break;
> >
> > isn't the reading done by:
> >
> >         read_pages(ractl);
> >
> > which does not set err!
>
> You're misunderstanding.  Yes, read_pages() is called when we fail to
> allocate a fresh folio; either because there's already one in the
> page cache, or because -ENOMEM (or if we raced to install one), but
> it's also called when all folios are normally allocated.  Here:
>
>         /*
>          * Now start the IO.  We ignore I/O errors - if the folio is not
>          * uptodate then the caller will launch read_folio again, and
>          * will then handle the error.
>          */
>         read_pages(ractl);
>
> So at the point that read_pages() is called, all folios that ractl
> describes are present in the page cache, locked and !uptodate.
>
> After calling aops->readahead() in read_pages(), most filesystems will
> have consumed all folios described by ractl.  It seems that NFS is
> choosing not to submit some folios, so rather than leave them sitting
> around in the page cache, Neil decided that we should remove them from
> the page cache.
>

[-- Attachment #2: Type: text/html, Size: 4542 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression in NFS probably due to very large amounts of readahead
  2024-11-26 18:42               ` Matthew Wilcox
  2024-11-26 20:22                 ` Anders Blomdell
@ 2024-11-27  7:55                 ` Anders Blomdell
  1 sibling, 0 replies; 16+ messages in thread
From: Anders Blomdell @ 2024-11-27  7:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, Philippe Troin, Andrew Morton, linux-fsdevel, linux-mm,
	linux-kernel, NeilBrown



On 2024-11-26 19:42, Matthew Wilcox wrote:
> On Tue, Nov 26, 2024 at 06:26:13PM +0100, Anders Blomdell wrote:
>> On 2024-11-26 17:55, Matthew Wilcox wrote:
>>> On Tue, Nov 26, 2024 at 04:28:04PM +0100, Anders Blomdell wrote:
>>>> On 2024-11-26 16:06, Jan Kara wrote:
>>>>> Hum, checking the history the update of ra->size has been added by Neil two
>>>>> years ago in 9fd472af84ab ("mm: improve cleanup when ->readpages doesn't
>>>>> process all pages"). Neil, the changelog seems as there was some real
>>>>> motivation behind updating of ra->size in read_pages(). What was it? Now I
>>>>> somewhat disagree with reducing ra->size in read_pages() because it seems
>>>>> like a wrong place to do that and if we do need something like that,
>>>>> readahead window sizing logic should rather be changed to take that into
>>>>> account? But it all depends on what was the real rationale behind reducing
>>>>> ra->size in read_pages()...
>>>>
>>>> My (rather limited) understanding of the patch is that it was intended to read those pages
>>>> that didn't get read because the allocation of a bigger folio failed, while not redoing what
>>>> readpages already did; how it was actually going to accomplish that is still unclear to me,
>>>> but I even don't even quite understand the comment...
>>>>
>>>> 	/*
>>>> 	 * If there were already pages in the page cache, then we may have
>>>> 	 * left some gaps.  Let the regular readahead code take care of this
>>>> 	 * situation.
>>>> 	 */
>>>>
>>>> the reason for an unchanged async_size is also beyond my understanding.
>>>
>>> This isn't because we couldn't allocate a folio, this is when we
>>> allocated folios, tried to read them and we failed to submit the I/O.
>>> This is a pretty rare occurrence under normal conditions.
>>
>> I beg to differ, the code is reached when there is
>> no folio support or ra->size < 4 (not considered in
>> this discussion) or falling throug when !err, err
>> is set by:
>>
>>          err = ra_alloc_folio(ractl, index, mark, order, gfp);
>>                  if (err)
>>                          break;
>>
>> isn't the reading done by:
>>
>>          read_pages(ractl);
>>
>> which does not set err!
> 
> You're misunderstanding.  Yes, read_pages() is called when we fail to
> allocate a fresh folio; either because there's already one in the
> page cache, or because -ENOMEM (or if we raced to install one), but
> it's also called when all folios are normally allocated.  Here:
> 
>          /*
>           * Now start the IO.  We ignore I/O errors - if the folio is not
>           * uptodate then the caller will launch read_folio again, and
>           * will then handle the error.
>           */
>          read_pages(ractl);
> 
> So at the point that read_pages() is called, all folios that ractl
> describes are present in the page cache, locked and !uptodate.
> 
> After calling aops->readahead() in read_pages(), most filesystems will
> have consumed all folios described by ractl.  It seems that NFS is
> choosing not to submit some folios, so rather than leave them sitting
> around in the page cache, Neil decided that we should remove them from
> the page cache.
More like me not reading the comments properly, sorry. What I thought I
said, was that the problematic code in the call to do_page_cache_ra was
reached when the folio alloction returned an error. Sorry for not being
clear, and thanks for your patience.

/Anders



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression in NFS probably due to very large amounts of readahead
  2024-11-26 15:06       ` Jan Kara
  2024-11-26 15:28         ` Anders Blomdell
@ 2024-11-27  8:37         ` NeilBrown
  2024-11-27 11:06           ` Jan Kara
  1 sibling, 1 reply; 16+ messages in thread
From: NeilBrown @ 2024-11-27  8:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Anders Blomdell, Philippe Troin, Jan Kara,
	Matthew Wilcox (Oracle),
	Andrew Morton, linux-fsdevel, linux-mm, linux-kernel

On Wed, 27 Nov 2024, Jan Kara wrote:
> On Tue 26-11-24 11:37:19, Jan Kara wrote:
> > On Tue 26-11-24 09:01:35, Anders Blomdell wrote:
> > > On 2024-11-26 02:48, Philippe Troin wrote:
> > > > On Sat, 2024-11-23 at 23:32 +0100, Anders Blomdell wrote:
> > > > > When we (re)started one of our servers with 6.11.3-200.fc40.x86_64,
> > > > > we got terrible performance (lots of nfs: server x.x.x.x not
> > > > > responding).
> > > > > What triggered this problem was virtual machines with NFS-mounted
> > > > > qcow2 disks
> > > > > that often triggered large readaheads that generates long streaks of
> > > > > disk I/O
> > > > > of 150-600 MB/s (4 ordinary HDD's) that filled up the buffer/cache
> > > > > area of the
> > > > > machine.
> > > > > 
> > > > > A git bisect gave the following suspect:
> > > > > 
> > > > > git bisect start
> > > > 
> > > > 8< snip >8
> > > > 
> > > > > # first bad commit: [7c877586da3178974a8a94577b6045a48377ff25]
> > > > > readahead: properly shorten readahead when falling back to
> > > > > do_page_cache_ra()
> > > > 
> > > > Thank you for taking the time to bisect, this issue has been bugging
> > > > me, but it's been non-deterministic, and hence hard to bisect.
> > > > 
> > > > I'm seeing the same problem on 6.11.10 (and earlier 6.11.x kernels) in
> > > > slightly different setups:
> > > > 
> > > > (1) On machines mounting NFSv3 shared drives. The symptom here is a
> > > > "nfs server XXX not responding, still trying" that never recovers
> > > > (while the server remains pingable and other NFSv3 volumes from the
> > > > hanging server can be mounted).
> > > > 
> > > > (2) On VMs running over qemu-kvm, I see very long stalls (can be up to
> > > > several minutes) on random I/O. These stalls eventually recover.
> > > > 
> > > > I've built a 6.11.10 kernel with
> > > > 7c877586da3178974a8a94577b6045a48377ff25 reverted and I'm back to
> > > > normal (no more NFS hangs, no more VM stalls).
> > > > 
> > > Some printk debugging, seems to indicate that the problem
> > > is that the entity 'ra->size - (index - start)' goes
> > > negative, which then gets cast to a very large unsigned
> > > 'nr_to_read' when calling 'do_page_cache_ra'. Where the true
> > > bug is still eludes me, though.
> > 
> > Thanks for the report, bisection and debugging! I think I see what's going
> > on. read_pages() can go and reduce ra->size when ->readahead() callback
> > failed to read all folios prepared for reading and apparently that's what
> > happens with NFS and what can lead to negative argument to
> > do_page_cache_ra(). Now at this point I'm of the opinion that updating
> > ra->size / ra->async_size does more harm than good (because those values
> > show *desired* readahead to happen, not exact number of pages read),
> > furthermore it is problematic because ra can be shared by multiple
> > processes and so updates are inherently racy. If we indeed need to store
> > number of read pages, we could do it through ractl which is call-site local
> > and used for communication between readahead generic functions and callers.
> > But I have to do some more history digging and code reading to understand
> > what is using this logic in read_pages().
> 
> Hum, checking the history the update of ra->size has been added by Neil two
> years ago in 9fd472af84ab ("mm: improve cleanup when ->readpages doesn't
> process all pages"). Neil, the changelog seems as there was some real
> motivation behind updating of ra->size in read_pages(). What was it? Now I
> somewhat disagree with reducing ra->size in read_pages() because it seems
> like a wrong place to do that and if we do need something like that,
> readahead window sizing logic should rather be changed to take that into
> account? But it all depends on what was the real rationale behind reducing
> ra->size in read_pages()...
> 

I cannot tell you much more than what the commit itself says.
If there are any pages still in the rac, then we didn't try read-ahead
and shouldn't pretend that we did. Else the numbers will be wrong.

I think the important part of the patch was the
delete_from_page_cache().
Leaving pages in the page cache which we didn't try to read will cause
a future read-ahead to skip those pages and they can only be read
synchronously.

But maybe you are right that ra, being shared, shouldn't be modified
like this.

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Regression in NFS probably due to very large amounts of readahead
  2024-11-27  8:37         ` NeilBrown
@ 2024-11-27 11:06           ` Jan Kara
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2024-11-27 11:06 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jan Kara, Anders Blomdell, Philippe Troin,
	Matthew Wilcox (Oracle),
	Andrew Morton, linux-fsdevel, linux-mm, linux-kernel,
	David Howells, netfs

Added David Howells to CC since this seems to be mostly netfs related.

On Wed 27-11-24 19:37:10, NeilBrown wrote:
> On Wed, 27 Nov 2024, Jan Kara wrote:
> > On Tue 26-11-24 11:37:19, Jan Kara wrote:
> > > On Tue 26-11-24 09:01:35, Anders Blomdell wrote:
> > > > On 2024-11-26 02:48, Philippe Troin wrote:
> > > > > On Sat, 2024-11-23 at 23:32 +0100, Anders Blomdell wrote:
> > > > > > When we (re)started one of our servers with 6.11.3-200.fc40.x86_64,
> > > > > > we got terrible performance (lots of nfs: server x.x.x.x not
> > > > > > responding).
> > > > > > What triggered this problem was virtual machines with NFS-mounted
> > > > > > qcow2 disks
> > > > > > that often triggered large readaheads that generates long streaks of
> > > > > > disk I/O
> > > > > > of 150-600 MB/s (4 ordinary HDD's) that filled up the buffer/cache
> > > > > > area of the
> > > > > > machine.
> > > > > > 
> > > > > > A git bisect gave the following suspect:
> > > > > > 
> > > > > > git bisect start
> > > > > 
> > > > > 8< snip >8
> > > > > 
> > > > > > # first bad commit: [7c877586da3178974a8a94577b6045a48377ff25]
> > > > > > readahead: properly shorten readahead when falling back to
> > > > > > do_page_cache_ra()
> > > > > 
> > > > > Thank you for taking the time to bisect, this issue has been bugging
> > > > > me, but it's been non-deterministic, and hence hard to bisect.
> > > > > 
> > > > > I'm seeing the same problem on 6.11.10 (and earlier 6.11.x kernels) in
> > > > > slightly different setups:
> > > > > 
> > > > > (1) On machines mounting NFSv3 shared drives. The symptom here is a
> > > > > "nfs server XXX not responding, still trying" that never recovers
> > > > > (while the server remains pingable and other NFSv3 volumes from the
> > > > > hanging server can be mounted).
> > > > > 
> > > > > (2) On VMs running over qemu-kvm, I see very long stalls (can be up to
> > > > > several minutes) on random I/O. These stalls eventually recover.
> > > > > 
> > > > > I've built a 6.11.10 kernel with
> > > > > 7c877586da3178974a8a94577b6045a48377ff25 reverted and I'm back to
> > > > > normal (no more NFS hangs, no more VM stalls).
> > > > > 
> > > > Some printk debugging, seems to indicate that the problem
> > > > is that the entity 'ra->size - (index - start)' goes
> > > > negative, which then gets cast to a very large unsigned
> > > > 'nr_to_read' when calling 'do_page_cache_ra'. Where the true
> > > > bug is still eludes me, though.
> > > 
> > > Thanks for the report, bisection and debugging! I think I see what's going
> > > on. read_pages() can go and reduce ra->size when ->readahead() callback
> > > failed to read all folios prepared for reading and apparently that's what
> > > happens with NFS and what can lead to negative argument to
> > > do_page_cache_ra(). Now at this point I'm of the opinion that updating
> > > ra->size / ra->async_size does more harm than good (because those values
> > > show *desired* readahead to happen, not exact number of pages read),
> > > furthermore it is problematic because ra can be shared by multiple
> > > processes and so updates are inherently racy. If we indeed need to store
> > > number of read pages, we could do it through ractl which is call-site local
> > > and used for communication between readahead generic functions and callers.
> > > But I have to do some more history digging and code reading to understand
> > > what is using this logic in read_pages().
> > 
> > Hum, checking the history the update of ra->size has been added by Neil two
> > years ago in 9fd472af84ab ("mm: improve cleanup when ->readpages doesn't
> > process all pages"). Neil, the changelog seems as there was some real
> > motivation behind updating of ra->size in read_pages(). What was it? Now I
> > somewhat disagree with reducing ra->size in read_pages() because it seems
> > like a wrong place to do that and if we do need something like that,
> > readahead window sizing logic should rather be changed to take that into
> > account? But it all depends on what was the real rationale behind reducing
> > ra->size in read_pages()...
> > 
> 
> I cannot tell you much more than what the commit itself says.
> If there are any pages still in the rac, then we didn't try read-ahead
> and shouldn't pretend that we did. Else the numbers will be wrong.
> 
> I think the important part of the patch was the
> delete_from_page_cache().
> Leaving pages in the page cache which we didn't try to read will cause
> a future read-ahead to skip those pages and they can only be read
> synchronously.

Yes, I agree with the delete_from_page_cache() part (although it seems as a
bit of an band aid but I guess KISS principle wins here).

> But maybe you are right that ra, being shared, shouldn't be modified
> like this.

OK, I was wondering whether this ra update isn't some way how NFS tries to
stear optimal readahead size for it. It would be weird but possible. If
this was mostly a theoretical concert, then I'd be for dropping the ra
update in read_pages(). I did a small excursion to nfs_readahead() and
that function itself seems to read all the pages unless there's some error
like ENOMEM. But before doing this nfs_readahead() does:

        ret = nfs_netfs_readahead(ractl);
        if (!ret)
                goto out;                    

And that is more interesting because if the inode has netfs cache, we will
do netfs_readahead(ractl) and return 0. So whatever netfs_readahead() reads
is the final result. And that function actually seems to read only
PAGEVEC_SIZE folios (because that's what fits in its request structure) and
aborts. 

Now unless you have fscache in tmpfs or you have really large folios in the
page cache, reading PAGEVEC_SIZE folios can be too small to get decent
performance. But that's somewhat besides the point of this thread. The fact
is that netfs can indeed read very few folios from the readahead it was
asked to do. The question is how the generic readahead code should handle
such case. Either we could say that such behavior is not really supported
(besides error recovery where performance is not an issue) and fix netfs to
try harder to submit all the folios generic code asked it to read. Or we
can say generic readahead code needs to accommodate such behavior in a
performant way but then the readahead limitation should be communicated in
advance so that we avoid creating tons of folios only to discard them a
while later. I guess the decision depends on how practical is the "try to
read all folios" solution for netfs... What do you think guys?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-11-27 11:07 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-23 22:32 Regression in NFS probably due to very large amounts of readahead Anders Blomdell
2024-11-26  1:48 ` Philippe Troin
2024-11-26  8:01   ` Anders Blomdell
2024-11-26 10:37     ` Jan Kara
2024-11-26 12:49       ` Anders Blomdell
2024-11-26 13:24         ` Anders Blomdell
2024-11-26 15:00         ` Jan Kara
2024-11-26 15:06       ` Jan Kara
2024-11-26 15:28         ` Anders Blomdell
2024-11-26 16:55           ` Matthew Wilcox
2024-11-26 17:26             ` Anders Blomdell
2024-11-26 18:42               ` Matthew Wilcox
2024-11-26 20:22                 ` Anders Blomdell
2024-11-27  7:55                 ` Anders Blomdell
2024-11-27  8:37         ` NeilBrown
2024-11-27 11:06           ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox