Re: + fs-break-generic_file_buffered_read-up-into-multiple-functions.patch added to -mm tree

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: + fs-break-generic_file_buffered_read-up-into-multiple-functions.patch added to -mm tree
       [not found] <20201025220817.XxXVE%akpm@linux-foundation.org>
@ 2020-10-27 13:35 ` Matthew Wilcox
  2020-10-28 22:22   ` Andrew Morton
  0 siblings, 1 reply; 10+ messages in thread
From: Matthew Wilcox @ 2020-10-27 13:35 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, axboe, kent.overstreet

On Sun, Oct 25, 2020 at 03:08:17PM -0700, akpm@linux-foundation.org wrote:
> The patch titled
>      Subject: mm/filemap/c: freak generic_file_buffered_read up into multiple functions
> has been added to the -mm tree.  Its filename is
>      fs-break-generic_file_buffered_read-up-into-multiple-functions.patch

Can we back this out?  It really makes the THP patchset unhappy.  I think
we can do something like this afterwards, but doing it this way round is
ridiculously hard.

Long explanation ...

If we readahead a THP and hit a read error, the iomap code currently tries
to split the page to handle the error on a more fine-grained boundary.
We can't split a page while someone else holds a reference to it, and
generic/273 has 500 threads livelocking on the same page, all looking it
up, trying to lock it, sleeping, eventually getting the lock, discovering
somebody else has the lock, and retrying.

I changed from using wait_on_page_locked_killable() to
put_and_wait_on_page_locked(), so the threads all sleep without holding
a reference to the page.  It worked great, and was a nice simplification

 mm/filemap.c | 69 ++++++++++++++++++-------------------------------------------
 1 file changed, 20 insertions(+), 49 deletions(-)

With this reorganisation, not only do we try to take the page lock a long
way down the chain, we've also got a batch of pages that we're holding a
refcount on, so we need to call put_page() on the rest of the batch and
then call put_and_wait_on_page_locked() on the one page that we need to
wait for.

I'm sure it's possible to make this work, but it's going to be completely
unlike what I've been testing.  I basically have to start again on
doing these changes.


commit 9fd2f26c6d00c7f0c6c730ec10a421780fedaa79
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Oct 12 16:36:11 2020 -0400

    mm/filemap: Don't hold a page reference while waiting for unlock
    
    In the upcoming THP patch series, if we find a !Uptodate page, it
    is because of a read error.  In this case, we want to split the THP
    into smaller pages so we can handle the error in as granular a fashion
    as possible.  But xfstests generic/273 defeats this strategy by having
    500 threads all sleeping on the same page, each waiting for their turn
    to split the page.  None of them will ever succeed because splitting a
    page requires that you hold the only reference to it.
    
    To fix this, use put_and_wait_on_page_locked() to sleep without holding a
    reference to the page.  Each of the readers will then go back and retry
    the page lookup after the page is unlocked.
    
    Since we now get the page lock a little earlier in
    generic_file_buffered_read(), we can eliminate a number of duplicate
    checks.  The original intent (commit ebded02788b5 ("avoid unnecessary
    calls to lock_page when waiting for IO to complete during a read")
    behind getting the page lock later was to avoid re-locking the page
    after it has been brought uptodate by another thread.  We will still
    avoid that because we go through the normal lookup path again after the
    winning thread has brought the page uptodate.
    
    Using the "fail 10% of readaheads" patch, which will induce the !Uptodate
    case, I can detect no significant difference by applying this patch with
    the generic/273 test.
    
    Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

diff --git a/mm/filemap.c b/mm/filemap.c
index e496833d3ede..a6a2132087cd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1358,14 +1358,6 @@ static int __wait_on_page_locked_async(struct page *page,
 	return ret;
 }
 
-static int wait_on_page_locked_async(struct page *page,
-				     struct wait_page_queue *wait)
-{
-	if (!PageLocked(page))
-		return 0;
-	return __wait_on_page_locked_async(compound_head(page), wait, false);
-}
-
 /**
  * put_and_wait_on_page_locked - Drop a reference and wait for it to be unlocked
  * @page: The page to wait for.
@@ -2279,34 +2271,37 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
 					put_page(page);
 					goto out;
 				}
-				error = wait_on_page_locked_async(page,
-								iocb->ki_waitq);
+				error = lock_page_async(page, iocb->ki_waitq);
+				if (error)
+					goto readpage_error;
+			} else if (iocb->ki_flags & IOCB_NOWAIT) {
+				put_page(page);
+				goto would_block;
 			} else {
-				if (iocb->ki_flags & IOCB_NOWAIT) {
-					put_page(page);
-					goto would_block;
+				if (!trylock_page(page)) {
+					put_and_wait_on_page_locked(page,
+							TASK_KILLABLE);
+					continue;
 				}
-				error = wait_on_page_locked_killable(page);
 			}
-			if (unlikely(error))
-				goto readpage_error;
 			if (PageUptodate(page))
-				goto page_ok;
+				goto uptodate;
+			if (!page->mapping) {
+				unlock_page(page);
+				put_page(page);
+				continue;
+			}
 
 			if (inode->i_blkbits == PAGE_SHIFT ||
 					!mapping->a_ops->is_partially_uptodate)
-				goto page_not_up_to_date;
+				goto readpage;
 			/* pipes can't handle partially uptodate pages */
 			if (unlikely(iov_iter_is_pipe(iter)))
-				goto page_not_up_to_date;
-			if (!trylock_page(page))
-				goto page_not_up_to_date;
-			/* Did it get truncated before we got the lock? */
-			if (!page->mapping)
-				goto page_not_up_to_date_locked;
+				goto readpage;
 			if (!mapping->a_ops->is_partially_uptodate(page,
 							offset, iter->count))
-				goto page_not_up_to_date_locked;
+				goto readpage;
+uptodate:
 			unlock_page(page);
 		}
 page_ok:
@@ -2372,30 +2367,6 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
 			goto out;
 		}
 		continue;
-
-page_not_up_to_date:
-		/* Get exclusive access to the page ... */
-		if (iocb->ki_flags & IOCB_WAITQ)
-			error = lock_page_async(page, iocb->ki_waitq);
-		else
-			error = lock_page_killable(page);
-		if (unlikely(error))
-			goto readpage_error;
-
-page_not_up_to_date_locked:
-		/* Did it get truncated before we got the lock? */
-		if (!page->mapping) {
-			unlock_page(page);
-			put_page(page);
-			continue;
-		}
-
-		/* Did somebody else fill it already? */
-		if (PageUptodate(page)) {
-			unlock_page(page);
-			goto page_ok;
-		}
-
 readpage:
 		if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT)) {
 			unlock_page(page);


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + fs-break-generic_file_buffered_read-up-into-multiple-functions.patch added to -mm tree
  2020-10-27 13:35 ` + fs-break-generic_file_buffered_read-up-into-multiple-functions.patch added to -mm tree Matthew Wilcox
@ 2020-10-28 22:22   ` Andrew Morton
  2020-10-28 22:26     ` Jens Axboe
  2020-10-28 22:50     ` Matthew Wilcox
  0 siblings, 2 replies; 10+ messages in thread
From: Andrew Morton @ 2020-10-28 22:22 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, axboe, kent.overstreet

On Tue, 27 Oct 2020 13:35:51 +0000 Matthew Wilcox <willy@infradead.org> wrote:

> On Sun, Oct 25, 2020 at 03:08:17PM -0700, akpm@linux-foundation.org wrote:
> > The patch titled
> >      Subject: mm/filemap/c: freak generic_file_buffered_read up into multiple functions
> > has been added to the -mm tree.  Its filename is
> >      fs-break-generic_file_buffered_read-up-into-multiple-functions.patch
> 
> Can we back this out?  It really makes the THP patchset unhappy.  I think
> we can do something like this afterwards, but doing it this way round is
> ridiculously hard.

Two concerns:

: On my test box, 4k buffered random reads go from ~150k to ~250k iops,
: and the improvements to big sequential reads are even bigger.

That's a big improvement!  We want that improvement.  Throwing it away
on behalf of an as-yet-unmerged feature patchset hurts.  Can we expect that
this improvement will be available post-that-patchset?  And when?

(This improvment is rather hard to believe, really - more details on the
test environment would be useful.  Can we expect that people will in
general see similar benefits or was there something special about the
testing?)

Secondly,

: This incorporates the fix for IOCB_WAITQ handling that Jens just
: posted as well

Is this (mysterious, unreferenced!) fix against v1 of Kent's patchset,
or is it against mainline?  If it's against mainline then it's
presumably first priority.  Can we please get that sent out in a
standalone form asap?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + fs-break-generic_file_buffered_read-up-into-multiple-functions.patch added to -mm tree
  2020-10-28 22:22   ` Andrew Morton
@ 2020-10-28 22:26     ` Jens Axboe
  2020-10-29 13:57       ` Jens Axboe
  2020-10-28 22:50     ` Matthew Wilcox
  1 sibling, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2020-10-28 22:26 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox; +Cc: linux-mm, kent.overstreet

On 10/28/20 4:22 PM, Andrew Morton wrote:
> On Tue, 27 Oct 2020 13:35:51 +0000 Matthew Wilcox <willy@infradead.org> wrote:
> 
>> On Sun, Oct 25, 2020 at 03:08:17PM -0700, akpm@linux-foundation.org wrote:
>>> The patch titled
>>>      Subject: mm/filemap/c: freak generic_file_buffered_read up into multiple functions
>>> has been added to the -mm tree.  Its filename is
>>>      fs-break-generic_file_buffered_read-up-into-multiple-functions.patch
>>
>> Can we back this out?  It really makes the THP patchset unhappy.  I think
>> we can do something like this afterwards, but doing it this way round is
>> ridiculously hard.
> 
> Two concerns:
> 
> : On my test box, 4k buffered random reads go from ~150k to ~250k iops,
> : and the improvements to big sequential reads are even bigger.
> 
> That's a big improvement!  We want that improvement.  Throwing it away
> on behalf of an as-yet-unmerged feature patchset hurts.  Can we expect that
> this improvement will be available post-that-patchset?  And when?
> 
> (This improvment is rather hard to believe, really - more details on the
> test environment would be useful.  Can we expect that people will in
> general see similar benefits or was there something special about the
> testing?)

I did see some wins when I tested this. I'll try and run some testing
tomorrow and report back. If there's something specifically you want to
see tested, let me know.

> Secondly,
> 
> : This incorporates the fix for IOCB_WAITQ handling that Jens just
> : posted as well
> 
> Is this (mysterious, unreferenced!) fix against v1 of Kent's patchset,
> or is it against mainline?  If it's against mainline then it's
> presumably first priority.  Can we please get that sent out in a
> standalone form asap?

It's this one:

commit 13bd691421bc191a402d2e0d3da5f248d170a632
Author: Jens Axboe <axboe@kernel.dk>
Date:   Sat Oct 17 08:31:29 2020 -0600

    mm: mark async iocb read as NOWAIT once some data has been copied

which is in Linus's tree. Kent spotted the issue and had it fixed in
his 5.9 based patchset, which is probably where this confusion comes
from.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + fs-break-generic_file_buffered_read-up-into-multiple-functions.patch added to -mm tree
  2020-10-28 22:26     ` Jens Axboe
@ 2020-10-29 13:57       ` Jens Axboe
  2020-10-29 14:57         ` Matthew Wilcox
  0 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2020-10-29 13:57 UTC (permalink / raw)
  To: Andrew Morton, Matthew Wilcox; +Cc: linux-mm, kent.overstreet

On 10/28/20 4:26 PM, Jens Axboe wrote:
> On 10/28/20 4:22 PM, Andrew Morton wrote:
>> On Tue, 27 Oct 2020 13:35:51 +0000 Matthew Wilcox <willy@infradead.org> wrote:
>>
>>> On Sun, Oct 25, 2020 at 03:08:17PM -0700, akpm@linux-foundation.org wrote:
>>>> The patch titled
>>>>      Subject: mm/filemap/c: freak generic_file_buffered_read up into multiple functions
>>>> has been added to the -mm tree.  Its filename is
>>>>      fs-break-generic_file_buffered_read-up-into-multiple-functions.patch
>>>
>>> Can we back this out?  It really makes the THP patchset unhappy.  I think
>>> we can do something like this afterwards, but doing it this way round is
>>> ridiculously hard.
>>
>> Two concerns:
>>
>> : On my test box, 4k buffered random reads go from ~150k to ~250k iops,
>> : and the improvements to big sequential reads are even bigger.
>>
>> That's a big improvement!  We want that improvement.  Throwing it away
>> on behalf of an as-yet-unmerged feature patchset hurts.  Can we expect that
>> this improvement will be available post-that-patchset?  And when?
>>
>> (This improvment is rather hard to believe, really - more details on the
>> test environment would be useful.  Can we expect that people will in
>> general see similar benefits or was there something special about the
>> testing?)
> 
> I did see some wins when I tested this. I'll try and run some testing
> tomorrow and report back. If there's something specifically you want to
> see tested, let me know.

I did some testing, unfortunately it's _very_ hard to produce somewhat
consistent and good numbers as it quickly becomes a game of kswapd.
Here's a basic case of 4 threads doing 32k random reads:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
    462 root      20   0       0      0      0 R  65.5   0.0   0:08.02 kswapd0
   2287 axboe     20   0 1303448   2176   1072 R  46.6   0.0   0:05.35 fio
   2289 axboe     20   0 1303456   2196   1092 D  46.6   0.0   0:05.34 fio
   2290 axboe     20   0 1303460   2216   1112 D  46.6   0.0   0:05.37 fio
   2288 axboe     20   0 1303452   2224   1120 R  45.9   0.0   0:05.33 fio

Sad face... Unfortunately once kswapd kicks in, performance also
plummets. This box only has 32G of ram, and you can fill that in less
than 10 seconds doing buffered reads like that.

I ran 4k and 32k testing, and using 1 and 4 threads. But given the above
sadness, it quickly ends up looking the same for me.

What I noticed in my initial testing on Kent's patches (which was
focused on correctness) was that a read+write verify workload had
consistently better read throughput.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + fs-break-generic_file_buffered_read-up-into-multiple-functions.patch added to -mm tree
  2020-10-29 13:57       ` Jens Axboe
@ 2020-10-29 14:57         ` Matthew Wilcox
  2020-10-29 15:02           ` Jens Axboe
  0 siblings, 1 reply; 10+ messages in thread
From: Matthew Wilcox @ 2020-10-29 14:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, linux-mm, kent.overstreet

On Thu, Oct 29, 2020 at 07:57:34AM -0600, Jens Axboe wrote:
> On 10/28/20 4:26 PM, Jens Axboe wrote:
> > I did see some wins when I tested this. I'll try and run some testing
> > tomorrow and report back. If there's something specifically you want to
> > see tested, let me know.
> 
> I did some testing, unfortunately it's _very_ hard to produce somewhat
> consistent and good numbers as it quickly becomes a game of kswapd.
> Here's a basic case of 4 threads doing 32k random reads:
> 
>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>     462 root      20   0       0      0      0 R  65.5   0.0   0:08.02 kswapd0
>    2287 axboe     20   0 1303448   2176   1072 R  46.6   0.0   0:05.35 fio
>    2289 axboe     20   0 1303456   2196   1092 D  46.6   0.0   0:05.34 fio
>    2290 axboe     20   0 1303460   2216   1112 D  46.6   0.0   0:05.37 fio
>    2288 axboe     20   0 1303452   2224   1120 R  45.9   0.0   0:05.33 fio
> 
> Sad face... Unfortunately once kswapd kicks in, performance also
> plummets. This box only has 32G of ram, and you can fill that in less
> than 10 seconds doing buffered reads like that.
> 
> I ran 4k and 32k testing, and using 1 and 4 threads. But given the above
> sadness, it quickly ends up looking the same for me.

What if your workload actually fits in memory?  That would seem to be the
situation where Kent's patches would make a difference.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + fs-break-generic_file_buffered_read-up-into-multiple-functions.patch added to -mm tree
  2020-10-29 14:57         ` Matthew Wilcox
@ 2020-10-29 15:02           ` Jens Axboe
  2020-10-29 15:03             ` Matthew Wilcox
  0 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2020-10-29 15:02 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Andrew Morton, linux-mm, kent.overstreet

On 10/29/20 8:57 AM, Matthew Wilcox wrote:
> On Thu, Oct 29, 2020 at 07:57:34AM -0600, Jens Axboe wrote:
>> On 10/28/20 4:26 PM, Jens Axboe wrote:
>>> I did see some wins when I tested this. I'll try and run some testing
>>> tomorrow and report back. If there's something specifically you want to
>>> see tested, let me know.
>>
>> I did some testing, unfortunately it's _very_ hard to produce somewhat
>> consistent and good numbers as it quickly becomes a game of kswapd.
>> Here's a basic case of 4 threads doing 32k random reads:
>>
>>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>>     462 root      20   0       0      0      0 R  65.5   0.0   0:08.02 kswapd0
>>    2287 axboe     20   0 1303448   2176   1072 R  46.6   0.0   0:05.35 fio
>>    2289 axboe     20   0 1303456   2196   1092 D  46.6   0.0   0:05.34 fio
>>    2290 axboe     20   0 1303460   2216   1112 D  46.6   0.0   0:05.37 fio
>>    2288 axboe     20   0 1303452   2224   1120 R  45.9   0.0   0:05.33 fio
>>
>> Sad face... Unfortunately once kswapd kicks in, performance also
>> plummets. This box only has 32G of ram, and you can fill that in less
>> than 10 seconds doing buffered reads like that.
>>
>> I ran 4k and 32k testing, and using 1 and 4 threads. But given the above
>> sadness, it quickly ends up looking the same for me.
> 
> What if your workload actually fits in memory?  That would seem to be
> the situation where Kent's patches would make a difference.

That was my point, if I do multi-page reads then memory is filled in
seconds, which makes it pretty hard to provide any accurate numbers. I
don't have anything slow in this test box, I'll see if I can find
something to stick in it.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + fs-break-generic_file_buffered_read-up-into-multiple-functions.patch added to -mm tree
  2020-10-29 15:02           ` Jens Axboe
@ 2020-10-29 15:03             ` Matthew Wilcox
  2020-10-29 15:05               ` Jens Axboe
  0 siblings, 1 reply; 10+ messages in thread
From: Matthew Wilcox @ 2020-10-29 15:03 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andrew Morton, linux-mm, kent.overstreet

On Thu, Oct 29, 2020 at 09:02:31AM -0600, Jens Axboe wrote:
> On 10/29/20 8:57 AM, Matthew Wilcox wrote:
> > On Thu, Oct 29, 2020 at 07:57:34AM -0600, Jens Axboe wrote:
> >> On 10/28/20 4:26 PM, Jens Axboe wrote:
> >>> I did see some wins when I tested this. I'll try and run some testing
> >>> tomorrow and report back. If there's something specifically you want to
> >>> see tested, let me know.
> >>
> >> I did some testing, unfortunately it's _very_ hard to produce somewhat
> >> consistent and good numbers as it quickly becomes a game of kswapd.
> >> Here's a basic case of 4 threads doing 32k random reads:
> >>
> >>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> >>     462 root      20   0       0      0      0 R  65.5   0.0   0:08.02 kswapd0
> >>    2287 axboe     20   0 1303448   2176   1072 R  46.6   0.0   0:05.35 fio
> >>    2289 axboe     20   0 1303456   2196   1092 D  46.6   0.0   0:05.34 fio
> >>    2290 axboe     20   0 1303460   2216   1112 D  46.6   0.0   0:05.37 fio
> >>    2288 axboe     20   0 1303452   2224   1120 R  45.9   0.0   0:05.33 fio
> >>
> >> Sad face... Unfortunately once kswapd kicks in, performance also
> >> plummets. This box only has 32G of ram, and you can fill that in less
> >> than 10 seconds doing buffered reads like that.
> >>
> >> I ran 4k and 32k testing, and using 1 and 4 threads. But given the above
> >> sadness, it quickly ends up looking the same for me.
> > 
> > What if your workload actually fits in memory?  That would seem to be
> > the situation where Kent's patches would make a difference.
> 
> That was my point, if I do multi-page reads then memory is filled in
> seconds, which makes it pretty hard to provide any accurate numbers. I
> don't have anything slow in this test box, I'll see if I can find
> something to stick in it.

I meant re-reading files which fit in memory, so you take ten seconds
to fill the page cache, then read from the same files over and over.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + fs-break-generic_file_buffered_read-up-into-multiple-functions.patch added to -mm tree
  2020-10-29 15:03             ` Matthew Wilcox
@ 2020-10-29 15:05               ` Jens Axboe
  2020-10-29 15:17                 ` Jens Axboe
  0 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2020-10-29 15:05 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Andrew Morton, linux-mm, kent.overstreet

On 10/29/20 9:03 AM, Matthew Wilcox wrote:
> On Thu, Oct 29, 2020 at 09:02:31AM -0600, Jens Axboe wrote:
>> On 10/29/20 8:57 AM, Matthew Wilcox wrote:
>>> On Thu, Oct 29, 2020 at 07:57:34AM -0600, Jens Axboe wrote:
>>>> On 10/28/20 4:26 PM, Jens Axboe wrote:
>>>>> I did see some wins when I tested this. I'll try and run some testing
>>>>> tomorrow and report back. If there's something specifically you want to
>>>>> see tested, let me know.
>>>>
>>>> I did some testing, unfortunately it's _very_ hard to produce somewhat
>>>> consistent and good numbers as it quickly becomes a game of kswapd.
>>>> Here's a basic case of 4 threads doing 32k random reads:
>>>>
>>>>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>>>>     462 root      20   0       0      0      0 R  65.5   0.0   0:08.02 kswapd0
>>>>    2287 axboe     20   0 1303448   2176   1072 R  46.6   0.0   0:05.35 fio
>>>>    2289 axboe     20   0 1303456   2196   1092 D  46.6   0.0   0:05.34 fio
>>>>    2290 axboe     20   0 1303460   2216   1112 D  46.6   0.0   0:05.37 fio
>>>>    2288 axboe     20   0 1303452   2224   1120 R  45.9   0.0   0:05.33 fio
>>>>
>>>> Sad face... Unfortunately once kswapd kicks in, performance also
>>>> plummets. This box only has 32G of ram, and you can fill that in less
>>>> than 10 seconds doing buffered reads like that.
>>>>
>>>> I ran 4k and 32k testing, and using 1 and 4 threads. But given the above
>>>> sadness, it quickly ends up looking the same for me.
>>>
>>> What if your workload actually fits in memory?  That would seem to be
>>> the situation where Kent's patches would make a difference.
>>
>> That was my point, if I do multi-page reads then memory is filled in
>> seconds, which makes it pretty hard to provide any accurate numbers. I
>> don't have anything slow in this test box, I'll see if I can find
>> something to stick in it.
> 
> I meant re-reading files which fit in memory, so you take ten seconds
> to fill the page cache, then read from the same files over and over.

That I can certainly try.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + fs-break-generic_file_buffered_read-up-into-multiple-functions.patch added to -mm tree
  2020-10-29 15:05               ` Jens Axboe
@ 2020-10-29 15:17                 ` Jens Axboe
  0 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2020-10-29 15:17 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Andrew Morton, linux-mm, kent.overstreet

On 10/29/20 9:05 AM, Jens Axboe wrote:
> On 10/29/20 9:03 AM, Matthew Wilcox wrote:
>> On Thu, Oct 29, 2020 at 09:02:31AM -0600, Jens Axboe wrote:
>>> On 10/29/20 8:57 AM, Matthew Wilcox wrote:
>>>> On Thu, Oct 29, 2020 at 07:57:34AM -0600, Jens Axboe wrote:
>>>>> On 10/28/20 4:26 PM, Jens Axboe wrote:
>>>>>> I did see some wins when I tested this. I'll try and run some testing
>>>>>> tomorrow and report back. If there's something specifically you want to
>>>>>> see tested, let me know.
>>>>>
>>>>> I did some testing, unfortunately it's _very_ hard to produce somewhat
>>>>> consistent and good numbers as it quickly becomes a game of kswapd.
>>>>> Here's a basic case of 4 threads doing 32k random reads:
>>>>>
>>>>>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>>>>>     462 root      20   0       0      0      0 R  65.5   0.0   0:08.02 kswapd0
>>>>>    2287 axboe     20   0 1303448   2176   1072 R  46.6   0.0   0:05.35 fio
>>>>>    2289 axboe     20   0 1303456   2196   1092 D  46.6   0.0   0:05.34 fio
>>>>>    2290 axboe     20   0 1303460   2216   1112 D  46.6   0.0   0:05.37 fio
>>>>>    2288 axboe     20   0 1303452   2224   1120 R  45.9   0.0   0:05.33 fio
>>>>>
>>>>> Sad face... Unfortunately once kswapd kicks in, performance also
>>>>> plummets. This box only has 32G of ram, and you can fill that in less
>>>>> than 10 seconds doing buffered reads like that.
>>>>>
>>>>> I ran 4k and 32k testing, and using 1 and 4 threads. But given the above
>>>>> sadness, it quickly ends up looking the same for me.
>>>>
>>>> What if your workload actually fits in memory?  That would seem to be
>>>> the situation where Kent's patches would make a difference.
>>>
>>> That was my point, if I do multi-page reads then memory is filled in
>>> seconds, which makes it pretty hard to provide any accurate numbers. I
>>> don't have anything slow in this test box, I'll see if I can find
>>> something to stick in it.
>>
>> I meant re-reading files which fit in memory, so you take ten seconds
>> to fill the page cache, then read from the same files over and over.
> 
> That I can certainly try.

Reading a 16G file randomly for 10 seconds, using 1 or 4 threads and
either 4k or 32k reads:

test			5.10-rc1		5.10-rc1+kent
-------------------------------------------------------------
1 thread, 4k		976K			1030K	(+5.5%)
4 threads, 4k		3462K			3453K	(-0.3%)
1 thread, 32k		299K			322K	(+7.7%)
4 threads, 32k		769K			785K	(+2.0%)

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + fs-break-generic_file_buffered_read-up-into-multiple-functions.patch added to -mm tree
  2020-10-28 22:22   ` Andrew Morton
  2020-10-28 22:26     ` Jens Axboe
@ 2020-10-28 22:50     ` Matthew Wilcox
  1 sibling, 0 replies; 10+ messages in thread
From: Matthew Wilcox @ 2020-10-28 22:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, axboe, kent.overstreet

On Wed, Oct 28, 2020 at 03:22:35PM -0700, Andrew Morton wrote:
> On Tue, 27 Oct 2020 13:35:51 +0000 Matthew Wilcox <willy@infradead.org> wrote:
> 
> > On Sun, Oct 25, 2020 at 03:08:17PM -0700, akpm@linux-foundation.org wrote:
> > > The patch titled
> > >      Subject: mm/filemap/c: freak generic_file_buffered_read up into multiple functions
> > > has been added to the -mm tree.  Its filename is
> > >      fs-break-generic_file_buffered_read-up-into-multiple-functions.patch
> > 
> > Can we back this out?  It really makes the THP patchset unhappy.  I think
> > we can do something like this afterwards, but doing it this way round is
> > ridiculously hard.
> 
> Two concerns:
> 
> : On my test box, 4k buffered random reads go from ~150k to ~250k iops,
> : and the improvements to big sequential reads are even bigger.
> 
> That's a big improvement!  We want that improvement.  Throwing it away
> on behalf of an as-yet-unmerged feature patchset hurts.  Can we expect that
> this improvement will be available post-that-patchset?  And when?

I agree we want that improvement!  I don't understand how this improves
4k buffered reads though.  It should improve 8k to 60k io sizes, but
it should be the same for 4k.  Honestly, it should be better for 4097
bytes on up (but nobody tests 4097 byte block sizes).

I've been working on redoing my patchset on top of current -next for the
past two days.  It's mostly there, but I've hit a problem when testing
with the readahead error injector.  The good news is that the performance
improvement should be cumulative -- Kent's improvement mean that we
get pages out of the page cache in batches, while mine will do fewer
atomic ops (on page->refcount).

Funnily, this has led to finding a bug in copy_page_to_iter_iovec()
and, honestly, a lot of the other iov_iter code, but I don't think
that's going to hold us up.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-10-29 15:17 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20201025220817.XxXVE%akpm@linux-foundation.org>
2020-10-27 13:35 ` + fs-break-generic_file_buffered_read-up-into-multiple-functions.patch added to -mm tree Matthew Wilcox
2020-10-28 22:22   ` Andrew Morton
2020-10-28 22:26     ` Jens Axboe
2020-10-29 13:57       ` Jens Axboe
2020-10-29 14:57         ` Matthew Wilcox
2020-10-29 15:02           ` Jens Axboe
2020-10-29 15:03             ` Matthew Wilcox
2020-10-29 15:05               ` Jens Axboe
2020-10-29 15:17                 ` Jens Axboe
2020-10-28 22:50     ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox