linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: "Bhatnagar, Rishabh" <risbhat@amazon.com>
Cc: Jan Kara <jack@suse.cz>,
	tytso@mit.edu, akpm@linux-foundation.org, linux-mm@kvack.org,
	linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org,
	abuehaze@amazon.com
Subject: Re: EXT4 IOPS degradation between 4.14 and 5.10
Date: Thu, 9 Feb 2023 09:34:49 +0100	[thread overview]
Message-ID: <20230209083449.ibmskbri5noojh4u@quack3> (raw)
In-Reply-To: <1c134792-1763-8c92-6516-78a330d4756e@amazon.com>

On Wed 08-02-23 11:17:21, Bhatnagar, Rishabh wrote:
> > > > > > > + /*
> > > > > > > + * We come here when we got at @end. We take care to not overflow the
> > > > > > > + * index @index as it confuses some of the callers. This breaks the
> > > > > > > + * iteration when there is page at index -1 but that is already broken
> > > > > > > + * anyway.
> > > > > > > + */
> > > > > > > + if (end == (pgoff_t)-1)
> > > > > > > + *index = (pgoff_t)-1;
> > > > > > > + else
> > > > > > > + *index = end + 1;
> > > > > > > +out:
> > > > > > > rcu_read_unlock();
> > > > > > > 
> > > > > > > - if (ret)
> > > > > > > - *index = pages[ret - 1]->index + 1;
> > > > > > > -
> > > > > > > 
> > > > > > >    From the description of the patch i didn't see any mention of this
> > > > > > > functional change.
> > > > > > > Was this change intentional and did help some usecase or general performance
> > > > > > > improvement?
> > > > > > So the change was intentional. When I was working on the series, I was
> > > > > > somewhat concerned that the old code could end up in a pathological
> > > > > > situation like:
> > > > > >      We scan range 0-1000000, find the only dirty page at index 0, return it.
> > > > > >      We scan range 1-1000000, find the only dirty page at index 1, return it.
> > > > > >      ...
> > > > > > 
> > > > > > This way we end up with rather inefficient scanning and in theory malicious
> > > > > > user could livelock writeback like this. That being said this was/is mostly
> > > > > > a theoretical concern.
> > > > > Ok so its more of a security concern. But do you think this has a latency
> > > > > impact? I didn't see
> > > > > much latency impact between the two sets and throughput is higher.
> > > > Yes, I expect there will be latency impact but for this workload probably
> > > > relatively small. I expect the expensive part on your workload is the
> > > > fsync(2) call, in particular the committing of the transaction and the
> > > > flushing of the disk caches as a result of that. Data writes themselves are
> > > > relatively cheap. If you had say 1MB blocks instead of 16k ones, I'd expect
> > > > the numbers to start looking differently as the cost of IO and of cache
> > > > flushing becomes for comparable - obviously it all depends on the details
> > > > of the backing storage as well. Perhaps could you measure how much time we
> > > > spend in file_write_and_wait_range() calls vs in the whole ext4_sync_file()
> > > > call to confirm that?
> > > > 
> > > > Overall your tests show we could gain some throughput without sacrificing
> > > > too much latency if we somehow batch tiny fsync requests more. The trick is
> > > > how to do this without regressing other workloads and also across various
> > > > storage types which are going to have very different properties.
> > > Yeah i agree fsync is much more expensive operation than just dirtying
> > > buffers in page cache.  I did use the ext4dist tool from bcc to get the
> > > distribution of write vs fsync.  Overall fsync is much more expensive
> > > operation so yeah if we can get higher throughput here with batching it
> > > shouldn't impact fsync latency that much.
> > Well, I was more interested in file_write_and_wait_range() vs
> > ext4_sync_file() latency comparison. write(2) calls are indeed very fast
> > because they just copy into the page cache so that is not very interesting.
> > But file_write_and_wait_range() is more interesting because that measures
> > the cost of writing file data to the disk while whole ext4_sync_file()
> > measures the cost of writing file data to the disk + the cost of flushing
> > the journal and I'm interested in how much is the flushing of the journal
> > costly compared to the data writeback.
> 
> Sorry i misunderstood your comment. Here is the revised data. Flushing
> journal very heavy compared to flushing data.
> 
> ext4_sync_file:       ~18.6 msecs
> fdatawrite_range:   ~ 4usecs
> fdatawait_range:     ~ 83.6usecs
> fc_commit:             ~18.6 msecs
> 
> Tracing 1 functions for "ext4_sync_file"
> 
> 
>      nsecs               : count     distribution
>    1048576 -> 2097151    : 75 |                                        |
>    2097152 -> 4194303    : 1496 |****                                    |
>    4194304 -> 8388607    : 3461 |**********                              |
>    8388608 -> 16777215   : 6693 |********************                    |
>   16777216 -> 33554431   : 13355 |****************************************|
>   33554432 -> 67108863   : 1631 |****                                    |
> 
> avg = 18624922 nsecs, total: 505778389231 nsecs, count: 27156
> 
> 
> Tracing 1 functions for "__filemap_fdatawrite_range"
> 
>      nsecs               : count     distribution
>        512 -> 1023       : 0 |                                        |
>       1024 -> 2047       : 1 |**                                      |
>       2048 -> 4095       : 14 |****************************************|
>       4096 -> 8191       : 5 |**************                          |
>       8192 -> 16383      : 1 |**                                      |
> 
> avg = 3943 nsecs, total: 82809 nsecs, count: 21
> 
> Tracing 1 functions for "__filemap_fdatawait_range
> 
>      nsecs               : count     distribution
>        128 -> 255        : 0 |                                        |
>        256 -> 511        : 1 |********************                    |
>        512 -> 1023       : 2 |****************************************|
>       1024 -> 2047       : 2 |****************************************|
>       2048 -> 4095       : 0 |                                        |
>       4096 -> 8191       : 0 |                                        |
>       8192 -> 16383      : 0 |                                        |
>      16384 -> 32767      : 0 |                                        |
>      32768 -> 65535      : 0 |                                        |
>      65536 -> 131071     : 0 |                                        |
>     131072 -> 262143     : 0 |                                        |
>     262144 -> 524287     : 1 |********************                    |
> 
> avg = 83644 nsecs, total: 501866 nsecs, count: 6
> 
> Tracing 1 functions for "ext4_fc_commit
> 
>      nsecs               : count     distribution
>    1048576 -> 2097151    : 70 |                                        |
>    2097152 -> 4194303    : 1512 |****                                    |
>    4194304 -> 8388607    : 3436 |**********                              |
>    8388608 -> 16777215   : 6699 |********************                    |
>   16777216 -> 33554431   : 13349 |****************************************|
>   33554432 -> 67108863   : 1641 |****                                    |
> 
> avg = 18622010 nsecs, total: 505699318966 nsecs, count: 27156

Thanks for the data! So indeed as I suspected for your storage the
transaction commit (likely due to the flushing of disk caches) costs the
vast majority of time. I guess I'll try to experiment with this a bit if we
can somehow achieve more batching in ext4_sync_file()...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


  reply	other threads:[~2023-02-09  8:43 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-26  0:33 Bhatnagar, Rishabh
2023-01-26  9:32 ` Jan Kara
2023-01-26 18:23   ` Bhatnagar, Rishabh
2023-01-27 12:17     ` Jan Kara
2023-01-30 17:45       ` Bhatnagar, Rishabh
2023-02-08 14:02         ` Jan Kara
2023-02-08 19:17           ` Bhatnagar, Rishabh
2023-02-09  8:34             ` Jan Kara [this message]
2023-02-08 19:37           ` Theodore Ts'o

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230209083449.ibmskbri5noojh4u@quack3 \
    --to=jack@suse.cz \
    --cc=abuehaze@amazon.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=risbhat@amazon.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox