linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Bharata B Rao <bharata@amd.com>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	nikunj@amd.com, willy@infradead.org, vbabka@suse.cz,
	david@redhat.com, akpm@linux-foundation.org, yuzhao@google.com,
	axboe@kernel.dk, viro@zeniv.linux.org.uk, brauner@kernel.org,
	jack@suse.cz, joshdon@google.com, clm@meta.com
Subject: Re: [RFC PATCH 0/1] Large folios in block buffered IO path
Date: Wed, 27 Nov 2024 17:48:31 +0530	[thread overview]
Message-ID: <3947869f-90d4-4912-a42f-197147fe64f0@amd.com> (raw)
In-Reply-To: <CAGudoHEvrML100XBTT=sBDud5L2zeQ3ja5BmBCL2TTYYoEC55A@mail.gmail.com>

On 27-Nov-24 11:49 AM, Mateusz Guzik wrote:
> On Wed, Nov 27, 2024 at 7:13 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>>
>> On Wed, Nov 27, 2024 at 6:48 AM Bharata B Rao <bharata@amd.com> wrote:
>>>
>>> Recently we discussed the scalability issues while running large
>>> instances of FIO with buffered IO option on NVME block devices here:
>>>
>>> https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/
>>>
>>> One of the suggestions Chris Mason gave (during private discussions) was
>>> to enable large folios in block buffered IO path as that could
>>> improve the scalability problems and improve the lock contention
>>> scenarios.
>>>
>>
>> I have no basis to comment on the idea.
>>
>> However, it is pretty apparent whatever the situation it is being
>> heavily disfigured by lock contention in blkdev_llseek:
>>
>>> perf-lock contention output
>>> ---------------------------
>>> The lock contention data doesn't look all that conclusive but for 30% rwmixwrite
>>> mix it looks like this:
>>>
>>> perf-lock contention default
>>>   contended   total wait     max wait     avg wait         type   caller
>>>
>>> 1337359017     64.69 h     769.04 us    174.14 us     spinlock   rwsem_wake.isra.0+0x42
>>>                          0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>>>                          0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
>>>                          0xffffffff8f39e7d2  rwsem_wake.isra.0+0x42
>>>                          0xffffffff8f39e88f  up_write+0x4f
>>>                          0xffffffff8f9d598e  blkdev_llseek+0x4e
>>>                          0xffffffff8f703322  ksys_lseek+0x72
>>>                          0xffffffff8f7033a8  __x64_sys_lseek+0x18
>>>                          0xffffffff8f20b983  x64_sys_call+0x1fb3
>>>     2665573     64.38 h       1.98 s      86.95 ms      rwsem:W   blkdev_llseek+0x31
>>>                          0xffffffff903f15bc  rwsem_down_write_slowpath+0x36c
>>>                          0xffffffff903f18fb  down_write+0x5b
>>>                          0xffffffff8f9d5971  blkdev_llseek+0x31
>>>                          0xffffffff8f703322  ksys_lseek+0x72
>>>                          0xffffffff8f7033a8  __x64_sys_lseek+0x18
>>>                          0xffffffff8f20b983  x64_sys_call+0x1fb3
>>>                          0xffffffff903dce5e  do_syscall_64+0x7e
>>>                          0xffffffff9040012b  entry_SYSCALL_64_after_hwframe+0x76
>>
>> Admittedly I'm not familiar with this code, but at a quick glance the
>> lock can be just straight up removed here?
>>
>>    534 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
>>    535 {
>>    536 │       struct inode *bd_inode = bdev_file_inode(file);
>>    537 │       loff_t retval;
>>    538 │
>>    539 │       inode_lock(bd_inode);
>>    540 │       retval = fixed_size_llseek(file, offset, whence,
>> i_size_read(bd_inode));
>>    541 │       inode_unlock(bd_inode);
>>    542 │       return retval;
>>    543 }
>>
>> At best it stabilizes the size for the duration of the call. Sounds
>> like it helps nothing since if the size can change, the file offset
>> will still be altered as if there was no locking?
>>
>> Suppose this cannot be avoided to grab the size for whatever reason.
>>
>> While the above fio invocation did not work for me, I ran some crapper
>> which I had in my shell history and according to strace:
>> [pid 271829] lseek(7, 0, SEEK_SET)      = 0
>> [pid 271829] lseek(7, 0, SEEK_SET)      = 0
>> [pid 271830] lseek(7, 0, SEEK_SET)      = 0
>>
>> ... the lseeks just rewind to the beginning, *definitely* not needing
>> to know the size. One would have to check but this is most likely the
>> case in your test as well.
>>
>> And for that there is 0 need to grab the size, and consequently the inode lock.

Here is the complete FIO cmdline I am using:

fio -filename=/dev/nvme1n1p1 -direct=0 -thread -size=800G -rw=rw 
-rwmixwrite=30 --norandommap --randrepeat=0 -ioengine=sync -bs=64k 
-numjobs=1 -runtime=3600 --time_based -group_reporting -name=mytest

And that results in lseek patterns like these:

lseek(6, 0, SEEK_SET)             = 0
lseek(6, 131072, SEEK_SET)        = 131072
lseek(6, 65536, SEEK_SET)         = 65536
lseek(6, 196608, SEEK_SET)        = 196608
lseek(6, 131072, SEEK_SET)        = 131072
lseek(6, 393216, SEEK_SET)        = 393216
lseek(6, 196608, SEEK_SET)        = 196608
lseek(6, 458752, SEEK_SET)        = 458752
lseek(6, 262144, SEEK_SET)        = 262144
lseek(6, 1114112, SEEK_SET)       = 1114112

The lseeks are interspersed with read and write calls.

> 
> That is to say bare minimum this needs to be benchmarked before/after
> with the lock removed from the picture, like so:
> 
> diff --git a/block/fops.c b/block/fops.c
> index 2d01c9007681..7f9e9e2f9081 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -534,12 +534,8 @@ const struct address_space_operations def_blk_aops = {
>   static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
>   {
>          struct inode *bd_inode = bdev_file_inode(file);
> -       loff_t retval;
> 
> -       inode_lock(bd_inode);
> -       retval = fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
> -       inode_unlock(bd_inode);
> -       return retval;
> +       return fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
>   }
> 
>   static int blkdev_fsync(struct file *filp, loff_t start, loff_t end,
> 
> To be aborted if it blows up (but I don't see why it would).

Thanks for this fix, will try and get back with results.

Regards,
Bharata.


  parent reply	other threads:[~2024-11-27 12:18 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-27  5:47 Bharata B Rao
2024-11-27  5:47 ` [RFC PATCH 1/1] block/ioctl: Add an ioctl to enable large folios for " Bharata B Rao
2024-11-27  6:26   ` Christoph Hellwig
2024-11-27 10:37     ` Bharata B Rao
2024-11-28  5:43       ` Christoph Hellwig
2024-11-27  6:13 ` [RFC PATCH 0/1] Large folios in " Mateusz Guzik
2024-11-27  6:19   ` Mateusz Guzik
2024-11-27 12:02     ` Jan Kara
2024-11-27 12:13       ` Christian Brauner
2024-11-28  5:40       ` Ritesh Harjani
2024-11-27 12:18     ` Bharata B Rao [this message]
2024-11-27 12:28       ` Mateusz Guzik
2024-11-28  4:01         ` Bharata B Rao
2024-11-28  4:22           ` Matthew Wilcox
2024-11-28  4:37             ` Bharata B Rao
2024-11-28 11:23               ` Bharata B Rao
2024-11-28 23:31                 ` Mateusz Guzik
2024-11-29 10:32                   ` Bharata B Rao
2024-11-28  4:22           ` Mateusz Guzik
2024-11-28  4:31             ` Mateusz Guzik
2024-12-02  9:37               ` Bharata B Rao
2024-12-02 10:08                 ` Mateusz Guzik
2024-12-03  5:01                   ` Bharata B Rao
2024-11-28  4:43             ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3947869f-90d4-4912-a42f-197147fe64f0@amd.com \
    --to=bharata@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=brauner@kernel.org \
    --cc=clm@meta.com \
    --cc=david@redhat.com \
    --cc=jack@suse.cz \
    --cc=joshdon@google.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mjguzik@gmail.com \
    --cc=nikunj@amd.com \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox