From: Jan Kara <jack@suse.cz>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: Bharata B Rao <bharata@amd.com>,
linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
nikunj@amd.com, willy@infradead.org, vbabka@suse.cz,
david@redhat.com, akpm@linux-foundation.org, yuzhao@google.com,
axboe@kernel.dk, viro@zeniv.linux.org.uk, brauner@kernel.org,
jack@suse.cz, joshdon@google.com, clm@meta.com
Subject: Re: [RFC PATCH 0/1] Large folios in block buffered IO path
Date: Wed, 27 Nov 2024 13:02:35 +0100 [thread overview]
Message-ID: <20241127120235.ejpvpks3fosbzbkr@quack3> (raw)
In-Reply-To: <CAGudoHEvrML100XBTT=sBDud5L2zeQ3ja5BmBCL2TTYYoEC55A@mail.gmail.com>
On Wed 27-11-24 07:19:59, Mateusz Guzik wrote:
> On Wed, Nov 27, 2024 at 7:13 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > On Wed, Nov 27, 2024 at 6:48 AM Bharata B Rao <bharata@amd.com> wrote:
> > >
> > > Recently we discussed the scalability issues while running large
> > > instances of FIO with buffered IO option on NVME block devices here:
> > >
> > > https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/
> > >
> > > One of the suggestions Chris Mason gave (during private discussions) was
> > > to enable large folios in block buffered IO path as that could
> > > improve the scalability problems and improve the lock contention
> > > scenarios.
> > >
> >
> > I have no basis to comment on the idea.
> >
> > However, it is pretty apparent whatever the situation it is being
> > heavily disfigured by lock contention in blkdev_llseek:
> >
> > > perf-lock contention output
> > > ---------------------------
> > > The lock contention data doesn't look all that conclusive but for 30% rwmixwrite
> > > mix it looks like this:
> > >
> > > perf-lock contention default
> > > contended total wait max wait avg wait type caller
> > >
> > > 1337359017 64.69 h 769.04 us 174.14 us spinlock rwsem_wake.isra.0+0x42
> > > 0xffffffff903f60a3 native_queued_spin_lock_slowpath+0x1f3
> > > 0xffffffff903f537c _raw_spin_lock_irqsave+0x5c
> > > 0xffffffff8f39e7d2 rwsem_wake.isra.0+0x42
> > > 0xffffffff8f39e88f up_write+0x4f
> > > 0xffffffff8f9d598e blkdev_llseek+0x4e
> > > 0xffffffff8f703322 ksys_lseek+0x72
> > > 0xffffffff8f7033a8 __x64_sys_lseek+0x18
> > > 0xffffffff8f20b983 x64_sys_call+0x1fb3
> > > 2665573 64.38 h 1.98 s 86.95 ms rwsem:W blkdev_llseek+0x31
> > > 0xffffffff903f15bc rwsem_down_write_slowpath+0x36c
> > > 0xffffffff903f18fb down_write+0x5b
> > > 0xffffffff8f9d5971 blkdev_llseek+0x31
> > > 0xffffffff8f703322 ksys_lseek+0x72
> > > 0xffffffff8f7033a8 __x64_sys_lseek+0x18
> > > 0xffffffff8f20b983 x64_sys_call+0x1fb3
> > > 0xffffffff903dce5e do_syscall_64+0x7e
> > > 0xffffffff9040012b entry_SYSCALL_64_after_hwframe+0x76
> >
> > Admittedly I'm not familiar with this code, but at a quick glance the
> > lock can be just straight up removed here?
> >
> > 534 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
> > 535 {
> > 536 │ struct inode *bd_inode = bdev_file_inode(file);
> > 537 │ loff_t retval;
> > 538 │
> > 539 │ inode_lock(bd_inode);
> > 540 │ retval = fixed_size_llseek(file, offset, whence,
> > i_size_read(bd_inode));
> > 541 │ inode_unlock(bd_inode);
> > 542 │ return retval;
> > 543 }
> >
> > At best it stabilizes the size for the duration of the call. Sounds
> > like it helps nothing since if the size can change, the file offset
> > will still be altered as if there was no locking?
> >
> > Suppose this cannot be avoided to grab the size for whatever reason.
> >
> > While the above fio invocation did not work for me, I ran some crapper
> > which I had in my shell history and according to strace:
> > [pid 271829] lseek(7, 0, SEEK_SET) = 0
> > [pid 271829] lseek(7, 0, SEEK_SET) = 0
> > [pid 271830] lseek(7, 0, SEEK_SET) = 0
> >
> > ... the lseeks just rewind to the beginning, *definitely* not needing
> > to know the size. One would have to check but this is most likely the
> > case in your test as well.
> >
> > And for that there is 0 need to grab the size, and consequently the inode lock.
>
> That is to say bare minimum this needs to be benchmarked before/after
> with the lock removed from the picture, like so:
Yeah, I've noticed this in the locking profiles as well and I agree
bd_inode locking seems unnecessary here. Even some filesystems (e.g. ext4)
get away without using inode lock in their llseek handler...
Honza
> diff --git a/block/fops.c b/block/fops.c
> index 2d01c9007681..7f9e9e2f9081 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -534,12 +534,8 @@ const struct address_space_operations def_blk_aops = {
> static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
> {
> struct inode *bd_inode = bdev_file_inode(file);
> - loff_t retval;
>
> - inode_lock(bd_inode);
> - retval = fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
> - inode_unlock(bd_inode);
> - return retval;
> + return fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
> }
>
> static int blkdev_fsync(struct file *filp, loff_t start, loff_t end,
>
> To be aborted if it blows up (but I don't see why it would).
>
> --
> Mateusz Guzik <mjguzik gmail.com>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
next prev parent reply other threads:[~2024-11-27 12:02 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-27 5:47 Bharata B Rao
2024-11-27 5:47 ` [RFC PATCH 1/1] block/ioctl: Add an ioctl to enable large folios for " Bharata B Rao
2024-11-27 6:26 ` Christoph Hellwig
2024-11-27 10:37 ` Bharata B Rao
2024-11-28 5:43 ` Christoph Hellwig
2024-11-27 6:13 ` [RFC PATCH 0/1] Large folios in " Mateusz Guzik
2024-11-27 6:19 ` Mateusz Guzik
2024-11-27 12:02 ` Jan Kara [this message]
2024-11-27 12:13 ` Christian Brauner
2024-11-28 5:40 ` Ritesh Harjani
2024-11-27 12:18 ` Bharata B Rao
2024-11-27 12:28 ` Mateusz Guzik
2024-11-28 4:01 ` Bharata B Rao
2024-11-28 4:22 ` Matthew Wilcox
2024-11-28 4:37 ` Bharata B Rao
2024-11-28 11:23 ` Bharata B Rao
2024-11-28 23:31 ` Mateusz Guzik
2024-11-29 10:32 ` Bharata B Rao
2024-11-28 4:22 ` Mateusz Guzik
2024-11-28 4:31 ` Mateusz Guzik
2024-12-02 9:37 ` Bharata B Rao
2024-12-02 10:08 ` Mateusz Guzik
2024-12-03 5:01 ` Bharata B Rao
2024-11-28 4:43 ` Matthew Wilcox
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241127120235.ejpvpks3fosbzbkr@quack3 \
--to=jack@suse.cz \
--cc=akpm@linux-foundation.org \
--cc=axboe@kernel.dk \
--cc=bharata@amd.com \
--cc=brauner@kernel.org \
--cc=clm@meta.com \
--cc=david@redhat.com \
--cc=joshdon@google.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mjguzik@gmail.com \
--cc=nikunj@amd.com \
--cc=vbabka@suse.cz \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox