Re: [RFC PATCH 0/1] Large folios in block buffered IO path

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mateusz Guzik <mjguzik@gmail.com>
To: Bharata B Rao <bharata@amd.com>
Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	 linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	nikunj@amd.com,  willy@infradead.org, vbabka@suse.cz,
	david@redhat.com,  akpm@linux-foundation.org, yuzhao@google.com,
	axboe@kernel.dk,  viro@zeniv.linux.org.uk, brauner@kernel.org,
	jack@suse.cz, joshdon@google.com,  clm@meta.com
Subject: Re: [RFC PATCH 0/1] Large folios in block buffered IO path
Date: Thu, 28 Nov 2024 05:31:38 +0100	[thread overview]
Message-ID: <CAGudoHHo4sLNpoVw-WTGVCc-gL0xguYWfUWfV1CSsQo6-bGnFg@mail.gmail.com> (raw)
In-Reply-To: <CAGudoHHBu663RSjQUwi14_d+Ln6mw_ESvYCc6dTec-O0Wi1-Eg@mail.gmail.com>

On Thu, Nov 28, 2024 at 5:22 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Thu, Nov 28, 2024 at 5:02 AM Bharata B Rao <bharata@amd.com> wrote:
> >
> > The contention with inode_lock is gone after your above changes. The new
> > top 10 contention data looks like this now:
> >
> >   contended   total wait     max wait     avg wait         type   caller
> >
> > 2441494015    172.15 h       1.72 ms    253.83 us     spinlock
> > folio_wait_bit_common+0xd5
> >                          0xffffffffadbf60a3
> > native_queued_spin_lock_slowpath+0x1f3
> >                          0xffffffffadbf5d01  _raw_spin_lock_irq+0x51
> >                          0xffffffffacdd1905  folio_wait_bit_common+0xd5
> >                          0xffffffffacdd2d0a  filemap_get_pages+0x68a
> >                          0xffffffffacdd2e73  filemap_read+0x103
> >                          0xffffffffad1d67ba  blkdev_read_iter+0x6a
> >                          0xffffffffacf06937  vfs_read+0x297
> >                          0xffffffffacf07653  ksys_read+0x73
> >    25269947      1.58 h       1.72 ms    225.44 us     spinlock
> > folio_wake_bit+0x62
> >                          0xffffffffadbf60a3
> > native_queued_spin_lock_slowpath+0x1f3
> >                          0xffffffffadbf537c  _raw_spin_lock_irqsave+0x5c
> >                          0xffffffffacdcf322  folio_wake_bit+0x62
> >                          0xffffffffacdd2ca7  filemap_get_pages+0x627
> >                          0xffffffffacdd2e73  filemap_read+0x103
> >                          0xffffffffad1d67ba  blkdev_read_iter+0x6a
> >                          0xffffffffacf06937  vfs_read+0x297
> >                          0xffffffffacf07653  ksys_read+0x73
> >    44757761      1.05 h       1.55 ms     84.41 us     spinlock
> > folio_wake_bit+0x62
> >                          0xffffffffadbf60a3
> > native_queued_spin_lock_slowpath+0x1f3
> >                          0xffffffffadbf537c  _raw_spin_lock_irqsave+0x5c
> >                          0xffffffffacdcf322  folio_wake_bit+0x62
> >                          0xffffffffacdcf7bc  folio_end_read+0x2c
> >                          0xffffffffacf6d4cf  mpage_read_end_io+0x6f
> >                          0xffffffffad1d8abb  bio_endio+0x12b
> >                          0xffffffffad1f07bd  blk_mq_end_request_batch+0x12d
> >                          0xffffffffc05e4e9b  nvme_pci_complete_batch+0xbb
> [snip]
> > However a point of concern is that FIO bandwidth comes down drastically
> > after the change.
> >
>
> Nicely put :)
>
> >                 default                         inode_lock-fix
> > rw=30%
> > Instance 1      r=55.7GiB/s,w=23.9GiB/s         r=9616MiB/s,w=4121MiB/s
> > Instance 2      r=38.5GiB/s,w=16.5GiB/s         r=8482MiB/s,w=3635MiB/s
> > Instance 3      r=37.5GiB/s,w=16.1GiB/s         r=8609MiB/s,w=3690MiB/s
> > Instance 4      r=37.4GiB/s,w=16.0GiB/s         r=8486MiB/s,w=3637MiB/s
> >
>
> This means that the folio waiting stuff has poor scalability, but
> without digging into it I have no idea what can be done. The easy way
> out would be to speculatively spin before buggering off, but one would
> have to check what happens in real workloads -- presumably the lock
> owner can be off cpu for a long time (I presume there is no way to
> store the owner).
>
> The now-removed lock uses rwsems which behave better when contested
> and was pulling contention away from folios, artificially *helping*
> performance by having the folio bottleneck be exercised less.
>
> The right thing to do in the long run is still to whack the llseek
> lock acquire, but in the light of the above it can probably wait for
> better times.

WIlly mentioned the folio wait queue hash table could be grown, you
can find it in mm/filemap.c:
  1062 #define PAGE_WAIT_TABLE_BITS 8
  1063 #define PAGE_WAIT_TABLE_SIZE (1 << PAGE_WAIT_TABLE_BITS)
  1064 static wait_queue_head_t folio_wait_table[PAGE_WAIT_TABLE_SIZE]
__cacheline_aligned;
  1065
  1066 static wait_queue_head_t *folio_waitqueue(struct folio *folio)
  1067 {
  1068 │       return &folio_wait_table[hash_ptr(folio, PAGE_WAIT_TABLE_BITS)];
  1069 }

Can you collect off cpu time? offcputime-bpfcc -K > /tmp/out

On debian this ships with the bpfcc-tools package.


-- 
Mateusz Guzik <mjguzik gmail.com>

next prev parent reply	other threads:[~2024-11-28  4:31 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-27  5:47 Bharata B Rao
2024-11-27  5:47 ` [RFC PATCH 1/1] block/ioctl: Add an ioctl to enable large folios for " Bharata B Rao
2024-11-27  6:26   ` Christoph Hellwig
2024-11-27 10:37     ` Bharata B Rao
2024-11-28  5:43       ` Christoph Hellwig
2024-11-27  6:13 ` [RFC PATCH 0/1] Large folios in " Mateusz Guzik
2024-11-27  6:19   ` Mateusz Guzik
2024-11-27 12:02     ` Jan Kara
2024-11-27 12:13       ` Christian Brauner
2024-11-28  5:40       ` Ritesh Harjani
2024-11-27 12:18     ` Bharata B Rao
2024-11-27 12:28       ` Mateusz Guzik
2024-11-28  4:01         ` Bharata B Rao
2024-11-28  4:22           ` Matthew Wilcox
2024-11-28  4:37             ` Bharata B Rao
2024-11-28 11:23               ` Bharata B Rao
2024-11-28 23:31                 ` Mateusz Guzik
2024-11-29 10:32                   ` Bharata B Rao
2024-11-28  4:22           ` Mateusz Guzik
2024-11-28  4:31             ` Mateusz Guzik [this message]
2024-12-02  9:37               ` Bharata B Rao
2024-12-02 10:08                 ` Mateusz Guzik
2024-12-03  5:01                   ` Bharata B Rao
2024-11-28  4:43             ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAGudoHHo4sLNpoVw-WTGVCc-gL0xguYWfUWfV1CSsQo6-bGnFg@mail.gmail.com \
    --to=mjguzik@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=bharata@amd.com \
    --cc=brauner@kernel.org \
    --cc=clm@meta.com \
    --cc=david@redhat.com \
    --cc=jack@suse.cz \
    --cc=joshdon@google.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nikunj@amd.com \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox