From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, willy@infradead.org, linux-mm@kvack.org
Subject: Re: [PATCH 3/3] xfs: convert buffer cache to use high order folios
Date: Thu, 18 Jan 2024 17:31:00 -0800 [thread overview]
Message-ID: <20240119013100.GR674499@frogsfrogsfrogs> (raw)
In-Reply-To: <20240118222216.4131379-4-david@fromorbit.com>
On Fri, Jan 19, 2024 at 09:19:41AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Now that we have the buffer cache using the folio API, we can extend
> the use of folios to allocate high order folios for multi-page
> buffers rather than an array of single pages that are then vmapped
> into a contiguous range.
>
> This creates two types of buffers: single folio buffers that can
> have arbitrary order, and multi-folio buffers made up of many single
> page folios that get vmapped. The latter is essentially the existing
> code, so there are no logic changes to handle this case.
>
> There are a few places where we iterate the folios on a buffer.
> These need to be converted to handle the high order folio case.
> Luckily, this only occurs when bp->b_folio_count == 1, and the code
> for handling this case is just a simple application of the folio API
> to the operations that need to be performed.
>
> The code that allocates buffers will optimistically attempt a high
> order folio allocation as a fast path. If this high order allocation
> fails, then we fall back to the existing multi-folio allocation
> code. This now forms the slow allocation path, and hopefully will be
> largely unused in normal conditions.
>
> This should improve performance of large buffer operations (e.g.
> large directory block sizes) as we should now mostly avoid the
> expense of vmapping large buffers (and the vmap lock contention that
> can occur) as well as avoid the runtime pressure that frequently
> accessing kernel vmapped pages put on the TLBs.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/xfs/xfs_buf.c | 150 +++++++++++++++++++++++++++++++++++++----------
> 1 file changed, 119 insertions(+), 31 deletions(-)
>
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 15907e92d0d3..df363f17ea1a 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -74,6 +74,10 @@ xfs_buf_is_vmapped(
> return bp->b_addr && bp->b_folio_count > 1;
> }
>
> +/*
> + * See comment above xfs_buf_alloc_folios() about the constraints placed on
> + * allocating vmapped buffers.
> + */
> static inline int
> xfs_buf_vmap_len(
> struct xfs_buf *bp)
> @@ -344,14 +348,72 @@ xfs_buf_alloc_kmem(
> bp->b_addr = NULL;
> return -ENOMEM;
> }
> - bp->b_offset = offset_in_page(bp->b_addr);
> bp->b_folios = bp->b_folio_array;
> bp->b_folios[0] = kmem_to_folio(bp->b_addr);
> + bp->b_offset = offset_in_folio(bp->b_folios[0], bp->b_addr);
> bp->b_folio_count = 1;
> bp->b_flags |= _XBF_KMEM;
> return 0;
> }
>
> +/*
> + * Allocating a high order folio makes the assumption that buffers are a
> + * power-of-2 size so that ilog2() returns the exact order needed to fit
> + * the contents of the buffer. Buffer lengths are mostly a power of two,
> + * so this is not an unreasonable approach to take by default.
> + *
> + * The exception here are user xattr data buffers, which can be arbitrarily
> + * sized up to 64kB plus structure metadata. In that case, round up the order.
> + */
> +static bool
> +xfs_buf_alloc_folio(
> + struct xfs_buf *bp,
> + gfp_t gfp_mask)
> +{
> + int length = BBTOB(bp->b_length);
> + int order;
> +
> + order = ilog2(length);
> + if ((1 << order) < length)
> + order = ilog2(length - 1) + 1;
> +
> + if (order <= PAGE_SHIFT)
> + order = 0;
> + else
> + order -= PAGE_SHIFT;
> +
> + bp->b_folio_array[0] = folio_alloc(gfp_mask, order);
> + if (!bp->b_folio_array[0])
> + return false;
> +
> + bp->b_folios = bp->b_folio_array;
> + bp->b_folio_count = 1;
> + bp->b_flags |= _XBF_FOLIOS;
> + return true;
Hmm. So I guess with this patchset, either we get one multi-page large
folio for the whole buffer, or we fall back to the array of basepage
sized folios?
/me wonders if the extra flexibility from alloc_folio_bulk_array and a
folioized vm_map_ram would eliminate all this special casing?
Uhoh, lights flickering again and ice crashing off the roof. I better
go before the lights go out again. :(
--D
> +}
> +
> +/*
> + * When we allocate folios for a buffer, we end up with one of two types of
> + * buffer.
> + *
> + * The first type is a single folio buffer - this may be a high order
> + * folio or just a single page sized folio, but either way they get treated the
> + * same way by the rest of the code - the buffer memory spans a single
> + * contiguous memory region that we don't have to map and unmap to access the
> + * data directly.
> + *
> + * The second type of buffer is the multi-folio buffer. These are *always* made
> + * up of single page folios so that they can be fed to vmap_ram() to return a
> + * contiguous memory region we can access the data through, or mark it as
> + * XBF_UNMAPPED and access the data directly through individual folio_address()
> + * calls.
> + *
> + * We don't use high order folios for this second type of buffer (yet) because
> + * having variable size folios makes offset-to-folio indexing and iteration of
> + * the data range more complex than if they are fixed size. This case should now
> + * be the slow path, though, so unless we regularly fail to allocate high order
> + * folios, there should be little need to optimise this path.
> + */
> static int
> xfs_buf_alloc_folios(
> struct xfs_buf *bp,
> @@ -363,7 +425,15 @@ xfs_buf_alloc_folios(
> if (flags & XBF_READ_AHEAD)
> gfp_mask |= __GFP_NORETRY;
>
> - /* Make sure that we have a page list */
> + /* Assure zeroed buffer for non-read cases. */
> + if (!(flags & XBF_READ))
> + gfp_mask |= __GFP_ZERO;
> +
> + /* Optimistically attempt a single high order folio allocation. */
> + if (xfs_buf_alloc_folio(bp, gfp_mask))
> + return 0;
> +
> + /* Fall back to allocating an array of single page folios. */
> bp->b_folio_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
> if (bp->b_folio_count <= XB_FOLIOS) {
> bp->b_folios = bp->b_folio_array;
> @@ -375,9 +445,6 @@ xfs_buf_alloc_folios(
> }
> bp->b_flags |= _XBF_FOLIOS;
>
> - /* Assure zeroed buffer for non-read cases. */
> - if (!(flags & XBF_READ))
> - gfp_mask |= __GFP_ZERO;
>
> /*
> * Bulk filling of pages can take multiple calls. Not filling the entire
> @@ -418,7 +485,7 @@ _xfs_buf_map_folios(
> {
> ASSERT(bp->b_flags & _XBF_FOLIOS);
> if (bp->b_folio_count == 1) {
> - /* A single page buffer is always mappable */
> + /* A single folio buffer is always mappable */
> bp->b_addr = folio_address(bp->b_folios[0]);
> } else if (flags & XBF_UNMAPPED) {
> bp->b_addr = NULL;
> @@ -1465,20 +1532,28 @@ xfs_buf_ioapply_map(
> int *count,
> blk_opf_t op)
> {
> - int page_index;
> - unsigned int total_nr_pages = bp->b_folio_count;
> - int nr_pages;
> + int folio_index;
> + unsigned int total_nr_folios = bp->b_folio_count;
> + int nr_folios;
> struct bio *bio;
> sector_t sector = bp->b_maps[map].bm_bn;
> int size;
> int offset;
>
> - /* skip the pages in the buffer before the start offset */
> - page_index = 0;
> + /*
> + * If the start offset if larger than a single page, we need to be
> + * careful. We might have a high order folio, in which case the indexing
> + * is from the start of the buffer. However, if we have more than one
> + * folio single page folio in the buffer, we need to skip the folios in
> + * the buffer before the start offset.
> + */
> + folio_index = 0;
> offset = *buf_offset;
> - while (offset >= PAGE_SIZE) {
> - page_index++;
> - offset -= PAGE_SIZE;
> + if (bp->b_folio_count > 1) {
> + while (offset >= PAGE_SIZE) {
> + folio_index++;
> + offset -= PAGE_SIZE;
> + }
> }
>
> /*
> @@ -1491,28 +1566,28 @@ xfs_buf_ioapply_map(
>
> next_chunk:
> atomic_inc(&bp->b_io_remaining);
> - nr_pages = bio_max_segs(total_nr_pages);
> + nr_folios = bio_max_segs(total_nr_folios);
>
> - bio = bio_alloc(bp->b_target->bt_bdev, nr_pages, op, GFP_NOIO);
> + bio = bio_alloc(bp->b_target->bt_bdev, nr_folios, op, GFP_NOIO);
> bio->bi_iter.bi_sector = sector;
> bio->bi_end_io = xfs_buf_bio_end_io;
> bio->bi_private = bp;
>
> - for (; size && nr_pages; nr_pages--, page_index++) {
> - int rbytes, nbytes = PAGE_SIZE - offset;
> + for (; size && nr_folios; nr_folios--, folio_index++) {
> + struct folio *folio = bp->b_folios[folio_index];
> + int nbytes = folio_size(folio) - offset;
>
> if (nbytes > size)
> nbytes = size;
>
> - rbytes = bio_add_folio(bio, bp->b_folios[page_index], nbytes,
> - offset);
> - if (rbytes < nbytes)
> + if (!bio_add_folio(bio, folio, nbytes,
> + offset_in_folio(folio, offset)))
> break;
>
> offset = 0;
> sector += BTOBB(nbytes);
> size -= nbytes;
> - total_nr_pages--;
> + total_nr_folios--;
> }
>
> if (likely(bio->bi_iter.bi_size)) {
> @@ -1722,6 +1797,13 @@ xfs_buf_offset(
> if (bp->b_addr)
> return bp->b_addr + offset;
>
> + /* Single folio buffers may use large folios. */
> + if (bp->b_folio_count == 1) {
> + folio = bp->b_folios[0];
> + return folio_address(folio) + offset_in_folio(folio, offset);
> + }
> +
> + /* Multi-folio buffers always use PAGE_SIZE folios */
> folio = bp->b_folios[offset >> PAGE_SHIFT];
> return folio_address(folio) + (offset & (PAGE_SIZE-1));
> }
> @@ -1737,18 +1819,24 @@ xfs_buf_zero(
> bend = boff + bsize;
> while (boff < bend) {
> struct folio *folio;
> - int page_index, page_offset, csize;
> + int folio_index, folio_offset, csize;
>
> - page_index = (boff + bp->b_offset) >> PAGE_SHIFT;
> - page_offset = (boff + bp->b_offset) & ~PAGE_MASK;
> - folio = bp->b_folios[page_index];
> - csize = min_t(size_t, PAGE_SIZE - page_offset,
> + /* Single folio buffers may use large folios. */
> + if (bp->b_folio_count == 1) {
> + folio = bp->b_folios[0];
> + folio_offset = offset_in_folio(folio,
> + bp->b_offset + boff);
> + } else {
> + folio_index = (boff + bp->b_offset) >> PAGE_SHIFT;
> + folio_offset = (boff + bp->b_offset) & ~PAGE_MASK;
> + folio = bp->b_folios[folio_index];
> + }
> +
> + csize = min_t(size_t, folio_size(folio) - folio_offset,
> BBTOB(bp->b_length) - boff);
> + ASSERT((csize + folio_offset) <= folio_size(folio));
>
> - ASSERT((csize + page_offset) <= PAGE_SIZE);
> -
> - memset(folio_address(folio) + page_offset, 0, csize);
> -
> + memset(folio_address(folio) + folio_offset, 0, csize);
> boff += csize;
> }
> }
> --
> 2.43.0
>
>
next prev parent reply other threads:[~2024-01-19 1:31 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-18 22:19 [RFC] [PATCH 0/3] xfs: use large folios for buffers Dave Chinner
2024-01-18 22:19 ` [PATCH 1/3] xfs: unmapped buffer item size straddling mismatch Dave Chinner
2024-01-22 6:41 ` Christoph Hellwig
2024-01-18 22:19 ` [PATCH 2/3] xfs: use folios in the buffer cache Dave Chinner
2024-01-19 1:26 ` Darrick J. Wong
2024-01-19 7:03 ` Dave Chinner
2024-01-22 6:39 ` Christoph Hellwig
2024-01-22 12:05 ` Dave Chinner
2024-01-22 13:18 ` Christoph Hellwig
2024-01-22 21:10 ` Dave Chinner
2024-01-18 22:19 ` [PATCH 3/3] xfs: convert buffer cache to use high order folios Dave Chinner
2024-01-19 1:31 ` Darrick J. Wong [this message]
2024-01-19 7:12 ` Dave Chinner
2024-01-22 6:45 ` Christoph Hellwig
2024-01-22 11:57 ` Dave Chinner
2024-01-19 1:05 ` [RFC] [PATCH 0/3] xfs: use large folios for buffers Darrick J. Wong
2024-01-22 10:13 ` Andi Kleen
2024-01-22 11:53 ` Dave Chinner
2024-01-22 13:34 ` Using Folios for XFS metadata Andi Kleen
2024-01-23 0:19 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240119013100.GR674499@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=david@fromorbit.com \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox