linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Kundan Kumar <kundan.kumar@samsung.com>
Cc: viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz,
	willy@infradead.org, mcgrof@kernel.org, clm@meta.com,
	david@fromorbit.com, amir73il@gmail.com, axboe@kernel.dk,
	hch@lst.de, ritesh.list@gmail.com, djwong@kernel.org,
	dave@stgolabs.net, cem@kernel.org, wangyufei@vivo.com,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-xfs@vger.kernel.org, gost.dev@samsung.com,
	anuj20.g@samsung.com, vishak.g@samsung.com, joshi.k@samsung.com
Subject: Re: [PATCH v3 0/6] AG aware parallel writeback for XFS
Date: Wed, 21 Jan 2026 14:54:40 -0500	[thread overview]
Message-ID: <aXEvAD5Rf5QLp4Ma@bfoster> (raw)
In-Reply-To: <20260116100818.7576-1-kundan.kumar@samsung.com>

On Fri, Jan 16, 2026 at 03:38:12PM +0530, Kundan Kumar wrote:
> This series explores AG aware parallel writeback for XFS. The goal is
> to reduce writeback contention and improve scalability by allowing
> writeback to be distributed across allocation groups (AGs).
> 
> Problem statement
> =================
> Today, XFS writeback walks the page cache serially per inode and funnels
> all writeback through a single writeback context. For aging filesystems,
> especially with high parallel buffered IO this leads to limited
> concurrency across independent AGs.
> 
> The filesystem already has strong AG level parallelism for allocation and
> metadata operations, but writeback remains largely AG agnostic.
> 
> High-level approach
> ===================
> This series introduces an AG aware writeback with following model:
> 1) Predict the target AG for buffered writes (mapped or delalloc) at write
>    time.
> 2) Tag AG hints per folio (via lightweight metadata / xarray).
> 3) Track dirty AGs per inode using bitmap.
> 4) Offload writeback to per AG worker threads, each performing a onepass
>    scan.
> 5) Workers filter folios and submit folios which are tagged for its AG.
> 
> Unlike our earlier approach that parallelized writeback by introducing
> multiple writeback contexts per BDI, this series keeps all changes within
> XFS and is orthogonal to that work. The AG aware mechanism uses per folio
> AG hints to route writeback to AG specific workers, and therefore applies
> even when a single inode’s data spans multiple AGs. This avoids the
> earlier limitation of relying on inode-based AG locality, which can break
> down on aged/fragmented filesystems.
> 
> IOPS and throughput
> ===================
> We see significant improvemnt in IOPS if files span across multiple AG
> 
> Workload 12 files each of 500M in 12 directories(AGs) - numjobs = 12
>     - NVMe device Intel Optane
>         Base XFS                : 308 MiB/s
>         Parallel Writeback XFS  : 1534 MiB/s  (+398%)
> 
> Workload 6 files each of 6G in 6 directories(AGs) - numjobs = 12
>     - NVMe device Intel Optane
>         Base XFS                : 409 MiB/s
>         Parallel Writeback XFS  : 1245 MiB/s  (+204%)
> 

Hi Kundan,

Could you provide more detail on how you're testing here? I threw this
at some beefier storage I have around out of curiosity and I'm not
seeing much of a difference. It could be I'm missing some details or
maybe the storage outweighs the processing benefit. But for example, is
this a fio test command being used? Is there preallocation? What type of
storage? Is a particular fs geometry being targeted for this
optimization (i.e. smaller AGs), etc.?

FWIW, I skimmed through the code a bit and the main thing that kind of
stands out to me is the write time per-folio hinting. Writeback handling
for the overwrite (i.e. non-delalloc) case is basically a single lookup
per mapping under shared inode lock. The question that comes to mind
there is what is the value of per-ag batching as opposed to just adding
generic concurrency? It seems unnecessary to me to take care to shuffle
overwrites into per-ag based workers when the underlying locking is
already shared.

WRT delalloc, it looks like we're basically taking the inode AG as the
starting point and guessing based on the on-disk AGF free blocks counter
at the time of the write. The delalloc accounting doesn't count against
the AGF, however, so ISTM that in many cases this would just effectively
land on the inode AG for larger delalloc writes. Is that not the case?

Once we get to delalloc writeback, we're under exclusive inode lock and
fall into the block allocator. The latter trylock iterates the AGs
looking for a good candidate. So what's the advantage of per-ag
splitting delalloc at writeback time if we're sending the same inode to
per-ag workers that all 1. require exclusive inode lock and 2. call into
an allocator that is designed to be scalable (i.e. if one AG is locked
it will just move to the next)?

Yet another consideration is how delalloc conversion works at the
xfs_bmapi_convert_delalloc() -> xfs_bmapi_convert_one_delalloc() level.
If you take a look at the latter, we look up the entire delalloc extent
backing the folio under writeback and attempt to allocate it all at once
(not just the blocks backing the folio). So in theory if we were to end
up tagging a sequence of contiguous delalloc backed folios at buffered
write time with different AGs, we're still going to try to allocate all
of that in one AG at writeback time. So the per-ag hinting also sort of
competes with this by shuffling writeback of the same potential extent
into different workers, making it a little hard to try and reason about.

So stepping back it kind of feels to me like the write time hinting has
so much potential for inaccuracy and unpredictability of writeback time
behavior (for the delalloc case), that it makes me wonder if we're
effectively just enabling arbitrary concurrency at writeback time and
perhaps seeing benefit from that. If so, that makes me wonder if the
associated value can be gained by somehow simplifying this to not
require write time hinting at all.

Have you run any experiments that perhaps rotors inodes to the
individual wb workers based on the inode AG (i.e. basically ignoring all
the write time stuff) by chance? Or anything that otherwise helps
quantify the value of per-ag batching over just basic concurrency? I'd
be interested to see if/how behavior changes with something like that.

Brian

> These changes are on top of the v6.18 kernel release.
> 
> Future work involves tighten writeback control (wbc) handling to integrate
> with global writeback accounting and range semantics, also evaluate
> interaction with higher level writeback parallelism.
> 
> Kundan Kumar (6):
>   iomap: add write ops hook to attach metadata to folios
>   xfs: add helpers to pack AG prediction info for per-folio tracking
>   xfs: add per-inode AG prediction map and dirty-AG bitmap
>   xfs: tag folios with AG number during buffered write via iomap attach
>     hook
>   xfs: add per-AG writeback workqueue infrastructure
>   xfs: offload writeback by AG using per-inode dirty bitmap and per-AG
>     workers
> 
>  fs/iomap/buffered-io.c |   3 +
>  fs/xfs/xfs_aops.c      | 257 +++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_aops.h      |   3 +
>  fs/xfs/xfs_icache.c    |  27 +++++
>  fs/xfs/xfs_inode.h     |   5 +
>  fs/xfs/xfs_iomap.c     | 114 ++++++++++++++++++
>  fs/xfs/xfs_iomap.h     |  31 +++++
>  fs/xfs/xfs_mount.c     |   2 +
>  fs/xfs/xfs_mount.h     |  10 ++
>  fs/xfs/xfs_super.c     |   2 +
>  include/linux/iomap.h  |   3 +
>  11 files changed, 457 insertions(+)
> 
> -- 
> 2.25.1
> 
> 



  parent reply	other threads:[~2026-01-21 19:55 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20260116101236epcas5p12ba3de776976f4ea6666e16a33ab6ec4@epcas5p1.samsung.com>
2026-01-16 10:08 ` Kundan Kumar
     [not found]   ` <CGME20260116101241epcas5p330f9c335a096aaaefda4b7d3c38d6038@epcas5p3.samsung.com>
2026-01-16 10:08     ` [PATCH v3 1/6] iomap: add write ops hook to attach metadata to folios Kundan Kumar
     [not found]   ` <CGME20260116101245epcas5p30269c6aa35784db67e6d6ca800a683a7@epcas5p3.samsung.com>
2026-01-16 10:08     ` [PATCH v3 2/6] xfs: add helpers to pack AG prediction info for per-folio tracking Kundan Kumar
2026-01-29  0:45       ` Darrick J. Wong
2026-02-03  7:15         ` Kundan Kumar
2026-02-05 16:39           ` Darrick J. Wong
2026-02-04  7:37       ` Nirjhar Roy (IBM)
     [not found]   ` <CGME20260116101251epcas5p1cf5b48f2efb14fe4387be3053b3c3ebc@epcas5p1.samsung.com>
2026-01-16 10:08     ` [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap Kundan Kumar
2026-01-29  0:44       ` Darrick J. Wong
2026-02-03  7:20         ` Kundan Kumar
2026-02-05 16:42           ` Darrick J. Wong
2026-02-05  6:44         ` Nirjhar Roy (IBM)
2026-02-05 16:32           ` Darrick J. Wong
2026-02-06  5:41             ` Nirjhar Roy (IBM)
2026-02-05  6:36       ` Nirjhar Roy (IBM)
2026-02-05 16:36         ` Darrick J. Wong
2026-02-06  5:36           ` Nirjhar Roy (IBM)
2026-02-06  5:57             ` Darrick J. Wong
2026-02-06  6:03               ` Nirjhar Roy (IBM)
2026-02-06  7:00       ` Christoph Hellwig
     [not found]   ` <CGME20260116101256epcas5p2d6125a6bcad78c33f737fdc3484aca79@epcas5p2.samsung.com>
2026-01-16 10:08     ` [PATCH v3 4/6] xfs: tag folios with AG number during buffered write via iomap attach hook Kundan Kumar
2026-01-29  0:47       ` Darrick J. Wong
2026-01-29 22:40         ` Darrick J. Wong
2026-02-03  7:32           ` Kundan Kumar
2026-02-03  7:28         ` Kundan Kumar
2026-02-05 15:56           ` Brian Foster
2026-02-06  6:44       ` Nirjhar Roy (IBM)
     [not found]   ` <CGME20260116101259epcas5p1cfa6ab02e5a01f7c46cc78df95c57ce0@epcas5p1.samsung.com>
2026-01-16 10:08     ` [PATCH v3 5/6] xfs: add per-AG writeback workqueue infrastructure Kundan Kumar
2026-01-29 22:21       ` Darrick J. Wong
2026-02-03  7:35         ` Kundan Kumar
2026-02-06  6:46       ` Christoph Hellwig
2026-02-10 11:56       ` Nirjhar Roy (IBM)
     [not found]   ` <CGME20260116101305epcas5p497cd6d9027301853669f1c1aaffbf128@epcas5p4.samsung.com>
2026-01-16 10:08     ` [PATCH v3 6/6] xfs: offload writeback by AG using per-inode dirty bitmap and per-AG workers Kundan Kumar
2026-01-29 22:34       ` Darrick J. Wong
2026-02-03  7:40         ` Kundan Kumar
2026-02-11  9:39       ` Nirjhar Roy (IBM)
2026-01-16 16:13   ` [syzbot ci] Re: AG aware parallel writeback for XFS syzbot ci
2026-01-21 19:54   ` Brian Foster [this message]
2026-01-22 16:15     ` [PATCH v3 0/6] " Kundan Kumar
2026-01-23  9:36       ` Pankaj Raghav (Samsung)
2026-01-23 13:26       ` Brian Foster
2026-01-28 18:28         ` Kundan Kumar
2026-02-06  6:25           ` Christoph Hellwig
2026-02-06 10:07             ` Kundan Kumar
2026-02-06 17:42               ` Darrick J. Wong
2026-02-09  6:30               ` Christoph Hellwig
2026-02-09 15:54             ` Kundan Kumar
2026-02-10 15:38               ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aXEvAD5Rf5QLp4Ma@bfoster \
    --to=bfoster@redhat.com \
    --cc=amir73il@gmail.com \
    --cc=anuj20.g@samsung.com \
    --cc=axboe@kernel.dk \
    --cc=brauner@kernel.org \
    --cc=cem@kernel.org \
    --cc=clm@meta.com \
    --cc=dave@stgolabs.net \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=gost.dev@samsung.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=joshi.k@samsung.com \
    --cc=kundan.kumar@samsung.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=mcgrof@kernel.org \
    --cc=ritesh.list@gmail.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=vishak.g@samsung.com \
    --cc=wangyufei@vivo.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox