Re: [PATCH v3 6/6] xfs: offload writeback by AG using per-inode dirty bitmap and per-AG workers

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Kundan Kumar <kundan.kumar@samsung.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz,
	willy@infradead.org, mcgrof@kernel.org, clm@meta.com,
	david@fromorbit.com, amir73il@gmail.com, axboe@kernel.dk,
	hch@lst.de, ritesh.list@gmail.com, dave@stgolabs.net,
	cem@kernel.org, wangyufei@vivo.com,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-xfs@vger.kernel.org, gost.dev@samsung.com,
	anuj20.g@samsung.com, vishak.g@samsung.com, joshi.k@samsung.com
Subject: Re: [PATCH v3 6/6] xfs: offload writeback by AG using per-inode dirty bitmap and per-AG workers
Date: Tue, 3 Feb 2026 13:10:24 +0530	[thread overview]
Message-ID: <1eb3e208-f207-4a04-adbd-9ca143c4f869@samsung.com> (raw)
In-Reply-To: <20260129223421.GE7712@frogsfrogsfrogs>

On 1/30/2026 4:04 AM, Darrick J. Wong wrote:
> On Fri, Jan 16, 2026 at 03:38:18PM +0530, Kundan Kumar wrote:
>> Offload XFS writeback to per-AG workers based on the inode dirty-AG
>> bitmap. Each worker scans and submits writeback only for folios
>> belonging to its AG.
>>
>> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
>> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
>> ---
>>   fs/xfs/xfs_aops.c | 178 ++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 178 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
>> index 9d5b65922cd2..55c3154fb2b5 100644
>> --- a/fs/xfs/xfs_aops.c
>> +++ b/fs/xfs/xfs_aops.c
>> @@ -678,6 +678,180 @@ xfs_zoned_writeback_submit(
>>   	return 0;
>>   }
>>   
>> +static bool xfs_agp_match(struct xfs_inode *ip, pgoff_t index,
>> +			  xfs_agnumber_t agno)
>> +{
>> +	void *ent;
>> +	u32 v;
>> +	bool match = false;
>> +
>> +	ent = xa_load(&ip->i_ag_pmap, index);
>> +	if (ent && xa_is_value(ent)) {
>> +		v = xa_to_value(ent);
>> +		if (xfs_agp_valid(v))
>> +			match = (xfs_agp_agno(v) == (u32)agno);
>> +	}
>> +
>> +	return match;
>> +}
>> +
>> +static bool xfs_folio_matches_ag(struct folio *folio, xfs_agnumber_t agno)
>> +{
>> +	struct xfs_inode *ip = XFS_I(folio_mapping(folio)->host);
>> +
>> +	return xfs_agp_match(ip, folio->index, agno);
>> +}
>> +
>> +static int xfs_writepages_ag(struct xfs_inode *ip,
>> +			     struct writeback_control *wbc,
>> +			     xfs_agnumber_t agno)
>> +{
>> +	struct inode *inode = VFS_I(ip);
>> +	struct address_space *mapping = inode->i_mapping;
>> +	struct folio_batch *fbatch = &wbc->fbatch;
>> +	int ret = 0;
>> +	pgoff_t index, end;
>> +
>> +	wbc->range_cyclic = 0;
>> +
>> +	folio_batch_init(fbatch);
>> +	index = wbc->range_start >> PAGE_SHIFT;
>> +	end = wbc->range_end >> PAGE_SHIFT;
>> +
>> +	struct xfs_writepage_ctx wpc = {
>> +		.ctx = {
>> +			.inode = inode,
>> +			.wbc = wbc,
>> +			.ops = &xfs_writeback_ops,
>> +		},
>> +	};
>> +
>> +	while (index <= end) {
>> +		int i, nr;
>> +
>> +		/* get a batch of DIRTY folios starting at index */
>> +		nr = filemap_get_folios_tag(mapping, &index, end,
>> +					    PAGECACHE_TAG_DIRTY, fbatch);
>> +		if (!nr)
>> +			break;
>> +
>> +		for (i = 0; i < nr; i++) {
>> +			struct folio *folio = fbatch->folios[i];
>> +
>> +			/* Filter BEFORE locking */
>> +			if (!xfs_folio_matches_ag(folio, agno))
> 
> So we grab a batch of dirty folios, and only /then/ check to see if
> they've been tagged with the target agno?  That doesn't seem very
> efficient if there are a lot of dirty folio and they're not evenly
> distributed among AGs.
> 

i_ag_dirty_bitmap is a hint populated at write time to indicate that an
inode likely has dirty folios tagged for AG X, so we only kick workers 
for AGs that should have work. That said, folios may not be evenly 
distributed across AGs, so a worker can still end up scanning and 
finding few or no matches in some cases.

>> +				continue;
>> +
>> +			folio_lock(folio);
>> +
>> +			/*
>> +			 * Now it's ours: clear dirty and submit.
>> +			 * This prevents *this AG worker* from seeing it again
>> +			 * next time.
>> +			 */
>> +			if (!folio_clear_dirty_for_io(folio)) {
>> +				folio_unlock(folio);
>> +				continue;
>> +			}
>> +			xa_erase(&ip->i_ag_pmap, folio->index);
> 
> Why erase the association?  Is this because once we've written the folio
> back to storage, we want a subsequent write to a fsblock within that
> folio to tag the folio with the agnumber of that fsblock?  Hrm, maybe
> that's how you deal with multi-fsblock folios; the folio tag reflects
> the first block to be dirtied within the folio?
> 

The primary reason for erasing the entry is to keep the metadata 
bounded, once the folio is written back and clean, there is no need to 
retain any AG association, so we drop it to free the per-folio state. On 
the next write, the folio is re-tagged based on the iomap for that write.

This also helps with multi-fsblock folios: by re-tagging on each 
clean->dirty transition.

>> +
>> +			ret = iomap_writeback_folio(&wpc.ctx, folio);
>> +			folio_unlock(folio);
>> +
>> +			if (ret) {
>> +				folio_batch_release(fbatch);
>> +				goto out;
>> +			}
>> +		}
>> +
>> +		folio_batch_release(fbatch);
>> +		cond_resched();
>> +	}
>> +
>> +out:
>> +	if (wpc.ctx.wb_ctx && wpc.ctx.ops && wpc.ctx.ops->writeback_submit)
>> +		wpc.ctx.ops->writeback_submit(&wpc.ctx, ret);
>> +
>> +	return ret;
>> +}
>> +
>> +static void xfs_ag_writeback_work(struct work_struct *work)
>> +{
>> +	struct xfs_ag_wb *awb = container_of(to_delayed_work(work),
>> +					     struct xfs_ag_wb, ag_work);
>> +	struct xfs_ag_wb_task *task;
>> +	struct xfs_mount *mp;
>> +	struct inode *inode;
>> +	struct xfs_inode *ip;
>> +	int ret;
>> +
>> +	for (;;) {
>> +		spin_lock(&awb->lock);
>> +		task = list_first_entry_or_null(&awb->task_list,
>> +						struct xfs_ag_wb_task, list);
>> +		if (task)
>> +			list_del_init(&task->list);
>> +		spin_unlock(&awb->lock);
>> +
>> +		if (!task)
>> +			break;
>> +
>> +		ip = task->ip;
>> +		mp = ip->i_mount;
>> +		inode = VFS_I(ip);
>> +
>> +		ret = xfs_writepages_ag(ip, &task->wbc, task->agno);
>> +
>> +		/* If didn't submit everything for this AG, set its bit */
>> +		if (ret)
>> +			set_bit(task->agno, ip->i_ag_dirty_bitmap);
>> +
>> +		iput(inode); /* drop igrab */
>> +		mempool_free(task, mp->m_ag_task_pool);
>> +	}
>> +}
>> +
>> +static int xfs_vm_writepages_offload(struct address_space *mapping,
>> +				     struct writeback_control *wbc)
>> +{
>> +	struct inode *inode = mapping->host;
>> +	struct xfs_inode *ip = XFS_I(inode);
>> +	struct xfs_mount *mp = ip->i_mount;
>> +	struct xfs_ag_wb *awb;
>> +	struct xfs_ag_wb_task *task;
>> +	xfs_agnumber_t agno;
>> +
>> +	if (!ip->i_ag_dirty_bits)
>> +		return 0;
>> +
>> +	for_each_set_bit(agno, ip->i_ag_dirty_bitmap, ip->i_ag_dirty_bits) {
>> +		if (!test_and_clear_bit(agno, ip->i_ag_dirty_bitmap))
>> +			continue;
>> +
>> +		task =  mempool_alloc(mp->m_ag_task_pool, GFP_NOFS);
> 
> Allocating memory (even from a mempool) during writeback makes me
> nervous...
> 

Agreed. We'll remove this allocation entirely. Instead of allocating an
xfs_ag_wb_task in ->writepages, we'll maintain a mount-wide list of 
"active" inodes (those with a non-empty dirty-AG hint). The offload path 
will only enqueue the inode on that list and kick the relevant AG workers.

Each AG worker will then walk the active inode list and scan the 
pagecache, filtering folios tagged for its AG and submitting IO. This 
keeps thewriteback path allocation-free and avoids mempool use under 
reclaim.

>> +		if (!task) {
>> +			set_bit(agno, ip->i_ag_dirty_bitmap);
>> +			continue;
>> +		}
> 
> ...because apparently the allocation can fail.  If so, then why don't we
> just fall back to serial writeback instead of ... moving on to the next
> AG and seeing if there's more memory?
> 
>> +
>> +		INIT_LIST_HEAD(&task->list);
>> +		task->ip = ip;
>> +		task->agno = agno;
>> +		task->wbc = *wbc;
>> +		igrab(inode); /* worker owns inode ref */
> 
> Shouldn't we check for a null return value here?  That should never
> happen, but we /do/ have the option of falling back to standard
> iomap_writepages.
> 

Yes I will fix this in next version.

>> +
>> +		awb = &mp->m_ag_wb[agno];
>> +
>> +		spin_lock(&awb->lock);
>> +		list_add_tail(&task->list, &awb->task_list);
>> +		spin_unlock(&awb->lock);
>> +
>> +		mod_delayed_work(mp->m_ag_wq, &awb->ag_work, 0);
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>   static const struct iomap_writeback_ops xfs_zoned_writeback_ops = {
>>   	.writeback_range	= xfs_zoned_writeback_range,
>>   	.writeback_submit	= xfs_zoned_writeback_submit,
>> @@ -706,6 +880,7 @@ xfs_init_ag_writeback(struct xfs_mount *mp)
>>   	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
>>   		struct xfs_ag_wb *awb = &mp->m_ag_wb[agno];
>>   
>> +		INIT_DELAYED_WORK(&awb->ag_work, xfs_ag_writeback_work);
>>   		spin_lock_init(&awb->lock);
>>   		INIT_LIST_HEAD(&awb->task_list);
>>   		awb->agno = agno;
>> @@ -769,6 +944,9 @@ xfs_vm_writepages(
>>   			xfs_open_zone_put(xc.open_zone);
>>   		return error;
>>   	} else {
>> +		if (wbc->sync_mode != WB_SYNC_ALL)
>> +			return xfs_vm_writepages_offload(mapping, wbc);
>> +
>>   		struct xfs_writepage_ctx	wpc = {
>>   			.ctx = {
>>   				.inode	= mapping->host,
>> -- 
>> 2.25.1
>>
>>
>

next prev parent reply	other threads:[~2026-02-03  7:40 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20260116101236epcas5p12ba3de776976f4ea6666e16a33ab6ec4@epcas5p1.samsung.com>
2026-01-16 10:08 ` [PATCH v3 0/6] AG aware parallel writeback for XFS Kundan Kumar
     [not found]   ` <CGME20260116101241epcas5p330f9c335a096aaaefda4b7d3c38d6038@epcas5p3.samsung.com>
2026-01-16 10:08     ` [PATCH v3 1/6] iomap: add write ops hook to attach metadata to folios Kundan Kumar
     [not found]   ` <CGME20260116101245epcas5p30269c6aa35784db67e6d6ca800a683a7@epcas5p3.samsung.com>
2026-01-16 10:08     ` [PATCH v3 2/6] xfs: add helpers to pack AG prediction info for per-folio tracking Kundan Kumar
2026-01-29  0:45       ` Darrick J. Wong
2026-02-03  7:15         ` Kundan Kumar
2026-02-05 16:39           ` Darrick J. Wong
2026-02-04  7:37       ` Nirjhar Roy (IBM)
     [not found]   ` <CGME20260116101251epcas5p1cf5b48f2efb14fe4387be3053b3c3ebc@epcas5p1.samsung.com>
2026-01-16 10:08     ` [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap Kundan Kumar
2026-01-29  0:44       ` Darrick J. Wong
2026-02-03  7:20         ` Kundan Kumar
2026-02-05 16:42           ` Darrick J. Wong
2026-02-05  6:44         ` Nirjhar Roy (IBM)
2026-02-05 16:32           ` Darrick J. Wong
2026-02-06  5:41             ` Nirjhar Roy (IBM)
2026-02-05  6:36       ` Nirjhar Roy (IBM)
2026-02-05 16:36         ` Darrick J. Wong
2026-02-06  5:36           ` Nirjhar Roy (IBM)
2026-02-06  5:57             ` Darrick J. Wong
2026-02-06  6:03               ` Nirjhar Roy (IBM)
2026-02-06  7:00       ` Christoph Hellwig
     [not found]   ` <CGME20260116101256epcas5p2d6125a6bcad78c33f737fdc3484aca79@epcas5p2.samsung.com>
2026-01-16 10:08     ` [PATCH v3 4/6] xfs: tag folios with AG number during buffered write via iomap attach hook Kundan Kumar
2026-01-29  0:47       ` Darrick J. Wong
2026-01-29 22:40         ` Darrick J. Wong
2026-02-03  7:32           ` Kundan Kumar
2026-02-03  7:28         ` Kundan Kumar
2026-02-05 15:56           ` Brian Foster
2026-02-06  6:44       ` Nirjhar Roy (IBM)
     [not found]   ` <CGME20260116101259epcas5p1cfa6ab02e5a01f7c46cc78df95c57ce0@epcas5p1.samsung.com>
2026-01-16 10:08     ` [PATCH v3 5/6] xfs: add per-AG writeback workqueue infrastructure Kundan Kumar
2026-01-29 22:21       ` Darrick J. Wong
2026-02-03  7:35         ` Kundan Kumar
2026-02-06  6:46       ` Christoph Hellwig
2026-02-10 11:56       ` Nirjhar Roy (IBM)
     [not found]   ` <CGME20260116101305epcas5p497cd6d9027301853669f1c1aaffbf128@epcas5p4.samsung.com>
2026-01-16 10:08     ` [PATCH v3 6/6] xfs: offload writeback by AG using per-inode dirty bitmap and per-AG workers Kundan Kumar
2026-01-29 22:34       ` Darrick J. Wong
2026-02-03  7:40         ` Kundan Kumar [this message]
2026-02-11  9:39       ` Nirjhar Roy (IBM)
2026-01-16 16:13   ` [syzbot ci] Re: AG aware parallel writeback for XFS syzbot ci
2026-01-21 19:54   ` [PATCH v3 0/6] " Brian Foster
2026-01-22 16:15     ` Kundan Kumar
2026-01-23  9:36       ` Pankaj Raghav (Samsung)
2026-01-23 13:26       ` Brian Foster
2026-01-28 18:28         ` Kundan Kumar
2026-02-06  6:25           ` Christoph Hellwig
2026-02-06 10:07             ` Kundan Kumar
2026-02-06 17:42               ` Darrick J. Wong
2026-02-09  6:30               ` Christoph Hellwig
2026-02-09 15:54             ` Kundan Kumar
2026-02-10 15:38               ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1eb3e208-f207-4a04-adbd-9ca143c4f869@samsung.com \
    --to=kundan.kumar@samsung.com \
    --cc=amir73il@gmail.com \
    --cc=anuj20.g@samsung.com \
    --cc=axboe@kernel.dk \
    --cc=brauner@kernel.org \
    --cc=cem@kernel.org \
    --cc=clm@meta.com \
    --cc=dave@stgolabs.net \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=gost.dev@samsung.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=joshi.k@samsung.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=mcgrof@kernel.org \
    --cc=ritesh.list@gmail.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=vishak.g@samsung.com \
    --cc=wangyufei@vivo.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox