[PATCH v3 0/6] AG aware parallel writeback for XFS

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/6] AG aware parallel writeback for XFS
       [not found] <CGME20260116101236epcas5p12ba3de776976f4ea6666e16a33ab6ec4@epcas5p1.samsung.com>
@ 2026-01-16 10:08 ` Kundan Kumar
       [not found]   ` <CGME20260116101241epcas5p330f9c335a096aaaefda4b7d3c38d6038@epcas5p3.samsung.com>
                     ` (7 more replies)
  0 siblings, 8 replies; 48+ messages in thread
From: Kundan Kumar @ 2026-01-16 10:08 UTC (permalink / raw)
  To: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, djwong, dave, cem, wangyufei
  Cc: linux-fsdevel, linux-mm, linux-xfs, gost.dev, kundan.kumar,
	anuj20.g, vishak.g, joshi.k

This series explores AG aware parallel writeback for XFS. The goal is
to reduce writeback contention and improve scalability by allowing
writeback to be distributed across allocation groups (AGs).

Problem statement
=================
Today, XFS writeback walks the page cache serially per inode and funnels
all writeback through a single writeback context. For aging filesystems,
especially with high parallel buffered IO this leads to limited
concurrency across independent AGs.

The filesystem already has strong AG level parallelism for allocation and
metadata operations, but writeback remains largely AG agnostic.

High-level approach
===================
This series introduces an AG aware writeback with following model:
1) Predict the target AG for buffered writes (mapped or delalloc) at write
   time.
2) Tag AG hints per folio (via lightweight metadata / xarray).
3) Track dirty AGs per inode using bitmap.
4) Offload writeback to per AG worker threads, each performing a onepass
   scan.
5) Workers filter folios and submit folios which are tagged for its AG.

Unlike our earlier approach that parallelized writeback by introducing
multiple writeback contexts per BDI, this series keeps all changes within
XFS and is orthogonal to that work. The AG aware mechanism uses per folio
AG hints to route writeback to AG specific workers, and therefore applies
even when a single inode’s data spans multiple AGs. This avoids the
earlier limitation of relying on inode-based AG locality, which can break
down on aged/fragmented filesystems.

IOPS and throughput
===================
We see significant improvemnt in IOPS if files span across multiple AG

Workload 12 files each of 500M in 12 directories(AGs) - numjobs = 12
    - NVMe device Intel Optane
        Base XFS                : 308 MiB/s
        Parallel Writeback XFS  : 1534 MiB/s  (+398%)

Workload 6 files each of 6G in 6 directories(AGs) - numjobs = 12
    - NVMe device Intel Optane
        Base XFS                : 409 MiB/s
        Parallel Writeback XFS  : 1245 MiB/s  (+204%)

These changes are on top of the v6.18 kernel release.

Future work involves tighten writeback control (wbc) handling to integrate
with global writeback accounting and range semantics, also evaluate
interaction with higher level writeback parallelism.

Kundan Kumar (6):
  iomap: add write ops hook to attach metadata to folios
  xfs: add helpers to pack AG prediction info for per-folio tracking
  xfs: add per-inode AG prediction map and dirty-AG bitmap
  xfs: tag folios with AG number during buffered write via iomap attach
    hook
  xfs: add per-AG writeback workqueue infrastructure
  xfs: offload writeback by AG using per-inode dirty bitmap and per-AG
    workers

 fs/iomap/buffered-io.c |   3 +
 fs/xfs/xfs_aops.c      | 257 +++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_aops.h      |   3 +
 fs/xfs/xfs_icache.c    |  27 +++++
 fs/xfs/xfs_inode.h     |   5 +
 fs/xfs/xfs_iomap.c     | 114 ++++++++++++++++++
 fs/xfs/xfs_iomap.h     |  31 +++++
 fs/xfs/xfs_mount.c     |   2 +
 fs/xfs/xfs_mount.h     |  10 ++
 fs/xfs/xfs_super.c     |   2 +
 include/linux/iomap.h  |   3 +
 11 files changed, 457 insertions(+)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 48+ messages in thread

[parent not found: <CGME20260116101241epcas5p330f9c335a096aaaefda4b7d3c38d6038@epcas5p3.samsung.com>]

* [PATCH v3 1/6] iomap: add write ops hook to attach metadata to folios
       [not found]   ` <CGME20260116101241epcas5p330f9c335a096aaaefda4b7d3c38d6038@epcas5p3.samsung.com>
@ 2026-01-16 10:08     ` Kundan Kumar
  0 siblings, 0 replies; 48+ messages in thread
From: Kundan Kumar @ 2026-01-16 10:08 UTC (permalink / raw)
  To: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, djwong, dave, cem, wangyufei
  Cc: linux-fsdevel, linux-mm, linux-xfs, gost.dev, kundan.kumar,
	anuj20.g, vishak.g, joshi.k

Add an optional iomap_attach_folio callback to struct iomap_write_ops.
Filesystems may use this hook to tag folios during write-begin without
changing iomap core behavior.

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 fs/iomap/buffered-io.c | 3 +++
 include/linux/iomap.h  | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 8b847a1e27f1..701f2e9cd010 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -699,6 +699,9 @@ static int __iomap_write_begin(const struct iomap_iter *iter,
 	size_t from = offset_in_folio(folio, pos), to = from + len;
 	size_t poff, plen;
 
+	if (write_ops && write_ops->tag_folio)
+		write_ops->tag_folio(&iter->iomap, folio, pos, len);
+
 	/*
 	 * If the write or zeroing completely overlaps the current folio, then
 	 * entire folio will be dirtied so there is no need for
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 73dceabc21c8..14ef88f8ee84 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -176,6 +176,9 @@ struct iomap_write_ops {
 	 */
 	int (*read_folio_range)(const struct iomap_iter *iter,
 			struct folio *folio, loff_t pos, size_t len);
+
+	void (*tag_folio)(const struct iomap *iomap,
+			struct folio *folio,  loff_t pos, size_t len);
 };
 
 /*
-- 
2.25.1



^ permalink raw reply	[flat|nested] 48+ messages in thread

[parent not found: <CGME20260116101245epcas5p30269c6aa35784db67e6d6ca800a683a7@epcas5p3.samsung.com>]

* [PATCH v3 2/6] xfs: add helpers to pack AG prediction info for per-folio tracking
       [not found]   ` <CGME20260116101245epcas5p30269c6aa35784db67e6d6ca800a683a7@epcas5p3.samsung.com>
@ 2026-01-16 10:08     ` Kundan Kumar
  2026-01-29  0:45       ` Darrick J. Wong
  2026-02-04  7:37       ` Nirjhar Roy (IBM)
  0 siblings, 2 replies; 48+ messages in thread
From: Kundan Kumar @ 2026-01-16 10:08 UTC (permalink / raw)
  To: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, djwong, dave, cem, wangyufei
  Cc: linux-fsdevel, linux-mm, linux-xfs, gost.dev, kundan.kumar,
	anuj20.g, vishak.g, joshi.k

Introduce helper routines to pack and unpack AG prediction metadata
for folios. This provides a compact and self-contained representation
for AG tracking.

The packed layout uses:
 - bit 31	: valid
 - bit 24-30	: iomap type
 - bit 0-23	: AG number

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 fs/xfs/xfs_iomap.h | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index ebcce7d49446..eaf4513f6759 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -12,6 +12,37 @@ struct xfs_inode;
 struct xfs_bmbt_irec;
 struct xfs_zone_alloc_ctx;
 
+/* pack prediction in a u32 stored in xarray */
+#define XFS_AGP_VALID_SHIFT 31
+#define XFS_AGP_TYPE_SHIFT 24
+#define XFS_AGP_TYPE_MASK 0x7fu
+#define XFS_AGP_AGNO_MASK 0x00ffffffu
+
+static inline u32 xfs_agp_pack(u32 agno, u8 iomap_type, bool valid)
+{
+	u32 v = agno & XFS_AGP_AGNO_MASK;
+
+	v |= ((u32)iomap_type & XFS_AGP_TYPE_MASK) << XFS_AGP_TYPE_SHIFT;
+	if (valid)
+		v |= (1u << XFS_AGP_VALID_SHIFT);
+	return v;
+}
+
+static inline bool xfs_agp_valid(u32 v)
+{
+	return v >> XFS_AGP_VALID_SHIFT;
+}
+
+static inline u32 xfs_agp_agno(u32 v)
+{
+	return v & XFS_AGP_AGNO_MASK;
+}
+
+static inline u8 xfs_agp_type(u32 v)
+{
+	return (u8)((v >> XFS_AGP_TYPE_SHIFT) & XFS_AGP_TYPE_MASK);
+}
+
 int xfs_iomap_write_direct(struct xfs_inode *ip, xfs_fileoff_t offset_fsb,
 		xfs_fileoff_t count_fsb, unsigned int flags,
 		struct xfs_bmbt_irec *imap, u64 *sequence);
-- 
2.25.1



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/6] xfs: add helpers to pack AG prediction info for per-folio tracking
  2026-01-16 10:08     ` [PATCH v3 2/6] xfs: add helpers to pack AG prediction info for per-folio tracking Kundan Kumar
@ 2026-01-29  0:45       ` Darrick J. Wong
  2026-02-03  7:15         ` Kundan Kumar
  2026-02-04  7:37       ` Nirjhar Roy (IBM)
  1 sibling, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2026-01-29  0:45 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm,
	linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k

On Fri, Jan 16, 2026 at 03:38:14PM +0530, Kundan Kumar wrote:
> Introduce helper routines to pack and unpack AG prediction metadata
> for folios. This provides a compact and self-contained representation
> for AG tracking.
> 
> The packed layout uses:
>  - bit 31	: valid
>  - bit 24-30	: iomap type
>  - bit 0-23	: AG number

There are only 5 iomap types, why do you need 7 bits for that?

Also, can you store more bits on a 64-bit system to avoid truncating the
AG number?

--D

> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> ---
>  fs/xfs/xfs_iomap.h | 31 +++++++++++++++++++++++++++++++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
> index ebcce7d49446..eaf4513f6759 100644
> --- a/fs/xfs/xfs_iomap.h
> +++ b/fs/xfs/xfs_iomap.h
> @@ -12,6 +12,37 @@ struct xfs_inode;
>  struct xfs_bmbt_irec;
>  struct xfs_zone_alloc_ctx;
>  
> +/* pack prediction in a u32 stored in xarray */
> +#define XFS_AGP_VALID_SHIFT 31
> +#define XFS_AGP_TYPE_SHIFT 24
> +#define XFS_AGP_TYPE_MASK 0x7fu
> +#define XFS_AGP_AGNO_MASK 0x00ffffffu
> +
> +static inline u32 xfs_agp_pack(u32 agno, u8 iomap_type, bool valid)
> +{
> +	u32 v = agno & XFS_AGP_AGNO_MASK;
> +
> +	v |= ((u32)iomap_type & XFS_AGP_TYPE_MASK) << XFS_AGP_TYPE_SHIFT;
> +	if (valid)
> +		v |= (1u << XFS_AGP_VALID_SHIFT);
> +	return v;
> +}
> +
> +static inline bool xfs_agp_valid(u32 v)
> +{
> +	return v >> XFS_AGP_VALID_SHIFT;
> +}
> +
> +static inline u32 xfs_agp_agno(u32 v)
> +{
> +	return v & XFS_AGP_AGNO_MASK;
> +}
> +
> +static inline u8 xfs_agp_type(u32 v)
> +{
> +	return (u8)((v >> XFS_AGP_TYPE_SHIFT) & XFS_AGP_TYPE_MASK);
> +}
> +
>  int xfs_iomap_write_direct(struct xfs_inode *ip, xfs_fileoff_t offset_fsb,
>  		xfs_fileoff_t count_fsb, unsigned int flags,
>  		struct xfs_bmbt_irec *imap, u64 *sequence);
> -- 
> 2.25.1
> 
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/6] xfs: add helpers to pack AG prediction info for per-folio tracking
  2026-01-29  0:45       ` Darrick J. Wong
@ 2026-02-03  7:15         ` Kundan Kumar
  2026-02-05 16:39           ` Darrick J. Wong
  0 siblings, 1 reply; 48+ messages in thread
From: Kundan Kumar @ 2026-02-03  7:15 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm,
	linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k

On 1/29/2026 6:15 AM, Darrick J. Wong wrote:
> On Fri, Jan 16, 2026 at 03:38:14PM +0530, Kundan Kumar wrote:
>> Introduce helper routines to pack and unpack AG prediction metadata
>> for folios. This provides a compact and self-contained representation
>> for AG tracking.
>>
>> The packed layout uses:
>>   - bit 31	: valid
>>   - bit 24-30	: iomap type
>>   - bit 0-23	: AG number
> 
> There are only 5 iomap types, why do you need 7 bits for that?
> 
> Also, can you store more bits on a 64-bit system to avoid truncating the
> AG number?
> 
> --D

I’ll reduce the type field to 3 bits (8 values).

For the AG number, I can drop the artificial 24-bit cap by packing into 
an unsigned long and storing it via xa_mk_value(), which provides ~60 
bits on 64-bit systems and ~28 bits on 32-bit systems.

> 
>> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
>> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
>> ---
>>   fs/xfs/xfs_iomap.h | 31 +++++++++++++++++++++++++++++++
>>   1 file changed, 31 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
>> index ebcce7d49446..eaf4513f6759 100644
>> --- a/fs/xfs/xfs_iomap.h
>> +++ b/fs/xfs/xfs_iomap.h
>> @@ -12,6 +12,37 @@ struct xfs_inode;
>>   struct xfs_bmbt_irec;
>>   struct xfs_zone_alloc_ctx;
>>   
>> +/* pack prediction in a u32 stored in xarray */
>> +#define XFS_AGP_VALID_SHIFT 31
>> +#define XFS_AGP_TYPE_SHIFT 24
>> +#define XFS_AGP_TYPE_MASK 0x7fu
>> +#define XFS_AGP_AGNO_MASK 0x00ffffffu
>> +
>> +static inline u32 xfs_agp_pack(u32 agno, u8 iomap_type, bool valid)
>> +{
>> +	u32 v = agno & XFS_AGP_AGNO_MASK;
>> +
>> +	v |= ((u32)iomap_type & XFS_AGP_TYPE_MASK) << XFS_AGP_TYPE_SHIFT;
>> +	if (valid)
>> +		v |= (1u << XFS_AGP_VALID_SHIFT);
>> +	return v;
>> +}
>> +
>> +static inline bool xfs_agp_valid(u32 v)
>> +{
>> +	return v >> XFS_AGP_VALID_SHIFT;
>> +}
>> +
>> +static inline u32 xfs_agp_agno(u32 v)
>> +{
>> +	return v & XFS_AGP_AGNO_MASK;
>> +}
>> +
>> +static inline u8 xfs_agp_type(u32 v)
>> +{
>> +	return (u8)((v >> XFS_AGP_TYPE_SHIFT) & XFS_AGP_TYPE_MASK);
>> +}
>> +
>>   int xfs_iomap_write_direct(struct xfs_inode *ip, xfs_fileoff_t offset_fsb,
>>   		xfs_fileoff_t count_fsb, unsigned int flags,
>>   		struct xfs_bmbt_irec *imap, u64 *sequence);
>> -- 
>> 2.25.1
>>
>>
> 



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/6] xfs: add helpers to pack AG prediction info for per-folio tracking
  2026-02-03  7:15         ` Kundan Kumar
@ 2026-02-05 16:39           ` Darrick J. Wong
  0 siblings, 0 replies; 48+ messages in thread
From: Darrick J. Wong @ 2026-02-05 16:39 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm,
	linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k

On Tue, Feb 03, 2026 at 12:45:33PM +0530, Kundan Kumar wrote:
> On 1/29/2026 6:15 AM, Darrick J. Wong wrote:
> > On Fri, Jan 16, 2026 at 03:38:14PM +0530, Kundan Kumar wrote:
> >> Introduce helper routines to pack and unpack AG prediction metadata
> >> for folios. This provides a compact and self-contained representation
> >> for AG tracking.
> >>
> >> The packed layout uses:
> >>   - bit 31	: valid
> >>   - bit 24-30	: iomap type
> >>   - bit 0-23	: AG number
> > 
> > There are only 5 iomap types, why do you need 7 bits for that?
> > 
> > Also, can you store more bits on a 64-bit system to avoid truncating the
> > AG number?
> > 
> > --D
> 
> I’ll reduce the type field to 3 bits (8 values).
> 
> For the AG number, I can drop the artificial 24-bit cap by packing into 
> an unsigned long and storing it via xa_mk_value(), which provides ~60 
> bits on 64-bit systems and ~28 bits on 32-bit systems.

<nod>

> > 
> >> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
> >> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> >> ---
> >>   fs/xfs/xfs_iomap.h | 31 +++++++++++++++++++++++++++++++
> >>   1 file changed, 31 insertions(+)
> >>
> >> diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
> >> index ebcce7d49446..eaf4513f6759 100644
> >> --- a/fs/xfs/xfs_iomap.h
> >> +++ b/fs/xfs/xfs_iomap.h
> >> @@ -12,6 +12,37 @@ struct xfs_inode;
> >>   struct xfs_bmbt_irec;
> >>   struct xfs_zone_alloc_ctx;
> >>   
> >> +/* pack prediction in a u32 stored in xarray */
> >> +#define XFS_AGP_VALID_SHIFT 31
> >> +#define XFS_AGP_TYPE_SHIFT 24
> >> +#define XFS_AGP_TYPE_MASK 0x7fu
> >> +#define XFS_AGP_AGNO_MASK 0x00ffffffu
> >> +
> >> +static inline u32 xfs_agp_pack(u32 agno, u8 iomap_type, bool valid)
> >> +{
> >> +	u32 v = agno & XFS_AGP_AGNO_MASK;
> >> +
> >> +	v |= ((u32)iomap_type & XFS_AGP_TYPE_MASK) << XFS_AGP_TYPE_SHIFT;
> >> +	if (valid)
> >> +		v |= (1u << XFS_AGP_VALID_SHIFT);
> >> +	return v;
> >> +}
> >> +
> >> +static inline bool xfs_agp_valid(u32 v)
> >> +{
> >> +	return v >> XFS_AGP_VALID_SHIFT;

Isn't this just a mask?

	return v & (1U << XFS_AGP_VALID_SHIFT)

> >> +}
> >> +
> >> +static inline u32 xfs_agp_agno(u32 v)
> >> +{
> >> +	return v & XFS_AGP_AGNO_MASK;
> >> +}
> >> +
> >> +static inline u8 xfs_agp_type(u32 v)
> >> +{
> >> +	return (u8)((v >> XFS_AGP_TYPE_SHIFT) & XFS_AGP_TYPE_MASK);
> >> +}

And as Nirjhar noted, please try to use richer types when possible.
s/u32 agno/xfs_agnumber_t agno/

--D

> >> +
> >>   int xfs_iomap_write_direct(struct xfs_inode *ip, xfs_fileoff_t offset_fsb,
> >>   		xfs_fileoff_t count_fsb, unsigned int flags,
> >>   		struct xfs_bmbt_irec *imap, u64 *sequence);
> >> -- 
> >> 2.25.1
> >>
> >>
> > 
> 
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 2/6] xfs: add helpers to pack AG prediction info for per-folio tracking
  2026-01-16 10:08     ` [PATCH v3 2/6] xfs: add helpers to pack AG prediction info for per-folio tracking Kundan Kumar
  2026-01-29  0:45       ` Darrick J. Wong
@ 2026-02-04  7:37       ` Nirjhar Roy (IBM)
  1 sibling, 0 replies; 48+ messages in thread
From: Nirjhar Roy (IBM) @ 2026-02-04  7:37 UTC (permalink / raw)
  To: Kundan Kumar, viro, brauner, jack, willy, mcgrof, clm, david,
	amir73il, axboe, hch, ritesh.list, djwong, dave, cem, wangyufei
  Cc: linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g,
	joshi.k

On Fri, 2026-01-16 at 15:38 +0530, Kundan Kumar wrote:
> Introduce helper routines to pack and unpack AG prediction metadata
> for folios. This provides a compact and self-contained representation
> for AG tracking.
> 
> The packed layout uses:
>  - bit 31	: valid
>  - bit 24-30	: iomap type
>  - bit 0-23	: AG number
# AG limited to 2^24 - is this a reasonable assumption?
> 
> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> ---
>  fs/xfs/xfs_iomap.h | 31 +++++++++++++++++++++++++++++++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
> index ebcce7d49446..eaf4513f6759 100644
> --- a/fs/xfs/xfs_iomap.h
> +++ b/fs/xfs/xfs_iomap.h
> @@ -12,6 +12,37 @@ struct xfs_inode;
>  struct xfs_bmbt_irec;
>  struct xfs_zone_alloc_ctx;
>  
> +/* pack prediction in a u32 stored in xarray */
Nit: Maybe some brief comment about what these are doing and their significance?
> +#define XFS_AGP_VALID_SHIFT 31
> +#define XFS_AGP_TYPE_SHIFT 24
> +#define XFS_AGP_TYPE_MASK 0x7fu
> +#define XFS_AGP_AGNO_MASK 0x00ffffffu
> +
> +static inline u32 xfs_agp_pack(u32 agno, u8 iomap_type, bool valid)
Nit: Maybe use xfs_agnumber_t instead of u32 for agno?
Nit: I think xfs style is like 
static inline u32
<function name, parameters etc>
> +{
> +	u32 v = agno & XFS_AGP_AGNO_MASK;
> +
> +	v |= ((u32)iomap_type & XFS_AGP_TYPE_MASK) << XFS_AGP_TYPE_SHIFT;
> +	if (valid)
> +		v |= (1u << XFS_AGP_VALID_SHIFT);
> +	return v;
> +}
> +
> +static inline bool xfs_agp_valid(u32 v)
> +{
> +	return v >> XFS_AGP_VALID_SHIFT;
> +}
> +
> +static inline u32 xfs_agp_agno(u32 v)
> +{
> +	return v & XFS_AGP_AGNO_MASK;
> +}
> +
> +static inline u8 xfs_agp_type(u32 v)
> +{
> +	return (u8)((v >> XFS_AGP_TYPE_SHIFT) & XFS_AGP_TYPE_MASK);
> +}
Again Nit:
We are introducing functions here but using it in the upcoming patches, any reason why can't we
introduce and use the functions in the same patch? In that way we can also look into the definitions
and usages in the same patch instead of this where we have to switch between 2 patches - one to look
into the definitions and another to look into the usages/call-sites? Again no hard preferences here
- just a suggestion.
--NR
> +
>  int xfs_iomap_write_direct(struct xfs_inode *ip, xfs_fileoff_t offset_fsb,
>  		xfs_fileoff_t count_fsb, unsigned int flags,
>  		struct xfs_bmbt_irec *imap, u64 *sequence);



^ permalink raw reply	[flat|nested] 48+ messages in thread

[parent not found: <CGME20260116101251epcas5p1cf5b48f2efb14fe4387be3053b3c3ebc@epcas5p1.samsung.com>]

* [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap
       [not found]   ` <CGME20260116101251epcas5p1cf5b48f2efb14fe4387be3053b3c3ebc@epcas5p1.samsung.com>
@ 2026-01-16 10:08     ` Kundan Kumar
  2026-01-29  0:44       ` Darrick J. Wong
                         ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Kundan Kumar @ 2026-01-16 10:08 UTC (permalink / raw)
  To: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, djwong, dave, cem, wangyufei
  Cc: linux-fsdevel, linux-mm, linux-xfs, gost.dev, kundan.kumar,
	anuj20.g, vishak.g, joshi.k

Add per-inode structures to track predicted AGs of dirty folios using
an xarray and bitmap. This enables efficient identification of AGs
involved in writeback.

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 fs/xfs/xfs_icache.c | 27 +++++++++++++++++++++++++++
 fs/xfs/xfs_inode.h  |  5 +++++
 2 files changed, 32 insertions(+)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index e44040206851..f97aa6d66271 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -80,6 +80,25 @@ static inline xa_mark_t ici_tag_to_mark(unsigned int tag)
 	return XFS_PERAG_BLOCKGC_MARK;
 }
 
+static int xfs_inode_init_ag_bitmap(struct xfs_inode *ip)
+{
+	unsigned int bits = ip->i_mount->m_sb.sb_agcount;
+	unsigned int nlongs;
+
+	xa_init_flags(&ip->i_ag_pmap, XA_FLAGS_LOCK_IRQ);
+	ip->i_ag_dirty_bitmap = NULL;
+	ip->i_ag_dirty_bits = bits;
+
+	if (!bits)
+		return 0;
+
+	nlongs = BITS_TO_LONGS(bits);
+	ip->i_ag_dirty_bitmap = kcalloc(nlongs, sizeof(unsigned long),
+					GFP_NOFS);
+
+	return ip->i_ag_dirty_bitmap ? 0 : -ENOMEM;
+}
+
 /*
  * Allocate and initialise an xfs_inode.
  */
@@ -131,6 +150,8 @@ xfs_inode_alloc(
 	ip->i_next_unlinked = NULLAGINO;
 	ip->i_prev_unlinked = 0;
 
+	xfs_inode_init_ag_bitmap(ip);
+
 	return ip;
 }
 
@@ -194,6 +215,12 @@ xfs_inode_free(
 	ip->i_ino = 0;
 	spin_unlock(&ip->i_flags_lock);
 
+	/* free xarray contents (values are immediate packed ints) */
+	xa_destroy(&ip->i_ag_pmap);
+	kfree(ip->i_ag_dirty_bitmap);
+	ip->i_ag_dirty_bitmap = NULL;
+	ip->i_ag_dirty_bits = 0;
+
 	__xfs_inode_free(ip);
 }
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index bd6d33557194..dee449168605 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -99,6 +99,11 @@ typedef struct xfs_inode {
 	spinlock_t		i_ioend_lock;
 	struct work_struct	i_ioend_work;
 	struct list_head	i_ioend_list;
+
+	/* AG prediction map: pgoff_t -> packed u32 */
+	struct xarray           i_ag_pmap;
+	unsigned long           *i_ag_dirty_bitmap;
+	unsigned int            i_ag_dirty_bits;
 } xfs_inode_t;
 
 static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip)
-- 
2.25.1



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap
  2026-01-16 10:08     ` [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap Kundan Kumar
@ 2026-01-29  0:44       ` Darrick J. Wong
  2026-02-03  7:20         ` Kundan Kumar
  2026-02-05  6:44         ` Nirjhar Roy (IBM)
  2026-02-05  6:36       ` Nirjhar Roy (IBM)
  2026-02-06  7:00       ` Christoph Hellwig
  2 siblings, 2 replies; 48+ messages in thread
From: Darrick J. Wong @ 2026-01-29  0:44 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm,
	linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k

On Fri, Jan 16, 2026 at 03:38:15PM +0530, Kundan Kumar wrote:
> Add per-inode structures to track predicted AGs of dirty folios using
> an xarray and bitmap. This enables efficient identification of AGs
> involved in writeback.
> 
> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> ---
>  fs/xfs/xfs_icache.c | 27 +++++++++++++++++++++++++++
>  fs/xfs/xfs_inode.h  |  5 +++++
>  2 files changed, 32 insertions(+)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index e44040206851..f97aa6d66271 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -80,6 +80,25 @@ static inline xa_mark_t ici_tag_to_mark(unsigned int tag)
>  	return XFS_PERAG_BLOCKGC_MARK;
>  }
>  
> +static int xfs_inode_init_ag_bitmap(struct xfs_inode *ip)
> +{
> +	unsigned int bits = ip->i_mount->m_sb.sb_agcount;
> +	unsigned int nlongs;
> +
> +	xa_init_flags(&ip->i_ag_pmap, XA_FLAGS_LOCK_IRQ);

This increases the size of struct xfs_inode by 40 bytes...

> +	ip->i_ag_dirty_bitmap = NULL;
> +	ip->i_ag_dirty_bits = bits;
> +
> +	if (!bits)
> +		return 0;
> +
> +	nlongs = BITS_TO_LONGS(bits);
> +	ip->i_ag_dirty_bitmap = kcalloc(nlongs, sizeof(unsigned long),
> +					GFP_NOFS);

...and there could be hundreds or thousands of AGs for each filesystem.
That's a lot of kernel memory to handle this prediction stuff, and I"m
not even sure what ag_dirty_bitmap does yet.

> +
> +	return ip->i_ag_dirty_bitmap ? 0 : -ENOMEM;
> +}
> +
>  /*
>   * Allocate and initialise an xfs_inode.
>   */
> @@ -131,6 +150,8 @@ xfs_inode_alloc(
>  	ip->i_next_unlinked = NULLAGINO;
>  	ip->i_prev_unlinked = 0;
>  
> +	xfs_inode_init_ag_bitmap(ip);

Unchecked return value???

> +
>  	return ip;
>  }
>  
> @@ -194,6 +215,12 @@ xfs_inode_free(
>  	ip->i_ino = 0;
>  	spin_unlock(&ip->i_flags_lock);
>  
> +	/* free xarray contents (values are immediate packed ints) */
> +	xa_destroy(&ip->i_ag_pmap);
> +	kfree(ip->i_ag_dirty_bitmap);
> +	ip->i_ag_dirty_bitmap = NULL;
> +	ip->i_ag_dirty_bits = 0;
> +
>  	__xfs_inode_free(ip);
>  }
>  
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index bd6d33557194..dee449168605 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -99,6 +99,11 @@ typedef struct xfs_inode {
>  	spinlock_t		i_ioend_lock;
>  	struct work_struct	i_ioend_work;
>  	struct list_head	i_ioend_list;
> +
> +	/* AG prediction map: pgoff_t -> packed u32 */

What about blocksize < pagesize filesystems?  Which packed agno do you
associate with the pgoff_t?

Also, do you have an xarray entry for each pgoff_t in a large folio?

--D

> +	struct xarray           i_ag_pmap;
> +	unsigned long           *i_ag_dirty_bitmap;
> +	unsigned int            i_ag_dirty_bits;
>  } xfs_inode_t;
>  
>  static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip)
> -- 
> 2.25.1
> 
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap
  2026-01-29  0:44       ` Darrick J. Wong
@ 2026-02-03  7:20         ` Kundan Kumar
  2026-02-05 16:42           ` Darrick J. Wong
  2026-02-05  6:44         ` Nirjhar Roy (IBM)
  1 sibling, 1 reply; 48+ messages in thread
From: Kundan Kumar @ 2026-02-03  7:20 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm,
	linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k

On 1/29/2026 6:14 AM, Darrick J. Wong wrote:
> On Fri, Jan 16, 2026 at 03:38:15PM +0530, Kundan Kumar wrote:
>> Add per-inode structures to track predicted AGs of dirty folios using
>> an xarray and bitmap. This enables efficient identification of AGs
>> involved in writeback.
>>
>> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
>> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
>> ---
>>   fs/xfs/xfs_icache.c | 27 +++++++++++++++++++++++++++
>>   fs/xfs/xfs_inode.h  |  5 +++++
>>   2 files changed, 32 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
>> index e44040206851..f97aa6d66271 100644
>> --- a/fs/xfs/xfs_icache.c
>> +++ b/fs/xfs/xfs_icache.c
>> @@ -80,6 +80,25 @@ static inline xa_mark_t ici_tag_to_mark(unsigned int tag)
>>   	return XFS_PERAG_BLOCKGC_MARK;
>>   }
>>   
>> +static int xfs_inode_init_ag_bitmap(struct xfs_inode *ip)
>> +{
>> +	unsigned int bits = ip->i_mount->m_sb.sb_agcount;
>> +	unsigned int nlongs;
>> +
>> +	xa_init_flags(&ip->i_ag_pmap, XA_FLAGS_LOCK_IRQ);
> 
> This increases the size of struct xfs_inode by 40 bytes...
> 

I’ll make this lazy and sparse: move AG writeback state behind a pointer
allocated on first use, and replace the bitmap with a sparse dirty-AG
set(xarray keyed by agno) so memory scales with AGs actually touched by
the inode.

>> +	ip->i_ag_dirty_bitmap = NULL;
>> +	ip->i_ag_dirty_bits = bits;
>> +
>> +	if (!bits)
>> +		return 0;
>> +
>> +	nlongs = BITS_TO_LONGS(bits);
>> +	ip->i_ag_dirty_bitmap = kcalloc(nlongs, sizeof(unsigned long),
>> +					GFP_NOFS);
> 
> ...and there could be hundreds or thousands of AGs for each filesystem.
> That's a lot of kernel memory to handle this prediction stuff, and I"m
> not even sure what ag_dirty_bitmap does yet.
> 

The bit for an AG is set in ag_dirty_bitmap at write time. During
writeback, we check which AG bits are set, wake only those AG-specific
workers, and each worker scans the page cache, filters folios tagged for
its AG, and submits the I/O.

>> +
>> +	return ip->i_ag_dirty_bitmap ? 0 : -ENOMEM;
>> +}
>> +
>>   /*
>>    * Allocate and initialise an xfs_inode.
>>    */
>> @@ -131,6 +150,8 @@ xfs_inode_alloc(
>>   	ip->i_next_unlinked = NULLAGINO;
>>   	ip->i_prev_unlinked = 0;
>>   
>> +	xfs_inode_init_ag_bitmap(ip);
> 
> Unchecked return value???

Will correct in next version

> 
>> +
>>   	return ip;
>>   }
>>   
>> @@ -194,6 +215,12 @@ xfs_inode_free(
>>   	ip->i_ino = 0;
>>   	spin_unlock(&ip->i_flags_lock);
>>   
>> +	/* free xarray contents (values are immediate packed ints) */
>> +	xa_destroy(&ip->i_ag_pmap);
>> +	kfree(ip->i_ag_dirty_bitmap);
>> +	ip->i_ag_dirty_bitmap = NULL;
>> +	ip->i_ag_dirty_bits = 0;
>> +
>>   	__xfs_inode_free(ip);
>>   }
>>   
>> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
>> index bd6d33557194..dee449168605 100644
>> --- a/fs/xfs/xfs_inode.h
>> +++ b/fs/xfs/xfs_inode.h
>> @@ -99,6 +99,11 @@ typedef struct xfs_inode {
>>   	spinlock_t		i_ioend_lock;
>>   	struct work_struct	i_ioend_work;
>>   	struct list_head	i_ioend_list;
>> +
>> +	/* AG prediction map: pgoff_t -> packed u32 */
> 
> What about blocksize < pagesize filesystems?  Which packed agno do you
> associate with the pgoff_t?
> 
> Also, do you have an xarray entry for each pgoff_t in a large folio?
> 
> --D
> 

pgoff_t here is the pagecache index (folio->index), i.e. file offset in
PAGE_SIZE units, not a filesystem block index. So blocksize < PAGE_SIZE
doesn’t change the association, the packed agno is attached to the folio
at that pagecache index.

We store one xarray entry per folio index (the start of the folio). We 
do not create entries for each base-page inside a large folio. If a 
large folio could span multiple extents/AGs, we’ll treat the hint as 
advisory and tag it invalid (fallback to normal writeback routing) 
rather than trying to encode per-subpage AGs.

>> +	struct xarray           i_ag_pmap;
>> +	unsigned long           *i_ag_dirty_bitmap;
>> +	unsigned int            i_ag_dirty_bits;
>>   } xfs_inode_t;
>>   
>>   static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip)
>> -- 
>> 2.25.1
>>
>>
> 



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap
  2026-02-03  7:20         ` Kundan Kumar
@ 2026-02-05 16:42           ` Darrick J. Wong
  0 siblings, 0 replies; 48+ messages in thread
From: Darrick J. Wong @ 2026-02-05 16:42 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm,
	linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k

On Tue, Feb 03, 2026 at 12:50:53PM +0530, Kundan Kumar wrote:
> On 1/29/2026 6:14 AM, Darrick J. Wong wrote:
> > On Fri, Jan 16, 2026 at 03:38:15PM +0530, Kundan Kumar wrote:
> >> Add per-inode structures to track predicted AGs of dirty folios using
> >> an xarray and bitmap. This enables efficient identification of AGs
> >> involved in writeback.
> >>
> >> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
> >> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> >> ---
> >>   fs/xfs/xfs_icache.c | 27 +++++++++++++++++++++++++++
> >>   fs/xfs/xfs_inode.h  |  5 +++++
> >>   2 files changed, 32 insertions(+)
> >>
> >> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> >> index e44040206851..f97aa6d66271 100644
> >> --- a/fs/xfs/xfs_icache.c
> >> +++ b/fs/xfs/xfs_icache.c
> >> @@ -80,6 +80,25 @@ static inline xa_mark_t ici_tag_to_mark(unsigned int tag)
> >>   	return XFS_PERAG_BLOCKGC_MARK;
> >>   }
> >>   
> >> +static int xfs_inode_init_ag_bitmap(struct xfs_inode *ip)
> >> +{
> >> +	unsigned int bits = ip->i_mount->m_sb.sb_agcount;
> >> +	unsigned int nlongs;
> >> +
> >> +	xa_init_flags(&ip->i_ag_pmap, XA_FLAGS_LOCK_IRQ);
> > 
> > This increases the size of struct xfs_inode by 40 bytes...
> > 
> 
> I’ll make this lazy and sparse: move AG writeback state behind a pointer
> allocated on first use, and replace the bitmap with a sparse dirty-AG
> set(xarray keyed by agno) so memory scales with AGs actually touched by
> the inode.
> 
> >> +	ip->i_ag_dirty_bitmap = NULL;
> >> +	ip->i_ag_dirty_bits = bits;
> >> +
> >> +	if (!bits)
> >> +		return 0;
> >> +
> >> +	nlongs = BITS_TO_LONGS(bits);
> >> +	ip->i_ag_dirty_bitmap = kcalloc(nlongs, sizeof(unsigned long),
> >> +					GFP_NOFS);
> > 
> > ...and there could be hundreds or thousands of AGs for each filesystem.
> > That's a lot of kernel memory to handle this prediction stuff, and I"m
> > not even sure what ag_dirty_bitmap does yet.
> > 
> 
> The bit for an AG is set in ag_dirty_bitmap at write time. During
> writeback, we check which AG bits are set, wake only those AG-specific
> workers, and each worker scans the page cache, filters folios tagged for
> its AG, and submits the I/O.
> 
> >> +
> >> +	return ip->i_ag_dirty_bitmap ? 0 : -ENOMEM;
> >> +}
> >> +
> >>   /*
> >>    * Allocate and initialise an xfs_inode.
> >>    */
> >> @@ -131,6 +150,8 @@ xfs_inode_alloc(
> >>   	ip->i_next_unlinked = NULLAGINO;
> >>   	ip->i_prev_unlinked = 0;
> >>   
> >> +	xfs_inode_init_ag_bitmap(ip);
> > 
> > Unchecked return value???
> 
> Will correct in next version
> 
> > 
> >> +
> >>   	return ip;
> >>   }
> >>   
> >> @@ -194,6 +215,12 @@ xfs_inode_free(
> >>   	ip->i_ino = 0;
> >>   	spin_unlock(&ip->i_flags_lock);
> >>   
> >> +	/* free xarray contents (values are immediate packed ints) */
> >> +	xa_destroy(&ip->i_ag_pmap);
> >> +	kfree(ip->i_ag_dirty_bitmap);
> >> +	ip->i_ag_dirty_bitmap = NULL;
> >> +	ip->i_ag_dirty_bits = 0;
> >> +
> >>   	__xfs_inode_free(ip);
> >>   }
> >>   
> >> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> >> index bd6d33557194..dee449168605 100644
> >> --- a/fs/xfs/xfs_inode.h
> >> +++ b/fs/xfs/xfs_inode.h
> >> @@ -99,6 +99,11 @@ typedef struct xfs_inode {
> >>   	spinlock_t		i_ioend_lock;
> >>   	struct work_struct	i_ioend_work;
> >>   	struct list_head	i_ioend_list;
> >> +
> >> +	/* AG prediction map: pgoff_t -> packed u32 */
> > 
> > What about blocksize < pagesize filesystems?  Which packed agno do you
> > associate with the pgoff_t?
> > 
> > Also, do you have an xarray entry for each pgoff_t in a large folio?
> > 
> > --D
> > 
> 
> pgoff_t here is the pagecache index (folio->index), i.e. file offset in
> PAGE_SIZE units, not a filesystem block index. So blocksize < PAGE_SIZE
> doesn’t change the association, the packed agno is attached to the folio
> at that pagecache index.

Ok, so the tag is entirely determined by the AG of the first fsblock
within the folio.

> We store one xarray entry per folio index (the start of the folio). We 
> do not create entries for each base-page inside a large folio. If a 
> large folio could span multiple extents/AGs, we’ll treat the hint as 
> advisory and tag it invalid (fallback to normal writeback routing) 
> rather than trying to encode per-subpage AGs.

Oh, ok, so if you have the mapping and the folio at the same time you
can determine that the entire large folio maps to a single extent, and
tag the whole large folio as belonging to a single AG.  That clears
things up, thank you.

It's only in the case of extreme fragmentation that a large folio gets
flung at the old writeback paths, which is probably good enough anyway.

--D

> >> +	struct xarray           i_ag_pmap;
> >> +	unsigned long           *i_ag_dirty_bitmap;
> >> +	unsigned int            i_ag_dirty_bits;
> >>   } xfs_inode_t;
> >>   
> >>   static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip)
> >> -- 
> >> 2.25.1
> >>
> >>
> > 
> 
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap
  2026-01-29  0:44       ` Darrick J. Wong
  2026-02-03  7:20         ` Kundan Kumar
@ 2026-02-05  6:44         ` Nirjhar Roy (IBM)
  2026-02-05 16:32           ` Darrick J. Wong
  1 sibling, 1 reply; 48+ messages in thread
From: Nirjhar Roy (IBM) @ 2026-02-05  6:44 UTC (permalink / raw)
  To: Darrick J. Wong, Kundan Kumar
  Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm,
	linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k

On Wed, 2026-01-28 at 16:44 -0800, Darrick J. Wong wrote:
> On Fri, Jan 16, 2026 at 03:38:15PM +0530, Kundan Kumar wrote:
> > Add per-inode structures to track predicted AGs of dirty folios using
> > an xarray and bitmap. This enables efficient identification of AGs
> > involved in writeback.
> > 
> > Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
> > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> > ---
> >  fs/xfs/xfs_icache.c | 27 +++++++++++++++++++++++++++
> >  fs/xfs/xfs_inode.h  |  5 +++++
> >  2 files changed, 32 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index e44040206851..f97aa6d66271 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -80,6 +80,25 @@ static inline xa_mark_t ici_tag_to_mark(unsigned int tag)
> >  	return XFS_PERAG_BLOCKGC_MARK;
> >  }
> >  
> > +static int xfs_inode_init_ag_bitmap(struct xfs_inode *ip)
> > +{
> > +	unsigned int bits = ip->i_mount->m_sb.sb_agcount;
> > +	unsigned int nlongs;
> > +
> > +	xa_init_flags(&ip->i_ag_pmap, XA_FLAGS_LOCK_IRQ);
> 
> This increases the size of struct xfs_inode by 40 bytes...
> 
> > +	ip->i_ag_dirty_bitmap = NULL;
> > +	ip->i_ag_dirty_bits = bits;
> > +
> > +	if (!bits)
> > +		return 0;
> > +
> > +	nlongs = BITS_TO_LONGS(bits);
> > +	ip->i_ag_dirty_bitmap = kcalloc(nlongs, sizeof(unsigned long),
> > +					GFP_NOFS);
> 
> ...and there could be hundreds or thousands of AGs for each filesystem.
> That's a lot of kernel memory to handle this prediction stuff, and I"m
> not even sure what ag_dirty_bitmap does yet.
> 
> > +
> > +	return ip->i_ag_dirty_bitmap ? 0 : -ENOMEM;
> > +}
> > +
> >  /*
> >   * Allocate and initialise an xfs_inode.
> >   */
> > @@ -131,6 +150,8 @@ xfs_inode_alloc(
> >  	ip->i_next_unlinked = NULLAGINO;
> >  	ip->i_prev_unlinked = 0;
> >  
> > +	xfs_inode_init_ag_bitmap(ip);
> 
> Unchecked return value???
> 
> > +
> >  	return ip;
> >  }
> >  
> > @@ -194,6 +215,12 @@ xfs_inode_free(
> >  	ip->i_ino = 0;
> >  	spin_unlock(&ip->i_flags_lock);
> >  
> > +	/* free xarray contents (values are immediate packed ints) */
> > +	xa_destroy(&ip->i_ag_pmap);
> > +	kfree(ip->i_ag_dirty_bitmap);
> > +	ip->i_ag_dirty_bitmap = NULL;
> > +	ip->i_ag_dirty_bits = 0;
> > +
> >  	__xfs_inode_free(ip);
> >  }
> >  
> > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > index bd6d33557194..dee449168605 100644
> > --- a/fs/xfs/xfs_inode.h
> > +++ b/fs/xfs/xfs_inode.h
> > @@ -99,6 +99,11 @@ typedef struct xfs_inode {
> >  	spinlock_t		i_ioend_lock;
> >  	struct work_struct	i_ioend_work;
> >  	struct list_head	i_ioend_list;
> > +
> > +	/* AG prediction map: pgoff_t -> packed u32 */
> 
> What about blocksize < pagesize filesystems?  Which packed agno do you
> associate with the pgoff_t?
Sorry but I am bit unfamiliar with the term packed agno? Can you please briefly explain does packed
agno mean?
--NR
> 
> Also, do you have an xarray entry for each pgoff_t in a large folio?
> 
> --D
> 
> > +	struct xarray           i_ag_pmap;
> > +	unsigned long           *i_ag_dirty_bitmap;
> > +	unsigned int            i_ag_dirty_bits;
> >  } xfs_inode_t;
> >  
> >  static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip)
> > -- 
> > 2.25.1
> > 
> > 



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap
  2026-02-05  6:44         ` Nirjhar Roy (IBM)
@ 2026-02-05 16:32           ` Darrick J. Wong
  2026-02-06  5:41             ` Nirjhar Roy (IBM)
  0 siblings, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2026-02-05 16:32 UTC (permalink / raw)
  To: Nirjhar Roy (IBM)
  Cc: Kundan Kumar, viro, brauner, jack, willy, mcgrof, clm, david,
	amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei,
	linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g,
	joshi.k

On Thu, Feb 05, 2026 at 12:14:35PM +0530, Nirjhar Roy (IBM) wrote:
> On Wed, 2026-01-28 at 16:44 -0800, Darrick J. Wong wrote:
> > On Fri, Jan 16, 2026 at 03:38:15PM +0530, Kundan Kumar wrote:
> > > Add per-inode structures to track predicted AGs of dirty folios using
> > > an xarray and bitmap. This enables efficient identification of AGs
> > > involved in writeback.
> > > 
> > > Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
> > > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> > > ---
> > >  fs/xfs/xfs_icache.c | 27 +++++++++++++++++++++++++++
> > >  fs/xfs/xfs_inode.h  |  5 +++++
> > >  2 files changed, 32 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > index e44040206851..f97aa6d66271 100644
> > > --- a/fs/xfs/xfs_icache.c
> > > +++ b/fs/xfs/xfs_icache.c
> > > @@ -80,6 +80,25 @@ static inline xa_mark_t ici_tag_to_mark(unsigned int tag)
> > >  	return XFS_PERAG_BLOCKGC_MARK;
> > >  }
> > >  
> > > +static int xfs_inode_init_ag_bitmap(struct xfs_inode *ip)
> > > +{
> > > +	unsigned int bits = ip->i_mount->m_sb.sb_agcount;
> > > +	unsigned int nlongs;
> > > +
> > > +	xa_init_flags(&ip->i_ag_pmap, XA_FLAGS_LOCK_IRQ);
> > 
> > This increases the size of struct xfs_inode by 40 bytes...
> > 
> > > +	ip->i_ag_dirty_bitmap = NULL;
> > > +	ip->i_ag_dirty_bits = bits;
> > > +
> > > +	if (!bits)
> > > +		return 0;
> > > +
> > > +	nlongs = BITS_TO_LONGS(bits);
> > > +	ip->i_ag_dirty_bitmap = kcalloc(nlongs, sizeof(unsigned long),
> > > +					GFP_NOFS);
> > 
> > ...and there could be hundreds or thousands of AGs for each filesystem.
> > That's a lot of kernel memory to handle this prediction stuff, and I"m
> > not even sure what ag_dirty_bitmap does yet.
> > 
> > > +
> > > +	return ip->i_ag_dirty_bitmap ? 0 : -ENOMEM;
> > > +}
> > > +
> > >  /*
> > >   * Allocate and initialise an xfs_inode.
> > >   */
> > > @@ -131,6 +150,8 @@ xfs_inode_alloc(
> > >  	ip->i_next_unlinked = NULLAGINO;
> > >  	ip->i_prev_unlinked = 0;
> > >  
> > > +	xfs_inode_init_ag_bitmap(ip);
> > 
> > Unchecked return value???
> > 
> > > +
> > >  	return ip;
> > >  }
> > >  
> > > @@ -194,6 +215,12 @@ xfs_inode_free(
> > >  	ip->i_ino = 0;
> > >  	spin_unlock(&ip->i_flags_lock);
> > >  
> > > +	/* free xarray contents (values are immediate packed ints) */
> > > +	xa_destroy(&ip->i_ag_pmap);
> > > +	kfree(ip->i_ag_dirty_bitmap);
> > > +	ip->i_ag_dirty_bitmap = NULL;
> > > +	ip->i_ag_dirty_bits = 0;
> > > +
> > >  	__xfs_inode_free(ip);
> > >  }
> > >  
> > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > > index bd6d33557194..dee449168605 100644
> > > --- a/fs/xfs/xfs_inode.h
> > > +++ b/fs/xfs/xfs_inode.h
> > > @@ -99,6 +99,11 @@ typedef struct xfs_inode {
> > >  	spinlock_t		i_ioend_lock;
> > >  	struct work_struct	i_ioend_work;
> > >  	struct list_head	i_ioend_list;
> > > +
> > > +	/* AG prediction map: pgoff_t -> packed u32 */
> > 
> > What about blocksize < pagesize filesystems?  Which packed agno do you
> > associate with the pgoff_t?
> Sorry but I am bit unfamiliar with the term packed agno? Can you please briefly explain does packed
> agno mean?

I was talking about the "xfs_agp" numbers introduced in the previous
patch, which seem to contain the bottom 25 bits of the ag number.

Really I was just asking about multi-fsblock folios -- which block
within that folio do we map to an AG for tagging purposes?

I think Kundan replied elsewhere that it's the first block.

--D

> --NR
> > 
> > Also, do you have an xarray entry for each pgoff_t in a large folio?
> > 
> > --D
> > 
> > > +	struct xarray           i_ag_pmap;
> > > +	unsigned long           *i_ag_dirty_bitmap;
> > > +	unsigned int            i_ag_dirty_bits;
> > >  } xfs_inode_t;
> > >  
> > >  static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip)
> > > -- 
> > > 2.25.1
> > > 
> > > 
> 
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap
  2026-02-05 16:32           ` Darrick J. Wong
@ 2026-02-06  5:41             ` Nirjhar Roy (IBM)
  0 siblings, 0 replies; 48+ messages in thread
From: Nirjhar Roy (IBM) @ 2026-02-06  5:41 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Kundan Kumar, viro, brauner, jack, willy, mcgrof, clm, david,
	amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei,
	linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g,
	joshi.k

On Thu, 2026-02-05 at 08:32 -0800, Darrick J. Wong wrote:
> On Thu, Feb 05, 2026 at 12:14:35PM +0530, Nirjhar Roy (IBM) wrote:
> > On Wed, 2026-01-28 at 16:44 -0800, Darrick J. Wong wrote:
> > > On Fri, Jan 16, 2026 at 03:38:15PM +0530, Kundan Kumar wrote:
> > > > Add per-inode structures to track predicted AGs of dirty folios using
> > > > an xarray and bitmap. This enables efficient identification of AGs
> > > > involved in writeback.
> > > > 
> > > > Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
> > > > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> > > > ---
> > > >  fs/xfs/xfs_icache.c | 27 +++++++++++++++++++++++++++
> > > >  fs/xfs/xfs_inode.h  |  5 +++++
> > > >  2 files changed, 32 insertions(+)
> > > > 
> > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > index e44040206851..f97aa6d66271 100644
> > > > --- a/fs/xfs/xfs_icache.c
> > > > +++ b/fs/xfs/xfs_icache.c
> > > > @@ -80,6 +80,25 @@ static inline xa_mark_t ici_tag_to_mark(unsigned int tag)
> > > >  	return XFS_PERAG_BLOCKGC_MARK;
> > > >  }
> > > >  
> > > > +static int xfs_inode_init_ag_bitmap(struct xfs_inode *ip)
> > > > +{
> > > > +	unsigned int bits = ip->i_mount->m_sb.sb_agcount;
> > > > +	unsigned int nlongs;
> > > > +
> > > > +	xa_init_flags(&ip->i_ag_pmap, XA_FLAGS_LOCK_IRQ);
> > > 
> > > This increases the size of struct xfs_inode by 40 bytes...
> > > 
> > > > +	ip->i_ag_dirty_bitmap = NULL;
> > > > +	ip->i_ag_dirty_bits = bits;
> > > > +
> > > > +	if (!bits)
> > > > +		return 0;
> > > > +
> > > > +	nlongs = BITS_TO_LONGS(bits);
> > > > +	ip->i_ag_dirty_bitmap = kcalloc(nlongs, sizeof(unsigned long),
> > > > +					GFP_NOFS);
> > > 
> > > ...and there could be hundreds or thousands of AGs for each filesystem.
> > > That's a lot of kernel memory to handle this prediction stuff, and I"m
> > > not even sure what ag_dirty_bitmap does yet.
> > > 
> > > > +
> > > > +	return ip->i_ag_dirty_bitmap ? 0 : -ENOMEM;
> > > > +}
> > > > +
> > > >  /*
> > > >   * Allocate and initialise an xfs_inode.
> > > >   */
> > > > @@ -131,6 +150,8 @@ xfs_inode_alloc(
> > > >  	ip->i_next_unlinked = NULLAGINO;
> > > >  	ip->i_prev_unlinked = 0;
> > > >  
> > > > +	xfs_inode_init_ag_bitmap(ip);
> > > 
> > > Unchecked return value???
> > > 
> > > > +
> > > >  	return ip;
> > > >  }
> > > >  
> > > > @@ -194,6 +215,12 @@ xfs_inode_free(
> > > >  	ip->i_ino = 0;
> > > >  	spin_unlock(&ip->i_flags_lock);
> > > >  
> > > > +	/* free xarray contents (values are immediate packed ints) */
> > > > +	xa_destroy(&ip->i_ag_pmap);
> > > > +	kfree(ip->i_ag_dirty_bitmap);
> > > > +	ip->i_ag_dirty_bitmap = NULL;
> > > > +	ip->i_ag_dirty_bits = 0;
> > > > +
> > > >  	__xfs_inode_free(ip);
> > > >  }
> > > >  
> > > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > > > index bd6d33557194..dee449168605 100644
> > > > --- a/fs/xfs/xfs_inode.h
> > > > +++ b/fs/xfs/xfs_inode.h
> > > > @@ -99,6 +99,11 @@ typedef struct xfs_inode {
> > > >  	spinlock_t		i_ioend_lock;
> > > >  	struct work_struct	i_ioend_work;
> > > >  	struct list_head	i_ioend_list;
> > > > +
> > > > +	/* AG prediction map: pgoff_t -> packed u32 */
> > > 
> > > What about blocksize < pagesize filesystems?  Which packed agno do you
> > > associate with the pgoff_t?
> > Sorry but I am bit unfamiliar with the term packed agno? Can you please briefly explain does packed
> > agno mean?
> 
> I was talking about the "xfs_agp" numbers introduced in the previous
> patch, which seem to contain the bottom 25 bits of the ag number.
Okay, right. Thank you. 
> 
> Really I was just asking about multi-fsblock folios -- which block
> within that folio do we map to an AG for tagging purposes?
> 
> I think Kundan replied elsewhere that it's the first block.
Okay.
--NR
> 
> --D
> 
> > --NR
> > > Also, do you have an xarray entry for each pgoff_t in a large folio?
> > > 
> > > --D
> > > 
> > > > +	struct xarray           i_ag_pmap;
> > > > +	unsigned long           *i_ag_dirty_bitmap;
> > > > +	unsigned int            i_ag_dirty_bits;
> > > >  } xfs_inode_t;
> > > >  
> > > >  static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip)
> > > > -- 
> > > > 2.25.1
> > > > 
> > > > 



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap
  2026-01-16 10:08     ` [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap Kundan Kumar
  2026-01-29  0:44       ` Darrick J. Wong
@ 2026-02-05  6:36       ` Nirjhar Roy (IBM)
  2026-02-05 16:36         ` Darrick J. Wong
  2026-02-06  7:00       ` Christoph Hellwig
  2 siblings, 1 reply; 48+ messages in thread
From: Nirjhar Roy (IBM) @ 2026-02-05  6:36 UTC (permalink / raw)
  To: Kundan Kumar, viro, brauner, jack, willy, mcgrof, clm, david,
	amir73il, axboe, hch, ritesh.list, djwong, dave, cem, wangyufei
  Cc: linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g,
	joshi.k

On Fri, 2026-01-16 at 15:38 +0530, Kundan Kumar wrote:
> Add per-inode structures to track predicted AGs of dirty folios using
> an xarray and bitmap. This enables efficient identification of AGs
> involved in writeback.
> 
> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> ---
>  fs/xfs/xfs_icache.c | 27 +++++++++++++++++++++++++++
>  fs/xfs/xfs_inode.h  |  5 +++++
>  2 files changed, 32 insertions(+)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index e44040206851..f97aa6d66271 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -80,6 +80,25 @@ static inline xa_mark_t ici_tag_to_mark(unsigned int tag)
>  	return XFS_PERAG_BLOCKGC_MARK;
>  }
>  
> +static int xfs_inode_init_ag_bitmap(struct xfs_inode *ip)
Similar comment as before:
static int
xfs_inode_init...()
> +{
> +	unsigned int bits = ip->i_mount->m_sb.sb_agcount;
> +	unsigned int nlongs;
> +
> +	xa_init_flags(&ip->i_ag_pmap, XA_FLAGS_LOCK_IRQ);
Nit: The name of the functions suggests that it is initializing the tracking bitmap which it does -
however, the above line does slightly different thing? Maybe move the xarray init outside the bitmap
init function? 
> +	ip->i_ag_dirty_bitmap = NULL;
> +	ip->i_ag_dirty_bits = bits;
> +
> +	if (!bits)
Umm, !bits means agcount is 0. Shouldn't we ASSERT that bits >= 2? Or am I missing something?
> +		return 0;
> +
> +	nlongs = BITS_TO_LONGS(bits);
> +	ip->i_ag_dirty_bitmap = kcalloc(nlongs, sizeof(unsigned long),
> +					GFP_NOFS);
> +
> +	return ip->i_ag_dirty_bitmap ? 0 : -ENOMEM;
> +}
> +
>  /*
>   * Allocate and initialise an xfs_inode.
>   */
> @@ -131,6 +150,8 @@ xfs_inode_alloc(
>  	ip->i_next_unlinked = NULLAGINO;
>  	ip->i_prev_unlinked = 0;
>  
> +	xfs_inode_init_ag_bitmap(ip);
xfs_inode_init_ag_bitmap() returns int - error handling for -ENOMEM?
> +
>  	return ip;
>  }
>  
> @@ -194,6 +215,12 @@ xfs_inode_free(
>  	ip->i_ino = 0;
>  	spin_unlock(&ip->i_flags_lock);
>  
> +	/* free xarray contents (values are immediate packed ints) */
> +	xa_destroy(&ip->i_ag_pmap);
Nit:Maybe have a small wrapper for freeing it the prediction map? No hard preferences though.
> +	kfree(ip->i_ag_dirty_bitmap);
> +	ip->i_ag_dirty_bitmap = NULL;
Nit: Usually while freeing the pointers I prefer:
t = ip->i_ag_dirty_bitmap;
ip->i_ag_dirty_bitmap = NULL;
kfree(t);
In this way, the pointer(i_ag_dirty_bitmap in this case) that I am freeing never points to an
already freed address.

> +	ip->i_ag_dirty_bits = 0;
> +
>  	__xfs_inode_free(ip);
>  }
>  
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index bd6d33557194..dee449168605 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -99,6 +99,11 @@ typedef struct xfs_inode {
>  	spinlock_t		i_ioend_lock;
>  	struct work_struct	i_ioend_work;
>  	struct list_head	i_ioend_list;
> +
> +	/* AG prediction map: pgoff_t -> packed u32 */
> +	struct xarray           i_ag_pmap;
> +	unsigned long           *i_ag_dirty_bitmap;
> +	unsigned int            i_ag_dirty_bits;
Not sure but, I mostly see the typedefed versions of data types being used like uint32 etc. Darrick,
hch, are the above fine?
--NR
>  } xfs_inode_t;
>  
>  static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip)



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap
  2026-02-05  6:36       ` Nirjhar Roy (IBM)
@ 2026-02-05 16:36         ` Darrick J. Wong
  2026-02-06  5:36           ` Nirjhar Roy (IBM)
  0 siblings, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2026-02-05 16:36 UTC (permalink / raw)
  To: Nirjhar Roy (IBM)
  Cc: Kundan Kumar, viro, brauner, jack, willy, mcgrof, clm, david,
	amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei,
	linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g,
	joshi.k

On Thu, Feb 05, 2026 at 12:06:19PM +0530, Nirjhar Roy (IBM) wrote:
> On Fri, 2026-01-16 at 15:38 +0530, Kundan Kumar wrote:
> > Add per-inode structures to track predicted AGs of dirty folios using
> > an xarray and bitmap. This enables efficient identification of AGs
> > involved in writeback.
> > 
> > Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
> > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> > ---
> >  fs/xfs/xfs_icache.c | 27 +++++++++++++++++++++++++++
> >  fs/xfs/xfs_inode.h  |  5 +++++
> >  2 files changed, 32 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index e44040206851..f97aa6d66271 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -80,6 +80,25 @@ static inline xa_mark_t ici_tag_to_mark(unsigned int tag)
> >  	return XFS_PERAG_BLOCKGC_MARK;
> >  }
> >  
> > +static int xfs_inode_init_ag_bitmap(struct xfs_inode *ip)
> Similar comment as before:
> static int
> xfs_inode_init...()
> > +{
> > +	unsigned int bits = ip->i_mount->m_sb.sb_agcount;
> > +	unsigned int nlongs;
> > +
> > +	xa_init_flags(&ip->i_ag_pmap, XA_FLAGS_LOCK_IRQ);
> Nit: The name of the functions suggests that it is initializing the tracking bitmap which it does -
> however, the above line does slightly different thing? Maybe move the xarray init outside the bitmap
> init function? 

Or just call it something else?  xfs_inode_init_perag_wb?

> > +	ip->i_ag_dirty_bitmap = NULL;
> > +	ip->i_ag_dirty_bits = bits;
> > +
> > +	if (!bits)
> Umm, !bits means agcount is 0. Shouldn't we ASSERT that bits >= 2? Or am I missing something?

Technically you can have 1 AG, but you definitely can't mount a zero AG
filesystem.

> > +		return 0;
> > +
> > +	nlongs = BITS_TO_LONGS(bits);
> > +	ip->i_ag_dirty_bitmap = kcalloc(nlongs, sizeof(unsigned long),
> > +					GFP_NOFS);
> > +
> > +	return ip->i_ag_dirty_bitmap ? 0 : -ENOMEM;
> > +}
> > +
> >  /*
> >   * Allocate and initialise an xfs_inode.
> >   */
> > @@ -131,6 +150,8 @@ xfs_inode_alloc(
> >  	ip->i_next_unlinked = NULLAGINO;
> >  	ip->i_prev_unlinked = 0;
> >  
> > +	xfs_inode_init_ag_bitmap(ip);
> xfs_inode_init_ag_bitmap() returns int - error handling for -ENOMEM?
> > +
> >  	return ip;
> >  }
> >  
> > @@ -194,6 +215,12 @@ xfs_inode_free(
> >  	ip->i_ino = 0;
> >  	spin_unlock(&ip->i_flags_lock);
> >  
> > +	/* free xarray contents (values are immediate packed ints) */
> > +	xa_destroy(&ip->i_ag_pmap);
> Nit:Maybe have a small wrapper for freeing it the prediction map? No hard preferences though.
> > +	kfree(ip->i_ag_dirty_bitmap);
> > +	ip->i_ag_dirty_bitmap = NULL;
> Nit: Usually while freeing the pointers I prefer:
> t = ip->i_ag_dirty_bitmap;
> ip->i_ag_dirty_bitmap = NULL;
> kfree(t);
> In this way, the pointer(i_ag_dirty_bitmap in this case) that I am freeing never points to an
> already freed address.
> 
> > +	ip->i_ag_dirty_bits = 0;
> > +
> >  	__xfs_inode_free(ip);
> >  }
> >  
> > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > index bd6d33557194..dee449168605 100644
> > --- a/fs/xfs/xfs_inode.h
> > +++ b/fs/xfs/xfs_inode.h
> > @@ -99,6 +99,11 @@ typedef struct xfs_inode {
> >  	spinlock_t		i_ioend_lock;
> >  	struct work_struct	i_ioend_work;
> >  	struct list_head	i_ioend_list;
> > +
> > +	/* AG prediction map: pgoff_t -> packed u32 */
> > +	struct xarray           i_ag_pmap;
> > +	unsigned long           *i_ag_dirty_bitmap;
> > +	unsigned int            i_ag_dirty_bits;
> Not sure but, I mostly see the typedefed versions of data types being used like uint32 etc. Darrick,
> hch, are the above fine?

Yes, please don't mix types.  Pick one type and stick with it.

(and yes I wish we could struct bitmap_t(unsigned long))

--D

> --NR
> >  } xfs_inode_t;
> >  
> >  static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip)
> 
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap
  2026-02-05 16:36         ` Darrick J. Wong
@ 2026-02-06  5:36           ` Nirjhar Roy (IBM)
  2026-02-06  5:57             ` Darrick J. Wong
  0 siblings, 1 reply; 48+ messages in thread
From: Nirjhar Roy (IBM) @ 2026-02-06  5:36 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Kundan Kumar, viro, brauner, jack, willy, mcgrof, clm, david,
	amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei,
	linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g,
	joshi.k

On Thu, 2026-02-05 at 08:36 -0800, Darrick J. Wong wrote:
> On Thu, Feb 05, 2026 at 12:06:19PM +0530, Nirjhar Roy (IBM) wrote:
> > On Fri, 2026-01-16 at 15:38 +0530, Kundan Kumar wrote:
> > > Add per-inode structures to track predicted AGs of dirty folios using
> > > an xarray and bitmap. This enables efficient identification of AGs
> > > involved in writeback.
> > > 
> > > Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
> > > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> > > ---
> > >  fs/xfs/xfs_icache.c | 27 +++++++++++++++++++++++++++
> > >  fs/xfs/xfs_inode.h  |  5 +++++
> > >  2 files changed, 32 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > index e44040206851..f97aa6d66271 100644
> > > --- a/fs/xfs/xfs_icache.c
> > > +++ b/fs/xfs/xfs_icache.c
> > > @@ -80,6 +80,25 @@ static inline xa_mark_t ici_tag_to_mark(unsigned int tag)
> > >  	return XFS_PERAG_BLOCKGC_MARK;
> > >  }
> > >  
> > > +static int xfs_inode_init_ag_bitmap(struct xfs_inode *ip)
> > Similar comment as before:
> > static int
> > xfs_inode_init...()
> > > +{
> > > +	unsigned int bits = ip->i_mount->m_sb.sb_agcount;
> > > +	unsigned int nlongs;
> > > +
> > > +	xa_init_flags(&ip->i_ag_pmap, XA_FLAGS_LOCK_IRQ);
> > Nit: The name of the functions suggests that it is initializing the tracking bitmap which it does -
> > however, the above line does slightly different thing? Maybe move the xarray init outside the bitmap
> > init function? 
> 
> Or just call it something else?  xfs_inode_init_perag_wb?
> 
> > > +	ip->i_ag_dirty_bitmap = NULL;
> > > +	ip->i_ag_dirty_bits = bits;
> > > +
> > > +	if (!bits)
> > Umm, !bits means agcount is 0. Shouldn't we ASSERT that bits >= 2? Or am I missing something?
> 
> Technically you can have 1 AG, but you definitely can't mount a zero AG
> filesystem.
Okay, but:
/home/ubuntu$ mkfs.xfs -f  -d agcount=1 /dev/loop0
Filesystem must have at least 2 superblocks for redundancy!
Usage: mkfs.xfs
Or maybe this restriction is just at the userspace tool level?
> 
> > > +		return 0;
> > > +
> > > +	nlongs = BITS_TO_LONGS(bits);
> > > +	ip->i_ag_dirty_bitmap = kcalloc(nlongs, sizeof(unsigned long),
> > > +					GFP_NOFS);
> > > +
> > > +	return ip->i_ag_dirty_bitmap ? 0 : -ENOMEM;
> > > +}
> > > +
> > >  /*
> > >   * Allocate and initialise an xfs_inode.
> > >   */
> > > @@ -131,6 +150,8 @@ xfs_inode_alloc(
> > >  	ip->i_next_unlinked = NULLAGINO;
> > >  	ip->i_prev_unlinked = 0;
> > >  
> > > +	xfs_inode_init_ag_bitmap(ip);
> > xfs_inode_init_ag_bitmap() returns int - error handling for -ENOMEM?
> > > +
> > >  	return ip;
> > >  }
> > >  
> > > @@ -194,6 +215,12 @@ xfs_inode_free(
> > >  	ip->i_ino = 0;
> > >  	spin_unlock(&ip->i_flags_lock);
> > >  
> > > +	/* free xarray contents (values are immediate packed ints) */
> > > +	xa_destroy(&ip->i_ag_pmap);
> > Nit:Maybe have a small wrapper for freeing it the prediction map? No hard preferences though.
> > > +	kfree(ip->i_ag_dirty_bitmap);
> > > +	ip->i_ag_dirty_bitmap = NULL;
> > Nit: Usually while freeing the pointers I prefer:
> > t = ip->i_ag_dirty_bitmap;
> > ip->i_ag_dirty_bitmap = NULL;
> > kfree(t);
> > In this way, the pointer(i_ag_dirty_bitmap in this case) that I am freeing never points to an
> > already freed address.
> > 
> > > +	ip->i_ag_dirty_bits = 0;
> > > +
> > >  	__xfs_inode_free(ip);
> > >  }
> > >  
> > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > > index bd6d33557194..dee449168605 100644
> > > --- a/fs/xfs/xfs_inode.h
> > > +++ b/fs/xfs/xfs_inode.h
> > > @@ -99,6 +99,11 @@ typedef struct xfs_inode {
> > >  	spinlock_t		i_ioend_lock;
> > >  	struct work_struct	i_ioend_work;
> > >  	struct list_head	i_ioend_list;
> > > +
> > > +	/* AG prediction map: pgoff_t -> packed u32 */
> > > +	struct xarray           i_ag_pmap;
> > > +	unsigned long           *i_ag_dirty_bitmap;
> > > +	unsigned int            i_ag_dirty_bits;
> > Not sure but, I mostly see the typedefed versions of data types being used like uint32 etc. Darrick,
> > hch, are the above fine?
> 
> Yes, please don't mix types.  Pick one type and stick with it.
> 
> (and yes I wish we could struct bitmap_t(unsigned long))
> 
> --D
> 
> > --NR
> > >  } xfs_inode_t;
> > >  
> > >  static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip)



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap
  2026-02-06  5:36           ` Nirjhar Roy (IBM)
@ 2026-02-06  5:57             ` Darrick J. Wong
  2026-02-06  6:03               ` Nirjhar Roy (IBM)
  0 siblings, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2026-02-06  5:57 UTC (permalink / raw)
  To: Nirjhar Roy (IBM)
  Cc: Kundan Kumar, viro, brauner, jack, willy, mcgrof, clm, david,
	amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei,
	linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g,
	joshi.k

On Fri, Feb 06, 2026 at 11:06:03AM +0530, Nirjhar Roy (IBM) wrote:
> On Thu, 2026-02-05 at 08:36 -0800, Darrick J. Wong wrote:
> > On Thu, Feb 05, 2026 at 12:06:19PM +0530, Nirjhar Roy (IBM) wrote:
> > > On Fri, 2026-01-16 at 15:38 +0530, Kundan Kumar wrote:
> > > > Add per-inode structures to track predicted AGs of dirty folios using
> > > > an xarray and bitmap. This enables efficient identification of AGs
> > > > involved in writeback.
> > > > 
> > > > Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
> > > > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> > > > ---
> > > >  fs/xfs/xfs_icache.c | 27 +++++++++++++++++++++++++++
> > > >  fs/xfs/xfs_inode.h  |  5 +++++
> > > >  2 files changed, 32 insertions(+)
> > > > 
> > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > index e44040206851..f97aa6d66271 100644
> > > > --- a/fs/xfs/xfs_icache.c
> > > > +++ b/fs/xfs/xfs_icache.c
> > > > @@ -80,6 +80,25 @@ static inline xa_mark_t ici_tag_to_mark(unsigned int tag)
> > > >  	return XFS_PERAG_BLOCKGC_MARK;
> > > >  }
> > > >  
> > > > +static int xfs_inode_init_ag_bitmap(struct xfs_inode *ip)
> > > Similar comment as before:
> > > static int
> > > xfs_inode_init...()
> > > > +{
> > > > +	unsigned int bits = ip->i_mount->m_sb.sb_agcount;
> > > > +	unsigned int nlongs;
> > > > +
> > > > +	xa_init_flags(&ip->i_ag_pmap, XA_FLAGS_LOCK_IRQ);
> > > Nit: The name of the functions suggests that it is initializing the tracking bitmap which it does -
> > > however, the above line does slightly different thing? Maybe move the xarray init outside the bitmap
> > > init function? 
> > 
> > Or just call it something else?  xfs_inode_init_perag_wb?
> > 
> > > > +	ip->i_ag_dirty_bitmap = NULL;
> > > > +	ip->i_ag_dirty_bits = bits;
> > > > +
> > > > +	if (!bits)
> > > Umm, !bits means agcount is 0. Shouldn't we ASSERT that bits >= 2? Or am I missing something?
> > 
> > Technically you can have 1 AG, but you definitely can't mount a zero AG
> > filesystem.
> Okay, but:
> /home/ubuntu$ mkfs.xfs -f  -d agcount=1 /dev/loop0
> Filesystem must have at least 2 superblocks for redundancy!
> Usage: mkfs.xfs
> Or maybe this restriction is just at the userspace tool level?

Yeah.  If the only super dies then the filesystem is completely
unrecoverable, which is why you have to really fight mkfs to spit out
single-AG filesystems.

--D

> > > > +		return 0;
> > > > +
> > > > +	nlongs = BITS_TO_LONGS(bits);
> > > > +	ip->i_ag_dirty_bitmap = kcalloc(nlongs, sizeof(unsigned long),
> > > > +					GFP_NOFS);
> > > > +
> > > > +	return ip->i_ag_dirty_bitmap ? 0 : -ENOMEM;
> > > > +}
> > > > +
> > > >  /*
> > > >   * Allocate and initialise an xfs_inode.
> > > >   */
> > > > @@ -131,6 +150,8 @@ xfs_inode_alloc(
> > > >  	ip->i_next_unlinked = NULLAGINO;
> > > >  	ip->i_prev_unlinked = 0;
> > > >  
> > > > +	xfs_inode_init_ag_bitmap(ip);
> > > xfs_inode_init_ag_bitmap() returns int - error handling for -ENOMEM?
> > > > +
> > > >  	return ip;
> > > >  }
> > > >  
> > > > @@ -194,6 +215,12 @@ xfs_inode_free(
> > > >  	ip->i_ino = 0;
> > > >  	spin_unlock(&ip->i_flags_lock);
> > > >  
> > > > +	/* free xarray contents (values are immediate packed ints) */
> > > > +	xa_destroy(&ip->i_ag_pmap);
> > > Nit:Maybe have a small wrapper for freeing it the prediction map? No hard preferences though.
> > > > +	kfree(ip->i_ag_dirty_bitmap);
> > > > +	ip->i_ag_dirty_bitmap = NULL;
> > > Nit: Usually while freeing the pointers I prefer:
> > > t = ip->i_ag_dirty_bitmap;
> > > ip->i_ag_dirty_bitmap = NULL;
> > > kfree(t);
> > > In this way, the pointer(i_ag_dirty_bitmap in this case) that I am freeing never points to an
> > > already freed address.
> > > 
> > > > +	ip->i_ag_dirty_bits = 0;
> > > > +
> > > >  	__xfs_inode_free(ip);
> > > >  }
> > > >  
> > > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > > > index bd6d33557194..dee449168605 100644
> > > > --- a/fs/xfs/xfs_inode.h
> > > > +++ b/fs/xfs/xfs_inode.h
> > > > @@ -99,6 +99,11 @@ typedef struct xfs_inode {
> > > >  	spinlock_t		i_ioend_lock;
> > > >  	struct work_struct	i_ioend_work;
> > > >  	struct list_head	i_ioend_list;
> > > > +
> > > > +	/* AG prediction map: pgoff_t -> packed u32 */
> > > > +	struct xarray           i_ag_pmap;
> > > > +	unsigned long           *i_ag_dirty_bitmap;
> > > > +	unsigned int            i_ag_dirty_bits;
> > > Not sure but, I mostly see the typedefed versions of data types being used like uint32 etc. Darrick,
> > > hch, are the above fine?
> > 
> > Yes, please don't mix types.  Pick one type and stick with it.
> > 
> > (and yes I wish we could struct bitmap_t(unsigned long))
> > 
> > --D
> > 
> > > --NR
> > > >  } xfs_inode_t;
> > > >  
> > > >  static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip)
> 
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap
  2026-02-06  5:57             ` Darrick J. Wong
@ 2026-02-06  6:03               ` Nirjhar Roy (IBM)
  0 siblings, 0 replies; 48+ messages in thread
From: Nirjhar Roy (IBM) @ 2026-02-06  6:03 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Kundan Kumar, viro, brauner, jack, willy, mcgrof, clm, david,
	amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei,
	linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g,
	joshi.k


On 2/6/26 11:27, Darrick J. Wong wrote:
> On Fri, Feb 06, 2026 at 11:06:03AM +0530, Nirjhar Roy (IBM) wrote:
>> On Thu, 2026-02-05 at 08:36 -0800, Darrick J. Wong wrote:
>>> On Thu, Feb 05, 2026 at 12:06:19PM +0530, Nirjhar Roy (IBM) wrote:
>>>> On Fri, 2026-01-16 at 15:38 +0530, Kundan Kumar wrote:
>>>>> Add per-inode structures to track predicted AGs of dirty folios using
>>>>> an xarray and bitmap. This enables efficient identification of AGs
>>>>> involved in writeback.
>>>>>
>>>>> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
>>>>> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
>>>>> ---
>>>>>   fs/xfs/xfs_icache.c | 27 +++++++++++++++++++++++++++
>>>>>   fs/xfs/xfs_inode.h  |  5 +++++
>>>>>   2 files changed, 32 insertions(+)
>>>>>
>>>>> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
>>>>> index e44040206851..f97aa6d66271 100644
>>>>> --- a/fs/xfs/xfs_icache.c
>>>>> +++ b/fs/xfs/xfs_icache.c
>>>>> @@ -80,6 +80,25 @@ static inline xa_mark_t ici_tag_to_mark(unsigned int tag)
>>>>>   	return XFS_PERAG_BLOCKGC_MARK;
>>>>>   }
>>>>>   
>>>>> +static int xfs_inode_init_ag_bitmap(struct xfs_inode *ip)
>>>> Similar comment as before:
>>>> static int
>>>> xfs_inode_init...()
>>>>> +{
>>>>> +	unsigned int bits = ip->i_mount->m_sb.sb_agcount;
>>>>> +	unsigned int nlongs;
>>>>> +
>>>>> +	xa_init_flags(&ip->i_ag_pmap, XA_FLAGS_LOCK_IRQ);
>>>> Nit: The name of the functions suggests that it is initializing the tracking bitmap which it does -
>>>> however, the above line does slightly different thing? Maybe move the xarray init outside the bitmap
>>>> init function?
>>> Or just call it something else?  xfs_inode_init_perag_wb?
>>>
>>>>> +	ip->i_ag_dirty_bitmap = NULL;
>>>>> +	ip->i_ag_dirty_bits = bits;
>>>>> +
>>>>> +	if (!bits)
>>>> Umm, !bits means agcount is 0. Shouldn't we ASSERT that bits >= 2? Or am I missing something?
>>> Technically you can have 1 AG, but you definitely can't mount a zero AG
>>> filesystem.
>> Okay, but:
>> /home/ubuntu$ mkfs.xfs -f  -d agcount=1 /dev/loop0
>> Filesystem must have at least 2 superblocks for redundancy!
>> Usage: mkfs.xfs
>> Or maybe this restriction is just at the userspace tool level?
> Yeah.  If the only super dies then the filesystem is completely
> unrecoverable, which is why you have to really fight mkfs to spit out
> single-AG filesystems.

Okay. But I think there are certain places in the kernel code where 1 AG 
is treated as an -EINVAL. For instance [1]

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/xfs_fsops.c#n140

--NR

>
> --D
>
>>>>> +		return 0;
>>>>> +
>>>>> +	nlongs = BITS_TO_LONGS(bits);
>>>>> +	ip->i_ag_dirty_bitmap = kcalloc(nlongs, sizeof(unsigned long),
>>>>> +					GFP_NOFS);
>>>>> +
>>>>> +	return ip->i_ag_dirty_bitmap ? 0 : -ENOMEM;
>>>>> +}
>>>>> +
>>>>>   /*
>>>>>    * Allocate and initialise an xfs_inode.
>>>>>    */
>>>>> @@ -131,6 +150,8 @@ xfs_inode_alloc(
>>>>>   	ip->i_next_unlinked = NULLAGINO;
>>>>>   	ip->i_prev_unlinked = 0;
>>>>>   
>>>>> +	xfs_inode_init_ag_bitmap(ip);
>>>> xfs_inode_init_ag_bitmap() returns int - error handling for -ENOMEM?
>>>>> +
>>>>>   	return ip;
>>>>>   }
>>>>>   
>>>>> @@ -194,6 +215,12 @@ xfs_inode_free(
>>>>>   	ip->i_ino = 0;
>>>>>   	spin_unlock(&ip->i_flags_lock);
>>>>>   
>>>>> +	/* free xarray contents (values are immediate packed ints) */
>>>>> +	xa_destroy(&ip->i_ag_pmap);
>>>> Nit:Maybe have a small wrapper for freeing it the prediction map? No hard preferences though.
>>>>> +	kfree(ip->i_ag_dirty_bitmap);
>>>>> +	ip->i_ag_dirty_bitmap = NULL;
>>>> Nit: Usually while freeing the pointers I prefer:
>>>> t = ip->i_ag_dirty_bitmap;
>>>> ip->i_ag_dirty_bitmap = NULL;
>>>> kfree(t);
>>>> In this way, the pointer(i_ag_dirty_bitmap in this case) that I am freeing never points to an
>>>> already freed address.
>>>>
>>>>> +	ip->i_ag_dirty_bits = 0;
>>>>> +
>>>>>   	__xfs_inode_free(ip);
>>>>>   }
>>>>>   
>>>>> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
>>>>> index bd6d33557194..dee449168605 100644
>>>>> --- a/fs/xfs/xfs_inode.h
>>>>> +++ b/fs/xfs/xfs_inode.h
>>>>> @@ -99,6 +99,11 @@ typedef struct xfs_inode {
>>>>>   	spinlock_t		i_ioend_lock;
>>>>>   	struct work_struct	i_ioend_work;
>>>>>   	struct list_head	i_ioend_list;
>>>>> +
>>>>> +	/* AG prediction map: pgoff_t -> packed u32 */
>>>>> +	struct xarray           i_ag_pmap;
>>>>> +	unsigned long           *i_ag_dirty_bitmap;
>>>>> +	unsigned int            i_ag_dirty_bits;
>>>> Not sure but, I mostly see the typedefed versions of data types being used like uint32 etc. Darrick,
>>>> hch, are the above fine?
>>> Yes, please don't mix types.  Pick one type and stick with it.
>>>
>>> (and yes I wish we could struct bitmap_t(unsigned long))
>>>
>>> --D
>>>
>>>> --NR
>>>>>   } xfs_inode_t;
>>>>>   
>>>>>   static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip)
>>
-- 
Nirjhar Roy
Linux Kernel Developer
IBM, Bangalore



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap
  2026-01-16 10:08     ` [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap Kundan Kumar
  2026-01-29  0:44       ` Darrick J. Wong
  2026-02-05  6:36       ` Nirjhar Roy (IBM)
@ 2026-02-06  7:00       ` Christoph Hellwig
  2 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2026-02-06  7:00 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, djwong, dave, cem, wangyufei, linux-fsdevel,
	linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k

On Fri, Jan 16, 2026 at 03:38:15PM +0530, Kundan Kumar wrote:
> @@ -99,6 +99,11 @@ typedef struct xfs_inode {
>  	spinlock_t		i_ioend_lock;
>  	struct work_struct	i_ioend_work;
>  	struct list_head	i_ioend_list;
> +
> +	/* AG prediction map: pgoff_t -> packed u32 */
> +	struct xarray           i_ag_pmap;
> +	unsigned long           *i_ag_dirty_bitmap;
> +	unsigned int            i_ag_dirty_bits;

I don't think bloating the inode like this is acceptable.

As said in my reply to patch 5, I think we're better off with more coarse
grained sharding anyway.  i.e. just track an ag per inode.  We did this
for zoned xfs by pointing to the open zone using the existing and
otherwise unused inode->i_private field with great success.   You won't
get as good placement for some use cases where files are fragmented over
multiple AGs and you can please near close blocks right now, but you'll
save a lot of overhead searching the alloc btrees, and probably generate
a write pattern more suitable for SSDs.

^ permalink raw reply	[flat|nested] 48+ messages in thread

[parent not found: <CGME20260116101256epcas5p2d6125a6bcad78c33f737fdc3484aca79@epcas5p2.samsung.com>]

* [PATCH v3 4/6] xfs: tag folios with AG number during buffered write via iomap attach hook
       [not found]   ` <CGME20260116101256epcas5p2d6125a6bcad78c33f737fdc3484aca79@epcas5p2.samsung.com>
@ 2026-01-16 10:08     ` Kundan Kumar
  2026-01-29  0:47       ` Darrick J. Wong
  2026-02-06  6:44       ` Nirjhar Roy (IBM)
  0 siblings, 2 replies; 48+ messages in thread
From: Kundan Kumar @ 2026-01-16 10:08 UTC (permalink / raw)
  To: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, djwong, dave, cem, wangyufei
  Cc: linux-fsdevel, linux-mm, linux-xfs, gost.dev, kundan.kumar,
	anuj20.g, vishak.g, joshi.k

Use the iomap attach hook to tag folios with their predicted
allocation group at write time. Mapped extents derive AG directly;
delalloc and hole cases use a lightweight predictor.

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 fs/xfs/xfs_iomap.c | 114 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 114 insertions(+)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 490e12cb99be..3c927ce118fe 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -12,6 +12,9 @@
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
 #include "xfs_inode.h"
+#include "xfs_alloc.h"
+#include "xfs_ag.h"
+#include "xfs_ag_resv.h"
 #include "xfs_btree.h"
 #include "xfs_bmap_btree.h"
 #include "xfs_bmap.h"
@@ -92,8 +95,119 @@ xfs_iomap_valid(
 	return true;
 }
 
+static xfs_agnumber_t
+xfs_predict_delalloc_agno(const struct xfs_inode *ip, loff_t pos, loff_t len)
+{
+	struct xfs_mount *mp = ip->i_mount;
+	xfs_agnumber_t start_agno, agno, best_agno;
+	struct xfs_perag *pag;
+
+	xfs_extlen_t free, resv, avail;
+	xfs_extlen_t need_fsbs, min_free_fsbs;
+	xfs_extlen_t best_free = 0;
+	xfs_agnumber_t agcount = mp->m_sb.sb_agcount;
+
+	/* RT inodes allocate from the realtime volume */
+	if (XFS_IS_REALTIME_INODE(ip))
+		return XFS_INO_TO_AGNO(mp, ip->i_ino);
+
+	start_agno =  XFS_INO_TO_AGNO(mp, ip->i_ino);
+
+	/*
+	 * size-based minimum free requirement.
+	 * Convert bytes to fsbs and require some slack.
+	 */
+	need_fsbs = XFS_B_TO_FSB(mp, (xfs_fsize_t)len);
+	min_free_fsbs = need_fsbs + max_t(xfs_extlen_t, need_fsbs >> 2, 128);
+
+	/*
+	 * scan AGs starting at start_agno and wrapping.
+	 * Pick the first AG that meets min_free_fsbs after reservations.
+	 * Keep a "best" fallback = maximum (free - resv).
+	 */
+	best_agno = start_agno;
+
+	for (xfs_agnumber_t i = 0; i < agcount; i++) {
+		agno = (start_agno + i) % agcount;
+		pag = xfs_perag_get(mp, agno);
+
+		if (!xfs_perag_initialised_agf(pag))
+			goto next;
+
+		free = READ_ONCE(pag->pagf_freeblks);
+		resv = xfs_ag_resv_needed(pag, XFS_AG_RESV_NONE);
+
+		if (free <= resv)
+			goto next;
+
+		avail = free - resv;
+
+		if (avail >= min_free_fsbs) {
+			xfs_perag_put(pag);
+			return agno;
+		}
+
+		if (avail > best_free) {
+			best_free = avail;
+			best_agno = agno;
+		}
+next:
+		xfs_perag_put(pag);
+	}
+
+	return best_agno;
+}
+
+static inline xfs_agnumber_t xfs_ag_from_iomap(const struct xfs_mount *mp,
+		const struct iomap *iomap,
+		const struct xfs_inode *ip, loff_t pos, size_t len)
+{
+	if (iomap->type == IOMAP_MAPPED || iomap->type == IOMAP_UNWRITTEN) {
+		/* iomap->addr is byte address on device for buffered I/O */
+		xfs_fsblock_t fsb = XFS_BB_TO_FSBT(mp, BTOBB(iomap->addr));
+
+		return XFS_FSB_TO_AGNO(mp, fsb);
+	} else if (iomap->type == IOMAP_HOLE || iomap->type == IOMAP_DELALLOC) {
+		return xfs_predict_delalloc_agno(ip, pos, len);
+	}
+
+	return XFS_INO_TO_AGNO(mp, ip->i_ino);
+}
+
+static void xfs_agp_set(struct xfs_inode *ip, pgoff_t index,
+			xfs_agnumber_t agno, u8 type)
+{
+	u32 packed = xfs_agp_pack((u32)agno, type, true);
+
+	/* store as immediate value */
+	xa_store(&ip->i_ag_pmap, index, xa_mk_value(packed), GFP_NOFS);
+
+	/* Mark this AG as having potential dirty work */
+	if (ip->i_ag_dirty_bitmap && (u32)agno < ip->i_ag_dirty_bits)
+		set_bit((u32)agno, ip->i_ag_dirty_bitmap);
+}
+
+static void
+xfs_iomap_tag_folio(const struct iomap *iomap, struct folio *folio,
+		loff_t pos, size_t len)
+{
+	struct inode *inode;
+	struct xfs_inode *ip;
+	struct xfs_mount *mp;
+	xfs_agnumber_t agno;
+
+	inode = folio_mapping(folio)->host;
+	ip = XFS_I(inode);
+	mp = ip->i_mount;
+
+	agno = xfs_ag_from_iomap(mp, iomap, ip, pos, len);
+
+	xfs_agp_set(ip, folio->index, agno, (u8)iomap->type);
+}
+
 const struct iomap_write_ops xfs_iomap_write_ops = {
 	.iomap_valid		= xfs_iomap_valid,
+	.tag_folio		= xfs_iomap_tag_folio,
 };
 
 int
-- 
2.25.1



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 4/6] xfs: tag folios with AG number during buffered write via iomap attach hook
  2026-01-16 10:08     ` [PATCH v3 4/6] xfs: tag folios with AG number during buffered write via iomap attach hook Kundan Kumar
@ 2026-01-29  0:47       ` Darrick J. Wong
  2026-01-29 22:40         ` Darrick J. Wong
  2026-02-03  7:28         ` Kundan Kumar
  2026-02-06  6:44       ` Nirjhar Roy (IBM)
  1 sibling, 2 replies; 48+ messages in thread
From: Darrick J. Wong @ 2026-01-29  0:47 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm,
	linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k

On Fri, Jan 16, 2026 at 03:38:16PM +0530, Kundan Kumar wrote:
> Use the iomap attach hook to tag folios with their predicted
> allocation group at write time. Mapped extents derive AG directly;
> delalloc and hole cases use a lightweight predictor.
> 
> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> ---
>  fs/xfs/xfs_iomap.c | 114 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 114 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 490e12cb99be..3c927ce118fe 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -12,6 +12,9 @@
>  #include "xfs_trans_resv.h"
>  #include "xfs_mount.h"
>  #include "xfs_inode.h"
> +#include "xfs_alloc.h"
> +#include "xfs_ag.h"
> +#include "xfs_ag_resv.h"
>  #include "xfs_btree.h"
>  #include "xfs_bmap_btree.h"
>  #include "xfs_bmap.h"
> @@ -92,8 +95,119 @@ xfs_iomap_valid(
>  	return true;
>  }
>  
> +static xfs_agnumber_t
> +xfs_predict_delalloc_agno(const struct xfs_inode *ip, loff_t pos, loff_t len)
> +{
> +	struct xfs_mount *mp = ip->i_mount;
> +	xfs_agnumber_t start_agno, agno, best_agno;
> +	struct xfs_perag *pag;
> +
> +	xfs_extlen_t free, resv, avail;
> +	xfs_extlen_t need_fsbs, min_free_fsbs;
> +	xfs_extlen_t best_free = 0;
> +	xfs_agnumber_t agcount = mp->m_sb.sb_agcount;
> +
> +	/* RT inodes allocate from the realtime volume */
> +	if (XFS_IS_REALTIME_INODE(ip))
> +		return XFS_INO_TO_AGNO(mp, ip->i_ino);
> +
> +	start_agno =  XFS_INO_TO_AGNO(mp, ip->i_ino);
> +
> +	/*
> +	 * size-based minimum free requirement.
> +	 * Convert bytes to fsbs and require some slack.
> +	 */
> +	need_fsbs = XFS_B_TO_FSB(mp, (xfs_fsize_t)len);
> +	min_free_fsbs = need_fsbs + max_t(xfs_extlen_t, need_fsbs >> 2, 128);
> +
> +	/*
> +	 * scan AGs starting at start_agno and wrapping.
> +	 * Pick the first AG that meets min_free_fsbs after reservations.
> +	 * Keep a "best" fallback = maximum (free - resv).
> +	 */
> +	best_agno = start_agno;
> +
> +	for (xfs_agnumber_t i = 0; i < agcount; i++) {
> +		agno = (start_agno + i) % agcount;
> +		pag = xfs_perag_get(mp, agno);
> +
> +		if (!xfs_perag_initialised_agf(pag))
> +			goto next;
> +
> +		free = READ_ONCE(pag->pagf_freeblks);
> +		resv = xfs_ag_resv_needed(pag, XFS_AG_RESV_NONE);
> +
> +		if (free <= resv)
> +			goto next;
> +
> +		avail = free - resv;
> +
> +		if (avail >= min_free_fsbs) {
> +			xfs_perag_put(pag);
> +			return agno;
> +		}
> +
> +		if (avail > best_free) {
> +			best_free = avail;
> +			best_agno = agno;
> +		}
> +next:
> +		xfs_perag_put(pag);
> +	}
> +
> +	return best_agno;
> +}
> +
> +static inline xfs_agnumber_t xfs_ag_from_iomap(const struct xfs_mount *mp,
> +		const struct iomap *iomap,
> +		const struct xfs_inode *ip, loff_t pos, size_t len)
> +{
> +	if (iomap->type == IOMAP_MAPPED || iomap->type == IOMAP_UNWRITTEN) {
> +		/* iomap->addr is byte address on device for buffered I/O */
> +		xfs_fsblock_t fsb = XFS_BB_TO_FSBT(mp, BTOBB(iomap->addr));
> +
> +		return XFS_FSB_TO_AGNO(mp, fsb);
> +	} else if (iomap->type == IOMAP_HOLE || iomap->type == IOMAP_DELALLOC) {
> +		return xfs_predict_delalloc_agno(ip, pos, len);

Is it worth doing an AG scan to guess where the allocation might come
from?  The predictions could turn out to be wrong by virtue of other
delalloc regions being written back between the time that xfs_agp_set is
called, and the actual bmapi_write call.

> +	}
> +
> +	return XFS_INO_TO_AGNO(mp, ip->i_ino);
> +}
> +
> +static void xfs_agp_set(struct xfs_inode *ip, pgoff_t index,
> +			xfs_agnumber_t agno, u8 type)
> +{
> +	u32 packed = xfs_agp_pack((u32)agno, type, true);
> +
> +	/* store as immediate value */
> +	xa_store(&ip->i_ag_pmap, index, xa_mk_value(packed), GFP_NOFS);
> +
> +	/* Mark this AG as having potential dirty work */
> +	if (ip->i_ag_dirty_bitmap && (u32)agno < ip->i_ag_dirty_bits)
> +		set_bit((u32)agno, ip->i_ag_dirty_bitmap);
> +}
> +
> +static void
> +xfs_iomap_tag_folio(const struct iomap *iomap, struct folio *folio,
> +		loff_t pos, size_t len)
> +{
> +	struct inode *inode;
> +	struct xfs_inode *ip;
> +	struct xfs_mount *mp;
> +	xfs_agnumber_t agno;
> +
> +	inode = folio_mapping(folio)->host;
> +	ip = XFS_I(inode);
> +	mp = ip->i_mount;
> +
> +	agno = xfs_ag_from_iomap(mp, iomap, ip, pos, len);
> +
> +	xfs_agp_set(ip, folio->index, agno, (u8)iomap->type);

Hrm, so no, the ag_pmap only caches the ag number for the index of a
folio, even if it spans many many blocks.

--D

> +}
> +
>  const struct iomap_write_ops xfs_iomap_write_ops = {
>  	.iomap_valid		= xfs_iomap_valid,
> +	.tag_folio		= xfs_iomap_tag_folio,
>  };
>  
>  int
> -- 
> 2.25.1
> 
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 4/6] xfs: tag folios with AG number during buffered write via iomap attach hook
  2026-01-29  0:47       ` Darrick J. Wong
@ 2026-01-29 22:40         ` Darrick J. Wong
  2026-02-03  7:32           ` Kundan Kumar
  2026-02-03  7:28         ` Kundan Kumar
  1 sibling, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2026-01-29 22:40 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm,
	linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k

On Wed, Jan 28, 2026 at 04:47:45PM -0800, Darrick J. Wong wrote:
> On Fri, Jan 16, 2026 at 03:38:16PM +0530, Kundan Kumar wrote:
> > Use the iomap attach hook to tag folios with their predicted
> > allocation group at write time. Mapped extents derive AG directly;
> > delalloc and hole cases use a lightweight predictor.
> > 
> > Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
> > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> > ---
> >  fs/xfs/xfs_iomap.c | 114 +++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 114 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > index 490e12cb99be..3c927ce118fe 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> > @@ -12,6 +12,9 @@
> >  #include "xfs_trans_resv.h"
> >  #include "xfs_mount.h"
> >  #include "xfs_inode.h"
> > +#include "xfs_alloc.h"
> > +#include "xfs_ag.h"
> > +#include "xfs_ag_resv.h"
> >  #include "xfs_btree.h"
> >  #include "xfs_bmap_btree.h"
> >  #include "xfs_bmap.h"
> > @@ -92,8 +95,119 @@ xfs_iomap_valid(
> >  	return true;
> >  }
> >  
> > +static xfs_agnumber_t
> > +xfs_predict_delalloc_agno(const struct xfs_inode *ip, loff_t pos, loff_t len)
> > +{
> > +	struct xfs_mount *mp = ip->i_mount;
> > +	xfs_agnumber_t start_agno, agno, best_agno;
> > +	struct xfs_perag *pag;
> > +
> > +	xfs_extlen_t free, resv, avail;
> > +	xfs_extlen_t need_fsbs, min_free_fsbs;
> > +	xfs_extlen_t best_free = 0;
> > +	xfs_agnumber_t agcount = mp->m_sb.sb_agcount;
> > +
> > +	/* RT inodes allocate from the realtime volume */
> > +	if (XFS_IS_REALTIME_INODE(ip))
> > +		return XFS_INO_TO_AGNO(mp, ip->i_ino);
> > +
> > +	start_agno =  XFS_INO_TO_AGNO(mp, ip->i_ino);
> > +
> > +	/*
> > +	 * size-based minimum free requirement.
> > +	 * Convert bytes to fsbs and require some slack.
> > +	 */
> > +	need_fsbs = XFS_B_TO_FSB(mp, (xfs_fsize_t)len);
> > +	min_free_fsbs = need_fsbs + max_t(xfs_extlen_t, need_fsbs >> 2, 128);
> > +
> > +	/*
> > +	 * scan AGs starting at start_agno and wrapping.
> > +	 * Pick the first AG that meets min_free_fsbs after reservations.
> > +	 * Keep a "best" fallback = maximum (free - resv).
> > +	 */
> > +	best_agno = start_agno;
> > +
> > +	for (xfs_agnumber_t i = 0; i < agcount; i++) {
> > +		agno = (start_agno + i) % agcount;
> > +		pag = xfs_perag_get(mp, agno);
> > +
> > +		if (!xfs_perag_initialised_agf(pag))
> > +			goto next;
> > +
> > +		free = READ_ONCE(pag->pagf_freeblks);
> > +		resv = xfs_ag_resv_needed(pag, XFS_AG_RESV_NONE);
> > +
> > +		if (free <= resv)
> > +			goto next;
> > +
> > +		avail = free - resv;
> > +
> > +		if (avail >= min_free_fsbs) {
> > +			xfs_perag_put(pag);
> > +			return agno;
> > +		}
> > +
> > +		if (avail > best_free) {
> > +			best_free = avail;
> > +			best_agno = agno;
> > +		}
> > +next:
> > +		xfs_perag_put(pag);
> > +	}
> > +
> > +	return best_agno;
> > +}
> > +
> > +static inline xfs_agnumber_t xfs_ag_from_iomap(const struct xfs_mount *mp,
> > +		const struct iomap *iomap,
> > +		const struct xfs_inode *ip, loff_t pos, size_t len)
> > +{
> > +	if (iomap->type == IOMAP_MAPPED || iomap->type == IOMAP_UNWRITTEN) {
> > +		/* iomap->addr is byte address on device for buffered I/O */
> > +		xfs_fsblock_t fsb = XFS_BB_TO_FSBT(mp, BTOBB(iomap->addr));
> > +
> > +		return XFS_FSB_TO_AGNO(mp, fsb);

Also, what happens if this is a realtime file?  For pre-rtgroups
filesystems there is no group number to use; and for rtgroups you have
to use xfs_rtb_to_rgno.  The i_ag_dirty_bitmap and the m_ag_wb array
will be the wrong size if rgcount != agcount; and also you probably
don't want to have in the same per-group writeback list two inodes with
folios having the same group number but writing to two different devices
(data vs. rt).

--D

> > +	} else if (iomap->type == IOMAP_HOLE || iomap->type == IOMAP_DELALLOC) {
> > +		return xfs_predict_delalloc_agno(ip, pos, len);
> 
> Is it worth doing an AG scan to guess where the allocation might come
> from?  The predictions could turn out to be wrong by virtue of other
> delalloc regions being written back between the time that xfs_agp_set is
> called, and the actual bmapi_write call.
> 
> > +	}
> > +
> > +	return XFS_INO_TO_AGNO(mp, ip->i_ino);
> > +}
> > +
> > +static void xfs_agp_set(struct xfs_inode *ip, pgoff_t index,
> > +			xfs_agnumber_t agno, u8 type)
> > +{
> > +	u32 packed = xfs_agp_pack((u32)agno, type, true);
> > +
> > +	/* store as immediate value */
> > +	xa_store(&ip->i_ag_pmap, index, xa_mk_value(packed), GFP_NOFS);
> > +
> > +	/* Mark this AG as having potential dirty work */
> > +	if (ip->i_ag_dirty_bitmap && (u32)agno < ip->i_ag_dirty_bits)
> > +		set_bit((u32)agno, ip->i_ag_dirty_bitmap);
> > +}
> > +
> > +static void
> > +xfs_iomap_tag_folio(const struct iomap *iomap, struct folio *folio,
> > +		loff_t pos, size_t len)
> > +{
> > +	struct inode *inode;
> > +	struct xfs_inode *ip;
> > +	struct xfs_mount *mp;
> > +	xfs_agnumber_t agno;
> > +
> > +	inode = folio_mapping(folio)->host;
> > +	ip = XFS_I(inode);
> > +	mp = ip->i_mount;
> > +
> > +	agno = xfs_ag_from_iomap(mp, iomap, ip, pos, len);
> > +
> > +	xfs_agp_set(ip, folio->index, agno, (u8)iomap->type);
> 
> Hrm, so no, the ag_pmap only caches the ag number for the index of a
> folio, even if it spans many many blocks.
> 
> --D
> 
> > +}
> > +
> >  const struct iomap_write_ops xfs_iomap_write_ops = {
> >  	.iomap_valid		= xfs_iomap_valid,
> > +	.tag_folio		= xfs_iomap_tag_folio,
> >  };
> >  
> >  int
> > -- 
> > 2.25.1
> > 
> > 
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 4/6] xfs: tag folios with AG number during buffered write via iomap attach hook
  2026-01-29 22:40         ` Darrick J. Wong
@ 2026-02-03  7:32           ` Kundan Kumar
  0 siblings, 0 replies; 48+ messages in thread
From: Kundan Kumar @ 2026-02-03  7:32 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm,
	linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k

>
> Also, what happens if this is a realtime file?  For pre-rtgroups
> filesystems there is no group number to use; and for rtgroups you have
> to use xfs_rtb_to_rgno.  The i_ag_dirty_bitmap and the m_ag_wb array
> will be the wrong size if rgcount != agcount; and also you probably
> don't want to have in the same per-group writeback list two inodes with
> folios having the same group number but writing to two different devices
> (data vs. rt).
> 
> --D
> 

For this series I’ll explicitly exclude realtime inodes from AG routing
(skip tagging / fall back to normal writeback). Rtgroup-aware routing 
would need a separate rg-domain (separate worker array/queues sized by
rgcount) as follow-up work.





^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 4/6] xfs: tag folios with AG number during buffered write via iomap attach hook
  2026-01-29  0:47       ` Darrick J. Wong
  2026-01-29 22:40         ` Darrick J. Wong
@ 2026-02-03  7:28         ` Kundan Kumar
  2026-02-05 15:56           ` Brian Foster
  1 sibling, 1 reply; 48+ messages in thread
From: Kundan Kumar @ 2026-02-03  7:28 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm,
	linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k

On 1/29/2026 6:17 AM, Darrick J. Wong wrote:
> On Fri, Jan 16, 2026 at 03:38:16PM +0530, Kundan Kumar wrote:
>> Use the iomap attach hook to tag folios with their predicted
>> allocation group at write time. Mapped extents derive AG directly;
>> delalloc and hole cases use a lightweight predictor.
>>
>> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
>> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
>> ---
>>   fs/xfs/xfs_iomap.c | 114 +++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 114 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 490e12cb99be..3c927ce118fe 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -12,6 +12,9 @@
>>   #include "xfs_trans_resv.h"
>>   #include "xfs_mount.h"
>>   #include "xfs_inode.h"
>> +#include "xfs_alloc.h"
>> +#include "xfs_ag.h"
>> +#include "xfs_ag_resv.h"
>>   #include "xfs_btree.h"
>>   #include "xfs_bmap_btree.h"
>>   #include "xfs_bmap.h"
>> @@ -92,8 +95,119 @@ xfs_iomap_valid(
>>   	return true;
>>   }
>>   
>> +static xfs_agnumber_t
>> +xfs_predict_delalloc_agno(const struct xfs_inode *ip, loff_t pos, loff_t len)
>> +{
>> +	struct xfs_mount *mp = ip->i_mount;
>> +	xfs_agnumber_t start_agno, agno, best_agno;
>> +	struct xfs_perag *pag;
>> +
>> +	xfs_extlen_t free, resv, avail;
>> +	xfs_extlen_t need_fsbs, min_free_fsbs;
>> +	xfs_extlen_t best_free = 0;
>> +	xfs_agnumber_t agcount = mp->m_sb.sb_agcount;
>> +
>> +	/* RT inodes allocate from the realtime volume */
>> +	if (XFS_IS_REALTIME_INODE(ip))
>> +		return XFS_INO_TO_AGNO(mp, ip->i_ino);
>> +
>> +	start_agno =  XFS_INO_TO_AGNO(mp, ip->i_ino);
>> +
>> +	/*
>> +	 * size-based minimum free requirement.
>> +	 * Convert bytes to fsbs and require some slack.
>> +	 */
>> +	need_fsbs = XFS_B_TO_FSB(mp, (xfs_fsize_t)len);
>> +	min_free_fsbs = need_fsbs + max_t(xfs_extlen_t, need_fsbs >> 2, 128);
>> +
>> +	/*
>> +	 * scan AGs starting at start_agno and wrapping.
>> +	 * Pick the first AG that meets min_free_fsbs after reservations.
>> +	 * Keep a "best" fallback = maximum (free - resv).
>> +	 */
>> +	best_agno = start_agno;
>> +
>> +	for (xfs_agnumber_t i = 0; i < agcount; i++) {
>> +		agno = (start_agno + i) % agcount;
>> +		pag = xfs_perag_get(mp, agno);
>> +
>> +		if (!xfs_perag_initialised_agf(pag))
>> +			goto next;
>> +
>> +		free = READ_ONCE(pag->pagf_freeblks);
>> +		resv = xfs_ag_resv_needed(pag, XFS_AG_RESV_NONE);
>> +
>> +		if (free <= resv)
>> +			goto next;
>> +
>> +		avail = free - resv;
>> +
>> +		if (avail >= min_free_fsbs) {
>> +			xfs_perag_put(pag);
>> +			return agno;
>> +		}
>> +
>> +		if (avail > best_free) {
>> +			best_free = avail;
>> +			best_agno = agno;
>> +		}
>> +next:
>> +		xfs_perag_put(pag);
>> +	}
>> +
>> +	return best_agno;
>> +}
>> +
>> +static inline xfs_agnumber_t xfs_ag_from_iomap(const struct xfs_mount *mp,
>> +		const struct iomap *iomap,
>> +		const struct xfs_inode *ip, loff_t pos, size_t len)
>> +{
>> +	if (iomap->type == IOMAP_MAPPED || iomap->type == IOMAP_UNWRITTEN) {
>> +		/* iomap->addr is byte address on device for buffered I/O */
>> +		xfs_fsblock_t fsb = XFS_BB_TO_FSBT(mp, BTOBB(iomap->addr));
>> +
>> +		return XFS_FSB_TO_AGNO(mp, fsb);
>> +	} else if (iomap->type == IOMAP_HOLE || iomap->type == IOMAP_DELALLOC) {
>> +		return xfs_predict_delalloc_agno(ip, pos, len);
> 
> Is it worth doing an AG scan to guess where the allocation might come
> from?  The predictions could turn out to be wrong by virtue of other
> delalloc regions being written back between the time that xfs_agp_set is
> called, and the actual bmapi_write call.
> 

The delalloc prediction works well in the common cases: (1) when an AG 
has sufficient free space and allocations stay within it, and (2) when 
an AG becomes full and allocation naturally moves to the next suitable AG.

The only case where the prediction can be wrong is when an AG is in the
process of being exhausted concurrently with writeback, so allocation
shifts between the time we tag the folio and the actual bmapi_write.
My understanding is that window is narrow, and only a small fraction of 
IOs would be misrouted.

>> +	}
>> +
>> +	return XFS_INO_TO_AGNO(mp, ip->i_ino);
>> +}
>> +
>> +static void xfs_agp_set(struct xfs_inode *ip, pgoff_t index,
>> +			xfs_agnumber_t agno, u8 type)
>> +{
>> +	u32 packed = xfs_agp_pack((u32)agno, type, true);
>> +
>> +	/* store as immediate value */
>> +	xa_store(&ip->i_ag_pmap, index, xa_mk_value(packed), GFP_NOFS);
>> +
>> +	/* Mark this AG as having potential dirty work */
>> +	if (ip->i_ag_dirty_bitmap && (u32)agno < ip->i_ag_dirty_bits)
>> +		set_bit((u32)agno, ip->i_ag_dirty_bitmap);
>> +}
>> +
>> +static void
>> +xfs_iomap_tag_folio(const struct iomap *iomap, struct folio *folio,
>> +		loff_t pos, size_t len)
>> +{
>> +	struct inode *inode;
>> +	struct xfs_inode *ip;
>> +	struct xfs_mount *mp;
>> +	xfs_agnumber_t agno;
>> +
>> +	inode = folio_mapping(folio)->host;
>> +	ip = XFS_I(inode);
>> +	mp = ip->i_mount;
>> +
>> +	agno = xfs_ag_from_iomap(mp, iomap, ip, pos, len);
>> +
>> +	xfs_agp_set(ip, folio->index, agno, (u8)iomap->type);
> 
> Hrm, so no, the ag_pmap only caches the ag number for the index of a
> folio, even if it spans many many blocks.
> 
> --D
> 

Thanks for pointing out, I will rework to handle this case.

>> +}
>> +
>>   const struct iomap_write_ops xfs_iomap_write_ops = {
>>   	.iomap_valid		= xfs_iomap_valid,
>> +	.tag_folio		= xfs_iomap_tag_folio,
>>   };
>>   
>>   int
>> -- 
>> 2.25.1
>>
>>
> 



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 4/6] xfs: tag folios with AG number during buffered write via iomap attach hook
  2026-02-03  7:28         ` Kundan Kumar
@ 2026-02-05 15:56           ` Brian Foster
  0 siblings, 0 replies; 48+ messages in thread
From: Brian Foster @ 2026-02-05 15:56 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: Darrick J. Wong, viro, brauner, jack, willy, mcgrof, clm, david,
	amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei,
	linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g,
	joshi.k

On Tue, Feb 03, 2026 at 12:58:34PM +0530, Kundan Kumar wrote:
> On 1/29/2026 6:17 AM, Darrick J. Wong wrote:
> > On Fri, Jan 16, 2026 at 03:38:16PM +0530, Kundan Kumar wrote:
> >> Use the iomap attach hook to tag folios with their predicted
> >> allocation group at write time. Mapped extents derive AG directly;
> >> delalloc and hole cases use a lightweight predictor.
> >>
> >> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
> >> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> >> ---
> >>   fs/xfs/xfs_iomap.c | 114 +++++++++++++++++++++++++++++++++++++++++++++
> >>   1 file changed, 114 insertions(+)
> >>
> >> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> >> index 490e12cb99be..3c927ce118fe 100644
> >> --- a/fs/xfs/xfs_iomap.c
> >> +++ b/fs/xfs/xfs_iomap.c
> >> @@ -12,6 +12,9 @@
> >>   #include "xfs_trans_resv.h"
> >>   #include "xfs_mount.h"
> >>   #include "xfs_inode.h"
> >> +#include "xfs_alloc.h"
> >> +#include "xfs_ag.h"
> >> +#include "xfs_ag_resv.h"
> >>   #include "xfs_btree.h"
> >>   #include "xfs_bmap_btree.h"
> >>   #include "xfs_bmap.h"
> >> @@ -92,8 +95,119 @@ xfs_iomap_valid(
> >>   	return true;
> >>   }
> >>   
> >> +static xfs_agnumber_t
> >> +xfs_predict_delalloc_agno(const struct xfs_inode *ip, loff_t pos, loff_t len)
> >> +{
> >> +	struct xfs_mount *mp = ip->i_mount;
> >> +	xfs_agnumber_t start_agno, agno, best_agno;
> >> +	struct xfs_perag *pag;
> >> +
> >> +	xfs_extlen_t free, resv, avail;
> >> +	xfs_extlen_t need_fsbs, min_free_fsbs;
> >> +	xfs_extlen_t best_free = 0;
> >> +	xfs_agnumber_t agcount = mp->m_sb.sb_agcount;
> >> +
> >> +	/* RT inodes allocate from the realtime volume */
> >> +	if (XFS_IS_REALTIME_INODE(ip))
> >> +		return XFS_INO_TO_AGNO(mp, ip->i_ino);
> >> +
> >> +	start_agno =  XFS_INO_TO_AGNO(mp, ip->i_ino);
> >> +
> >> +	/*
> >> +	 * size-based minimum free requirement.
> >> +	 * Convert bytes to fsbs and require some slack.
> >> +	 */
> >> +	need_fsbs = XFS_B_TO_FSB(mp, (xfs_fsize_t)len);
> >> +	min_free_fsbs = need_fsbs + max_t(xfs_extlen_t, need_fsbs >> 2, 128);
> >> +
> >> +	/*
> >> +	 * scan AGs starting at start_agno and wrapping.
> >> +	 * Pick the first AG that meets min_free_fsbs after reservations.
> >> +	 * Keep a "best" fallback = maximum (free - resv).
> >> +	 */
> >> +	best_agno = start_agno;
> >> +
> >> +	for (xfs_agnumber_t i = 0; i < agcount; i++) {
> >> +		agno = (start_agno + i) % agcount;
> >> +		pag = xfs_perag_get(mp, agno);
> >> +
> >> +		if (!xfs_perag_initialised_agf(pag))
> >> +			goto next;
> >> +
> >> +		free = READ_ONCE(pag->pagf_freeblks);
> >> +		resv = xfs_ag_resv_needed(pag, XFS_AG_RESV_NONE);
> >> +
> >> +		if (free <= resv)
> >> +			goto next;
> >> +
> >> +		avail = free - resv;
> >> +
> >> +		if (avail >= min_free_fsbs) {
> >> +			xfs_perag_put(pag);
> >> +			return agno;
> >> +		}
> >> +
> >> +		if (avail > best_free) {
> >> +			best_free = avail;
> >> +			best_agno = agno;
> >> +		}
> >> +next:
> >> +		xfs_perag_put(pag);
> >> +	}
> >> +
> >> +	return best_agno;
> >> +}
> >> +
> >> +static inline xfs_agnumber_t xfs_ag_from_iomap(const struct xfs_mount *mp,
> >> +		const struct iomap *iomap,
> >> +		const struct xfs_inode *ip, loff_t pos, size_t len)
> >> +{
> >> +	if (iomap->type == IOMAP_MAPPED || iomap->type == IOMAP_UNWRITTEN) {
> >> +		/* iomap->addr is byte address on device for buffered I/O */
> >> +		xfs_fsblock_t fsb = XFS_BB_TO_FSBT(mp, BTOBB(iomap->addr));
> >> +
> >> +		return XFS_FSB_TO_AGNO(mp, fsb);
> >> +	} else if (iomap->type == IOMAP_HOLE || iomap->type == IOMAP_DELALLOC) {
> >> +		return xfs_predict_delalloc_agno(ip, pos, len);
> > 
> > Is it worth doing an AG scan to guess where the allocation might come
> > from?  The predictions could turn out to be wrong by virtue of other
> > delalloc regions being written back between the time that xfs_agp_set is
> > called, and the actual bmapi_write call.
> > 
> 
> The delalloc prediction works well in the common cases: (1) when an AG 
> has sufficient free space and allocations stay within it, and (2) when 
> an AG becomes full and allocation naturally moves to the next suitable AG.
> 
> The only case where the prediction can be wrong is when an AG is in the
> process of being exhausted concurrently with writeback, so allocation
> shifts between the time we tag the folio and the actual bmapi_write.
> My understanding is that window is narrow, and only a small fraction of 
> IOs would be misrouted.
> 

I wonder how true that would be under more mixed workloads. For example,
if writeback is iterating AGs under a trylock, all it really takes to
redirect incorrectly is lock contention, which then seems like it could
be a compounding factor for other AG workers.

Another thing that comes to mind is the writeback delay. For example, if
we buffer up enough delalloc in pagecache to one or more inodes that
target the same AG, then it seems possible to hint folios to AGs that
are already full, they "just don't know it yet." Maybe that is more of
an odd/rare case though.

Perhaps the better question here is.. how would one test for this? It
might be interesting to have stats counters or something that could
indicate hits and misses wrt the hint such that this could be more
easily evaluated against different workloads (assuming that doesn't
already exist and I missed it).. hm?

Brian

> >> +	}
> >> +
> >> +	return XFS_INO_TO_AGNO(mp, ip->i_ino);
> >> +}
> >> +
> >> +static void xfs_agp_set(struct xfs_inode *ip, pgoff_t index,
> >> +			xfs_agnumber_t agno, u8 type)
> >> +{
> >> +	u32 packed = xfs_agp_pack((u32)agno, type, true);
> >> +
> >> +	/* store as immediate value */
> >> +	xa_store(&ip->i_ag_pmap, index, xa_mk_value(packed), GFP_NOFS);
> >> +
> >> +	/* Mark this AG as having potential dirty work */
> >> +	if (ip->i_ag_dirty_bitmap && (u32)agno < ip->i_ag_dirty_bits)
> >> +		set_bit((u32)agno, ip->i_ag_dirty_bitmap);
> >> +}
> >> +
> >> +static void
> >> +xfs_iomap_tag_folio(const struct iomap *iomap, struct folio *folio,
> >> +		loff_t pos, size_t len)
> >> +{
> >> +	struct inode *inode;
> >> +	struct xfs_inode *ip;
> >> +	struct xfs_mount *mp;
> >> +	xfs_agnumber_t agno;
> >> +
> >> +	inode = folio_mapping(folio)->host;
> >> +	ip = XFS_I(inode);
> >> +	mp = ip->i_mount;
> >> +
> >> +	agno = xfs_ag_from_iomap(mp, iomap, ip, pos, len);
> >> +
> >> +	xfs_agp_set(ip, folio->index, agno, (u8)iomap->type);
> > 
> > Hrm, so no, the ag_pmap only caches the ag number for the index of a
> > folio, even if it spans many many blocks.
> > 
> > --D
> > 
> 
> Thanks for pointing out, I will rework to handle this case.
> 
> >> +}
> >> +
> >>   const struct iomap_write_ops xfs_iomap_write_ops = {
> >>   	.iomap_valid		= xfs_iomap_valid,
> >> +	.tag_folio		= xfs_iomap_tag_folio,
> >>   };
> >>   
> >>   int
> >> -- 
> >> 2.25.1
> >>
> >>
> > 
> 
> 



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 4/6] xfs: tag folios with AG number during buffered write via iomap attach hook
  2026-01-16 10:08     ` [PATCH v3 4/6] xfs: tag folios with AG number during buffered write via iomap attach hook Kundan Kumar
  2026-01-29  0:47       ` Darrick J. Wong
@ 2026-02-06  6:44       ` Nirjhar Roy (IBM)
  1 sibling, 0 replies; 48+ messages in thread
From: Nirjhar Roy (IBM) @ 2026-02-06  6:44 UTC (permalink / raw)
  To: Kundan Kumar, viro, brauner, jack, willy, mcgrof, clm, david,
	amir73il, axboe, hch, ritesh.list, djwong, dave, cem, wangyufei
  Cc: linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g,
	joshi.k

On Fri, 2026-01-16 at 15:38 +0530, Kundan Kumar wrote:
> Use the iomap attach hook to tag folios with their predicted
> allocation group at write time. Mapped extents derive AG directly;
> delalloc and hole cases use a lightweight predictor.
> 
> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> ---
>  fs/xfs/xfs_iomap.c | 114 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 114 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 490e12cb99be..3c927ce118fe 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -12,6 +12,9 @@
>  #include "xfs_trans_resv.h"
>  #include "xfs_mount.h"
>  #include "xfs_inode.h"
> +#include "xfs_alloc.h"
> +#include "xfs_ag.h"
> +#include "xfs_ag_resv.h"
>  #include "xfs_btree.h"
>  #include "xfs_bmap_btree.h"
>  #include "xfs_bmap.h"
> @@ -92,8 +95,119 @@ xfs_iomap_valid(
>  	return true;
>  }
>  
> +static xfs_agnumber_t
> +xfs_predict_delalloc_agno(const struct xfs_inode *ip, loff_t pos, loff_t len)
> +{
> +	struct xfs_mount *mp = ip->i_mount;
> +	xfs_agnumber_t start_agno, agno, best_agno;
> +	struct xfs_perag *pag;
Nit: Coding style - I think XFS uses a tab between the data type and the identifier and makes all
the identifier align. Something like
int			a;
struct xfs_mount	*mp;
> +
Nit: Extra blank line?
> +	xfs_extlen_t free, resv, avail;
> +	xfs_extlen_t need_fsbs, min_free_fsbs;
> +	xfs_extlen_t best_free = 0;
> +	xfs_agnumber_t agcount = mp->m_sb.sb_agcount;
Similar comment as above
> +
> +	/* RT inodes allocate from the realtime volume */
> +	if (XFS_IS_REALTIME_INODE(ip))
> +		return XFS_INO_TO_AGNO(mp, ip->i_ino);
What are we returning for realtime volume? AG sizes and rtgroup sizes may be different, isn't it?
Can we use the above macro for both? Also, rt volume work with extents (which can be more than the
fsblock size) - so will the above work? 
> +
> +	start_agno =  XFS_INO_TO_AGNO(mp, ip->i_ino);
> +
> +	/*
> +	 * size-based minimum free requirement.
> +	 * Convert bytes to fsbs and require some slack.
> +	 */
> +	need_fsbs = XFS_B_TO_FSB(mp, (xfs_fsize_t)len);
> +	min_free_fsbs = need_fsbs + max_t(xfs_extlen_t, need_fsbs >> 2, 128);
A short comment explaining the above?
> +
> +	/*
> +	 * scan AGs starting at start_agno and wrapping.
> +	 * Pick the first AG that meets min_free_fsbs after reservations.
> +	 * Keep a "best" fallback = maximum (free - resv).
> +	 */
> +	best_agno = start_agno;
> +
> +	for (xfs_agnumber_t i = 0; i < agcount; i++) {
Maybe use for_each_perag_wrap_range or for_each_perag_wrap (defined in fs/xfs/libxfs/xfs_ag.h)?
> +		agno = (start_agno + i) % agcount;
> +		pag = xfs_perag_get(mp, agno);
> +
> +		if (!xfs_perag_initialised_agf(pag))
> +			goto next;
> +
> +		free = READ_ONCE(pag->pagf_freeblks);
> +		resv = xfs_ag_resv_needed(pag, XFS_AG_RESV_NONE);
> +
> +		if (free <= resv)
> +			goto next;
> +
> +		avail = free - resv;
> +
> +		if (avail >= min_free_fsbs) {
> +			xfs_perag_put(pag);
> +			return agno;
> +		}
> +
> +		if (avail > best_free) {
Just for my understanding - we are doing a largest fit selection, aren't we?
> +			best_free = avail;
> +			best_agno = agno;
> +		}
> +next:
> +		xfs_perag_put(pag);
> +	}
> +
> +	return best_agno;
> +}
> +
> +static inline xfs_agnumber_t xfs_ag_from_iomap(const struct xfs_mount *mp,
> +		const struct iomap *iomap,
> +		const struct xfs_inode *ip, loff_t pos, size_t len)
> +{
> +	if (iomap->type == IOMAP_MAPPED || iomap->type == IOMAP_UNWRITTEN) {
> +		/* iomap->addr is byte address on device for buffered I/O */
> +		xfs_fsblock_t fsb = XFS_BB_TO_FSBT(mp, BTOBB(iomap->addr));
> +
> +		return XFS_FSB_TO_AGNO(mp, fsb);
> +	} else if (iomap->type == IOMAP_HOLE || iomap->type == IOMAP_DELALLOC) {
> +		return xfs_predict_delalloc_agno(ip, pos, len);
> +	}
> +
> +	return XFS_INO_TO_AGNO(mp, ip->i_ino);
> +}
> +
> +static void xfs_agp_set(struct xfs_inode *ip, pgoff_t index,
> +			xfs_agnumber_t agno, u8 type)
> +{
> +	u32 packed = xfs_agp_pack((u32)agno, type, true);
> +
> +	/* store as immediate value */
> +	xa_store(&ip->i_ag_pmap, index, xa_mk_value(packed), GFP_NOFS);
Maybe nit: Unhandled return from xa_store() - It returns void *
> +
> +	/* Mark this AG as having potential dirty work */
> +	if (ip->i_ag_dirty_bitmap && (u32)agno < ip->i_ag_dirty_bits)
if agno >= i_ag_dirty_bits, then shouldn't we throw a warn or an ASSERT() or XFS_IS_CORRUPT() -
Darrick, any thoughts?
> +		set_bit((u32)agno, ip->i_ag_dirty_bitmap);
> +}
> +
> +static void
> +xfs_iomap_tag_folio(const struct iomap *iomap, struct folio *folio,
> +		loff_t pos, size_t len)
Nit: Coding style:
I think xfs functions declares 1 parameter in each line with the ")" just after the last parameter
on the same line. Look at something like xfs_swap_extent_rmap() defined in fs/xfs/xfs_bmap_util.c.
Not sure if the maintainers have hard preferences with this - so better to cross check.
> +{
> +	struct inode *inode;
> +	struct xfs_inode *ip;
> +	struct xfs_mount *mp;
> +	xfs_agnumber_t agno;
Coding style as mentioned before:
struct inode		*inode;
struct xfs_inod		*ip;

--NR 
> +
> +	inode = folio_mapping(folio)->host;
> +	ip = XFS_I(inode);
> +	mp = ip->i_mount;
> +
> +	agno = xfs_ag_from_iomap(mp, iomap, ip, pos, len);
> +
> +	xfs_agp_set(ip, folio->index, agno, (u8)iomap->type);
> +}
> +
>  const struct iomap_write_ops xfs_iomap_write_ops = {
>  	.iomap_valid		= xfs_iomap_valid,
> +	.tag_folio		= xfs_iomap_tag_folio,
>  };
>  
>  int



^ permalink raw reply	[flat|nested] 48+ messages in thread

[parent not found: <CGME20260116101259epcas5p1cfa6ab02e5a01f7c46cc78df95c57ce0@epcas5p1.samsung.com>]

* [PATCH v3 5/6] xfs: add per-AG writeback workqueue infrastructure
       [not found]   ` <CGME20260116101259epcas5p1cfa6ab02e5a01f7c46cc78df95c57ce0@epcas5p1.samsung.com>
@ 2026-01-16 10:08     ` Kundan Kumar
  2026-01-29 22:21       ` Darrick J. Wong
                         ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Kundan Kumar @ 2026-01-16 10:08 UTC (permalink / raw)
  To: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, djwong, dave, cem, wangyufei
  Cc: linux-fsdevel, linux-mm, linux-xfs, gost.dev, kundan.kumar,
	anuj20.g, vishak.g, joshi.k

Introduce per-AG writeback worker infrastructure at mount time.
This patch adds initialization and teardown only, without changing
writeback behavior.

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 fs/xfs/xfs_aops.c  | 79 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_aops.h  |  3 ++
 fs/xfs/xfs_mount.c |  2 ++
 fs/xfs/xfs_mount.h | 10 ++++++
 fs/xfs/xfs_super.c |  2 ++
 5 files changed, 96 insertions(+)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index a26f79815533..9d5b65922cd2 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -23,6 +23,23 @@
 #include "xfs_zone_alloc.h"
 #include "xfs_rtgroup.h"
 
+#define XFS_AG_TASK_POOL_MIN 1024
+
+struct xfs_ag_wb_task {
+	struct list_head list;
+	struct xfs_inode *ip;
+	struct writeback_control wbc;
+	xfs_agnumber_t agno;
+};
+
+struct xfs_ag_wb {
+	struct delayed_work ag_work;
+	spinlock_t lock;
+	struct list_head task_list;
+	xfs_agnumber_t agno;
+	struct xfs_mount *mp;
+};
+
 struct xfs_writepage_ctx {
 	struct iomap_writepage_ctx ctx;
 	unsigned int		data_seq;
@@ -666,6 +683,68 @@ static const struct iomap_writeback_ops xfs_zoned_writeback_ops = {
 	.writeback_submit	= xfs_zoned_writeback_submit,
 };
 
+void
+xfs_init_ag_writeback(struct xfs_mount *mp)
+{
+	xfs_agnumber_t agno;
+
+	mp->m_ag_wq = alloc_workqueue("xfs-ag-wb", WQ_UNBOUND | WQ_MEM_RECLAIM,
+				      0);
+	if (!mp->m_ag_wq)
+		return;
+
+	mp->m_ag_wb = kcalloc(mp->m_sb.sb_agcount,
+				sizeof(struct xfs_ag_wb),
+				GFP_KERNEL);
+
+	if (!mp->m_ag_wb) {
+		destroy_workqueue(mp->m_ag_wq);
+		mp->m_ag_wq = NULL;
+		return;
+	}
+
+	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
+		struct xfs_ag_wb *awb = &mp->m_ag_wb[agno];
+
+		spin_lock_init(&awb->lock);
+		INIT_LIST_HEAD(&awb->task_list);
+		awb->agno = agno;
+		awb->mp = mp;
+	}
+
+	mp->m_ag_task_cachep = kmem_cache_create("xfs_ag_wb_task",
+						sizeof(struct xfs_ag_wb_task),
+						0,
+						SLAB_RECLAIM_ACCOUNT,
+						NULL);
+
+	mp->m_ag_task_pool = mempool_create_slab_pool(XFS_AG_TASK_POOL_MIN,
+	mp->m_ag_task_cachep);
+
+	if (!mp->m_ag_task_pool) {
+		kmem_cache_destroy(mp->m_ag_task_cachep);
+		mp->m_ag_task_cachep = NULL;
+	}
+}
+
+void
+xfs_destroy_ag_writeback(struct xfs_mount *mp)
+{
+	if (mp->m_ag_wq) {
+		flush_workqueue(mp->m_ag_wq);
+		destroy_workqueue(mp->m_ag_wq);
+		mp->m_ag_wq = NULL;
+	}
+	kfree(mp->m_ag_wb);
+	mp->m_ag_wb = NULL;
+
+	mempool_destroy(mp->m_ag_task_pool);
+	mp->m_ag_task_pool = NULL;
+
+	kmem_cache_destroy(mp->m_ag_task_cachep);
+	mp->m_ag_task_cachep = NULL;
+}
+
 STATIC int
 xfs_vm_writepages(
 	struct address_space	*mapping,
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index 5a7a0f1a0b49..e84acb7e8ca8 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -12,4 +12,7 @@ extern const struct address_space_operations xfs_dax_aops;
 int xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
 void xfs_end_bio(struct bio *bio);
 
+void xfs_init_ag_writeback(struct xfs_mount *mp);
+void xfs_destroy_ag_writeback(struct xfs_mount *mp);
+
 #endif /* __XFS_AOPS_H__ */
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 0953f6ae94ab..26224503c4bf 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1323,6 +1323,8 @@ xfs_unmountfs(
 
 	xfs_qm_unmount(mp);
 
+	xfs_destroy_ag_writeback(mp);
+
 	/*
 	 * Unreserve any blocks we have so that when we unmount we don't account
 	 * the reserved free space as used. This is really only necessary for
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index b871dfde372b..c44155de2883 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -342,6 +342,16 @@ typedef struct xfs_mount {
 
 	/* Hook to feed dirent updates to an active online repair. */
 	struct xfs_hooks	m_dir_update_hooks;
+
+
+	/* global XFS AG writeback wq */
+	struct workqueue_struct *m_ag_wq;
+	/* array of [sb_agcount] */
+	struct xfs_ag_wb        *m_ag_wb;
+
+	/* task cache and pool */
+	struct kmem_cache *m_ag_task_cachep;
+	mempool_t *m_ag_task_pool;
 } xfs_mount_t;
 
 #define M_IGEO(mp)		(&(mp)->m_ino_geo)
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index bc71aa9dcee8..73f8d2942df4 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1765,6 +1765,8 @@ xfs_fs_fill_super(
 	if (error)
 		goto out_free_sb;
 
+	xfs_init_ag_writeback(mp);
+
 	/*
 	 * V4 support is undergoing deprecation.
 	 *
-- 
2.25.1



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/6] xfs: add per-AG writeback workqueue infrastructure
  2026-01-16 10:08     ` [PATCH v3 5/6] xfs: add per-AG writeback workqueue infrastructure Kundan Kumar
@ 2026-01-29 22:21       ` Darrick J. Wong
  2026-02-03  7:35         ` Kundan Kumar
  2026-02-06  6:46       ` Christoph Hellwig
  2026-02-10 11:56       ` Nirjhar Roy (IBM)
  2 siblings, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2026-01-29 22:21 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm,
	linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k

On Fri, Jan 16, 2026 at 03:38:17PM +0530, Kundan Kumar wrote:
> Introduce per-AG writeback worker infrastructure at mount time.
> This patch adds initialization and teardown only, without changing
> writeback behavior.
> 
> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> ---
>  fs/xfs/xfs_aops.c  | 79 ++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_aops.h  |  3 ++
>  fs/xfs/xfs_mount.c |  2 ++
>  fs/xfs/xfs_mount.h | 10 ++++++
>  fs/xfs/xfs_super.c |  2 ++
>  5 files changed, 96 insertions(+)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index a26f79815533..9d5b65922cd2 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -23,6 +23,23 @@
>  #include "xfs_zone_alloc.h"
>  #include "xfs_rtgroup.h"
>  
> +#define XFS_AG_TASK_POOL_MIN 1024
> +
> +struct xfs_ag_wb_task {
> +	struct list_head list;
> +	struct xfs_inode *ip;
> +	struct writeback_control wbc;
> +	xfs_agnumber_t agno;
> +};
> +
> +struct xfs_ag_wb {
> +	struct delayed_work ag_work;
> +	spinlock_t lock;
> +	struct list_head task_list;
> +	xfs_agnumber_t agno;
> +	struct xfs_mount *mp;
> +};

Help me understand the data structures here ... for each AG there's an
xfs_ag_wb object which can be run as a delayed workqueue item.  This
xfs_ag_wb is the head of a list of xfs_ag_wb_task items?

In turn, each xfs_ag_wb_task list item points to an inode and
(redundantly?) the agnumber?  So in effect each AG has a kworker that
can say "do all of this file's pagecache writeback for all dirty folios
tagged with the same agnumber"?

<shrug> It's hard to tell with no comments about how these two pieces of
data relate to each other.

--D

>  struct xfs_writepage_ctx {
>  	struct iomap_writepage_ctx ctx;
>  	unsigned int		data_seq;
> @@ -666,6 +683,68 @@ static const struct iomap_writeback_ops xfs_zoned_writeback_ops = {
>  	.writeback_submit	= xfs_zoned_writeback_submit,
>  };
>  
> +void
> +xfs_init_ag_writeback(struct xfs_mount *mp)
> +{
> +	xfs_agnumber_t agno;
> +
> +	mp->m_ag_wq = alloc_workqueue("xfs-ag-wb", WQ_UNBOUND | WQ_MEM_RECLAIM,
> +				      0);
> +	if (!mp->m_ag_wq)
> +		return;
> +
> +	mp->m_ag_wb = kcalloc(mp->m_sb.sb_agcount,
> +				sizeof(struct xfs_ag_wb),
> +				GFP_KERNEL);
> +
> +	if (!mp->m_ag_wb) {
> +		destroy_workqueue(mp->m_ag_wq);
> +		mp->m_ag_wq = NULL;
> +		return;
> +	}
> +
> +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> +		struct xfs_ag_wb *awb = &mp->m_ag_wb[agno];
> +
> +		spin_lock_init(&awb->lock);
> +		INIT_LIST_HEAD(&awb->task_list);
> +		awb->agno = agno;
> +		awb->mp = mp;

Can't you stuff this information in struct xfs_perag instead of asking
for a potentially huge allocation?

> +	}
> +
> +	mp->m_ag_task_cachep = kmem_cache_create("xfs_ag_wb_task",
> +						sizeof(struct xfs_ag_wb_task),
> +						0,
> +						SLAB_RECLAIM_ACCOUNT,
> +						NULL);
> +
> +	mp->m_ag_task_pool = mempool_create_slab_pool(XFS_AG_TASK_POOL_MIN,
> +	mp->m_ag_task_cachep);
> +
> +	if (!mp->m_ag_task_pool) {
> +		kmem_cache_destroy(mp->m_ag_task_cachep);
> +		mp->m_ag_task_cachep = NULL;
> +	}
> +}
> +
> +void
> +xfs_destroy_ag_writeback(struct xfs_mount *mp)
> +{
> +	if (mp->m_ag_wq) {
> +		flush_workqueue(mp->m_ag_wq);
> +		destroy_workqueue(mp->m_ag_wq);
> +		mp->m_ag_wq = NULL;
> +	}
> +	kfree(mp->m_ag_wb);
> +	mp->m_ag_wb = NULL;
> +
> +	mempool_destroy(mp->m_ag_task_pool);
> +	mp->m_ag_task_pool = NULL;
> +
> +	kmem_cache_destroy(mp->m_ag_task_cachep);
> +	mp->m_ag_task_cachep = NULL;
> +}
> +
>  STATIC int
>  xfs_vm_writepages(
>  	struct address_space	*mapping,
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index 5a7a0f1a0b49..e84acb7e8ca8 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -12,4 +12,7 @@ extern const struct address_space_operations xfs_dax_aops;
>  int xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
>  void xfs_end_bio(struct bio *bio);
>  
> +void xfs_init_ag_writeback(struct xfs_mount *mp);
> +void xfs_destroy_ag_writeback(struct xfs_mount *mp);
> +
>  #endif /* __XFS_AOPS_H__ */
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 0953f6ae94ab..26224503c4bf 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -1323,6 +1323,8 @@ xfs_unmountfs(
>  
>  	xfs_qm_unmount(mp);
>  
> +	xfs_destroy_ag_writeback(mp);
> +
>  	/*
>  	 * Unreserve any blocks we have so that when we unmount we don't account
>  	 * the reserved free space as used. This is really only necessary for
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index b871dfde372b..c44155de2883 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -342,6 +342,16 @@ typedef struct xfs_mount {
>  
>  	/* Hook to feed dirent updates to an active online repair. */
>  	struct xfs_hooks	m_dir_update_hooks;
> +
> +
> +	/* global XFS AG writeback wq */
> +	struct workqueue_struct *m_ag_wq;
> +	/* array of [sb_agcount] */
> +	struct xfs_ag_wb        *m_ag_wb;
> +
> +	/* task cache and pool */
> +	struct kmem_cache *m_ag_task_cachep;
> +	mempool_t *m_ag_task_pool;
>  } xfs_mount_t;
>  
>  #define M_IGEO(mp)		(&(mp)->m_ino_geo)
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index bc71aa9dcee8..73f8d2942df4 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1765,6 +1765,8 @@ xfs_fs_fill_super(
>  	if (error)
>  		goto out_free_sb;
>  
> +	xfs_init_ag_writeback(mp);
> +
>  	/*
>  	 * V4 support is undergoing deprecation.
>  	 *
> -- 
> 2.25.1
> 
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/6] xfs: add per-AG writeback workqueue infrastructure
  2026-01-29 22:21       ` Darrick J. Wong
@ 2026-02-03  7:35         ` Kundan Kumar
  0 siblings, 0 replies; 48+ messages in thread
From: Kundan Kumar @ 2026-02-03  7:35 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm,
	linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k

On 1/30/2026 3:51 AM, Darrick J. Wong wrote:
> On Fri, Jan 16, 2026 at 03:38:17PM +0530, Kundan Kumar wrote:
>> Introduce per-AG writeback worker infrastructure at mount time.
>> This patch adds initialization and teardown only, without changing
>> writeback behavior.
>>
>> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
>> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
>> ---
>>   fs/xfs/xfs_aops.c  | 79 ++++++++++++++++++++++++++++++++++++++++++++++
>>   fs/xfs/xfs_aops.h  |  3 ++
>>   fs/xfs/xfs_mount.c |  2 ++
>>   fs/xfs/xfs_mount.h | 10 ++++++
>>   fs/xfs/xfs_super.c |  2 ++
>>   5 files changed, 96 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
>> index a26f79815533..9d5b65922cd2 100644
>> --- a/fs/xfs/xfs_aops.c
>> +++ b/fs/xfs/xfs_aops.c
>> @@ -23,6 +23,23 @@
>>   #include "xfs_zone_alloc.h"
>>   #include "xfs_rtgroup.h"
>>   
>> +#define XFS_AG_TASK_POOL_MIN 1024
>> +
>> +struct xfs_ag_wb_task {
>> +	struct list_head list;
>> +	struct xfs_inode *ip;
>> +	struct writeback_control wbc;
>> +	xfs_agnumber_t agno;
>> +};
>> +
>> +struct xfs_ag_wb {
>> +	struct delayed_work ag_work;
>> +	spinlock_t lock;
>> +	struct list_head task_list;
>> +	xfs_agnumber_t agno;
>> +	struct xfs_mount *mp;
>> +};
> 
> Help me understand the data structures here ... for each AG there's an
> xfs_ag_wb object which can be run as a delayed workqueue item.  This
> xfs_ag_wb is the head of a list of xfs_ag_wb_task items?
> 
> In turn, each xfs_ag_wb_task list item points to an inode and
> (redundantly?) the agnumber?  So in effect each AG has a kworker that
> can say "do all of this file's pagecache writeback for all dirty folios
> tagged with the same agnumber"?
> 
> <shrug> It's hard to tell with no comments about how these two pieces of
> data relate to each other.
> 
> --D
> 

Yes, that’s the intent. struct xfs_ag_wb is per-AG state and owns the
delayed_work that runs the AG worker on m_ag_wq. Each xfs_ag_wb
maintains a queue of pending work items that request "process inode X 
for AG Y".

struct xfs_ag_wb_task is the queued request, it carries the target inode
plus a snapshot of wbc.

When the AG worker runs, it drains its task list and for each inode does 
a one-pass scan of the mapping, filtering folios tagged for that AG and
submitting IO for those folios only.

>>   struct xfs_writepage_ctx {
>>   	struct iomap_writepage_ctx ctx;
>>   	unsigned int		data_seq;
>> @@ -666,6 +683,68 @@ static const struct iomap_writeback_ops xfs_zoned_writeback_ops = {
>>   	.writeback_submit	= xfs_zoned_writeback_submit,
>>   };
>>   
>> +void
>> +xfs_init_ag_writeback(struct xfs_mount *mp)
>> +{
>> +	xfs_agnumber_t agno;
>> +
>> +	mp->m_ag_wq = alloc_workqueue("xfs-ag-wb", WQ_UNBOUND | WQ_MEM_RECLAIM,
>> +				      0);
>> +	if (!mp->m_ag_wq)
>> +		return;
>> +
>> +	mp->m_ag_wb = kcalloc(mp->m_sb.sb_agcount,
>> +				sizeof(struct xfs_ag_wb),
>> +				GFP_KERNEL);
>> +
>> +	if (!mp->m_ag_wb) {
>> +		destroy_workqueue(mp->m_ag_wq);
>> +		mp->m_ag_wq = NULL;
>> +		return;
>> +	}
>> +
>> +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
>> +		struct xfs_ag_wb *awb = &mp->m_ag_wb[agno];
>> +
>> +		spin_lock_init(&awb->lock);
>> +		INIT_LIST_HEAD(&awb->task_list);
>> +		awb->agno = agno;
>> +		awb->mp = mp;
> 
> Can't you stuff this information in struct xfs_perag instead of asking
> for a potentially huge allocation?
> 

I can switch the xfs_ag_wb fields (lock/list/work/agno/mp) into
struct xfs_perag and initialize them when perags are set up.




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/6] xfs: add per-AG writeback workqueue infrastructure
  2026-01-16 10:08     ` [PATCH v3 5/6] xfs: add per-AG writeback workqueue infrastructure Kundan Kumar
  2026-01-29 22:21       ` Darrick J. Wong
@ 2026-02-06  6:46       ` Christoph Hellwig
  2026-02-10 11:56       ` Nirjhar Roy (IBM)
  2 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2026-02-06  6:46 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, djwong, dave, cem, wangyufei, linux-fsdevel,
	linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k

On Fri, Jan 16, 2026 at 03:38:17PM +0530, Kundan Kumar wrote:
> Introduce per-AG writeback worker infrastructure at mount time.
> This patch adds initialization and teardown only, without changing
> writeback behavior.

We really should not be adding extra writeback threads in the file
system.  btrfs has done this and created a lot of madness.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 5/6] xfs: add per-AG writeback workqueue infrastructure
  2026-01-16 10:08     ` [PATCH v3 5/6] xfs: add per-AG writeback workqueue infrastructure Kundan Kumar
  2026-01-29 22:21       ` Darrick J. Wong
  2026-02-06  6:46       ` Christoph Hellwig
@ 2026-02-10 11:56       ` Nirjhar Roy (IBM)
  2 siblings, 0 replies; 48+ messages in thread
From: Nirjhar Roy (IBM) @ 2026-02-10 11:56 UTC (permalink / raw)
  To: Kundan Kumar, viro, brauner, jack, willy, mcgrof, clm, david,
	amir73il, axboe, hch, ritesh.list, djwong, dave, cem, wangyufei
  Cc: linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g,
	joshi.k

On Fri, 2026-01-16 at 15:38 +0530, Kundan Kumar wrote:
> Introduce per-AG writeback worker infrastructure at mount time.
> This patch adds initialization and teardown only, without changing
> writeback behavior.
> 
> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> ---
>  fs/xfs/xfs_aops.c  | 79 ++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_aops.h  |  3 ++
>  fs/xfs/xfs_mount.c |  2 ++
>  fs/xfs/xfs_mount.h | 10 ++++++
>  fs/xfs/xfs_super.c |  2 ++
>  5 files changed, 96 insertions(+)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index a26f79815533..9d5b65922cd2 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -23,6 +23,23 @@
>  #include "xfs_zone_alloc.h"
>  #include "xfs_rtgroup.h"
>  
> +#define XFS_AG_TASK_POOL_MIN 1024
> +
> +struct xfs_ag_wb_task {
> +	struct list_head list;
> +	struct xfs_inode *ip;
> +	struct writeback_control wbc;
> +	xfs_agnumber_t agno;

agno where the ip resides or the agno of any one the blocks which belongs to this ip?
> +};

Nit: Coding style - tab between data type and the indentifier
> +
> +struct xfs_ag_wb {
> +	struct delayed_work ag_work;
> +	spinlock_t lock;
> +	struct list_head task_list;
> +	xfs_agnumber_t agno;
> +	struct xfs_mount *mp;
> +};
> +
>  struct xfs_writepage_ctx {
>  	struct iomap_writepage_ctx ctx;
>  	unsigned int		data_seq;
> @@ -666,6 +683,68 @@ static const struct iomap_writeback_ops xfs_zoned_writeback_ops = {
>  	.writeback_submit	= xfs_zoned_writeback_submit,
>  };
>  
> +void
> +xfs_init_ag_writeback(struct xfs_mount *mp)
> +{
> +	xfs_agnumber_t agno;
> +
> +	mp->m_ag_wq = alloc_workqueue("xfs-ag-wb", WQ_UNBOUND | WQ_MEM_RECLAIM,
> +				      0);

Nit: I think we follow 2 tabs indentation of the parameter list length exceeds per line limit count.
Similar comments for such changes in the below function call sites.
> +	if (!mp->m_ag_wq)
> +		return;
> +
> +	mp->m_ag_wb = kcalloc(mp->m_sb.sb_agcount,
> +				sizeof(struct xfs_ag_wb),
> +				GFP_KERNEL);
> +
> +	if (!mp->m_ag_wb) {
> +		destroy_workqueue(mp->m_ag_wq);
> +		mp->m_ag_wq = NULL;
> +		return;
> +	}
> +
> +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
> +		struct xfs_ag_wb *awb = &mp->m_ag_wb[agno];
> +
> +		spin_lock_init(&awb->lock);
> +		INIT_LIST_HEAD(&awb->task_list);
> +		awb->agno = agno;
> +		awb->mp = mp;
> +	}
> +
> +	mp->m_ag_task_cachep = kmem_cache_create("xfs_ag_wb_task",
> +						sizeof(struct xfs_ag_wb_task),
> +						0,
> +						SLAB_RECLAIM_ACCOUNT,
> +						NULL);
> +
> +	mp->m_ag_task_pool = mempool_create_slab_pool(XFS_AG_TASK_POOL_MIN,
> +	mp->m_ag_task_cachep);

Nit: 2 tabs indentation
> +
> +	if (!mp->m_ag_task_pool) {
> +		kmem_cache_destroy(mp->m_ag_task_cachep);
> +		mp->m_ag_task_cachep = NULL;

Shouldn't we be also freeing mp->m_ag_wq and the array mp->m_ag_wb[] array ?
> +	}
> +}
> +
> +void
> +xfs_destroy_ag_writeback(struct xfs_mount *mp)
> +{
> +	if (mp->m_ag_wq) {
> +		flush_workqueue(mp->m_ag_wq);
> +		destroy_workqueue(mp->m_ag_wq);
> +		mp->m_ag_wq = NULL;
> +	}
> +	kfree(mp->m_ag_wb);
> +	mp->m_ag_wb = NULL;
> +
> +	mempool_destroy(mp->m_ag_task_pool);
> +	mp->m_ag_task_pool = NULL;
> +
> +	kmem_cache_destroy(mp->m_ag_task_cachep);
> +	mp->m_ag_task_cachep = NULL;
> +}
> +
>  STATIC int
>  xfs_vm_writepages(
>  	struct address_space	*mapping,
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index 5a7a0f1a0b49..e84acb7e8ca8 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -12,4 +12,7 @@ extern const struct address_space_operations xfs_dax_aops;
>  int xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
>  void xfs_end_bio(struct bio *bio);
>  
> +void xfs_init_ag_writeback(struct xfs_mount *mp);
> +void xfs_destroy_ag_writeback(struct xfs_mount *mp);
> +
>  #endif /* __XFS_AOPS_H__ */
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 0953f6ae94ab..26224503c4bf 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -1323,6 +1323,8 @@ xfs_unmountfs(
>  
>  	xfs_qm_unmount(mp);
>  
> +	xfs_destroy_ag_writeback(mp);
> +
>  	/*
>  	 * Unreserve any blocks we have so that when we unmount we don't account
>  	 * the reserved free space as used. This is really only necessary for
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index b871dfde372b..c44155de2883 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -342,6 +342,16 @@ typedef struct xfs_mount {
>  
>  	/* Hook to feed dirent updates to an active online repair. */
>  	struct xfs_hooks	m_dir_update_hooks;
> +
> +
> +	/* global XFS AG writeback wq */
> +	struct workqueue_struct *m_ag_wq;
> +	/* array of [sb_agcount] */
> +	struct xfs_ag_wb        *m_ag_wb;
> +
> +	/* task cache and pool */
> +	struct kmem_cache *m_ag_task_cachep;
> +	mempool_t *m_ag_task_pool;

Again Nit: - Coding style - tab between data type and the identifier


>  } xfs_mount_t;
>  
>  #define M_IGEO(mp)		(&(mp)->m_ino_geo)
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index bc71aa9dcee8..73f8d2942df4 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1765,6 +1765,8 @@ xfs_fs_fill_super(
>  	if (error)
>  		goto out_free_sb;
>  
> +	xfs_init_ag_writeback(mp);

So if we are not able to init the write back workqueue infra, we are still okay to mount the fs
except that the AG aware write back facility won't be available i.e, we aren't treating it as a
fatal error, correct?
--NR
> +
>  	/*
>  	 * V4 support is undergoing deprecation.
>  	 *



^ permalink raw reply	[flat|nested] 48+ messages in thread

[parent not found: <CGME20260116101305epcas5p497cd6d9027301853669f1c1aaffbf128@epcas5p4.samsung.com>]

* [PATCH v3 6/6] xfs: offload writeback by AG using per-inode dirty bitmap and per-AG workers
       [not found]   ` <CGME20260116101305epcas5p497cd6d9027301853669f1c1aaffbf128@epcas5p4.samsung.com>
@ 2026-01-16 10:08     ` Kundan Kumar
  2026-01-29 22:34       ` Darrick J. Wong
  2026-02-11  9:39       ` Nirjhar Roy (IBM)
  0 siblings, 2 replies; 48+ messages in thread
From: Kundan Kumar @ 2026-01-16 10:08 UTC (permalink / raw)
  To: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, djwong, dave, cem, wangyufei
  Cc: linux-fsdevel, linux-mm, linux-xfs, gost.dev, kundan.kumar,
	anuj20.g, vishak.g, joshi.k

Offload XFS writeback to per-AG workers based on the inode dirty-AG
bitmap. Each worker scans and submits writeback only for folios
belonging to its AG.

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
---
 fs/xfs/xfs_aops.c | 178 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 178 insertions(+)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 9d5b65922cd2..55c3154fb2b5 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -678,6 +678,180 @@ xfs_zoned_writeback_submit(
 	return 0;
 }
 
+static bool xfs_agp_match(struct xfs_inode *ip, pgoff_t index,
+			  xfs_agnumber_t agno)
+{
+	void *ent;
+	u32 v;
+	bool match = false;
+
+	ent = xa_load(&ip->i_ag_pmap, index);
+	if (ent && xa_is_value(ent)) {
+		v = xa_to_value(ent);
+		if (xfs_agp_valid(v))
+			match = (xfs_agp_agno(v) == (u32)agno);
+	}
+
+	return match;
+}
+
+static bool xfs_folio_matches_ag(struct folio *folio, xfs_agnumber_t agno)
+{
+	struct xfs_inode *ip = XFS_I(folio_mapping(folio)->host);
+
+	return xfs_agp_match(ip, folio->index, agno);
+}
+
+static int xfs_writepages_ag(struct xfs_inode *ip,
+			     struct writeback_control *wbc,
+			     xfs_agnumber_t agno)
+{
+	struct inode *inode = VFS_I(ip);
+	struct address_space *mapping = inode->i_mapping;
+	struct folio_batch *fbatch = &wbc->fbatch;
+	int ret = 0;
+	pgoff_t index, end;
+
+	wbc->range_cyclic = 0;
+
+	folio_batch_init(fbatch);
+	index = wbc->range_start >> PAGE_SHIFT;
+	end = wbc->range_end >> PAGE_SHIFT;
+
+	struct xfs_writepage_ctx wpc = {
+		.ctx = {
+			.inode = inode,
+			.wbc = wbc,
+			.ops = &xfs_writeback_ops,
+		},
+	};
+
+	while (index <= end) {
+		int i, nr;
+
+		/* get a batch of DIRTY folios starting at index */
+		nr = filemap_get_folios_tag(mapping, &index, end,
+					    PAGECACHE_TAG_DIRTY, fbatch);
+		if (!nr)
+			break;
+
+		for (i = 0; i < nr; i++) {
+			struct folio *folio = fbatch->folios[i];
+
+			/* Filter BEFORE locking */
+			if (!xfs_folio_matches_ag(folio, agno))
+				continue;
+
+			folio_lock(folio);
+
+			/*
+			 * Now it's ours: clear dirty and submit.
+			 * This prevents *this AG worker* from seeing it again
+			 * next time.
+			 */
+			if (!folio_clear_dirty_for_io(folio)) {
+				folio_unlock(folio);
+				continue;
+			}
+			xa_erase(&ip->i_ag_pmap, folio->index);
+
+			ret = iomap_writeback_folio(&wpc.ctx, folio);
+			folio_unlock(folio);
+
+			if (ret) {
+				folio_batch_release(fbatch);
+				goto out;
+			}
+		}
+
+		folio_batch_release(fbatch);
+		cond_resched();
+	}
+
+out:
+	if (wpc.ctx.wb_ctx && wpc.ctx.ops && wpc.ctx.ops->writeback_submit)
+		wpc.ctx.ops->writeback_submit(&wpc.ctx, ret);
+
+	return ret;
+}
+
+static void xfs_ag_writeback_work(struct work_struct *work)
+{
+	struct xfs_ag_wb *awb = container_of(to_delayed_work(work),
+					     struct xfs_ag_wb, ag_work);
+	struct xfs_ag_wb_task *task;
+	struct xfs_mount *mp;
+	struct inode *inode;
+	struct xfs_inode *ip;
+	int ret;
+
+	for (;;) {
+		spin_lock(&awb->lock);
+		task = list_first_entry_or_null(&awb->task_list,
+						struct xfs_ag_wb_task, list);
+		if (task)
+			list_del_init(&task->list);
+		spin_unlock(&awb->lock);
+
+		if (!task)
+			break;
+
+		ip = task->ip;
+		mp = ip->i_mount;
+		inode = VFS_I(ip);
+
+		ret = xfs_writepages_ag(ip, &task->wbc, task->agno);
+
+		/* If didn't submit everything for this AG, set its bit */
+		if (ret)
+			set_bit(task->agno, ip->i_ag_dirty_bitmap);
+
+		iput(inode); /* drop igrab */
+		mempool_free(task, mp->m_ag_task_pool);
+	}
+}
+
+static int xfs_vm_writepages_offload(struct address_space *mapping,
+				     struct writeback_control *wbc)
+{
+	struct inode *inode = mapping->host;
+	struct xfs_inode *ip = XFS_I(inode);
+	struct xfs_mount *mp = ip->i_mount;
+	struct xfs_ag_wb *awb;
+	struct xfs_ag_wb_task *task;
+	xfs_agnumber_t agno;
+
+	if (!ip->i_ag_dirty_bits)
+		return 0;
+
+	for_each_set_bit(agno, ip->i_ag_dirty_bitmap, ip->i_ag_dirty_bits) {
+		if (!test_and_clear_bit(agno, ip->i_ag_dirty_bitmap))
+			continue;
+
+		task =  mempool_alloc(mp->m_ag_task_pool, GFP_NOFS);
+		if (!task) {
+			set_bit(agno, ip->i_ag_dirty_bitmap);
+			continue;
+		}
+
+		INIT_LIST_HEAD(&task->list);
+		task->ip = ip;
+		task->agno = agno;
+		task->wbc = *wbc;
+		igrab(inode); /* worker owns inode ref */
+
+		awb = &mp->m_ag_wb[agno];
+
+		spin_lock(&awb->lock);
+		list_add_tail(&task->list, &awb->task_list);
+		spin_unlock(&awb->lock);
+
+		mod_delayed_work(mp->m_ag_wq, &awb->ag_work, 0);
+	}
+
+	return 0;
+}
+
 static const struct iomap_writeback_ops xfs_zoned_writeback_ops = {
 	.writeback_range	= xfs_zoned_writeback_range,
 	.writeback_submit	= xfs_zoned_writeback_submit,
@@ -706,6 +880,7 @@ xfs_init_ag_writeback(struct xfs_mount *mp)
 	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
 		struct xfs_ag_wb *awb = &mp->m_ag_wb[agno];
 
+		INIT_DELAYED_WORK(&awb->ag_work, xfs_ag_writeback_work);
 		spin_lock_init(&awb->lock);
 		INIT_LIST_HEAD(&awb->task_list);
 		awb->agno = agno;
@@ -769,6 +944,9 @@ xfs_vm_writepages(
 			xfs_open_zone_put(xc.open_zone);
 		return error;
 	} else {
+		if (wbc->sync_mode != WB_SYNC_ALL)
+			return xfs_vm_writepages_offload(mapping, wbc);
+
 		struct xfs_writepage_ctx	wpc = {
 			.ctx = {
 				.inode	= mapping->host,
-- 
2.25.1



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 6/6] xfs: offload writeback by AG using per-inode dirty bitmap and per-AG workers
  2026-01-16 10:08     ` [PATCH v3 6/6] xfs: offload writeback by AG using per-inode dirty bitmap and per-AG workers Kundan Kumar
@ 2026-01-29 22:34       ` Darrick J. Wong
  2026-02-03  7:40         ` Kundan Kumar
  2026-02-11  9:39       ` Nirjhar Roy (IBM)
  1 sibling, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2026-01-29 22:34 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm,
	linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k

On Fri, Jan 16, 2026 at 03:38:18PM +0530, Kundan Kumar wrote:
> Offload XFS writeback to per-AG workers based on the inode dirty-AG
> bitmap. Each worker scans and submits writeback only for folios
> belonging to its AG.
> 
> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> ---
>  fs/xfs/xfs_aops.c | 178 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 178 insertions(+)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 9d5b65922cd2..55c3154fb2b5 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -678,6 +678,180 @@ xfs_zoned_writeback_submit(
>  	return 0;
>  }
>  
> +static bool xfs_agp_match(struct xfs_inode *ip, pgoff_t index,
> +			  xfs_agnumber_t agno)
> +{
> +	void *ent;
> +	u32 v;
> +	bool match = false;
> +
> +	ent = xa_load(&ip->i_ag_pmap, index);
> +	if (ent && xa_is_value(ent)) {
> +		v = xa_to_value(ent);
> +		if (xfs_agp_valid(v))
> +			match = (xfs_agp_agno(v) == (u32)agno);
> +	}
> +
> +	return match;
> +}
> +
> +static bool xfs_folio_matches_ag(struct folio *folio, xfs_agnumber_t agno)
> +{
> +	struct xfs_inode *ip = XFS_I(folio_mapping(folio)->host);
> +
> +	return xfs_agp_match(ip, folio->index, agno);
> +}
> +
> +static int xfs_writepages_ag(struct xfs_inode *ip,
> +			     struct writeback_control *wbc,
> +			     xfs_agnumber_t agno)
> +{
> +	struct inode *inode = VFS_I(ip);
> +	struct address_space *mapping = inode->i_mapping;
> +	struct folio_batch *fbatch = &wbc->fbatch;
> +	int ret = 0;
> +	pgoff_t index, end;
> +
> +	wbc->range_cyclic = 0;
> +
> +	folio_batch_init(fbatch);
> +	index = wbc->range_start >> PAGE_SHIFT;
> +	end = wbc->range_end >> PAGE_SHIFT;
> +
> +	struct xfs_writepage_ctx wpc = {
> +		.ctx = {
> +			.inode = inode,
> +			.wbc = wbc,
> +			.ops = &xfs_writeback_ops,
> +		},
> +	};
> +
> +	while (index <= end) {
> +		int i, nr;
> +
> +		/* get a batch of DIRTY folios starting at index */
> +		nr = filemap_get_folios_tag(mapping, &index, end,
> +					    PAGECACHE_TAG_DIRTY, fbatch);
> +		if (!nr)
> +			break;
> +
> +		for (i = 0; i < nr; i++) {
> +			struct folio *folio = fbatch->folios[i];
> +
> +			/* Filter BEFORE locking */
> +			if (!xfs_folio_matches_ag(folio, agno))

So we grab a batch of dirty folios, and only /then/ check to see if
they've been tagged with the target agno?  That doesn't seem very
efficient if there are a lot of dirty folio and they're not evenly
distributed among AGs.

> +				continue;
> +
> +			folio_lock(folio);
> +
> +			/*
> +			 * Now it's ours: clear dirty and submit.
> +			 * This prevents *this AG worker* from seeing it again
> +			 * next time.
> +			 */
> +			if (!folio_clear_dirty_for_io(folio)) {
> +				folio_unlock(folio);
> +				continue;
> +			}
> +			xa_erase(&ip->i_ag_pmap, folio->index);

Why erase the association?  Is this because once we've written the folio
back to storage, we want a subsequent write to a fsblock within that
folio to tag the folio with the agnumber of that fsblock?  Hrm, maybe
that's how you deal with multi-fsblock folios; the folio tag reflects
the first block to be dirtied within the folio?

> +
> +			ret = iomap_writeback_folio(&wpc.ctx, folio);
> +			folio_unlock(folio);
> +
> +			if (ret) {
> +				folio_batch_release(fbatch);
> +				goto out;
> +			}
> +		}
> +
> +		folio_batch_release(fbatch);
> +		cond_resched();
> +	}
> +
> +out:
> +	if (wpc.ctx.wb_ctx && wpc.ctx.ops && wpc.ctx.ops->writeback_submit)
> +		wpc.ctx.ops->writeback_submit(&wpc.ctx, ret);
> +
> +	return ret;
> +}
> +
> +static void xfs_ag_writeback_work(struct work_struct *work)
> +{
> +	struct xfs_ag_wb *awb = container_of(to_delayed_work(work),
> +					     struct xfs_ag_wb, ag_work);
> +	struct xfs_ag_wb_task *task;
> +	struct xfs_mount *mp;
> +	struct inode *inode;
> +	struct xfs_inode *ip;
> +	int ret;
> +
> +	for (;;) {
> +		spin_lock(&awb->lock);
> +		task = list_first_entry_or_null(&awb->task_list,
> +						struct xfs_ag_wb_task, list);
> +		if (task)
> +			list_del_init(&task->list);
> +		spin_unlock(&awb->lock);
> +
> +		if (!task)
> +			break;
> +
> +		ip = task->ip;
> +		mp = ip->i_mount;
> +		inode = VFS_I(ip);
> +
> +		ret = xfs_writepages_ag(ip, &task->wbc, task->agno);
> +
> +		/* If didn't submit everything for this AG, set its bit */
> +		if (ret)
> +			set_bit(task->agno, ip->i_ag_dirty_bitmap);
> +
> +		iput(inode); /* drop igrab */
> +		mempool_free(task, mp->m_ag_task_pool);
> +	}
> +}
> +
> +static int xfs_vm_writepages_offload(struct address_space *mapping,
> +				     struct writeback_control *wbc)
> +{
> +	struct inode *inode = mapping->host;
> +	struct xfs_inode *ip = XFS_I(inode);
> +	struct xfs_mount *mp = ip->i_mount;
> +	struct xfs_ag_wb *awb;
> +	struct xfs_ag_wb_task *task;
> +	xfs_agnumber_t agno;
> +
> +	if (!ip->i_ag_dirty_bits)
> +		return 0;
> +
> +	for_each_set_bit(agno, ip->i_ag_dirty_bitmap, ip->i_ag_dirty_bits) {
> +		if (!test_and_clear_bit(agno, ip->i_ag_dirty_bitmap))
> +			continue;
> +
> +		task =  mempool_alloc(mp->m_ag_task_pool, GFP_NOFS);

Allocating memory (even from a mempool) during writeback makes me
nervous...

> +		if (!task) {
> +			set_bit(agno, ip->i_ag_dirty_bitmap);
> +			continue;
> +		}

...because apparently the allocation can fail.  If so, then why don't we
just fall back to serial writeback instead of ... moving on to the next
AG and seeing if there's more memory?

> +
> +		INIT_LIST_HEAD(&task->list);
> +		task->ip = ip;
> +		task->agno = agno;
> +		task->wbc = *wbc;
> +		igrab(inode); /* worker owns inode ref */

Shouldn't we check for a null return value here?  That should never
happen, but we /do/ have the option of falling back to standard
iomap_writepages.

> +
> +		awb = &mp->m_ag_wb[agno];
> +
> +		spin_lock(&awb->lock);
> +		list_add_tail(&task->list, &awb->task_list);
> +		spin_unlock(&awb->lock);
> +
> +		mod_delayed_work(mp->m_ag_wq, &awb->ag_work, 0);
> +	}
> +
> +	return 0;
> +}
> +
>  static const struct iomap_writeback_ops xfs_zoned_writeback_ops = {
>  	.writeback_range	= xfs_zoned_writeback_range,
>  	.writeback_submit	= xfs_zoned_writeback_submit,
> @@ -706,6 +880,7 @@ xfs_init_ag_writeback(struct xfs_mount *mp)
>  	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
>  		struct xfs_ag_wb *awb = &mp->m_ag_wb[agno];
>  
> +		INIT_DELAYED_WORK(&awb->ag_work, xfs_ag_writeback_work);
>  		spin_lock_init(&awb->lock);
>  		INIT_LIST_HEAD(&awb->task_list);
>  		awb->agno = agno;
> @@ -769,6 +944,9 @@ xfs_vm_writepages(
>  			xfs_open_zone_put(xc.open_zone);
>  		return error;
>  	} else {
> +		if (wbc->sync_mode != WB_SYNC_ALL)
> +			return xfs_vm_writepages_offload(mapping, wbc);
> +
>  		struct xfs_writepage_ctx	wpc = {
>  			.ctx = {
>  				.inode	= mapping->host,
> -- 
> 2.25.1
> 
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 6/6] xfs: offload writeback by AG using per-inode dirty bitmap and per-AG workers
  2026-01-29 22:34       ` Darrick J. Wong
@ 2026-02-03  7:40         ` Kundan Kumar
  0 siblings, 0 replies; 48+ messages in thread
From: Kundan Kumar @ 2026-02-03  7:40 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm,
	linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k

On 1/30/2026 4:04 AM, Darrick J. Wong wrote:
> On Fri, Jan 16, 2026 at 03:38:18PM +0530, Kundan Kumar wrote:
>> Offload XFS writeback to per-AG workers based on the inode dirty-AG
>> bitmap. Each worker scans and submits writeback only for folios
>> belonging to its AG.
>>
>> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
>> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
>> ---
>>   fs/xfs/xfs_aops.c | 178 ++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 178 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
>> index 9d5b65922cd2..55c3154fb2b5 100644
>> --- a/fs/xfs/xfs_aops.c
>> +++ b/fs/xfs/xfs_aops.c
>> @@ -678,6 +678,180 @@ xfs_zoned_writeback_submit(
>>   	return 0;
>>   }
>>   
>> +static bool xfs_agp_match(struct xfs_inode *ip, pgoff_t index,
>> +			  xfs_agnumber_t agno)
>> +{
>> +	void *ent;
>> +	u32 v;
>> +	bool match = false;
>> +
>> +	ent = xa_load(&ip->i_ag_pmap, index);
>> +	if (ent && xa_is_value(ent)) {
>> +		v = xa_to_value(ent);
>> +		if (xfs_agp_valid(v))
>> +			match = (xfs_agp_agno(v) == (u32)agno);
>> +	}
>> +
>> +	return match;
>> +}
>> +
>> +static bool xfs_folio_matches_ag(struct folio *folio, xfs_agnumber_t agno)
>> +{
>> +	struct xfs_inode *ip = XFS_I(folio_mapping(folio)->host);
>> +
>> +	return xfs_agp_match(ip, folio->index, agno);
>> +}
>> +
>> +static int xfs_writepages_ag(struct xfs_inode *ip,
>> +			     struct writeback_control *wbc,
>> +			     xfs_agnumber_t agno)
>> +{
>> +	struct inode *inode = VFS_I(ip);
>> +	struct address_space *mapping = inode->i_mapping;
>> +	struct folio_batch *fbatch = &wbc->fbatch;
>> +	int ret = 0;
>> +	pgoff_t index, end;
>> +
>> +	wbc->range_cyclic = 0;
>> +
>> +	folio_batch_init(fbatch);
>> +	index = wbc->range_start >> PAGE_SHIFT;
>> +	end = wbc->range_end >> PAGE_SHIFT;
>> +
>> +	struct xfs_writepage_ctx wpc = {
>> +		.ctx = {
>> +			.inode = inode,
>> +			.wbc = wbc,
>> +			.ops = &xfs_writeback_ops,
>> +		},
>> +	};
>> +
>> +	while (index <= end) {
>> +		int i, nr;
>> +
>> +		/* get a batch of DIRTY folios starting at index */
>> +		nr = filemap_get_folios_tag(mapping, &index, end,
>> +					    PAGECACHE_TAG_DIRTY, fbatch);
>> +		if (!nr)
>> +			break;
>> +
>> +		for (i = 0; i < nr; i++) {
>> +			struct folio *folio = fbatch->folios[i];
>> +
>> +			/* Filter BEFORE locking */
>> +			if (!xfs_folio_matches_ag(folio, agno))
> 
> So we grab a batch of dirty folios, and only /then/ check to see if
> they've been tagged with the target agno?  That doesn't seem very
> efficient if there are a lot of dirty folio and they're not evenly
> distributed among AGs.
> 

i_ag_dirty_bitmap is a hint populated at write time to indicate that an
inode likely has dirty folios tagged for AG X, so we only kick workers 
for AGs that should have work. That said, folios may not be evenly 
distributed across AGs, so a worker can still end up scanning and 
finding few or no matches in some cases.

>> +				continue;
>> +
>> +			folio_lock(folio);
>> +
>> +			/*
>> +			 * Now it's ours: clear dirty and submit.
>> +			 * This prevents *this AG worker* from seeing it again
>> +			 * next time.
>> +			 */
>> +			if (!folio_clear_dirty_for_io(folio)) {
>> +				folio_unlock(folio);
>> +				continue;
>> +			}
>> +			xa_erase(&ip->i_ag_pmap, folio->index);
> 
> Why erase the association?  Is this because once we've written the folio
> back to storage, we want a subsequent write to a fsblock within that
> folio to tag the folio with the agnumber of that fsblock?  Hrm, maybe
> that's how you deal with multi-fsblock folios; the folio tag reflects
> the first block to be dirtied within the folio?
> 

The primary reason for erasing the entry is to keep the metadata 
bounded, once the folio is written back and clean, there is no need to 
retain any AG association, so we drop it to free the per-folio state. On 
the next write, the folio is re-tagged based on the iomap for that write.

This also helps with multi-fsblock folios: by re-tagging on each 
clean->dirty transition.

>> +
>> +			ret = iomap_writeback_folio(&wpc.ctx, folio);
>> +			folio_unlock(folio);
>> +
>> +			if (ret) {
>> +				folio_batch_release(fbatch);
>> +				goto out;
>> +			}
>> +		}
>> +
>> +		folio_batch_release(fbatch);
>> +		cond_resched();
>> +	}
>> +
>> +out:
>> +	if (wpc.ctx.wb_ctx && wpc.ctx.ops && wpc.ctx.ops->writeback_submit)
>> +		wpc.ctx.ops->writeback_submit(&wpc.ctx, ret);
>> +
>> +	return ret;
>> +}
>> +
>> +static void xfs_ag_writeback_work(struct work_struct *work)
>> +{
>> +	struct xfs_ag_wb *awb = container_of(to_delayed_work(work),
>> +					     struct xfs_ag_wb, ag_work);
>> +	struct xfs_ag_wb_task *task;
>> +	struct xfs_mount *mp;
>> +	struct inode *inode;
>> +	struct xfs_inode *ip;
>> +	int ret;
>> +
>> +	for (;;) {
>> +		spin_lock(&awb->lock);
>> +		task = list_first_entry_or_null(&awb->task_list,
>> +						struct xfs_ag_wb_task, list);
>> +		if (task)
>> +			list_del_init(&task->list);
>> +		spin_unlock(&awb->lock);
>> +
>> +		if (!task)
>> +			break;
>> +
>> +		ip = task->ip;
>> +		mp = ip->i_mount;
>> +		inode = VFS_I(ip);
>> +
>> +		ret = xfs_writepages_ag(ip, &task->wbc, task->agno);
>> +
>> +		/* If didn't submit everything for this AG, set its bit */
>> +		if (ret)
>> +			set_bit(task->agno, ip->i_ag_dirty_bitmap);
>> +
>> +		iput(inode); /* drop igrab */
>> +		mempool_free(task, mp->m_ag_task_pool);
>> +	}
>> +}
>> +
>> +static int xfs_vm_writepages_offload(struct address_space *mapping,
>> +				     struct writeback_control *wbc)
>> +{
>> +	struct inode *inode = mapping->host;
>> +	struct xfs_inode *ip = XFS_I(inode);
>> +	struct xfs_mount *mp = ip->i_mount;
>> +	struct xfs_ag_wb *awb;
>> +	struct xfs_ag_wb_task *task;
>> +	xfs_agnumber_t agno;
>> +
>> +	if (!ip->i_ag_dirty_bits)
>> +		return 0;
>> +
>> +	for_each_set_bit(agno, ip->i_ag_dirty_bitmap, ip->i_ag_dirty_bits) {
>> +		if (!test_and_clear_bit(agno, ip->i_ag_dirty_bitmap))
>> +			continue;
>> +
>> +		task =  mempool_alloc(mp->m_ag_task_pool, GFP_NOFS);
> 
> Allocating memory (even from a mempool) during writeback makes me
> nervous...
> 

Agreed. We'll remove this allocation entirely. Instead of allocating an
xfs_ag_wb_task in ->writepages, we'll maintain a mount-wide list of 
"active" inodes (those with a non-empty dirty-AG hint). The offload path 
will only enqueue the inode on that list and kick the relevant AG workers.

Each AG worker will then walk the active inode list and scan the 
pagecache, filtering folios tagged for its AG and submitting IO. This 
keeps thewriteback path allocation-free and avoids mempool use under 
reclaim.

>> +		if (!task) {
>> +			set_bit(agno, ip->i_ag_dirty_bitmap);
>> +			continue;
>> +		}
> 
> ...because apparently the allocation can fail.  If so, then why don't we
> just fall back to serial writeback instead of ... moving on to the next
> AG and seeing if there's more memory?
> 
>> +
>> +		INIT_LIST_HEAD(&task->list);
>> +		task->ip = ip;
>> +		task->agno = agno;
>> +		task->wbc = *wbc;
>> +		igrab(inode); /* worker owns inode ref */
> 
> Shouldn't we check for a null return value here?  That should never
> happen, but we /do/ have the option of falling back to standard
> iomap_writepages.
> 

Yes I will fix this in next version.

>> +
>> +		awb = &mp->m_ag_wb[agno];
>> +
>> +		spin_lock(&awb->lock);
>> +		list_add_tail(&task->list, &awb->task_list);
>> +		spin_unlock(&awb->lock);
>> +
>> +		mod_delayed_work(mp->m_ag_wq, &awb->ag_work, 0);
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>   static const struct iomap_writeback_ops xfs_zoned_writeback_ops = {
>>   	.writeback_range	= xfs_zoned_writeback_range,
>>   	.writeback_submit	= xfs_zoned_writeback_submit,
>> @@ -706,6 +880,7 @@ xfs_init_ag_writeback(struct xfs_mount *mp)
>>   	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
>>   		struct xfs_ag_wb *awb = &mp->m_ag_wb[agno];
>>   
>> +		INIT_DELAYED_WORK(&awb->ag_work, xfs_ag_writeback_work);
>>   		spin_lock_init(&awb->lock);
>>   		INIT_LIST_HEAD(&awb->task_list);
>>   		awb->agno = agno;
>> @@ -769,6 +944,9 @@ xfs_vm_writepages(
>>   			xfs_open_zone_put(xc.open_zone);
>>   		return error;
>>   	} else {
>> +		if (wbc->sync_mode != WB_SYNC_ALL)
>> +			return xfs_vm_writepages_offload(mapping, wbc);
>> +
>>   		struct xfs_writepage_ctx	wpc = {
>>   			.ctx = {
>>   				.inode	= mapping->host,
>> -- 
>> 2.25.1
>>
>>
> 



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 6/6] xfs: offload writeback by AG using per-inode dirty bitmap and per-AG workers
  2026-01-16 10:08     ` [PATCH v3 6/6] xfs: offload writeback by AG using per-inode dirty bitmap and per-AG workers Kundan Kumar
  2026-01-29 22:34       ` Darrick J. Wong
@ 2026-02-11  9:39       ` Nirjhar Roy (IBM)
  1 sibling, 0 replies; 48+ messages in thread
From: Nirjhar Roy (IBM) @ 2026-02-11  9:39 UTC (permalink / raw)
  To: Kundan Kumar, viro, brauner, jack, willy, mcgrof, clm, david,
	amir73il, axboe, hch, ritesh.list, djwong, dave, cem, wangyufei
  Cc: linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g,
	joshi.k

On Fri, 2026-01-16 at 15:38 +0530, Kundan Kumar wrote:
> Offload XFS writeback to per-AG workers based on the inode dirty-AG
> bitmap. Each worker scans and submits writeback only for folios
> belonging to its AG.
> 
> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> ---
>  fs/xfs/xfs_aops.c | 178 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 178 insertions(+)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 9d5b65922cd2..55c3154fb2b5 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -678,6 +678,180 @@ xfs_zoned_writeback_submit(
>  	return 0;
>  }
>  
> +static bool xfs_agp_match(struct xfs_inode *ip, pgoff_t index,
> +			  xfs_agnumber_t agno)
> +{
> +	void *ent;
> +	u32 v;
> +	bool match = false;

Similar coding style comments as mentioned in the previous patches in this series - variable
declaration and function prototype style
> +
> +	ent = xa_load(&ip->i_ag_pmap, index);
> +	if (ent && xa_is_value(ent)) {
> +		v = xa_to_value(ent);
> +		if (xfs_agp_valid(v))
> +			match = (xfs_agp_agno(v) == (u32)agno);
> +	}
> +
> +	return match;
> +}
> +
> +static bool xfs_folio_matches_ag(struct folio *folio, xfs_agnumber_t agno)
> +{
> +	struct xfs_inode *ip = XFS_I(folio_mapping(folio)->host);
> +
> +	return xfs_agp_match(ip, folio->index, agno);
> +}
> +
> +static int xfs_writepages_ag(struct xfs_inode *ip,
> +			     struct writeback_control *wbc,
> +			     xfs_agnumber_t agno)
> +{
> +	struct inode *inode = VFS_I(ip);
> +	struct address_space *mapping = inode->i_mapping;
> +	struct folio_batch *fbatch = &wbc->fbatch;
> +	int ret = 0;
> +	pgoff_t index, end;
> +

Coding style ^^
> +	wbc->range_cyclic = 0;
> +
> +	folio_batch_init(fbatch);
> +	index = wbc->range_start >> PAGE_SHIFT;
> +	end = wbc->range_end >> PAGE_SHIFT;
> +
> +	struct xfs_writepage_ctx wpc = {
> +		.ctx = {
> +			.inode = inode,
> +			.wbc = wbc,
> +			.ops = &xfs_writeback_ops,
> +		},
> +	};

Nit: Maybe group all the declarations together? I did get a comment like this in one of my patches.
> +
> +	while (index <= end) {
> +		int i, nr;

Coding style
> +
> +		/* get a batch of DIRTY folios starting at index */
> +		nr = filemap_get_folios_tag(mapping, &index, end,
> +					    PAGECACHE_TAG_DIRTY, fbatch);
2 tabs indentation
> +		if (!nr)
> +			break;
> +
> +		for (i = 0; i < nr; i++) {
> +			struct folio *folio = fbatch->folios[i];
> +
> +			/* Filter BEFORE locking */
> +			if (!xfs_folio_matches_ag(folio, agno))
> +				continue;
> +
> +			folio_lock(folio);
> +
> +			/*
> +			 * Now it's ours: clear dirty and submit.
> +			 * This prevents *this AG worker* from seeing it again
> +			 * next time.
> +			 */
> +			if (!folio_clear_dirty_for_io(folio)) {
> +				folio_unlock(folio);
> +				continue;
> +			}
> +			xa_erase(&ip->i_ag_pmap, folio->index);
> +
> +			ret = iomap_writeback_folio(&wpc.ctx, folio);
> +			folio_unlock(folio);
> +
> +			if (ret) {
> +				folio_batch_release(fbatch);
> +				goto out;
> +			}
> +		}
> +
> +		folio_batch_release(fbatch);
> +		cond_resched();
> +	}
> +
> +out:
> +	if (wpc.ctx.wb_ctx && wpc.ctx.ops && wpc.ctx.ops->writeback_submit)
> +		wpc.ctx.ops->writeback_submit(&wpc.ctx, ret);
> +
> +	return ret;
> +}
> +
> +static void xfs_ag_writeback_work(struct work_struct *work)
> +{
> +	struct xfs_ag_wb *awb = container_of(to_delayed_work(work),
> +					     struct xfs_ag_wb, ag_work);
> +	struct xfs_ag_wb_task *task;
> +	struct xfs_mount *mp;
> +	struct inode *inode;
> +	struct xfs_inode *ip;
> +	int ret;

Coding style issues for variable declarations
> +
> +	for (;;) {
Nit: I am not sure if this is a common practice in XFS to use such infinite for loops. Maybe have
some conditional flag variable or something? Darrick, thoughts? 
> +		spin_lock(&awb->lock);
> +		task = list_first_entry_or_null(&awb->task_list,
> +						struct xfs_ag_wb_task, list);
2 tabs indentation                              ^^
> +		if (task)
> +			list_del_init(&task->list);
> +		spin_unlock(&awb->lock);
> +
> +		if (!task)
> +			break;
> +
> +		ip = task->ip;
> +		mp = ip->i_mount;
> +		inode = VFS_I(ip);
> +
> +		ret = xfs_writepages_ag(ip, &task->wbc, task->agno);
> +
> +		/* If didn't submit everything for this AG, set its bit */
> +		if (ret)
> +			set_bit(task->agno, ip->i_ag_dirty_bitmap);
> +
> +		iput(inode); /* drop igrab */
> +		mempool_free(task, mp->m_ag_task_pool);
> +	}
> +}
> +
> +static int xfs_vm_writepages_offload(struct address_space *mapping,
> +				     struct writeback_control *wbc)
2 tabs indentation                   ^^		 
> +{
> +	struct inode *inode = mapping->host;
> +	struct xfs_inode *ip = XFS_I(inode);
> +	struct xfs_mount *mp = ip->i_mount;
> +	struct xfs_ag_wb *awb;
> +	struct xfs_ag_wb_task *task;
> +	xfs_agnumber_t agno;
> +

Coding style for variable declaration ^^
> +	if (!ip->i_ag_dirty_bits)
> +		return 0;
> +
> +	for_each_set_bit(agno, ip->i_ag_dirty_bitmap, ip->i_ag_dirty_bits) {
> +		if (!test_and_clear_bit(agno, ip->i_ag_dirty_bitmap))
> +			continue;
> +
> +		task =  mempool_alloc(mp->m_ag_task_pool, GFP_NOFS);
> +		if (!task) {
> +			set_bit(agno, ip->i_ag_dirty_bitmap);
> +			continue;
> +		}
> +
> +		INIT_LIST_HEAD(&task->list);
> +		task->ip = ip;
> +		task->agno = agno;
> +		task->wbc = *wbc;
> +		igrab(inode); /* worker owns inode ref */
> +
> +		awb = &mp->m_ag_wb[agno];
> +
> +		spin_lock(&awb->lock);
> +		list_add_tail(&task->list, &awb->task_list);
> +		spin_unlock(&awb->lock);
> +
> +		mod_delayed_work(mp->m_ag_wq, &awb->ag_work, 0);
> +	}
> +
> +	return 0;

It seems that this function always returns zero - maybe in that case use void?
> +}
> +
>  static const struct iomap_writeback_ops xfs_zoned_writeback_ops = {
>  	.writeback_range	= xfs_zoned_writeback_range,
>  	.writeback_submit	= xfs_zoned_writeback_submit,
> @@ -706,6 +880,7 @@ xfs_init_ag_writeback(struct xfs_mount *mp)
>  	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
>  		struct xfs_ag_wb *awb = &mp->m_ag_wb[agno];
>  
> +		INIT_DELAYED_WORK(&awb->ag_work, xfs_ag_writeback_work);
>  		spin_lock_init(&awb->lock);
>  		INIT_LIST_HEAD(&awb->task_list);
>  		awb->agno = agno;
> @@ -769,6 +944,9 @@ xfs_vm_writepages(
>  			xfs_open_zone_put(xc.open_zone);
>  		return error;
>  	} else {
> +		if (wbc->sync_mode != WB_SYNC_ALL)
> +			return xfs_vm_writepages_offload(mapping, wbc);
> +

Since this patch (and the overall patch series) introduces a couple of new data structures and a lot
of new functions, can we please have some concise comments at the beginning of the functions or at
least the functions introduced in this patch - just for a quicker understanding?
--NR
>  		struct xfs_writepage_ctx	wpc = {
>  			.ctx = {
>  				.inode	= mapping->host,



^ permalink raw reply	[flat|nested] 48+ messages in thread

* [syzbot ci] Re: AG aware parallel writeback for XFS
  2026-01-16 10:08 ` [PATCH v3 0/6] AG aware parallel writeback for XFS Kundan Kumar
                     ` (5 preceding siblings ...)
       [not found]   ` <CGME20260116101305epcas5p497cd6d9027301853669f1c1aaffbf128@epcas5p4.samsung.com>
@ 2026-01-16 16:13   ` syzbot ci
  2026-01-21 19:54   ` [PATCH v3 0/6] " Brian Foster
  7 siblings, 0 replies; 48+ messages in thread
From: syzbot ci @ 2026-01-16 16:13 UTC (permalink / raw)
  To: amir73il, anuj20.g, axboe, brauner, cem, clm, dave, david,
	djwong, gost.dev, hch, jack, joshi.k, kundan.kumar,
	linux-fsdevel, linux-mm, linux-xfs, mcgrof, ritesh.list, viro,
	vishak.g, wangyufei, willy
  Cc: syzbot, syzkaller-bugs

syzbot ci has tested the following series

[v3] AG aware parallel writeback for XFS
https://lore.kernel.org/all/20260116100818.7576-1-kundan.kumar@samsung.com
* [PATCH v3 1/6] iomap: add write ops hook to attach metadata to folios
* [PATCH v3 2/6] xfs: add helpers to pack AG prediction info for per-folio tracking
* [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap
* [PATCH v3 4/6] xfs: tag folios with AG number during buffered write via iomap attach hook
* [PATCH v3 5/6] xfs: add per-AG writeback workqueue infrastructure
* [PATCH v3 6/6] xfs: offload writeback by AG using per-inode dirty bitmap and per-AG workers

and found the following issue:
WARNING in xfs_init_ag_writeback

Full report is available here:
https://ci.syzbot.org/series/0e34236b-3594-400f-b9cd-6f59b196014f

***

WARNING in xfs_init_ag_writeback

tree:      mm-new
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git
base:      eeb33083cc4749bdb61582eaeb5c200702607703
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/f21aae46-4d21-4410-9132-190ad8ae3994/config
C repro:   https://ci.syzbot.org/findings/24ec6312-5e15-4750-aa2a-08d381daca26/c_repro
syz repro: https://ci.syzbot.org/findings/24ec6312-5e15-4750-aa2a-08d381daca26/syz_repro

loop2: detected capacity change from 0 to 32768
------------[ cut here ]------------
kmem_cache of name 'xfs_ag_wb_task' already exists
WARNING: mm/slab_common.c:110 at kmem_cache_sanity_check mm/slab_common.c:109 [inline], CPU#1: syz.2.19/6098
WARNING: mm/slab_common.c:110 at __kmem_cache_create_args+0x99/0x310 mm/slab_common.c:310, CPU#1: syz.2.19/6098
Modules linked in:
CPU: 1 UID: 0 PID: 6098 Comm: syz.2.19 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:kmem_cache_sanity_check mm/slab_common.c:109 [inline]
RIP: 0010:__kmem_cache_create_args+0x9c/0x310 mm/slab_common.c:310
Code: 43 8e 4d 8b 24 24 49 81 fc b8 78 43 8e 74 20 49 8b 7c 24 f8 48 89 de e8 f2 c0 67 09 85 c0 75 e2 48 8d 3d f7 ca bc 0d 48 89 de <67> 48 0f b9 3a 48 89 df be 20 00 00 00 e8 92 c2 67 09 48 85 c0 0f
RSP: 0018:ffffc90003817a08 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffffff8bc734a0 RCX: 0000000000000007
RDX: 000000008bc73406 RSI: ffffffff8bc734a0 RDI: ffffffff8fc6ee40
RBP: 0000000000080000 R08: ffffffff8fc26e77 R09: 1ffffffff1f84dce
R10: dffffc0000000000 R11: fffffbfff1f84dcf R12: ffff88810b7a7300
R13: ffff888115440000 R14: ffffc90003817aa0 R15: 0000000000000190
FS:  0000555567cf7500(0000) GS:ffff8882a9a05000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f049ffff000 CR3: 00000001b9726000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 __kmem_cache_create include/linux/slab.h:384 [inline]
 xfs_init_ag_writeback+0x41b/0x570 fs/xfs/xfs_aops.c:890
 xfs_fs_fill_super+0x7e3/0x1640 fs/xfs/xfs_super.c:1759
 get_tree_bdev_flags+0x431/0x4f0 fs/super.c:1691
 vfs_get_tree+0x92/0x2a0 fs/super.c:1751
 fc_mount fs/namespace.c:1199 [inline]
 do_new_mount_fc fs/namespace.c:3636 [inline]
 do_new_mount+0x329/0xa50 fs/namespace.c:3712
 do_mount fs/namespace.c:4035 [inline]
 __do_sys_mount fs/namespace.c:4224 [inline]
 __se_sys_mount+0x31d/0x420 fs/namespace.c:4201
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f32c139bf4a
Code: 48 c7 c2 e8 ff ff ff f7 d8 64 89 02 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffd30494b68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00007ffd30494bf0 RCX: 00007f32c139bf4a
RDX: 0000200000009740 RSI: 0000200000009780 RDI: 00007ffd30494bb0
RBP: 0000200000009740 R08: 00007ffd30494bf0 R09: 0000000000004010
R10: 0000000000004010 R11: 0000000000000246 R12: 0000200000009780
R13: 00007ffd30494bb0 R14: 0000000000009768 R15: 0000200000000300
 </TASK>
----------------
Code disassembly (best guess):
   0:	43 8e 4d 8b          	rex.XB mov -0x75(%r13),%cs
   4:	24 24                	and    $0x24,%al
   6:	49 81 fc b8 78 43 8e 	cmp    $0xffffffff8e4378b8,%r12
   d:	74 20                	je     0x2f
   f:	49 8b 7c 24 f8       	mov    -0x8(%r12),%rdi
  14:	48 89 de             	mov    %rbx,%rsi
  17:	e8 f2 c0 67 09       	call   0x967c10e
  1c:	85 c0                	test   %eax,%eax
  1e:	75 e2                	jne    0x2
  20:	48 8d 3d f7 ca bc 0d 	lea    0xdbccaf7(%rip),%rdi        # 0xdbccb1e
  27:	48 89 de             	mov    %rbx,%rsi
* 2a:	67 48 0f b9 3a       	ud1    (%edx),%rdi <-- trapping instruction
  2f:	48 89 df             	mov    %rbx,%rdi
  32:	be 20 00 00 00       	mov    $0x20,%esi
  37:	e8 92 c2 67 09       	call   0x967c2ce
  3c:	48 85 c0             	test   %rax,%rax
  3f:	0f                   	.byte 0xf


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 0/6] AG aware parallel writeback for XFS
  2026-01-16 10:08 ` [PATCH v3 0/6] AG aware parallel writeback for XFS Kundan Kumar
                     ` (6 preceding siblings ...)
  2026-01-16 16:13   ` [syzbot ci] Re: AG aware parallel writeback for XFS syzbot ci
@ 2026-01-21 19:54   ` Brian Foster
  2026-01-22 16:15     ` Kundan Kumar
  7 siblings, 1 reply; 48+ messages in thread
From: Brian Foster @ 2026-01-21 19:54 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, djwong, dave, cem, wangyufei, linux-fsdevel,
	linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k

On Fri, Jan 16, 2026 at 03:38:12PM +0530, Kundan Kumar wrote:
> This series explores AG aware parallel writeback for XFS. The goal is
> to reduce writeback contention and improve scalability by allowing
> writeback to be distributed across allocation groups (AGs).
> 
> Problem statement
> =================
> Today, XFS writeback walks the page cache serially per inode and funnels
> all writeback through a single writeback context. For aging filesystems,
> especially with high parallel buffered IO this leads to limited
> concurrency across independent AGs.
> 
> The filesystem already has strong AG level parallelism for allocation and
> metadata operations, but writeback remains largely AG agnostic.
> 
> High-level approach
> ===================
> This series introduces an AG aware writeback with following model:
> 1) Predict the target AG for buffered writes (mapped or delalloc) at write
>    time.
> 2) Tag AG hints per folio (via lightweight metadata / xarray).
> 3) Track dirty AGs per inode using bitmap.
> 4) Offload writeback to per AG worker threads, each performing a onepass
>    scan.
> 5) Workers filter folios and submit folios which are tagged for its AG.
> 
> Unlike our earlier approach that parallelized writeback by introducing
> multiple writeback contexts per BDI, this series keeps all changes within
> XFS and is orthogonal to that work. The AG aware mechanism uses per folio
> AG hints to route writeback to AG specific workers, and therefore applies
> even when a single inode’s data spans multiple AGs. This avoids the
> earlier limitation of relying on inode-based AG locality, which can break
> down on aged/fragmented filesystems.
> 
> IOPS and throughput
> ===================
> We see significant improvemnt in IOPS if files span across multiple AG
> 
> Workload 12 files each of 500M in 12 directories(AGs) - numjobs = 12
>     - NVMe device Intel Optane
>         Base XFS                : 308 MiB/s
>         Parallel Writeback XFS  : 1534 MiB/s  (+398%)
> 
> Workload 6 files each of 6G in 6 directories(AGs) - numjobs = 12
>     - NVMe device Intel Optane
>         Base XFS                : 409 MiB/s
>         Parallel Writeback XFS  : 1245 MiB/s  (+204%)
> 

Hi Kundan,

Could you provide more detail on how you're testing here? I threw this
at some beefier storage I have around out of curiosity and I'm not
seeing much of a difference. It could be I'm missing some details or
maybe the storage outweighs the processing benefit. But for example, is
this a fio test command being used? Is there preallocation? What type of
storage? Is a particular fs geometry being targeted for this
optimization (i.e. smaller AGs), etc.?

FWIW, I skimmed through the code a bit and the main thing that kind of
stands out to me is the write time per-folio hinting. Writeback handling
for the overwrite (i.e. non-delalloc) case is basically a single lookup
per mapping under shared inode lock. The question that comes to mind
there is what is the value of per-ag batching as opposed to just adding
generic concurrency? It seems unnecessary to me to take care to shuffle
overwrites into per-ag based workers when the underlying locking is
already shared.

WRT delalloc, it looks like we're basically taking the inode AG as the
starting point and guessing based on the on-disk AGF free blocks counter
at the time of the write. The delalloc accounting doesn't count against
the AGF, however, so ISTM that in many cases this would just effectively
land on the inode AG for larger delalloc writes. Is that not the case?

Once we get to delalloc writeback, we're under exclusive inode lock and
fall into the block allocator. The latter trylock iterates the AGs
looking for a good candidate. So what's the advantage of per-ag
splitting delalloc at writeback time if we're sending the same inode to
per-ag workers that all 1. require exclusive inode lock and 2. call into
an allocator that is designed to be scalable (i.e. if one AG is locked
it will just move to the next)?

Yet another consideration is how delalloc conversion works at the
xfs_bmapi_convert_delalloc() -> xfs_bmapi_convert_one_delalloc() level.
If you take a look at the latter, we look up the entire delalloc extent
backing the folio under writeback and attempt to allocate it all at once
(not just the blocks backing the folio). So in theory if we were to end
up tagging a sequence of contiguous delalloc backed folios at buffered
write time with different AGs, we're still going to try to allocate all
of that in one AG at writeback time. So the per-ag hinting also sort of
competes with this by shuffling writeback of the same potential extent
into different workers, making it a little hard to try and reason about.

So stepping back it kind of feels to me like the write time hinting has
so much potential for inaccuracy and unpredictability of writeback time
behavior (for the delalloc case), that it makes me wonder if we're
effectively just enabling arbitrary concurrency at writeback time and
perhaps seeing benefit from that. If so, that makes me wonder if the
associated value can be gained by somehow simplifying this to not
require write time hinting at all.

Have you run any experiments that perhaps rotors inodes to the
individual wb workers based on the inode AG (i.e. basically ignoring all
the write time stuff) by chance? Or anything that otherwise helps
quantify the value of per-ag batching over just basic concurrency? I'd
be interested to see if/how behavior changes with something like that.

Brian

> These changes are on top of the v6.18 kernel release.
> 
> Future work involves tighten writeback control (wbc) handling to integrate
> with global writeback accounting and range semantics, also evaluate
> interaction with higher level writeback parallelism.
> 
> Kundan Kumar (6):
>   iomap: add write ops hook to attach metadata to folios
>   xfs: add helpers to pack AG prediction info for per-folio tracking
>   xfs: add per-inode AG prediction map and dirty-AG bitmap
>   xfs: tag folios with AG number during buffered write via iomap attach
>     hook
>   xfs: add per-AG writeback workqueue infrastructure
>   xfs: offload writeback by AG using per-inode dirty bitmap and per-AG
>     workers
> 
>  fs/iomap/buffered-io.c |   3 +
>  fs/xfs/xfs_aops.c      | 257 +++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_aops.h      |   3 +
>  fs/xfs/xfs_icache.c    |  27 +++++
>  fs/xfs/xfs_inode.h     |   5 +
>  fs/xfs/xfs_iomap.c     | 114 ++++++++++++++++++
>  fs/xfs/xfs_iomap.h     |  31 +++++
>  fs/xfs/xfs_mount.c     |   2 +
>  fs/xfs/xfs_mount.h     |  10 ++
>  fs/xfs/xfs_super.c     |   2 +
>  include/linux/iomap.h  |   3 +
>  11 files changed, 457 insertions(+)
> 
> -- 
> 2.25.1
> 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 0/6] AG aware parallel writeback for XFS
  2026-01-21 19:54   ` [PATCH v3 0/6] " Brian Foster
@ 2026-01-22 16:15     ` Kundan Kumar
  2026-01-23  9:36       ` Pankaj Raghav (Samsung)
  2026-01-23 13:26       ` Brian Foster
  0 siblings, 2 replies; 48+ messages in thread
From: Kundan Kumar @ 2026-01-22 16:15 UTC (permalink / raw)
  To: Brian Foster
  Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, djwong, dave, cem, wangyufei, linux-fsdevel,
	linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k

> 
> Could you provide more detail on how you're testing here? I threw this
> at some beefier storage I have around out of curiosity and I'm not
> seeing much of a difference. It could be I'm missing some details or
> maybe the storage outweighs the processing benefit. But for example, is
> this a fio test command being used? Is there preallocation? What type of
> storage? Is a particular fs geometry being targeted for this
> optimization (i.e. smaller AGs), etc.?

Thanks Brian for the detailed review and for taking the time to
look through the code.

The numbers quoted were from fio buffered write workloads on NVMe
(Optane devices), using multiple files placed in different
directories mapping to different AGs. Jobs were buffered
randwrite with multiple jobs. This can be tested with both
fallocate=none or otherwise.

Sample script for 12 jobs to 12 directories(AGs) :

mkfs.xfs -f -d agcount=12 /dev/nvme0n1
mount /dev/nvme0n1 /mnt
sync
echo 3 > /proc/sys/vm/drop_caches

for i in {1..12}; do
   mkdir -p /mnt/dir$i
done

fio job.fio

umount /mnt
echo 3 > /proc/sys/vm/drop_caches

The job file :

[global]
bs=4k
iodepth=32
rw=randwrite
ioengine=io_uring
fallocate=none
nrfiles=12
numjobs=1
size=6G
direct=0
group_reporting=1
create_on_open=1
name=test

[job1]
directory=/mnt/dir1

[job2]
directory=/mnt/dir2
...
...
[job12]
directory=/mnt/dir12

> 
> FWIW, I skimmed through the code a bit and the main thing that kind of
> stands out to me is the write time per-folio hinting. Writeback handling
> for the overwrite (i.e. non-delalloc) case is basically a single lookup
> per mapping under shared inode lock. The question that comes to mind
> there is what is the value of per-ag batching as opposed to just adding
> generic concurrency? It seems unnecessary to me to take care to shuffle
> overwrites into per-ag based workers when the underlying locking is
> already shared.
> 

That’s a fair point. For the overwrite (non-delalloc) case, the
per-folio AG hinting is not meant to change allocation behavior, and
I agree the underlying inode locking remains shared. The primary value
I’m seeing there is the ability to partition writeback iteration and
submission when dirty data spans multiple AGs.
I will try routing overwrite writeback to workers irrespective of AG
(e.g. hash/inode based), to compare between generic concurrency vs AG
batching.

> WRT delalloc, it looks like we're basically taking the inode AG as the
> starting point and guessing based on the on-disk AGF free blocks counter
> at the time of the write. The delalloc accounting doesn't count against
> the AGF, however, so ISTM that in many cases this would just effectively
> land on the inode AG for larger delalloc writes. Is that not the case?
> 
> Once we get to delalloc writeback, we're under exclusive inode lock and
> fall into the block allocator. The latter trylock iterates the AGs
> looking for a good candidate. So what's the advantage of per-ag
> splitting delalloc at writeback time if we're sending the same inode to
> per-ag workers that all 1. require exclusive inode lock and 2. call into
> an allocator that is designed to be scalable (i.e. if one AG is locked
> it will just move to the next)?
> 

The intent of per-AG splitting is not to parallelize allocation
within a single inode or override allocator behavior, but to
partition writeback scheduling so that inodes associated with
different AGs are routed to different workers. This implicitly
distributes inodes across AG workers, even though each inode’s
delalloc conversion remains serialized.

> Yet another consideration is how delalloc conversion works at the
> xfs_bmapi_convert_delalloc() -> xfs_bmapi_convert_one_delalloc() level.
> If you take a look at the latter, we look up the entire delalloc extent
> backing the folio under writeback and attempt to allocate it all at once
> (not just the blocks backing the folio). So in theory if we were to end
> up tagging a sequence of contiguous delalloc backed folios at buffered
> write time with different AGs, we're still going to try to allocate all
> of that in one AG at writeback time. So the per-ag hinting also sort of
> competes with this by shuffling writeback of the same potential extent
> into different workers, making it a little hard to try and reason about.
> 

Agreed — delalloc conversion happens at extent granularity, so
per-folio AG hints are not meant to steer final allocation. In this
series the hints are used purely as writeback scheduling tokens;
allocation still occurs once per extent under XFS_ILOCK_EXCL using
existing allocator logic. The goal is to partition writeback work and
avoid funneling multiple inodes through a single writeback path, not
to influence extent placement.

> So stepping back it kind of feels to me like the write time hinting has
> so much potential for inaccuracy and unpredictability of writeback time
> behavior (for the delalloc case), that it makes me wonder if we're
> effectively just enabling arbitrary concurrency at writeback time and
> perhaps seeing benefit from that. If so, that makes me wonder if the
> associated value can be gained by somehow simplifying this to not
> require write time hinting at all.
> 
> Have you run any experiments that perhaps rotors inodes to the
> individual wb workers based on the inode AG (i.e. basically ignoring all
> the write time stuff) by chance? Or anything that otherwise helps
> quantify the value of per-ag batching over just basic concurrency? I'd
> be interested to see if/how behavior changes with something like that.

Yes, inode-AG based routing has been explored as part of earlier
higher-level writeback work (link below), where inodes are affined to
writeback contexts based on inode AG. That effectively provides
generic concurrency and serves as a useful baseline.
https://lore.kernel.org/all/20251014120845.2361-1-kundan.kumar@samsung.com/

The motivation for this series is the complementary case where a
single inode’s dirty data spans multiple AGs on aged/fragmented
filesystems, where inode-AG affinity breaks down. The folio-level AG
hinting here is intended to explore whether finer-grained partitioning
provides additional benefit beyond inode-based routing.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 0/6] AG aware parallel writeback for XFS
  2026-01-22 16:15     ` Kundan Kumar
@ 2026-01-23  9:36       ` Pankaj Raghav (Samsung)
  2026-01-23 13:26       ` Brian Foster
  1 sibling, 0 replies; 48+ messages in thread
From: Pankaj Raghav (Samsung) @ 2026-01-23  9:36 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: Brian Foster, viro, brauner, jack, willy, mcgrof, clm, david,
	amir73il, axboe, hch, ritesh.list, djwong, dave, cem, wangyufei,
	linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g,
	joshi.k

> > So stepping back it kind of feels to me like the write time hinting has
> > so much potential for inaccuracy and unpredictability of writeback time
> > behavior (for the delalloc case), that it makes me wonder if we're
> > effectively just enabling arbitrary concurrency at writeback time and
> > perhaps seeing benefit from that. If so, that makes me wonder if the
> > associated value can be gained by somehow simplifying this to not
> > require write time hinting at all.
> > 
> > Have you run any experiments that perhaps rotors inodes to the
> > individual wb workers based on the inode AG (i.e. basically ignoring all
> > the write time stuff) by chance? Or anything that otherwise helps
> > quantify the value of per-ag batching over just basic concurrency? I'd
> > be interested to see if/how behavior changes with something like that.
> 
> Yes, inode-AG based routing has been explored as part of earlier
> higher-level writeback work (link below), where inodes are affined to
> writeback contexts based on inode AG. That effectively provides
> generic concurrency and serves as a useful baseline.
> https://lore.kernel.org/all/20251014120845.2361-1-kundan.kumar@samsung.com/
> 
> The motivation for this series is the complementary case where a
> single inode’s dirty data spans multiple AGs on aged/fragmented
> filesystems, where inode-AG affinity breaks down. The folio-level AG
> hinting here is intended to explore whether finer-grained partitioning
> provides additional benefit beyond inode-based routing.
> 

One thing that would be useful to see in the cover letter is to see how
these patches work on an aged XFS filesystem as that was the main reason
you moved to a different approach. And given that you have completely
changed the design, it would have been better to see this series as an
RFC as well.

--
Pankaj


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 0/6] AG aware parallel writeback for XFS
  2026-01-22 16:15     ` Kundan Kumar
  2026-01-23  9:36       ` Pankaj Raghav (Samsung)
@ 2026-01-23 13:26       ` Brian Foster
  2026-01-28 18:28         ` Kundan Kumar
  1 sibling, 1 reply; 48+ messages in thread
From: Brian Foster @ 2026-01-23 13:26 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, djwong, dave, cem, wangyufei, linux-fsdevel,
	linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k

On Thu, Jan 22, 2026 at 09:45:05PM +0530, Kundan Kumar wrote:
> > 
> > Could you provide more detail on how you're testing here? I threw this
> > at some beefier storage I have around out of curiosity and I'm not
> > seeing much of a difference. It could be I'm missing some details or
> > maybe the storage outweighs the processing benefit. But for example, is
> > this a fio test command being used? Is there preallocation? What type of
> > storage? Is a particular fs geometry being targeted for this
> > optimization (i.e. smaller AGs), etc.?
> 
> Thanks Brian for the detailed review and for taking the time to
> look through the code.
> 
> The numbers quoted were from fio buffered write workloads on NVMe
> (Optane devices), using multiple files placed in different
> directories mapping to different AGs. Jobs were buffered
> randwrite with multiple jobs. This can be tested with both
> fallocate=none or otherwise.
> 
> Sample script for 12 jobs to 12 directories(AGs) :
> 
> mkfs.xfs -f -d agcount=12 /dev/nvme0n1
> mount /dev/nvme0n1 /mnt
> sync
> echo 3 > /proc/sys/vm/drop_caches
> 
> for i in {1..12}; do
>    mkdir -p /mnt/dir$i
> done
> 
> fio job.fio
> 
> umount /mnt
> echo 3 > /proc/sys/vm/drop_caches
> 
> The job file :
> 
> [global]
> bs=4k
> iodepth=32
> rw=randwrite
> ioengine=io_uring
> fallocate=none
> nrfiles=12
> numjobs=1
> size=6G
> direct=0
> group_reporting=1
> create_on_open=1
> name=test
> 
> [job1]
> directory=/mnt/dir1
> 
> [job2]
> directory=/mnt/dir2
> ...
> ...
> [job12]
> directory=/mnt/dir12
> 

Thanks..

> > 
> > FWIW, I skimmed through the code a bit and the main thing that kind of
> > stands out to me is the write time per-folio hinting. Writeback handling
> > for the overwrite (i.e. non-delalloc) case is basically a single lookup
> > per mapping under shared inode lock. The question that comes to mind
> > there is what is the value of per-ag batching as opposed to just adding
> > generic concurrency? It seems unnecessary to me to take care to shuffle
> > overwrites into per-ag based workers when the underlying locking is
> > already shared.
> > 
> 
> That’s a fair point. For the overwrite (non-delalloc) case, the
> per-folio AG hinting is not meant to change allocation behavior, and
> I agree the underlying inode locking remains shared. The primary value
> I’m seeing there is the ability to partition writeback iteration and
> submission when dirty data spans multiple AGs.
> I will try routing overwrite writeback to workers irrespective of AG
> (e.g. hash/inode based), to compare between generic concurrency vs AG
> batching.
> 
> > WRT delalloc, it looks like we're basically taking the inode AG as the
> > starting point and guessing based on the on-disk AGF free blocks counter
> > at the time of the write. The delalloc accounting doesn't count against
> > the AGF, however, so ISTM that in many cases this would just effectively
> > land on the inode AG for larger delalloc writes. Is that not the case?
> > 
> > Once we get to delalloc writeback, we're under exclusive inode lock and
> > fall into the block allocator. The latter trylock iterates the AGs
> > looking for a good candidate. So what's the advantage of per-ag
> > splitting delalloc at writeback time if we're sending the same inode to
> > per-ag workers that all 1. require exclusive inode lock and 2. call into
> > an allocator that is designed to be scalable (i.e. if one AG is locked
> > it will just move to the next)?
> > 
> 
> The intent of per-AG splitting is not to parallelize allocation
> within a single inode or override allocator behavior, but to
> partition writeback scheduling so that inodes associated with
> different AGs are routed to different workers. This implicitly
> distributes inodes across AG workers, even though each inode’s
> delalloc conversion remains serialized.
> 
> > Yet another consideration is how delalloc conversion works at the
> > xfs_bmapi_convert_delalloc() -> xfs_bmapi_convert_one_delalloc() level.
> > If you take a look at the latter, we look up the entire delalloc extent
> > backing the folio under writeback and attempt to allocate it all at once
> > (not just the blocks backing the folio). So in theory if we were to end
> > up tagging a sequence of contiguous delalloc backed folios at buffered
> > write time with different AGs, we're still going to try to allocate all
> > of that in one AG at writeback time. So the per-ag hinting also sort of
> > competes with this by shuffling writeback of the same potential extent
> > into different workers, making it a little hard to try and reason about.
> > 
> 
> Agreed — delalloc conversion happens at extent granularity, so
> per-folio AG hints are not meant to steer final allocation. In this
> series the hints are used purely as writeback scheduling tokens;
> allocation still occurs once per extent under XFS_ILOCK_EXCL using
> existing allocator logic. The goal is to partition writeback work and
> avoid funneling multiple inodes through a single writeback path, not
> to influence extent placement.
> 

Yeah.. I realize none of this is really intended to drive allocation
behavior. The observation that all this per-folio tracking ultimately
boils down to either sharding based on information we have at writeback
time (i.e. overwrites) or effectively batching based on on-disk AG state
at the time of the write is kind of what suggests that the folio
granular hinting is potentially overkill.

> > So stepping back it kind of feels to me like the write time hinting has
> > so much potential for inaccuracy and unpredictability of writeback time
> > behavior (for the delalloc case), that it makes me wonder if we're
> > effectively just enabling arbitrary concurrency at writeback time and
> > perhaps seeing benefit from that. If so, that makes me wonder if the
> > associated value can be gained by somehow simplifying this to not
> > require write time hinting at all.
> > 
> > Have you run any experiments that perhaps rotors inodes to the
> > individual wb workers based on the inode AG (i.e. basically ignoring all
> > the write time stuff) by chance? Or anything that otherwise helps
> > quantify the value of per-ag batching over just basic concurrency? I'd
> > be interested to see if/how behavior changes with something like that.
> 
> Yes, inode-AG based routing has been explored as part of earlier
> higher-level writeback work (link below), where inodes are affined to
> writeback contexts based on inode AG. That effectively provides
> generic concurrency and serves as a useful baseline.
> https://lore.kernel.org/all/20251014120845.2361-1-kundan.kumar@samsung.com/
> 

Ah, I recall seeing that. A couple questions..

That link states the following:

"For XFS, affining inodes to writeback threads resulted in a decline
in IOPS for certain devices. The issue was caused by AG lock contention
in xfs_end_io, where multiple writeback threads competed for the same
AG lock."

Can you quantify that? It seems like xfs_end_io() mostly cares about
things like unwritten conversion, COW remapping, etc., so block
allocation shouldn't be prominent. Is this producing something where
frequent unwritten conversion results in a lot of bmapbt splits or
something?

Also, how safe is it is to break off writeback tasks at the XFS layer
like this? For example, is it safe to spread around the wbc to a bunch
of tasks like this? What about serialization for things like bandwidth
accounting and whatnot in the core/calling code?  Should the code that
splits off wq tasks in XFS be responsible to wait for parallel
submission completion before returning (I didn't see anything doing that
on a scan, but could have missed it)..?

> The motivation for this series is the complementary case where a
> single inode’s dirty data spans multiple AGs on aged/fragmented
> filesystems, where inode-AG affinity breaks down. The folio-level AG
> hinting here is intended to explore whether finer-grained partitioning
> provides additional benefit beyond inode-based routing.
> 

But also I'm not sure I follow the high level goal here. I have the same
question as Pankaj in that regard.. is this series intended to replace
the previous bdi level approach, or go along with it somehow? Doing
something at the bdi level seems like a more natural approach in
general, so I'm curious why the change in direction.

Brian



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 0/6] AG aware parallel writeback for XFS
  2026-01-23 13:26       ` Brian Foster
@ 2026-01-28 18:28         ` Kundan Kumar
  2026-02-06  6:25           ` Christoph Hellwig
  0 siblings, 1 reply; 48+ messages in thread
From: Kundan Kumar @ 2026-01-28 18:28 UTC (permalink / raw)
  To: Brian Foster
  Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe,
	hch, ritesh.list, djwong, dave, cem, wangyufei, linux-fsdevel,
	linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k

On 1/23/2026 6:56 PM, Brian Foster wrote:
> On Thu, Jan 22, 2026 at 09:45:05PM +0530, Kundan Kumar wrote:
>>>
>>> Could you provide more detail on how you're testing here? I threw this
>>> at some beefier storage I have around out of curiosity and I'm not
>>> seeing much of a difference. It could be I'm missing some details or
>>> maybe the storage outweighs the processing benefit. But for example, is
>>> this a fio test command being used? Is there preallocation? What type of
>>> storage? Is a particular fs geometry being targeted for this
>>> optimization (i.e. smaller AGs), etc.?
>>
>> Thanks Brian for the detailed review and for taking the time to
>> look through the code.
>>
>> The numbers quoted were from fio buffered write workloads on NVMe
>> (Optane devices), using multiple files placed in different
>> directories mapping to different AGs. Jobs were buffered
>> randwrite with multiple jobs. This can be tested with both
>> fallocate=none or otherwise.
>>
>> Sample script for 12 jobs to 12 directories(AGs) :
>>
>> mkfs.xfs -f -d agcount=12 /dev/nvme0n1
>> mount /dev/nvme0n1 /mnt
>> sync
>> echo 3 > /proc/sys/vm/drop_caches
>>
>> for i in {1..12}; do
>>     mkdir -p /mnt/dir$i
>> done
>>
>> fio job.fio
>>
>> umount /mnt
>> echo 3 > /proc/sys/vm/drop_caches
>>
>> The job file :
>>
>> [global]
>> bs=4k
>> iodepth=32
>> rw=randwrite
>> ioengine=io_uring
>> fallocate=none
>> nrfiles=12
>> numjobs=1
>> size=6G
>> direct=0
>> group_reporting=1
>> create_on_open=1
>> name=test
>>
>> [job1]
>> directory=/mnt/dir1
>>
>> [job2]
>> directory=/mnt/dir2
>> ...
>> ...
>> [job12]
>> directory=/mnt/dir12
>>
> 
> Thanks..
> 
>>>
>>> FWIW, I skimmed through the code a bit and the main thing that kind of
>>> stands out to me is the write time per-folio hinting. Writeback handling
>>> for the overwrite (i.e. non-delalloc) case is basically a single lookup
>>> per mapping under shared inode lock. The question that comes to mind
>>> there is what is the value of per-ag batching as opposed to just adding
>>> generic concurrency? It seems unnecessary to me to take care to shuffle
>>> overwrites into per-ag based workers when the underlying locking is
>>> already shared.
>>>
>>
>> That’s a fair point. For the overwrite (non-delalloc) case, the
>> per-folio AG hinting is not meant to change allocation behavior, and
>> I agree the underlying inode locking remains shared. The primary value
>> I’m seeing there is the ability to partition writeback iteration and
>> submission when dirty data spans multiple AGs.
>> I will try routing overwrite writeback to workers irrespective of AG
>> (e.g. hash/inode based), to compare between generic concurrency vs AG
>> batching.
>>
>>> WRT delalloc, it looks like we're basically taking the inode AG as the
>>> starting point and guessing based on the on-disk AGF free blocks counter
>>> at the time of the write. The delalloc accounting doesn't count against
>>> the AGF, however, so ISTM that in many cases this would just effectively
>>> land on the inode AG for larger delalloc writes. Is that not the case?
>>>
>>> Once we get to delalloc writeback, we're under exclusive inode lock and
>>> fall into the block allocator. The latter trylock iterates the AGs
>>> looking for a good candidate. So what's the advantage of per-ag
>>> splitting delalloc at writeback time if we're sending the same inode to
>>> per-ag workers that all 1. require exclusive inode lock and 2. call into
>>> an allocator that is designed to be scalable (i.e. if one AG is locked
>>> it will just move to the next)?
>>>
>>
>> The intent of per-AG splitting is not to parallelize allocation
>> within a single inode or override allocator behavior, but to
>> partition writeback scheduling so that inodes associated with
>> different AGs are routed to different workers. This implicitly
>> distributes inodes across AG workers, even though each inode’s
>> delalloc conversion remains serialized.
>>
>>> Yet another consideration is how delalloc conversion works at the
>>> xfs_bmapi_convert_delalloc() -> xfs_bmapi_convert_one_delalloc() level.
>>> If you take a look at the latter, we look up the entire delalloc extent
>>> backing the folio under writeback and attempt to allocate it all at once
>>> (not just the blocks backing the folio). So in theory if we were to end
>>> up tagging a sequence of contiguous delalloc backed folios at buffered
>>> write time with different AGs, we're still going to try to allocate all
>>> of that in one AG at writeback time. So the per-ag hinting also sort of
>>> competes with this by shuffling writeback of the same potential extent
>>> into different workers, making it a little hard to try and reason about.
>>>
>>
>> Agreed — delalloc conversion happens at extent granularity, so
>> per-folio AG hints are not meant to steer final allocation. In this
>> series the hints are used purely as writeback scheduling tokens;
>> allocation still occurs once per extent under XFS_ILOCK_EXCL using
>> existing allocator logic. The goal is to partition writeback work and
>> avoid funneling multiple inodes through a single writeback path, not
>> to influence extent placement.
>>
> 
> Yeah.. I realize none of this is really intended to drive allocation
> behavior. The observation that all this per-folio tracking ultimately
> boils down to either sharding based on information we have at writeback
> time (i.e. overwrites) or effectively batching based on on-disk AG state
> at the time of the write is kind of what suggests that the folio
> granular hinting is potentially overkill.
> 
>>> So stepping back it kind of feels to me like the write time hinting has
>>> so much potential for inaccuracy and unpredictability of writeback time
>>> behavior (for the delalloc case), that it makes me wonder if we're
>>> effectively just enabling arbitrary concurrency at writeback time and
>>> perhaps seeing benefit from that. If so, that makes me wonder if the
>>> associated value can be gained by somehow simplifying this to not
>>> require write time hinting at all.
>>>
>>> Have you run any experiments that perhaps rotors inodes to the
>>> individual wb workers based on the inode AG (i.e. basically ignoring all
>>> the write time stuff) by chance? Or anything that otherwise helps
>>> quantify the value of per-ag batching over just basic concurrency? I'd
>>> be interested to see if/how behavior changes with something like that.
>>
>> Yes, inode-AG based routing has been explored as part of earlier
>> higher-level writeback work (link below), where inodes are affined to
>> writeback contexts based on inode AG. That effectively provides
>> generic concurrency and serves as a useful baseline.
>> https://lore.kernel.org/all/20251014120845.2361-1-kundan.kumar@samsung.com/
>>
> 
> Ah, I recall seeing that. A couple questions..
> 
> That link states the following:
> 
> "For XFS, affining inodes to writeback threads resulted in a decline
> in IOPS for certain devices. The issue was caused by AG lock contention
> in xfs_end_io, where multiple writeback threads competed for the same
> AG lock."
> 
> Can you quantify that? It seems like xfs_end_io() mostly cares about
> things like unwritten conversion, COW remapping, etc., so block
> allocation shouldn't be prominent. Is this producing something where
> frequent unwritten conversion results in a lot of bmapbt splits or
> something?
> 

I captured stacks from the contending completion workers and the hotspot
is in the unwritten conversion path
(xfs_end_io() -> xfs_iomap_write_unwritten()). We were repeatedly
contending on the AGF buffer lock via xfs_alloc_fix_freelist() /
xfs_alloc_read_agf() when writeback threads were affined per-inode.
This contention went away once writeback was distributed across
AG-based workers, pointing to reduced AGF hotspotting during unwritten
conversion (rmap/btree updates and freelist fixes), rather than block
allocation in the write path itself.

> Also, how safe is it is to break off writeback tasks at the XFS layer
> like this? For example, is it safe to spread around the wbc to a bunch
> of tasks like this? What about serialization for things like bandwidth
> accounting and whatnot in the core/calling code?  Should the code that
> splits off wq tasks in XFS be responsible to wait for parallel
> submission completion before returning (I didn't see anything doing that
> on a scan, but could have missed it)..?
> 

You are right that core writeback accounting assumes serialized
updates. The current series copies wbc per worker to avoid concurrent
mutation, but that is not sufficient for strict global accounting
semantics.

For this series we only offload the async path
(wbc->sync_mode != WB_SYNC_ALL), so we do not wait for worker completion
before returning from ->writepages(). Sync writeback continues down the
existing iomap_writepages path.

>> The motivation for this series is the complementary case where a
>> single inode’s dirty data spans multiple AGs on aged/fragmented
>> filesystems, where inode-AG affinity breaks down. The folio-level AG
>> hinting here is intended to explore whether finer-grained partitioning
>> provides additional benefit beyond inode-based routing.
>>
> 
> But also I'm not sure I follow the high level goal here. I have the same
> question as Pankaj in that regard.. is this series intended to replace
> the previous bdi level approach, or go along with it somehow? Doing
> something at the bdi level seems like a more natural approach in
> general, so I'm curious why the change in direction.
> 
> Brian
> 

This series is intended to replace the earlier BDI-level approach for
XFS, not to go alongside it. While BDI-level sharding is the more
natural generic mechanism, we saw XFS regressions on some setups when
inodes were affined to wb threads due to completion-side AG contention.

The goal here is to make concurrency an XFS policy decision by routing
writeback using AG-aware folio tags, so we avoid inode-affinity
hotspots and handle cases where a single inode spans multiple AGs on
aged or fragmented filesystems.

If this approach does not hold up across workloads and devices, we can
fall back to the generic BDI sharding model.

- Kundan



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 0/6] AG aware parallel writeback for XFS
  2026-01-28 18:28         ` Kundan Kumar
@ 2026-02-06  6:25           ` Christoph Hellwig
  2026-02-06 10:07             ` Kundan Kumar
  2026-02-09 15:54             ` Kundan Kumar
  0 siblings, 2 replies; 48+ messages in thread
From: Christoph Hellwig @ 2026-02-06  6:25 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: Brian Foster, viro, brauner, jack, willy, mcgrof, clm, david,
	amir73il, axboe, hch, ritesh.list, djwong, dave, cem, wangyufei,
	linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g,
	joshi.k

On Wed, Jan 28, 2026 at 11:58:25PM +0530, Kundan Kumar wrote:
> This series is intended to replace the earlier BDI-level approach for
> XFS, not to go alongside it. While BDI-level sharding is the more
> natural generic mechanism, we saw XFS regressions on some setups when
> inodes were affined to wb threads due to completion-side AG contention.
> 
> The goal here is to make concurrency an XFS policy decision by routing
> writeback using AG-aware folio tags, so we avoid inode-affinity
> hotspots and handle cases where a single inode spans multiple AGs on
> aged or fragmented filesystems.
> 
> If this approach does not hold up across workloads and devices, we can
> fall back to the generic BDI sharding model.

I fear we're deep down a rabbit hole solving the wrong problem here.
Traditionally block allocation, in XFS and in general, was about finding
the "best" location to avoid seeks.  With SSDs the seeks themselves are
kinda pointless, although large sequential write streams are still very
useful of course, as is avoiding both freespace and bmap fragmentation.
On the other hand avoiding contention from multiple writers is a good
thing.  (this is discounting the HDD case, where the industry is very
rapidly moving to zoned device, for which zoned XFS has a totally
different allocator)

With multi-threaded writeback this become important for writeback, but
even before this would be useful for direct and uncached I/O.

So I think the first thing I'd look into it to tune the allocator to
avoid that contention, by by spreading different allocation streams from
different core to different AGs, and relax the very sophisticated and
detailed placement done by the XFS allocator.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 0/6] AG aware parallel writeback for XFS
  2026-02-06  6:25           ` Christoph Hellwig
@ 2026-02-06 10:07             ` Kundan Kumar
  2026-02-06 17:42               ` Darrick J. Wong
  2026-02-09  6:30               ` Christoph Hellwig
  2026-02-09 15:54             ` Kundan Kumar
  1 sibling, 2 replies; 48+ messages in thread
From: Kundan Kumar @ 2026-02-06 10:07 UTC (permalink / raw)
  To: Christoph Hellwig, djwong
  Cc: Brian Foster, viro, brauner, jack, willy, mcgrof, clm, david,
	amir73il, axboe, ritesh.list, dave, cem, wangyufei,
	linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g,
	joshi.k

On 2/6/2026 11:55 AM, Christoph Hellwig wrote:
> I fear we're deep down a rabbit hole solving the wrong problem here.
> Traditionally block allocation, in XFS and in general, was about finding
> the "best" location to avoid seeks.  With SSDs the seeks themselves are
> kinda pointless, although large sequential write streams are still very
> useful of course, as is avoiding both freespace and bmap fragmentation.
> On the other hand avoiding contention from multiple writers is a good
> thing.  (this is discounting the HDD case, where the industry is very
> rapidly moving to zoned device, for which zoned XFS has a totally
> different allocator)
> 
> With multi-threaded writeback this become important for writeback, but
> even before this would be useful for direct and uncached I/O.
> 
> So I think the first thing I'd look into it to tune the allocator to
> avoid that contention, by by spreading different allocation streams from
> different core to different AGs, and relax the very sophisticated and
> detailed placement done by the XFS allocator.

When you say "coarse-grained sharding", do you mean tracking a single
"home AG" per inode (no per-folio tagging) and using it as a best-effort
hint for writeback routing?

If so, we can align with that and keep the threading model generic by
relying on bdi writeback contexts. Concretely, XFS would set up a 
bounded number of bdi wb contexts at mount time, and route each inode to 
its home shard. Does this align with what you have in mind?

We had implemented a similar approach in earlier versions of this
series[1]. But the feedback[2] that we got was that mapping high level
writeback to eventual AG allocation can be sometimes inaccurate (aged
filesystems, inode spanning accross multiple AGs, etc.), so we moved to
per folio tagging to make the routing decision closer to the actual IO.
That said, I agree that the per folio approach is complex.

Darrick, does this direction look reasonable to you as well?

[1] 
https://lore.kernel.org/all/20251014120845.2361-1-kundan.kumar@samsung.com/
[2] https://lore.kernel.org/all/20251107133742.GA5596@lst.de/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 0/6] AG aware parallel writeback for XFS
  2026-02-06 10:07             ` Kundan Kumar
@ 2026-02-06 17:42               ` Darrick J. Wong
  2026-02-09  6:30               ` Christoph Hellwig
  1 sibling, 0 replies; 48+ messages in thread
From: Darrick J. Wong @ 2026-02-06 17:42 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: Christoph Hellwig, Brian Foster, viro, brauner, jack, willy,
	mcgrof, clm, david, amir73il, axboe, ritesh.list, dave, cem,
	wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev,
	anuj20.g, vishak.g, joshi.k

On Fri, Feb 06, 2026 at 03:37:38PM +0530, Kundan Kumar wrote:
> On 2/6/2026 11:55 AM, Christoph Hellwig wrote:
> > I fear we're deep down a rabbit hole solving the wrong problem here.
> > Traditionally block allocation, in XFS and in general, was about finding
> > the "best" location to avoid seeks.  With SSDs the seeks themselves are
> > kinda pointless, although large sequential write streams are still very
> > useful of course, as is avoiding both freespace and bmap fragmentation.
> > On the other hand avoiding contention from multiple writers is a good
> > thing.  (this is discounting the HDD case, where the industry is very
> > rapidly moving to zoned device, for which zoned XFS has a totally
> > different allocator)
> > 
> > With multi-threaded writeback this become important for writeback, but
> > even before this would be useful for direct and uncached I/O.
> > 
> > So I think the first thing I'd look into it to tune the allocator to
> > avoid that contention, by by spreading different allocation streams from
> > different core to different AGs, and relax the very sophisticated and
> > detailed placement done by the XFS allocator.
> 
> When you say "coarse-grained sharding", do you mean tracking a single
> "home AG" per inode (no per-folio tagging) and using it as a best-effort
> hint for writeback routing?
> 
> If so, we can align with that and keep the threading model generic by
> relying on bdi writeback contexts. Concretely, XFS would set up a 
> bounded number of bdi wb contexts at mount time, and route each inode to 
> its home shard. Does this align with what you have in mind?
> 
> We had implemented a similar approach in earlier versions of this
> series[1]. But the feedback[2] that we got was that mapping high level
> writeback to eventual AG allocation can be sometimes inaccurate (aged
> filesystems, inode spanning accross multiple AGs, etc.), so we moved to
> per folio tagging to make the routing decision closer to the actual IO.
> That said, I agree that the per folio approach is complex.
> 
> Darrick, does this direction look reasonable to you as well?

Yeah, I think a simpler scheme would be a better starting place; we can
make things more complex later once we can show that there's a need.

--D

> 
> [1] 
> https://lore.kernel.org/all/20251014120845.2361-1-kundan.kumar@samsung.com/
> [2] https://lore.kernel.org/all/20251107133742.GA5596@lst.de/
> 
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 0/6] AG aware parallel writeback for XFS
  2026-02-06 10:07             ` Kundan Kumar
  2026-02-06 17:42               ` Darrick J. Wong
@ 2026-02-09  6:30               ` Christoph Hellwig
  1 sibling, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2026-02-09  6:30 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: Christoph Hellwig, djwong, Brian Foster, viro, brauner, jack,
	willy, mcgrof, clm, david, amir73il, axboe, ritesh.list, dave,
	cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev,
	anuj20.g, vishak.g, joshi.k

On Fri, Feb 06, 2026 at 03:37:38PM +0530, Kundan Kumar wrote:
> When you say "coarse-grained sharding", do you mean tracking a single
> "home AG" per inode (no per-folio tagging) and using it as a best-effort
> hint for writeback routing?

Yes, but..

> If so, we can align with that and keep the threading model generic by
> relying on bdi writeback contexts. Concretely, XFS would set up a 
> bounded number of bdi wb contexts at mount time, and route each inode to 
> its home shard. Does this align with what you have in mind?

Yes.

> We had implemented a similar approach in earlier versions of this
> series[1]. But the feedback[2] that we got was that mapping high level
> writeback to eventual AG allocation can be sometimes inaccurate (aged
> filesystems, inode spanning accross multiple AGs, etc.), so we moved to
> per folio tagging to make the routing decision closer to the actual IO.
> That said, I agree that the per folio approach is complex.

It is inaccurate, not to say often wrong.  And that's the point I'm
trying to make here:  start with simplifying the allocation policies
so that they make sense in this kind of environment, and then align
that sharding to that.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 0/6] AG aware parallel writeback for XFS
  2026-02-06  6:25           ` Christoph Hellwig
  2026-02-06 10:07             ` Kundan Kumar
@ 2026-02-09 15:54             ` Kundan Kumar
  2026-02-10 15:38               ` Christoph Hellwig
  1 sibling, 1 reply; 48+ messages in thread
From: Kundan Kumar @ 2026-02-09 15:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Brian Foster, viro, brauner, jack, willy, mcgrof, clm, david,
	amir73il, axboe, ritesh.list, djwong, dave, cem, wangyufei,
	linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g,
	joshi.k

On 2/6/2026 11:55 AM, Christoph Hellwig wrote:
> I fear we're deep down a rabbit hole solving the wrong problem here.
> Traditionally block allocation, in XFS and in general, was about finding
> the "best" location to avoid seeks.  With SSDs the seeks themselves are
> kinda pointless, although large sequential write streams are still very
> useful of course, as is avoiding both freespace and bmap fragmentation.
> On the other hand avoiding contention from multiple writers is a good
> thing.  (this is discounting the HDD case, where the industry is very
> rapidly moving to zoned device, for which zoned XFS has a totally
> different allocator)
> 
> With multi-threaded writeback this become important for writeback, but
> even before this would be useful for direct and uncached I/O.
> 
> So I think the first thing I'd look into it to tune the allocator to
> avoid that contention, by by spreading different allocation streams from
> different core to different AGs, and relax the very sophisticated and
> detailed placement done by the XFS allocator

Thanks, I’m going to restate what I think you are suggesting to make
sure I'm tracking correctly.

We will step back from per-folio tagging and instead align coarse
sharding with a simpler allocation policy:

- Create a bounded number of bdi wb contexts at mount time (capped,
e.g. ≤ agcount).
- Store a per-inode stream/shard id (no per-folio state).
- Assign the stream id once and use it to select the wb context for
writeback.
- In the delalloc allocator, bias AG selection from the stream id by
partitioning AG space into per-stream "bands" and rotating the start
AG within that band; fall back to the existing allocator when
allocation can't be satisfied.

Does this align with what you have in mind?


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 0/6] AG aware parallel writeback for XFS
  2026-02-09 15:54             ` Kundan Kumar
@ 2026-02-10 15:38               ` Christoph Hellwig
  0 siblings, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2026-02-10 15:38 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: Christoph Hellwig, Brian Foster, viro, brauner, jack, willy,
	mcgrof, clm, david, amir73il, axboe, ritesh.list, djwong, dave,
	cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev,
	anuj20.g, vishak.g, joshi.k

On Mon, Feb 09, 2026 at 09:24:49PM +0530, Kundan Kumar wrote:
> - Create a bounded number of bdi wb contexts at mount time (capped,
> e.g. ≤ agcount).

Yeah.  And then optimally map them to CPU cores, similar to the
blk-mq cpumap.

> - Store a per-inode stream/shard id (no per-folio state).

Yes.

> - Assign the stream id once and use it to select the wb context for
> writeback.

Yes.

> - In the delalloc allocator, bias AG selection from the stream id by
> partitioning AG space into per-stream "bands" and rotating the start
> AG within that band; fall back to the existing allocator when
> allocation can't be satisfied.

Yes.

We might also need something that falls back to less helpers if
the free space is distributed unevently, but probably not for the
first prototype.



^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2026-02-11  9:39 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20260116101236epcas5p12ba3de776976f4ea6666e16a33ab6ec4@epcas5p1.samsung.com>
2026-01-16 10:08 ` [PATCH v3 0/6] AG aware parallel writeback for XFS Kundan Kumar
     [not found]   ` <CGME20260116101241epcas5p330f9c335a096aaaefda4b7d3c38d6038@epcas5p3.samsung.com>
2026-01-16 10:08     ` [PATCH v3 1/6] iomap: add write ops hook to attach metadata to folios Kundan Kumar
     [not found]   ` <CGME20260116101245epcas5p30269c6aa35784db67e6d6ca800a683a7@epcas5p3.samsung.com>
2026-01-16 10:08     ` [PATCH v3 2/6] xfs: add helpers to pack AG prediction info for per-folio tracking Kundan Kumar
2026-01-29  0:45       ` Darrick J. Wong
2026-02-03  7:15         ` Kundan Kumar
2026-02-05 16:39           ` Darrick J. Wong
2026-02-04  7:37       ` Nirjhar Roy (IBM)
     [not found]   ` <CGME20260116101251epcas5p1cf5b48f2efb14fe4387be3053b3c3ebc@epcas5p1.samsung.com>
2026-01-16 10:08     ` [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap Kundan Kumar
2026-01-29  0:44       ` Darrick J. Wong
2026-02-03  7:20         ` Kundan Kumar
2026-02-05 16:42           ` Darrick J. Wong
2026-02-05  6:44         ` Nirjhar Roy (IBM)
2026-02-05 16:32           ` Darrick J. Wong
2026-02-06  5:41             ` Nirjhar Roy (IBM)
2026-02-05  6:36       ` Nirjhar Roy (IBM)
2026-02-05 16:36         ` Darrick J. Wong
2026-02-06  5:36           ` Nirjhar Roy (IBM)
2026-02-06  5:57             ` Darrick J. Wong
2026-02-06  6:03               ` Nirjhar Roy (IBM)
2026-02-06  7:00       ` Christoph Hellwig
     [not found]   ` <CGME20260116101256epcas5p2d6125a6bcad78c33f737fdc3484aca79@epcas5p2.samsung.com>
2026-01-16 10:08     ` [PATCH v3 4/6] xfs: tag folios with AG number during buffered write via iomap attach hook Kundan Kumar
2026-01-29  0:47       ` Darrick J. Wong
2026-01-29 22:40         ` Darrick J. Wong
2026-02-03  7:32           ` Kundan Kumar
2026-02-03  7:28         ` Kundan Kumar
2026-02-05 15:56           ` Brian Foster
2026-02-06  6:44       ` Nirjhar Roy (IBM)
     [not found]   ` <CGME20260116101259epcas5p1cfa6ab02e5a01f7c46cc78df95c57ce0@epcas5p1.samsung.com>
2026-01-16 10:08     ` [PATCH v3 5/6] xfs: add per-AG writeback workqueue infrastructure Kundan Kumar
2026-01-29 22:21       ` Darrick J. Wong
2026-02-03  7:35         ` Kundan Kumar
2026-02-06  6:46       ` Christoph Hellwig
2026-02-10 11:56       ` Nirjhar Roy (IBM)
     [not found]   ` <CGME20260116101305epcas5p497cd6d9027301853669f1c1aaffbf128@epcas5p4.samsung.com>
2026-01-16 10:08     ` [PATCH v3 6/6] xfs: offload writeback by AG using per-inode dirty bitmap and per-AG workers Kundan Kumar
2026-01-29 22:34       ` Darrick J. Wong
2026-02-03  7:40         ` Kundan Kumar
2026-02-11  9:39       ` Nirjhar Roy (IBM)
2026-01-16 16:13   ` [syzbot ci] Re: AG aware parallel writeback for XFS syzbot ci
2026-01-21 19:54   ` [PATCH v3 0/6] " Brian Foster
2026-01-22 16:15     ` Kundan Kumar
2026-01-23  9:36       ` Pankaj Raghav (Samsung)
2026-01-23 13:26       ` Brian Foster
2026-01-28 18:28         ` Kundan Kumar
2026-02-06  6:25           ` Christoph Hellwig
2026-02-06 10:07             ` Kundan Kumar
2026-02-06 17:42               ` Darrick J. Wong
2026-02-09  6:30               ` Christoph Hellwig
2026-02-09 15:54             ` Kundan Kumar
2026-02-10 15:38               ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox