* [PATCH v3 0/6] AG aware parallel writeback for XFS [not found] <CGME20260116101236epcas5p12ba3de776976f4ea6666e16a33ab6ec4@epcas5p1.samsung.com> @ 2026-01-16 10:08 ` Kundan Kumar [not found] ` <CGME20260116101241epcas5p330f9c335a096aaaefda4b7d3c38d6038@epcas5p3.samsung.com> ` (7 more replies) 0 siblings, 8 replies; 48+ messages in thread From: Kundan Kumar @ 2026-01-16 10:08 UTC (permalink / raw) To: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, cem, wangyufei Cc: linux-fsdevel, linux-mm, linux-xfs, gost.dev, kundan.kumar, anuj20.g, vishak.g, joshi.k This series explores AG aware parallel writeback for XFS. The goal is to reduce writeback contention and improve scalability by allowing writeback to be distributed across allocation groups (AGs). Problem statement ================= Today, XFS writeback walks the page cache serially per inode and funnels all writeback through a single writeback context. For aging filesystems, especially with high parallel buffered IO this leads to limited concurrency across independent AGs. The filesystem already has strong AG level parallelism for allocation and metadata operations, but writeback remains largely AG agnostic. High-level approach =================== This series introduces an AG aware writeback with following model: 1) Predict the target AG for buffered writes (mapped or delalloc) at write time. 2) Tag AG hints per folio (via lightweight metadata / xarray). 3) Track dirty AGs per inode using bitmap. 4) Offload writeback to per AG worker threads, each performing a onepass scan. 5) Workers filter folios and submit folios which are tagged for its AG. Unlike our earlier approach that parallelized writeback by introducing multiple writeback contexts per BDI, this series keeps all changes within XFS and is orthogonal to that work. The AG aware mechanism uses per folio AG hints to route writeback to AG specific workers, and therefore applies even when a single inode’s data spans multiple AGs. This avoids the earlier limitation of relying on inode-based AG locality, which can break down on aged/fragmented filesystems. IOPS and throughput =================== We see significant improvemnt in IOPS if files span across multiple AG Workload 12 files each of 500M in 12 directories(AGs) - numjobs = 12 - NVMe device Intel Optane Base XFS : 308 MiB/s Parallel Writeback XFS : 1534 MiB/s (+398%) Workload 6 files each of 6G in 6 directories(AGs) - numjobs = 12 - NVMe device Intel Optane Base XFS : 409 MiB/s Parallel Writeback XFS : 1245 MiB/s (+204%) These changes are on top of the v6.18 kernel release. Future work involves tighten writeback control (wbc) handling to integrate with global writeback accounting and range semantics, also evaluate interaction with higher level writeback parallelism. Kundan Kumar (6): iomap: add write ops hook to attach metadata to folios xfs: add helpers to pack AG prediction info for per-folio tracking xfs: add per-inode AG prediction map and dirty-AG bitmap xfs: tag folios with AG number during buffered write via iomap attach hook xfs: add per-AG writeback workqueue infrastructure xfs: offload writeback by AG using per-inode dirty bitmap and per-AG workers fs/iomap/buffered-io.c | 3 + fs/xfs/xfs_aops.c | 257 +++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_aops.h | 3 + fs/xfs/xfs_icache.c | 27 +++++ fs/xfs/xfs_inode.h | 5 + fs/xfs/xfs_iomap.c | 114 ++++++++++++++++++ fs/xfs/xfs_iomap.h | 31 +++++ fs/xfs/xfs_mount.c | 2 + fs/xfs/xfs_mount.h | 10 ++ fs/xfs/xfs_super.c | 2 + include/linux/iomap.h | 3 + 11 files changed, 457 insertions(+) -- 2.25.1 ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <CGME20260116101241epcas5p330f9c335a096aaaefda4b7d3c38d6038@epcas5p3.samsung.com>]
* [PATCH v3 1/6] iomap: add write ops hook to attach metadata to folios [not found] ` <CGME20260116101241epcas5p330f9c335a096aaaefda4b7d3c38d6038@epcas5p3.samsung.com> @ 2026-01-16 10:08 ` Kundan Kumar 0 siblings, 0 replies; 48+ messages in thread From: Kundan Kumar @ 2026-01-16 10:08 UTC (permalink / raw) To: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, cem, wangyufei Cc: linux-fsdevel, linux-mm, linux-xfs, gost.dev, kundan.kumar, anuj20.g, vishak.g, joshi.k Add an optional iomap_attach_folio callback to struct iomap_write_ops. Filesystems may use this hook to tag folios during write-begin without changing iomap core behavior. Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> --- fs/iomap/buffered-io.c | 3 +++ include/linux/iomap.h | 3 +++ 2 files changed, 6 insertions(+) diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c index 8b847a1e27f1..701f2e9cd010 100644 --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -699,6 +699,9 @@ static int __iomap_write_begin(const struct iomap_iter *iter, size_t from = offset_in_folio(folio, pos), to = from + len; size_t poff, plen; + if (write_ops && write_ops->tag_folio) + write_ops->tag_folio(&iter->iomap, folio, pos, len); + /* * If the write or zeroing completely overlaps the current folio, then * entire folio will be dirtied so there is no need for diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 73dceabc21c8..14ef88f8ee84 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -176,6 +176,9 @@ struct iomap_write_ops { */ int (*read_folio_range)(const struct iomap_iter *iter, struct folio *folio, loff_t pos, size_t len); + + void (*tag_folio)(const struct iomap *iomap, + struct folio *folio, loff_t pos, size_t len); }; /* -- 2.25.1 ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <CGME20260116101245epcas5p30269c6aa35784db67e6d6ca800a683a7@epcas5p3.samsung.com>]
* [PATCH v3 2/6] xfs: add helpers to pack AG prediction info for per-folio tracking [not found] ` <CGME20260116101245epcas5p30269c6aa35784db67e6d6ca800a683a7@epcas5p3.samsung.com> @ 2026-01-16 10:08 ` Kundan Kumar 2026-01-29 0:45 ` Darrick J. Wong 2026-02-04 7:37 ` Nirjhar Roy (IBM) 0 siblings, 2 replies; 48+ messages in thread From: Kundan Kumar @ 2026-01-16 10:08 UTC (permalink / raw) To: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, cem, wangyufei Cc: linux-fsdevel, linux-mm, linux-xfs, gost.dev, kundan.kumar, anuj20.g, vishak.g, joshi.k Introduce helper routines to pack and unpack AG prediction metadata for folios. This provides a compact and self-contained representation for AG tracking. The packed layout uses: - bit 31 : valid - bit 24-30 : iomap type - bit 0-23 : AG number Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> --- fs/xfs/xfs_iomap.h | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h index ebcce7d49446..eaf4513f6759 100644 --- a/fs/xfs/xfs_iomap.h +++ b/fs/xfs/xfs_iomap.h @@ -12,6 +12,37 @@ struct xfs_inode; struct xfs_bmbt_irec; struct xfs_zone_alloc_ctx; +/* pack prediction in a u32 stored in xarray */ +#define XFS_AGP_VALID_SHIFT 31 +#define XFS_AGP_TYPE_SHIFT 24 +#define XFS_AGP_TYPE_MASK 0x7fu +#define XFS_AGP_AGNO_MASK 0x00ffffffu + +static inline u32 xfs_agp_pack(u32 agno, u8 iomap_type, bool valid) +{ + u32 v = agno & XFS_AGP_AGNO_MASK; + + v |= ((u32)iomap_type & XFS_AGP_TYPE_MASK) << XFS_AGP_TYPE_SHIFT; + if (valid) + v |= (1u << XFS_AGP_VALID_SHIFT); + return v; +} + +static inline bool xfs_agp_valid(u32 v) +{ + return v >> XFS_AGP_VALID_SHIFT; +} + +static inline u32 xfs_agp_agno(u32 v) +{ + return v & XFS_AGP_AGNO_MASK; +} + +static inline u8 xfs_agp_type(u32 v) +{ + return (u8)((v >> XFS_AGP_TYPE_SHIFT) & XFS_AGP_TYPE_MASK); +} + int xfs_iomap_write_direct(struct xfs_inode *ip, xfs_fileoff_t offset_fsb, xfs_fileoff_t count_fsb, unsigned int flags, struct xfs_bmbt_irec *imap, u64 *sequence); -- 2.25.1 ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 2/6] xfs: add helpers to pack AG prediction info for per-folio tracking 2026-01-16 10:08 ` [PATCH v3 2/6] xfs: add helpers to pack AG prediction info for per-folio tracking Kundan Kumar @ 2026-01-29 0:45 ` Darrick J. Wong 2026-02-03 7:15 ` Kundan Kumar 2026-02-04 7:37 ` Nirjhar Roy (IBM) 1 sibling, 1 reply; 48+ messages in thread From: Darrick J. Wong @ 2026-01-29 0:45 UTC (permalink / raw) To: Kundan Kumar Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Fri, Jan 16, 2026 at 03:38:14PM +0530, Kundan Kumar wrote: > Introduce helper routines to pack and unpack AG prediction metadata > for folios. This provides a compact and self-contained representation > for AG tracking. > > The packed layout uses: > - bit 31 : valid > - bit 24-30 : iomap type > - bit 0-23 : AG number There are only 5 iomap types, why do you need 7 bits for that? Also, can you store more bits on a 64-bit system to avoid truncating the AG number? --D > Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> > --- > fs/xfs/xfs_iomap.h | 31 +++++++++++++++++++++++++++++++ > 1 file changed, 31 insertions(+) > > diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h > index ebcce7d49446..eaf4513f6759 100644 > --- a/fs/xfs/xfs_iomap.h > +++ b/fs/xfs/xfs_iomap.h > @@ -12,6 +12,37 @@ struct xfs_inode; > struct xfs_bmbt_irec; > struct xfs_zone_alloc_ctx; > > +/* pack prediction in a u32 stored in xarray */ > +#define XFS_AGP_VALID_SHIFT 31 > +#define XFS_AGP_TYPE_SHIFT 24 > +#define XFS_AGP_TYPE_MASK 0x7fu > +#define XFS_AGP_AGNO_MASK 0x00ffffffu > + > +static inline u32 xfs_agp_pack(u32 agno, u8 iomap_type, bool valid) > +{ > + u32 v = agno & XFS_AGP_AGNO_MASK; > + > + v |= ((u32)iomap_type & XFS_AGP_TYPE_MASK) << XFS_AGP_TYPE_SHIFT; > + if (valid) > + v |= (1u << XFS_AGP_VALID_SHIFT); > + return v; > +} > + > +static inline bool xfs_agp_valid(u32 v) > +{ > + return v >> XFS_AGP_VALID_SHIFT; > +} > + > +static inline u32 xfs_agp_agno(u32 v) > +{ > + return v & XFS_AGP_AGNO_MASK; > +} > + > +static inline u8 xfs_agp_type(u32 v) > +{ > + return (u8)((v >> XFS_AGP_TYPE_SHIFT) & XFS_AGP_TYPE_MASK); > +} > + > int xfs_iomap_write_direct(struct xfs_inode *ip, xfs_fileoff_t offset_fsb, > xfs_fileoff_t count_fsb, unsigned int flags, > struct xfs_bmbt_irec *imap, u64 *sequence); > -- > 2.25.1 > > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 2/6] xfs: add helpers to pack AG prediction info for per-folio tracking 2026-01-29 0:45 ` Darrick J. Wong @ 2026-02-03 7:15 ` Kundan Kumar 2026-02-05 16:39 ` Darrick J. Wong 0 siblings, 1 reply; 48+ messages in thread From: Kundan Kumar @ 2026-02-03 7:15 UTC (permalink / raw) To: Darrick J. Wong Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On 1/29/2026 6:15 AM, Darrick J. Wong wrote: > On Fri, Jan 16, 2026 at 03:38:14PM +0530, Kundan Kumar wrote: >> Introduce helper routines to pack and unpack AG prediction metadata >> for folios. This provides a compact and self-contained representation >> for AG tracking. >> >> The packed layout uses: >> - bit 31 : valid >> - bit 24-30 : iomap type >> - bit 0-23 : AG number > > There are only 5 iomap types, why do you need 7 bits for that? > > Also, can you store more bits on a 64-bit system to avoid truncating the > AG number? > > --D I’ll reduce the type field to 3 bits (8 values). For the AG number, I can drop the artificial 24-bit cap by packing into an unsigned long and storing it via xa_mk_value(), which provides ~60 bits on 64-bit systems and ~28 bits on 32-bit systems. > >> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> >> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> >> --- >> fs/xfs/xfs_iomap.h | 31 +++++++++++++++++++++++++++++++ >> 1 file changed, 31 insertions(+) >> >> diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h >> index ebcce7d49446..eaf4513f6759 100644 >> --- a/fs/xfs/xfs_iomap.h >> +++ b/fs/xfs/xfs_iomap.h >> @@ -12,6 +12,37 @@ struct xfs_inode; >> struct xfs_bmbt_irec; >> struct xfs_zone_alloc_ctx; >> >> +/* pack prediction in a u32 stored in xarray */ >> +#define XFS_AGP_VALID_SHIFT 31 >> +#define XFS_AGP_TYPE_SHIFT 24 >> +#define XFS_AGP_TYPE_MASK 0x7fu >> +#define XFS_AGP_AGNO_MASK 0x00ffffffu >> + >> +static inline u32 xfs_agp_pack(u32 agno, u8 iomap_type, bool valid) >> +{ >> + u32 v = agno & XFS_AGP_AGNO_MASK; >> + >> + v |= ((u32)iomap_type & XFS_AGP_TYPE_MASK) << XFS_AGP_TYPE_SHIFT; >> + if (valid) >> + v |= (1u << XFS_AGP_VALID_SHIFT); >> + return v; >> +} >> + >> +static inline bool xfs_agp_valid(u32 v) >> +{ >> + return v >> XFS_AGP_VALID_SHIFT; >> +} >> + >> +static inline u32 xfs_agp_agno(u32 v) >> +{ >> + return v & XFS_AGP_AGNO_MASK; >> +} >> + >> +static inline u8 xfs_agp_type(u32 v) >> +{ >> + return (u8)((v >> XFS_AGP_TYPE_SHIFT) & XFS_AGP_TYPE_MASK); >> +} >> + >> int xfs_iomap_write_direct(struct xfs_inode *ip, xfs_fileoff_t offset_fsb, >> xfs_fileoff_t count_fsb, unsigned int flags, >> struct xfs_bmbt_irec *imap, u64 *sequence); >> -- >> 2.25.1 >> >> > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 2/6] xfs: add helpers to pack AG prediction info for per-folio tracking 2026-02-03 7:15 ` Kundan Kumar @ 2026-02-05 16:39 ` Darrick J. Wong 0 siblings, 0 replies; 48+ messages in thread From: Darrick J. Wong @ 2026-02-05 16:39 UTC (permalink / raw) To: Kundan Kumar Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Tue, Feb 03, 2026 at 12:45:33PM +0530, Kundan Kumar wrote: > On 1/29/2026 6:15 AM, Darrick J. Wong wrote: > > On Fri, Jan 16, 2026 at 03:38:14PM +0530, Kundan Kumar wrote: > >> Introduce helper routines to pack and unpack AG prediction metadata > >> for folios. This provides a compact and self-contained representation > >> for AG tracking. > >> > >> The packed layout uses: > >> - bit 31 : valid > >> - bit 24-30 : iomap type > >> - bit 0-23 : AG number > > > > There are only 5 iomap types, why do you need 7 bits for that? > > > > Also, can you store more bits on a 64-bit system to avoid truncating the > > AG number? > > > > --D > > I’ll reduce the type field to 3 bits (8 values). > > For the AG number, I can drop the artificial 24-bit cap by packing into > an unsigned long and storing it via xa_mk_value(), which provides ~60 > bits on 64-bit systems and ~28 bits on 32-bit systems. <nod> > > > >> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> > >> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> > >> --- > >> fs/xfs/xfs_iomap.h | 31 +++++++++++++++++++++++++++++++ > >> 1 file changed, 31 insertions(+) > >> > >> diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h > >> index ebcce7d49446..eaf4513f6759 100644 > >> --- a/fs/xfs/xfs_iomap.h > >> +++ b/fs/xfs/xfs_iomap.h > >> @@ -12,6 +12,37 @@ struct xfs_inode; > >> struct xfs_bmbt_irec; > >> struct xfs_zone_alloc_ctx; > >> > >> +/* pack prediction in a u32 stored in xarray */ > >> +#define XFS_AGP_VALID_SHIFT 31 > >> +#define XFS_AGP_TYPE_SHIFT 24 > >> +#define XFS_AGP_TYPE_MASK 0x7fu > >> +#define XFS_AGP_AGNO_MASK 0x00ffffffu > >> + > >> +static inline u32 xfs_agp_pack(u32 agno, u8 iomap_type, bool valid) > >> +{ > >> + u32 v = agno & XFS_AGP_AGNO_MASK; > >> + > >> + v |= ((u32)iomap_type & XFS_AGP_TYPE_MASK) << XFS_AGP_TYPE_SHIFT; > >> + if (valid) > >> + v |= (1u << XFS_AGP_VALID_SHIFT); > >> + return v; > >> +} > >> + > >> +static inline bool xfs_agp_valid(u32 v) > >> +{ > >> + return v >> XFS_AGP_VALID_SHIFT; Isn't this just a mask? return v & (1U << XFS_AGP_VALID_SHIFT) > >> +} > >> + > >> +static inline u32 xfs_agp_agno(u32 v) > >> +{ > >> + return v & XFS_AGP_AGNO_MASK; > >> +} > >> + > >> +static inline u8 xfs_agp_type(u32 v) > >> +{ > >> + return (u8)((v >> XFS_AGP_TYPE_SHIFT) & XFS_AGP_TYPE_MASK); > >> +} And as Nirjhar noted, please try to use richer types when possible. s/u32 agno/xfs_agnumber_t agno/ --D > >> + > >> int xfs_iomap_write_direct(struct xfs_inode *ip, xfs_fileoff_t offset_fsb, > >> xfs_fileoff_t count_fsb, unsigned int flags, > >> struct xfs_bmbt_irec *imap, u64 *sequence); > >> -- > >> 2.25.1 > >> > >> > > > > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 2/6] xfs: add helpers to pack AG prediction info for per-folio tracking 2026-01-16 10:08 ` [PATCH v3 2/6] xfs: add helpers to pack AG prediction info for per-folio tracking Kundan Kumar 2026-01-29 0:45 ` Darrick J. Wong @ 2026-02-04 7:37 ` Nirjhar Roy (IBM) 1 sibling, 0 replies; 48+ messages in thread From: Nirjhar Roy (IBM) @ 2026-02-04 7:37 UTC (permalink / raw) To: Kundan Kumar, viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, cem, wangyufei Cc: linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Fri, 2026-01-16 at 15:38 +0530, Kundan Kumar wrote: > Introduce helper routines to pack and unpack AG prediction metadata > for folios. This provides a compact and self-contained representation > for AG tracking. > > The packed layout uses: > - bit 31 : valid > - bit 24-30 : iomap type > - bit 0-23 : AG number # AG limited to 2^24 - is this a reasonable assumption? > > Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> > --- > fs/xfs/xfs_iomap.h | 31 +++++++++++++++++++++++++++++++ > 1 file changed, 31 insertions(+) > > diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h > index ebcce7d49446..eaf4513f6759 100644 > --- a/fs/xfs/xfs_iomap.h > +++ b/fs/xfs/xfs_iomap.h > @@ -12,6 +12,37 @@ struct xfs_inode; > struct xfs_bmbt_irec; > struct xfs_zone_alloc_ctx; > > +/* pack prediction in a u32 stored in xarray */ Nit: Maybe some brief comment about what these are doing and their significance? > +#define XFS_AGP_VALID_SHIFT 31 > +#define XFS_AGP_TYPE_SHIFT 24 > +#define XFS_AGP_TYPE_MASK 0x7fu > +#define XFS_AGP_AGNO_MASK 0x00ffffffu > + > +static inline u32 xfs_agp_pack(u32 agno, u8 iomap_type, bool valid) Nit: Maybe use xfs_agnumber_t instead of u32 for agno? Nit: I think xfs style is like static inline u32 <function name, parameters etc> > +{ > + u32 v = agno & XFS_AGP_AGNO_MASK; > + > + v |= ((u32)iomap_type & XFS_AGP_TYPE_MASK) << XFS_AGP_TYPE_SHIFT; > + if (valid) > + v |= (1u << XFS_AGP_VALID_SHIFT); > + return v; > +} > + > +static inline bool xfs_agp_valid(u32 v) > +{ > + return v >> XFS_AGP_VALID_SHIFT; > +} > + > +static inline u32 xfs_agp_agno(u32 v) > +{ > + return v & XFS_AGP_AGNO_MASK; > +} > + > +static inline u8 xfs_agp_type(u32 v) > +{ > + return (u8)((v >> XFS_AGP_TYPE_SHIFT) & XFS_AGP_TYPE_MASK); > +} Again Nit: We are introducing functions here but using it in the upcoming patches, any reason why can't we introduce and use the functions in the same patch? In that way we can also look into the definitions and usages in the same patch instead of this where we have to switch between 2 patches - one to look into the definitions and another to look into the usages/call-sites? Again no hard preferences here - just a suggestion. --NR > + > int xfs_iomap_write_direct(struct xfs_inode *ip, xfs_fileoff_t offset_fsb, > xfs_fileoff_t count_fsb, unsigned int flags, > struct xfs_bmbt_irec *imap, u64 *sequence); ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <CGME20260116101251epcas5p1cf5b48f2efb14fe4387be3053b3c3ebc@epcas5p1.samsung.com>]
* [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap [not found] ` <CGME20260116101251epcas5p1cf5b48f2efb14fe4387be3053b3c3ebc@epcas5p1.samsung.com> @ 2026-01-16 10:08 ` Kundan Kumar 2026-01-29 0:44 ` Darrick J. Wong ` (2 more replies) 0 siblings, 3 replies; 48+ messages in thread From: Kundan Kumar @ 2026-01-16 10:08 UTC (permalink / raw) To: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, cem, wangyufei Cc: linux-fsdevel, linux-mm, linux-xfs, gost.dev, kundan.kumar, anuj20.g, vishak.g, joshi.k Add per-inode structures to track predicted AGs of dirty folios using an xarray and bitmap. This enables efficient identification of AGs involved in writeback. Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> --- fs/xfs/xfs_icache.c | 27 +++++++++++++++++++++++++++ fs/xfs/xfs_inode.h | 5 +++++ 2 files changed, 32 insertions(+) diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index e44040206851..f97aa6d66271 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -80,6 +80,25 @@ static inline xa_mark_t ici_tag_to_mark(unsigned int tag) return XFS_PERAG_BLOCKGC_MARK; } +static int xfs_inode_init_ag_bitmap(struct xfs_inode *ip) +{ + unsigned int bits = ip->i_mount->m_sb.sb_agcount; + unsigned int nlongs; + + xa_init_flags(&ip->i_ag_pmap, XA_FLAGS_LOCK_IRQ); + ip->i_ag_dirty_bitmap = NULL; + ip->i_ag_dirty_bits = bits; + + if (!bits) + return 0; + + nlongs = BITS_TO_LONGS(bits); + ip->i_ag_dirty_bitmap = kcalloc(nlongs, sizeof(unsigned long), + GFP_NOFS); + + return ip->i_ag_dirty_bitmap ? 0 : -ENOMEM; +} + /* * Allocate and initialise an xfs_inode. */ @@ -131,6 +150,8 @@ xfs_inode_alloc( ip->i_next_unlinked = NULLAGINO; ip->i_prev_unlinked = 0; + xfs_inode_init_ag_bitmap(ip); + return ip; } @@ -194,6 +215,12 @@ xfs_inode_free( ip->i_ino = 0; spin_unlock(&ip->i_flags_lock); + /* free xarray contents (values are immediate packed ints) */ + xa_destroy(&ip->i_ag_pmap); + kfree(ip->i_ag_dirty_bitmap); + ip->i_ag_dirty_bitmap = NULL; + ip->i_ag_dirty_bits = 0; + __xfs_inode_free(ip); } diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index bd6d33557194..dee449168605 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -99,6 +99,11 @@ typedef struct xfs_inode { spinlock_t i_ioend_lock; struct work_struct i_ioend_work; struct list_head i_ioend_list; + + /* AG prediction map: pgoff_t -> packed u32 */ + struct xarray i_ag_pmap; + unsigned long *i_ag_dirty_bitmap; + unsigned int i_ag_dirty_bits; } xfs_inode_t; static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip) -- 2.25.1 ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap 2026-01-16 10:08 ` [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap Kundan Kumar @ 2026-01-29 0:44 ` Darrick J. Wong 2026-02-03 7:20 ` Kundan Kumar 2026-02-05 6:44 ` Nirjhar Roy (IBM) 2026-02-05 6:36 ` Nirjhar Roy (IBM) 2026-02-06 7:00 ` Christoph Hellwig 2 siblings, 2 replies; 48+ messages in thread From: Darrick J. Wong @ 2026-01-29 0:44 UTC (permalink / raw) To: Kundan Kumar Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Fri, Jan 16, 2026 at 03:38:15PM +0530, Kundan Kumar wrote: > Add per-inode structures to track predicted AGs of dirty folios using > an xarray and bitmap. This enables efficient identification of AGs > involved in writeback. > > Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> > --- > fs/xfs/xfs_icache.c | 27 +++++++++++++++++++++++++++ > fs/xfs/xfs_inode.h | 5 +++++ > 2 files changed, 32 insertions(+) > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index e44040206851..f97aa6d66271 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -80,6 +80,25 @@ static inline xa_mark_t ici_tag_to_mark(unsigned int tag) > return XFS_PERAG_BLOCKGC_MARK; > } > > +static int xfs_inode_init_ag_bitmap(struct xfs_inode *ip) > +{ > + unsigned int bits = ip->i_mount->m_sb.sb_agcount; > + unsigned int nlongs; > + > + xa_init_flags(&ip->i_ag_pmap, XA_FLAGS_LOCK_IRQ); This increases the size of struct xfs_inode by 40 bytes... > + ip->i_ag_dirty_bitmap = NULL; > + ip->i_ag_dirty_bits = bits; > + > + if (!bits) > + return 0; > + > + nlongs = BITS_TO_LONGS(bits); > + ip->i_ag_dirty_bitmap = kcalloc(nlongs, sizeof(unsigned long), > + GFP_NOFS); ...and there could be hundreds or thousands of AGs for each filesystem. That's a lot of kernel memory to handle this prediction stuff, and I"m not even sure what ag_dirty_bitmap does yet. > + > + return ip->i_ag_dirty_bitmap ? 0 : -ENOMEM; > +} > + > /* > * Allocate and initialise an xfs_inode. > */ > @@ -131,6 +150,8 @@ xfs_inode_alloc( > ip->i_next_unlinked = NULLAGINO; > ip->i_prev_unlinked = 0; > > + xfs_inode_init_ag_bitmap(ip); Unchecked return value??? > + > return ip; > } > > @@ -194,6 +215,12 @@ xfs_inode_free( > ip->i_ino = 0; > spin_unlock(&ip->i_flags_lock); > > + /* free xarray contents (values are immediate packed ints) */ > + xa_destroy(&ip->i_ag_pmap); > + kfree(ip->i_ag_dirty_bitmap); > + ip->i_ag_dirty_bitmap = NULL; > + ip->i_ag_dirty_bits = 0; > + > __xfs_inode_free(ip); > } > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h > index bd6d33557194..dee449168605 100644 > --- a/fs/xfs/xfs_inode.h > +++ b/fs/xfs/xfs_inode.h > @@ -99,6 +99,11 @@ typedef struct xfs_inode { > spinlock_t i_ioend_lock; > struct work_struct i_ioend_work; > struct list_head i_ioend_list; > + > + /* AG prediction map: pgoff_t -> packed u32 */ What about blocksize < pagesize filesystems? Which packed agno do you associate with the pgoff_t? Also, do you have an xarray entry for each pgoff_t in a large folio? --D > + struct xarray i_ag_pmap; > + unsigned long *i_ag_dirty_bitmap; > + unsigned int i_ag_dirty_bits; > } xfs_inode_t; > > static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip) > -- > 2.25.1 > > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap 2026-01-29 0:44 ` Darrick J. Wong @ 2026-02-03 7:20 ` Kundan Kumar 2026-02-05 16:42 ` Darrick J. Wong 2026-02-05 6:44 ` Nirjhar Roy (IBM) 1 sibling, 1 reply; 48+ messages in thread From: Kundan Kumar @ 2026-02-03 7:20 UTC (permalink / raw) To: Darrick J. Wong Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On 1/29/2026 6:14 AM, Darrick J. Wong wrote: > On Fri, Jan 16, 2026 at 03:38:15PM +0530, Kundan Kumar wrote: >> Add per-inode structures to track predicted AGs of dirty folios using >> an xarray and bitmap. This enables efficient identification of AGs >> involved in writeback. >> >> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> >> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> >> --- >> fs/xfs/xfs_icache.c | 27 +++++++++++++++++++++++++++ >> fs/xfs/xfs_inode.h | 5 +++++ >> 2 files changed, 32 insertions(+) >> >> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c >> index e44040206851..f97aa6d66271 100644 >> --- a/fs/xfs/xfs_icache.c >> +++ b/fs/xfs/xfs_icache.c >> @@ -80,6 +80,25 @@ static inline xa_mark_t ici_tag_to_mark(unsigned int tag) >> return XFS_PERAG_BLOCKGC_MARK; >> } >> >> +static int xfs_inode_init_ag_bitmap(struct xfs_inode *ip) >> +{ >> + unsigned int bits = ip->i_mount->m_sb.sb_agcount; >> + unsigned int nlongs; >> + >> + xa_init_flags(&ip->i_ag_pmap, XA_FLAGS_LOCK_IRQ); > > This increases the size of struct xfs_inode by 40 bytes... > I’ll make this lazy and sparse: move AG writeback state behind a pointer allocated on first use, and replace the bitmap with a sparse dirty-AG set(xarray keyed by agno) so memory scales with AGs actually touched by the inode. >> + ip->i_ag_dirty_bitmap = NULL; >> + ip->i_ag_dirty_bits = bits; >> + >> + if (!bits) >> + return 0; >> + >> + nlongs = BITS_TO_LONGS(bits); >> + ip->i_ag_dirty_bitmap = kcalloc(nlongs, sizeof(unsigned long), >> + GFP_NOFS); > > ...and there could be hundreds or thousands of AGs for each filesystem. > That's a lot of kernel memory to handle this prediction stuff, and I"m > not even sure what ag_dirty_bitmap does yet. > The bit for an AG is set in ag_dirty_bitmap at write time. During writeback, we check which AG bits are set, wake only those AG-specific workers, and each worker scans the page cache, filters folios tagged for its AG, and submits the I/O. >> + >> + return ip->i_ag_dirty_bitmap ? 0 : -ENOMEM; >> +} >> + >> /* >> * Allocate and initialise an xfs_inode. >> */ >> @@ -131,6 +150,8 @@ xfs_inode_alloc( >> ip->i_next_unlinked = NULLAGINO; >> ip->i_prev_unlinked = 0; >> >> + xfs_inode_init_ag_bitmap(ip); > > Unchecked return value??? Will correct in next version > >> + >> return ip; >> } >> >> @@ -194,6 +215,12 @@ xfs_inode_free( >> ip->i_ino = 0; >> spin_unlock(&ip->i_flags_lock); >> >> + /* free xarray contents (values are immediate packed ints) */ >> + xa_destroy(&ip->i_ag_pmap); >> + kfree(ip->i_ag_dirty_bitmap); >> + ip->i_ag_dirty_bitmap = NULL; >> + ip->i_ag_dirty_bits = 0; >> + >> __xfs_inode_free(ip); >> } >> >> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h >> index bd6d33557194..dee449168605 100644 >> --- a/fs/xfs/xfs_inode.h >> +++ b/fs/xfs/xfs_inode.h >> @@ -99,6 +99,11 @@ typedef struct xfs_inode { >> spinlock_t i_ioend_lock; >> struct work_struct i_ioend_work; >> struct list_head i_ioend_list; >> + >> + /* AG prediction map: pgoff_t -> packed u32 */ > > What about blocksize < pagesize filesystems? Which packed agno do you > associate with the pgoff_t? > > Also, do you have an xarray entry for each pgoff_t in a large folio? > > --D > pgoff_t here is the pagecache index (folio->index), i.e. file offset in PAGE_SIZE units, not a filesystem block index. So blocksize < PAGE_SIZE doesn’t change the association, the packed agno is attached to the folio at that pagecache index. We store one xarray entry per folio index (the start of the folio). We do not create entries for each base-page inside a large folio. If a large folio could span multiple extents/AGs, we’ll treat the hint as advisory and tag it invalid (fallback to normal writeback routing) rather than trying to encode per-subpage AGs. >> + struct xarray i_ag_pmap; >> + unsigned long *i_ag_dirty_bitmap; >> + unsigned int i_ag_dirty_bits; >> } xfs_inode_t; >> >> static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip) >> -- >> 2.25.1 >> >> > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap 2026-02-03 7:20 ` Kundan Kumar @ 2026-02-05 16:42 ` Darrick J. Wong 0 siblings, 0 replies; 48+ messages in thread From: Darrick J. Wong @ 2026-02-05 16:42 UTC (permalink / raw) To: Kundan Kumar Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Tue, Feb 03, 2026 at 12:50:53PM +0530, Kundan Kumar wrote: > On 1/29/2026 6:14 AM, Darrick J. Wong wrote: > > On Fri, Jan 16, 2026 at 03:38:15PM +0530, Kundan Kumar wrote: > >> Add per-inode structures to track predicted AGs of dirty folios using > >> an xarray and bitmap. This enables efficient identification of AGs > >> involved in writeback. > >> > >> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> > >> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> > >> --- > >> fs/xfs/xfs_icache.c | 27 +++++++++++++++++++++++++++ > >> fs/xfs/xfs_inode.h | 5 +++++ > >> 2 files changed, 32 insertions(+) > >> > >> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > >> index e44040206851..f97aa6d66271 100644 > >> --- a/fs/xfs/xfs_icache.c > >> +++ b/fs/xfs/xfs_icache.c > >> @@ -80,6 +80,25 @@ static inline xa_mark_t ici_tag_to_mark(unsigned int tag) > >> return XFS_PERAG_BLOCKGC_MARK; > >> } > >> > >> +static int xfs_inode_init_ag_bitmap(struct xfs_inode *ip) > >> +{ > >> + unsigned int bits = ip->i_mount->m_sb.sb_agcount; > >> + unsigned int nlongs; > >> + > >> + xa_init_flags(&ip->i_ag_pmap, XA_FLAGS_LOCK_IRQ); > > > > This increases the size of struct xfs_inode by 40 bytes... > > > > I’ll make this lazy and sparse: move AG writeback state behind a pointer > allocated on first use, and replace the bitmap with a sparse dirty-AG > set(xarray keyed by agno) so memory scales with AGs actually touched by > the inode. > > >> + ip->i_ag_dirty_bitmap = NULL; > >> + ip->i_ag_dirty_bits = bits; > >> + > >> + if (!bits) > >> + return 0; > >> + > >> + nlongs = BITS_TO_LONGS(bits); > >> + ip->i_ag_dirty_bitmap = kcalloc(nlongs, sizeof(unsigned long), > >> + GFP_NOFS); > > > > ...and there could be hundreds or thousands of AGs for each filesystem. > > That's a lot of kernel memory to handle this prediction stuff, and I"m > > not even sure what ag_dirty_bitmap does yet. > > > > The bit for an AG is set in ag_dirty_bitmap at write time. During > writeback, we check which AG bits are set, wake only those AG-specific > workers, and each worker scans the page cache, filters folios tagged for > its AG, and submits the I/O. > > >> + > >> + return ip->i_ag_dirty_bitmap ? 0 : -ENOMEM; > >> +} > >> + > >> /* > >> * Allocate and initialise an xfs_inode. > >> */ > >> @@ -131,6 +150,8 @@ xfs_inode_alloc( > >> ip->i_next_unlinked = NULLAGINO; > >> ip->i_prev_unlinked = 0; > >> > >> + xfs_inode_init_ag_bitmap(ip); > > > > Unchecked return value??? > > Will correct in next version > > > > >> + > >> return ip; > >> } > >> > >> @@ -194,6 +215,12 @@ xfs_inode_free( > >> ip->i_ino = 0; > >> spin_unlock(&ip->i_flags_lock); > >> > >> + /* free xarray contents (values are immediate packed ints) */ > >> + xa_destroy(&ip->i_ag_pmap); > >> + kfree(ip->i_ag_dirty_bitmap); > >> + ip->i_ag_dirty_bitmap = NULL; > >> + ip->i_ag_dirty_bits = 0; > >> + > >> __xfs_inode_free(ip); > >> } > >> > >> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h > >> index bd6d33557194..dee449168605 100644 > >> --- a/fs/xfs/xfs_inode.h > >> +++ b/fs/xfs/xfs_inode.h > >> @@ -99,6 +99,11 @@ typedef struct xfs_inode { > >> spinlock_t i_ioend_lock; > >> struct work_struct i_ioend_work; > >> struct list_head i_ioend_list; > >> + > >> + /* AG prediction map: pgoff_t -> packed u32 */ > > > > What about blocksize < pagesize filesystems? Which packed agno do you > > associate with the pgoff_t? > > > > Also, do you have an xarray entry for each pgoff_t in a large folio? > > > > --D > > > > pgoff_t here is the pagecache index (folio->index), i.e. file offset in > PAGE_SIZE units, not a filesystem block index. So blocksize < PAGE_SIZE > doesn’t change the association, the packed agno is attached to the folio > at that pagecache index. Ok, so the tag is entirely determined by the AG of the first fsblock within the folio. > We store one xarray entry per folio index (the start of the folio). We > do not create entries for each base-page inside a large folio. If a > large folio could span multiple extents/AGs, we’ll treat the hint as > advisory and tag it invalid (fallback to normal writeback routing) > rather than trying to encode per-subpage AGs. Oh, ok, so if you have the mapping and the folio at the same time you can determine that the entire large folio maps to a single extent, and tag the whole large folio as belonging to a single AG. That clears things up, thank you. It's only in the case of extreme fragmentation that a large folio gets flung at the old writeback paths, which is probably good enough anyway. --D > >> + struct xarray i_ag_pmap; > >> + unsigned long *i_ag_dirty_bitmap; > >> + unsigned int i_ag_dirty_bits; > >> } xfs_inode_t; > >> > >> static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip) > >> -- > >> 2.25.1 > >> > >> > > > > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap 2026-01-29 0:44 ` Darrick J. Wong 2026-02-03 7:20 ` Kundan Kumar @ 2026-02-05 6:44 ` Nirjhar Roy (IBM) 2026-02-05 16:32 ` Darrick J. Wong 1 sibling, 1 reply; 48+ messages in thread From: Nirjhar Roy (IBM) @ 2026-02-05 6:44 UTC (permalink / raw) To: Darrick J. Wong, Kundan Kumar Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Wed, 2026-01-28 at 16:44 -0800, Darrick J. Wong wrote: > On Fri, Jan 16, 2026 at 03:38:15PM +0530, Kundan Kumar wrote: > > Add per-inode structures to track predicted AGs of dirty folios using > > an xarray and bitmap. This enables efficient identification of AGs > > involved in writeback. > > > > Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> > > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> > > --- > > fs/xfs/xfs_icache.c | 27 +++++++++++++++++++++++++++ > > fs/xfs/xfs_inode.h | 5 +++++ > > 2 files changed, 32 insertions(+) > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > index e44040206851..f97aa6d66271 100644 > > --- a/fs/xfs/xfs_icache.c > > +++ b/fs/xfs/xfs_icache.c > > @@ -80,6 +80,25 @@ static inline xa_mark_t ici_tag_to_mark(unsigned int tag) > > return XFS_PERAG_BLOCKGC_MARK; > > } > > > > +static int xfs_inode_init_ag_bitmap(struct xfs_inode *ip) > > +{ > > + unsigned int bits = ip->i_mount->m_sb.sb_agcount; > > + unsigned int nlongs; > > + > > + xa_init_flags(&ip->i_ag_pmap, XA_FLAGS_LOCK_IRQ); > > This increases the size of struct xfs_inode by 40 bytes... > > > + ip->i_ag_dirty_bitmap = NULL; > > + ip->i_ag_dirty_bits = bits; > > + > > + if (!bits) > > + return 0; > > + > > + nlongs = BITS_TO_LONGS(bits); > > + ip->i_ag_dirty_bitmap = kcalloc(nlongs, sizeof(unsigned long), > > + GFP_NOFS); > > ...and there could be hundreds or thousands of AGs for each filesystem. > That's a lot of kernel memory to handle this prediction stuff, and I"m > not even sure what ag_dirty_bitmap does yet. > > > + > > + return ip->i_ag_dirty_bitmap ? 0 : -ENOMEM; > > +} > > + > > /* > > * Allocate and initialise an xfs_inode. > > */ > > @@ -131,6 +150,8 @@ xfs_inode_alloc( > > ip->i_next_unlinked = NULLAGINO; > > ip->i_prev_unlinked = 0; > > > > + xfs_inode_init_ag_bitmap(ip); > > Unchecked return value??? > > > + > > return ip; > > } > > > > @@ -194,6 +215,12 @@ xfs_inode_free( > > ip->i_ino = 0; > > spin_unlock(&ip->i_flags_lock); > > > > + /* free xarray contents (values are immediate packed ints) */ > > + xa_destroy(&ip->i_ag_pmap); > > + kfree(ip->i_ag_dirty_bitmap); > > + ip->i_ag_dirty_bitmap = NULL; > > + ip->i_ag_dirty_bits = 0; > > + > > __xfs_inode_free(ip); > > } > > > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h > > index bd6d33557194..dee449168605 100644 > > --- a/fs/xfs/xfs_inode.h > > +++ b/fs/xfs/xfs_inode.h > > @@ -99,6 +99,11 @@ typedef struct xfs_inode { > > spinlock_t i_ioend_lock; > > struct work_struct i_ioend_work; > > struct list_head i_ioend_list; > > + > > + /* AG prediction map: pgoff_t -> packed u32 */ > > What about blocksize < pagesize filesystems? Which packed agno do you > associate with the pgoff_t? Sorry but I am bit unfamiliar with the term packed agno? Can you please briefly explain does packed agno mean? --NR > > Also, do you have an xarray entry for each pgoff_t in a large folio? > > --D > > > + struct xarray i_ag_pmap; > > + unsigned long *i_ag_dirty_bitmap; > > + unsigned int i_ag_dirty_bits; > > } xfs_inode_t; > > > > static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip) > > -- > > 2.25.1 > > > > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap 2026-02-05 6:44 ` Nirjhar Roy (IBM) @ 2026-02-05 16:32 ` Darrick J. Wong 2026-02-06 5:41 ` Nirjhar Roy (IBM) 0 siblings, 1 reply; 48+ messages in thread From: Darrick J. Wong @ 2026-02-05 16:32 UTC (permalink / raw) To: Nirjhar Roy (IBM) Cc: Kundan Kumar, viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Thu, Feb 05, 2026 at 12:14:35PM +0530, Nirjhar Roy (IBM) wrote: > On Wed, 2026-01-28 at 16:44 -0800, Darrick J. Wong wrote: > > On Fri, Jan 16, 2026 at 03:38:15PM +0530, Kundan Kumar wrote: > > > Add per-inode structures to track predicted AGs of dirty folios using > > > an xarray and bitmap. This enables efficient identification of AGs > > > involved in writeback. > > > > > > Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> > > > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> > > > --- > > > fs/xfs/xfs_icache.c | 27 +++++++++++++++++++++++++++ > > > fs/xfs/xfs_inode.h | 5 +++++ > > > 2 files changed, 32 insertions(+) > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > index e44040206851..f97aa6d66271 100644 > > > --- a/fs/xfs/xfs_icache.c > > > +++ b/fs/xfs/xfs_icache.c > > > @@ -80,6 +80,25 @@ static inline xa_mark_t ici_tag_to_mark(unsigned int tag) > > > return XFS_PERAG_BLOCKGC_MARK; > > > } > > > > > > +static int xfs_inode_init_ag_bitmap(struct xfs_inode *ip) > > > +{ > > > + unsigned int bits = ip->i_mount->m_sb.sb_agcount; > > > + unsigned int nlongs; > > > + > > > + xa_init_flags(&ip->i_ag_pmap, XA_FLAGS_LOCK_IRQ); > > > > This increases the size of struct xfs_inode by 40 bytes... > > > > > + ip->i_ag_dirty_bitmap = NULL; > > > + ip->i_ag_dirty_bits = bits; > > > + > > > + if (!bits) > > > + return 0; > > > + > > > + nlongs = BITS_TO_LONGS(bits); > > > + ip->i_ag_dirty_bitmap = kcalloc(nlongs, sizeof(unsigned long), > > > + GFP_NOFS); > > > > ...and there could be hundreds or thousands of AGs for each filesystem. > > That's a lot of kernel memory to handle this prediction stuff, and I"m > > not even sure what ag_dirty_bitmap does yet. > > > > > + > > > + return ip->i_ag_dirty_bitmap ? 0 : -ENOMEM; > > > +} > > > + > > > /* > > > * Allocate and initialise an xfs_inode. > > > */ > > > @@ -131,6 +150,8 @@ xfs_inode_alloc( > > > ip->i_next_unlinked = NULLAGINO; > > > ip->i_prev_unlinked = 0; > > > > > > + xfs_inode_init_ag_bitmap(ip); > > > > Unchecked return value??? > > > > > + > > > return ip; > > > } > > > > > > @@ -194,6 +215,12 @@ xfs_inode_free( > > > ip->i_ino = 0; > > > spin_unlock(&ip->i_flags_lock); > > > > > > + /* free xarray contents (values are immediate packed ints) */ > > > + xa_destroy(&ip->i_ag_pmap); > > > + kfree(ip->i_ag_dirty_bitmap); > > > + ip->i_ag_dirty_bitmap = NULL; > > > + ip->i_ag_dirty_bits = 0; > > > + > > > __xfs_inode_free(ip); > > > } > > > > > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h > > > index bd6d33557194..dee449168605 100644 > > > --- a/fs/xfs/xfs_inode.h > > > +++ b/fs/xfs/xfs_inode.h > > > @@ -99,6 +99,11 @@ typedef struct xfs_inode { > > > spinlock_t i_ioend_lock; > > > struct work_struct i_ioend_work; > > > struct list_head i_ioend_list; > > > + > > > + /* AG prediction map: pgoff_t -> packed u32 */ > > > > What about blocksize < pagesize filesystems? Which packed agno do you > > associate with the pgoff_t? > Sorry but I am bit unfamiliar with the term packed agno? Can you please briefly explain does packed > agno mean? I was talking about the "xfs_agp" numbers introduced in the previous patch, which seem to contain the bottom 25 bits of the ag number. Really I was just asking about multi-fsblock folios -- which block within that folio do we map to an AG for tagging purposes? I think Kundan replied elsewhere that it's the first block. --D > --NR > > > > Also, do you have an xarray entry for each pgoff_t in a large folio? > > > > --D > > > > > + struct xarray i_ag_pmap; > > > + unsigned long *i_ag_dirty_bitmap; > > > + unsigned int i_ag_dirty_bits; > > > } xfs_inode_t; > > > > > > static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip) > > > -- > > > 2.25.1 > > > > > > > > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap 2026-02-05 16:32 ` Darrick J. Wong @ 2026-02-06 5:41 ` Nirjhar Roy (IBM) 0 siblings, 0 replies; 48+ messages in thread From: Nirjhar Roy (IBM) @ 2026-02-06 5:41 UTC (permalink / raw) To: Darrick J. Wong Cc: Kundan Kumar, viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Thu, 2026-02-05 at 08:32 -0800, Darrick J. Wong wrote: > On Thu, Feb 05, 2026 at 12:14:35PM +0530, Nirjhar Roy (IBM) wrote: > > On Wed, 2026-01-28 at 16:44 -0800, Darrick J. Wong wrote: > > > On Fri, Jan 16, 2026 at 03:38:15PM +0530, Kundan Kumar wrote: > > > > Add per-inode structures to track predicted AGs of dirty folios using > > > > an xarray and bitmap. This enables efficient identification of AGs > > > > involved in writeback. > > > > > > > > Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> > > > > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> > > > > --- > > > > fs/xfs/xfs_icache.c | 27 +++++++++++++++++++++++++++ > > > > fs/xfs/xfs_inode.h | 5 +++++ > > > > 2 files changed, 32 insertions(+) > > > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > > index e44040206851..f97aa6d66271 100644 > > > > --- a/fs/xfs/xfs_icache.c > > > > +++ b/fs/xfs/xfs_icache.c > > > > @@ -80,6 +80,25 @@ static inline xa_mark_t ici_tag_to_mark(unsigned int tag) > > > > return XFS_PERAG_BLOCKGC_MARK; > > > > } > > > > > > > > +static int xfs_inode_init_ag_bitmap(struct xfs_inode *ip) > > > > +{ > > > > + unsigned int bits = ip->i_mount->m_sb.sb_agcount; > > > > + unsigned int nlongs; > > > > + > > > > + xa_init_flags(&ip->i_ag_pmap, XA_FLAGS_LOCK_IRQ); > > > > > > This increases the size of struct xfs_inode by 40 bytes... > > > > > > > + ip->i_ag_dirty_bitmap = NULL; > > > > + ip->i_ag_dirty_bits = bits; > > > > + > > > > + if (!bits) > > > > + return 0; > > > > + > > > > + nlongs = BITS_TO_LONGS(bits); > > > > + ip->i_ag_dirty_bitmap = kcalloc(nlongs, sizeof(unsigned long), > > > > + GFP_NOFS); > > > > > > ...and there could be hundreds or thousands of AGs for each filesystem. > > > That's a lot of kernel memory to handle this prediction stuff, and I"m > > > not even sure what ag_dirty_bitmap does yet. > > > > > > > + > > > > + return ip->i_ag_dirty_bitmap ? 0 : -ENOMEM; > > > > +} > > > > + > > > > /* > > > > * Allocate and initialise an xfs_inode. > > > > */ > > > > @@ -131,6 +150,8 @@ xfs_inode_alloc( > > > > ip->i_next_unlinked = NULLAGINO; > > > > ip->i_prev_unlinked = 0; > > > > > > > > + xfs_inode_init_ag_bitmap(ip); > > > > > > Unchecked return value??? > > > > > > > + > > > > return ip; > > > > } > > > > > > > > @@ -194,6 +215,12 @@ xfs_inode_free( > > > > ip->i_ino = 0; > > > > spin_unlock(&ip->i_flags_lock); > > > > > > > > + /* free xarray contents (values are immediate packed ints) */ > > > > + xa_destroy(&ip->i_ag_pmap); > > > > + kfree(ip->i_ag_dirty_bitmap); > > > > + ip->i_ag_dirty_bitmap = NULL; > > > > + ip->i_ag_dirty_bits = 0; > > > > + > > > > __xfs_inode_free(ip); > > > > } > > > > > > > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h > > > > index bd6d33557194..dee449168605 100644 > > > > --- a/fs/xfs/xfs_inode.h > > > > +++ b/fs/xfs/xfs_inode.h > > > > @@ -99,6 +99,11 @@ typedef struct xfs_inode { > > > > spinlock_t i_ioend_lock; > > > > struct work_struct i_ioend_work; > > > > struct list_head i_ioend_list; > > > > + > > > > + /* AG prediction map: pgoff_t -> packed u32 */ > > > > > > What about blocksize < pagesize filesystems? Which packed agno do you > > > associate with the pgoff_t? > > Sorry but I am bit unfamiliar with the term packed agno? Can you please briefly explain does packed > > agno mean? > > I was talking about the "xfs_agp" numbers introduced in the previous > patch, which seem to contain the bottom 25 bits of the ag number. Okay, right. Thank you. > > Really I was just asking about multi-fsblock folios -- which block > within that folio do we map to an AG for tagging purposes? > > I think Kundan replied elsewhere that it's the first block. Okay. --NR > > --D > > > --NR > > > Also, do you have an xarray entry for each pgoff_t in a large folio? > > > > > > --D > > > > > > > + struct xarray i_ag_pmap; > > > > + unsigned long *i_ag_dirty_bitmap; > > > > + unsigned int i_ag_dirty_bits; > > > > } xfs_inode_t; > > > > > > > > static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip) > > > > -- > > > > 2.25.1 > > > > > > > > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap 2026-01-16 10:08 ` [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap Kundan Kumar 2026-01-29 0:44 ` Darrick J. Wong @ 2026-02-05 6:36 ` Nirjhar Roy (IBM) 2026-02-05 16:36 ` Darrick J. Wong 2026-02-06 7:00 ` Christoph Hellwig 2 siblings, 1 reply; 48+ messages in thread From: Nirjhar Roy (IBM) @ 2026-02-05 6:36 UTC (permalink / raw) To: Kundan Kumar, viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, cem, wangyufei Cc: linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Fri, 2026-01-16 at 15:38 +0530, Kundan Kumar wrote: > Add per-inode structures to track predicted AGs of dirty folios using > an xarray and bitmap. This enables efficient identification of AGs > involved in writeback. > > Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> > --- > fs/xfs/xfs_icache.c | 27 +++++++++++++++++++++++++++ > fs/xfs/xfs_inode.h | 5 +++++ > 2 files changed, 32 insertions(+) > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index e44040206851..f97aa6d66271 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -80,6 +80,25 @@ static inline xa_mark_t ici_tag_to_mark(unsigned int tag) > return XFS_PERAG_BLOCKGC_MARK; > } > > +static int xfs_inode_init_ag_bitmap(struct xfs_inode *ip) Similar comment as before: static int xfs_inode_init...() > +{ > + unsigned int bits = ip->i_mount->m_sb.sb_agcount; > + unsigned int nlongs; > + > + xa_init_flags(&ip->i_ag_pmap, XA_FLAGS_LOCK_IRQ); Nit: The name of the functions suggests that it is initializing the tracking bitmap which it does - however, the above line does slightly different thing? Maybe move the xarray init outside the bitmap init function? > + ip->i_ag_dirty_bitmap = NULL; > + ip->i_ag_dirty_bits = bits; > + > + if (!bits) Umm, !bits means agcount is 0. Shouldn't we ASSERT that bits >= 2? Or am I missing something? > + return 0; > + > + nlongs = BITS_TO_LONGS(bits); > + ip->i_ag_dirty_bitmap = kcalloc(nlongs, sizeof(unsigned long), > + GFP_NOFS); > + > + return ip->i_ag_dirty_bitmap ? 0 : -ENOMEM; > +} > + > /* > * Allocate and initialise an xfs_inode. > */ > @@ -131,6 +150,8 @@ xfs_inode_alloc( > ip->i_next_unlinked = NULLAGINO; > ip->i_prev_unlinked = 0; > > + xfs_inode_init_ag_bitmap(ip); xfs_inode_init_ag_bitmap() returns int - error handling for -ENOMEM? > + > return ip; > } > > @@ -194,6 +215,12 @@ xfs_inode_free( > ip->i_ino = 0; > spin_unlock(&ip->i_flags_lock); > > + /* free xarray contents (values are immediate packed ints) */ > + xa_destroy(&ip->i_ag_pmap); Nit:Maybe have a small wrapper for freeing it the prediction map? No hard preferences though. > + kfree(ip->i_ag_dirty_bitmap); > + ip->i_ag_dirty_bitmap = NULL; Nit: Usually while freeing the pointers I prefer: t = ip->i_ag_dirty_bitmap; ip->i_ag_dirty_bitmap = NULL; kfree(t); In this way, the pointer(i_ag_dirty_bitmap in this case) that I am freeing never points to an already freed address. > + ip->i_ag_dirty_bits = 0; > + > __xfs_inode_free(ip); > } > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h > index bd6d33557194..dee449168605 100644 > --- a/fs/xfs/xfs_inode.h > +++ b/fs/xfs/xfs_inode.h > @@ -99,6 +99,11 @@ typedef struct xfs_inode { > spinlock_t i_ioend_lock; > struct work_struct i_ioend_work; > struct list_head i_ioend_list; > + > + /* AG prediction map: pgoff_t -> packed u32 */ > + struct xarray i_ag_pmap; > + unsigned long *i_ag_dirty_bitmap; > + unsigned int i_ag_dirty_bits; Not sure but, I mostly see the typedefed versions of data types being used like uint32 etc. Darrick, hch, are the above fine? --NR > } xfs_inode_t; > > static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip) ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap 2026-02-05 6:36 ` Nirjhar Roy (IBM) @ 2026-02-05 16:36 ` Darrick J. Wong 2026-02-06 5:36 ` Nirjhar Roy (IBM) 0 siblings, 1 reply; 48+ messages in thread From: Darrick J. Wong @ 2026-02-05 16:36 UTC (permalink / raw) To: Nirjhar Roy (IBM) Cc: Kundan Kumar, viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Thu, Feb 05, 2026 at 12:06:19PM +0530, Nirjhar Roy (IBM) wrote: > On Fri, 2026-01-16 at 15:38 +0530, Kundan Kumar wrote: > > Add per-inode structures to track predicted AGs of dirty folios using > > an xarray and bitmap. This enables efficient identification of AGs > > involved in writeback. > > > > Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> > > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> > > --- > > fs/xfs/xfs_icache.c | 27 +++++++++++++++++++++++++++ > > fs/xfs/xfs_inode.h | 5 +++++ > > 2 files changed, 32 insertions(+) > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > index e44040206851..f97aa6d66271 100644 > > --- a/fs/xfs/xfs_icache.c > > +++ b/fs/xfs/xfs_icache.c > > @@ -80,6 +80,25 @@ static inline xa_mark_t ici_tag_to_mark(unsigned int tag) > > return XFS_PERAG_BLOCKGC_MARK; > > } > > > > +static int xfs_inode_init_ag_bitmap(struct xfs_inode *ip) > Similar comment as before: > static int > xfs_inode_init...() > > +{ > > + unsigned int bits = ip->i_mount->m_sb.sb_agcount; > > + unsigned int nlongs; > > + > > + xa_init_flags(&ip->i_ag_pmap, XA_FLAGS_LOCK_IRQ); > Nit: The name of the functions suggests that it is initializing the tracking bitmap which it does - > however, the above line does slightly different thing? Maybe move the xarray init outside the bitmap > init function? Or just call it something else? xfs_inode_init_perag_wb? > > + ip->i_ag_dirty_bitmap = NULL; > > + ip->i_ag_dirty_bits = bits; > > + > > + if (!bits) > Umm, !bits means agcount is 0. Shouldn't we ASSERT that bits >= 2? Or am I missing something? Technically you can have 1 AG, but you definitely can't mount a zero AG filesystem. > > + return 0; > > + > > + nlongs = BITS_TO_LONGS(bits); > > + ip->i_ag_dirty_bitmap = kcalloc(nlongs, sizeof(unsigned long), > > + GFP_NOFS); > > + > > + return ip->i_ag_dirty_bitmap ? 0 : -ENOMEM; > > +} > > + > > /* > > * Allocate and initialise an xfs_inode. > > */ > > @@ -131,6 +150,8 @@ xfs_inode_alloc( > > ip->i_next_unlinked = NULLAGINO; > > ip->i_prev_unlinked = 0; > > > > + xfs_inode_init_ag_bitmap(ip); > xfs_inode_init_ag_bitmap() returns int - error handling for -ENOMEM? > > + > > return ip; > > } > > > > @@ -194,6 +215,12 @@ xfs_inode_free( > > ip->i_ino = 0; > > spin_unlock(&ip->i_flags_lock); > > > > + /* free xarray contents (values are immediate packed ints) */ > > + xa_destroy(&ip->i_ag_pmap); > Nit:Maybe have a small wrapper for freeing it the prediction map? No hard preferences though. > > + kfree(ip->i_ag_dirty_bitmap); > > + ip->i_ag_dirty_bitmap = NULL; > Nit: Usually while freeing the pointers I prefer: > t = ip->i_ag_dirty_bitmap; > ip->i_ag_dirty_bitmap = NULL; > kfree(t); > In this way, the pointer(i_ag_dirty_bitmap in this case) that I am freeing never points to an > already freed address. > > > + ip->i_ag_dirty_bits = 0; > > + > > __xfs_inode_free(ip); > > } > > > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h > > index bd6d33557194..dee449168605 100644 > > --- a/fs/xfs/xfs_inode.h > > +++ b/fs/xfs/xfs_inode.h > > @@ -99,6 +99,11 @@ typedef struct xfs_inode { > > spinlock_t i_ioend_lock; > > struct work_struct i_ioend_work; > > struct list_head i_ioend_list; > > + > > + /* AG prediction map: pgoff_t -> packed u32 */ > > + struct xarray i_ag_pmap; > > + unsigned long *i_ag_dirty_bitmap; > > + unsigned int i_ag_dirty_bits; > Not sure but, I mostly see the typedefed versions of data types being used like uint32 etc. Darrick, > hch, are the above fine? Yes, please don't mix types. Pick one type and stick with it. (and yes I wish we could struct bitmap_t(unsigned long)) --D > --NR > > } xfs_inode_t; > > > > static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip) > > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap 2026-02-05 16:36 ` Darrick J. Wong @ 2026-02-06 5:36 ` Nirjhar Roy (IBM) 2026-02-06 5:57 ` Darrick J. Wong 0 siblings, 1 reply; 48+ messages in thread From: Nirjhar Roy (IBM) @ 2026-02-06 5:36 UTC (permalink / raw) To: Darrick J. Wong Cc: Kundan Kumar, viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Thu, 2026-02-05 at 08:36 -0800, Darrick J. Wong wrote: > On Thu, Feb 05, 2026 at 12:06:19PM +0530, Nirjhar Roy (IBM) wrote: > > On Fri, 2026-01-16 at 15:38 +0530, Kundan Kumar wrote: > > > Add per-inode structures to track predicted AGs of dirty folios using > > > an xarray and bitmap. This enables efficient identification of AGs > > > involved in writeback. > > > > > > Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> > > > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> > > > --- > > > fs/xfs/xfs_icache.c | 27 +++++++++++++++++++++++++++ > > > fs/xfs/xfs_inode.h | 5 +++++ > > > 2 files changed, 32 insertions(+) > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > index e44040206851..f97aa6d66271 100644 > > > --- a/fs/xfs/xfs_icache.c > > > +++ b/fs/xfs/xfs_icache.c > > > @@ -80,6 +80,25 @@ static inline xa_mark_t ici_tag_to_mark(unsigned int tag) > > > return XFS_PERAG_BLOCKGC_MARK; > > > } > > > > > > +static int xfs_inode_init_ag_bitmap(struct xfs_inode *ip) > > Similar comment as before: > > static int > > xfs_inode_init...() > > > +{ > > > + unsigned int bits = ip->i_mount->m_sb.sb_agcount; > > > + unsigned int nlongs; > > > + > > > + xa_init_flags(&ip->i_ag_pmap, XA_FLAGS_LOCK_IRQ); > > Nit: The name of the functions suggests that it is initializing the tracking bitmap which it does - > > however, the above line does slightly different thing? Maybe move the xarray init outside the bitmap > > init function? > > Or just call it something else? xfs_inode_init_perag_wb? > > > > + ip->i_ag_dirty_bitmap = NULL; > > > + ip->i_ag_dirty_bits = bits; > > > + > > > + if (!bits) > > Umm, !bits means agcount is 0. Shouldn't we ASSERT that bits >= 2? Or am I missing something? > > Technically you can have 1 AG, but you definitely can't mount a zero AG > filesystem. Okay, but: /home/ubuntu$ mkfs.xfs -f -d agcount=1 /dev/loop0 Filesystem must have at least 2 superblocks for redundancy! Usage: mkfs.xfs Or maybe this restriction is just at the userspace tool level? > > > > + return 0; > > > + > > > + nlongs = BITS_TO_LONGS(bits); > > > + ip->i_ag_dirty_bitmap = kcalloc(nlongs, sizeof(unsigned long), > > > + GFP_NOFS); > > > + > > > + return ip->i_ag_dirty_bitmap ? 0 : -ENOMEM; > > > +} > > > + > > > /* > > > * Allocate and initialise an xfs_inode. > > > */ > > > @@ -131,6 +150,8 @@ xfs_inode_alloc( > > > ip->i_next_unlinked = NULLAGINO; > > > ip->i_prev_unlinked = 0; > > > > > > + xfs_inode_init_ag_bitmap(ip); > > xfs_inode_init_ag_bitmap() returns int - error handling for -ENOMEM? > > > + > > > return ip; > > > } > > > > > > @@ -194,6 +215,12 @@ xfs_inode_free( > > > ip->i_ino = 0; > > > spin_unlock(&ip->i_flags_lock); > > > > > > + /* free xarray contents (values are immediate packed ints) */ > > > + xa_destroy(&ip->i_ag_pmap); > > Nit:Maybe have a small wrapper for freeing it the prediction map? No hard preferences though. > > > + kfree(ip->i_ag_dirty_bitmap); > > > + ip->i_ag_dirty_bitmap = NULL; > > Nit: Usually while freeing the pointers I prefer: > > t = ip->i_ag_dirty_bitmap; > > ip->i_ag_dirty_bitmap = NULL; > > kfree(t); > > In this way, the pointer(i_ag_dirty_bitmap in this case) that I am freeing never points to an > > already freed address. > > > > > + ip->i_ag_dirty_bits = 0; > > > + > > > __xfs_inode_free(ip); > > > } > > > > > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h > > > index bd6d33557194..dee449168605 100644 > > > --- a/fs/xfs/xfs_inode.h > > > +++ b/fs/xfs/xfs_inode.h > > > @@ -99,6 +99,11 @@ typedef struct xfs_inode { > > > spinlock_t i_ioend_lock; > > > struct work_struct i_ioend_work; > > > struct list_head i_ioend_list; > > > + > > > + /* AG prediction map: pgoff_t -> packed u32 */ > > > + struct xarray i_ag_pmap; > > > + unsigned long *i_ag_dirty_bitmap; > > > + unsigned int i_ag_dirty_bits; > > Not sure but, I mostly see the typedefed versions of data types being used like uint32 etc. Darrick, > > hch, are the above fine? > > Yes, please don't mix types. Pick one type and stick with it. > > (and yes I wish we could struct bitmap_t(unsigned long)) > > --D > > > --NR > > > } xfs_inode_t; > > > > > > static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip) ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap 2026-02-06 5:36 ` Nirjhar Roy (IBM) @ 2026-02-06 5:57 ` Darrick J. Wong 2026-02-06 6:03 ` Nirjhar Roy (IBM) 0 siblings, 1 reply; 48+ messages in thread From: Darrick J. Wong @ 2026-02-06 5:57 UTC (permalink / raw) To: Nirjhar Roy (IBM) Cc: Kundan Kumar, viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Fri, Feb 06, 2026 at 11:06:03AM +0530, Nirjhar Roy (IBM) wrote: > On Thu, 2026-02-05 at 08:36 -0800, Darrick J. Wong wrote: > > On Thu, Feb 05, 2026 at 12:06:19PM +0530, Nirjhar Roy (IBM) wrote: > > > On Fri, 2026-01-16 at 15:38 +0530, Kundan Kumar wrote: > > > > Add per-inode structures to track predicted AGs of dirty folios using > > > > an xarray and bitmap. This enables efficient identification of AGs > > > > involved in writeback. > > > > > > > > Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> > > > > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> > > > > --- > > > > fs/xfs/xfs_icache.c | 27 +++++++++++++++++++++++++++ > > > > fs/xfs/xfs_inode.h | 5 +++++ > > > > 2 files changed, 32 insertions(+) > > > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > > index e44040206851..f97aa6d66271 100644 > > > > --- a/fs/xfs/xfs_icache.c > > > > +++ b/fs/xfs/xfs_icache.c > > > > @@ -80,6 +80,25 @@ static inline xa_mark_t ici_tag_to_mark(unsigned int tag) > > > > return XFS_PERAG_BLOCKGC_MARK; > > > > } > > > > > > > > +static int xfs_inode_init_ag_bitmap(struct xfs_inode *ip) > > > Similar comment as before: > > > static int > > > xfs_inode_init...() > > > > +{ > > > > + unsigned int bits = ip->i_mount->m_sb.sb_agcount; > > > > + unsigned int nlongs; > > > > + > > > > + xa_init_flags(&ip->i_ag_pmap, XA_FLAGS_LOCK_IRQ); > > > Nit: The name of the functions suggests that it is initializing the tracking bitmap which it does - > > > however, the above line does slightly different thing? Maybe move the xarray init outside the bitmap > > > init function? > > > > Or just call it something else? xfs_inode_init_perag_wb? > > > > > > + ip->i_ag_dirty_bitmap = NULL; > > > > + ip->i_ag_dirty_bits = bits; > > > > + > > > > + if (!bits) > > > Umm, !bits means agcount is 0. Shouldn't we ASSERT that bits >= 2? Or am I missing something? > > > > Technically you can have 1 AG, but you definitely can't mount a zero AG > > filesystem. > Okay, but: > /home/ubuntu$ mkfs.xfs -f -d agcount=1 /dev/loop0 > Filesystem must have at least 2 superblocks for redundancy! > Usage: mkfs.xfs > Or maybe this restriction is just at the userspace tool level? Yeah. If the only super dies then the filesystem is completely unrecoverable, which is why you have to really fight mkfs to spit out single-AG filesystems. --D > > > > + return 0; > > > > + > > > > + nlongs = BITS_TO_LONGS(bits); > > > > + ip->i_ag_dirty_bitmap = kcalloc(nlongs, sizeof(unsigned long), > > > > + GFP_NOFS); > > > > + > > > > + return ip->i_ag_dirty_bitmap ? 0 : -ENOMEM; > > > > +} > > > > + > > > > /* > > > > * Allocate and initialise an xfs_inode. > > > > */ > > > > @@ -131,6 +150,8 @@ xfs_inode_alloc( > > > > ip->i_next_unlinked = NULLAGINO; > > > > ip->i_prev_unlinked = 0; > > > > > > > > + xfs_inode_init_ag_bitmap(ip); > > > xfs_inode_init_ag_bitmap() returns int - error handling for -ENOMEM? > > > > + > > > > return ip; > > > > } > > > > > > > > @@ -194,6 +215,12 @@ xfs_inode_free( > > > > ip->i_ino = 0; > > > > spin_unlock(&ip->i_flags_lock); > > > > > > > > + /* free xarray contents (values are immediate packed ints) */ > > > > + xa_destroy(&ip->i_ag_pmap); > > > Nit:Maybe have a small wrapper for freeing it the prediction map? No hard preferences though. > > > > + kfree(ip->i_ag_dirty_bitmap); > > > > + ip->i_ag_dirty_bitmap = NULL; > > > Nit: Usually while freeing the pointers I prefer: > > > t = ip->i_ag_dirty_bitmap; > > > ip->i_ag_dirty_bitmap = NULL; > > > kfree(t); > > > In this way, the pointer(i_ag_dirty_bitmap in this case) that I am freeing never points to an > > > already freed address. > > > > > > > + ip->i_ag_dirty_bits = 0; > > > > + > > > > __xfs_inode_free(ip); > > > > } > > > > > > > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h > > > > index bd6d33557194..dee449168605 100644 > > > > --- a/fs/xfs/xfs_inode.h > > > > +++ b/fs/xfs/xfs_inode.h > > > > @@ -99,6 +99,11 @@ typedef struct xfs_inode { > > > > spinlock_t i_ioend_lock; > > > > struct work_struct i_ioend_work; > > > > struct list_head i_ioend_list; > > > > + > > > > + /* AG prediction map: pgoff_t -> packed u32 */ > > > > + struct xarray i_ag_pmap; > > > > + unsigned long *i_ag_dirty_bitmap; > > > > + unsigned int i_ag_dirty_bits; > > > Not sure but, I mostly see the typedefed versions of data types being used like uint32 etc. Darrick, > > > hch, are the above fine? > > > > Yes, please don't mix types. Pick one type and stick with it. > > > > (and yes I wish we could struct bitmap_t(unsigned long)) > > > > --D > > > > > --NR > > > > } xfs_inode_t; > > > > > > > > static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip) > > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap 2026-02-06 5:57 ` Darrick J. Wong @ 2026-02-06 6:03 ` Nirjhar Roy (IBM) 0 siblings, 0 replies; 48+ messages in thread From: Nirjhar Roy (IBM) @ 2026-02-06 6:03 UTC (permalink / raw) To: Darrick J. Wong Cc: Kundan Kumar, viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On 2/6/26 11:27, Darrick J. Wong wrote: > On Fri, Feb 06, 2026 at 11:06:03AM +0530, Nirjhar Roy (IBM) wrote: >> On Thu, 2026-02-05 at 08:36 -0800, Darrick J. Wong wrote: >>> On Thu, Feb 05, 2026 at 12:06:19PM +0530, Nirjhar Roy (IBM) wrote: >>>> On Fri, 2026-01-16 at 15:38 +0530, Kundan Kumar wrote: >>>>> Add per-inode structures to track predicted AGs of dirty folios using >>>>> an xarray and bitmap. This enables efficient identification of AGs >>>>> involved in writeback. >>>>> >>>>> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> >>>>> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> >>>>> --- >>>>> fs/xfs/xfs_icache.c | 27 +++++++++++++++++++++++++++ >>>>> fs/xfs/xfs_inode.h | 5 +++++ >>>>> 2 files changed, 32 insertions(+) >>>>> >>>>> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c >>>>> index e44040206851..f97aa6d66271 100644 >>>>> --- a/fs/xfs/xfs_icache.c >>>>> +++ b/fs/xfs/xfs_icache.c >>>>> @@ -80,6 +80,25 @@ static inline xa_mark_t ici_tag_to_mark(unsigned int tag) >>>>> return XFS_PERAG_BLOCKGC_MARK; >>>>> } >>>>> >>>>> +static int xfs_inode_init_ag_bitmap(struct xfs_inode *ip) >>>> Similar comment as before: >>>> static int >>>> xfs_inode_init...() >>>>> +{ >>>>> + unsigned int bits = ip->i_mount->m_sb.sb_agcount; >>>>> + unsigned int nlongs; >>>>> + >>>>> + xa_init_flags(&ip->i_ag_pmap, XA_FLAGS_LOCK_IRQ); >>>> Nit: The name of the functions suggests that it is initializing the tracking bitmap which it does - >>>> however, the above line does slightly different thing? Maybe move the xarray init outside the bitmap >>>> init function? >>> Or just call it something else? xfs_inode_init_perag_wb? >>> >>>>> + ip->i_ag_dirty_bitmap = NULL; >>>>> + ip->i_ag_dirty_bits = bits; >>>>> + >>>>> + if (!bits) >>>> Umm, !bits means agcount is 0. Shouldn't we ASSERT that bits >= 2? Or am I missing something? >>> Technically you can have 1 AG, but you definitely can't mount a zero AG >>> filesystem. >> Okay, but: >> /home/ubuntu$ mkfs.xfs -f -d agcount=1 /dev/loop0 >> Filesystem must have at least 2 superblocks for redundancy! >> Usage: mkfs.xfs >> Or maybe this restriction is just at the userspace tool level? > Yeah. If the only super dies then the filesystem is completely > unrecoverable, which is why you have to really fight mkfs to spit out > single-AG filesystems. Okay. But I think there are certain places in the kernel code where 1 AG is treated as an -EINVAL. For instance [1] [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/xfs_fsops.c#n140 --NR > > --D > >>>>> + return 0; >>>>> + >>>>> + nlongs = BITS_TO_LONGS(bits); >>>>> + ip->i_ag_dirty_bitmap = kcalloc(nlongs, sizeof(unsigned long), >>>>> + GFP_NOFS); >>>>> + >>>>> + return ip->i_ag_dirty_bitmap ? 0 : -ENOMEM; >>>>> +} >>>>> + >>>>> /* >>>>> * Allocate and initialise an xfs_inode. >>>>> */ >>>>> @@ -131,6 +150,8 @@ xfs_inode_alloc( >>>>> ip->i_next_unlinked = NULLAGINO; >>>>> ip->i_prev_unlinked = 0; >>>>> >>>>> + xfs_inode_init_ag_bitmap(ip); >>>> xfs_inode_init_ag_bitmap() returns int - error handling for -ENOMEM? >>>>> + >>>>> return ip; >>>>> } >>>>> >>>>> @@ -194,6 +215,12 @@ xfs_inode_free( >>>>> ip->i_ino = 0; >>>>> spin_unlock(&ip->i_flags_lock); >>>>> >>>>> + /* free xarray contents (values are immediate packed ints) */ >>>>> + xa_destroy(&ip->i_ag_pmap); >>>> Nit:Maybe have a small wrapper for freeing it the prediction map? No hard preferences though. >>>>> + kfree(ip->i_ag_dirty_bitmap); >>>>> + ip->i_ag_dirty_bitmap = NULL; >>>> Nit: Usually while freeing the pointers I prefer: >>>> t = ip->i_ag_dirty_bitmap; >>>> ip->i_ag_dirty_bitmap = NULL; >>>> kfree(t); >>>> In this way, the pointer(i_ag_dirty_bitmap in this case) that I am freeing never points to an >>>> already freed address. >>>> >>>>> + ip->i_ag_dirty_bits = 0; >>>>> + >>>>> __xfs_inode_free(ip); >>>>> } >>>>> >>>>> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h >>>>> index bd6d33557194..dee449168605 100644 >>>>> --- a/fs/xfs/xfs_inode.h >>>>> +++ b/fs/xfs/xfs_inode.h >>>>> @@ -99,6 +99,11 @@ typedef struct xfs_inode { >>>>> spinlock_t i_ioend_lock; >>>>> struct work_struct i_ioend_work; >>>>> struct list_head i_ioend_list; >>>>> + >>>>> + /* AG prediction map: pgoff_t -> packed u32 */ >>>>> + struct xarray i_ag_pmap; >>>>> + unsigned long *i_ag_dirty_bitmap; >>>>> + unsigned int i_ag_dirty_bits; >>>> Not sure but, I mostly see the typedefed versions of data types being used like uint32 etc. Darrick, >>>> hch, are the above fine? >>> Yes, please don't mix types. Pick one type and stick with it. >>> >>> (and yes I wish we could struct bitmap_t(unsigned long)) >>> >>> --D >>> >>>> --NR >>>>> } xfs_inode_t; >>>>> >>>>> static inline bool xfs_inode_on_unlinked_list(const struct xfs_inode *ip) >> -- Nirjhar Roy Linux Kernel Developer IBM, Bangalore ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap 2026-01-16 10:08 ` [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap Kundan Kumar 2026-01-29 0:44 ` Darrick J. Wong 2026-02-05 6:36 ` Nirjhar Roy (IBM) @ 2026-02-06 7:00 ` Christoph Hellwig 2 siblings, 0 replies; 48+ messages in thread From: Christoph Hellwig @ 2026-02-06 7:00 UTC (permalink / raw) To: Kundan Kumar Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Fri, Jan 16, 2026 at 03:38:15PM +0530, Kundan Kumar wrote: > @@ -99,6 +99,11 @@ typedef struct xfs_inode { > spinlock_t i_ioend_lock; > struct work_struct i_ioend_work; > struct list_head i_ioend_list; > + > + /* AG prediction map: pgoff_t -> packed u32 */ > + struct xarray i_ag_pmap; > + unsigned long *i_ag_dirty_bitmap; > + unsigned int i_ag_dirty_bits; I don't think bloating the inode like this is acceptable. As said in my reply to patch 5, I think we're better off with more coarse grained sharding anyway. i.e. just track an ag per inode. We did this for zoned xfs by pointing to the open zone using the existing and otherwise unused inode->i_private field with great success. You won't get as good placement for some use cases where files are fragmented over multiple AGs and you can please near close blocks right now, but you'll save a lot of overhead searching the alloc btrees, and probably generate a write pattern more suitable for SSDs. ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <CGME20260116101256epcas5p2d6125a6bcad78c33f737fdc3484aca79@epcas5p2.samsung.com>]
* [PATCH v3 4/6] xfs: tag folios with AG number during buffered write via iomap attach hook [not found] ` <CGME20260116101256epcas5p2d6125a6bcad78c33f737fdc3484aca79@epcas5p2.samsung.com> @ 2026-01-16 10:08 ` Kundan Kumar 2026-01-29 0:47 ` Darrick J. Wong 2026-02-06 6:44 ` Nirjhar Roy (IBM) 0 siblings, 2 replies; 48+ messages in thread From: Kundan Kumar @ 2026-01-16 10:08 UTC (permalink / raw) To: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, cem, wangyufei Cc: linux-fsdevel, linux-mm, linux-xfs, gost.dev, kundan.kumar, anuj20.g, vishak.g, joshi.k Use the iomap attach hook to tag folios with their predicted allocation group at write time. Mapped extents derive AG directly; delalloc and hole cases use a lightweight predictor. Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> --- fs/xfs/xfs_iomap.c | 114 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 114 insertions(+) diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 490e12cb99be..3c927ce118fe 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -12,6 +12,9 @@ #include "xfs_trans_resv.h" #include "xfs_mount.h" #include "xfs_inode.h" +#include "xfs_alloc.h" +#include "xfs_ag.h" +#include "xfs_ag_resv.h" #include "xfs_btree.h" #include "xfs_bmap_btree.h" #include "xfs_bmap.h" @@ -92,8 +95,119 @@ xfs_iomap_valid( return true; } +static xfs_agnumber_t +xfs_predict_delalloc_agno(const struct xfs_inode *ip, loff_t pos, loff_t len) +{ + struct xfs_mount *mp = ip->i_mount; + xfs_agnumber_t start_agno, agno, best_agno; + struct xfs_perag *pag; + + xfs_extlen_t free, resv, avail; + xfs_extlen_t need_fsbs, min_free_fsbs; + xfs_extlen_t best_free = 0; + xfs_agnumber_t agcount = mp->m_sb.sb_agcount; + + /* RT inodes allocate from the realtime volume */ + if (XFS_IS_REALTIME_INODE(ip)) + return XFS_INO_TO_AGNO(mp, ip->i_ino); + + start_agno = XFS_INO_TO_AGNO(mp, ip->i_ino); + + /* + * size-based minimum free requirement. + * Convert bytes to fsbs and require some slack. + */ + need_fsbs = XFS_B_TO_FSB(mp, (xfs_fsize_t)len); + min_free_fsbs = need_fsbs + max_t(xfs_extlen_t, need_fsbs >> 2, 128); + + /* + * scan AGs starting at start_agno and wrapping. + * Pick the first AG that meets min_free_fsbs after reservations. + * Keep a "best" fallback = maximum (free - resv). + */ + best_agno = start_agno; + + for (xfs_agnumber_t i = 0; i < agcount; i++) { + agno = (start_agno + i) % agcount; + pag = xfs_perag_get(mp, agno); + + if (!xfs_perag_initialised_agf(pag)) + goto next; + + free = READ_ONCE(pag->pagf_freeblks); + resv = xfs_ag_resv_needed(pag, XFS_AG_RESV_NONE); + + if (free <= resv) + goto next; + + avail = free - resv; + + if (avail >= min_free_fsbs) { + xfs_perag_put(pag); + return agno; + } + + if (avail > best_free) { + best_free = avail; + best_agno = agno; + } +next: + xfs_perag_put(pag); + } + + return best_agno; +} + +static inline xfs_agnumber_t xfs_ag_from_iomap(const struct xfs_mount *mp, + const struct iomap *iomap, + const struct xfs_inode *ip, loff_t pos, size_t len) +{ + if (iomap->type == IOMAP_MAPPED || iomap->type == IOMAP_UNWRITTEN) { + /* iomap->addr is byte address on device for buffered I/O */ + xfs_fsblock_t fsb = XFS_BB_TO_FSBT(mp, BTOBB(iomap->addr)); + + return XFS_FSB_TO_AGNO(mp, fsb); + } else if (iomap->type == IOMAP_HOLE || iomap->type == IOMAP_DELALLOC) { + return xfs_predict_delalloc_agno(ip, pos, len); + } + + return XFS_INO_TO_AGNO(mp, ip->i_ino); +} + +static void xfs_agp_set(struct xfs_inode *ip, pgoff_t index, + xfs_agnumber_t agno, u8 type) +{ + u32 packed = xfs_agp_pack((u32)agno, type, true); + + /* store as immediate value */ + xa_store(&ip->i_ag_pmap, index, xa_mk_value(packed), GFP_NOFS); + + /* Mark this AG as having potential dirty work */ + if (ip->i_ag_dirty_bitmap && (u32)agno < ip->i_ag_dirty_bits) + set_bit((u32)agno, ip->i_ag_dirty_bitmap); +} + +static void +xfs_iomap_tag_folio(const struct iomap *iomap, struct folio *folio, + loff_t pos, size_t len) +{ + struct inode *inode; + struct xfs_inode *ip; + struct xfs_mount *mp; + xfs_agnumber_t agno; + + inode = folio_mapping(folio)->host; + ip = XFS_I(inode); + mp = ip->i_mount; + + agno = xfs_ag_from_iomap(mp, iomap, ip, pos, len); + + xfs_agp_set(ip, folio->index, agno, (u8)iomap->type); +} + const struct iomap_write_ops xfs_iomap_write_ops = { .iomap_valid = xfs_iomap_valid, + .tag_folio = xfs_iomap_tag_folio, }; int -- 2.25.1 ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 4/6] xfs: tag folios with AG number during buffered write via iomap attach hook 2026-01-16 10:08 ` [PATCH v3 4/6] xfs: tag folios with AG number during buffered write via iomap attach hook Kundan Kumar @ 2026-01-29 0:47 ` Darrick J. Wong 2026-01-29 22:40 ` Darrick J. Wong 2026-02-03 7:28 ` Kundan Kumar 2026-02-06 6:44 ` Nirjhar Roy (IBM) 1 sibling, 2 replies; 48+ messages in thread From: Darrick J. Wong @ 2026-01-29 0:47 UTC (permalink / raw) To: Kundan Kumar Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Fri, Jan 16, 2026 at 03:38:16PM +0530, Kundan Kumar wrote: > Use the iomap attach hook to tag folios with their predicted > allocation group at write time. Mapped extents derive AG directly; > delalloc and hole cases use a lightweight predictor. > > Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> > --- > fs/xfs/xfs_iomap.c | 114 +++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 114 insertions(+) > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index 490e12cb99be..3c927ce118fe 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -12,6 +12,9 @@ > #include "xfs_trans_resv.h" > #include "xfs_mount.h" > #include "xfs_inode.h" > +#include "xfs_alloc.h" > +#include "xfs_ag.h" > +#include "xfs_ag_resv.h" > #include "xfs_btree.h" > #include "xfs_bmap_btree.h" > #include "xfs_bmap.h" > @@ -92,8 +95,119 @@ xfs_iomap_valid( > return true; > } > > +static xfs_agnumber_t > +xfs_predict_delalloc_agno(const struct xfs_inode *ip, loff_t pos, loff_t len) > +{ > + struct xfs_mount *mp = ip->i_mount; > + xfs_agnumber_t start_agno, agno, best_agno; > + struct xfs_perag *pag; > + > + xfs_extlen_t free, resv, avail; > + xfs_extlen_t need_fsbs, min_free_fsbs; > + xfs_extlen_t best_free = 0; > + xfs_agnumber_t agcount = mp->m_sb.sb_agcount; > + > + /* RT inodes allocate from the realtime volume */ > + if (XFS_IS_REALTIME_INODE(ip)) > + return XFS_INO_TO_AGNO(mp, ip->i_ino); > + > + start_agno = XFS_INO_TO_AGNO(mp, ip->i_ino); > + > + /* > + * size-based minimum free requirement. > + * Convert bytes to fsbs and require some slack. > + */ > + need_fsbs = XFS_B_TO_FSB(mp, (xfs_fsize_t)len); > + min_free_fsbs = need_fsbs + max_t(xfs_extlen_t, need_fsbs >> 2, 128); > + > + /* > + * scan AGs starting at start_agno and wrapping. > + * Pick the first AG that meets min_free_fsbs after reservations. > + * Keep a "best" fallback = maximum (free - resv). > + */ > + best_agno = start_agno; > + > + for (xfs_agnumber_t i = 0; i < agcount; i++) { > + agno = (start_agno + i) % agcount; > + pag = xfs_perag_get(mp, agno); > + > + if (!xfs_perag_initialised_agf(pag)) > + goto next; > + > + free = READ_ONCE(pag->pagf_freeblks); > + resv = xfs_ag_resv_needed(pag, XFS_AG_RESV_NONE); > + > + if (free <= resv) > + goto next; > + > + avail = free - resv; > + > + if (avail >= min_free_fsbs) { > + xfs_perag_put(pag); > + return agno; > + } > + > + if (avail > best_free) { > + best_free = avail; > + best_agno = agno; > + } > +next: > + xfs_perag_put(pag); > + } > + > + return best_agno; > +} > + > +static inline xfs_agnumber_t xfs_ag_from_iomap(const struct xfs_mount *mp, > + const struct iomap *iomap, > + const struct xfs_inode *ip, loff_t pos, size_t len) > +{ > + if (iomap->type == IOMAP_MAPPED || iomap->type == IOMAP_UNWRITTEN) { > + /* iomap->addr is byte address on device for buffered I/O */ > + xfs_fsblock_t fsb = XFS_BB_TO_FSBT(mp, BTOBB(iomap->addr)); > + > + return XFS_FSB_TO_AGNO(mp, fsb); > + } else if (iomap->type == IOMAP_HOLE || iomap->type == IOMAP_DELALLOC) { > + return xfs_predict_delalloc_agno(ip, pos, len); Is it worth doing an AG scan to guess where the allocation might come from? The predictions could turn out to be wrong by virtue of other delalloc regions being written back between the time that xfs_agp_set is called, and the actual bmapi_write call. > + } > + > + return XFS_INO_TO_AGNO(mp, ip->i_ino); > +} > + > +static void xfs_agp_set(struct xfs_inode *ip, pgoff_t index, > + xfs_agnumber_t agno, u8 type) > +{ > + u32 packed = xfs_agp_pack((u32)agno, type, true); > + > + /* store as immediate value */ > + xa_store(&ip->i_ag_pmap, index, xa_mk_value(packed), GFP_NOFS); > + > + /* Mark this AG as having potential dirty work */ > + if (ip->i_ag_dirty_bitmap && (u32)agno < ip->i_ag_dirty_bits) > + set_bit((u32)agno, ip->i_ag_dirty_bitmap); > +} > + > +static void > +xfs_iomap_tag_folio(const struct iomap *iomap, struct folio *folio, > + loff_t pos, size_t len) > +{ > + struct inode *inode; > + struct xfs_inode *ip; > + struct xfs_mount *mp; > + xfs_agnumber_t agno; > + > + inode = folio_mapping(folio)->host; > + ip = XFS_I(inode); > + mp = ip->i_mount; > + > + agno = xfs_ag_from_iomap(mp, iomap, ip, pos, len); > + > + xfs_agp_set(ip, folio->index, agno, (u8)iomap->type); Hrm, so no, the ag_pmap only caches the ag number for the index of a folio, even if it spans many many blocks. --D > +} > + > const struct iomap_write_ops xfs_iomap_write_ops = { > .iomap_valid = xfs_iomap_valid, > + .tag_folio = xfs_iomap_tag_folio, > }; > > int > -- > 2.25.1 > > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 4/6] xfs: tag folios with AG number during buffered write via iomap attach hook 2026-01-29 0:47 ` Darrick J. Wong @ 2026-01-29 22:40 ` Darrick J. Wong 2026-02-03 7:32 ` Kundan Kumar 2026-02-03 7:28 ` Kundan Kumar 1 sibling, 1 reply; 48+ messages in thread From: Darrick J. Wong @ 2026-01-29 22:40 UTC (permalink / raw) To: Kundan Kumar Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Wed, Jan 28, 2026 at 04:47:45PM -0800, Darrick J. Wong wrote: > On Fri, Jan 16, 2026 at 03:38:16PM +0530, Kundan Kumar wrote: > > Use the iomap attach hook to tag folios with their predicted > > allocation group at write time. Mapped extents derive AG directly; > > delalloc and hole cases use a lightweight predictor. > > > > Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> > > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> > > --- > > fs/xfs/xfs_iomap.c | 114 +++++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 114 insertions(+) > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > > index 490e12cb99be..3c927ce118fe 100644 > > --- a/fs/xfs/xfs_iomap.c > > +++ b/fs/xfs/xfs_iomap.c > > @@ -12,6 +12,9 @@ > > #include "xfs_trans_resv.h" > > #include "xfs_mount.h" > > #include "xfs_inode.h" > > +#include "xfs_alloc.h" > > +#include "xfs_ag.h" > > +#include "xfs_ag_resv.h" > > #include "xfs_btree.h" > > #include "xfs_bmap_btree.h" > > #include "xfs_bmap.h" > > @@ -92,8 +95,119 @@ xfs_iomap_valid( > > return true; > > } > > > > +static xfs_agnumber_t > > +xfs_predict_delalloc_agno(const struct xfs_inode *ip, loff_t pos, loff_t len) > > +{ > > + struct xfs_mount *mp = ip->i_mount; > > + xfs_agnumber_t start_agno, agno, best_agno; > > + struct xfs_perag *pag; > > + > > + xfs_extlen_t free, resv, avail; > > + xfs_extlen_t need_fsbs, min_free_fsbs; > > + xfs_extlen_t best_free = 0; > > + xfs_agnumber_t agcount = mp->m_sb.sb_agcount; > > + > > + /* RT inodes allocate from the realtime volume */ > > + if (XFS_IS_REALTIME_INODE(ip)) > > + return XFS_INO_TO_AGNO(mp, ip->i_ino); > > + > > + start_agno = XFS_INO_TO_AGNO(mp, ip->i_ino); > > + > > + /* > > + * size-based minimum free requirement. > > + * Convert bytes to fsbs and require some slack. > > + */ > > + need_fsbs = XFS_B_TO_FSB(mp, (xfs_fsize_t)len); > > + min_free_fsbs = need_fsbs + max_t(xfs_extlen_t, need_fsbs >> 2, 128); > > + > > + /* > > + * scan AGs starting at start_agno and wrapping. > > + * Pick the first AG that meets min_free_fsbs after reservations. > > + * Keep a "best" fallback = maximum (free - resv). > > + */ > > + best_agno = start_agno; > > + > > + for (xfs_agnumber_t i = 0; i < agcount; i++) { > > + agno = (start_agno + i) % agcount; > > + pag = xfs_perag_get(mp, agno); > > + > > + if (!xfs_perag_initialised_agf(pag)) > > + goto next; > > + > > + free = READ_ONCE(pag->pagf_freeblks); > > + resv = xfs_ag_resv_needed(pag, XFS_AG_RESV_NONE); > > + > > + if (free <= resv) > > + goto next; > > + > > + avail = free - resv; > > + > > + if (avail >= min_free_fsbs) { > > + xfs_perag_put(pag); > > + return agno; > > + } > > + > > + if (avail > best_free) { > > + best_free = avail; > > + best_agno = agno; > > + } > > +next: > > + xfs_perag_put(pag); > > + } > > + > > + return best_agno; > > +} > > + > > +static inline xfs_agnumber_t xfs_ag_from_iomap(const struct xfs_mount *mp, > > + const struct iomap *iomap, > > + const struct xfs_inode *ip, loff_t pos, size_t len) > > +{ > > + if (iomap->type == IOMAP_MAPPED || iomap->type == IOMAP_UNWRITTEN) { > > + /* iomap->addr is byte address on device for buffered I/O */ > > + xfs_fsblock_t fsb = XFS_BB_TO_FSBT(mp, BTOBB(iomap->addr)); > > + > > + return XFS_FSB_TO_AGNO(mp, fsb); Also, what happens if this is a realtime file? For pre-rtgroups filesystems there is no group number to use; and for rtgroups you have to use xfs_rtb_to_rgno. The i_ag_dirty_bitmap and the m_ag_wb array will be the wrong size if rgcount != agcount; and also you probably don't want to have in the same per-group writeback list two inodes with folios having the same group number but writing to two different devices (data vs. rt). --D > > + } else if (iomap->type == IOMAP_HOLE || iomap->type == IOMAP_DELALLOC) { > > + return xfs_predict_delalloc_agno(ip, pos, len); > > Is it worth doing an AG scan to guess where the allocation might come > from? The predictions could turn out to be wrong by virtue of other > delalloc regions being written back between the time that xfs_agp_set is > called, and the actual bmapi_write call. > > > + } > > + > > + return XFS_INO_TO_AGNO(mp, ip->i_ino); > > +} > > + > > +static void xfs_agp_set(struct xfs_inode *ip, pgoff_t index, > > + xfs_agnumber_t agno, u8 type) > > +{ > > + u32 packed = xfs_agp_pack((u32)agno, type, true); > > + > > + /* store as immediate value */ > > + xa_store(&ip->i_ag_pmap, index, xa_mk_value(packed), GFP_NOFS); > > + > > + /* Mark this AG as having potential dirty work */ > > + if (ip->i_ag_dirty_bitmap && (u32)agno < ip->i_ag_dirty_bits) > > + set_bit((u32)agno, ip->i_ag_dirty_bitmap); > > +} > > + > > +static void > > +xfs_iomap_tag_folio(const struct iomap *iomap, struct folio *folio, > > + loff_t pos, size_t len) > > +{ > > + struct inode *inode; > > + struct xfs_inode *ip; > > + struct xfs_mount *mp; > > + xfs_agnumber_t agno; > > + > > + inode = folio_mapping(folio)->host; > > + ip = XFS_I(inode); > > + mp = ip->i_mount; > > + > > + agno = xfs_ag_from_iomap(mp, iomap, ip, pos, len); > > + > > + xfs_agp_set(ip, folio->index, agno, (u8)iomap->type); > > Hrm, so no, the ag_pmap only caches the ag number for the index of a > folio, even if it spans many many blocks. > > --D > > > +} > > + > > const struct iomap_write_ops xfs_iomap_write_ops = { > > .iomap_valid = xfs_iomap_valid, > > + .tag_folio = xfs_iomap_tag_folio, > > }; > > > > int > > -- > > 2.25.1 > > > > > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 4/6] xfs: tag folios with AG number during buffered write via iomap attach hook 2026-01-29 22:40 ` Darrick J. Wong @ 2026-02-03 7:32 ` Kundan Kumar 0 siblings, 0 replies; 48+ messages in thread From: Kundan Kumar @ 2026-02-03 7:32 UTC (permalink / raw) To: Darrick J. Wong Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k > > Also, what happens if this is a realtime file? For pre-rtgroups > filesystems there is no group number to use; and for rtgroups you have > to use xfs_rtb_to_rgno. The i_ag_dirty_bitmap and the m_ag_wb array > will be the wrong size if rgcount != agcount; and also you probably > don't want to have in the same per-group writeback list two inodes with > folios having the same group number but writing to two different devices > (data vs. rt). > > --D > For this series I’ll explicitly exclude realtime inodes from AG routing (skip tagging / fall back to normal writeback). Rtgroup-aware routing would need a separate rg-domain (separate worker array/queues sized by rgcount) as follow-up work. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 4/6] xfs: tag folios with AG number during buffered write via iomap attach hook 2026-01-29 0:47 ` Darrick J. Wong 2026-01-29 22:40 ` Darrick J. Wong @ 2026-02-03 7:28 ` Kundan Kumar 2026-02-05 15:56 ` Brian Foster 1 sibling, 1 reply; 48+ messages in thread From: Kundan Kumar @ 2026-02-03 7:28 UTC (permalink / raw) To: Darrick J. Wong Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On 1/29/2026 6:17 AM, Darrick J. Wong wrote: > On Fri, Jan 16, 2026 at 03:38:16PM +0530, Kundan Kumar wrote: >> Use the iomap attach hook to tag folios with their predicted >> allocation group at write time. Mapped extents derive AG directly; >> delalloc and hole cases use a lightweight predictor. >> >> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> >> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> >> --- >> fs/xfs/xfs_iomap.c | 114 +++++++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 114 insertions(+) >> >> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c >> index 490e12cb99be..3c927ce118fe 100644 >> --- a/fs/xfs/xfs_iomap.c >> +++ b/fs/xfs/xfs_iomap.c >> @@ -12,6 +12,9 @@ >> #include "xfs_trans_resv.h" >> #include "xfs_mount.h" >> #include "xfs_inode.h" >> +#include "xfs_alloc.h" >> +#include "xfs_ag.h" >> +#include "xfs_ag_resv.h" >> #include "xfs_btree.h" >> #include "xfs_bmap_btree.h" >> #include "xfs_bmap.h" >> @@ -92,8 +95,119 @@ xfs_iomap_valid( >> return true; >> } >> >> +static xfs_agnumber_t >> +xfs_predict_delalloc_agno(const struct xfs_inode *ip, loff_t pos, loff_t len) >> +{ >> + struct xfs_mount *mp = ip->i_mount; >> + xfs_agnumber_t start_agno, agno, best_agno; >> + struct xfs_perag *pag; >> + >> + xfs_extlen_t free, resv, avail; >> + xfs_extlen_t need_fsbs, min_free_fsbs; >> + xfs_extlen_t best_free = 0; >> + xfs_agnumber_t agcount = mp->m_sb.sb_agcount; >> + >> + /* RT inodes allocate from the realtime volume */ >> + if (XFS_IS_REALTIME_INODE(ip)) >> + return XFS_INO_TO_AGNO(mp, ip->i_ino); >> + >> + start_agno = XFS_INO_TO_AGNO(mp, ip->i_ino); >> + >> + /* >> + * size-based minimum free requirement. >> + * Convert bytes to fsbs and require some slack. >> + */ >> + need_fsbs = XFS_B_TO_FSB(mp, (xfs_fsize_t)len); >> + min_free_fsbs = need_fsbs + max_t(xfs_extlen_t, need_fsbs >> 2, 128); >> + >> + /* >> + * scan AGs starting at start_agno and wrapping. >> + * Pick the first AG that meets min_free_fsbs after reservations. >> + * Keep a "best" fallback = maximum (free - resv). >> + */ >> + best_agno = start_agno; >> + >> + for (xfs_agnumber_t i = 0; i < agcount; i++) { >> + agno = (start_agno + i) % agcount; >> + pag = xfs_perag_get(mp, agno); >> + >> + if (!xfs_perag_initialised_agf(pag)) >> + goto next; >> + >> + free = READ_ONCE(pag->pagf_freeblks); >> + resv = xfs_ag_resv_needed(pag, XFS_AG_RESV_NONE); >> + >> + if (free <= resv) >> + goto next; >> + >> + avail = free - resv; >> + >> + if (avail >= min_free_fsbs) { >> + xfs_perag_put(pag); >> + return agno; >> + } >> + >> + if (avail > best_free) { >> + best_free = avail; >> + best_agno = agno; >> + } >> +next: >> + xfs_perag_put(pag); >> + } >> + >> + return best_agno; >> +} >> + >> +static inline xfs_agnumber_t xfs_ag_from_iomap(const struct xfs_mount *mp, >> + const struct iomap *iomap, >> + const struct xfs_inode *ip, loff_t pos, size_t len) >> +{ >> + if (iomap->type == IOMAP_MAPPED || iomap->type == IOMAP_UNWRITTEN) { >> + /* iomap->addr is byte address on device for buffered I/O */ >> + xfs_fsblock_t fsb = XFS_BB_TO_FSBT(mp, BTOBB(iomap->addr)); >> + >> + return XFS_FSB_TO_AGNO(mp, fsb); >> + } else if (iomap->type == IOMAP_HOLE || iomap->type == IOMAP_DELALLOC) { >> + return xfs_predict_delalloc_agno(ip, pos, len); > > Is it worth doing an AG scan to guess where the allocation might come > from? The predictions could turn out to be wrong by virtue of other > delalloc regions being written back between the time that xfs_agp_set is > called, and the actual bmapi_write call. > The delalloc prediction works well in the common cases: (1) when an AG has sufficient free space and allocations stay within it, and (2) when an AG becomes full and allocation naturally moves to the next suitable AG. The only case where the prediction can be wrong is when an AG is in the process of being exhausted concurrently with writeback, so allocation shifts between the time we tag the folio and the actual bmapi_write. My understanding is that window is narrow, and only a small fraction of IOs would be misrouted. >> + } >> + >> + return XFS_INO_TO_AGNO(mp, ip->i_ino); >> +} >> + >> +static void xfs_agp_set(struct xfs_inode *ip, pgoff_t index, >> + xfs_agnumber_t agno, u8 type) >> +{ >> + u32 packed = xfs_agp_pack((u32)agno, type, true); >> + >> + /* store as immediate value */ >> + xa_store(&ip->i_ag_pmap, index, xa_mk_value(packed), GFP_NOFS); >> + >> + /* Mark this AG as having potential dirty work */ >> + if (ip->i_ag_dirty_bitmap && (u32)agno < ip->i_ag_dirty_bits) >> + set_bit((u32)agno, ip->i_ag_dirty_bitmap); >> +} >> + >> +static void >> +xfs_iomap_tag_folio(const struct iomap *iomap, struct folio *folio, >> + loff_t pos, size_t len) >> +{ >> + struct inode *inode; >> + struct xfs_inode *ip; >> + struct xfs_mount *mp; >> + xfs_agnumber_t agno; >> + >> + inode = folio_mapping(folio)->host; >> + ip = XFS_I(inode); >> + mp = ip->i_mount; >> + >> + agno = xfs_ag_from_iomap(mp, iomap, ip, pos, len); >> + >> + xfs_agp_set(ip, folio->index, agno, (u8)iomap->type); > > Hrm, so no, the ag_pmap only caches the ag number for the index of a > folio, even if it spans many many blocks. > > --D > Thanks for pointing out, I will rework to handle this case. >> +} >> + >> const struct iomap_write_ops xfs_iomap_write_ops = { >> .iomap_valid = xfs_iomap_valid, >> + .tag_folio = xfs_iomap_tag_folio, >> }; >> >> int >> -- >> 2.25.1 >> >> > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 4/6] xfs: tag folios with AG number during buffered write via iomap attach hook 2026-02-03 7:28 ` Kundan Kumar @ 2026-02-05 15:56 ` Brian Foster 0 siblings, 0 replies; 48+ messages in thread From: Brian Foster @ 2026-02-05 15:56 UTC (permalink / raw) To: Kundan Kumar Cc: Darrick J. Wong, viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Tue, Feb 03, 2026 at 12:58:34PM +0530, Kundan Kumar wrote: > On 1/29/2026 6:17 AM, Darrick J. Wong wrote: > > On Fri, Jan 16, 2026 at 03:38:16PM +0530, Kundan Kumar wrote: > >> Use the iomap attach hook to tag folios with their predicted > >> allocation group at write time. Mapped extents derive AG directly; > >> delalloc and hole cases use a lightweight predictor. > >> > >> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> > >> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> > >> --- > >> fs/xfs/xfs_iomap.c | 114 +++++++++++++++++++++++++++++++++++++++++++++ > >> 1 file changed, 114 insertions(+) > >> > >> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > >> index 490e12cb99be..3c927ce118fe 100644 > >> --- a/fs/xfs/xfs_iomap.c > >> +++ b/fs/xfs/xfs_iomap.c > >> @@ -12,6 +12,9 @@ > >> #include "xfs_trans_resv.h" > >> #include "xfs_mount.h" > >> #include "xfs_inode.h" > >> +#include "xfs_alloc.h" > >> +#include "xfs_ag.h" > >> +#include "xfs_ag_resv.h" > >> #include "xfs_btree.h" > >> #include "xfs_bmap_btree.h" > >> #include "xfs_bmap.h" > >> @@ -92,8 +95,119 @@ xfs_iomap_valid( > >> return true; > >> } > >> > >> +static xfs_agnumber_t > >> +xfs_predict_delalloc_agno(const struct xfs_inode *ip, loff_t pos, loff_t len) > >> +{ > >> + struct xfs_mount *mp = ip->i_mount; > >> + xfs_agnumber_t start_agno, agno, best_agno; > >> + struct xfs_perag *pag; > >> + > >> + xfs_extlen_t free, resv, avail; > >> + xfs_extlen_t need_fsbs, min_free_fsbs; > >> + xfs_extlen_t best_free = 0; > >> + xfs_agnumber_t agcount = mp->m_sb.sb_agcount; > >> + > >> + /* RT inodes allocate from the realtime volume */ > >> + if (XFS_IS_REALTIME_INODE(ip)) > >> + return XFS_INO_TO_AGNO(mp, ip->i_ino); > >> + > >> + start_agno = XFS_INO_TO_AGNO(mp, ip->i_ino); > >> + > >> + /* > >> + * size-based minimum free requirement. > >> + * Convert bytes to fsbs and require some slack. > >> + */ > >> + need_fsbs = XFS_B_TO_FSB(mp, (xfs_fsize_t)len); > >> + min_free_fsbs = need_fsbs + max_t(xfs_extlen_t, need_fsbs >> 2, 128); > >> + > >> + /* > >> + * scan AGs starting at start_agno and wrapping. > >> + * Pick the first AG that meets min_free_fsbs after reservations. > >> + * Keep a "best" fallback = maximum (free - resv). > >> + */ > >> + best_agno = start_agno; > >> + > >> + for (xfs_agnumber_t i = 0; i < agcount; i++) { > >> + agno = (start_agno + i) % agcount; > >> + pag = xfs_perag_get(mp, agno); > >> + > >> + if (!xfs_perag_initialised_agf(pag)) > >> + goto next; > >> + > >> + free = READ_ONCE(pag->pagf_freeblks); > >> + resv = xfs_ag_resv_needed(pag, XFS_AG_RESV_NONE); > >> + > >> + if (free <= resv) > >> + goto next; > >> + > >> + avail = free - resv; > >> + > >> + if (avail >= min_free_fsbs) { > >> + xfs_perag_put(pag); > >> + return agno; > >> + } > >> + > >> + if (avail > best_free) { > >> + best_free = avail; > >> + best_agno = agno; > >> + } > >> +next: > >> + xfs_perag_put(pag); > >> + } > >> + > >> + return best_agno; > >> +} > >> + > >> +static inline xfs_agnumber_t xfs_ag_from_iomap(const struct xfs_mount *mp, > >> + const struct iomap *iomap, > >> + const struct xfs_inode *ip, loff_t pos, size_t len) > >> +{ > >> + if (iomap->type == IOMAP_MAPPED || iomap->type == IOMAP_UNWRITTEN) { > >> + /* iomap->addr is byte address on device for buffered I/O */ > >> + xfs_fsblock_t fsb = XFS_BB_TO_FSBT(mp, BTOBB(iomap->addr)); > >> + > >> + return XFS_FSB_TO_AGNO(mp, fsb); > >> + } else if (iomap->type == IOMAP_HOLE || iomap->type == IOMAP_DELALLOC) { > >> + return xfs_predict_delalloc_agno(ip, pos, len); > > > > Is it worth doing an AG scan to guess where the allocation might come > > from? The predictions could turn out to be wrong by virtue of other > > delalloc regions being written back between the time that xfs_agp_set is > > called, and the actual bmapi_write call. > > > > The delalloc prediction works well in the common cases: (1) when an AG > has sufficient free space and allocations stay within it, and (2) when > an AG becomes full and allocation naturally moves to the next suitable AG. > > The only case where the prediction can be wrong is when an AG is in the > process of being exhausted concurrently with writeback, so allocation > shifts between the time we tag the folio and the actual bmapi_write. > My understanding is that window is narrow, and only a small fraction of > IOs would be misrouted. > I wonder how true that would be under more mixed workloads. For example, if writeback is iterating AGs under a trylock, all it really takes to redirect incorrectly is lock contention, which then seems like it could be a compounding factor for other AG workers. Another thing that comes to mind is the writeback delay. For example, if we buffer up enough delalloc in pagecache to one or more inodes that target the same AG, then it seems possible to hint folios to AGs that are already full, they "just don't know it yet." Maybe that is more of an odd/rare case though. Perhaps the better question here is.. how would one test for this? It might be interesting to have stats counters or something that could indicate hits and misses wrt the hint such that this could be more easily evaluated against different workloads (assuming that doesn't already exist and I missed it).. hm? Brian > >> + } > >> + > >> + return XFS_INO_TO_AGNO(mp, ip->i_ino); > >> +} > >> + > >> +static void xfs_agp_set(struct xfs_inode *ip, pgoff_t index, > >> + xfs_agnumber_t agno, u8 type) > >> +{ > >> + u32 packed = xfs_agp_pack((u32)agno, type, true); > >> + > >> + /* store as immediate value */ > >> + xa_store(&ip->i_ag_pmap, index, xa_mk_value(packed), GFP_NOFS); > >> + > >> + /* Mark this AG as having potential dirty work */ > >> + if (ip->i_ag_dirty_bitmap && (u32)agno < ip->i_ag_dirty_bits) > >> + set_bit((u32)agno, ip->i_ag_dirty_bitmap); > >> +} > >> + > >> +static void > >> +xfs_iomap_tag_folio(const struct iomap *iomap, struct folio *folio, > >> + loff_t pos, size_t len) > >> +{ > >> + struct inode *inode; > >> + struct xfs_inode *ip; > >> + struct xfs_mount *mp; > >> + xfs_agnumber_t agno; > >> + > >> + inode = folio_mapping(folio)->host; > >> + ip = XFS_I(inode); > >> + mp = ip->i_mount; > >> + > >> + agno = xfs_ag_from_iomap(mp, iomap, ip, pos, len); > >> + > >> + xfs_agp_set(ip, folio->index, agno, (u8)iomap->type); > > > > Hrm, so no, the ag_pmap only caches the ag number for the index of a > > folio, even if it spans many many blocks. > > > > --D > > > > Thanks for pointing out, I will rework to handle this case. > > >> +} > >> + > >> const struct iomap_write_ops xfs_iomap_write_ops = { > >> .iomap_valid = xfs_iomap_valid, > >> + .tag_folio = xfs_iomap_tag_folio, > >> }; > >> > >> int > >> -- > >> 2.25.1 > >> > >> > > > > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 4/6] xfs: tag folios with AG number during buffered write via iomap attach hook 2026-01-16 10:08 ` [PATCH v3 4/6] xfs: tag folios with AG number during buffered write via iomap attach hook Kundan Kumar 2026-01-29 0:47 ` Darrick J. Wong @ 2026-02-06 6:44 ` Nirjhar Roy (IBM) 1 sibling, 0 replies; 48+ messages in thread From: Nirjhar Roy (IBM) @ 2026-02-06 6:44 UTC (permalink / raw) To: Kundan Kumar, viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, cem, wangyufei Cc: linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Fri, 2026-01-16 at 15:38 +0530, Kundan Kumar wrote: > Use the iomap attach hook to tag folios with their predicted > allocation group at write time. Mapped extents derive AG directly; > delalloc and hole cases use a lightweight predictor. > > Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> > --- > fs/xfs/xfs_iomap.c | 114 +++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 114 insertions(+) > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index 490e12cb99be..3c927ce118fe 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -12,6 +12,9 @@ > #include "xfs_trans_resv.h" > #include "xfs_mount.h" > #include "xfs_inode.h" > +#include "xfs_alloc.h" > +#include "xfs_ag.h" > +#include "xfs_ag_resv.h" > #include "xfs_btree.h" > #include "xfs_bmap_btree.h" > #include "xfs_bmap.h" > @@ -92,8 +95,119 @@ xfs_iomap_valid( > return true; > } > > +static xfs_agnumber_t > +xfs_predict_delalloc_agno(const struct xfs_inode *ip, loff_t pos, loff_t len) > +{ > + struct xfs_mount *mp = ip->i_mount; > + xfs_agnumber_t start_agno, agno, best_agno; > + struct xfs_perag *pag; Nit: Coding style - I think XFS uses a tab between the data type and the identifier and makes all the identifier align. Something like int a; struct xfs_mount *mp; > + Nit: Extra blank line? > + xfs_extlen_t free, resv, avail; > + xfs_extlen_t need_fsbs, min_free_fsbs; > + xfs_extlen_t best_free = 0; > + xfs_agnumber_t agcount = mp->m_sb.sb_agcount; Similar comment as above > + > + /* RT inodes allocate from the realtime volume */ > + if (XFS_IS_REALTIME_INODE(ip)) > + return XFS_INO_TO_AGNO(mp, ip->i_ino); What are we returning for realtime volume? AG sizes and rtgroup sizes may be different, isn't it? Can we use the above macro for both? Also, rt volume work with extents (which can be more than the fsblock size) - so will the above work? > + > + start_agno = XFS_INO_TO_AGNO(mp, ip->i_ino); > + > + /* > + * size-based minimum free requirement. > + * Convert bytes to fsbs and require some slack. > + */ > + need_fsbs = XFS_B_TO_FSB(mp, (xfs_fsize_t)len); > + min_free_fsbs = need_fsbs + max_t(xfs_extlen_t, need_fsbs >> 2, 128); A short comment explaining the above? > + > + /* > + * scan AGs starting at start_agno and wrapping. > + * Pick the first AG that meets min_free_fsbs after reservations. > + * Keep a "best" fallback = maximum (free - resv). > + */ > + best_agno = start_agno; > + > + for (xfs_agnumber_t i = 0; i < agcount; i++) { Maybe use for_each_perag_wrap_range or for_each_perag_wrap (defined in fs/xfs/libxfs/xfs_ag.h)? > + agno = (start_agno + i) % agcount; > + pag = xfs_perag_get(mp, agno); > + > + if (!xfs_perag_initialised_agf(pag)) > + goto next; > + > + free = READ_ONCE(pag->pagf_freeblks); > + resv = xfs_ag_resv_needed(pag, XFS_AG_RESV_NONE); > + > + if (free <= resv) > + goto next; > + > + avail = free - resv; > + > + if (avail >= min_free_fsbs) { > + xfs_perag_put(pag); > + return agno; > + } > + > + if (avail > best_free) { Just for my understanding - we are doing a largest fit selection, aren't we? > + best_free = avail; > + best_agno = agno; > + } > +next: > + xfs_perag_put(pag); > + } > + > + return best_agno; > +} > + > +static inline xfs_agnumber_t xfs_ag_from_iomap(const struct xfs_mount *mp, > + const struct iomap *iomap, > + const struct xfs_inode *ip, loff_t pos, size_t len) > +{ > + if (iomap->type == IOMAP_MAPPED || iomap->type == IOMAP_UNWRITTEN) { > + /* iomap->addr is byte address on device for buffered I/O */ > + xfs_fsblock_t fsb = XFS_BB_TO_FSBT(mp, BTOBB(iomap->addr)); > + > + return XFS_FSB_TO_AGNO(mp, fsb); > + } else if (iomap->type == IOMAP_HOLE || iomap->type == IOMAP_DELALLOC) { > + return xfs_predict_delalloc_agno(ip, pos, len); > + } > + > + return XFS_INO_TO_AGNO(mp, ip->i_ino); > +} > + > +static void xfs_agp_set(struct xfs_inode *ip, pgoff_t index, > + xfs_agnumber_t agno, u8 type) > +{ > + u32 packed = xfs_agp_pack((u32)agno, type, true); > + > + /* store as immediate value */ > + xa_store(&ip->i_ag_pmap, index, xa_mk_value(packed), GFP_NOFS); Maybe nit: Unhandled return from xa_store() - It returns void * > + > + /* Mark this AG as having potential dirty work */ > + if (ip->i_ag_dirty_bitmap && (u32)agno < ip->i_ag_dirty_bits) if agno >= i_ag_dirty_bits, then shouldn't we throw a warn or an ASSERT() or XFS_IS_CORRUPT() - Darrick, any thoughts? > + set_bit((u32)agno, ip->i_ag_dirty_bitmap); > +} > + > +static void > +xfs_iomap_tag_folio(const struct iomap *iomap, struct folio *folio, > + loff_t pos, size_t len) Nit: Coding style: I think xfs functions declares 1 parameter in each line with the ")" just after the last parameter on the same line. Look at something like xfs_swap_extent_rmap() defined in fs/xfs/xfs_bmap_util.c. Not sure if the maintainers have hard preferences with this - so better to cross check. > +{ > + struct inode *inode; > + struct xfs_inode *ip; > + struct xfs_mount *mp; > + xfs_agnumber_t agno; Coding style as mentioned before: struct inode *inode; struct xfs_inod *ip; --NR > + > + inode = folio_mapping(folio)->host; > + ip = XFS_I(inode); > + mp = ip->i_mount; > + > + agno = xfs_ag_from_iomap(mp, iomap, ip, pos, len); > + > + xfs_agp_set(ip, folio->index, agno, (u8)iomap->type); > +} > + > const struct iomap_write_ops xfs_iomap_write_ops = { > .iomap_valid = xfs_iomap_valid, > + .tag_folio = xfs_iomap_tag_folio, > }; > > int ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <CGME20260116101259epcas5p1cfa6ab02e5a01f7c46cc78df95c57ce0@epcas5p1.samsung.com>]
* [PATCH v3 5/6] xfs: add per-AG writeback workqueue infrastructure [not found] ` <CGME20260116101259epcas5p1cfa6ab02e5a01f7c46cc78df95c57ce0@epcas5p1.samsung.com> @ 2026-01-16 10:08 ` Kundan Kumar 2026-01-29 22:21 ` Darrick J. Wong ` (2 more replies) 0 siblings, 3 replies; 48+ messages in thread From: Kundan Kumar @ 2026-01-16 10:08 UTC (permalink / raw) To: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, cem, wangyufei Cc: linux-fsdevel, linux-mm, linux-xfs, gost.dev, kundan.kumar, anuj20.g, vishak.g, joshi.k Introduce per-AG writeback worker infrastructure at mount time. This patch adds initialization and teardown only, without changing writeback behavior. Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> --- fs/xfs/xfs_aops.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_aops.h | 3 ++ fs/xfs/xfs_mount.c | 2 ++ fs/xfs/xfs_mount.h | 10 ++++++ fs/xfs/xfs_super.c | 2 ++ 5 files changed, 96 insertions(+) diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index a26f79815533..9d5b65922cd2 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -23,6 +23,23 @@ #include "xfs_zone_alloc.h" #include "xfs_rtgroup.h" +#define XFS_AG_TASK_POOL_MIN 1024 + +struct xfs_ag_wb_task { + struct list_head list; + struct xfs_inode *ip; + struct writeback_control wbc; + xfs_agnumber_t agno; +}; + +struct xfs_ag_wb { + struct delayed_work ag_work; + spinlock_t lock; + struct list_head task_list; + xfs_agnumber_t agno; + struct xfs_mount *mp; +}; + struct xfs_writepage_ctx { struct iomap_writepage_ctx ctx; unsigned int data_seq; @@ -666,6 +683,68 @@ static const struct iomap_writeback_ops xfs_zoned_writeback_ops = { .writeback_submit = xfs_zoned_writeback_submit, }; +void +xfs_init_ag_writeback(struct xfs_mount *mp) +{ + xfs_agnumber_t agno; + + mp->m_ag_wq = alloc_workqueue("xfs-ag-wb", WQ_UNBOUND | WQ_MEM_RECLAIM, + 0); + if (!mp->m_ag_wq) + return; + + mp->m_ag_wb = kcalloc(mp->m_sb.sb_agcount, + sizeof(struct xfs_ag_wb), + GFP_KERNEL); + + if (!mp->m_ag_wb) { + destroy_workqueue(mp->m_ag_wq); + mp->m_ag_wq = NULL; + return; + } + + for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) { + struct xfs_ag_wb *awb = &mp->m_ag_wb[agno]; + + spin_lock_init(&awb->lock); + INIT_LIST_HEAD(&awb->task_list); + awb->agno = agno; + awb->mp = mp; + } + + mp->m_ag_task_cachep = kmem_cache_create("xfs_ag_wb_task", + sizeof(struct xfs_ag_wb_task), + 0, + SLAB_RECLAIM_ACCOUNT, + NULL); + + mp->m_ag_task_pool = mempool_create_slab_pool(XFS_AG_TASK_POOL_MIN, + mp->m_ag_task_cachep); + + if (!mp->m_ag_task_pool) { + kmem_cache_destroy(mp->m_ag_task_cachep); + mp->m_ag_task_cachep = NULL; + } +} + +void +xfs_destroy_ag_writeback(struct xfs_mount *mp) +{ + if (mp->m_ag_wq) { + flush_workqueue(mp->m_ag_wq); + destroy_workqueue(mp->m_ag_wq); + mp->m_ag_wq = NULL; + } + kfree(mp->m_ag_wb); + mp->m_ag_wb = NULL; + + mempool_destroy(mp->m_ag_task_pool); + mp->m_ag_task_pool = NULL; + + kmem_cache_destroy(mp->m_ag_task_cachep); + mp->m_ag_task_cachep = NULL; +} + STATIC int xfs_vm_writepages( struct address_space *mapping, diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h index 5a7a0f1a0b49..e84acb7e8ca8 100644 --- a/fs/xfs/xfs_aops.h +++ b/fs/xfs/xfs_aops.h @@ -12,4 +12,7 @@ extern const struct address_space_operations xfs_dax_aops; int xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size); void xfs_end_bio(struct bio *bio); +void xfs_init_ag_writeback(struct xfs_mount *mp); +void xfs_destroy_ag_writeback(struct xfs_mount *mp); + #endif /* __XFS_AOPS_H__ */ diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c index 0953f6ae94ab..26224503c4bf 100644 --- a/fs/xfs/xfs_mount.c +++ b/fs/xfs/xfs_mount.c @@ -1323,6 +1323,8 @@ xfs_unmountfs( xfs_qm_unmount(mp); + xfs_destroy_ag_writeback(mp); + /* * Unreserve any blocks we have so that when we unmount we don't account * the reserved free space as used. This is really only necessary for diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index b871dfde372b..c44155de2883 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -342,6 +342,16 @@ typedef struct xfs_mount { /* Hook to feed dirent updates to an active online repair. */ struct xfs_hooks m_dir_update_hooks; + + + /* global XFS AG writeback wq */ + struct workqueue_struct *m_ag_wq; + /* array of [sb_agcount] */ + struct xfs_ag_wb *m_ag_wb; + + /* task cache and pool */ + struct kmem_cache *m_ag_task_cachep; + mempool_t *m_ag_task_pool; } xfs_mount_t; #define M_IGEO(mp) (&(mp)->m_ino_geo) diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index bc71aa9dcee8..73f8d2942df4 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1765,6 +1765,8 @@ xfs_fs_fill_super( if (error) goto out_free_sb; + xfs_init_ag_writeback(mp); + /* * V4 support is undergoing deprecation. * -- 2.25.1 ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 5/6] xfs: add per-AG writeback workqueue infrastructure 2026-01-16 10:08 ` [PATCH v3 5/6] xfs: add per-AG writeback workqueue infrastructure Kundan Kumar @ 2026-01-29 22:21 ` Darrick J. Wong 2026-02-03 7:35 ` Kundan Kumar 2026-02-06 6:46 ` Christoph Hellwig 2026-02-10 11:56 ` Nirjhar Roy (IBM) 2 siblings, 1 reply; 48+ messages in thread From: Darrick J. Wong @ 2026-01-29 22:21 UTC (permalink / raw) To: Kundan Kumar Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Fri, Jan 16, 2026 at 03:38:17PM +0530, Kundan Kumar wrote: > Introduce per-AG writeback worker infrastructure at mount time. > This patch adds initialization and teardown only, without changing > writeback behavior. > > Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> > --- > fs/xfs/xfs_aops.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++ > fs/xfs/xfs_aops.h | 3 ++ > fs/xfs/xfs_mount.c | 2 ++ > fs/xfs/xfs_mount.h | 10 ++++++ > fs/xfs/xfs_super.c | 2 ++ > 5 files changed, 96 insertions(+) > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > index a26f79815533..9d5b65922cd2 100644 > --- a/fs/xfs/xfs_aops.c > +++ b/fs/xfs/xfs_aops.c > @@ -23,6 +23,23 @@ > #include "xfs_zone_alloc.h" > #include "xfs_rtgroup.h" > > +#define XFS_AG_TASK_POOL_MIN 1024 > + > +struct xfs_ag_wb_task { > + struct list_head list; > + struct xfs_inode *ip; > + struct writeback_control wbc; > + xfs_agnumber_t agno; > +}; > + > +struct xfs_ag_wb { > + struct delayed_work ag_work; > + spinlock_t lock; > + struct list_head task_list; > + xfs_agnumber_t agno; > + struct xfs_mount *mp; > +}; Help me understand the data structures here ... for each AG there's an xfs_ag_wb object which can be run as a delayed workqueue item. This xfs_ag_wb is the head of a list of xfs_ag_wb_task items? In turn, each xfs_ag_wb_task list item points to an inode and (redundantly?) the agnumber? So in effect each AG has a kworker that can say "do all of this file's pagecache writeback for all dirty folios tagged with the same agnumber"? <shrug> It's hard to tell with no comments about how these two pieces of data relate to each other. --D > struct xfs_writepage_ctx { > struct iomap_writepage_ctx ctx; > unsigned int data_seq; > @@ -666,6 +683,68 @@ static const struct iomap_writeback_ops xfs_zoned_writeback_ops = { > .writeback_submit = xfs_zoned_writeback_submit, > }; > > +void > +xfs_init_ag_writeback(struct xfs_mount *mp) > +{ > + xfs_agnumber_t agno; > + > + mp->m_ag_wq = alloc_workqueue("xfs-ag-wb", WQ_UNBOUND | WQ_MEM_RECLAIM, > + 0); > + if (!mp->m_ag_wq) > + return; > + > + mp->m_ag_wb = kcalloc(mp->m_sb.sb_agcount, > + sizeof(struct xfs_ag_wb), > + GFP_KERNEL); > + > + if (!mp->m_ag_wb) { > + destroy_workqueue(mp->m_ag_wq); > + mp->m_ag_wq = NULL; > + return; > + } > + > + for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) { > + struct xfs_ag_wb *awb = &mp->m_ag_wb[agno]; > + > + spin_lock_init(&awb->lock); > + INIT_LIST_HEAD(&awb->task_list); > + awb->agno = agno; > + awb->mp = mp; Can't you stuff this information in struct xfs_perag instead of asking for a potentially huge allocation? > + } > + > + mp->m_ag_task_cachep = kmem_cache_create("xfs_ag_wb_task", > + sizeof(struct xfs_ag_wb_task), > + 0, > + SLAB_RECLAIM_ACCOUNT, > + NULL); > + > + mp->m_ag_task_pool = mempool_create_slab_pool(XFS_AG_TASK_POOL_MIN, > + mp->m_ag_task_cachep); > + > + if (!mp->m_ag_task_pool) { > + kmem_cache_destroy(mp->m_ag_task_cachep); > + mp->m_ag_task_cachep = NULL; > + } > +} > + > +void > +xfs_destroy_ag_writeback(struct xfs_mount *mp) > +{ > + if (mp->m_ag_wq) { > + flush_workqueue(mp->m_ag_wq); > + destroy_workqueue(mp->m_ag_wq); > + mp->m_ag_wq = NULL; > + } > + kfree(mp->m_ag_wb); > + mp->m_ag_wb = NULL; > + > + mempool_destroy(mp->m_ag_task_pool); > + mp->m_ag_task_pool = NULL; > + > + kmem_cache_destroy(mp->m_ag_task_cachep); > + mp->m_ag_task_cachep = NULL; > +} > + > STATIC int > xfs_vm_writepages( > struct address_space *mapping, > diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h > index 5a7a0f1a0b49..e84acb7e8ca8 100644 > --- a/fs/xfs/xfs_aops.h > +++ b/fs/xfs/xfs_aops.h > @@ -12,4 +12,7 @@ extern const struct address_space_operations xfs_dax_aops; > int xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size); > void xfs_end_bio(struct bio *bio); > > +void xfs_init_ag_writeback(struct xfs_mount *mp); > +void xfs_destroy_ag_writeback(struct xfs_mount *mp); > + > #endif /* __XFS_AOPS_H__ */ > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c > index 0953f6ae94ab..26224503c4bf 100644 > --- a/fs/xfs/xfs_mount.c > +++ b/fs/xfs/xfs_mount.c > @@ -1323,6 +1323,8 @@ xfs_unmountfs( > > xfs_qm_unmount(mp); > > + xfs_destroy_ag_writeback(mp); > + > /* > * Unreserve any blocks we have so that when we unmount we don't account > * the reserved free space as used. This is really only necessary for > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h > index b871dfde372b..c44155de2883 100644 > --- a/fs/xfs/xfs_mount.h > +++ b/fs/xfs/xfs_mount.h > @@ -342,6 +342,16 @@ typedef struct xfs_mount { > > /* Hook to feed dirent updates to an active online repair. */ > struct xfs_hooks m_dir_update_hooks; > + > + > + /* global XFS AG writeback wq */ > + struct workqueue_struct *m_ag_wq; > + /* array of [sb_agcount] */ > + struct xfs_ag_wb *m_ag_wb; > + > + /* task cache and pool */ > + struct kmem_cache *m_ag_task_cachep; > + mempool_t *m_ag_task_pool; > } xfs_mount_t; > > #define M_IGEO(mp) (&(mp)->m_ino_geo) > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > index bc71aa9dcee8..73f8d2942df4 100644 > --- a/fs/xfs/xfs_super.c > +++ b/fs/xfs/xfs_super.c > @@ -1765,6 +1765,8 @@ xfs_fs_fill_super( > if (error) > goto out_free_sb; > > + xfs_init_ag_writeback(mp); > + > /* > * V4 support is undergoing deprecation. > * > -- > 2.25.1 > > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 5/6] xfs: add per-AG writeback workqueue infrastructure 2026-01-29 22:21 ` Darrick J. Wong @ 2026-02-03 7:35 ` Kundan Kumar 0 siblings, 0 replies; 48+ messages in thread From: Kundan Kumar @ 2026-02-03 7:35 UTC (permalink / raw) To: Darrick J. Wong Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On 1/30/2026 3:51 AM, Darrick J. Wong wrote: > On Fri, Jan 16, 2026 at 03:38:17PM +0530, Kundan Kumar wrote: >> Introduce per-AG writeback worker infrastructure at mount time. >> This patch adds initialization and teardown only, without changing >> writeback behavior. >> >> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> >> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> >> --- >> fs/xfs/xfs_aops.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++ >> fs/xfs/xfs_aops.h | 3 ++ >> fs/xfs/xfs_mount.c | 2 ++ >> fs/xfs/xfs_mount.h | 10 ++++++ >> fs/xfs/xfs_super.c | 2 ++ >> 5 files changed, 96 insertions(+) >> >> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c >> index a26f79815533..9d5b65922cd2 100644 >> --- a/fs/xfs/xfs_aops.c >> +++ b/fs/xfs/xfs_aops.c >> @@ -23,6 +23,23 @@ >> #include "xfs_zone_alloc.h" >> #include "xfs_rtgroup.h" >> >> +#define XFS_AG_TASK_POOL_MIN 1024 >> + >> +struct xfs_ag_wb_task { >> + struct list_head list; >> + struct xfs_inode *ip; >> + struct writeback_control wbc; >> + xfs_agnumber_t agno; >> +}; >> + >> +struct xfs_ag_wb { >> + struct delayed_work ag_work; >> + spinlock_t lock; >> + struct list_head task_list; >> + xfs_agnumber_t agno; >> + struct xfs_mount *mp; >> +}; > > Help me understand the data structures here ... for each AG there's an > xfs_ag_wb object which can be run as a delayed workqueue item. This > xfs_ag_wb is the head of a list of xfs_ag_wb_task items? > > In turn, each xfs_ag_wb_task list item points to an inode and > (redundantly?) the agnumber? So in effect each AG has a kworker that > can say "do all of this file's pagecache writeback for all dirty folios > tagged with the same agnumber"? > > <shrug> It's hard to tell with no comments about how these two pieces of > data relate to each other. > > --D > Yes, that’s the intent. struct xfs_ag_wb is per-AG state and owns the delayed_work that runs the AG worker on m_ag_wq. Each xfs_ag_wb maintains a queue of pending work items that request "process inode X for AG Y". struct xfs_ag_wb_task is the queued request, it carries the target inode plus a snapshot of wbc. When the AG worker runs, it drains its task list and for each inode does a one-pass scan of the mapping, filtering folios tagged for that AG and submitting IO for those folios only. >> struct xfs_writepage_ctx { >> struct iomap_writepage_ctx ctx; >> unsigned int data_seq; >> @@ -666,6 +683,68 @@ static const struct iomap_writeback_ops xfs_zoned_writeback_ops = { >> .writeback_submit = xfs_zoned_writeback_submit, >> }; >> >> +void >> +xfs_init_ag_writeback(struct xfs_mount *mp) >> +{ >> + xfs_agnumber_t agno; >> + >> + mp->m_ag_wq = alloc_workqueue("xfs-ag-wb", WQ_UNBOUND | WQ_MEM_RECLAIM, >> + 0); >> + if (!mp->m_ag_wq) >> + return; >> + >> + mp->m_ag_wb = kcalloc(mp->m_sb.sb_agcount, >> + sizeof(struct xfs_ag_wb), >> + GFP_KERNEL); >> + >> + if (!mp->m_ag_wb) { >> + destroy_workqueue(mp->m_ag_wq); >> + mp->m_ag_wq = NULL; >> + return; >> + } >> + >> + for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) { >> + struct xfs_ag_wb *awb = &mp->m_ag_wb[agno]; >> + >> + spin_lock_init(&awb->lock); >> + INIT_LIST_HEAD(&awb->task_list); >> + awb->agno = agno; >> + awb->mp = mp; > > Can't you stuff this information in struct xfs_perag instead of asking > for a potentially huge allocation? > I can switch the xfs_ag_wb fields (lock/list/work/agno/mp) into struct xfs_perag and initialize them when perags are set up. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 5/6] xfs: add per-AG writeback workqueue infrastructure 2026-01-16 10:08 ` [PATCH v3 5/6] xfs: add per-AG writeback workqueue infrastructure Kundan Kumar 2026-01-29 22:21 ` Darrick J. Wong @ 2026-02-06 6:46 ` Christoph Hellwig 2026-02-10 11:56 ` Nirjhar Roy (IBM) 2 siblings, 0 replies; 48+ messages in thread From: Christoph Hellwig @ 2026-02-06 6:46 UTC (permalink / raw) To: Kundan Kumar Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Fri, Jan 16, 2026 at 03:38:17PM +0530, Kundan Kumar wrote: > Introduce per-AG writeback worker infrastructure at mount time. > This patch adds initialization and teardown only, without changing > writeback behavior. We really should not be adding extra writeback threads in the file system. btrfs has done this and created a lot of madness. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 5/6] xfs: add per-AG writeback workqueue infrastructure 2026-01-16 10:08 ` [PATCH v3 5/6] xfs: add per-AG writeback workqueue infrastructure Kundan Kumar 2026-01-29 22:21 ` Darrick J. Wong 2026-02-06 6:46 ` Christoph Hellwig @ 2026-02-10 11:56 ` Nirjhar Roy (IBM) 2 siblings, 0 replies; 48+ messages in thread From: Nirjhar Roy (IBM) @ 2026-02-10 11:56 UTC (permalink / raw) To: Kundan Kumar, viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, cem, wangyufei Cc: linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Fri, 2026-01-16 at 15:38 +0530, Kundan Kumar wrote: > Introduce per-AG writeback worker infrastructure at mount time. > This patch adds initialization and teardown only, without changing > writeback behavior. > > Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> > --- > fs/xfs/xfs_aops.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++ > fs/xfs/xfs_aops.h | 3 ++ > fs/xfs/xfs_mount.c | 2 ++ > fs/xfs/xfs_mount.h | 10 ++++++ > fs/xfs/xfs_super.c | 2 ++ > 5 files changed, 96 insertions(+) > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > index a26f79815533..9d5b65922cd2 100644 > --- a/fs/xfs/xfs_aops.c > +++ b/fs/xfs/xfs_aops.c > @@ -23,6 +23,23 @@ > #include "xfs_zone_alloc.h" > #include "xfs_rtgroup.h" > > +#define XFS_AG_TASK_POOL_MIN 1024 > + > +struct xfs_ag_wb_task { > + struct list_head list; > + struct xfs_inode *ip; > + struct writeback_control wbc; > + xfs_agnumber_t agno; agno where the ip resides or the agno of any one the blocks which belongs to this ip? > +}; Nit: Coding style - tab between data type and the indentifier > + > +struct xfs_ag_wb { > + struct delayed_work ag_work; > + spinlock_t lock; > + struct list_head task_list; > + xfs_agnumber_t agno; > + struct xfs_mount *mp; > +}; > + > struct xfs_writepage_ctx { > struct iomap_writepage_ctx ctx; > unsigned int data_seq; > @@ -666,6 +683,68 @@ static const struct iomap_writeback_ops xfs_zoned_writeback_ops = { > .writeback_submit = xfs_zoned_writeback_submit, > }; > > +void > +xfs_init_ag_writeback(struct xfs_mount *mp) > +{ > + xfs_agnumber_t agno; > + > + mp->m_ag_wq = alloc_workqueue("xfs-ag-wb", WQ_UNBOUND | WQ_MEM_RECLAIM, > + 0); Nit: I think we follow 2 tabs indentation of the parameter list length exceeds per line limit count. Similar comments for such changes in the below function call sites. > + if (!mp->m_ag_wq) > + return; > + > + mp->m_ag_wb = kcalloc(mp->m_sb.sb_agcount, > + sizeof(struct xfs_ag_wb), > + GFP_KERNEL); > + > + if (!mp->m_ag_wb) { > + destroy_workqueue(mp->m_ag_wq); > + mp->m_ag_wq = NULL; > + return; > + } > + > + for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) { > + struct xfs_ag_wb *awb = &mp->m_ag_wb[agno]; > + > + spin_lock_init(&awb->lock); > + INIT_LIST_HEAD(&awb->task_list); > + awb->agno = agno; > + awb->mp = mp; > + } > + > + mp->m_ag_task_cachep = kmem_cache_create("xfs_ag_wb_task", > + sizeof(struct xfs_ag_wb_task), > + 0, > + SLAB_RECLAIM_ACCOUNT, > + NULL); > + > + mp->m_ag_task_pool = mempool_create_slab_pool(XFS_AG_TASK_POOL_MIN, > + mp->m_ag_task_cachep); Nit: 2 tabs indentation > + > + if (!mp->m_ag_task_pool) { > + kmem_cache_destroy(mp->m_ag_task_cachep); > + mp->m_ag_task_cachep = NULL; Shouldn't we be also freeing mp->m_ag_wq and the array mp->m_ag_wb[] array ? > + } > +} > + > +void > +xfs_destroy_ag_writeback(struct xfs_mount *mp) > +{ > + if (mp->m_ag_wq) { > + flush_workqueue(mp->m_ag_wq); > + destroy_workqueue(mp->m_ag_wq); > + mp->m_ag_wq = NULL; > + } > + kfree(mp->m_ag_wb); > + mp->m_ag_wb = NULL; > + > + mempool_destroy(mp->m_ag_task_pool); > + mp->m_ag_task_pool = NULL; > + > + kmem_cache_destroy(mp->m_ag_task_cachep); > + mp->m_ag_task_cachep = NULL; > +} > + > STATIC int > xfs_vm_writepages( > struct address_space *mapping, > diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h > index 5a7a0f1a0b49..e84acb7e8ca8 100644 > --- a/fs/xfs/xfs_aops.h > +++ b/fs/xfs/xfs_aops.h > @@ -12,4 +12,7 @@ extern const struct address_space_operations xfs_dax_aops; > int xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size); > void xfs_end_bio(struct bio *bio); > > +void xfs_init_ag_writeback(struct xfs_mount *mp); > +void xfs_destroy_ag_writeback(struct xfs_mount *mp); > + > #endif /* __XFS_AOPS_H__ */ > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c > index 0953f6ae94ab..26224503c4bf 100644 > --- a/fs/xfs/xfs_mount.c > +++ b/fs/xfs/xfs_mount.c > @@ -1323,6 +1323,8 @@ xfs_unmountfs( > > xfs_qm_unmount(mp); > > + xfs_destroy_ag_writeback(mp); > + > /* > * Unreserve any blocks we have so that when we unmount we don't account > * the reserved free space as used. This is really only necessary for > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h > index b871dfde372b..c44155de2883 100644 > --- a/fs/xfs/xfs_mount.h > +++ b/fs/xfs/xfs_mount.h > @@ -342,6 +342,16 @@ typedef struct xfs_mount { > > /* Hook to feed dirent updates to an active online repair. */ > struct xfs_hooks m_dir_update_hooks; > + > + > + /* global XFS AG writeback wq */ > + struct workqueue_struct *m_ag_wq; > + /* array of [sb_agcount] */ > + struct xfs_ag_wb *m_ag_wb; > + > + /* task cache and pool */ > + struct kmem_cache *m_ag_task_cachep; > + mempool_t *m_ag_task_pool; Again Nit: - Coding style - tab between data type and the identifier > } xfs_mount_t; > > #define M_IGEO(mp) (&(mp)->m_ino_geo) > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > index bc71aa9dcee8..73f8d2942df4 100644 > --- a/fs/xfs/xfs_super.c > +++ b/fs/xfs/xfs_super.c > @@ -1765,6 +1765,8 @@ xfs_fs_fill_super( > if (error) > goto out_free_sb; > > + xfs_init_ag_writeback(mp); So if we are not able to init the write back workqueue infra, we are still okay to mount the fs except that the AG aware write back facility won't be available i.e, we aren't treating it as a fatal error, correct? --NR > + > /* > * V4 support is undergoing deprecation. > * ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <CGME20260116101305epcas5p497cd6d9027301853669f1c1aaffbf128@epcas5p4.samsung.com>]
* [PATCH v3 6/6] xfs: offload writeback by AG using per-inode dirty bitmap and per-AG workers [not found] ` <CGME20260116101305epcas5p497cd6d9027301853669f1c1aaffbf128@epcas5p4.samsung.com> @ 2026-01-16 10:08 ` Kundan Kumar 2026-01-29 22:34 ` Darrick J. Wong 2026-02-11 9:39 ` Nirjhar Roy (IBM) 0 siblings, 2 replies; 48+ messages in thread From: Kundan Kumar @ 2026-01-16 10:08 UTC (permalink / raw) To: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, cem, wangyufei Cc: linux-fsdevel, linux-mm, linux-xfs, gost.dev, kundan.kumar, anuj20.g, vishak.g, joshi.k Offload XFS writeback to per-AG workers based on the inode dirty-AG bitmap. Each worker scans and submits writeback only for folios belonging to its AG. Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> --- fs/xfs/xfs_aops.c | 178 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 178 insertions(+) diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 9d5b65922cd2..55c3154fb2b5 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -678,6 +678,180 @@ xfs_zoned_writeback_submit( return 0; } +static bool xfs_agp_match(struct xfs_inode *ip, pgoff_t index, + xfs_agnumber_t agno) +{ + void *ent; + u32 v; + bool match = false; + + ent = xa_load(&ip->i_ag_pmap, index); + if (ent && xa_is_value(ent)) { + v = xa_to_value(ent); + if (xfs_agp_valid(v)) + match = (xfs_agp_agno(v) == (u32)agno); + } + + return match; +} + +static bool xfs_folio_matches_ag(struct folio *folio, xfs_agnumber_t agno) +{ + struct xfs_inode *ip = XFS_I(folio_mapping(folio)->host); + + return xfs_agp_match(ip, folio->index, agno); +} + +static int xfs_writepages_ag(struct xfs_inode *ip, + struct writeback_control *wbc, + xfs_agnumber_t agno) +{ + struct inode *inode = VFS_I(ip); + struct address_space *mapping = inode->i_mapping; + struct folio_batch *fbatch = &wbc->fbatch; + int ret = 0; + pgoff_t index, end; + + wbc->range_cyclic = 0; + + folio_batch_init(fbatch); + index = wbc->range_start >> PAGE_SHIFT; + end = wbc->range_end >> PAGE_SHIFT; + + struct xfs_writepage_ctx wpc = { + .ctx = { + .inode = inode, + .wbc = wbc, + .ops = &xfs_writeback_ops, + }, + }; + + while (index <= end) { + int i, nr; + + /* get a batch of DIRTY folios starting at index */ + nr = filemap_get_folios_tag(mapping, &index, end, + PAGECACHE_TAG_DIRTY, fbatch); + if (!nr) + break; + + for (i = 0; i < nr; i++) { + struct folio *folio = fbatch->folios[i]; + + /* Filter BEFORE locking */ + if (!xfs_folio_matches_ag(folio, agno)) + continue; + + folio_lock(folio); + + /* + * Now it's ours: clear dirty and submit. + * This prevents *this AG worker* from seeing it again + * next time. + */ + if (!folio_clear_dirty_for_io(folio)) { + folio_unlock(folio); + continue; + } + xa_erase(&ip->i_ag_pmap, folio->index); + + ret = iomap_writeback_folio(&wpc.ctx, folio); + folio_unlock(folio); + + if (ret) { + folio_batch_release(fbatch); + goto out; + } + } + + folio_batch_release(fbatch); + cond_resched(); + } + +out: + if (wpc.ctx.wb_ctx && wpc.ctx.ops && wpc.ctx.ops->writeback_submit) + wpc.ctx.ops->writeback_submit(&wpc.ctx, ret); + + return ret; +} + +static void xfs_ag_writeback_work(struct work_struct *work) +{ + struct xfs_ag_wb *awb = container_of(to_delayed_work(work), + struct xfs_ag_wb, ag_work); + struct xfs_ag_wb_task *task; + struct xfs_mount *mp; + struct inode *inode; + struct xfs_inode *ip; + int ret; + + for (;;) { + spin_lock(&awb->lock); + task = list_first_entry_or_null(&awb->task_list, + struct xfs_ag_wb_task, list); + if (task) + list_del_init(&task->list); + spin_unlock(&awb->lock); + + if (!task) + break; + + ip = task->ip; + mp = ip->i_mount; + inode = VFS_I(ip); + + ret = xfs_writepages_ag(ip, &task->wbc, task->agno); + + /* If didn't submit everything for this AG, set its bit */ + if (ret) + set_bit(task->agno, ip->i_ag_dirty_bitmap); + + iput(inode); /* drop igrab */ + mempool_free(task, mp->m_ag_task_pool); + } +} + +static int xfs_vm_writepages_offload(struct address_space *mapping, + struct writeback_control *wbc) +{ + struct inode *inode = mapping->host; + struct xfs_inode *ip = XFS_I(inode); + struct xfs_mount *mp = ip->i_mount; + struct xfs_ag_wb *awb; + struct xfs_ag_wb_task *task; + xfs_agnumber_t agno; + + if (!ip->i_ag_dirty_bits) + return 0; + + for_each_set_bit(agno, ip->i_ag_dirty_bitmap, ip->i_ag_dirty_bits) { + if (!test_and_clear_bit(agno, ip->i_ag_dirty_bitmap)) + continue; + + task = mempool_alloc(mp->m_ag_task_pool, GFP_NOFS); + if (!task) { + set_bit(agno, ip->i_ag_dirty_bitmap); + continue; + } + + INIT_LIST_HEAD(&task->list); + task->ip = ip; + task->agno = agno; + task->wbc = *wbc; + igrab(inode); /* worker owns inode ref */ + + awb = &mp->m_ag_wb[agno]; + + spin_lock(&awb->lock); + list_add_tail(&task->list, &awb->task_list); + spin_unlock(&awb->lock); + + mod_delayed_work(mp->m_ag_wq, &awb->ag_work, 0); + } + + return 0; +} + static const struct iomap_writeback_ops xfs_zoned_writeback_ops = { .writeback_range = xfs_zoned_writeback_range, .writeback_submit = xfs_zoned_writeback_submit, @@ -706,6 +880,7 @@ xfs_init_ag_writeback(struct xfs_mount *mp) for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) { struct xfs_ag_wb *awb = &mp->m_ag_wb[agno]; + INIT_DELAYED_WORK(&awb->ag_work, xfs_ag_writeback_work); spin_lock_init(&awb->lock); INIT_LIST_HEAD(&awb->task_list); awb->agno = agno; @@ -769,6 +944,9 @@ xfs_vm_writepages( xfs_open_zone_put(xc.open_zone); return error; } else { + if (wbc->sync_mode != WB_SYNC_ALL) + return xfs_vm_writepages_offload(mapping, wbc); + struct xfs_writepage_ctx wpc = { .ctx = { .inode = mapping->host, -- 2.25.1 ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 6/6] xfs: offload writeback by AG using per-inode dirty bitmap and per-AG workers 2026-01-16 10:08 ` [PATCH v3 6/6] xfs: offload writeback by AG using per-inode dirty bitmap and per-AG workers Kundan Kumar @ 2026-01-29 22:34 ` Darrick J. Wong 2026-02-03 7:40 ` Kundan Kumar 2026-02-11 9:39 ` Nirjhar Roy (IBM) 1 sibling, 1 reply; 48+ messages in thread From: Darrick J. Wong @ 2026-01-29 22:34 UTC (permalink / raw) To: Kundan Kumar Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Fri, Jan 16, 2026 at 03:38:18PM +0530, Kundan Kumar wrote: > Offload XFS writeback to per-AG workers based on the inode dirty-AG > bitmap. Each worker scans and submits writeback only for folios > belonging to its AG. > > Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> > --- > fs/xfs/xfs_aops.c | 178 ++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 178 insertions(+) > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > index 9d5b65922cd2..55c3154fb2b5 100644 > --- a/fs/xfs/xfs_aops.c > +++ b/fs/xfs/xfs_aops.c > @@ -678,6 +678,180 @@ xfs_zoned_writeback_submit( > return 0; > } > > +static bool xfs_agp_match(struct xfs_inode *ip, pgoff_t index, > + xfs_agnumber_t agno) > +{ > + void *ent; > + u32 v; > + bool match = false; > + > + ent = xa_load(&ip->i_ag_pmap, index); > + if (ent && xa_is_value(ent)) { > + v = xa_to_value(ent); > + if (xfs_agp_valid(v)) > + match = (xfs_agp_agno(v) == (u32)agno); > + } > + > + return match; > +} > + > +static bool xfs_folio_matches_ag(struct folio *folio, xfs_agnumber_t agno) > +{ > + struct xfs_inode *ip = XFS_I(folio_mapping(folio)->host); > + > + return xfs_agp_match(ip, folio->index, agno); > +} > + > +static int xfs_writepages_ag(struct xfs_inode *ip, > + struct writeback_control *wbc, > + xfs_agnumber_t agno) > +{ > + struct inode *inode = VFS_I(ip); > + struct address_space *mapping = inode->i_mapping; > + struct folio_batch *fbatch = &wbc->fbatch; > + int ret = 0; > + pgoff_t index, end; > + > + wbc->range_cyclic = 0; > + > + folio_batch_init(fbatch); > + index = wbc->range_start >> PAGE_SHIFT; > + end = wbc->range_end >> PAGE_SHIFT; > + > + struct xfs_writepage_ctx wpc = { > + .ctx = { > + .inode = inode, > + .wbc = wbc, > + .ops = &xfs_writeback_ops, > + }, > + }; > + > + while (index <= end) { > + int i, nr; > + > + /* get a batch of DIRTY folios starting at index */ > + nr = filemap_get_folios_tag(mapping, &index, end, > + PAGECACHE_TAG_DIRTY, fbatch); > + if (!nr) > + break; > + > + for (i = 0; i < nr; i++) { > + struct folio *folio = fbatch->folios[i]; > + > + /* Filter BEFORE locking */ > + if (!xfs_folio_matches_ag(folio, agno)) So we grab a batch of dirty folios, and only /then/ check to see if they've been tagged with the target agno? That doesn't seem very efficient if there are a lot of dirty folio and they're not evenly distributed among AGs. > + continue; > + > + folio_lock(folio); > + > + /* > + * Now it's ours: clear dirty and submit. > + * This prevents *this AG worker* from seeing it again > + * next time. > + */ > + if (!folio_clear_dirty_for_io(folio)) { > + folio_unlock(folio); > + continue; > + } > + xa_erase(&ip->i_ag_pmap, folio->index); Why erase the association? Is this because once we've written the folio back to storage, we want a subsequent write to a fsblock within that folio to tag the folio with the agnumber of that fsblock? Hrm, maybe that's how you deal with multi-fsblock folios; the folio tag reflects the first block to be dirtied within the folio? > + > + ret = iomap_writeback_folio(&wpc.ctx, folio); > + folio_unlock(folio); > + > + if (ret) { > + folio_batch_release(fbatch); > + goto out; > + } > + } > + > + folio_batch_release(fbatch); > + cond_resched(); > + } > + > +out: > + if (wpc.ctx.wb_ctx && wpc.ctx.ops && wpc.ctx.ops->writeback_submit) > + wpc.ctx.ops->writeback_submit(&wpc.ctx, ret); > + > + return ret; > +} > + > +static void xfs_ag_writeback_work(struct work_struct *work) > +{ > + struct xfs_ag_wb *awb = container_of(to_delayed_work(work), > + struct xfs_ag_wb, ag_work); > + struct xfs_ag_wb_task *task; > + struct xfs_mount *mp; > + struct inode *inode; > + struct xfs_inode *ip; > + int ret; > + > + for (;;) { > + spin_lock(&awb->lock); > + task = list_first_entry_or_null(&awb->task_list, > + struct xfs_ag_wb_task, list); > + if (task) > + list_del_init(&task->list); > + spin_unlock(&awb->lock); > + > + if (!task) > + break; > + > + ip = task->ip; > + mp = ip->i_mount; > + inode = VFS_I(ip); > + > + ret = xfs_writepages_ag(ip, &task->wbc, task->agno); > + > + /* If didn't submit everything for this AG, set its bit */ > + if (ret) > + set_bit(task->agno, ip->i_ag_dirty_bitmap); > + > + iput(inode); /* drop igrab */ > + mempool_free(task, mp->m_ag_task_pool); > + } > +} > + > +static int xfs_vm_writepages_offload(struct address_space *mapping, > + struct writeback_control *wbc) > +{ > + struct inode *inode = mapping->host; > + struct xfs_inode *ip = XFS_I(inode); > + struct xfs_mount *mp = ip->i_mount; > + struct xfs_ag_wb *awb; > + struct xfs_ag_wb_task *task; > + xfs_agnumber_t agno; > + > + if (!ip->i_ag_dirty_bits) > + return 0; > + > + for_each_set_bit(agno, ip->i_ag_dirty_bitmap, ip->i_ag_dirty_bits) { > + if (!test_and_clear_bit(agno, ip->i_ag_dirty_bitmap)) > + continue; > + > + task = mempool_alloc(mp->m_ag_task_pool, GFP_NOFS); Allocating memory (even from a mempool) during writeback makes me nervous... > + if (!task) { > + set_bit(agno, ip->i_ag_dirty_bitmap); > + continue; > + } ...because apparently the allocation can fail. If so, then why don't we just fall back to serial writeback instead of ... moving on to the next AG and seeing if there's more memory? > + > + INIT_LIST_HEAD(&task->list); > + task->ip = ip; > + task->agno = agno; > + task->wbc = *wbc; > + igrab(inode); /* worker owns inode ref */ Shouldn't we check for a null return value here? That should never happen, but we /do/ have the option of falling back to standard iomap_writepages. > + > + awb = &mp->m_ag_wb[agno]; > + > + spin_lock(&awb->lock); > + list_add_tail(&task->list, &awb->task_list); > + spin_unlock(&awb->lock); > + > + mod_delayed_work(mp->m_ag_wq, &awb->ag_work, 0); > + } > + > + return 0; > +} > + > static const struct iomap_writeback_ops xfs_zoned_writeback_ops = { > .writeback_range = xfs_zoned_writeback_range, > .writeback_submit = xfs_zoned_writeback_submit, > @@ -706,6 +880,7 @@ xfs_init_ag_writeback(struct xfs_mount *mp) > for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) { > struct xfs_ag_wb *awb = &mp->m_ag_wb[agno]; > > + INIT_DELAYED_WORK(&awb->ag_work, xfs_ag_writeback_work); > spin_lock_init(&awb->lock); > INIT_LIST_HEAD(&awb->task_list); > awb->agno = agno; > @@ -769,6 +944,9 @@ xfs_vm_writepages( > xfs_open_zone_put(xc.open_zone); > return error; > } else { > + if (wbc->sync_mode != WB_SYNC_ALL) > + return xfs_vm_writepages_offload(mapping, wbc); > + > struct xfs_writepage_ctx wpc = { > .ctx = { > .inode = mapping->host, > -- > 2.25.1 > > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 6/6] xfs: offload writeback by AG using per-inode dirty bitmap and per-AG workers 2026-01-29 22:34 ` Darrick J. Wong @ 2026-02-03 7:40 ` Kundan Kumar 0 siblings, 0 replies; 48+ messages in thread From: Kundan Kumar @ 2026-02-03 7:40 UTC (permalink / raw) To: Darrick J. Wong Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On 1/30/2026 4:04 AM, Darrick J. Wong wrote: > On Fri, Jan 16, 2026 at 03:38:18PM +0530, Kundan Kumar wrote: >> Offload XFS writeback to per-AG workers based on the inode dirty-AG >> bitmap. Each worker scans and submits writeback only for folios >> belonging to its AG. >> >> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> >> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> >> --- >> fs/xfs/xfs_aops.c | 178 ++++++++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 178 insertions(+) >> >> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c >> index 9d5b65922cd2..55c3154fb2b5 100644 >> --- a/fs/xfs/xfs_aops.c >> +++ b/fs/xfs/xfs_aops.c >> @@ -678,6 +678,180 @@ xfs_zoned_writeback_submit( >> return 0; >> } >> >> +static bool xfs_agp_match(struct xfs_inode *ip, pgoff_t index, >> + xfs_agnumber_t agno) >> +{ >> + void *ent; >> + u32 v; >> + bool match = false; >> + >> + ent = xa_load(&ip->i_ag_pmap, index); >> + if (ent && xa_is_value(ent)) { >> + v = xa_to_value(ent); >> + if (xfs_agp_valid(v)) >> + match = (xfs_agp_agno(v) == (u32)agno); >> + } >> + >> + return match; >> +} >> + >> +static bool xfs_folio_matches_ag(struct folio *folio, xfs_agnumber_t agno) >> +{ >> + struct xfs_inode *ip = XFS_I(folio_mapping(folio)->host); >> + >> + return xfs_agp_match(ip, folio->index, agno); >> +} >> + >> +static int xfs_writepages_ag(struct xfs_inode *ip, >> + struct writeback_control *wbc, >> + xfs_agnumber_t agno) >> +{ >> + struct inode *inode = VFS_I(ip); >> + struct address_space *mapping = inode->i_mapping; >> + struct folio_batch *fbatch = &wbc->fbatch; >> + int ret = 0; >> + pgoff_t index, end; >> + >> + wbc->range_cyclic = 0; >> + >> + folio_batch_init(fbatch); >> + index = wbc->range_start >> PAGE_SHIFT; >> + end = wbc->range_end >> PAGE_SHIFT; >> + >> + struct xfs_writepage_ctx wpc = { >> + .ctx = { >> + .inode = inode, >> + .wbc = wbc, >> + .ops = &xfs_writeback_ops, >> + }, >> + }; >> + >> + while (index <= end) { >> + int i, nr; >> + >> + /* get a batch of DIRTY folios starting at index */ >> + nr = filemap_get_folios_tag(mapping, &index, end, >> + PAGECACHE_TAG_DIRTY, fbatch); >> + if (!nr) >> + break; >> + >> + for (i = 0; i < nr; i++) { >> + struct folio *folio = fbatch->folios[i]; >> + >> + /* Filter BEFORE locking */ >> + if (!xfs_folio_matches_ag(folio, agno)) > > So we grab a batch of dirty folios, and only /then/ check to see if > they've been tagged with the target agno? That doesn't seem very > efficient if there are a lot of dirty folio and they're not evenly > distributed among AGs. > i_ag_dirty_bitmap is a hint populated at write time to indicate that an inode likely has dirty folios tagged for AG X, so we only kick workers for AGs that should have work. That said, folios may not be evenly distributed across AGs, so a worker can still end up scanning and finding few or no matches in some cases. >> + continue; >> + >> + folio_lock(folio); >> + >> + /* >> + * Now it's ours: clear dirty and submit. >> + * This prevents *this AG worker* from seeing it again >> + * next time. >> + */ >> + if (!folio_clear_dirty_for_io(folio)) { >> + folio_unlock(folio); >> + continue; >> + } >> + xa_erase(&ip->i_ag_pmap, folio->index); > > Why erase the association? Is this because once we've written the folio > back to storage, we want a subsequent write to a fsblock within that > folio to tag the folio with the agnumber of that fsblock? Hrm, maybe > that's how you deal with multi-fsblock folios; the folio tag reflects > the first block to be dirtied within the folio? > The primary reason for erasing the entry is to keep the metadata bounded, once the folio is written back and clean, there is no need to retain any AG association, so we drop it to free the per-folio state. On the next write, the folio is re-tagged based on the iomap for that write. This also helps with multi-fsblock folios: by re-tagging on each clean->dirty transition. >> + >> + ret = iomap_writeback_folio(&wpc.ctx, folio); >> + folio_unlock(folio); >> + >> + if (ret) { >> + folio_batch_release(fbatch); >> + goto out; >> + } >> + } >> + >> + folio_batch_release(fbatch); >> + cond_resched(); >> + } >> + >> +out: >> + if (wpc.ctx.wb_ctx && wpc.ctx.ops && wpc.ctx.ops->writeback_submit) >> + wpc.ctx.ops->writeback_submit(&wpc.ctx, ret); >> + >> + return ret; >> +} >> + >> +static void xfs_ag_writeback_work(struct work_struct *work) >> +{ >> + struct xfs_ag_wb *awb = container_of(to_delayed_work(work), >> + struct xfs_ag_wb, ag_work); >> + struct xfs_ag_wb_task *task; >> + struct xfs_mount *mp; >> + struct inode *inode; >> + struct xfs_inode *ip; >> + int ret; >> + >> + for (;;) { >> + spin_lock(&awb->lock); >> + task = list_first_entry_or_null(&awb->task_list, >> + struct xfs_ag_wb_task, list); >> + if (task) >> + list_del_init(&task->list); >> + spin_unlock(&awb->lock); >> + >> + if (!task) >> + break; >> + >> + ip = task->ip; >> + mp = ip->i_mount; >> + inode = VFS_I(ip); >> + >> + ret = xfs_writepages_ag(ip, &task->wbc, task->agno); >> + >> + /* If didn't submit everything for this AG, set its bit */ >> + if (ret) >> + set_bit(task->agno, ip->i_ag_dirty_bitmap); >> + >> + iput(inode); /* drop igrab */ >> + mempool_free(task, mp->m_ag_task_pool); >> + } >> +} >> + >> +static int xfs_vm_writepages_offload(struct address_space *mapping, >> + struct writeback_control *wbc) >> +{ >> + struct inode *inode = mapping->host; >> + struct xfs_inode *ip = XFS_I(inode); >> + struct xfs_mount *mp = ip->i_mount; >> + struct xfs_ag_wb *awb; >> + struct xfs_ag_wb_task *task; >> + xfs_agnumber_t agno; >> + >> + if (!ip->i_ag_dirty_bits) >> + return 0; >> + >> + for_each_set_bit(agno, ip->i_ag_dirty_bitmap, ip->i_ag_dirty_bits) { >> + if (!test_and_clear_bit(agno, ip->i_ag_dirty_bitmap)) >> + continue; >> + >> + task = mempool_alloc(mp->m_ag_task_pool, GFP_NOFS); > > Allocating memory (even from a mempool) during writeback makes me > nervous... > Agreed. We'll remove this allocation entirely. Instead of allocating an xfs_ag_wb_task in ->writepages, we'll maintain a mount-wide list of "active" inodes (those with a non-empty dirty-AG hint). The offload path will only enqueue the inode on that list and kick the relevant AG workers. Each AG worker will then walk the active inode list and scan the pagecache, filtering folios tagged for its AG and submitting IO. This keeps thewriteback path allocation-free and avoids mempool use under reclaim. >> + if (!task) { >> + set_bit(agno, ip->i_ag_dirty_bitmap); >> + continue; >> + } > > ...because apparently the allocation can fail. If so, then why don't we > just fall back to serial writeback instead of ... moving on to the next > AG and seeing if there's more memory? > >> + >> + INIT_LIST_HEAD(&task->list); >> + task->ip = ip; >> + task->agno = agno; >> + task->wbc = *wbc; >> + igrab(inode); /* worker owns inode ref */ > > Shouldn't we check for a null return value here? That should never > happen, but we /do/ have the option of falling back to standard > iomap_writepages. > Yes I will fix this in next version. >> + >> + awb = &mp->m_ag_wb[agno]; >> + >> + spin_lock(&awb->lock); >> + list_add_tail(&task->list, &awb->task_list); >> + spin_unlock(&awb->lock); >> + >> + mod_delayed_work(mp->m_ag_wq, &awb->ag_work, 0); >> + } >> + >> + return 0; >> +} >> + >> static const struct iomap_writeback_ops xfs_zoned_writeback_ops = { >> .writeback_range = xfs_zoned_writeback_range, >> .writeback_submit = xfs_zoned_writeback_submit, >> @@ -706,6 +880,7 @@ xfs_init_ag_writeback(struct xfs_mount *mp) >> for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) { >> struct xfs_ag_wb *awb = &mp->m_ag_wb[agno]; >> >> + INIT_DELAYED_WORK(&awb->ag_work, xfs_ag_writeback_work); >> spin_lock_init(&awb->lock); >> INIT_LIST_HEAD(&awb->task_list); >> awb->agno = agno; >> @@ -769,6 +944,9 @@ xfs_vm_writepages( >> xfs_open_zone_put(xc.open_zone); >> return error; >> } else { >> + if (wbc->sync_mode != WB_SYNC_ALL) >> + return xfs_vm_writepages_offload(mapping, wbc); >> + >> struct xfs_writepage_ctx wpc = { >> .ctx = { >> .inode = mapping->host, >> -- >> 2.25.1 >> >> > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 6/6] xfs: offload writeback by AG using per-inode dirty bitmap and per-AG workers 2026-01-16 10:08 ` [PATCH v3 6/6] xfs: offload writeback by AG using per-inode dirty bitmap and per-AG workers Kundan Kumar 2026-01-29 22:34 ` Darrick J. Wong @ 2026-02-11 9:39 ` Nirjhar Roy (IBM) 1 sibling, 0 replies; 48+ messages in thread From: Nirjhar Roy (IBM) @ 2026-02-11 9:39 UTC (permalink / raw) To: Kundan Kumar, viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, cem, wangyufei Cc: linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Fri, 2026-01-16 at 15:38 +0530, Kundan Kumar wrote: > Offload XFS writeback to per-AG workers based on the inode dirty-AG > bitmap. Each worker scans and submits writeback only for folios > belonging to its AG. > > Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> > --- > fs/xfs/xfs_aops.c | 178 ++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 178 insertions(+) > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > index 9d5b65922cd2..55c3154fb2b5 100644 > --- a/fs/xfs/xfs_aops.c > +++ b/fs/xfs/xfs_aops.c > @@ -678,6 +678,180 @@ xfs_zoned_writeback_submit( > return 0; > } > > +static bool xfs_agp_match(struct xfs_inode *ip, pgoff_t index, > + xfs_agnumber_t agno) > +{ > + void *ent; > + u32 v; > + bool match = false; Similar coding style comments as mentioned in the previous patches in this series - variable declaration and function prototype style > + > + ent = xa_load(&ip->i_ag_pmap, index); > + if (ent && xa_is_value(ent)) { > + v = xa_to_value(ent); > + if (xfs_agp_valid(v)) > + match = (xfs_agp_agno(v) == (u32)agno); > + } > + > + return match; > +} > + > +static bool xfs_folio_matches_ag(struct folio *folio, xfs_agnumber_t agno) > +{ > + struct xfs_inode *ip = XFS_I(folio_mapping(folio)->host); > + > + return xfs_agp_match(ip, folio->index, agno); > +} > + > +static int xfs_writepages_ag(struct xfs_inode *ip, > + struct writeback_control *wbc, > + xfs_agnumber_t agno) > +{ > + struct inode *inode = VFS_I(ip); > + struct address_space *mapping = inode->i_mapping; > + struct folio_batch *fbatch = &wbc->fbatch; > + int ret = 0; > + pgoff_t index, end; > + Coding style ^^ > + wbc->range_cyclic = 0; > + > + folio_batch_init(fbatch); > + index = wbc->range_start >> PAGE_SHIFT; > + end = wbc->range_end >> PAGE_SHIFT; > + > + struct xfs_writepage_ctx wpc = { > + .ctx = { > + .inode = inode, > + .wbc = wbc, > + .ops = &xfs_writeback_ops, > + }, > + }; Nit: Maybe group all the declarations together? I did get a comment like this in one of my patches. > + > + while (index <= end) { > + int i, nr; Coding style > + > + /* get a batch of DIRTY folios starting at index */ > + nr = filemap_get_folios_tag(mapping, &index, end, > + PAGECACHE_TAG_DIRTY, fbatch); 2 tabs indentation > + if (!nr) > + break; > + > + for (i = 0; i < nr; i++) { > + struct folio *folio = fbatch->folios[i]; > + > + /* Filter BEFORE locking */ > + if (!xfs_folio_matches_ag(folio, agno)) > + continue; > + > + folio_lock(folio); > + > + /* > + * Now it's ours: clear dirty and submit. > + * This prevents *this AG worker* from seeing it again > + * next time. > + */ > + if (!folio_clear_dirty_for_io(folio)) { > + folio_unlock(folio); > + continue; > + } > + xa_erase(&ip->i_ag_pmap, folio->index); > + > + ret = iomap_writeback_folio(&wpc.ctx, folio); > + folio_unlock(folio); > + > + if (ret) { > + folio_batch_release(fbatch); > + goto out; > + } > + } > + > + folio_batch_release(fbatch); > + cond_resched(); > + } > + > +out: > + if (wpc.ctx.wb_ctx && wpc.ctx.ops && wpc.ctx.ops->writeback_submit) > + wpc.ctx.ops->writeback_submit(&wpc.ctx, ret); > + > + return ret; > +} > + > +static void xfs_ag_writeback_work(struct work_struct *work) > +{ > + struct xfs_ag_wb *awb = container_of(to_delayed_work(work), > + struct xfs_ag_wb, ag_work); > + struct xfs_ag_wb_task *task; > + struct xfs_mount *mp; > + struct inode *inode; > + struct xfs_inode *ip; > + int ret; Coding style issues for variable declarations > + > + for (;;) { Nit: I am not sure if this is a common practice in XFS to use such infinite for loops. Maybe have some conditional flag variable or something? Darrick, thoughts? > + spin_lock(&awb->lock); > + task = list_first_entry_or_null(&awb->task_list, > + struct xfs_ag_wb_task, list); 2 tabs indentation ^^ > + if (task) > + list_del_init(&task->list); > + spin_unlock(&awb->lock); > + > + if (!task) > + break; > + > + ip = task->ip; > + mp = ip->i_mount; > + inode = VFS_I(ip); > + > + ret = xfs_writepages_ag(ip, &task->wbc, task->agno); > + > + /* If didn't submit everything for this AG, set its bit */ > + if (ret) > + set_bit(task->agno, ip->i_ag_dirty_bitmap); > + > + iput(inode); /* drop igrab */ > + mempool_free(task, mp->m_ag_task_pool); > + } > +} > + > +static int xfs_vm_writepages_offload(struct address_space *mapping, > + struct writeback_control *wbc) 2 tabs indentation ^^ > +{ > + struct inode *inode = mapping->host; > + struct xfs_inode *ip = XFS_I(inode); > + struct xfs_mount *mp = ip->i_mount; > + struct xfs_ag_wb *awb; > + struct xfs_ag_wb_task *task; > + xfs_agnumber_t agno; > + Coding style for variable declaration ^^ > + if (!ip->i_ag_dirty_bits) > + return 0; > + > + for_each_set_bit(agno, ip->i_ag_dirty_bitmap, ip->i_ag_dirty_bits) { > + if (!test_and_clear_bit(agno, ip->i_ag_dirty_bitmap)) > + continue; > + > + task = mempool_alloc(mp->m_ag_task_pool, GFP_NOFS); > + if (!task) { > + set_bit(agno, ip->i_ag_dirty_bitmap); > + continue; > + } > + > + INIT_LIST_HEAD(&task->list); > + task->ip = ip; > + task->agno = agno; > + task->wbc = *wbc; > + igrab(inode); /* worker owns inode ref */ > + > + awb = &mp->m_ag_wb[agno]; > + > + spin_lock(&awb->lock); > + list_add_tail(&task->list, &awb->task_list); > + spin_unlock(&awb->lock); > + > + mod_delayed_work(mp->m_ag_wq, &awb->ag_work, 0); > + } > + > + return 0; It seems that this function always returns zero - maybe in that case use void? > +} > + > static const struct iomap_writeback_ops xfs_zoned_writeback_ops = { > .writeback_range = xfs_zoned_writeback_range, > .writeback_submit = xfs_zoned_writeback_submit, > @@ -706,6 +880,7 @@ xfs_init_ag_writeback(struct xfs_mount *mp) > for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) { > struct xfs_ag_wb *awb = &mp->m_ag_wb[agno]; > > + INIT_DELAYED_WORK(&awb->ag_work, xfs_ag_writeback_work); > spin_lock_init(&awb->lock); > INIT_LIST_HEAD(&awb->task_list); > awb->agno = agno; > @@ -769,6 +944,9 @@ xfs_vm_writepages( > xfs_open_zone_put(xc.open_zone); > return error; > } else { > + if (wbc->sync_mode != WB_SYNC_ALL) > + return xfs_vm_writepages_offload(mapping, wbc); > + Since this patch (and the overall patch series) introduces a couple of new data structures and a lot of new functions, can we please have some concise comments at the beginning of the functions or at least the functions introduced in this patch - just for a quicker understanding? --NR > struct xfs_writepage_ctx wpc = { > .ctx = { > .inode = mapping->host, ^ permalink raw reply [flat|nested] 48+ messages in thread
* [syzbot ci] Re: AG aware parallel writeback for XFS 2026-01-16 10:08 ` [PATCH v3 0/6] AG aware parallel writeback for XFS Kundan Kumar ` (5 preceding siblings ...) [not found] ` <CGME20260116101305epcas5p497cd6d9027301853669f1c1aaffbf128@epcas5p4.samsung.com> @ 2026-01-16 16:13 ` syzbot ci 2026-01-21 19:54 ` [PATCH v3 0/6] " Brian Foster 7 siblings, 0 replies; 48+ messages in thread From: syzbot ci @ 2026-01-16 16:13 UTC (permalink / raw) To: amir73il, anuj20.g, axboe, brauner, cem, clm, dave, david, djwong, gost.dev, hch, jack, joshi.k, kundan.kumar, linux-fsdevel, linux-mm, linux-xfs, mcgrof, ritesh.list, viro, vishak.g, wangyufei, willy Cc: syzbot, syzkaller-bugs syzbot ci has tested the following series [v3] AG aware parallel writeback for XFS https://lore.kernel.org/all/20260116100818.7576-1-kundan.kumar@samsung.com * [PATCH v3 1/6] iomap: add write ops hook to attach metadata to folios * [PATCH v3 2/6] xfs: add helpers to pack AG prediction info for per-folio tracking * [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap * [PATCH v3 4/6] xfs: tag folios with AG number during buffered write via iomap attach hook * [PATCH v3 5/6] xfs: add per-AG writeback workqueue infrastructure * [PATCH v3 6/6] xfs: offload writeback by AG using per-inode dirty bitmap and per-AG workers and found the following issue: WARNING in xfs_init_ag_writeback Full report is available here: https://ci.syzbot.org/series/0e34236b-3594-400f-b9cd-6f59b196014f *** WARNING in xfs_init_ag_writeback tree: mm-new URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git base: eeb33083cc4749bdb61582eaeb5c200702607703 arch: amd64 compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8 config: https://ci.syzbot.org/builds/f21aae46-4d21-4410-9132-190ad8ae3994/config C repro: https://ci.syzbot.org/findings/24ec6312-5e15-4750-aa2a-08d381daca26/c_repro syz repro: https://ci.syzbot.org/findings/24ec6312-5e15-4750-aa2a-08d381daca26/syz_repro loop2: detected capacity change from 0 to 32768 ------------[ cut here ]------------ kmem_cache of name 'xfs_ag_wb_task' already exists WARNING: mm/slab_common.c:110 at kmem_cache_sanity_check mm/slab_common.c:109 [inline], CPU#1: syz.2.19/6098 WARNING: mm/slab_common.c:110 at __kmem_cache_create_args+0x99/0x310 mm/slab_common.c:310, CPU#1: syz.2.19/6098 Modules linked in: CPU: 1 UID: 0 PID: 6098 Comm: syz.2.19 Not tainted syzkaller #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 RIP: 0010:kmem_cache_sanity_check mm/slab_common.c:109 [inline] RIP: 0010:__kmem_cache_create_args+0x9c/0x310 mm/slab_common.c:310 Code: 43 8e 4d 8b 24 24 49 81 fc b8 78 43 8e 74 20 49 8b 7c 24 f8 48 89 de e8 f2 c0 67 09 85 c0 75 e2 48 8d 3d f7 ca bc 0d 48 89 de <67> 48 0f b9 3a 48 89 df be 20 00 00 00 e8 92 c2 67 09 48 85 c0 0f RSP: 0018:ffffc90003817a08 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffffff8bc734a0 RCX: 0000000000000007 RDX: 000000008bc73406 RSI: ffffffff8bc734a0 RDI: ffffffff8fc6ee40 RBP: 0000000000080000 R08: ffffffff8fc26e77 R09: 1ffffffff1f84dce R10: dffffc0000000000 R11: fffffbfff1f84dcf R12: ffff88810b7a7300 R13: ffff888115440000 R14: ffffc90003817aa0 R15: 0000000000000190 FS: 0000555567cf7500(0000) GS:ffff8882a9a05000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f049ffff000 CR3: 00000001b9726000 CR4: 00000000000006f0 Call Trace: <TASK> __kmem_cache_create include/linux/slab.h:384 [inline] xfs_init_ag_writeback+0x41b/0x570 fs/xfs/xfs_aops.c:890 xfs_fs_fill_super+0x7e3/0x1640 fs/xfs/xfs_super.c:1759 get_tree_bdev_flags+0x431/0x4f0 fs/super.c:1691 vfs_get_tree+0x92/0x2a0 fs/super.c:1751 fc_mount fs/namespace.c:1199 [inline] do_new_mount_fc fs/namespace.c:3636 [inline] do_new_mount+0x329/0xa50 fs/namespace.c:3712 do_mount fs/namespace.c:4035 [inline] __do_sys_mount fs/namespace.c:4224 [inline] __se_sys_mount+0x31d/0x420 fs/namespace.c:4201 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f32c139bf4a Code: 48 c7 c2 e8 ff ff ff f7 d8 64 89 02 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffd30494b68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 00007ffd30494bf0 RCX: 00007f32c139bf4a RDX: 0000200000009740 RSI: 0000200000009780 RDI: 00007ffd30494bb0 RBP: 0000200000009740 R08: 00007ffd30494bf0 R09: 0000000000004010 R10: 0000000000004010 R11: 0000000000000246 R12: 0000200000009780 R13: 00007ffd30494bb0 R14: 0000000000009768 R15: 0000200000000300 </TASK> ---------------- Code disassembly (best guess): 0: 43 8e 4d 8b rex.XB mov -0x75(%r13),%cs 4: 24 24 and $0x24,%al 6: 49 81 fc b8 78 43 8e cmp $0xffffffff8e4378b8,%r12 d: 74 20 je 0x2f f: 49 8b 7c 24 f8 mov -0x8(%r12),%rdi 14: 48 89 de mov %rbx,%rsi 17: e8 f2 c0 67 09 call 0x967c10e 1c: 85 c0 test %eax,%eax 1e: 75 e2 jne 0x2 20: 48 8d 3d f7 ca bc 0d lea 0xdbccaf7(%rip),%rdi # 0xdbccb1e 27: 48 89 de mov %rbx,%rsi * 2a: 67 48 0f b9 3a ud1 (%edx),%rdi <-- trapping instruction 2f: 48 89 df mov %rbx,%rdi 32: be 20 00 00 00 mov $0x20,%esi 37: e8 92 c2 67 09 call 0x967c2ce 3c: 48 85 c0 test %rax,%rax 3f: 0f .byte 0xf *** If these findings have caused you to resend the series or submit a separate fix, please add the following tag to your commit message: Tested-by: syzbot@syzkaller.appspotmail.com --- This report is generated by a bot. It may contain errors. syzbot ci engineers can be reached at syzkaller@googlegroups.com. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 0/6] AG aware parallel writeback for XFS 2026-01-16 10:08 ` [PATCH v3 0/6] AG aware parallel writeback for XFS Kundan Kumar ` (6 preceding siblings ...) 2026-01-16 16:13 ` [syzbot ci] Re: AG aware parallel writeback for XFS syzbot ci @ 2026-01-21 19:54 ` Brian Foster 2026-01-22 16:15 ` Kundan Kumar 7 siblings, 1 reply; 48+ messages in thread From: Brian Foster @ 2026-01-21 19:54 UTC (permalink / raw) To: Kundan Kumar Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Fri, Jan 16, 2026 at 03:38:12PM +0530, Kundan Kumar wrote: > This series explores AG aware parallel writeback for XFS. The goal is > to reduce writeback contention and improve scalability by allowing > writeback to be distributed across allocation groups (AGs). > > Problem statement > ================= > Today, XFS writeback walks the page cache serially per inode and funnels > all writeback through a single writeback context. For aging filesystems, > especially with high parallel buffered IO this leads to limited > concurrency across independent AGs. > > The filesystem already has strong AG level parallelism for allocation and > metadata operations, but writeback remains largely AG agnostic. > > High-level approach > =================== > This series introduces an AG aware writeback with following model: > 1) Predict the target AG for buffered writes (mapped or delalloc) at write > time. > 2) Tag AG hints per folio (via lightweight metadata / xarray). > 3) Track dirty AGs per inode using bitmap. > 4) Offload writeback to per AG worker threads, each performing a onepass > scan. > 5) Workers filter folios and submit folios which are tagged for its AG. > > Unlike our earlier approach that parallelized writeback by introducing > multiple writeback contexts per BDI, this series keeps all changes within > XFS and is orthogonal to that work. The AG aware mechanism uses per folio > AG hints to route writeback to AG specific workers, and therefore applies > even when a single inode’s data spans multiple AGs. This avoids the > earlier limitation of relying on inode-based AG locality, which can break > down on aged/fragmented filesystems. > > IOPS and throughput > =================== > We see significant improvemnt in IOPS if files span across multiple AG > > Workload 12 files each of 500M in 12 directories(AGs) - numjobs = 12 > - NVMe device Intel Optane > Base XFS : 308 MiB/s > Parallel Writeback XFS : 1534 MiB/s (+398%) > > Workload 6 files each of 6G in 6 directories(AGs) - numjobs = 12 > - NVMe device Intel Optane > Base XFS : 409 MiB/s > Parallel Writeback XFS : 1245 MiB/s (+204%) > Hi Kundan, Could you provide more detail on how you're testing here? I threw this at some beefier storage I have around out of curiosity and I'm not seeing much of a difference. It could be I'm missing some details or maybe the storage outweighs the processing benefit. But for example, is this a fio test command being used? Is there preallocation? What type of storage? Is a particular fs geometry being targeted for this optimization (i.e. smaller AGs), etc.? FWIW, I skimmed through the code a bit and the main thing that kind of stands out to me is the write time per-folio hinting. Writeback handling for the overwrite (i.e. non-delalloc) case is basically a single lookup per mapping under shared inode lock. The question that comes to mind there is what is the value of per-ag batching as opposed to just adding generic concurrency? It seems unnecessary to me to take care to shuffle overwrites into per-ag based workers when the underlying locking is already shared. WRT delalloc, it looks like we're basically taking the inode AG as the starting point and guessing based on the on-disk AGF free blocks counter at the time of the write. The delalloc accounting doesn't count against the AGF, however, so ISTM that in many cases this would just effectively land on the inode AG for larger delalloc writes. Is that not the case? Once we get to delalloc writeback, we're under exclusive inode lock and fall into the block allocator. The latter trylock iterates the AGs looking for a good candidate. So what's the advantage of per-ag splitting delalloc at writeback time if we're sending the same inode to per-ag workers that all 1. require exclusive inode lock and 2. call into an allocator that is designed to be scalable (i.e. if one AG is locked it will just move to the next)? Yet another consideration is how delalloc conversion works at the xfs_bmapi_convert_delalloc() -> xfs_bmapi_convert_one_delalloc() level. If you take a look at the latter, we look up the entire delalloc extent backing the folio under writeback and attempt to allocate it all at once (not just the blocks backing the folio). So in theory if we were to end up tagging a sequence of contiguous delalloc backed folios at buffered write time with different AGs, we're still going to try to allocate all of that in one AG at writeback time. So the per-ag hinting also sort of competes with this by shuffling writeback of the same potential extent into different workers, making it a little hard to try and reason about. So stepping back it kind of feels to me like the write time hinting has so much potential for inaccuracy and unpredictability of writeback time behavior (for the delalloc case), that it makes me wonder if we're effectively just enabling arbitrary concurrency at writeback time and perhaps seeing benefit from that. If so, that makes me wonder if the associated value can be gained by somehow simplifying this to not require write time hinting at all. Have you run any experiments that perhaps rotors inodes to the individual wb workers based on the inode AG (i.e. basically ignoring all the write time stuff) by chance? Or anything that otherwise helps quantify the value of per-ag batching over just basic concurrency? I'd be interested to see if/how behavior changes with something like that. Brian > These changes are on top of the v6.18 kernel release. > > Future work involves tighten writeback control (wbc) handling to integrate > with global writeback accounting and range semantics, also evaluate > interaction with higher level writeback parallelism. > > Kundan Kumar (6): > iomap: add write ops hook to attach metadata to folios > xfs: add helpers to pack AG prediction info for per-folio tracking > xfs: add per-inode AG prediction map and dirty-AG bitmap > xfs: tag folios with AG number during buffered write via iomap attach > hook > xfs: add per-AG writeback workqueue infrastructure > xfs: offload writeback by AG using per-inode dirty bitmap and per-AG > workers > > fs/iomap/buffered-io.c | 3 + > fs/xfs/xfs_aops.c | 257 +++++++++++++++++++++++++++++++++++++++++ > fs/xfs/xfs_aops.h | 3 + > fs/xfs/xfs_icache.c | 27 +++++ > fs/xfs/xfs_inode.h | 5 + > fs/xfs/xfs_iomap.c | 114 ++++++++++++++++++ > fs/xfs/xfs_iomap.h | 31 +++++ > fs/xfs/xfs_mount.c | 2 + > fs/xfs/xfs_mount.h | 10 ++ > fs/xfs/xfs_super.c | 2 + > include/linux/iomap.h | 3 + > 11 files changed, 457 insertions(+) > > -- > 2.25.1 > > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 0/6] AG aware parallel writeback for XFS 2026-01-21 19:54 ` [PATCH v3 0/6] " Brian Foster @ 2026-01-22 16:15 ` Kundan Kumar 2026-01-23 9:36 ` Pankaj Raghav (Samsung) 2026-01-23 13:26 ` Brian Foster 0 siblings, 2 replies; 48+ messages in thread From: Kundan Kumar @ 2026-01-22 16:15 UTC (permalink / raw) To: Brian Foster Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k > > Could you provide more detail on how you're testing here? I threw this > at some beefier storage I have around out of curiosity and I'm not > seeing much of a difference. It could be I'm missing some details or > maybe the storage outweighs the processing benefit. But for example, is > this a fio test command being used? Is there preallocation? What type of > storage? Is a particular fs geometry being targeted for this > optimization (i.e. smaller AGs), etc.? Thanks Brian for the detailed review and for taking the time to look through the code. The numbers quoted were from fio buffered write workloads on NVMe (Optane devices), using multiple files placed in different directories mapping to different AGs. Jobs were buffered randwrite with multiple jobs. This can be tested with both fallocate=none or otherwise. Sample script for 12 jobs to 12 directories(AGs) : mkfs.xfs -f -d agcount=12 /dev/nvme0n1 mount /dev/nvme0n1 /mnt sync echo 3 > /proc/sys/vm/drop_caches for i in {1..12}; do mkdir -p /mnt/dir$i done fio job.fio umount /mnt echo 3 > /proc/sys/vm/drop_caches The job file : [global] bs=4k iodepth=32 rw=randwrite ioengine=io_uring fallocate=none nrfiles=12 numjobs=1 size=6G direct=0 group_reporting=1 create_on_open=1 name=test [job1] directory=/mnt/dir1 [job2] directory=/mnt/dir2 ... ... [job12] directory=/mnt/dir12 > > FWIW, I skimmed through the code a bit and the main thing that kind of > stands out to me is the write time per-folio hinting. Writeback handling > for the overwrite (i.e. non-delalloc) case is basically a single lookup > per mapping under shared inode lock. The question that comes to mind > there is what is the value of per-ag batching as opposed to just adding > generic concurrency? It seems unnecessary to me to take care to shuffle > overwrites into per-ag based workers when the underlying locking is > already shared. > That’s a fair point. For the overwrite (non-delalloc) case, the per-folio AG hinting is not meant to change allocation behavior, and I agree the underlying inode locking remains shared. The primary value I’m seeing there is the ability to partition writeback iteration and submission when dirty data spans multiple AGs. I will try routing overwrite writeback to workers irrespective of AG (e.g. hash/inode based), to compare between generic concurrency vs AG batching. > WRT delalloc, it looks like we're basically taking the inode AG as the > starting point and guessing based on the on-disk AGF free blocks counter > at the time of the write. The delalloc accounting doesn't count against > the AGF, however, so ISTM that in many cases this would just effectively > land on the inode AG for larger delalloc writes. Is that not the case? > > Once we get to delalloc writeback, we're under exclusive inode lock and > fall into the block allocator. The latter trylock iterates the AGs > looking for a good candidate. So what's the advantage of per-ag > splitting delalloc at writeback time if we're sending the same inode to > per-ag workers that all 1. require exclusive inode lock and 2. call into > an allocator that is designed to be scalable (i.e. if one AG is locked > it will just move to the next)? > The intent of per-AG splitting is not to parallelize allocation within a single inode or override allocator behavior, but to partition writeback scheduling so that inodes associated with different AGs are routed to different workers. This implicitly distributes inodes across AG workers, even though each inode’s delalloc conversion remains serialized. > Yet another consideration is how delalloc conversion works at the > xfs_bmapi_convert_delalloc() -> xfs_bmapi_convert_one_delalloc() level. > If you take a look at the latter, we look up the entire delalloc extent > backing the folio under writeback and attempt to allocate it all at once > (not just the blocks backing the folio). So in theory if we were to end > up tagging a sequence of contiguous delalloc backed folios at buffered > write time with different AGs, we're still going to try to allocate all > of that in one AG at writeback time. So the per-ag hinting also sort of > competes with this by shuffling writeback of the same potential extent > into different workers, making it a little hard to try and reason about. > Agreed — delalloc conversion happens at extent granularity, so per-folio AG hints are not meant to steer final allocation. In this series the hints are used purely as writeback scheduling tokens; allocation still occurs once per extent under XFS_ILOCK_EXCL using existing allocator logic. The goal is to partition writeback work and avoid funneling multiple inodes through a single writeback path, not to influence extent placement. > So stepping back it kind of feels to me like the write time hinting has > so much potential for inaccuracy and unpredictability of writeback time > behavior (for the delalloc case), that it makes me wonder if we're > effectively just enabling arbitrary concurrency at writeback time and > perhaps seeing benefit from that. If so, that makes me wonder if the > associated value can be gained by somehow simplifying this to not > require write time hinting at all. > > Have you run any experiments that perhaps rotors inodes to the > individual wb workers based on the inode AG (i.e. basically ignoring all > the write time stuff) by chance? Or anything that otherwise helps > quantify the value of per-ag batching over just basic concurrency? I'd > be interested to see if/how behavior changes with something like that. Yes, inode-AG based routing has been explored as part of earlier higher-level writeback work (link below), where inodes are affined to writeback contexts based on inode AG. That effectively provides generic concurrency and serves as a useful baseline. https://lore.kernel.org/all/20251014120845.2361-1-kundan.kumar@samsung.com/ The motivation for this series is the complementary case where a single inode’s dirty data spans multiple AGs on aged/fragmented filesystems, where inode-AG affinity breaks down. The folio-level AG hinting here is intended to explore whether finer-grained partitioning provides additional benefit beyond inode-based routing. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 0/6] AG aware parallel writeback for XFS 2026-01-22 16:15 ` Kundan Kumar @ 2026-01-23 9:36 ` Pankaj Raghav (Samsung) 2026-01-23 13:26 ` Brian Foster 1 sibling, 0 replies; 48+ messages in thread From: Pankaj Raghav (Samsung) @ 2026-01-23 9:36 UTC (permalink / raw) To: Kundan Kumar Cc: Brian Foster, viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k > > So stepping back it kind of feels to me like the write time hinting has > > so much potential for inaccuracy and unpredictability of writeback time > > behavior (for the delalloc case), that it makes me wonder if we're > > effectively just enabling arbitrary concurrency at writeback time and > > perhaps seeing benefit from that. If so, that makes me wonder if the > > associated value can be gained by somehow simplifying this to not > > require write time hinting at all. > > > > Have you run any experiments that perhaps rotors inodes to the > > individual wb workers based on the inode AG (i.e. basically ignoring all > > the write time stuff) by chance? Or anything that otherwise helps > > quantify the value of per-ag batching over just basic concurrency? I'd > > be interested to see if/how behavior changes with something like that. > > Yes, inode-AG based routing has been explored as part of earlier > higher-level writeback work (link below), where inodes are affined to > writeback contexts based on inode AG. That effectively provides > generic concurrency and serves as a useful baseline. > https://lore.kernel.org/all/20251014120845.2361-1-kundan.kumar@samsung.com/ > > The motivation for this series is the complementary case where a > single inode’s dirty data spans multiple AGs on aged/fragmented > filesystems, where inode-AG affinity breaks down. The folio-level AG > hinting here is intended to explore whether finer-grained partitioning > provides additional benefit beyond inode-based routing. > One thing that would be useful to see in the cover letter is to see how these patches work on an aged XFS filesystem as that was the main reason you moved to a different approach. And given that you have completely changed the design, it would have been better to see this series as an RFC as well. -- Pankaj ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 0/6] AG aware parallel writeback for XFS 2026-01-22 16:15 ` Kundan Kumar 2026-01-23 9:36 ` Pankaj Raghav (Samsung) @ 2026-01-23 13:26 ` Brian Foster 2026-01-28 18:28 ` Kundan Kumar 1 sibling, 1 reply; 48+ messages in thread From: Brian Foster @ 2026-01-23 13:26 UTC (permalink / raw) To: Kundan Kumar Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Thu, Jan 22, 2026 at 09:45:05PM +0530, Kundan Kumar wrote: > > > > Could you provide more detail on how you're testing here? I threw this > > at some beefier storage I have around out of curiosity and I'm not > > seeing much of a difference. It could be I'm missing some details or > > maybe the storage outweighs the processing benefit. But for example, is > > this a fio test command being used? Is there preallocation? What type of > > storage? Is a particular fs geometry being targeted for this > > optimization (i.e. smaller AGs), etc.? > > Thanks Brian for the detailed review and for taking the time to > look through the code. > > The numbers quoted were from fio buffered write workloads on NVMe > (Optane devices), using multiple files placed in different > directories mapping to different AGs. Jobs were buffered > randwrite with multiple jobs. This can be tested with both > fallocate=none or otherwise. > > Sample script for 12 jobs to 12 directories(AGs) : > > mkfs.xfs -f -d agcount=12 /dev/nvme0n1 > mount /dev/nvme0n1 /mnt > sync > echo 3 > /proc/sys/vm/drop_caches > > for i in {1..12}; do > mkdir -p /mnt/dir$i > done > > fio job.fio > > umount /mnt > echo 3 > /proc/sys/vm/drop_caches > > The job file : > > [global] > bs=4k > iodepth=32 > rw=randwrite > ioengine=io_uring > fallocate=none > nrfiles=12 > numjobs=1 > size=6G > direct=0 > group_reporting=1 > create_on_open=1 > name=test > > [job1] > directory=/mnt/dir1 > > [job2] > directory=/mnt/dir2 > ... > ... > [job12] > directory=/mnt/dir12 > Thanks.. > > > > FWIW, I skimmed through the code a bit and the main thing that kind of > > stands out to me is the write time per-folio hinting. Writeback handling > > for the overwrite (i.e. non-delalloc) case is basically a single lookup > > per mapping under shared inode lock. The question that comes to mind > > there is what is the value of per-ag batching as opposed to just adding > > generic concurrency? It seems unnecessary to me to take care to shuffle > > overwrites into per-ag based workers when the underlying locking is > > already shared. > > > > That’s a fair point. For the overwrite (non-delalloc) case, the > per-folio AG hinting is not meant to change allocation behavior, and > I agree the underlying inode locking remains shared. The primary value > I’m seeing there is the ability to partition writeback iteration and > submission when dirty data spans multiple AGs. > I will try routing overwrite writeback to workers irrespective of AG > (e.g. hash/inode based), to compare between generic concurrency vs AG > batching. > > > WRT delalloc, it looks like we're basically taking the inode AG as the > > starting point and guessing based on the on-disk AGF free blocks counter > > at the time of the write. The delalloc accounting doesn't count against > > the AGF, however, so ISTM that in many cases this would just effectively > > land on the inode AG for larger delalloc writes. Is that not the case? > > > > Once we get to delalloc writeback, we're under exclusive inode lock and > > fall into the block allocator. The latter trylock iterates the AGs > > looking for a good candidate. So what's the advantage of per-ag > > splitting delalloc at writeback time if we're sending the same inode to > > per-ag workers that all 1. require exclusive inode lock and 2. call into > > an allocator that is designed to be scalable (i.e. if one AG is locked > > it will just move to the next)? > > > > The intent of per-AG splitting is not to parallelize allocation > within a single inode or override allocator behavior, but to > partition writeback scheduling so that inodes associated with > different AGs are routed to different workers. This implicitly > distributes inodes across AG workers, even though each inode’s > delalloc conversion remains serialized. > > > Yet another consideration is how delalloc conversion works at the > > xfs_bmapi_convert_delalloc() -> xfs_bmapi_convert_one_delalloc() level. > > If you take a look at the latter, we look up the entire delalloc extent > > backing the folio under writeback and attempt to allocate it all at once > > (not just the blocks backing the folio). So in theory if we were to end > > up tagging a sequence of contiguous delalloc backed folios at buffered > > write time with different AGs, we're still going to try to allocate all > > of that in one AG at writeback time. So the per-ag hinting also sort of > > competes with this by shuffling writeback of the same potential extent > > into different workers, making it a little hard to try and reason about. > > > > Agreed — delalloc conversion happens at extent granularity, so > per-folio AG hints are not meant to steer final allocation. In this > series the hints are used purely as writeback scheduling tokens; > allocation still occurs once per extent under XFS_ILOCK_EXCL using > existing allocator logic. The goal is to partition writeback work and > avoid funneling multiple inodes through a single writeback path, not > to influence extent placement. > Yeah.. I realize none of this is really intended to drive allocation behavior. The observation that all this per-folio tracking ultimately boils down to either sharding based on information we have at writeback time (i.e. overwrites) or effectively batching based on on-disk AG state at the time of the write is kind of what suggests that the folio granular hinting is potentially overkill. > > So stepping back it kind of feels to me like the write time hinting has > > so much potential for inaccuracy and unpredictability of writeback time > > behavior (for the delalloc case), that it makes me wonder if we're > > effectively just enabling arbitrary concurrency at writeback time and > > perhaps seeing benefit from that. If so, that makes me wonder if the > > associated value can be gained by somehow simplifying this to not > > require write time hinting at all. > > > > Have you run any experiments that perhaps rotors inodes to the > > individual wb workers based on the inode AG (i.e. basically ignoring all > > the write time stuff) by chance? Or anything that otherwise helps > > quantify the value of per-ag batching over just basic concurrency? I'd > > be interested to see if/how behavior changes with something like that. > > Yes, inode-AG based routing has been explored as part of earlier > higher-level writeback work (link below), where inodes are affined to > writeback contexts based on inode AG. That effectively provides > generic concurrency and serves as a useful baseline. > https://lore.kernel.org/all/20251014120845.2361-1-kundan.kumar@samsung.com/ > Ah, I recall seeing that. A couple questions.. That link states the following: "For XFS, affining inodes to writeback threads resulted in a decline in IOPS for certain devices. The issue was caused by AG lock contention in xfs_end_io, where multiple writeback threads competed for the same AG lock." Can you quantify that? It seems like xfs_end_io() mostly cares about things like unwritten conversion, COW remapping, etc., so block allocation shouldn't be prominent. Is this producing something where frequent unwritten conversion results in a lot of bmapbt splits or something? Also, how safe is it is to break off writeback tasks at the XFS layer like this? For example, is it safe to spread around the wbc to a bunch of tasks like this? What about serialization for things like bandwidth accounting and whatnot in the core/calling code? Should the code that splits off wq tasks in XFS be responsible to wait for parallel submission completion before returning (I didn't see anything doing that on a scan, but could have missed it)..? > The motivation for this series is the complementary case where a > single inode’s dirty data spans multiple AGs on aged/fragmented > filesystems, where inode-AG affinity breaks down. The folio-level AG > hinting here is intended to explore whether finer-grained partitioning > provides additional benefit beyond inode-based routing. > But also I'm not sure I follow the high level goal here. I have the same question as Pankaj in that regard.. is this series intended to replace the previous bdi level approach, or go along with it somehow? Doing something at the bdi level seems like a more natural approach in general, so I'm curious why the change in direction. Brian ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 0/6] AG aware parallel writeback for XFS 2026-01-23 13:26 ` Brian Foster @ 2026-01-28 18:28 ` Kundan Kumar 2026-02-06 6:25 ` Christoph Hellwig 0 siblings, 1 reply; 48+ messages in thread From: Kundan Kumar @ 2026-01-28 18:28 UTC (permalink / raw) To: Brian Foster Cc: viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On 1/23/2026 6:56 PM, Brian Foster wrote: > On Thu, Jan 22, 2026 at 09:45:05PM +0530, Kundan Kumar wrote: >>> >>> Could you provide more detail on how you're testing here? I threw this >>> at some beefier storage I have around out of curiosity and I'm not >>> seeing much of a difference. It could be I'm missing some details or >>> maybe the storage outweighs the processing benefit. But for example, is >>> this a fio test command being used? Is there preallocation? What type of >>> storage? Is a particular fs geometry being targeted for this >>> optimization (i.e. smaller AGs), etc.? >> >> Thanks Brian for the detailed review and for taking the time to >> look through the code. >> >> The numbers quoted were from fio buffered write workloads on NVMe >> (Optane devices), using multiple files placed in different >> directories mapping to different AGs. Jobs were buffered >> randwrite with multiple jobs. This can be tested with both >> fallocate=none or otherwise. >> >> Sample script for 12 jobs to 12 directories(AGs) : >> >> mkfs.xfs -f -d agcount=12 /dev/nvme0n1 >> mount /dev/nvme0n1 /mnt >> sync >> echo 3 > /proc/sys/vm/drop_caches >> >> for i in {1..12}; do >> mkdir -p /mnt/dir$i >> done >> >> fio job.fio >> >> umount /mnt >> echo 3 > /proc/sys/vm/drop_caches >> >> The job file : >> >> [global] >> bs=4k >> iodepth=32 >> rw=randwrite >> ioengine=io_uring >> fallocate=none >> nrfiles=12 >> numjobs=1 >> size=6G >> direct=0 >> group_reporting=1 >> create_on_open=1 >> name=test >> >> [job1] >> directory=/mnt/dir1 >> >> [job2] >> directory=/mnt/dir2 >> ... >> ... >> [job12] >> directory=/mnt/dir12 >> > > Thanks.. > >>> >>> FWIW, I skimmed through the code a bit and the main thing that kind of >>> stands out to me is the write time per-folio hinting. Writeback handling >>> for the overwrite (i.e. non-delalloc) case is basically a single lookup >>> per mapping under shared inode lock. The question that comes to mind >>> there is what is the value of per-ag batching as opposed to just adding >>> generic concurrency? It seems unnecessary to me to take care to shuffle >>> overwrites into per-ag based workers when the underlying locking is >>> already shared. >>> >> >> That’s a fair point. For the overwrite (non-delalloc) case, the >> per-folio AG hinting is not meant to change allocation behavior, and >> I agree the underlying inode locking remains shared. The primary value >> I’m seeing there is the ability to partition writeback iteration and >> submission when dirty data spans multiple AGs. >> I will try routing overwrite writeback to workers irrespective of AG >> (e.g. hash/inode based), to compare between generic concurrency vs AG >> batching. >> >>> WRT delalloc, it looks like we're basically taking the inode AG as the >>> starting point and guessing based on the on-disk AGF free blocks counter >>> at the time of the write. The delalloc accounting doesn't count against >>> the AGF, however, so ISTM that in many cases this would just effectively >>> land on the inode AG for larger delalloc writes. Is that not the case? >>> >>> Once we get to delalloc writeback, we're under exclusive inode lock and >>> fall into the block allocator. The latter trylock iterates the AGs >>> looking for a good candidate. So what's the advantage of per-ag >>> splitting delalloc at writeback time if we're sending the same inode to >>> per-ag workers that all 1. require exclusive inode lock and 2. call into >>> an allocator that is designed to be scalable (i.e. if one AG is locked >>> it will just move to the next)? >>> >> >> The intent of per-AG splitting is not to parallelize allocation >> within a single inode or override allocator behavior, but to >> partition writeback scheduling so that inodes associated with >> different AGs are routed to different workers. This implicitly >> distributes inodes across AG workers, even though each inode’s >> delalloc conversion remains serialized. >> >>> Yet another consideration is how delalloc conversion works at the >>> xfs_bmapi_convert_delalloc() -> xfs_bmapi_convert_one_delalloc() level. >>> If you take a look at the latter, we look up the entire delalloc extent >>> backing the folio under writeback and attempt to allocate it all at once >>> (not just the blocks backing the folio). So in theory if we were to end >>> up tagging a sequence of contiguous delalloc backed folios at buffered >>> write time with different AGs, we're still going to try to allocate all >>> of that in one AG at writeback time. So the per-ag hinting also sort of >>> competes with this by shuffling writeback of the same potential extent >>> into different workers, making it a little hard to try and reason about. >>> >> >> Agreed — delalloc conversion happens at extent granularity, so >> per-folio AG hints are not meant to steer final allocation. In this >> series the hints are used purely as writeback scheduling tokens; >> allocation still occurs once per extent under XFS_ILOCK_EXCL using >> existing allocator logic. The goal is to partition writeback work and >> avoid funneling multiple inodes through a single writeback path, not >> to influence extent placement. >> > > Yeah.. I realize none of this is really intended to drive allocation > behavior. The observation that all this per-folio tracking ultimately > boils down to either sharding based on information we have at writeback > time (i.e. overwrites) or effectively batching based on on-disk AG state > at the time of the write is kind of what suggests that the folio > granular hinting is potentially overkill. > >>> So stepping back it kind of feels to me like the write time hinting has >>> so much potential for inaccuracy and unpredictability of writeback time >>> behavior (for the delalloc case), that it makes me wonder if we're >>> effectively just enabling arbitrary concurrency at writeback time and >>> perhaps seeing benefit from that. If so, that makes me wonder if the >>> associated value can be gained by somehow simplifying this to not >>> require write time hinting at all. >>> >>> Have you run any experiments that perhaps rotors inodes to the >>> individual wb workers based on the inode AG (i.e. basically ignoring all >>> the write time stuff) by chance? Or anything that otherwise helps >>> quantify the value of per-ag batching over just basic concurrency? I'd >>> be interested to see if/how behavior changes with something like that. >> >> Yes, inode-AG based routing has been explored as part of earlier >> higher-level writeback work (link below), where inodes are affined to >> writeback contexts based on inode AG. That effectively provides >> generic concurrency and serves as a useful baseline. >> https://lore.kernel.org/all/20251014120845.2361-1-kundan.kumar@samsung.com/ >> > > Ah, I recall seeing that. A couple questions.. > > That link states the following: > > "For XFS, affining inodes to writeback threads resulted in a decline > in IOPS for certain devices. The issue was caused by AG lock contention > in xfs_end_io, where multiple writeback threads competed for the same > AG lock." > > Can you quantify that? It seems like xfs_end_io() mostly cares about > things like unwritten conversion, COW remapping, etc., so block > allocation shouldn't be prominent. Is this producing something where > frequent unwritten conversion results in a lot of bmapbt splits or > something? > I captured stacks from the contending completion workers and the hotspot is in the unwritten conversion path (xfs_end_io() -> xfs_iomap_write_unwritten()). We were repeatedly contending on the AGF buffer lock via xfs_alloc_fix_freelist() / xfs_alloc_read_agf() when writeback threads were affined per-inode. This contention went away once writeback was distributed across AG-based workers, pointing to reduced AGF hotspotting during unwritten conversion (rmap/btree updates and freelist fixes), rather than block allocation in the write path itself. > Also, how safe is it is to break off writeback tasks at the XFS layer > like this? For example, is it safe to spread around the wbc to a bunch > of tasks like this? What about serialization for things like bandwidth > accounting and whatnot in the core/calling code? Should the code that > splits off wq tasks in XFS be responsible to wait for parallel > submission completion before returning (I didn't see anything doing that > on a scan, but could have missed it)..? > You are right that core writeback accounting assumes serialized updates. The current series copies wbc per worker to avoid concurrent mutation, but that is not sufficient for strict global accounting semantics. For this series we only offload the async path (wbc->sync_mode != WB_SYNC_ALL), so we do not wait for worker completion before returning from ->writepages(). Sync writeback continues down the existing iomap_writepages path. >> The motivation for this series is the complementary case where a >> single inode’s dirty data spans multiple AGs on aged/fragmented >> filesystems, where inode-AG affinity breaks down. The folio-level AG >> hinting here is intended to explore whether finer-grained partitioning >> provides additional benefit beyond inode-based routing. >> > > But also I'm not sure I follow the high level goal here. I have the same > question as Pankaj in that regard.. is this series intended to replace > the previous bdi level approach, or go along with it somehow? Doing > something at the bdi level seems like a more natural approach in > general, so I'm curious why the change in direction. > > Brian > This series is intended to replace the earlier BDI-level approach for XFS, not to go alongside it. While BDI-level sharding is the more natural generic mechanism, we saw XFS regressions on some setups when inodes were affined to wb threads due to completion-side AG contention. The goal here is to make concurrency an XFS policy decision by routing writeback using AG-aware folio tags, so we avoid inode-affinity hotspots and handle cases where a single inode spans multiple AGs on aged or fragmented filesystems. If this approach does not hold up across workloads and devices, we can fall back to the generic BDI sharding model. - Kundan ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 0/6] AG aware parallel writeback for XFS 2026-01-28 18:28 ` Kundan Kumar @ 2026-02-06 6:25 ` Christoph Hellwig 2026-02-06 10:07 ` Kundan Kumar 2026-02-09 15:54 ` Kundan Kumar 0 siblings, 2 replies; 48+ messages in thread From: Christoph Hellwig @ 2026-02-06 6:25 UTC (permalink / raw) To: Kundan Kumar Cc: Brian Foster, viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, hch, ritesh.list, djwong, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Wed, Jan 28, 2026 at 11:58:25PM +0530, Kundan Kumar wrote: > This series is intended to replace the earlier BDI-level approach for > XFS, not to go alongside it. While BDI-level sharding is the more > natural generic mechanism, we saw XFS regressions on some setups when > inodes were affined to wb threads due to completion-side AG contention. > > The goal here is to make concurrency an XFS policy decision by routing > writeback using AG-aware folio tags, so we avoid inode-affinity > hotspots and handle cases where a single inode spans multiple AGs on > aged or fragmented filesystems. > > If this approach does not hold up across workloads and devices, we can > fall back to the generic BDI sharding model. I fear we're deep down a rabbit hole solving the wrong problem here. Traditionally block allocation, in XFS and in general, was about finding the "best" location to avoid seeks. With SSDs the seeks themselves are kinda pointless, although large sequential write streams are still very useful of course, as is avoiding both freespace and bmap fragmentation. On the other hand avoiding contention from multiple writers is a good thing. (this is discounting the HDD case, where the industry is very rapidly moving to zoned device, for which zoned XFS has a totally different allocator) With multi-threaded writeback this become important for writeback, but even before this would be useful for direct and uncached I/O. So I think the first thing I'd look into it to tune the allocator to avoid that contention, by by spreading different allocation streams from different core to different AGs, and relax the very sophisticated and detailed placement done by the XFS allocator. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 0/6] AG aware parallel writeback for XFS 2026-02-06 6:25 ` Christoph Hellwig @ 2026-02-06 10:07 ` Kundan Kumar 2026-02-06 17:42 ` Darrick J. Wong 2026-02-09 6:30 ` Christoph Hellwig 2026-02-09 15:54 ` Kundan Kumar 1 sibling, 2 replies; 48+ messages in thread From: Kundan Kumar @ 2026-02-06 10:07 UTC (permalink / raw) To: Christoph Hellwig, djwong Cc: Brian Foster, viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On 2/6/2026 11:55 AM, Christoph Hellwig wrote: > I fear we're deep down a rabbit hole solving the wrong problem here. > Traditionally block allocation, in XFS and in general, was about finding > the "best" location to avoid seeks. With SSDs the seeks themselves are > kinda pointless, although large sequential write streams are still very > useful of course, as is avoiding both freespace and bmap fragmentation. > On the other hand avoiding contention from multiple writers is a good > thing. (this is discounting the HDD case, where the industry is very > rapidly moving to zoned device, for which zoned XFS has a totally > different allocator) > > With multi-threaded writeback this become important for writeback, but > even before this would be useful for direct and uncached I/O. > > So I think the first thing I'd look into it to tune the allocator to > avoid that contention, by by spreading different allocation streams from > different core to different AGs, and relax the very sophisticated and > detailed placement done by the XFS allocator. When you say "coarse-grained sharding", do you mean tracking a single "home AG" per inode (no per-folio tagging) and using it as a best-effort hint for writeback routing? If so, we can align with that and keep the threading model generic by relying on bdi writeback contexts. Concretely, XFS would set up a bounded number of bdi wb contexts at mount time, and route each inode to its home shard. Does this align with what you have in mind? We had implemented a similar approach in earlier versions of this series[1]. But the feedback[2] that we got was that mapping high level writeback to eventual AG allocation can be sometimes inaccurate (aged filesystems, inode spanning accross multiple AGs, etc.), so we moved to per folio tagging to make the routing decision closer to the actual IO. That said, I agree that the per folio approach is complex. Darrick, does this direction look reasonable to you as well? [1] https://lore.kernel.org/all/20251014120845.2361-1-kundan.kumar@samsung.com/ [2] https://lore.kernel.org/all/20251107133742.GA5596@lst.de/ ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 0/6] AG aware parallel writeback for XFS 2026-02-06 10:07 ` Kundan Kumar @ 2026-02-06 17:42 ` Darrick J. Wong 2026-02-09 6:30 ` Christoph Hellwig 1 sibling, 0 replies; 48+ messages in thread From: Darrick J. Wong @ 2026-02-06 17:42 UTC (permalink / raw) To: Kundan Kumar Cc: Christoph Hellwig, Brian Foster, viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Fri, Feb 06, 2026 at 03:37:38PM +0530, Kundan Kumar wrote: > On 2/6/2026 11:55 AM, Christoph Hellwig wrote: > > I fear we're deep down a rabbit hole solving the wrong problem here. > > Traditionally block allocation, in XFS and in general, was about finding > > the "best" location to avoid seeks. With SSDs the seeks themselves are > > kinda pointless, although large sequential write streams are still very > > useful of course, as is avoiding both freespace and bmap fragmentation. > > On the other hand avoiding contention from multiple writers is a good > > thing. (this is discounting the HDD case, where the industry is very > > rapidly moving to zoned device, for which zoned XFS has a totally > > different allocator) > > > > With multi-threaded writeback this become important for writeback, but > > even before this would be useful for direct and uncached I/O. > > > > So I think the first thing I'd look into it to tune the allocator to > > avoid that contention, by by spreading different allocation streams from > > different core to different AGs, and relax the very sophisticated and > > detailed placement done by the XFS allocator. > > When you say "coarse-grained sharding", do you mean tracking a single > "home AG" per inode (no per-folio tagging) and using it as a best-effort > hint for writeback routing? > > If so, we can align with that and keep the threading model generic by > relying on bdi writeback contexts. Concretely, XFS would set up a > bounded number of bdi wb contexts at mount time, and route each inode to > its home shard. Does this align with what you have in mind? > > We had implemented a similar approach in earlier versions of this > series[1]. But the feedback[2] that we got was that mapping high level > writeback to eventual AG allocation can be sometimes inaccurate (aged > filesystems, inode spanning accross multiple AGs, etc.), so we moved to > per folio tagging to make the routing decision closer to the actual IO. > That said, I agree that the per folio approach is complex. > > Darrick, does this direction look reasonable to you as well? Yeah, I think a simpler scheme would be a better starting place; we can make things more complex later once we can show that there's a need. --D > > [1] > https://lore.kernel.org/all/20251014120845.2361-1-kundan.kumar@samsung.com/ > [2] https://lore.kernel.org/all/20251107133742.GA5596@lst.de/ > > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 0/6] AG aware parallel writeback for XFS 2026-02-06 10:07 ` Kundan Kumar 2026-02-06 17:42 ` Darrick J. Wong @ 2026-02-09 6:30 ` Christoph Hellwig 1 sibling, 0 replies; 48+ messages in thread From: Christoph Hellwig @ 2026-02-09 6:30 UTC (permalink / raw) To: Kundan Kumar Cc: Christoph Hellwig, djwong, Brian Foster, viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, ritesh.list, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Fri, Feb 06, 2026 at 03:37:38PM +0530, Kundan Kumar wrote: > When you say "coarse-grained sharding", do you mean tracking a single > "home AG" per inode (no per-folio tagging) and using it as a best-effort > hint for writeback routing? Yes, but.. > If so, we can align with that and keep the threading model generic by > relying on bdi writeback contexts. Concretely, XFS would set up a > bounded number of bdi wb contexts at mount time, and route each inode to > its home shard. Does this align with what you have in mind? Yes. > We had implemented a similar approach in earlier versions of this > series[1]. But the feedback[2] that we got was that mapping high level > writeback to eventual AG allocation can be sometimes inaccurate (aged > filesystems, inode spanning accross multiple AGs, etc.), so we moved to > per folio tagging to make the routing decision closer to the actual IO. > That said, I agree that the per folio approach is complex. It is inaccurate, not to say often wrong. And that's the point I'm trying to make here: start with simplifying the allocation policies so that they make sense in this kind of environment, and then align that sharding to that. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 0/6] AG aware parallel writeback for XFS 2026-02-06 6:25 ` Christoph Hellwig 2026-02-06 10:07 ` Kundan Kumar @ 2026-02-09 15:54 ` Kundan Kumar 2026-02-10 15:38 ` Christoph Hellwig 1 sibling, 1 reply; 48+ messages in thread From: Kundan Kumar @ 2026-02-09 15:54 UTC (permalink / raw) To: Christoph Hellwig Cc: Brian Foster, viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, ritesh.list, djwong, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On 2/6/2026 11:55 AM, Christoph Hellwig wrote: > I fear we're deep down a rabbit hole solving the wrong problem here. > Traditionally block allocation, in XFS and in general, was about finding > the "best" location to avoid seeks. With SSDs the seeks themselves are > kinda pointless, although large sequential write streams are still very > useful of course, as is avoiding both freespace and bmap fragmentation. > On the other hand avoiding contention from multiple writers is a good > thing. (this is discounting the HDD case, where the industry is very > rapidly moving to zoned device, for which zoned XFS has a totally > different allocator) > > With multi-threaded writeback this become important for writeback, but > even before this would be useful for direct and uncached I/O. > > So I think the first thing I'd look into it to tune the allocator to > avoid that contention, by by spreading different allocation streams from > different core to different AGs, and relax the very sophisticated and > detailed placement done by the XFS allocator Thanks, I’m going to restate what I think you are suggesting to make sure I'm tracking correctly. We will step back from per-folio tagging and instead align coarse sharding with a simpler allocation policy: - Create a bounded number of bdi wb contexts at mount time (capped, e.g. ≤ agcount). - Store a per-inode stream/shard id (no per-folio state). - Assign the stream id once and use it to select the wb context for writeback. - In the delalloc allocator, bias AG selection from the stream id by partitioning AG space into per-stream "bands" and rotating the start AG within that band; fall back to the existing allocator when allocation can't be satisfied. Does this align with what you have in mind? ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 0/6] AG aware parallel writeback for XFS 2026-02-09 15:54 ` Kundan Kumar @ 2026-02-10 15:38 ` Christoph Hellwig 0 siblings, 0 replies; 48+ messages in thread From: Christoph Hellwig @ 2026-02-10 15:38 UTC (permalink / raw) To: Kundan Kumar Cc: Christoph Hellwig, Brian Foster, viro, brauner, jack, willy, mcgrof, clm, david, amir73il, axboe, ritesh.list, djwong, dave, cem, wangyufei, linux-fsdevel, linux-mm, linux-xfs, gost.dev, anuj20.g, vishak.g, joshi.k On Mon, Feb 09, 2026 at 09:24:49PM +0530, Kundan Kumar wrote: > - Create a bounded number of bdi wb contexts at mount time (capped, > e.g. ≤ agcount). Yeah. And then optimally map them to CPU cores, similar to the blk-mq cpumap. > - Store a per-inode stream/shard id (no per-folio state). Yes. > - Assign the stream id once and use it to select the wb context for > writeback. Yes. > - In the delalloc allocator, bias AG selection from the stream id by > partitioning AG space into per-stream "bands" and rotating the start > AG within that band; fall back to the existing allocator when > allocation can't be satisfied. Yes. We might also need something that falls back to less helpers if the free space is distributed unevently, but probably not for the first prototype. ^ permalink raw reply [flat|nested] 48+ messages in thread
end of thread, other threads:[~2026-02-11 9:39 UTC | newest]
Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <CGME20260116101236epcas5p12ba3de776976f4ea6666e16a33ab6ec4@epcas5p1.samsung.com>
2026-01-16 10:08 ` [PATCH v3 0/6] AG aware parallel writeback for XFS Kundan Kumar
[not found] ` <CGME20260116101241epcas5p330f9c335a096aaaefda4b7d3c38d6038@epcas5p3.samsung.com>
2026-01-16 10:08 ` [PATCH v3 1/6] iomap: add write ops hook to attach metadata to folios Kundan Kumar
[not found] ` <CGME20260116101245epcas5p30269c6aa35784db67e6d6ca800a683a7@epcas5p3.samsung.com>
2026-01-16 10:08 ` [PATCH v3 2/6] xfs: add helpers to pack AG prediction info for per-folio tracking Kundan Kumar
2026-01-29 0:45 ` Darrick J. Wong
2026-02-03 7:15 ` Kundan Kumar
2026-02-05 16:39 ` Darrick J. Wong
2026-02-04 7:37 ` Nirjhar Roy (IBM)
[not found] ` <CGME20260116101251epcas5p1cf5b48f2efb14fe4387be3053b3c3ebc@epcas5p1.samsung.com>
2026-01-16 10:08 ` [PATCH v3 3/6] xfs: add per-inode AG prediction map and dirty-AG bitmap Kundan Kumar
2026-01-29 0:44 ` Darrick J. Wong
2026-02-03 7:20 ` Kundan Kumar
2026-02-05 16:42 ` Darrick J. Wong
2026-02-05 6:44 ` Nirjhar Roy (IBM)
2026-02-05 16:32 ` Darrick J. Wong
2026-02-06 5:41 ` Nirjhar Roy (IBM)
2026-02-05 6:36 ` Nirjhar Roy (IBM)
2026-02-05 16:36 ` Darrick J. Wong
2026-02-06 5:36 ` Nirjhar Roy (IBM)
2026-02-06 5:57 ` Darrick J. Wong
2026-02-06 6:03 ` Nirjhar Roy (IBM)
2026-02-06 7:00 ` Christoph Hellwig
[not found] ` <CGME20260116101256epcas5p2d6125a6bcad78c33f737fdc3484aca79@epcas5p2.samsung.com>
2026-01-16 10:08 ` [PATCH v3 4/6] xfs: tag folios with AG number during buffered write via iomap attach hook Kundan Kumar
2026-01-29 0:47 ` Darrick J. Wong
2026-01-29 22:40 ` Darrick J. Wong
2026-02-03 7:32 ` Kundan Kumar
2026-02-03 7:28 ` Kundan Kumar
2026-02-05 15:56 ` Brian Foster
2026-02-06 6:44 ` Nirjhar Roy (IBM)
[not found] ` <CGME20260116101259epcas5p1cfa6ab02e5a01f7c46cc78df95c57ce0@epcas5p1.samsung.com>
2026-01-16 10:08 ` [PATCH v3 5/6] xfs: add per-AG writeback workqueue infrastructure Kundan Kumar
2026-01-29 22:21 ` Darrick J. Wong
2026-02-03 7:35 ` Kundan Kumar
2026-02-06 6:46 ` Christoph Hellwig
2026-02-10 11:56 ` Nirjhar Roy (IBM)
[not found] ` <CGME20260116101305epcas5p497cd6d9027301853669f1c1aaffbf128@epcas5p4.samsung.com>
2026-01-16 10:08 ` [PATCH v3 6/6] xfs: offload writeback by AG using per-inode dirty bitmap and per-AG workers Kundan Kumar
2026-01-29 22:34 ` Darrick J. Wong
2026-02-03 7:40 ` Kundan Kumar
2026-02-11 9:39 ` Nirjhar Roy (IBM)
2026-01-16 16:13 ` [syzbot ci] Re: AG aware parallel writeback for XFS syzbot ci
2026-01-21 19:54 ` [PATCH v3 0/6] " Brian Foster
2026-01-22 16:15 ` Kundan Kumar
2026-01-23 9:36 ` Pankaj Raghav (Samsung)
2026-01-23 13:26 ` Brian Foster
2026-01-28 18:28 ` Kundan Kumar
2026-02-06 6:25 ` Christoph Hellwig
2026-02-06 10:07 ` Kundan Kumar
2026-02-06 17:42 ` Darrick J. Wong
2026-02-09 6:30 ` Christoph Hellwig
2026-02-09 15:54 ` Kundan Kumar
2026-02-10 15:38 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox