* [PATCH, RFC] limit per-inode writeback size considered harmful
@ 2025-10-13 7:21 Christoph Hellwig
2025-10-13 11:01 ` Jan Kara
0 siblings, 1 reply; 3+ messages in thread
From: Christoph Hellwig @ 2025-10-13 7:21 UTC (permalink / raw)
To: jack, willy
Cc: akpm, linux-fsdevel, linux-mm, dlemoal, linux-xfs, hans.holmberg
Hi all,
we have a customer workload where the current core writeback behavior
causes severe fragmentation on zoned XFS despite a friendly write pattern
from the application. We tracked this down to writeback_chunk_size only
giving about 30-40MBs to each inode before switching to a new inode,
which will cause files that are aligned to the zone size (256MB on HDD)
to be fragmented into usually 5-7 extents spread over different zones.
Using the hack below makes this problem go away entirely by always
writing an inode fully up to the zone size. Damien came up with a
heuristic here:
https://lore.kernel.org/linux-xfs/20251013070945.GA2446@lst.de/T/#t
that also papers over this, but it falls apart on larger memory
systems where we can cache more of these files in the page cache
than we open zones.
Does anyone remember the reason for this limit writeback size? I
looked at git history and the code touched comes from a refactoring in
2011, and before that it's really hard to figure out where the original
even worse behavior came from. At least for zoned devices based
on a flag or something similar we'd love to avoid switching between
inodes during writeback, as that would drastically reduce the
potential for self-induced fragmentation.
---
fs/fs-writeback.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 2b35e80037fe..9dd9c5f4d86b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1892,9 +1892,11 @@ static long writeback_chunk_size(struct bdi_writeback *wb,
* (quickly) tag currently dirty pages
* (maybe slowly) sync all tagged pages
*/
- if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages)
+ if (1) { /* XXX: check flag */
+ pages = SZ_256M; /* Don't hard code? */
+ } else if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages) {
pages = LONG_MAX;
- else {
+ } else {
pages = min(wb->avg_write_bandwidth / 2,
global_wb_domain.dirty_limit / DIRTY_SCOPE);
pages = min(pages, work->nr_pages);
--
2.47.3
^ permalink raw reply [flat|nested] 3+ messages in thread* Re: [PATCH, RFC] limit per-inode writeback size considered harmful
2025-10-13 7:21 [PATCH, RFC] limit per-inode writeback size considered harmful Christoph Hellwig
@ 2025-10-13 11:01 ` Jan Kara
2025-10-13 21:16 ` Dave Chinner
0 siblings, 1 reply; 3+ messages in thread
From: Jan Kara @ 2025-10-13 11:01 UTC (permalink / raw)
To: Christoph Hellwig
Cc: jack, willy, akpm, linux-fsdevel, linux-mm, dlemoal, linux-xfs,
hans.holmberg
Hello!
On Mon 13-10-25 16:21:42, Christoph Hellwig wrote:
> we have a customer workload where the current core writeback behavior
> causes severe fragmentation on zoned XFS despite a friendly write pattern
> from the application. We tracked this down to writeback_chunk_size only
> giving about 30-40MBs to each inode before switching to a new inode,
> which will cause files that are aligned to the zone size (256MB on HDD)
> to be fragmented into usually 5-7 extents spread over different zones.
> Using the hack below makes this problem go away entirely by always
> writing an inode fully up to the zone size. Damien came up with a
> heuristic here:
>
> https://lore.kernel.org/linux-xfs/20251013070945.GA2446@lst.de/T/#t
>
> that also papers over this, but it falls apart on larger memory
> systems where we can cache more of these files in the page cache
> than we open zones.
>
> Does anyone remember the reason for this limit writeback size? I
> looked at git history and the code touched comes from a refactoring in
> 2011, and before that it's really hard to figure out where the original
> even worse behavior came from. At least for zoned devices based
> on a flag or something similar we'd love to avoid switching between
> inodes during writeback, as that would drastically reduce the
> potential for self-induced fragmentation.
That has been a long time ago but as far as I remember the idea of the
logic in writeback_chunk_size() is that for background writeback we want
to:
a) Reasonably often bail out to the main writeback loop to recheck whether
more writeback is still needed (we are still over background threshold,
there isn't other higher priority writeback work such as sync etc.).
b) Alternate between inodes needing writeback so that continuously dirtying
one inode doesn't starve writeback on other inodes.
c) Write enough so that writeback can be efficient.
Currently we have MIN_WRITEBACK_PAGES which is hardwired to 4MB and which
defines granularity of write chunk. Now your problem sounds like you'd like
to configure MIN_WRITEBACK_PAGES on per BDI basis and I think that makes
sense. Do I understand you right?
Honza
>
> ---
> fs/fs-writeback.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 2b35e80037fe..9dd9c5f4d86b 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1892,9 +1892,11 @@ static long writeback_chunk_size(struct bdi_writeback *wb,
> * (quickly) tag currently dirty pages
> * (maybe slowly) sync all tagged pages
> */
> - if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages)
> + if (1) { /* XXX: check flag */
> + pages = SZ_256M; /* Don't hard code? */
> + } else if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages) {
> pages = LONG_MAX;
> - else {
> + } else {
> pages = min(wb->avg_write_bandwidth / 2,
> global_wb_domain.dirty_limit / DIRTY_SCOPE);
> pages = min(pages, work->nr_pages);
> --
> 2.47.3
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 3+ messages in thread* Re: [PATCH, RFC] limit per-inode writeback size considered harmful
2025-10-13 11:01 ` Jan Kara
@ 2025-10-13 21:16 ` Dave Chinner
0 siblings, 0 replies; 3+ messages in thread
From: Dave Chinner @ 2025-10-13 21:16 UTC (permalink / raw)
To: Jan Kara
Cc: Christoph Hellwig, willy, akpm, linux-fsdevel, linux-mm, dlemoal,
linux-xfs, hans.holmberg
On Mon, Oct 13, 2025 at 01:01:49PM +0200, Jan Kara wrote:
> Hello!
>
> On Mon 13-10-25 16:21:42, Christoph Hellwig wrote:
> > we have a customer workload where the current core writeback behavior
> > causes severe fragmentation on zoned XFS despite a friendly write pattern
> > from the application. We tracked this down to writeback_chunk_size only
> > giving about 30-40MBs to each inode before switching to a new inode,
> > which will cause files that are aligned to the zone size (256MB on HDD)
> > to be fragmented into usually 5-7 extents spread over different zones.
> > Using the hack below makes this problem go away entirely by always
> > writing an inode fully up to the zone size. Damien came up with a
> > heuristic here:
> >
> > https://lore.kernel.org/linux-xfs/20251013070945.GA2446@lst.de/T/#t
> >
> > that also papers over this, but it falls apart on larger memory
> > systems where we can cache more of these files in the page cache
> > than we open zones.
> >
> > Does anyone remember the reason for this limit writeback size? I
> > looked at git history and the code touched comes from a refactoring in
> > 2011, and before that it's really hard to figure out where the original
> > even worse behavior came from. At least for zoned devices based
> > on a flag or something similar we'd love to avoid switching between
> > inodes during writeback, as that would drastically reduce the
> > potential for self-induced fragmentation.
>
> That has been a long time ago but as far as I remember the idea of the
> logic in writeback_chunk_size() is that for background writeback we want
> to:
>
> a) Reasonably often bail out to the main writeback loop to recheck whether
> more writeback is still needed (we are still over background threshold,
> there isn't other higher priority writeback work such as sync etc.).
*nod*
> b) Alternate between inodes needing writeback so that continuously dirtying
> one inode doesn't starve writeback on other inodes.
Yes, this was a big concern at the time - I remember semi-regular
bug reports from users with large machines (think hundreds of GB of
RAM back in pre-2010 era) where significant amounts of user data was
lost because the system crashed hours after the data was written.
The suspect was always writeback starving an inode of writeback for
long periods of time, and those problems largely went away when this
mechanism was introduced.
> c) Write enough so that writeback can be efficient.
>
> Currently we have MIN_WRITEBACK_PAGES which is hardwired to 4MB and which
> defines granularity of write chunk.
Historically speaking, this writeback clustering granulairty was
needed w/ XFS to minimise fragmentation during delayed allocation
when there were lots of medium sized dirty files (e.g. untarring a
tarball with lots of multi-megabyte files) and incoming write
throttling triggered writeback.
In situations where writeback does lots of small writes across
multiple inodes, XFS will pack allocation for then tightly,
optimising the write IO patterns to be sequential across multi-inode
writeback. However, having a minimum writeback chunk too small would
lead to excessive fragmentation and very poor sequential read IO
patterns (and hence performance issues). This was especially in
times before the IO-less write throttling was introduced because of
the random single page IO that the old write throttling algorithm
did.
Hence for a long time before the writeback code had
MIN_WRITEBACK_PAGES, XFS had a hard-coded minmum writeback cluster
size that overrode the VFS "nr_to_write" value to ensure this
(mimimum of 1024 pages, IIRC) to avoid this problem.
> Now your problem sounds like you'd like to configure
> MIN_WRITEBACK_PAGES on per BDI basis and I think that makes sense.
> Do I understand you right?
Yeah, given the above XFS history and using a minimum writeback
chunk size to mitigate the issue, this seems like the right approach
to take to me.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-10-13 21:16 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-13 7:21 [PATCH, RFC] limit per-inode writeback size considered harmful Christoph Hellwig
2025-10-13 11:01 ` Jan Kara
2025-10-13 21:16 ` Dave Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox