From: Jan Kara <jack@suse.cz>
To: Christoph Hellwig <hch@lst.de>
Cc: jack@suse.cz, willy@infradead.org, akpm@linux-foundation.org,
linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
dlemoal@kernel.org, linux-xfs@vger.kernel.org,
hans.holmberg@wdc.com
Subject: Re: [PATCH, RFC] limit per-inode writeback size considered harmful
Date: Mon, 13 Oct 2025 13:01:49 +0200 [thread overview]
Message-ID: <j55u2ol6bconzpeaxdldqjimyrmnuafx5jarzhvic3r2ljbdus@tkmjzu4ka7eh> (raw)
In-Reply-To: <20251013072738.4125498-1-hch@lst.de>
Hello!
On Mon 13-10-25 16:21:42, Christoph Hellwig wrote:
> we have a customer workload where the current core writeback behavior
> causes severe fragmentation on zoned XFS despite a friendly write pattern
> from the application. We tracked this down to writeback_chunk_size only
> giving about 30-40MBs to each inode before switching to a new inode,
> which will cause files that are aligned to the zone size (256MB on HDD)
> to be fragmented into usually 5-7 extents spread over different zones.
> Using the hack below makes this problem go away entirely by always
> writing an inode fully up to the zone size. Damien came up with a
> heuristic here:
>
> https://lore.kernel.org/linux-xfs/20251013070945.GA2446@lst.de/T/#t
>
> that also papers over this, but it falls apart on larger memory
> systems where we can cache more of these files in the page cache
> than we open zones.
>
> Does anyone remember the reason for this limit writeback size? I
> looked at git history and the code touched comes from a refactoring in
> 2011, and before that it's really hard to figure out where the original
> even worse behavior came from. At least for zoned devices based
> on a flag or something similar we'd love to avoid switching between
> inodes during writeback, as that would drastically reduce the
> potential for self-induced fragmentation.
That has been a long time ago but as far as I remember the idea of the
logic in writeback_chunk_size() is that for background writeback we want
to:
a) Reasonably often bail out to the main writeback loop to recheck whether
more writeback is still needed (we are still over background threshold,
there isn't other higher priority writeback work such as sync etc.).
b) Alternate between inodes needing writeback so that continuously dirtying
one inode doesn't starve writeback on other inodes.
c) Write enough so that writeback can be efficient.
Currently we have MIN_WRITEBACK_PAGES which is hardwired to 4MB and which
defines granularity of write chunk. Now your problem sounds like you'd like
to configure MIN_WRITEBACK_PAGES on per BDI basis and I think that makes
sense. Do I understand you right?
Honza
>
> ---
> fs/fs-writeback.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 2b35e80037fe..9dd9c5f4d86b 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1892,9 +1892,11 @@ static long writeback_chunk_size(struct bdi_writeback *wb,
> * (quickly) tag currently dirty pages
> * (maybe slowly) sync all tagged pages
> */
> - if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages)
> + if (1) { /* XXX: check flag */
> + pages = SZ_256M; /* Don't hard code? */
> + } else if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages) {
> pages = LONG_MAX;
> - else {
> + } else {
> pages = min(wb->avg_write_bandwidth / 2,
> global_wb_domain.dirty_limit / DIRTY_SCOPE);
> pages = min(pages, work->nr_pages);
> --
> 2.47.3
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
next prev parent reply other threads:[~2025-10-13 11:02 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-13 7:21 Christoph Hellwig
2025-10-13 11:01 ` Jan Kara [this message]
2025-10-13 21:16 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=j55u2ol6bconzpeaxdldqjimyrmnuafx5jarzhvic3r2ljbdus@tkmjzu4ka7eh \
--to=jack@suse.cz \
--cc=akpm@linux-foundation.org \
--cc=dlemoal@kernel.org \
--cc=hans.holmberg@wdc.com \
--cc=hch@lst.de \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox