From: Jeff Layton <jlayton@kernel.org>
To: Alexander Viro <viro@zeniv.linux.org.uk>,
Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
"Matthew Wilcox (Oracle)" <willy@infradead.org>,
Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@kernel.org>,
Lorenzo Stoakes <ljs@kernel.org>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Vlastimil Babka <vbabka@kernel.org>,
Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>,
Mike Snitzer <snitzer@kernel.org>, Jens Axboe <axboe@kernel.dk>,
Chuck Lever <chuck.lever@oracle.com>
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-nfs@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v2 0/3] mm: improve write performance with RWF_DONTCACHE
Date: Wed, 08 Apr 2026 14:45:32 -0400 [thread overview]
Message-ID: <c2351220336d0ad99396a331bad34c6177bf354e.camel@kernel.org> (raw)
In-Reply-To: <20260408-dontcache-v2-0-948dec1e756b@kernel.org>
On Wed, 2026-04-08 at 10:25 -0400, Jeff Layton wrote:
> This version adopts Christoph's suggest to have generic_write_sync()
> kick the flusher thread for the superblock instead of initiating
> writeback directly. This seems to perform as well or better in most
> cases than doing the writeback directly.
>
> Here are results on XFS, both local and exported via knfsd:
>
> nfsd: https://markdownpastebin.com/?id=1884b9487c404ff4b7094ed41cc48f05
> xfs: https://markdownpastebin.com/?id=3c6b262182184b25b7d58fb211374475
>
> Ritesh had also asked about getting perf lock traces to confirm the
> source of the contention. I did that (and I can post them if you like),
> but the results from the unpatched dontcache runs didn't point out any
> specific lock contention. That leads me to believe that the bottlenecks
> were from normal queueing work, and not contention for the xa_lock after
> all.
>
> Kicking the writeback thread seems to be a clear improvement over the
> status quo in my testing, but I do wonder if having dontcache writes
> spamming writeback for the whole bdi is the best idea.
>
> I'm benchmarking out a patch that has the flusher do a
> writeback_single_inode() for the work. I don't expect it to perform
> measurably better in this testing, but it would better isolate the
> DONTCACHE writeback behavior to just those inodes touched by DONTCACHE
> writes.
>
> Assuming that looks OK, I'll probably send a v3. Original cover letter
> from v1 follows:
>
Actually, that version regressed performance in a couple of cases. I think v2 is probably the best approach, on balance. Maybe we can get this into -next so that it can make v7.2?
Here's the comparison between this version and a writeback_single_inode() flush version:
------------------8<-----------------------
● Comparing dontcache numbers against the previous whole-BDI flusher kernel (from /tmp/dontcache-local-4way-flusher.md):
Per-Inode vs Whole-BDI Flusher — DONTCACHE on Local XFS
Single-Client Writes
┌──────────────────┬───────────┬───────────┬─────────────┐
│ Benchmark │ Whole-BDI │ Per-Inode │ Change │
├──────────────────┼───────────┼───────────┼─────────────┤
│ Seq write MB/s │ 1450 │ 1438 │ -1% (noise) │
├──────────────────┼───────────┼───────────┼─────────────┤
│ Seq write p99.9 │ 23.5 ms │ 23.5 ms │ identical │
├──────────────────┼───────────┼───────────┼─────────────┤
│ Rand write MB/s │ 363 │ 286 │ -21% │
├──────────────────┼───────────┼───────────┼─────────────┤
│ Rand write p99.9 │ 1.8 ms │ 16.7 ms │ regression │
└──────────────────┴───────────┴───────────┴─────────────┘
Seq write is identical. Rand write regressed — the whole-BDI flusher batched all dirty pages in one pass with writeback_sb_inodes() under a single blk_plug, while per-inode write_inode_now() loses that batching.
Single-Client Reads
┌────────────────┬───────────┬───────────┬────────┐
│ Benchmark │ Whole-BDI │ Per-Inode │ Change │
├────────────────┼───────────┼───────────┼────────┤
│ Seq read MB/s │ 2950 │ 2350 │ -20% │
├────────────────┼───────────┼───────────┼────────┤
│ Rand read MB/s │ 651 │ 519 │ -20% │
└────────────────┴───────────┴───────────┴────────┘
Reads shouldn't be affected by writeback path changes. Buffered reads also dropped (2888 → 2331), suggesting different system conditions between runs rather than a per-inode regression.
Multi-Writer (Scenario A)
┌────────────────┬───────────┬───────────┬────────────┐
│ Metric │ Whole-BDI │ Per-Inode │ Change │
├────────────────┼───────────┼───────────┼────────────┤
│ Aggregate MB/s │ 1478 │ 999 │ -32% │
├────────────────┼───────────┼───────────┼────────────┤
│ p99.9 │ 46 ms │ 77 ms │ -67% worse │
└────────────────┴───────────┴───────────┴────────────┘
This is the biggest regression. With whole-BDI, the flusher did one batched pass through all dirty inodes via writeback_sb_inodes(). With per-inode, each of 4 writers queues a separate work item processed serially by write_inode_now() — losing the batch I/O merging benefit.
Scenario C & D (Noisy Neighbor)
┌─────────────────────────┬───────────┬───────────┬─────────────┐
│ Metric │ Whole-BDI │ Per-Inode │ Change │
├─────────────────────────┼───────────┼───────────┼─────────────┤
│ Scenario C writer │ 1468 │ 1386 │ -6% │
├─────────────────────────┼───────────┼───────────┼─────────────┤
│ Scenario C readers │ 18.7 MB/s │ 18.7 MB/s │ identical │
├─────────────────────────┼───────────┼───────────┼─────────────┤
│ Scenario D writer │ 1472 │ 1467 │ identical │
├─────────────────────────┼───────────┼───────────┼─────────────┤
│ Scenario D readers │ 496 MB/s │ 507 MB/s │ +2% │
├─────────────────────────┼───────────┼───────────┼─────────────┤
│ Scenario D reader p99.9 │ 440 us │ 358 us │ +19% better │
└─────────────────────────┴───────────┴───────────┴─────────────┘
Mixed-mode (Scenario D) is the intended production case and it's essentially identical or slightly better — per-inode writeback creates less device contention for buffered readers.
Summary
The per-inode approach is neutral-to-slightly-better for the production scenario (Scenario D), but regresses on multi-writer and random write workloads. The core issue is loss of I/O batching
— writeback_sb_inodes() processes all dirty inodes in one blk_plug'd pass, while per-inode write_inode_now() calls are processed one at a time. The read regressions likely reflect different
system conditions since buffered/direct reads also dropped ~20%.
--
Jeff Layton <jlayton@kernel.org>
prev parent reply other threads:[~2026-04-08 18:45 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-08 14:25 Jeff Layton
2026-04-08 14:25 ` [PATCH v2 1/3] mm: kick writeback flusher instead of inline flush for IOCB_DONTCACHE Jeff Layton
2026-04-08 14:25 ` [PATCH v2 2/3] testing: add nfsd-io-bench NFS server benchmark suite Jeff Layton
2026-04-08 14:25 ` [PATCH v2 3/3] testing: add dontcache-bench local filesystem " Jeff Layton
2026-04-08 18:45 ` Jeff Layton [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c2351220336d0ad99396a331bad34c6177bf354e.camel@kernel.org \
--to=jlayton@kernel.org \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=axboe@kernel.dk \
--cc=brauner@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=david@kernel.org \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nfs@vger.kernel.org \
--cc=ljs@kernel.org \
--cc=mhocko@suse.com \
--cc=rppt@kernel.org \
--cc=snitzer@kernel.org \
--cc=surenb@google.com \
--cc=vbabka@kernel.org \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox