linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@kernel.org>
To: Alexander Viro <viro@zeniv.linux.org.uk>,
	Christian Brauner	 <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
	"Matthew Wilcox (Oracle)"	 <willy@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <ljs@kernel.org>,
	"Liam R. Howlett"	 <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@kernel.org>,
	Mike Rapoport	 <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko	 <mhocko@suse.com>,
	Mike Snitzer <snitzer@kernel.org>, Jens Axboe <axboe@kernel.dk>,
	 Chuck Lever <chuck.lever@oracle.com>
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	 linux-nfs@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v2 0/3] mm: improve write performance with RWF_DONTCACHE
Date: Wed, 08 Apr 2026 14:45:32 -0400	[thread overview]
Message-ID: <c2351220336d0ad99396a331bad34c6177bf354e.camel@kernel.org> (raw)
In-Reply-To: <20260408-dontcache-v2-0-948dec1e756b@kernel.org>

On Wed, 2026-04-08 at 10:25 -0400, Jeff Layton wrote:
> This version adopts Christoph's suggest to have generic_write_sync()
> kick the flusher thread for the superblock instead of initiating
> writeback directly. This seems to perform as well or better in most
> cases than doing the writeback directly.
> 
> Here are results on XFS, both local and exported via knfsd:
> 
>     nfsd: https://markdownpastebin.com/?id=1884b9487c404ff4b7094ed41cc48f05
>     xfs: https://markdownpastebin.com/?id=3c6b262182184b25b7d58fb211374475
> 
> Ritesh had also asked about getting perf lock traces to confirm the
> source of the contention. I did that (and I can post them if you like),
> but the results from the unpatched dontcache runs didn't point out any
> specific lock contention. That leads me to believe that the bottlenecks
> were from normal queueing work, and not contention for the xa_lock after
> all.
> 
> Kicking the writeback thread seems to be a clear improvement over the
> status quo in my testing, but I do wonder if having dontcache writes
> spamming writeback for the whole bdi is the best idea.
> 
> I'm benchmarking out a patch that has the flusher do a
> writeback_single_inode() for the work. I don't expect it to perform
> measurably better in this testing, but it would better isolate the
> DONTCACHE writeback behavior to just those inodes touched by DONTCACHE
> writes.
> 
> Assuming that looks OK, I'll probably send a v3. Original cover letter
> from v1 follows:
> 

Actually, that version regressed performance in a couple of cases. I think v2 is probably the best approach, on balance. Maybe we can get this into -next so that it can make v7.2?

Here's the comparison between this version and a writeback_single_inode() flush version:

------------------8<-----------------------

● Comparing dontcache numbers against the previous whole-BDI flusher kernel (from /tmp/dontcache-local-4way-flusher.md):                                                                          

  Per-Inode vs Whole-BDI Flusher — DONTCACHE on Local XFS                                                                                                                                         

  Single-Client Writes

  ┌──────────────────┬───────────┬───────────┬─────────────┐                                                                                                                                      
  │    Benchmark     │ Whole-BDI │ Per-Inode │   Change    │                                                                                                                                      
  ├──────────────────┼───────────┼───────────┼─────────────┤      
  │ Seq write MB/s   │ 1450      │ 1438      │ -1% (noise) │
  ├──────────────────┼───────────┼───────────┼─────────────┤
  │ Seq write p99.9  │ 23.5 ms   │ 23.5 ms   │ identical   │
  ├──────────────────┼───────────┼───────────┼─────────────┤                                                                                                                                      
  │ Rand write MB/s  │ 363       │ 286       │ -21%        │
  ├──────────────────┼───────────┼───────────┼─────────────┤                                                                                                                                      
  │ Rand write p99.9 │ 1.8 ms    │ 16.7 ms   │ regression  │
  └──────────────────┴───────────┴───────────┴─────────────┘

  Seq write is identical. Rand write regressed — the whole-BDI flusher batched all dirty pages in one pass with writeback_sb_inodes() under a single blk_plug, while per-inode write_inode_now() loses that batching.

  Single-Client Reads

  ┌────────────────┬───────────┬───────────┬────────┐
  │   Benchmark    │ Whole-BDI │ Per-Inode │ Change │
  ├────────────────┼───────────┼───────────┼────────┤
  │ Seq read MB/s  │ 2950      │ 2350      │ -20%   │
  ├────────────────┼───────────┼───────────┼────────┤
  │ Rand read MB/s │ 651       │ 519       │ -20%   │
  └────────────────┴───────────┴───────────┴────────┘

  Reads shouldn't be affected by writeback path changes. Buffered reads also dropped (2888 → 2331), suggesting different system conditions between runs rather than a per-inode regression.

  Multi-Writer (Scenario A)

  ┌────────────────┬───────────┬───────────┬────────────┐
  │     Metric     │ Whole-BDI │ Per-Inode │   Change   │
  ├────────────────┼───────────┼───────────┼────────────┤
  │ Aggregate MB/s │ 1478      │ 999       │ -32%       │
  ├────────────────┼───────────┼───────────┼────────────┤
  │ p99.9          │ 46 ms     │ 77 ms     │ -67% worse │
  └────────────────┴───────────┴───────────┴────────────┘

  This is the biggest regression. With whole-BDI, the flusher did one batched pass through all dirty inodes via writeback_sb_inodes(). With per-inode, each of 4 writers queues a separate work item processed serially by write_inode_now() — losing the batch I/O merging benefit.

  Scenario C & D (Noisy Neighbor)

  ┌─────────────────────────┬───────────┬───────────┬─────────────┐
  │         Metric          │ Whole-BDI │ Per-Inode │   Change    │
  ├─────────────────────────┼───────────┼───────────┼─────────────┤
  │ Scenario C writer       │ 1468      │ 1386      │ -6%         │
  ├─────────────────────────┼───────────┼───────────┼─────────────┤
  │ Scenario C readers      │ 18.7 MB/s │ 18.7 MB/s │ identical   │
  ├─────────────────────────┼───────────┼───────────┼─────────────┤
  │ Scenario D writer       │ 1472      │ 1467      │ identical   │
  ├─────────────────────────┼───────────┼───────────┼─────────────┤
  │ Scenario D readers      │ 496 MB/s  │ 507 MB/s  │ +2%         │
  ├─────────────────────────┼───────────┼───────────┼─────────────┤
  │ Scenario D reader p99.9 │ 440 us    │ 358 us    │ +19% better │
  └─────────────────────────┴───────────┴───────────┴─────────────┘

  Mixed-mode (Scenario D) is the intended production case and it's essentially identical or slightly better — per-inode writeback creates less device contention for buffered readers.

  Summary

  The per-inode approach is neutral-to-slightly-better for the production scenario (Scenario D), but regresses on multi-writer and random write workloads. The core issue is loss of I/O batching
 — writeback_sb_inodes() processes all dirty inodes in one blk_plug'd pass, while per-inode write_inode_now() calls are processed one at a time. The read regressions likely reflect different
  system conditions since buffered/direct reads also dropped ~20%.

-- 
Jeff Layton <jlayton@kernel.org>

      parent reply	other threads:[~2026-04-08 18:45 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-08 14:25 Jeff Layton
2026-04-08 14:25 ` [PATCH v2 1/3] mm: kick writeback flusher instead of inline flush for IOCB_DONTCACHE Jeff Layton
2026-04-08 14:25 ` [PATCH v2 2/3] testing: add nfsd-io-bench NFS server benchmark suite Jeff Layton
2026-04-08 14:25 ` [PATCH v2 3/3] testing: add dontcache-bench local filesystem " Jeff Layton
2026-04-08 18:45 ` Jeff Layton [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c2351220336d0ad99396a331bad34c6177bf354e.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=brauner@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=david@kernel.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@suse.com \
    --cc=rppt@kernel.org \
    --cc=snitzer@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox