linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yunzhao Li <yunzhao@cloudflare.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Jan Kara <jack@suse.cz>,
	 linux-fsdevel@vger.kernel.org,
	Jesper Brouer <jesper@cloudflare.com>
Subject: balance_dirty_pages() causes 40% IO PSI (full) with no drain benefit on 384 GB machine
Date: Tue, 17 Mar 2026 15:53:51 -0700	[thread overview]
Message-ID: <CAAnvDSwOoEOkMEDgKUcHkx-SdmZ_ar5AbVy4Gp-XT7i04ON5fg@mail.gmail.com> (raw)

[-- Attachment #1: Type: text/plain, Size: 3524 bytes --]

Hello,

On a 384 GB machine with NVMe storage (2x NVMe RAID0, dm-crypt,
XFS, kernel 6.12, AMD EPYC 9684X 96-Core), balance_dirty_pages()
throttles writers via io_schedule_timeout(), causing 26-40% IO PSI(full).
But the throttling doesn't actually drain dirty pages faster.
The flusher only submits ~578 MB/s of writeback regardless of
whether writers are throttled, and the NVMe device has ample
spare capacity (1,044 MB/s benchmarked).

I'd like to understand whether this is expected and what the
right approach is.

The setup
---------

  dirty_background_ratio=10, dirty_ratio=20 (defaults)
  dirtyable memory: ~77 GB
  -> bg_thresh:       10% * 77 GB         =  7.7 GB
  -> freerun ceiling: (20%+10%)/2 * 77 GB = 11.7 GB
  -> limit (hard):    20% * 77 GB         = 15.5 GB

  Write generation:   ~580 MB/s (HTTP cache miss writes)
  Flusher drain rate: ~578 MB/s (device can do 1044 MB/s
                       flusher can't feed it fast enough)

Below freerun, balance_dirty_pages() returns immediately.
Between freerun and limit, pos_ratio ramps from 2.0 down to 0
via cubic polynomial that tasks sleep proportionally in
io_schedule_timeout(). At limit, pos_ratio=0 and all writers
block (max 200ms sleep).

Generation ≈ drain, so dirty settles at 10-14 GB — crossing
the freerun ceiling into the proportional throttle zone.

The observation
---------------

                 throughput  IO PSI full
  dirty 5-10 GB:  494 MB/s       1.4%
  dirty >10 GB:   578 MB/s      26.2%
                  (dirty still accumulating at +2 MB/s)

  Peak IO PSI full: 39.5%.

The proportional throttle adds 26% IO PSI (full) but dirty
still grows. The flusher is already at its submission ceiling
and sleeping writers doesn't help it submit I/O faster. The
device is actually starved: writeback-in-flight drops from
6-8 MB (baseline) to 1.8 MB (during throttle), and NVMe QD
drops from 45 to 37. The device could drain more if fed
more, but the flusher can't feed it faster.

Meanwhile, memory is not scarce:

  Dirty:          16 GB
  Clean file LRU: 57 GB  (instantly reclaimable)
  Memory PSI:     1-2%

The dirty pages aren't causing memory pressure. 57 GB of clean
pages remain available for instant reclaim. The throttle is
protecting a resource that isn't scarce, at a cost of 40% IO
PSI (full).

Our workaround plan: dirty_background_ratio=5, dirty_ratio=40.
This raises freerun to ~17.5 GB, keeping dirty in freerun.
The flusher drains identically. It runs to bg_thresh either
way.

Questions
---------

1. When should balance_dirty_pages() sleep writers? Currently
   the criterion is "dirty > fraction of dirtyable memory."
   This doesn't consider whether sleeping actually helps
   drain dirty faster, or whether the remaining clean pages
   are sufficient. Should the decision factor in flusher/
   device saturation or available reclaimable memory?

2. Is tuning dirty_ratio to 30-40% the expected approach for
   high-memory (>256 GB) systems? Documentation doesn't
   cover this.

3. The freerun ceiling gates entry into the proportional
   throttle path. Even moderate sleeping shows up as IO PSI
   (io_schedule_timeout is accounted as IO stall). Dirty
   never hits the hard limit in our case. It sits in the
   proportional zone, but cumulative PSI from many tasks
   sleeping short durations is already 26-40% (full). Should
   the throttle path be skipped when sleeping cannot help
   drain?

Thanks,

Yunzhao

[-- Attachment #2: Type: text/html, Size: 3854 bytes --]

             reply	other threads:[~2026-03-17 22:54 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-17 22:53 Yunzhao Li [this message]
2026-03-19 11:58 ` Jan Kara
2026-03-20 14:38   ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAAnvDSwOoEOkMEDgKUcHkx-SdmZ_ar5AbVy4Gp-XT7i04ON5fg@mail.gmail.com \
    --to=yunzhao@cloudflare.com \
    --cc=akpm@linux-foundation.org \
    --cc=jack@suse.cz \
    --cc=jesper@cloudflare.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox