Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Improving large folio writeback performance

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jan Kara <jack@suse.cz>
To: Joanne Koong <joannelkoong@gmail.com>
Cc: Jan Kara <jack@suse.cz>,
	lsf-pc@lists.linux-foundation.org,
	 linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	 "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Improving large folio writeback performance
Date: Fri, 17 Jan 2025 12:53:12 +0100	[thread overview]
Message-ID: <kugnldi6l2rr4m2pcyh3ystyjsnwhcp3jrukqt7ni2ipnw3vpg@l7ieaeq3uosk> (raw)
In-Reply-To: <CAJnrk1aZYpGe+x3=Fz0W30FfXB9RADutDpp+4DeuoBSVHp9XHA@mail.gmail.com>

On Thu 16-01-25 15:38:54, Joanne Koong wrote:
> On Thu, Jan 16, 2025 at 3:01 AM Jan Kara <jack@suse.cz> wrote:
> > On Tue 14-01-25 16:50:53, Joanne Koong wrote:
> > > I would like to propose a discussion topic about improving large folio
> > > writeback performance. As more filesystems adopt large folios, it
> > > becomes increasingly important that writeback is made to be as
> > > performant as possible. There are two areas I'd like to discuss:
> > >
> > > == Granularity of dirty pages writeback ==
> > > Currently, the granularity of writeback is at the folio level. If one
> > > byte in a folio is dirty, the entire folio will be written back. This
> > > becomes unscalable for larger folios and significantly degrades
> > > performance, especially for workloads that employ random writes.
> > >
> > > One idea is to track dirty pages at a smaller granularity using a
> > > 64-bit bitmap stored inside the folio struct where each bit tracks a
> > > smaller chunk of pages (eg for 2 MB folios, each bit would track 32k
> > > pages), and only write back dirty chunks rather than the entire folio.
> >
> > Yes, this is known problem and as Dave pointed out, currently it is upto
> > the lower layer to handle finer grained dirtiness handling. You can take
> > inspiration in the iomap layer that already does this, or you can convert
> > your filesystem to use iomap (preferred way).
> >
> > > == Balancing dirty pages ==
> > > It was observed that the dirty page balancing logic used in
> > > balance_dirty_pages() fails to scale for large folios [1]. For
> > > example, fuse saw around a 125% drop in throughput for writes when
> > > using large folios vs small folios on 1MB block sizes, which was
> > > attributed to scheduled io waits in the dirty page balancing logic. In
> > > generic_perform_write(), dirty pages are balanced after every write to
> > > the page cache by the filesystem. With large folios, each write
> > > dirties a larger number of pages which can grossly exceed the
> > > ratelimit, whereas with small folios each write is one page and so
> > > pages are balanced more incrementally and adheres more closely to the
> > > ratelimit. In order to accomodate large folios, likely the logic in
> > > balancing dirty pages needs to be reworked.
> >
> > I think there are several separate issues here. One is that
> > folio_account_dirtied() will consider the whole folio as needing writeback
> > which is not necessarily the case (as e.g. iomap will writeback only dirty
> > blocks in it). This was OKish when pages were 4k and you were using 1k
> > blocks (which was uncommon configuration anyway, usually you had 4k block
> > size), it starts to hurt a lot with 2M folios so we might need to find a
> > way how to propagate the information about really dirty bits into writeback
> > accounting.
> 
> Agreed. The only workable solution I see is to have some sort of api
> similar to filemap_dirty_folio() that takes in the number of pages
> dirtied as an arg, but maybe there's a better solution.

Yes, something like that I suppose.

> > Another problem *may* be that fast increments to dirtied pages (as we dirty
> > 512 pages at once instead of 16 we did in the past) cause over-reaction in
> > the dirtiness balancing logic and we throttle the task too much. The
> > heuristics there try to find the right amount of time to block a task so
> > that dirtying speed matches the writeback speed and it's plausible that
> > the large increments make this logic oscilate between two extremes leading
> > to suboptimal throughput. Also, since this was observed with FUSE, I belive
> > a significant factor is that FUSE enables "strictlimit" feature of the BDI
> > which makes dirty throttling more aggressive (generally the amount of
> > allowed dirty pages is lower). Anyway, these are mostly speculations from
> > my end. This needs more data to decide what exactly (if anything) needs
> > tweaking in the dirty throttling logic.
> 
> I tested this experimentally and you're right, on FUSE this is
> impacted a lot by the "strictlimit". I didn't see any bottlenecks when
> strictlimit wasn't enabled on FUSE. AFAICT, the strictlimit affects
> the dirty throttle control freerun flag (which gets used to determine
> whether throttling can be skipped) in the balance_dirty_pages() logic.
> For FUSE, we can't turn off strictlimit for unprivileged servers, but
> maybe we can make the throttling check more permissive by upping the
> value of the min_pause calculation in wb_min_pause() for writes that
> support large folios? As of right now, the current logic makes writing
> large folios unfeasible in FUSE (estimates show around a 75% drop in
> throughput).

I think tweaking min_pause is a wrong way to do this. I think that is just a
symptom. Can you run something like:

while true; do
	cat /sys/kernel/debug/bdi/<fuse-bdi>/stats
	echo "---------"
	sleep 1
done >bdi-debug.txt

while you are writing to the FUSE filesystem and share the output file?
That should tell us a bit more about what's happening inside the writeback
throttling. Also do you somehow configure min/max_ratio for the FUSE bdi?
You can check in /sys/block/<fuse-bdi>/bdi/{min,max}_ratio . I suspect the
problem is that the BDI dirty limit does not ramp up properly when we
increase dirtied pages in large chunks.

Actually, there's a patch queued in mm tree that improves the ramping up of
bdi dirty limit for strictlimit bdis [1]. It would be nice if you could
test whether it changes something in the behavior you observe. Thanks!

								Honza

[1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche
s/mm-page-writeback-consolidate-wb_thresh-bumping-logic-into-__wb_calc_thresh.pa
tch

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

next prev parent reply	other threads:[~2025-01-17 11:53 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-15  0:50 Joanne Koong
2025-01-15  1:21 ` Dave Chinner
2025-01-16 20:14   ` Joanne Koong
2025-01-15  1:50 ` Darrick J. Wong
2025-01-16 11:01 ` [Lsf-pc] " Jan Kara
2025-01-16 23:38   ` Joanne Koong
2025-01-17 11:53     ` Jan Kara [this message]
2025-01-17 22:45       ` Joanne Koong
2025-01-20 22:42         ` Jan Kara
2025-01-22  0:29           ` Joanne Koong
2025-01-22  9:22             ` Jan Kara
2025-01-22 22:17               ` Joanne Koong
2025-01-17 11:40 ` Vlastimil Babka
2025-01-17 11:56   ` [Lsf-pc] " Jan Kara
2025-01-17 14:17     ` Matthew Wilcox
2025-01-22 11:15       ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=kugnldi6l2rr4m2pcyh3ystyjsnwhcp3jrukqt7ni2ipnw3vpg@l7ieaeq3uosk \
    --to=jack@suse.cz \
    --cc=joannelkoong@gmail.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox