linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] Improving large folio writeback performance
@ 2025-01-15  0:50 Joanne Koong
  2025-01-15  1:21 ` Dave Chinner
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Joanne Koong @ 2025-01-15  0:50 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-fsdevel, linux-mm, Matthew Wilcox (Oracle)

Hi all,

I would like to propose a discussion topic about improving large folio
writeback performance. As more filesystems adopt large folios, it
becomes increasingly important that writeback is made to be as
performant as possible. There are two areas I'd like to discuss:


== Granularity of dirty pages writeback ==
Currently, the granularity of writeback is at the folio level. If one
byte in a folio is dirty, the entire folio will be written back. This
becomes unscalable for larger folios and significantly degrades
performance, especially for workloads that employ random writes.

One idea is to track dirty pages at a smaller granularity using a
64-bit bitmap stored inside the folio struct where each bit tracks a
smaller chunk of pages (eg for 2 MB folios, each bit would track 32k
pages), and only write back dirty chunks rather than the entire folio.


== Balancing dirty pages ==
It was observed that the dirty page balancing logic used in
balance_dirty_pages() fails to scale for large folios [1]. For
example, fuse saw around a 125% drop in throughput for writes when
using large folios vs small folios on 1MB block sizes, which was
attributed to scheduled io waits in the dirty page balancing logic. In
generic_perform_write(), dirty pages are balanced after every write to
the page cache by the filesystem. With large folios, each write
dirties a larger number of pages which can grossly exceed the
ratelimit, whereas with small folios each write is one page and so
pages are balanced more incrementally and adheres more closely to the
ratelimit. In order to accomodate large folios, likely the logic in
balancing dirty pages needs to be reworked.


Thanks,
Joanne

[1] https://lore.kernel.org/linux-fsdevel/Z1N505RCcH1dXlLZ@casper.infradead.org/T/#m9e3dd273aa202f9f4e12eb9c96602b5fec2d383d


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving large folio writeback performance
  2025-01-15  0:50 [LSF/MM/BPF TOPIC] Improving large folio writeback performance Joanne Koong
@ 2025-01-15  1:21 ` Dave Chinner
  2025-01-16 20:14   ` Joanne Koong
  2025-01-15  1:50 ` Darrick J. Wong
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2025-01-15  1:21 UTC (permalink / raw)
  To: Joanne Koong; +Cc: lsf-pc, linux-fsdevel, linux-mm, Matthew Wilcox (Oracle)

On Tue, Jan 14, 2025 at 04:50:53PM -0800, Joanne Koong wrote:
> Hi all,
> 
> I would like to propose a discussion topic about improving large folio
> writeback performance. As more filesystems adopt large folios, it
> becomes increasingly important that writeback is made to be as
> performant as possible. There are two areas I'd like to discuss:
> 
> 
> == Granularity of dirty pages writeback ==
> Currently, the granularity of writeback is at the folio level. If one
> byte in a folio is dirty, the entire folio will be written back. This
> becomes unscalable for larger folios and significantly degrades
> performance, especially for workloads that employ random writes.

This sounds familiar, probably because we fixed this exact issue in
the iomap infrastructure some while ago.

commit 4ce02c67972211be488408c275c8fbf19faf29b3
Author: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Date:   Mon Jul 10 14:12:43 2023 -0700

    iomap: Add per-block dirty state tracking to improve performance
    
    When filesystem blocksize is less than folio size (either with
    mapping_large_folio_support() or with blocksize < pagesize) and when the
    folio is uptodate in pagecache, then even a byte write can cause
    an entire folio to be written to disk during writeback. This happens
    because we currently don't have a mechanism to track per-block dirty
    state within struct iomap_folio_state. We currently only track uptodate
    state.
    
    This patch implements support for tracking per-block dirty state in
    iomap_folio_state->state bitmap. This should help improve the filesystem
    write performance and help reduce write amplification.
    
    Performance testing of below fio workload reveals ~16x performance
    improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
    FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
    
    1. <test_randwrite.fio>
    [global]
            ioengine=psync
            rw=randwrite
            overwrite=1
            pre_read=1
            direct=0
            bs=4k
            size=1G
            dir=./
            numjobs=8
            fdatasync=1
            runtime=60
            iodepth=64
            group_reporting=1
    
    [fio-run]
    
    2. Also our internal performance team reported that this patch improves
       their database workload performance by around ~83% (with XFS on Power)
    
    Reported-by: Aravinda Herle <araherle@in.ibm.com>
    Reported-by: Brian Foster <bfoster@redhat.com>
    Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>


> One idea is to track dirty pages at a smaller granularity using a
> 64-bit bitmap stored inside the folio struct where each bit tracks a
> smaller chunk of pages (eg for 2 MB folios, each bit would track 32k
> pages), and only write back dirty chunks rather than the entire folio.

Have a look at how sub-folio state is tracked via the
folio->iomap_folio_state->state{} bitmaps.

Essentially it is up to the subsystem to track sub-folio state if
they require it; there is some generic filesystem infrastructure
support already in place (like iomap), but if that doesn't fit a
filesystem then it will need to provide it's own dirty/uptodate
tracking....

-Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving large folio writeback performance
  2025-01-15  0:50 [LSF/MM/BPF TOPIC] Improving large folio writeback performance Joanne Koong
  2025-01-15  1:21 ` Dave Chinner
@ 2025-01-15  1:50 ` Darrick J. Wong
  2025-01-16 11:01 ` [Lsf-pc] " Jan Kara
  2025-01-17 11:40 ` Vlastimil Babka
  3 siblings, 0 replies; 16+ messages in thread
From: Darrick J. Wong @ 2025-01-15  1:50 UTC (permalink / raw)
  To: Joanne Koong; +Cc: lsf-pc, linux-fsdevel, linux-mm, Matthew Wilcox (Oracle)

On Tue, Jan 14, 2025 at 04:50:53PM -0800, Joanne Koong wrote:
> Hi all,
> 
> I would like to propose a discussion topic about improving large folio
> writeback performance. As more filesystems adopt large folios, it
> becomes increasingly important that writeback is made to be as
> performant as possible. There are two areas I'd like to discuss:
> 
> 
> == Granularity of dirty pages writeback ==
> Currently, the granularity of writeback is at the folio level. If one
> byte in a folio is dirty, the entire folio will be written back. This
> becomes unscalable for larger folios and significantly degrades
> performance, especially for workloads that employ random writes.
> 
> One idea is to track dirty pages at a smaller granularity using a
> 64-bit bitmap stored inside the folio struct where each bit tracks a
> smaller chunk of pages (eg for 2 MB folios, each bit would track 32k
> pages), and only write back dirty chunks rather than the entire folio.
> 
> 
> == Balancing dirty pages ==
> It was observed that the dirty page balancing logic used in
> balance_dirty_pages() fails to scale for large folios [1]. For
> example, fuse saw around a 125% drop in throughput for writes when
> using large folios vs small folios on 1MB block sizes, which was
> attributed to scheduled io waits in the dirty page balancing logic. In
> generic_perform_write(), dirty pages are balanced after every write to
> the page cache by the filesystem. With large folios, each write
> dirties a larger number of pages which can grossly exceed the
> ratelimit, whereas with small folios each write is one page and so
> pages are balanced more incrementally and adheres more closely to the
> ratelimit. In order to accomodate large folios, likely the logic in
> balancing dirty pages needs to be reworked.

Hmrmm.... it's a pity that folio_account_dirtied charges the process
for all the pages in the folio even if it only wrote one byte, and then
the ratelimit thresholds haven't caught up to filesystems batching calls
to balance_dirty_pages.  But I'm no expert on how that ratelimiting
stuff works so that's all I have to say about that. :/

--D

> 
> Thanks,
> Joanne
> 
> [1] https://lore.kernel.org/linux-fsdevel/Z1N505RCcH1dXlLZ@casper.infradead.org/T/#m9e3dd273aa202f9f4e12eb9c96602b5fec2d383d
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Improving large folio writeback performance
  2025-01-15  0:50 [LSF/MM/BPF TOPIC] Improving large folio writeback performance Joanne Koong
  2025-01-15  1:21 ` Dave Chinner
  2025-01-15  1:50 ` Darrick J. Wong
@ 2025-01-16 11:01 ` Jan Kara
  2025-01-16 23:38   ` Joanne Koong
  2025-01-17 11:40 ` Vlastimil Babka
  3 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2025-01-16 11:01 UTC (permalink / raw)
  To: Joanne Koong; +Cc: lsf-pc, linux-fsdevel, linux-mm, Matthew Wilcox (Oracle)


Hello!

On Tue 14-01-25 16:50:53, Joanne Koong wrote:
> I would like to propose a discussion topic about improving large folio
> writeback performance. As more filesystems adopt large folios, it
> becomes increasingly important that writeback is made to be as
> performant as possible. There are two areas I'd like to discuss:
> 
> == Granularity of dirty pages writeback ==
> Currently, the granularity of writeback is at the folio level. If one
> byte in a folio is dirty, the entire folio will be written back. This
> becomes unscalable for larger folios and significantly degrades
> performance, especially for workloads that employ random writes.
> 
> One idea is to track dirty pages at a smaller granularity using a
> 64-bit bitmap stored inside the folio struct where each bit tracks a
> smaller chunk of pages (eg for 2 MB folios, each bit would track 32k
> pages), and only write back dirty chunks rather than the entire folio.

Yes, this is known problem and as Dave pointed out, currently it is upto
the lower layer to handle finer grained dirtiness handling. You can take
inspiration in the iomap layer that already does this, or you can convert
your filesystem to use iomap (preferred way).

> == Balancing dirty pages ==
> It was observed that the dirty page balancing logic used in
> balance_dirty_pages() fails to scale for large folios [1]. For
> example, fuse saw around a 125% drop in throughput for writes when
> using large folios vs small folios on 1MB block sizes, which was
> attributed to scheduled io waits in the dirty page balancing logic. In
> generic_perform_write(), dirty pages are balanced after every write to
> the page cache by the filesystem. With large folios, each write
> dirties a larger number of pages which can grossly exceed the
> ratelimit, whereas with small folios each write is one page and so
> pages are balanced more incrementally and adheres more closely to the
> ratelimit. In order to accomodate large folios, likely the logic in
> balancing dirty pages needs to be reworked.

I think there are several separate issues here. One is that
folio_account_dirtied() will consider the whole folio as needing writeback
which is not necessarily the case (as e.g. iomap will writeback only dirty
blocks in it). This was OKish when pages were 4k and you were using 1k
blocks (which was uncommon configuration anyway, usually you had 4k block
size), it starts to hurt a lot with 2M folios so we might need to find a
way how to propagate the information about really dirty bits into writeback
accounting.

Another problem *may* be that fast increments to dirtied pages (as we dirty
512 pages at once instead of 16 we did in the past) cause over-reaction in
the dirtiness balancing logic and we throttle the task too much. The
heuristics there try to find the right amount of time to block a task so
that dirtying speed matches the writeback speed and it's plausible that
the large increments make this logic oscilate between two extremes leading
to suboptimal throughput. Also, since this was observed with FUSE, I belive
a significant factor is that FUSE enables "strictlimit" feature of the BDI
which makes dirty throttling more aggressive (generally the amount of
allowed dirty pages is lower). Anyway, these are mostly speculations from
my end. This needs more data to decide what exactly (if anything) needs
tweaking in the dirty throttling logic.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving large folio writeback performance
  2025-01-15  1:21 ` Dave Chinner
@ 2025-01-16 20:14   ` Joanne Koong
  0 siblings, 0 replies; 16+ messages in thread
From: Joanne Koong @ 2025-01-16 20:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: lsf-pc, linux-fsdevel, linux-mm, Matthew Wilcox (Oracle)

On Tue, Jan 14, 2025 at 5:21 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Jan 14, 2025 at 04:50:53PM -0800, Joanne Koong wrote:
> > Hi all,
> >
> > I would like to propose a discussion topic about improving large folio
> > writeback performance. As more filesystems adopt large folios, it
> > becomes increasingly important that writeback is made to be as
> > performant as possible. There are two areas I'd like to discuss:
> >
> >
> > == Granularity of dirty pages writeback ==
> > Currently, the granularity of writeback is at the folio level. If one
> > byte in a folio is dirty, the entire folio will be written back. This
> > becomes unscalable for larger folios and significantly degrades
> > performance, especially for workloads that employ random writes.
>
> This sounds familiar, probably because we fixed this exact issue in
> the iomap infrastructure some while ago.
>
> commit 4ce02c67972211be488408c275c8fbf19faf29b3
> Author: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> Date:   Mon Jul 10 14:12:43 2023 -0700
>
>     iomap: Add per-block dirty state tracking to improve performance
>
>     When filesystem blocksize is less than folio size (either with
>     mapping_large_folio_support() or with blocksize < pagesize) and when the
>     folio is uptodate in pagecache, then even a byte write can cause
>     an entire folio to be written to disk during writeback. This happens
>     because we currently don't have a mechanism to track per-block dirty
>     state within struct iomap_folio_state. We currently only track uptodate
>     state.
>
>     This patch implements support for tracking per-block dirty state in
>     iomap_folio_state->state bitmap. This should help improve the filesystem
>     write performance and help reduce write amplification.
>
>     Performance testing of below fio workload reveals ~16x performance
>     improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
>     FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
>
>     1. <test_randwrite.fio>
>     [global]
>             ioengine=psync
>             rw=randwrite
>             overwrite=1
>             pre_read=1
>             direct=0
>             bs=4k
>             size=1G
>             dir=./
>             numjobs=8
>             fdatasync=1
>             runtime=60
>             iodepth=64
>             group_reporting=1
>
>     [fio-run]
>
>     2. Also our internal performance team reported that this patch improves
>        their database workload performance by around ~83% (with XFS on Power)
>
>     Reported-by: Aravinda Herle <araherle@in.ibm.com>
>     Reported-by: Brian Foster <bfoster@redhat.com>
>     Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>     Reviewed-by: Darrick J. Wong <djwong@kernel.org>
>
>
> > One idea is to track dirty pages at a smaller granularity using a
> > 64-bit bitmap stored inside the folio struct where each bit tracks a
> > smaller chunk of pages (eg for 2 MB folios, each bit would track 32k
> > pages), and only write back dirty chunks rather than the entire folio.
>
> Have a look at how sub-folio state is tracked via the
> folio->iomap_folio_state->state{} bitmaps.
>
> Essentially it is up to the subsystem to track sub-folio state if
> they require it; there is some generic filesystem infrastructure
> support already in place (like iomap), but if that doesn't fit a
> filesystem then it will need to provide it's own dirty/uptodate
> tracking....

Great, thanks for the info. I'll take a look at how the iomap layer does this.

>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Improving large folio writeback performance
  2025-01-16 11:01 ` [Lsf-pc] " Jan Kara
@ 2025-01-16 23:38   ` Joanne Koong
  2025-01-17 11:53     ` Jan Kara
  0 siblings, 1 reply; 16+ messages in thread
From: Joanne Koong @ 2025-01-16 23:38 UTC (permalink / raw)
  To: Jan Kara; +Cc: lsf-pc, linux-fsdevel, linux-mm, Matthew Wilcox (Oracle)

On Thu, Jan 16, 2025 at 3:01 AM Jan Kara <jack@suse.cz> wrote:
>
>
> Hello!
>
> On Tue 14-01-25 16:50:53, Joanne Koong wrote:
> > I would like to propose a discussion topic about improving large folio
> > writeback performance. As more filesystems adopt large folios, it
> > becomes increasingly important that writeback is made to be as
> > performant as possible. There are two areas I'd like to discuss:
> >
> > == Granularity of dirty pages writeback ==
> > Currently, the granularity of writeback is at the folio level. If one
> > byte in a folio is dirty, the entire folio will be written back. This
> > becomes unscalable for larger folios and significantly degrades
> > performance, especially for workloads that employ random writes.
> >
> > One idea is to track dirty pages at a smaller granularity using a
> > 64-bit bitmap stored inside the folio struct where each bit tracks a
> > smaller chunk of pages (eg for 2 MB folios, each bit would track 32k
> > pages), and only write back dirty chunks rather than the entire folio.
>
> Yes, this is known problem and as Dave pointed out, currently it is upto
> the lower layer to handle finer grained dirtiness handling. You can take
> inspiration in the iomap layer that already does this, or you can convert
> your filesystem to use iomap (preferred way).
>
> > == Balancing dirty pages ==
> > It was observed that the dirty page balancing logic used in
> > balance_dirty_pages() fails to scale for large folios [1]. For
> > example, fuse saw around a 125% drop in throughput for writes when
> > using large folios vs small folios on 1MB block sizes, which was
> > attributed to scheduled io waits in the dirty page balancing logic. In
> > generic_perform_write(), dirty pages are balanced after every write to
> > the page cache by the filesystem. With large folios, each write
> > dirties a larger number of pages which can grossly exceed the
> > ratelimit, whereas with small folios each write is one page and so
> > pages are balanced more incrementally and adheres more closely to the
> > ratelimit. In order to accomodate large folios, likely the logic in
> > balancing dirty pages needs to be reworked.
>
> I think there are several separate issues here. One is that
> folio_account_dirtied() will consider the whole folio as needing writeback
> which is not necessarily the case (as e.g. iomap will writeback only dirty
> blocks in it). This was OKish when pages were 4k and you were using 1k
> blocks (which was uncommon configuration anyway, usually you had 4k block
> size), it starts to hurt a lot with 2M folios so we might need to find a
> way how to propagate the information about really dirty bits into writeback
> accounting.

Agreed. The only workable solution I see is to have some sort of api
similar to filemap_dirty_folio() that takes in the number of pages
dirtied as an arg, but maybe there's a better solution.

>
> Another problem *may* be that fast increments to dirtied pages (as we dirty
> 512 pages at once instead of 16 we did in the past) cause over-reaction in
> the dirtiness balancing logic and we throttle the task too much. The
> heuristics there try to find the right amount of time to block a task so
> that dirtying speed matches the writeback speed and it's plausible that
> the large increments make this logic oscilate between two extremes leading
> to suboptimal throughput. Also, since this was observed with FUSE, I belive
> a significant factor is that FUSE enables "strictlimit" feature of the BDI
> which makes dirty throttling more aggressive (generally the amount of
> allowed dirty pages is lower). Anyway, these are mostly speculations from
> my end. This needs more data to decide what exactly (if anything) needs
> tweaking in the dirty throttling logic.
>

I tested this experimentally and you're right, on FUSE this is
impacted a lot by the "strictlimit". I didn't see any bottlenecks when
strictlimit wasn't enabled on FUSE. AFAICT, the strictlimit affects
the dirty throttle control freerun flag (which gets used to determine
whether throttling can be skipped) in the balance_dirty_pages() logic.
For FUSE, we can't turn off strictlimit for unprivileged servers, but
maybe we can make the throttling check more permissive by upping the
value of the min_pause calculation in wb_min_pause() for writes that
support large folios? As of right now, the current logic makes writing
large folios unfeasible in FUSE (estimates show around a 75% drop in
throughput).


Thanks,
Joanne


>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Improving large folio writeback performance
  2025-01-15  0:50 [LSF/MM/BPF TOPIC] Improving large folio writeback performance Joanne Koong
                   ` (2 preceding siblings ...)
  2025-01-16 11:01 ` [Lsf-pc] " Jan Kara
@ 2025-01-17 11:40 ` Vlastimil Babka
  2025-01-17 11:56   ` [Lsf-pc] " Jan Kara
  3 siblings, 1 reply; 16+ messages in thread
From: Vlastimil Babka @ 2025-01-17 11:40 UTC (permalink / raw)
  To: Joanne Koong, lsf-pc; +Cc: linux-fsdevel, linux-mm, Matthew Wilcox (Oracle)

On 1/15/25 01:50, Joanne Koong wrote:
> Hi all,
> 
> I would like to propose a discussion topic about improving large folio
> writeback performance. As more filesystems adopt large folios, it
> becomes increasingly important that writeback is made to be as
> performant as possible. There are two areas I'd like to discuss:
> 
> 
> == Granularity of dirty pages writeback ==
> Currently, the granularity of writeback is at the folio level. If one
> byte in a folio is dirty, the entire folio will be written back. This
> becomes unscalable for larger folios and significantly degrades
> performance, especially for workloads that employ random writes.
> 
> One idea is to track dirty pages at a smaller granularity using a
> 64-bit bitmap stored inside the folio struct where each bit tracks a
> smaller chunk of pages (eg for 2 MB folios, each bit would track 32k
> pages), and only write back dirty chunks rather than the entire folio.

I think this might be tricky in some cases? I.e. with 2 MB and pmd-mapped
folio, it's possible to write-protect only the whole pmd, not individual 32k
chunks in order to catch the first write to a chunk to mark it dirty.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Improving large folio writeback performance
  2025-01-16 23:38   ` Joanne Koong
@ 2025-01-17 11:53     ` Jan Kara
  2025-01-17 22:45       ` Joanne Koong
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2025-01-17 11:53 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Jan Kara, lsf-pc, linux-fsdevel, linux-mm, Matthew Wilcox (Oracle)

On Thu 16-01-25 15:38:54, Joanne Koong wrote:
> On Thu, Jan 16, 2025 at 3:01 AM Jan Kara <jack@suse.cz> wrote:
> > On Tue 14-01-25 16:50:53, Joanne Koong wrote:
> > > I would like to propose a discussion topic about improving large folio
> > > writeback performance. As more filesystems adopt large folios, it
> > > becomes increasingly important that writeback is made to be as
> > > performant as possible. There are two areas I'd like to discuss:
> > >
> > > == Granularity of dirty pages writeback ==
> > > Currently, the granularity of writeback is at the folio level. If one
> > > byte in a folio is dirty, the entire folio will be written back. This
> > > becomes unscalable for larger folios and significantly degrades
> > > performance, especially for workloads that employ random writes.
> > >
> > > One idea is to track dirty pages at a smaller granularity using a
> > > 64-bit bitmap stored inside the folio struct where each bit tracks a
> > > smaller chunk of pages (eg for 2 MB folios, each bit would track 32k
> > > pages), and only write back dirty chunks rather than the entire folio.
> >
> > Yes, this is known problem and as Dave pointed out, currently it is upto
> > the lower layer to handle finer grained dirtiness handling. You can take
> > inspiration in the iomap layer that already does this, or you can convert
> > your filesystem to use iomap (preferred way).
> >
> > > == Balancing dirty pages ==
> > > It was observed that the dirty page balancing logic used in
> > > balance_dirty_pages() fails to scale for large folios [1]. For
> > > example, fuse saw around a 125% drop in throughput for writes when
> > > using large folios vs small folios on 1MB block sizes, which was
> > > attributed to scheduled io waits in the dirty page balancing logic. In
> > > generic_perform_write(), dirty pages are balanced after every write to
> > > the page cache by the filesystem. With large folios, each write
> > > dirties a larger number of pages which can grossly exceed the
> > > ratelimit, whereas with small folios each write is one page and so
> > > pages are balanced more incrementally and adheres more closely to the
> > > ratelimit. In order to accomodate large folios, likely the logic in
> > > balancing dirty pages needs to be reworked.
> >
> > I think there are several separate issues here. One is that
> > folio_account_dirtied() will consider the whole folio as needing writeback
> > which is not necessarily the case (as e.g. iomap will writeback only dirty
> > blocks in it). This was OKish when pages were 4k and you were using 1k
> > blocks (which was uncommon configuration anyway, usually you had 4k block
> > size), it starts to hurt a lot with 2M folios so we might need to find a
> > way how to propagate the information about really dirty bits into writeback
> > accounting.
> 
> Agreed. The only workable solution I see is to have some sort of api
> similar to filemap_dirty_folio() that takes in the number of pages
> dirtied as an arg, but maybe there's a better solution.

Yes, something like that I suppose.

> > Another problem *may* be that fast increments to dirtied pages (as we dirty
> > 512 pages at once instead of 16 we did in the past) cause over-reaction in
> > the dirtiness balancing logic and we throttle the task too much. The
> > heuristics there try to find the right amount of time to block a task so
> > that dirtying speed matches the writeback speed and it's plausible that
> > the large increments make this logic oscilate between two extremes leading
> > to suboptimal throughput. Also, since this was observed with FUSE, I belive
> > a significant factor is that FUSE enables "strictlimit" feature of the BDI
> > which makes dirty throttling more aggressive (generally the amount of
> > allowed dirty pages is lower). Anyway, these are mostly speculations from
> > my end. This needs more data to decide what exactly (if anything) needs
> > tweaking in the dirty throttling logic.
> 
> I tested this experimentally and you're right, on FUSE this is
> impacted a lot by the "strictlimit". I didn't see any bottlenecks when
> strictlimit wasn't enabled on FUSE. AFAICT, the strictlimit affects
> the dirty throttle control freerun flag (which gets used to determine
> whether throttling can be skipped) in the balance_dirty_pages() logic.
> For FUSE, we can't turn off strictlimit for unprivileged servers, but
> maybe we can make the throttling check more permissive by upping the
> value of the min_pause calculation in wb_min_pause() for writes that
> support large folios? As of right now, the current logic makes writing
> large folios unfeasible in FUSE (estimates show around a 75% drop in
> throughput).

I think tweaking min_pause is a wrong way to do this. I think that is just a
symptom. Can you run something like:

while true; do
	cat /sys/kernel/debug/bdi/<fuse-bdi>/stats
	echo "---------"
	sleep 1
done >bdi-debug.txt

while you are writing to the FUSE filesystem and share the output file?
That should tell us a bit more about what's happening inside the writeback
throttling. Also do you somehow configure min/max_ratio for the FUSE bdi?
You can check in /sys/block/<fuse-bdi>/bdi/{min,max}_ratio . I suspect the
problem is that the BDI dirty limit does not ramp up properly when we
increase dirtied pages in large chunks.

Actually, there's a patch queued in mm tree that improves the ramping up of
bdi dirty limit for strictlimit bdis [1]. It would be nice if you could
test whether it changes something in the behavior you observe. Thanks!

								Honza

[1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche
s/mm-page-writeback-consolidate-wb_thresh-bumping-logic-into-__wb_calc_thresh.pa
tch

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Improving large folio writeback performance
  2025-01-17 11:40 ` Vlastimil Babka
@ 2025-01-17 11:56   ` Jan Kara
  2025-01-17 14:17     ` Matthew Wilcox
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2025-01-17 11:56 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Joanne Koong, lsf-pc, linux-fsdevel, linux-mm, Matthew Wilcox (Oracle)

On Fri 17-01-25 12:40:15, Vlastimil Babka wrote:
> On 1/15/25 01:50, Joanne Koong wrote:
> > Hi all,
> > 
> > I would like to propose a discussion topic about improving large folio
> > writeback performance. As more filesystems adopt large folios, it
> > becomes increasingly important that writeback is made to be as
> > performant as possible. There are two areas I'd like to discuss:
> > 
> > 
> > == Granularity of dirty pages writeback ==
> > Currently, the granularity of writeback is at the folio level. If one
> > byte in a folio is dirty, the entire folio will be written back. This
> > becomes unscalable for larger folios and significantly degrades
> > performance, especially for workloads that employ random writes.
> > 
> > One idea is to track dirty pages at a smaller granularity using a
> > 64-bit bitmap stored inside the folio struct where each bit tracks a
> > smaller chunk of pages (eg for 2 MB folios, each bit would track 32k
> > pages), and only write back dirty chunks rather than the entire folio.
> 
> I think this might be tricky in some cases? I.e. with 2 MB and pmd-mapped
> folio, it's possible to write-protect only the whole pmd, not individual 32k
> chunks in order to catch the first write to a chunk to mark it dirty.

Definitely. Once you map a folio through PMD entry, you have no other
option than consider whole 2MB dirty. But with PTE mappings or
modifications through syscalls you can do more fine-grained dirtiness
tracking and there're enough cases like that that it pays off.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Improving large folio writeback performance
  2025-01-17 11:56   ` [Lsf-pc] " Jan Kara
@ 2025-01-17 14:17     ` Matthew Wilcox
  2025-01-22 11:15       ` David Hildenbrand
  0 siblings, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2025-01-17 14:17 UTC (permalink / raw)
  To: Jan Kara; +Cc: Vlastimil Babka, Joanne Koong, lsf-pc, linux-fsdevel, linux-mm

On Fri, Jan 17, 2025 at 12:56:52PM +0100, Jan Kara wrote:
> On Fri 17-01-25 12:40:15, Vlastimil Babka wrote:
> > I think this might be tricky in some cases? I.e. with 2 MB and pmd-mapped
> > folio, it's possible to write-protect only the whole pmd, not individual 32k
> > chunks in order to catch the first write to a chunk to mark it dirty.
> 
> Definitely. Once you map a folio through PMD entry, you have no other
> option than consider whole 2MB dirty. But with PTE mappings or
> modifications through syscalls you can do more fine-grained dirtiness
> tracking and there're enough cases like that that it pays off.

Almost no applications use shared mmap writes to write to files.  The
error handling story is crap and there's only limited control about when
writeback actually happens.  Almost every application uses write(), even
if they have the file mmaped.  This isn't a scenario worth worrying about.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Improving large folio writeback performance
  2025-01-17 11:53     ` Jan Kara
@ 2025-01-17 22:45       ` Joanne Koong
  2025-01-20 22:42         ` Jan Kara
  0 siblings, 1 reply; 16+ messages in thread
From: Joanne Koong @ 2025-01-17 22:45 UTC (permalink / raw)
  To: Jan Kara; +Cc: lsf-pc, linux-fsdevel, linux-mm, Matthew Wilcox (Oracle)

On Fri, Jan 17, 2025 at 3:53 AM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 16-01-25 15:38:54, Joanne Koong wrote:
> > On Thu, Jan 16, 2025 at 3:01 AM Jan Kara <jack@suse.cz> wrote:
> > > On Tue 14-01-25 16:50:53, Joanne Koong wrote:
> > > > I would like to propose a discussion topic about improving large folio
> > > > writeback performance. As more filesystems adopt large folios, it
> > > > becomes increasingly important that writeback is made to be as
> > > > performant as possible. There are two areas I'd like to discuss:
> > > >
> > > > == Granularity of dirty pages writeback ==
> > > > Currently, the granularity of writeback is at the folio level. If one
> > > > byte in a folio is dirty, the entire folio will be written back. This
> > > > becomes unscalable for larger folios and significantly degrades
> > > > performance, especially for workloads that employ random writes.
> > > >
> > > > One idea is to track dirty pages at a smaller granularity using a
> > > > 64-bit bitmap stored inside the folio struct where each bit tracks a
> > > > smaller chunk of pages (eg for 2 MB folios, each bit would track 32k
> > > > pages), and only write back dirty chunks rather than the entire folio.
> > >
> > > Yes, this is known problem and as Dave pointed out, currently it is upto
> > > the lower layer to handle finer grained dirtiness handling. You can take
> > > inspiration in the iomap layer that already does this, or you can convert
> > > your filesystem to use iomap (preferred way).
> > >
> > > > == Balancing dirty pages ==
> > > > It was observed that the dirty page balancing logic used in
> > > > balance_dirty_pages() fails to scale for large folios [1]. For
> > > > example, fuse saw around a 125% drop in throughput for writes when
> > > > using large folios vs small folios on 1MB block sizes, which was
> > > > attributed to scheduled io waits in the dirty page balancing logic. In
> > > > generic_perform_write(), dirty pages are balanced after every write to
> > > > the page cache by the filesystem. With large folios, each write
> > > > dirties a larger number of pages which can grossly exceed the
> > > > ratelimit, whereas with small folios each write is one page and so
> > > > pages are balanced more incrementally and adheres more closely to the
> > > > ratelimit. In order to accomodate large folios, likely the logic in
> > > > balancing dirty pages needs to be reworked.
> > >
> > > I think there are several separate issues here. One is that
> > > folio_account_dirtied() will consider the whole folio as needing writeback
> > > which is not necessarily the case (as e.g. iomap will writeback only dirty
> > > blocks in it). This was OKish when pages were 4k and you were using 1k
> > > blocks (which was uncommon configuration anyway, usually you had 4k block
> > > size), it starts to hurt a lot with 2M folios so we might need to find a
> > > way how to propagate the information about really dirty bits into writeback
> > > accounting.
> >
> > Agreed. The only workable solution I see is to have some sort of api
> > similar to filemap_dirty_folio() that takes in the number of pages
> > dirtied as an arg, but maybe there's a better solution.
>
> Yes, something like that I suppose.
>
> > > Another problem *may* be that fast increments to dirtied pages (as we dirty
> > > 512 pages at once instead of 16 we did in the past) cause over-reaction in
> > > the dirtiness balancing logic and we throttle the task too much. The
> > > heuristics there try to find the right amount of time to block a task so
> > > that dirtying speed matches the writeback speed and it's plausible that
> > > the large increments make this logic oscilate between two extremes leading
> > > to suboptimal throughput. Also, since this was observed with FUSE, I belive
> > > a significant factor is that FUSE enables "strictlimit" feature of the BDI
> > > which makes dirty throttling more aggressive (generally the amount of
> > > allowed dirty pages is lower). Anyway, these are mostly speculations from
> > > my end. This needs more data to decide what exactly (if anything) needs
> > > tweaking in the dirty throttling logic.
> >
> > I tested this experimentally and you're right, on FUSE this is
> > impacted a lot by the "strictlimit". I didn't see any bottlenecks when
> > strictlimit wasn't enabled on FUSE. AFAICT, the strictlimit affects
> > the dirty throttle control freerun flag (which gets used to determine
> > whether throttling can be skipped) in the balance_dirty_pages() logic.
> > For FUSE, we can't turn off strictlimit for unprivileged servers, but
> > maybe we can make the throttling check more permissive by upping the
> > value of the min_pause calculation in wb_min_pause() for writes that
> > support large folios? As of right now, the current logic makes writing
> > large folios unfeasible in FUSE (estimates show around a 75% drop in
> > throughput).
>
> I think tweaking min_pause is a wrong way to do this. I think that is just a
> symptom. Can you run something like:
>
> while true; do
>         cat /sys/kernel/debug/bdi/<fuse-bdi>/stats
>         echo "---------"
>         sleep 1
> done >bdi-debug.txt
>
> while you are writing to the FUSE filesystem and share the output file?
> That should tell us a bit more about what's happening inside the writeback
> throttling. Also do you somehow configure min/max_ratio for the FUSE bdi?
> You can check in /sys/block/<fuse-bdi>/bdi/{min,max}_ratio . I suspect the
> problem is that the BDI dirty limit does not ramp up properly when we
> increase dirtied pages in large chunks.

This is the debug info I see for FUSE large folio writes where bs=1M
and size=1G:


BdiWriteback:                0 kB
BdiReclaimable:              0 kB
BdiDirtyThresh:            896 kB
DirtyThresh:            359824 kB
BackgroundThresh:       179692 kB
BdiDirtied:            1071104 kB
BdiWritten:               4096 kB
BdiWriteBandwidth:           0 kBps
b_dirty:                     0
b_io:                        0
b_more_io:                   0
b_dirty_time:                0
bdi_list:                    1
state:                       1
---------
BdiWriteback:                0 kB
BdiReclaimable:              0 kB
BdiDirtyThresh:           3596 kB
DirtyThresh:            359824 kB
BackgroundThresh:       179692 kB
BdiDirtied:            1290240 kB
BdiWritten:               4992 kB
BdiWriteBandwidth:           0 kBps
b_dirty:                     0
b_io:                        0
b_more_io:                   0
b_dirty_time:                0
bdi_list:                    1
state:                       1
---------
BdiWriteback:                0 kB
BdiReclaimable:              0 kB
BdiDirtyThresh:           3596 kB
DirtyThresh:            359824 kB
BackgroundThresh:       179692 kB
BdiDirtied:            1517568 kB
BdiWritten:               5824 kB
BdiWriteBandwidth:       25692 kBps
b_dirty:                     0
b_io:                        1
b_more_io:                   0
b_dirty_time:                0
bdi_list:                    1
state:                       7
---------
BdiWriteback:                0 kB
BdiReclaimable:              0 kB
BdiDirtyThresh:           3596 kB
DirtyThresh:            359824 kB
BackgroundThresh:       179692 kB
BdiDirtied:            1747968 kB
BdiWritten:               6720 kB
BdiWriteBandwidth:           0 kBps
b_dirty:                     0
b_io:                        0
b_more_io:                   0
b_dirty_time:                0
bdi_list:                    1
state:                       1
---------
BdiWriteback:                0 kB
BdiReclaimable:              0 kB
BdiDirtyThresh:            896 kB
DirtyThresh:            359824 kB
BackgroundThresh:       179692 kB
BdiDirtied:            1949696 kB
BdiWritten:               7552 kB
BdiWriteBandwidth:           0 kBps
b_dirty:                     0
b_io:                        0
b_more_io:                   0
b_dirty_time:                0
bdi_list:                    1
state:                       1
---------
BdiWriteback:                0 kB
BdiReclaimable:              0 kB
BdiDirtyThresh:           3612 kB
DirtyThresh:            361300 kB
BackgroundThresh:       180428 kB
BdiDirtied:            2097152 kB
BdiWritten:               8128 kB
BdiWriteBandwidth:           0 kBps
b_dirty:                     0
b_io:                        0
b_more_io:                   0
b_dirty_time:                0
bdi_list:                    1
state:                       1
---------


I didn't do anything to configure/change the FUSE bdi min/max_ratio.
This is what I see on my system:

cat /sys/class/bdi/0:52/min_ratio
0
cat /sys/class/bdi/0:52/max_ratio
1


>
> Actually, there's a patch queued in mm tree that improves the ramping up of
> bdi dirty limit for strictlimit bdis [1]. It would be nice if you could
> test whether it changes something in the behavior you observe. Thanks!
>
>                                                                 Honza
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche
> s/mm-page-writeback-consolidate-wb_thresh-bumping-logic-into-__wb_calc_thresh.pa
> tch

I still see the same results (~230 MiB/s throughput using fio) with
this patch applied, unfortunately. Here's the debug info I see with
this patch (same test scenario as above on FUSE large folio writes
where bs=1M and size=1G):

BdiWriteback:                0 kB
BdiReclaimable:           2048 kB
BdiDirtyThresh:           3588 kB
DirtyThresh:            359132 kB
BackgroundThresh:       179348 kB
BdiDirtied:              51200 kB
BdiWritten:                128 kB
BdiWriteBandwidth:      102400 kBps
b_dirty:                     1
b_io:                        0
b_more_io:                   0
b_dirty_time:                0
bdi_list:                    1
state:                       5
---------
BdiWriteback:                0 kB
BdiReclaimable:              0 kB
BdiDirtyThresh:           3588 kB
DirtyThresh:            359144 kB
BackgroundThresh:       179352 kB
BdiDirtied:             331776 kB
BdiWritten:               1216 kB
BdiWriteBandwidth:           0 kBps
b_dirty:                     0
b_io:                        0
b_more_io:                   0
b_dirty_time:                0
bdi_list:                    1
state:                       1
---------
BdiWriteback:                0 kB
BdiReclaimable:              0 kB
BdiDirtyThresh:           3588 kB
DirtyThresh:            359144 kB
BackgroundThresh:       179352 kB
BdiDirtied:             562176 kB
BdiWritten:               2176 kB
BdiWriteBandwidth:           0 kBps
b_dirty:                     0
b_io:                        0
b_more_io:                   0
b_dirty_time:                0
bdi_list:                    1
state:                       1
---------
BdiWriteback:                0 kB
BdiReclaimable:              0 kB
BdiDirtyThresh:           3588 kB
DirtyThresh:            359144 kB
BackgroundThresh:       179352 kB
BdiDirtied:             792576 kB
BdiWritten:               3072 kB
BdiWriteBandwidth:           0 kBps
b_dirty:                     0
b_io:                        0
b_more_io:                   0
b_dirty_time:                0
bdi_list:                    1
state:                       1
---------
BdiWriteback:               64 kB
BdiReclaimable:              0 kB
BdiDirtyThresh:           3588 kB
DirtyThresh:            359144 kB
BackgroundThresh:       179352 kB
BdiDirtied:            1026048 kB
BdiWritten:               3904 kB
BdiWriteBandwidth:           0 kBps
b_dirty:                     0
b_io:                        0
b_more_io:                   0
b_dirty_time:                0
bdi_list:                    1
state:                       1
---------


Thanks,
Joanne
>
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Improving large folio writeback performance
  2025-01-17 22:45       ` Joanne Koong
@ 2025-01-20 22:42         ` Jan Kara
  2025-01-22  0:29           ` Joanne Koong
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2025-01-20 22:42 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Jan Kara, lsf-pc, linux-fsdevel, linux-mm, Matthew Wilcox (Oracle)

On Fri 17-01-25 14:45:01, Joanne Koong wrote:
> On Fri, Jan 17, 2025 at 3:53 AM Jan Kara <jack@suse.cz> wrote:
> > On Thu 16-01-25 15:38:54, Joanne Koong wrote:
> > I think tweaking min_pause is a wrong way to do this. I think that is just a
> > symptom. Can you run something like:
> >
> > while true; do
> >         cat /sys/kernel/debug/bdi/<fuse-bdi>/stats
> >         echo "---------"
> >         sleep 1
> > done >bdi-debug.txt
> >
> > while you are writing to the FUSE filesystem and share the output file?
> > That should tell us a bit more about what's happening inside the writeback
> > throttling. Also do you somehow configure min/max_ratio for the FUSE bdi?
> > You can check in /sys/block/<fuse-bdi>/bdi/{min,max}_ratio . I suspect the
> > problem is that the BDI dirty limit does not ramp up properly when we
> > increase dirtied pages in large chunks.
> 
> This is the debug info I see for FUSE large folio writes where bs=1M
> and size=1G:
> 
> 
> BdiWriteback:                0 kB
> BdiReclaimable:              0 kB
> BdiDirtyThresh:            896 kB
> DirtyThresh:            359824 kB
> BackgroundThresh:       179692 kB
> BdiDirtied:            1071104 kB
> BdiWritten:               4096 kB
> BdiWriteBandwidth:           0 kBps
> b_dirty:                     0
> b_io:                        0
> b_more_io:                   0
> b_dirty_time:                0
> bdi_list:                    1
> state:                       1
> ---------
> BdiWriteback:                0 kB
> BdiReclaimable:              0 kB
> BdiDirtyThresh:           3596 kB
> DirtyThresh:            359824 kB
> BackgroundThresh:       179692 kB
> BdiDirtied:            1290240 kB
> BdiWritten:               4992 kB
> BdiWriteBandwidth:           0 kBps
> b_dirty:                     0
> b_io:                        0
> b_more_io:                   0
> b_dirty_time:                0
> bdi_list:                    1
> state:                       1
> ---------
> BdiWriteback:                0 kB
> BdiReclaimable:              0 kB
> BdiDirtyThresh:           3596 kB
> DirtyThresh:            359824 kB
> BackgroundThresh:       179692 kB
> BdiDirtied:            1517568 kB
> BdiWritten:               5824 kB
> BdiWriteBandwidth:       25692 kBps
> b_dirty:                     0
> b_io:                        1
> b_more_io:                   0
> b_dirty_time:                0
> bdi_list:                    1
> state:                       7
> ---------
> BdiWriteback:                0 kB
> BdiReclaimable:              0 kB
> BdiDirtyThresh:           3596 kB
> DirtyThresh:            359824 kB
> BackgroundThresh:       179692 kB
> BdiDirtied:            1747968 kB
> BdiWritten:               6720 kB
> BdiWriteBandwidth:           0 kBps
> b_dirty:                     0
> b_io:                        0
> b_more_io:                   0
> b_dirty_time:                0
> bdi_list:                    1
> state:                       1
> ---------
> BdiWriteback:                0 kB
> BdiReclaimable:              0 kB
> BdiDirtyThresh:            896 kB
> DirtyThresh:            359824 kB
> BackgroundThresh:       179692 kB
> BdiDirtied:            1949696 kB
> BdiWritten:               7552 kB
> BdiWriteBandwidth:           0 kBps
> b_dirty:                     0
> b_io:                        0
> b_more_io:                   0
> b_dirty_time:                0
> bdi_list:                    1
> state:                       1
> ---------
> BdiWriteback:                0 kB
> BdiReclaimable:              0 kB
> BdiDirtyThresh:           3612 kB
> DirtyThresh:            361300 kB
> BackgroundThresh:       180428 kB
> BdiDirtied:            2097152 kB
> BdiWritten:               8128 kB
> BdiWriteBandwidth:           0 kBps
> b_dirty:                     0
> b_io:                        0
> b_more_io:                   0
> b_dirty_time:                0
> bdi_list:                    1
> state:                       1
> ---------
> 
> 
> I didn't do anything to configure/change the FUSE bdi min/max_ratio.
> This is what I see on my system:
> 
> cat /sys/class/bdi/0:52/min_ratio
> 0
> cat /sys/class/bdi/0:52/max_ratio
> 1

OK, we can see that BdiDirtyThresh stabilized more or less at 3.6MB.
Checking the code, this shows we are hitting __wb_calc_thresh() logic:

        if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
                unsigned long limit = hard_dirty_limit(dom, dtc->thresh);
                u64 wb_scale_thresh = 0;

                if (limit > dtc->dirty)
                        wb_scale_thresh = (limit - dtc->dirty) / 100;
                wb_thresh = max(wb_thresh, min(wb_scale_thresh, wb_max_thresh /
        }

so BdiDirtyThresh is set to DirtyThresh/100. This also shows bdi never
generates enough throughput to ramp up it's share from this initial value.

> > Actually, there's a patch queued in mm tree that improves the ramping up of
> > bdi dirty limit for strictlimit bdis [1]. It would be nice if you could
> > test whether it changes something in the behavior you observe. Thanks!
> >
> >                                                                 Honza
> >
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche
> > s/mm-page-writeback-consolidate-wb_thresh-bumping-logic-into-__wb_calc_thresh.pa
> > tch
> 
> I still see the same results (~230 MiB/s throughput using fio) with
> this patch applied, unfortunately. Here's the debug info I see with
> this patch (same test scenario as above on FUSE large folio writes
> where bs=1M and size=1G):
> 
> BdiWriteback:                0 kB
> BdiReclaimable:           2048 kB
> BdiDirtyThresh:           3588 kB
> DirtyThresh:            359132 kB
> BackgroundThresh:       179348 kB
> BdiDirtied:              51200 kB
> BdiWritten:                128 kB
> BdiWriteBandwidth:      102400 kBps
> b_dirty:                     1
> b_io:                        0
> b_more_io:                   0
> b_dirty_time:                0
> bdi_list:                    1
> state:                       5
> ---------
> BdiWriteback:                0 kB
> BdiReclaimable:              0 kB
> BdiDirtyThresh:           3588 kB
> DirtyThresh:            359144 kB
> BackgroundThresh:       179352 kB
> BdiDirtied:             331776 kB
> BdiWritten:               1216 kB
> BdiWriteBandwidth:           0 kBps
> b_dirty:                     0
> b_io:                        0
> b_more_io:                   0
> b_dirty_time:                0
> bdi_list:                    1
> state:                       1
> ---------
> BdiWriteback:                0 kB
> BdiReclaimable:              0 kB
> BdiDirtyThresh:           3588 kB
> DirtyThresh:            359144 kB
> BackgroundThresh:       179352 kB
> BdiDirtied:             562176 kB
> BdiWritten:               2176 kB
> BdiWriteBandwidth:           0 kBps
> b_dirty:                     0
> b_io:                        0
> b_more_io:                   0
> b_dirty_time:                0
> bdi_list:                    1
> state:                       1
> ---------
> BdiWriteback:                0 kB
> BdiReclaimable:              0 kB
> BdiDirtyThresh:           3588 kB
> DirtyThresh:            359144 kB
> BackgroundThresh:       179352 kB
> BdiDirtied:             792576 kB
> BdiWritten:               3072 kB
> BdiWriteBandwidth:           0 kBps
> b_dirty:                     0
> b_io:                        0
> b_more_io:                   0
> b_dirty_time:                0
> bdi_list:                    1
> state:                       1
> ---------
> BdiWriteback:               64 kB
> BdiReclaimable:              0 kB
> BdiDirtyThresh:           3588 kB
> DirtyThresh:            359144 kB
> BackgroundThresh:       179352 kB
> BdiDirtied:            1026048 kB
> BdiWritten:               3904 kB
> BdiWriteBandwidth:           0 kBps
> b_dirty:                     0
> b_io:                        0
> b_more_io:                   0
> b_dirty_time:                0
> bdi_list:                    1
> state:                       1
> ---------

Yeah, here the situation is really the same. As an experiment can you
experiment with setting min_ratio for the FUSE bdi to 1, 2, 3, ..., 10 (I
don't expect you should need to go past 10) and figure out when there's
enough slack space for the writeback bandwidth to ramp up to a full speed?
Thanks!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Improving large folio writeback performance
  2025-01-20 22:42         ` Jan Kara
@ 2025-01-22  0:29           ` Joanne Koong
  2025-01-22  9:22             ` Jan Kara
  0 siblings, 1 reply; 16+ messages in thread
From: Joanne Koong @ 2025-01-22  0:29 UTC (permalink / raw)
  To: Jan Kara; +Cc: lsf-pc, linux-fsdevel, linux-mm, Matthew Wilcox (Oracle)

On Mon, Jan 20, 2025 at 2:42 PM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 17-01-25 14:45:01, Joanne Koong wrote:
> > On Fri, Jan 17, 2025 at 3:53 AM Jan Kara <jack@suse.cz> wrote:
> > > On Thu 16-01-25 15:38:54, Joanne Koong wrote:
> > > I think tweaking min_pause is a wrong way to do this. I think that is just a
> > > symptom. Can you run something like:
> > >
> > > while true; do
> > >         cat /sys/kernel/debug/bdi/<fuse-bdi>/stats
> > >         echo "---------"
> > >         sleep 1
> > > done >bdi-debug.txt
> > >
> > > while you are writing to the FUSE filesystem and share the output file?
> > > That should tell us a bit more about what's happening inside the writeback
> > > throttling. Also do you somehow configure min/max_ratio for the FUSE bdi?
> > > You can check in /sys/block/<fuse-bdi>/bdi/{min,max}_ratio . I suspect the
> > > problem is that the BDI dirty limit does not ramp up properly when we
> > > increase dirtied pages in large chunks.
> >
> > This is the debug info I see for FUSE large folio writes where bs=1M
> > and size=1G:
> >
> >
> > BdiWriteback:                0 kB
> > BdiReclaimable:              0 kB
> > BdiDirtyThresh:            896 kB
> > DirtyThresh:            359824 kB
> > BackgroundThresh:       179692 kB
> > BdiDirtied:            1071104 kB
> > BdiWritten:               4096 kB
> > BdiWriteBandwidth:           0 kBps
> > b_dirty:                     0
> > b_io:                        0
> > b_more_io:                   0
> > b_dirty_time:                0
> > bdi_list:                    1
> > state:                       1
> > ---------
> > BdiWriteback:                0 kB
> > BdiReclaimable:              0 kB
> > BdiDirtyThresh:           3596 kB
> > DirtyThresh:            359824 kB
> > BackgroundThresh:       179692 kB
> > BdiDirtied:            1290240 kB
> > BdiWritten:               4992 kB
> > BdiWriteBandwidth:           0 kBps
> > b_dirty:                     0
> > b_io:                        0
> > b_more_io:                   0
> > b_dirty_time:                0
> > bdi_list:                    1
> > state:                       1
> > ---------
> > BdiWriteback:                0 kB
> > BdiReclaimable:              0 kB
> > BdiDirtyThresh:           3596 kB
> > DirtyThresh:            359824 kB
> > BackgroundThresh:       179692 kB
> > BdiDirtied:            1517568 kB
> > BdiWritten:               5824 kB
> > BdiWriteBandwidth:       25692 kBps
> > b_dirty:                     0
> > b_io:                        1
> > b_more_io:                   0
> > b_dirty_time:                0
> > bdi_list:                    1
> > state:                       7
> > ---------
> > BdiWriteback:                0 kB
> > BdiReclaimable:              0 kB
> > BdiDirtyThresh:           3596 kB
> > DirtyThresh:            359824 kB
> > BackgroundThresh:       179692 kB
> > BdiDirtied:            1747968 kB
> > BdiWritten:               6720 kB
> > BdiWriteBandwidth:           0 kBps
> > b_dirty:                     0
> > b_io:                        0
> > b_more_io:                   0
> > b_dirty_time:                0
> > bdi_list:                    1
> > state:                       1
> > ---------
> > BdiWriteback:                0 kB
> > BdiReclaimable:              0 kB
> > BdiDirtyThresh:            896 kB
> > DirtyThresh:            359824 kB
> > BackgroundThresh:       179692 kB
> > BdiDirtied:            1949696 kB
> > BdiWritten:               7552 kB
> > BdiWriteBandwidth:           0 kBps
> > b_dirty:                     0
> > b_io:                        0
> > b_more_io:                   0
> > b_dirty_time:                0
> > bdi_list:                    1
> > state:                       1
> > ---------
> > BdiWriteback:                0 kB
> > BdiReclaimable:              0 kB
> > BdiDirtyThresh:           3612 kB
> > DirtyThresh:            361300 kB
> > BackgroundThresh:       180428 kB
> > BdiDirtied:            2097152 kB
> > BdiWritten:               8128 kB
> > BdiWriteBandwidth:           0 kBps
> > b_dirty:                     0
> > b_io:                        0
> > b_more_io:                   0
> > b_dirty_time:                0
> > bdi_list:                    1
> > state:                       1
> > ---------
> >
> >
> > I didn't do anything to configure/change the FUSE bdi min/max_ratio.
> > This is what I see on my system:
> >
> > cat /sys/class/bdi/0:52/min_ratio
> > 0
> > cat /sys/class/bdi/0:52/max_ratio
> > 1
>
> OK, we can see that BdiDirtyThresh stabilized more or less at 3.6MB.
> Checking the code, this shows we are hitting __wb_calc_thresh() logic:
>
>         if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
>                 unsigned long limit = hard_dirty_limit(dom, dtc->thresh);
>                 u64 wb_scale_thresh = 0;
>
>                 if (limit > dtc->dirty)
>                         wb_scale_thresh = (limit - dtc->dirty) / 100;
>                 wb_thresh = max(wb_thresh, min(wb_scale_thresh, wb_max_thresh /
>         }
>
> so BdiDirtyThresh is set to DirtyThresh/100. This also shows bdi never
> generates enough throughput to ramp up it's share from this initial value.
>
> > > Actually, there's a patch queued in mm tree that improves the ramping up of
> > > bdi dirty limit for strictlimit bdis [1]. It would be nice if you could
> > > test whether it changes something in the behavior you observe. Thanks!
> > >
> > >                                                                 Honza
> > >
> > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche
> > > s/mm-page-writeback-consolidate-wb_thresh-bumping-logic-into-__wb_calc_thresh.pa
> > > tch
> >
> > I still see the same results (~230 MiB/s throughput using fio) with
> > this patch applied, unfortunately. Here's the debug info I see with
> > this patch (same test scenario as above on FUSE large folio writes
> > where bs=1M and size=1G):
> >
> > BdiWriteback:                0 kB
> > BdiReclaimable:           2048 kB
> > BdiDirtyThresh:           3588 kB
> > DirtyThresh:            359132 kB
> > BackgroundThresh:       179348 kB
> > BdiDirtied:              51200 kB
> > BdiWritten:                128 kB
> > BdiWriteBandwidth:      102400 kBps
> > b_dirty:                     1
> > b_io:                        0
> > b_more_io:                   0
> > b_dirty_time:                0
> > bdi_list:                    1
> > state:                       5
> > ---------
> > BdiWriteback:                0 kB
> > BdiReclaimable:              0 kB
> > BdiDirtyThresh:           3588 kB
> > DirtyThresh:            359144 kB
> > BackgroundThresh:       179352 kB
> > BdiDirtied:             331776 kB
> > BdiWritten:               1216 kB
> > BdiWriteBandwidth:           0 kBps
> > b_dirty:                     0
> > b_io:                        0
> > b_more_io:                   0
> > b_dirty_time:                0
> > bdi_list:                    1
> > state:                       1
> > ---------
> > BdiWriteback:                0 kB
> > BdiReclaimable:              0 kB
> > BdiDirtyThresh:           3588 kB
> > DirtyThresh:            359144 kB
> > BackgroundThresh:       179352 kB
> > BdiDirtied:             562176 kB
> > BdiWritten:               2176 kB
> > BdiWriteBandwidth:           0 kBps
> > b_dirty:                     0
> > b_io:                        0
> > b_more_io:                   0
> > b_dirty_time:                0
> > bdi_list:                    1
> > state:                       1
> > ---------
> > BdiWriteback:                0 kB
> > BdiReclaimable:              0 kB
> > BdiDirtyThresh:           3588 kB
> > DirtyThresh:            359144 kB
> > BackgroundThresh:       179352 kB
> > BdiDirtied:             792576 kB
> > BdiWritten:               3072 kB
> > BdiWriteBandwidth:           0 kBps
> > b_dirty:                     0
> > b_io:                        0
> > b_more_io:                   0
> > b_dirty_time:                0
> > bdi_list:                    1
> > state:                       1
> > ---------
> > BdiWriteback:               64 kB
> > BdiReclaimable:              0 kB
> > BdiDirtyThresh:           3588 kB
> > DirtyThresh:            359144 kB
> > BackgroundThresh:       179352 kB
> > BdiDirtied:            1026048 kB
> > BdiWritten:               3904 kB
> > BdiWriteBandwidth:           0 kBps
> > b_dirty:                     0
> > b_io:                        0
> > b_more_io:                   0
> > b_dirty_time:                0
> > bdi_list:                    1
> > state:                       1
> > ---------
>
> Yeah, here the situation is really the same. As an experiment can you
> experiment with setting min_ratio for the FUSE bdi to 1, 2, 3, ..., 10 (I
> don't expect you should need to go past 10) and figure out when there's
> enough slack space for the writeback bandwidth to ramp up to a full speed?
> Thanks!
>
>                                                                 Honza

When locally testing this, I'm seeing that the max_ratio affects the
bandwidth more so than min_ratio (eg the different min_ratios have
roughly the same bandwidth per max_ratio). I'm also seeing somewhat
high variance across runs which makes it hard to gauge what's
accurate, but on average this is what I'm seeing:

max_ratio=1 --- bandwidth= ~230 MiB/s
max_ratio=2 --- bandwidth= ~420 MiB/s
max_ratio=3 --- bandwidth= ~550 MiB/s
max_ratio=4 --- bandwidth= ~653 MiB/s
max_ratio=5 --- bandwidth= ~700 MiB/s
max_ratio=6 --- bandwidth= ~810 MiB/s
max_ratio=7 --- bandwidth= ~1040 MiB/s (and then a lot of times, 561
MiB/s on subsequent runs)


Thanks,
Joanne

> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Improving large folio writeback performance
  2025-01-22  0:29           ` Joanne Koong
@ 2025-01-22  9:22             ` Jan Kara
  2025-01-22 22:17               ` Joanne Koong
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2025-01-22  9:22 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Jan Kara, lsf-pc, linux-fsdevel, linux-mm, Matthew Wilcox (Oracle)

On Tue 21-01-25 16:29:57, Joanne Koong wrote:
> On Mon, Jan 20, 2025 at 2:42 PM Jan Kara <jack@suse.cz> wrote:
> > On Fri 17-01-25 14:45:01, Joanne Koong wrote:
> > > On Fri, Jan 17, 2025 at 3:53 AM Jan Kara <jack@suse.cz> wrote:
> > > > On Thu 16-01-25 15:38:54, Joanne Koong wrote:
> > > > I think tweaking min_pause is a wrong way to do this. I think that is just a
> > > > symptom. Can you run something like:
> > > >
> > > > while true; do
> > > >         cat /sys/kernel/debug/bdi/<fuse-bdi>/stats
> > > >         echo "---------"
> > > >         sleep 1
> > > > done >bdi-debug.txt
> > > >
> > > > while you are writing to the FUSE filesystem and share the output file?
> > > > That should tell us a bit more about what's happening inside the writeback
> > > > throttling. Also do you somehow configure min/max_ratio for the FUSE bdi?
> > > > You can check in /sys/block/<fuse-bdi>/bdi/{min,max}_ratio . I suspect the
> > > > problem is that the BDI dirty limit does not ramp up properly when we
> > > > increase dirtied pages in large chunks.
> > >
> > > This is the debug info I see for FUSE large folio writes where bs=1M
> > > and size=1G:
> > >
> > >
> > > BdiWriteback:                0 kB
> > > BdiReclaimable:              0 kB
> > > BdiDirtyThresh:            896 kB
> > > DirtyThresh:            359824 kB
> > > BackgroundThresh:       179692 kB
> > > BdiDirtied:            1071104 kB
> > > BdiWritten:               4096 kB
> > > BdiWriteBandwidth:           0 kBps
> > > b_dirty:                     0
> > > b_io:                        0
> > > b_more_io:                   0
> > > b_dirty_time:                0
> > > bdi_list:                    1
> > > state:                       1
> > > ---------
> > > BdiWriteback:                0 kB
> > > BdiReclaimable:              0 kB
> > > BdiDirtyThresh:           3596 kB
> > > DirtyThresh:            359824 kB
> > > BackgroundThresh:       179692 kB
> > > BdiDirtied:            1290240 kB
> > > BdiWritten:               4992 kB
> > > BdiWriteBandwidth:           0 kBps
> > > b_dirty:                     0
> > > b_io:                        0
> > > b_more_io:                   0
> > > b_dirty_time:                0
> > > bdi_list:                    1
> > > state:                       1
> > > ---------
> > > BdiWriteback:                0 kB
> > > BdiReclaimable:              0 kB
> > > BdiDirtyThresh:           3596 kB
> > > DirtyThresh:            359824 kB
> > > BackgroundThresh:       179692 kB
> > > BdiDirtied:            1517568 kB
> > > BdiWritten:               5824 kB
> > > BdiWriteBandwidth:       25692 kBps
> > > b_dirty:                     0
> > > b_io:                        1
> > > b_more_io:                   0
> > > b_dirty_time:                0
> > > bdi_list:                    1
> > > state:                       7
> > > ---------
> > > BdiWriteback:                0 kB
> > > BdiReclaimable:              0 kB
> > > BdiDirtyThresh:           3596 kB
> > > DirtyThresh:            359824 kB
> > > BackgroundThresh:       179692 kB
> > > BdiDirtied:            1747968 kB
> > > BdiWritten:               6720 kB
> > > BdiWriteBandwidth:           0 kBps
> > > b_dirty:                     0
> > > b_io:                        0
> > > b_more_io:                   0
> > > b_dirty_time:                0
> > > bdi_list:                    1
> > > state:                       1
> > > ---------
> > > BdiWriteback:                0 kB
> > > BdiReclaimable:              0 kB
> > > BdiDirtyThresh:            896 kB
> > > DirtyThresh:            359824 kB
> > > BackgroundThresh:       179692 kB
> > > BdiDirtied:            1949696 kB
> > > BdiWritten:               7552 kB
> > > BdiWriteBandwidth:           0 kBps
> > > b_dirty:                     0
> > > b_io:                        0
> > > b_more_io:                   0
> > > b_dirty_time:                0
> > > bdi_list:                    1
> > > state:                       1
> > > ---------
> > > BdiWriteback:                0 kB
> > > BdiReclaimable:              0 kB
> > > BdiDirtyThresh:           3612 kB
> > > DirtyThresh:            361300 kB
> > > BackgroundThresh:       180428 kB
> > > BdiDirtied:            2097152 kB
> > > BdiWritten:               8128 kB
> > > BdiWriteBandwidth:           0 kBps
> > > b_dirty:                     0
> > > b_io:                        0
> > > b_more_io:                   0
> > > b_dirty_time:                0
> > > bdi_list:                    1
> > > state:                       1
> > > ---------
> > >
> > >
> > > I didn't do anything to configure/change the FUSE bdi min/max_ratio.
> > > This is what I see on my system:
> > >
> > > cat /sys/class/bdi/0:52/min_ratio
> > > 0
> > > cat /sys/class/bdi/0:52/max_ratio
> > > 1
> >
> > OK, we can see that BdiDirtyThresh stabilized more or less at 3.6MB.
> > Checking the code, this shows we are hitting __wb_calc_thresh() logic:
> >
> >         if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
> >                 unsigned long limit = hard_dirty_limit(dom, dtc->thresh);
> >                 u64 wb_scale_thresh = 0;
> >
> >                 if (limit > dtc->dirty)
> >                         wb_scale_thresh = (limit - dtc->dirty) / 100;
> >                 wb_thresh = max(wb_thresh, min(wb_scale_thresh, wb_max_thresh /
> >         }
> >
> > so BdiDirtyThresh is set to DirtyThresh/100. This also shows bdi never
> > generates enough throughput to ramp up it's share from this initial value.
> >
> > > > Actually, there's a patch queued in mm tree that improves the ramping up of
> > > > bdi dirty limit for strictlimit bdis [1]. It would be nice if you could
> > > > test whether it changes something in the behavior you observe. Thanks!
> > > >
> > > >                                                                 Honza
> > > >
> > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche
> > > > s/mm-page-writeback-consolidate-wb_thresh-bumping-logic-into-__wb_calc_thresh.pa
> > > > tch
> > >
> > > I still see the same results (~230 MiB/s throughput using fio) with
> > > this patch applied, unfortunately. Here's the debug info I see with
> > > this patch (same test scenario as above on FUSE large folio writes
> > > where bs=1M and size=1G):
> > >
> > > BdiWriteback:                0 kB
> > > BdiReclaimable:           2048 kB
> > > BdiDirtyThresh:           3588 kB
> > > DirtyThresh:            359132 kB
> > > BackgroundThresh:       179348 kB
> > > BdiDirtied:              51200 kB
> > > BdiWritten:                128 kB
> > > BdiWriteBandwidth:      102400 kBps
> > > b_dirty:                     1
> > > b_io:                        0
> > > b_more_io:                   0
> > > b_dirty_time:                0
> > > bdi_list:                    1
> > > state:                       5
> > > ---------
> > > BdiWriteback:                0 kB
> > > BdiReclaimable:              0 kB
> > > BdiDirtyThresh:           3588 kB
> > > DirtyThresh:            359144 kB
> > > BackgroundThresh:       179352 kB
> > > BdiDirtied:             331776 kB
> > > BdiWritten:               1216 kB
> > > BdiWriteBandwidth:           0 kBps
> > > b_dirty:                     0
> > > b_io:                        0
> > > b_more_io:                   0
> > > b_dirty_time:                0
> > > bdi_list:                    1
> > > state:                       1
> > > ---------
> > > BdiWriteback:                0 kB
> > > BdiReclaimable:              0 kB
> > > BdiDirtyThresh:           3588 kB
> > > DirtyThresh:            359144 kB
> > > BackgroundThresh:       179352 kB
> > > BdiDirtied:             562176 kB
> > > BdiWritten:               2176 kB
> > > BdiWriteBandwidth:           0 kBps
> > > b_dirty:                     0
> > > b_io:                        0
> > > b_more_io:                   0
> > > b_dirty_time:                0
> > > bdi_list:                    1
> > > state:                       1
> > > ---------
> > > BdiWriteback:                0 kB
> > > BdiReclaimable:              0 kB
> > > BdiDirtyThresh:           3588 kB
> > > DirtyThresh:            359144 kB
> > > BackgroundThresh:       179352 kB
> > > BdiDirtied:             792576 kB
> > > BdiWritten:               3072 kB
> > > BdiWriteBandwidth:           0 kBps
> > > b_dirty:                     0
> > > b_io:                        0
> > > b_more_io:                   0
> > > b_dirty_time:                0
> > > bdi_list:                    1
> > > state:                       1
> > > ---------
> > > BdiWriteback:               64 kB
> > > BdiReclaimable:              0 kB
> > > BdiDirtyThresh:           3588 kB
> > > DirtyThresh:            359144 kB
> > > BackgroundThresh:       179352 kB
> > > BdiDirtied:            1026048 kB
> > > BdiWritten:               3904 kB
> > > BdiWriteBandwidth:           0 kBps
> > > b_dirty:                     0
> > > b_io:                        0
> > > b_more_io:                   0
> > > b_dirty_time:                0
> > > bdi_list:                    1
> > > state:                       1
> > > ---------
> >
> > Yeah, here the situation is really the same. As an experiment can you
> > experiment with setting min_ratio for the FUSE bdi to 1, 2, 3, ..., 10 (I
> > don't expect you should need to go past 10) and figure out when there's
> > enough slack space for the writeback bandwidth to ramp up to a full speed?
> > Thanks!
> >
> >                                                                 Honza
> 
> When locally testing this, I'm seeing that the max_ratio affects the
> bandwidth more so than min_ratio (eg the different min_ratios have
> roughly the same bandwidth per max_ratio). I'm also seeing somewhat
> high variance across runs which makes it hard to gauge what's
> accurate, but on average this is what I'm seeing:
> 
> max_ratio=1 --- bandwidth= ~230 MiB/s
> max_ratio=2 --- bandwidth= ~420 MiB/s
> max_ratio=3 --- bandwidth= ~550 MiB/s
> max_ratio=4 --- bandwidth= ~653 MiB/s
> max_ratio=5 --- bandwidth= ~700 MiB/s
> max_ratio=6 --- bandwidth= ~810 MiB/s
> max_ratio=7 --- bandwidth= ~1040 MiB/s (and then a lot of times, 561
> MiB/s on subsequent runs)

Ah, sorry. I actually misinterpretted your reply from previous email that:

> > > cat /sys/class/bdi/0:52/max_ratio
> > > 1

This means the amount of dirty pages for the fuse filesystem is indeed
hard-capped at 1% of dirty limit which happens to be ~3MB on your machine.
Checking where this is coming from I can see that fuse_bdi_init() does
this by:

	bdi_set_max_ratio(sb->s_bdi, 1);

So FUSE restricts itself and with only 3MB dirty limit and 2MB dirtying
granularity it is not surprising that dirty throttling doesn't work well.

I'd say there needs to be some better heuristic within FUSE that balances
maximum folio size and maximum dirty limit setting for the filesystem to a
sensible compromise (so that there's space for at least say 10 dirty
max-sized folios within the dirty limit).

But I guess this is just a shorter-term workaround. Long-term, finer
grained dirtiness tracking within FUSE (and writeback counters tracking in
MM) is going to be a more effective solution.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Improving large folio writeback performance
  2025-01-17 14:17     ` Matthew Wilcox
@ 2025-01-22 11:15       ` David Hildenbrand
  0 siblings, 0 replies; 16+ messages in thread
From: David Hildenbrand @ 2025-01-22 11:15 UTC (permalink / raw)
  To: Matthew Wilcox, Jan Kara
  Cc: Vlastimil Babka, Joanne Koong, lsf-pc, linux-fsdevel, linux-mm

On 17.01.25 15:17, Matthew Wilcox wrote:
> On Fri, Jan 17, 2025 at 12:56:52PM +0100, Jan Kara wrote:
>> On Fri 17-01-25 12:40:15, Vlastimil Babka wrote:
>>> I think this might be tricky in some cases? I.e. with 2 MB and pmd-mapped
>>> folio, it's possible to write-protect only the whole pmd, not individual 32k
>>> chunks in order to catch the first write to a chunk to mark it dirty.
>>
>> Definitely. Once you map a folio through PMD entry, you have no other
>> option than consider whole 2MB dirty. But with PTE mappings or
>> modifications through syscalls you can do more fine-grained dirtiness
>> tracking and there're enough cases like that that it pays off.
> 
> Almost no applications use shared mmap writes to write to files.  The
> error handling story is crap and there's only limited control about when
> writeback actually happens.  Almost every application uses write(), even
> if they have the file mmaped.  This isn't a scenario worth worrying about.

Right, for example while ordinary files can be used for backing VMs, VMs 
permanently dirty a lot of memory, resulting in a constant writeback 
stream and nasty storage wear (so I've been told). There are some use 
cases for it, but we primarily use shmem/memfd/anon instead.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Improving large folio writeback performance
  2025-01-22  9:22             ` Jan Kara
@ 2025-01-22 22:17               ` Joanne Koong
  0 siblings, 0 replies; 16+ messages in thread
From: Joanne Koong @ 2025-01-22 22:17 UTC (permalink / raw)
  To: Jan Kara; +Cc: lsf-pc, linux-fsdevel, linux-mm, Matthew Wilcox (Oracle)

On Wed, Jan 22, 2025 at 1:22 AM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 21-01-25 16:29:57, Joanne Koong wrote:
> > On Mon, Jan 20, 2025 at 2:42 PM Jan Kara <jack@suse.cz> wrote:
> > > On Fri 17-01-25 14:45:01, Joanne Koong wrote:
> > > > On Fri, Jan 17, 2025 at 3:53 AM Jan Kara <jack@suse.cz> wrote:
> > > > > On Thu 16-01-25 15:38:54, Joanne Koong wrote:
> > > > > I think tweaking min_pause is a wrong way to do this. I think that is just a
> > > > > symptom. Can you run something like:
> > > > >
> > > > > while true; do
> > > > >         cat /sys/kernel/debug/bdi/<fuse-bdi>/stats
> > > > >         echo "---------"
> > > > >         sleep 1
> > > > > done >bdi-debug.txt
> > > > >
> > > > > while you are writing to the FUSE filesystem and share the output file?
> > > > > That should tell us a bit more about what's happening inside the writeback
> > > > > throttling. Also do you somehow configure min/max_ratio for the FUSE bdi?
> > > > > You can check in /sys/block/<fuse-bdi>/bdi/{min,max}_ratio . I suspect the
> > > > > problem is that the BDI dirty limit does not ramp up properly when we
> > > > > increase dirtied pages in large chunks.
> > > >
> > > > This is the debug info I see for FUSE large folio writes where bs=1M
> > > > and size=1G:
> > > >
> > > >
> > > > BdiWriteback:                0 kB
> > > > BdiReclaimable:              0 kB
> > > > BdiDirtyThresh:            896 kB
> > > > DirtyThresh:            359824 kB
> > > > BackgroundThresh:       179692 kB
> > > > BdiDirtied:            1071104 kB
> > > > BdiWritten:               4096 kB
> > > > BdiWriteBandwidth:           0 kBps
> > > > b_dirty:                     0
> > > > b_io:                        0
> > > > b_more_io:                   0
> > > > b_dirty_time:                0
> > > > bdi_list:                    1
> > > > state:                       1
> > > > ---------
> > > > BdiWriteback:                0 kB
> > > > BdiReclaimable:              0 kB
> > > > BdiDirtyThresh:           3596 kB
> > > > DirtyThresh:            359824 kB
> > > > BackgroundThresh:       179692 kB
> > > > BdiDirtied:            1290240 kB
> > > > BdiWritten:               4992 kB
> > > > BdiWriteBandwidth:           0 kBps
> > > > b_dirty:                     0
> > > > b_io:                        0
> > > > b_more_io:                   0
> > > > b_dirty_time:                0
> > > > bdi_list:                    1
> > > > state:                       1
> > > > ---------
> > > > BdiWriteback:                0 kB
> > > > BdiReclaimable:              0 kB
> > > > BdiDirtyThresh:           3596 kB
> > > > DirtyThresh:            359824 kB
> > > > BackgroundThresh:       179692 kB
> > > > BdiDirtied:            1517568 kB
> > > > BdiWritten:               5824 kB
> > > > BdiWriteBandwidth:       25692 kBps
> > > > b_dirty:                     0
> > > > b_io:                        1
> > > > b_more_io:                   0
> > > > b_dirty_time:                0
> > > > bdi_list:                    1
> > > > state:                       7
> > > > ---------
> > > > BdiWriteback:                0 kB
> > > > BdiReclaimable:              0 kB
> > > > BdiDirtyThresh:           3596 kB
> > > > DirtyThresh:            359824 kB
> > > > BackgroundThresh:       179692 kB
> > > > BdiDirtied:            1747968 kB
> > > > BdiWritten:               6720 kB
> > > > BdiWriteBandwidth:           0 kBps
> > > > b_dirty:                     0
> > > > b_io:                        0
> > > > b_more_io:                   0
> > > > b_dirty_time:                0
> > > > bdi_list:                    1
> > > > state:                       1
> > > > ---------
> > > > BdiWriteback:                0 kB
> > > > BdiReclaimable:              0 kB
> > > > BdiDirtyThresh:            896 kB
> > > > DirtyThresh:            359824 kB
> > > > BackgroundThresh:       179692 kB
> > > > BdiDirtied:            1949696 kB
> > > > BdiWritten:               7552 kB
> > > > BdiWriteBandwidth:           0 kBps
> > > > b_dirty:                     0
> > > > b_io:                        0
> > > > b_more_io:                   0
> > > > b_dirty_time:                0
> > > > bdi_list:                    1
> > > > state:                       1
> > > > ---------
> > > > BdiWriteback:                0 kB
> > > > BdiReclaimable:              0 kB
> > > > BdiDirtyThresh:           3612 kB
> > > > DirtyThresh:            361300 kB
> > > > BackgroundThresh:       180428 kB
> > > > BdiDirtied:            2097152 kB
> > > > BdiWritten:               8128 kB
> > > > BdiWriteBandwidth:           0 kBps
> > > > b_dirty:                     0
> > > > b_io:                        0
> > > > b_more_io:                   0
> > > > b_dirty_time:                0
> > > > bdi_list:                    1
> > > > state:                       1
> > > > ---------
> > > >
> > > >
> > > > I didn't do anything to configure/change the FUSE bdi min/max_ratio.
> > > > This is what I see on my system:
> > > >
> > > > cat /sys/class/bdi/0:52/min_ratio
> > > > 0
> > > > cat /sys/class/bdi/0:52/max_ratio
> > > > 1
> > >
> > > OK, we can see that BdiDirtyThresh stabilized more or less at 3.6MB.
> > > Checking the code, this shows we are hitting __wb_calc_thresh() logic:
> > >
> > >         if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
> > >                 unsigned long limit = hard_dirty_limit(dom, dtc->thresh);
> > >                 u64 wb_scale_thresh = 0;
> > >
> > >                 if (limit > dtc->dirty)
> > >                         wb_scale_thresh = (limit - dtc->dirty) / 100;
> > >                 wb_thresh = max(wb_thresh, min(wb_scale_thresh, wb_max_thresh /
> > >         }
> > >
> > > so BdiDirtyThresh is set to DirtyThresh/100. This also shows bdi never
> > > generates enough throughput to ramp up it's share from this initial value.
> > >
> > > > > Actually, there's a patch queued in mm tree that improves the ramping up of
> > > > > bdi dirty limit for strictlimit bdis [1]. It would be nice if you could
> > > > > test whether it changes something in the behavior you observe. Thanks!
> > > > >
> > > > >                                                                 Honza
> > > > >
> > > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche
> > > > > s/mm-page-writeback-consolidate-wb_thresh-bumping-logic-into-__wb_calc_thresh.pa
> > > > > tch
> > > >
> > > > I still see the same results (~230 MiB/s throughput using fio) with
> > > > this patch applied, unfortunately. Here's the debug info I see with
> > > > this patch (same test scenario as above on FUSE large folio writes
> > > > where bs=1M and size=1G):
> > > >
> > > > BdiWriteback:                0 kB
> > > > BdiReclaimable:           2048 kB
> > > > BdiDirtyThresh:           3588 kB
> > > > DirtyThresh:            359132 kB
> > > > BackgroundThresh:       179348 kB
> > > > BdiDirtied:              51200 kB
> > > > BdiWritten:                128 kB
> > > > BdiWriteBandwidth:      102400 kBps
> > > > b_dirty:                     1
> > > > b_io:                        0
> > > > b_more_io:                   0
> > > > b_dirty_time:                0
> > > > bdi_list:                    1
> > > > state:                       5
> > > > ---------
> > > > BdiWriteback:                0 kB
> > > > BdiReclaimable:              0 kB
> > > > BdiDirtyThresh:           3588 kB
> > > > DirtyThresh:            359144 kB
> > > > BackgroundThresh:       179352 kB
> > > > BdiDirtied:             331776 kB
> > > > BdiWritten:               1216 kB
> > > > BdiWriteBandwidth:           0 kBps
> > > > b_dirty:                     0
> > > > b_io:                        0
> > > > b_more_io:                   0
> > > > b_dirty_time:                0
> > > > bdi_list:                    1
> > > > state:                       1
> > > > ---------
> > > > BdiWriteback:                0 kB
> > > > BdiReclaimable:              0 kB
> > > > BdiDirtyThresh:           3588 kB
> > > > DirtyThresh:            359144 kB
> > > > BackgroundThresh:       179352 kB
> > > > BdiDirtied:             562176 kB
> > > > BdiWritten:               2176 kB
> > > > BdiWriteBandwidth:           0 kBps
> > > > b_dirty:                     0
> > > > b_io:                        0
> > > > b_more_io:                   0
> > > > b_dirty_time:                0
> > > > bdi_list:                    1
> > > > state:                       1
> > > > ---------
> > > > BdiWriteback:                0 kB
> > > > BdiReclaimable:              0 kB
> > > > BdiDirtyThresh:           3588 kB
> > > > DirtyThresh:            359144 kB
> > > > BackgroundThresh:       179352 kB
> > > > BdiDirtied:             792576 kB
> > > > BdiWritten:               3072 kB
> > > > BdiWriteBandwidth:           0 kBps
> > > > b_dirty:                     0
> > > > b_io:                        0
> > > > b_more_io:                   0
> > > > b_dirty_time:                0
> > > > bdi_list:                    1
> > > > state:                       1
> > > > ---------
> > > > BdiWriteback:               64 kB
> > > > BdiReclaimable:              0 kB
> > > > BdiDirtyThresh:           3588 kB
> > > > DirtyThresh:            359144 kB
> > > > BackgroundThresh:       179352 kB
> > > > BdiDirtied:            1026048 kB
> > > > BdiWritten:               3904 kB
> > > > BdiWriteBandwidth:           0 kBps
> > > > b_dirty:                     0
> > > > b_io:                        0
> > > > b_more_io:                   0
> > > > b_dirty_time:                0
> > > > bdi_list:                    1
> > > > state:                       1
> > > > ---------
> > >
> > > Yeah, here the situation is really the same. As an experiment can you
> > > experiment with setting min_ratio for the FUSE bdi to 1, 2, 3, ..., 10 (I
> > > don't expect you should need to go past 10) and figure out when there's
> > > enough slack space for the writeback bandwidth to ramp up to a full speed?
> > > Thanks!
> > >
> > >                                                                 Honza
> >
> > When locally testing this, I'm seeing that the max_ratio affects the
> > bandwidth more so than min_ratio (eg the different min_ratios have
> > roughly the same bandwidth per max_ratio). I'm also seeing somewhat
> > high variance across runs which makes it hard to gauge what's
> > accurate, but on average this is what I'm seeing:
> >
> > max_ratio=1 --- bandwidth= ~230 MiB/s
> > max_ratio=2 --- bandwidth= ~420 MiB/s
> > max_ratio=3 --- bandwidth= ~550 MiB/s
> > max_ratio=4 --- bandwidth= ~653 MiB/s
> > max_ratio=5 --- bandwidth= ~700 MiB/s
> > max_ratio=6 --- bandwidth= ~810 MiB/s
> > max_ratio=7 --- bandwidth= ~1040 MiB/s (and then a lot of times, 561
> > MiB/s on subsequent runs)
>
> Ah, sorry. I actually misinterpretted your reply from previous email that:
>
> > > > cat /sys/class/bdi/0:52/max_ratio
> > > > 1
>
> This means the amount of dirty pages for the fuse filesystem is indeed
> hard-capped at 1% of dirty limit which happens to be ~3MB on your machine.
> Checking where this is coming from I can see that fuse_bdi_init() does
> this by:
>
>         bdi_set_max_ratio(sb->s_bdi, 1);
>
> So FUSE restricts itself and with only 3MB dirty limit and 2MB dirtying
> granularity it is not surprising that dirty throttling doesn't work well.
>
> I'd say there needs to be some better heuristic within FUSE that balances
> maximum folio size and maximum dirty limit setting for the filesystem to a
> sensible compromise (so that there's space for at least say 10 dirty
> max-sized folios within the dirty limit).
>
> But I guess this is just a shorter-term workaround. Long-term, finer
> grained dirtiness tracking within FUSE (and writeback counters tracking in
> MM) is going to be a more effective solution.
>

Thanks for taking a look, Jan. I'll play around with the bdi limits,
though I don't think we'll be able to up this for unprivileged FUSE
servers. I'm planning to add finer grained diritiness tracking to FUSE
and the associated mm writeback counter changes but even then, having
full writes be that much slower is probably a no-go, so I'll
experiment with limiting the fgf order.


Thanks,
Joanne

>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-01-22 22:17 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-15  0:50 [LSF/MM/BPF TOPIC] Improving large folio writeback performance Joanne Koong
2025-01-15  1:21 ` Dave Chinner
2025-01-16 20:14   ` Joanne Koong
2025-01-15  1:50 ` Darrick J. Wong
2025-01-16 11:01 ` [Lsf-pc] " Jan Kara
2025-01-16 23:38   ` Joanne Koong
2025-01-17 11:53     ` Jan Kara
2025-01-17 22:45       ` Joanne Koong
2025-01-20 22:42         ` Jan Kara
2025-01-22  0:29           ` Joanne Koong
2025-01-22  9:22             ` Jan Kara
2025-01-22 22:17               ` Joanne Koong
2025-01-17 11:40 ` Vlastimil Babka
2025-01-17 11:56   ` [Lsf-pc] " Jan Kara
2025-01-17 14:17     ` Matthew Wilcox
2025-01-22 11:15       ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox