Re: [PATCH v2 0/6] btrfs: preparation patches for the incoming metadata folio conversion

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH v2 0/6] btrfs: preparation patches for the incoming metadata folio conversion
       [not found] <cover.1689143654.git.wqu@suse.com>
@ 2023-07-12 16:41 ` Christoph Hellwig
  2023-07-12 23:58   ` Qu Wenruo
  0 siblings, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2023-07-12 16:41 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, willy, linux-mm

On Wed, Jul 12, 2023 at 02:37:40PM +0800, Qu Wenruo wrote:
> One of the biggest problem for metadata folio conversion is, we still
> need the current page based solution (or folios with order 0) as a
> fallback solution when we can not get a high order folio.

Do we?  btrfs by default uses a 16k nodesize (order 2 on x86), with
a maximum of 64k (order 4).  IIRC we should be able to get them pretty
reliably.

If not the best thning is to just a virtually contigous allocation as
fallback, i.e. use vm_map_ram.  That's what XFS uses in it's buffer
cache, and it already did so before it stopped to use page cache to
back it's buffer cache, something I plan to do for the btrfs buffer
cache as well, as the page cache algorithms tend to not work very
well for buffer based metadata, never mind that there is an incredible
amount of complex code just working around the interactions.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 0/6] btrfs: preparation patches for the incoming metadata folio conversion
  2023-07-12 16:41 ` [PATCH v2 0/6] btrfs: preparation patches for the incoming metadata folio conversion Christoph Hellwig
@ 2023-07-12 23:58   ` Qu Wenruo
  2023-07-13 11:16     ` Christoph Hellwig
  2023-07-13 11:26     ` David Sterba
  0 siblings, 2 replies; 6+ messages in thread
From: Qu Wenruo @ 2023-07-12 23:58 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-btrfs, willy, linux-mm



On 2023/7/13 00:41, Christoph Hellwig wrote:
> On Wed, Jul 12, 2023 at 02:37:40PM +0800, Qu Wenruo wrote:
>> One of the biggest problem for metadata folio conversion is, we still
>> need the current page based solution (or folios with order 0) as a
>> fallback solution when we can not get a high order folio.
> 
> Do we?  btrfs by default uses a 16k nodesize (order 2 on x86), with
> a maximum of 64k (order 4).  IIRC we should be able to get them pretty
> reliably.

If it can be done as reliable as order 0 with NOFAIL, I'm totally fine 
with that.

> 
> If not the best thning is to just a virtually contigous allocation as
> fallback, i.e. use vm_map_ram.

That's also what Sweet Tea Dorminy mentioned, and I believe it's the 
correct way to go (as the fallback)

Although my concern is my lack of experience on MM code, and if those 
pages can still be attached to address space (with PagePrivate set).

>  That's what XFS uses in it's buffer
> cache, and it already did so before it stopped to use page cache to
> back it's buffer cache, something I plan to do for the btrfs buffer
> cache as well, as the page cache algorithms tend to not work very
> well for buffer based metadata, never mind that there is an incredible
> amount of complex code just working around the interactions.
Thus we have the preparation patchset as the first step.
It should help no matter what the next step we go.

Thanks,
Qu


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 0/6] btrfs: preparation patches for the incoming metadata folio conversion
  2023-07-12 23:58   ` Qu Wenruo
@ 2023-07-13 11:16     ` Christoph Hellwig
  2023-07-13 11:26     ` David Sterba
  1 sibling, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2023-07-13 11:16 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, linux-btrfs, willy, linux-mm

On Thu, Jul 13, 2023 at 07:58:17AM +0800, Qu Wenruo wrote:
> > Do we?  btrfs by default uses a 16k nodesize (order 2 on x86), with
> > a maximum of 64k (order 4).  IIRC we should be able to get them pretty
> > reliably.
> 
> If it can be done as reliable as order 0 with NOFAIL, I'm totally fine with
> that.

I think that is the aim.  I'm not entirely sure if we are entirely there
yes, thus the Ccs.

> > If not the best thning is to just a virtually contigous allocation as
> > fallback, i.e. use vm_map_ram.
> 
> That's also what Sweet Tea Dorminy mentioned, and I believe it's the correct
> way to go (as the fallback)
> 
> Although my concern is my lack of experience on MM code, and if those pages
> can still be attached to address space (with PagePrivate set).

At least they could back in the day when XFS did exactly that.  In fact
that was the use case why I added vmap originally back in 2002..



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 0/6] btrfs: preparation patches for the incoming metadata folio conversion
  2023-07-12 23:58   ` Qu Wenruo
  2023-07-13 11:16     ` Christoph Hellwig
@ 2023-07-13 11:26     ` David Sterba
  2023-07-13 11:41       ` Qu Wenruo
  1 sibling, 1 reply; 6+ messages in thread
From: David Sterba @ 2023-07-13 11:26 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Christoph Hellwig, linux-btrfs, willy, linux-mm

On Thu, Jul 13, 2023 at 07:58:17AM +0800, Qu Wenruo wrote:
> On 2023/7/13 00:41, Christoph Hellwig wrote:
> > On Wed, Jul 12, 2023 at 02:37:40PM +0800, Qu Wenruo wrote:
> >> One of the biggest problem for metadata folio conversion is, we still
> >> need the current page based solution (or folios with order 0) as a
> >> fallback solution when we can not get a high order folio.
> > 
> > Do we?  btrfs by default uses a 16k nodesize (order 2 on x86), with
> > a maximum of 64k (order 4).  IIRC we should be able to get them pretty
> > reliably.
> 
> If it can be done as reliable as order 0 with NOFAIL, I'm totally fine 
> with that.

I have mentioned my concerns about the allocation problems with higher
order than 0 in the past. Allocator gives some guarantees about not
failing for certain levels, now it's 1 (mm/fail_page_alloc.c
fail_page_alloc.min_oder = 1).

Per comment in page_alloc.c:rmqueue()

2814         /*
2815          * We most definitely don't want callers attempting to
2816          * allocate greater than order-1 page units with __GFP_NOFAIL.
2817          */
2818         WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));

For allocations with higher order, eg. 4 to match the default 16K nodes,
this increases pressure and can trigger compaction, logic around
PAGE_ALLOC_COSTLY_ORDER which is 3.

> > If not the best thning is to just a virtually contigous allocation as
> > fallback, i.e. use vm_map_ram.

So we can allocate 0-order pages and then map them to virtual addresses,
which needs manipulation of PTE (page table entries), and requires
additional memory. This is what xfs does,
fs/xfs_buf.c:_xfs_buf_map_pages(), needs some care with aliasing memory,
so vm_unmap_aliases() is required and brings some overhead, and at the
end vm_unmap_ram() needs to be called, another overhead but probably
bearable.

With all that in place there would be a contiguous memory range
representing the metadata, so a simple memcpy() can be done. Sure,
with higher overhead and decreased reliability due to potentially
failing memory allocations - for metadata operations.

Compare that to what we have:

Pages are allocated as order 0, so there's much higher chance to get
them under pressure and not increasing the pressure otherwise.  We don't
need any virtual mappings. The cost is that we have to iterate the pages
and do the partial copying ourselves, but this is hidden in helpers.

We have different usage pattern of the metadata buffers than xfs, so
that it does something with vmapped contiguous buffers may not be easily
transferable to btrfs and bring us new problems.

The conversion to folios will happen eventually, though I don't want to
sacrifice reliability just for API use convenience. First the conversion
should be done 1:1 with pages and folios both order 0 before switching
to some higher order allocations hidden behind API calls.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 0/6] btrfs: preparation patches for the incoming metadata folio conversion
  2023-07-13 11:26     ` David Sterba
@ 2023-07-13 11:41       ` Qu Wenruo
  2023-07-13 11:49         ` David Sterba
  0 siblings, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2023-07-13 11:41 UTC (permalink / raw)
  To: dsterba, Qu Wenruo; +Cc: Christoph Hellwig, linux-btrfs, willy, linux-mm



On 2023/7/13 19:26, David Sterba wrote:
> On Thu, Jul 13, 2023 at 07:58:17AM +0800, Qu Wenruo wrote:
>> On 2023/7/13 00:41, Christoph Hellwig wrote:
>>> On Wed, Jul 12, 2023 at 02:37:40PM +0800, Qu Wenruo wrote:
>>>> One of the biggest problem for metadata folio conversion is, we still
>>>> need the current page based solution (or folios with order 0) as a
>>>> fallback solution when we can not get a high order folio.
>>>
>>> Do we?  btrfs by default uses a 16k nodesize (order 2 on x86), with
>>> a maximum of 64k (order 4).  IIRC we should be able to get them pretty
>>> reliably.
>>
>> If it can be done as reliable as order 0 with NOFAIL, I'm totally fine
>> with that.
>
> I have mentioned my concerns about the allocation problems with higher
> order than 0 in the past. Allocator gives some guarantees about not
> failing for certain levels, now it's 1 (mm/fail_page_alloc.c
> fail_page_alloc.min_oder = 1).
>
> Per comment in page_alloc.c:rmqueue()
>
> 2814         /*
> 2815          * We most definitely don't want callers attempting to
> 2816          * allocate greater than order-1 page units with __GFP_NOFAIL.
> 2817          */
> 2818         WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
>
> For allocations with higher order, eg. 4 to match the default 16K nodes,
> this increases pressure and can trigger compaction, logic around
> PAGE_ALLOC_COSTLY_ORDER which is 3.
>
>>> If not the best thning is to just a virtually contigous allocation as
>>> fallback, i.e. use vm_map_ram.
>
> So we can allocate 0-order pages and then map them to virtual addresses,
> which needs manipulation of PTE (page table entries), and requires
> additional memory. This is what xfs does,
> fs/xfs_buf.c:_xfs_buf_map_pages(), needs some care with aliasing memory,
> so vm_unmap_aliases() is required and brings some overhead, and at the
> end vm_unmap_ram() needs to be called, another overhead but probably
> bearable.
>
> With all that in place there would be a contiguous memory range
> representing the metadata, so a simple memcpy() can be done. Sure,
> with higher overhead and decreased reliability due to potentially
> failing memory allocations - for metadata operations.
>
> Compare that to what we have:
>
> Pages are allocated as order 0, so there's much higher chance to get
> them under pressure and not increasing the pressure otherwise.  We don't
> need any virtual mappings. The cost is that we have to iterate the pages
> and do the partial copying ourselves, but this is hidden in helpers.
>
> We have different usage pattern of the metadata buffers than xfs, so
> that it does something with vmapped contiguous buffers may not be easily
> transferable to btrfs and bring us new problems.
>
> The conversion to folios will happen eventually, though I don't want to
> sacrifice reliability just for API use convenience. First the conversion
> should be done 1:1 with pages and folios both order 0 before switching
> to some higher order allocations hidden behind API calls.

In fact, I have another solution as a middle ground before adding folio
into the situation.

   Check if the pages are already physically continuous.
   If so, everything can go without any cross-page handling.

   If not, we can either keep the current cross-page handling, or migrate
   to the virtually continuous mapped pages.

Currently we already have around 50~66% of eb pages are already
allocated physically continuous.

If we can just reduce the cross page handling for more than half of the
ebs, it's already a win.

For the vmapped pages, I'm not sure about the overhead, but I can try to
go that path and check the result.

Thanks,
Qu


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 0/6] btrfs: preparation patches for the incoming metadata folio conversion
  2023-07-13 11:41       ` Qu Wenruo
@ 2023-07-13 11:49         ` David Sterba
  0 siblings, 0 replies; 6+ messages in thread
From: David Sterba @ 2023-07-13 11:49 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: dsterba, Qu Wenruo, Christoph Hellwig, linux-btrfs, willy, linux-mm

On Thu, Jul 13, 2023 at 07:41:53PM +0800, Qu Wenruo wrote:
> On 2023/7/13 19:26, David Sterba wrote:
> > On Thu, Jul 13, 2023 at 07:58:17AM +0800, Qu Wenruo wrote:
> >> On 2023/7/13 00:41, Christoph Hellwig wrote:
> >>> On Wed, Jul 12, 2023 at 02:37:40PM +0800, Qu Wenruo wrote:
> >>>> One of the biggest problem for metadata folio conversion is, we still
> >>>> need the current page based solution (or folios with order 0) as a
> >>>> fallback solution when we can not get a high order folio.
> >>>
> >>> Do we?  btrfs by default uses a 16k nodesize (order 2 on x86), with
> >>> a maximum of 64k (order 4).  IIRC we should be able to get them pretty
> >>> reliably.
> >>
> >> If it can be done as reliable as order 0 with NOFAIL, I'm totally fine
> >> with that.
> >
> > I have mentioned my concerns about the allocation problems with higher
> > order than 0 in the past. Allocator gives some guarantees about not
> > failing for certain levels, now it's 1 (mm/fail_page_alloc.c
> > fail_page_alloc.min_oder = 1).
> >
> > Per comment in page_alloc.c:rmqueue()
> >
> > 2814         /*
> > 2815          * We most definitely don't want callers attempting to
> > 2816          * allocate greater than order-1 page units with __GFP_NOFAIL.
> > 2817          */
> > 2818         WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> >
> > For allocations with higher order, eg. 4 to match the default 16K nodes,
> > this increases pressure and can trigger compaction, logic around
> > PAGE_ALLOC_COSTLY_ORDER which is 3.
> >
> >>> If not the best thning is to just a virtually contigous allocation as
> >>> fallback, i.e. use vm_map_ram.
> >
> > So we can allocate 0-order pages and then map them to virtual addresses,
> > which needs manipulation of PTE (page table entries), and requires
> > additional memory. This is what xfs does,
> > fs/xfs_buf.c:_xfs_buf_map_pages(), needs some care with aliasing memory,
> > so vm_unmap_aliases() is required and brings some overhead, and at the
> > end vm_unmap_ram() needs to be called, another overhead but probably
> > bearable.
> >
> > With all that in place there would be a contiguous memory range
> > representing the metadata, so a simple memcpy() can be done. Sure,
> > with higher overhead and decreased reliability due to potentially
> > failing memory allocations - for metadata operations.
> >
> > Compare that to what we have:
> >
> > Pages are allocated as order 0, so there's much higher chance to get
> > them under pressure and not increasing the pressure otherwise.  We don't
> > need any virtual mappings. The cost is that we have to iterate the pages
> > and do the partial copying ourselves, but this is hidden in helpers.
> >
> > We have different usage pattern of the metadata buffers than xfs, so
> > that it does something with vmapped contiguous buffers may not be easily
> > transferable to btrfs and bring us new problems.
> >
> > The conversion to folios will happen eventually, though I don't want to
> > sacrifice reliability just for API use convenience. First the conversion
> > should be done 1:1 with pages and folios both order 0 before switching
> > to some higher order allocations hidden behind API calls.
> 
> In fact, I have another solution as a middle ground before adding folio
> into the situation.
> 
>    Check if the pages are already physically continuous.
>    If so, everything can go without any cross-page handling.
> 
>    If not, we can either keep the current cross-page handling, or migrate
>    to the virtually continuous mapped pages.
> 
> Currently we already have around 50~66% of eb pages are already
> allocated physically continuous.

Memory fragmentation becomes problem over time on systems running for
weeks/months, then the contiguous ranges will became scarce. So if you
measure that on a system with a lot of memory and for a short time then
of course this will reach high rate of contiguous pages.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-07-13 11:56 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <cover.1689143654.git.wqu@suse.com>
2023-07-12 16:41 ` [PATCH v2 0/6] btrfs: preparation patches for the incoming metadata folio conversion Christoph Hellwig
2023-07-12 23:58   ` Qu Wenruo
2023-07-13 11:16     ` Christoph Hellwig
2023-07-13 11:26     ` David Sterba
2023-07-13 11:41       ` Qu Wenruo
2023-07-13 11:49         ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox