linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Sterba <dsterba@suse.cz>
To: Qu Wenruo <wqu@suse.com>
Cc: Christoph Hellwig <hch@infradead.org>,
	linux-btrfs@vger.kernel.org, willy@infradead.org,
	linux-mm@kvack.org
Subject: Re: [PATCH v2 0/6] btrfs: preparation patches for the incoming metadata folio conversion
Date: Thu, 13 Jul 2023 13:26:05 +0200	[thread overview]
Message-ID: <20230713112605.GO30916@twin.jikos.cz> (raw)
In-Reply-To: <ff78f3e8-6438-4b29-02c0-c14fb5949360@suse.com>

On Thu, Jul 13, 2023 at 07:58:17AM +0800, Qu Wenruo wrote:
> On 2023/7/13 00:41, Christoph Hellwig wrote:
> > On Wed, Jul 12, 2023 at 02:37:40PM +0800, Qu Wenruo wrote:
> >> One of the biggest problem for metadata folio conversion is, we still
> >> need the current page based solution (or folios with order 0) as a
> >> fallback solution when we can not get a high order folio.
> > 
> > Do we?  btrfs by default uses a 16k nodesize (order 2 on x86), with
> > a maximum of 64k (order 4).  IIRC we should be able to get them pretty
> > reliably.
> 
> If it can be done as reliable as order 0 with NOFAIL, I'm totally fine 
> with that.

I have mentioned my concerns about the allocation problems with higher
order than 0 in the past. Allocator gives some guarantees about not
failing for certain levels, now it's 1 (mm/fail_page_alloc.c
fail_page_alloc.min_oder = 1).

Per comment in page_alloc.c:rmqueue()

2814         /*
2815          * We most definitely don't want callers attempting to
2816          * allocate greater than order-1 page units with __GFP_NOFAIL.
2817          */
2818         WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));

For allocations with higher order, eg. 4 to match the default 16K nodes,
this increases pressure and can trigger compaction, logic around
PAGE_ALLOC_COSTLY_ORDER which is 3.

> > If not the best thning is to just a virtually contigous allocation as
> > fallback, i.e. use vm_map_ram.

So we can allocate 0-order pages and then map them to virtual addresses,
which needs manipulation of PTE (page table entries), and requires
additional memory. This is what xfs does,
fs/xfs_buf.c:_xfs_buf_map_pages(), needs some care with aliasing memory,
so vm_unmap_aliases() is required and brings some overhead, and at the
end vm_unmap_ram() needs to be called, another overhead but probably
bearable.

With all that in place there would be a contiguous memory range
representing the metadata, so a simple memcpy() can be done. Sure,
with higher overhead and decreased reliability due to potentially
failing memory allocations - for metadata operations.

Compare that to what we have:

Pages are allocated as order 0, so there's much higher chance to get
them under pressure and not increasing the pressure otherwise.  We don't
need any virtual mappings. The cost is that we have to iterate the pages
and do the partial copying ourselves, but this is hidden in helpers.

We have different usage pattern of the metadata buffers than xfs, so
that it does something with vmapped contiguous buffers may not be easily
transferable to btrfs and bring us new problems.

The conversion to folios will happen eventually, though I don't want to
sacrifice reliability just for API use convenience. First the conversion
should be done 1:1 with pages and folios both order 0 before switching
to some higher order allocations hidden behind API calls.


  parent reply	other threads:[~2023-07-13 11:32 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <cover.1689143654.git.wqu@suse.com>
2023-07-12 16:41 ` Christoph Hellwig
2023-07-12 23:58   ` Qu Wenruo
2023-07-13 11:16     ` Christoph Hellwig
2023-07-13 11:26     ` David Sterba [this message]
2023-07-13 11:41       ` Qu Wenruo
2023-07-13 11:49         ` David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230713112605.GO30916@twin.jikos.cz \
    --to=dsterba@suse.cz \
    --cc=hch@infradead.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=willy@infradead.org \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox