linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: James Houghton <jthoughton@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>,
	Hugh Dickins <hughd@google.com>,
	Muchun Song <songmuchun@bytedance.com>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	David Hildenbrand <david@redhat.com>,
	David Rientjes <rientjes@google.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Jiaqi Yan <jiaqiyan@google.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/2] mm: rmap: merge HugeTLB mapcount logic with THPs
Date: Thu, 9 Mar 2023 14:29:47 -0500	[thread overview]
Message-ID: <ZAozq9a25utZtsaD@x1n> (raw)
In-Reply-To: <CADrL8HXk6F+7N9sYfFb3Q=T9Tda2+St1JoBCThrgc2j2yU3bcQ@mail.gmail.com>

On Thu, Mar 09, 2023 at 10:05:12AM -0800, James Houghton wrote:
> On Wed, Mar 8, 2023 at 2:10 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Mon, Mar 06, 2023 at 11:00:02PM +0000, James Houghton wrote:
> > > HugeTLB pages may soon support being mapped with PTEs. To allow for this
> > > case, merge HugeTLB's mapcount scheme with THP's.
> > >
> > > The first patch of this series comes from the HugeTLB high-granularity
> > > mapping series[1], though with some updates, as the original version
> > > was buggy[2] and incomplete.
> > >
> > > I am sending this change as part of this smaller series in hopes that it
> > > can be more thoroughly scrutinized.
> > >
> > > I haven't run any THP performance tests with this series applied.
> > > HugeTLB pages don't currently support being mapped with
> > > `compound=false`, but this mapcount scheme will make collapsing
> > > compound=false mappings in HugeTLB pages quite slow. This can be
> > > optimized with future patches (likely by taking advantage of HugeTLB's
> > > alignment guarantees).
> > >
> > > Matthew Wilcox is working on a mapcounting scheme[3] that will avoid
> > > the use of each subpage's mapcount. If this series is applied, Matthew's
> > > new scheme will automatically apply to HugeTLB pages.
> >
> > Is this the plan?
> >
> > I may have not followed closely on the latest development of Matthew's
> > idea.  The thing is if the design requires ptes being installed / removed
> > at the same time for the whole folio, then it may not work directly for HGM
> > if HGM wants to support at least postcopy, iiuc, because if we install the
> > whole folio ptes at the same time it seems to beat the whole purpose of
> > having HGM..
> 
> My understanding is that it doesn't *require* all the PTEs in a folio
> to be mapped at the same time. I don't see how it possibly could,
> given that UFFDIO_CONTINUE exists (which can already create PTE-mapped
> THPs today). It would be faster to populate all the PTEs at the same
> time (you would only need to traverse the page table once for the
> entire group to see if you should be incrementing mapcount).
> 
> Though, with respect to unmapping, if PTEs aren't all unmapped at the
> same time, then you could end up with a case where mapcount is still
> incremented but nothing is really mapped. I'm not really sure what
> should be done there, but this problem applies to PTE-mapped THPs the
> same way that it applies to HGMed HugeTLB pages.
> 
> > The patch (especially patch 1) looks good.  So it's a pure question just to
> > make sure we're on the same page.  IIUC your other mapcount proposal may
> > work, but it still needs to be able to take care of ptes in less-than-folio
> > sizes whatever it'll look like at last.
> 
> By my "other mapcount proposal", I assume you mean the "using the
> PAGE_SPECIAL bit to track if mapcount has been incremented or not". It
> really only serves as an optimization for Matthew's scheme (see below
> [2] for some more thoughts), and it doesn't have to only apply to
> HugeTLB.
> 
> I originally thought[1] that Matthew's scheme would be really painful
> for postcopy for HGM without this optimization, but it's actually not
> so bad. Let's assume the worst case, that we're UFFDIO_CONTINUEing
> from the end to the beginning, like in [1]:
> 
> First CONTINUE: pvmw finds an empty PUD, so quickly returns false.
> Second CONTINUE: pvmw finds 511 empty PMDs, then finds 511 empty PTEs,
> then finds a present PTE (from the first CONTINUE).
> Third CONTINUE: pvmw finds 511 empty PMDs, then finds 510 empty PTEs.
> ...
> 514th CONTINUE: pvmw finds 510 empty PMDs, then finds 511 empty PTEs.
> 
> So it'll be slow, but it won't have to check 262k empty PTEs per
> CONTINUE (though you could make this possible with MADV_DONTNEED).
> Even with an HGM implementation that only allows PTE-mapping of
> HugeTLB pages, it should still behave just like this, too.
> 
> > A trivial comment on patch 2 since we're at it: does "a future plan on some
> > arch to support 512GB huge page" justify itself?  It would be better
> > justified, IMHO, when that support is added (and decided to use HGM)?
> 
> That's fine with me. I'm happy to drop that patch.
> 
> > What I feel like is missing (rather than patch 2 itself) is some guard to
> > make sure thp mapcountings will not be abused with new hugetlb sizes
> > coming.
> >
> > How about another BUG_ON() squashed into patch 1 (probably somewhere in
> > page_add_file|anon_rmap()) to make sure folio_size() is always smaller than
> > COMPOUND_MAPPED / 2)?
> 
> Sure, I can add that.
> 
> Thanks, Peter!
> 
> - James
> 
> [1]: https://lore.kernel.org/linux-mm/CADrL8HUrEgt+1qAtEsOHuQeA+WWnggGfLj8_nqHF0k-pqPi52w@mail.gmail.com/
> 
> [2]: Some details on what the optimization might look like:
> 
> So an excerpt of Matthew's scheme would look something like this:
> 
>     /* if we're mapping < folio_nr_pages(folio) worth of PTEs. */
>     if (!folio_has_ptes(folio, vma))
>       atomic_inc(folio->_mapcount);
> 
> where folio_has_ptes() is defined like:
> 
>     if (!page_vma_mapped_walk(...))
>       return false;
>     page_vma_mapped_walk_done(...);
>     return true;
> 
> You might be able to optimize folio_has_ptes() with a block like this
> at the beginning:
> 
>     if (folio_is_naturally_aligned(folio, vma)) {
>       /* optimization for naturally-aligned folios. */
>       if (folio_test_hugetlb(folio)) {
>         /* check hstate-level PTE, and do a similar check as below. */
>       }
>       /* for naturally-aligned THPs: */
>       pmdp = mm_find_pmd(...); /* or just pass it in. */
>       pmd = READ_ONCE(*pmdp);
>       BUG_ON(!pmd_present(pmd) || pmd_leaf(pmd));
>       if (pmd_special(pmd))
>         return true;
>       /* we already hold the PTL for the PTE. */
>       ptl = pmd_lock(mm, pmdp);
>       /* test and set pmd_special */
>       pmd_unlock(ptl)
>       return if_we_set_pmd_special;
>     }
> 
> (pmd_special() doesn't currently exist.) If HugeTLB walking code can
> be merged with generic mm, then HugeTLB wouldn't have a special case
> at all here.

I see what you mean now, thanks.  That looks fine.  I just suspect the
pte_special trick will still be needed if this will start to apply to HGM,
as it seems to not suite perfectly with a large folio size, still.  The
MADV_DONTNEED worst case of having it loop over ~folio_size() times of none
pte is still possible.

-- 
Peter Xu



      reply	other threads:[~2023-03-09 19:29 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-06 23:00 James Houghton
2023-03-06 23:00 ` [PATCH 1/2] mm: rmap: make hugetlb pages participate in _nr_pages_mapped James Houghton
2023-03-07 21:54   ` Mike Kravetz
2023-03-08  0:36     ` James Houghton
2023-03-08 21:56       ` Peter Xu
2023-03-09 19:58         ` James Houghton
2023-03-06 23:00 ` [PATCH 2/2] mm: rmap: increase COMPOUND_MAPPED to support 512G HugeTLB pages James Houghton
2023-03-08 22:10 ` [PATCH 0/2] mm: rmap: merge HugeTLB mapcount logic with THPs Peter Xu
2023-03-09 18:05   ` James Houghton
2023-03-09 19:29     ` Peter Xu [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZAozq9a25utZtsaD@x1n \
    --to=peterx@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=david@redhat.com \
    --cc=hughd@google.com \
    --cc=jiaqiyan@google.com \
    --cc=jthoughton@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=rientjes@google.com \
    --cc=songmuchun@bytedance.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox