Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jane Chu <jane.chu@oracle.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Will Deacon <will@kernel.org>,
	Nanyong Sun <sunnanyong@huawei.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	muchun.song@linux.dev, akpm@linux-foundation.org,
	anshuman.khandual@arm.com, wangkefeng.wang@huawei.com,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
Date: Thu, 8 Feb 2024 11:21:19 -0800	[thread overview]
Message-ID: <78eee4ef-99ed-46a3-a776-a74bcd83ba44@oracle.com> (raw)
In-Reply-To: <ZcT4DH7VE1XLBvVc@casper.infradead.org>

On 2/8/2024 7:49 AM, Matthew Wilcox wrote:

> On Wed, Feb 07, 2024 at 06:24:52PM -0800, Jane Chu wrote:
>> On 2/7/2024 6:17 AM, Matthew Wilcox wrote:
>>> While this array of ~512 pages have been allocated to hugetlbfs, and one
>>> would think that there would be no way that there could still be
>>> references to them, another CPU can have a pointer to this struct page
>>> (eg attempting a speculative page cache reference or
>>> get_user_pages_fast()).  That means it will try to call
>>> atomic_add_unless(&page->_refcount, 1, 0);
>>>
>>> Actually, I wonder if this isn't a problem on x86 too?  Do we need to
>>> explicitly go through an RCU grace period before freeing the pages
>>> for use by somebody else?
>>>
>> Sorry, not sure what I'm missing, please help.
> Having written out the analysis, I now think it can't happen on x86,
> but let's walk through it because it's non-obvious (and I think it
> illustrates what people are afraid of on Arm).
>
> CPU A calls either get_user_pages_fast() or __filemap_get_folio().
> Let's do the latter this time.
>
>          folio = filemap_get_entry(mapping, index);
> filemap_get_entry:
> 	rcu_read_lock();
>          folio = xas_load(&xas);
>          if (!folio_try_get_rcu(folio))
>                  goto repeat;
>          if (unlikely(folio != xas_reload(&xas))) {
>                  folio_put(folio);
>                  goto repeat;
>          }
> folio_try_get_rcu:
> 	folio_ref_try_add_rcu(folio, 1);
> folio_ref_try_add_rcu:
>          if (unlikely(!folio_ref_add_unless(folio, count, 0))) {
>                  /* Either the folio has been freed, or will be freed. */
>                  return false;
> folio_ref_add_unless:
>          return page_ref_add_unless(&folio->page, nr, u);
> page_ref_add_unless:
> 	atomic_add_unless(&page->_refcount, nr, u);
>
> A rather deep callchain there, but for our purposes the important part
> is: we take the RCU read lock, we look up a folio, we increment its
> refcount if it's not zero, then check that looking up this index gets
> the same folio; if it doesn't, we decrement the refcount again and retry
> the lookup.
>
> For this analysis, we can be preempted at any point after we've got the
> folio pointer from xa_load().
>
>>  From hugetlb allocation perspective,  one of the scenarios is run time
>> hugetlb page allocation (say 2M pages), starting from the buddy allocator
>> returns compound pages, then the head page is set to frozen, then the
>> folio(compound pages) is put thru the HVO process, one of which is
>> vmemmap_split_pmd() in case a vmemmap page is a PMD page.
>>
>> Until the HVO process completes, none of the vmemmap represented pages are
>> available to any threads, so what are the causes for IRQ threads to access
>> their vmemmap pages?
> Yup, this sounds like enough, but it's not.  The problem is the person
> who's looking up the folio in the pagecache under RCU.  They've got
> the folio pointer and have been preempted.  So now what happens to our
> victim folio?
>
> Something happens to remove it from the page cache.  Maybe the file is
> truncated, perhaps vmscan comes along and kicks it out.  Either way, it's
> removed from the xarray and gets its refcount set to 0.  If the lookup
> were to continue at this time, everything would be fine because it would
> see a refcount of 0 and not increment it (in page_ref_add_unless()).
> And this is where my analysis of RCU tends to go wrong, because I only
> think of interleaving event A and B.  I don't think about B and then C
> happening before A resumes.  But it can!  Let's follow the journey of
> this struct page.
>
> Now that it's been removed from the page cache, it's allocated by hugetlb,
> as you describe.  And it's one of the tail pages towards the end of
> the 512 contiguous struct pages.  That means that we alter vmemmap so
> that the pointer to struct page now points to a different struct page
> (one of the earlier ones).  Then the original page of vmemmap containing
> our lucky struct page is returned to the page allocator.  At this point,
> it no longer contains struct pages; it can contain literally anything.
>
> Where my analysis went wrong was that CPU A _no longer has a pointer
> to it_.  CPU A has a pointer into vmemmap.  So it will access the
> replacement struct page (which definitely has a refcount 0) instead of
> the one which has been freed.  I had thought that CPU A would access the
> original memory which has now been allocated to someone else.  But no,
> it can't because its pointer is virtual, not physical.
>
>
> ---
>
> Now I'm thinking more about this and there's another scenario which I
> thought might go wrong, and doesn't.  For 7 of the 512 pages which are
> freed, the struct page pointer gathered by CPU A will not point to a
> page with a refcount of 0.  Instead it will point to an alias of the
> head page with a positive refcount.  For those pages, CPU A will see
> folio_try_get_rcu() succeed.  Then it will call xas_reload() and see
> the folio isn't there any more, so it will call folio_put() on something
> which used to be a folio, and isn't any more.
>
> But folio_put() calls folio_put_testzero() which calls put_page_testzero()
> without asserting that the pointer is actually to a folio.
> So everything's fine, but really only by coincidence; I don't think
> anybody's thought about this scenario before (maybe Muchun has, but I
> don't remember it being discussed).

Wow!  Marvelous analysis, thank you!

So is the solution simple as making folio_put_testzero() to check 
whether the folio pointer actually points to a folio?

or there is more to consider?

Thanks a lot!

-jane

next prev parent reply	other threads:[~2024-02-08 19:21 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-13  9:44 Nanyong Sun
2024-01-13  9:44 ` [PATCH v3 1/3] mm: HVO: introduce helper function to update and flush pgtable Nanyong Sun
2024-01-13  9:44 ` [PATCH v3 2/3] arm64: mm: HVO: support BBM of vmemmap pgtable safely Nanyong Sun
2024-01-15  2:38   ` Muchun Song
2024-02-07 12:21   ` Mark Rutland
2024-02-08  9:30     ` Nanyong Sun
2024-01-13  9:44 ` [PATCH v3 3/3] arm64: mm: Re-enable OPTIMIZE_HUGETLB_VMEMMAP Nanyong Sun
2024-01-25 18:06 ` [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize Catalin Marinas
2024-01-27  5:04   ` Nanyong Sun
2024-02-07 11:12     ` Will Deacon
2024-02-07 11:21       ` Matthew Wilcox
2024-02-07 12:11         ` Will Deacon
2024-02-07 12:24           ` Mark Rutland
2024-02-07 14:17           ` Matthew Wilcox
2024-02-08  2:24             ` Jane Chu
2024-02-08 15:49               ` Matthew Wilcox
2024-02-08 19:21                 ` Jane Chu [this message]
2024-02-11 11:59                 ` Muchun Song
2024-06-05 20:50                   ` Yu Zhao
2024-06-06  8:30                     ` David Hildenbrand
2024-06-07 16:55                       ` Frank van der Linden
2024-02-07 12:20         ` Catalin Marinas
2024-02-08  9:44           ` Nanyong Sun
2024-02-08 13:17             ` Will Deacon
2024-03-13 23:32               ` David Rientjes
2024-03-25 15:24                 ` Nanyong Sun
2024-03-26 12:54                   ` Will Deacon
2024-06-24  5:39                   ` Yu Zhao
2024-06-27 14:33                     ` Nanyong Sun
2024-06-27 21:03                       ` Yu Zhao
2024-07-04 11:47                         ` Nanyong Sun
2024-07-04 19:45                           ` Yu Zhao
2024-02-07 12:44     ` Catalin Marinas
2024-06-27 21:19       ` Yu Zhao
2024-07-05 15:49         ` Catalin Marinas
2024-07-05 17:41           ` Yu Zhao
2024-07-10 16:51             ` Catalin Marinas
2024-07-10 17:12               ` Yu Zhao
2024-07-10 22:29                 ` Catalin Marinas
2024-07-10 23:07                   ` Yu Zhao
2024-07-11  8:31                     ` Yu Zhao
2024-07-11 11:39                       ` Catalin Marinas
2024-07-11 17:38                         ` Yu Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=78eee4ef-99ed-46a3-a776-a74bcd83ba44@oracle.com \
    --to=jane.chu@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=catalin.marinas@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=sunnanyong@huawei.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox