From: Yu Zhao <yuzhao@google.com>
To: Muchun Song <muchun.song@linux.dev>,
Frank van der Linden <fvdl@google.com>,
David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>,
Jane Chu <jane.chu@oracle.com>, Will Deacon <will@kernel.org>,
Nanyong Sun <sunnanyong@huawei.com>,
Catalin Marinas <catalin.marinas@arm.com>,
akpm@linux-foundation.org, anshuman.khandual@arm.com,
wangkefeng.wang@huawei.com,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
Date: Wed, 5 Jun 2024 14:50:44 -0600 [thread overview]
Message-ID: <CAOUHufYF2E-hM-u8eZc+APZAsMX3pOAmto7kB3orH5_MRgvSkA@mail.gmail.com> (raw)
In-Reply-To: <917FFC7F-0615-44DD-90EE-9F85F8EA9974@linux.dev>
On Sun, Feb 11, 2024 at 5:00 AM Muchun Song <muchun.song@linux.dev> wrote:
>
> > On Feb 8, 2024, at 23:49, Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Wed, Feb 07, 2024 at 06:24:52PM -0800, Jane Chu wrote:
> >> On 2/7/2024 6:17 AM, Matthew Wilcox wrote:
> >>> While this array of ~512 pages have been allocated to hugetlbfs, and one
> >>> would think that there would be no way that there could still be
> >>> references to them, another CPU can have a pointer to this struct page
> >>> (eg attempting a speculative page cache reference or
> >>> get_user_pages_fast()). That means it will try to call
> >>> atomic_add_unless(&page->_refcount, 1, 0);
> >>>
> >>> Actually, I wonder if this isn't a problem on x86 too? Do we need to
> >>> explicitly go through an RCU grace period before freeing the pages
> >>> for use by somebody else?
> >>>
> >> Sorry, not sure what I'm missing, please help.
> >
> > Having written out the analysis, I now think it can't happen on x86,
> > but let's walk through it because it's non-obvious (and I think it
> > illustrates what people are afraid of on Arm).
> >
> > CPU A calls either get_user_pages_fast() or __filemap_get_folio().
> > Let's do the latter this time.
> >
> > folio = filemap_get_entry(mapping, index);
> > filemap_get_entry:
> > rcu_read_lock();
> > folio = xas_load(&xas);
> > if (!folio_try_get_rcu(folio))
> > goto repeat;
> > if (unlikely(folio != xas_reload(&xas))) {
> > folio_put(folio);
> > goto repeat;
> > }
> > folio_try_get_rcu:
> > folio_ref_try_add_rcu(folio, 1);
> > folio_ref_try_add_rcu:
> > if (unlikely(!folio_ref_add_unless(folio, count, 0))) {
> > /* Either the folio has been freed, or will be freed. */
> > return false;
> > folio_ref_add_unless:
> > return page_ref_add_unless(&folio->page, nr, u);
> > page_ref_add_unless:
> > atomic_add_unless(&page->_refcount, nr, u);
> >
> > A rather deep callchain there, but for our purposes the important part
> > is: we take the RCU read lock, we look up a folio, we increment its
> > refcount if it's not zero, then check that looking up this index gets
> > the same folio; if it doesn't, we decrement the refcount again and retry
> > the lookup.
> >
> > For this analysis, we can be preempted at any point after we've got the
> > folio pointer from xa_load().
> >
> >> From hugetlb allocation perspective, one of the scenarios is run time
> >> hugetlb page allocation (say 2M pages), starting from the buddy allocator
> >> returns compound pages, then the head page is set to frozen, then the
> >> folio(compound pages) is put thru the HVO process, one of which is
> >> vmemmap_split_pmd() in case a vmemmap page is a PMD page.
> >>
> >> Until the HVO process completes, none of the vmemmap represented pages are
> >> available to any threads, so what are the causes for IRQ threads to access
> >> their vmemmap pages?
> >
> > Yup, this sounds like enough, but it's not. The problem is the person
> > who's looking up the folio in the pagecache under RCU. They've got
> > the folio pointer and have been preempted. So now what happens to our
> > victim folio?
> >
> > Something happens to remove it from the page cache. Maybe the file is
> > truncated, perhaps vmscan comes along and kicks it out. Either way, it's
> > removed from the xarray and gets its refcount set to 0. If the lookup
> > were to continue at this time, everything would be fine because it would
> > see a refcount of 0 and not increment it (in page_ref_add_unless()).
> > And this is where my analysis of RCU tends to go wrong, because I only
> > think of interleaving event A and B. I don't think about B and then C
> > happening before A resumes. But it can! Let's follow the journey of
> > this struct page.
> >
> > Now that it's been removed from the page cache, it's allocated by hugetlb,
> > as you describe. And it's one of the tail pages towards the end of
> > the 512 contiguous struct pages. That means that we alter vmemmap so
> > that the pointer to struct page now points to a different struct page
> > (one of the earlier ones). Then the original page of vmemmap containing
> > our lucky struct page is returned to the page allocator. At this point,
> > it no longer contains struct pages; it can contain literally anything.
> >
> > Where my analysis went wrong was that CPU A _no longer has a pointer
> > to it_. CPU A has a pointer into vmemmap. So it will access the
> > replacement struct page (which definitely has a refcount 0) instead of
> > the one which has been freed. I had thought that CPU A would access the
> > original memory which has now been allocated to someone else. But no,
> > it can't because its pointer is virtual, not physical.
> >
> >
> > ---
> >
> > Now I'm thinking more about this and there's another scenario which I
> > thought might go wrong, and doesn't. For 7 of the 512 pages which are
> > freed, the struct page pointer gathered by CPU A will not point to a
> > page with a refcount of 0. Instead it will point to an alias of the
> > head page with a positive refcount. For those pages, CPU A will see
> > folio_try_get_rcu() succeed. Then it will call xas_reload() and see
> > the folio isn't there any more, so it will call folio_put() on something
> > which used to be a folio, and isn't any more.
> >
> > But folio_put() calls folio_put_testzero() which calls put_page_testzero()
> > without asserting that the pointer is actually to a folio.
> > So everything's fine, but really only by coincidence; I don't think
> > anybody's thought about this scenario before (maybe Muchun has, but I
> > don't remember it being discussed).
>
> I have to say it is a really great analysis, I haven't thought about the
> case of get_page_unless_zero() so deeply.
>
> To avoid increasing a refcount to a tail page struct, I have made
> all the 7 tail pages read-only when I first write those code.
I think making tail page metadata RO is a good design choice. Details below.
> But it
> is a really problem, because it will panic (due to RO permission)
> when encountering the above scenario to increase its refcount.
>
> In order to fix the race with __filemap_get_folio(), my first
> thought of fixing this issue is to add a rcu_synchronize() after
> the processing of HVO optimization and before being allocated to
> users. Note that HugePage pages are frozen before going through
> the precessing of HVO optimization meaning all the refcount of all
> the struct pages are 0. Therefore, folio_try_get_rcu() in
> __filemap_get_folio() will fail unless the HugeTLB page has been
> allocated to the user.
>
> But I realized there are some users who may pass a arbitrary
> page struct (which may be those 7 special tail page structs,
> alias of the head page struct, of a HugeTLB page) to the following
> helpers, which also could get a refcount of a tail page struct.
> Those helpers also need to be fixed.
>
> 1) get_page_unless_zero
> 2) folio_try_get
> 3) folio_try_get_rcu
>
> I have checked all the users of 1), If I am not wrong, all the users
> already handle the HugeTLB pages before calling to get_page_unless_zero().
> Although there is no problem with 1) now, it will be fragile to let users
> guarantee that it will not pass any tail pages of a HugeTLB page to
> 1). So I want to change 1) to the following to fix this.
>
> static inline bool get_page_unless_zero(struct page *page)
> {
> if (page_ref_add_unless(page, 1, 0)) {
> /* @page must be a genuine head or alias head page here. */
> struct page *head = page_fixed_fake_head(page);
>
> if (likely(head == page))
> return true;
> put_page(head);
> }
>
> return false;
> }
>
> 2) and 3) should adopt the similar approach to make sure we cannot increase
> tail pages' refcount. 2) and 3) will be like the following (only demonstrate
> the key logic):
>
> static inline bool folio_try_get(struct folio *folio)/folio_ref_try_add_rcu
> {
> if (folio_ref_add_unless(folio, 1, 0)) {
> struct folio *genuine = page_folio(&folio->page);
>
> if (likely(genuine == folio))
> return true;
> folio_put(genuine);
> }
>
> return false;
> }
>
> Additionally, we also should alter RO permission of those 7 tail pages
> to RW to avoid panic().
We can use RCU, which IMO is a better choice, as the following:
get_page_unless_zero()
{
int rc = false;
rcu_read_lock();
if (page_is_fake_head(page) || !page_ref_count(page)) {
smp_mb(); // implied by atomic_add_unless()
goto unlock;
}
rc = page_ref_add_unless();
unlock:
rcu_read_unlock();
return rc;
}
And on the HVO/de-HOV sides:
folio_ref_unfreeze();
synchronize_rcu();
HVO/de-HVO;
I think this is a lot better than making tail page metadata RW because:
1. it helps debug, IMO, a lot;
2. I don't think HVO is the only one that needs this.
David (we missed you in today's THP meeting),
Please correct me if I'm wrong -- I think virtio-mem also suffers from
the same problem when freeing offlined struct page, since I wasn't
able to find anything that would prevent a **speculative** struct page
walker from trying to access struct pages belonging to pages being
concurrently offlined.
If this is true, we might want to map a "zero struct page" rather than
leave a hole in vmemmap when offlining pages. And the logic on the hot
removal side would be similar to that of HVO.
> There is no problem in the following helpers since all of them already
> handle HVO case through _compound_head(), they will get the __genuine__
> head page struct and increase its refcount.
>
> 1) try_get_page
> 2) folio_get
> 3) get_page
>
> Just some thoughts from mine, maybe you guys have more simple and graceful
> approaches. Comments are welcome.
>
> Muchun,
> Thanks.
>
>
next prev parent reply other threads:[~2024-06-05 20:51 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-13 9:44 Nanyong Sun
2024-01-13 9:44 ` [PATCH v3 1/3] mm: HVO: introduce helper function to update and flush pgtable Nanyong Sun
2024-01-13 9:44 ` [PATCH v3 2/3] arm64: mm: HVO: support BBM of vmemmap pgtable safely Nanyong Sun
2024-01-15 2:38 ` Muchun Song
2024-02-07 12:21 ` Mark Rutland
2024-02-08 9:30 ` Nanyong Sun
2024-01-13 9:44 ` [PATCH v3 3/3] arm64: mm: Re-enable OPTIMIZE_HUGETLB_VMEMMAP Nanyong Sun
2024-01-25 18:06 ` [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize Catalin Marinas
2024-01-27 5:04 ` Nanyong Sun
2024-02-07 11:12 ` Will Deacon
2024-02-07 11:21 ` Matthew Wilcox
2024-02-07 12:11 ` Will Deacon
2024-02-07 12:24 ` Mark Rutland
2024-02-07 14:17 ` Matthew Wilcox
2024-02-08 2:24 ` Jane Chu
2024-02-08 15:49 ` Matthew Wilcox
2024-02-08 19:21 ` Jane Chu
2024-02-11 11:59 ` Muchun Song
2024-06-05 20:50 ` Yu Zhao [this message]
2024-06-06 8:30 ` David Hildenbrand
2024-06-07 16:55 ` Frank van der Linden
2024-02-07 12:20 ` Catalin Marinas
2024-02-08 9:44 ` Nanyong Sun
2024-02-08 13:17 ` Will Deacon
2024-03-13 23:32 ` David Rientjes
2024-03-25 15:24 ` Nanyong Sun
2024-03-26 12:54 ` Will Deacon
2024-06-24 5:39 ` Yu Zhao
2024-06-27 14:33 ` Nanyong Sun
2024-06-27 21:03 ` Yu Zhao
2024-07-04 11:47 ` Nanyong Sun
2024-07-04 19:45 ` Yu Zhao
2024-02-07 12:44 ` Catalin Marinas
2024-06-27 21:19 ` Yu Zhao
2024-07-05 15:49 ` Catalin Marinas
2024-07-05 17:41 ` Yu Zhao
2024-07-10 16:51 ` Catalin Marinas
2024-07-10 17:12 ` Yu Zhao
2024-07-10 22:29 ` Catalin Marinas
2024-07-10 23:07 ` Yu Zhao
2024-07-11 8:31 ` Yu Zhao
2024-07-11 11:39 ` Catalin Marinas
2024-07-11 17:38 ` Yu Zhao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAOUHufYF2E-hM-u8eZc+APZAsMX3pOAmto7kB3orH5_MRgvSkA@mail.gmail.com \
--to=yuzhao@google.com \
--cc=akpm@linux-foundation.org \
--cc=anshuman.khandual@arm.com \
--cc=catalin.marinas@arm.com \
--cc=david@redhat.com \
--cc=fvdl@google.com \
--cc=jane.chu@oracle.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=muchun.song@linux.dev \
--cc=sunnanyong@huawei.com \
--cc=wangkefeng.wang@huawei.com \
--cc=will@kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox