[PATCH mm-unstable v1] mm/hugetlb_vmemmap: fix memory loads ordering

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH mm-unstable v1] mm/hugetlb_vmemmap: fix memory loads ordering
@ 2025-01-07  4:35 Yu Zhao
  2025-01-07  8:41 ` Muchun Song
  2025-01-07  8:49 ` David Hildenbrand
  0 siblings, 2 replies; 9+ messages in thread
From: Yu Zhao @ 2025-01-07  4:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Mateusz Guzik, Matthew Wilcox (Oracle),
	Muchun Song, linux-mm, linux-kernel, Yu Zhao, Will Deacon

Using x86_64 as an example, for a 32KB struct page[] area describing a
2MB hugeTLB, HVO reduces the area to 4KB by the following steps:
1. Split the (r/w vmemmap) PMD mapping the area into 512 (r/w) PTEs;
2. For the 8 PTEs mapping the area, remap PTE 1-7 to the page mapped
   by PTE 0, and at the same time change the permission from r/w to
   r/o;
3. Free the pages PTE 1-7 used to map, hence the reduction from 32KB
   to 4KB.

However, the following race can happen due to improperly memory loads
ordering:
  CPU 1 (HVO)                     CPU 2 (speculative PFN walker)

  page_ref_freeze()
  synchronize_rcu()
                                  rcu_read_lock()
                                  page_is_fake_head() is false
  vmemmap_remap_pte()
  XXX: struct page[] becomes r/o

  page_ref_unfreeze()
                                  page_ref_count() is not zero

                                  atomic_add_unless(&page->_refcount)
                                  XXX: try to modify r/o struct page[]

Specifically, page_is_fake_head() must be ordered after
page_ref_count() on CPU 2 so that it can only return true for this
case, to avoid the later attempt to modify r/o struct page[].

This patch adds the missing memory barrier and makes the tests on
page_is_fake_head() and page_ref_count() done in the proper order.

Fixes: bd225530a4c7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
Reported-by: Will Deacon <will@kernel.org>
Closes: https://lore.kernel.org/20241128142028.GA3506@willie-the-truck/
Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 include/linux/page-flags.h | 2 +-
 include/linux/page_ref.h   | 8 ++++++--
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 691506bdf2c5..6b8ecf86f1b6 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -212,7 +212,7 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page
 	 * cold cacheline in some cases.
 	 */
 	if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
-	    test_bit(PG_head, &page->flags)) {
+	    test_bit_acquire(PG_head, &page->flags)) {
 		/*
 		 * We can safely access the field of the @page[1] with PG_head
 		 * because the @page is a compound page composed with at least
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 8c236c651d1d..5becea98bd79 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -233,8 +233,12 @@ static inline bool page_ref_add_unless(struct page *page, int nr, int u)
 	bool ret = false;
 
 	rcu_read_lock();
-	/* avoid writing to the vmemmap area being remapped */
-	if (!page_is_fake_head(page) && page_ref_count(page) != u)
+	/*
+	 * To avoid writing to the vmemmap area remapped into r/o in parallel,
+	 * the page_ref_count() test must precede the page_is_fake_head() test
+	 * so that test_bit_acquire() in the latter is ordered after the former.
+	 */
+	if (page_ref_count(page) != u && !page_is_fake_head(page))
 		ret = atomic_add_unless(&page->_refcount, nr, u);
 	rcu_read_unlock();
 
-- 
2.47.1.613.gc27f4b7a9f-goog



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH mm-unstable v1] mm/hugetlb_vmemmap: fix memory loads ordering
  2025-01-07  4:35 [PATCH mm-unstable v1] mm/hugetlb_vmemmap: fix memory loads ordering Yu Zhao
@ 2025-01-07  8:41 ` Muchun Song
  2025-01-08  7:32   ` Yu Zhao
  2025-01-07  8:49 ` David Hildenbrand
  1 sibling, 1 reply; 9+ messages in thread
From: Muchun Song @ 2025-01-07  8:41 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, David Hildenbrand, Mateusz Guzik,
	Matthew Wilcox (Oracle),
	linux-mm, linux-kernel, Will Deacon



> On Jan 7, 2025, at 12:35, Yu Zhao <yuzhao@google.com> wrote:
> 
> Using x86_64 as an example, for a 32KB struct page[] area describing a
> 2MB hugeTLB, HVO reduces the area to 4KB by the following steps:
> 1. Split the (r/w vmemmap) PMD mapping the area into 512 (r/w) PTEs;
> 2. For the 8 PTEs mapping the area, remap PTE 1-7 to the page mapped
>   by PTE 0, and at the same time change the permission from r/w to
>   r/o;
> 3. Free the pages PTE 1-7 used to map, hence the reduction from 32KB
>   to 4KB.
> 
> However, the following race can happen due to improperly memory loads
> ordering:
>  CPU 1 (HVO)                     CPU 2 (speculative PFN walker)
> 
>  page_ref_freeze()
>  synchronize_rcu()
>                                  rcu_read_lock()
>                                  page_is_fake_head() is false
>  vmemmap_remap_pte()
>  XXX: struct page[] becomes r/o
> 
>  page_ref_unfreeze()
>                                  page_ref_count() is not zero
> 
>                                  atomic_add_unless(&page->_refcount)
>                                  XXX: try to modify r/o struct page[]
> 
> Specifically, page_is_fake_head() must be ordered after
> page_ref_count() on CPU 2 so that it can only return true for this
> case, to avoid the later attempt to modify r/o struct page[].
> 
> This patch adds the missing memory barrier and makes the tests on
> page_is_fake_head() and page_ref_count() done in the proper order.
> 
> Fixes: bd225530a4c7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
> Reported-by: Will Deacon <will@kernel.org>
> Closes: https://lore.kernel.org/20241128142028.GA3506@willie-the-truck/
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> ---
> include/linux/page-flags.h | 2 +-
> include/linux/page_ref.h   | 8 ++++++--
> 2 files changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 691506bdf2c5..6b8ecf86f1b6 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -212,7 +212,7 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page
> * cold cacheline in some cases.
> */
> 	if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
> -	    test_bit(PG_head, &page->flags)) {
> +	    test_bit_acquire(PG_head, &page->flags)) {
> 		/*
> 		 * We can safely access the field of the @page[1] with PG_head
> 		 * because the @page is a compound page composed with at least
> diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
> index 8c236c651d1d..5becea98bd79 100644
> --- a/include/linux/page_ref.h
> +++ b/include/linux/page_ref.h
> @@ -233,8 +233,12 @@ static inline bool page_ref_add_unless(struct page *page, int nr, int u)
> 	bool ret = false;
> 
> 	rcu_read_lock();
> - 	/* avoid writing to the vmemmap area being remapped */
> - 	if (!page_is_fake_head(page) && page_ref_count(page) != u)
> + 	/*
> +	 * To avoid writing to the vmemmap area remapped into r/o in parallel,
> +	 * the page_ref_count() test must precede the page_is_fake_head() test
> +	 * so that test_bit_acquire() in the latter is ordered after the former.
> +	 */
> + 	if (page_ref_count(page) != u && !page_is_fake_head(page))

IIUC, we need to insert a memory barrier between page_ref_count() and page_is_fake_head().
Specifically, accessing between page->_refcount and page->flags. So we should insert a
read memory barrier here, right? But I saw you added an acquire barrier in page_fixed_fake_head(),
I don't understand why an acquire barrier could stop the CPU reordering the accessing
between them. What am I missing here?

Muchun,
Thanks.

> 		ret = atomic_add_unless(&page->_refcount, nr, u);
> 	rcu_read_unlock();
> 
> -- 
> 2.47.1.613.gc27f4b7a9f-goog
> 



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH mm-unstable v1] mm/hugetlb_vmemmap: fix memory loads ordering
  2025-01-07  4:35 [PATCH mm-unstable v1] mm/hugetlb_vmemmap: fix memory loads ordering Yu Zhao
  2025-01-07  8:41 ` Muchun Song
@ 2025-01-07  8:49 ` David Hildenbrand
  2025-01-07 16:35   ` Matthew Wilcox
  2025-01-08  7:34   ` Yu Zhao
  1 sibling, 2 replies; 9+ messages in thread
From: David Hildenbrand @ 2025-01-07  8:49 UTC (permalink / raw)
  To: Yu Zhao, Andrew Morton
  Cc: Mateusz Guzik, Matthew Wilcox (Oracle),
	Muchun Song, linux-mm, linux-kernel, Will Deacon

On 07.01.25 05:35, Yu Zhao wrote:
> Using x86_64 as an example, for a 32KB struct page[] area describing a
> 2MB hugeTLB, HVO reduces the area to 4KB by the following steps:
> 1. Split the (r/w vmemmap) PMD mapping the area into 512 (r/w) PTEs;
> 2. For the 8 PTEs mapping the area, remap PTE 1-7 to the page mapped
>     by PTE 0, and at the same time change the permission from r/w to
>     r/o;
> 3. Free the pages PTE 1-7 used to map, hence the reduction from 32KB
>     to 4KB.
> 
> However, the following race can happen due to improperly memory loads
> ordering:
>    CPU 1 (HVO)                     CPU 2 (speculative PFN walker)
> 
>    page_ref_freeze()
>    synchronize_rcu()
>                                    rcu_read_lock()
>                                    page_is_fake_head() is false
>    vmemmap_remap_pte()
>    XXX: struct page[] becomes r/o
> 
>    page_ref_unfreeze()
>                                    page_ref_count() is not zero
> 
>                                    atomic_add_unless(&page->_refcount)
>                                    XXX: try to modify r/o struct page[]
> 
> Specifically, page_is_fake_head() must be ordered after
> page_ref_count() on CPU 2 so that it can only return true for this
> case, to avoid the later attempt to modify r/o struct page[].

I *think* this is correct.

> 
> This patch adds the missing memory barrier and makes the tests on
> page_is_fake_head() and page_ref_count() done in the proper order.
> 
> Fixes: bd225530a4c7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
> Reported-by: Will Deacon <will@kernel.org>
> Closes: https://lore.kernel.org/20241128142028.GA3506@willie-the-truck/
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> ---
>   include/linux/page-flags.h | 2 +-
>   include/linux/page_ref.h   | 8 ++++++--
>   2 files changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 691506bdf2c5..6b8ecf86f1b6 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -212,7 +212,7 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page
>   	 * cold cacheline in some cases.
>   	 */
>   	if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
> -	    test_bit(PG_head, &page->flags)) {
> +	    test_bit_acquire(PG_head, &page->flags)) {

This change will affect all page_fixed_fake_head() users, like ordinary 
PageTail even on !hugetlb.

I assume you want an explicit memory barrier in the single problematic 
caller instead.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH mm-unstable v1] mm/hugetlb_vmemmap: fix memory loads ordering
  2025-01-07  8:49 ` David Hildenbrand
@ 2025-01-07 16:35   ` Matthew Wilcox
  2025-01-07 17:02     ` David Hildenbrand
  2025-01-08  7:34   ` Yu Zhao
  1 sibling, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2025-01-07 16:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Yu Zhao, Andrew Morton, Mateusz Guzik, Muchun Song, linux-mm,
	linux-kernel, Will Deacon

On Tue, Jan 07, 2025 at 09:49:18AM +0100, David Hildenbrand wrote:
> > +++ b/include/linux/page-flags.h
> > @@ -212,7 +212,7 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page
> >   	 * cold cacheline in some cases.
> >   	 */
> >   	if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
> > -	    test_bit(PG_head, &page->flags)) {
> > +	    test_bit_acquire(PG_head, &page->flags)) {
> 
> This change will affect all page_fixed_fake_head() users, like ordinary
> PageTail even on !hugetlb.

I've been looking at the callers of PageTail() because it's going to
be a bit of a weird thing to be checking in the separate-page-and-folio
world.  Obviously we can implement it, but there's a bit of a "But why
would you want to ask that question" question.

Most current occurrences of PageTail() are in assertions of one form or
another.  Fair enough, not performance critical.

make_device_exclusive_range() is a little weird; looks like it's trying
to make sure that each folio is only made exclusive once, and ignore any
partial folios which overlap the start of the area.

damon_get_folio() wants to fail for tail pages.  Fair enough.

split_huge_pages_all() is debug code.

page_idle_get_folio() is like damon.

That's it.  We don't seem to have any PageTail() callers in critical
code any more.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH mm-unstable v1] mm/hugetlb_vmemmap: fix memory loads ordering
  2025-01-07 16:35   ` Matthew Wilcox
@ 2025-01-07 17:02     ` David Hildenbrand
  2025-01-10 19:04       ` David Hildenbrand
  0 siblings, 1 reply; 9+ messages in thread
From: David Hildenbrand @ 2025-01-07 17:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Yu Zhao, Andrew Morton, Mateusz Guzik, Muchun Song, linux-mm,
	linux-kernel, Will Deacon

On 07.01.25 17:35, Matthew Wilcox wrote:
> On Tue, Jan 07, 2025 at 09:49:18AM +0100, David Hildenbrand wrote:
>>> +++ b/include/linux/page-flags.h
>>> @@ -212,7 +212,7 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page
>>>    	 * cold cacheline in some cases.
>>>    	 */
>>>    	if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
>>> -	    test_bit(PG_head, &page->flags)) {
>>> +	    test_bit_acquire(PG_head, &page->flags)) {
>>
>> This change will affect all page_fixed_fake_head() users, like ordinary
>> PageTail even on !hugetlb.
> 
> I've been looking at the callers of PageTail() because it's going to
> be a bit of a weird thing to be checking in the separate-page-and-folio
> world.  Obviously we can implement it, but there's a bit of a "But why
> would you want to ask that question" question.
> 
> Most current occurrences of PageTail() are in assertions of one form or
> another.  Fair enough, not performance critical.
> 
> make_device_exclusive_range() is a little weird; looks like it's trying
> to make sure that each folio is only made exclusive once, and ignore any
> partial folios which overlap the start of the area.

I could have sworn we only support small folios here, but looks like
we do support large folios.

IIUC, there is no way to identify reliably "this folio is device exclusive",
the only hint is "no mappings". The following might do:

diff --git a/mm/rmap.c b/mm/rmap.c
index c6c4d4ea29a7e..1424d0a351a86 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2543,7 +2543,13 @@ int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
  
         for (i = 0; i < npages; i++, start += PAGE_SIZE) {
                 struct folio *folio = page_folio(pages[i]);
-               if (PageTail(pages[i]) || !folio_trylock(folio)) {
+
+               /*
+                * If there are no mappings, either the folio is actually
+                * unmapped or only device-exclusive swap entries point at
+                * this folio.
+                */
+               if (!folio_mapped(folio) || !folio_trylock(folio)) {
                         folio_put(folio);
                         pages[i] = NULL;
                         continue;


> 
> damon_get_folio() wants to fail for tail pages.  Fair enough.
> 
> split_huge_pages_all() is debug code.
> 
> page_idle_get_folio() is like damon.
> 
> That's it.  We don't seem to have any PageTail() callers in critical
> code any more.

Ah, you're right. Interestingly, PageTransTail() is even unused?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH mm-unstable v1] mm/hugetlb_vmemmap: fix memory loads ordering
  2025-01-07  8:41 ` Muchun Song
@ 2025-01-08  7:32   ` Yu Zhao
  0 siblings, 0 replies; 9+ messages in thread
From: Yu Zhao @ 2025-01-08  7:32 UTC (permalink / raw)
  To: Muchun Song
  Cc: Andrew Morton, David Hildenbrand, Mateusz Guzik,
	Matthew Wilcox (Oracle),
	linux-mm, linux-kernel, Will Deacon

On Tue, Jan 7, 2025 at 1:41 AM Muchun Song <muchun.song@linux.dev> wrote:
>
>
>
> > On Jan 7, 2025, at 12:35, Yu Zhao <yuzhao@google.com> wrote:
> >
> > Using x86_64 as an example, for a 32KB struct page[] area describing a
> > 2MB hugeTLB, HVO reduces the area to 4KB by the following steps:
> > 1. Split the (r/w vmemmap) PMD mapping the area into 512 (r/w) PTEs;
> > 2. For the 8 PTEs mapping the area, remap PTE 1-7 to the page mapped
> >   by PTE 0, and at the same time change the permission from r/w to
> >   r/o;
> > 3. Free the pages PTE 1-7 used to map, hence the reduction from 32KB
> >   to 4KB.
> >
> > However, the following race can happen due to improperly memory loads
> > ordering:
> >  CPU 1 (HVO)                     CPU 2 (speculative PFN walker)
> >
> >  page_ref_freeze()
> >  synchronize_rcu()
> >                                  rcu_read_lock()
> >                                  page_is_fake_head() is false
> >  vmemmap_remap_pte()
> >  XXX: struct page[] becomes r/o
> >
> >  page_ref_unfreeze()
> >                                  page_ref_count() is not zero
> >
> >                                  atomic_add_unless(&page->_refcount)
> >                                  XXX: try to modify r/o struct page[]
> >
> > Specifically, page_is_fake_head() must be ordered after
> > page_ref_count() on CPU 2 so that it can only return true for this
> > case, to avoid the later attempt to modify r/o struct page[].
> >
> > This patch adds the missing memory barrier and makes the tests on
> > page_is_fake_head() and page_ref_count() done in the proper order.
> >
> > Fixes: bd225530a4c7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
> > Reported-by: Will Deacon <will@kernel.org>
> > Closes: https://lore.kernel.org/20241128142028.GA3506@willie-the-truck/
> > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > ---
> > include/linux/page-flags.h | 2 +-
> > include/linux/page_ref.h   | 8 ++++++--
> > 2 files changed, 7 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 691506bdf2c5..6b8ecf86f1b6 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -212,7 +212,7 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page
> > * cold cacheline in some cases.
> > */
> >       if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
> > -         test_bit(PG_head, &page->flags)) {
> > +         test_bit_acquire(PG_head, &page->flags)) {
> >               /*
> >                * We can safely access the field of the @page[1] with PG_head
> >                * because the @page is a compound page composed with at least
> > diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
> > index 8c236c651d1d..5becea98bd79 100644
> > --- a/include/linux/page_ref.h
> > +++ b/include/linux/page_ref.h
> > @@ -233,8 +233,12 @@ static inline bool page_ref_add_unless(struct page *page, int nr, int u)
> >       bool ret = false;
> >
> >       rcu_read_lock();
> > -     /* avoid writing to the vmemmap area being remapped */
> > -     if (!page_is_fake_head(page) && page_ref_count(page) != u)
> > +     /*
> > +      * To avoid writing to the vmemmap area remapped into r/o in parallel,
> > +      * the page_ref_count() test must precede the page_is_fake_head() test
> > +      * so that test_bit_acquire() in the latter is ordered after the former.
> > +      */
> > +     if (page_ref_count(page) != u && !page_is_fake_head(page))
>
> IIUC, we need to insert a memory barrier between page_ref_count() and page_is_fake_head().
> Specifically, accessing between page->_refcount and page->flags. So we should insert a
> read memory barrier here, right?

Correct, i.e., page_ref_count(page) != u; smp_rmb(); !page_is_fake_head(page).

> But I saw you added an acquire barrier in page_fixed_fake_head(),
> I don't understand why an acquire barrier could stop the CPU reordering the accessing
> between them. What am I missing here?

A load-acquire on page->_refcount would be equivalent to the smp_rmb()
above. But apparently I used on page->flags because I misremembered
whether a load-acquire inserts the equivalent smp_rmb() before or
after (it's after, not before). Will fix this in v2.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH mm-unstable v1] mm/hugetlb_vmemmap: fix memory loads ordering
  2025-01-07  8:49 ` David Hildenbrand
  2025-01-07 16:35   ` Matthew Wilcox
@ 2025-01-08  7:34   ` Yu Zhao
  1 sibling, 0 replies; 9+ messages in thread
From: Yu Zhao @ 2025-01-08  7:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Mateusz Guzik, Matthew Wilcox (Oracle),
	Muchun Song, linux-mm, linux-kernel, Will Deacon

On Tue, Jan 7, 2025 at 1:49 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 07.01.25 05:35, Yu Zhao wrote:
> > Using x86_64 as an example, for a 32KB struct page[] area describing a
> > 2MB hugeTLB, HVO reduces the area to 4KB by the following steps:
> > 1. Split the (r/w vmemmap) PMD mapping the area into 512 (r/w) PTEs;
> > 2. For the 8 PTEs mapping the area, remap PTE 1-7 to the page mapped
> >     by PTE 0, and at the same time change the permission from r/w to
> >     r/o;
> > 3. Free the pages PTE 1-7 used to map, hence the reduction from 32KB
> >     to 4KB.
> >
> > However, the following race can happen due to improperly memory loads
> > ordering:
> >    CPU 1 (HVO)                     CPU 2 (speculative PFN walker)
> >
> >    page_ref_freeze()
> >    synchronize_rcu()
> >                                    rcu_read_lock()
> >                                    page_is_fake_head() is false
> >    vmemmap_remap_pte()
> >    XXX: struct page[] becomes r/o
> >
> >    page_ref_unfreeze()
> >                                    page_ref_count() is not zero
> >
> >                                    atomic_add_unless(&page->_refcount)
> >                                    XXX: try to modify r/o struct page[]
> >
> > Specifically, page_is_fake_head() must be ordered after
> > page_ref_count() on CPU 2 so that it can only return true for this
> > case, to avoid the later attempt to modify r/o struct page[].
>
> I *think* this is correct.
>
> >
> > This patch adds the missing memory barrier and makes the tests on
> > page_is_fake_head() and page_ref_count() done in the proper order.
> >
> > Fixes: bd225530a4c7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers")
> > Reported-by: Will Deacon <will@kernel.org>
> > Closes: https://lore.kernel.org/20241128142028.GA3506@willie-the-truck/
> > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > ---
> >   include/linux/page-flags.h | 2 +-
> >   include/linux/page_ref.h   | 8 ++++++--
> >   2 files changed, 7 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 691506bdf2c5..6b8ecf86f1b6 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -212,7 +212,7 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page
> >        * cold cacheline in some cases.
> >        */
> >       if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
> > -         test_bit(PG_head, &page->flags)) {
> > +         test_bit_acquire(PG_head, &page->flags)) {
>
> This change will affect all page_fixed_fake_head() users, like ordinary
> PageTail even on !hugetlb.
>
> I assume you want an explicit memory barrier in the single problematic
> caller instead.

Let me make it HVO specific in v2. It might look cleaner that way.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH mm-unstable v1] mm/hugetlb_vmemmap: fix memory loads ordering
  2025-01-07 17:02     ` David Hildenbrand
@ 2025-01-10 19:04       ` David Hildenbrand
  2025-01-10 19:17         ` David Hildenbrand
  0 siblings, 1 reply; 9+ messages in thread
From: David Hildenbrand @ 2025-01-10 19:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Yu Zhao, Andrew Morton, Mateusz Guzik, Muchun Song, linux-mm,
	linux-kernel, Will Deacon

On 07.01.25 18:02, David Hildenbrand wrote:
> On 07.01.25 17:35, Matthew Wilcox wrote:
>> On Tue, Jan 07, 2025 at 09:49:18AM +0100, David Hildenbrand wrote:
>>>> +++ b/include/linux/page-flags.h
>>>> @@ -212,7 +212,7 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page
>>>>     	 * cold cacheline in some cases.
>>>>     	 */
>>>>     	if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
>>>> -	    test_bit(PG_head, &page->flags)) {
>>>> +	    test_bit_acquire(PG_head, &page->flags)) {
>>>
>>> This change will affect all page_fixed_fake_head() users, like ordinary
>>> PageTail even on !hugetlb.
>>
>> I've been looking at the callers of PageTail() because it's going to
>> be a bit of a weird thing to be checking in the separate-page-and-folio
>> world.  Obviously we can implement it, but there's a bit of a "But why
>> would you want to ask that question" question.
>>
>> Most current occurrences of PageTail() are in assertions of one form or
>> another.  Fair enough, not performance critical.
>>
>> make_device_exclusive_range() is a little weird; looks like it's trying
>> to make sure that each folio is only made exclusive once, and ignore any
>> partial folios which overlap the start of the area.
> 
> I could have sworn we only support small folios here, but looks like
> we do support large folios.
> 
> IIUC, there is no way to identify reliably "this folio is device exclusive",
> the only hint is "no mappings". The following might do:
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index c6c4d4ea29a7e..1424d0a351a86 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2543,7 +2543,13 @@ int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
>    
>           for (i = 0; i < npages; i++, start += PAGE_SIZE) {
>                   struct folio *folio = page_folio(pages[i]);
> -               if (PageTail(pages[i]) || !folio_trylock(folio)) {
> +
> +               /*
> +                * If there are no mappings, either the folio is actually
> +                * unmapped or only device-exclusive swap entries point at
> +                * this folio.
> +                */
> +               if (!folio_mapped(folio) || !folio_trylock(folio)) {
>                           folio_put(folio);
>                           pages[i] = NULL;
>                           continue;

I stared longer at this, and not sure if that will work.

The PageTail() is in place because we return with the folio locked on 
success, so we won't trylock again on tail pages.

But staring at page_make_device_exclusive_one(), I am not sure if it 
does what we want in all cases ...

... and the hmm selftests just keeps failing upstream as well?! huh. :)

I'll try spending some time on this to see if I can grasp what needs to 
be done and how it could be handled ... better.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH mm-unstable v1] mm/hugetlb_vmemmap: fix memory loads ordering
  2025-01-10 19:04       ` David Hildenbrand
@ 2025-01-10 19:17         ` David Hildenbrand
  0 siblings, 0 replies; 9+ messages in thread
From: David Hildenbrand @ 2025-01-10 19:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Yu Zhao, Andrew Morton, Mateusz Guzik, Muchun Song, linux-mm,
	linux-kernel, Will Deacon

On 10.01.25 20:04, David Hildenbrand wrote:
> On 07.01.25 18:02, David Hildenbrand wrote:
>> On 07.01.25 17:35, Matthew Wilcox wrote:
>>> On Tue, Jan 07, 2025 at 09:49:18AM +0100, David Hildenbrand wrote:
>>>>> +++ b/include/linux/page-flags.h
>>>>> @@ -212,7 +212,7 @@ static __always_inline const struct page *page_fixed_fake_head(const struct page
>>>>>      	 * cold cacheline in some cases.
>>>>>      	 */
>>>>>      	if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) &&
>>>>> -	    test_bit(PG_head, &page->flags)) {
>>>>> +	    test_bit_acquire(PG_head, &page->flags)) {
>>>>
>>>> This change will affect all page_fixed_fake_head() users, like ordinary
>>>> PageTail even on !hugetlb.
>>>
>>> I've been looking at the callers of PageTail() because it's going to
>>> be a bit of a weird thing to be checking in the separate-page-and-folio
>>> world.  Obviously we can implement it, but there's a bit of a "But why
>>> would you want to ask that question" question.
>>>
>>> Most current occurrences of PageTail() are in assertions of one form or
>>> another.  Fair enough, not performance critical.
>>>
>>> make_device_exclusive_range() is a little weird; looks like it's trying
>>> to make sure that each folio is only made exclusive once, and ignore any
>>> partial folios which overlap the start of the area.
>>
>> I could have sworn we only support small folios here, but looks like
>> we do support large folios.
>>
>> IIUC, there is no way to identify reliably "this folio is device exclusive",
>> the only hint is "no mappings". The following might do:
>>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index c6c4d4ea29a7e..1424d0a351a86 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -2543,7 +2543,13 @@ int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
>>     
>>            for (i = 0; i < npages; i++, start += PAGE_SIZE) {
>>                    struct folio *folio = page_folio(pages[i]);
>> -               if (PageTail(pages[i]) || !folio_trylock(folio)) {
>> +
>> +               /*
>> +                * If there are no mappings, either the folio is actually
>> +                * unmapped or only device-exclusive swap entries point at
>> +                * this folio.
>> +                */
>> +               if (!folio_mapped(folio) || !folio_trylock(folio)) {
>>                            folio_put(folio);
>>                            pages[i] = NULL;
>>                            continue;
> 
> I stared longer at this, and not sure if that will work.
> 
> The PageTail() is in place because we return with the folio locked on
> success, so we won't trylock again on tail pages.
> 
> But staring at page_make_device_exclusive_one(), I am not sure if it
> does what we want in all cases ...
> 
> ... and the hmm selftests just keeps failing upstream as well?! huh. :)
> 
> I'll try spending some time on this to see if I can grasp what needs to
> be done and how it could be handled ... better.
> 

As expected ...

# echo never > /sys/kernel/mm/transparent_hugepage/enabled
# ./hmm-tests
...
#  RUN           hmm.hmm_device_private.exclusive ...
#            OK  hmm.hmm_device_private.exclusive
ok 21 hmm.hmm_device_private.exclusive
#  RUN           hmm.hmm_device_private.exclusive_mprotect ...
#            OK  hmm.hmm_device_private.exclusive_mprotect
ok 22 hmm.hmm_device_private.exclusive_mprotect
#  RUN           hmm.hmm_device_private.exclusive_cow ...
#            OK  hmm.hmm_device_private.exclusive_cow
ok 23 hmm.hmm_device_private.exclusive_cow
#  RUN           hmm.hmm_device_private.hmm_gup_test ...
#            OK  hmm.hmm_device_private.hmm_gup_test
...

# echo always > /sys/kernel/mm/transparent_hugepage/enabled
...
#  RUN           hmm.hmm_device_private.exclusive ...
# hmm-tests.c:1751:exclusive:Expected ret (-16) == 0 (0)
# exclusive: Test terminated by assertion
#          FAIL  hmm.hmm_device_private.exclusive
not ok 21 hmm.hmm_device_private.exclusive
#  RUN           hmm.hmm_device_private.exclusive_mprotect ...
# hmm-tests.c:1805:exclusive_mprotect:Expected ret (-16) == 0 (0)
# exclusive_mprotect: Test terminated by assertion
#          FAIL  hmm.hmm_device_private.exclusive_mprotect
not ok 22 hmm.hmm_device_private.exclusive_mprotect
#  RUN           hmm.hmm_device_private.exclusive_cow ...
# hmm-tests.c:1858:exclusive_cow:Expected ret (-16) == 0 (0)
# exclusive_cow: Test terminated by assertion
#          FAIL  hmm.hmm_device_private.exclusive_cow
not ok 23 hmm.hmm_device_private.exclusive_cow


So rejecting folio_test_large() would likely achieve the same thing 
right now.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-01-10 19:17 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-07  4:35 [PATCH mm-unstable v1] mm/hugetlb_vmemmap: fix memory loads ordering Yu Zhao
2025-01-07  8:41 ` Muchun Song
2025-01-08  7:32   ` Yu Zhao
2025-01-07  8:49 ` David Hildenbrand
2025-01-07 16:35   ` Matthew Wilcox
2025-01-07 17:02     ` David Hildenbrand
2025-01-10 19:04       ` David Hildenbrand
2025-01-10 19:17         ` David Hildenbrand
2025-01-08  7:34   ` Yu Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox