linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mike Kravetz <mike.kravetz@oracle.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	akpm@linux-foundation.org, songmuchun@bytedance.com,
	almasrymina@google.com, linmiaohe@huawei.com,
	minhquangbui99@gmail.com, aneesh.kumar@linux.ibm.com
Subject: Re: [PATCH v2 5/9] mm/hugetlb: convert isolate_or_dissolve_huge_page to folios
Date: Tue, 13 Jun 2023 16:29:40 -0700	[thread overview]
Message-ID: <20230613232940.GA3898@monkey> (raw)
In-Reply-To: <20230612233451.GF3704@monkey>

On 06/12/23 16:34, Mike Kravetz wrote:
> On 06/12/23 18:41, Matthew Wilcox wrote:
> > On Tue, Nov 01, 2022 at 03:30:55PM -0700, Sidhartha Kumar wrote:
> > > +++ b/mm/hugetlb.c
> > > @@ -2815,7 +2815,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
> > >  int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
> > >  {
> > >  	struct hstate *h;
> > > -	struct page *head;
> > > +	struct folio *folio = page_folio(page);
> > 
> > Is this safe?  I was reviewing a different patch today, and I spotted
> > this.  With THP, we can relatively easily hit this case:
> > 
> > struct page points to a page with pfn 0x40305, in a folio of order 2.
> > We call page_folio() on it and the resulting pointer is for the folio
> > with pfn 0x40304.
> > If we don't have our own refcount (or some other protection ...) against
> > freeing, the folio can now be freed and reallocated.  Say it's now part
> > of an order-3 folio.
> > Our 'folio' pointer is now actually a pointer to a tail page, and we
> > have various assertions that a folio pointer doesn't point to a tail
> > page, so they trigger.
> > 
> > It seems to me that this ...
> > 
> >         /*
> >          * The page might have been dissolved from under our feet, so make sure
> >          * to carefully check the state under the lock.
> >          * Return success when racing as if we dissolved the page ourselves.
> >          */
> >         spin_lock_irq(&hugetlb_lock);
> >         if (folio_test_hugetlb(folio)) {
> >                 h = folio_hstate(folio);
> >         } else {
> >                 spin_unlock_irq(&hugetlb_lock);
> >                 return 0;
> >         }
> > 
> > implies that we don't have our own reference on the folio, so we might
> > find a situation where the folio pointer we have is no longer a folio
> > pointer.
> 
> Your analysis is correct.
> 
> This is not safe because we hold no locks or references.  The folio
> pointer obtained via page_folio(page) may not be valid when calling
> folio_test_hugetlb(folio) and later.
> 
> My bad for the Reviewed-by: :(
> 

I was looking at this more closely and need a bit of clarification.  As
mentioned, your analysis is correct.  However, it appears that there is
other code doing:

folio = page_folio(page);
...
if (folio_test_hugetlb(folio))

without holding a folio ref or some type of lock.  split_huge_pages_all()
is one such example.

So, either this code has the same issue or there are folio routines that
can be called without holding a ref/lock.  The kerneldoc for
folio_test_hugetlb says "Caller should have a reference on the folio to
prevent it from being turned into a tail page.".  However, is that mostly
to make sure the returned value is consistent/valid?  Can it really lead
to an assert if folio pointer is changed to point to something else?

> > Maybe the page_folio() call should be moved inside the hugetlb_lock
> > protection?  Is that enough?  I don't know enough about how hugetlb
> > pages are split, freed & allocated to know what's going on.

Upon further thought, I think we should move the page_folio() inside the
lock just to be more correct.

> > 
> > But then we _drop_ the lock, and keep referring to ...
> > 
> > > @@ -2841,10 +2840,10 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
> > >  	if (hstate_is_gigantic(h))
> > >  		return -ENOMEM;
> > >  
> > > -	if (page_count(head) && !isolate_hugetlb(head, list))
> > > +	if (folio_ref_count(folio) && !isolate_hugetlb(&folio->page, list))
> > >  		ret = 0;
> > > -	else if (!page_count(head))
> > > -		ret = alloc_and_dissolve_huge_page(h, head, list);
> > > +	else if (!folio_ref_count(folio))
> > > +		ret = alloc_and_dissolve_huge_page(h, &folio->page, list);
> 
> The above was OK when using struct page instead of folio.  The 'racy'
> part was getting the ref count on the head page.  It was OK because this
> was only a check to see if we should TRY to isolate or dissolve.  The
> code to actually isolate or dissolve would take the appropriate locks.

page_count() is doing 'folio_ref_count(page_folio(page));' and there I suspect
there are many places doing page_count without taking a page ref or locking.
So, it seems like this would also be safe?

> I'm afraid the code is now making even more use of a potentially invalid
> folio.  Here is how the above now looks in v6.3:
> 
> 	spin_unlock_irq(&hugetlb_lock);
> 
> 	/*
> 	 * Fence off gigantic pages as there is a cyclic dependency between
> 	 * alloc_contig_range and them. Return -ENOMEM as this has the effect
> 	 * of bailing out right away without further retrying.
> 	 */
> 	if (hstate_is_gigantic(h))
> 		return -ENOMEM;
> 
> 	if (folio_ref_count(folio) && isolate_hugetlb(folio, list))
> 		ret = 0;
> 	else if (!folio_ref_count(folio))
> 		ret = alloc_and_dissolve_hugetlb_folio(h, folio, list);
> 
> Looks like that potentially invalid folio is being passed to other
> routines.  Previous code would take lock and revalidate that struct page
> was still a hugetlb page.  We can not do the same with a folio.

Perhaps I spoke too soon.  Yes, we pass a potentially invalid folio
pointer to isolate_hugetlb() and alloc_and_dissolve_hugetlb_folio().
However, it seems the validation they perform should be sufficient.

bool isolate_hugetlb(struct folio *folio, struct list_head *list)
{
	bool ret = true;

	spin_lock_irq(&hugetlb_lock);
	if (!folio_test_hugetlb(folio) ||
	    !folio_test_hugetlb_migratable(folio) ||
	    !folio_try_get(folio)) {
		ret = false;
		goto unlock;


static int alloc_and_dissolve_hugetlb_folio(struct hstate *h,
			struct folio *old_folio, struct list_head *list)
{
	...
retry:
	spin_lock_irq(&hugetlb_lock);
	if (!folio_test_hugetlb(old_folio)) {
		...
	} else if (folio_ref_count(old_folio)) {
		...
	} else if (!folio_test_hugetlb_freed(old_folio)) {
		...
		goto retry;
	} else {
		/*
		 * Ok, old_folio is still a genuine free hugepage.

Upon further consideration, I do not see an issue with the existing
code.  If there are issues with calling folio_test_hugetlb() or
folio_ref_count() on a potentially invalid folio pointer, then we do
have issues here.  However, such an issue would be more widespread as
there is more code doing the same. 
-- 
Mike Kravetz


  reply	other threads:[~2023-06-13 23:30 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-01 22:30 [PATCH v2 0/9] convert hugetlb_cgroup helper functions " Sidhartha Kumar
2022-11-01 22:30 ` [PATCH v2 1/9] mm/hugetlb_cgroup: convert __set_hugetlb_cgroup() " Sidhartha Kumar
2022-11-02  3:05   ` Muchun Song
2022-11-01 22:30 ` [PATCH v2 2/9] mm/hugetlb_cgroup: convert hugetlb_cgroup_from_page() " Sidhartha Kumar
2022-11-02  6:25   ` Muchun Song
2022-11-01 22:30 ` [PATCH v2 3/9] mm/hugetlb_cgroup: convert set_hugetlb_cgroup*() " Sidhartha Kumar
2022-11-02  6:45   ` Muchun Song
2022-11-10  0:20     ` Sidhartha Kumar
2022-11-10  7:34       ` Muchun Song
2022-11-10 19:08   ` Mike Kravetz
2022-11-01 22:30 ` [PATCH v2 4/9] mm/hugetlb_cgroup: convert hugetlb_cgroup_migrate " Sidhartha Kumar
2022-11-02  6:47   ` Muchun Song
2022-11-01 22:30 ` [PATCH v2 5/9] mm/hugetlb: convert isolate_or_dissolve_huge_page " Sidhartha Kumar
2022-11-02  6:48   ` Muchun Song
2023-06-12 17:41   ` Matthew Wilcox
2023-06-12 18:45     ` Sidhartha Kumar
2023-06-12 23:34     ` Mike Kravetz
2023-06-13 23:29       ` Mike Kravetz [this message]
2022-11-01 22:30 ` [PATCH v2 6/9] mm/hugetlb: convert free_huge_page " Sidhartha Kumar
2022-11-02  6:53   ` Muchun Song
2022-11-01 22:30 ` [PATCH v2 7/9] mm/hugetlb_cgroup: convert hugetlb_cgroup_uncharge_page() " Sidhartha Kumar
2022-11-02  6:56   ` Muchun Song
2022-11-01 22:30 ` [PATCH v2 8/9] mm/hugeltb_cgroup: convert hugetlb_cgroup_commit_charge*() " Sidhartha Kumar
2022-11-02  6:57   ` Muchun Song
2022-11-01 22:30 ` [PATCH v2 9/9] mm/hugetlb: convert move_hugetlb_state() " Sidhartha Kumar
2022-11-02  7:01   ` Muchun Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230613232940.GA3898@monkey \
    --to=mike.kravetz@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=almasrymina@google.com \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minhquangbui99@gmail.com \
    --cc=sidhartha.kumar@oracle.com \
    --cc=songmuchun@bytedance.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox