Re: [PATCH 1/4] mm/migrate: Add migrate_device_split_page

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Alistair Popple <apopple@nvidia.com>
To: Zi Yan <ziy@nvidia.com>
Cc: Matthew Brost <matthew.brost@intel.com>,
	 Balbir Singh <balbirs@nvidia.com>,
	Francois Dugast <francois.dugast@intel.com>,
	 intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	 Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, David Hildenbrand <david@kernel.org>,
	 Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	 "Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Nico Pache <npache@redhat.com>,
	 Ryan Roberts <ryan.roberts@arm.com>, Dev Jain <dev.jain@arm.com>,
	Barry Song <baohua@kernel.org>,
	 Lance Yang <lance.yang@linux.dev>,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: [PATCH 1/4] mm/migrate: Add migrate_device_split_page
Date: Thu, 8 Jan 2026 14:14:28 +1100	[thread overview]
Message-ID: <w22y7ioppcgcqikwko5v6zf4c42pdv345d6vt2kvk3dcmtmi6e@mn6undehc6xb> (raw)
In-Reply-To: <0419825F-D627-41E9-887D-51E18A2D54E7@nvidia.com>

On 2026-01-08 at 13:53 +1100, Zi Yan <ziy@nvidia.com> wrote...
> On 7 Jan 2026, at 21:17, Matthew Brost wrote:
> 
> > On Thu, Jan 08, 2026 at 11:56:03AM +1100, Balbir Singh wrote:
> >> On 1/8/26 08:03, Zi Yan wrote:
> >>> On 7 Jan 2026, at 16:15, Matthew Brost wrote:
> >>>
> >>>> On Wed, Jan 07, 2026 at 03:38:35PM -0500, Zi Yan wrote:
> >>>>> On 7 Jan 2026, at 15:20, Zi Yan wrote:
> >>>>>
> >>>>>> +THP folks
> >>>>>
> >>>>> +willy, since he commented in another thread.
> >>>>>
> >>>>>>
> >>>>>> On 16 Dec 2025, at 15:10, Francois Dugast wrote:
> >>>>>>
> >>>>>>> From: Matthew Brost <matthew.brost@intel.com>
> >>>>>>>
> >>>>>>> Introduce migrate_device_split_page() to split a device page into
> >>>>>>> lower-order pages. Used when a folio allocated as higher-order is freed
> >>>>>>> and later reallocated at a smaller order by the driver memory manager.
> >>>>>>>
> >>>>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
> >>>>>>> Cc: Balbir Singh <balbirs@nvidia.com>
> >>>>>>> Cc: dri-devel@lists.freedesktop.org
> >>>>>>> Cc: linux-mm@kvack.org
> >>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> >>>>>>> Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> >>>>>>> ---
> >>>>>>>  include/linux/huge_mm.h |  3 +++
> >>>>>>>  include/linux/migrate.h |  1 +
> >>>>>>>  mm/huge_memory.c        |  6 ++---
> >>>>>>>  mm/migrate_device.c     | 49 +++++++++++++++++++++++++++++++++++++++++
> >>>>>>>  4 files changed, 56 insertions(+), 3 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >>>>>>> index a4d9f964dfde..6ad8f359bc0d 100644
> >>>>>>> --- a/include/linux/huge_mm.h
> >>>>>>> +++ b/include/linux/huge_mm.h
> >>>>>>> @@ -374,6 +374,9 @@ int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list
> >>>>>>>  int folio_split_unmapped(struct folio *folio, unsigned int new_order);
> >>>>>>>  unsigned int min_order_for_split(struct folio *folio);
> >>>>>>>  int split_folio_to_list(struct folio *folio, struct list_head *list);
> >>>>>>> +int __split_unmapped_folio(struct folio *folio, int new_order,
> >>>>>>> +			   struct page *split_at, struct xa_state *xas,
> >>>>>>> +			   struct address_space *mapping, enum split_type split_type);
> >>>>>>>  int folio_check_splittable(struct folio *folio, unsigned int new_order,
> >>>>>>>  			   enum split_type split_type);
> >>>>>>>  int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
> >>>>>>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> >>>>>>> index 26ca00c325d9..ec65e4fd5f88 100644
> >>>>>>> --- a/include/linux/migrate.h
> >>>>>>> +++ b/include/linux/migrate.h
> >>>>>>> @@ -192,6 +192,7 @@ void migrate_device_pages(unsigned long *src_pfns, unsigned long *dst_pfns,
> >>>>>>>  			unsigned long npages);
> >>>>>>>  void migrate_device_finalize(unsigned long *src_pfns,
> >>>>>>>  			unsigned long *dst_pfns, unsigned long npages);
> >>>>>>> +int migrate_device_split_page(struct page *page);
> >>>>>>>
> >>>>>>>  #endif /* CONFIG_MIGRATION */
> >>>>>>>
> >>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >>>>>>> index 40cf59301c21..7ded35a3ecec 100644
> >>>>>>> --- a/mm/huge_memory.c
> >>>>>>> +++ b/mm/huge_memory.c
> >>>>>>> @@ -3621,9 +3621,9 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
> >>>>>>>   * Return: 0 - successful, <0 - failed (if -ENOMEM is returned, @folio might be
> >>>>>>>   * split but not to @new_order, the caller needs to check)
> >>>>>>>   */
> >>>>>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
> >>>>>>> -		struct page *split_at, struct xa_state *xas,
> >>>>>>> -		struct address_space *mapping, enum split_type split_type)
> >>>>>>> +int __split_unmapped_folio(struct folio *folio, int new_order,
> >>>>>>> +			   struct page *split_at, struct xa_state *xas,
> >>>>>>> +			   struct address_space *mapping, enum split_type split_type)
> >>>>>>>  {
> >>>>>>>  	const bool is_anon = folio_test_anon(folio);
> >>>>>>>  	int old_order = folio_order(folio);
> >>>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> >>>>>>> index 23379663b1e1..eb0f0e938947 100644
> >>>>>>> --- a/mm/migrate_device.c
> >>>>>>> +++ b/mm/migrate_device.c
> >>>>>>> @@ -775,6 +775,49 @@ int migrate_vma_setup(struct migrate_vma *args)
> >>>>>>>  EXPORT_SYMBOL(migrate_vma_setup);
> >>>>>>>
> >>>>>>>  #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
> >>>>>>> +/**
> >>>>>>> + * migrate_device_split_page() - Split device page
> >>>>>>> + * @page: Device page to split
> >>>>>>> + *
> >>>>>>> + * Splits a device page into smaller pages. Typically called when reallocating a
> >>>>>>> + * folio to a smaller size. Inherently racy—only safe if the caller ensures
> >>>>>>> + * mutual exclusion within the page's folio (i.e., no other threads are using
> >>>>>>> + * pages within the folio). Expected to be called a free device page and
> >>>>>>> + * restores all split out pages to a free state.
> >>>>>>> + */
> >>>>>
> >>>>> Do you mind explaining why __split_unmapped_folio() is needed for a free device
> >>>>> page? A free page is not supposed to be a large folio, at least from a core
> >>>>> MM point of view. __split_unmapped_folio() is intended to work on large folios
> >>>>> (or compound pages), even if the input folio has refcount == 0 (because it is
> >>>>> frozen).
> >>>>>
> >>>>
> >>>> Well, then maybe this is a bug in core MM where the freed page is still
> >>>> a THP. Let me explain the scenario and why this is needed from my POV.
> >>>>
> >>>> Our VRAM allocator in Xe (and several other DRM drivers) is DRM buddy.
> >>>> This is a shared pool between traditional DRM GEMs (buffer objects) and
> >>>> SVM allocations (pages). It doesn’t have any view of the page backing—it
> >>>> basically just hands back a pointer to VRAM space that we allocate from.
> >>>> From that, if it’s an SVM allocation, we can derive the device pages.
> >>>>
> >>>> What I see happening is: a 2M buddy allocation occurs, we make the
> >>>> backing device pages a large folio, and sometime later the folio
> >>>> refcount goes to zero and we free the buddy allocation. Later, the buddy
> >>>> allocation is reused for a smaller allocation (e.g., 4K or 64K), but the
> >>>> backing pages are still a large folio. Here is where we need to split
> >>>
> >>> I agree with you that it might be a bug in free_zone_device_folio() based
> >>> on my understanding. Since zone_device_page_init() calls prep_compound_page()
> >>> for >0 orders, but free_zone_device_folio() never reverse the process.
> >>>
> >>> Balbir and Alistair might be able to help here.

Just catching up after the Christmas break.

> >>
> >> I agree it's an API limitation
> 
> I am not sure. If free_zone_device_folio() does not get rid of large folio
> metadata, there is no guarantee that a freed large device private folio will
> be reallocated as a large device private folio. And when mTHP support is
> added, the folio order might change too. That can cause issues when
> compound_head() is called on a tail page of a previously large folio, since
> compound_head() will return the old head page instead of the tail page itself.

I agree that freeing the device folio should get rid of the large folio. That
would also keep it consistent with what we do for FS DAX for example.

> >>
> >>>
> >>> I cherry picked the code from __free_frozen_pages() to reverse the process.
> >>> Can you give it a try to see if it solve the above issue? Thanks.

It would be nice if this could be a common helper for freeing compound
ZONE_DEVICE pages. FS DAX already has this for example:

static inline unsigned long dax_folio_put(struct folio *folio)
{
	unsigned long ref;
	int order, i;

	if (!dax_folio_is_shared(folio))
		ref = 0;
	else
		ref = --folio->share;

	if (ref)
		return ref;

	folio->mapping = NULL;
	order = folio_order(folio);
	if (!order)
		return 0;
	folio_reset_order(folio);

	for (i = 0; i < (1UL << order); i++) {
		struct dev_pagemap *pgmap = page_pgmap(&folio->page);
		struct page *page = folio_page(folio, i);
		struct folio *new_folio = (struct folio *)page;

		ClearPageHead(page);
		clear_compound_head(page);

		new_folio->mapping = NULL;
		/*
		 * Reset pgmap which was over-written by
		 * prep_compound_page().
		 */
		new_folio->pgmap = pgmap;
		new_folio->share = 0;
		WARN_ON_ONCE(folio_ref_count(new_folio));
	}

	return ref;
}

Aside from the weird refcount checks that FS DAX needs to at the start of this
function I don't think there is anything specific to DEVICE_PRIVATE pages there.

> >>>
> >>> From 3aa03baa39b7e62ea079e826de6ed5aab3061e46 Mon Sep 17 00:00:00 2001
> >>> From: Zi Yan <ziy@nvidia.com>
> >>> Date: Wed, 7 Jan 2026 16:49:52 -0500
> >>> Subject: [PATCH] mm/memremap: free device private folio fix
> >>> Content-Type: text/plain; charset="utf-8"
> >>>
> >>> Signed-off-by: Zi Yan <ziy@nvidia.com>
> >>> ---
> >>>  mm/memremap.c | 15 +++++++++++++++
> >>>  1 file changed, 15 insertions(+)
> >>>
> >>> diff --git a/mm/memremap.c b/mm/memremap.c
> >>> index 63c6ab4fdf08..483666ff7271 100644
> >>> --- a/mm/memremap.c
> >>> +++ b/mm/memremap.c
> >>> @@ -475,6 +475,21 @@ void free_zone_device_folio(struct folio *folio)
> >>>  		pgmap->ops->folio_free(folio);
> >>>  		break;
> >>>  	}
> >>> +
> >>> +	if (nr > 1) {
> >>> +		struct page *head = folio_page(folio, 0);
> >>> +
> >>> +		head[1].flags.f &= ~PAGE_FLAGS_SECOND;
> >>> +#ifdef NR_PAGES_IN_LARGE_FOLIO
> >>> +		folio->_nr_pages = 0;
> >>> +#endif
> >>> +		for (i = 1; i < nr; i++) {
> >>> +			(head + i)->mapping = NULL;
> >>> +			clear_compound_head(head + i);
> >>
> >> I see that your skipping the checks in free_page_tail_prepare()? IIUC, we should be able
> >> to invoke it even for zone device private pages
> 
> I am not sure about what part of compound page is also used in device private folio.
> Yes, it is better to add right checks.
> 
> >>
> >>> +		}
> >>> +		folio->mapping = NULL;
> >>
> >> This is already done in free_zone_device_folio()
> >>
> >>> +		head->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> >>
> >> I don't think this is required for zone device private folios, but I suppose it
> >> keeps the code generic
> >>
> >
> > Well, the above code doesn’t work, but I think it’s the right idea.
> > clear_compound_head aliases to pgmap, which we don’t want to be NULL. I
> 
> Thank you for pointing it out. I am not familiar with device private page code.
> 
> > believe the individual pages likely need their flags cleared (?), and
> 
> Yes, I missed the tail page flag clearing part.
> 
> > this step must be done before calling folio_free and include a barrier,
> > as the page can be immediately reallocated.
> >
> > So here’s what I came up with, and it seems to work (for Xe, GPU SVM):
> >
> >  mm/memremap.c | 21 +++++++++++++++++++++
> >  1 file changed, 21 insertions(+)
> >
> > diff --git a/mm/memremap.c b/mm/memremap.c
> > index 63c6ab4fdf08..ac20abb6a441 100644
> > --- a/mm/memremap.c
> > +++ b/mm/memremap.c
> > @@ -448,6 +448,27 @@ void free_zone_device_folio(struct folio *folio)
> >             pgmap->type != MEMORY_DEVICE_GENERIC)
> >                 folio->mapping = NULL;
> >
> > +       if (nr > 1) {
> > +               struct page *head = folio_page(folio, 0);
> > +
> > +               head[1].flags.f &= ~PAGE_FLAGS_SECOND;
> > +#ifdef NR_PAGES_IN_LARGE_FOLIO
> > +               folio->_nr_pages = 0;
> > +#endif
> > +               for (i = 1; i < nr; i++) {
> > +                       struct folio *new_folio = (struct folio *)(head + i);
> > +
> > +                       (head + i)->mapping = NULL;
> > +                       (head + i)->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > +
> > +                       /* Overwrite compound_head with pgmap */
> > +                       new_folio->pgmap = pgmap;
> > +               }
> > +
> > +               head->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > +               smp_wmb();	/* Changes but be visable before freeing folio */
> > +       }
> > +
> >         switch (pgmap->type) {
> >         case MEMORY_DEVICE_PRIVATE:
> >         case MEMORY_DEVICE_COHERENT:
> >
> 
> It looks good to me, but I am very likely missing the detail on device private
> pages. Like Balbir pointed out above, for tail pages, calling
> free_tail_page_prepare() might be better to get sanity checks like normal
> large folio, although you will need to set ->pgmap after it.
> 
> It is better to send it as a proper patch and get reviews from other
> MM folks.
> 
> >>> +	}
> >>>  }
> >>>
> >>>  void zone_device_page_init(struct page *page, unsigned int order)
> >>
> >>
> >> Otherwise, it seems like the right way to solve the issue.
> >>
> >
> > My question is: why isn’t Nouveau hitting this issue, or your Nvidia
> > out-of-tree driver (lack of testing, Xe's test suite coverage is quite
> > good at finding corners).
> >
> > Also, will this change in behavior break either ofthose drivers?
> >
> > Matt
> >
> >> Balbir
> 
> 
> Best Regards,
> Yan, Zi

next prev parent reply	other threads:[~2026-01-08  3:14 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20251216201206.1660899-1-francois.dugast@intel.com>
2025-12-16 20:10 ` Francois Dugast
2025-12-16 20:34   ` Matthew Wilcox
2025-12-16 21:39     ` Matthew Brost
2026-01-06  2:39       ` Matthew Brost
2026-01-07 20:15         ` Zi Yan
2026-01-07 20:20   ` Zi Yan
2026-01-07 20:38     ` Zi Yan
2026-01-07 21:15       ` Matthew Brost
2026-01-07 22:03         ` Zi Yan
2026-01-08  0:56           ` Balbir Singh
2026-01-08  2:17             ` Matthew Brost
2026-01-08  2:53               ` Zi Yan
2026-01-08  3:14                 ` Alistair Popple [this message]
2026-01-08  3:42                   ` Matthew Brost
2026-01-08  4:47                     ` Balbir Singh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=w22y7ioppcgcqikwko5v6zf4c42pdv345d6vt2kvk3dcmtmi6e@mn6undehc6xb \
    --to=apopple@nvidia.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=balbirs@nvidia.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=francois.dugast@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lance.yang@linux.dev \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=matthew.brost@intel.com \
    --cc=npache@redhat.com \
    --cc=ryan.roberts@arm.com \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox