Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Matthew Brost <matthew.brost@intel.com>
To: Alistair Popple <apopple@nvidia.com>
Cc: Balbir Singh <balbirs@nvidia.com>,
	Francois Dugast <francois.dugast@intel.com>,
	<intel-xe@lists.freedesktop.org>,
	<dri-devel@lists.freedesktop.org>, Zi Yan <ziy@nvidia.com>,
	David Hildenbrand <david@kernel.org>,
	Oscar Salvador <osalvador@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>, Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>, <linux-mm@kvack.org>,
	<linux-cxl@vger.kernel.org>, <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper
Date: Mon, 12 Jan 2026 18:16:50 -0800	[thread overview]
Message-ID: <aWWrEtUvlqTVKs2N@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <lwbeop6muq4tbrdauwfi42nw5jss7yvgbmls546sxvygivpcwg@persiopzpqed>

On Tue, Jan 13, 2026 at 01:06:02PM +1100, Alistair Popple wrote:
> On 2026-01-13 at 12:40 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > On Tue, Jan 13, 2026 at 12:35:31PM +1100, Alistair Popple wrote:
> > > On 2026-01-13 at 12:07 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > > On Tue, Jan 13, 2026 at 11:43:51AM +1100, Alistair Popple wrote:
> > > > > On 2026-01-13 at 11:23 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > > > > On Tue, Jan 13, 2026 at 10:58:27AM +1100, Alistair Popple wrote:
> > > > > > > On 2026-01-12 at 12:16 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> > > > > > > > On Mon, Jan 12, 2026 at 11:44:15AM +1100, Balbir Singh wrote:
> > > > > > > > > On 1/12/26 06:55, Francois Dugast wrote:
> > > > > > > > > > From: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > > 
> > > > > > > > > > Add free_zone_device_folio_prepare(), a helper that restores large
> > > > > > > > > > ZONE_DEVICE folios to a sane, initial state before freeing them.
> > > > > > > > > > 
> > > > > > > > > > Compound ZONE_DEVICE folios overwrite per-page state (e.g. pgmap and
> > > > > > > > > > compound metadata). Before returning such pages to the device pgmap
> > > > > > > > > > allocator, each constituent page must be reset to a standalone
> > > > > > > > > > ZONE_DEVICE folio with a valid pgmap and no compound state.
> > > > > > > > > > 
> > > > > > > > > > Use this helper prior to folio_free() for device-private and
> > > > > > > > > > device-coherent folios to ensure consistent device page state for
> > > > > > > > > > subsequent allocations.
> > > > > > > > > > 
> > > > > > > > > > Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
> > > > > > > > > > Cc: Zi Yan <ziy@nvidia.com>
> > > > > > > > > > Cc: David Hildenbrand <david@kernel.org>
> > > > > > > > > > Cc: Oscar Salvador <osalvador@suse.de>
> > > > > > > > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > > > > > > > Cc: Balbir Singh <balbirs@nvidia.com>
> > > > > > > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > > > > > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > > > > > > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > > > > > > > Cc: Mike Rapoport <rppt@kernel.org>
> > > > > > > > > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > > > > > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > > > > > > Cc: Alistair Popple <apopple@nvidia.com>
> > > > > > > > > > Cc: linux-mm@kvack.org
> > > > > > > > > > Cc: linux-cxl@vger.kernel.org
> > > > > > > > > > Cc: linux-kernel@vger.kernel.org
> > > > > > > > > > Suggested-by: Alistair Popple <apopple@nvidia.com>
> > > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > > Signed-off-by: Francois Dugast <francois.dugast@intel.com>
> > > > > > > > > > ---
> > > > > > > > > >  include/linux/memremap.h |  1 +
> > > > > > > > > >  mm/memremap.c            | 55 ++++++++++++++++++++++++++++++++++++++++
> > > > > > > > > >  2 files changed, 56 insertions(+)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > > > > > > > > > index 97fcffeb1c1e..88e1d4707296 100644
> > > > > > > > > > --- a/include/linux/memremap.h
> > > > > > > > > > +++ b/include/linux/memremap.h
> > > > > > > > > > @@ -230,6 +230,7 @@ static inline bool is_fsdax_page(const struct page *page)
> > > > > > > > > >  
> > > > > > > > > >  #ifdef CONFIG_ZONE_DEVICE
> > > > > > > > > >  void zone_device_page_init(struct page *page, unsigned int order);
> > > > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio);
> > > > > > > > > >  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> > > > > > > > > >  void memunmap_pages(struct dev_pagemap *pgmap);
> > > > > > > > > >  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
> > > > > > > > > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > > > > > > > > index 39dc4bd190d0..375a61e18858 100644
> > > > > > > > > > --- a/mm/memremap.c
> > > > > > > > > > +++ b/mm/memremap.c
> > > > > > > > > > @@ -413,6 +413,60 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn)
> > > > > > > > > >  }
> > > > > > > > > >  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> > > > > > > > > >  
> > > > > > > > > > +/**
> > > > > > > > > > + * free_zone_device_folio_prepare() - Prepare a ZONE_DEVICE folio for freeing.
> > > > > > > > > > + * @folio: ZONE_DEVICE folio to prepare for release.
> > > > > > > > > > + *
> > > > > > > > > > + * ZONE_DEVICE pages/folios (e.g., device-private memory or fsdax-backed pages)
> > > > > > > > > > + * can be compound. When freeing a compound ZONE_DEVICE folio, the tail pages
> > > > > > > > > > + * must be restored to a sane ZONE_DEVICE state before they are released.
> > > > > > > > > > + *
> > > > > > > > > > + * This helper:
> > > > > > > > > > + *   - Clears @folio->mapping and, for compound folios, clears each page's
> > > > > > > > > > + *     compound-head state (ClearPageHead()/clear_compound_head()).
> > > > > > > > > > + *   - Resets the compound order metadata (folio_reset_order()) and then
> > > > > > > > > > + *     initializes each constituent page as a standalone ZONE_DEVICE folio:
> > > > > > > > > > + *       * clears ->mapping
> > > > > > > > > > + *       * restores ->pgmap (prep_compound_page() overwrites it)
> > > > > > > > > > + *       * clears ->share (only relevant for fsdax; unused for device-private)
> > > > > > > > > > + *
> > > > > > > > > > + * If @folio is order-0, only the mapping is cleared and no further work is
> > > > > > > > > > + * required.
> > > > > > > > > > + */
> > > > > > > > > > +void free_zone_device_folio_prepare(struct folio *folio)
> > > > > > > 
> > > > > > > I don't really like the naming here - we're not preparing a folio to be
> > > > > > > freed, from the core-mm perspective the folio is already free. This is just
> > > > > > > reinitialising the folio metadata ready for the driver to reuse it, which may
> > > > > > > actually involve just recreating a compound folio.
> > > > > > > 
> > > > > > > So maybe zone_device_folio_reinitialise()? Or would it be possible to
> > > > > > 
> > > > > > zone_device_folio_reinitialise - that works for me... but seem like
> > > > > > everyone has a opinion. 
> > > > > 
> > > > > Well of course :) There are only two hard problems in programming and
> > > > > I forget the other one. But I didn't want to just say I don't like
> > > > > free_zone_device_folio_prepare() without offering an alternative, I'd be open
> > > > > to others.
> > > > > 
> > > > 
> > > > zone_device_folio_reinitialise is good with me.
> > > > 
> > > > > > 
> > > > > > > roll this into a zone_device_folio_init() type function (similar to
> > > > > > > zone_device_page_init()) that just deals with everything at allocation time?
> > > > > > > 
> > > > > > 
> > > > > > I don’t think doing this at allocation actually works without a big lock
> > > > > > per pgmap. Consider the case where a VRAM allocator allocates two
> > > > > > distinct subsets of a large folio and you have a multi-threaded GPU page
> > > > > > fault handler (Xe does). It’s possible two threads could call
> > > > > > zone_device_folio_reinitialise at the same time, racing and causing all
> > > > > > sorts of issues. My plan is to just call this function in the driver’s
> > > > > > ->folio_free() prior to returning the VRAM allocation to my driver pool.
> > > > > 
> > > > > This doesn't make sense to me (at least as someone who doesn't know DRM SVM
> > > > > intimately) - the folio metadata initialisation should only happen after the
> > > > > VRAM allocation has occured.
> > > > > 
> > > > > IOW the VRAM allocator needs to deal with the locking, once you have the VRAM
> > > > > physical range you just initialise the folio/pages associated with that range
> > > > > with zone_device_folio_(re)initialise() and you're done.
> > > > > 
> > > > 
> > > > Our VRAM allocator does have locking (via DRM buddy), but that layer
> > > 
> > > I mean I assumed it did :-)
> > > 
> > > > doesn’t have visibility into the folio or its pages. By the time we
> > > > handle the folio/pages in the GPU fault handler, there are no global
> > > > locks preventing two GPU faults from each having, say, 16 pages from the
> > > > same order-9 folio. I believe if both threads call
> > > > zone_device_folio_reinitialise/init at the same time, bad things could
> > > > happen.
> > > 
> > > This is confusing to me. If you are getting a GPU fault it implies no page is
> > > mapped at a particular virtual address. The normal process (or at least the
> > > process I'm familiar with) for handling this is to allocate and map a page at
> > > the faulting virtual address. So in the scenario of two GPUs faulting on the
> > > same VA each thread would allocate VRAM using DRM buddy, presumably getting
> > 
> > Different VAs.
> > 
> > > different physical pages, and so the zone_device_folio_init() call would be to
> > 
> > Yes, different physical pages but same folio which is possible if it
> > hasn't been split yet (i.e., both threads a different subset of pages in
> > the same folio, try to split at the same time and boom something bad
> > happens).
> 
> So is you're concern something like this:
> 
> 1) There is a free folio A of order 9, starting at physical address 0.
> 2) You have two GPU faults, both call into DRM Buddy to get a 4K page .
> 3) GPU 1 gets allocated physical address 0 (ie. folio_page(folio_A, 0))
> 4) GPU 2 gets allocated physical address 0x1000 (ie. folio_page(folio_A, 1))
> 5) Both call zone_device_folio_init() which splits the folio, meaning the
>    previous step would touch folio_page(folio_A, 0) even though it has not been
>    allocated physical address 0.
> 

Yes.

> If that's the concern then what I'm saying (and what I think Jason was getting
> at) is that (5) above is wrong - the driver doesn't (and shouldn't) update the
> compound head (ie. folio_page(folio_a, 0)) - zone_device_folio_init() should
> just overwrite all the metadata in the struct pages it has been allocated. We're
> not really splitting folios, because it makes no sense to talk of splitting a
> free folio which I think is why some core-mm people took notice.
> 
> Also It doesn't matter that you are leaving the previous compound head struct
> pages in some weird state, the core-mm doesn't care about them anymore and the
> struct page/folio is only used by core-mm not drivers. They will get properly
> (re)initialised when needed for the core-mm in zone_device_folio_init() which in
> this case would happen in step 3.
>

Something like this should work too. I started implementing it on my
side earlier today, but of course, I was hitting hangs. From an API
point of view, zone_device_folio_init would need to be updated to accept
a pgmap argument. In this example, folio_page(folio_A, 1) wouldn’t have
a valid pgmap to retrieve. It could look at the folio’s pgmap, but that
also seems like it could race under the right conditions.

Let me see what this looks like and whether I can get it working.

Matt
 
>  - Alistair
> 
> > > different folios/pages.
> > > 
> > > Then eventually one thread would succeed in creating the mapping from VA->VRAM
> > > and the losing thread would free the VRAM allocation back to DRM buddy.
> > > 
> > > So I'm a bit confused by the above statement that two GPUs faults could each
> > > have the same pages or be calling zone_device_folio_init() on the same pages.
> > > How would that happen?
> > > 
> > 
> > See above. I hope my above statements make this clear.
> > 
> > Matt
> > 
> > > > > Is the concern that reinitialisation would touch pages outside of the allocated
> > > > > VRAM range if it was previously a large folio?
> > > > 
> > > > No just two threads call zone_device_folio_reinitialise/init at the same
> > > > time, on the same folio.
> > > > 
> > > > If we call zone_device_folio_reinitialise in ->folio_free this problem
> > > > goes away. We could solve this with split_lock or something but I'd
> > > > prefer not to add lock for this (although some of prior revs did do
> > > > this, maybe we will revist this later).
> > > > 
> > > > Anyways - this falls in driver detail / choice IMO.
> > > 
> > > Agreed.
> > > 
> > >  - Alistair
> > > 
> > > > Matt
> > > > 
> > > > > 
> > > > > > > > > > +{
> > > > > > > > > > +	struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> > > > > > > > > > +	int order, i;
> > > > > > > > > > +
> > > > > > > > > > +	VM_WARN_ON_FOLIO(!folio_is_zone_device(folio), folio);
> > > > > > > > > > +
> > > > > > > > > > +	folio->mapping = NULL;
> > > > > > > > > > +	order = folio_order(folio);
> > > > > > > > > > +	if (!order)
> > > > > > > > > > +		return;
> > > > > > > > > > +
> > > > > > > > > > +	folio_reset_order(folio);
> > > > > > > > > > +
> > > > > > > > > > +	for (i = 0; i < (1UL << order); i++) {
> > > > > > > > > > +		struct page *page = folio_page(folio, i);
> > > > > > > > > > +		struct folio *new_folio = (struct folio *)page;
> > > > > > > > > > +
> > > > > > > > > > +		ClearPageHead(page);
> > > > > > > > > > +		clear_compound_head(page);
> > > > > > > > > > +
> > > > > > > > > > +		new_folio->mapping = NULL;
> > > > > > > > > > +		/*
> > > > > > > > > > +		 * Reset pgmap which was over-written by
> > > > > > > > > > +		 * prep_compound_page().
> > > > > > > > > > +		 */
> > > > > > > > > > +		new_folio->pgmap = pgmap;
> > > > > > > > > > +		new_folio->share = 0;	/* fsdax only, unused for device private */
> > > > > > > > > > +		VM_WARN_ON_FOLIO(folio_ref_count(new_folio), new_folio);
> > > > > > > > > > +		VM_WARN_ON_FOLIO(!folio_is_zone_device(new_folio), new_folio);
> > > > > > > > > 
> > > > > > > > > Does calling the free_folio() callback on new_folio solve the issue you are facing, or is
> > > > > > > > > that PMD_ORDER more frees than we'd like?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > No, calling free_folio() more often doesn’t solve anything—in fact, that
> > > > > > > > would make my implementation explode. I explained this in detail here [1]
> > > > > > > > to Zi.
> > > > > > > > 
> > > > > > > > To recap [1], my memory allocator has no visibility into individual
> > > > > > > > pages or folios; it is DRM Buddy layered on top of TTM BO. This design
> > > > > > > > allows VRAM to be allocated or evicted for both traditional GPU
> > > > > > > > allocations (GEMs) and SVM allocations.
> > > > > > > > 
> > > > > > > > Now, to recap the actual issue: if device folios are not split upon free
> > > > > > > > and are later reallocated with a different order in
> > > > > > > > zone_device_page_init, the implementation breaks. This problem is not
> > > > > > > > specific to Xe—Nouveau happens to always allocate at the same order, so
> > > > > > > > it works by coincidence. Reallocating at a different order is valid
> > > > > > > > behavior and must be supported.
> > > > > > > 
> > > > > > > I agree it's probably by coincidence but it is a perfectly valid design to
> > > > > > > always just (re)allocate at the same order and not worry about having to
> > > > > > > reinitialise things to different orders.
> > > > > > > 
> > > > > > 
> > > > > > I would agree with this statement too — it’s perfectly valid if a driver
> > > > > > always wants to (re)allocate at the same order.
> > > > > > 
> > > > > > Matt
> > > > > > 
> > > > > > >  - Alistair
> > > > > > > 
> > > > > > > > Matt
> > > > > > > > 
> > > > > > > > [1] https://patchwork.freedesktop.org/patch/697710/?series=159119&rev=3#comment_1282413
> > > > > > > > 
> > > > > > > > > > +	}
> > > > > > > > > > +}
> > > > > > > > > > +EXPORT_SYMBOL_GPL(free_zone_device_folio_prepare);
> > > > > > > > > > +
> > > > > > > > > >  void free_zone_device_folio(struct folio *folio)
> > > > > > > > > >  {
> > > > > > > > > >  	struct dev_pagemap *pgmap = folio->pgmap;
> > > > > > > > > > @@ -454,6 +508,7 @@ void free_zone_device_folio(struct folio *folio)
> > > > > > > > > >  	case MEMORY_DEVICE_COHERENT:
> > > > > > > > > >  		if (WARN_ON_ONCE(!pgmap->ops || !pgmap->ops->folio_free))
> > > > > > > > > >  			break;
> > > > > > > > > > +		free_zone_device_folio_prepare(folio);
> > > > > > > > > >  		pgmap->ops->folio_free(folio, order);
> > > > > > > > > >  		percpu_ref_put_many(&folio->pgmap->ref, nr);
> > > > > > > > > >  		break;
> > > > > > > > > 
> > > > > > > > > Balbir

next prev parent reply	other threads:[~2026-01-13  2:17 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-11 20:55 [PATCH v4 0/7] Enable THP support in drm_pagemap Francois Dugast
2026-01-11 20:55 ` [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback Francois Dugast
2026-01-11 22:35   ` Matthew Wilcox
2026-01-12  0:19     ` Balbir Singh
2026-01-12  0:51       ` Zi Yan
2026-01-12  1:37         ` Matthew Brost
2026-01-12  4:50         ` Balbir Singh
2026-01-12 13:45         ` Jason Gunthorpe
2026-01-12 16:31           ` Zi Yan
2026-01-12 16:50             ` Jason Gunthorpe
2026-01-12 17:46               ` Zi Yan
2026-01-12 18:25                 ` Jason Gunthorpe
2026-01-12 18:55                   ` Zi Yan
2026-01-12 19:28                     ` Jason Gunthorpe
2026-01-12 23:34                       ` Zi Yan
2026-01-12 23:53                         ` Jason Gunthorpe
2026-01-13  0:35                           ` Zi Yan
2026-01-12 23:07               ` Matthew Brost
2026-01-12 21:49           ` Matthew Brost
2026-01-12 23:15             ` Zi Yan
2026-01-12 23:22               ` Matthew Brost
2026-01-12 23:44                 ` Alistair Popple
2026-01-12 23:54                   ` Jason Gunthorpe
2026-01-12 23:31               ` Jason Gunthorpe
2026-01-11 20:55 ` [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper Francois Dugast
2026-01-12  0:44   ` Balbir Singh
2026-01-12  1:16     ` Matthew Brost
2026-01-12  2:15       ` Balbir Singh
2026-01-12  2:37         ` Matthew Brost
2026-01-12  2:50           ` Matthew Brost
2026-01-12 23:58       ` Alistair Popple
2026-01-13  0:23         ` Matthew Brost
2026-01-13  0:43           ` Alistair Popple
2026-01-13  1:07             ` Matthew Brost
2026-01-13  1:35               ` Alistair Popple
2026-01-13  1:40                 ` Matthew Brost
2026-01-13  2:06                   ` Alistair Popple
2026-01-13  2:16                     ` Matthew Brost [this message]
2026-01-13  2:31                       ` Alistair Popple
2026-01-11 20:55 ` [PATCH v4 3/7] fs/dax: Use " Francois Dugast
2026-01-12  4:14   ` kernel test robot
2026-01-11 20:55 ` [PATCH v4 4/7] drm/pagemap: Unlock and put folios when possible Francois Dugast
2026-01-11 20:55 ` [PATCH v4 5/7] drm/pagemap: Add helper to access zone_device_data Francois Dugast
2026-01-11 20:55 ` [PATCH v4 6/7] drm/pagemap: Correct cpages calculation for migrate_vma_setup Francois Dugast
2026-01-12 14:17   ` Francois Dugast
2026-01-11 20:55 ` [PATCH v4 7/7] drm/pagemap: Enable THP support for GPU memory migration Francois Dugast
2026-01-11 21:37   ` Matthew Brost

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aWWrEtUvlqTVKs2N@lstrano-desk.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=balbirs@nvidia.com \
    --cc=david@kernel.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=francois.dugast@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=osalvador@suse.de \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox