From: Alistair Popple <apopple@nvidia.com>
To: Matthew Brost <matthew.brost@intel.com>
Cc: "Zi Yan" <ziy@nvidia.com>, "Jason Gunthorpe" <jgg@ziepe.ca>,
"Matthew Wilcox" <willy@infradead.org>,
"Balbir Singh" <balbirs@nvidia.com>,
"Francois Dugast" <francois.dugast@intel.com>,
intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
"Madhavan Srinivasan" <maddy@linux.ibm.com>,
"Nicholas Piggin" <npiggin@gmail.com>,
"Michael Ellerman" <mpe@ellerman.id.au>,
"Christophe Leroy (CS GROUP)" <chleroy@kernel.org>,
"Felix Kuehling" <Felix.Kuehling@amd.com>,
"Alex Deucher" <alexander.deucher@amd.com>,
"Christian König" <christian.koenig@amd.com>,
"David Airlie" <airlied@gmail.com>,
"Simona Vetter" <simona@ffwll.ch>,
"Maarten Lankhorst" <maarten.lankhorst@linux.intel.com>,
"Maxime Ripard" <mripard@kernel.org>,
"Thomas Zimmermann" <tzimmermann@suse.de>,
"Lyude Paul" <lyude@redhat.com>,
"Danilo Krummrich" <dakr@kernel.org>,
"Bjorn Helgaas" <bhelgaas@google.com>,
"Logan Gunthorpe" <logang@deltatee.com>,
"David Hildenbrand" <david@kernel.org>,
"Oscar Salvador" <osalvador@suse.de>,
"Andrew Morton" <akpm@linux-foundation.org>,
"Leon Romanovsky" <leon@kernel.org>,
"Lorenzo Stoakes" <lorenzo.stoakes@oracle.com>,
"Liam R . Howlett" <Liam.Howlett@oracle.com>,
"Vlastimil Babka" <vbabka@suse.cz>,
"Mike Rapoport" <rppt@kernel.org>,
"Suren Baghdasaryan" <surenb@google.com>,
"Michal Hocko" <mhocko@suse.com>,
linuxppc-dev@lists.ozlabs.org, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, amd-gfx@lists.freedesktop.org,
nouveau@lists.freedesktop.org, linux-pci@vger.kernel.org,
linux-mm@kvack.org, linux-cxl@vger.kernel.org
Subject: Re: [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback
Date: Tue, 13 Jan 2026 10:44:27 +1100 [thread overview]
Message-ID: <fzpd6caij2l73jkdvvmlk4jxlrdbt5ozu4yladpsbdc4c4jvag@d72h42nfolgh> (raw)
In-Reply-To: <aWWCK0C23CUl9zEq@lstrano-desk.jf.intel.com>
On 2026-01-13 at 10:22 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> On Mon, Jan 12, 2026 at 06:15:26PM -0500, Zi Yan wrote:
> > On 12 Jan 2026, at 16:49, Matthew Brost wrote:
> >
> > > On Mon, Jan 12, 2026 at 09:45:10AM -0400, Jason Gunthorpe wrote:
> > >
> > > Hi, catching up here.
> > >
> > >> On Sun, Jan 11, 2026 at 07:51:01PM -0500, Zi Yan wrote:
> > >>> On 11 Jan 2026, at 19:19, Balbir Singh wrote:
> > >>>
> > >>>> On 1/12/26 08:35, Matthew Wilcox wrote:
> > >>>>> On Sun, Jan 11, 2026 at 09:55:40PM +0100, Francois Dugast wrote:
> > >>>>>> The core MM splits the folio before calling folio_free, restoring the
> > >>>>>> zone pages associated with the folio to an initialized state (e.g.,
> > >>>>>> non-compound, pgmap valid, etc...). The order argument represents the
> > >>>>>> folio’s order prior to the split which can be used driver side to know
> > >>>>>> how many pages are being freed.
> > >>>>>
> > >>>>> This really feels like the wrong way to fix this problem.
> > >>>>>
> > >>>
> > >>> Hi Matthew,
> > >>>
> > >>> I think the wording is confusing, since the actual issue is that:
> > >>>
> > >>> 1. zone_device_page_init() calls prep_compound_page() to form a large folio,
> > >>> 2. but free_zone_device_folio() never reverse the course,
> > >>> 3. the undo of prep_compound_page() in free_zone_device_folio() needs to
> > >>> be done before driver callback ->folio_free(), since once ->folio_free()
> > >>> is called, the folio can be reallocated immediately,
> > >>> 4. after the undo of prep_compound_page(), folio_order() can no longer provide
> > >>> the original order information, thus, folio_free() needs that for proper
> > >>> device side ref manipulation.
> > >>
> > >> There is something wrong with the driver if the "folio can be
> > >> reallocated immediately".
> > >>
> > >> The flow generally expects there to be a driver allocator linked to
> > >> folio_free()
> > >>
> > >> 1) Allocator finds free memory
> > >> 2) zone_device_page_init() allocates the memory and makes refcount=1
> > >> 3) __folio_put() knows the recount 0.
> > >> 4) free_zone_device_folio() calls folio_free(), but it doesn't
> > >> actually need to undo prep_compound_page() because *NOTHING* can
> > >> use the page pointer at this point.
> > >
> > > Correct—nothing can use the folio prior to calling folio_free(). Once
> > > folio_free() returns, the driver side is free to immediately reallocate
> > > the folio (or a subset of its pages).
> > >
> > >> 5) Driver puts the memory back into the allocator and now #1 can
> > >> happen. It knows how much memory to put back because folio->order
> > >> is valid from #2
> > >> 6) #1 happens again, then #2 happens again and the folio is in the
> > >> right state for use. The successor #2 fully undoes the work of the
> > >> predecessor #2.
> > >>
> > >> If you have races where #1 can happen immediately after #3 then the
> > >> driver design is fundamentally broken and passing around order isn't
> > >> going to help anything.
> > >>
> > >
> > > The above race does not exist; if it did, I agree we’d be solving
> > > nothing here.
> > >
> > >> If the allocator is using the struct page memory then step #5 should
> > >> also clean up the struct page with the allocator data before returning
> > >> it to the allocator.
> > >>
> > >
> > > We could move the call to free_zone_device_folio_prepare() [1] into the
> > > driver-side implementation of ->folio_free() and drop the order argument
> > > here. Zi didn’t particularly like that; he preferred calling
> > > free_zone_device_folio_prepare() [2] before invoking ->folio_free(),
> > > which is why this patch exists.
> >
> > On a second thought, if calling free_zone_device_folio_prepare() in
> > ->folio_free() works, feel free to do so.
I think making drivers do this is the correct approach and is consistent with
what P2PDMA and DAX does. All the interfaces for mapping a ZONE_DEVICE folio
currently rely on the driver correctly initialising the folio, so this special
case for ZONE_DEVICE_PRIVATE/COHERENT seemed weird to me - they shouldn't rely
on the core-mm to do some of the re-initialisation in the free paths.
Also drivers may have different strategies than just resetting everything back
to small pages. For example the may choose to only ever allocate large folios
making the whole clearing/resetting of folio fields superfluous.
- Alistair
> +1, testing this change right now and it does indeed work.
>
> Matt
>
> > >
> > > FWIW, I do not have a strong opinion here—either way works. Xe doesn’t
> > > actually need the order regardless of where
> > > free_zone_device_folio_prepare() is called, but Nouveau does need the
> > > order if free_zone_device_folio_prepare() is called before
> > > ->folio_free().
> > >
> > > [1] https://patchwork.freedesktop.org/patch/697877/?series=159120&rev=4
> > > [2] https://patchwork.freedesktop.org/patch/697709/?series=159120&rev=3#comment_1282405
> > >
> > >> I vaugely remember talking about this before in the context of the Xe
> > >> driver.. You can't just take an existing VRAM allocator and layer it
> > >> on top of the folios and have it broadly ignore the folio_free
> > >> callback.
> > >>
> > >
> > > We are definitely not ignoring the ->folio_free callback—that is the
> > > point at which we tell our VRAM allocator (DRM buddy) it is okay to
> > > release the allocation and make it available for reuse.
> > >
> > > Matt
> > >
> > >> Jsaon
> >
> >
> > Best Regards,
> > Yan, Zi
next prev parent reply other threads:[~2026-01-12 23:44 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-11 20:55 [PATCH v4 0/7] Enable THP support in drm_pagemap Francois Dugast
2026-01-11 20:55 ` [PATCH v4 1/7] mm/zone_device: Add order argument to folio_free callback Francois Dugast
2026-01-11 22:35 ` Matthew Wilcox
2026-01-12 0:19 ` Balbir Singh
2026-01-12 0:51 ` Zi Yan
2026-01-12 1:37 ` Matthew Brost
2026-01-12 4:50 ` Balbir Singh
2026-01-12 13:45 ` Jason Gunthorpe
2026-01-12 16:31 ` Zi Yan
2026-01-12 16:50 ` Jason Gunthorpe
2026-01-12 17:46 ` Zi Yan
2026-01-12 18:25 ` Jason Gunthorpe
2026-01-12 18:55 ` Zi Yan
2026-01-12 19:28 ` Jason Gunthorpe
2026-01-12 23:34 ` Zi Yan
2026-01-12 23:53 ` Jason Gunthorpe
2026-01-13 0:35 ` Zi Yan
2026-01-12 23:07 ` Matthew Brost
2026-01-12 21:49 ` Matthew Brost
2026-01-12 23:15 ` Zi Yan
2026-01-12 23:22 ` Matthew Brost
2026-01-12 23:44 ` Alistair Popple [this message]
2026-01-12 23:54 ` Jason Gunthorpe
2026-01-12 23:31 ` Jason Gunthorpe
2026-01-11 20:55 ` [PATCH v4 2/7] mm/zone_device: Add free_zone_device_folio_prepare() helper Francois Dugast
2026-01-12 0:44 ` Balbir Singh
2026-01-12 1:16 ` Matthew Brost
2026-01-12 2:15 ` Balbir Singh
2026-01-12 2:37 ` Matthew Brost
2026-01-12 2:50 ` Matthew Brost
2026-01-12 23:58 ` Alistair Popple
2026-01-13 0:23 ` Matthew Brost
2026-01-13 0:43 ` Alistair Popple
2026-01-13 1:07 ` Matthew Brost
2026-01-13 1:35 ` Alistair Popple
2026-01-13 1:40 ` Matthew Brost
2026-01-13 2:06 ` Alistair Popple
2026-01-13 2:16 ` Matthew Brost
2026-01-11 20:55 ` [PATCH v4 3/7] fs/dax: Use " Francois Dugast
2026-01-12 4:14 ` kernel test robot
2026-01-11 20:55 ` [PATCH v4 4/7] drm/pagemap: Unlock and put folios when possible Francois Dugast
2026-01-11 20:55 ` [PATCH v4 5/7] drm/pagemap: Add helper to access zone_device_data Francois Dugast
2026-01-11 20:55 ` [PATCH v4 6/7] drm/pagemap: Correct cpages calculation for migrate_vma_setup Francois Dugast
2026-01-12 14:17 ` Francois Dugast
2026-01-11 20:55 ` [PATCH v4 7/7] drm/pagemap: Enable THP support for GPU memory migration Francois Dugast
2026-01-11 21:37 ` Matthew Brost
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=fzpd6caij2l73jkdvvmlk4jxlrdbt5ozu4yladpsbdc4c4jvag@d72h42nfolgh \
--to=apopple@nvidia.com \
--cc=Felix.Kuehling@amd.com \
--cc=Liam.Howlett@oracle.com \
--cc=airlied@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=balbirs@nvidia.com \
--cc=bhelgaas@google.com \
--cc=chleroy@kernel.org \
--cc=christian.koenig@amd.com \
--cc=dakr@kernel.org \
--cc=david@kernel.org \
--cc=dri-devel@lists.freedesktop.org \
--cc=francois.dugast@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=jgg@ziepe.ca \
--cc=kvm@vger.kernel.org \
--cc=leon@kernel.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-pci@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=logang@deltatee.com \
--cc=lorenzo.stoakes@oracle.com \
--cc=lyude@redhat.com \
--cc=maarten.lankhorst@linux.intel.com \
--cc=maddy@linux.ibm.com \
--cc=matthew.brost@intel.com \
--cc=mhocko@suse.com \
--cc=mpe@ellerman.id.au \
--cc=mripard@kernel.org \
--cc=nouveau@lists.freedesktop.org \
--cc=npiggin@gmail.com \
--cc=osalvador@suse.de \
--cc=rppt@kernel.org \
--cc=simona@ffwll.ch \
--cc=surenb@google.com \
--cc=tzimmermann@suse.de \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox