From: Alistair Popple <apopple@nvidia.com>
To: Matthew Brost <matthew.brost@intel.com>
Cc: "Balbir Singh" <balbirs@nvidia.com>,
"Jason Gunthorpe" <jgg@nvidia.com>,
"Vlastimil Babka" <vbabka@suse.cz>,
"Francois Dugast" <francois.dugast@intel.com>,
intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
"Zi Yan" <ziy@nvidia.com>,
"adhavan Srinivasan" <maddy@linux.ibm.com>,
"Nicholas Piggin" <npiggin@gmail.com>,
"Michael Ellerman" <mpe@ellerman.id.au>,
"Christophe Leroy (CS GROUP)" <chleroy@kernel.org>,
"Felix Kuehling" <Felix.Kuehling@amd.com>,
"Alex Deucher" <alexander.deucher@amd.com>,
"Christian König" <christian.koenig@amd.com>,
"David Airlie" <airlied@gmail.com>,
"Simona Vetter" <simona@ffwll.ch>,
"Maarten Lankhorst" <maarten.lankhorst@linux.intel.com>,
"Maxime Ripard" <mripard@kernel.org>,
"Thomas Zimmermann" <tzimmermann@suse.de>,
"Lyude Paul" <lyude@redhat.com>,
"Danilo Krummrich" <dakr@kernel.org>,
"David Hildenbrand" <david@kernel.org>,
"Oscar Salvador" <osalvador@suse.de>,
"Andrew Morton" <akpm@linux-foundation.org>,
"Leon Romanovsky" <leon@kernel.org>,
"Lorenzo Stoakes" <lorenzo.stoakes@oracle.com>,
"Liam R . Howlett" <Liam.Howlett@oracle.com>,
"Mike Rapoport" <rppt@kernel.org>,
"Suren Baghdasaryan" <surenb@google.com>,
"Michal Hocko" <mhocko@suse.com>,
linuxppc-dev@lists.ozlabs.org, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, amd-gfx@lists.freedesktop.org,
nouveau@lists.freedesktop.org, linux-mm@kvack.org,
linux-cxl@vger.kernel.org
Subject: Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
Date: Mon, 19 Jan 2026 16:59:56 +1100 [thread overview]
Message-ID: <4k72r4n5poss2glrof5fsapczkpcrnpokposeikw5wjvtodbto@wpqsxoxzpvy6> (raw)
In-Reply-To: <aWsdv6dX2RgqajFQ@lstrano-desk.jf.intel.com>
On 2026-01-17 at 16:27 +1100, Matthew Brost <matthew.brost@intel.com> wrote...
> On Sat, Jan 17, 2026 at 03:42:16PM +1100, Balbir Singh wrote:
> > On 1/17/26 14:55, Matthew Brost wrote:
> > > On Fri, Jan 16, 2026 at 08:51:14PM -0400, Jason Gunthorpe wrote:
> > >> On Fri, Jan 16, 2026 at 12:31:25PM -0800, Matthew Brost wrote:
> > >>>> I suppose we could be getting say an order-9 folio that was previously used
> > >>>> as two order-8 folios? And each of them had their _nr_pages in their head
> > >>>
> > >>> Yes, this is a good example. At this point we have idea what previous
> > >>> allocation(s) order(s) were - we could have multiple places in the loop
> > >>> where _nr_pages is populated, thus we have to clear this everywhere.
> > >>
> > >> Why? The fact you have to use such a crazy expression to even access
> > >> _nr_pages strongly says nothing will read it as _nr_pages.
> > >>
> > >> Explain each thing:
> > >>
> > >> new_page->flags.f &= ~0xffUL; /* Clear possible order, page head */
> > >>
> > >> OK, the tail page flags need to be set right, and prep_compound_page()
> > >> called later depends on them being zero.
> > >>
> > >> ((struct folio *)(new_page - 1))->_nr_pages = 0;
> > >>
> > >> Can't see a reason, nothing reads _nr_pages from a random tail
> > >> page. _nr_pages is the last 8 bytes of struct page so it overlaps
> > >> memcg_data, which is also not supposed to be read from a tail page?
This is (or was) either a order-0 page, a head page or a tail page, who
knows. So it doesn't really matter whether or not _nr_pages or memcg_data are
supposed to be read from a tail page or not. What really matters is does any of
vm_insert_page(), migrate_vma_*() or prep_compound_page() expect this to be a
particular value when called on this page?
AFAIK memcg_data is at least expected to be NULL for migrate_vma_*() when called
on an order-0 page, which means it has to be cleared.
Although I think it would be far less confusing if it was just written like that
rather than the folio math but it achieves the same thing and is technically
correct.
> > >> new_folio->mapping = NULL;
> > >>
> > >> Pointless, prep_compound_page() -> prep_compound_tail() -> p->mapping = TAIL_MAPPING;
Not pointless - vm_insert_page() for example expects folio_test_anon() which
which won't be the case if p->mapping was previously set to TAIL_MAPPING so it
needs to be cleared. migrate_vma_setup() has a similar issue.
> > >>
> > >> new_folio->pgmap = pgmap; /* Also clear compound head */
> > >>
> > >> Pointless, compound_head is set in prep_compound_tail(): set_compound_head(p, head);
No it isn't - we're not clearing tail pages here, we're initialising ZONE_DEVICE
struct pages ready for use by the core-mm which means the pgmap needs to be
correct.
> > >> new_folio->share = 0; /* fsdax only, unused for device private */
> > >>
> > >> Not sure, certainly share isn't read from a tail page..
Yeah, not useful for now because FS DAX isn't using this function. Arguably it
should though.
> > >>>>> Why can't this use the normal helpers, like memmap_init_compound()?
Because that's not what this function is trying to do - eg. we might not be
trying to create a compound page. Although something like
memmap_init_zone_device() looks like it would be a good starting point, with the
page order being a parameter instead of read from the pgmap.
> > >>>>>
> > >>>>> struct folio *new_folio = page
> > >>>>>
> > >>>>> /* First 4 tail pages are part of struct folio */
> > >>>>> for (i = 4; i < (1UL << order); i++) {
> > >>>>> prep_compound_tail(..)
> > >>>>> }
> > >>>>>
> > >>>>> prep_comound_head(page, order)
> > >>>>> new_folio->_nr_pages = 0
> > >>>>>
> > >>>>> ??
> > >>>
> > >>> I've beat this to death with Alistair, normal helpers do not work here.
> > >> What do you mean? It already calls prep_compound_page()! The issue
> > >> seems to be that prep_compound_page() makes assumptions about what
> > >> values are in flags already?
> > >>
> > >> So how about move that page flags mask logic into
> > >> prep_compound_tail()? I think that would help Vlastimil's
> > >> concern. That function is already touching most of the cache line so
> > >> an extra word shouldn't make a performance difference.
> > >>
> > >>> An order zero allocation could have _nr_pages set in its page,
> > >>> new_folio->_nr_pages is page + 1 memory.
> > >>
> > >> An order zero allocation does not have _nr_pages because it is in page
> > >> +1 memory that doesn't exist.
> > >>
> > >> An order zero allocation might have memcg_data in the same slot, does
> > >> it need zeroing? If so why not add that to prep_compound_head() ?
> > >>
> > >> Also, prep_compound_head() handles order 0 too:
> > >>
> > >> if (IS_ENABLED(CONFIG_64BIT) || order > 1) {
> > >> atomic_set(&folio->_pincount, 0);
> > >> atomic_set(&folio->_entire_mapcount, -1);
> > >> }
> > >> if (order > 1)
> > >> INIT_LIST_HEAD(&folio->_deferred_list);
> > >>
> > >> So some of the problem here looks to be not calling it:
> > >>
> > >> if (order)
> > >> prep_compound_page(page, order);
> > >>
> > >> So, remove that if ? Also shouldn't it be moved above the
> > >> set_page_count/lock_page ?
> > >>
> > >
> > > I'm not addressing each comment, some might be valid, others are not.
Hopefully some of my explainations above help.
> > >
> > > Ok, can I rework this in a follow-up - I will commit to that? Anything
> > > we touch here is extremely sensitive to failures - Intel is the primary
> > > test vector for any modification to device pages for what I can tell.
> > >
> > > The fact is that large device pages do not really work without this
> > > patch, or prior revs. I’ve spent a lot of time getting large device
> > > pages stable — both here and in the initial series, commiting to help in
> > > follow on series touch SVM related things.
> > >
> >
> > Matthew, I feel your frustration and appreciate your help.
> > For the current state of 6.19, your changes work for me, I added a
> > Reviewed-by to the patch. It affects a small number of drivers and makes
> > them work for zone device folios. I am happy to maintain the changes
> > sent out as a part of zone_device_page_init()
No problem with the above, and FWIW it seems correct. Although I suspect just
setting page->memcg_data = 0 would have been far less controversial ;)
> +1
>
> > We can rework the details in a follow up series, there are many ideas
> > and ways of doing this (Jason, Alistair, Zi have good ideas as well).
> >
>
> I agree we can rework this in a follow-up — the core MM is hard, and for
> valid reasons, but we can all work together on cleaning it up.
>
> Matt
>
> > > I’m going to miss my merge window with this (RB’d) patch blocked for
> > > large device pages. Expect my commitment to helping other vendors to
> > > drop if this happens. I’ll maybe just say: that doesn’t work in my CI,
> > > try again.
> > >
> > > Or perhaps we just revert large device pages in 6.19 if we can't get a
> > > consensus here as we shouldn't ship a non-functional kernel.
> > >
> > > Matt
> > >
> > >> Jason
> >
next prev parent reply other threads:[~2026-01-19 6:00 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-16 11:10 [PATCH v6 0/5] Enable THP support in drm_pagemap Francois Dugast
2026-01-16 11:10 ` [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios Francois Dugast
2026-01-16 13:10 ` Balbir Singh
2026-01-16 16:07 ` Vlastimil Babka
2026-01-16 17:20 ` Jason Gunthorpe
2026-01-16 17:27 ` Vlastimil Babka
2026-01-22 8:02 ` Vlastimil Babka
2026-01-16 17:49 ` Jason Gunthorpe
2026-01-16 19:17 ` Vlastimil Babka
2026-01-16 20:31 ` Matthew Brost
2026-01-17 0:51 ` Jason Gunthorpe
2026-01-17 3:55 ` Matthew Brost
2026-01-17 4:42 ` Balbir Singh
2026-01-17 5:27 ` Matthew Brost
2026-01-19 5:59 ` Alistair Popple [this message]
2026-01-19 14:20 ` Jason Gunthorpe
2026-01-19 20:09 ` Zi Yan
2026-01-19 20:35 ` Jason Gunthorpe
2026-01-19 22:15 ` Balbir Singh
2026-01-20 2:50 ` Zi Yan
2026-01-20 13:53 ` Jason Gunthorpe
2026-01-21 3:01 ` Zi Yan
2026-01-22 7:19 ` Matthew Brost
2026-01-22 8:00 ` Vlastimil Babka
2026-01-22 9:10 ` Balbir Singh
2026-01-22 21:41 ` Andrew Morton
2026-01-22 22:53 ` Alistair Popple
2026-01-23 6:45 ` Vlastimil Babka
2026-01-22 14:29 ` Jason Gunthorpe
2026-01-22 15:46 ` Jason Gunthorpe
2026-01-23 2:41 ` Zi Yan
2026-01-23 14:19 ` Jason Gunthorpe
2026-01-21 3:51 ` Balbir Singh
2026-01-17 0:19 ` Jason Gunthorpe
2026-01-19 5:41 ` Alistair Popple
2026-01-19 14:24 ` Jason Gunthorpe
2026-01-16 22:34 ` Andrew Morton
2026-01-16 22:36 ` Matthew Brost
2026-01-16 11:10 ` [PATCH v6 2/5] drm/pagemap: Unlock and put folios when possible Francois Dugast
2026-01-16 11:10 ` [PATCH v6 3/5] drm/pagemap: Add helper to access zone_device_data Francois Dugast
2026-01-16 11:10 ` [PATCH v6 4/5] drm/pagemap: Correct cpages calculation for migrate_vma_setup Francois Dugast
2026-01-16 11:37 ` Balbir Singh
2026-01-16 12:02 ` Francois Dugast
2026-01-16 11:10 ` [PATCH v6 5/5] drm/pagemap: Enable THP support for GPU memory migration Francois Dugast
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4k72r4n5poss2glrof5fsapczkpcrnpokposeikw5wjvtodbto@wpqsxoxzpvy6 \
--to=apopple@nvidia.com \
--cc=Felix.Kuehling@amd.com \
--cc=Liam.Howlett@oracle.com \
--cc=airlied@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=balbirs@nvidia.com \
--cc=chleroy@kernel.org \
--cc=christian.koenig@amd.com \
--cc=dakr@kernel.org \
--cc=david@kernel.org \
--cc=dri-devel@lists.freedesktop.org \
--cc=francois.dugast@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=jgg@nvidia.com \
--cc=kvm@vger.kernel.org \
--cc=leon@kernel.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=lyude@redhat.com \
--cc=maarten.lankhorst@linux.intel.com \
--cc=maddy@linux.ibm.com \
--cc=matthew.brost@intel.com \
--cc=mhocko@suse.com \
--cc=mpe@ellerman.id.au \
--cc=mripard@kernel.org \
--cc=nouveau@lists.freedesktop.org \
--cc=npiggin@gmail.com \
--cc=osalvador@suse.de \
--cc=rppt@kernel.org \
--cc=simona@ffwll.ch \
--cc=surenb@google.com \
--cc=tzimmermann@suse.de \
--cc=vbabka@suse.cz \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox