linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Matthew Brost <matthew.brost@intel.com>
To: "Mika Penttilä" <mpenttil@redhat.com>
Cc: <linux-kernel@vger.kernel.org>, <dri-devel@lists.freedesktop.org>,
	<linux-mm@kvack.org>, Andrew Morton <akpm@linux-foundation.org>,
	"David Hildenbrand" <david@redhat.com>, Zi Yan <ziy@nvidia.com>,
	Joshua Hahn <joshua.hahnjy@gmail.com>,
	Rakie Kim <rakie.kim@sk.com>, Byungchul Park <byungchul@sk.com>,
	Gregory Price <gourry@gourry.net>,
	Ying Huang <ying.huang@linux.alibaba.com>,
	Alistair Popple <apopple@nvidia.com>,
	"Oscar Salvador" <osalvador@suse.de>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Nico Pache <npache@redhat.com>,
	Ryan Roberts <ryan.roberts@arm.com>, Dev Jain <dev.jain@arm.com>,
	Barry Song <baohua@kernel.org>, Lyude Paul <lyude@redhat.com>,
	Danilo Krummrich <dakr@kernel.org>,
	David Airlie <airlied@gmail.com>, Simona Vetter <simona@ffwll.ch>,
	Ralph Campbell <rcampbell@nvidia.com>,
	Francois Dugast <francois.dugast@intel.com>,
	Balbir Singh <balbirs@nvidia.com>
Subject: Re: [PATCH] fixup: mm/migrate_device: handle partially mapped folios during
Date: Thu, 20 Nov 2025 14:04:37 -0800	[thread overview]
Message-ID: <aR+QdfjRw+U5go9+@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <a6d4dff5-15be-48e3-9bb8-00bb44dc5584@redhat.com>

On Thu, Nov 20, 2025 at 07:43:42PM +0200, Mika Penttilä wrote:
> Hi,
> 
> On 11/20/25 19:05, Matthew Brost wrote:
> 
> > Splitting a partially mapped folio caused a regression in the Intel Xe
> > SVM test suite in the mremap section, resulting in the following stack
> > trace:
> >
> >  NFO: task kworker/u65:2:1642 blocked for more than 30 seconds.
> > [  212.624286]       Tainted: G S      W           6.18.0-rc6-xe+ #1719
> > [  212.638288] Workqueue: xe_page_fault_work_queue xe_pagefault_queue_work [xe]
> > [  212.638323] Call Trace:
> > [  212.638324]  <TASK>
> > [  212.638325]  __schedule+0x4b0/0x990
> > [  212.638330]  schedule+0x22/0xd0
> > [  212.638331]  io_schedule+0x41/0x60
> > [  212.638333]  migration_entry_wait_on_locked+0x1d8/0x2d0
> > [  212.638336]  ? __pfx_wake_page_function+0x10/0x10
> > [  212.638339]  migration_entry_wait+0xd2/0xe0
> > [  212.638341]  hmm_vma_walk_pmd+0x7c9/0x8d0
> > [  212.638343]  walk_pgd_range+0x51d/0xa40
> > [  212.638345]  __walk_page_range+0x75/0x1e0
> > [  212.638347]  walk_page_range_mm+0x138/0x1f0
> > [  212.638349]  hmm_range_fault+0x59/0xa0
> > [  212.638351]  drm_gpusvm_get_pages+0x194/0x7b0 [drm_gpusvm_helper]
> > [  212.638354]  drm_gpusvm_range_get_pages+0x2d/0x40 [drm_gpusvm_helper]
> > [  212.638355]  __xe_svm_handle_pagefault+0x259/0x900 [xe]
> > [  212.638375]  ? update_load_avg+0x7f/0x6c0
> > [  212.638377]  ? update_curr+0x13d/0x170
> > [  212.638379]  xe_svm_handle_pagefault+0x37/0x90 [xe]
> > [  212.638396]  xe_pagefault_queue_work+0x2da/0x3c0 [xe]
> > [  212.638420]  process_one_work+0x16e/0x2e0
> > [  212.638422]  worker_thread+0x284/0x410
> > [  212.638423]  ? __pfx_worker_thread+0x10/0x10
> > [  212.638425]  kthread+0xec/0x210
> > [  212.638427]  ? __pfx_kthread+0x10/0x10
> > [  212.638428]  ? __pfx_kthread+0x10/0x10
> > [  212.638430]  ret_from_fork+0xbd/0x100
> > [  212.638433]  ? __pfx_kthread+0x10/0x10
> > [  212.638434]  ret_from_fork_asm+0x1a/0x30
> > [  212.638436]  </TASK>
> >
> > The issue appears to be that migration PTEs are not properly removed
> > after a split.
> >
> > This change refactors the code to perform the split in a slightly
> > different manner while retaining the original patch’s intent. With this
> > update, the Intel Xe SVM test suite fully passes.
> >
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Zi Yan <ziy@nvidia.com>
> > Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
> > Cc: Rakie Kim <rakie.kim@sk.com>
> > Cc: Byungchul Park <byungchul@sk.com>
> > Cc: Gregory Price <gourry@gourry.net>
> > Cc: Ying Huang <ying.huang@linux.alibaba.com>
> > Cc: Alistair Popple <apopple@nvidia.com>
> > Cc: Oscar Salvador <osalvador@suse.de>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > Cc: Nico Pache <npache@redhat.com>
> > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > Cc: Dev Jain <dev.jain@arm.com>
> > Cc: Barry Song <baohua@kernel.org>
> > Cc: Lyude Paul <lyude@redhat.com>
> > Cc: Danilo Krummrich <dakr@kernel.org>
> > Cc: David Airlie <airlied@gmail.com>
> > Cc: Simona Vetter <simona@ffwll.ch>
> > Cc: Ralph Campbell <rcampbell@nvidia.com>
> > Cc: Mika Penttilä <mpenttil@redhat.com>
> > Cc: Francois Dugast <francois.dugast@intel.com>
> > Cc: Balbir Singh <balbirs@nvidia.com>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> >
> > ---
> > This fixup should be squashed into the patch "mm/migrate_device: handle
> > partially mapped folios during" in mm/mm-unstable
> >
> > I replaced the original patch with a local patch I authored a while back
> > that solves the same problem but uses a different code structure. The
> > failing test case—only available on an Xe driver—passes with this patch.
> > I can attempt to fix up the original patch within its structure if
> > that’s preferred.
> > ---
> >  mm/migrate_device.c | 42 ++++++++++++++++++++++++------------------
> >  1 file changed, 24 insertions(+), 18 deletions(-)
> >
> > diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> > index fa42d2ebd024..69e88f4a2563 100644
> > --- a/mm/migrate_device.c
> > +++ b/mm/migrate_device.c
> > @@ -254,6 +254,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >  	spinlock_t *ptl;
> >  	struct folio *fault_folio = migrate->fault_page ?
> >  		page_folio(migrate->fault_page) : NULL;
> > +	struct folio *split_folio = NULL;
> >  	pte_t *ptep;
> >  
> >  again:
> > @@ -266,10 +267,11 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >  			return 0;
> >  	}
> >  
> > -	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> > +	ptep = pte_offset_map_lock(mm, pmdp, start, &ptl);
> >  	if (!ptep)
> >  		goto again;
> >  	arch_enter_lazy_mmu_mode();
> > +	ptep += (addr - start) / PAGE_SIZE;
> >  
> >  	for (; addr < end; addr += PAGE_SIZE, ptep++) {
> >  		struct dev_pagemap *pgmap;
> > @@ -347,22 +349,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >  					pgmap->owner != migrate->pgmap_owner)
> >  					goto next;
> >  			}
> > -			folio = page ? page_folio(page) : NULL;
> > -			if (folio && folio_test_large(folio)) {
> > -				int ret;
> > -
> > -				pte_unmap_unlock(ptep, ptl);
> > -				ret = migrate_vma_split_folio(folio,
> > -							  migrate->fault_page);
> > -
> > -				if (ret) {
> > -					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> > -					goto next;
> > -				}
> > -
> > -				addr = start;
> > -				goto again;
> > -			}
> >  			mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
> >  			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
> >  		}
> > @@ -400,6 +386,11 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >  			bool anon_exclusive;
> >  			pte_t swp_pte;
> >  
> > +			if (folio_order(folio)) {
> > +				split_folio = folio;
> > +				goto split;
> > +			}
> > +
> >  			flush_cache_page(vma, addr, pte_pfn(pte));
> >  			anon_exclusive = folio_test_anon(folio) &&
> >  					  PageAnonExclusive(page);
> > @@ -478,8 +469,23 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >  	if (unmapped)
> >  		flush_tlb_range(walk->vma, start, end);
> >  
> > +split:
> >  	arch_leave_lazy_mmu_mode();
> > -	pte_unmap_unlock(ptep - 1, ptl);
> > +	pte_unmap_unlock(ptep - 1 + !!split_folio, ptl);
> > +
> > +	if (split_folio) {
> > +		int ret;
> > +
> > +		ret = split_folio(split_folio);
> > +		if (fault_folio != split_folio)
> > +			folio_unlock(split_folio);
> 
> I think wrong folio is left locked in case of fault_folio != NULL. Look how
> migrate_vma_split_folio() handles it.
>

Ah, yes. It took me a minute, but I think I understand what the code is
doing. It has to relock the fault page if it was included in the split.
I actually think that code has a corner-case bug in it too.
 
> > +		folio_put(split_folio);
> > +		if (ret)
> > +			return migrate_vma_collect_skip(addr, end, walk);
> > +
> > +		split_folio = NULL;
> > +		goto again;
> > +	}
> >  
> >  	return 0;
> >  }
> 
> How is this making a difference, I suppose it's only the 
> migrate_vma_collect_skip() after failed split? 

Yes, that is one part, but perhaps that is actually harmless as split
will just keep on failing.

I think the real issue is that it restarts the loop with goto again by
resetting addr to the start. If the prior pages were locked and
migration entries were installed, the second pass will fail to lock the
pages and install NULL mpfns, which results in the migration entries
never being removed—and that leads to the stack trace I’m seeing.

> Why are you just removing the other migrate_vma_split_folio() call? 

I was a bit rushed and overloaded with other things. I think I have a
way to fix this within the framework of the original patch. Another
patch in Balbir’s series has the same problem too. Let me try this one
again, testing out another version now.

Thanks for the input.

Matt

> 
> --Mika
> 


      reply	other threads:[~2025-11-20 22:04 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-20 17:05 Matthew Brost
2025-11-20 17:43 ` Mika Penttilä
2025-11-20 22:04   ` Matthew Brost [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aR+QdfjRw+U5go9+@lstrano-desk.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=airlied@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=balbirs@nvidia.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=byungchul@sk.com \
    --cc=dakr@kernel.org \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=francois.dugast@intel.com \
    --cc=gourry@gourry.net \
    --cc=joshua.hahnjy@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=lyude@redhat.com \
    --cc=mpenttil@redhat.com \
    --cc=npache@redhat.com \
    --cc=osalvador@suse.de \
    --cc=rakie.kim@sk.com \
    --cc=rcampbell@nvidia.com \
    --cc=ryan.roberts@arm.com \
    --cc=simona@ffwll.ch \
    --cc=ying.huang@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox