Re: [v7 11/16] mm/migrate_device: add THP splitting during migration

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Balbir Singh <balbirs@nvidia.com>
To: Zi Yan <ziy@nvidia.com>
Cc: linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org,
	linux-mm@kvack.org, akpm@linux-foundation.org,
	"David Hildenbrand" <david@redhat.com>,
	"Joshua Hahn" <joshua.hahnjy@gmail.com>,
	"Rakie Kim" <rakie.kim@sk.com>,
	"Byungchul Park" <byungchul@sk.com>,
	"Gregory Price" <gourry@gourry.net>,
	"Ying Huang" <ying.huang@linux.alibaba.com>,
	"Alistair Popple" <apopple@nvidia.com>,
	"Oscar Salvador" <osalvador@suse.de>,
	"Lorenzo Stoakes" <lorenzo.stoakes@oracle.com>,
	"Baolin Wang" <baolin.wang@linux.alibaba.com>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	"Nico Pache" <npache@redhat.com>,
	"Ryan Roberts" <ryan.roberts@arm.com>,
	"Dev Jain" <dev.jain@arm.com>, "Barry Song" <baohua@kernel.org>,
	"Lyude Paul" <lyude@redhat.com>,
	"Danilo Krummrich" <dakr@kernel.org>,
	"David Airlie" <airlied@gmail.com>,
	"Simona Vetter" <simona@ffwll.ch>,
	"Ralph Campbell" <rcampbell@nvidia.com>,
	"Mika Penttilä" <mpenttil@redhat.com>,
	"Matthew Brost" <matthew.brost@intel.com>,
	"Francois Dugast" <francois.dugast@intel.com>
Subject: Re: [v7 11/16] mm/migrate_device: add THP splitting during migration
Date: Tue, 14 Oct 2025 08:33:45 +1100	[thread overview]
Message-ID: <7213aa5c-7129-4a3f-acbf-ab922f0f6cac@nvidia.com> (raw)
In-Reply-To: <FB8D8651-639F-4882-9868-7FA0A8B62397@nvidia.com>

On 10/14/25 08:17, Zi Yan wrote:
> On 1 Oct 2025, at 2:57, Balbir Singh wrote:
> 
>> Implement migrate_vma_split_pages() to handle THP splitting during the
>> migration process when destination cannot allocate compound pages.
>>
>> This addresses the common scenario where migrate_vma_setup() succeeds with
>> MIGRATE_PFN_COMPOUND pages, but the destination device cannot allocate
>> large pages during the migration phase.
>>
>> Key changes:
>> - migrate_vma_split_pages(): Split already-isolated pages during migration
>> - Enhanced folio_split() and __split_unmapped_folio() with isolated
>>   parameter to avoid redundant unmap/remap operations
>>
>> This provides a fallback mechansim to ensure migration succeeds even when
>> large page allocation fails at the destination.
>>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
>> Cc: Rakie Kim <rakie.kim@sk.com>
>> Cc: Byungchul Park <byungchul@sk.com>
>> Cc: Gregory Price <gourry@gourry.net>
>> Cc: Ying Huang <ying.huang@linux.alibaba.com>
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>> Cc: Nico Pache <npache@redhat.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Dev Jain <dev.jain@arm.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Lyude Paul <lyude@redhat.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: Simona Vetter <simona@ffwll.ch>
>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>> Cc: Mika Penttilä <mpenttil@redhat.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Francois Dugast <francois.dugast@intel.com>
>>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>>  include/linux/huge_mm.h | 11 +++++-
>>  lib/test_hmm.c          |  9 +++++
>>  mm/huge_memory.c        | 46 ++++++++++++----------
>>  mm/migrate_device.c     | 85 +++++++++++++++++++++++++++++++++++------
>>  4 files changed, 117 insertions(+), 34 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 2d669be7f1c8..a166be872628 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -365,8 +365,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>  		vm_flags_t vm_flags);
>>
>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> -		unsigned int new_order);
>> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> +		unsigned int new_order, bool unmapped);
>>  int min_order_for_split(struct folio *folio);
>>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>> @@ -375,6 +375,13 @@ bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
>>  		bool warns);
>>  int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
>>  		struct list_head *list);
>> +
>> +static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> +		unsigned int new_order)
>> +{
>> +	return __split_huge_page_to_list_to_order(page, list, new_order, false);
>> +}
>> +
>>  /*
>>   * try_folio_split - try to split a @folio at @page using non uniform split.
>>   * @folio: folio to be split
>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>> index 46fa9e200db8..df429670633e 100644
>> --- a/lib/test_hmm.c
>> +++ b/lib/test_hmm.c
>> @@ -1612,6 +1612,15 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>  	order = folio_order(page_folio(vmf->page));
>>  	nr = 1 << order;
>>
>> +	/*
>> +	 * When folios are partially mapped, we can't rely on the folio
>> +	 * order of vmf->page as the folio might not be fully split yet
>> +	 */
>> +	if (vmf->pte) {
>> +		order = 0;
>> +		nr = 1;
>> +	}
>> +
>>  	/*
>>  	 * Consider a per-cpu cache of src and dst pfns, but with
>>  	 * large number of cpus that might not scale well.
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 8c95a658b3ec..022b0729f826 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3463,15 +3463,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>>  		new_folio->mapping = folio->mapping;
>>  		new_folio->index = folio->index + i;
>>
>> -		/*
>> -		 * page->private should not be set in tail pages. Fix up and warn once
>> -		 * if private is unexpectedly set.
>> -		 */
>> -		if (unlikely(new_folio->private)) {
>> -			VM_WARN_ON_ONCE_PAGE(true, new_head);
>> -			new_folio->private = NULL;
>> -		}
>> -
>>  		if (folio_test_swapcache(folio))
>>  			new_folio->swap.val = folio->swap.val + i;
>>
>> @@ -3700,6 +3691,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>   * @lock_at: a page within @folio to be left locked to caller
>>   * @list: after-split folios will be put on it if non NULL
>>   * @uniform_split: perform uniform split or not (non-uniform split)
>> + * @unmapped: The pages are already unmapped, they are migration entries.
>>   *
>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>   * It is in charge of checking whether the split is supported or not and
>> @@ -3715,7 +3707,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>   */
>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>  		struct page *split_at, struct page *lock_at,
>> -		struct list_head *list, bool uniform_split)
>> +		struct list_head *list, bool uniform_split, bool unmapped)
>>  {
>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>> @@ -3765,13 +3757,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  		 * is taken to serialise against parallel split or collapse
>>  		 * operations.
>>  		 */
>> -		anon_vma = folio_get_anon_vma(folio);
>> -		if (!anon_vma) {
>> -			ret = -EBUSY;
>> -			goto out;
>> +		if (!unmapped) {
>> +			anon_vma = folio_get_anon_vma(folio);
>> +			if (!anon_vma) {
>> +				ret = -EBUSY;
>> +				goto out;
>> +			}
>> +			anon_vma_lock_write(anon_vma);
>>  		}
>>  		mapping = NULL;
>> -		anon_vma_lock_write(anon_vma);
>>  	} else {
>>  		unsigned int min_order;
>>  		gfp_t gfp;
>> @@ -3838,7 +3832,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  		goto out_unlock;
>>  	}
>>
>> -	unmap_folio(folio);
>> +	if (!unmapped)
>> +		unmap_folio(folio);
>>
>>  	/* block interrupt reentry in xa_lock and spinlock */
>>  	local_irq_disable();
>> @@ -3925,10 +3920,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>
>>  			next = folio_next(new_folio);
>>
>> +			zone_device_private_split_cb(folio, new_folio);
>> +
>>  			expected_refs = folio_expected_ref_count(new_folio) + 1;
>>  			folio_ref_unfreeze(new_folio, expected_refs);
>>
>> -			lru_add_split_folio(folio, new_folio, lruvec, list);
>> +			if (!unmapped)
>> +				lru_add_split_folio(folio, new_folio, lruvec, list);
>>
>>  			/*
>>  			 * Anonymous folio with swap cache.
>> @@ -3959,6 +3957,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  			__filemap_remove_folio(new_folio, NULL);
>>  			folio_put_refs(new_folio, nr_pages);
>>  		}
>> +
>> +		zone_device_private_split_cb(folio, NULL);
>>  		/*
>>  		 * Unfreeze @folio only after all page cache entries, which
>>  		 * used to point to it, have been updated with new folios.
>> @@ -3982,6 +3982,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>
>>  	local_irq_enable();
>>
>> +	if (unmapped)
>> +		return ret;
>> +
>>  	if (nr_shmem_dropped)
>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>
>> @@ -4072,12 +4075,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>   * Returns -EINVAL when trying to split to an order that is incompatible
>>   * with the folio. Splitting to order 0 is compatible with all folios.
>>   */
>> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> -				     unsigned int new_order)
>> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> +				     unsigned int new_order, bool unmapped)
>>  {
>>  	struct folio *folio = page_folio(page);
>>
>> -	return __folio_split(folio, new_order, &folio->page, page, list, true);
>> +	return __folio_split(folio, new_order, &folio->page, page, list, true,
>> +				unmapped);
>>  }
>>
>>  /*
>> @@ -4106,7 +4110,7 @@ int folio_split(struct folio *folio, unsigned int new_order,
>>  		struct page *split_at, struct list_head *list)
>>  {
>>  	return __folio_split(folio, new_order, split_at, &folio->page, list,
>> -			false);
>> +			false, false);
>>  }
>>
>>  int min_order_for_split(struct folio *folio)
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 4156fd6190d2..fa42d2ebd024 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -306,6 +306,23 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>  			    pgmap->owner != migrate->pgmap_owner)
>>  				goto next;
>>
>> +			folio = page_folio(page);
>> +			if (folio_test_large(folio)) {
>> +				int ret;
>> +
>> +				pte_unmap_unlock(ptep, ptl);
>> +				ret = migrate_vma_split_folio(folio,
>> +							  migrate->fault_page);
>> +
>> +				if (ret) {
>> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>> +					goto next;
>> +				}
>> +
>> +				addr = start;
>> +				goto again;
>> +			}
>> +
>>  			mpfn = migrate_pfn(page_to_pfn(page)) |
>>  					MIGRATE_PFN_MIGRATE;
>>  			if (is_writable_device_private_entry(entry))
>> @@ -880,6 +897,29 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>>  		src[i] &= ~MIGRATE_PFN_MIGRATE;
>>  	return 0;
>>  }
>> +
>> +static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
>> +					    unsigned long idx, unsigned long addr,
>> +					    struct folio *folio)
>> +{
>> +	unsigned long i;
>> +	unsigned long pfn;
>> +	unsigned long flags;
>> +	int ret = 0;
>> +
>> +	folio_get(folio);
>> +	split_huge_pmd_address(migrate->vma, addr, true);
>> +	ret = __split_huge_page_to_list_to_order(folio_page(folio, 0), NULL,
>> +							0, true);
> 
> Why not just call __split_unmapped_folio() here? Then, you do not need to add
> a new unmapped parameter in __folio_split().
> 
> 

The benefit comes from the ref count checks and freeze/unfreeze (common code) in
__folio_split() and also from the callbacks that are to be made to the drivers on
folio split. These paths are required for both mapped and unmapped folios.

Otherwise we'd have to replicate that logic and checks again for unmapped folios
and handle post split processing again.

[...]

Thanks,
Balbir

next prev parent reply	other threads:[~2025-10-13 21:34 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-01  6:56 [v7 00/16] mm: support device-private THP Balbir Singh
2025-10-01  6:56 ` [v7 01/16] mm/zone_device: support large zone device private folios Balbir Singh
2025-10-12  6:10   ` Lance Yang
2025-10-12 22:54     ` Balbir Singh
2025-10-01  6:56 ` [v7 02/16] mm/zone_device: Rename page_free callback to folio_free Balbir Singh
2025-10-01  6:56 ` [v7 03/16] mm/huge_memory: add device-private THP support to PMD operations Balbir Singh
2025-10-12 15:46   ` Lance Yang
2025-10-13  0:01     ` Balbir Singh
2025-10-13  1:48       ` Lance Yang
2025-10-17 14:49   ` linux-next: KVM/s390x regression (was: [v7 03/16] mm/huge_memory: add device-private THP support to PMD operations) Christian Borntraeger
2025-10-17 14:54     ` linux-next: KVM/s390x regression David Hildenbrand
2025-10-17 15:01       ` Christian Borntraeger
2025-10-17 15:07         ` David Hildenbrand
2025-10-17 15:20           ` Christian Borntraeger
2025-10-17 17:07             ` David Hildenbrand
2025-10-17 21:56               ` Balbir Singh
2025-10-17 22:15                 ` David Hildenbrand
2025-10-17 22:41                   ` David Hildenbrand
2025-10-20  7:01                     ` Christian Borntraeger
2025-10-20  7:00                 ` Christian Borntraeger
2025-10-20  8:41                   ` David Hildenbrand
2025-10-20  9:04                     ` Claudio Imbrenda
2025-10-27 16:47                     ` Claudio Imbrenda
2025-10-27 16:59                       ` David Hildenbrand
2025-10-27 17:06                       ` Christian Borntraeger
2025-10-28  9:24                         ` Balbir Singh
2025-10-28 13:01                         ` [PATCH v1 0/1] KVM: s390: Fix missing present bit for gmap puds Claudio Imbrenda
2025-10-28 13:01                           ` [PATCH v1 1/1] " Claudio Imbrenda
2025-10-28 21:23                             ` Balbir Singh
2025-10-29 10:00                             ` David Hildenbrand
2025-10-29 10:20                               ` Claudio Imbrenda
2025-10-28 22:53                           ` [PATCH v1 0/1] " Andrew Morton
2025-10-01  6:56 ` [v7 04/16] mm/rmap: extend rmap and migration support device-private entries Balbir Singh
2025-10-22 11:54   ` Lance Yang
2025-10-01  6:56 ` [v7 05/16] mm/huge_memory: implement device-private THP splitting Balbir Singh
2025-10-01  6:56 ` [v7 06/16] mm/migrate_device: handle partially mapped folios during collection Balbir Singh
2025-10-01  6:56 ` [v7 07/16] mm/migrate_device: implement THP migration of zone device pages Balbir Singh
2025-10-01  6:56 ` [v7 08/16] mm/memory/fault: add THP fault handling for zone device private pages Balbir Singh
2025-10-01  6:57 ` [v7 09/16] lib/test_hmm: add zone device private THP test infrastructure Balbir Singh
2025-10-01  6:57 ` [v7 10/16] mm/memremap: add driver callback support for folio splitting Balbir Singh
2025-10-01  6:57 ` [v7 11/16] mm/migrate_device: add THP splitting during migration Balbir Singh
2025-10-13 21:17   ` Zi Yan
2025-10-13 21:33     ` Balbir Singh [this message]
2025-10-13 21:55       ` Zi Yan
2025-10-13 22:50         ` Balbir Singh
2025-10-19  8:19   ` Wei Yang
2025-10-19 22:49     ` Balbir Singh
2025-10-19 22:59       ` Zi Yan
2025-10-21 21:34         ` Balbir Singh
2025-10-22  2:59           ` Zi Yan
2025-10-22  7:16             ` Balbir Singh
2025-10-22 15:26               ` Zi Yan
2025-10-28  9:32                 ` Balbir Singh
2025-10-01  6:57 ` [v7 12/16] lib/test_hmm: add large page allocation failure testing Balbir Singh
2025-10-01  6:57 ` [v7 13/16] selftests/mm/hmm-tests: new tests for zone device THP migration Balbir Singh
2025-10-01  6:57 ` [v7 14/16] selftests/mm/hmm-tests: partial unmap, mremap and anon_write tests Balbir Singh
2025-10-01  6:57 ` [v7 15/16] selftests/mm/hmm-tests: new throughput tests including THP Balbir Singh
2025-10-01  6:57 ` [v7 16/16] gpu/drm/nouveau: enable THP support for GPU memory migration Balbir Singh
2025-10-09  3:17 ` [v7 00/16] mm: support device-private THP Andrew Morton
2025-10-09  3:26   ` Balbir Singh
2025-10-09 10:33     ` Matthew Brost
2025-10-13 22:51       ` Balbir Singh
2025-11-11 23:43       ` Andrew Morton
2025-11-11 23:52         ` Balbir Singh
2025-11-12  0:24           ` Andrew Morton
2025-11-12  0:36             ` Balbir Singh
2025-11-20  2:40           ` Matthew Brost
2025-11-20  2:50             ` Balbir Singh
2025-11-20  2:59               ` Balbir Singh
2025-11-20  3:15                 ` Matthew Brost
2025-11-20  3:58                   ` Balbir Singh
2025-11-20  5:46                     ` Balbir Singh
2025-11-20  5:53                     ` Matthew Brost
2025-11-20  6:03                       ` Balbir Singh
2025-11-20 17:27                         ` Matthew Brost

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7213aa5c-7129-4a3f-acbf-ab922f0f6cac@nvidia.com \
    --to=balbirs@nvidia.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=airlied@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=byungchul@sk.com \
    --cc=dakr@kernel.org \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=francois.dugast@intel.com \
    --cc=gourry@gourry.net \
    --cc=joshua.hahnjy@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=lyude@redhat.com \
    --cc=matthew.brost@intel.com \
    --cc=mpenttil@redhat.com \
    --cc=npache@redhat.com \
    --cc=osalvador@suse.de \
    --cc=rakie.kim@sk.com \
    --cc=rcampbell@nvidia.com \
    --cc=ryan.roberts@arm.com \
    --cc=simona@ffwll.ch \
    --cc=ying.huang@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox