[RFC PATCH 0/5] drm/amdkfd: Add batch SVM range registration support

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/5] drm/amdkfd: Add batch SVM range registration support
@ 2025-11-12  7:29 Honglei Huang
  2025-11-12  7:29 ` [PATCH 1/5] drm/amdkfd: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute Honglei Huang
                   ` (5 more replies)
  0 siblings, 6 replies; 10+ messages in thread
From: Honglei Huang @ 2025-11-12  7:29 UTC (permalink / raw)
  To: Felix.Kuehling, alexander.deucher, christian.koenig, Ray.Huang
  Cc: dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
	linux-kernel, linux-mm, akpm, honghuang

Hi all,

This RFC patch series introduces a new mechanism for batch registration of
multiple non-contiguous SVM (Shared Virtual Memory) ranges in a single ioctl
call. The primary goal of this series is to start a discussion about the best
approach to handle scattered user memory allocations in GPU workloads.

Background and Motivation
==========================

Current applications using ROCm/HSA often need to register many scattered
memory buffers (e.g., multiple malloc() allocations) for GPU access. With the
existing AMDKFD_IOC_SVM ioctl, each range must be registered individually,
leading to:
- Blocking issue in some special use cases with many memory ranges
- High system call overhead when dealing with dozens or hundreds of ranges
- Inefficient resource management
- Complexity in userspace applications

Use Case Example
================

Consider a typical ML/HPC workload that allocates 100+ small buffers across
different parts of the address space. Currently, this requires 100+ separate
ioctl calls. The proposed batch interface reduces this to a single call.

Paravirtualized environments exacerbate this issue, as KVM's memory backing
is often non-contiguous at the host level. In virtualized environments, guest
physical memory appears contiguous to the VM but is actually scattered across
host memory pages. This fragmentation means that what appears as a single
large allocation in the guest may require multiple discrete SVM registrations
to properly handle the underlying host memory layout, further multiplying the
number of required ioctl calls.

Current Implementation - A Workaround Approach
===============================================

This patch series implements a WORKAROUND solution that pins user pages in
memory to enable batch registration. While functional, this approach has
several significant limitations:

**Major Concern: Memory Pinning**
- The implementation uses pin_user_pages_fast() to lock pages in RAM
- This defeats the purpose of SVM's on-demand paging mechanism
- Prevents memory oversubscription and dynamic migration
- May cause memory pressure on systems with limited RAM
- Goes against the fundamental design philosophy of HMM-based SVM

**Known Limitations:**
1. Increased memory footprint due to pinned pages
2. Potential for memory fragmentation
3. No support for transparent huge pages in pinned regions
4. Limited interaction with memory cgroups and resource controls
5. Complexity in handling VMA operations and lifecycle management
6. May interfere with NUMA optimization and page migration

Why Submit This RFC?
====================

Despite the limitations above, I am submitting this series to:

1. **Start the Discussion**: I want community feedback on whether batch
   registration is a useful feature worth pursuing.

2. **Explore Better Alternatives**: Is there a way to achieve batch
   registration without pinning? Could I extend HMM to better support
   this use case?

3. **Understand Trade-offs**: For some workloads, the performance benefit
   of batch registration might outweigh the drawbacks of pinning. I'd
   like to understand where the balance lies.

Questions for the Community
============================

1. Are there existing mechanisms in HMM or mm that could support batch
   operations without pinning?

2. Would a different approach (e.g., async registration, delayed validation)
   be more acceptable?

Alternative Approaches Considered
==================================

I've considered several alternatives:

A) **Pure HMM approach**: Register ranges without pinning, rely entirely on

B) **Userspace batching library**: Hide multiple ioctls behind a library.

Patch Series Overview
=====================

Patch 1: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute type
Patch 2: Define data structures for batch SVM range registration
Patch 3: Add new AMDKFD_IOC_SVM_RANGES ioctl command
Patch 4: Implement page pinning mechanism for scattered ranges
Patch 5: Wire up the ioctl handler and attribute processing

Testing
=======

The series has been tested with:
- Multiple scattered malloc() allocations (2-2000+ ranges)
- Various allocation sizes (4KB to 1G+)
- GPU compute workloads using the registered ranges
- Memory pressure scenarios
- OpecnCL CTS in KVM guest environment
- HIP catch tests in KVM guest environment
- Some AI applications like Stable Diffusion, ComfyUI, 3B LLM models based
  on HuggingFace transformers

I understand this approach is not ideal and are committed to working on a
better solution based on community feedback. This RFC is the starting point
for that discussion.

Thank you for your time and consideration.

Best regards,
Honglei Huang

---

Honglei Huang (5):
  drm/amdkfd: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute
  drm/amdkfd: Add SVM ranges data structures
  drm/amdkfd: Add AMDKFD_IOC_SVM_RANGES ioctl command
  drm/amdkfd: Add support for pinned user pages in SVM ranges
  drm/amdkfd: Wire up SVM ranges ioctl handler

 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c |  67 +++++++++++
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c     | 232 +++++++++++++++++++++++++++++--
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h     |   3 +
 include/uapi/linux/kfd_ioctl.h           |  52 +++++++-
 4 files changed, 348 insertions(+), 6 deletions(-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/5] drm/amdkfd: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute
  2025-11-12  7:29 [RFC PATCH 0/5] drm/amdkfd: Add batch SVM range registration support Honglei Huang
@ 2025-11-12  7:29 ` Honglei Huang
  2025-11-12  7:29 ` [PATCH 2/5] drm/amdkfd: Add SVM ranges data structures Honglei Huang
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Honglei Huang @ 2025-11-12  7:29 UTC (permalink / raw)
  To: Felix.Kuehling, alexander.deucher, christian.koenig, Ray.Huang
  Cc: dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
	linux-kernel, linux-mm, akpm, honghuang, Honglei Huang

From: Honglei Huang <Honglei1.Huang@amd.com>

Add a new SVM attribute type to indicate whether a memory range is
a special mapped VMA (VM_PFNMAP or VM_MIXEDMAP). This attribute will
be used to support non-contiguous memory mappings in SVM ranges.

The MAPPED attribute allows the driver to distinguish between regular
anonymous memory and pre-mapped device or reserved memory regions,
enabling different handling paths for page pinning and GPU mapping.

Signed-off-by: Honglei Huang <Honglei1.Huang@amd.com>
---
 include/uapi/linux/kfd_ioctl.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h
index 2040a470ddb4..320a4a0e10bc 100644
--- a/include/uapi/linux/kfd_ioctl.h
+++ b/include/uapi/linux/kfd_ioctl.h
@@ -703,6 +703,7 @@ enum kfd_ioctl_svm_location {
  * @KFD_IOCTL_SVM_ATTR_CLR_FLAGS: bitmask of flags to clear
  * @KFD_IOCTL_SVM_ATTR_GRANULARITY: migration granularity
  *                                  (log2 num pages)
+ * @KFD_IOCTL_SVM_ATTR_MAPPED: indicates whether the range is VM_PFNMAP or VM_MIXEDMAP
  */
 enum kfd_ioctl_svm_attr_type {
 	KFD_IOCTL_SVM_ATTR_PREFERRED_LOC,
@@ -712,7 +713,8 @@ enum kfd_ioctl_svm_attr_type {
 	KFD_IOCTL_SVM_ATTR_NO_ACCESS,
 	KFD_IOCTL_SVM_ATTR_SET_FLAGS,
 	KFD_IOCTL_SVM_ATTR_CLR_FLAGS,
-	KFD_IOCTL_SVM_ATTR_GRANULARITY
+	KFD_IOCTL_SVM_ATTR_GRANULARITY,
+	KFD_IOCTL_SVM_ATTR_MAPPED
 };
 
 /**
-- 
2.34.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2/5] drm/amdkfd: Add SVM ranges data structures
  2025-11-12  7:29 [RFC PATCH 0/5] drm/amdkfd: Add batch SVM range registration support Honglei Huang
  2025-11-12  7:29 ` [PATCH 1/5] drm/amdkfd: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute Honglei Huang
@ 2025-11-12  7:29 ` Honglei Huang
  2025-11-12  7:29 ` [PATCH 3/5] drm/amdkfd: Add AMDKFD_IOC_SVM_RANGES ioctl command Honglei Huang
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Honglei Huang @ 2025-11-12  7:29 UTC (permalink / raw)
  To: Felix.Kuehling, alexander.deucher, christian.koenig, Ray.Huang
  Cc: dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
	linux-kernel, linux-mm, akpm, honghuang, Honglei Huang

From: Honglei Huang <Honglei1.Huang@amd.com>

Add new UAPI data structures to support batch SVM range registration:

- struct kfd_ioctl_svm_range: Describes a single SVM range with its
  virtual address and size.

- struct kfd_ioctl_svm_ranges_args: Arguments for batch registration
  of multiple non-contiguous SVM ranges. This structure allows
  registering multiple ranges with the same set of attributes in a
  single ioctl call, improving efficiency over multiple individual
  ioctl calls.

The new structures enable userspace to efficiently register scattered
memory buffers (e.g., multiple malloc allocations) to GPU address
space without requiring them to be physically or virtually contiguous.

Signed-off-by: Honglei Huang <Honglei1.Huang@amd.com>
---
 include/uapi/linux/kfd_ioctl.h | 42 ++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h
index 320a4a0e10bc..d782bda1d2ca 100644
--- a/include/uapi/linux/kfd_ioctl.h
+++ b/include/uapi/linux/kfd_ioctl.h
@@ -777,6 +777,48 @@ struct kfd_ioctl_svm_args {
 	struct kfd_ioctl_svm_attribute attrs[];
 };
 
+/**
+ * kfd_ioctl_svm_range - SVM range descriptor
+ *
+ * @addr: starting virtual address of the SVM range
+ * @size: size of the SVM range in bytes
+ * @pad: padding for alignment
+ *
+ */
+struct kfd_ioctl_svm_range {
+	__u64 addr;
+	__u64 size;
+};
+
+/**
+ * kfd_ioctl_svm_ranges_args - Arguments for SVM register ranges ioctl
+ *
+ * @nranges: number of ranges in the @ranges array
+ * @op: operation to perform (see enum @kfd_ioctl_svm_op)
+ * @nattr: number of attributes in the @attrs array
+ * @ranges: variable length array of ranges
+ * @attrs: variable length array of attributes
+ *
+ * This ioctl allows registering multiple SVM ranges with the same
+ * set of attributes. This is more efficient than calling the SVM
+ * ioctl multiple times for each range.
+ *
+ * The semantics of the operations and attributes are the same as
+ * for kfd_ioctl_svm_args.
+ */
+struct kfd_ioctl_svm_ranges_args {
+	__u64 start_addr;
+	__u64 size;
+	__u32 op;
+	__u32 nattr;
+	/* Variable length array of attributes */
+	__u64 attrs_ptr;
+	__u32 nranges;
+	__u32 pad;
+	/* Variable length array of ranges */
+	__u64 ranges_ptr;
+};
+
 /**
  * kfd_ioctl_set_xnack_mode_args - Arguments for set_xnack_mode
  *
-- 
2.34.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 3/5] drm/amdkfd: Add AMDKFD_IOC_SVM_RANGES ioctl command
  2025-11-12  7:29 [RFC PATCH 0/5] drm/amdkfd: Add batch SVM range registration support Honglei Huang
  2025-11-12  7:29 ` [PATCH 1/5] drm/amdkfd: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute Honglei Huang
  2025-11-12  7:29 ` [PATCH 2/5] drm/amdkfd: Add SVM ranges data structures Honglei Huang
@ 2025-11-12  7:29 ` Honglei Huang
  2025-11-12  7:29 ` [PATCH 4/5] drm/amdkfd: Add support for pinned user pages in SVM ranges Honglei Huang
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Honglei Huang @ 2025-11-12  7:29 UTC (permalink / raw)
  To: Felix.Kuehling, alexander.deucher, christian.koenig, Ray.Huang
  Cc: dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
	linux-kernel, linux-mm, akpm, honghuang, Honglei Huang

From: Honglei Huang <Honglei1.Huang@amd.com>

Define a new ioctl command AMDKFD_IOC_SVM_RANGES (0x27) to support
batch registration of multiple SVM ranges. Update AMDKFD_COMMAND_END
from 0x27 to 0x28 accordingly.

This ioctl provides a more efficient interface for userspace to
register multiple non-contiguous memory ranges with the same set
of SVM attributes in a single system call, reducing context switching
overhead compared to multiple AMDKFD_IOC_SVM calls.

Signed-off-by: Honglei Huang <Honglei1.Huang@amd.com>
---
 include/uapi/linux/kfd_ioctl.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h
index d782bda1d2ca..c5f9595ef30d 100644
--- a/include/uapi/linux/kfd_ioctl.h
+++ b/include/uapi/linux/kfd_ioctl.h
@@ -663,7 +663,6 @@ enum kfd_mmio_remap {
 #define KFD_IOCTL_SVM_FLAG_GPU_ALWAYS_MAPPED   0x00000040
 /* Fine grained coherency between all devices using device-scope atomics */
 #define KFD_IOCTL_SVM_FLAG_EXT_COHERENT        0x00000080
-
 /**
  * kfd_ioctl_svm_op - SVM ioctl operations
  *
@@ -1622,7 +1621,10 @@ struct kfd_ioctl_dbg_trap_args {
 #define AMDKFD_IOC_DBG_TRAP			\
 		AMDKFD_IOWR(0x26, struct kfd_ioctl_dbg_trap_args)
 
+#define AMDKFD_IOC_SVM_RANGES		\
+		AMDKFD_IOWR(0x27, struct kfd_ioctl_svm_ranges_args)
+
 #define AMDKFD_COMMAND_START		0x01
-#define AMDKFD_COMMAND_END		0x27
+#define AMDKFD_COMMAND_END		0x28
 
 #endif
-- 
2.34.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 4/5] drm/amdkfd: Add support for pinned user pages in SVM ranges
  2025-11-12  7:29 [RFC PATCH 0/5] drm/amdkfd: Add batch SVM range registration support Honglei Huang
                   ` (2 preceding siblings ...)
  2025-11-12  7:29 ` [PATCH 3/5] drm/amdkfd: Add AMDKFD_IOC_SVM_RANGES ioctl command Honglei Huang
@ 2025-11-12  7:29 ` Honglei Huang
  2025-11-12  7:29 ` [PATCH 5/5] drm/amdkfd: Wire up SVM ranges ioctl handler Honglei Huang
  2025-11-12  8:34 ` [RFC PATCH 0/5] drm/amdkfd: Add batch SVM range registration support Christian König
  5 siblings, 0 replies; 10+ messages in thread
From: Honglei Huang @ 2025-11-12  7:29 UTC (permalink / raw)
  To: Felix.Kuehling, alexander.deucher, christian.koenig, Ray.Huang
  Cc: dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
	linux-kernel, linux-mm, akpm, honghuang, Honglei Huang

From: Honglei Huang <Honglei1.Huang@amd.com>

Implement core functionality to pin and manage user pages for
non-contiguous SVM ranges:

1. Add svm_pin_user_ranges() function:
   - Pin multiple non-contiguous user memory ranges
   - Use pin_user_pages_fast() to lock pages in memory
   - Store pinned pages in VMA's vm_private_data
   - Set up custom VMA operations for fault handling

2. Add svm_range_get_mapped_pages() function:
   - Optimized path for pre-mapped VMAs
   - Retrieve pages directly from vm_private_data
   - Bypass HMM for already-pinned pages

3. Implement svm_iovec_ops VMA operations:
   - svm_iovec_fault(): Handle page faults by returning pre-pinned pages
   - svm_iovec_close(): Cleanup and unpin pages on VMA close

4. Add is_map flag to struct svm_range:
   - Track whether a range uses the pinned pages mechanism
   - Enable conditional logic in DMA mapping and validation paths

5. Update DMA mapping logic:
   - Skip special device page handling for pinned user pages
   - Treat pinned pages as regular system memory for DMA

6. Modify validation logic:
   - svm_range_is_valid() accepts mapped VMAs when is_map flag is set
   - svm_range_validate_and_map() uses appropriate page retrieval path

This infrastructure enables efficient handling of scattered user
buffers without requiring memory to be virtually contiguous,
supporting use cases like multiple malloc() allocations being
registered to GPU address space.

Signed-off-by: Honglei Huang <Honglei1.Huang@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 232 ++++++++++++++++++++++++++-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h |   3 +
 2 files changed, 229 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 31e500859ab0..fef0d147d938 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -179,7 +179,7 @@ svm_range_dma_map_dev(struct amdgpu_device *adev, struct svm_range *prange,
 			dma_unmap_page(dev, addr[i], PAGE_SIZE, dir);
 
 		page = hmm_pfn_to_page(hmm_pfns[i]);
-		if (is_zone_device_page(page)) {
+		if (is_zone_device_page(page) && prange->svm_bo && !prange->is_map) {
 			struct amdgpu_device *bo_adev = prange->svm_bo->node->adev;
 
 			addr[i] = (hmm_pfns[i] << PAGE_SHIFT) +
@@ -682,6 +682,18 @@ static int svm_range_bo_validate(void *param, struct amdgpu_bo *bo)
 	return ttm_bo_validate(&bo->tbo, &bo->placement, &ctx);
 }
 
+static bool
+svm_range_has_mapped_attr(uint32_t nattr, struct kfd_ioctl_svm_attribute *attrs)
+{
+	uint32_t i;
+
+	for (i = 0; i < nattr; i++) {
+		if (attrs[i].type == KFD_IOCTL_SVM_ATTR_MAPPED)
+			return true;
+	}
+	return false;
+}
+
 static int
 svm_range_check_attr(struct kfd_process *p,
 		     uint32_t nattr, struct kfd_ioctl_svm_attribute *attrs)
@@ -713,6 +725,8 @@ svm_range_check_attr(struct kfd_process *p,
 			break;
 		case KFD_IOCTL_SVM_ATTR_GRANULARITY:
 			break;
+		case KFD_IOCTL_SVM_ATTR_MAPPED:
+			break;
 		default:
 			pr_debug("unknown attr type 0x%x\n", attrs[i].type);
 			return -EINVAL;
@@ -777,6 +791,9 @@ svm_range_apply_attrs(struct kfd_process *p, struct svm_range *prange,
 		case KFD_IOCTL_SVM_ATTR_GRANULARITY:
 			prange->granularity = min_t(uint32_t, attrs[i].value, 0x3F);
 			break;
+		case KFD_IOCTL_SVM_ATTR_MAPPED:
+			prange->is_map = true;
+			break;
 		default:
 			WARN_ONCE(1, "svm_range_check_attrs wasn't called?");
 		}
@@ -830,6 +847,8 @@ svm_range_is_same_attrs(struct kfd_process *p, struct svm_range *prange,
 			if (prange->granularity != attrs[i].value)
 				return false;
 			break;
+		case KFD_IOCTL_SVM_ATTR_MAPPED:
+			return false;
 		default:
 			WARN_ONCE(1, "svm_range_check_attrs wasn't called?");
 		}
@@ -1547,6 +1566,81 @@ static void *kfd_svm_page_owner(struct kfd_process *p, int32_t gpuidx)
 	return SVM_ADEV_PGMAP_OWNER(pdd->dev->adev);
 }
 
+static int svm_range_is_mapped_vma(struct vm_area_struct *vma)
+{
+	return vma && (vma->vm_flags & (VM_IO | VM_PFNMAP));
+}
+
+static int svm_range_get_mapped_pages(struct mmu_interval_notifier *notifier,
+				      struct mm_struct *mm, struct page **pages,
+				      uint64_t start, uint64_t npages,
+				      struct hmm_range **phmm_range,
+				      bool readonly, bool mmap_locked,
+				      void *owner, struct vm_area_struct *vma)
+{
+	struct hmm_range *hmm_range;
+	unsigned long *pfns;
+
+	unsigned long vma_size;
+	struct page **vma_pages;
+	unsigned long vma_start_offset;
+	unsigned long i;
+	int r = 0;
+
+	hmm_range = kzalloc(sizeof(*hmm_range), GFP_KERNEL);
+	if (unlikely(!hmm_range))
+		return -ENOMEM;
+
+	pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
+	if (unlikely(!pfns)) {
+		r = -ENOMEM;
+		goto out_free_range;
+	}
+
+	hmm_range->notifier = notifier;
+	hmm_range->default_flags = HMM_PFN_REQ_FAULT;
+	if (!readonly)
+		hmm_range->default_flags |= HMM_PFN_REQ_WRITE;
+	hmm_range->hmm_pfns = pfns;
+	hmm_range->start = start;
+	hmm_range->end = start + npages * PAGE_SIZE;
+	hmm_range->dev_private_owner = owner;
+
+	hmm_range->notifier_seq = mmu_interval_read_begin(notifier);
+
+	if (likely(!mmap_locked))
+		mmap_read_lock(mm);
+
+	vma_pages = vma->vm_private_data;
+	vma_size = vma->vm_end - vma->vm_start;
+	vma_start_offset = (unsigned long)start - vma->vm_start;
+
+	if ((vma_size >> PAGE_SHIFT) < npages) {
+		pr_err("ERROR: mapped vma npages: %lx < userptr map npages: %llx\n",
+		       vma_size, npages);
+		return -EINVAL;
+	}
+
+	for (i = 0; i < npages; i++)
+		pfns[i] = page_to_pfn(
+			vma_pages[(vma_start_offset >> PAGE_SHIFT) + i]);
+
+	if (likely(!mmap_locked))
+		mmap_read_unlock(mm);
+
+	for (i = 0; pages && i < npages; i++)
+		pages[i] = hmm_pfn_to_page(pfns[i]);
+
+	*phmm_range = hmm_range;
+
+	return 0;
+
+out_free_range:
+	kfree(hmm_range);
+
+	return r;
+}
+
 /*
  * Validation+GPU mapping with concurrent invalidation (MMU notifiers)
  *
@@ -1674,7 +1768,15 @@ static int svm_range_validate_and_map(struct mm_struct *mm,
 			next = min(vma->vm_end, end);
 			npages = (next - addr) >> PAGE_SHIFT;
 			WRITE_ONCE(p->svms.faulting_task, current);
-			r = amdgpu_hmm_range_get_pages(&prange->notifier, addr, npages,
+			if (svm_range_is_mapped_vma(vma))
+			{
+				r = svm_range_get_mapped_pages(&prange->notifier, mm, NULL,
+							addr, npages, &hmm_range,
+							readonly, true, owner, vma);
+				prange->is_map = true;
+			}
+			else
+				r = amdgpu_hmm_range_get_pages(&prange->notifier, addr, npages,
 						       readonly, owner, NULL,
 						       &hmm_range);
 			WRITE_ONCE(p->svms.faulting_task, NULL);
@@ -3269,9 +3371,9 @@ svm_range_check_vm(struct kfd_process *p, uint64_t start, uint64_t last,
  *  0 - OK, otherwise error code
  */
 static int
-svm_range_is_valid(struct kfd_process *p, uint64_t start, uint64_t size)
+svm_range_is_valid(struct kfd_process *p, uint64_t start, uint64_t size, bool mapped)
 {
-	const unsigned long device_vma = VM_IO | VM_PFNMAP | VM_MIXEDMAP;
+	const unsigned long device_vma = mapped ? 0 : VM_IO | VM_PFNMAP | VM_MIXEDMAP;
 	struct vm_area_struct *vma;
 	unsigned long end;
 	unsigned long start_unchg = start;
@@ -3510,6 +3612,8 @@ static void svm_range_evict_svm_bo_worker(struct work_struct *work)
 	svm_range_bo_unref(svm_bo);
 }
 
+
+
 static int
 svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
 		   uint64_t start, uint64_t size, uint32_t nattr,
@@ -3525,6 +3629,7 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
 	struct svm_range *next;
 	bool update_mapping = false;
 	bool flush_tlb;
+	bool if_mapped;
 	int r, ret = 0;
 
 	pr_debug("pasid 0x%x svms 0x%p [0x%llx 0x%llx] pages 0x%llx\n",
@@ -3540,7 +3645,9 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
 
 	svm_range_list_lock_and_flush_work(svms, mm);
 
-	r = svm_range_is_valid(p, start, size);
+	if_mapped = svm_range_has_mapped_attr(nattr, attrs);
+
+	r = svm_range_is_valid(p, start, size, if_mapped);
 	if (r) {
 		pr_debug("invalid range r=%d\n", r);
 		mmap_write_unlock(mm);
@@ -3679,7 +3786,7 @@ svm_range_get_attr(struct kfd_process *p, struct mm_struct *mm,
 	flush_work(&p->svms.deferred_list_work);
 
 	mmap_read_lock(mm);
-	r = svm_range_is_valid(p, start, size);
+	r = svm_range_is_valid(p, start, size, false);
 	mmap_read_unlock(mm);
 	if (r) {
 		pr_debug("invalid range r=%d\n", r);
@@ -4153,3 +4260,116 @@ svm_ioctl(struct kfd_process *p, enum kfd_ioctl_svm_op op, uint64_t start,
 
 	return r;
 }
+
+static void svm_iovec_close(struct vm_area_struct *vma)
+{
+	struct page **pages = vma->vm_private_data;
+	uint32_t npages;
+
+	if (!pages)
+		return;
+
+	npages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+
+	unpin_user_pages_dirty_lock(pages, npages, false);
+	pr_debug("svm_iovec_close, unpin pages, start: 0x%lx, npages: 0x%x\n",
+		 vma->vm_start, npages);
+
+	kvfree(pages);
+	vma->vm_private_data = NULL;
+}
+
+static vm_fault_t svm_iovec_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct page **pages;
+
+	if ((vmf->pgoff << PAGE_SHIFT) >= (vma->vm_end - vma->vm_start)) {
+		return VM_FAULT_SIGBUS;
+	}
+
+	pages = (struct page **)vma->vm_private_data;
+	if (!pages) {
+		return VM_FAULT_SIGBUS;
+	}
+
+	vmf->page = pages[vmf->pgoff];
+
+	return VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct svm_iovec_ops = {
+	.close = svm_iovec_close,
+	.fault = svm_iovec_fault,
+};
+
+int svm_pin_user_ranges(struct kfd_process *p, uint64_t start, uint64_t size,
+			struct kfd_ioctl_svm_range *ranges, uint64_t nranges)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma;
+	struct page **pages = NULL, **cur_page;
+	uint32_t vma_size, npages = 0, pinned_pages = 0;
+	int i, ret;
+
+	mmap_read_lock(mm);
+	vma = find_vma(mm, start);
+	if (!vma) {
+		pr_err("failed to find vma, start: 0x%llx\n", start);
+		mmap_read_unlock(mm);
+		return -EINVAL;
+	}
+	mmap_read_unlock(mm);
+
+	if (vma->vm_ops == &svm_iovec_ops)
+		return 0;
+
+	vma_size = vma->vm_end - vma->vm_start;
+	if (size > vma_size) {
+		pr_err("vma size: %x < target size: %llx\n", vma_size, size);
+		goto failed_free;
+	}
+
+	for (i = 0; i < nranges; i++) {
+		npages += ranges[i].size >> PAGE_SHIFT;
+	}
+
+	pages = kvmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
+	if (!pages) {
+		pr_err("failed to allocate pages\n");
+		ret = -ENOMEM;
+		goto failed_free;
+	}
+
+	cur_page = pages;
+
+	for (i = 0; i < nranges; i++) {
+		ret = pin_user_pages_fast(ranges[i].addr,
+					  (ranges[i].size >> PAGE_SHIFT),
+					  FOLL_WRITE | FOLL_FORCE, cur_page);
+		if (ret < 0) {
+			pr_err("failed to pin user pages, addr: 0x%llx, size: 0x%llx\n",
+			       ranges[i].addr, ranges[i].size);
+			if (pinned_pages)
+				unpin_user_pages(pages, pinned_pages);
+			goto failed_free;
+		}
+
+		cur_page += (ranges[i].size >> PAGE_SHIFT);
+		pinned_pages += (ranges[i].size >> PAGE_SHIFT);
+	}
+
+	mmap_write_lock(mm);
+	vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
+	vma->vm_private_data = pages;
+	vma->vm_ops = &svm_iovec_ops;
+	mmap_write_unlock(mm);
+	return 0;
+
+failed_free:
+	if (pages) {
+		unpin_user_pages_dirty_lock(pages, pinned_pages, false);
+		kvfree(pages);
+	}
+	return ret;
+}
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
index 70c1776611c4..ebaa10fce8c1 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
@@ -137,6 +137,7 @@ struct svm_range {
 	DECLARE_BITMAP(bitmap_access, MAX_GPU_INSTANCE);
 	DECLARE_BITMAP(bitmap_aip, MAX_GPU_INSTANCE);
 	bool				mapped_to_gpu;
+	bool				is_map;
 };
 
 static inline void svm_range_lock(struct svm_range *prange)
@@ -207,6 +208,8 @@ void svm_range_bo_unref_async(struct svm_range_bo *svm_bo);
 
 void svm_range_set_max_pages(struct amdgpu_device *adev);
 int svm_range_switch_xnack_reserve_mem(struct kfd_process *p, bool xnack_enabled);
+int svm_pin_user_ranges(struct kfd_process *p, uint64_t start, uint64_t size,
+			struct kfd_ioctl_svm_range *ranges, uint64_t nranges);
 
 #else
 
-- 
2.34.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 5/5] drm/amdkfd: Wire up SVM ranges ioctl handler
  2025-11-12  7:29 [RFC PATCH 0/5] drm/amdkfd: Add batch SVM range registration support Honglei Huang
                   ` (3 preceding siblings ...)
  2025-11-12  7:29 ` [PATCH 4/5] drm/amdkfd: Add support for pinned user pages in SVM ranges Honglei Huang
@ 2025-11-12  7:29 ` Honglei Huang
  2025-11-12  8:34 ` [RFC PATCH 0/5] drm/amdkfd: Add batch SVM range registration support Christian König
  5 siblings, 0 replies; 10+ messages in thread
From: Honglei Huang @ 2025-11-12  7:29 UTC (permalink / raw)
  To: Felix.Kuehling, alexander.deucher, christian.koenig, Ray.Huang
  Cc: dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
	linux-kernel, linux-mm, akpm, honghuang, Honglei Huang

From: Honglei Huang <Honglei1.Huang@amd.com>

Implement the kfd_ioctl_svm_ranges() handler that integrates the
SVM ranges functionality:

1. kfd_ioctl_svm_ranges() implementation:
   - Validate input parameters (ranges, attributes, addresses)
   - Copy range descriptors and attributes from userspace
   - Call svm_pin_user_ranges() to pin the specified memory ranges
   - Construct kfd_ioctl_svm_args and invoke existing kfd_ioctl_svm()
   - Properly handle memory allocation and cleanup on error paths

2. Extend attribute handling:
   - svm_range_check_attr(): Accept KFD_IOCTL_SVM_ATTR_MAPPED attribute
   - svm_range_apply_attrs(): Set prange->is_map when MAPPED attr present
   - svm_range_is_same_attrs(): Force update when MAPPED attribute used
   - svm_range_has_mapped_attr(): Helper to detect MAPPED in attr list

3. Register ioctl in amdkfd_ioctls table:
   - Add AMDKFD_IOC_SVM_RANGES entry with kfd_ioctl_svm_ranges handler
   - No special flags required (use default permissions)

This completes the implementation of batch SVM range registration,
allowing userspace to efficiently register multiple non-contiguous
memory buffers with a single ioctl call.

The implementation reuses existing SVM infrastructure while adding
the ability to handle pre-pinned memory pages, reducing overhead
for applications that need to register many scattered allocations.

Signed-off-by: Honglei Huang <Honglei1.Huang@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 67 ++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index fdf171ad4a3c..7e7e00d3f873 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -1739,6 +1739,70 @@ static int kfd_ioctl_svm(struct file *filep, struct kfd_process *p, void *data)
 
 	return r;
 }
+
+static int kfd_ioctl_svm_ranges(struct file *filep, struct kfd_process *p,
+				void *data)
+{
+	struct kfd_ioctl_svm_ranges_args *args = data;
+	struct kfd_ioctl_svm_args *svm_args;
+	int r = 0, err;
+	struct kfd_ioctl_svm_range *ranges;
+	size_t sattr;
+
+	if (!args->nranges || !args->ranges_ptr)
+		return -EINVAL;
+	if (!args->start_addr || !args->size)
+		return -EINVAL;
+
+	pr_debug("start 0x%llx size 0x%llx op 0x%x nattr 0x%x nranges 0x%x\n",
+		 args->start_addr, args->size, args->op, args->nattr, args->nranges);
+
+	if (args->nranges && args->ranges_ptr) {
+		ranges = kvmalloc_array(args->nranges, sizeof(*ranges),
+					GFP_KERNEL);
+		if (!ranges)
+			return -ENOMEM;
+
+		err = copy_from_user(ranges, (void __user *)args->ranges_ptr,
+				     args->nranges * sizeof(*ranges));
+		if (err != 0) {
+			kvfree(ranges);
+			return -EFAULT;
+		}
+
+		r = svm_pin_user_ranges(p, args->start_addr, args->size, ranges,
+					args->nranges);
+
+		kvfree(ranges);
+
+		if (r)
+			return r;
+	}
+
+	sattr = args->nattr * sizeof(struct kfd_ioctl_svm_attribute);
+
+	svm_args = kvmalloc(sizeof(*svm_args) + sattr, GFP_KERNEL);
+	if (!svm_args)
+		return -ENOMEM;
+
+	svm_args->start_addr = args->start_addr;
+	svm_args->size = args->size;
+	svm_args->nattr = args->nattr;
+	svm_args->op = args->op;
+
+	err = copy_from_user(&svm_args->attrs[0], (void __user *)args->attrs_ptr,
+			     sattr);
+	if (err != 0) {
+		kvfree(svm_args);
+		return -EFAULT;
+	}
+
+	r = kfd_ioctl_svm(filep, p, svm_args);
+
+	kvfree(svm_args);
+	return r;
+}
+
 #else
 static int kfd_ioctl_set_xnack_mode(struct file *filep,
 				    struct kfd_process *p, void *data)
@@ -3226,6 +3290,9 @@ static const struct amdkfd_ioctl_desc amdkfd_ioctls[] = {
 
 	AMDKFD_IOCTL_DEF(AMDKFD_IOC_DBG_TRAP,
 			kfd_ioctl_set_debug_trap, 0),
+
+	AMDKFD_IOCTL_DEF(AMDKFD_IOC_SVM_RANGES, 
+			kfd_ioctl_svm_ranges, 0),
 };
 
 #define AMDKFD_CORE_IOCTL_COUNT	ARRAY_SIZE(amdkfd_ioctls)
-- 
2.34.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/5] drm/amdkfd: Add batch SVM range registration support
  2025-11-12  7:29 [RFC PATCH 0/5] drm/amdkfd: Add batch SVM range registration support Honglei Huang
                   ` (4 preceding siblings ...)
  2025-11-12  7:29 ` [PATCH 5/5] drm/amdkfd: Wire up SVM ranges ioctl handler Honglei Huang
@ 2025-11-12  8:34 ` Christian König
  2025-11-12 12:10   ` Honglei1.Huang@amd.com
  5 siblings, 1 reply; 10+ messages in thread
From: Christian König @ 2025-11-12  8:34 UTC (permalink / raw)
  To: Honglei Huang, Felix.Kuehling, alexander.deucher, Ray.Huang
  Cc: dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
	linux-kernel, linux-mm, akpm, honghuang

Hi,

On 11/12/25 08:29, Honglei Huang wrote:
> Hi all,
> 
> This RFC patch series introduces a new mechanism for batch registration of
> multiple non-contiguous SVM (Shared Virtual Memory) ranges in a single ioctl
> call. The primary goal of this series is to start a discussion about the best
> approach to handle scattered user memory allocations in GPU workloads.
> 
> Background and Motivation
> ==========================
> 
> Current applications using ROCm/HSA often need to register many scattered
> memory buffers (e.g., multiple malloc() allocations) for GPU access. With the
> existing AMDKFD_IOC_SVM ioctl, each range must be registered individually,
> leading to:
> - Blocking issue in some special use cases with many memory ranges
> - High system call overhead when dealing with dozens or hundreds of ranges
> - Inefficient resource management
> - Complexity in userspace applications
> 
> Use Case Example
> ================
> 
> Consider a typical ML/HPC workload that allocates 100+ small buffers across
> different parts of the address space. Currently, this requires 100+ separate
> ioctl calls. The proposed batch interface reduces this to a single call.

Yeah, that's an intentional limitation.

In an IOCTL interface you usually need to guarantee that the operation either completes or fails in a transactional manner.

It is possible to implement this, but usually rather tricky if you do multiple operations in a single IOCTL. So you really need a good use case to justify the added complexity.

> Paravirtualized environments exacerbate this issue, as KVM's memory backing
> is often non-contiguous at the host level. In virtualized environments, guest
> physical memory appears contiguous to the VM but is actually scattered across
> host memory pages. This fragmentation means that what appears as a single
> large allocation in the guest may require multiple discrete SVM registrations
> to properly handle the underlying host memory layout, further multiplying the
> number of required ioctl calls.
SVM with dynamic migration under KVM is most likely a dead end to begin with.

The only possibility to implement it is with memory pinning which is basically userptr.

Or a rather slow client side IOMMU emulation to catch concurrent DMA transfers to get the necessary information onto the host side.

Intel calls this approach colIOMMU: https://www.usenix.org/system/files/atc20-paper236-slides-tian.pdf

> Current Implementation - A Workaround Approach
> ===============================================
> 
> This patch series implements a WORKAROUND solution that pins user pages in
> memory to enable batch registration. While functional, this approach has
> several significant limitations:
> 
> **Major Concern: Memory Pinning**
> - The implementation uses pin_user_pages_fast() to lock pages in RAM
> - This defeats the purpose of SVM's on-demand paging mechanism
> - Prevents memory oversubscription and dynamic migration
> - May cause memory pressure on systems with limited RAM
> - Goes against the fundamental design philosophy of HMM-based SVM

That again is perfectly intentional. Any other mode doesn't really make sense with KVM.

> **Known Limitations:**
> 1. Increased memory footprint due to pinned pages
> 2. Potential for memory fragmentation
> 3. No support for transparent huge pages in pinned regions
> 4. Limited interaction with memory cgroups and resource controls
> 5. Complexity in handling VMA operations and lifecycle management
> 6. May interfere with NUMA optimization and page migration
> 
> Why Submit This RFC?
> ====================
> 
> Despite the limitations above, I am submitting this series to:
> 
> 1. **Start the Discussion**: I want community feedback on whether batch
>    registration is a useful feature worth pursuing.
> 
> 2. **Explore Better Alternatives**: Is there a way to achieve batch
>    registration without pinning? Could I extend HMM to better support
>    this use case?

There is an ongoing unification project between KFD and KGD, we are currently looking into the SVM part on a weekly basis.

Saying that we probably need a really good justification to add new features to the KFD interfaces cause this is going to delay the unification.

Regards,
Christian.

> 
> 3. **Understand Trade-offs**: For some workloads, the performance benefit
>    of batch registration might outweigh the drawbacks of pinning. I'd
>    like to understand where the balance lies.
> 
> Questions for the Community
> ============================
> 
> 1. Are there existing mechanisms in HMM or mm that could support batch
>    operations without pinning?
> 
> 2. Would a different approach (e.g., async registration, delayed validation)
>    be more acceptable?
> 
> Alternative Approaches Considered
> ==================================
> 
> I've considered several alternatives:
> 
> A) **Pure HMM approach**: Register ranges without pinning, rely entirely on
> 
> B) **Userspace batching library**: Hide multiple ioctls behind a library.
> 
> Patch Series Overview
> =====================
> 
> Patch 1: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute type
> Patch 2: Define data structures for batch SVM range registration
> Patch 3: Add new AMDKFD_IOC_SVM_RANGES ioctl command
> Patch 4: Implement page pinning mechanism for scattered ranges
> Patch 5: Wire up the ioctl handler and attribute processing
> 
> Testing
> =======
> 
> The series has been tested with:
> - Multiple scattered malloc() allocations (2-2000+ ranges)
> - Various allocation sizes (4KB to 1G+)
> - GPU compute workloads using the registered ranges
> - Memory pressure scenarios
> - OpecnCL CTS in KVM guest environment
> - HIP catch tests in KVM guest environment
> - Some AI applications like Stable Diffusion, ComfyUI, 3B LLM models based
>   on HuggingFace transformers
> 
> I understand this approach is not ideal and are committed to working on a
> better solution based on community feedback. This RFC is the starting point
> for that discussion.
> 
> Thank you for your time and consideration.
> 
> Best regards,
> Honglei Huang
> 
> ---
> 
> Honglei Huang (5):
>   drm/amdkfd: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute
>   drm/amdkfd: Add SVM ranges data structures
>   drm/amdkfd: Add AMDKFD_IOC_SVM_RANGES ioctl command
>   drm/amdkfd: Add support for pinned user pages in SVM ranges
>   drm/amdkfd: Wire up SVM ranges ioctl handler
> 
>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c |  67 +++++++++++
>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c     | 232 +++++++++++++++++++++++++++++--
>  drivers/gpu/drm/amd/amdkfd/kfd_svm.h     |   3 +
>  include/uapi/linux/kfd_ioctl.h           |  52 +++++++-
>  4 files changed, 348 insertions(+), 6 deletions(-)



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/5] drm/amdkfd: Add batch SVM range registration support
  2025-11-12  8:34 ` [RFC PATCH 0/5] drm/amdkfd: Add batch SVM range registration support Christian König
@ 2025-11-12 12:10   ` Honglei1.Huang@amd.com
  2025-11-12 12:50     ` Christian König
  0 siblings, 1 reply; 10+ messages in thread
From: Honglei1.Huang@amd.com @ 2025-11-12 12:10 UTC (permalink / raw)
  To: Christian König
  Cc: Felix.Kuehling, alexander.deucher, Ray.Huang, dmitry.osipenko,
	Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel, linux-kernel,
	linux-mm, akpm, Honglei Huang

Hi Christian,

Really thanks for the detailed feedback and insights. Your comments are 
incredibly helpful and clear.

On 2025/11/12 16:34, Christian König wrote:
> Hi,
> 
> On 11/12/25 08:29, Honglei Huang wrote:
>> Hi all,
>>
>> This RFC patch series introduces a new mechanism for batch registration of
>> multiple non-contiguous SVM (Shared Virtual Memory) ranges in a single ioctl
>> call. The primary goal of this series is to start a discussion about the best
>> approach to handle scattered user memory allocations in GPU workloads.
>>
>> Background and Motivation
>> ==========================
>>
>> Current applications using ROCm/HSA often need to register many scattered
>> memory buffers (e.g., multiple malloc() allocations) for GPU access. With the
>> existing AMDKFD_IOC_SVM ioctl, each range must be registered individually,
>> leading to:
>> - Blocking issue in some special use cases with many memory ranges
>> - High system call overhead when dealing with dozens or hundreds of ranges
>> - Inefficient resource management
>> - Complexity in userspace applications
>>
>> Use Case Example
>> ================
>>
>> Consider a typical ML/HPC workload that allocates 100+ small buffers across
>> different parts of the address space. Currently, this requires 100+ separate
>> ioctl calls. The proposed batch interface reduces this to a single call.
> 
> Yeah, that's an intentional limitation.
> 
> In an IOCTL interface you usually need to guarantee that the operation either completes or fails in a transactional manner.
> 
> It is possible to implement this, but usually rather tricky if you do multiple operations in a single IOCTL. So you really need a good use case to justify the added complexity.
> 

You're absolutely right about the transactional complexity. This 
operation indeed requires proper rollback mechanisms and error handling 
to maintain atomicity.


>> Paravirtualized environments exacerbate this issue, as KVM's memory backing
>> is often non-contiguous at the host level. In virtualized environments, guest
>> physical memory appears contiguous to the VM but is actually scattered across
>> host memory pages. This fragmentation means that what appears as a single
>> large allocation in the guest may require multiple discrete SVM registrations
>> to properly handle the underlying host memory layout, further multiplying the
>> number of required ioctl calls.
> SVM with dynamic migration under KVM is most likely a dead end to begin with.
> 
> The only possibility to implement it is with memory pinning which is basically userptr.
> 
> Or a rather slow client side IOMMU emulation to catch concurrent DMA transfers to get the necessary information onto the host side.
> 
> Intel calls this approach colIOMMU: https://www.usenix.org/system/files/atc20-paper236-slides-tian.pdf
> 

This is very helpful context.Your confirmation that memory pinning 
(userptr-style) is the practical approach helps me understand that what 
I initially saw as a "workaround" is actually the intended solution for 
this use case.
For colIOMMU, I'll study it to better understand the alternatives and 
their trade-offs.

>> Current Implementation - A Workaround Approach
>> ===============================================
>>
>> This patch series implements a WORKAROUND solution that pins user pages in
>> memory to enable batch registration. While functional, this approach has
>> several significant limitations:
>>
>> **Major Concern: Memory Pinning**
>> - The implementation uses pin_user_pages_fast() to lock pages in RAM
>> - This defeats the purpose of SVM's on-demand paging mechanism
>> - Prevents memory oversubscription and dynamic migration
>> - May cause memory pressure on systems with limited RAM
>> - Goes against the fundamental design philosophy of HMM-based SVM
> 
> That again is perfectly intentional. Any other mode doesn't really make sense with KVM.
> 
>> **Known Limitations:**
>> 1. Increased memory footprint due to pinned pages
>> 2. Potential for memory fragmentation
>> 3. No support for transparent huge pages in pinned regions
>> 4. Limited interaction with memory cgroups and resource controls
>> 5. Complexity in handling VMA operations and lifecycle management
>> 6. May interfere with NUMA optimization and page migration
>>
>> Why Submit This RFC?
>> ====================
>>
>> Despite the limitations above, I am submitting this series to:
>>
>> 1. **Start the Discussion**: I want community feedback on whether batch
>>     registration is a useful feature worth pursuing.
>>
>> 2. **Explore Better Alternatives**: Is there a way to achieve batch
>>     registration without pinning? Could I extend HMM to better support
>>     this use case?
> 
> There is an ongoing unification project between KFD and KGD, we are currently looking into the SVM part on a weekly basis.
> 
> Saying that we probably need a really good justification to add new features to the KFD interfaces cause this is going to delay the unification.
> 
> Regards,
> Christian.

Thank you for sharing this critical information. Is there a public 
discussion forum or mailing list for the KFD/KGD unification where I 
could follow progress and understand the design direction?

Regarding the use case justification: I need to be honest here - the
primary driver for this feature is indeed KVM/virtualized environments.
The scattered allocation problem exists in native environments too, but
the overhead is tolerable there. However, I do want to raise one 
consideration for the unified interface design:

GPU computing in virtualized/cloud environments is growing rapidly, 
major cloud providers (AWS, Azure) now offer GPU instances ROCm in 
containers/VMs is becoming more common.So while my current use case is 
specific to KVM, the virtualized GPU workload pattern may become more 
prevalent.

So during the unified interface design, please keep the door open for 
batch-style operations if they don't complicate the core design.

I really appreciate your time and guidance on this.

Regards,
Honglei



> 
>>
>> 3. **Understand Trade-offs**: For some workloads, the performance benefit
>>     of batch registration might outweigh the drawbacks of pinning. I'd
>>     like to understand where the balance lies.
>>
>> Questions for the Community
>> ============================
>>
>> 1. Are there existing mechanisms in HMM or mm that could support batch
>>     operations without pinning?
>>
>> 2. Would a different approach (e.g., async registration, delayed validation)
>>     be more acceptable?
>>
>> Alternative Approaches Considered
>> ==================================
>>
>> I've considered several alternatives:
>>
>> A) **Pure HMM approach**: Register ranges without pinning, rely entirely on
>>
>> B) **Userspace batching library**: Hide multiple ioctls behind a library.
>>
>> Patch Series Overview
>> =====================
>>
>> Patch 1: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute type
>> Patch 2: Define data structures for batch SVM range registration
>> Patch 3: Add new AMDKFD_IOC_SVM_RANGES ioctl command
>> Patch 4: Implement page pinning mechanism for scattered ranges
>> Patch 5: Wire up the ioctl handler and attribute processing
>>
>> Testing
>> =======
>>
>> The series has been tested with:
>> - Multiple scattered malloc() allocations (2-2000+ ranges)
>> - Various allocation sizes (4KB to 1G+)
>> - GPU compute workloads using the registered ranges
>> - Memory pressure scenarios
>> - OpecnCL CTS in KVM guest environment
>> - HIP catch tests in KVM guest environment
>> - Some AI applications like Stable Diffusion, ComfyUI, 3B LLM models based
>>    on HuggingFace transformers
>>
>> I understand this approach is not ideal and are committed to working on a
>> better solution based on community feedback. This RFC is the starting point
>> for that discussion.
>>
>> Thank you for your time and consideration.
>>
>> Best regards,
>> Honglei Huang
>>
>> ---
>>
>> Honglei Huang (5):
>>    drm/amdkfd: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute
>>    drm/amdkfd: Add SVM ranges data structures
>>    drm/amdkfd: Add AMDKFD_IOC_SVM_RANGES ioctl command
>>    drm/amdkfd: Add support for pinned user pages in SVM ranges
>>    drm/amdkfd: Wire up SVM ranges ioctl handler
>>
>>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c |  67 +++++++++++
>>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c     | 232 +++++++++++++++++++++++++++++--
>>   drivers/gpu/drm/amd/amdkfd/kfd_svm.h     |   3 +
>>   include/uapi/linux/kfd_ioctl.h           |  52 +++++++-
>>   4 files changed, 348 insertions(+), 6 deletions(-)
> 



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/5] drm/amdkfd: Add batch SVM range registration support
  2025-11-12 12:10   ` Honglei1.Huang@amd.com
@ 2025-11-12 12:50     ` Christian König
  0 siblings, 0 replies; 10+ messages in thread
From: Christian König @ 2025-11-12 12:50 UTC (permalink / raw)
  To: Honglei1.Huang@amd.com
  Cc: Felix.Kuehling, alexander.deucher, Ray.Huang, dmitry.osipenko,
	Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel, linux-kernel,
	linux-mm, akpm, Honglei Huang

Hi Honglei,

On 11/12/25 13:10, Honglei1.Huang@amd.com wrote:
>>> Paravirtualized environments exacerbate this issue, as KVM's memory backing
>>> is often non-contiguous at the host level. In virtualized environments, guest
>>> physical memory appears contiguous to the VM but is actually scattered across
>>> host memory pages. This fragmentation means that what appears as a single
>>> large allocation in the guest may require multiple discrete SVM registrations
>>> to properly handle the underlying host memory layout, further multiplying the
>>> number of required ioctl calls.
>> SVM with dynamic migration under KVM is most likely a dead end to begin with.
>>
>> The only possibility to implement it is with memory pinning which is basically userptr.
>>
>> Or a rather slow client side IOMMU emulation to catch concurrent DMA transfers to get the necessary information onto the host side.
>>
>> Intel calls this approach colIOMMU: https://www.usenix.org/system/files/atc20-paper236-slides-tian.pdf
>>
> 
> This is very helpful context.Your confirmation that memory pinning (userptr-style) is the practical approach helps me understand that what I initially saw as a "workaround" is actually the intended solution for this use case.

Well "intended" is maybe not the right term, I would rather say "possible" with the current SW/HW stack design in virtualization.

In general fault based SVM/HMM would still be nice to have even under virtualization environment, it's just simply not really feasible at the moment.

> For colIOMMU, I'll study it to better understand the alternatives and their trade-offs.

I haven't looked into it in detail either. It's mostly developed with the pass-through use case in mind, but avoiding pinning memory on the host side which is one of many per-requisites to have some HMM based migration working as well.

...>>> Why Submit This RFC?
>>> ====================
>>>
>>> Despite the limitations above, I am submitting this series to:
>>>
>>> 1. **Start the Discussion**: I want community feedback on whether batch
>>>     registration is a useful feature worth pursuing.
>>>
>>> 2. **Explore Better Alternatives**: Is there a way to achieve batch
>>>     registration without pinning? Could I extend HMM to better support
>>>     this use case?
>>
>> There is an ongoing unification project between KFD and KGD, we are currently looking into the SVM part on a weekly basis.
>>
>> Saying that we probably need a really good justification to add new features to the KFD interfaces cause this is going to delay the unification.
>>
>> Regards,
>> Christian.
> 
> Thank you for sharing this critical information. Is there a public discussion forum or mailing list for the KFD/KGD unification where I could follow progress and understand the design direction?

Alex is driving this. No mailing list, but IIRC Alex has organized a lot of topics on some confluence page, but I can't find it of hand.

> Regarding the use case justification: I need to be honest here - the
> primary driver for this feature is indeed KVM/virtualized environments.
> The scattered allocation problem exists in native environments too, but
> the overhead is tolerable there. However, I do want to raise one consideration for the unified interface design:
> 
> GPU computing in virtualized/cloud environments is growing rapidly, major cloud providers (AWS, Azure) now offer GPU instances ROCm in containers/VMs is becoming more common.So while my current use case is specific to KVM, the virtualized GPU workload pattern may become more prevalent.
> 
> So during the unified interface design, please keep the door open for batch-style operations if they don't complicate the core design.

Oh, yes! That's definitely valuable information to have and a more or less a new requirement for the SVM userspace API.

I already expected that we sooner or later run into such things, but having it definitely confirmed is really good to have.

Regards,
Christian.

> 
> I really appreciate your time and guidance on this.
> 
> Regards,
> Honglei
> 
> 
> 
>>
>>>
>>> 3. **Understand Trade-offs**: For some workloads, the performance benefit
>>>     of batch registration might outweigh the drawbacks of pinning. I'd
>>>     like to understand where the balance lies.
>>>
>>> Questions for the Community
>>> ============================
>>>
>>> 1. Are there existing mechanisms in HMM or mm that could support batch
>>>     operations without pinning?
>>>
>>> 2. Would a different approach (e.g., async registration, delayed validation)
>>>     be more acceptable?
>>>
>>> Alternative Approaches Considered
>>> ==================================
>>>
>>> I've considered several alternatives:
>>>
>>> A) **Pure HMM approach**: Register ranges without pinning, rely entirely on
>>>
>>> B) **Userspace batching library**: Hide multiple ioctls behind a library.
>>>
>>> Patch Series Overview
>>> =====================
>>>
>>> Patch 1: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute type
>>> Patch 2: Define data structures for batch SVM range registration
>>> Patch 3: Add new AMDKFD_IOC_SVM_RANGES ioctl command
>>> Patch 4: Implement page pinning mechanism for scattered ranges
>>> Patch 5: Wire up the ioctl handler and attribute processing
>>>
>>> Testing
>>> =======
>>>
>>> The series has been tested with:
>>> - Multiple scattered malloc() allocations (2-2000+ ranges)
>>> - Various allocation sizes (4KB to 1G+)
>>> - GPU compute workloads using the registered ranges
>>> - Memory pressure scenarios
>>> - OpecnCL CTS in KVM guest environment
>>> - HIP catch tests in KVM guest environment
>>> - Some AI applications like Stable Diffusion, ComfyUI, 3B LLM models based
>>>    on HuggingFace transformers
>>>
>>> I understand this approach is not ideal and are committed to working on a
>>> better solution based on community feedback. This RFC is the starting point
>>> for that discussion.
>>>
>>> Thank you for your time and consideration.
>>>
>>> Best regards,
>>> Honglei Huang
>>>
>>> ---
>>>
>>> Honglei Huang (5):
>>>    drm/amdkfd: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute
>>>    drm/amdkfd: Add SVM ranges data structures
>>>    drm/amdkfd: Add AMDKFD_IOC_SVM_RANGES ioctl command
>>>    drm/amdkfd: Add support for pinned user pages in SVM ranges
>>>    drm/amdkfd: Wire up SVM ranges ioctl handler
>>>
>>>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c |  67 +++++++++++
>>>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c     | 232 +++++++++++++++++++++++++++++--
>>>   drivers/gpu/drm/amd/amdkfd/kfd_svm.h     |   3 +
>>>   include/uapi/linux/kfd_ioctl.h           |  52 +++++++-
>>>   4 files changed, 348 insertions(+), 6 deletions(-)
>>
> 



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 0/5] drm/amdkfd: Add batch SVM range registration support
@ 2025-11-12  7:35 Honglei Huang
  0 siblings, 0 replies; 10+ messages in thread
From: Honglei Huang @ 2025-11-12  7:35 UTC (permalink / raw)
  To: Felix.Kuehling, alexander.deucher, christian.koenig, Ray.Huang
  Cc: dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
	linux-kernel, linux-mm, akpm, honghuan, Honglei Huang

From: Honglei Huang <Honglei1.Huang@amd.com>

Hi all,

This RFC patch series introduces a new mechanism for batch registration of
multiple non-contiguous SVM (Shared Virtual Memory) ranges in a single ioctl
call. The primary goal of this series is to start a discussion about the best
approach to handle scattered user memory allocations in GPU workloads.

Background and Motivation
==========================

Current applications using ROCm/HSA often need to register many scattered
memory buffers (e.g., multiple malloc() allocations) for GPU access. With the
existing AMDKFD_IOC_SVM ioctl, each range must be registered individually,
leading to:
- Blocking issue in some special use cases with many memory ranges
- High system call overhead when dealing with dozens or hundreds of ranges
- Inefficient resource management
- Complexity in userspace applications

Use Case Example
================

Consider a typical ML/HPC workload that allocates 100+ small buffers across
different parts of the address space. Currently, this requires 100+ separate
ioctl calls. The proposed batch interface reduces this to a single call.

Paravirtualized environments exacerbate this issue, as KVM's memory backing
is often non-contiguous at the host level. In virtualized environments, guest
physical memory appears contiguous to the VM but is actually scattered across
host memory pages. This fragmentation means that what appears as a single
large allocation in the guest may require multiple discrete SVM registrations
to properly handle the underlying host memory layout, further multiplying the
number of required ioctl calls.

Current Implementation - A Workaround Approach
===============================================

This patch series implements a WORKAROUND solution that pins user pages in
memory to enable batch registration. While functional, this approach has
several significant limitations:

**Major Concern: Memory Pinning**
- The implementation uses pin_user_pages_fast() to lock pages in RAM
- This defeats the purpose of SVM's on-demand paging mechanism
- Prevents memory oversubscription and dynamic migration
- May cause memory pressure on systems with limited RAM
- Goes against the fundamental design philosophy of HMM-based SVM

**Known Limitations:**
1. Increased memory footprint due to pinned pages
2. Potential for memory fragmentation
3. No support for transparent huge pages in pinned regions
4. Limited interaction with memory cgroups and resource controls
5. Complexity in handling VMA operations and lifecycle management
6. May interfere with NUMA optimization and page migration

Why Submit This RFC?
====================

Despite the limitations above, I am submitting this series to:

1. **Start the Discussion**: I want community feedback on whether batch
   registration is a useful feature worth pursuing.

2. **Explore Better Alternatives**: Is there a way to achieve batch
   registration without pinning? Could I extend HMM to better support
   this use case?

3. **Understand Trade-offs**: For some workloads, the performance benefit
   of batch registration might outweigh the drawbacks of pinning. I'd
   like to understand where the balance lies.

Questions for the Community
============================

1. Are there existing mechanisms in HMM or mm that could support batch
   operations without pinning?

2. Would a different approach (e.g., async registration, delayed validation)
   be more acceptable?

Alternative Approaches Considered
==================================

I've considered several alternatives:

A) **Pure HMM approach**: Register ranges without pinning, rely entirely on

B) **Userspace batching library**: Hide multiple ioctls behind a library.

Patch Series Overview
=====================

Patch 1: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute type
Patch 2: Define data structures for batch SVM range registration
Patch 3: Add new AMDKFD_IOC_SVM_RANGES ioctl command
Patch 4: Implement page pinning mechanism for scattered ranges
Patch 5: Wire up the ioctl handler and attribute processing

Testing
=======

The series has been tested with:
- Multiple scattered malloc() allocations (2-2000+ ranges)
- Various allocation sizes (4KB to 1G+)
- GPU compute workloads using the registered ranges
- Memory pressure scenarios
- OpecnCL CTS in KVM guest environment
- HIP catch tests in KVM guest environment
- Some AI applications like Stable Diffusion, ComfyUI, 3B LLM models based
  on HuggingFace transformers

I understand this approach is not ideal and are committed to working on a
better solution based on community feedback. This RFC is the starting point
for that discussion.

Thank you for your time and consideration.

Best regards,
Honglei Huang

Honglei Huang (5):
  drm/amdkfd: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute
  drm/amdkfd: Add SVM ranges data structures
  drm/amdkfd: Add AMDKFD_IOC_SVM_RANGES ioctl command
  drm/amdkfd: Add support for pinned user pages in SVM ranges
  drm/amdkfd: Wire up SVM ranges ioctl handler

 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c |  67 +++++++
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c     | 232 ++++++++++++++++++++++-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h     |   3 +
 include/uapi/linux/kfd_ioctl.h           |  52 ++++-
 4 files changed, 345 insertions(+), 9 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-11-12 12:50 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-12  7:29 [RFC PATCH 0/5] drm/amdkfd: Add batch SVM range registration support Honglei Huang
2025-11-12  7:29 ` [PATCH 1/5] drm/amdkfd: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute Honglei Huang
2025-11-12  7:29 ` [PATCH 2/5] drm/amdkfd: Add SVM ranges data structures Honglei Huang
2025-11-12  7:29 ` [PATCH 3/5] drm/amdkfd: Add AMDKFD_IOC_SVM_RANGES ioctl command Honglei Huang
2025-11-12  7:29 ` [PATCH 4/5] drm/amdkfd: Add support for pinned user pages in SVM ranges Honglei Huang
2025-11-12  7:29 ` [PATCH 5/5] drm/amdkfd: Wire up SVM ranges ioctl handler Honglei Huang
2025-11-12  8:34 ` [RFC PATCH 0/5] drm/amdkfd: Add batch SVM range registration support Christian König
2025-11-12 12:10   ` Honglei1.Huang@amd.com
2025-11-12 12:50     ` Christian König
2025-11-12  7:35 Honglei Huang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox