* [PATCH v3 1/8] drm/amdkfd: Add userptr batch allocation UAPI structures
2026-02-06 6:25 [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support Honglei Huang
@ 2026-02-06 6:25 ` Honglei Huang
2026-02-06 6:25 ` [PATCH v3 2/8] drm/amdkfd: Add user_range_info infrastructure to kgd_mem Honglei Huang
` (7 subsequent siblings)
8 siblings, 0 replies; 22+ messages in thread
From: Honglei Huang @ 2026-02-06 6:25 UTC (permalink / raw)
To: Felix.Kuehling, alexander.deucher, christian.koenig, Ray.Huang
Cc: dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
linux-kernel, linux-mm, akpm, honghuan
From: Honglei Huang <honghuan@amd.com>
Introduce new UAPI structures to support batch allocation of
non-contiguous userptr ranges in a single ioctl call.
add:
- KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag
- struct kfd_ioctl_userptr_range for individual ranges
- struct kfd_ioctl_userptr_ranges_data for batch data
Signed-off-by: Honglei Huang <honghuan@amd.com>
---
include/uapi/linux/kfd_ioctl.h | 31 ++++++++++++++++++++++++++++++-
1 file changed, 30 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h
index 84aa24c02..579850e70 100644
--- a/include/uapi/linux/kfd_ioctl.h
+++ b/include/uapi/linux/kfd_ioctl.h
@@ -420,16 +420,45 @@ struct kfd_ioctl_acquire_vm_args {
#define KFD_IOC_ALLOC_MEM_FLAGS_UNCACHED (1 << 25)
#define KFD_IOC_ALLOC_MEM_FLAGS_EXT_COHERENT (1 << 24)
#define KFD_IOC_ALLOC_MEM_FLAGS_CONTIGUOUS (1 << 23)
+#define KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH (1 << 22)
+
+/* Userptr range for batch allocation
+ *
+ * @start: start address of user virtual memory range
+ * @size: size of this user virtual memory range in bytes
+ */
+struct kfd_ioctl_userptr_range {
+ __u64 start; /* to KFD */
+ __u64 size; /* to KFD */
+};
+
+/* Complete userptr batch allocation data structure
+ *
+ * This structure combines the header and ranges array for convenience.
+ * User space can allocate memory for this structure with the desired
+ * number of ranges and pass a pointer to it via mmap_offset field.
+ *
+ * @num_ranges: number of ranges in the ranges array
+ * @reserved: reserved for future use, must be 0
+ * @ranges: flexible array of userptr ranges
+ */
+struct kfd_ioctl_userptr_ranges_data {
+ __u32 num_ranges; /* to KFD */
+ __u32 reserved; /* to KFD, must be 0 */
+ struct kfd_ioctl_userptr_range ranges[]; /* to KFD */
+};
/* Allocate memory for later SVM (shared virtual memory) mapping.
*
* @va_addr: virtual address of the memory to be allocated
* all later mappings on all GPUs will use this address
- * @size: size in bytes
+ * @size: size in bytes (total size for batch allocation)
* @handle: buffer handle returned to user mode, used to refer to
* this allocation for mapping, unmapping and freeing
* @mmap_offset: for CPU-mapping the allocation by mmapping a render node
* for userptrs this is overloaded to specify the CPU address
+ * for batch userptr (KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH),
+ * this should point to a kfd_ioctl_userptr_ranges_data structure
* @gpu_id: device identifier
* @flags: memory type and attributes. See KFD_IOC_ALLOC_MEM_FLAGS above
*/
--
2.34.1
^ permalink raw reply [flat|nested] 22+ messages in thread* [PATCH v3 2/8] drm/amdkfd: Add user_range_info infrastructure to kgd_mem
2026-02-06 6:25 [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support Honglei Huang
2026-02-06 6:25 ` [PATCH v3 1/8] drm/amdkfd: Add userptr batch allocation UAPI structures Honglei Huang
@ 2026-02-06 6:25 ` Honglei Huang
2026-02-06 6:25 ` [PATCH v3 3/8] drm/amdkfd: Implement interval tree for userptr ranges Honglei Huang
` (6 subsequent siblings)
8 siblings, 0 replies; 22+ messages in thread
From: Honglei Huang @ 2026-02-06 6:25 UTC (permalink / raw)
To: Felix.Kuehling, alexander.deucher, christian.koenig, Ray.Huang
Cc: dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
linux-kernel, linux-mm, akpm, honghuan
From: Honglei Huang <honghuan@amd.com>
Add data structures to support batch userptr allocations with
multiple non-contiguous CPU virtual address ranges.
add:
- struct user_range_info: per-range metadata including HMM range,
invalidation counter, and interval tree node
- Fields to kgd_mem: num_user_ranges, user_ranges array,
batch_va_min/max, batch_notifier, and user_ranges_itree
Signed-off-by: Honglei Huang <honghuan@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 23 ++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 321cbf9a1..58917a4b3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -48,6 +48,7 @@ enum TLB_FLUSH_TYPE {
struct amdgpu_device;
struct kfd_process_device;
+struct kfd_ioctl_userptr_range;
struct amdgpu_reset_context;
enum kfd_mem_attachment_type {
@@ -67,6 +68,15 @@ struct kfd_mem_attachment {
uint64_t pte_flags;
};
+/* Individual range info for batch userptr allocations */
+struct user_range_info {
+ uint64_t start; /* CPU virtual address start */
+ uint64_t size; /* Size in bytes */
+ struct hmm_range *range; /* HMM range for this userptr */
+ uint32_t invalid; /* Per-range invalidation counter */
+ struct interval_tree_node it_node; /* Interval tree node for fast overlap lookup */
+};
+
struct kgd_mem {
struct mutex lock;
struct amdgpu_bo *bo;
@@ -89,6 +99,14 @@ struct kgd_mem {
uint32_t gem_handle;
bool aql_queue;
bool is_imported;
+
+ /* For batch userptr allocation: multiple non-contiguous CPU VA ranges */
+ uint32_t num_user_ranges;
+ struct user_range_info *user_ranges;
+ uint64_t batch_va_min;
+ uint64_t batch_va_max;
+ struct mmu_interval_notifier batch_notifier;
+ struct rb_root_cached user_ranges_itree;
};
/* KFD Memory Eviction */
@@ -313,6 +331,11 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
struct amdgpu_device *adev, uint64_t va, uint64_t size,
void *drm_priv, struct kgd_mem **mem,
uint64_t *offset, uint32_t flags, bool criu_resume);
+int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch(
+ struct amdgpu_device *adev, uint64_t va, uint64_t size,
+ void *drm_priv, struct kgd_mem **mem,
+ uint64_t *offset, struct kfd_ioctl_userptr_range *ranges,
+ uint32_t num_ranges, uint32_t flags, bool criu_resume);
int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
struct amdgpu_device *adev, struct kgd_mem *mem, void *drm_priv,
uint64_t *size);
--
2.34.1
^ permalink raw reply [flat|nested] 22+ messages in thread* [PATCH v3 3/8] drm/amdkfd: Implement interval tree for userptr ranges
2026-02-06 6:25 [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support Honglei Huang
2026-02-06 6:25 ` [PATCH v3 1/8] drm/amdkfd: Add userptr batch allocation UAPI structures Honglei Huang
2026-02-06 6:25 ` [PATCH v3 2/8] drm/amdkfd: Add user_range_info infrastructure to kgd_mem Honglei Huang
@ 2026-02-06 6:25 ` Honglei Huang
2026-02-06 6:25 ` [PATCH v3 4/8] drm/amdkfd: Add batch MMU notifier support Honglei Huang
` (5 subsequent siblings)
8 siblings, 0 replies; 22+ messages in thread
From: Honglei Huang @ 2026-02-06 6:25 UTC (permalink / raw)
To: Felix.Kuehling, alexander.deucher, christian.koenig, Ray.Huang
Cc: dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
linux-kernel, linux-mm, akpm, honghuan
From: Honglei Huang <honghuan@amd.com>
Add interval tree support for efficient lookup of affected userptr
ranges during MMU notifier callbacks.
add:
- Include linux/interval_tree.h
- mark_invalid_ranges() function that uses interval tree to
identify and mark ranges affected by a given invalidation event
Signed-off-by: Honglei Huang <honghuan@amd.com>
---
.../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 21 +++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index a32b46355..3b7fc6d15 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -25,6 +25,7 @@
#include <linux/pagemap.h>
#include <linux/sched/mm.h>
#include <linux/sched/task.h>
+#include <linux/interval_tree.h>
#include <drm/ttm/ttm_tt.h>
#include <drm/drm_exec.h>
@@ -1122,6 +1123,26 @@ static int init_user_pages(struct kgd_mem *mem, uint64_t user_addr,
return ret;
}
+static bool mark_invalid_ranges(struct kgd_mem *mem,
+ unsigned long inv_start, unsigned long inv_end)
+{
+ struct interval_tree_node *node;
+ struct user_range_info *range_info;
+ bool any_invalid = false;
+
+ for (node = interval_tree_iter_first(&mem->user_ranges_itree, inv_start, inv_end - 1);
+ node;
+ node = interval_tree_iter_next(node, inv_start, inv_end - 1)) {
+ range_info = container_of(node, struct user_range_info, it_node);
+ range_info->invalid++;
+ any_invalid = true;
+ pr_debug("Range [0x%llx-0x%llx) marked invalid (count=%u)\n",
+ range_info->start, range_info->start + range_info->size,
+ range_info->invalid);
+ }
+ return any_invalid;
+}
+
/* Reserving a BO and its page table BOs must happen atomically to
* avoid deadlocks. Some operations update multiple VMs at once. Track
* all the reservation info in a context structure. Optionally a sync
--
2.34.1
^ permalink raw reply [flat|nested] 22+ messages in thread* [PATCH v3 4/8] drm/amdkfd: Add batch MMU notifier support
2026-02-06 6:25 [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support Honglei Huang
` (2 preceding siblings ...)
2026-02-06 6:25 ` [PATCH v3 3/8] drm/amdkfd: Implement interval tree for userptr ranges Honglei Huang
@ 2026-02-06 6:25 ` Honglei Huang
2026-02-06 6:25 ` [PATCH v3 5/8] drm/amdkfd: Implement batch userptr page management Honglei Huang
` (4 subsequent siblings)
8 siblings, 0 replies; 22+ messages in thread
From: Honglei Huang @ 2026-02-06 6:25 UTC (permalink / raw)
To: Felix.Kuehling, alexander.deucher, christian.koenig, Ray.Huang
Cc: dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
linux-kernel, linux-mm, akpm, honghuan
From: Honglei Huang <honghuan@amd.com>
Implement MMU notifier callbacks for batch userptr allocations.
This adds:
- amdgpu_amdkfd_evict_userptr_batch(): handles MMU invalidation
events for batch allocations, using interval tree to identify
affected ranges
- amdgpu_amdkfd_invalidate_userptr_batch(): wrapper for invalidate
callback
- amdgpu_amdkfd_hsa_batch_ops: MMU notifier ops structure
Signed-off-by: Honglei Huang <honghuan@amd.com>
---
.../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 57 +++++++++++++++++++
1 file changed, 57 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 3b7fc6d15..af6db20de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1143,6 +1143,63 @@ static bool mark_invalid_ranges(struct kgd_mem *mem,
return any_invalid;
}
+static int amdgpu_amdkfd_evict_userptr_batch(struct mmu_interval_notifier *mni,
+ const struct mmu_notifier_range *range,
+ unsigned long cur_seq)
+{
+ struct kgd_mem *mem;
+ struct amdkfd_process_info *process_info;
+ int r = 0;
+
+ mem = container_of(mni, struct kgd_mem, batch_notifier);
+ process_info = mem->process_info;
+
+ if (READ_ONCE(process_info->block_mmu_notifications))
+ return 0;
+
+ if (!mark_invalid_ranges(mem, range->start, range->end)) {
+ pr_debug("Batch userptr: invalidation [0x%lx-0x%lx) does not affect any range\n",
+ range->start, range->end);
+ return 0;
+ }
+
+ mutex_lock(&process_info->notifier_lock);
+ mmu_interval_set_seq(mni, cur_seq);
+
+ mem->invalid++;
+
+ if (++process_info->evicted_bos == 1) {
+ r = kgd2kfd_quiesce_mm(mni->mm,
+ KFD_QUEUE_EVICTION_TRIGGER_USERPTR);
+
+ if (r && r != -ESRCH)
+ pr_err("Failed to quiesce KFD\n");
+
+ if (r != -ESRCH)
+ queue_delayed_work(system_freezable_wq,
+ &process_info->restore_userptr_work,
+ msecs_to_jiffies(AMDGPU_USERPTR_RESTORE_DELAY_MS));
+ }
+ mutex_unlock(&process_info->notifier_lock);
+
+ pr_debug("Batch userptr evicted: va_min=0x%llx va_max=0x%llx, inv_range=[0x%lx-0x%lx)\n",
+ mem->batch_va_min, mem->batch_va_max, range->start, range->end);
+
+ return r;
+}
+
+static bool amdgpu_amdkfd_invalidate_userptr_batch(struct mmu_interval_notifier *mni,
+ const struct mmu_notifier_range *range,
+ unsigned long cur_seq)
+{
+ amdgpu_amdkfd_evict_userptr_batch(mni, range, cur_seq);
+ return true;
+}
+
+static const struct mmu_interval_notifier_ops amdgpu_amdkfd_hsa_batch_ops = {
+ .invalidate = amdgpu_amdkfd_invalidate_userptr_batch,
+};
+
/* Reserving a BO and its page table BOs must happen atomically to
* avoid deadlocks. Some operations update multiple VMs at once. Track
* all the reservation info in a context structure. Optionally a sync
--
2.34.1
^ permalink raw reply [flat|nested] 22+ messages in thread* [PATCH v3 5/8] drm/amdkfd: Implement batch userptr page management
2026-02-06 6:25 [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support Honglei Huang
` (3 preceding siblings ...)
2026-02-06 6:25 ` [PATCH v3 4/8] drm/amdkfd: Add batch MMU notifier support Honglei Huang
@ 2026-02-06 6:25 ` Honglei Huang
2026-02-06 6:25 ` [PATCH v3 6/8] drm/amdkfd: Add batch allocation function and export API Honglei Huang
` (3 subsequent siblings)
8 siblings, 0 replies; 22+ messages in thread
From: Honglei Huang @ 2026-02-06 6:25 UTC (permalink / raw)
To: Felix.Kuehling, alexander.deucher, christian.koenig, Ray.Huang
Cc: dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
linux-kernel, linux-mm, akpm, honghuan
From: Honglei Huang <honghuan@amd.com>
Add core page management functions for batch userptr allocations.
This adds:
- get_user_pages_batch(): gets user pages for a single range within
a batch allocation using HMM
- set_user_pages_batch(): populates TTM page array from multiple
HMM ranges
Signed-off-by: Honglei Huang <honghuan@amd.com>
---
.../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 54 +++++++++++++++++++
1 file changed, 54 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index af6db20de..7aca1868d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1200,6 +1200,60 @@ static const struct mmu_interval_notifier_ops amdgpu_amdkfd_hsa_batch_ops = {
.invalidate = amdgpu_amdkfd_invalidate_userptr_batch,
};
+static int get_user_pages_batch(struct mm_struct *mm,
+ struct kgd_mem *mem,
+ struct user_range_info *range,
+ struct hmm_range **range_hmm, bool readonly)
+{
+ struct vm_area_struct *vma;
+ int r = 0;
+
+ *range_hmm = NULL;
+
+ if (!mmget_not_zero(mm))
+ return -ESRCH;
+
+ mmap_read_lock(mm);
+ vma = vma_lookup(mm, range->start);
+ if (unlikely(!vma)) {
+ r = -EFAULT;
+ goto out_unlock;
+ }
+
+ r = amdgpu_hmm_range_get_pages(&mem->batch_notifier, range->start,
+ range->size >> PAGE_SHIFT, readonly,
+ NULL, range_hmm);
+
+out_unlock:
+ mmap_read_unlock(mm);
+ mmput(mm);
+ return r;
+}
+
+static int set_user_pages_batch(struct ttm_tt *ttm,
+ struct user_range_info *ranges,
+ uint32_t nranges)
+{
+ uint32_t i, j, k = 0, range_npfns;
+
+ for (i = 0; i < nranges; ++i) {
+ if (!ranges[i].range || !ranges[i].range->hmm_pfns)
+ return -EINVAL;
+
+ range_npfns = (ranges[i].range->end - ranges[i].range->start) >>
+ PAGE_SHIFT;
+
+ if (k + range_npfns > ttm->num_pages)
+ return -EOVERFLOW;
+
+ for (j = 0; j < range_npfns; ++j)
+ ttm->pages[k++] =
+ hmm_pfn_to_page(ranges[i].range->hmm_pfns[j]);
+ }
+
+ return 0;
+}
+
/* Reserving a BO and its page table BOs must happen atomically to
* avoid deadlocks. Some operations update multiple VMs at once. Track
* all the reservation info in a context structure. Optionally a sync
--
2.34.1
^ permalink raw reply [flat|nested] 22+ messages in thread* [PATCH v3 6/8] drm/amdkfd: Add batch allocation function and export API
2026-02-06 6:25 [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support Honglei Huang
` (4 preceding siblings ...)
2026-02-06 6:25 ` [PATCH v3 5/8] drm/amdkfd: Implement batch userptr page management Honglei Huang
@ 2026-02-06 6:25 ` Honglei Huang
2026-02-06 6:25 ` [PATCH v3 7/8] drm/amdkfd: Unify userptr cleanup and update paths Honglei Huang
` (2 subsequent siblings)
8 siblings, 0 replies; 22+ messages in thread
From: Honglei Huang @ 2026-02-06 6:25 UTC (permalink / raw)
To: Felix.Kuehling, alexander.deucher, christian.koenig, Ray.Huang
Cc: dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
linux-kernel, linux-mm, akpm, honghuan
From: Honglei Huang <honghuan@amd.com>
Implement the main batch userptr allocation function and export it
through the AMDKFD API.
This adds:
- init_user_pages_batch(): initializes batch allocation by setting
up interval tree, registering single MMU notifier, and getting
pages for all ranges
- amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch(): main entry point
for batch userptr allocation
- Function export in amdgpu_amdkfd.h
Signed-off-by: Honglei Huang <honghuan@amd.com>
---
.../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 264 ++++++++++++++++++
1 file changed, 264 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 7aca1868d..bc075f5f1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1254,6 +1254,151 @@ static int set_user_pages_batch(struct ttm_tt *ttm,
return 0;
}
+static int init_user_pages_batch(struct kgd_mem *mem,
+ struct kfd_ioctl_userptr_range *ranges,
+ uint32_t num_ranges, bool criu_resume,
+ uint64_t user_addr, uint32_t size)
+{
+ struct amdkfd_process_info *process_info = mem->process_info;
+ struct amdgpu_bo *bo = mem->bo;
+ struct ttm_operation_ctx ctx = { true, false };
+ struct hmm_range *range;
+ uint64_t va_min = ULLONG_MAX, va_max = 0;
+ int ret = 0;
+ uint32_t i;
+
+ if (!num_ranges || !ranges)
+ return -EINVAL;
+
+ mutex_lock(&process_info->lock);
+
+ mem->user_ranges = kvcalloc(num_ranges, sizeof(struct user_range_info),
+ GFP_KERNEL);
+
+ if (!mem->user_ranges) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ mem->num_user_ranges = num_ranges;
+
+ mem->user_ranges_itree = RB_ROOT_CACHED;
+
+ ret = amdgpu_ttm_tt_set_userptr(&bo->tbo, user_addr, 0);
+ if (ret) {
+ pr_err("%s: Failed to set userptr: %d\n", __func__, ret);
+ goto out;
+ }
+
+ for (i = 0; i < num_ranges; i++) {
+ uint64_t range_end;
+
+ mem->user_ranges[i].start = ranges[i].start;
+ mem->user_ranges[i].size = ranges[i].size;
+ mem->user_ranges[i].range = NULL;
+
+ range_end = ranges[i].start + ranges[i].size;
+
+ mem->user_ranges[i].it_node.start = ranges[i].start;
+ mem->user_ranges[i].it_node.last = range_end - 1;
+ interval_tree_insert(&mem->user_ranges[i].it_node, &mem->user_ranges_itree);
+
+ if (ranges[i].start < va_min)
+ va_min = ranges[i].start;
+ if (range_end > va_max)
+ va_max = range_end;
+
+ pr_debug("Initializing userptr range %u: addr=0x%llx size=0x%llx\n",
+ i, mem->user_ranges[i].start, mem->user_ranges[i].size);
+ }
+
+ mem->batch_va_min = va_min;
+ mem->batch_va_max = va_max;
+
+ pr_debug("Batch userptr: registering single notifier for span [0x%llx - 0x%llx)\n",
+ va_min, va_max);
+
+ ret = mmu_interval_notifier_insert(&mem->batch_notifier,
+ current->mm, va_min, va_max - va_min,
+ &amdgpu_amdkfd_hsa_batch_ops);
+ if (ret) {
+ pr_err("%s: Failed to register batch MMU notifier: %d\n",
+ __func__, ret);
+ goto err_cleanup_ranges;
+ }
+
+ if (criu_resume) {
+ mutex_lock(&process_info->notifier_lock);
+ mem->invalid++;
+ mutex_unlock(&process_info->notifier_lock);
+ mutex_unlock(&process_info->lock);
+ return 0;
+ }
+
+ for (i = 0; i < num_ranges; i++) {
+ ret = get_user_pages_batch(
+ current->mm, mem, &mem->user_ranges[i], &range,
+ amdgpu_ttm_tt_is_readonly(bo->tbo.ttm));
+ if (ret) {
+ if (ret == -EAGAIN)
+ pr_debug("Failed to get user pages for range %u, try again\n", i);
+ else
+ pr_err("%s: Failed to get user pages for range %u: %d\n",
+ __func__, i, ret);
+ goto err_unregister;
+ }
+
+ mem->user_ranges[i].range = range;
+ }
+
+ ret = amdgpu_bo_reserve(bo, true);
+ if (ret) {
+ pr_err("%s: Failed to reserve BO\n", __func__);
+ goto release_pages;
+ }
+
+ if (bo->tbo.ttm->pages) {
+ set_user_pages_batch(bo->tbo.ttm,
+ mem->user_ranges,
+ num_ranges);
+ } else {
+ pr_err("%s: TTM pages array is NULL\n", __func__);
+ ret = -EINVAL;
+ amdgpu_bo_unreserve(bo);
+ goto release_pages;
+ }
+
+ amdgpu_bo_placement_from_domain(bo, mem->domain);
+ ret = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx);
+ if (ret)
+ pr_err("%s: failed to validate BO\n", __func__);
+
+ amdgpu_bo_unreserve(bo);
+
+release_pages:
+ for (i = 0; i < num_ranges; i++) {
+ if (mem->user_ranges[i].range) {
+ amdgpu_ttm_tt_get_user_pages_done(bo->tbo.ttm,
+ mem->user_ranges[i].range);
+ }
+ }
+
+err_unregister:
+ if (ret && mem->batch_notifier.mm) {
+ mmu_interval_notifier_remove(&mem->batch_notifier);
+ mem->batch_notifier.mm = NULL;
+ }
+err_cleanup_ranges:
+ if (ret) {
+ for (i = 0; i < num_ranges; i++) {
+ mem->user_ranges[i].range = NULL;
+ }
+ }
+
+out:
+ mutex_unlock(&process_info->lock);
+ return ret;
+}
+
/* Reserving a BO and its page table BOs must happen atomically to
* avoid deadlocks. Some operations update multiple VMs at once. Track
* all the reservation info in a context structure. Optionally a sync
@@ -2012,6 +2157,125 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
return ret;
}
+int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch(
+ struct amdgpu_device *adev, uint64_t va, uint64_t size, void *drm_priv,
+ struct kgd_mem **mem, uint64_t *offset,
+ struct kfd_ioctl_userptr_range *ranges, uint32_t num_ranges,
+ uint32_t flags, bool criu_resume)
+{
+ struct amdgpu_vm *avm = drm_priv_to_vm(drm_priv);
+ struct amdgpu_bo *bo;
+ struct drm_gem_object *gobj = NULL;
+ u32 domain, alloc_domain;
+ uint64_t aligned_size;
+ int8_t xcp_id = -1;
+ u64 alloc_flags;
+ int ret;
+
+ if (!(flags & KFD_IOC_ALLOC_MEM_FLAGS_USERPTR)) {
+ pr_err("Batch allocation requires USERPTR flag\n");
+ return -EINVAL;
+ }
+
+ if (flags & KFD_IOC_ALLOC_MEM_FLAGS_AQL_QUEUE_MEM) {
+ pr_err("Batch userptr does not support AQL queue\n");
+ return -EINVAL;
+ }
+
+ domain = AMDGPU_GEM_DOMAIN_GTT;
+ alloc_domain = AMDGPU_GEM_DOMAIN_CPU;
+ alloc_flags = AMDGPU_GEM_CREATE_PREEMPTIBLE;
+
+ if (flags & KFD_IOC_ALLOC_MEM_FLAGS_COHERENT)
+ alloc_flags |= AMDGPU_GEM_CREATE_COHERENT;
+ if (flags & KFD_IOC_ALLOC_MEM_FLAGS_EXT_COHERENT)
+ alloc_flags |= AMDGPU_GEM_CREATE_EXT_COHERENT;
+ if (flags & KFD_IOC_ALLOC_MEM_FLAGS_UNCACHED)
+ alloc_flags |= AMDGPU_GEM_CREATE_UNCACHED;
+
+ *mem = kzalloc(sizeof(struct kgd_mem), GFP_KERNEL);
+ if (!*mem) {
+ ret = -ENOMEM;
+ goto err;
+ }
+ INIT_LIST_HEAD(&(*mem)->attachments);
+ mutex_init(&(*mem)->lock);
+ (*mem)->aql_queue = false;
+
+ aligned_size = PAGE_ALIGN(size);
+
+ (*mem)->alloc_flags = flags;
+
+ amdgpu_sync_create(&(*mem)->sync);
+
+ ret = amdgpu_amdkfd_reserve_mem_limit(adev, aligned_size, flags,
+ xcp_id);
+ if (ret) {
+ pr_debug("Insufficient memory\n");
+ goto err_reserve_limit;
+ }
+
+ pr_debug("\tcreate BO VA 0x%llx size 0x%llx for batch userptr (ranges=%u)\n",
+ va, size, num_ranges);
+
+ ret = amdgpu_gem_object_create(adev, aligned_size, 1, alloc_domain, alloc_flags,
+ ttm_bo_type_device, NULL, &gobj, xcp_id + 1);
+ if (ret) {
+ pr_debug("Failed to create BO on domain %s. ret %d\n",
+ domain_string(alloc_domain), ret);
+ goto err_bo_create;
+ }
+
+ ret = drm_vma_node_allow(&gobj->vma_node, drm_priv);
+ if (ret) {
+ pr_debug("Failed to allow vma node access. ret %d\n", ret);
+ goto err_node_allow;
+ }
+
+ ret = drm_gem_handle_create(adev->kfd.client.file, gobj, &(*mem)->gem_handle);
+ if (ret)
+ goto err_gem_handle_create;
+
+ bo = gem_to_amdgpu_bo(gobj);
+ bo->kfd_bo = *mem;
+ bo->flags |= AMDGPU_AMDKFD_CREATE_USERPTR_BO;
+
+ (*mem)->bo = bo;
+ (*mem)->va = va;
+ (*mem)->domain = domain;
+ (*mem)->mapped_to_gpu_memory = 0;
+ (*mem)->process_info = avm->process_info;
+
+ add_kgd_mem_to_kfd_bo_list(*mem, avm->process_info, ranges[0].start);
+
+ ret = init_user_pages_batch(*mem, ranges, num_ranges, criu_resume, va, aligned_size);
+ if (ret) {
+ pr_err("Failed to initialize batch user pages: %d\n", ret);
+ goto allocate_init_user_pages_failed;
+ }
+
+ return 0;
+
+allocate_init_user_pages_failed:
+ remove_kgd_mem_from_kfd_bo_list(*mem, avm->process_info);
+ drm_gem_handle_delete(adev->kfd.client.file, (*mem)->gem_handle);
+err_gem_handle_create:
+ drm_vma_node_revoke(&gobj->vma_node, drm_priv);
+err_node_allow:
+ goto err_reserve_limit;
+err_bo_create:
+ amdgpu_amdkfd_unreserve_mem_limit(adev, aligned_size, flags, xcp_id);
+err_reserve_limit:
+ amdgpu_sync_free(&(*mem)->sync);
+ mutex_destroy(&(*mem)->lock);
+ if (gobj)
+ drm_gem_object_put(gobj);
+ else
+ kfree(*mem);
+err:
+ return ret;
+}
+
int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
struct amdgpu_device *adev, struct kgd_mem *mem, void *drm_priv,
uint64_t *size)
--
2.34.1
^ permalink raw reply [flat|nested] 22+ messages in thread* [PATCH v3 7/8] drm/amdkfd: Unify userptr cleanup and update paths
2026-02-06 6:25 [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support Honglei Huang
` (5 preceding siblings ...)
2026-02-06 6:25 ` [PATCH v3 6/8] drm/amdkfd: Add batch allocation function and export API Honglei Huang
@ 2026-02-06 6:25 ` Honglei Huang
2026-02-06 6:25 ` [PATCH v3 8/8] drm/amdkfd: Wire up batch allocation in ioctl handler Honglei Huang
2026-02-06 13:56 ` [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support Christian König
8 siblings, 0 replies; 22+ messages in thread
From: Honglei Huang @ 2026-02-06 6:25 UTC (permalink / raw)
To: Felix.Kuehling, alexander.deucher, christian.koenig, Ray.Huang
Cc: dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
linux-kernel, linux-mm, akpm, honghuan
From: Honglei Huang <honghuan@amd.com>
Refactor userptr management code to handle both single and batch
allocations uniformly.
This adds:
- cleanup_userptr_resources(): unified cleanup for single/batch
- discard_user_pages_batch(): discard pages for batch ranges
- amdgpu_amdkfd_update_user_pages_batch(): update pages for batch
- valid_user_pages_batch(): validate batch pages
Modified functions to support batch mode:
- update_invalid_user_pages(): uses batch update when applicable
- confirm_valid_user_pages_locked(): checks batch validity
- amdgpu_amdkfd_gpuvm_free_memory_of_gpu(): uses unified cleanup
Signed-off-by: Honglei Huang <honghuan@amd.com>
---
.../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 143 +++++++++++++++---
1 file changed, 126 insertions(+), 17 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index bc075f5f1..bea365bdc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -2276,6 +2276,35 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch(
return ret;
}
+static void cleanup_userptr_resources(struct kgd_mem *mem,
+ struct amdkfd_process_info *process_info)
+{
+ uint32_t i;
+
+ if (!amdgpu_ttm_tt_get_usermm(mem->bo->tbo.ttm))
+ return;
+
+ if (mem->num_user_ranges > 0 && mem->user_ranges) {
+ for (i = 0; i < mem->num_user_ranges; i++)
+ interval_tree_remove(&mem->user_ranges[i].it_node,
+ &mem->user_ranges_itree);
+
+ if (mem->batch_notifier.mm) {
+ mmu_interval_notifier_remove(&mem->batch_notifier);
+ mem->batch_notifier.mm = NULL;
+ }
+
+ kvfree(mem->user_ranges);
+ mem->user_ranges = NULL;
+ mem->num_user_ranges = 0;
+ } else {
+ amdgpu_hmm_unregister(mem->bo);
+ mutex_lock(&process_info->notifier_lock);
+ amdgpu_ttm_tt_discard_user_pages(mem->bo->tbo.ttm, mem->range);
+ mutex_unlock(&process_info->notifier_lock);
+ }
+}
+
int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
struct amdgpu_device *adev, struct kgd_mem *mem, void *drm_priv,
uint64_t *size)
@@ -2317,12 +2346,7 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
mutex_unlock(&process_info->lock);
/* Cleanup user pages and MMU notifiers */
- if (amdgpu_ttm_tt_get_usermm(mem->bo->tbo.ttm)) {
- amdgpu_hmm_unregister(mem->bo);
- mutex_lock(&process_info->notifier_lock);
- amdgpu_ttm_tt_discard_user_pages(mem->bo->tbo.ttm, mem->range);
- mutex_unlock(&process_info->notifier_lock);
- }
+ cleanup_userptr_resources(mem, process_info);
ret = reserve_bo_and_cond_vms(mem, NULL, BO_VM_ALL, &ctx);
if (unlikely(ret))
@@ -2909,6 +2933,44 @@ int amdgpu_amdkfd_evict_userptr(struct mmu_interval_notifier *mni,
return r;
}
+static void discard_user_pages_batch(struct amdgpu_bo *bo, struct kgd_mem *mem)
+{
+ uint32_t i;
+
+ for (i = 0; i < mem->num_user_ranges; i++) {
+ if (mem->user_ranges[i].invalid && mem->user_ranges[i].range) {
+ amdgpu_ttm_tt_discard_user_pages(bo->tbo.ttm,
+ mem->user_ranges[i].range);
+ mem->user_ranges[i].range = NULL;
+ }
+ }
+}
+
+static int amdgpu_amdkfd_update_user_pages_batch(struct mm_struct *mm,
+ struct amdgpu_bo *bo,
+ struct kgd_mem *mem)
+{
+ uint32_t i;
+ int ret = 0;
+
+ for (i = 0; i < mem->num_user_ranges; i++) {
+ if (!mem->user_ranges[i].invalid)
+ continue;
+
+ ret = get_user_pages_batch(
+ mm, mem, &mem->user_ranges[i],
+ &mem->user_ranges[i].range,
+ amdgpu_ttm_tt_is_readonly(bo->tbo.ttm));
+ if (ret) {
+ pr_debug("Failed %d to get user pages for range %u\n",
+ ret, i);
+ break;
+ }
+ }
+
+ return ret;
+}
+
/* Update invalid userptr BOs
*
* Moves invalidated (evicted) userptr BOs from userptr_valid_list to
@@ -2946,8 +3008,12 @@ static int update_invalid_user_pages(struct amdkfd_process_info *process_info,
bo = mem->bo;
- amdgpu_ttm_tt_discard_user_pages(bo->tbo.ttm, mem->range);
- mem->range = NULL;
+ if (mem->num_user_ranges > 0 && mem->user_ranges)
+ discard_user_pages_batch(bo, mem);
+ else {
+ amdgpu_ttm_tt_discard_user_pages(bo->tbo.ttm, mem->range);
+ mem->range = NULL;
+ }
/* BO reservations and getting user pages (hmm_range_fault)
* must happen outside the notifier lock
@@ -2971,7 +3037,11 @@ static int update_invalid_user_pages(struct amdkfd_process_info *process_info,
}
/* Get updated user pages */
- ret = amdgpu_ttm_tt_get_user_pages(bo, &mem->range);
+ if (mem->num_user_ranges > 0 && mem->user_ranges)
+ ret = amdgpu_amdkfd_update_user_pages_batch(mm, bo, mem);
+ else
+ ret = amdgpu_ttm_tt_get_user_pages(bo, &mem->range);
+
if (ret) {
pr_debug("Failed %d to get user pages\n", ret);
@@ -3005,7 +3075,10 @@ static int update_invalid_user_pages(struct amdkfd_process_info *process_info,
ret = 0;
}
- amdgpu_ttm_tt_set_user_pages(bo->tbo.ttm, mem->range);
+ if (mem->num_user_ranges == 0)
+ amdgpu_ttm_tt_set_user_pages(bo->tbo.ttm, mem->range);
+ else
+ set_user_pages_batch(bo->tbo.ttm, mem->user_ranges, mem->num_user_ranges);
mutex_lock(&process_info->notifier_lock);
@@ -3019,6 +3092,14 @@ static int update_invalid_user_pages(struct amdkfd_process_info *process_info,
/* set mem valid if mem has hmm range associated */
if (mem->range)
mem->invalid = 0;
+
+ if (mem->num_user_ranges > 0 && mem->user_ranges) {
+ uint32_t i;
+
+ for (i = 0; i < mem->num_user_ranges; i++)
+ mem->user_ranges[i].invalid = 0;
+ mem->invalid = 0;
+ }
}
unlock_out:
@@ -3126,6 +3207,29 @@ static int validate_invalid_user_pages(struct amdkfd_process_info *process_info)
return ret;
}
+static bool valid_user_pages_batch(struct kgd_mem *mem)
+{
+ uint32_t i;
+ bool all_valid = true;
+
+ if (!mem->user_ranges || mem->num_user_ranges == 0)
+ return true;
+
+ for (i = 0; i < mem->num_user_ranges; i++) {
+ if (!mem->user_ranges[i].range)
+ continue;
+
+ if (!amdgpu_ttm_tt_get_user_pages_done(mem->bo->tbo.ttm,
+ mem->user_ranges[i].range)) {
+ all_valid = false;
+ }
+
+ mem->user_ranges[i].range = NULL;
+ }
+
+ return all_valid;
+}
+
/* Confirm that all user pages are valid while holding the notifier lock
*
* Moves valid BOs from the userptr_inval_list back to userptr_val_list.
@@ -3140,15 +3244,20 @@ static int confirm_valid_user_pages_locked(struct amdkfd_process_info *process_i
validate_list) {
bool valid;
- /* keep mem without hmm range at userptr_inval_list */
- if (!mem->range)
- continue;
+ if (mem->num_user_ranges > 0 && mem->user_ranges)
+ valid = valid_user_pages_batch(mem);
+ else {
+ /* keep mem without hmm range at userptr_inval_list */
+ if (!mem->range)
+ continue;
- /* Only check mem with hmm range associated */
- valid = amdgpu_ttm_tt_get_user_pages_done(
- mem->bo->tbo.ttm, mem->range);
+ /* Only check mem with hmm range associated */
+ valid = amdgpu_ttm_tt_get_user_pages_done(
+ mem->bo->tbo.ttm, mem->range);
+
+ mem->range = NULL;
+ }
- mem->range = NULL;
if (!valid) {
WARN(!mem->invalid, "Invalid BO not marked invalid");
ret = -EAGAIN;
--
2.34.1
^ permalink raw reply [flat|nested] 22+ messages in thread* [PATCH v3 8/8] drm/amdkfd: Wire up batch allocation in ioctl handler
2026-02-06 6:25 [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support Honglei Huang
` (6 preceding siblings ...)
2026-02-06 6:25 ` [PATCH v3 7/8] drm/amdkfd: Unify userptr cleanup and update paths Honglei Huang
@ 2026-02-06 6:25 ` Honglei Huang
2026-02-06 13:56 ` [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support Christian König
8 siblings, 0 replies; 22+ messages in thread
From: Honglei Huang @ 2026-02-06 6:25 UTC (permalink / raw)
To: Felix.Kuehling, alexander.deucher, christian.koenig, Ray.Huang
Cc: dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
linux-kernel, linux-mm, akpm, honghuan
From: Honglei Huang <honghuan@amd.com>
Integrate batch userptr allocation into the KFD ioctl interface.
This adds:
- kfd_copy_userptr_ranges(): validates and copies batch range data
from userspace, checking alignment, sizes, and total size match
- Modifications to kfd_ioctl_alloc_memory_of_gpu() to detect batch
mode and route to appropriate allocation function
- SVM conflict checking extended for batch ranges
Signed-off-by: Honglei Huang <honghuan@amd.com>
---
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 128 +++++++++++++++++++++--
1 file changed, 122 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index a72cc980a..d0b56d5cc 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -1047,10 +1047,79 @@ static int kfd_ioctl_get_available_memory(struct file *filep,
return 0;
}
+static int kfd_copy_userptr_ranges(void __user *user_data, uint64_t expected_size,
+ struct kfd_ioctl_userptr_range **ranges_out,
+ uint32_t *num_ranges_out)
+{
+ struct kfd_ioctl_userptr_ranges_data ranges_header;
+ struct kfd_ioctl_userptr_range *ranges;
+ uint64_t total_size = 0;
+ uint32_t num_ranges;
+ size_t header_size;
+ uint32_t i;
+
+ if (!user_data) {
+ pr_err("Batch allocation: ranges pointer is NULL\n");
+ return -EINVAL;
+ }
+
+ header_size = offsetof(struct kfd_ioctl_userptr_ranges_data, ranges);
+ if (copy_from_user(&ranges_header, user_data, header_size)) {
+ pr_err("Failed to copy ranges data header from user space\n");
+ return -EFAULT;
+ }
+
+ num_ranges = ranges_header.num_ranges;
+ if (num_ranges == 0) {
+ pr_err("Batch allocation: invalid number of ranges %u\n", num_ranges);
+ return -EINVAL;
+ }
+
+ if (ranges_header.reserved != 0) {
+ pr_err("Batch allocation: reserved field must be 0\n");
+ return -EINVAL;
+ }
+
+ ranges = kvmalloc_array(num_ranges, sizeof(*ranges), GFP_KERNEL);
+ if (!ranges)
+ return -ENOMEM;
+
+ if (copy_from_user(ranges, user_data + header_size,
+ num_ranges * sizeof(*ranges))) {
+ pr_err("Failed to copy ranges from user space\n");
+ kvfree(ranges);
+ return -EFAULT;
+ }
+
+ for (i = 0; i < num_ranges; i++) {
+ if (!ranges[i].start || !ranges[i].size ||
+ (ranges[i].start & ~PAGE_MASK) ||
+ (ranges[i].size & ~PAGE_MASK)) {
+ pr_err("Invalid range %u: start=0x%llx size=0x%llx\n",
+ i, ranges[i].start, ranges[i].size);
+ kvfree(ranges);
+ return -EINVAL;
+ }
+ total_size += ranges[i].size;
+ }
+
+ if (total_size != expected_size) {
+ pr_err("Size mismatch: provided %llu != calculated %llu\n",
+ expected_size, total_size);
+ kvfree(ranges);
+ return -EINVAL;
+ }
+
+ *ranges_out = ranges;
+ *num_ranges_out = num_ranges;
+ return 0;
+}
+
static int kfd_ioctl_alloc_memory_of_gpu(struct file *filep,
struct kfd_process *p, void *data)
{
struct kfd_ioctl_alloc_memory_of_gpu_args *args = data;
+ struct kfd_ioctl_userptr_range *ranges = NULL;
struct kfd_process_device *pdd;
void *mem;
struct kfd_node *dev;
@@ -1058,16 +1127,32 @@ static int kfd_ioctl_alloc_memory_of_gpu(struct file *filep,
long err;
uint64_t offset = args->mmap_offset;
uint32_t flags = args->flags;
+ uint32_t num_ranges = 0;
+ bool is_batch = false;
if (args->size == 0)
return -EINVAL;
+ if ((flags & KFD_IOC_ALLOC_MEM_FLAGS_USERPTR) &&
+ (flags & KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH)) {
+ is_batch = true;
+ }
+
if (p->context_id != KFD_CONTEXT_ID_PRIMARY && (flags & KFD_IOC_ALLOC_MEM_FLAGS_USERPTR)) {
pr_debug("USERPTR is not supported on non-primary kfd_process\n");
return -EOPNOTSUPP;
}
+ if (is_batch) {
+ err = kfd_copy_userptr_ranges((void __user *)args->mmap_offset,
+ args->size, &ranges, &num_ranges);
+ if (err)
+ return err;
+
+ offset = 0;
+ }
+
#if IS_ENABLED(CONFIG_HSA_AMD_SVM)
/* Flush pending deferred work to avoid racing with deferred actions
* from previous memory map changes (e.g. munmap).
@@ -1086,13 +1171,15 @@ static int kfd_ioctl_alloc_memory_of_gpu(struct file *filep,
pr_err("Address: 0x%llx already allocated by SVM\n",
args->va_addr);
mutex_unlock(&p->svms.lock);
- return -EADDRINUSE;
+ err = -EADDRINUSE;
+ goto err_free_ranges;
}
/* When register user buffer check if it has been registered by svm by
* buffer cpu virtual address.
+ * For batch mode, check each range individually below.
*/
- if ((flags & KFD_IOC_ALLOC_MEM_FLAGS_USERPTR) &&
+ if ((flags & KFD_IOC_ALLOC_MEM_FLAGS_USERPTR) && !is_batch &&
interval_tree_iter_first(&p->svms.objects,
args->mmap_offset >> PAGE_SHIFT,
(args->mmap_offset + args->size - 1) >> PAGE_SHIFT)) {
@@ -1102,6 +1189,22 @@ static int kfd_ioctl_alloc_memory_of_gpu(struct file *filep,
return -EADDRINUSE;
}
+ /* Check each userptr range for SVM conflicts in batch mode */
+ if (is_batch) {
+ uint32_t i;
+ for (i = 0; i < num_ranges; i++) {
+ if (interval_tree_iter_first(&p->svms.objects,
+ ranges[i].start >> PAGE_SHIFT,
+ (ranges[i].start + ranges[i].size - 1) >> PAGE_SHIFT)) {
+ pr_err("Userptr range %u (0x%llx) already allocated by SVM\n",
+ i, ranges[i].start);
+ mutex_unlock(&p->svms.lock);
+ err = -EADDRINUSE;
+ goto err_free_ranges;
+ }
+ }
+ }
+
mutex_unlock(&p->svms.lock);
#endif
mutex_lock(&p->mutex);
@@ -1149,10 +1252,17 @@ static int kfd_ioctl_alloc_memory_of_gpu(struct file *filep,
}
}
- err = amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
- dev->adev, args->va_addr, args->size,
- pdd->drm_priv, (struct kgd_mem **) &mem, &offset,
- flags, false);
+ if (is_batch) {
+ err = amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch(
+ dev->adev, args->va_addr, args->size, pdd->drm_priv,
+ (struct kgd_mem **)&mem, &offset, ranges, num_ranges,
+ flags, false);
+ } else {
+ err = amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
+ dev->adev, args->va_addr, args->size,
+ pdd->drm_priv, (struct kgd_mem **) &mem, &offset,
+ flags, false);
+ }
if (err)
goto err_unlock;
@@ -1184,6 +1294,9 @@ static int kfd_ioctl_alloc_memory_of_gpu(struct file *filep,
args->mmap_offset = KFD_MMAP_TYPE_MMIO
| KFD_MMAP_GPU_ID(args->gpu_id);
+ if (is_batch)
+ kvfree(ranges);
+
return 0;
err_free:
@@ -1193,6 +1306,9 @@ static int kfd_ioctl_alloc_memory_of_gpu(struct file *filep,
err_pdd:
err_large_bar:
mutex_unlock(&p->mutex);
+err_free_ranges:
+ if (ranges)
+ kvfree(ranges);
return err;
}
--
2.34.1
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support
2026-02-06 6:25 [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support Honglei Huang
` (7 preceding siblings ...)
2026-02-06 6:25 ` [PATCH v3 8/8] drm/amdkfd: Wire up batch allocation in ioctl handler Honglei Huang
@ 2026-02-06 13:56 ` Christian König
2026-02-09 6:14 ` Honglei Huang
8 siblings, 1 reply; 22+ messages in thread
From: Christian König @ 2026-02-06 13:56 UTC (permalink / raw)
To: Honglei Huang, Felix.Kuehling, alexander.deucher, Ray.Huang
Cc: dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
linux-kernel, linux-mm, akpm, honghuan
On 2/6/26 07:25, Honglei Huang wrote:
> From: Honglei Huang <honghuan@amd.com>
>
> Hi all,
>
> This is v3 of the patch series to support allocating multiple non-contiguous
> CPU virtual address ranges that map to a single contiguous GPU virtual address.
>
> v3:
> 1. No new ioctl: Reuses existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU
> - Adds only one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH
That is most likely not the best approach, but Felix or Philip need to comment here since I don't know such IOCTLs well either.
> - When flag is set, mmap_offset field points to range array
> - Minimal API surface change
Why range of VA space for each entry?
> 2. Improved MMU notifier handling:
> - Single mmu_interval_notifier covering the VA span [va_min, va_max]
> - Interval tree for efficient lookup of affected ranges during invalidation
> - Avoids per-range notifier overhead mentioned in v2 review
That won't work unless you also modify hmm_range_fault() to take multiple VA addresses (or ranges) at the same time.
The problem is that we must rely on hmm_range.notifier_seq to detect changes to the page tables in question, but that in turn works only if you have one hmm_range structure and not multiple.
What might work is doing an XOR or CRC over all hmm_range.notifier_seq you have, but that is a bit flaky.
Regards,
Christian.
>
> 3. Better code organization: Split into 8 focused patches for easier review
>
> v2:
> - Each CPU VA range gets its own mmu_interval_notifier for invalidation
> - All ranges validated together and mapped to contiguous GPU VA
> - Single kgd_mem object with array of user_range_info structures
> - Unified eviction/restore path for all ranges in a batch
>
> Current Implementation Approach
> ===============================
>
> This series implements a practical solution within existing kernel constraints:
>
> 1. Single MMU notifier for VA span: Register one notifier covering the
> entire range from lowest to highest address in the batch
>
> 2. Interval tree filtering: Use interval tree to efficiently identify
> which specific ranges are affected during invalidation callbacks,
> avoiding unnecessary processing for unrelated address changes
>
> 3. Unified eviction/restore: All ranges in a batch share eviction and
> restore paths, maintaining consistency with existing userptr handling
>
> Patch Series Overview
> =====================
>
> Patch 1/8: Add userptr batch allocation UAPI structures
> - KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag
> - kfd_ioctl_userptr_range and kfd_ioctl_userptr_ranges_data structures
>
> Patch 2/8: Add user_range_info infrastructure to kgd_mem
> - user_range_info structure for per-range tracking
> - Fields for batch allocation in kgd_mem
>
> Patch 3/8: Implement interval tree for userptr ranges
> - Interval tree for efficient range lookup during invalidation
> - mark_invalid_ranges() function
>
> Patch 4/8: Add batch MMU notifier support
> - Single notifier for entire VA span
> - Invalidation callback using interval tree filtering
>
> Patch 5/8: Implement batch userptr page management
> - get_user_pages_batch() and set_user_pages_batch()
> - Per-range page array management
>
> Patch 6/8: Add batch allocation function and export API
> - init_user_pages_batch() main initialization
> - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch() entry point
>
> Patch 7/8: Unify userptr cleanup and update paths
> - Shared eviction/restore handling for batch allocations
> - Integration with existing userptr validation flows
>
> Patch 8/8: Wire up batch allocation in ioctl handler
> - Input validation and range array parsing
> - Integration with existing alloc_memory_of_gpu path
>
> Testing
> =======
>
> - Multiple scattered malloc() allocations (2-4000+ ranges)
> - Various allocation sizes (4KB to 1G+ per range)
> - Memory pressure scenarios and eviction/restore cycles
> - OpenCL CTS and HIP catch tests in KVM guest environment
> - AI workloads: Stable Diffusion, ComfyUI in virtualized environments
> - Small LLM inference (3B-7B models)
> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
> - Performance improvement: 2x-2.4x faster than userspace approach
>
> Thank you for your review and feedback.
>
> Best regards,
> Honglei Huang
>
> Honglei Huang (8):
> drm/amdkfd: Add userptr batch allocation UAPI structures
> drm/amdkfd: Add user_range_info infrastructure to kgd_mem
> drm/amdkfd: Implement interval tree for userptr ranges
> drm/amdkfd: Add batch MMU notifier support
> drm/amdkfd: Implement batch userptr page management
> drm/amdkfd: Add batch allocation function and export API
> drm/amdkfd: Unify userptr cleanup and update paths
> drm/amdkfd: Wire up batch allocation in ioctl handler
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 23 +
> .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 539 +++++++++++++++++-
> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 128 ++++-
> include/uapi/linux/kfd_ioctl.h | 31 +-
> 4 files changed, 697 insertions(+), 24 deletions(-)
>
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support
2026-02-06 13:56 ` [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support Christian König
@ 2026-02-09 6:14 ` Honglei Huang
2026-02-09 10:16 ` Christian König
0 siblings, 1 reply; 22+ messages in thread
From: Honglei Huang @ 2026-02-09 6:14 UTC (permalink / raw)
To: Felix.Kuehling, Christian König, alexander.deucher,
Philip.Yang, Ray.Huang
Cc: dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
linux-kernel, linux-mm, akpm
I've reworked the implementation in v4. The fix is actually inspired
by the DRM GPU SVM framework (drivers/gpu/drm/drm_gpusvm.c).
DRM GPU SVM uses wide notifiers (recommended 512M or larger) to track
multiple user virtual address ranges under a single mmu_interval_notifier,
and these ranges can be non-contiguous which is essentially the same
problem that batch userptr needs to solve: one BO backed by multiple
non-contiguous CPU VA ranges sharing one notifier.
The wide notifier is created in drm_gpusvm_notifier_alloc:
notifier->itree.start = ALIGN_DOWN(fault_addr, gpusvm->notifier_size);
notifier->itree.last = ALIGN(fault_addr + 1, gpusvm->notifier_size) - 1;
The Xe driver passes
xe_modparam.svm_notifier_size * SZ_1M in xe_svm_init
as the notifier_size, so one notifier can cover many of MB of VA space
containing multiple non-contiguous ranges.
And DRM GPU SVM solves the per-range validity problem with flag-based
validation instead of seq-based validation in:
- drm_gpusvm_pages_valid() checks
flags.has_dma_mapping
not notifier_seq. The comment explicitly states:
"This is akin to a notifier seqno check in the HMM documentation
but due to wider notifiers (i.e., notifiers which span multiple
ranges) this function is required for finer grained checking"
- __drm_gpusvm_unmap_pages() clears
flags.has_dma_mapping = false under notifier_lock
- drm_gpusvm_get_pages() sets
flags.has_dma_mapping = true under notifier_lock
I adopted the same approach.
DRM GPU SVM:
drm_gpusvm_notifier_invalidate()
down_write(&gpusvm->notifier_lock);
mmu_interval_set_seq(mni, cur_seq);
gpusvm->ops->invalidate()
-> xe_svm_invalidate()
drm_gpusvm_for_each_range()
-> __drm_gpusvm_unmap_pages()
WRITE_ONCE(flags.has_dma_mapping = false); // clear flag
up_write(&gpusvm->notifier_lock);
KFD batch userptr:
amdgpu_amdkfd_evict_userptr_batch()
mutex_lock(&process_info->notifier_lock);
mmu_interval_set_seq(mni, cur_seq);
discard_invalid_ranges()
interval_tree_iter_first/next()
range_info->valid = false; // clear flag
mutex_unlock(&process_info->notifier_lock);
Both implementations:
- Acquire notifier_lock FIRST, before any flag changes
- Call mmu_interval_set_seq() under the lock
- Use interval tree to find affected ranges within the wide notifier
- Mark per-range flag as invalid/valid under the lock
The page fault path and final validation path also follow the same
pattern as DRM GPU SVM: fault outside the lock, set/check per-range
flag under the lock.
Regards,
Honglei
On 2026/2/6 21:56, Christian König wrote:
> On 2/6/26 07:25, Honglei Huang wrote:
>> From: Honglei Huang <honghuan@amd.com>
>>
>> Hi all,
>>
>> This is v3 of the patch series to support allocating multiple non-contiguous
>> CPU virtual address ranges that map to a single contiguous GPU virtual address.
>>
>> v3:
>> 1. No new ioctl: Reuses existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU
>> - Adds only one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH
>
> That is most likely not the best approach, but Felix or Philip need to comment here since I don't know such IOCTLs well either.
>
>> - When flag is set, mmap_offset field points to range array
>> - Minimal API surface change
>
> Why range of VA space for each entry?
>
>> 2. Improved MMU notifier handling:
>> - Single mmu_interval_notifier covering the VA span [va_min, va_max]
>> - Interval tree for efficient lookup of affected ranges during invalidation
>> - Avoids per-range notifier overhead mentioned in v2 review
>
> That won't work unless you also modify hmm_range_fault() to take multiple VA addresses (or ranges) at the same time.
>
> The problem is that we must rely on hmm_range.notifier_seq to detect changes to the page tables in question, but that in turn works only if you have one hmm_range structure and not multiple.
>
> What might work is doing an XOR or CRC over all hmm_range.notifier_seq you have, but that is a bit flaky.
>
> Regards,
> Christian.
>
>>
>> 3. Better code organization: Split into 8 focused patches for easier review
>>
>> v2:
>> - Each CPU VA range gets its own mmu_interval_notifier for invalidation
>> - All ranges validated together and mapped to contiguous GPU VA
>> - Single kgd_mem object with array of user_range_info structures
>> - Unified eviction/restore path for all ranges in a batch
>>
>> Current Implementation Approach
>> ===============================
>>
>> This series implements a practical solution within existing kernel constraints:
>>
>> 1. Single MMU notifier for VA span: Register one notifier covering the
>> entire range from lowest to highest address in the batch
>>
>> 2. Interval tree filtering: Use interval tree to efficiently identify
>> which specific ranges are affected during invalidation callbacks,
>> avoiding unnecessary processing for unrelated address changes
>>
>> 3. Unified eviction/restore: All ranges in a batch share eviction and
>> restore paths, maintaining consistency with existing userptr handling
>>
>> Patch Series Overview
>> =====================
>>
>> Patch 1/8: Add userptr batch allocation UAPI structures
>> - KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag
>> - kfd_ioctl_userptr_range and kfd_ioctl_userptr_ranges_data structures
>>
>> Patch 2/8: Add user_range_info infrastructure to kgd_mem
>> - user_range_info structure for per-range tracking
>> - Fields for batch allocation in kgd_mem
>>
>> Patch 3/8: Implement interval tree for userptr ranges
>> - Interval tree for efficient range lookup during invalidation
>> - mark_invalid_ranges() function
>>
>> Patch 4/8: Add batch MMU notifier support
>> - Single notifier for entire VA span
>> - Invalidation callback using interval tree filtering
>>
>> Patch 5/8: Implement batch userptr page management
>> - get_user_pages_batch() and set_user_pages_batch()
>> - Per-range page array management
>>
>> Patch 6/8: Add batch allocation function and export API
>> - init_user_pages_batch() main initialization
>> - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch() entry point
>>
>> Patch 7/8: Unify userptr cleanup and update paths
>> - Shared eviction/restore handling for batch allocations
>> - Integration with existing userptr validation flows
>>
>> Patch 8/8: Wire up batch allocation in ioctl handler
>> - Input validation and range array parsing
>> - Integration with existing alloc_memory_of_gpu path
>>
>> Testing
>> =======
>>
>> - Multiple scattered malloc() allocations (2-4000+ ranges)
>> - Various allocation sizes (4KB to 1G+ per range)
>> - Memory pressure scenarios and eviction/restore cycles
>> - OpenCL CTS and HIP catch tests in KVM guest environment
>> - AI workloads: Stable Diffusion, ComfyUI in virtualized environments
>> - Small LLM inference (3B-7B models)
>> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
>> - Performance improvement: 2x-2.4x faster than userspace approach
>>
>> Thank you for your review and feedback.
>>
>> Best regards,
>> Honglei Huang
>>
>> Honglei Huang (8):
>> drm/amdkfd: Add userptr batch allocation UAPI structures
>> drm/amdkfd: Add user_range_info infrastructure to kgd_mem
>> drm/amdkfd: Implement interval tree for userptr ranges
>> drm/amdkfd: Add batch MMU notifier support
>> drm/amdkfd: Implement batch userptr page management
>> drm/amdkfd: Add batch allocation function and export API
>> drm/amdkfd: Unify userptr cleanup and update paths
>> drm/amdkfd: Wire up batch allocation in ioctl handler
>>
>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 23 +
>> .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 539 +++++++++++++++++-
>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 128 ++++-
>> include/uapi/linux/kfd_ioctl.h | 31 +-
>> 4 files changed, 697 insertions(+), 24 deletions(-)
>>
>
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support
2026-02-09 6:14 ` Honglei Huang
@ 2026-02-09 10:16 ` Christian König
2026-02-09 12:52 ` Honglei Huang
0 siblings, 1 reply; 22+ messages in thread
From: Christian König @ 2026-02-09 10:16 UTC (permalink / raw)
To: Honglei Huang, Felix.Kuehling, alexander.deucher, Philip.Yang, Ray.Huang
Cc: dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
linux-kernel, linux-mm, akpm
On 2/9/26 07:14, Honglei Huang wrote:
>
> I've reworked the implementation in v4. The fix is actually inspired
> by the DRM GPU SVM framework (drivers/gpu/drm/drm_gpusvm.c).
>
> DRM GPU SVM uses wide notifiers (recommended 512M or larger) to track
> multiple user virtual address ranges under a single mmu_interval_notifier,
> and these ranges can be non-contiguous which is essentially the same
> problem that batch userptr needs to solve: one BO backed by multiple
> non-contiguous CPU VA ranges sharing one notifier.
That still doesn't solve the sequencing problem.
As far as I can see you can't use hmm_range_fault with this approach or it would just not be very valuable.
So how should that work with your patch set?
Regards,
Christian.
>
> The wide notifier is created in drm_gpusvm_notifier_alloc:
> notifier->itree.start = ALIGN_DOWN(fault_addr, gpusvm->notifier_size);
> notifier->itree.last = ALIGN(fault_addr + 1, gpusvm->notifier_size) - 1;
> The Xe driver passes
> xe_modparam.svm_notifier_size * SZ_1M in xe_svm_init
> as the notifier_size, so one notifier can cover many of MB of VA space
> containing multiple non-contiguous ranges.
>
> And DRM GPU SVM solves the per-range validity problem with flag-based
> validation instead of seq-based validation in:
> - drm_gpusvm_pages_valid() checks
> flags.has_dma_mapping
> not notifier_seq. The comment explicitly states:
> "This is akin to a notifier seqno check in the HMM documentation
> but due to wider notifiers (i.e., notifiers which span multiple
> ranges) this function is required for finer grained checking"
> - __drm_gpusvm_unmap_pages() clears
> flags.has_dma_mapping = false under notifier_lock
> - drm_gpusvm_get_pages() sets
> flags.has_dma_mapping = true under notifier_lock
> I adopted the same approach.
>
> DRM GPU SVM:
> drm_gpusvm_notifier_invalidate()
> down_write(&gpusvm->notifier_lock);
> mmu_interval_set_seq(mni, cur_seq);
> gpusvm->ops->invalidate()
> -> xe_svm_invalidate()
> drm_gpusvm_for_each_range()
> -> __drm_gpusvm_unmap_pages()
> WRITE_ONCE(flags.has_dma_mapping = false); // clear flag
> up_write(&gpusvm->notifier_lock);
>
> KFD batch userptr:
> amdgpu_amdkfd_evict_userptr_batch()
> mutex_lock(&process_info->notifier_lock);
> mmu_interval_set_seq(mni, cur_seq);
> discard_invalid_ranges()
> interval_tree_iter_first/next()
> range_info->valid = false; // clear flag
> mutex_unlock(&process_info->notifier_lock);
>
> Both implementations:
> - Acquire notifier_lock FIRST, before any flag changes
> - Call mmu_interval_set_seq() under the lock
> - Use interval tree to find affected ranges within the wide notifier
> - Mark per-range flag as invalid/valid under the lock
>
> The page fault path and final validation path also follow the same
> pattern as DRM GPU SVM: fault outside the lock, set/check per-range
> flag under the lock.
>
> Regards,
> Honglei
>
>
> On 2026/2/6 21:56, Christian König wrote:
>> On 2/6/26 07:25, Honglei Huang wrote:
>>> From: Honglei Huang <honghuan@amd.com>
>>>
>>> Hi all,
>>>
>>> This is v3 of the patch series to support allocating multiple non-contiguous
>>> CPU virtual address ranges that map to a single contiguous GPU virtual address.
>>>
>>> v3:
>>> 1. No new ioctl: Reuses existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU
>>> - Adds only one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH
>>
>> That is most likely not the best approach, but Felix or Philip need to comment here since I don't know such IOCTLs well either.
>>
>>> - When flag is set, mmap_offset field points to range array
>>> - Minimal API surface change
>>
>> Why range of VA space for each entry?
>>
>>> 2. Improved MMU notifier handling:
>>> - Single mmu_interval_notifier covering the VA span [va_min, va_max]
>>> - Interval tree for efficient lookup of affected ranges during invalidation
>>> - Avoids per-range notifier overhead mentioned in v2 review
>>
>> That won't work unless you also modify hmm_range_fault() to take multiple VA addresses (or ranges) at the same time.
>>
>> The problem is that we must rely on hmm_range.notifier_seq to detect changes to the page tables in question, but that in turn works only if you have one hmm_range structure and not multiple.
>>
>> What might work is doing an XOR or CRC over all hmm_range.notifier_seq you have, but that is a bit flaky.
>>
>> Regards,
>> Christian.
>>
>>>
>>> 3. Better code organization: Split into 8 focused patches for easier review
>>>
>>> v2:
>>> - Each CPU VA range gets its own mmu_interval_notifier for invalidation
>>> - All ranges validated together and mapped to contiguous GPU VA
>>> - Single kgd_mem object with array of user_range_info structures
>>> - Unified eviction/restore path for all ranges in a batch
>>>
>>> Current Implementation Approach
>>> ===============================
>>>
>>> This series implements a practical solution within existing kernel constraints:
>>>
>>> 1. Single MMU notifier for VA span: Register one notifier covering the
>>> entire range from lowest to highest address in the batch
>>>
>>> 2. Interval tree filtering: Use interval tree to efficiently identify
>>> which specific ranges are affected during invalidation callbacks,
>>> avoiding unnecessary processing for unrelated address changes
>>>
>>> 3. Unified eviction/restore: All ranges in a batch share eviction and
>>> restore paths, maintaining consistency with existing userptr handling
>>>
>>> Patch Series Overview
>>> =====================
>>>
>>> Patch 1/8: Add userptr batch allocation UAPI structures
>>> - KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag
>>> - kfd_ioctl_userptr_range and kfd_ioctl_userptr_ranges_data structures
>>>
>>> Patch 2/8: Add user_range_info infrastructure to kgd_mem
>>> - user_range_info structure for per-range tracking
>>> - Fields for batch allocation in kgd_mem
>>>
>>> Patch 3/8: Implement interval tree for userptr ranges
>>> - Interval tree for efficient range lookup during invalidation
>>> - mark_invalid_ranges() function
>>>
>>> Patch 4/8: Add batch MMU notifier support
>>> - Single notifier for entire VA span
>>> - Invalidation callback using interval tree filtering
>>>
>>> Patch 5/8: Implement batch userptr page management
>>> - get_user_pages_batch() and set_user_pages_batch()
>>> - Per-range page array management
>>>
>>> Patch 6/8: Add batch allocation function and export API
>>> - init_user_pages_batch() main initialization
>>> - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch() entry point
>>>
>>> Patch 7/8: Unify userptr cleanup and update paths
>>> - Shared eviction/restore handling for batch allocations
>>> - Integration with existing userptr validation flows
>>>
>>> Patch 8/8: Wire up batch allocation in ioctl handler
>>> - Input validation and range array parsing
>>> - Integration with existing alloc_memory_of_gpu path
>>>
>>> Testing
>>> =======
>>>
>>> - Multiple scattered malloc() allocations (2-4000+ ranges)
>>> - Various allocation sizes (4KB to 1G+ per range)
>>> - Memory pressure scenarios and eviction/restore cycles
>>> - OpenCL CTS and HIP catch tests in KVM guest environment
>>> - AI workloads: Stable Diffusion, ComfyUI in virtualized environments
>>> - Small LLM inference (3B-7B models)
>>> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
>>> - Performance improvement: 2x-2.4x faster than userspace approach
>>>
>>> Thank you for your review and feedback.
>>>
>>> Best regards,
>>> Honglei Huang
>>>
>>> Honglei Huang (8):
>>> drm/amdkfd: Add userptr batch allocation UAPI structures
>>> drm/amdkfd: Add user_range_info infrastructure to kgd_mem
>>> drm/amdkfd: Implement interval tree for userptr ranges
>>> drm/amdkfd: Add batch MMU notifier support
>>> drm/amdkfd: Implement batch userptr page management
>>> drm/amdkfd: Add batch allocation function and export API
>>> drm/amdkfd: Unify userptr cleanup and update paths
>>> drm/amdkfd: Wire up batch allocation in ioctl handler
>>>
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 23 +
>>> .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 539 +++++++++++++++++-
>>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 128 ++++-
>>> include/uapi/linux/kfd_ioctl.h | 31 +-
>>> 4 files changed, 697 insertions(+), 24 deletions(-)
>>>
>>
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support
2026-02-09 10:16 ` Christian König
@ 2026-02-09 12:52 ` Honglei Huang
2026-02-09 12:59 ` Christian König
0 siblings, 1 reply; 22+ messages in thread
From: Honglei Huang @ 2026-02-09 12:52 UTC (permalink / raw)
To: Christian König
Cc: Felix.Kuehling, Philip.Yang, Ray.Huang, alexander.deucher,
dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
linux-kernel, linux-mm, akpm
DRM GPU SVM does use hmm_range_fault(), see drm_gpusvm_get_pages()
My implementation follows the same pattern. The detailed comparison
of invalidation path was provided in the second half of my previous mail.
On 2026/2/9 18:16, Christian König wrote:
> On 2/9/26 07:14, Honglei Huang wrote:
>>
>> I've reworked the implementation in v4. The fix is actually inspired
>> by the DRM GPU SVM framework (drivers/gpu/drm/drm_gpusvm.c).
>>
>> DRM GPU SVM uses wide notifiers (recommended 512M or larger) to track
>> multiple user virtual address ranges under a single mmu_interval_notifier,
>> and these ranges can be non-contiguous which is essentially the same
>> problem that batch userptr needs to solve: one BO backed by multiple
>> non-contiguous CPU VA ranges sharing one notifier.
>
> That still doesn't solve the sequencing problem.
>
> As far as I can see you can't use hmm_range_fault with this approach or it would just not be very valuable.
>
> So how should that work with your patch set?
>
> Regards,
> Christian.
>
>>
>> The wide notifier is created in drm_gpusvm_notifier_alloc:
>> notifier->itree.start = ALIGN_DOWN(fault_addr, gpusvm->notifier_size);
>> notifier->itree.last = ALIGN(fault_addr + 1, gpusvm->notifier_size) - 1;
>> The Xe driver passes
>> xe_modparam.svm_notifier_size * SZ_1M in xe_svm_init
>> as the notifier_size, so one notifier can cover many of MB of VA space
>> containing multiple non-contiguous ranges.
>>
>> And DRM GPU SVM solves the per-range validity problem with flag-based
>> validation instead of seq-based validation in:
>> - drm_gpusvm_pages_valid() checks
>> flags.has_dma_mapping
>> not notifier_seq. The comment explicitly states:
>> "This is akin to a notifier seqno check in the HMM documentation
>> but due to wider notifiers (i.e., notifiers which span multiple
>> ranges) this function is required for finer grained checking"
>> - __drm_gpusvm_unmap_pages() clears
>> flags.has_dma_mapping = false under notifier_lock
>> - drm_gpusvm_get_pages() sets
>> flags.has_dma_mapping = true under notifier_lock
>> I adopted the same approach.
>>
>> DRM GPU SVM:
>> drm_gpusvm_notifier_invalidate()
>> down_write(&gpusvm->notifier_lock);
>> mmu_interval_set_seq(mni, cur_seq);
>> gpusvm->ops->invalidate()
>> -> xe_svm_invalidate()
>> drm_gpusvm_for_each_range()
>> -> __drm_gpusvm_unmap_pages()
>> WRITE_ONCE(flags.has_dma_mapping = false); // clear flag
>> up_write(&gpusvm->notifier_lock);
>>
>> KFD batch userptr:
>> amdgpu_amdkfd_evict_userptr_batch()
>> mutex_lock(&process_info->notifier_lock);
>> mmu_interval_set_seq(mni, cur_seq);
>> discard_invalid_ranges()
>> interval_tree_iter_first/next()
>> range_info->valid = false; // clear flag
>> mutex_unlock(&process_info->notifier_lock);
>>
>> Both implementations:
>> - Acquire notifier_lock FIRST, before any flag changes
>> - Call mmu_interval_set_seq() under the lock
>> - Use interval tree to find affected ranges within the wide notifier
>> - Mark per-range flag as invalid/valid under the lock
>>
>> The page fault path and final validation path also follow the same
>> pattern as DRM GPU SVM: fault outside the lock, set/check per-range
>> flag under the lock.
>>
>> Regards,
>> Honglei
>>
>>
>> On 2026/2/6 21:56, Christian König wrote:
>>> On 2/6/26 07:25, Honglei Huang wrote:
>>>> From: Honglei Huang <honghuan@amd.com>
>>>>
>>>> Hi all,
>>>>
>>>> This is v3 of the patch series to support allocating multiple non-contiguous
>>>> CPU virtual address ranges that map to a single contiguous GPU virtual address.
>>>>
>>>> v3:
>>>> 1. No new ioctl: Reuses existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU
>>>> - Adds only one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH
>>>
>>> That is most likely not the best approach, but Felix or Philip need to comment here since I don't know such IOCTLs well either.
>>>
>>>> - When flag is set, mmap_offset field points to range array
>>>> - Minimal API surface change
>>>
>>> Why range of VA space for each entry?
>>>
>>>> 2. Improved MMU notifier handling:
>>>> - Single mmu_interval_notifier covering the VA span [va_min, va_max]
>>>> - Interval tree for efficient lookup of affected ranges during invalidation
>>>> - Avoids per-range notifier overhead mentioned in v2 review
>>>
>>> That won't work unless you also modify hmm_range_fault() to take multiple VA addresses (or ranges) at the same time.
>>>
>>> The problem is that we must rely on hmm_range.notifier_seq to detect changes to the page tables in question, but that in turn works only if you have one hmm_range structure and not multiple.
>>>
>>> What might work is doing an XOR or CRC over all hmm_range.notifier_seq you have, but that is a bit flaky.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> 3. Better code organization: Split into 8 focused patches for easier review
>>>>
>>>> v2:
>>>> - Each CPU VA range gets its own mmu_interval_notifier for invalidation
>>>> - All ranges validated together and mapped to contiguous GPU VA
>>>> - Single kgd_mem object with array of user_range_info structures
>>>> - Unified eviction/restore path for all ranges in a batch
>>>>
>>>> Current Implementation Approach
>>>> ===============================
>>>>
>>>> This series implements a practical solution within existing kernel constraints:
>>>>
>>>> 1. Single MMU notifier for VA span: Register one notifier covering the
>>>> entire range from lowest to highest address in the batch
>>>>
>>>> 2. Interval tree filtering: Use interval tree to efficiently identify
>>>> which specific ranges are affected during invalidation callbacks,
>>>> avoiding unnecessary processing for unrelated address changes
>>>>
>>>> 3. Unified eviction/restore: All ranges in a batch share eviction and
>>>> restore paths, maintaining consistency with existing userptr handling
>>>>
>>>> Patch Series Overview
>>>> =====================
>>>>
>>>> Patch 1/8: Add userptr batch allocation UAPI structures
>>>> - KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag
>>>> - kfd_ioctl_userptr_range and kfd_ioctl_userptr_ranges_data structures
>>>>
>>>> Patch 2/8: Add user_range_info infrastructure to kgd_mem
>>>> - user_range_info structure for per-range tracking
>>>> - Fields for batch allocation in kgd_mem
>>>>
>>>> Patch 3/8: Implement interval tree for userptr ranges
>>>> - Interval tree for efficient range lookup during invalidation
>>>> - mark_invalid_ranges() function
>>>>
>>>> Patch 4/8: Add batch MMU notifier support
>>>> - Single notifier for entire VA span
>>>> - Invalidation callback using interval tree filtering
>>>>
>>>> Patch 5/8: Implement batch userptr page management
>>>> - get_user_pages_batch() and set_user_pages_batch()
>>>> - Per-range page array management
>>>>
>>>> Patch 6/8: Add batch allocation function and export API
>>>> - init_user_pages_batch() main initialization
>>>> - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch() entry point
>>>>
>>>> Patch 7/8: Unify userptr cleanup and update paths
>>>> - Shared eviction/restore handling for batch allocations
>>>> - Integration with existing userptr validation flows
>>>>
>>>> Patch 8/8: Wire up batch allocation in ioctl handler
>>>> - Input validation and range array parsing
>>>> - Integration with existing alloc_memory_of_gpu path
>>>>
>>>> Testing
>>>> =======
>>>>
>>>> - Multiple scattered malloc() allocations (2-4000+ ranges)
>>>> - Various allocation sizes (4KB to 1G+ per range)
>>>> - Memory pressure scenarios and eviction/restore cycles
>>>> - OpenCL CTS and HIP catch tests in KVM guest environment
>>>> - AI workloads: Stable Diffusion, ComfyUI in virtualized environments
>>>> - Small LLM inference (3B-7B models)
>>>> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
>>>> - Performance improvement: 2x-2.4x faster than userspace approach
>>>>
>>>> Thank you for your review and feedback.
>>>>
>>>> Best regards,
>>>> Honglei Huang
>>>>
>>>> Honglei Huang (8):
>>>> drm/amdkfd: Add userptr batch allocation UAPI structures
>>>> drm/amdkfd: Add user_range_info infrastructure to kgd_mem
>>>> drm/amdkfd: Implement interval tree for userptr ranges
>>>> drm/amdkfd: Add batch MMU notifier support
>>>> drm/amdkfd: Implement batch userptr page management
>>>> drm/amdkfd: Add batch allocation function and export API
>>>> drm/amdkfd: Unify userptr cleanup and update paths
>>>> drm/amdkfd: Wire up batch allocation in ioctl handler
>>>>
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 23 +
>>>> .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 539 +++++++++++++++++-
>>>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 128 ++++-
>>>> include/uapi/linux/kfd_ioctl.h | 31 +-
>>>> 4 files changed, 697 insertions(+), 24 deletions(-)
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support
2026-02-09 12:52 ` Honglei Huang
@ 2026-02-09 12:59 ` Christian König
2026-02-09 13:11 ` Honglei Huang
0 siblings, 1 reply; 22+ messages in thread
From: Christian König @ 2026-02-09 12:59 UTC (permalink / raw)
To: Honglei Huang
Cc: Felix.Kuehling, Philip.Yang, Ray.Huang, alexander.deucher,
dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
linux-kernel, linux-mm, akpm
On 2/9/26 13:52, Honglei Huang wrote:
> DRM GPU SVM does use hmm_range_fault(), see drm_gpusvm_get_pages()
I'm not sure what you are talking about, drm_gpusvm_get_pages() only supports a single range as well and not scatter gather of VA addresses.
As far as I can see that doesn't help the slightest.
> My implementation follows the same pattern. The detailed comparison
> of invalidation path was provided in the second half of my previous mail.
Yeah and as I said that is not very valuable because it doesn't solves the sequence problem.
As far as I can see the approach you try here is a clear NAK from my side.
Regards,
Christian.
>
> On 2026/2/9 18:16, Christian König wrote:
>> On 2/9/26 07:14, Honglei Huang wrote:
>>>
>>> I've reworked the implementation in v4. The fix is actually inspired
>>> by the DRM GPU SVM framework (drivers/gpu/drm/drm_gpusvm.c).
>>>
>>> DRM GPU SVM uses wide notifiers (recommended 512M or larger) to track
>>> multiple user virtual address ranges under a single mmu_interval_notifier,
>>> and these ranges can be non-contiguous which is essentially the same
>>> problem that batch userptr needs to solve: one BO backed by multiple
>>> non-contiguous CPU VA ranges sharing one notifier.
>>
>> That still doesn't solve the sequencing problem.
>>
>> As far as I can see you can't use hmm_range_fault with this approach or it would just not be very valuable.
>>
>> So how should that work with your patch set?
>>
>> Regards,
>> Christian.
>>
>>>
>>> The wide notifier is created in drm_gpusvm_notifier_alloc:
>>> notifier->itree.start = ALIGN_DOWN(fault_addr, gpusvm->notifier_size);
>>> notifier->itree.last = ALIGN(fault_addr + 1, gpusvm->notifier_size) - 1;
>>> The Xe driver passes
>>> xe_modparam.svm_notifier_size * SZ_1M in xe_svm_init
>>> as the notifier_size, so one notifier can cover many of MB of VA space
>>> containing multiple non-contiguous ranges.
>>>
>>> And DRM GPU SVM solves the per-range validity problem with flag-based
>>> validation instead of seq-based validation in:
>>> - drm_gpusvm_pages_valid() checks
>>> flags.has_dma_mapping
>>> not notifier_seq. The comment explicitly states:
>>> "This is akin to a notifier seqno check in the HMM documentation
>>> but due to wider notifiers (i.e., notifiers which span multiple
>>> ranges) this function is required for finer grained checking"
>>> - __drm_gpusvm_unmap_pages() clears
>>> flags.has_dma_mapping = false under notifier_lock
>>> - drm_gpusvm_get_pages() sets
>>> flags.has_dma_mapping = true under notifier_lock
>>> I adopted the same approach.
>>>
>>> DRM GPU SVM:
>>> drm_gpusvm_notifier_invalidate()
>>> down_write(&gpusvm->notifier_lock);
>>> mmu_interval_set_seq(mni, cur_seq);
>>> gpusvm->ops->invalidate()
>>> -> xe_svm_invalidate()
>>> drm_gpusvm_for_each_range()
>>> -> __drm_gpusvm_unmap_pages()
>>> WRITE_ONCE(flags.has_dma_mapping = false); // clear flag
>>> up_write(&gpusvm->notifier_lock);
>>>
>>> KFD batch userptr:
>>> amdgpu_amdkfd_evict_userptr_batch()
>>> mutex_lock(&process_info->notifier_lock);
>>> mmu_interval_set_seq(mni, cur_seq);
>>> discard_invalid_ranges()
>>> interval_tree_iter_first/next()
>>> range_info->valid = false; // clear flag
>>> mutex_unlock(&process_info->notifier_lock);
>>>
>>> Both implementations:
>>> - Acquire notifier_lock FIRST, before any flag changes
>>> - Call mmu_interval_set_seq() under the lock
>>> - Use interval tree to find affected ranges within the wide notifier
>>> - Mark per-range flag as invalid/valid under the lock
>>>
>>> The page fault path and final validation path also follow the same
>>> pattern as DRM GPU SVM: fault outside the lock, set/check per-range
>>> flag under the lock.
>>>
>>> Regards,
>>> Honglei
>>>
>>>
>>> On 2026/2/6 21:56, Christian König wrote:
>>>> On 2/6/26 07:25, Honglei Huang wrote:
>>>>> From: Honglei Huang <honghuan@amd.com>
>>>>>
>>>>> Hi all,
>>>>>
>>>>> This is v3 of the patch series to support allocating multiple non-contiguous
>>>>> CPU virtual address ranges that map to a single contiguous GPU virtual address.
>>>>>
>>>>> v3:
>>>>> 1. No new ioctl: Reuses existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU
>>>>> - Adds only one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH
>>>>
>>>> That is most likely not the best approach, but Felix or Philip need to comment here since I don't know such IOCTLs well either.
>>>>
>>>>> - When flag is set, mmap_offset field points to range array
>>>>> - Minimal API surface change
>>>>
>>>> Why range of VA space for each entry?
>>>>
>>>>> 2. Improved MMU notifier handling:
>>>>> - Single mmu_interval_notifier covering the VA span [va_min, va_max]
>>>>> - Interval tree for efficient lookup of affected ranges during invalidation
>>>>> - Avoids per-range notifier overhead mentioned in v2 review
>>>>
>>>> That won't work unless you also modify hmm_range_fault() to take multiple VA addresses (or ranges) at the same time.
>>>>
>>>> The problem is that we must rely on hmm_range.notifier_seq to detect changes to the page tables in question, but that in turn works only if you have one hmm_range structure and not multiple.
>>>>
>>>> What might work is doing an XOR or CRC over all hmm_range.notifier_seq you have, but that is a bit flaky.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> 3. Better code organization: Split into 8 focused patches for easier review
>>>>>
>>>>> v2:
>>>>> - Each CPU VA range gets its own mmu_interval_notifier for invalidation
>>>>> - All ranges validated together and mapped to contiguous GPU VA
>>>>> - Single kgd_mem object with array of user_range_info structures
>>>>> - Unified eviction/restore path for all ranges in a batch
>>>>>
>>>>> Current Implementation Approach
>>>>> ===============================
>>>>>
>>>>> This series implements a practical solution within existing kernel constraints:
>>>>>
>>>>> 1. Single MMU notifier for VA span: Register one notifier covering the
>>>>> entire range from lowest to highest address in the batch
>>>>>
>>>>> 2. Interval tree filtering: Use interval tree to efficiently identify
>>>>> which specific ranges are affected during invalidation callbacks,
>>>>> avoiding unnecessary processing for unrelated address changes
>>>>>
>>>>> 3. Unified eviction/restore: All ranges in a batch share eviction and
>>>>> restore paths, maintaining consistency with existing userptr handling
>>>>>
>>>>> Patch Series Overview
>>>>> =====================
>>>>>
>>>>> Patch 1/8: Add userptr batch allocation UAPI structures
>>>>> - KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag
>>>>> - kfd_ioctl_userptr_range and kfd_ioctl_userptr_ranges_data structures
>>>>>
>>>>> Patch 2/8: Add user_range_info infrastructure to kgd_mem
>>>>> - user_range_info structure for per-range tracking
>>>>> - Fields for batch allocation in kgd_mem
>>>>>
>>>>> Patch 3/8: Implement interval tree for userptr ranges
>>>>> - Interval tree for efficient range lookup during invalidation
>>>>> - mark_invalid_ranges() function
>>>>>
>>>>> Patch 4/8: Add batch MMU notifier support
>>>>> - Single notifier for entire VA span
>>>>> - Invalidation callback using interval tree filtering
>>>>>
>>>>> Patch 5/8: Implement batch userptr page management
>>>>> - get_user_pages_batch() and set_user_pages_batch()
>>>>> - Per-range page array management
>>>>>
>>>>> Patch 6/8: Add batch allocation function and export API
>>>>> - init_user_pages_batch() main initialization
>>>>> - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch() entry point
>>>>>
>>>>> Patch 7/8: Unify userptr cleanup and update paths
>>>>> - Shared eviction/restore handling for batch allocations
>>>>> - Integration with existing userptr validation flows
>>>>>
>>>>> Patch 8/8: Wire up batch allocation in ioctl handler
>>>>> - Input validation and range array parsing
>>>>> - Integration with existing alloc_memory_of_gpu path
>>>>>
>>>>> Testing
>>>>> =======
>>>>>
>>>>> - Multiple scattered malloc() allocations (2-4000+ ranges)
>>>>> - Various allocation sizes (4KB to 1G+ per range)
>>>>> - Memory pressure scenarios and eviction/restore cycles
>>>>> - OpenCL CTS and HIP catch tests in KVM guest environment
>>>>> - AI workloads: Stable Diffusion, ComfyUI in virtualized environments
>>>>> - Small LLM inference (3B-7B models)
>>>>> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
>>>>> - Performance improvement: 2x-2.4x faster than userspace approach
>>>>>
>>>>> Thank you for your review and feedback.
>>>>>
>>>>> Best regards,
>>>>> Honglei Huang
>>>>>
>>>>> Honglei Huang (8):
>>>>> drm/amdkfd: Add userptr batch allocation UAPI structures
>>>>> drm/amdkfd: Add user_range_info infrastructure to kgd_mem
>>>>> drm/amdkfd: Implement interval tree for userptr ranges
>>>>> drm/amdkfd: Add batch MMU notifier support
>>>>> drm/amdkfd: Implement batch userptr page management
>>>>> drm/amdkfd: Add batch allocation function and export API
>>>>> drm/amdkfd: Unify userptr cleanup and update paths
>>>>> drm/amdkfd: Wire up batch allocation in ioctl handler
>>>>>
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 23 +
>>>>> .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 539 +++++++++++++++++-
>>>>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 128 ++++-
>>>>> include/uapi/linux/kfd_ioctl.h | 31 +-
>>>>> 4 files changed, 697 insertions(+), 24 deletions(-)
>>>>>
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support
2026-02-09 12:59 ` Christian König
@ 2026-02-09 13:11 ` Honglei Huang
2026-02-09 13:27 ` Christian König
0 siblings, 1 reply; 22+ messages in thread
From: Honglei Huang @ 2026-02-09 13:11 UTC (permalink / raw)
To: Christian König
Cc: Felix.Kuehling, Philip.Yang, Ray.Huang, alexander.deucher,
dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
linux-kernel, linux-mm, akpm
So the drm svm is also a NAK?
These codes have passed local testing, opencl and rocr, I also provided
a detailed code path and analysis.
You only said the conclusion without providing any reasons or evidence.
Your statement has no justifiable reasons and is difficult to convince
so far.
On 2026/2/9 20:59, Christian König wrote:
> On 2/9/26 13:52, Honglei Huang wrote:
>> DRM GPU SVM does use hmm_range_fault(), see drm_gpusvm_get_pages()
>
> I'm not sure what you are talking about, drm_gpusvm_get_pages() only supports a single range as well and not scatter gather of VA addresses.
>
> As far as I can see that doesn't help the slightest.
>
>> My implementation follows the same pattern. The detailed comparison
>> of invalidation path was provided in the second half of my previous mail.
>
> Yeah and as I said that is not very valuable because it doesn't solves the sequence problem.
>
> As far as I can see the approach you try here is a clear NAK from my side.
>
> Regards,
> Christian.
>
>>
>> On 2026/2/9 18:16, Christian König wrote:
>>> On 2/9/26 07:14, Honglei Huang wrote:
>>>>
>>>> I've reworked the implementation in v4. The fix is actually inspired
>>>> by the DRM GPU SVM framework (drivers/gpu/drm/drm_gpusvm.c).
>>>>
>>>> DRM GPU SVM uses wide notifiers (recommended 512M or larger) to track
>>>> multiple user virtual address ranges under a single mmu_interval_notifier,
>>>> and these ranges can be non-contiguous which is essentially the same
>>>> problem that batch userptr needs to solve: one BO backed by multiple
>>>> non-contiguous CPU VA ranges sharing one notifier.
>>>
>>> That still doesn't solve the sequencing problem.
>>>
>>> As far as I can see you can't use hmm_range_fault with this approach or it would just not be very valuable.
>>>
>>> So how should that work with your patch set?
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> The wide notifier is created in drm_gpusvm_notifier_alloc:
>>>> notifier->itree.start = ALIGN_DOWN(fault_addr, gpusvm->notifier_size);
>>>> notifier->itree.last = ALIGN(fault_addr + 1, gpusvm->notifier_size) - 1;
>>>> The Xe driver passes
>>>> xe_modparam.svm_notifier_size * SZ_1M in xe_svm_init
>>>> as the notifier_size, so one notifier can cover many of MB of VA space
>>>> containing multiple non-contiguous ranges.
>>>>
>>>> And DRM GPU SVM solves the per-range validity problem with flag-based
>>>> validation instead of seq-based validation in:
>>>> - drm_gpusvm_pages_valid() checks
>>>> flags.has_dma_mapping
>>>> not notifier_seq. The comment explicitly states:
>>>> "This is akin to a notifier seqno check in the HMM documentation
>>>> but due to wider notifiers (i.e., notifiers which span multiple
>>>> ranges) this function is required for finer grained checking"
>>>> - __drm_gpusvm_unmap_pages() clears
>>>> flags.has_dma_mapping = false under notifier_lock
>>>> - drm_gpusvm_get_pages() sets
>>>> flags.has_dma_mapping = true under notifier_lock
>>>> I adopted the same approach.
>>>>
>>>> DRM GPU SVM:
>>>> drm_gpusvm_notifier_invalidate()
>>>> down_write(&gpusvm->notifier_lock);
>>>> mmu_interval_set_seq(mni, cur_seq);
>>>> gpusvm->ops->invalidate()
>>>> -> xe_svm_invalidate()
>>>> drm_gpusvm_for_each_range()
>>>> -> __drm_gpusvm_unmap_pages()
>>>> WRITE_ONCE(flags.has_dma_mapping = false); // clear flag
>>>> up_write(&gpusvm->notifier_lock);
>>>>
>>>> KFD batch userptr:
>>>> amdgpu_amdkfd_evict_userptr_batch()
>>>> mutex_lock(&process_info->notifier_lock);
>>>> mmu_interval_set_seq(mni, cur_seq);
>>>> discard_invalid_ranges()
>>>> interval_tree_iter_first/next()
>>>> range_info->valid = false; // clear flag
>>>> mutex_unlock(&process_info->notifier_lock);
>>>>
>>>> Both implementations:
>>>> - Acquire notifier_lock FIRST, before any flag changes
>>>> - Call mmu_interval_set_seq() under the lock
>>>> - Use interval tree to find affected ranges within the wide notifier
>>>> - Mark per-range flag as invalid/valid under the lock
>>>>
>>>> The page fault path and final validation path also follow the same
>>>> pattern as DRM GPU SVM: fault outside the lock, set/check per-range
>>>> flag under the lock.
>>>>
>>>> Regards,
>>>> Honglei
>>>>
>>>>
>>>> On 2026/2/6 21:56, Christian König wrote:
>>>>> On 2/6/26 07:25, Honglei Huang wrote:
>>>>>> From: Honglei Huang <honghuan@amd.com>
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> This is v3 of the patch series to support allocating multiple non-contiguous
>>>>>> CPU virtual address ranges that map to a single contiguous GPU virtual address.
>>>>>>
>>>>>> v3:
>>>>>> 1. No new ioctl: Reuses existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU
>>>>>> - Adds only one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH
>>>>>
>>>>> That is most likely not the best approach, but Felix or Philip need to comment here since I don't know such IOCTLs well either.
>>>>>
>>>>>> - When flag is set, mmap_offset field points to range array
>>>>>> - Minimal API surface change
>>>>>
>>>>> Why range of VA space for each entry?
>>>>>
>>>>>> 2. Improved MMU notifier handling:
>>>>>> - Single mmu_interval_notifier covering the VA span [va_min, va_max]
>>>>>> - Interval tree for efficient lookup of affected ranges during invalidation
>>>>>> - Avoids per-range notifier overhead mentioned in v2 review
>>>>>
>>>>> That won't work unless you also modify hmm_range_fault() to take multiple VA addresses (or ranges) at the same time.
>>>>>
>>>>> The problem is that we must rely on hmm_range.notifier_seq to detect changes to the page tables in question, but that in turn works only if you have one hmm_range structure and not multiple.
>>>>>
>>>>> What might work is doing an XOR or CRC over all hmm_range.notifier_seq you have, but that is a bit flaky.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>> 3. Better code organization: Split into 8 focused patches for easier review
>>>>>>
>>>>>> v2:
>>>>>> - Each CPU VA range gets its own mmu_interval_notifier for invalidation
>>>>>> - All ranges validated together and mapped to contiguous GPU VA
>>>>>> - Single kgd_mem object with array of user_range_info structures
>>>>>> - Unified eviction/restore path for all ranges in a batch
>>>>>>
>>>>>> Current Implementation Approach
>>>>>> ===============================
>>>>>>
>>>>>> This series implements a practical solution within existing kernel constraints:
>>>>>>
>>>>>> 1. Single MMU notifier for VA span: Register one notifier covering the
>>>>>> entire range from lowest to highest address in the batch
>>>>>>
>>>>>> 2. Interval tree filtering: Use interval tree to efficiently identify
>>>>>> which specific ranges are affected during invalidation callbacks,
>>>>>> avoiding unnecessary processing for unrelated address changes
>>>>>>
>>>>>> 3. Unified eviction/restore: All ranges in a batch share eviction and
>>>>>> restore paths, maintaining consistency with existing userptr handling
>>>>>>
>>>>>> Patch Series Overview
>>>>>> =====================
>>>>>>
>>>>>> Patch 1/8: Add userptr batch allocation UAPI structures
>>>>>> - KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag
>>>>>> - kfd_ioctl_userptr_range and kfd_ioctl_userptr_ranges_data structures
>>>>>>
>>>>>> Patch 2/8: Add user_range_info infrastructure to kgd_mem
>>>>>> - user_range_info structure for per-range tracking
>>>>>> - Fields for batch allocation in kgd_mem
>>>>>>
>>>>>> Patch 3/8: Implement interval tree for userptr ranges
>>>>>> - Interval tree for efficient range lookup during invalidation
>>>>>> - mark_invalid_ranges() function
>>>>>>
>>>>>> Patch 4/8: Add batch MMU notifier support
>>>>>> - Single notifier for entire VA span
>>>>>> - Invalidation callback using interval tree filtering
>>>>>>
>>>>>> Patch 5/8: Implement batch userptr page management
>>>>>> - get_user_pages_batch() and set_user_pages_batch()
>>>>>> - Per-range page array management
>>>>>>
>>>>>> Patch 6/8: Add batch allocation function and export API
>>>>>> - init_user_pages_batch() main initialization
>>>>>> - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch() entry point
>>>>>>
>>>>>> Patch 7/8: Unify userptr cleanup and update paths
>>>>>> - Shared eviction/restore handling for batch allocations
>>>>>> - Integration with existing userptr validation flows
>>>>>>
>>>>>> Patch 8/8: Wire up batch allocation in ioctl handler
>>>>>> - Input validation and range array parsing
>>>>>> - Integration with existing alloc_memory_of_gpu path
>>>>>>
>>>>>> Testing
>>>>>> =======
>>>>>>
>>>>>> - Multiple scattered malloc() allocations (2-4000+ ranges)
>>>>>> - Various allocation sizes (4KB to 1G+ per range)
>>>>>> - Memory pressure scenarios and eviction/restore cycles
>>>>>> - OpenCL CTS and HIP catch tests in KVM guest environment
>>>>>> - AI workloads: Stable Diffusion, ComfyUI in virtualized environments
>>>>>> - Small LLM inference (3B-7B models)
>>>>>> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
>>>>>> - Performance improvement: 2x-2.4x faster than userspace approach
>>>>>>
>>>>>> Thank you for your review and feedback.
>>>>>>
>>>>>> Best regards,
>>>>>> Honglei Huang
>>>>>>
>>>>>> Honglei Huang (8):
>>>>>> drm/amdkfd: Add userptr batch allocation UAPI structures
>>>>>> drm/amdkfd: Add user_range_info infrastructure to kgd_mem
>>>>>> drm/amdkfd: Implement interval tree for userptr ranges
>>>>>> drm/amdkfd: Add batch MMU notifier support
>>>>>> drm/amdkfd: Implement batch userptr page management
>>>>>> drm/amdkfd: Add batch allocation function and export API
>>>>>> drm/amdkfd: Unify userptr cleanup and update paths
>>>>>> drm/amdkfd: Wire up batch allocation in ioctl handler
>>>>>>
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 23 +
>>>>>> .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 539 +++++++++++++++++-
>>>>>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 128 ++++-
>>>>>> include/uapi/linux/kfd_ioctl.h | 31 +-
>>>>>> 4 files changed, 697 insertions(+), 24 deletions(-)
>>>>>>
>>>>>
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support
2026-02-09 13:11 ` Honglei Huang
@ 2026-02-09 13:27 ` Christian König
2026-02-09 14:16 ` Honglei Huang
0 siblings, 1 reply; 22+ messages in thread
From: Christian König @ 2026-02-09 13:27 UTC (permalink / raw)
To: Honglei Huang
Cc: Felix.Kuehling, Philip.Yang, Ray.Huang, alexander.deucher,
dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
linux-kernel, linux-mm, akpm
On 2/9/26 14:11, Honglei Huang wrote:
>
> So the drm svm is also a NAK?
>
> These codes have passed local testing, opencl and rocr, I also provided a detailed code path and analysis.
> You only said the conclusion without providing any reasons or evidence. Your statement has no justifiable reasons and is difficult to convince
> so far.
That sounds like you don't understand what the issue here is, I will try to explain this once more on pseudo-code.
Page tables are updated without holding a lock, so when you want to grab physical addresses from the then you need to use an opportunistically retry based approach to make sure that the data you got is still valid.
In other words something like this here is needed:
retry:
hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
hmm_range.hmm_pfns = kvmalloc_array(npages, ...);
...
while (true) {
mmap_read_lock(mm);
err = hmm_range_fault(&hmm_range);
mmap_read_unlock(mm);
if (err == -EBUSY) {
if (time_after(jiffies, timeout))
break;
hmm_range.notifier_seq =
mmu_interval_read_begin(notifier);
continue;
}
break;
}
...
for (i = 0, j = 0; i < npages; ++j) {
...
dma_map_page(...)
...
grab_notifier_lock();
if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq))
goto retry;
restart_queues();
drop_notifier_lock();
...
Now hmm_range.notifier_seq indicates if your DMA addresses are still valid or not after you grabbed the notifier lock.
The problem is that hmm_range works only on a single range/sequence combination, so when you do multiple calls to hmm_range_fault() for scattered VA is can easily be that one call invalidates the ranges of another call.
So as long as you only have a few hundred hmm_ranges for your userptrs that kind of works, but it doesn't scale up into the thousands of different VA addresses you get for scattered handling.
That's why hmm_range_fault needs to be modified to handle an array of VA addresses instead of just a A..B range.
Regards,
Christian.
>
> On 2026/2/9 20:59, Christian König wrote:
>> On 2/9/26 13:52, Honglei Huang wrote:
>>> DRM GPU SVM does use hmm_range_fault(), see drm_gpusvm_get_pages()
>>
>> I'm not sure what you are talking about, drm_gpusvm_get_pages() only supports a single range as well and not scatter gather of VA addresses.
>>
>> As far as I can see that doesn't help the slightest.
>>
>>> My implementation follows the same pattern. The detailed comparison
>>> of invalidation path was provided in the second half of my previous mail.
>>
>> Yeah and as I said that is not very valuable because it doesn't solves the sequence problem.
>>
>> As far as I can see the approach you try here is a clear NAK from my side.
>>
>> Regards,
>> Christian.
>>
>>>
>>> On 2026/2/9 18:16, Christian König wrote:
>>>> On 2/9/26 07:14, Honglei Huang wrote:
>>>>>
>>>>> I've reworked the implementation in v4. The fix is actually inspired
>>>>> by the DRM GPU SVM framework (drivers/gpu/drm/drm_gpusvm.c).
>>>>>
>>>>> DRM GPU SVM uses wide notifiers (recommended 512M or larger) to track
>>>>> multiple user virtual address ranges under a single mmu_interval_notifier,
>>>>> and these ranges can be non-contiguous which is essentially the same
>>>>> problem that batch userptr needs to solve: one BO backed by multiple
>>>>> non-contiguous CPU VA ranges sharing one notifier.
>>>>
>>>> That still doesn't solve the sequencing problem.
>>>>
>>>> As far as I can see you can't use hmm_range_fault with this approach or it would just not be very valuable.
>>>>
>>>> So how should that work with your patch set?
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> The wide notifier is created in drm_gpusvm_notifier_alloc:
>>>>> notifier->itree.start = ALIGN_DOWN(fault_addr, gpusvm->notifier_size);
>>>>> notifier->itree.last = ALIGN(fault_addr + 1, gpusvm->notifier_size) - 1;
>>>>> The Xe driver passes
>>>>> xe_modparam.svm_notifier_size * SZ_1M in xe_svm_init
>>>>> as the notifier_size, so one notifier can cover many of MB of VA space
>>>>> containing multiple non-contiguous ranges.
>>>>>
>>>>> And DRM GPU SVM solves the per-range validity problem with flag-based
>>>>> validation instead of seq-based validation in:
>>>>> - drm_gpusvm_pages_valid() checks
>>>>> flags.has_dma_mapping
>>>>> not notifier_seq. The comment explicitly states:
>>>>> "This is akin to a notifier seqno check in the HMM documentation
>>>>> but due to wider notifiers (i.e., notifiers which span multiple
>>>>> ranges) this function is required for finer grained checking"
>>>>> - __drm_gpusvm_unmap_pages() clears
>>>>> flags.has_dma_mapping = false under notifier_lock
>>>>> - drm_gpusvm_get_pages() sets
>>>>> flags.has_dma_mapping = true under notifier_lock
>>>>> I adopted the same approach.
>>>>>
>>>>> DRM GPU SVM:
>>>>> drm_gpusvm_notifier_invalidate()
>>>>> down_write(&gpusvm->notifier_lock);
>>>>> mmu_interval_set_seq(mni, cur_seq);
>>>>> gpusvm->ops->invalidate()
>>>>> -> xe_svm_invalidate()
>>>>> drm_gpusvm_for_each_range()
>>>>> -> __drm_gpusvm_unmap_pages()
>>>>> WRITE_ONCE(flags.has_dma_mapping = false); // clear flag
>>>>> up_write(&gpusvm->notifier_lock);
>>>>>
>>>>> KFD batch userptr:
>>>>> amdgpu_amdkfd_evict_userptr_batch()
>>>>> mutex_lock(&process_info->notifier_lock);
>>>>> mmu_interval_set_seq(mni, cur_seq);
>>>>> discard_invalid_ranges()
>>>>> interval_tree_iter_first/next()
>>>>> range_info->valid = false; // clear flag
>>>>> mutex_unlock(&process_info->notifier_lock);
>>>>>
>>>>> Both implementations:
>>>>> - Acquire notifier_lock FIRST, before any flag changes
>>>>> - Call mmu_interval_set_seq() under the lock
>>>>> - Use interval tree to find affected ranges within the wide notifier
>>>>> - Mark per-range flag as invalid/valid under the lock
>>>>>
>>>>> The page fault path and final validation path also follow the same
>>>>> pattern as DRM GPU SVM: fault outside the lock, set/check per-range
>>>>> flag under the lock.
>>>>>
>>>>> Regards,
>>>>> Honglei
>>>>>
>>>>>
>>>>> On 2026/2/6 21:56, Christian König wrote:
>>>>>> On 2/6/26 07:25, Honglei Huang wrote:
>>>>>>> From: Honglei Huang <honghuan@amd.com>
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> This is v3 of the patch series to support allocating multiple non-contiguous
>>>>>>> CPU virtual address ranges that map to a single contiguous GPU virtual address.
>>>>>>>
>>>>>>> v3:
>>>>>>> 1. No new ioctl: Reuses existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU
>>>>>>> - Adds only one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH
>>>>>>
>>>>>> That is most likely not the best approach, but Felix or Philip need to comment here since I don't know such IOCTLs well either.
>>>>>>
>>>>>>> - When flag is set, mmap_offset field points to range array
>>>>>>> - Minimal API surface change
>>>>>>
>>>>>> Why range of VA space for each entry?
>>>>>>
>>>>>>> 2. Improved MMU notifier handling:
>>>>>>> - Single mmu_interval_notifier covering the VA span [va_min, va_max]
>>>>>>> - Interval tree for efficient lookup of affected ranges during invalidation
>>>>>>> - Avoids per-range notifier overhead mentioned in v2 review
>>>>>>
>>>>>> That won't work unless you also modify hmm_range_fault() to take multiple VA addresses (or ranges) at the same time.
>>>>>>
>>>>>> The problem is that we must rely on hmm_range.notifier_seq to detect changes to the page tables in question, but that in turn works only if you have one hmm_range structure and not multiple.
>>>>>>
>>>>>> What might work is doing an XOR or CRC over all hmm_range.notifier_seq you have, but that is a bit flaky.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>> 3. Better code organization: Split into 8 focused patches for easier review
>>>>>>>
>>>>>>> v2:
>>>>>>> - Each CPU VA range gets its own mmu_interval_notifier for invalidation
>>>>>>> - All ranges validated together and mapped to contiguous GPU VA
>>>>>>> - Single kgd_mem object with array of user_range_info structures
>>>>>>> - Unified eviction/restore path for all ranges in a batch
>>>>>>>
>>>>>>> Current Implementation Approach
>>>>>>> ===============================
>>>>>>>
>>>>>>> This series implements a practical solution within existing kernel constraints:
>>>>>>>
>>>>>>> 1. Single MMU notifier for VA span: Register one notifier covering the
>>>>>>> entire range from lowest to highest address in the batch
>>>>>>>
>>>>>>> 2. Interval tree filtering: Use interval tree to efficiently identify
>>>>>>> which specific ranges are affected during invalidation callbacks,
>>>>>>> avoiding unnecessary processing for unrelated address changes
>>>>>>>
>>>>>>> 3. Unified eviction/restore: All ranges in a batch share eviction and
>>>>>>> restore paths, maintaining consistency with existing userptr handling
>>>>>>>
>>>>>>> Patch Series Overview
>>>>>>> =====================
>>>>>>>
>>>>>>> Patch 1/8: Add userptr batch allocation UAPI structures
>>>>>>> - KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag
>>>>>>> - kfd_ioctl_userptr_range and kfd_ioctl_userptr_ranges_data structures
>>>>>>>
>>>>>>> Patch 2/8: Add user_range_info infrastructure to kgd_mem
>>>>>>> - user_range_info structure for per-range tracking
>>>>>>> - Fields for batch allocation in kgd_mem
>>>>>>>
>>>>>>> Patch 3/8: Implement interval tree for userptr ranges
>>>>>>> - Interval tree for efficient range lookup during invalidation
>>>>>>> - mark_invalid_ranges() function
>>>>>>>
>>>>>>> Patch 4/8: Add batch MMU notifier support
>>>>>>> - Single notifier for entire VA span
>>>>>>> - Invalidation callback using interval tree filtering
>>>>>>>
>>>>>>> Patch 5/8: Implement batch userptr page management
>>>>>>> - get_user_pages_batch() and set_user_pages_batch()
>>>>>>> - Per-range page array management
>>>>>>>
>>>>>>> Patch 6/8: Add batch allocation function and export API
>>>>>>> - init_user_pages_batch() main initialization
>>>>>>> - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch() entry point
>>>>>>>
>>>>>>> Patch 7/8: Unify userptr cleanup and update paths
>>>>>>> - Shared eviction/restore handling for batch allocations
>>>>>>> - Integration with existing userptr validation flows
>>>>>>>
>>>>>>> Patch 8/8: Wire up batch allocation in ioctl handler
>>>>>>> - Input validation and range array parsing
>>>>>>> - Integration with existing alloc_memory_of_gpu path
>>>>>>>
>>>>>>> Testing
>>>>>>> =======
>>>>>>>
>>>>>>> - Multiple scattered malloc() allocations (2-4000+ ranges)
>>>>>>> - Various allocation sizes (4KB to 1G+ per range)
>>>>>>> - Memory pressure scenarios and eviction/restore cycles
>>>>>>> - OpenCL CTS and HIP catch tests in KVM guest environment
>>>>>>> - AI workloads: Stable Diffusion, ComfyUI in virtualized environments
>>>>>>> - Small LLM inference (3B-7B models)
>>>>>>> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
>>>>>>> - Performance improvement: 2x-2.4x faster than userspace approach
>>>>>>>
>>>>>>> Thank you for your review and feedback.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Honglei Huang
>>>>>>>
>>>>>>> Honglei Huang (8):
>>>>>>> drm/amdkfd: Add userptr batch allocation UAPI structures
>>>>>>> drm/amdkfd: Add user_range_info infrastructure to kgd_mem
>>>>>>> drm/amdkfd: Implement interval tree for userptr ranges
>>>>>>> drm/amdkfd: Add batch MMU notifier support
>>>>>>> drm/amdkfd: Implement batch userptr page management
>>>>>>> drm/amdkfd: Add batch allocation function and export API
>>>>>>> drm/amdkfd: Unify userptr cleanup and update paths
>>>>>>> drm/amdkfd: Wire up batch allocation in ioctl handler
>>>>>>>
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 23 +
>>>>>>> .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 539 +++++++++++++++++-
>>>>>>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 128 ++++-
>>>>>>> include/uapi/linux/kfd_ioctl.h | 31 +-
>>>>>>> 4 files changed, 697 insertions(+), 24 deletions(-)
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support
2026-02-09 13:27 ` Christian König
@ 2026-02-09 14:16 ` Honglei Huang
2026-02-09 14:25 ` Christian König
0 siblings, 1 reply; 22+ messages in thread
From: Honglei Huang @ 2026-02-09 14:16 UTC (permalink / raw)
To: Christian König
Cc: Felix.Kuehling, Philip.Yang, Ray.Huang, alexander.deucher,
dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
linux-kernel, linux-mm, akpm
The case you described: one hmm_range_fault() invalidating another's
seq under the same notifier, is already handled in the implementation.
example: suppose ranges A, B, C share one notifier:
1. hmm_range_fault(A) succeeds, seq_A recorded
2. External invalidation occurs, triggers callback:
mutex_lock(notifier_lock)
→ mmu_interval_set_seq()
→ range_A->valid = false
→ mem->invalid++
mutex_unlock(notifier_lock)
3. hmm_range_fault(B) succeeds
4. Commit phase:
mutex_lock(notifier_lock)
→ check mem->invalid != saved_invalid
→ return -EAGAIN, retry the entire batch
mutex_unlock(notifier_lock)
All concurrent invalidations are caught by the mem->invalid counter.
Additionally, amdgpu_ttm_tt_get_user_pages_done() in
confirm_valid_user_pages_locked
performs a per-range mmu_interval_read_retry() as a final safety check.
DRM GPU SVM uses the same approach: drm_gpusvm_get_pages() also calls
hmm_range_fault() per-range independently there is no array version
of hmm_range_fault in DRM GPU SVM either. If you consider this approach
unworkable, then DRM GPU SVM would be unworkable too, yet it has been
accepted upstream.
The number of batch ranges is controllable. And even if it
scales to thousands, DRM GPU SVM faces exactly the same situation:
it does not need an array version of hmm_range_fault either, which
shows this is a correctness question, not a performance one. For
correctness, I believe DRM GPU SVM already demonstrates the approach
is ok.
For performance, I have tested with thousands of ranges present:
performance reaches 80%~95% of the native driver, and all OpenCL
and ROCr test suites pass with no correctness issues.
Here is how DRM GPU SVM handles correctness with multiple ranges
under one wide notifier doing per-range hmm_range_fault:
Invalidation: drm_gpusvm_notifier_invalidate()
- Acquires notifier_lock
- Calls mmu_interval_set_seq()
- Iterates affected ranges via driver callback (xe_svm_invalidate)
- Clears has_dma_mapping = false for each affected range (under lock)
- Releases notifier_lock
Fault: drm_gpusvm_get_pages() (called per-range independently)
- mmu_interval_read_begin() to get seq
- hmm_range_fault() outside lock
- Acquires notifier_lock
- mmu_interval_read_retry() → if stale, release lock and retry
- DMA map pages + set has_dma_mapping = true (under lock)
- Releases notifier_lock
Validation: drm_gpusvm_pages_valid()
- Checks has_dma_mapping flag (under lock), NOT seq
If invalidation occurs between two per-range faults, the flag is
cleared under lock, and either mmu_interval_read_retry catches it
in the current fault, or drm_gpusvm_pages_valid() catches it at
validation time. No stale pages are ever committed.
KFD batch userptr uses the same three-step pattern:
Invalidation: amdgpu_amdkfd_evict_userptr_batch()
- Acquires notifier_lock
- Calls mmu_interval_set_seq()
- Iterates affected ranges via interval_tree
- Sets range->valid = false for each affected range (under lock)
- Increments mem->invalid (under lock)
- Releases notifier_lock
Fault: update_invalid_user_pages()
- Per-range hmm_range_fault() outside lock
- Acquires notifier_lock
- Checks mem->invalid != saved_invalid → if changed, -EAGAIN retry
- Sets range->valid = true for faulted ranges (under lock)
- Releases notifier_lock
Validation: valid_user_pages_batch()
- Checks range->valid flag
- Calls amdgpu_ttm_tt_get_user_pages_done() (mmu_interval_read_retry)
The logic is equivalent as far as I can see.
Regards,
Honglei
On 2026/2/9 21:27, Christian König wrote:
> On 2/9/26 14:11, Honglei Huang wrote:
>>
>> So the drm svm is also a NAK?
>>
>> These codes have passed local testing, opencl and rocr, I also provided a detailed code path and analysis.
>> You only said the conclusion without providing any reasons or evidence. Your statement has no justifiable reasons and is difficult to convince
>> so far.
>
> That sounds like you don't understand what the issue here is, I will try to explain this once more on pseudo-code.
>
> Page tables are updated without holding a lock, so when you want to grab physical addresses from the then you need to use an opportunistically retry based approach to make sure that the data you got is still valid.
>
> In other words something like this here is needed:
>
> retry:
> hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
> hmm_range.hmm_pfns = kvmalloc_array(npages, ...);
> ...
> while (true) {
> mmap_read_lock(mm);
> err = hmm_range_fault(&hmm_range);
> mmap_read_unlock(mm);
>
> if (err == -EBUSY) {
> if (time_after(jiffies, timeout))
> break;
>
> hmm_range.notifier_seq =
> mmu_interval_read_begin(notifier);
> continue;
> }
> break;
> }
> ...
> for (i = 0, j = 0; i < npages; ++j) {
> ...
> dma_map_page(...)
> ...
> grab_notifier_lock();
> if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq))
> goto retry;
> restart_queues();
> drop_notifier_lock();
> ...
>
> Now hmm_range.notifier_seq indicates if your DMA addresses are still valid or not after you grabbed the notifier lock.
>
> The problem is that hmm_range works only on a single range/sequence combination, so when you do multiple calls to hmm_range_fault() for scattered VA is can easily be that one call invalidates the ranges of another call.
>
> So as long as you only have a few hundred hmm_ranges for your userptrs that kind of works, but it doesn't scale up into the thousands of different VA addresses you get for scattered handling.
>
> That's why hmm_range_fault needs to be modified to handle an array of VA addresses instead of just a A..B range.
>
> Regards,
> Christian.
>
>
>>
>> On 2026/2/9 20:59, Christian König wrote:
>>> On 2/9/26 13:52, Honglei Huang wrote:
>>>> DRM GPU SVM does use hmm_range_fault(), see drm_gpusvm_get_pages()
>>>
>>> I'm not sure what you are talking about, drm_gpusvm_get_pages() only supports a single range as well and not scatter gather of VA addresses.
>>>
>>> As far as I can see that doesn't help the slightest.
>>>
>>>> My implementation follows the same pattern. The detailed comparison
>>>> of invalidation path was provided in the second half of my previous mail.
>>>
>>> Yeah and as I said that is not very valuable because it doesn't solves the sequence problem.
>>>
>>> As far as I can see the approach you try here is a clear NAK from my side.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> On 2026/2/9 18:16, Christian König wrote:
>>>>> On 2/9/26 07:14, Honglei Huang wrote:
>>>>>>
>>>>>> I've reworked the implementation in v4. The fix is actually inspired
>>>>>> by the DRM GPU SVM framework (drivers/gpu/drm/drm_gpusvm.c).
>>>>>>
>>>>>> DRM GPU SVM uses wide notifiers (recommended 512M or larger) to track
>>>>>> multiple user virtual address ranges under a single mmu_interval_notifier,
>>>>>> and these ranges can be non-contiguous which is essentially the same
>>>>>> problem that batch userptr needs to solve: one BO backed by multiple
>>>>>> non-contiguous CPU VA ranges sharing one notifier.
>>>>>
>>>>> That still doesn't solve the sequencing problem.
>>>>>
>>>>> As far as I can see you can't use hmm_range_fault with this approach or it would just not be very valuable.
>>>>>
>>>>> So how should that work with your patch set?
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>> The wide notifier is created in drm_gpusvm_notifier_alloc:
>>>>>> notifier->itree.start = ALIGN_DOWN(fault_addr, gpusvm->notifier_size);
>>>>>> notifier->itree.last = ALIGN(fault_addr + 1, gpusvm->notifier_size) - 1;
>>>>>> The Xe driver passes
>>>>>> xe_modparam.svm_notifier_size * SZ_1M in xe_svm_init
>>>>>> as the notifier_size, so one notifier can cover many of MB of VA space
>>>>>> containing multiple non-contiguous ranges.
>>>>>>
>>>>>> And DRM GPU SVM solves the per-range validity problem with flag-based
>>>>>> validation instead of seq-based validation in:
>>>>>> - drm_gpusvm_pages_valid() checks
>>>>>> flags.has_dma_mapping
>>>>>> not notifier_seq. The comment explicitly states:
>>>>>> "This is akin to a notifier seqno check in the HMM documentation
>>>>>> but due to wider notifiers (i.e., notifiers which span multiple
>>>>>> ranges) this function is required for finer grained checking"
>>>>>> - __drm_gpusvm_unmap_pages() clears
>>>>>> flags.has_dma_mapping = false under notifier_lock
>>>>>> - drm_gpusvm_get_pages() sets
>>>>>> flags.has_dma_mapping = true under notifier_lock
>>>>>> I adopted the same approach.
>>>>>>
>>>>>> DRM GPU SVM:
>>>>>> drm_gpusvm_notifier_invalidate()
>>>>>> down_write(&gpusvm->notifier_lock);
>>>>>> mmu_interval_set_seq(mni, cur_seq);
>>>>>> gpusvm->ops->invalidate()
>>>>>> -> xe_svm_invalidate()
>>>>>> drm_gpusvm_for_each_range()
>>>>>> -> __drm_gpusvm_unmap_pages()
>>>>>> WRITE_ONCE(flags.has_dma_mapping = false); // clear flag
>>>>>> up_write(&gpusvm->notifier_lock);
>>>>>>
>>>>>> KFD batch userptr:
>>>>>> amdgpu_amdkfd_evict_userptr_batch()
>>>>>> mutex_lock(&process_info->notifier_lock);
>>>>>> mmu_interval_set_seq(mni, cur_seq);
>>>>>> discard_invalid_ranges()
>>>>>> interval_tree_iter_first/next()
>>>>>> range_info->valid = false; // clear flag
>>>>>> mutex_unlock(&process_info->notifier_lock);
>>>>>>
>>>>>> Both implementations:
>>>>>> - Acquire notifier_lock FIRST, before any flag changes
>>>>>> - Call mmu_interval_set_seq() under the lock
>>>>>> - Use interval tree to find affected ranges within the wide notifier
>>>>>> - Mark per-range flag as invalid/valid under the lock
>>>>>>
>>>>>> The page fault path and final validation path also follow the same
>>>>>> pattern as DRM GPU SVM: fault outside the lock, set/check per-range
>>>>>> flag under the lock.
>>>>>>
>>>>>> Regards,
>>>>>> Honglei
>>>>>>
>>>>>>
>>>>>> On 2026/2/6 21:56, Christian König wrote:
>>>>>>> On 2/6/26 07:25, Honglei Huang wrote:
>>>>>>>> From: Honglei Huang <honghuan@amd.com>
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> This is v3 of the patch series to support allocating multiple non-contiguous
>>>>>>>> CPU virtual address ranges that map to a single contiguous GPU virtual address.
>>>>>>>>
>>>>>>>> v3:
>>>>>>>> 1. No new ioctl: Reuses existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU
>>>>>>>> - Adds only one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH
>>>>>>>
>>>>>>> That is most likely not the best approach, but Felix or Philip need to comment here since I don't know such IOCTLs well either.
>>>>>>>
>>>>>>>> - When flag is set, mmap_offset field points to range array
>>>>>>>> - Minimal API surface change
>>>>>>>
>>>>>>> Why range of VA space for each entry?
>>>>>>>
>>>>>>>> 2. Improved MMU notifier handling:
>>>>>>>> - Single mmu_interval_notifier covering the VA span [va_min, va_max]
>>>>>>>> - Interval tree for efficient lookup of affected ranges during invalidation
>>>>>>>> - Avoids per-range notifier overhead mentioned in v2 review
>>>>>>>
>>>>>>> That won't work unless you also modify hmm_range_fault() to take multiple VA addresses (or ranges) at the same time.
>>>>>>>
>>>>>>> The problem is that we must rely on hmm_range.notifier_seq to detect changes to the page tables in question, but that in turn works only if you have one hmm_range structure and not multiple.
>>>>>>>
>>>>>>> What might work is doing an XOR or CRC over all hmm_range.notifier_seq you have, but that is a bit flaky.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>>
>>>>>>>> 3. Better code organization: Split into 8 focused patches for easier review
>>>>>>>>
>>>>>>>> v2:
>>>>>>>> - Each CPU VA range gets its own mmu_interval_notifier for invalidation
>>>>>>>> - All ranges validated together and mapped to contiguous GPU VA
>>>>>>>> - Single kgd_mem object with array of user_range_info structures
>>>>>>>> - Unified eviction/restore path for all ranges in a batch
>>>>>>>>
>>>>>>>> Current Implementation Approach
>>>>>>>> ===============================
>>>>>>>>
>>>>>>>> This series implements a practical solution within existing kernel constraints:
>>>>>>>>
>>>>>>>> 1. Single MMU notifier for VA span: Register one notifier covering the
>>>>>>>> entire range from lowest to highest address in the batch
>>>>>>>>
>>>>>>>> 2. Interval tree filtering: Use interval tree to efficiently identify
>>>>>>>> which specific ranges are affected during invalidation callbacks,
>>>>>>>> avoiding unnecessary processing for unrelated address changes
>>>>>>>>
>>>>>>>> 3. Unified eviction/restore: All ranges in a batch share eviction and
>>>>>>>> restore paths, maintaining consistency with existing userptr handling
>>>>>>>>
>>>>>>>> Patch Series Overview
>>>>>>>> =====================
>>>>>>>>
>>>>>>>> Patch 1/8: Add userptr batch allocation UAPI structures
>>>>>>>> - KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag
>>>>>>>> - kfd_ioctl_userptr_range and kfd_ioctl_userptr_ranges_data structures
>>>>>>>>
>>>>>>>> Patch 2/8: Add user_range_info infrastructure to kgd_mem
>>>>>>>> - user_range_info structure for per-range tracking
>>>>>>>> - Fields for batch allocation in kgd_mem
>>>>>>>>
>>>>>>>> Patch 3/8: Implement interval tree for userptr ranges
>>>>>>>> - Interval tree for efficient range lookup during invalidation
>>>>>>>> - mark_invalid_ranges() function
>>>>>>>>
>>>>>>>> Patch 4/8: Add batch MMU notifier support
>>>>>>>> - Single notifier for entire VA span
>>>>>>>> - Invalidation callback using interval tree filtering
>>>>>>>>
>>>>>>>> Patch 5/8: Implement batch userptr page management
>>>>>>>> - get_user_pages_batch() and set_user_pages_batch()
>>>>>>>> - Per-range page array management
>>>>>>>>
>>>>>>>> Patch 6/8: Add batch allocation function and export API
>>>>>>>> - init_user_pages_batch() main initialization
>>>>>>>> - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch() entry point
>>>>>>>>
>>>>>>>> Patch 7/8: Unify userptr cleanup and update paths
>>>>>>>> - Shared eviction/restore handling for batch allocations
>>>>>>>> - Integration with existing userptr validation flows
>>>>>>>>
>>>>>>>> Patch 8/8: Wire up batch allocation in ioctl handler
>>>>>>>> - Input validation and range array parsing
>>>>>>>> - Integration with existing alloc_memory_of_gpu path
>>>>>>>>
>>>>>>>> Testing
>>>>>>>> =======
>>>>>>>>
>>>>>>>> - Multiple scattered malloc() allocations (2-4000+ ranges)
>>>>>>>> - Various allocation sizes (4KB to 1G+ per range)
>>>>>>>> - Memory pressure scenarios and eviction/restore cycles
>>>>>>>> - OpenCL CTS and HIP catch tests in KVM guest environment
>>>>>>>> - AI workloads: Stable Diffusion, ComfyUI in virtualized environments
>>>>>>>> - Small LLM inference (3B-7B models)
>>>>>>>> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
>>>>>>>> - Performance improvement: 2x-2.4x faster than userspace approach
>>>>>>>>
>>>>>>>> Thank you for your review and feedback.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Honglei Huang
>>>>>>>>
>>>>>>>> Honglei Huang (8):
>>>>>>>> drm/amdkfd: Add userptr batch allocation UAPI structures
>>>>>>>> drm/amdkfd: Add user_range_info infrastructure to kgd_mem
>>>>>>>> drm/amdkfd: Implement interval tree for userptr ranges
>>>>>>>> drm/amdkfd: Add batch MMU notifier support
>>>>>>>> drm/amdkfd: Implement batch userptr page management
>>>>>>>> drm/amdkfd: Add batch allocation function and export API
>>>>>>>> drm/amdkfd: Unify userptr cleanup and update paths
>>>>>>>> drm/amdkfd: Wire up batch allocation in ioctl handler
>>>>>>>>
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 23 +
>>>>>>>> .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 539 +++++++++++++++++-
>>>>>>>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 128 ++++-
>>>>>>>> include/uapi/linux/kfd_ioctl.h | 31 +-
>>>>>>>> 4 files changed, 697 insertions(+), 24 deletions(-)
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support
2026-02-09 14:16 ` Honglei Huang
@ 2026-02-09 14:25 ` Christian König
2026-02-09 14:44 ` Honglei Huang
0 siblings, 1 reply; 22+ messages in thread
From: Christian König @ 2026-02-09 14:25 UTC (permalink / raw)
To: Honglei Huang
Cc: Felix.Kuehling, Philip.Yang, Ray.Huang, alexander.deucher,
dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
linux-kernel, linux-mm, akpm
On 2/9/26 15:16, Honglei Huang wrote:
> The case you described: one hmm_range_fault() invalidating another's
> seq under the same notifier, is already handled in the implementation.
>
> example: suppose ranges A, B, C share one notifier:
>
> 1. hmm_range_fault(A) succeeds, seq_A recorded
> 2. External invalidation occurs, triggers callback:
> mutex_lock(notifier_lock)
> → mmu_interval_set_seq()
> → range_A->valid = false
> → mem->invalid++
> mutex_unlock(notifier_lock)
> 3. hmm_range_fault(B) succeeds
> 4. Commit phase:
> mutex_lock(notifier_lock)
> → check mem->invalid != saved_invalid
> → return -EAGAIN, retry the entire batch
> mutex_unlock(notifier_lock)
>
> All concurrent invalidations are caught by the mem->invalid counter.
> Additionally, amdgpu_ttm_tt_get_user_pages_done() in confirm_valid_user_pages_locked
> performs a per-range mmu_interval_read_retry() as a final safety check.
>
> DRM GPU SVM uses the same approach: drm_gpusvm_get_pages() also calls
> hmm_range_fault() per-range independently there is no array version
> of hmm_range_fault in DRM GPU SVM either. If you consider this approach
> unworkable, then DRM GPU SVM would be unworkable too, yet it has been
> accepted upstream.
>
> The number of batch ranges is controllable. And even if it
> scales to thousands, DRM GPU SVM faces exactly the same situation:
> it does not need an array version of hmm_range_fault either, which
> shows this is a correctness question, not a performance one. For
> correctness, I believe DRM GPU SVM already demonstrates the approach
> is ok.
Well yes, GPU SVM would have exactly the same problems. But that also doesn't have a create bulk userptr interface.
The implementation is simply not made for this use case, and as far as I know no current upstream implementation is.
> For performance, I have tested with thousands of ranges present:
> performance reaches 80%~95% of the native driver, and all OpenCL
> and ROCr test suites pass with no correctness issues.
Testing can only falsify a system and not verify it.
> Here is how DRM GPU SVM handles correctness with multiple ranges
> under one wide notifier doing per-range hmm_range_fault:
>
> Invalidation: drm_gpusvm_notifier_invalidate()
> - Acquires notifier_lock
> - Calls mmu_interval_set_seq()
> - Iterates affected ranges via driver callback (xe_svm_invalidate)
> - Clears has_dma_mapping = false for each affected range (under lock)
> - Releases notifier_lock
>
> Fault: drm_gpusvm_get_pages() (called per-range independently)
> - mmu_interval_read_begin() to get seq
> - hmm_range_fault() outside lock
> - Acquires notifier_lock
> - mmu_interval_read_retry() → if stale, release lock and retry
> - DMA map pages + set has_dma_mapping = true (under lock)
> - Releases notifier_lock
>
> Validation: drm_gpusvm_pages_valid()
> - Checks has_dma_mapping flag (under lock), NOT seq
>
> If invalidation occurs between two per-range faults, the flag is
> cleared under lock, and either mmu_interval_read_retry catches it
> in the current fault, or drm_gpusvm_pages_valid() catches it at
> validation time. No stale pages are ever committed.
>
> KFD batch userptr uses the same three-step pattern:
>
> Invalidation: amdgpu_amdkfd_evict_userptr_batch()
> - Acquires notifier_lock
> - Calls mmu_interval_set_seq()
> - Iterates affected ranges via interval_tree
> - Sets range->valid = false for each affected range (under lock)
> - Increments mem->invalid (under lock)
> - Releases notifier_lock
>
> Fault: update_invalid_user_pages()
> - Per-range hmm_range_fault() outside lock
And here the idea falls apart. Each hmm_range_fault() can invalidate the other ranges while faulting them in.
That is not fundamentally solveable, but by moving the handling further into hmm_range_fault it makes it much less likely that something goes wrong.
So once more as long as this still uses this hacky approach I will clearly reject this implementation.
Regards,
Christian.
> - Acquires notifier_lock
> - Checks mem->invalid != saved_invalid → if changed, -EAGAIN retry
> - Sets range->valid = true for faulted ranges (under lock)
> - Releases notifier_lock
>
> Validation: valid_user_pages_batch()
> - Checks range->valid flag
> - Calls amdgpu_ttm_tt_get_user_pages_done() (mmu_interval_read_retry)
>
> The logic is equivalent as far as I can see.
>
> Regards,
> Honglei
>
>
>
> On 2026/2/9 21:27, Christian König wrote:
>> On 2/9/26 14:11, Honglei Huang wrote:
>>>
>>> So the drm svm is also a NAK?
>>>
>>> These codes have passed local testing, opencl and rocr, I also provided a detailed code path and analysis.
>>> You only said the conclusion without providing any reasons or evidence. Your statement has no justifiable reasons and is difficult to convince
>>> so far.
>>
>> That sounds like you don't understand what the issue here is, I will try to explain this once more on pseudo-code.
>>
>> Page tables are updated without holding a lock, so when you want to grab physical addresses from the then you need to use an opportunistically retry based approach to make sure that the data you got is still valid.
>>
>> In other words something like this here is needed:
>>
>> retry:
>> hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
>> hmm_range.hmm_pfns = kvmalloc_array(npages, ...);
>> ...
>> while (true) {
>> mmap_read_lock(mm);
>> err = hmm_range_fault(&hmm_range);
>> mmap_read_unlock(mm);
>>
>> if (err == -EBUSY) {
>> if (time_after(jiffies, timeout))
>> break;
>>
>> hmm_range.notifier_seq =
>> mmu_interval_read_begin(notifier);
>> continue;
>> }
>> break;
>> }
>> ...
>> for (i = 0, j = 0; i < npages; ++j) {
>> ...
>> dma_map_page(...)
>> ...
>> grab_notifier_lock();
>> if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq))
>> goto retry;
>> restart_queues();
>> drop_notifier_lock();
>> ...
>>
>> Now hmm_range.notifier_seq indicates if your DMA addresses are still valid or not after you grabbed the notifier lock.
>>
>> The problem is that hmm_range works only on a single range/sequence combination, so when you do multiple calls to hmm_range_fault() for scattered VA is can easily be that one call invalidates the ranges of another call.
>>
>> So as long as you only have a few hundred hmm_ranges for your userptrs that kind of works, but it doesn't scale up into the thousands of different VA addresses you get for scattered handling.
>>
>> That's why hmm_range_fault needs to be modified to handle an array of VA addresses instead of just a A..B range.
>>
>> Regards,
>> Christian.
>>
>>
>>>
>>> On 2026/2/9 20:59, Christian König wrote:
>>>> On 2/9/26 13:52, Honglei Huang wrote:
>>>>> DRM GPU SVM does use hmm_range_fault(), see drm_gpusvm_get_pages()
>>>>
>>>> I'm not sure what you are talking about, drm_gpusvm_get_pages() only supports a single range as well and not scatter gather of VA addresses.
>>>>
>>>> As far as I can see that doesn't help the slightest.
>>>>
>>>>> My implementation follows the same pattern. The detailed comparison
>>>>> of invalidation path was provided in the second half of my previous mail.
>>>>
>>>> Yeah and as I said that is not very valuable because it doesn't solves the sequence problem.
>>>>
>>>> As far as I can see the approach you try here is a clear NAK from my side.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> On 2026/2/9 18:16, Christian König wrote:
>>>>>> On 2/9/26 07:14, Honglei Huang wrote:
>>>>>>>
>>>>>>> I've reworked the implementation in v4. The fix is actually inspired
>>>>>>> by the DRM GPU SVM framework (drivers/gpu/drm/drm_gpusvm.c).
>>>>>>>
>>>>>>> DRM GPU SVM uses wide notifiers (recommended 512M or larger) to track
>>>>>>> multiple user virtual address ranges under a single mmu_interval_notifier,
>>>>>>> and these ranges can be non-contiguous which is essentially the same
>>>>>>> problem that batch userptr needs to solve: one BO backed by multiple
>>>>>>> non-contiguous CPU VA ranges sharing one notifier.
>>>>>>
>>>>>> That still doesn't solve the sequencing problem.
>>>>>>
>>>>>> As far as I can see you can't use hmm_range_fault with this approach or it would just not be very valuable.
>>>>>>
>>>>>> So how should that work with your patch set?
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>> The wide notifier is created in drm_gpusvm_notifier_alloc:
>>>>>>> notifier->itree.start = ALIGN_DOWN(fault_addr, gpusvm->notifier_size);
>>>>>>> notifier->itree.last = ALIGN(fault_addr + 1, gpusvm->notifier_size) - 1;
>>>>>>> The Xe driver passes
>>>>>>> xe_modparam.svm_notifier_size * SZ_1M in xe_svm_init
>>>>>>> as the notifier_size, so one notifier can cover many of MB of VA space
>>>>>>> containing multiple non-contiguous ranges.
>>>>>>>
>>>>>>> And DRM GPU SVM solves the per-range validity problem with flag-based
>>>>>>> validation instead of seq-based validation in:
>>>>>>> - drm_gpusvm_pages_valid() checks
>>>>>>> flags.has_dma_mapping
>>>>>>> not notifier_seq. The comment explicitly states:
>>>>>>> "This is akin to a notifier seqno check in the HMM documentation
>>>>>>> but due to wider notifiers (i.e., notifiers which span multiple
>>>>>>> ranges) this function is required for finer grained checking"
>>>>>>> - __drm_gpusvm_unmap_pages() clears
>>>>>>> flags.has_dma_mapping = false under notifier_lock
>>>>>>> - drm_gpusvm_get_pages() sets
>>>>>>> flags.has_dma_mapping = true under notifier_lock
>>>>>>> I adopted the same approach.
>>>>>>>
>>>>>>> DRM GPU SVM:
>>>>>>> drm_gpusvm_notifier_invalidate()
>>>>>>> down_write(&gpusvm->notifier_lock);
>>>>>>> mmu_interval_set_seq(mni, cur_seq);
>>>>>>> gpusvm->ops->invalidate()
>>>>>>> -> xe_svm_invalidate()
>>>>>>> drm_gpusvm_for_each_range()
>>>>>>> -> __drm_gpusvm_unmap_pages()
>>>>>>> WRITE_ONCE(flags.has_dma_mapping = false); // clear flag
>>>>>>> up_write(&gpusvm->notifier_lock);
>>>>>>>
>>>>>>> KFD batch userptr:
>>>>>>> amdgpu_amdkfd_evict_userptr_batch()
>>>>>>> mutex_lock(&process_info->notifier_lock);
>>>>>>> mmu_interval_set_seq(mni, cur_seq);
>>>>>>> discard_invalid_ranges()
>>>>>>> interval_tree_iter_first/next()
>>>>>>> range_info->valid = false; // clear flag
>>>>>>> mutex_unlock(&process_info->notifier_lock);
>>>>>>>
>>>>>>> Both implementations:
>>>>>>> - Acquire notifier_lock FIRST, before any flag changes
>>>>>>> - Call mmu_interval_set_seq() under the lock
>>>>>>> - Use interval tree to find affected ranges within the wide notifier
>>>>>>> - Mark per-range flag as invalid/valid under the lock
>>>>>>>
>>>>>>> The page fault path and final validation path also follow the same
>>>>>>> pattern as DRM GPU SVM: fault outside the lock, set/check per-range
>>>>>>> flag under the lock.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Honglei
>>>>>>>
>>>>>>>
>>>>>>> On 2026/2/6 21:56, Christian König wrote:
>>>>>>>> On 2/6/26 07:25, Honglei Huang wrote:
>>>>>>>>> From: Honglei Huang <honghuan@amd.com>
>>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> This is v3 of the patch series to support allocating multiple non-contiguous
>>>>>>>>> CPU virtual address ranges that map to a single contiguous GPU virtual address.
>>>>>>>>>
>>>>>>>>> v3:
>>>>>>>>> 1. No new ioctl: Reuses existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU
>>>>>>>>> - Adds only one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH
>>>>>>>>
>>>>>>>> That is most likely not the best approach, but Felix or Philip need to comment here since I don't know such IOCTLs well either.
>>>>>>>>
>>>>>>>>> - When flag is set, mmap_offset field points to range array
>>>>>>>>> - Minimal API surface change
>>>>>>>>
>>>>>>>> Why range of VA space for each entry?
>>>>>>>>
>>>>>>>>> 2. Improved MMU notifier handling:
>>>>>>>>> - Single mmu_interval_notifier covering the VA span [va_min, va_max]
>>>>>>>>> - Interval tree for efficient lookup of affected ranges during invalidation
>>>>>>>>> - Avoids per-range notifier overhead mentioned in v2 review
>>>>>>>>
>>>>>>>> That won't work unless you also modify hmm_range_fault() to take multiple VA addresses (or ranges) at the same time.
>>>>>>>>
>>>>>>>> The problem is that we must rely on hmm_range.notifier_seq to detect changes to the page tables in question, but that in turn works only if you have one hmm_range structure and not multiple.
>>>>>>>>
>>>>>>>> What might work is doing an XOR or CRC over all hmm_range.notifier_seq you have, but that is a bit flaky.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> 3. Better code organization: Split into 8 focused patches for easier review
>>>>>>>>>
>>>>>>>>> v2:
>>>>>>>>> - Each CPU VA range gets its own mmu_interval_notifier for invalidation
>>>>>>>>> - All ranges validated together and mapped to contiguous GPU VA
>>>>>>>>> - Single kgd_mem object with array of user_range_info structures
>>>>>>>>> - Unified eviction/restore path for all ranges in a batch
>>>>>>>>>
>>>>>>>>> Current Implementation Approach
>>>>>>>>> ===============================
>>>>>>>>>
>>>>>>>>> This series implements a practical solution within existing kernel constraints:
>>>>>>>>>
>>>>>>>>> 1. Single MMU notifier for VA span: Register one notifier covering the
>>>>>>>>> entire range from lowest to highest address in the batch
>>>>>>>>>
>>>>>>>>> 2. Interval tree filtering: Use interval tree to efficiently identify
>>>>>>>>> which specific ranges are affected during invalidation callbacks,
>>>>>>>>> avoiding unnecessary processing for unrelated address changes
>>>>>>>>>
>>>>>>>>> 3. Unified eviction/restore: All ranges in a batch share eviction and
>>>>>>>>> restore paths, maintaining consistency with existing userptr handling
>>>>>>>>>
>>>>>>>>> Patch Series Overview
>>>>>>>>> =====================
>>>>>>>>>
>>>>>>>>> Patch 1/8: Add userptr batch allocation UAPI structures
>>>>>>>>> - KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag
>>>>>>>>> - kfd_ioctl_userptr_range and kfd_ioctl_userptr_ranges_data structures
>>>>>>>>>
>>>>>>>>> Patch 2/8: Add user_range_info infrastructure to kgd_mem
>>>>>>>>> - user_range_info structure for per-range tracking
>>>>>>>>> - Fields for batch allocation in kgd_mem
>>>>>>>>>
>>>>>>>>> Patch 3/8: Implement interval tree for userptr ranges
>>>>>>>>> - Interval tree for efficient range lookup during invalidation
>>>>>>>>> - mark_invalid_ranges() function
>>>>>>>>>
>>>>>>>>> Patch 4/8: Add batch MMU notifier support
>>>>>>>>> - Single notifier for entire VA span
>>>>>>>>> - Invalidation callback using interval tree filtering
>>>>>>>>>
>>>>>>>>> Patch 5/8: Implement batch userptr page management
>>>>>>>>> - get_user_pages_batch() and set_user_pages_batch()
>>>>>>>>> - Per-range page array management
>>>>>>>>>
>>>>>>>>> Patch 6/8: Add batch allocation function and export API
>>>>>>>>> - init_user_pages_batch() main initialization
>>>>>>>>> - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch() entry point
>>>>>>>>>
>>>>>>>>> Patch 7/8: Unify userptr cleanup and update paths
>>>>>>>>> - Shared eviction/restore handling for batch allocations
>>>>>>>>> - Integration with existing userptr validation flows
>>>>>>>>>
>>>>>>>>> Patch 8/8: Wire up batch allocation in ioctl handler
>>>>>>>>> - Input validation and range array parsing
>>>>>>>>> - Integration with existing alloc_memory_of_gpu path
>>>>>>>>>
>>>>>>>>> Testing
>>>>>>>>> =======
>>>>>>>>>
>>>>>>>>> - Multiple scattered malloc() allocations (2-4000+ ranges)
>>>>>>>>> - Various allocation sizes (4KB to 1G+ per range)
>>>>>>>>> - Memory pressure scenarios and eviction/restore cycles
>>>>>>>>> - OpenCL CTS and HIP catch tests in KVM guest environment
>>>>>>>>> - AI workloads: Stable Diffusion, ComfyUI in virtualized environments
>>>>>>>>> - Small LLM inference (3B-7B models)
>>>>>>>>> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
>>>>>>>>> - Performance improvement: 2x-2.4x faster than userspace approach
>>>>>>>>>
>>>>>>>>> Thank you for your review and feedback.
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Honglei Huang
>>>>>>>>>
>>>>>>>>> Honglei Huang (8):
>>>>>>>>> drm/amdkfd: Add userptr batch allocation UAPI structures
>>>>>>>>> drm/amdkfd: Add user_range_info infrastructure to kgd_mem
>>>>>>>>> drm/amdkfd: Implement interval tree for userptr ranges
>>>>>>>>> drm/amdkfd: Add batch MMU notifier support
>>>>>>>>> drm/amdkfd: Implement batch userptr page management
>>>>>>>>> drm/amdkfd: Add batch allocation function and export API
>>>>>>>>> drm/amdkfd: Unify userptr cleanup and update paths
>>>>>>>>> drm/amdkfd: Wire up batch allocation in ioctl handler
>>>>>>>>>
>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 23 +
>>>>>>>>> .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 539 +++++++++++++++++-
>>>>>>>>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 128 ++++-
>>>>>>>>> include/uapi/linux/kfd_ioctl.h | 31 +-
>>>>>>>>> 4 files changed, 697 insertions(+), 24 deletions(-)
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support
2026-02-09 14:25 ` Christian König
@ 2026-02-09 14:44 ` Honglei Huang
2026-02-09 15:07 ` Christian König
0 siblings, 1 reply; 22+ messages in thread
From: Honglei Huang @ 2026-02-09 14:44 UTC (permalink / raw)
To: Christian König
Cc: Felix.Kuehling, Philip.Yang, Ray.Huang, alexander.deucher,
dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
linux-kernel, linux-mm, akpm
you said that DRM GPU SVM has the same pattern, but argued
that it is not designed for "batch userptr". However, this distinction
has no technical significance. The core problem is "multiple ranges
under one wide notifier doing per-range hmm_range_fault". Whether
these ranges are dynamically created by GPU page faults or
batch-specified via ioctl, the concurrency safety mechanism is
same.
You said "each hmm_range_fault() can invalidate the other ranges
while faulting them in". Yes, this can happen but this is precisely
the scenario that mem->invalid catches:
1. hmm_range_fault(A) succeeds
2. hmm_range_fault(B) triggers reclaim → A's pages swapped out
→ MMU notifier callback:
mutex_lock(notifier_lock)
range_A->valid = false
mem->invalid++
mutex_unlock(notifier_lock)
3. hmm_range_fault(B) completes
4. Commit phase:
mutex_lock(notifier_lock)
mem->invalid != saved_invalid
→ return -EAGAIN, retry entire batch
mutex_unlock(notifier_lock)
invalid pages are never committed.
Regards,
Honglei
On 2026/2/9 22:25, Christian König wrote:
> On 2/9/26 15:16, Honglei Huang wrote:
>> The case you described: one hmm_range_fault() invalidating another's
>> seq under the same notifier, is already handled in the implementation.
>>
>> example: suppose ranges A, B, C share one notifier:
>>
>> 1. hmm_range_fault(A) succeeds, seq_A recorded
>> 2. External invalidation occurs, triggers callback:
>> mutex_lock(notifier_lock)
>> → mmu_interval_set_seq()
>> → range_A->valid = false
>> → mem->invalid++
>> mutex_unlock(notifier_lock)
>> 3. hmm_range_fault(B) succeeds
>> 4. Commit phase:
>> mutex_lock(notifier_lock)
>> → check mem->invalid != saved_invalid
>> → return -EAGAIN, retry the entire batch
>> mutex_unlock(notifier_lock)
>>
>> All concurrent invalidations are caught by the mem->invalid counter.
>> Additionally, amdgpu_ttm_tt_get_user_pages_done() in confirm_valid_user_pages_locked
>> performs a per-range mmu_interval_read_retry() as a final safety check.
>>
>> DRM GPU SVM uses the same approach: drm_gpusvm_get_pages() also calls
>> hmm_range_fault() per-range independently there is no array version
>> of hmm_range_fault in DRM GPU SVM either. If you consider this approach
>> unworkable, then DRM GPU SVM would be unworkable too, yet it has been
>> accepted upstream.
>>
>> The number of batch ranges is controllable. And even if it
>> scales to thousands, DRM GPU SVM faces exactly the same situation:
>> it does not need an array version of hmm_range_fault either, which
>> shows this is a correctness question, not a performance one. For
>> correctness, I believe DRM GPU SVM already demonstrates the approach
>> is ok.
>
> Well yes, GPU SVM would have exactly the same problems. But that also doesn't have a create bulk userptr interface.
>
> The implementation is simply not made for this use case, and as far as I know no current upstream implementation is.
>
>> For performance, I have tested with thousands of ranges present:
>> performance reaches 80%~95% of the native driver, and all OpenCL
>> and ROCr test suites pass with no correctness issues.
>
> Testing can only falsify a system and not verify it.
>
>> Here is how DRM GPU SVM handles correctness with multiple ranges
>> under one wide notifier doing per-range hmm_range_fault:
>>
>> Invalidation: drm_gpusvm_notifier_invalidate()
>> - Acquires notifier_lock
>> - Calls mmu_interval_set_seq()
>> - Iterates affected ranges via driver callback (xe_svm_invalidate)
>> - Clears has_dma_mapping = false for each affected range (under lock)
>> - Releases notifier_lock
>>
>> Fault: drm_gpusvm_get_pages() (called per-range independently)
>> - mmu_interval_read_begin() to get seq
>> - hmm_range_fault() outside lock
>> - Acquires notifier_lock
>> - mmu_interval_read_retry() → if stale, release lock and retry
>> - DMA map pages + set has_dma_mapping = true (under lock)
>> - Releases notifier_lock
>>
>> Validation: drm_gpusvm_pages_valid()
>> - Checks has_dma_mapping flag (under lock), NOT seq
>>
>> If invalidation occurs between two per-range faults, the flag is
>> cleared under lock, and either mmu_interval_read_retry catches it
>> in the current fault, or drm_gpusvm_pages_valid() catches it at
>> validation time. No stale pages are ever committed.
>>
>> KFD batch userptr uses the same three-step pattern:
>>
>> Invalidation: amdgpu_amdkfd_evict_userptr_batch()
>> - Acquires notifier_lock
>> - Calls mmu_interval_set_seq()
>> - Iterates affected ranges via interval_tree
>> - Sets range->valid = false for each affected range (under lock)
>> - Increments mem->invalid (under lock)
>> - Releases notifier_lock
>>
>> Fault: update_invalid_user_pages()
>> - Per-range hmm_range_fault() outside lock
>
> And here the idea falls apart. Each hmm_range_fault() can invalidate the other ranges while faulting them in.
>
> That is not fundamentally solveable, but by moving the handling further into hmm_range_fault it makes it much less likely that something goes wrong.
>
> So once more as long as this still uses this hacky approach I will clearly reject this implementation.
>
> Regards,
> Christian.
>
>> - Acquires notifier_lock
>> - Checks mem->invalid != saved_invalid → if changed, -EAGAIN retry
>> - Sets range->valid = true for faulted ranges (under lock)
>> - Releases notifier_lock
>>
>> Validation: valid_user_pages_batch()
>> - Checks range->valid flag
>> - Calls amdgpu_ttm_tt_get_user_pages_done() (mmu_interval_read_retry)
>>
>> The logic is equivalent as far as I can see.
>>
>> Regards,
>> Honglei
>>
>>
>>
>> On 2026/2/9 21:27, Christian König wrote:
>>> On 2/9/26 14:11, Honglei Huang wrote:
>>>>
>>>> So the drm svm is also a NAK?
>>>>
>>>> These codes have passed local testing, opencl and rocr, I also provided a detailed code path and analysis.
>>>> You only said the conclusion without providing any reasons or evidence. Your statement has no justifiable reasons and is difficult to convince
>>>> so far.
>>>
>>> That sounds like you don't understand what the issue here is, I will try to explain this once more on pseudo-code.
>>>
>>> Page tables are updated without holding a lock, so when you want to grab physical addresses from the then you need to use an opportunistically retry based approach to make sure that the data you got is still valid.
>>>
>>> In other words something like this here is needed:
>>>
>>> retry:
>>> hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
>>> hmm_range.hmm_pfns = kvmalloc_array(npages, ...);
>>> ...
>>> while (true) {
>>> mmap_read_lock(mm);
>>> err = hmm_range_fault(&hmm_range);
>>> mmap_read_unlock(mm);
>>>
>>> if (err == -EBUSY) {
>>> if (time_after(jiffies, timeout))
>>> break;
>>>
>>> hmm_range.notifier_seq =
>>> mmu_interval_read_begin(notifier);
>>> continue;
>>> }
>>> break;
>>> }
>>> ...
>>> for (i = 0, j = 0; i < npages; ++j) {
>>> ...
>>> dma_map_page(...)
>>> ...
>>> grab_notifier_lock();
>>> if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq))
>>> goto retry;
>>> restart_queues();
>>> drop_notifier_lock();
>>> ...
>>>
>>> Now hmm_range.notifier_seq indicates if your DMA addresses are still valid or not after you grabbed the notifier lock.
>>>
>>> The problem is that hmm_range works only on a single range/sequence combination, so when you do multiple calls to hmm_range_fault() for scattered VA is can easily be that one call invalidates the ranges of another call.
>>>
>>> So as long as you only have a few hundred hmm_ranges for your userptrs that kind of works, but it doesn't scale up into the thousands of different VA addresses you get for scattered handling.
>>>
>>> That's why hmm_range_fault needs to be modified to handle an array of VA addresses instead of just a A..B range.
>>>
>>> Regards,
>>> Christian.
>>>
>>>
>>>>
>>>> On 2026/2/9 20:59, Christian König wrote:
>>>>> On 2/9/26 13:52, Honglei Huang wrote:
>>>>>> DRM GPU SVM does use hmm_range_fault(), see drm_gpusvm_get_pages()
>>>>>
>>>>> I'm not sure what you are talking about, drm_gpusvm_get_pages() only supports a single range as well and not scatter gather of VA addresses.
>>>>>
>>>>> As far as I can see that doesn't help the slightest.
>>>>>
>>>>>> My implementation follows the same pattern. The detailed comparison
>>>>>> of invalidation path was provided in the second half of my previous mail.
>>>>>
>>>>> Yeah and as I said that is not very valuable because it doesn't solves the sequence problem.
>>>>>
>>>>> As far as I can see the approach you try here is a clear NAK from my side.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>> On 2026/2/9 18:16, Christian König wrote:
>>>>>>> On 2/9/26 07:14, Honglei Huang wrote:
>>>>>>>>
>>>>>>>> I've reworked the implementation in v4. The fix is actually inspired
>>>>>>>> by the DRM GPU SVM framework (drivers/gpu/drm/drm_gpusvm.c).
>>>>>>>>
>>>>>>>> DRM GPU SVM uses wide notifiers (recommended 512M or larger) to track
>>>>>>>> multiple user virtual address ranges under a single mmu_interval_notifier,
>>>>>>>> and these ranges can be non-contiguous which is essentially the same
>>>>>>>> problem that batch userptr needs to solve: one BO backed by multiple
>>>>>>>> non-contiguous CPU VA ranges sharing one notifier.
>>>>>>>
>>>>>>> That still doesn't solve the sequencing problem.
>>>>>>>
>>>>>>> As far as I can see you can't use hmm_range_fault with this approach or it would just not be very valuable.
>>>>>>>
>>>>>>> So how should that work with your patch set?
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>>
>>>>>>>> The wide notifier is created in drm_gpusvm_notifier_alloc:
>>>>>>>> notifier->itree.start = ALIGN_DOWN(fault_addr, gpusvm->notifier_size);
>>>>>>>> notifier->itree.last = ALIGN(fault_addr + 1, gpusvm->notifier_size) - 1;
>>>>>>>> The Xe driver passes
>>>>>>>> xe_modparam.svm_notifier_size * SZ_1M in xe_svm_init
>>>>>>>> as the notifier_size, so one notifier can cover many of MB of VA space
>>>>>>>> containing multiple non-contiguous ranges.
>>>>>>>>
>>>>>>>> And DRM GPU SVM solves the per-range validity problem with flag-based
>>>>>>>> validation instead of seq-based validation in:
>>>>>>>> - drm_gpusvm_pages_valid() checks
>>>>>>>> flags.has_dma_mapping
>>>>>>>> not notifier_seq. The comment explicitly states:
>>>>>>>> "This is akin to a notifier seqno check in the HMM documentation
>>>>>>>> but due to wider notifiers (i.e., notifiers which span multiple
>>>>>>>> ranges) this function is required for finer grained checking"
>>>>>>>> - __drm_gpusvm_unmap_pages() clears
>>>>>>>> flags.has_dma_mapping = false under notifier_lock
>>>>>>>> - drm_gpusvm_get_pages() sets
>>>>>>>> flags.has_dma_mapping = true under notifier_lock
>>>>>>>> I adopted the same approach.
>>>>>>>>
>>>>>>>> DRM GPU SVM:
>>>>>>>> drm_gpusvm_notifier_invalidate()
>>>>>>>> down_write(&gpusvm->notifier_lock);
>>>>>>>> mmu_interval_set_seq(mni, cur_seq);
>>>>>>>> gpusvm->ops->invalidate()
>>>>>>>> -> xe_svm_invalidate()
>>>>>>>> drm_gpusvm_for_each_range()
>>>>>>>> -> __drm_gpusvm_unmap_pages()
>>>>>>>> WRITE_ONCE(flags.has_dma_mapping = false); // clear flag
>>>>>>>> up_write(&gpusvm->notifier_lock);
>>>>>>>>
>>>>>>>> KFD batch userptr:
>>>>>>>> amdgpu_amdkfd_evict_userptr_batch()
>>>>>>>> mutex_lock(&process_info->notifier_lock);
>>>>>>>> mmu_interval_set_seq(mni, cur_seq);
>>>>>>>> discard_invalid_ranges()
>>>>>>>> interval_tree_iter_first/next()
>>>>>>>> range_info->valid = false; // clear flag
>>>>>>>> mutex_unlock(&process_info->notifier_lock);
>>>>>>>>
>>>>>>>> Both implementations:
>>>>>>>> - Acquire notifier_lock FIRST, before any flag changes
>>>>>>>> - Call mmu_interval_set_seq() under the lock
>>>>>>>> - Use interval tree to find affected ranges within the wide notifier
>>>>>>>> - Mark per-range flag as invalid/valid under the lock
>>>>>>>>
>>>>>>>> The page fault path and final validation path also follow the same
>>>>>>>> pattern as DRM GPU SVM: fault outside the lock, set/check per-range
>>>>>>>> flag under the lock.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Honglei
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2026/2/6 21:56, Christian König wrote:
>>>>>>>>> On 2/6/26 07:25, Honglei Huang wrote:
>>>>>>>>>> From: Honglei Huang <honghuan@amd.com>
>>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> This is v3 of the patch series to support allocating multiple non-contiguous
>>>>>>>>>> CPU virtual address ranges that map to a single contiguous GPU virtual address.
>>>>>>>>>>
>>>>>>>>>> v3:
>>>>>>>>>> 1. No new ioctl: Reuses existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU
>>>>>>>>>> - Adds only one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH
>>>>>>>>>
>>>>>>>>> That is most likely not the best approach, but Felix or Philip need to comment here since I don't know such IOCTLs well either.
>>>>>>>>>
>>>>>>>>>> - When flag is set, mmap_offset field points to range array
>>>>>>>>>> - Minimal API surface change
>>>>>>>>>
>>>>>>>>> Why range of VA space for each entry?
>>>>>>>>>
>>>>>>>>>> 2. Improved MMU notifier handling:
>>>>>>>>>> - Single mmu_interval_notifier covering the VA span [va_min, va_max]
>>>>>>>>>> - Interval tree for efficient lookup of affected ranges during invalidation
>>>>>>>>>> - Avoids per-range notifier overhead mentioned in v2 review
>>>>>>>>>
>>>>>>>>> That won't work unless you also modify hmm_range_fault() to take multiple VA addresses (or ranges) at the same time.
>>>>>>>>>
>>>>>>>>> The problem is that we must rely on hmm_range.notifier_seq to detect changes to the page tables in question, but that in turn works only if you have one hmm_range structure and not multiple.
>>>>>>>>>
>>>>>>>>> What might work is doing an XOR or CRC over all hmm_range.notifier_seq you have, but that is a bit flaky.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 3. Better code organization: Split into 8 focused patches for easier review
>>>>>>>>>>
>>>>>>>>>> v2:
>>>>>>>>>> - Each CPU VA range gets its own mmu_interval_notifier for invalidation
>>>>>>>>>> - All ranges validated together and mapped to contiguous GPU VA
>>>>>>>>>> - Single kgd_mem object with array of user_range_info structures
>>>>>>>>>> - Unified eviction/restore path for all ranges in a batch
>>>>>>>>>>
>>>>>>>>>> Current Implementation Approach
>>>>>>>>>> ===============================
>>>>>>>>>>
>>>>>>>>>> This series implements a practical solution within existing kernel constraints:
>>>>>>>>>>
>>>>>>>>>> 1. Single MMU notifier for VA span: Register one notifier covering the
>>>>>>>>>> entire range from lowest to highest address in the batch
>>>>>>>>>>
>>>>>>>>>> 2. Interval tree filtering: Use interval tree to efficiently identify
>>>>>>>>>> which specific ranges are affected during invalidation callbacks,
>>>>>>>>>> avoiding unnecessary processing for unrelated address changes
>>>>>>>>>>
>>>>>>>>>> 3. Unified eviction/restore: All ranges in a batch share eviction and
>>>>>>>>>> restore paths, maintaining consistency with existing userptr handling
>>>>>>>>>>
>>>>>>>>>> Patch Series Overview
>>>>>>>>>> =====================
>>>>>>>>>>
>>>>>>>>>> Patch 1/8: Add userptr batch allocation UAPI structures
>>>>>>>>>> - KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag
>>>>>>>>>> - kfd_ioctl_userptr_range and kfd_ioctl_userptr_ranges_data structures
>>>>>>>>>>
>>>>>>>>>> Patch 2/8: Add user_range_info infrastructure to kgd_mem
>>>>>>>>>> - user_range_info structure for per-range tracking
>>>>>>>>>> - Fields for batch allocation in kgd_mem
>>>>>>>>>>
>>>>>>>>>> Patch 3/8: Implement interval tree for userptr ranges
>>>>>>>>>> - Interval tree for efficient range lookup during invalidation
>>>>>>>>>> - mark_invalid_ranges() function
>>>>>>>>>>
>>>>>>>>>> Patch 4/8: Add batch MMU notifier support
>>>>>>>>>> - Single notifier for entire VA span
>>>>>>>>>> - Invalidation callback using interval tree filtering
>>>>>>>>>>
>>>>>>>>>> Patch 5/8: Implement batch userptr page management
>>>>>>>>>> - get_user_pages_batch() and set_user_pages_batch()
>>>>>>>>>> - Per-range page array management
>>>>>>>>>>
>>>>>>>>>> Patch 6/8: Add batch allocation function and export API
>>>>>>>>>> - init_user_pages_batch() main initialization
>>>>>>>>>> - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch() entry point
>>>>>>>>>>
>>>>>>>>>> Patch 7/8: Unify userptr cleanup and update paths
>>>>>>>>>> - Shared eviction/restore handling for batch allocations
>>>>>>>>>> - Integration with existing userptr validation flows
>>>>>>>>>>
>>>>>>>>>> Patch 8/8: Wire up batch allocation in ioctl handler
>>>>>>>>>> - Input validation and range array parsing
>>>>>>>>>> - Integration with existing alloc_memory_of_gpu path
>>>>>>>>>>
>>>>>>>>>> Testing
>>>>>>>>>> =======
>>>>>>>>>>
>>>>>>>>>> - Multiple scattered malloc() allocations (2-4000+ ranges)
>>>>>>>>>> - Various allocation sizes (4KB to 1G+ per range)
>>>>>>>>>> - Memory pressure scenarios and eviction/restore cycles
>>>>>>>>>> - OpenCL CTS and HIP catch tests in KVM guest environment
>>>>>>>>>> - AI workloads: Stable Diffusion, ComfyUI in virtualized environments
>>>>>>>>>> - Small LLM inference (3B-7B models)
>>>>>>>>>> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
>>>>>>>>>> - Performance improvement: 2x-2.4x faster than userspace approach
>>>>>>>>>>
>>>>>>>>>> Thank you for your review and feedback.
>>>>>>>>>>
>>>>>>>>>> Best regards,
>>>>>>>>>> Honglei Huang
>>>>>>>>>>
>>>>>>>>>> Honglei Huang (8):
>>>>>>>>>> drm/amdkfd: Add userptr batch allocation UAPI structures
>>>>>>>>>> drm/amdkfd: Add user_range_info infrastructure to kgd_mem
>>>>>>>>>> drm/amdkfd: Implement interval tree for userptr ranges
>>>>>>>>>> drm/amdkfd: Add batch MMU notifier support
>>>>>>>>>> drm/amdkfd: Implement batch userptr page management
>>>>>>>>>> drm/amdkfd: Add batch allocation function and export API
>>>>>>>>>> drm/amdkfd: Unify userptr cleanup and update paths
>>>>>>>>>> drm/amdkfd: Wire up batch allocation in ioctl handler
>>>>>>>>>>
>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 23 +
>>>>>>>>>> .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 539 +++++++++++++++++-
>>>>>>>>>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 128 ++++-
>>>>>>>>>> include/uapi/linux/kfd_ioctl.h | 31 +-
>>>>>>>>>> 4 files changed, 697 insertions(+), 24 deletions(-)
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support
2026-02-09 14:44 ` Honglei Huang
@ 2026-02-09 15:07 ` Christian König
2026-02-09 15:46 ` Honglei Huang
0 siblings, 1 reply; 22+ messages in thread
From: Christian König @ 2026-02-09 15:07 UTC (permalink / raw)
To: Honglei Huang
Cc: Felix.Kuehling, Philip.Yang, Ray.Huang, alexander.deucher,
dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
linux-kernel, linux-mm, akpm
On 2/9/26 15:44, Honglei Huang wrote:
> you said that DRM GPU SVM has the same pattern, but argued
> that it is not designed for "batch userptr". However, this distinction
> has no technical significance. The core problem is "multiple ranges
> under one wide notifier doing per-range hmm_range_fault". Whether
> these ranges are dynamically created by GPU page faults or
> batch-specified via ioctl, the concurrency safety mechanism is
> same.
>
> You said "each hmm_range_fault() can invalidate the other ranges
> while faulting them in". Yes, this can happen but this is precisely
> the scenario that mem->invalid catches:
>
> 1. hmm_range_fault(A) succeeds
> 2. hmm_range_fault(B) triggers reclaim → A's pages swapped out
> → MMU notifier callback:
> mutex_lock(notifier_lock)
> range_A->valid = false
> mem->invalid++
> mutex_unlock(notifier_lock)
> 3. hmm_range_fault(B) completes
> 4. Commit phase:
> mutex_lock(notifier_lock)
> mem->invalid != saved_invalid
> → return -EAGAIN, retry entire batch
> mutex_unlock(notifier_lock)
>
> invalid pages are never committed.
Once more that is not the problem. I completely agree that this is all correctly handled.
The problem is that the more hmm_ranges you get the more likely it is that getting another pfn invalidates a pfn you previously acquired.
So this can end up in an endless loop, and that's why the GPUSVM code also has a timeout on the retry.
What you need to figure out is how to teach hmm_range_fault() and the underlying walk_page_range() how to skip entries which you are not interested in.
Just a trivial example, assuming you have the following VAs you want your userptr to be filled in with: 3, 1, 5, 8, 7, 2
To handle this case you need to build a data structure which tells you what is the smalest, largest and where each VA in the middle comes in. So you need something like: 1->1, 2->5, 3->0, 5->2, 7->4, 8->3
Then you would call walk_page_range(mm, 1, 8, ops, data), the pud walk decides if it needs to go into pmd or eventually fault, the pmd walk decides if ptes needs to be filled in etc...
The final pte handler then fills in the pfns linearly for the addresses you need.
And yeah I perfectly know that this is horrible complicated, but as far as I can see everything else will just not scale.
Creating hundreds of separate userptrs only scales up to a few megabyte and then falls apart.
Regards,
Christian.
>
> Regards,
> Honglei
>
>
> On 2026/2/9 22:25, Christian König wrote:
>> On 2/9/26 15:16, Honglei Huang wrote:
>>> The case you described: one hmm_range_fault() invalidating another's
>>> seq under the same notifier, is already handled in the implementation.
>>>
>>> example: suppose ranges A, B, C share one notifier:
>>>
>>> 1. hmm_range_fault(A) succeeds, seq_A recorded
>>> 2. External invalidation occurs, triggers callback:
>>> mutex_lock(notifier_lock)
>>> → mmu_interval_set_seq()
>>> → range_A->valid = false
>>> → mem->invalid++
>>> mutex_unlock(notifier_lock)
>>> 3. hmm_range_fault(B) succeeds
>>> 4. Commit phase:
>>> mutex_lock(notifier_lock)
>>> → check mem->invalid != saved_invalid
>>> → return -EAGAIN, retry the entire batch
>>> mutex_unlock(notifier_lock)
>>>
>>> All concurrent invalidations are caught by the mem->invalid counter.
>>> Additionally, amdgpu_ttm_tt_get_user_pages_done() in confirm_valid_user_pages_locked
>>> performs a per-range mmu_interval_read_retry() as a final safety check.
>>>
>>> DRM GPU SVM uses the same approach: drm_gpusvm_get_pages() also calls
>>> hmm_range_fault() per-range independently there is no array version
>>> of hmm_range_fault in DRM GPU SVM either. If you consider this approach
>>> unworkable, then DRM GPU SVM would be unworkable too, yet it has been
>>> accepted upstream.
>>>
>>> The number of batch ranges is controllable. And even if it
>>> scales to thousands, DRM GPU SVM faces exactly the same situation:
>>> it does not need an array version of hmm_range_fault either, which
>>> shows this is a correctness question, not a performance one. For
>>> correctness, I believe DRM GPU SVM already demonstrates the approach
>>> is ok.
>>
>> Well yes, GPU SVM would have exactly the same problems. But that also doesn't have a create bulk userptr interface.
>>
>> The implementation is simply not made for this use case, and as far as I know no current upstream implementation is.
>>
>>> For performance, I have tested with thousands of ranges present:
>>> performance reaches 80%~95% of the native driver, and all OpenCL
>>> and ROCr test suites pass with no correctness issues.
>>
>> Testing can only falsify a system and not verify it.
>>
>>> Here is how DRM GPU SVM handles correctness with multiple ranges
>>> under one wide notifier doing per-range hmm_range_fault:
>>>
>>> Invalidation: drm_gpusvm_notifier_invalidate()
>>> - Acquires notifier_lock
>>> - Calls mmu_interval_set_seq()
>>> - Iterates affected ranges via driver callback (xe_svm_invalidate)
>>> - Clears has_dma_mapping = false for each affected range (under lock)
>>> - Releases notifier_lock
>>>
>>> Fault: drm_gpusvm_get_pages() (called per-range independently)
>>> - mmu_interval_read_begin() to get seq
>>> - hmm_range_fault() outside lock
>>> - Acquires notifier_lock
>>> - mmu_interval_read_retry() → if stale, release lock and retry
>>> - DMA map pages + set has_dma_mapping = true (under lock)
>>> - Releases notifier_lock
>>>
>>> Validation: drm_gpusvm_pages_valid()
>>> - Checks has_dma_mapping flag (under lock), NOT seq
>>>
>>> If invalidation occurs between two per-range faults, the flag is
>>> cleared under lock, and either mmu_interval_read_retry catches it
>>> in the current fault, or drm_gpusvm_pages_valid() catches it at
>>> validation time. No stale pages are ever committed.
>>>
>>> KFD batch userptr uses the same three-step pattern:
>>>
>>> Invalidation: amdgpu_amdkfd_evict_userptr_batch()
>>> - Acquires notifier_lock
>>> - Calls mmu_interval_set_seq()
>>> - Iterates affected ranges via interval_tree
>>> - Sets range->valid = false for each affected range (under lock)
>>> - Increments mem->invalid (under lock)
>>> - Releases notifier_lock
>>>
>>> Fault: update_invalid_user_pages()
>>> - Per-range hmm_range_fault() outside lock
>>
>> And here the idea falls apart. Each hmm_range_fault() can invalidate the other ranges while faulting them in.
>>
>> That is not fundamentally solveable, but by moving the handling further into hmm_range_fault it makes it much less likely that something goes wrong.
>>
>> So once more as long as this still uses this hacky approach I will clearly reject this implementation.
>>
>> Regards,
>> Christian.
>>
>>> - Acquires notifier_lock
>>> - Checks mem->invalid != saved_invalid → if changed, -EAGAIN retry
>>> - Sets range->valid = true for faulted ranges (under lock)
>>> - Releases notifier_lock
>>>
>>> Validation: valid_user_pages_batch()
>>> - Checks range->valid flag
>>> - Calls amdgpu_ttm_tt_get_user_pages_done() (mmu_interval_read_retry)
>>>
>>> The logic is equivalent as far as I can see.
>>>
>>> Regards,
>>> Honglei
>>>
>>>
>>>
>>> On 2026/2/9 21:27, Christian König wrote:
>>>> On 2/9/26 14:11, Honglei Huang wrote:
>>>>>
>>>>> So the drm svm is also a NAK?
>>>>>
>>>>> These codes have passed local testing, opencl and rocr, I also provided a detailed code path and analysis.
>>>>> You only said the conclusion without providing any reasons or evidence. Your statement has no justifiable reasons and is difficult to convince
>>>>> so far.
>>>>
>>>> That sounds like you don't understand what the issue here is, I will try to explain this once more on pseudo-code.
>>>>
>>>> Page tables are updated without holding a lock, so when you want to grab physical addresses from the then you need to use an opportunistically retry based approach to make sure that the data you got is still valid.
>>>>
>>>> In other words something like this here is needed:
>>>>
>>>> retry:
>>>> hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
>>>> hmm_range.hmm_pfns = kvmalloc_array(npages, ...);
>>>> ...
>>>> while (true) {
>>>> mmap_read_lock(mm);
>>>> err = hmm_range_fault(&hmm_range);
>>>> mmap_read_unlock(mm);
>>>>
>>>> if (err == -EBUSY) {
>>>> if (time_after(jiffies, timeout))
>>>> break;
>>>>
>>>> hmm_range.notifier_seq =
>>>> mmu_interval_read_begin(notifier);
>>>> continue;
>>>> }
>>>> break;
>>>> }
>>>> ...
>>>> for (i = 0, j = 0; i < npages; ++j) {
>>>> ...
>>>> dma_map_page(...)
>>>> ...
>>>> grab_notifier_lock();
>>>> if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq))
>>>> goto retry;
>>>> restart_queues();
>>>> drop_notifier_lock();
>>>> ...
>>>>
>>>> Now hmm_range.notifier_seq indicates if your DMA addresses are still valid or not after you grabbed the notifier lock.
>>>>
>>>> The problem is that hmm_range works only on a single range/sequence combination, so when you do multiple calls to hmm_range_fault() for scattered VA is can easily be that one call invalidates the ranges of another call.
>>>>
>>>> So as long as you only have a few hundred hmm_ranges for your userptrs that kind of works, but it doesn't scale up into the thousands of different VA addresses you get for scattered handling.
>>>>
>>>> That's why hmm_range_fault needs to be modified to handle an array of VA addresses instead of just a A..B range.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>
>>>>>
>>>>> On 2026/2/9 20:59, Christian König wrote:
>>>>>> On 2/9/26 13:52, Honglei Huang wrote:
>>>>>>> DRM GPU SVM does use hmm_range_fault(), see drm_gpusvm_get_pages()
>>>>>>
>>>>>> I'm not sure what you are talking about, drm_gpusvm_get_pages() only supports a single range as well and not scatter gather of VA addresses.
>>>>>>
>>>>>> As far as I can see that doesn't help the slightest.
>>>>>>
>>>>>>> My implementation follows the same pattern. The detailed comparison
>>>>>>> of invalidation path was provided in the second half of my previous mail.
>>>>>>
>>>>>> Yeah and as I said that is not very valuable because it doesn't solves the sequence problem.
>>>>>>
>>>>>> As far as I can see the approach you try here is a clear NAK from my side.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>> On 2026/2/9 18:16, Christian König wrote:
>>>>>>>> On 2/9/26 07:14, Honglei Huang wrote:
>>>>>>>>>
>>>>>>>>> I've reworked the implementation in v4. The fix is actually inspired
>>>>>>>>> by the DRM GPU SVM framework (drivers/gpu/drm/drm_gpusvm.c).
>>>>>>>>>
>>>>>>>>> DRM GPU SVM uses wide notifiers (recommended 512M or larger) to track
>>>>>>>>> multiple user virtual address ranges under a single mmu_interval_notifier,
>>>>>>>>> and these ranges can be non-contiguous which is essentially the same
>>>>>>>>> problem that batch userptr needs to solve: one BO backed by multiple
>>>>>>>>> non-contiguous CPU VA ranges sharing one notifier.
>>>>>>>>
>>>>>>>> That still doesn't solve the sequencing problem.
>>>>>>>>
>>>>>>>> As far as I can see you can't use hmm_range_fault with this approach or it would just not be very valuable.
>>>>>>>>
>>>>>>>> So how should that work with your patch set?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> The wide notifier is created in drm_gpusvm_notifier_alloc:
>>>>>>>>> notifier->itree.start = ALIGN_DOWN(fault_addr, gpusvm->notifier_size);
>>>>>>>>> notifier->itree.last = ALIGN(fault_addr + 1, gpusvm->notifier_size) - 1;
>>>>>>>>> The Xe driver passes
>>>>>>>>> xe_modparam.svm_notifier_size * SZ_1M in xe_svm_init
>>>>>>>>> as the notifier_size, so one notifier can cover many of MB of VA space
>>>>>>>>> containing multiple non-contiguous ranges.
>>>>>>>>>
>>>>>>>>> And DRM GPU SVM solves the per-range validity problem with flag-based
>>>>>>>>> validation instead of seq-based validation in:
>>>>>>>>> - drm_gpusvm_pages_valid() checks
>>>>>>>>> flags.has_dma_mapping
>>>>>>>>> not notifier_seq. The comment explicitly states:
>>>>>>>>> "This is akin to a notifier seqno check in the HMM documentation
>>>>>>>>> but due to wider notifiers (i.e., notifiers which span multiple
>>>>>>>>> ranges) this function is required for finer grained checking"
>>>>>>>>> - __drm_gpusvm_unmap_pages() clears
>>>>>>>>> flags.has_dma_mapping = false under notifier_lock
>>>>>>>>> - drm_gpusvm_get_pages() sets
>>>>>>>>> flags.has_dma_mapping = true under notifier_lock
>>>>>>>>> I adopted the same approach.
>>>>>>>>>
>>>>>>>>> DRM GPU SVM:
>>>>>>>>> drm_gpusvm_notifier_invalidate()
>>>>>>>>> down_write(&gpusvm->notifier_lock);
>>>>>>>>> mmu_interval_set_seq(mni, cur_seq);
>>>>>>>>> gpusvm->ops->invalidate()
>>>>>>>>> -> xe_svm_invalidate()
>>>>>>>>> drm_gpusvm_for_each_range()
>>>>>>>>> -> __drm_gpusvm_unmap_pages()
>>>>>>>>> WRITE_ONCE(flags.has_dma_mapping = false); // clear flag
>>>>>>>>> up_write(&gpusvm->notifier_lock);
>>>>>>>>>
>>>>>>>>> KFD batch userptr:
>>>>>>>>> amdgpu_amdkfd_evict_userptr_batch()
>>>>>>>>> mutex_lock(&process_info->notifier_lock);
>>>>>>>>> mmu_interval_set_seq(mni, cur_seq);
>>>>>>>>> discard_invalid_ranges()
>>>>>>>>> interval_tree_iter_first/next()
>>>>>>>>> range_info->valid = false; // clear flag
>>>>>>>>> mutex_unlock(&process_info->notifier_lock);
>>>>>>>>>
>>>>>>>>> Both implementations:
>>>>>>>>> - Acquire notifier_lock FIRST, before any flag changes
>>>>>>>>> - Call mmu_interval_set_seq() under the lock
>>>>>>>>> - Use interval tree to find affected ranges within the wide notifier
>>>>>>>>> - Mark per-range flag as invalid/valid under the lock
>>>>>>>>>
>>>>>>>>> The page fault path and final validation path also follow the same
>>>>>>>>> pattern as DRM GPU SVM: fault outside the lock, set/check per-range
>>>>>>>>> flag under the lock.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Honglei
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2026/2/6 21:56, Christian König wrote:
>>>>>>>>>> On 2/6/26 07:25, Honglei Huang wrote:
>>>>>>>>>>> From: Honglei Huang <honghuan@amd.com>
>>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> This is v3 of the patch series to support allocating multiple non-contiguous
>>>>>>>>>>> CPU virtual address ranges that map to a single contiguous GPU virtual address.
>>>>>>>>>>>
>>>>>>>>>>> v3:
>>>>>>>>>>> 1. No new ioctl: Reuses existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU
>>>>>>>>>>> - Adds only one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH
>>>>>>>>>>
>>>>>>>>>> That is most likely not the best approach, but Felix or Philip need to comment here since I don't know such IOCTLs well either.
>>>>>>>>>>
>>>>>>>>>>> - When flag is set, mmap_offset field points to range array
>>>>>>>>>>> - Minimal API surface change
>>>>>>>>>>
>>>>>>>>>> Why range of VA space for each entry?
>>>>>>>>>>
>>>>>>>>>>> 2. Improved MMU notifier handling:
>>>>>>>>>>> - Single mmu_interval_notifier covering the VA span [va_min, va_max]
>>>>>>>>>>> - Interval tree for efficient lookup of affected ranges during invalidation
>>>>>>>>>>> - Avoids per-range notifier overhead mentioned in v2 review
>>>>>>>>>>
>>>>>>>>>> That won't work unless you also modify hmm_range_fault() to take multiple VA addresses (or ranges) at the same time.
>>>>>>>>>>
>>>>>>>>>> The problem is that we must rely on hmm_range.notifier_seq to detect changes to the page tables in question, but that in turn works only if you have one hmm_range structure and not multiple.
>>>>>>>>>>
>>>>>>>>>> What might work is doing an XOR or CRC over all hmm_range.notifier_seq you have, but that is a bit flaky.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 3. Better code organization: Split into 8 focused patches for easier review
>>>>>>>>>>>
>>>>>>>>>>> v2:
>>>>>>>>>>> - Each CPU VA range gets its own mmu_interval_notifier for invalidation
>>>>>>>>>>> - All ranges validated together and mapped to contiguous GPU VA
>>>>>>>>>>> - Single kgd_mem object with array of user_range_info structures
>>>>>>>>>>> - Unified eviction/restore path for all ranges in a batch
>>>>>>>>>>>
>>>>>>>>>>> Current Implementation Approach
>>>>>>>>>>> ===============================
>>>>>>>>>>>
>>>>>>>>>>> This series implements a practical solution within existing kernel constraints:
>>>>>>>>>>>
>>>>>>>>>>> 1. Single MMU notifier for VA span: Register one notifier covering the
>>>>>>>>>>> entire range from lowest to highest address in the batch
>>>>>>>>>>>
>>>>>>>>>>> 2. Interval tree filtering: Use interval tree to efficiently identify
>>>>>>>>>>> which specific ranges are affected during invalidation callbacks,
>>>>>>>>>>> avoiding unnecessary processing for unrelated address changes
>>>>>>>>>>>
>>>>>>>>>>> 3. Unified eviction/restore: All ranges in a batch share eviction and
>>>>>>>>>>> restore paths, maintaining consistency with existing userptr handling
>>>>>>>>>>>
>>>>>>>>>>> Patch Series Overview
>>>>>>>>>>> =====================
>>>>>>>>>>>
>>>>>>>>>>> Patch 1/8: Add userptr batch allocation UAPI structures
>>>>>>>>>>> - KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag
>>>>>>>>>>> - kfd_ioctl_userptr_range and kfd_ioctl_userptr_ranges_data structures
>>>>>>>>>>>
>>>>>>>>>>> Patch 2/8: Add user_range_info infrastructure to kgd_mem
>>>>>>>>>>> - user_range_info structure for per-range tracking
>>>>>>>>>>> - Fields for batch allocation in kgd_mem
>>>>>>>>>>>
>>>>>>>>>>> Patch 3/8: Implement interval tree for userptr ranges
>>>>>>>>>>> - Interval tree for efficient range lookup during invalidation
>>>>>>>>>>> - mark_invalid_ranges() function
>>>>>>>>>>>
>>>>>>>>>>> Patch 4/8: Add batch MMU notifier support
>>>>>>>>>>> - Single notifier for entire VA span
>>>>>>>>>>> - Invalidation callback using interval tree filtering
>>>>>>>>>>>
>>>>>>>>>>> Patch 5/8: Implement batch userptr page management
>>>>>>>>>>> - get_user_pages_batch() and set_user_pages_batch()
>>>>>>>>>>> - Per-range page array management
>>>>>>>>>>>
>>>>>>>>>>> Patch 6/8: Add batch allocation function and export API
>>>>>>>>>>> - init_user_pages_batch() main initialization
>>>>>>>>>>> - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch() entry point
>>>>>>>>>>>
>>>>>>>>>>> Patch 7/8: Unify userptr cleanup and update paths
>>>>>>>>>>> - Shared eviction/restore handling for batch allocations
>>>>>>>>>>> - Integration with existing userptr validation flows
>>>>>>>>>>>
>>>>>>>>>>> Patch 8/8: Wire up batch allocation in ioctl handler
>>>>>>>>>>> - Input validation and range array parsing
>>>>>>>>>>> - Integration with existing alloc_memory_of_gpu path
>>>>>>>>>>>
>>>>>>>>>>> Testing
>>>>>>>>>>> =======
>>>>>>>>>>>
>>>>>>>>>>> - Multiple scattered malloc() allocations (2-4000+ ranges)
>>>>>>>>>>> - Various allocation sizes (4KB to 1G+ per range)
>>>>>>>>>>> - Memory pressure scenarios and eviction/restore cycles
>>>>>>>>>>> - OpenCL CTS and HIP catch tests in KVM guest environment
>>>>>>>>>>> - AI workloads: Stable Diffusion, ComfyUI in virtualized environments
>>>>>>>>>>> - Small LLM inference (3B-7B models)
>>>>>>>>>>> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
>>>>>>>>>>> - Performance improvement: 2x-2.4x faster than userspace approach
>>>>>>>>>>>
>>>>>>>>>>> Thank you for your review and feedback.
>>>>>>>>>>>
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Honglei Huang
>>>>>>>>>>>
>>>>>>>>>>> Honglei Huang (8):
>>>>>>>>>>> drm/amdkfd: Add userptr batch allocation UAPI structures
>>>>>>>>>>> drm/amdkfd: Add user_range_info infrastructure to kgd_mem
>>>>>>>>>>> drm/amdkfd: Implement interval tree for userptr ranges
>>>>>>>>>>> drm/amdkfd: Add batch MMU notifier support
>>>>>>>>>>> drm/amdkfd: Implement batch userptr page management
>>>>>>>>>>> drm/amdkfd: Add batch allocation function and export API
>>>>>>>>>>> drm/amdkfd: Unify userptr cleanup and update paths
>>>>>>>>>>> drm/amdkfd: Wire up batch allocation in ioctl handler
>>>>>>>>>>>
>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 23 +
>>>>>>>>>>> .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 539 +++++++++++++++++-
>>>>>>>>>>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 128 ++++-
>>>>>>>>>>> include/uapi/linux/kfd_ioctl.h | 31 +-
>>>>>>>>>>> 4 files changed, 697 insertions(+), 24 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support
2026-02-09 15:07 ` Christian König
@ 2026-02-09 15:46 ` Honglei Huang
2026-02-09 17:37 ` Christian König
0 siblings, 1 reply; 22+ messages in thread
From: Honglei Huang @ 2026-02-09 15:46 UTC (permalink / raw)
To: Christian König
Cc: Felix.Kuehling, Philip.Yang, Ray.Huang, alexander.deucher,
dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
linux-kernel, linux-mm, akpm
Agreed with you that with many ranges, the probability of
cross-invalidation during sequential hmm_range_fault() calls
increases, and in a extreme scenario this could lead to excessive
retries. I had been focused on proving correctness and missed the
scalability.
I propose the further plan:
Will add a retry limit similar to what DRM GPU SVM does with
DRM_GPUSVM_MAX_RETRIES. This bounds the worst case.
This maybe ok to make the current batch userptr usable.
And I agree that teaching walk_page_range() to handle
non-contiguous VA sets in a single walk would be the proper
long-term solution. That work would benefit not only KFD batch
userptr. Will keep digging out the better solution.
Regards,
Honglei
On 2026/2/9 23:07, Christian König wrote:
> On 2/9/26 15:44, Honglei Huang wrote:
>> you said that DRM GPU SVM has the same pattern, but argued
>> that it is not designed for "batch userptr". However, this distinction
>> has no technical significance. The core problem is "multiple ranges
>> under one wide notifier doing per-range hmm_range_fault". Whether
>> these ranges are dynamically created by GPU page faults or
>> batch-specified via ioctl, the concurrency safety mechanism is
>> same.
>>
>> You said "each hmm_range_fault() can invalidate the other ranges
>> while faulting them in". Yes, this can happen but this is precisely
>> the scenario that mem->invalid catches:
>>
>> 1. hmm_range_fault(A) succeeds
>> 2. hmm_range_fault(B) triggers reclaim → A's pages swapped out
>> → MMU notifier callback:
>> mutex_lock(notifier_lock)
>> range_A->valid = false
>> mem->invalid++
>> mutex_unlock(notifier_lock)
>> 3. hmm_range_fault(B) completes
>> 4. Commit phase:
>> mutex_lock(notifier_lock)
>> mem->invalid != saved_invalid
>> → return -EAGAIN, retry entire batch
>> mutex_unlock(notifier_lock)
>>
>> invalid pages are never committed.
>
> Once more that is not the problem. I completely agree that this is all correctly handled.
>
> The problem is that the more hmm_ranges you get the more likely it is that getting another pfn invalidates a pfn you previously acquired.
>
> So this can end up in an endless loop, and that's why the GPUSVM code also has a timeout on the retry.
>
>
> What you need to figure out is how to teach hmm_range_fault() and the underlying walk_page_range() how to skip entries which you are not interested in.
>
> Just a trivial example, assuming you have the following VAs you want your userptr to be filled in with: 3, 1, 5, 8, 7, 2
>
> To handle this case you need to build a data structure which tells you what is the smalest, largest and where each VA in the middle comes in. So you need something like: 1->1, 2->5, 3->0, 5->2, 7->4, 8->3
>
> Then you would call walk_page_range(mm, 1, 8, ops, data), the pud walk decides if it needs to go into pmd or eventually fault, the pmd walk decides if ptes needs to be filled in etc...
>
> The final pte handler then fills in the pfns linearly for the addresses you need.
>
> And yeah I perfectly know that this is horrible complicated, but as far as I can see everything else will just not scale.
>
> Creating hundreds of separate userptrs only scales up to a few megabyte and then falls apart.
>
> Regards,
> Christian.
>
>>
>> Regards,
>> Honglei
>>
>>
>> On 2026/2/9 22:25, Christian König wrote:
>>> On 2/9/26 15:16, Honglei Huang wrote:
>>>> The case you described: one hmm_range_fault() invalidating another's
>>>> seq under the same notifier, is already handled in the implementation.
>>>>
>>>> example: suppose ranges A, B, C share one notifier:
>>>>
>>>> 1. hmm_range_fault(A) succeeds, seq_A recorded
>>>> 2. External invalidation occurs, triggers callback:
>>>> mutex_lock(notifier_lock)
>>>> → mmu_interval_set_seq()
>>>> → range_A->valid = false
>>>> → mem->invalid++
>>>> mutex_unlock(notifier_lock)
>>>> 3. hmm_range_fault(B) succeeds
>>>> 4. Commit phase:
>>>> mutex_lock(notifier_lock)
>>>> → check mem->invalid != saved_invalid
>>>> → return -EAGAIN, retry the entire batch
>>>> mutex_unlock(notifier_lock)
>>>>
>>>> All concurrent invalidations are caught by the mem->invalid counter.
>>>> Additionally, amdgpu_ttm_tt_get_user_pages_done() in confirm_valid_user_pages_locked
>>>> performs a per-range mmu_interval_read_retry() as a final safety check.
>>>>
>>>> DRM GPU SVM uses the same approach: drm_gpusvm_get_pages() also calls
>>>> hmm_range_fault() per-range independently there is no array version
>>>> of hmm_range_fault in DRM GPU SVM either. If you consider this approach
>>>> unworkable, then DRM GPU SVM would be unworkable too, yet it has been
>>>> accepted upstream.
>>>>
>>>> The number of batch ranges is controllable. And even if it
>>>> scales to thousands, DRM GPU SVM faces exactly the same situation:
>>>> it does not need an array version of hmm_range_fault either, which
>>>> shows this is a correctness question, not a performance one. For
>>>> correctness, I believe DRM GPU SVM already demonstrates the approach
>>>> is ok.
>>>
>>> Well yes, GPU SVM would have exactly the same problems. But that also doesn't have a create bulk userptr interface.
>>>
>>> The implementation is simply not made for this use case, and as far as I know no current upstream implementation is.
>>>
>>>> For performance, I have tested with thousands of ranges present:
>>>> performance reaches 80%~95% of the native driver, and all OpenCL
>>>> and ROCr test suites pass with no correctness issues.
>>>
>>> Testing can only falsify a system and not verify it.
>>>
>>>> Here is how DRM GPU SVM handles correctness with multiple ranges
>>>> under one wide notifier doing per-range hmm_range_fault:
>>>>
>>>> Invalidation: drm_gpusvm_notifier_invalidate()
>>>> - Acquires notifier_lock
>>>> - Calls mmu_interval_set_seq()
>>>> - Iterates affected ranges via driver callback (xe_svm_invalidate)
>>>> - Clears has_dma_mapping = false for each affected range (under lock)
>>>> - Releases notifier_lock
>>>>
>>>> Fault: drm_gpusvm_get_pages() (called per-range independently)
>>>> - mmu_interval_read_begin() to get seq
>>>> - hmm_range_fault() outside lock
>>>> - Acquires notifier_lock
>>>> - mmu_interval_read_retry() → if stale, release lock and retry
>>>> - DMA map pages + set has_dma_mapping = true (under lock)
>>>> - Releases notifier_lock
>>>>
>>>> Validation: drm_gpusvm_pages_valid()
>>>> - Checks has_dma_mapping flag (under lock), NOT seq
>>>>
>>>> If invalidation occurs between two per-range faults, the flag is
>>>> cleared under lock, and either mmu_interval_read_retry catches it
>>>> in the current fault, or drm_gpusvm_pages_valid() catches it at
>>>> validation time. No stale pages are ever committed.
>>>>
>>>> KFD batch userptr uses the same three-step pattern:
>>>>
>>>> Invalidation: amdgpu_amdkfd_evict_userptr_batch()
>>>> - Acquires notifier_lock
>>>> - Calls mmu_interval_set_seq()
>>>> - Iterates affected ranges via interval_tree
>>>> - Sets range->valid = false for each affected range (under lock)
>>>> - Increments mem->invalid (under lock)
>>>> - Releases notifier_lock
>>>>
>>>> Fault: update_invalid_user_pages()
>>>> - Per-range hmm_range_fault() outside lock
>>>
>>> And here the idea falls apart. Each hmm_range_fault() can invalidate the other ranges while faulting them in.
>>>
>>> That is not fundamentally solveable, but by moving the handling further into hmm_range_fault it makes it much less likely that something goes wrong.
>>>
>>> So once more as long as this still uses this hacky approach I will clearly reject this implementation.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> - Acquires notifier_lock
>>>> - Checks mem->invalid != saved_invalid → if changed, -EAGAIN retry
>>>> - Sets range->valid = true for faulted ranges (under lock)
>>>> - Releases notifier_lock
>>>>
>>>> Validation: valid_user_pages_batch()
>>>> - Checks range->valid flag
>>>> - Calls amdgpu_ttm_tt_get_user_pages_done() (mmu_interval_read_retry)
>>>>
>>>> The logic is equivalent as far as I can see.
>>>>
>>>> Regards,
>>>> Honglei
>>>>
>>>>
>>>>
>>>> On 2026/2/9 21:27, Christian König wrote:
>>>>> On 2/9/26 14:11, Honglei Huang wrote:
>>>>>>
>>>>>> So the drm svm is also a NAK?
>>>>>>
>>>>>> These codes have passed local testing, opencl and rocr, I also provided a detailed code path and analysis.
>>>>>> You only said the conclusion without providing any reasons or evidence. Your statement has no justifiable reasons and is difficult to convince
>>>>>> so far.
>>>>>
>>>>> That sounds like you don't understand what the issue here is, I will try to explain this once more on pseudo-code.
>>>>>
>>>>> Page tables are updated without holding a lock, so when you want to grab physical addresses from the then you need to use an opportunistically retry based approach to make sure that the data you got is still valid.
>>>>>
>>>>> In other words something like this here is needed:
>>>>>
>>>>> retry:
>>>>> hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
>>>>> hmm_range.hmm_pfns = kvmalloc_array(npages, ...);
>>>>> ...
>>>>> while (true) {
>>>>> mmap_read_lock(mm);
>>>>> err = hmm_range_fault(&hmm_range);
>>>>> mmap_read_unlock(mm);
>>>>>
>>>>> if (err == -EBUSY) {
>>>>> if (time_after(jiffies, timeout))
>>>>> break;
>>>>>
>>>>> hmm_range.notifier_seq =
>>>>> mmu_interval_read_begin(notifier);
>>>>> continue;
>>>>> }
>>>>> break;
>>>>> }
>>>>> ...
>>>>> for (i = 0, j = 0; i < npages; ++j) {
>>>>> ...
>>>>> dma_map_page(...)
>>>>> ...
>>>>> grab_notifier_lock();
>>>>> if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq))
>>>>> goto retry;
>>>>> restart_queues();
>>>>> drop_notifier_lock();
>>>>> ...
>>>>>
>>>>> Now hmm_range.notifier_seq indicates if your DMA addresses are still valid or not after you grabbed the notifier lock.
>>>>>
>>>>> The problem is that hmm_range works only on a single range/sequence combination, so when you do multiple calls to hmm_range_fault() for scattered VA is can easily be that one call invalidates the ranges of another call.
>>>>>
>>>>> So as long as you only have a few hundred hmm_ranges for your userptrs that kind of works, but it doesn't scale up into the thousands of different VA addresses you get for scattered handling.
>>>>>
>>>>> That's why hmm_range_fault needs to be modified to handle an array of VA addresses instead of just a A..B range.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>
>>>>>>
>>>>>> On 2026/2/9 20:59, Christian König wrote:
>>>>>>> On 2/9/26 13:52, Honglei Huang wrote:
>>>>>>>> DRM GPU SVM does use hmm_range_fault(), see drm_gpusvm_get_pages()
>>>>>>>
>>>>>>> I'm not sure what you are talking about, drm_gpusvm_get_pages() only supports a single range as well and not scatter gather of VA addresses.
>>>>>>>
>>>>>>> As far as I can see that doesn't help the slightest.
>>>>>>>
>>>>>>>> My implementation follows the same pattern. The detailed comparison
>>>>>>>> of invalidation path was provided in the second half of my previous mail.
>>>>>>>
>>>>>>> Yeah and as I said that is not very valuable because it doesn't solves the sequence problem.
>>>>>>>
>>>>>>> As far as I can see the approach you try here is a clear NAK from my side.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>>
>>>>>>>> On 2026/2/9 18:16, Christian König wrote:
>>>>>>>>> On 2/9/26 07:14, Honglei Huang wrote:
>>>>>>>>>>
>>>>>>>>>> I've reworked the implementation in v4. The fix is actually inspired
>>>>>>>>>> by the DRM GPU SVM framework (drivers/gpu/drm/drm_gpusvm.c).
>>>>>>>>>>
>>>>>>>>>> DRM GPU SVM uses wide notifiers (recommended 512M or larger) to track
>>>>>>>>>> multiple user virtual address ranges under a single mmu_interval_notifier,
>>>>>>>>>> and these ranges can be non-contiguous which is essentially the same
>>>>>>>>>> problem that batch userptr needs to solve: one BO backed by multiple
>>>>>>>>>> non-contiguous CPU VA ranges sharing one notifier.
>>>>>>>>>
>>>>>>>>> That still doesn't solve the sequencing problem.
>>>>>>>>>
>>>>>>>>> As far as I can see you can't use hmm_range_fault with this approach or it would just not be very valuable.
>>>>>>>>>
>>>>>>>>> So how should that work with your patch set?
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The wide notifier is created in drm_gpusvm_notifier_alloc:
>>>>>>>>>> notifier->itree.start = ALIGN_DOWN(fault_addr, gpusvm->notifier_size);
>>>>>>>>>> notifier->itree.last = ALIGN(fault_addr + 1, gpusvm->notifier_size) - 1;
>>>>>>>>>> The Xe driver passes
>>>>>>>>>> xe_modparam.svm_notifier_size * SZ_1M in xe_svm_init
>>>>>>>>>> as the notifier_size, so one notifier can cover many of MB of VA space
>>>>>>>>>> containing multiple non-contiguous ranges.
>>>>>>>>>>
>>>>>>>>>> And DRM GPU SVM solves the per-range validity problem with flag-based
>>>>>>>>>> validation instead of seq-based validation in:
>>>>>>>>>> - drm_gpusvm_pages_valid() checks
>>>>>>>>>> flags.has_dma_mapping
>>>>>>>>>> not notifier_seq. The comment explicitly states:
>>>>>>>>>> "This is akin to a notifier seqno check in the HMM documentation
>>>>>>>>>> but due to wider notifiers (i.e., notifiers which span multiple
>>>>>>>>>> ranges) this function is required for finer grained checking"
>>>>>>>>>> - __drm_gpusvm_unmap_pages() clears
>>>>>>>>>> flags.has_dma_mapping = false under notifier_lock
>>>>>>>>>> - drm_gpusvm_get_pages() sets
>>>>>>>>>> flags.has_dma_mapping = true under notifier_lock
>>>>>>>>>> I adopted the same approach.
>>>>>>>>>>
>>>>>>>>>> DRM GPU SVM:
>>>>>>>>>> drm_gpusvm_notifier_invalidate()
>>>>>>>>>> down_write(&gpusvm->notifier_lock);
>>>>>>>>>> mmu_interval_set_seq(mni, cur_seq);
>>>>>>>>>> gpusvm->ops->invalidate()
>>>>>>>>>> -> xe_svm_invalidate()
>>>>>>>>>> drm_gpusvm_for_each_range()
>>>>>>>>>> -> __drm_gpusvm_unmap_pages()
>>>>>>>>>> WRITE_ONCE(flags.has_dma_mapping = false); // clear flag
>>>>>>>>>> up_write(&gpusvm->notifier_lock);
>>>>>>>>>>
>>>>>>>>>> KFD batch userptr:
>>>>>>>>>> amdgpu_amdkfd_evict_userptr_batch()
>>>>>>>>>> mutex_lock(&process_info->notifier_lock);
>>>>>>>>>> mmu_interval_set_seq(mni, cur_seq);
>>>>>>>>>> discard_invalid_ranges()
>>>>>>>>>> interval_tree_iter_first/next()
>>>>>>>>>> range_info->valid = false; // clear flag
>>>>>>>>>> mutex_unlock(&process_info->notifier_lock);
>>>>>>>>>>
>>>>>>>>>> Both implementations:
>>>>>>>>>> - Acquire notifier_lock FIRST, before any flag changes
>>>>>>>>>> - Call mmu_interval_set_seq() under the lock
>>>>>>>>>> - Use interval tree to find affected ranges within the wide notifier
>>>>>>>>>> - Mark per-range flag as invalid/valid under the lock
>>>>>>>>>>
>>>>>>>>>> The page fault path and final validation path also follow the same
>>>>>>>>>> pattern as DRM GPU SVM: fault outside the lock, set/check per-range
>>>>>>>>>> flag under the lock.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Honglei
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 2026/2/6 21:56, Christian König wrote:
>>>>>>>>>>> On 2/6/26 07:25, Honglei Huang wrote:
>>>>>>>>>>>> From: Honglei Huang <honghuan@amd.com>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> This is v3 of the patch series to support allocating multiple non-contiguous
>>>>>>>>>>>> CPU virtual address ranges that map to a single contiguous GPU virtual address.
>>>>>>>>>>>>
>>>>>>>>>>>> v3:
>>>>>>>>>>>> 1. No new ioctl: Reuses existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU
>>>>>>>>>>>> - Adds only one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH
>>>>>>>>>>>
>>>>>>>>>>> That is most likely not the best approach, but Felix or Philip need to comment here since I don't know such IOCTLs well either.
>>>>>>>>>>>
>>>>>>>>>>>> - When flag is set, mmap_offset field points to range array
>>>>>>>>>>>> - Minimal API surface change
>>>>>>>>>>>
>>>>>>>>>>> Why range of VA space for each entry?
>>>>>>>>>>>
>>>>>>>>>>>> 2. Improved MMU notifier handling:
>>>>>>>>>>>> - Single mmu_interval_notifier covering the VA span [va_min, va_max]
>>>>>>>>>>>> - Interval tree for efficient lookup of affected ranges during invalidation
>>>>>>>>>>>> - Avoids per-range notifier overhead mentioned in v2 review
>>>>>>>>>>>
>>>>>>>>>>> That won't work unless you also modify hmm_range_fault() to take multiple VA addresses (or ranges) at the same time.
>>>>>>>>>>>
>>>>>>>>>>> The problem is that we must rely on hmm_range.notifier_seq to detect changes to the page tables in question, but that in turn works only if you have one hmm_range structure and not multiple.
>>>>>>>>>>>
>>>>>>>>>>> What might work is doing an XOR or CRC over all hmm_range.notifier_seq you have, but that is a bit flaky.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 3. Better code organization: Split into 8 focused patches for easier review
>>>>>>>>>>>>
>>>>>>>>>>>> v2:
>>>>>>>>>>>> - Each CPU VA range gets its own mmu_interval_notifier for invalidation
>>>>>>>>>>>> - All ranges validated together and mapped to contiguous GPU VA
>>>>>>>>>>>> - Single kgd_mem object with array of user_range_info structures
>>>>>>>>>>>> - Unified eviction/restore path for all ranges in a batch
>>>>>>>>>>>>
>>>>>>>>>>>> Current Implementation Approach
>>>>>>>>>>>> ===============================
>>>>>>>>>>>>
>>>>>>>>>>>> This series implements a practical solution within existing kernel constraints:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Single MMU notifier for VA span: Register one notifier covering the
>>>>>>>>>>>> entire range from lowest to highest address in the batch
>>>>>>>>>>>>
>>>>>>>>>>>> 2. Interval tree filtering: Use interval tree to efficiently identify
>>>>>>>>>>>> which specific ranges are affected during invalidation callbacks,
>>>>>>>>>>>> avoiding unnecessary processing for unrelated address changes
>>>>>>>>>>>>
>>>>>>>>>>>> 3. Unified eviction/restore: All ranges in a batch share eviction and
>>>>>>>>>>>> restore paths, maintaining consistency with existing userptr handling
>>>>>>>>>>>>
>>>>>>>>>>>> Patch Series Overview
>>>>>>>>>>>> =====================
>>>>>>>>>>>>
>>>>>>>>>>>> Patch 1/8: Add userptr batch allocation UAPI structures
>>>>>>>>>>>> - KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag
>>>>>>>>>>>> - kfd_ioctl_userptr_range and kfd_ioctl_userptr_ranges_data structures
>>>>>>>>>>>>
>>>>>>>>>>>> Patch 2/8: Add user_range_info infrastructure to kgd_mem
>>>>>>>>>>>> - user_range_info structure for per-range tracking
>>>>>>>>>>>> - Fields for batch allocation in kgd_mem
>>>>>>>>>>>>
>>>>>>>>>>>> Patch 3/8: Implement interval tree for userptr ranges
>>>>>>>>>>>> - Interval tree for efficient range lookup during invalidation
>>>>>>>>>>>> - mark_invalid_ranges() function
>>>>>>>>>>>>
>>>>>>>>>>>> Patch 4/8: Add batch MMU notifier support
>>>>>>>>>>>> - Single notifier for entire VA span
>>>>>>>>>>>> - Invalidation callback using interval tree filtering
>>>>>>>>>>>>
>>>>>>>>>>>> Patch 5/8: Implement batch userptr page management
>>>>>>>>>>>> - get_user_pages_batch() and set_user_pages_batch()
>>>>>>>>>>>> - Per-range page array management
>>>>>>>>>>>>
>>>>>>>>>>>> Patch 6/8: Add batch allocation function and export API
>>>>>>>>>>>> - init_user_pages_batch() main initialization
>>>>>>>>>>>> - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch() entry point
>>>>>>>>>>>>
>>>>>>>>>>>> Patch 7/8: Unify userptr cleanup and update paths
>>>>>>>>>>>> - Shared eviction/restore handling for batch allocations
>>>>>>>>>>>> - Integration with existing userptr validation flows
>>>>>>>>>>>>
>>>>>>>>>>>> Patch 8/8: Wire up batch allocation in ioctl handler
>>>>>>>>>>>> - Input validation and range array parsing
>>>>>>>>>>>> - Integration with existing alloc_memory_of_gpu path
>>>>>>>>>>>>
>>>>>>>>>>>> Testing
>>>>>>>>>>>> =======
>>>>>>>>>>>>
>>>>>>>>>>>> - Multiple scattered malloc() allocations (2-4000+ ranges)
>>>>>>>>>>>> - Various allocation sizes (4KB to 1G+ per range)
>>>>>>>>>>>> - Memory pressure scenarios and eviction/restore cycles
>>>>>>>>>>>> - OpenCL CTS and HIP catch tests in KVM guest environment
>>>>>>>>>>>> - AI workloads: Stable Diffusion, ComfyUI in virtualized environments
>>>>>>>>>>>> - Small LLM inference (3B-7B models)
>>>>>>>>>>>> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
>>>>>>>>>>>> - Performance improvement: 2x-2.4x faster than userspace approach
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you for your review and feedback.
>>>>>>>>>>>>
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Honglei Huang
>>>>>>>>>>>>
>>>>>>>>>>>> Honglei Huang (8):
>>>>>>>>>>>> drm/amdkfd: Add userptr batch allocation UAPI structures
>>>>>>>>>>>> drm/amdkfd: Add user_range_info infrastructure to kgd_mem
>>>>>>>>>>>> drm/amdkfd: Implement interval tree for userptr ranges
>>>>>>>>>>>> drm/amdkfd: Add batch MMU notifier support
>>>>>>>>>>>> drm/amdkfd: Implement batch userptr page management
>>>>>>>>>>>> drm/amdkfd: Add batch allocation function and export API
>>>>>>>>>>>> drm/amdkfd: Unify userptr cleanup and update paths
>>>>>>>>>>>> drm/amdkfd: Wire up batch allocation in ioctl handler
>>>>>>>>>>>>
>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 23 +
>>>>>>>>>>>> .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 539 +++++++++++++++++-
>>>>>>>>>>>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 128 ++++-
>>>>>>>>>>>> include/uapi/linux/kfd_ioctl.h | 31 +-
>>>>>>>>>>>> 4 files changed, 697 insertions(+), 24 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support
2026-02-09 15:46 ` Honglei Huang
@ 2026-02-09 17:37 ` Christian König
0 siblings, 0 replies; 22+ messages in thread
From: Christian König @ 2026-02-09 17:37 UTC (permalink / raw)
To: Honglei Huang
Cc: Felix.Kuehling, Philip.Yang, Ray.Huang, alexander.deucher,
dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
linux-kernel, linux-mm, akpm
On 2/9/26 16:46, Honglei Huang wrote:
> Agreed with you that with many ranges, the probability of
> cross-invalidation during sequential hmm_range_fault() calls
> increases, and in a extreme scenario this could lead to excessive
> retries. I had been focused on proving correctness and missed the
> scalability.
>
> I propose the further plan:
>
> Will add a retry limit similar to what DRM GPU SVM does with
> DRM_GPUSVM_MAX_RETRIES. This bounds the worst case.
> This maybe ok to make the current batch userptr usable.
Rather make that a wall clock timeout.
You will also need a limitation on the amount of memory mapped though this, my educated guess is that something around 4MiB (1024 userptrs) should be the limit.
> And I agree that teaching walk_page_range() to handle
> non-contiguous VA sets in a single walk would be the proper
> long-term solution. That work would benefit not only KFD batch
> userptr. Will keep digging out the better solution.
We already had issues with applications which unknowlingly of the overhead tried to use userptrs for all memory allocations.
If I'm not completely mistaken we also have a limit of roughly 256MiB you can map through this without running into massive problems on the graphics side.
So that will clearly not work in even in the short term.
Regards,
Christian.
>
> Regards,
> Honglei
>
> On 2026/2/9 23:07, Christian König wrote:
>> On 2/9/26 15:44, Honglei Huang wrote:
>>> you said that DRM GPU SVM has the same pattern, but argued
>>> that it is not designed for "batch userptr". However, this distinction
>>> has no technical significance. The core problem is "multiple ranges
>>> under one wide notifier doing per-range hmm_range_fault". Whether
>>> these ranges are dynamically created by GPU page faults or
>>> batch-specified via ioctl, the concurrency safety mechanism is
>>> same.
>>>
>>> You said "each hmm_range_fault() can invalidate the other ranges
>>> while faulting them in". Yes, this can happen but this is precisely
>>> the scenario that mem->invalid catches:
>>>
>>> 1. hmm_range_fault(A) succeeds
>>> 2. hmm_range_fault(B) triggers reclaim → A's pages swapped out
>>> → MMU notifier callback:
>>> mutex_lock(notifier_lock)
>>> range_A->valid = false
>>> mem->invalid++
>>> mutex_unlock(notifier_lock)
>>> 3. hmm_range_fault(B) completes
>>> 4. Commit phase:
>>> mutex_lock(notifier_lock)
>>> mem->invalid != saved_invalid
>>> → return -EAGAIN, retry entire batch
>>> mutex_unlock(notifier_lock)
>>>
>>> invalid pages are never committed.
>>
>> Once more that is not the problem. I completely agree that this is all correctly handled.
>>
>> The problem is that the more hmm_ranges you get the more likely it is that getting another pfn invalidates a pfn you previously acquired.
>>
>> So this can end up in an endless loop, and that's why the GPUSVM code also has a timeout on the retry.
>>
>>
>> What you need to figure out is how to teach hmm_range_fault() and the underlying walk_page_range() how to skip entries which you are not interested in.
>>
>> Just a trivial example, assuming you have the following VAs you want your userptr to be filled in with: 3, 1, 5, 8, 7, 2
>>
>> To handle this case you need to build a data structure which tells you what is the smalest, largest and where each VA in the middle comes in. So you need something like: 1->1, 2->5, 3->0, 5->2, 7->4, 8->3
>>
>> Then you would call walk_page_range(mm, 1, 8, ops, data), the pud walk decides if it needs to go into pmd or eventually fault, the pmd walk decides if ptes needs to be filled in etc...
>>
>> The final pte handler then fills in the pfns linearly for the addresses you need.
>>
>> And yeah I perfectly know that this is horrible complicated, but as far as I can see everything else will just not scale.
>>
>> Creating hundreds of separate userptrs only scales up to a few megabyte and then falls apart.
>>
>> Regards,
>> Christian.
>>
>>>
>>> Regards,
>>> Honglei
>>>
>>>
>>> On 2026/2/9 22:25, Christian König wrote:
>>>> On 2/9/26 15:16, Honglei Huang wrote:
>>>>> The case you described: one hmm_range_fault() invalidating another's
>>>>> seq under the same notifier, is already handled in the implementation.
>>>>>
>>>>> example: suppose ranges A, B, C share one notifier:
>>>>>
>>>>> 1. hmm_range_fault(A) succeeds, seq_A recorded
>>>>> 2. External invalidation occurs, triggers callback:
>>>>> mutex_lock(notifier_lock)
>>>>> → mmu_interval_set_seq()
>>>>> → range_A->valid = false
>>>>> → mem->invalid++
>>>>> mutex_unlock(notifier_lock)
>>>>> 3. hmm_range_fault(B) succeeds
>>>>> 4. Commit phase:
>>>>> mutex_lock(notifier_lock)
>>>>> → check mem->invalid != saved_invalid
>>>>> → return -EAGAIN, retry the entire batch
>>>>> mutex_unlock(notifier_lock)
>>>>>
>>>>> All concurrent invalidations are caught by the mem->invalid counter.
>>>>> Additionally, amdgpu_ttm_tt_get_user_pages_done() in confirm_valid_user_pages_locked
>>>>> performs a per-range mmu_interval_read_retry() as a final safety check.
>>>>>
>>>>> DRM GPU SVM uses the same approach: drm_gpusvm_get_pages() also calls
>>>>> hmm_range_fault() per-range independently there is no array version
>>>>> of hmm_range_fault in DRM GPU SVM either. If you consider this approach
>>>>> unworkable, then DRM GPU SVM would be unworkable too, yet it has been
>>>>> accepted upstream.
>>>>>
>>>>> The number of batch ranges is controllable. And even if it
>>>>> scales to thousands, DRM GPU SVM faces exactly the same situation:
>>>>> it does not need an array version of hmm_range_fault either, which
>>>>> shows this is a correctness question, not a performance one. For
>>>>> correctness, I believe DRM GPU SVM already demonstrates the approach
>>>>> is ok.
>>>>
>>>> Well yes, GPU SVM would have exactly the same problems. But that also doesn't have a create bulk userptr interface.
>>>>
>>>> The implementation is simply not made for this use case, and as far as I know no current upstream implementation is.
>>>>
>>>>> For performance, I have tested with thousands of ranges present:
>>>>> performance reaches 80%~95% of the native driver, and all OpenCL
>>>>> and ROCr test suites pass with no correctness issues.
>>>>
>>>> Testing can only falsify a system and not verify it.
>>>>
>>>>> Here is how DRM GPU SVM handles correctness with multiple ranges
>>>>> under one wide notifier doing per-range hmm_range_fault:
>>>>>
>>>>> Invalidation: drm_gpusvm_notifier_invalidate()
>>>>> - Acquires notifier_lock
>>>>> - Calls mmu_interval_set_seq()
>>>>> - Iterates affected ranges via driver callback (xe_svm_invalidate)
>>>>> - Clears has_dma_mapping = false for each affected range (under lock)
>>>>> - Releases notifier_lock
>>>>>
>>>>> Fault: drm_gpusvm_get_pages() (called per-range independently)
>>>>> - mmu_interval_read_begin() to get seq
>>>>> - hmm_range_fault() outside lock
>>>>> - Acquires notifier_lock
>>>>> - mmu_interval_read_retry() → if stale, release lock and retry
>>>>> - DMA map pages + set has_dma_mapping = true (under lock)
>>>>> - Releases notifier_lock
>>>>>
>>>>> Validation: drm_gpusvm_pages_valid()
>>>>> - Checks has_dma_mapping flag (under lock), NOT seq
>>>>>
>>>>> If invalidation occurs between two per-range faults, the flag is
>>>>> cleared under lock, and either mmu_interval_read_retry catches it
>>>>> in the current fault, or drm_gpusvm_pages_valid() catches it at
>>>>> validation time. No stale pages are ever committed.
>>>>>
>>>>> KFD batch userptr uses the same three-step pattern:
>>>>>
>>>>> Invalidation: amdgpu_amdkfd_evict_userptr_batch()
>>>>> - Acquires notifier_lock
>>>>> - Calls mmu_interval_set_seq()
>>>>> - Iterates affected ranges via interval_tree
>>>>> - Sets range->valid = false for each affected range (under lock)
>>>>> - Increments mem->invalid (under lock)
>>>>> - Releases notifier_lock
>>>>>
>>>>> Fault: update_invalid_user_pages()
>>>>> - Per-range hmm_range_fault() outside lock
>>>>
>>>> And here the idea falls apart. Each hmm_range_fault() can invalidate the other ranges while faulting them in.
>>>>
>>>> That is not fundamentally solveable, but by moving the handling further into hmm_range_fault it makes it much less likely that something goes wrong.
>>>>
>>>> So once more as long as this still uses this hacky approach I will clearly reject this implementation.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> - Acquires notifier_lock
>>>>> - Checks mem->invalid != saved_invalid → if changed, -EAGAIN retry
>>>>> - Sets range->valid = true for faulted ranges (under lock)
>>>>> - Releases notifier_lock
>>>>>
>>>>> Validation: valid_user_pages_batch()
>>>>> - Checks range->valid flag
>>>>> - Calls amdgpu_ttm_tt_get_user_pages_done() (mmu_interval_read_retry)
>>>>>
>>>>> The logic is equivalent as far as I can see.
>>>>>
>>>>> Regards,
>>>>> Honglei
>>>>>
>>>>>
>>>>>
>>>>> On 2026/2/9 21:27, Christian König wrote:
>>>>>> On 2/9/26 14:11, Honglei Huang wrote:
>>>>>>>
>>>>>>> So the drm svm is also a NAK?
>>>>>>>
>>>>>>> These codes have passed local testing, opencl and rocr, I also provided a detailed code path and analysis.
>>>>>>> You only said the conclusion without providing any reasons or evidence. Your statement has no justifiable reasons and is difficult to convince
>>>>>>> so far.
>>>>>>
>>>>>> That sounds like you don't understand what the issue here is, I will try to explain this once more on pseudo-code.
>>>>>>
>>>>>> Page tables are updated without holding a lock, so when you want to grab physical addresses from the then you need to use an opportunistically retry based approach to make sure that the data you got is still valid.
>>>>>>
>>>>>> In other words something like this here is needed:
>>>>>>
>>>>>> retry:
>>>>>> hmm_range.notifier_seq = mmu_interval_read_begin(notifier);
>>>>>> hmm_range.hmm_pfns = kvmalloc_array(npages, ...);
>>>>>> ...
>>>>>> while (true) {
>>>>>> mmap_read_lock(mm);
>>>>>> err = hmm_range_fault(&hmm_range);
>>>>>> mmap_read_unlock(mm);
>>>>>>
>>>>>> if (err == -EBUSY) {
>>>>>> if (time_after(jiffies, timeout))
>>>>>> break;
>>>>>>
>>>>>> hmm_range.notifier_seq =
>>>>>> mmu_interval_read_begin(notifier);
>>>>>> continue;
>>>>>> }
>>>>>> break;
>>>>>> }
>>>>>> ...
>>>>>> for (i = 0, j = 0; i < npages; ++j) {
>>>>>> ...
>>>>>> dma_map_page(...)
>>>>>> ...
>>>>>> grab_notifier_lock();
>>>>>> if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq))
>>>>>> goto retry;
>>>>>> restart_queues();
>>>>>> drop_notifier_lock();
>>>>>> ...
>>>>>>
>>>>>> Now hmm_range.notifier_seq indicates if your DMA addresses are still valid or not after you grabbed the notifier lock.
>>>>>>
>>>>>> The problem is that hmm_range works only on a single range/sequence combination, so when you do multiple calls to hmm_range_fault() for scattered VA is can easily be that one call invalidates the ranges of another call.
>>>>>>
>>>>>> So as long as you only have a few hundred hmm_ranges for your userptrs that kind of works, but it doesn't scale up into the thousands of different VA addresses you get for scattered handling.
>>>>>>
>>>>>> That's why hmm_range_fault needs to be modified to handle an array of VA addresses instead of just a A..B range.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> On 2026/2/9 20:59, Christian König wrote:
>>>>>>>> On 2/9/26 13:52, Honglei Huang wrote:
>>>>>>>>> DRM GPU SVM does use hmm_range_fault(), see drm_gpusvm_get_pages()
>>>>>>>>
>>>>>>>> I'm not sure what you are talking about, drm_gpusvm_get_pages() only supports a single range as well and not scatter gather of VA addresses.
>>>>>>>>
>>>>>>>> As far as I can see that doesn't help the slightest.
>>>>>>>>
>>>>>>>>> My implementation follows the same pattern. The detailed comparison
>>>>>>>>> of invalidation path was provided in the second half of my previous mail.
>>>>>>>>
>>>>>>>> Yeah and as I said that is not very valuable because it doesn't solves the sequence problem.
>>>>>>>>
>>>>>>>> As far as I can see the approach you try here is a clear NAK from my side.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2026/2/9 18:16, Christian König wrote:
>>>>>>>>>> On 2/9/26 07:14, Honglei Huang wrote:
>>>>>>>>>>>
>>>>>>>>>>> I've reworked the implementation in v4. The fix is actually inspired
>>>>>>>>>>> by the DRM GPU SVM framework (drivers/gpu/drm/drm_gpusvm.c).
>>>>>>>>>>>
>>>>>>>>>>> DRM GPU SVM uses wide notifiers (recommended 512M or larger) to track
>>>>>>>>>>> multiple user virtual address ranges under a single mmu_interval_notifier,
>>>>>>>>>>> and these ranges can be non-contiguous which is essentially the same
>>>>>>>>>>> problem that batch userptr needs to solve: one BO backed by multiple
>>>>>>>>>>> non-contiguous CPU VA ranges sharing one notifier.
>>>>>>>>>>
>>>>>>>>>> That still doesn't solve the sequencing problem.
>>>>>>>>>>
>>>>>>>>>> As far as I can see you can't use hmm_range_fault with this approach or it would just not be very valuable.
>>>>>>>>>>
>>>>>>>>>> So how should that work with your patch set?
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The wide notifier is created in drm_gpusvm_notifier_alloc:
>>>>>>>>>>> notifier->itree.start = ALIGN_DOWN(fault_addr, gpusvm->notifier_size);
>>>>>>>>>>> notifier->itree.last = ALIGN(fault_addr + 1, gpusvm->notifier_size) - 1;
>>>>>>>>>>> The Xe driver passes
>>>>>>>>>>> xe_modparam.svm_notifier_size * SZ_1M in xe_svm_init
>>>>>>>>>>> as the notifier_size, so one notifier can cover many of MB of VA space
>>>>>>>>>>> containing multiple non-contiguous ranges.
>>>>>>>>>>>
>>>>>>>>>>> And DRM GPU SVM solves the per-range validity problem with flag-based
>>>>>>>>>>> validation instead of seq-based validation in:
>>>>>>>>>>> - drm_gpusvm_pages_valid() checks
>>>>>>>>>>> flags.has_dma_mapping
>>>>>>>>>>> not notifier_seq. The comment explicitly states:
>>>>>>>>>>> "This is akin to a notifier seqno check in the HMM documentation
>>>>>>>>>>> but due to wider notifiers (i.e., notifiers which span multiple
>>>>>>>>>>> ranges) this function is required for finer grained checking"
>>>>>>>>>>> - __drm_gpusvm_unmap_pages() clears
>>>>>>>>>>> flags.has_dma_mapping = false under notifier_lock
>>>>>>>>>>> - drm_gpusvm_get_pages() sets
>>>>>>>>>>> flags.has_dma_mapping = true under notifier_lock
>>>>>>>>>>> I adopted the same approach.
>>>>>>>>>>>
>>>>>>>>>>> DRM GPU SVM:
>>>>>>>>>>> drm_gpusvm_notifier_invalidate()
>>>>>>>>>>> down_write(&gpusvm->notifier_lock);
>>>>>>>>>>> mmu_interval_set_seq(mni, cur_seq);
>>>>>>>>>>> gpusvm->ops->invalidate()
>>>>>>>>>>> -> xe_svm_invalidate()
>>>>>>>>>>> drm_gpusvm_for_each_range()
>>>>>>>>>>> -> __drm_gpusvm_unmap_pages()
>>>>>>>>>>> WRITE_ONCE(flags.has_dma_mapping = false); // clear flag
>>>>>>>>>>> up_write(&gpusvm->notifier_lock);
>>>>>>>>>>>
>>>>>>>>>>> KFD batch userptr:
>>>>>>>>>>> amdgpu_amdkfd_evict_userptr_batch()
>>>>>>>>>>> mutex_lock(&process_info->notifier_lock);
>>>>>>>>>>> mmu_interval_set_seq(mni, cur_seq);
>>>>>>>>>>> discard_invalid_ranges()
>>>>>>>>>>> interval_tree_iter_first/next()
>>>>>>>>>>> range_info->valid = false; // clear flag
>>>>>>>>>>> mutex_unlock(&process_info->notifier_lock);
>>>>>>>>>>>
>>>>>>>>>>> Both implementations:
>>>>>>>>>>> - Acquire notifier_lock FIRST, before any flag changes
>>>>>>>>>>> - Call mmu_interval_set_seq() under the lock
>>>>>>>>>>> - Use interval tree to find affected ranges within the wide notifier
>>>>>>>>>>> - Mark per-range flag as invalid/valid under the lock
>>>>>>>>>>>
>>>>>>>>>>> The page fault path and final validation path also follow the same
>>>>>>>>>>> pattern as DRM GPU SVM: fault outside the lock, set/check per-range
>>>>>>>>>>> flag under the lock.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Honglei
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 2026/2/6 21:56, Christian König wrote:
>>>>>>>>>>>> On 2/6/26 07:25, Honglei Huang wrote:
>>>>>>>>>>>>> From: Honglei Huang <honghuan@amd.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is v3 of the patch series to support allocating multiple non-contiguous
>>>>>>>>>>>>> CPU virtual address ranges that map to a single contiguous GPU virtual address.
>>>>>>>>>>>>>
>>>>>>>>>>>>> v3:
>>>>>>>>>>>>> 1. No new ioctl: Reuses existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU
>>>>>>>>>>>>> - Adds only one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH
>>>>>>>>>>>>
>>>>>>>>>>>> That is most likely not the best approach, but Felix or Philip need to comment here since I don't know such IOCTLs well either.
>>>>>>>>>>>>
>>>>>>>>>>>>> - When flag is set, mmap_offset field points to range array
>>>>>>>>>>>>> - Minimal API surface change
>>>>>>>>>>>>
>>>>>>>>>>>> Why range of VA space for each entry?
>>>>>>>>>>>>
>>>>>>>>>>>>> 2. Improved MMU notifier handling:
>>>>>>>>>>>>> - Single mmu_interval_notifier covering the VA span [va_min, va_max]
>>>>>>>>>>>>> - Interval tree for efficient lookup of affected ranges during invalidation
>>>>>>>>>>>>> - Avoids per-range notifier overhead mentioned in v2 review
>>>>>>>>>>>>
>>>>>>>>>>>> That won't work unless you also modify hmm_range_fault() to take multiple VA addresses (or ranges) at the same time.
>>>>>>>>>>>>
>>>>>>>>>>>> The problem is that we must rely on hmm_range.notifier_seq to detect changes to the page tables in question, but that in turn works only if you have one hmm_range structure and not multiple.
>>>>>>>>>>>>
>>>>>>>>>>>> What might work is doing an XOR or CRC over all hmm_range.notifier_seq you have, but that is a bit flaky.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Christian.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 3. Better code organization: Split into 8 focused patches for easier review
>>>>>>>>>>>>>
>>>>>>>>>>>>> v2:
>>>>>>>>>>>>> - Each CPU VA range gets its own mmu_interval_notifier for invalidation
>>>>>>>>>>>>> - All ranges validated together and mapped to contiguous GPU VA
>>>>>>>>>>>>> - Single kgd_mem object with array of user_range_info structures
>>>>>>>>>>>>> - Unified eviction/restore path for all ranges in a batch
>>>>>>>>>>>>>
>>>>>>>>>>>>> Current Implementation Approach
>>>>>>>>>>>>> ===============================
>>>>>>>>>>>>>
>>>>>>>>>>>>> This series implements a practical solution within existing kernel constraints:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. Single MMU notifier for VA span: Register one notifier covering the
>>>>>>>>>>>>> entire range from lowest to highest address in the batch
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2. Interval tree filtering: Use interval tree to efficiently identify
>>>>>>>>>>>>> which specific ranges are affected during invalidation callbacks,
>>>>>>>>>>>>> avoiding unnecessary processing for unrelated address changes
>>>>>>>>>>>>>
>>>>>>>>>>>>> 3. Unified eviction/restore: All ranges in a batch share eviction and
>>>>>>>>>>>>> restore paths, maintaining consistency with existing userptr handling
>>>>>>>>>>>>>
>>>>>>>>>>>>> Patch Series Overview
>>>>>>>>>>>>> =====================
>>>>>>>>>>>>>
>>>>>>>>>>>>> Patch 1/8: Add userptr batch allocation UAPI structures
>>>>>>>>>>>>> - KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag
>>>>>>>>>>>>> - kfd_ioctl_userptr_range and kfd_ioctl_userptr_ranges_data structures
>>>>>>>>>>>>>
>>>>>>>>>>>>> Patch 2/8: Add user_range_info infrastructure to kgd_mem
>>>>>>>>>>>>> - user_range_info structure for per-range tracking
>>>>>>>>>>>>> - Fields for batch allocation in kgd_mem
>>>>>>>>>>>>>
>>>>>>>>>>>>> Patch 3/8: Implement interval tree for userptr ranges
>>>>>>>>>>>>> - Interval tree for efficient range lookup during invalidation
>>>>>>>>>>>>> - mark_invalid_ranges() function
>>>>>>>>>>>>>
>>>>>>>>>>>>> Patch 4/8: Add batch MMU notifier support
>>>>>>>>>>>>> - Single notifier for entire VA span
>>>>>>>>>>>>> - Invalidation callback using interval tree filtering
>>>>>>>>>>>>>
>>>>>>>>>>>>> Patch 5/8: Implement batch userptr page management
>>>>>>>>>>>>> - get_user_pages_batch() and set_user_pages_batch()
>>>>>>>>>>>>> - Per-range page array management
>>>>>>>>>>>>>
>>>>>>>>>>>>> Patch 6/8: Add batch allocation function and export API
>>>>>>>>>>>>> - init_user_pages_batch() main initialization
>>>>>>>>>>>>> - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch() entry point
>>>>>>>>>>>>>
>>>>>>>>>>>>> Patch 7/8: Unify userptr cleanup and update paths
>>>>>>>>>>>>> - Shared eviction/restore handling for batch allocations
>>>>>>>>>>>>> - Integration with existing userptr validation flows
>>>>>>>>>>>>>
>>>>>>>>>>>>> Patch 8/8: Wire up batch allocation in ioctl handler
>>>>>>>>>>>>> - Input validation and range array parsing
>>>>>>>>>>>>> - Integration with existing alloc_memory_of_gpu path
>>>>>>>>>>>>>
>>>>>>>>>>>>> Testing
>>>>>>>>>>>>> =======
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Multiple scattered malloc() allocations (2-4000+ ranges)
>>>>>>>>>>>>> - Various allocation sizes (4KB to 1G+ per range)
>>>>>>>>>>>>> - Memory pressure scenarios and eviction/restore cycles
>>>>>>>>>>>>> - OpenCL CTS and HIP catch tests in KVM guest environment
>>>>>>>>>>>>> - AI workloads: Stable Diffusion, ComfyUI in virtualized environments
>>>>>>>>>>>>> - Small LLM inference (3B-7B models)
>>>>>>>>>>>>> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
>>>>>>>>>>>>> - Performance improvement: 2x-2.4x faster than userspace approach
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you for your review and feedback.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Honglei Huang
>>>>>>>>>>>>>
>>>>>>>>>>>>> Honglei Huang (8):
>>>>>>>>>>>>> drm/amdkfd: Add userptr batch allocation UAPI structures
>>>>>>>>>>>>> drm/amdkfd: Add user_range_info infrastructure to kgd_mem
>>>>>>>>>>>>> drm/amdkfd: Implement interval tree for userptr ranges
>>>>>>>>>>>>> drm/amdkfd: Add batch MMU notifier support
>>>>>>>>>>>>> drm/amdkfd: Implement batch userptr page management
>>>>>>>>>>>>> drm/amdkfd: Add batch allocation function and export API
>>>>>>>>>>>>> drm/amdkfd: Unify userptr cleanup and update paths
>>>>>>>>>>>>> drm/amdkfd: Wire up batch allocation in ioctl handler
>>>>>>>>>>>>>
>>>>>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 23 +
>>>>>>>>>>>>> .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 539 +++++++++++++++++-
>>>>>>>>>>>>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 128 ++++-
>>>>>>>>>>>>> include/uapi/linux/kfd_ioctl.h | 31 +-
>>>>>>>>>>>>> 4 files changed, 697 insertions(+), 24 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 22+ messages in thread