* [RFC PATCH 0/5] drm/amdkfd: Add batch SVM range registration support
@ 2025-11-12 7:35 Honglei Huang
0 siblings, 0 replies; 5+ messages in thread
From: Honglei Huang @ 2025-11-12 7:35 UTC (permalink / raw)
To: Felix.Kuehling, alexander.deucher, christian.koenig, Ray.Huang
Cc: dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
linux-kernel, linux-mm, akpm, honghuan, Honglei Huang
From: Honglei Huang <Honglei1.Huang@amd.com>
Hi all,
This RFC patch series introduces a new mechanism for batch registration of
multiple non-contiguous SVM (Shared Virtual Memory) ranges in a single ioctl
call. The primary goal of this series is to start a discussion about the best
approach to handle scattered user memory allocations in GPU workloads.
Background and Motivation
==========================
Current applications using ROCm/HSA often need to register many scattered
memory buffers (e.g., multiple malloc() allocations) for GPU access. With the
existing AMDKFD_IOC_SVM ioctl, each range must be registered individually,
leading to:
- Blocking issue in some special use cases with many memory ranges
- High system call overhead when dealing with dozens or hundreds of ranges
- Inefficient resource management
- Complexity in userspace applications
Use Case Example
================
Consider a typical ML/HPC workload that allocates 100+ small buffers across
different parts of the address space. Currently, this requires 100+ separate
ioctl calls. The proposed batch interface reduces this to a single call.
Paravirtualized environments exacerbate this issue, as KVM's memory backing
is often non-contiguous at the host level. In virtualized environments, guest
physical memory appears contiguous to the VM but is actually scattered across
host memory pages. This fragmentation means that what appears as a single
large allocation in the guest may require multiple discrete SVM registrations
to properly handle the underlying host memory layout, further multiplying the
number of required ioctl calls.
Current Implementation - A Workaround Approach
===============================================
This patch series implements a WORKAROUND solution that pins user pages in
memory to enable batch registration. While functional, this approach has
several significant limitations:
**Major Concern: Memory Pinning**
- The implementation uses pin_user_pages_fast() to lock pages in RAM
- This defeats the purpose of SVM's on-demand paging mechanism
- Prevents memory oversubscription and dynamic migration
- May cause memory pressure on systems with limited RAM
- Goes against the fundamental design philosophy of HMM-based SVM
**Known Limitations:**
1. Increased memory footprint due to pinned pages
2. Potential for memory fragmentation
3. No support for transparent huge pages in pinned regions
4. Limited interaction with memory cgroups and resource controls
5. Complexity in handling VMA operations and lifecycle management
6. May interfere with NUMA optimization and page migration
Why Submit This RFC?
====================
Despite the limitations above, I am submitting this series to:
1. **Start the Discussion**: I want community feedback on whether batch
registration is a useful feature worth pursuing.
2. **Explore Better Alternatives**: Is there a way to achieve batch
registration without pinning? Could I extend HMM to better support
this use case?
3. **Understand Trade-offs**: For some workloads, the performance benefit
of batch registration might outweigh the drawbacks of pinning. I'd
like to understand where the balance lies.
Questions for the Community
============================
1. Are there existing mechanisms in HMM or mm that could support batch
operations without pinning?
2. Would a different approach (e.g., async registration, delayed validation)
be more acceptable?
Alternative Approaches Considered
==================================
I've considered several alternatives:
A) **Pure HMM approach**: Register ranges without pinning, rely entirely on
B) **Userspace batching library**: Hide multiple ioctls behind a library.
Patch Series Overview
=====================
Patch 1: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute type
Patch 2: Define data structures for batch SVM range registration
Patch 3: Add new AMDKFD_IOC_SVM_RANGES ioctl command
Patch 4: Implement page pinning mechanism for scattered ranges
Patch 5: Wire up the ioctl handler and attribute processing
Testing
=======
The series has been tested with:
- Multiple scattered malloc() allocations (2-2000+ ranges)
- Various allocation sizes (4KB to 1G+)
- GPU compute workloads using the registered ranges
- Memory pressure scenarios
- OpecnCL CTS in KVM guest environment
- HIP catch tests in KVM guest environment
- Some AI applications like Stable Diffusion, ComfyUI, 3B LLM models based
on HuggingFace transformers
I understand this approach is not ideal and are committed to working on a
better solution based on community feedback. This RFC is the starting point
for that discussion.
Thank you for your time and consideration.
Best regards,
Honglei Huang
Honglei Huang (5):
drm/amdkfd: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute
drm/amdkfd: Add SVM ranges data structures
drm/amdkfd: Add AMDKFD_IOC_SVM_RANGES ioctl command
drm/amdkfd: Add support for pinned user pages in SVM ranges
drm/amdkfd: Wire up SVM ranges ioctl handler
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 67 +++++++
drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 232 ++++++++++++++++++++++-
drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 3 +
include/uapi/linux/kfd_ioctl.h | 52 ++++-
4 files changed, 345 insertions(+), 9 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC PATCH 0/5] drm/amdkfd: Add batch SVM range registration support
2025-11-12 12:10 ` Honglei1.Huang@amd.com
@ 2025-11-12 12:50 ` Christian König
0 siblings, 0 replies; 5+ messages in thread
From: Christian König @ 2025-11-12 12:50 UTC (permalink / raw)
To: Honglei1.Huang@amd.com
Cc: Felix.Kuehling, alexander.deucher, Ray.Huang, dmitry.osipenko,
Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel, linux-kernel,
linux-mm, akpm, Honglei Huang
Hi Honglei,
On 11/12/25 13:10, Honglei1.Huang@amd.com wrote:
>>> Paravirtualized environments exacerbate this issue, as KVM's memory backing
>>> is often non-contiguous at the host level. In virtualized environments, guest
>>> physical memory appears contiguous to the VM but is actually scattered across
>>> host memory pages. This fragmentation means that what appears as a single
>>> large allocation in the guest may require multiple discrete SVM registrations
>>> to properly handle the underlying host memory layout, further multiplying the
>>> number of required ioctl calls.
>> SVM with dynamic migration under KVM is most likely a dead end to begin with.
>>
>> The only possibility to implement it is with memory pinning which is basically userptr.
>>
>> Or a rather slow client side IOMMU emulation to catch concurrent DMA transfers to get the necessary information onto the host side.
>>
>> Intel calls this approach colIOMMU: https://www.usenix.org/system/files/atc20-paper236-slides-tian.pdf
>>
>
> This is very helpful context.Your confirmation that memory pinning (userptr-style) is the practical approach helps me understand that what I initially saw as a "workaround" is actually the intended solution for this use case.
Well "intended" is maybe not the right term, I would rather say "possible" with the current SW/HW stack design in virtualization.
In general fault based SVM/HMM would still be nice to have even under virtualization environment, it's just simply not really feasible at the moment.
> For colIOMMU, I'll study it to better understand the alternatives and their trade-offs.
I haven't looked into it in detail either. It's mostly developed with the pass-through use case in mind, but avoiding pinning memory on the host side which is one of many per-requisites to have some HMM based migration working as well.
...>>> Why Submit This RFC?
>>> ====================
>>>
>>> Despite the limitations above, I am submitting this series to:
>>>
>>> 1. **Start the Discussion**: I want community feedback on whether batch
>>> registration is a useful feature worth pursuing.
>>>
>>> 2. **Explore Better Alternatives**: Is there a way to achieve batch
>>> registration without pinning? Could I extend HMM to better support
>>> this use case?
>>
>> There is an ongoing unification project between KFD and KGD, we are currently looking into the SVM part on a weekly basis.
>>
>> Saying that we probably need a really good justification to add new features to the KFD interfaces cause this is going to delay the unification.
>>
>> Regards,
>> Christian.
>
> Thank you for sharing this critical information. Is there a public discussion forum or mailing list for the KFD/KGD unification where I could follow progress and understand the design direction?
Alex is driving this. No mailing list, but IIRC Alex has organized a lot of topics on some confluence page, but I can't find it of hand.
> Regarding the use case justification: I need to be honest here - the
> primary driver for this feature is indeed KVM/virtualized environments.
> The scattered allocation problem exists in native environments too, but
> the overhead is tolerable there. However, I do want to raise one consideration for the unified interface design:
>
> GPU computing in virtualized/cloud environments is growing rapidly, major cloud providers (AWS, Azure) now offer GPU instances ROCm in containers/VMs is becoming more common.So while my current use case is specific to KVM, the virtualized GPU workload pattern may become more prevalent.
>
> So during the unified interface design, please keep the door open for batch-style operations if they don't complicate the core design.
Oh, yes! That's definitely valuable information to have and a more or less a new requirement for the SVM userspace API.
I already expected that we sooner or later run into such things, but having it definitely confirmed is really good to have.
Regards,
Christian.
>
> I really appreciate your time and guidance on this.
>
> Regards,
> Honglei
>
>
>
>>
>>>
>>> 3. **Understand Trade-offs**: For some workloads, the performance benefit
>>> of batch registration might outweigh the drawbacks of pinning. I'd
>>> like to understand where the balance lies.
>>>
>>> Questions for the Community
>>> ============================
>>>
>>> 1. Are there existing mechanisms in HMM or mm that could support batch
>>> operations without pinning?
>>>
>>> 2. Would a different approach (e.g., async registration, delayed validation)
>>> be more acceptable?
>>>
>>> Alternative Approaches Considered
>>> ==================================
>>>
>>> I've considered several alternatives:
>>>
>>> A) **Pure HMM approach**: Register ranges without pinning, rely entirely on
>>>
>>> B) **Userspace batching library**: Hide multiple ioctls behind a library.
>>>
>>> Patch Series Overview
>>> =====================
>>>
>>> Patch 1: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute type
>>> Patch 2: Define data structures for batch SVM range registration
>>> Patch 3: Add new AMDKFD_IOC_SVM_RANGES ioctl command
>>> Patch 4: Implement page pinning mechanism for scattered ranges
>>> Patch 5: Wire up the ioctl handler and attribute processing
>>>
>>> Testing
>>> =======
>>>
>>> The series has been tested with:
>>> - Multiple scattered malloc() allocations (2-2000+ ranges)
>>> - Various allocation sizes (4KB to 1G+)
>>> - GPU compute workloads using the registered ranges
>>> - Memory pressure scenarios
>>> - OpecnCL CTS in KVM guest environment
>>> - HIP catch tests in KVM guest environment
>>> - Some AI applications like Stable Diffusion, ComfyUI, 3B LLM models based
>>> on HuggingFace transformers
>>>
>>> I understand this approach is not ideal and are committed to working on a
>>> better solution based on community feedback. This RFC is the starting point
>>> for that discussion.
>>>
>>> Thank you for your time and consideration.
>>>
>>> Best regards,
>>> Honglei Huang
>>>
>>> ---
>>>
>>> Honglei Huang (5):
>>> drm/amdkfd: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute
>>> drm/amdkfd: Add SVM ranges data structures
>>> drm/amdkfd: Add AMDKFD_IOC_SVM_RANGES ioctl command
>>> drm/amdkfd: Add support for pinned user pages in SVM ranges
>>> drm/amdkfd: Wire up SVM ranges ioctl handler
>>>
>>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 67 +++++++++++
>>> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 232 +++++++++++++++++++++++++++++--
>>> drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 3 +
>>> include/uapi/linux/kfd_ioctl.h | 52 +++++++-
>>> 4 files changed, 348 insertions(+), 6 deletions(-)
>>
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC PATCH 0/5] drm/amdkfd: Add batch SVM range registration support
2025-11-12 8:34 ` Christian König
@ 2025-11-12 12:10 ` Honglei1.Huang@amd.com
2025-11-12 12:50 ` Christian König
0 siblings, 1 reply; 5+ messages in thread
From: Honglei1.Huang@amd.com @ 2025-11-12 12:10 UTC (permalink / raw)
To: Christian König
Cc: Felix.Kuehling, alexander.deucher, Ray.Huang, dmitry.osipenko,
Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel, linux-kernel,
linux-mm, akpm, Honglei Huang
Hi Christian,
Really thanks for the detailed feedback and insights. Your comments are
incredibly helpful and clear.
On 2025/11/12 16:34, Christian König wrote:
> Hi,
>
> On 11/12/25 08:29, Honglei Huang wrote:
>> Hi all,
>>
>> This RFC patch series introduces a new mechanism for batch registration of
>> multiple non-contiguous SVM (Shared Virtual Memory) ranges in a single ioctl
>> call. The primary goal of this series is to start a discussion about the best
>> approach to handle scattered user memory allocations in GPU workloads.
>>
>> Background and Motivation
>> ==========================
>>
>> Current applications using ROCm/HSA often need to register many scattered
>> memory buffers (e.g., multiple malloc() allocations) for GPU access. With the
>> existing AMDKFD_IOC_SVM ioctl, each range must be registered individually,
>> leading to:
>> - Blocking issue in some special use cases with many memory ranges
>> - High system call overhead when dealing with dozens or hundreds of ranges
>> - Inefficient resource management
>> - Complexity in userspace applications
>>
>> Use Case Example
>> ================
>>
>> Consider a typical ML/HPC workload that allocates 100+ small buffers across
>> different parts of the address space. Currently, this requires 100+ separate
>> ioctl calls. The proposed batch interface reduces this to a single call.
>
> Yeah, that's an intentional limitation.
>
> In an IOCTL interface you usually need to guarantee that the operation either completes or fails in a transactional manner.
>
> It is possible to implement this, but usually rather tricky if you do multiple operations in a single IOCTL. So you really need a good use case to justify the added complexity.
>
You're absolutely right about the transactional complexity. This
operation indeed requires proper rollback mechanisms and error handling
to maintain atomicity.
>> Paravirtualized environments exacerbate this issue, as KVM's memory backing
>> is often non-contiguous at the host level. In virtualized environments, guest
>> physical memory appears contiguous to the VM but is actually scattered across
>> host memory pages. This fragmentation means that what appears as a single
>> large allocation in the guest may require multiple discrete SVM registrations
>> to properly handle the underlying host memory layout, further multiplying the
>> number of required ioctl calls.
> SVM with dynamic migration under KVM is most likely a dead end to begin with.
>
> The only possibility to implement it is with memory pinning which is basically userptr.
>
> Or a rather slow client side IOMMU emulation to catch concurrent DMA transfers to get the necessary information onto the host side.
>
> Intel calls this approach colIOMMU: https://www.usenix.org/system/files/atc20-paper236-slides-tian.pdf
>
This is very helpful context.Your confirmation that memory pinning
(userptr-style) is the practical approach helps me understand that what
I initially saw as a "workaround" is actually the intended solution for
this use case.
For colIOMMU, I'll study it to better understand the alternatives and
their trade-offs.
>> Current Implementation - A Workaround Approach
>> ===============================================
>>
>> This patch series implements a WORKAROUND solution that pins user pages in
>> memory to enable batch registration. While functional, this approach has
>> several significant limitations:
>>
>> **Major Concern: Memory Pinning**
>> - The implementation uses pin_user_pages_fast() to lock pages in RAM
>> - This defeats the purpose of SVM's on-demand paging mechanism
>> - Prevents memory oversubscription and dynamic migration
>> - May cause memory pressure on systems with limited RAM
>> - Goes against the fundamental design philosophy of HMM-based SVM
>
> That again is perfectly intentional. Any other mode doesn't really make sense with KVM.
>
>> **Known Limitations:**
>> 1. Increased memory footprint due to pinned pages
>> 2. Potential for memory fragmentation
>> 3. No support for transparent huge pages in pinned regions
>> 4. Limited interaction with memory cgroups and resource controls
>> 5. Complexity in handling VMA operations and lifecycle management
>> 6. May interfere with NUMA optimization and page migration
>>
>> Why Submit This RFC?
>> ====================
>>
>> Despite the limitations above, I am submitting this series to:
>>
>> 1. **Start the Discussion**: I want community feedback on whether batch
>> registration is a useful feature worth pursuing.
>>
>> 2. **Explore Better Alternatives**: Is there a way to achieve batch
>> registration without pinning? Could I extend HMM to better support
>> this use case?
>
> There is an ongoing unification project between KFD and KGD, we are currently looking into the SVM part on a weekly basis.
>
> Saying that we probably need a really good justification to add new features to the KFD interfaces cause this is going to delay the unification.
>
> Regards,
> Christian.
Thank you for sharing this critical information. Is there a public
discussion forum or mailing list for the KFD/KGD unification where I
could follow progress and understand the design direction?
Regarding the use case justification: I need to be honest here - the
primary driver for this feature is indeed KVM/virtualized environments.
The scattered allocation problem exists in native environments too, but
the overhead is tolerable there. However, I do want to raise one
consideration for the unified interface design:
GPU computing in virtualized/cloud environments is growing rapidly,
major cloud providers (AWS, Azure) now offer GPU instances ROCm in
containers/VMs is becoming more common.So while my current use case is
specific to KVM, the virtualized GPU workload pattern may become more
prevalent.
So during the unified interface design, please keep the door open for
batch-style operations if they don't complicate the core design.
I really appreciate your time and guidance on this.
Regards,
Honglei
>
>>
>> 3. **Understand Trade-offs**: For some workloads, the performance benefit
>> of batch registration might outweigh the drawbacks of pinning. I'd
>> like to understand where the balance lies.
>>
>> Questions for the Community
>> ============================
>>
>> 1. Are there existing mechanisms in HMM or mm that could support batch
>> operations without pinning?
>>
>> 2. Would a different approach (e.g., async registration, delayed validation)
>> be more acceptable?
>>
>> Alternative Approaches Considered
>> ==================================
>>
>> I've considered several alternatives:
>>
>> A) **Pure HMM approach**: Register ranges without pinning, rely entirely on
>>
>> B) **Userspace batching library**: Hide multiple ioctls behind a library.
>>
>> Patch Series Overview
>> =====================
>>
>> Patch 1: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute type
>> Patch 2: Define data structures for batch SVM range registration
>> Patch 3: Add new AMDKFD_IOC_SVM_RANGES ioctl command
>> Patch 4: Implement page pinning mechanism for scattered ranges
>> Patch 5: Wire up the ioctl handler and attribute processing
>>
>> Testing
>> =======
>>
>> The series has been tested with:
>> - Multiple scattered malloc() allocations (2-2000+ ranges)
>> - Various allocation sizes (4KB to 1G+)
>> - GPU compute workloads using the registered ranges
>> - Memory pressure scenarios
>> - OpecnCL CTS in KVM guest environment
>> - HIP catch tests in KVM guest environment
>> - Some AI applications like Stable Diffusion, ComfyUI, 3B LLM models based
>> on HuggingFace transformers
>>
>> I understand this approach is not ideal and are committed to working on a
>> better solution based on community feedback. This RFC is the starting point
>> for that discussion.
>>
>> Thank you for your time and consideration.
>>
>> Best regards,
>> Honglei Huang
>>
>> ---
>>
>> Honglei Huang (5):
>> drm/amdkfd: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute
>> drm/amdkfd: Add SVM ranges data structures
>> drm/amdkfd: Add AMDKFD_IOC_SVM_RANGES ioctl command
>> drm/amdkfd: Add support for pinned user pages in SVM ranges
>> drm/amdkfd: Wire up SVM ranges ioctl handler
>>
>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 67 +++++++++++
>> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 232 +++++++++++++++++++++++++++++--
>> drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 3 +
>> include/uapi/linux/kfd_ioctl.h | 52 +++++++-
>> 4 files changed, 348 insertions(+), 6 deletions(-)
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC PATCH 0/5] drm/amdkfd: Add batch SVM range registration support
2025-11-12 7:29 Honglei Huang
@ 2025-11-12 8:34 ` Christian König
2025-11-12 12:10 ` Honglei1.Huang@amd.com
0 siblings, 1 reply; 5+ messages in thread
From: Christian König @ 2025-11-12 8:34 UTC (permalink / raw)
To: Honglei Huang, Felix.Kuehling, alexander.deucher, Ray.Huang
Cc: dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
linux-kernel, linux-mm, akpm, honghuang
Hi,
On 11/12/25 08:29, Honglei Huang wrote:
> Hi all,
>
> This RFC patch series introduces a new mechanism for batch registration of
> multiple non-contiguous SVM (Shared Virtual Memory) ranges in a single ioctl
> call. The primary goal of this series is to start a discussion about the best
> approach to handle scattered user memory allocations in GPU workloads.
>
> Background and Motivation
> ==========================
>
> Current applications using ROCm/HSA often need to register many scattered
> memory buffers (e.g., multiple malloc() allocations) for GPU access. With the
> existing AMDKFD_IOC_SVM ioctl, each range must be registered individually,
> leading to:
> - Blocking issue in some special use cases with many memory ranges
> - High system call overhead when dealing with dozens or hundreds of ranges
> - Inefficient resource management
> - Complexity in userspace applications
>
> Use Case Example
> ================
>
> Consider a typical ML/HPC workload that allocates 100+ small buffers across
> different parts of the address space. Currently, this requires 100+ separate
> ioctl calls. The proposed batch interface reduces this to a single call.
Yeah, that's an intentional limitation.
In an IOCTL interface you usually need to guarantee that the operation either completes or fails in a transactional manner.
It is possible to implement this, but usually rather tricky if you do multiple operations in a single IOCTL. So you really need a good use case to justify the added complexity.
> Paravirtualized environments exacerbate this issue, as KVM's memory backing
> is often non-contiguous at the host level. In virtualized environments, guest
> physical memory appears contiguous to the VM but is actually scattered across
> host memory pages. This fragmentation means that what appears as a single
> large allocation in the guest may require multiple discrete SVM registrations
> to properly handle the underlying host memory layout, further multiplying the
> number of required ioctl calls.
SVM with dynamic migration under KVM is most likely a dead end to begin with.
The only possibility to implement it is with memory pinning which is basically userptr.
Or a rather slow client side IOMMU emulation to catch concurrent DMA transfers to get the necessary information onto the host side.
Intel calls this approach colIOMMU: https://www.usenix.org/system/files/atc20-paper236-slides-tian.pdf
> Current Implementation - A Workaround Approach
> ===============================================
>
> This patch series implements a WORKAROUND solution that pins user pages in
> memory to enable batch registration. While functional, this approach has
> several significant limitations:
>
> **Major Concern: Memory Pinning**
> - The implementation uses pin_user_pages_fast() to lock pages in RAM
> - This defeats the purpose of SVM's on-demand paging mechanism
> - Prevents memory oversubscription and dynamic migration
> - May cause memory pressure on systems with limited RAM
> - Goes against the fundamental design philosophy of HMM-based SVM
That again is perfectly intentional. Any other mode doesn't really make sense with KVM.
> **Known Limitations:**
> 1. Increased memory footprint due to pinned pages
> 2. Potential for memory fragmentation
> 3. No support for transparent huge pages in pinned regions
> 4. Limited interaction with memory cgroups and resource controls
> 5. Complexity in handling VMA operations and lifecycle management
> 6. May interfere with NUMA optimization and page migration
>
> Why Submit This RFC?
> ====================
>
> Despite the limitations above, I am submitting this series to:
>
> 1. **Start the Discussion**: I want community feedback on whether batch
> registration is a useful feature worth pursuing.
>
> 2. **Explore Better Alternatives**: Is there a way to achieve batch
> registration without pinning? Could I extend HMM to better support
> this use case?
There is an ongoing unification project between KFD and KGD, we are currently looking into the SVM part on a weekly basis.
Saying that we probably need a really good justification to add new features to the KFD interfaces cause this is going to delay the unification.
Regards,
Christian.
>
> 3. **Understand Trade-offs**: For some workloads, the performance benefit
> of batch registration might outweigh the drawbacks of pinning. I'd
> like to understand where the balance lies.
>
> Questions for the Community
> ============================
>
> 1. Are there existing mechanisms in HMM or mm that could support batch
> operations without pinning?
>
> 2. Would a different approach (e.g., async registration, delayed validation)
> be more acceptable?
>
> Alternative Approaches Considered
> ==================================
>
> I've considered several alternatives:
>
> A) **Pure HMM approach**: Register ranges without pinning, rely entirely on
>
> B) **Userspace batching library**: Hide multiple ioctls behind a library.
>
> Patch Series Overview
> =====================
>
> Patch 1: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute type
> Patch 2: Define data structures for batch SVM range registration
> Patch 3: Add new AMDKFD_IOC_SVM_RANGES ioctl command
> Patch 4: Implement page pinning mechanism for scattered ranges
> Patch 5: Wire up the ioctl handler and attribute processing
>
> Testing
> =======
>
> The series has been tested with:
> - Multiple scattered malloc() allocations (2-2000+ ranges)
> - Various allocation sizes (4KB to 1G+)
> - GPU compute workloads using the registered ranges
> - Memory pressure scenarios
> - OpecnCL CTS in KVM guest environment
> - HIP catch tests in KVM guest environment
> - Some AI applications like Stable Diffusion, ComfyUI, 3B LLM models based
> on HuggingFace transformers
>
> I understand this approach is not ideal and are committed to working on a
> better solution based on community feedback. This RFC is the starting point
> for that discussion.
>
> Thank you for your time and consideration.
>
> Best regards,
> Honglei Huang
>
> ---
>
> Honglei Huang (5):
> drm/amdkfd: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute
> drm/amdkfd: Add SVM ranges data structures
> drm/amdkfd: Add AMDKFD_IOC_SVM_RANGES ioctl command
> drm/amdkfd: Add support for pinned user pages in SVM ranges
> drm/amdkfd: Wire up SVM ranges ioctl handler
>
> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 67 +++++++++++
> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 232 +++++++++++++++++++++++++++++--
> drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 3 +
> include/uapi/linux/kfd_ioctl.h | 52 +++++++-
> 4 files changed, 348 insertions(+), 6 deletions(-)
^ permalink raw reply [flat|nested] 5+ messages in thread
* [RFC PATCH 0/5] drm/amdkfd: Add batch SVM range registration support
@ 2025-11-12 7:29 Honglei Huang
2025-11-12 8:34 ` Christian König
0 siblings, 1 reply; 5+ messages in thread
From: Honglei Huang @ 2025-11-12 7:29 UTC (permalink / raw)
To: Felix.Kuehling, alexander.deucher, christian.koenig, Ray.Huang
Cc: dmitry.osipenko, Xinhui.Pan, airlied, daniel, amd-gfx, dri-devel,
linux-kernel, linux-mm, akpm, honghuang
Hi all,
This RFC patch series introduces a new mechanism for batch registration of
multiple non-contiguous SVM (Shared Virtual Memory) ranges in a single ioctl
call. The primary goal of this series is to start a discussion about the best
approach to handle scattered user memory allocations in GPU workloads.
Background and Motivation
==========================
Current applications using ROCm/HSA often need to register many scattered
memory buffers (e.g., multiple malloc() allocations) for GPU access. With the
existing AMDKFD_IOC_SVM ioctl, each range must be registered individually,
leading to:
- Blocking issue in some special use cases with many memory ranges
- High system call overhead when dealing with dozens or hundreds of ranges
- Inefficient resource management
- Complexity in userspace applications
Use Case Example
================
Consider a typical ML/HPC workload that allocates 100+ small buffers across
different parts of the address space. Currently, this requires 100+ separate
ioctl calls. The proposed batch interface reduces this to a single call.
Paravirtualized environments exacerbate this issue, as KVM's memory backing
is often non-contiguous at the host level. In virtualized environments, guest
physical memory appears contiguous to the VM but is actually scattered across
host memory pages. This fragmentation means that what appears as a single
large allocation in the guest may require multiple discrete SVM registrations
to properly handle the underlying host memory layout, further multiplying the
number of required ioctl calls.
Current Implementation - A Workaround Approach
===============================================
This patch series implements a WORKAROUND solution that pins user pages in
memory to enable batch registration. While functional, this approach has
several significant limitations:
**Major Concern: Memory Pinning**
- The implementation uses pin_user_pages_fast() to lock pages in RAM
- This defeats the purpose of SVM's on-demand paging mechanism
- Prevents memory oversubscription and dynamic migration
- May cause memory pressure on systems with limited RAM
- Goes against the fundamental design philosophy of HMM-based SVM
**Known Limitations:**
1. Increased memory footprint due to pinned pages
2. Potential for memory fragmentation
3. No support for transparent huge pages in pinned regions
4. Limited interaction with memory cgroups and resource controls
5. Complexity in handling VMA operations and lifecycle management
6. May interfere with NUMA optimization and page migration
Why Submit This RFC?
====================
Despite the limitations above, I am submitting this series to:
1. **Start the Discussion**: I want community feedback on whether batch
registration is a useful feature worth pursuing.
2. **Explore Better Alternatives**: Is there a way to achieve batch
registration without pinning? Could I extend HMM to better support
this use case?
3. **Understand Trade-offs**: For some workloads, the performance benefit
of batch registration might outweigh the drawbacks of pinning. I'd
like to understand where the balance lies.
Questions for the Community
============================
1. Are there existing mechanisms in HMM or mm that could support batch
operations without pinning?
2. Would a different approach (e.g., async registration, delayed validation)
be more acceptable?
Alternative Approaches Considered
==================================
I've considered several alternatives:
A) **Pure HMM approach**: Register ranges without pinning, rely entirely on
B) **Userspace batching library**: Hide multiple ioctls behind a library.
Patch Series Overview
=====================
Patch 1: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute type
Patch 2: Define data structures for batch SVM range registration
Patch 3: Add new AMDKFD_IOC_SVM_RANGES ioctl command
Patch 4: Implement page pinning mechanism for scattered ranges
Patch 5: Wire up the ioctl handler and attribute processing
Testing
=======
The series has been tested with:
- Multiple scattered malloc() allocations (2-2000+ ranges)
- Various allocation sizes (4KB to 1G+)
- GPU compute workloads using the registered ranges
- Memory pressure scenarios
- OpecnCL CTS in KVM guest environment
- HIP catch tests in KVM guest environment
- Some AI applications like Stable Diffusion, ComfyUI, 3B LLM models based
on HuggingFace transformers
I understand this approach is not ideal and are committed to working on a
better solution based on community feedback. This RFC is the starting point
for that discussion.
Thank you for your time and consideration.
Best regards,
Honglei Huang
---
Honglei Huang (5):
drm/amdkfd: Add KFD_IOCTL_SVM_ATTR_MAPPED attribute
drm/amdkfd: Add SVM ranges data structures
drm/amdkfd: Add AMDKFD_IOC_SVM_RANGES ioctl command
drm/amdkfd: Add support for pinned user pages in SVM ranges
drm/amdkfd: Wire up SVM ranges ioctl handler
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 67 +++++++++++
drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 232 +++++++++++++++++++++++++++++--
drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 3 +
include/uapi/linux/kfd_ioctl.h | 52 +++++++-
4 files changed, 348 insertions(+), 6 deletions(-)
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-11-12 12:50 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-12 7:35 [RFC PATCH 0/5] drm/amdkfd: Add batch SVM range registration support Honglei Huang
-- strict thread matches above, loose matches on Subject: below --
2025-11-12 7:29 Honglei Huang
2025-11-12 8:34 ` Christian König
2025-11-12 12:10 ` Honglei1.Huang@amd.com
2025-11-12 12:50 ` Christian König
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox