linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Honglei Huang <honghuan@amd.com>
To: "Kuehling, Felix" <felix.kuehling@amd.com>,
	"Christian König" <christian.koenig@amd.com>
Cc: dmitry.osipenko@collabora.com, Xinhui.Pan@amd.com,
	airlied@gmail.com, daniel@ffwll.ch,
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	akpm@linux-foundation.org, Honglei Huang <honglei1.huang@amd.com>,
	alexander.deucher@amd.com, Ray.Huang@amd.com
Subject: Re: [PATCH v2 0/4] drm/amdkfd: Add batch userptr allocation support
Date: Sat, 10 Jan 2026 10:28:49 +0800	[thread overview]
Message-ID: <dc1f5de7-40c4-4649-8f2f-0fee4b540783@amd.com> (raw)
In-Reply-To: <ab5d1bb7-7896-49fd-a9ea-19294f4f57ca@amd.com>


Hi Felix,

You're right - I understand now that the render node transition is already
Appreciate the clarification.

Regards,
Honglei


On 2026/1/10 05:14, Kuehling, Felix wrote:
> FWIW, ROCr already uses rendernode APIs for our implementation of the 
> CUDA VM API (DMABuf imports into rendernode contexts that share the VA 
> space with KFD and VA mappings with more flexibility than what we have 
> in the KFD API). So the transition to render node APIs has already 
> started, especially in the memory management area. It's not some far-off 
> future thing.
> 
> Regards,
>    Felix
> 
> On 2026-01-09 04:07, Christian König wrote:
>> Hi Honglei,
>>
>> I have to agree with Felix. Adding such complexity to the KFD API is a 
>> clear no-go from my side.
>>
>> Just skimming over the patch it's obvious that this isn't correctly 
>> implemented. You simply can't the MMU notifier ranges likes this.
>>
>> Regards,
>> Christian.
>>
>> On 1/9/26 08:55, Honglei Huang wrote:
>>> Hi Felix,
>>>
>>> Thank you for the feedback. I understand your concern about API 
>>> maintenance.
>>>
>>>  From what I can see, KFD is still the core driver for all GPU 
>>> compute workloads. The entire compute ecosystem is built on KFD's 
>>> infrastructure and continues to rely on it. While the unification 
>>> work is ongoing, any transition to DRM render node APIs would 
>>> naturally take considerable time, and KFD is expected to remain the 
>>> primary interface for compute for the foreseeable future. This batch 
>>> allocation issue is affecting performance in some specific computing 
>>> scenarios.
>>>
>>> You're absolutely right about the API proliferation concern. Based on 
>>> your feedback, I'd like to revise the approach for v3 to minimize 
>>> impact by reusing the existing ioctl instead of adding a new API:
>>>
>>> - Reuse existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU ioctl
>>> - Add one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH
>>> - When flag is set, mmap_offset field points to range array
>>> - No new ioctl command, no new structure
>>>
>>> This changes the API surface from adding a new ioctl to adding just 
>>> one flag.
>>>
>>> Actually the implementation modifies DRM's GPU memory management
>>> infrastructure in amdgpu_amdkfd_gpuvm.c. If DRM render node needs 
>>> similar functionality later, these functions could be directly reused.
>>>
>>> Would you be willing to review v3 with this approach?
>>>
>>> Regards,
>>> Honglei Huang
>>>
>>> On 2026/1/9 03:46, Felix Kuehling wrote:
>>>> I don't have time to review this in detail right now. I am concerned 
>>>> about adding new KFD API, when the trend is moving towards DRM 
>>>> render node APIs. This creates additional burden for ongoing support 
>>>> of these APIs in addition to the inevitable DRM render node 
>>>> duplicates we'll have in the future. Would it be possible to 
>>>> implement this batch userptr allocation in a render node API from 
>>>> the start?
>>>>
>>>> Regards,
>>>>     Felix
>>>>
>>>>
>>>> On 2026-01-04 02:21, Honglei Huang wrote:
>>>>> From: Honglei Huang <honghuan@amd.com>
>>>>>
>>>>> Hi all,
>>>>>
>>>>> This is v2 of the patch series to support allocating multiple non- 
>>>>> contiguous
>>>>> CPU virtual address ranges that map to a single contiguous GPU 
>>>>> virtual address.
>>>>>
>>>>> **Key improvements over v1:**
>>>>> - NO memory pinning: uses HMM for page tracking, pages can be 
>>>>> swapped/ migrated
>>>>> - NO impact on SVM subsystem: avoids complexity during KFD/KGD 
>>>>> unification
>>>>> - Better approach: userptr's VA remapping design is ideal for 
>>>>> scattered VA registration
>>>>>
>>>>> Based on community feedback, v2 takes a completely different 
>>>>> implementation
>>>>> approach by leveraging the existing userptr infrastructure rather than
>>>>> introducing new SVM-based mechanisms that required memory pinning.
>>>>>
>>>>> Changes from v1
>>>>> ===============
>>>>>
>>>>> v1 attempted to solve this problem through the SVM subsystem by:
>>>>> - Adding a new AMDKFD_IOC_SVM_RANGES ioctl for batch SVM range 
>>>>> registration
>>>>> - Introducing KFD_IOCTL_SVM_ATTR_MAPPED attribute for special VMA 
>>>>> handling
>>>>> - Using pin_user_pages_fast() to pin scattered memory ranges
>>>>> - Registering multiple SVM ranges with pinned pages
>>>>>
>>>>> This approach had significant drawbacks:
>>>>> 1. Memory pinning defeated the purpose of HMM-based SVM's on-demand 
>>>>> paging
>>>>> 2. Added complexity to the SVM subsystem
>>>>> 3. Prevented memory oversubscription and dynamic migration
>>>>> 4. Could cause memory pressure due to locked pages
>>>>> 5. Interfered with NUMA optimization and page migration
>>>>>
>>>>> v2 Implementation Approach
>>>>> ==========================
>>>>>
>>>>> 1. **No memory pinning required**
>>>>>      - Uses HMM (Heterogeneous Memory Management) for page tracking
>>>>>      - Pages are NOT pinned, can be swapped/migrated when not in use
>>>>>      - Supports dynamic page eviction and on-demand restore like 
>>>>> standard userptr
>>>>>
>>>>> 2. **Zero impact on KFD SVM subsystem**
>>>>>      - Extends ALLOC_MEMORY_OF_GPU path, not SVM
>>>>>      - New ioctl: AMDKFD_IOC_ALLOC_MEMORY_OF_GPU_BATCH
>>>>>      - Zero changes to SVM code, limited scope of changes
>>>>>
>>>>> 3. **Perfect fit for non-contiguous VA registration**
>>>>>      - Userptr design naturally supports GPU VA != CPU VA mapping
>>>>>      - Multiple non-contiguous CPU VA ranges -> single contiguous 
>>>>> GPU VA
>>>>>      - Unlike KFD SVM which maintains VA identity, userptr allows 
>>>>> remapping,
>>>>>        This VA remapping capability makes userptr ideal for 
>>>>> scattered allocations
>>>>>
>>>>> **Implementation Details:**
>>>>>      - Each CPU VA range gets its own mmu_interval_notifier for 
>>>>> invalidation
>>>>>      - All ranges validated together and mapped to contiguous GPU VA
>>>>>      - Single kgd_mem object with array of user_range_info structures
>>>>>      - Unified eviction/restore path for all ranges in a batch
>>>>>
>>>>> Patch Series Overview
>>>>> =====================
>>>>>
>>>>> Patch 1/4: Add AMDKFD_IOC_ALLOC_MEMORY_OF_GPU_BATCH ioctl and data 
>>>>> structures
>>>>>       - New ioctl command and kfd_ioctl_userptr_range structure
>>>>>       - UAPI for userspace to request batch userptr allocation
>>>>>
>>>>> Patch 2/4: Extend kgd_mem for batch userptr support
>>>>>       - Add user_range_info and associated fields to kgd_mem
>>>>>       - Data structures for tracking multiple ranges per allocation
>>>>>
>>>>> Patch 3/4: Implement batch userptr allocation and management
>>>>>       - Core functions: init_user_pages_batch(), 
>>>>> get_user_pages_batch()
>>>>>       - Per-range eviction/restore handlers with unified management
>>>>>       - Integration with existing userptr eviction/validation flows
>>>>>
>>>>> Patch 4/4: Wire up batch userptr ioctl handler
>>>>>       - Ioctl handler with input validation
>>>>>       - SVM conflict checking for GPU VA and CPU VA ranges
>>>>>       - Integration with kfd_process and process_device infrastructure
>>>>>
>>>>> Performance Comparison
>>>>> ======================
>>>>>
>>>>> Before implementing this patch, we attempted a userspace solution 
>>>>> that makes
>>>>> multiple calls to the existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU ioctl to
>>>>> register non-contiguous VA ranges individually. This approach 
>>>>> resulted in
>>>>> severe performance degradation:
>>>>>
>>>>> **Userspace Multiple ioctl Approach:**
>>>>> - Benchmark score: ~80,000 (down from 200,000 on bare metal)
>>>>> - Performance loss: 60% degradation
>>>>>
>>>>> **This Kernel Batch ioctl Approach:**
>>>>> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
>>>>> - Performance improvement: 2x-2.4x faster than userspace approach
>>>>> - Achieves near-native performance in virtualized environments
>>>>>
>>>>> The batch registration in kernel avoids the repeated syscall 
>>>>> overhead and
>>>>> enables efficient unified management of scattered VA ranges, 
>>>>> recovering most
>>>>> of the performance lost to virtualization.
>>>>>
>>>>> Testing Results
>>>>> ===============
>>>>>
>>>>> The series has been tested with:
>>>>> - Multiple scattered malloc() allocations (2-4000+ ranges)
>>>>> - Various allocation sizes (4KB to 1G+ per range)
>>>>> - GPU compute workloads using the batch-allocated ranges
>>>>> - Memory pressure scenarios and eviction/restore cycles
>>>>> - OpenCL CTS in KVM guest environment
>>>>> - HIP catch tests in KVM guest environment
>>>>> - AI workloads: Stable Diffusion, ComfyUI in virtualized environments
>>>>> - Small LLM inference (3B-7B models) using HuggingFace transformers
>>>>>
>>>>> Corresponding userspace patche
>>>>> ================================
>>>>> Userspace ROCm changes for new ioctl:
>>>>> - libhsakmt: https://github.com/ROCm/rocm-systems/commit/ 
>>>>> ac21716e5d6f68ec524e50eeef10d1d6ad7eae86
>>>>>
>>>>> Thank you for your review and waiting for the feedback.
>>>>>
>>>>> Best regards,
>>>>> Honglei Huang
>>>>>
>>>>> Honglei Huang (4):
>>>>>     drm/amdkfd: Add batch userptr allocation UAPI
>>>>>     drm/amdkfd: Extend kgd_mem for batch userptr support
>>>>>     drm/amdkfd: Implement batch userptr allocation and management
>>>>>     drm/amdkfd: Wire up batch userptr ioctl handler
>>>>>
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |  21 +
>>>>>    .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 543 ++++++++++++ 
>>>>> +++++-
>>>>>    drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      | 159 +++++
>>>>>    include/uapi/linux/kfd_ioctl.h                |  37 +-
>>>>>    4 files changed, 740 insertions(+), 20 deletions(-)
>>>>>



  reply	other threads:[~2026-01-10  2:29 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-04  7:21 Honglei Huang
2026-01-04  7:21 ` [PATCH v2 1/4] drm/amdkfd: Add batch userptr allocation UAPI Honglei Huang
2026-01-04  7:21 ` [PATCH v2 2/4] drm/amdkfd: Extend kgd_mem for batch userptr support Honglei Huang
2026-01-04  7:21 ` [PATCH v2 3/4] drm/amdkfd: Implement batch userptr allocation and management Honglei Huang
2026-01-04  7:21 ` [PATCH v2 4/4] drm/amdkfd: Wire up batch userptr ioctl handler Honglei Huang
2026-01-08 19:46 ` [PATCH v2 0/4] drm/amdkfd: Add batch userptr allocation support Felix Kuehling
2026-01-09  7:55   ` Honglei Huang
2026-01-09  9:07     ` Christian König
2026-01-09 21:14       ` Kuehling, Felix
2026-01-10  2:28         ` Honglei Huang [this message]
2026-01-10  2:30       ` Honglei Huang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=dc1f5de7-40c4-4649-8f2f-0fee4b540783@amd.com \
    --to=honghuan@amd.com \
    --cc=Ray.Huang@amd.com \
    --cc=Xinhui.Pan@amd.com \
    --cc=airlied@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=daniel@ffwll.ch \
    --cc=dmitry.osipenko@collabora.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=felix.kuehling@amd.com \
    --cc=honglei1.huang@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox