Re: [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Christian König" <christian.koenig@amd.com>
To: Honglei Huang <honghuan@amd.com>
Cc: Felix.Kuehling@amd.com, Philip.Yang@amd.com, Ray.Huang@amd.com,
	alexander.deucher@amd.com, dmitry.osipenko@collabora.com,
	Xinhui.Pan@amd.com, airlied@gmail.com, daniel@ffwll.ch,
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	akpm@linux-foundation.org
Subject: Re: [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support
Date: Mon, 9 Feb 2026 13:59:25 +0100	[thread overview]
Message-ID: <451400e6-bbe0-4186-bae6-1bf64181c378@amd.com> (raw)
In-Reply-To: <38264429-a256-4c2f-bcfd-8a021d9603b2@amd.com>

On 2/9/26 13:52, Honglei Huang wrote:
> DRM GPU SVM does use hmm_range_fault(), see drm_gpusvm_get_pages()

I'm not sure what you are talking about, drm_gpusvm_get_pages() only supports a single range as well and not scatter gather of VA addresses.

As far as I can see that doesn't help the slightest.

> My implementation follows the same pattern. The detailed comparison
> of invalidation path was provided in the second half of my previous mail.

Yeah and as I said that is not very valuable because it doesn't solves the sequence problem.

As far as I can see the approach you try here is a clear NAK from my side.

Regards,
Christian.

> 
> On 2026/2/9 18:16, Christian König wrote:
>> On 2/9/26 07:14, Honglei Huang wrote:
>>>
>>> I've reworked the implementation in v4. The fix is actually inspired
>>> by the DRM GPU SVM framework (drivers/gpu/drm/drm_gpusvm.c).
>>>
>>> DRM GPU SVM uses wide notifiers (recommended 512M or larger) to track
>>> multiple user virtual address ranges under a single mmu_interval_notifier,
>>> and these ranges can be non-contiguous which is essentially the same
>>> problem that batch userptr needs to solve: one BO backed by multiple
>>> non-contiguous CPU VA ranges sharing one notifier.
>>
>> That still doesn't solve the sequencing problem.
>>
>> As far as I can see you can't use hmm_range_fault with this approach or it would just not be very valuable.
>>
>> So how should that work with your patch set?
>>
>> Regards,
>> Christian.
>>
>>>
>>> The wide notifier is created in drm_gpusvm_notifier_alloc:
>>>    notifier->itree.start = ALIGN_DOWN(fault_addr, gpusvm->notifier_size);
>>>    notifier->itree.last = ALIGN(fault_addr + 1, gpusvm->notifier_size) - 1;
>>> The Xe driver passes
>>>    xe_modparam.svm_notifier_size * SZ_1M in xe_svm_init
>>> as the notifier_size, so one notifier can cover many of MB of VA space
>>> containing multiple non-contiguous ranges.
>>>
>>> And DRM GPU SVM solves the per-range validity problem with flag-based
>>> validation instead of seq-based validation in:
>>>    - drm_gpusvm_pages_valid() checks
>>>        flags.has_dma_mapping
>>>      not notifier_seq. The comment explicitly states:
>>>        "This is akin to a notifier seqno check in the HMM documentation
>>>         but due to wider notifiers (i.e., notifiers which span multiple
>>>         ranges) this function is required for finer grained checking"
>>>    - __drm_gpusvm_unmap_pages() clears
>>>        flags.has_dma_mapping = false  under notifier_lock
>>>    - drm_gpusvm_get_pages() sets
>>>        flags.has_dma_mapping = true  under notifier_lock
>>> I adopted the same approach.
>>>
>>> DRM GPU SVM:
>>>    drm_gpusvm_notifier_invalidate()
>>>      down_write(&gpusvm->notifier_lock);
>>>      mmu_interval_set_seq(mni, cur_seq);
>>>      gpusvm->ops->invalidate()
>>>        -> xe_svm_invalidate()
>>>           drm_gpusvm_for_each_range()
>>>             -> __drm_gpusvm_unmap_pages()
>>>                WRITE_ONCE(flags.has_dma_mapping = false);  // clear flag
>>>      up_write(&gpusvm->notifier_lock);
>>>
>>> KFD batch userptr:
>>>    amdgpu_amdkfd_evict_userptr_batch()
>>>      mutex_lock(&process_info->notifier_lock);
>>>      mmu_interval_set_seq(mni, cur_seq);
>>>      discard_invalid_ranges()
>>>        interval_tree_iter_first/next()
>>>          range_info->valid = false;          // clear flag
>>>      mutex_unlock(&process_info->notifier_lock);
>>>
>>> Both implementations:
>>>    - Acquire notifier_lock FIRST, before any flag changes
>>>    - Call mmu_interval_set_seq() under the lock
>>>    - Use interval tree to find affected ranges within the wide notifier
>>>    - Mark per-range flag as invalid/valid under the lock
>>>
>>> The page fault path and final validation path also follow the same
>>> pattern as DRM GPU SVM: fault outside the lock, set/check per-range
>>> flag under the lock.
>>>
>>> Regards,
>>> Honglei
>>>
>>>
>>> On 2026/2/6 21:56, Christian König wrote:
>>>> On 2/6/26 07:25, Honglei Huang wrote:
>>>>> From: Honglei Huang <honghuan@amd.com>
>>>>>
>>>>> Hi all,
>>>>>
>>>>> This is v3 of the patch series to support allocating multiple non-contiguous
>>>>> CPU virtual address ranges that map to a single contiguous GPU virtual address.
>>>>>
>>>>> v3:
>>>>> 1. No new ioctl: Reuses existing AMDKFD_IOC_ALLOC_MEMORY_OF_GPU
>>>>>      - Adds only one flag: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH
>>>>
>>>> That is most likely not the best approach, but Felix or Philip need to comment here since I don't know such IOCTLs well either.
>>>>
>>>>>      - When flag is set, mmap_offset field points to range array
>>>>>      - Minimal API surface change
>>>>
>>>> Why range of VA space for each entry?
>>>>
>>>>> 2. Improved MMU notifier handling:
>>>>>      - Single mmu_interval_notifier covering the VA span [va_min, va_max]
>>>>>      - Interval tree for efficient lookup of affected ranges during invalidation
>>>>>      - Avoids per-range notifier overhead mentioned in v2 review
>>>>
>>>> That won't work unless you also modify hmm_range_fault() to take multiple VA addresses (or ranges) at the same time.
>>>>
>>>> The problem is that we must rely on hmm_range.notifier_seq to detect changes to the page tables in question, but that in turn works only if you have one hmm_range structure and not multiple.
>>>>
>>>> What might work is doing an XOR or CRC over all hmm_range.notifier_seq you have, but that is a bit flaky.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> 3. Better code organization: Split into 8 focused patches for easier review
>>>>>
>>>>> v2:
>>>>>      - Each CPU VA range gets its own mmu_interval_notifier for invalidation
>>>>>      - All ranges validated together and mapped to contiguous GPU VA
>>>>>      - Single kgd_mem object with array of user_range_info structures
>>>>>      - Unified eviction/restore path for all ranges in a batch
>>>>>
>>>>> Current Implementation Approach
>>>>> ===============================
>>>>>
>>>>> This series implements a practical solution within existing kernel constraints:
>>>>>
>>>>> 1. Single MMU notifier for VA span: Register one notifier covering the
>>>>>      entire range from lowest to highest address in the batch
>>>>>
>>>>> 2. Interval tree filtering: Use interval tree to efficiently identify
>>>>>      which specific ranges are affected during invalidation callbacks,
>>>>>      avoiding unnecessary processing for unrelated address changes
>>>>>
>>>>> 3. Unified eviction/restore: All ranges in a batch share eviction and
>>>>>      restore paths, maintaining consistency with existing userptr handling
>>>>>
>>>>> Patch Series Overview
>>>>> =====================
>>>>>
>>>>> Patch 1/8: Add userptr batch allocation UAPI structures
>>>>>       - KFD_IOC_ALLOC_MEM_FLAGS_USERPTR_BATCH flag
>>>>>       - kfd_ioctl_userptr_range and kfd_ioctl_userptr_ranges_data structures
>>>>>
>>>>> Patch 2/8: Add user_range_info infrastructure to kgd_mem
>>>>>       - user_range_info structure for per-range tracking
>>>>>       - Fields for batch allocation in kgd_mem
>>>>>
>>>>> Patch 3/8: Implement interval tree for userptr ranges
>>>>>       - Interval tree for efficient range lookup during invalidation
>>>>>       - mark_invalid_ranges() function
>>>>>
>>>>> Patch 4/8: Add batch MMU notifier support
>>>>>       - Single notifier for entire VA span
>>>>>       - Invalidation callback using interval tree filtering
>>>>>
>>>>> Patch 5/8: Implement batch userptr page management
>>>>>       - get_user_pages_batch() and set_user_pages_batch()
>>>>>       - Per-range page array management
>>>>>
>>>>> Patch 6/8: Add batch allocation function and export API
>>>>>       - init_user_pages_batch() main initialization
>>>>>       - amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu_batch() entry point
>>>>>
>>>>> Patch 7/8: Unify userptr cleanup and update paths
>>>>>       - Shared eviction/restore handling for batch allocations
>>>>>       - Integration with existing userptr validation flows
>>>>>
>>>>> Patch 8/8: Wire up batch allocation in ioctl handler
>>>>>       - Input validation and range array parsing
>>>>>       - Integration with existing alloc_memory_of_gpu path
>>>>>
>>>>> Testing
>>>>> =======
>>>>>
>>>>> - Multiple scattered malloc() allocations (2-4000+ ranges)
>>>>> - Various allocation sizes (4KB to 1G+ per range)
>>>>> - Memory pressure scenarios and eviction/restore cycles
>>>>> - OpenCL CTS and HIP catch tests in KVM guest environment
>>>>> - AI workloads: Stable Diffusion, ComfyUI in virtualized environments
>>>>> - Small LLM inference (3B-7B models)
>>>>> - Benchmark score: 160,000 - 190,000 (80%-95% of bare metal)
>>>>> - Performance improvement: 2x-2.4x faster than userspace approach
>>>>>
>>>>> Thank you for your review and feedback.
>>>>>
>>>>> Best regards,
>>>>> Honglei Huang
>>>>>
>>>>> Honglei Huang (8):
>>>>>     drm/amdkfd: Add userptr batch allocation UAPI structures
>>>>>     drm/amdkfd: Add user_range_info infrastructure to kgd_mem
>>>>>     drm/amdkfd: Implement interval tree for userptr ranges
>>>>>     drm/amdkfd: Add batch MMU notifier support
>>>>>     drm/amdkfd: Implement batch userptr page management
>>>>>     drm/amdkfd: Add batch allocation function and export API
>>>>>     drm/amdkfd: Unify userptr cleanup and update paths
>>>>>     drm/amdkfd: Wire up batch allocation in ioctl handler
>>>>>
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |  23 +
>>>>>    .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  | 539 +++++++++++++++++-
>>>>>    drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      | 128 ++++-
>>>>>    include/uapi/linux/kfd_ioctl.h                |  31 +-
>>>>>    4 files changed, 697 insertions(+), 24 deletions(-)
>>>>>
>>>>
>>>
>>
>

next prev parent reply	other threads:[~2026-02-09 12:59 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-06  6:25 Honglei Huang
2026-02-06  6:25 ` [PATCH v3 1/8] drm/amdkfd: Add userptr batch allocation UAPI structures Honglei Huang
2026-02-06  6:25 ` [PATCH v3 2/8] drm/amdkfd: Add user_range_info infrastructure to kgd_mem Honglei Huang
2026-02-06  6:25 ` [PATCH v3 3/8] drm/amdkfd: Implement interval tree for userptr ranges Honglei Huang
2026-02-06  6:25 ` [PATCH v3 4/8] drm/amdkfd: Add batch MMU notifier support Honglei Huang
2026-02-06  6:25 ` [PATCH v3 5/8] drm/amdkfd: Implement batch userptr page management Honglei Huang
2026-02-06  6:25 ` [PATCH v3 6/8] drm/amdkfd: Add batch allocation function and export API Honglei Huang
2026-02-06  6:25 ` [PATCH v3 7/8] drm/amdkfd: Unify userptr cleanup and update paths Honglei Huang
2026-02-06  6:25 ` [PATCH v3 8/8] drm/amdkfd: Wire up batch allocation in ioctl handler Honglei Huang
2026-02-06 13:56 ` [PATCH v3 0/8] drm/amdkfd: Add batch userptr allocation support Christian König
2026-02-09  6:14   ` Honglei Huang
2026-02-09 10:16     ` Christian König
2026-02-09 12:52       ` Honglei Huang
2026-02-09 12:59         ` Christian König [this message]
2026-02-09 13:11           ` Honglei Huang
2026-02-09 13:27             ` Christian König
2026-02-09 14:16               ` Honglei Huang
2026-02-09 14:25                 ` Christian König
2026-02-09 14:44                   ` Honglei Huang
2026-02-09 15:07                     ` Christian König
2026-02-09 15:46                       ` Honglei Huang
2026-02-09 17:37                         ` Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=451400e6-bbe0-4186-bae6-1bf64181c378@amd.com \
    --to=christian.koenig@amd.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=Philip.Yang@amd.com \
    --cc=Ray.Huang@amd.com \
    --cc=Xinhui.Pan@amd.com \
    --cc=airlied@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=daniel@ffwll.ch \
    --cc=dmitry.osipenko@collabora.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=honghuan@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox