* Re: [PATCH] drm/amdkfd: Fix svm_bo and vram page refcount [not found] ` <aa910171-bc96-d8b1-1bee-65f3ef5d1f46@amd.com> @ 2025-10-03 22:16 ` Felix Kuehling 2025-10-06 12:55 ` Philip Yang 2025-10-06 13:21 ` Jason Gunthorpe 0 siblings, 2 replies; 5+ messages in thread From: Felix Kuehling @ 2025-10-03 22:16 UTC (permalink / raw) To: Philip Yang, Philip Yang, amd-gfx, Linux MM, Jason Gunthorpe, Leon Romanovsky [+Linux MM and HMM maintainers] Please see below my question about the safety of using zone_device_page_init. On 2025-10-03 18:02, Philip Yang wrote: > > On 2025-10-03 17:46, Felix Kuehling wrote: >> >> On 2025-10-03 17:18, Philip Yang wrote: >>> >>> On 2025-10-03 17:05, Felix Kuehling wrote: >>>> On 2025-09-26 17:03, Philip Yang wrote: >>>>> zone_device_page_init uses set_page_count to set vram page >>>>> refcount to >>>>> 1, there is race if step 2 happens between step 1 and 3. >>>>> >>>>> 1. CPU page fault handler get vram page, migrate the vram page to >>>>> system page >>>>> 2. GPU page fault migrate to the vram page, set page refcount to 1 >>>>> 3. CPU page fault handler put vram page, the vram page refcount is >>>>> 0 and reduce the vram_bo refcount >>>>> 4. vram_bo refcount is 1 off because the vram page is still used. >>>>> >>>>> Afterwards, this causes use-after-free bug and page refcount warning. >>>> >>>> This implies that migration to RAM and to VRAM of the same range >>>> are happening at the same time. Isn't that a bigger problem? It >>>> means someone doing a migration is not holding the >>>> prange->migrate_mutex. >>> >>> Migration hold prange->migrate_mutex so we don't have migration to >>> RAM and VRAM of same range at same time, the issue is in step 3, CPU >>> page fault handler do_swap_page put_page after >>> pgmap->ops->migrate_to_ram() returns and during migate_to_vram. >> >> That's the part I don't understand. The CPU page fault handler >> (svm_migrate_to_ram) is holding prange->migrate_mutex until the very >> end. Where do we have a put_page for a zone_device page outside the >> prange->migrate_mutex? Do you have a backtrace? > do_swap_page() { > ....... > } else if (is_device_private_entry(entry)) { > ........ > > /* > * Get a page reference while we know the page can't be > * freed. > */ > if (trylock_page(vmf->page)) { > struct dev_pagemap *pgmap; > > get_page(vmf->page); > pte_unmap_unlock(vmf->pte, vmf->ptl); > pgmap = page_pgmap(vmf->page); > ret = pgmap->ops->migrate_to_ram(vmf); > unlock_page(vmf->page); > put_page(vmf->page); > > This put_page reduce the vram page refcount to zero if migrate_to_vram > -> svm_migrate_get_vram_page already call zone_device_page_init set > page refcount to 1. > > put_page must be after unlock_page as put_page may free the page, > svm_migrate_get_vram_page can lock the page, but page refcount becomes 0. OK. Then you must have hit the WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref)) in that function. It sounds like zone_device_page_init is just unsafe to use in general. It assumes that pages have a 0 refcount. But I don't see a good way for drivers to guarantee that, because they are not in control of when the page refcounts for their zone-device pages get decremented. Regards, Felix > > Regards, > > Philip > >> >> Regards, >> Felix >> >> >>> >>> Regards, >>> >>> Philip >>> >>>> >>>> Regards, >>>> Felix >>>> >>>> >>>>> >>>>> zone_device_page_init should not use in page migration, change to >>>>> get_page fix the race bug. >>>>> >>>>> Add WARN_ONCE to report this issue early because the refcount bug is >>>>> hard to investigate. >>>>> >>>>> Signed-off-by: Philip Yang <Philip.Yang@amd.com> >>>>> --- >>>>> drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 14 +++++++++++++- >>>>> 1 file changed, 13 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c >>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c >>>>> index d10c6673f4de..15ab2db4af1d 100644 >>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c >>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c >>>>> @@ -217,7 +217,8 @@ svm_migrate_get_vram_page(struct svm_range >>>>> *prange, unsigned long pfn) >>>>> page = pfn_to_page(pfn); >>>>> svm_range_bo_ref(prange->svm_bo); >>>>> page->zone_device_data = prange->svm_bo; >>>>> - zone_device_page_init(page); >>>>> + get_page(page); >>>>> + lock_page(page); >>>>> } >>>>> static void >>>>> @@ -552,6 +553,17 @@ svm_migrate_ram_to_vram(struct svm_range >>>>> *prange, uint32_t best_loc, >>>>> if (mpages) { >>>>> prange->actual_loc = best_loc; >>>>> prange->vram_pages += mpages; >>>>> + /* >>>>> + * To guarent we hold correct page refcount for all >>>>> prange vram >>>>> + * pages and svm_bo refcount. >>>>> + * After prange migrated to VRAM, each vram page refcount >>>>> hold >>>>> + * one svm_bo refcount, and vram node hold one refcount. >>>>> + * After page migrated to system memory, vram page refcount >>>>> + * reduced to 0, svm_migrate_page_free reduce svm_bo >>>>> refcount. >>>>> + * svm_range_vram_node_free will free the svm_bo. >>>>> + */ >>>>> + WARN_ONCE(prange->vram_pages == >>>>> kref_read(&prange->svm_bo->kref), >>>>> + "svm_bo refcount leaking\n"); >>>>> } else if (!prange->actual_loc) { >>>>> /* if no page migrated and all pages from prange are at >>>>> * sys ram drop svm_bo got from svm_range_vram_node_new ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] drm/amdkfd: Fix svm_bo and vram page refcount 2025-10-03 22:16 ` [PATCH] drm/amdkfd: Fix svm_bo and vram page refcount Felix Kuehling @ 2025-10-06 12:55 ` Philip Yang 2025-10-06 13:21 ` Jason Gunthorpe 1 sibling, 0 replies; 5+ messages in thread From: Philip Yang @ 2025-10-06 12:55 UTC (permalink / raw) To: Felix Kuehling, Philip Yang, amd-gfx, Linux MM, Jason Gunthorpe, Leon Romanovsky On 2025-10-03 18:16, Felix Kuehling wrote: > [+Linux MM and HMM maintainers] > > Please see below my question about the safety of using > zone_device_page_init. > > On 2025-10-03 18:02, Philip Yang wrote: >> >> On 2025-10-03 17:46, Felix Kuehling wrote: >>> >>> On 2025-10-03 17:18, Philip Yang wrote: >>>> >>>> On 2025-10-03 17:05, Felix Kuehling wrote: >>>>> On 2025-09-26 17:03, Philip Yang wrote: >>>>>> zone_device_page_init uses set_page_count to set vram page >>>>>> refcount to >>>>>> 1, there is race if step 2 happens between step 1 and 3. >>>>>> >>>>>> 1. CPU page fault handler get vram page, migrate the vram page to >>>>>> system page >>>>>> 2. GPU page fault migrate to the vram page, set page refcount to 1 >>>>>> 3. CPU page fault handler put vram page, the vram page refcount is >>>>>> 0 and reduce the vram_bo refcount >>>>>> 4. vram_bo refcount is 1 off because the vram page is still used. >>>>>> >>>>>> Afterwards, this causes use-after-free bug and page refcount >>>>>> warning. >>>>> >>>>> This implies that migration to RAM and to VRAM of the same range >>>>> are happening at the same time. Isn't that a bigger problem? It >>>>> means someone doing a migration is not holding the >>>>> prange->migrate_mutex. >>>> >>>> Migration hold prange->migrate_mutex so we don't have migration to >>>> RAM and VRAM of same range at same time, the issue is in step 3, >>>> CPU page fault handler do_swap_page put_page after >>>> pgmap->ops->migrate_to_ram() returns and during migate_to_vram. >>> >>> That's the part I don't understand. The CPU page fault handler >>> (svm_migrate_to_ram) is holding prange->migrate_mutex until the very >>> end. Where do we have a put_page for a zone_device page outside the >>> prange->migrate_mutex? Do you have a backtrace? >> do_swap_page() { >> ....... >> } else if (is_device_private_entry(entry)) { >> ........ >> >> /* >> * Get a page reference while we know the page can't be >> * freed. >> */ >> if (trylock_page(vmf->page)) { >> struct dev_pagemap *pgmap; >> >> get_page(vmf->page); >> pte_unmap_unlock(vmf->pte, vmf->ptl); >> pgmap = page_pgmap(vmf->page); >> ret = pgmap->ops->migrate_to_ram(vmf); >> unlock_page(vmf->page); >> put_page(vmf->page); >> >> This put_page reduce the vram page refcount to zero if >> migrate_to_vram -> svm_migrate_get_vram_page already call >> zone_device_page_init set page refcount to 1. >> >> put_page must be after unlock_page as put_page may free the page, >> svm_migrate_get_vram_page can lock the page, but page refcount >> becomes 0. > > OK. Then you must have hit the > WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref)) in that > function. This warning is for pgmap percpu_refcount, not for page refcount, I didn't see this warning. > > It sounds like zone_device_page_init is just unsafe to use in general. > It assumes that pages have a 0 refcount. But I don't see a good way > for drivers to guarantee that, because they are not in control of when > the page refcounts for their zone-device pages get decremented. Seems this issue is caused by the change in commit 1afaeb8293c9 "mm/migrate: Trylock device page in do_swap_page", I am not sure if the same fix is needed in several drivers calling zone_device_page_init. Regards, Philip > > Regards, > Felix > > >> >> Regards, >> >> Philip >> >>> >>> Regards, >>> Felix >>> >>> >>>> >>>> Regards, >>>> >>>> Philip >>>> >>>>> >>>>> Regards, >>>>> Felix >>>>> >>>>> >>>>>> >>>>>> zone_device_page_init should not use in page migration, change to >>>>>> get_page fix the race bug. >>>>>> >>>>>> Add WARN_ONCE to report this issue early because the refcount bug is >>>>>> hard to investigate. >>>>>> >>>>>> Signed-off-by: Philip Yang <Philip.Yang@amd.com> >>>>>> --- >>>>>> drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 14 +++++++++++++- >>>>>> 1 file changed, 13 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c >>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c >>>>>> index d10c6673f4de..15ab2db4af1d 100644 >>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c >>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c >>>>>> @@ -217,7 +217,8 @@ svm_migrate_get_vram_page(struct svm_range >>>>>> *prange, unsigned long pfn) >>>>>> page = pfn_to_page(pfn); >>>>>> svm_range_bo_ref(prange->svm_bo); >>>>>> page->zone_device_data = prange->svm_bo; >>>>>> - zone_device_page_init(page); >>>>>> + get_page(page); >>>>>> + lock_page(page); >>>>>> } >>>>>> static void >>>>>> @@ -552,6 +553,17 @@ svm_migrate_ram_to_vram(struct svm_range >>>>>> *prange, uint32_t best_loc, >>>>>> if (mpages) { >>>>>> prange->actual_loc = best_loc; >>>>>> prange->vram_pages += mpages; >>>>>> + /* >>>>>> + * To guarent we hold correct page refcount for all >>>>>> prange vram >>>>>> + * pages and svm_bo refcount. >>>>>> + * After prange migrated to VRAM, each vram page >>>>>> refcount hold >>>>>> + * one svm_bo refcount, and vram node hold one refcount. >>>>>> + * After page migrated to system memory, vram page refcount >>>>>> + * reduced to 0, svm_migrate_page_free reduce svm_bo >>>>>> refcount. >>>>>> + * svm_range_vram_node_free will free the svm_bo. >>>>>> + */ >>>>>> + WARN_ONCE(prange->vram_pages == >>>>>> kref_read(&prange->svm_bo->kref), >>>>>> + "svm_bo refcount leaking\n"); >>>>>> } else if (!prange->actual_loc) { >>>>>> /* if no page migrated and all pages from prange are at >>>>>> * sys ram drop svm_bo got from svm_range_vram_node_new ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] drm/amdkfd: Fix svm_bo and vram page refcount 2025-10-03 22:16 ` [PATCH] drm/amdkfd: Fix svm_bo and vram page refcount Felix Kuehling 2025-10-06 12:55 ` Philip Yang @ 2025-10-06 13:21 ` Jason Gunthorpe 2025-10-06 17:51 ` Felix Kuehling 1 sibling, 1 reply; 5+ messages in thread From: Jason Gunthorpe @ 2025-10-06 13:21 UTC (permalink / raw) To: Felix Kuehling, Alistair Popple Cc: Philip Yang, Philip Yang, amd-gfx, Linux MM, Leon Romanovsky On Fri, Oct 03, 2025 at 06:16:14PM -0400, Felix Kuehling wrote: > It sounds like zone_device_page_init is just unsafe to use in > general. It can only be used if you know the page is freed. > It assumes that pages have a 0 refcount. Yes > But I don't see a good way for drivers to guarantee that, because > they are not in control of when the page refcounts for their > zone-device pages get decremented. ?? Drivers are supposed to hoook pgmap->ops->page_free() and keep track. There is no way to write a driver without calling zone_device_page_init() as there is no other defined way to re-use a page that has been returned through page_free(). It is completely wrong to call get_page() on a 0 refcount folio, we don't have a debugging crash for this, but we really should. If you think the refcount could be 0 you have to use a try_get(). So this patch looks wrong to me, I see a page_free() implementation and this is the only call to zone_device_page_init(). If you remove it the driver is absolutely broken. I would expect migration should be writing to freed memory and zone_device_page_init() is the correct and only way to make freed memory usable again. Therefore, I expect the refcount to be 0 when svm_migrate_ram_to_vram() picks a dst. If it is not true, and you are tring to migrate to already allocated VRAM, then WTF? And if you really want to do that then yes you need to use get_page but you need a different path to handle already allocated vs page_free() called. get_page() MUST NOT be used to unfree page_free'd memory. The explanation in the commit doesn't really have enough detail: > 1. CPU page fault handler get vram page, migrate the vram page to > system page > 2. GPU page fault migrate to the vram page, set page refcount to 1 So why is the same vram page being used for both? For #1 the VRAM page is installed in a swap entry so it is has an elevated refcount. The implication is that #2 is targeting already allocated VRAM memory that is NOT FREE. Jason ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] drm/amdkfd: Fix svm_bo and vram page refcount 2025-10-06 13:21 ` Jason Gunthorpe @ 2025-10-06 17:51 ` Felix Kuehling 2025-10-06 18:35 ` Jason Gunthorpe 0 siblings, 1 reply; 5+ messages in thread From: Felix Kuehling @ 2025-10-06 17:51 UTC (permalink / raw) To: Jason Gunthorpe, Alistair Popple Cc: Philip Yang, Philip Yang, amd-gfx, Linux MM, Leon Romanovsky [-- Attachment #1: Type: text/plain, Size: 3101 bytes --] On 2025-10-06 09:21, Jason Gunthorpe wrote: > On Fri, Oct 03, 2025 at 06:16:14PM -0400, Felix Kuehling wrote: > >> It sounds like zone_device_page_init is just unsafe to use in >> general. > It can only be used if you know the page is freed. > >> It assumes that pages have a 0 refcount. > Yes > >> But I don't see a good way for drivers to guarantee that, because >> they are not in control of when the page refcounts for their >> zone-device pages get decremented. > ?? Drivers are supposed to hoook pgmap->ops->page_free() and keep > track. Right, we have that callback, it's currently only used to track references to our buffer objects used to back device pages. > > There is no way to write a driver without calling > zone_device_page_init() as there is no other defined way to re-use a > page that has been returned through page_free(). > > It is completely wrong to call get_page() on a 0 refcount folio, we > don't have a debugging crash for this, but we really should. If you > think the refcount could be 0 you have to use a try_get(). > > So this patch looks wrong to me, I see a page_free() implementation > and this is the only call to zone_device_page_init(). If you remove it > the driver is absolutely broken. > > I would expect migration should be writing to freed memory and > zone_device_page_init() is the correct and only way to make freed > memory usable again. OK. We made an incorrect assumption that we can reuse a page if the driver isn't tracking it as allocated to any of our SVM ranges (i.e., after dev_pagemap_ops.migrate_to_ram() migrated all data out of the page). However, we neglected that other parts of the kernel can still hold references to a page even after that. Would something like this work: static void svm_migrate_get_vram_page(struct svm_range *prange, unsigned long pfn) { ... if (!try_get_page(page)) { page->zone_device_data = prange->svm_bo; zone_device_page_init(page); } } > > Therefore, I expect the refcount to be 0 when > svm_migrate_ram_to_vram() picks a dst. > > If it is not true, and you are tring to migrate to already allocated > VRAM, then WTF? As I understand it, it's a race condition. The driver is done with the page and its migrate_to_ram() call has completed. But do_swap_page hasn't called put_page yet. At the same time, another thread is trying to reuse the page, migrating data back to VRAM. Regards, Felix > > And if you really want to do that then yes you need to use get_page > but you need a different path to handle already allocated vs > page_free() called. get_page() MUST NOT be used to unfree page_free'd > memory. > > The explanation in the commit doesn't really have enough detail: > >> 1. CPU page fault handler get vram page, migrate the vram page to >> system page >> 2. GPU page fault migrate to the vram page, set page refcount to 1 > So why is the same vram page being used for both? For #1 the VRAM page > is installed in a swap entry so it is has an elevated refcount. > > The implication is that #2 is targeting already allocated VRAM memory > that is NOT FREE. > > Jason > [-- Attachment #2: Type: text/html, Size: 4645 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] drm/amdkfd: Fix svm_bo and vram page refcount 2025-10-06 17:51 ` Felix Kuehling @ 2025-10-06 18:35 ` Jason Gunthorpe 0 siblings, 0 replies; 5+ messages in thread From: Jason Gunthorpe @ 2025-10-06 18:35 UTC (permalink / raw) To: Felix Kuehling Cc: Alistair Popple, Philip Yang, Philip Yang, amd-gfx, Linux MM, Leon Romanovsky On Mon, Oct 06, 2025 at 01:51:37PM -0400, Felix Kuehling wrote: > OK. We made an incorrect assumption that we can reuse a page if the > driver isn't tracking it as allocated to any of our SVM ranges (i.e., > after dev_pagemap_ops.migrate_to_ram() migrated all data out of the > page). However, we neglected that other parts of the kernel can still > hold references to a page even after that. Yes, that sounds completely incorrect. > As I understand it, it's a race condition. The driver is done with the > page and its migrate_to_ram() call has completed. But do_swap_page > hasn't called put_page yet. At the same time, another thread is trying > to reuse the page, migrating data back to VRAM. Which means the driver is not properly tracking freed pages. I don't think the code you showed makes alot of sense, if someone else has a reference on the page it could be for many reasons. If you take a non-free page and treat it as free and safe to use you probably are adding a security bug. Jason ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-10-06 18:35 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <20250926210331.17401-1-Philip.Yang@amd.com>
[not found] ` <87ae1017-5990-4d6e-b42c-7a15f5663281@amd.com>
[not found] ` <f3349a43-446f-f712-ac61-fa867cd74242@amd.com>
[not found] ` <674f455e-434d-43d2-8f4f-18f577479ac9@amd.com>
[not found] ` <aa910171-bc96-d8b1-1bee-65f3ef5d1f46@amd.com>
2025-10-03 22:16 ` [PATCH] drm/amdkfd: Fix svm_bo and vram page refcount Felix Kuehling
2025-10-06 12:55 ` Philip Yang
2025-10-06 13:21 ` Jason Gunthorpe
2025-10-06 17:51 ` Felix Kuehling
2025-10-06 18:35 ` Jason Gunthorpe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox