* [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice? @ 2025-01-23 10:20 David Hildenbrand 2025-01-23 15:08 ` Simona Vetter 0 siblings, 1 reply; 16+ messages in thread From: David Hildenbrand @ 2025-01-23 10:20 UTC (permalink / raw) To: linux-mm Cc: John Hubbard, nouveau, Jason Gunthorpe, Alistair Popple, DRI Development, Karol Herbst, Lyude Paul, Danilo Krummrich Hi, I keep finding issues in our implementation of "device exclusive non-swap entries", and the way it messes with mapcounts is disgusting. As a reminder, what we do here is to replace a PTE pointing to an anonymous page by a "device exclusive non-swap entry". As long as the original PTE is in place, the only CPU can access it, as soon as the "device exclusive non-swap entry" is in place, only the device can access it. Conversion back and forth is triggered by CPU / device faults. I have fixes/reworks/simplifications for most things, but as there is only a "real" single user in-tree of make_device_exclusive(): drivers/gpu/drm/nouveau/nouveau_svm.c to "support SVM atomics in Nouveau [1]" naturally I am wondering: is this still a thing on actual hardware, or is it already stale on recent hardware and not really required anymore? [1] https://lore.kernel.org/linux-kernel//6621654.gmDyfcmpjF@nvdebian/T/ -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice? 2025-01-23 10:20 [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice? David Hildenbrand @ 2025-01-23 15:08 ` Simona Vetter 2025-01-24 10:44 ` David Hildenbrand 0 siblings, 1 reply; 16+ messages in thread From: Simona Vetter @ 2025-01-23 15:08 UTC (permalink / raw) To: David Hildenbrand Cc: linux-mm, John Hubbard, nouveau, Jason Gunthorpe, Alistair Popple, DRI Development, Karol Herbst, Lyude Paul, Danilo Krummrich On Thu, Jan 23, 2025 at 11:20:37AM +0100, David Hildenbrand wrote: > Hi, > > I keep finding issues in our implementation of "device exclusive non-swap > entries", and the way it messes with mapcounts is disgusting. > > As a reminder, what we do here is to replace a PTE pointing to an anonymous > page by a "device exclusive non-swap entry". > > As long as the original PTE is in place, the only CPU can access it, as soon > as the "device exclusive non-swap entry" is in place, only the device can > access it. Conversion back and forth is triggered by CPU / device faults. > > I have fixes/reworks/simplifications for most things, but as there is only a > "real" single user in-tree of make_device_exclusive(): > > drivers/gpu/drm/nouveau/nouveau_svm.c > > to "support SVM atomics in Nouveau [1]" > > naturally I am wondering: is this still a thing on actual hardware, or is it > already stale on recent hardware and not really required anymore? > > > [1] https://lore.kernel.org/linux-kernel//6621654.gmDyfcmpjF@nvdebian/T/ As long as you don't have a coherent interconnect it's needed. On intel discrete device atomics require device memory, so they need full hmm migration (and hence wont use this function even once we land intel gpu svm code in upstream). On integrated the gpu is tied into the coherency fabric, so there it's not needed. I think the more fundamental question with both this function here and with forced migration to device memory is that there's no guarantee it will work out. At least that's my understanding. And for this gpu device atomics without coherent interconnect idea to work, we'd need to be able to guarantee that we can make any page device exclusive. So from my side I have some pretty big question marks on this entire thing overall. Cheers, Sima -- Simona Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice? 2025-01-23 15:08 ` Simona Vetter @ 2025-01-24 10:44 ` David Hildenbrand 2025-01-24 14:11 ` Jason Gunthorpe 2025-01-24 15:28 ` Simona Vetter 0 siblings, 2 replies; 16+ messages in thread From: David Hildenbrand @ 2025-01-24 10:44 UTC (permalink / raw) To: linux-mm, John Hubbard, nouveau, Jason Gunthorpe, Alistair Popple, DRI Development, Karol Herbst, Lyude Paul, Danilo Krummrich On 23.01.25 16:08, Simona Vetter wrote: > On Thu, Jan 23, 2025 at 11:20:37AM +0100, David Hildenbrand wrote: >> Hi, >> >> I keep finding issues in our implementation of "device exclusive non-swap >> entries", and the way it messes with mapcounts is disgusting. >> >> As a reminder, what we do here is to replace a PTE pointing to an anonymous >> page by a "device exclusive non-swap entry". >> >> As long as the original PTE is in place, the only CPU can access it, as soon >> as the "device exclusive non-swap entry" is in place, only the device can >> access it. Conversion back and forth is triggered by CPU / device faults. >> >> I have fixes/reworks/simplifications for most things, but as there is only a >> "real" single user in-tree of make_device_exclusive(): >> >> drivers/gpu/drm/nouveau/nouveau_svm.c >> >> to "support SVM atomics in Nouveau [1]" >> >> naturally I am wondering: is this still a thing on actual hardware, or is it >> already stale on recent hardware and not really required anymore? >> >> >> [1] https://lore.kernel.org/linux-kernel//6621654.gmDyfcmpjF@nvdebian/T/ > Thanks for your answer! Nvidia folks told me on a different channel that it's still getting used. > As long as you don't have a coherent interconnect it's needed. On intel > discrete device atomics require device memory, so they need full hmm > migration (and hence wont use this function even once we land intel gpu > svm code in upstream). Makes sense. > On integrated the gpu is tied into the coherency > fabric, so there it's not needed. > > I think the more fundamental question with both this function here and > with forced migration to device memory is that there's no guarantee it > will work out. Yes, in particular with device-exclusive, it doesn't really work with THP and is only limited to anonymous memory. I have patches to at least make it work reliably with THP. Then, we seem to give up too easily if we cannot lock the folio when wanting to convert to device-exclusive, which also looks rather odd. But well, maybe it just works good enough in the common case, or there is some other retry logic that makes it fly. > At least that's my understanding. And for this gpu device > atomics without coherent interconnect idea to work, we'd need to be able > to guarantee that we can make any page device exclusive. So from my side I > have some pretty big question marks on this entire thing overall. I don't think other memory (shmem/file/...) is really feasible as soon as other processes (not the current process) map/write/read file pages. We could really only handle if we converted a single PTE and that PTE is getting converted back again. There are other concerns I have (what if the page is pinned and access outside of the user space page tables?). Maybe there was not need to handle these cases so far. So best I can do is make anonymous memory more reliable with device-exclusive and fixup some of the problematic parts that I see (e.g., broken page reclaim, page migration, ...). But before starting to cleanup+improve the existing handling of anonymous memory, I was wondering if this whole thing is getting used at all. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice? 2025-01-24 10:44 ` David Hildenbrand @ 2025-01-24 14:11 ` Jason Gunthorpe 2025-01-24 14:39 ` David Hildenbrand 2025-01-24 15:28 ` Simona Vetter 1 sibling, 1 reply; 16+ messages in thread From: Jason Gunthorpe @ 2025-01-24 14:11 UTC (permalink / raw) To: David Hildenbrand Cc: linux-mm, John Hubbard, nouveau, Alistair Popple, DRI Development, Karol Herbst, Lyude Paul, Danilo Krummrich On Fri, Jan 24, 2025 at 11:44:28AM +0100, David Hildenbrand wrote: > There are other concerns I have (what if the page is pinned and access > outside of the user space page tables?). Maybe there was not need to handle > these cases so far. I think alot of this depends on userspace following some restrictions so that the pages are always convertible. Presumably if the userspace breaks things then their atomic using GPU kernels will fault. So, from a kernel perspective, I'd suggest that creating a reasonable set of conditions that userspace can follow to have it work reliably is a reasonable goal. Jason ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice? 2025-01-24 14:11 ` Jason Gunthorpe @ 2025-01-24 14:39 ` David Hildenbrand 0 siblings, 0 replies; 16+ messages in thread From: David Hildenbrand @ 2025-01-24 14:39 UTC (permalink / raw) To: Jason Gunthorpe Cc: linux-mm, John Hubbard, nouveau, Alistair Popple, DRI Development, Karol Herbst, Lyude Paul, Danilo Krummrich On 24.01.25 15:11, Jason Gunthorpe wrote: > On Fri, Jan 24, 2025 at 11:44:28AM +0100, David Hildenbrand wrote: > >> There are other concerns I have (what if the page is pinned and access >> outside of the user space page tables?). Maybe there was not need to handle >> these cases so far. > > I think alot of this depends on userspace following some restrictions > so that the pages are always convertible. Presumably if the userspace > breaks things then their atomic using GPU kernels will fault. > > So, from a kernel perspective, I'd suggest that creating a reasonable > set of conditions that userspace can follow to have it work reliably > is a reasonable goal. Yes, that's my assumption as well. If you register a page using io_uring as a fixed buffer and then trigger "device_exclusive" access, the page can still be read/written using io_uring and atomics might not work as expected. "Not supported". -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice? 2025-01-24 10:44 ` David Hildenbrand 2025-01-24 14:11 ` Jason Gunthorpe @ 2025-01-24 15:28 ` Simona Vetter 2025-01-24 17:54 ` David Hildenbrand 1 sibling, 1 reply; 16+ messages in thread From: Simona Vetter @ 2025-01-24 15:28 UTC (permalink / raw) To: David Hildenbrand Cc: linux-mm, John Hubbard, nouveau, Jason Gunthorpe, Alistair Popple, DRI Development, Karol Herbst, Lyude Paul, Danilo Krummrich On Fri, Jan 24, 2025 at 11:44:28AM +0100, David Hildenbrand wrote: > On 23.01.25 16:08, Simona Vetter wrote: > > On Thu, Jan 23, 2025 at 11:20:37AM +0100, David Hildenbrand wrote: > > > Hi, > > > > > > I keep finding issues in our implementation of "device exclusive non-swap > > > entries", and the way it messes with mapcounts is disgusting. > > > > > > As a reminder, what we do here is to replace a PTE pointing to an anonymous > > > page by a "device exclusive non-swap entry". > > > > > > As long as the original PTE is in place, the only CPU can access it, as soon > > > as the "device exclusive non-swap entry" is in place, only the device can > > > access it. Conversion back and forth is triggered by CPU / device faults. > > > > > > I have fixes/reworks/simplifications for most things, but as there is only a > > > "real" single user in-tree of make_device_exclusive(): > > > > > > drivers/gpu/drm/nouveau/nouveau_svm.c > > > > > > to "support SVM atomics in Nouveau [1]" > > > > > > naturally I am wondering: is this still a thing on actual hardware, or is it > > > already stale on recent hardware and not really required anymore? > > > > > > > > > [1] https://lore.kernel.org/linux-kernel//6621654.gmDyfcmpjF@nvdebian/T/ > > > > Thanks for your answer! > > Nvidia folks told me on a different channel that it's still getting used. > > > As long as you don't have a coherent interconnect it's needed. On intel > > discrete device atomics require device memory, so they need full hmm > > migration (and hence wont use this function even once we land intel gpu > > svm code in upstream). > > Makes sense. > > > On integrated the gpu is tied into the coherency > > fabric, so there it's not needed. > > > > I think the more fundamental question with both this function here and > > with forced migration to device memory is that there's no guarantee it > > will work out. > > Yes, in particular with device-exclusive, it doesn't really work with THP > and is only limited to anonymous memory. I have patches to at least make it > work reliably with THP. I should have crawled through the implementation first before replying. Since it only looks at folio_mapcount() make_device_exclusive() should at least in theory work reliably on anon memory, and not be impacted by elevated refcounts due to migration/ksm/thp/whatever. This is unlike device atomics that require migration to device memory, which is just fundamentally not a reliable thing. > Then, we seem to give up too easily if we cannot lock the folio when wanting > to convert to device-exclusive, which also looks rather odd. But well, maybe > it just works good enough in the common case, or there is some other retry > logic that makes it fly. I've crawled through the path to migrate pages from device memory back to system memory a few months ago, and found some livelock issues in there. Wouldn't be surprised if m_d_e has some of the same, but I didn't dig through it (least because intel can't use it because not-so-great hw design). > > At least that's my understanding. And for this gpu device > > atomics without coherent interconnect idea to work, we'd need to be able > > to guarantee that we can make any page device exclusive. So from my side I > > have some pretty big question marks on this entire thing overall. > > I don't think other memory (shmem/file/...) is really feasible as soon as > other processes (not the current process) map/write/read file pages. Yeah none of the apis that use this internally in their implementations make any promises beyond memory acquired with libc's malloc() or one of the variants. So this limitation is fine. > We could really only handle if we converted a single PTE and that PTE is > getting converted back again. > > There are other concerns I have (what if the page is pinned and access > outside of the user space page tables?). Maybe there was not need to handle > these cases so far. I think that's also ok, but might be good to document this clearly that concurrent direct i/o or rdma registered buffer or whatever will mess with this. The promise is only between the gpu and the cpu, not anything else, in current apis. At least to my knowledge. > So best I can do is make anonymous memory more reliable with > device-exclusive and fixup some of the problematic parts that I see (e.g., > broken page reclaim, page migration, ...). > > But before starting to cleanup+improve the existing handling of anonymous > memory, I was wondering if this whole thing is getting used at all. Yeah if this can be made reliable (under the limitation of only anon memory and only excluding userspace access) then I expect we'll need this for a very long time. I just had no idea whether even that is possible. What isn't good is if it's only mostly reliable, like the current pgmap->ops->migrate_to_ram() path in do_swap_page() still is. But that one is fixable, the patches should be floating around somewhere. -Sima -- Simona Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice? 2025-01-24 15:28 ` Simona Vetter @ 2025-01-24 17:54 ` David Hildenbrand 2025-01-28 0:09 ` Alistair Popple 0 siblings, 1 reply; 16+ messages in thread From: David Hildenbrand @ 2025-01-24 17:54 UTC (permalink / raw) To: linux-mm, John Hubbard, nouveau, Jason Gunthorpe, Alistair Popple, DRI Development, Karol Herbst, Lyude Paul, Danilo Krummrich >>> On integrated the gpu is tied into the coherency >>> fabric, so there it's not needed. >>> >>> I think the more fundamental question with both this function here and >>> with forced migration to device memory is that there's no guarantee it >>> will work out. >> >> Yes, in particular with device-exclusive, it doesn't really work with THP >> and is only limited to anonymous memory. I have patches to at least make it >> work reliably with THP. > > I should have crawled through the implementation first before replying. > Since it only looks at folio_mapcount() make_device_exclusive() should at > least in theory work reliably on anon memory, and not be impacted by > elevated refcounts due to migration/ksm/thp/whatever. Yes, there is -- in theory -- nothing blocking the conversion except the folio lock. That's different than page migration. [...] > >> Then, we seem to give up too easily if we cannot lock the folio when wanting >> to convert to device-exclusive, which also looks rather odd. But well, maybe >> it just works good enough in the common case, or there is some other retry >> logic that makes it fly. > > I've crawled through the path to migrate pages from device memory back to > system memory a few months ago, and found some livelock issues in there. > Wouldn't be surprised if m_d_e has some of the same, but I didn't dig > through it (least because intel can't use it because not-so-great hw > design). Yes, that might be possible. Maybe something keeps spinning while the folio is locked instead of properly waiting for the lock. ... or someone is trying to convert a tail page of a THP, in which case we will also keep failing the conversion right now. >> There are other concerns I have (what if the page is pinned and access >> outside of the user space page tables?). Maybe there was not need to handle >> these cases so far. > > I think that's also ok, but might be good to document this clearly that > concurrent direct i/o or rdma registered buffer or whatever will mess with > this. The promise is only between the gpu and the cpu, not anything else, > in current apis. At least to my knowledge. Well, the issue is that e.g., iouring fixed buffers can be accessed from the CPU using the direct map from the CPU, not necessarily using DMA from a device. In any case, I'm planning on adding some code-level documentation for that and look into extending the high-level doc while at it. > >> So best I can do is make anonymous memory more reliable with >> device-exclusive and fixup some of the problematic parts that I see (e.g., >> broken page reclaim, page migration, ...). >> >> But before starting to cleanup+improve the existing handling of anonymous >> memory, I was wondering if this whole thing is getting used at all. > > Yeah if this can be made reliable (under the limitation of only anon > memory and only excluding userspace access) then I expect we'll need this > for a very long time. I just had no idea whether even that is possible. > > What isn't good is if it's only mostly reliable, like the current > pgmap->ops->migrate_to_ram() path in do_swap_page() still is. I'll cc you on patches once I figure out some details on how to fix some page table walkers that really don't expect these non-swap entries. Fortunately, the hmm test device is in place to trigger some shaky scenarios. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice? 2025-01-24 17:54 ` David Hildenbrand @ 2025-01-28 0:09 ` Alistair Popple 2025-01-28 20:14 ` Simona Vetter 0 siblings, 1 reply; 16+ messages in thread From: Alistair Popple @ 2025-01-28 0:09 UTC (permalink / raw) To: David Hildenbrand Cc: linux-mm, John Hubbard, nouveau, Jason Gunthorpe, DRI Development, Karol Herbst, Lyude Paul, Danilo Krummrich On Fri, Jan 24, 2025 at 06:54:02PM +0100, David Hildenbrand wrote: > > > > On integrated the gpu is tied into the coherency > > > > fabric, so there it's not needed. > > > > > > > > I think the more fundamental question with both this function here and > > > > with forced migration to device memory is that there's no guarantee it > > > > will work out. > > > > > > Yes, in particular with device-exclusive, it doesn't really work with THP > > > and is only limited to anonymous memory. I have patches to at least make it > > > work reliably with THP. > > > > I should have crawled through the implementation first before replying. > > Since it only looks at folio_mapcount() make_device_exclusive() should at > > least in theory work reliably on anon memory, and not be impacted by > > elevated refcounts due to migration/ksm/thp/whatever. > > Yes, there is -- in theory -- nothing blocking the conversion except the > folio lock. That's different than page migration. Indeed - this was the entire motivation for make_device_exclusive() - that we needed a way to reliably exclude CPU access that couldn't be blocked in the same way page migration can (otherwise we could have just migrated to a device page, even if that may have added unwanted overhead). > [...] > > > > > > Then, we seem to give up too easily if we cannot lock the folio when wanting > > > to convert to device-exclusive, which also looks rather odd. But well, maybe > > > it just works good enough in the common case, or there is some other retry > > > logic that makes it fly. > > > > I've crawled through the path to migrate pages from device memory back to > > system memory a few months ago, and found some livelock issues in there. > > Wouldn't be surprised if m_d_e has some of the same, but I didn't dig > > through it (least because intel can't use it because not-so-great hw > > design). > > Yes, that might be possible. Maybe something keeps spinning while the folio > is locked instead of properly waiting for the lock. > > ... or someone is trying to convert a tail page of a THP, in which case we > will also keep failing the conversion right now. > > > > There are other concerns I have (what if the page is pinned and access > > > outside of the user space page tables?). Maybe there was not need to handle > > > these cases so far. > > > > I think that's also ok, but might be good to document this clearly that > > concurrent direct i/o or rdma registered buffer or whatever will mess with > > this. The promise is only between the gpu and the cpu, not anything else, > > in current apis. At least to my knowledge. That is correct - we never came up with a good solution for direct i/o or rdma so the promise is only between cpu and gpu. That said direct i/o probably implies a shared filebacked mapping. We weren't convinced there was a useful use case there because anything could potentially map a file and use non-atomic CPU instructions to access shared the shared mapping anyway. > Well, the issue is that e.g., iouring fixed buffers can be accessed from the > CPU using the direct map from the CPU, not necessarily using DMA from a > device. Right. This was all targetted at private anonymous memory and it's assumed that both cpu and gpu are using atomic operations. As soon as you're talking shared non-anonymous mappings it seems impossible for userspace to guarantee nothing is accessing that memory with non-atomic ops anyway, even with a coherent interconnect. > In any case, I'm planning on adding some code-level documentation for that > and look into extending the high-level doc while at it. Thanks! Please CC me in. > > > > > So best I can do is make anonymous memory more reliable with > > > device-exclusive and fixup some of the problematic parts that I see (e.g., > > > broken page reclaim, page migration, ...). > > > > > > But before starting to cleanup+improve the existing handling of anonymous > > > memory, I was wondering if this whole thing is getting used at all. > > > > Yeah if this can be made reliable (under the limitation of only anon > > memory and only excluding userspace access) then I expect we'll need this > > for a very long time. I just had no idea whether even that is possible. > > > > What isn't good is if it's only mostly reliable, like the current > > pgmap->ops->migrate_to_ram() path in do_swap_page() still is. Right. This is _supposed_ to be 100% reliable under those limitations although I will admit I didn't consider THP deeply as that is not supported for DEVICE_PRIVATE anyway. > I'll cc you on patches once I figure out some details on how to fix some > page table walkers that really don't expect these non-swap entries. > > Fortunately, the hmm test device is in place to trigger some shaky > scenarios. > > -- > Cheers, > > David / dhildenb > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice? 2025-01-28 0:09 ` Alistair Popple @ 2025-01-28 20:14 ` Simona Vetter 2025-01-28 20:24 ` David Hildenbrand 0 siblings, 1 reply; 16+ messages in thread From: Simona Vetter @ 2025-01-28 20:14 UTC (permalink / raw) To: Alistair Popple Cc: David Hildenbrand, linux-mm, John Hubbard, nouveau, Jason Gunthorpe, DRI Development, Karol Herbst, Lyude Paul, Danilo Krummrich On Tue, Jan 28, 2025 at 11:09:24AM +1100, Alistair Popple wrote: > On Fri, Jan 24, 2025 at 06:54:02PM +0100, David Hildenbrand wrote: > > > > > On integrated the gpu is tied into the coherency > > > > > fabric, so there it's not needed. > > > > > > > > > > I think the more fundamental question with both this function here and > > > > > with forced migration to device memory is that there's no guarantee it > > > > > will work out. > > > > > > > > Yes, in particular with device-exclusive, it doesn't really work with THP > > > > and is only limited to anonymous memory. I have patches to at least make it > > > > work reliably with THP. > > > > > > I should have crawled through the implementation first before replying. > > > Since it only looks at folio_mapcount() make_device_exclusive() should at > > > least in theory work reliably on anon memory, and not be impacted by > > > elevated refcounts due to migration/ksm/thp/whatever. > > > > Yes, there is -- in theory -- nothing blocking the conversion except the > > folio lock. That's different than page migration. > > Indeed - this was the entire motivation for make_device_exclusive() - that we > needed a way to reliably exclude CPU access that couldn't be blocked in the same > way page migration can (otherwise we could have just migrated to a device page, > even if that may have added unwanted overhead). The folio_trylock worries me a bit. I guess this is to avoid deadlocks when locking multiple folios, but I think at least on the first one we need an unconditional folio_lock to guarantee forward progress. Since atomics can't cross 4k boundaries (or the hw is just really broken) this should be enough to avoid being stuck in a livelock. I'm also not seeing any other reason why a folio_lock shouldn't work here, but then my understanding of mm/ stuff is really just scratching the surface. I did crawl through all the other code and it looks like everything else is unconditional locks. So looks all good and I didn't spot anything else that seemed problematic. Somewhat aside, I do wonder whether we really want to require callers to hold the mmap lock, or whether with all the work towards lockless fastpath that shouldn't instead just be an implementation detail. At least for the gpu hmm code I've seen I've tried to push hard towards a world were the gpu side does not rely on mmap_read_lock being held at all, to future proof this all. And currently we only have one caller of make_device_exclusive_range() so would be simple to do. -Sima -- Simona Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice? 2025-01-28 20:14 ` Simona Vetter @ 2025-01-28 20:24 ` David Hildenbrand 2025-01-29 10:48 ` Simona Vetter 0 siblings, 1 reply; 16+ messages in thread From: David Hildenbrand @ 2025-01-28 20:24 UTC (permalink / raw) To: Alistair Popple, linux-mm, John Hubbard, nouveau, Jason Gunthorpe, DRI Development, Karol Herbst, Lyude Paul, Danilo Krummrich On 28.01.25 21:14, Simona Vetter wrote: > On Tue, Jan 28, 2025 at 11:09:24AM +1100, Alistair Popple wrote: >> On Fri, Jan 24, 2025 at 06:54:02PM +0100, David Hildenbrand wrote: >>>>>> On integrated the gpu is tied into the coherency >>>>>> fabric, so there it's not needed. >>>>>> >>>>>> I think the more fundamental question with both this function here and >>>>>> with forced migration to device memory is that there's no guarantee it >>>>>> will work out. >>>>> >>>>> Yes, in particular with device-exclusive, it doesn't really work with THP >>>>> and is only limited to anonymous memory. I have patches to at least make it >>>>> work reliably with THP. >>>> >>>> I should have crawled through the implementation first before replying. >>>> Since it only looks at folio_mapcount() make_device_exclusive() should at >>>> least in theory work reliably on anon memory, and not be impacted by >>>> elevated refcounts due to migration/ksm/thp/whatever. >>> >>> Yes, there is -- in theory -- nothing blocking the conversion except the >>> folio lock. That's different than page migration. >> >> Indeed - this was the entire motivation for make_device_exclusive() - that we >> needed a way to reliably exclude CPU access that couldn't be blocked in the same >> way page migration can (otherwise we could have just migrated to a device page, >> even if that may have added unwanted overhead). > > The folio_trylock worries me a bit. I guess this is to avoid deadlocks > when locking multiple folios, but I think at least on the first one we > need an unconditional folio_lock to guarantee forward progress. At least on the hmm path I was able to trigger the EBUSY a couple of times due to concurrent swapout. But the hmm-tests selftest fails immediately instead of retrying. > Since > atomics can't cross 4k boundaries (or the hw is just really broken) this > should be enough to avoid being stuck in a livelock. I'm also not seeing > any other reason why a folio_lock shouldn't work here, but then my > understanding of mm/ stuff is really just scratching the surface. > > I did crawl through all the other code and it looks like everything else > is unconditional locks. So looks all good and I didn't spot anything else > that seemed problematic. > > Somewhat aside, I do wonder whether we really want to require callers to > hold the mmap lock, or whether with all the work towards lockless fastpath > that shouldn't instead just be an implementation detail. We might be able to use the VMA lock in the future, but that will require GUP support and a bunch more. Until then, the mm_lock in read mode is required. I was not able to convince myself that we'll really need the folio lock, but that's also a separate discussion. > At least for the > gpu hmm code I've seen I've tried to push hard towards a world were the > gpu side does not rely on mmap_read_lock being held at all, to future > proof this all. And currently we only have one caller of > make_device_exclusive_range() so would be simple to do. We could likely move the mmap_lock into that function, but avoiding it is more effort. In any case, I'll send something out probably tomorrow to fix page migration/swapout of pages with device-exclusive entries and a bunch of other things (THP, interaction with hugetlb, ...). -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice? 2025-01-28 20:24 ` David Hildenbrand @ 2025-01-29 10:48 ` Simona Vetter 2025-01-29 11:28 ` Simona Vetter 0 siblings, 1 reply; 16+ messages in thread From: Simona Vetter @ 2025-01-29 10:48 UTC (permalink / raw) To: David Hildenbrand Cc: Alistair Popple, linux-mm, John Hubbard, nouveau, Jason Gunthorpe, DRI Development, Karol Herbst, Lyude Paul, Danilo Krummrich On Tue, Jan 28, 2025 at 09:24:33PM +0100, David Hildenbrand wrote: > On 28.01.25 21:14, Simona Vetter wrote: > > On Tue, Jan 28, 2025 at 11:09:24AM +1100, Alistair Popple wrote: > > > On Fri, Jan 24, 2025 at 06:54:02PM +0100, David Hildenbrand wrote: > > > > > > > On integrated the gpu is tied into the coherency > > > > > > > fabric, so there it's not needed. > > > > > > > > > > > > > > I think the more fundamental question with both this function here and > > > > > > > with forced migration to device memory is that there's no guarantee it > > > > > > > will work out. > > > > > > > > > > > > Yes, in particular with device-exclusive, it doesn't really work with THP > > > > > > and is only limited to anonymous memory. I have patches to at least make it > > > > > > work reliably with THP. > > > > > > > > > > I should have crawled through the implementation first before replying. > > > > > Since it only looks at folio_mapcount() make_device_exclusive() should at > > > > > least in theory work reliably on anon memory, and not be impacted by > > > > > elevated refcounts due to migration/ksm/thp/whatever. > > > > > > > > Yes, there is -- in theory -- nothing blocking the conversion except the > > > > folio lock. That's different than page migration. > > > > > > Indeed - this was the entire motivation for make_device_exclusive() - that we > > > needed a way to reliably exclude CPU access that couldn't be blocked in the same > > > way page migration can (otherwise we could have just migrated to a device page, > > > even if that may have added unwanted overhead). > > > > The folio_trylock worries me a bit. I guess this is to avoid deadlocks > > when locking multiple folios, but I think at least on the first one we > > need an unconditional folio_lock to guarantee forward progress. > > At least on the hmm path I was able to trigger the EBUSY a couple of times > due to concurrent swapout. But the hmm-tests selftest fails immediately > instead of retrying. My worries with just retrying is that it's very hard to assess whether there's a livelock or whether the retry has a good chance of success. As an example the ->migrate_to_ram path has some trylocks, and the window where all other threads got halfway and then fail the trylock is big enough that once you pile up enough threads that spin through there, you're stuck forever. Which isn't great. So if we could convert at least the first folio_trylock into a plain lock then forward progress is obviously assured and there's no need to crawl through large chunks of mm/ code to hunt for corner cases where we could be too unlucky to ever win the race. > > Since > > atomics can't cross 4k boundaries (or the hw is just really broken) this > > should be enough to avoid being stuck in a livelock. I'm also not seeing > > any other reason why a folio_lock shouldn't work here, but then my > > understanding of mm/ stuff is really just scratching the surface. > > > > I did crawl through all the other code and it looks like everything else > > is unconditional locks. So looks all good and I didn't spot anything else > > that seemed problematic. > > > > Somewhat aside, I do wonder whether we really want to require callers to > > hold the mmap lock, or whether with all the work towards lockless fastpath > > that shouldn't instead just be an implementation detail. > > We might be able to use the VMA lock in the future, but that will require > GUP support and a bunch more. Until then, the mm_lock in read mode is > required. Yup. I also don't think we should try to improve before benchmarks show an actual need. It's more about future proofing and making sure mmap_lock doesn't leak into driver data structures that I'm worried about. Because I've seen some hmm/gpu rfc patches that heavily relied on mmap_lock to keep everything correct on the driver side, which is not a clean design. > I was not able to convince myself that we'll really need the folio lock, but > that's also a separate discussion. This is way above my pay understanding of mm/ unfortunately. > > At least for the > > gpu hmm code I've seen I've tried to push hard towards a world were the > > gpu side does not rely on mmap_read_lock being held at all, to future > > proof this all. And currently we only have one caller of > > make_device_exclusive_range() so would be simple to do. > > We could likely move the mmap_lock into that function, but avoiding it is > more effort. I didn't mean more than just that, which would make sure drivers at least do not rely on mmap_lock being held. That then allows us to switch over to vma lock or anything else entirely within mm/ code. If we leave it as-is then more drivers accidentally or intentionally will rely on this, like I think is the case for ->migrate_to_ram for hmm already. And then it's more pain to untangle. > In any case, I'll send something out probably tomorrow to fix page > migration/swapout of pages with device-exclusive entries and a bunch of > other things (THP, interaction with hugetlb, ...). Thanks a lot! Cheer, Sima > > -- > Cheers, > > David / dhildenb > -- Simona Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice? 2025-01-29 10:48 ` Simona Vetter @ 2025-01-29 11:28 ` Simona Vetter 2025-01-29 11:31 ` David Hildenbrand 0 siblings, 1 reply; 16+ messages in thread From: Simona Vetter @ 2025-01-29 11:28 UTC (permalink / raw) To: David Hildenbrand, Alistair Popple, linux-mm, John Hubbard, nouveau, Jason Gunthorpe, DRI Development, Karol Herbst, Lyude Paul, Danilo Krummrich On Wed, Jan 29, 2025 at 11:48:03AM +0100, Simona Vetter wrote: > On Tue, Jan 28, 2025 at 09:24:33PM +0100, David Hildenbrand wrote: > > On 28.01.25 21:14, Simona Vetter wrote: > > > On Tue, Jan 28, 2025 at 11:09:24AM +1100, Alistair Popple wrote: > > > > On Fri, Jan 24, 2025 at 06:54:02PM +0100, David Hildenbrand wrote: > > > > > > > > On integrated the gpu is tied into the coherency > > > > > > > > fabric, so there it's not needed. > > > > > > > > > > > > > > > > I think the more fundamental question with both this function here and > > > > > > > > with forced migration to device memory is that there's no guarantee it > > > > > > > > will work out. > > > > > > > > > > > > > > Yes, in particular with device-exclusive, it doesn't really work with THP > > > > > > > and is only limited to anonymous memory. I have patches to at least make it > > > > > > > work reliably with THP. > > > > > > > > > > > > I should have crawled through the implementation first before replying. > > > > > > Since it only looks at folio_mapcount() make_device_exclusive() should at > > > > > > least in theory work reliably on anon memory, and not be impacted by > > > > > > elevated refcounts due to migration/ksm/thp/whatever. > > > > > > > > > > Yes, there is -- in theory -- nothing blocking the conversion except the > > > > > folio lock. That's different than page migration. > > > > > > > > Indeed - this was the entire motivation for make_device_exclusive() - that we > > > > needed a way to reliably exclude CPU access that couldn't be blocked in the same > > > > way page migration can (otherwise we could have just migrated to a device page, > > > > even if that may have added unwanted overhead). > > > > > > The folio_trylock worries me a bit. I guess this is to avoid deadlocks > > > when locking multiple folios, but I think at least on the first one we > > > need an unconditional folio_lock to guarantee forward progress. > > > > At least on the hmm path I was able to trigger the EBUSY a couple of times > > due to concurrent swapout. But the hmm-tests selftest fails immediately > > instead of retrying. > > My worries with just retrying is that it's very hard to assess whether > there's a livelock or whether the retry has a good chance of success. As > an example the ->migrate_to_ram path has some trylocks, and the window > where all other threads got halfway and then fail the trylock is big > enough that once you pile up enough threads that spin through there, > you're stuck forever. Which isn't great. > > So if we could convert at least the first folio_trylock into a plain lock > then forward progress is obviously assured and there's no need to crawl > through large chunks of mm/ code to hunt for corner cases where we could > be too unlucky to ever win the race. > > > > Since > > > atomics can't cross 4k boundaries (or the hw is just really broken) this > > > should be enough to avoid being stuck in a livelock. I'm also not seeing > > > any other reason why a folio_lock shouldn't work here, but then my > > > understanding of mm/ stuff is really just scratching the surface. > > > > > > I did crawl through all the other code and it looks like everything else > > > is unconditional locks. So looks all good and I didn't spot anything else > > > that seemed problematic. > > > > > > Somewhat aside, I do wonder whether we really want to require callers to > > > hold the mmap lock, or whether with all the work towards lockless fastpath > > > that shouldn't instead just be an implementation detail. > > > > We might be able to use the VMA lock in the future, but that will require > > GUP support and a bunch more. Until then, the mm_lock in read mode is > > required. > > Yup. I also don't think we should try to improve before benchmarks show an > actual need. It's more about future proofing and making sure mmap_lock > doesn't leak into driver data structures that I'm worried about. Because > I've seen some hmm/gpu rfc patches that heavily relied on mmap_lock to > keep everything correct on the driver side, which is not a clean design. > > > I was not able to convince myself that we'll really need the folio lock, but > > that's also a separate discussion. > > This is way above my pay understanding of mm/ unfortunately. I pondered this some more, and I think it's to make sure we get a stable reading of folio_mapcount() and are not racing with new rmaps being established. But I also got lost a few times in the maze ... -Sima > > > > At least for the > > > gpu hmm code I've seen I've tried to push hard towards a world were the > > > gpu side does not rely on mmap_read_lock being held at all, to future > > > proof this all. And currently we only have one caller of > > > make_device_exclusive_range() so would be simple to do. > > > > We could likely move the mmap_lock into that function, but avoiding it is > > more effort. > > I didn't mean more than just that, which would make sure drivers at least > do not rely on mmap_lock being held. That then allows us to switch over to > vma lock or anything else entirely within mm/ code. > > If we leave it as-is then more drivers accidentally or intentionally will > rely on this, like I think is the case for ->migrate_to_ram for hmm > already. And then it's more pain to untangle. > > > In any case, I'll send something out probably tomorrow to fix page > > migration/swapout of pages with device-exclusive entries and a bunch of > > other things (THP, interaction with hugetlb, ...). > > Thanks a lot! > > Cheer, Sima > > > > -- > > Cheers, > > > > David / dhildenb > > > > -- > Simona Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch -- Simona Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice? 2025-01-29 11:28 ` Simona Vetter @ 2025-01-29 11:31 ` David Hildenbrand 2025-01-29 14:05 ` Simona Vetter 0 siblings, 1 reply; 16+ messages in thread From: David Hildenbrand @ 2025-01-29 11:31 UTC (permalink / raw) To: Alistair Popple, linux-mm, John Hubbard, nouveau, Jason Gunthorpe, DRI Development, Karol Herbst, Lyude Paul, Danilo Krummrich On 29.01.25 12:28, Simona Vetter wrote: > On Wed, Jan 29, 2025 at 11:48:03AM +0100, Simona Vetter wrote: >> On Tue, Jan 28, 2025 at 09:24:33PM +0100, David Hildenbrand wrote: >>> On 28.01.25 21:14, Simona Vetter wrote: >>>> On Tue, Jan 28, 2025 at 11:09:24AM +1100, Alistair Popple wrote: >>>>> On Fri, Jan 24, 2025 at 06:54:02PM +0100, David Hildenbrand wrote: >>>>>>>>> On integrated the gpu is tied into the coherency >>>>>>>>> fabric, so there it's not needed. >>>>>>>>> >>>>>>>>> I think the more fundamental question with both this function here and >>>>>>>>> with forced migration to device memory is that there's no guarantee it >>>>>>>>> will work out. >>>>>>>> >>>>>>>> Yes, in particular with device-exclusive, it doesn't really work with THP >>>>>>>> and is only limited to anonymous memory. I have patches to at least make it >>>>>>>> work reliably with THP. >>>>>>> >>>>>>> I should have crawled through the implementation first before replying. >>>>>>> Since it only looks at folio_mapcount() make_device_exclusive() should at >>>>>>> least in theory work reliably on anon memory, and not be impacted by >>>>>>> elevated refcounts due to migration/ksm/thp/whatever. >>>>>> >>>>>> Yes, there is -- in theory -- nothing blocking the conversion except the >>>>>> folio lock. That's different than page migration. >>>>> >>>>> Indeed - this was the entire motivation for make_device_exclusive() - that we >>>>> needed a way to reliably exclude CPU access that couldn't be blocked in the same >>>>> way page migration can (otherwise we could have just migrated to a device page, >>>>> even if that may have added unwanted overhead). >>>> >>>> The folio_trylock worries me a bit. I guess this is to avoid deadlocks >>>> when locking multiple folios, but I think at least on the first one we >>>> need an unconditional folio_lock to guarantee forward progress. >>> >>> At least on the hmm path I was able to trigger the EBUSY a couple of times >>> due to concurrent swapout. But the hmm-tests selftest fails immediately >>> instead of retrying. >> >> My worries with just retrying is that it's very hard to assess whether >> there's a livelock or whether the retry has a good chance of success. As >> an example the ->migrate_to_ram path has some trylocks, and the window >> where all other threads got halfway and then fail the trylock is big >> enough that once you pile up enough threads that spin through there, >> you're stuck forever. Which isn't great. >> >> So if we could convert at least the first folio_trylock into a plain lock >> then forward progress is obviously assured and there's no need to crawl >> through large chunks of mm/ code to hunt for corner cases where we could >> be too unlucky to ever win the race. >> >>>> Since >>>> atomics can't cross 4k boundaries (or the hw is just really broken) this >>>> should be enough to avoid being stuck in a livelock. I'm also not seeing >>>> any other reason why a folio_lock shouldn't work here, but then my >>>> understanding of mm/ stuff is really just scratching the surface. >>>> >>>> I did crawl through all the other code and it looks like everything else >>>> is unconditional locks. So looks all good and I didn't spot anything else >>>> that seemed problematic. >>>> >>>> Somewhat aside, I do wonder whether we really want to require callers to >>>> hold the mmap lock, or whether with all the work towards lockless fastpath >>>> that shouldn't instead just be an implementation detail. >>> >>> We might be able to use the VMA lock in the future, but that will require >>> GUP support and a bunch more. Until then, the mm_lock in read mode is >>> required. >> >> Yup. I also don't think we should try to improve before benchmarks show an >> actual need. It's more about future proofing and making sure mmap_lock >> doesn't leak into driver data structures that I'm worried about. Because >> I've seen some hmm/gpu rfc patches that heavily relied on mmap_lock to >> keep everything correct on the driver side, which is not a clean design. >> >>> I was not able to convince myself that we'll really need the folio lock, but >>> that's also a separate discussion. >> >> This is way above my pay understanding of mm/ unfortunately. > > I pondered this some more, and I think it's to make sure we get a stable > reading of folio_mapcount() and are not racing with new rmaps being > established. But I also got lost a few times in the maze ... That mapcount handling is all messed up and I'll remove that along with the rmap walk. Also, the folio lock does not stabilize the mapcount at all ... Here is my understanding: commit e2dca6db09186534c7e6082b77be6e17d8920f10 Author: David Hildenbrand <david@redhat.com> Date: Tue Jan 28 15:25:37 2025 +0100 mm/memory: document restore_exclusive_pte() Let's document how this function is to be used, and why the requirement for the folio lock might maybe be dropped in the future. Signed-off-by: David Hildenbrand <david@redhat.com> diff --git a/mm/memory.c b/mm/memory.c index 46956994aaff..caaae8df11a9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -718,6 +718,31 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma, } #endif +/** + * restore_exclusive_pte - Restore a device-exclusive entry + * @vma: VMA covering @address + * @folio: the mapped folio + * @page: the mapped folio page + * @address: the virtual address + * @ptep: PTE pointer into the locked page table mapping the folio page + * @orig_pte: PTE value at @ptep + * + * Restore a device-exclusive non-swap entry to an ordinary present PTE. + * + * The folio and the page table must be locked, and MMU notifiers must have + * been called to invalidate any (exclusive) device mappings. In case of + * fork(), MMU_NOTIFY_PROTECTION_PAGE is triggered, and in case of a page + * fault MMU_NOTIFY_EXCLUSIVE is triggered. + * + * Locking the folio makes sure that anybody who just converted the PTE to + * a device-private entry can map it into the device, before unlocking it; so + * the folio lock prevents concurrent conversion to device-exclusive. + * + * TODO: the folio lock does not protect against all cases of concurrent + * page table modifications (e.g., MADV_DONTNEED, mprotect), so device drivers + * must already use MMU notifiers to sync against any concurrent changes + * Maybe the requirement for the folio lock can be dropped in the future. + */ -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice? 2025-01-29 11:31 ` David Hildenbrand @ 2025-01-29 14:05 ` Simona Vetter 2025-01-29 16:13 ` David Hildenbrand 0 siblings, 1 reply; 16+ messages in thread From: Simona Vetter @ 2025-01-29 14:05 UTC (permalink / raw) To: David Hildenbrand Cc: Alistair Popple, linux-mm, John Hubbard, nouveau, Jason Gunthorpe, DRI Development, Karol Herbst, Lyude Paul, Danilo Krummrich On Wed, Jan 29, 2025 at 12:31:14PM +0100, David Hildenbrand wrote: > On 29.01.25 12:28, Simona Vetter wrote: > > On Wed, Jan 29, 2025 at 11:48:03AM +0100, Simona Vetter wrote: > > > On Tue, Jan 28, 2025 at 09:24:33PM +0100, David Hildenbrand wrote: > > > > On 28.01.25 21:14, Simona Vetter wrote: > > > > > On Tue, Jan 28, 2025 at 11:09:24AM +1100, Alistair Popple wrote: > > > > > > On Fri, Jan 24, 2025 at 06:54:02PM +0100, David Hildenbrand wrote: > > > > > > > > > > On integrated the gpu is tied into the coherency > > > > > > > > > > fabric, so there it's not needed. > > > > > > > > > > > > > > > > > > > > I think the more fundamental question with both this function here and > > > > > > > > > > with forced migration to device memory is that there's no guarantee it > > > > > > > > > > will work out. > > > > > > > > > > > > > > > > > > Yes, in particular with device-exclusive, it doesn't really work with THP > > > > > > > > > and is only limited to anonymous memory. I have patches to at least make it > > > > > > > > > work reliably with THP. > > > > > > > > > > > > > > > > I should have crawled through the implementation first before replying. > > > > > > > > Since it only looks at folio_mapcount() make_device_exclusive() should at > > > > > > > > least in theory work reliably on anon memory, and not be impacted by > > > > > > > > elevated refcounts due to migration/ksm/thp/whatever. > > > > > > > > > > > > > > Yes, there is -- in theory -- nothing blocking the conversion except the > > > > > > > folio lock. That's different than page migration. > > > > > > > > > > > > Indeed - this was the entire motivation for make_device_exclusive() - that we > > > > > > needed a way to reliably exclude CPU access that couldn't be blocked in the same > > > > > > way page migration can (otherwise we could have just migrated to a device page, > > > > > > even if that may have added unwanted overhead). > > > > > > > > > > The folio_trylock worries me a bit. I guess this is to avoid deadlocks > > > > > when locking multiple folios, but I think at least on the first one we > > > > > need an unconditional folio_lock to guarantee forward progress. > > > > > > > > At least on the hmm path I was able to trigger the EBUSY a couple of times > > > > due to concurrent swapout. But the hmm-tests selftest fails immediately > > > > instead of retrying. > > > > > > My worries with just retrying is that it's very hard to assess whether > > > there's a livelock or whether the retry has a good chance of success. As > > > an example the ->migrate_to_ram path has some trylocks, and the window > > > where all other threads got halfway and then fail the trylock is big > > > enough that once you pile up enough threads that spin through there, > > > you're stuck forever. Which isn't great. > > > > > > So if we could convert at least the first folio_trylock into a plain lock > > > then forward progress is obviously assured and there's no need to crawl > > > through large chunks of mm/ code to hunt for corner cases where we could > > > be too unlucky to ever win the race. > > > > > > > > Since > > > > > atomics can't cross 4k boundaries (or the hw is just really broken) this > > > > > should be enough to avoid being stuck in a livelock. I'm also not seeing > > > > > any other reason why a folio_lock shouldn't work here, but then my > > > > > understanding of mm/ stuff is really just scratching the surface. > > > > > > > > > > I did crawl through all the other code and it looks like everything else > > > > > is unconditional locks. So looks all good and I didn't spot anything else > > > > > that seemed problematic. > > > > > > > > > > Somewhat aside, I do wonder whether we really want to require callers to > > > > > hold the mmap lock, or whether with all the work towards lockless fastpath > > > > > that shouldn't instead just be an implementation detail. > > > > > > > > We might be able to use the VMA lock in the future, but that will require > > > > GUP support and a bunch more. Until then, the mm_lock in read mode is > > > > required. > > > > > > Yup. I also don't think we should try to improve before benchmarks show an > > > actual need. It's more about future proofing and making sure mmap_lock > > > doesn't leak into driver data structures that I'm worried about. Because > > > I've seen some hmm/gpu rfc patches that heavily relied on mmap_lock to > > > keep everything correct on the driver side, which is not a clean design. > > > > > > > I was not able to convince myself that we'll really need the folio lock, but > > > > that's also a separate discussion. > > > > > > This is way above my pay understanding of mm/ unfortunately. > > > > I pondered this some more, and I think it's to make sure we get a stable > > reading of folio_mapcount() and are not racing with new rmaps being > > established. But I also got lost a few times in the maze ... > > That mapcount handling is all messed up and I'll remove that along with > the rmap walk. Also, the folio lock does not stabilize the mapcount at all ... Hm ... also rethinking this all, we don't need a lot of guarantees here. Anything userspace does that re-elevates the mapcount or otherwise could starve out make_device_exclusive is I think a case of "don't do that". I think the only guarantee we need is that make_device_exclusive succeeds against other kernel stuff like thp/migration/ksm and all those things, at least reliably when you retry. And maybe also that concurrent make_device_exclusive calls don't starve each another but eventually get the job done (but only if it's the same owner). > Here is my understanding: > > commit e2dca6db09186534c7e6082b77be6e17d8920f10 > Author: David Hildenbrand <david@redhat.com> > Date: Tue Jan 28 15:25:37 2025 +0100 > > mm/memory: document restore_exclusive_pte() > Let's document how this function is to be used, and why the requirement > for the folio lock might maybe be dropped in the future. > Signed-off-by: David Hildenbrand <david@redhat.com> > > diff --git a/mm/memory.c b/mm/memory.c > index 46956994aaff..caaae8df11a9 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -718,6 +718,31 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma, > } > #endif > +/** > + * restore_exclusive_pte - Restore a device-exclusive entry > + * @vma: VMA covering @address > + * @folio: the mapped folio > + * @page: the mapped folio page > + * @address: the virtual address > + * @ptep: PTE pointer into the locked page table mapping the folio page > + * @orig_pte: PTE value at @ptep > + * > + * Restore a device-exclusive non-swap entry to an ordinary present PTE. > + * > + * The folio and the page table must be locked, and MMU notifiers must have > + * been called to invalidate any (exclusive) device mappings. In case of > + * fork(), MMU_NOTIFY_PROTECTION_PAGE is triggered, and in case of a page > + * fault MMU_NOTIFY_EXCLUSIVE is triggered. > + * > + * Locking the folio makes sure that anybody who just converted the PTE to > + * a device-private entry can map it into the device, before unlocking it; so > + * the folio lock prevents concurrent conversion to device-exclusive. > + * > + * TODO: the folio lock does not protect against all cases of concurrent > + * page table modifications (e.g., MADV_DONTNEED, mprotect), so device drivers > + * must already use MMU notifiers to sync against any concurrent changes > + * Maybe the requirement for the folio lock can be dropped in the future. Hm yeah I was a bit confused why this would work at first. But since we break cow with FOLL_WRITE there shouldn't be any other mappings around. Therefore relying on the mmu notifier for that mm_struct is enough, and we don't need to hold the folio_lock in callers. I think pushing both the folio_unlock and the mmap_read_lock/unlock into make_device_exclusive would be a good clarification of what's going on here. Cheers, Sima > + */ > > > -- > Cheers, > > David / dhildenb > -- Simona Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice? 2025-01-29 14:05 ` Simona Vetter @ 2025-01-29 16:13 ` David Hildenbrand 2025-01-30 8:55 ` Simona Vetter 0 siblings, 1 reply; 16+ messages in thread From: David Hildenbrand @ 2025-01-29 16:13 UTC (permalink / raw) To: Alistair Popple, linux-mm, John Hubbard, nouveau, Jason Gunthorpe, DRI Development, Karol Herbst, Lyude Paul, Danilo Krummrich On 29.01.25 15:05, Simona Vetter wrote: > On Wed, Jan 29, 2025 at 12:31:14PM +0100, David Hildenbrand wrote: >> On 29.01.25 12:28, Simona Vetter wrote: >>> On Wed, Jan 29, 2025 at 11:48:03AM +0100, Simona Vetter wrote: >>>> On Tue, Jan 28, 2025 at 09:24:33PM +0100, David Hildenbrand wrote: >>>>> On 28.01.25 21:14, Simona Vetter wrote: >>>>>> On Tue, Jan 28, 2025 at 11:09:24AM +1100, Alistair Popple wrote: >>>>>>> On Fri, Jan 24, 2025 at 06:54:02PM +0100, David Hildenbrand wrote: >>>>>>>>>>> On integrated the gpu is tied into the coherency >>>>>>>>>>> fabric, so there it's not needed. >>>>>>>>>>> >>>>>>>>>>> I think the more fundamental question with both this function here and >>>>>>>>>>> with forced migration to device memory is that there's no guarantee it >>>>>>>>>>> will work out. >>>>>>>>>> >>>>>>>>>> Yes, in particular with device-exclusive, it doesn't really work with THP >>>>>>>>>> and is only limited to anonymous memory. I have patches to at least make it >>>>>>>>>> work reliably with THP. >>>>>>>>> >>>>>>>>> I should have crawled through the implementation first before replying. >>>>>>>>> Since it only looks at folio_mapcount() make_device_exclusive() should at >>>>>>>>> least in theory work reliably on anon memory, and not be impacted by >>>>>>>>> elevated refcounts due to migration/ksm/thp/whatever. >>>>>>>> >>>>>>>> Yes, there is -- in theory -- nothing blocking the conversion except the >>>>>>>> folio lock. That's different than page migration. >>>>>>> >>>>>>> Indeed - this was the entire motivation for make_device_exclusive() - that we >>>>>>> needed a way to reliably exclude CPU access that couldn't be blocked in the same >>>>>>> way page migration can (otherwise we could have just migrated to a device page, >>>>>>> even if that may have added unwanted overhead). >>>>>> >>>>>> The folio_trylock worries me a bit. I guess this is to avoid deadlocks >>>>>> when locking multiple folios, but I think at least on the first one we >>>>>> need an unconditional folio_lock to guarantee forward progress. >>>>> >>>>> At least on the hmm path I was able to trigger the EBUSY a couple of times >>>>> due to concurrent swapout. But the hmm-tests selftest fails immediately >>>>> instead of retrying. >>>> >>>> My worries with just retrying is that it's very hard to assess whether >>>> there's a livelock or whether the retry has a good chance of success. As >>>> an example the ->migrate_to_ram path has some trylocks, and the window >>>> where all other threads got halfway and then fail the trylock is big >>>> enough that once you pile up enough threads that spin through there, >>>> you're stuck forever. Which isn't great. >>>> >>>> So if we could convert at least the first folio_trylock into a plain lock >>>> then forward progress is obviously assured and there's no need to crawl >>>> through large chunks of mm/ code to hunt for corner cases where we could >>>> be too unlucky to ever win the race. >>>> >>>>>> Since >>>>>> atomics can't cross 4k boundaries (or the hw is just really broken) this >>>>>> should be enough to avoid being stuck in a livelock. I'm also not seeing >>>>>> any other reason why a folio_lock shouldn't work here, but then my >>>>>> understanding of mm/ stuff is really just scratching the surface. >>>>>> >>>>>> I did crawl through all the other code and it looks like everything else >>>>>> is unconditional locks. So looks all good and I didn't spot anything else >>>>>> that seemed problematic. >>>>>> >>>>>> Somewhat aside, I do wonder whether we really want to require callers to >>>>>> hold the mmap lock, or whether with all the work towards lockless fastpath >>>>>> that shouldn't instead just be an implementation detail. >>>>> >>>>> We might be able to use the VMA lock in the future, but that will require >>>>> GUP support and a bunch more. Until then, the mm_lock in read mode is >>>>> required. >>>> >>>> Yup. I also don't think we should try to improve before benchmarks show an >>>> actual need. It's more about future proofing and making sure mmap_lock >>>> doesn't leak into driver data structures that I'm worried about. Because >>>> I've seen some hmm/gpu rfc patches that heavily relied on mmap_lock to >>>> keep everything correct on the driver side, which is not a clean design. >>>> >>>>> I was not able to convince myself that we'll really need the folio lock, but >>>>> that's also a separate discussion. >>>> >>>> This is way above my pay understanding of mm/ unfortunately. >>> >>> I pondered this some more, and I think it's to make sure we get a stable >>> reading of folio_mapcount() and are not racing with new rmaps being >>> established. But I also got lost a few times in the maze ... >> >> That mapcount handling is all messed up and I'll remove that along with >> the rmap walk. Also, the folio lock does not stabilize the mapcount at all ... > > Hm ... also rethinking this all, we don't need a lot of guarantees here. > Anything userspace does that re-elevates the mapcount or otherwise could > starve out make_device_exclusive is I think a case of "don't do that". > > I think the only guarantee we need is that make_device_exclusive succeeds > against other kernel stuff like thp/migration/ksm and all those things, at > least reliably when you retry. And maybe also that concurrent > make_device_exclusive calls don't starve each another but eventually get > the job done (but only if it's the same owner). > >> Here is my understanding: >> >> commit e2dca6db09186534c7e6082b77be6e17d8920f10 >> Author: David Hildenbrand <david@redhat.com> >> Date: Tue Jan 28 15:25:37 2025 +0100 >> >> mm/memory: document restore_exclusive_pte() >> Let's document how this function is to be used, and why the requirement >> for the folio lock might maybe be dropped in the future. >> Signed-off-by: David Hildenbrand <david@redhat.com> >> >> diff --git a/mm/memory.c b/mm/memory.c >> index 46956994aaff..caaae8df11a9 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -718,6 +718,31 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma, >> } >> #endif >> +/** >> + * restore_exclusive_pte - Restore a device-exclusive entry >> + * @vma: VMA covering @address >> + * @folio: the mapped folio >> + * @page: the mapped folio page >> + * @address: the virtual address >> + * @ptep: PTE pointer into the locked page table mapping the folio page >> + * @orig_pte: PTE value at @ptep >> + * >> + * Restore a device-exclusive non-swap entry to an ordinary present PTE. >> + * >> + * The folio and the page table must be locked, and MMU notifiers must have >> + * been called to invalidate any (exclusive) device mappings. In case of >> + * fork(), MMU_NOTIFY_PROTECTION_PAGE is triggered, and in case of a page >> + * fault MMU_NOTIFY_EXCLUSIVE is triggered. >> + * >> + * Locking the folio makes sure that anybody who just converted the PTE to >> + * a device-private entry can map it into the device, before unlocking it; so >> + * the folio lock prevents concurrent conversion to device-exclusive. >> + * >> + * TODO: the folio lock does not protect against all cases of concurrent >> + * page table modifications (e.g., MADV_DONTNEED, mprotect), so device drivers >> + * must already use MMU notifiers to sync against any concurrent changes >> + * Maybe the requirement for the folio lock can be dropped in the future. > > Hm yeah I was a bit confused why this would work at first. But since we > break cow with FOLL_WRITE there shouldn't be any other mappings around. > Therefore relying on the mmu notifier for that mm_struct is enough, and we > don't need to hold the folio_lock in callers. Right; the devil is in the detail here; I looked at the SVM code and it registers an MMU notifier around that code to catch any invalidations that are happening while mapping the page. And it has another one to handle invalidations in general. I did not completely digest why two notifiers are required ... there likely is a good reason. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice? 2025-01-29 16:13 ` David Hildenbrand @ 2025-01-30 8:55 ` Simona Vetter 0 siblings, 0 replies; 16+ messages in thread From: Simona Vetter @ 2025-01-30 8:55 UTC (permalink / raw) To: David Hildenbrand Cc: Alistair Popple, linux-mm, John Hubbard, nouveau, Jason Gunthorpe, DRI Development, Karol Herbst, Lyude Paul, Danilo Krummrich On Wed, Jan 29, 2025 at 05:13:41PM +0100, David Hildenbrand wrote: > On 29.01.25 15:05, Simona Vetter wrote: > > On Wed, Jan 29, 2025 at 12:31:14PM +0100, David Hildenbrand wrote: > > > On 29.01.25 12:28, Simona Vetter wrote: > > > > On Wed, Jan 29, 2025 at 11:48:03AM +0100, Simona Vetter wrote: > > > > > On Tue, Jan 28, 2025 at 09:24:33PM +0100, David Hildenbrand wrote: > > > > > > On 28.01.25 21:14, Simona Vetter wrote: > > > > > > > On Tue, Jan 28, 2025 at 11:09:24AM +1100, Alistair Popple wrote: > > > > > > > > On Fri, Jan 24, 2025 at 06:54:02PM +0100, David Hildenbrand wrote: > > > > > > > > > > > > On integrated the gpu is tied into the coherency > > > > > > > > > > > > fabric, so there it's not needed. > > > > > > > > > > > > > > > > > > > > > > > > I think the more fundamental question with both this function here and > > > > > > > > > > > > with forced migration to device memory is that there's no guarantee it > > > > > > > > > > > > will work out. > > > > > > > > > > > > > > > > > > > > > > Yes, in particular with device-exclusive, it doesn't really work with THP > > > > > > > > > > > and is only limited to anonymous memory. I have patches to at least make it > > > > > > > > > > > work reliably with THP. > > > > > > > > > > > > > > > > > > > > I should have crawled through the implementation first before replying. > > > > > > > > > > Since it only looks at folio_mapcount() make_device_exclusive() should at > > > > > > > > > > least in theory work reliably on anon memory, and not be impacted by > > > > > > > > > > elevated refcounts due to migration/ksm/thp/whatever. > > > > > > > > > > > > > > > > > > Yes, there is -- in theory -- nothing blocking the conversion except the > > > > > > > > > folio lock. That's different than page migration. > > > > > > > > > > > > > > > > Indeed - this was the entire motivation for make_device_exclusive() - that we > > > > > > > > needed a way to reliably exclude CPU access that couldn't be blocked in the same > > > > > > > > way page migration can (otherwise we could have just migrated to a device page, > > > > > > > > even if that may have added unwanted overhead). > > > > > > > > > > > > > > The folio_trylock worries me a bit. I guess this is to avoid deadlocks > > > > > > > when locking multiple folios, but I think at least on the first one we > > > > > > > need an unconditional folio_lock to guarantee forward progress. > > > > > > > > > > > > At least on the hmm path I was able to trigger the EBUSY a couple of times > > > > > > due to concurrent swapout. But the hmm-tests selftest fails immediately > > > > > > instead of retrying. > > > > > > > > > > My worries with just retrying is that it's very hard to assess whether > > > > > there's a livelock or whether the retry has a good chance of success. As > > > > > an example the ->migrate_to_ram path has some trylocks, and the window > > > > > where all other threads got halfway and then fail the trylock is big > > > > > enough that once you pile up enough threads that spin through there, > > > > > you're stuck forever. Which isn't great. > > > > > > > > > > So if we could convert at least the first folio_trylock into a plain lock > > > > > then forward progress is obviously assured and there's no need to crawl > > > > > through large chunks of mm/ code to hunt for corner cases where we could > > > > > be too unlucky to ever win the race. > > > > > > > > > > > > Since > > > > > > > atomics can't cross 4k boundaries (or the hw is just really broken) this > > > > > > > should be enough to avoid being stuck in a livelock. I'm also not seeing > > > > > > > any other reason why a folio_lock shouldn't work here, but then my > > > > > > > understanding of mm/ stuff is really just scratching the surface. > > > > > > > > > > > > > > I did crawl through all the other code and it looks like everything else > > > > > > > is unconditional locks. So looks all good and I didn't spot anything else > > > > > > > that seemed problematic. > > > > > > > > > > > > > > Somewhat aside, I do wonder whether we really want to require callers to > > > > > > > hold the mmap lock, or whether with all the work towards lockless fastpath > > > > > > > that shouldn't instead just be an implementation detail. > > > > > > > > > > > > We might be able to use the VMA lock in the future, but that will require > > > > > > GUP support and a bunch more. Until then, the mm_lock in read mode is > > > > > > required. > > > > > > > > > > Yup. I also don't think we should try to improve before benchmarks show an > > > > > actual need. It's more about future proofing and making sure mmap_lock > > > > > doesn't leak into driver data structures that I'm worried about. Because > > > > > I've seen some hmm/gpu rfc patches that heavily relied on mmap_lock to > > > > > keep everything correct on the driver side, which is not a clean design. > > > > > > > > > > > I was not able to convince myself that we'll really need the folio lock, but > > > > > > that's also a separate discussion. > > > > > > > > > > This is way above my pay understanding of mm/ unfortunately. > > > > > > > > I pondered this some more, and I think it's to make sure we get a stable > > > > reading of folio_mapcount() and are not racing with new rmaps being > > > > established. But I also got lost a few times in the maze ... > > > > > > That mapcount handling is all messed up and I'll remove that along with > > > the rmap walk. Also, the folio lock does not stabilize the mapcount at all ... > > > > Hm ... also rethinking this all, we don't need a lot of guarantees here. > > Anything userspace does that re-elevates the mapcount or otherwise could > > starve out make_device_exclusive is I think a case of "don't do that". > > > > I think the only guarantee we need is that make_device_exclusive succeeds > > against other kernel stuff like thp/migration/ksm and all those things, at > > least reliably when you retry. And maybe also that concurrent > > make_device_exclusive calls don't starve each another but eventually get > > the job done (but only if it's the same owner). > > > > > Here is my understanding: > > > > > > commit e2dca6db09186534c7e6082b77be6e17d8920f10 > > > Author: David Hildenbrand <david@redhat.com> > > > Date: Tue Jan 28 15:25:37 2025 +0100 > > > > > > mm/memory: document restore_exclusive_pte() > > > Let's document how this function is to be used, and why the requirement > > > for the folio lock might maybe be dropped in the future. > > > Signed-off-by: David Hildenbrand <david@redhat.com> > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > index 46956994aaff..caaae8df11a9 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -718,6 +718,31 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma, > > > } > > > #endif > > > +/** > > > + * restore_exclusive_pte - Restore a device-exclusive entry > > > + * @vma: VMA covering @address > > > + * @folio: the mapped folio > > > + * @page: the mapped folio page > > > + * @address: the virtual address > > > + * @ptep: PTE pointer into the locked page table mapping the folio page > > > + * @orig_pte: PTE value at @ptep > > > + * > > > + * Restore a device-exclusive non-swap entry to an ordinary present PTE. > > > + * > > > + * The folio and the page table must be locked, and MMU notifiers must have > > > + * been called to invalidate any (exclusive) device mappings. In case of > > > + * fork(), MMU_NOTIFY_PROTECTION_PAGE is triggered, and in case of a page > > > + * fault MMU_NOTIFY_EXCLUSIVE is triggered. > > > + * > > > + * Locking the folio makes sure that anybody who just converted the PTE to > > > + * a device-private entry can map it into the device, before unlocking it; so > > > + * the folio lock prevents concurrent conversion to device-exclusive. > > > + * > > > + * TODO: the folio lock does not protect against all cases of concurrent > > > + * page table modifications (e.g., MADV_DONTNEED, mprotect), so device drivers > > > + * must already use MMU notifiers to sync against any concurrent changes > > > + * Maybe the requirement for the folio lock can be dropped in the future. > > > > Hm yeah I was a bit confused why this would work at first. But since we > > break cow with FOLL_WRITE there shouldn't be any other mappings around. > > Therefore relying on the mmu notifier for that mm_struct is enough, and we > > don't need to hold the folio_lock in callers. > > Right; the devil is in the detail here; I looked at the SVM code and it > registers an MMU notifier around that code to catch any invalidations that > are happening while mapping the page. And it has another one to handle > invalidations in general. I did not completely digest why two notifiers are > required ... there likely is a good reason. At least as a fallback you need a notifier for exactly only the fault, to make sure you don't retry due to spurious invalidations. Otherwise if userspace is hammering an adjacent page from the cpu we would not make forward progress, and that's a correctness issue. Now always inserting a new notifier for every fault in all cases isn't great because it incurs a stall. But I discussed this issue with Thomas Hellstrom a while back and there's a bunch of tricks we could deploy if this ever becomes a performance issue. So this all looks ok to me. -Sima -- Simona Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-01-30 8:55 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-01-23 10:20 [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice? David Hildenbrand 2025-01-23 15:08 ` Simona Vetter 2025-01-24 10:44 ` David Hildenbrand 2025-01-24 14:11 ` Jason Gunthorpe 2025-01-24 14:39 ` David Hildenbrand 2025-01-24 15:28 ` Simona Vetter 2025-01-24 17:54 ` David Hildenbrand 2025-01-28 0:09 ` Alistair Popple 2025-01-28 20:14 ` Simona Vetter 2025-01-28 20:24 ` David Hildenbrand 2025-01-29 10:48 ` Simona Vetter 2025-01-29 11:28 ` Simona Vetter 2025-01-29 11:31 ` David Hildenbrand 2025-01-29 14:05 ` Simona Vetter 2025-01-29 16:13 ` David Hildenbrand 2025-01-30 8:55 ` Simona Vetter
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox