[Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice?

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice?
@ 2025-01-23 10:20 David Hildenbrand
  2025-01-23 15:08 ` Simona Vetter
  0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2025-01-23 10:20 UTC (permalink / raw)
  To: linux-mm
  Cc: John Hubbard, nouveau, Jason Gunthorpe, Alistair Popple,
	DRI Development, Karol Herbst, Lyude Paul, Danilo Krummrich

Hi,

I keep finding issues in our implementation of "device exclusive 
non-swap entries", and the way it messes with mapcounts is disgusting.

As a reminder, what we do here is to replace a PTE pointing to an 
anonymous page by a "device exclusive non-swap entry".

As long as the original PTE is in place, the only CPU can access it, as 
soon as the "device exclusive non-swap entry" is in place, only the 
device can access it. Conversion back and forth is triggered by CPU / 
device faults.

I have fixes/reworks/simplifications for most things, but as there is 
only a "real" single user in-tree of make_device_exclusive():

	drivers/gpu/drm/nouveau/nouveau_svm.c

to "support SVM atomics in Nouveau [1]"

naturally I am wondering: is this still a thing on actual hardware, or 
is it already stale on recent hardware and not really required anymore?

[1] https://lore.kernel.org/linux-kernel//6621654.gmDyfcmpjF@nvdebian/T/

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice?
  2025-01-23 10:20 [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice? David Hildenbrand
@ 2025-01-23 15:08 ` Simona Vetter
  2025-01-24 10:44   ` David Hildenbrand
  0 siblings, 1 reply; 16+ messages in thread
From: Simona Vetter @ 2025-01-23 15:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, John Hubbard, nouveau, Jason Gunthorpe,
	Alistair Popple, DRI Development, Karol Herbst, Lyude Paul,
	Danilo Krummrich

On Thu, Jan 23, 2025 at 11:20:37AM +0100, David Hildenbrand wrote:
> Hi,
> 
> I keep finding issues in our implementation of "device exclusive non-swap
> entries", and the way it messes with mapcounts is disgusting.
> 
> As a reminder, what we do here is to replace a PTE pointing to an anonymous
> page by a "device exclusive non-swap entry".
> 
> As long as the original PTE is in place, the only CPU can access it, as soon
> as the "device exclusive non-swap entry" is in place, only the device can
> access it. Conversion back and forth is triggered by CPU / device faults.
> 
> I have fixes/reworks/simplifications for most things, but as there is only a
> "real" single user in-tree of make_device_exclusive():
> 
> 	drivers/gpu/drm/nouveau/nouveau_svm.c
> 
> to "support SVM atomics in Nouveau [1]"
> 
> naturally I am wondering: is this still a thing on actual hardware, or is it
> already stale on recent hardware and not really required anymore?
> 
> 
> [1] https://lore.kernel.org/linux-kernel//6621654.gmDyfcmpjF@nvdebian/T/

As long as you don't have a coherent interconnect it's needed. On intel
discrete device atomics require device memory, so they need full hmm
migration (and hence wont use this function even once we land intel gpu
svm code in upstream). On integrated the gpu is tied into the coherency
fabric, so there it's not needed.

I think the more fundamental question with both this function here and
with forced migration to device memory is that there's no guarantee it
will work out. At least that's my understanding. And for this gpu device
atomics without coherent interconnect idea to work, we'd need to be able
to guarantee that we can make any page device exclusive. So from my side I
have some pretty big question marks on this entire thing overall.

Cheers, Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice?
  2025-01-23 15:08 ` Simona Vetter
@ 2025-01-24 10:44   ` David Hildenbrand
  2025-01-24 14:11     ` Jason Gunthorpe
  2025-01-24 15:28     ` Simona Vetter
  0 siblings, 2 replies; 16+ messages in thread
From: David Hildenbrand @ 2025-01-24 10:44 UTC (permalink / raw)
  To: linux-mm, John Hubbard, nouveau, Jason Gunthorpe,
	Alistair Popple, DRI Development, Karol Herbst, Lyude Paul,
	Danilo Krummrich

On 23.01.25 16:08, Simona Vetter wrote:
> On Thu, Jan 23, 2025 at 11:20:37AM +0100, David Hildenbrand wrote:
>> Hi,
>>
>> I keep finding issues in our implementation of "device exclusive non-swap
>> entries", and the way it messes with mapcounts is disgusting.
>>
>> As a reminder, what we do here is to replace a PTE pointing to an anonymous
>> page by a "device exclusive non-swap entry".
>>
>> As long as the original PTE is in place, the only CPU can access it, as soon
>> as the "device exclusive non-swap entry" is in place, only the device can
>> access it. Conversion back and forth is triggered by CPU / device faults.
>>
>> I have fixes/reworks/simplifications for most things, but as there is only a
>> "real" single user in-tree of make_device_exclusive():
>>
>> 	drivers/gpu/drm/nouveau/nouveau_svm.c
>>
>> to "support SVM atomics in Nouveau [1]"
>>
>> naturally I am wondering: is this still a thing on actual hardware, or is it
>> already stale on recent hardware and not really required anymore?
>>
>>
>> [1] https://lore.kernel.org/linux-kernel//6621654.gmDyfcmpjF@nvdebian/T/
> 

Thanks for your answer!

Nvidia folks told me on a different channel that it's still getting used.

> As long as you don't have a coherent interconnect it's needed. On intel
> discrete device atomics require device memory, so they need full hmm
> migration (and hence wont use this function even once we land intel gpu
> svm code in upstream).

Makes sense.

> On integrated the gpu is tied into the coherency
> fabric, so there it's not needed.
> 
> I think the more fundamental question with both this function here and
> with forced migration to device memory is that there's no guarantee it
> will work out.

Yes, in particular with device-exclusive, it doesn't really work with 
THP and is only limited to anonymous memory. I have patches to at least 
make it work reliably with THP.

Then, we seem to give up too easily if we cannot lock the folio when 
wanting to convert to device-exclusive, which also looks rather odd. But 
well, maybe it just works good enough in the common case, or there is 
some other retry logic that makes it fly.

> At least that's my understanding. And for this gpu device
> atomics without coherent interconnect idea to work, we'd need to be able
> to guarantee that we can make any page device exclusive. So from my side I
> have some pretty big question marks on this entire thing overall.

I don't think other memory (shmem/file/...) is really feasible as soon 
as other processes (not the current process) map/write/read file pages.

We could really only handle if we converted a single PTE and that PTE is 
getting converted back again.

There are other concerns I have (what if the page is pinned and access 
outside of the user space page tables?). Maybe there was not need to 
handle these cases so far.

So best I can do is make anonymous memory more reliable with 
device-exclusive and fixup some of the problematic parts that I see 
(e.g., broken page reclaim, page migration, ...).

But before starting to cleanup+improve the existing handling of 
anonymous memory, I was wondering if this whole thing is getting used at 
all.

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice?
  2025-01-24 10:44   ` David Hildenbrand
@ 2025-01-24 14:11     ` Jason Gunthorpe
  2025-01-24 14:39       ` David Hildenbrand
  2025-01-24 15:28     ` Simona Vetter
  1 sibling, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2025-01-24 14:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, John Hubbard, nouveau, Alistair Popple,
	DRI Development, Karol Herbst, Lyude Paul, Danilo Krummrich

On Fri, Jan 24, 2025 at 11:44:28AM +0100, David Hildenbrand wrote:

> There are other concerns I have (what if the page is pinned and access
> outside of the user space page tables?). Maybe there was not need to handle
> these cases so far.

I think alot of this depends on userspace following some restrictions
so that the pages are always convertible. Presumably if the userspace
breaks things then their atomic using GPU kernels will fault.

So, from a kernel perspective, I'd suggest that creating a reasonable
set of conditions that userspace can follow to have it work reliably
is a reasonable goal.

Jason


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice?
  2025-01-24 14:11     ` Jason Gunthorpe
@ 2025-01-24 14:39       ` David Hildenbrand
  0 siblings, 0 replies; 16+ messages in thread
From: David Hildenbrand @ 2025-01-24 14:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, John Hubbard, nouveau, Alistair Popple,
	DRI Development, Karol Herbst, Lyude Paul, Danilo Krummrich

On 24.01.25 15:11, Jason Gunthorpe wrote:
> On Fri, Jan 24, 2025 at 11:44:28AM +0100, David Hildenbrand wrote:
> 
>> There are other concerns I have (what if the page is pinned and access
>> outside of the user space page tables?). Maybe there was not need to handle
>> these cases so far.
> 
> I think alot of this depends on userspace following some restrictions
> so that the pages are always convertible. Presumably if the userspace
> breaks things then their atomic using GPU kernels will fault.
> 
> So, from a kernel perspective, I'd suggest that creating a reasonable
> set of conditions that userspace can follow to have it work reliably
> is a reasonable goal.

Yes, that's my assumption as well. If you register a page using io_uring 
as a fixed buffer and then trigger "device_exclusive" access, the page 
can still be read/written using io_uring and atomics might not work as 
expected. "Not supported".

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice?
  2025-01-24 10:44   ` David Hildenbrand
  2025-01-24 14:11     ` Jason Gunthorpe
@ 2025-01-24 15:28     ` Simona Vetter
  2025-01-24 17:54       ` David Hildenbrand
  1 sibling, 1 reply; 16+ messages in thread
From: Simona Vetter @ 2025-01-24 15:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, John Hubbard, nouveau, Jason Gunthorpe,
	Alistair Popple, DRI Development, Karol Herbst, Lyude Paul,
	Danilo Krummrich

On Fri, Jan 24, 2025 at 11:44:28AM +0100, David Hildenbrand wrote:
> On 23.01.25 16:08, Simona Vetter wrote:
> > On Thu, Jan 23, 2025 at 11:20:37AM +0100, David Hildenbrand wrote:
> > > Hi,
> > > 
> > > I keep finding issues in our implementation of "device exclusive non-swap
> > > entries", and the way it messes with mapcounts is disgusting.
> > > 
> > > As a reminder, what we do here is to replace a PTE pointing to an anonymous
> > > page by a "device exclusive non-swap entry".
> > > 
> > > As long as the original PTE is in place, the only CPU can access it, as soon
> > > as the "device exclusive non-swap entry" is in place, only the device can
> > > access it. Conversion back and forth is triggered by CPU / device faults.
> > > 
> > > I have fixes/reworks/simplifications for most things, but as there is only a
> > > "real" single user in-tree of make_device_exclusive():
> > > 
> > > 	drivers/gpu/drm/nouveau/nouveau_svm.c
> > > 
> > > to "support SVM atomics in Nouveau [1]"
> > > 
> > > naturally I am wondering: is this still a thing on actual hardware, or is it
> > > already stale on recent hardware and not really required anymore?
> > > 
> > > 
> > > [1] https://lore.kernel.org/linux-kernel//6621654.gmDyfcmpjF@nvdebian/T/
> > 
> 
> Thanks for your answer!
> 
> Nvidia folks told me on a different channel that it's still getting used.
> 
> > As long as you don't have a coherent interconnect it's needed. On intel
> > discrete device atomics require device memory, so they need full hmm
> > migration (and hence wont use this function even once we land intel gpu
> > svm code in upstream).
> 
> Makes sense.
> 
> > On integrated the gpu is tied into the coherency
> > fabric, so there it's not needed.
> > 
> > I think the more fundamental question with both this function here and
> > with forced migration to device memory is that there's no guarantee it
> > will work out.
> 
> Yes, in particular with device-exclusive, it doesn't really work with THP
> and is only limited to anonymous memory. I have patches to at least make it
> work reliably with THP.

I should have crawled through the implementation first before replying.
Since it only looks at folio_mapcount() make_device_exclusive() should at
least in theory work reliably on anon memory, and not be impacted by
elevated refcounts due to migration/ksm/thp/whatever. This is unlike
device atomics that require migration to device memory, which is just
fundamentally not a reliable thing.

> Then, we seem to give up too easily if we cannot lock the folio when wanting
> to convert to device-exclusive, which also looks rather odd. But well, maybe
> it just works good enough in the common case, or there is some other retry
> logic that makes it fly.

I've crawled through the path to migrate pages from device memory back to
system memory a few months ago, and found some livelock issues in there.
Wouldn't be surprised if m_d_e has some of the same, but I didn't dig
through it (least because intel can't use it because not-so-great hw
design).

> > At least that's my understanding. And for this gpu device
> > atomics without coherent interconnect idea to work, we'd need to be able
> > to guarantee that we can make any page device exclusive. So from my side I
> > have some pretty big question marks on this entire thing overall.
> 
> I don't think other memory (shmem/file/...) is really feasible as soon as
> other processes (not the current process) map/write/read file pages.

Yeah none of the apis that use this internally in their implementations
make any promises beyond memory acquired with libc's malloc() or one of
the variants. So this limitation is fine.

> We could really only handle if we converted a single PTE and that PTE is
> getting converted back again.
> 
> There are other concerns I have (what if the page is pinned and access
> outside of the user space page tables?). Maybe there was not need to handle
> these cases so far.

I think that's also ok, but might be good to document this clearly that
concurrent direct i/o or rdma registered buffer or whatever will mess with
this. The promise is only between the gpu and the cpu, not anything else,
in current apis. At least to my knowledge.

> So best I can do is make anonymous memory more reliable with
> device-exclusive and fixup some of the problematic parts that I see (e.g.,
> broken page reclaim, page migration, ...).
> 
> But before starting to cleanup+improve the existing handling of anonymous
> memory, I was wondering if this whole thing is getting used at all.

Yeah if this can be made reliable (under the limitation of only anon
memory and only excluding userspace access) then I expect we'll need this
for a very long time. I just had no idea whether even that is possible.

What isn't good is if it's only mostly reliable, like the current
pgmap->ops->migrate_to_ram() path in do_swap_page() still is. But that one
is fixable, the patches should be floating around somewhere.
-Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice?
  2025-01-24 15:28     ` Simona Vetter
@ 2025-01-24 17:54       ` David Hildenbrand
  2025-01-28  0:09         ` Alistair Popple
  0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2025-01-24 17:54 UTC (permalink / raw)
  To: linux-mm, John Hubbard, nouveau, Jason Gunthorpe,
	Alistair Popple, DRI Development, Karol Herbst, Lyude Paul,
	Danilo Krummrich

>>> On integrated the gpu is tied into the coherency
>>> fabric, so there it's not needed.
>>>
>>> I think the more fundamental question with both this function here and
>>> with forced migration to device memory is that there's no guarantee it
>>> will work out.
>>
>> Yes, in particular with device-exclusive, it doesn't really work with THP
>> and is only limited to anonymous memory. I have patches to at least make it
>> work reliably with THP.
> 
> I should have crawled through the implementation first before replying.
> Since it only looks at folio_mapcount() make_device_exclusive() should at
> least in theory work reliably on anon memory, and not be impacted by
> elevated refcounts due to migration/ksm/thp/whatever.

Yes, there is -- in theory -- nothing blocking the conversion except the 
folio lock. That's different than page migration.

[...]

> 
>> Then, we seem to give up too easily if we cannot lock the folio when wanting
>> to convert to device-exclusive, which also looks rather odd. But well, maybe
>> it just works good enough in the common case, or there is some other retry
>> logic that makes it fly.
> 
> I've crawled through the path to migrate pages from device memory back to
> system memory a few months ago, and found some livelock issues in there.
> Wouldn't be surprised if m_d_e has some of the same, but I didn't dig
> through it (least because intel can't use it because not-so-great hw
> design).

Yes, that might be possible. Maybe something keeps spinning while the 
folio is locked instead of properly waiting for the lock.

... or someone is trying to convert a tail page of a THP, in which case 
we will also keep failing the conversion right now.

>> There are other concerns I have (what if the page is pinned and access
>> outside of the user space page tables?). Maybe there was not need to handle
>> these cases so far.
> 
> I think that's also ok, but might be good to document this clearly that
> concurrent direct i/o or rdma registered buffer or whatever will mess with
> this. The promise is only between the gpu and the cpu, not anything else,
> in current apis. At least to my knowledge.

Well, the issue is that e.g., iouring fixed buffers can be accessed from 
the CPU using the direct map from the CPU, not necessarily using DMA 
from a device.

In any case, I'm planning on adding some code-level documentation for 
that and look into extending the high-level doc while at it.

> 
>> So best I can do is make anonymous memory more reliable with
>> device-exclusive and fixup some of the problematic parts that I see (e.g.,
>> broken page reclaim, page migration, ...).
>>
>> But before starting to cleanup+improve the existing handling of anonymous
>> memory, I was wondering if this whole thing is getting used at all.
> 
> Yeah if this can be made reliable (under the limitation of only anon
> memory and only excluding userspace access) then I expect we'll need this
> for a very long time. I just had no idea whether even that is possible.
> 
> What isn't good is if it's only mostly reliable, like the current
> pgmap->ops->migrate_to_ram() path in do_swap_page() still is.

I'll cc you on patches once I figure out some details on how to fix some 
page table walkers that really don't expect these non-swap entries.

Fortunately, the hmm test device is in place to trigger some shaky 
scenarios.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice?
  2025-01-24 17:54       ` David Hildenbrand
@ 2025-01-28  0:09         ` Alistair Popple
  2025-01-28 20:14           ` Simona Vetter
  0 siblings, 1 reply; 16+ messages in thread
From: Alistair Popple @ 2025-01-28  0:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, John Hubbard, nouveau, Jason Gunthorpe,
	DRI Development, Karol Herbst, Lyude Paul, Danilo Krummrich

On Fri, Jan 24, 2025 at 06:54:02PM +0100, David Hildenbrand wrote:
> > > > On integrated the gpu is tied into the coherency
> > > > fabric, so there it's not needed.
> > > > 
> > > > I think the more fundamental question with both this function here and
> > > > with forced migration to device memory is that there's no guarantee it
> > > > will work out.
> > > 
> > > Yes, in particular with device-exclusive, it doesn't really work with THP
> > > and is only limited to anonymous memory. I have patches to at least make it
> > > work reliably with THP.
> > 
> > I should have crawled through the implementation first before replying.
> > Since it only looks at folio_mapcount() make_device_exclusive() should at
> > least in theory work reliably on anon memory, and not be impacted by
> > elevated refcounts due to migration/ksm/thp/whatever.
> 
> Yes, there is -- in theory -- nothing blocking the conversion except the
> folio lock. That's different than page migration.

Indeed - this was the entire motivation for make_device_exclusive() - that we
needed a way to reliably exclude CPU access that couldn't be blocked in the same
way page migration can (otherwise we could have just migrated to a device page,
even if that may have added unwanted overhead).
> [...]
> 
> > 
> > > Then, we seem to give up too easily if we cannot lock the folio when wanting
> > > to convert to device-exclusive, which also looks rather odd. But well, maybe
> > > it just works good enough in the common case, or there is some other retry
> > > logic that makes it fly.
> > 
> > I've crawled through the path to migrate pages from device memory back to
> > system memory a few months ago, and found some livelock issues in there.
> > Wouldn't be surprised if m_d_e has some of the same, but I didn't dig
> > through it (least because intel can't use it because not-so-great hw
> > design).
> 
> Yes, that might be possible. Maybe something keeps spinning while the folio
> is locked instead of properly waiting for the lock.
> 
> ... or someone is trying to convert a tail page of a THP, in which case we
> will also keep failing the conversion right now.
>
> > > There are other concerns I have (what if the page is pinned and access
> > > outside of the user space page tables?). Maybe there was not need to handle
> > > these cases so far.
> > 
> > I think that's also ok, but might be good to document this clearly that
> > concurrent direct i/o or rdma registered buffer or whatever will mess with
> > this. The promise is only between the gpu and the cpu, not anything else,
> > in current apis. At least to my knowledge.

That is correct - we never came up with a good solution for direct i/o or
rdma so the promise is only between cpu and gpu. That said direct i/o probably
implies a shared filebacked mapping. We weren't convinced there was a useful use
case there because anything could potentially map a file and use non-atomic CPU
instructions to access shared the shared mapping anyway.

> Well, the issue is that e.g., iouring fixed buffers can be accessed from the
> CPU using the direct map from the CPU, not necessarily using DMA from a
> device.

Right. This was all targetted at private anonymous memory and it's assumed that
both cpu and gpu are using atomic operations. As soon as you're talking shared
non-anonymous mappings it seems impossible for userspace to guarantee nothing
is accessing that memory with non-atomic ops anyway, even with a coherent
interconnect.

> In any case, I'm planning on adding some code-level documentation for that
> and look into extending the high-level doc while at it.

Thanks! Please CC me in.
 
> > 
> > > So best I can do is make anonymous memory more reliable with
> > > device-exclusive and fixup some of the problematic parts that I see (e.g.,
> > > broken page reclaim, page migration, ...).
> > > 
> > > But before starting to cleanup+improve the existing handling of anonymous
> > > memory, I was wondering if this whole thing is getting used at all.
> > 
> > Yeah if this can be made reliable (under the limitation of only anon
> > memory and only excluding userspace access) then I expect we'll need this
> > for a very long time. I just had no idea whether even that is possible.
> > 
> > What isn't good is if it's only mostly reliable, like the current
> > pgmap->ops->migrate_to_ram() path in do_swap_page() still is.

Right. This is _supposed_ to be 100% reliable under those limitations although
I will admit I didn't consider THP deeply as that is not supported for
DEVICE_PRIVATE anyway.

> I'll cc you on patches once I figure out some details on how to fix some
> page table walkers that really don't expect these non-swap entries.
> 
> Fortunately, the hmm test device is in place to trigger some shaky
> scenarios.
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice?
  2025-01-28  0:09         ` Alistair Popple
@ 2025-01-28 20:14           ` Simona Vetter
  2025-01-28 20:24             ` David Hildenbrand
  0 siblings, 1 reply; 16+ messages in thread
From: Simona Vetter @ 2025-01-28 20:14 UTC (permalink / raw)
  To: Alistair Popple
  Cc: David Hildenbrand, linux-mm, John Hubbard, nouveau,
	Jason Gunthorpe, DRI Development, Karol Herbst, Lyude Paul,
	Danilo Krummrich

On Tue, Jan 28, 2025 at 11:09:24AM +1100, Alistair Popple wrote:
> On Fri, Jan 24, 2025 at 06:54:02PM +0100, David Hildenbrand wrote:
> > > > > On integrated the gpu is tied into the coherency
> > > > > fabric, so there it's not needed.
> > > > > 
> > > > > I think the more fundamental question with both this function here and
> > > > > with forced migration to device memory is that there's no guarantee it
> > > > > will work out.
> > > > 
> > > > Yes, in particular with device-exclusive, it doesn't really work with THP
> > > > and is only limited to anonymous memory. I have patches to at least make it
> > > > work reliably with THP.
> > > 
> > > I should have crawled through the implementation first before replying.
> > > Since it only looks at folio_mapcount() make_device_exclusive() should at
> > > least in theory work reliably on anon memory, and not be impacted by
> > > elevated refcounts due to migration/ksm/thp/whatever.
> > 
> > Yes, there is -- in theory -- nothing blocking the conversion except the
> > folio lock. That's different than page migration.
> 
> Indeed - this was the entire motivation for make_device_exclusive() - that we
> needed a way to reliably exclude CPU access that couldn't be blocked in the same
> way page migration can (otherwise we could have just migrated to a device page,
> even if that may have added unwanted overhead).

The folio_trylock worries me a bit. I guess this is to avoid deadlocks
when locking multiple folios, but I think at least on the first one we
need an unconditional folio_lock to guarantee forward progress. Since
atomics can't cross 4k boundaries (or the hw is just really broken) this
should be enough to avoid being stuck in a livelock. I'm also not seeing
any other reason why a folio_lock shouldn't work here, but then my
understanding of mm/ stuff is really just scratching the surface.

I did crawl through all the other code and it looks like everything else
is unconditional locks. So looks all good and I didn't spot anything else
that seemed problematic.

Somewhat aside, I do wonder whether we really want to require callers to
hold the mmap lock, or whether with all the work towards lockless fastpath
that shouldn't instead just be an implementation detail. At least for the
gpu hmm code I've seen I've tried to push hard towards a world were the
gpu side does not rely on mmap_read_lock being held at all, to future
proof this all. And currently we only have one caller of
make_device_exclusive_range() so would be simple to do.
-Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice?
  2025-01-28 20:14           ` Simona Vetter
@ 2025-01-28 20:24             ` David Hildenbrand
  2025-01-29 10:48               ` Simona Vetter
  0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2025-01-28 20:24 UTC (permalink / raw)
  To: Alistair Popple, linux-mm, John Hubbard, nouveau,
	Jason Gunthorpe, DRI Development, Karol Herbst, Lyude Paul,
	Danilo Krummrich

On 28.01.25 21:14, Simona Vetter wrote:
> On Tue, Jan 28, 2025 at 11:09:24AM +1100, Alistair Popple wrote:
>> On Fri, Jan 24, 2025 at 06:54:02PM +0100, David Hildenbrand wrote:
>>>>>> On integrated the gpu is tied into the coherency
>>>>>> fabric, so there it's not needed.
>>>>>>
>>>>>> I think the more fundamental question with both this function here and
>>>>>> with forced migration to device memory is that there's no guarantee it
>>>>>> will work out.
>>>>>
>>>>> Yes, in particular with device-exclusive, it doesn't really work with THP
>>>>> and is only limited to anonymous memory. I have patches to at least make it
>>>>> work reliably with THP.
>>>>
>>>> I should have crawled through the implementation first before replying.
>>>> Since it only looks at folio_mapcount() make_device_exclusive() should at
>>>> least in theory work reliably on anon memory, and not be impacted by
>>>> elevated refcounts due to migration/ksm/thp/whatever.
>>>
>>> Yes, there is -- in theory -- nothing blocking the conversion except the
>>> folio lock. That's different than page migration.
>>
>> Indeed - this was the entire motivation for make_device_exclusive() - that we
>> needed a way to reliably exclude CPU access that couldn't be blocked in the same
>> way page migration can (otherwise we could have just migrated to a device page,
>> even if that may have added unwanted overhead).
> 
> The folio_trylock worries me a bit. I guess this is to avoid deadlocks
> when locking multiple folios, but I think at least on the first one we
> need an unconditional folio_lock to guarantee forward progress.

At least on the hmm path I was able to trigger the EBUSY a couple of 
times due to concurrent swapout. But the hmm-tests selftest fails 
immediately instead of retrying.

> Since
> atomics can't cross 4k boundaries (or the hw is just really broken) this
> should be enough to avoid being stuck in a livelock. I'm also not seeing
> any other reason why a folio_lock shouldn't work here, but then my
> understanding of mm/ stuff is really just scratching the surface.
> 
> I did crawl through all the other code and it looks like everything else
> is unconditional locks. So looks all good and I didn't spot anything else
> that seemed problematic.
> 
> Somewhat aside, I do wonder whether we really want to require callers to
> hold the mmap lock, or whether with all the work towards lockless fastpath
> that shouldn't instead just be an implementation detail.

We might be able to use the VMA lock in the future, but that will 
require GUP support and a bunch more. Until then, the mm_lock in read 
mode is required.

I was not able to convince myself that we'll really need the folio lock, 
but that's also a separate discussion.

> At least for the
> gpu hmm code I've seen I've tried to push hard towards a world were the
> gpu side does not rely on mmap_read_lock being held at all, to future
> proof this all. And currently we only have one caller of
> make_device_exclusive_range() so would be simple to do.

We could likely move the mmap_lock into that function, but avoiding it 
is more effort.

In any case, I'll send something out probably tomorrow to fix page 
migration/swapout of pages with device-exclusive entries and a bunch of 
other things (THP, interaction with hugetlb, ...).

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice?
  2025-01-28 20:24             ` David Hildenbrand
@ 2025-01-29 10:48               ` Simona Vetter
  2025-01-29 11:28                 ` Simona Vetter
  0 siblings, 1 reply; 16+ messages in thread
From: Simona Vetter @ 2025-01-29 10:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alistair Popple, linux-mm, John Hubbard, nouveau,
	Jason Gunthorpe, DRI Development, Karol Herbst, Lyude Paul,
	Danilo Krummrich

On Tue, Jan 28, 2025 at 09:24:33PM +0100, David Hildenbrand wrote:
> On 28.01.25 21:14, Simona Vetter wrote:
> > On Tue, Jan 28, 2025 at 11:09:24AM +1100, Alistair Popple wrote:
> > > On Fri, Jan 24, 2025 at 06:54:02PM +0100, David Hildenbrand wrote:
> > > > > > > On integrated the gpu is tied into the coherency
> > > > > > > fabric, so there it's not needed.
> > > > > > > 
> > > > > > > I think the more fundamental question with both this function here and
> > > > > > > with forced migration to device memory is that there's no guarantee it
> > > > > > > will work out.
> > > > > > 
> > > > > > Yes, in particular with device-exclusive, it doesn't really work with THP
> > > > > > and is only limited to anonymous memory. I have patches to at least make it
> > > > > > work reliably with THP.
> > > > > 
> > > > > I should have crawled through the implementation first before replying.
> > > > > Since it only looks at folio_mapcount() make_device_exclusive() should at
> > > > > least in theory work reliably on anon memory, and not be impacted by
> > > > > elevated refcounts due to migration/ksm/thp/whatever.
> > > > 
> > > > Yes, there is -- in theory -- nothing blocking the conversion except the
> > > > folio lock. That's different than page migration.
> > > 
> > > Indeed - this was the entire motivation for make_device_exclusive() - that we
> > > needed a way to reliably exclude CPU access that couldn't be blocked in the same
> > > way page migration can (otherwise we could have just migrated to a device page,
> > > even if that may have added unwanted overhead).
> > 
> > The folio_trylock worries me a bit. I guess this is to avoid deadlocks
> > when locking multiple folios, but I think at least on the first one we
> > need an unconditional folio_lock to guarantee forward progress.
> 
> At least on the hmm path I was able to trigger the EBUSY a couple of times
> due to concurrent swapout. But the hmm-tests selftest fails immediately
> instead of retrying.

My worries with just retrying is that it's very hard to assess whether
there's a livelock or whether the retry has a good chance of success. As
an example the ->migrate_to_ram path has some trylocks, and the window
where all other threads got halfway and then fail the trylock is big
enough that once you pile up enough threads that spin through there,
you're stuck forever. Which isn't great.

So if we could convert at least the first folio_trylock into a plain lock
then forward progress is obviously assured and there's no need to crawl
through large chunks of mm/ code to hunt for corner cases where we could
be too unlucky to ever win the race.

> > Since
> > atomics can't cross 4k boundaries (or the hw is just really broken) this
> > should be enough to avoid being stuck in a livelock. I'm also not seeing
> > any other reason why a folio_lock shouldn't work here, but then my
> > understanding of mm/ stuff is really just scratching the surface.
> > 
> > I did crawl through all the other code and it looks like everything else
> > is unconditional locks. So looks all good and I didn't spot anything else
> > that seemed problematic.
> > 
> > Somewhat aside, I do wonder whether we really want to require callers to
> > hold the mmap lock, or whether with all the work towards lockless fastpath
> > that shouldn't instead just be an implementation detail.
> 
> We might be able to use the VMA lock in the future, but that will require
> GUP support and a bunch more. Until then, the mm_lock in read mode is
> required.

Yup. I also don't think we should try to improve before benchmarks show an
actual need. It's more about future proofing and making sure mmap_lock
doesn't leak into driver data structures that I'm worried about. Because
I've seen some hmm/gpu rfc patches that heavily relied on mmap_lock to
keep everything correct on the driver side, which is not a clean design.

> I was not able to convince myself that we'll really need the folio lock, but
> that's also a separate discussion.

This is way above my pay understanding of mm/ unfortunately.

> > At least for the
> > gpu hmm code I've seen I've tried to push hard towards a world were the
> > gpu side does not rely on mmap_read_lock being held at all, to future
> > proof this all. And currently we only have one caller of
> > make_device_exclusive_range() so would be simple to do.
> 
> We could likely move the mmap_lock into that function, but avoiding it is
> more effort.

I didn't mean more than just that, which would make sure drivers at least
do not rely on mmap_lock being held. That then allows us to switch over to
vma lock or anything else entirely within mm/ code.

If we leave it as-is then more drivers accidentally or intentionally will
rely on this, like I think is the case for ->migrate_to_ram for hmm
already. And then it's more pain to untangle.

> In any case, I'll send something out probably tomorrow to fix page
> migration/swapout of pages with device-exclusive entries and a bunch of
> other things (THP, interaction with hugetlb, ...).

Thanks a lot!

Cheer, Sima
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice?
  2025-01-29 10:48               ` Simona Vetter
@ 2025-01-29 11:28                 ` Simona Vetter
  2025-01-29 11:31                   ` David Hildenbrand
  0 siblings, 1 reply; 16+ messages in thread
From: Simona Vetter @ 2025-01-29 11:28 UTC (permalink / raw)
  To: David Hildenbrand, Alistair Popple, linux-mm, John Hubbard,
	nouveau, Jason Gunthorpe, DRI Development, Karol Herbst,
	Lyude Paul, Danilo Krummrich

On Wed, Jan 29, 2025 at 11:48:03AM +0100, Simona Vetter wrote:
> On Tue, Jan 28, 2025 at 09:24:33PM +0100, David Hildenbrand wrote:
> > On 28.01.25 21:14, Simona Vetter wrote:
> > > On Tue, Jan 28, 2025 at 11:09:24AM +1100, Alistair Popple wrote:
> > > > On Fri, Jan 24, 2025 at 06:54:02PM +0100, David Hildenbrand wrote:
> > > > > > > > On integrated the gpu is tied into the coherency
> > > > > > > > fabric, so there it's not needed.
> > > > > > > > 
> > > > > > > > I think the more fundamental question with both this function here and
> > > > > > > > with forced migration to device memory is that there's no guarantee it
> > > > > > > > will work out.
> > > > > > > 
> > > > > > > Yes, in particular with device-exclusive, it doesn't really work with THP
> > > > > > > and is only limited to anonymous memory. I have patches to at least make it
> > > > > > > work reliably with THP.
> > > > > > 
> > > > > > I should have crawled through the implementation first before replying.
> > > > > > Since it only looks at folio_mapcount() make_device_exclusive() should at
> > > > > > least in theory work reliably on anon memory, and not be impacted by
> > > > > > elevated refcounts due to migration/ksm/thp/whatever.
> > > > > 
> > > > > Yes, there is -- in theory -- nothing blocking the conversion except the
> > > > > folio lock. That's different than page migration.
> > > > 
> > > > Indeed - this was the entire motivation for make_device_exclusive() - that we
> > > > needed a way to reliably exclude CPU access that couldn't be blocked in the same
> > > > way page migration can (otherwise we could have just migrated to a device page,
> > > > even if that may have added unwanted overhead).
> > > 
> > > The folio_trylock worries me a bit. I guess this is to avoid deadlocks
> > > when locking multiple folios, but I think at least on the first one we
> > > need an unconditional folio_lock to guarantee forward progress.
> > 
> > At least on the hmm path I was able to trigger the EBUSY a couple of times
> > due to concurrent swapout. But the hmm-tests selftest fails immediately
> > instead of retrying.
> 
> My worries with just retrying is that it's very hard to assess whether
> there's a livelock or whether the retry has a good chance of success. As
> an example the ->migrate_to_ram path has some trylocks, and the window
> where all other threads got halfway and then fail the trylock is big
> enough that once you pile up enough threads that spin through there,
> you're stuck forever. Which isn't great.
> 
> So if we could convert at least the first folio_trylock into a plain lock
> then forward progress is obviously assured and there's no need to crawl
> through large chunks of mm/ code to hunt for corner cases where we could
> be too unlucky to ever win the race.
> 
> > > Since
> > > atomics can't cross 4k boundaries (or the hw is just really broken) this
> > > should be enough to avoid being stuck in a livelock. I'm also not seeing
> > > any other reason why a folio_lock shouldn't work here, but then my
> > > understanding of mm/ stuff is really just scratching the surface.
> > > 
> > > I did crawl through all the other code and it looks like everything else
> > > is unconditional locks. So looks all good and I didn't spot anything else
> > > that seemed problematic.
> > > 
> > > Somewhat aside, I do wonder whether we really want to require callers to
> > > hold the mmap lock, or whether with all the work towards lockless fastpath
> > > that shouldn't instead just be an implementation detail.
> > 
> > We might be able to use the VMA lock in the future, but that will require
> > GUP support and a bunch more. Until then, the mm_lock in read mode is
> > required.
> 
> Yup. I also don't think we should try to improve before benchmarks show an
> actual need. It's more about future proofing and making sure mmap_lock
> doesn't leak into driver data structures that I'm worried about. Because
> I've seen some hmm/gpu rfc patches that heavily relied on mmap_lock to
> keep everything correct on the driver side, which is not a clean design.
> 
> > I was not able to convince myself that we'll really need the folio lock, but
> > that's also a separate discussion.
> 
> This is way above my pay understanding of mm/ unfortunately.

I pondered this some more, and I think it's to make sure we get a stable
reading of folio_mapcount() and are not racing with new rmaps being
established. But I also got lost a few times in the maze ...
-Sima

> 
> > > At least for the
> > > gpu hmm code I've seen I've tried to push hard towards a world were the
> > > gpu side does not rely on mmap_read_lock being held at all, to future
> > > proof this all. And currently we only have one caller of
> > > make_device_exclusive_range() so would be simple to do.
> > 
> > We could likely move the mmap_lock into that function, but avoiding it is
> > more effort.
> 
> I didn't mean more than just that, which would make sure drivers at least
> do not rely on mmap_lock being held. That then allows us to switch over to
> vma lock or anything else entirely within mm/ code.
> 
> If we leave it as-is then more drivers accidentally or intentionally will
> rely on this, like I think is the case for ->migrate_to_ram for hmm
> already. And then it's more pain to untangle.
> 
> > In any case, I'll send something out probably tomorrow to fix page
> > migration/swapout of pages with device-exclusive entries and a bunch of
> > other things (THP, interaction with hugetlb, ...).
> 
> Thanks a lot!
> 
> Cheer, Sima
> > 
> > -- 
> > Cheers,
> > 
> > David / dhildenb
> > 
> 
> -- 
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice?
  2025-01-29 11:28                 ` Simona Vetter
@ 2025-01-29 11:31                   ` David Hildenbrand
  2025-01-29 14:05                     ` Simona Vetter
  0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2025-01-29 11:31 UTC (permalink / raw)
  To: Alistair Popple, linux-mm, John Hubbard, nouveau,
	Jason Gunthorpe, DRI Development, Karol Herbst, Lyude Paul,
	Danilo Krummrich

On 29.01.25 12:28, Simona Vetter wrote:
> On Wed, Jan 29, 2025 at 11:48:03AM +0100, Simona Vetter wrote:
>> On Tue, Jan 28, 2025 at 09:24:33PM +0100, David Hildenbrand wrote:
>>> On 28.01.25 21:14, Simona Vetter wrote:
>>>> On Tue, Jan 28, 2025 at 11:09:24AM +1100, Alistair Popple wrote:
>>>>> On Fri, Jan 24, 2025 at 06:54:02PM +0100, David Hildenbrand wrote:
>>>>>>>>> On integrated the gpu is tied into the coherency
>>>>>>>>> fabric, so there it's not needed.
>>>>>>>>>
>>>>>>>>> I think the more fundamental question with both this function here and
>>>>>>>>> with forced migration to device memory is that there's no guarantee it
>>>>>>>>> will work out.
>>>>>>>>
>>>>>>>> Yes, in particular with device-exclusive, it doesn't really work with THP
>>>>>>>> and is only limited to anonymous memory. I have patches to at least make it
>>>>>>>> work reliably with THP.
>>>>>>>
>>>>>>> I should have crawled through the implementation first before replying.
>>>>>>> Since it only looks at folio_mapcount() make_device_exclusive() should at
>>>>>>> least in theory work reliably on anon memory, and not be impacted by
>>>>>>> elevated refcounts due to migration/ksm/thp/whatever.
>>>>>>
>>>>>> Yes, there is -- in theory -- nothing blocking the conversion except the
>>>>>> folio lock. That's different than page migration.
>>>>>
>>>>> Indeed - this was the entire motivation for make_device_exclusive() - that we
>>>>> needed a way to reliably exclude CPU access that couldn't be blocked in the same
>>>>> way page migration can (otherwise we could have just migrated to a device page,
>>>>> even if that may have added unwanted overhead).
>>>>
>>>> The folio_trylock worries me a bit. I guess this is to avoid deadlocks
>>>> when locking multiple folios, but I think at least on the first one we
>>>> need an unconditional folio_lock to guarantee forward progress.
>>>
>>> At least on the hmm path I was able to trigger the EBUSY a couple of times
>>> due to concurrent swapout. But the hmm-tests selftest fails immediately
>>> instead of retrying.
>>
>> My worries with just retrying is that it's very hard to assess whether
>> there's a livelock or whether the retry has a good chance of success. As
>> an example the ->migrate_to_ram path has some trylocks, and the window
>> where all other threads got halfway and then fail the trylock is big
>> enough that once you pile up enough threads that spin through there,
>> you're stuck forever. Which isn't great.
>>
>> So if we could convert at least the first folio_trylock into a plain lock
>> then forward progress is obviously assured and there's no need to crawl
>> through large chunks of mm/ code to hunt for corner cases where we could
>> be too unlucky to ever win the race.
>>
>>>> Since
>>>> atomics can't cross 4k boundaries (or the hw is just really broken) this
>>>> should be enough to avoid being stuck in a livelock. I'm also not seeing
>>>> any other reason why a folio_lock shouldn't work here, but then my
>>>> understanding of mm/ stuff is really just scratching the surface.
>>>>
>>>> I did crawl through all the other code and it looks like everything else
>>>> is unconditional locks. So looks all good and I didn't spot anything else
>>>> that seemed problematic.
>>>>
>>>> Somewhat aside, I do wonder whether we really want to require callers to
>>>> hold the mmap lock, or whether with all the work towards lockless fastpath
>>>> that shouldn't instead just be an implementation detail.
>>>
>>> We might be able to use the VMA lock in the future, but that will require
>>> GUP support and a bunch more. Until then, the mm_lock in read mode is
>>> required.
>>
>> Yup. I also don't think we should try to improve before benchmarks show an
>> actual need. It's more about future proofing and making sure mmap_lock
>> doesn't leak into driver data structures that I'm worried about. Because
>> I've seen some hmm/gpu rfc patches that heavily relied on mmap_lock to
>> keep everything correct on the driver side, which is not a clean design.
>>
>>> I was not able to convince myself that we'll really need the folio lock, but
>>> that's also a separate discussion.
>>
>> This is way above my pay understanding of mm/ unfortunately.
> 
> I pondered this some more, and I think it's to make sure we get a stable
> reading of folio_mapcount() and are not racing with new rmaps being
> established. But I also got lost a few times in the maze ...

That mapcount handling is all messed up and I'll remove that along with
the rmap walk. Also, the folio lock does not stabilize the mapcount at all ...

Here is my understanding:

commit e2dca6db09186534c7e6082b77be6e17d8920f10
Author: David Hildenbrand <david@redhat.com>
Date:   Tue Jan 28 15:25:37 2025 +0100

     mm/memory: document restore_exclusive_pte()
     
     Let's document how this function is to be used, and why the requirement
     for the folio lock might maybe be dropped in the future.
     
     Signed-off-by: David Hildenbrand <david@redhat.com>

diff --git a/mm/memory.c b/mm/memory.c
index 46956994aaff..caaae8df11a9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -718,6 +718,31 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
  }
  #endif
  
+/**
+ * restore_exclusive_pte - Restore a device-exclusive entry
+ * @vma: VMA covering @address
+ * @folio: the mapped folio
+ * @page: the mapped folio page
+ * @address: the virtual address
+ * @ptep: PTE pointer into the locked page table mapping the folio page
+ * @orig_pte: PTE value at @ptep
+ *
+ * Restore a device-exclusive non-swap entry to an ordinary present PTE.
+ *
+ * The folio and the page table must be locked, and MMU notifiers must have
+ * been called to invalidate any (exclusive) device mappings. In case of
+ * fork(), MMU_NOTIFY_PROTECTION_PAGE is triggered, and in case of a page
+ * fault MMU_NOTIFY_EXCLUSIVE is triggered.
+ *
+ * Locking the folio makes sure that anybody who just converted the PTE to
+ * a device-private entry can map it into the device, before unlocking it; so
+ * the folio lock prevents concurrent conversion to device-exclusive.
+ *
+ * TODO: the folio lock does not protect against all cases of concurrent
+ * page table modifications (e.g., MADV_DONTNEED, mprotect), so device drivers
+ * must already use MMU notifiers to sync against any concurrent changes
+ * Maybe the requirement for the folio lock can be dropped in the future.
+ */


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice?
  2025-01-29 11:31                   ` David Hildenbrand
@ 2025-01-29 14:05                     ` Simona Vetter
  2025-01-29 16:13                       ` David Hildenbrand
  0 siblings, 1 reply; 16+ messages in thread
From: Simona Vetter @ 2025-01-29 14:05 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alistair Popple, linux-mm, John Hubbard, nouveau,
	Jason Gunthorpe, DRI Development, Karol Herbst, Lyude Paul,
	Danilo Krummrich

On Wed, Jan 29, 2025 at 12:31:14PM +0100, David Hildenbrand wrote:
> On 29.01.25 12:28, Simona Vetter wrote:
> > On Wed, Jan 29, 2025 at 11:48:03AM +0100, Simona Vetter wrote:
> > > On Tue, Jan 28, 2025 at 09:24:33PM +0100, David Hildenbrand wrote:
> > > > On 28.01.25 21:14, Simona Vetter wrote:
> > > > > On Tue, Jan 28, 2025 at 11:09:24AM +1100, Alistair Popple wrote:
> > > > > > On Fri, Jan 24, 2025 at 06:54:02PM +0100, David Hildenbrand wrote:
> > > > > > > > > > On integrated the gpu is tied into the coherency
> > > > > > > > > > fabric, so there it's not needed.
> > > > > > > > > > 
> > > > > > > > > > I think the more fundamental question with both this function here and
> > > > > > > > > > with forced migration to device memory is that there's no guarantee it
> > > > > > > > > > will work out.
> > > > > > > > > 
> > > > > > > > > Yes, in particular with device-exclusive, it doesn't really work with THP
> > > > > > > > > and is only limited to anonymous memory. I have patches to at least make it
> > > > > > > > > work reliably with THP.
> > > > > > > > 
> > > > > > > > I should have crawled through the implementation first before replying.
> > > > > > > > Since it only looks at folio_mapcount() make_device_exclusive() should at
> > > > > > > > least in theory work reliably on anon memory, and not be impacted by
> > > > > > > > elevated refcounts due to migration/ksm/thp/whatever.
> > > > > > > 
> > > > > > > Yes, there is -- in theory -- nothing blocking the conversion except the
> > > > > > > folio lock. That's different than page migration.
> > > > > > 
> > > > > > Indeed - this was the entire motivation for make_device_exclusive() - that we
> > > > > > needed a way to reliably exclude CPU access that couldn't be blocked in the same
> > > > > > way page migration can (otherwise we could have just migrated to a device page,
> > > > > > even if that may have added unwanted overhead).
> > > > > 
> > > > > The folio_trylock worries me a bit. I guess this is to avoid deadlocks
> > > > > when locking multiple folios, but I think at least on the first one we
> > > > > need an unconditional folio_lock to guarantee forward progress.
> > > > 
> > > > At least on the hmm path I was able to trigger the EBUSY a couple of times
> > > > due to concurrent swapout. But the hmm-tests selftest fails immediately
> > > > instead of retrying.
> > > 
> > > My worries with just retrying is that it's very hard to assess whether
> > > there's a livelock or whether the retry has a good chance of success. As
> > > an example the ->migrate_to_ram path has some trylocks, and the window
> > > where all other threads got halfway and then fail the trylock is big
> > > enough that once you pile up enough threads that spin through there,
> > > you're stuck forever. Which isn't great.
> > > 
> > > So if we could convert at least the first folio_trylock into a plain lock
> > > then forward progress is obviously assured and there's no need to crawl
> > > through large chunks of mm/ code to hunt for corner cases where we could
> > > be too unlucky to ever win the race.
> > > 
> > > > > Since
> > > > > atomics can't cross 4k boundaries (or the hw is just really broken) this
> > > > > should be enough to avoid being stuck in a livelock. I'm also not seeing
> > > > > any other reason why a folio_lock shouldn't work here, but then my
> > > > > understanding of mm/ stuff is really just scratching the surface.
> > > > > 
> > > > > I did crawl through all the other code and it looks like everything else
> > > > > is unconditional locks. So looks all good and I didn't spot anything else
> > > > > that seemed problematic.
> > > > > 
> > > > > Somewhat aside, I do wonder whether we really want to require callers to
> > > > > hold the mmap lock, or whether with all the work towards lockless fastpath
> > > > > that shouldn't instead just be an implementation detail.
> > > > 
> > > > We might be able to use the VMA lock in the future, but that will require
> > > > GUP support and a bunch more. Until then, the mm_lock in read mode is
> > > > required.
> > > 
> > > Yup. I also don't think we should try to improve before benchmarks show an
> > > actual need. It's more about future proofing and making sure mmap_lock
> > > doesn't leak into driver data structures that I'm worried about. Because
> > > I've seen some hmm/gpu rfc patches that heavily relied on mmap_lock to
> > > keep everything correct on the driver side, which is not a clean design.
> > > 
> > > > I was not able to convince myself that we'll really need the folio lock, but
> > > > that's also a separate discussion.
> > > 
> > > This is way above my pay understanding of mm/ unfortunately.
> > 
> > I pondered this some more, and I think it's to make sure we get a stable
> > reading of folio_mapcount() and are not racing with new rmaps being
> > established. But I also got lost a few times in the maze ...
> 
> That mapcount handling is all messed up and I'll remove that along with
> the rmap walk. Also, the folio lock does not stabilize the mapcount at all ...

Hm ... also rethinking this all, we don't need a lot of guarantees here.
Anything userspace does that re-elevates the mapcount or otherwise could
starve out make_device_exclusive is I think a case of "don't do that".

I think the only guarantee we need is that make_device_exclusive succeeds
against other kernel stuff like thp/migration/ksm and all those things, at
least reliably when you retry. And maybe also that concurrent
make_device_exclusive calls don't starve each another but eventually get
the job done (but only if it's the same owner).

> Here is my understanding:
> 
> commit e2dca6db09186534c7e6082b77be6e17d8920f10
> Author: David Hildenbrand <david@redhat.com>
> Date:   Tue Jan 28 15:25:37 2025 +0100
> 
>     mm/memory: document restore_exclusive_pte()
>     Let's document how this function is to be used, and why the requirement
>     for the folio lock might maybe be dropped in the future.
>     Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 46956994aaff..caaae8df11a9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -718,6 +718,31 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
>  }
>  #endif
> +/**
> + * restore_exclusive_pte - Restore a device-exclusive entry
> + * @vma: VMA covering @address
> + * @folio: the mapped folio
> + * @page: the mapped folio page
> + * @address: the virtual address
> + * @ptep: PTE pointer into the locked page table mapping the folio page
> + * @orig_pte: PTE value at @ptep
> + *
> + * Restore a device-exclusive non-swap entry to an ordinary present PTE.
> + *
> + * The folio and the page table must be locked, and MMU notifiers must have
> + * been called to invalidate any (exclusive) device mappings. In case of
> + * fork(), MMU_NOTIFY_PROTECTION_PAGE is triggered, and in case of a page
> + * fault MMU_NOTIFY_EXCLUSIVE is triggered.
> + *
> + * Locking the folio makes sure that anybody who just converted the PTE to
> + * a device-private entry can map it into the device, before unlocking it; so
> + * the folio lock prevents concurrent conversion to device-exclusive.
> + *
> + * TODO: the folio lock does not protect against all cases of concurrent
> + * page table modifications (e.g., MADV_DONTNEED, mprotect), so device drivers
> + * must already use MMU notifiers to sync against any concurrent changes
> + * Maybe the requirement for the folio lock can be dropped in the future.

Hm yeah I was a bit confused why this would work at first. But since we
break cow with FOLL_WRITE there shouldn't be any other mappings around.
Therefore relying on the mmu notifier for that mm_struct is enough, and we
don't need to hold the folio_lock in callers.

I think pushing both the folio_unlock and the mmap_read_lock/unlock into
make_device_exclusive would be a good clarification of what's going on
here.

Cheers, Sima

> + */
> 
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice?
  2025-01-29 14:05                     ` Simona Vetter
@ 2025-01-29 16:13                       ` David Hildenbrand
  2025-01-30  8:55                         ` Simona Vetter
  0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2025-01-29 16:13 UTC (permalink / raw)
  To: Alistair Popple, linux-mm, John Hubbard, nouveau,
	Jason Gunthorpe, DRI Development, Karol Herbst, Lyude Paul,
	Danilo Krummrich

On 29.01.25 15:05, Simona Vetter wrote:
> On Wed, Jan 29, 2025 at 12:31:14PM +0100, David Hildenbrand wrote:
>> On 29.01.25 12:28, Simona Vetter wrote:
>>> On Wed, Jan 29, 2025 at 11:48:03AM +0100, Simona Vetter wrote:
>>>> On Tue, Jan 28, 2025 at 09:24:33PM +0100, David Hildenbrand wrote:
>>>>> On 28.01.25 21:14, Simona Vetter wrote:
>>>>>> On Tue, Jan 28, 2025 at 11:09:24AM +1100, Alistair Popple wrote:
>>>>>>> On Fri, Jan 24, 2025 at 06:54:02PM +0100, David Hildenbrand wrote:
>>>>>>>>>>> On integrated the gpu is tied into the coherency
>>>>>>>>>>> fabric, so there it's not needed.
>>>>>>>>>>>
>>>>>>>>>>> I think the more fundamental question with both this function here and
>>>>>>>>>>> with forced migration to device memory is that there's no guarantee it
>>>>>>>>>>> will work out.
>>>>>>>>>>
>>>>>>>>>> Yes, in particular with device-exclusive, it doesn't really work with THP
>>>>>>>>>> and is only limited to anonymous memory. I have patches to at least make it
>>>>>>>>>> work reliably with THP.
>>>>>>>>>
>>>>>>>>> I should have crawled through the implementation first before replying.
>>>>>>>>> Since it only looks at folio_mapcount() make_device_exclusive() should at
>>>>>>>>> least in theory work reliably on anon memory, and not be impacted by
>>>>>>>>> elevated refcounts due to migration/ksm/thp/whatever.
>>>>>>>>
>>>>>>>> Yes, there is -- in theory -- nothing blocking the conversion except the
>>>>>>>> folio lock. That's different than page migration.
>>>>>>>
>>>>>>> Indeed - this was the entire motivation for make_device_exclusive() - that we
>>>>>>> needed a way to reliably exclude CPU access that couldn't be blocked in the same
>>>>>>> way page migration can (otherwise we could have just migrated to a device page,
>>>>>>> even if that may have added unwanted overhead).
>>>>>>
>>>>>> The folio_trylock worries me a bit. I guess this is to avoid deadlocks
>>>>>> when locking multiple folios, but I think at least on the first one we
>>>>>> need an unconditional folio_lock to guarantee forward progress.
>>>>>
>>>>> At least on the hmm path I was able to trigger the EBUSY a couple of times
>>>>> due to concurrent swapout. But the hmm-tests selftest fails immediately
>>>>> instead of retrying.
>>>>
>>>> My worries with just retrying is that it's very hard to assess whether
>>>> there's a livelock or whether the retry has a good chance of success. As
>>>> an example the ->migrate_to_ram path has some trylocks, and the window
>>>> where all other threads got halfway and then fail the trylock is big
>>>> enough that once you pile up enough threads that spin through there,
>>>> you're stuck forever. Which isn't great.
>>>>
>>>> So if we could convert at least the first folio_trylock into a plain lock
>>>> then forward progress is obviously assured and there's no need to crawl
>>>> through large chunks of mm/ code to hunt for corner cases where we could
>>>> be too unlucky to ever win the race.
>>>>
>>>>>> Since
>>>>>> atomics can't cross 4k boundaries (or the hw is just really broken) this
>>>>>> should be enough to avoid being stuck in a livelock. I'm also not seeing
>>>>>> any other reason why a folio_lock shouldn't work here, but then my
>>>>>> understanding of mm/ stuff is really just scratching the surface.
>>>>>>
>>>>>> I did crawl through all the other code and it looks like everything else
>>>>>> is unconditional locks. So looks all good and I didn't spot anything else
>>>>>> that seemed problematic.
>>>>>>
>>>>>> Somewhat aside, I do wonder whether we really want to require callers to
>>>>>> hold the mmap lock, or whether with all the work towards lockless fastpath
>>>>>> that shouldn't instead just be an implementation detail.
>>>>>
>>>>> We might be able to use the VMA lock in the future, but that will require
>>>>> GUP support and a bunch more. Until then, the mm_lock in read mode is
>>>>> required.
>>>>
>>>> Yup. I also don't think we should try to improve before benchmarks show an
>>>> actual need. It's more about future proofing and making sure mmap_lock
>>>> doesn't leak into driver data structures that I'm worried about. Because
>>>> I've seen some hmm/gpu rfc patches that heavily relied on mmap_lock to
>>>> keep everything correct on the driver side, which is not a clean design.
>>>>
>>>>> I was not able to convince myself that we'll really need the folio lock, but
>>>>> that's also a separate discussion.
>>>>
>>>> This is way above my pay understanding of mm/ unfortunately.
>>>
>>> I pondered this some more, and I think it's to make sure we get a stable
>>> reading of folio_mapcount() and are not racing with new rmaps being
>>> established. But I also got lost a few times in the maze ...
>>
>> That mapcount handling is all messed up and I'll remove that along with
>> the rmap walk. Also, the folio lock does not stabilize the mapcount at all ...
> 
> Hm ... also rethinking this all, we don't need a lot of guarantees here.
> Anything userspace does that re-elevates the mapcount or otherwise could
> starve out make_device_exclusive is I think a case of "don't do that".
> 
> I think the only guarantee we need is that make_device_exclusive succeeds
> against other kernel stuff like thp/migration/ksm and all those things, at
> least reliably when you retry. And maybe also that concurrent
> make_device_exclusive calls don't starve each another but eventually get
> the job done (but only if it's the same owner).
> 
>> Here is my understanding:
>>
>> commit e2dca6db09186534c7e6082b77be6e17d8920f10
>> Author: David Hildenbrand <david@redhat.com>
>> Date:   Tue Jan 28 15:25:37 2025 +0100
>>
>>      mm/memory: document restore_exclusive_pte()
>>      Let's document how this function is to be used, and why the requirement
>>      for the folio lock might maybe be dropped in the future.
>>      Signed-off-by: David Hildenbrand <david@redhat.com>
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 46956994aaff..caaae8df11a9 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -718,6 +718,31 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
>>   }
>>   #endif
>> +/**
>> + * restore_exclusive_pte - Restore a device-exclusive entry
>> + * @vma: VMA covering @address
>> + * @folio: the mapped folio
>> + * @page: the mapped folio page
>> + * @address: the virtual address
>> + * @ptep: PTE pointer into the locked page table mapping the folio page
>> + * @orig_pte: PTE value at @ptep
>> + *
>> + * Restore a device-exclusive non-swap entry to an ordinary present PTE.
>> + *
>> + * The folio and the page table must be locked, and MMU notifiers must have
>> + * been called to invalidate any (exclusive) device mappings. In case of
>> + * fork(), MMU_NOTIFY_PROTECTION_PAGE is triggered, and in case of a page
>> + * fault MMU_NOTIFY_EXCLUSIVE is triggered.
>> + *
>> + * Locking the folio makes sure that anybody who just converted the PTE to
>> + * a device-private entry can map it into the device, before unlocking it; so
>> + * the folio lock prevents concurrent conversion to device-exclusive.
>> + *
>> + * TODO: the folio lock does not protect against all cases of concurrent
>> + * page table modifications (e.g., MADV_DONTNEED, mprotect), so device drivers
>> + * must already use MMU notifiers to sync against any concurrent changes
>> + * Maybe the requirement for the folio lock can be dropped in the future.
> 
> Hm yeah I was a bit confused why this would work at first. But since we
> break cow with FOLL_WRITE there shouldn't be any other mappings around.
> Therefore relying on the mmu notifier for that mm_struct is enough, and we
> don't need to hold the folio_lock in callers.

Right; the devil is in the detail here; I looked at the SVM code and it 
registers an MMU notifier around that code to catch any invalidations 
that are happening while mapping the page. And it has another one to 
handle invalidations in general. I did not completely digest why two 
notifiers are required ... there likely is a good reason.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice?
  2025-01-29 16:13                       ` David Hildenbrand
@ 2025-01-30  8:55                         ` Simona Vetter
  0 siblings, 0 replies; 16+ messages in thread
From: Simona Vetter @ 2025-01-30  8:55 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alistair Popple, linux-mm, John Hubbard, nouveau,
	Jason Gunthorpe, DRI Development, Karol Herbst, Lyude Paul,
	Danilo Krummrich

On Wed, Jan 29, 2025 at 05:13:41PM +0100, David Hildenbrand wrote:
> On 29.01.25 15:05, Simona Vetter wrote:
> > On Wed, Jan 29, 2025 at 12:31:14PM +0100, David Hildenbrand wrote:
> > > On 29.01.25 12:28, Simona Vetter wrote:
> > > > On Wed, Jan 29, 2025 at 11:48:03AM +0100, Simona Vetter wrote:
> > > > > On Tue, Jan 28, 2025 at 09:24:33PM +0100, David Hildenbrand wrote:
> > > > > > On 28.01.25 21:14, Simona Vetter wrote:
> > > > > > > On Tue, Jan 28, 2025 at 11:09:24AM +1100, Alistair Popple wrote:
> > > > > > > > On Fri, Jan 24, 2025 at 06:54:02PM +0100, David Hildenbrand wrote:
> > > > > > > > > > > > On integrated the gpu is tied into the coherency
> > > > > > > > > > > > fabric, so there it's not needed.
> > > > > > > > > > > > 
> > > > > > > > > > > > I think the more fundamental question with both this function here and
> > > > > > > > > > > > with forced migration to device memory is that there's no guarantee it
> > > > > > > > > > > > will work out.
> > > > > > > > > > > 
> > > > > > > > > > > Yes, in particular with device-exclusive, it doesn't really work with THP
> > > > > > > > > > > and is only limited to anonymous memory. I have patches to at least make it
> > > > > > > > > > > work reliably with THP.
> > > > > > > > > > 
> > > > > > > > > > I should have crawled through the implementation first before replying.
> > > > > > > > > > Since it only looks at folio_mapcount() make_device_exclusive() should at
> > > > > > > > > > least in theory work reliably on anon memory, and not be impacted by
> > > > > > > > > > elevated refcounts due to migration/ksm/thp/whatever.
> > > > > > > > > 
> > > > > > > > > Yes, there is -- in theory -- nothing blocking the conversion except the
> > > > > > > > > folio lock. That's different than page migration.
> > > > > > > > 
> > > > > > > > Indeed - this was the entire motivation for make_device_exclusive() - that we
> > > > > > > > needed a way to reliably exclude CPU access that couldn't be blocked in the same
> > > > > > > > way page migration can (otherwise we could have just migrated to a device page,
> > > > > > > > even if that may have added unwanted overhead).
> > > > > > > 
> > > > > > > The folio_trylock worries me a bit. I guess this is to avoid deadlocks
> > > > > > > when locking multiple folios, but I think at least on the first one we
> > > > > > > need an unconditional folio_lock to guarantee forward progress.
> > > > > > 
> > > > > > At least on the hmm path I was able to trigger the EBUSY a couple of times
> > > > > > due to concurrent swapout. But the hmm-tests selftest fails immediately
> > > > > > instead of retrying.
> > > > > 
> > > > > My worries with just retrying is that it's very hard to assess whether
> > > > > there's a livelock or whether the retry has a good chance of success. As
> > > > > an example the ->migrate_to_ram path has some trylocks, and the window
> > > > > where all other threads got halfway and then fail the trylock is big
> > > > > enough that once you pile up enough threads that spin through there,
> > > > > you're stuck forever. Which isn't great.
> > > > > 
> > > > > So if we could convert at least the first folio_trylock into a plain lock
> > > > > then forward progress is obviously assured and there's no need to crawl
> > > > > through large chunks of mm/ code to hunt for corner cases where we could
> > > > > be too unlucky to ever win the race.
> > > > > 
> > > > > > > Since
> > > > > > > atomics can't cross 4k boundaries (or the hw is just really broken) this
> > > > > > > should be enough to avoid being stuck in a livelock. I'm also not seeing
> > > > > > > any other reason why a folio_lock shouldn't work here, but then my
> > > > > > > understanding of mm/ stuff is really just scratching the surface.
> > > > > > > 
> > > > > > > I did crawl through all the other code and it looks like everything else
> > > > > > > is unconditional locks. So looks all good and I didn't spot anything else
> > > > > > > that seemed problematic.
> > > > > > > 
> > > > > > > Somewhat aside, I do wonder whether we really want to require callers to
> > > > > > > hold the mmap lock, or whether with all the work towards lockless fastpath
> > > > > > > that shouldn't instead just be an implementation detail.
> > > > > > 
> > > > > > We might be able to use the VMA lock in the future, but that will require
> > > > > > GUP support and a bunch more. Until then, the mm_lock in read mode is
> > > > > > required.
> > > > > 
> > > > > Yup. I also don't think we should try to improve before benchmarks show an
> > > > > actual need. It's more about future proofing and making sure mmap_lock
> > > > > doesn't leak into driver data structures that I'm worried about. Because
> > > > > I've seen some hmm/gpu rfc patches that heavily relied on mmap_lock to
> > > > > keep everything correct on the driver side, which is not a clean design.
> > > > > 
> > > > > > I was not able to convince myself that we'll really need the folio lock, but
> > > > > > that's also a separate discussion.
> > > > > 
> > > > > This is way above my pay understanding of mm/ unfortunately.
> > > > 
> > > > I pondered this some more, and I think it's to make sure we get a stable
> > > > reading of folio_mapcount() and are not racing with new rmaps being
> > > > established. But I also got lost a few times in the maze ...
> > > 
> > > That mapcount handling is all messed up and I'll remove that along with
> > > the rmap walk. Also, the folio lock does not stabilize the mapcount at all ...
> > 
> > Hm ... also rethinking this all, we don't need a lot of guarantees here.
> > Anything userspace does that re-elevates the mapcount or otherwise could
> > starve out make_device_exclusive is I think a case of "don't do that".
> > 
> > I think the only guarantee we need is that make_device_exclusive succeeds
> > against other kernel stuff like thp/migration/ksm and all those things, at
> > least reliably when you retry. And maybe also that concurrent
> > make_device_exclusive calls don't starve each another but eventually get
> > the job done (but only if it's the same owner).
> > 
> > > Here is my understanding:
> > > 
> > > commit e2dca6db09186534c7e6082b77be6e17d8920f10
> > > Author: David Hildenbrand <david@redhat.com>
> > > Date:   Tue Jan 28 15:25:37 2025 +0100
> > > 
> > >      mm/memory: document restore_exclusive_pte()
> > >      Let's document how this function is to be used, and why the requirement
> > >      for the folio lock might maybe be dropped in the future.
> > >      Signed-off-by: David Hildenbrand <david@redhat.com>
> > > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 46956994aaff..caaae8df11a9 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -718,6 +718,31 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
> > >   }
> > >   #endif
> > > +/**
> > > + * restore_exclusive_pte - Restore a device-exclusive entry
> > > + * @vma: VMA covering @address
> > > + * @folio: the mapped folio
> > > + * @page: the mapped folio page
> > > + * @address: the virtual address
> > > + * @ptep: PTE pointer into the locked page table mapping the folio page
> > > + * @orig_pte: PTE value at @ptep
> > > + *
> > > + * Restore a device-exclusive non-swap entry to an ordinary present PTE.
> > > + *
> > > + * The folio and the page table must be locked, and MMU notifiers must have
> > > + * been called to invalidate any (exclusive) device mappings. In case of
> > > + * fork(), MMU_NOTIFY_PROTECTION_PAGE is triggered, and in case of a page
> > > + * fault MMU_NOTIFY_EXCLUSIVE is triggered.
> > > + *
> > > + * Locking the folio makes sure that anybody who just converted the PTE to
> > > + * a device-private entry can map it into the device, before unlocking it; so
> > > + * the folio lock prevents concurrent conversion to device-exclusive.
> > > + *
> > > + * TODO: the folio lock does not protect against all cases of concurrent
> > > + * page table modifications (e.g., MADV_DONTNEED, mprotect), so device drivers
> > > + * must already use MMU notifiers to sync against any concurrent changes
> > > + * Maybe the requirement for the folio lock can be dropped in the future.
> > 
> > Hm yeah I was a bit confused why this would work at first. But since we
> > break cow with FOLL_WRITE there shouldn't be any other mappings around.
> > Therefore relying on the mmu notifier for that mm_struct is enough, and we
> > don't need to hold the folio_lock in callers.
> 
> Right; the devil is in the detail here; I looked at the SVM code and it
> registers an MMU notifier around that code to catch any invalidations that
> are happening while mapping the page. And it has another one to handle
> invalidations in general. I did not completely digest why two notifiers are
> required ... there likely is a good reason.

At least as a fallback you need a notifier for exactly only the fault, to
make sure you don't retry due to spurious invalidations. Otherwise if
userspace is hammering an adjacent page from the cpu we would not make
forward progress, and that's a correctness issue.

Now always inserting a new notifier for every fault in all cases isn't
great because it incurs a stall. But I discussed this issue with Thomas
Hellstrom a while back and there's a bunch of tricks we could deploy if
this ever becomes a performance issue.

So this all looks ok to me.
-Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-01-30  8:55 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-23 10:20 [Question] Are "device exclusive non-swap entries" / "SVM atomics in Nouveau" still getting used in practice? David Hildenbrand
2025-01-23 15:08 ` Simona Vetter
2025-01-24 10:44   ` David Hildenbrand
2025-01-24 14:11     ` Jason Gunthorpe
2025-01-24 14:39       ` David Hildenbrand
2025-01-24 15:28     ` Simona Vetter
2025-01-24 17:54       ` David Hildenbrand
2025-01-28  0:09         ` Alistair Popple
2025-01-28 20:14           ` Simona Vetter
2025-01-28 20:24             ` David Hildenbrand
2025-01-29 10:48               ` Simona Vetter
2025-01-29 11:28                 ` Simona Vetter
2025-01-29 11:31                   ` David Hildenbrand
2025-01-29 14:05                     ` Simona Vetter
2025-01-29 16:13                       ` David Hildenbrand
2025-01-30  8:55                         ` Simona Vetter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox