[LSF/MM/BPF TOPIC] The future of ZONE

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] The future of ZONE_DEVICE pages
@ 2025-01-31  2:59 Alistair Popple
  2025-01-31  3:29 ` Balbir Singh
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Alistair Popple @ 2025-01-31  2:59 UTC (permalink / raw)
  To: linux-mm; +Cc: lsf-pc, david, willy, ziy, jhubbard, jgg, balbirs

I have a few topics that I would like to discuss around ZONE_DEVICE pages
and their current and future usage in the kernel. Generally these pages are
used to represent various forms of device memory (PCIe BAR space, coherent
accelerator memory, persistent memory, unaddressable device memory). All
of these require special treatment by the core MM so many features must be
implemented specifically for ZONE_DEVICE pages.

I would like to get feedback on several ideas I've had for a while:

Large page migration for ZONE_DEVICE pages
==========================================

Currently large ZONE_DEVICE pages only exist for persistent memory use cases
(DAX, FS DAX). This involves a special reference counting scheme which I hope to
have fixed[1] by the time of the LSF/MM/BPF. Fixing this allows for other higher
order ZONE_DEVICE folios.

Specifically I would like to introduce the possiblity of migrating large CPU
folios to unaddressable (DEVICE_PRIVATE) or coherent (DEVICE_COHERENT) memory.
The current interfaces (migrate_vma) don't allow that as they require all folios
to be split.

Some of the issues are:

1. What should the interface look like?

These are non-lru pages, so likely there is overlap with "non-lru page migration
in a memdesc world"[2]

2. How do we allow merging/splitting of pages during migration?

This is neccessary because when migrating back from device memory there may not
be enough large CPU pages available.

3. Any other issues?

[1] - https://lore.kernel.org/linux-mm/cover.11189864684e31260d1408779fac9db80122047b.1736488799.git-series.apopple@nvidia.com/
[2] - https://lore.kernel.org/linux-mm/2612ac8a-d0a9-452b-a53d-75ffc6166224@redhat.com/

File-backed DEVICE_PRIVATE/COHERENT pages
=========================================

Currently DEVICE_PRVIATE and DEVICE_COHERENT pages are only supported for
private anonymous memory. This prevents devices from having local access to
shared or file-backed mappings instead relying on remote DMA access which limits
performance.

I have been prototyping allowing ZONE_DEVICE pages in the page cache with
a callback when the CPU requires access. This approach seems promising and
relatively straight-forward but I would like some early feedback on either this
or alternate approaches that I should investigate.

Combining P2PDMA and DEVICE_PRIVATE pages
=========================================

Currently device memory that cannot be directly accessed via the CPU can be
represented by DEVICE_PRIVATE pages allowing it to be mapped and treated like
a normal virtual page by userpsace. Many devices also support accessing device
memory directly from the CPU via a PCIe BAR.

This access requires a P2PDMA page, meaning there are potentially two pages
tracking the same piece of physical memory. This not only seems wasteful but
fraught - for example device drivers need to keep page lifetimes in sync. I
would like to discuss ways of solving this.

DEVICE_PRIVATE pages, the linear map and the memdesc world
==========================================================

DEVICE_PRIVATE pages currently reside in the linear map such that pfn_to_page()
and page_to_pfn() work "as expected". However this implies a contiguous range
of unused physical addresses need to be both available and allocated for device
memory. This isn't always available, particularly on ARM[1] where the vmemmap
region may not be large enough to accomodate the amount of device memory.

However it occurs to me that (almost?) all code paths that deal with
DEVICE_PRIVATE pages are already aware of this - in the case of page_to_pfn()
the page can be directly queried with is_device_private_page() and in the case
of pfn_to_page() the pfn has (almost?) always been obtained from a special swap
entry indicating such.

So does page_to_pfn()/pfn_to_page() really need to work for DEIVCE_PRIVATE
pages? If not could we allocate the struct pages in a vmalloc array instead? Do
we even need ZONE_DEIVCE pages/folios in a memdesc world?

[1] - https://lore.kernel.org/linux-arm-kernel/CAMj1kXHxyntweiq76CdW=ov2_CkEQUbdPekGNDtFp7rBCJJE2w@mail.gmail.com/

Other issues/ideas
==================

Are there any other clean-ups or features that people are interested in seeing?

 - Alistair

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM/BPF TOPIC] The future of ZONE_DEVICE pages
  2025-01-31  2:59 [LSF/MM/BPF TOPIC] The future of ZONE_DEVICE pages Alistair Popple
@ 2025-01-31  3:29 ` Balbir Singh
  2025-01-31  3:58 ` Zi Yan
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Balbir Singh @ 2025-01-31  3:29 UTC (permalink / raw)
  To: Alistair Popple, linux-mm; +Cc: lsf-pc, david, willy, ziy, jhubbard, jgg

On 1/31/25 13:59, Alistair Popple wrote:
> I have a few topics that I would like to discuss around ZONE_DEVICE pages
> and their current and future usage in the kernel. Generally these pages are
> used to represent various forms of device memory (PCIe BAR space, coherent
> accelerator memory, persistent memory, unaddressable device memory). All
> of these require special treatment by the core MM so many features must be
> implemented specifically for ZONE_DEVICE pages.
> 
> I would like to get feedback on several ideas I've had for a while:
> 
> Large page migration for ZONE_DEVICE pages
> ==========================================
> 
> Currently large ZONE_DEVICE pages only exist for persistent memory use cases
> (DAX, FS DAX). This involves a special reference counting scheme which I hope to
> have fixed[1] by the time of the LSF/MM/BPF. Fixing this allows for other higher
> order ZONE_DEVICE folios.
> 
> Specifically I would like to introduce the possiblity of migrating large CPU
> folios to unaddressable (DEVICE_PRIVATE) or coherent (DEVICE_COHERENT) memory.
> The current interfaces (migrate_vma) don't allow that as they require all folios
> to be split.
> 
> Some of the issues are:
> 
> 1. What should the interface look like?
> 
> These are non-lru pages, so likely there is overlap with "non-lru page migration
> in a memdesc world"[2]
> 
> 2. How do we allow merging/splitting of pages during migration?
> 
> This is neccessary because when migrating back from device memory there may not
> be enough large CPU pages available.
> 
> 3. Any other issues?

I'd definitely be interested in the above topic. In general, I see a lot of overlap
of folio and struct page code, I think folio is a good abstraction and I am hoping some
day we just have folio's to abstract the size of the page.


> 
> [1] - https://lore.kernel.org/linux-mm/cover.11189864684e31260d1408779fac9db80122047b.1736488799.git-series.apopple@nvidia.com/
> [2] - https://lore.kernel.org/linux-mm/2612ac8a-d0a9-452b-a53d-75ffc6166224@redhat.com/
> 
> File-backed DEVICE_PRIVATE/COHERENT pages
> =========================================
> 
> Currently DEVICE_PRVIATE and DEVICE_COHERENT pages are only supported for
> private anonymous memory. This prevents devices from having local access to
> shared or file-backed mappings instead relying on remote DMA access which limits
> performance.
> 
> I have been prototyping allowing ZONE_DEVICE pages in the page cache with
> a callback when the CPU requires access. This approach seems promising and
> relatively straight-forward but I would like some early feedback on either this
> or alternate approaches that I should investigate.
> 

I assume this is for mapped page cache pages?

> Combining P2PDMA and DEVICE_PRIVATE pages
> =========================================
> 
> Currently device memory that cannot be directly accessed via the CPU can be
> represented by DEVICE_PRIVATE pages allowing it to be mapped and treated like
> a normal virtual page by userpsace. Many devices also support accessing device
> memory directly from the CPU via a PCIe BAR.
> 
> This access requires a P2PDMA page, meaning there are potentially two pages
> tracking the same piece of physical memory. This not only seems wasteful but
> fraught - for example device drivers need to keep page lifetimes in sync. I
> would like to discuss ways of solving this.

+1 for the topics

Balbir Singh


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM/BPF TOPIC] The future of ZONE_DEVICE pages
  2025-01-31  2:59 [LSF/MM/BPF TOPIC] The future of ZONE_DEVICE pages Alistair Popple
  2025-01-31  3:29 ` Balbir Singh
@ 2025-01-31  3:58 ` Zi Yan
  2025-01-31  5:50   ` Alistair Popple
  2025-01-31  8:47 ` David Hildenbrand
  2025-01-31 14:52 ` Jason Gunthorpe
  3 siblings, 1 reply; 9+ messages in thread
From: Zi Yan @ 2025-01-31  3:58 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, lsf-pc, david, willy, jhubbard, jgg, balbirs, christian.koenig

On 30 Jan 2025, at 21:59, Alistair Popple wrote:

> I have a few topics that I would like to discuss around ZONE_DEVICE pages
> and their current and future usage in the kernel. Generally these pages are
> used to represent various forms of device memory (PCIe BAR space, coherent
> accelerator memory, persistent memory, unaddressable device memory). All
> of these require special treatment by the core MM so many features must be
> implemented specifically for ZONE_DEVICE pages.
>
> I would like to get feedback on several ideas I've had for a while:
>
> Large page migration for ZONE_DEVICE pages
> ==========================================
>
> Currently large ZONE_DEVICE pages only exist for persistent memory use cases
> (DAX, FS DAX). This involves a special reference counting scheme which I hope to
> have fixed[1] by the time of the LSF/MM/BPF. Fixing this allows for other higher
> order ZONE_DEVICE folios.
>
> Specifically I would like to introduce the possiblity of migrating large CPU
> folios to unaddressable (DEVICE_PRIVATE) or coherent (DEVICE_COHERENT) memory.
> The current interfaces (migrate_vma) don't allow that as they require all folios
> to be split.
>
> Some of the issues are:
>
> 1. What should the interface look like?
>
> These are non-lru pages, so likely there is overlap with "non-lru page migration
> in a memdesc world"[2]

It seems to me that unaddressable (DEVICE_PRIVATE) and coherent (DEVICE_COHERENT)
should be treated differently, since CPU cannot access the former but can access
the latter. Am I getting it right?

>
> 2. How do we allow merging/splitting of pages during migration?
>
> This is neccessary because when migrating back from device memory there may not
> be enough large CPU pages available.

It is similar to THP swap out and swap in, we just swap out a whole THP
but swap in individual base pages. But there is a discussion on large folio swapin[1]
might change it.

[1] https://lore.kernel.org/linux-mm/58716200-fd10-4487-aed3-607a10e9fdd0@gmail.com/

>
> 3. Any other issues?

Once a large folio is migrated to device, when CPU wants to access the data, even
if there is enough memory in CPU memory, we might not want to migrate back the
entire large folio, since maybe only a base page is shared between CPU and the device.
Bouncing a large folio for data shared within a base page would be wasteful.
I think about doing something like PCIe atomic from a device. Does it make sense?

>
> [1] - https://lore.kernel.org/linux-mm/cover.11189864684e31260d1408779fac9db80122047b.1736488799.git-series.apopple@nvidia.com/
> [2] - https://lore.kernel.org/linux-mm/2612ac8a-d0a9-452b-a53d-75ffc6166224@redhat.com/
>
> File-backed DEVICE_PRIVATE/COHERENT pages
> =========================================
>
> Currently DEVICE_PRVIATE and DEVICE_COHERENT pages are only supported for
> private anonymous memory. This prevents devices from having local access to
> shared or file-backed mappings instead relying on remote DMA access which limits
> performance.
>
> I have been prototyping allowing ZONE_DEVICE pages in the page cache with
> a callback when the CPU requires access. This approach seems promising and
> relatively straight-forward but I would like some early feedback on either this
> or alternate approaches that I should investigate.
>
> Combining P2PDMA and DEVICE_PRIVATE pages
> =========================================
>
> Currently device memory that cannot be directly accessed via the CPU can be
> represented by DEVICE_PRIVATE pages allowing it to be mapped and treated like
> a normal virtual page by userpsace. Many devices also support accessing device
> memory directly from the CPU via a PCIe BAR.
>
> This access requires a P2PDMA page, meaning there are potentially two pages
> tracking the same piece of physical memory. This not only seems wasteful but
> fraught - for example device drivers need to keep page lifetimes in sync. I
> would like to discuss ways of solving this.
>
> DEVICE_PRIVATE pages, the linear map and the memdesc world
> ==========================================================
>
> DEVICE_PRIVATE pages currently reside in the linear map such that pfn_to_page()
> and page_to_pfn() work "as expected". However this implies a contiguous range
> of unused physical addresses need to be both available and allocated for device
> memory. This isn't always available, particularly on ARM[1] where the vmemmap
> region may not be large enough to accomodate the amount of device memory.
>
> However it occurs to me that (almost?) all code paths that deal with
> DEVICE_PRIVATE pages are already aware of this - in the case of page_to_pfn()
> the page can be directly queried with is_device_private_page() and in the case
> of pfn_to_page() the pfn has (almost?) always been obtained from a special swap
> entry indicating such.
>
> So does page_to_pfn()/pfn_to_page() really need to work for DEIVCE_PRIVATE
> pages? If not could we allocate the struct pages in a vmalloc array instead? Do
> we even need ZONE_DEIVCE pages/folios in a memdesc world?

It occurs to me as well when I am reading your migration proposal above.
struct page is not used for DEVICE_PRIVATE, maybe it is OK to get rid of it.
How about DEVICE_COHERENT? Is its struct page used currently? I see AMD kfd
driver is using DEVICE_COHERENT (Christian König cc'd).



--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM/BPF TOPIC] The future of ZONE_DEVICE pages
  2025-01-31  3:58 ` Zi Yan
@ 2025-01-31  5:50   ` Alistair Popple
  2025-01-31 15:34     ` Zi Yan
  0 siblings, 1 reply; 9+ messages in thread
From: Alistair Popple @ 2025-01-31  5:50 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, lsf-pc, david, willy, jhubbard, jgg, balbirs, christian.koenig

On Thu, Jan 30, 2025 at 10:58:22PM -0500, Zi Yan wrote:
> On 30 Jan 2025, at 21:59, Alistair Popple wrote:
> 
> > I have a few topics that I would like to discuss around ZONE_DEVICE pages
> > and their current and future usage in the kernel. Generally these pages are
> > used to represent various forms of device memory (PCIe BAR space, coherent
> > accelerator memory, persistent memory, unaddressable device memory). All
> > of these require special treatment by the core MM so many features must be
> > implemented specifically for ZONE_DEVICE pages.
> >
> > I would like to get feedback on several ideas I've had for a while:
> >
> > Large page migration for ZONE_DEVICE pages
> > ==========================================
> >
> > Currently large ZONE_DEVICE pages only exist for persistent memory use cases
> > (DAX, FS DAX). This involves a special reference counting scheme which I hope to
> > have fixed[1] by the time of the LSF/MM/BPF. Fixing this allows for other higher
> > order ZONE_DEVICE folios.
> >
> > Specifically I would like to introduce the possiblity of migrating large CPU
> > folios to unaddressable (DEVICE_PRIVATE) or coherent (DEVICE_COHERENT) memory.
> > The current interfaces (migrate_vma) don't allow that as they require all folios
> > to be split.
> >
> > Some of the issues are:
> >
> > 1. What should the interface look like?
> >
> > These are non-lru pages, so likely there is overlap with "non-lru page migration
> > in a memdesc world"[2]
> 
> It seems to me that unaddressable (DEVICE_PRIVATE) and coherent (DEVICE_COHERENT)
> should be treated differently, since CPU cannot access the former but can access
> the latter. Am I getting it right?

In some ways there are similar (they are non-LRU pages, core-MM doesn't in
general touch them for eg. reclaim, etc) but as you say they are also different
in that the can be accessed directly from the CPU.

The key thing they have in common though is they only get mapped into userspace
via a device-driver explicitly migrating them there, hence why I have included
them here.

> >
> > 2. How do we allow merging/splitting of pages during migration?
> >
> > This is neccessary because when migrating back from device memory there may not
> > be enough large CPU pages available.
> 
> It is similar to THP swap out and swap in, we just swap out a whole THP
> but swap in individual base pages. But there is a discussion on large folio swapin[1]
> might change it.
> 
> [1] https://lore.kernel.org/linux-mm/58716200-fd10-4487-aed3-607a10e9fdd0@gmail.com/
> 
> >
> > 3. Any other issues?
> 
> Once a large folio is migrated to device, when CPU wants to access the data, even
> if there is enough memory in CPU memory, we might not want to migrate back the
> entire large folio, since maybe only a base page is shared between CPU and the device.
> Bouncing a large folio for data shared within a base page would be wasteful.

Indeed. This bouncing normally happens via a migrate_to_ram() callback so I was
thinking this would be one instance where a driver might want to split a page
when migrating back with eg. migrate_vma_*().

> I think about doing something like PCIe atomic from a device. Does it make sense?

I'm not sure I follow where exactly PCIe atomics fit in here? If a page has been
migrated to a GPU we wouldn't need PCIe atomics. Or are you saying avoiding PCIe
atomics might be another reason a page might need to be split? (ie. CPU is doing
atomic access to one subpage, GPU to another)

> >
> > [1] - https://lore.kernel.org/linux-mm/cover.11189864684e31260d1408779fac9db80122047b.1736488799.git-series.apopple@nvidia.com/
> > [2] - https://lore.kernel.org/linux-mm/2612ac8a-d0a9-452b-a53d-75ffc6166224@redhat.com/
> >
> > File-backed DEVICE_PRIVATE/COHERENT pages
> > =========================================
> >
> > Currently DEVICE_PRVIATE and DEVICE_COHERENT pages are only supported for
> > private anonymous memory. This prevents devices from having local access to
> > shared or file-backed mappings instead relying on remote DMA access which limits
> > performance.
> >
> > I have been prototyping allowing ZONE_DEVICE pages in the page cache with
> > a callback when the CPU requires access. This approach seems promising and
> > relatively straight-forward but I would like some early feedback on either this
> > or alternate approaches that I should investigate.
> >
> > Combining P2PDMA and DEVICE_PRIVATE pages
> > =========================================
> >
> > Currently device memory that cannot be directly accessed via the CPU can be
> > represented by DEVICE_PRIVATE pages allowing it to be mapped and treated like
> > a normal virtual page by userpsace. Many devices also support accessing device
> > memory directly from the CPU via a PCIe BAR.
> >
> > This access requires a P2PDMA page, meaning there are potentially two pages
> > tracking the same piece of physical memory. This not only seems wasteful but
> > fraught - for example device drivers need to keep page lifetimes in sync. I
> > would like to discuss ways of solving this.
> >
> > DEVICE_PRIVATE pages, the linear map and the memdesc world
> > ==========================================================
> >
> > DEVICE_PRIVATE pages currently reside in the linear map such that pfn_to_page()
> > and page_to_pfn() work "as expected". However this implies a contiguous range
> > of unused physical addresses need to be both available and allocated for device
> > memory. This isn't always available, particularly on ARM[1] where the vmemmap
> > region may not be large enough to accomodate the amount of device memory.
> >
> > However it occurs to me that (almost?) all code paths that deal with
> > DEVICE_PRIVATE pages are already aware of this - in the case of page_to_pfn()
> > the page can be directly queried with is_device_private_page() and in the case
> > of pfn_to_page() the pfn has (almost?) always been obtained from a special swap
> > entry indicating such.
> >
> > So does page_to_pfn()/pfn_to_page() really need to work for DEIVCE_PRIVATE
> > pages? If not could we allocate the struct pages in a vmalloc array instead? Do
> > we even need ZONE_DEIVCE pages/folios in a memdesc world?
> 
> It occurs to me as well when I am reading your migration proposal above.
> struct page is not used for DEVICE_PRIVATE, maybe it is OK to get rid of it.
> How about DEVICE_COHERENT? Is its struct page used currently? I see AMD kfd
> driver is using DEVICE_COHERENT (Christian König cc'd).

I'm not sure removing struct page for DEVICE_COHERENT would be so straight
forward. Unlike DEVICE_PRIVATE pages these are mapped by normal present
PTEs so we can't rely on having a special PTE to figure out which variant of
pfn_to_{page|memdesc|thing}() to call.

On the other hand this is real memory in the physical address space, and so
should probably be covered by the linear map anyway and have their own reserved
region of physical address space. This is unlike DEVICE_PRIVATE entries which
effectively need to steal some physical address space.

> --
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM/BPF TOPIC] The future of ZONE_DEVICE pages
  2025-01-31  5:50   ` Alistair Popple
@ 2025-01-31 15:34     ` Zi Yan
  0 siblings, 0 replies; 9+ messages in thread
From: Zi Yan @ 2025-01-31 15:34 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-mm, lsf-pc, david, willy, jhubbard, jgg, balbirs, christian.koenig

On 31 Jan 2025, at 0:50, Alistair Popple wrote:

> On Thu, Jan 30, 2025 at 10:58:22PM -0500, Zi Yan wrote:
>> On 30 Jan 2025, at 21:59, Alistair Popple wrote:
>>
>>> I have a few topics that I would like to discuss around ZONE_DEVICE pages
>>> and their current and future usage in the kernel. Generally these pages are
>>> used to represent various forms of device memory (PCIe BAR space, coherent
>>> accelerator memory, persistent memory, unaddressable device memory). All
>>> of these require special treatment by the core MM so many features must be
>>> implemented specifically for ZONE_DEVICE pages.
>>>
>>> I would like to get feedback on several ideas I've had for a while:
>>>
>>> Large page migration for ZONE_DEVICE pages
>>> ==========================================
>>>
>>> Currently large ZONE_DEVICE pages only exist for persistent memory use cases
>>> (DAX, FS DAX). This involves a special reference counting scheme which I hope to
>>> have fixed[1] by the time of the LSF/MM/BPF. Fixing this allows for other higher
>>> order ZONE_DEVICE folios.
>>>
>>> Specifically I would like to introduce the possiblity of migrating large CPU
>>> folios to unaddressable (DEVICE_PRIVATE) or coherent (DEVICE_COHERENT) memory.
>>> The current interfaces (migrate_vma) don't allow that as they require all folios
>>> to be split.
>>>
>>> Some of the issues are:
>>>
>>> 1. What should the interface look like?
>>>
>>> These are non-lru pages, so likely there is overlap with "non-lru page migration
>>> in a memdesc world"[2]
>>
>> It seems to me that unaddressable (DEVICE_PRIVATE) and coherent (DEVICE_COHERENT)
>> should be treated differently, since CPU cannot access the former but can access
>> the latter. Am I getting it right?
>
> In some ways there are similar (they are non-LRU pages, core-MM doesn't in
> general touch them for eg. reclaim, etc) but as you say they are also different
> in that the can be accessed directly from the CPU.
>
> The key thing they have in common though is they only get mapped into userspace
> via a device-driver explicitly migrating them there, hence why I have included
> them here.
>
>>>
>>> 2. How do we allow merging/splitting of pages during migration?
>>>
>>> This is neccessary because when migrating back from device memory there may not
>>> be enough large CPU pages available.
>>
>> It is similar to THP swap out and swap in, we just swap out a whole THP
>> but swap in individual base pages. But there is a discussion on large folio swapin[1]
>> might change it.
>>
>> [1] https://lore.kernel.org/linux-mm/58716200-fd10-4487-aed3-607a10e9fdd0@gmail.com/
>>
>>>
>>> 3. Any other issues?
>>
>> Once a large folio is migrated to device, when CPU wants to access the data, even
>> if there is enough memory in CPU memory, we might not want to migrate back the
>> entire large folio, since maybe only a base page is shared between CPU and the device.
>> Bouncing a large folio for data shared within a base page would be wasteful.
>
> Indeed. This bouncing normally happens via a migrate_to_ram() callback so I was
> thinking this would be one instance where a driver might want to split a page
> when migrating back with eg. migrate_vma_*().
>
>> I think about doing something like PCIe atomic from a device. Does it make sense?
>
> I'm not sure I follow where exactly PCIe atomics fit in here? If a page has been
> migrated to a GPU we wouldn't need PCIe atomics. Or are you saying avoiding PCIe
> atomics might be another reason a page might need to be split? (ie. CPU is doing
> atomic access to one subpage, GPU to another)

Oh, I got PCIe atomics wrong. I thought migration is needed even for PCIe
atomics. Forget about my comment on PCIe atomics.

>
>>>
>>> [1] - https://lore.kernel.org/linux-mm/cover.11189864684e31260d1408779fac9db80122047b.1736488799.git-series.apopple@nvidia.com/
>>> [2] - https://lore.kernel.org/linux-mm/2612ac8a-d0a9-452b-a53d-75ffc6166224@redhat.com/
>>>
>>> File-backed DEVICE_PRIVATE/COHERENT pages
>>> =========================================
>>>
>>> Currently DEVICE_PRVIATE and DEVICE_COHERENT pages are only supported for
>>> private anonymous memory. This prevents devices from having local access to
>>> shared or file-backed mappings instead relying on remote DMA access which limits
>>> performance.
>>>
>>> I have been prototyping allowing ZONE_DEVICE pages in the page cache with
>>> a callback when the CPU requires access. This approach seems promising and
>>> relatively straight-forward but I would like some early feedback on either this
>>> or alternate approaches that I should investigate.
>>>
>>> Combining P2PDMA and DEVICE_PRIVATE pages
>>> =========================================
>>>
>>> Currently device memory that cannot be directly accessed via the CPU can be
>>> represented by DEVICE_PRIVATE pages allowing it to be mapped and treated like
>>> a normal virtual page by userpsace. Many devices also support accessing device
>>> memory directly from the CPU via a PCIe BAR.
>>>
>>> This access requires a P2PDMA page, meaning there are potentially two pages
>>> tracking the same piece of physical memory. This not only seems wasteful but
>>> fraught - for example device drivers need to keep page lifetimes in sync. I
>>> would like to discuss ways of solving this.
>>>
>>> DEVICE_PRIVATE pages, the linear map and the memdesc world
>>> ==========================================================
>>>
>>> DEVICE_PRIVATE pages currently reside in the linear map such that pfn_to_page()
>>> and page_to_pfn() work "as expected". However this implies a contiguous range
>>> of unused physical addresses need to be both available and allocated for device
>>> memory. This isn't always available, particularly on ARM[1] where the vmemmap
>>> region may not be large enough to accomodate the amount of device memory.
>>>
>>> However it occurs to me that (almost?) all code paths that deal with
>>> DEVICE_PRIVATE pages are already aware of this - in the case of page_to_pfn()
>>> the page can be directly queried with is_device_private_page() and in the case
>>> of pfn_to_page() the pfn has (almost?) always been obtained from a special swap
>>> entry indicating such.
>>>
>>> So does page_to_pfn()/pfn_to_page() really need to work for DEIVCE_PRIVATE
>>> pages? If not could we allocate the struct pages in a vmalloc array instead? Do
>>> we even need ZONE_DEIVCE pages/folios in a memdesc world?
>>
>> It occurs to me as well when I am reading your migration proposal above.
>> struct page is not used for DEVICE_PRIVATE, maybe it is OK to get rid of it.
>> How about DEVICE_COHERENT? Is its struct page used currently? I see AMD kfd
>> driver is using DEVICE_COHERENT (Christian König cc'd).
>
> I'm not sure removing struct page for DEVICE_COHERENT would be so straight
> forward. Unlike DEVICE_PRIVATE pages these are mapped by normal present
> PTEs so we can't rely on having a special PTE to figure out which variant of
> pfn_to_{page|memdesc|thing}() to call.
>
> On the other hand this is real memory in the physical address space, and so
> should probably be covered by the linear map anyway and have their own reserved
> region of physical address space. This is unlike DEVICE_PRIVATE entries which
> effectively need to steal some physical address space.

Got it. Like you said above, DEVICE_PRIVATE and DEVICE_COHERENT are both non-lru
pages, but only DEVICE_COHERENT can be accessed by CPU. We probably want to
categorize them differently based on DavidH’s email[1]:

DEVICE_PRIVATE: non-folio migration
DEVICE_COHERENT: non-LRU folio migration

[1] https://lore.kernel.org/linux-mm/bb0f813e-7c1b-4257-baa5-5afe18be8552@redhat.com/

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM/BPF TOPIC] The future of ZONE_DEVICE pages
  2025-01-31  2:59 [LSF/MM/BPF TOPIC] The future of ZONE_DEVICE pages Alistair Popple
  2025-01-31  3:29 ` Balbir Singh
  2025-01-31  3:58 ` Zi Yan
@ 2025-01-31  8:47 ` David Hildenbrand
  2025-02-05 10:12   ` Alistair Popple
  2025-01-31 14:52 ` Jason Gunthorpe
  3 siblings, 1 reply; 9+ messages in thread
From: David Hildenbrand @ 2025-01-31  8:47 UTC (permalink / raw)
  To: Alistair Popple, linux-mm; +Cc: lsf-pc, willy, ziy, jhubbard, jgg, balbirs

On 31.01.25 03:59, Alistair Popple wrote:
> I have a few topics that I would like to discuss around ZONE_DEVICE pages
> and their current and future usage in the kernel. Generally these pages are
> used to represent various forms of device memory (PCIe BAR space, coherent
> accelerator memory, persistent memory, unaddressable device memory). All
> of these require special treatment by the core MM so many features must be
> implemented specifically for ZONE_DEVICE pages.
> 
> I would like to get feedback on several ideas I've had for a while:
> 
> Large page migration for ZONE_DEVICE pages
> ==========================================
> 
> Currently large ZONE_DEVICE pages only exist for persistent memory use cases
> (DAX, FS DAX). This involves a special reference counting scheme which I hope to
> have fixed[1] by the time of the LSF/MM/BPF. Fixing this allows for other higher
> order ZONE_DEVICE folios.
> 
> Specifically I would like to introduce the possiblity of migrating large CPU
> folios to unaddressable (DEVICE_PRIVATE) or coherent (DEVICE_COHERENT) memory.
> The current interfaces (migrate_vma) don't allow that as they require all folios
> to be split.
> 

Hi,

> Some of the issues are:
> 
> 1. What should the interface look like?
> 
> These are non-lru pages, so likely there is overlap with "non-lru page migration
> in a memdesc world"[2]

Yes, although these (what we called "non-lru migration" before 
ZONE_DEVICE popped up) are currently all order-0. Likely this will 
change at some point, but not sure if there is currently a real demand 
for it.

Agreed that there is quite some overlap. E.g., no page->lru field, and 
the problem about splitting large allocations etc.

For example, balloon-inflated pages are currently all order-0. If we'd 
want to support something larger but still allow for reliable balloon 
compaction under memory fragmentation, we'd want an option to 
split-before-migration (similar as you describe below).

Alternatively, we can just split right at the start: if the balloon 
allocated a 2MiB compound page, it can just split it to 512 order-0 
pages and allow for migration of the individual pieces. Both approaches 
have their pros and cons.

Anyway: "non-lru migration" is not quite expressive. It's likely going 
to be:

(1) LRU folio migration
(2) non-LRU folio migration (->ZONE_DEVICE)
(3) non-folio migration (balloon,zsmalloc, ...)

(1) and (2) have things in common (e.g., rmap, folio handling) and (2) 
and (3) have things in common (e.g., no ->lru field).

Would there be something ZONE_DEVICE based that we want to migrate and 
that will not be a folio (iow, not mapped into user page tables etc)?

> 
> 2. How do we allow merging/splitting of pages during migration?
> 
> This is neccessary because when migrating back from device memory there may not
> be enough large CPU pages available.
> 
> 3. Any other issues?
> 
> [1] - https://lore.kernel.org/linux-mm/cover.11189864684e31260d1408779fac9db80122047b.1736488799.git-series.apopple@nvidia.com/
> [2] - https://lore.kernel.org/linux-mm/2612ac8a-d0a9-452b-a53d-75ffc6166224@redhat.com/
> 
> File-backed DEVICE_PRIVATE/COHERENT pages
> =========================================
> 
> Currently DEVICE_PRVIATE and DEVICE_COHERENT pages are only supported for
> private anonymous memory. This prevents devices from having local access to
> shared or file-backed mappings instead relying on remote DMA access which limits
> performance.
> 
> I have been prototyping allowing ZONE_DEVICE pages in the page cache with
> a callback when the CPU requires access.

Hmm, things like read/write/writeback get more tricky. How would you 
writeback content from a ZONE_DEVICE folio? Likely that's not possible.

So I'm not sure if we want to go down that path; it will be great to 
learn about your approach and your findings.

[...]

There is a lot of interesting stuff in there; I assume too much for a 
single session :)

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM/BPF TOPIC] The future of ZONE_DEVICE pages
  2025-01-31  8:47 ` David Hildenbrand
@ 2025-02-05 10:12   ` Alistair Popple
  0 siblings, 0 replies; 9+ messages in thread
From: Alistair Popple @ 2025-02-05 10:12 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: linux-mm, lsf-pc, willy, ziy, jhubbard, jgg, balbirs

On Fri, Jan 31, 2025 at 09:47:39AM +0100, David Hildenbrand wrote:
> On 31.01.25 03:59, Alistair Popple wrote:
> > I have a few topics that I would like to discuss around ZONE_DEVICE pages
> > and their current and future usage in the kernel. Generally these pages are
> > used to represent various forms of device memory (PCIe BAR space, coherent
> > accelerator memory, persistent memory, unaddressable device memory). All
> > of these require special treatment by the core MM so many features must be
> > implemented specifically for ZONE_DEVICE pages.
> > 
> > I would like to get feedback on several ideas I've had for a while:
> > 
> > Large page migration for ZONE_DEVICE pages
> > ==========================================
> > 
> > Currently large ZONE_DEVICE pages only exist for persistent memory use cases
> > (DAX, FS DAX). This involves a special reference counting scheme which I hope to
> > have fixed[1] by the time of the LSF/MM/BPF. Fixing this allows for other higher
> > order ZONE_DEVICE folios.
> > 
> > Specifically I would like to introduce the possiblity of migrating large CPU
> > folios to unaddressable (DEVICE_PRIVATE) or coherent (DEVICE_COHERENT) memory.
> > The current interfaces (migrate_vma) don't allow that as they require all folios
> > to be split.
> > 
> 
> Hi,
> 
> > Some of the issues are:
> > 
> > 1. What should the interface look like?
> > 
> > These are non-lru pages, so likely there is overlap with "non-lru page migration
> > in a memdesc world"[2]
> 
> Yes, although these (what we called "non-lru migration" before ZONE_DEVICE
> popped up) are currently all order-0. Likely this will change at some point,
> but not sure if there is currently a real demand for it.
> 
> Agreed that there is quite some overlap. E.g., no page->lru field, and the
> problem about splitting large allocations etc.
> 
> For example, balloon-inflated pages are currently all order-0. If we'd want
> to support something larger but still allow for reliable balloon compaction
> under memory fragmentation, we'd want an option to split-before-migration
> (similar as you describe below).
> 
> Alternatively, we can just split right at the start: if the balloon
> allocated a 2MiB compound page, it can just split it to 512 order-0 pages
> and allow for migration of the individual pieces. Both approaches have their
> pros and cons.
> 
> Anyway: "non-lru migration" is not quite expressive. It's likely going to
> be:
> 
> (1) LRU folio migration
> (2) non-LRU folio migration (->ZONE_DEVICE)
> (3) non-folio migration (balloon,zsmalloc, ...)
> 
> (1) and (2) have things in common (e.g., rmap, folio handling) and (2) and
> (3) have things in common (e.g., no ->lru field).
> 
> Would there be something ZONE_DEVICE based that we want to migrate and that
> will not be a folio (iow, not mapped into user page tables etc)?

I'm not aware of any such use-cases. Your case (2) above is what I was thinking
about.
 
> > 
> > 2. How do we allow merging/splitting of pages during migration?
> > 
> > This is neccessary because when migrating back from device memory there may not
> > be enough large CPU pages available.
> > 
> > 3. Any other issues?
> > 
> > [1] - https://lore.kernel.org/linux-mm/cover.11189864684e31260d1408779fac9db80122047b.1736488799.git-series.apopple@nvidia.com/
> > [2] - https://lore.kernel.org/linux-mm/2612ac8a-d0a9-452b-a53d-75ffc6166224@redhat.com/
> > 
> > File-backed DEVICE_PRIVATE/COHERENT pages
> > =========================================
> > 
> > Currently DEVICE_PRVIATE and DEVICE_COHERENT pages are only supported for
> > private anonymous memory. This prevents devices from having local access to
> > shared or file-backed mappings instead relying on remote DMA access which limits
> > performance.
> > 
> > I have been prototyping allowing ZONE_DEVICE pages in the page cache with
> > a callback when the CPU requires access.
> 
> Hmm, things like read/write/writeback get more tricky. How would you
> writeback content from a ZONE_DEVICE folio? Likely that's not possible.

The general gist is somewhat analogous to what happens when the CPU faults on
a DEVICE_PRIVATE page. Except obviously it wouldn't be a fault, rather whenever
something looked up the page-cache entry and found a DEVICE_PRIVATE page we
would have a driver callback somewhat similar to migrate_to_ram() that would
copy the data back to normal system memory. IOW CPU would always own the page
and could always get it back.

It has been a while since I last looked at this problem though (FS DAX refcount
clean ups took way longer than expected!), but I recall having this at least
somewhat working. I will see if I can get it cleaned up and posted as an RFC
soon.

> So I'm not sure if we want to go down that path; it will be great to learn
> about your approach and your findings.
> 
> [...]
> 
> 
> There is a lot of interesting stuff in there; I assume too much for a single
> session :)

And probably way more than I can get done in a year :-)

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM/BPF TOPIC] The future of ZONE_DEVICE pages
  2025-01-31  2:59 [LSF/MM/BPF TOPIC] The future of ZONE_DEVICE pages Alistair Popple
                   ` (2 preceding siblings ...)
  2025-01-31  8:47 ` David Hildenbrand
@ 2025-01-31 14:52 ` Jason Gunthorpe
  2025-02-02  8:22   ` Leon Romanovsky
  3 siblings, 1 reply; 9+ messages in thread
From: Jason Gunthorpe @ 2025-01-31 14:52 UTC (permalink / raw)
  To: Alistair Popple; +Cc: linux-mm, lsf-pc, david, willy, ziy, jhubbard, balbirs

On Fri, Jan 31, 2025 at 01:59:09PM +1100, Alistair Popple wrote:
> Combining P2PDMA and DEVICE_PRIVATE pages
> =========================================
> 
> Currently device memory that cannot be directly accessed via the CPU can be
> represented by DEVICE_PRIVATE pages allowing it to be mapped and treated like
> a normal virtual page by userpsace. Many devices also support accessing device
> memory directly from the CPU via a PCIe BAR.
> 
> This access requires a P2PDMA page, meaning there are potentially two pages
> tracking the same piece of physical memory. This not only seems wasteful but
> fraught - for example device drivers need to keep page lifetimes in sync. I
> would like to discuss ways of solving this.

My general plan for this has been to teach the DMA API how to do P2P
without struct page. Leon's topic is the frist step on this journey.
  https://lore.kernel.org/linux-mm/97f385db-42c9-4c04-8fba-9b1ba8ffc525@nvidia.com/

When we can DMA map P2P memory without struct page then we can talk
about what API changes would be needed to take advantage of that.

However, merging P2P and DEVICE_PRIVATE seems very tricky to me, you'd
have to make the type of page dependent on how it was read out of the
PTE - a swap entry is PRIVATE, a normal PTE is P2P.

Further, a struct page is necessary if any P2P pages are being placed
into VMAs..

Jason

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [LSF/MM/BPF TOPIC] The future of ZONE_DEVICE pages
  2025-01-31 14:52 ` Jason Gunthorpe
@ 2025-02-02  8:22   ` Leon Romanovsky
  0 siblings, 0 replies; 9+ messages in thread
From: Leon Romanovsky @ 2025-02-02  8:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alistair Popple, linux-mm, lsf-pc, david, willy, ziy, jhubbard, balbirs

On Fri, Jan 31, 2025 at 10:52:37AM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 31, 2025 at 01:59:09PM +1100, Alistair Popple wrote:
> > Combining P2PDMA and DEVICE_PRIVATE pages
> > =========================================
> > 
> > Currently device memory that cannot be directly accessed via the CPU can be
> > represented by DEVICE_PRIVATE pages allowing it to be mapped and treated like
> > a normal virtual page by userpsace. Many devices also support accessing device
> > memory directly from the CPU via a PCIe BAR.
> > 
> > This access requires a P2PDMA page, meaning there are potentially two pages
> > tracking the same piece of physical memory. This not only seems wasteful but
> > fraught - for example device drivers need to keep page lifetimes in sync. I
> > would like to discuss ways of solving this.
> 
> My general plan for this has been to teach the DMA API how to do P2P
> without struct page. Leon's topic is the frist step on this journey.
>   https://lore.kernel.org/linux-mm/97f385db-42c9-4c04-8fba-9b1ba8ffc525@nvidia.com/

The latest proposal for LSF/MM 2025 is here:
[LSF/MM/BPF TOPIC] DMA mapping API in complex scenarios
https://lore.kernel.org/linux-rdma/20250122071600.GC10702@unreal/T/#u

Thanks


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-02-05 10:12 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-31  2:59 [LSF/MM/BPF TOPIC] The future of ZONE_DEVICE pages Alistair Popple
2025-01-31  3:29 ` Balbir Singh
2025-01-31  3:58 ` Zi Yan
2025-01-31  5:50   ` Alistair Popple
2025-01-31 15:34     ` Zi Yan
2025-01-31  8:47 ` David Hildenbrand
2025-02-05 10:12   ` Alistair Popple
2025-01-31 14:52 ` Jason Gunthorpe
2025-02-02  8:22   ` Leon Romanovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox