Re: [LSF/MM/BPF TOPIC] The future of ZONE_DEVICE pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Zi Yan <ziy@nvidia.com>
To: Alistair Popple <apopple@nvidia.com>
Cc: linux-mm@kvack.org, lsf-pc@lists.linux-foundation.org,
	david@redhat.com, willy@infradead.org, jhubbard@nvidia.com,
	jgg@nvidia.com, balbirs@nvidia.com, christian.koenig@amd.com
Subject: Re: [LSF/MM/BPF TOPIC] The future of ZONE_DEVICE pages
Date: Fri, 31 Jan 2025 10:34:32 -0500	[thread overview]
Message-ID: <B008EF24-98EF-4EF0-8EBD-0282D475390C@nvidia.com> (raw)
In-Reply-To: <4ciym2rrwkttnlym77ebywwn4ppesycjwm2dcoffs74eslu6uf@gvvnthaxkdoc>

On 31 Jan 2025, at 0:50, Alistair Popple wrote:

> On Thu, Jan 30, 2025 at 10:58:22PM -0500, Zi Yan wrote:
>> On 30 Jan 2025, at 21:59, Alistair Popple wrote:
>>
>>> I have a few topics that I would like to discuss around ZONE_DEVICE pages
>>> and their current and future usage in the kernel. Generally these pages are
>>> used to represent various forms of device memory (PCIe BAR space, coherent
>>> accelerator memory, persistent memory, unaddressable device memory). All
>>> of these require special treatment by the core MM so many features must be
>>> implemented specifically for ZONE_DEVICE pages.
>>>
>>> I would like to get feedback on several ideas I've had for a while:
>>>
>>> Large page migration for ZONE_DEVICE pages
>>> ==========================================
>>>
>>> Currently large ZONE_DEVICE pages only exist for persistent memory use cases
>>> (DAX, FS DAX). This involves a special reference counting scheme which I hope to
>>> have fixed[1] by the time of the LSF/MM/BPF. Fixing this allows for other higher
>>> order ZONE_DEVICE folios.
>>>
>>> Specifically I would like to introduce the possiblity of migrating large CPU
>>> folios to unaddressable (DEVICE_PRIVATE) or coherent (DEVICE_COHERENT) memory.
>>> The current interfaces (migrate_vma) don't allow that as they require all folios
>>> to be split.
>>>
>>> Some of the issues are:
>>>
>>> 1. What should the interface look like?
>>>
>>> These are non-lru pages, so likely there is overlap with "non-lru page migration
>>> in a memdesc world"[2]
>>
>> It seems to me that unaddressable (DEVICE_PRIVATE) and coherent (DEVICE_COHERENT)
>> should be treated differently, since CPU cannot access the former but can access
>> the latter. Am I getting it right?
>
> In some ways there are similar (they are non-LRU pages, core-MM doesn't in
> general touch them for eg. reclaim, etc) but as you say they are also different
> in that the can be accessed directly from the CPU.
>
> The key thing they have in common though is they only get mapped into userspace
> via a device-driver explicitly migrating them there, hence why I have included
> them here.
>
>>>
>>> 2. How do we allow merging/splitting of pages during migration?
>>>
>>> This is neccessary because when migrating back from device memory there may not
>>> be enough large CPU pages available.
>>
>> It is similar to THP swap out and swap in, we just swap out a whole THP
>> but swap in individual base pages. But there is a discussion on large folio swapin[1]
>> might change it.
>>
>> [1] https://lore.kernel.org/linux-mm/58716200-fd10-4487-aed3-607a10e9fdd0@gmail.com/
>>
>>>
>>> 3. Any other issues?
>>
>> Once a large folio is migrated to device, when CPU wants to access the data, even
>> if there is enough memory in CPU memory, we might not want to migrate back the
>> entire large folio, since maybe only a base page is shared between CPU and the device.
>> Bouncing a large folio for data shared within a base page would be wasteful.
>
> Indeed. This bouncing normally happens via a migrate_to_ram() callback so I was
> thinking this would be one instance where a driver might want to split a page
> when migrating back with eg. migrate_vma_*().
>
>> I think about doing something like PCIe atomic from a device. Does it make sense?
>
> I'm not sure I follow where exactly PCIe atomics fit in here? If a page has been
> migrated to a GPU we wouldn't need PCIe atomics. Or are you saying avoiding PCIe
> atomics might be another reason a page might need to be split? (ie. CPU is doing
> atomic access to one subpage, GPU to another)

Oh, I got PCIe atomics wrong. I thought migration is needed even for PCIe
atomics. Forget about my comment on PCIe atomics.

>
>>>
>>> [1] - https://lore.kernel.org/linux-mm/cover.11189864684e31260d1408779fac9db80122047b.1736488799.git-series.apopple@nvidia.com/
>>> [2] - https://lore.kernel.org/linux-mm/2612ac8a-d0a9-452b-a53d-75ffc6166224@redhat.com/
>>>
>>> File-backed DEVICE_PRIVATE/COHERENT pages
>>> =========================================
>>>
>>> Currently DEVICE_PRVIATE and DEVICE_COHERENT pages are only supported for
>>> private anonymous memory. This prevents devices from having local access to
>>> shared or file-backed mappings instead relying on remote DMA access which limits
>>> performance.
>>>
>>> I have been prototyping allowing ZONE_DEVICE pages in the page cache with
>>> a callback when the CPU requires access. This approach seems promising and
>>> relatively straight-forward but I would like some early feedback on either this
>>> or alternate approaches that I should investigate.
>>>
>>> Combining P2PDMA and DEVICE_PRIVATE pages
>>> =========================================
>>>
>>> Currently device memory that cannot be directly accessed via the CPU can be
>>> represented by DEVICE_PRIVATE pages allowing it to be mapped and treated like
>>> a normal virtual page by userpsace. Many devices also support accessing device
>>> memory directly from the CPU via a PCIe BAR.
>>>
>>> This access requires a P2PDMA page, meaning there are potentially two pages
>>> tracking the same piece of physical memory. This not only seems wasteful but
>>> fraught - for example device drivers need to keep page lifetimes in sync. I
>>> would like to discuss ways of solving this.
>>>
>>> DEVICE_PRIVATE pages, the linear map and the memdesc world
>>> ==========================================================
>>>
>>> DEVICE_PRIVATE pages currently reside in the linear map such that pfn_to_page()
>>> and page_to_pfn() work "as expected". However this implies a contiguous range
>>> of unused physical addresses need to be both available and allocated for device
>>> memory. This isn't always available, particularly on ARM[1] where the vmemmap
>>> region may not be large enough to accomodate the amount of device memory.
>>>
>>> However it occurs to me that (almost?) all code paths that deal with
>>> DEVICE_PRIVATE pages are already aware of this - in the case of page_to_pfn()
>>> the page can be directly queried with is_device_private_page() and in the case
>>> of pfn_to_page() the pfn has (almost?) always been obtained from a special swap
>>> entry indicating such.
>>>
>>> So does page_to_pfn()/pfn_to_page() really need to work for DEIVCE_PRIVATE
>>> pages? If not could we allocate the struct pages in a vmalloc array instead? Do
>>> we even need ZONE_DEIVCE pages/folios in a memdesc world?
>>
>> It occurs to me as well when I am reading your migration proposal above.
>> struct page is not used for DEVICE_PRIVATE, maybe it is OK to get rid of it.
>> How about DEVICE_COHERENT? Is its struct page used currently? I see AMD kfd
>> driver is using DEVICE_COHERENT (Christian König cc'd).
>
> I'm not sure removing struct page for DEVICE_COHERENT would be so straight
> forward. Unlike DEVICE_PRIVATE pages these are mapped by normal present
> PTEs so we can't rely on having a special PTE to figure out which variant of
> pfn_to_{page|memdesc|thing}() to call.
>
> On the other hand this is real memory in the physical address space, and so
> should probably be covered by the linear map anyway and have their own reserved
> region of physical address space. This is unlike DEVICE_PRIVATE entries which
> effectively need to steal some physical address space.

Got it. Like you said above, DEVICE_PRIVATE and DEVICE_COHERENT are both non-lru
pages, but only DEVICE_COHERENT can be accessed by CPU. We probably want to
categorize them differently based on DavidH’s email[1]:

DEVICE_PRIVATE: non-folio migration
DEVICE_COHERENT: non-LRU folio migration

[1] https://lore.kernel.org/linux-mm/bb0f813e-7c1b-4257-baa5-5afe18be8552@redhat.com/

Best Regards,
Yan, Zi

next prev parent reply	other threads:[~2025-01-31 15:34 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-31  2:59 Alistair Popple
2025-01-31  3:29 ` Balbir Singh
2025-01-31  3:58 ` Zi Yan
2025-01-31  5:50   ` Alistair Popple
2025-01-31 15:34     ` Zi Yan [this message]
2025-01-31  8:47 ` David Hildenbrand
2025-02-05 10:12   ` Alistair Popple
2025-01-31 14:52 ` Jason Gunthorpe
2025-02-02  8:22   ` Leon Romanovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=B008EF24-98EF-4EF0-8EBD-0282D475390C@nvidia.com \
    --to=ziy@nvidia.com \
    --cc=apopple@nvidia.com \
    --cc=balbirs@nvidia.com \
    --cc=christian.koenig@amd.com \
    --cc=david@redhat.com \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox