[RFC] Huge remap_pfn_range for vfio-pci

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Huge remap_pfn_range for vfio-pci
@ 2024-05-24 20:54 Axel Rasmussen
  2024-05-24 23:31 ` Peter Xu
  0 siblings, 1 reply; 3+ messages in thread
From: Axel Rasmussen @ 2024-05-24 20:54 UTC (permalink / raw)
  To: Jason Gunthorpe, Peter Xu, David Hildenbrand, Sean Christopherson
  Cc: Linux MM

Hi,

I'm interested in extending remap_pfn_range to allow it to map the
range hugely (using PUDs or PMDs). The initial user I have in mind is
vfio-pci; I'm thinking when we're mapping large ranges for GPUs, we
can get both a performance and host overhead win by doing this hugely.

Another thing I have in the back of my mind is adding something KVM
can re-use to simplify its whole host_pfn_mapping_level /
hva_to_pfn_remapped / get_user_page_fast_only thing.

I know Peter and David are working on some related things (hugetlbfs
unification and follow_pte et al improvements, respectively). Although
I have a hacky proof of concept that works, I thought it best to get
some consensus on the design before I post something, so I don't
conflict with this existing / upcoming work.

Changing remap_pfn_range to install PUDs or PMDs is straightforward.
The hairy part is the fault / follow side of things:

1. follow_pte clearly doesn't work for this, since the leaf might be a
PUD or PMD instead. Most callers don't care about the PTE itself, they
care about the pgprot or flags it has set, so my idea was to add a new
interface which just yields those bits, instead of the actual PTE.

Peter, I think hugetlbfs unification may run into similar issues, do
you have some plan already to deal with PUD/PMD/PTE being different
types?

2. vfio-pci relies on vm_ops->fault. This is a problem because the
normal fault handler path doesn't call this until after it has walked
down to the PTE level, installing PUDs/PMDs along the way. I have only
gross ideas for how to deal with this:

- Add a VM_HUGEPFNMAP VMA flag indicating vm_ops->fault should be
called earlier in __handle_mm_fault
- Add a vm_ops->hugepfn_fault (name not important) which should be
called earlier in __handle_mm_fault
- Go ahead and let remap_pfn_range overwrite existing PUDs/PMDS

I wonder which of these folks find least offensive? Or is there a
better way I haven't thought of?

3. That's also an issue for CoW faults, but I don't know of any real
use case for CoW huge pfn mappings, so I thought we can just keep the
existing small mapping behavior for CoW VMAs. Any objections?

Thanks!

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC] Huge remap_pfn_range for vfio-pci
  2024-05-24 20:54 [RFC] Huge remap_pfn_range for vfio-pci Axel Rasmussen
@ 2024-05-24 23:31 ` Peter Xu
  2024-05-30 16:59   ` Axel Rasmussen
  0 siblings, 1 reply; 3+ messages in thread
From: Peter Xu @ 2024-05-24 23:31 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Jason Gunthorpe, David Hildenbrand, Sean Christopherson,
	Linux MM, Alex Williamson

On Fri, May 24, 2024 at 01:54:20PM -0700, Axel Rasmussen wrote:
> Hi,

Hi, Axel,

> 
> I'm interested in extending remap_pfn_range to allow it to map the
> range hugely (using PUDs or PMDs). The initial user I have in mind is
> vfio-pci; I'm thinking when we're mapping large ranges for GPUs, we
> can get both a performance and host overhead win by doing this hugely.
> 
> Another thing I have in the back of my mind is adding something KVM
> can re-use to simplify its whole host_pfn_mapping_level /
> hva_to_pfn_remapped / get_user_page_fast_only thing.

IIUC kvm should be prepared for it, as host_pfn_mapping_level() can detect
any huge mappings using the *_leaf() apis.

> 
> I know Peter and David are working on some related things (hugetlbfs
> unification and follow_pte et al improvements, respectively). Although
> I have a hacky proof of concept that works, I thought it best to get
> some consensus on the design before I post something, so I don't
> conflict with this existing / upcoming work.

Yes we're working on that, mostly with Alex.  There's a testing branch but
half baked so far:

https://github.com/xzpeter/linux/commits/huge-pfnmap/

> 
> Changing remap_pfn_range to install PUDs or PMDs is straightforward.
> The hairy part is the fault / follow side of things:

I'm surprised you thought about the fault() path, even if Alex just
officially proposed it yesterday.  Maybe you followed the previous
discussions.  It's here:

https://lore.kernel.org/r/20240523195629.218043-1-alex.williamson@redhat.com

> 
> 1. follow_pte clearly doesn't work for this, since the leaf might be a
> PUD or PMD instead. Most callers don't care about the PTE itself, they
> care about the pgprot or flags it has set, so my idea was to add a new
> interface which just yields those bits, instead of the actual PTE.

See:

https://github.com/xzpeter/linux/commit/2cb4702418a1b740129fc7b379b52e16e57032e1

> 
> Peter, I think hugetlbfs unification may run into similar issues, do
> you have some plan already to deal with PUD/PMD/PTE being different
> types?

Exactly.  There'll be some shared work between the two projects on fork(),
mprotect, etc.  And yes I plan to cover them all but I'll start with the
pfnmap thing, paving way for hugetlb, while we have Oscar (from SUSE kernel
team) working concurrently on other paths of hugetlb.

> 
> 2. vfio-pci relies on vm_ops->fault. This is a problem because the
> normal fault handler path doesn't call this until after it has walked
> down to the PTE level, installing PUDs/PMDs along the way. I have only
> gross ideas for how to deal with this:
> 
> - Add a VM_HUGEPFNMAP VMA flag indicating vm_ops->fault should be
> called earlier in __handle_mm_fault
> - Add a vm_ops->hugepfn_fault (name not important) which should be
> called earlier in __handle_mm_fault
> - Go ahead and let remap_pfn_range overwrite existing PUDs/PMDS

I actually don't know what exactly you meant here, but Alex already worked
on that with huge_fault().  See:

https://github.com/awilliam/linux-vfio/commit/ec6c970f8374f91df0ebfe180cd388ba31187942

So far I don't yet understand why we need a new vma flag.

> 
> I wonder which of these folks find least offensive? Or is there a
> better way I haven't thought of?
> 
> 3. That's also an issue for CoW faults, but I don't know of any real
> use case for CoW huge pfn mappings, so I thought we can just keep the
> existing small mapping behavior for CoW VMAs. Any objections?

I think we should keep the pud/pmd transparent, so that the old pte
behavior needs to be maintained.  E.g., I think we'll need to be able to
split a pud/pmd mapping if mprotect() partially.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC] Huge remap_pfn_range for vfio-pci
  2024-05-24 23:31 ` Peter Xu
@ 2024-05-30 16:59   ` Axel Rasmussen
  0 siblings, 0 replies; 3+ messages in thread
From: Axel Rasmussen @ 2024-05-30 16:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jason Gunthorpe, David Hildenbrand, Sean Christopherson,
	Linux MM, Alex Williamson

On Fri, May 24, 2024 at 4:31 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, May 24, 2024 at 01:54:20PM -0700, Axel Rasmussen wrote:
> > Hi,
>
> Hi, Axel,
>
> >
> > I'm interested in extending remap_pfn_range to allow it to map the
> > range hugely (using PUDs or PMDs). The initial user I have in mind is
> > vfio-pci; I'm thinking when we're mapping large ranges for GPUs, we
> > can get both a performance and host overhead win by doing this hugely.
> >
> > Another thing I have in the back of my mind is adding something KVM
> > can re-use to simplify its whole host_pfn_mapping_level /
> > hva_to_pfn_remapped / get_user_page_fast_only thing.
>
> IIUC kvm should be prepared for it, as host_pfn_mapping_level() can detect
> any huge mappings using the *_leaf() apis.

Right, the KVM code works as is. Sean had been suggesting though that
if follow_pte() (or its replacement) returned the level, and had an
option to work locklessly, KVM could just re-use it and delete some
code. I think we could also avoid doing two page table walks (once for
follow_pte, and once to determine the level).

Then again, I think it is somewhat debatable what exactly such an API
would look like, or whether it would be too KVM-specific to expose
generally.

>
> >
> > I know Peter and David are working on some related things (hugetlbfs
> > unification and follow_pte et al improvements, respectively). Although
> > I have a hacky proof of concept that works, I thought it best to get
> > some consensus on the design before I post something, so I don't
> > conflict with this existing / upcoming work.
>
> Yes we're working on that, mostly with Alex.  There's a testing branch but
> half baked so far:
>
> https://github.com/xzpeter/linux/commits/huge-pfnmap/

Ah, I hadn't been aware of this, it looks like you're already well on
your way to implementing exactly what I was thinking of. :) In that
case I'll mostly plan on trying out this branch, and offering any
feedback / fixes I find, it would be counter productive to spend time
building my own implementation.

>
> >
> > Changing remap_pfn_range to install PUDs or PMDs is straightforward.
> > The hairy part is the fault / follow side of things:
>
> I'm surprised you thought about the fault() path, even if Alex just
> officially proposed it yesterday.  Maybe you followed the previous
> discussions.  It's here:
>
> https://lore.kernel.org/r/20240523195629.218043-1-alex.williamson@redhat.com
>
> >
> > 1. follow_pte clearly doesn't work for this, since the leaf might be a
> > PUD or PMD instead. Most callers don't care about the PTE itself, they
> > care about the pgprot or flags it has set, so my idea was to add a new
> > interface which just yields those bits, instead of the actual PTE.
>
> See:
>
> https://github.com/xzpeter/linux/commit/2cb4702418a1b740129fc7b379b52e16e57032e1

Ah! Thanks for the pointer. This is relatively close to what I had in mind.

>
> >
> > Peter, I think hugetlbfs unification may run into similar issues, do
> > you have some plan already to deal with PUD/PMD/PTE being different
> > types?
>
> Exactly.  There'll be some shared work between the two projects on fork(),
> mprotect, etc.  And yes I plan to cover them all but I'll start with the
> pfnmap thing, paving way for hugetlb, while we have Oscar (from SUSE kernel
> team) working concurrently on other paths of hugetlb.
>
> >
> > 2. vfio-pci relies on vm_ops->fault. This is a problem because the
> > normal fault handler path doesn't call this until after it has walked
> > down to the PTE level, installing PUDs/PMDs along the way. I have only
> > gross ideas for how to deal with this:
> >
> > - Add a VM_HUGEPFNMAP VMA flag indicating vm_ops->fault should be
> > called earlier in __handle_mm_fault
> > - Add a vm_ops->hugepfn_fault (name not important) which should be
> > called earlier in __handle_mm_fault
> > - Go ahead and let remap_pfn_range overwrite existing PUDs/PMDS
>
> I actually don't know what exactly you meant here, but Alex already worked
> on that with huge_fault().  See:
>
> https://github.com/awilliam/linux-vfio/commit/ec6c970f8374f91df0ebfe180cd388ba31187942
>
> So far I don't yet understand why we need a new vma flag.

Ah, I had discounted huge_fault() thinking it was specific to
hugetlbfs or THPs. I should have spent more time reading that code, I
agree it looks like it avoids all of what I'm talking about here. :)

>
> >
> > I wonder which of these folks find least offensive? Or is there a
> > better way I haven't thought of?
> >
> > 3. That's also an issue for CoW faults, but I don't know of any real
> > use case for CoW huge pfn mappings, so I thought we can just keep the
> > existing small mapping behavior for CoW VMAs. Any objections?
>
> I think we should keep the pud/pmd transparent, so that the old pte
> behavior needs to be maintained.  E.g., I think we'll need to be able to
> split a pud/pmd mapping if mprotect() partially.

I had been thinking of ensuring we never had pud/pmds in CoW mappings,
but using huge_fault() might make my worry go away entirely.

I completely agree we should allow vfio mappings to be mixed size
though, in case things aren't quite aligned (due to a mprotect split
or any other reason), we can still have a mostly-huge mapping with
some ptes on the end(s).

>
> Thanks,
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-05-30 17:00 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-24 20:54 [RFC] Huge remap_pfn_range for vfio-pci Axel Rasmussen
2024-05-24 23:31 ` Peter Xu
2024-05-30 16:59   ` Axel Rasmussen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox