linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Axel Rasmussen <axelrasmussen@google.com>
To: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>,
	David Hildenbrand <david@redhat.com>,
	 Sean Christopherson <seanjc@google.com>,
	Linux MM <linux-mm@kvack.org>,
	 Alex Williamson <alex.williamson@redhat.com>
Subject: Re: [RFC] Huge remap_pfn_range for vfio-pci
Date: Thu, 30 May 2024 09:59:44 -0700	[thread overview]
Message-ID: <CAJHvVcjiZZTsma5HWamJ-N48f20X8ejJT0yFC+Z=Br=r1KpNzw@mail.gmail.com> (raw)
In-Reply-To: <ZlEjYZB78SVjFXVy@x1n>

On Fri, May 24, 2024 at 4:31 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, May 24, 2024 at 01:54:20PM -0700, Axel Rasmussen wrote:
> > Hi,
>
> Hi, Axel,
>
> >
> > I'm interested in extending remap_pfn_range to allow it to map the
> > range hugely (using PUDs or PMDs). The initial user I have in mind is
> > vfio-pci; I'm thinking when we're mapping large ranges for GPUs, we
> > can get both a performance and host overhead win by doing this hugely.
> >
> > Another thing I have in the back of my mind is adding something KVM
> > can re-use to simplify its whole host_pfn_mapping_level /
> > hva_to_pfn_remapped / get_user_page_fast_only thing.
>
> IIUC kvm should be prepared for it, as host_pfn_mapping_level() can detect
> any huge mappings using the *_leaf() apis.

Right, the KVM code works as is. Sean had been suggesting though that
if follow_pte() (or its replacement) returned the level, and had an
option to work locklessly, KVM could just re-use it and delete some
code. I think we could also avoid doing two page table walks (once for
follow_pte, and once to determine the level).

Then again, I think it is somewhat debatable what exactly such an API
would look like, or whether it would be too KVM-specific to expose
generally.

>
> >
> > I know Peter and David are working on some related things (hugetlbfs
> > unification and follow_pte et al improvements, respectively). Although
> > I have a hacky proof of concept that works, I thought it best to get
> > some consensus on the design before I post something, so I don't
> > conflict with this existing / upcoming work.
>
> Yes we're working on that, mostly with Alex.  There's a testing branch but
> half baked so far:
>
> https://github.com/xzpeter/linux/commits/huge-pfnmap/

Ah, I hadn't been aware of this, it looks like you're already well on
your way to implementing exactly what I was thinking of. :) In that
case I'll mostly plan on trying out this branch, and offering any
feedback / fixes I find, it would be counter productive to spend time
building my own implementation.

>
> >
> > Changing remap_pfn_range to install PUDs or PMDs is straightforward.
> > The hairy part is the fault / follow side of things:
>
> I'm surprised you thought about the fault() path, even if Alex just
> officially proposed it yesterday.  Maybe you followed the previous
> discussions.  It's here:
>
> https://lore.kernel.org/r/20240523195629.218043-1-alex.williamson@redhat.com
>
> >
> > 1. follow_pte clearly doesn't work for this, since the leaf might be a
> > PUD or PMD instead. Most callers don't care about the PTE itself, they
> > care about the pgprot or flags it has set, so my idea was to add a new
> > interface which just yields those bits, instead of the actual PTE.
>
> See:
>
> https://github.com/xzpeter/linux/commit/2cb4702418a1b740129fc7b379b52e16e57032e1

Ah! Thanks for the pointer. This is relatively close to what I had in mind.

>
> >
> > Peter, I think hugetlbfs unification may run into similar issues, do
> > you have some plan already to deal with PUD/PMD/PTE being different
> > types?
>
> Exactly.  There'll be some shared work between the two projects on fork(),
> mprotect, etc.  And yes I plan to cover them all but I'll start with the
> pfnmap thing, paving way for hugetlb, while we have Oscar (from SUSE kernel
> team) working concurrently on other paths of hugetlb.
>
> >
> > 2. vfio-pci relies on vm_ops->fault. This is a problem because the
> > normal fault handler path doesn't call this until after it has walked
> > down to the PTE level, installing PUDs/PMDs along the way. I have only
> > gross ideas for how to deal with this:
> >
> > - Add a VM_HUGEPFNMAP VMA flag indicating vm_ops->fault should be
> > called earlier in __handle_mm_fault
> > - Add a vm_ops->hugepfn_fault (name not important) which should be
> > called earlier in __handle_mm_fault
> > - Go ahead and let remap_pfn_range overwrite existing PUDs/PMDS
>
> I actually don't know what exactly you meant here, but Alex already worked
> on that with huge_fault().  See:
>
> https://github.com/awilliam/linux-vfio/commit/ec6c970f8374f91df0ebfe180cd388ba31187942
>
> So far I don't yet understand why we need a new vma flag.

Ah, I had discounted huge_fault() thinking it was specific to
hugetlbfs or THPs. I should have spent more time reading that code, I
agree it looks like it avoids all of what I'm talking about here. :)

>
> >
> > I wonder which of these folks find least offensive? Or is there a
> > better way I haven't thought of?
> >
> > 3. That's also an issue for CoW faults, but I don't know of any real
> > use case for CoW huge pfn mappings, so I thought we can just keep the
> > existing small mapping behavior for CoW VMAs. Any objections?
>
> I think we should keep the pud/pmd transparent, so that the old pte
> behavior needs to be maintained.  E.g., I think we'll need to be able to
> split a pud/pmd mapping if mprotect() partially.

I had been thinking of ensuring we never had pud/pmds in CoW mappings,
but using huge_fault() might make my worry go away entirely.

I completely agree we should allow vfio mappings to be mixed size
though, in case things aren't quite aligned (due to a mprotect split
or any other reason), we can still have a mostly-huge mapping with
some ptes on the end(s).

>
> Thanks,
>
> --
> Peter Xu
>


      reply	other threads:[~2024-05-30 17:00 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-24 20:54 Axel Rasmussen
2024-05-24 23:31 ` Peter Xu
2024-05-30 16:59   ` Axel Rasmussen [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJHvVcjiZZTsma5HWamJ-N48f20X8ejJT0yFC+Z=Br=r1KpNzw@mail.gmail.com' \
    --to=axelrasmussen@google.com \
    --cc=alex.williamson@redhat.com \
    --cc=david@redhat.com \
    --cc=jgg@nvidia.com \
    --cc=linux-mm@kvack.org \
    --cc=peterx@redhat.com \
    --cc=seanjc@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox