linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Axel Rasmussen <axelrasmussen@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>,
	David Hildenbrand <david@redhat.com>,
	Sean Christopherson <seanjc@google.com>,
	Linux MM <linux-mm@kvack.org>,
	Alex Williamson <alex.williamson@redhat.com>
Subject: Re: [RFC] Huge remap_pfn_range for vfio-pci
Date: Fri, 24 May 2024 19:31:45 -0400	[thread overview]
Message-ID: <ZlEjYZB78SVjFXVy@x1n> (raw)
In-Reply-To: <CAJHvVcge-1JhHd4HQKXye9F-WrrMQZ66oU6XF2ie3PCKXaihKw@mail.gmail.com>

On Fri, May 24, 2024 at 01:54:20PM -0700, Axel Rasmussen wrote:
> Hi,

Hi, Axel,

> 
> I'm interested in extending remap_pfn_range to allow it to map the
> range hugely (using PUDs or PMDs). The initial user I have in mind is
> vfio-pci; I'm thinking when we're mapping large ranges for GPUs, we
> can get both a performance and host overhead win by doing this hugely.
> 
> Another thing I have in the back of my mind is adding something KVM
> can re-use to simplify its whole host_pfn_mapping_level /
> hva_to_pfn_remapped / get_user_page_fast_only thing.

IIUC kvm should be prepared for it, as host_pfn_mapping_level() can detect
any huge mappings using the *_leaf() apis.

> 
> I know Peter and David are working on some related things (hugetlbfs
> unification and follow_pte et al improvements, respectively). Although
> I have a hacky proof of concept that works, I thought it best to get
> some consensus on the design before I post something, so I don't
> conflict with this existing / upcoming work.

Yes we're working on that, mostly with Alex.  There's a testing branch but
half baked so far:

https://github.com/xzpeter/linux/commits/huge-pfnmap/

> 
> Changing remap_pfn_range to install PUDs or PMDs is straightforward.
> The hairy part is the fault / follow side of things:

I'm surprised you thought about the fault() path, even if Alex just
officially proposed it yesterday.  Maybe you followed the previous
discussions.  It's here:

https://lore.kernel.org/r/20240523195629.218043-1-alex.williamson@redhat.com

> 
> 1. follow_pte clearly doesn't work for this, since the leaf might be a
> PUD or PMD instead. Most callers don't care about the PTE itself, they
> care about the pgprot or flags it has set, so my idea was to add a new
> interface which just yields those bits, instead of the actual PTE.

See:

https://github.com/xzpeter/linux/commit/2cb4702418a1b740129fc7b379b52e16e57032e1

> 
> Peter, I think hugetlbfs unification may run into similar issues, do
> you have some plan already to deal with PUD/PMD/PTE being different
> types?

Exactly.  There'll be some shared work between the two projects on fork(),
mprotect, etc.  And yes I plan to cover them all but I'll start with the
pfnmap thing, paving way for hugetlb, while we have Oscar (from SUSE kernel
team) working concurrently on other paths of hugetlb.

> 
> 2. vfio-pci relies on vm_ops->fault. This is a problem because the
> normal fault handler path doesn't call this until after it has walked
> down to the PTE level, installing PUDs/PMDs along the way. I have only
> gross ideas for how to deal with this:
> 
> - Add a VM_HUGEPFNMAP VMA flag indicating vm_ops->fault should be
> called earlier in __handle_mm_fault
> - Add a vm_ops->hugepfn_fault (name not important) which should be
> called earlier in __handle_mm_fault
> - Go ahead and let remap_pfn_range overwrite existing PUDs/PMDS

I actually don't know what exactly you meant here, but Alex already worked
on that with huge_fault().  See:

https://github.com/awilliam/linux-vfio/commit/ec6c970f8374f91df0ebfe180cd388ba31187942

So far I don't yet understand why we need a new vma flag.

> 
> I wonder which of these folks find least offensive? Or is there a
> better way I haven't thought of?
> 
> 3. That's also an issue for CoW faults, but I don't know of any real
> use case for CoW huge pfn mappings, so I thought we can just keep the
> existing small mapping behavior for CoW VMAs. Any objections?

I think we should keep the pud/pmd transparent, so that the old pte
behavior needs to be maintained.  E.g., I think we'll need to be able to
split a pud/pmd mapping if mprotect() partially.

Thanks,

-- 
Peter Xu



  reply	other threads:[~2024-05-24 23:31 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-24 20:54 Axel Rasmussen
2024-05-24 23:31 ` Peter Xu [this message]
2024-05-30 16:59   ` Axel Rasmussen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZlEjYZB78SVjFXVy@x1n \
    --to=peterx@redhat.com \
    --cc=alex.williamson@redhat.com \
    --cc=axelrasmussen@google.com \
    --cc=david@redhat.com \
    --cc=jgg@nvidia.com \
    --cc=linux-mm@kvack.org \
    --cc=seanjc@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox