Re: [PATCH v2 4/4] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Peter Xu <peterx@redhat.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: kvm@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, Nico Pache <npache@redhat.com>,
	Zi Yan <ziy@nvidia.com>, Alex Mastro <amastro@fb.com>,
	David Hildenbrand <david@redhat.com>,
	Alex Williamson <alex@shazbot.org>, Zhi Wang <zhiw@nvidia.com>,
	David Laight <david.laight.linux@gmail.com>,
	Yi Liu <yi.l.liu@intel.com>, Ankit Agrawal <ankita@nvidia.com>,
	Kevin Tian <kevin.tian@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH v2 4/4] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
Date: Wed, 10 Dec 2025 15:43:43 -0500	[thread overview]
Message-ID: <aTnbf_dtwOo_gaVM@x1.local> (raw)
In-Reply-To: <aTWqvfYHWWMgKHPQ@nvidia.com>

On Sun, Dec 07, 2025 at 12:26:37PM -0400, Jason Gunthorpe wrote:
> On Thu, Dec 04, 2025 at 10:10:03AM -0500, Peter Xu wrote:
> 
> > +/*
> > + * Hint function for mmap() about the size of mapping to be carried out.
> > + * This helps to enable huge pfnmaps as much as possible on BAR mappings.
> > + *
> > + * This function does the minimum check on mmap() parameters to make the
> > + * hint valid only. The majority of mmap() sanity check will be done later
> > + * in mmap().
> > + */
> > +int vfio_pci_core_get_mapping_order(struct vfio_device *device,
> > +				    unsigned long pgoff, size_t len)
> > +{
> > +	struct vfio_pci_core_device *vdev =
> > +	    container_of(device, struct vfio_pci_core_device, vdev);
> > +	struct pci_dev *pdev = vdev->pdev;
> > +	unsigned int index = pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
> > +	unsigned long req_start;
> > +	size_t phys_len;
> > +
> > +	/* Currently, only bars 0-5 supports huge pfnmap */
> > +	if (index >= VFIO_PCI_ROM_REGION_INDEX)
> > +		return 0;
> > +
> > +	/*
> > +	 * NOTE: we're keeping things simple as of now, assuming the
> > +	 * physical address of BARs (aka, pci_resource_start(pdev, index))
> > +	 * should always be aligned with pgoff in vfio-pci's address space.
> > +	 */
> > +	req_start = (pgoff << PAGE_SHIFT) & ((1UL << VFIO_PCI_OFFSET_SHIFT) - 1);
> > +	phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
> > +
> > +	/*
> > +	 * If this happens, it will probably fail mmap() later.. mapping
> > +	 * hint isn't important anymore.
> > +	 */
> > +	if (req_start >= phys_len)
> > +		return 0;
> > +
> > +	phys_len = MIN(phys_len - req_start, len);
> > +
> > +	if (IS_ENABLED(CONFIG_ARCH_SUPPORTS_PUD_PFNMAP) && phys_len >= PUD_SIZE)
> > +		return PUD_ORDER;
> > +
> > +	if (IS_ENABLED(CONFIG_ARCH_SUPPORTS_PMD_PFNMAP) && phys_len >= PMD_SIZE)
> > +		return PMD_ORDER;
> > +
> 
> This seems a bit weird, the vma length is already known, it is len,
> why do we go to all this trouble to recalculate len in terms of phys?
> 
> If the length is wrong the mmap will fail, so there is no issue with
> returning a larger order here.
> 
> I feel this should just return the order based on pci_resource_len()?

IIUC there's a trivial difference when partial of a huge bar is mapped.

Example: 1G bar, map range (pgoff=2M, size=1G-2M).

If we return bar size order, we'd say 1G, however then it means we'll do
the alignment with 1G. __thp_get_unmapped_area() will think it's not
proper, because:

	loff_t off_end = off + len;
	loff_t off_align = round_up(off, size);

	if (off_end <= off_align || (off_end - off_align) < size)
		return 0;

Here what we really want is to map (2M, 1G-2M) with 2M huge, not 1G, nor
4K.

> 
> And shouldn't the mm be the one aligning it to what the arch can do
> not drives?

Note that here checking CONFIG_ARCH_SUPPORTS_P*D_PFNMAP is a vfio behavior,
pairing with the huge_fault() of vfio-pci driver.  It implies if vfio-pci's
huge pfnmap is enabled or not.  If it's not enabled, we don't need to
report larger orders here.

Said that, this is still a valid point, that core mm should likely also
check against the configs when the kernel was built, though it should not
check against CONFIG_ARCH_SUPPORTS_PMD_PFNMAP.. Instead, it should check
HAVE_ARCH_TRANSPARENT_HUGEPAGE*.

But then... I really want to avoid adding more dependencies to THPs in core
mm on pfnmaps.  I used to decouple THP and huge mappings, that series
wasn't going anywhere, but adding these checks will add more dependencies..

Shall I keep it simple to leave it to drivers, until we have something more
solid (I think we need HAVE_ARCH_HUGE_P*D_LEAVES here)?

Even with that config ready, drivers should always still do proper check on
its own (drivers need to support huge pfnmaps here first before reporting
high orders).  So what I can add into core mm to check arch support would
only be an extra layer of safety net, not much real help but burn some cpu
cycles, IMHO...

Thanks,

-- 
Peter Xu

next prev parent reply	other threads:[~2025-12-10 20:43 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-04 15:09 [PATCH v2 0/4] mm/vfio: " Peter Xu
2025-12-04 15:10 ` [PATCH v2 1/4] mm/thp: Allow thp_get_unmapped_area_vmflags() to take alignment Peter Xu
2025-12-04 15:10 ` [PATCH v2 2/4] mm: Add file_operations.get_mapping_order() Peter Xu
2025-12-04 15:19   ` Peter Xu
2025-12-08  9:21     ` Matthew Wilcox
2025-12-10 20:24       ` Peter Xu
2025-12-07 16:21   ` Jason Gunthorpe
2025-12-10 20:23     ` Peter Xu
2025-12-04 15:10 ` [PATCH v2 3/4] vfio: Introduce vfio_device_ops.get_mapping_order hook Peter Xu
2025-12-04 15:10 ` [PATCH v2 4/4] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings Peter Xu
2025-12-05  4:33   ` kernel test robot
2025-12-05  7:45   ` kernel test robot
2025-12-07 16:26   ` Jason Gunthorpe
2025-12-10 20:43     ` Peter Xu [this message]
2025-12-08  3:11   ` Alex Mastro
2025-12-04 18:16 ` [PATCH v2 0/4] mm/vfio: " Cédric Le Goater
2025-12-07  9:13 ` Alex Mastro

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aTnbf_dtwOo_gaVM@x1.local \
    --to=peterx@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex@shazbot.org \
    --cc=amastro@fb.com \
    --cc=ankita@nvidia.com \
    --cc=david.laight.linux@gmail.com \
    --cc=david@redhat.com \
    --cc=jgg@nvidia.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=npache@redhat.com \
    --cc=yi.l.liu@intel.com \
    --cc=zhiw@nvidia.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox