* [PATCH v2 0/6] vfio: Improve DMA mapping performance for huge pfnmaps
@ 2025-02-18 22:22 Alex Williamson
2025-02-18 22:22 ` [PATCH v2 5/6] mm: Provide address mask in struct follow_pfnmap_args Alex Williamson
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Alex Williamson @ 2025-02-18 22:22 UTC (permalink / raw)
To: alex.williamson
Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, jgg, akpm,
linux-mm, david, willy
v2:
- Rewrapped comment block in 3/6
- Added 4/6 to use consistent types (Jason)
- Renamed s/pgmask/addr_mask/ (David)
- Updated 6/6 with proposed epfn algorithm (Jason)
- Applied and retained sign-offs for all but 6/6 where the epfn
calculation changed
v1: https://lore.kernel.org/all/20250205231728.2527186-1-alex.williamson@redhat.com/
As GPU BAR sizes increase, the overhead of DMA mapping pfnmap ranges has
become a significant overhead for VMs making use of device assignment.
Not only does each mapping require upwards of a few seconds, but BARs
are mapped in and out of the VM address space multiple times during
guest boot. Also factor in that multi-GPU configurations are
increasingly commonplace and BAR sizes are continuing to increase.
Configurations today can already be delayed minutes during guest boot.
We've taken steps to make Linux a better guest by batching PCI BAR
sizing operations[1], but it only provides and incremental improvement.
This series attempts to fully address the issue by leveraging the huge
pfnmap support added in v6.12. When we insert pfnmaps using pud and pmd
mappings, we can later take advantage of the knowledge of the mapping
level page mask to iterate on the relevant mapping stride. In the
commonly achieved optimal case, this results in a reduction of pfn
lookups by a factor of 256k. For a local test system, an overhead of
~1s for DMA mapping a 32GB PCI BAR is reduced to sub-millisecond (8M
page sized operations reduced to 32 pud sized operations).
Please review, test, and provide feedback. I hope that mm folks can
ack the trivial follow_pfnmap_args update to provide the mapping level
page mask. Naming is hard, so any preference other than pgmask is
welcome. Thanks,
Alex
[1]https://lore.kernel.org/all/20250120182202.1878581-1-alex.williamson@redhat.com/
Alex Williamson (6):
vfio/type1: Catch zero from pin_user_pages_remote()
vfio/type1: Convert all vaddr_get_pfns() callers to use vfio_batch
vfio/type1: Use vfio_batch for vaddr_get_pfns()
vfio/type1: Use consistent types for page counts
mm: Provide address mask in struct follow_pfnmap_args
vfio/type1: Use mapping page mask for pfnmaps
drivers/vfio/vfio_iommu_type1.c | 123 ++++++++++++++++++++------------
include/linux/mm.h | 2 +
mm/memory.c | 1 +
3 files changed, 80 insertions(+), 46 deletions(-)
--
2.48.1
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v2 5/6] mm: Provide address mask in struct follow_pfnmap_args
2025-02-18 22:22 [PATCH v2 0/6] vfio: Improve DMA mapping performance for huge pfnmaps Alex Williamson
@ 2025-02-18 22:22 ` Alex Williamson
2025-02-19 8:31 ` David Hildenbrand
2025-02-19 2:37 ` [PATCH v2 0/6] vfio: Improve DMA mapping performance for huge pfnmaps Mitchell Augustin
2025-02-28 16:32 ` Alex Williamson
2 siblings, 1 reply; 7+ messages in thread
From: Alex Williamson @ 2025-02-18 22:22 UTC (permalink / raw)
To: alex.williamson
Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, jgg, akpm,
linux-mm, david
follow_pfnmap_start() walks the page table for a given address and
fills out the struct follow_pfnmap_args in pfnmap_args_setup().
The address mask of the page table level is already provided to this
latter function for calculating the pfn. This address mask can also
be useful for the caller to determine the extent of the contiguous
mapping.
For example, vfio-pci now supports huge_fault for pfnmaps and is able
to insert pud and pmd mappings. When we DMA map these pfnmaps, ex.
PCI MMIO BARs, we iterate follow_pfnmap_start() to get each pfn to test
for a contiguous pfn range. Providing the mapping address mask allows
us to skip the extent of the mapping level. Assuming a 1GB pud level
and 4KB page size, iterations are reduced by a factor of 256K. In wall
clock time, mapping a 32GB PCI BAR is reduced from ~1s to <1ms.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: linux-mm@kvack.org
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
Tested-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
include/linux/mm.h | 2 ++
mm/memory.c | 1 +
2 files changed, 3 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7b1068ddcbb7..92b30dba7e38 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2417,11 +2417,13 @@ struct follow_pfnmap_args {
* Outputs:
*
* @pfn: the PFN of the address
+ * @addr_mask: address mask covering pfn
* @pgprot: the pgprot_t of the mapping
* @writable: whether the mapping is writable
* @special: whether the mapping is a special mapping (real PFN maps)
*/
unsigned long pfn;
+ unsigned long addr_mask;
pgprot_t pgprot;
bool writable;
bool special;
diff --git a/mm/memory.c b/mm/memory.c
index 539c0f7c6d54..8f0969f132fe 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6477,6 +6477,7 @@ static inline void pfnmap_args_setup(struct follow_pfnmap_args *args,
args->lock = lock;
args->ptep = ptep;
args->pfn = pfn_base + ((args->address & ~addr_mask) >> PAGE_SHIFT);
+ args->addr_mask = addr_mask;
args->pgprot = pgprot;
args->writable = writable;
args->special = special;
--
2.48.1
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2 0/6] vfio: Improve DMA mapping performance for huge pfnmaps
2025-02-18 22:22 [PATCH v2 0/6] vfio: Improve DMA mapping performance for huge pfnmaps Alex Williamson
2025-02-18 22:22 ` [PATCH v2 5/6] mm: Provide address mask in struct follow_pfnmap_args Alex Williamson
@ 2025-02-19 2:37 ` Mitchell Augustin
2025-02-28 16:32 ` Alex Williamson
2 siblings, 0 replies; 7+ messages in thread
From: Mitchell Augustin @ 2025-02-19 2:37 UTC (permalink / raw)
To: Alex Williamson
Cc: kvm, linux-kernel, peterx, clg, jgg, akpm, linux-mm, david, willy
No change in behavior observed from v1 on my config (DGX H100). Thanks!
Reviewed-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
Tested-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
On Tue, Feb 18, 2025 at 4:22 PM Alex Williamson
<alex.williamson@redhat.com> wrote:
>
> v2:
> - Rewrapped comment block in 3/6
> - Added 4/6 to use consistent types (Jason)
> - Renamed s/pgmask/addr_mask/ (David)
> - Updated 6/6 with proposed epfn algorithm (Jason)
> - Applied and retained sign-offs for all but 6/6 where the epfn
> calculation changed
>
> v1: https://lore.kernel.org/all/20250205231728.2527186-1-alex.williamson@redhat.com/
>
> As GPU BAR sizes increase, the overhead of DMA mapping pfnmap ranges has
> become a significant overhead for VMs making use of device assignment.
> Not only does each mapping require upwards of a few seconds, but BARs
> are mapped in and out of the VM address space multiple times during
> guest boot. Also factor in that multi-GPU configurations are
> increasingly commonplace and BAR sizes are continuing to increase.
> Configurations today can already be delayed minutes during guest boot.
>
> We've taken steps to make Linux a better guest by batching PCI BAR
> sizing operations[1], but it only provides and incremental improvement.
>
> This series attempts to fully address the issue by leveraging the huge
> pfnmap support added in v6.12. When we insert pfnmaps using pud and pmd
> mappings, we can later take advantage of the knowledge of the mapping
> level page mask to iterate on the relevant mapping stride. In the
> commonly achieved optimal case, this results in a reduction of pfn
> lookups by a factor of 256k. For a local test system, an overhead of
> ~1s for DMA mapping a 32GB PCI BAR is reduced to sub-millisecond (8M
> page sized operations reduced to 32 pud sized operations).
>
> Please review, test, and provide feedback. I hope that mm folks can
> ack the trivial follow_pfnmap_args update to provide the mapping level
> page mask. Naming is hard, so any preference other than pgmask is
> welcome. Thanks,
>
> Alex
>
> [1]https://lore.kernel.org/all/20250120182202.1878581-1-alex.williamson@redhat.com/
>
>
> Alex Williamson (6):
> vfio/type1: Catch zero from pin_user_pages_remote()
> vfio/type1: Convert all vaddr_get_pfns() callers to use vfio_batch
> vfio/type1: Use vfio_batch for vaddr_get_pfns()
> vfio/type1: Use consistent types for page counts
> mm: Provide address mask in struct follow_pfnmap_args
> vfio/type1: Use mapping page mask for pfnmaps
>
> drivers/vfio/vfio_iommu_type1.c | 123 ++++++++++++++++++++------------
> include/linux/mm.h | 2 +
> mm/memory.c | 1 +
> 3 files changed, 80 insertions(+), 46 deletions(-)
>
> --
> 2.48.1
>
--
Mitchell Augustin
Software Engineer - Ubuntu Partner Engineering
Email:mitchell.augustin@canonical.com
Location:United States of America
canonical.com
ubuntu.com
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2 5/6] mm: Provide address mask in struct follow_pfnmap_args
2025-02-18 22:22 ` [PATCH v2 5/6] mm: Provide address mask in struct follow_pfnmap_args Alex Williamson
@ 2025-02-19 8:31 ` David Hildenbrand
2025-02-26 19:54 ` Alex Williamson
0 siblings, 1 reply; 7+ messages in thread
From: David Hildenbrand @ 2025-02-19 8:31 UTC (permalink / raw)
To: Alex Williamson
Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, jgg, akpm, linux-mm
On 18.02.25 23:22, Alex Williamson wrote:
> follow_pfnmap_start() walks the page table for a given address and
> fills out the struct follow_pfnmap_args in pfnmap_args_setup().
> The address mask of the page table level is already provided to this
> latter function for calculating the pfn. This address mask can also
> be useful for the caller to determine the extent of the contiguous
> mapping.
>
> For example, vfio-pci now supports huge_fault for pfnmaps and is able
> to insert pud and pmd mappings. When we DMA map these pfnmaps, ex.
> PCI MMIO BARs, we iterate follow_pfnmap_start() to get each pfn to test
> for a contiguous pfn range. Providing the mapping address mask allows
> us to skip the extent of the mapping level. Assuming a 1GB pud level
> and 4KB page size, iterations are reduced by a factor of 256K. In wall
> clock time, mapping a 32GB PCI BAR is reduced from ~1s to <1ms.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: linux-mm@kvack.org
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Reviewed-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
> Tested-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2 5/6] mm: Provide address mask in struct follow_pfnmap_args
2025-02-19 8:31 ` David Hildenbrand
@ 2025-02-26 19:54 ` Alex Williamson
2025-02-26 20:05 ` David Hildenbrand
0 siblings, 1 reply; 7+ messages in thread
From: Alex Williamson @ 2025-02-26 19:54 UTC (permalink / raw)
To: David Hildenbrand
Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, jgg, akpm, linux-mm
On Wed, 19 Feb 2025 09:31:48 +0100
David Hildenbrand <david@redhat.com> wrote:
> On 18.02.25 23:22, Alex Williamson wrote:
> > follow_pfnmap_start() walks the page table for a given address and
> > fills out the struct follow_pfnmap_args in pfnmap_args_setup().
> > The address mask of the page table level is already provided to this
> > latter function for calculating the pfn. This address mask can also
> > be useful for the caller to determine the extent of the contiguous
> > mapping.
> >
> > For example, vfio-pci now supports huge_fault for pfnmaps and is able
> > to insert pud and pmd mappings. When we DMA map these pfnmaps, ex.
> > PCI MMIO BARs, we iterate follow_pfnmap_start() to get each pfn to test
> > for a contiguous pfn range. Providing the mapping address mask allows
> > us to skip the extent of the mapping level. Assuming a 1GB pud level
> > and 4KB page size, iterations are reduced by a factor of 256K. In wall
> > clock time, mapping a 32GB PCI BAR is reduced from ~1s to <1ms.
> >
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: linux-mm@kvack.org
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> > Reviewed-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
> > Tested-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
>
> Acked-by: David Hildenbrand <david@redhat.com>
Thanks, David!
Is there any objection from mm folks to bring this in through the vfio
tree?
Patch: https://lore.kernel.org/all/20250218222209.1382449-6-alex.williamson@redhat.com/
Series: https://lore.kernel.org/all/20250218222209.1382449-1-alex.williamson@redhat.com/
Thanks,
Alex
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2 5/6] mm: Provide address mask in struct follow_pfnmap_args
2025-02-26 19:54 ` Alex Williamson
@ 2025-02-26 20:05 ` David Hildenbrand
0 siblings, 0 replies; 7+ messages in thread
From: David Hildenbrand @ 2025-02-26 20:05 UTC (permalink / raw)
To: Alex Williamson
Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, jgg, akpm, linux-mm
On 26.02.25 20:54, Alex Williamson wrote:
> On Wed, 19 Feb 2025 09:31:48 +0100
> David Hildenbrand <david@redhat.com> wrote:
>
>> On 18.02.25 23:22, Alex Williamson wrote:
>>> follow_pfnmap_start() walks the page table for a given address and
>>> fills out the struct follow_pfnmap_args in pfnmap_args_setup().
>>> The address mask of the page table level is already provided to this
>>> latter function for calculating the pfn. This address mask can also
>>> be useful for the caller to determine the extent of the contiguous
>>> mapping.
>>>
>>> For example, vfio-pci now supports huge_fault for pfnmaps and is able
>>> to insert pud and pmd mappings. When we DMA map these pfnmaps, ex.
>>> PCI MMIO BARs, we iterate follow_pfnmap_start() to get each pfn to test
>>> for a contiguous pfn range. Providing the mapping address mask allows
>>> us to skip the extent of the mapping level. Assuming a 1GB pud level
>>> and 4KB page size, iterations are reduced by a factor of 256K. In wall
>>> clock time, mapping a 32GB PCI BAR is reduced from ~1s to <1ms.
>>>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: linux-mm@kvack.org
>>> Reviewed-by: Peter Xu <peterx@redhat.com>
>>> Reviewed-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
>>> Tested-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
>>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>> ---
>>
>> Acked-by: David Hildenbrand <david@redhat.com>
>
> Thanks, David!
>
> Is there any objection from mm folks to bring this in through the vfio
> tree?
I assume it's fine. Andrew is on CC, so he should be aware of it. I'm
not aware of possible clashes.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2 0/6] vfio: Improve DMA mapping performance for huge pfnmaps
2025-02-18 22:22 [PATCH v2 0/6] vfio: Improve DMA mapping performance for huge pfnmaps Alex Williamson
2025-02-18 22:22 ` [PATCH v2 5/6] mm: Provide address mask in struct follow_pfnmap_args Alex Williamson
2025-02-19 2:37 ` [PATCH v2 0/6] vfio: Improve DMA mapping performance for huge pfnmaps Mitchell Augustin
@ 2025-02-28 16:32 ` Alex Williamson
2 siblings, 0 replies; 7+ messages in thread
From: Alex Williamson @ 2025-02-28 16:32 UTC (permalink / raw)
To: alex.williamson
Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, jgg, akpm,
linux-mm, david, willy
On Tue, 18 Feb 2025 15:22:00 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:
> v2:
> - Rewrapped comment block in 3/6
> - Added 4/6 to use consistent types (Jason)
> - Renamed s/pgmask/addr_mask/ (David)
> - Updated 6/6 with proposed epfn algorithm (Jason)
> - Applied and retained sign-offs for all but 6/6 where the epfn
> calculation changed
>
> v1: https://lore.kernel.org/all/20250205231728.2527186-1-alex.williamson@redhat.com/
>
> As GPU BAR sizes increase, the overhead of DMA mapping pfnmap ranges has
> become a significant overhead for VMs making use of device assignment.
> Not only does each mapping require upwards of a few seconds, but BARs
> are mapped in and out of the VM address space multiple times during
> guest boot. Also factor in that multi-GPU configurations are
> increasingly commonplace and BAR sizes are continuing to increase.
> Configurations today can already be delayed minutes during guest boot.
>
> We've taken steps to make Linux a better guest by batching PCI BAR
> sizing operations[1], but it only provides and incremental improvement.
>
> This series attempts to fully address the issue by leveraging the huge
> pfnmap support added in v6.12. When we insert pfnmaps using pud and pmd
> mappings, we can later take advantage of the knowledge of the mapping
> level page mask to iterate on the relevant mapping stride. In the
> commonly achieved optimal case, this results in a reduction of pfn
> lookups by a factor of 256k. For a local test system, an overhead of
> ~1s for DMA mapping a 32GB PCI BAR is reduced to sub-millisecond (8M
> page sized operations reduced to 32 pud sized operations).
>
> Please review, test, and provide feedback. I hope that mm folks can
> ack the trivial follow_pfnmap_args update to provide the mapping level
> page mask. Naming is hard, so any preference other than pgmask is
> welcome. Thanks,
>
> Alex
>
> [1]https://lore.kernel.org/all/20250120182202.1878581-1-alex.williamson@redhat.com/
>
>
> Alex Williamson (6):
> vfio/type1: Catch zero from pin_user_pages_remote()
> vfio/type1: Convert all vaddr_get_pfns() callers to use vfio_batch
> vfio/type1: Use vfio_batch for vaddr_get_pfns()
> vfio/type1: Use consistent types for page counts
> mm: Provide address mask in struct follow_pfnmap_args
> vfio/type1: Use mapping page mask for pfnmaps
>
> drivers/vfio/vfio_iommu_type1.c | 123 ++++++++++++++++++++------------
> include/linux/mm.h | 2 +
> mm/memory.c | 1 +
> 3 files changed, 80 insertions(+), 46 deletions(-)
With David's blessing relative to mm, applied to vfio next branch for
v6.15. Thanks all for the reviews and testing!
Alex
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-02-28 16:32 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-18 22:22 [PATCH v2 0/6] vfio: Improve DMA mapping performance for huge pfnmaps Alex Williamson
2025-02-18 22:22 ` [PATCH v2 5/6] mm: Provide address mask in struct follow_pfnmap_args Alex Williamson
2025-02-19 8:31 ` David Hildenbrand
2025-02-26 19:54 ` Alex Williamson
2025-02-26 20:05 ` David Hildenbrand
2025-02-19 2:37 ` [PATCH v2 0/6] vfio: Improve DMA mapping performance for huge pfnmaps Mitchell Augustin
2025-02-28 16:32 ` Alex Williamson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox