* [PATCH v11 0/9] Provide a new two step DMA mapping API
@ 2025-05-05 7:01 ` Leon Romanovsky
2025-05-05 7:01 ` [PATCH v11 1/9] PCI/P2PDMA: Refactor the p2pdma mapping helpers Leon Romanovsky
` (10 more replies)
0 siblings, 11 replies; 12+ messages in thread
From: Leon Romanovsky @ 2025-05-05 7:01 UTC (permalink / raw)
To: Marek Szyprowski
Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Jake Edge,
Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun, Robin Murphy,
Joerg Roedel, Will Deacon, Sagi Grimberg, Bjorn Helgaas,
Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
Alex Williamson, Jérôme Glisse, Andrew Morton,
linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
Kanchan Joshi, Chaitanya Kulkarni
Hi Marek,
These are the DMA/IOMMU patches only, which have not seen functional
changes for a while. They are tested and reviewed and ready to merge.
We will work with relevant subsystems to merge rest of the conversion
patches. At least some of them will be done in next cycle to reduce
merge conflicts.
Thanks
=========================================================================
Following recent on site LSF/MM 2025 [1] discussion, the overall
response was extremely positive with many people expressed their
desire to see this series merged, so they can base their work on it.
It includes, but not limited:
* Luis's "nvme-pci: breaking the 512 KiB max IO boundary":
https://lore.kernel.org/all/20250320111328.2841690-1-mcgrof@kernel.org/
* Chuck's NFS conversion to use one structure (bio_vec) for all types
of RPC transports:
https://lore.kernel.org/all/913df4b4-fc4a-409d-9007-088a3e2c8291@oracle.com
* Matthew's vision for the world without struct page:
https://lore.kernel.org/all/Z-WRQOYEvOWlI34w@casper.infradead.org/
* Confidential computing roadmap from Dan:
https://lore.kernel.org/all/6801a8e3968da_71fe29411@dwillia2-xfh.jf.intel.com.notmuch
This series is combination of effort of many people who contributed ideas,
code and testing and I'm gratefully thankful for them.
[1] https://lore.kernel.org/linux-rdma/20250122071600.GC10702@unreal/
-----------------------------------------------------------------------
Changelog:
v11:
* Left only DMA/IOMMU patches to allow merge
* Added Baolu's ROB tags
* Fixed commit messages
v10: https://lore.kernel.org/all/cover.1745831017.git.leon@kernel.org
* Rebased on top v6.15-rc3
* Added Luis's tags
* Addressed review comments from Luis about DMA patches
* Removed segment size check from single-segment SGL optimization code
* Changed NVMe unmap data code as was suggested by Christoph
v9: https://lore.kernel.org/all/cover.1745394536.git.leon@kernel.org/
* Added tested-by from Jens.
* Replaced is_pci_p2pdma_page(bv.bv_page) check with if
"(IS_ENABLED(CONFIG_PCI_P2PDMA) && (req->cmd_flags & REQ_P2PDMA))"
which is more aligned with the goal (do not access struct page) and
more efficient. This is the one line only that was changed in Jens's
performance testing flow, so I kept his tags as is.
* Restored single-segment optimization for SGL path.
* Added forgotten unmap of metdata SGL multi-segment flow.
* Split and squashed optimization patch from Kanchan.
* Converted "bool aborted" flag to use newly introduced flag variable.
v8: https://lore.kernel.org/all/cover.1744825142.git.leon@kernel.org/
* Rebased to v6.15-rc1
* Added NVMe patches which are now patches and not RFC. They were in
RFC stage because block iterator caused to performance regression
for very extreme case scenario (~100M IOPS), but after Kanchan fixed
it, the code started to be ready for merging.
* @Niklas, i didn't change naming in this series as it follows iommu
naming format.
v7: https://lore.kernel.org/all/cover.1738765879.git.leonro@nvidia.com/
* Rebased to v6.14-rc1
v6: https://lore.kernel.org/all/cover.1737106761.git.leon@kernel.org
* Changed internal __size variable to u64 to properly set private flag
in most significant bit.
* Added comment about why we check DMA_IOVA_USE_SWIOTLB
* Break unlink loop if phys is NULL, condition which we shouldn't get.
v5: https://lore.kernel.org/all/cover.1734436840.git.leon@kernel.org
* Trimmed long lines in all patches.
* Squashed "dma-mapping: Add check if IOVA can be used" into
"dma: Provide an interface to allow allocate IOVA" patch.
* Added tags from Christoph and Will.
* Fixed spelling/grammar errors.
* Change title from "dma: Provide an ..." to be "dma-mapping: Provide
an ...".
* Slightly changed hmm patch to set sticky flags in one place.
v4: https://lore.kernel.org/all/cover.1733398913.git.leon@kernel.org
* Added extra patch to add kernel-doc for iommu_unmap and iommu_unmap_fast
* Rebased to v6.13-rc1
* Added Will's tags
v3: https://lore.kernel.org/all/cover.1731244445.git.leon@kernel.org
* Added DMA_ATTR_SKIP_CPU_SYNC to p2p pages in HMM.
* Fixed error unwind if dma_iova_sync fails in HMM.
* Clear all PFN flags which were set in map to make code.
more clean, the callers anyway cleaned them.
* Generalize sticky PFN flags logic in HMM.
* Removed not-needed #ifdef-#endif section.
v2: https://lore.kernel.org/all/cover.1730892663.git.leon@kernel.org
* Fixed docs file as Randy suggested
* Fixed releases of memory in HMM path. It was allocated with kv..
variants but released with kfree instead of kvfree.
* Slightly changed commit message in VFIO patch.
v1: https://lore.kernel.org/all/cover.1730298502.git.leon@kernel.org
* Squashed two VFIO patches into one
* Added Acked-by/Reviewed-by tags
* Fix docs spelling errors
* Simplified dma_iova_sync() API
* Added extra check in dma_iova_destroy() if mapped size to make code more clear
* Fixed checkpatch warnings in p2p patch
* Changed implementation of VFIO mlx5 mlx5vf_add_migration_pages() to
be more general
* Reduced the number of changes in VFIO patch
v0: https://lore.kernel.org/all/cover.1730037276.git.leon@kernel.org
----------------------------------------------------------------------------
LWN coverage:
Dancing the DMA two-step - https://lwn.net/Articles/997563/
----------------------------------------------------------------------------
Currently the only efficient way to map a complex memory description through
the DMA API is by using the scatterlist APIs. The SG APIs are unique in that
they efficiently combine the two fundamental operations of sizing and allocating
a large IOVA window from the IOMMU and processing all the per-address
swiotlb/flushing/p2p/map details.
This uniqueness has been a long standing pain point as the scatterlist API
is mandatory, but expensive to use. It prevents any kind of optimization or
feature improvement (such as avoiding struct page for P2P) due to the
impossibility of improving the scatterlist.
Several approaches have been explored to expand the DMA API with additional
scatterlist-like structures (BIO, rlist), instead split up the DMA API
to allow callers to bring their own data structure.
The API is split up into parts:
- Allocate IOVA space:
To do any pre-allocation required. This is done based on the caller
supplying some details about how much IOMMU address space it would need
in worst case.
- Map and unmap relevant structures to pre-allocated IOVA space:
Perform the actual mapping into the pre-allocated IOVA. This is very
similar to dma_map_page().
The whole series can be found here:
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git dma-split-May-5
There can be found examples of three different users are converted to the new API
to show the benefits and its versatility. Each user has a unique
flow:
1. RDMA ODP is an example of "SVA mirroring" using HMM that needs to
dynamically map/unmap large numbers of single pages. This becomes
significantly faster in the IOMMU case as the map/unmap is now just
a page table walk, the IOVA allocation is pre-computed once. Significant
amounts of memory are saved as there is no longer a need to store the
dma_addr_t of each page.
2. VFIO PCI live migration code is building a very large "page list"
for the device. Instead of allocating a scatter list entry per allocated
page it can just allocate an array of 'struct page *', saving a large
amount of memory.
3. NVMe PCI demonstrates how a BIO can be converted to a HW scatter
list without having to allocate then populate an intermediate SG table.
To make the use of the new API easier, HMM and block subsystems are extended
to hide the optimization details from the caller. Among these optimizations:
* Memory reduction as in most real use cases there is no need to store mapped
DMA addresses and unmap them.
* Reducing the function call overhead by removing the need to call function
pointers and use direct calls instead.
This step is first along a path to provide alternatives to scatterlist and
solve some of the abuses and design mistakes.
Thanks
Christoph Hellwig (6):
PCI/P2PDMA: Refactor the p2pdma mapping helpers
dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h
iommu: generalize the batched sync after map interface
iommu/dma: Factor out a iommu_dma_map_swiotlb helper
dma-mapping: add a dma_need_unmap helper
docs: core-api: document the IOVA-based API
Leon Romanovsky (3):
iommu: add kernel-doc for iommu_unmap_fast
dma-mapping: Provide an interface to allow allocate IOVA
dma-mapping: Implement link/unlink ranges API
Documentation/core-api/dma-api.rst | 71 +++++
drivers/iommu/dma-iommu.c | 482 +++++++++++++++++++++++++----
drivers/iommu/iommu.c | 84 ++---
drivers/pci/p2pdma.c | 38 +--
include/linux/dma-map-ops.h | 54 ----
include/linux/dma-mapping.h | 85 +++++
include/linux/iommu.h | 4 +
include/linux/pci-p2pdma.h | 85 +++++
kernel/dma/direct.c | 44 +--
kernel/dma/mapping.c | 18 ++
10 files changed, 764 insertions(+), 201 deletions(-)
--
2.49.0
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v11 1/9] PCI/P2PDMA: Refactor the p2pdma mapping helpers
2025-05-05 7:01 ` [PATCH v11 0/9] Provide a new two step DMA mapping API Leon Romanovsky
@ 2025-05-05 7:01 ` Leon Romanovsky
2025-05-05 7:01 ` [PATCH v11 2/9] dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h Leon Romanovsky
` (9 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Leon Romanovsky @ 2025-05-05 7:01 UTC (permalink / raw)
Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Jake Edge,
Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun, Robin Murphy,
Joerg Roedel, Will Deacon, Sagi Grimberg, Bjorn Helgaas,
Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
Alex Williamson, Jérôme Glisse, Andrew Morton,
linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
Kanchan Joshi, Chaitanya Kulkarni, Lu Baolu, Leon Romanovsky
From: Christoph Hellwig <hch@lst.de>
The current scheme with a single helper to determine the P2P status
and map a scatterlist segment force users to always use the map_sg
helper to DMA map, which we're trying to get away from because they
are very cache inefficient.
Refactor the code so that there is a single helper that checks the P2P
state for a page, including the result that it is not a P2P page to
simplify the callers, and a second one to perform the address translation
for a bus mapped P2P transfer that does not depend on the scatterlist
structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/iommu/dma-iommu.c | 47 +++++++++++++++++-----------------
drivers/pci/p2pdma.c | 38 ++++-----------------------
include/linux/dma-map-ops.h | 51 +++++++++++++++++++++++++++++--------
kernel/dma/direct.c | 43 +++++++++++++++----------------
4 files changed, 91 insertions(+), 88 deletions(-)
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index a775e4dbe06f..8a89e63c5973 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1359,7 +1359,6 @@ int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
struct scatterlist *s, *prev = NULL;
int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs);
struct pci_p2pdma_map_state p2pdma_state = {};
- enum pci_p2pdma_map_type map;
dma_addr_t iova;
size_t iova_len = 0;
unsigned long mask = dma_get_seg_boundary(dev);
@@ -1389,28 +1388,30 @@ int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
size_t s_length = s->length;
size_t pad_len = (mask - iova_len + 1) & mask;
- if (is_pci_p2pdma_page(sg_page(s))) {
- map = pci_p2pdma_map_segment(&p2pdma_state, dev, s);
- switch (map) {
- case PCI_P2PDMA_MAP_BUS_ADDR:
- /*
- * iommu_map_sg() will skip this segment as
- * it is marked as a bus address,
- * __finalise_sg() will copy the dma address
- * into the output segment.
- */
- continue;
- case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
- /*
- * Mapping through host bridge should be
- * mapped with regular IOVAs, thus we
- * do nothing here and continue below.
- */
- break;
- default:
- ret = -EREMOTEIO;
- goto out_restore_sg;
- }
+ switch (pci_p2pdma_state(&p2pdma_state, dev, sg_page(s))) {
+ case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+ /*
+ * Mapping through host bridge should be mapped with
+ * regular IOVAs, thus we do nothing here and continue
+ * below.
+ */
+ break;
+ case PCI_P2PDMA_MAP_NONE:
+ break;
+ case PCI_P2PDMA_MAP_BUS_ADDR:
+ /*
+ * iommu_map_sg() will skip this segment as it is marked
+ * as a bus address, __finalise_sg() will copy the dma
+ * address into the output segment.
+ */
+ s->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state,
+ sg_phys(s));
+ sg_dma_len(s) = sg->length;
+ sg_dma_mark_bus_address(s);
+ continue;
+ default:
+ ret = -EREMOTEIO;
+ goto out_restore_sg;
}
sg_dma_address(s) = s_iova_off;
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 19214ec81fbb..8d955c25aed3 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -1004,40 +1004,12 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
return type;
}
-/**
- * pci_p2pdma_map_segment - map an sg segment determining the mapping type
- * @state: State structure that should be declared outside of the for_each_sg()
- * loop and initialized to zero.
- * @dev: DMA device that's doing the mapping operation
- * @sg: scatterlist segment to map
- *
- * This is a helper to be used by non-IOMMU dma_map_sg() implementations where
- * the sg segment is the same for the page_link and the dma_address.
- *
- * Attempt to map a single segment in an SGL with the PCI bus address.
- * The segment must point to a PCI P2PDMA page and thus must be
- * wrapped in a is_pci_p2pdma_page(sg_page(sg)) check.
- *
- * Returns the type of mapping used and maps the page if the type is
- * PCI_P2PDMA_MAP_BUS_ADDR.
- */
-enum pci_p2pdma_map_type
-pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
- struct scatterlist *sg)
+void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
+ struct device *dev, struct page *page)
{
- if (state->pgmap != page_pgmap(sg_page(sg))) {
- state->pgmap = page_pgmap(sg_page(sg));
- state->map = pci_p2pdma_map_type(state->pgmap, dev);
- state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
- }
-
- if (state->map == PCI_P2PDMA_MAP_BUS_ADDR) {
- sg->dma_address = sg_phys(sg) + state->bus_off;
- sg_dma_len(sg) = sg->length;
- sg_dma_mark_bus_address(sg);
- }
-
- return state->map;
+ state->pgmap = page_pgmap(page);
+ state->map = pci_p2pdma_map_type(state->pgmap, dev);
+ state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
}
/**
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index e172522cd936..c3086edeccc6 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -443,6 +443,11 @@ enum pci_p2pdma_map_type {
*/
PCI_P2PDMA_MAP_UNKNOWN = 0,
+ /*
+ * Not a PCI P2PDMA transfer.
+ */
+ PCI_P2PDMA_MAP_NONE,
+
/*
* PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
* traverse the host bridge and the host bridge is not in the
@@ -471,21 +476,47 @@ enum pci_p2pdma_map_type {
struct pci_p2pdma_map_state {
struct dev_pagemap *pgmap;
- int map;
+ enum pci_p2pdma_map_type map;
u64 bus_off;
};
-#ifdef CONFIG_PCI_P2PDMA
-enum pci_p2pdma_map_type
-pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
- struct scatterlist *sg);
-#else /* CONFIG_PCI_P2PDMA */
+/* helper for pci_p2pdma_state(), do not use directly */
+void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
+ struct device *dev, struct page *page);
+
+/**
+ * pci_p2pdma_state - check the P2P transfer state of a page
+ * @state: P2P state structure
+ * @dev: device to transfer to/from
+ * @page: page to map
+ *
+ * Check if @page is a PCI P2PDMA page, and if yes of what kind. Returns the
+ * map type, and updates @state with all information needed for a P2P transfer.
+ */
static inline enum pci_p2pdma_map_type
-pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
- struct scatterlist *sg)
+pci_p2pdma_state(struct pci_p2pdma_map_state *state, struct device *dev,
+ struct page *page)
+{
+ if (IS_ENABLED(CONFIG_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
+ if (state->pgmap != page_pgmap(page))
+ __pci_p2pdma_update_state(state, dev, page);
+ return state->map;
+ }
+ return PCI_P2PDMA_MAP_NONE;
+}
+
+/**
+ * pci_p2pdma_bus_addr_map - map a PCI_P2PDMA_MAP_BUS_ADDR P2P transfer
+ * @state: P2P state structure
+ * @paddr: physical address to map
+ *
+ * Map a physically contiguous PCI_P2PDMA_MAP_BUS_ADDR transfer.
+ */
+static inline dma_addr_t
+pci_p2pdma_bus_addr_map(struct pci_p2pdma_map_state *state, phys_addr_t paddr)
{
- return PCI_P2PDMA_MAP_NOT_SUPPORTED;
+ WARN_ON_ONCE(state->map != PCI_P2PDMA_MAP_BUS_ADDR);
+ return paddr + state->bus_off;
}
-#endif /* CONFIG_PCI_P2PDMA */
#endif /* _LINUX_DMA_MAP_OPS_H */
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index b8fe0b3d0ffb..cec43cd5ed62 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -462,34 +462,33 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
enum dma_data_direction dir, unsigned long attrs)
{
struct pci_p2pdma_map_state p2pdma_state = {};
- enum pci_p2pdma_map_type map;
struct scatterlist *sg;
int i, ret;
for_each_sg(sgl, sg, nents, i) {
- if (is_pci_p2pdma_page(sg_page(sg))) {
- map = pci_p2pdma_map_segment(&p2pdma_state, dev, sg);
- switch (map) {
- case PCI_P2PDMA_MAP_BUS_ADDR:
- continue;
- case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
- /*
- * Any P2P mapping that traverses the PCI
- * host bridge must be mapped with CPU physical
- * address and not PCI bus addresses. This is
- * done with dma_direct_map_page() below.
- */
- break;
- default:
- ret = -EREMOTEIO;
+ switch (pci_p2pdma_state(&p2pdma_state, dev, sg_page(sg))) {
+ case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+ /*
+ * Any P2P mapping that traverses the PCI host bridge
+ * must be mapped with CPU physical address and not PCI
+ * bus addresses.
+ */
+ break;
+ case PCI_P2PDMA_MAP_NONE:
+ sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
+ sg->offset, sg->length, dir, attrs);
+ if (sg->dma_address == DMA_MAPPING_ERROR) {
+ ret = -EIO;
goto out_unmap;
}
- }
-
- sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
- sg->offset, sg->length, dir, attrs);
- if (sg->dma_address == DMA_MAPPING_ERROR) {
- ret = -EIO;
+ break;
+ case PCI_P2PDMA_MAP_BUS_ADDR:
+ sg->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state,
+ sg_phys(sg));
+ sg_dma_mark_bus_address(sg);
+ continue;
+ default:
+ ret = -EREMOTEIO;
goto out_unmap;
}
sg_dma_len(sg) = sg->length;
--
2.49.0
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v11 2/9] dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h
2025-05-05 7:01 ` [PATCH v11 0/9] Provide a new two step DMA mapping API Leon Romanovsky
2025-05-05 7:01 ` [PATCH v11 1/9] PCI/P2PDMA: Refactor the p2pdma mapping helpers Leon Romanovsky
@ 2025-05-05 7:01 ` Leon Romanovsky
2025-05-05 7:01 ` [PATCH v11 3/9] iommu: generalize the batched sync after map interface Leon Romanovsky
` (8 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Leon Romanovsky @ 2025-05-05 7:01 UTC (permalink / raw)
Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Jake Edge,
Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun, Robin Murphy,
Joerg Roedel, Will Deacon, Sagi Grimberg, Bjorn Helgaas,
Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
Alex Williamson, Jérôme Glisse, Andrew Morton,
linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
Kanchan Joshi, Chaitanya Kulkarni, Lu Baolu, Leon Romanovsky
From: Christoph Hellwig <hch@lst.de>
To support the upcoming non-scatterlist mapping helpers, we need to go
back to have them called outside of the DMA API. Thus move them out of
dma-map-ops.h, which is only for DMA API implementations to pci-p2pdma.h,
which is for driver use.
Note that the core helper is still not exported as the mapping is
expected to be done only by very highlevel subsystem code at least for
now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/iommu/dma-iommu.c | 1 +
include/linux/dma-map-ops.h | 85 -------------------------------------
include/linux/pci-p2pdma.h | 85 +++++++++++++++++++++++++++++++++++++
kernel/dma/direct.c | 1 +
4 files changed, 87 insertions(+), 85 deletions(-)
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 8a89e63c5973..9ba8d8bc0ce9 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -27,6 +27,7 @@
#include <linux/msi.h>
#include <linux/of_iommu.h>
#include <linux/pci.h>
+#include <linux/pci-p2pdma.h>
#include <linux/scatterlist.h>
#include <linux/spinlock.h>
#include <linux/swiotlb.h>
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index c3086edeccc6..f48e5fb88bd5 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -434,89 +434,4 @@ static inline void debug_dma_dump_mappings(struct device *dev)
#endif /* CONFIG_DMA_API_DEBUG */
extern const struct dma_map_ops dma_dummy_ops;
-
-enum pci_p2pdma_map_type {
- /*
- * PCI_P2PDMA_MAP_UNKNOWN: Used internally for indicating the mapping
- * type hasn't been calculated yet. Functions that return this enum
- * never return this value.
- */
- PCI_P2PDMA_MAP_UNKNOWN = 0,
-
- /*
- * Not a PCI P2PDMA transfer.
- */
- PCI_P2PDMA_MAP_NONE,
-
- /*
- * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
- * traverse the host bridge and the host bridge is not in the
- * allowlist. DMA Mapping routines should return an error when
- * this is returned.
- */
- PCI_P2PDMA_MAP_NOT_SUPPORTED,
-
- /*
- * PCI_P2PDMA_BUS_ADDR: Indicates that two devices can talk to
- * each other directly through a PCI switch and the transaction will
- * not traverse the host bridge. Such a mapping should program
- * the DMA engine with PCI bus addresses.
- */
- PCI_P2PDMA_MAP_BUS_ADDR,
-
- /*
- * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk
- * to each other, but the transaction traverses a host bridge on the
- * allowlist. In this case, a normal mapping either with CPU physical
- * addresses (in the case of dma-direct) or IOVA addresses (in the
- * case of IOMMUs) should be used to program the DMA engine.
- */
- PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
-};
-
-struct pci_p2pdma_map_state {
- struct dev_pagemap *pgmap;
- enum pci_p2pdma_map_type map;
- u64 bus_off;
-};
-
-/* helper for pci_p2pdma_state(), do not use directly */
-void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
- struct device *dev, struct page *page);
-
-/**
- * pci_p2pdma_state - check the P2P transfer state of a page
- * @state: P2P state structure
- * @dev: device to transfer to/from
- * @page: page to map
- *
- * Check if @page is a PCI P2PDMA page, and if yes of what kind. Returns the
- * map type, and updates @state with all information needed for a P2P transfer.
- */
-static inline enum pci_p2pdma_map_type
-pci_p2pdma_state(struct pci_p2pdma_map_state *state, struct device *dev,
- struct page *page)
-{
- if (IS_ENABLED(CONFIG_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
- if (state->pgmap != page_pgmap(page))
- __pci_p2pdma_update_state(state, dev, page);
- return state->map;
- }
- return PCI_P2PDMA_MAP_NONE;
-}
-
-/**
- * pci_p2pdma_bus_addr_map - map a PCI_P2PDMA_MAP_BUS_ADDR P2P transfer
- * @state: P2P state structure
- * @paddr: physical address to map
- *
- * Map a physically contiguous PCI_P2PDMA_MAP_BUS_ADDR transfer.
- */
-static inline dma_addr_t
-pci_p2pdma_bus_addr_map(struct pci_p2pdma_map_state *state, phys_addr_t paddr)
-{
- WARN_ON_ONCE(state->map != PCI_P2PDMA_MAP_BUS_ADDR);
- return paddr + state->bus_off;
-}
-
#endif /* _LINUX_DMA_MAP_OPS_H */
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 2c07aa6b7665..075c20b161d9 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -104,4 +104,89 @@ static inline struct pci_dev *pci_p2pmem_find(struct device *client)
return pci_p2pmem_find_many(&client, 1);
}
+enum pci_p2pdma_map_type {
+ /*
+ * PCI_P2PDMA_MAP_UNKNOWN: Used internally as an initial state before
+ * the mapping type has been calculated. Exported routines for the API
+ * will never return this value.
+ */
+ PCI_P2PDMA_MAP_UNKNOWN = 0,
+
+ /*
+ * Not a PCI P2PDMA transfer.
+ */
+ PCI_P2PDMA_MAP_NONE,
+
+ /*
+ * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
+ * traverse the host bridge and the host bridge is not in the
+ * allowlist. DMA Mapping routines should return an error when
+ * this is returned.
+ */
+ PCI_P2PDMA_MAP_NOT_SUPPORTED,
+
+ /*
+ * PCI_P2PDMA_MAP_BUS_ADDR: Indicates that two devices can talk to
+ * each other directly through a PCI switch and the transaction will
+ * not traverse the host bridge. Such a mapping should program
+ * the DMA engine with PCI bus addresses.
+ */
+ PCI_P2PDMA_MAP_BUS_ADDR,
+
+ /*
+ * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk
+ * to each other, but the transaction traverses a host bridge on the
+ * allowlist. In this case, a normal mapping either with CPU physical
+ * addresses (in the case of dma-direct) or IOVA addresses (in the
+ * case of IOMMUs) should be used to program the DMA engine.
+ */
+ PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
+};
+
+struct pci_p2pdma_map_state {
+ struct dev_pagemap *pgmap;
+ enum pci_p2pdma_map_type map;
+ u64 bus_off;
+};
+
+/* helper for pci_p2pdma_state(), do not use directly */
+void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
+ struct device *dev, struct page *page);
+
+/**
+ * pci_p2pdma_state - check the P2P transfer state of a page
+ * @state: P2P state structure
+ * @dev: device to transfer to/from
+ * @page: page to map
+ *
+ * Check if @page is a PCI P2PDMA page, and if yes of what kind. Returns the
+ * map type, and updates @state with all information needed for a P2P transfer.
+ */
+static inline enum pci_p2pdma_map_type
+pci_p2pdma_state(struct pci_p2pdma_map_state *state, struct device *dev,
+ struct page *page)
+{
+ if (IS_ENABLED(CONFIG_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
+ if (state->pgmap != page_pgmap(page))
+ __pci_p2pdma_update_state(state, dev, page);
+ return state->map;
+ }
+ return PCI_P2PDMA_MAP_NONE;
+}
+
+/**
+ * pci_p2pdma_bus_addr_map - Translate a physical address to a bus address
+ * for a PCI_P2PDMA_MAP_BUS_ADDR transfer.
+ * @state: P2P state structure
+ * @paddr: physical address to map
+ *
+ * Map a physically contiguous PCI_P2PDMA_MAP_BUS_ADDR transfer.
+ */
+static inline dma_addr_t
+pci_p2pdma_bus_addr_map(struct pci_p2pdma_map_state *state, phys_addr_t paddr)
+{
+ WARN_ON_ONCE(state->map != PCI_P2PDMA_MAP_BUS_ADDR);
+ return paddr + state->bus_off;
+}
+
#endif /* _LINUX_PCI_P2P_H */
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index cec43cd5ed62..24c359d9c879 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -13,6 +13,7 @@
#include <linux/vmalloc.h>
#include <linux/set_memory.h>
#include <linux/slab.h>
+#include <linux/pci-p2pdma.h>
#include "direct.h"
/*
--
2.49.0
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v11 3/9] iommu: generalize the batched sync after map interface
2025-05-05 7:01 ` [PATCH v11 0/9] Provide a new two step DMA mapping API Leon Romanovsky
2025-05-05 7:01 ` [PATCH v11 1/9] PCI/P2PDMA: Refactor the p2pdma mapping helpers Leon Romanovsky
2025-05-05 7:01 ` [PATCH v11 2/9] dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h Leon Romanovsky
@ 2025-05-05 7:01 ` Leon Romanovsky
2025-05-05 7:01 ` [PATCH v11 4/9] iommu: add kernel-doc for iommu_unmap_fast Leon Romanovsky
` (7 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Leon Romanovsky @ 2025-05-05 7:01 UTC (permalink / raw)
Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Jake Edge,
Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun, Robin Murphy,
Joerg Roedel, Will Deacon, Sagi Grimberg, Bjorn Helgaas,
Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
Alex Williamson, Jérôme Glisse, Andrew Morton,
linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
Kanchan Joshi, Chaitanya Kulkarni, Jason Gunthorpe,
Leon Romanovsky
From: Christoph Hellwig <hch@lst.de>
For the upcoming IOVA-based DMA API we want to batch the
ops->iotlb_sync_map() call after mapping multiple IOVAs from
dma-iommu without having a scatterlist. Improve the API.
Add a wrapper for the map_sync as iommu_sync_map() so that callers
don't need to poke into the methods directly.
Formalize __iommu_map() into iommu_map_nosync() which requires the
caller to call iommu_sync_map() after all maps are completed.
Refactor the existing sanity checks from all the different layers
into iommu_map_nosync().
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Will Deacon <will@kernel.org>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/iommu/iommu.c | 65 +++++++++++++++++++------------------------
include/linux/iommu.h | 4 +++
2 files changed, 33 insertions(+), 36 deletions(-)
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 4f91a740c15f..02960585b8d4 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2443,8 +2443,8 @@ static size_t iommu_pgsize(struct iommu_domain *domain, unsigned long iova,
return pgsize;
}
-static int __iommu_map(struct iommu_domain *domain, unsigned long iova,
- phys_addr_t paddr, size_t size, int prot, gfp_t gfp)
+int iommu_map_nosync(struct iommu_domain *domain, unsigned long iova,
+ phys_addr_t paddr, size_t size, int prot, gfp_t gfp)
{
const struct iommu_domain_ops *ops = domain->ops;
unsigned long orig_iova = iova;
@@ -2453,12 +2453,19 @@ static int __iommu_map(struct iommu_domain *domain, unsigned long iova,
phys_addr_t orig_paddr = paddr;
int ret = 0;
+ might_sleep_if(gfpflags_allow_blocking(gfp));
+
if (unlikely(!(domain->type & __IOMMU_DOMAIN_PAGING)))
return -EINVAL;
if (WARN_ON(!ops->map_pages || domain->pgsize_bitmap == 0UL))
return -ENODEV;
+ /* Discourage passing strange GFP flags */
+ if (WARN_ON_ONCE(gfp & (__GFP_COMP | __GFP_DMA | __GFP_DMA32 |
+ __GFP_HIGHMEM)))
+ return -EINVAL;
+
/* find out the minimum page size supported */
min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
@@ -2506,31 +2513,27 @@ static int __iommu_map(struct iommu_domain *domain, unsigned long iova,
return ret;
}
-int iommu_map(struct iommu_domain *domain, unsigned long iova,
- phys_addr_t paddr, size_t size, int prot, gfp_t gfp)
+int iommu_sync_map(struct iommu_domain *domain, unsigned long iova, size_t size)
{
const struct iommu_domain_ops *ops = domain->ops;
- int ret;
-
- might_sleep_if(gfpflags_allow_blocking(gfp));
- /* Discourage passing strange GFP flags */
- if (WARN_ON_ONCE(gfp & (__GFP_COMP | __GFP_DMA | __GFP_DMA32 |
- __GFP_HIGHMEM)))
- return -EINVAL;
+ if (!ops->iotlb_sync_map)
+ return 0;
+ return ops->iotlb_sync_map(domain, iova, size);
+}
- ret = __iommu_map(domain, iova, paddr, size, prot, gfp);
- if (ret == 0 && ops->iotlb_sync_map) {
- ret = ops->iotlb_sync_map(domain, iova, size);
- if (ret)
- goto out_err;
- }
+int iommu_map(struct iommu_domain *domain, unsigned long iova,
+ phys_addr_t paddr, size_t size, int prot, gfp_t gfp)
+{
+ int ret;
- return ret;
+ ret = iommu_map_nosync(domain, iova, paddr, size, prot, gfp);
+ if (ret)
+ return ret;
-out_err:
- /* undo mappings already done */
- iommu_unmap(domain, iova, size);
+ ret = iommu_sync_map(domain, iova, size);
+ if (ret)
+ iommu_unmap(domain, iova, size);
return ret;
}
@@ -2630,26 +2633,17 @@ ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
struct scatterlist *sg, unsigned int nents, int prot,
gfp_t gfp)
{
- const struct iommu_domain_ops *ops = domain->ops;
size_t len = 0, mapped = 0;
phys_addr_t start;
unsigned int i = 0;
int ret;
- might_sleep_if(gfpflags_allow_blocking(gfp));
-
- /* Discourage passing strange GFP flags */
- if (WARN_ON_ONCE(gfp & (__GFP_COMP | __GFP_DMA | __GFP_DMA32 |
- __GFP_HIGHMEM)))
- return -EINVAL;
-
while (i <= nents) {
phys_addr_t s_phys = sg_phys(sg);
if (len && s_phys != start + len) {
- ret = __iommu_map(domain, iova + mapped, start,
+ ret = iommu_map_nosync(domain, iova + mapped, start,
len, prot, gfp);
-
if (ret)
goto out_err;
@@ -2672,11 +2666,10 @@ ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
sg = sg_next(sg);
}
- if (ops->iotlb_sync_map) {
- ret = ops->iotlb_sync_map(domain, iova, mapped);
- if (ret)
- goto out_err;
- }
+ ret = iommu_sync_map(domain, iova, mapped);
+ if (ret)
+ goto out_err;
+
return mapped;
out_err:
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index ccce8a751e2a..ce472af8e9c3 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -872,6 +872,10 @@ extern struct iommu_domain *iommu_get_domain_for_dev(struct device *dev);
extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
phys_addr_t paddr, size_t size, int prot, gfp_t gfp);
+int iommu_map_nosync(struct iommu_domain *domain, unsigned long iova,
+ phys_addr_t paddr, size_t size, int prot, gfp_t gfp);
+int iommu_sync_map(struct iommu_domain *domain, unsigned long iova,
+ size_t size);
extern size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova,
size_t size);
extern size_t iommu_unmap_fast(struct iommu_domain *domain,
--
2.49.0
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v11 4/9] iommu: add kernel-doc for iommu_unmap_fast
2025-05-05 7:01 ` [PATCH v11 0/9] Provide a new two step DMA mapping API Leon Romanovsky
` (2 preceding siblings ...)
2025-05-05 7:01 ` [PATCH v11 3/9] iommu: generalize the batched sync after map interface Leon Romanovsky
@ 2025-05-05 7:01 ` Leon Romanovsky
2025-05-05 7:01 ` [PATCH v11 5/9] dma-mapping: Provide an interface to allow allocate IOVA Leon Romanovsky
` (6 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Leon Romanovsky @ 2025-05-05 7:01 UTC (permalink / raw)
Cc: Leon Romanovsky, Jens Axboe, Christoph Hellwig, Keith Busch,
Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
Kevin Tian, Alex Williamson, Jérôme Glisse,
Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
Kanchan Joshi, Chaitanya Kulkarni, Jason Gunthorpe, Lu Baolu
From: Leon Romanovsky <leonro@nvidia.com>
Add kernel-doc section for iommu_unmap_fast to document existing
limitation of underlying functions which can't split individual ranges.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/iommu/iommu.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 02960585b8d4..8619c355ef9c 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2621,6 +2621,25 @@ size_t iommu_unmap(struct iommu_domain *domain,
}
EXPORT_SYMBOL_GPL(iommu_unmap);
+/**
+ * iommu_unmap_fast() - Remove mappings from a range of IOVA without IOTLB sync
+ * @domain: Domain to manipulate
+ * @iova: IO virtual address to start
+ * @size: Length of the range starting from @iova
+ * @iotlb_gather: range information for a pending IOTLB flush
+ *
+ * iommu_unmap_fast() will remove a translation created by iommu_map().
+ * It can't subdivide a mapping created by iommu_map(), so it should be
+ * called with IOVA ranges that match what was passed to iommu_map(). The
+ * range can aggregate contiguous iommu_map() calls so long as no individual
+ * range is split.
+ *
+ * Basically iommu_unmap_fast() is the same as iommu_unmap() but for callers
+ * which manage the IOTLB flushing externally to perform a batched sync.
+ *
+ * Returns: Number of bytes of IOVA unmapped. iova + res will be the point
+ * unmapping stopped.
+ */
size_t iommu_unmap_fast(struct iommu_domain *domain,
unsigned long iova, size_t size,
struct iommu_iotlb_gather *iotlb_gather)
--
2.49.0
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v11 5/9] dma-mapping: Provide an interface to allow allocate IOVA
2025-05-05 7:01 ` [PATCH v11 0/9] Provide a new two step DMA mapping API Leon Romanovsky
` (3 preceding siblings ...)
2025-05-05 7:01 ` [PATCH v11 4/9] iommu: add kernel-doc for iommu_unmap_fast Leon Romanovsky
@ 2025-05-05 7:01 ` Leon Romanovsky
2025-05-05 7:01 ` [PATCH v11 6/9] iommu/dma: Factor out a iommu_dma_map_swiotlb helper Leon Romanovsky
` (5 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Leon Romanovsky @ 2025-05-05 7:01 UTC (permalink / raw)
Cc: Leon Romanovsky, Jens Axboe, Christoph Hellwig, Keith Busch,
Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
Kevin Tian, Alex Williamson, Jérôme Glisse,
Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
Kanchan Joshi, Chaitanya Kulkarni
From: Leon Romanovsky <leonro@nvidia.com>
The existing .map_pages() callback provides both allocating of IOVA
and linking DMA pages. That combination works great for most of the
callers who use it in control paths, but is less effective in fast
paths where there may be multiple calls to map_page().
These advanced callers already manage their data in some sort of
database and can perform IOVA allocation in advance, leaving range
linkage operation to be in fast path.
Provide an interface to allocate/deallocate IOVA and next patch
link/unlink DMA ranges to that specific IOVA.
In the new API a DMA mapping transaction is identified by a
struct dma_iova_state, which holds some recomputed information
for the transaction which does not change for each page being
mapped, so add a check if IOVA can be used for the specific
transaction.
The API is exported from dma-iommu as it is the only implementation
supported, the namespace is clearly different from iommu_* functions
which are not allowed to be used. This code layout allows us to save
function call per API call used in datapath as well as a lot of boilerplate
code.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/iommu/dma-iommu.c | 86 +++++++++++++++++++++++++++++++++++++
include/linux/dma-mapping.h | 48 +++++++++++++++++++++
2 files changed, 134 insertions(+)
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 9ba8d8bc0ce9..d3211a8d755e 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1723,6 +1723,92 @@ size_t iommu_dma_max_mapping_size(struct device *dev)
return SIZE_MAX;
}
+/**
+ * dma_iova_try_alloc - Try to allocate an IOVA space
+ * @dev: Device to allocate the IOVA space for
+ * @state: IOVA state
+ * @phys: physical address
+ * @size: IOVA size
+ *
+ * Check if @dev supports the IOVA-based DMA API, and if yes allocate IOVA space
+ * for the given base address and size.
+ *
+ * Note: @phys is only used to calculate the IOVA alignment. Callers that always
+ * do PAGE_SIZE aligned transfers can safely pass 0 here.
+ *
+ * Returns %true if the IOVA-based DMA API can be used and IOVA space has been
+ * allocated, or %false if the regular DMA API should be used.
+ */
+bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state,
+ phys_addr_t phys, size_t size)
+{
+ struct iommu_dma_cookie *cookie;
+ struct iommu_domain *domain;
+ struct iova_domain *iovad;
+ size_t iova_off;
+ dma_addr_t addr;
+
+ memset(state, 0, sizeof(*state));
+ if (!use_dma_iommu(dev))
+ return false;
+
+ domain = iommu_get_dma_domain(dev);
+ cookie = domain->iova_cookie;
+ iovad = &cookie->iovad;
+ iova_off = iova_offset(iovad, phys);
+
+ if (static_branch_unlikely(&iommu_deferred_attach_enabled) &&
+ iommu_deferred_attach(dev, iommu_get_domain_for_dev(dev)))
+ return false;
+
+ if (WARN_ON_ONCE(!size))
+ return false;
+
+ /*
+ * DMA_IOVA_USE_SWIOTLB is flag which is set by dma-iommu
+ * internals, make sure that caller didn't set it and/or
+ * didn't use this interface to map SIZE_MAX.
+ */
+ if (WARN_ON_ONCE((u64)size & DMA_IOVA_USE_SWIOTLB))
+ return false;
+
+ addr = iommu_dma_alloc_iova(domain,
+ iova_align(iovad, size + iova_off),
+ dma_get_mask(dev), dev);
+ if (!addr)
+ return false;
+
+ state->addr = addr + iova_off;
+ state->__size = size;
+ return true;
+}
+EXPORT_SYMBOL_GPL(dma_iova_try_alloc);
+
+/**
+ * dma_iova_free - Free an IOVA space
+ * @dev: Device to free the IOVA space for
+ * @state: IOVA state
+ *
+ * Undoes a successful dma_try_iova_alloc().
+ *
+ * Note that all dma_iova_link() calls need to be undone first. For callers
+ * that never call dma_iova_unlink(), dma_iova_destroy() can be used instead
+ * which unlinks all ranges and frees the IOVA space in a single efficient
+ * operation.
+ */
+void dma_iova_free(struct device *dev, struct dma_iova_state *state)
+{
+ struct iommu_domain *domain = iommu_get_dma_domain(dev);
+ struct iommu_dma_cookie *cookie = domain->iova_cookie;
+ struct iova_domain *iovad = &cookie->iovad;
+ size_t iova_start_pad = iova_offset(iovad, state->addr);
+ size_t size = dma_iova_size(state);
+
+ iommu_dma_free_iova(domain, state->addr - iova_start_pad,
+ iova_align(iovad, size + iova_start_pad), NULL);
+}
+EXPORT_SYMBOL_GPL(dma_iova_free);
+
void iommu_setup_dma_ops(struct device *dev)
{
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index b79925b1c433..de7f73810d54 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -72,6 +72,22 @@
#define DMA_BIT_MASK(n) (((n) == 64) ? ~0ULL : ((1ULL<<(n))-1))
+struct dma_iova_state {
+ dma_addr_t addr;
+ u64 __size;
+};
+
+/*
+ * Use the high bit to mark if we used swiotlb for one or more ranges.
+ */
+#define DMA_IOVA_USE_SWIOTLB (1ULL << 63)
+
+static inline size_t dma_iova_size(struct dma_iova_state *state)
+{
+ /* Casting is needed for 32-bits systems */
+ return (size_t)(state->__size & ~DMA_IOVA_USE_SWIOTLB);
+}
+
#ifdef CONFIG_DMA_API_DEBUG
void debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr);
void debug_dma_map_single(struct device *dev, const void *addr,
@@ -277,6 +293,38 @@ static inline int dma_mmap_noncontiguous(struct device *dev,
}
#endif /* CONFIG_HAS_DMA */
+#ifdef CONFIG_IOMMU_DMA
+/**
+ * dma_use_iova - check if the IOVA API is used for this state
+ * @state: IOVA state
+ *
+ * Return %true if the DMA transfers uses the dma_iova_*() calls or %false if
+ * they can't be used.
+ */
+static inline bool dma_use_iova(struct dma_iova_state *state)
+{
+ return state->__size != 0;
+}
+
+bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state,
+ phys_addr_t phys, size_t size);
+void dma_iova_free(struct device *dev, struct dma_iova_state *state);
+#else /* CONFIG_IOMMU_DMA */
+static inline bool dma_use_iova(struct dma_iova_state *state)
+{
+ return false;
+}
+static inline bool dma_iova_try_alloc(struct device *dev,
+ struct dma_iova_state *state, phys_addr_t phys, size_t size)
+{
+ return false;
+}
+static inline void dma_iova_free(struct device *dev,
+ struct dma_iova_state *state)
+{
+}
+#endif /* CONFIG_IOMMU_DMA */
+
#if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC)
void __dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr, size_t size,
enum dma_data_direction dir);
--
2.49.0
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v11 6/9] iommu/dma: Factor out a iommu_dma_map_swiotlb helper
2025-05-05 7:01 ` [PATCH v11 0/9] Provide a new two step DMA mapping API Leon Romanovsky
` (4 preceding siblings ...)
2025-05-05 7:01 ` [PATCH v11 5/9] dma-mapping: Provide an interface to allow allocate IOVA Leon Romanovsky
@ 2025-05-05 7:01 ` Leon Romanovsky
2025-05-05 7:01 ` [PATCH v11 7/9] dma-mapping: Implement link/unlink ranges API Leon Romanovsky
` (4 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Leon Romanovsky @ 2025-05-05 7:01 UTC (permalink / raw)
Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Jake Edge,
Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun, Robin Murphy,
Joerg Roedel, Will Deacon, Sagi Grimberg, Bjorn Helgaas,
Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
Alex Williamson, Jérôme Glisse, Andrew Morton,
linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
Kanchan Joshi, Chaitanya Kulkarni, Lu Baolu, Leon Romanovsky
From: Christoph Hellwig <hch@lst.de>
Split the iommu logic from iommu_dma_map_page into a separate helper.
This not only keeps the code neatly separated, but will also allow for
reuse in another caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/iommu/dma-iommu.c | 73 ++++++++++++++++++++++-----------------
1 file changed, 41 insertions(+), 32 deletions(-)
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index d3211a8d755e..d7684024c439 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1138,6 +1138,43 @@ void iommu_dma_sync_sg_for_device(struct device *dev, struct scatterlist *sgl,
arch_sync_dma_for_device(sg_phys(sg), sg->length, dir);
}
+static phys_addr_t iommu_dma_map_swiotlb(struct device *dev, phys_addr_t phys,
+ size_t size, enum dma_data_direction dir, unsigned long attrs)
+{
+ struct iommu_domain *domain = iommu_get_dma_domain(dev);
+ struct iova_domain *iovad = &domain->iova_cookie->iovad;
+
+ if (!is_swiotlb_active(dev)) {
+ dev_warn_once(dev, "DMA bounce buffers are inactive, unable to map unaligned transaction.\n");
+ return (phys_addr_t)DMA_MAPPING_ERROR;
+ }
+
+ trace_swiotlb_bounced(dev, phys, size);
+
+ phys = swiotlb_tbl_map_single(dev, phys, size, iova_mask(iovad), dir,
+ attrs);
+
+ /*
+ * Untrusted devices should not see padding areas with random leftover
+ * kernel data, so zero the pre- and post-padding.
+ * swiotlb_tbl_map_single() has initialized the bounce buffer proper to
+ * the contents of the original memory buffer.
+ */
+ if (phys != (phys_addr_t)DMA_MAPPING_ERROR && dev_is_untrusted(dev)) {
+ size_t start, virt = (size_t)phys_to_virt(phys);
+
+ /* Pre-padding */
+ start = iova_align_down(iovad, virt);
+ memset((void *)start, 0, virt - start);
+
+ /* Post-padding */
+ start = virt + size;
+ memset((void *)start, 0, iova_align(iovad, start) - start);
+ }
+
+ return phys;
+}
+
dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
unsigned long offset, size_t size, enum dma_data_direction dir,
unsigned long attrs)
@@ -1151,42 +1188,14 @@ dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
dma_addr_t iova, dma_mask = dma_get_mask(dev);
/*
- * If both the physical buffer start address and size are
- * page aligned, we don't need to use a bounce page.
+ * If both the physical buffer start address and size are page aligned,
+ * we don't need to use a bounce page.
*/
if (dev_use_swiotlb(dev, size, dir) &&
iova_offset(iovad, phys | size)) {
- if (!is_swiotlb_active(dev)) {
- dev_warn_once(dev, "DMA bounce buffers are inactive, unable to map unaligned transaction.\n");
- return DMA_MAPPING_ERROR;
- }
-
- trace_swiotlb_bounced(dev, phys, size);
-
- phys = swiotlb_tbl_map_single(dev, phys, size,
- iova_mask(iovad), dir, attrs);
-
- if (phys == DMA_MAPPING_ERROR)
+ phys = iommu_dma_map_swiotlb(dev, phys, size, dir, attrs);
+ if (phys == (phys_addr_t)DMA_MAPPING_ERROR)
return DMA_MAPPING_ERROR;
-
- /*
- * Untrusted devices should not see padding areas with random
- * leftover kernel data, so zero the pre- and post-padding.
- * swiotlb_tbl_map_single() has initialized the bounce buffer
- * proper to the contents of the original memory buffer.
- */
- if (dev_is_untrusted(dev)) {
- size_t start, virt = (size_t)phys_to_virt(phys);
-
- /* Pre-padding */
- start = iova_align_down(iovad, virt);
- memset((void *)start, 0, virt - start);
-
- /* Post-padding */
- start = virt + size;
- memset((void *)start, 0,
- iova_align(iovad, start) - start);
- }
}
if (!coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
--
2.49.0
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v11 7/9] dma-mapping: Implement link/unlink ranges API
2025-05-05 7:01 ` [PATCH v11 0/9] Provide a new two step DMA mapping API Leon Romanovsky
` (5 preceding siblings ...)
2025-05-05 7:01 ` [PATCH v11 6/9] iommu/dma: Factor out a iommu_dma_map_swiotlb helper Leon Romanovsky
@ 2025-05-05 7:01 ` Leon Romanovsky
2025-05-05 7:01 ` [PATCH v11 8/9] dma-mapping: add a dma_need_unmap helper Leon Romanovsky
` (3 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Leon Romanovsky @ 2025-05-05 7:01 UTC (permalink / raw)
Cc: Leon Romanovsky, Jens Axboe, Christoph Hellwig, Keith Busch,
Jake Edge, Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun,
Robin Murphy, Joerg Roedel, Will Deacon, Sagi Grimberg,
Bjorn Helgaas, Logan Gunthorpe, Yishai Hadas, Shameer Kolothum,
Kevin Tian, Alex Williamson, Jérôme Glisse,
Andrew Morton, linux-doc, linux-kernel, linux-block, linux-rdma,
iommu, linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
Kanchan Joshi, Chaitanya Kulkarni
From: Leon Romanovsky <leonro@nvidia.com>
Introduce new DMA APIs to perform DMA linkage of buffers
in layers higher than DMA.
In proposed API, the callers will perform the following steps.
In map path:
if (dma_can_use_iova(...))
dma_iova_alloc()
for (page in range)
dma_iova_link_next(...)
dma_iova_sync(...)
else
/* Fallback to legacy map pages */
for (all pages)
dma_map_page(...)
In unmap path:
if (dma_can_use_iova(...))
dma_iova_destroy()
else
for (all pages)
dma_unmap_page(...)
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/iommu/dma-iommu.c | 275 +++++++++++++++++++++++++++++++++++-
include/linux/dma-mapping.h | 32 +++++
2 files changed, 306 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index d7684024c439..98f7205ec8fb 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1175,6 +1175,17 @@ static phys_addr_t iommu_dma_map_swiotlb(struct device *dev, phys_addr_t phys,
return phys;
}
+/*
+ * Checks if a physical buffer has unaligned boundaries with respect to
+ * the IOMMU granule. Returns non-zero if either the start or end
+ * address is not aligned to the granule boundary.
+ */
+static inline size_t iova_unaligned(struct iova_domain *iovad, phys_addr_t phys,
+ size_t size)
+{
+ return iova_offset(iovad, phys | size);
+}
+
dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
unsigned long offset, size_t size, enum dma_data_direction dir,
unsigned long attrs)
@@ -1192,7 +1203,7 @@ dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
* we don't need to use a bounce page.
*/
if (dev_use_swiotlb(dev, size, dir) &&
- iova_offset(iovad, phys | size)) {
+ iova_unaligned(iovad, phys, size)) {
phys = iommu_dma_map_swiotlb(dev, phys, size, dir, attrs);
if (phys == (phys_addr_t)DMA_MAPPING_ERROR)
return DMA_MAPPING_ERROR;
@@ -1818,6 +1829,268 @@ void dma_iova_free(struct device *dev, struct dma_iova_state *state)
}
EXPORT_SYMBOL_GPL(dma_iova_free);
+static int __dma_iova_link(struct device *dev, dma_addr_t addr,
+ phys_addr_t phys, size_t size, enum dma_data_direction dir,
+ unsigned long attrs)
+{
+ bool coherent = dev_is_dma_coherent(dev);
+
+ if (!coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+ arch_sync_dma_for_device(phys, size, dir);
+
+ return iommu_map_nosync(iommu_get_dma_domain(dev), addr, phys, size,
+ dma_info_to_prot(dir, coherent, attrs), GFP_ATOMIC);
+}
+
+static int iommu_dma_iova_bounce_and_link(struct device *dev, dma_addr_t addr,
+ phys_addr_t phys, size_t bounce_len,
+ enum dma_data_direction dir, unsigned long attrs,
+ size_t iova_start_pad)
+{
+ struct iommu_domain *domain = iommu_get_dma_domain(dev);
+ struct iova_domain *iovad = &domain->iova_cookie->iovad;
+ phys_addr_t bounce_phys;
+ int error;
+
+ bounce_phys = iommu_dma_map_swiotlb(dev, phys, bounce_len, dir, attrs);
+ if (bounce_phys == DMA_MAPPING_ERROR)
+ return -ENOMEM;
+
+ error = __dma_iova_link(dev, addr - iova_start_pad,
+ bounce_phys - iova_start_pad,
+ iova_align(iovad, bounce_len), dir, attrs);
+ if (error)
+ swiotlb_tbl_unmap_single(dev, bounce_phys, bounce_len, dir,
+ attrs);
+ return error;
+}
+
+static int iommu_dma_iova_link_swiotlb(struct device *dev,
+ struct dma_iova_state *state, phys_addr_t phys, size_t offset,
+ size_t size, enum dma_data_direction dir, unsigned long attrs)
+{
+ struct iommu_domain *domain = iommu_get_dma_domain(dev);
+ struct iommu_dma_cookie *cookie = domain->iova_cookie;
+ struct iova_domain *iovad = &cookie->iovad;
+ size_t iova_start_pad = iova_offset(iovad, phys);
+ size_t iova_end_pad = iova_offset(iovad, phys + size);
+ dma_addr_t addr = state->addr + offset;
+ size_t mapped = 0;
+ int error;
+
+ if (iova_start_pad) {
+ size_t bounce_len = min(size, iovad->granule - iova_start_pad);
+
+ error = iommu_dma_iova_bounce_and_link(dev, addr, phys,
+ bounce_len, dir, attrs, iova_start_pad);
+ if (error)
+ return error;
+ state->__size |= DMA_IOVA_USE_SWIOTLB;
+
+ mapped += bounce_len;
+ size -= bounce_len;
+ if (!size)
+ return 0;
+ }
+
+ size -= iova_end_pad;
+ error = __dma_iova_link(dev, addr + mapped, phys + mapped, size, dir,
+ attrs);
+ if (error)
+ goto out_unmap;
+ mapped += size;
+
+ if (iova_end_pad) {
+ error = iommu_dma_iova_bounce_and_link(dev, addr + mapped,
+ phys + mapped, iova_end_pad, dir, attrs, 0);
+ if (error)
+ goto out_unmap;
+ state->__size |= DMA_IOVA_USE_SWIOTLB;
+ }
+
+ return 0;
+
+out_unmap:
+ dma_iova_unlink(dev, state, 0, mapped, dir, attrs);
+ return error;
+}
+
+/**
+ * dma_iova_link - Link a range of IOVA space
+ * @dev: DMA device
+ * @state: IOVA state
+ * @phys: physical address to link
+ * @offset: offset into the IOVA state to map into
+ * @size: size of the buffer
+ * @dir: DMA direction
+ * @attrs: attributes of mapping properties
+ *
+ * Link a range of IOVA space for the given IOVA state without IOTLB sync.
+ * This function is used to link multiple physical addresses in contiguous
+ * IOVA space without performing costly IOTLB sync.
+ *
+ * The caller is responsible to call to dma_iova_sync() to sync IOTLB at
+ * the end of linkage.
+ */
+int dma_iova_link(struct device *dev, struct dma_iova_state *state,
+ phys_addr_t phys, size_t offset, size_t size,
+ enum dma_data_direction dir, unsigned long attrs)
+{
+ struct iommu_domain *domain = iommu_get_dma_domain(dev);
+ struct iommu_dma_cookie *cookie = domain->iova_cookie;
+ struct iova_domain *iovad = &cookie->iovad;
+ size_t iova_start_pad = iova_offset(iovad, phys);
+
+ if (WARN_ON_ONCE(iova_start_pad && offset > 0))
+ return -EIO;
+
+ if (dev_use_swiotlb(dev, size, dir) &&
+ iova_unaligned(iovad, phys, size))
+ return iommu_dma_iova_link_swiotlb(dev, state, phys, offset,
+ size, dir, attrs);
+
+ return __dma_iova_link(dev, state->addr + offset - iova_start_pad,
+ phys - iova_start_pad,
+ iova_align(iovad, size + iova_start_pad), dir, attrs);
+}
+EXPORT_SYMBOL_GPL(dma_iova_link);
+
+/**
+ * dma_iova_sync - Sync IOTLB
+ * @dev: DMA device
+ * @state: IOVA state
+ * @offset: offset into the IOVA state to sync
+ * @size: size of the buffer
+ *
+ * Sync IOTLB for the given IOVA state. This function should be called on
+ * the IOVA-contiguous range created by one ore more dma_iova_link() calls
+ * to sync the IOTLB.
+ */
+int dma_iova_sync(struct device *dev, struct dma_iova_state *state,
+ size_t offset, size_t size)
+{
+ struct iommu_domain *domain = iommu_get_dma_domain(dev);
+ struct iommu_dma_cookie *cookie = domain->iova_cookie;
+ struct iova_domain *iovad = &cookie->iovad;
+ dma_addr_t addr = state->addr + offset;
+ size_t iova_start_pad = iova_offset(iovad, addr);
+
+ return iommu_sync_map(domain, addr - iova_start_pad,
+ iova_align(iovad, size + iova_start_pad));
+}
+EXPORT_SYMBOL_GPL(dma_iova_sync);
+
+static void iommu_dma_iova_unlink_range_slow(struct device *dev,
+ dma_addr_t addr, size_t size, enum dma_data_direction dir,
+ unsigned long attrs)
+{
+ struct iommu_domain *domain = iommu_get_dma_domain(dev);
+ struct iommu_dma_cookie *cookie = domain->iova_cookie;
+ struct iova_domain *iovad = &cookie->iovad;
+ size_t iova_start_pad = iova_offset(iovad, addr);
+ dma_addr_t end = addr + size;
+
+ do {
+ phys_addr_t phys;
+ size_t len;
+
+ phys = iommu_iova_to_phys(domain, addr);
+ if (WARN_ON(!phys))
+ /* Something very horrible happen here */
+ return;
+
+ len = min_t(size_t,
+ end - addr, iovad->granule - iova_start_pad);
+
+ if (!dev_is_dma_coherent(dev) &&
+ !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+ arch_sync_dma_for_cpu(phys, len, dir);
+
+ swiotlb_tbl_unmap_single(dev, phys, len, dir, attrs);
+
+ addr += len;
+ iova_start_pad = 0;
+ } while (addr < end);
+}
+
+static void __iommu_dma_iova_unlink(struct device *dev,
+ struct dma_iova_state *state, size_t offset, size_t size,
+ enum dma_data_direction dir, unsigned long attrs,
+ bool free_iova)
+{
+ struct iommu_domain *domain = iommu_get_dma_domain(dev);
+ struct iommu_dma_cookie *cookie = domain->iova_cookie;
+ struct iova_domain *iovad = &cookie->iovad;
+ dma_addr_t addr = state->addr + offset;
+ size_t iova_start_pad = iova_offset(iovad, addr);
+ struct iommu_iotlb_gather iotlb_gather;
+ size_t unmapped;
+
+ if ((state->__size & DMA_IOVA_USE_SWIOTLB) ||
+ (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)))
+ iommu_dma_iova_unlink_range_slow(dev, addr, size, dir, attrs);
+
+ iommu_iotlb_gather_init(&iotlb_gather);
+ iotlb_gather.queued = free_iova && READ_ONCE(cookie->fq_domain);
+
+ size = iova_align(iovad, size + iova_start_pad);
+ addr -= iova_start_pad;
+ unmapped = iommu_unmap_fast(domain, addr, size, &iotlb_gather);
+ WARN_ON(unmapped != size);
+
+ if (!iotlb_gather.queued)
+ iommu_iotlb_sync(domain, &iotlb_gather);
+ if (free_iova)
+ iommu_dma_free_iova(domain, addr, size, &iotlb_gather);
+}
+
+/**
+ * dma_iova_unlink - Unlink a range of IOVA space
+ * @dev: DMA device
+ * @state: IOVA state
+ * @offset: offset into the IOVA state to unlink
+ * @size: size of the buffer
+ * @dir: DMA direction
+ * @attrs: attributes of mapping properties
+ *
+ * Unlink a range of IOVA space for the given IOVA state.
+ */
+void dma_iova_unlink(struct device *dev, struct dma_iova_state *state,
+ size_t offset, size_t size, enum dma_data_direction dir,
+ unsigned long attrs)
+{
+ __iommu_dma_iova_unlink(dev, state, offset, size, dir, attrs, false);
+}
+EXPORT_SYMBOL_GPL(dma_iova_unlink);
+
+/**
+ * dma_iova_destroy - Finish a DMA mapping transaction
+ * @dev: DMA device
+ * @state: IOVA state
+ * @mapped_len: number of bytes to unmap
+ * @dir: DMA direction
+ * @attrs: attributes of mapping properties
+ *
+ * Unlink the IOVA range up to @mapped_len and free the entire IOVA space. The
+ * range of IOVA from dma_addr to @mapped_len must all be linked, and be the
+ * only linked IOVA in state.
+ */
+void dma_iova_destroy(struct device *dev, struct dma_iova_state *state,
+ size_t mapped_len, enum dma_data_direction dir,
+ unsigned long attrs)
+{
+ if (mapped_len)
+ __iommu_dma_iova_unlink(dev, state, 0, mapped_len, dir, attrs,
+ true);
+ else
+ /*
+ * We can be here if first call to dma_iova_link() failed and
+ * there is nothing to unlink, so let's be more clear.
+ */
+ dma_iova_free(dev, state);
+}
+EXPORT_SYMBOL_GPL(dma_iova_destroy);
+
void iommu_setup_dma_ops(struct device *dev)
{
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index de7f73810d54..a71e110f1e9d 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -309,6 +309,17 @@ static inline bool dma_use_iova(struct dma_iova_state *state)
bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state,
phys_addr_t phys, size_t size);
void dma_iova_free(struct device *dev, struct dma_iova_state *state);
+void dma_iova_destroy(struct device *dev, struct dma_iova_state *state,
+ size_t mapped_len, enum dma_data_direction dir,
+ unsigned long attrs);
+int dma_iova_sync(struct device *dev, struct dma_iova_state *state,
+ size_t offset, size_t size);
+int dma_iova_link(struct device *dev, struct dma_iova_state *state,
+ phys_addr_t phys, size_t offset, size_t size,
+ enum dma_data_direction dir, unsigned long attrs);
+void dma_iova_unlink(struct device *dev, struct dma_iova_state *state,
+ size_t offset, size_t size, enum dma_data_direction dir,
+ unsigned long attrs);
#else /* CONFIG_IOMMU_DMA */
static inline bool dma_use_iova(struct dma_iova_state *state)
{
@@ -323,6 +334,27 @@ static inline void dma_iova_free(struct device *dev,
struct dma_iova_state *state)
{
}
+static inline void dma_iova_destroy(struct device *dev,
+ struct dma_iova_state *state, size_t mapped_len,
+ enum dma_data_direction dir, unsigned long attrs)
+{
+}
+static inline int dma_iova_sync(struct device *dev,
+ struct dma_iova_state *state, size_t offset, size_t size)
+{
+ return -EOPNOTSUPP;
+}
+static inline int dma_iova_link(struct device *dev,
+ struct dma_iova_state *state, phys_addr_t phys, size_t offset,
+ size_t size, enum dma_data_direction dir, unsigned long attrs)
+{
+ return -EOPNOTSUPP;
+}
+static inline void dma_iova_unlink(struct device *dev,
+ struct dma_iova_state *state, size_t offset, size_t size,
+ enum dma_data_direction dir, unsigned long attrs)
+{
+}
#endif /* CONFIG_IOMMU_DMA */
#if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC)
--
2.49.0
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v11 8/9] dma-mapping: add a dma_need_unmap helper
2025-05-05 7:01 ` [PATCH v11 0/9] Provide a new two step DMA mapping API Leon Romanovsky
` (6 preceding siblings ...)
2025-05-05 7:01 ` [PATCH v11 7/9] dma-mapping: Implement link/unlink ranges API Leon Romanovsky
@ 2025-05-05 7:01 ` Leon Romanovsky
2025-05-05 7:01 ` [PATCH v11 9/9] docs: core-api: document the IOVA-based API Leon Romanovsky
` (2 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Leon Romanovsky @ 2025-05-05 7:01 UTC (permalink / raw)
Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Jake Edge,
Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun, Robin Murphy,
Joerg Roedel, Will Deacon, Sagi Grimberg, Bjorn Helgaas,
Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
Alex Williamson, Jérôme Glisse, Andrew Morton,
linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
Kanchan Joshi, Chaitanya Kulkarni, Leon Romanovsky
From: Christoph Hellwig <hch@lst.de>
Add helper that allows a driver to skip calling dma_unmap_*
if the DMA layer can guarantee that they are no-nops.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
include/linux/dma-mapping.h | 5 +++++
kernel/dma/mapping.c | 18 ++++++++++++++++++
2 files changed, 23 insertions(+)
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index a71e110f1e9d..d2f358c5a25d 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -406,6 +406,7 @@ static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
{
return dma_dev_need_sync(dev) ? __dma_need_sync(dev, dma_addr) : false;
}
+bool dma_need_unmap(struct device *dev);
#else /* !CONFIG_HAS_DMA || !CONFIG_DMA_NEED_SYNC */
static inline bool dma_dev_need_sync(const struct device *dev)
{
@@ -431,6 +432,10 @@ static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
{
return false;
}
+static inline bool dma_need_unmap(struct device *dev)
+{
+ return false;
+}
#endif /* !CONFIG_HAS_DMA || !CONFIG_DMA_NEED_SYNC */
struct page *dma_alloc_pages(struct device *dev, size_t size,
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index cda127027e48..3c3204ad2839 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -443,6 +443,24 @@ bool __dma_need_sync(struct device *dev, dma_addr_t dma_addr)
}
EXPORT_SYMBOL_GPL(__dma_need_sync);
+/**
+ * dma_need_unmap - does this device need dma_unmap_* operations
+ * @dev: device to check
+ *
+ * If this function returns %false, drivers can skip calling dma_unmap_* after
+ * finishing an I/O. This function must be called after all mappings that might
+ * need to be unmapped have been performed.
+ */
+bool dma_need_unmap(struct device *dev)
+{
+ if (!dma_map_direct(dev, get_dma_ops(dev)))
+ return true;
+ if (!dev->dma_skip_sync)
+ return true;
+ return IS_ENABLED(CONFIG_DMA_API_DEBUG);
+}
+EXPORT_SYMBOL_GPL(dma_need_unmap);
+
static void dma_setup_need_sync(struct device *dev)
{
const struct dma_map_ops *ops = get_dma_ops(dev);
--
2.49.0
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v11 9/9] docs: core-api: document the IOVA-based API
2025-05-05 7:01 ` [PATCH v11 0/9] Provide a new two step DMA mapping API Leon Romanovsky
` (7 preceding siblings ...)
2025-05-05 7:01 ` [PATCH v11 8/9] dma-mapping: add a dma_need_unmap helper Leon Romanovsky
@ 2025-05-05 7:01 ` Leon Romanovsky
2025-05-05 12:29 ` [PATCH v11 0/9] Provide a new two step DMA mapping API Jason Gunthorpe
2025-05-06 8:55 ` Marek Szyprowski
10 siblings, 0 replies; 12+ messages in thread
From: Leon Romanovsky @ 2025-05-05 7:01 UTC (permalink / raw)
Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Jake Edge,
Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun, Robin Murphy,
Joerg Roedel, Will Deacon, Sagi Grimberg, Bjorn Helgaas,
Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
Alex Williamson, Jérôme Glisse, Andrew Morton,
linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
Kanchan Joshi, Chaitanya Kulkarni, Leon Romanovsky
From: Christoph Hellwig <hch@lst.de>
Add an explanation of the newly added IOVA-based mapping API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
Documentation/core-api/dma-api.rst | 71 ++++++++++++++++++++++++++++++
1 file changed, 71 insertions(+)
diff --git a/Documentation/core-api/dma-api.rst b/Documentation/core-api/dma-api.rst
index 8e3cce3d0a23..2ad08517e626 100644
--- a/Documentation/core-api/dma-api.rst
+++ b/Documentation/core-api/dma-api.rst
@@ -530,6 +530,77 @@ routines, e.g.:::
....
}
+Part Ie - IOVA-based DMA mappings
+---------------------------------
+
+These APIs allow a very efficient mapping when using an IOMMU. They are an
+optional path that requires extra code and are only recommended for drivers
+where DMA mapping performance, or the space usage for storing the DMA addresses
+matter. All the considerations from the previous section apply here as well.
+
+::
+
+ bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state,
+ phys_addr_t phys, size_t size);
+
+Is used to try to allocate IOVA space for mapping operation. If it returns
+false this API can't be used for the given device and the normal streaming
+DMA mapping API should be used. The ``struct dma_iova_state`` is allocated
+by the driver and must be kept around until unmap time.
+
+::
+
+ static inline bool dma_use_iova(struct dma_iova_state *state)
+
+Can be used by the driver to check if the IOVA-based API is used after a
+call to dma_iova_try_alloc. This can be useful in the unmap path.
+
+::
+
+ int dma_iova_link(struct device *dev, struct dma_iova_state *state,
+ phys_addr_t phys, size_t offset, size_t size,
+ enum dma_data_direction dir, unsigned long attrs);
+
+Is used to link ranges to the IOVA previously allocated. The start of all
+but the first call to dma_iova_link for a given state must be aligned
+to the DMA merge boundary returned by ``dma_get_merge_boundary())``, and
+the size of all but the last range must be aligned to the DMA merge boundary
+as well.
+
+::
+
+ int dma_iova_sync(struct device *dev, struct dma_iova_state *state,
+ size_t offset, size_t size);
+
+Must be called to sync the IOMMU page tables for IOVA-range mapped by one or
+more calls to ``dma_iova_link()``.
+
+For drivers that use a one-shot mapping, all ranges can be unmapped and the
+IOVA freed by calling:
+
+::
+
+ void dma_iova_destroy(struct device *dev, struct dma_iova_state *state,
+ size_t mapped_len, enum dma_data_direction dir,
+ unsigned long attrs);
+
+Alternatively drivers can dynamically manage the IOVA space by unmapping
+and mapping individual regions. In that case
+
+::
+
+ void dma_iova_unlink(struct device *dev, struct dma_iova_state *state,
+ size_t offset, size_t size, enum dma_data_direction dir,
+ unsigned long attrs);
+
+is used to unmap a range previously mapped, and
+
+::
+
+ void dma_iova_free(struct device *dev, struct dma_iova_state *state);
+
+is used to free the IOVA space. All regions must have been unmapped using
+``dma_iova_unlink()`` before calling ``dma_iova_free()``.
Part II - Non-coherent DMA allocations
--------------------------------------
--
2.49.0
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v11 0/9] Provide a new two step DMA mapping API
2025-05-05 7:01 ` [PATCH v11 0/9] Provide a new two step DMA mapping API Leon Romanovsky
` (8 preceding siblings ...)
2025-05-05 7:01 ` [PATCH v11 9/9] docs: core-api: document the IOVA-based API Leon Romanovsky
@ 2025-05-05 12:29 ` Jason Gunthorpe
2025-05-06 8:55 ` Marek Szyprowski
10 siblings, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2025-05-05 12:29 UTC (permalink / raw)
To: Leon Romanovsky
Cc: Marek Szyprowski, Jens Axboe, Christoph Hellwig, Keith Busch,
Jake Edge, Jonathan Corbet, Zhu Yanjun, Robin Murphy,
Joerg Roedel, Will Deacon, Sagi Grimberg, Bjorn Helgaas,
Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
Alex Williamson, Jérôme Glisse, Andrew Morton,
linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
Kanchan Joshi, Chaitanya Kulkarni
On Mon, May 05, 2025 at 10:01:37AM +0300, Leon Romanovsky wrote:
> Hi Marek,
>
> These are the DMA/IOMMU patches only, which have not seen functional
> changes for a while. They are tested and reviewed and ready to merge.
>
> We will work with relevant subsystems to merge rest of the conversion
> patches. At least some of them will be done in next cycle to reduce
> merge conflicts.
Please lets have this on a branch so I can do the rdma parts this
cycle.
Thanks,
Jason
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v11 0/9] Provide a new two step DMA mapping API
2025-05-05 7:01 ` [PATCH v11 0/9] Provide a new two step DMA mapping API Leon Romanovsky
` (9 preceding siblings ...)
2025-05-05 12:29 ` [PATCH v11 0/9] Provide a new two step DMA mapping API Jason Gunthorpe
@ 2025-05-06 8:55 ` Marek Szyprowski
10 siblings, 0 replies; 12+ messages in thread
From: Marek Szyprowski @ 2025-05-06 8:55 UTC (permalink / raw)
To: Leon Romanovsky
Cc: Jens Axboe, Christoph Hellwig, Keith Busch, Jake Edge,
Jonathan Corbet, Jason Gunthorpe, Zhu Yanjun, Robin Murphy,
Joerg Roedel, Will Deacon, Sagi Grimberg, Bjorn Helgaas,
Logan Gunthorpe, Yishai Hadas, Shameer Kolothum, Kevin Tian,
Alex Williamson, Jérôme Glisse, Andrew Morton,
linux-doc, linux-kernel, linux-block, linux-rdma, iommu,
linux-nvme, linux-pci, kvm, linux-mm, Niklas Schnelle,
Chuck Lever, Luis Chamberlain, Matthew Wilcox, Dan Williams,
Kanchan Joshi, Chaitanya Kulkarni
On 05.05.2025 09:01, Leon Romanovsky wrote:
> Hi Marek,
>
> These are the DMA/IOMMU patches only, which have not seen functional
> changes for a while. They are tested and reviewed and ready to merge.
>
> We will work with relevant subsystems to merge rest of the conversion
> patches. At least some of them will be done in next cycle to reduce
> merge conflicts.
>
> Thanks
>
> =========================================================================
> Following recent on site LSF/MM 2025 [1] discussion, the overall
> response was extremely positive with many people expressed their
> desire to see this series merged, so they can base their work on it.
>
> It includes, but not limited:
> * Luis's "nvme-pci: breaking the 512 KiB max IO boundary":
> https://lore.kernel.org/all/20250320111328.2841690-1-mcgrof@kernel.org/
> * Chuck's NFS conversion to use one structure (bio_vec) for all types
> of RPC transports:
> https://lore.kernel.org/all/913df4b4-fc4a-409d-9007-088a3e2c8291@oracle.com
> * Matthew's vision for the world without struct page:
> https://lore.kernel.org/all/Z-WRQOYEvOWlI34w@casper.infradead.org/
> * Confidential computing roadmap from Dan:
> https://lore.kernel.org/all/6801a8e3968da_71fe29411@dwillia2-xfh.jf.intel.com.notmuch
>
> This series is combination of effort of many people who contributed ideas,
> code and testing and I'm gratefully thankful for them.
Thanks everyone involved in this contribution. I appreciate the effort
of showing that such new API is really needed and will be used by other
subsystems. I see benefits from this approach and I hope that any
pending issues can be resolved incrementally.
I've applied this patchset to dma-mapping-next branch and it will be
also available as dma-mapping-for-6.16-two-step-api [1] stable branch
for those who wants to base their pending work on it.
[1]
https://web.git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux.git/log/?h=dma-mapping-for-6.16-two-step-api
> ...
Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-05-06 8:55 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <CGME20250505070202eucas1p19aa3caffab7176123323fe5462773c8e@eucas1p1.samsung.com>
2025-05-05 7:01 ` [PATCH v11 0/9] Provide a new two step DMA mapping API Leon Romanovsky
2025-05-05 7:01 ` [PATCH v11 1/9] PCI/P2PDMA: Refactor the p2pdma mapping helpers Leon Romanovsky
2025-05-05 7:01 ` [PATCH v11 2/9] dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h Leon Romanovsky
2025-05-05 7:01 ` [PATCH v11 3/9] iommu: generalize the batched sync after map interface Leon Romanovsky
2025-05-05 7:01 ` [PATCH v11 4/9] iommu: add kernel-doc for iommu_unmap_fast Leon Romanovsky
2025-05-05 7:01 ` [PATCH v11 5/9] dma-mapping: Provide an interface to allow allocate IOVA Leon Romanovsky
2025-05-05 7:01 ` [PATCH v11 6/9] iommu/dma: Factor out a iommu_dma_map_swiotlb helper Leon Romanovsky
2025-05-05 7:01 ` [PATCH v11 7/9] dma-mapping: Implement link/unlink ranges API Leon Romanovsky
2025-05-05 7:01 ` [PATCH v11 8/9] dma-mapping: add a dma_need_unmap helper Leon Romanovsky
2025-05-05 7:01 ` [PATCH v11 9/9] docs: core-api: document the IOVA-based API Leon Romanovsky
2025-05-05 12:29 ` [PATCH v11 0/9] Provide a new two step DMA mapping API Jason Gunthorpe
2025-05-06 8:55 ` Marek Szyprowski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox