linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v2 0/3] Fix late DMA unmap crash for page pool
@ 2025-03-25 15:45 Toke Høiland-Jørgensen
  2025-03-25 15:45 ` [PATCH net-next v2 1/3] page_pool: Move pp_magic check into helper functions Toke Høiland-Jørgensen
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-03-25 15:45 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Andrew Lunn,
	Eric Dumazet, Paolo Abeni, Ilias Apalodimas, Simon Horman,
	Andrew Morton, Mina Almasry, Yonglong Liu, Yunsheng Lin,
	Pavel Begunkov, Matthew Wilcox
  Cc: netdev, bpf, linux-rdma, linux-mm,
	Toke Høiland-Jørgensen, Qiuling Ren, Yuying Ma

This series fixes the late dma_unmap crash for page pool first reported
by Yonglong Liu in [0]. It is an alternative approach to the one
submitted by Yunsheng Lin, most recently in [1]. The first two commits
are small refactors of the page pool code, in preparation of the main
change in patch 3. See the commit message of patch 3 for the details.

-Toke

[0] https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
[1] https://lore.kernel.org/r/20250307092356.638242-1-linyunsheng@huawei.com

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
Changes in v2:
- Always leave two bits at the top of pp_magic as zero, instead of one

- Add an rcu_read_lock() around __page_pool_dma_sync_for_device()

- Add a comment in poison.h with a reference to the bitmask definition

- Add a longer description of the logic of the bitmask definitions to
  the comment in types.h, and a summary of the security implications of
  using the pp_magic field to the commit message of patch 3

- Collect Mina's Reviewed-by and Yonglong's Tested-by tags

- Link to v1: https://lore.kernel.org/r/20250314-page-pool-track-dma-v1-0-c212e57a74c2@redhat.com

---
Toke Høiland-Jørgensen (3):
      page_pool: Move pp_magic check into helper functions
      page_pool: Turn dma_sync and dma_sync_cpu fields into a bitmap
      page_pool: Track DMA-mapped pages and unmap them when destroying the pool

 drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c |  4 +-
 include/linux/poison.h                           |  4 ++
 include/net/page_pool/helpers.h                  |  6 +-
 include/net/page_pool/types.h                    | 67 +++++++++++++++++-
 mm/page_alloc.c                                  |  9 +--
 net/core/devmem.c                                |  3 +-
 net/core/netmem_priv.h                           | 33 ++++++++-
 net/core/page_pool.c                             | 87 ++++++++++++++++++++----
 net/core/skbuff.c                                | 16 +----
 net/core/xdp.c                                   |  4 +-
 10 files changed, 186 insertions(+), 47 deletions(-)
---
base-commit: 45e36a8e3c17c4d50ecbc863893f253fb46ac070
change-id: 20250310-page-pool-track-dma-0332343a460e



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH net-next v2 1/3] page_pool: Move pp_magic check into helper functions
  2025-03-25 15:45 [PATCH net-next v2 0/3] Fix late DMA unmap crash for page pool Toke Høiland-Jørgensen
@ 2025-03-25 15:45 ` Toke Høiland-Jørgensen
  2025-03-25 15:45 ` [PATCH net-next v2 2/3] page_pool: Turn dma_sync and dma_sync_cpu fields into a bitmap Toke Høiland-Jørgensen
  2025-03-25 15:45 ` [PATCH net-next v2 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool Toke Høiland-Jørgensen
  2 siblings, 0 replies; 17+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-03-25 15:45 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Andrew Lunn,
	Eric Dumazet, Paolo Abeni, Ilias Apalodimas, Simon Horman,
	Andrew Morton, Mina Almasry, Yonglong Liu, Yunsheng Lin,
	Pavel Begunkov, Matthew Wilcox
  Cc: netdev, bpf, linux-rdma, linux-mm, Toke Høiland-Jørgensen

Since we are about to stash some more information into the pp_magic
field, let's move the magic signature checks into a pair of helper
functions so it can be changed in one place.

Reviewed-by: Mina Almasry <almasrymina@google.com>
Tested-by: Yonglong Liu <liuyonglong@huawei.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c |  4 ++--
 include/net/page_pool/types.h                    | 18 ++++++++++++++++++
 mm/page_alloc.c                                  |  9 +++------
 net/core/netmem_priv.h                           |  5 +++++
 net/core/skbuff.c                                | 16 ++--------------
 net/core/xdp.c                                   |  4 ++--
 6 files changed, 32 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 6f3094a479e1ec61854bb48a6a0c812167487173..70c6f0b2abb921778c98fbd428594ebd7986a302 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -706,8 +706,8 @@ static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
 				xdpi = mlx5e_xdpi_fifo_pop(xdpi_fifo);
 				page = xdpi.page.page;
 
-				/* No need to check ((page->pp_magic & ~0x3UL) == PP_SIGNATURE)
-				 * as we know this is a page_pool page.
+				/* No need to check page_pool_page_is_pp() as we
+				 * know this is a page_pool page.
 				 */
 				page_pool_recycle_direct(page->pp, page);
 			} while (++n < num);
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index 36eb57d73abc6cfc601e700ca08be20fb8281055..df0d3c1608929605224feb26173135ff37951ef8 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -54,6 +54,14 @@ struct pp_alloc_cache {
 	netmem_ref cache[PP_ALLOC_CACHE_SIZE];
 };
 
+/* Mask used for checking in page_pool_page_is_pp() below. page->pp_magic is
+ * OR'ed with PP_SIGNATURE after the allocation in order to preserve bit 0 for
+ * the head page of compound page and bit 1 for pfmemalloc page.
+ * page_is_pfmemalloc() is checked in __page_pool_put_page() to avoid recycling
+ * the pfmemalloc page.
+ */
+#define PP_MAGIC_MASK ~0x3UL
+
 /**
  * struct page_pool_params - page pool parameters
  * @fast:	params accessed frequently on hotpath
@@ -264,6 +272,11 @@ void page_pool_destroy(struct page_pool *pool);
 void page_pool_use_xdp_mem(struct page_pool *pool, void (*disconnect)(void *),
 			   const struct xdp_mem_info *mem);
 void page_pool_put_netmem_bulk(netmem_ref *data, u32 count);
+
+static inline bool page_pool_page_is_pp(struct page *page)
+{
+	return (page->pp_magic & PP_MAGIC_MASK) == PP_SIGNATURE;
+}
 #else
 static inline void page_pool_destroy(struct page_pool *pool)
 {
@@ -278,6 +291,11 @@ static inline void page_pool_use_xdp_mem(struct page_pool *pool,
 static inline void page_pool_put_netmem_bulk(netmem_ref *data, u32 count)
 {
 }
+
+static inline bool page_pool_page_is_pp(struct page *page)
+{
+	return false;
+}
 #endif
 
 void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 542d25f77be80304b731411ffd29b276ee13be0c..3535ee76afe946cbb042ecbce603bdbedc9233b9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -55,6 +55,7 @@
 #include <linux/delayacct.h>
 #include <linux/cacheinfo.h>
 #include <linux/pgalloc_tag.h>
+#include <net/page_pool/types.h>
 #include <asm/div64.h>
 #include "internal.h"
 #include "shuffle.h"
@@ -872,9 +873,7 @@ static inline bool page_expected_state(struct page *page,
 #ifdef CONFIG_MEMCG
 			page->memcg_data |
 #endif
-#ifdef CONFIG_PAGE_POOL
-			((page->pp_magic & ~0x3UL) == PP_SIGNATURE) |
-#endif
+			page_pool_page_is_pp(page) |
 			(page->flags & check_flags)))
 		return false;
 
@@ -901,10 +900,8 @@ static const char *page_bad_reason(struct page *page, unsigned long flags)
 	if (unlikely(page->memcg_data))
 		bad_reason = "page still charged to cgroup";
 #endif
-#ifdef CONFIG_PAGE_POOL
-	if (unlikely((page->pp_magic & ~0x3UL) == PP_SIGNATURE))
+	if (unlikely(page_pool_page_is_pp(page)))
 		bad_reason = "page_pool leak";
-#endif
 	return bad_reason;
 }
 
diff --git a/net/core/netmem_priv.h b/net/core/netmem_priv.h
index 7eadb8393e002fd1cc2cef8a313d2ea7df76f301..f33162fd281c23e109273ba09950c5d0a2829bc9 100644
--- a/net/core/netmem_priv.h
+++ b/net/core/netmem_priv.h
@@ -18,6 +18,11 @@ static inline void netmem_clear_pp_magic(netmem_ref netmem)
 	__netmem_clear_lsb(netmem)->pp_magic = 0;
 }
 
+static inline bool netmem_is_pp(netmem_ref netmem)
+{
+	return (netmem_get_pp_magic(netmem) & PP_MAGIC_MASK) == PP_SIGNATURE;
+}
+
 static inline void netmem_set_pp(netmem_ref netmem, struct page_pool *pool)
 {
 	__netmem_clear_lsb(netmem)->pp = pool;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ab8acb737b93299f503e5c298b87e18edd59d555..a64d777488e403d5fdef83ae42ae9e4924c1a0dc 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -893,11 +893,6 @@ static void skb_clone_fraglist(struct sk_buff *skb)
 		skb_get(list);
 }
 
-static bool is_pp_netmem(netmem_ref netmem)
-{
-	return (netmem_get_pp_magic(netmem) & ~0x3UL) == PP_SIGNATURE;
-}
-
 int skb_pp_cow_data(struct page_pool *pool, struct sk_buff **pskb,
 		    unsigned int headroom)
 {
@@ -995,14 +990,7 @@ bool napi_pp_put_page(netmem_ref netmem)
 {
 	netmem = netmem_compound_head(netmem);
 
-	/* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
-	 * in order to preserve any existing bits, such as bit 0 for the
-	 * head page of compound page and bit 1 for pfmemalloc page, so
-	 * mask those bits for freeing side when doing below checking,
-	 * and page_is_pfmemalloc() is checked in __page_pool_put_page()
-	 * to avoid recycling the pfmemalloc page.
-	 */
-	if (unlikely(!is_pp_netmem(netmem)))
+	if (unlikely(!netmem_is_pp(netmem)))
 		return false;
 
 	page_pool_put_full_netmem(netmem_get_pp(netmem), netmem, false);
@@ -1042,7 +1030,7 @@ static int skb_pp_frag_ref(struct sk_buff *skb)
 
 	for (i = 0; i < shinfo->nr_frags; i++) {
 		head_netmem = netmem_compound_head(shinfo->frags[i].netmem);
-		if (likely(is_pp_netmem(head_netmem)))
+		if (likely(netmem_is_pp(head_netmem)))
 			page_pool_ref_netmem(head_netmem);
 		else
 			page_ref_inc(netmem_to_page(head_netmem));
diff --git a/net/core/xdp.c b/net/core/xdp.c
index f86eedad586a77eb63a96a85aa6d068d3e94f077..0ba73943c6eed873b3d1c681b3b9a802b590f2d9 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -437,8 +437,8 @@ void __xdp_return(netmem_ref netmem, enum xdp_mem_type mem_type,
 		netmem = netmem_compound_head(netmem);
 		if (napi_direct && xdp_return_frame_no_direct())
 			napi_direct = false;
-		/* No need to check ((page->pp_magic & ~0x3UL) == PP_SIGNATURE)
-		 * as mem->type knows this a page_pool page
+		/* No need to check netmem_is_pp() as mem->type knows this a
+		 * page_pool page
 		 */
 		page_pool_put_full_netmem(netmem_get_pp(netmem), netmem,
 					  napi_direct);

-- 
2.48.1



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH net-next v2 2/3] page_pool: Turn dma_sync and dma_sync_cpu fields into a bitmap
  2025-03-25 15:45 [PATCH net-next v2 0/3] Fix late DMA unmap crash for page pool Toke Høiland-Jørgensen
  2025-03-25 15:45 ` [PATCH net-next v2 1/3] page_pool: Move pp_magic check into helper functions Toke Høiland-Jørgensen
@ 2025-03-25 15:45 ` Toke Høiland-Jørgensen
  2025-03-25 22:17   ` Jakub Kicinski
  2025-03-26 18:00   ` Saeed Mahameed
  2025-03-25 15:45 ` [PATCH net-next v2 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool Toke Høiland-Jørgensen
  2 siblings, 2 replies; 17+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-03-25 15:45 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Andrew Lunn,
	Eric Dumazet, Paolo Abeni, Ilias Apalodimas, Simon Horman,
	Andrew Morton, Mina Almasry, Yonglong Liu, Yunsheng Lin,
	Pavel Begunkov, Matthew Wilcox
  Cc: netdev, bpf, linux-rdma, linux-mm, Toke Høiland-Jørgensen

Change the single-bit booleans for dma_sync into an unsigned long with
BIT() definitions so that a subsequent patch can write them both with a
singe WRITE_ONCE() on teardown. Also move the check for the sync_cpu
side into __page_pool_dma_sync_for_cpu() so it can be disabled for
non-netmem providers as well.

Reviewed-by: Mina Almasry <almasrymina@google.com>
Tested-by: Yonglong Liu <liuyonglong@huawei.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/net/page_pool/helpers.h | 6 +++---
 include/net/page_pool/types.h   | 8 ++++++--
 net/core/devmem.c               | 3 +--
 net/core/page_pool.c            | 9 +++++----
 4 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
index 582a3d00cbe2315edeb92850b6a42ab21e509e45..7ed32bde4b8944deb7fb22e291e95b8487be681a 100644
--- a/include/net/page_pool/helpers.h
+++ b/include/net/page_pool/helpers.h
@@ -443,6 +443,9 @@ static inline void __page_pool_dma_sync_for_cpu(const struct page_pool *pool,
 						const dma_addr_t dma_addr,
 						u32 offset, u32 dma_sync_size)
 {
+	if (!(READ_ONCE(pool->dma_sync) & PP_DMA_SYNC_CPU))
+		return;
+
 	dma_sync_single_range_for_cpu(pool->p.dev, dma_addr,
 				      offset + pool->p.offset, dma_sync_size,
 				      page_pool_get_dma_dir(pool));
@@ -473,9 +476,6 @@ page_pool_dma_sync_netmem_for_cpu(const struct page_pool *pool,
 				  const netmem_ref netmem, u32 offset,
 				  u32 dma_sync_size)
 {
-	if (!pool->dma_sync_for_cpu)
-		return;
-
 	__page_pool_dma_sync_for_cpu(pool,
 				     page_pool_get_dma_addr_netmem(netmem),
 				     offset, dma_sync_size);
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index df0d3c1608929605224feb26173135ff37951ef8..fbe34024b20061e8bcd1d4474f6ebfc70992f1eb 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -33,6 +33,10 @@
 #define PP_FLAG_ALL		(PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV | \
 				 PP_FLAG_SYSTEM_POOL | PP_FLAG_ALLOW_UNREADABLE_NETMEM)
 
+/* bit values used in pp->dma_sync */
+#define PP_DMA_SYNC_DEV	BIT(0)
+#define PP_DMA_SYNC_CPU	BIT(1)
+
 /*
  * Fast allocation side cache array/stack
  *
@@ -175,12 +179,12 @@ struct page_pool {
 
 	bool has_init_callback:1;	/* slow::init_callback is set */
 	bool dma_map:1;			/* Perform DMA mapping */
-	bool dma_sync:1;		/* Perform DMA sync for device */
-	bool dma_sync_for_cpu:1;	/* Perform DMA sync for cpu */
 #ifdef CONFIG_PAGE_POOL_STATS
 	bool system:1;			/* This is a global percpu pool */
 #endif
 
+	unsigned long dma_sync;
+
 	__cacheline_group_begin_aligned(frag, PAGE_POOL_FRAG_GROUP_ALIGN);
 	long frag_users;
 	netmem_ref frag_page;
diff --git a/net/core/devmem.c b/net/core/devmem.c
index 6802e82a4d03b6030f6df50ae3661f81e40bc101..955d392d707b12fe784747aa2040ce1a882a64db 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -340,8 +340,7 @@ int mp_dmabuf_devmem_init(struct page_pool *pool)
 	/* dma-buf dma addresses do not need and should not be used with
 	 * dma_sync_for_cpu/device. Force disable dma_sync.
 	 */
-	pool->dma_sync = false;
-	pool->dma_sync_for_cpu = false;
+	pool->dma_sync = 0;
 
 	if (pool->p.order != 0)
 		return -E2BIG;
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index acef1fcd8ddcfd1853a6f2055c1f1820ab248e8d..d51ca4389dd62d8bc266a9a2b792838257173535 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -203,7 +203,7 @@ static int page_pool_init(struct page_pool *pool,
 	memcpy(&pool->slow, &params->slow, sizeof(pool->slow));
 
 	pool->cpuid = cpuid;
-	pool->dma_sync_for_cpu = true;
+	pool->dma_sync = PP_DMA_SYNC_CPU;
 
 	/* Validate only known flags were used */
 	if (pool->slow.flags & ~PP_FLAG_ALL)
@@ -238,7 +238,7 @@ static int page_pool_init(struct page_pool *pool,
 		if (!pool->p.max_len)
 			return -EINVAL;
 
-		pool->dma_sync = true;
+		pool->dma_sync |= PP_DMA_SYNC_DEV;
 
 		/* pool->p.offset has to be set according to the address
 		 * offset used by the DMA engine to start copying rx data
@@ -291,7 +291,7 @@ static int page_pool_init(struct page_pool *pool,
 	}
 
 	if (pool->mp_ops) {
-		if (!pool->dma_map || !pool->dma_sync)
+		if (!pool->dma_map || !(pool->dma_sync & PP_DMA_SYNC_DEV))
 			return -EOPNOTSUPP;
 
 		if (WARN_ON(!is_kernel_rodata((unsigned long)pool->mp_ops))) {
@@ -466,7 +466,8 @@ page_pool_dma_sync_for_device(const struct page_pool *pool,
 			      netmem_ref netmem,
 			      u32 dma_sync_size)
 {
-	if (pool->dma_sync && dma_dev_need_sync(pool->p.dev))
+	if ((READ_ONCE(pool->dma_sync) & PP_DMA_SYNC_DEV) &&
+	    dma_dev_need_sync(pool->p.dev))
 		__page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
 }
 

-- 
2.48.1



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH net-next v2 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
  2025-03-25 15:45 [PATCH net-next v2 0/3] Fix late DMA unmap crash for page pool Toke Høiland-Jørgensen
  2025-03-25 15:45 ` [PATCH net-next v2 1/3] page_pool: Move pp_magic check into helper functions Toke Høiland-Jørgensen
  2025-03-25 15:45 ` [PATCH net-next v2 2/3] page_pool: Turn dma_sync and dma_sync_cpu fields into a bitmap Toke Høiland-Jørgensen
@ 2025-03-25 15:45 ` Toke Høiland-Jørgensen
  2025-03-26 13:54   ` Jesper Dangaard Brouer
  2025-03-26 18:22   ` Saeed Mahameed
  2 siblings, 2 replies; 17+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-03-25 15:45 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Andrew Lunn,
	Eric Dumazet, Paolo Abeni, Ilias Apalodimas, Simon Horman,
	Andrew Morton, Mina Almasry, Yonglong Liu, Yunsheng Lin,
	Pavel Begunkov, Matthew Wilcox
  Cc: netdev, bpf, linux-rdma, linux-mm,
	Toke Høiland-Jørgensen, Qiuling Ren, Yuying Ma

When enabling DMA mapping in page_pool, pages are kept DMA mapped until
they are released from the pool, to avoid the overhead of re-mapping the
pages every time they are used. This causes resource leaks and/or
crashes when there are pages still outstanding while the device is torn
down, because page_pool will attempt an unmap through a non-existent DMA
device on the subsequent page return.

To fix this, implement a simple tracking of outstanding DMA-mapped pages
in page pool using an xarray. This was first suggested by Mina[0], and
turns out to be fairly straight forward: We simply store pointers to
pages directly in the xarray with xa_alloc() when they are first DMA
mapped, and remove them from the array on unmap. Then, when a page pool
is torn down, it can simply walk the xarray and unmap all pages still
present there before returning, which also allows us to get rid of the
get/put_device() calls in page_pool. Using xa_cmpxchg(), no additional
synchronisation is needed, as a page will only ever be unmapped once.

To avoid having to walk the entire xarray on unmap to find the page
reference, we stash the ID assigned by xa_alloc() into the page
structure itself, using the upper bits of the pp_magic field. This
requires a couple of defines to avoid conflicting with the
POINTER_POISON_DELTA define, but this is all evaluated at compile-time,
so does not affect run-time performance. The bitmap calculations in this
patch gives the following number of bits for different architectures:

- 23 bits on 32-bit architectures
- 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
- 32 bits on other 64-bit architectures

Stashing a value into the unused bits of pp_magic does have the effect
that it can make the value stored there lie outside the unmappable
range (as governed by the mmap_min_addr sysctl), for architectures that
don't define ILLEGAL_POINTER_VALUE. This means that if one of the
pointers that is aliased to the pp_magic field (such as page->lru.next)
is dereferenced while the page is owned by page_pool, that could lead to
a dereference into userspace, which is a security concern. The risk of
this is mitigated by the fact that (a) we always clear pp_magic before
releasing a page from page_pool, and (b) this would need a
use-after-free bug for struct page, which can have many other risks
since page->lru.next is used as a generic list pointer in multiple
places in the kernel. As such, with this patch we take the position that
this risk is negligible in practice. For more discussion, see[1].

Since all the tracking added in this patch is performed on DMA
map/unmap, no additional code is needed in the fast path, meaning the
performance overhead of this tracking is negligible there. A
micro-benchmark shows that the total overhead of the tracking itself is
about 400 ns (39 cycles(tsc) 395.218 ns; sum for both map and unmap[2]).
Since this cost is only paid on DMA map and unmap, it seems like an
acceptable cost to fix the late unmap issue. Further optimisation can
narrow the cases where this cost is paid (for instance by eliding the
tracking when DMA map/unmap is a no-op).

The extra memory needed to track the pages is neatly encapsulated inside
xarray, which uses the 'struct xa_node' structure to track items. This
structure is 576 bytes long, with slots for 64 items, meaning that a
full node occurs only 9 bytes of overhead per slot it tracks (in
practice, it probably won't be this efficient, but in any case it should
be an acceptable overhead).

[0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/
[1] https://lore.kernel.org/r/20250320023202.GA25514@openwall.com
[2] https://lore.kernel.org/r/ae07144c-9295-4c9d-a400-153bb689fe9e@huawei.com

Reported-by: Yonglong Liu <liuyonglong@huawei.com>
Closes: https://lore.kernel.org/r/8743264a-9700-4227-a556-5f931c720211@huawei.com
Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code")
Suggested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Qiuling Ren <qren@redhat.com>
Tested-by: Yuying Ma <yuma@redhat.com>
Tested-by: Yonglong Liu <liuyonglong@huawei.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/linux/poison.h        |  4 +++
 include/net/page_pool/types.h | 49 +++++++++++++++++++++++---
 net/core/netmem_priv.h        | 28 ++++++++++++++-
 net/core/page_pool.c          | 82 ++++++++++++++++++++++++++++++++++++-------
 4 files changed, 145 insertions(+), 18 deletions(-)

diff --git a/include/linux/poison.h b/include/linux/poison.h
index 331a9a996fa8746626afa63ea462b85ca3e5938b..5351efd710d5e21cc341f7bb533b1aeea4a0808a 100644
--- a/include/linux/poison.h
+++ b/include/linux/poison.h
@@ -70,6 +70,10 @@
 #define KEY_DESTROY		0xbd
 
 /********** net/core/page_pool.c **********/
+/*
+ * page_pool uses additional free bits within this value to store data, see the
+ * definition of PP_DMA_INDEX_MASK in include/net/page_pool/types.h
+ */
 #define PP_SIGNATURE		(0x40 + POISON_POINTER_DELTA)
 
 /********** net/core/skbuff.c **********/
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index fbe34024b20061e8bcd1d4474f6ebfc70992f1eb..8f9ed92a4b2af6c136406c41b1a829c8e52ba3e2 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -6,6 +6,7 @@
 #include <linux/dma-direction.h>
 #include <linux/ptr_ring.h>
 #include <linux/types.h>
+#include <linux/xarray.h>
 #include <net/netmem.h>
 
 #define PP_FLAG_DMA_MAP		BIT(0) /* Should page_pool do the DMA
@@ -58,13 +59,51 @@ struct pp_alloc_cache {
 	netmem_ref cache[PP_ALLOC_CACHE_SIZE];
 };
 
+/*
+ * DMA mapping IDs
+ *
+ * When DMA-mapping a page, we allocate an ID (from an xarray) and stash this in
+ * the upper bits of page->pp_magic. We always want to be able to unambiguously
+ * identify page pool pages (using page_pool_page_is_pp()). Non-PP pages can
+ * have arbitrary kernel pointers stored in the same field as pp_magic (since it
+ * overlaps with page->lru.next), so we must ensure that we cannot mistake a
+ * valid kernel pointer with any of the values we write into this field.
+ *
+ * On architectures that set POISON_POINTER_DELTA, this is already ensured,
+ * since this value becomes part of PP_SIGNATURE; meaning we can just use the
+ * space between the PP_SIGNATURE value (without POISON_POINTER_DELTA), and the
+ * lowest bits of POISON_POINTER_DELTA. On arches where POISON_POINTER_DELTA is
+ * 0, we make sure that we leave the two topmost bits empty, as that guarantees
+ * we won't mistake a valid kernel pointer for a value we set, regardless of the
+ * VMSPLIT setting.
+ *
+ * Altogether, this means that the number of bits available is constrained by
+ * the size of an unsigned long (at the upper end, subtracting two bits per the
+ * above), and the definition of PP_SIGNATURE (with or without
+ * POISON_POINTER_DELTA).
+ */
+#define PP_DMA_INDEX_SHIFT (1 + __fls(PP_SIGNATURE - POISON_POINTER_DELTA))
+#if POISON_POINTER_DELTA > 0
+/* PP_SIGNATURE includes POISON_POINTER_DELTA, so limit the size of the DMA
+ * index to not overlap with that if set
+ */
+#define PP_DMA_INDEX_BITS MIN(32, __ffs(POISON_POINTER_DELTA) - PP_DMA_INDEX_SHIFT)
+#else
+/* Always leave out the topmost two; see above. */
+#define PP_DMA_INDEX_BITS MIN(32, BITS_PER_LONG - PP_DMA_INDEX_SHIFT - 2)
+#endif
+
+#define PP_DMA_INDEX_MASK GENMASK(PP_DMA_INDEX_BITS + PP_DMA_INDEX_SHIFT - 1, \
+				  PP_DMA_INDEX_SHIFT)
+#define PP_DMA_INDEX_LIMIT XA_LIMIT(1, BIT(PP_DMA_INDEX_BITS) - 1)
+
 /* Mask used for checking in page_pool_page_is_pp() below. page->pp_magic is
  * OR'ed with PP_SIGNATURE after the allocation in order to preserve bit 0 for
- * the head page of compound page and bit 1 for pfmemalloc page.
- * page_is_pfmemalloc() is checked in __page_pool_put_page() to avoid recycling
- * the pfmemalloc page.
+ * the head page of compound page and bit 1 for pfmemalloc page, as well as the
+ * bits used for the DMA index. page_is_pfmemalloc() is checked in
+ * __page_pool_put_page() to avoid recycling the pfmemalloc page.
  */
-#define PP_MAGIC_MASK ~0x3UL
+#define PP_MAGIC_MASK ~(PP_DMA_INDEX_MASK | 0x3UL)
 
 /**
  * struct page_pool_params - page pool parameters
@@ -233,6 +272,8 @@ struct page_pool {
 	void *mp_priv;
 	const struct memory_provider_ops *mp_ops;
 
+	struct xarray dma_mapped;
+
 #ifdef CONFIG_PAGE_POOL_STATS
 	/* recycle stats are per-cpu to avoid locking */
 	struct page_pool_recycle_stats __percpu *recycle_stats;
diff --git a/net/core/netmem_priv.h b/net/core/netmem_priv.h
index f33162fd281c23e109273ba09950c5d0a2829bc9..cd95394399b40c3604934ba7898eeeeacb8aee99 100644
--- a/net/core/netmem_priv.h
+++ b/net/core/netmem_priv.h
@@ -5,7 +5,7 @@
 
 static inline unsigned long netmem_get_pp_magic(netmem_ref netmem)
 {
-	return __netmem_clear_lsb(netmem)->pp_magic;
+	return __netmem_clear_lsb(netmem)->pp_magic & ~PP_DMA_INDEX_MASK;
 }
 
 static inline void netmem_or_pp_magic(netmem_ref netmem, unsigned long pp_magic)
@@ -15,6 +15,8 @@ static inline void netmem_or_pp_magic(netmem_ref netmem, unsigned long pp_magic)
 
 static inline void netmem_clear_pp_magic(netmem_ref netmem)
 {
+	WARN_ON_ONCE(__netmem_clear_lsb(netmem)->pp_magic & PP_DMA_INDEX_MASK);
+
 	__netmem_clear_lsb(netmem)->pp_magic = 0;
 }
 
@@ -33,4 +35,28 @@ static inline void netmem_set_dma_addr(netmem_ref netmem,
 {
 	__netmem_clear_lsb(netmem)->dma_addr = dma_addr;
 }
+
+static inline unsigned long netmem_get_dma_index(netmem_ref netmem)
+{
+	unsigned long magic;
+
+	if (WARN_ON_ONCE(netmem_is_net_iov(netmem)))
+		return 0;
+
+	magic = __netmem_clear_lsb(netmem)->pp_magic;
+
+	return (magic & PP_DMA_INDEX_MASK) >> PP_DMA_INDEX_SHIFT;
+}
+
+static inline void netmem_set_dma_index(netmem_ref netmem,
+					unsigned long id)
+{
+	unsigned long magic;
+
+	if (WARN_ON_ONCE(netmem_is_net_iov(netmem)))
+		return;
+
+	magic = netmem_get_pp_magic(netmem) | (id << PP_DMA_INDEX_SHIFT);
+	__netmem_clear_lsb(netmem)->pp_magic = magic;
+}
 #endif
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index d51ca4389dd62d8bc266a9a2b792838257173535..55acb4bc2c57893486f222e9f39b48a09b0d78d0 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -226,6 +226,8 @@ static int page_pool_init(struct page_pool *pool,
 			return -EINVAL;
 
 		pool->dma_map = true;
+
+		xa_init_flags(&pool->dma_mapped, XA_FLAGS_ALLOC1);
 	}
 
 	if (pool->slow.flags & PP_FLAG_DMA_SYNC_DEV) {
@@ -275,9 +277,6 @@ static int page_pool_init(struct page_pool *pool,
 	/* Driver calling page_pool_create() also call page_pool_destroy() */
 	refcount_set(&pool->user_cnt, 1);
 
-	if (pool->dma_map)
-		get_device(pool->p.dev);
-
 	if (pool->slow.flags & PP_FLAG_ALLOW_UNREADABLE_NETMEM) {
 		/* We rely on rtnl_lock()ing to make sure netdev_rx_queue
 		 * configuration doesn't change while we're initializing
@@ -325,7 +324,7 @@ static void page_pool_uninit(struct page_pool *pool)
 	ptr_ring_cleanup(&pool->ring, NULL);
 
 	if (pool->dma_map)
-		put_device(pool->p.dev);
+		xa_destroy(&pool->dma_mapped);
 
 #ifdef CONFIG_PAGE_POOL_STATS
 	if (!pool->system)
@@ -466,14 +465,20 @@ page_pool_dma_sync_for_device(const struct page_pool *pool,
 			      netmem_ref netmem,
 			      u32 dma_sync_size)
 {
-	if ((READ_ONCE(pool->dma_sync) & PP_DMA_SYNC_DEV) &&
-	    dma_dev_need_sync(pool->p.dev))
-		__page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
+	if (dma_dev_need_sync(pool->p.dev)) {
+		rcu_read_lock();
+		if (READ_ONCE(pool->dma_sync) & PP_DMA_SYNC_DEV)
+			__page_pool_dma_sync_for_device(pool, netmem,
+							dma_sync_size);
+		rcu_read_unlock();
+	}
 }
 
-static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
+static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem, gfp_t gfp)
 {
 	dma_addr_t dma;
+	int err;
+	u32 id;
 
 	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
 	 * since dma_addr_t can be either 32 or 64 bits and does not always fit
@@ -487,15 +492,28 @@ static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
 	if (dma_mapping_error(pool->p.dev, dma))
 		return false;
 
-	if (page_pool_set_dma_addr_netmem(netmem, dma))
+	if (in_softirq())
+		err = xa_alloc(&pool->dma_mapped, &id, netmem_to_page(netmem),
+			       PP_DMA_INDEX_LIMIT, gfp);
+	else
+		err = xa_alloc_bh(&pool->dma_mapped, &id, netmem_to_page(netmem),
+				  PP_DMA_INDEX_LIMIT, gfp);
+	if (err) {
+		WARN_ONCE(1, "couldn't track DMA mapping, please report to netdev@");
 		goto unmap_failed;
+	}
 
+	if (page_pool_set_dma_addr_netmem(netmem, dma)) {
+		WARN_ONCE(1, "unexpected DMA address, please report to netdev@");
+		goto unmap_failed;
+	}
+
+	netmem_set_dma_index(netmem, id);
 	page_pool_dma_sync_for_device(pool, netmem, pool->p.max_len);
 
 	return true;
 
 unmap_failed:
-	WARN_ONCE(1, "unexpected DMA address, please report to netdev@");
 	dma_unmap_page_attrs(pool->p.dev, dma,
 			     PAGE_SIZE << pool->p.order, pool->p.dma_dir,
 			     DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING);
@@ -512,7 +530,7 @@ static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
 	if (unlikely(!page))
 		return NULL;
 
-	if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page)))) {
+	if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page), gfp))) {
 		put_page(page);
 		return NULL;
 	}
@@ -558,7 +576,7 @@ static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool,
 	 */
 	for (i = 0; i < nr_pages; i++) {
 		netmem = pool->alloc.cache[i];
-		if (dma_map && unlikely(!page_pool_dma_map(pool, netmem))) {
+		if (dma_map && unlikely(!page_pool_dma_map(pool, netmem, gfp))) {
 			put_page(netmem_to_page(netmem));
 			continue;
 		}
@@ -660,6 +678,8 @@ void page_pool_clear_pp_info(netmem_ref netmem)
 static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
 							 netmem_ref netmem)
 {
+	struct page *old, *page = netmem_to_page(netmem);
+	unsigned long id;
 	dma_addr_t dma;
 
 	if (!pool->dma_map)
@@ -668,6 +688,17 @@ static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
 		 */
 		return;
 
+	id = netmem_get_dma_index(netmem);
+	if (!id)
+		return;
+
+	if (in_softirq())
+		old = xa_cmpxchg(&pool->dma_mapped, id, page, NULL, 0);
+	else
+		old = xa_cmpxchg_bh(&pool->dma_mapped, id, page, NULL, 0);
+	if (old != page)
+		return;
+
 	dma = page_pool_get_dma_addr_netmem(netmem);
 
 	/* When page is unmapped, it cannot be returned to our pool */
@@ -675,6 +706,7 @@ static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
 			     PAGE_SIZE << pool->p.order, pool->p.dma_dir,
 			     DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING);
 	page_pool_set_dma_addr_netmem(netmem, 0);
+	netmem_set_dma_index(netmem, 0);
 }
 
 /* Disconnects a page (from a page_pool).  API users can have a need
@@ -1084,8 +1116,32 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
 
 static void page_pool_scrub(struct page_pool *pool)
 {
+	unsigned long id;
+	void *ptr;
+
 	page_pool_empty_alloc_cache_once(pool);
-	pool->destroy_cnt++;
+	if (!pool->destroy_cnt++ && pool->dma_map) {
+		if (pool->dma_sync) {
+			/* paired with READ_ONCE in
+			 * page_pool_dma_sync_for_device() and
+			 * __page_pool_dma_sync_for_cpu()
+			 */
+			WRITE_ONCE(pool->dma_sync, false);
+
+			/* Make sure all concurrent returns that may see the old
+			 * value of dma_sync (and thus perform a sync) have
+			 * finished before doing the unmapping below. Skip the
+			 * wait if the device doesn't actually need syncing, or
+			 * if there are no outstanding mapped pages.
+			 */
+			if (dma_dev_need_sync(pool->p.dev) &&
+			    !xa_empty(&pool->dma_mapped))
+				synchronize_net();
+		}
+
+		xa_for_each(&pool->dma_mapped, id, ptr)
+			__page_pool_release_page_dma(pool, page_to_netmem(ptr));
+	}
 
 	/* No more consumers should exist, but producers could still
 	 * be in-flight.

-- 
2.48.1



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 2/3] page_pool: Turn dma_sync and dma_sync_cpu fields into a bitmap
  2025-03-25 15:45 ` [PATCH net-next v2 2/3] page_pool: Turn dma_sync and dma_sync_cpu fields into a bitmap Toke Høiland-Jørgensen
@ 2025-03-25 22:17   ` Jakub Kicinski
  2025-03-26  8:12     ` Toke Høiland-Jørgensen
  2025-03-26 18:00   ` Saeed Mahameed
  1 sibling, 1 reply; 17+ messages in thread
From: Jakub Kicinski @ 2025-03-25 22:17 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: David S. Miller, Jesper Dangaard Brouer, Saeed Mahameed,
	Leon Romanovsky, Tariq Toukan, Andrew Lunn, Eric Dumazet,
	Paolo Abeni, Ilias Apalodimas, Simon Horman, Andrew Morton,
	Mina Almasry, Yonglong Liu, Yunsheng Lin, Pavel Begunkov,
	Matthew Wilcox, netdev, bpf, linux-rdma, linux-mm

On Tue, 25 Mar 2025 16:45:43 +0100 Toke Høiland-Jørgensen wrote:
> Change the single-bit booleans for dma_sync into an unsigned long with
> BIT() definitions so that a subsequent patch can write them both with a
> singe WRITE_ONCE() on teardown. Also move the check for the sync_cpu
> side into __page_pool_dma_sync_for_cpu() so it can be disabled for
> non-netmem providers as well.

Can we make them just bools without the bit width?
Less churn and actually fewer bytes.
I don't see why we'd need to wipe them atomically.
In fact I don't see why we're touching dma_sync_cpu, at all,
it's driver-facing and the driver is gone in the problematic
scenario.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 2/3] page_pool: Turn dma_sync and dma_sync_cpu fields into a bitmap
  2025-03-25 22:17   ` Jakub Kicinski
@ 2025-03-26  8:12     ` Toke Høiland-Jørgensen
  2025-03-26 11:23       ` Jakub Kicinski
  0 siblings, 1 reply; 17+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-03-26  8:12 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David S. Miller, Jesper Dangaard Brouer, Saeed Mahameed,
	Leon Romanovsky, Tariq Toukan, Andrew Lunn, Eric Dumazet,
	Paolo Abeni, Ilias Apalodimas, Simon Horman, Andrew Morton,
	Mina Almasry, Yonglong Liu, Yunsheng Lin, Pavel Begunkov,
	Matthew Wilcox, netdev, bpf, linux-rdma, linux-mm

Jakub Kicinski <kuba@kernel.org> writes:

> On Tue, 25 Mar 2025 16:45:43 +0100 Toke Høiland-Jørgensen wrote:
>> Change the single-bit booleans for dma_sync into an unsigned long with
>> BIT() definitions so that a subsequent patch can write them both with a
>> singe WRITE_ONCE() on teardown. Also move the check for the sync_cpu
>> side into __page_pool_dma_sync_for_cpu() so it can be disabled for
>> non-netmem providers as well.
>
> Can we make them just bools without the bit width?
> Less churn and actually fewer bytes.

Ah! Didn't realise that was possible, excellent solution :)

> I don't see why we'd need to wipe them atomically.
> In fact I don't see why we're touching dma_sync_cpu, at all,
> it's driver-facing and the driver is gone in the problematic
> scenario.

No you're right, but it felt weird to change just one of them, so
figured I'd go with both. But keeping them both as bool, and just making
dma_sync a full-width bool works, so I'll respin with that and leave
dma_sync_cpu as-is.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 2/3] page_pool: Turn dma_sync and dma_sync_cpu fields into a bitmap
  2025-03-26  8:12     ` Toke Høiland-Jørgensen
@ 2025-03-26 11:23       ` Jakub Kicinski
  2025-03-26 11:29         ` Jakub Kicinski
  0 siblings, 1 reply; 17+ messages in thread
From: Jakub Kicinski @ 2025-03-26 11:23 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: David S. Miller, Jesper Dangaard Brouer, Saeed Mahameed,
	Leon Romanovsky, Tariq Toukan, Andrew Lunn, Eric Dumazet,
	Paolo Abeni, Ilias Apalodimas, Simon Horman, Andrew Morton,
	Mina Almasry, Yonglong Liu, Yunsheng Lin, Pavel Begunkov,
	Matthew Wilcox, netdev, bpf, linux-rdma, linux-mm

On Wed, 26 Mar 2025 09:12:34 +0100 Toke Høiland-Jørgensen wrote:
> > I don't see why we'd need to wipe them atomically.
> > In fact I don't see why we're touching dma_sync_cpu, at all,
> > it's driver-facing and the driver is gone in the problematic
> > scenario.  
> 
> No you're right, but it felt weird to change just one of them, so
> figured I'd go with both. But keeping them both as bool, and just making
> dma_sync a full-width bool works, so I'll respin with that and leave
> dma_sync_cpu as-is.

Opinion on dma_sync_cpu clearing probably depends on mental model.
No strong feelings but perhaps add a comment next to clearing it
for the likes of myself saying that this technically shouldn't be
needed as we only expect drivers to ask for CPU sync?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 2/3] page_pool: Turn dma_sync and dma_sync_cpu fields into a bitmap
  2025-03-26 11:23       ` Jakub Kicinski
@ 2025-03-26 11:29         ` Jakub Kicinski
  0 siblings, 0 replies; 17+ messages in thread
From: Jakub Kicinski @ 2025-03-26 11:29 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: David S. Miller, Jesper Dangaard Brouer, Saeed Mahameed,
	Leon Romanovsky, Tariq Toukan, Andrew Lunn, Eric Dumazet,
	Paolo Abeni, Ilias Apalodimas, Simon Horman, Andrew Morton,
	Mina Almasry, Yonglong Liu, Yunsheng Lin, Pavel Begunkov,
	Matthew Wilcox, netdev, bpf, linux-rdma, linux-mm

On Wed, 26 Mar 2025 04:23:47 -0700 Jakub Kicinski wrote:
> On Wed, 26 Mar 2025 09:12:34 +0100 Toke Høiland-Jørgensen wrote:
> > > I don't see why we'd need to wipe them atomically.
> > > In fact I don't see why we're touching dma_sync_cpu, at all,
> > > it's driver-facing and the driver is gone in the problematic
> > > scenario.    
> > 
> > No you're right, but it felt weird to change just one of them, so
> > figured I'd go with both. But keeping them both as bool, and just making
> > dma_sync a full-width bool works, so I'll respin with that and leave
> > dma_sync_cpu as-is.  
> 
> Opinion on dma_sync_cpu clearing probably depends on mental model.
> No strong feelings but perhaps add a comment next to clearing it
> for the likes of myself saying that this technically shouldn't be
> needed as we only expect drivers to ask for CPU sync?

Ah, misread, I thought you meant "as-is" == "as is in this series".
Thanks!


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
  2025-03-25 15:45 ` [PATCH net-next v2 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool Toke Høiland-Jørgensen
@ 2025-03-26 13:54   ` Jesper Dangaard Brouer
  2025-03-26 18:22   ` Saeed Mahameed
  1 sibling, 0 replies; 17+ messages in thread
From: Jesper Dangaard Brouer @ 2025-03-26 13:54 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, David S. Miller,
	Jakub Kicinski, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
	Andrew Lunn, Eric Dumazet, Paolo Abeni, Ilias Apalodimas,
	Simon Horman, Andrew Morton, Mina Almasry, Yonglong Liu,
	Yunsheng Lin, Pavel Begunkov, Matthew Wilcox
  Cc: netdev, bpf, linux-rdma, linux-mm, Qiuling Ren, Yuying Ma



On 25/03/2025 16.45, Toke Høiland-Jørgensen wrote:
> When enabling DMA mapping in page_pool, pages are kept DMA mapped until
> they are released from the pool, to avoid the overhead of re-mapping the
> pages every time they are used. This causes resource leaks and/or
> crashes when there are pages still outstanding while the device is torn
> down, because page_pool will attempt an unmap through a non-existent DMA
> device on the subsequent page return.
> 
> To fix this, implement a simple tracking of outstanding DMA-mapped pages
> in page pool using an xarray. This was first suggested by Mina[0], and
> turns out to be fairly straight forward: We simply store pointers to
> pages directly in the xarray with xa_alloc() when they are first DMA
> mapped, and remove them from the array on unmap. Then, when a page pool
> is torn down, it can simply walk the xarray and unmap all pages still
> present there before returning, which also allows us to get rid of the
> get/put_device() calls in page_pool. Using xa_cmpxchg(), no additional
> synchronisation is needed, as a page will only ever be unmapped once.
> 
> To avoid having to walk the entire xarray on unmap to find the page
> reference, we stash the ID assigned by xa_alloc() into the page
> structure itself, using the upper bits of the pp_magic field. This
> requires a couple of defines to avoid conflicting with the
> POINTER_POISON_DELTA define, but this is all evaluated at compile-time,
> so does not affect run-time performance. The bitmap calculations in this
> patch gives the following number of bits for different architectures:
> 
> - 23 bits on 32-bit architectures
> - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
> - 32 bits on other 64-bit architectures
> 
> Stashing a value into the unused bits of pp_magic does have the effect
> that it can make the value stored there lie outside the unmappable
> range (as governed by the mmap_min_addr sysctl), for architectures that
> don't define ILLEGAL_POINTER_VALUE. This means that if one of the
> pointers that is aliased to the pp_magic field (such as page->lru.next)
> is dereferenced while the page is owned by page_pool, that could lead to
> a dereference into userspace, which is a security concern. The risk of
> this is mitigated by the fact that (a) we always clear pp_magic before
> releasing a page from page_pool, and (b) this would need a
> use-after-free bug for struct page, which can have many other risks
> since page->lru.next is used as a generic list pointer in multiple
> places in the kernel. As such, with this patch we take the position that
> this risk is negligible in practice. For more discussion, see[1].
> 
> Since all the tracking added in this patch is performed on DMA
> map/unmap, no additional code is needed in the fast path, meaning the
> performance overhead of this tracking is negligible there. A
> micro-benchmark shows that the total overhead of the tracking itself is
> about 400 ns (39 cycles(tsc) 395.218 ns; sum for both map and unmap[2]).
> Since this cost is only paid on DMA map and unmap, it seems like an
> acceptable cost to fix the late unmap issue. Further optimisation can
> narrow the cases where this cost is paid (for instance by eliding the
> tracking when DMA map/unmap is a no-op).
> 
> The extra memory needed to track the pages is neatly encapsulated inside
> xarray, which uses the 'struct xa_node' structure to track items. This
> structure is 576 bytes long, with slots for 64 items, meaning that a
> full node occurs only 9 bytes of overhead per slot it tracks (in
> practice, it probably won't be this efficient, but in any case it should
> be an acceptable overhead).
> 
> [0]https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/
> [1]https://lore.kernel.org/r/20250320023202.GA25514@openwall.com
> [2]https://lore.kernel.org/r/ae07144c-9295-4c9d-a400-153bb689fe9e@huawei.com
> 
> Reported-by: Yonglong Liu<liuyonglong@huawei.com>
> Closes:https://lore.kernel.org/r/8743264a-9700-4227-a556-5f931c720211@huawei.com
> Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code")
> Suggested-by: Mina Almasry<almasrymina@google.com>
> Reviewed-by: Mina Almasry<almasrymina@google.com>
> Reviewed-by: Jesper Dangaard Brouer<hawk@kernel.org>
> Tested-by: Jesper Dangaard Brouer<hawk@kernel.org>
> Tested-by: Qiuling Ren<qren@redhat.com>
> Tested-by: Yuying Ma<yuma@redhat.com>
> Tested-by: Yonglong Liu<liuyonglong@huawei.com>
> Signed-off-by: Toke Høiland-Jørgensen<toke@redhat.com>
> ---
>   include/linux/poison.h        |  4 +++
>   include/net/page_pool/types.h | 49 +++++++++++++++++++++++---
>   net/core/netmem_priv.h        | 28 ++++++++++++++-
>   net/core/page_pool.c          | 82 ++++++++++++++++++++++++++++++++++++-------
>   4 files changed, 145 insertions(+), 18 deletions(-)


Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 2/3] page_pool: Turn dma_sync and dma_sync_cpu fields into a bitmap
  2025-03-25 15:45 ` [PATCH net-next v2 2/3] page_pool: Turn dma_sync and dma_sync_cpu fields into a bitmap Toke Høiland-Jørgensen
  2025-03-25 22:17   ` Jakub Kicinski
@ 2025-03-26 18:00   ` Saeed Mahameed
  1 sibling, 0 replies; 17+ messages in thread
From: Saeed Mahameed @ 2025-03-26 18:00 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Leon Romanovsky, Tariq Toukan, Andrew Lunn, Eric Dumazet,
	Paolo Abeni, Ilias Apalodimas, Simon Horman, Andrew Morton,
	Mina Almasry, Yonglong Liu, Yunsheng Lin, Pavel Begunkov,
	Matthew Wilcox, netdev, bpf, linux-rdma, linux-mm

On 25 Mar 16:45, Toke Høiland-Jørgensen wrote:
>Change the single-bit booleans for dma_sync into an unsigned long with
>BIT() definitions so that a subsequent patch can write them both with a
>singe WRITE_ONCE() on teardown. Also move the check for the sync_cpu
>side into __page_pool_dma_sync_for_cpu() so it can be disabled for
>non-netmem providers as well.
>
>Reviewed-by: Mina Almasry <almasrymina@google.com>
>Tested-by: Yonglong Liu <liuyonglong@huawei.com>
>Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>---
> include/net/page_pool/helpers.h | 6 +++---
> include/net/page_pool/types.h   | 8 ++++++--
> net/core/devmem.c               | 3 +--
> net/core/page_pool.c            | 9 +++++----
> 4 files changed, 15 insertions(+), 11 deletions(-)
>
>diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
>index 582a3d00cbe2315edeb92850b6a42ab21e509e45..7ed32bde4b8944deb7fb22e291e95b8487be681a 100644
>--- a/include/net/page_pool/helpers.h
>+++ b/include/net/page_pool/helpers.h
>@@ -443,6 +443,9 @@ static inline void __page_pool_dma_sync_for_cpu(const struct page_pool *pool,
> 						const dma_addr_t dma_addr,
> 						u32 offset, u32 dma_sync_size)
> {
>+	if (!(READ_ONCE(pool->dma_sync) & PP_DMA_SYNC_CPU))
>+		return;
>+

page_pool_dma_sync_for_cpu() is a wrapper for this function, and it assumes
pages were created with DMA flag, so you are adding this unnecessary check
for that path.

Just change page_pool_dma_sync_for_cpu() to directly call 
dma_sync_single_range_for_cpu(...) as part of this patch.

> 	dma_sync_single_range_for_cpu(pool->p.dev, dma_addr,
> 				      offset + pool->p.offset, dma_sync_size,
> 				      page_pool_get_dma_dir(pool));
>@@ -473,9 +476,6 @@ page_pool_dma_sync_netmem_for_cpu(const struct page_pool *pool,
> 				  const netmem_ref netmem, u32 offset,
> 				  u32 dma_sync_size)
> {
>-	if (!pool->dma_sync_for_cpu)
>-		return;
>-
> 	__page_pool_dma_sync_for_cpu(pool,
> 				     page_pool_get_dma_addr_netmem(netmem),
> 				     offset, dma_sync_size);
>diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
>index df0d3c1608929605224feb26173135ff37951ef8..fbe34024b20061e8bcd1d4474f6ebfc70992f1eb 100644
>--- a/include/net/page_pool/types.h
>+++ b/include/net/page_pool/types.h
>@@ -33,6 +33,10 @@
> #define PP_FLAG_ALL		(PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV | \
> 				 PP_FLAG_SYSTEM_POOL | PP_FLAG_ALLOW_UNREADABLE_NETMEM)
>
>+/* bit values used in pp->dma_sync */
>+#define PP_DMA_SYNC_DEV	BIT(0)
>+#define PP_DMA_SYNC_CPU	BIT(1)
>+
> /*
>  * Fast allocation side cache array/stack
>  *
>@@ -175,12 +179,12 @@ struct page_pool {
>
> 	bool has_init_callback:1;	/* slow::init_callback is set */
> 	bool dma_map:1;			/* Perform DMA mapping */
>-	bool dma_sync:1;		/* Perform DMA sync for device */
>-	bool dma_sync_for_cpu:1;	/* Perform DMA sync for cpu */
> #ifdef CONFIG_PAGE_POOL_STATS
> 	bool system:1;			/* This is a global percpu pool */
> #endif
>
>+	unsigned long dma_sync;
>+
> 	__cacheline_group_begin_aligned(frag, PAGE_POOL_FRAG_GROUP_ALIGN);
> 	long frag_users;
> 	netmem_ref frag_page;
>diff --git a/net/core/devmem.c b/net/core/devmem.c
>index 6802e82a4d03b6030f6df50ae3661f81e40bc101..955d392d707b12fe784747aa2040ce1a882a64db 100644
>--- a/net/core/devmem.c
>+++ b/net/core/devmem.c
>@@ -340,8 +340,7 @@ int mp_dmabuf_devmem_init(struct page_pool *pool)
> 	/* dma-buf dma addresses do not need and should not be used with
> 	 * dma_sync_for_cpu/device. Force disable dma_sync.
> 	 */
>-	pool->dma_sync = false;
>-	pool->dma_sync_for_cpu = false;
>+	pool->dma_sync = 0;
>
> 	if (pool->p.order != 0)
> 		return -E2BIG;
>diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>index acef1fcd8ddcfd1853a6f2055c1f1820ab248e8d..d51ca4389dd62d8bc266a9a2b792838257173535 100644
>--- a/net/core/page_pool.c
>+++ b/net/core/page_pool.c
>@@ -203,7 +203,7 @@ static int page_pool_init(struct page_pool *pool,
> 	memcpy(&pool->slow, &params->slow, sizeof(pool->slow));
>
> 	pool->cpuid = cpuid;
>-	pool->dma_sync_for_cpu = true;
>+	pool->dma_sync = PP_DMA_SYNC_CPU;
>
> 	/* Validate only known flags were used */
> 	if (pool->slow.flags & ~PP_FLAG_ALL)
>@@ -238,7 +238,7 @@ static int page_pool_init(struct page_pool *pool,
> 		if (!pool->p.max_len)
> 			return -EINVAL;
>
>-		pool->dma_sync = true;
>+		pool->dma_sync |= PP_DMA_SYNC_DEV;
>
> 		/* pool->p.offset has to be set according to the address
> 		 * offset used by the DMA engine to start copying rx data
>@@ -291,7 +291,7 @@ static int page_pool_init(struct page_pool *pool,
> 	}
>
> 	if (pool->mp_ops) {
>-		if (!pool->dma_map || !pool->dma_sync)
>+		if (!pool->dma_map || !(pool->dma_sync & PP_DMA_SYNC_DEV))
> 			return -EOPNOTSUPP;
>
> 		if (WARN_ON(!is_kernel_rodata((unsigned long)pool->mp_ops))) {
>@@ -466,7 +466,8 @@ page_pool_dma_sync_for_device(const struct page_pool *pool,
> 			      netmem_ref netmem,
> 			      u32 dma_sync_size)
> {
>-	if (pool->dma_sync && dma_dev_need_sync(pool->p.dev))
>+	if ((READ_ONCE(pool->dma_sync) & PP_DMA_SYNC_DEV) &&
>+	    dma_dev_need_sync(pool->p.dev))
> 		__page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
> }
>
>
>-- 
>2.48.1
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
  2025-03-25 15:45 ` [PATCH net-next v2 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool Toke Høiland-Jørgensen
  2025-03-26 13:54   ` Jesper Dangaard Brouer
@ 2025-03-26 18:22   ` Saeed Mahameed
  2025-03-26 20:02     ` Mina Almasry
  2025-03-27  3:53     ` Yunsheng Lin
  1 sibling, 2 replies; 17+ messages in thread
From: Saeed Mahameed @ 2025-03-26 18:22 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Leon Romanovsky, Tariq Toukan, Andrew Lunn, Eric Dumazet,
	Paolo Abeni, Ilias Apalodimas, Simon Horman, Andrew Morton,
	Mina Almasry, Yonglong Liu, Yunsheng Lin, Pavel Begunkov,
	Matthew Wilcox, netdev, bpf, linux-rdma, linux-mm, Qiuling Ren,
	Yuying Ma

On 25 Mar 16:45, Toke Høiland-Jørgensen wrote:
>When enabling DMA mapping in page_pool, pages are kept DMA mapped until
>they are released from the pool, to avoid the overhead of re-mapping the
>pages every time they are used. This causes resource leaks and/or
>crashes when there are pages still outstanding while the device is torn
>down, because page_pool will attempt an unmap through a non-existent DMA
>device on the subsequent page return.
>

Why dynamically track when it is guaranteed the page_pool consumer (driver) 
will return all outstanding pages before disabling the DMA device.
When a page pool is destroyed by the driver, just mark it as "DMA-inactive",
and on page_pool_return_page() if DMA-inactive don't recycle those pages 
and immediately DMA unmap and release them. We can also track drivers if
they still have page_pools with outstanding DMA, and print a warning when
DMA is disabled.

I know there's an extra check on fast path, but looking below it seems
like slow path is becoming unnecessarily complicated, and we are sacrificing
slow path performance significantly for the sake of not touching fast path
at all. page_pool slow path should _not_ be significantly slower
than using the page allocator directly! In many production environments and
some workloads, page pool recycling can happen at a very low rate resulting
on performance relying solely on slow path.. 

>To fix this, implement a simple tracking of outstanding DMA-mapped pages
>in page pool using an xarray. This was first suggested by Mina[0], and
>turns out to be fairly straight forward: We simply store pointers to
>pages directly in the xarray with xa_alloc() when they are first DMA
>mapped, and remove them from the array on unmap. Then, when a page pool
>is torn down, it can simply walk the xarray and unmap all pages still
>present there before returning, which also allows us to get rid of the
>get/put_device() calls in page_pool. Using xa_cmpxchg(), no additional
>synchronisation is needed, as a page will only ever be unmapped once.
>
>To avoid having to walk the entire xarray on unmap to find the page
>reference, we stash the ID assigned by xa_alloc() into the page
>structure itself, using the upper bits of the pp_magic field. This
>requires a couple of defines to avoid conflicting with the
>POINTER_POISON_DELTA define, but this is all evaluated at compile-time,
>so does not affect run-time performance. The bitmap calculations in this
>patch gives the following number of bits for different architectures:
>
>- 23 bits on 32-bit architectures
>- 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
>- 32 bits on other 64-bit architectures
>
>Stashing a value into the unused bits of pp_magic does have the effect
>that it can make the value stored there lie outside the unmappable
>range (as governed by the mmap_min_addr sysctl), for architectures that
>don't define ILLEGAL_POINTER_VALUE. This means that if one of the
>pointers that is aliased to the pp_magic field (such as page->lru.next)
>is dereferenced while the page is owned by page_pool, that could lead to
>a dereference into userspace, which is a security concern. The risk of
>this is mitigated by the fact that (a) we always clear pp_magic before
>releasing a page from page_pool, and (b) this would need a
>use-after-free bug for struct page, which can have many other risks
>since page->lru.next is used as a generic list pointer in multiple
>places in the kernel. As such, with this patch we take the position that
>this risk is negligible in practice. For more discussion, see[1].
>
>Since all the tracking added in this patch is performed on DMA
>map/unmap, no additional code is needed in the fast path, meaning the
>performance overhead of this tracking is negligible there. A
>micro-benchmark shows that the total overhead of the tracking itself is
>about 400 ns (39 cycles(tsc) 395.218 ns; sum for both map and unmap[2]).
>Since this cost is only paid on DMA map and unmap, it seems like an
>acceptable cost to fix the late unmap issue. Further optimisation can
>narrow the cases where this cost is paid (for instance by eliding the
>tracking when DMA map/unmap is a no-op).
>
What I am missing here, what is the added cost of those extra operations on
the slow path compared to before this patch? Total overhead being
acceptable doesn't justify the change, we need diff before and after.

>The extra memory needed to track the pages is neatly encapsulated inside
>xarray, which uses the 'struct xa_node' structure to track items. This
>structure is 576 bytes long, with slots for 64 items, meaning that a
>full node occurs only 9 bytes of overhead per slot it tracks (in
>practice, it probably won't be this efficient, but in any case it should
>be an acceptable overhead).
>
>[0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/
>[1] https://lore.kernel.org/r/20250320023202.GA25514@openwall.com
>[2] https://lore.kernel.org/r/ae07144c-9295-4c9d-a400-153bb689fe9e@huawei.com
>
>Reported-by: Yonglong Liu <liuyonglong@huawei.com>
>Closes: https://lore.kernel.org/r/8743264a-9700-4227-a556-5f931c720211@huawei.com
>Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code")
>Suggested-by: Mina Almasry <almasrymina@google.com>
>Reviewed-by: Mina Almasry <almasrymina@google.com>
>Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
>Tested-by: Jesper Dangaard Brouer <hawk@kernel.org>
>Tested-by: Qiuling Ren <qren@redhat.com>
>Tested-by: Yuying Ma <yuma@redhat.com>
>Tested-by: Yonglong Liu <liuyonglong@huawei.com>
>Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>---
> include/linux/poison.h        |  4 +++
> include/net/page_pool/types.h | 49 +++++++++++++++++++++++---
> net/core/netmem_priv.h        | 28 ++++++++++++++-
> net/core/page_pool.c          | 82 ++++++++++++++++++++++++++++++++++++-------
> 4 files changed, 145 insertions(+), 18 deletions(-)
>
>diff --git a/include/linux/poison.h b/include/linux/poison.h
>index 331a9a996fa8746626afa63ea462b85ca3e5938b..5351efd710d5e21cc341f7bb533b1aeea4a0808a 100644
>--- a/include/linux/poison.h
>+++ b/include/linux/poison.h
>@@ -70,6 +70,10 @@
> #define KEY_DESTROY		0xbd
>
> /********** net/core/page_pool.c **********/
>+/*
>+ * page_pool uses additional free bits within this value to store data, see the
>+ * definition of PP_DMA_INDEX_MASK in include/net/page_pool/types.h
>+ */
> #define PP_SIGNATURE		(0x40 + POISON_POINTER_DELTA)
>
> /********** net/core/skbuff.c **********/
>diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
>index fbe34024b20061e8bcd1d4474f6ebfc70992f1eb..8f9ed92a4b2af6c136406c41b1a829c8e52ba3e2 100644
>--- a/include/net/page_pool/types.h
>+++ b/include/net/page_pool/types.h
>@@ -6,6 +6,7 @@
> #include <linux/dma-direction.h>
> #include <linux/ptr_ring.h>
> #include <linux/types.h>
>+#include <linux/xarray.h>
> #include <net/netmem.h>
>
> #define PP_FLAG_DMA_MAP		BIT(0) /* Should page_pool do the DMA
>@@ -58,13 +59,51 @@ struct pp_alloc_cache {
> 	netmem_ref cache[PP_ALLOC_CACHE_SIZE];
> };
>
>+/*
>+ * DMA mapping IDs
>+ *
>+ * When DMA-mapping a page, we allocate an ID (from an xarray) and stash this in
>+ * the upper bits of page->pp_magic. We always want to be able to unambiguously
>+ * identify page pool pages (using page_pool_page_is_pp()). Non-PP pages can
>+ * have arbitrary kernel pointers stored in the same field as pp_magic (since it
>+ * overlaps with page->lru.next), so we must ensure that we cannot mistake a
>+ * valid kernel pointer with any of the values we write into this field.
>+ *
>+ * On architectures that set POISON_POINTER_DELTA, this is already ensured,
>+ * since this value becomes part of PP_SIGNATURE; meaning we can just use the
>+ * space between the PP_SIGNATURE value (without POISON_POINTER_DELTA), and the
>+ * lowest bits of POISON_POINTER_DELTA. On arches where POISON_POINTER_DELTA is
>+ * 0, we make sure that we leave the two topmost bits empty, as that guarantees
>+ * we won't mistake a valid kernel pointer for a value we set, regardless of the
>+ * VMSPLIT setting.
>+ *
>+ * Altogether, this means that the number of bits available is constrained by
>+ * the size of an unsigned long (at the upper end, subtracting two bits per the
>+ * above), and the definition of PP_SIGNATURE (with or without
>+ * POISON_POINTER_DELTA).
>+ */
>+#define PP_DMA_INDEX_SHIFT (1 + __fls(PP_SIGNATURE - POISON_POINTER_DELTA))
>+#if POISON_POINTER_DELTA > 0
>+/* PP_SIGNATURE includes POISON_POINTER_DELTA, so limit the size of the DMA
>+ * index to not overlap with that if set
>+ */
>+#define PP_DMA_INDEX_BITS MIN(32, __ffs(POISON_POINTER_DELTA) - PP_DMA_INDEX_SHIFT)
>+#else
>+/* Always leave out the topmost two; see above. */
>+#define PP_DMA_INDEX_BITS MIN(32, BITS_PER_LONG - PP_DMA_INDEX_SHIFT - 2)
>+#endif
>+
>+#define PP_DMA_INDEX_MASK GENMASK(PP_DMA_INDEX_BITS + PP_DMA_INDEX_SHIFT - 1, \
>+				  PP_DMA_INDEX_SHIFT)
>+#define PP_DMA_INDEX_LIMIT XA_LIMIT(1, BIT(PP_DMA_INDEX_BITS) - 1)
>+
> /* Mask used for checking in page_pool_page_is_pp() below. page->pp_magic is
>  * OR'ed with PP_SIGNATURE after the allocation in order to preserve bit 0 for
>- * the head page of compound page and bit 1 for pfmemalloc page.
>- * page_is_pfmemalloc() is checked in __page_pool_put_page() to avoid recycling
>- * the pfmemalloc page.
>+ * the head page of compound page and bit 1 for pfmemalloc page, as well as the
>+ * bits used for the DMA index. page_is_pfmemalloc() is checked in
>+ * __page_pool_put_page() to avoid recycling the pfmemalloc page.
>  */
>-#define PP_MAGIC_MASK ~0x3UL
>+#define PP_MAGIC_MASK ~(PP_DMA_INDEX_MASK | 0x3UL)
>
> /**
>  * struct page_pool_params - page pool parameters
>@@ -233,6 +272,8 @@ struct page_pool {
> 	void *mp_priv;
> 	const struct memory_provider_ops *mp_ops;
>
>+	struct xarray dma_mapped;
>+
> #ifdef CONFIG_PAGE_POOL_STATS
> 	/* recycle stats are per-cpu to avoid locking */
> 	struct page_pool_recycle_stats __percpu *recycle_stats;
>diff --git a/net/core/netmem_priv.h b/net/core/netmem_priv.h
>index f33162fd281c23e109273ba09950c5d0a2829bc9..cd95394399b40c3604934ba7898eeeeacb8aee99 100644
>--- a/net/core/netmem_priv.h
>+++ b/net/core/netmem_priv.h
>@@ -5,7 +5,7 @@
>
> static inline unsigned long netmem_get_pp_magic(netmem_ref netmem)
> {
>-	return __netmem_clear_lsb(netmem)->pp_magic;
>+	return __netmem_clear_lsb(netmem)->pp_magic & ~PP_DMA_INDEX_MASK;
> }
>
> static inline void netmem_or_pp_magic(netmem_ref netmem, unsigned long pp_magic)
>@@ -15,6 +15,8 @@ static inline void netmem_or_pp_magic(netmem_ref netmem, unsigned long pp_magic)
>
> static inline void netmem_clear_pp_magic(netmem_ref netmem)
> {
>+	WARN_ON_ONCE(__netmem_clear_lsb(netmem)->pp_magic & PP_DMA_INDEX_MASK);
>+
> 	__netmem_clear_lsb(netmem)->pp_magic = 0;
> }
>
>@@ -33,4 +35,28 @@ static inline void netmem_set_dma_addr(netmem_ref netmem,
> {
> 	__netmem_clear_lsb(netmem)->dma_addr = dma_addr;
> }
>+
>+static inline unsigned long netmem_get_dma_index(netmem_ref netmem)
>+{
>+	unsigned long magic;
>+
>+	if (WARN_ON_ONCE(netmem_is_net_iov(netmem)))
>+		return 0;
>+
>+	magic = __netmem_clear_lsb(netmem)->pp_magic;
>+
>+	return (magic & PP_DMA_INDEX_MASK) >> PP_DMA_INDEX_SHIFT;
>+}
>+
>+static inline void netmem_set_dma_index(netmem_ref netmem,
>+					unsigned long id)
>+{
>+	unsigned long magic;
>+
>+	if (WARN_ON_ONCE(netmem_is_net_iov(netmem)))
>+		return;
>+
>+	magic = netmem_get_pp_magic(netmem) | (id << PP_DMA_INDEX_SHIFT);
>+	__netmem_clear_lsb(netmem)->pp_magic = magic;
>+}
> #endif
>diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>index d51ca4389dd62d8bc266a9a2b792838257173535..55acb4bc2c57893486f222e9f39b48a09b0d78d0 100644
>--- a/net/core/page_pool.c
>+++ b/net/core/page_pool.c
>@@ -226,6 +226,8 @@ static int page_pool_init(struct page_pool *pool,
> 			return -EINVAL;
>
> 		pool->dma_map = true;
>+
>+		xa_init_flags(&pool->dma_mapped, XA_FLAGS_ALLOC1);
> 	}
>
> 	if (pool->slow.flags & PP_FLAG_DMA_SYNC_DEV) {
>@@ -275,9 +277,6 @@ static int page_pool_init(struct page_pool *pool,
> 	/* Driver calling page_pool_create() also call page_pool_destroy() */
> 	refcount_set(&pool->user_cnt, 1);
>
>-	if (pool->dma_map)
>-		get_device(pool->p.dev);
>-
> 	if (pool->slow.flags & PP_FLAG_ALLOW_UNREADABLE_NETMEM) {
> 		/* We rely on rtnl_lock()ing to make sure netdev_rx_queue
> 		 * configuration doesn't change while we're initializing
>@@ -325,7 +324,7 @@ static void page_pool_uninit(struct page_pool *pool)
> 	ptr_ring_cleanup(&pool->ring, NULL);
>
> 	if (pool->dma_map)
>-		put_device(pool->p.dev);
>+		xa_destroy(&pool->dma_mapped);
>
> #ifdef CONFIG_PAGE_POOL_STATS
> 	if (!pool->system)
>@@ -466,14 +465,20 @@ page_pool_dma_sync_for_device(const struct page_pool *pool,
> 			      netmem_ref netmem,
> 			      u32 dma_sync_size)
> {
>-	if ((READ_ONCE(pool->dma_sync) & PP_DMA_SYNC_DEV) &&
>-	    dma_dev_need_sync(pool->p.dev))
>-		__page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
>+	if (dma_dev_need_sync(pool->p.dev)) {
>+		rcu_read_lock();
>+		if (READ_ONCE(pool->dma_sync) & PP_DMA_SYNC_DEV)
>+			__page_pool_dma_sync_for_device(pool, netmem,
>+							dma_sync_size);
>+		rcu_read_unlock();
>+	}
> }
>
>-static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
>+static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem, gfp_t gfp)
> {
> 	dma_addr_t dma;
>+	int err;
>+	u32 id;
>
> 	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
> 	 * since dma_addr_t can be either 32 or 64 bits and does not always fit
>@@ -487,15 +492,28 @@ static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
> 	if (dma_mapping_error(pool->p.dev, dma))
> 		return false;
>
>-	if (page_pool_set_dma_addr_netmem(netmem, dma))
>+	if (in_softirq())
>+		err = xa_alloc(&pool->dma_mapped, &id, netmem_to_page(netmem),
>+			       PP_DMA_INDEX_LIMIT, gfp);
>+	else
>+		err = xa_alloc_bh(&pool->dma_mapped, &id, netmem_to_page(netmem),
>+				  PP_DMA_INDEX_LIMIT, gfp);
>+	if (err) {
>+		WARN_ONCE(1, "couldn't track DMA mapping, please report to netdev@");
> 		goto unmap_failed;
>+	}
>
>+	if (page_pool_set_dma_addr_netmem(netmem, dma)) {
>+		WARN_ONCE(1, "unexpected DMA address, please report to netdev@");
>+		goto unmap_failed;
>+	}
>+
>+	netmem_set_dma_index(netmem, id);
> 	page_pool_dma_sync_for_device(pool, netmem, pool->p.max_len);
>
> 	return true;
>
> unmap_failed:
>-	WARN_ONCE(1, "unexpected DMA address, please report to netdev@");
> 	dma_unmap_page_attrs(pool->p.dev, dma,
> 			     PAGE_SIZE << pool->p.order, pool->p.dma_dir,
> 			     DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING);
>@@ -512,7 +530,7 @@ static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
> 	if (unlikely(!page))
> 		return NULL;
>
>-	if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page)))) {
>+	if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page), gfp))) {
> 		put_page(page);
> 		return NULL;
> 	}
>@@ -558,7 +576,7 @@ static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool,
> 	 */
> 	for (i = 0; i < nr_pages; i++) {
> 		netmem = pool->alloc.cache[i];
>-		if (dma_map && unlikely(!page_pool_dma_map(pool, netmem))) {
>+		if (dma_map && unlikely(!page_pool_dma_map(pool, netmem, gfp))) {
> 			put_page(netmem_to_page(netmem));
> 			continue;
> 		}
>@@ -660,6 +678,8 @@ void page_pool_clear_pp_info(netmem_ref netmem)
> static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
> 							 netmem_ref netmem)
> {
>+	struct page *old, *page = netmem_to_page(netmem);
>+	unsigned long id;
> 	dma_addr_t dma;
>
> 	if (!pool->dma_map)
>@@ -668,6 +688,17 @@ static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
> 		 */
> 		return;
>
>+	id = netmem_get_dma_index(netmem);
>+	if (!id)
>+		return;
>+
>+	if (in_softirq())
>+		old = xa_cmpxchg(&pool->dma_mapped, id, page, NULL, 0);
>+	else
>+		old = xa_cmpxchg_bh(&pool->dma_mapped, id, page, NULL, 0);
>+	if (old != page)
>+		return;
>+
> 	dma = page_pool_get_dma_addr_netmem(netmem);
>
> 	/* When page is unmapped, it cannot be returned to our pool */
>@@ -675,6 +706,7 @@ static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
> 			     PAGE_SIZE << pool->p.order, pool->p.dma_dir,
> 			     DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING);
> 	page_pool_set_dma_addr_netmem(netmem, 0);
>+	netmem_set_dma_index(netmem, 0);
> }
>
> /* Disconnects a page (from a page_pool).  API users can have a need
>@@ -1084,8 +1116,32 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
>
> static void page_pool_scrub(struct page_pool *pool)
> {
>+	unsigned long id;
>+	void *ptr;
>+
> 	page_pool_empty_alloc_cache_once(pool);
>-	pool->destroy_cnt++;
>+	if (!pool->destroy_cnt++ && pool->dma_map) {
>+		if (pool->dma_sync) {
>+			/* paired with READ_ONCE in
>+			 * page_pool_dma_sync_for_device() and
>+			 * __page_pool_dma_sync_for_cpu()
>+			 */
>+			WRITE_ONCE(pool->dma_sync, false);
>+
>+			/* Make sure all concurrent returns that may see the old
>+			 * value of dma_sync (and thus perform a sync) have
>+			 * finished before doing the unmapping below. Skip the
>+			 * wait if the device doesn't actually need syncing, or
>+			 * if there are no outstanding mapped pages.
>+			 */
>+			if (dma_dev_need_sync(pool->p.dev) &&
>+			    !xa_empty(&pool->dma_mapped))
>+				synchronize_net();
>+		}
>+
>+		xa_for_each(&pool->dma_mapped, id, ptr)
>+			__page_pool_release_page_dma(pool, page_to_netmem(ptr));
>+	}
>
> 	/* No more consumers should exist, but producers could still
> 	 * be in-flight.
>
>-- 
>2.48.1
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
  2025-03-26 18:22   ` Saeed Mahameed
@ 2025-03-26 20:02     ` Mina Almasry
  2025-03-27  0:29       ` Saeed Mahameed
  2025-03-27  3:53     ` Yunsheng Lin
  1 sibling, 1 reply; 17+ messages in thread
From: Mina Almasry @ 2025-03-26 20:02 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Toke Høiland-Jørgensen, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, Leon Romanovsky,
	Tariq Toukan, Andrew Lunn, Eric Dumazet, Paolo Abeni,
	Ilias Apalodimas, Simon Horman, Andrew Morton, Yonglong Liu,
	Yunsheng Lin, Pavel Begunkov, Matthew Wilcox, netdev, bpf,
	linux-rdma, linux-mm, Qiuling Ren, Yuying Ma

On Wed, Mar 26, 2025 at 11:22 AM Saeed Mahameed <saeedm@nvidia.com> wrote:
>
> On 25 Mar 16:45, Toke Høiland-Jørgensen wrote:
> >When enabling DMA mapping in page_pool, pages are kept DMA mapped until
> >they are released from the pool, to avoid the overhead of re-mapping the
> >pages every time they are used. This causes resource leaks and/or
> >crashes when there are pages still outstanding while the device is torn
> >down, because page_pool will attempt an unmap through a non-existent DMA
> >device on the subsequent page return.
> >
>
> Why dynamically track when it is guaranteed the page_pool consumer (driver)
> will return all outstanding pages before disabling the DMA device.
> When a page pool is destroyed by the driver, just mark it as "DMA-inactive",
> and on page_pool_return_page() if DMA-inactive don't recycle those pages
> and immediately DMA unmap and release them.

That doesn't work, AFAIU. DMA unmaping after page_pool_destroy has
been called in what's causing the very bug this series is trying to
fix. What happens is:

1. Driver calls page_pool_destroy,
2. Driver removes the net_device (and I guess the associated iommu
structs go away with it).
3. Page-pool tries to unmap after page_pool_destroy is called, trying
to fetch iommu resources that have been freed due to the netdevice
gone away = bad stuff.

(but maybe I misunderstood your suggestion)

-- 
Thanks,
Mina


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
  2025-03-26 20:02     ` Mina Almasry
@ 2025-03-27  0:29       ` Saeed Mahameed
  2025-03-27  1:37         ` Mina Almasry
  0 siblings, 1 reply; 17+ messages in thread
From: Saeed Mahameed @ 2025-03-27  0:29 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Toke Høiland-Jørgensen, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, Leon Romanovsky,
	Tariq Toukan, Andrew Lunn, Eric Dumazet, Paolo Abeni,
	Ilias Apalodimas, Simon Horman, Andrew Morton, Yonglong Liu,
	Yunsheng Lin, Pavel Begunkov, Matthew Wilcox, netdev, bpf,
	linux-rdma, linux-mm, Qiuling Ren, Yuying Ma

On 26 Mar 13:02, Mina Almasry wrote:
>On Wed, Mar 26, 2025 at 11:22 AM Saeed Mahameed <saeedm@nvidia.com> wrote:
>>
>> On 25 Mar 16:45, Toke Høiland-Jørgensen wrote:
>> >When enabling DMA mapping in page_pool, pages are kept DMA mapped until
>> >they are released from the pool, to avoid the overhead of re-mapping the
>> >pages every time they are used. This causes resource leaks and/or
>> >crashes when there are pages still outstanding while the device is torn
>> >down, because page_pool will attempt an unmap through a non-existent DMA
>> >device on the subsequent page return.
>> >
>>
>> Why dynamically track when it is guaranteed the page_pool consumer (driver)
>> will return all outstanding pages before disabling the DMA device.
>> When a page pool is destroyed by the driver, just mark it as "DMA-inactive",
>> and on page_pool_return_page() if DMA-inactive don't recycle those pages
>> and immediately DMA unmap and release them.
>
>That doesn't work, AFAIU. DMA unmaping after page_pool_destroy has
>been called in what's causing the very bug this series is trying to
>fix. What happens is:
>
>1. Driver calls page_pool_destroy,

Here the driver should already have returned all inflight pages to the
pool, the problem is that those returned pages are recycled back, instead
of being dma unmapped, all we need to do is to mark page_pool as "don't
recycle" after the driver destroyed it.

>2. Driver removes the net_device (and I guess the associated iommu
>structs go away with it).

if the driver had not yet released those pages at this point then there is a
more series leak than just dma leak. If the pool has those pages, which is
probably the case, then they were already release by the driver, the problem
is that they were recycled back.

>3. Page-pool tries to unmap after page_pool_destroy is called, trying
>to fetch iommu resources that have been freed due to the netdevice
>gone away = bad stuff.
>
>(but maybe I misunderstood your suggestion)

Yes, see above, but I just double checked the code and I though that the
page_pool_destroy would wait for all inflight pages to be release before it
returns back to the caller, but apparently I was mistaken, if the pages are
still being held by stack/user-space then they will still be considered
"inflight" although the driver is done with them :/.

So yes tracking "ALL" pages is one way to solve it, but I still think that
the correct way to deal with this is to hold the driver/netdev while there
are inflight pages, but no strong opinion if we expect pages to remain in
userspace for "too" long, then I admit, tracking is the only way.

>
>-- 
>Thanks,
>Mina


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
  2025-03-27  0:29       ` Saeed Mahameed
@ 2025-03-27  1:37         ` Mina Almasry
  0 siblings, 0 replies; 17+ messages in thread
From: Mina Almasry @ 2025-03-27  1:37 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Toke Høiland-Jørgensen, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, Leon Romanovsky,
	Tariq Toukan, Andrew Lunn, Eric Dumazet, Paolo Abeni,
	Ilias Apalodimas, Simon Horman, Andrew Morton, Yonglong Liu,
	Yunsheng Lin, Pavel Begunkov, Matthew Wilcox, netdev, bpf,
	linux-rdma, linux-mm, Qiuling Ren, Yuying Ma

On Wed, Mar 26, 2025 at 5:29 PM Saeed Mahameed <saeedm@nvidia.com> wrote:
>
> On 26 Mar 13:02, Mina Almasry wrote:
> >On Wed, Mar 26, 2025 at 11:22 AM Saeed Mahameed <saeedm@nvidia.com> wrote:
> >>
> >> On 25 Mar 16:45, Toke Høiland-Jørgensen wrote:
> >> >When enabling DMA mapping in page_pool, pages are kept DMA mapped until
> >> >they are released from the pool, to avoid the overhead of re-mapping the
> >> >pages every time they are used. This causes resource leaks and/or
> >> >crashes when there are pages still outstanding while the device is torn
> >> >down, because page_pool will attempt an unmap through a non-existent DMA
> >> >device on the subsequent page return.
> >> >
> >>
> >> Why dynamically track when it is guaranteed the page_pool consumer (driver)
> >> will return all outstanding pages before disabling the DMA device.
> >> When a page pool is destroyed by the driver, just mark it as "DMA-inactive",
> >> and on page_pool_return_page() if DMA-inactive don't recycle those pages
> >> and immediately DMA unmap and release them.
> >
> >That doesn't work, AFAIU. DMA unmaping after page_pool_destroy has
> >been called in what's causing the very bug this series is trying to
> >fix. What happens is:
> >
> >1. Driver calls page_pool_destroy,
>
> Here the driver should already have returned all inflight pages to the
> pool, the problem is that those returned pages are recycled back, instead
> of being dma unmapped, all we need to do is to mark page_pool as "don't
> recycle" after the driver destroyed it.
>
> >2. Driver removes the net_device (and I guess the associated iommu
> >structs go away with it).
>
> if the driver had not yet released those pages at this point then there is a
> more series leak than just dma leak. If the pool has those pages, which is
> probably the case, then they were already release by the driver, the problem
> is that they were recycled back.
>
> >3. Page-pool tries to unmap after page_pool_destroy is called, trying
> >to fetch iommu resources that have been freed due to the netdevice
> >gone away = bad stuff.
> >
> >(but maybe I misunderstood your suggestion)
>
> Yes, see above, but I just double checked the code and I though that the
> page_pool_destroy would wait for all inflight pages to be release before it
> returns back to the caller, but apparently I was mistaken, if the pages are
> still being held by stack/user-space then they will still be considered
> "inflight" although the driver is done with them :/.
>

Right, I think we're on the same page now. page_pool_destroy doesn't
(currently) wait for all the inflight pages to be returned before it
returns to the driver calling it, even if the driver is fully done
with all the pages. There could be pages still held by the
userspace/net stack.

> So yes tracking "ALL" pages is one way to solve it, but I still think that
> the correct way to deal with this is to hold the driver/netdev while there
> are inflight pages, but no strong opinion if we expect pages to remain in
> userspace for "too" long, then I admit, tracking is the only way.
>

AFAICT, the amount of time the userspace can hold onto an inflight
page is actually indefinite. The pathological edge case is the
userspace opens a receive socket, never closes it, and never
recvmsg()'s it. In that case skbs will wait in the socket's receive
queue forever.

FWIW Jakub did suggest a fix where the page_pool will stall the
netdevice removal while there are inflight pages. I never provided
Reviewed-by because I was nervous about GVE failing to soft reset or
unload or something because some userspace is holding onto a page_pool
page, but maybe you like it better:

https://lore.kernel.org/netdev/20240809205717.0c966bad@kernel.org/T/#mbca7f9391ba44840444e747c9038ef6617142048

My personal feeling is that adding overhead to the slow page_alloc +
dma mapping path is preferable to blocking netdev unregister.

-- 
Thanks,
Mina


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
  2025-03-26 18:22   ` Saeed Mahameed
  2025-03-26 20:02     ` Mina Almasry
@ 2025-03-27  3:53     ` Yunsheng Lin
  2025-03-27  4:59       ` Mina Almasry
  1 sibling, 1 reply; 17+ messages in thread
From: Yunsheng Lin @ 2025-03-27  3:53 UTC (permalink / raw)
  To: Saeed Mahameed, Toke Høiland-Jørgensen
  Cc: David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Leon Romanovsky, Tariq Toukan, Andrew Lunn, Eric Dumazet,
	Paolo Abeni, Ilias Apalodimas, Simon Horman, Andrew Morton,
	Mina Almasry, Yonglong Liu, Pavel Begunkov, Matthew Wilcox,
	netdev, bpf, linux-rdma, linux-mm, Qiuling Ren, Yuying Ma

On 2025/3/27 2:22, Saeed Mahameed wrote:

...

> 
> I know there's an extra check on fast path, but looking below it seems
> like slow path is becoming unnecessarily complicated, and we are sacrificing
> slow path performance significantly for the sake of not touching fast path
> at all. page_pool slow path should _not_ be significantly slower
> than using the page allocator directly! In many production environments and
> some workloads, page pool recycling can happen at a very low rate resulting
> on performance relying solely on slow path..

+1. And we also rely on the above 'slow path' to flush old nid-mismatched pages
and refill new pages from the correct nid, see the nid checking and updating in
page_pool_refill_alloc_cache() and page_pool_update_nid() when irq migrates
between different numa nodes as it seems common for linux distributions to use
irqbalance to tune performance.

>> To fix this, implement a simple tracking of outstanding DMA-mapped pages
>> in page pool using an xarray. This was first suggested by Mina[0], and
>> turns out to be fairly straight forward: We simply store pointers to
>> pages directly in the xarray with xa_alloc() when they are first DMA
>> mapped, and remove them from the array on unmap. Then, when a page pool
>> is torn down, it can simply walk the xarray and unmap all pages still
>> present there before returning, which also allows us to get rid of the
>> get/put_device() calls in page_pool. Using xa_cmpxchg(), no additional
>> synchronisation is needed, as a page will only ever be unmapped once.
>>
>> To avoid having to walk the entire xarray on unmap to find the page
>> reference, we stash the ID assigned by xa_alloc() into the page
>> structure itself, using the upper bits of the pp_magic field. This
>> requires a couple of defines to avoid conflicting with the
>> POINTER_POISON_DELTA define, but this is all evaluated at compile-time,
>> so does not affect run-time performance. The bitmap calculations in this
>> patch gives the following number of bits for different architectures:
>>
>> - 23 bits on 32-bit architectures
>> - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
>> - 32 bits on other 64-bit architectures
>>
>> Stashing a value into the unused bits of pp_magic does have the effect
>> that it can make the value stored there lie outside the unmappable
>> range (as governed by the mmap_min_addr sysctl), for architectures that
>> don't define ILLEGAL_POINTER_VALUE. This means that if one of the
>> pointers that is aliased to the pp_magic field (such as page->lru.next)
>> is dereferenced while the page is owned by page_pool, that could lead to
>> a dereference into userspace, which is a security concern. The risk of
>> this is mitigated by the fact that (a) we always clear pp_magic before
>> releasing a page from page_pool, and (b) this would need a
>> use-after-free bug for struct page, which can have many other risks
>> since page->lru.next is used as a generic list pointer in multiple
>> places in the kernel. As such, with this patch we take the position that
>> this risk is negligible in practice. For more discussion, see[1].

The below is a recent UAF for a page, I guess it can be said that this
patch seem to basically make mmap_min_addr mechanism useless for the above
arches when a driver with page_pool dma mapping support is loaded:
https://lore.kernel.org/all/Z878K7JQ93LqBdCB@casper.infradead.org/

So how about logging a warning so that user can tell if their system may
be in security compromised state for the above case?

>>
>> Since all the tracking added in this patch is performed on DMA
>> map/unmap, no additional code is needed in the fast path, meaning the
>> performance overhead of this tracking is negligible there. A
>> micro-benchmark shows that the total overhead of the tracking itself is
>> about 400 ns (39 cycles(tsc) 395.218 ns; sum for both map and unmap[2]).
>> Since this cost is only paid on DMA map and unmap, it seems like an
>> acceptable cost to fix the late unmap issue. Further optimisation can
>> narrow the cases where this cost is paid (for instance by eliding the
>> tracking when DMA map/unmap is a no-op).
>>
> What I am missing here, what is the added cost of those extra operations on
> the slow path compared to before this patch? Total overhead being
> acceptable doesn't justify the change, we need diff before and after.

Toke used my data in [2] below:
The above 400ns is the added cost of those extra operations on the slow path,
before this patch the slow path only cost about 170ns, so there is more than
200% performance degradation for the page tracking in this patch, which I
failed to see why it is acceptable:(

> 
>> The extra memory needed to track the pages is neatly encapsulated inside
>> xarray, which uses the 'struct xa_node' structure to track items. This
>> structure is 576 bytes long, with slots for 64 items, meaning that a
>> full node occurs only 9 bytes of overhead per slot it tracks (in
>> practice, it probably won't be this efficient, but in any case it should
>> be an acceptable overhead).
>>
>> [0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/
>> [1] https://lore.kernel.org/r/20250320023202.GA25514@openwall.com
>> [2] https://lore.kernel.org/r/ae07144c-9295-4c9d-a400-153bb689fe9e@huawei.com



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
  2025-03-27  3:53     ` Yunsheng Lin
@ 2025-03-27  4:59       ` Mina Almasry
  2025-03-27  7:21         ` Yunsheng Lin
  0 siblings, 1 reply; 17+ messages in thread
From: Mina Almasry @ 2025-03-27  4:59 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: Saeed Mahameed, Toke Høiland-Jørgensen,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Leon Romanovsky, Tariq Toukan, Andrew Lunn, Eric Dumazet,
	Paolo Abeni, Ilias Apalodimas, Simon Horman, Andrew Morton,
	Yonglong Liu, Pavel Begunkov, Matthew Wilcox, netdev, bpf,
	linux-rdma, linux-mm, Qiuling Ren, Yuying Ma

On Wed, Mar 26, 2025 at 8:54 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>
> >> Since all the tracking added in this patch is performed on DMA
> >> map/unmap, no additional code is needed in the fast path, meaning the
> >> performance overhead of this tracking is negligible there. A
> >> micro-benchmark shows that the total overhead of the tracking itself is
> >> about 400 ns (39 cycles(tsc) 395.218 ns; sum for both map and unmap[2]).
> >> Since this cost is only paid on DMA map and unmap, it seems like an
> >> acceptable cost to fix the late unmap issue. Further optimisation can
> >> narrow the cases where this cost is paid (for instance by eliding the
> >> tracking when DMA map/unmap is a no-op).
> >>
> > What I am missing here, what is the added cost of those extra operations on
> > the slow path compared to before this patch? Total overhead being
> > acceptable doesn't justify the change, we need diff before and after.
>
> Toke used my data in [2] below:
> The above 400ns is the added cost of those extra operations on the slow path,
> before this patch the slow path only cost about 170ns, so there is more than
> 200% performance degradation for the page tracking in this patch, which I
> failed to see why it is acceptable:(
>

You may be correct about the absolute value of the overhead added
(400ns), I'm not sure it's a 200% regression though.

what time_bench_page_pool03_slow actually does each iteration:
- Allocates a page *from the fast path*
- Frees a page to through the slow path (recycling disabled).

Notably it doesn't do anything in the slow path that I imagine is
actually expensive: alloc_page, dma_map_page, & dma_unmap_page.

We do not have an existing benchmark case that actually tests the full
cost of the slow path (i.e full cost of page_pool_alloc from slow path
with dma-mapping and page_pool_put_page to the slow path with
dma-unmapping). That test case would have given us the full picture in
terms of % regression.

This is partly why I want to upstream the benchmark. Such cases can be
added after it is upstreamed.

--
Thanks,
Mina


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
  2025-03-27  4:59       ` Mina Almasry
@ 2025-03-27  7:21         ` Yunsheng Lin
  0 siblings, 0 replies; 17+ messages in thread
From: Yunsheng Lin @ 2025-03-27  7:21 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Saeed Mahameed, Toke Høiland-Jørgensen,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Leon Romanovsky, Tariq Toukan, Andrew Lunn, Eric Dumazet,
	Paolo Abeni, Ilias Apalodimas, Simon Horman, Andrew Morton,
	Yonglong Liu, Pavel Begunkov, Matthew Wilcox, netdev, bpf,
	linux-rdma, linux-mm, Qiuling Ren, Yuying Ma

On 2025/3/27 12:59, Mina Almasry wrote:
> On Wed, Mar 26, 2025 at 8:54 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>>>
>>>> Since all the tracking added in this patch is performed on DMA
>>>> map/unmap, no additional code is needed in the fast path, meaning the
>>>> performance overhead of this tracking is negligible there. A
>>>> micro-benchmark shows that the total overhead of the tracking itself is
>>>> about 400 ns (39 cycles(tsc) 395.218 ns; sum for both map and unmap[2]).
>>>> Since this cost is only paid on DMA map and unmap, it seems like an
>>>> acceptable cost to fix the late unmap issue. Further optimisation can
>>>> narrow the cases where this cost is paid (for instance by eliding the
>>>> tracking when DMA map/unmap is a no-op).

See the above statement, and note the above optimisation was also discussed
before and it seemed unfeasible too.

> 
> what time_bench_page_pool03_slow actually does each iteration:
> - Allocates a page *from the fast path*
> - Frees a page to through the slow path (recycling disabled).
> 
> Notably it doesn't do anything in the slow path that I imagine is
> actually expensive: alloc_page, dma_map_page, & dma_unmap_page.

As above, for most arches, the DMA map/unmap seems to be almost no-op when
page_pool is created with PP_FLAG_DMA_MAP flag without IOMMU/swiotlb behind
the DMA MAPPING.

> 
> We do not have an existing benchmark case that actually tests the full
> cost of the slow path (i.e full cost of page_pool_alloc from slow path
> with dma-mapping and page_pool_put_page to the slow path with
> dma-unmapping). That test case would have given us the full picture in
> terms of % regression.
> 
> This is partly why I want to upstream the benchmark. Such cases can be
> added after it is upstreamed.

Why not add it now when you seemed to be arguing that exercising the code
path of dma_map_page() and dma_unmap_page() may change the full picture
here.


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-03-27  7:21 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-25 15:45 [PATCH net-next v2 0/3] Fix late DMA unmap crash for page pool Toke Høiland-Jørgensen
2025-03-25 15:45 ` [PATCH net-next v2 1/3] page_pool: Move pp_magic check into helper functions Toke Høiland-Jørgensen
2025-03-25 15:45 ` [PATCH net-next v2 2/3] page_pool: Turn dma_sync and dma_sync_cpu fields into a bitmap Toke Høiland-Jørgensen
2025-03-25 22:17   ` Jakub Kicinski
2025-03-26  8:12     ` Toke Høiland-Jørgensen
2025-03-26 11:23       ` Jakub Kicinski
2025-03-26 11:29         ` Jakub Kicinski
2025-03-26 18:00   ` Saeed Mahameed
2025-03-25 15:45 ` [PATCH net-next v2 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool Toke Høiland-Jørgensen
2025-03-26 13:54   ` Jesper Dangaard Brouer
2025-03-26 18:22   ` Saeed Mahameed
2025-03-26 20:02     ` Mina Almasry
2025-03-27  0:29       ` Saeed Mahameed
2025-03-27  1:37         ` Mina Almasry
2025-03-27  3:53     ` Yunsheng Lin
2025-03-27  4:59       ` Mina Almasry
2025-03-27  7:21         ` Yunsheng Lin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox