[PATCH net-next 0/3] Fix late DMA unmap crash for page pool

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next 0/3] Fix late DMA unmap crash for page pool
@ 2025-03-14 10:10 Toke Høiland-Jørgensen
  2025-03-14 10:10 ` [PATCH net-next 1/3] page_pool: Move pp_magic check into helper functions Toke Høiland-Jørgensen
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-03-14 10:10 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Andrew Lunn,
	Eric Dumazet, Paolo Abeni, Ilias Apalodimas, Simon Horman,
	Andrew Morton, Mina Almasry, Yonglong Liu, Yunsheng Lin,
	Pavel Begunkov, Matthew Wilcox
  Cc: netdev, bpf, linux-rdma, linux-mm,
	Toke Høiland-Jørgensen, Qiuling Ren, Yuying Ma

This series fixes the late dma_unmap crash for page pool first reported
by Yonglong Liu in [0]. It is an alternative approach to the one
submitted by Yunsheng Lin, most recently in [1]. The first two commits
are small refactors of the page pool code, in preparation of the main
change in patch 3. See the commit message of patch 3 for the details.

-Toke

[0] https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
[1] https://lore.kernel.org/r/20250307092356.638242-1-linyunsheng@huawei.com

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
Toke Høiland-Jørgensen (3):
      page_pool: Move pp_magic check into helper functions
      page_pool: Turn dma_sync and dma_sync_cpu fields into a bitmap
      page_pool: Track DMA-mapped pages and unmap them when destroying the pool

 drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c |  4 +-
 include/net/page_pool/helpers.h                  |  6 +-
 include/net/page_pool/types.h                    | 54 +++++++++++++++-
 mm/page_alloc.c                                  |  9 +--
 net/core/devmem.c                                |  3 +-
 net/core/netmem_priv.h                           | 33 +++++++++-
 net/core/page_pool.c                             | 81 ++++++++++++++++++++----
 net/core/skbuff.c                                | 16 +----
 net/core/xdp.c                                   |  4 +-
 9 files changed, 164 insertions(+), 46 deletions(-)
---
base-commit: 8ef890df4031121a94407c84659125cbccd3fdbe
change-id: 20250310-page-pool-track-dma-0332343a460e



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH net-next 1/3] page_pool: Move pp_magic check into helper functions
  2025-03-14 10:10 [PATCH net-next 0/3] Fix late DMA unmap crash for page pool Toke Høiland-Jørgensen
@ 2025-03-14 10:10 ` Toke Høiland-Jørgensen
  2025-03-21 17:16   ` Mina Almasry
  2025-03-14 10:10 ` [PATCH net-next 2/3] page_pool: Turn dma_sync and dma_sync_cpu fields into a bitmap Toke Høiland-Jørgensen
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-03-14 10:10 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Andrew Lunn,
	Eric Dumazet, Paolo Abeni, Ilias Apalodimas, Simon Horman,
	Andrew Morton, Mina Almasry, Yonglong Liu, Yunsheng Lin,
	Pavel Begunkov, Matthew Wilcox
  Cc: netdev, bpf, linux-rdma, linux-mm, Toke Høiland-Jørgensen

Since we are about to stash some more information into the pp_magic
field, let's move the magic signature checks into a pair of helper
functions so it can be changed in one place.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c |  4 ++--
 include/net/page_pool/types.h                    | 18 ++++++++++++++++++
 mm/page_alloc.c                                  |  9 +++------
 net/core/netmem_priv.h                           |  5 +++++
 net/core/skbuff.c                                | 16 ++--------------
 net/core/xdp.c                                   |  4 ++--
 6 files changed, 32 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 6f3094a479e1ec61854bb48a6a0c812167487173..70c6f0b2abb921778c98fbd428594ebd7986a302 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -706,8 +706,8 @@ static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
 				xdpi = mlx5e_xdpi_fifo_pop(xdpi_fifo);
 				page = xdpi.page.page;
 
-				/* No need to check ((page->pp_magic & ~0x3UL) == PP_SIGNATURE)
-				 * as we know this is a page_pool page.
+				/* No need to check page_pool_page_is_pp() as we
+				 * know this is a page_pool page.
 				 */
 				page_pool_recycle_direct(page->pp, page);
 			} while (++n < num);
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index 36eb57d73abc6cfc601e700ca08be20fb8281055..df0d3c1608929605224feb26173135ff37951ef8 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -54,6 +54,14 @@ struct pp_alloc_cache {
 	netmem_ref cache[PP_ALLOC_CACHE_SIZE];
 };
 
+/* Mask used for checking in page_pool_page_is_pp() below. page->pp_magic is
+ * OR'ed with PP_SIGNATURE after the allocation in order to preserve bit 0 for
+ * the head page of compound page and bit 1 for pfmemalloc page.
+ * page_is_pfmemalloc() is checked in __page_pool_put_page() to avoid recycling
+ * the pfmemalloc page.
+ */
+#define PP_MAGIC_MASK ~0x3UL
+
 /**
  * struct page_pool_params - page pool parameters
  * @fast:	params accessed frequently on hotpath
@@ -264,6 +272,11 @@ void page_pool_destroy(struct page_pool *pool);
 void page_pool_use_xdp_mem(struct page_pool *pool, void (*disconnect)(void *),
 			   const struct xdp_mem_info *mem);
 void page_pool_put_netmem_bulk(netmem_ref *data, u32 count);
+
+static inline bool page_pool_page_is_pp(struct page *page)
+{
+	return (page->pp_magic & PP_MAGIC_MASK) == PP_SIGNATURE;
+}
 #else
 static inline void page_pool_destroy(struct page_pool *pool)
 {
@@ -278,6 +291,11 @@ static inline void page_pool_use_xdp_mem(struct page_pool *pool,
 static inline void page_pool_put_netmem_bulk(netmem_ref *data, u32 count)
 {
 }
+
+static inline bool page_pool_page_is_pp(struct page *page)
+{
+	return false;
+}
 #endif
 
 void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 579789600a3c7bfb7b0d847d51af702a9d4b139a..0268b68935ceb27d9781e59a474a234a6a61ea74 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -55,6 +55,7 @@
 #include <linux/delayacct.h>
 #include <linux/cacheinfo.h>
 #include <linux/pgalloc_tag.h>
+#include <net/page_pool/types.h>
 #include <asm/div64.h>
 #include "internal.h"
 #include "shuffle.h"
@@ -872,9 +873,7 @@ static inline bool page_expected_state(struct page *page,
 #ifdef CONFIG_MEMCG
 			page->memcg_data |
 #endif
-#ifdef CONFIG_PAGE_POOL
-			((page->pp_magic & ~0x3UL) == PP_SIGNATURE) |
-#endif
+			page_pool_page_is_pp(page) |
 			(page->flags & check_flags)))
 		return false;
 
@@ -901,10 +900,8 @@ static const char *page_bad_reason(struct page *page, unsigned long flags)
 	if (unlikely(page->memcg_data))
 		bad_reason = "page still charged to cgroup";
 #endif
-#ifdef CONFIG_PAGE_POOL
-	if (unlikely((page->pp_magic & ~0x3UL) == PP_SIGNATURE))
+	if (unlikely(page_pool_page_is_pp(page)))
 		bad_reason = "page_pool leak";
-#endif
 	return bad_reason;
 }
 
diff --git a/net/core/netmem_priv.h b/net/core/netmem_priv.h
index 7eadb8393e002fd1cc2cef8a313d2ea7df76f301..f33162fd281c23e109273ba09950c5d0a2829bc9 100644
--- a/net/core/netmem_priv.h
+++ b/net/core/netmem_priv.h
@@ -18,6 +18,11 @@ static inline void netmem_clear_pp_magic(netmem_ref netmem)
 	__netmem_clear_lsb(netmem)->pp_magic = 0;
 }
 
+static inline bool netmem_is_pp(netmem_ref netmem)
+{
+	return (netmem_get_pp_magic(netmem) & PP_MAGIC_MASK) == PP_SIGNATURE;
+}
+
 static inline void netmem_set_pp(netmem_ref netmem, struct page_pool *pool)
 {
 	__netmem_clear_lsb(netmem)->pp = pool;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ab8acb737b93299f503e5c298b87e18edd59d555..a64d777488e403d5fdef83ae42ae9e4924c1a0dc 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -893,11 +893,6 @@ static void skb_clone_fraglist(struct sk_buff *skb)
 		skb_get(list);
 }
 
-static bool is_pp_netmem(netmem_ref netmem)
-{
-	return (netmem_get_pp_magic(netmem) & ~0x3UL) == PP_SIGNATURE;
-}
-
 int skb_pp_cow_data(struct page_pool *pool, struct sk_buff **pskb,
 		    unsigned int headroom)
 {
@@ -995,14 +990,7 @@ bool napi_pp_put_page(netmem_ref netmem)
 {
 	netmem = netmem_compound_head(netmem);
 
-	/* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation
-	 * in order to preserve any existing bits, such as bit 0 for the
-	 * head page of compound page and bit 1 for pfmemalloc page, so
-	 * mask those bits for freeing side when doing below checking,
-	 * and page_is_pfmemalloc() is checked in __page_pool_put_page()
-	 * to avoid recycling the pfmemalloc page.
-	 */
-	if (unlikely(!is_pp_netmem(netmem)))
+	if (unlikely(!netmem_is_pp(netmem)))
 		return false;
 
 	page_pool_put_full_netmem(netmem_get_pp(netmem), netmem, false);
@@ -1042,7 +1030,7 @@ static int skb_pp_frag_ref(struct sk_buff *skb)
 
 	for (i = 0; i < shinfo->nr_frags; i++) {
 		head_netmem = netmem_compound_head(shinfo->frags[i].netmem);
-		if (likely(is_pp_netmem(head_netmem)))
+		if (likely(netmem_is_pp(head_netmem)))
 			page_pool_ref_netmem(head_netmem);
 		else
 			page_ref_inc(netmem_to_page(head_netmem));
diff --git a/net/core/xdp.c b/net/core/xdp.c
index f86eedad586a77eb63a96a85aa6d068d3e94f077..0ba73943c6eed873b3d1c681b3b9a802b590f2d9 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -437,8 +437,8 @@ void __xdp_return(netmem_ref netmem, enum xdp_mem_type mem_type,
 		netmem = netmem_compound_head(netmem);
 		if (napi_direct && xdp_return_frame_no_direct())
 			napi_direct = false;
-		/* No need to check ((page->pp_magic & ~0x3UL) == PP_SIGNATURE)
-		 * as mem->type knows this a page_pool page
+		/* No need to check netmem_is_pp() as mem->type knows this a
+		 * page_pool page
 		 */
 		page_pool_put_full_netmem(netmem_get_pp(netmem), netmem,
 					  napi_direct);

-- 
2.48.1



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next 1/3] page_pool: Move pp_magic check into helper functions
  2025-03-14 10:10 ` [PATCH net-next 1/3] page_pool: Move pp_magic check into helper functions Toke Høiland-Jørgensen
@ 2025-03-21 17:16   ` Mina Almasry
  0 siblings, 0 replies; 19+ messages in thread
From: Mina Almasry @ 2025-03-21 17:16 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Andrew Lunn,
	Eric Dumazet, Paolo Abeni, Ilias Apalodimas, Simon Horman,
	Andrew Morton, Yonglong Liu, Yunsheng Lin, Pavel Begunkov,
	Matthew Wilcox, netdev, bpf, linux-rdma, linux-mm

On Fri, Mar 14, 2025 at 3:12 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Since we are about to stash some more information into the pp_magic
> field, let's move the magic signature checks into a pair of helper
> functions so it can be changed in one place.
>
> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>

Straightforward conversion.

Reviewed-by: Mina Almasry <almasrymina@google.com>

-- 
Thanks,
Mina


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH net-next 2/3] page_pool: Turn dma_sync and dma_sync_cpu fields into a bitmap
  2025-03-14 10:10 [PATCH net-next 0/3] Fix late DMA unmap crash for page pool Toke Høiland-Jørgensen
  2025-03-14 10:10 ` [PATCH net-next 1/3] page_pool: Move pp_magic check into helper functions Toke Høiland-Jørgensen
@ 2025-03-14 10:10 ` Toke Høiland-Jørgensen
  2025-03-15  9:31   ` Yunsheng Lin
  2025-03-21 23:13   ` Mina Almasry
  2025-03-14 10:10 ` [PATCH net-next 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool Toke Høiland-Jørgensen
  2025-03-17  2:02 ` [PATCH net-next 0/3] Fix late DMA unmap crash for page pool Yonglong Liu
  3 siblings, 2 replies; 19+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-03-14 10:10 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Andrew Lunn,
	Eric Dumazet, Paolo Abeni, Ilias Apalodimas, Simon Horman,
	Andrew Morton, Mina Almasry, Yonglong Liu, Yunsheng Lin,
	Pavel Begunkov, Matthew Wilcox
  Cc: netdev, bpf, linux-rdma, linux-mm, Toke Høiland-Jørgensen

Change the single-bit booleans for dma_sync into an unsigned long with
BIT() definitions so that a subsequent patch can write them both with a
singe WRITE_ONCE() on teardown. Also move the check for the sync_cpu
side into __page_pool_dma_sync_for_cpu() so it can be disabled for
non-netmem providers as well.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/net/page_pool/helpers.h | 6 +++---
 include/net/page_pool/types.h   | 8 ++++++--
 net/core/devmem.c               | 3 +--
 net/core/page_pool.c            | 9 +++++----
 4 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
index 582a3d00cbe2315edeb92850b6a42ab21e509e45..7ed32bde4b8944deb7fb22e291e95b8487be681a 100644
--- a/include/net/page_pool/helpers.h
+++ b/include/net/page_pool/helpers.h
@@ -443,6 +443,9 @@ static inline void __page_pool_dma_sync_for_cpu(const struct page_pool *pool,
 						const dma_addr_t dma_addr,
 						u32 offset, u32 dma_sync_size)
 {
+	if (!(READ_ONCE(pool->dma_sync) & PP_DMA_SYNC_CPU))
+		return;
+
 	dma_sync_single_range_for_cpu(pool->p.dev, dma_addr,
 				      offset + pool->p.offset, dma_sync_size,
 				      page_pool_get_dma_dir(pool));
@@ -473,9 +476,6 @@ page_pool_dma_sync_netmem_for_cpu(const struct page_pool *pool,
 				  const netmem_ref netmem, u32 offset,
 				  u32 dma_sync_size)
 {
-	if (!pool->dma_sync_for_cpu)
-		return;
-
 	__page_pool_dma_sync_for_cpu(pool,
 				     page_pool_get_dma_addr_netmem(netmem),
 				     offset, dma_sync_size);
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index df0d3c1608929605224feb26173135ff37951ef8..fbe34024b20061e8bcd1d4474f6ebfc70992f1eb 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -33,6 +33,10 @@
 #define PP_FLAG_ALL		(PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV | \
 				 PP_FLAG_SYSTEM_POOL | PP_FLAG_ALLOW_UNREADABLE_NETMEM)
 
+/* bit values used in pp->dma_sync */
+#define PP_DMA_SYNC_DEV	BIT(0)
+#define PP_DMA_SYNC_CPU	BIT(1)
+
 /*
  * Fast allocation side cache array/stack
  *
@@ -175,12 +179,12 @@ struct page_pool {
 
 	bool has_init_callback:1;	/* slow::init_callback is set */
 	bool dma_map:1;			/* Perform DMA mapping */
-	bool dma_sync:1;		/* Perform DMA sync for device */
-	bool dma_sync_for_cpu:1;	/* Perform DMA sync for cpu */
 #ifdef CONFIG_PAGE_POOL_STATS
 	bool system:1;			/* This is a global percpu pool */
 #endif
 
+	unsigned long dma_sync;
+
 	__cacheline_group_begin_aligned(frag, PAGE_POOL_FRAG_GROUP_ALIGN);
 	long frag_users;
 	netmem_ref frag_page;
diff --git a/net/core/devmem.c b/net/core/devmem.c
index 7c6e0b5b6acb55f376ec725dfb71d1f70a4320c3..16e43752566feb510b3e47fbec2d8da0f26a6adc 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -337,8 +337,7 @@ int mp_dmabuf_devmem_init(struct page_pool *pool)
 	/* dma-buf dma addresses do not need and should not be used with
 	 * dma_sync_for_cpu/device. Force disable dma_sync.
 	 */
-	pool->dma_sync = false;
-	pool->dma_sync_for_cpu = false;
+	pool->dma_sync = 0;
 
 	if (pool->p.order != 0)
 		return -E2BIG;
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index acef1fcd8ddcfd1853a6f2055c1f1820ab248e8d..d51ca4389dd62d8bc266a9a2b792838257173535 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -203,7 +203,7 @@ static int page_pool_init(struct page_pool *pool,
 	memcpy(&pool->slow, &params->slow, sizeof(pool->slow));
 
 	pool->cpuid = cpuid;
-	pool->dma_sync_for_cpu = true;
+	pool->dma_sync = PP_DMA_SYNC_CPU;
 
 	/* Validate only known flags were used */
 	if (pool->slow.flags & ~PP_FLAG_ALL)
@@ -238,7 +238,7 @@ static int page_pool_init(struct page_pool *pool,
 		if (!pool->p.max_len)
 			return -EINVAL;
 
-		pool->dma_sync = true;
+		pool->dma_sync |= PP_DMA_SYNC_DEV;
 
 		/* pool->p.offset has to be set according to the address
 		 * offset used by the DMA engine to start copying rx data
@@ -291,7 +291,7 @@ static int page_pool_init(struct page_pool *pool,
 	}
 
 	if (pool->mp_ops) {
-		if (!pool->dma_map || !pool->dma_sync)
+		if (!pool->dma_map || !(pool->dma_sync & PP_DMA_SYNC_DEV))
 			return -EOPNOTSUPP;
 
 		if (WARN_ON(!is_kernel_rodata((unsigned long)pool->mp_ops))) {
@@ -466,7 +466,8 @@ page_pool_dma_sync_for_device(const struct page_pool *pool,
 			      netmem_ref netmem,
 			      u32 dma_sync_size)
 {
-	if (pool->dma_sync && dma_dev_need_sync(pool->p.dev))
+	if ((READ_ONCE(pool->dma_sync) & PP_DMA_SYNC_DEV) &&
+	    dma_dev_need_sync(pool->p.dev))
 		__page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
 }
 

-- 
2.48.1



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next 2/3] page_pool: Turn dma_sync and dma_sync_cpu fields into a bitmap
  2025-03-14 10:10 ` [PATCH net-next 2/3] page_pool: Turn dma_sync and dma_sync_cpu fields into a bitmap Toke Høiland-Jørgensen
@ 2025-03-15  9:31   ` Yunsheng Lin
  2025-03-17 14:55     ` Toke Høiland-Jørgensen
  2025-03-21 23:13   ` Mina Almasry
  1 sibling, 1 reply; 19+ messages in thread
From: Yunsheng Lin @ 2025-03-15  9:31 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, Saeed Mahameed,
	Leon Romanovsky, Tariq Toukan, Andrew Lunn, Eric Dumazet,
	Paolo Abeni, Ilias Apalodimas, Simon Horman, Andrew Morton,
	Mina Almasry, Yonglong Liu, Yunsheng Lin, Pavel Begunkov,
	Matthew Wilcox
  Cc: netdev, bpf, linux-rdma, linux-mm

On 3/14/2025 6:10 PM, Toke Høiland-Jørgensen wrote:
> Change the single-bit booleans for dma_sync into an unsigned long with
> BIT() definitions so that a subsequent patch can write them both with a
> singe WRITE_ONCE() on teardown. Also move the check for the sync_cpu
> side into __page_pool_dma_sync_for_cpu() so it can be disabled for
> non-netmem providers as well.

I guess this patch is for the preparation of disabling the
page_pool_dma_sync_for_cpu() related API on teardown?

It seems unnecessary that page_pool_dma_sync_for_cpu() related API need
to be disabled on teardown as page_pool_dma_sync_for_cpu() has the same
calling assumption as the alloc API, which is not supposed to be called
by the drivers when page_pool_destroy() is called.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next 2/3] page_pool: Turn dma_sync and dma_sync_cpu fields into a bitmap
  2025-03-15  9:31   ` Yunsheng Lin
@ 2025-03-17 14:55     ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 19+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-03-17 14:55 UTC (permalink / raw)
  To: Yunsheng Lin, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, Saeed Mahameed, Leon Romanovsky,
	Tariq Toukan, Andrew Lunn, Eric Dumazet, Paolo Abeni,
	Ilias Apalodimas, Simon Horman, Andrew Morton, Mina Almasry,
	Yonglong Liu, Yunsheng Lin, Pavel Begunkov, Matthew Wilcox
  Cc: netdev, bpf, linux-rdma, linux-mm

Yunsheng Lin <yunshenglin0825@gmail.com> writes:

> On 3/14/2025 6:10 PM, Toke Høiland-Jørgensen wrote:
>> Change the single-bit booleans for dma_sync into an unsigned long with
>> BIT() definitions so that a subsequent patch can write them both with a
>> singe WRITE_ONCE() on teardown. Also move the check for the sync_cpu
>> side into __page_pool_dma_sync_for_cpu() so it can be disabled for
>> non-netmem providers as well.
>
> I guess this patch is for the preparation of disabling the
> page_pool_dma_sync_for_cpu() related API on teardown?
>
> It seems unnecessary that page_pool_dma_sync_for_cpu() related API need
> to be disabled on teardown as page_pool_dma_sync_for_cpu() has the same
> calling assumption as the alloc API, which is not supposed to be called
> by the drivers when page_pool_destroy() is called.

Sure, we could keep it to the dma_sync_for_dev() direction only, but
making both directions use the same variable to store the state, and
just resetting it at once, seemed simpler and more consistent.

-Toke



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next 2/3] page_pool: Turn dma_sync and dma_sync_cpu fields into a bitmap
  2025-03-14 10:10 ` [PATCH net-next 2/3] page_pool: Turn dma_sync and dma_sync_cpu fields into a bitmap Toke Høiland-Jørgensen
  2025-03-15  9:31   ` Yunsheng Lin
@ 2025-03-21 23:13   ` Mina Almasry
  1 sibling, 0 replies; 19+ messages in thread
From: Mina Almasry @ 2025-03-21 23:13 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Andrew Lunn,
	Eric Dumazet, Paolo Abeni, Ilias Apalodimas, Simon Horman,
	Andrew Morton, Yonglong Liu, Yunsheng Lin, Pavel Begunkov,
	Matthew Wilcox, netdev, bpf, linux-rdma, linux-mm

On Fri, Mar 14, 2025 at 3:12 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Change the single-bit booleans for dma_sync into an unsigned long with
> BIT() definitions so that a subsequent patch can write them both with a
> singe WRITE_ONCE() on teardown. Also move the check for the sync_cpu
> side into __page_pool_dma_sync_for_cpu() so it can be disabled for
> non-netmem providers as well.
>
> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>

Reviewed-by: Mina Almasry <almasrymina@google.com>

> ---
>  include/net/page_pool/helpers.h | 6 +++---
>  include/net/page_pool/types.h   | 8 ++++++--
>  net/core/devmem.c               | 3 +--
>  net/core/page_pool.c            | 9 +++++----
>  4 files changed, 15 insertions(+), 11 deletions(-)
>
> diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
> index 582a3d00cbe2315edeb92850b6a42ab21e509e45..7ed32bde4b8944deb7fb22e291e95b8487be681a 100644
> --- a/include/net/page_pool/helpers.h
> +++ b/include/net/page_pool/helpers.h
> @@ -443,6 +443,9 @@ static inline void __page_pool_dma_sync_for_cpu(const struct page_pool *pool,
>                                                 const dma_addr_t dma_addr,
>                                                 u32 offset, u32 dma_sync_size)
>  {
> +       if (!(READ_ONCE(pool->dma_sync) & PP_DMA_SYNC_CPU))
> +               return;
> +
>         dma_sync_single_range_for_cpu(pool->p.dev, dma_addr,
>                                       offset + pool->p.offset, dma_sync_size,
>                                       page_pool_get_dma_dir(pool));
> @@ -473,9 +476,6 @@ page_pool_dma_sync_netmem_for_cpu(const struct page_pool *pool,
>                                   const netmem_ref netmem, u32 offset,
>                                   u32 dma_sync_size)
>  {
> -       if (!pool->dma_sync_for_cpu)
> -               return;
> -
>         __page_pool_dma_sync_for_cpu(pool,
>                                      page_pool_get_dma_addr_netmem(netmem),
>                                      offset, dma_sync_size);

I think moving the check to __page_pool_dma_sync_for_cpu is fine, but
I would have preferred to keep it as-is actually.

I think if we're syncing netmem we should check dma_sync_for_cpu,
because the netmem may not be dma-syncable. But for pages, they will
likely always be dma-syncable. Some driver may have opted to do a perf
optimizations by calling __page_pool_dma_sync_for_cpu on a dma-addr
that it knows came from a page to save some cycles of netmem checking.

> diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
> index df0d3c1608929605224feb26173135ff37951ef8..fbe34024b20061e8bcd1d4474f6ebfc70992f1eb 100644
> --- a/include/net/page_pool/types.h
> +++ b/include/net/page_pool/types.h
> @@ -33,6 +33,10 @@
>  #define PP_FLAG_ALL            (PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV | \
>                                  PP_FLAG_SYSTEM_POOL | PP_FLAG_ALLOW_UNREADABLE_NETMEM)
>
> +/* bit values used in pp->dma_sync */
> +#define PP_DMA_SYNC_DEV        BIT(0)
> +#define PP_DMA_SYNC_CPU        BIT(1)
> +
>  /*
>   * Fast allocation side cache array/stack
>   *
> @@ -175,12 +179,12 @@ struct page_pool {
>
>         bool has_init_callback:1;       /* slow::init_callback is set */
>         bool dma_map:1;                 /* Perform DMA mapping */
> -       bool dma_sync:1;                /* Perform DMA sync for device */
> -       bool dma_sync_for_cpu:1;        /* Perform DMA sync for cpu */
>  #ifdef CONFIG_PAGE_POOL_STATS
>         bool system:1;                  /* This is a global percpu pool */
>  #endif
>
> +       unsigned long dma_sync;
> +
>         __cacheline_group_begin_aligned(frag, PAGE_POOL_FRAG_GROUP_ALIGN);
>         long frag_users;
>         netmem_ref frag_page;
> diff --git a/net/core/devmem.c b/net/core/devmem.c
> index 7c6e0b5b6acb55f376ec725dfb71d1f70a4320c3..16e43752566feb510b3e47fbec2d8da0f26a6adc 100644
> --- a/net/core/devmem.c
> +++ b/net/core/devmem.c
> @@ -337,8 +337,7 @@ int mp_dmabuf_devmem_init(struct page_pool *pool)
>         /* dma-buf dma addresses do not need and should not be used with
>          * dma_sync_for_cpu/device. Force disable dma_sync.
>          */
> -       pool->dma_sync = false;
> -       pool->dma_sync_for_cpu = false;
> +       pool->dma_sync = 0;
>
>         if (pool->p.order != 0)
>                 return -E2BIG;
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index acef1fcd8ddcfd1853a6f2055c1f1820ab248e8d..d51ca4389dd62d8bc266a9a2b792838257173535 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -203,7 +203,7 @@ static int page_pool_init(struct page_pool *pool,
>         memcpy(&pool->slow, &params->slow, sizeof(pool->slow));
>
>         pool->cpuid = cpuid;
> -       pool->dma_sync_for_cpu = true;
> +       pool->dma_sync = PP_DMA_SYNC_CPU;
>

More pedantically this should have been pool->dma_sync |=
PP_DMA_SYNC_CPU, but it doesn't matter since this variable is 0
initialized I think.


--
Thanks,
Mina


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH net-next 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
  2025-03-14 10:10 [PATCH net-next 0/3] Fix late DMA unmap crash for page pool Toke Høiland-Jørgensen
  2025-03-14 10:10 ` [PATCH net-next 1/3] page_pool: Move pp_magic check into helper functions Toke Høiland-Jørgensen
  2025-03-14 10:10 ` [PATCH net-next 2/3] page_pool: Turn dma_sync and dma_sync_cpu fields into a bitmap Toke Høiland-Jørgensen
@ 2025-03-14 10:10 ` Toke Høiland-Jørgensen
  2025-03-15  9:46   ` Yunsheng Lin
  2025-03-17  2:02 ` [PATCH net-next 0/3] Fix late DMA unmap crash for page pool Yonglong Liu
  3 siblings, 1 reply; 19+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-03-14 10:10 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Andrew Lunn,
	Eric Dumazet, Paolo Abeni, Ilias Apalodimas, Simon Horman,
	Andrew Morton, Mina Almasry, Yonglong Liu, Yunsheng Lin,
	Pavel Begunkov, Matthew Wilcox
  Cc: netdev, bpf, linux-rdma, linux-mm,
	Toke Høiland-Jørgensen, Qiuling Ren, Yuying Ma

When enabling DMA mapping in page_pool, pages are kept DMA mapped until
they are released from the pool, to avoid the overhead of re-mapping the
pages every time they are used. This causes resource leaks and/or
crashes when there are pages still outstanding while the device is torn
down, because page_pool will attempt an unmap through a non-existent DMA
device on the subsequent page return.

To fix this, implement a simple tracking of outstanding DMA-mapped pages
in page pool using an xarray. This was first suggested by Mina[0], and
turns out to be fairly straight forward: We simply store pointers to
pages directly in the xarray with xa_alloc() when they are first DMA
mapped, and remove them from the array on unmap. Then, when a page pool
is torn down, it can simply walk the xarray and unmap all pages still
present there before returning, which also allows us to get rid of the
get/put_device() calls in page_pool. Using xa_cmpxchg(), no additional
synchronisation is needed, as a page will only ever be unmapped once.

To avoid having to walk the entire xarray on unmap to find the page
reference, we stash the ID assigned by xa_alloc() into the page
structure itself, using the upper bits of the pp_magic field. This
requires a couple of defines to avoid conflicting with the
POINTER_POISON_DELTA define, but this is all evaluated at compile-time,
so does not affect run-time performance. The bitmap calculations in this
patch gives the following number of bits for different architectures:

- 24 bits on 32-bit architectures
- 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
- 32 bits on other 64-bit architectures

Since all the tracking is performed on DMA map/unmap, no additional code
is needed in the fast path, meaning the performance overhead of this
tracking is negligible. A micro-benchmark shows that the total overhead
of using xarray for this purpose is about 400 ns (39 cycles(tsc) 395.218
ns; sum for both map and unmap[1]). Since this cost is only paid on DMA
map and unmap, it seems like an acceptable cost to fix the late unmap
issue. Further optimisation can narrow the cases where this cost is
paid (for instance by eliding the tracking when DMA map/unmap is a
no-op).

The extra memory needed to track the pages is neatly encapsulated inside
xarray, which uses the 'struct xa_node' structure to track items. This
structure is 576 bytes long, with slots for 64 items, meaning that a
full node occurs only 9 bytes of overhead per slot it tracks (in
practice, it probably won't be this efficient, but in any case it should
be an acceptable overhead).

[0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/
[1] https://lore.kernel.org/r/ae07144c-9295-4c9d-a400-153bb689fe9e@huawei.com

Reported-by: Yonglong Liu <liuyonglong@huawei.com>
Closes: https://lore.kernel.org/r/8743264a-9700-4227-a556-5f931c720211@huawei.com
Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code")
Suggested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Qiuling Ren <qren@redhat.com>
Tested-by: Yuying Ma <yuma@redhat.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/net/page_pool/types.h | 36 +++++++++++++++++++---
 net/core/netmem_priv.h        | 28 ++++++++++++++++-
 net/core/page_pool.c          | 72 +++++++++++++++++++++++++++++++++++++------
 3 files changed, 121 insertions(+), 15 deletions(-)

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index fbe34024b20061e8bcd1d4474f6ebfc70992f1eb..1e187489d392757c8dd2960870b0d875c1dde01b 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -6,6 +6,7 @@
 #include <linux/dma-direction.h>
 #include <linux/ptr_ring.h>
 #include <linux/types.h>
+#include <linux/xarray.h>
 #include <net/netmem.h>
 
 #define PP_FLAG_DMA_MAP		BIT(0) /* Should page_pool do the DMA
@@ -58,13 +59,38 @@ struct pp_alloc_cache {
 	netmem_ref cache[PP_ALLOC_CACHE_SIZE];
 };
 
+/*
+ * DMA mapping IDs
+ *
+ * When DMA-mapping a page, we allocate an ID (from an xarray) and stash this in
+ * the upper bits of page->pp_magic. The number of bits available here is
+ * constrained by the size of an unsigned long, and the definition of
+ * PP_SIGNATURE.
+ */
+#define PP_DMA_INDEX_SHIFT (1 + __fls(PP_SIGNATURE - POISON_POINTER_DELTA))
+#define _PP_DMA_INDEX_BITS MIN(32, BITS_PER_LONG - PP_DMA_INDEX_SHIFT - 1)
+
+/* PP_SIGNATURE includes POISON_POINTER_DELTA, so limit the size of the DMA
+ * index to not overlap with that if set
+ */
+#if POISON_POINTER_DELTA > 0
+#define PP_DMA_INDEX_BITS MIN(_PP_DMA_INDEX_BITS, \
+			      __ffs(POISON_POINTER_DELTA) - PP_DMA_INDEX_SHIFT)
+#else
+#define PP_DMA_INDEX_BITS _PP_DMA_INDEX_BITS
+#endif
+
+#define PP_DMA_INDEX_MASK GENMASK(PP_DMA_INDEX_BITS + PP_DMA_INDEX_SHIFT - 1, \
+				  PP_DMA_INDEX_SHIFT)
+#define PP_DMA_INDEX_LIMIT XA_LIMIT(1, BIT(PP_DMA_INDEX_BITS) - 1)
+
 /* Mask used for checking in page_pool_page_is_pp() below. page->pp_magic is
  * OR'ed with PP_SIGNATURE after the allocation in order to preserve bit 0 for
- * the head page of compound page and bit 1 for pfmemalloc page.
- * page_is_pfmemalloc() is checked in __page_pool_put_page() to avoid recycling
- * the pfmemalloc page.
+ * the head page of compound page and bit 1 for pfmemalloc page, as well as the
+ * bits used for the DMA index. page_is_pfmemalloc() is checked in
+ * __page_pool_put_page() to avoid recycling the pfmemalloc page.
  */
-#define PP_MAGIC_MASK ~0x3UL
+#define PP_MAGIC_MASK ~(PP_DMA_INDEX_MASK | 0x3UL)
 
 /**
  * struct page_pool_params - page pool parameters
@@ -233,6 +259,8 @@ struct page_pool {
 	void *mp_priv;
 	const struct memory_provider_ops *mp_ops;
 
+	struct xarray dma_mapped;
+
 #ifdef CONFIG_PAGE_POOL_STATS
 	/* recycle stats are per-cpu to avoid locking */
 	struct page_pool_recycle_stats __percpu *recycle_stats;
diff --git a/net/core/netmem_priv.h b/net/core/netmem_priv.h
index f33162fd281c23e109273ba09950c5d0a2829bc9..cd95394399b40c3604934ba7898eeeeacb8aee99 100644
--- a/net/core/netmem_priv.h
+++ b/net/core/netmem_priv.h
@@ -5,7 +5,7 @@
 
 static inline unsigned long netmem_get_pp_magic(netmem_ref netmem)
 {
-	return __netmem_clear_lsb(netmem)->pp_magic;
+	return __netmem_clear_lsb(netmem)->pp_magic & ~PP_DMA_INDEX_MASK;
 }
 
 static inline void netmem_or_pp_magic(netmem_ref netmem, unsigned long pp_magic)
@@ -15,6 +15,8 @@ static inline void netmem_or_pp_magic(netmem_ref netmem, unsigned long pp_magic)
 
 static inline void netmem_clear_pp_magic(netmem_ref netmem)
 {
+	WARN_ON_ONCE(__netmem_clear_lsb(netmem)->pp_magic & PP_DMA_INDEX_MASK);
+
 	__netmem_clear_lsb(netmem)->pp_magic = 0;
 }
 
@@ -33,4 +35,28 @@ static inline void netmem_set_dma_addr(netmem_ref netmem,
 {
 	__netmem_clear_lsb(netmem)->dma_addr = dma_addr;
 }
+
+static inline unsigned long netmem_get_dma_index(netmem_ref netmem)
+{
+	unsigned long magic;
+
+	if (WARN_ON_ONCE(netmem_is_net_iov(netmem)))
+		return 0;
+
+	magic = __netmem_clear_lsb(netmem)->pp_magic;
+
+	return (magic & PP_DMA_INDEX_MASK) >> PP_DMA_INDEX_SHIFT;
+}
+
+static inline void netmem_set_dma_index(netmem_ref netmem,
+					unsigned long id)
+{
+	unsigned long magic;
+
+	if (WARN_ON_ONCE(netmem_is_net_iov(netmem)))
+		return;
+
+	magic = netmem_get_pp_magic(netmem) | (id << PP_DMA_INDEX_SHIFT);
+	__netmem_clear_lsb(netmem)->pp_magic = magic;
+}
 #endif
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index d51ca4389dd62d8bc266a9a2b792838257173535..5612d72d483cad8c24d2f703bb48aad185cfe59a 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -226,6 +226,8 @@ static int page_pool_init(struct page_pool *pool,
 			return -EINVAL;
 
 		pool->dma_map = true;
+
+		xa_init_flags(&pool->dma_mapped, XA_FLAGS_ALLOC1);
 	}
 
 	if (pool->slow.flags & PP_FLAG_DMA_SYNC_DEV) {
@@ -275,9 +277,6 @@ static int page_pool_init(struct page_pool *pool,
 	/* Driver calling page_pool_create() also call page_pool_destroy() */
 	refcount_set(&pool->user_cnt, 1);
 
-	if (pool->dma_map)
-		get_device(pool->p.dev);
-
 	if (pool->slow.flags & PP_FLAG_ALLOW_UNREADABLE_NETMEM) {
 		/* We rely on rtnl_lock()ing to make sure netdev_rx_queue
 		 * configuration doesn't change while we're initializing
@@ -325,7 +324,7 @@ static void page_pool_uninit(struct page_pool *pool)
 	ptr_ring_cleanup(&pool->ring, NULL);
 
 	if (pool->dma_map)
-		put_device(pool->p.dev);
+		xa_destroy(&pool->dma_mapped);
 
 #ifdef CONFIG_PAGE_POOL_STATS
 	if (!pool->system)
@@ -471,9 +470,11 @@ page_pool_dma_sync_for_device(const struct page_pool *pool,
 		__page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
 }
 
-static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
+static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem, gfp_t gfp)
 {
 	dma_addr_t dma;
+	int err;
+	u32 id;
 
 	/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
 	 * since dma_addr_t can be either 32 or 64 bits and does not always fit
@@ -487,15 +488,28 @@ static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
 	if (dma_mapping_error(pool->p.dev, dma))
 		return false;
 
-	if (page_pool_set_dma_addr_netmem(netmem, dma))
+	if (in_softirq())
+		err = xa_alloc(&pool->dma_mapped, &id, netmem_to_page(netmem),
+			       PP_DMA_INDEX_LIMIT, gfp);
+	else
+		err = xa_alloc_bh(&pool->dma_mapped, &id, netmem_to_page(netmem),
+				  PP_DMA_INDEX_LIMIT, gfp);
+	if (err) {
+		WARN_ONCE(1, "couldn't track DMA mapping, please report to netdev@");
 		goto unmap_failed;
+	}
 
+	if (page_pool_set_dma_addr_netmem(netmem, dma)) {
+		WARN_ONCE(1, "unexpected DMA address, please report to netdev@");
+		goto unmap_failed;
+	}
+
+	netmem_set_dma_index(netmem, id);
 	page_pool_dma_sync_for_device(pool, netmem, pool->p.max_len);
 
 	return true;
 
 unmap_failed:
-	WARN_ONCE(1, "unexpected DMA address, please report to netdev@");
 	dma_unmap_page_attrs(pool->p.dev, dma,
 			     PAGE_SIZE << pool->p.order, pool->p.dma_dir,
 			     DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING);
@@ -512,7 +526,7 @@ static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
 	if (unlikely(!page))
 		return NULL;
 
-	if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page)))) {
+	if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page), gfp))) {
 		put_page(page);
 		return NULL;
 	}
@@ -558,7 +572,7 @@ static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool,
 	 */
 	for (i = 0; i < nr_pages; i++) {
 		netmem = pool->alloc.cache[i];
-		if (dma_map && unlikely(!page_pool_dma_map(pool, netmem))) {
+		if (dma_map && unlikely(!page_pool_dma_map(pool, netmem, gfp))) {
 			put_page(netmem_to_page(netmem));
 			continue;
 		}
@@ -660,6 +674,8 @@ void page_pool_clear_pp_info(netmem_ref netmem)
 static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
 							 netmem_ref netmem)
 {
+	struct page *old, *page = netmem_to_page(netmem);
+	unsigned long id;
 	dma_addr_t dma;
 
 	if (!pool->dma_map)
@@ -668,6 +684,17 @@ static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
 		 */
 		return;
 
+	id = netmem_get_dma_index(netmem);
+	if (!id)
+		return;
+
+	if (in_softirq())
+		old = xa_cmpxchg(&pool->dma_mapped, id, page, NULL, 0);
+	else
+		old = xa_cmpxchg_bh(&pool->dma_mapped, id, page, NULL, 0);
+	if (old != page)
+		return;
+
 	dma = page_pool_get_dma_addr_netmem(netmem);
 
 	/* When page is unmapped, it cannot be returned to our pool */
@@ -675,6 +702,7 @@ static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
 			     PAGE_SIZE << pool->p.order, pool->p.dma_dir,
 			     DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING);
 	page_pool_set_dma_addr_netmem(netmem, 0);
+	netmem_set_dma_index(netmem, 0);
 }
 
 /* Disconnects a page (from a page_pool).  API users can have a need
@@ -1084,8 +1112,32 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
 
 static void page_pool_scrub(struct page_pool *pool)
 {
+	unsigned long id;
+	void *ptr;
+
 	page_pool_empty_alloc_cache_once(pool);
-	pool->destroy_cnt++;
+	if (!pool->destroy_cnt++ && pool->dma_map) {
+		if (pool->dma_sync) {
+			/* paired with READ_ONCE in
+			 * page_pool_dma_sync_for_device() and
+			 * __page_pool_dma_sync_for_cpu()
+			 */
+			WRITE_ONCE(pool->dma_sync, false);
+
+			/* Make sure all concurrent returns that may see the old
+			 * value of dma_sync (and thus perform a sync) have
+			 * finished before doing the unmapping below. Skip the
+			 * wait if the device doesn't actually need syncing, or
+			 * if there are no outstanding mapped pages.
+			 */
+			if (dma_dev_need_sync(pool->p.dev) &&
+			    !xa_empty(&pool->dma_mapped))
+				synchronize_net();
+		}
+
+		xa_for_each(&pool->dma_mapped, id, ptr)
+			__page_pool_release_page_dma(pool, page_to_netmem(ptr));
+	}
 
 	/* No more consumers should exist, but producers could still
 	 * be in-flight.

-- 
2.48.1



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
  2025-03-14 10:10 ` [PATCH net-next 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool Toke Høiland-Jørgensen
@ 2025-03-15  9:46   ` Yunsheng Lin
  2025-03-17 15:16     ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 19+ messages in thread
From: Yunsheng Lin @ 2025-03-15  9:46 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, Saeed Mahameed,
	Leon Romanovsky, Tariq Toukan, Andrew Lunn, Eric Dumazet,
	Paolo Abeni, Ilias Apalodimas, Simon Horman, Andrew Morton,
	Mina Almasry, Yonglong Liu, Yunsheng Lin, Pavel Begunkov,
	Matthew Wilcox, Robin Murphy, IOMMU
  Cc: netdev, bpf, linux-rdma, linux-mm, Qiuling Ren, Yuying Ma

On 3/14/2025 6:10 PM, Toke Høiland-Jørgensen wrote:

...

> 
> To avoid having to walk the entire xarray on unmap to find the page
> reference, we stash the ID assigned by xa_alloc() into the page
> structure itself, using the upper bits of the pp_magic field. This
> requires a couple of defines to avoid conflicting with the
> POINTER_POISON_DELTA define, but this is all evaluated at compile-time,
> so does not affect run-time performance. The bitmap calculations in this
> patch gives the following number of bits for different architectures:
> 
> - 24 bits on 32-bit architectures
> - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
> - 32 bits on other 64-bit architectures

 From commit c07aea3ef4d4 ("mm: add a signature in struct page"):
"The page->signature field is aliased to page->lru.next and
page->compound_head, but it can't be set by mistake because the
signature value is a bad pointer, and can't trigger a false positive
in PageTail() because the last bit is 0."

And commit 8a5e5e02fc83 ("include/linux/poison.h: fix LIST_POISON{1,2} 
offset"):
"Poison pointer values should be small enough to find a room in
non-mmap'able/hardly-mmap'able space."

So the question seems to be:
1. Is stashing the ID causing page->pp_magic to be in the mmap'able/
    easier-mmap'able space? If yes, how can we make sure this will not
    cause any security problem?
2. Is the masking the page->pp_magic causing a valid pionter for
    page->lru.next or page->compound_head to be treated as a vaild
    PP_SIGNATURE? which might cause page_pool to recycle a page not
    allocated via page_pool.

> 
> Since all the tracking is performed on DMA map/unmap, no additional code
> is needed in the fast path, meaning the performance overhead of this
> tracking is negligible. A micro-benchmark shows that the total overhead
> of using xarray for this purpose is about 400 ns (39 cycles(tsc) 395.218
> ns; sum for both map and unmap[1]). Since this cost is only paid on DMA
> map and unmap, it seems like an acceptable cost to fix the late unmap

For most use cases when PP_FLAG_DMA_MAP is set and IOMMU is off, the
DMA map and unmap operation is almost negligible as said below, so the
cost is about 200% performance degradation, which doesn't seems like an
acceptable cost.

> issue. Further optimisation can narrow the cases where this cost is
> paid (for instance by eliding the tracking when DMA map/unmap is a
> no-op).

The above was discussed in [1] and brought up again in [2], so cc
Robin to see if there is any clarifying to see if he still view the
above as misuse of DMA API.

1. 
https://lore.kernel.org/all/9a4d1357-f30d-420d-a575-7ae305ca6dda@huawei.com/

2. https://lore.kernel.org/all/caf31b5e-0e8f-4844-b7ba-ef59ed13b74e@arm.com/

> 
> The extra memory needed to track the pages is neatly encapsulated inside
> xarray, which uses the 'struct xa_node' structure to track items. This
> structure is 576 bytes long, with slots for 64 items, meaning that a
> full node occurs only 9 bytes of overhead per slot it tracks (in
> practice, it probably won't be this efficient, but in any case it should

Is there any debug infrastructure to know if it is not this efficient?
as there may be 576 byte overhead for a page for the worst case.

> be an acceptable overhead).
 > > [0] 
https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/
> [1] https://lore.kernel.org/r/ae07144c-9295-4c9d-a400-153bb689fe9e@huawei.com
> 
> Reported-by: Yonglong Liu <liuyonglong@huawei.com>
> Closes: https://lore.kernel.org/r/8743264a-9700-4227-a556-5f931c720211@huawei.com
> Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code")
> Suggested-by: Mina Almasry <almasrymina@google.com>
> Reviewed-by: Mina Almasry <almasrymina@google.com>
> Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Tested-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Tested-by: Qiuling Ren <qren@redhat.com>
> Tested-by: Yuying Ma <yuma@redhat.com>
> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>

...

> @@ -1084,8 +1112,32 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
>   
>   static void page_pool_scrub(struct page_pool *pool)
>   {
> +	unsigned long id;
> +	void *ptr;
> +
>   	page_pool_empty_alloc_cache_once(pool);
> -	pool->destroy_cnt++;
> +	if (!pool->destroy_cnt++ && pool->dma_map) {
> +		if (pool->dma_sync) {
> +			/* paired with READ_ONCE in
> +			 * page_pool_dma_sync_for_device() and
> +			 * __page_pool_dma_sync_for_cpu()
> +			 */
> +			WRITE_ONCE(pool->dma_sync, false);
> +
> +			/* Make sure all concurrent returns that may see the old
> +			 * value of dma_sync (and thus perform a sync) have
> +			 * finished before doing the unmapping below. Skip the
> +			 * wait if the device doesn't actually need syncing, or
> +			 * if there are no outstanding mapped pages.
> +			 */
> +			if (dma_dev_need_sync(pool->p.dev) &&
> +			    !xa_empty(&pool->dma_mapped))
> +				synchronize_net();

I guess the above synchronize_net() is assuming that the above dma sync
API is always called in the softirq context, as it seems there is no
rcu read lock added in this patch to be paired with that.

Doesn't page_pool_put_page() might be called in non-softirq context when
allow_direct is false and in_softirq() returns false?

> +		}
> +
> +		xa_for_each(&pool->dma_mapped, id, ptr)
> +			__page_pool_release_page_dma(pool, page_to_netmem(ptr));
> +	}
>   
>   	/* No more consumers should exist, but producers could still
>   	 * be in-flight.
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
  2025-03-15  9:46   ` Yunsheng Lin
@ 2025-03-17 15:16     ` Toke Høiland-Jørgensen
  2025-03-18 12:39       ` Yunsheng Lin
  0 siblings, 1 reply; 19+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-03-17 15:16 UTC (permalink / raw)
  To: Yunsheng Lin, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, Saeed Mahameed, Leon Romanovsky,
	Tariq Toukan, Andrew Lunn, Eric Dumazet, Paolo Abeni,
	Ilias Apalodimas, Simon Horman, Andrew Morton, Mina Almasry,
	Yonglong Liu, Yunsheng Lin, Pavel Begunkov, Matthew Wilcox,
	Robin Murphy, IOMMU
  Cc: netdev, bpf, linux-rdma, linux-mm, Qiuling Ren, Yuying Ma

Yunsheng Lin <yunshenglin0825@gmail.com> writes:

> On 3/14/2025 6:10 PM, Toke Høiland-Jørgensen wrote:
>
> ...
>
>> 
>> To avoid having to walk the entire xarray on unmap to find the page
>> reference, we stash the ID assigned by xa_alloc() into the page
>> structure itself, using the upper bits of the pp_magic field. This
>> requires a couple of defines to avoid conflicting with the
>> POINTER_POISON_DELTA define, but this is all evaluated at compile-time,
>> so does not affect run-time performance. The bitmap calculations in this
>> patch gives the following number of bits for different architectures:
>> 
>> - 24 bits on 32-bit architectures
>> - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
>> - 32 bits on other 64-bit architectures
>
>  From commit c07aea3ef4d4 ("mm: add a signature in struct page"):
> "The page->signature field is aliased to page->lru.next and
> page->compound_head, but it can't be set by mistake because the
> signature value is a bad pointer, and can't trigger a false positive
> in PageTail() because the last bit is 0."
>
> And commit 8a5e5e02fc83 ("include/linux/poison.h: fix LIST_POISON{1,2} 
> offset"):
> "Poison pointer values should be small enough to find a room in
> non-mmap'able/hardly-mmap'able space."
>
> So the question seems to be:
> 1. Is stashing the ID causing page->pp_magic to be in the mmap'able/
>     easier-mmap'able space? If yes, how can we make sure this will not
>     cause any security problem?
> 2. Is the masking the page->pp_magic causing a valid pionter for
>     page->lru.next or page->compound_head to be treated as a vaild
>     PP_SIGNATURE? which might cause page_pool to recycle a page not
>     allocated via page_pool.

Right, so my reasoning for why the defines in this patch works for this
is as follows: in both cases we need to make sure that the ID stashed in
that field never looks like a valid kernel pointer. For 64-bit arches
(where CONFIG_ILLEGAL_POINTER_VALUE), we make sure of this by never
writing to any bits that overlap with the illegal value (so that the
PP_SIGNATURE written to the field keeps it as an illegal pointer value).
For 32-bit arches, we make sure of this by making sure the top-most bit
is always 0 (the -1 in the define for _PP_DMA_INDEX_BITS) in the patch,
which puts it outside the range used for kernel pointers (AFAICT).

>> Since all the tracking is performed on DMA map/unmap, no additional code
>> is needed in the fast path, meaning the performance overhead of this
>> tracking is negligible. A micro-benchmark shows that the total overhead
>> of using xarray for this purpose is about 400 ns (39 cycles(tsc) 395.218
>> ns; sum for both map and unmap[1]). Since this cost is only paid on DMA
>> map and unmap, it seems like an acceptable cost to fix the late unmap
>
> For most use cases when PP_FLAG_DMA_MAP is set and IOMMU is off, the
> DMA map and unmap operation is almost negligible as said below, so the
> cost is about 200% performance degradation, which doesn't seems like an
> acceptable cost.

I disagree. This only impacts the slow path, as long as pages are
recycled there is no additional cost. While your patch series has
demonstrated that it is *possible* to reduce the cost even in the slow
path, I don't think the complexity cost of this is worth it.

[...]

>> The extra memory needed to track the pages is neatly encapsulated inside
>> xarray, which uses the 'struct xa_node' structure to track items. This
>> structure is 576 bytes long, with slots for 64 items, meaning that a
>> full node occurs only 9 bytes of overhead per slot it tracks (in
>> practice, it probably won't be this efficient, but in any case it should
>
> Is there any debug infrastructure to know if it is not this efficient?
> as there may be 576 byte overhead for a page for the worst case.

There's an XA_DEBUG define which enables some dump functions, but I
don't think there's any API to inspect the memory usage. I guess you
could attach a BPF program and walk the structure, or something?

>> +			/* Make sure all concurrent returns that may see the old
>> +			 * value of dma_sync (and thus perform a sync) have
>> +			 * finished before doing the unmapping below. Skip the
>> +			 * wait if the device doesn't actually need syncing, or
>> +			 * if there are no outstanding mapped pages.
>> +			 */
>> +			if (dma_dev_need_sync(pool->p.dev) &&
>> +			    !xa_empty(&pool->dma_mapped))
>> +				synchronize_net();
>
> I guess the above synchronize_net() is assuming that the above dma sync
> API is always called in the softirq context, as it seems there is no
> rcu read lock added in this patch to be paired with that.

Yup, that was my assumption.

> Doesn't page_pool_put_page() might be called in non-softirq context when
> allow_direct is false and in_softirq() returns false?

I am not sure if this happens in practice in any of the delayed return
paths we are worried about for this patch. If it does we could apply
something like the diff below (on top of this patch). I can respin with
this if needed, but I'll wait a bit and give others a chance to chime in.

-Toke



@@ -465,9 +465,13 @@ page_pool_dma_sync_for_device(const struct page_pool *pool,
 			      netmem_ref netmem,
 			      u32 dma_sync_size)
 {
-	if ((READ_ONCE(pool->dma_sync) & PP_DMA_SYNC_DEV) &&
-	    dma_dev_need_sync(pool->p.dev))
-		__page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
+	if (dma_dev_need_sync(pool->p.dev)) {
+		rcu_read_lock();
+		if (READ_ONCE(pool->dma_sync) & PP_DMA_SYNC_DEV)
+			__page_pool_dma_sync_for_device(pool, netmem,
+							dma_sync_size);
+		rcu_read_unlock();
+	}
 }
 
 static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem, gfp_t gfp)



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
  2025-03-17 15:16     ` Toke Høiland-Jørgensen
@ 2025-03-18 12:39       ` Yunsheng Lin
  2025-03-18 20:55         ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 19+ messages in thread
From: Yunsheng Lin @ 2025-03-18 12:39 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Yunsheng Lin, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, Saeed Mahameed,
	Leon Romanovsky, Tariq Toukan, Andrew Lunn, Eric Dumazet,
	Paolo Abeni, Ilias Apalodimas, Simon Horman, Andrew Morton,
	Mina Almasry, Yonglong Liu, Pavel Begunkov, Matthew Wilcox,
	Robin Murphy, IOMMU, segoon, solar
  Cc: netdev, bpf, linux-rdma, linux-mm, Qiuling Ren, Yuying Ma

On 2025/3/17 23:16, Toke Høiland-Jørgensen wrote:
> Yunsheng Lin <yunshenglin0825@gmail.com> writes:
> 
>> On 3/14/2025 6:10 PM, Toke Høiland-Jørgensen wrote:
>>
>> ...
>>
>>>
>>> To avoid having to walk the entire xarray on unmap to find the page
>>> reference, we stash the ID assigned by xa_alloc() into the page
>>> structure itself, using the upper bits of the pp_magic field. This
>>> requires a couple of defines to avoid conflicting with the
>>> POINTER_POISON_DELTA define, but this is all evaluated at compile-time,
>>> so does not affect run-time performance. The bitmap calculations in this
>>> patch gives the following number of bits for different architectures:
>>>
>>> - 24 bits on 32-bit architectures
>>> - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
>>> - 32 bits on other 64-bit architectures
>>
>>  From commit c07aea3ef4d4 ("mm: add a signature in struct page"):
>> "The page->signature field is aliased to page->lru.next and
>> page->compound_head, but it can't be set by mistake because the
>> signature value is a bad pointer, and can't trigger a false positive
>> in PageTail() because the last bit is 0."
>>
>> And commit 8a5e5e02fc83 ("include/linux/poison.h: fix LIST_POISON{1,2} 
>> offset"):
>> "Poison pointer values should be small enough to find a room in
>> non-mmap'able/hardly-mmap'able space."
>>
>> So the question seems to be:
>> 1. Is stashing the ID causing page->pp_magic to be in the mmap'able/
>>     easier-mmap'able space? If yes, how can we make sure this will not
>>     cause any security problem?
>> 2. Is the masking the page->pp_magic causing a valid pionter for
>>     page->lru.next or page->compound_head to be treated as a vaild
>>     PP_SIGNATURE? which might cause page_pool to recycle a page not
>>     allocated via page_pool.
> 
> Right, so my reasoning for why the defines in this patch works for this
> is as follows: in both cases we need to make sure that the ID stashed in
> that field never looks like a valid kernel pointer. For 64-bit arches
> (where CONFIG_ILLEGAL_POINTER_VALUE), we make sure of this by never
> writing to any bits that overlap with the illegal value (so that the
> PP_SIGNATURE written to the field keeps it as an illegal pointer value).
> For 32-bit arches, we make sure of this by making sure the top-most bit
> is always 0 (the -1 in the define for _PP_DMA_INDEX_BITS) in the patch,
> which puts it outside the range used for kernel pointers (AFAICT).

Is there any season you think only kernel pointer is relevant here?
It seems it is not really only about kernel pointers as round_hint_to_min()
in mm/mmap.c suggests and the commit log in the above commit 8a5e5e02fc83
if I understand it correctly:
"Given unprivileged users cannot mmap anything below mmap_min_addr, it
should be safe to use poison pointers lower than mmap_min_addr."

And the above "making sure the top-most bit is always 0" doesn't seems to
ensure page->signature to be outside the range used for kernel pointers
for 32-bit arches with VMSPLIT_1G defined, see arch/arm/Kconfig, there
is a similar config for x86 too:
       prompt "Memory split"
       depends on MMU
       default VMSPLIT_3G
       help
         Select the desired split between kernel and user memory.

         If you are not absolutely sure what you are doing, leave this
         option alone!

       config VMSPLIT_3G
              bool "3G/1G user/kernel split"
       config VMSPLIT_3G_OPT
             depends on !ARM_LPAE
              bool "3G/1G user/kernel split (for full 1G low memory)"
       config VMSPLIT_2G
              bool "2G/2G user/kernel split"
       config VMSPLIT_1G
              bool "1G/3G user/kernel split"

IMHO, even if some trick like above is really feasible, it may be
adding some limitation or complexity to the ARCH and MM subsystem, for
example, stashing the ID in page->signature may cause 0x*40 signature
to be unusable for other poison pointer purpose, it makes more sense to
make it obvious by doing the above trick in some MM header file like
poison.h instead of in the page_pool subsystem.

> 
>>> Since all the tracking is performed on DMA map/unmap, no additional code
>>> is needed in the fast path, meaning the performance overhead of this
>>> tracking is negligible. A micro-benchmark shows that the total overhead
>>> of using xarray for this purpose is about 400 ns (39 cycles(tsc) 395.218
>>> ns; sum for both map and unmap[1]). Since this cost is only paid on DMA
>>> map and unmap, it seems like an acceptable cost to fix the late unmap
>>
>> For most use cases when PP_FLAG_DMA_MAP is set and IOMMU is off, the
>> DMA map and unmap operation is almost negligible as said below, so the
>> cost is about 200% performance degradation, which doesn't seems like an
>> acceptable cost.
> 
> I disagree. This only impacts the slow path, as long as pages are
> recycled there is no additional cost. While your patch series has
> demonstrated that it is *possible* to reduce the cost even in the slow
> path, I don't think the complexity cost of this is worth it.

It is still the datapath otherwise there isn't a specific testing
for that use case, more than 200% performance degradation is too much
IHMO.

Let aside the above severe performance degradation, reusing space in
page->signature seems to be a tradeoff between adding complexity in
page_pool subsystem and in VM/ARCH subsystem as mentioned above.

> 
> [...]
> 
>>> The extra memory needed to track the pages is neatly encapsulated inside
>>> xarray, which uses the 'struct xa_node' structure to track items. This
>>> structure is 576 bytes long, with slots for 64 items, meaning that a
>>> full node occurs only 9 bytes of overhead per slot it tracks (in
>>> practice, it probably won't be this efficient, but in any case it should
>>
>> Is there any debug infrastructure to know if it is not this efficient?
>> as there may be 576 byte overhead for a page for the worst case.
> 
> There's an XA_DEBUG define which enables some dump functions, but I
> don't think there's any API to inspect the memory usage. I guess you
> could attach a BPF program and walk the structure, or something?
> 

It seems the XA_DEBUG is not defined in production environment.
And I am not familiar enough with BPF program to understand if the
BPF way is feasiable in production environment.
Any example for the above BPF program and how to attach it?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
  2025-03-18 12:39       ` Yunsheng Lin
@ 2025-03-18 20:55         ` Toke Høiland-Jørgensen
  2025-03-19 11:06           ` Yunsheng Lin
  0 siblings, 1 reply; 19+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-03-18 20:55 UTC (permalink / raw)
  To: Yunsheng Lin, Yunsheng Lin, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, Saeed Mahameed, Leon Romanovsky,
	Tariq Toukan, Andrew Lunn, Eric Dumazet, Paolo Abeni,
	Ilias Apalodimas, Simon Horman, Andrew Morton, Mina Almasry,
	Yonglong Liu, Pavel Begunkov, Matthew Wilcox, Robin Murphy,
	IOMMU, segoon, solar
  Cc: netdev, bpf, linux-rdma, linux-mm, Qiuling Ren, Yuying Ma

Yunsheng Lin <linyunsheng@huawei.com> writes:

> On 2025/3/17 23:16, Toke Høiland-Jørgensen wrote:
>> Yunsheng Lin <yunshenglin0825@gmail.com> writes:
>> 
>>> On 3/14/2025 6:10 PM, Toke Høiland-Jørgensen wrote:
>>>
>>> ...
>>>
>>>>
>>>> To avoid having to walk the entire xarray on unmap to find the page
>>>> reference, we stash the ID assigned by xa_alloc() into the page
>>>> structure itself, using the upper bits of the pp_magic field. This
>>>> requires a couple of defines to avoid conflicting with the
>>>> POINTER_POISON_DELTA define, but this is all evaluated at compile-time,
>>>> so does not affect run-time performance. The bitmap calculations in this
>>>> patch gives the following number of bits for different architectures:
>>>>
>>>> - 24 bits on 32-bit architectures
>>>> - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
>>>> - 32 bits on other 64-bit architectures
>>>
>>>  From commit c07aea3ef4d4 ("mm: add a signature in struct page"):
>>> "The page->signature field is aliased to page->lru.next and
>>> page->compound_head, but it can't be set by mistake because the
>>> signature value is a bad pointer, and can't trigger a false positive
>>> in PageTail() because the last bit is 0."
>>>
>>> And commit 8a5e5e02fc83 ("include/linux/poison.h: fix LIST_POISON{1,2} 
>>> offset"):
>>> "Poison pointer values should be small enough to find a room in
>>> non-mmap'able/hardly-mmap'able space."
>>>
>>> So the question seems to be:
>>> 1. Is stashing the ID causing page->pp_magic to be in the mmap'able/
>>>     easier-mmap'able space? If yes, how can we make sure this will not
>>>     cause any security problem?
>>> 2. Is the masking the page->pp_magic causing a valid pionter for
>>>     page->lru.next or page->compound_head to be treated as a vaild
>>>     PP_SIGNATURE? which might cause page_pool to recycle a page not
>>>     allocated via page_pool.
>> 
>> Right, so my reasoning for why the defines in this patch works for this
>> is as follows: in both cases we need to make sure that the ID stashed in
>> that field never looks like a valid kernel pointer. For 64-bit arches
>> (where CONFIG_ILLEGAL_POINTER_VALUE), we make sure of this by never
>> writing to any bits that overlap with the illegal value (so that the
>> PP_SIGNATURE written to the field keeps it as an illegal pointer value).
>> For 32-bit arches, we make sure of this by making sure the top-most bit
>> is always 0 (the -1 in the define for _PP_DMA_INDEX_BITS) in the patch,
>> which puts it outside the range used for kernel pointers (AFAICT).
>
> Is there any season you think only kernel pointer is relevant here?

Yes. Any pointer stored in the same space as pp_magic by other users of
the page will be kernel pointers (as they come from page->lru.next). The
goal of PP_SIGNATURE is to be able to distinguish pages allocated by
page_pool, so we don't accidentally recycle a page from somewhere else.
That's the goal of the check in page_pool_page_is_pp():

(page->pp_magic & PP_MAGIC_MASK) == PP_SIGNATURE

To achieve this, we must ensure that the check above never returns true
for any value another page user could have written into the same field
(i.e., into page->lru.next). For 64-bit arches, POISON_POINTER_DELTA
serves this purpose. For 32-bit arches, we can leave the top-most bits
out of PP_MAGIC_MASK, to make sure that any valid pointer value will
fail the check above.

> It seems it is not really only about kernel pointers as round_hint_to_min()
> in mm/mmap.c suggests and the commit log in the above commit 8a5e5e02fc83
> if I understand it correctly:
> "Given unprivileged users cannot mmap anything below mmap_min_addr, it
> should be safe to use poison pointers lower than mmap_min_addr."
>
> And the above "making sure the top-most bit is always 0" doesn't seems to
> ensure page->signature to be outside the range used for kernel pointers
> for 32-bit arches with VMSPLIT_1G defined, see arch/arm/Kconfig, there
> is a similar config for x86 too:
>        prompt "Memory split"
>        depends on MMU
>        default VMSPLIT_3G
>        help
>          Select the desired split between kernel and user memory.
>
>          If you are not absolutely sure what you are doing, leave this
>          option alone!
>
>        config VMSPLIT_3G
>               bool "3G/1G user/kernel split"
>        config VMSPLIT_3G_OPT
>              depends on !ARM_LPAE
>               bool "3G/1G user/kernel split (for full 1G low memory)"
>        config VMSPLIT_2G
>               bool "2G/2G user/kernel split"
>        config VMSPLIT_1G
>               bool "1G/3G user/kernel split"

Ah, interesting, didn't know this was configurable. Okay, so AFAICT, the
lowest value of PAGE_OFFSET is 0x40000000 (for VMSPLIT_1G), so we need
to leave two bits off at the top instead of just one. Will update this,
and try to explain the logic better in the comment.

> IMHO, even if some trick like above is really feasible, it may be
> adding some limitation or complexity to the ARCH and MM subsystem, for
> example, stashing the ID in page->signature may cause 0x*40 signature
> to be unusable for other poison pointer purpose, it makes more sense to
> make it obvious by doing the above trick in some MM header file like
> poison.h instead of in the page_pool subsystem.

AFAIU, PP_SIGNATURE is used for page_pool to be able to distinguish its
own pages from those allocated elsewhere (cf the description above).
Which means that these definitions are logically page_pool-internal, and
thus it makes the most sense to keep them in the page pool headers. The
only bits the mm subsystem cares about in that field are the bottom two
(for pfmemalloc pages and compound pages).

>>>> Since all the tracking is performed on DMA map/unmap, no additional code
>>>> is needed in the fast path, meaning the performance overhead of this
>>>> tracking is negligible. A micro-benchmark shows that the total overhead
>>>> of using xarray for this purpose is about 400 ns (39 cycles(tsc) 395.218
>>>> ns; sum for both map and unmap[1]). Since this cost is only paid on DMA
>>>> map and unmap, it seems like an acceptable cost to fix the late unmap
>>>
>>> For most use cases when PP_FLAG_DMA_MAP is set and IOMMU is off, the
>>> DMA map and unmap operation is almost negligible as said below, so the
>>> cost is about 200% performance degradation, which doesn't seems like an
>>> acceptable cost.
>> 
>> I disagree. This only impacts the slow path, as long as pages are
>> recycled there is no additional cost. While your patch series has
>> demonstrated that it is *possible* to reduce the cost even in the slow
>> path, I don't think the complexity cost of this is worth it.
>
> It is still the datapath otherwise there isn't a specific testing
> for that use case, more than 200% performance degradation is too much
> IHMO.

Do you have a real-world use case (i.e., a networking benchmark, not a
micro-benchmark of the allocation function) where this change has a
measurable impact on performance? If so, please do share the numbers!

I very much doubt it will be anything close to that magnitude, but I'm
always willing to be persuaded by data :)

> Let aside the above severe performance degradation, reusing space in
> page->signature seems to be a tradeoff between adding complexity in
> page_pool subsystem and in VM/ARCH subsystem as mentioned above.

I think you are overstating the impact on other MM users; this is all
mostly page_pool-internal logic, cf the explanation above.

>> 
>> [...]
>> 
>>>> The extra memory needed to track the pages is neatly encapsulated inside
>>>> xarray, which uses the 'struct xa_node' structure to track items. This
>>>> structure is 576 bytes long, with slots for 64 items, meaning that a
>>>> full node occurs only 9 bytes of overhead per slot it tracks (in
>>>> practice, it probably won't be this efficient, but in any case it should
>>>
>>> Is there any debug infrastructure to know if it is not this efficient?
>>> as there may be 576 byte overhead for a page for the worst case.
>> 
>> There's an XA_DEBUG define which enables some dump functions, but I
>> don't think there's any API to inspect the memory usage. I guess you
>> could attach a BPF program and walk the structure, or something?
>> 
>
> It seems the XA_DEBUG is not defined in production environment.
> And I am not familiar enough with BPF program to understand if the
> BPF way is feasiable in production environment.
> Any example for the above BPF program and how to attach it?

Hmm, no, not really, sorry :(

I *think* it should be possible to write a bpftrace script that walks
the internal xarray tree and counts the nodes along the way, but it's
not trivial to do, and I haven't tried.

-Toke



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
  2025-03-18 20:55         ` Toke Høiland-Jørgensen
@ 2025-03-19 11:06           ` Yunsheng Lin
  2025-03-19 12:18             ` Toke Høiland-Jørgensen
  2025-03-20  2:32             ` Solar Designer
  0 siblings, 2 replies; 19+ messages in thread
From: Yunsheng Lin @ 2025-03-19 11:06 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Yunsheng Lin, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, Saeed Mahameed,
	Leon Romanovsky, Tariq Toukan, Andrew Lunn, Eric Dumazet,
	Paolo Abeni, Ilias Apalodimas, Simon Horman, Andrew Morton,
	Mina Almasry, Yonglong Liu, Pavel Begunkov, Matthew Wilcox,
	Robin Murphy, IOMMU, segoon, solar, oss-security,
	kernel-hardening
  Cc: netdev, bpf, linux-rdma, linux-mm, Qiuling Ren, Yuying Ma

On 2025/3/19 4:55, Toke Høiland-Jørgensen wrote:
> Yunsheng Lin <linyunsheng@huawei.com> writes:
> 
>> On 2025/3/17 23:16, Toke Høiland-Jørgensen wrote:
>>> Yunsheng Lin <yunshenglin0825@gmail.com> writes:
>>>
>>>> On 3/14/2025 6:10 PM, Toke Høiland-Jørgensen wrote:
>>>>
>>>> ...
>>>>
>>>>>
>>>>> To avoid having to walk the entire xarray on unmap to find the page
>>>>> reference, we stash the ID assigned by xa_alloc() into the page
>>>>> structure itself, using the upper bits of the pp_magic field. This
>>>>> requires a couple of defines to avoid conflicting with the
>>>>> POINTER_POISON_DELTA define, but this is all evaluated at compile-time,
>>>>> so does not affect run-time performance. The bitmap calculations in this
>>>>> patch gives the following number of bits for different architectures:
>>>>>
>>>>> - 24 bits on 32-bit architectures
>>>>> - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
>>>>> - 32 bits on other 64-bit architectures
>>>>
>>>>  From commit c07aea3ef4d4 ("mm: add a signature in struct page"):
>>>> "The page->signature field is aliased to page->lru.next and
>>>> page->compound_head, but it can't be set by mistake because the
>>>> signature value is a bad pointer, and can't trigger a false positive
>>>> in PageTail() because the last bit is 0."
>>>>
>>>> And commit 8a5e5e02fc83 ("include/linux/poison.h: fix LIST_POISON{1,2} 
>>>> offset"):
>>>> "Poison pointer values should be small enough to find a room in
>>>> non-mmap'able/hardly-mmap'able space."
>>>>
>>>> So the question seems to be:
>>>> 1. Is stashing the ID causing page->pp_magic to be in the mmap'able/
>>>>     easier-mmap'able space? If yes, how can we make sure this will not
>>>>     cause any security problem?
>>>> 2. Is the masking the page->pp_magic causing a valid pionter for
>>>>     page->lru.next or page->compound_head to be treated as a vaild
>>>>     PP_SIGNATURE? which might cause page_pool to recycle a page not
>>>>     allocated via page_pool.
>>>
>>> Right, so my reasoning for why the defines in this patch works for this
>>> is as follows: in both cases we need to make sure that the ID stashed in
>>> that field never looks like a valid kernel pointer. For 64-bit arches
>>> (where CONFIG_ILLEGAL_POINTER_VALUE), we make sure of this by never
>>> writing to any bits that overlap with the illegal value (so that the
>>> PP_SIGNATURE written to the field keeps it as an illegal pointer value).
>>> For 32-bit arches, we make sure of this by making sure the top-most bit
>>> is always 0 (the -1 in the define for _PP_DMA_INDEX_BITS) in the patch,
>>> which puts it outside the range used for kernel pointers (AFAICT).
>>
>> Is there any season you think only kernel pointer is relevant here?
> 
> Yes. Any pointer stored in the same space as pp_magic by other users of
> the page will be kernel pointers (as they come from page->lru.next). The
> goal of PP_SIGNATURE is to be able to distinguish pages allocated by
> page_pool, so we don't accidentally recycle a page from somewhere else.
> That's the goal of the check in page_pool_page_is_pp():
> 
> (page->pp_magic & PP_MAGIC_MASK) == PP_SIGNATURE
> 
> To achieve this, we must ensure that the check above never returns true
> for any value another page user could have written into the same field
> (i.e., into page->lru.next). For 64-bit arches, POISON_POINTER_DELTA

POISON_POINTER_DELTA is defined according to CONFIG_ILLEGAL_POINTER_VALUE,
if CONFIG_ILLEGAL_POINTER_VALUE is not defined yet, POISON_POINTER_DELTA
is defined to zero.

It seems only the below 64-bit arches define CONFIG_ILLEGAL_POINTER_VALUE
through grepping:
a29815a333c6 core, x86: make LIST_POISON less deadly
5c178472af24 riscv: define ILLEGAL_POINTER_VALUE for 64bit
f6853eb561fb powerpc/64: Define ILLEGAL_POINTER_VALUE for 64-bit
bf0c4e047324 arm64: kconfig: Move LIST_POISON to a safe value

The below 64-bit arches don't seems to define the above config yet:
MIPS64, SPARC64, System z(S390X),loongarch

Does ID stashing cause problem for the above arches?

> serves this purpose. For 32-bit arches, we can leave the top-most bits
> out of PP_MAGIC_MASK, to make sure that any valid pointer value will
> fail the check above.

The above mainly explained how to ensure page_pool_page_is_pp() will
not return false positive result from the page_pool perspective.

From MM/security perspective, most of the commits quoted above seem
to suggest that poison pointer should be in the non-mmap'able or
hardly-mmap'able space, otherwise userspace can arrange for those
pointers to actually be dereferencable, potentially turning an oops
to an expolit, more detailed example in the below paper, which explains
how to exploit a vulnerability which hardened by the 8a5e5e02fc83 commit:
https://www.usenix.org/system/files/conference/woot15/woot15-paper-xu.pdf

ID stashing seems to cause page->lru.next (aliased to page->pp_magic) to
be in the mmap'able space for some arches.


> 
>> It seems it is not really only about kernel pointers as round_hint_to_min()
>> in mm/mmap.c suggests and the commit log in the above commit 8a5e5e02fc83
>> if I understand it correctly:
>> "Given unprivileged users cannot mmap anything below mmap_min_addr, it
>> should be safe to use poison pointers lower than mmap_min_addr."
>>
>> And the above "making sure the top-most bit is always 0" doesn't seems to
>> ensure page->signature to be outside the range used for kernel pointers
>> for 32-bit arches with VMSPLIT_1G defined, see arch/arm/Kconfig, there
>> is a similar config for x86 too:

...

> 
> Ah, interesting, didn't know this was configurable. Okay, so AFAICT, the
> lowest value of PAGE_OFFSET is 0x40000000 (for VMSPLIT_1G), so we need
> to leave two bits off at the top instead of just one. Will update this,
> and try to explain the logic better in the comment.

It seems there was attempt of doing 4G/4G split too, and that is the kind
of limitation or complexity added to the ARCH and MM subsystem by doing the
ID stashing I mentioned earlier.
https://lore.kernel.org/lkml/Pine.LNX.4.44.0307082332450.17252-100000@localhost.localdomain/

> 
>> IMHO, even if some trick like above is really feasible, it may be
>> adding some limitation or complexity to the ARCH and MM subsystem, for
>> example, stashing the ID in page->signature may cause 0x*40 signature
>> to be unusable for other poison pointer purpose, it makes more sense to
>> make it obvious by doing the above trick in some MM header file like
>> poison.h instead of in the page_pool subsystem.
> 
> AFAIU, PP_SIGNATURE is used for page_pool to be able to distinguish its
> own pages from those allocated elsewhere (cf the description above).
> Which means that these definitions are logically page_pool-internal, and
> thus it makes the most sense to keep them in the page pool headers. The
> only bits the mm subsystem cares about in that field are the bottom two
> (for pfmemalloc pages and compound pages).

All I asked is about moving PP_MAGIC_MASK macro into poison.h if you
still want to proceed with reusing the page->pp_magic as the masking and
the signature to be masked seems reasonable to be in the same file.

> 
>>>>> Since all the tracking is performed on DMA map/unmap, no additional code
>>>>> is needed in the fast path, meaning the performance overhead of this
>>>>> tracking is negligible. A micro-benchmark shows that the total overhead
>>>>> of using xarray for this purpose is about 400 ns (39 cycles(tsc) 395.218
>>>>> ns; sum for both map and unmap[1]). Since this cost is only paid on DMA
>>>>> map and unmap, it seems like an acceptable cost to fix the late unmap
>>>>
>>>> For most use cases when PP_FLAG_DMA_MAP is set and IOMMU is off, the
>>>> DMA map and unmap operation is almost negligible as said below, so the
>>>> cost is about 200% performance degradation, which doesn't seems like an
>>>> acceptable cost.
>>>
>>> I disagree. This only impacts the slow path, as long as pages are
>>> recycled there is no additional cost. While your patch series has
>>> demonstrated that it is *possible* to reduce the cost even in the slow
>>> path, I don't think the complexity cost of this is worth it.
>>
>> It is still the datapath otherwise there isn't a specific testing
>> for that use case, more than 200% performance degradation is too much
>> IHMO.
> 
> Do you have a real-world use case (i.e., a networking benchmark, not a
> micro-benchmark of the allocation function) where this change has a
> measurable impact on performance? If so, please do share the numbers!

I don't have one yet.

> 
> I very much doubt it will be anything close to that magnitude, but I'm
> always willing to be persuaded by data :)
> 
>> Let aside the above severe performance degradation, reusing space in
>> page->signature seems to be a tradeoff between adding complexity in
>> page_pool subsystem and in VM/ARCH subsystem as mentioned above.
> 
> I think you are overstating the impact on other MM users; this is all
> mostly page_pool-internal logic, cf the explanation above.

To be honest, I am not that familiar with the pointer poison mechanism.
But through some researching and analyzing, it makes sense to overstate
it a little as it seems to be security-related.
Cc'ed some security-related experts and ML to see if there is some
clarifying from them.

> 
>>>
>>> [...]
>>>
>>>>> The extra memory needed to track the pages is neatly encapsulated inside
>>>>> xarray, which uses the 'struct xa_node' structure to track items. This
>>>>> structure is 576 bytes long, with slots for 64 items, meaning that a
>>>>> full node occurs only 9 bytes of overhead per slot it tracks (in
>>>>> practice, it probably won't be this efficient, but in any case it should
>>>>
>>>> Is there any debug infrastructure to know if it is not this efficient?
>>>> as there may be 576 byte overhead for a page for the worst case.
>>>
>>> There's an XA_DEBUG define which enables some dump functions, but I
>>> don't think there's any API to inspect the memory usage. I guess you
>>> could attach a BPF program and walk the structure, or something?
>>>
>>
>> It seems the XA_DEBUG is not defined in production environment.
>> And I am not familiar enough with BPF program to understand if the
>> BPF way is feasiable in production environment.
>> Any example for the above BPF program and how to attach it?
> 
> Hmm, no, not really, sorry :(
> 
> I *think* it should be possible to write a bpftrace script that walks
> the internal xarray tree and counts the nodes along the way, but it's
> not trivial to do, and I haven't tried.
> 
> -Toke
> 
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
  2025-03-19 11:06           ` Yunsheng Lin
@ 2025-03-19 12:18             ` Toke Høiland-Jørgensen
  2025-03-20 11:17               ` Yunsheng Lin
  2025-03-20  2:32             ` Solar Designer
  1 sibling, 1 reply; 19+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-03-19 12:18 UTC (permalink / raw)
  To: Yunsheng Lin, Yunsheng Lin, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, Saeed Mahameed, Leon Romanovsky,
	Tariq Toukan, Andrew Lunn, Eric Dumazet, Paolo Abeni,
	Ilias Apalodimas, Simon Horman, Andrew Morton, Mina Almasry,
	Yonglong Liu, Pavel Begunkov, Matthew Wilcox, Robin Murphy,
	IOMMU, segoon, solar, oss-security, kernel-hardening
  Cc: netdev, bpf, linux-rdma, linux-mm, Qiuling Ren, Yuying Ma

Yunsheng Lin <linyunsheng@huawei.com> writes:

> On 2025/3/19 4:55, Toke Høiland-Jørgensen wrote:
>> Yunsheng Lin <linyunsheng@huawei.com> writes:
>> 
>>> On 2025/3/17 23:16, Toke Høiland-Jørgensen wrote:
>>>> Yunsheng Lin <yunshenglin0825@gmail.com> writes:
>>>>
>>>>> On 3/14/2025 6:10 PM, Toke Høiland-Jørgensen wrote:
>>>>>
>>>>> ...
>>>>>
>>>>>>
>>>>>> To avoid having to walk the entire xarray on unmap to find the page
>>>>>> reference, we stash the ID assigned by xa_alloc() into the page
>>>>>> structure itself, using the upper bits of the pp_magic field. This
>>>>>> requires a couple of defines to avoid conflicting with the
>>>>>> POINTER_POISON_DELTA define, but this is all evaluated at compile-time,
>>>>>> so does not affect run-time performance. The bitmap calculations in this
>>>>>> patch gives the following number of bits for different architectures:
>>>>>>
>>>>>> - 24 bits on 32-bit architectures
>>>>>> - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
>>>>>> - 32 bits on other 64-bit architectures
>>>>>
>>>>>  From commit c07aea3ef4d4 ("mm: add a signature in struct page"):
>>>>> "The page->signature field is aliased to page->lru.next and
>>>>> page->compound_head, but it can't be set by mistake because the
>>>>> signature value is a bad pointer, and can't trigger a false positive
>>>>> in PageTail() because the last bit is 0."
>>>>>
>>>>> And commit 8a5e5e02fc83 ("include/linux/poison.h: fix LIST_POISON{1,2} 
>>>>> offset"):
>>>>> "Poison pointer values should be small enough to find a room in
>>>>> non-mmap'able/hardly-mmap'able space."
>>>>>
>>>>> So the question seems to be:
>>>>> 1. Is stashing the ID causing page->pp_magic to be in the mmap'able/
>>>>>     easier-mmap'able space? If yes, how can we make sure this will not
>>>>>     cause any security problem?
>>>>> 2. Is the masking the page->pp_magic causing a valid pionter for
>>>>>     page->lru.next or page->compound_head to be treated as a vaild
>>>>>     PP_SIGNATURE? which might cause page_pool to recycle a page not
>>>>>     allocated via page_pool.
>>>>
>>>> Right, so my reasoning for why the defines in this patch works for this
>>>> is as follows: in both cases we need to make sure that the ID stashed in
>>>> that field never looks like a valid kernel pointer. For 64-bit arches
>>>> (where CONFIG_ILLEGAL_POINTER_VALUE), we make sure of this by never
>>>> writing to any bits that overlap with the illegal value (so that the
>>>> PP_SIGNATURE written to the field keeps it as an illegal pointer value).
>>>> For 32-bit arches, we make sure of this by making sure the top-most bit
>>>> is always 0 (the -1 in the define for _PP_DMA_INDEX_BITS) in the patch,
>>>> which puts it outside the range used for kernel pointers (AFAICT).
>>>
>>> Is there any season you think only kernel pointer is relevant here?
>> 
>> Yes. Any pointer stored in the same space as pp_magic by other users of
>> the page will be kernel pointers (as they come from page->lru.next). The
>> goal of PP_SIGNATURE is to be able to distinguish pages allocated by
>> page_pool, so we don't accidentally recycle a page from somewhere else.
>> That's the goal of the check in page_pool_page_is_pp():
>> 
>> (page->pp_magic & PP_MAGIC_MASK) == PP_SIGNATURE
>> 
>> To achieve this, we must ensure that the check above never returns true
>> for any value another page user could have written into the same field
>> (i.e., into page->lru.next). For 64-bit arches, POISON_POINTER_DELTA
>
> POISON_POINTER_DELTA is defined according to CONFIG_ILLEGAL_POINTER_VALUE,
> if CONFIG_ILLEGAL_POINTER_VALUE is not defined yet, POISON_POINTER_DELTA
> is defined to zero.
>
> It seems only the below 64-bit arches define CONFIG_ILLEGAL_POINTER_VALUE
> through grepping:
> a29815a333c6 core, x86: make LIST_POISON less deadly
> 5c178472af24 riscv: define ILLEGAL_POINTER_VALUE for 64bit
> f6853eb561fb powerpc/64: Define ILLEGAL_POINTER_VALUE for 64-bit
> bf0c4e047324 arm64: kconfig: Move LIST_POISON to a safe value
>
> The below 64-bit arches don't seems to define the above config yet:
> MIPS64, SPARC64, System z(S390X),loongarch
>
> Does ID stashing cause problem for the above arches?

Well, not from a "number of bits available" perspective. In that case we
have plenty of bits, but we limit the size of the ID we stash to 32 bits
in all cases, so we will just end up with the 21 leading bits being
all-zero. So for those arches we basically end up in the same situation
as on 32-bit (see below).

>> serves this purpose. For 32-bit arches, we can leave the top-most bits
>> out of PP_MAGIC_MASK, to make sure that any valid pointer value will
>> fail the check above.
>
> The above mainly explained how to ensure page_pool_page_is_pp() will
> not return false positive result from the page_pool perspective.
>
> From MM/security perspective, most of the commits quoted above seem
> to suggest that poison pointer should be in the non-mmap'able or
> hardly-mmap'able space, otherwise userspace can arrange for those
> pointers to actually be dereferencable, potentially turning an oops
> to an expolit, more detailed example in the below paper, which explains
> how to exploit a vulnerability which hardened by the 8a5e5e02fc83 commit:
> https://www.usenix.org/system/files/conference/woot15/woot15-paper-xu.pdf
>
> ID stashing seems to cause page->lru.next (aliased to page->pp_magic) to
> be in the mmap'able space for some arches.

Right, okay, I see what you mean. So the risk is basically the
following:

If some other part of the kernel ends up dereferencing the
page->lru.next pointer of a page that is owned by page_pool, and which
has an ID stashed into page->pp_magic, that dereference can end up being
to a valid userspace mapping, which can lead to Bad Things(tm), cf the
paper above.

This is mitigated by the fact that it can only happen on architectures
that don't set ILLEGAL_POINTER_VALUE (which includes 32-bit arches, and
the ones you listed above). In addition, this has to happen while the
page is owned by page_pool, and while it is DMA-mapped - we already
clear the pp_magic field when releasing the page from page_pool.

I am not sure to what extent the above is a risk we should take pains to
avoid, TBH. It seems to me that for this to become a real problem, lots
of other things will already have gone wrong. But happy to defer to the
mm/security folks here.

>>> It seems it is not really only about kernel pointers as round_hint_to_min()
>>> in mm/mmap.c suggests and the commit log in the above commit 8a5e5e02fc83
>>> if I understand it correctly:
>>> "Given unprivileged users cannot mmap anything below mmap_min_addr, it
>>> should be safe to use poison pointers lower than mmap_min_addr."
>>>
>>> And the above "making sure the top-most bit is always 0" doesn't seems to
>>> ensure page->signature to be outside the range used for kernel pointers
>>> for 32-bit arches with VMSPLIT_1G defined, see arch/arm/Kconfig, there
>>> is a similar config for x86 too:
>
> ...
>
>> 
>> Ah, interesting, didn't know this was configurable. Okay, so AFAICT, the
>> lowest value of PAGE_OFFSET is 0x40000000 (for VMSPLIT_1G), so we need
>> to leave two bits off at the top instead of just one. Will update this,
>> and try to explain the logic better in the comment.
>
> It seems there was attempt of doing 4G/4G split too, and that is the kind
> of limitation or complexity added to the ARCH and MM subsystem by doing the
> ID stashing I mentioned earlier.
> https://lore.kernel.org/lkml/Pine.LNX.4.44.0307082332450.17252-100000@localhost.localdomain/

Given that this is all temporary until the folio rework Matthew alluded
to is completed, I think that particular concern is somewhat theoretical :)

>>> IMHO, even if some trick like above is really feasible, it may be
>>> adding some limitation or complexity to the ARCH and MM subsystem, for
>>> example, stashing the ID in page->signature may cause 0x*40 signature
>>> to be unusable for other poison pointer purpose, it makes more sense to
>>> make it obvious by doing the above trick in some MM header file like
>>> poison.h instead of in the page_pool subsystem.
>> 
>> AFAIU, PP_SIGNATURE is used for page_pool to be able to distinguish its
>> own pages from those allocated elsewhere (cf the description above).
>> Which means that these definitions are logically page_pool-internal, and
>> thus it makes the most sense to keep them in the page pool headers. The
>> only bits the mm subsystem cares about in that field are the bottom two
>> (for pfmemalloc pages and compound pages).
>
> All I asked is about moving PP_MAGIC_MASK macro into poison.h if you
> still want to proceed with reusing the page->pp_magic as the masking and
> the signature to be masked seems reasonable to be in the same file.

Hmm, my thinking was that this would be a lot of irrelevant stuff to put
into poison.h, but I suppose we could do so if the mm folks don't object :)

-Toke



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
  2025-03-19 12:18             ` Toke Høiland-Jørgensen
@ 2025-03-20 11:17               ` Yunsheng Lin
  2025-03-20 14:34                 ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 19+ messages in thread
From: Yunsheng Lin @ 2025-03-20 11:17 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Yunsheng Lin, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, Saeed Mahameed,
	Leon Romanovsky, Tariq Toukan, Andrew Lunn, Eric Dumazet,
	Paolo Abeni, Ilias Apalodimas, Simon Horman, Andrew Morton,
	Mina Almasry, Yonglong Liu, Pavel Begunkov, Matthew Wilcox,
	Robin Murphy, IOMMU, segoon, solar, oss-security,
	kernel-hardening
  Cc: netdev, bpf, linux-rdma, linux-mm, Qiuling Ren, Yuying Ma

On 2025/3/19 20:18, Toke Høiland-Jørgensen wrote:
>>
>> All I asked is about moving PP_MAGIC_MASK macro into poison.h if you
>> still want to proceed with reusing the page->pp_magic as the masking and
>> the signature to be masked seems reasonable to be in the same file.
> 
> Hmm, my thinking was that this would be a lot of irrelevant stuff to put
> into poison.h, but I suppose we could do so if the mm folks don't object :)

The masking and the signature to be masked is correlated, I am not sure
what you meant by 'irrelevant stuff' here.

As you seemed to have understood most of my concern about reusing
page->pp_magic, I am not going to argue with you about the uncertainty
of security and complexity of different address layout for different
arches again.

But I am still think it is not the way forward with the reusing of
page->pp_magic through doing some homework about the 'POISON_POINTER'.
If you still think my idea is complex and still want to proceed with
reusing the space of page->pp_magic, go ahead and let the maintainers
decide if it is worth the security risk and performance degradation.

> 
> -Toke
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
  2025-03-20 11:17               ` Yunsheng Lin
@ 2025-03-20 14:34                 ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 19+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-03-20 14:34 UTC (permalink / raw)
  To: Yunsheng Lin, Yunsheng Lin, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, Saeed Mahameed, Leon Romanovsky,
	Tariq Toukan, Andrew Lunn, Eric Dumazet, Paolo Abeni,
	Ilias Apalodimas, Simon Horman, Andrew Morton, Mina Almasry,
	Yonglong Liu, Pavel Begunkov, Matthew Wilcox, Robin Murphy,
	IOMMU, segoon, solar, kernel-hardening
  Cc: netdev, bpf, linux-rdma, linux-mm, Qiuling Ren, Yuying Ma

Yunsheng Lin <linyunsheng@huawei.com> writes:

> On 2025/3/19 20:18, Toke Høiland-Jørgensen wrote:
>>>
>>> All I asked is about moving PP_MAGIC_MASK macro into poison.h if you
>>> still want to proceed with reusing the page->pp_magic as the masking and
>>> the signature to be masked seems reasonable to be in the same file.
>> 
>> Hmm, my thinking was that this would be a lot of irrelevant stuff to put
>> into poison.h, but I suppose we could do so if the mm folks don't object :)
>
> The masking and the signature to be masked is correlated, I am not sure
> what you meant by 'irrelevant stuff' here.

Well, looking at it again, mostly the XA_LIMIT define, I guess. But I
can just leave that in the PP header.

> As you seemed to have understood most of my concern about reusing
> page->pp_magic, I am not going to argue with you about the uncertainty
> of security and complexity of different address layout for different
> arches again.
>
> But I am still think it is not the way forward with the reusing of
> page->pp_magic through doing some homework about the 'POISON_POINTER'.
> If you still think my idea is complex and still want to proceed with
> reusing the space of page->pp_magic, go ahead and let the maintainers
> decide if it is worth the security risk and performance degradation.

Yeah, thanks for taking the time to go through the implications. On
balance, I still believe reusing the bits is a better solution, but it
will of course ultimately be up to the maintainers to decide.

I will post a v2 of this series with the adjustments we've discussed,
and try to outline the tradeoffs and risks involved in the description,
and then leave it to the maintainers to decide which approach they want
to move forward with.

-Toke



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
  2025-03-19 11:06           ` Yunsheng Lin
  2025-03-19 12:18             ` Toke Høiland-Jørgensen
@ 2025-03-20  2:32             ` Solar Designer
  2025-03-20 14:37               ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 19+ messages in thread
From: Solar Designer @ 2025-03-20  2:32 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: Toke Høiland-Jørgensen, Yunsheng Lin, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, Saeed Mahameed,
	Leon Romanovsky, Tariq Toukan, Andrew Lunn, Eric Dumazet,
	Paolo Abeni, Ilias Apalodimas, Simon Horman, Andrew Morton,
	Mina Almasry, Yonglong Liu, Pavel Begunkov, Matthew Wilcox,
	Robin Murphy, IOMMU, segoon, oss-security, kernel-hardening,
	netdev, bpf, linux-rdma, linux-mm, Qiuling Ren, Yuying Ma,
	sultan

On Wed, Mar 19, 2025 at 07:06:57PM +0800, Yunsheng Lin wrote:
> On 2025/3/19 4:55, Toke Høiland-Jørgensen wrote:
> > Yunsheng Lin <linyunsheng@huawei.com> writes:
> >> On 2025/3/17 23:16, Toke Høiland-Jørgensen wrote:
> >>> Yunsheng Lin <yunshenglin0825@gmail.com> writes:
> >>>> On 3/14/2025 6:10 PM, Toke Høiland-Jørgensen wrote:
> >>>>
> >>>> ...
> >>>>
> >>>>> To avoid having to walk the entire xarray on unmap to find the page
> >>>>> reference, we stash the ID assigned by xa_alloc() into the page
> >>>>> structure itself, using the upper bits of the pp_magic field. This
> >>>>> requires a couple of defines to avoid conflicting with the
> >>>>> POINTER_POISON_DELTA define, but this is all evaluated at compile-time,
> >>>>> so does not affect run-time performance. The bitmap calculations in this
> >>>>> patch gives the following number of bits for different architectures:
> >>>>>
> >>>>> - 24 bits on 32-bit architectures
> >>>>> - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
> >>>>> - 32 bits on other 64-bit architectures
> >>>>
> >>>>  From commit c07aea3ef4d4 ("mm: add a signature in struct page"):
> >>>> "The page->signature field is aliased to page->lru.next and
> >>>> page->compound_head, but it can't be set by mistake because the
> >>>> signature value is a bad pointer, and can't trigger a false positive
> >>>> in PageTail() because the last bit is 0."
> >>>>
> >>>> And commit 8a5e5e02fc83 ("include/linux/poison.h: fix LIST_POISON{1,2} 
> >>>> offset"):
> >>>> "Poison pointer values should be small enough to find a room in
> >>>> non-mmap'able/hardly-mmap'able space."
> >>>>
> >>>> So the question seems to be:
> >>>> 1. Is stashing the ID causing page->pp_magic to be in the mmap'able/
> >>>>     easier-mmap'able space? If yes, how can we make sure this will not
> >>>>     cause any security problem?
> >>>> 2. Is the masking the page->pp_magic causing a valid pionter for
> >>>>     page->lru.next or page->compound_head to be treated as a vaild
> >>>>     PP_SIGNATURE? which might cause page_pool to recycle a page not
> >>>>     allocated via page_pool.
> >>>
> >>> Right, so my reasoning for why the defines in this patch works for this
> >>> is as follows: in both cases we need to make sure that the ID stashed in
> >>> that field never looks like a valid kernel pointer. For 64-bit arches
> >>> (where CONFIG_ILLEGAL_POINTER_VALUE), we make sure of this by never
> >>> writing to any bits that overlap with the illegal value (so that the
> >>> PP_SIGNATURE written to the field keeps it as an illegal pointer value).
> >>> For 32-bit arches, we make sure of this by making sure the top-most bit
> >>> is always 0 (the -1 in the define for _PP_DMA_INDEX_BITS) in the patch,
> >>> which puts it outside the range used for kernel pointers (AFAICT).
> >>
> >> Is there any season you think only kernel pointer is relevant here?
> > 
> > Yes. Any pointer stored in the same space as pp_magic by other users of
> > the page will be kernel pointers (as they come from page->lru.next). The
> > goal of PP_SIGNATURE is to be able to distinguish pages allocated by
> > page_pool, so we don't accidentally recycle a page from somewhere else.
> > That's the goal of the check in page_pool_page_is_pp():
> > 
> > (page->pp_magic & PP_MAGIC_MASK) == PP_SIGNATURE
> > 
> > To achieve this, we must ensure that the check above never returns true
> > for any value another page user could have written into the same field
> > (i.e., into page->lru.next). For 64-bit arches, POISON_POINTER_DELTA
> 
> POISON_POINTER_DELTA is defined according to CONFIG_ILLEGAL_POINTER_VALUE,
> if CONFIG_ILLEGAL_POINTER_VALUE is not defined yet, POISON_POINTER_DELTA
> is defined to zero.
> 
> It seems only the below 64-bit arches define CONFIG_ILLEGAL_POINTER_VALUE
> through grepping:
> a29815a333c6 core, x86: make LIST_POISON less deadly
> 5c178472af24 riscv: define ILLEGAL_POINTER_VALUE for 64bit
> f6853eb561fb powerpc/64: Define ILLEGAL_POINTER_VALUE for 64-bit
> bf0c4e047324 arm64: kconfig: Move LIST_POISON to a safe value
> 
> The below 64-bit arches don't seems to define the above config yet:
> MIPS64, SPARC64, System z(S390X),loongarch
> 
> Does ID stashing cause problem for the above arches?
> 
> > serves this purpose. For 32-bit arches, we can leave the top-most bits
> > out of PP_MAGIC_MASK, to make sure that any valid pointer value will
> > fail the check above.
> 
> The above mainly explained how to ensure page_pool_page_is_pp() will
> not return false positive result from the page_pool perspective.
> 
> From MM/security perspective, most of the commits quoted above seem
> to suggest that poison pointer should be in the non-mmap'able or
> hardly-mmap'able space, otherwise userspace can arrange for those
> pointers to actually be dereferencable, potentially turning an oops
> to an expolit, more detailed example in the below paper, which explains
> how to exploit a vulnerability which hardened by the 8a5e5e02fc83 commit:
> https://www.usenix.org/system/files/conference/woot15/woot15-paper-xu.pdf
> 
> ID stashing seems to cause page->lru.next (aliased to page->pp_magic) to
> be in the mmap'able space for some arches.

...

> To be honest, I am not that familiar with the pointer poison mechanism.
> But through some researching and analyzing, it makes sense to overstate
> it a little as it seems to be security-related.
> Cc'ed some security-related experts and ML to see if there is some
> clarifying from them.

You're correct that the pointer poison values should be in areas not
mmap'able by userspace (at least with reasonable mmap_min_addr values).

Looking at the union inside "struct page", I see pp_magic is aliased
against multiple pointers in the union'ed anonymous structs.

I'm not familiar with the uses of page->pp_magic and how likely or not
we are to have a bug where its aliasing with pointers would be exposed
as an attack vector, but this does look like a serious security concern.
It looks like we would be seriously weakening the poisoning, except on
archs where the new values with ID stashing are still not mmap'able.

I just discussed the matter with my colleague at CIQ, Sultan Alsawaf,
and he thinks the added risk is not that bad.  He wrote:

> Toke's response here is fair:
> 
> > Right, okay, I see what you mean. So the risk is basically the
> > following:
> > 
> > If some other part of the kernel ends up dereferencing the
> > page->lru.next pointer of a page that is owned by page_pool, and which
> > has an ID stashed into page->pp_magic, that dereference can end up being
> > to a valid userspace mapping, which can lead to Bad Things(tm), cf the
> > paper above.
> > 
> > This is mitigated by the fact that it can only happen on architectures
> > that don't set ILLEGAL_POINTER_VALUE (which includes 32-bit arches, and
> > the ones you listed above). In addition, this has to happen while the
> > page is owned by page_pool, and while it is DMA-mapped - we already
> > clear the pp_magic field when releasing the page from page_pool.
> > 
> > I am not sure to what extent the above is a risk we should take pains to
> > avoid, TBH. It seems to me that for this to become a real problem, lots
> > of other things will already have gone wrong. But happy to defer to the
> > mm/security folks here.
> 
> For this to be a problem, there already needs to be a use-after-free on
> a page, which arguably creates many other vectors for attack.
> 
> The lru field of struct page is already used as a generic list pointer
> in several places in the kernel once ownership of the page is obtained.
> Any risk of dereferencing lru.next in a use-after-free scenario would
> technically apply to a bunch of other places in the kernel (grep for
> page->lru).

We also tried searching for existing exploitation techniques for "struct
page" use-after-free.  We couldn't find any.  The closest (non-)match I
found is this fine research (the same project presented differently):

https://i.blackhat.com/BH-US-24/Presentations/US24-Qian-PageJack-A-Powerful-Exploit-Technique-With-Page-Level-UAF-Thursday.pdf page 33+
https://arxiv.org/html/2401.17618v2#S4
https://phrack.org/issues/71/13

The arxiv paper includes this sentence: "To create a page-level UAF, the
key is to cause a UAF of the struct page objects."  However, we do not
see them actually do that, and this statement is not found in the slides
nor in the Phrack article.  Confused.

Thank you for CC'ing me and the kernel-hardening list.  However, please
do not CC the oss-security list like that, where it's against content
guidelines.  Only properly focused new postings/threads are acceptable
there (not CC'ing from/to other lists where only part of the content is
on-topic, and follow-ups might not be on-topic at all).  See:

https://oss-security.openwall.org/wiki/mailing-lists/oss-security#list-content-guidelines

As a moderator for oss-security, I'm going to remove these messages from
the queue now.  Please drop Cc: oss-security from any further replies.

If desired, we may bring these topics to oss-security separately, with a
proper Subject line and clear description of what we're talking about.

Alexander


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
  2025-03-20  2:32             ` Solar Designer
@ 2025-03-20 14:37               ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 19+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-03-20 14:37 UTC (permalink / raw)
  To: Solar Designer, Yunsheng Lin
  Cc: Yunsheng Lin, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, Saeed Mahameed, Leon Romanovsky,
	Tariq Toukan, Andrew Lunn, Eric Dumazet, Paolo Abeni,
	Ilias Apalodimas, Simon Horman, Andrew Morton, Mina Almasry,
	Yonglong Liu, Pavel Begunkov, Matthew Wilcox, Robin Murphy,
	IOMMU, segoon, kernel-hardening, netdev, bpf, linux-rdma,
	linux-mm, Qiuling Ren, Yuying Ma, sultan

Solar Designer <solar@openwall.com> writes:

> On Wed, Mar 19, 2025 at 07:06:57PM +0800, Yunsheng Lin wrote:
>> On 2025/3/19 4:55, Toke Høiland-Jørgensen wrote:
>> > Yunsheng Lin <linyunsheng@huawei.com> writes:
>> >> On 2025/3/17 23:16, Toke Høiland-Jørgensen wrote:
>> >>> Yunsheng Lin <yunshenglin0825@gmail.com> writes:
>> >>>> On 3/14/2025 6:10 PM, Toke Høiland-Jørgensen wrote:
>> >>>>
>> >>>> ...
>> >>>>
>> >>>>> To avoid having to walk the entire xarray on unmap to find the page
>> >>>>> reference, we stash the ID assigned by xa_alloc() into the page
>> >>>>> structure itself, using the upper bits of the pp_magic field. This
>> >>>>> requires a couple of defines to avoid conflicting with the
>> >>>>> POINTER_POISON_DELTA define, but this is all evaluated at compile-time,
>> >>>>> so does not affect run-time performance. The bitmap calculations in this
>> >>>>> patch gives the following number of bits for different architectures:
>> >>>>>
>> >>>>> - 24 bits on 32-bit architectures
>> >>>>> - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
>> >>>>> - 32 bits on other 64-bit architectures
>> >>>>
>> >>>>  From commit c07aea3ef4d4 ("mm: add a signature in struct page"):
>> >>>> "The page->signature field is aliased to page->lru.next and
>> >>>> page->compound_head, but it can't be set by mistake because the
>> >>>> signature value is a bad pointer, and can't trigger a false positive
>> >>>> in PageTail() because the last bit is 0."
>> >>>>
>> >>>> And commit 8a5e5e02fc83 ("include/linux/poison.h: fix LIST_POISON{1,2} 
>> >>>> offset"):
>> >>>> "Poison pointer values should be small enough to find a room in
>> >>>> non-mmap'able/hardly-mmap'able space."
>> >>>>
>> >>>> So the question seems to be:
>> >>>> 1. Is stashing the ID causing page->pp_magic to be in the mmap'able/
>> >>>>     easier-mmap'able space? If yes, how can we make sure this will not
>> >>>>     cause any security problem?
>> >>>> 2. Is the masking the page->pp_magic causing a valid pionter for
>> >>>>     page->lru.next or page->compound_head to be treated as a vaild
>> >>>>     PP_SIGNATURE? which might cause page_pool to recycle a page not
>> >>>>     allocated via page_pool.
>> >>>
>> >>> Right, so my reasoning for why the defines in this patch works for this
>> >>> is as follows: in both cases we need to make sure that the ID stashed in
>> >>> that field never looks like a valid kernel pointer. For 64-bit arches
>> >>> (where CONFIG_ILLEGAL_POINTER_VALUE), we make sure of this by never
>> >>> writing to any bits that overlap with the illegal value (so that the
>> >>> PP_SIGNATURE written to the field keeps it as an illegal pointer value).
>> >>> For 32-bit arches, we make sure of this by making sure the top-most bit
>> >>> is always 0 (the -1 in the define for _PP_DMA_INDEX_BITS) in the patch,
>> >>> which puts it outside the range used for kernel pointers (AFAICT).
>> >>
>> >> Is there any season you think only kernel pointer is relevant here?
>> > 
>> > Yes. Any pointer stored in the same space as pp_magic by other users of
>> > the page will be kernel pointers (as they come from page->lru.next). The
>> > goal of PP_SIGNATURE is to be able to distinguish pages allocated by
>> > page_pool, so we don't accidentally recycle a page from somewhere else.
>> > That's the goal of the check in page_pool_page_is_pp():
>> > 
>> > (page->pp_magic & PP_MAGIC_MASK) == PP_SIGNATURE
>> > 
>> > To achieve this, we must ensure that the check above never returns true
>> > for any value another page user could have written into the same field
>> > (i.e., into page->lru.next). For 64-bit arches, POISON_POINTER_DELTA
>> 
>> POISON_POINTER_DELTA is defined according to CONFIG_ILLEGAL_POINTER_VALUE,
>> if CONFIG_ILLEGAL_POINTER_VALUE is not defined yet, POISON_POINTER_DELTA
>> is defined to zero.
>> 
>> It seems only the below 64-bit arches define CONFIG_ILLEGAL_POINTER_VALUE
>> through grepping:
>> a29815a333c6 core, x86: make LIST_POISON less deadly
>> 5c178472af24 riscv: define ILLEGAL_POINTER_VALUE for 64bit
>> f6853eb561fb powerpc/64: Define ILLEGAL_POINTER_VALUE for 64-bit
>> bf0c4e047324 arm64: kconfig: Move LIST_POISON to a safe value
>> 
>> The below 64-bit arches don't seems to define the above config yet:
>> MIPS64, SPARC64, System z(S390X),loongarch
>> 
>> Does ID stashing cause problem for the above arches?
>> 
>> > serves this purpose. For 32-bit arches, we can leave the top-most bits
>> > out of PP_MAGIC_MASK, to make sure that any valid pointer value will
>> > fail the check above.
>> 
>> The above mainly explained how to ensure page_pool_page_is_pp() will
>> not return false positive result from the page_pool perspective.
>> 
>> From MM/security perspective, most of the commits quoted above seem
>> to suggest that poison pointer should be in the non-mmap'able or
>> hardly-mmap'able space, otherwise userspace can arrange for those
>> pointers to actually be dereferencable, potentially turning an oops
>> to an expolit, more detailed example in the below paper, which explains
>> how to exploit a vulnerability which hardened by the 8a5e5e02fc83 commit:
>> https://www.usenix.org/system/files/conference/woot15/woot15-paper-xu.pdf
>> 
>> ID stashing seems to cause page->lru.next (aliased to page->pp_magic) to
>> be in the mmap'able space for some arches.
>
> ...
>
>> To be honest, I am not that familiar with the pointer poison mechanism.
>> But through some researching and analyzing, it makes sense to overstate
>> it a little as it seems to be security-related.
>> Cc'ed some security-related experts and ML to see if there is some
>> clarifying from them.
>
> You're correct that the pointer poison values should be in areas not
> mmap'able by userspace (at least with reasonable mmap_min_addr values).
>
> Looking at the union inside "struct page", I see pp_magic is aliased
> against multiple pointers in the union'ed anonymous structs.
>
> I'm not familiar with the uses of page->pp_magic and how likely or not
> we are to have a bug where its aliasing with pointers would be exposed
> as an attack vector, but this does look like a serious security concern.
> It looks like we would be seriously weakening the poisoning, except on
> archs where the new values with ID stashing are still not mmap'able.
>
> I just discussed the matter with my colleague at CIQ, Sultan Alsawaf,
> and he thinks the added risk is not that bad.  He wrote:
>
>> Toke's response here is fair:
>> 
>> > Right, okay, I see what you mean. So the risk is basically the
>> > following:
>> > 
>> > If some other part of the kernel ends up dereferencing the
>> > page->lru.next pointer of a page that is owned by page_pool, and which
>> > has an ID stashed into page->pp_magic, that dereference can end up being
>> > to a valid userspace mapping, which can lead to Bad Things(tm), cf the
>> > paper above.
>> > 
>> > This is mitigated by the fact that it can only happen on architectures
>> > that don't set ILLEGAL_POINTER_VALUE (which includes 32-bit arches, and
>> > the ones you listed above). In addition, this has to happen while the
>> > page is owned by page_pool, and while it is DMA-mapped - we already
>> > clear the pp_magic field when releasing the page from page_pool.
>> > 
>> > I am not sure to what extent the above is a risk we should take pains to
>> > avoid, TBH. It seems to me that for this to become a real problem, lots
>> > of other things will already have gone wrong. But happy to defer to the
>> > mm/security folks here.
>> 
>> For this to be a problem, there already needs to be a use-after-free on
>> a page, which arguably creates many other vectors for attack.
>> 
>> The lru field of struct page is already used as a generic list pointer
>> in several places in the kernel once ownership of the page is obtained.
>> Any risk of dereferencing lru.next in a use-after-free scenario would
>> technically apply to a bunch of other places in the kernel (grep for
>> page->lru).
>
> We also tried searching for existing exploitation techniques for "struct
> page" use-after-free.  We couldn't find any.  The closest (non-)match I
> found is this fine research (the same project presented differently):
>
> https://i.blackhat.com/BH-US-24/Presentations/US24-Qian-PageJack-A-Powerful-Exploit-Technique-With-Page-Level-UAF-Thursday.pdf page 33+
> https://arxiv.org/html/2401.17618v2#S4
> https://phrack.org/issues/71/13
>
> The arxiv paper includes this sentence: "To create a page-level UAF, the
> key is to cause a UAF of the struct page objects."  However, we do not
> see them actually do that, and this statement is not found in the slides
> nor in the Phrack article.  Confused.

Thank you for weighing in! I will post an updated version of this patch
with a reference to this discussion and try to summarise the above.

> Thank you for CC'ing me and the kernel-hardening list.  However, please
> do not CC the oss-security list like that, where it's against content
> guidelines.  Only properly focused new postings/threads are acceptable
> there (not CC'ing from/to other lists where only part of the content is
> on-topic, and follow-ups might not be on-topic at all).  See:
>
> https://oss-security.openwall.org/wiki/mailing-lists/oss-security#list-content-guidelines
>
> As a moderator for oss-security, I'm going to remove these messages from
> the queue now.  Please drop Cc: oss-security from any further replies.
>
> If desired, we may bring these topics to oss-security separately, with a
> proper Subject line and clear description of what we're talking about.

Noted, thanks!

-Toke



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next 0/3] Fix late DMA unmap crash for page pool
  2025-03-14 10:10 [PATCH net-next 0/3] Fix late DMA unmap crash for page pool Toke Høiland-Jørgensen
                   ` (2 preceding siblings ...)
  2025-03-14 10:10 ` [PATCH net-next 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool Toke Høiland-Jørgensen
@ 2025-03-17  2:02 ` Yonglong Liu
  3 siblings, 0 replies; 19+ messages in thread
From: Yonglong Liu @ 2025-03-17  2:02 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, Saeed Mahameed,
	Leon Romanovsky, Tariq Toukan, Andrew Lunn, Eric Dumazet,
	Paolo Abeni, Ilias Apalodimas, Simon Horman, Andrew Morton,
	Mina Almasry, Yunsheng Lin, Pavel Begunkov, Matthew Wilcox
  Cc: netdev, bpf, linux-rdma, linux-mm, Qiuling Ren, Yuying Ma

I know this solution is still under discussion, but anyway, I tested the
scenarios I reported in:

[0]https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/

It seems the problem has been solved. Thanks!


Tested-by: Yonglong Liu<liuyonglong@huawei.com>

On 2025/3/14 18:10, Toke Høiland-Jørgensen wrote:
> This series fixes the late dma_unmap crash for page pool first reported
> by Yonglong Liu in [0]. It is an alternative approach to the one
> submitted by Yunsheng Lin, most recently in [1]. The first two commits
> are small refactors of the page pool code, in preparation of the main
> change in patch 3. See the commit message of patch 3 for the details.
>
> -Toke
>
> [0] https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/
> [1] https://lore.kernel.org/r/20250307092356.638242-1-linyunsheng@huawei.com
>
> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
> ---
> Toke Høiland-Jørgensen (3):
>        page_pool: Move pp_magic check into helper functions
>        page_pool: Turn dma_sync and dma_sync_cpu fields into a bitmap
>        page_pool: Track DMA-mapped pages and unmap them when destroying the pool
>
>   drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c |  4 +-
>   include/net/page_pool/helpers.h                  |  6 +-
>   include/net/page_pool/types.h                    | 54 +++++++++++++++-
>   mm/page_alloc.c                                  |  9 +--
>   net/core/devmem.c                                |  3 +-
>   net/core/netmem_priv.h                           | 33 +++++++++-
>   net/core/page_pool.c                             | 81 ++++++++++++++++++++----
>   net/core/skbuff.c                                | 16 +----
>   net/core/xdp.c                                   |  4 +-
>   9 files changed, 164 insertions(+), 46 deletions(-)
> ---
> base-commit: 8ef890df4031121a94407c84659125cbccd3fdbe
> change-id: 20250310-page-pool-track-dma-0332343a460e
>
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-03-21 23:13 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-14 10:10 [PATCH net-next 0/3] Fix late DMA unmap crash for page pool Toke Høiland-Jørgensen
2025-03-14 10:10 ` [PATCH net-next 1/3] page_pool: Move pp_magic check into helper functions Toke Høiland-Jørgensen
2025-03-21 17:16   ` Mina Almasry
2025-03-14 10:10 ` [PATCH net-next 2/3] page_pool: Turn dma_sync and dma_sync_cpu fields into a bitmap Toke Høiland-Jørgensen
2025-03-15  9:31   ` Yunsheng Lin
2025-03-17 14:55     ` Toke Høiland-Jørgensen
2025-03-21 23:13   ` Mina Almasry
2025-03-14 10:10 ` [PATCH net-next 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool Toke Høiland-Jørgensen
2025-03-15  9:46   ` Yunsheng Lin
2025-03-17 15:16     ` Toke Høiland-Jørgensen
2025-03-18 12:39       ` Yunsheng Lin
2025-03-18 20:55         ` Toke Høiland-Jørgensen
2025-03-19 11:06           ` Yunsheng Lin
2025-03-19 12:18             ` Toke Høiland-Jørgensen
2025-03-20 11:17               ` Yunsheng Lin
2025-03-20 14:34                 ` Toke Høiland-Jørgensen
2025-03-20  2:32             ` Solar Designer
2025-03-20 14:37               ` Toke Høiland-Jørgensen
2025-03-17  2:02 ` [PATCH net-next 0/3] Fix late DMA unmap crash for page pool Yonglong Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox