* [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
@ 2025-03-08 14:54 Toke Høiland-Jørgensen
2025-03-08 19:22 ` Mina Almasry
2025-03-09 13:27 ` Yunsheng Lin
0 siblings, 2 replies; 32+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-03-08 14:54 UTC (permalink / raw)
To: Andrew Morton, Jesper Dangaard Brouer, Ilias Apalodimas, David S. Miller
Cc: Yunsheng Lin, Toke Høiland-Jørgensen, Yonglong Liu,
Mina Almasry, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, linux-mm, netdev
When enabling DMA mapping in page_pool, pages are kept DMA mapped until
they are released from the pool, to avoid the overhead of re-mapping the
pages every time they are used. This causes problems when a device is
torn down, because the page pool can't unmap the pages until they are
returned to the pool. This causes resource leaks and/or crashes when
there are pages still outstanding while the device is torn down, because
page_pool will attempt an unmap of a non-existent DMA device on the
subsequent page return.
To fix this, implement a simple tracking of outstanding dma-mapped pages
in page pool using an xarray. This was first suggested by Mina[0], and
turns out to be fairly straight forward: We simply store pointers to
pages directly in the xarray with xa_alloc() when they are first DMA
mapped, and remove them from the array on unmap. Then, when a page pool
is torn down, it can simply walk the xarray and unmap all pages still
present there before returning, which also allows us to get rid of the
get/put_device() calls in page_pool. Using xa_cmpxchg(), no additional
synchronisation is needed, as a page will only ever be unmapped once.
To avoid having to walk the entire xarray on unmap to find the page
reference, we stash the ID assigned by xa_alloc() into the page
structure itself, in the field previously called '_pp_mapping_pad' in
the page_pool struct inside struct page. This field overlaps with the
page->mapping pointer, which may turn out to be problematic, so an
alternative is probably needed. Sticking the ID into some of the upper
bits of page->pp_magic may work as an alternative, but that requires
further investigation. Using the 'mapping' field works well enough as
a demonstration for this RFC, though.
Since all the tracking is performed on DMA map/unmap, no additional code
is needed in the fast path, meaning the performance overhead of this
tracking is negligible. The extra memory needed to track the pages is
neatly encapsulated inside xarray, which uses the 'struct xa_node'
structure to track items. This structure is 576 bytes long, with slots
for 64 items, meaning that a full node occurs only 9 bytes of overhead
per slot it tracks (in practice, it probably won't be this efficient,
but in any case it should be an acceptable overhead).
[0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/
Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code")
Reported-by: Yonglong Liu <liuyonglong@huawei.com>
Suggested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Tested-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
This is an alternative to Yunsheng's series. Yunsheng requested I send
this as an RFC to better be able to discuss the different approaches; see
some initial discussion in[1], also regarding where to store the ID as
alluded to above.
-Toke
[1] https://lore.kernel.org/r/40b33879-509a-4c4a-873b-b5d3573b6e14@gmail.com
include/linux/mm_types.h | 2 +-
include/net/page_pool/types.h | 2 ++
net/core/netmem_priv.h | 17 +++++++++++++
net/core/page_pool.c | 46 +++++++++++++++++++++++++++++------
4 files changed, 58 insertions(+), 9 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0234f14f2aa6..d2c7a7b04bea 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -121,7 +121,7 @@ struct page {
*/
unsigned long pp_magic;
struct page_pool *pp;
- unsigned long _pp_mapping_pad;
+ unsigned long pp_dma_index;
unsigned long dma_addr;
atomic_long_t pp_ref_count;
};
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index 36eb57d73abc..13597a77aa36 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -221,6 +221,8 @@ struct page_pool {
void *mp_priv;
const struct memory_provider_ops *mp_ops;
+ struct xarray dma_mapped;
+
#ifdef CONFIG_PAGE_POOL_STATS
/* recycle stats are per-cpu to avoid locking */
struct page_pool_recycle_stats __percpu *recycle_stats;
diff --git a/net/core/netmem_priv.h b/net/core/netmem_priv.h
index 7eadb8393e00..59679406a7b7 100644
--- a/net/core/netmem_priv.h
+++ b/net/core/netmem_priv.h
@@ -28,4 +28,21 @@ static inline void netmem_set_dma_addr(netmem_ref netmem,
{
__netmem_clear_lsb(netmem)->dma_addr = dma_addr;
}
+
+static inline unsigned long netmem_get_dma_index(netmem_ref netmem)
+{
+ if (WARN_ON_ONCE(netmem_is_net_iov(netmem)))
+ return 0;
+
+ return netmem_to_page(netmem)->pp_dma_index;
+}
+
+static inline void netmem_set_dma_index(netmem_ref netmem,
+ unsigned long id)
+{
+ if (WARN_ON_ONCE(netmem_is_net_iov(netmem)))
+ return;
+
+ netmem_to_page(netmem)->pp_dma_index = id;
+}
#endif
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index acef1fcd8ddc..d5530f29bf62 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -226,6 +226,8 @@ static int page_pool_init(struct page_pool *pool,
return -EINVAL;
pool->dma_map = true;
+
+ xa_init_flags(&pool->dma_mapped, XA_FLAGS_ALLOC1);
}
if (pool->slow.flags & PP_FLAG_DMA_SYNC_DEV) {
@@ -275,9 +277,6 @@ static int page_pool_init(struct page_pool *pool,
/* Driver calling page_pool_create() also call page_pool_destroy() */
refcount_set(&pool->user_cnt, 1);
- if (pool->dma_map)
- get_device(pool->p.dev);
-
if (pool->slow.flags & PP_FLAG_ALLOW_UNREADABLE_NETMEM) {
/* We rely on rtnl_lock()ing to make sure netdev_rx_queue
* configuration doesn't change while we're initializing
@@ -325,7 +324,7 @@ static void page_pool_uninit(struct page_pool *pool)
ptr_ring_cleanup(&pool->ring, NULL);
if (pool->dma_map)
- put_device(pool->p.dev);
+ xa_destroy(&pool->dma_mapped);
#ifdef CONFIG_PAGE_POOL_STATS
if (!pool->system)
@@ -470,9 +469,11 @@ page_pool_dma_sync_for_device(const struct page_pool *pool,
__page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
}
-static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
+static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem, gfp_t gfp)
{
dma_addr_t dma;
+ int err;
+ u32 id;
/* Setup DMA mapping: use 'struct page' area for storing DMA-addr
* since dma_addr_t can be either 32 or 64 bits and does not always fit
@@ -486,9 +487,19 @@ static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
if (dma_mapping_error(pool->p.dev, dma))
return false;
+ if (in_softirq())
+ err = xa_alloc(&pool->dma_mapped, &id, netmem_to_page(netmem),
+ XA_LIMIT(1, UINT_MAX), gfp);
+ else
+ err = xa_alloc_bh(&pool->dma_mapped, &id, netmem_to_page(netmem),
+ XA_LIMIT(1, UINT_MAX), gfp);
+ if (err)
+ goto unmap_failed;
+
if (page_pool_set_dma_addr_netmem(netmem, dma))
goto unmap_failed;
+ netmem_set_dma_index(netmem, id);
page_pool_dma_sync_for_device(pool, netmem, pool->p.max_len);
return true;
@@ -511,7 +522,7 @@ static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
if (unlikely(!page))
return NULL;
- if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page)))) {
+ if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page), gfp))) {
put_page(page);
return NULL;
}
@@ -557,7 +568,7 @@ static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool,
*/
for (i = 0; i < nr_pages; i++) {
netmem = pool->alloc.cache[i];
- if (dma_map && unlikely(!page_pool_dma_map(pool, netmem))) {
+ if (dma_map && unlikely(!page_pool_dma_map(pool, netmem, gfp))) {
put_page(netmem_to_page(netmem));
continue;
}
@@ -659,6 +670,8 @@ void page_pool_clear_pp_info(netmem_ref netmem)
static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
netmem_ref netmem)
{
+ struct page *old, *page = netmem_to_page(netmem);
+ unsigned long id;
dma_addr_t dma;
if (!pool->dma_map)
@@ -667,6 +680,17 @@ static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
*/
return;
+ id = netmem_get_dma_index(netmem);
+ if (!id)
+ return;
+
+ if (in_softirq())
+ old = xa_cmpxchg(&pool->dma_mapped, id, page, NULL, 0);
+ else
+ old = xa_cmpxchg_bh(&pool->dma_mapped, id, page, NULL, 0);
+ if (old != page)
+ return;
+
dma = page_pool_get_dma_addr_netmem(netmem);
/* When page is unmapped, it cannot be returned to our pool */
@@ -674,6 +698,7 @@ static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
PAGE_SIZE << pool->p.order, pool->p.dma_dir,
DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING);
page_pool_set_dma_addr_netmem(netmem, 0);
+ netmem_set_dma_index(netmem, 0);
}
/* Disconnects a page (from a page_pool). API users can have a need
@@ -1083,8 +1108,13 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
static void page_pool_scrub(struct page_pool *pool)
{
+ unsigned long id;
+ void *ptr;
+
page_pool_empty_alloc_cache_once(pool);
- pool->destroy_cnt++;
+ if (!pool->destroy_cnt++)
+ xa_for_each(&pool->dma_mapped, id, ptr)
+ __page_pool_release_page_dma(pool, page_to_netmem(ptr));
/* No more consumers should exist, but producers could still
* be in-flight.
--
2.48.1
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-08 14:54 [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool Toke Høiland-Jørgensen
@ 2025-03-08 19:22 ` Mina Almasry
2025-03-09 12:42 ` Toke Høiland-Jørgensen
2025-03-10 7:17 ` Pavel Begunkov
2025-03-09 13:27 ` Yunsheng Lin
1 sibling, 2 replies; 32+ messages in thread
From: Mina Almasry @ 2025-03-08 19:22 UTC (permalink / raw)
To: Toke Høiland-Jørgensen, Pavel Begunkov, David Wei
Cc: Andrew Morton, Jesper Dangaard Brouer, Ilias Apalodimas,
David S. Miller, Yunsheng Lin, Yonglong Liu, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, linux-mm, netdev
On Sat, Mar 8, 2025 at 6:55 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> When enabling DMA mapping in page_pool, pages are kept DMA mapped until
> they are released from the pool, to avoid the overhead of re-mapping the
> pages every time they are used. This causes problems when a device is
> torn down, because the page pool can't unmap the pages until they are
> returned to the pool. This causes resource leaks and/or crashes when
> there are pages still outstanding while the device is torn down, because
> page_pool will attempt an unmap of a non-existent DMA device on the
> subsequent page return.
>
> To fix this, implement a simple tracking of outstanding dma-mapped pages
> in page pool using an xarray. This was first suggested by Mina[0], and
> turns out to be fairly straight forward: We simply store pointers to
> pages directly in the xarray with xa_alloc() when they are first DMA
> mapped, and remove them from the array on unmap. Then, when a page pool
> is torn down, it can simply walk the xarray and unmap all pages still
> present there before returning, which also allows us to get rid of the
> get/put_device() calls in page_pool.
THANK YOU!! I had been looking at the other proposals to fix this here
and there and I had similar feelings to you. They add lots of code
changes and the code changes themselves were hard for me to
understand. I hope we can make this simpler approach work.
> Using xa_cmpxchg(), no additional
> synchronisation is needed, as a page will only ever be unmapped once.
>
Very clever. I had been wondering how to handle the concurrency. I
also think this works.
> To avoid having to walk the entire xarray on unmap to find the page
> reference, we stash the ID assigned by xa_alloc() into the page
> structure itself, in the field previously called '_pp_mapping_pad' in
> the page_pool struct inside struct page. This field overlaps with the
> page->mapping pointer, which may turn out to be problematic, so an
> alternative is probably needed. Sticking the ID into some of the upper
> bits of page->pp_magic may work as an alternative, but that requires
> further investigation. Using the 'mapping' field works well enough as
> a demonstration for this RFC, though.
>
I'm unsure about this. I think page->mapping may be used when we map
the page to the userspace in TCP zerocopy, but I'm really not sure.
Yes, finding somewhere else to put the id would be ideal. Do we really
need a full unsigned long for the pp_magic?
> Since all the tracking is performed on DMA map/unmap, no additional code
> is needed in the fast path, meaning the performance overhead of this
> tracking is negligible. The extra memory needed to track the pages is
> neatly encapsulated inside xarray, which uses the 'struct xa_node'
> structure to track items. This structure is 576 bytes long, with slots
> for 64 items, meaning that a full node occurs only 9 bytes of overhead
> per slot it tracks (in practice, it probably won't be this efficient,
> but in any case it should be an acceptable overhead).
>
Yes, I think I also saw in another thread that this version actually
produces better perf results than the more complicated version,
because the page_pool benchmark actually does no mapping and this
version doesn't add overhead when there is no mapping.
As an aside I have it in my todo list to put the page_pool benchmark
in the upstream kernel so we don't have to run out-of-tree versions.
> [0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/
>
> Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code")
> Reported-by: Yonglong Liu <liuyonglong@huawei.com>
> Suggested-by: Mina Almasry <almasrymina@google.com>
> Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Tested-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
I'm only holding back my Reviewed-by for the page->mapping issue.
Other than that, I think this is great. Thank you very much for
looking.
Pavel, David, as an aside, I think we need to propagate this fix to
memory providers as a follow up. We probably need a new op in the
provider to unmap. Then, in page_pool_scrub, where this patch does an
xa_for_each, we need to call that unmap op.
For the io_uring memory provider, I think it needs to unmap its memory
(and record it did so, so it doesn't re-unmap it later).
For the dmabuf memory provider, I think maybe we need to call
dma_buf_unmap_attachment (and record we did that, so we don't re-do it
later).
I don't think propagating this fix to memory providers should block
merging the fix for pages, IMO.
> ---
> This is an alternative to Yunsheng's series. Yunsheng requested I send
> this as an RFC to better be able to discuss the different approaches; see
> some initial discussion in[1], also regarding where to store the ID as
> alluded to above.
>
> -Toke
>
> [1] https://lore.kernel.org/r/40b33879-509a-4c4a-873b-b5d3573b6e14@gmail.com
>
> include/linux/mm_types.h | 2 +-
> include/net/page_pool/types.h | 2 ++
> net/core/netmem_priv.h | 17 +++++++++++++
> net/core/page_pool.c | 46 +++++++++++++++++++++++++++++------
> 4 files changed, 58 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 0234f14f2aa6..d2c7a7b04bea 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -121,7 +121,7 @@ struct page {
> */
> unsigned long pp_magic;
> struct page_pool *pp;
> - unsigned long _pp_mapping_pad;
> + unsigned long pp_dma_index;
> unsigned long dma_addr;
> atomic_long_t pp_ref_count;
> };
> diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
> index 36eb57d73abc..13597a77aa36 100644
> --- a/include/net/page_pool/types.h
> +++ b/include/net/page_pool/types.h
> @@ -221,6 +221,8 @@ struct page_pool {
> void *mp_priv;
> const struct memory_provider_ops *mp_ops;
>
> + struct xarray dma_mapped;
> +
> #ifdef CONFIG_PAGE_POOL_STATS
> /* recycle stats are per-cpu to avoid locking */
> struct page_pool_recycle_stats __percpu *recycle_stats;
> diff --git a/net/core/netmem_priv.h b/net/core/netmem_priv.h
> index 7eadb8393e00..59679406a7b7 100644
> --- a/net/core/netmem_priv.h
> +++ b/net/core/netmem_priv.h
> @@ -28,4 +28,21 @@ static inline void netmem_set_dma_addr(netmem_ref netmem,
> {
> __netmem_clear_lsb(netmem)->dma_addr = dma_addr;
> }
> +
> +static inline unsigned long netmem_get_dma_index(netmem_ref netmem)
> +{
> + if (WARN_ON_ONCE(netmem_is_net_iov(netmem)))
> + return 0;
> +
> + return netmem_to_page(netmem)->pp_dma_index;
> +}
> +
> +static inline void netmem_set_dma_index(netmem_ref netmem,
> + unsigned long id)
> +{
> + if (WARN_ON_ONCE(netmem_is_net_iov(netmem)))
> + return;
> +
> + netmem_to_page(netmem)->pp_dma_index = id;
> +}
> #endif
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index acef1fcd8ddc..d5530f29bf62 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -226,6 +226,8 @@ static int page_pool_init(struct page_pool *pool,
> return -EINVAL;
>
> pool->dma_map = true;
> +
> + xa_init_flags(&pool->dma_mapped, XA_FLAGS_ALLOC1);
> }
>
> if (pool->slow.flags & PP_FLAG_DMA_SYNC_DEV) {
> @@ -275,9 +277,6 @@ static int page_pool_init(struct page_pool *pool,
> /* Driver calling page_pool_create() also call page_pool_destroy() */
> refcount_set(&pool->user_cnt, 1);
>
> - if (pool->dma_map)
> - get_device(pool->p.dev);
> -
> if (pool->slow.flags & PP_FLAG_ALLOW_UNREADABLE_NETMEM) {
> /* We rely on rtnl_lock()ing to make sure netdev_rx_queue
> * configuration doesn't change while we're initializing
> @@ -325,7 +324,7 @@ static void page_pool_uninit(struct page_pool *pool)
> ptr_ring_cleanup(&pool->ring, NULL);
>
> if (pool->dma_map)
> - put_device(pool->p.dev);
> + xa_destroy(&pool->dma_mapped);
>
> #ifdef CONFIG_PAGE_POOL_STATS
> if (!pool->system)
> @@ -470,9 +469,11 @@ page_pool_dma_sync_for_device(const struct page_pool *pool,
> __page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
> }
>
> -static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
> +static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem, gfp_t gfp)
> {
> dma_addr_t dma;
> + int err;
> + u32 id;
>
> /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
> * since dma_addr_t can be either 32 or 64 bits and does not always fit
> @@ -486,9 +487,19 @@ static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
> if (dma_mapping_error(pool->p.dev, dma))
> return false;
>
> + if (in_softirq())
> + err = xa_alloc(&pool->dma_mapped, &id, netmem_to_page(netmem),
> + XA_LIMIT(1, UINT_MAX), gfp);
> + else
> + err = xa_alloc_bh(&pool->dma_mapped, &id, netmem_to_page(netmem),
> + XA_LIMIT(1, UINT_MAX), gfp);
nit: I think xa_limit_32b works here, because XA_FLAGS_ALLOC1 avoids 1 anyway.
--
Thanks,
Mina
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-08 19:22 ` Mina Almasry
@ 2025-03-09 12:42 ` Toke Høiland-Jørgensen
2025-03-11 13:25 ` Pavel Begunkov
2025-03-10 7:17 ` Pavel Begunkov
1 sibling, 1 reply; 32+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-03-09 12:42 UTC (permalink / raw)
To: Mina Almasry, Pavel Begunkov, David Wei
Cc: Andrew Morton, Jesper Dangaard Brouer, Ilias Apalodimas,
David S. Miller, Yunsheng Lin, Yonglong Liu, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, linux-mm, netdev
Mina Almasry <almasrymina@google.com> writes:
> On Sat, Mar 8, 2025 at 6:55 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> When enabling DMA mapping in page_pool, pages are kept DMA mapped until
>> they are released from the pool, to avoid the overhead of re-mapping the
>> pages every time they are used. This causes problems when a device is
>> torn down, because the page pool can't unmap the pages until they are
>> returned to the pool. This causes resource leaks and/or crashes when
>> there are pages still outstanding while the device is torn down, because
>> page_pool will attempt an unmap of a non-existent DMA device on the
>> subsequent page return.
>>
>> To fix this, implement a simple tracking of outstanding dma-mapped pages
>> in page pool using an xarray. This was first suggested by Mina[0], and
>> turns out to be fairly straight forward: We simply store pointers to
>> pages directly in the xarray with xa_alloc() when they are first DMA
>> mapped, and remove them from the array on unmap. Then, when a page pool
>> is torn down, it can simply walk the xarray and unmap all pages still
>> present there before returning, which also allows us to get rid of the
>> get/put_device() calls in page_pool.
>
> THANK YOU!! I had been looking at the other proposals to fix this here
> and there and I had similar feelings to you. They add lots of code
> changes and the code changes themselves were hard for me to
> understand. I hope we can make this simpler approach work.
You're welcome :)
And yeah, me too!
>> Using xa_cmpxchg(), no additional
>> synchronisation is needed, as a page will only ever be unmapped once.
>>
>
> Very clever. I had been wondering how to handle the concurrency. I
> also think this works.
Thanks!
>> To avoid having to walk the entire xarray on unmap to find the page
>> reference, we stash the ID assigned by xa_alloc() into the page
>> structure itself, in the field previously called '_pp_mapping_pad' in
>> the page_pool struct inside struct page. This field overlaps with the
>> page->mapping pointer, which may turn out to be problematic, so an
>> alternative is probably needed. Sticking the ID into some of the upper
>> bits of page->pp_magic may work as an alternative, but that requires
>> further investigation. Using the 'mapping' field works well enough as
>> a demonstration for this RFC, though.
>>
>
> I'm unsure about this. I think page->mapping may be used when we map
> the page to the userspace in TCP zerocopy, but I'm really not sure.
> Yes, finding somewhere else to put the id would be ideal. Do we really
> need a full unsigned long for the pp_magic?
No, pp_magic was also my backup plan (see the other thread). Tried
actually doing that now, and while there's a bit of complication due to
the varying definitions of POISON_POINTER_DELTA across architectures,
but it seems that this can be defined at compile time. I'll send a v2
RFC with this change.
-Toke
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-08 14:54 [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool Toke Høiland-Jørgensen
2025-03-08 19:22 ` Mina Almasry
@ 2025-03-09 13:27 ` Yunsheng Lin
2025-03-10 9:13 ` Toke Høiland-Jørgensen
1 sibling, 1 reply; 32+ messages in thread
From: Yunsheng Lin @ 2025-03-09 13:27 UTC (permalink / raw)
To: Toke Høiland-Jørgensen, Andrew Morton,
Jesper Dangaard Brouer, Ilias Apalodimas, David S. Miller
Cc: Yunsheng Lin, Yonglong Liu, Mina Almasry, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, linux-mm, netdev
On 3/8/2025 10:54 PM, Toke Høiland-Jørgensen wrote:
> When enabling DMA mapping in page_pool, pages are kept DMA mapped until
> they are released from the pool, to avoid the overhead of re-mapping the
> pages every time they are used. This causes problems when a device is
> torn down, because the page pool can't unmap the pages until they are
> returned to the pool. This causes resource leaks and/or crashes when
> there are pages still outstanding while the device is torn down, because
> page_pool will attempt an unmap of a non-existent DMA device on the
> subsequent page return.
>
> To fix this, implement a simple tracking of outstanding dma-mapped pages
> in page pool using an xarray. This was first suggested by Mina[0], and
> turns out to be fairly straight forward: We simply store pointers to
> pages directly in the xarray with xa_alloc() when they are first DMA
> mapped, and remove them from the array on unmap. Then, when a page pool
> is torn down, it can simply walk the xarray and unmap all pages still
> present there before returning, which also allows us to get rid of the
> get/put_device() calls in page_pool. Using xa_cmpxchg(), no additional
> synchronisation is needed, as a page will only ever be unmapped once.
The implementation of xa_cmpxchg() seems to take the xa_lock, which
seems to be a per-Xarray spin_lock.
Yes, if if we were to take a per-Xarray lock unconditionaly, additional
synchronisation like rcu doesn't seems to be needed. But it seems an
excessive overhead for normal packet processing when page_pool_destroy()
is not called yet?
Also, we might need a similar locking or synchronisation for the dma
sync API in order to skip the dma sync API when page_pool_destroy() is
called too.
>
> To avoid having to walk the entire xarray on unmap to find the page
> reference, we stash the ID assigned by xa_alloc() into the page
> structure itself, in the field previously called '_pp_mapping_pad' in
> the page_pool struct inside struct page. This field overlaps with the
> page->mapping pointer, which may turn out to be problematic, so an
> alternative is probably needed. Sticking the ID into some of the upper
> bits of page->pp_magic may work as an alternative, but that requires
> further investigation. Using the 'mapping' field works well enough as
> a demonstration for this RFC, though.
page->pp_magic seems to only have 16 bits space available at most when
trying to reuse some unused bits in page->pp_magic, as BPF_PTR_POISON
and STACK_DEPOT_POISON seems to already use 16 bits, so:
1. For 32 bits system, it seems there is only 16 bits left even if the
ILLEGAL_POINTER_VALUE is defined as 0x0.
2. For 64 bits system, it seems there is only 12 bits left for powerpc
as ILLEGAL_POINTER_VALUE is defined as 0x5deadbeef0000000, see
arch/powerpc/Kconfig.
So it seems we might need to limit the dma mapping count to 4096 or
65536?
And to be honest, I am not sure if those 'unused' 12/16 bits can really
be reused or not, I guess we might need suggestion from mm and arch
experts here.
>
> Since all the tracking is performed on DMA map/unmap, no additional code
> is needed in the fast path, meaning the performance overhead of this
> tracking is negligible. The extra memory needed to track the pages is
> neatly encapsulated inside xarray, which uses the 'struct xa_node'
> structure to track items. This structure is 576 bytes long, with slots
> for 64 items, meaning that a full node occurs only 9 bytes of overhead
> per slot it tracks (in practice, it probably won't be this efficient,
> but in any case it should be an acceptable overhead).
Even if items is stored sequentially in xa_node at first, is it possible
that there may be fragmentation in those xa_node when pages are released
and allocated many times during packet processing? If yes, is there any
fragmentation info about those xa_node?
>
> [0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/
>
> Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code")
> Reported-by: Yonglong Liu <liuyonglong@huawei.com>
> Suggested-by: Mina Almasry <almasrymina@google.com>
> Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Tested-by: Jesper Dangaard Brouer <hawk@kernel.org>
> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
> ---
> This is an alternative to Yunsheng's series. Yunsheng requested I send
> this as an RFC to better be able to discuss the different approaches; see
> some initial discussion in[1], also regarding where to store the ID as
> alluded to above.
As mentioned before, I am not really convinced there is still any
space left in 'struct page' yet, otherwise we might already use that
space to fix the DMA address > 32 bits problem in 32 bits system, see
page_pool_set_dma_addr_netmem().
Also, Using the more space in 'struct page' for the page_pool seems to
make page_pool more coupled to the mm subsystem, which seems to not
align with the folios work that is trying to decouple non-mm subsystem
from the mm subsystem by avoid other subsystem using more of the 'struct
page' as metadata from the long term point of view.
>
> -Toke
>
> [1] https://lore.kernel.org/r/40b33879-509a-4c4a-873b-b5d3573b6e14@gmail.com
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-08 19:22 ` Mina Almasry
2025-03-09 12:42 ` Toke Høiland-Jørgensen
@ 2025-03-10 7:17 ` Pavel Begunkov
1 sibling, 0 replies; 32+ messages in thread
From: Pavel Begunkov @ 2025-03-10 7:17 UTC (permalink / raw)
To: Mina Almasry, Toke Høiland-Jørgensen, David Wei
Cc: Andrew Morton, Jesper Dangaard Brouer, Ilias Apalodimas,
David S. Miller, Yunsheng Lin, Yonglong Liu, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, linux-mm, netdev
On 3/8/25 19:22, Mina Almasry wrote:
> On Sat, Mar 8, 2025 at 6:55 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> When enabling DMA mapping in page_pool, pages are kept DMA mapped until
>> they are released from the pool, to avoid the overhead of re-mapping the
>> pages every time they are used. This causes problems when a device is
>> torn down, because the page pool can't unmap the pages until they are
>> returned to the pool. This causes resource leaks and/or crashes when
>> there are pages still outstanding while the device is torn down, because
>> page_pool will attempt an unmap of a non-existent DMA device on the
>> subsequent page return.
>>
>> To fix this, implement a simple tracking of outstanding dma-mapped pages
>> in page pool using an xarray. This was first suggested by Mina[0], and
>> turns out to be fairly straight forward: We simply store pointers to
>> pages directly in the xarray with xa_alloc() when they are first DMA
>> mapped, and remove them from the array on unmap. Then, when a page pool
>> is torn down, it can simply walk the xarray and unmap all pages still
>> present there before returning, which also allows us to get rid of the
>> get/put_device() calls in page_pool.
>
>> Using xa_cmpxchg(), no additional
>> synchronisation is needed, as a page will only ever be unmapped once.
>>
>> To avoid having to walk the entire xarray on unmap to find the page
>> reference, we stash the ID assigned by xa_alloc() into the page
>> structure itself, in the field previously called '_pp_mapping_pad' in
>> the page_pool struct inside struct page. This field overlaps with the
>> page->mapping pointer, which may turn out to be problematic, so an
>> alternative is probably needed. Sticking the ID into some of the upper
>> bits of page->pp_magic may work as an alternative, but that requires
>> further investigation. Using the 'mapping' field works well enough as
>> a demonstration for this RFC, though.
>>
>> Since all the tracking is performed on DMA map/unmap, no additional code
>> is needed in the fast path, meaning the performance overhead of this
>> tracking is negligible. The extra memory needed to track the pages is
>> neatly encapsulated inside xarray, which uses the 'struct xa_node'
>> structure to track items. This structure is 576 bytes long, with slots
>> for 64 items, meaning that a full node occurs only 9 bytes of overhead
>> per slot it tracks (in practice, it probably won't be this efficient,
>> but in any case it should be an acceptable overhead).
...
>
> Pavel, David, as an aside, I think we need to propagate this fix to
> memory providers as a follow up. We probably need a new op in the
> provider to unmap. Then, in page_pool_scrub, where this patch does an
> xa_for_each, we need to call that unmap op.
Sounds like it, which is the easy part since mps already hold the
full list of pages available. We just need to be careful unmapping
all netmems in presense of multiple pools, but that should be fine.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-09 13:27 ` Yunsheng Lin
@ 2025-03-10 9:13 ` Toke Høiland-Jørgensen
2025-03-10 12:35 ` Yunsheng Lin
2025-03-10 15:42 ` Matthew Wilcox
0 siblings, 2 replies; 32+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-03-10 9:13 UTC (permalink / raw)
To: Yunsheng Lin, Andrew Morton, Jesper Dangaard Brouer,
Ilias Apalodimas, David S. Miller
Cc: Yunsheng Lin, Yonglong Liu, Mina Almasry, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, linux-mm, netdev
Yunsheng Lin <yunshenglin0825@gmail.com> writes:
> On 3/8/2025 10:54 PM, Toke Høiland-Jørgensen wrote:
>> When enabling DMA mapping in page_pool, pages are kept DMA mapped until
>> they are released from the pool, to avoid the overhead of re-mapping the
>> pages every time they are used. This causes problems when a device is
>> torn down, because the page pool can't unmap the pages until they are
>> returned to the pool. This causes resource leaks and/or crashes when
>> there are pages still outstanding while the device is torn down, because
>> page_pool will attempt an unmap of a non-existent DMA device on the
>> subsequent page return.
>>
>> To fix this, implement a simple tracking of outstanding dma-mapped pages
>> in page pool using an xarray. This was first suggested by Mina[0], and
>> turns out to be fairly straight forward: We simply store pointers to
>> pages directly in the xarray with xa_alloc() when they are first DMA
>> mapped, and remove them from the array on unmap. Then, when a page pool
>> is torn down, it can simply walk the xarray and unmap all pages still
>> present there before returning, which also allows us to get rid of the
>> get/put_device() calls in page_pool. Using xa_cmpxchg(), no additional
>> synchronisation is needed, as a page will only ever be unmapped once.
>
> The implementation of xa_cmpxchg() seems to take the xa_lock, which
> seems to be a per-Xarray spin_lock.
> Yes, if if we were to take a per-Xarray lock unconditionaly, additional
> synchronisation like rcu doesn't seems to be needed. But it seems an
> excessive overhead for normal packet processing when page_pool_destroy()
> is not called yet?
I dunno, maybe? It's only done on DMA map/unmap, so it may be
acceptable? We should get some benchmark results to assess :)
> Also, we might need a similar locking or synchronisation for the dma
> sync API in order to skip the dma sync API when page_pool_destroy() is
> called too.
Good point, but that seems a separate issue? And simpler to solve (just
set pool->dma_sync to false when destroying?).
>> To avoid having to walk the entire xarray on unmap to find the page
>> reference, we stash the ID assigned by xa_alloc() into the page
>> structure itself, in the field previously called '_pp_mapping_pad' in
>> the page_pool struct inside struct page. This field overlaps with the
>> page->mapping pointer, which may turn out to be problematic, so an
>> alternative is probably needed. Sticking the ID into some of the upper
>> bits of page->pp_magic may work as an alternative, but that requires
>> further investigation. Using the 'mapping' field works well enough as
>> a demonstration for this RFC, though.
> page->pp_magic seems to only have 16 bits space available at most when
> trying to reuse some unused bits in page->pp_magic, as BPF_PTR_POISON
> and STACK_DEPOT_POISON seems to already use 16 bits, so:
> 1. For 32 bits system, it seems there is only 16 bits left even if the
> ILLEGAL_POINTER_VALUE is defined as 0x0.
> 2. For 64 bits system, it seems there is only 12 bits left for powerpc
> as ILLEGAL_POINTER_VALUE is defined as 0x5deadbeef0000000, see
> arch/powerpc/Kconfig.
>
> So it seems we might need to limit the dma mapping count to 4096 or
> 65536?
>
> And to be honest, I am not sure if those 'unused' 12/16 bits can really
> be reused or not, I guess we might need suggestion from mm and arch
> experts here.
Why do we need to care about BPF_PTR_POISON and STACK_DEPOT_POISON?
AFAICT, we only need to make sure we preserve the PP_SIGNATURE value.
See v2 of the RFC patch, the bit arithmetic there gives me:
- 24 bits on 32-bit architectures
- 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
- 32 bits on other 64-bit architectures
Which seems to be plenty?
>> Since all the tracking is performed on DMA map/unmap, no additional code
>> is needed in the fast path, meaning the performance overhead of this
>> tracking is negligible. The extra memory needed to track the pages is
>> neatly encapsulated inside xarray, which uses the 'struct xa_node'
>> structure to track items. This structure is 576 bytes long, with slots
>> for 64 items, meaning that a full node occurs only 9 bytes of overhead
>> per slot it tracks (in practice, it probably won't be this efficient,
>> but in any case it should be an acceptable overhead).
>
> Even if items is stored sequentially in xa_node at first, is it possible
> that there may be fragmentation in those xa_node when pages are released
> and allocated many times during packet processing? If yes, is there any
> fragmentation info about those xa_node?
Some (that's what I mean with "not as efficient"). AFAICT, xa_array does
do some rebalancing of the underlying radix tree, freeing nodes when
they are no longer used. However, I am not too familiar with the xarray
code, so I don't know exactly how efficient this is in practice.
>> [0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/
>>
>> Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code")
>> Reported-by: Yonglong Liu <liuyonglong@huawei.com>
>> Suggested-by: Mina Almasry <almasrymina@google.com>
>> Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> Tested-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>> ---
>> This is an alternative to Yunsheng's series. Yunsheng requested I send
>> this as an RFC to better be able to discuss the different approaches; see
>> some initial discussion in[1], also regarding where to store the ID as
>> alluded to above.
>
> As mentioned before, I am not really convinced there is still any
> space left in 'struct page' yet, otherwise we might already use that
> space to fix the DMA address > 32 bits problem in 32 bits system, see
> page_pool_set_dma_addr_netmem().
See above.
> Also, Using the more space in 'struct page' for the page_pool seems to
> make page_pool more coupled to the mm subsystem, which seems to not
> align with the folios work that is trying to decouple non-mm subsystem
> from the mm subsystem by avoid other subsystem using more of the 'struct
> page' as metadata from the long term point of view.
This seems a bit theoretical; any future changes of struct page would
have to shuffle things around so we still have the ID available,
obviously :)
-Toke
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-10 9:13 ` Toke Høiland-Jørgensen
@ 2025-03-10 12:35 ` Yunsheng Lin
2025-03-10 15:24 ` Toke Høiland-Jørgensen
2025-03-10 15:42 ` Matthew Wilcox
1 sibling, 1 reply; 32+ messages in thread
From: Yunsheng Lin @ 2025-03-10 12:35 UTC (permalink / raw)
To: Toke Høiland-Jørgensen, Yunsheng Lin, Andrew Morton,
Jesper Dangaard Brouer, Ilias Apalodimas, David S. Miller
Cc: Yonglong Liu, Mina Almasry, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, linux-mm, netdev
On 2025/3/10 17:13, Toke Høiland-Jørgensen wrote:
...
>
>> Also, we might need a similar locking or synchronisation for the dma
>> sync API in order to skip the dma sync API when page_pool_destroy() is
>> called too.
>
> Good point, but that seems a separate issue? And simpler to solve (just
If I understand the comment from DMA experts correctly, the dma_sync API
should not be allowed to be called after the dma_unmap API.
> set pool->dma_sync to false when destroying?).
Without locking or synchronisation, there is some small window between
pool->dma_sync checking and dma sync API calling, during which the driver
might have already unbound.
>
>>> To avoid having to walk the entire xarray on unmap to find the page
>>> reference, we stash the ID assigned by xa_alloc() into the page
>>> structure itself, in the field previously called '_pp_mapping_pad' in
>>> the page_pool struct inside struct page. This field overlaps with the
>>> page->mapping pointer, which may turn out to be problematic, so an
>>> alternative is probably needed. Sticking the ID into some of the upper
>>> bits of page->pp_magic may work as an alternative, but that requires
>>> further investigation. Using the 'mapping' field works well enough as
>>> a demonstration for this RFC, though.
>> page->pp_magic seems to only have 16 bits space available at most when
>> trying to reuse some unused bits in page->pp_magic, as BPF_PTR_POISON
>> and STACK_DEPOT_POISON seems to already use 16 bits, so:
>> 1. For 32 bits system, it seems there is only 16 bits left even if the
>> ILLEGAL_POINTER_VALUE is defined as 0x0.
>> 2. For 64 bits system, it seems there is only 12 bits left for powerpc
>> as ILLEGAL_POINTER_VALUE is defined as 0x5deadbeef0000000, see
>> arch/powerpc/Kconfig.
>>
>> So it seems we might need to limit the dma mapping count to 4096 or
>> 65536?
>>
>> And to be honest, I am not sure if those 'unused' 12/16 bits can really
>> be reused or not, I guess we might need suggestion from mm and arch
>> experts here.
>
> Why do we need to care about BPF_PTR_POISON and STACK_DEPOT_POISON?
> AFAICT, we only need to make sure we preserve the PP_SIGNATURE value.
> See v2 of the RFC patch, the bit arithmetic there gives me:
>
> - 24 bits on 32-bit architectures
> - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
> - 32 bits on other 64-bit architectures
>
> Which seems to be plenty?
I am really doubtful it is that simple, but we always can hear from the
experts if it isn't:)
>
>>> Since all the tracking is performed on DMA map/unmap, no additional code
>>> is needed in the fast path, meaning the performance overhead of this
>>> tracking is negligible. The extra memory needed to track the pages is
>>> neatly encapsulated inside xarray, which uses the 'struct xa_node'
>>> structure to track items. This structure is 576 bytes long, with slots
>>> for 64 items, meaning that a full node occurs only 9 bytes of overhead
>>> per slot it tracks (in practice, it probably won't be this efficient,
>>> but in any case it should be an acceptable overhead).
>>
>> Even if items is stored sequentially in xa_node at first, is it possible
>> that there may be fragmentation in those xa_node when pages are released
>> and allocated many times during packet processing? If yes, is there any
>> fragmentation info about those xa_node?
>
> Some (that's what I mean with "not as efficient"). AFAICT, xa_array does
> do some rebalancing of the underlying radix tree, freeing nodes when
> they are no longer used. However, I am not too familiar with the xarray
> code, so I don't know exactly how efficient this is in practice.
I guess that is one of the disadvantages that an advanced struct like
Xarray is used:(
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-10 12:35 ` Yunsheng Lin
@ 2025-03-10 15:24 ` Toke Høiland-Jørgensen
2025-03-11 12:19 ` Yunsheng Lin
0 siblings, 1 reply; 32+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-03-10 15:24 UTC (permalink / raw)
To: Yunsheng Lin, Yunsheng Lin, Andrew Morton,
Jesper Dangaard Brouer, Ilias Apalodimas, David S. Miller
Cc: Yonglong Liu, Mina Almasry, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, linux-mm, netdev
Yunsheng Lin <linyunsheng@huawei.com> writes:
> On 2025/3/10 17:13, Toke Høiland-Jørgensen wrote:
>
> ...
>
>>
>>> Also, we might need a similar locking or synchronisation for the dma
>>> sync API in order to skip the dma sync API when page_pool_destroy() is
>>> called too.
>>
>> Good point, but that seems a separate issue? And simpler to solve (just
>
> If I understand the comment from DMA experts correctly, the dma_sync API
> should not be allowed to be called after the dma_unmap API.
Ah, right, I see what you mean; will add a check for this.
>> set pool->dma_sync to false when destroying?).
>
> Without locking or synchronisation, there is some small window between
> pool->dma_sync checking and dma sync API calling, during which the driver
> might have already unbound.
>
>>
>>>> To avoid having to walk the entire xarray on unmap to find the page
>>>> reference, we stash the ID assigned by xa_alloc() into the page
>>>> structure itself, in the field previously called '_pp_mapping_pad' in
>>>> the page_pool struct inside struct page. This field overlaps with the
>>>> page->mapping pointer, which may turn out to be problematic, so an
>>>> alternative is probably needed. Sticking the ID into some of the upper
>>>> bits of page->pp_magic may work as an alternative, but that requires
>>>> further investigation. Using the 'mapping' field works well enough as
>>>> a demonstration for this RFC, though.
>>> page->pp_magic seems to only have 16 bits space available at most when
>>> trying to reuse some unused bits in page->pp_magic, as BPF_PTR_POISON
>>> and STACK_DEPOT_POISON seems to already use 16 bits, so:
>>> 1. For 32 bits system, it seems there is only 16 bits left even if the
>>> ILLEGAL_POINTER_VALUE is defined as 0x0.
>>> 2. For 64 bits system, it seems there is only 12 bits left for powerpc
>>> as ILLEGAL_POINTER_VALUE is defined as 0x5deadbeef0000000, see
>>> arch/powerpc/Kconfig.
>>>
>>> So it seems we might need to limit the dma mapping count to 4096 or
>>> 65536?
>>>
>>> And to be honest, I am not sure if those 'unused' 12/16 bits can really
>>> be reused or not, I guess we might need suggestion from mm and arch
>>> experts here.
>>
>> Why do we need to care about BPF_PTR_POISON and STACK_DEPOT_POISON?
>> AFAICT, we only need to make sure we preserve the PP_SIGNATURE value.
>> See v2 of the RFC patch, the bit arithmetic there gives me:
>>
>> - 24 bits on 32-bit architectures
>> - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
>> - 32 bits on other 64-bit architectures
>>
>> Which seems to be plenty?
>
> I am really doubtful it is that simple, but we always can hear from the
> experts if it isn't:)
Do you have any specific reason to think so? :)
>>>> Since all the tracking is performed on DMA map/unmap, no additional code
>>>> is needed in the fast path, meaning the performance overhead of this
>>>> tracking is negligible. The extra memory needed to track the pages is
>>>> neatly encapsulated inside xarray, which uses the 'struct xa_node'
>>>> structure to track items. This structure is 576 bytes long, with slots
>>>> for 64 items, meaning that a full node occurs only 9 bytes of overhead
>>>> per slot it tracks (in practice, it probably won't be this efficient,
>>>> but in any case it should be an acceptable overhead).
>>>
>>> Even if items is stored sequentially in xa_node at first, is it possible
>>> that there may be fragmentation in those xa_node when pages are released
>>> and allocated many times during packet processing? If yes, is there any
>>> fragmentation info about those xa_node?
>>
>> Some (that's what I mean with "not as efficient"). AFAICT, xa_array does
>> do some rebalancing of the underlying radix tree, freeing nodes when
>> they are no longer used. However, I am not too familiar with the xarray
>> code, so I don't know exactly how efficient this is in practice.
>
> I guess that is one of the disadvantages that an advanced struct like
> Xarray is used:(
Sure, there will be some overhead from using xarray, but I think the
simplicity makes up for it; especially since we can limit this to the
cases where it's absolutely needed.
-Toke
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-10 9:13 ` Toke Høiland-Jørgensen
2025-03-10 12:35 ` Yunsheng Lin
@ 2025-03-10 15:42 ` Matthew Wilcox
2025-03-10 17:26 ` Toke Høiland-Jørgensen
2025-03-11 12:25 ` Yunsheng Lin
1 sibling, 2 replies; 32+ messages in thread
From: Matthew Wilcox @ 2025-03-10 15:42 UTC (permalink / raw)
To: Toke Høiland-Jørgensen
Cc: Yunsheng Lin, Andrew Morton, Jesper Dangaard Brouer,
Ilias Apalodimas, David S. Miller, Yunsheng Lin, Yonglong Liu,
Mina Almasry, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, linux-mm, netdev
On Mon, Mar 10, 2025 at 10:13:32AM +0100, Toke Høiland-Jørgensen wrote:
> Yunsheng Lin <yunshenglin0825@gmail.com> writes:
> > Also, Using the more space in 'struct page' for the page_pool seems to
> > make page_pool more coupled to the mm subsystem, which seems to not
> > align with the folios work that is trying to decouple non-mm subsystem
> > from the mm subsystem by avoid other subsystem using more of the 'struct
> > page' as metadata from the long term point of view.
>
> This seems a bit theoretical; any future changes of struct page would
> have to shuffle things around so we still have the ID available,
> obviously :)
See https://kernelnewbies.org/MatthewWilcox/Memdescs
and more immediately
https://kernelnewbies.org/MatthewWilcox/Memdescs/Path
pagepool is going to be renamed "bump" because it's a bump allocator and
"pagepool" is a nonsense name. I haven't looked into it in a lot of
detail yet, but in the not-too-distant future, struct page will look
like this (from your point of view):
struct page {
unsigned long flags;
unsigned long memdesc;
int _refcount; // 0 for bump
union {
unsigned long private;
atomic_t _mapcount; // maybe used by bump? not sure
};
};
'memdesc' will be a pointer to struct bump with the bottom four bits of
that pointer indicating that it's a struct bump pointer (and not, say, a
folio or a slab).
So if you allocate a multi-page bump, you'll get N of these pages,
and they'll all point to the same struct bump where you'll maintain
your actual refcount. And you'll be able to grow struct bump to your
heart's content. I don't know exactly what struct bump looks like,
but the core mm will have no requirements on you.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-10 15:42 ` Matthew Wilcox
@ 2025-03-10 17:26 ` Toke Høiland-Jørgensen
2025-03-11 15:03 ` Matthew Wilcox
2025-03-11 12:25 ` Yunsheng Lin
1 sibling, 1 reply; 32+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-03-10 17:26 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Yunsheng Lin, Andrew Morton, Jesper Dangaard Brouer,
Ilias Apalodimas, David S. Miller, Yunsheng Lin, Yonglong Liu,
Mina Almasry, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, linux-mm, netdev
Matthew Wilcox <willy@infradead.org> writes:
> On Mon, Mar 10, 2025 at 10:13:32AM +0100, Toke Høiland-Jørgensen wrote:
>> Yunsheng Lin <yunshenglin0825@gmail.com> writes:
>> > Also, Using the more space in 'struct page' for the page_pool seems to
>> > make page_pool more coupled to the mm subsystem, which seems to not
>> > align with the folios work that is trying to decouple non-mm subsystem
>> > from the mm subsystem by avoid other subsystem using more of the 'struct
>> > page' as metadata from the long term point of view.
>>
>> This seems a bit theoretical; any future changes of struct page would
>> have to shuffle things around so we still have the ID available,
>> obviously :)
>
> See https://kernelnewbies.org/MatthewWilcox/Memdescs
> and more immediately
> https://kernelnewbies.org/MatthewWilcox/Memdescs/Path
>
> pagepool is going to be renamed "bump" because it's a bump allocator and
> "pagepool" is a nonsense name. I haven't looked into it in a lot of
> detail yet, but in the not-too-distant future, struct page will look
> like this (from your point of view):
>
> struct page {
> unsigned long flags;
> unsigned long memdesc;
> int _refcount; // 0 for bump
> union {
> unsigned long private;
> atomic_t _mapcount; // maybe used by bump? not sure
> };
> };
>
> 'memdesc' will be a pointer to struct bump with the bottom four bits of
> that pointer indicating that it's a struct bump pointer (and not, say, a
> folio or a slab).
>
> So if you allocate a multi-page bump, you'll get N of these pages,
> and they'll all point to the same struct bump where you'll maintain
> your actual refcount. And you'll be able to grow struct bump to your
> heart's content. I don't know exactly what struct bump looks like,
> but the core mm will have no requirements on you.
Ah, excellent, thanks for the pointer!
Out of curiosity, why "bump"? Is that a term of art somewhere?
And in the meantime (until those patches land), do you see any reason
why we can't squat on the middle bits of page->pp_magic (AKA page->lru)
like I'm doing in v2[0] of this patch?
-Toke
[0] https://lore.kernel.org/r/20250309124719.21285-1-toke@redhat.com
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-10 15:24 ` Toke Høiland-Jørgensen
@ 2025-03-11 12:19 ` Yunsheng Lin
2025-03-11 13:26 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 32+ messages in thread
From: Yunsheng Lin @ 2025-03-11 12:19 UTC (permalink / raw)
To: Toke Høiland-Jørgensen, Yunsheng Lin, Andrew Morton,
Jesper Dangaard Brouer, Ilias Apalodimas, David S. Miller
Cc: Yonglong Liu, Mina Almasry, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, linux-mm, netdev
On 2025/3/10 23:24, Toke Høiland-Jørgensen wrote:
>>
>> I guess that is one of the disadvantages that an advanced struct like
>> Xarray is used:(
>
> Sure, there will be some overhead from using xarray, but I think the
> simplicity makes up for it; especially since we can limit this to the
As my understanding, it is more complicated, it is just that complexity
is hidden before xarray now.
Even if there is no space in 'struct page' to store the id, the
'struct page' pointer itself can be used as id if the xarray can
use pointer as id. But it might mean the memory utilization might
not be as efficient as it should be, and performance hurts too if
there is more memory to be allocated and freed.
It seems it is just a matter of choices between using tailor-made
page_pool specific optimization and using some generic advanced
struct like xarray.
I chose the tailor-made one because it ensure least overhead as
much as possibe from performance and memory utilization perspective,
for example, the 'single producer, multiple consumer' guarantee
offered by NAPI context can avoid some lock and atomic operation.
> cases where it's absolutely needed.
The above can also be done for using page_pool_item too as the
lower 2 bits can be used to indicate the pointer in 'struct page'
is 'page_pool_item' or 'page_pool', I just don't think it is
necessary yet as it might add more checking in the fast path.
>
> -Toke
>
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-10 15:42 ` Matthew Wilcox
2025-03-10 17:26 ` Toke Høiland-Jørgensen
@ 2025-03-11 12:25 ` Yunsheng Lin
2025-03-11 15:11 ` Matthew Wilcox
1 sibling, 1 reply; 32+ messages in thread
From: Yunsheng Lin @ 2025-03-11 12:25 UTC (permalink / raw)
To: Matthew Wilcox, Toke Høiland-Jørgensen
Cc: Yunsheng Lin, Andrew Morton, Jesper Dangaard Brouer,
Ilias Apalodimas, David S. Miller, Yonglong Liu, Mina Almasry,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
linux-mm, netdev
On 2025/3/10 23:42, Matthew Wilcox wrote:
> On Mon, Mar 10, 2025 at 10:13:32AM +0100, Toke Høiland-Jørgensen wrote:
>> Yunsheng Lin <yunshenglin0825@gmail.com> writes:
>>> Also, Using the more space in 'struct page' for the page_pool seems to
>>> make page_pool more coupled to the mm subsystem, which seems to not
>>> align with the folios work that is trying to decouple non-mm subsystem
>>> from the mm subsystem by avoid other subsystem using more of the 'struct
>>> page' as metadata from the long term point of view.
>>
>> This seems a bit theoretical; any future changes of struct page would
>> have to shuffle things around so we still have the ID available,
>> obviously :)
>
> See https://kernelnewbies.org/MatthewWilcox/Memdescs
> and more immediately
> https://kernelnewbies.org/MatthewWilcox/Memdescs/Path
>
> pagepool is going to be renamed "bump" because it's a bump allocator and
> "pagepool" is a nonsense name. I haven't looked into it in a lot of
> detail yet, but in the not-too-distant future, struct page will look
> like this (from your point of view):
>
> struct page {
> unsigned long flags;
> unsigned long memdesc;
It seems there may be memory behind the above 'memdesc' with different size
and layout for different subsystem?
I am not sure if I understand the case of the same page might be handle in
two subsystems concurrently or a page is allocated in one subsystem and
then passed to be handled in other subsystem, for examlpe:
page_pool owned page is mmap'ed into user space through tcp zero copy,
see tcp_zerocopy_vm_insert_batch(), it seems the same page is handled in
both networking/page_pool and vm subsystem?
And page->mapping seems to have been moved into 'memdesc' as there is no
'mapping' field in 'struct page' you list here? Does we need a similar
field like 'mapping' in the 'memdesc' for page_pool subsystem to support
tcp zero copy?
> int _refcount; // 0 for bump
> union {
> unsigned long private;
> atomic_t _mapcount; // maybe used by bump? not sure
> };
> };
>
> 'memdesc' will be a pointer to struct bump with the bottom four bits of
> that pointer indicating that it's a struct bump pointer (and not, say, a
> folio or a slab).
The above seems similar as what I was doing, the difference seems to be
that memory behind the above pointer is managed by page_pool itself
instead of mm subsystem allocating 'memdesc' memory from a slab cache?
>
> So if you allocate a multi-page bump, you'll get N of these pages,
> and they'll all point to the same struct bump where you'll maintain
> your actual refcount. And you'll be able to grow struct bump to your
> heart's content. I don't know exactly what struct bump looks like,
> but the core mm will have no requirements on you.
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-09 12:42 ` Toke Høiland-Jørgensen
@ 2025-03-11 13:25 ` Pavel Begunkov
2025-03-11 13:44 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 32+ messages in thread
From: Pavel Begunkov @ 2025-03-11 13:25 UTC (permalink / raw)
To: Toke Høiland-Jørgensen, Mina Almasry, David Wei
Cc: Andrew Morton, Jesper Dangaard Brouer, Ilias Apalodimas,
David S. Miller, Yunsheng Lin, Yonglong Liu, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, linux-mm, netdev
On 3/9/25 12:42, Toke Høiland-Jørgensen wrote:
> Mina Almasry <almasrymina@google.com> writes:
>
>> On Sat, Mar 8, 2025 at 6:55 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>>
>>> When enabling DMA mapping in page_pool, pages are kept DMA mapped until
>>> they are released from the pool, to avoid the overhead of re-mapping the
>>> pages every time they are used. This causes problems when a device is
>>> torn down, because the page pool can't unmap the pages until they are
>>> returned to the pool. This causes resource leaks and/or crashes when
>>> there are pages still outstanding while the device is torn down, because
>>> page_pool will attempt an unmap of a non-existent DMA device on the
>>> subsequent page return.
>>>
>>> To fix this, implement a simple tracking of outstanding dma-mapped pages
>>> in page pool using an xarray. This was first suggested by Mina[0], and
>>> turns out to be fairly straight forward: We simply store pointers to
>>> pages directly in the xarray with xa_alloc() when they are first DMA
>>> mapped, and remove them from the array on unmap. Then, when a page pool
>>> is torn down, it can simply walk the xarray and unmap all pages still
>>> present there before returning, which also allows us to get rid of the
>>> get/put_device() calls in page_pool.
>>
>> THANK YOU!! I had been looking at the other proposals to fix this here
>> and there and I had similar feelings to you. They add lots of code
>> changes and the code changes themselves were hard for me to
>> understand. I hope we can make this simpler approach work.
>
> You're welcome :)
> And yeah, me too!
>
>>> Using xa_cmpxchg(), no additional
>>> synchronisation is needed, as a page will only ever be unmapped once.
>>>
>>
>> Very clever. I had been wondering how to handle the concurrency. I
>> also think this works.
>
> Thanks!
>
>>> To avoid having to walk the entire xarray on unmap to find the page
>>> reference, we stash the ID assigned by xa_alloc() into the page
>>> structure itself, in the field previously called '_pp_mapping_pad' in
>>> the page_pool struct inside struct page. This field overlaps with the
>>> page->mapping pointer, which may turn out to be problematic, so an
>>> alternative is probably needed. Sticking the ID into some of the upper
>>> bits of page->pp_magic may work as an alternative, but that requires
>>> further investigation. Using the 'mapping' field works well enough as
>>> a demonstration for this RFC, though.
>>>
>>
>> I'm unsure about this. I think page->mapping may be used when we map
>> the page to the userspace in TCP zerocopy, but I'm really not sure.
>> Yes, finding somewhere else to put the id would be ideal. Do we really
>> need a full unsigned long for the pp_magic?
>
> No, pp_magic was also my backup plan (see the other thread). Tried
> actually doing that now, and while there's a bit of complication due to
> the varying definitions of POISON_POINTER_DELTA across architectures,
> but it seems that this can be defined at compile time. I'll send a v2
> RFC with this change.
FWIW, personally I like this one much more than an extra indirection
to pp.
If we're out of space in the page, why can't we use struct page *
as indices into the xarray? Ala
struct page *p = ...;
xa_store(xarray, index=(unsigned long)p, p);
Indices wouldn't be nicely packed, but it's still a map. Is there
a problem with that I didn't consider?
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-11 12:19 ` Yunsheng Lin
@ 2025-03-11 13:26 ` Toke Høiland-Jørgensen
2025-03-12 12:04 ` Yunsheng Lin
0 siblings, 1 reply; 32+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-03-11 13:26 UTC (permalink / raw)
To: Yunsheng Lin, Yunsheng Lin, Andrew Morton,
Jesper Dangaard Brouer, Ilias Apalodimas, David S. Miller
Cc: Yonglong Liu, Mina Almasry, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, linux-mm, netdev
Yunsheng Lin <linyunsheng@huawei.com> writes:
> On 2025/3/10 23:24, Toke Høiland-Jørgensen wrote:
>
>>>
>>> I guess that is one of the disadvantages that an advanced struct like
>>> Xarray is used:(
>>
>> Sure, there will be some overhead from using xarray, but I think the
>> simplicity makes up for it; especially since we can limit this to the
>
> As my understanding, it is more complicated, it is just that
> complexity is hidden before xarray now.
Yes, which encapsulates the complexity into a shared abstraction that is
widely used in the kernel, so it does not add new complexity to the
kernel as a whole. Whereas your series adds a whole bunch of new
complexity to the kernel in the form of a new slab allocator.
> Even if there is no space in 'struct page' to store the id, the
> 'struct page' pointer itself can be used as id if the xarray can
> use pointer as id. But it might mean the memory utilization might
> not be as efficient as it should be, and performance hurts too if
> there is more memory to be allocated and freed.
I don't think it can. But sure, there can be other ways around this.
FWIW, I don't think your idea of allocating page_pool_items to use as an
indirection is totally crazy, but all the complexity around it (the
custom slab allocator etc) is way too much. And if we can avoid the item
indirection that is obviously better.
> It seems it is just a matter of choices between using tailor-made
> page_pool specific optimization and using some generic advanced
> struct like xarray.
Yup, basically.
> I chose the tailor-made one because it ensure least overhead as
> much as possibe from performance and memory utilization perspective,
> for example, the 'single producer, multiple consumer' guarantee
> offered by NAPI context can avoid some lock and atomic operation.
Right, and my main point is that the complexity of this is not worth it :)
>> cases where it's absolutely needed.
>
> The above can also be done for using page_pool_item too as the
> lower 2 bits can be used to indicate the pointer in 'struct page'
> is 'page_pool_item' or 'page_pool', I just don't think it is
> necessary yet as it might add more checking in the fast path.
Yup, did think about using the lower bits to distinguish if it does turn
out that we can't avoid an indirection. See above; it's not actually the
page_pool_item concept that is my main issue with your series.
-Toke
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-11 13:25 ` Pavel Begunkov
@ 2025-03-11 13:44 ` Toke Høiland-Jørgensen
2025-03-11 13:56 ` Pavel Begunkov
2025-03-11 16:46 ` Matthew Wilcox
0 siblings, 2 replies; 32+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-03-11 13:44 UTC (permalink / raw)
To: Pavel Begunkov, Mina Almasry, David Wei
Cc: Andrew Morton, Jesper Dangaard Brouer, Ilias Apalodimas,
David S. Miller, Yunsheng Lin, Yonglong Liu, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, linux-mm, netdev
Pavel Begunkov <asml.silence@gmail.com> writes:
> On 3/9/25 12:42, Toke Høiland-Jørgensen wrote:
>> Mina Almasry <almasrymina@google.com> writes:
>>
>>> On Sat, Mar 8, 2025 at 6:55 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>>>
>>>> When enabling DMA mapping in page_pool, pages are kept DMA mapped until
>>>> they are released from the pool, to avoid the overhead of re-mapping the
>>>> pages every time they are used. This causes problems when a device is
>>>> torn down, because the page pool can't unmap the pages until they are
>>>> returned to the pool. This causes resource leaks and/or crashes when
>>>> there are pages still outstanding while the device is torn down, because
>>>> page_pool will attempt an unmap of a non-existent DMA device on the
>>>> subsequent page return.
>>>>
>>>> To fix this, implement a simple tracking of outstanding dma-mapped pages
>>>> in page pool using an xarray. This was first suggested by Mina[0], and
>>>> turns out to be fairly straight forward: We simply store pointers to
>>>> pages directly in the xarray with xa_alloc() when they are first DMA
>>>> mapped, and remove them from the array on unmap. Then, when a page pool
>>>> is torn down, it can simply walk the xarray and unmap all pages still
>>>> present there before returning, which also allows us to get rid of the
>>>> get/put_device() calls in page_pool.
>>>
>>> THANK YOU!! I had been looking at the other proposals to fix this here
>>> and there and I had similar feelings to you. They add lots of code
>>> changes and the code changes themselves were hard for me to
>>> understand. I hope we can make this simpler approach work.
>>
>> You're welcome :)
>> And yeah, me too!
>>
>>>> Using xa_cmpxchg(), no additional
>>>> synchronisation is needed, as a page will only ever be unmapped once.
>>>>
>>>
>>> Very clever. I had been wondering how to handle the concurrency. I
>>> also think this works.
>>
>> Thanks!
>>
>>>> To avoid having to walk the entire xarray on unmap to find the page
>>>> reference, we stash the ID assigned by xa_alloc() into the page
>>>> structure itself, in the field previously called '_pp_mapping_pad' in
>>>> the page_pool struct inside struct page. This field overlaps with the
>>>> page->mapping pointer, which may turn out to be problematic, so an
>>>> alternative is probably needed. Sticking the ID into some of the upper
>>>> bits of page->pp_magic may work as an alternative, but that requires
>>>> further investigation. Using the 'mapping' field works well enough as
>>>> a demonstration for this RFC, though.
>>>>
>>>
>>> I'm unsure about this. I think page->mapping may be used when we map
>>> the page to the userspace in TCP zerocopy, but I'm really not sure.
>>> Yes, finding somewhere else to put the id would be ideal. Do we really
>>> need a full unsigned long for the pp_magic?
>>
>> No, pp_magic was also my backup plan (see the other thread). Tried
>> actually doing that now, and while there's a bit of complication due to
>> the varying definitions of POISON_POINTER_DELTA across architectures,
>> but it seems that this can be defined at compile time. I'll send a v2
>> RFC with this change.
>
> FWIW, personally I like this one much more than an extra indirection
> to pp.
>
> If we're out of space in the page, why can't we use struct page *
> as indices into the xarray? Ala
>
> struct page *p = ...;
> xa_store(xarray, index=(unsigned long)p, p);
>
> Indices wouldn't be nicely packed, but it's still a map. Is there
> a problem with that I didn't consider?
Huh. As I just replied to Yunsheng, I was under the impression that this
was not supported. But since you're now the second person to suggest
this, I looked again, and it looks like I was wrong. There does indeed
seem to be other places in the kernel that does this.
As you say the indices won't be as densely packed, though. So I'm
wondering if using the bits in pp_magic would be better in any case to
get the better packing? I guess we can try benchmarking both approaches
and see if there's a measurable difference.
-Toke
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-11 13:44 ` Toke Høiland-Jørgensen
@ 2025-03-11 13:56 ` Pavel Begunkov
2025-03-11 15:49 ` Toke Høiland-Jørgensen
2025-03-11 16:46 ` Matthew Wilcox
1 sibling, 1 reply; 32+ messages in thread
From: Pavel Begunkov @ 2025-03-11 13:56 UTC (permalink / raw)
To: Toke Høiland-Jørgensen, Mina Almasry, David Wei
Cc: Andrew Morton, Jesper Dangaard Brouer, Ilias Apalodimas,
David S. Miller, Yunsheng Lin, Yonglong Liu, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, linux-mm, netdev
On 3/11/25 13:44, Toke Høiland-Jørgensen wrote:
> Pavel Begunkov <asml.silence@gmail.com> writes:
>
>> On 3/9/25 12:42, Toke Høiland-Jørgensen wrote:
>>> Mina Almasry <almasrymina@google.com> writes:
...
>>> No, pp_magic was also my backup plan (see the other thread). Tried
>>> actually doing that now, and while there's a bit of complication due to
>>> the varying definitions of POISON_POINTER_DELTA across architectures,
>>> but it seems that this can be defined at compile time. I'll send a v2
>>> RFC with this change.
>>
>> FWIW, personally I like this one much more than an extra indirection
>> to pp.
>>
>> If we're out of space in the page, why can't we use struct page *
>> as indices into the xarray? Ala
>>
>> struct page *p = ...;
>> xa_store(xarray, index=(unsigned long)p, p);
>>
>> Indices wouldn't be nicely packed, but it's still a map. Is there
>> a problem with that I didn't consider?
>
> Huh. As I just replied to Yunsheng, I was under the impression that this
> was not supported. But since you're now the second person to suggest
> this, I looked again, and it looks like I was wrong. There does indeed
> seem to be other places in the kernel that does this.
And I just noticed there is an entire discussion my email
client didn't pull :)
At least that's likely the easiest solution. Depends on how
complicated it is to fit the index in, but there is an option
to just go with it and continue the discussion on how to
improve it on top.
> As you say the indices won't be as densely packed, though. So I'm
> wondering if using the bits in pp_magic would be better in any case to
> get the better packing? I guess we can try benchmarking both approaches
> and see if there's a measurable difference.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-10 17:26 ` Toke Høiland-Jørgensen
@ 2025-03-11 15:03 ` Matthew Wilcox
2025-03-11 15:32 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 32+ messages in thread
From: Matthew Wilcox @ 2025-03-11 15:03 UTC (permalink / raw)
To: Toke Høiland-Jørgensen
Cc: Yunsheng Lin, Andrew Morton, Jesper Dangaard Brouer,
Ilias Apalodimas, David S. Miller, Yunsheng Lin, Yonglong Liu,
Mina Almasry, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, linux-mm, netdev
On Mon, Mar 10, 2025 at 06:26:23PM +0100, Toke Høiland-Jørgensen wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> > See https://kernelnewbies.org/MatthewWilcox/Memdescs
> > and more immediately
> > https://kernelnewbies.org/MatthewWilcox/Memdescs/Path
> >
> > pagepool is going to be renamed "bump" because it's a bump allocator and
> > "pagepool" is a nonsense name. I haven't looked into it in a lot of
> > detail yet, but in the not-too-distant future, struct page will look
> > like this (from your point of view):
> >
> > struct page {
> > unsigned long flags;
> > unsigned long memdesc;
> > int _refcount; // 0 for bump
> > union {
> > unsigned long private;
> > atomic_t _mapcount; // maybe used by bump? not sure
> > };
> > };
> >
> > 'memdesc' will be a pointer to struct bump with the bottom four bits of
> > that pointer indicating that it's a struct bump pointer (and not, say, a
> > folio or a slab).
> >
> > So if you allocate a multi-page bump, you'll get N of these pages,
> > and they'll all point to the same struct bump where you'll maintain
> > your actual refcount. And you'll be able to grow struct bump to your
> > heart's content. I don't know exactly what struct bump looks like,
> > but the core mm will have no requirements on you.
>
> Ah, excellent, thanks for the pointer!
>
> Out of curiosity, why "bump"? Is that a term of art somewhere?
https://en.wikipedia.org/wiki/Region-based_memory_management
(and the term "bump allocator" has a number of hits in your favourite
search engine)
> And in the meantime (until those patches land), do you see any reason
> why we can't squat on the middle bits of page->pp_magic (AKA page->lru)
> like I'm doing in v2[0] of this patch?
I haven't had time to dig into this series. I'm trying to get a bunch
of things finished before LSFMM.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-11 12:25 ` Yunsheng Lin
@ 2025-03-11 15:11 ` Matthew Wilcox
2025-03-12 12:05 ` Yunsheng Lin
0 siblings, 1 reply; 32+ messages in thread
From: Matthew Wilcox @ 2025-03-11 15:11 UTC (permalink / raw)
To: Yunsheng Lin
Cc: Toke Høiland-Jørgensen, Yunsheng Lin, Andrew Morton,
Jesper Dangaard Brouer, Ilias Apalodimas, David S. Miller,
Yonglong Liu, Mina Almasry, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, linux-mm, netdev
On Tue, Mar 11, 2025 at 08:25:25PM +0800, Yunsheng Lin wrote:
> > struct page {
> > unsigned long flags;
> > unsigned long memdesc;
>
> It seems there may be memory behind the above 'memdesc' with different size
> and layout for different subsystem?
Yes.
> I am not sure if I understand the case of the same page might be handle in
> two subsystems concurrently or a page is allocated in one subsystem and
> then passed to be handled in other subsystem, for examlpe:
> page_pool owned page is mmap'ed into user space through tcp zero copy,
> see tcp_zerocopy_vm_insert_batch(), it seems the same page is handled in
> both networking/page_pool and vm subsystem?
It's not that arbitrary. I mean, you could read all the documentation
I've written about this concept, listen to the talks I've given. But
sure, you're a special fucking snowflake and deserve your own unique
explanation.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-11 15:03 ` Matthew Wilcox
@ 2025-03-11 15:32 ` Toke Høiland-Jørgensen
0 siblings, 0 replies; 32+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-03-11 15:32 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Yunsheng Lin, Andrew Morton, Jesper Dangaard Brouer,
Ilias Apalodimas, David S. Miller, Yunsheng Lin, Yonglong Liu,
Mina Almasry, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, linux-mm, netdev
Matthew Wilcox <willy@infradead.org> writes:
> On Mon, Mar 10, 2025 at 06:26:23PM +0100, Toke Høiland-Jørgensen wrote:
>> Matthew Wilcox <willy@infradead.org> writes:
>> > See https://kernelnewbies.org/MatthewWilcox/Memdescs
>> > and more immediately
>> > https://kernelnewbies.org/MatthewWilcox/Memdescs/Path
>> >
>> > pagepool is going to be renamed "bump" because it's a bump allocator and
>> > "pagepool" is a nonsense name. I haven't looked into it in a lot of
>> > detail yet, but in the not-too-distant future, struct page will look
>> > like this (from your point of view):
>> >
>> > struct page {
>> > unsigned long flags;
>> > unsigned long memdesc;
>> > int _refcount; // 0 for bump
>> > union {
>> > unsigned long private;
>> > atomic_t _mapcount; // maybe used by bump? not sure
>> > };
>> > };
>> >
>> > 'memdesc' will be a pointer to struct bump with the bottom four bits of
>> > that pointer indicating that it's a struct bump pointer (and not, say, a
>> > folio or a slab).
>> >
>> > So if you allocate a multi-page bump, you'll get N of these pages,
>> > and they'll all point to the same struct bump where you'll maintain
>> > your actual refcount. And you'll be able to grow struct bump to your
>> > heart's content. I don't know exactly what struct bump looks like,
>> > but the core mm will have no requirements on you.
>>
>> Ah, excellent, thanks for the pointer!
>>
>> Out of curiosity, why "bump"? Is that a term of art somewhere?
>
> https://en.wikipedia.org/wiki/Region-based_memory_management
>
> (and the term "bump allocator" has a number of hits in your favourite
> search engine)
Right, fair point, I was being lazy by just asking. Thanks for taking
the time to provide a link :)
>> And in the meantime (until those patches land), do you see any reason
>> why we can't squat on the middle bits of page->pp_magic (AKA page->lru)
>> like I'm doing in v2[0] of this patch?
>
> I haven't had time to dig into this series. I'm trying to get a bunch
> of things finished before LSFMM.
Alright, well if/when you get a chance (before of after LSFMM), I'd
appreciate if you could take a look!
-Toke
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-11 13:56 ` Pavel Begunkov
@ 2025-03-11 15:49 ` Toke Høiland-Jørgensen
0 siblings, 0 replies; 32+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-03-11 15:49 UTC (permalink / raw)
To: Pavel Begunkov, Mina Almasry, David Wei
Cc: Andrew Morton, Jesper Dangaard Brouer, Ilias Apalodimas,
David S. Miller, Yunsheng Lin, Yonglong Liu, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, linux-mm, netdev
Pavel Begunkov <asml.silence@gmail.com> writes:
> On 3/11/25 13:44, Toke Høiland-Jørgensen wrote:
>> Pavel Begunkov <asml.silence@gmail.com> writes:
>>
>>> On 3/9/25 12:42, Toke Høiland-Jørgensen wrote:
>>>> Mina Almasry <almasrymina@google.com> writes:
> ...
>>>> No, pp_magic was also my backup plan (see the other thread). Tried
>>>> actually doing that now, and while there's a bit of complication due to
>>>> the varying definitions of POISON_POINTER_DELTA across architectures,
>>>> but it seems that this can be defined at compile time. I'll send a v2
>>>> RFC with this change.
>>>
>>> FWIW, personally I like this one much more than an extra indirection
>>> to pp.
>>>
>>> If we're out of space in the page, why can't we use struct page *
>>> as indices into the xarray? Ala
>>>
>>> struct page *p = ...;
>>> xa_store(xarray, index=(unsigned long)p, p);
>>>
>>> Indices wouldn't be nicely packed, but it's still a map. Is there
>>> a problem with that I didn't consider?
>>
>> Huh. As I just replied to Yunsheng, I was under the impression that this
>> was not supported. But since you're now the second person to suggest
>> this, I looked again, and it looks like I was wrong. There does indeed
>> seem to be other places in the kernel that does this.
>
> And I just noticed there is an entire discussion my email
> client didn't pull :)
>
> At least that's likely the easiest solution. Depends on how
> complicated it is to fit the index in, but there is an option
> to just go with it and continue the discussion on how to
> improve it on top.
Didn't seem to be too complicated, assuming no problems appear with
using the middle bits of the pp_magic field. See v2 of the RFC, or here
for the latest version:
https://git.kernel.org/toke/c/df6248a71f85
-Toke
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-11 13:44 ` Toke Høiland-Jørgensen
2025-03-11 13:56 ` Pavel Begunkov
@ 2025-03-11 16:46 ` Matthew Wilcox
2025-03-12 10:56 ` Toke Høiland-Jørgensen
2025-03-12 12:55 ` Pavel Begunkov
1 sibling, 2 replies; 32+ messages in thread
From: Matthew Wilcox @ 2025-03-11 16:46 UTC (permalink / raw)
To: Toke Høiland-Jørgensen
Cc: Pavel Begunkov, Mina Almasry, David Wei, Andrew Morton,
Jesper Dangaard Brouer, Ilias Apalodimas, David S. Miller,
Yunsheng Lin, Yonglong Liu, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, linux-mm, netdev
On Tue, Mar 11, 2025 at 02:44:15PM +0100, Toke Høiland-Jørgensen wrote:
> Pavel Begunkov <asml.silence@gmail.com> writes:
> > If we're out of space in the page, why can't we use struct page *
> > as indices into the xarray? Ala
> >
> > struct page *p = ...;
> > xa_store(xarray, index=(unsigned long)p, p);
> >
> > Indices wouldn't be nicely packed, but it's still a map. Is there
> > a problem with that I didn't consider?
>
> Huh. As I just replied to Yunsheng, I was under the impression that this
> was not supported. But since you're now the second person to suggest
> this, I looked again, and it looks like I was wrong. There does indeed
> seem to be other places in the kernel that does this.
>
> As you say the indices won't be as densely packed, though. So I'm
> wondering if using the bits in pp_magic would be better in any case to
> get the better packing? I guess we can try benchmarking both approaches
> and see if there's a measurable difference.
This is an absolutely terrible idea, only proposed by those who have no
understanding of how the XArray works. It could not be more wasteful.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-11 16:46 ` Matthew Wilcox
@ 2025-03-12 10:56 ` Toke Høiland-Jørgensen
2025-03-12 12:55 ` Pavel Begunkov
1 sibling, 0 replies; 32+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-03-12 10:56 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Pavel Begunkov, Mina Almasry, David Wei, Andrew Morton,
Jesper Dangaard Brouer, Ilias Apalodimas, David S. Miller,
Yunsheng Lin, Yonglong Liu, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, linux-mm, netdev
Matthew Wilcox <willy@infradead.org> writes:
> On Tue, Mar 11, 2025 at 02:44:15PM +0100, Toke Høiland-Jørgensen wrote:
>> Pavel Begunkov <asml.silence@gmail.com> writes:
>> > If we're out of space in the page, why can't we use struct page *
>> > as indices into the xarray? Ala
>> >
>> > struct page *p = ...;
>> > xa_store(xarray, index=(unsigned long)p, p);
>> >
>> > Indices wouldn't be nicely packed, but it's still a map. Is there
>> > a problem with that I didn't consider?
>>
>> Huh. As I just replied to Yunsheng, I was under the impression that this
>> was not supported. But since you're now the second person to suggest
>> this, I looked again, and it looks like I was wrong. There does indeed
>> seem to be other places in the kernel that does this.
>>
>> As you say the indices won't be as densely packed, though. So I'm
>> wondering if using the bits in pp_magic would be better in any case to
>> get the better packing? I guess we can try benchmarking both approaches
>> and see if there's a measurable difference.
>
> This is an absolutely terrible idea, only proposed by those who have no
> understanding of how the XArray works. It could not be more wasteful.
Alright, ACK, will stay with the xa_alloc() + stashing the id in
pp_magic (unless you come back and tell us there's some reason we can't
use those bits).
Do you mind if I send a patch to the xarray docs to explicitly spell out
that using pointers as keys is a bad idea? It's a bit implicit the way
it's phrased now, IMO.
-Toke
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-11 13:26 ` Toke Høiland-Jørgensen
@ 2025-03-12 12:04 ` Yunsheng Lin
2025-03-12 12:27 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 32+ messages in thread
From: Yunsheng Lin @ 2025-03-12 12:04 UTC (permalink / raw)
To: Toke Høiland-Jørgensen, Yunsheng Lin, Andrew Morton,
Jesper Dangaard Brouer, Ilias Apalodimas, David S. Miller
Cc: Yonglong Liu, Mina Almasry, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, linux-mm, netdev
On 2025/3/11 21:26, Toke Høiland-Jørgensen wrote:
> Yunsheng Lin <linyunsheng@huawei.com> writes:
>
>> On 2025/3/10 23:24, Toke Høiland-Jørgensen wrote:
>>
>>>>
>>>> I guess that is one of the disadvantages that an advanced struct like
>>>> Xarray is used:(
>>>
>>> Sure, there will be some overhead from using xarray, but I think the
>>> simplicity makes up for it; especially since we can limit this to the
>>
>> As my understanding, it is more complicated, it is just that
>> complexity is hidden before xarray now.
>
> Yes, which encapsulates the complexity into a shared abstraction that is
> widely used in the kernel, so it does not add new complexity to the
> kernel as a whole. Whereas your series adds a whole bunch of new
> complexity to the kernel in the form of a new slab allocator.
>
>> Even if there is no space in 'struct page' to store the id, the
>> 'struct page' pointer itself can be used as id if the xarray can
>> use pointer as id. But it might mean the memory utilization might
>> not be as efficient as it should be, and performance hurts too if
>> there is more memory to be allocated and freed.
>
> I don't think it can. But sure, there can be other ways around this.
>
> FWIW, I don't think your idea of allocating page_pool_items to use as an
> indirection is totally crazy, but all the complexity around it (the
> custom slab allocator etc) is way too much. And if we can avoid the item
> indirection that is obviously better.
>
>> It seems it is just a matter of choices between using tailor-made
>> page_pool specific optimization and using some generic advanced
>> struct like xarray.
>
> Yup, basically.
>
>> I chose the tailor-made one because it ensure least overhead as
>> much as possibe from performance and memory utilization perspective,
>> for example, the 'single producer, multiple consumer' guarantee
>> offered by NAPI context can avoid some lock and atomic operation.
>
> Right, and my main point is that the complexity of this is not worth it :)
Without the complexity, there is about 400ns performance overhead
for Xarray compared to about 10ns performance overhead for the
tailor-made one, which means there is about more than 200% performance
degradation for page_pool03_slow test case:
[ 9190.217338] bench_page_pool_simple: Loaded
[ 9190.298495] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.770 ns (step:0) - (measurement period time:0.077040570 sec time_interval:77040570) - (invoke count:100000000 tsc_interval:7704049)
[ 9190.893294] time_bench: Type:atomic_inc Per elem: 0 cycles(tsc) 5.775 ns (step:0) - (measurement period time:0.577582060 sec time_interval:577582060) - (invoke count:100000000 tsc_interval:57758202)
[ 9191.061026] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.017 ns (step:0) - (measurement period time:0.150170250 sec time_interval:150170250) - (invoke count:10000000 tsc_interval:15017020)
[ 9191.771113] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.930 ns (step:0) - (measurement period time:0.693045930 sec time_interval:693045930) - (invoke count:100000000 tsc_interval:69304585)
[ 9231.309907] time_bench: Type:xarray Per elem: 39 cycles(tsc) 395.218 ns (step:0) - (measurement period time:39.521827650 sec time_interval:39521827650) - (invoke count:100000000 tsc_interval:3952182703)
[ 9231.327827] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[ 9231.640260] time_bench: Type:no-softirq-page_pool01 Per elem: 3 cycles(tsc) 30.316 ns (step:0) - (measurement period time:0.303162870 sec time_interval:303162870) - (invoke count:10000000 tsc_interval:30316281)
[ 9231.658866] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[ 9232.244955] time_bench: Type:no-softirq-page_pool02 Per elem: 5 cycles(tsc) 57.691 ns (step:0) - (measurement period time:0.576910280 sec time_interval:576910280) - (invoke count:10000000 tsc_interval:57691015)
[ 9232.263567] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[ 9233.990491] time_bench: Type:no-softirq-page_pool03 Per elem: 17 cycles(tsc) 171.809 ns (step:0) - (measurement period time:1.718090950 sec time_interval:1718090950) - (invoke count:10000000 tsc_interval:171809088)
[ 9234.011402] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[ 9234.019286] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[ 9234.328952] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 3 cycles(tsc) 30.057 ns (step:0) - (measurement period time:0.300574060 sec time_interval:300574060) - (invoke count:10000000 tsc_interval:30057400)
[ 9234.348155] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[ 9234.898627] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 5 cycles(tsc) 54.146 ns (step:0) - (measurement period time:0.541466850 sec time_interval:541466850) - (invoke count:10000000 tsc_interval:54146680)
[ 9234.917742] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[ 9236.691076] time_bench: Type:tasklet_page_pool03_slow Per elem: 17 cycles(tsc) 176.467 ns (step:0) - (measurement period time:1.764675290 sec time_interval:1764675290) - (invoke count:10000000 tsc_interval:176467523)
Tested using the below diff:
+++ b/kernel/lib/bench_page_pool_simple.c
@@ -149,6 +149,48 @@ static int time_bench_rcu(
return loops_cnt;
}
+static int time_bench_xarray(
+ struct time_bench_record *rec, void *data)
+{
+ uint64_t loops_cnt = 0;
+ struct xarray dma_mapped;
+ int i, err;
+ u32 id;
+ void *old;
+
+ xa_init_flags(&dma_mapped, XA_FLAGS_ALLOC1);
+
+ time_bench_start(rec);
+ /** Loop to measure **/
+ for (i = 0; i < rec->loops; i++) {
+
+ if (in_softirq())
+ err = xa_alloc(&dma_mapped, &id, &dma_mapped,
+ XA_LIMIT(1, UINT_MAX), GFP_ATOMIC);
+ else
+ err = xa_alloc_bh(&dma_mapped, &id, &dma_mapped,
+ XA_LIMIT(1, UINT_MAX), GFP_KERNEL);
+
+ if (err)
+ break;
+
+ loops_cnt++;
+ barrier(); /* avoid compiler to optimize this loop */
+
+ if (in_softirq())
+ old = xa_cmpxchg(&dma_mapped, id, &dma_mapped, NULL, 0);
+ else
+ old = xa_cmpxchg_bh(&dma_mapped, id, &dma_mapped, NULL, 0);
+
+ if (old != &dma_mapped)
+ break;
+ }
+ time_bench_stop(rec, loops_cnt);
+
+ xa_destroy(&dma_mapped);
+ return loops_cnt;
+}
+
/* Helper for filling some page's into ptr_ring */
static void pp_fill_ptr_ring(struct page_pool *pp, int elems)
{
@@ -334,6 +376,8 @@ static int run_benchmark_tests(void)
"lock", NULL, time_bench_lock);
time_bench_loop(nr_loops*10, 0,
"rcu", NULL, time_bench_rcu);
+ time_bench_loop(nr_loops*10, 0,
+ "xarray", NULL, time_bench_xarray);
}
/* This test cannot activate correct code path, due to no-softirq ctx */
>
>>> cases where it's absolutely needed.
>>
>> The above can also be done for using page_pool_item too as the
>> lower 2 bits can be used to indicate the pointer in 'struct page'
>> is 'page_pool_item' or 'page_pool', I just don't think it is
>> necessary yet as it might add more checking in the fast path.
>
> Yup, did think about using the lower bits to distinguish if it does turn
> out that we can't avoid an indirection. See above; it's not actually the
The 'memdesc' seems like an indirection to me when using that to shrink
'struct page' to a smaller size.
> page_pool_item concept that is my main issue with your series.
>
> -Toke
>
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-11 15:11 ` Matthew Wilcox
@ 2025-03-12 12:05 ` Yunsheng Lin
2025-03-12 18:35 ` Shuah
0 siblings, 1 reply; 32+ messages in thread
From: Yunsheng Lin @ 2025-03-12 12:05 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Toke Høiland-Jørgensen, Yunsheng Lin, Andrew Morton,
Jesper Dangaard Brouer, Ilias Apalodimas, David S. Miller,
Yonglong Liu, Mina Almasry, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, linux-mm, netdev, conduct
On 2025/3/11 23:11, Matthew Wilcox wrote:
> On Tue, Mar 11, 2025 at 08:25:25PM +0800, Yunsheng Lin wrote:
>>> struct page {
>>> unsigned long flags;
>>> unsigned long memdesc;
>>
>> It seems there may be memory behind the above 'memdesc' with different size
>> and layout for different subsystem?
>
> Yes.
>
>> I am not sure if I understand the case of the same page might be handle in
>> two subsystems concurrently or a page is allocated in one subsystem and
>> then passed to be handled in other subsystem, for examlpe:
>> page_pool owned page is mmap'ed into user space through tcp zero copy,
>> see tcp_zerocopy_vm_insert_batch(), it seems the same page is handled in
>> both networking/page_pool and vm subsystem?
>
> It's not that arbitrary. I mean, you could read all the documentation
> I've written about this concept, listen to the talks I've given. But
> sure, you're a special fucking snowflake and deserve your own unique
> explanation.
If you don't like responding to the above question/comment, I would rather
you strip out them like the other question/comment or just ignore it:(
I am not sure how to interpret the comment, but I am sure it is not a kind
one, so CC 'Code of Conduct Committee' in case there is more coming.
>
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-12 12:04 ` Yunsheng Lin
@ 2025-03-12 12:27 ` Toke Høiland-Jørgensen
2025-03-12 12:53 ` Pavel Begunkov
0 siblings, 1 reply; 32+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-03-12 12:27 UTC (permalink / raw)
To: Yunsheng Lin, Yunsheng Lin, Andrew Morton,
Jesper Dangaard Brouer, Ilias Apalodimas, David S. Miller
Cc: Yonglong Liu, Mina Almasry, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, linux-mm, netdev
Yunsheng Lin <linyunsheng@huawei.com> writes:
> On 2025/3/11 21:26, Toke Høiland-Jørgensen wrote:
>> Yunsheng Lin <linyunsheng@huawei.com> writes:
>>
>>> On 2025/3/10 23:24, Toke Høiland-Jørgensen wrote:
>>>
>>>>>
>>>>> I guess that is one of the disadvantages that an advanced struct like
>>>>> Xarray is used:(
>>>>
>>>> Sure, there will be some overhead from using xarray, but I think the
>>>> simplicity makes up for it; especially since we can limit this to the
>>>
>>> As my understanding, it is more complicated, it is just that
>>> complexity is hidden before xarray now.
>>
>> Yes, which encapsulates the complexity into a shared abstraction that is
>> widely used in the kernel, so it does not add new complexity to the
>> kernel as a whole. Whereas your series adds a whole bunch of new
>> complexity to the kernel in the form of a new slab allocator.
>>
>>> Even if there is no space in 'struct page' to store the id, the
>>> 'struct page' pointer itself can be used as id if the xarray can
>>> use pointer as id. But it might mean the memory utilization might
>>> not be as efficient as it should be, and performance hurts too if
>>> there is more memory to be allocated and freed.
>>
>> I don't think it can. But sure, there can be other ways around this.
>>
>> FWIW, I don't think your idea of allocating page_pool_items to use as an
>> indirection is totally crazy, but all the complexity around it (the
>> custom slab allocator etc) is way too much. And if we can avoid the item
>> indirection that is obviously better.
>>
>>> It seems it is just a matter of choices between using tailor-made
>>> page_pool specific optimization and using some generic advanced
>>> struct like xarray.
>>
>> Yup, basically.
>>
>>> I chose the tailor-made one because it ensure least overhead as
>>> much as possibe from performance and memory utilization perspective,
>>> for example, the 'single producer, multiple consumer' guarantee
>>> offered by NAPI context can avoid some lock and atomic operation.
>>
>> Right, and my main point is that the complexity of this is not worth it :)
>
> Without the complexity, there is about 400ns performance overhead
> for Xarray compared to about 10ns performance overhead for the
> tailor-made one, which means there is about more than 200% performance
> degradation for page_pool03_slow test case:
Great, thanks for testing! I will include this in the non-RFC posting of
this series :)
>>>> cases where it's absolutely needed.
>>>
>>> The above can also be done for using page_pool_item too as the
>>> lower 2 bits can be used to indicate the pointer in 'struct page'
>>> is 'page_pool_item' or 'page_pool', I just don't think it is
>>> necessary yet as it might add more checking in the fast path.
>>
>> Yup, did think about using the lower bits to distinguish if it does turn
>> out that we can't avoid an indirection. See above; it's not actually the
>
> The 'memdesc' seems like an indirection to me when using that to shrink
> 'struct page' to a smaller size.
Yes, it does seem like we'll end up with an indirection of some kind
eventually. But let's cross that bridge when we get to it...
-Toke
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-12 12:27 ` Toke Høiland-Jørgensen
@ 2025-03-12 12:53 ` Pavel Begunkov
0 siblings, 0 replies; 32+ messages in thread
From: Pavel Begunkov @ 2025-03-12 12:53 UTC (permalink / raw)
To: Toke Høiland-Jørgensen, Yunsheng Lin, Yunsheng Lin,
Andrew Morton, Jesper Dangaard Brouer, Ilias Apalodimas,
David S. Miller
Cc: Yonglong Liu, Mina Almasry, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, linux-mm, netdev
On 3/12/25 12:27, Toke Høiland-Jørgensen wrote:
> Yunsheng Lin <linyunsheng@huawei.com> writes:
...
>>>>> cases where it's absolutely needed.
>>>>
>>>> The above can also be done for using page_pool_item too as the
>>>> lower 2 bits can be used to indicate the pointer in 'struct page'
>>>> is 'page_pool_item' or 'page_pool', I just don't think it is
>>>> necessary yet as it might add more checking in the fast path.
>>>
>>> Yup, did think about using the lower bits to distinguish if it does turn
>>> out that we can't avoid an indirection. See above; it's not actually the
>>
>> The 'memdesc' seems like an indirection to me when using that to shrink
>> 'struct page' to a smaller size.
>
> Yes, it does seem like we'll end up with an indirection of some kind
> eventually. But let's cross that bridge when we get to it...
At which point it might be easier to avoid all the "bump"s business,
fully embrace net_iov / netmem format, wrap all pp pages into a
structure on the page pool side, and pass that around. That would
remove the indirection for most of the accesses, and the allocation
can be easily cached.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-11 16:46 ` Matthew Wilcox
2025-03-12 10:56 ` Toke Høiland-Jørgensen
@ 2025-03-12 12:55 ` Pavel Begunkov
1 sibling, 0 replies; 32+ messages in thread
From: Pavel Begunkov @ 2025-03-12 12:55 UTC (permalink / raw)
To: Matthew Wilcox, Toke Høiland-Jørgensen
Cc: Mina Almasry, David Wei, Andrew Morton, Jesper Dangaard Brouer,
Ilias Apalodimas, David S. Miller, Yunsheng Lin, Yonglong Liu,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
linux-mm, netdev
On 3/11/25 16:46, Matthew Wilcox wrote:
> On Tue, Mar 11, 2025 at 02:44:15PM +0100, Toke Høiland-Jørgensen wrote:
>> Pavel Begunkov <asml.silence@gmail.com> writes:
>>> If we're out of space in the page, why can't we use struct page *
>>> as indices into the xarray? Ala
>>>
>>> struct page *p = ...;
>>> xa_store(xarray, index=(unsigned long)p, p);
>>>
>>> Indices wouldn't be nicely packed, but it's still a map. Is there
>>> a problem with that I didn't consider?
>>
>> Huh. As I just replied to Yunsheng, I was under the impression that this
>> was not supported. But since you're now the second person to suggest
>> this, I looked again, and it looks like I was wrong. There does indeed
>> seem to be other places in the kernel that does this.
>>
>> As you say the indices won't be as densely packed, though. So I'm
>> wondering if using the bits in pp_magic would be better in any case to
>> get the better packing? I guess we can try benchmarking both approaches
>> and see if there's a measurable difference.
>
> This is an absolutely terrible idea, only proposed by those who have no
> understanding of how the XArray works. It could not be more wasteful.
Which is why it's so great we have you here, not every one is
developing xarray. So maybe it is useless for this case then.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-12 12:05 ` Yunsheng Lin
@ 2025-03-12 18:35 ` Shuah
2025-03-12 18:48 ` shuah
0 siblings, 1 reply; 32+ messages in thread
From: Shuah @ 2025-03-12 18:35 UTC (permalink / raw)
To: Yunsheng Lin, Matthew Wilcox
Cc: Toke Høiland-Jørgensen, Yunsheng Lin, Andrew Morton,
Jesper Dangaard Brouer, Ilias Apalodimas, David S. Miller,
Yonglong Liu, Mina Almasry, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, linux-mm, netdev, conduct
On 3/12/25 06:05, Yunsheng Lin wrote:
> On 2025/3/11 23:11, Matthew Wilcox wrote:
>> On Tue, Mar 11, 2025 at 08:25:25PM +0800, Yunsheng Lin wrote:
>>>> struct page {
>>>> unsigned long flags;
>>>> unsigned long memdesc;
>>>
>>> It seems there may be memory behind the above 'memdesc' with different size
>>> and layout for different subsystem?
>>
>> Yes.
>>
>>> I am not sure if I understand the case of the same page might be handle in
>>> two subsystems concurrently or a page is allocated in one subsystem and
>>> then passed to be handled in other subsystem, for examlpe:
>>> page_pool owned page is mmap'ed into user space through tcp zero copy,
>>> see tcp_zerocopy_vm_insert_batch(), it seems the same page is handled in
>>> both networking/page_pool and vm subsystem?
>>
>> It's not that arbitrary. I mean, you could read all the documentation
>> I've written about this concept, listen to the talks I've given.
You can't point to talk given on the concept - people don't have to go
find your talks to understand the concept. You are expected to answer
the question and explain it to us here in this thread.
But
>> sure, you're a special fucking snowflake and deserve your own unique
>> explanation.
Yunsheng Lin, This message is a rude personal attack. This isn't the
way to treat your peers in the community. Apology is warranted.
>
> If you don't like responding to the above question/comment, I would rather
> you strip out them like the other question/comment or just ignore it:(
>
> I am not sure how to interpret the comment, but I am sure it is not a kind
> one, so CC 'Code of Conduct Committee' in case there is more coming.
>
Thank you Mathew for letting us know about this.
thanks,
-- Shuah ((on behalf of the CoC committee)
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-12 18:35 ` Shuah
@ 2025-03-12 18:48 ` shuah
2025-03-12 18:56 ` Matthew Wilcox
2025-03-14 18:09 ` Matthew Wilcox
0 siblings, 2 replies; 32+ messages in thread
From: shuah @ 2025-03-12 18:48 UTC (permalink / raw)
To: Yunsheng Lin, Matthew Wilcox
Cc: Toke Høiland-Jørgensen, Yunsheng Lin, Andrew Morton,
Jesper Dangaard Brouer, Ilias Apalodimas, David S. Miller,
Yonglong Liu, Mina Almasry, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, linux-mm, netdev, conduct
On 3/12/25 12:35, Shuah wrote:
> On 3/12/25 06:05, Yunsheng Lin wrote:
>> On 2025/3/11 23:11, Matthew Wilcox wrote:
>>> On Tue, Mar 11, 2025 at 08:25:25PM +0800, Yunsheng Lin wrote:
>>>>> struct page {
>>>>> unsigned long flags;
>>>>> unsigned long memdesc;
>>>>
>>>> It seems there may be memory behind the above 'memdesc' with different size
>>>> and layout for different subsystem?
>>>
>>> Yes.
>>>
>>>> I am not sure if I understand the case of the same page might be handle in
>>>> two subsystems concurrently or a page is allocated in one subsystem and
>>>> then passed to be handled in other subsystem, for examlpe:
>>>> page_pool owned page is mmap'ed into user space through tcp zero copy,
>>>> see tcp_zerocopy_vm_insert_batch(), it seems the same page is handled in
>>>> both networking/page_pool and vm subsystem?
>>>
>>> It's not that arbitrary. I mean, you could read all the documentation
>>> I've written about this concept, listen to the talks I've given.
>
> You can't point to talk given on the concept - people don't have to go
> find your talks to understand the concept. You are expected to answer
> the question and explain it to us here in this thread.
>
> But
>>> sure, you're a special fucking snowflake and deserve your own unique
>>> explanation.
Mathew,
This message is a rude personal attack. This isn't the way to treat your
peers in the community. Apology is warranted.
>
> Yunsheng Lin, This message is a rude personal attack. This isn't the
> way to treat your peers in the community. Apology is warranted.
>
Yunsheng Lin, I am so sorry I got it wrong. Apologies for the mistake.
thanks,
-- Shuah (on behalf of the CoC committee)
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-12 18:48 ` shuah
@ 2025-03-12 18:56 ` Matthew Wilcox
2025-03-12 22:25 ` Shuah Khan
2025-03-14 18:09 ` Matthew Wilcox
1 sibling, 1 reply; 32+ messages in thread
From: Matthew Wilcox @ 2025-03-12 18:56 UTC (permalink / raw)
To: shuah
Cc: Yunsheng Lin, Toke Høiland-Jørgensen, Yunsheng Lin,
Andrew Morton, Jesper Dangaard Brouer, Ilias Apalodimas,
David S. Miller, Yonglong Liu, Mina Almasry, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, linux-mm, netdev,
conduct
On Wed, Mar 12, 2025 at 12:48:56PM -0600, shuah wrote:
> This message is a rude personal attack. This isn't the way to treat your
> peers in the community. Apology is warranted.
I apologise for using the word "fucking".
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-12 18:56 ` Matthew Wilcox
@ 2025-03-12 22:25 ` Shuah Khan
0 siblings, 0 replies; 32+ messages in thread
From: Shuah Khan @ 2025-03-12 22:25 UTC (permalink / raw)
To: Matthew Wilcox, shuah
Cc: Yunsheng Lin, Toke Høiland-Jørgensen, Yunsheng Lin,
Andrew Morton, Jesper Dangaard Brouer, Ilias Apalodimas,
David S. Miller, Yonglong Liu, Mina Almasry, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, linux-mm, netdev,
conduct
On 3/12/25 12:56, Matthew Wilcox wrote:
> On Wed, Mar 12, 2025 at 12:48:56PM -0600, shuah wrote:
>> This message is a rude personal attack. This isn't the way to treat your
>> peers in the community. Apology is warranted.
>
> I apologise for using the word "fucking".
Sorry. This response just isn't sufficient.
Mathew, you know better than this to repeat the offensive word
in you response.
It is not okay to insult a fellow developer by calling them
"speclial *** snowflake"
Please find a better way to say you don't like these patches.
thanks,
-- Shuah (on behalf of the CoC committee)
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool
2025-03-12 18:48 ` shuah
2025-03-12 18:56 ` Matthew Wilcox
@ 2025-03-14 18:09 ` Matthew Wilcox
1 sibling, 0 replies; 32+ messages in thread
From: Matthew Wilcox @ 2025-03-14 18:09 UTC (permalink / raw)
To: shuah
Cc: Yunsheng Lin, Toke Høiland-Jørgensen, Yunsheng Lin,
Andrew Morton, Jesper Dangaard Brouer, Ilias Apalodimas,
David S. Miller, Yonglong Liu, Mina Almasry, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, linux-mm, netdev,
conduct
On Wed, Mar 12, 2025 at 12:48:56PM -0600, shuah wrote:
> This message is a rude personal attack. This isn't the way to treat your
> peers in the community. Apology is warranted.
I apologise for the insult; that was unnecessary.
^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2025-03-14 18:09 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-08 14:54 [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool Toke Høiland-Jørgensen
2025-03-08 19:22 ` Mina Almasry
2025-03-09 12:42 ` Toke Høiland-Jørgensen
2025-03-11 13:25 ` Pavel Begunkov
2025-03-11 13:44 ` Toke Høiland-Jørgensen
2025-03-11 13:56 ` Pavel Begunkov
2025-03-11 15:49 ` Toke Høiland-Jørgensen
2025-03-11 16:46 ` Matthew Wilcox
2025-03-12 10:56 ` Toke Høiland-Jørgensen
2025-03-12 12:55 ` Pavel Begunkov
2025-03-10 7:17 ` Pavel Begunkov
2025-03-09 13:27 ` Yunsheng Lin
2025-03-10 9:13 ` Toke Høiland-Jørgensen
2025-03-10 12:35 ` Yunsheng Lin
2025-03-10 15:24 ` Toke Høiland-Jørgensen
2025-03-11 12:19 ` Yunsheng Lin
2025-03-11 13:26 ` Toke Høiland-Jørgensen
2025-03-12 12:04 ` Yunsheng Lin
2025-03-12 12:27 ` Toke Høiland-Jørgensen
2025-03-12 12:53 ` Pavel Begunkov
2025-03-10 15:42 ` Matthew Wilcox
2025-03-10 17:26 ` Toke Høiland-Jørgensen
2025-03-11 15:03 ` Matthew Wilcox
2025-03-11 15:32 ` Toke Høiland-Jørgensen
2025-03-11 12:25 ` Yunsheng Lin
2025-03-11 15:11 ` Matthew Wilcox
2025-03-12 12:05 ` Yunsheng Lin
2025-03-12 18:35 ` Shuah
2025-03-12 18:48 ` shuah
2025-03-12 18:56 ` Matthew Wilcox
2025-03-12 22:25 ` Shuah Khan
2025-03-14 18:09 ` Matthew Wilcox
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox