* [RFC mm v6] mm: introduce a new page type for page pool in page type
@ 2025-11-17 5:20 Byungchul Park
2025-11-17 11:20 ` Toke Høiland-Jørgensen
2025-11-17 16:02 ` Jesper Dangaard Brouer
0 siblings, 2 replies; 8+ messages in thread
From: Byungchul Park @ 2025-11-17 5:20 UTC (permalink / raw)
To: linux-mm, netdev
Cc: linux-kernel, kernel_team, harry.yoo, ast, daniel, davem, kuba,
hawk, john.fastabend, sdf, saeedm, leon, tariqt, mbloch,
andrew+netdev, edumazet, pabeni, akpm, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, horms, jackmanb,
hannes, ziy, ilias.apalodimas, willy, brauner, kas, yuzhao,
usamaarif642, baolin.wang, almasrymina, toke, asml.silence, bpf,
linux-rdma, sfr, dw, ap420073, dtatulea
Currently, the condition 'page->pp_magic == PP_SIGNATURE' is used to
determine if a page belongs to a page pool. However, with the planned
removal of @pp_magic, we should instead leverage the page_type in struct
page, such as PGTY_netpp, for this purpose.
Introduce and use the page type APIs e.g. PageNetpp(), __SetPageNetpp(),
and __ClearPageNetpp() instead, and remove the existing APIs accessing
@pp_magic e.g. page_pool_page_is_pp(), netmem_or_pp_magic(), and
netmem_clear_pp_magic().
Plus, add @page_type to struct net_iov at the same offset as struct page
so as to use the page_type APIs for struct net_iov as well. While at it,
reorder @type and @owner in struct net_iov to avoid a hole and
increasing the struct size.
This work was inspired by the following link:
https://lore.kernel.org/all/582f41c0-2742-4400-9c81-0d46bf4e8314@gmail.com/
While at it, move the sanity check for page pool to on the free path.
Suggested-by: David Hildenbrand <david@redhat.com>
Co-developed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Byungchul Park <byungchul@sk.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
---
I dropped all the Reviewed-by and Acked-by given for network changes
since I changed how to implement the part on the request from Jakub.
Can I keep your tags? Jakub, are you okay with this change?
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
This set is supposed to go via the mm tree, but it currently also
depends on patches in the net-next tree. For now, this set is based
on linux-next, but will apply cleanly (or get rebased) after mm tree was
rebased.
Changes from v5:
1. Add @page_type to struct net_iov so as to use the page_type
APIs even for struct net_iov. (feedbacked by Jakub)
Changes from v4:
1. Rebase on the latest version of linux-next as of Nov 3.
2. Improve commit messages. (feedbacked by Jakub and Mina)
3. Add Acked-by and Reviewed-by. Thanks to Mina.
Changes from v3:
1. Rebase on next-20251023 of linux-next.
2. Split into two, mm changes and network changes.
3. Improve the comments (feedbacked by Jakub)
Changes from v2:
1. Rebase on linux-next as of Jul 29.
2. Skip 'niov->pp = NULL' when it's allocated using __GFP_ZERO.
3. Change trivial coding style. (feedbacked by Mina)
4. Add Co-developed-by, Acked-by, and Reviewed-by properly.
Thanks to all.
Changes from v1:
1. Rebase on linux-next.
2. Initialize net_iov->pp = NULL when allocating net_iov in
net_devmem_bind_dmabuf() and io_zcrx_create_area().
3. Use ->pp for net_iov to identify if it's pp rather than
always consider net_iov as pp.
4. Add Suggested-by: David Hildenbrand <david@redhat.com>.
---
.../net/ethernet/mellanox/mlx5/core/en/xdp.c | 2 +-
include/linux/mm.h | 27 +++----------------
include/linux/page-flags.h | 6 +++++
include/net/netmem.h | 15 +++++++++--
mm/page_alloc.c | 8 +++---
net/core/netmem_priv.h | 20 +++++---------
net/core/page_pool.c | 18 +++++++++++--
7 files changed, 50 insertions(+), 46 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 5d51600935a6..def274f5c1ca 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -707,7 +707,7 @@ static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
xdpi = mlx5e_xdpi_fifo_pop(xdpi_fifo);
page = xdpi.page.page;
- /* No need to check page_pool_page_is_pp() as we
+ /* No need to check PageNetpp() as we
* know this is a page_pool page.
*/
page_pool_recycle_direct(pp_page_to_nmdesc(page)->pp,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a3f97c551ad8..081e365caa1a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4252,10 +4252,9 @@ int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
* DMA mapping IDs for page_pool
*
* When DMA-mapping a page, page_pool allocates an ID (from an xarray) and
- * stashes it in the upper bits of page->pp_magic. We always want to be able to
- * unambiguously identify page pool pages (using page_pool_page_is_pp()). Non-PP
- * pages can have arbitrary kernel pointers stored in the same field as pp_magic
- * (since it overlaps with page->lru.next), so we must ensure that we cannot
+ * stashes it in the upper bits of page->pp_magic. Non-PP pages can have
+ * arbitrary kernel pointers stored in the same field as pp_magic (since
+ * it overlaps with page->lru.next), so we must ensure that we cannot
* mistake a valid kernel pointer with any of the values we write into this
* field.
*
@@ -4290,26 +4289,6 @@ int arch_lock_shadow_stack_status(struct task_struct *t, unsigned long status);
#define PP_DMA_INDEX_MASK GENMASK(PP_DMA_INDEX_BITS + PP_DMA_INDEX_SHIFT - 1, \
PP_DMA_INDEX_SHIFT)
-/* Mask used for checking in page_pool_page_is_pp() below. page->pp_magic is
- * OR'ed with PP_SIGNATURE after the allocation in order to preserve bit 0 for
- * the head page of compound page and bit 1 for pfmemalloc page, as well as the
- * bits used for the DMA index. page_is_pfmemalloc() is checked in
- * __page_pool_put_page() to avoid recycling the pfmemalloc page.
- */
-#define PP_MAGIC_MASK ~(PP_DMA_INDEX_MASK | 0x3UL)
-
-#ifdef CONFIG_PAGE_POOL
-static inline bool page_pool_page_is_pp(const struct page *page)
-{
- return (page->pp_magic & PP_MAGIC_MASK) == PP_SIGNATURE;
-}
-#else
-static inline bool page_pool_page_is_pp(const struct page *page)
-{
- return false;
-}
-#endif
-
#define PAGE_SNAPSHOT_FAITHFUL (1 << 0)
#define PAGE_SNAPSHOT_PG_BUDDY (1 << 1)
#define PAGE_SNAPSHOT_PG_IDLE (1 << 2)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 0091ad1986bf..edf5418c91dd 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -934,6 +934,7 @@ enum pagetype {
PGTY_zsmalloc = 0xf6,
PGTY_unaccepted = 0xf7,
PGTY_large_kmalloc = 0xf8,
+ PGTY_netpp = 0xf9,
PGTY_mapcount_underflow = 0xff
};
@@ -1078,6 +1079,11 @@ PAGE_TYPE_OPS(Zsmalloc, zsmalloc, zsmalloc)
PAGE_TYPE_OPS(Unaccepted, unaccepted, unaccepted)
FOLIO_TYPE_OPS(large_kmalloc, large_kmalloc)
+/*
+ * Marks page_pool allocated pages.
+ */
+PAGE_TYPE_OPS(Netpp, netpp, netpp)
+
/**
* PageHuge - Determine if the page belongs to hugetlbfs
* @page: The page to test.
diff --git a/include/net/netmem.h b/include/net/netmem.h
index 651e2c62d1dd..ab107b05341e 100644
--- a/include/net/netmem.h
+++ b/include/net/netmem.h
@@ -114,10 +114,21 @@ struct net_iov {
atomic_long_t pp_ref_count;
};
};
- struct net_iov_area *owner;
+
+ unsigned int page_type;
enum net_iov_type type;
+ struct net_iov_area *owner;
};
+/* Make sure 'the offset of page_type in struct page == the offset of
+ * type in struct net_iov'.
+ */
+#define NET_IOV_ASSERT_OFFSET(pg, iov) \
+ static_assert(offsetof(struct page, pg) == \
+ offsetof(struct net_iov, iov))
+NET_IOV_ASSERT_OFFSET(page_type, page_type);
+#undef NET_IOV_ASSERT_OFFSET
+
struct net_iov_area {
/* Array of net_iovs for this area. */
struct net_iov *niovs;
@@ -260,7 +271,7 @@ static inline unsigned long netmem_pfn_trace(netmem_ref netmem)
*/
#define pp_page_to_nmdesc(p) \
({ \
- DEBUG_NET_WARN_ON_ONCE(!page_pool_page_is_pp(p)); \
+ DEBUG_NET_WARN_ON_ONCE(!PageNetpp(p)); \
__pp_page_to_nmdesc(p); \
})
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 600d9e981c23..01dd14123065 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1041,7 +1041,6 @@ static inline bool page_expected_state(struct page *page,
#ifdef CONFIG_MEMCG
page->memcg_data |
#endif
- page_pool_page_is_pp(page) |
(page->flags.f & check_flags)))
return false;
@@ -1068,8 +1067,6 @@ static const char *page_bad_reason(struct page *page, unsigned long flags)
if (unlikely(page->memcg_data))
bad_reason = "page still charged to cgroup";
#endif
- if (unlikely(page_pool_page_is_pp(page)))
- bad_reason = "page_pool leak";
return bad_reason;
}
@@ -1378,9 +1375,12 @@ __always_inline bool free_pages_prepare(struct page *page,
mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
folio->mapping = NULL;
}
- if (unlikely(page_has_type(page)))
+ if (unlikely(page_has_type(page))) {
+ /* networking expects to clear its page type before releasing */
+ WARN_ON_ONCE(PageNetpp(page));
/* Reset the page_type (which overlays _mapcount) */
page->page_type = UINT_MAX;
+ }
if (is_check_pages_enabled()) {
if (free_page_is_bad(page))
diff --git a/net/core/netmem_priv.h b/net/core/netmem_priv.h
index 23175cb2bd86..6b7d90b8ebb9 100644
--- a/net/core/netmem_priv.h
+++ b/net/core/netmem_priv.h
@@ -8,21 +8,15 @@ static inline unsigned long netmem_get_pp_magic(netmem_ref netmem)
return netmem_to_nmdesc(netmem)->pp_magic & ~PP_DMA_INDEX_MASK;
}
-static inline void netmem_or_pp_magic(netmem_ref netmem, unsigned long pp_magic)
-{
- netmem_to_nmdesc(netmem)->pp_magic |= pp_magic;
-}
-
-static inline void netmem_clear_pp_magic(netmem_ref netmem)
-{
- WARN_ON_ONCE(netmem_to_nmdesc(netmem)->pp_magic & PP_DMA_INDEX_MASK);
-
- netmem_to_nmdesc(netmem)->pp_magic = 0;
-}
-
static inline bool netmem_is_pp(netmem_ref netmem)
{
- return (netmem_get_pp_magic(netmem) & PP_MAGIC_MASK) == PP_SIGNATURE;
+ /* XXX: Now that the offset of page_type is shared between
+ * struct page and net_iov, just cast the netmem to struct page
+ * unconditionally by clearing NET_IOV if any, no matter whether
+ * it comes from struct net_iov or struct page. This should be
+ * adjusted once the offset is no longer shared.
+ */
+ return PageNetpp(__netmem_to_page((__force unsigned long)netmem & ~NET_IOV));
}
static inline void netmem_set_pp(netmem_ref netmem, struct page_pool *pool)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 1a5edec485f1..27650ca789d1 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -699,7 +699,14 @@ s32 page_pool_inflight(const struct page_pool *pool, bool strict)
void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
{
netmem_set_pp(netmem, pool);
- netmem_or_pp_magic(netmem, PP_SIGNATURE);
+
+ /* XXX: Now that the offset of page_type is shared between
+ * struct page and net_iov, just cast the netmem to struct page
+ * unconditionally by clearing NET_IOV if any, no matter whether
+ * it comes from struct net_iov or struct page. This should be
+ * adjusted once the offset is no longer shared.
+ */
+ __SetPageNetpp(__netmem_to_page((__force unsigned long)netmem & ~NET_IOV));
/* Ensuring all pages have been split into one fragment initially:
* page_pool_set_pp_info() is only called once for every page when it
@@ -714,7 +721,14 @@ void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
void page_pool_clear_pp_info(netmem_ref netmem)
{
- netmem_clear_pp_magic(netmem);
+ /* XXX: Now that the offset of page_type is shared between
+ * struct page and net_iov, just cast the netmem to struct page
+ * unconditionally by clearing NET_IOV if any, no matter whether
+ * it comes from struct net_iov or struct page. This should be
+ * adjusted once the offset is no longer shared.
+ */
+ __ClearPageNetpp(__netmem_to_page((__force unsigned long)netmem & ~NET_IOV));
+
netmem_set_pp(netmem, NULL);
}
base-commit: e1f5bb196f0b0eee197e06d361f8ac5f091c2963
--
2.17.1
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC mm v6] mm: introduce a new page type for page pool in page type
2025-11-17 5:20 [RFC mm v6] mm: introduce a new page type for page pool in page type Byungchul Park
@ 2025-11-17 11:20 ` Toke Høiland-Jørgensen
2025-11-17 16:02 ` Jesper Dangaard Brouer
1 sibling, 0 replies; 8+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-11-17 11:20 UTC (permalink / raw)
To: Byungchul Park, linux-mm, netdev
Cc: linux-kernel, kernel_team, harry.yoo, ast, daniel, davem, kuba,
hawk, john.fastabend, sdf, saeedm, leon, tariqt, mbloch,
andrew+netdev, edumazet, pabeni, akpm, david, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, horms, jackmanb,
hannes, ziy, ilias.apalodimas, willy, brauner, kas, yuzhao,
usamaarif642, baolin.wang, almasrymina, asml.silence, bpf,
linux-rdma, sfr, dw, ap420073, dtatulea
Byungchul Park <byungchul@sk.com> writes:
> Currently, the condition 'page->pp_magic == PP_SIGNATURE' is used to
> determine if a page belongs to a page pool. However, with the planned
> removal of @pp_magic, we should instead leverage the page_type in struct
> page, such as PGTY_netpp, for this purpose.
>
> Introduce and use the page type APIs e.g. PageNetpp(), __SetPageNetpp(),
> and __ClearPageNetpp() instead, and remove the existing APIs accessing
> @pp_magic e.g. page_pool_page_is_pp(), netmem_or_pp_magic(), and
> netmem_clear_pp_magic().
>
> Plus, add @page_type to struct net_iov at the same offset as struct page
> so as to use the page_type APIs for struct net_iov as well. While at it,
> reorder @type and @owner in struct net_iov to avoid a hole and
> increasing the struct size.
>
> This work was inspired by the following link:
>
> https://lore.kernel.org/all/582f41c0-2742-4400-9c81-0d46bf4e8314@gmail.com/
>
> While at it, move the sanity check for page pool to on the free path.
>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Co-developed-by: Pavel Begunkov <asml.silence@gmail.com>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> Signed-off-by: Byungchul Park <byungchul@sk.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> Acked-by: Zi Yan <ziy@nvidia.com>
> ---
> I dropped all the Reviewed-by and Acked-by given for network changes
> since I changed how to implement the part on the request from Jakub.
> Can I keep your tags? Jakub, are you okay with this change?
LGTM, you can keep mine :)
-Toke
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC mm v6] mm: introduce a new page type for page pool in page type
2025-11-17 5:20 [RFC mm v6] mm: introduce a new page type for page pool in page type Byungchul Park
2025-11-17 11:20 ` Toke Høiland-Jørgensen
@ 2025-11-17 16:02 ` Jesper Dangaard Brouer
2025-11-17 16:47 ` David Hildenbrand (Red Hat)
1 sibling, 1 reply; 8+ messages in thread
From: Jesper Dangaard Brouer @ 2025-11-17 16:02 UTC (permalink / raw)
To: Byungchul Park, linux-mm, netdev
Cc: linux-kernel, kernel_team, harry.yoo, ast, daniel, davem, kuba,
john.fastabend, sdf, saeedm, leon, tariqt, mbloch, andrew+netdev,
edumazet, pabeni, akpm, david, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, mhocko, horms, jackmanb, hannes, ziy,
ilias.apalodimas, willy, brauner, kas, yuzhao, usamaarif642,
baolin.wang, almasrymina, toke, asml.silence, bpf, linux-rdma,
sfr, dw, ap420073, dtatulea
On 17/11/2025 06.20, Byungchul Park wrote:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 600d9e981c23..01dd14123065 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1041,7 +1041,6 @@ static inline bool page_expected_state(struct page *page,
> #ifdef CONFIG_MEMCG
> page->memcg_data |
> #endif
> - page_pool_page_is_pp(page) |
> (page->flags.f & check_flags)))
> return false;
>
> @@ -1068,8 +1067,6 @@ static const char *page_bad_reason(struct page *page, unsigned long flags)
> if (unlikely(page->memcg_data))
> bad_reason = "page still charged to cgroup";
> #endif
> - if (unlikely(page_pool_page_is_pp(page)))
> - bad_reason = "page_pool leak";
> return bad_reason;
> }
This code have helped us catch leaks in the past.
When this happens the result is that the page is marked as a bad page.
>
> @@ -1378,9 +1375,12 @@ __always_inline bool free_pages_prepare(struct page *page,
> mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
> folio->mapping = NULL;
> }
> - if (unlikely(page_has_type(page)))
> + if (unlikely(page_has_type(page))) {
> + /* networking expects to clear its page type before releasing */
> + WARN_ON_ONCE(PageNetpp(page));
> /* Reset the page_type (which overlays _mapcount) */
> page->page_type = UINT_MAX;
> + }
>
> if (is_check_pages_enabled()) {
> if (free_page_is_bad(page))
What happens to the page? ... when it gets marked with:
page->page_type = UINT_MAX
Will it get freed and allowed to be used by others?
- if so it can result in other hard-to-catch bugs
--Jesper
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC mm v6] mm: introduce a new page type for page pool in page type
2025-11-17 16:02 ` Jesper Dangaard Brouer
@ 2025-11-17 16:47 ` David Hildenbrand (Red Hat)
2025-11-18 1:07 ` Byungchul Park
0 siblings, 1 reply; 8+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-17 16:47 UTC (permalink / raw)
To: Jesper Dangaard Brouer, Byungchul Park, linux-mm, netdev
Cc: linux-kernel, kernel_team, harry.yoo, ast, daniel, davem, kuba,
john.fastabend, sdf, saeedm, leon, tariqt, mbloch, andrew+netdev,
edumazet, pabeni, akpm, lorenzo.stoakes, Liam.Howlett, vbabka,
rppt, surenb, mhocko, horms, jackmanb, hannes, ziy,
ilias.apalodimas, willy, brauner, kas, yuzhao, usamaarif642,
baolin.wang, almasrymina, toke, asml.silence, bpf, linux-rdma,
sfr, dw, ap420073, dtatulea
On 17.11.25 17:02, Jesper Dangaard Brouer wrote:
>
> On 17/11/2025 06.20, Byungchul Park wrote:
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 600d9e981c23..01dd14123065 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1041,7 +1041,6 @@ static inline bool page_expected_state(struct page *page,
>> #ifdef CONFIG_MEMCG
>> page->memcg_data |
>> #endif
>> - page_pool_page_is_pp(page) |
>> (page->flags.f & check_flags)))
>> return false;
>>
>> @@ -1068,8 +1067,6 @@ static const char *page_bad_reason(struct page *page, unsigned long flags)
>> if (unlikely(page->memcg_data))
>> bad_reason = "page still charged to cgroup";
>> #endif
>> - if (unlikely(page_pool_page_is_pp(page)))
>> - bad_reason = "page_pool leak";
>> return bad_reason;
>> }
>
> This code have helped us catch leaks in the past.
> When this happens the result is that the page is marked as a bad page.
>
>>
>> @@ -1378,9 +1375,12 @@ __always_inline bool free_pages_prepare(struct page *page,
>> mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
>> folio->mapping = NULL;
>> }
>> - if (unlikely(page_has_type(page)))
>> + if (unlikely(page_has_type(page))) {
>> + /* networking expects to clear its page type before releasing */
>> + WARN_ON_ONCE(PageNetpp(page));
>> /* Reset the page_type (which overlays _mapcount) */
>> page->page_type = UINT_MAX;
>> + }
>>
>> if (is_check_pages_enabled()) {
>> if (free_page_is_bad(page))
>
> What happens to the page? ... when it gets marked with:
> page->page_type = UINT_MAX
>
> Will it get freed and allowed to be used by others?
> - if so it can result in other hard-to-catch bugs
Yes, just like most other use-after-free from any other subsystem in the
kernel :)
The expectation is that such BUGs are found early during testing
(triggering a WARN) such that they can be fixed early.
But we could also report a bad page here and just stop (return false).
--
Cheers
David
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC mm v6] mm: introduce a new page type for page pool in page type
2025-11-17 16:47 ` David Hildenbrand (Red Hat)
@ 2025-11-18 1:07 ` Byungchul Park
2025-11-18 1:18 ` Byungchul Park
0 siblings, 1 reply; 8+ messages in thread
From: Byungchul Park @ 2025-11-18 1:07 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: Jesper Dangaard Brouer, linux-mm, netdev, linux-kernel,
kernel_team, harry.yoo, ast, daniel, davem, kuba, john.fastabend,
sdf, saeedm, leon, tariqt, mbloch, andrew+netdev, edumazet,
pabeni, akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
surenb, mhocko, horms, jackmanb, hannes, ziy, ilias.apalodimas,
willy, brauner, kas, yuzhao, usamaarif642, baolin.wang,
almasrymina, toke, asml.silence, bpf, linux-rdma, sfr, dw,
ap420073, dtatulea
On Mon, Nov 17, 2025 at 05:47:05PM +0100, David Hildenbrand (Red Hat) wrote:
> On 17.11.25 17:02, Jesper Dangaard Brouer wrote:
> >
> > On 17/11/2025 06.20, Byungchul Park wrote:
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 600d9e981c23..01dd14123065 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1041,7 +1041,6 @@ static inline bool page_expected_state(struct page *page,
> > > #ifdef CONFIG_MEMCG
> > > page->memcg_data |
> > > #endif
> > > - page_pool_page_is_pp(page) |
> > > (page->flags.f & check_flags)))
> > > return false;
> > >
> > > @@ -1068,8 +1067,6 @@ static const char *page_bad_reason(struct page *page, unsigned long flags)
> > > if (unlikely(page->memcg_data))
> > > bad_reason = "page still charged to cgroup";
> > > #endif
> > > - if (unlikely(page_pool_page_is_pp(page)))
> > > - bad_reason = "page_pool leak";
> > > return bad_reason;
> > > }
> >
> > This code have helped us catch leaks in the past.
> > When this happens the result is that the page is marked as a bad page.
> >
> > >
> > > @@ -1378,9 +1375,12 @@ __always_inline bool free_pages_prepare(struct page *page,
> > > mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
> > > folio->mapping = NULL;
> > > }
> > > - if (unlikely(page_has_type(page)))
> > > + if (unlikely(page_has_type(page))) {
> > > + /* networking expects to clear its page type before releasing */
> > > + WARN_ON_ONCE(PageNetpp(page));
> > > /* Reset the page_type (which overlays _mapcount) */
> > > page->page_type = UINT_MAX;
> > > + }
> > >
> > > if (is_check_pages_enabled()) {
> > > if (free_page_is_bad(page))
> >
> > What happens to the page? ... when it gets marked with:
> > page->page_type = UINT_MAX
> >
> > Will it get freed and allowed to be used by others?
> > - if so it can result in other hard-to-catch bugs
>
> Yes, just like most other use-after-free from any other subsystem in the
> kernel :)
>
> The expectation is that such BUGs are found early during testing
> (triggering a WARN) such that they can be fixed early.
>
> But we could also report a bad page here and just stop (return false).
I think the WARN_ON_ONCE() makes the problematic situation detectable.
However, if we should prevent the page from being used on the detection,
sure, I can update the patch.
Thanks,
Byungchul
>
> --
> Cheers
>
> David
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC mm v6] mm: introduce a new page type for page pool in page type
2025-11-18 1:07 ` Byungchul Park
@ 2025-11-18 1:18 ` Byungchul Park
2025-11-18 9:41 ` Jesper Dangaard Brouer
0 siblings, 1 reply; 8+ messages in thread
From: Byungchul Park @ 2025-11-18 1:18 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: Jesper Dangaard Brouer, linux-mm, netdev, linux-kernel,
kernel_team, harry.yoo, ast, daniel, davem, kuba, john.fastabend,
sdf, saeedm, leon, tariqt, mbloch, andrew+netdev, edumazet,
pabeni, akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
surenb, mhocko, horms, jackmanb, hannes, ziy, ilias.apalodimas,
willy, brauner, kas, yuzhao, usamaarif642, baolin.wang,
almasrymina, toke, asml.silence, bpf, linux-rdma, sfr, dw,
ap420073, dtatulea
On Tue, Nov 18, 2025 at 10:07:35AM +0900, Byungchul Park wrote:
> On Mon, Nov 17, 2025 at 05:47:05PM +0100, David Hildenbrand (Red Hat) wrote:
> > On 17.11.25 17:02, Jesper Dangaard Brouer wrote:
> > >
> > > On 17/11/2025 06.20, Byungchul Park wrote:
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index 600d9e981c23..01dd14123065 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -1041,7 +1041,6 @@ static inline bool page_expected_state(struct page *page,
> > > > #ifdef CONFIG_MEMCG
> > > > page->memcg_data |
> > > > #endif
> > > > - page_pool_page_is_pp(page) |
> > > > (page->flags.f & check_flags)))
> > > > return false;
> > > >
> > > > @@ -1068,8 +1067,6 @@ static const char *page_bad_reason(struct page *page, unsigned long flags)
> > > > if (unlikely(page->memcg_data))
> > > > bad_reason = "page still charged to cgroup";
> > > > #endif
> > > > - if (unlikely(page_pool_page_is_pp(page)))
> > > > - bad_reason = "page_pool leak";
> > > > return bad_reason;
> > > > }
> > >
> > > This code have helped us catch leaks in the past.
> > > When this happens the result is that the page is marked as a bad page.
> > >
> > > >
> > > > @@ -1378,9 +1375,12 @@ __always_inline bool free_pages_prepare(struct page *page,
> > > > mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
> > > > folio->mapping = NULL;
> > > > }
> > > > - if (unlikely(page_has_type(page)))
> > > > + if (unlikely(page_has_type(page))) {
> > > > + /* networking expects to clear its page type before releasing */
> > > > + WARN_ON_ONCE(PageNetpp(page));
> > > > /* Reset the page_type (which overlays _mapcount) */
> > > > page->page_type = UINT_MAX;
> > > > + }
> > > >
> > > > if (is_check_pages_enabled()) {
> > > > if (free_page_is_bad(page))
> > >
> > > What happens to the page? ... when it gets marked with:
> > > page->page_type = UINT_MAX
> > >
> > > Will it get freed and allowed to be used by others?
> > > - if so it can result in other hard-to-catch bugs
> >
> > Yes, just like most other use-after-free from any other subsystem in the
> > kernel :)
> >
> > The expectation is that such BUGs are found early during testing
> > (triggering a WARN) such that they can be fixed early.
> >
> > But we could also report a bad page here and just stop (return false).
>
> I think the WARN_ON_ONCE() makes the problematic situation detectable.
> However, if we should prevent the page from being used on the detection,
> sure, I can update the patch.
I will respin with the following diff folded on the top.
Byungchul
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 01dd14123065..5ae55a5d7b5d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1377,7 +1377,10 @@ __always_inline bool free_pages_prepare(struct page *page,
}
if (unlikely(page_has_type(page))) {
/* networking expects to clear its page type before releasing */
- WARN_ON_ONCE(PageNetpp(page));
+ if (unlikely(PageNetpp(page))) {
+ bad_page(page, "page_pool leak");
+ return false;
+ }
/* Reset the page_type (which overlays _mapcount) */
page->page_type = UINT_MAX;
}
>
> Thanks,
> Byungchul
>
> >
> > --
> > Cheers
> >
> > David
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC mm v6] mm: introduce a new page type for page pool in page type
2025-11-18 1:18 ` Byungchul Park
@ 2025-11-18 9:41 ` Jesper Dangaard Brouer
2025-11-18 10:20 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 8+ messages in thread
From: Jesper Dangaard Brouer @ 2025-11-18 9:41 UTC (permalink / raw)
To: Byungchul Park, David Hildenbrand (Red Hat)
Cc: linux-mm, netdev, linux-kernel, kernel_team, harry.yoo, ast,
daniel, davem, kuba, john.fastabend, sdf, saeedm, leon, tariqt,
mbloch, andrew+netdev, edumazet, pabeni, akpm, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, horms, jackmanb,
hannes, ziy, ilias.apalodimas, willy, brauner, kas, yuzhao,
usamaarif642, baolin.wang, almasrymina, toke, asml.silence, bpf,
linux-rdma, sfr, dw, ap420073, dtatulea
On 18/11/2025 02.18, Byungchul Park wrote:
> On Tue, Nov 18, 2025 at 10:07:35AM +0900, Byungchul Park wrote:
>> On Mon, Nov 17, 2025 at 05:47:05PM +0100, David Hildenbrand (Red Hat) wrote:
>>> On 17.11.25 17:02, Jesper Dangaard Brouer wrote:
>>>>
>>>> On 17/11/2025 06.20, Byungchul Park wrote:
>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>> index 600d9e981c23..01dd14123065 100644
>>>>> --- a/mm/page_alloc.c
>>>>> +++ b/mm/page_alloc.c
>>>>> @@ -1041,7 +1041,6 @@ static inline bool page_expected_state(struct page *page,
>>>>> #ifdef CONFIG_MEMCG
>>>>> page->memcg_data |
>>>>> #endif
>>>>> - page_pool_page_is_pp(page) |
>>>>> (page->flags.f & check_flags)))
>>>>> return false;
>>>>>
>>>>> @@ -1068,8 +1067,6 @@ static const char *page_bad_reason(struct page *page, unsigned long flags)
>>>>> if (unlikely(page->memcg_data))
>>>>> bad_reason = "page still charged to cgroup";
>>>>> #endif
>>>>> - if (unlikely(page_pool_page_is_pp(page)))
>>>>> - bad_reason = "page_pool leak";
>>>>> return bad_reason;
>>>>> }
>>>>
>>>> This code have helped us catch leaks in the past.
>>>> When this happens the result is that the page is marked as a bad page.
>>>>
>>>>>
>>>>> @@ -1378,9 +1375,12 @@ __always_inline bool free_pages_prepare(struct page *page,
>>>>> mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
>>>>> folio->mapping = NULL;
>>>>> }
>>>>> - if (unlikely(page_has_type(page)))
>>>>> + if (unlikely(page_has_type(page))) {
>>>>> + /* networking expects to clear its page type before releasing */
>>>>> + WARN_ON_ONCE(PageNetpp(page));
>>>>> /* Reset the page_type (which overlays _mapcount) */
>>>>> page->page_type = UINT_MAX;
>>>>> + }
>>>>>
>>>>> if (is_check_pages_enabled()) {
>>>>> if (free_page_is_bad(page))
>>>>
>>>> What happens to the page? ... when it gets marked with:
>>>> page->page_type = UINT_MAX
>>>>
>>>> Will it get freed and allowed to be used by others?
>>>> - if so it can result in other hard-to-catch bugs
>>>
>>> Yes, just like most other use-after-free from any other subsystem in the
>>> kernel :)
>>>
>>> The expectation is that such BUGs are found early during testing
>>> (triggering a WARN) such that they can be fixed early.
>>>
>>> But we could also report a bad page here and just stop (return false).
I agree, that we want to catch these bugs early by triggering a WARN.
The bad_page() call also triggers dump_stack() and have a burst limiter,
which I like. We are running with CONFIG_DEBUG_VM=y in production (as
the measured overhead was minimal) to monitor these kind of leaks.
For the case with page_pool, we *could* recover more gracefully, by
returning the page to the page_pool (page->pp) instance. But I'm
reluctant to taking this path, as that puts less pressure on fixing the
leak as we "recovered", as this becomes are warning and not a bug.
Opinions are welcomed, should we recover or do bad_page() ?
>>
>> I think the WARN_ON_ONCE() makes the problematic situation detectable.
>> However, if we should prevent the page from being used on the detection,
>> sure, I can update the patch.
>
> I will respin with the following diff folded on the top.
LGTM
> Byungchul
> ---
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 01dd14123065..5ae55a5d7b5d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1377,7 +1377,10 @@ __always_inline bool free_pages_prepare(struct page *page,
> }
> if (unlikely(page_has_type(page))) {
> /* networking expects to clear its page type before releasing */
> - WARN_ON_ONCE(PageNetpp(page));
> + if (unlikely(PageNetpp(page))) {
> + bad_page(page, "page_pool leak");
> + return false;
> + }
> /* Reset the page_type (which overlays _mapcount) */
> page->page_type = UINT_MAX;
> }
>
>>
>> Thanks,
>> Byungchul
>>
>>>
>>> --
>>> Cheers
>>>
>>> David
--Jesper
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC mm v6] mm: introduce a new page type for page pool in page type
2025-11-18 9:41 ` Jesper Dangaard Brouer
@ 2025-11-18 10:20 ` Toke Høiland-Jørgensen
0 siblings, 0 replies; 8+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-11-18 10:20 UTC (permalink / raw)
To: Jesper Dangaard Brouer, Byungchul Park, David Hildenbrand (Red Hat)
Cc: linux-mm, netdev, linux-kernel, kernel_team, harry.yoo, ast,
daniel, davem, kuba, john.fastabend, sdf, saeedm, leon, tariqt,
mbloch, andrew+netdev, edumazet, pabeni, akpm, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, horms, jackmanb,
hannes, ziy, ilias.apalodimas, willy, brauner, kas, yuzhao,
usamaarif642, baolin.wang, almasrymina, asml.silence, bpf,
linux-rdma, sfr, dw, ap420073, dtatulea
Jesper Dangaard Brouer <hawk@kernel.org> writes:
> On 18/11/2025 02.18, Byungchul Park wrote:
>> On Tue, Nov 18, 2025 at 10:07:35AM +0900, Byungchul Park wrote:
>>> On Mon, Nov 17, 2025 at 05:47:05PM +0100, David Hildenbrand (Red Hat) wrote:
>>>> On 17.11.25 17:02, Jesper Dangaard Brouer wrote:
>>>>>
>>>>> On 17/11/2025 06.20, Byungchul Park wrote:
>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>> index 600d9e981c23..01dd14123065 100644
>>>>>> --- a/mm/page_alloc.c
>>>>>> +++ b/mm/page_alloc.c
>>>>>> @@ -1041,7 +1041,6 @@ static inline bool page_expected_state(struct page *page,
>>>>>> #ifdef CONFIG_MEMCG
>>>>>> page->memcg_data |
>>>>>> #endif
>>>>>> - page_pool_page_is_pp(page) |
>>>>>> (page->flags.f & check_flags)))
>>>>>> return false;
>>>>>>
>>>>>> @@ -1068,8 +1067,6 @@ static const char *page_bad_reason(struct page *page, unsigned long flags)
>>>>>> if (unlikely(page->memcg_data))
>>>>>> bad_reason = "page still charged to cgroup";
>>>>>> #endif
>>>>>> - if (unlikely(page_pool_page_is_pp(page)))
>>>>>> - bad_reason = "page_pool leak";
>>>>>> return bad_reason;
>>>>>> }
>>>>>
>>>>> This code have helped us catch leaks in the past.
>>>>> When this happens the result is that the page is marked as a bad page.
>>>>>
>>>>>>
>>>>>> @@ -1378,9 +1375,12 @@ __always_inline bool free_pages_prepare(struct page *page,
>>>>>> mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
>>>>>> folio->mapping = NULL;
>>>>>> }
>>>>>> - if (unlikely(page_has_type(page)))
>>>>>> + if (unlikely(page_has_type(page))) {
>>>>>> + /* networking expects to clear its page type before releasing */
>>>>>> + WARN_ON_ONCE(PageNetpp(page));
>>>>>> /* Reset the page_type (which overlays _mapcount) */
>>>>>> page->page_type = UINT_MAX;
>>>>>> + }
>>>>>>
>>>>>> if (is_check_pages_enabled()) {
>>>>>> if (free_page_is_bad(page))
>>>>>
>>>>> What happens to the page? ... when it gets marked with:
>>>>> page->page_type = UINT_MAX
>>>>>
>>>>> Will it get freed and allowed to be used by others?
>>>>> - if so it can result in other hard-to-catch bugs
>>>>
>>>> Yes, just like most other use-after-free from any other subsystem in the
>>>> kernel :)
>>>>
>>>> The expectation is that such BUGs are found early during testing
>>>> (triggering a WARN) such that they can be fixed early.
>>>>
>>>> But we could also report a bad page here and just stop (return false).
>
> I agree, that we want to catch these bugs early by triggering a WARN.
> The bad_page() call also triggers dump_stack() and have a burst limiter,
> which I like. We are running with CONFIG_DEBUG_VM=y in production (as
> the measured overhead was minimal) to monitor these kind of leaks.
>
> For the case with page_pool, we *could* recover more gracefully, by
> returning the page to the page_pool (page->pp) instance. But I'm
> reluctant to taking this path, as that puts less pressure on fixing the
> leak as we "recovered", as this becomes are warning and not a bug.
> Opinions are welcomed, should we recover or do bad_page() ?
I think we should do bad_page() to get the bugs fixed :)
-Toke
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-11-18 10:20 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-17 5:20 [RFC mm v6] mm: introduce a new page type for page pool in page type Byungchul Park
2025-11-17 11:20 ` Toke Høiland-Jørgensen
2025-11-17 16:02 ` Jesper Dangaard Brouer
2025-11-17 16:47 ` David Hildenbrand (Red Hat)
2025-11-18 1:07 ` Byungchul Park
2025-11-18 1:18 ` Byungchul Park
2025-11-18 9:41 ` Jesper Dangaard Brouer
2025-11-18 10:20 ` Toke Høiland-Jørgensen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox