* [RFC PATCH 1/7] mm: Use frozen pages for page tables
2025-10-20 0:16 [RFC PATCH 0/7] Separate ptdesc from struct page Matthew Wilcox (Oracle)
@ 2025-10-20 0:16 ` Matthew Wilcox (Oracle)
2025-10-20 0:16 ` [RFC PATCH 2/7] mm: Account pagetable memory when allocated Matthew Wilcox (Oracle)
` (6 subsequent siblings)
7 siblings, 0 replies; 17+ messages in thread
From: Matthew Wilcox (Oracle) @ 2025-10-20 0:16 UTC (permalink / raw)
To: linux-mm; +Cc: Matthew Wilcox (Oracle), Vishal Moola, Johannes Weiner
Page tables do not use the reference count. That means we can avoid
two atomic operations (one on alloc, one on free) by allocating frozen
pages here. This does not interfere with compaction as page tables are
non-movable allocations.
pagetable_alloc() and pagetable_free() need to move out of line to make
this work as alloc_frozen_page() and free_frozen_page() are not exported
outside the mm for now. We'll want them out of line anyway soon.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
include/linux/mm.h | 32 ++------------------------------
mm/memory.c | 31 +++++++++++++++++++++++++++++++
2 files changed, 33 insertions(+), 30 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d16b33bacc32..ec9365375d9c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2965,37 +2965,9 @@ static inline bool pagetable_is_reserved(struct ptdesc *pt)
return test_bit(PT_reserved, &pt->pt_flags.f);
}
-/**
- * pagetable_alloc - Allocate pagetables
- * @gfp: GFP flags
- * @order: desired pagetable order
- *
- * pagetable_alloc allocates memory for page tables as well as a page table
- * descriptor to describe that memory.
- *
- * Return: The ptdesc describing the allocated page tables.
- */
-static inline struct ptdesc *pagetable_alloc_noprof(gfp_t gfp, unsigned int order)
-{
- struct page *page = alloc_pages_noprof(gfp | __GFP_COMP, order);
-
- return page_ptdesc(page);
-}
+struct ptdesc *pagetable_alloc_noprof(gfp_t gfp, unsigned int order);
#define pagetable_alloc(...) alloc_hooks(pagetable_alloc_noprof(__VA_ARGS__))
-
-/**
- * pagetable_free - Free pagetables
- * @pt: The page table descriptor
- *
- * pagetable_free frees the memory of all page tables described by a page
- * table descriptor and the memory for the descriptor itself.
- */
-static inline void pagetable_free(struct ptdesc *pt)
-{
- struct page *page = ptdesc_page(pt);
-
- __free_pages(page, compound_order(page));
-}
+void pagetable_free(struct ptdesc *pt);
#if defined(CONFIG_SPLIT_PTE_PTLOCKS)
#if ALLOC_SPLIT_PTLOCKS
diff --git a/mm/memory.c b/mm/memory.c
index 74b45e258323..de9f999ffcf6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7267,6 +7267,37 @@ long copy_folio_from_user(struct folio *dst_folio,
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
+/**
+ * pagetable_alloc - Allocate pagetables
+ * @gfp: GFP flags
+ * @order: desired pagetable order
+ *
+ * pagetable_alloc allocates memory for page tables as well as a page table
+ * descriptor to describe that memory.
+ *
+ * Return: The ptdesc describing the allocated page tables.
+ */
+struct ptdesc *pagetable_alloc_noprof(gfp_t gfp, unsigned int order)
+{
+ struct page *page = alloc_frozen_pages_noprof(gfp | __GFP_COMP, order);
+
+ return page_ptdesc(page);
+}
+
+/**
+ * pagetable_free - Free pagetables
+ * @pt: The page table descriptor
+ *
+ * pagetable_free frees the memory of all page tables described by a page
+ * table descriptor and the memory for the descriptor itself.
+ */
+void pagetable_free(struct ptdesc *pt)
+{
+ struct page *page = ptdesc_page(pt);
+
+ free_frozen_pages(page, compound_order(page));
+}
+
#if defined(CONFIG_SPLIT_PTE_PTLOCKS) && ALLOC_SPLIT_PTLOCKS
static struct kmem_cache *page_ptl_cachep;
--
2.47.2
^ permalink raw reply [flat|nested] 17+ messages in thread* [RFC PATCH 2/7] mm: Account pagetable memory when allocated
2025-10-20 0:16 [RFC PATCH 0/7] Separate ptdesc from struct page Matthew Wilcox (Oracle)
2025-10-20 0:16 ` [RFC PATCH 1/7] mm: Use frozen pages for page tables Matthew Wilcox (Oracle)
@ 2025-10-20 0:16 ` Matthew Wilcox (Oracle)
2025-11-11 7:09 ` Anshuman Khandual
2025-10-20 0:16 ` [RFC PATCH 3/7] mm: Mark " Matthew Wilcox (Oracle)
` (5 subsequent siblings)
7 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox (Oracle) @ 2025-10-20 0:16 UTC (permalink / raw)
To: linux-mm; +Cc: Matthew Wilcox (Oracle), Vishal Moola, Johannes Weiner
Move the accounting from the constructor to the allocation site.
Some of the architecture code is a little complex to reason about,
but I think this is all correct (and slightly more efficient due
to having 'order' as an argument instead of having to retrieve it
from struct page again).
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
include/linux/mm.h | 11 -----------
mm/memory.c | 11 ++++++++++-
2 files changed, 10 insertions(+), 12 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ec9365375d9c..ade37df5ea2b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3044,26 +3044,15 @@ static inline bool ptlock_init(struct ptdesc *ptdesc) { return true; }
static inline void ptlock_free(struct ptdesc *ptdesc) {}
#endif /* defined(CONFIG_SPLIT_PTE_PTLOCKS) */
-static inline unsigned long ptdesc_nr_pages(const struct ptdesc *ptdesc)
-{
- return compound_nr(ptdesc_page(ptdesc));
-}
-
static inline void __pagetable_ctor(struct ptdesc *ptdesc)
{
- pg_data_t *pgdat = NODE_DATA(memdesc_nid(ptdesc->pt_flags));
-
__SetPageTable(ptdesc_page(ptdesc));
- mod_node_page_state(pgdat, NR_PAGETABLE, ptdesc_nr_pages(ptdesc));
}
static inline void pagetable_dtor(struct ptdesc *ptdesc)
{
- pg_data_t *pgdat = NODE_DATA(memdesc_nid(ptdesc->pt_flags));
-
ptlock_free(ptdesc);
__ClearPageTable(ptdesc_page(ptdesc));
- mod_node_page_state(pgdat, NR_PAGETABLE, -ptdesc_nr_pages(ptdesc));
}
static inline void pagetable_dtor_free(struct ptdesc *ptdesc)
diff --git a/mm/memory.c b/mm/memory.c
index de9f999ffcf6..033fc4e29232 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7280,7 +7280,13 @@ long copy_folio_from_user(struct folio *dst_folio,
struct ptdesc *pagetable_alloc_noprof(gfp_t gfp, unsigned int order)
{
struct page *page = alloc_frozen_pages_noprof(gfp | __GFP_COMP, order);
+ pg_data_t *pgdat;
+ if (!page)
+ return NULL;
+
+ pgdat = NODE_DATA(page_to_nid(page));
+ mod_node_page_state(pgdat, NR_PAGETABLE, 1 << order);
return page_ptdesc(page);
}
@@ -7293,9 +7299,12 @@ struct ptdesc *pagetable_alloc_noprof(gfp_t gfp, unsigned int order)
*/
void pagetable_free(struct ptdesc *pt)
{
+ pg_data_t *pgdat = NODE_DATA(memdesc_nid(pt->pt_flags));
struct page *page = ptdesc_page(pt);
+ unsigned int order = compound_order(page);
- free_frozen_pages(page, compound_order(page));
+ mod_node_page_state(pgdat, NR_PAGETABLE, -(1L << order));
+ free_frozen_pages(page, order);
}
#if defined(CONFIG_SPLIT_PTE_PTLOCKS) && ALLOC_SPLIT_PTLOCKS
--
2.47.2
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [RFC PATCH 2/7] mm: Account pagetable memory when allocated
2025-10-20 0:16 ` [RFC PATCH 2/7] mm: Account pagetable memory when allocated Matthew Wilcox (Oracle)
@ 2025-11-11 7:09 ` Anshuman Khandual
2025-11-11 16:43 ` Matthew Wilcox
0 siblings, 1 reply; 17+ messages in thread
From: Anshuman Khandual @ 2025-11-11 7:09 UTC (permalink / raw)
To: Matthew Wilcox (Oracle), linux-mm; +Cc: Vishal Moola, Johannes Weiner
On 20/10/25 5:46 AM, Matthew Wilcox (Oracle) wrote:
> Move the accounting from the constructor to the allocation site.
> Some of the architecture code is a little complex to reason about,
> but I think this is all correct (and slightly more efficient due
> to having 'order' as an argument instead of having to retrieve it
> from struct page again).
Agreed.
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
> include/linux/mm.h | 11 -----------
> mm/memory.c | 11 ++++++++++-
> 2 files changed, 10 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ec9365375d9c..ade37df5ea2b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3044,26 +3044,15 @@ static inline bool ptlock_init(struct ptdesc *ptdesc) { return true; }
> static inline void ptlock_free(struct ptdesc *ptdesc) {}
> #endif /* defined(CONFIG_SPLIT_PTE_PTLOCKS) */
>
> -static inline unsigned long ptdesc_nr_pages(const struct ptdesc *ptdesc)
> -{
> - return compound_nr(ptdesc_page(ptdesc));
> -}
Right - ptdesc_nr_pages() will no longer be required.
> -
> static inline void __pagetable_ctor(struct ptdesc *ptdesc)
> {
> - pg_data_t *pgdat = NODE_DATA(memdesc_nid(ptdesc->pt_flags));
> -
> __SetPageTable(ptdesc_page(ptdesc));
> - mod_node_page_state(pgdat, NR_PAGETABLE, ptdesc_nr_pages(ptdesc));
> }
>
> static inline void pagetable_dtor(struct ptdesc *ptdesc)
> {
> - pg_data_t *pgdat = NODE_DATA(memdesc_nid(ptdesc->pt_flags));
> -
> ptlock_free(ptdesc);
> __ClearPageTable(ptdesc_page(ptdesc));
> - mod_node_page_state(pgdat, NR_PAGETABLE, -ptdesc_nr_pages(ptdesc));
> }
>
> static inline void pagetable_dtor_free(struct ptdesc *ptdesc)
> diff --git a/mm/memory.c b/mm/memory.c
> index de9f999ffcf6..033fc4e29232 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -7280,7 +7280,13 @@ long copy_folio_from_user(struct folio *dst_folio,
> struct ptdesc *pagetable_alloc_noprof(gfp_t gfp, unsigned int order)
> {
> struct page *page = alloc_frozen_pages_noprof(gfp | __GFP_COMP, order);
> + pg_data_t *pgdat;
>
> + if (!page)
> + return NULL;
Should not this check be moved to the earlier patch instead during
the page allocation ?
> +
> + pgdat = NODE_DATA(page_to_nid(page));
> + mod_node_page_state(pgdat, NR_PAGETABLE, 1 << order);
> return page_ptdesc(page);
> }
>
> @@ -7293,9 +7299,12 @@ struct ptdesc *pagetable_alloc_noprof(gfp_t gfp, unsigned int order)
> */
> void pagetable_free(struct ptdesc *pt)
> {
> + pg_data_t *pgdat = NODE_DATA(memdesc_nid(pt->pt_flags));
> struct page *page = ptdesc_page(pt);
> + unsigned int order = compound_order(page);
>
> - free_frozen_pages(page, compound_order(page));
> + mod_node_page_state(pgdat, NR_PAGETABLE, -(1L << order));
> + free_frozen_pages(page, order);
> }
>
> #if defined(CONFIG_SPLIT_PTE_PTLOCKS) && ALLOC_SPLIT_PTLOCKS
Otherwise LGTM.
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [RFC PATCH 2/7] mm: Account pagetable memory when allocated
2025-11-11 7:09 ` Anshuman Khandual
@ 2025-11-11 16:43 ` Matthew Wilcox
0 siblings, 0 replies; 17+ messages in thread
From: Matthew Wilcox @ 2025-11-11 16:43 UTC (permalink / raw)
To: Anshuman Khandual; +Cc: linux-mm, Vishal Moola, Johannes Weiner
On Tue, Nov 11, 2025 at 12:39:19PM +0530, Anshuman Khandual wrote:
> > +++ b/mm/memory.c
> > @@ -7280,7 +7280,13 @@ long copy_folio_from_user(struct folio *dst_folio,
> > struct ptdesc *pagetable_alloc_noprof(gfp_t gfp, unsigned int order)
> > {
> > struct page *page = alloc_frozen_pages_noprof(gfp | __GFP_COMP, order);
> > + pg_data_t *pgdat;
> >
> > + if (!page)
> > + return NULL;
>
> Should not this check be moved to the earlier patch instead during
> the page allocation ?
We could, but since page_ptdesc() is only a cast, there's no need to.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [RFC PATCH 3/7] mm: Mark pagetable memory when allocated
2025-10-20 0:16 [RFC PATCH 0/7] Separate ptdesc from struct page Matthew Wilcox (Oracle)
2025-10-20 0:16 ` [RFC PATCH 1/7] mm: Use frozen pages for page tables Matthew Wilcox (Oracle)
2025-10-20 0:16 ` [RFC PATCH 2/7] mm: Account pagetable memory when allocated Matthew Wilcox (Oracle)
@ 2025-10-20 0:16 ` Matthew Wilcox (Oracle)
2025-10-20 0:16 ` [RFC PATCH 4/7] pgtable: Remove uses of page->lru Matthew Wilcox (Oracle)
` (4 subsequent siblings)
7 siblings, 0 replies; 17+ messages in thread
From: Matthew Wilcox (Oracle) @ 2025-10-20 0:16 UTC (permalink / raw)
To: linux-mm; +Cc: Matthew Wilcox (Oracle), Vishal Moola, Johannes Weiner
Move the page type setting from the constructor to the allocation site.
Some of the architecture code is a little complex to reason about, but
I think this is all correct. This makes __pagetable_ctor() empty, so
remove it. While pagetable_pud_ctor() and higher levels are now empty,
leave them alone as there may be call to have them do something in future.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
include/linux/mm.h | 11 -----------
mm/memory.c | 2 ++
2 files changed, 2 insertions(+), 11 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ade37df5ea2b..edcb7d75542f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3044,15 +3044,9 @@ static inline bool ptlock_init(struct ptdesc *ptdesc) { return true; }
static inline void ptlock_free(struct ptdesc *ptdesc) {}
#endif /* defined(CONFIG_SPLIT_PTE_PTLOCKS) */
-static inline void __pagetable_ctor(struct ptdesc *ptdesc)
-{
- __SetPageTable(ptdesc_page(ptdesc));
-}
-
static inline void pagetable_dtor(struct ptdesc *ptdesc)
{
ptlock_free(ptdesc);
- __ClearPageTable(ptdesc_page(ptdesc));
}
static inline void pagetable_dtor_free(struct ptdesc *ptdesc)
@@ -3066,7 +3060,6 @@ static inline bool pagetable_pte_ctor(struct mm_struct *mm,
{
if (mm != &init_mm && !ptlock_init(ptdesc))
return false;
- __pagetable_ctor(ptdesc);
return true;
}
@@ -3174,7 +3167,6 @@ static inline bool pagetable_pmd_ctor(struct mm_struct *mm,
if (mm != &init_mm && !pmd_ptlock_init(ptdesc))
return false;
ptdesc_pmd_pts_init(ptdesc);
- __pagetable_ctor(ptdesc);
return true;
}
@@ -3199,17 +3191,14 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
static inline void pagetable_pud_ctor(struct ptdesc *ptdesc)
{
- __pagetable_ctor(ptdesc);
}
static inline void pagetable_p4d_ctor(struct ptdesc *ptdesc)
{
- __pagetable_ctor(ptdesc);
}
static inline void pagetable_pgd_ctor(struct ptdesc *ptdesc)
{
- __pagetable_ctor(ptdesc);
}
extern void __init pagecache_init(void);
diff --git a/mm/memory.c b/mm/memory.c
index 033fc4e29232..47eb5834db23 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7287,6 +7287,7 @@ struct ptdesc *pagetable_alloc_noprof(gfp_t gfp, unsigned int order)
pgdat = NODE_DATA(page_to_nid(page));
mod_node_page_state(pgdat, NR_PAGETABLE, 1 << order);
+ __SetPageTable(page);
return page_ptdesc(page);
}
@@ -7304,6 +7305,7 @@ void pagetable_free(struct ptdesc *pt)
unsigned int order = compound_order(page);
mod_node_page_state(pgdat, NR_PAGETABLE, -(1L << order));
+ __ClearPageTable(page);
free_frozen_pages(page, order);
}
--
2.47.2
^ permalink raw reply [flat|nested] 17+ messages in thread* [RFC PATCH 4/7] pgtable: Remove uses of page->lru
2025-10-20 0:16 [RFC PATCH 0/7] Separate ptdesc from struct page Matthew Wilcox (Oracle)
` (2 preceding siblings ...)
2025-10-20 0:16 ` [RFC PATCH 3/7] mm: Mark " Matthew Wilcox (Oracle)
@ 2025-10-20 0:16 ` Matthew Wilcox (Oracle)
2025-11-11 7:53 ` Anshuman Khandual
2025-10-20 0:16 ` [RFC PATCH 5/7] x86: Call preallocate_vmalloc_pages() later Matthew Wilcox (Oracle)
` (3 subsequent siblings)
7 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox (Oracle) @ 2025-10-20 0:16 UTC (permalink / raw)
To: linux-mm; +Cc: Matthew Wilcox (Oracle), Vishal Moola, Johannes Weiner
Use ptdesc->pt_list instead of page->lru. These are the same bits for
now, but will be different when ptdesc is allocated separately.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
mm/pgtable-generic.c | 24 +++++++++++++++---------
1 file changed, 15 insertions(+), 9 deletions(-)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 567e2d084071..f40bab9e6c46 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -164,13 +164,14 @@ pud_t pudp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
pgtable_t pgtable)
{
+ struct ptdesc *ptdesc = page_ptdesc(pgtable);
assert_spin_locked(pmd_lockptr(mm, pmdp));
/* FIFO */
if (!pmd_huge_pte(mm, pmdp))
- INIT_LIST_HEAD(&pgtable->lru);
+ INIT_LIST_HEAD(&ptdesc->pt_list);
else
- list_add(&pgtable->lru, &pmd_huge_pte(mm, pmdp)->lru);
+ list_add(&ptdesc->pt_list, &page_ptdesc(pmd_huge_pte(mm, pmdp))->pt_list);
pmd_huge_pte(mm, pmdp) = pgtable;
}
#endif
@@ -179,17 +180,22 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
/* no "address" argument so destroys page coloring of some arch */
pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
{
- pgtable_t pgtable;
+ struct ptdesc *ptdesc, *next;
+ struct page *page;
assert_spin_locked(pmd_lockptr(mm, pmdp));
/* FIFO */
- pgtable = pmd_huge_pte(mm, pmdp);
- pmd_huge_pte(mm, pmdp) = list_first_entry_or_null(&pgtable->lru,
- struct page, lru);
- if (pmd_huge_pte(mm, pmdp))
- list_del(&pgtable->lru);
- return pgtable;
+ page = pmd_huge_pte(mm, pmdp);
+ ptdesc = page_ptdesc(page);
+ next = list_first_entry_or_null(&ptdesc->pt_list, struct ptdesc, pt_list);
+ if (next) {
+ pmd_huge_pte(mm, pmdp) = ptdesc_page(next);
+ list_del(&ptdesc->pt_list);
+ } else {
+ pmd_huge_pte(mm, pmdp) = NULL;
+ }
+ return page;
}
#endif
--
2.47.2
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [RFC PATCH 4/7] pgtable: Remove uses of page->lru
2025-10-20 0:16 ` [RFC PATCH 4/7] pgtable: Remove uses of page->lru Matthew Wilcox (Oracle)
@ 2025-11-11 7:53 ` Anshuman Khandual
2025-11-11 18:46 ` Matthew Wilcox
0 siblings, 1 reply; 17+ messages in thread
From: Anshuman Khandual @ 2025-11-11 7:53 UTC (permalink / raw)
To: Matthew Wilcox (Oracle), linux-mm; +Cc: Vishal Moola, Johannes Weiner
On 20/10/25 5:46 AM, Matthew Wilcox (Oracle) wrote:
> Use ptdesc->pt_list instead of page->lru. These are the same bits for
> now, but will be different when ptdesc is allocated separately.
Just wondering - should a corresponding TABLE_MATCH() be added untill
those above bits in page and ptdesc remain same.
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -584,6 +584,7 @@ struct ptdesc {
static_assert(offsetof(struct page, pg) == offsetof(struct ptdesc, pt))
TABLE_MATCH(flags, pt_flags);
TABLE_MATCH(compound_head, pt_list);
+TABLE_MATCH(lru, pt_list);
TABLE_MATCH(compound_head, _pt_pad_1);
TABLE_MATCH(mapping, __page_mapping);
TABLE_MATCH(__folio_index, pt_index);
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
> mm/pgtable-generic.c | 24 +++++++++++++++---------
> 1 file changed, 15 insertions(+), 9 deletions(-)
>
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 567e2d084071..f40bab9e6c46 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -164,13 +164,14 @@ pud_t pudp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
> void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
> pgtable_t pgtable)
> {
> + struct ptdesc *ptdesc = page_ptdesc(pgtable);
> assert_spin_locked(pmd_lockptr(mm, pmdp));
>
> /* FIFO */
> if (!pmd_huge_pte(mm, pmdp))
> - INIT_LIST_HEAD(&pgtable->lru);
> + INIT_LIST_HEAD(&ptdesc->pt_list);
> else
> - list_add(&pgtable->lru, &pmd_huge_pte(mm, pmdp)->lru);
> + list_add(&ptdesc->pt_list, &page_ptdesc(pmd_huge_pte(mm, pmdp))->pt_list);
> pmd_huge_pte(mm, pmdp) = pgtable;
> }
> #endif
> @@ -179,17 +180,22 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
> /* no "address" argument so destroys page coloring of some arch */
> pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
> {
> - pgtable_t pgtable;
> + struct ptdesc *ptdesc, *next;
> + struct page *page;
>
> assert_spin_locked(pmd_lockptr(mm, pmdp));
>
> /* FIFO */
> - pgtable = pmd_huge_pte(mm, pmdp);
> - pmd_huge_pte(mm, pmdp) = list_first_entry_or_null(&pgtable->lru,
> - struct page, lru);
> - if (pmd_huge_pte(mm, pmdp))
> - list_del(&pgtable->lru);
> - return pgtable;
> + page = pmd_huge_pte(mm, pmdp);
> + ptdesc = page_ptdesc(page);
> + next = list_first_entry_or_null(&ptdesc->pt_list, struct ptdesc, pt_list);
> + if (next) {
> + pmd_huge_pte(mm, pmdp) = ptdesc_page(next);
> + list_del(&ptdesc->pt_list);
> + } else {
> + pmd_huge_pte(mm, pmdp) = NULL;
> + }
> + return page;
> }
> #endif
>
Otherwise LGTM.
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [RFC PATCH 4/7] pgtable: Remove uses of page->lru
2025-11-11 7:53 ` Anshuman Khandual
@ 2025-11-11 18:46 ` Matthew Wilcox
0 siblings, 0 replies; 17+ messages in thread
From: Matthew Wilcox @ 2025-11-11 18:46 UTC (permalink / raw)
To: Anshuman Khandual; +Cc: linux-mm, Vishal Moola, Johannes Weiner
On Tue, Nov 11, 2025 at 01:23:53PM +0530, Anshuman Khandual wrote:
> On 20/10/25 5:46 AM, Matthew Wilcox (Oracle) wrote:
> > Use ptdesc->pt_list instead of page->lru. These are the same bits for
> > now, but will be different when ptdesc is allocated separately.
>
> Just wondering - should a corresponding TABLE_MATCH() be added untill
> those above bits in page and ptdesc remain same.
Huh, I hadn't noticed that was missing. I'll put that in as a separate
patch.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [RFC PATCH 5/7] x86: Call preallocate_vmalloc_pages() later
2025-10-20 0:16 [RFC PATCH 0/7] Separate ptdesc from struct page Matthew Wilcox (Oracle)
` (3 preceding siblings ...)
2025-10-20 0:16 ` [RFC PATCH 4/7] pgtable: Remove uses of page->lru Matthew Wilcox (Oracle)
@ 2025-10-20 0:16 ` Matthew Wilcox (Oracle)
2025-11-11 8:59 ` Anshuman Khandual
2025-10-20 0:16 ` [RFC PATCH 6/7] mm: Add alloc_pages_memdesc family of APIs Matthew Wilcox (Oracle)
` (2 subsequent siblings)
7 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox (Oracle) @ 2025-10-20 0:16 UTC (permalink / raw)
To: linux-mm; +Cc: Matthew Wilcox (Oracle), Vishal Moola, Johannes Weiner
When separately allocating ptdesc from struct page, calling
preallocate_vmalloc_pages() from mem_init() is too early as the slab
allocator hasn't been set up yet. Move preallocate_vmalloc_pages() to
vmalloc_init() which is called after the slab allocator has been set up.
Honestly, this patch is a bit bobbins and I'm sure it'll be reworked
before it goes upstream.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
arch/x86/mm/init_64.c | 4 +---
include/linux/mm.h | 33 +++++++++++++++++++++++++++++++--
mm/vmalloc.c | 2 ++
3 files changed, 34 insertions(+), 5 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0e4270e20fad..5270fc24f6f6 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1322,7 +1322,7 @@ static void __init register_page_bootmem_info(void)
* Only the level which needs to be synchronized between all page-tables is
* allocated because the synchronization can be expensive.
*/
-static void __init preallocate_vmalloc_pages(void)
+void __init preallocate_vmalloc_pages(void)
{
unsigned long addr;
const char *lvl;
@@ -1390,8 +1390,6 @@ void __init mem_init(void)
/* Register memory areas for /proc/kcore */
if (get_gate_vma(&init_mm))
kclist_add(&kcore_vsyscall, (void *)VSYSCALL_ADDR, PAGE_SIZE, KCORE_USER);
-
- preallocate_vmalloc_pages();
}
int kernel_set_to_readonly;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index edcb7d75542f..e60b181da3df 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1160,6 +1160,12 @@ static inline int is_vmalloc_or_module_addr(const void *x)
}
#endif
+#ifdef CONFIG_X86
+void __init preallocate_vmalloc_pages(void);
+#else
+static inline void preallocate_vmalloc_pages(void) { }
+#endif
+
/*
* How many times the entire folio is mapped as a single unit (eg by a
* PMD or PUD entry). This is probably not what you want, except for
@@ -2939,9 +2945,32 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long a
}
#endif /* CONFIG_MMU */
+static inline struct page *ptdesc_page(const struct ptdesc *pt)
+{
+ return pt->pt_page;
+}
+
+static inline struct ptdesc *page_ptdesc(const struct page *page)
+{
+ memdesc_t memdesc = READ_ONCE(page->memdesc);
+
+ if (memdesc_type(memdesc) != MEMDESC_TYPE_PAGE_TABLE) {
+ printk(KERN_EMERG "memdesc %lx index %lx\n", memdesc.v, page->__folio_index);
+ VM_BUG_ON_PAGE(1, page);
+ return NULL;
+ }
+ return (void *)(memdesc.v - MEMDESC_TYPE_PAGE_TABLE);
+}
+
+/**
+ * enum pt_flags = How the ptdesc flags bits are used.
+ * @PT_reserved: Used by PowerPC
+ *
+ * The pt flags are stored in a memdesc_flags_t.
+ * The high bits are used for information like zone/node/section.
+ */
enum pt_flags {
PT_reserved = PG_reserved,
- /* High bits are used for zone/node/section */
};
static inline struct ptdesc *virt_to_ptdesc(const void *x)
@@ -2957,7 +2986,7 @@ static inline struct ptdesc *virt_to_ptdesc(const void *x)
*/
static inline void *ptdesc_address(const struct ptdesc *pt)
{
- return folio_address(ptdesc_folio(pt));
+ return page_address(pt->pt_page);
}
static inline bool pagetable_is_reserved(struct ptdesc *pt)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 798b2ed21e46..9b349051a83a 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -5264,6 +5264,8 @@ void __init vmalloc_init(void)
struct vm_struct *tmp;
int i;
+ preallocate_vmalloc_pages();
+
/*
* Create the cache for vmap_area objects.
*/
--
2.47.2
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [RFC PATCH 5/7] x86: Call preallocate_vmalloc_pages() later
2025-10-20 0:16 ` [RFC PATCH 5/7] x86: Call preallocate_vmalloc_pages() later Matthew Wilcox (Oracle)
@ 2025-11-11 8:59 ` Anshuman Khandual
2025-11-12 18:36 ` Vishal Moola (Oracle)
2025-11-13 13:53 ` Matthew Wilcox
0 siblings, 2 replies; 17+ messages in thread
From: Anshuman Khandual @ 2025-11-11 8:59 UTC (permalink / raw)
To: Matthew Wilcox (Oracle), linux-mm; +Cc: Vishal Moola, Johannes Weiner
On 20/10/25 5:46 AM, Matthew Wilcox (Oracle) wrote:
> When separately allocating ptdesc from struct page, calling
> preallocate_vmalloc_pages() from mem_init() is too early as the slab
> allocator hasn't been set up yet. Move preallocate_vmalloc_pages() to
> vmalloc_init() which is called after the slab allocator has been set up.
>
> Honestly, this patch is a bit bobbins and I'm sure it'll be reworked
> before it goes upstream.
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
> arch/x86/mm/init_64.c | 4 +---
> include/linux/mm.h | 33 +++++++++++++++++++++++++++++++--
> mm/vmalloc.c | 2 ++
> 3 files changed, 34 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 0e4270e20fad..5270fc24f6f6 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1322,7 +1322,7 @@ static void __init register_page_bootmem_info(void)
> * Only the level which needs to be synchronized between all page-tables is
> * allocated because the synchronization can be expensive.
> */
> -static void __init preallocate_vmalloc_pages(void)
> +void __init preallocate_vmalloc_pages(void)
> {
> unsigned long addr;
> const char *lvl;
> @@ -1390,8 +1390,6 @@ void __init mem_init(void)
> /* Register memory areas for /proc/kcore */
> if (get_gate_vma(&init_mm))
> kclist_add(&kcore_vsyscall, (void *)VSYSCALL_ADDR, PAGE_SIZE, KCORE_USER);
> -
> - preallocate_vmalloc_pages();
> }
>
> int kernel_set_to_readonly;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index edcb7d75542f..e60b181da3df 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1160,6 +1160,12 @@ static inline int is_vmalloc_or_module_addr(const void *x)
> }
> #endif
>
> +#ifdef CONFIG_X86
> +void __init preallocate_vmalloc_pages(void);
> +#else
> +static inline void preallocate_vmalloc_pages(void) { }
> +#endif
> +
> /*
> * How many times the entire folio is mapped as a single unit (eg by a
> * PMD or PUD entry). This is probably not what you want, except for
> @@ -2939,9 +2945,32 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long a
> }
> #endif /* CONFIG_MMU */
>
> +static inline struct page *ptdesc_page(const struct ptdesc *pt)
> +{
> + return pt->pt_page;
> +}
pt_page has not been added as an element into ptdesc and hence
the build fails upto this patch.
> +
> +static inline struct ptdesc *page_ptdesc(const struct page *page)
> +{
> + memdesc_t memdesc = READ_ONCE(page->memdesc);
> +
> + if (memdesc_type(memdesc) != MEMDESC_TYPE_PAGE_TABLE) {
> + printk(KERN_EMERG "memdesc %lx index %lx\n", memdesc.v, page->__folio_index);
> + VM_BUG_ON_PAGE(1, page);
> + return NULL;
> + }
> + return (void *)(memdesc.v - MEMDESC_TYPE_PAGE_TABLE);
> +}
Ditto - these elements have not been introduced.
> +
> +/**
> + * enum pt_flags = How the ptdesc flags bits are used.
> + * @PT_reserved: Used by PowerPC
> + *
> + * The pt flags are stored in a memdesc_flags_t.
> + * The high bits are used for information like zone/node/section.
> + */
> enum pt_flags {
> PT_reserved = PG_reserved,
> - /* High bits are used for zone/node/section */
> };
>
> static inline struct ptdesc *virt_to_ptdesc(const void *x)
> @@ -2957,7 +2986,7 @@ static inline struct ptdesc *virt_to_ptdesc(const void *x)
> */
> static inline void *ptdesc_address(const struct ptdesc *pt)
> {
> - return folio_address(ptdesc_folio(pt));
> + return page_address(pt->pt_page);
> }
>
> static inline bool pagetable_is_reserved(struct ptdesc *pt)
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 798b2ed21e46..9b349051a83a 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -5264,6 +5264,8 @@ void __init vmalloc_init(void)
> struct vm_struct *tmp;
> int i;
>
> + preallocate_vmalloc_pages();
> +
> /*
> * Create the cache for vmap_area objects.
> */
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [RFC PATCH 5/7] x86: Call preallocate_vmalloc_pages() later
2025-11-11 8:59 ` Anshuman Khandual
@ 2025-11-12 18:36 ` Vishal Moola (Oracle)
2025-11-12 19:31 ` Vishal Moola (Oracle)
2025-11-13 13:53 ` Matthew Wilcox
1 sibling, 1 reply; 17+ messages in thread
From: Vishal Moola (Oracle) @ 2025-11-12 18:36 UTC (permalink / raw)
To: Anshuman Khandual; +Cc: Matthew Wilcox (Oracle), linux-mm, Johannes Weiner
On Tue, Nov 11, 2025 at 02:29:44PM +0530, Anshuman Khandual wrote:
> On 20/10/25 5:46 AM, Matthew Wilcox (Oracle) wrote:
> > When separately allocating ptdesc from struct page, calling
> > preallocate_vmalloc_pages() from mem_init() is too early as the slab
> > allocator hasn't been set up yet. Move preallocate_vmalloc_pages() to
> > vmalloc_init() which is called after the slab allocator has been set up.
> >
> > Honestly, this patch is a bit bobbins and I'm sure it'll be reworked
> > before it goes upstream.
> >
> > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > ---
> > arch/x86/mm/init_64.c | 4 +---
> > include/linux/mm.h | 33 +++++++++++++++++++++++++++++++--
> > mm/vmalloc.c | 2 ++
> > 3 files changed, 34 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > index 0e4270e20fad..5270fc24f6f6 100644
> > --- a/arch/x86/mm/init_64.c
> > +++ b/arch/x86/mm/init_64.c
> > @@ -1322,7 +1322,7 @@ static void __init register_page_bootmem_info(void)
> > * Only the level which needs to be synchronized between all page-tables is
> > * allocated because the synchronization can be expensive.
> > */
> > -static void __init preallocate_vmalloc_pages(void)
> > +void __init preallocate_vmalloc_pages(void)
> > {
> > unsigned long addr;
> > const char *lvl;
> > @@ -1390,8 +1390,6 @@ void __init mem_init(void)
> > /* Register memory areas for /proc/kcore */
> > if (get_gate_vma(&init_mm))
> > kclist_add(&kcore_vsyscall, (void *)VSYSCALL_ADDR, PAGE_SIZE, KCORE_USER);
> > -
> > - preallocate_vmalloc_pages();
> > }
> >
> > int kernel_set_to_readonly;
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index edcb7d75542f..e60b181da3df 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1160,6 +1160,12 @@ static inline int is_vmalloc_or_module_addr(const void *x)
> > }
> > #endif
> >
> > +#ifdef CONFIG_X86
> > +void __init preallocate_vmalloc_pages(void);
> > +#else
> > +static inline void preallocate_vmalloc_pages(void) { }
> > +#endif
> > +
> > /*
> > * How many times the entire folio is mapped as a single unit (eg by a
> > * PMD or PUD entry). This is probably not what you want, except for
> > @@ -2939,9 +2945,32 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long a
> > }
> > #endif /* CONFIG_MMU */
> >
> > +static inline struct page *ptdesc_page(const struct ptdesc *pt)
> > +{
> > + return pt->pt_page;
> > +}
>
> pt_page has not been added as an element into ptdesc and hence
> the build fails upto this patch.
Which commit are you basing this on? It builds and boots fine for me on
6.18-rc5. The commits don't apply to mm-new anymore but that sounds
different...
> > +
> > +static inline struct ptdesc *page_ptdesc(const struct page *page)
> > +{
> > + memdesc_t memdesc = READ_ONCE(page->memdesc);
> > +
> > + if (memdesc_type(memdesc) != MEMDESC_TYPE_PAGE_TABLE) {
> > + printk(KERN_EMERG "memdesc %lx index %lx\n", memdesc.v, page->__folio_index);
> > + VM_BUG_ON_PAGE(1, page);
> > + return NULL;
> > + }
> > + return (void *)(memdesc.v - MEMDESC_TYPE_PAGE_TABLE);
> > +}
>
> Ditto - these elements have not been introduced.
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [RFC PATCH 5/7] x86: Call preallocate_vmalloc_pages() later
2025-11-12 18:36 ` Vishal Moola (Oracle)
@ 2025-11-12 19:31 ` Vishal Moola (Oracle)
0 siblings, 0 replies; 17+ messages in thread
From: Vishal Moola (Oracle) @ 2025-11-12 19:31 UTC (permalink / raw)
To: Anshuman Khandual; +Cc: Matthew Wilcox (Oracle), linux-mm, Johannes Weiner
On Wed, Nov 12, 2025 at 10:36:36AM -0800, Vishal Moola (Oracle) wrote:
> On Tue, Nov 11, 2025 at 02:29:44PM +0530, Anshuman Khandual wrote:
> > On 20/10/25 5:46 AM, Matthew Wilcox (Oracle) wrote:
> > > When separately allocating ptdesc from struct page, calling
> > > preallocate_vmalloc_pages() from mem_init() is too early as the slab
> > > allocator hasn't been set up yet. Move preallocate_vmalloc_pages() to
> > > vmalloc_init() which is called after the slab allocator has been set up.
> > >
> > > Honestly, this patch is a bit bobbins and I'm sure it'll be reworked
> > > before it goes upstream.
> > >
> > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > > ---
> > > arch/x86/mm/init_64.c | 4 +---
> > > include/linux/mm.h | 33 +++++++++++++++++++++++++++++++--
> > > mm/vmalloc.c | 2 ++
> > > 3 files changed, 34 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > > index 0e4270e20fad..5270fc24f6f6 100644
> > > --- a/arch/x86/mm/init_64.c
> > > +++ b/arch/x86/mm/init_64.c
> > > @@ -1322,7 +1322,7 @@ static void __init register_page_bootmem_info(void)
> > > * Only the level which needs to be synchronized between all page-tables is
> > > * allocated because the synchronization can be expensive.
> > > */
> > > -static void __init preallocate_vmalloc_pages(void)
> > > +void __init preallocate_vmalloc_pages(void)
> > > {
> > > unsigned long addr;
> > > const char *lvl;
> > > @@ -1390,8 +1390,6 @@ void __init mem_init(void)
> > > /* Register memory areas for /proc/kcore */
> > > if (get_gate_vma(&init_mm))
> > > kclist_add(&kcore_vsyscall, (void *)VSYSCALL_ADDR, PAGE_SIZE, KCORE_USER);
> > > -
> > > - preallocate_vmalloc_pages();
> > > }
> > >
> > > int kernel_set_to_readonly;
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index edcb7d75542f..e60b181da3df 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -1160,6 +1160,12 @@ static inline int is_vmalloc_or_module_addr(const void *x)
> > > }
> > > #endif
> > >
> > > +#ifdef CONFIG_X86
> > > +void __init preallocate_vmalloc_pages(void);
> > > +#else
> > > +static inline void preallocate_vmalloc_pages(void) { }
> > > +#endif
> > > +
> > > /*
> > > * How many times the entire folio is mapped as a single unit (eg by a
> > > * PMD or PUD entry). This is probably not what you want, except for
> > > @@ -2939,9 +2945,32 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long a
> > > }
> > > #endif /* CONFIG_MMU */
> > >
> > > +static inline struct page *ptdesc_page(const struct ptdesc *pt)
> > > +{
> > > + return pt->pt_page;
> > > +}
> >
> > pt_page has not been added as an element into ptdesc and hence
> > the build fails upto this patch.
>
> Which commit are you basing this on? It builds and boots fine for me on
> 6.18-rc5. The commits don't apply to mm-new anymore but that sounds
> different...
Ah nevermind, I see what your saying. This and patch 6 don't compile.
> > > +
> > > +static inline struct ptdesc *page_ptdesc(const struct page *page)
> > > +{
> > > + memdesc_t memdesc = READ_ONCE(page->memdesc);
> > > +
> > > + if (memdesc_type(memdesc) != MEMDESC_TYPE_PAGE_TABLE) {
> > > + printk(KERN_EMERG "memdesc %lx index %lx\n", memdesc.v, page->__folio_index);
> > > + VM_BUG_ON_PAGE(1, page);
> > > + return NULL;
> > > + }
> > > + return (void *)(memdesc.v - MEMDESC_TYPE_PAGE_TABLE);
> > > +}
> >
> > Ditto - these elements have not been introduced.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH 5/7] x86: Call preallocate_vmalloc_pages() later
2025-11-11 8:59 ` Anshuman Khandual
2025-11-12 18:36 ` Vishal Moola (Oracle)
@ 2025-11-13 13:53 ` Matthew Wilcox
1 sibling, 0 replies; 17+ messages in thread
From: Matthew Wilcox @ 2025-11-13 13:53 UTC (permalink / raw)
To: Anshuman Khandual; +Cc: linux-mm, Vishal Moola, Johannes Weiner, Joerg Roedel
On Tue, Nov 11, 2025 at 02:29:44PM +0530, Anshuman Khandual wrote:
> pt_page has not been added as an element into ptdesc and hence
> the build fails upto this patch.
Right, sorry, looks like the commit went astray. Those lines should
have been in a lter patch, and this patch should have been as below.
Adding Joerg to the cc to see if this invocation of
preallocate_vmalloc_pages() is early enough. Also I'd like feedback on
whether we want to expose preallocate_vmalloc_pages() like this or if we
want something more bland as a name like arch_vmalloc_init().
commit b34c228df965
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date: Sun Oct 19 18:34:36 2025 -0400
x86: Call preallocate_vmalloc_pages() later
When separately allocating ptdesc from struct page, calling
preallocate_vmalloc_pages() from mem_init() is too early as the slab
allocator hasn't been set up yet. Move preallocate_vmalloc_pages() to
vmalloc_init() which is called after the slab allocator has been set up.
Honestly, this patch is a bit bobbins and I'm sure it'll be reworked
before it goes upstream.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3d9a5e4ccaa4..0006b2880bc0 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1322,7 +1322,7 @@ static void __init register_page_bootmem_info(void)
* Only the level which needs to be synchronized between all page-tables is
* allocated because the synchronization can be expensive.
*/
-static void __init preallocate_vmalloc_pages(void)
+void __init preallocate_vmalloc_pages(void)
{
unsigned long addr;
const char *lvl;
@@ -1390,8 +1390,6 @@ void __init mem_init(void)
/* Register memory areas for /proc/kcore */
if (get_gate_vma(&init_mm))
kclist_add(&kcore_vsyscall, (void *)VSYSCALL_ADDR, PAGE_SIZE, KCORE_USER);
-
- preallocate_vmalloc_pages();
}
int kernel_set_to_readonly;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3111344b8d05..00ea79340c0c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1177,6 +1177,12 @@ static inline int is_vmalloc_or_module_addr(const void *x)
}
#endif
+#ifdef CONFIG_X86
+void __init preallocate_vmalloc_pages(void);
+#else
+static inline void preallocate_vmalloc_pages(void) { }
+#endif
+
/*
* How many times the entire folio is mapped as a single unit (eg by a
* PMD or PUD entry). This is probably not what you want, except for
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 0832f944544c..8af051136211 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -5385,6 +5385,8 @@ void __init vmalloc_init(void)
struct vm_struct *tmp;
int i;
+ preallocate_vmalloc_pages();
+
/*
* Create the cache for vmap_area objects.
*/
^ permalink raw reply [flat|nested] 17+ messages in thread
* [RFC PATCH 6/7] mm: Add alloc_pages_memdesc family of APIs
2025-10-20 0:16 [RFC PATCH 0/7] Separate ptdesc from struct page Matthew Wilcox (Oracle)
` (4 preceding siblings ...)
2025-10-20 0:16 ` [RFC PATCH 5/7] x86: Call preallocate_vmalloc_pages() later Matthew Wilcox (Oracle)
@ 2025-10-20 0:16 ` Matthew Wilcox (Oracle)
2025-10-20 0:16 ` [RFC PATCH 7/7] mm: Allocate ptdesc from slab Matthew Wilcox (Oracle)
2025-10-20 6:43 ` [syzbot ci] Re: Separate ptdesc from struct page syzbot ci
7 siblings, 0 replies; 17+ messages in thread
From: Matthew Wilcox (Oracle) @ 2025-10-20 0:16 UTC (permalink / raw)
To: linux-mm; +Cc: Matthew Wilcox (Oracle), Vishal Moola, Johannes Weiner
Convert the alloc_frozen_pages implementations into alloc_pages_memdesc
and add wrappers to keep the frozen pages users working.
This hasn't been widely tested; I bet the build bots will find something
I missed.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
include/linux/gfp.h | 13 +++++++++++++
include/linux/mm_types.h | 25 +++++++++++++++++++++++++
mm/internal.h | 13 ++++++++++---
mm/mempolicy.c | 28 ++++++++++++++++------------
mm/page_alloc.c | 12 +++++++++---
5 files changed, 73 insertions(+), 18 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0ceb4e09306c..6e13e0b829f8 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -225,6 +225,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order, int preferred_n
nodemask_t *nodemask);
#define __alloc_pages(...) alloc_hooks(__alloc_pages_noprof(__VA_ARGS__))
+struct page *__alloc_pages_memdesc_noprof(gfp_t gfp, unsigned int order,
+ memdesc_t memdesc, int preferred_nid, nodemask_t *nodemask);
+#define __alloc_pages_memdesc(...) alloc_hooks(__alloc_pages_memdesc_noprof(__VA_ARGS__))
+
struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask);
#define __folio_alloc(...) alloc_hooks(__folio_alloc_noprof(__VA_ARGS__))
@@ -315,6 +319,8 @@ static inline struct page *alloc_pages_node_noprof(int nid, gfp_t gfp_mask,
#ifdef CONFIG_NUMA
struct page *alloc_pages_noprof(gfp_t gfp, unsigned int order);
+struct page *alloc_pages_memdesc_noprof(gfp_t gfp, unsigned int order,
+ memdesc_t memdesc);
struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order);
struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
struct mempolicy *mpol, pgoff_t ilx, int nid);
@@ -325,6 +331,12 @@ static inline struct page *alloc_pages_noprof(gfp_t gfp_mask, unsigned int order
{
return alloc_pages_node_noprof(numa_node_id(), gfp_mask, order);
}
+static inline struct page *alloc_pages_memdesc_noprof(gfp_t gfp,
+ unsigned int order, memdesc_t memdesc)
+{
+ return __alloc_pages_memdesc_noprof(gfp, order, memdesc,
+ numa_node_id(), NULL);
+}
static inline struct folio *folio_alloc_noprof(gfp_t gfp, unsigned int order)
{
return __folio_alloc_node_noprof(gfp, order, numa_node_id());
@@ -339,6 +351,7 @@ static inline struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int orde
#endif
#define alloc_pages(...) alloc_hooks(alloc_pages_noprof(__VA_ARGS__))
+#define alloc_pages_memdesc(...) alloc_hooks(alloc_pages_memdesc_noprof(__VA_ARGS__))
#define folio_alloc(...) alloc_hooks(folio_alloc_noprof(__VA_ARGS__))
#define folio_alloc_mpol(...) alloc_hooks(folio_alloc_mpol_noprof(__VA_ARGS__))
#define vma_alloc_folio(...) alloc_hooks(vma_alloc_folio_noprof(__VA_ARGS__))
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 90e5790c318f..f5d9e0afe0fa 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -38,6 +38,30 @@ typedef struct {
unsigned long f;
} memdesc_flags_t;
+/**
+ * typedef memdesc_t - A typed memory descriptor.
+ *
+ * The bottom few bits of this encoded pointer determine the type
+ * of the memdesc.
+ */
+typedef struct {
+ unsigned long v;
+} memdesc_t;
+
+#define MEMDESC_TYPE_PAGE_TABLE 15
+
+static inline memdesc_t memdesc_create(void *p, unsigned long type)
+{
+ VM_BUG_ON((unsigned long)p & 15);
+ VM_BUG_ON(type > 15);
+ return (memdesc_t) { .v = type | (unsigned long)p };
+}
+
+static inline unsigned long memdesc_type(memdesc_t memdesc)
+{
+ return memdesc.v & 15;
+}
+
/*
* Each physical page in the system has a struct page associated with
* it to keep track of whatever it is we are using the page for at the
@@ -126,6 +150,7 @@ struct page {
};
struct { /* Tail pages of compound page */
unsigned long compound_head; /* Bit zero is set */
+ memdesc_t memdesc; /* All pages, not just tail */
};
struct { /* ZONE_DEVICE pages */
/*
diff --git a/mm/internal.h b/mm/internal.h
index 1561fc2ff5b8..15d64601289b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -824,15 +824,22 @@ extern bool free_pages_prepare(struct page *page, unsigned int order);
extern int user_min_free_kbytes;
-struct page *__alloc_frozen_pages_noprof(gfp_t, unsigned int order, int nid,
- nodemask_t *);
+static inline struct page *__alloc_frozen_pages_noprof(gfp_t gfp,
+ unsigned int order, int nid, nodemask_t *mask)
+{
+ return __alloc_pages_memdesc_noprof(gfp, order,
+ memdesc_create(NULL, 0), nid, mask);
+}
#define __alloc_frozen_pages(...) \
alloc_hooks(__alloc_frozen_pages_noprof(__VA_ARGS__))
void free_frozen_pages(struct page *page, unsigned int order);
void free_unref_folios(struct folio_batch *fbatch);
#ifdef CONFIG_NUMA
-struct page *alloc_frozen_pages_noprof(gfp_t, unsigned int order);
+static inline struct page *alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order)
+{
+ return alloc_pages_memdesc_noprof(gfp, order, memdesc_create(NULL, 0));
+}
#else
static inline struct page *alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order)
{
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index eb83cff7db8c..866d6609a758 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2338,7 +2338,7 @@ bool mempolicy_in_oom_domain(struct task_struct *tsk,
}
static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
- int nid, nodemask_t *nodemask)
+ memdesc_t memdesc, int nid, nodemask_t *nodemask)
{
struct page *page;
gfp_t preferred_gfp;
@@ -2351,9 +2351,11 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
*/
preferred_gfp = gfp | __GFP_NOWARN;
preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
- page = __alloc_frozen_pages_noprof(preferred_gfp, order, nid, nodemask);
+ page = __alloc_pages_memdesc_noprof(preferred_gfp, order, memdesc,
+ nid, nodemask);
if (!page)
- page = __alloc_frozen_pages_noprof(gfp, order, nid, NULL);
+ page = __alloc_pages_memdesc_noprof(gfp, order, memdesc,
+ nid, NULL);
return page;
}
@@ -2362,6 +2364,7 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
* alloc_pages_mpol - Allocate pages according to NUMA mempolicy.
* @gfp: GFP flags.
* @order: Order of the page allocation.
+ * @memdesc: Memory descriptor.
* @pol: Pointer to the NUMA mempolicy.
* @ilx: Index for interleave mempolicy (also distinguishes alloc_pages()).
* @nid: Preferred node (usually numa_node_id() but @mpol may override it).
@@ -2369,7 +2372,7 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
* Return: The page on success or NULL if allocation fails.
*/
static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
- struct mempolicy *pol, pgoff_t ilx, int nid)
+ memdesc_t memdesc, struct mempolicy *pol, pgoff_t ilx, int nid)
{
nodemask_t *nodemask;
struct page *page;
@@ -2377,7 +2380,7 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
nodemask = policy_nodemask(gfp, pol, ilx, &nid);
if (pol->mode == MPOL_PREFERRED_MANY)
- return alloc_pages_preferred_many(gfp, order, nid, nodemask);
+ return alloc_pages_preferred_many(gfp, order, memdesc, nid, nodemask);
if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
/* filter "hugepage" allocation, unless from alloc_pages() */
@@ -2399,9 +2402,9 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
* First, try to allocate THP only on local node, but
* don't reclaim unnecessarily, just compact.
*/
- page = __alloc_frozen_pages_noprof(
+ page = __alloc_pages_memdesc_noprof(
gfp | __GFP_THISNODE | __GFP_NORETRY, order,
- nid, NULL);
+ memdesc, nid, NULL);
if (page || !(gfp & __GFP_DIRECT_RECLAIM))
return page;
/*
@@ -2413,7 +2416,7 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
}
}
- page = __alloc_frozen_pages_noprof(gfp, order, nid, nodemask);
+ page = __alloc_pages_memdesc_noprof(gfp, order, memdesc, nid, nodemask);
if (unlikely(pol->mode == MPOL_INTERLEAVE ||
pol->mode == MPOL_WEIGHTED_INTERLEAVE) && page) {
@@ -2432,8 +2435,8 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
struct folio *folio_alloc_mpol_noprof(gfp_t gfp, unsigned int order,
struct mempolicy *pol, pgoff_t ilx, int nid)
{
- struct page *page = alloc_pages_mpol(gfp | __GFP_COMP, order, pol,
- ilx, nid);
+ struct page *page = alloc_pages_mpol(gfp | __GFP_COMP, order,
+ memdesc_create(NULL, 0), pol, ilx, nid);
if (!page)
return NULL;
@@ -2473,7 +2476,8 @@ struct folio *vma_alloc_folio_noprof(gfp_t gfp, int order, struct vm_area_struct
}
EXPORT_SYMBOL(vma_alloc_folio_noprof);
-struct page *alloc_frozen_pages_noprof(gfp_t gfp, unsigned order)
+struct page *alloc_pages_memdesc_noprof(gfp_t gfp, unsigned order,
+ memdesc_t memdesc)
{
struct mempolicy *pol = &default_policy;
@@ -2484,7 +2488,7 @@ struct page *alloc_frozen_pages_noprof(gfp_t gfp, unsigned order)
if (!in_interrupt() && !(gfp & __GFP_THISNODE))
pol = get_task_policy(current);
- return alloc_pages_mpol(gfp, order, pol, NO_INTERLEAVE_INDEX,
+ return alloc_pages_mpol(gfp, order, memdesc, pol, NO_INTERLEAVE_INDEX,
numa_node_id());
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 600d9e981c23..c1451ca0acc1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5144,8 +5144,8 @@ EXPORT_SYMBOL_GPL(alloc_pages_bulk_noprof);
/*
* This is the 'heart' of the zoned buddy allocator.
*/
-struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
- int preferred_nid, nodemask_t *nodemask)
+struct page *__alloc_pages_memdesc_noprof(gfp_t gfp, unsigned int order,
+ memdesc_t memdesc, int preferred_nid, nodemask_t *nodemask)
{
struct page *page;
unsigned int alloc_flags = ALLOC_WMARK_LOW;
@@ -5205,9 +5205,15 @@ struct page *__alloc_frozen_pages_noprof(gfp_t gfp, unsigned int order,
trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
kmsan_alloc_page(page, order, alloc_gfp);
+ if (page && memdesc.v) {
+ unsigned long i, max = 1UL << order;
+
+ for (i = 0; i < max; i++)
+ page->memdesc = memdesc;
+ }
return page;
}
-EXPORT_SYMBOL(__alloc_frozen_pages_noprof);
+EXPORT_SYMBOL(__alloc_pages_memdesc_noprof);
struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
int preferred_nid, nodemask_t *nodemask)
--
2.47.2
^ permalink raw reply [flat|nested] 17+ messages in thread* [RFC PATCH 7/7] mm: Allocate ptdesc from slab
2025-10-20 0:16 [RFC PATCH 0/7] Separate ptdesc from struct page Matthew Wilcox (Oracle)
` (5 preceding siblings ...)
2025-10-20 0:16 ` [RFC PATCH 6/7] mm: Add alloc_pages_memdesc family of APIs Matthew Wilcox (Oracle)
@ 2025-10-20 0:16 ` Matthew Wilcox (Oracle)
2025-10-20 6:43 ` [syzbot ci] Re: Separate ptdesc from struct page syzbot ci
7 siblings, 0 replies; 17+ messages in thread
From: Matthew Wilcox (Oracle) @ 2025-10-20 0:16 UTC (permalink / raw)
To: linux-mm; +Cc: Matthew Wilcox (Oracle), Vishal Moola, Johannes Weiner
Create a slab cache for ptdescs and point to the struct page from the
ptdesc. Remove all the padding from ptdesc that makes it line up with
struct page.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
include/linux/mm.h | 1 +
include/linux/mm_types.h | 50 ++++------------------------------------
mm/internal.h | 1 +
mm/memory.c | 35 ++++++++++++++++++++++++----
mm/mm_init.c | 1 +
5 files changed, 37 insertions(+), 51 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e60b181da3df..e8bb52061b0c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2970,6 +2970,7 @@ static inline struct ptdesc *page_ptdesc(const struct page *page)
* The high bits are used for information like zone/node/section.
*/
enum pt_flags {
+ /* Bits 0-3 used for pt_order */
PT_reserved = PG_reserved,
};
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index f5d9e0afe0fa..efdf29b8b478 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -548,38 +548,30 @@ FOLIO_MATCH(compound_head, _head_3);
/**
* struct ptdesc - Memory descriptor for page tables.
* @pt_flags: enum pt_flags plus zone/node/section.
+ * @pt_page: page allocated to store page table entries.
* @pt_rcu_head: For freeing page table pages.
* @pt_list: List of used page tables. Used for s390 gmap shadow pages
* (which are not linked into the user page tables) and x86
* pgds.
- * @_pt_pad_1: Padding that aliases with page's compound head.
* @pmd_huge_pte: Protected by ptdesc->ptl, used for THPs.
- * @__page_mapping: Aliases with page->mapping. Unused for page tables.
* @pt_index: Used for s390 gmap.
* @pt_mm: Used for x86 pgds.
* @pt_frag_refcount: For fragmented page table tracking. Powerpc only.
* @pt_share_count: Used for HugeTLB PMD page table share count.
- * @_pt_pad_2: Padding to ensure proper alignment.
* @ptl: Lock for the page table.
- * @__page_type: Same as page->page_type. Unused for page tables.
- * @__page_refcount: Same as page refcount.
- * @pt_memcg_data: Memcg data. Tracked for page tables here.
*
* This struct overlays struct page for now. Do not modify without a good
* understanding of the issues.
*/
struct ptdesc {
memdesc_flags_t pt_flags;
+ struct page *pt_page;
union {
struct rcu_head pt_rcu_head;
struct list_head pt_list;
- struct {
- unsigned long _pt_pad_1;
- pgtable_t pmd_huge_pte;
- };
+ pgtable_t pmd_huge_pte;
};
- unsigned long __page_mapping;
union {
pgoff_t pt_index;
@@ -591,47 +583,13 @@ struct ptdesc {
};
union {
- unsigned long _pt_pad_2;
#if ALLOC_SPLIT_PTLOCKS
spinlock_t *ptl;
#else
spinlock_t ptl;
#endif
};
- unsigned int __page_type;
- atomic_t __page_refcount;
-#ifdef CONFIG_MEMCG
- unsigned long pt_memcg_data;
-#endif
-};
-
-#define TABLE_MATCH(pg, pt) \
- static_assert(offsetof(struct page, pg) == offsetof(struct ptdesc, pt))
-TABLE_MATCH(flags, pt_flags);
-TABLE_MATCH(compound_head, pt_list);
-TABLE_MATCH(compound_head, _pt_pad_1);
-TABLE_MATCH(mapping, __page_mapping);
-TABLE_MATCH(__folio_index, pt_index);
-TABLE_MATCH(rcu_head, pt_rcu_head);
-TABLE_MATCH(page_type, __page_type);
-TABLE_MATCH(_refcount, __page_refcount);
-#ifdef CONFIG_MEMCG
-TABLE_MATCH(memcg_data, pt_memcg_data);
-#endif
-#undef TABLE_MATCH
-static_assert(sizeof(struct ptdesc) <= sizeof(struct page));
-
-#define ptdesc_page(pt) (_Generic((pt), \
- const struct ptdesc *: (const struct page *)(pt), \
- struct ptdesc *: (struct page *)(pt)))
-
-#define ptdesc_folio(pt) (_Generic((pt), \
- const struct ptdesc *: (const struct folio *)(pt), \
- struct ptdesc *: (struct folio *)(pt)))
-
-#define page_ptdesc(p) (_Generic((p), \
- const struct page *: (const struct ptdesc *)(p), \
- struct page *: (struct ptdesc *)(p)))
+} __aligned(16);
#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING
static inline void ptdesc_pmd_pts_init(struct ptdesc *ptdesc)
diff --git a/mm/internal.h b/mm/internal.h
index 15d64601289b..d57487ba443d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -100,6 +100,7 @@ struct pagetable_move_control {
unlikely(__ret_warn_once); \
})
+void __init ptcache_init(void);
void page_writeback_init(void);
/*
diff --git a/mm/memory.c b/mm/memory.c
index 47eb5834db23..331582bec495 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7267,10 +7267,17 @@ long copy_folio_from_user(struct folio *dst_folio,
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
+static struct kmem_cache *ptcache;
+
+void __init ptcache_init(void)
+{
+ ptcache = KMEM_CACHE(ptdesc, 0);
+}
+
/**
* pagetable_alloc - Allocate pagetables
* @gfp: GFP flags
- * @order: desired pagetable order
+ * @order: pagetable order
*
* pagetable_alloc allocates memory for page tables as well as a page table
* descriptor to describe that memory.
@@ -7279,16 +7286,34 @@ long copy_folio_from_user(struct folio *dst_folio,
*/
struct ptdesc *pagetable_alloc_noprof(gfp_t gfp, unsigned int order)
{
- struct page *page = alloc_frozen_pages_noprof(gfp | __GFP_COMP, order);
+ struct page *page;
pg_data_t *pgdat;
+ struct ptdesc *ptdesc;
+
+ BUG_ON(!ptcache);
- if (!page)
+ ptdesc = kmem_cache_alloc(ptcache, gfp);
+ if (!ptdesc)
return NULL;
+ page = alloc_pages_memdesc(gfp, order,
+ memdesc_create(ptdesc, MEMDESC_TYPE_PAGE_TABLE));
+ if (!page) {
+ kmem_cache_free(ptcache, ptdesc);
+ return NULL;
+ }
+
+ VM_BUG_ON_PAGE(memdesc_type(page->memdesc) != MEMDESC_TYPE_PAGE_TABLE, page);
pgdat = NODE_DATA(page_to_nid(page));
mod_node_page_state(pgdat, NR_PAGETABLE, 1 << order);
__SetPageTable(page);
- return page_ptdesc(page);
+ page->__folio_index = (unsigned long)ptdesc;
+
+ ptdesc->pt_flags = page->flags;
+ ptdesc->pt_flags.f |= order;
+ ptdesc->pt_page = page;
+
+ return ptdesc;
}
/**
@@ -7302,7 +7327,7 @@ void pagetable_free(struct ptdesc *pt)
{
pg_data_t *pgdat = NODE_DATA(memdesc_nid(pt->pt_flags));
struct page *page = ptdesc_page(pt);
- unsigned int order = compound_order(page);
+ unsigned int order = pt->pt_flags.f & 0xf;
mod_node_page_state(pgdat, NR_PAGETABLE, -(1L << order));
__ClearPageTable(page);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 3db2dea7db4c..dc6d2f81b692 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2712,6 +2712,7 @@ void __init mm_core_init(void)
*/
page_ext_init_flatmem_late();
kmemleak_init();
+ ptcache_init();
ptlock_cache_init();
pgtable_cache_init();
debug_objects_mem_init();
--
2.47.2
^ permalink raw reply [flat|nested] 17+ messages in thread* [syzbot ci] Re: Separate ptdesc from struct page
2025-10-20 0:16 [RFC PATCH 0/7] Separate ptdesc from struct page Matthew Wilcox (Oracle)
` (6 preceding siblings ...)
2025-10-20 0:16 ` [RFC PATCH 7/7] mm: Allocate ptdesc from slab Matthew Wilcox (Oracle)
@ 2025-10-20 6:43 ` syzbot ci
7 siblings, 0 replies; 17+ messages in thread
From: syzbot ci @ 2025-10-20 6:43 UTC (permalink / raw)
To: hannes, linux-mm, vishal.moola, willy; +Cc: syzbot, syzkaller-bugs
syzbot ci has tested the following series
[v1] Separate ptdesc from struct page
https://lore.kernel.org/all/20251020001652.2116669-1-willy@infradead.org
* [RFC PATCH 1/7] mm: Use frozen pages for page tables
* [RFC PATCH 2/7] mm: Account pagetable memory when allocated
* [RFC PATCH 3/7] mm: Mark pagetable memory when allocated
* [RFC PATCH 4/7] pgtable: Remove uses of page->lru
* [RFC PATCH 5/7] x86: Call preallocate_vmalloc_pages() later
* [RFC PATCH 6/7] mm: Add alloc_pages_memdesc family of APIs
* [RFC PATCH 7/7] mm: Allocate ptdesc from slab
and found the following issue:
kernel BUG in pte_free_now
Full report is available here:
https://ci.syzbot.org/series/034eb8df-67a6-4c07-8ba5-3ac0df045a40
***
kernel BUG in pte_free_now
tree: torvalds
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base: 3a8660878839faadb4f1a6dd72c3179c1df56787
arch: amd64
compiler: Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config: https://ci.syzbot.org/builds/2cf2c360-080f-4864-9be9-e277d29fc594/config
syz repro: https://ci.syzbot.org/findings/1f85642a-695a-4b8d-a5ee-b95a5105cc53/syz_repro
__do_softirq kernel/softirq.c:656 [inline]
invoke_softirq kernel/softirq.c:496 [inline]
__irq_exit_rcu+0xca/0x1f0 kernel/softirq.c:723
irq_exit_rcu+0x9/0x30 kernel/softirq.c:739
instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1052 [inline]
sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1052
asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:697
------------[ cut here ]------------
kernel BUG at ./include/linux/mm.h:2959!
Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 0 UID: 0 PID: 6093 Comm: syz.1.54 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:page_ptdesc include/linux/mm.h:2959 [inline]
RIP: 0010:pte_free include/asm-generic/pgalloc.h:115 [inline]
RIP: 0010:pte_free_now+0xc2/0xd0 mm/pgtable-generic.c:249
Code: 48 89 df e8 20 d1 13 00 48 8b 13 48 c7 c7 c0 89 76 8b 4c 89 f6 e8 6e 66 15 ff 4c 89 ff 48 c7 c6 00 8a 76 8b e8 1f c3 15 ff 90 <0f> 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
RSP: 0018:ffffc90000007bb0 EFLAGS: 00010246
RAX: de62a9f92d629d00 RBX: ffffea0005ca91e0 RCX: de62a9f92d629d00
RDX: 0000000000000002 RSI: ffffffff8d9d1d2d RDI: ffff8881127ad700
RBP: ffffc90000007e30 R08: ffffc90000007567 R09: 1ffff92000000eac
R10: dffffc0000000000 R11: fffff52000000ead R12: dffffc0000000000
R13: ffffffff81a82b37 R14: 0000000000000000 R15: ffffea0005ca91c0
FS: 00007f66c98976c0(0000) GS:ffff88818e70c000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f66c9896fc8 CR3: 0000000112a72000 CR4: 00000000000006f0
Call Trace:
<IRQ>
rcu_do_batch kernel/rcu/tree.c:2605 [inline]
rcu_core+0xcab/0x1770 kernel/rcu/tree.c:2861
handle_softirqs+0x286/0x870 kernel/softirq.c:622
__do_softirq kernel/softirq.c:656 [inline]
invoke_softirq kernel/softirq.c:496 [inline]
__irq_exit_rcu+0xca/0x1f0 kernel/softirq.c:723
irq_exit_rcu+0x9/0x30 kernel/softirq.c:739
instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1052 [inline]
sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1052
</IRQ>
<TASK>
asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:697
RIP: 0010:clear_page_rep+0xe/0x20 arch/x86/lib/clear_page_64.S:23
Code: cc cc cc cc cc cc cc cc cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa b9 00 02 00 00 31 c0 f3 48 ab <e9> 8d 46 04 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 0018:ffffc90003cb76d0 EFLAGS: 00000246
RAX: 0000000000000000 RBX: fffa800000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8880123a2000
RBP: fffa8000123a1000 R08: ffffea000048e847 R09: 1ffffd4000091d08
R10: dffffc0000000000 R11: fffff94000091d09 R12: 0000000000000000
R13: 1ffffffff1b404a6 R14: 1ffffffff1b404a4 R15: 0000000000000000
clear_page arch/x86/include/asm/page_64.h:54 [inline]
clear_highpage include/linux/highmem.h:240 [inline]
shmem_get_folio_gfp+0xaad/0x1660 mm/shmem.c:2583
shmem_fault+0x179/0x390 mm/shmem.c:2734
__do_fault+0x138/0x390 mm/memory.c:5280
do_read_fault mm/memory.c:5698 [inline]
do_fault mm/memory.c:5832 [inline]
do_pte_missing mm/memory.c:4361 [inline]
handle_pte_fault mm/memory.c:6177 [inline]
__handle_mm_fault+0x35e3/0x5400 mm/memory.c:6318
handle_mm_fault+0x40a/0x8e0 mm/memory.c:6487
faultin_page mm/gup.c:1126 [inline]
__get_user_pages+0x165c/0x2a00 mm/gup.c:1428
populate_vma_page_range+0x29f/0x3a0 mm/gup.c:1860
__mm_populate+0x24c/0x380 mm/gup.c:1963
mm_populate include/linux/mm.h:3451 [inline]
vm_mmap_pgoff+0x387/0x4d0 mm/util.c:586
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f66c898efc9
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f66c9897038 EFLAGS: 00000246 ORIG_RAX: 0000000000000009
RAX: ffffffffffffffda RBX: 00007f66c8be6090 RCX: 00007f66c898efc9
RDX: b635773f06ebbeef RSI: 0000000000b36000 RDI: 0000200000000000
RBP: 00007f66c8a11f91 R08: ffffffffffffffff R09: 00000000e9ba1000
R10: 0000000000008031 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f66c8be6128 R14: 00007f66c8be6090 R15: 00007ffcfaf828b8
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:page_ptdesc include/linux/mm.h:2959 [inline]
RIP: 0010:pte_free include/asm-generic/pgalloc.h:115 [inline]
RIP: 0010:pte_free_now+0xc2/0xd0 mm/pgtable-generic.c:249
Code: 48 89 df e8 20 d1 13 00 48 8b 13 48 c7 c7 c0 89 76 8b 4c 89 f6 e8 6e 66 15 ff 4c 89 ff 48 c7 c6 00 8a 76 8b e8 1f c3 15 ff 90 <0f> 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
RSP: 0018:ffffc90000007bb0 EFLAGS: 00010246
RAX: de62a9f92d629d00 RBX: ffffea0005ca91e0 RCX: de62a9f92d629d00
RDX: 0000000000000002 RSI: ffffffff8d9d1d2d RDI: ffff8881127ad700
RBP: ffffc90000007e30 R08: ffffc90000007567 R09: 1ffff92000000eac
R10: dffffc0000000000 R11: fffff52000000ead R12: dffffc0000000000
R13: ffffffff81a82b37 R14: 0000000000000000 R15: ffffea0005ca91c0
FS: 00007f66c98976c0(0000) GS:ffff88818e70c000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f66c9896fc8 CR3: 0000000112a72000 CR4: 00000000000006f0
----------------
Code disassembly (best guess):
0: cc int3
1: cc int3
2: cc int3
3: cc int3
4: cc int3
5: cc int3
6: cc int3
7: cc int3
8: cc int3
9: cc int3
a: cc int3
b: cc int3
c: 90 nop
d: 90 nop
e: 90 nop
f: 90 nop
10: 90 nop
11: 90 nop
12: 90 nop
13: 90 nop
14: 90 nop
15: 90 nop
16: 90 nop
17: 90 nop
18: 90 nop
19: 90 nop
1a: 90 nop
1b: 90 nop
1c: f3 0f 1e fa endbr64
20: b9 00 02 00 00 mov $0x200,%ecx
25: 31 c0 xor %eax,%eax
27: f3 48 ab rep stos %rax,%es:(%rdi)
* 2a: e9 8d 46 04 00 jmp 0x446bc <-- trapping instruction
2f: 90 nop
30: 90 nop
31: 90 nop
32: 90 nop
33: 90 nop
34: 90 nop
35: 90 nop
36: 90 nop
37: 90 nop
38: 90 nop
39: 90 nop
3a: 90 nop
3b: 90 nop
3c: 90 nop
3d: 90 nop
3e: 90 nop
3f: 90 nop
***
If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
Tested-by: syzbot@syzkaller.appspotmail.com
---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.
^ permalink raw reply [flat|nested] 17+ messages in thread