linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/45] hugetlb pagewalk unification
@ 2024-07-04  4:30 Oscar Salvador
  2024-07-04  4:30 ` [PATCH 01/45] arch/x86: Drop own definition of pgd,p4d_leaf Oscar Salvador
                   ` (46 more replies)
  0 siblings, 47 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

Hi all,

During Peter's talk at the LSFMM, it was agreed that one of the things
that need to be done in order to further integrate hugetlb into mm core,
is to unify generic and hugetlb pagewalkers.
I started with this one, which is unifying hugetlb into generic
pagewalk, instead of having its hugetlb_entry entries.
Which means that pmd_entry/pte_entry(for cont-pte) entries will also deal with
hugetlb vmas as well, and so will new pud_entry entries since hugetlb can be
pud mapped (devm pages as well but we seem not to care about those with
the exception of hmm code).

The outcome is this RFC.

Before you continue, let me clarify certain points:

This patchset is not yet finished, as there are things that 1) need more thought,
2) are still broken (like the hmm bits as I am clueless about that) 3)
some paths have not been tested at all.

The things I tested were:

 - memory-failure
 - smaps/numa_maps/pagemap (the latter only for pud/pmd, not
   cont-{pmd,ptes}
 - mempolicy

on arm64 (for 64KB and 32M hugetlb pages) and on x86_64 (for 2MB and 1GB
hugetlb pages).
More tests need to be conducted, and I plan to borrow a pp64le machine
to also carry out some tests there, but for now this is what my bandwith
allowed me to do.

I am well aware that there are two things that might scare people, one
being the number of patches, and the other being the amount of code
added.

For the former, I will by no means ask anyone to review 45 patches, but
since this patchset touches isolated paths (damon, mincore, hmm,
task_mmu, memory-failure, mempolicy), I will point out some people
that might be able to help me out with those different bits:

 - Miaohe for memory-failure bits
 - David for task_mmu bits
 - SeongJae Park for damon bits
 - Jerome for hmm bits
 - feel freel to join for the rest

I think that that might be a good approach, and instead of having
to review 45 patches, one has only to review at most 5 or 6.

For the latter, there is an explanation: hugetlb operates on ptes
(although it allocates puds/pmds and the operations work on that level too),
which means that now that we will handle PUD/PMD-mapped hugetlb with
{pud,pmd}_* operations, we need to introduce quite a few functions that
do not exist yet and we need from now onwards.

Although I am sending this out, this is not a "rfc ready material",
as I said there are still things that need to be improved/fixed/tested,
but I wanted to make it public nevertheless so we can gather some constructive
feedback that helps us moving in the right direction and to also widen the discussions.

So take this more of a "Hey, let me show what I am doing and call me out on
things you consider wrong".

Thanks in advance

Oscar Salvador (45):
  arch/x86: Drop own definition of pgd,p4d_leaf
  mm: Add {pmd,pud}_huge_lock helper
  mm/pagewalk: Move vma_pgtable_walk_begin and vma_pgtable_walk_end
    upfront
  mm/pagewalk: Only call pud_entry when we have a pud leaf
  mm/pagewalk: Enable walk_pmd_range to handle cont-pmds
  mm/pagewalk: Do not try to split non-thp pud or pmd leafs
  arch/s390: Enable __s390_enable_skey_pmd to handle hugetlb vmas
  fs/proc: Enable smaps_pmd_entry to handle PMD-mapped hugetlb vmas
  mm: Implement pud-version functions for swap and vm_normal_page_pud
  fs/proc: Create smaps_pud_range to handle PUD-mapped hugetlb vmas
  fs/proc: Enable smaps_pte_entry to handle cont-pte mapped hugetlb vmas
  fs/proc: Enable pagemap_pmd_range to handle hugetlb vmas
  mm: Implement pud-version uffd functions
  fs/proc: Create pagemap_pud_range to handle PUD-mapped hugetlb vmas
  fs/proc: Adjust pte_to_pagemap_entry for hugetlb vmas
  fs/proc: Enable pagemap_scan_pmd_entry to handle hugetlb vmas
  mm: Implement pud-version for pud_mkinvalid and pudp_establish
  fs/proc: Create pagemap_scan_pud_entry to handle PUD-mapped hugetlb
    vmas
  fs/proc: Enable gather_pte_stats to handle hugetlb vmas
  fs/proc: Enable gather_pte_stats to handle cont-pte mapped hugetlb
    vmas
  fs/proc: Create gather_pud_stats to handle PUD-mapped hugetlb pages
  mm/mempolicy: Enable queue_folios_pmd to handle hugetlb vmas
  mm/mempolicy: Create queue_folios_pud to handle PUD-mapped hugetlb
    vmas
  mm/memory_failure: Enable check_hwpoisoned_pmd_entry to handle hugetlb
    vmas
  mm/memory-failure: Create check_hwpoisoned_pud_entry to handle
    PUD-mapped hugetlb vmas
  mm/damon: Enable damon_young_pmd_entry to handle hugetlb vmas
  mm/damon: Create damon_young_pud_entry to handle PUD-mapped hugetlb
    vmas
  mm/damon: Enable damon_mkold_pmd_entry to handle hugetlb vmas
  mm/damon: Create damon_mkold_pud_entry to handle PUD-mapped hugetlb
    vmas
  mm,mincore: Enable mincore_pte_range to handle hugetlb vmas
  mm/mincore: Create mincore_pud_range to handle PUD-mapped hugetlb vmas
  mm/hmm: Enable hmm_vma_walk_pmd, to handle hugetlb vmas
  mm/hmm: Enable hmm_vma_walk_pud to handle PUD-mapped hugetlb vmas
  arch/powerpc: Skip hugetlb vmas in subpage_mark_vma_nohuge
  arch/s390: Skip hugetlb vmas in thp_split_mm
  fs/proc: Make clear_refs_test_walk skip hugetlb vmas
  mm/lock: Make mlock_test_walk skip hugetlb vmas
  mm/madvise: Make swapin_test_walk skip hugetlb vmas
  mm/madvise: Make madvise_cold_test_walk skip hugetlb vmas
  mm/madvise: Make madvise_free_test_walk skip hugetlb vmas
  mm/migrate_device: Make migrate_vma_test_walk skip hugetlb vmas
  mm/memcontrol: Make mem_cgroup_move_test_walk skip hugetlb vmas
  mm/memcontrol: Make mem_cgroup_count_test_walk skip hugetlb vmas
  mm/hugetlb_vmemmap: Make vmemmap_test_walk skip hugetlb vmas
  mm: Delete all hugetlb_entry entries

 arch/arm64/include/asm/pgtable.h             |  19 +
 arch/loongarch/include/asm/pgtable.h         |   8 +
 arch/mips/include/asm/pgtable.h              |   7 +
 arch/powerpc/include/asm/book3s/64/pgtable.h |   8 +-
 arch/powerpc/mm/book3s64/pgtable.c           |  15 +-
 arch/powerpc/mm/book3s64/subpage_prot.c      |   2 +
 arch/riscv/include/asm/pgtable.h             |  15 +
 arch/s390/mm/gmap.c                          |  37 +-
 arch/x86/include/asm/pgtable.h               | 199 +++++----
 fs/proc/task_mmu.c                           | 434 ++++++++++++-------
 include/asm-generic/pgtable_uffd.h           |  30 ++
 include/linux/mm.h                           |   4 +
 include/linux/mm_inline.h                    |  34 ++
 include/linux/pagewalk.h                     |  10 -
 include/linux/pgtable.h                      |  77 +++-
 include/linux/swapops.h                      |  27 ++
 mm/damon/ops-common.c                        |  21 +-
 mm/damon/vaddr.c                             | 173 ++++----
 mm/hmm.c                                     |  69 +--
 mm/hugetlb_vmemmap.c                         |  12 +
 mm/madvise.c                                 |  36 ++
 mm/memcontrol-v1.c                           |  24 +
 mm/memory-failure.c                          |  99 +++--
 mm/memory.c                                  |  51 +++
 mm/mempolicy.c                               | 121 +++---
 mm/migrate_device.c                          |  12 +
 mm/mincore.c                                 |  46 +-
 mm/mlock.c                                   |  12 +
 mm/mprotect.c                                |  10 -
 mm/pagewalk.c                                |  73 +---
 mm/pgtable-generic.c                         |  21 +
 31 files changed, 1089 insertions(+), 617 deletions(-)

-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 01/45] arch/x86: Drop own definition of pgd,p4d_leaf
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
@ 2024-07-04  4:30 ` Oscar Salvador
  2024-07-04  4:30 ` [PATCH 02/45] mm: Add {pmd,pud}_huge_lock helper Oscar Salvador
                   ` (45 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

We provide generic definitions of pXd_leaf in pgtable.h when the arch
do not define their own, where the generic pXd_leaf always return false.

Although x86 defines {pgd,p4d}_leaf, they end up being a no-op, so drop them
and make them fallback to the generic one.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 arch/x86/include/asm/pgtable.h | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 65b8e5bb902c..772f778bac06 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -252,13 +252,6 @@ static inline unsigned long pgd_pfn(pgd_t pgd)
 	return (pgd_val(pgd) & PTE_PFN_MASK) >> PAGE_SHIFT;
 }
 
-#define p4d_leaf p4d_leaf
-static inline bool p4d_leaf(p4d_t p4d)
-{
-	/* No 512 GiB pages yet */
-	return 0;
-}
-
 #define pte_page(pte)	pfn_to_page(pte_pfn(pte))
 
 #define pmd_leaf pmd_leaf
@@ -1396,9 +1389,6 @@ static inline bool pgdp_maps_userspace(void *__ptr)
 	return (((ptr & ~PAGE_MASK) / sizeof(pgd_t)) < PGD_KERNEL_START);
 }
 
-#define pgd_leaf	pgd_leaf
-static inline bool pgd_leaf(pgd_t pgd) { return false; }
-
 #ifdef CONFIG_MITIGATION_PAGE_TABLE_ISOLATION
 /*
  * All top-level MITIGATION_PAGE_TABLE_ISOLATION page tables are order-1 pages
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 02/45] mm: Add {pmd,pud}_huge_lock helper
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
  2024-07-04  4:30 ` [PATCH 01/45] arch/x86: Drop own definition of pgd,p4d_leaf Oscar Salvador
@ 2024-07-04  4:30 ` Oscar Salvador
  2024-07-04 15:02   ` Peter Xu
  2024-07-04  4:30 ` [PATCH 03/45] mm/pagewalk: Move vma_pgtable_walk_begin and vma_pgtable_walk_end upfront Oscar Salvador
                   ` (44 subsequent siblings)
  46 siblings, 1 reply; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

Deep down hugetlb and thp use the same lock for pud and pmd.
Create two helpers that can be directly used by both of them,
as they will be used in the generic pagewalkers.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 include/linux/mm_inline.h | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index f4fe593c1400..93e3eb86ef4e 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -9,6 +9,7 @@
 #include <linux/string.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/swapops.h>
+#include <linux/hugetlb.h>
 
 /**
  * folio_is_file_lru - Should the folio be on a file LRU or anon LRU?
@@ -590,4 +591,30 @@ static inline bool vma_has_recency(struct vm_area_struct *vma)
 	return true;
 }
 
+static inline spinlock_t *pmd_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
+{
+	spinlock_t *ptl;
+
+	if (pmd_leaf(*pmd)) {
+		ptl = pmd_lock(vma->vm_mm, pmd);
+		if (pmd_leaf(*pmd))
+			return ptl;
+		spin_unlock(ptl);
+	}
+	return NULL;
+}
+
+static inline spinlock_t *pud_huge_lock(pud_t *pud, struct vm_area_struct *vma)
+{
+	spinlock_t *ptl = pud_lock(vma->vm_mm, pud);
+
+	if (pud_leaf(*pud)) {
+		ptl = pud_lock(vma->vm_mm, pud);
+		if (pud_leaf(*pud))
+			return ptl;
+		spin_unlock(ptl);
+	}
+	return NULL;
+}
+
 #endif
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 03/45] mm/pagewalk: Move vma_pgtable_walk_begin and vma_pgtable_walk_end upfront
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
  2024-07-04  4:30 ` [PATCH 01/45] arch/x86: Drop own definition of pgd,p4d_leaf Oscar Salvador
  2024-07-04  4:30 ` [PATCH 02/45] mm: Add {pmd,pud}_huge_lock helper Oscar Salvador
@ 2024-07-04  4:30 ` Oscar Salvador
  2024-07-04  4:30 ` [PATCH 04/45] mm/pagewalk: Only call pud_entry when we have a pud leaf Oscar Salvador
                   ` (43 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

In order to prepare walk_pgd_range for handling hugetlb pages, move
the hugetlb vma locking into __walk_page_range.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/pagewalk.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index ae2f08ce991b..eba705def9a0 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -269,7 +269,6 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
 	const struct mm_walk_ops *ops = walk->ops;
 	int err = 0;
 
-	hugetlb_vma_lock_read(vma);
 	do {
 		next = hugetlb_entry_end(h, addr, end);
 		pte = hugetlb_walk(vma, addr & hmask, sz);
@@ -280,7 +279,6 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
 		if (err)
 			break;
 	} while (addr = next, addr != end);
-	hugetlb_vma_unlock_read(vma);
 
 	return err;
 }
@@ -339,11 +337,13 @@ static int __walk_page_range(unsigned long start, unsigned long end,
 			return err;
 	}
 
+	vma_pgtable_walk_begin(vma);
 	if (is_vm_hugetlb_page(vma)) {
 		if (ops->hugetlb_entry)
 			err = walk_hugetlb_range(start, end, walk);
 	} else
 		err = walk_pgd_range(start, end, walk);
+	vma_pgtable_walk_end(vma);
 
 	if (ops->post_vma)
 		ops->post_vma(walk);
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 04/45] mm/pagewalk: Only call pud_entry when we have a pud leaf
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (2 preceding siblings ...)
  2024-07-04  4:30 ` [PATCH 03/45] mm/pagewalk: Move vma_pgtable_walk_begin and vma_pgtable_walk_end upfront Oscar Salvador
@ 2024-07-04  4:30 ` Oscar Salvador
  2024-07-04  4:30 ` [PATCH 05/45] mm/pagewalk: Enable walk_pmd_range to handle cont-pmds Oscar Salvador
                   ` (42 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

Check first whether the pud is leaf one before trying to call pud_entry.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/pagewalk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index eba705def9a0..d93e77411482 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -155,7 +155,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 
 		walk->action = ACTION_SUBTREE;
 
-		if (ops->pud_entry)
+		if (ops->pud_entry && pud_leaf(*pud))
 			err = ops->pud_entry(pud, addr, next, walk);
 		if (err)
 			break;
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 05/45] mm/pagewalk: Enable walk_pmd_range to handle cont-pmds
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (3 preceding siblings ...)
  2024-07-04  4:30 ` [PATCH 04/45] mm/pagewalk: Only call pud_entry when we have a pud leaf Oscar Salvador
@ 2024-07-04  4:30 ` Oscar Salvador
  2024-07-04 15:41   ` David Hildenbrand
  2024-07-05 16:56   ` kernel test robot
  2024-07-04  4:30 ` [PATCH 06/45] mm/pagewalk: Do not try to split non-thp pud or pmd leafs Oscar Salvador
                   ` (41 subsequent siblings)
  46 siblings, 2 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

HugeTLB pages can be cont-pmd mapped, so teach walk_pmd_range to
handle those.
This will save us some cycles as we do it in one-shot instead of
calling in multiple times.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 include/linux/pgtable.h | 12 ++++++++++++
 mm/pagewalk.c           | 12 +++++++++---
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 2a6a3cccfc36..3a7b8751747e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1914,6 +1914,18 @@ typedef unsigned int pgtbl_mod_mask;
 #define __pte_leaf_size(x,y) pte_leaf_size(y)
 #endif
 
+#ifndef pmd_cont
+#define pmd_cont(x) false
+#endif
+
+#ifndef CONT_PMD_SIZE
+#define CONT_PMD_SIZE 0
+#endif
+
+#ifndef CONT_PMDS
+#define CONT_PMDS 0
+#endif
+
 /*
  * We always define pmd_pfn for all archs as it's used in lots of generic
  * code.  Now it happens too for pud_pfn (and can happen for larger
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index d93e77411482..a9c36f9e9820 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -81,11 +81,18 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 	const struct mm_walk_ops *ops = walk->ops;
 	int err = 0;
 	int depth = real_depth(3);
+	int cont_pmds;
 
 	pmd = pmd_offset(pud, addr);
 	do {
 again:
-		next = pmd_addr_end(addr, end);
+		if (pmd_cont(*pmd)) {
+			cont_pmds = CONT_PMDS;
+			next = pmd_cont_addr_end(addr, end);
+		} else {
+			cont_pmds = 1;
+			next = pmd_addr_end(addr, end);
+		}
 		if (pmd_none(*pmd)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -126,8 +133,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 
 		if (walk->action == ACTION_AGAIN)
 			goto again;
-
-	} while (pmd++, addr = next, addr != end);
+	} while (pmd += cont_pmds, addr = next, addr != end);
 
 	return err;
 }
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 06/45] mm/pagewalk: Do not try to split non-thp pud or pmd leafs
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (4 preceding siblings ...)
  2024-07-04  4:30 ` [PATCH 05/45] mm/pagewalk: Enable walk_pmd_range to handle cont-pmds Oscar Salvador
@ 2024-07-04  4:30 ` Oscar Salvador
  2024-07-04  4:30 ` [PATCH 07/45] arch/s390: Enable __s390_enable_skey_pmd to handle hugetlb vmas Oscar Salvador
                   ` (40 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

Hugetlb pages will be handled in the generic path, so do not try to split
pud or pmd leafs if they are not thp.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/pagewalk.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index a9c36f9e9820..78d45f1450aa 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -119,7 +119,8 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 		 * Check this here so we only break down trans_huge
 		 * pages when we _need_ to
 		 */
-		if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) ||
+		if (((!walk->vma || is_vm_hugetlb_page(walk->vma)) &&
+		    (pmd_leaf(*pmd) || !pmd_present(*pmd))) ||
 		    walk->action == ACTION_CONTINUE ||
 		    !(ops->pte_entry))
 			continue;
@@ -169,7 +170,8 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 		if (walk->action == ACTION_AGAIN)
 			goto again;
 
-		if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) ||
+		if (((!walk->vma || is_vm_hugetlb_page(walk->vma)) &&
+		    (pud_leaf(*pud) || !pud_present(*pud))) ||
 		    walk->action == ACTION_CONTINUE ||
 		    !(ops->pmd_entry || ops->pte_entry))
 			continue;
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 07/45] arch/s390: Enable __s390_enable_skey_pmd to handle hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (5 preceding siblings ...)
  2024-07-04  4:30 ` [PATCH 06/45] mm/pagewalk: Do not try to split non-thp pud or pmd leafs Oscar Salvador
@ 2024-07-04  4:30 ` Oscar Salvador
  2024-07-04  4:30 ` [PATCH 08/45] fs/proc: Enable smaps_pmd_entry to handle PMD-mapped " Oscar Salvador
                   ` (39 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

__s390_enable_skey_pmd does nothing for THP, but will handle pmd-leafs
hugetlb vmas from now onwards, so teach it how to handle those.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 arch/s390/mm/gmap.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 474a25ca5c48..e1d098dc7f07 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2723,6 +2723,20 @@ static int __s390_enable_skey_pte(pte_t *pte, unsigned long addr,
 static int __s390_enable_skey_pmd(pmd_t *pmd, unsigned long addr,
 				  unsigned long next, struct mm_walk *walk)
 {
+	if (pmd_leaf(*pmd) && is_vm_hugetlb_page(vma))
+		unsigned long start, end;
+		struct page *page = pmd_page(*pmd);
+
+		if (pmd_val(*pmd) & _SEGMENT_ENTRY_INVALID ||
+		    !(pmd_val(*pmd) & _SEGMENT_ENTRY_WRITE))
+			return 0;
+
+		start = pmd_val(*pmd) & HPAGE_MASK;
+		end = start + HPAGE_SIZE;
+		__storage_key_init_range(start, end);
+		set_bit(PG_arch_1, &page->flags);
+	}
+
 	cond_resched();
 	return 0;
 }
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 08/45] fs/proc: Enable smaps_pmd_entry to handle PMD-mapped hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (6 preceding siblings ...)
  2024-07-04  4:30 ` [PATCH 07/45] arch/s390: Enable __s390_enable_skey_pmd to handle hugetlb vmas Oscar Salvador
@ 2024-07-04  4:30 ` Oscar Salvador
  2024-07-04  4:30 ` [PATCH 09/45] mm: Implement pud-version functions for swap and vm_normal_page_pud Oscar Salvador
                   ` (38 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

PMD-mapped hugetlb vmas will also reach smaps_pmd_entry.
Add the required code so it knows how to handle those there.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 fs/proc/task_mmu.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 775a2e8d600c..78d84d7e2353 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -804,6 +804,18 @@ static void smaps_pte_hole_lookup(unsigned long addr, struct mm_walk *walk)
 #endif
 }
 
+#ifdef CONFIG_HUGETLB_PAGE
+static void mss_hugetlb_update(struct mem_size_stats *mss, struct folio *folio,
+			       struct vm_area_struct *vma, pte_t *pte)
+{
+	if (folio_likely_mapped_shared(folio) ||
+	    hugetlb_pmd_shared(pte))
+		mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
+	else
+		mss->private_hugetlb += huge_page_size(hstate_vma(vma));
+}
+#endif
+
 static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 		struct mm_walk *walk)
 {
@@ -851,12 +863,13 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 	smaps_account(mss, page, false, young, dirty, locked, present);
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
 static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
 		struct mm_walk *walk)
 {
 	struct mem_size_stats *mss = walk->private;
 	struct vm_area_struct *vma = walk->vma;
+	bool hugetlb = is_vm_hugetlb_page(vma);
 	bool locked = !!(vma->vm_flags & VM_LOCKED);
 	struct page *page = NULL;
 	bool present = false;
@@ -865,7 +878,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
 	if (pmd_present(*pmd)) {
 		page = vm_normal_page_pmd(vma, addr, *pmd);
 		present = true;
-	} else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) {
+	} else if (is_swap_pmd(*pmd)) {
 		swp_entry_t entry = pmd_to_swp_entry(*pmd);
 
 		if (is_pfn_swap_entry(entry))
@@ -874,6 +887,12 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
 	if (IS_ERR_OR_NULL(page))
 		return;
 	folio = page_folio(page);
+
+	if (hugetlb) {
+		mss_hugetlb_update(mss, folio, vma, (pte_t *)pmd);
+		return;
+	}
+
 	if (folio_test_anon(folio))
 		mss->anonymous_thp += HPAGE_PMD_SIZE;
 	else if (folio_test_swapbacked(folio))
@@ -900,7 +919,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	ptl = pmd_trans_huge_lock(pmd, vma);
+	ptl = pmd_huge_lock(pmd, vma);
 	if (ptl) {
 		smaps_pmd_entry(pmd, addr, walk);
 		spin_unlock(ptl);
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 09/45] mm: Implement pud-version functions for swap and vm_normal_page_pud
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (7 preceding siblings ...)
  2024-07-04  4:30 ` [PATCH 08/45] fs/proc: Enable smaps_pmd_entry to handle PMD-mapped " Oscar Salvador
@ 2024-07-04  4:30 ` Oscar Salvador
  2024-07-04  4:30 ` [PATCH 10/45] fs/proc: Create smaps_pud_range to handle PUD-mapped hugetlb vmas Oscar Salvador
                   ` (37 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

HugeTLB pages will be handled on pud level as well, so we need to
implement pud-version of vm_normal_page_pud and swp-pud functions.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h |  1 +
 include/linux/mm.h                           |  4 ++
 include/linux/pgtable.h                      |  6 +++
 include/linux/swapops.h                      | 15 ++++++
 mm/memory.c                                  | 51 ++++++++++++++++++++
 5 files changed, 77 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 519b1743a0f4..fa4bb8d6356f 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -687,6 +687,7 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 #define __pte_to_swp_entry(pte)	((swp_entry_t) { pte_val((pte)) & ~_PAGE_PTE })
 #define __swp_entry_to_pte(x)	__pte((x).val | _PAGE_PTE)
 #define __pmd_to_swp_entry(pmd)	(__pte_to_swp_entry(pmd_pte(pmd)))
+#define __pud_to_swp_entry(pud)	(__pte_to_swp_entry(pud_pte(pud)))
 #define __swp_entry_to_pmd(x)	(pte_pmd(__swp_entry_to_pte(x)))
 
 #ifdef CONFIG_MEM_SOFT_DIRTY
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5f1075d19600..baade06b159b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2371,6 +2371,10 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
 				  unsigned long addr, pmd_t pmd);
 struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 				pmd_t pmd);
+struct folio *vm_normal_folio_pud(struct vm_area_struct *vma,
+				  unsigned long addr, pud_t pud);
+struct page *vm_normal_page_pud(struct vm_area_struct *vma, unsigned long addr,
+				pud_t pud);
 
 void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
 		  unsigned long size);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 3a7b8751747e..a9edeb86b7fe 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1990,4 +1990,10 @@ pgprot_t vm_get_page_prot(unsigned long vm_flags)			\
 }									\
 EXPORT_SYMBOL(vm_get_page_prot);
 
+#ifdef CONFIG_HUGETLB_PAGE
+#ifndef __pud_to_swp_entry
+#define __pud_to_swp_entry(pud) ((swp_entry_t) { pud_val(pud) })
+#endif
+#endif
+
 #endif /* _LINUX_PGTABLE_H */
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index cb468e418ea1..182957f0d013 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -126,6 +126,21 @@ static inline int is_swap_pte(pte_t pte)
 	return !pte_none(pte) && !pte_present(pte);
 }
 
+#ifdef CONFIG_HUGETLB_PAGE
+static inline int is_swap_pud(pud_t pud)
+{
+	return !pud_none(pud) && !pud_present(pud);
+}
+
+static inline swp_entry_t pud_to_swp_entry(pud_t pud)
+{
+	swp_entry_t arch_entry;
+
+	arch_entry = __pud_to_swp_entry(pud);
+	return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry));
+}
+#endif
+
 /*
  * Convert the arch-dependent pte representation of a swp_entry_t into an
  * arch-independent swp_entry_t.
diff --git a/mm/memory.c b/mm/memory.c
index 0a769f34bbb2..90c5dfac35c6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -718,6 +718,57 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
 }
 #endif
 
+#ifdef CONFIG_HUGETLB_PAGE
+struct page *vm_normal_page_pud(struct vm_area_struct *vma, unsigned long addr,
+				pud_t pud)
+{
+	unsigned long pfn = pud_pfn(pud);
+
+	/*
+	 * There is no pmd_special() but there may be special pmds, e.g.
+	 * in a direct-access (dax) mapping, so let's just replicate the
+	 * !CONFIG_ARCH_HAS_PTE_SPECIAL case from vm_normal_page() here.
+	 */
+	if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
+		if (vma->vm_flags & VM_MIXEDMAP) {
+			if (!pfn_valid(pfn))
+				return NULL;
+			goto out;
+		} else {
+			unsigned long off;
+
+			off = (addr - vma->vm_start) >> PAGE_SHIFT;
+			if (pfn == vma->vm_pgoff + off)
+				return NULL;
+			if (!is_cow_mapping(vma->vm_flags))
+				return NULL;
+		}
+	}
+
+	if (pud_devmap(pud))
+		return NULL;
+	if (unlikely(pfn > highest_memmap_pfn))
+		return NULL;
+
+	/*
+	 * NOTE! We still have PageReserved() pages in the page tables.
+	 * eg. VDSO mappings can cause them to exist.
+	 */
+out:
+	return pfn_to_page(pfn);
+}
+
+struct folio *vm_normal_folio_pud(struct vm_area_struct *vma,
+				  unsigned long addr, pud_t pud)
+{
+	struct page *page = vm_normal_page_pud(vma, addr, pud);
+
+	if (page)
+		return page_folio(page);
+	return NULL;
+}
+#endif
+
 static void restore_exclusive_pte(struct vm_area_struct *vma,
 				  struct page *page, unsigned long address,
 				  pte_t *ptep)
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 10/45] fs/proc: Create smaps_pud_range to handle PUD-mapped hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (8 preceding siblings ...)
  2024-07-04  4:30 ` [PATCH 09/45] mm: Implement pud-version functions for swap and vm_normal_page_pud Oscar Salvador
@ 2024-07-04  4:30 ` Oscar Salvador
  2024-07-04  4:30 ` [PATCH 11/45] fs/proc: Enable smaps_pte_entry to handle cont-pte mapped " Oscar Salvador
                   ` (36 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

Normal THP cannot be PUD-mapped (besides devmap), but hugetlb can, so create
smaps_pud_range in order to handle PUD-mapped hugetlb vmas.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 fs/proc/task_mmu.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 78d84d7e2353..3f3460ff03b0 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -912,6 +912,40 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
 }
 #endif
 
+#ifdef CONFIG_HUGETLB_PAGE
+static int smaps_pud_range(pud_t *pudp, unsigned long addr, unsigned long end,
+			   struct mm_walk *walk)
+{
+	spinlock_t *ptl;
+	struct folio *folio;
+	struct vm_area_struct *vma = walk->vma;
+	struct mem_size_stats *mss = walk->private;
+
+	ptl = pud_huge_lock(pudp, vma);
+	if (!ptl)
+		return 0;
+
+	if (pud_present(*pudp)) {
+		folio = vm_normal_folio_pud(vma, addr, *pudp);
+	} else if (is_swap_pud(*pudp)) {
+		/* PUD-hugetlbs can have swap entries */
+		swp_entry_t swpent = pud_to_swp_entry(*pudp);
+
+		if (is_pfn_swap_entry(swpent))
+			folio = pfn_swap_entry_folio(swpent);
+	}
+
+	if (folio)
+		/* Only for now hugetlb pages can end up here */
+		mss_hugetlb_update(mss, folio, vma, (pte_t *)pudp);
+
+	spin_unlock(ptl);
+	return 0;
+}
+#else
+#define smaps_pud_range NULL
+#endif
+
 static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 			   struct mm_walk *walk)
 {
@@ -1061,12 +1095,14 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
 #endif /* HUGETLB_PAGE */
 
 static const struct mm_walk_ops smaps_walk_ops = {
+	.pud_entry              = smaps_pud_range,
 	.pmd_entry		= smaps_pte_range,
 	.hugetlb_entry		= smaps_hugetlb_range,
 	.walk_lock		= PGWALK_RDLOCK,
 };
 
 static const struct mm_walk_ops smaps_shmem_walk_ops = {
+	.pud_entry              = smaps_pud_range,
 	.pmd_entry		= smaps_pte_range,
 	.hugetlb_entry		= smaps_hugetlb_range,
 	.pte_hole		= smaps_pte_hole,
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 11/45] fs/proc: Enable smaps_pte_entry to handle cont-pte mapped hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (9 preceding siblings ...)
  2024-07-04  4:30 ` [PATCH 10/45] fs/proc: Create smaps_pud_range to handle PUD-mapped hugetlb vmas Oscar Salvador
@ 2024-07-04  4:30 ` Oscar Salvador
  2024-07-04 10:30   ` David Hildenbrand
  2024-07-04  4:30 ` [PATCH 12/45] fs/proc: Enable pagemap_pmd_range to handle " Oscar Salvador
                   ` (35 subsequent siblings)
  46 siblings, 1 reply; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

HugeTLB pages can be cont-pte mapped, so teach smaps_pte_entry to handle
them.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 fs/proc/task_mmu.c      | 19 +++++++++++++------
 include/linux/pgtable.h | 12 ++++++++++++
 2 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3f3460ff03b0..4d94b6ce58dd 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -825,6 +825,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 	struct page *page = NULL;
 	bool present = false, young = false, dirty = false;
 	pte_t ptent = ptep_get(pte);
+	unsigned long size = pte_cont(ptent) ? PAGE_SIZE * CONT_PTES : PAGE_SIZE;
 
 	if (pte_present(ptent)) {
 		page = vm_normal_page(vma, addr, ptent);
@@ -834,18 +835,18 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 	} else if (is_swap_pte(ptent)) {
 		swp_entry_t swpent = pte_to_swp_entry(ptent);
 
-		if (!non_swap_entry(swpent)) {
+		if (!is_vm_hugetlb_page(vma) && !non_swap_entry(swpent)) {
 			int mapcount;
 
-			mss->swap += PAGE_SIZE;
+			mss->swap += size;
 			mapcount = swp_swapcount(swpent);
 			if (mapcount >= 2) {
-				u64 pss_delta = (u64)PAGE_SIZE << PSS_SHIFT;
+				u64 pss_delta = (u64)size << PSS_SHIFT;
 
 				do_div(pss_delta, mapcount);
 				mss->swap_pss += pss_delta;
 			} else {
-				mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT;
+				mss->swap_pss += (u64)size << PSS_SHIFT;
 			}
 		} else if (is_pfn_swap_entry(swpent)) {
 			if (is_device_private_entry(swpent))
@@ -860,7 +861,10 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 	if (!page)
 		return;
 
-	smaps_account(mss, page, false, young, dirty, locked, present);
+	if (is_vm_hugetlb_page(vma))
+		mss_hugetlb_update(mss, page_folio(page), vma, pte);
+	else
+		smaps_account(mss, page, false, young, dirty, locked, present);
 }
 
 #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
@@ -952,6 +956,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	struct vm_area_struct *vma = walk->vma;
 	pte_t *pte;
 	spinlock_t *ptl;
+	unsigned long size, cont_ptes;
 
 	ptl = pmd_huge_lock(pmd, vma);
 	if (ptl) {
@@ -965,7 +970,9 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		walk->action = ACTION_AGAIN;
 		return 0;
 	}
-	for (; addr != end; pte++, addr += PAGE_SIZE)
+	size = pte_cont(ptep_get(pte)) ? PAGE_SIZE * CONT_PTES : PAGE_SIZE;
+	cont_ptes = pte_cont(ptep_get(pte)) ? CONT_PTES : 1;
+	for (; addr != end; pte += cont_ptes, addr += size)
 		smaps_pte_entry(pte, addr, walk);
 	pte_unmap_unlock(pte - 1, ptl);
 out:
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index a9edeb86b7fe..991137dab87e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1926,6 +1926,18 @@ typedef unsigned int pgtbl_mod_mask;
 #define CONT_PMDS 0
 #endif
 
+#ifndef pte_cont
+#define pte_cont(x) false
+#endif
+
+#ifndef CONT_PTE_SIZE
+#define CONT_PTE_SIZE 0
+#endif
+
+#ifndef CONT_PTES
+#define CONT_PTES 0
+#endif
+
 /*
  * We always define pmd_pfn for all archs as it's used in lots of generic
  * code.  Now it happens too for pud_pfn (and can happen for larger
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 12/45] fs/proc: Enable pagemap_pmd_range to handle hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (10 preceding siblings ...)
  2024-07-04  4:30 ` [PATCH 11/45] fs/proc: Enable smaps_pte_entry to handle cont-pte mapped " Oscar Salvador
@ 2024-07-04  4:30 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 13/45] mm: Implement pud-version uffd functions Oscar Salvador
                   ` (34 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

PMD-mapped hugetlb vmas will also reach pagemap_pmd_range.
Add the required code so it knows how to handle those there.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 fs/proc/task_mmu.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 4d94b6ce58dd..ec429d82b921 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1824,9 +1824,9 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 	spinlock_t *ptl;
 	pte_t *pte, *orig_pte;
 	int err = 0;
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
 
-	ptl = pmd_trans_huge_lock(pmdp, vma);
+	ptl = pmd_huge_lock(pmdp, vma);
 	if (ptl) {
 		unsigned int idx = (addr & ~PMD_MASK) >> PAGE_SHIFT;
 		u64 flags = 0, frame = 0;
@@ -1848,7 +1848,6 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 			if (pm->show_pfn)
 				frame = pmd_pfn(pmd) + idx;
 		}
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
 		else if (is_swap_pmd(pmd)) {
 			swp_entry_t entry = pmd_to_swp_entry(pmd);
 			unsigned long offset;
@@ -1861,7 +1860,8 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 				frame = swp_type(entry) |
 					(offset << MAX_SWAPFILES_SHIFT);
 			}
-			flags |= PM_SWAP;
+			if (!is_vm_hugetlb_page(vma))
+				flags |= PM_SWAP;
 			if (pmd_swp_soft_dirty(pmd))
 				flags |= PM_SOFT_DIRTY;
 			if (pmd_swp_uffd_wp(pmd))
@@ -1869,7 +1869,6 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 			VM_BUG_ON(!is_pmd_migration_entry(pmd));
 			page = pfn_swap_entry_to_page(entry);
 		}
-#endif
 
 		if (page) {
 			folio = page_folio(page);
@@ -1899,7 +1898,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 		spin_unlock(ptl);
 		return err;
 	}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
 
 	/*
 	 * We can assume that @vma always points to a valid one and @end never
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 13/45] mm: Implement pud-version uffd functions
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (11 preceding siblings ...)
  2024-07-04  4:30 ` [PATCH 12/45] fs/proc: Enable pagemap_pmd_range to handle " Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-05 15:48   ` kernel test robot
  2024-07-05 15:48   ` kernel test robot
  2024-07-04  4:31 ` [PATCH 14/45] fs/proc: Create pagemap_pud_range to handle PUD-mapped hugetlb vmas Oscar Salvador
                   ` (33 subsequent siblings)
  46 siblings, 2 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

HugeTLB pages will be handled on pud level as well, so we need to
implement pud-versions of uffd functions in order to properly handle
it.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 arch/arm64/include/asm/pgtable.h   |   7 ++
 arch/x86/include/asm/pgtable.h     | 158 +++++++++++++++++------------
 include/asm-generic/pgtable_uffd.h |  30 ++++++
 include/linux/pgtable.h            |   5 +
 4 files changed, 136 insertions(+), 64 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index f8efbc128446..936ed3a915a3 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -556,6 +556,9 @@ static inline int pmd_trans_huge(pmd_t pmd)
 #define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
 #define pmd_mkinvalid(pmd)	pte_pmd(pte_mkinvalid(pmd_pte(pmd)))
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+#define pud_uffd_wp(pud)	pte_uffd_wp(pud_pte(pud))
+#define pud_mkuffd_wp(pud)	pte_pud(pte_mkuffd_wp(pud_pte(pud)))
+#define pud_clear_uffd_wp(pud)	pte_pud(pte_clear_uffd_wp(pud_pte(pud)))
 #define pmd_uffd_wp(pmd)	pte_uffd_wp(pmd_pte(pmd))
 #define pmd_mkuffd_wp(pmd)	pte_pmd(pte_mkuffd_wp(pmd_pte(pmd)))
 #define pmd_clear_uffd_wp(pmd)	pte_pmd(pte_clear_uffd_wp(pmd_pte(pmd)))
@@ -563,6 +566,10 @@ static inline int pmd_trans_huge(pmd_t pmd)
 #define pmd_swp_mkuffd_wp(pmd)	pte_pmd(pte_swp_mkuffd_wp(pmd_pte(pmd)))
 #define pmd_swp_clear_uffd_wp(pmd) \
 				pte_pmd(pte_swp_clear_uffd_wp(pmd_pte(pmd)))
+#define pud_swp_uffd_wp(pud)	pte_swp_uffd_wp(pud_pte(pud))
+#define pud_swp_mkuffd_wp(pud)	pte_pud(pte_swp_mkuffd_wp(pud_pte(pud)))
+#define pud_swp_clear_uffd_wp(pud) \
+				pte_pud(pte_swp_clear_uffd_wp(pud_pte(pud)))
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
 
 #define pmd_write(pmd)		pte_write(pmd_pte(pmd))
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 772f778bac06..640edc31962f 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -512,70 +512,6 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd)
 	return pmd_mksaveddirty(pmd);
 }
 
-#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
-static inline int pmd_uffd_wp(pmd_t pmd)
-{
-	return pmd_flags(pmd) & _PAGE_UFFD_WP;
-}
-
-static inline pmd_t pmd_mkuffd_wp(pmd_t pmd)
-{
-	return pmd_wrprotect(pmd_set_flags(pmd, _PAGE_UFFD_WP));
-}
-
-static inline pmd_t pmd_clear_uffd_wp(pmd_t pmd)
-{
-	return pmd_clear_flags(pmd, _PAGE_UFFD_WP);
-}
-#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
-
-static inline pmd_t pmd_mkold(pmd_t pmd)
-{
-	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
-}
-
-static inline pmd_t pmd_mkclean(pmd_t pmd)
-{
-	return pmd_clear_flags(pmd, _PAGE_DIRTY_BITS);
-}
-
-static inline pmd_t pmd_mkdirty(pmd_t pmd)
-{
-	pmd = pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
-
-	return pmd_mksaveddirty(pmd);
-}
-
-static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
-{
-	pmd = pmd_clear_flags(pmd, _PAGE_RW);
-
-	return pmd_set_flags(pmd, _PAGE_DIRTY);
-}
-
-static inline pmd_t pmd_mkdevmap(pmd_t pmd)
-{
-	return pmd_set_flags(pmd, _PAGE_DEVMAP);
-}
-
-static inline pmd_t pmd_mkhuge(pmd_t pmd)
-{
-	return pmd_set_flags(pmd, _PAGE_PSE);
-}
-
-static inline pmd_t pmd_mkyoung(pmd_t pmd)
-{
-	return pmd_set_flags(pmd, _PAGE_ACCESSED);
-}
-
-static inline pmd_t pmd_mkwrite_novma(pmd_t pmd)
-{
-	return pmd_set_flags(pmd, _PAGE_RW);
-}
-
-pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
-#define pmd_mkwrite pmd_mkwrite
-
 static inline pud_t pud_set_flags(pud_t pud, pudval_t set)
 {
 	pudval_t v = native_pud_val(pud);
@@ -659,6 +595,85 @@ static inline pud_t pud_mkwrite(pud_t pud)
 	return pud_clear_saveddirty(pud);
 }
 
+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+static inline int pud_uffd_wp(pud_t pud)
+{
+	return pud_flags(pud) & _PAGE_UFFD_WP;
+}
+
+static inline pud_t pud_mkuffd_wp(pud_t pud)
+{
+	return pud_wrprotect(pud_set_flags(pud, _PAGE_UFFD_WP));
+}
+
+static inline pud_t pud_clear_uffd_wp(pud_t pud)
+{
+	return pud_clear_flags(pud, _PAGE_UFFD_WP);
+}
+
+static inline int pmd_uffd_wp(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_UFFD_WP;
+}
+
+static inline pmd_t pmd_mkuffd_wp(pmd_t pmd)
+{
+	return pmd_wrprotect(pmd_set_flags(pmd, _PAGE_UFFD_WP));
+}
+
+static inline pmd_t pmd_clear_uffd_wp(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_UFFD_WP);
+}
+#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
+
+static inline pmd_t pmd_mkold(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_mkclean(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_DIRTY_BITS);
+}
+
+static inline pmd_t pmd_mkdirty(pmd_t pmd)
+{
+	pmd = pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+
+	return pmd_mksaveddirty(pmd);
+}
+
+static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
+{
+	pmd = pmd_clear_flags(pmd, _PAGE_RW);
+
+	return pmd_set_flags(pmd, _PAGE_DIRTY);
+}
+
+static inline pmd_t pmd_mkdevmap(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_DEVMAP);
+}
+
+static inline pmd_t pmd_mkhuge(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_PSE);
+}
+
+static inline pmd_t pmd_mkyoung(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_mkwrite_novma(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_RW);
+}
+
+pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
+#define pmd_mkwrite pmd_mkwrite
+
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 static inline int pte_soft_dirty(pte_t pte)
 {
@@ -1574,6 +1589,21 @@ static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
 {
 	return pmd_clear_flags(pmd, _PAGE_SWP_UFFD_WP);
 }
+
+static inline pud_t pud_swp_mkuffd_wp(pud_t pud)
+{
+	return pud_set_flags(pud, _PAGE_SWP_UFFD_WP);
+}
+
+static inline int pud_swp_uffd_wp(pud_t pud)
+{
+	return pud_flags(pud) & _PAGE_SWP_UFFD_WP;
+}
+
+static inline pud_t pud_swp_clear_uffd_wp(pud_t pud)
+{
+	return pud_clear_flags(pud, _PAGE_SWP_UFFD_WP);
+}
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
 
 static inline u16 pte_flags_pkey(unsigned long pte_flags)
diff --git a/include/asm-generic/pgtable_uffd.h b/include/asm-generic/pgtable_uffd.h
index 828966d4c281..453118eb2161 100644
--- a/include/asm-generic/pgtable_uffd.h
+++ b/include/asm-generic/pgtable_uffd.h
@@ -12,6 +12,11 @@ static __always_inline int pmd_uffd_wp(pmd_t pmd)
 	return 0;
 }
 
+static __always_inline int pud_uffd_wp(pud_t pud)
+{
+	return 0;
+}
+
 static __always_inline pte_t pte_mkuffd_wp(pte_t pte)
 {
 	return pte;
@@ -22,6 +27,11 @@ static __always_inline pmd_t pmd_mkuffd_wp(pmd_t pmd)
 	return pmd;
 }
 
+static __always_inline pud_t pud_mkuffd_wp(pud_t pud)
+{
+	return pmd;
+}
+
 static __always_inline pte_t pte_clear_uffd_wp(pte_t pte)
 {
 	return pte;
@@ -32,6 +42,11 @@ static __always_inline pmd_t pmd_clear_uffd_wp(pmd_t pmd)
 	return pmd;
 }
 
+static __always_inline pud_t pud_clear_uffd_wp(pud_t pud)
+{
+	return pmd;
+}
+
 static __always_inline pte_t pte_swp_mkuffd_wp(pte_t pte)
 {
 	return pte;
@@ -61,6 +76,21 @@ static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
 {
 	return pmd;
 }
+
+static inline pud_t pud_swp_mkuffd_wp(pud_t pud)
+{
+	return pud;
+}
+
+static inline int pud_swp_uffd_wp(pud_t pud)
+{
+	return 0;
+}
+
+static inline pud_t pud_swp_clear_uffd_wp(pud_t pud)
+{
+	return pud;
+}
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
 
 #endif /* _ASM_GENERIC_PGTABLE_UFFD_H */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 991137dab87e..a2e2ebb93f21 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1433,6 +1433,11 @@ static inline int pmd_soft_dirty(pmd_t pmd)
 	return 0;
 }
 
+static inline int pud_soft_dirty(pud_t pud)
+{
+	return 0;
+}
+
 static inline pte_t pte_mksoft_dirty(pte_t pte)
 {
 	return pte;
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 14/45] fs/proc: Create pagemap_pud_range to handle PUD-mapped hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (12 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 13/45] mm: Implement pud-version uffd functions Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 15/45] fs/proc: Adjust pte_to_pagemap_entry for " Oscar Salvador
                   ` (32 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

Normal THP cannot be PUD-mapped (besides devmap) but hugetlb can,
so create pagemap_pud_range in order to handle PUD-mapped hugetlb vmas

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 fs/proc/task_mmu.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ec429d82b921..5965a074467e 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1924,6 +1924,65 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 	return err;
 }
 
+#ifdef CONFIG_HUGETLB_PAGE
+static int pagemap_pud_range(pud_t *pudp, unsigned long addr, unsigned long end,
+			     struct mm_walk *walk)
+{
+	pud_t pud;
+	int err = 0;
+	spinlock_t *ptl;
+	u64 flags = 0, frame = 0;
+	struct pagemapread *pm = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+
+	ptl = pud_huge_lock(pudp, vma);
+	if (!ptl)
+		return err;
+
+	pud = *pudp;
+
+	if (vma->vm_flags & VM_SOFTDIRTY)
+		flags |= PM_SOFT_DIRTY;
+	if (pud_present(pud)) {
+		struct folio *folio = page_folio(pud_page(pud));
+
+		flags |= PM_PRESENT;
+		if (!folio_test_anon(folio))
+			flags |= PM_FILE;
+
+		if (!folio_likely_mapped_shared(folio))
+			flags |= PM_MMAP_EXCLUSIVE;
+
+		if (pud_soft_dirty(pud))
+			flags |= PM_SOFT_DIRTY;
+		if (pud_uffd_wp(pud))
+			flags |= PM_UFFD_WP;
+		if (pm->show_pfn)
+			frame = pud_pfn(pud) +
+				((addr & ~PUD_MASK) >> PAGE_SHIFT);
+	} else if (pud_swp_uffd_wp(pud)) {
+		/* Only hugetlb can have swap entries at PUD level*/
+		flags |= PM_UFFD_WP;
+	}
+
+	for (; addr != end; addr += PAGE_SIZE) {
+		pagemap_entry_t pme = make_pme(frame, flags);
+
+		err = add_to_pagemap(&pme, pm);
+		if (err)
+			return err;
+		if (pm->show_pfn && (flags & PM_PRESENT))
+			frame++;
+	}
+	spin_unlock(ptl);
+
+	cond_resched();
+	return err;
+}
+#else
+#define pagemap_pud_range NULL
+#endif
+
 #ifdef CONFIG_HUGETLB_PAGE
 /* This function walks within one hugetlb entry in the single call */
 static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
@@ -1980,6 +2039,7 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
 #endif /* HUGETLB_PAGE */
 
 static const struct mm_walk_ops pagemap_ops = {
+	.pud_entry      = pagemap_pud_range,
 	.pmd_entry	= pagemap_pmd_range,
 	.pte_hole	= pagemap_pte_hole,
 	.hugetlb_entry	= pagemap_hugetlb_range,
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 15/45] fs/proc: Adjust pte_to_pagemap_entry for hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (13 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 14/45] fs/proc: Create pagemap_pud_range to handle PUD-mapped hugetlb vmas Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 16/45] fs/proc: Enable pagemap_scan_pmd_entry to handle " Oscar Salvador
                   ` (31 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

HugeTLB pages cannot be swapped , so do not mark them as that.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 fs/proc/task_mmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5965a074467e..22200018371d 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1795,7 +1795,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 			frame = swp_type(entry) |
 			    (offset << MAX_SWAPFILES_SHIFT);
 		}
-		flags |= PM_SWAP;
+		if (!is_vm_hugetlb_page(vma))
+			flags |= PM_SWAP;
 		if (is_pfn_swap_entry(entry))
 			page = pfn_swap_entry_to_page(entry);
 		if (pte_marker_entry_uffd_wp(entry))
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 16/45] fs/proc: Enable pagemap_scan_pmd_entry to handle hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (14 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 15/45] fs/proc: Adjust pte_to_pagemap_entry for " Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 17/45] mm: Implement pud-version for pud_mkinvalid and pudp_establish Oscar Salvador
                   ` (30 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

PMD-mapped hugetlb vmas will also reach pagemap_scan_pmd_entry.
Add the required code so it knows how to handle those there.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 fs/proc/task_mmu.c | 41 ++++++++++++++++++++++++++---------------
 1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 22200018371d..df649f69ea2c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -2263,8 +2263,8 @@ static void make_uffd_wp_pte(struct vm_area_struct *vma,
 	}
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static unsigned long pagemap_thp_category(struct pagemap_scan_private *p,
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
+static unsigned long pagemap_pmd_category(struct pagemap_scan_private *p,
 					  struct vm_area_struct *vma,
 					  unsigned long addr, pmd_t pmd)
 {
@@ -2296,7 +2296,8 @@ static unsigned long pagemap_thp_category(struct pagemap_scan_private *p,
 		if (pmd_swp_soft_dirty(pmd))
 			categories |= PAGE_IS_SOFT_DIRTY;
 
-		if (p->masks_of_interest & PAGE_IS_FILE) {
+		if ((p->masks_of_interest & PAGE_IS_FILE) &&
+		    !is_vm_hugetlb_page(vma)) {
 			swp = pmd_to_swp_entry(pmd);
 			if (is_pfn_swap_entry(swp) &&
 			    !folio_test_anon(pfn_swap_entry_folio(swp)))
@@ -2321,7 +2322,7 @@ static void make_uffd_wp_pmd(struct vm_area_struct *vma,
 		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
 	}
 }
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
 
 #ifdef CONFIG_HUGETLB_PAGE
 static unsigned long pagemap_hugetlb_category(pte_t pte)
@@ -2522,22 +2523,22 @@ static int pagemap_scan_output(unsigned long categories,
 	return ret;
 }
 
-static int pagemap_scan_thp_entry(pmd_t *pmd, unsigned long start,
+static int pagemap_scan_huge_entry(pmd_t *pmd, unsigned long start,
 				  unsigned long end, struct mm_walk *walk)
 {
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
 	struct pagemap_scan_private *p = walk->private;
 	struct vm_area_struct *vma = walk->vma;
 	unsigned long categories;
 	spinlock_t *ptl;
 	int ret = 0;
 
-	ptl = pmd_trans_huge_lock(pmd, vma);
+	ptl = pmd_huge_lock(pmd, vma);
 	if (!ptl)
 		return -ENOENT;
 
 	categories = p->cur_vma_category |
-		     pagemap_thp_category(p, vma, start, *pmd);
+		     pagemap_pmd_category(p, vma, start, *pmd);
 
 	if (!pagemap_scan_is_interesting_page(categories, p))
 		goto out_unlock;
@@ -2556,19 +2557,29 @@ static int pagemap_scan_thp_entry(pmd_t *pmd, unsigned long start,
 	 * needs to be performed on a portion of the huge page.
 	 */
 	if (end != start + HPAGE_SIZE) {
-		spin_unlock(ptl);
-		split_huge_pmd(vma, pmd, start);
 		pagemap_scan_backout_range(p, start, end);
-		/* Report as if there was no THP */
-		return -ENOENT;
+		if (!is_vm_hugetlb_page(vma)) {
+			/* Report as if there was no THP */
+			spin_unlock(ptl);
+			split_huge_pmd(vma, pmd, start);
+			ret = -ENOENT;
+			goto out;
+		}
+		ret = 0;
+		p->arg.walk_end = start;
+		goto out_unlock;
 	}
 
 	make_uffd_wp_pmd(vma, start, pmd);
-	flush_tlb_range(vma, start, end);
+	if (is_vm_hugetlb_page(vma))
+		flush_hugetlb_tlb_range(vma, start, end);
+	else
+		flush_tlb_range(vma, start, end);
 out_unlock:
 	spin_unlock(ptl);
+out:
 	return ret;
-#else /* !CONFIG_TRANSPARENT_HUGEPAGE */
+#else /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
 	return -ENOENT;
 #endif
 }
@@ -2585,7 +2596,7 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
 
 	arch_enter_lazy_mmu_mode();
 
-	ret = pagemap_scan_thp_entry(pmd, start, end, walk);
+	ret = pagemap_scan_huge_entry(pmd, start, end, walk);
 	if (ret != -ENOENT) {
 		arch_leave_lazy_mmu_mode();
 		return ret;
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 17/45] mm: Implement pud-version for pud_mkinvalid and pudp_establish
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (15 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 16/45] fs/proc: Enable pagemap_scan_pmd_entry to handle " Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 18/45] fs/proc: Create pagemap_scan_pud_entry to handle PUD-mapped hugetlb vmas Oscar Salvador
                   ` (29 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

HugeTLB pages will be handled on pud level as well, so we need to
implement pud-versions of pud_mkinvalid and pudp_establish.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 arch/arm64/include/asm/pgtable.h             | 11 ++++++
 arch/loongarch/include/asm/pgtable.h         |  8 ++++
 arch/mips/include/asm/pgtable.h              |  7 ++++
 arch/powerpc/include/asm/book3s/64/pgtable.h |  7 +++-
 arch/powerpc/mm/book3s64/pgtable.c           | 15 ++++++-
 arch/riscv/include/asm/pgtable.h             | 15 +++++++
 arch/x86/include/asm/pgtable.h               | 31 ++++++++++++++-
 include/linux/pgtable.h                      | 41 +++++++++++++++++++-
 mm/pgtable-generic.c                         | 21 ++++++++++
 9 files changed, 150 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 936ed3a915a3..5e26e63b1012 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -595,6 +595,7 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd)
 #define pud_write(pud)		pte_write(pud_pte(pud))
 
 #define pud_mkhuge(pud)		(__pud(pud_val(pud) & ~PUD_TABLE_BIT))
+#define pud_mkinvalid(pud)	pte_pud(pte_mkinvalid(pud_pte(pud)))
 
 #define __pud_to_phys(pud)	__pte_to_phys(pud_pte(pud))
 #define __phys_to_pud_val(phys)	__phys_to_pte_val(phys)
@@ -1344,6 +1345,16 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
 }
 #endif
 
+#ifdef CONFIG_HUGETLB_PAGE
+#define pudp_establish pudp_establish
+static inline pud_t pudp_establish(struct vm_area_struct *vma,
+		unsigned long address, pud_t *pudp, pud_t pud)
+{
+	page_table_check_pud_set(vma->vm_mm, pudp, pud);
+	return __pud(xchg_relaxed(&pud_val(*pudp), pud_val(pud)));
+}
+#endif
+
 /*
  * Encode and decode a swap entry:
  *	bits 0-1:	present (must be zero)
diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h
index 161dd6e10479..cf73c2f2da2c 100644
--- a/arch/loongarch/include/asm/pgtable.h
+++ b/arch/loongarch/include/asm/pgtable.h
@@ -581,6 +581,14 @@ static inline pmd_t pmd_mkinvalid(pmd_t pmd)
 	return pmd;
 }
 
+static inline pud_t pud_mkinvalid(pud_t pud)
+{
+	pud_val(pud) |= _PAGE_PRESENT_INVALID;
+	pud_val(pud) &= ~(_PAGE_PRESENT | _PAGE_VALID | _PAGE_DIRTY | _PAGE_PROTNONE);
+
+	return pud;
+}
+
 /*
  * The generic version pmdp_huge_get_and_clear uses a version of pmd_clear() with a
  * different prototype.
diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
index c29a551eb0ca..390a2f022147 100644
--- a/arch/mips/include/asm/pgtable.h
+++ b/arch/mips/include/asm/pgtable.h
@@ -736,6 +736,13 @@ static inline pmd_t pmd_mkinvalid(pmd_t pmd)
 	return pmd;
 }
 
+static inline pud_t pud_mkinvalid(pud_t pud)
+{
+	pud_val(pud) &= ~(_PAGE_PRESENT | _PAGE_VALID | _PAGE_DIRTY);
+
+	return pud;
+}
+
 /*
  * The generic version pmdp_huge_get_and_clear uses a version of pmd_clear() with a
  * different prototype.
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index fa4bb8d6356f..f95ac2a87548 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1085,7 +1085,8 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
 #define pmd_mksoft_dirty(pmd)  pte_pmd(pte_mksoft_dirty(pmd_pte(pmd)))
 #define pmd_clear_soft_dirty(pmd) pte_pmd(pte_clear_soft_dirty(pmd_pte(pmd)))
 
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#if defined(CONFIG_ARCH_ENABLE_THP_MIGRATION) || defined(CONFIG_HUGETLB_PAGE)
+#define pud_swp_soft_dirty(pud)		pte_swp_soft_dirty(pud_pte(pud))
 #define pmd_swp_mksoft_dirty(pmd)	pte_pmd(pte_swp_mksoft_dirty(pmd_pte(pmd)))
 #define pmd_swp_soft_dirty(pmd)		pte_swp_soft_dirty(pmd_pte(pmd))
 #define pmd_swp_clear_soft_dirty(pmd)	pte_pmd(pte_swp_clear_soft_dirty(pmd_pte(pmd)))
@@ -1386,6 +1387,10 @@ static inline pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm,
 extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			     pmd_t *pmdp);
 
+#define __HAVE_ARCH_PUDP_INVALIDATE
+extern pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+			     pud_t *pudp);
+
 #define pmd_move_must_withdraw pmd_move_must_withdraw
 struct spinlock;
 extern int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index f4d8d3c40e5c..1b6ae7898f99 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -37,7 +37,7 @@ EXPORT_SYMBOL(__pmd_frag_nr);
 unsigned long __pmd_frag_size_shift;
 EXPORT_SYMBOL(__pmd_frag_size_shift);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
 /*
  * This is called when relaxing access to a hugepage. It's also called in the page
  * fault path when we don't hit any of the major fault cases, ie, a minor
@@ -259,7 +259,18 @@ pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 	pmdv &= _HPAGE_CHG_MASK;
 	return pmd_set_protbits(__pmd(pmdv), newprot);
 }
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+		     pud_t *pudp)
+{
+	unsigned long old_pud;
+
+	VM_WARN_ON_ONCE(!pud_present(*pudp));
+	old_pud = pud_hugepage_update(vma->vm_mm, address, pudp, _PAGE_PRESENT, _PAGE_INVALID);
+	flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
+	return __pud(old_pud);
+}
+#endif /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
 
 /* For use by kexec, called with MMU off */
 notrace void mmu_cleanup_all(void)
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index ebfe8faafb79..51600afa203c 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -657,6 +657,11 @@ static inline unsigned long pud_pfn(pud_t pud)
 	return ((__pud_to_phys(pud) & PUD_MASK) >> PAGE_SHIFT);
 }
 
+static inline pud_t pud_mkinvalid(pud_t pud)
+{
+	return __pud(pud_val(pud) & ~(_PAGE_PRESENT|_PAGE_PROT_NONE));
+}
+
 static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 {
 	return pte_pmd(pte_modify(pmd_pte(pmd), newprot));
@@ -804,6 +809,16 @@ extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
 				 unsigned long address, pmd_t *pmdp);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+#ifdef CONFIG_HUGETLB_PAGE
+#define pudp_establish pudp_establish
+static inline pud_t pudp_establish(struct vm_area_struct *vma,
+				unsigned long address, pud_t *pudp, pud_t pud)
+{
+	page_table_check_pud_set(vma->vm_mm, pudp, pud);
+	return __pud(atomic_long_xchg((atomic_long_t *)pudp, pud_val(pud)));
+}
+#endif
+
 /*
  * Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
  * are !pte_none() && !pte_present().
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 640edc31962f..572458a106e9 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -783,6 +783,12 @@ static inline pmd_t pmd_mkinvalid(pmd_t pmd)
 		      __pgprot(pmd_flags(pmd) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));
 }
 
+static inline pud_t pud_mkinvalid(pud_t pud)
+{
+	return pfn_pud(pud_pfn(pud),
+		      __pgprot(pud_flags(pud) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));
+}
+
 static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
 
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
@@ -1353,6 +1359,23 @@ static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
 	return pud;
 }
 
+#ifndef pudp_establish
+#define pudp_establish pudp_establish
+static inline pud_t pudp_establish(struct vm_area_struct *vma,
+		unsigned long address, pud_t *pudp, pud_t pud)
+{
+	page_table_check_pud_set(vma->vm_mm, pudp, pud);
+	if (IS_ENABLED(CONFIG_SMP)) {
+		return xchg(pudp, pud);
+	} else {
+		pud_t old = *pudp;
+
+		WRITE_ONCE(*pudp, pud);
+		return old;
+	}
+}
+#endif
+
 #define __HAVE_ARCH_PMDP_SET_WRPROTECT
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pmd_t *pmdp)
@@ -1389,7 +1412,6 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
 #define __HAVE_ARCH_PMDP_INVALIDATE_AD
 extern pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma,
 				unsigned long address, pmd_t *pmdp);
-
 /*
  * Page table pages are page-aligned.  The lower half of the top
  * level is used for userspace and the top half for the kernel.
@@ -1541,7 +1563,12 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
 	return pte_clear_flags(pte, _PAGE_SWP_SOFT_DIRTY);
 }
 
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#if defined(CONFIG_ARCH_ENABLE_THP_MIGRATION) || defined(CONFIG_HUGETLB_PAGE)
+static inline int pud_swp_soft_dirty(pud_t pud)
+{
+	return pud_flags(pud) & _PAGE_SWP_SOFT_DIRTY;
+}
+
 static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
 {
 	return pmd_set_flags(pmd, _PAGE_SWP_SOFT_DIRTY);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index a2e2ebb93f21..458e3cbc96b2 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -956,6 +956,11 @@ extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			    pmd_t *pmdp);
 #endif
 
+#ifndef __HAVE_ARCH_PUDP_INVALIDATE
+extern pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+			    pud_t *pudp);
+#endif
+
 #ifndef __HAVE_ARCH_PMDP_INVALIDATE_AD
 
 /*
@@ -976,6 +981,26 @@ extern pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma,
 				unsigned long address, pmd_t *pmdp);
 #endif
 
+#ifndef __HAVE_ARCH_PUDP_INVALIDATE_AD
+
+/*
+ * pudp_invalidate_ad() invalidates the PMD while changing a hugetlb mapping in
+ * the page tables. This function is similar to pudp_invalidate(), but should
+ * only be used if the access and dirty bits would  not be cleared by the software
+ * in the new PUD value. The function ensures  that hardware changes of the access
+ * and dirty bits updates would not be lost.
+ *
+ * Doing so can allow in certain architectures to avoid a TLB flush in most
+ * cases. Yet, another TLB flush might be necessary later if the PUD update
+ * itself requires such flush (e.g., if protection was set to be stricter). Yet,
+ * even when a TLB flush is needed because of the update, the caller may be able
+ * to batch these TLB flushing operations, so fewer TLB flush operations are
+ * needed.
+ */
+extern pud_t pudp_invalidate_ad(struct vm_area_struct *vma,
+				unsigned long address, pud_t *pudp);
+#endif
+
 #ifndef __HAVE_ARCH_PTE_SAME
 static inline int pte_same(pte_t pte_a, pte_t pte_b)
 {
@@ -1406,7 +1431,16 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 #endif
 
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
-#ifndef CONFIG_ARCH_ENABLE_THP_MIGRATION
+static inline int pud_soft_dirty(pud_t pud)
+{
+	return 0;
+}
+#if !defined(CONFIG_ARCH_ENABLE_THP_MIGRATION) && !defined(CONFIG_HUGETLB_PAGE)
+static inline int pud_swp_soft_dirty(pud_t pud)
+{
+	return 0;
+}
+
 static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
 {
 	return pmd;
@@ -1487,6 +1521,11 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd)
 {
 	return pmd;
 }
+
+static inline int pud_swp_soft_dirty(pud_t pud)
+{
+	return 0;
+}
 #endif
 
 #ifndef __HAVE_PFNMAP_TRACKING
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index a78a4adf711a..e11ad8663903 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -194,6 +194,27 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 }
 #endif
 
+#ifndef __HAVE_ARCH_PUDP_INVALIDATE
+pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+		     pud_t *pudp)
+{
+	VM_WARN_ON_ONCE(!pud_present(*pudp));
+	pud_t old = pudp_establish(vma, address, pudp, pud_mkinvalid(*pudp));
+
+	flush_pud_tlb_range(vma, address, address + PUD_SIZE);
+	return old;
+}
+#endif
+
+#ifndef __HAVE_ARCH_PUDP_INVALIDATE_AD
+pud_t pudp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
+			 pud_t *pudp)
+{
+	VM_WARN_ON_ONCE(!pud_present(*pudp));
+	return pudp_invalidate(vma, address, pudp);
+}
+#endif
+
 #ifndef __HAVE_ARCH_PMDP_INVALIDATE
 pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 		     pmd_t *pmdp)
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 18/45] fs/proc: Create pagemap_scan_pud_entry to handle PUD-mapped hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (16 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 17/45] mm: Implement pud-version for pud_mkinvalid and pudp_establish Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 19/45] fs/proc: Enable gather_pte_stats to handle " Oscar Salvador
                   ` (28 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

Normal THP cannot be PUD-mapped (besides devmap) but hugetlb can,
so create pagemap_scan_pud_entry in order to handle PUD-mapped hugetlb vmas.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 fs/proc/task_mmu.c | 104 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 103 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index df649f69ea2c..3785a44b97fa 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1925,7 +1925,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 	return err;
 }
 
-#ifdef CONFIG_HUGETLB_PAGE
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
 static int pagemap_pud_range(pud_t *pudp, unsigned long addr, unsigned long end,
 			     struct mm_walk *walk)
 {
@@ -2324,6 +2324,59 @@ static void make_uffd_wp_pmd(struct vm_area_struct *vma,
 }
 #endif /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
 
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
+static unsigned long pagemap_pud_category(struct pagemap_scan_private *p,
+					  struct vm_area_struct *vma,
+					  unsigned long addr, pud_t pud)
+{
+	unsigned long categories = PAGE_IS_HUGE;
+
+	if (pud_present(pud)) {
+		struct page *page;
+
+		categories |= PAGE_IS_PRESENT;
+		if (!pud_uffd_wp(pud))
+			categories |= PAGE_IS_WRITTEN;
+
+		if (p->masks_of_interest & PAGE_IS_FILE) {
+			page = vm_normal_page_pud(vma, addr, pud);
+			if (page && !PageAnon(page))
+				categories |= PAGE_IS_FILE;
+		}
+
+		if (is_zero_pfn(pud_pfn(pud)))
+			categories |= PAGE_IS_PFNZERO;
+		if (pud_soft_dirty(pud))
+			categories |= PAGE_IS_SOFT_DIRTY;
+	} else if (is_swap_pud(pud)) {
+		swp_entry_t swp;
+
+		categories |= PAGE_IS_SWAPPED;
+		if (!pud_swp_uffd_wp(pud))
+			categories |= PAGE_IS_WRITTEN;
+		if (pud_swp_soft_dirty(pud))
+			categories |= PAGE_IS_SOFT_DIRTY;
+	}
+
+	return categories;
+}
+
+static void make_uffd_wp_pud(struct vm_area_struct *vma,
+			     unsigned long addr, pud_t *pudp)
+{
+	pud_t old, pud = *pudp;
+
+	if (pud_present(pud)) {
+		old = pudp_invalidate_ad(vma, addr, pudp);
+		pud = pud_mkuffd_wp(old);
+		set_pud_at(vma->vm_mm, addr, pudp, pud);
+	} else if (is_migration_entry(pud_to_swp_entry(pud))) {
+		pud = pud_swp_mkuffd_wp(pud);
+		set_pud_at(vma->vm_mm, addr, pudp, pud);
+	}
+}
+#endif /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
+
 #ifdef CONFIG_HUGETLB_PAGE
 static unsigned long pagemap_hugetlb_category(pte_t pte)
 {
@@ -2685,6 +2738,54 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
 	return ret;
 }
 
+#ifdef CONFIG_HUGETLB_PAGE
+static int pagemap_scan_pud_entry(pud_t *pud, unsigned long start,
+				  unsigned long end, struct mm_walk *walk)
+{
+	int ret = 0;
+	spinlock_t *ptl;
+	unsigned long categories;
+	struct vm_area_struct *vma = walk->vma;
+	struct pagemap_scan_private *p = walk->private;
+
+	/* Only PUD-mapped hugetlb can reach here at this moment */
+	ptl = pud_huge_lock(pud, vma);
+	if (!ptl)
+		return 0;
+
+	categories = p->cur_vma_category |
+		     pagemap_pud_category(p, vma, start, *pud);
+
+	if (!pagemap_scan_is_interesting_page(categories, p))
+		goto out_unlock;
+
+	ret = pagemap_scan_output(categories, p, start, &end);
+	if (start == end)
+		goto out_unlock;
+
+	if (~p->arg.flags & PM_SCAN_WP_MATCHING)
+		goto out_unlock;
+	if (~categories & PAGE_IS_WRITTEN)
+		goto out_unlock;
+
+	if (end != start + PUD_SIZE) {
+		ret = 0;
+		pagemap_scan_backout_range(p, start, end);
+		p->arg.walk_end = start;
+		goto out_unlock;
+	}
+
+	make_uffd_wp_pud(vma, start, pud);
+	flush_hugetlb_tlb_range(vma, start, end);
+
+out_unlock:
+	spin_unlock(ptl);
+	return ret;
+}
+#else
+#define pagemap_scan_pud_entry	NULL
+#endif
+
 #ifdef CONFIG_HUGETLB_PAGE
 static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask,
 				      unsigned long start, unsigned long end,
@@ -2772,6 +2873,7 @@ static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end,
 
 static const struct mm_walk_ops pagemap_scan_ops = {
 	.test_walk = pagemap_scan_test_walk,
+	.pud_entry = pagemap_scan_pud_entry,
 	.pmd_entry = pagemap_scan_pmd_entry,
 	.pte_hole = pagemap_scan_pte_hole,
 	.hugetlb_entry = pagemap_scan_hugetlb_entry,
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 19/45] fs/proc: Enable gather_pte_stats to handle hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (17 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 18/45] fs/proc: Create pagemap_scan_pud_entry to handle PUD-mapped hugetlb vmas Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 20/45] fs/proc: Enable gather_pte_stats to handle cont-pte mapped " Oscar Salvador
                   ` (27 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

PMD-mapped hugetlb vmas will also reach gather_pte_stats.
Add the required code so it knows how to handle those there.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 fs/proc/task_mmu.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3785a44b97fa..e13754d3246e 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -3141,7 +3141,7 @@ static struct page *can_gather_numa_stats(pte_t pte, struct vm_area_struct *vma,
 	return page;
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
 static struct page *can_gather_numa_stats_pmd(pmd_t pmd,
 					      struct vm_area_struct *vma,
 					      unsigned long addr)
@@ -3176,15 +3176,21 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	pte_t *orig_pte;
 	pte_t *pte;
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	ptl = pmd_trans_huge_lock(pmd, vma);
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
+	ptl = pmd_huge_lock(pmd, vma);
 	if (ptl) {
+		unsigned long nr_pages;
 		struct page *page;
 
+		if (is_vm_hugetlb_page(vma))
+			nr_pages = 1;
+		else
+			nr_pages = HPAGE_PMD_SIZE / PAGE_SIZE;
+
 		page = can_gather_numa_stats_pmd(*pmd, vma, addr);
 		if (page)
 			gather_stats(page, md, pmd_dirty(*pmd),
-				     HPAGE_PMD_SIZE/PAGE_SIZE);
+				     nr_pages);
 		spin_unlock(ptl);
 		return 0;
 	}
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 20/45] fs/proc: Enable gather_pte_stats to handle cont-pte mapped hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (18 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 19/45] fs/proc: Enable gather_pte_stats to handle " Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 21/45] fs/proc: Create gather_pud_stats to handle PUD-mapped hugetlb pages Oscar Salvador
                   ` (26 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

HugeTLB pages can be cont-pte mapped, so teach smaps_pte_entry to handle
them.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 fs/proc/task_mmu.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e13754d3246e..98dd03c26e68 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -3175,6 +3175,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	spinlock_t *ptl;
 	pte_t *orig_pte;
 	pte_t *pte;
+	unsigned long size = PAGE_SIZE, cont_ptes = 1;
 
 #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
 	ptl = pmd_huge_lock(pmd, vma);
@@ -3200,6 +3201,10 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 		walk->action = ACTION_AGAIN;
 		return 0;
 	}
+	if (pte_cont(ptep_get(pte))) {
+		size = PAGE_SIZE * CONT_PTES;
+		cont_ptes = CONT_PTES;
+	}
 	do {
 		pte_t ptent = ptep_get(pte);
 		struct page *page = can_gather_numa_stats(ptent, vma, addr);
@@ -3207,7 +3212,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 			continue;
 		gather_stats(page, md, pte_dirty(ptent), 1);
 
-	} while (pte++, addr += PAGE_SIZE, addr != end);
+	} while (pte += cont_ptes, addr += size, addr != end);
 	pte_unmap_unlock(orig_pte, ptl);
 	cond_resched();
 	return 0;
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 21/45] fs/proc: Create gather_pud_stats to handle PUD-mapped hugetlb pages
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (19 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 20/45] fs/proc: Enable gather_pte_stats to handle cont-pte mapped " Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 22/45] mm/mempolicy: Enable queue_folios_pmd to handle hugetlb vmas Oscar Salvador
                   ` (25 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

Normal THP cannot be PUD-mapped (besides devmap), but hugetlb can, so create
gather_pud_stats in order to handle PUD-mapped hugetlb vmas.
Also implement can_gather_numa_stats_pud which is the pud version of
can_gather_numa_stats_pmd.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 arch/arm64/include/asm/pgtable.h |  1 +
 fs/proc/task_mmu.c               | 56 ++++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 5e26e63b1012..1a6b8be2f0d0 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -590,6 +590,7 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd)
 #define pfn_pmd(pfn,prot)	__pmd(__phys_to_pmd_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))
 #define mk_pmd(page,prot)	pfn_pmd(page_to_pfn(page),prot)
 
+#define pud_dirty(pud)		pte_dirty(pud_pte(pud))
 #define pud_young(pud)		pte_young(pud_pte(pud))
 #define pud_mkyoung(pud)	pte_pud(pte_mkyoung(pud_pte(pud)))
 #define pud_write(pud)		pte_write(pud_pte(pud))
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 98dd03c26e68..5df17b7cfe6c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -3141,6 +3141,61 @@ static struct page *can_gather_numa_stats(pte_t pte, struct vm_area_struct *vma,
 	return page;
 }
 
+#ifdef CONFIG_HUGETLB_PAGE
+static struct page *can_gather_numa_stats_pud(pud_t pud,
+					      struct vm_area_struct *vma,
+					      unsigned long addr)
+{
+	struct page *page;
+	int nid;
+
+	if (!pud_present(pud))
+		return NULL;
+
+	page = pud_page(pud);
+	if (!page)
+		return NULL;
+
+	if (PageReserved(page))
+		return NULL;
+
+	nid = page_to_nid(page);
+	if (!node_isset(nid, node_states[N_MEMORY]))
+		return NULL;
+
+	return page;
+}
+
+static int gather_pud_stats(pud_t *pud, unsigned long addr,
+			    unsigned long end, struct mm_walk *walk)
+{
+	spinlock_t *ptl;
+	struct page *page;
+	unsigned long nr_pages;
+	struct numa_maps *md = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+
+	ptl = pud_huge_lock(pud, vma);
+	if (!ptl)
+		return 0;
+
+	if (is_vm_hugetlb_page(vma))
+		nr_pages = 1;
+	else
+		nr_pages = HPAGE_PUD_SIZE / PAGE_SIZE;
+
+	page = can_gather_numa_stats_pud(*pud, vma, addr);
+	if (page)
+		gather_stats(page, md, pud_dirty(*pud),
+			     nr_pages);
+
+	spin_unlock(ptl);
+	return 0;
+}
+#else
+#define gather_pud_stats	NULL
+#endif
+
 #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
 static struct page *can_gather_numa_stats_pmd(pmd_t pmd,
 					      struct vm_area_struct *vma,
@@ -3245,6 +3300,7 @@ static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
 
 static const struct mm_walk_ops show_numa_ops = {
 	.hugetlb_entry = gather_hugetlb_stats,
+	.pud_entry = gather_pud_stats,
 	.pmd_entry = gather_pte_stats,
 	.walk_lock = PGWALK_RDLOCK,
 };
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 22/45] mm/mempolicy: Enable queue_folios_pmd to handle hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (20 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 21/45] fs/proc: Create gather_pud_stats to handle PUD-mapped hugetlb pages Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 23/45] mm/mempolicy: Create queue_folios_pud to handle PUD-mapped " Oscar Salvador
                   ` (24 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

PMD-mapped hugetlb vmas will also reach smaps_pmd_entry.
Add the required code so it knows how to handle those there.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 include/linux/mm_inline.h |  7 +++++++
 mm/mempolicy.c            | 42 ++++++++++++++++++++++++---------------
 2 files changed, 33 insertions(+), 16 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 93e3eb86ef4e..521a001429d2 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -591,6 +591,13 @@ static inline bool vma_has_recency(struct vm_area_struct *vma)
 	return true;
 }
 
+static inline bool is_shared_pmd(pmd_t *pmd, struct vm_area_struct *vma)
+{
+	if (!is_vm_hugetlb_page(vma))
+		return false;
+	return hugetlb_pmd_shared((pte_t *)pmd);
+}
+
 static inline spinlock_t *pmd_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 {
 	spinlock_t *ptl;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f8703feb68b7..5baf29da198c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -455,7 +455,8 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 };
 
 static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
-				unsigned long flags);
+				unsigned long flags, struct vm_area_struct *vma,
+				bool shared);
 static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
 				pgoff_t ilx, int *nid);
 
@@ -518,7 +519,8 @@ static void queue_folios_pmd(pmd_t *pmd, struct mm_walk *walk)
 		return;
 	if (!(qp->flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) ||
 	    !vma_migratable(walk->vma) ||
-	    !migrate_folio_add(folio, qp->pagelist, qp->flags))
+	    !migrate_folio_add(folio, qp->pagelist, qp->flags, walk->vma,
+			       is_shared_pmd(pmd, walk->vma)))
 		qp->nr_failed++;
 }
 
@@ -543,7 +545,7 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
 	pte_t ptent;
 	spinlock_t *ptl;
 
-	ptl = pmd_trans_huge_lock(pmd, vma);
+	ptl = pmd_huge_lock(pmd, vma);
 	if (ptl) {
 		queue_folios_pmd(pmd, walk);
 		spin_unlock(ptl);
@@ -598,7 +600,7 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
 		}
 		if (!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) ||
 		    !vma_migratable(vma) ||
-		    !migrate_folio_add(folio, qp->pagelist, flags)) {
+		    !migrate_folio_add(folio, qp->pagelist, flags, vma, false)) {
 			qp->nr_failed++;
 			if (strictly_unmovable(flags))
 				break;
@@ -1025,8 +1027,11 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 
 #ifdef CONFIG_MIGRATION
 static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
-				unsigned long flags)
+			      unsigned long flags, struct vm_area_struct *vma,
+			      bool shared)
 {
+	bool ret = true;
+	bool is_hugetlb = is_vm_hugetlb_page(vma);
 	/*
 	 * Unless MPOL_MF_MOVE_ALL, we try to avoid migrating a shared folio.
 	 * Choosing not to migrate a shared folio is not counted as a failure.
@@ -1034,23 +1039,27 @@ static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
 	 * See folio_likely_mapped_shared() on possible imprecision when we
 	 * cannot easily detect if a folio is shared.
 	 */
-	if ((flags & MPOL_MF_MOVE_ALL) || !folio_likely_mapped_shared(folio)) {
-		if (folio_isolate_lru(folio)) {
-			list_add_tail(&folio->lru, foliolist);
-			node_stat_mod_folio(folio,
-				NR_ISOLATED_ANON + folio_is_file_lru(folio),
-				folio_nr_pages(folio));
-		} else {
+	if ((flags & MPOL_MF_MOVE_ALL) ||
+	    (!folio_likely_mapped_shared(folio) && !shared)) {
+		if (is_hugetlb)
+			return isolate_hugetlb(folio, foliolist);
+
+		ret = folio_isolate_lru(folio);
+		if (!ret)
 			/*
 			 * Non-movable folio may reach here.  And, there may be
 			 * temporary off LRU folios or non-LRU movable folios.
 			 * Treat them as unmovable folios since they can't be
 			 * isolated, so they can't be moved at the moment.
 			 */
-			return false;
-		}
+			return ret;
+
+		list_add_tail(&folio->lru, foliolist);
+		node_stat_mod_folio(folio,
+			NR_ISOLATED_ANON + folio_is_file_lru(folio),
+			folio_nr_pages(folio));
 	}
-	return true;
+	return ret;
 }
 
 /*
@@ -1239,7 +1248,8 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
 #else
 
 static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
-				unsigned long flags)
+				unsigned long flags, struct vm_area_struct *vma,
+				bool shared)
 {
 	return false;
 }
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 23/45] mm/mempolicy: Create queue_folios_pud to handle PUD-mapped hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (21 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 22/45] mm/mempolicy: Enable queue_folios_pmd to handle hugetlb vmas Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 24/45] mm/memory_failure: Enable check_hwpoisoned_pmd_entry to handle " Oscar Salvador
                   ` (23 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

Normal THP cannot be PUD-mapped (besides devmap), but hugetlb can, so create
queue_folios_pud in order to handle PUD-mapped hugetlb vmas.
Also implement is_pud_migration_entry and pud_folio, as they will be used in this patch.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 include/linux/pgtable.h |  1 +
 include/linux/swapops.h | 12 ++++++++++++
 mm/mempolicy.c          | 32 ++++++++++++++++++++++++++++++++
 3 files changed, 45 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 458e3cbc96b2..23d51fec81ac 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -51,6 +51,7 @@
 #endif
 
 #define pmd_folio(pmd) page_folio(pmd_page(pmd))
+#define pud_folio(pud) page_folio(pud_page(pud))
 
 /*
  * A page table page can be thought of an array like this: pXd_t[PTRS_PER_PxD]
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 182957f0d013..a23900961d11 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -542,6 +542,18 @@ static inline bool is_pfn_swap_entry(swp_entry_t entry)
 
 struct page_vma_mapped_walk;
 
+#ifdef CONFIG_HUGETLB_PAGE
+static inline int is_pud_migration_entry(pud_t pud)
+{
+	return is_swap_pud(pud) && is_migration_entry(pud_to_swp_entry(pud));
+}
+#else
+static inline int is_pud_migration_entry(pud_t pud)
+{
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
 extern int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 		struct page *page);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 5baf29da198c..93b14090d484 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -501,6 +501,37 @@ static inline bool queue_folio_required(struct folio *folio,
 	return node_isset(nid, *qp->nmask) == !(flags & MPOL_MF_INVERT);
 }
 
+static int queue_folios_pud(pud_t *pud, unsigned long addr, unsigned long end,
+			     struct mm_walk *walk)
+{
+	spinlock_t *ptl;
+	struct folio *folio;
+	struct vm_area_struct *vma = walk->vma;
+	struct queue_pages *qp = walk->private;
+
+	ptl = pud_huge_lock(pud, vma);
+	if (!ptl)
+		return 0;
+
+	if (unlikely(is_pud_migration_entry(*pud))) {
+		qp->nr_failed++;
+		goto out;
+	}
+	folio = pud_folio(*pud);
+	if (!queue_folio_required(folio, qp))
+		goto out;
+	if (!(qp->flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) ||
+	    !vma_migratable(walk->vma) ||
+	    !migrate_folio_add(folio, qp->pagelist, qp->flags, walk->vma, false))
+		qp->nr_failed++;
+
+	spin_unlock(ptl);
+out:
+	if (qp->nr_failed && strictly_unmovable(qp->flags))
+		return -EIO;
+	return 0;
+}
+
 static void queue_folios_pmd(pmd_t *pmd, struct mm_walk *walk)
 {
 	struct folio *folio;
@@ -730,6 +761,7 @@ static int queue_pages_test_walk(unsigned long start, unsigned long end,
 
 static const struct mm_walk_ops queue_pages_walk_ops = {
 	.hugetlb_entry		= queue_folios_hugetlb,
+	.pud_entry		= queue_folios_pud,
 	.pmd_entry		= queue_folios_pte_range,
 	.test_walk		= queue_pages_test_walk,
 	.walk_lock		= PGWALK_RDLOCK,
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 24/45] mm/memory_failure: Enable check_hwpoisoned_pmd_entry to handle hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (22 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 23/45] mm/mempolicy: Create queue_folios_pud to handle PUD-mapped " Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 25/45] mm/memory-failure: Create check_hwpoisoned_pud_entry to handle PUD-mapped " Oscar Salvador
                   ` (22 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

PMD-mapped hugetlb vmas will also check_hwpoisoned_pmd_entry.
Add the required code so it knows how to handle those there.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/memory-failure.c | 44 ++++++++++++++++++++++++++++++--------------
 1 file changed, 30 insertions(+), 14 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 0cb1b7bea9a5..8cae95e36365 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -771,27 +771,43 @@ static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
 	return 1;
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
 static int check_hwpoisoned_pmd_entry(pmd_t *pmdp, unsigned long addr,
-				      struct hwpoison_walk *hwp)
+				      struct hwpoison_walk *hwp,
+				      struct vm_area_struct *vma)
 {
 	pmd_t pmd = *pmdp;
 	unsigned long pfn;
-	unsigned long hwpoison_vaddr;
+	unsigned short shift;
+	unsigned long hwpoison_vaddr = addr;
 
-	if (!pmd_present(pmd))
-		return 0;
-	pfn = pmd_pfn(pmd);
-	if (pfn <= hwp->pfn && hwp->pfn < pfn + HPAGE_PMD_NR) {
-		hwpoison_vaddr = addr + ((hwp->pfn - pfn) << PAGE_SHIFT);
-		set_to_kill(&hwp->tk, hwpoison_vaddr, PAGE_SHIFT);
-		return 1;
+	if (pmd_present(pmd)) {
+		pfn = pmd_pfn(pmd);
+	} else {
+		swp_entry_t swp = pmd_to_swp_entry(pmd);
+
+		if (!is_hwpoison_entry(swp))
+			return 0;
+		pfn = swp_offset_pfn(swp);
 	}
-	return 0;
+
+	shift = is_vm_hugetlb_page(vma) ? huge_page_shift(hstate_vma(vma))
+					: PAGE_SHIFT;
+
+	if (pfn > hwp->pfn || hwp->pfn > pfn + HPAGE_PMD_NR)
+		return 0;
+
+	if (!is_vm_hugetlb_page(vma))
+		hwpoison_vaddr += (hwp->pfn - pfn) << PAGE_SHIFT;
+
+	set_to_kill(&hwp->tk, hwpoison_vaddr, shift);
+
+	return 1;
 }
 #else
 static int check_hwpoisoned_pmd_entry(pmd_t *pmdp, unsigned long addr,
-				      struct hwpoison_walk *hwp)
+				      struct hwpoison_walk *hwp,
+				      struct vm_area_struct *vma)
 {
 	return 0;
 }
@@ -805,9 +821,9 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
 	pte_t *ptep, *mapped_pte;
 	spinlock_t *ptl;
 
-	ptl = pmd_trans_huge_lock(pmdp, walk->vma);
+	ptl = pmd_huge_lock(pmdp, walk->vma);
 	if (ptl) {
-		ret = check_hwpoisoned_pmd_entry(pmdp, addr, hwp);
+		ret = check_hwpoisoned_pmd_entry(pmdp, addr, hwp, walk->vma);
 		spin_unlock(ptl);
 		goto out;
 	}
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 25/45] mm/memory-failure: Create check_hwpoisoned_pud_entry to handle PUD-mapped hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (23 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 24/45] mm/memory_failure: Enable check_hwpoisoned_pmd_entry to handle " Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 26/45] mm/damon: Enable damon_young_pmd_entry to handle " Oscar Salvador
                   ` (21 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

Normal THP cannot be PUD-mapped (besides devmap), but hugetlb can, so create
check_hwpoisoned_pud_entry in order to handle PUD-mapped hugetlb vmas.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/memory-failure.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 8cae95e36365..622862c4c300 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -771,6 +771,43 @@ static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
 	return 1;
 }
 
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
+static int hwpoison_pud_range(pud_t *pudp, unsigned long addr,
+			      unsigned long end, struct mm_walk *walk)
+{
+	int ret = 0;
+	spinlock_t *ptl;
+	pud_t pud = *pudp;
+	unsigned long pfn;
+	struct hwpoison_walk *hwp = walk->private;
+
+	ptl = pud_huge_lock(pudp, walk->vma);
+	if (!ptl)
+		return 0;
+
+	if (pud_present(pud)) {
+		pfn = pud_pfn(pud);
+	} else {
+		swp_entry_t swp = pud_to_swp_entry(pud);
+
+		if (!is_hwpoison_entry(swp))
+			goto out_unlock;
+		pfn = swp_offset_pfn(swp);
+	}
+
+	if (!pfn || pfn != hwp->pfn)
+		goto out_unlock;
+
+	set_to_kill(&hwp->tk, addr, huge_page_shift(hstate_vma(walk->vma)));
+	ret = 1;
+out_unlock:
+	spin_unlock(ptl);
+	return ret;
+}
+#else
+hwpoison_pud_range NULL
+#endif
+
 #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
 static int check_hwpoisoned_pmd_entry(pmd_t *pmdp, unsigned long addr,
 				      struct hwpoison_walk *hwp,
@@ -862,6 +899,7 @@ static int hwpoison_hugetlb_range(pte_t *ptep, unsigned long hmask,
 #endif
 
 static const struct mm_walk_ops hwpoison_walk_ops = {
+	.pud_entry = hwpoison_pud_range,
 	.pmd_entry = hwpoison_pte_range,
 	.hugetlb_entry = hwpoison_hugetlb_range,
 	.walk_lock = PGWALK_RDLOCK,
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 26/45] mm/damon: Enable damon_young_pmd_entry to handle hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (24 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 25/45] mm/memory-failure: Create check_hwpoisoned_pud_entry to handle PUD-mapped " Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 27/45] mm/damon: Create damon_young_pud_entry to handle PUD-mapped " Oscar Salvador
                   ` (20 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

PMD-mapped hugetlb vmas will also reach damon_young_pmd_entry.
Add the required code so it knows how to handle those there.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/damon/vaddr.c | 31 ++++++++++++++++++-------------
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 58829baf8b5d..00d32beffe38 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -443,30 +443,35 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr,
 	struct folio *folio;
 	struct damon_young_walk_private *priv = walk->private;
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	if (pmd_trans_huge(pmdp_get(pmd))) {
-		pmd_t pmde;
-
-		ptl = pmd_lock(walk->mm, pmd);
-		pmde = pmdp_get(pmd);
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
+	ptl = pmd_huge_lock(vma, pmd);
+	if (ptl) {
+		unsigned long pfn;
 
-		if (!pmd_present(pmde)) {
+		if (!pmd_present(*pmd)) {
 			spin_unlock(ptl);
 			return 0;
 		}
 
-		if (!pmd_trans_huge(pmde)) {
-			spin_unlock(ptl);
-			goto regular_page;
+		pfn = pmd_pfn(*pmd);
+		if (is_vm_hugetlb_page(walk->vma)) {
+			folio = pfn_folio(pfn);
+			if (folio)
+				folio_get(folio);
+		} else {
+			folio = damon_get_folio(pfn);
 		}
-		folio = damon_get_folio(pmd_pfn(pmde));
 		if (!folio)
 			goto huge_out;
 		if (pmd_young(pmde) || !folio_test_idle(folio) ||
 					mmu_notifier_test_young(walk->mm,
 						addr))
 			priv->young = true;
-		*priv->folio_sz = HPAGE_PMD_SIZE;
+
+		if (is_vm_hugetlb_page(walk->vma))
+			*priv->folio_sz = huge_page_size(h);
+		else
+			*priv->folio_sz = HPAGE_PMD_SIZE;
 		folio_put(folio);
 huge_out:
 		spin_unlock(ptl);
@@ -474,7 +479,7 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr,
 	}
 
 regular_page:
-#endif	/* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif	/* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
 
 	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
 	if (!pte) {
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 27/45] mm/damon: Create damon_young_pud_entry to handle PUD-mapped hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (25 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 26/45] mm/damon: Enable damon_young_pmd_entry to handle " Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 28/45] mm/damon: Enable damon_mkold_pmd_entry to handle " Oscar Salvador
                   ` (19 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

Normal THP cannot be PUD-mapped (besides devmap), but hugetlb can, so create
damon_young_pud_entry in order to handle PUD-mapped hugetlb vmas.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/damon/vaddr.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 00d32beffe38..2d5ad47b9dae 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -434,6 +434,39 @@ struct damon_young_walk_private {
 	bool young;
 };
 
+static int damon_young_pud_entry(pmd_t *pud, unsigned long addr,
+				 unsigned long next, struct mm_walk *walk)
+{
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
+	spinlock_t *ptl;
+	struct folio *folio;
+	struct damon_young_walk_private *priv = walk->private;
+
+	ptl = pud_huge_lock(vma, pud);
+	if (!ptl)
+		return 0;
+
+	if (!pud_present(*pud))
+		goto out;
+
+	folio = pfn_folio(pud_pfn(*pud));
+	if (!folio)
+		goto out;
+	folio_get(folio);
+
+	if (pud_young(pmde) || !folio_test_idle(folio) ||
+	    mmu_notifier_test_young(walk->mm, addr))
+		priv->young = true;
+
+	*priv->folio_sz = huge_page_size(h);
+	folio_put(folio);
+out:
+	spin_unlock(ptl);
+#endif
+	return 0;
+}
+
+
 static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr,
 		unsigned long next, struct mm_walk *walk)
 {
@@ -537,6 +570,7 @@ static int damon_young_hugetlb_entry(pte_t *pte, unsigned long hmask,
 #endif /* CONFIG_HUGETLB_PAGE */
 
 static const struct mm_walk_ops damon_young_ops = {
+	.pud_entry = damon_young_pud_entry,
 	.pmd_entry = damon_young_pmd_entry,
 	.hugetlb_entry = damon_young_hugetlb_entry,
 	.walk_lock = PGWALK_RDLOCK,
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 28/45] mm/damon: Enable damon_mkold_pmd_entry to handle hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (26 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 27/45] mm/damon: Create damon_young_pud_entry to handle PUD-mapped " Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04 11:03   ` David Hildenbrand
  2024-07-04  4:31 ` [PATCH 29/45] mm/damon: Create damon_mkold_pud_entry to handle PUD-mapped " Oscar Salvador
                   ` (18 subsequent siblings)
  46 siblings, 1 reply; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

PMD-mapped hugetlb vmas will also reach damon_mkold_pmd_entry.
Add the required code so it knows how to handle those there.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/damon/ops-common.c | 21 ++++++++++++++++-----
 mm/damon/vaddr.c      | 15 +++++----------
 2 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/mm/damon/ops-common.c b/mm/damon/ops-common.c
index d25d99cb5f2b..6727658a3ef5 100644
--- a/mm/damon/ops-common.c
+++ b/mm/damon/ops-common.c
@@ -53,18 +53,29 @@ void damon_ptep_mkold(pte_t *pte, struct vm_area_struct *vma, unsigned long addr
 
 void damon_pmdp_mkold(pmd_t *pmd, struct vm_area_struct *vma, unsigned long addr)
 {
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	struct folio *folio = damon_get_folio(pmd_pfn(pmdp_get(pmd)));
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
+	struct folio *folio;
+	unsigned long size;
+
+	if (is_vm_hugetlb_page(vma)) {
+		folio = pfn_folio(pdm_pfn(*pmd))
+		folio_get(folio);
+		size = huge_page_size(hstate_vma(vma));
+	} else {
+		folio = damon_get_folio(pmd_pfn(*pmd));
+		size = PMD_SIZE;
+	}
 
 	if (!folio)
-		return;
+		return 0;
 
-	if (pmdp_clear_young_notify(vma, addr, pmd))
+	if (pmdp_test_and_clear_young(vma, addr, pmd) ||
+	    mmu_notifier_clear_young(mm, addr, addr + size))
 		folio_set_young(folio);
 
 	folio_set_idle(folio);
 	folio_put(folio);
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif /*CONFIG_PGTABLE_HAS_HUGE_LEAVES */
 }
 
 #define DAMON_MAX_SUBSCORE	(100)
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 2d5ad47b9dae..47c84cdda32c 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -304,21 +304,16 @@ static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr,
 	pmd_t pmde;
 	spinlock_t *ptl;
 
-	if (pmd_trans_huge(pmdp_get(pmd))) {
-		ptl = pmd_lock(walk->mm, pmd);
-		pmde = pmdp_get(pmd);
-
-		if (!pmd_present(pmde)) {
+	ptl = pmd_huge_lock(walk->vma, pmd);
+	if (ptl) {
+		if (!pmd_present(*pmd)) {
 			spin_unlock(ptl);
 			return 0;
 		}
 
-		if (pmd_trans_huge(pmde)) {
-			damon_pmdp_mkold(pmd, walk->vma, addr);
-			spin_unlock(ptl);
-			return 0;
-		}
+		damon_pmdp_mkold(pmd, walk->vma, addr);
 		spin_unlock(ptl);
+		return 0;
 	}
 
 	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 29/45] mm/damon: Create damon_mkold_pud_entry to handle PUD-mapped hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (27 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 28/45] mm/damon: Enable damon_mkold_pmd_entry to handle " Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 30/45] mm,mincore: Enable mincore_pte_range to handle " Oscar Salvador
                   ` (17 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

Normal THP cannot be PUD-mapped (besides devmap), but hugetlb can, so create
damon_mkold_pud_entry in order to handle PUD-mapped hugetlb vmas.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/damon/vaddr.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 47c84cdda32c..6a383ce5a775 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -329,6 +329,37 @@ static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr,
 	return 0;
 }
 
+static int damon_mkold_pud_entry(pmd_t *pud, unsigned long addr,
+				 unsigned long next, struct mm_walk *walk)
+{
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
+	spinlock_t *ptl;
+	struct folio *folio;
+	struct vm_area_struct *vma = walk->vma;
+	unsigned long size = huge_page_size(hstate_vma(vma));
+
+	ptl = pud_huge_lock(vma, pud);
+	if (!ptl)
+		return 0;
+
+	if (!pud_present(*pud))
+		goto out;
+
+	folio = pfn_folio(pud_pfn(*pud));
+	folio_get(folio);
+
+	if (pudp_test_and_clear_young(vma, addr, pud) ||
+	    mmu_notifier_clear_young(mm, addr, addr + size))
+		folio_set_young(folio);
+
+	folio_set_idle(folio);
+	folio_put(folio);
+out:
+	spin_unlock(ptl);
+#endif
+	return 0;
+}
+
 #ifdef CONFIG_HUGETLB_PAGE
 static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm,
 				struct vm_area_struct *vma, unsigned long addr)
@@ -383,6 +414,7 @@ static int damon_mkold_hugetlb_entry(pte_t *pte, unsigned long hmask,
 #endif /* CONFIG_HUGETLB_PAGE */
 
 static const struct mm_walk_ops damon_mkold_ops = {
+	.pud_entry = damon_mkold_pud_entry,
 	.pmd_entry = damon_mkold_pmd_entry,
 	.hugetlb_entry = damon_mkold_hugetlb_entry,
 	.walk_lock = PGWALK_RDLOCK,
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 30/45] mm,mincore: Enable mincore_pte_range to handle hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (28 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 29/45] mm/damon: Create damon_mkold_pud_entry to handle PUD-mapped " Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 31/45] mm/mincore: Create mincore_pud_range to handle PUD-mapped " Oscar Salvador
                   ` (16 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

PMD-mapped hugetlb vmas will also mincore_pte_range.
Add the required code so it knows how to handle those there.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/mincore.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/mincore.c b/mm/mincore.c
index d6bd19e520fc..5154bc705f60 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -18,6 +18,7 @@
 #include <linux/shmem_fs.h>
 #include <linux/hugetlb.h>
 #include <linux/pgtable.h>
+#include <linux/mm_inline.h>
 
 #include <linux/uaccess.h>
 #include "swap.h"
@@ -106,8 +107,9 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	unsigned char *vec = walk->private;
 	int nr = (end - addr) >> PAGE_SHIFT;
 
-	ptl = pmd_trans_huge_lock(pmd, vma);
+	ptl = pmd_huge_lock(pmd, vma);
 	if (ptl) {
+		/* Better handling of hugetlb is required (pte marker etc.) */
 		memset(vec, 1, nr);
 		spin_unlock(ptl);
 		goto out;
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 31/45] mm/mincore: Create mincore_pud_range to handle PUD-mapped hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (29 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 30/45] mm,mincore: Enable mincore_pte_range to handle " Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 32/45] mm/hmm: Enable hmm_vma_walk_pmd, to handle " Oscar Salvador
                   ` (15 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

Normal THP cannot be PUD-mapped (besides devmap), but hugetlb can, so create
mincore_pud_range in order to handle PUD-mapped hugetlb vmas.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/mincore.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/mm/mincore.c b/mm/mincore.c
index 5154bc705f60..786df7246899 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -98,6 +98,25 @@ static int mincore_unmapped_range(unsigned long addr, unsigned long end,
 	return 0;
 }
 
+static int mincore_pud_range(pud_t *pud, unsigned long addr, unsigned long end,
+			     struct mm_walk *walk)
+{
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
+	spinlock_t *ptl;
+	unsigned char *vec = walk->private;
+	int nr = (end - addr) >> PAGE_SHIFT;
+	struct vm_area_struct *vma = walk->vma;
+
+	ptl = pud_huge_lock(pud, vma);
+	if (!ptl)
+		return 0;
+
+	memset(vec, 1, nr);
+	spin_unlock(ptl);
+#endif
+	return 0;
+}
+
 static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 			struct mm_walk *walk)
 {
@@ -175,6 +194,7 @@ static inline bool can_do_mincore(struct vm_area_struct *vma)
 }
 
 static const struct mm_walk_ops mincore_walk_ops = {
+	.pud_entry		= mincore_pud_range,
 	.pmd_entry		= mincore_pte_range,
 	.pte_hole		= mincore_unmapped_range,
 	.hugetlb_entry		= mincore_hugetlb,
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 32/45] mm/hmm: Enable hmm_vma_walk_pmd, to handle hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (30 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 31/45] mm/mincore: Create mincore_pud_range to handle PUD-mapped " Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 33/45] mm/hmm: Enable hmm_vma_walk_pud to handle PUD-mapped " Oscar Salvador
                   ` (14 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

PMD-mapped hugetlb vmas will also reach hmm_vma_walk_pmd.
Add the required code so it knows how to handle those there.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/hmm.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 7e0229ae4a5a..fbee08973544 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -183,7 +183,7 @@ static inline unsigned long pmd_to_hmm_pfn_flags(struct hmm_range *range,
 	       hmm_pfn_flags_order(PMD_SHIFT - PAGE_SHIFT);
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
 static int hmm_vma_handle_pmd(struct mm_walk *walk, unsigned long addr,
 			      unsigned long end, unsigned long hmm_pfns[],
 			      pmd_t pmd)
@@ -206,11 +206,11 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk, unsigned long addr,
 		hmm_pfns[i] = pfn | cpu_flags;
 	return 0;
 }
-#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#else /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
 /* stub to allow the code below to compile */
 int hmm_vma_handle_pmd(struct mm_walk *walk, unsigned long addr,
 		unsigned long end, unsigned long hmm_pfns[], pmd_t pmd);
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
 
 static inline unsigned long pte_to_hmm_pfn_flags(struct hmm_range *range,
 						 pte_t pte)
@@ -336,7 +336,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	if (pmd_none(pmd))
 		return hmm_vma_walk_hole(start, end, -1, walk);
 
-	if (thp_migration_supported() && is_pmd_migration_entry(pmd)) {
+	if (is_pmd_migration_entry(pmd)) {
 		if (hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0)) {
 			hmm_vma_walk->last = addr;
 			pmd_migration_entry_wait(walk->mm, pmdp);
@@ -351,7 +351,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
 	}
 
-	if (pmd_devmap(pmd) || pmd_trans_huge(pmd)) {
+	if (pmd_devmap(pmd) || pmd_leaf(pmd)) {
 		/*
 		 * No need to take pmd_lock here, even if some other thread
 		 * is splitting the huge pmd we will get that event through
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 33/45] mm/hmm: Enable hmm_vma_walk_pud to handle PUD-mapped hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (31 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 32/45] mm/hmm: Enable hmm_vma_walk_pmd, to handle " Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 34/45] arch/powerpc: Skip hugetlb vmas in subpage_mark_vma_nohuge Oscar Salvador
                   ` (13 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

Normal THP cannot be PUD-mapped (besides devmap), but hugetlb can,
so enable hmm_vma_walk_pud to handle PUD-mapped hugetlb vmas.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/hmm.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index fbee08973544..2b752f703b6d 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -396,8 +396,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	return 0;
 }
 
-#if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && \
-    defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
+#if (defined(CONFIG_ARCH_HAS_PTE_DEVMAP) || defined (CONFIG_PGTABLE_HAS_HUGE_LEAVES))
 static inline unsigned long pud_to_hmm_pfn_flags(struct hmm_range *range,
 						 pud_t pud)
 {
@@ -429,7 +428,7 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
 		return hmm_vma_walk_hole(start, end, -1, walk);
 	}
 
-	if (pud_leaf(pud) && pud_devmap(pud)) {
+	if (pud_leaf(pud)) {
 		unsigned long i, npages, pfn;
 		unsigned int required_fault;
 		unsigned long *hmm_pfns;
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 34/45] arch/powerpc: Skip hugetlb vmas in subpage_mark_vma_nohuge
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (32 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 33/45] mm/hmm: Enable hmm_vma_walk_pud to handle PUD-mapped " Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 35/45] arch/s390: Skip hugetlb vmas in thp_split_mm Oscar Salvador
                   ` (12 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

Skip hugetlb vmas as we are not interested in those.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 arch/powerpc/mm/book3s64/subpage_prot.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c
index ec98e526167e..dd529adda87f 100644
--- a/arch/powerpc/mm/book3s64/subpage_prot.c
+++ b/arch/powerpc/mm/book3s64/subpage_prot.c
@@ -159,6 +159,8 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
 	 * VM_NOHUGEPAGE and split them.
 	 */
 	for_each_vma_range(vmi, vma, addr + len) {
+		if (is_vm_hugetlb_page(vma))
+			continue;
 		vm_flags_set(vma, VM_NOHUGEPAGE);
 		walk_page_vma(vma, &subpage_walk_ops, NULL);
 	}
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 35/45] arch/s390: Skip hugetlb vmas in thp_split_mm
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (33 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 34/45] arch/powerpc: Skip hugetlb vmas in subpage_mark_vma_nohuge Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 36/45] fs/proc: Make clear_refs_test_walk skip hugetlb vmas Oscar Salvador
                   ` (11 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

Skip hugetlb vmas as we are not interested in those.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 arch/s390/mm/gmap.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index e1d098dc7f07..580e4ab6f018 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2538,6 +2538,8 @@ static inline void thp_split_mm(struct mm_struct *mm)
 	VMA_ITERATOR(vmi, mm, 0);
 
 	for_each_vma(vmi, vma) {
+		if (is_vm_hugetlb_page(vma))
+			continue;
 		vm_flags_mod(vma, VM_NOHUGEPAGE, VM_HUGEPAGE);
 		walk_page_vma(vma, &thp_split_walk_ops, NULL);
 	}
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 36/45] fs/proc: Make clear_refs_test_walk skip hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (34 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 35/45] arch/s390: Skip hugetlb vmas in thp_split_mm Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 37/45] mm/lock: Make mlock_test_walk " Oscar Salvador
                   ` (10 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

Skip hugetlb vmas as we are not interested in those.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 fs/proc/task_mmu.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5df17b7cfe6c..df94f2093588 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1579,6 +1579,9 @@ static int clear_refs_test_walk(unsigned long start, unsigned long end,
 	struct clear_refs_private *cp = walk->private;
 	struct vm_area_struct *vma = walk->vma;
 
+	if (is_vm_hugetlb_page(vma))
+		return 1;
+
 	if (vma->vm_flags & VM_PFNMAP)
 		return 1;
 
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 37/45] mm/lock: Make mlock_test_walk skip hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (35 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 36/45] fs/proc: Make clear_refs_test_walk skip hugetlb vmas Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 38/45] mm/madvise: Make swapin_test_walk " Oscar Salvador
                   ` (9 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

Skip hugetlb vmas as we are not interested in those.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/mlock.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/mm/mlock.c b/mm/mlock.c
index 52d6e401ad67..b37079b3505f 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -409,6 +409,17 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
 	return 0;
 }
 
+static int mlock_test_walk(unsigned long start, unsigned long end,
+			   struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->vma;
+
+	if (is_vm_hugetlb_page(vma))
+		return 1;
+
+	return 0;
+}
+
 /*
  * mlock_vma_pages_range() - mlock any pages already in the range,
  *                           or munlock all pages in the range.
@@ -425,6 +436,7 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma,
 {
 	static const struct mm_walk_ops mlock_walk_ops = {
 		.pmd_entry = mlock_pte_range,
+		.test_walk = mlock_test_walk,
 		.walk_lock = PGWALK_WRLOCK_VERIFY,
 	};
 
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 38/45] mm/madvise: Make swapin_test_walk skip hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (36 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 37/45] mm/lock: Make mlock_test_walk " Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 39/45] mm/madvise: Make madvise_cold_test_walk " Oscar Salvador
                   ` (8 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

Skip hugetlb vmas as we are not interested in those.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/madvise.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/mm/madvise.c b/mm/madvise.c
index 96c026fe0c99..4c7c409e8b4a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -212,8 +212,20 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 	return 0;
 }
 
+static int swapin_test_walk(unsigned long start, unsigned long end,
+			    struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->vma;
+
+	if (is_vm_hugetlb_page(vma))
+		return 1;
+
+	return 0;
+}
+
 static const struct mm_walk_ops swapin_walk_ops = {
 	.pmd_entry		= swapin_walk_pmd_entry,
+	.test_walk		= swapin_test_walk,
 	.walk_lock		= PGWALK_RDLOCK,
 };
 
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 39/45] mm/madvise: Make madvise_cold_test_walk skip hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (37 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 38/45] mm/madvise: Make swapin_test_walk " Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 40/45] mm/madvise: Make madvise_free_test_walk " Oscar Salvador
                   ` (7 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

Skip hugetlb vmas as we are not interested in those.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/madvise.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/mm/madvise.c b/mm/madvise.c
index 4c7c409e8b4a..e60311636c4c 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -565,8 +565,20 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 	return 0;
 }
 
+static int madvise_cold_test_walk(unsigned long start, unsigned long end,
+			   struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->vma;
+
+	if (is_vm_hugetlb_page(vma))
+		return 1;
+
+	return 0;
+}
+
 static const struct mm_walk_ops cold_walk_ops = {
 	.pmd_entry = madvise_cold_or_pageout_pte_range,
+	.test_walk = madvise_cold_test_walk,
 	.walk_lock = PGWALK_RDLOCK,
 };
 
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 40/45] mm/madvise: Make madvise_free_test_walk skip hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (38 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 39/45] mm/madvise: Make madvise_cold_test_walk " Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 41/45] mm/migrate_device: Make migrate_vma_test_walk " Oscar Salvador
                   ` (6 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

Skip hugetlb vmas as we are not interested in those.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/madvise.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/mm/madvise.c b/mm/madvise.c
index e60311636c4c..08f72622913f 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -805,8 +805,20 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	return 0;
 }
 
+static int madvise_free_test_walk(unsigned long start, unsigned long end,
+				  struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->vma;
+
+	if (is_vm_hugetlb_page(vma))
+		return 1;
+
+	return 0;
+}
+
 static const struct mm_walk_ops madvise_free_walk_ops = {
 	.pmd_entry		= madvise_free_pte_range,
+	.test_walk		= madvise_free_test_walk,
 	.walk_lock		= PGWALK_RDLOCK,
 };
 
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 41/45] mm/migrate_device: Make migrate_vma_test_walk skip hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (39 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 40/45] mm/madvise: Make madvise_free_test_walk " Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 42/45] mm/memcontrol: Make mem_cgroup_move_test_walk " Oscar Salvador
                   ` (5 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

Skip hugetlb vmas as we are not interested in those.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/migrate_device.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 6d66dc1c6ffa..c44ac45b207d 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -279,8 +279,20 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 	return 0;
 }
 
+static int migrate_vma_test_walk(unsigned long start, unsigned long end,
+				 struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->vma;
+
+	if (is_vm_hugetlb_page(vma))
+		return 1;
+
+	return 0;
+}
+
 static const struct mm_walk_ops migrate_vma_walk_ops = {
 	.pmd_entry		= migrate_vma_collect_pmd,
+	.test_walk		= migrate_vma_test_walk,
 	.pte_hole		= migrate_vma_collect_hole,
 	.walk_lock		= PGWALK_RDLOCK,
 };
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 42/45] mm/memcontrol: Make mem_cgroup_move_test_walk skip hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (40 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 41/45] mm/migrate_device: Make migrate_vma_test_walk " Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 43/45] mm/memcontrol: Make mem_cgroup_count_test_walk " Oscar Salvador
                   ` (4 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

Skip hugetlb vmas as we are not interested in those.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/memcontrol-v1.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 99cc9501eec1..542922562cf9 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -1319,8 +1319,20 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 	return ret;
 }
 
+static int mem_cgroup_move_test_walk(unsigned long start, unsigned long end,
+				     struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->vma;
+
+	if (is_vm_hugetlb_page(vma))
+		return 1;
+
+	return 0;
+}
+
 static const struct mm_walk_ops charge_walk_ops = {
 	.pmd_entry	= mem_cgroup_move_charge_pte_range,
+	.test_walk      = mem_cgroup_move_test_walk,
 	.walk_lock	= PGWALK_RDLOCK,
 };
 
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 43/45] mm/memcontrol: Make mem_cgroup_count_test_walk skip hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (41 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 42/45] mm/memcontrol: Make mem_cgroup_move_test_walk " Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 44/45] mm/hugetlb_vmemmap: Make vmemmap_test_walk " Oscar Salvador
                   ` (3 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

Skip hugetlb vmas as we are not interested in those.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/memcontrol-v1.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 542922562cf9..cd8ad1a7f170 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -1039,8 +1039,20 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
 	return 0;
 }
 
+static int mem_cgroup_count_test_walk(unsigned long start, unsigned long end,
+				     struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->vma;
+
+	if (is_vm_hugetlb_page(vma))
+		return 1;
+
+	return 0;
+}
+
 static const struct mm_walk_ops precharge_walk_ops = {
 	.pmd_entry	= mem_cgroup_count_precharge_pte_range,
+	.test_walk      = mem_cgroup_count_test_walk,
 	.walk_lock	= PGWALK_RDLOCK,
 };
 
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 44/45] mm/hugetlb_vmemmap: Make vmemmap_test_walk skip hugetlb vmas
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (42 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 43/45] mm/memcontrol: Make mem_cgroup_count_test_walk " Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04  4:31 ` [PATCH 45/45] mm: Delete all hugetlb_entry entries Oscar Salvador
                   ` (2 subsequent siblings)
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

Skip hugetlb vmas as we are not interested in those.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/hugetlb_vmemmap.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 829112b0a914..3e6fd5ae27bd 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -151,9 +151,21 @@ static int vmemmap_pte_entry(pte_t *pte, unsigned long addr,
 	return 0;
 }
 
+static int vmemmap_test_walk(unsigned long start, unsigned long end,
+			     struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->vma;
+
+	if (is_vm_hugetlb_page(vma))
+		return 1;
+
+	return 0;
+}
+
 static const struct mm_walk_ops vmemmap_remap_ops = {
 	.pmd_entry	= vmemmap_pmd_entry,
 	.pte_entry	= vmemmap_pte_entry,
+	.test_walk	= vmemmap_test_walk,
 };
 
 static int vmemmap_remap_range(unsigned long start, unsigned long end,
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 45/45] mm: Delete all hugetlb_entry entries
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (43 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 44/45] mm/hugetlb_vmemmap: Make vmemmap_test_walk " Oscar Salvador
@ 2024-07-04  4:31 ` Oscar Salvador
  2024-07-04 10:13 ` [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
  2024-07-04 10:44 ` David Hildenbrand
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04  4:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Oscar Salvador

Generic pagewalker can now deal with hugetlb pages as well, so there is
no need for specific hugetlb_entry functions.
Drop them all.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 arch/s390/mm/gmap.c      |  27 ------
 fs/proc/task_mmu.c       | 181 ---------------------------------------
 include/linux/pagewalk.h |  10 ---
 mm/damon/vaddr.c         |  89 -------------------
 mm/hmm.c                 |  54 ------------
 mm/memory-failure.c      |  17 ----
 mm/mempolicy.c           |  47 ----------
 mm/mincore.c             |  22 -----
 mm/mprotect.c            |  10 ---
 mm/pagewalk.c            |  49 +----------
 10 files changed, 1 insertion(+), 505 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 580e4ab6f018..3307f0ec505c 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2743,34 +2743,7 @@ static int __s390_enable_skey_pmd(pmd_t *pmd, unsigned long addr,
 	return 0;
 }
 
-static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr,
-				      unsigned long hmask, unsigned long next,
-				      struct mm_walk *walk)
-{
-	pmd_t *pmd = (pmd_t *)pte;
-	unsigned long start, end;
-	struct page *page = pmd_page(*pmd);
-
-	/*
-	 * The write check makes sure we do not set a key on shared
-	 * memory. This is needed as the walker does not differentiate
-	 * between actual guest memory and the process executable or
-	 * shared libraries.
-	 */
-	if (pmd_val(*pmd) & _SEGMENT_ENTRY_INVALID ||
-	    !(pmd_val(*pmd) & _SEGMENT_ENTRY_WRITE))
-		return 0;
-
-	start = pmd_val(*pmd) & HPAGE_MASK;
-	end = start + HPAGE_SIZE;
-	__storage_key_init_range(start, end);
-	set_bit(PG_arch_1, &page->flags);
-	cond_resched();
-	return 0;
-}
-
 static const struct mm_walk_ops enable_skey_walk_ops = {
-	.hugetlb_entry		= __s390_enable_skey_hugetlb,
 	.pte_entry		= __s390_enable_skey_pte,
 	.pmd_entry		= __s390_enable_skey_pmd,
 	.walk_lock		= PGWALK_WRLOCK,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index df94f2093588..52fa82336825 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1066,52 +1066,15 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 	seq_putc(m, '\n');
 }
 
-#ifdef CONFIG_HUGETLB_PAGE
-static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
-				 unsigned long addr, unsigned long end,
-				 struct mm_walk *walk)
-{
-	struct mem_size_stats *mss = walk->private;
-	struct vm_area_struct *vma = walk->vma;
-	pte_t ptent = huge_ptep_get(walk->mm, addr, pte);
-	struct folio *folio = NULL;
-	bool present = false;
-
-	if (pte_present(ptent)) {
-		folio = page_folio(pte_page(ptent));
-		present = true;
-	} else if (is_swap_pte(ptent)) {
-		swp_entry_t swpent = pte_to_swp_entry(ptent);
-
-		if (is_pfn_swap_entry(swpent))
-			folio = pfn_swap_entry_folio(swpent);
-	}
-
-	if (folio) {
-		/* We treat non-present entries as "maybe shared". */
-		if (!present || folio_likely_mapped_shared(folio) ||
-		    hugetlb_pmd_shared(pte))
-			mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
-		else
-			mss->private_hugetlb += huge_page_size(hstate_vma(vma));
-	}
-	return 0;
-}
-#else
-#define smaps_hugetlb_range	NULL
-#endif /* HUGETLB_PAGE */
-
 static const struct mm_walk_ops smaps_walk_ops = {
 	.pud_entry              = smaps_pud_range,
 	.pmd_entry		= smaps_pte_range,
-	.hugetlb_entry		= smaps_hugetlb_range,
 	.walk_lock		= PGWALK_RDLOCK,
 };
 
 static const struct mm_walk_ops smaps_shmem_walk_ops = {
 	.pud_entry              = smaps_pud_range,
 	.pmd_entry		= smaps_pte_range,
-	.hugetlb_entry		= smaps_hugetlb_range,
 	.pte_hole		= smaps_pte_hole,
 	.walk_lock		= PGWALK_RDLOCK,
 };
@@ -1987,66 +1950,10 @@ static int pagemap_pud_range(pud_t *pudp, unsigned long addr, unsigned long end,
 #define pagemap_pud_range NULL
 #endif
 
-#ifdef CONFIG_HUGETLB_PAGE
-/* This function walks within one hugetlb entry in the single call */
-static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
-				 unsigned long addr, unsigned long end,
-				 struct mm_walk *walk)
-{
-	struct pagemapread *pm = walk->private;
-	struct vm_area_struct *vma = walk->vma;
-	u64 flags = 0, frame = 0;
-	int err = 0;
-	pte_t pte;
-
-	if (vma->vm_flags & VM_SOFTDIRTY)
-		flags |= PM_SOFT_DIRTY;
-
-	pte = huge_ptep_get(walk->mm, addr, ptep);
-	if (pte_present(pte)) {
-		struct folio *folio = page_folio(pte_page(pte));
-
-		if (!folio_test_anon(folio))
-			flags |= PM_FILE;
-
-		if (!folio_likely_mapped_shared(folio) &&
-		    !hugetlb_pmd_shared(ptep))
-			flags |= PM_MMAP_EXCLUSIVE;
-
-		if (huge_pte_uffd_wp(pte))
-			flags |= PM_UFFD_WP;
-
-		flags |= PM_PRESENT;
-		if (pm->show_pfn)
-			frame = pte_pfn(pte) +
-				((addr & ~hmask) >> PAGE_SHIFT);
-	} else if (pte_swp_uffd_wp_any(pte)) {
-		flags |= PM_UFFD_WP;
-	}
-
-	for (; addr != end; addr += PAGE_SIZE) {
-		pagemap_entry_t pme = make_pme(frame, flags);
-
-		err = add_to_pagemap(&pme, pm);
-		if (err)
-			return err;
-		if (pm->show_pfn && (flags & PM_PRESENT))
-			frame++;
-	}
-
-	cond_resched();
-
-	return err;
-}
-#else
-#define pagemap_hugetlb_range	NULL
-#endif /* HUGETLB_PAGE */
-
 static const struct mm_walk_ops pagemap_ops = {
 	.pud_entry      = pagemap_pud_range,
 	.pmd_entry	= pagemap_pmd_range,
 	.pte_hole	= pagemap_pte_hole,
-	.hugetlb_entry	= pagemap_hugetlb_range,
 	.walk_lock	= PGWALK_RDLOCK,
 };
 
@@ -2789,67 +2696,6 @@ static int pagemap_scan_pud_entry(pud_t *pud, unsigned long start,
 #define pagemap_scan_pud_entry	NULL
 #endif
 
-#ifdef CONFIG_HUGETLB_PAGE
-static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask,
-				      unsigned long start, unsigned long end,
-				      struct mm_walk *walk)
-{
-	struct pagemap_scan_private *p = walk->private;
-	struct vm_area_struct *vma = walk->vma;
-	unsigned long categories;
-	spinlock_t *ptl;
-	int ret = 0;
-	pte_t pte;
-
-	if (~p->arg.flags & PM_SCAN_WP_MATCHING) {
-		/* Go the short route when not write-protecting pages. */
-
-		pte = huge_ptep_get(walk->mm, start, ptep);
-		categories = p->cur_vma_category | pagemap_hugetlb_category(pte);
-
-		if (!pagemap_scan_is_interesting_page(categories, p))
-			return 0;
-
-		return pagemap_scan_output(categories, p, start, &end);
-	}
-
-	i_mmap_lock_write(vma->vm_file->f_mapping);
-	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep);
-
-	pte = huge_ptep_get(walk->mm, start, ptep);
-	categories = p->cur_vma_category | pagemap_hugetlb_category(pte);
-
-	if (!pagemap_scan_is_interesting_page(categories, p))
-		goto out_unlock;
-
-	ret = pagemap_scan_output(categories, p, start, &end);
-	if (start == end)
-		goto out_unlock;
-
-	if (~categories & PAGE_IS_WRITTEN)
-		goto out_unlock;
-
-	if (end != start + HPAGE_SIZE) {
-		/* Partial HugeTLB page WP isn't possible. */
-		pagemap_scan_backout_range(p, start, end);
-		p->arg.walk_end = start;
-		ret = 0;
-		goto out_unlock;
-	}
-
-	make_uffd_wp_huge_pte(vma, start, ptep, pte);
-	flush_hugetlb_tlb_range(vma, start, end);
-
-out_unlock:
-	spin_unlock(ptl);
-	i_mmap_unlock_write(vma->vm_file->f_mapping);
-
-	return ret;
-}
-#else
-#define pagemap_scan_hugetlb_entry NULL
-#endif
-
 static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end,
 				 int depth, struct mm_walk *walk)
 {
@@ -2879,7 +2725,6 @@ static const struct mm_walk_ops pagemap_scan_ops = {
 	.pud_entry = pagemap_scan_pud_entry,
 	.pmd_entry = pagemap_scan_pmd_entry,
 	.pte_hole = pagemap_scan_pte_hole,
-	.hugetlb_entry = pagemap_scan_hugetlb_entry,
 };
 
 static int pagemap_scan_get_args(struct pm_scan_arg *arg,
@@ -3275,34 +3120,8 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	cond_resched();
 	return 0;
 }
-#ifdef CONFIG_HUGETLB_PAGE
-static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
-		unsigned long addr, unsigned long end, struct mm_walk *walk)
-{
-	pte_t huge_pte = huge_ptep_get(walk->mm, addr, pte);
-	struct numa_maps *md;
-	struct page *page;
-
-	if (!pte_present(huge_pte))
-		return 0;
-
-	page = pte_page(huge_pte);
-
-	md = walk->private;
-	gather_stats(page, md, pte_dirty(huge_pte), 1);
-	return 0;
-}
-
-#else
-static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
-		unsigned long addr, unsigned long end, struct mm_walk *walk)
-{
-	return 0;
-}
-#endif
 
 static const struct mm_walk_ops show_numa_ops = {
-	.hugetlb_entry = gather_hugetlb_stats,
 	.pud_entry = gather_pud_stats,
 	.pmd_entry = gather_pte_stats,
 	.walk_lock = PGWALK_RDLOCK,
diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index 27cd1e59ccf7..6df0726eecb6 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -31,16 +31,6 @@ enum page_walk_lock {
  *			depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD.
  *			Any folded depths (where PTRS_PER_P?D is equal to 1)
  *			are skipped.
- * @hugetlb_entry:	if set, called for each hugetlb entry. This hook
- *			function is called with the vma lock held, in order to
- *			protect against a concurrent freeing of the pte_t* or
- *			the ptl. In some cases, the hook function needs to drop
- *			and retake the vma lock in order to avoid deadlocks
- *			while calling other functions. In such cases the hook
- *			function must either refrain from accessing the pte or
- *			ptl after dropping the vma lock, or else revalidate
- *			those items after re-acquiring the vma lock and before
- *			accessing them.
  * @test_walk:		caller specific callback function to determine whether
  *			we walk over the current vma or not. Returning 0 means
  *			"do page table walk over the current vma", returning
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 6a383ce5a775..82a8d3146f05 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -360,63 +360,9 @@ static int damon_mkold_pud_entry(pmd_t *pud, unsigned long addr,
 	return 0;
 }
 
-#ifdef CONFIG_HUGETLB_PAGE
-static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm,
-				struct vm_area_struct *vma, unsigned long addr)
-{
-	bool referenced = false;
-	pte_t entry = huge_ptep_get(mm, addr, pte);
-	struct folio *folio = pfn_folio(pte_pfn(entry));
-	unsigned long psize = huge_page_size(hstate_vma(vma));
-
-	folio_get(folio);
-
-	if (pte_young(entry)) {
-		referenced = true;
-		entry = pte_mkold(entry);
-		set_huge_pte_at(mm, addr, pte, entry, psize);
-	}
-
-#ifdef CONFIG_MMU_NOTIFIER
-	if (mmu_notifier_clear_young(mm, addr,
-				     addr + huge_page_size(hstate_vma(vma))))
-		referenced = true;
-#endif /* CONFIG_MMU_NOTIFIER */
-
-	if (referenced)
-		folio_set_young(folio);
-
-	folio_set_idle(folio);
-	folio_put(folio);
-}
-
-static int damon_mkold_hugetlb_entry(pte_t *pte, unsigned long hmask,
-				     unsigned long addr, unsigned long end,
-				     struct mm_walk *walk)
-{
-	struct hstate *h = hstate_vma(walk->vma);
-	spinlock_t *ptl;
-	pte_t entry;
-
-	ptl = huge_pte_lock(h, walk->mm, pte);
-	entry = huge_ptep_get(walk->mm, addr, pte);
-	if (!pte_present(entry))
-		goto out;
-
-	damon_hugetlb_mkold(pte, walk->mm, walk->vma, addr);
-
-out:
-	spin_unlock(ptl);
-	return 0;
-}
-#else
-#define damon_mkold_hugetlb_entry NULL
-#endif /* CONFIG_HUGETLB_PAGE */
-
 static const struct mm_walk_ops damon_mkold_ops = {
 	.pud_entry = damon_mkold_pud_entry,
 	.pmd_entry = damon_mkold_pmd_entry,
-	.hugetlb_entry = damon_mkold_hugetlb_entry,
 	.walk_lock = PGWALK_RDLOCK,
 };
 
@@ -562,44 +508,9 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr,
 	return 0;
 }
 
-#ifdef CONFIG_HUGETLB_PAGE
-static int damon_young_hugetlb_entry(pte_t *pte, unsigned long hmask,
-				     unsigned long addr, unsigned long end,
-				     struct mm_walk *walk)
-{
-	struct damon_young_walk_private *priv = walk->private;
-	struct hstate *h = hstate_vma(walk->vma);
-	struct folio *folio;
-	spinlock_t *ptl;
-	pte_t entry;
-
-	ptl = huge_pte_lock(h, walk->mm, pte);
-	entry = huge_ptep_get(walk->mm, addr, pte);
-	if (!pte_present(entry))
-		goto out;
-
-	folio = pfn_folio(pte_pfn(entry));
-	folio_get(folio);
-
-	if (pte_young(entry) || !folio_test_idle(folio) ||
-	    mmu_notifier_test_young(walk->mm, addr))
-		priv->young = true;
-	*priv->folio_sz = huge_page_size(h);
-
-	folio_put(folio);
-
-out:
-	spin_unlock(ptl);
-	return 0;
-}
-#else
-#define damon_young_hugetlb_entry NULL
-#endif /* CONFIG_HUGETLB_PAGE */
-
 static const struct mm_walk_ops damon_young_ops = {
 	.pud_entry = damon_young_pud_entry,
 	.pmd_entry = damon_young_pmd_entry,
-	.hugetlb_entry = damon_young_hugetlb_entry,
 	.walk_lock = PGWALK_RDLOCK,
 };
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 2b752f703b6d..fccde5dae818 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -463,59 +463,6 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
 #define hmm_vma_walk_pud	NULL
 #endif
 
-#ifdef CONFIG_HUGETLB_PAGE
-static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
-				      unsigned long start, unsigned long end,
-				      struct mm_walk *walk)
-{
-	unsigned long addr = start, i, pfn;
-	struct hmm_vma_walk *hmm_vma_walk = walk->private;
-	struct hmm_range *range = hmm_vma_walk->range;
-	struct vm_area_struct *vma = walk->vma;
-	unsigned int required_fault;
-	unsigned long pfn_req_flags;
-	unsigned long cpu_flags;
-	spinlock_t *ptl;
-	pte_t entry;
-
-	ptl = huge_pte_lock(hstate_vma(vma), walk->mm, pte);
-	entry = huge_ptep_get(walk->mm, addr, pte);
-
-	i = (start - range->start) >> PAGE_SHIFT;
-	pfn_req_flags = range->hmm_pfns[i];
-	cpu_flags = pte_to_hmm_pfn_flags(range, entry) |
-		    hmm_pfn_flags_order(huge_page_order(hstate_vma(vma)));
-	required_fault =
-		hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, cpu_flags);
-	if (required_fault) {
-		int ret;
-
-		spin_unlock(ptl);
-		hugetlb_vma_unlock_read(vma);
-		/*
-		 * Avoid deadlock: drop the vma lock before calling
-		 * hmm_vma_fault(), which will itself potentially take and
-		 * drop the vma lock. This is also correct from a
-		 * protection point of view, because there is no further
-		 * use here of either pte or ptl after dropping the vma
-		 * lock.
-		 */
-		ret = hmm_vma_fault(addr, end, required_fault, walk);
-		hugetlb_vma_lock_read(vma);
-		return ret;
-	}
-
-	pfn = pte_pfn(entry) + ((start & ~hmask) >> PAGE_SHIFT);
-	for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
-		range->hmm_pfns[i] = pfn | cpu_flags;
-
-	spin_unlock(ptl);
-	return 0;
-}
-#else
-#define hmm_vma_walk_hugetlb_entry NULL
-#endif /* CONFIG_HUGETLB_PAGE */
-
 static int hmm_vma_walk_test(unsigned long start, unsigned long end,
 			     struct mm_walk *walk)
 {
@@ -554,7 +501,6 @@ static const struct mm_walk_ops hmm_walk_ops = {
 	.pud_entry	= hmm_vma_walk_pud,
 	.pmd_entry	= hmm_vma_walk_pmd,
 	.pte_hole	= hmm_vma_walk_hole,
-	.hugetlb_entry	= hmm_vma_walk_hugetlb_entry,
 	.test_walk	= hmm_vma_walk_test,
 	.walk_lock	= PGWALK_RDLOCK,
 };
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 622862c4c300..c4ce4cf16651 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -882,26 +882,9 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
 	return ret;
 }
 
-#ifdef CONFIG_HUGETLB_PAGE
-static int hwpoison_hugetlb_range(pte_t *ptep, unsigned long hmask,
-			    unsigned long addr, unsigned long end,
-			    struct mm_walk *walk)
-{
-	struct hwpoison_walk *hwp = walk->private;
-	pte_t pte = huge_ptep_get(walk->mm, addr, ptep);
-	struct hstate *h = hstate_vma(walk->vma);
-
-	return check_hwpoisoned_entry(pte, addr, huge_page_shift(h),
-				      hwp->pfn, &hwp->tk);
-}
-#else
-#define hwpoison_hugetlb_range	NULL
-#endif
-
 static const struct mm_walk_ops hwpoison_walk_ops = {
 	.pud_entry = hwpoison_pud_range,
 	.pmd_entry = hwpoison_pte_range,
-	.hugetlb_entry = hwpoison_hugetlb_range,
 	.walk_lock = PGWALK_RDLOCK,
 };
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 93b14090d484..8b5ca719193c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -645,51 +645,6 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
 	return 0;
 }
 
-static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
-			       unsigned long addr, unsigned long end,
-			       struct mm_walk *walk)
-{
-#ifdef CONFIG_HUGETLB_PAGE
-	struct queue_pages *qp = walk->private;
-	unsigned long flags = qp->flags;
-	struct folio *folio;
-	spinlock_t *ptl;
-	pte_t entry;
-
-	ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte);
-	entry = huge_ptep_get(walk->mm, addr, pte);
-	if (!pte_present(entry)) {
-		if (unlikely(is_hugetlb_entry_migration(entry)))
-			qp->nr_failed++;
-		goto unlock;
-	}
-	folio = pfn_folio(pte_pfn(entry));
-	if (!queue_folio_required(folio, qp))
-		goto unlock;
-	if (!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) ||
-	    !vma_migratable(walk->vma)) {
-		qp->nr_failed++;
-		goto unlock;
-	}
-	/*
-	 * Unless MPOL_MF_MOVE_ALL, we try to avoid migrating a shared folio.
-	 * Choosing not to migrate a shared folio is not counted as a failure.
-	 *
-	 * See folio_likely_mapped_shared() on possible imprecision when we
-	 * cannot easily detect if a folio is shared.
-	 */
-	if ((flags & MPOL_MF_MOVE_ALL) ||
-	    (!folio_likely_mapped_shared(folio) && !hugetlb_pmd_shared(pte)))
-		if (!isolate_hugetlb(folio, qp->pagelist))
-			qp->nr_failed++;
-unlock:
-	spin_unlock(ptl);
-	if (qp->nr_failed && strictly_unmovable(flags))
-		return -EIO;
-#endif
-	return 0;
-}
-
 #ifdef CONFIG_NUMA_BALANCING
 /*
  * This is used to mark a range of virtual addresses to be inaccessible.
@@ -760,7 +715,6 @@ static int queue_pages_test_walk(unsigned long start, unsigned long end,
 }
 
 static const struct mm_walk_ops queue_pages_walk_ops = {
-	.hugetlb_entry		= queue_folios_hugetlb,
 	.pud_entry		= queue_folios_pud,
 	.pmd_entry		= queue_folios_pte_range,
 	.test_walk		= queue_pages_test_walk,
@@ -768,7 +722,6 @@ static const struct mm_walk_ops queue_pages_walk_ops = {
 };
 
 static const struct mm_walk_ops queue_pages_lock_vma_walk_ops = {
-	.hugetlb_entry		= queue_folios_hugetlb,
 	.pmd_entry		= queue_folios_pte_range,
 	.test_walk		= queue_pages_test_walk,
 	.walk_lock		= PGWALK_WRLOCK,
diff --git a/mm/mincore.c b/mm/mincore.c
index 786df7246899..26f699a47371 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -23,27 +23,6 @@
 #include <linux/uaccess.h>
 #include "swap.h"
 
-static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
-			unsigned long end, struct mm_walk *walk)
-{
-#ifdef CONFIG_HUGETLB_PAGE
-	unsigned char present;
-	unsigned char *vec = walk->private;
-
-	/*
-	 * Hugepages under user process are always in RAM and never
-	 * swapped out, but theoretically it needs to be checked.
-	 */
-	present = pte && !huge_pte_none_mostly(huge_ptep_get(walk->mm, addr, pte));
-	for (; addr != end; vec++, addr += PAGE_SIZE)
-		*vec = present;
-	walk->private = vec;
-#else
-	BUG();
-#endif
-	return 0;
-}
-
 /*
  * Later we can get more picky about what "in core" means precisely.
  * For now, simply check to see if the page is in the page cache,
@@ -197,7 +176,6 @@ static const struct mm_walk_ops mincore_walk_ops = {
 	.pud_entry		= mincore_pud_range,
 	.pmd_entry		= mincore_pte_range,
 	.pte_hole		= mincore_unmapped_range,
-	.hugetlb_entry		= mincore_hugetlb,
 	.walk_lock		= PGWALK_RDLOCK,
 };
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 222ab434da54..ca1962d5cb95 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -555,15 +555,6 @@ static int prot_none_pte_entry(pte_t *pte, unsigned long addr,
 		0 : -EACCES;
 }
 
-static int prot_none_hugetlb_entry(pte_t *pte, unsigned long hmask,
-				   unsigned long addr, unsigned long next,
-				   struct mm_walk *walk)
-{
-	return pfn_modify_allowed(pte_pfn(ptep_get(pte)),
-				  *(pgprot_t *)(walk->private)) ?
-		0 : -EACCES;
-}
-
 static int prot_none_test(unsigned long addr, unsigned long next,
 			  struct mm_walk *walk)
 {
@@ -572,7 +563,6 @@ static int prot_none_test(unsigned long addr, unsigned long next,
 
 static const struct mm_walk_ops prot_none_walk_ops = {
 	.pte_entry		= prot_none_pte_entry,
-	.hugetlb_entry		= prot_none_hugetlb_entry,
 	.test_walk		= prot_none_test,
 	.walk_lock		= PGWALK_WRLOCK,
 };
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 78d45f1450aa..7e2721f49e68 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -257,49 +257,6 @@ static int walk_pgd_range(unsigned long addr, unsigned long end,
 	return err;
 }
 
-#ifdef CONFIG_HUGETLB_PAGE
-static unsigned long hugetlb_entry_end(struct hstate *h, unsigned long addr,
-				       unsigned long end)
-{
-	unsigned long boundary = (addr & huge_page_mask(h)) + huge_page_size(h);
-	return boundary < end ? boundary : end;
-}
-
-static int walk_hugetlb_range(unsigned long addr, unsigned long end,
-			      struct mm_walk *walk)
-{
-	struct vm_area_struct *vma = walk->vma;
-	struct hstate *h = hstate_vma(vma);
-	unsigned long next;
-	unsigned long hmask = huge_page_mask(h);
-	unsigned long sz = huge_page_size(h);
-	pte_t *pte;
-	const struct mm_walk_ops *ops = walk->ops;
-	int err = 0;
-
-	do {
-		next = hugetlb_entry_end(h, addr, end);
-		pte = hugetlb_walk(vma, addr & hmask, sz);
-		if (pte)
-			err = ops->hugetlb_entry(pte, hmask, addr, next, walk);
-		else if (ops->pte_hole)
-			err = ops->pte_hole(addr, next, -1, walk);
-		if (err)
-			break;
-	} while (addr = next, addr != end);
-
-	return err;
-}
-
-#else /* CONFIG_HUGETLB_PAGE */
-static int walk_hugetlb_range(unsigned long addr, unsigned long end,
-			      struct mm_walk *walk)
-{
-	return 0;
-}
-
-#endif /* CONFIG_HUGETLB_PAGE */
-
 /*
  * Decide whether we really walk over the current vma on [@start, @end)
  * or skip it via the returned value. Return 0 if we do walk over the
@@ -346,11 +303,7 @@ static int __walk_page_range(unsigned long start, unsigned long end,
 	}
 
 	vma_pgtable_walk_begin(vma);
-	if (is_vm_hugetlb_page(vma)) {
-		if (ops->hugetlb_entry)
-			err = walk_hugetlb_range(start, end, walk);
-	} else
-		err = walk_pgd_range(start, end, walk);
+	err = walk_pgd_range(start, end, walk);
 	vma_pgtable_walk_end(vma);
 
 	if (ops->post_vma)
-- 
2.26.2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 00/45] hugetlb pagewalk unification
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (44 preceding siblings ...)
  2024-07-04  4:31 ` [PATCH 45/45] mm: Delete all hugetlb_entry entries Oscar Salvador
@ 2024-07-04 10:13 ` Oscar Salvador
  2024-07-04 10:44 ` David Hildenbrand
  46 siblings, 0 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-04 10:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, David Hildenbrand,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy

On Thu, Jul 04, 2024 at 06:30:47AM +0200, Oscar Salvador wrote:
> Hi all,
> 
> During Peter's talk at the LSFMM, it was agreed that one of the things
> that need to be done in order to further integrate hugetlb into mm core,
> is to unify generic and hugetlb pagewalkers.
> I started with this one, which is unifying hugetlb into generic
> pagewalk, instead of having its hugetlb_entry entries.
> Which means that pmd_entry/pte_entry(for cont-pte) entries will also deal with
> hugetlb vmas as well, and so will new pud_entry entries since hugetlb can be
> pud mapped (devm pages as well but we seem not to care about those with
> the exception of hmm code).
> 
> The outcome is this RFC.

Just dropping the git tree, in case someone finds it more suitable:

https://github.com/leberus/linux/ hugetlb-unification-pagewalk 


-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 11/45] fs/proc: Enable smaps_pte_entry to handle cont-pte mapped hugetlb vmas
  2024-07-04  4:30 ` [PATCH 11/45] fs/proc: Enable smaps_pte_entry to handle cont-pte mapped " Oscar Salvador
@ 2024-07-04 10:30   ` David Hildenbrand
  0 siblings, 0 replies; 66+ messages in thread
From: David Hildenbrand @ 2024-07-04 10:30 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, SeongJae Park,
	Miaohe Lin, Michal Hocko, Matthew Wilcox, Christophe Leroy

>   
>   #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
> @@ -952,6 +956,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>   	struct vm_area_struct *vma = walk->vma;
>   	pte_t *pte;
>   	spinlock_t *ptl;
> +	unsigned long size, cont_ptes;
>   
>   	ptl = pmd_huge_lock(pmd, vma);
>   	if (ptl) {
> @@ -965,7 +970,9 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>   		walk->action = ACTION_AGAIN;
>   		return 0;
>   	}
> -	for (; addr != end; pte++, addr += PAGE_SIZE)
> +	size = pte_cont(ptep_get(pte)) ? PAGE_SIZE * CONT_PTES : PAGE_SIZE;
> +	cont_ptes = pte_cont(ptep_get(pte)) ? CONT_PTES : 1;
> +	for (; addr != end; pte += cont_ptes, addr += size)

The better way to do this is to actually batch PTEs also when cont-pte 
is not around (e.g., x86). folio_pte_batch() does that and optimized 
automatically for the cont-pte bit -- which should only apply if we have 
a present folio.

So this code might need some slight reshuffling (lookup the folio first, 
if it's large use folio_pte_batch(), otherwise (small/no normal folio) 
process individual PTEs).

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 00/45] hugetlb pagewalk unification
  2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
                   ` (45 preceding siblings ...)
  2024-07-04 10:13 ` [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
@ 2024-07-04 10:44 ` David Hildenbrand
  2024-07-04 14:30   ` Peter Xu
  46 siblings, 1 reply; 66+ messages in thread
From: David Hildenbrand @ 2024-07-04 10:44 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, SeongJae Park,
	Miaohe Lin, Michal Hocko, Matthew Wilcox, Christophe Leroy

On 04.07.24 06:30, Oscar Salvador wrote:
> Hi all,
> 
> During Peter's talk at the LSFMM, it was agreed that one of the things
> that need to be done in order to further integrate hugetlb into mm core,
> is to unify generic and hugetlb pagewalkers.
> I started with this one, which is unifying hugetlb into generic
> pagewalk, instead of having its hugetlb_entry entries.
> Which means that pmd_entry/pte_entry(for cont-pte) entries will also deal with
> hugetlb vmas as well, and so will new pud_entry entries since hugetlb can be
> pud mapped (devm pages as well but we seem not to care about those with
> the exception of hmm code).
> 
> The outcome is this RFC.

First of all, a good step into the right direction, but maybe not what 
we want long-term. So I'm questioning whether we want this intermediate 
approach. walk_page_range() and friends are simply not a good design 
(e.g., indirect function calls).


There are roughly two categories of page table walkers we have:

1) We actually only want to walk present folios (to be precise, page
    ranges of folios). We should look into moving away from the walk the
    page walker API where possible, and have something better that
    directly gives us the folio (page ranges). Any PTE batching would be
    done internally.

2) We want to deal with non-present folios as well (swp entries and all
    kinds of other stuff). We should maybe implement our custom page
    table walker and move away from walk_page_range(). We are not walking
    "pages" after all but everything else included :)

Then, there is a subset of 1) where we only want to walk to a single 
address (a single folio). I'm working on that right now to get rid of 
follow_page() and some (IIRC 3: KSM an daemon) walk_page_range() users. 
Hugetlb will still remain a bit special, but I'm afraid we cannot hide 
that completely.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 28/45] mm/damon: Enable damon_mkold_pmd_entry to handle hugetlb vmas
  2024-07-04  4:31 ` [PATCH 28/45] mm/damon: Enable damon_mkold_pmd_entry to handle " Oscar Salvador
@ 2024-07-04 11:03   ` David Hildenbrand
  0 siblings, 0 replies; 66+ messages in thread
From: David Hildenbrand @ 2024-07-04 11:03 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, SeongJae Park,
	Miaohe Lin, Michal Hocko, Matthew Wilcox, Christophe Leroy

On 04.07.24 06:31, Oscar Salvador wrote:
> PMD-mapped hugetlb vmas will also reach damon_mkold_pmd_entry.
> Add the required code so it knows how to handle those there.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>   mm/damon/ops-common.c | 21 ++++++++++++++++-----
>   mm/damon/vaddr.c      | 15 +++++----------
>   2 files changed, 21 insertions(+), 15 deletions(-)
> 

(besides a lot of this code needing cleanups and likely fixes)

> diff --git a/mm/damon/ops-common.c b/mm/damon/ops-common.c
> index d25d99cb5f2b..6727658a3ef5 100644
> --- a/mm/damon/ops-common.c
> +++ b/mm/damon/ops-common.c
> @@ -53,18 +53,29 @@ void damon_ptep_mkold(pte_t *pte, struct vm_area_struct *vma, unsigned long addr
>   
>   void damon_pmdp_mkold(pmd_t *pmd, struct vm_area_struct *vma, unsigned long addr)
>   {
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	struct folio *folio = damon_get_folio(pmd_pfn(pmdp_get(pmd)));
> +#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
> +	struct folio *folio;
> +	unsigned long size;
> +
> +	if (is_vm_hugetlb_page(vma)) {
> +		folio = pfn_folio(pdm_pfn(*pmd))
> +		folio_get(folio);
> +		size = huge_page_size(hstate_vma(vma));
> +	} else {
> +		folio = damon_get_folio(pmd_pfn(*pmd));
> +		size = PMD_SIZE;
> +	}
>   
>   	if (!folio)
> -		return;
> +		return 0;
>   
> -	if (pmdp_clear_young_notify(vma, addr, pmd))
> +	if (pmdp_test_and_clear_young(vma, addr, pmd) ||
> +	    mmu_notifier_clear_young(mm, addr, addr + size))
>   		folio_set_young(folio);

And I think here is the issue for both the cont-PMD and cont-PTE case:

For hugetlb we *absolutely must* use the set_huge_pte_at()-style 
functions, otherwise we might suddenly lose the cont-pte/cont-pmd bit. 
We cannot arbitrarily replace these "huge_pte" functions by others that 
work on individual PTEs/PMDs.

(noting that the hugetlb code in damon_hugetlb_mkold() is likely not 
correct, because we could be losing concurrently set dirty bits I believe)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 00/45] hugetlb pagewalk unification
  2024-07-04 10:44 ` David Hildenbrand
@ 2024-07-04 14:30   ` Peter Xu
  2024-07-04 15:23     ` David Hildenbrand
  2024-07-08 14:35     ` Jason Gunthorpe
  0 siblings, 2 replies; 66+ messages in thread
From: Peter Xu @ 2024-07-04 14:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Oscar Salvador, Andrew Morton, linux-kernel, linux-mm,
	Muchun Song, SeongJae Park, Miaohe Lin, Michal Hocko,
	Matthew Wilcox, Christophe Leroy, Jason Gunthorpe

Hey, David,

On Thu, Jul 04, 2024 at 12:44:38PM +0200, David Hildenbrand wrote:
> There are roughly two categories of page table walkers we have:
> 
> 1) We actually only want to walk present folios (to be precise, page
>    ranges of folios). We should look into moving away from the walk the
>    page walker API where possible, and have something better that
>    directly gives us the folio (page ranges). Any PTE batching would be
>    done internally.
> 
> 2) We want to deal with non-present folios as well (swp entries and all
>    kinds of other stuff). We should maybe implement our custom page
>    table walker and move away from walk_page_range(). We are not walking
>    "pages" after all but everything else included :)
> 
> Then, there is a subset of 1) where we only want to walk to a single address
> (a single folio). I'm working on that right now to get rid of follow_page()
> and some (IIRC 3: KSM an daemon) walk_page_range() users. Hugetlb will still
> remain a bit special, but I'm afraid we cannot hide that completely.

Maybe you are talking about the generic concept of "page table walker", not
walk_page_range() explicitly?

I'd agree if it's about the generic concept. For example, follow_page()
definitely is tailored for getting the page/folio.  But just to mention
Oscar's series is only working on the page_walk API itself.  What I see so
far is most of the walk_page API users aren't described above - most of
them do not fall into category 1) at all, if any. And they either need to
fetch something from the pgtable where having the folio isn't enough, or
modify the pgtable for different reasons.

A generic pgtable walker looks still wanted at some point, but it can be
too involved to be introduced together with this "remove hugetlb_entry"
effort.

To me, that future work is not yet about "get the folio, ignore the
pgtable", but about how to abstract different layers of pgtables, so the
caller may get a generic concept of "one pgtable entry" with the level/size
information attached, and process it at a single place / hook, and perhaps
hopefully even work with a device pgtable, as long as it's a radix tree.

[Adding Jason into the loop too. PS: Oscar, please consider copying Jason
 for the works too; Jason provided great lots of useful discussions in the
 past on relevant topics]

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 02/45] mm: Add {pmd,pud}_huge_lock helper
  2024-07-04  4:30 ` [PATCH 02/45] mm: Add {pmd,pud}_huge_lock helper Oscar Salvador
@ 2024-07-04 15:02   ` Peter Xu
  0 siblings, 0 replies; 66+ messages in thread
From: Peter Xu @ 2024-07-04 15:02 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, linux-kernel, linux-mm, Muchun Song,
	David Hildenbrand, SeongJae Park, Miaohe Lin, Michal Hocko,
	Matthew Wilcox, Christophe Leroy

On Thu, Jul 04, 2024 at 06:30:49AM +0200, Oscar Salvador wrote:
> Deep down hugetlb and thp use the same lock for pud and pmd.
> Create two helpers that can be directly used by both of them,
> as they will be used in the generic pagewalkers.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>  include/linux/mm_inline.h | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index f4fe593c1400..93e3eb86ef4e 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -9,6 +9,7 @@
>  #include <linux/string.h>
>  #include <linux/userfaultfd_k.h>
>  #include <linux/swapops.h>
> +#include <linux/hugetlb.h>
>  
>  /**
>   * folio_is_file_lru - Should the folio be on a file LRU or anon LRU?
> @@ -590,4 +591,30 @@ static inline bool vma_has_recency(struct vm_area_struct *vma)
>  	return true;
>  }
>  
> +static inline spinlock_t *pmd_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
> +{
> +	spinlock_t *ptl;
> +
> +	if (pmd_leaf(*pmd)) {
> +		ptl = pmd_lock(vma->vm_mm, pmd);
> +		if (pmd_leaf(*pmd))
> +			return ptl;
> +		spin_unlock(ptl);
> +	}
> +	return NULL;
> +}
> +
> +static inline spinlock_t *pud_huge_lock(pud_t *pud, struct vm_area_struct *vma)
> +{
> +	spinlock_t *ptl = pud_lock(vma->vm_mm, pud);
> +
> +	if (pud_leaf(*pud)) {
> +		ptl = pud_lock(vma->vm_mm, pud);
> +		if (pud_leaf(*pud))
> +			return ptl;
> +		spin_unlock(ptl);
> +	}
> +	return NULL;
> +}

IIRC I left similar comment before somewhere when we're discussing.. but we
may need to consider swap entries too.

I think it might be easier we stick with pxd_trans_huge_lock(), but some
slight modification on top: (1) rename them, perhaps s/trans_//g? (2) need
to also handle swap entry for puds (hugetlb migration entries, right now
pud_trans_huge_lock() didn't consider that).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 00/45] hugetlb pagewalk unification
  2024-07-04 14:30   ` Peter Xu
@ 2024-07-04 15:23     ` David Hildenbrand
  2024-07-04 16:43       ` Peter Xu
  2024-07-08  8:18       ` Oscar Salvador
  2024-07-08 14:35     ` Jason Gunthorpe
  1 sibling, 2 replies; 66+ messages in thread
From: David Hildenbrand @ 2024-07-04 15:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: Oscar Salvador, Andrew Morton, linux-kernel, linux-mm,
	Muchun Song, SeongJae Park, Miaohe Lin, Michal Hocko,
	Matthew Wilcox, Christophe Leroy, Jason Gunthorpe

On 04.07.24 16:30, Peter Xu wrote:
> Hey, David,
> 

Hi!

> On Thu, Jul 04, 2024 at 12:44:38PM +0200, David Hildenbrand wrote:
>> There are roughly two categories of page table walkers we have:
>>
>> 1) We actually only want to walk present folios (to be precise, page
>>     ranges of folios). We should look into moving away from the walk the
>>     page walker API where possible, and have something better that
>>     directly gives us the folio (page ranges). Any PTE batching would be
>>     done internally.
>>
>> 2) We want to deal with non-present folios as well (swp entries and all
>>     kinds of other stuff). We should maybe implement our custom page
>>     table walker and move away from walk_page_range(). We are not walking
>>     "pages" after all but everything else included :)
>>
>> Then, there is a subset of 1) where we only want to walk to a single address
>> (a single folio). I'm working on that right now to get rid of follow_page()
>> and some (IIRC 3: KSM an daemon) walk_page_range() users. Hugetlb will still
>> remain a bit special, but I'm afraid we cannot hide that completely.
> 
> Maybe you are talking about the generic concept of "page table walker", not
> walk_page_range() explicitly?
> 
> I'd agree if it's about the generic concept. For example, follow_page()
> definitely is tailored for getting the page/folio.  But just to mention
> Oscar's series is only working on the page_walk API itself.  What I see so
> far is most of the walk_page API users aren't described above - most of
> them do not fall into category 1) at all, if any. And they either need to
> fetch something from the pgtable where having the folio isn't enough, or
> modify the pgtable for different reasons.

Right, but having 1) does not imply that we won't be having access to 
the page table entry in an abstracted form, the folio is simply the 
primary source of information that these users care about. 2) is an 
extension of 1), but walking+exposing all (or most) other page table 
entries as well in some form, which is certainly harder to get right.

Taking a look at some examples:

* madvise_cold_or_pageout_pte_range() only cares about present folios.
* madvise_free_pte_range() only cares about present folios.
* break_ksm_ops() only cares about present folios.
* mlock_walk_ops() only cares about present folios.
* damon_mkold_ops() only cares about present folios.
* damon_young_ops() only cares about present folios.

There are certainly other page_walk API users that are more involved and 
need to do way more magic, which fall into category 2). In particular 
things like swapin_walk_ops(), hmm_walk_ops() and most 
fs/proc/task_mmu.c. Likely there are plenty of them.


Taking a look at vmscan.c/walk_mm(), I'm not sure how much benefit there 
even is left in using walk_page_range() :)

> 
> A generic pgtable walker looks still wanted at some point, but it can be
> too involved to be introduced together with this "remove hugetlb_entry"
> effort.

My thinking was if "remove hugetlb_entry" cannot wait for "remove 
page_walk", because we found a reasonable way to do it better and 
convert the individual users. Maybe it can't.

I've not given up hope that we can end up with something better and 
clearer than the current page_walk API :)

> 
> To me, that future work is not yet about "get the folio, ignore the
> pgtable", but about how to abstract different layers of pgtables, so the
> caller may get a generic concept of "one pgtable entry" with the level/size
> information attached, and process it at a single place / hook, and perhaps
> hopefully even work with a device pgtable, as long as it's a radix tree.

To me 2) is an extension of 1). My thinking is that we can start with 1) 
without having to are about all details of 2). If we have to make it as 
generic that we can walk any page table layout out there in this world, 
I'm not so sure.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 05/45] mm/pagewalk: Enable walk_pmd_range to handle cont-pmds
  2024-07-04  4:30 ` [PATCH 05/45] mm/pagewalk: Enable walk_pmd_range to handle cont-pmds Oscar Salvador
@ 2024-07-04 15:41   ` David Hildenbrand
  2024-07-05 16:56   ` kernel test robot
  1 sibling, 0 replies; 66+ messages in thread
From: David Hildenbrand @ 2024-07-04 15:41 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: linux-kernel, linux-mm, Peter Xu, Muchun Song, SeongJae Park,
	Miaohe Lin, Michal Hocko, Matthew Wilcox, Christophe Leroy

On 04.07.24 06:30, Oscar Salvador wrote:
> HugeTLB pages can be cont-pmd mapped, so teach walk_pmd_range to
> handle those.
> This will save us some cycles as we do it in one-shot instead of
> calling in multiple times.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>   include/linux/pgtable.h | 12 ++++++++++++
>   mm/pagewalk.c           | 12 +++++++++---
>   2 files changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 2a6a3cccfc36..3a7b8751747e 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1914,6 +1914,18 @@ typedef unsigned int pgtbl_mod_mask;
>   #define __pte_leaf_size(x,y) pte_leaf_size(y)
>   #endif
>   
> +#ifndef pmd_cont
> +#define pmd_cont(x) false
> +#endif
> +
> +#ifndef CONT_PMD_SIZE
> +#define CONT_PMD_SIZE 0
> +#endif
> +
> +#ifndef CONT_PMDS
> +#define CONT_PMDS 0
> +#endif
> +
>   /*
>    * We always define pmd_pfn for all archs as it's used in lots of generic
>    * code.  Now it happens too for pud_pfn (and can happen for larger
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index d93e77411482..a9c36f9e9820 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -81,11 +81,18 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>   	const struct mm_walk_ops *ops = walk->ops;
>   	int err = 0;
>   	int depth = real_depth(3);
> +	int cont_pmds;
>   
>   	pmd = pmd_offset(pud, addr);
>   	do {
>   again:
> -		next = pmd_addr_end(addr, end);
> +		if (pmd_cont(*pmd)) {
> +			cont_pmds = CONT_PMDS;
> +			next = pmd_cont_addr_end(addr, end);
> +		} else {
> +			cont_pmds = 1;
> +			next = pmd_addr_end(addr, end);
> +		}
>   		if (pmd_none(*pmd)) {
>   			if (ops->pte_hole)
>   				err = ops->pte_hole(addr, next, depth, walk);
> @@ -126,8 +133,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>   
>   		if (walk->action == ACTION_AGAIN)
>   			goto again;
> -
> -	} while (pmd++, addr = next, addr != end);
> +	} while (pmd += cont_pmds, addr = next, addr != end);

Similar to my other comment regarding PTE batching, this is very 
specific to architectures that support cont-pmds.

Yes, right now we only have that on architectures that support 
cont-pmd-sized hugetlb, but Willy is interested in us supporting+mapping 
folios > PMD_SIZE, whereby we'd want to batch even without arch-specific 
cont-pmd bits.

Similar to the other (pte) case, having a way to generically patch 
folios will me more beneficial. Note that cont-pmd/cont-pte is only 
relevant for present entries (-> mapping folios).

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 00/45] hugetlb pagewalk unification
  2024-07-04 15:23     ` David Hildenbrand
@ 2024-07-04 16:43       ` Peter Xu
  2024-07-08  8:18       ` Oscar Salvador
  1 sibling, 0 replies; 66+ messages in thread
From: Peter Xu @ 2024-07-04 16:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Oscar Salvador, Andrew Morton, linux-kernel, linux-mm,
	Muchun Song, SeongJae Park, Miaohe Lin, Michal Hocko,
	Matthew Wilcox, Christophe Leroy, Jason Gunthorpe

On Thu, Jul 04, 2024 at 05:23:30PM +0200, David Hildenbrand wrote:
> On 04.07.24 16:30, Peter Xu wrote:
> > Hey, David,
> > 
> 
> Hi!
> 
> > On Thu, Jul 04, 2024 at 12:44:38PM +0200, David Hildenbrand wrote:
> > > There are roughly two categories of page table walkers we have:
> > > 
> > > 1) We actually only want to walk present folios (to be precise, page
> > >     ranges of folios). We should look into moving away from the walk the
> > >     page walker API where possible, and have something better that
> > >     directly gives us the folio (page ranges). Any PTE batching would be
> > >     done internally.
> > > 
> > > 2) We want to deal with non-present folios as well (swp entries and all
> > >     kinds of other stuff). We should maybe implement our custom page
> > >     table walker and move away from walk_page_range(). We are not walking
> > >     "pages" after all but everything else included :)
> > > 
> > > Then, there is a subset of 1) where we only want to walk to a single address
> > > (a single folio). I'm working on that right now to get rid of follow_page()
> > > and some (IIRC 3: KSM an daemon) walk_page_range() users. Hugetlb will still
> > > remain a bit special, but I'm afraid we cannot hide that completely.
> > 
> > Maybe you are talking about the generic concept of "page table walker", not
> > walk_page_range() explicitly?
> > 
> > I'd agree if it's about the generic concept. For example, follow_page()
> > definitely is tailored for getting the page/folio.  But just to mention
> > Oscar's series is only working on the page_walk API itself.  What I see so
> > far is most of the walk_page API users aren't described above - most of
> > them do not fall into category 1) at all, if any. And they either need to
> > fetch something from the pgtable where having the folio isn't enough, or
> > modify the pgtable for different reasons.
> 
> Right, but having 1) does not imply that we won't be having access to the
> page table entry in an abstracted form, the folio is simply the primary
> source of information that these users care about. 2) is an extension of 1),
> but walking+exposing all (or most) other page table entries as well in some
> form, which is certainly harder to get right.
> 
> Taking a look at some examples:
> 
> * madvise_cold_or_pageout_pte_range() only cares about present folios.
> * madvise_free_pte_range() only cares about present folios.
> * break_ksm_ops() only cares about present folios.
> * mlock_walk_ops() only cares about present folios.
> * damon_mkold_ops() only cares about present folios.
> * damon_young_ops() only cares about present folios.
> 
> There are certainly other page_walk API users that are more involved and
> need to do way more magic, which fall into category 2). In particular things
> like swapin_walk_ops(), hmm_walk_ops() and most fs/proc/task_mmu.c. Likely
> there are plenty of them.
> 
> 
> Taking a look at vmscan.c/walk_mm(), I'm not sure how much benefit there
> even is left in using walk_page_range() :)

Hmm, I need to confess from a quick look I didn't yet see why the current
page_walk API won't work under p4d there.. it could be that I missed some
details.

> 
> > 
> > A generic pgtable walker looks still wanted at some point, but it can be
> > too involved to be introduced together with this "remove hugetlb_entry"
> > effort.
> 
> My thinking was if "remove hugetlb_entry" cannot wait for "remove
> page_walk", because we found a reasonable way to do it better and convert
> the individual users. Maybe it can't.
> 
> I've not given up hope that we can end up with something better and clearer
> than the current page_walk API :)

Oh so you meant you have plan to rewrite some of the page_walk API users to
use the new API you plan to propose?

It looks fine by me. I assume anything new will already taking hugetlb
folios into account, so it'll "just work" and actually reduce number of
patches here, am I right?

If it still needs time to land, I think it's also fine that it's done on
top of Oscar's.  So it may boil down to the schedule in that case, and we
may also want to know how Oscar sees this.

> 
> > 
> > To me, that future work is not yet about "get the folio, ignore the
> > pgtable", but about how to abstract different layers of pgtables, so the
> > caller may get a generic concept of "one pgtable entry" with the level/size
> > information attached, and process it at a single place / hook, and perhaps
> > hopefully even work with a device pgtable, as long as it's a radix tree.
> 
> To me 2) is an extension of 1). My thinking is that we can start with 1)
> without having to are about all details of 2). If we have to make it as
> generic that we can walk any page table layout out there in this world, I'm
> not so sure.

I still see a hope there, after all the radix pgtable is indeed a common
abstraction and it looks to me a lot of things share that structure. IIUC
one challenge of it is being fast.  So.. I don't know. But I'll be more
than happy to see it come if someone can work it out, and it just sounds
very nice too if some chunk of code can be run the same for mm/, kvm/ and
iommu/.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 13/45] mm: Implement pud-version uffd functions
  2024-07-04  4:31 ` [PATCH 13/45] mm: Implement pud-version uffd functions Oscar Salvador
@ 2024-07-05 15:48   ` kernel test robot
  2024-07-05 15:48   ` kernel test robot
  1 sibling, 0 replies; 66+ messages in thread
From: kernel test robot @ 2024-07-05 15:48 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: llvm, oe-kbuild-all, Linux Memory Management List, linux-kernel,
	Peter Xu, Muchun Song, David Hildenbrand, SeongJae Park,
	Miaohe Lin, Michal Hocko, Matthew Wilcox, Christophe Leroy,
	Oscar Salvador

Hi Oscar,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on sj/damon/next next-20240703]
[cannot apply to powerpc/next powerpc/fixes linus/master v6.10-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Oscar-Salvador/arch-x86-Drop-own-definition-of-pgd-p4d_leaf/20240705-042640
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20240704043132.28501-14-osalvador%40suse.de
patch subject: [PATCH 13/45] mm: Implement pud-version uffd functions
config: s390-allnoconfig (https://download.01.org/0day-ci/archive/20240705/202407052314.JxgKIfN1-lkp@intel.com/config)
compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project a0c6b8aef853eedaa0980f07c0a502a5a8a9740e)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240705/202407052314.JxgKIfN1-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202407052314.JxgKIfN1-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from arch/s390/kernel/asm-offsets.c:11:
   In file included from include/linux/kvm_host.h:16:
   In file included from include/linux/mm.h:30:
   In file included from include/linux/pgtable.h:17:
>> include/asm-generic/pgtable_uffd.h:32:9: error: use of undeclared identifier 'pmd'; did you mean 'pud'?
      32 |         return pmd;
         |                ^~~
         |                pud
   include/asm-generic/pgtable_uffd.h:30:50: note: 'pud' declared here
      30 | static __always_inline pud_t pud_mkuffd_wp(pud_t pud)
         |                                                  ^
   include/asm-generic/pgtable_uffd.h:47:9: error: use of undeclared identifier 'pmd'; did you mean 'pud'?
      47 |         return pmd;
         |                ^~~
         |                pud
   include/asm-generic/pgtable_uffd.h:45:54: note: 'pud' declared here
      45 | static __always_inline pud_t pud_clear_uffd_wp(pud_t pud)
         |                                                      ^
   In file included from arch/s390/kernel/asm-offsets.c:11:
   In file included from include/linux/kvm_host.h:16:
   In file included from include/linux/mm.h:2221:
   include/linux/vmstat.h:514:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
     514 |         return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
         |                               ~~~~~~~~~~~ ^ ~~~
   In file included from arch/s390/kernel/asm-offsets.c:11:
   In file included from include/linux/kvm_host.h:19:
   In file included from include/linux/msi.h:27:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:14:
   In file included from arch/s390/include/asm/io.h:93:
   include/asm-generic/io.h:548:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     548 |         val = __raw_readb(PCI_IOBASE + addr);
         |                           ~~~~~~~~~~ ^
   include/asm-generic/io.h:561:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     561 |         val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/big_endian.h:37:59: note: expanded from macro '__le16_to_cpu'
      37 | #define __le16_to_cpu(x) __swab16((__force __u16)(__le16)(x))
         |                                                           ^
   include/uapi/linux/swab.h:102:54: note: expanded from macro '__swab16'
     102 | #define __swab16(x) (__u16)__builtin_bswap16((__u16)(x))
         |                                                      ^
   In file included from arch/s390/kernel/asm-offsets.c:11:
   In file included from include/linux/kvm_host.h:19:
   In file included from include/linux/msi.h:27:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:14:
   In file included from arch/s390/include/asm/io.h:93:
   include/asm-generic/io.h:574:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     574 |         val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/big_endian.h:35:59: note: expanded from macro '__le32_to_cpu'
      35 | #define __le32_to_cpu(x) __swab32((__force __u32)(__le32)(x))
         |                                                           ^
   include/uapi/linux/swab.h:115:54: note: expanded from macro '__swab32'
     115 | #define __swab32(x) (__u32)__builtin_bswap32((__u32)(x))
         |                                                      ^
   In file included from arch/s390/kernel/asm-offsets.c:11:
   In file included from include/linux/kvm_host.h:19:
   In file included from include/linux/msi.h:27:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:14:
   In file included from arch/s390/include/asm/io.h:93:
   include/asm-generic/io.h:585:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     585 |         __raw_writeb(value, PCI_IOBASE + addr);
         |                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:595:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     595 |         __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:605:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     605 |         __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:693:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     693 |         readsb(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:701:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     701 |         readsw(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:709:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     709 |         readsl(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:718:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     718 |         writesb(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
   include/asm-generic/io.h:727:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     727 |         writesw(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
   include/asm-generic/io.h:736:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     736 |         writesl(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
   In file included from arch/s390/kernel/asm-offsets.c:11:
   In file included from include/linux/kvm_host.h:19:
   In file included from include/linux/msi.h:27:
   In file included from include/linux/irq.h:591:
   In file included from arch/s390/include/asm/hw_irq.h:6:
   In file included from include/linux/pci.h:37:
   In file included from include/linux/device.h:32:
   In file included from include/linux/device/driver.h:21:
   In file included from include/linux/module.h:19:
   In file included from include/linux/elf.h:6:
   In file included from arch/s390/include/asm/elf.h:160:
   include/linux/compat.h:454:22: warning: array index 3 is past the end of the array (that has type 'const unsigned long[1]') [-Warray-bounds]
     454 |         case 4: v.sig[7] = (set->sig[3] >> 32); v.sig[6] = set->sig[3];
         |                             ^        ~
   arch/s390/include/asm/signal.h:22:9: note: array 'sig' declared here


vim +32 include/asm-generic/pgtable_uffd.h

    29	
    30	static __always_inline pud_t pud_mkuffd_wp(pud_t pud)
    31	{
  > 32		return pmd;
    33	}
    34	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 13/45] mm: Implement pud-version uffd functions
  2024-07-04  4:31 ` [PATCH 13/45] mm: Implement pud-version uffd functions Oscar Salvador
  2024-07-05 15:48   ` kernel test robot
@ 2024-07-05 15:48   ` kernel test robot
  1 sibling, 0 replies; 66+ messages in thread
From: kernel test robot @ 2024-07-05 15:48 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: oe-kbuild-all, Linux Memory Management List, linux-kernel,
	Peter Xu, Muchun Song, David Hildenbrand, SeongJae Park,
	Miaohe Lin, Michal Hocko, Matthew Wilcox, Christophe Leroy,
	Oscar Salvador

Hi Oscar,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on sj/damon/next next-20240703]
[cannot apply to powerpc/next powerpc/fixes linus/master v6.10-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Oscar-Salvador/arch-x86-Drop-own-definition-of-pgd-p4d_leaf/20240705-042640
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20240704043132.28501-14-osalvador%40suse.de
patch subject: [PATCH 13/45] mm: Implement pud-version uffd functions
config: openrisc-allnoconfig (https://download.01.org/0day-ci/archive/20240705/202407052337.jk13ShDm-lkp@intel.com/config)
compiler: or1k-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240705/202407052337.jk13ShDm-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202407052337.jk13ShDm-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from include/linux/pgtable.h:17,
                    from include/linux/mm.h:30,
                    from include/linux/pid_namespace.h:7,
                    from include/linux/ptrace.h:10,
                    from arch/openrisc/kernel/asm-offsets.c:28:
   include/asm-generic/pgtable_uffd.h: In function 'pud_mkuffd_wp':
>> include/asm-generic/pgtable_uffd.h:32:16: error: 'pmd' undeclared (first use in this function); did you mean 'pud'?
      32 |         return pmd;
         |                ^~~
         |                pud
   include/asm-generic/pgtable_uffd.h:32:16: note: each undeclared identifier is reported only once for each function it appears in
   include/asm-generic/pgtable_uffd.h: In function 'pud_clear_uffd_wp':
   include/asm-generic/pgtable_uffd.h:47:16: error: 'pmd' undeclared (first use in this function); did you mean 'pud'?
      47 |         return pmd;
         |                ^~~
         |                pud
   make[3]: *** [scripts/Makefile.build:117: arch/openrisc/kernel/asm-offsets.s] Error 1
   make[3]: Target 'prepare' not remade because of errors.
   make[2]: *** [Makefile:1208: prepare0] Error 2
   make[2]: Target 'prepare' not remade because of errors.
   make[1]: *** [Makefile:240: __sub-make] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:240: __sub-make] Error 2
   make: Target 'prepare' not remade because of errors.


vim +32 include/asm-generic/pgtable_uffd.h

    29	
    30	static __always_inline pud_t pud_mkuffd_wp(pud_t pud)
    31	{
  > 32		return pmd;
    33	}
    34	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 05/45] mm/pagewalk: Enable walk_pmd_range to handle cont-pmds
  2024-07-04  4:30 ` [PATCH 05/45] mm/pagewalk: Enable walk_pmd_range to handle cont-pmds Oscar Salvador
  2024-07-04 15:41   ` David Hildenbrand
@ 2024-07-05 16:56   ` kernel test robot
  1 sibling, 0 replies; 66+ messages in thread
From: kernel test robot @ 2024-07-05 16:56 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: llvm, oe-kbuild-all, Linux Memory Management List, linux-kernel,
	Peter Xu, Muchun Song, David Hildenbrand, SeongJae Park,
	Miaohe Lin, Michal Hocko, Matthew Wilcox, Christophe Leroy,
	Oscar Salvador

Hi Oscar,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on sj/damon/next powerpc/next powerpc/fixes linus/master v6.10-rc6 next-20240703]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Oscar-Salvador/arch-x86-Drop-own-definition-of-pgd-p4d_leaf/20240705-042640
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20240704043132.28501-6-osalvador%40suse.de
patch subject: [PATCH 05/45] mm/pagewalk: Enable walk_pmd_range to handle cont-pmds
config: um-allnoconfig (https://download.01.org/0day-ci/archive/20240706/202407060025.WIFWw7WY-lkp@intel.com/config)
compiler: clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240706/202407060025.WIFWw7WY-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202407060025.WIFWw7WY-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from mm/pagewalk.c:3:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:14:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:548:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     548 |         val = __raw_readb(PCI_IOBASE + addr);
         |                           ~~~~~~~~~~ ^
   include/asm-generic/io.h:561:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     561 |         val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu'
      37 | #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
         |                                                   ^
   In file included from mm/pagewalk.c:3:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:14:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:574:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     574 |         val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu'
      35 | #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
         |                                                   ^
   In file included from mm/pagewalk.c:3:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:14:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:585:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     585 |         __raw_writeb(value, PCI_IOBASE + addr);
         |                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:595:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     595 |         __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:605:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     605 |         __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:693:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     693 |         readsb(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:701:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     701 |         readsw(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:709:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     709 |         readsl(PCI_IOBASE + addr, buffer, count);
         |                ~~~~~~~~~~ ^
   include/asm-generic/io.h:718:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     718 |         writesb(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
   include/asm-generic/io.h:727:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     727 |         writesw(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
   include/asm-generic/io.h:736:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     736 |         writesl(PCI_IOBASE + addr, buffer, count);
         |                 ~~~~~~~~~~ ^
>> mm/pagewalk.c:91:11: error: call to undeclared function 'pmd_cont_addr_end'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
      91 |                         next = pmd_cont_addr_end(addr, end);
         |                                ^
   12 warnings and 1 error generated.


vim +/pmd_cont_addr_end +91 mm/pagewalk.c

    75	
    76	static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
    77				  struct mm_walk *walk)
    78	{
    79		pmd_t *pmd;
    80		unsigned long next;
    81		const struct mm_walk_ops *ops = walk->ops;
    82		int err = 0;
    83		int depth = real_depth(3);
    84		int cont_pmds;
    85	
    86		pmd = pmd_offset(pud, addr);
    87		do {
    88	again:
    89			if (pmd_cont(*pmd)) {
    90				cont_pmds = CONT_PMDS;
  > 91				next = pmd_cont_addr_end(addr, end);
    92			} else {
    93				cont_pmds = 1;
    94				next = pmd_addr_end(addr, end);
    95			}
    96			if (pmd_none(*pmd)) {
    97				if (ops->pte_hole)
    98					err = ops->pte_hole(addr, next, depth, walk);
    99				if (err)
   100					break;
   101				continue;
   102			}
   103	
   104			walk->action = ACTION_SUBTREE;
   105	
   106			/*
   107			 * This implies that each ->pmd_entry() handler
   108			 * needs to know about pmd_trans_huge() pmds
   109			 */
   110			if (ops->pmd_entry)
   111				err = ops->pmd_entry(pmd, addr, next, walk);
   112			if (err)
   113				break;
   114	
   115			if (walk->action == ACTION_AGAIN)
   116				goto again;
   117	
   118			/*
   119			 * Check this here so we only break down trans_huge
   120			 * pages when we _need_ to
   121			 */
   122			if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) ||
   123			    walk->action == ACTION_CONTINUE ||
   124			    !(ops->pte_entry))
   125				continue;
   126	
   127			if (walk->vma)
   128				split_huge_pmd(walk->vma, pmd, addr);
   129	
   130			err = walk_pte_range(pmd, addr, next, walk);
   131			if (err)
   132				break;
   133	
   134			if (walk->action == ACTION_AGAIN)
   135				goto again;
   136		} while (pmd += cont_pmds, addr = next, addr != end);
   137	
   138		return err;
   139	}
   140	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 00/45] hugetlb pagewalk unification
  2024-07-04 15:23     ` David Hildenbrand
  2024-07-04 16:43       ` Peter Xu
@ 2024-07-08  8:18       ` Oscar Salvador
  2024-07-08 14:28         ` Jason Gunthorpe
  2024-07-10  3:52         ` David Hildenbrand
  1 sibling, 2 replies; 66+ messages in thread
From: Oscar Salvador @ 2024-07-08  8:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, Andrew Morton, linux-kernel, linux-mm, Muchun Song,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Jason Gunthorpe

On Thu, Jul 04, 2024 at 05:23:30PM +0200, David Hildenbrand wrote:
> My thinking was if "remove hugetlb_entry" cannot wait for "remove
> page_walk", because we found a reasonable way to do it better and convert
> the individual users. Maybe it can't.
> 
> I've not given up hope that we can end up with something better and clearer
> than the current page_walk API :)

Hi David,

I agree that the current page_walk might be a bit convoluted, and that the
indirect functions approach is a bit of a hassle.
Having said that, let me clarify something.

Although this patchset touches the page_walk API wrt. getting rid of
hugetlb special casing all over the place, my goal was not as focused on
the page_walk as it was on the hugetlb code to gain hability to be
interpreted on PUD/PMD level.

One of the things, among other things, that helped in creating this
mess/duplication we have wrt. hugetlb code vs mm core is that hugetlb
__always__ operates on ptes, which means that we cannot rely on the mm
core to do the right thing, and we need a bunch of hugetlb-pte functions
that knows about their thing, so we lean on that.

IMHO, that was a mistake to start with, but I was not around when it was
introduced and maybe there were good reasons to deal with that the way
it is done.
But, the thing is that my ultimate goal, is for hugetlb code to be able
to deal with PUD/PMD (pte and cont-pte is already dealt with) just like
mm core does for THP (PUD is not supported by THP, but you get me), and
that is not that difficult to do, as this patchset tries to prove.

Of course, for hugetlb to gain the hability to operate on PUD/PMD, this
means we need to add a fairly amount of code. e.g: for operating
hugepages on PUD level, code for markers on PUD/PMD level for
uffd/poison stuff (only dealt
on pmd/pte atm AFAIK), swap functions for PUD (is_swap_pud for PUD), etc.
Basically, almost all we did for PMD-* stuff we need it for PUD as well,
and that will be around when THP gains support for PUD if it ever does
(I guess that in a few years if memory capacity keeps increasing).

E.g: pud_to_swp_entry to detect that a swp entry is hwpoison with
     is_hwpoison_entry

Yes, it is a hassle to have more code around, but IMO, this new code
will help us in 1) move away from __always__ operate on ptes 2) ease
integrate hugetlb code into mm core.

I will keep working on this patchset not because of pagewalk savings,
but because I think it will help us in have hugetlb more mm-core ready,
since the current pagewalk has to test that a hugetlb page can be
properly read on PUD/PMD/PTE level no matter what: uffd for hugetlb on PUD/PMD,
hwpoison entries for swp on PUD/PMD, pud invalidating, etc.

If that gets accomplished, I think that a fair amount of code that lives
in hugetlb.c can be deleted/converted as less special casing will be needed.

I might be wrong and maybe I will hit a brick wall, but hopefully not.



-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 00/45] hugetlb pagewalk unification
  2024-07-08  8:18       ` Oscar Salvador
@ 2024-07-08 14:28         ` Jason Gunthorpe
  2024-07-10  3:52         ` David Hildenbrand
  1 sibling, 0 replies; 66+ messages in thread
From: Jason Gunthorpe @ 2024-07-08 14:28 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: David Hildenbrand, Peter Xu, Andrew Morton, linux-kernel,
	linux-mm, Muchun Song, SeongJae Park, Miaohe Lin, Michal Hocko,
	Matthew Wilcox, Christophe Leroy

On Mon, Jul 08, 2024 at 10:18:30AM +0200, Oscar Salvador wrote:

> IMHO, that was a mistake to start with, but I was not around when it was
> introduced and maybe there were good reasons to deal with that the way
> it is done.

It is a trade off, either you have to write out a lot of duplicated
code for every level or you have this sort of level agnostic design.

> But, the thing is that my ultimate goal, is for hugetlb code to be able
> to deal with PUD/PMD (pte and cont-pte is already dealt with) just like
> mm core does for THP (PUD is not supported by THP, but you get me), and
> that is not that difficult to do, as this patchset tries to prove.

IMHO we need to get to an API that can understand everything in a page
table. Having two APIs that are both disjoint is the problematic bit.

Improving the pud/pmd/etc API is a good direction

Nobody has explored it, but generalizing to a 'non-level' API could
also be a direction. 'non-level' means it works more like the huge API
where the level is not part of the function names but somehow the
level is encoded by the values/state/something.

This is appealing for things like page_walk where we have all these
per-level ops which are kind of pointless code duplication.

I've been doing some experiments on the iommu page table side on both
these directions and so far I haven't come to something that is really
great :\

> Of course, for hugetlb to gain the hability to operate on PUD/PMD, this
> means we need to add a fairly amount of code. e.g: for operating
> hugepages on PUD level, code for markers on PUD/PMD level for
> uffd/poison stuff (only dealt
> on pmd/pte atm AFAIK), swap functions for PUD (is_swap_pud for PUD), etc.
> Basically, almost all we did for PMD-* stuff we need it for PUD as well,
> and that will be around when THP gains support for PUD if it ever does
> (I guess that in a few years if memory capacity keeps increasing).

Right, this is the general pain of the mm's design is we have to
duplicate so much stuff N-wise for each level, even though in alot of
cases it isn't different for each level.

> I will keep working on this patchset not because of pagewalk savings,
> but because I think it will help us in have hugetlb more mm-core ready,
> since the current pagewalk has to test that a hugetlb page can be
> properly read on PUD/PMD/PTE level no matter what: uffd for hugetlb on PUD/PMD,
> hwpoison entries for swp on PUD/PMD, pud invalidating, etc.

Right, it would be nice if the page walk ops didn't have to touch huge
stuff at all. pagewalk ops, as they are today, should just work with
pud/pmd/pte normal functions in all cases.

Jason


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 00/45] hugetlb pagewalk unification
  2024-07-04 14:30   ` Peter Xu
  2024-07-04 15:23     ` David Hildenbrand
@ 2024-07-08 14:35     ` Jason Gunthorpe
  1 sibling, 0 replies; 66+ messages in thread
From: Jason Gunthorpe @ 2024-07-08 14:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, Oscar Salvador, Andrew Morton, linux-kernel,
	linux-mm, Muchun Song, SeongJae Park, Miaohe Lin, Michal Hocko,
	Matthew Wilcox, Christophe Leroy

On Thu, Jul 04, 2024 at 10:30:14AM -0400, Peter Xu wrote:
> Hey, David,
> 
> On Thu, Jul 04, 2024 at 12:44:38PM +0200, David Hildenbrand wrote:
> > There are roughly two categories of page table walkers we have:
> > 
> > 1) We actually only want to walk present folios (to be precise, page
> >    ranges of folios). We should look into moving away from the walk the
> >    page walker API where possible, and have something better that
> >    directly gives us the folio (page ranges). Any PTE batching would be
> >    done internally.

This seems like a good direction for some users as well to me.

If we can reduce the number of places touching the pud/pmd/pte APIs
that is a nice abstraction to reach toward.

It naturally would remove hugepte users too.

Jason


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 00/45] hugetlb pagewalk unification
  2024-07-08  8:18       ` Oscar Salvador
  2024-07-08 14:28         ` Jason Gunthorpe
@ 2024-07-10  3:52         ` David Hildenbrand
  2024-07-10 11:26           ` Oscar Salvador
  1 sibling, 1 reply; 66+ messages in thread
From: David Hildenbrand @ 2024-07-10  3:52 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Peter Xu, Andrew Morton, linux-kernel, linux-mm, Muchun Song,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Jason Gunthorpe

On 08.07.24 10:18, Oscar Salvador wrote:
> On Thu, Jul 04, 2024 at 05:23:30PM +0200, David Hildenbrand wrote:
>> My thinking was if "remove hugetlb_entry" cannot wait for "remove
>> page_walk", because we found a reasonable way to do it better and convert
>> the individual users. Maybe it can't.
>>
>> I've not given up hope that we can end up with something better and clearer
>> than the current page_walk API :)
> 
> Hi David,
> 

Hi!

> I agree that the current page_walk might be a bit convoluted, and that the
> indirect functions approach is a bit of a hassle.
> Having said that, let me clarify something.
> 
> Although this patchset touches the page_walk API wrt. getting rid of
> hugetlb special casing all over the place, my goal was not as focused on
> the page_walk as it was on the hugetlb code to gain hability to be
> interpreted on PUD/PMD level.

I understand that. And it would all be easier+more straight forward if 
we wouldn't have that hugetlb CONT-PTE / CONT-PMD stuff in there that 
works similar, but different to "ordinary" cont-pte for thp.

I'm sure you stumbled over the set_huge_pte_at() on arm64 for example. 
If we, at one point *don't* use these hugetlb functions right now to 
modify hugetlb entries, we might be in trouble.

That's why I think we should maybe invest our time and effort in having 
a new pagewalker that will just batch such things naturally, and users 
that can operate on that naturally. For example: a hugetlb 
cont-pte-mapped folio will just naturally be reported as a "fully mapped 
folio", just like a THP would be if mapped in a compatible way.

Yes, this requires more work, but as raised in some patches here, 
working on individual PTEs/PMDs for hugetlb is problematic.

You have to batch every operation, to essentially teach ordinary code to 
do what the hugetlb_* special code would have done on cont-pte/cont-pmd 
things.


(as a side note, cont-pte/cont-pmd should primarily be a hint from arch 
code on how many entries we can batch, like we do in folio_pte_batch(); 
point is that we want to batch also on architectures where we don't have 
such bits, and prepare for architectures that implement various sizes of 
batching; IMHO, having cont-pte/cont-pmd checks in common code is likely 
the wrong approach. Again, folio_pte_batch() is where we tackled the 
problem differently from the THP perspective)

> 
> One of the things, among other things, that helped in creating this
> mess/duplication we have wrt. hugetlb code vs mm core is that hugetlb
> __always__ operates on ptes, which means that we cannot rely on the mm
> core to do the right thing, and we need a bunch of hugetlb-pte functions
> that knows about their thing, so we lean on that.
> 
> IMHO, that was a mistake to start with, but I was not around when it was
> introduced and maybe there were good reasons to deal with that the way
> it is done.
> But, the thing is that my ultimate goal, is for hugetlb code to be able
> to deal with PUD/PMD (pte and cont-pte is already dealt with) just like
> mm core does for THP (PUD is not supported by THP, but you get me), and
> that is not that difficult to do, as this patchset tries to prove.
> 
> Of course, for hugetlb to gain the hability to operate on PUD/PMD, this
> means we need to add a fairly amount of code. e.g: for operating
> hugepages on PUD level, code for markers on PUD/PMD level for
> uffd/poison stuff (only dealt
> on pmd/pte atm AFAIK), swap functions for PUD (is_swap_pud for PUD), etc.
> Basically, almost all we did for PMD-* stuff we need it for PUD as well,
> and that will be around when THP gains support for PUD if it ever does
> (I guess that in a few years if memory capacity keeps increasing).
> 
> E.g: pud_to_swp_entry to detect that a swp entry is hwpoison with
>       is_hwpoison_entry
> 
> Yes, it is a hassle to have more code around, but IMO, this new code
> will help us in 1) move away from __always__ operate on ptes 2) ease
> integrate hugetlb code into mm core.
> 
> I will keep working on this patchset not because of pagewalk savings,
> but because I think it will help us in have hugetlb more mm-core ready,
> since the current pagewalk has to test that a hugetlb page can be
> properly read on PUD/PMD/PTE level no matter what: uffd for hugetlb on PUD/PMD,
> hwpoison entries for swp on PUD/PMD, pud invalidating, etc.
> 
> If that gets accomplished, I think that a fair amount of code that lives
> in hugetlb.c can be deleted/converted as less special casing will be needed.
> 
> I might be wrong and maybe I will hit a brick wall, but hopefully not.

I have an idea for a better page table walker API that would try 
batching most entries (under one PTL), and walkers can just register for 
the types they want. Hoping I will find some time to at least scetch the 
user interface soon.

That doesn't mean that this should block your work, but the 
cont-pte/cont/pmd hugetlb stuff is really nasty to handle here, and I 
don't particularly like where this is going.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 00/45] hugetlb pagewalk unification
  2024-07-10  3:52         ` David Hildenbrand
@ 2024-07-10 11:26           ` Oscar Salvador
  2024-07-11  0:15             ` David Hildenbrand
  0 siblings, 1 reply; 66+ messages in thread
From: Oscar Salvador @ 2024-07-10 11:26 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, Andrew Morton, linux-kernel, linux-mm, Muchun Song,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Jason Gunthorpe

On Wed, Jul 10, 2024 at 05:52:43AM +0200, David Hildenbrand wrote:
> I understand that. And it would all be easier+more straight forward if we
> wouldn't have that hugetlb CONT-PTE / CONT-PMD stuff in there that works
> similar, but different to "ordinary" cont-pte for thp.
> 
> I'm sure you stumbled over the set_huge_pte_at() on arm64 for example. If
> we, at one point *don't* use these hugetlb functions right now to modify
> hugetlb entries, we might be in trouble.
> 
> That's why I think we should maybe invest our time and effort in having a
> new pagewalker that will just batch such things naturally, and users that
> can operate on that naturally. For example: a hugetlb cont-pte-mapped folio
> will just naturally be reported as a "fully mapped folio", just like a THP
> would be if mapped in a compatible way.
> 
> Yes, this requires more work, but as raised in some patches here, working on
> individual PTEs/PMDs for hugetlb is problematic.
> 
> You have to batch every operation, to essentially teach ordinary code to do
> what the hugetlb_* special code would have done on cont-pte/cont-pmd things.
> 
> 
> (as a side note, cont-pte/cont-pmd should primarily be a hint from arch code
> on how many entries we can batch, like we do in folio_pte_batch(); point is
> that we want to batch also on architectures where we don't have such bits,
> and prepare for architectures that implement various sizes of batching;
> IMHO, having cont-pte/cont-pmd checks in common code is likely the wrong
> approach. Again, folio_pte_batch() is where we tackled the problem
> differently from the THP perspective)

I must say I did not check folio_pte_batch() and I am totally ignorant
of what/how it does things.
I will have a look.

> I have an idea for a better page table walker API that would try batching
> most entries (under one PTL), and walkers can just register for the types
> they want. Hoping I will find some time to at least scetch the user
> interface soon.
> 
> That doesn't mean that this should block your work, but the
> cont-pte/cont/pmd hugetlb stuff is really nasty to handle here, and I don't
> particularly like where this is going.

Ok, let me take a step back then.
Previous versions of that RFC did not handle cont-{pte-pmd} wide in the
open, so let me go back to the drawing board and come up with something
that does not fiddle with cont- stuff in that way.

I might post here a small diff just to see if we are on the same page.

As usual, thanks a lot for your comments David!


-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 00/45] hugetlb pagewalk unification
  2024-07-10 11:26           ` Oscar Salvador
@ 2024-07-11  0:15             ` David Hildenbrand
  2024-07-11  4:48               ` Oscar Salvador
  0 siblings, 1 reply; 66+ messages in thread
From: David Hildenbrand @ 2024-07-11  0:15 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Peter Xu, Andrew Morton, linux-kernel, linux-mm, Muchun Song,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Jason Gunthorpe, Ryan Roberts

>> (as a side note, cont-pte/cont-pmd should primarily be a hint from arch code
>> on how many entries we can batch, like we do in folio_pte_batch(); point is
>> that we want to batch also on architectures where we don't have such bits,
>> and prepare for architectures that implement various sizes of batching;
>> IMHO, having cont-pte/cont-pmd checks in common code is likely the wrong
>> approach. Again, folio_pte_batch() is where we tackled the problem
>> differently from the THP perspective)
> 
> I must say I did not check folio_pte_batch() and I am totally ignorant
> of what/how it does things.
> I will have a look.
> 
>> I have an idea for a better page table walker API that would try batching
>> most entries (under one PTL), and walkers can just register for the types
>> they want. Hoping I will find some time to at least scetch the user
>> interface soon.
>>
>> That doesn't mean that this should block your work, but the
>> cont-pte/cont/pmd hugetlb stuff is really nasty to handle here, and I don't
>> particularly like where this is going.
> 
> Ok, let me take a step back then.
> Previous versions of that RFC did not handle cont-{pte-pmd} wide in the
> open, so let me go back to the drawing board and come up with something
> that does not fiddle with cont- stuff in that way.
> 
> I might post here a small diff just to see if we are on the same page.
> 
> As usual, thanks a lot for your comments David!

Feel free to reach out to discuss ways forward. I think we should

(a) move to the automatic cont-pte setting as done for THPs via
     set_ptes().
(b) Batching PTE updates at all relevant places, so we get no change in
     behavior: cont-pte bit will remain set.
(c) Likely remove the use of cont-pte bits in hugetlb code for anything
     that is not a present folio (i.e., where automatic cont-pte bit
     setting would never set it). Migration entries might require
     thought (we can easily batch to achieve the same thing, but the
     behavior of hugetlb likely differs to the generic way of handling
     migration entries on multiple ptes: reference the folio vs.
     the respective subpages of the folio).

Then we are essentially replacing what the current hugetlb_ functions do 
by the common way it is being done for THP (which does not exist for 
cont-pmd yet ... ). The only real alternative is special casing hugetlb 
all over the place to still call the hugetlb_* functions.

:( it would all be easier without that hugetlb cont-pte/cont-pmd usage.

CCing Ryan so he's aware.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 00/45] hugetlb pagewalk unification
  2024-07-11  0:15             ` David Hildenbrand
@ 2024-07-11  4:48               ` Oscar Salvador
  2024-07-11  4:53                 ` David Hildenbrand
  0 siblings, 1 reply; 66+ messages in thread
From: Oscar Salvador @ 2024-07-11  4:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, Andrew Morton, linux-kernel, linux-mm, Muchun Song,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Jason Gunthorpe, Ryan Roberts

On Thu, Jul 11, 2024 at 02:15:38AM +0200, David Hildenbrand wrote:
> > > (as a side note, cont-pte/cont-pmd should primarily be a hint from arch code
> > > on how many entries we can batch, like we do in folio_pte_batch(); point is
> > > that we want to batch also on architectures where we don't have such bits,
> > > and prepare for architectures that implement various sizes of batching;
> > > IMHO, having cont-pte/cont-pmd checks in common code is likely the wrong
> > > approach. Again, folio_pte_batch() is where we tackled the problem
> > > differently from the THP perspective)
> > 
> > I must say I did not check folio_pte_batch() and I am totally ignorant
> > of what/how it does things.
> > I will have a look.
> > 
> > > I have an idea for a better page table walker API that would try batching
> > > most entries (under one PTL), and walkers can just register for the types
> > > they want. Hoping I will find some time to at least scetch the user
> > > interface soon.
> > > 
> > > That doesn't mean that this should block your work, but the
> > > cont-pte/cont/pmd hugetlb stuff is really nasty to handle here, and I don't
> > > particularly like where this is going.
> > 
> > Ok, let me take a step back then.
> > Previous versions of that RFC did not handle cont-{pte-pmd} wide in the
> > open, so let me go back to the drawing board and come up with something
> > that does not fiddle with cont- stuff in that way.
> > 
> > I might post here a small diff just to see if we are on the same page.
> > 
> > As usual, thanks a lot for your comments David!
> 
> Feel free to reach out to discuss ways forward. I think we should
> 
> (a) move to the automatic cont-pte setting as done for THPs via
>     set_ptes().
> (b) Batching PTE updates at all relevant places, so we get no change in
>     behavior: cont-pte bit will remain set.
> (c) Likely remove the use of cont-pte bits in hugetlb code for anything
>     that is not a present folio (i.e., where automatic cont-pte bit
>     setting would never set it). Migration entries might require
>     thought (we can easily batch to achieve the same thing, but the
>     behavior of hugetlb likely differs to the generic way of handling
>     migration entries on multiple ptes: reference the folio vs.
>     the respective subpages of the folio).

Uhm, I see, but I am bit confused.
Although related, this seems orthogonal to this series and more like for
a next-thing to do, right?

It is true that this series tries to handle cont-{pmd,pte} in the
pagewalk api for hugetlb vmas, but in order to raise less eye brows I
can come up with a way not to do that for now, so we do not fiddle with
cont-stuff in this series.


Or am I misunderstanding you?


-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 00/45] hugetlb pagewalk unification
  2024-07-11  4:48               ` Oscar Salvador
@ 2024-07-11  4:53                 ` David Hildenbrand
  0 siblings, 0 replies; 66+ messages in thread
From: David Hildenbrand @ 2024-07-11  4:53 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Peter Xu, Andrew Morton, linux-kernel, linux-mm, Muchun Song,
	SeongJae Park, Miaohe Lin, Michal Hocko, Matthew Wilcox,
	Christophe Leroy, Jason Gunthorpe, Ryan Roberts

On 11.07.24 06:48, Oscar Salvador wrote:
> On Thu, Jul 11, 2024 at 02:15:38AM +0200, David Hildenbrand wrote:
>>>> (as a side note, cont-pte/cont-pmd should primarily be a hint from arch code
>>>> on how many entries we can batch, like we do in folio_pte_batch(); point is
>>>> that we want to batch also on architectures where we don't have such bits,
>>>> and prepare for architectures that implement various sizes of batching;
>>>> IMHO, having cont-pte/cont-pmd checks in common code is likely the wrong
>>>> approach. Again, folio_pte_batch() is where we tackled the problem
>>>> differently from the THP perspective)
>>>
>>> I must say I did not check folio_pte_batch() and I am totally ignorant
>>> of what/how it does things.
>>> I will have a look.
>>>
>>>> I have an idea for a better page table walker API that would try batching
>>>> most entries (under one PTL), and walkers can just register for the types
>>>> they want. Hoping I will find some time to at least scetch the user
>>>> interface soon.
>>>>
>>>> That doesn't mean that this should block your work, but the
>>>> cont-pte/cont/pmd hugetlb stuff is really nasty to handle here, and I don't
>>>> particularly like where this is going.
>>>
>>> Ok, let me take a step back then.
>>> Previous versions of that RFC did not handle cont-{pte-pmd} wide in the
>>> open, so let me go back to the drawing board and come up with something
>>> that does not fiddle with cont- stuff in that way.
>>>
>>> I might post here a small diff just to see if we are on the same page.
>>>
>>> As usual, thanks a lot for your comments David!
>>
>> Feel free to reach out to discuss ways forward. I think we should
>>
>> (a) move to the automatic cont-pte setting as done for THPs via
>>      set_ptes().
>> (b) Batching PTE updates at all relevant places, so we get no change in
>>      behavior: cont-pte bit will remain set.
>> (c) Likely remove the use of cont-pte bits in hugetlb code for anything
>>      that is not a present folio (i.e., where automatic cont-pte bit
>>      setting would never set it). Migration entries might require
>>      thought (we can easily batch to achieve the same thing, but the
>>      behavior of hugetlb likely differs to the generic way of handling
>>      migration entries on multiple ptes: reference the folio vs.
>>      the respective subpages of the folio).
> 
> Uhm, I see, but I am bit confused.
> Although related, this seems orthogonal to this series and more like for
> a next-thing to do, right?

Well, yes and no. The thing is, that the cont-pte/cont-pmd stuff is not 
as easy to handle like the PMD/PUD stuff, and sorting that out sounds 
like some "pain". That's the ugly part of hugetlb, where it's simply ... 
quite different :(

> 
> It is true that this series tries to handle cont-{pmd,pte} in the
> pagewalk api for hugetlb vmas, but in order to raise less eye brows I
> can come up with a way not to do that for now, so we do not fiddle with
> cont-stuff in this series.
> 
> 
> Or am I misunderstanding you?

I can answer once I know more details about the approach you have in mind :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2024-07-11  4:53 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-04  4:30 [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
2024-07-04  4:30 ` [PATCH 01/45] arch/x86: Drop own definition of pgd,p4d_leaf Oscar Salvador
2024-07-04  4:30 ` [PATCH 02/45] mm: Add {pmd,pud}_huge_lock helper Oscar Salvador
2024-07-04 15:02   ` Peter Xu
2024-07-04  4:30 ` [PATCH 03/45] mm/pagewalk: Move vma_pgtable_walk_begin and vma_pgtable_walk_end upfront Oscar Salvador
2024-07-04  4:30 ` [PATCH 04/45] mm/pagewalk: Only call pud_entry when we have a pud leaf Oscar Salvador
2024-07-04  4:30 ` [PATCH 05/45] mm/pagewalk: Enable walk_pmd_range to handle cont-pmds Oscar Salvador
2024-07-04 15:41   ` David Hildenbrand
2024-07-05 16:56   ` kernel test robot
2024-07-04  4:30 ` [PATCH 06/45] mm/pagewalk: Do not try to split non-thp pud or pmd leafs Oscar Salvador
2024-07-04  4:30 ` [PATCH 07/45] arch/s390: Enable __s390_enable_skey_pmd to handle hugetlb vmas Oscar Salvador
2024-07-04  4:30 ` [PATCH 08/45] fs/proc: Enable smaps_pmd_entry to handle PMD-mapped " Oscar Salvador
2024-07-04  4:30 ` [PATCH 09/45] mm: Implement pud-version functions for swap and vm_normal_page_pud Oscar Salvador
2024-07-04  4:30 ` [PATCH 10/45] fs/proc: Create smaps_pud_range to handle PUD-mapped hugetlb vmas Oscar Salvador
2024-07-04  4:30 ` [PATCH 11/45] fs/proc: Enable smaps_pte_entry to handle cont-pte mapped " Oscar Salvador
2024-07-04 10:30   ` David Hildenbrand
2024-07-04  4:30 ` [PATCH 12/45] fs/proc: Enable pagemap_pmd_range to handle " Oscar Salvador
2024-07-04  4:31 ` [PATCH 13/45] mm: Implement pud-version uffd functions Oscar Salvador
2024-07-05 15:48   ` kernel test robot
2024-07-05 15:48   ` kernel test robot
2024-07-04  4:31 ` [PATCH 14/45] fs/proc: Create pagemap_pud_range to handle PUD-mapped hugetlb vmas Oscar Salvador
2024-07-04  4:31 ` [PATCH 15/45] fs/proc: Adjust pte_to_pagemap_entry for " Oscar Salvador
2024-07-04  4:31 ` [PATCH 16/45] fs/proc: Enable pagemap_scan_pmd_entry to handle " Oscar Salvador
2024-07-04  4:31 ` [PATCH 17/45] mm: Implement pud-version for pud_mkinvalid and pudp_establish Oscar Salvador
2024-07-04  4:31 ` [PATCH 18/45] fs/proc: Create pagemap_scan_pud_entry to handle PUD-mapped hugetlb vmas Oscar Salvador
2024-07-04  4:31 ` [PATCH 19/45] fs/proc: Enable gather_pte_stats to handle " Oscar Salvador
2024-07-04  4:31 ` [PATCH 20/45] fs/proc: Enable gather_pte_stats to handle cont-pte mapped " Oscar Salvador
2024-07-04  4:31 ` [PATCH 21/45] fs/proc: Create gather_pud_stats to handle PUD-mapped hugetlb pages Oscar Salvador
2024-07-04  4:31 ` [PATCH 22/45] mm/mempolicy: Enable queue_folios_pmd to handle hugetlb vmas Oscar Salvador
2024-07-04  4:31 ` [PATCH 23/45] mm/mempolicy: Create queue_folios_pud to handle PUD-mapped " Oscar Salvador
2024-07-04  4:31 ` [PATCH 24/45] mm/memory_failure: Enable check_hwpoisoned_pmd_entry to handle " Oscar Salvador
2024-07-04  4:31 ` [PATCH 25/45] mm/memory-failure: Create check_hwpoisoned_pud_entry to handle PUD-mapped " Oscar Salvador
2024-07-04  4:31 ` [PATCH 26/45] mm/damon: Enable damon_young_pmd_entry to handle " Oscar Salvador
2024-07-04  4:31 ` [PATCH 27/45] mm/damon: Create damon_young_pud_entry to handle PUD-mapped " Oscar Salvador
2024-07-04  4:31 ` [PATCH 28/45] mm/damon: Enable damon_mkold_pmd_entry to handle " Oscar Salvador
2024-07-04 11:03   ` David Hildenbrand
2024-07-04  4:31 ` [PATCH 29/45] mm/damon: Create damon_mkold_pud_entry to handle PUD-mapped " Oscar Salvador
2024-07-04  4:31 ` [PATCH 30/45] mm,mincore: Enable mincore_pte_range to handle " Oscar Salvador
2024-07-04  4:31 ` [PATCH 31/45] mm/mincore: Create mincore_pud_range to handle PUD-mapped " Oscar Salvador
2024-07-04  4:31 ` [PATCH 32/45] mm/hmm: Enable hmm_vma_walk_pmd, to handle " Oscar Salvador
2024-07-04  4:31 ` [PATCH 33/45] mm/hmm: Enable hmm_vma_walk_pud to handle PUD-mapped " Oscar Salvador
2024-07-04  4:31 ` [PATCH 34/45] arch/powerpc: Skip hugetlb vmas in subpage_mark_vma_nohuge Oscar Salvador
2024-07-04  4:31 ` [PATCH 35/45] arch/s390: Skip hugetlb vmas in thp_split_mm Oscar Salvador
2024-07-04  4:31 ` [PATCH 36/45] fs/proc: Make clear_refs_test_walk skip hugetlb vmas Oscar Salvador
2024-07-04  4:31 ` [PATCH 37/45] mm/lock: Make mlock_test_walk " Oscar Salvador
2024-07-04  4:31 ` [PATCH 38/45] mm/madvise: Make swapin_test_walk " Oscar Salvador
2024-07-04  4:31 ` [PATCH 39/45] mm/madvise: Make madvise_cold_test_walk " Oscar Salvador
2024-07-04  4:31 ` [PATCH 40/45] mm/madvise: Make madvise_free_test_walk " Oscar Salvador
2024-07-04  4:31 ` [PATCH 41/45] mm/migrate_device: Make migrate_vma_test_walk " Oscar Salvador
2024-07-04  4:31 ` [PATCH 42/45] mm/memcontrol: Make mem_cgroup_move_test_walk " Oscar Salvador
2024-07-04  4:31 ` [PATCH 43/45] mm/memcontrol: Make mem_cgroup_count_test_walk " Oscar Salvador
2024-07-04  4:31 ` [PATCH 44/45] mm/hugetlb_vmemmap: Make vmemmap_test_walk " Oscar Salvador
2024-07-04  4:31 ` [PATCH 45/45] mm: Delete all hugetlb_entry entries Oscar Salvador
2024-07-04 10:13 ` [PATCH 00/45] hugetlb pagewalk unification Oscar Salvador
2024-07-04 10:44 ` David Hildenbrand
2024-07-04 14:30   ` Peter Xu
2024-07-04 15:23     ` David Hildenbrand
2024-07-04 16:43       ` Peter Xu
2024-07-08  8:18       ` Oscar Salvador
2024-07-08 14:28         ` Jason Gunthorpe
2024-07-10  3:52         ` David Hildenbrand
2024-07-10 11:26           ` Oscar Salvador
2024-07-11  0:15             ` David Hildenbrand
2024-07-11  4:48               ` Oscar Salvador
2024-07-11  4:53                 ` David Hildenbrand
2024-07-08 14:35     ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox