linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64)
@ 2024-05-17 18:59 Christophe Leroy
  2024-05-17 18:59 ` [RFC PATCH v2 01/20] mm: Provide pagesize to pmd_populate() Christophe Leroy
                   ` (21 more replies)
  0 siblings, 22 replies; 60+ messages in thread
From: Christophe Leroy @ 2024-05-17 18:59 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu, Oscar Salvador,
	Michael Ellerman, Nicholas Piggin
  Cc: Christophe Leroy, linux-kernel, linux-mm, linuxppc-dev

This is the continuation of the RFC v1 series "Reimplement huge pages
without hugepd on powerpc 8xx". It now get rid of hugepd completely
after handling also e500 and book3s/64

Unlike most architectures, powerpc 8xx HW requires a two-level
pagetable topology for all page sizes. So a leaf PMD-contig approach
is not feasible as such.

Possible sizes are 4k, 16k, 512k and 8M.

First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
must point to a single entry level-2 page table. Until now that was
done using hugepd. This series changes it to use standard page tables
where the entry is replicated 1024 times on each of the two pagetables
refered by the two associated PMD entries for that 8M page.

At the moment it has to look into each helper to know if the
hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
a lower size. I hope this can me handled by core-mm in the future.

For e500 and book3s/64 there are less constraints because it is not
tied to the HW assisted tablewalk like on 8xx, so it is easier to use
leaf PMDs (and PUDs).

On e500 the supported page sizes are 4M, 16M, 64M, 256M and 1G. All at
PMD level on e500/32 and mix of PMD and PUD for e500/64. We encode page
size with 4 available bits in PTE entries. On e300/32 PGD entries size
is increases to 64 bits in order to allow leaf-PMD entries because PTE
are 64 bits on e500.

On book3s/64 only the hash-4k mode is concerned. It supports 16M pages
as cont-PMD and 16G pages as cont-PUD. In other modes (radix-4k, radix-6k
and hash-64k) the sizes match with PMD and PUD sizes so that's just leaf
entries.

Christophe Leroy (20):
  mm: Provide pagesize to pmd_populate()
  mm: Provide page size to pte_alloc_huge()
  mm: Provide pmd to pte_leaf_size()
  mm: Provide mm_struct and address to huge_ptep_get()
  powerpc/mm: Allow hugepages without hugepd
  powerpc/8xx: Fix size given to set_huge_pte_at()
  powerpc/8xx: Rework support for 8M pages using contiguous PTE entries
  powerpc/8xx: Simplify struct mmu_psize_def
  powerpc/mm: Remove _PAGE_PSIZE
  powerpc/mm: Fix __find_linux_pte() on 32 bits with PMD leaf entries
  powerpc/mm: Complement huge_pte_alloc() for all non HUGEPD setups
  powerpc/64e: Remove unneeded #ifdef CONFIG_PPC_E500
  powerpc/64e: Clean up impossible setups
  powerpc/e500: Remove enc field from struct mmu_psize_def
  powerpc/85xx: Switch to 64 bits PGD
  powerpc/e500: Encode hugepage size in PTE bits
  powerpc/e500: Use contiguous PMD instead of hugepd
  powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD
  powerpc/mm: Remove hugepd leftovers
  mm: Remove CONFIG_ARCH_HAS_HUGEPD

 arch/arm/include/asm/hugetlb-3level.h         |   2 +-
 arch/arm64/include/asm/hugetlb.h              |   2 +-
 arch/arm64/include/asm/pgtable.h              |   2 +-
 arch/arm64/mm/hugetlbpage.c                   |   4 +-
 arch/parisc/mm/hugetlbpage.c                  |   2 +-
 arch/powerpc/Kconfig                          |   1 -
 arch/powerpc/include/asm/book3s/32/pgalloc.h  |   2 -
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |  15 -
 arch/powerpc/include/asm/book3s/64/hash.h     |  38 +-
 arch/powerpc/include/asm/book3s/64/hugetlb.h  |  38 --
 .../include/asm/book3s/64/pgtable-4k.h        |  34 --
 .../include/asm/book3s/64/pgtable-64k.h       |  20 -
 arch/powerpc/include/asm/hugetlb.h            |  26 +-
 .../include/asm/nohash/32/hugetlb-8xx.h       |  58 +--
 arch/powerpc/include/asm/nohash/32/mmu-8xx.h  |   9 +-
 arch/powerpc/include/asm/nohash/32/pgalloc.h  |   2 +
 arch/powerpc/include/asm/nohash/32/pte-40x.h  |   3 -
 arch/powerpc/include/asm/nohash/32/pte-44x.h  |   3 -
 arch/powerpc/include/asm/nohash/32/pte-85xx.h |   3 -
 arch/powerpc/include/asm/nohash/32/pte-8xx.h  |  64 ++-
 .../powerpc/include/asm/nohash/hugetlb-e500.h |  36 +-
 arch/powerpc/include/asm/nohash/mmu-e500.h    |   4 -
 arch/powerpc/include/asm/nohash/pgalloc.h     |   2 -
 arch/powerpc/include/asm/nohash/pgtable.h     |  45 +-
 arch/powerpc/include/asm/nohash/pte-e500.h    |  22 +-
 arch/powerpc/include/asm/page.h               |  32 --
 arch/powerpc/include/asm/pgtable-be-types.h   |  10 -
 arch/powerpc/include/asm/pgtable-types.h      |  13 +-
 arch/powerpc/include/asm/pgtable.h            |   3 +
 arch/powerpc/kernel/head_85xx.S               |  33 +-
 arch/powerpc/kernel/head_8xx.S                |  10 +-
 arch/powerpc/mm/book3s64/hash_utils.c         |  11 +-
 arch/powerpc/mm/book3s64/pgtable.c            |  12 -
 arch/powerpc/mm/hugetlbpage.c                 | 450 ++----------------
 arch/powerpc/mm/init-common.c                 |   8 +-
 arch/powerpc/mm/kasan/8xx.c                   |  15 +-
 arch/powerpc/mm/nohash/8xx.c                  |  46 +-
 arch/powerpc/mm/nohash/book3e_pgtable.c       |   4 +-
 arch/powerpc/mm/nohash/tlb.c                  | 172 ++-----
 arch/powerpc/mm/nohash/tlb_low_64e.S          | 257 ++--------
 arch/powerpc/mm/pgtable.c                     |  94 ++--
 arch/powerpc/mm/pgtable_32.c                  |   2 +-
 arch/riscv/include/asm/hugetlb.h              |   2 +-
 arch/riscv/include/asm/pgtable.h              |   2 +-
 arch/riscv/mm/hugetlbpage.c                   |   4 +-
 arch/s390/include/asm/hugetlb.h               |   2 +-
 arch/s390/mm/hugetlbpage.c                    |   2 +-
 arch/sh/mm/hugetlbpage.c                      |   2 +-
 arch/sparc/include/asm/pgtable_64.h           |   2 +-
 arch/sparc/mm/hugetlbpage.c                   |   4 +-
 fs/hugetlbfs/inode.c                          |   2 +-
 fs/proc/task_mmu.c                            |   8 +-
 fs/userfaultfd.c                              |   2 +-
 include/asm-generic/hugetlb.h                 |   2 +-
 include/linux/hugetlb.h                       |  10 +-
 include/linux/mm.h                            |  12 +-
 include/linux/pgtable.h                       |   2 +-
 include/linux/swapops.h                       |   2 +-
 kernel/events/core.c                          |   2 +-
 mm/Kconfig                                    |  10 -
 mm/damon/vaddr.c                              |   6 +-
 mm/filemap.c                                  |   2 +-
 mm/gup.c                                      | 105 +---
 mm/hmm.c                                      |   2 +-
 mm/hugetlb.c                                  |  46 +-
 mm/internal.h                                 |   2 +-
 mm/memory-failure.c                           |   2 +-
 mm/memory.c                                   |  19 +-
 mm/mempolicy.c                                |   2 +-
 mm/migrate.c                                  |   4 +-
 mm/mincore.c                                  |   2 +-
 mm/pagewalk.c                                 |  57 +--
 mm/pgalloc-track.h                            |   2 +-
 mm/userfaultfd.c                              |   6 +-
 74 files changed, 494 insertions(+), 1444 deletions(-)

-- 
2.44.0



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v2 01/20] mm: Provide pagesize to pmd_populate()
  2024-05-17 18:59 [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Christophe Leroy
@ 2024-05-17 18:59 ` Christophe Leroy
  2024-05-20  9:01   ` Oscar Salvador
  2024-05-17 18:59 ` [RFC PATCH v2 02/20] mm: Provide page size to pte_alloc_huge() Christophe Leroy
                   ` (20 subsequent siblings)
  21 siblings, 1 reply; 60+ messages in thread
From: Christophe Leroy @ 2024-05-17 18:59 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu, Oscar Salvador,
	Michael Ellerman, Nicholas Piggin
  Cc: Christophe Leroy, linux-kernel, linux-mm, linuxppc-dev

Unlike many architectures, powerpc 8xx hardware tablewalk requires
a two level process for all page sizes, allthough second level only
has one entry when pagesize is 8M.

To fit with Linux page table topology and without requiring special
page directory layout like hugepd, the page entry will be replicated
1024 times in the standard page table. However for large pages it is
necessary to set bits in the level-1 (PMD) entry. At the time being,
for 512k pages the flag is kept in the PTE and inserted in the PMD
entry at TLB miss exception, that is necessary because we can have
pages of different sizes in a page table. However the 12 PTE bits are
fully used and there is no room for an additional bit for page size.

For 8M pages, there will be only one page per PMD entry, it is
therefore possible to flag the pagesize in the PMD entry, with the
advantage that the information will already be at the right place for
the hardware.

To do so, add a new helper called pmd_populate_size() which takes the
page size as an additional argument, and modify __pte_alloc() to also
take that argument. pte_alloc() is left unmodified in order to
reduce churn on callers, and a pte_alloc_size() is added for use by
pte_alloc_huge().

When an architecture doesn't provide pmd_populate_size(),
pmd_populate() is used as a fallback.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 include/linux/mm.h | 12 +++++++-----
 mm/filemap.c       |  2 +-
 mm/internal.h      |  2 +-
 mm/memory.c        | 19 ++++++++++++-------
 mm/pgalloc-track.h |  2 +-
 mm/userfaultfd.c   |  4 ++--
 6 files changed, 24 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b6bdaa18b9e9..158cb87bc604 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2803,8 +2803,8 @@ static inline void mm_inc_nr_ptes(struct mm_struct *mm) {}
 static inline void mm_dec_nr_ptes(struct mm_struct *mm) {}
 #endif
 
-int __pte_alloc(struct mm_struct *mm, pmd_t *pmd);
-int __pte_alloc_kernel(pmd_t *pmd);
+int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long sz);
+int __pte_alloc_kernel(pmd_t *pmd, unsigned long sz);
 
 #if defined(CONFIG_MMU)
 
@@ -2989,7 +2989,8 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
 	pte_unmap(pte);					\
 } while (0)
 
-#define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd))
+#define pte_alloc_size(mm, pmd, sz) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd, sz))
+#define pte_alloc(mm, pmd) pte_alloc_size(mm, pmd, PAGE_SIZE)
 
 #define pte_alloc_map(mm, pmd, address)			\
 	(pte_alloc(mm, pmd) ? NULL : pte_offset_map(pmd, address))
@@ -2998,9 +2999,10 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
 	(pte_alloc(mm, pmd) ?			\
 		 NULL : pte_offset_map_lock(mm, pmd, address, ptlp))
 
-#define pte_alloc_kernel(pmd, address)			\
-	((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \
+#define pte_alloc_kernel_size(pmd, address, sz)			\
+	((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd, sz))? \
 		NULL: pte_offset_kernel(pmd, address))
+#define pte_alloc_kernel(pmd, address)	pte_alloc_kernel_size(pmd, address, PAGE_SIZE)
 
 #if USE_SPLIT_PMD_PTLOCKS
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 30de18c4fd28..5a783063d1f6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3428,7 +3428,7 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct folio *folio,
 	}
 
 	if (pmd_none(*vmf->pmd) && vmf->prealloc_pte)
-		pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
+		pmd_install(mm, vmf->pmd, &vmf->prealloc_pte, PAGE_SIZE);
 
 	return false;
 }
diff --git a/mm/internal.h b/mm/internal.h
index 07ad2675a88b..4a01bbf55264 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -206,7 +206,7 @@ void folio_activate(struct folio *folio);
 void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
 		   struct vm_area_struct *start_vma, unsigned long floor,
 		   unsigned long ceiling, bool mm_wr_locked);
-void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
+void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte, unsigned long sz);
 
 struct zap_details;
 void unmap_page_range(struct mmu_gather *tlb,
diff --git a/mm/memory.c b/mm/memory.c
index d2155ced45f8..2a9eba13a95f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -409,7 +409,12 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
 	} while (vma);
 }
 
-void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
+#ifndef pmd_populate_size
+#define pmd_populate_size(mm, pmdp, pte, sz) pmd_populate(mm, pmdp, pte)
+#define pmd_populate_kernel_size(mm, pmdp, pte, sz) pmd_populate_kernel(mm, pmdp, pte)
+#endif
+
+void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte, unsigned long sz)
 {
 	spinlock_t *ptl = pmd_lock(mm, pmd);
 
@@ -429,25 +434,25 @@ void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
 		 * smp_rmb() barriers in page table walking code.
 		 */
 		smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
-		pmd_populate(mm, pmd, *pte);
+		pmd_populate_size(mm, pmd, *pte, sz);
 		*pte = NULL;
 	}
 	spin_unlock(ptl);
 }
 
-int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
+int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long sz)
 {
 	pgtable_t new = pte_alloc_one(mm);
 	if (!new)
 		return -ENOMEM;
 
-	pmd_install(mm, pmd, &new);
+	pmd_install(mm, pmd, &new, sz);
 	if (new)
 		pte_free(mm, new);
 	return 0;
 }
 
-int __pte_alloc_kernel(pmd_t *pmd)
+int __pte_alloc_kernel(pmd_t *pmd, unsigned long sz)
 {
 	pte_t *new = pte_alloc_one_kernel(&init_mm);
 	if (!new)
@@ -456,7 +461,7 @@ int __pte_alloc_kernel(pmd_t *pmd)
 	spin_lock(&init_mm.page_table_lock);
 	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		smp_wmb(); /* See comment in pmd_install() */
-		pmd_populate_kernel(&init_mm, pmd, new);
+		pmd_populate_kernel_size(&init_mm, pmd, new, sz);
 		new = NULL;
 	}
 	spin_unlock(&init_mm.page_table_lock);
@@ -4740,7 +4745,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 		}
 
 		if (vmf->prealloc_pte)
-			pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte);
+			pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte, PAGE_SIZE);
 		else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
 			return VM_FAULT_OOM;
 	}
diff --git a/mm/pgalloc-track.h b/mm/pgalloc-track.h
index e9e879de8649..90e37de7ab77 100644
--- a/mm/pgalloc-track.h
+++ b/mm/pgalloc-track.h
@@ -45,7 +45,7 @@ static inline pmd_t *pmd_alloc_track(struct mm_struct *mm, pud_t *pud,
 
 #define pte_alloc_kernel_track(pmd, address, mask)			\
 	((unlikely(pmd_none(*(pmd))) &&					\
-	  (__pte_alloc_kernel(pmd) || ({*(mask)|=PGTBL_PMD_MODIFIED;0;})))?\
+	  (__pte_alloc_kernel(pmd, PAGE_SIZE) || ({*(mask)|=PGTBL_PMD_MODIFIED;0;})))?\
 		NULL: pte_offset_kernel(pmd, address))
 
 #endif /* _LINUX_PGALLOC_TRACK_H */
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 3c3539c573e7..0f129d5c5aa2 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -764,7 +764,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
 			break;
 		}
 		if (unlikely(pmd_none(dst_pmdval)) &&
-		    unlikely(__pte_alloc(dst_mm, dst_pmd))) {
+		    unlikely(__pte_alloc(dst_mm, dst_pmd, PAGE_SIZE))) {
 			err = -ENOMEM;
 			break;
 		}
@@ -1687,7 +1687,7 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
 					err = -ENOENT;
 					break;
 				}
-				if (unlikely(__pte_alloc(mm, src_pmd))) {
+				if (unlikely(__pte_alloc(mm, src_pmd, PAGE_SIZE))) {
 					err = -ENOMEM;
 					break;
 				}
-- 
2.44.0



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v2 02/20] mm: Provide page size to pte_alloc_huge()
  2024-05-17 18:59 [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Christophe Leroy
  2024-05-17 18:59 ` [RFC PATCH v2 01/20] mm: Provide pagesize to pmd_populate() Christophe Leroy
@ 2024-05-17 18:59 ` Christophe Leroy
  2024-05-17 18:59 ` [RFC PATCH v2 03/20] mm: Provide pmd to pte_leaf_size() Christophe Leroy
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 60+ messages in thread
From: Christophe Leroy @ 2024-05-17 18:59 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu, Oscar Salvador,
	Michael Ellerman, Nicholas Piggin
  Cc: Christophe Leroy, linux-kernel, linux-mm, linuxppc-dev

In order to be able to flag the PMD entry with _PMD_HUGE_8M on
powerpc 8xx, provide page size to pte_alloc_huge() and use it
through the newly introduced pte_alloc_size().

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/arm64/mm/hugetlbpage.c   | 2 +-
 arch/parisc/mm/hugetlbpage.c  | 2 +-
 arch/powerpc/mm/hugetlbpage.c | 2 +-
 arch/riscv/mm/hugetlbpage.c   | 2 +-
 arch/sh/mm/hugetlbpage.c      | 2 +-
 arch/sparc/mm/hugetlbpage.c   | 2 +-
 include/linux/hugetlb.h       | 4 ++--
 7 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index b872b003a55f..aa7ded49f8cf 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -292,7 +292,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 			return NULL;
 
 		WARN_ON(addr & (sz - 1));
-		ptep = pte_alloc_huge(mm, pmdp, addr);
+		ptep = pte_alloc_huge(mm, pmdp, addr, sz);
 	} else if (sz == PMD_SIZE) {
 		if (want_pmd_share(vma, addr) && pud_none(READ_ONCE(*pudp)))
 			ptep = huge_pmd_share(mm, vma, addr, pudp);
diff --git a/arch/parisc/mm/hugetlbpage.c b/arch/parisc/mm/hugetlbpage.c
index a9f7e21f6656..2f4c6b440710 100644
--- a/arch/parisc/mm/hugetlbpage.c
+++ b/arch/parisc/mm/hugetlbpage.c
@@ -66,7 +66,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (pud) {
 		pmd = pmd_alloc(mm, pud, addr);
 		if (pmd)
-			pte = pte_alloc_huge(mm, pmd, addr);
+			pte = pte_alloc_huge(mm, pmd, addr, sz);
 	}
 	return pte;
 }
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 594a4b7b2ca2..66ac56b26007 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -183,7 +183,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 		return NULL;
 
 	if (IS_ENABLED(CONFIG_PPC_8xx) && pshift < PMD_SHIFT)
-		return pte_alloc_huge(mm, (pmd_t *)hpdp, addr);
+		return pte_alloc_huge(mm, (pmd_t *)hpdp, addr, sz);
 
 	BUG_ON(!hugepd_none(*hpdp) && !hugepd_ok(*hpdp));
 
diff --git a/arch/riscv/mm/hugetlbpage.c b/arch/riscv/mm/hugetlbpage.c
index 5ef2a6891158..dc77a58c6321 100644
--- a/arch/riscv/mm/hugetlbpage.c
+++ b/arch/riscv/mm/hugetlbpage.c
@@ -67,7 +67,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 
 	for_each_napot_order(order) {
 		if (napot_cont_size(order) == sz) {
-			pte = pte_alloc_huge(mm, pmd, addr & napot_cont_mask(order));
+			pte = pte_alloc_huge(mm, pmd, addr & napot_cont_mask(order), sz);
 			break;
 		}
 	}
diff --git a/arch/sh/mm/hugetlbpage.c b/arch/sh/mm/hugetlbpage.c
index 6cb0ad73dbb9..26579429e5ed 100644
--- a/arch/sh/mm/hugetlbpage.c
+++ b/arch/sh/mm/hugetlbpage.c
@@ -38,7 +38,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 			if (pud) {
 				pmd = pmd_alloc(mm, pud, addr);
 				if (pmd)
-					pte = pte_alloc_huge(mm, pmd, addr);
+					pte = pte_alloc_huge(mm, pmd, addr, sz);
 			}
 		}
 	}
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index b432500c13a5..5a342199e837 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -298,7 +298,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 		return NULL;
 	if (sz >= PMD_SIZE)
 		return (pte_t *)pmd;
-	return pte_alloc_huge(mm, pmd, addr);
+	return pte_alloc_huge(mm, pmd, addr, sz);
 }
 
 pte_t *huge_pte_offset(struct mm_struct *mm,
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 77b30a8c6076..d9c5d9daadc5 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -193,9 +193,9 @@ static inline pte_t *pte_offset_huge(pmd_t *pmd, unsigned long address)
 	return pte_offset_kernel(pmd, address);
 }
 static inline pte_t *pte_alloc_huge(struct mm_struct *mm, pmd_t *pmd,
-				    unsigned long address)
+				    unsigned long address, unsigned long sz)
 {
-	return pte_alloc(mm, pmd) ? NULL : pte_offset_huge(pmd, address);
+	return pte_alloc_size(mm, pmd, sz) ? NULL : pte_offset_huge(pmd, address);
 }
 #endif
 
-- 
2.44.0



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v2 03/20] mm: Provide pmd to pte_leaf_size()
  2024-05-17 18:59 [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Christophe Leroy
  2024-05-17 18:59 ` [RFC PATCH v2 01/20] mm: Provide pagesize to pmd_populate() Christophe Leroy
  2024-05-17 18:59 ` [RFC PATCH v2 02/20] mm: Provide page size to pte_alloc_huge() Christophe Leroy
@ 2024-05-17 18:59 ` Christophe Leroy
  2024-05-21  9:39   ` Oscar Salvador
  2024-05-17 18:59 ` [RFC PATCH v2 04/20] mm: Provide mm_struct and address to huge_ptep_get() Christophe Leroy
                   ` (18 subsequent siblings)
  21 siblings, 1 reply; 60+ messages in thread
From: Christophe Leroy @ 2024-05-17 18:59 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu, Oscar Salvador,
	Michael Ellerman, Nicholas Piggin
  Cc: Christophe Leroy, linux-kernel, linux-mm, linuxppc-dev

On powerpc 8xx, when a page is 8M size, the information is in the PMD
entry. So provide it to pte_leaf_size().

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/arm64/include/asm/pgtable.h             | 2 +-
 arch/powerpc/include/asm/nohash/32/pte-8xx.h | 2 +-
 arch/riscv/include/asm/pgtable.h             | 2 +-
 arch/sparc/include/asm/pgtable_64.h          | 2 +-
 arch/sparc/mm/hugetlbpage.c                  | 2 +-
 include/linux/pgtable.h                      | 2 +-
 kernel/events/core.c                         | 2 +-
 7 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index afdd56d26ad7..57c40f2498ab 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -624,7 +624,7 @@ extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
 #define pmd_bad(pmd)		(!pmd_table(pmd))
 
 #define pmd_leaf_size(pmd)	(pmd_cont(pmd) ? CONT_PMD_SIZE : PMD_SIZE)
-#define pte_leaf_size(pte)	(pte_cont(pte) ? CONT_PTE_SIZE : PAGE_SIZE)
+#define pte_leaf_size(pmd, pte)	(pte_cont(pte) ? CONT_PTE_SIZE : PAGE_SIZE)
 
 #if defined(CONFIG_ARM64_64K_PAGES) || CONFIG_PGTABLE_LEVELS < 3
 static inline bool pud_sect(pud_t pud) { return false; }
diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
index 137dc3c84e45..07df6b664861 100644
--- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
@@ -151,7 +151,7 @@ static inline unsigned long pgd_leaf_size(pgd_t pgd)
 
 #define pgd_leaf_size pgd_leaf_size
 
-static inline unsigned long pte_leaf_size(pte_t pte)
+static inline unsigned long pte_leaf_size(pmd_t pmd, pte_t pte)
 {
 	pte_basic_t val = pte_val(pte);
 
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 6afd6bb4882e..9d9abe161a89 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -434,7 +434,7 @@ static inline pte_t pte_mkhuge(pte_t pte)
 }
 
 #ifdef CONFIG_RISCV_ISA_SVNAPOT
-#define pte_leaf_size(pte)	(pte_napot(pte) ?				\
+#define pte_leaf_size(pmd, pte)	(pte_napot(pte) ?				\
 					napot_cont_size(napot_cont_order(pte)) :\
 					PAGE_SIZE)
 #endif
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 4d1bafaba942..67063af2ff8f 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -1175,7 +1175,7 @@ extern unsigned long pud_leaf_size(pud_t pud);
 extern unsigned long pmd_leaf_size(pmd_t pmd);
 
 #define pte_leaf_size pte_leaf_size
-extern unsigned long pte_leaf_size(pte_t pte);
+extern unsigned long pte_leaf_size(pmd_t pmd, pte_t pte);
 
 #endif /* CONFIG_HUGETLB_PAGE */
 
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index 5a342199e837..60c845a15bee 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -276,7 +276,7 @@ static unsigned long huge_tte_to_size(pte_t pte)
 
 unsigned long pud_leaf_size(pud_t pud) { return 1UL << tte_to_shift(*(pte_t *)&pud); }
 unsigned long pmd_leaf_size(pmd_t pmd) { return 1UL << tte_to_shift(*(pte_t *)&pmd); }
-unsigned long pte_leaf_size(pte_t pte) { return 1UL << tte_to_shift(pte); }
+unsigned long pte_leaf_size(pmd_t pmd, pte_t pte) { return 1UL << tte_to_shift(pte); }
 
 pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long addr, unsigned long sz)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 85fc7554cd52..e605a4149fc7 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1802,7 +1802,7 @@ typedef unsigned int pgtbl_mod_mask;
 #define pmd_leaf_size(x) PMD_SIZE
 #endif
 #ifndef pte_leaf_size
-#define pte_leaf_size(x) PAGE_SIZE
+#define pte_leaf_size(x, y) PAGE_SIZE
 #endif
 
 /*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 724e6d7e128f..5c1c083222b2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7585,7 +7585,7 @@ static u64 perf_get_pgtable_size(struct mm_struct *mm, unsigned long addr)
 
 	pte = ptep_get_lockless(ptep);
 	if (pte_present(pte))
-		size = pte_leaf_size(pte);
+		size = pte_leaf_size(pmd, pte);
 	pte_unmap(ptep);
 #endif /* CONFIG_HAVE_FAST_GUP */
 
-- 
2.44.0



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v2 04/20] mm: Provide mm_struct and address to huge_ptep_get()
  2024-05-17 18:59 [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Christophe Leroy
                   ` (2 preceding siblings ...)
  2024-05-17 18:59 ` [RFC PATCH v2 03/20] mm: Provide pmd to pte_leaf_size() Christophe Leroy
@ 2024-05-17 18:59 ` Christophe Leroy
  2024-05-17 18:59 ` [RFC PATCH v2 05/20] powerpc/mm: Allow hugepages without hugepd Christophe Leroy
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 60+ messages in thread
From: Christophe Leroy @ 2024-05-17 18:59 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu, Oscar Salvador,
	Michael Ellerman, Nicholas Piggin
  Cc: Christophe Leroy, linux-kernel, linux-mm, linuxppc-dev

On powerpc 8xx huge_ptep_get() will need to know whether the given
ptep is a PTE entry or a PMD entry. This cannot be known with the
PMD entry itself because there is no easy way to know it from the
content of the entry.

So huge_ptep_get() will need to know either the size of the page
or get the pmd.

In order to be consistent with huge_ptep_get_and_clear(), give
mm and address to huge_ptep_get().

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
v2: Add missing changes in arch implementations
---
 arch/arm/include/asm/hugetlb-3level.h |  2 +-
 arch/arm64/include/asm/hugetlb.h      |  2 +-
 arch/arm64/mm/hugetlbpage.c           |  2 +-
 arch/riscv/include/asm/hugetlb.h      |  2 +-
 arch/riscv/mm/hugetlbpage.c           |  2 +-
 arch/s390/include/asm/hugetlb.h       |  2 +-
 arch/s390/mm/hugetlbpage.c            |  2 +-
 fs/hugetlbfs/inode.c                  |  2 +-
 fs/proc/task_mmu.c                    |  8 ++---
 fs/userfaultfd.c                      |  2 +-
 include/asm-generic/hugetlb.h         |  2 +-
 include/linux/swapops.h               |  2 +-
 mm/damon/vaddr.c                      |  6 ++--
 mm/gup.c                              |  2 +-
 mm/hmm.c                              |  2 +-
 mm/hugetlb.c                          | 46 +++++++++++++--------------
 mm/memory-failure.c                   |  2 +-
 mm/mempolicy.c                        |  2 +-
 mm/migrate.c                          |  4 +--
 mm/mincore.c                          |  2 +-
 mm/userfaultfd.c                      |  2 +-
 21 files changed, 49 insertions(+), 49 deletions(-)

diff --git a/arch/arm/include/asm/hugetlb-3level.h b/arch/arm/include/asm/hugetlb-3level.h
index a30be5505793..470c45c22e80 100644
--- a/arch/arm/include/asm/hugetlb-3level.h
+++ b/arch/arm/include/asm/hugetlb-3level.h
@@ -18,7 +18,7 @@
  * (The valid bit is automatically cleared by set_pte_at for PROT_NONE ptes).
  */
 #define __HAVE_ARCH_HUGE_PTEP_GET
-static inline pte_t huge_ptep_get(pte_t *ptep)
+static inline pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
 {
 	pte_t retval = *ptep;
 	if (pte_val(retval))
diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h
index 2ddc33d93b13..1af39a74e791 100644
--- a/arch/arm64/include/asm/hugetlb.h
+++ b/arch/arm64/include/asm/hugetlb.h
@@ -46,7 +46,7 @@ extern pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
 extern void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
 			   pte_t *ptep, unsigned long sz);
 #define __HAVE_ARCH_HUGE_PTEP_GET
-extern pte_t huge_ptep_get(pte_t *ptep);
+extern pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
 
 void __init arm64_hugetlb_cma_reserve(void);
 
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index aa7ded49f8cf..7c6a24d29b3f 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -141,7 +141,7 @@ static inline int num_contig_ptes(unsigned long size, size_t *pgsize)
 	return contig_ptes;
 }
 
-pte_t huge_ptep_get(pte_t *ptep)
+pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
 {
 	int ncontig, i;
 	size_t pgsize;
diff --git a/arch/riscv/include/asm/hugetlb.h b/arch/riscv/include/asm/hugetlb.h
index 22deb7a2a6ec..6321bca08740 100644
--- a/arch/riscv/include/asm/hugetlb.h
+++ b/arch/riscv/include/asm/hugetlb.h
@@ -44,7 +44,7 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 			       pte_t pte, int dirty);
 
 #define __HAVE_ARCH_HUGE_PTEP_GET
-pte_t huge_ptep_get(pte_t *ptep);
+pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
 
 pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags);
 #define arch_make_huge_pte arch_make_huge_pte
diff --git a/arch/riscv/mm/hugetlbpage.c b/arch/riscv/mm/hugetlbpage.c
index dc77a58c6321..56abd6213ca1 100644
--- a/arch/riscv/mm/hugetlbpage.c
+++ b/arch/riscv/mm/hugetlbpage.c
@@ -3,7 +3,7 @@
 #include <linux/err.h>
 
 #ifdef CONFIG_RISCV_ISA_SVNAPOT
-pte_t huge_ptep_get(pte_t *ptep)
+pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
 {
 	unsigned long pte_num;
 	int i;
diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h
index deb198a61039..caabc01c1812 100644
--- a/arch/s390/include/asm/hugetlb.h
+++ b/arch/s390/include/asm/hugetlb.h
@@ -19,7 +19,7 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 		     pte_t *ptep, pte_t pte, unsigned long sz);
 void __set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 		     pte_t *ptep, pte_t pte);
-pte_t huge_ptep_get(pte_t *ptep);
+pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
 pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 			      unsigned long addr, pte_t *ptep);
 
diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c
index dc3db86e13ff..ee7da593f36c 100644
--- a/arch/s390/mm/hugetlbpage.c
+++ b/arch/s390/mm/hugetlbpage.c
@@ -169,7 +169,7 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 	__set_huge_pte_at(mm, addr, ptep, pte);
 }
 
-pte_t huge_ptep_get(pte_t *ptep)
+pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
 {
 	return __rste_to_pte(pte_val(*ptep));
 }
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 6502c7e776d1..ec3ec87d29e7 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -425,7 +425,7 @@ static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
 	if (!ptep)
 		return false;
 
-	pte = huge_ptep_get(ptep);
+	pte = huge_ptep_get(vma->vm_mm, addr, ptep);
 	if (huge_pte_none(pte) || !pte_present(pte))
 		return false;
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 102f48668c35..332ade5ae788 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1572,7 +1572,7 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
 	if (vma->vm_flags & VM_SOFTDIRTY)
 		flags |= PM_SOFT_DIRTY;
 
-	pte = huge_ptep_get(ptep);
+	pte = huge_ptep_get(walk->mm, addr, ptep);
 	if (pte_present(pte)) {
 		struct page *page = pte_page(pte);
 
@@ -2260,7 +2260,7 @@ static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask,
 	if (~p->arg.flags & PM_SCAN_WP_MATCHING) {
 		/* Go the short route when not write-protecting pages. */
 
-		pte = huge_ptep_get(ptep);
+		pte = huge_ptep_get(walk->mm, start, ptep);
 		categories = p->cur_vma_category | pagemap_hugetlb_category(pte);
 
 		if (!pagemap_scan_is_interesting_page(categories, p))
@@ -2272,7 +2272,7 @@ static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask,
 	i_mmap_lock_write(vma->vm_file->f_mapping);
 	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep);
 
-	pte = huge_ptep_get(ptep);
+	pte = huge_ptep_get(walk->mm, start, ptep);
 	categories = p->cur_vma_category | pagemap_hugetlb_category(pte);
 
 	if (!pagemap_scan_is_interesting_page(categories, p))
@@ -2667,7 +2667,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
 		unsigned long addr, unsigned long end, struct mm_walk *walk)
 {
-	pte_t huge_pte = huge_ptep_get(pte);
+	pte_t huge_pte = huge_ptep_get(walk->mm, addr, pte);
 	struct numa_maps *md;
 	struct page *page;
 
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 292f5fd50104..fa58e0b2820f 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -256,7 +256,7 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
 		goto out;
 
 	ret = false;
-	pte = huge_ptep_get(ptep);
+	pte = huge_ptep_get(vma->vm_mm, vmf->address, ptep);
 
 	/*
 	 * Lockless access: we're in a wait_event so it's ok if it
diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
index 6dcf4d576970..594d5905f615 100644
--- a/include/asm-generic/hugetlb.h
+++ b/include/asm-generic/hugetlb.h
@@ -144,7 +144,7 @@ static inline int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 #endif
 
 #ifndef __HAVE_ARCH_HUGE_PTEP_GET
-static inline pte_t huge_ptep_get(pte_t *ptep)
+static inline pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
 {
 	return ptep_get(ptep);
 }
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index a5c560a2f8c2..44a9f786ee41 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -334,7 +334,7 @@ static inline bool is_migration_entry_dirty(swp_entry_t entry)
 
 extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 					unsigned long address);
-extern void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte);
+extern void migration_entry_wait_huge(struct vm_area_struct *vma, unsigned long addr, pte_t *pte);
 #else  /* CONFIG_MIGRATION */
 static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
 {
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 381559e4a1fa..58829baf8b5d 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -339,7 +339,7 @@ static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm,
 				struct vm_area_struct *vma, unsigned long addr)
 {
 	bool referenced = false;
-	pte_t entry = huge_ptep_get(pte);
+	pte_t entry = huge_ptep_get(mm, addr, pte);
 	struct folio *folio = pfn_folio(pte_pfn(entry));
 	unsigned long psize = huge_page_size(hstate_vma(vma));
 
@@ -373,7 +373,7 @@ static int damon_mkold_hugetlb_entry(pte_t *pte, unsigned long hmask,
 	pte_t entry;
 
 	ptl = huge_pte_lock(h, walk->mm, pte);
-	entry = huge_ptep_get(pte);
+	entry = huge_ptep_get(walk->mm, addr, pte);
 	if (!pte_present(entry))
 		goto out;
 
@@ -509,7 +509,7 @@ static int damon_young_hugetlb_entry(pte_t *pte, unsigned long hmask,
 	pte_t entry;
 
 	ptl = huge_pte_lock(h, walk->mm, pte);
-	entry = huge_ptep_get(pte);
+	entry = huge_ptep_get(walk->mm, addr, pte);
 	if (!pte_present(entry))
 		goto out;
 
diff --git a/mm/gup.c b/mm/gup.c
index 1611e73b1121..86b5105b82a1 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2812,7 +2812,7 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 	if (pte_end < end)
 		end = pte_end;
 
-	pte = huge_ptep_get(ptep);
+	pte = huge_ptep_get(NULL, addr, ptep);
 
 	if (!pte_access_permitted(pte, flags & FOLL_WRITE))
 		return 0;
diff --git a/mm/hmm.c b/mm/hmm.c
index 277ddcab4947..91a0b57fcb2e 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -485,7 +485,7 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
 	pte_t entry;
 
 	ptl = huge_pte_lock(hstate_vma(vma), walk->mm, pte);
-	entry = huge_ptep_get(pte);
+	entry = huge_ptep_get(walk->mm, addr, pte);
 
 	i = (start - range->start) >> PAGE_SHIFT;
 	pfn_req_flags = range->hmm_pfns[i];
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ce7be5c24442..e6196c7455d0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5321,7 +5321,7 @@ static void set_huge_ptep_writable(struct vm_area_struct *vma,
 {
 	pte_t entry;
 
-	entry = huge_pte_mkwrite(huge_pte_mkdirty(huge_ptep_get(ptep)));
+	entry = huge_pte_mkwrite(huge_pte_mkdirty(huge_ptep_get(vma->vm_mm, address, ptep)));
 	if (huge_ptep_set_access_flags(vma, address, ptep, entry, 1))
 		update_mmu_cache(vma, address, ptep);
 }
@@ -5429,7 +5429,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 		dst_ptl = huge_pte_lock(h, dst, dst_pte);
 		src_ptl = huge_pte_lockptr(h, src, src_pte);
 		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
-		entry = huge_ptep_get(src_pte);
+		entry = huge_ptep_get(src_vma->vm_mm, addr, src_pte);
 again:
 		if (huge_pte_none(entry)) {
 			/*
@@ -5467,7 +5467,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				set_huge_pte_at(dst, addr, dst_pte,
 						make_pte_marker(marker), sz);
 		} else {
-			entry = huge_ptep_get(src_pte);
+			entry = huge_ptep_get(src_vma->vm_mm, addr, src_pte);
 			pte_folio = page_folio(pte_page(entry));
 			folio_get(pte_folio);
 
@@ -5509,7 +5509,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				dst_ptl = huge_pte_lock(h, dst, dst_pte);
 				src_ptl = huge_pte_lockptr(h, src, src_pte);
 				spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
-				entry = huge_ptep_get(src_pte);
+				entry = huge_ptep_get(src_vma->vm_mm, addr, src_pte);
 				if (!pte_same(src_pte_old, entry)) {
 					restore_reserve_on_error(h, dst_vma, addr,
 								new_folio);
@@ -5619,7 +5619,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
 			new_addr |= last_addr_mask;
 			continue;
 		}
-		if (huge_pte_none(huge_ptep_get(src_pte)))
+		if (huge_pte_none(huge_ptep_get(mm, old_addr, src_pte)))
 			continue;
 
 		if (huge_pmd_unshare(mm, vma, old_addr, src_pte)) {
@@ -5692,7 +5692,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			continue;
 		}
 
-		pte = huge_ptep_get(ptep);
+		pte = huge_ptep_get(mm, address, ptep);
 		if (huge_pte_none(pte)) {
 			spin_unlock(ptl);
 			continue;
@@ -5929,7 +5929,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
 		       struct vm_fault *vmf)
 {
 	const bool unshare = flags & FAULT_FLAG_UNSHARE;
-	pte_t pte = huge_ptep_get(ptep);
+	pte_t pte = huge_ptep_get(mm, address, ptep);
 	struct hstate *h = hstate_vma(vma);
 	struct folio *old_folio;
 	struct folio *new_folio;
@@ -6042,7 +6042,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
 			spin_lock(ptl);
 			ptep = hugetlb_walk(vma, haddr, huge_page_size(h));
 			if (likely(ptep &&
-				   pte_same(huge_ptep_get(ptep), pte)))
+				   pte_same(huge_ptep_get(mm, haddr, ptep), pte)))
 				goto retry_avoidcopy;
 			/*
 			 * race occurs while re-acquiring page table
@@ -6080,7 +6080,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
 	 */
 	spin_lock(ptl);
 	ptep = hugetlb_walk(vma, haddr, huge_page_size(h));
-	if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
+	if (likely(ptep && pte_same(huge_ptep_get(mm, haddr, ptep), pte))) {
 		pte_t newpte = make_huge_pte(vma, &new_folio->page, !unshare);
 
 		/* Break COW or unshare */
@@ -6180,14 +6180,14 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_fault *vmf,
  * Recheck pte with pgtable lock.  Returns true if pte didn't change, or
  * false if pte changed or is changing.
  */
-static bool hugetlb_pte_stable(struct hstate *h, struct mm_struct *mm,
+static bool hugetlb_pte_stable(struct hstate *h, struct mm_struct *mm, unsigned long addr,
 			       pte_t *ptep, pte_t old_pte)
 {
 	spinlock_t *ptl;
 	bool same;
 
 	ptl = huge_pte_lock(h, mm, ptep);
-	same = pte_same(huge_ptep_get(ptep), old_pte);
+	same = pte_same(huge_ptep_get(mm, addr, ptep), old_pte);
 	spin_unlock(ptl);
 
 	return same;
@@ -6252,7 +6252,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			 * never happen on the page after UFFDIO_COPY has
 			 * correctly installed the page and returned.
 			 */
-			if (!hugetlb_pte_stable(h, mm, ptep, old_pte)) {
+			if (!hugetlb_pte_stable(h, mm, haddr, ptep, old_pte)) {
 				ret = 0;
 				goto out;
 			}
@@ -6281,7 +6281,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			 * here.  Before returning error, get ptl and make
 			 * sure there really is no pte entry.
 			 */
-			if (hugetlb_pte_stable(h, mm, ptep, old_pte))
+			if (hugetlb_pte_stable(h, mm, haddr, ptep, old_pte))
 				ret = vmf_error(PTR_ERR(folio));
 			else
 				ret = 0;
@@ -6328,7 +6328,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			folio_unlock(folio);
 			folio_put(folio);
 			/* See comment in userfaultfd_missing() block above */
-			if (!hugetlb_pte_stable(h, mm, ptep, old_pte)) {
+			if (!hugetlb_pte_stable(h, mm, haddr, ptep, old_pte)) {
 				ret = 0;
 				goto out;
 			}
@@ -6355,7 +6355,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	ptl = huge_pte_lock(h, mm, ptep);
 	ret = 0;
 	/* If pte changed from under us, retry */
-	if (!pte_same(huge_ptep_get(ptep), old_pte))
+	if (!pte_same(huge_ptep_get(mm, address, ptep), old_pte))
 		goto backout;
 
 	if (anon_rmap)
@@ -6478,7 +6478,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		return VM_FAULT_OOM;
 	}
 
-	entry = huge_ptep_get(ptep);
+	entry = huge_ptep_get(mm, address, ptep);
 	if (huge_pte_none_mostly(entry)) {
 		if (is_pte_marker(entry)) {
 			pte_marker marker =
@@ -6519,7 +6519,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			 * be released there.
 			 */
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-			migration_entry_wait_huge(vma, ptep);
+			migration_entry_wait_huge(vma, haddr, ptep);
 			return 0;
 		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
 			ret = VM_FAULT_HWPOISON_LARGE |
@@ -6552,11 +6552,11 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	ptl = huge_pte_lock(h, mm, ptep);
 
 	/* Check for a racing update before calling hugetlb_wp() */
-	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
+	if (unlikely(!pte_same(entry, huge_ptep_get(mm, address, ptep))))
 		goto out_ptl;
 
 	/* Handle userfault-wp first, before trying to lock more pages */
-	if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(ptep)) &&
+	if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(mm, address, ptep)) &&
 	    (flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
 		if (!userfaultfd_wp_async(vma)) {
 			spin_unlock(ptl);
@@ -6679,7 +6679,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 		ptl = huge_pte_lock(h, dst_mm, dst_pte);
 
 		/* Don't overwrite any existing PTEs (even markers) */
-		if (!huge_pte_none(huge_ptep_get(dst_pte))) {
+		if (!huge_pte_none(huge_ptep_get(mm, dst_addr, dst_pte))) {
 			spin_unlock(ptl);
 			return -EEXIST;
 		}
@@ -6816,7 +6816,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 	 * page backing it, then access the page.
 	 */
 	ret = -EEXIST;
-	if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
+	if (!huge_pte_none_mostly(huge_ptep_get(mm, dst_addr, dst_pte)))
 		goto out_release_unlock;
 
 	if (folio_in_pagecache)
@@ -6891,7 +6891,7 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
 		goto out_unlock;
 
 	ptl = huge_pte_lock(h, mm, pte);
-	entry = huge_ptep_get(pte);
+	entry = huge_ptep_get(mm, address, pte);
 	if (pte_present(entry)) {
 		page = pte_page(entry);
 
@@ -7008,7 +7008,7 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
 			address |= last_addr_mask;
 			continue;
 		}
-		pte = huge_ptep_get(ptep);
+		pte = huge_ptep_get(mm, address, ptep);
 		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
 			/* Nothing to do. */
 		} else if (unlikely(is_hugetlb_entry_migration(pte))) {
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 9e62a00b46dd..629db978fca5 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -832,7 +832,7 @@ static int hwpoison_hugetlb_range(pte_t *ptep, unsigned long hmask,
 			    struct mm_walk *walk)
 {
 	struct hwpoison_walk *hwp = walk->private;
-	pte_t pte = huge_ptep_get(ptep);
+	pte_t pte = huge_ptep_get(walk->mm, addr, ptep);
 	struct hstate *h = hstate_vma(walk->vma);
 
 	return check_hwpoisoned_entry(pte, addr, huge_page_shift(h),
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0fe77738d971..50a79700f496 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -624,7 +624,7 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
 	pte_t entry;
 
 	ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte);
-	entry = huge_ptep_get(pte);
+	entry = huge_ptep_get(walk->mm, addr, pte);
 	if (!pte_present(entry)) {
 		if (unlikely(is_hugetlb_entry_migration(entry)))
 			qp->nr_failed++;
diff --git a/mm/migrate.c b/mm/migrate.c
index 73a052a382f1..87f7aedb8ee2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -338,14 +338,14 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
  *
  * This function will release the vma lock before returning.
  */
-void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *ptep)
+void migration_entry_wait_huge(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
 {
 	spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), vma->vm_mm, ptep);
 	pte_t pte;
 
 	hugetlb_vma_assert_locked(vma);
 	spin_lock(ptl);
-	pte = huge_ptep_get(ptep);
+	pte = huge_ptep_get(vma->vm_mm, addr, ptep);
 
 	if (unlikely(!is_hugetlb_entry_migration(pte))) {
 		spin_unlock(ptl);
diff --git a/mm/mincore.c b/mm/mincore.c
index dad3622cc963..b5735a4aaa7d 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -33,7 +33,7 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
 	 * Hugepages under user process are always in RAM and never
 	 * swapped out, but theoretically it needs to be checked.
 	 */
-	present = pte && !huge_pte_none_mostly(huge_ptep_get(pte));
+	present = pte && !huge_pte_none_mostly(huge_ptep_get(walk->mm, addr, pte));
 	for (; addr != end; vec++, addr += PAGE_SIZE)
 		*vec = present;
 	walk->private = vec;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0f129d5c5aa2..87526cf18830 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -555,7 +555,7 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
 		}
 
 		if (!uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE) &&
-		    !huge_pte_none_mostly(huge_ptep_get(dst_pte))) {
+		    !huge_pte_none_mostly(huge_ptep_get(dst_mm, dst_addr, dst_pte))) {
 			err = -EEXIST;
 			hugetlb_vma_unlock_read(dst_vma);
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-- 
2.44.0



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v2 05/20] powerpc/mm: Allow hugepages without hugepd
  2024-05-17 18:59 [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Christophe Leroy
                   ` (3 preceding siblings ...)
  2024-05-17 18:59 ` [RFC PATCH v2 04/20] mm: Provide mm_struct and address to huge_ptep_get() Christophe Leroy
@ 2024-05-17 18:59 ` Christophe Leroy
  2024-05-17 19:00 ` [RFC PATCH v2 06/20] powerpc/8xx: Fix size given to set_huge_pte_at() Christophe Leroy
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 60+ messages in thread
From: Christophe Leroy @ 2024-05-17 18:59 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu, Oscar Salvador,
	Michael Ellerman, Nicholas Piggin
  Cc: Christophe Leroy, linux-kernel, linux-mm, linuxppc-dev

In preparation of implementing huge pages on powerpc 8xx
without hugepd, enclose hugepd related code inside an
ifdef CONFIG_ARCH_HAS_HUGEPD

This also allows removing some stubs.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/include/asm/book3s/32/pgalloc.h |  2 --
 arch/powerpc/include/asm/hugetlb.h           | 10 ++--------
 arch/powerpc/include/asm/nohash/pgtable.h    |  8 +++++---
 arch/powerpc/mm/hugetlbpage.c                | 13 +++++++++++++
 arch/powerpc/mm/pgtable.c                    |  2 ++
 5 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h b/arch/powerpc/include/asm/book3s/32/pgalloc.h
index dc5c039eb28e..dd4eb3063175 100644
--- a/arch/powerpc/include/asm/book3s/32/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h
@@ -47,8 +47,6 @@ static inline void pgtable_free(void *table, unsigned index_size)
 	}
 }
 
-#define get_hugepd_cache_index(x)  (x)
-
 static inline void pgtable_free_tlb(struct mmu_gather *tlb,
 				    void *table, int shift)
 {
diff --git a/arch/powerpc/include/asm/hugetlb.h b/arch/powerpc/include/asm/hugetlb.h
index ea71f7245a63..79176a499763 100644
--- a/arch/powerpc/include/asm/hugetlb.h
+++ b/arch/powerpc/include/asm/hugetlb.h
@@ -30,10 +30,12 @@ static inline int is_hugepage_only_range(struct mm_struct *mm,
 }
 #define is_hugepage_only_range is_hugepage_only_range
 
+#ifdef CONFIG_ARCH_HAS_HUGEPD
 #define __HAVE_ARCH_HUGETLB_FREE_PGD_RANGE
 void hugetlb_free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 			    unsigned long end, unsigned long floor,
 			    unsigned long ceiling);
+#endif
 
 #define __HAVE_ARCH_HUGE_PTEP_GET_AND_CLEAR
 static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
@@ -67,14 +69,6 @@ static inline void flush_hugetlb_page(struct vm_area_struct *vma,
 {
 }
 
-#define hugepd_shift(x) 0
-static inline pte_t *hugepte_offset(hugepd_t hpd, unsigned long addr,
-				    unsigned pdshift)
-{
-	return NULL;
-}
-
-
 static inline void __init gigantic_hugetlb_cma_reserve(void)
 {
 }
diff --git a/arch/powerpc/include/asm/nohash/pgtable.h b/arch/powerpc/include/asm/nohash/pgtable.h
index 427db14292c9..ac3353f7f2ac 100644
--- a/arch/powerpc/include/asm/nohash/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/pgtable.h
@@ -340,7 +340,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
 
 #define pgprot_writecombine pgprot_noncached_wc
 
-#ifdef CONFIG_HUGETLB_PAGE
+#ifdef CONFIG_ARCH_HAS_HUGEPD
 static inline int hugepd_ok(hugepd_t hpd)
 {
 #ifdef CONFIG_PPC_8xx
@@ -351,6 +351,10 @@ static inline int hugepd_ok(hugepd_t hpd)
 #endif
 }
 
+#define is_hugepd(hpd)		(hugepd_ok(hpd))
+#endif
+
+#ifdef CONFIG_HUGETLB_PAGE
 static inline int pmd_huge(pmd_t pmd)
 {
 	return 0;
@@ -360,8 +364,6 @@ static inline int pud_huge(pud_t pud)
 {
 	return 0;
 }
-
-#define is_hugepd(hpd)		(hugepd_ok(hpd))
 #endif
 
 int map_kernel_page(unsigned long va, phys_addr_t pa, pgprot_t prot);
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 66ac56b26007..82495b8ea793 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -42,6 +42,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr, unsigned long s
 	return __find_linux_pte(mm->pgd, addr, NULL, NULL);
 }
 
+#ifdef CONFIG_ARCH_HAS_HUGEPD
 static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 			   unsigned long address, unsigned int pdshift,
 			   unsigned int pshift, spinlock_t *ptl)
@@ -193,6 +194,16 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	return hugepte_offset(*hpdp, addr, pdshift);
 }
+#else
+pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
+		      unsigned long addr, unsigned long sz)
+{
+	if (sz < PMD_SIZE)
+		return pte_alloc_huge(mm, pmd_off(mm, addr), addr, sz);
+
+	return NULL;
+}
+#endif
 
 #ifdef CONFIG_PPC_BOOK3S_64
 /*
@@ -248,6 +259,7 @@ int __init alloc_bootmem_huge_page(struct hstate *h, int nid)
 	return __alloc_bootmem_huge_page(h, nid);
 }
 
+#ifdef CONFIG_ARCH_HAS_HUGEPD
 #ifndef CONFIG_PPC_BOOK3S_64
 #define HUGEPD_FREELIST_SIZE \
 	((PAGE_SIZE - sizeof(struct hugepd_freelist)) / sizeof(pte_t))
@@ -505,6 +517,7 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 		}
 	} while (addr = next, addr != end);
 }
+#endif
 
 bool __init arch_hugetlb_valid_size(unsigned long size)
 {
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 9e7ba9c3851f..acdf64c9b93e 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -487,8 +487,10 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
 	if (!hpdp)
 		return NULL;
 
+#ifdef CONFIG_ARCH_HAS_HUGEPD
 	ret_pte = hugepte_offset(*hpdp, ea, pdshift);
 	pdshift = hugepd_shift(*hpdp);
+#endif
 out:
 	if (hpage_shift)
 		*hpage_shift = pdshift;
-- 
2.44.0



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v2 06/20] powerpc/8xx: Fix size given to set_huge_pte_at()
  2024-05-17 18:59 [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Christophe Leroy
                   ` (4 preceding siblings ...)
  2024-05-17 18:59 ` [RFC PATCH v2 05/20] powerpc/mm: Allow hugepages without hugepd Christophe Leroy
@ 2024-05-17 19:00 ` Christophe Leroy
  2024-05-20  9:14   ` Oscar Salvador
  2024-05-17 19:00 ` [RFC PATCH v2 07/20] powerpc/8xx: Rework support for 8M pages using contiguous PTE entries Christophe Leroy
                   ` (15 subsequent siblings)
  21 siblings, 1 reply; 60+ messages in thread
From: Christophe Leroy @ 2024-05-17 19:00 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu, Oscar Salvador,
	Michael Ellerman, Nicholas Piggin
  Cc: Christophe Leroy, linux-kernel, linux-mm, linuxppc-dev

set_huge_pte_at() expects the real page size, not the psize which is
the index of the page definition in table mmu_psize_defs[]

Fixes: 935d4f0c6dc8 ("mm: hugetlb: add huge page size param to set_huge_pte_at()")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/mm/nohash/8xx.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/nohash/8xx.c b/arch/powerpc/mm/nohash/8xx.c
index 43d4842bb1c7..d93433e26ded 100644
--- a/arch/powerpc/mm/nohash/8xx.c
+++ b/arch/powerpc/mm/nohash/8xx.c
@@ -94,7 +94,8 @@ static int __ref __early_map_kernel_hugepage(unsigned long va, phys_addr_t pa,
 		return -EINVAL;
 
 	set_huge_pte_at(&init_mm, va, ptep,
-			pte_mkhuge(pfn_pte(pa >> PAGE_SHIFT, prot)), psize);
+			pte_mkhuge(pfn_pte(pa >> PAGE_SHIFT, prot)),
+			1UL << mmu_psize_to_shift(psize));
 
 	return 0;
 }
-- 
2.44.0



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v2 07/20] powerpc/8xx: Rework support for 8M pages using contiguous PTE entries
  2024-05-17 18:59 [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Christophe Leroy
                   ` (5 preceding siblings ...)
  2024-05-17 19:00 ` [RFC PATCH v2 06/20] powerpc/8xx: Fix size given to set_huge_pte_at() Christophe Leroy
@ 2024-05-17 19:00 ` Christophe Leroy
  2024-05-24 10:02   ` Oscar Salvador
  2024-05-17 19:00 ` [RFC PATCH v2 08/20] powerpc/8xx: Simplify struct mmu_psize_def Christophe Leroy
                   ` (14 subsequent siblings)
  21 siblings, 1 reply; 60+ messages in thread
From: Christophe Leroy @ 2024-05-17 19:00 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu, Oscar Salvador,
	Michael Ellerman, Nicholas Piggin
  Cc: Christophe Leroy, linux-kernel, linux-mm, linuxppc-dev

In order to fit better with standard Linux page tables layout, add
support for 8M pages using contiguous PTE entries in a standard
page table. Page tables will then be populated with 1024 similar
entries and two PMD entries will point to that page table.

The PMD entries also get a flag to tell it is addressing an 8M page,
this is required for the HW tablewalk assistance.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/Kconfig                          |  1 -
 arch/powerpc/include/asm/hugetlb.h            | 11 +++-
 .../include/asm/nohash/32/hugetlb-8xx.h       | 54 ++++++++----------
 arch/powerpc/include/asm/nohash/32/pgalloc.h  |  2 +
 arch/powerpc/include/asm/nohash/32/pte-8xx.h  | 57 +++++++++++++------
 arch/powerpc/include/asm/nohash/pgtable.h     |  4 --
 arch/powerpc/include/asm/page.h               |  5 --
 arch/powerpc/include/asm/pgtable.h            |  3 +
 arch/powerpc/kernel/head_8xx.S                | 10 +---
 arch/powerpc/mm/hugetlbpage.c                 | 18 +++---
 arch/powerpc/mm/kasan/8xx.c                   | 15 +++--
 arch/powerpc/mm/nohash/8xx.c                  | 43 +++++++-------
 arch/powerpc/mm/pgtable.c                     | 24 +++++---
 arch/powerpc/mm/pgtable_32.c                  |  2 +-
 arch/powerpc/platforms/Kconfig.cputype        |  2 +
 15 files changed, 139 insertions(+), 112 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index a1a3b3363008..6a4ea7dad23f 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -135,7 +135,6 @@ config PPC
 	select ARCH_HAS_DMA_MAP_DIRECT 		if PPC_PSERIES
 	select ARCH_HAS_FORTIFY_SOURCE
 	select ARCH_HAS_GCOV_PROFILE_ALL
-	select ARCH_HAS_HUGEPD			if HUGETLB_PAGE
 	select ARCH_HAS_KCOV
 	select ARCH_HAS_MEMBARRIER_CALLBACKS
 	select ARCH_HAS_MEMBARRIER_SYNC_CORE
diff --git a/arch/powerpc/include/asm/hugetlb.h b/arch/powerpc/include/asm/hugetlb.h
index 79176a499763..36ed6d976cf9 100644
--- a/arch/powerpc/include/asm/hugetlb.h
+++ b/arch/powerpc/include/asm/hugetlb.h
@@ -41,7 +41,16 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 					    unsigned long addr, pte_t *ptep)
 {
-	return __pte(pte_update(mm, addr, ptep, ~0UL, 0, 1));
+	pmd_t *pmdp = (pmd_t *)ptep;
+	pte_t pte;
+
+	if (IS_ENABLED(CONFIG_PPC_8xx) && pmdp == pmd_off(mm, ALIGN_DOWN(addr, SZ_8M))) {
+		pte = __pte(pte_update(mm, addr, pte_offset_kernel(pmdp, 0), ~0UL, 0, 1));
+		pte_update(mm, addr, pte_offset_kernel(pmdp + 1, 0), ~0UL, 0, 1);
+	} else {
+		pte = __pte(pte_update(mm, addr, ptep, ~0UL, 0, 1));
+	}
+	return pte;
 }
 
 #define __HAVE_ARCH_HUGE_PTEP_CLEAR_FLUSH
diff --git a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
index 92df40c6cc6b..1414cfd28987 100644
--- a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
@@ -4,45 +4,25 @@
 
 #define PAGE_SHIFT_8M		23
 
-static inline pte_t *hugepd_page(hugepd_t hpd)
-{
-	BUG_ON(!hugepd_ok(hpd));
-
-	return (pte_t *)__va(hpd_val(hpd) & ~HUGEPD_SHIFT_MASK);
-}
-
-static inline unsigned int hugepd_shift(hugepd_t hpd)
-{
-	return PAGE_SHIFT_8M;
-}
-
-static inline pte_t *hugepte_offset(hugepd_t hpd, unsigned long addr,
-				    unsigned int pdshift)
-{
-	unsigned long idx = (addr & (SZ_4M - 1)) >> PAGE_SHIFT;
-
-	return hugepd_page(hpd) + idx;
-}
-
 static inline void flush_hugetlb_page(struct vm_area_struct *vma,
 				      unsigned long vmaddr)
 {
 	flush_tlb_page(vma, vmaddr);
 }
 
-static inline void hugepd_populate(hugepd_t *hpdp, pte_t *new, unsigned int pshift)
+static inline int check_and_get_huge_psize(int shift)
 {
-	*hpdp = __hugepd(__pa(new) | _PMD_USER | _PMD_PRESENT | _PMD_PAGE_8M);
+	return shift_to_mmu_psize(shift);
 }
 
-static inline void hugepd_populate_kernel(hugepd_t *hpdp, pte_t *new, unsigned int pshift)
+#define __HAVE_ARCH_HUGE_PTEP_GET
+static inline pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
 {
-	*hpdp = __hugepd(__pa(new) | _PMD_PRESENT | _PMD_PAGE_8M);
-}
+	pmd_t *pmdp = (pmd_t *)ptep;
 
-static inline int check_and_get_huge_psize(int shift)
-{
-	return shift_to_mmu_psize(shift);
+	if (pmdp == pmd_off(mm, ALIGN_DOWN(addr, SZ_8M)))
+		ptep = pte_offset_kernel(pmdp, 0);
+	return ptep_get(ptep);
 }
 
 #define __HAVE_ARCH_HUGE_SET_HUGE_PTE_AT
@@ -53,7 +33,14 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
 static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
 				  pte_t *ptep, unsigned long sz)
 {
-	pte_update(mm, addr, ptep, ~0UL, 0, 1);
+	pmd_t *pmdp = (pmd_t *)ptep;
+
+	if (pmdp == pmd_off(mm, ALIGN_DOWN(addr, SZ_8M))) {
+		pte_update(mm, addr, pte_offset_kernel(pmdp, 0), ~0UL, 0, 1);
+		pte_update(mm, addr, pte_offset_kernel(pmdp + 1, 0), ~0UL, 0, 1);
+	} else {
+		pte_update(mm, addr, ptep, ~0UL, 0, 1);
+	}
 }
 
 #define __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
@@ -63,7 +50,14 @@ static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
 	unsigned long clr = ~pte_val(pte_wrprotect(__pte(~0)));
 	unsigned long set = pte_val(pte_wrprotect(__pte(0)));
 
-	pte_update(mm, addr, ptep, clr, set, 1);
+	pmd_t *pmdp = (pmd_t *)ptep;
+
+	if (pmdp == pmd_off(mm, ALIGN_DOWN(addr, SZ_8M))) {
+		pte_update(mm, addr, pte_offset_kernel(pmdp, 0), clr, set, 1);
+		pte_update(mm, addr, pte_offset_kernel(pmdp + 1, 0), clr, set, 1);
+	} else {
+		pte_update(mm, addr, ptep, clr, set, 1);
+	}
 }
 
 #ifdef CONFIG_PPC_4K_PAGES
diff --git a/arch/powerpc/include/asm/nohash/32/pgalloc.h b/arch/powerpc/include/asm/nohash/32/pgalloc.h
index 11eac371e7e0..ff4f90cfb461 100644
--- a/arch/powerpc/include/asm/nohash/32/pgalloc.h
+++ b/arch/powerpc/include/asm/nohash/32/pgalloc.h
@@ -14,6 +14,7 @@
 #define __pmd_free_tlb(tlb,x,a)		do { } while (0)
 /* #define pgd_populate(mm, pmd, pte)      BUG() */
 
+#ifndef CONFIG_PPC_8xx
 static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmdp,
 				       pte_t *pte)
 {
@@ -31,5 +32,6 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmdp,
 	else
 		*pmdp = __pmd(__pa(pte_page) | _PMD_USER | _PMD_PRESENT);
 }
+#endif
 
 #endif /* _ASM_POWERPC_PGALLOC_32_H */
diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
index 07df6b664861..b05cc4f87713 100644
--- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
@@ -129,32 +129,34 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr,
 }
 #define ptep_set_wrprotect ptep_set_wrprotect
 
+static pmd_t *pmd_off(struct mm_struct *mm, unsigned long addr);
+static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address);
+
 static inline void __ptep_set_access_flags(struct vm_area_struct *vma, pte_t *ptep,
 					   pte_t entry, unsigned long address, int psize)
 {
 	unsigned long set = pte_val(entry) & (_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_EXEC);
 	unsigned long clr = ~pte_val(entry) & _PAGE_RO;
 	int huge = psize > mmu_virtual_psize ? 1 : 0;
+	pmd_t *pmdp = (pmd_t *)ptep;
 
-	pte_update(vma->vm_mm, address, ptep, clr, set, huge);
+	if (pmdp == pmd_off(vma->vm_mm, ALIGN_DOWN(address, SZ_8M))) {
+		pte_update(vma->vm_mm, address, pte_offset_kernel(pmdp, 0), clr, set, huge);
+		pte_update(vma->vm_mm, address, pte_offset_kernel(pmdp + 1, 0), clr, set, huge);
+	} else {
+		pte_update(vma->vm_mm, address, ptep, clr, set, huge);
+	}
 
 	flush_tlb_page(vma, address);
 }
 #define __ptep_set_access_flags __ptep_set_access_flags
 
-static inline unsigned long pgd_leaf_size(pgd_t pgd)
-{
-	if (pgd_val(pgd) & _PMD_PAGE_8M)
-		return SZ_8M;
-	return SZ_4M;
-}
-
-#define pgd_leaf_size pgd_leaf_size
-
 static inline unsigned long pte_leaf_size(pmd_t pmd, pte_t pte)
 {
 	pte_basic_t val = pte_val(pte);
 
+	if (pmd_val(pmd) & _PMD_PAGE_8M)
+		return SZ_8M;
 	if (val & _PAGE_HUGE)
 		return SZ_512K;
 	if (val & _PAGE_SPS)
@@ -168,17 +170,16 @@ static inline unsigned long pte_leaf_size(pmd_t pmd, pte_t pte)
  * On the 8xx, the page tables are a bit special. For 16k pages, we have
  * 4 identical entries. For 512k pages, we have 128 entries as if it was
  * 4k pages, but they are flagged as 512k pages for the hardware.
- * For other page sizes, we have a single entry in the table.
+ * For 8M pages, we have 1024 entries as if it was
+ * 4M pages, but they are flagged as 8M pages for the hardware.
+ * For 4k pages, we have a single entry in the table.
  */
-static pmd_t *pmd_off(struct mm_struct *mm, unsigned long addr);
-static int hugepd_ok(hugepd_t hpd);
-
 static inline int number_of_cells_per_pte(pmd_t *pmd, pte_basic_t val, int huge)
 {
 	if (!huge)
 		return PAGE_SIZE / SZ_4K;
-	else if (hugepd_ok(*((hugepd_t *)pmd)))
-		return 1;
+	else if ((pmd_val(*pmd) & _PMD_PAGE_MASK) == _PMD_PAGE_8M)
+		return SZ_4M / SZ_4K;
 	else if (IS_ENABLED(CONFIG_PPC_4K_PAGES) && !(val & _PAGE_HUGE))
 		return SZ_16K / SZ_4K;
 	else
@@ -198,7 +199,7 @@ static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, p
 
 	for (i = 0; i < num; i += PAGE_SIZE / SZ_4K, new += PAGE_SIZE) {
 		*entry++ = new;
-		if (IS_ENABLED(CONFIG_PPC_16K_PAGES) && num != 1) {
+		if (IS_ENABLED(CONFIG_PPC_16K_PAGES)) {
 			*entry++ = new;
 			*entry++ = new;
 			*entry++ = new;
@@ -221,6 +222,28 @@ static inline pte_t ptep_get(pte_t *ptep)
 }
 #endif /* CONFIG_PPC_16K_PAGES */
 
+static inline void pmd_populate_kernel_size(struct mm_struct *mm, pmd_t *pmdp,
+					    pte_t *pte, unsigned long sz)
+{
+	if (sz == SZ_8M)
+		*pmdp = __pmd(__pa(pte) | _PMD_PRESENT | _PMD_PAGE_8M);
+	else
+		*pmdp = __pmd(__pa(pte) | _PMD_PRESENT);
+}
+
+static inline void pmd_populate_size(struct mm_struct *mm, pmd_t *pmdp,
+				     pgtable_t pte_page, unsigned long sz)
+{
+	if (sz == SZ_8M)
+		*pmdp = __pmd(__pa(pte_page) | _PMD_USER | _PMD_PRESENT | _PMD_PAGE_8M);
+	else
+		*pmdp = __pmd(__pa(pte_page) | _PMD_USER | _PMD_PRESENT);
+}
+#define pmd_populate_size pmd_populate_size
+
+#define pmd_populate(mm, pmdp, pte) pmd_populate_size(mm, pmdp, pte, PAGE_SIZE)
+#define pmd_populate_kernel(mm, pmdp, pte) pmd_populate_kernel_size(mm, pmdp, pte, PAGE_SIZE)
+
 #endif
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/include/asm/nohash/pgtable.h b/arch/powerpc/include/asm/nohash/pgtable.h
index ac3353f7f2ac..c4be7754e96f 100644
--- a/arch/powerpc/include/asm/nohash/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/pgtable.h
@@ -343,12 +343,8 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
 #ifdef CONFIG_ARCH_HAS_HUGEPD
 static inline int hugepd_ok(hugepd_t hpd)
 {
-#ifdef CONFIG_PPC_8xx
-	return ((hpd_val(hpd) & _PMD_PAGE_MASK) == _PMD_PAGE_8M);
-#else
 	/* We clear the top bit to indicate hugepd */
 	return (hpd_val(hpd) && (hpd_val(hpd) & PD_HUGE) == 0);
-#endif
 }
 
 #define is_hugepd(hpd)		(hugepd_ok(hpd))
diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index e411e5a70ea3..018c3d55232c 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -293,13 +293,8 @@ static inline const void *pfn_to_kaddr(unsigned long pfn)
 /*
  * Some number of bits at the level of the page table that points to
  * a hugepte are used to encode the size.  This masks those bits.
- * On 8xx, HW assistance requires 4k alignment for the hugepte.
  */
-#ifdef CONFIG_PPC_8xx
-#define HUGEPD_SHIFT_MASK     0xfff
-#else
 #define HUGEPD_SHIFT_MASK     0x3f
-#endif
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index 239709a2f68e..264a6c09517a 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -106,6 +106,9 @@ unsigned long vmalloc_to_phys(void *vmalloc_addr);
 
 void pgtable_cache_add(unsigned int shift);
 
+#ifdef CONFIG_PPC32
+void __init *early_alloc_pgtable(unsigned long size);
+#endif
 pte_t *early_pte_alloc_kernel(pmd_t *pmdp, unsigned long va);
 
 #if defined(CONFIG_STRICT_KERNEL_RWX) || defined(CONFIG_PPC32)
diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index 647b0b445e89..43919ae0bd11 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -415,14 +415,13 @@ FixupDAR:/* Entry point for dcbx workaround. */
 	oris	r11, r11, (swapper_pg_dir - PAGE_OFFSET)@ha
 3:
 	lwz	r11, (swapper_pg_dir-PAGE_OFFSET)@l(r11)	/* Get the level 1 entry */
+	rlwinm	r11, r11, 0, ~_PMD_PAGE_8M
 	mtspr	SPRN_MD_TWC, r11
-	mtcrf	0x01, r11
 	mfspr	r11, SPRN_MD_TWC
 	lwz	r11, 0(r11)	/* Get the pte */
-	bt	28,200f		/* bit 28 = Large page (8M) */
 	/* concat physical page address(r11) and page offset(r10) */
 	rlwimi	r11, r10, 0, 32 - PAGE_SHIFT, 31
-201:	lwz	r11,0(r11)
+	lwz	r11,0(r11)
 /* Check if it really is a dcbx instruction. */
 /* dcbt and dcbtst does not generate DTLB Misses/Errors,
  * no need to include them here */
@@ -441,11 +440,6 @@ FixupDAR:/* Entry point for dcbx workaround. */
 141:	mfspr	r10,SPRN_M_TW
 	b	DARFixed	/* Nope, go back to normal TLB processing */
 
-200:
-	/* concat physical page address(r11) and page offset(r10) */
-	rlwimi	r11, r10, 0, 32 - PAGE_SHIFT_8M, 31
-	b	201b
-
 144:	mfspr	r10, SPRN_DSISR
 	rlwinm	r10, r10,0,7,5	/* Clear store bit for buggy dcbst insn */
 	mtspr	SPRN_DSISR, r10
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 82495b8ea793..42b12e1ec851 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -183,9 +183,6 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!hpdp)
 		return NULL;
 
-	if (IS_ENABLED(CONFIG_PPC_8xx) && pshift < PMD_SHIFT)
-		return pte_alloc_huge(mm, (pmd_t *)hpdp, addr, sz);
-
 	BUG_ON(!hugepd_none(*hpdp) && !hugepd_ok(*hpdp));
 
 	if (hugepd_none(*hpdp) && __hugepte_alloc(mm, hpdp, addr,
@@ -198,10 +195,18 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long addr, unsigned long sz)
 {
+	pmd_t *pmd = pmd_off(mm, addr);
+
 	if (sz < PMD_SIZE)
-		return pte_alloc_huge(mm, pmd_off(mm, addr), addr, sz);
+		return pte_alloc_huge(mm, pmd, addr, sz);
 
-	return NULL;
+	if (sz != SZ_8M)
+		return NULL;
+	if (!pte_alloc_huge(mm, pmd, addr, sz))
+		return NULL;
+	if (!pte_alloc_huge(mm, pmd + 1, addr, sz))
+		return NULL;
+	return (pte_t *)pmd;
 }
 #endif
 
@@ -599,8 +604,7 @@ static int __init hugetlbpage_init(void)
 		if (pdshift > shift) {
 			if (!IS_ENABLED(CONFIG_PPC_8xx))
 				pgtable_cache_add(pdshift - shift);
-		} else if (IS_ENABLED(CONFIG_PPC_E500) ||
-			   IS_ENABLED(CONFIG_PPC_8xx)) {
+		} else if (IS_ENABLED(CONFIG_PPC_E500)) {
 			pgtable_cache_add(PTE_T_ORDER);
 		}
 
diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224054f8..a4f33508cb6e 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -12,22 +12,25 @@ kasan_init_shadow_8M(unsigned long k_start, unsigned long k_end, void *block)
 	pmd_t *pmd = pmd_off_k(k_start);
 	unsigned long k_cur, k_next;
 
-	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
-		pte_basic_t *new;
+	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++, block += SZ_4M) {
+		pte_t *ptep;
+		int i;
 
 		k_next = pgd_addr_end(k_cur, k_end);
 		k_next = pgd_addr_end(k_next, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
-		new = memblock_alloc(sizeof(pte_basic_t), SZ_4K);
+		ptep = memblock_alloc(PTE_FRAG_SIZE, PTE_FRAG_SIZE);
 		if (!new)
 			return -ENOMEM;
 
-		*new = pte_val(pte_mkhuge(pfn_pte(PHYS_PFN(__pa(block)), PAGE_KERNEL)));
+		for (i = 0; i < PTRS_PER_PTE; i++) {
+			pte_t pte = pte_mkhuge(pfn_pte(PHYS_PFN(__pa(block + i * PAGE_SIZE)), PAGE_KERNEL));
 
-		hugepd_populate_kernel((hugepd_t *)pmd, (pte_t *)new, PAGE_SHIFT_8M);
-		hugepd_populate_kernel((hugepd_t *)pmd + 1, (pte_t *)new, PAGE_SHIFT_8M);
+			__set_pte_at(&init_mm, k_cur, ptep + i, pte, 1);
+		}
+		pmd_populate_kernel_size(&init_mm, pmd, ptep, SZ_8M);
 	}
 	return 0;
 }
diff --git a/arch/powerpc/mm/nohash/8xx.c b/arch/powerpc/mm/nohash/8xx.c
index d93433e26ded..99f656b3f9f3 100644
--- a/arch/powerpc/mm/nohash/8xx.c
+++ b/arch/powerpc/mm/nohash/8xx.c
@@ -48,20 +48,6 @@ unsigned long p_block_mapped(phys_addr_t pa)
 	return 0;
 }
 
-static pte_t __init *early_hugepd_alloc_kernel(hugepd_t *pmdp, unsigned long va)
-{
-	if (hpd_val(*pmdp) == 0) {
-		pte_t *ptep = memblock_alloc(sizeof(pte_basic_t), SZ_4K);
-
-		if (!ptep)
-			return NULL;
-
-		hugepd_populate_kernel((hugepd_t *)pmdp, ptep, PAGE_SHIFT_8M);
-		hugepd_populate_kernel((hugepd_t *)pmdp + 1, ptep, PAGE_SHIFT_8M);
-	}
-	return hugepte_offset(*(hugepd_t *)pmdp, va, PGDIR_SHIFT);
-}
-
 static int __ref __early_map_kernel_hugepage(unsigned long va, phys_addr_t pa,
 					     pgprot_t prot, int psize, bool new)
 {
@@ -75,24 +61,33 @@ static int __ref __early_map_kernel_hugepage(unsigned long va, phys_addr_t pa,
 		if (WARN_ON(slab_is_available()))
 			return -EINVAL;
 
-		if (psize == MMU_PAGE_512K)
+		if (psize == MMU_PAGE_8M) {
+			if (WARN_ON(!pmd_none(*pmdp) || !pmd_none(*(pmdp + 1))))
+				return -EINVAL;
+
+			ptep = early_alloc_pgtable(PTE_FRAG_SIZE);
+			pmd_populate_kernel_size(&init_mm, pmdp, ptep, SZ_8M);
+
+			ptep = early_alloc_pgtable(PTE_FRAG_SIZE);
+			pmd_populate_kernel_size(&init_mm, pmdp + 1, ptep, SZ_8M);
+
+			ptep = (pte_t *)pmdp;
+		} else {
 			ptep = early_pte_alloc_kernel(pmdp, va);
-		else
-			ptep = early_hugepd_alloc_kernel((hugepd_t *)pmdp, va);
+			/* The PTE should never be already present */
+			if (WARN_ON(pte_present(*ptep) && pgprot_val(prot)))
+				return -EINVAL;
+		}
 	} else {
-		if (psize == MMU_PAGE_512K)
-			ptep = pte_offset_kernel(pmdp, va);
+		if (psize == MMU_PAGE_8M)
+			ptep = (pte_t *)pmdp;
 		else
-			ptep = hugepte_offset(*(hugepd_t *)pmdp, va, PGDIR_SHIFT);
+			ptep = pte_offset_kernel(pmdp, va);
 	}
 
 	if (WARN_ON(!ptep))
 		return -ENOMEM;
 
-	/* The PTE should never be already present */
-	if (new && WARN_ON(pte_present(*ptep) && pgprot_val(prot)))
-		return -EINVAL;
-
 	set_huge_pte_at(&init_mm, va, ptep,
 			pte_mkhuge(pfn_pte(pa >> PAGE_SHIFT, prot)),
 			1UL << mmu_psize_to_shift(psize));
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index acdf64c9b93e..59f0d7706d2f 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -297,11 +297,8 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 }
 
 #if defined(CONFIG_PPC_8xx)
-void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
-		     pte_t pte, unsigned long sz)
+static void __set_huge_pte_at(pmd_t *pmd, pte_t *ptep, pte_basic_t val)
 {
-	pmd_t *pmd = pmd_off(mm, addr);
-	pte_basic_t val;
 	pte_basic_t *entry = (pte_basic_t *)ptep;
 	int num, i;
 
@@ -311,15 +308,26 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
 	 */
 	VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep));
 
-	pte = set_pte_filter(pte, addr);
-
-	val = pte_val(pte);
-
 	num = number_of_cells_per_pte(pmd, val, 1);
 
 	for (i = 0; i < num; i++, entry++, val += SZ_4K)
 		*entry = val;
 }
+
+void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
+		     pte_t pte, unsigned long sz)
+{
+	pmd_t *pmdp = pmd_off(mm, addr);
+
+	pte = set_pte_filter(pte, addr);
+
+	if (sz == SZ_8M) {
+		__set_huge_pte_at(pmdp, pte_offset_kernel(pmdp, 0), pte_val(pte));
+		__set_huge_pte_at(pmdp, pte_offset_kernel(pmdp + 1, 0), pte_val(pte) + SZ_4M);
+	} else {
+		__set_huge_pte_at(pmdp, ptep, pte_val(pte));
+	}
+}
 #endif
 #endif /* CONFIG_HUGETLB_PAGE */
 
diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
index cfd622ebf774..787b22206386 100644
--- a/arch/powerpc/mm/pgtable_32.c
+++ b/arch/powerpc/mm/pgtable_32.c
@@ -48,7 +48,7 @@ notrace void __init early_ioremap_init(void)
 	early_ioremap_setup();
 }
 
-static void __init *early_alloc_pgtable(unsigned long size)
+void __init *early_alloc_pgtable(unsigned long size)
 {
 	void *ptr = memblock_alloc(size, size);
 
diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index b2d8c0da2ad9..fa4bb096b3ae 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -98,6 +98,7 @@ config PPC_BOOK3S_64
 	select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
 	select ARCH_ENABLE_SPLIT_PMD_PTLOCK
 	select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
+	select ARCH_HAS_HUGEPD if HUGETLB_PAGE
 	select ARCH_SUPPORTS_HUGETLBFS
 	select ARCH_SUPPORTS_NUMA_BALANCING
 	select HAVE_MOVE_PMD
@@ -290,6 +291,7 @@ config PPC_BOOK3S
 config PPC_E500
 	select FSL_EMB_PERFMON
 	bool
+	select ARCH_HAS_HUGEPD if HUGETLB_PAGE
 	select ARCH_SUPPORTS_HUGETLBFS if PHYS_64BIT || PPC64
 	select PPC_SMP_MUXED_IPI
 	select PPC_DOORBELL
-- 
2.44.0



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v2 08/20] powerpc/8xx: Simplify struct mmu_psize_def
  2024-05-17 18:59 [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Christophe Leroy
                   ` (6 preceding siblings ...)
  2024-05-17 19:00 ` [RFC PATCH v2 07/20] powerpc/8xx: Rework support for 8M pages using contiguous PTE entries Christophe Leroy
@ 2024-05-17 19:00 ` Christophe Leroy
  2024-05-25  3:36   ` Oscar Salvador
  2024-05-17 19:00 ` [RFC PATCH v2 09/20] powerpc/mm: Remove _PAGE_PSIZE Christophe Leroy
                   ` (13 subsequent siblings)
  21 siblings, 1 reply; 60+ messages in thread
From: Christophe Leroy @ 2024-05-17 19:00 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu, Oscar Salvador,
	Michael Ellerman, Nicholas Piggin
  Cc: Christophe Leroy, linux-kernel, linux-mm, linuxppc-dev

On 8xx, only the shift field is used in struct mmu_psize_def

Remove other fields and related macros.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/include/asm/nohash/32/mmu-8xx.h | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h b/arch/powerpc/include/asm/nohash/32/mmu-8xx.h
index 141d82e249a8..a756a1e59c54 100644
--- a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/mmu-8xx.h
@@ -189,19 +189,14 @@ typedef struct {
 
 #define PHYS_IMMR_BASE (mfspr(SPRN_IMMR) & 0xfff80000)
 
-/* Page size definitions, common between 32 and 64-bit
+/*
+ * Page size definitions for 8xx
  *
  *    shift : is the "PAGE_SHIFT" value for that page size
- *    penc  : is the pte encoding mask
  *
  */
 struct mmu_psize_def {
 	unsigned int	shift;	/* number of bits */
-	unsigned int	enc;	/* PTE encoding */
-	unsigned int    ind;    /* Corresponding indirect page size shift */
-	unsigned int	flags;
-#define MMU_PAGE_SIZE_DIRECT	0x1	/* Supported as a direct size */
-#define MMU_PAGE_SIZE_INDIRECT	0x2	/* Supported as an indirect size */
 };
 
 extern struct mmu_psize_def mmu_psize_defs[MMU_PAGE_COUNT];
-- 
2.44.0



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v2 09/20] powerpc/mm: Remove _PAGE_PSIZE
  2024-05-17 18:59 [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Christophe Leroy
                   ` (7 preceding siblings ...)
  2024-05-17 19:00 ` [RFC PATCH v2 08/20] powerpc/8xx: Simplify struct mmu_psize_def Christophe Leroy
@ 2024-05-17 19:00 ` Christophe Leroy
  2024-05-25  3:40   ` Oscar Salvador
  2024-05-17 19:00 ` [RFC PATCH v2 10/20] powerpc/mm: Fix __find_linux_pte() on 32 bits with PMD leaf entries Christophe Leroy
                   ` (12 subsequent siblings)
  21 siblings, 1 reply; 60+ messages in thread
From: Christophe Leroy @ 2024-05-17 19:00 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu, Oscar Salvador,
	Michael Ellerman, Nicholas Piggin
  Cc: Christophe Leroy, linux-kernel, linux-mm, linuxppc-dev

_PAGE_PSIZE macro is never used outside the place it is defined
and is used only on 8xx and e500.

Remove indirection, remove it and use its content directly.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/include/asm/nohash/32/pte-40x.h  | 3 ---
 arch/powerpc/include/asm/nohash/32/pte-44x.h  | 3 ---
 arch/powerpc/include/asm/nohash/32/pte-85xx.h | 3 ---
 arch/powerpc/include/asm/nohash/32/pte-8xx.h  | 5 ++---
 arch/powerpc/include/asm/nohash/pte-e500.h    | 4 +---
 5 files changed, 3 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/32/pte-40x.h b/arch/powerpc/include/asm/nohash/32/pte-40x.h
index d759cfd74754..52ed58516fa4 100644
--- a/arch/powerpc/include/asm/nohash/32/pte-40x.h
+++ b/arch/powerpc/include/asm/nohash/32/pte-40x.h
@@ -49,9 +49,6 @@
 #define _PAGE_EXEC	0x200	/* hardware: EX permission */
 #define _PAGE_ACCESSED	0x400	/* software: R: page referenced */
 
-/* No page size encoding in the linux PTE */
-#define _PAGE_PSIZE		0
-
 /* cache related flags non existing on 40x */
 #define _PAGE_COHERENT	0
 
diff --git a/arch/powerpc/include/asm/nohash/32/pte-44x.h b/arch/powerpc/include/asm/nohash/32/pte-44x.h
index 851813725237..da0469928273 100644
--- a/arch/powerpc/include/asm/nohash/32/pte-44x.h
+++ b/arch/powerpc/include/asm/nohash/32/pte-44x.h
@@ -75,9 +75,6 @@
 #define _PAGE_NO_CACHE	0x00000400		/* H: I bit */
 #define _PAGE_WRITETHRU	0x00000800		/* H: W bit */
 
-/* No page size encoding in the linux PTE */
-#define _PAGE_PSIZE		0
-
 /* TODO: Add large page lowmem mapping support */
 #define _PMD_PRESENT	0
 #define _PMD_PRESENT_MASK (PAGE_MASK)
diff --git a/arch/powerpc/include/asm/nohash/32/pte-85xx.h b/arch/powerpc/include/asm/nohash/32/pte-85xx.h
index 653a342d3b25..14d64b4f3f14 100644
--- a/arch/powerpc/include/asm/nohash/32/pte-85xx.h
+++ b/arch/powerpc/include/asm/nohash/32/pte-85xx.h
@@ -31,9 +31,6 @@
 #define _PAGE_WRITETHRU	0x00400	/* H: W bit */
 #define _PAGE_SPECIAL	0x00800 /* S: Special page */
 
-/* No page size encoding in the linux PTE */
-#define _PAGE_PSIZE		0
-
 #define _PMD_PRESENT	0
 #define _PMD_PRESENT_MASK (PAGE_MASK)
 #define _PMD_BAD	(~PAGE_MASK)
diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
index b05cc4f87713..e5bf0d29c7db 100644
--- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
@@ -74,12 +74,11 @@
 #define _PTE_NONE_MASK	0
 
 #ifdef CONFIG_PPC_16K_PAGES
-#define _PAGE_PSIZE	_PAGE_SPS
+#define _PAGE_BASE_NC	(_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_SPS)
 #else
-#define _PAGE_PSIZE		0
+#define _PAGE_BASE_NC	(_PAGE_PRESENT | _PAGE_ACCESSED)
 #endif
 
-#define _PAGE_BASE_NC	(_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_PSIZE)
 #define _PAGE_BASE	(_PAGE_BASE_NC)
 
 #include <asm/pgtable-masks.h>
diff --git a/arch/powerpc/include/asm/nohash/pte-e500.h b/arch/powerpc/include/asm/nohash/pte-e500.h
index f516f0b5b7a8..975facc7e38e 100644
--- a/arch/powerpc/include/asm/nohash/pte-e500.h
+++ b/arch/powerpc/include/asm/nohash/pte-e500.h
@@ -65,8 +65,6 @@
 
 #define _PAGE_SPECIAL	_PAGE_SW0
 
-/* Base page size */
-#define _PAGE_PSIZE	_PAGE_PSIZE_4K
 #define	PTE_RPN_SHIFT	(24)
 
 #define PTE_WIMGE_SHIFT (19)
@@ -89,7 +87,7 @@
  * pages. We always set _PAGE_COHERENT when SMP is enabled or
  * the processor might need it for DMA coherency.
  */
-#define _PAGE_BASE_NC	(_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_PSIZE)
+#define _PAGE_BASE_NC	(_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_PSIZE_4K)
 #if defined(CONFIG_SMP)
 #define _PAGE_BASE	(_PAGE_BASE_NC | _PAGE_COHERENT)
 #else
-- 
2.44.0



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v2 10/20] powerpc/mm: Fix __find_linux_pte() on 32 bits with PMD leaf entries
  2024-05-17 18:59 [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Christophe Leroy
                   ` (8 preceding siblings ...)
  2024-05-17 19:00 ` [RFC PATCH v2 09/20] powerpc/mm: Remove _PAGE_PSIZE Christophe Leroy
@ 2024-05-17 19:00 ` Christophe Leroy
  2024-05-25  4:12   ` Oscar Salvador
  2024-05-17 19:00 ` [RFC PATCH v2 11/20] powerpc/mm: Complement huge_pte_alloc() for all non HUGEPD setups Christophe Leroy
                   ` (11 subsequent siblings)
  21 siblings, 1 reply; 60+ messages in thread
From: Christophe Leroy @ 2024-05-17 19:00 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu, Oscar Salvador,
	Michael Ellerman, Nicholas Piggin
  Cc: Christophe Leroy, linux-kernel, linux-mm, linuxppc-dev

Building on 32 bits with pmd_leaf() not returning always false leads
to the following error:

  CC      arch/powerpc/mm/pgtable.o
arch/powerpc/mm/pgtable.c: In function '__find_linux_pte':
arch/powerpc/mm/pgtable.c:506:1: error: function may return address of local variable [-Werror=return-local-addr]
  506 | }
      | ^
arch/powerpc/mm/pgtable.c:394:15: note: declared here
  394 |         pud_t pud, *pudp;
      |               ^~~
arch/powerpc/mm/pgtable.c:394:15: note: declared here

This is due to pmd_offset() being a no-op in that case.

So rework it for powerpc/32 so that pXd_offset() are used on real
pointers and not on on-stack copies.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/mm/pgtable.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 59f0d7706d2f..51ee508eeb5b 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -390,8 +390,12 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
 			bool *is_thp, unsigned *hpage_shift)
 {
 	pgd_t *pgdp;
-	p4d_t p4d, *p4dp;
-	pud_t pud, *pudp;
+	p4d_t *p4dp;
+	pud_t *pudp;
+#ifdef CONFIG_PPC64
+	p4d_t p4d;
+	pud_t pud;
+#endif
 	pmd_t pmd, *pmdp;
 	pte_t *ret_pte;
 	hugepd_t *hpdp = NULL;
@@ -412,6 +416,7 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
 	 */
 	pgdp = pgdir + pgd_index(ea);
 	p4dp = p4d_offset(pgdp, ea);
+#ifdef CONFIG_PPC64
 	p4d  = READ_ONCE(*p4dp);
 	pdshift = P4D_SHIFT;
 
@@ -452,6 +457,11 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
 
 	pdshift = PMD_SHIFT;
 	pmdp = pmd_offset(&pud, ea);
+#else
+	p4dp = p4d_offset(pgdp, ea);
+	pudp = pud_offset(p4dp, ea);
+	pmdp = pmd_offset(pudp, ea);
+#endif
 	pmd  = READ_ONCE(*pmdp);
 
 	/*
-- 
2.44.0



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v2 11/20] powerpc/mm: Complement huge_pte_alloc() for all non HUGEPD setups
  2024-05-17 18:59 [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Christophe Leroy
                   ` (9 preceding siblings ...)
  2024-05-17 19:00 ` [RFC PATCH v2 10/20] powerpc/mm: Fix __find_linux_pte() on 32 bits with PMD leaf entries Christophe Leroy
@ 2024-05-17 19:00 ` Christophe Leroy
  2024-05-25  4:29   ` Oscar Salvador
  2024-05-17 19:00 ` [RFC PATCH v2 12/20] powerpc/64e: Remove unneeded #ifdef CONFIG_PPC_E500 Christophe Leroy
                   ` (10 subsequent siblings)
  21 siblings, 1 reply; 60+ messages in thread
From: Christophe Leroy @ 2024-05-17 19:00 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu, Oscar Salvador,
	Michael Ellerman, Nicholas Piggin
  Cc: Christophe Leroy, linux-kernel, linux-mm, linuxppc-dev

huge_pte_alloc() for non-HUGEPD targets is reserved for 8xx at the
moment. In order to convert other targets for non-HUGEPD, complement
huge_pte_alloc() to support any standard cont-PxD setup.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/mm/hugetlbpage.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 42b12e1ec851..f8aefa1e7363 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -195,11 +195,34 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long addr, unsigned long sz)
 {
-	pmd_t *pmd = pmd_off(mm, addr);
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	addr &= ~(sz - 1);
+	pgd = pgd_offset(mm, addr);
+
+	p4d = p4d_offset(pgd, addr);
+	if (sz >= PGDIR_SIZE)
+		return (pte_t *)p4d;
+
+	pud = pud_alloc(mm, p4d, addr);
+	if (!pud)
+		return NULL;
+	if (sz >= PUD_SIZE)
+		return (pte_t *)pud;
+
+	pmd = pmd_alloc(mm, pud, addr);
+	if (!pmd)
+		return NULL;
 
 	if (sz < PMD_SIZE)
 		return pte_alloc_huge(mm, pmd, addr, sz);
 
+	if (!IS_ENABLED(CONFIG_PPC_8xx))
+		return (pte_t *)pmd;
+
 	if (sz != SZ_8M)
 		return NULL;
 	if (!pte_alloc_huge(mm, pmd, addr, sz))
-- 
2.44.0



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v2 12/20] powerpc/64e: Remove unneeded #ifdef CONFIG_PPC_E500
  2024-05-17 18:59 [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Christophe Leroy
                   ` (10 preceding siblings ...)
  2024-05-17 19:00 ` [RFC PATCH v2 11/20] powerpc/mm: Complement huge_pte_alloc() for all non HUGEPD setups Christophe Leroy
@ 2024-05-17 19:00 ` Christophe Leroy
  2024-05-24  7:31   ` Michael Ellerman
  2024-05-17 19:00 ` [RFC PATCH v2 13/20] powerpc/64e: Clean up impossible setups Christophe Leroy
                   ` (9 subsequent siblings)
  21 siblings, 1 reply; 60+ messages in thread
From: Christophe Leroy @ 2024-05-17 19:00 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu, Oscar Salvador,
	Michael Ellerman, Nicholas Piggin
  Cc: Christophe Leroy, linux-kernel, linux-mm, linuxppc-dev

When it is a nohash/64 it can't be anything else than
CONFIG_PPC_E500 so remove the #ifdef as they are always true.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/mm/nohash/tlb.c | 13 -------------
 1 file changed, 13 deletions(-)

diff --git a/arch/powerpc/mm/nohash/tlb.c b/arch/powerpc/mm/nohash/tlb.c
index 5ffa0af4328a..d16f1ef7516c 100644
--- a/arch/powerpc/mm/nohash/tlb.c
+++ b/arch/powerpc/mm/nohash/tlb.c
@@ -403,8 +403,6 @@ static void __init setup_page_sizes(void)
 	unsigned int tlb0ps;
 	unsigned int eptcfg;
 	int i, psize;
-
-#ifdef CONFIG_PPC_E500
 	unsigned int mmucfg = mfspr(SPRN_MMUCFG);
 	int fsl_mmu = mmu_has_feature(MMU_FTR_TYPE_FSL_E);
 
@@ -470,7 +468,6 @@ static void __init setup_page_sizes(void)
 
 		goto out;
 	}
-#endif
 
 	tlb0cfg = mfspr(SPRN_TLB0CFG);
 	tlb0ps = mfspr(SPRN_TLB0PS);
@@ -547,13 +544,11 @@ static void __init setup_mmu_htw(void)
 		patch_exception(0x1c0, exc_data_tlb_miss_htw_book3e);
 		patch_exception(0x1e0, exc_instruction_tlb_miss_htw_book3e);
 		break;
-#ifdef CONFIG_PPC_E500
 	case PPC_HTW_E6500:
 		extlb_level_exc = EX_TLB_SIZE;
 		patch_exception(0x1c0, exc_data_tlb_miss_e6500_book3e);
 		patch_exception(0x1e0, exc_instruction_tlb_miss_e6500_book3e);
 		break;
-#endif
 	}
 	pr_info("MMU: Book3E HW tablewalk %s\n",
 		book3e_htw_mode != PPC_HTW_NONE ? "enabled" : "not supported");
@@ -590,7 +585,6 @@ static void early_init_this_mmu(void)
 	}
 	mtspr(SPRN_MAS4, mas4);
 
-#ifdef CONFIG_PPC_E500
 	if (mmu_has_feature(MMU_FTR_TYPE_FSL_E)) {
 		unsigned int num_cams;
 		bool map = true;
@@ -611,7 +605,6 @@ static void early_init_this_mmu(void)
 			linear_map_top = map_mem_in_cams(linear_map_top,
 							 num_cams, false, true);
 	}
-#endif
 
 	/* A sync won't hurt us after mucking around with
 	 * the MMU configuration
@@ -643,7 +636,6 @@ static void __init early_init_mmu_global(void)
 	/* Look for HW tablewalk support */
 	setup_mmu_htw();
 
-#ifdef CONFIG_PPC_E500
 	if (mmu_has_feature(MMU_FTR_TYPE_FSL_E)) {
 		if (book3e_htw_mode == PPC_HTW_NONE) {
 			extlb_level_exc = EX_TLB_SIZE;
@@ -652,7 +644,6 @@ static void __init early_init_mmu_global(void)
 				exc_instruction_tlb_miss_bolted_book3e);
 		}
 	}
-#endif
 
 	/* Set the global containing the top of the linear mapping
 	 * for use by the TLB miss code
@@ -664,7 +655,6 @@ static void __init early_init_mmu_global(void)
 
 static void __init early_mmu_set_memory_limit(void)
 {
-#ifdef CONFIG_PPC_E500
 	if (mmu_has_feature(MMU_FTR_TYPE_FSL_E)) {
 		/*
 		 * Limit memory so we dont have linear faults.
@@ -675,7 +665,6 @@ static void __init early_mmu_set_memory_limit(void)
 		 */
 		memblock_enforce_memory_limit(linear_map_top);
 	}
-#endif
 
 	memblock_set_current_limit(linear_map_top);
 }
@@ -713,7 +702,6 @@ void setup_initial_memory_limit(phys_addr_t first_memblock_base,
 	 * We crop it to the size of the first MEMBLOCK to
 	 * avoid going over total available memory just in case...
 	 */
-#ifdef CONFIG_PPC_E500
 	if (early_mmu_has_feature(MMU_FTR_TYPE_FSL_E)) {
 		unsigned long linear_sz;
 		unsigned int num_cams;
@@ -726,7 +714,6 @@ void setup_initial_memory_limit(phys_addr_t first_memblock_base,
 
 		ppc64_rma_size = min_t(u64, linear_sz, 0x40000000);
 	} else
-#endif
 		ppc64_rma_size = min_t(u64, first_memblock_size, 0x40000000);
 
 	/* Finally limit subsequent allocations */
-- 
2.44.0



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v2 13/20] powerpc/64e: Clean up impossible setups
  2024-05-17 18:59 [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Christophe Leroy
                   ` (11 preceding siblings ...)
  2024-05-17 19:00 ` [RFC PATCH v2 12/20] powerpc/64e: Remove unneeded #ifdef CONFIG_PPC_E500 Christophe Leroy
@ 2024-05-17 19:00 ` Christophe Leroy
  2024-05-17 19:00 ` [RFC PATCH v2 14/20] powerpc/e500: Remove enc field from struct mmu_psize_def Christophe Leroy
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 60+ messages in thread
From: Christophe Leroy @ 2024-05-17 19:00 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu, Oscar Salvador,
	Michael Ellerman, Nicholas Piggin
  Cc: Christophe Leroy, linux-kernel, linux-mm, linuxppc-dev

All E500 have MMU_FTR_TYPE_FSL_E.

So remove all impossible combinations.

This leads to removing PPC_HTW_IBM and related exceptions.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/include/asm/nohash/mmu-e500.h |   1 -
 arch/powerpc/mm/nohash/tlb.c               | 148 ++++------------
 arch/powerpc/mm/nohash/tlb_low_64e.S       | 194 ---------------------
 3 files changed, 36 insertions(+), 307 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/mmu-e500.h b/arch/powerpc/include/asm/nohash/mmu-e500.h
index 6ddced0415cb..792bfaafd70b 100644
--- a/arch/powerpc/include/asm/nohash/mmu-e500.h
+++ b/arch/powerpc/include/asm/nohash/mmu-e500.h
@@ -303,7 +303,6 @@ extern unsigned long linear_map_top;
 extern int book3e_htw_mode;
 
 #define PPC_HTW_NONE	0
-#define PPC_HTW_IBM	1
 #define PPC_HTW_E6500	2
 
 /*
diff --git a/arch/powerpc/mm/nohash/tlb.c b/arch/powerpc/mm/nohash/tlb.c
index d16f1ef7516c..1caccbf4c138 100644
--- a/arch/powerpc/mm/nohash/tlb.c
+++ b/arch/powerpc/mm/nohash/tlb.c
@@ -400,13 +400,11 @@ void tlb_flush_pgtable(struct mmu_gather *tlb, unsigned long address)
 static void __init setup_page_sizes(void)
 {
 	unsigned int tlb0cfg;
-	unsigned int tlb0ps;
 	unsigned int eptcfg;
-	int i, psize;
+	int psize;
 	unsigned int mmucfg = mfspr(SPRN_MMUCFG);
-	int fsl_mmu = mmu_has_feature(MMU_FTR_TYPE_FSL_E);
 
-	if (fsl_mmu && (mmucfg & MMUCFG_MAVN) == MMUCFG_MAVN_V1) {
+	if ((mmucfg & MMUCFG_MAVN) == MMUCFG_MAVN_V1) {
 		unsigned int tlb1cfg = mfspr(SPRN_TLB1CFG);
 		unsigned int min_pg, max_pg;
 
@@ -429,11 +427,7 @@ static void __init setup_page_sizes(void)
 			if ((shift >= min_pg) && (shift <= max_pg))
 				def->flags |= MMU_PAGE_SIZE_DIRECT;
 		}
-
-		goto out;
-	}
-
-	if (fsl_mmu && (mmucfg & MMUCFG_MAVN) == MMUCFG_MAVN_V2) {
+	} else if ((mmucfg & MMUCFG_MAVN) == MMUCFG_MAVN_V2) {
 		u32 tlb1cfg, tlb1ps;
 
 		tlb0cfg = mfspr(SPRN_TLB0CFG);
@@ -465,54 +459,8 @@ static void __init setup_page_sizes(void)
 					def->flags |= MMU_PAGE_SIZE_INDIRECT;
 			}
 		}
-
-		goto out;
-	}
-
-	tlb0cfg = mfspr(SPRN_TLB0CFG);
-	tlb0ps = mfspr(SPRN_TLB0PS);
-	eptcfg = mfspr(SPRN_EPTCFG);
-
-	/* Look for supported direct sizes */
-	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
-		struct mmu_psize_def *def = &mmu_psize_defs[psize];
-
-		if (tlb0ps & (1U << (def->shift - 10)))
-			def->flags |= MMU_PAGE_SIZE_DIRECT;
-	}
-
-	/* Indirect page sizes supported ? */
-	if ((tlb0cfg & TLBnCFG_IND) == 0 ||
-	    (tlb0cfg & TLBnCFG_PT) == 0)
-		goto out;
-
-	book3e_htw_mode = PPC_HTW_IBM;
-
-	/* Now, we only deal with one IND page size for each
-	 * direct size. Hopefully all implementations today are
-	 * unambiguous, but we might want to be careful in the
-	 * future.
-	 */
-	for (i = 0; i < 3; i++) {
-		unsigned int ps, sps;
-
-		sps = eptcfg & 0x1f;
-		eptcfg >>= 5;
-		ps = eptcfg & 0x1f;
-		eptcfg >>= 5;
-		if (!ps || !sps)
-			continue;
-		for (psize = 0; psize < MMU_PAGE_COUNT; psize++) {
-			struct mmu_psize_def *def = &mmu_psize_defs[psize];
-
-			if (ps == (def->shift - 10))
-				def->flags |= MMU_PAGE_SIZE_INDIRECT;
-			if (sps == (def->shift - 10))
-				def->ind = ps + 10;
-		}
 	}
 
-out:
 	/* Cleanup array and print summary */
 	pr_info("MMU: Supported page sizes\n");
 	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
@@ -540,10 +488,6 @@ static void __init setup_mmu_htw(void)
 	 */
 
 	switch (book3e_htw_mode) {
-	case PPC_HTW_IBM:
-		patch_exception(0x1c0, exc_data_tlb_miss_htw_book3e);
-		patch_exception(0x1e0, exc_instruction_tlb_miss_htw_book3e);
-		break;
 	case PPC_HTW_E6500:
 		extlb_level_exc = EX_TLB_SIZE;
 		patch_exception(0x1c0, exc_data_tlb_miss_e6500_book3e);
@@ -560,6 +504,8 @@ static void __init setup_mmu_htw(void)
 static void early_init_this_mmu(void)
 {
 	unsigned int mas4;
+	unsigned int num_cams;
+	bool map = true;
 
 	/* Set MAS4 based on page table setting */
 
@@ -572,12 +518,6 @@ static void early_init_this_mmu(void)
 		mmu_pte_psize = MMU_PAGE_2M;
 		break;
 
-	case PPC_HTW_IBM:
-		mas4 |= MAS4_INDD;
-		mas4 |=	BOOK3E_PAGESZ_1M << MAS4_TSIZED_SHIFT;
-		mmu_pte_psize = MMU_PAGE_1M;
-		break;
-
 	case PPC_HTW_NONE:
 		mas4 |=	BOOK3E_PAGESZ_4K << MAS4_TSIZED_SHIFT;
 		mmu_pte_psize = mmu_virtual_psize;
@@ -585,26 +525,21 @@ static void early_init_this_mmu(void)
 	}
 	mtspr(SPRN_MAS4, mas4);
 
-	if (mmu_has_feature(MMU_FTR_TYPE_FSL_E)) {
-		unsigned int num_cams;
-		bool map = true;
-
-		/* use a quarter of the TLBCAM for bolted linear map */
-		num_cams = (mfspr(SPRN_TLB1CFG) & TLBnCFG_N_ENTRY) / 4;
+	/* use a quarter of the TLBCAM for bolted linear map */
+	num_cams = (mfspr(SPRN_TLB1CFG) & TLBnCFG_N_ENTRY) / 4;
 
-		/*
-		 * Only do the mapping once per core, or else the
-		 * transient mapping would cause problems.
-		 */
+	/*
+	 * Only do the mapping once per core, or else the
+	 * transient mapping would cause problems.
+	 */
 #ifdef CONFIG_SMP
-		if (hweight32(get_tensr()) > 1)
-			map = false;
+	if (hweight32(get_tensr()) > 1)
+		map = false;
 #endif
 
-		if (map)
-			linear_map_top = map_mem_in_cams(linear_map_top,
-							 num_cams, false, true);
-	}
+	if (map)
+		linear_map_top = map_mem_in_cams(linear_map_top,
+						 num_cams, false, true);
 
 	/* A sync won't hurt us after mucking around with
 	 * the MMU configuration
@@ -620,10 +555,7 @@ static void __init early_init_mmu_global(void)
 	 *
 	 * Freescale booke only supports 4K pages in TLB0, so use that.
 	 */
-	if (mmu_has_feature(MMU_FTR_TYPE_FSL_E))
-		mmu_vmemmap_psize = MMU_PAGE_4K;
-	else
-		mmu_vmemmap_psize = MMU_PAGE_16M;
+	mmu_vmemmap_psize = MMU_PAGE_4K;
 
 	/* XXX This code only checks for TLB 0 capabilities and doesn't
 	 *     check what page size combos are supported by the HW. It
@@ -636,13 +568,10 @@ static void __init early_init_mmu_global(void)
 	/* Look for HW tablewalk support */
 	setup_mmu_htw();
 
-	if (mmu_has_feature(MMU_FTR_TYPE_FSL_E)) {
-		if (book3e_htw_mode == PPC_HTW_NONE) {
-			extlb_level_exc = EX_TLB_SIZE;
-			patch_exception(0x1c0, exc_data_tlb_miss_bolted_book3e);
-			patch_exception(0x1e0,
-				exc_instruction_tlb_miss_bolted_book3e);
-		}
+	if (book3e_htw_mode == PPC_HTW_NONE) {
+		extlb_level_exc = EX_TLB_SIZE;
+		patch_exception(0x1c0, exc_data_tlb_miss_bolted_book3e);
+		patch_exception(0x1e0, exc_instruction_tlb_miss_bolted_book3e);
 	}
 
 	/* Set the global containing the top of the linear mapping
@@ -655,16 +584,14 @@ static void __init early_init_mmu_global(void)
 
 static void __init early_mmu_set_memory_limit(void)
 {
-	if (mmu_has_feature(MMU_FTR_TYPE_FSL_E)) {
-		/*
-		 * Limit memory so we dont have linear faults.
-		 * Unlike memblock_set_current_limit, which limits
-		 * memory available during early boot, this permanently
-		 * reduces the memory available to Linux.  We need to
-		 * do this because highmem is not supported on 64-bit.
-		 */
-		memblock_enforce_memory_limit(linear_map_top);
-	}
+	/*
+	 * Limit memory so we dont have linear faults.
+	 * Unlike memblock_set_current_limit, which limits
+	 * memory available during early boot, this permanently
+	 * reduces the memory available to Linux.  We need to
+	 * do this because highmem is not supported on 64-bit.
+	 */
+	memblock_enforce_memory_limit(linear_map_top);
 
 	memblock_set_current_limit(linear_map_top);
 }
@@ -702,19 +629,16 @@ void setup_initial_memory_limit(phys_addr_t first_memblock_base,
 	 * We crop it to the size of the first MEMBLOCK to
 	 * avoid going over total available memory just in case...
 	 */
-	if (early_mmu_has_feature(MMU_FTR_TYPE_FSL_E)) {
-		unsigned long linear_sz;
-		unsigned int num_cams;
+	unsigned long linear_sz;
+	unsigned int num_cams;
 
-		/* use a quarter of the TLBCAM for bolted linear map */
-		num_cams = (mfspr(SPRN_TLB1CFG) & TLBnCFG_N_ENTRY) / 4;
+	/* use a quarter of the TLBCAM for bolted linear map */
+	num_cams = (mfspr(SPRN_TLB1CFG) & TLBnCFG_N_ENTRY) / 4;
 
-		linear_sz = map_mem_in_cams(first_memblock_size, num_cams,
-					    true, true);
+	linear_sz = map_mem_in_cams(first_memblock_size, num_cams,
+				    true, true);
 
-		ppc64_rma_size = min_t(u64, linear_sz, 0x40000000);
-	} else
-		ppc64_rma_size = min_t(u64, first_memblock_size, 0x40000000);
+	ppc64_rma_size = min_t(u64, linear_sz, 0x40000000);
 
 	/* Finally limit subsequent allocations */
 	memblock_set_current_limit(first_memblock_base + ppc64_rma_size);
diff --git a/arch/powerpc/mm/nohash/tlb_low_64e.S b/arch/powerpc/mm/nohash/tlb_low_64e.S
index 7e0b8fe1c279..93ecb8ec82b0 100644
--- a/arch/powerpc/mm/nohash/tlb_low_64e.S
+++ b/arch/powerpc/mm/nohash/tlb_low_64e.S
@@ -894,200 +894,6 @@ virt_page_table_tlb_miss_whacko_fault:
 	b	exc_data_storage_book3e
 
 
-/**************************************************************
- *                                                            *
- * TLB miss handling for Book3E with hw page table support    *
- *                                                            *
- **************************************************************/
-
-
-/* Data TLB miss */
-	START_EXCEPTION(data_tlb_miss_htw)
-	TLB_MISS_PROLOG
-
-	/* Now we handle the fault proper. We only save DEAR in normal
-	 * fault case since that's the only interesting values here.
-	 * We could probably also optimize by not saving SRR0/1 in the
-	 * linear mapping case but I'll leave that for later
-	 */
-	mfspr	r14,SPRN_ESR
-	mfspr	r16,SPRN_DEAR		/* get faulting address */
-	srdi	r11,r16,44		/* get region */
-	xoris	r11,r11,0xc
-	cmpldi	cr0,r11,0		/* linear mapping ? */
-	beq	tlb_load_linear		/* yes -> go to linear map load */
-	cmpldi	cr1,r11,1		/* vmalloc mapping ? */
-
-	/* We do the user/kernel test for the PID here along with the RW test
-	 */
-	srdi.	r11,r16,60		/* Check for user region */
-	ld	r15,PACAPGD(r13)	/* Load user pgdir */
-	beq	htw_tlb_miss
-
-	/* XXX replace the RMW cycles with immediate loads + writes */
-1:	mfspr	r10,SPRN_MAS1
-	rlwinm	r10,r10,0,16,1		/* Clear TID */
-	mtspr	SPRN_MAS1,r10
-	ld	r15,PACA_KERNELPGD(r13)	/* Load kernel pgdir */
-	beq+	cr1,htw_tlb_miss
-
-	/* We got a crappy address, just fault with whatever DEAR and ESR
-	 * are here
-	 */
-	TLB_MISS_EPILOG_ERROR
-	b	exc_data_storage_book3e
-
-/* Instruction TLB miss */
-	START_EXCEPTION(instruction_tlb_miss_htw)
-	TLB_MISS_PROLOG
-
-	/* If we take a recursive fault, the second level handler may need
-	 * to know whether we are handling a data or instruction fault in
-	 * order to get to the right store fault handler. We provide that
-	 * info by keeping a crazy value for ESR in r14
-	 */
-	li	r14,-1	/* store to exception frame is done later */
-
-	/* Now we handle the fault proper. We only save DEAR in the non
-	 * linear mapping case since we know the linear mapping case will
-	 * not re-enter. We could indeed optimize and also not save SRR0/1
-	 * in the linear mapping case but I'll leave that for later
-	 *
-	 * Faulting address is SRR0 which is already in r16
-	 */
-	srdi	r11,r16,44		/* get region */
-	xoris	r11,r11,0xc
-	cmpldi	cr0,r11,0		/* linear mapping ? */
-	beq	tlb_load_linear		/* yes -> go to linear map load */
-	cmpldi	cr1,r11,1		/* vmalloc mapping ? */
-
-	/* We do the user/kernel test for the PID here along with the RW test
-	 */
-	srdi.	r11,r16,60		/* Check for user region */
-	ld	r15,PACAPGD(r13)		/* Load user pgdir */
-	beq	htw_tlb_miss
-
-	/* XXX replace the RMW cycles with immediate loads + writes */
-1:	mfspr	r10,SPRN_MAS1
-	rlwinm	r10,r10,0,16,1			/* Clear TID */
-	mtspr	SPRN_MAS1,r10
-	ld	r15,PACA_KERNELPGD(r13)		/* Load kernel pgdir */
-	beq+	htw_tlb_miss
-
-	/* We got a crappy address, just fault */
-	TLB_MISS_EPILOG_ERROR
-	b	exc_instruction_storage_book3e
-
-
-/*
- * This is the guts of the second-level TLB miss handler for direct
- * misses. We are entered with:
- *
- * r16 = virtual page table faulting address
- * r15 = PGD pointer
- * r14 = ESR
- * r13 = PACA
- * r12 = TLB exception frame in PACA
- * r11 = crap (free to use)
- * r10 = crap (free to use)
- *
- * It can be re-entered by the linear mapping miss handler. However, to
- * avoid too much complication, it will save/restore things for us
- */
-htw_tlb_miss:
-#ifdef CONFIG_PPC_KUAP
-	mfspr	r10,SPRN_MAS1
-	rlwinm.	r10,r10,0,0x3fff0000
-	beq-	htw_tlb_miss_fault /* KUAP fault */
-#endif
-	/* Search if we already have a TLB entry for that virtual address, and
-	 * if we do, bail out.
-	 *
-	 * MAS1:IND should be already set based on MAS4
-	 */
-	PPC_TLBSRX_DOT(0,R16)
-	beq	htw_tlb_miss_done
-
-	/* Now, we need to walk the page tables. First check if we are in
-	 * range.
-	 */
-	rldicl.	r10,r16,64-PGTABLE_EADDR_SIZE,PGTABLE_EADDR_SIZE+4
-	bne-	htw_tlb_miss_fault
-
-	/* Get the PGD pointer */
-	cmpldi	cr0,r15,0
-	beq-	htw_tlb_miss_fault
-
-	/* Get to PGD entry */
-	rldicl	r11,r16,64-(PGDIR_SHIFT-3),64-PGD_INDEX_SIZE-3
-	clrrdi	r10,r11,3
-	ldx	r15,r10,r15
-	cmpdi	cr0,r15,0
-	bge	htw_tlb_miss_fault
-
-	/* Get to PUD entry */
-	rldicl	r11,r16,64-(PUD_SHIFT-3),64-PUD_INDEX_SIZE-3
-	clrrdi	r10,r11,3
-	ldx	r15,r10,r15
-	cmpdi	cr0,r15,0
-	bge	htw_tlb_miss_fault
-
-	/* Get to PMD entry */
-	rldicl	r11,r16,64-(PMD_SHIFT-3),64-PMD_INDEX_SIZE-3
-	clrrdi	r10,r11,3
-	ldx	r15,r10,r15
-	cmpdi	cr0,r15,0
-	bge	htw_tlb_miss_fault
-
-	/* Ok, we're all right, we can now create an indirect entry for
-	 * a 1M or 256M page.
-	 *
-	 * The last trick is now that because we use "half" pages for
-	 * the HTW (1M IND is 2K and 256M IND is 32K) we need to account
-	 * for an added LSB bit to the RPN. For 64K pages, there is no
-	 * problem as we already use 32K arrays (half PTE pages), but for
-	 * 4K page we need to extract a bit from the virtual address and
-	 * insert it into the "PA52" bit of the RPN.
-	 */
-	rlwimi	r15,r16,32-9,20,20
-	/* Now we build the MAS:
-	 *
-	 * MAS 0   :	Fully setup with defaults in MAS4 and TLBnCFG
-	 * MAS 1   :	Almost fully setup
-	 *               - PID already updated by caller if necessary
-	 *               - TSIZE for now is base ind page size always
-	 * MAS 2   :	Use defaults
-	 * MAS 3+7 :	Needs to be done
-	 */
-	ori	r10,r15,(BOOK3E_PAGESZ_4K << MAS3_SPSIZE_SHIFT)
-
-	srdi	r16,r10,32
-	mtspr	SPRN_MAS3,r10
-	mtspr	SPRN_MAS7,r16
-
-	tlbwe
-
-htw_tlb_miss_done:
-	/* We don't bother with restoring DEAR or ESR since we know we are
-	 * level 0 and just going back to userland. They are only needed
-	 * if you are going to take an access fault
-	 */
-	TLB_MISS_EPILOG_SUCCESS
-	rfi
-
-htw_tlb_miss_fault:
-	/* We need to check if it was an instruction miss. We know this
-	 * though because r14 would contain -1
-	 */
-	cmpdi	cr0,r14,-1
-	beq	1f
-	mtspr	SPRN_DEAR,r16
-	mtspr	SPRN_ESR,r14
-	TLB_MISS_EPILOG_ERROR
-	b	exc_data_storage_book3e
-1:	TLB_MISS_EPILOG_ERROR
-	b	exc_instruction_storage_book3e
-
 /*
  * This is the guts of "any" level TLB miss handler for kernel linear
  * mapping misses. We are entered with:
-- 
2.44.0



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v2 14/20] powerpc/e500: Remove enc field from struct mmu_psize_def
  2024-05-17 18:59 [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Christophe Leroy
                   ` (12 preceding siblings ...)
  2024-05-17 19:00 ` [RFC PATCH v2 13/20] powerpc/64e: Clean up impossible setups Christophe Leroy
@ 2024-05-17 19:00 ` Christophe Leroy
  2024-05-25  4:35   ` Oscar Salvador
  2024-05-17 19:00 ` [RFC PATCH v2 15/20] powerpc/85xx: Switch to 64 bits PGD Christophe Leroy
                   ` (7 subsequent siblings)
  21 siblings, 1 reply; 60+ messages in thread
From: Christophe Leroy @ 2024-05-17 19:00 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu, Oscar Salvador,
	Michael Ellerman, Nicholas Piggin
  Cc: Christophe Leroy, linux-kernel, linux-mm, linuxppc-dev

enc field is hidden behind BOOK3E_PAGESZ_XX macros, and when you look
closer you realise that this field is nothing else than the value of
shift minus ten.

So remove enc field and calculate tsize from shift field.

Also remove inc filed which is unused.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/include/asm/nohash/mmu-e500.h |  3 ---
 arch/powerpc/mm/nohash/book3e_pgtable.c    |  4 ++--
 arch/powerpc/mm/nohash/tlb.c               | 11 ++---------
 3 files changed, 4 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/mmu-e500.h b/arch/powerpc/include/asm/nohash/mmu-e500.h
index 792bfaafd70b..4167da0c0241 100644
--- a/arch/powerpc/include/asm/nohash/mmu-e500.h
+++ b/arch/powerpc/include/asm/nohash/mmu-e500.h
@@ -244,14 +244,11 @@ typedef struct {
 /* Page size definitions, common between 32 and 64-bit
  *
  *    shift : is the "PAGE_SHIFT" value for that page size
- *    penc  : is the pte encoding mask
  *
  */
 struct mmu_psize_def
 {
 	unsigned int	shift;	/* number of bits */
-	unsigned int	enc;	/* PTE encoding */
-	unsigned int    ind;    /* Corresponding indirect page size shift */
 	unsigned int	flags;
 #define MMU_PAGE_SIZE_DIRECT	0x1	/* Supported as a direct size */
 #define MMU_PAGE_SIZE_INDIRECT	0x2	/* Supported as an indirect size */
diff --git a/arch/powerpc/mm/nohash/book3e_pgtable.c b/arch/powerpc/mm/nohash/book3e_pgtable.c
index 1c5e4ecbebeb..ad2a7c26f2a0 100644
--- a/arch/powerpc/mm/nohash/book3e_pgtable.c
+++ b/arch/powerpc/mm/nohash/book3e_pgtable.c
@@ -29,10 +29,10 @@ int __meminit vmemmap_create_mapping(unsigned long start,
 		_PAGE_KERNEL_RW;
 
 	/* PTEs only contain page size encodings up to 32M */
-	BUG_ON(mmu_psize_defs[mmu_vmemmap_psize].enc > 0xf);
+	BUG_ON(mmu_psize_defs[mmu_vmemmap_psize].shift - 10 > 0xf);
 
 	/* Encode the size in the PTE */
-	flags |= mmu_psize_defs[mmu_vmemmap_psize].enc << 8;
+	flags |= (mmu_psize_defs[mmu_vmemmap_psize].shift - 10) << 8;
 
 	/* For each PTE for that area, map things. Note that we don't
 	 * increment phys because all PTEs are of the large size and
diff --git a/arch/powerpc/mm/nohash/tlb.c b/arch/powerpc/mm/nohash/tlb.c
index 1caccbf4c138..10b5a6b60450 100644
--- a/arch/powerpc/mm/nohash/tlb.c
+++ b/arch/powerpc/mm/nohash/tlb.c
@@ -53,37 +53,30 @@
 struct mmu_psize_def mmu_psize_defs[MMU_PAGE_COUNT] = {
 	[MMU_PAGE_4K] = {
 		.shift	= 12,
-		.enc	= BOOK3E_PAGESZ_4K,
 	},
 	[MMU_PAGE_2M] = {
 		.shift	= 21,
-		.enc	= BOOK3E_PAGESZ_2M,
 	},
 	[MMU_PAGE_4M] = {
 		.shift	= 22,
-		.enc	= BOOK3E_PAGESZ_4M,
 	},
 	[MMU_PAGE_16M] = {
 		.shift	= 24,
-		.enc	= BOOK3E_PAGESZ_16M,
 	},
 	[MMU_PAGE_64M] = {
 		.shift	= 26,
-		.enc	= BOOK3E_PAGESZ_64M,
 	},
 	[MMU_PAGE_256M] = {
 		.shift	= 28,
-		.enc	= BOOK3E_PAGESZ_256M,
 	},
 	[MMU_PAGE_1G] = {
 		.shift	= 30,
-		.enc	= BOOK3E_PAGESZ_1GB,
 	},
 };
 
 static inline int mmu_get_tsize(int psize)
 {
-	return mmu_psize_defs[psize].enc;
+	return mmu_psize_defs[psize].shift - 10;
 }
 #else
 static inline int mmu_get_tsize(int psize)
@@ -371,7 +364,7 @@ void tlb_flush(struct mmu_gather *tlb)
  */
 void tlb_flush_pgtable(struct mmu_gather *tlb, unsigned long address)
 {
-	int tsize = mmu_psize_defs[mmu_pte_psize].enc;
+	int tsize = mmu_get_tsize(mmu_pte_psize);
 
 	if (book3e_htw_mode != PPC_HTW_NONE) {
 		unsigned long start = address & PMD_MASK;
-- 
2.44.0



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v2 15/20] powerpc/85xx: Switch to 64 bits PGD
  2024-05-17 18:59 [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Christophe Leroy
                   ` (13 preceding siblings ...)
  2024-05-17 19:00 ` [RFC PATCH v2 14/20] powerpc/e500: Remove enc field from struct mmu_psize_def Christophe Leroy
@ 2024-05-17 19:00 ` Christophe Leroy
  2024-05-25  4:54   ` Oscar Salvador
  2024-05-17 19:00 ` [RFC PATCH v2 16/20] powerpc/e500: Encode hugepage size in PTE bits Christophe Leroy
                   ` (6 subsequent siblings)
  21 siblings, 1 reply; 60+ messages in thread
From: Christophe Leroy @ 2024-05-17 19:00 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu, Oscar Salvador,
	Michael Ellerman, Nicholas Piggin
  Cc: Christophe Leroy, linux-kernel, linux-mm, linuxppc-dev

In order to allow leaf PMD entries, switch the PGD to 64 bits entries.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/include/asm/pgtable-types.h |  4 ++++
 arch/powerpc/kernel/head_85xx.S          | 10 ++++++----
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-types.h b/arch/powerpc/include/asm/pgtable-types.h
index 082c85cc09b1..db965d98e0ae 100644
--- a/arch/powerpc/include/asm/pgtable-types.h
+++ b/arch/powerpc/include/asm/pgtable-types.h
@@ -49,7 +49,11 @@ static inline unsigned long pud_val(pud_t x)
 #endif /* CONFIG_PPC64 */
 
 /* PGD level */
+#if defined(CONFIG_PPC_E500) && defined(CONFIG_PTE_64BIT)
+typedef struct { unsigned long long pgd; } pgd_t;
+#else
 typedef struct { unsigned long pgd; } pgd_t;
+#endif
 #define __pgd(x)	((pgd_t) { (x) })
 static inline unsigned long pgd_val(pgd_t x)
 {
diff --git a/arch/powerpc/kernel/head_85xx.S b/arch/powerpc/kernel/head_85xx.S
index 39724ff5ae1f..a305244afc9f 100644
--- a/arch/powerpc/kernel/head_85xx.S
+++ b/arch/powerpc/kernel/head_85xx.S
@@ -307,8 +307,9 @@ set_ivor:
 #ifdef CONFIG_PTE_64BIT
 #ifdef CONFIG_HUGETLB_PAGE
 #define FIND_PTE	\
-	rlwinm	r12, r10, 13, 19, 29;	/* Compute pgdir/pmd offset */	\
-	lwzx	r11, r12, r11;		/* Get pgd/pmd entry */		\
+	rlwinm	r12, r10, 14, 18, 28;	/* Compute pgdir/pmd offset */	\
+	add	r12, r11, r12;						\
+	lwz	r11, 4(r12);		/* Get pgd/pmd entry */		\
 	rlwinm.	r12, r11, 0, 0, 20;	/* Extract pt base address */	\
 	blt	1000f;			/* Normal non-huge page */	\
 	beq	2f;			/* Bail if no table */		\
@@ -321,8 +322,9 @@ set_ivor:
 1001:	lwz	r11, 4(r12);		/* Get pte entry */
 #else
 #define FIND_PTE	\
-	rlwinm	r12, r10, 13, 19, 29;	/* Compute pgdir/pmd offset */	\
-	lwzx	r11, r12, r11;		/* Get pgd/pmd entry */		\
+	rlwinm	r12, r10, 14, 18, 28;	/* Compute pgdir/pmd offset */	\
+	add	r12, r11, r12;						\
+	lwz	r11, 4(r12);		/* Get pgd/pmd entry */		\
 	rlwinm.	r12, r11, 0, 0, 20;	/* Extract pt base address */	\
 	beq	2f;			/* Bail if no table */		\
 	rlwimi	r12, r10, 23, 20, 28;	/* Compute pte address */	\
-- 
2.44.0



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v2 16/20] powerpc/e500: Encode hugepage size in PTE bits
  2024-05-17 18:59 [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Christophe Leroy
                   ` (14 preceding siblings ...)
  2024-05-17 19:00 ` [RFC PATCH v2 15/20] powerpc/85xx: Switch to 64 bits PGD Christophe Leroy
@ 2024-05-17 19:00 ` Christophe Leroy
  2024-05-17 19:00 ` [RFC PATCH v2 17/20] powerpc/e500: Use contiguous PMD instead of hugepd Christophe Leroy
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 60+ messages in thread
From: Christophe Leroy @ 2024-05-17 19:00 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu, Oscar Salvador,
	Michael Ellerman, Nicholas Piggin
  Cc: Christophe Leroy, linux-kernel, linux-mm, linuxppc-dev

Use U0-U3 bits to encode hugepage size, more exactly page shift.

As we start using hugepages at shift 21 (2Mbytes), substract 20
so that it fits into 4 bits. That may change in the future if
we want to use smaller hugepages.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/include/asm/nohash/hugetlb-e500.h | 6 ++++++
 arch/powerpc/include/asm/nohash/pte-e500.h     | 3 +++
 2 files changed, 9 insertions(+)

diff --git a/arch/powerpc/include/asm/nohash/hugetlb-e500.h b/arch/powerpc/include/asm/nohash/hugetlb-e500.h
index 8f04ad20e040..d8e51a3f8557 100644
--- a/arch/powerpc/include/asm/nohash/hugetlb-e500.h
+++ b/arch/powerpc/include/asm/nohash/hugetlb-e500.h
@@ -42,4 +42,10 @@ static inline int check_and_get_huge_psize(int shift)
 	return shift_to_mmu_psize(shift);
 }
 
+static inline pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags)
+{
+	return __pte(pte_val(entry) | (_PAGE_U3 * (shift - 20)));
+}
+#define arch_make_huge_pte arch_make_huge_pte
+
 #endif /* _ASM_POWERPC_NOHASH_HUGETLB_E500_H */
diff --git a/arch/powerpc/include/asm/nohash/pte-e500.h b/arch/powerpc/include/asm/nohash/pte-e500.h
index 975facc7e38e..091e4bff1fba 100644
--- a/arch/powerpc/include/asm/nohash/pte-e500.h
+++ b/arch/powerpc/include/asm/nohash/pte-e500.h
@@ -46,6 +46,9 @@
 #define _PAGE_NO_CACHE	0x400000 /* I: cache inhibit */
 #define _PAGE_WRITETHRU	0x800000 /* W: cache write-through */
 
+#define _PAGE_HSIZE_MSK (_PAGE_U0 | _PAGE_U1 | _PAGE_U2 | _PAGE_U3)
+#define _PAGE_HSIZE_SHIFT	14
+
 /* "Higher level" linux bit combinations */
 #define _PAGE_EXEC		(_PAGE_BAP_SX | _PAGE_BAP_UX) /* .. and was cache cleaned */
 #define _PAGE_READ		(_PAGE_BAP_SR | _PAGE_BAP_UR) /* User read permission */
-- 
2.44.0



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v2 17/20] powerpc/e500: Use contiguous PMD instead of hugepd
  2024-05-17 18:59 [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Christophe Leroy
                   ` (15 preceding siblings ...)
  2024-05-17 19:00 ` [RFC PATCH v2 16/20] powerpc/e500: Encode hugepage size in PTE bits Christophe Leroy
@ 2024-05-17 19:00 ` Christophe Leroy
  2024-05-17 19:00 ` [RFC PATCH v2 18/20] powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD Christophe Leroy
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 60+ messages in thread
From: Christophe Leroy @ 2024-05-17 19:00 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu, Oscar Salvador,
	Michael Ellerman, Nicholas Piggin
  Cc: Christophe Leroy, linux-kernel, linux-mm, linuxppc-dev

e500 supports many page sizes among which the following size are
implemented in the kernel at the time being: 4M, 16M, 64M, 256M, 1G.

On e500, TLB miss for hugepages is exclusively handled by SW even
on e6500 which has HW assistance for 4k pages, so there are no
constraints like on the 8xx.

On e500/32, all are at PGD/PMD level and can be handled as
cont-PMD.

On e500/64, smaller ones are on PMD while bigger ones are on PUD.
Again, they can easily be handled as cont-PMD and cont-PUD instead
of hugepd.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 .../powerpc/include/asm/nohash/hugetlb-e500.h | 32 +---------
 arch/powerpc/include/asm/nohash/pgalloc.h     |  2 -
 arch/powerpc/include/asm/nohash/pgtable.h     | 43 +++++++++----
 arch/powerpc/include/asm/nohash/pte-e500.h    | 15 +++++
 arch/powerpc/include/asm/page.h               | 15 +----
 arch/powerpc/kernel/head_85xx.S               | 23 +++----
 arch/powerpc/mm/hugetlbpage.c                 |  2 -
 arch/powerpc/mm/nohash/tlb_low_64e.S          | 63 +++++++++++--------
 arch/powerpc/mm/pgtable.c                     | 31 +++++++++
 arch/powerpc/platforms/Kconfig.cputype        |  1 -
 10 files changed, 131 insertions(+), 96 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/hugetlb-e500.h b/arch/powerpc/include/asm/nohash/hugetlb-e500.h
index d8e51a3f8557..d30e2a3f129d 100644
--- a/arch/powerpc/include/asm/nohash/hugetlb-e500.h
+++ b/arch/powerpc/include/asm/nohash/hugetlb-e500.h
@@ -2,38 +2,12 @@
 #ifndef _ASM_POWERPC_NOHASH_HUGETLB_E500_H
 #define _ASM_POWERPC_NOHASH_HUGETLB_E500_H
 
-static inline pte_t *hugepd_page(hugepd_t hpd)
-{
-	if (WARN_ON(!hugepd_ok(hpd)))
-		return NULL;
-
-	return (pte_t *)((hpd_val(hpd) & ~HUGEPD_SHIFT_MASK) | PD_HUGE);
-}
-
-static inline unsigned int hugepd_shift(hugepd_t hpd)
-{
-	return hpd_val(hpd) & HUGEPD_SHIFT_MASK;
-}
-
-static inline pte_t *hugepte_offset(hugepd_t hpd, unsigned long addr,
-				    unsigned int pdshift)
-{
-	/*
-	 * On FSL BookE, we have multiple higher-level table entries that
-	 * point to the same hugepte.  Just use the first one since they're all
-	 * identical.  So for that case, idx=0.
-	 */
-	return hugepd_page(hpd);
-}
+#define __HAVE_ARCH_HUGE_SET_HUGE_PTE_AT
+void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
+		     pte_t pte, unsigned long sz);
 
 void flush_hugetlb_page(struct vm_area_struct *vma, unsigned long vmaddr);
 
-static inline void hugepd_populate(hugepd_t *hpdp, pte_t *new, unsigned int pshift)
-{
-	/* We use the old format for PPC_E500 */
-	*hpdp = __hugepd(((unsigned long)new & ~PD_HUGE) | pshift);
-}
-
 static inline int check_and_get_huge_psize(int shift)
 {
 	if (shift & 1)	/* Not a power of 4 */
diff --git a/arch/powerpc/include/asm/nohash/pgalloc.h b/arch/powerpc/include/asm/nohash/pgalloc.h
index 4b62376318e1..d06efac6d7aa 100644
--- a/arch/powerpc/include/asm/nohash/pgalloc.h
+++ b/arch/powerpc/include/asm/nohash/pgalloc.h
@@ -44,8 +44,6 @@ static inline void pgtable_free(void *table, int shift)
 	}
 }
 
-#define get_hugepd_cache_index(x)	(x)
-
 static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift)
 {
 	unsigned long pgf = (unsigned long)table;
diff --git a/arch/powerpc/include/asm/nohash/pgtable.h b/arch/powerpc/include/asm/nohash/pgtable.h
index c4be7754e96f..28ecb2c8b433 100644
--- a/arch/powerpc/include/asm/nohash/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/pgtable.h
@@ -52,11 +52,36 @@ static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, p
 {
 	pte_basic_t old = pte_val(*p);
 	pte_basic_t new = (old & ~(pte_basic_t)clr) | set;
+	unsigned long sz;
+	unsigned long pdsize;
+	int i;
 
 	if (new == old)
 		return old;
 
-	*p = __pte(new);
+#ifdef CONFIG_PPC_E500
+	if (huge)
+		sz = 1UL << (((old & _PAGE_HSIZE_MSK) >> _PAGE_HSIZE_SHIFT) + 20);
+	else
+#endif
+		sz = PAGE_SIZE;
+
+	if (!huge || sz < PMD_SIZE)
+		pdsize = PAGE_SIZE;
+	else if (sz < PUD_SIZE)
+		pdsize = PMD_SIZE;
+	else if (sz < P4D_SIZE)
+		pdsize = PUD_SIZE;
+	else if (sz < PGDIR_SIZE)
+		pdsize = P4D_SIZE;
+	else
+		pdsize = PGDIR_SIZE;
+
+	for (i = 0; i < sz / pdsize; i++, p++) {
+		*p = __pte(new);
+		if (new)
+			new += (unsigned long long)(pdsize / PAGE_SIZE) << PTE_RPN_SHIFT;
+	}
 
 	if (IS_ENABLED(CONFIG_44x) && !is_kernel_addr(addr) && (old & _PAGE_EXEC))
 		icache_44x_need_flush = 1;
@@ -340,25 +365,19 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
 
 #define pgprot_writecombine pgprot_noncached_wc
 
-#ifdef CONFIG_ARCH_HAS_HUGEPD
-static inline int hugepd_ok(hugepd_t hpd)
-{
-	/* We clear the top bit to indicate hugepd */
-	return (hpd_val(hpd) && (hpd_val(hpd) & PD_HUGE) == 0);
-}
-
-#define is_hugepd(hpd)		(hugepd_ok(hpd))
-#endif
-
 #ifdef CONFIG_HUGETLB_PAGE
 static inline int pmd_huge(pmd_t pmd)
 {
+#ifdef pmd_leaf
+	return pmd_leaf(pmd);
+#else
 	return 0;
+#endif
 }
 
 static inline int pud_huge(pud_t pud)
 {
-	return 0;
+	return pud_leaf(pud);
 }
 #endif
 
diff --git a/arch/powerpc/include/asm/nohash/pte-e500.h b/arch/powerpc/include/asm/nohash/pte-e500.h
index 091e4bff1fba..178378cdaabb 100644
--- a/arch/powerpc/include/asm/nohash/pte-e500.h
+++ b/arch/powerpc/include/asm/nohash/pte-e500.h
@@ -67,6 +67,7 @@
 #define _PAGE_RWX	(_PAGE_READ | _PAGE_WRITE | _PAGE_BAP_UX)
 
 #define _PAGE_SPECIAL	_PAGE_SW0
+#define _PAGE_PTE	_PAGE_PSIZE_4K
 
 #define	PTE_RPN_SHIFT	(24)
 
@@ -106,6 +107,20 @@ static inline pte_t pte_mkexec(pte_t pte)
 }
 #define pte_mkexec pte_mkexec
 
+static inline int pmd_leaf(pmd_t pmd)
+{
+	return pmd_val(pmd) & _PAGE_PTE;
+}
+#define pmd_leaf pmd_leaf
+
+#ifdef CONFIG_PPC64
+static inline int pud_leaf(pud_t pud)
+{
+	return pud_val(pud) & _PAGE_PTE;
+}
+#define pud_leaf pud_leaf
+#endif
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index 018c3d55232c..7d3c3bc40e6a 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -269,20 +269,7 @@ static inline const void *pfn_to_kaddr(unsigned long pfn)
 #define is_kernel_addr(x)	((x) >= TASK_SIZE)
 #endif
 
-#ifndef CONFIG_PPC_BOOK3S_64
-/*
- * Use the top bit of the higher-level page table entries to indicate whether
- * the entries we point to contain hugepages.  This works because we know that
- * the page tables live in kernel space.  If we ever decide to support having
- * page tables at arbitrary addresses, this breaks and will have to change.
- */
-#ifdef CONFIG_PPC64
-#define PD_HUGE 0x8000000000000000UL
-#else
-#define PD_HUGE 0x80000000
-#endif
-
-#else	/* CONFIG_PPC_BOOK3S_64 */
+#ifdef CONFIG_PPC_BOOK3S_64
 /*
  * Book3S 64 stores real addresses in the hugepd entries to
  * avoid overlaps with _PAGE_PRESENT and _PAGE_PTE.
diff --git a/arch/powerpc/kernel/head_85xx.S b/arch/powerpc/kernel/head_85xx.S
index a305244afc9f..96479a2230ac 100644
--- a/arch/powerpc/kernel/head_85xx.S
+++ b/arch/powerpc/kernel/head_85xx.S
@@ -310,16 +310,17 @@ set_ivor:
 	rlwinm	r12, r10, 14, 18, 28;	/* Compute pgdir/pmd offset */	\
 	add	r12, r11, r12;						\
 	lwz	r11, 4(r12);		/* Get pgd/pmd entry */		\
-	rlwinm.	r12, r11, 0, 0, 20;	/* Extract pt base address */	\
-	blt	1000f;			/* Normal non-huge page */	\
-	beq	2f;			/* Bail if no table */		\
-	oris	r11, r11, PD_HUGE@h;	/* Put back address bit */	\
-	andi.	r10, r11, HUGEPD_SHIFT_MASK@l; /* extract size field */	\
-	xor	r12, r10, r11;		/* drop size bits from pointer */ \
+	rotlwi.	r11, r11, 22;		/* Leaf entry (_PAGE_PTE set) */\
+	bge	1000f;			/* Normal non-huge page */	\
+	rlwinm	r10, r11, 64 - _PAGE_HSIZE_SHIFT - 22, 0xf;		\
+	rotrwi	r11, r11, 22;		/* Restore entry */		\
 	b	1001f;							\
-1000:	rlwimi	r12, r10, 23, 20, 28;	/* Compute pte address */	\
+1000:	rlwinm.	r12, r11, 32 - 22, 0, 20; /* Extract pt base address */	\
+	beq	2f;			/* Bail if no table */		\
+	rlwimi	r12, r10, 23, 20, 28;	/* Compute pte address */	\
 	li	r10, 0;			/* clear r10 */			\
-1001:	lwz	r11, 4(r12);		/* Get pte entry */
+	lwz	r11, 4(r12);		/* Get pte entry */		\
+1001:
 #else
 #define FIND_PTE	\
 	rlwinm	r12, r10, 14, 18, 28;	/* Compute pgdir/pmd offset */	\
@@ -749,16 +750,16 @@ finish_tlb_load:
 100:	stw	r15, 0(r17)
 
 	/*
-	 * Calc MAS1_TSIZE from r10 (which has pshift encoded)
+	 * Calc MAS1_TSIZE from r10 (which has pshift - 20 encoded)
 	 * tlb_enc = (pshift - 10).
 	 */
-	subi	r15, r10, 10
+	addi	r15, r10, 10
 	mfspr	r16, SPRN_MAS1
 	rlwimi	r16, r15, 7, 20, 24
 	mtspr	SPRN_MAS1, r16
 
 	/* copy the pshift for use later */
-	mr	r14, r10
+	addi	r14, r10, 20
 
 	/* fall through */
 
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index f8aefa1e7363..1401587578fc 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -627,8 +627,6 @@ static int __init hugetlbpage_init(void)
 		if (pdshift > shift) {
 			if (!IS_ENABLED(CONFIG_PPC_8xx))
 				pgtable_cache_add(pdshift - shift);
-		} else if (IS_ENABLED(CONFIG_PPC_E500)) {
-			pgtable_cache_add(PTE_T_ORDER);
 		}
 
 		configured = true;
diff --git a/arch/powerpc/mm/nohash/tlb_low_64e.S b/arch/powerpc/mm/nohash/tlb_low_64e.S
index 93ecb8ec82b0..3e4d23a562c3 100644
--- a/arch/powerpc/mm/nohash/tlb_low_64e.S
+++ b/arch/powerpc/mm/nohash/tlb_low_64e.S
@@ -152,20 +152,26 @@ tlb_miss_common_bolted:
 
 	rldicl	r15,r16,64-PUD_SHIFT+3,64-PUD_INDEX_SIZE-3
 	clrrdi	r15,r15,3
-	cmpdi	cr0,r14,0
-	bge	tlb_miss_fault_bolted	/* Bad pgd entry or hugepage; bail */
+	cmpdi	cr3,r14,0
+	andi.	r10,r14,_PAGE_PTE
+	beq-	cr3,tlb_miss_fault_bolted /* No entry, bail */
+	bne	tlb_miss_fault_bolted	/* Hugepage; bail */
 	ldx	r14,r14,r15		/* grab pud entry */
 
 	rldicl	r15,r16,64-PMD_SHIFT+3,64-PMD_INDEX_SIZE-3
 	clrrdi	r15,r15,3
-	cmpdi	cr0,r14,0
-	bge	tlb_miss_fault_bolted
+	cmpdi	cr3,r14,0
+	andi.	r10,r14,_PAGE_PTE
+	beq-	cr3,tlb_miss_fault_bolted /* No entry, bail */
+	bne	tlb_miss_fault_bolted	/* Hugepage; bail */
 	ldx	r14,r14,r15		/* Grab pmd entry */
 
 	rldicl	r15,r16,64-PAGE_SHIFT+3,64-PTE_INDEX_SIZE-3
 	clrrdi	r15,r15,3
-	cmpdi	cr0,r14,0
-	bge	tlb_miss_fault_bolted
+	cmpdi	cr3,r14,0
+	andi.	r10,r14,_PAGE_PTE
+	beq-	cr3,tlb_miss_fault_bolted /* No entry, bail */
+	bne	tlb_miss_fault_bolted	/* Hugepage; bail */
 	ldx	r14,r14,r15		/* Grab PTE, normal (!huge) page */
 
 	/* Check if required permissions are met */
@@ -390,19 +396,25 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_SMT)
 
 	rldicl	r15,r16,64-PUD_SHIFT+3,64-PUD_INDEX_SIZE-3
 	clrrdi	r15,r15,3
-	cmpdi	cr0,r14,0
-	bge	tlb_miss_huge_e6500	/* Bad pgd entry or hugepage; bail */
+	cmpdi	cr3,r14,0
+	andi.	r10,r14,_PAGE_PTE
+	beq-	cr3,tlb_miss_fault_e6500 /* No entry, bail */
+	bne	tlb_miss_huge_e6500	/* Hugepage; bail */
 	ldx	r14,r14,r15		/* grab pud entry */
 
 	rldicl	r15,r16,64-PMD_SHIFT+3,64-PMD_INDEX_SIZE-3
 	clrrdi	r15,r15,3
-	cmpdi	cr0,r14,0
-	bge	tlb_miss_huge_e6500
+	cmpdi	cr3,r14,0
+	andi.	r10,r14,_PAGE_PTE
+	beq-	cr3,tlb_miss_fault_e6500 /* No entry, bail */
+	bne	tlb_miss_huge_e6500	/* Hugepage; bail */
 	ldx	r14,r14,r15		/* Grab pmd entry */
 
 	mfspr	r10,SPRN_MAS0
-	cmpdi	cr0,r14,0
-	bge	tlb_miss_huge_e6500
+	cmpdi	cr3,r14,0
+	andi.	r15,r14,_PAGE_PTE
+	beq-	cr3,tlb_miss_fault_e6500 /* No entry, bail */
+	bne	tlb_miss_huge_e6500	/* Hugepage; bail */
 
 	/* Now we build the MAS for a 2M indirect page:
 	 *
@@ -449,12 +461,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_SMT)
 	rfi
 
 tlb_miss_huge_e6500:
-	beq	tlb_miss_fault_e6500
-	li	r10,1
-	andi.	r15,r14,HUGEPD_SHIFT_MASK@l /* r15 = psize */
-	rldimi	r14,r10,63,0		/* Set PD_HUGE */
-	xor	r14,r14,r15		/* Clear size bits */
-	ldx	r14,0,r14
+	rlwinm	r15,r14,32-_PAGE_HSIZE_SHIFT,0xf
 
 	/*
 	 * Now we build the MAS for a huge page.
@@ -465,7 +472,7 @@ tlb_miss_huge_e6500:
 	 * MAS 2,3+7:	Needs to be redone similar to non-tablewalk handler
 	 */
 
-	subi	r15,r15,10		/* Convert psize to tsize */
+	addi	r15,r15,10		/* Convert hsize to tsize */
 	mfspr	r10,SPRN_MAS1
 	rlwinm	r10,r10,0,~MAS1_IND
 	rlwimi	r10,r15,MAS1_TSIZE_SHIFT,MAS1_TSIZE_MASK
@@ -805,22 +812,28 @@ virt_page_table_tlb_miss:
 	rldicl	r11,r16,64-VPTE_PGD_SHIFT,64-PGD_INDEX_SIZE-3
 	clrrdi	r10,r11,3
 	ldx	r15,r10,r15
-	cmpdi	cr0,r15,0
-	bge	virt_page_table_tlb_miss_fault
+	cmpdi	cr3,r15,0
+	andi.	r10,r15,_PAGE_PTE
+	beq-	cr3,virt_page_table_tlb_miss_fault /* No entry, bail */
+	bne	virt_page_table_tlb_miss_fault	/* Hugepage; bail */
 
 	/* Get to PUD entry */
 	rldicl	r11,r16,64-VPTE_PUD_SHIFT,64-PUD_INDEX_SIZE-3
 	clrrdi	r10,r11,3
 	ldx	r15,r10,r15
-	cmpdi	cr0,r15,0
-	bge	virt_page_table_tlb_miss_fault
+	cmpdi	cr3,r15,0
+	andi.	r10,r15,_PAGE_PTE
+	beq-	cr3,virt_page_table_tlb_miss_fault /* No entry, bail */
+	bne	virt_page_table_tlb_miss_fault	/* Hugepage; bail */
 
 	/* Get to PMD entry */
 	rldicl	r11,r16,64-VPTE_PMD_SHIFT,64-PMD_INDEX_SIZE-3
 	clrrdi	r10,r11,3
 	ldx	r15,r10,r15
-	cmpdi	cr0,r15,0
-	bge	virt_page_table_tlb_miss_fault
+	cmpdi	cr3,r15,0
+	andi.	r10,r15,_PAGE_PTE
+	beq-	cr3,virt_page_table_tlb_miss_fault /* No entry, bail */
+	bne	virt_page_table_tlb_miss_fault	/* Hugepage; bail */
 
 	/* Ok, we're all right, we can now create a kernel translation for
 	 * a 4K or 64K page from r16 -> r15.
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 51ee508eeb5b..d68c0fcffe80 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -328,6 +328,37 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
 		__set_huge_pte_at(pmdp, ptep, pte_val(pte));
 	}
 }
+#elif defined(CONFIG_PPC_E500)
+void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
+		     pte_t pte, unsigned long sz)
+{
+	unsigned long pdsize;
+	int i;
+
+	pte = set_pte_filter(pte, addr);
+
+	/*
+	 * Make sure hardware valid bit is not set. We don't do
+	 * tlb flush for this update.
+	 */
+	VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep));
+
+	if (sz < PMD_SIZE)
+		pdsize = PAGE_SIZE;
+	else if (sz < PUD_SIZE)
+		pdsize = PMD_SIZE;
+	else if (sz < P4D_SIZE)
+		pdsize = PUD_SIZE;
+	else if (sz < PGDIR_SIZE)
+		pdsize = P4D_SIZE;
+	else
+		pdsize = PGDIR_SIZE;
+
+	for (i = 0; i < sz / pdsize; i++, ptep++, addr += pdsize) {
+		__set_pte_at(mm, addr, ptep, pte, 0);
+		pte = __pte(pte_val(pte) + ((unsigned long long)pdsize / PAGE_SIZE << PFN_PTE_SHIFT));
+	}
+}
 #endif
 #endif /* CONFIG_HUGETLB_PAGE */
 
diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index fa4bb096b3ae..30a78e99663e 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -291,7 +291,6 @@ config PPC_BOOK3S
 config PPC_E500
 	select FSL_EMB_PERFMON
 	bool
-	select ARCH_HAS_HUGEPD if HUGETLB_PAGE
 	select ARCH_SUPPORTS_HUGETLBFS if PHYS_64BIT || PPC64
 	select PPC_SMP_MUXED_IPI
 	select PPC_DOORBELL
-- 
2.44.0



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v2 18/20] powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD
  2024-05-17 18:59 [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Christophe Leroy
                   ` (16 preceding siblings ...)
  2024-05-17 19:00 ` [RFC PATCH v2 17/20] powerpc/e500: Use contiguous PMD instead of hugepd Christophe Leroy
@ 2024-05-17 19:00 ` Christophe Leroy
  2024-05-20 12:54   ` Nicholas Piggin
  2024-05-17 19:00 ` [RFC PATCH v2 19/20] powerpc/mm: Remove hugepd leftovers Christophe Leroy
                   ` (3 subsequent siblings)
  21 siblings, 1 reply; 60+ messages in thread
From: Christophe Leroy @ 2024-05-17 19:00 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu, Oscar Salvador,
	Michael Ellerman, Nicholas Piggin
  Cc: Christophe Leroy, linux-kernel, linux-mm, linuxppc-dev

On book3s/64, the only user of hugepd is hash in 4k mode.

All other setups (hash-64, radix-4, radix-64) use leaf PMD/PUD.

Rework hash-4k to use contiguous PMD and PUD instead.

In that setup there are only two huge page sizes: 16M and 16G.

16M sits at PMD level and 16G at PUD level.

pte_update doesn't know page size, lets use the same trick as
hpte_need_flush() to get page size from segment properties. That's
not the most efficient way but let's do that until callers of
pte_update() provide page size instead of just a huge flag.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/include/asm/book3s/64/hash-4k.h  | 15 --------
 arch/powerpc/include/asm/book3s/64/hash.h     | 38 +++++++++++++++----
 arch/powerpc/include/asm/book3s/64/hugetlb.h  | 38 -------------------
 .../include/asm/book3s/64/pgtable-4k.h        | 34 -----------------
 .../include/asm/book3s/64/pgtable-64k.h       | 20 ----------
 arch/powerpc/include/asm/hugetlb.h            |  4 ++
 .../include/asm/nohash/32/hugetlb-8xx.h       |  4 --
 .../powerpc/include/asm/nohash/hugetlb-e500.h |  4 --
 arch/powerpc/include/asm/page.h               |  8 ----
 arch/powerpc/mm/book3s64/hash_utils.c         | 11 ++++--
 arch/powerpc/mm/book3s64/pgtable.c            | 12 ------
 arch/powerpc/mm/hugetlbpage.c                 | 19 ----------
 arch/powerpc/mm/pgtable.c                     |  2 +-
 arch/powerpc/platforms/Kconfig.cputype        |  1 -
 14 files changed, 43 insertions(+), 167 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index 6472b08fa1b0..c654c376ef8b 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -74,21 +74,6 @@
 #define remap_4k_pfn(vma, addr, pfn, prot)	\
 	remap_pfn_range((vma), (addr), (pfn), PAGE_SIZE, (prot))
 
-#ifdef CONFIG_HUGETLB_PAGE
-static inline int hash__hugepd_ok(hugepd_t hpd)
-{
-	unsigned long hpdval = hpd_val(hpd);
-	/*
-	 * if it is not a pte and have hugepd shift mask
-	 * set, then it is a hugepd directory pointer
-	 */
-	if (!(hpdval & _PAGE_PTE) && (hpdval & _PAGE_PRESENT) &&
-	    ((hpdval & HUGEPD_SHIFT_MASK) != 0))
-		return true;
-	return false;
-}
-#endif
-
 /*
  * 4K PTE format is different from 64K PTE format. Saving the hash_slot is just
  * a matter of returning the PTE bits that need to be modified. On 64K PTE,
diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
index faf3e3b4e4b2..509811ca7695 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -4,6 +4,7 @@
 #ifdef __KERNEL__
 
 #include <asm/asm-const.h>
+#include <asm/book3s/64/slice.h>
 
 /*
  * Common bits between 4K and 64K pages in a linux-style PTE.
@@ -161,14 +162,10 @@ extern void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
 			    pte_t *ptep, unsigned long pte, int huge);
 unsigned long htab_convert_pte_flags(unsigned long pteflags, unsigned long flags);
 /* Atomic PTE updates */
-static inline unsigned long hash__pte_update(struct mm_struct *mm,
-					 unsigned long addr,
-					 pte_t *ptep, unsigned long clr,
-					 unsigned long set,
-					 int huge)
+static inline unsigned long hash__pte_update_one(pte_t *ptep, unsigned long clr,
+						 unsigned long set)
 {
 	__be64 old_be, tmp_be;
-	unsigned long old;
 
 	__asm__ __volatile__(
 	"1:	ldarx	%0,0,%3		# pte_update\n\
@@ -182,11 +179,38 @@ static inline unsigned long hash__pte_update(struct mm_struct *mm,
 	: "r" (ptep), "r" (cpu_to_be64(clr)), "m" (*ptep),
 	  "r" (cpu_to_be64(H_PAGE_BUSY)), "r" (cpu_to_be64(set))
 	: "cc" );
+
+	return be64_to_cpu(old_be);
+}
+
+static inline unsigned long hash__pte_update(struct mm_struct *mm,
+					 unsigned long addr,
+					 pte_t *ptep, unsigned long clr,
+					 unsigned long set,
+					 int huge)
+{
+	unsigned long old;
+
+	old = hash__pte_update_one(ptep, clr, set);
+
+	if (huge && IS_ENABLED(CONFIG_PPC_4K_PAGES)) {
+		unsigned int psize = get_slice_psize(mm, addr);
+		int nb, i;
+
+		if (psize == MMU_PAGE_16M)
+			nb = SZ_16M / PMD_SIZE;
+		else if (psize == MMU_PAGE_16G)
+			nb = SZ_16G / PUD_SIZE;
+		else
+			nb = 1;
+
+		for (i = 1; i < nb; i++)
+			hash__pte_update_one(ptep + i, clr, set);
+	}
 	/* huge pages use the old page table lock */
 	if (!huge)
 		assert_pte_locked(mm, addr);
 
-	old = be64_to_cpu(old_be);
 	if (old & H_PAGE_HASHPTE)
 		hpte_need_flush(mm, addr, ptep, old, huge);
 
diff --git a/arch/powerpc/include/asm/book3s/64/hugetlb.h b/arch/powerpc/include/asm/book3s/64/hugetlb.h
index aa1c67c8bfc8..f0bba9c5f9c3 100644
--- a/arch/powerpc/include/asm/book3s/64/hugetlb.h
+++ b/arch/powerpc/include/asm/book3s/64/hugetlb.h
@@ -49,9 +49,6 @@ static inline bool gigantic_page_runtime_supported(void)
 	return true;
 }
 
-/* hugepd entry valid bit */
-#define HUGEPD_VAL_BITS		(0x8000000000000000UL)
-
 #define huge_ptep_modify_prot_start huge_ptep_modify_prot_start
 extern pte_t huge_ptep_modify_prot_start(struct vm_area_struct *vma,
 					 unsigned long addr, pte_t *ptep);
@@ -60,29 +57,7 @@ extern pte_t huge_ptep_modify_prot_start(struct vm_area_struct *vma,
 extern void huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
 					 unsigned long addr, pte_t *ptep,
 					 pte_t old_pte, pte_t new_pte);
-/*
- * This should work for other subarchs too. But right now we use the
- * new format only for 64bit book3s
- */
-static inline pte_t *hugepd_page(hugepd_t hpd)
-{
-	BUG_ON(!hugepd_ok(hpd));
-	/*
-	 * We have only four bits to encode, MMU page size
-	 */
-	BUILD_BUG_ON((MMU_PAGE_COUNT - 1) > 0xf);
-	return __va(hpd_val(hpd) & HUGEPD_ADDR_MASK);
-}
-
-static inline unsigned int hugepd_mmu_psize(hugepd_t hpd)
-{
-	return (hpd_val(hpd) & HUGEPD_SHIFT_MASK) >> 2;
-}
 
-static inline unsigned int hugepd_shift(hugepd_t hpd)
-{
-	return mmu_psize_to_shift(hugepd_mmu_psize(hpd));
-}
 static inline void flush_hugetlb_page(struct vm_area_struct *vma,
 				      unsigned long vmaddr)
 {
@@ -90,19 +65,6 @@ static inline void flush_hugetlb_page(struct vm_area_struct *vma,
 		return radix__flush_hugetlb_page(vma, vmaddr);
 }
 
-static inline pte_t *hugepte_offset(hugepd_t hpd, unsigned long addr,
-				    unsigned int pdshift)
-{
-	unsigned long idx = (addr & ((1UL << pdshift) - 1)) >> hugepd_shift(hpd);
-
-	return hugepd_page(hpd) + idx;
-}
-
-static inline void hugepd_populate(hugepd_t *hpdp, pte_t *new, unsigned int pshift)
-{
-	*hpdp = __hugepd(__pa(new) | HUGEPD_VAL_BITS | (shift_to_mmu_psize(pshift) << 2));
-}
-
 void flush_hugetlb_page(struct vm_area_struct *vma, unsigned long vmaddr);
 
 static inline int check_and_get_huge_psize(int shift)
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable-4k.h b/arch/powerpc/include/asm/book3s/64/pgtable-4k.h
index 48f21820afe2..2b985bfbe863 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable-4k.h
@@ -26,40 +26,6 @@ static inline int pud_huge(pud_t pud)
 	return 0;
 }
 
-/*
- * With radix , we have hugepage ptes in the pud and pmd entries. We don't
- * need to setup hugepage directory for them. Our pte and page directory format
- * enable us to have this enabled.
- */
-static inline int hugepd_ok(hugepd_t hpd)
-{
-	if (radix_enabled())
-		return 0;
-	return hash__hugepd_ok(hpd);
-}
-#define is_hugepd(hpd)		(hugepd_ok(hpd))
-
-/*
- * 16M and 16G huge page directory tables are allocated from slab cache
- *
- */
-#define H_16M_CACHE_INDEX (PAGE_SHIFT + H_PTE_INDEX_SIZE + H_PMD_INDEX_SIZE - 24)
-#define H_16G_CACHE_INDEX                                                      \
-	(PAGE_SHIFT + H_PTE_INDEX_SIZE + H_PMD_INDEX_SIZE + H_PUD_INDEX_SIZE - 34)
-
-static inline int get_hugepd_cache_index(int index)
-{
-	switch (index) {
-	case H_16M_CACHE_INDEX:
-		return HTLB_16M_INDEX;
-	case H_16G_CACHE_INDEX:
-		return HTLB_16G_INDEX;
-	default:
-		BUG();
-	}
-	/* should not reach */
-}
-
 #endif /* CONFIG_HUGETLB_PAGE */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable-64k.h b/arch/powerpc/include/asm/book3s/64/pgtable-64k.h
index ced7ee8b42fc..02a1e3ec7cbe 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable-64k.h
@@ -30,26 +30,6 @@ static inline int pud_huge(pud_t pud)
 	return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
 }
 
-/*
- * With 64k page size, we have hugepage ptes in the pgd and pmd entries. We don't
- * need to setup hugepage directory for them. Our pte and page directory format
- * enable us to have this enabled.
- */
-static inline int hugepd_ok(hugepd_t hpd)
-{
-	return 0;
-}
-
-#define is_hugepd(pdep)			0
-
-/*
- * This should never get called
- */
-static __always_inline int get_hugepd_cache_index(int index)
-{
-	BUILD_BUG();
-}
-
 #endif /* CONFIG_HUGETLB_PAGE */
 
 static inline int remap_4k_pfn(struct vm_area_struct *vma, unsigned long addr,
diff --git a/arch/powerpc/include/asm/hugetlb.h b/arch/powerpc/include/asm/hugetlb.h
index 36ed6d976cf9..d022722e6530 100644
--- a/arch/powerpc/include/asm/hugetlb.h
+++ b/arch/powerpc/include/asm/hugetlb.h
@@ -37,6 +37,10 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 			    unsigned long ceiling);
 #endif
 
+#define __HAVE_ARCH_HUGE_SET_HUGE_PTE_AT
+void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
+		     pte_t pte, unsigned long sz);
+
 #define __HAVE_ARCH_HUGE_PTEP_GET_AND_CLEAR
 static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 					    unsigned long addr, pte_t *ptep)
diff --git a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
index 1414cfd28987..4cba84776a7d 100644
--- a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
@@ -25,10 +25,6 @@ static inline pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_
 	return ptep_get(ptep);
 }
 
-#define __HAVE_ARCH_HUGE_SET_HUGE_PTE_AT
-void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
-		     pte_t pte, unsigned long sz);
-
 #define __HAVE_ARCH_HUGE_PTE_CLEAR
 static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
 				  pte_t *ptep, unsigned long sz)
diff --git a/arch/powerpc/include/asm/nohash/hugetlb-e500.h b/arch/powerpc/include/asm/nohash/hugetlb-e500.h
index d30e2a3f129d..aea4c462e494 100644
--- a/arch/powerpc/include/asm/nohash/hugetlb-e500.h
+++ b/arch/powerpc/include/asm/nohash/hugetlb-e500.h
@@ -2,10 +2,6 @@
 #ifndef _ASM_POWERPC_NOHASH_HUGETLB_E500_H
 #define _ASM_POWERPC_NOHASH_HUGETLB_E500_H
 
-#define __HAVE_ARCH_HUGE_SET_HUGE_PTE_AT
-void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
-		     pte_t pte, unsigned long sz);
-
 void flush_hugetlb_page(struct vm_area_struct *vma, unsigned long vmaddr);
 
 static inline int check_and_get_huge_psize(int shift)
diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index 7d3c3bc40e6a..c0af246a64ff 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -269,14 +269,6 @@ static inline const void *pfn_to_kaddr(unsigned long pfn)
 #define is_kernel_addr(x)	((x) >= TASK_SIZE)
 #endif
 
-#ifdef CONFIG_PPC_BOOK3S_64
-/*
- * Book3S 64 stores real addresses in the hugepd entries to
- * avoid overlaps with _PAGE_PRESENT and _PAGE_PTE.
- */
-#define HUGEPD_ADDR_MASK	(0x0ffffffffffffffful & ~HUGEPD_SHIFT_MASK)
-#endif /* CONFIG_PPC_BOOK3S_64 */
-
 /*
  * Some number of bits at the level of the page table that points to
  * a hugepte are used to encode the size.  This masks those bits.
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c
index 01c3b4b65241..6727a15ab94f 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1233,10 +1233,6 @@ void __init hash__early_init_mmu(void)
 	__pmd_table_size = H_PMD_TABLE_SIZE;
 	__pud_table_size = H_PUD_TABLE_SIZE;
 	__pgd_table_size = H_PGD_TABLE_SIZE;
-	/*
-	 * 4k use hugepd format, so for hash set then to
-	 * zero
-	 */
 	__pmd_val_bits = HASH_PMD_VAL_BITS;
 	__pud_val_bits = HASH_PUD_VAL_BITS;
 	__pgd_val_bits = HASH_PGD_VAL_BITS;
@@ -1546,6 +1542,13 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
 		goto bail;
 	}
 
+	if (IS_ENABLED(CONFIG_PPC_4K_PAGES) && !radix_enabled()) {
+		if (hugeshift == PMD_SHIFT && psize == MMU_PAGE_16M)
+			hugeshift = mmu_psize_defs[MMU_PAGE_16M].shift;
+		if (hugeshift == PUD_SHIFT && psize == MMU_PAGE_16G)
+			hugeshift = mmu_psize_defs[MMU_PAGE_16G].shift;
+	}
+
 	/*
 	 * Add _PAGE_PRESENT to the required access perm. If there are parallel
 	 * updates to the pte that can possibly clear _PAGE_PTE, catch that too.
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 83823db3488b..e4a1e3feefce 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -460,18 +460,6 @@ static inline void pgtable_free(void *table, int index)
 	case PUD_INDEX:
 		__pud_free(table);
 		break;
-#if defined(CONFIG_PPC_4K_PAGES) && defined(CONFIG_HUGETLB_PAGE)
-		/* 16M hugepd directory at pud level */
-	case HTLB_16M_INDEX:
-		BUILD_BUG_ON(H_16M_CACHE_INDEX <= 0);
-		kmem_cache_free(PGT_CACHE(H_16M_CACHE_INDEX), table);
-		break;
-		/* 16G hugepd directory at the pgd level */
-	case HTLB_16G_INDEX:
-		BUILD_BUG_ON(H_16G_CACHE_INDEX <= 0);
-		kmem_cache_free(PGT_CACHE(H_16G_CACHE_INDEX), table);
-		break;
-#endif
 		/* We don't free pgd table via RCU callback */
 	default:
 		BUG();
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 1401587578fc..64b9029d86de 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -600,34 +600,15 @@ static int __init hugetlbpage_init(void)
 
 		shift = mmu_psize_to_shift(psize);
 
-#ifdef CONFIG_PPC_BOOK3S_64
-		if (shift > PGDIR_SHIFT)
-			continue;
-		else if (shift > PUD_SHIFT)
-			pdshift = PGDIR_SHIFT;
-		else if (shift > PMD_SHIFT)
-			pdshift = PUD_SHIFT;
-		else
-			pdshift = PMD_SHIFT;
-#else
 		if (shift < PUD_SHIFT)
 			pdshift = PMD_SHIFT;
 		else if (shift < PGDIR_SHIFT)
 			pdshift = PUD_SHIFT;
 		else
 			pdshift = PGDIR_SHIFT;
-#endif
 
 		if (add_huge_page_size(1ULL << shift) < 0)
 			continue;
-		/*
-		 * if we have pdshift and shift value same, we don't
-		 * use pgt cache for hugepd.
-		 */
-		if (pdshift > shift) {
-			if (!IS_ENABLED(CONFIG_PPC_8xx))
-				pgtable_cache_add(pdshift - shift);
-		}
 
 		configured = true;
 	}
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index d68c0fcffe80..7d4c004cbc75 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -328,7 +328,7 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
 		__set_huge_pte_at(pmdp, ptep, pte_val(pte));
 	}
 }
-#elif defined(CONFIG_PPC_E500)
+#else
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
 		     pte_t pte, unsigned long sz)
 {
diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index 30a78e99663e..b2d8c0da2ad9 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -98,7 +98,6 @@ config PPC_BOOK3S_64
 	select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
 	select ARCH_ENABLE_SPLIT_PMD_PTLOCK
 	select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
-	select ARCH_HAS_HUGEPD if HUGETLB_PAGE
 	select ARCH_SUPPORTS_HUGETLBFS
 	select ARCH_SUPPORTS_NUMA_BALANCING
 	select HAVE_MOVE_PMD
-- 
2.44.0



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v2 19/20] powerpc/mm: Remove hugepd leftovers
  2024-05-17 18:59 [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Christophe Leroy
                   ` (17 preceding siblings ...)
  2024-05-17 19:00 ` [RFC PATCH v2 18/20] powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD Christophe Leroy
@ 2024-05-17 19:00 ` Christophe Leroy
  2024-05-17 19:00 ` [RFC PATCH v2 20/20] mm: Remove CONFIG_ARCH_HAS_HUGEPD Christophe Leroy
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 60+ messages in thread
From: Christophe Leroy @ 2024-05-17 19:00 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu, Oscar Salvador,
	Michael Ellerman, Nicholas Piggin
  Cc: Christophe Leroy, linux-kernel, linux-mm, linuxppc-dev

All targets have now opted out of CONFIG_ARCH_HAS_HUGEPD so
remove left over code.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/include/asm/hugetlb.h          |   7 -
 arch/powerpc/include/asm/page.h             |   6 -
 arch/powerpc/include/asm/pgtable-be-types.h |  10 -
 arch/powerpc/include/asm/pgtable-types.h    |   9 -
 arch/powerpc/mm/hugetlbpage.c               | 412 --------------------
 arch/powerpc/mm/init-common.c               |   8 +-
 arch/powerpc/mm/pgtable.c                   |  27 +-
 7 files changed, 3 insertions(+), 476 deletions(-)

diff --git a/arch/powerpc/include/asm/hugetlb.h b/arch/powerpc/include/asm/hugetlb.h
index d022722e6530..00327aef2dec 100644
--- a/arch/powerpc/include/asm/hugetlb.h
+++ b/arch/powerpc/include/asm/hugetlb.h
@@ -30,13 +30,6 @@ static inline int is_hugepage_only_range(struct mm_struct *mm,
 }
 #define is_hugepage_only_range is_hugepage_only_range
 
-#ifdef CONFIG_ARCH_HAS_HUGEPD
-#define __HAVE_ARCH_HUGETLB_FREE_PGD_RANGE
-void hugetlb_free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
-			    unsigned long end, unsigned long floor,
-			    unsigned long ceiling);
-#endif
-
 #define __HAVE_ARCH_HUGE_SET_HUGE_PTE_AT
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
 		     pte_t pte, unsigned long sz);
diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index c0af246a64ff..83d0a4fc5f75 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -269,12 +269,6 @@ static inline const void *pfn_to_kaddr(unsigned long pfn)
 #define is_kernel_addr(x)	((x) >= TASK_SIZE)
 #endif
 
-/*
- * Some number of bits at the level of the page table that points to
- * a hugepte are used to encode the size.  This masks those bits.
- */
-#define HUGEPD_SHIFT_MASK     0x3f
-
 #ifndef __ASSEMBLY__
 
 #ifdef CONFIG_PPC_BOOK3S_64
diff --git a/arch/powerpc/include/asm/pgtable-be-types.h b/arch/powerpc/include/asm/pgtable-be-types.h
index 82633200b500..6bd8f89b25dc 100644
--- a/arch/powerpc/include/asm/pgtable-be-types.h
+++ b/arch/powerpc/include/asm/pgtable-be-types.h
@@ -101,14 +101,4 @@ static inline bool pmd_xchg(pmd_t *pmdp, pmd_t old, pmd_t new)
 	return pmd_raw(old) == prev;
 }
 
-#ifdef CONFIG_ARCH_HAS_HUGEPD
-typedef struct { __be64 pdbe; } hugepd_t;
-#define __hugepd(x) ((hugepd_t) { cpu_to_be64(x) })
-
-static inline unsigned long hpd_val(hugepd_t x)
-{
-	return be64_to_cpu(x.pdbe);
-}
-#endif
-
 #endif /* _ASM_POWERPC_PGTABLE_BE_TYPES_H */
diff --git a/arch/powerpc/include/asm/pgtable-types.h b/arch/powerpc/include/asm/pgtable-types.h
index db965d98e0ae..7b3d4c592a10 100644
--- a/arch/powerpc/include/asm/pgtable-types.h
+++ b/arch/powerpc/include/asm/pgtable-types.h
@@ -87,13 +87,4 @@ static inline bool pte_xchg(pte_t *ptep, pte_t old, pte_t new)
 }
 #endif
 
-#ifdef CONFIG_ARCH_HAS_HUGEPD
-typedef struct { unsigned long pd; } hugepd_t;
-#define __hugepd(x) ((hugepd_t) { (x) })
-static inline unsigned long hpd_val(hugepd_t x)
-{
-	return x.pd;
-}
-#endif
-
 #endif /* _ASM_POWERPC_PGTABLE_TYPES_H */
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 64b9029d86de..6fad89d7bff3 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -28,8 +28,6 @@
 
 bool hugetlb_disabled = false;
 
-#define hugepd_none(hpd)	(hpd_val(hpd) == 0)
-
 #define PTE_T_ORDER	(__builtin_ffs(sizeof(pte_basic_t)) - \
 			 __builtin_ffs(sizeof(void *)))
 
@@ -42,156 +40,6 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr, unsigned long s
 	return __find_linux_pte(mm->pgd, addr, NULL, NULL);
 }
 
-#ifdef CONFIG_ARCH_HAS_HUGEPD
-static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
-			   unsigned long address, unsigned int pdshift,
-			   unsigned int pshift, spinlock_t *ptl)
-{
-	struct kmem_cache *cachep;
-	pte_t *new;
-	int i;
-	int num_hugepd;
-
-	if (pshift >= pdshift) {
-		cachep = PGT_CACHE(PTE_T_ORDER);
-		num_hugepd = 1 << (pshift - pdshift);
-	} else {
-		cachep = PGT_CACHE(pdshift - pshift);
-		num_hugepd = 1;
-	}
-
-	if (!cachep) {
-		WARN_ONCE(1, "No page table cache created for hugetlb tables");
-		return -ENOMEM;
-	}
-
-	new = kmem_cache_alloc(cachep, pgtable_gfp_flags(mm, GFP_KERNEL));
-
-	BUG_ON(pshift > HUGEPD_SHIFT_MASK);
-	BUG_ON((unsigned long)new & HUGEPD_SHIFT_MASK);
-
-	if (!new)
-		return -ENOMEM;
-
-	/*
-	 * Make sure other cpus find the hugepd set only after a
-	 * properly initialized page table is visible to them.
-	 * For more details look for comment in __pte_alloc().
-	 */
-	smp_wmb();
-
-	spin_lock(ptl);
-	/*
-	 * We have multiple higher-level entries that point to the same
-	 * actual pte location.  Fill in each as we go and backtrack on error.
-	 * We need all of these so the DTLB pgtable walk code can find the
-	 * right higher-level entry without knowing if it's a hugepage or not.
-	 */
-	for (i = 0; i < num_hugepd; i++, hpdp++) {
-		if (unlikely(!hugepd_none(*hpdp)))
-			break;
-		hugepd_populate(hpdp, new, pshift);
-	}
-	/* If we bailed from the for loop early, an error occurred, clean up */
-	if (i < num_hugepd) {
-		for (i = i - 1 ; i >= 0; i--, hpdp--)
-			*hpdp = __hugepd(0);
-		kmem_cache_free(cachep, new);
-	} else {
-		kmemleak_ignore(new);
-	}
-	spin_unlock(ptl);
-	return 0;
-}
-
-/*
- * At this point we do the placement change only for BOOK3S 64. This would
- * possibly work on other subarchs.
- */
-pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
-		      unsigned long addr, unsigned long sz)
-{
-	pgd_t *pg;
-	p4d_t *p4;
-	pud_t *pu;
-	pmd_t *pm;
-	hugepd_t *hpdp = NULL;
-	unsigned pshift = __ffs(sz);
-	unsigned pdshift = PGDIR_SHIFT;
-	spinlock_t *ptl;
-
-	addr &= ~(sz-1);
-	pg = pgd_offset(mm, addr);
-	p4 = p4d_offset(pg, addr);
-
-#ifdef CONFIG_PPC_BOOK3S_64
-	if (pshift == PGDIR_SHIFT)
-		/* 16GB huge page */
-		return (pte_t *) p4;
-	else if (pshift > PUD_SHIFT) {
-		/*
-		 * We need to use hugepd table
-		 */
-		ptl = &mm->page_table_lock;
-		hpdp = (hugepd_t *)p4;
-	} else {
-		pdshift = PUD_SHIFT;
-		pu = pud_alloc(mm, p4, addr);
-		if (!pu)
-			return NULL;
-		if (pshift == PUD_SHIFT)
-			return (pte_t *)pu;
-		else if (pshift > PMD_SHIFT) {
-			ptl = pud_lockptr(mm, pu);
-			hpdp = (hugepd_t *)pu;
-		} else {
-			pdshift = PMD_SHIFT;
-			pm = pmd_alloc(mm, pu, addr);
-			if (!pm)
-				return NULL;
-			if (pshift == PMD_SHIFT)
-				/* 16MB hugepage */
-				return (pte_t *)pm;
-			else {
-				ptl = pmd_lockptr(mm, pm);
-				hpdp = (hugepd_t *)pm;
-			}
-		}
-	}
-#else
-	if (pshift >= PGDIR_SHIFT) {
-		ptl = &mm->page_table_lock;
-		hpdp = (hugepd_t *)p4;
-	} else {
-		pdshift = PUD_SHIFT;
-		pu = pud_alloc(mm, p4, addr);
-		if (!pu)
-			return NULL;
-		if (pshift >= PUD_SHIFT) {
-			ptl = pud_lockptr(mm, pu);
-			hpdp = (hugepd_t *)pu;
-		} else {
-			pdshift = PMD_SHIFT;
-			pm = pmd_alloc(mm, pu, addr);
-			if (!pm)
-				return NULL;
-			ptl = pmd_lockptr(mm, pm);
-			hpdp = (hugepd_t *)pm;
-		}
-	}
-#endif
-	if (!hpdp)
-		return NULL;
-
-	BUG_ON(!hugepd_none(*hpdp) && !hugepd_ok(*hpdp));
-
-	if (hugepd_none(*hpdp) && __hugepte_alloc(mm, hpdp, addr,
-						  pdshift, pshift, ptl))
-		return NULL;
-
-	return hugepte_offset(*hpdp, addr, pdshift);
-}
-#else
 pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long addr, unsigned long sz)
 {
@@ -287,266 +135,6 @@ int __init alloc_bootmem_huge_page(struct hstate *h, int nid)
 	return __alloc_bootmem_huge_page(h, nid);
 }
 
-#ifdef CONFIG_ARCH_HAS_HUGEPD
-#ifndef CONFIG_PPC_BOOK3S_64
-#define HUGEPD_FREELIST_SIZE \
-	((PAGE_SIZE - sizeof(struct hugepd_freelist)) / sizeof(pte_t))
-
-struct hugepd_freelist {
-	struct rcu_head	rcu;
-	unsigned int index;
-	void *ptes[];
-};
-
-static DEFINE_PER_CPU(struct hugepd_freelist *, hugepd_freelist_cur);
-
-static void hugepd_free_rcu_callback(struct rcu_head *head)
-{
-	struct hugepd_freelist *batch =
-		container_of(head, struct hugepd_freelist, rcu);
-	unsigned int i;
-
-	for (i = 0; i < batch->index; i++)
-		kmem_cache_free(PGT_CACHE(PTE_T_ORDER), batch->ptes[i]);
-
-	free_page((unsigned long)batch);
-}
-
-static void hugepd_free(struct mmu_gather *tlb, void *hugepte)
-{
-	struct hugepd_freelist **batchp;
-
-	batchp = &get_cpu_var(hugepd_freelist_cur);
-
-	if (atomic_read(&tlb->mm->mm_users) < 2 ||
-	    mm_is_thread_local(tlb->mm)) {
-		kmem_cache_free(PGT_CACHE(PTE_T_ORDER), hugepte);
-		put_cpu_var(hugepd_freelist_cur);
-		return;
-	}
-
-	if (*batchp == NULL) {
-		*batchp = (struct hugepd_freelist *)__get_free_page(GFP_ATOMIC);
-		(*batchp)->index = 0;
-	}
-
-	(*batchp)->ptes[(*batchp)->index++] = hugepte;
-	if ((*batchp)->index == HUGEPD_FREELIST_SIZE) {
-		call_rcu(&(*batchp)->rcu, hugepd_free_rcu_callback);
-		*batchp = NULL;
-	}
-	put_cpu_var(hugepd_freelist_cur);
-}
-#else
-static inline void hugepd_free(struct mmu_gather *tlb, void *hugepte) {}
-#endif
-
-/* Return true when the entry to be freed maps more than the area being freed */
-static bool range_is_outside_limits(unsigned long start, unsigned long end,
-				    unsigned long floor, unsigned long ceiling,
-				    unsigned long mask)
-{
-	if ((start & mask) < floor)
-		return true;
-	if (ceiling) {
-		ceiling &= mask;
-		if (!ceiling)
-			return true;
-	}
-	return end - 1 > ceiling - 1;
-}
-
-static void free_hugepd_range(struct mmu_gather *tlb, hugepd_t *hpdp, int pdshift,
-			      unsigned long start, unsigned long end,
-			      unsigned long floor, unsigned long ceiling)
-{
-	pte_t *hugepte = hugepd_page(*hpdp);
-	int i;
-
-	unsigned long pdmask = ~((1UL << pdshift) - 1);
-	unsigned int num_hugepd = 1;
-	unsigned int shift = hugepd_shift(*hpdp);
-
-	/* Note: On fsl the hpdp may be the first of several */
-	if (shift > pdshift)
-		num_hugepd = 1 << (shift - pdshift);
-
-	if (range_is_outside_limits(start, end, floor, ceiling, pdmask))
-		return;
-
-	for (i = 0; i < num_hugepd; i++, hpdp++)
-		*hpdp = __hugepd(0);
-
-	if (shift >= pdshift)
-		hugepd_free(tlb, hugepte);
-	else
-		pgtable_free_tlb(tlb, hugepte,
-				 get_hugepd_cache_index(pdshift - shift));
-}
-
-static void hugetlb_free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
-				   unsigned long addr, unsigned long end,
-				   unsigned long floor, unsigned long ceiling)
-{
-	pgtable_t token = pmd_pgtable(*pmd);
-
-	if (range_is_outside_limits(addr, end, floor, ceiling, PMD_MASK))
-		return;
-
-	pmd_clear(pmd);
-	pte_free_tlb(tlb, token, addr);
-	mm_dec_nr_ptes(tlb->mm);
-}
-
-static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
-				   unsigned long addr, unsigned long end,
-				   unsigned long floor, unsigned long ceiling)
-{
-	pmd_t *pmd;
-	unsigned long next;
-	unsigned long start;
-
-	start = addr;
-	do {
-		unsigned long more;
-
-		pmd = pmd_offset(pud, addr);
-		next = pmd_addr_end(addr, end);
-		if (!is_hugepd(__hugepd(pmd_val(*pmd)))) {
-			if (pmd_none_or_clear_bad(pmd))
-				continue;
-
-			/*
-			 * if it is not hugepd pointer, we should already find
-			 * it cleared.
-			 */
-			WARN_ON(!IS_ENABLED(CONFIG_PPC_8xx));
-
-			hugetlb_free_pte_range(tlb, pmd, addr, end, floor, ceiling);
-
-			continue;
-		}
-		/*
-		 * Increment next by the size of the huge mapping since
-		 * there may be more than one entry at this level for a
-		 * single hugepage, but all of them point to
-		 * the same kmem cache that holds the hugepte.
-		 */
-		more = addr + (1UL << hugepd_shift(*(hugepd_t *)pmd));
-		if (more > next)
-			next = more;
-
-		free_hugepd_range(tlb, (hugepd_t *)pmd, PMD_SHIFT,
-				  addr, next, floor, ceiling);
-	} while (addr = next, addr != end);
-
-	if (range_is_outside_limits(start, end, floor, ceiling, PUD_MASK))
-		return;
-
-	pmd = pmd_offset(pud, start & PUD_MASK);
-	pud_clear(pud);
-	pmd_free_tlb(tlb, pmd, start & PUD_MASK);
-	mm_dec_nr_pmds(tlb->mm);
-}
-
-static void hugetlb_free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
-				   unsigned long addr, unsigned long end,
-				   unsigned long floor, unsigned long ceiling)
-{
-	pud_t *pud;
-	unsigned long next;
-	unsigned long start;
-
-	start = addr;
-	do {
-		pud = pud_offset(p4d, addr);
-		next = pud_addr_end(addr, end);
-		if (!is_hugepd(__hugepd(pud_val(*pud)))) {
-			if (pud_none_or_clear_bad(pud))
-				continue;
-			hugetlb_free_pmd_range(tlb, pud, addr, next, floor,
-					       ceiling);
-		} else {
-			unsigned long more;
-			/*
-			 * Increment next by the size of the huge mapping since
-			 * there may be more than one entry at this level for a
-			 * single hugepage, but all of them point to
-			 * the same kmem cache that holds the hugepte.
-			 */
-			more = addr + (1UL << hugepd_shift(*(hugepd_t *)pud));
-			if (more > next)
-				next = more;
-
-			free_hugepd_range(tlb, (hugepd_t *)pud, PUD_SHIFT,
-					  addr, next, floor, ceiling);
-		}
-	} while (addr = next, addr != end);
-
-	if (range_is_outside_limits(start, end, floor, ceiling, PGDIR_MASK))
-		return;
-
-	pud = pud_offset(p4d, start & PGDIR_MASK);
-	p4d_clear(p4d);
-	pud_free_tlb(tlb, pud, start & PGDIR_MASK);
-	mm_dec_nr_puds(tlb->mm);
-}
-
-/*
- * This function frees user-level page tables of a process.
- */
-void hugetlb_free_pgd_range(struct mmu_gather *tlb,
-			    unsigned long addr, unsigned long end,
-			    unsigned long floor, unsigned long ceiling)
-{
-	pgd_t *pgd;
-	p4d_t *p4d;
-	unsigned long next;
-
-	/*
-	 * Because there are a number of different possible pagetable
-	 * layouts for hugepage ranges, we limit knowledge of how
-	 * things should be laid out to the allocation path
-	 * (huge_pte_alloc(), above).  Everything else works out the
-	 * structure as it goes from information in the hugepd
-	 * pointers.  That means that we can't here use the
-	 * optimization used in the normal page free_pgd_range(), of
-	 * checking whether we're actually covering a large enough
-	 * range to have to do anything at the top level of the walk
-	 * instead of at the bottom.
-	 *
-	 * To make sense of this, you should probably go read the big
-	 * block comment at the top of the normal free_pgd_range(),
-	 * too.
-	 */
-
-	do {
-		next = pgd_addr_end(addr, end);
-		pgd = pgd_offset(tlb->mm, addr);
-		p4d = p4d_offset(pgd, addr);
-		if (!is_hugepd(__hugepd(pgd_val(*pgd)))) {
-			if (p4d_none_or_clear_bad(p4d))
-				continue;
-			hugetlb_free_pud_range(tlb, p4d, addr, next, floor, ceiling);
-		} else {
-			unsigned long more;
-			/*
-			 * Increment next by the size of the huge mapping since
-			 * there may be more than one entry at the pgd level
-			 * for a single hugepage, but all of them point to the
-			 * same kmem cache that holds the hugepte.
-			 */
-			more = addr + (1UL << hugepd_shift(*(hugepd_t *)pgd));
-			if (more > next)
-				next = more;
-
-			free_hugepd_range(tlb, (hugepd_t *)p4d, PGDIR_SHIFT,
-					  addr, next, floor, ceiling);
-		}
-	} while (addr = next, addr != end);
-}
-#endif
-
 bool __init arch_hugetlb_valid_size(unsigned long size)
 {
 	int shift = __ffs(size);
diff --git a/arch/powerpc/mm/init-common.c b/arch/powerpc/mm/init-common.c
index d3a7726ecf51..024e95c62a2d 100644
--- a/arch/powerpc/mm/init-common.c
+++ b/arch/powerpc/mm/init-common.c
@@ -120,12 +120,8 @@ void pgtable_cache_add(unsigned int shift)
 	/* When batching pgtable pointers for RCU freeing, we store
 	 * the index size in the low bits.  Table alignment must be
 	 * big enough to fit it.
-	 *
-	 * Likewise, hugeapge pagetable pointers contain a (different)
-	 * shift value in the low bits.  All tables must be aligned so
-	 * as to leave enough 0 bits in the address to contain it. */
-	unsigned long minalign = max(MAX_PGTABLE_INDEX_SIZE + 1,
-				     HUGEPD_SHIFT_MASK + 1);
+	 */
+	unsigned long minalign = MAX_PGTABLE_INDEX_SIZE + 1;
 	struct kmem_cache *new = NULL;
 
 	/* It would be nice if this was a BUILD_BUG_ON(), but at the
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 7d4c004cbc75..e1ddfe0174d6 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -406,11 +406,10 @@ unsigned long vmalloc_to_phys(void *va)
 EXPORT_SYMBOL_GPL(vmalloc_to_phys);
 
 /*
- * We have 4 cases for pgds and pmds:
+ * We have 3 cases for pgds and pmds:
  * (1) invalid (all zeroes)
  * (2) pointer to next table, as normal; bottom 6 bits == 0
  * (3) leaf pte for huge page _PAGE_PTE set
- * (4) hugepd pointer, _PAGE_PTE = 0 and bits [2..6] indicate size of table
  *
  * So long as we atomically load page table pointers we are safe against teardown,
  * we can follow the address down to the page and take a ref on it.
@@ -429,7 +428,6 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
 #endif
 	pmd_t pmd, *pmdp;
 	pte_t *ret_pte;
-	hugepd_t *hpdp = NULL;
 	unsigned pdshift;
 
 	if (hpage_shift)
@@ -459,11 +457,6 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
 		goto out;
 	}
 
-	if (is_hugepd(__hugepd(p4d_val(p4d)))) {
-		hpdp = (hugepd_t *)&p4d;
-		goto out_huge;
-	}
-
 	/*
 	 * Even if we end up with an unmap, the pgtable will not
 	 * be freed, because we do an rcu free and here we are
@@ -481,11 +474,6 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
 		goto out;
 	}
 
-	if (is_hugepd(__hugepd(pud_val(pud)))) {
-		hpdp = (hugepd_t *)&pud;
-		goto out_huge;
-	}
-
 	pdshift = PMD_SHIFT;
 	pmdp = pmd_offset(&pud, ea);
 #else
@@ -525,21 +513,8 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
 		goto out;
 	}
 
-	if (is_hugepd(__hugepd(pmd_val(pmd)))) {
-		hpdp = (hugepd_t *)&pmd;
-		goto out_huge;
-	}
-
 	return pte_offset_kernel(&pmd, ea);
 
-out_huge:
-	if (!hpdp)
-		return NULL;
-
-#ifdef CONFIG_ARCH_HAS_HUGEPD
-	ret_pte = hugepte_offset(*hpdp, ea, pdshift);
-	pdshift = hugepd_shift(*hpdp);
-#endif
 out:
 	if (hpage_shift)
 		*hpage_shift = pdshift;
-- 
2.44.0



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [RFC PATCH v2 20/20] mm: Remove CONFIG_ARCH_HAS_HUGEPD
  2024-05-17 18:59 [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Christophe Leroy
                   ` (18 preceding siblings ...)
  2024-05-17 19:00 ` [RFC PATCH v2 19/20] powerpc/mm: Remove hugepd leftovers Christophe Leroy
@ 2024-05-17 19:00 ` Christophe Leroy
  2024-05-17 19:06 ` [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Jason Gunthorpe
  2024-05-23 19:40 ` Peter Xu
  21 siblings, 0 replies; 60+ messages in thread
From: Christophe Leroy @ 2024-05-17 19:00 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, Peter Xu, Oscar Salvador,
	Michael Ellerman, Nicholas Piggin
  Cc: Christophe Leroy, linux-kernel, linux-mm, linuxppc-dev

powerpc was the only user of CONFIG_ARCH_HAS_HUGEPD and doesn't
use it anymore, so remove all related code.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/mm/hugetlbpage.c |   1 -
 include/linux/hugetlb.h       |   6 --
 mm/Kconfig                    |  10 ----
 mm/gup.c                      | 105 +---------------------------------
 mm/pagewalk.c                 |  57 ++----------------
 5 files changed, 5 insertions(+), 174 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 6fad89d7bff3..1df9e4fa1001 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -79,7 +79,6 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 		return NULL;
 	return (pte_t *)pmd;
 }
-#endif
 
 #ifdef CONFIG_PPC_BOOK3S_64
 /*
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index d9c5d9daadc5..c020e3bdf62b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -20,12 +20,6 @@ struct user_struct;
 struct mmu_gather;
 struct node;
 
-#ifndef CONFIG_ARCH_HAS_HUGEPD
-typedef struct { unsigned long pd; } hugepd_t;
-#define is_hugepd(hugepd) (0)
-#define __hugepd(x) ((hugepd_t) { (x) })
-#endif
-
 void free_huge_folio(struct folio *folio);
 
 #ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/Kconfig b/mm/Kconfig
index b1448aa81e15..a52f8e3224fb 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1114,16 +1114,6 @@ config DMAPOOL_TEST
 config ARCH_HAS_PTE_SPECIAL
 	bool
 
-#
-# Some architectures require a special hugepage directory format that is
-# required to support multiple hugepage sizes. For example a4fe3ce76
-# "powerpc/mm: Allow more flexible layouts for hugepage pagetables"
-# introduced it on powerpc.  This allows for a more flexible hugepage
-# pagetable layouts.
-#
-config ARCH_HAS_HUGEPD
-	bool
-
 config MAPPING_DIRTY_HELPERS
         bool
 
diff --git a/mm/gup.c b/mm/gup.c
index 86b5105b82a1..95f121223f04 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2790,89 +2790,6 @@ static int record_subpages(struct page *page, unsigned long addr,
 	return nr;
 }
 
-#ifdef CONFIG_ARCH_HAS_HUGEPD
-static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
-				      unsigned long sz)
-{
-	unsigned long __boundary = (addr + sz) & ~(sz-1);
-	return (__boundary - 1 < end - 1) ? __boundary : end;
-}
-
-static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
-		       unsigned long end, unsigned int flags,
-		       struct page **pages, int *nr)
-{
-	unsigned long pte_end;
-	struct page *page;
-	struct folio *folio;
-	pte_t pte;
-	int refs;
-
-	pte_end = (addr + sz) & ~(sz-1);
-	if (pte_end < end)
-		end = pte_end;
-
-	pte = huge_ptep_get(NULL, addr, ptep);
-
-	if (!pte_access_permitted(pte, flags & FOLL_WRITE))
-		return 0;
-
-	/* hugepages are never "special" */
-	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
-
-	page = nth_page(pte_page(pte), (addr & (sz - 1)) >> PAGE_SHIFT);
-	refs = record_subpages(page, addr, end, pages + *nr);
-
-	folio = try_grab_folio(page, refs, flags);
-	if (!folio)
-		return 0;
-
-	if (unlikely(pte_val(pte) != pte_val(ptep_get(ptep)))) {
-		gup_put_folio(folio, refs, flags);
-		return 0;
-	}
-
-	if (!folio_fast_pin_allowed(folio, flags)) {
-		gup_put_folio(folio, refs, flags);
-		return 0;
-	}
-
-	if (!pte_write(pte) && gup_must_unshare(NULL, flags, &folio->page)) {
-		gup_put_folio(folio, refs, flags);
-		return 0;
-	}
-
-	*nr += refs;
-	folio_set_referenced(folio);
-	return 1;
-}
-
-static int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
-		unsigned int pdshift, unsigned long end, unsigned int flags,
-		struct page **pages, int *nr)
-{
-	pte_t *ptep;
-	unsigned long sz = 1UL << hugepd_shift(hugepd);
-	unsigned long next;
-
-	ptep = hugepte_offset(hugepd, addr, pdshift);
-	do {
-		next = hugepte_addr_end(addr, end, sz);
-		if (!gup_hugepte(ptep, sz, addr, end, flags, pages, nr))
-			return 0;
-	} while (ptep++, addr = next, addr != end);
-
-	return 1;
-}
-#else
-static inline int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
-		unsigned int pdshift, unsigned long end, unsigned int flags,
-		struct page **pages, int *nr)
-{
-	return 0;
-}
-#endif /* CONFIG_ARCH_HAS_HUGEPD */
-
 static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 			unsigned long end, unsigned int flags,
 			struct page **pages, int *nr)
@@ -3026,14 +2943,6 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo
 				pages, nr))
 				return 0;
 
-		} else if (unlikely(is_hugepd(__hugepd(pmd_val(pmd))))) {
-			/*
-			 * architecture have different format for hugetlbfs
-			 * pmd format and THP pmd format
-			 */
-			if (!gup_huge_pd(__hugepd(pmd_val(pmd)), addr,
-					 PMD_SHIFT, next, flags, pages, nr))
-				return 0;
 		} else if (!gup_pte_range(pmd, pmdp, addr, next, flags, pages, nr))
 			return 0;
 	} while (pmdp++, addr = next, addr != end);
@@ -3058,10 +2967,6 @@ static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned lo
 			if (!gup_huge_pud(pud, pudp, addr, next, flags,
 					  pages, nr))
 				return 0;
-		} else if (unlikely(is_hugepd(__hugepd(pud_val(pud))))) {
-			if (!gup_huge_pd(__hugepd(pud_val(pud)), addr,
-					 PUD_SHIFT, next, flags, pages, nr))
-				return 0;
 		} else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr))
 			return 0;
 	} while (pudp++, addr = next, addr != end);
@@ -3083,11 +2988,7 @@ static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned lo
 		if (p4d_none(p4d))
 			return 0;
 		BUILD_BUG_ON(p4d_huge(p4d));
-		if (unlikely(is_hugepd(__hugepd(p4d_val(p4d))))) {
-			if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
-					 P4D_SHIFT, next, flags, pages, nr))
-				return 0;
-		} else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr))
+		if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr))
 			return 0;
 	} while (p4dp++, addr = next, addr != end);
 
@@ -3111,10 +3012,6 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
 			if (!gup_huge_pgd(pgd, pgdp, addr, next, flags,
 					  pages, nr))
 				return;
-		} else if (unlikely(is_hugepd(__hugepd(pgd_val(pgd))))) {
-			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
-					 PGDIR_SHIFT, next, flags, pages, nr))
-				return;
 		} else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr))
 			return;
 	} while (pgdp++, addr = next, addr != end);
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index f46c80b18ce4..ae2f08ce991b 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -73,45 +73,6 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	return err;
 }
 
-#ifdef CONFIG_ARCH_HAS_HUGEPD
-static int walk_hugepd_range(hugepd_t *phpd, unsigned long addr,
-			     unsigned long end, struct mm_walk *walk, int pdshift)
-{
-	int err = 0;
-	const struct mm_walk_ops *ops = walk->ops;
-	int shift = hugepd_shift(*phpd);
-	int page_size = 1 << shift;
-
-	if (!ops->pte_entry)
-		return 0;
-
-	if (addr & (page_size - 1))
-		return 0;
-
-	for (;;) {
-		pte_t *pte;
-
-		spin_lock(&walk->mm->page_table_lock);
-		pte = hugepte_offset(*phpd, addr, pdshift);
-		err = ops->pte_entry(pte, addr, addr + page_size, walk);
-		spin_unlock(&walk->mm->page_table_lock);
-
-		if (err)
-			break;
-		if (addr >= end - page_size)
-			break;
-		addr += page_size;
-	}
-	return err;
-}
-#else
-static int walk_hugepd_range(hugepd_t *phpd, unsigned long addr,
-			     unsigned long end, struct mm_walk *walk, int pdshift)
-{
-	return 0;
-}
-#endif
-
 static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 			  struct mm_walk *walk)
 {
@@ -159,10 +120,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 		if (walk->vma)
 			split_huge_pmd(walk->vma, pmd, addr);
 
-		if (is_hugepd(__hugepd(pmd_val(*pmd))))
-			err = walk_hugepd_range((hugepd_t *)pmd, addr, next, walk, PMD_SHIFT);
-		else
-			err = walk_pte_range(pmd, addr, next, walk);
+		err = walk_pte_range(pmd, addr, next, walk);
 		if (err)
 			break;
 
@@ -215,10 +173,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 		if (pud_none(*pud))
 			goto again;
 
-		if (is_hugepd(__hugepd(pud_val(*pud))))
-			err = walk_hugepd_range((hugepd_t *)pud, addr, next, walk, PUD_SHIFT);
-		else
-			err = walk_pmd_range(pud, addr, next, walk);
+		err = walk_pmd_range(pud, addr, next, walk);
 		if (err)
 			break;
 	} while (pud++, addr = next, addr != end);
@@ -250,9 +205,7 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 			if (err)
 				break;
 		}
-		if (is_hugepd(__hugepd(p4d_val(*p4d))))
-			err = walk_hugepd_range((hugepd_t *)p4d, addr, next, walk, P4D_SHIFT);
-		else if (ops->pud_entry || ops->pmd_entry || ops->pte_entry)
+		if (ops->pud_entry || ops->pmd_entry || ops->pte_entry)
 			err = walk_pud_range(p4d, addr, next, walk);
 		if (err)
 			break;
@@ -287,9 +240,7 @@ static int walk_pgd_range(unsigned long addr, unsigned long end,
 			if (err)
 				break;
 		}
-		if (is_hugepd(__hugepd(pgd_val(*pgd))))
-			err = walk_hugepd_range((hugepd_t *)pgd, addr, next, walk, PGDIR_SHIFT);
-		else if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry || ops->pte_entry)
+		if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry || ops->pte_entry)
 			err = walk_p4d_range(pgd, addr, next, walk);
 		if (err)
 			break;
-- 
2.44.0



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64)
  2024-05-17 18:59 [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Christophe Leroy
                   ` (19 preceding siblings ...)
  2024-05-17 19:00 ` [RFC PATCH v2 20/20] mm: Remove CONFIG_ARCH_HAS_HUGEPD Christophe Leroy
@ 2024-05-17 19:06 ` Jason Gunthorpe
  2024-05-18  6:28   ` Christophe Leroy
  2024-05-23 19:40 ` Peter Xu
  21 siblings, 1 reply; 60+ messages in thread
From: Jason Gunthorpe @ 2024-05-17 19:06 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Andrew Morton, Peter Xu, Oscar Salvador, Michael Ellerman,
	Nicholas Piggin, linux-kernel, linux-mm, linuxppc-dev

On Fri, May 17, 2024 at 08:59:54PM +0200, Christophe Leroy wrote:
> This is the continuation of the RFC v1 series "Reimplement huge pages
> without hugepd on powerpc 8xx". It now get rid of hugepd completely
> after handling also e500 and book3s/64

This is really amazing, thank you for doing it!

Jason


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64)
  2024-05-17 19:06 ` [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Jason Gunthorpe
@ 2024-05-18  6:28   ` Christophe Leroy
  0 siblings, 0 replies; 60+ messages in thread
From: Christophe Leroy @ 2024-05-18  6:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Peter Xu, Oscar Salvador, Michael Ellerman,
	Nicholas Piggin, linux-kernel, linux-mm, linuxppc-dev



Le 17/05/2024 à 21:06, Jason Gunthorpe a écrit :
> On Fri, May 17, 2024 at 08:59:54PM +0200, Christophe Leroy wrote:
>> This is the continuation of the RFC v1 series "Reimplement huge pages
>> without hugepd on powerpc 8xx". It now get rid of hugepd completely
>> after handling also e500 and book3s/64
> 
> This is really amazing, thank you for doing it!
> 

You are welcome.

I have not yet taken into account your review comments on v1. I first 
wanted to have a global picture.

Christophe

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 01/20] mm: Provide pagesize to pmd_populate()
  2024-05-17 18:59 ` [RFC PATCH v2 01/20] mm: Provide pagesize to pmd_populate() Christophe Leroy
@ 2024-05-20  9:01   ` Oscar Salvador
  2024-05-20 16:24     ` Christophe Leroy
  0 siblings, 1 reply; 60+ messages in thread
From: Oscar Salvador @ 2024-05-20  9:01 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Andrew Morton, Jason Gunthorpe, Peter Xu, Michael Ellerman,
	Nicholas Piggin, linux-kernel, linux-mm, linuxppc-dev

On Fri, May 17, 2024 at 08:59:55PM +0200, Christophe Leroy wrote:
> Unlike many architectures, powerpc 8xx hardware tablewalk requires
> a two level process for all page sizes, allthough second level only
> has one entry when pagesize is 8M.

So, I went on a quick reading on

https://www.nxp.com/docs/en/application-note-software/AN3066.pdf

to get more insight, and I realized that some of the questions I made
in v1 were quite dump.

> 
> To fit with Linux page table topology and without requiring special
> page directory layout like hugepd, the page entry will be replicated
> 1024 times in the standard page table. However for large pages it is

You only have to replicate 1024 times in case the page size is 4KB, and you
will have to replicate that twice and have 2 PMDs pointing to it, right?

For 16KB, you will have the PMD containing 512 entries of 16KB.

> necessary to set bits in the level-1 (PMD) entry. At the time being,
> for 512k pages the flag is kept in the PTE and inserted in the PMD
> entry at TLB miss exception, that is necessary because we can have

 rlwimi  r11, r10, 32 - 9, _PMD_PAGE_512K
 mtspr   SPRN_MI_TWC, r11

So we shift the value and compare it to _PMD_PAGE_512K to see if the PTE
is a 512K page, and then we set it to SPRN_MI_TWC which I guess is some
CPU special register?

> pages of different sizes in a page table. However the 12 PTE bits are
> fully used and there is no room for an additional bit for page size.

You are referring to the bits in
arch/powerpc/include/asm/nohash/32/pte-8xx.h ?

> For 8M pages, there will be only one page per PMD entry, it is
> therefore possible to flag the pagesize in the PMD entry, with the

I am confused, and it might be just terminology, or I am getting wrong
the design.
You say that for 8MB pages, there will one page per PMD entry, but
based on the above, you will have 1024 entries (replicated)?
So, maybe this wanted to be read as "there will be only one page size per PMD
entry".

> advantage that the information will already be at the right place for
> the hardware.
> 
> To do so, add a new helper called pmd_populate_size() which takes the
> page size as an additional argument, and modify __pte_alloc() to also

"page size" makes me thing of the standart page size the kernel is
operating on (aka PAGE_SIZE), but it is actually the size of the huge
page, so I think we should clarify it.

> take that argument. pte_alloc() is left unmodified in order to
> reduce churn on callers, and a pte_alloc_size() is added for use by
> pte_alloc_huge().
> 
> When an architecture doesn't provide pmd_populate_size(),
> pmd_populate() is used as a fallback.

It is a bit unfortunate that we have to touch the code for other
architectures (in patch#2)

> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>

So far I only looked at this patch and patch#2, and code-wise looks good and
makes sense,  but I find it a bit unfortunate that we have to touch general
definitons and arch code (done in patch#2 and patch#3), and I hoped we could
somehow isolate this, but I could not think of a way.

I will give it some more though.

> ---
>  include/linux/mm.h | 12 +++++++-----
>  mm/filemap.c       |  2 +-
>  mm/internal.h      |  2 +-
>  mm/memory.c        | 19 ++++++++++++-------
>  mm/pgalloc-track.h |  2 +-
>  mm/userfaultfd.c   |  4 ++--
>  6 files changed, 24 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b6bdaa18b9e9..158cb87bc604 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2803,8 +2803,8 @@ static inline void mm_inc_nr_ptes(struct mm_struct *mm) {}
>  static inline void mm_dec_nr_ptes(struct mm_struct *mm) {}
>  #endif
>  
> -int __pte_alloc(struct mm_struct *mm, pmd_t *pmd);
> -int __pte_alloc_kernel(pmd_t *pmd);
> +int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long sz);
> +int __pte_alloc_kernel(pmd_t *pmd, unsigned long sz);
>  
>  #if defined(CONFIG_MMU)
>  
> @@ -2989,7 +2989,8 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
>  	pte_unmap(pte);					\
>  } while (0)
>  
> -#define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd))
> +#define pte_alloc_size(mm, pmd, sz) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd, sz))
> +#define pte_alloc(mm, pmd) pte_alloc_size(mm, pmd, PAGE_SIZE)
>  
>  #define pte_alloc_map(mm, pmd, address)			\
>  	(pte_alloc(mm, pmd) ? NULL : pte_offset_map(pmd, address))
> @@ -2998,9 +2999,10 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
>  	(pte_alloc(mm, pmd) ?			\
>  		 NULL : pte_offset_map_lock(mm, pmd, address, ptlp))
>  
> -#define pte_alloc_kernel(pmd, address)			\
> -	((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \
> +#define pte_alloc_kernel_size(pmd, address, sz)			\
> +	((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd, sz))? \
>  		NULL: pte_offset_kernel(pmd, address))
> +#define pte_alloc_kernel(pmd, address)	pte_alloc_kernel_size(pmd, address, PAGE_SIZE)
>  
>  #if USE_SPLIT_PMD_PTLOCKS
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 30de18c4fd28..5a783063d1f6 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3428,7 +3428,7 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct folio *folio,
>  	}
>  
>  	if (pmd_none(*vmf->pmd) && vmf->prealloc_pte)
> -		pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
> +		pmd_install(mm, vmf->pmd, &vmf->prealloc_pte, PAGE_SIZE);
>  
>  	return false;
>  }
> diff --git a/mm/internal.h b/mm/internal.h
> index 07ad2675a88b..4a01bbf55264 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -206,7 +206,7 @@ void folio_activate(struct folio *folio);
>  void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
>  		   struct vm_area_struct *start_vma, unsigned long floor,
>  		   unsigned long ceiling, bool mm_wr_locked);
> -void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
> +void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte, unsigned long sz);
>  
>  struct zap_details;
>  void unmap_page_range(struct mmu_gather *tlb,
> diff --git a/mm/memory.c b/mm/memory.c
> index d2155ced45f8..2a9eba13a95f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -409,7 +409,12 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
>  	} while (vma);
>  }
>  
> -void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
> +#ifndef pmd_populate_size
> +#define pmd_populate_size(mm, pmdp, pte, sz) pmd_populate(mm, pmdp, pte)
> +#define pmd_populate_kernel_size(mm, pmdp, pte, sz) pmd_populate_kernel(mm, pmdp, pte)
> +#endif
> +
> +void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte, unsigned long sz)
>  {
>  	spinlock_t *ptl = pmd_lock(mm, pmd);
>  
> @@ -429,25 +434,25 @@ void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
>  		 * smp_rmb() barriers in page table walking code.
>  		 */
>  		smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
> -		pmd_populate(mm, pmd, *pte);
> +		pmd_populate_size(mm, pmd, *pte, sz);
>  		*pte = NULL;
>  	}
>  	spin_unlock(ptl);
>  }
>  
> -int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
> +int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long sz)
>  {
>  	pgtable_t new = pte_alloc_one(mm);
>  	if (!new)
>  		return -ENOMEM;
>  
> -	pmd_install(mm, pmd, &new);
> +	pmd_install(mm, pmd, &new, sz);
>  	if (new)
>  		pte_free(mm, new);
>  	return 0;
>  }
>  
> -int __pte_alloc_kernel(pmd_t *pmd)
> +int __pte_alloc_kernel(pmd_t *pmd, unsigned long sz)
>  {
>  	pte_t *new = pte_alloc_one_kernel(&init_mm);
>  	if (!new)
> @@ -456,7 +461,7 @@ int __pte_alloc_kernel(pmd_t *pmd)
>  	spin_lock(&init_mm.page_table_lock);
>  	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
>  		smp_wmb(); /* See comment in pmd_install() */
> -		pmd_populate_kernel(&init_mm, pmd, new);
> +		pmd_populate_kernel_size(&init_mm, pmd, new, sz);
>  		new = NULL;
>  	}
>  	spin_unlock(&init_mm.page_table_lock);
> @@ -4740,7 +4745,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>  		}
>  
>  		if (vmf->prealloc_pte)
> -			pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte);
> +			pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte, PAGE_SIZE);
>  		else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
>  			return VM_FAULT_OOM;
>  	}
> diff --git a/mm/pgalloc-track.h b/mm/pgalloc-track.h
> index e9e879de8649..90e37de7ab77 100644
> --- a/mm/pgalloc-track.h
> +++ b/mm/pgalloc-track.h
> @@ -45,7 +45,7 @@ static inline pmd_t *pmd_alloc_track(struct mm_struct *mm, pud_t *pud,
>  
>  #define pte_alloc_kernel_track(pmd, address, mask)			\
>  	((unlikely(pmd_none(*(pmd))) &&					\
> -	  (__pte_alloc_kernel(pmd) || ({*(mask)|=PGTBL_PMD_MODIFIED;0;})))?\
> +	  (__pte_alloc_kernel(pmd, PAGE_SIZE) || ({*(mask)|=PGTBL_PMD_MODIFIED;0;})))?\
>  		NULL: pte_offset_kernel(pmd, address))
>  
>  #endif /* _LINUX_PGALLOC_TRACK_H */
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 3c3539c573e7..0f129d5c5aa2 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -764,7 +764,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
>  			break;
>  		}
>  		if (unlikely(pmd_none(dst_pmdval)) &&
> -		    unlikely(__pte_alloc(dst_mm, dst_pmd))) {
> +		    unlikely(__pte_alloc(dst_mm, dst_pmd, PAGE_SIZE))) {
>  			err = -ENOMEM;
>  			break;
>  		}
> @@ -1687,7 +1687,7 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
>  					err = -ENOENT;
>  					break;
>  				}
> -				if (unlikely(__pte_alloc(mm, src_pmd))) {
> +				if (unlikely(__pte_alloc(mm, src_pmd, PAGE_SIZE))) {
>  					err = -ENOMEM;
>  					break;
>  				}
> -- 
> 2.44.0
> 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 06/20] powerpc/8xx: Fix size given to set_huge_pte_at()
  2024-05-17 19:00 ` [RFC PATCH v2 06/20] powerpc/8xx: Fix size given to set_huge_pte_at() Christophe Leroy
@ 2024-05-20  9:14   ` Oscar Salvador
  2024-05-20 16:31     ` Christophe Leroy
  0 siblings, 1 reply; 60+ messages in thread
From: Oscar Salvador @ 2024-05-20  9:14 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Andrew Morton, Jason Gunthorpe, Peter Xu, Michael Ellerman,
	Nicholas Piggin, linux-kernel, linux-mm, linuxppc-dev

On Fri, May 17, 2024 at 09:00:00PM +0200, Christophe Leroy wrote:
> set_huge_pte_at() expects the real page size, not the psize which is

"expects the size of the huge page" sounds bettter? 

> the index of the page definition in table mmu_psize_defs[]
> 
> Fixes: 935d4f0c6dc8 ("mm: hugetlb: add huge page size param to set_huge_pte_at()")
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

AFAICS, this fixup is not related to the series, right? (yes, you will
the parameter later)
I would have it at the very beginning of the series.


> ---
>  arch/powerpc/mm/nohash/8xx.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/mm/nohash/8xx.c b/arch/powerpc/mm/nohash/8xx.c
> index 43d4842bb1c7..d93433e26ded 100644
> --- a/arch/powerpc/mm/nohash/8xx.c
> +++ b/arch/powerpc/mm/nohash/8xx.c
> @@ -94,7 +94,8 @@ static int __ref __early_map_kernel_hugepage(unsigned long va, phys_addr_t pa,
>  		return -EINVAL;
>  
>  	set_huge_pte_at(&init_mm, va, ptep,
> -			pte_mkhuge(pfn_pte(pa >> PAGE_SHIFT, prot)), psize);
> +			pte_mkhuge(pfn_pte(pa >> PAGE_SHIFT, prot)),
> +			1UL << mmu_psize_to_shift(psize));
>  
>  	return 0;
>  }
> -- 
> 2.44.0
> 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 18/20] powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD
  2024-05-17 19:00 ` [RFC PATCH v2 18/20] powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD Christophe Leroy
@ 2024-05-20 12:54   ` Nicholas Piggin
  2024-05-20 16:43     ` Christophe Leroy
  0 siblings, 1 reply; 60+ messages in thread
From: Nicholas Piggin @ 2024-05-20 12:54 UTC (permalink / raw)
  To: Christophe Leroy, Andrew Morton, Jason Gunthorpe, Peter Xu,
	Oscar Salvador, Michael Ellerman
  Cc: linux-kernel, linux-mm, linuxppc-dev

On Sat May 18, 2024 at 5:00 AM AEST, Christophe Leroy wrote:
> On book3s/64, the only user of hugepd is hash in 4k mode.
>
> All other setups (hash-64, radix-4, radix-64) use leaf PMD/PUD.
>
> Rework hash-4k to use contiguous PMD and PUD instead.
>
> In that setup there are only two huge page sizes: 16M and 16G.
>
> 16M sits at PMD level and 16G at PUD level.
>
> pte_update doesn't know page size, lets use the same trick as
> hpte_need_flush() to get page size from segment properties. That's
> not the most efficient way but let's do that until callers of
> pte_update() provide page size instead of just a huge flag.
>
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> ---
>  arch/powerpc/include/asm/book3s/64/hash-4k.h  | 15 --------
>  arch/powerpc/include/asm/book3s/64/hash.h     | 38 +++++++++++++++----
>  arch/powerpc/include/asm/book3s/64/hugetlb.h  | 38 -------------------
>  .../include/asm/book3s/64/pgtable-4k.h        | 34 -----------------
>  .../include/asm/book3s/64/pgtable-64k.h       | 20 ----------
>  arch/powerpc/include/asm/hugetlb.h            |  4 ++
>  .../include/asm/nohash/32/hugetlb-8xx.h       |  4 --
>  .../powerpc/include/asm/nohash/hugetlb-e500.h |  4 --
>  arch/powerpc/include/asm/page.h               |  8 ----
>  arch/powerpc/mm/book3s64/hash_utils.c         | 11 ++++--
>  arch/powerpc/mm/book3s64/pgtable.c            | 12 ------
>  arch/powerpc/mm/hugetlbpage.c                 | 19 ----------
>  arch/powerpc/mm/pgtable.c                     |  2 +-
>  arch/powerpc/platforms/Kconfig.cputype        |  1 -
>  14 files changed, 43 insertions(+), 167 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> index 6472b08fa1b0..c654c376ef8b 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> @@ -74,21 +74,6 @@
>  #define remap_4k_pfn(vma, addr, pfn, prot)	\
>  	remap_pfn_range((vma), (addr), (pfn), PAGE_SIZE, (prot))
>  
> -#ifdef CONFIG_HUGETLB_PAGE
> -static inline int hash__hugepd_ok(hugepd_t hpd)
> -{
> -	unsigned long hpdval = hpd_val(hpd);
> -	/*
> -	 * if it is not a pte and have hugepd shift mask
> -	 * set, then it is a hugepd directory pointer
> -	 */
> -	if (!(hpdval & _PAGE_PTE) && (hpdval & _PAGE_PRESENT) &&
> -	    ((hpdval & HUGEPD_SHIFT_MASK) != 0))
> -		return true;
> -	return false;
> -}
> -#endif
> -
>  /*
>   * 4K PTE format is different from 64K PTE format. Saving the hash_slot is just
>   * a matter of returning the PTE bits that need to be modified. On 64K PTE,
> diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
> index faf3e3b4e4b2..509811ca7695 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash.h
> @@ -4,6 +4,7 @@
>  #ifdef __KERNEL__
>  
>  #include <asm/asm-const.h>
> +#include <asm/book3s/64/slice.h>
>  
>  /*
>   * Common bits between 4K and 64K pages in a linux-style PTE.
> @@ -161,14 +162,10 @@ extern void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
>  			    pte_t *ptep, unsigned long pte, int huge);
>  unsigned long htab_convert_pte_flags(unsigned long pteflags, unsigned long flags);
>  /* Atomic PTE updates */
> -static inline unsigned long hash__pte_update(struct mm_struct *mm,
> -					 unsigned long addr,
> -					 pte_t *ptep, unsigned long clr,
> -					 unsigned long set,
> -					 int huge)
> +static inline unsigned long hash__pte_update_one(pte_t *ptep, unsigned long clr,
> +						 unsigned long set)
>  {
>  	__be64 old_be, tmp_be;
> -	unsigned long old;
>  
>  	__asm__ __volatile__(
>  	"1:	ldarx	%0,0,%3		# pte_update\n\
> @@ -182,11 +179,38 @@ static inline unsigned long hash__pte_update(struct mm_struct *mm,
>  	: "r" (ptep), "r" (cpu_to_be64(clr)), "m" (*ptep),
>  	  "r" (cpu_to_be64(H_PAGE_BUSY)), "r" (cpu_to_be64(set))
>  	: "cc" );
> +
> +	return be64_to_cpu(old_be);
> +}
> +
> +static inline unsigned long hash__pte_update(struct mm_struct *mm,
> +					 unsigned long addr,
> +					 pte_t *ptep, unsigned long clr,
> +					 unsigned long set,
> +					 int huge)
> +{
> +	unsigned long old;
> +
> +	old = hash__pte_update_one(ptep, clr, set);
> +
> +	if (huge && IS_ENABLED(CONFIG_PPC_4K_PAGES)) {
> +		unsigned int psize = get_slice_psize(mm, addr);
> +		int nb, i;
> +
> +		if (psize == MMU_PAGE_16M)
> +			nb = SZ_16M / PMD_SIZE;
> +		else if (psize == MMU_PAGE_16G)
> +			nb = SZ_16G / PUD_SIZE;
> +		else
> +			nb = 1;
> +
> +		for (i = 1; i < nb; i++)
> +			hash__pte_update_one(ptep + i, clr, set);
> +	}
>  	/* huge pages use the old page table lock */
>  	if (!huge)
>  		assert_pte_locked(mm, addr);
>  
> -	old = be64_to_cpu(old_be);
>  	if (old & H_PAGE_HASHPTE)
>  		hpte_need_flush(mm, addr, ptep, old, huge);
>  

Nice series, I don't know this hugepd code very well but I'll try.
Why do you have to replicate the PTE entry here? The hash table refill
should always be working on the first PTE of the page otherwise we have
bigger problems.

What paths look at the N > 0 PTEs of a contiguous page entry?

Thanks,
Nick


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 01/20] mm: Provide pagesize to pmd_populate()
  2024-05-20  9:01   ` Oscar Salvador
@ 2024-05-20 16:24     ` Christophe Leroy
  2024-05-21 11:57       ` Oscar Salvador
  0 siblings, 1 reply; 60+ messages in thread
From: Christophe Leroy @ 2024-05-20 16:24 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Jason Gunthorpe, Peter Xu, Michael Ellerman,
	Nicholas Piggin, linux-kernel, linux-mm, linuxppc-dev



Le 20/05/2024 à 11:01, Oscar Salvador a écrit :
> On Fri, May 17, 2024 at 08:59:55PM +0200, Christophe Leroy wrote:
>> Unlike many architectures, powerpc 8xx hardware tablewalk requires
>> a two level process for all page sizes, allthough second level only
>> has one entry when pagesize is 8M.
> 
> So, I went on a quick reading on
> 
> https://www.nxp.com/docs/en/application-note-software/AN3066.pdf
> 
> to get more insight, and I realized that some of the questions I made
> in v1 were quite dump.

I had a quick look at that document and it seems to provide a good 
summary of MMU features and principles. However there are some 
theoritical information which is not fully right in practice. For 
instance when they say "Segment attributes. These fields define 
attributes common to all pages in this segment.". This is right in 
theory if you consider it from Linux page table topology point of view, 
hence what they call a segment is a PMD entry for Linux. However, in 
practice each page has its own L1 and L2 attributes and there is not 
requirement at HW level to have all L1 attributes of all pages of a 
segment the same.

> 
>>
>> To fit with Linux page table topology and without requiring special
>> page directory layout like hugepd, the page entry will be replicated
>> 1024 times in the standard page table. However for large pages it is
> 
> You only have to replicate 1024 times in case the page size is 4KB, and you
> will have to replicate that twice and have 2 PMDs pointing to it, right?

Indeed.

> 
> For 16KB, you will have the PMD containing 512 entries of 16KB.

Exactly.

> 
>> necessary to set bits in the level-1 (PMD) entry. At the time being,
>> for 512k pages the flag is kept in the PTE and inserted in the PMD
>> entry at TLB miss exception, that is necessary because we can have
> 
>   rlwimi  r11, r10, 32 - 9, _PMD_PAGE_512K

rlwimi = Rotate Left Word Immediate then Mask Insert. Here it rotates 
r10 by 23 bits to the left (or 9 to the right) then masks with 
_PMD_PAGE_512K and inserts it into r11.

It means _PAGE_HUGE bit is copied into lower bit of PS attribute.

PS takes the following values:

PS = 00 ==> Small page (4k or 16k)
PS = 01 ==> 512k page
PS = 10 ==> Undefined
PS = 11 ==> 8M page

>   mtspr   SPRN_MI_TWC, r11
> 
> So we shift the value and compare it to _PMD_PAGE_512K to see if the PTE
> is a 512K page, and then we set it to SPRN_MI_TWC which I guess is some
> CPU special register?

TWC is where you store the Level 1 attributes, see figure 3 in the 
document you mentioned.

> 
>> pages of different sizes in a page table. However the 12 PTE bits are
>> fully used and there is no room for an additional bit for page size.
> 
> You are referring to the bits in
> arch/powerpc/include/asm/nohash/32/pte-8xx.h ?

Yes, page are 4k so only the 12 lower bits are available to encode PTE 
bits and all are used.

> 
>> For 8M pages, there will be only one page per PMD entry, it is
>> therefore possible to flag the pagesize in the PMD entry, with the
> 
> I am confused, and it might be just terminology, or I am getting wrong
> the design.
> You say that for 8MB pages, there will one page per PMD entry, but
> based on the above, you will have 1024 entries (replicated)?
> So, maybe this wanted to be read as "there will be only one page size per PMD
> entry".

You have 1024 entries in the PTE table. The PMD entry points to that 
table were all 1024 entries are the same because they all define the 
same (half) of a 8M page.

So you are also right, there is only one page size because there is only 
one 8M page.

> 
>> advantage that the information will already be at the right place for
>> the hardware.
>>
>> To do so, add a new helper called pmd_populate_size() which takes the
>> page size as an additional argument, and modify __pte_alloc() to also
> 
> "page size" makes me thing of the standart page size the kernel is
> operating on (aka PAGE_SIZE), but it is actually the size of the huge
> page, so I think we should clarify it.

Page size means "size of the page".

> 
>> take that argument. pte_alloc() is left unmodified in order to
>> reduce churn on callers, and a pte_alloc_size() is added for use by
>> pte_alloc_huge().
>>
>> When an architecture doesn't provide pmd_populate_size(),
>> pmd_populate() is used as a fallback.
> 
> It is a bit unfortunate that we have to touch the code for other
> architectures (in patch#2)

That's a RFC, all ideas are welcome, I needed something to replace 
hugepd_populate()

> 
>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> 
> So far I only looked at this patch and patch#2, and code-wise looks good and
> makes sense,  but I find it a bit unfortunate that we have to touch general
> definitons and arch code (done in patch#2 and patch#3), and I hoped we could
> somehow isolate this, but I could not think of a way.
> 
> I will give it some more though.
> 
>> ---
>>   include/linux/mm.h | 12 +++++++-----
>>   mm/filemap.c       |  2 +-
>>   mm/internal.h      |  2 +-
>>   mm/memory.c        | 19 ++++++++++++-------
>>   mm/pgalloc-track.h |  2 +-
>>   mm/userfaultfd.c   |  4 ++--
>>   6 files changed, 24 insertions(+), 17 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index b6bdaa18b9e9..158cb87bc604 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2803,8 +2803,8 @@ static inline void mm_inc_nr_ptes(struct mm_struct *mm) {}
>>   static inline void mm_dec_nr_ptes(struct mm_struct *mm) {}
>>   #endif
>>   
>> -int __pte_alloc(struct mm_struct *mm, pmd_t *pmd);
>> -int __pte_alloc_kernel(pmd_t *pmd);
>> +int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long sz);
>> +int __pte_alloc_kernel(pmd_t *pmd, unsigned long sz);
>>   
>>   #if defined(CONFIG_MMU)
>>   
>> @@ -2989,7 +2989,8 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
>>   	pte_unmap(pte);					\
>>   } while (0)
>>   
>> -#define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd))
>> +#define pte_alloc_size(mm, pmd, sz) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd, sz))
>> +#define pte_alloc(mm, pmd) pte_alloc_size(mm, pmd, PAGE_SIZE)
>>   
>>   #define pte_alloc_map(mm, pmd, address)			\
>>   	(pte_alloc(mm, pmd) ? NULL : pte_offset_map(pmd, address))
>> @@ -2998,9 +2999,10 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
>>   	(pte_alloc(mm, pmd) ?			\
>>   		 NULL : pte_offset_map_lock(mm, pmd, address, ptlp))
>>   
>> -#define pte_alloc_kernel(pmd, address)			\
>> -	((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \
>> +#define pte_alloc_kernel_size(pmd, address, sz)			\
>> +	((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd, sz))? \
>>   		NULL: pte_offset_kernel(pmd, address))
>> +#define pte_alloc_kernel(pmd, address)	pte_alloc_kernel_size(pmd, address, PAGE_SIZE)
>>   
>>   #if USE_SPLIT_PMD_PTLOCKS
>>   
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 30de18c4fd28..5a783063d1f6 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -3428,7 +3428,7 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct folio *folio,
>>   	}
>>   
>>   	if (pmd_none(*vmf->pmd) && vmf->prealloc_pte)
>> -		pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
>> +		pmd_install(mm, vmf->pmd, &vmf->prealloc_pte, PAGE_SIZE);
>>   
>>   	return false;
>>   }
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 07ad2675a88b..4a01bbf55264 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -206,7 +206,7 @@ void folio_activate(struct folio *folio);
>>   void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
>>   		   struct vm_area_struct *start_vma, unsigned long floor,
>>   		   unsigned long ceiling, bool mm_wr_locked);
>> -void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
>> +void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte, unsigned long sz);
>>   
>>   struct zap_details;
>>   void unmap_page_range(struct mmu_gather *tlb,
>> diff --git a/mm/memory.c b/mm/memory.c
>> index d2155ced45f8..2a9eba13a95f 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -409,7 +409,12 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
>>   	} while (vma);
>>   }
>>   
>> -void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
>> +#ifndef pmd_populate_size
>> +#define pmd_populate_size(mm, pmdp, pte, sz) pmd_populate(mm, pmdp, pte)
>> +#define pmd_populate_kernel_size(mm, pmdp, pte, sz) pmd_populate_kernel(mm, pmdp, pte)
>> +#endif
>> +
>> +void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte, unsigned long sz)
>>   {
>>   	spinlock_t *ptl = pmd_lock(mm, pmd);
>>   
>> @@ -429,25 +434,25 @@ void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
>>   		 * smp_rmb() barriers in page table walking code.
>>   		 */
>>   		smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
>> -		pmd_populate(mm, pmd, *pte);
>> +		pmd_populate_size(mm, pmd, *pte, sz);
>>   		*pte = NULL;
>>   	}
>>   	spin_unlock(ptl);
>>   }
>>   
>> -int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
>> +int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long sz)
>>   {
>>   	pgtable_t new = pte_alloc_one(mm);
>>   	if (!new)
>>   		return -ENOMEM;
>>   
>> -	pmd_install(mm, pmd, &new);
>> +	pmd_install(mm, pmd, &new, sz);
>>   	if (new)
>>   		pte_free(mm, new);
>>   	return 0;
>>   }
>>   
>> -int __pte_alloc_kernel(pmd_t *pmd)
>> +int __pte_alloc_kernel(pmd_t *pmd, unsigned long sz)
>>   {
>>   	pte_t *new = pte_alloc_one_kernel(&init_mm);
>>   	if (!new)
>> @@ -456,7 +461,7 @@ int __pte_alloc_kernel(pmd_t *pmd)
>>   	spin_lock(&init_mm.page_table_lock);
>>   	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
>>   		smp_wmb(); /* See comment in pmd_install() */
>> -		pmd_populate_kernel(&init_mm, pmd, new);
>> +		pmd_populate_kernel_size(&init_mm, pmd, new, sz);
>>   		new = NULL;
>>   	}
>>   	spin_unlock(&init_mm.page_table_lock);
>> @@ -4740,7 +4745,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>   		}
>>   
>>   		if (vmf->prealloc_pte)
>> -			pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte);
>> +			pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte, PAGE_SIZE);
>>   		else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
>>   			return VM_FAULT_OOM;
>>   	}
>> diff --git a/mm/pgalloc-track.h b/mm/pgalloc-track.h
>> index e9e879de8649..90e37de7ab77 100644
>> --- a/mm/pgalloc-track.h
>> +++ b/mm/pgalloc-track.h
>> @@ -45,7 +45,7 @@ static inline pmd_t *pmd_alloc_track(struct mm_struct *mm, pud_t *pud,
>>   
>>   #define pte_alloc_kernel_track(pmd, address, mask)			\
>>   	((unlikely(pmd_none(*(pmd))) &&					\
>> -	  (__pte_alloc_kernel(pmd) || ({*(mask)|=PGTBL_PMD_MODIFIED;0;})))?\
>> +	  (__pte_alloc_kernel(pmd, PAGE_SIZE) || ({*(mask)|=PGTBL_PMD_MODIFIED;0;})))?\
>>   		NULL: pte_offset_kernel(pmd, address))
>>   
>>   #endif /* _LINUX_PGALLOC_TRACK_H */
>> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
>> index 3c3539c573e7..0f129d5c5aa2 100644
>> --- a/mm/userfaultfd.c
>> +++ b/mm/userfaultfd.c
>> @@ -764,7 +764,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
>>   			break;
>>   		}
>>   		if (unlikely(pmd_none(dst_pmdval)) &&
>> -		    unlikely(__pte_alloc(dst_mm, dst_pmd))) {
>> +		    unlikely(__pte_alloc(dst_mm, dst_pmd, PAGE_SIZE))) {
>>   			err = -ENOMEM;
>>   			break;
>>   		}
>> @@ -1687,7 +1687,7 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
>>   					err = -ENOENT;
>>   					break;
>>   				}
>> -				if (unlikely(__pte_alloc(mm, src_pmd))) {
>> +				if (unlikely(__pte_alloc(mm, src_pmd, PAGE_SIZE))) {
>>   					err = -ENOMEM;
>>   					break;
>>   				}
>> -- 
>> 2.44.0
>>
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 06/20] powerpc/8xx: Fix size given to set_huge_pte_at()
  2024-05-20  9:14   ` Oscar Salvador
@ 2024-05-20 16:31     ` Christophe Leroy
  2024-05-20 17:42       ` Oscar Salvador
  2024-05-21  0:48       ` Michael Ellerman
  0 siblings, 2 replies; 60+ messages in thread
From: Christophe Leroy @ 2024-05-20 16:31 UTC (permalink / raw)
  To: Oscar Salvador, Michael Ellerman
  Cc: Andrew Morton, Jason Gunthorpe, Peter Xu, Nicholas Piggin,
	linux-kernel, linux-mm, linuxppc-dev

Hi Oscar, hi Michael,

Le 20/05/2024 à 11:14, Oscar Salvador a écrit :
> On Fri, May 17, 2024 at 09:00:00PM +0200, Christophe Leroy wrote:
>> set_huge_pte_at() expects the real page size, not the psize which is
> 
> "expects the size of the huge page" sounds bettter?

Parameter 'pzize' already provides the size of the hugepage, but not in 
the way set_huge_pte_at() expects it.

psize has one of the values defined by MMU_PAGE_XXX macros defined in 
arch/powerpc/include/asm/mmu.h while set_huge_pte_at() expects the size 
as a value.


> 
>> the index of the page definition in table mmu_psize_defs[]
>>
>> Fixes: 935d4f0c6dc8 ("mm: hugetlb: add huge page size param to set_huge_pte_at()")
>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> 
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> 
> AFAICS, this fixup is not related to the series, right? (yes, you will
> the parameter later)
> I would have it at the very beginning of the series.

You are right, I should have submitted it separately.

Michael can you take it as a fix for 6.10 ?

> 
> 
>> ---
>>   arch/powerpc/mm/nohash/8xx.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/mm/nohash/8xx.c b/arch/powerpc/mm/nohash/8xx.c
>> index 43d4842bb1c7..d93433e26ded 100644
>> --- a/arch/powerpc/mm/nohash/8xx.c
>> +++ b/arch/powerpc/mm/nohash/8xx.c
>> @@ -94,7 +94,8 @@ static int __ref __early_map_kernel_hugepage(unsigned long va, phys_addr_t pa,
>>   		return -EINVAL;
>>   
>>   	set_huge_pte_at(&init_mm, va, ptep,
>> -			pte_mkhuge(pfn_pte(pa >> PAGE_SHIFT, prot)), psize);
>> +			pte_mkhuge(pfn_pte(pa >> PAGE_SHIFT, prot)),
>> +			1UL << mmu_psize_to_shift(psize));
>>   
>>   	return 0;
>>   }
>> -- 
>> 2.44.0
>>
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 18/20] powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD
  2024-05-20 12:54   ` Nicholas Piggin
@ 2024-05-20 16:43     ` Christophe Leroy
  2024-05-22  1:13       ` Nicholas Piggin
  0 siblings, 1 reply; 60+ messages in thread
From: Christophe Leroy @ 2024-05-20 16:43 UTC (permalink / raw)
  To: Nicholas Piggin, Andrew Morton, Jason Gunthorpe, Peter Xu,
	Oscar Salvador, Michael Ellerman
  Cc: linux-kernel, linux-mm, linuxppc-dev



Le 20/05/2024 à 14:54, Nicholas Piggin a écrit :
> On Sat May 18, 2024 at 5:00 AM AEST, Christophe Leroy wrote:
>> On book3s/64, the only user of hugepd is hash in 4k mode.
>>
>> All other setups (hash-64, radix-4, radix-64) use leaf PMD/PUD.
>>
>> Rework hash-4k to use contiguous PMD and PUD instead.
>>
>> In that setup there are only two huge page sizes: 16M and 16G.
>>
>> 16M sits at PMD level and 16G at PUD level.
>>
>> pte_update doesn't know page size, lets use the same trick as
>> hpte_need_flush() to get page size from segment properties. That's
>> not the most efficient way but let's do that until callers of
>> pte_update() provide page size instead of just a huge flag.
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
>> ---
>>   arch/powerpc/include/asm/book3s/64/hash-4k.h  | 15 --------
>>   arch/powerpc/include/asm/book3s/64/hash.h     | 38 +++++++++++++++----
>>   arch/powerpc/include/asm/book3s/64/hugetlb.h  | 38 -------------------
>>   .../include/asm/book3s/64/pgtable-4k.h        | 34 -----------------
>>   .../include/asm/book3s/64/pgtable-64k.h       | 20 ----------
>>   arch/powerpc/include/asm/hugetlb.h            |  4 ++
>>   .../include/asm/nohash/32/hugetlb-8xx.h       |  4 --
>>   .../powerpc/include/asm/nohash/hugetlb-e500.h |  4 --
>>   arch/powerpc/include/asm/page.h               |  8 ----
>>   arch/powerpc/mm/book3s64/hash_utils.c         | 11 ++++--
>>   arch/powerpc/mm/book3s64/pgtable.c            | 12 ------
>>   arch/powerpc/mm/hugetlbpage.c                 | 19 ----------
>>   arch/powerpc/mm/pgtable.c                     |  2 +-
>>   arch/powerpc/platforms/Kconfig.cputype        |  1 -
>>   14 files changed, 43 insertions(+), 167 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
>> index 6472b08fa1b0..c654c376ef8b 100644
>> --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
>> +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
>> @@ -74,21 +74,6 @@
>>   #define remap_4k_pfn(vma, addr, pfn, prot)	\
>>   	remap_pfn_range((vma), (addr), (pfn), PAGE_SIZE, (prot))
>>   
>> -#ifdef CONFIG_HUGETLB_PAGE
>> -static inline int hash__hugepd_ok(hugepd_t hpd)
>> -{
>> -	unsigned long hpdval = hpd_val(hpd);
>> -	/*
>> -	 * if it is not a pte and have hugepd shift mask
>> -	 * set, then it is a hugepd directory pointer
>> -	 */
>> -	if (!(hpdval & _PAGE_PTE) && (hpdval & _PAGE_PRESENT) &&
>> -	    ((hpdval & HUGEPD_SHIFT_MASK) != 0))
>> -		return true;
>> -	return false;
>> -}
>> -#endif
>> -
>>   /*
>>    * 4K PTE format is different from 64K PTE format. Saving the hash_slot is just
>>    * a matter of returning the PTE bits that need to be modified. On 64K PTE,
>> diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
>> index faf3e3b4e4b2..509811ca7695 100644
>> --- a/arch/powerpc/include/asm/book3s/64/hash.h
>> +++ b/arch/powerpc/include/asm/book3s/64/hash.h
>> @@ -4,6 +4,7 @@
>>   #ifdef __KERNEL__
>>   
>>   #include <asm/asm-const.h>
>> +#include <asm/book3s/64/slice.h>
>>   
>>   /*
>>    * Common bits between 4K and 64K pages in a linux-style PTE.
>> @@ -161,14 +162,10 @@ extern void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
>>   			    pte_t *ptep, unsigned long pte, int huge);
>>   unsigned long htab_convert_pte_flags(unsigned long pteflags, unsigned long flags);
>>   /* Atomic PTE updates */
>> -static inline unsigned long hash__pte_update(struct mm_struct *mm,
>> -					 unsigned long addr,
>> -					 pte_t *ptep, unsigned long clr,
>> -					 unsigned long set,
>> -					 int huge)
>> +static inline unsigned long hash__pte_update_one(pte_t *ptep, unsigned long clr,
>> +						 unsigned long set)
>>   {
>>   	__be64 old_be, tmp_be;
>> -	unsigned long old;
>>   
>>   	__asm__ __volatile__(
>>   	"1:	ldarx	%0,0,%3		# pte_update\n\
>> @@ -182,11 +179,38 @@ static inline unsigned long hash__pte_update(struct mm_struct *mm,
>>   	: "r" (ptep), "r" (cpu_to_be64(clr)), "m" (*ptep),
>>   	  "r" (cpu_to_be64(H_PAGE_BUSY)), "r" (cpu_to_be64(set))
>>   	: "cc" );
>> +
>> +	return be64_to_cpu(old_be);
>> +}
>> +
>> +static inline unsigned long hash__pte_update(struct mm_struct *mm,
>> +					 unsigned long addr,
>> +					 pte_t *ptep, unsigned long clr,
>> +					 unsigned long set,
>> +					 int huge)
>> +{
>> +	unsigned long old;
>> +
>> +	old = hash__pte_update_one(ptep, clr, set);
>> +
>> +	if (huge && IS_ENABLED(CONFIG_PPC_4K_PAGES)) {
>> +		unsigned int psize = get_slice_psize(mm, addr);
>> +		int nb, i;
>> +
>> +		if (psize == MMU_PAGE_16M)
>> +			nb = SZ_16M / PMD_SIZE;
>> +		else if (psize == MMU_PAGE_16G)
>> +			nb = SZ_16G / PUD_SIZE;
>> +		else
>> +			nb = 1;
>> +
>> +		for (i = 1; i < nb; i++)
>> +			hash__pte_update_one(ptep + i, clr, set);
>> +	}
>>   	/* huge pages use the old page table lock */
>>   	if (!huge)
>>   		assert_pte_locked(mm, addr);
>>   
>> -	old = be64_to_cpu(old_be);
>>   	if (old & H_PAGE_HASHPTE)
>>   		hpte_need_flush(mm, addr, ptep, old, huge);
>>   
> 
> Nice series, I don't know this hugepd code very well but I'll try.
> Why do you have to replicate the PTE entry here? The hash table refill
> should always be working on the first PTE of the page otherwise we have
> bigger problems.

I don't know how book3s/64 works exactly, but on nohash, when you get a 
TLB miss exception the only thing you have is the address and you don't 
know yes it is a hugepage so you get the PTE as if it was a 4k page and 
it is only when you read that PTE that you know it is a hugepage.

Ok, on book3s/64 the page size seems to be encoded inside the segment so 
maybe it is a bit different but anyway the TLB miss exception (or DSI ?) 
can happen at any address.

> 
> What paths look at the N > 0 PTEs of a contiguous page entry?
> 

pte_offset_kernel() or pte_offset_map_lock() will land on any contiguous 
PTE based on the address handed to pte_index(), as if it was a standard 
(4k or 64k) page.

pte_index() doesn't know it is a hugepage, that's the reason why we need 
to duplicate the entry.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 06/20] powerpc/8xx: Fix size given to set_huge_pte_at()
  2024-05-20 16:31     ` Christophe Leroy
@ 2024-05-20 17:42       ` Oscar Salvador
  2024-05-22  8:45         ` Christophe Leroy
  2024-05-21  0:48       ` Michael Ellerman
  1 sibling, 1 reply; 60+ messages in thread
From: Oscar Salvador @ 2024-05-20 17:42 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Michael Ellerman, Andrew Morton, Jason Gunthorpe, Peter Xu,
	Nicholas Piggin, linux-kernel, linux-mm, linuxppc-dev

On Mon, May 20, 2024 at 04:31:39PM +0000, Christophe Leroy wrote:
> Hi Oscar, hi Michael,
> 
> Le 20/05/2024 à 11:14, Oscar Salvador a écrit :
> > On Fri, May 17, 2024 at 09:00:00PM +0200, Christophe Leroy wrote:
> >> set_huge_pte_at() expects the real page size, not the psize which is
> > 
> > "expects the size of the huge page" sounds bettter?
> 
> Parameter 'pzize' already provides the size of the hugepage, but not in 
> the way set_huge_pte_at() expects it.
> 
> psize has one of the values defined by MMU_PAGE_XXX macros defined in 
> arch/powerpc/include/asm/mmu.h while set_huge_pte_at() expects the size 
> as a value.

Yes, psize is an index, which is not a size by itself but used to get
mmu_psize_def.shift to see the actual size, I guess.
This is why I thought that being explicit about "expects the size of the
huge page" was better.

But no strong feelings here.


-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 06/20] powerpc/8xx: Fix size given to set_huge_pte_at()
  2024-05-20 16:31     ` Christophe Leroy
  2024-05-20 17:42       ` Oscar Salvador
@ 2024-05-21  0:48       ` Michael Ellerman
  2024-05-21  9:26         ` Oscar Salvador
  1 sibling, 1 reply; 60+ messages in thread
From: Michael Ellerman @ 2024-05-21  0:48 UTC (permalink / raw)
  To: Christophe Leroy, Oscar Salvador
  Cc: Andrew Morton, Jason Gunthorpe, Peter Xu, Nicholas Piggin,
	linux-kernel, linux-mm, linuxppc-dev

Christophe Leroy <christophe.leroy@csgroup.eu> writes:
> Hi Oscar, hi Michael,
>
> Le 20/05/2024 à 11:14, Oscar Salvador a écrit :
>> On Fri, May 17, 2024 at 09:00:00PM +0200, Christophe Leroy wrote:
>>> set_huge_pte_at() expects the real page size, not the psize which is
>> 
>> "expects the size of the huge page" sounds bettter?
>
> Parameter 'pzize' already provides the size of the hugepage, but not in 
> the way set_huge_pte_at() expects it.
>
> psize has one of the values defined by MMU_PAGE_XXX macros defined in 
> arch/powerpc/include/asm/mmu.h while set_huge_pte_at() expects the size 
> as a value.
>
>> 
>>> the index of the page definition in table mmu_psize_defs[]
>>>
>>> Fixes: 935d4f0c6dc8 ("mm: hugetlb: add huge page size param to set_huge_pte_at()")
>>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
>> 
>> Reviewed-by: Oscar Salvador <osalvador@suse.de>
>> 
>> AFAICS, this fixup is not related to the series, right? (yes, you will
>> the parameter later)
>> I would have it at the very beginning of the series.
>
> You are right, I should have submitted it separately.
>
> Michael can you take it as a fix for 6.10 ?

Yeah I can. Does it actually cause a bug at runtime (I assume so)?

cheers


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 06/20] powerpc/8xx: Fix size given to set_huge_pte_at()
  2024-05-21  0:48       ` Michael Ellerman
@ 2024-05-21  9:26         ` Oscar Salvador
  2024-05-22  8:32           ` Christophe Leroy
  0 siblings, 1 reply; 60+ messages in thread
From: Oscar Salvador @ 2024-05-21  9:26 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Christophe Leroy, Andrew Morton, Jason Gunthorpe, Peter Xu,
	Nicholas Piggin, linux-kernel, linux-mm, linuxppc-dev

On Tue, May 21, 2024 at 10:48:21AM +1000, Michael Ellerman wrote:
> Yeah I can. Does it actually cause a bug at runtime (I assume so)?

No, currently set_huge_pte_at() from 8xx ignores the 'sz' parameter.
But it will be used after this series.

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 03/20] mm: Provide pmd to pte_leaf_size()
  2024-05-17 18:59 ` [RFC PATCH v2 03/20] mm: Provide pmd to pte_leaf_size() Christophe Leroy
@ 2024-05-21  9:39   ` Oscar Salvador
  2024-05-22 10:22     ` Christophe Leroy
  0 siblings, 1 reply; 60+ messages in thread
From: Oscar Salvador @ 2024-05-21  9:39 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Andrew Morton, Jason Gunthorpe, Peter Xu, Michael Ellerman,
	Nicholas Piggin, linux-kernel, linux-mm, linuxppc-dev

On Fri, May 17, 2024 at 08:59:57PM +0200, Christophe Leroy wrote:
> On powerpc 8xx, when a page is 8M size, the information is in the PMD
> entry. So provide it to pte_leaf_size().
> 
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>

Overall looks good to me.

Would be nicer if we could left the arch code untouched.
I wanted to see how this would be if we go down that road and focus only 
on 8xx at the risk of being more esoteric.
pmd_pte_leaf_size() is a name of hell, but could be replaced
with __pte_leaf_size for example.

Worth it? Maybe not, anyway, just wanted to give it a go:


 diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
 index 137dc3c84e45..9e3fe6e1083f 100644
 --- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
 +++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
 @@ -151,7 +151,7 @@ static inline unsigned long pgd_leaf_size(pgd_t pgd)
 
  #define pgd_leaf_size pgd_leaf_size
 
 -static inline unsigned long pte_leaf_size(pte_t pte)
 +static inline unsigned long pmd_pte_leaf_size(pte_t pte)
  {
         pte_basic_t val = pte_val(pte);
 
 @@ -162,7 +162,7 @@ static inline unsigned long pte_leaf_size(pte_t pte)
         return SZ_4K;
  }
 
 -#define pte_leaf_size pte_leaf_size
 +#define pmd_pte_leaf_size pmd_pte_leaf_size
 
  /*
   * On the 8xx, the page tables are a bit special. For 16k pages, we have
 diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
 index 18019f037bae..2bc2fe3b2b53 100644
 --- a/include/linux/pgtable.h
 +++ b/include/linux/pgtable.h
 @@ -1891,6 +1891,9 @@ typedef unsigned int pgtbl_mod_mask;
  #ifndef pte_leaf_size
  #define pte_leaf_size(x) PAGE_SIZE
  #endif
 +#ifndef pmd_pte_leaf_size
 +#define pmd_pte_leaf_size(x, y) pte_leaf_size(y)
 +#endif
 
  /*
   * We always define pmd_pfn for all archs as it's used in lots of generic
 diff --git a/kernel/events/core.c b/kernel/events/core.c
 index f0128c5ff278..e90a547d2fb2 100644
 --- a/kernel/events/core.c
 +++ b/kernel/events/core.c
 @@ -7596,7 +7596,7 @@ static u64 perf_get_pgtable_size(struct mm_struct *mm, unsigned long addr)
 
         pte = ptep_get_lockless(ptep);
         if (pte_present(pte))
 -               size = pte_leaf_size(pte);
 +               size = pmd_pte_leaf_size(pmd, pte);
         pte_unmap(ptep);
  #endif /* CONFIG_HAVE_GUP_FAST */

 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 01/20] mm: Provide pagesize to pmd_populate()
  2024-05-20 16:24     ` Christophe Leroy
@ 2024-05-21 11:57       ` Oscar Salvador
  2024-05-22  8:37         ` Christophe Leroy
  0 siblings, 1 reply; 60+ messages in thread
From: Oscar Salvador @ 2024-05-21 11:57 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Andrew Morton, Jason Gunthorpe, Peter Xu, Michael Ellerman,
	Nicholas Piggin, linux-kernel, linux-mm, linuxppc-dev

On Mon, May 20, 2024 at 04:24:51PM +0000, Christophe Leroy wrote:
> I had a quick look at that document and it seems to provide a good 
> summary of MMU features and principles. However there are some 
> theoritical information which is not fully right in practice. For 
> instance when they say "Segment attributes. These fields define 
> attributes common to all pages in this segment.". This is right in 
> theory if you consider it from Linux page table topology point of view, 
> hence what they call a segment is a PMD entry for Linux. However, in 
> practice each page has its own L1 and L2 attributes and there is not 
> requirement at HW level to have all L1 attributes of all pages of a 
> segment the same.

Thanks for taking the time Christophe, highly appreciated.

 
> rlwimi = Rotate Left Word Immediate then Mask Insert. Here it rotates 
> r10 by 23 bits to the left (or 9 to the right) then masks with 
> _PMD_PAGE_512K and inserts it into r11.
> 
> It means _PAGE_HUGE bit is copied into lower bit of PS attribute.
> 
> PS takes the following values:
> 
> PS = 00 ==> Small page (4k or 16k)
> PS = 01 ==> 512k page
> PS = 10 ==> Undefined
> PS = 11 ==> 8M page

I see, thanks for the explanation.

> That's a RFC, all ideas are welcome, I needed something to replace 
> hugepd_populate()

The only user interested in pmd_populate() having a sz parameter
is 8xx because it will toggle _PMD_PAGE_8M in case of a 8MB mapping.

Would it be possible for 8xx to encode the 'sz' in the *pmd pointer
prior to calling down the chain? (something like as we do for PTR_ERR()).
Then pmd_populate_{kernel_}size() from 8xx, would extract it like:

 unsigned long sz = PTR_SIZE(pmd)

Then we would not need all these 'sz' parameters scattered.

Can that work?


PD: Do you know a way to emulate a 8xx VM? qemu seems to not have
support support.

Thanks


-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 18/20] powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD
  2024-05-20 16:43     ` Christophe Leroy
@ 2024-05-22  1:13       ` Nicholas Piggin
  2024-05-22  9:32         ` Christophe Leroy
  2024-05-22 12:23         ` Jason Gunthorpe
  0 siblings, 2 replies; 60+ messages in thread
From: Nicholas Piggin @ 2024-05-22  1:13 UTC (permalink / raw)
  To: Christophe Leroy, Andrew Morton, Jason Gunthorpe, Peter Xu,
	Oscar Salvador, Michael Ellerman
  Cc: linux-kernel, linux-mm, linuxppc-dev

On Tue May 21, 2024 at 2:43 AM AEST, Christophe Leroy wrote:
>
>
> Le 20/05/2024 à 14:54, Nicholas Piggin a écrit :
> > On Sat May 18, 2024 at 5:00 AM AEST, Christophe Leroy wrote:
> >> On book3s/64, the only user of hugepd is hash in 4k mode.
> >>
> >> All other setups (hash-64, radix-4, radix-64) use leaf PMD/PUD.
> >>
> >> Rework hash-4k to use contiguous PMD and PUD instead.
> >>
> >> In that setup there are only two huge page sizes: 16M and 16G.
> >>
> >> 16M sits at PMD level and 16G at PUD level.
> >>
> >> pte_update doesn't know page size, lets use the same trick as
> >> hpte_need_flush() to get page size from segment properties. That's
> >> not the most efficient way but let's do that until callers of
> >> pte_update() provide page size instead of just a huge flag.
> >>
> >> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> >> ---
> >>   arch/powerpc/include/asm/book3s/64/hash-4k.h  | 15 --------
> >>   arch/powerpc/include/asm/book3s/64/hash.h     | 38 +++++++++++++++----
> >>   arch/powerpc/include/asm/book3s/64/hugetlb.h  | 38 -------------------
> >>   .../include/asm/book3s/64/pgtable-4k.h        | 34 -----------------
> >>   .../include/asm/book3s/64/pgtable-64k.h       | 20 ----------
> >>   arch/powerpc/include/asm/hugetlb.h            |  4 ++
> >>   .../include/asm/nohash/32/hugetlb-8xx.h       |  4 --
> >>   .../powerpc/include/asm/nohash/hugetlb-e500.h |  4 --
> >>   arch/powerpc/include/asm/page.h               |  8 ----
> >>   arch/powerpc/mm/book3s64/hash_utils.c         | 11 ++++--
> >>   arch/powerpc/mm/book3s64/pgtable.c            | 12 ------
> >>   arch/powerpc/mm/hugetlbpage.c                 | 19 ----------
> >>   arch/powerpc/mm/pgtable.c                     |  2 +-
> >>   arch/powerpc/platforms/Kconfig.cputype        |  1 -
> >>   14 files changed, 43 insertions(+), 167 deletions(-)
> >>
> >> diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> >> index 6472b08fa1b0..c654c376ef8b 100644
> >> --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> >> +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> >> @@ -74,21 +74,6 @@
> >>   #define remap_4k_pfn(vma, addr, pfn, prot)	\
> >>   	remap_pfn_range((vma), (addr), (pfn), PAGE_SIZE, (prot))
> >>   
> >> -#ifdef CONFIG_HUGETLB_PAGE
> >> -static inline int hash__hugepd_ok(hugepd_t hpd)
> >> -{
> >> -	unsigned long hpdval = hpd_val(hpd);
> >> -	/*
> >> -	 * if it is not a pte and have hugepd shift mask
> >> -	 * set, then it is a hugepd directory pointer
> >> -	 */
> >> -	if (!(hpdval & _PAGE_PTE) && (hpdval & _PAGE_PRESENT) &&
> >> -	    ((hpdval & HUGEPD_SHIFT_MASK) != 0))
> >> -		return true;
> >> -	return false;
> >> -}
> >> -#endif
> >> -
> >>   /*
> >>    * 4K PTE format is different from 64K PTE format. Saving the hash_slot is just
> >>    * a matter of returning the PTE bits that need to be modified. On 64K PTE,
> >> diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
> >> index faf3e3b4e4b2..509811ca7695 100644
> >> --- a/arch/powerpc/include/asm/book3s/64/hash.h
> >> +++ b/arch/powerpc/include/asm/book3s/64/hash.h
> >> @@ -4,6 +4,7 @@
> >>   #ifdef __KERNEL__
> >>   
> >>   #include <asm/asm-const.h>
> >> +#include <asm/book3s/64/slice.h>
> >>   
> >>   /*
> >>    * Common bits between 4K and 64K pages in a linux-style PTE.
> >> @@ -161,14 +162,10 @@ extern void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
> >>   			    pte_t *ptep, unsigned long pte, int huge);
> >>   unsigned long htab_convert_pte_flags(unsigned long pteflags, unsigned long flags);
> >>   /* Atomic PTE updates */
> >> -static inline unsigned long hash__pte_update(struct mm_struct *mm,
> >> -					 unsigned long addr,
> >> -					 pte_t *ptep, unsigned long clr,
> >> -					 unsigned long set,
> >> -					 int huge)
> >> +static inline unsigned long hash__pte_update_one(pte_t *ptep, unsigned long clr,
> >> +						 unsigned long set)
> >>   {
> >>   	__be64 old_be, tmp_be;
> >> -	unsigned long old;
> >>   
> >>   	__asm__ __volatile__(
> >>   	"1:	ldarx	%0,0,%3		# pte_update\n\
> >> @@ -182,11 +179,38 @@ static inline unsigned long hash__pte_update(struct mm_struct *mm,
> >>   	: "r" (ptep), "r" (cpu_to_be64(clr)), "m" (*ptep),
> >>   	  "r" (cpu_to_be64(H_PAGE_BUSY)), "r" (cpu_to_be64(set))
> >>   	: "cc" );
> >> +
> >> +	return be64_to_cpu(old_be);
> >> +}
> >> +
> >> +static inline unsigned long hash__pte_update(struct mm_struct *mm,
> >> +					 unsigned long addr,
> >> +					 pte_t *ptep, unsigned long clr,
> >> +					 unsigned long set,
> >> +					 int huge)
> >> +{
> >> +	unsigned long old;
> >> +
> >> +	old = hash__pte_update_one(ptep, clr, set);
> >> +
> >> +	if (huge && IS_ENABLED(CONFIG_PPC_4K_PAGES)) {
> >> +		unsigned int psize = get_slice_psize(mm, addr);
> >> +		int nb, i;
> >> +
> >> +		if (psize == MMU_PAGE_16M)
> >> +			nb = SZ_16M / PMD_SIZE;
> >> +		else if (psize == MMU_PAGE_16G)
> >> +			nb = SZ_16G / PUD_SIZE;
> >> +		else
> >> +			nb = 1;
> >> +
> >> +		for (i = 1; i < nb; i++)
> >> +			hash__pte_update_one(ptep + i, clr, set);
> >> +	}
> >>   	/* huge pages use the old page table lock */
> >>   	if (!huge)
> >>   		assert_pte_locked(mm, addr);
> >>   
> >> -	old = be64_to_cpu(old_be);
> >>   	if (old & H_PAGE_HASHPTE)
> >>   		hpte_need_flush(mm, addr, ptep, old, huge);
> >>   
> > 
> > Nice series, I don't know this hugepd code very well but I'll try.
> > Why do you have to replicate the PTE entry here? The hash table refill
> > should always be working on the first PTE of the page otherwise we have
> > bigger problems.
>
> I don't know how book3s/64 works exactly, but on nohash, when you get a 
> TLB miss exception the only thing you have is the address and you don't 
> know yes it is a hugepage so you get the PTE as if it was a 4k page and 
> it is only when you read that PTE that you know it is a hugepage.
>
> Ok, on book3s/64 the page size seems to be encoded inside the segment so 
> maybe it is a bit different but anyway the TLB miss exception (or DSI ?) 
> can happen at any address.

Right.

If you think of the hash page table as a software loaded TLB (which
is how Linux kind of thinks of it), then DSI is a TLB miss. hash_page_x
calls find the Linux pte and load that translation into hash page table.

One of the hard parts is keeping them coherent with low overhead. This
requires pte bits H_PAGE_BUSY as a lock and H_PAGE_HASHPTE which means
it might be in the hash table. So Linux PTE and hash PTE have to be
1:1 in general.

There are probably cases where we could get away from 1:1, but I would
much prefer not to. Maybe read-only access would be okay though. But
the hash_page will have to always operate on the 0th pte, which I think
we get via segment size masking, same for any set / update / clear of
the pte.

> > 
> > What paths look at the N > 0 PTEs of a contiguous page entry?
> > 
>
> pte_offset_kernel() or pte_offset_map_lock() will land on any contiguous 
> PTE based on the address handed to pte_index(), as if it was a standard 
> (4k or 64k) page.
>
> pte_index() doesn't know it is a hugepage, that's the reason why we need 
> to duplicate the entry.

From the mm/ side of things, hugetlb page tables are always walked via
the huge vma which knows the page size and could align address... I
guess except for fast gup? Which should be read-only. So okay you do
need to replicate huge ptes for fast gup at least. Any others?

There's going to need to be a little more to it. __hash_page_huge sets
PTE accessed and dirty for example, so if we allow any PTE readers to
check the non-0th pte we would have to do something about that.

How do you deal with dirty/accessed bits for other subarchs?

We could just remove the hash_page setting of those bits and just cause
a fault and require Linux mm to set them. At least for hugepages we
could do that probably without any real performance worry.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 06/20] powerpc/8xx: Fix size given to set_huge_pte_at()
  2024-05-21  9:26         ` Oscar Salvador
@ 2024-05-22  8:32           ` Christophe Leroy
  2024-05-22 12:18             ` Christophe Leroy
  0 siblings, 1 reply; 60+ messages in thread
From: Christophe Leroy @ 2024-05-22  8:32 UTC (permalink / raw)
  To: Oscar Salvador, Michael Ellerman
  Cc: Andrew Morton, Jason Gunthorpe, Peter Xu, Nicholas Piggin,
	linux-kernel, linux-mm, linuxppc-dev



Le 21/05/2024 à 11:26, Oscar Salvador a écrit :
> On Tue, May 21, 2024 at 10:48:21AM +1000, Michael Ellerman wrote:
>> Yeah I can. Does it actually cause a bug at runtime (I assume so)?
> 
> No, currently set_huge_pte_at() from 8xx ignores the 'sz' parameter.
> But it will be used after this series.
> 

Ah yes, I mixed things up with something else in my mind.

So this patch doesn't qualify as a fix and doesn't need to be handled 
separately from the series and doesn't really need to go on top of the 
series either, I think it is better to keep it grouped with other 8xx 
changes.

Christophe

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 01/20] mm: Provide pagesize to pmd_populate()
  2024-05-21 11:57       ` Oscar Salvador
@ 2024-05-22  8:37         ` Christophe Leroy
  0 siblings, 0 replies; 60+ messages in thread
From: Christophe Leroy @ 2024-05-22  8:37 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Jason Gunthorpe, Peter Xu, Michael Ellerman,
	Nicholas Piggin, linux-kernel, linux-mm, linuxppc-dev



Le 21/05/2024 à 13:57, Oscar Salvador a écrit :
> On Mon, May 20, 2024 at 04:24:51PM +0000, Christophe Leroy wrote:
>> I had a quick look at that document and it seems to provide a good
>> summary of MMU features and principles. However there are some
>> theoritical information which is not fully right in practice. For
>> instance when they say "Segment attributes. These fields define
>> attributes common to all pages in this segment.". This is right in
>> theory if you consider it from Linux page table topology point of view,
>> hence what they call a segment is a PMD entry for Linux. However, in
>> practice each page has its own L1 and L2 attributes and there is not
>> requirement at HW level to have all L1 attributes of all pages of a
>> segment the same.
> 
> Thanks for taking the time Christophe, highly appreciated.
> 
>   
>> rlwimi = Rotate Left Word Immediate then Mask Insert. Here it rotates
>> r10 by 23 bits to the left (or 9 to the right) then masks with
>> _PMD_PAGE_512K and inserts it into r11.
>>
>> It means _PAGE_HUGE bit is copied into lower bit of PS attribute.
>>
>> PS takes the following values:
>>
>> PS = 00 ==> Small page (4k or 16k)
>> PS = 01 ==> 512k page
>> PS = 10 ==> Undefined
>> PS = 11 ==> 8M page
> 
> I see, thanks for the explanation.
> 
>> That's a RFC, all ideas are welcome, I needed something to replace
>> hugepd_populate()
> 
> The only user interested in pmd_populate() having a sz parameter
> is 8xx because it will toggle _PMD_PAGE_8M in case of a 8MB mapping.
> 
> Would it be possible for 8xx to encode the 'sz' in the *pmd pointer
> prior to calling down the chain? (something like as we do for PTR_ERR()).
> Then pmd_populate_{kernel_}size() from 8xx, would extract it like:
> 
>   unsigned long sz = PTR_SIZE(pmd)
> 
> Then we would not need all these 'sz' parameters scattered.
> 
> Can that work?

Indeed _PMD_PAGE_8M can be set in set_huge_pte_at(), no need to do it 
atomically as part of pmd_populate, so I'll drop patches 1 and 2.

> 
> 
> PD: Do you know a way to emulate a 8xx VM? qemu seems to not have
> support support.
> 

I don't know any way. You are right that 8xx is not supported by QEMU 
unfortunately. I don't know how difficult it would be to add it to QEMU.

Christophe

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 06/20] powerpc/8xx: Fix size given to set_huge_pte_at()
  2024-05-20 17:42       ` Oscar Salvador
@ 2024-05-22  8:45         ` Christophe Leroy
  0 siblings, 0 replies; 60+ messages in thread
From: Christophe Leroy @ 2024-05-22  8:45 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Michael Ellerman, Andrew Morton, Jason Gunthorpe, Peter Xu,
	Nicholas Piggin, linux-kernel, linux-mm, linuxppc-dev



Le 20/05/2024 à 19:42, Oscar Salvador a écrit :
> On Mon, May 20, 2024 at 04:31:39PM +0000, Christophe Leroy wrote:
>> Hi Oscar, hi Michael,
>>
>> Le 20/05/2024 à 11:14, Oscar Salvador a écrit :
>>> On Fri, May 17, 2024 at 09:00:00PM +0200, Christophe Leroy wrote:
>>>> set_huge_pte_at() expects the real page size, not the psize which is
>>>
>>> "expects the size of the huge page" sounds bettter?
>>
>> Parameter 'pzize' already provides the size of the hugepage, but not in
>> the way set_huge_pte_at() expects it.
>>
>> psize has one of the values defined by MMU_PAGE_XXX macros defined in
>> arch/powerpc/include/asm/mmu.h while set_huge_pte_at() expects the size
>> as a value.
> 
> Yes, psize is an index, which is not a size by itself but used to get
> mmu_psize_def.shift to see the actual size, I guess.
> This is why I thought that being explicit about "expects the size of the
> huge page" was better.
> 
> But no strong feelings here.
> 

Thanks, I'll try a rephrase.

Christophe

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 18/20] powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD
  2024-05-22  1:13       ` Nicholas Piggin
@ 2024-05-22  9:32         ` Christophe Leroy
  2024-05-22 12:23         ` Jason Gunthorpe
  1 sibling, 0 replies; 60+ messages in thread
From: Christophe Leroy @ 2024-05-22  9:32 UTC (permalink / raw)
  To: Nicholas Piggin, Andrew Morton, Jason Gunthorpe, Peter Xu,
	Oscar Salvador, Michael Ellerman
  Cc: linux-kernel, linux-mm, linuxppc-dev



Le 22/05/2024 à 03:13, Nicholas Piggin a écrit :
> On Tue May 21, 2024 at 2:43 AM AEST, Christophe Leroy wrote:
>>
>>
>> Le 20/05/2024 à 14:54, Nicholas Piggin a écrit :
>>> On Sat May 18, 2024 at 5:00 AM AEST, Christophe Leroy wrote:
>>>> On book3s/64, the only user of hugepd is hash in 4k mode.
>>>>
>>>> All other setups (hash-64, radix-4, radix-64) use leaf PMD/PUD.
>>>>
>>>> Rework hash-4k to use contiguous PMD and PUD instead.
>>>>
>>>> In that setup there are only two huge page sizes: 16M and 16G.
>>>>
>>>> 16M sits at PMD level and 16G at PUD level.
>>>>
>>>> pte_update doesn't know page size, lets use the same trick as
>>>> hpte_need_flush() to get page size from segment properties. That's
>>>> not the most efficient way but let's do that until callers of
>>>> pte_update() provide page size instead of just a huge flag.
>>>>
>>>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
>>>> ---
>>>>    arch/powerpc/include/asm/book3s/64/hash-4k.h  | 15 --------
>>>>    arch/powerpc/include/asm/book3s/64/hash.h     | 38 +++++++++++++++----
>>>>    arch/powerpc/include/asm/book3s/64/hugetlb.h  | 38 -------------------
>>>>    .../include/asm/book3s/64/pgtable-4k.h        | 34 -----------------
>>>>    .../include/asm/book3s/64/pgtable-64k.h       | 20 ----------
>>>>    arch/powerpc/include/asm/hugetlb.h            |  4 ++
>>>>    .../include/asm/nohash/32/hugetlb-8xx.h       |  4 --
>>>>    .../powerpc/include/asm/nohash/hugetlb-e500.h |  4 --
>>>>    arch/powerpc/include/asm/page.h               |  8 ----
>>>>    arch/powerpc/mm/book3s64/hash_utils.c         | 11 ++++--
>>>>    arch/powerpc/mm/book3s64/pgtable.c            | 12 ------
>>>>    arch/powerpc/mm/hugetlbpage.c                 | 19 ----------
>>>>    arch/powerpc/mm/pgtable.c                     |  2 +-
>>>>    arch/powerpc/platforms/Kconfig.cputype        |  1 -
>>>>    14 files changed, 43 insertions(+), 167 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
>>>> index 6472b08fa1b0..c654c376ef8b 100644
>>>> --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
>>>> +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
>>>> @@ -74,21 +74,6 @@
>>>>    #define remap_4k_pfn(vma, addr, pfn, prot)	\
>>>>    	remap_pfn_range((vma), (addr), (pfn), PAGE_SIZE, (prot))
>>>>    
>>>> -#ifdef CONFIG_HUGETLB_PAGE
>>>> -static inline int hash__hugepd_ok(hugepd_t hpd)
>>>> -{
>>>> -	unsigned long hpdval = hpd_val(hpd);
>>>> -	/*
>>>> -	 * if it is not a pte and have hugepd shift mask
>>>> -	 * set, then it is a hugepd directory pointer
>>>> -	 */
>>>> -	if (!(hpdval & _PAGE_PTE) && (hpdval & _PAGE_PRESENT) &&
>>>> -	    ((hpdval & HUGEPD_SHIFT_MASK) != 0))
>>>> -		return true;
>>>> -	return false;
>>>> -}
>>>> -#endif
>>>> -
>>>>    /*
>>>>     * 4K PTE format is different from 64K PTE format. Saving the hash_slot is just
>>>>     * a matter of returning the PTE bits that need to be modified. On 64K PTE,
>>>> diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
>>>> index faf3e3b4e4b2..509811ca7695 100644
>>>> --- a/arch/powerpc/include/asm/book3s/64/hash.h
>>>> +++ b/arch/powerpc/include/asm/book3s/64/hash.h
>>>> @@ -4,6 +4,7 @@
>>>>    #ifdef __KERNEL__
>>>>    
>>>>    #include <asm/asm-const.h>
>>>> +#include <asm/book3s/64/slice.h>
>>>>    
>>>>    /*
>>>>     * Common bits between 4K and 64K pages in a linux-style PTE.
>>>> @@ -161,14 +162,10 @@ extern void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
>>>>    			    pte_t *ptep, unsigned long pte, int huge);
>>>>    unsigned long htab_convert_pte_flags(unsigned long pteflags, unsigned long flags);
>>>>    /* Atomic PTE updates */
>>>> -static inline unsigned long hash__pte_update(struct mm_struct *mm,
>>>> -					 unsigned long addr,
>>>> -					 pte_t *ptep, unsigned long clr,
>>>> -					 unsigned long set,
>>>> -					 int huge)
>>>> +static inline unsigned long hash__pte_update_one(pte_t *ptep, unsigned long clr,
>>>> +						 unsigned long set)
>>>>    {
>>>>    	__be64 old_be, tmp_be;
>>>> -	unsigned long old;
>>>>    
>>>>    	__asm__ __volatile__(
>>>>    	"1:	ldarx	%0,0,%3		# pte_update\n\
>>>> @@ -182,11 +179,38 @@ static inline unsigned long hash__pte_update(struct mm_struct *mm,
>>>>    	: "r" (ptep), "r" (cpu_to_be64(clr)), "m" (*ptep),
>>>>    	  "r" (cpu_to_be64(H_PAGE_BUSY)), "r" (cpu_to_be64(set))
>>>>    	: "cc" );
>>>> +
>>>> +	return be64_to_cpu(old_be);
>>>> +}
>>>> +
>>>> +static inline unsigned long hash__pte_update(struct mm_struct *mm,
>>>> +					 unsigned long addr,
>>>> +					 pte_t *ptep, unsigned long clr,
>>>> +					 unsigned long set,
>>>> +					 int huge)
>>>> +{
>>>> +	unsigned long old;
>>>> +
>>>> +	old = hash__pte_update_one(ptep, clr, set);
>>>> +
>>>> +	if (huge && IS_ENABLED(CONFIG_PPC_4K_PAGES)) {
>>>> +		unsigned int psize = get_slice_psize(mm, addr);
>>>> +		int nb, i;
>>>> +
>>>> +		if (psize == MMU_PAGE_16M)
>>>> +			nb = SZ_16M / PMD_SIZE;
>>>> +		else if (psize == MMU_PAGE_16G)
>>>> +			nb = SZ_16G / PUD_SIZE;
>>>> +		else
>>>> +			nb = 1;
>>>> +
>>>> +		for (i = 1; i < nb; i++)
>>>> +			hash__pte_update_one(ptep + i, clr, set);
>>>> +	}
>>>>    	/* huge pages use the old page table lock */
>>>>    	if (!huge)
>>>>    		assert_pte_locked(mm, addr);
>>>>    
>>>> -	old = be64_to_cpu(old_be);
>>>>    	if (old & H_PAGE_HASHPTE)
>>>>    		hpte_need_flush(mm, addr, ptep, old, huge);
>>>>    
>>>
>>> Nice series, I don't know this hugepd code very well but I'll try.
>>> Why do you have to replicate the PTE entry here? The hash table refill
>>> should always be working on the first PTE of the page otherwise we have
>>> bigger problems.
>>
>> I don't know how book3s/64 works exactly, but on nohash, when you get a
>> TLB miss exception the only thing you have is the address and you don't
>> know yes it is a hugepage so you get the PTE as if it was a 4k page and
>> it is only when you read that PTE that you know it is a hugepage.
>>
>> Ok, on book3s/64 the page size seems to be encoded inside the segment so
>> maybe it is a bit different but anyway the TLB miss exception (or DSI ?)
>> can happen at any address.
> 
> Right.
> 
> If you think of the hash page table as a software loaded TLB (which
> is how Linux kind of thinks of it), then DSI is a TLB miss. hash_page_x
> calls find the Linux pte and load that translation into hash page table.
> 
> One of the hard parts is keeping them coherent with low overhead. This
> requires pte bits H_PAGE_BUSY as a lock and H_PAGE_HASHPTE which means
> it might be in the hash table. So Linux PTE and hash PTE have to be
> 1:1 in general.
> 
> There are probably cases where we could get away from 1:1, but I would
> much prefer not to. Maybe read-only access would be okay though. But
> the hash_page will have to always operate on the 0th pte, which I think
> we get via segment size masking, same for any set / update / clear of
> the pte.
> 
>>>
>>> What paths look at the N > 0 PTEs of a contiguous page entry?
>>>
>>
>> pte_offset_kernel() or pte_offset_map_lock() will land on any contiguous
>> PTE based on the address handed to pte_index(), as if it was a standard
>> (4k or 64k) page.
>>
>> pte_index() doesn't know it is a hugepage, that's the reason why we need
>> to duplicate the entry.
> 
>  From the mm/ side of things, hugetlb page tables are always walked via
> the huge vma which knows the page size and could align address... I
> guess except for fast gup? Which should be read-only. So okay you do
> need to replicate huge ptes for fast gup at least. Any others?
> 
> There's going to need to be a little more to it. __hash_page_huge sets
> PTE accessed and dirty for example, so if we allow any PTE readers to
> check the non-0th pte we would have to do something about that.
> 
> How do you deal with dirty/accessed bits for other subarchs?

All nohash bail out of TLB miss handler when accessing a page which 
doesn have the ACCESSED bit or writing a page which doesn't have DIRTY 
bit, see commit 2c74e2586bb9 ("powerpc/40x: Rework 40x PTE access and 
TLB miss") and other commits it refers to.

Same for the 603 which is the nohash version of book3s/32, see commits 
f8b58c64eaef ("powerpc/603: let's handle PAGE_DIRTY directly") and 
84de6ab0e904 ("powerpc/603: don't handle PAGE_ACCESSED in TLB miss 
handlers.").

Only the hash version of book3s/32 still updated PTE in miss handler, 
see 
https://elixir.bootlin.com/linux/v6.9/source/arch/powerpc/mm/book3s32/hash_low.S#L146 
but there are no hugepages on book3s/32


> 
> We could just remove the hash_page setting of those bits and just cause
> a fault and require Linux mm to set them. At least for hugepages we
> could do that probably without any real performance worry.
> 
> Thanks,
> Nick

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 03/20] mm: Provide pmd to pte_leaf_size()
  2024-05-21  9:39   ` Oscar Salvador
@ 2024-05-22 10:22     ` Christophe Leroy
  0 siblings, 0 replies; 60+ messages in thread
From: Christophe Leroy @ 2024-05-22 10:22 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Jason Gunthorpe, Peter Xu, Michael Ellerman,
	Nicholas Piggin, linux-kernel, linux-mm, linuxppc-dev



Le 21/05/2024 à 11:39, Oscar Salvador a écrit :
> On Fri, May 17, 2024 at 08:59:57PM +0200, Christophe Leroy wrote:
>> On powerpc 8xx, when a page is 8M size, the information is in the PMD
>> entry. So provide it to pte_leaf_size().
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> 
> Overall looks good to me.
> 
> Would be nicer if we could left the arch code untouched.
> I wanted to see how this would be if we go down that road and focus only
> on 8xx at the risk of being more esoteric.
> pmd_pte_leaf_size() is a name of hell, but could be replaced
> with __pte_leaf_size for example.
> 
> Worth it? Maybe not, anyway, just wanted to give it a go:

I like the idea, it doesn't look that bad after all, it avoids changes 
to other arches.

> 
> 
>   diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
>   index 137dc3c84e45..9e3fe6e1083f 100644
>   --- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
>   +++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
>   @@ -151,7 +151,7 @@ static inline unsigned long pgd_leaf_size(pgd_t pgd)
>   
>    #define pgd_leaf_size pgd_leaf_size
>   
>   -static inline unsigned long pte_leaf_size(pte_t pte)
>   +static inline unsigned long pmd_pte_leaf_size(pte_t pte)
>    {
>           pte_basic_t val = pte_val(pte);
>   
>   @@ -162,7 +162,7 @@ static inline unsigned long pte_leaf_size(pte_t pte)
>           return SZ_4K;
>    }
>   
>   -#define pte_leaf_size pte_leaf_size
>   +#define pmd_pte_leaf_size pmd_pte_leaf_size
>   
>    /*
>     * On the 8xx, the page tables are a bit special. For 16k pages, we have
>   diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>   index 18019f037bae..2bc2fe3b2b53 100644
>   --- a/include/linux/pgtable.h
>   +++ b/include/linux/pgtable.h
>   @@ -1891,6 +1891,9 @@ typedef unsigned int pgtbl_mod_mask;
>    #ifndef pte_leaf_size
>    #define pte_leaf_size(x) PAGE_SIZE
>    #endif
>   +#ifndef pmd_pte_leaf_size
>   +#define pmd_pte_leaf_size(x, y) pte_leaf_size(y)
>   +#endif
>   
>    /*
>     * We always define pmd_pfn for all archs as it's used in lots of generic
>   diff --git a/kernel/events/core.c b/kernel/events/core.c
>   index f0128c5ff278..e90a547d2fb2 100644
>   --- a/kernel/events/core.c
>   +++ b/kernel/events/core.c
>   @@ -7596,7 +7596,7 @@ static u64 perf_get_pgtable_size(struct mm_struct *mm, unsigned long addr)
>   
>           pte = ptep_get_lockless(ptep);
>           if (pte_present(pte))
>   -               size = pte_leaf_size(pte);
>   +               size = pmd_pte_leaf_size(pmd, pte);
>           pte_unmap(ptep);
>    #endif /* CONFIG_HAVE_GUP_FAST */
> 
>   
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 06/20] powerpc/8xx: Fix size given to set_huge_pte_at()
  2024-05-22  8:32           ` Christophe Leroy
@ 2024-05-22 12:18             ` Christophe Leroy
  0 siblings, 0 replies; 60+ messages in thread
From: Christophe Leroy @ 2024-05-22 12:18 UTC (permalink / raw)
  To: Oscar Salvador, Michael Ellerman, Peter Zijlstra
  Cc: Andrew Morton, Jason Gunthorpe, Peter Xu, Nicholas Piggin,
	linux-kernel, linux-mm, linuxppc-dev

+Peter Z. who added that commit.

Le 22/05/2024 à 10:32, Christophe Leroy a écrit :
> 
> 
> Le 21/05/2024 à 11:26, Oscar Salvador a écrit :
>> On Tue, May 21, 2024 at 10:48:21AM +1000, Michael Ellerman wrote:
>>> Yeah I can. Does it actually cause a bug at runtime (I assume so)?
>>
>> No, currently set_huge_pte_at() from 8xx ignores the 'sz' parameter.
>> But it will be used after this series.
>>
> 
> Ah yes, I mixed things up with something else in my mind.
> 
> So this patch doesn't qualify as a fix and doesn't need to be handled 
> separately from the series and doesn't really need to go on top of the 
> series either, I think it is better to keep it grouped with other 8xx 
> changes.
> 

I remember now, what I had in mind was commit c5eecbb58f65 
("powerpc/8xx: Implement pXX_leaf_size() support")

That commit is buggy, because pgd_leaf() will always return false on 
8xx. First of all pgd_leaf() could only return true on a target with 
P4Ds. Without P4Ds it should just return 0 like pgd_none(), pgd_bad(), 
... as defined in include/asm-generic/pgtable-nop4d.h

So it is pmd_leaf_size() that could eventually return something for 8xx.
But as 8xx is using hugepd, at the best case it will return crap, worst 
case the read will go in the weed.

To be correct we should had support of hugepd in perf_get_pgtable_size() 
but that's not trivial and this series is aiming at removing hugepd 
completely so there is no point in fixing stuff here, except maybe for 
stable ?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 18/20] powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD
  2024-05-22  1:13       ` Nicholas Piggin
  2024-05-22  9:32         ` Christophe Leroy
@ 2024-05-22 12:23         ` Jason Gunthorpe
  1 sibling, 0 replies; 60+ messages in thread
From: Jason Gunthorpe @ 2024-05-22 12:23 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Christophe Leroy, Andrew Morton, Peter Xu, Oscar Salvador,
	Michael Ellerman, linux-kernel, linux-mm, linuxppc-dev

On Wed, May 22, 2024 at 11:13:53AM +1000, Nicholas Piggin wrote:

> From the mm/ side of things, hugetlb page tables are always walked via
> the huge vma which knows the page size and could align address... I
> guess except for fast gup? Which should be read-only. So okay you do
> need to replicate huge ptes for fast gup at least. Any others?

We are trying to get away from this. We want all content in the page
table to be walkable via the normal pud/pmd/pte/etc functions and the
special huge VMA limited to only weird hugetlbfs internals. It should
not leak into the arch.

> There's going to need to be a little more to it. __hash_page_huge sets
> PTE accessed and dirty for example, so if we allow any PTE readers to
> check the non-0th pte we would have to do something about that.

Ryan added a special function to get the access and dirty flags from a
CONTIG PTE, the arch can do the right thing here. The case where there
was a CONTIG PTE that spanned two PMD entries might be some trouble
though.

> How do you deal with dirty/accessed bits for other subarchs?

ARM and RISCV verions will combine the access flags from every sub
pte. Their HW is allowed to set dirty/access bits on any PTE in a
contiguos set.

Jason


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64)
  2024-05-17 18:59 [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Christophe Leroy
                   ` (20 preceding siblings ...)
  2024-05-17 19:06 ` [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Jason Gunthorpe
@ 2024-05-23 19:40 ` Peter Xu
  2024-05-24  4:46   ` Michael Ellerman
  2024-05-24  6:31   ` Oscar Salvador
  21 siblings, 2 replies; 60+ messages in thread
From: Peter Xu @ 2024-05-23 19:40 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Andrew Morton, Jason Gunthorpe, Oscar Salvador, Michael Ellerman,
	Nicholas Piggin, linux-kernel, linux-mm, linuxppc-dev

On Fri, May 17, 2024 at 08:59:54PM +0200, Christophe Leroy wrote:
> This is the continuation of the RFC v1 series "Reimplement huge pages
> without hugepd on powerpc 8xx". It now get rid of hugepd completely
> after handling also e500 and book3s/64
> 
> Unlike most architectures, powerpc 8xx HW requires a two-level
> pagetable topology for all page sizes. So a leaf PMD-contig approach
> is not feasible as such.
> 
> Possible sizes are 4k, 16k, 512k and 8M.
> 
> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
> must point to a single entry level-2 page table. Until now that was
> done using hugepd. This series changes it to use standard page tables
> where the entry is replicated 1024 times on each of the two pagetables
> refered by the two associated PMD entries for that 8M page.
> 
> At the moment it has to look into each helper to know if the
> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
> a lower size. I hope this can me handled by core-mm in the future.
> 
> For e500 and book3s/64 there are less constraints because it is not
> tied to the HW assisted tablewalk like on 8xx, so it is easier to use
> leaf PMDs (and PUDs).
> 
> On e500 the supported page sizes are 4M, 16M, 64M, 256M and 1G. All at
> PMD level on e500/32 and mix of PMD and PUD for e500/64. We encode page
> size with 4 available bits in PTE entries. On e300/32 PGD entries size
> is increases to 64 bits in order to allow leaf-PMD entries because PTE
> are 64 bits on e500.
> 
> On book3s/64 only the hash-4k mode is concerned. It supports 16M pages
> as cont-PMD and 16G pages as cont-PUD. In other modes (radix-4k, radix-6k
> and hash-64k) the sizes match with PMD and PUD sizes so that's just leaf
> entries.
> 
> Christophe Leroy (20):
>   mm: Provide pagesize to pmd_populate()
>   mm: Provide page size to pte_alloc_huge()
>   mm: Provide pmd to pte_leaf_size()
>   mm: Provide mm_struct and address to huge_ptep_get()
>   powerpc/mm: Allow hugepages without hugepd
>   powerpc/8xx: Fix size given to set_huge_pte_at()
>   powerpc/8xx: Rework support for 8M pages using contiguous PTE entries
>   powerpc/8xx: Simplify struct mmu_psize_def
>   powerpc/mm: Remove _PAGE_PSIZE
>   powerpc/mm: Fix __find_linux_pte() on 32 bits with PMD leaf entries
>   powerpc/mm: Complement huge_pte_alloc() for all non HUGEPD setups
>   powerpc/64e: Remove unneeded #ifdef CONFIG_PPC_E500
>   powerpc/64e: Clean up impossible setups
>   powerpc/e500: Remove enc field from struct mmu_psize_def
>   powerpc/85xx: Switch to 64 bits PGD
>   powerpc/e500: Encode hugepage size in PTE bits
>   powerpc/e500: Use contiguous PMD instead of hugepd
>   powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD
>   powerpc/mm: Remove hugepd leftovers
>   mm: Remove CONFIG_ARCH_HAS_HUGEPD

Great to see this series, thanks again Christophe.

I requested for help on the lsfmm hugetlb unification session, but
unfortunately I don't think there were Power people around.. I'd like to
request help from Power developers again here on the list: it will be very
appreciated if you can help have a look at this series.

It's a direct dependent work to the hugetlb refactoring that we'll be
working on, while it looks like the hugetlb refactoring is something the
community as a whole would like to see in the near future.

We don't want to add more Power-only CONFIG_ARCH_HAS_HUGEPD checks for
hugetlb in any new code.

Currently Oscar offered help on that hugetlb project, and Oscar will start
to work on page_walk API refactoring.  I guess currently the simple way is
we'll work on top of Christophe's series.  Some proper review on this
series will definitely make it clearer on what we should do next.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64)
  2024-05-23 19:40 ` Peter Xu
@ 2024-05-24  4:46   ` Michael Ellerman
  2024-05-27 14:14     ` Peter Xu
  2024-05-24  6:31   ` Oscar Salvador
  1 sibling, 1 reply; 60+ messages in thread
From: Michael Ellerman @ 2024-05-24  4:46 UTC (permalink / raw)
  To: Peter Xu, Christophe Leroy
  Cc: Andrew Morton, Jason Gunthorpe, Oscar Salvador, Nicholas Piggin,
	linux-kernel, linux-mm, linuxppc-dev

Hi Peter,

Peter Xu <peterx@redhat.com> writes:
> On Fri, May 17, 2024 at 08:59:54PM +0200, Christophe Leroy wrote:
>> This is the continuation of the RFC v1 series "Reimplement huge pages
>> without hugepd on powerpc 8xx". It now get rid of hugepd completely
>> after handling also e500 and book3s/64
>> 
>> Unlike most architectures, powerpc 8xx HW requires a two-level
>> pagetable topology for all page sizes. So a leaf PMD-contig approach
>> is not feasible as such.
....
>
> Great to see this series, thanks again Christophe.
>
> I requested for help on the lsfmm hugetlb unification session, but
> unfortunately I don't think there were Power people around.. I'd like to
> request help from Power developers again here on the list: it will be very
> appreciated if you can help have a look at this series.

Christophe is a powerpc developer :)

I'll help where I can, but I don't know the hugepd code that well, I've
never really worked on it before. Nick will hopefully also be able to
help, he at least knows mm better than me, but he also has other work.

Hopefully we can make this series work, and replace hugepd. But if we
can't make that work then there is the possibility of just dropping
support for 16M/16G pages with HPT/4K pages.

> It's a direct dependent work to the hugetlb refactoring that we'll be
> working on, while it looks like the hugetlb refactoring is something the
> community as a whole would like to see in the near future.
>
> We don't want to add more Power-only CONFIG_ARCH_HAS_HUGEPD checks for
> hugetlb in any new code.

Yes I understand.

cheers


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64)
  2024-05-23 19:40 ` Peter Xu
  2024-05-24  4:46   ` Michael Ellerman
@ 2024-05-24  6:31   ` Oscar Salvador
  1 sibling, 0 replies; 60+ messages in thread
From: Oscar Salvador @ 2024-05-24  6:31 UTC (permalink / raw)
  To: Peter Xu
  Cc: Christophe Leroy, Andrew Morton, Jason Gunthorpe,
	Michael Ellerman, Nicholas Piggin, linux-kernel, linux-mm,
	linuxppc-dev

On Thu, May 23, 2024 at 03:40:20PM -0400, Peter Xu wrote:
> I requested for help on the lsfmm hugetlb unification session, but
> unfortunately I don't think there were Power people around.. I'd like to
> request help from Power developers again here on the list: it will be very
> appreciated if you can help have a look at this series.

I am not a powerpc developer but I plan on keep on reviewing this series
today and next week.

thanks


-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 12/20] powerpc/64e: Remove unneeded #ifdef CONFIG_PPC_E500
  2024-05-17 19:00 ` [RFC PATCH v2 12/20] powerpc/64e: Remove unneeded #ifdef CONFIG_PPC_E500 Christophe Leroy
@ 2024-05-24  7:31   ` Michael Ellerman
  2024-05-24  8:45     ` Christophe Leroy
  0 siblings, 1 reply; 60+ messages in thread
From: Michael Ellerman @ 2024-05-24  7:31 UTC (permalink / raw)
  To: Christophe Leroy, Andrew Morton, Jason Gunthorpe, Peter Xu,
	Oscar Salvador, Nicholas Piggin
  Cc: Christophe Leroy, linux-kernel, linux-mm, linuxppc-dev

Christophe Leroy <christophe.leroy@csgroup.eu> writes:
> When it is a nohash/64 it can't be anything else than
> CONFIG_PPC_E500 so remove the #ifdef as they are always true.

I have a series doing some similar cleanups, I'll post it. We can decide
whether to merge it before your series or combine them or whatever.

cheers


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 12/20] powerpc/64e: Remove unneeded #ifdef CONFIG_PPC_E500
  2024-05-24  7:31   ` Michael Ellerman
@ 2024-05-24  8:45     ` Christophe Leroy
  0 siblings, 0 replies; 60+ messages in thread
From: Christophe Leroy @ 2024-05-24  8:45 UTC (permalink / raw)
  To: Michael Ellerman, Andrew Morton, Jason Gunthorpe, Peter Xu,
	Oscar Salvador, Nicholas Piggin
  Cc: linux-kernel, linux-mm, linuxppc-dev



Le 24/05/2024 à 09:31, Michael Ellerman a écrit :
> Christophe Leroy <christophe.leroy@csgroup.eu> writes:
>> When it is a nohash/64 it can't be anything else than
>> CONFIG_PPC_E500 so remove the #ifdef as they are always true.
> 
> I have a series doing some similar cleanups, I'll post it. We can decide
> whether to merge it before your series or combine them or whatever.
> 

Great. I'll apply my series on top.

Note that it doesn't apply cleanly on merge branch (47279113c5d0), a 
3-way merge is needed:

$ LANG= git am -3 
~/Téléchargements/1-6-powerpc-64e-Remove-unused-IBM-HTW-code.patch
Applying: powerpc/64e: Remove unused IBM HTW code
Applying: powerpc/64e: Split out nohash Book3E 64-bit code
Using index info to reconstruct a base tree...
M	arch/powerpc/mm/nohash/Makefile
.git/rebase-apply/patch:554: trailing whitespace.
			def->shift = 0;	
warning: 1 line adds whitespace errors.
Falling back to patching base and 3-way merge...
Auto-merging arch/powerpc/mm/nohash/Makefile
Applying: powerpc/64e: Drop E500 ifdefs in 64-bit code
Applying: powerpc/64e: Drop MMU_FTR_TYPE_FSL_E checks in 64-bit code
Applying: powerpc/64e: Consolidate TLB miss handler patching
Applying: powerpc/64e: Drop unused TLB miss handlers

Thanks
Christophe

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 07/20] powerpc/8xx: Rework support for 8M pages using contiguous PTE entries
  2024-05-17 19:00 ` [RFC PATCH v2 07/20] powerpc/8xx: Rework support for 8M pages using contiguous PTE entries Christophe Leroy
@ 2024-05-24 10:02   ` Oscar Salvador
  2024-05-24 11:47     ` Christophe Leroy
  0 siblings, 1 reply; 60+ messages in thread
From: Oscar Salvador @ 2024-05-24 10:02 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Andrew Morton, Jason Gunthorpe, Peter Xu, Michael Ellerman,
	Nicholas Piggin, linux-kernel, linux-mm, linuxppc-dev

On Fri, May 17, 2024 at 09:00:01PM +0200, Christophe Leroy wrote:
> In order to fit better with standard Linux page tables layout, add
> support for 8M pages using contiguous PTE entries in a standard
> page table. Page tables will then be populated with 1024 similar
> entries and two PMD entries will point to that page table.
> 
> The PMD entries also get a flag to tell it is addressing an 8M page,
> this is required for the HW tablewalk assistance.
> 
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>

I guess that this will slightly change if you remove patch#1 and patch#2
as you said you will.
So I will not comment on the overall design because I do not know how it will
look afterwards, but just some things that caught my eye

> --- a/arch/powerpc/include/asm/hugetlb.h
> +++ b/arch/powerpc/include/asm/hugetlb.h
> @@ -41,7 +41,16 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
>  static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
>  					    unsigned long addr, pte_t *ptep)
>  {
> -	return __pte(pte_update(mm, addr, ptep, ~0UL, 0, 1));
> +	pmd_t *pmdp = (pmd_t *)ptep;
> +	pte_t pte;
> +
> +	if (IS_ENABLED(CONFIG_PPC_8xx) && pmdp == pmd_off(mm, ALIGN_DOWN(addr, SZ_8M))) {

There are quite some places where you do the "pmd_off" to check whether that
is a 8MB entry.
I think it would make somse sense to have some kind of macro/function to make
more clear what we are checking against.
e.g:

 #define pmd_is_SZ_8M(mm, addr, pmdp) (pmdp == pmd_off(mm, ALIGN_DOWN(addr, SZ_8M)))
 (or whatever name you see fit)
 
then you would just need

 if (IS_ENABLED(CONFIG_PPC_8xx && pmd_is_SZ_8M(mm, addr, pdmp))

Because I see that is also scaterred in 8xx code.


> +		pte = __pte(pte_update(mm, addr, pte_offset_kernel(pmdp, 0), ~0UL, 0, 1));
> +		pte_update(mm, addr, pte_offset_kernel(pmdp + 1, 0), ~0UL, 0, 1);

I have this fresh one because I recently read about 8xx pagetables, but not sure
how my memory will survive this, so maybe throw a little comment in there that
we are pointing the two pmds to the area.

Also, the way we pass the parameters here to pte_update() is a bit awkward.
Ideally we should be using some meaningful names?

 clr_all_bits = ~0UL
 set_bits = 0
 bool is_huge = true

 pte_update(mm, addr, pte_offset_kernel(pmdp + 1, 0), clr_all_bits, set_bits, is_huge)

or something along those lines

> -static inline int check_and_get_huge_psize(int shift)
> -{
> -	return shift_to_mmu_psize(shift);
> +	if (pmdp == pmd_off(mm, ALIGN_DOWN(addr, SZ_8M)))

Here you could also use the pmd_is_SZ_8M()

> +		ptep = pte_offset_kernel(pmdp, 0);
> +	return ptep_get(ptep);
>  }
>  
>  #define __HAVE_ARCH_HUGE_SET_HUGE_PTE_AT
> @@ -53,7 +33,14 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
>  static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
>  				  pte_t *ptep, unsigned long sz)
>  {
> -	pte_update(mm, addr, ptep, ~0UL, 0, 1);
> +	pmd_t *pmdp = (pmd_t *)ptep;
> +
> +	if (pmdp == pmd_off(mm, ALIGN_DOWN(addr, SZ_8M))) {
> +		pte_update(mm, addr, pte_offset_kernel(pmdp, 0), ~0UL, 0, 1);
> +		pte_update(mm, addr, pte_offset_kernel(pmdp + 1, 0), ~0UL, 0, 1);
> +	} else {
> +		pte_update(mm, addr, ptep, ~0UL, 0, 1);
> +	}

Could we not leverage this in huge_ptep_get_and_clear()?
AFAICS,

 huge_pet_get_and_clear(mm, addr, pte_t *p) 
 {
      pte_t pte = pte_val(*p);

      huge_pte_clear(mm, addr, p);
      return pte;
 }

Or maybe it is not that easy if different powerpc platforms provide their own.
It might be worth checking though.

>  }
>  
>  #define __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
> @@ -63,7 +50,14 @@ static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
>  	unsigned long clr = ~pte_val(pte_wrprotect(__pte(~0)));
>  	unsigned long set = pte_val(pte_wrprotect(__pte(0)));
>  
> -	pte_update(mm, addr, ptep, clr, set, 1);
> +	pmd_t *pmdp = (pmd_t *)ptep;
> +
> +	if (pmdp == pmd_off(mm, ALIGN_DOWN(addr, SZ_8M))) {
> +		pte_update(mm, addr, pte_offset_kernel(pmdp, 0), clr, set, 1);
> +		pte_update(mm, addr, pte_offset_kernel(pmdp + 1, 0), clr, set, 1);
> +	} else {
> +		pte_update(mm, addr, ptep, clr, set, 1);

I would replace the "1" with "is_huge" or "huge", as being done in
__ptep_set_access_flags , something that makes it more clear without the need
to check pte_update().

  
>  #endif /* _ASM_POWERPC_PGALLOC_32_H */
> diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
> index 07df6b664861..b05cc4f87713 100644
> --- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
> +++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
...
> - * For other page sizes, we have a single entry in the table.
> + * For 8M pages, we have 1024 entries as if it was
> + * 4M pages, but they are flagged as 8M pages for the hardware.

Maybe drop a comment that a single PMD entry is worth 4MB, so

> + * For 4k pages, we have a single entry in the table.
>   */
> -static pmd_t *pmd_off(struct mm_struct *mm, unsigned long addr);
> -static int hugepd_ok(hugepd_t hpd);
> -
>  static inline int number_of_cells_per_pte(pmd_t *pmd, pte_basic_t val, int huge)
>  {
>  	if (!huge)
>  		return PAGE_SIZE / SZ_4K;
> -	else if (hugepd_ok(*((hugepd_t *)pmd)))
> -		return 1;
> +	else if ((pmd_val(*pmd) & _PMD_PAGE_MASK) == _PMD_PAGE_8M)
> +		return SZ_4M / SZ_4K;

this becomes more intuitive.

  
> +static inline void pmd_populate_kernel_size(struct mm_struct *mm, pmd_t *pmdp,
> +					    pte_t *pte, unsigned long sz)
> +{
> +	if (sz == SZ_8M)
> +		*pmdp = __pmd(__pa(pte) | _PMD_PRESENT | _PMD_PAGE_8M);
> +	else
> +		*pmdp = __pmd(__pa(pte) | _PMD_PRESENT);
> +}
> +
> +static inline void pmd_populate_size(struct mm_struct *mm, pmd_t *pmdp,
> +				     pgtable_t pte_page, unsigned long sz)
> +{
> +	if (sz == SZ_8M)
> +		*pmdp = __pmd(__pa(pte_page) | _PMD_USER | _PMD_PRESENT | _PMD_PAGE_8M);
> +	else
> +		*pmdp = __pmd(__pa(pte_page) | _PMD_USER | _PMD_PRESENT);
> +}

In patch#1 you mentioned this will change with the removal of patch#1
and patch#2.

> --- a/arch/powerpc/mm/hugetlbpage.c
> +++ b/arch/powerpc/mm/hugetlbpage.c
> @@ -183,9 +183,6 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (!hpdp)
>  		return NULL;
>  
> -	if (IS_ENABLED(CONFIG_PPC_8xx) && pshift < PMD_SHIFT)
> -		return pte_alloc_huge(mm, (pmd_t *)hpdp, addr, sz);
> -
>  	BUG_ON(!hugepd_none(*hpdp) && !hugepd_ok(*hpdp));
>  
>  	if (hugepd_none(*hpdp) && __hugepte_alloc(mm, hpdp, addr,
> @@ -198,10 +195,18 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>  pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>  		      unsigned long addr, unsigned long sz)
>  {
> +	pmd_t *pmd = pmd_off(mm, addr);
> +
>  	if (sz < PMD_SIZE)
> -		return pte_alloc_huge(mm, pmd_off(mm, addr), addr, sz);
> +		return pte_alloc_huge(mm, pmd, addr, sz);
>  
> -	return NULL;
> +	if (sz != SZ_8M)
> +		return NULL;
> +	if (!pte_alloc_huge(mm, pmd, addr, sz))
> +		return NULL;
> +	if (!pte_alloc_huge(mm, pmd + 1, addr, sz))
> +		return NULL;
> +	return (pte_t *)pmd;

I think that having the check for invalid huge page sizes upfront would
make more sense, maybe just a matter of taste.

 /* Unsupported size */
 if (sz > PMD_SIZE && sz = SZ_8M)
     return NULL;

 if (sz < PMD_SIZE)
    ...
 /* 8MB huge pages */
 ...

 return (pte_t *) pmd;

Also, I am not a big fan of the two separate pte_alloc_huge() for pmd#0+pmd#1,
and I am thinking we might want to hide that within a function and drop a
comment in there explaining why we are updatng both pmds.
 
 

> diff --git a/arch/powerpc/mm/nohash/8xx.c b/arch/powerpc/mm/nohash/8xx.c
> index d93433e26ded..99f656b3f9f3 100644
> --- a/arch/powerpc/mm/nohash/8xx.c
> +++ b/arch/powerpc/mm/nohash/8xx.c
> @@ -48,20 +48,6 @@ unsigned long p_block_mapped(phys_addr_t pa)
>  	return 0;
>  }
>  
> -static pte_t __init *early_hugepd_alloc_kernel(hugepd_t *pmdp, unsigned long va)
> -{
> -	if (hpd_val(*pmdp) == 0) {
> -		pte_t *ptep = memblock_alloc(sizeof(pte_basic_t), SZ_4K);
> -
> -		if (!ptep)
> -			return NULL;
> -
> -		hugepd_populate_kernel((hugepd_t *)pmdp, ptep, PAGE_SHIFT_8M);
> -		hugepd_populate_kernel((hugepd_t *)pmdp + 1, ptep, PAGE_SHIFT_8M);
> -	}
> -	return hugepte_offset(*(hugepd_t *)pmdp, va, PGDIR_SHIFT);
> -}
> -
>  static int __ref __early_map_kernel_hugepage(unsigned long va, phys_addr_t pa,
>  					     pgprot_t prot, int psize, bool new)

Am I blind or do we never use the 'new' parameter?
I checked the tree and it seems we always pass it 'true'.

arch/powerpc/mm/nohash/8xx.c:		err = __early_map_kernel_hugepage(v, p, prot, MMU_PAGE_512K, new);
arch/powerpc/mm/nohash/8xx.c:		err = __early_map_kernel_hugepage(v, p, prot, MMU_PAGE_8M, new);
arch/powerpc/mm/nohash/8xx.c:		err = __early_map_kernel_hugepage(v, p, prot, MMU_PAGE_512K, new);
arch/powerpc/mm/nohash/8xx.c:
__early_map_kernel_hugepage(VIRT_IMMR_BASE, PHYS_IMMR_BASE, PAGE_KERNEL_NCG, MMU_PAGE_512K, true);

I think we can drop the 'new' and the block code that tries to handle
it?

> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
> index acdf64c9b93e..59f0d7706d2f 100644
> --- a/arch/powerpc/mm/pgtable.c
> +++ b/arch/powerpc/mm/pgtable.c

> +void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
> +		     pte_t pte, unsigned long sz)
> +{
> +	pmd_t *pmdp = pmd_off(mm, addr);
> +
> +	pte = set_pte_filter(pte, addr);
> +
> +	if (sz == SZ_8M) {
> +		__set_huge_pte_at(pmdp, pte_offset_kernel(pmdp, 0), pte_val(pte));
> +		__set_huge_pte_at(pmdp, pte_offset_kernel(pmdp + 1, 0), pte_val(pte) + SZ_4M);

You also mentioned that this would slightly change after you drop
patch#0 and patch#1.
The only comment I have right know would be to add a little comment
explaining the layout (the replication of 1024 entries), or just
something like "see comment from number_of_cells_per_pte".

 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 07/20] powerpc/8xx: Rework support for 8M pages using contiguous PTE entries
  2024-05-24 10:02   ` Oscar Salvador
@ 2024-05-24 11:47     ` Christophe Leroy
  0 siblings, 0 replies; 60+ messages in thread
From: Christophe Leroy @ 2024-05-24 11:47 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Jason Gunthorpe, Peter Xu, Michael Ellerman,
	Nicholas Piggin, linux-kernel, linux-mm, linuxppc-dev



Le 24/05/2024 à 12:02, Oscar Salvador a écrit :
> On Fri, May 17, 2024 at 09:00:01PM +0200, Christophe Leroy wrote:
>> In order to fit better with standard Linux page tables layout, add
>> support for 8M pages using contiguous PTE entries in a standard
>> page table. Page tables will then be populated with 1024 similar
>> entries and two PMD entries will point to that page table.
>>
>> The PMD entries also get a flag to tell it is addressing an 8M page,
>> this is required for the HW tablewalk assistance.
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> 
> I guess that this will slightly change if you remove patch#1 and patch#2
> as you said you will.
> So I will not comment on the overall design because I do not know how it will
> look afterwards, but just some things that caught my eye

Sure. I should send-out a v3 today or tomorrow, once I've done a few 
more tests.


> 
>> --- a/arch/powerpc/include/asm/hugetlb.h
>> +++ b/arch/powerpc/include/asm/hugetlb.h
>> @@ -41,7 +41,16 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
>>   static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
>>   					    unsigned long addr, pte_t *ptep)
>>   {
>> -	return __pte(pte_update(mm, addr, ptep, ~0UL, 0, 1));
>> +	pmd_t *pmdp = (pmd_t *)ptep;
>> +	pte_t pte;
>> +
>> +	if (IS_ENABLED(CONFIG_PPC_8xx) && pmdp == pmd_off(mm, ALIGN_DOWN(addr, SZ_8M))) {
> 
> There are quite some places where you do the "pmd_off" to check whether that
> is a 8MB entry.

I refactored the code, now I have only two places with it: pte_update() 
and huge_ptep_get()

By the way it doesn't check that PMD is 8M, it checks that the ptep 
points to the first PMD entry matching the said address.

> I think it would make somse sense to have some kind of macro/function to make
> more clear what we are checking against.
> e.g:
> 
>   #define pmd_is_SZ_8M(mm, addr, pmdp) (pmdp == pmd_off(mm, ALIGN_DOWN(addr, SZ_8M)))
>   (or whatever name you see fit)
>   
> then you would just need
> 
>   if (IS_ENABLED(CONFIG_PPC_8xx && pmd_is_SZ_8M(mm, addr, pdmp))
> 
> Because I see that is also scaterred in 8xx code.
> 
> 
>> +		pte = __pte(pte_update(mm, addr, pte_offset_kernel(pmdp, 0), ~0UL, 0, 1));
>> +		pte_update(mm, addr, pte_offset_kernel(pmdp + 1, 0), ~0UL, 0, 1);
> 
> I have this fresh one because I recently read about 8xx pagetables, but not sure
> how my memory will survive this, so maybe throw a little comment in there that
> we are pointing the two pmds to the area.

The two PMD are now pointing to there own areas, we are not anymore in 
the hugepd case where the PMD was pointing to a single HUGEPD containing 
a single HUGEPTE.

> 
> Also, the way we pass the parameters here to pte_update() is a bit awkward.
> Ideally we should be using some meaningful names?
> 
>   clr_all_bits = ~0UL
>   set_bits = 0
>   bool is_huge = true
> 
>   pte_update(mm, addr, pte_offset_kernel(pmdp + 1, 0), clr_all_bits, set_bits, is_huge)
> 
> or something along those lines

Well, with my refactoring those functions are not modified anymore so I 
won't change them.

> 
>> -static inline int check_and_get_huge_psize(int shift)
>> -{
>> -	return shift_to_mmu_psize(shift);
>> +	if (pmdp == pmd_off(mm, ALIGN_DOWN(addr, SZ_8M)))
> 
> Here you could also use the pmd_is_SZ_8M()

Yes, may do that.

> 
>> +		ptep = pte_offset_kernel(pmdp, 0);
>> +	return ptep_get(ptep);
>>   }
>>   
>>   #define __HAVE_ARCH_HUGE_SET_HUGE_PTE_AT
>> @@ -53,7 +33,14 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
>>   static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
>>   				  pte_t *ptep, unsigned long sz)
>>   {
>> -	pte_update(mm, addr, ptep, ~0UL, 0, 1);
>> +	pmd_t *pmdp = (pmd_t *)ptep;
>> +
>> +	if (pmdp == pmd_off(mm, ALIGN_DOWN(addr, SZ_8M))) {
>> +		pte_update(mm, addr, pte_offset_kernel(pmdp, 0), ~0UL, 0, 1);
>> +		pte_update(mm, addr, pte_offset_kernel(pmdp + 1, 0), ~0UL, 0, 1);
>> +	} else {
>> +		pte_update(mm, addr, ptep, ~0UL, 0, 1);
>> +	}
> 
> Could we not leverage this in huge_ptep_get_and_clear()?

I'm not modifying that anymore

> AFAICS,
> 
>   huge_pet_get_and_clear(mm, addr, pte_t *p)
>   {
>        pte_t pte = pte_val(*p);
> 
>        huge_pte_clear(mm, addr, p);
>        return pte;
>   }
> 
> Or maybe it is not that easy if different powerpc platforms provide their own.
> It might be worth checking though.
> 
>>   }
>>   
>>   #define __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
>> @@ -63,7 +50,14 @@ static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>   	unsigned long clr = ~pte_val(pte_wrprotect(__pte(~0)));
>>   	unsigned long set = pte_val(pte_wrprotect(__pte(0)));
>>   
>> -	pte_update(mm, addr, ptep, clr, set, 1);
>> +	pmd_t *pmdp = (pmd_t *)ptep;
>> +
>> +	if (pmdp == pmd_off(mm, ALIGN_DOWN(addr, SZ_8M))) {
>> +		pte_update(mm, addr, pte_offset_kernel(pmdp, 0), clr, set, 1);
>> +		pte_update(mm, addr, pte_offset_kernel(pmdp + 1, 0), clr, set, 1);
>> +	} else {
>> +		pte_update(mm, addr, ptep, clr, set, 1);
> 
> I would replace the "1" with "is_huge" or "huge", as being done in
> __ptep_set_access_flags , something that makes it more clear without the need
> to check pte_update().

It's not modified anymore

> 
>    
>>   #endif /* _ASM_POWERPC_PGALLOC_32_H */
>> diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
>> index 07df6b664861..b05cc4f87713 100644
>> --- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h
>> +++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h
> ...
>> - * For other page sizes, we have a single entry in the table.
>> + * For 8M pages, we have 1024 entries as if it was
>> + * 4M pages, but they are flagged as 8M pages for the hardware.
> 
> Maybe drop a comment that a single PMD entry is worth 4MB, so

Ok, added that the 4M is indeed PMD_SIZE

> 
>> + * For 4k pages, we have a single entry in the table.
>>    */
>> -static pmd_t *pmd_off(struct mm_struct *mm, unsigned long addr);
>> -static int hugepd_ok(hugepd_t hpd);
>> -
>>   static inline int number_of_cells_per_pte(pmd_t *pmd, pte_basic_t val, int huge)
>>   {
>>   	if (!huge)
>>   		return PAGE_SIZE / SZ_4K;
>> -	else if (hugepd_ok(*((hugepd_t *)pmd)))
>> -		return 1;
>> +	else if ((pmd_val(*pmd) & _PMD_PAGE_MASK) == _PMD_PAGE_8M)
>> +		return SZ_4M / SZ_4K;
> 
> this becomes more intuitive.
> 
>    
>> +static inline void pmd_populate_kernel_size(struct mm_struct *mm, pmd_t *pmdp,
>> +					    pte_t *pte, unsigned long sz)
>> +{
>> +	if (sz == SZ_8M)
>> +		*pmdp = __pmd(__pa(pte) | _PMD_PRESENT | _PMD_PAGE_8M);
>> +	else
>> +		*pmdp = __pmd(__pa(pte) | _PMD_PRESENT);
>> +}
>> +
>> +static inline void pmd_populate_size(struct mm_struct *mm, pmd_t *pmdp,
>> +				     pgtable_t pte_page, unsigned long sz)
>> +{
>> +	if (sz == SZ_8M)
>> +		*pmdp = __pmd(__pa(pte_page) | _PMD_USER | _PMD_PRESENT | _PMD_PAGE_8M);
>> +	else
>> +		*pmdp = __pmd(__pa(pte_page) | _PMD_USER | _PMD_PRESENT);
>> +}
> 
> In patch#1 you mentioned this will change with the removal of patch#1
> and patch#2.

Yes this goes away.

> 
>> --- a/arch/powerpc/mm/hugetlbpage.c
>> +++ b/arch/powerpc/mm/hugetlbpage.c
>> @@ -183,9 +183,6 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>>   	if (!hpdp)
>>   		return NULL;
>>   
>> -	if (IS_ENABLED(CONFIG_PPC_8xx) && pshift < PMD_SHIFT)
>> -		return pte_alloc_huge(mm, (pmd_t *)hpdp, addr, sz);
>> -
>>   	BUG_ON(!hugepd_none(*hpdp) && !hugepd_ok(*hpdp));
>>   
>>   	if (hugepd_none(*hpdp) && __hugepte_alloc(mm, hpdp, addr,
>> @@ -198,10 +195,18 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>>   pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>>   		      unsigned long addr, unsigned long sz)
>>   {
>> +	pmd_t *pmd = pmd_off(mm, addr);
>> +
>>   	if (sz < PMD_SIZE)
>> -		return pte_alloc_huge(mm, pmd_off(mm, addr), addr, sz);
>> +		return pte_alloc_huge(mm, pmd, addr, sz);
>>   
>> -	return NULL;
>> +	if (sz != SZ_8M)
>> +		return NULL;
>> +	if (!pte_alloc_huge(mm, pmd, addr, sz))
>> +		return NULL;
>> +	if (!pte_alloc_huge(mm, pmd + 1, addr, sz))
>> +		return NULL;
>> +	return (pte_t *)pmd;
> 
> I think that having the check for invalid huge page sizes upfront would
> make more sense, maybe just a matter of taste.

Well, it would make it less easy when we go one step further to support 
e500 and book3s/64. I prefer to do it that way to keep it as flat as 
possible and avoid a deep if ... if ... if

By the way I have now squashed patch 11 into patch 5.

> 
>   /* Unsupported size */
>   if (sz > PMD_SIZE && sz = SZ_8M)
>       return NULL;
> 
>   if (sz < PMD_SIZE)
>      ...
>   /* 8MB huge pages */
>   ...
> 
>   return (pte_t *) pmd;
> 
> Also, I am not a big fan of the two separate pte_alloc_huge() for pmd#0+pmd#1,
> and I am thinking we might want to hide that within a function and drop a
> comment in there explaining why we are updatng both pmds.

Now changed to:

+       for (i = 0; i < sz / PMD_SIZE; i++) {
+               if (!pte_alloc_huge(mm, pmd + i, addr))
+                       return NULL;
+       }

>   
>   
> 
>> diff --git a/arch/powerpc/mm/nohash/8xx.c b/arch/powerpc/mm/nohash/8xx.c
>> index d93433e26ded..99f656b3f9f3 100644
>> --- a/arch/powerpc/mm/nohash/8xx.c
>> +++ b/arch/powerpc/mm/nohash/8xx.c
>> @@ -48,20 +48,6 @@ unsigned long p_block_mapped(phys_addr_t pa)
>>   	return 0;
>>   }
>>   
>> -static pte_t __init *early_hugepd_alloc_kernel(hugepd_t *pmdp, unsigned long va)
>> -{
>> -	if (hpd_val(*pmdp) == 0) {
>> -		pte_t *ptep = memblock_alloc(sizeof(pte_basic_t), SZ_4K);
>> -
>> -		if (!ptep)
>> -			return NULL;
>> -
>> -		hugepd_populate_kernel((hugepd_t *)pmdp, ptep, PAGE_SHIFT_8M);
>> -		hugepd_populate_kernel((hugepd_t *)pmdp + 1, ptep, PAGE_SHIFT_8M);
>> -	}
>> -	return hugepte_offset(*(hugepd_t *)pmdp, va, PGDIR_SHIFT);
>> -}
>> -
>>   static int __ref __early_map_kernel_hugepage(unsigned long va, phys_addr_t pa,
>>   					     pgprot_t prot, int psize, bool new)
> 
> Am I blind or do we never use the 'new' parameter?
> I checked the tree and it seems we always pass it 'true'.

You must be blind :)

$ git grep mmu_mapin_ram_chunk
arch/powerpc/mm/nohash/8xx.c:static int mmu_mapin_ram_chunk(unsigned 
long offset, unsigned long top,
arch/powerpc/mm/nohash/8xx.c:   mmu_mapin_ram_chunk(0, boundary, 
PAGE_KERNEL_TEXT, true);
arch/powerpc/mm/nohash/8xx.c:           mmu_mapin_ram_chunk(boundary, 
einittext8, PAGE_KERNEL_TEXT, true);
arch/powerpc/mm/nohash/8xx.c:           mmu_mapin_ram_chunk(einittext8, 
top, PAGE_KERNEL, true);
arch/powerpc/mm/nohash/8xx.c:           err = 
mmu_mapin_ram_chunk(boundary, einittext8, PAGE_KERNEL, false);
arch/powerpc/mm/nohash/8xx.c:   err = mmu_mapin_ram_chunk(0, sinittext, 
PAGE_KERNEL_ROX, false);



> 
> arch/powerpc/mm/nohash/8xx.c:		err = __early_map_kernel_hugepage(v, p, prot, MMU_PAGE_512K, new);
> arch/powerpc/mm/nohash/8xx.c:		err = __early_map_kernel_hugepage(v, p, prot, MMU_PAGE_8M, new);
> arch/powerpc/mm/nohash/8xx.c:		err = __early_map_kernel_hugepage(v, p, prot, MMU_PAGE_512K, new);
> arch/powerpc/mm/nohash/8xx.c:
> __early_map_kernel_hugepage(VIRT_IMMR_BASE, PHYS_IMMR_BASE, PAGE_KERNEL_NCG, MMU_PAGE_512K, true);
> 
> I think we can drop the 'new' and the block code that tries to handle
> it?
> 
>> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
>> index acdf64c9b93e..59f0d7706d2f 100644
>> --- a/arch/powerpc/mm/pgtable.c
>> +++ b/arch/powerpc/mm/pgtable.c
> 
>> +void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
>> +		     pte_t pte, unsigned long sz)
>> +{
>> +	pmd_t *pmdp = pmd_off(mm, addr);
>> +
>> +	pte = set_pte_filter(pte, addr);
>> +
>> +	if (sz == SZ_8M) {
>> +		__set_huge_pte_at(pmdp, pte_offset_kernel(pmdp, 0), pte_val(pte));
>> +		__set_huge_pte_at(pmdp, pte_offset_kernel(pmdp + 1, 0), pte_val(pte) + SZ_4M);
> 
> You also mentioned that this would slightly change after you drop
> patch#0 and patch#1.
> The only comment I have right know would be to add a little comment
> explaining the layout (the replication of 1024 entries), or just
> something like "see comment from number_of_cells_per_pte".
> 


Done.

Thanks for the rewiew.
Christophe

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 08/20] powerpc/8xx: Simplify struct mmu_psize_def
  2024-05-17 19:00 ` [RFC PATCH v2 08/20] powerpc/8xx: Simplify struct mmu_psize_def Christophe Leroy
@ 2024-05-25  3:36   ` Oscar Salvador
  0 siblings, 0 replies; 60+ messages in thread
From: Oscar Salvador @ 2024-05-25  3:36 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Andrew Morton, Jason Gunthorpe, Peter Xu, Michael Ellerman,
	Nicholas Piggin, linux-kernel, linux-mm, linuxppc-dev

On Fri, May 17, 2024 at 09:00:02PM +0200, Christophe Leroy wrote:
> On 8xx, only the shift field is used in struct mmu_psize_def
> 
> Remove other fields and related macros.
> 
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>

Reviewed-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 09/20] powerpc/mm: Remove _PAGE_PSIZE
  2024-05-17 19:00 ` [RFC PATCH v2 09/20] powerpc/mm: Remove _PAGE_PSIZE Christophe Leroy
@ 2024-05-25  3:40   ` Oscar Salvador
  0 siblings, 0 replies; 60+ messages in thread
From: Oscar Salvador @ 2024-05-25  3:40 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Andrew Morton, Jason Gunthorpe, Peter Xu, Michael Ellerman,
	Nicholas Piggin, linux-kernel, linux-mm, linuxppc-dev

On Fri, May 17, 2024 at 09:00:03PM +0200, Christophe Leroy wrote:
> _PAGE_PSIZE macro is never used outside the place it is defined
> and is used only on 8xx and e500.
> 
> Remove indirection, remove it and use its content directly.
> 
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>

Reviewed-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 10/20] powerpc/mm: Fix __find_linux_pte() on 32 bits with PMD leaf entries
  2024-05-17 19:00 ` [RFC PATCH v2 10/20] powerpc/mm: Fix __find_linux_pte() on 32 bits with PMD leaf entries Christophe Leroy
@ 2024-05-25  4:12   ` Oscar Salvador
  2024-05-25  6:41     ` Christophe Leroy
  0 siblings, 1 reply; 60+ messages in thread
From: Oscar Salvador @ 2024-05-25  4:12 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Andrew Morton, Jason Gunthorpe, Peter Xu, Michael Ellerman,
	Nicholas Piggin, linux-kernel, linux-mm, linuxppc-dev

On Fri, May 17, 2024 at 09:00:04PM +0200, Christophe Leroy wrote:
> Building on 32 bits with pmd_leaf() not returning always false leads
> to the following error:

I am curious though.
pmd_leaf is only defined in include/linux/pgtable.h for 32bits, and is hardcoded
to false.
I do not see where we change it in previous patches, so is this artificial?

> 
>   CC      arch/powerpc/mm/pgtable.o
> arch/powerpc/mm/pgtable.c: In function '__find_linux_pte':
> arch/powerpc/mm/pgtable.c:506:1: error: function may return address of local variable [-Werror=return-local-addr]
>   506 | }
>       | ^
> arch/powerpc/mm/pgtable.c:394:15: note: declared here
>   394 |         pud_t pud, *pudp;
>       |               ^~~
> arch/powerpc/mm/pgtable.c:394:15: note: declared here
> 
> This is due to pmd_offset() being a no-op in that case.

This is because 32bits powerpc include pgtable-nopmd.h?

> So rework it for powerpc/32 so that pXd_offset() are used on real
> pointers and not on on-stack copies.
> 
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> ---
>  arch/powerpc/mm/pgtable.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
> index 59f0d7706d2f..51ee508eeb5b 100644
> --- a/arch/powerpc/mm/pgtable.c
> +++ b/arch/powerpc/mm/pgtable.c
> @@ -390,8 +390,12 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
>  			bool *is_thp, unsigned *hpage_shift)
>  {
>  	pgd_t *pgdp;
> -	p4d_t p4d, *p4dp;
> -	pud_t pud, *pudp;
> +	p4d_t *p4dp;
> +	pud_t *pudp;
> +#ifdef CONFIG_PPC64
> +	p4d_t p4d;
> +	pud_t pud;
> +#endif
>  	pmd_t pmd, *pmdp;
>  	pte_t *ret_pte;
>  	hugepd_t *hpdp = NULL;
> @@ -412,6 +416,7 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
>  	 */
>  	pgdp = pgdir + pgd_index(ea);
>  	p4dp = p4d_offset(pgdp, ea);
> +#ifdef CONFIG_PPC64
>  	p4d  = READ_ONCE(*p4dp);
>  	pdshift = P4D_SHIFT;
>  
> @@ -452,6 +457,11 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
>  
>  	pdshift = PMD_SHIFT;
>  	pmdp = pmd_offset(&pud, ea);
> +#else
> +	p4dp = p4d_offset(pgdp, ea);
> +	pudp = pud_offset(p4dp, ea);
> +	pmdp = pmd_offset(pudp, ea);

I would drop a comment on top explaining that these are no-op for 32bits,
otherwise it might not be obvious to people as why this distiction between 64 and
32bits.

Other than that looks good to me

 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 11/20] powerpc/mm: Complement huge_pte_alloc() for all non HUGEPD setups
  2024-05-17 19:00 ` [RFC PATCH v2 11/20] powerpc/mm: Complement huge_pte_alloc() for all non HUGEPD setups Christophe Leroy
@ 2024-05-25  4:29   ` Oscar Salvador
  2024-05-25  6:44     ` Christophe Leroy
  0 siblings, 1 reply; 60+ messages in thread
From: Oscar Salvador @ 2024-05-25  4:29 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Andrew Morton, Jason Gunthorpe, Peter Xu, Michael Ellerman,
	Nicholas Piggin, linux-kernel, linux-mm, linuxppc-dev

On Fri, May 17, 2024 at 09:00:05PM +0200, Christophe Leroy wrote:
> huge_pte_alloc() for non-HUGEPD targets is reserved for 8xx at the
> moment. In order to convert other targets for non-HUGEPD, complement
> huge_pte_alloc() to support any standard cont-PxD setup.
> 
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> ---
>  arch/powerpc/mm/hugetlbpage.c | 25 ++++++++++++++++++++++++-
>  1 file changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
> index 42b12e1ec851..f8aefa1e7363 100644
> --- a/arch/powerpc/mm/hugetlbpage.c
> +++ b/arch/powerpc/mm/hugetlbpage.c
> @@ -195,11 +195,34 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>  pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>  		      unsigned long addr, unsigned long sz)
>  {
> -	pmd_t *pmd = pmd_off(mm, addr);
> +	pgd_t *pgd;
> +	p4d_t *p4d;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +
> +	addr &= ~(sz - 1);
> +	pgd = pgd_offset(mm, addr);
> +
> +	p4d = p4d_offset(pgd, addr);
> +	if (sz >= PGDIR_SIZE)
> +		return (pte_t *)p4d;
> +
> +	pud = pud_alloc(mm, p4d, addr);
> +	if (!pud)
> +		return NULL;
> +	if (sz >= PUD_SIZE)
> +		return (pte_t *)pud;
> +
> +	pmd = pmd_alloc(mm, pud, addr);
> +	if (!pmd)
> +		return NULL;
>  
>  	if (sz < PMD_SIZE)
>  		return pte_alloc_huge(mm, pmd, addr, sz);
>  
> +	if (!IS_ENABLED(CONFIG_PPC_8xx))
> +		return (pte_t *)pmd;

So only 8xx has cont-PMD for hugepages?

> +
>  	if (sz != SZ_8M)
>  		return NULL;

Since this function is the core for allocation huge pages, I think it would
benefit from a comment at the top explaining the possible layouts.
e.g: Who can have cont-{P4d,PUD,PMD} etc.
A brief explanation of the possible scheme for all powerpc platforms.

That would help people looking into this in a future.

 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 14/20] powerpc/e500: Remove enc field from struct mmu_psize_def
  2024-05-17 19:00 ` [RFC PATCH v2 14/20] powerpc/e500: Remove enc field from struct mmu_psize_def Christophe Leroy
@ 2024-05-25  4:35   ` Oscar Salvador
  0 siblings, 0 replies; 60+ messages in thread
From: Oscar Salvador @ 2024-05-25  4:35 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Andrew Morton, Jason Gunthorpe, Peter Xu, Michael Ellerman,
	Nicholas Piggin, linux-kernel, linux-mm, linuxppc-dev

On Fri, May 17, 2024 at 09:00:08PM +0200, Christophe Leroy wrote:
> enc field is hidden behind BOOK3E_PAGESZ_XX macros, and when you look
> closer you realise that this field is nothing else than the value of
> shift minus ten.
> 
> So remove enc field and calculate tsize from shift field.
> 
> Also remove inc filed which is unused.
> 
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 15/20] powerpc/85xx: Switch to 64 bits PGD
  2024-05-17 19:00 ` [RFC PATCH v2 15/20] powerpc/85xx: Switch to 64 bits PGD Christophe Leroy
@ 2024-05-25  4:54   ` Oscar Salvador
  2024-05-25  9:02     ` Christophe Leroy
  0 siblings, 1 reply; 60+ messages in thread
From: Oscar Salvador @ 2024-05-25  4:54 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Andrew Morton, Jason Gunthorpe, Peter Xu, Michael Ellerman,
	Nicholas Piggin, linux-kernel, linux-mm, linuxppc-dev

On Fri, May 17, 2024 at 09:00:09PM +0200, Christophe Leroy wrote:
> In order to allow leaf PMD entries, switch the PGD to 64 bits entries.
> 
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>

I do not quite understand this change.
Are not powerE500 and power85xx two different things?
You are changing making it 64 for PPC_E500_64bits, but you are updating head_85xx.
Are they sharing this code?

Also, we would benefit from a slightly bigger changelog, explaining why
do we need this change in some more detail.

 
> diff --git a/arch/powerpc/include/asm/pgtable-types.h b/arch/powerpc/include/asm/pgtable-types.h
> index 082c85cc09b1..db965d98e0ae 100644
> --- a/arch/powerpc/include/asm/pgtable-types.h
> +++ b/arch/powerpc/include/asm/pgtable-types.h
> @@ -49,7 +49,11 @@ static inline unsigned long pud_val(pud_t x)
>  #endif /* CONFIG_PPC64 */
>  
>  /* PGD level */
> +#if defined(CONFIG_PPC_E500) && defined(CONFIG_PTE_64BIT)
> +typedef struct { unsigned long long pgd; } pgd_t;
> +#else
>  typedef struct { unsigned long pgd; } pgd_t;
> +#endif
>  #define __pgd(x)	((pgd_t) { (x) })
>  static inline unsigned long pgd_val(pgd_t x)
>  {
> diff --git a/arch/powerpc/kernel/head_85xx.S b/arch/powerpc/kernel/head_85xx.S
> index 39724ff5ae1f..a305244afc9f 100644
> --- a/arch/powerpc/kernel/head_85xx.S
> +++ b/arch/powerpc/kernel/head_85xx.S
> @@ -307,8 +307,9 @@ set_ivor:
>  #ifdef CONFIG_PTE_64BIT
>  #ifdef CONFIG_HUGETLB_PAGE
>  #define FIND_PTE	\
> -	rlwinm	r12, r10, 13, 19, 29;	/* Compute pgdir/pmd offset */	\
> -	lwzx	r11, r12, r11;		/* Get pgd/pmd entry */		\
> +	rlwinm	r12, r10, 14, 18, 28;	/* Compute pgdir/pmd offset */	\
> +	add	r12, r11, r12;

You add the offset to pgdir? 

> +	lwz	r11, 4(r12);		/* Get pgd/pmd entry */		\

What is i offset 4?


-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 10/20] powerpc/mm: Fix __find_linux_pte() on 32 bits with PMD leaf entries
  2024-05-25  4:12   ` Oscar Salvador
@ 2024-05-25  6:41     ` Christophe Leroy
  0 siblings, 0 replies; 60+ messages in thread
From: Christophe Leroy @ 2024-05-25  6:41 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Jason Gunthorpe, Peter Xu, Michael Ellerman,
	Nicholas Piggin, linux-kernel, linux-mm, linuxppc-dev



Le 25/05/2024 à 06:12, Oscar Salvador a écrit :
> On Fri, May 17, 2024 at 09:00:04PM +0200, Christophe Leroy wrote:
>> Building on 32 bits with pmd_leaf() not returning always false leads
>> to the following error:
> 
> I am curious though.
> pmd_leaf is only defined in include/linux/pgtable.h for 32bits, and is hardcoded
> to false.
> I do not see where we change it in previous patches, so is this artificial?

Patch 17 brings pmd_leaf()

> 
>>
>>    CC      arch/powerpc/mm/pgtable.o
>> arch/powerpc/mm/pgtable.c: In function '__find_linux_pte':
>> arch/powerpc/mm/pgtable.c:506:1: error: function may return address of local variable [-Werror=return-local-addr]
>>    506 | }
>>        | ^
>> arch/powerpc/mm/pgtable.c:394:15: note: declared here
>>    394 |         pud_t pud, *pudp;
>>        |               ^~~
>> arch/powerpc/mm/pgtable.c:394:15: note: declared here
>>
>> This is due to pmd_offset() being a no-op in that case.
> 
> This is because 32bits powerpc include pgtable-nopmd.h?
> 
>> So rework it for powerpc/32 so that pXd_offset() are used on real
>> pointers and not on on-stack copies.
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
>> ---
>>   arch/powerpc/mm/pgtable.c | 14 ++++++++++++--
>>   1 file changed, 12 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
>> index 59f0d7706d2f..51ee508eeb5b 100644
>> --- a/arch/powerpc/mm/pgtable.c
>> +++ b/arch/powerpc/mm/pgtable.c
>> @@ -390,8 +390,12 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
>>   			bool *is_thp, unsigned *hpage_shift)
>>   {
>>   	pgd_t *pgdp;
>> -	p4d_t p4d, *p4dp;
>> -	pud_t pud, *pudp;
>> +	p4d_t *p4dp;
>> +	pud_t *pudp;
>> +#ifdef CONFIG_PPC64
>> +	p4d_t p4d;
>> +	pud_t pud;
>> +#endif
>>   	pmd_t pmd, *pmdp;
>>   	pte_t *ret_pte;
>>   	hugepd_t *hpdp = NULL;
>> @@ -412,6 +416,7 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
>>   	 */
>>   	pgdp = pgdir + pgd_index(ea);
>>   	p4dp = p4d_offset(pgdp, ea);
>> +#ifdef CONFIG_PPC64
>>   	p4d  = READ_ONCE(*p4dp);
>>   	pdshift = P4D_SHIFT;
>>   
>> @@ -452,6 +457,11 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
>>   
>>   	pdshift = PMD_SHIFT;
>>   	pmdp = pmd_offset(&pud, ea);
>> +#else
>> +	p4dp = p4d_offset(pgdp, ea);
>> +	pudp = pud_offset(p4dp, ea);
>> +	pmdp = pmd_offset(pudp, ea);
> 
> I would drop a comment on top explaining that these are no-op for 32bits,
> otherwise it might not be obvious to people as why this distiction between 64 and
> 32bits.

Ok

> 
> Other than that looks good to me
> 
>   
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 11/20] powerpc/mm: Complement huge_pte_alloc() for all non HUGEPD setups
  2024-05-25  4:29   ` Oscar Salvador
@ 2024-05-25  6:44     ` Christophe Leroy
  2024-05-25 10:33       ` Oscar Salvador
  0 siblings, 1 reply; 60+ messages in thread
From: Christophe Leroy @ 2024-05-25  6:44 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Jason Gunthorpe, Peter Xu, Michael Ellerman,
	Nicholas Piggin, linux-kernel, linux-mm, linuxppc-dev



Le 25/05/2024 à 06:29, Oscar Salvador a écrit :
> On Fri, May 17, 2024 at 09:00:05PM +0200, Christophe Leroy wrote:
>> huge_pte_alloc() for non-HUGEPD targets is reserved for 8xx at the
>> moment. In order to convert other targets for non-HUGEPD, complement
>> huge_pte_alloc() to support any standard cont-PxD setup.
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
>> ---
>>   arch/powerpc/mm/hugetlbpage.c | 25 ++++++++++++++++++++++++-
>>   1 file changed, 24 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
>> index 42b12e1ec851..f8aefa1e7363 100644
>> --- a/arch/powerpc/mm/hugetlbpage.c
>> +++ b/arch/powerpc/mm/hugetlbpage.c
>> @@ -195,11 +195,34 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>>   pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>>   		      unsigned long addr, unsigned long sz)
>>   {
>> -	pmd_t *pmd = pmd_off(mm, addr);
>> +	pgd_t *pgd;
>> +	p4d_t *p4d;
>> +	pud_t *pud;
>> +	pmd_t *pmd;
>> +
>> +	addr &= ~(sz - 1);
>> +	pgd = pgd_offset(mm, addr);
>> +
>> +	p4d = p4d_offset(pgd, addr);
>> +	if (sz >= PGDIR_SIZE)
>> +		return (pte_t *)p4d;
>> +
>> +	pud = pud_alloc(mm, p4d, addr);
>> +	if (!pud)
>> +		return NULL;
>> +	if (sz >= PUD_SIZE)
>> +		return (pte_t *)pud;
>> +
>> +	pmd = pmd_alloc(mm, pud, addr);
>> +	if (!pmd)
>> +		return NULL;
>>   
>>   	if (sz < PMD_SIZE)
>>   		return pte_alloc_huge(mm, pmd, addr, sz);
>>   
>> +	if (!IS_ENABLED(CONFIG_PPC_8xx))
>> +		return (pte_t *)pmd;
> 
> So only 8xx has cont-PMD for hugepages?

No, all have cont-PMD but only 8xx handles pages greater than PMD_SIZE 
as cont-PTE instead of cont-PMD.

> 
>> +
>>   	if (sz != SZ_8M)
>>   		return NULL;
> 
> Since this function is the core for allocation huge pages, I think it would
> benefit from a comment at the top explaining the possible layouts.
> e.g: Who can have cont-{P4d,PUD,PMD} etc.
> A brief explanation of the possible scheme for all powerpc platforms.

All is standard except 8xx, let's just have a comment for 8xx.

> 
> That would help people looking into this in a future.
> 
>   
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 15/20] powerpc/85xx: Switch to 64 bits PGD
  2024-05-25  4:54   ` Oscar Salvador
@ 2024-05-25  9:02     ` Christophe Leroy
  0 siblings, 0 replies; 60+ messages in thread
From: Christophe Leroy @ 2024-05-25  9:02 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Jason Gunthorpe, Peter Xu, Michael Ellerman,
	Nicholas Piggin, linux-kernel, linux-mm, linuxppc-dev



Le 25/05/2024 à 06:54, Oscar Salvador a écrit :
> On Fri, May 17, 2024 at 09:00:09PM +0200, Christophe Leroy wrote:
>> In order to allow leaf PMD entries, switch the PGD to 64 bits entries.
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> 
> I do not quite understand this change.
> Are not powerE500 and power85xx two different things?

Yes they are two different things, but one contains the other

e500 is the processor-core which is included inside the MPC85xx micro 
controller.

But CONFIG_PPC_E500 is a bit more than e500 core, it also includes e5500 
and e6500 which are evolutions of e500.

mpc85xx is 32 bits
e5500 and e6500 are 64 bits



> You are changing making it 64 for PPC_E500_64bits, but you are updating head_85xx.
> Are they sharing this code?

Not exactly. mpc85xx can be built with 32 bits PTE or 64 bits PTE, based 
on CONFIG_PTE_64BIT

When CONFIG_PTE_64BIT is selected it uses the same PTE layout on 32-bits 
and 64-bits. But on 32-bits the PGD is still 32-bits, so it is not 
possible to use leaf entries at PGD level hence the change.

When CONFIG_PTE_64BIT is not selected, huge pages are not supported.

> 
> Also, we would benefit from a slightly bigger changelog, explaining why
> do we need this change in some more detail.

Yes I can write this is because PTEs are 64-bits allthought I thought it 
was obvious.

> 
>   
>> diff --git a/arch/powerpc/include/asm/pgtable-types.h b/arch/powerpc/include/asm/pgtable-types.h
>> index 082c85cc09b1..db965d98e0ae 100644
>> --- a/arch/powerpc/include/asm/pgtable-types.h
>> +++ b/arch/powerpc/include/asm/pgtable-types.h
>> @@ -49,7 +49,11 @@ static inline unsigned long pud_val(pud_t x)
>>   #endif /* CONFIG_PPC64 */
>>   
>>   /* PGD level */
>> +#if defined(CONFIG_PPC_E500) && defined(CONFIG_PTE_64BIT)
>> +typedef struct { unsigned long long pgd; } pgd_t;
>> +#else
>>   typedef struct { unsigned long pgd; } pgd_t;
>> +#endif
>>   #define __pgd(x)	((pgd_t) { (x) })
>>   static inline unsigned long pgd_val(pgd_t x)
>>   {
>> diff --git a/arch/powerpc/kernel/head_85xx.S b/arch/powerpc/kernel/head_85xx.S
>> index 39724ff5ae1f..a305244afc9f 100644
>> --- a/arch/powerpc/kernel/head_85xx.S
>> +++ b/arch/powerpc/kernel/head_85xx.S
>> @@ -307,8 +307,9 @@ set_ivor:
>>   #ifdef CONFIG_PTE_64BIT
>>   #ifdef CONFIG_HUGETLB_PAGE
>>   #define FIND_PTE	\
>> -	rlwinm	r12, r10, 13, 19, 29;	/* Compute pgdir/pmd offset */	\
>> -	lwzx	r11, r12, r11;		/* Get pgd/pmd entry */		\
>> +	rlwinm	r12, r10, 14, 18, 28;	/* Compute pgdir/pmd offset */	\
>> +	add	r12, r11, r12;
> 
> You add the offset to pgdir?

Yes because later r12 points to the PTE so when it is a leaf PGD entry 
we need r12 to point to that entry.

> 
>> +	lwz	r11, 4(r12);		/* Get pgd/pmd entry */		\
> 
> What is i offset 4?

It is big endian, the entry is now 64 bits but the real content of the 
entry is still 32 bits so it is in the lower word.

> 
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 11/20] powerpc/mm: Complement huge_pte_alloc() for all non HUGEPD setups
  2024-05-25  6:44     ` Christophe Leroy
@ 2024-05-25 10:33       ` Oscar Salvador
  0 siblings, 0 replies; 60+ messages in thread
From: Oscar Salvador @ 2024-05-25 10:33 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Andrew Morton, Jason Gunthorpe, Peter Xu, Michael Ellerman,
	Nicholas Piggin, linux-kernel, linux-mm, linuxppc-dev

On Sat, May 25, 2024 at 06:44:06AM +0000, Christophe Leroy wrote:
> No, all have cont-PMD but only 8xx handles pages greater than PMD_SIZE 
> as cont-PTE instead of cont-PMD.

Yes, sorry, I managed to confuse myself. It is obvious from the code.

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64)
  2024-05-24  4:46   ` Michael Ellerman
@ 2024-05-27 14:14     ` Peter Xu
  0 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2024-05-27 14:14 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Christophe Leroy, Andrew Morton, Jason Gunthorpe, Oscar Salvador,
	Nicholas Piggin, linux-kernel, linux-mm, linuxppc-dev

On Fri, May 24, 2024 at 02:46:58PM +1000, Michael Ellerman wrote:
> Christophe is a powerpc developer :)

Yes, definitely. :)

> 
> I'll help where I can, but I don't know the hugepd code that well, I've
> never really worked on it before. Nick will hopefully also be able to
> help, he at least knows mm better than me, but he also has other work.
> 
> Hopefully we can make this series work, and replace hugepd. But if we
> can't make that work then there is the possibility of just dropping
> support for 16M/16G pages with HPT/4K pages.

Great, thank you!

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2024-05-27 14:14 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-17 18:59 [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Christophe Leroy
2024-05-17 18:59 ` [RFC PATCH v2 01/20] mm: Provide pagesize to pmd_populate() Christophe Leroy
2024-05-20  9:01   ` Oscar Salvador
2024-05-20 16:24     ` Christophe Leroy
2024-05-21 11:57       ` Oscar Salvador
2024-05-22  8:37         ` Christophe Leroy
2024-05-17 18:59 ` [RFC PATCH v2 02/20] mm: Provide page size to pte_alloc_huge() Christophe Leroy
2024-05-17 18:59 ` [RFC PATCH v2 03/20] mm: Provide pmd to pte_leaf_size() Christophe Leroy
2024-05-21  9:39   ` Oscar Salvador
2024-05-22 10:22     ` Christophe Leroy
2024-05-17 18:59 ` [RFC PATCH v2 04/20] mm: Provide mm_struct and address to huge_ptep_get() Christophe Leroy
2024-05-17 18:59 ` [RFC PATCH v2 05/20] powerpc/mm: Allow hugepages without hugepd Christophe Leroy
2024-05-17 19:00 ` [RFC PATCH v2 06/20] powerpc/8xx: Fix size given to set_huge_pte_at() Christophe Leroy
2024-05-20  9:14   ` Oscar Salvador
2024-05-20 16:31     ` Christophe Leroy
2024-05-20 17:42       ` Oscar Salvador
2024-05-22  8:45         ` Christophe Leroy
2024-05-21  0:48       ` Michael Ellerman
2024-05-21  9:26         ` Oscar Salvador
2024-05-22  8:32           ` Christophe Leroy
2024-05-22 12:18             ` Christophe Leroy
2024-05-17 19:00 ` [RFC PATCH v2 07/20] powerpc/8xx: Rework support for 8M pages using contiguous PTE entries Christophe Leroy
2024-05-24 10:02   ` Oscar Salvador
2024-05-24 11:47     ` Christophe Leroy
2024-05-17 19:00 ` [RFC PATCH v2 08/20] powerpc/8xx: Simplify struct mmu_psize_def Christophe Leroy
2024-05-25  3:36   ` Oscar Salvador
2024-05-17 19:00 ` [RFC PATCH v2 09/20] powerpc/mm: Remove _PAGE_PSIZE Christophe Leroy
2024-05-25  3:40   ` Oscar Salvador
2024-05-17 19:00 ` [RFC PATCH v2 10/20] powerpc/mm: Fix __find_linux_pte() on 32 bits with PMD leaf entries Christophe Leroy
2024-05-25  4:12   ` Oscar Salvador
2024-05-25  6:41     ` Christophe Leroy
2024-05-17 19:00 ` [RFC PATCH v2 11/20] powerpc/mm: Complement huge_pte_alloc() for all non HUGEPD setups Christophe Leroy
2024-05-25  4:29   ` Oscar Salvador
2024-05-25  6:44     ` Christophe Leroy
2024-05-25 10:33       ` Oscar Salvador
2024-05-17 19:00 ` [RFC PATCH v2 12/20] powerpc/64e: Remove unneeded #ifdef CONFIG_PPC_E500 Christophe Leroy
2024-05-24  7:31   ` Michael Ellerman
2024-05-24  8:45     ` Christophe Leroy
2024-05-17 19:00 ` [RFC PATCH v2 13/20] powerpc/64e: Clean up impossible setups Christophe Leroy
2024-05-17 19:00 ` [RFC PATCH v2 14/20] powerpc/e500: Remove enc field from struct mmu_psize_def Christophe Leroy
2024-05-25  4:35   ` Oscar Salvador
2024-05-17 19:00 ` [RFC PATCH v2 15/20] powerpc/85xx: Switch to 64 bits PGD Christophe Leroy
2024-05-25  4:54   ` Oscar Salvador
2024-05-25  9:02     ` Christophe Leroy
2024-05-17 19:00 ` [RFC PATCH v2 16/20] powerpc/e500: Encode hugepage size in PTE bits Christophe Leroy
2024-05-17 19:00 ` [RFC PATCH v2 17/20] powerpc/e500: Use contiguous PMD instead of hugepd Christophe Leroy
2024-05-17 19:00 ` [RFC PATCH v2 18/20] powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD Christophe Leroy
2024-05-20 12:54   ` Nicholas Piggin
2024-05-20 16:43     ` Christophe Leroy
2024-05-22  1:13       ` Nicholas Piggin
2024-05-22  9:32         ` Christophe Leroy
2024-05-22 12:23         ` Jason Gunthorpe
2024-05-17 19:00 ` [RFC PATCH v2 19/20] powerpc/mm: Remove hugepd leftovers Christophe Leroy
2024-05-17 19:00 ` [RFC PATCH v2 20/20] mm: Remove CONFIG_ARCH_HAS_HUGEPD Christophe Leroy
2024-05-17 19:06 ` [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64) Jason Gunthorpe
2024-05-18  6:28   ` Christophe Leroy
2024-05-23 19:40 ` Peter Xu
2024-05-24  4:46   ` Michael Ellerman
2024-05-27 14:14     ` Peter Xu
2024-05-24  6:31   ` Oscar Salvador

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox