linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/7] mm/mprotect: Fix dax puds
@ 2024-08-12 18:12 Peter Xu
  2024-08-12 18:12 ` [PATCH v5 1/7] mm/dax: Dump start address in fault handler Peter Xu
                   ` (8 more replies)
  0 siblings, 9 replies; 15+ messages in thread
From: Peter Xu @ 2024-08-12 18:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Kirill A . Shutemov, Nicholas Piggin, David Hildenbrand,
	Matthew Wilcox, Andrew Morton, James Houghton, Huang Ying,
	Aneesh Kumar K . V, peterx, Vlastimil Babka, Rick P Edgecombe,
	Hugh Dickins, Borislav Petkov, Christophe Leroy,
	Michael Ellerman, Rik van Riel, Dan Williams, Mel Gorman, x86,
	Ingo Molnar, linuxppc-dev, Dave Hansen, Dave Jiang,
	Oscar Salvador, Thomas Gleixner

[Based on mm-unstable, commit 98808d08fc0f, Aug 7th. NOTE: it is
 intentional to not have rebased to latest mm-unstable, as this is to
 replace the queued v4]

v5 Changelog:
- Rename patch subject "mm/x86: arch_check_zapped_pud()", add "Implement" [tglx]
- Mostly rewrote commit messages for the x86 patches, follow -tip rules [tglx]
- Line wrap fixes (to mostly avoid newlines when unnecessary) [tglx]
- English fixes [tglx]
- Fix a build issue only happens with i386 pae + clang
  https://lore.kernel.org/r/202408111850.Y7rbVXOo-lkp@intel.com

v1: https://lore.kernel.org/r/20240621142504.1940209-1-peterx@redhat.com
v2: https://lore.kernel.org/r/20240703212918.2417843-1-peterx@redhat.com
v3: https://lore.kernel.org/r/20240715192142.3241557-1-peterx@redhat.com
v4: https://lore.kernel.org/r/20240807194812.819412-1-peterx@redhat.com

Dax supports pud pages for a while, but mprotect on puds was missing since
the start.  This series tries to fix that by providing pud handling in
mprotect().  The goal is to add more types of pud mappings like hugetlb or
pfnmaps.  This series paves way for it by fixing known pud entries.

Considering nobody reported this until when I looked at those other types
of pud mappings, I am thinking maybe it doesn't need to be a fix for stable
and this may not need to be backported.  I would guess whoever cares about
mprotect() won't care 1G dax puds yet, vice versa.  I hope fixing that in
new kernels would be fine, but I'm open to suggestions.

There're a few small things changed to teach mprotect work on PUDs. E.g. it
will need to start with dropping NUMA_HUGE_PTE_UPDATES which may stop
making sense when there can be more than one type of huge pte.  OTOH, we'll
also need to push the mmu notifiers from pmd to pud layers, which might
need some attention but so far I think it's safe.  For such details, please
refer to each patch's commit message.

The mprotect() pud process should be straightforward, as I kept it as
simple as possible.  There's no NUMA handled as dax simply doesn't support
that.  There's also no userfault involvements as file memory (even if work
with userfault-wp async mode) will need to split a pud, so pud entry
doesn't need to yet know userfault's existance (but hugetlb entries will;
that's also for later).

Tests
=====

What I did test:

- cross-build tests that I normally cover [1]

- smoke tested on x86_64 the simplest program [2] on dev_dax 1G PUD
  mprotect() using QEMU's nvdimm emulations [3] and ndctl to create
  namespaces with proper alignments, which used to throw "bad pud" but now
  it'll run through all fine.  I checked sigbus happens if with illegal
  access on protected puds.

- vmtests.

What I didn't test:

- fsdax: I wanted to also give it a shot, but only until then I noticed it
  doesn't seem to be supported (according to dax_iomap_fault(), which will
  always fallback on PUD_ORDER).  I did remember it was supported before, I
  could miss something important there.. please shoot if so.

- userfault wp-async: I also wanted to test userfault-wp async be able to
  split huge puds (here it's simply a clear_pud.. though), but it won't
  work for devdax anyway due to not allowed to do smaller than 1G faults in
  this case. So skip too.

- Power, as no hardware on hand.

Thanks,

[1] https://gitlab.com/peterx/lkb-harness/-/blob/main/config.json
[2] https://github.com/xzpeter/clibs/blob/master/misc/dax.c
[3] https://github.com/qemu/qemu/blob/master/docs/nvdimm.txt

Peter Xu (7):
  mm/dax: Dump start address in fault handler
  mm/mprotect: Push mmu notifier to PUDs
  mm/powerpc: Add missing pud helpers
  mm/x86: Make pud_leaf() only care about PSE bit
  mm/x86: Implement arch_check_zapped_pud()
  mm/x86: Add missing pud helpers
  mm/mprotect: fix dax pud handlings

 arch/powerpc/include/asm/book3s/64/pgtable.h |  3 +
 arch/powerpc/mm/book3s64/pgtable.c           | 20 ++++++
 arch/x86/include/asm/pgtable.h               | 70 ++++++++++++++++---
 arch/x86/mm/pgtable.c                        | 18 +++++
 drivers/dax/device.c                         |  6 +-
 include/linux/huge_mm.h                      | 24 +++++++
 include/linux/pgtable.h                      |  6 ++
 mm/huge_memory.c                             | 56 ++++++++++++++-
 mm/mprotect.c                                | 71 +++++++++++++-------
 9 files changed, 236 insertions(+), 38 deletions(-)

-- 
2.45.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v5 1/7] mm/dax: Dump start address in fault handler
  2024-08-12 18:12 [PATCH v5 0/7] mm/mprotect: Fix dax puds Peter Xu
@ 2024-08-12 18:12 ` Peter Xu
  2024-08-12 18:12 ` [PATCH v5 2/7] mm/mprotect: Push mmu notifier to PUDs Peter Xu
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2024-08-12 18:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Kirill A . Shutemov, Nicholas Piggin, David Hildenbrand,
	Matthew Wilcox, Andrew Morton, James Houghton, Huang Ying,
	Aneesh Kumar K . V, peterx, Vlastimil Babka, Rick P Edgecombe,
	Hugh Dickins, Borislav Petkov, Christophe Leroy,
	Michael Ellerman, Rik van Riel, Dan Williams, Mel Gorman, x86,
	Ingo Molnar, linuxppc-dev, Dave Hansen, Dave Jiang,
	Oscar Salvador, Thomas Gleixner

Currently the dax fault handler dumps the vma range when dynamic debugging
enabled.  That's mostly not useful.  Dump the (aligned) address instead
with the order info.

Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 drivers/dax/device.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 2051e4f73c8a..9c1a729cd77e 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -235,9 +235,9 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf, unsigned int order)
 	int id;
 	struct dev_dax *dev_dax = filp->private_data;
 
-	dev_dbg(&dev_dax->dev, "%s: %s (%#lx - %#lx) order:%d\n", current->comm,
-			(vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read",
-			vmf->vma->vm_start, vmf->vma->vm_end, order);
+	dev_dbg(&dev_dax->dev, "%s: op=%s addr=%#lx order=%d\n", current->comm,
+		(vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read",
+		vmf->address & ~((1UL << (order + PAGE_SHIFT)) - 1), order);
 
 	id = dax_read_lock();
 	if (order == 0)
-- 
2.45.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v5 2/7] mm/mprotect: Push mmu notifier to PUDs
  2024-08-12 18:12 [PATCH v5 0/7] mm/mprotect: Fix dax puds Peter Xu
  2024-08-12 18:12 ` [PATCH v5 1/7] mm/dax: Dump start address in fault handler Peter Xu
@ 2024-08-12 18:12 ` Peter Xu
  2024-08-12 18:12 ` [PATCH v5 3/7] mm/powerpc: Add missing pud helpers Peter Xu
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2024-08-12 18:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Kirill A . Shutemov, Nicholas Piggin, David Hildenbrand,
	Matthew Wilcox, Andrew Morton, James Houghton, Huang Ying,
	Aneesh Kumar K . V, peterx, Vlastimil Babka, Rick P Edgecombe,
	Hugh Dickins, Borislav Petkov, Christophe Leroy,
	Michael Ellerman, Rik van Riel, Dan Williams, Mel Gorman, x86,
	Ingo Molnar, linuxppc-dev, Dave Hansen, Dave Jiang,
	Oscar Salvador, Thomas Gleixner, kvm, Sean Christopherson,
	Paolo Bonzini, David Rientjes

mprotect() does mmu notifiers in PMD levels.  It's there since 2014 of
commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to
change_pmd_range").

At that time, the issue was that NUMA balancing can be applied on a huge
range of VM memory, even if nothing was populated.  The notification can be
avoided in this case if no valid pmd detected, which includes either THP or
a PTE pgtable page.

Now to pave way for PUD handling, this isn't enough.  We need to generate
mmu notifications even on PUD entries properly.  mprotect() is currently
broken on PUD (e.g., one can easily trigger kernel error with dax 1G
mappings already), this is the start to fix it.

To fix that, this patch proposes to push such notifications to the PUD
layers.

There is risk on regressing the problem Rik wanted to resolve before, but I
think it shouldn't really happen, and I still chose this solution because
of a few reasons:

  1) Consider a large VM that should definitely contain more than GBs of
  memory, it's highly likely that PUDs are also none.  In this case there
  will have no regression.

  2) KVM has evolved a lot over the years to get rid of rmap walks, which
  might be the major cause of the previous soft-lockup.  At least TDP MMU
  already got rid of rmap as long as not nested (which should be the major
  use case, IIUC), then the TDP MMU pgtable walker will simply see empty VM
  pgtable (e.g. EPT on x86), the invalidation of a full empty region in
  most cases could be pretty fast now, comparing to 2014.

  3) KVM has explicit code paths now to even give way for mmu notifiers
  just like this one, e.g. in commit d02c357e5bfa ("KVM: x86/mmu: Retry
  fault before acquiring mmu_lock if mapping is changing").  It'll also
  avoid contentions that may also contribute to a soft-lockup.

  4) Stick with PMD layer simply don't work when PUD is there...  We need
  one way or another to fix PUD mappings on mprotect().

Pushing it to PUD should be the safest approach as of now, e.g. there's yet
no sign of huge P4D coming on any known archs.

Cc: kvm@vger.kernel.org
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/mprotect.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 37cf8d249405..d423080e6509 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -363,9 +363,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 	unsigned long next;
 	long pages = 0;
 	unsigned long nr_huge_updates = 0;
-	struct mmu_notifier_range range;
-
-	range.start = 0;
 
 	pmd = pmd_offset(pud, addr);
 	do {
@@ -383,14 +380,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 		if (pmd_none(*pmd))
 			goto next;
 
-		/* invoke the mmu notifier if the pmd is populated */
-		if (!range.start) {
-			mmu_notifier_range_init(&range,
-				MMU_NOTIFY_PROTECTION_VMA, 0,
-				vma->vm_mm, addr, end);
-			mmu_notifier_invalidate_range_start(&range);
-		}
-
 		_pmd = pmdp_get_lockless(pmd);
 		if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) {
 			if ((next - addr != HPAGE_PMD_SIZE) ||
@@ -431,9 +420,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 		cond_resched();
 	} while (pmd++, addr = next, addr != end);
 
-	if (range.start)
-		mmu_notifier_invalidate_range_end(&range);
-
 	if (nr_huge_updates)
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
 	return pages;
@@ -443,22 +429,36 @@ static inline long change_pud_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, p4d_t *p4d, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
+	struct mmu_notifier_range range;
 	pud_t *pud;
 	unsigned long next;
 	long pages = 0, ret;
 
+	range.start = 0;
+
 	pud = pud_offset(p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
 		ret = change_prepare(vma, pud, pmd, addr, cp_flags);
-		if (ret)
-			return ret;
+		if (ret) {
+			pages = ret;
+			break;
+		}
 		if (pud_none_or_clear_bad(pud))
 			continue;
+		if (!range.start) {
+			mmu_notifier_range_init(&range,
+						MMU_NOTIFY_PROTECTION_VMA, 0,
+						vma->vm_mm, addr, end);
+			mmu_notifier_invalidate_range_start(&range);
+		}
 		pages += change_pmd_range(tlb, vma, pud, addr, next, newprot,
 					  cp_flags);
 	} while (pud++, addr = next, addr != end);
 
+	if (range.start)
+		mmu_notifier_invalidate_range_end(&range);
+
 	return pages;
 }
 
-- 
2.45.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v5 3/7] mm/powerpc: Add missing pud helpers
  2024-08-12 18:12 [PATCH v5 0/7] mm/mprotect: Fix dax puds Peter Xu
  2024-08-12 18:12 ` [PATCH v5 1/7] mm/dax: Dump start address in fault handler Peter Xu
  2024-08-12 18:12 ` [PATCH v5 2/7] mm/mprotect: Push mmu notifier to PUDs Peter Xu
@ 2024-08-12 18:12 ` Peter Xu
  2024-08-12 18:12 ` [PATCH v5 4/7] mm/x86: Make pud_leaf() only care about PSE bit Peter Xu
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2024-08-12 18:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Kirill A . Shutemov, Nicholas Piggin, David Hildenbrand,
	Matthew Wilcox, Andrew Morton, James Houghton, Huang Ying,
	Aneesh Kumar K . V, peterx, Vlastimil Babka, Rick P Edgecombe,
	Hugh Dickins, Borislav Petkov, Christophe Leroy,
	Michael Ellerman, Rik van Riel, Dan Williams, Mel Gorman, x86,
	Ingo Molnar, linuxppc-dev, Dave Hansen, Dave Jiang,
	Oscar Salvador, Thomas Gleixner

Some new helpers will be needed for pud entry updates soon.  Introduce
these helpers by referencing the pmd ones.  Namely:

  - pudp_invalidate(): this helper invalidates a huge pud before a split
  happens, so that the invalidated pud entry will make sure no race will
  happen (either with software, like a concurrent zap, or hardware, like
  a/d bit lost).

  - pud_modify(): this helper applies a new pgprot to an existing huge pud
  mapping.

For more information on why we need these two helpers, please refer to the
corresponding pmd helpers in the mprotect() code path.

Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h |  3 +++
 arch/powerpc/mm/book3s64/pgtable.c           | 20 ++++++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 519b1743a0f4..5da92ba68a45 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1124,6 +1124,7 @@ extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
 extern pud_t pfn_pud(unsigned long pfn, pgprot_t pgprot);
 extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot);
 extern pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot);
+extern pud_t pud_modify(pud_t pud, pgprot_t newprot);
 extern void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 		       pmd_t *pmdp, pmd_t pmd);
 extern void set_pud_at(struct mm_struct *mm, unsigned long addr,
@@ -1384,6 +1385,8 @@ static inline pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm,
 #define __HAVE_ARCH_PMDP_INVALIDATE
 extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			     pmd_t *pmdp);
+extern pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+			     pud_t *pudp);
 
 #define pmd_move_must_withdraw pmd_move_must_withdraw
 struct spinlock;
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index f4d8d3c40e5c..5a4a75369043 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -176,6 +176,17 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 	return __pmd(old_pmd);
 }
 
+pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+		      pud_t *pudp)
+{
+	unsigned long old_pud;
+
+	VM_WARN_ON_ONCE(!pud_present(*pudp));
+	old_pud = pud_hugepage_update(vma->vm_mm, address, pudp, _PAGE_PRESENT, _PAGE_INVALID);
+	flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
+	return __pud(old_pud);
+}
+
 pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma,
 				   unsigned long addr, pmd_t *pmdp, int full)
 {
@@ -259,6 +270,15 @@ pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 	pmdv &= _HPAGE_CHG_MASK;
 	return pmd_set_protbits(__pmd(pmdv), newprot);
 }
+
+pud_t pud_modify(pud_t pud, pgprot_t newprot)
+{
+	unsigned long pudv;
+
+	pudv = pud_val(pud);
+	pudv &= _HPAGE_CHG_MASK;
+	return pud_set_protbits(__pud(pudv), newprot);
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 /* For use by kexec, called with MMU off */
-- 
2.45.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v5 4/7] mm/x86: Make pud_leaf() only care about PSE bit
  2024-08-12 18:12 [PATCH v5 0/7] mm/mprotect: Fix dax puds Peter Xu
                   ` (2 preceding siblings ...)
  2024-08-12 18:12 ` [PATCH v5 3/7] mm/powerpc: Add missing pud helpers Peter Xu
@ 2024-08-12 18:12 ` Peter Xu
  2024-08-12 18:12 ` [PATCH v5 5/7] mm/x86: Implement arch_check_zapped_pud() Peter Xu
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2024-08-12 18:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Kirill A . Shutemov, Nicholas Piggin, David Hildenbrand,
	Matthew Wilcox, Andrew Morton, James Houghton, Huang Ying,
	Aneesh Kumar K . V, peterx, Vlastimil Babka, Rick P Edgecombe,
	Hugh Dickins, Borislav Petkov, Christophe Leroy,
	Michael Ellerman, Rik van Riel, Dan Williams, Mel Gorman, x86,
	Ingo Molnar, linuxppc-dev, Dave Hansen, Dave Jiang,
	Oscar Salvador, Thomas Gleixner

When working on mprotect() on 1G dax entries, I hit an zap bad pud
error when zapping a huge pud that is with PROT_NONE permission.

Here the problem is x86's pud_leaf() requires both PRESENT and PSE bits
set to report a pud entry as a leaf, but that doesn't look right, as
it's not following the pXd_leaf() definition that we stick with so far,
where PROT_NONE entries should be reported as leaves.

To fix it, change x86's pud_leaf() implementation to only check against
PSE bit to report a leaf, irrelevant of whether PRESENT bit is set.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/include/asm/pgtable.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e39311a89bf4..a2a3bd4c1bda 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1078,8 +1078,7 @@ static inline pmd_t *pud_pgtable(pud_t pud)
 #define pud_leaf pud_leaf
 static inline bool pud_leaf(pud_t pud)
 {
-	return (pud_val(pud) & (_PAGE_PSE | _PAGE_PRESENT)) ==
-		(_PAGE_PSE | _PAGE_PRESENT);
+	return pud_val(pud) & _PAGE_PSE;
 }
 
 static inline int pud_bad(pud_t pud)
-- 
2.45.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v5 5/7] mm/x86: Implement arch_check_zapped_pud()
  2024-08-12 18:12 [PATCH v5 0/7] mm/mprotect: Fix dax puds Peter Xu
                   ` (3 preceding siblings ...)
  2024-08-12 18:12 ` [PATCH v5 4/7] mm/x86: Make pud_leaf() only care about PSE bit Peter Xu
@ 2024-08-12 18:12 ` Peter Xu
  2024-08-12 18:12 ` [PATCH v5 6/7] mm/x86: Add missing pud helpers Peter Xu
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2024-08-12 18:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Kirill A . Shutemov, Nicholas Piggin, David Hildenbrand,
	Matthew Wilcox, Andrew Morton, James Houghton, Huang Ying,
	Aneesh Kumar K . V, peterx, Vlastimil Babka, Rick P Edgecombe,
	Hugh Dickins, Borislav Petkov, Christophe Leroy,
	Michael Ellerman, Rik van Riel, Dan Williams, Mel Gorman, x86,
	Ingo Molnar, linuxppc-dev, Dave Hansen, Dave Jiang,
	Oscar Salvador, Thomas Gleixner

Introduce arch_check_zapped_pud() to sanity check shadow stack on PUD zaps.
It has the same logic as the PMD helper.

One thing to mention is, it might be a good idea to use page_table_check in
the future for trapping wrong setups of shadow stack pgtable entries [1].
That is left for the future as a separate effort.

[1] https://lore.kernel.org/all/59d518698f664e07c036a5098833d7b56b953305.camel@intel.com

Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/include/asm/pgtable.h | 10 ++++++++++
 arch/x86/mm/pgtable.c          |  6 ++++++
 include/linux/pgtable.h        |  6 ++++++
 mm/huge_memory.c               |  4 +++-
 4 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a2a3bd4c1bda..fdb8ac9e7030 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -174,6 +174,13 @@ static inline int pud_young(pud_t pud)
 	return pud_flags(pud) & _PAGE_ACCESSED;
 }
 
+static inline bool pud_shstk(pud_t pud)
+{
+	return cpu_feature_enabled(X86_FEATURE_SHSTK) &&
+	       (pud_flags(pud) & (_PAGE_RW | _PAGE_DIRTY | _PAGE_PSE)) ==
+	       (_PAGE_DIRTY | _PAGE_PSE);
+}
+
 static inline int pte_write(pte_t pte)
 {
 	/*
@@ -1667,6 +1674,9 @@ void arch_check_zapped_pte(struct vm_area_struct *vma, pte_t pte);
 #define arch_check_zapped_pmd arch_check_zapped_pmd
 void arch_check_zapped_pmd(struct vm_area_struct *vma, pmd_t pmd);
 
+#define arch_check_zapped_pud arch_check_zapped_pud
+void arch_check_zapped_pud(struct vm_area_struct *vma, pud_t pud);
+
 #ifdef CONFIG_XEN_PV
 #define arch_has_hw_nonleaf_pmd_young arch_has_hw_nonleaf_pmd_young
 static inline bool arch_has_hw_nonleaf_pmd_young(void)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index f5931499c2d6..36e7139a61d9 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -926,3 +926,9 @@ void arch_check_zapped_pmd(struct vm_area_struct *vma, pmd_t pmd)
 	VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) &&
 			pmd_shstk(pmd));
 }
+
+void arch_check_zapped_pud(struct vm_area_struct *vma, pud_t pud)
+{
+	/* See note in arch_check_zapped_pte() */
+	VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) && pud_shstk(pud));
+}
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 2a6a3cccfc36..780f3b439d98 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -447,6 +447,12 @@ static inline void arch_check_zapped_pmd(struct vm_area_struct *vma,
 }
 #endif
 
+#ifndef arch_check_zapped_pud
+static inline void arch_check_zapped_pud(struct vm_area_struct *vma, pud_t pud)
+{
+}
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				       unsigned long address,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0024266dea0a..81c5da0708ed 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2293,12 +2293,14 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pud_t *pud, unsigned long addr)
 {
 	spinlock_t *ptl;
+	pud_t orig_pud;
 
 	ptl = __pud_trans_huge_lock(pud, vma);
 	if (!ptl)
 		return 0;
 
-	pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm);
+	orig_pud = pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm);
+	arch_check_zapped_pud(vma, orig_pud);
 	tlb_remove_pud_tlb_entry(tlb, pud, addr);
 	if (vma_is_special_huge(vma)) {
 		spin_unlock(ptl);
-- 
2.45.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v5 6/7] mm/x86: Add missing pud helpers
  2024-08-12 18:12 [PATCH v5 0/7] mm/mprotect: Fix dax puds Peter Xu
                   ` (4 preceding siblings ...)
  2024-08-12 18:12 ` [PATCH v5 5/7] mm/x86: Implement arch_check_zapped_pud() Peter Xu
@ 2024-08-12 18:12 ` Peter Xu
  2024-08-12 18:12 ` [PATCH v5 7/7] mm/mprotect: fix dax pud handlings Peter Xu
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2024-08-12 18:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Kirill A . Shutemov, Nicholas Piggin, David Hildenbrand,
	Matthew Wilcox, Andrew Morton, James Houghton, Huang Ying,
	Aneesh Kumar K . V, peterx, Vlastimil Babka, Rick P Edgecombe,
	Hugh Dickins, Borislav Petkov, Christophe Leroy,
	Michael Ellerman, Rik van Riel, Dan Williams, Mel Gorman, x86,
	Ingo Molnar, linuxppc-dev, Dave Hansen, Dave Jiang,
	Oscar Salvador, Thomas Gleixner

Some new helpers will be needed for pud entry updates soon.  Introduce
these helpers by referencing the pmd ones.  Namely:

  - pudp_invalidate(): this helper invalidates a huge pud before a split
  happens, so that the invalidated pud entry will make sure no race will
  happen (either with software, like a concurrent zap, or hardware, like
  a/d bit lost).

  - pud_modify(): this helper applies a new pgprot to an existing huge pud
  mapping.

For more information on why we need these two helpers, please refer to the
corresponding pmd helpers in the mprotect() code path.

When at it, simplify the pud_modify()/pmd_modify() comments on shadow stack
pgtable entries to reference pte_modify() to avoid duplicating the whole
paragraph three times.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/include/asm/pgtable.h | 57 +++++++++++++++++++++++++++++-----
 arch/x86/mm/pgtable.c          | 12 +++++++
 2 files changed, 61 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index fdb8ac9e7030..8d12bfad6a1d 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -787,6 +787,12 @@ static inline pmd_t pmd_mkinvalid(pmd_t pmd)
 		      __pgprot(pmd_flags(pmd) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));
 }
 
+static inline pud_t pud_mkinvalid(pud_t pud)
+{
+	return pfn_pud(pud_pfn(pud),
+		       __pgprot(pud_flags(pud) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));
+}
+
 static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
 
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
@@ -834,14 +840,8 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 	pmd_result = __pmd(val);
 
 	/*
-	 * To avoid creating Write=0,Dirty=1 PMDs, pte_modify() needs to avoid:
-	 *  1. Marking Write=0 PMDs Dirty=1
-	 *  2. Marking Dirty=1 PMDs Write=0
-	 *
-	 * The first case cannot happen because the _PAGE_CHG_MASK will filter
-	 * out any Dirty bit passed in newprot. Handle the second case by
-	 * going through the mksaveddirty exercise. Only do this if the old
-	 * value was Write=1 to avoid doing this on Shadow Stack PTEs.
+	 * Avoid creating shadow stack PMD by accident.  See comment in
+	 * pte_modify().
 	 */
 	if (oldval & _PAGE_RW)
 		pmd_result = pmd_mksaveddirty(pmd_result);
@@ -851,6 +851,29 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 	return pmd_result;
 }
 
+static inline pud_t pud_modify(pud_t pud, pgprot_t newprot)
+{
+	pudval_t val = pud_val(pud), oldval = val;
+	pud_t pud_result;
+
+	val &= _HPAGE_CHG_MASK;
+	val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
+	val = flip_protnone_guard(oldval, val, PHYSICAL_PUD_PAGE_MASK);
+
+	pud_result = __pud(val);
+
+	/*
+	 * Avoid creating shadow stack PUD by accident.  See comment in
+	 * pte_modify().
+	 */
+	if (oldval & _PAGE_RW)
+		pud_result = pud_mksaveddirty(pud_result);
+	else
+		pud_result = pud_clear_saveddirty(pud_result);
+
+	return pud_result;
+}
+
 /*
  * mprotect needs to preserve PAT and encryption bits when updating
  * vm_page_prot
@@ -1389,10 +1412,28 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
 }
 #endif
 
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+static inline pud_t pudp_establish(struct vm_area_struct *vma,
+		unsigned long address, pud_t *pudp, pud_t pud)
+{
+	page_table_check_pud_set(vma->vm_mm, pudp, pud);
+	if (IS_ENABLED(CONFIG_SMP)) {
+		return xchg(pudp, pud);
+	} else {
+		pud_t old = *pudp;
+		WRITE_ONCE(*pudp, pud);
+		return old;
+	}
+}
+#endif
+
 #define __HAVE_ARCH_PMDP_INVALIDATE_AD
 extern pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma,
 				unsigned long address, pmd_t *pmdp);
 
+pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+		      pud_t *pudp);
+
 /*
  * Page table pages are page-aligned.  The lower half of the top
  * level is used for userspace and the top half for the kernel.
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 36e7139a61d9..5745a354a241 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -641,6 +641,18 @@ pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
 }
 #endif
 
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
+	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
+pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+		     pud_t *pudp)
+{
+	VM_WARN_ON_ONCE(!pud_present(*pudp));
+	pud_t old = pudp_establish(vma, address, pudp, pud_mkinvalid(*pudp));
+	flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
+	return old;
+}
+#endif
+
 /**
  * reserve_top_address - reserves a hole in the top of kernel address space
  * @reserve - size of hole to reserve
-- 
2.45.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v5 7/7] mm/mprotect: fix dax pud handlings
  2024-08-12 18:12 [PATCH v5 0/7] mm/mprotect: Fix dax puds Peter Xu
                   ` (5 preceding siblings ...)
  2024-08-12 18:12 ` [PATCH v5 6/7] mm/x86: Add missing pud helpers Peter Xu
@ 2024-08-12 18:12 ` Peter Xu
  2024-08-13 12:50 ` [PATCH v5 0/7] mm/mprotect: Fix dax puds Michael Ellerman
  2024-11-11 21:20 ` Jann Horn
  8 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2024-08-12 18:12 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Kirill A . Shutemov, Nicholas Piggin, David Hildenbrand,
	Matthew Wilcox, Andrew Morton, James Houghton, Huang Ying,
	Aneesh Kumar K . V, peterx, Vlastimil Babka, Rick P Edgecombe,
	Hugh Dickins, Borislav Petkov, Christophe Leroy,
	Michael Ellerman, Rik van Riel, Dan Williams, Mel Gorman, x86,
	Ingo Molnar, linuxppc-dev, Dave Hansen, Dave Jiang,
	Oscar Salvador, Thomas Gleixner

This is only relevant to the two archs that support PUD dax, aka, x86_64
and ppc64.  PUD THPs do not yet exist elsewhere, and hugetlb PUDs do not
count in this case.

DAX have had PUD mappings for years, but change protection path never
worked.  When the path is triggered in any form (a simple test program
would be: call mprotect() on a 1G dev_dax mapping), the kernel will report
"bad pud".  This patch should fix that.

The new change_huge_pud() tries to keep everything simple.  For example, it
doesn't optimize write bit as that will need even more PUD helpers.  It's
not too bad anyway to have one more write fault in the worst case once for
1G range; may be a bigger thing for each PAGE_SIZE, though.  Neither does
it support userfault-wp bits, as there isn't such PUD mappings that is
supported; file mappings always need a split there.

The same to TLB shootdown: the pmd path (which was for x86 only) has the
trick of using _ad() version of pmdp_invalidate*() which can avoid one
redundant TLB, but let's also leave that for later.  Again, the larger the
mapping, the smaller of such effect.

There's some difference on handling "retry" for change_huge_pud() (where it
can return 0): it isn't like change_huge_pmd(), as the pmd version is safe
with all conditions handled in change_pte_range() later, thanks to Hugh's
new pte_offset_map_lock().  In short, change_pte_range() is simply smarter.
For that, change_pud_range() will need proper retry if it races with
something else when a huge PUD changed from under us.

The last thing to mention is currently the PUD path ignores the huge pte
numa counter (NUMA_HUGE_PTE_UPDATES), not only because DAX is not
applicable to NUMA, but also that it's ambiguous on its own to decide how
to account pud in this case.  In one earlier version of this patchset I
proposed to remove the counter as it doesn't even look right to do the
accounting as of now [1], but then a further discussion suggests we can
leave that for later, as that doesn't block this series if we choose to
ignore that counter.  That's what this patch does, by ignoring it.

When at it, touch up the comment in pgtable_split_needed() to make it
generic to either pmd or pud file THPs.

[1] https://lore.kernel.org/all/20240715192142.3241557-3-peterx@redhat.com/
[2] https://lore.kernel.org/r/added2d0-b8be-4108-82ca-1367a388d0b1@redhat.com

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: x86@kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages")
Fixes: 27af67f35631 ("powerpc/book3s64/mm: enable transparent pud hugepage")
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/huge_mm.h | 24 +++++++++++++++++++
 mm/huge_memory.c        | 52 +++++++++++++++++++++++++++++++++++++++++
 mm/mprotect.c           | 39 ++++++++++++++++++++++++-------
 3 files changed, 107 insertions(+), 8 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ce44caa40eed..6370026689e0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -342,6 +342,17 @@ void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
 void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 		unsigned long address);
 
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		    pud_t *pudp, unsigned long addr, pgprot_t newprot,
+		    unsigned long cp_flags);
+#else
+static inline int
+change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		pud_t *pudp, unsigned long addr, pgprot_t newprot,
+		unsigned long cp_flags) { return 0; }
+#endif
+
 #define split_huge_pud(__vma, __pud, __address)				\
 	do {								\
 		pud_t *____pud = (__pud);				\
@@ -585,6 +596,19 @@ static inline int next_order(unsigned long *orders, int prev)
 {
 	return 0;
 }
+
+static inline void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
+				    unsigned long address)
+{
+}
+
+static inline int change_huge_pud(struct mmu_gather *tlb,
+				  struct vm_area_struct *vma, pud_t *pudp,
+				  unsigned long addr, pgprot_t newprot,
+				  unsigned long cp_flags)
+{
+	return 0;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 static inline int split_folio_to_list_to_order(struct folio *folio,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 81c5da0708ed..0aafd26d7a53 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2114,6 +2114,53 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	return ret;
 }
 
+/*
+ * Returns:
+ *
+ * - 0: if pud leaf changed from under us
+ * - 1: if pud can be skipped
+ * - HPAGE_PUD_NR: if pud was successfully processed
+ */
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		    pud_t *pudp, unsigned long addr, pgprot_t newprot,
+		    unsigned long cp_flags)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pud_t oldpud, entry;
+	spinlock_t *ptl;
+
+	tlb_change_page_size(tlb, HPAGE_PUD_SIZE);
+
+	/* NUMA balancing doesn't apply to dax */
+	if (cp_flags & MM_CP_PROT_NUMA)
+		return 1;
+
+	/*
+	 * Huge entries on userfault-wp only works with anonymous, while we
+	 * don't have anonymous PUDs yet.
+	 */
+	if (WARN_ON_ONCE(cp_flags & MM_CP_UFFD_WP_ALL))
+		return 1;
+
+	ptl = __pud_trans_huge_lock(pudp, vma);
+	if (!ptl)
+		return 0;
+
+	/*
+	 * Can't clear PUD or it can race with concurrent zapping.  See
+	 * change_huge_pmd().
+	 */
+	oldpud = pudp_invalidate(vma, addr, pudp);
+	entry = pud_modify(oldpud, newprot);
+	set_pud_at(mm, addr, pudp, entry);
+	tlb_flush_pud_range(tlb, addr, HPAGE_PUD_SIZE);
+
+	spin_unlock(ptl);
+	return HPAGE_PUD_NR;
+}
+#endif
+
 #ifdef CONFIG_USERFAULTFD
 /*
  * The PT lock for src_pmd and dst_vma/src_vma (for reading) are locked by
@@ -2344,6 +2391,11 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(&range);
 }
+#else
+void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
+		unsigned long address)
+{
+}
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
 static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index d423080e6509..446f8e5f10d9 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -302,8 +302,9 @@ pgtable_split_needed(struct vm_area_struct *vma, unsigned long cp_flags)
 {
 	/*
 	 * pte markers only resides in pte level, if we need pte markers,
-	 * we need to split.  We cannot wr-protect shmem thp because file
-	 * thp is handled differently when split by erasing the pmd so far.
+	 * we need to split.  For example, we cannot wr-protect a file thp
+	 * (e.g. 2M shmem) because file thp is handled differently when
+	 * split by erasing the pmd so far.
 	 */
 	return (cp_flags & MM_CP_UFFD_WP) && !vma_is_anonymous(vma);
 }
@@ -430,31 +431,53 @@ static inline long change_pud_range(struct mmu_gather *tlb,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
 	struct mmu_notifier_range range;
-	pud_t *pud;
+	pud_t *pudp, pud;
 	unsigned long next;
 	long pages = 0, ret;
 
 	range.start = 0;
 
-	pud = pud_offset(p4d, addr);
+	pudp = pud_offset(p4d, addr);
 	do {
+again:
 		next = pud_addr_end(addr, end);
-		ret = change_prepare(vma, pud, pmd, addr, cp_flags);
+		ret = change_prepare(vma, pudp, pmd, addr, cp_flags);
 		if (ret) {
 			pages = ret;
 			break;
 		}
-		if (pud_none_or_clear_bad(pud))
+
+		pud = READ_ONCE(*pudp);
+		if (pud_none(pud))
 			continue;
+
 		if (!range.start) {
 			mmu_notifier_range_init(&range,
 						MMU_NOTIFY_PROTECTION_VMA, 0,
 						vma->vm_mm, addr, end);
 			mmu_notifier_invalidate_range_start(&range);
 		}
-		pages += change_pmd_range(tlb, vma, pud, addr, next, newprot,
+
+		if (pud_leaf(pud)) {
+			if ((next - addr != PUD_SIZE) ||
+			    pgtable_split_needed(vma, cp_flags)) {
+				__split_huge_pud(vma, pudp, addr);
+				goto again;
+			} else {
+				ret = change_huge_pud(tlb, vma, pudp,
+						      addr, newprot, cp_flags);
+				if (ret == 0)
+					goto again;
+				/* huge pud was handled */
+				if (ret == HPAGE_PUD_NR)
+					pages += HPAGE_PUD_NR;
+				continue;
+			}
+		}
+
+		pages += change_pmd_range(tlb, vma, pudp, addr, next, newprot,
 					  cp_flags);
-	} while (pud++, addr = next, addr != end);
+	} while (pudp++, addr = next, addr != end);
 
 	if (range.start)
 		mmu_notifier_invalidate_range_end(&range);
-- 
2.45.0



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 0/7] mm/mprotect: Fix dax puds
  2024-08-12 18:12 [PATCH v5 0/7] mm/mprotect: Fix dax puds Peter Xu
                   ` (6 preceding siblings ...)
  2024-08-12 18:12 ` [PATCH v5 7/7] mm/mprotect: fix dax pud handlings Peter Xu
@ 2024-08-13 12:50 ` Michael Ellerman
  2024-08-13 16:06   ` Peter Xu
  2024-11-11 21:20 ` Jann Horn
  8 siblings, 1 reply; 15+ messages in thread
From: Michael Ellerman @ 2024-08-13 12:50 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: Kirill A . Shutemov, Nicholas Piggin, David Hildenbrand,
	Matthew Wilcox, Andrew Morton, James Houghton, Huang Ying,
	Aneesh Kumar K . V, peterx, Vlastimil Babka, Rick P Edgecombe,
	Hugh Dickins, Borislav Petkov, Christophe Leroy, Rik van Riel,
	Dan Williams, Mel Gorman, x86, Ingo Molnar, linuxppc-dev,
	Dave Hansen, Dave Jiang, Oscar Salvador, Thomas Gleixner

Peter Xu <peterx@redhat.com> writes:
> [Based on mm-unstable, commit 98808d08fc0f, Aug 7th. NOTE: it is
>  intentional to not have rebased to latest mm-unstable, as this is to
>  replace the queued v4]
>
> v5 Changelog:
> - Rename patch subject "mm/x86: arch_check_zapped_pud()", add "Implement" [tglx]
> - Mostly rewrote commit messages for the x86 patches, follow -tip rules [tglx]
> - Line wrap fixes (to mostly avoid newlines when unnecessary) [tglx]
> - English fixes [tglx]
> - Fix a build issue only happens with i386 pae + clang
>   https://lore.kernel.org/r/202408111850.Y7rbVXOo-lkp@intel.com
>
> v1: https://lore.kernel.org/r/20240621142504.1940209-1-peterx@redhat.com
> v2: https://lore.kernel.org/r/20240703212918.2417843-1-peterx@redhat.com
> v3: https://lore.kernel.org/r/20240715192142.3241557-1-peterx@redhat.com
> v4: https://lore.kernel.org/r/20240807194812.819412-1-peterx@redhat.com
>
> Dax supports pud pages for a while, but mprotect on puds was missing since
> the start.  This series tries to fix that by providing pud handling in
> mprotect().  The goal is to add more types of pud mappings like hugetlb or
> pfnmaps.  This series paves way for it by fixing known pud entries.
>
> Considering nobody reported this until when I looked at those other types
> of pud mappings, I am thinking maybe it doesn't need to be a fix for stable
> and this may not need to be backported.  I would guess whoever cares about
> mprotect() won't care 1G dax puds yet, vice versa.  I hope fixing that in
> new kernels would be fine, but I'm open to suggestions.
>
> There're a few small things changed to teach mprotect work on PUDs. E.g. it
> will need to start with dropping NUMA_HUGE_PTE_UPDATES which may stop
> making sense when there can be more than one type of huge pte.  OTOH, we'll
> also need to push the mmu notifiers from pmd to pud layers, which might
> need some attention but so far I think it's safe.  For such details, please
> refer to each patch's commit message.
>
> The mprotect() pud process should be straightforward, as I kept it as
> simple as possible.  There's no NUMA handled as dax simply doesn't support
> that.  There's also no userfault involvements as file memory (even if work
> with userfault-wp async mode) will need to split a pud, so pud entry
> doesn't need to yet know userfault's existance (but hugetlb entries will;
> that's also for later).
>
> Tests
> =====
>
> What I did test:
>
> - cross-build tests that I normally cover [1]
>
> - smoke tested on x86_64 the simplest program [2] on dev_dax 1G PUD
>   mprotect() using QEMU's nvdimm emulations [3] and ndctl to create
>   namespaces with proper alignments, which used to throw "bad pud" but now
>   it'll run through all fine.  I checked sigbus happens if with illegal
>   access on protected puds.
>
> - vmtests.
>
> What I didn't test:
>
> - fsdax: I wanted to also give it a shot, but only until then I noticed it
>   doesn't seem to be supported (according to dax_iomap_fault(), which will
>   always fallback on PUD_ORDER).  I did remember it was supported before, I
>   could miss something important there.. please shoot if so.
>
> - userfault wp-async: I also wanted to test userfault-wp async be able to
>   split huge puds (here it's simply a clear_pud.. though), but it won't
>   work for devdax anyway due to not allowed to do smaller than 1G faults in
>   this case. So skip too.
>
> - Power, as no hardware on hand.

Does it need some specific configuration, or just any Power machine will do?

cheers


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 0/7] mm/mprotect: Fix dax puds
  2024-08-13 12:50 ` [PATCH v5 0/7] mm/mprotect: Fix dax puds Michael Ellerman
@ 2024-08-13 16:06   ` Peter Xu
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2024-08-13 16:06 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: linux-kernel, linux-mm, Kirill A . Shutemov, Nicholas Piggin,
	David Hildenbrand, Matthew Wilcox, Andrew Morton, James Houghton,
	Huang Ying, Aneesh Kumar K . V, Vlastimil Babka,
	Rick P Edgecombe, Hugh Dickins, Borislav Petkov,
	Christophe Leroy, Rik van Riel, Dan Williams, Mel Gorman, x86,
	Ingo Molnar, linuxppc-dev, Dave Hansen, Dave Jiang,
	Oscar Salvador, Thomas Gleixner

On Tue, Aug 13, 2024 at 10:50:04PM +1000, Michael Ellerman wrote:
> > - Power, as no hardware on hand.
> 
> Does it need some specific configuration, or just any Power machine will do?

I am not familiar with both dax and power to be sure on the hardware
implications here, sorry.  I think as long as one can create some /dev/dax
with 1g mapping alignment (I suppose normally with ndctl) then it should be
able to hit the mprotect() path.

One can verify first without this series that it could trigger a bad pud,
then it'll be away after this series applied.  Meanwhile the mprotect()
should apply on the 1g dax range properly.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 0/7] mm/mprotect: Fix dax puds
  2024-08-12 18:12 [PATCH v5 0/7] mm/mprotect: Fix dax puds Peter Xu
                   ` (7 preceding siblings ...)
  2024-08-13 12:50 ` [PATCH v5 0/7] mm/mprotect: Fix dax puds Michael Ellerman
@ 2024-11-11 21:20 ` Jann Horn
  2024-11-13 16:39   ` Peter Xu
  8 siblings, 1 reply; 15+ messages in thread
From: Jann Horn @ 2024-11-11 21:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Kirill A . Shutemov, Nicholas Piggin,
	David Hildenbrand, Matthew Wilcox, Andrew Morton, James Houghton,
	Huang Ying, Aneesh Kumar K . V, Vlastimil Babka,
	Rick P Edgecombe, Hugh Dickins, Borislav Petkov,
	Christophe Leroy, Michael Ellerman, Rik van Riel, Dan Williams,
	Mel Gorman, x86, Ingo Molnar, linuxppc-dev, Dave Hansen,
	Dave Jiang, Oscar Salvador, Thomas Gleixner

On Mon, Aug 12, 2024 at 8:12 PM Peter Xu <peterx@redhat.com> wrote:
> Dax supports pud pages for a while, but mprotect on puds was missing since
> the start.  This series tries to fix that by providing pud handling in
> mprotect().  The goal is to add more types of pud mappings like hugetlb or
> pfnmaps.  This series paves way for it by fixing known pud entries.

Do people actually use hardware where they can use PUD THP mappings
for DAX? I thought that was just some esoteric feature that isn't
actually usable on almost any system.
Was I wrong about that?

I think another example that probably doesn't play entirely nice with
PUD THP mappings is mremap()'s move_page_tables(). If
dax_get_unmapped_area() allows creating a VMA at an unaligned start
address (which I think it does?), move_page_tables() can probably end
up copying from an aligned address mapped with a huge PUD entry to an
unaligned address that needs to be mapped at the PTE level, and I
think that will probably cause it to call into get_old_pmd() while a
huge PUD entry is still present, which will probably get us a
pud_bad() error or such?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 0/7] mm/mprotect: Fix dax puds
  2024-11-11 21:20 ` Jann Horn
@ 2024-11-13 16:39   ` Peter Xu
  2024-11-13 16:42     ` David Hildenbrand
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Xu @ 2024-11-13 16:39 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-kernel, linux-mm, Kirill A . Shutemov, Nicholas Piggin,
	David Hildenbrand, Matthew Wilcox, Andrew Morton, James Houghton,
	Huang Ying, Aneesh Kumar K . V, Vlastimil Babka,
	Rick P Edgecombe, Hugh Dickins, Borislav Petkov,
	Christophe Leroy, Michael Ellerman, Rik van Riel, Dan Williams,
	Mel Gorman, x86, Ingo Molnar, linuxppc-dev, Dave Hansen,
	Dave Jiang, Oscar Salvador, Thomas Gleixner

On Mon, Nov 11, 2024 at 10:20:59PM +0100, Jann Horn wrote:
> On Mon, Aug 12, 2024 at 8:12 PM Peter Xu <peterx@redhat.com> wrote:
> > Dax supports pud pages for a while, but mprotect on puds was missing since
> > the start.  This series tries to fix that by providing pud handling in
> > mprotect().  The goal is to add more types of pud mappings like hugetlb or
> > pfnmaps.  This series paves way for it by fixing known pud entries.
> 
> Do people actually use hardware where they can use PUD THP mappings
> for DAX? I thought that was just some esoteric feature that isn't
> actually usable on almost any system.
> Was I wrong about that?

I did run it with a qemu emulated nvdimm device.  Though in reality I've no
idea on how many people are using it..

> 
> I think another example that probably doesn't play entirely nice with
> PUD THP mappings is mremap()'s move_page_tables(). If
> dax_get_unmapped_area() allows creating a VMA at an unaligned start
> address (which I think it does?), move_page_tables() can probably end
> up copying from an aligned address mapped with a huge PUD entry to an
> unaligned address that needs to be mapped at the PTE level, and I
> think that will probably cause it to call into get_old_pmd() while a
> huge PUD entry is still present, which will probably get us a
> pud_bad() error or such?

I think you're probably right, that we have other places that may not work
well with pud mappings.

I also wonder whether dax_get_unmapped_area() needs to properly handle
MAP_FIXED, even for PMD mappings.

It looks like it always fallbacks to the default mm_get_unmapped_area()
with FIXED, which have no idea on dax->alignment so it'll always allow
it.. The issue is I'm not sure dax pmd can be split at all, while I think
split-able is needed when mremap from a pmd-aligned address to a
!pmd-aligned address.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 0/7] mm/mprotect: Fix dax puds
  2024-11-13 16:39   ` Peter Xu
@ 2024-11-13 16:42     ` David Hildenbrand
  2024-11-13 17:56       ` Peter Xu
  0 siblings, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2024-11-13 16:42 UTC (permalink / raw)
  To: Peter Xu, Jann Horn
  Cc: linux-kernel, linux-mm, Kirill A . Shutemov, Nicholas Piggin,
	Matthew Wilcox, Andrew Morton, James Houghton, Huang Ying,
	Aneesh Kumar K . V, Vlastimil Babka, Rick P Edgecombe,
	Hugh Dickins, Borislav Petkov, Christophe Leroy,
	Michael Ellerman, Rik van Riel, Dan Williams, Mel Gorman, x86,
	Ingo Molnar, linuxppc-dev, Dave Hansen, Dave Jiang,
	Oscar Salvador, Thomas Gleixner

On 13.11.24 17:39, Peter Xu wrote:
> On Mon, Nov 11, 2024 at 10:20:59PM +0100, Jann Horn wrote:
>> On Mon, Aug 12, 2024 at 8:12 PM Peter Xu <peterx@redhat.com> wrote:
>>> Dax supports pud pages for a while, but mprotect on puds was missing since
>>> the start.  This series tries to fix that by providing pud handling in
>>> mprotect().  The goal is to add more types of pud mappings like hugetlb or
>>> pfnmaps.  This series paves way for it by fixing known pud entries.
>>
>> Do people actually use hardware where they can use PUD THP mappings
>> for DAX? I thought that was just some esoteric feature that isn't
>> actually usable on almost any system.
>> Was I wrong about that?
> 
> I did run it with a qemu emulated nvdimm device.  Though in reality I've no
> idea on how many people are using it..

I wonder if we still have to support it ... or could disable it+rip it out.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 0/7] mm/mprotect: Fix dax puds
  2024-11-13 16:42     ` David Hildenbrand
@ 2024-11-13 17:56       ` Peter Xu
  2024-11-13 18:45         ` David Hildenbrand
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Xu @ 2024-11-13 17:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jann Horn, linux-kernel, linux-mm, Kirill A . Shutemov,
	Nicholas Piggin, Matthew Wilcox, Andrew Morton, James Houghton,
	Huang Ying, Aneesh Kumar K . V, Vlastimil Babka,
	Rick P Edgecombe, Hugh Dickins, Borislav Petkov,
	Christophe Leroy, Michael Ellerman, Rik van Riel, Dan Williams,
	Mel Gorman, x86, Ingo Molnar, linuxppc-dev, Dave Hansen,
	Dave Jiang, Oscar Salvador, Thomas Gleixner

On Wed, Nov 13, 2024 at 05:42:15PM +0100, David Hildenbrand wrote:
> On 13.11.24 17:39, Peter Xu wrote:
> > On Mon, Nov 11, 2024 at 10:20:59PM +0100, Jann Horn wrote:
> > > On Mon, Aug 12, 2024 at 8:12 PM Peter Xu <peterx@redhat.com> wrote:
> > > > Dax supports pud pages for a while, but mprotect on puds was missing since
> > > > the start.  This series tries to fix that by providing pud handling in
> > > > mprotect().  The goal is to add more types of pud mappings like hugetlb or
> > > > pfnmaps.  This series paves way for it by fixing known pud entries.
> > > 
> > > Do people actually use hardware where they can use PUD THP mappings
> > > for DAX? I thought that was just some esoteric feature that isn't
> > > actually usable on almost any system.
> > > Was I wrong about that?
> > 
> > I did run it with a qemu emulated nvdimm device.  Though in reality I've no
> > idea on how many people are using it..
> 
> I wonder if we still have to support it ... or could disable it+rip it out.

Note that in my previous email, I also mentioned mremap() for PMD on dax
too.  If that's a real problem, it won't be fixed even if dropping dax PUD
support.

And we definitely want to understand whether there're still users on pud
dax to consider dropping anything.. it could still be that both mprotect()
and mremap() are not yet used in the current use cases.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 0/7] mm/mprotect: Fix dax puds
  2024-11-13 17:56       ` Peter Xu
@ 2024-11-13 18:45         ` David Hildenbrand
  0 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2024-11-13 18:45 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jann Horn, linux-kernel, linux-mm, Kirill A . Shutemov,
	Nicholas Piggin, Matthew Wilcox, Andrew Morton, James Houghton,
	Huang Ying, Aneesh Kumar K . V, Vlastimil Babka,
	Rick P Edgecombe, Hugh Dickins, Borislav Petkov,
	Christophe Leroy, Michael Ellerman, Rik van Riel, Dan Williams,
	Mel Gorman, x86, Ingo Molnar, linuxppc-dev, Dave Hansen,
	Dave Jiang, Oscar Salvador, Thomas Gleixner

On 13.11.24 18:56, Peter Xu wrote:
> On Wed, Nov 13, 2024 at 05:42:15PM +0100, David Hildenbrand wrote:
>> On 13.11.24 17:39, Peter Xu wrote:
>>> On Mon, Nov 11, 2024 at 10:20:59PM +0100, Jann Horn wrote:
>>>> On Mon, Aug 12, 2024 at 8:12 PM Peter Xu <peterx@redhat.com> wrote:
>>>>> Dax supports pud pages for a while, but mprotect on puds was missing since
>>>>> the start.  This series tries to fix that by providing pud handling in
>>>>> mprotect().  The goal is to add more types of pud mappings like hugetlb or
>>>>> pfnmaps.  This series paves way for it by fixing known pud entries.
>>>>
>>>> Do people actually use hardware where they can use PUD THP mappings
>>>> for DAX? I thought that was just some esoteric feature that isn't
>>>> actually usable on almost any system.
>>>> Was I wrong about that?
>>>
>>> I did run it with a qemu emulated nvdimm device.  Though in reality I've no
>>> idea on how many people are using it..
>>
>> I wonder if we still have to support it ... or could disable it+rip it out.
> 
> Note that in my previous email, I also mentioned mremap() for PMD on dax
> too.  If that's a real problem, it won't be fixed even if dropping dax PUD
> support.
>

Very true.


> And we definitely want to understand whether there're still users on pud
> dax to consider dropping anything.. it could still be that both mprotect()
> and mremap() are not yet used in the current use cases.

Right, but at least NVDIMMs are getting less important. Just a thought 
if this is really still worth having.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-11-13 18:45 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-08-12 18:12 [PATCH v5 0/7] mm/mprotect: Fix dax puds Peter Xu
2024-08-12 18:12 ` [PATCH v5 1/7] mm/dax: Dump start address in fault handler Peter Xu
2024-08-12 18:12 ` [PATCH v5 2/7] mm/mprotect: Push mmu notifier to PUDs Peter Xu
2024-08-12 18:12 ` [PATCH v5 3/7] mm/powerpc: Add missing pud helpers Peter Xu
2024-08-12 18:12 ` [PATCH v5 4/7] mm/x86: Make pud_leaf() only care about PSE bit Peter Xu
2024-08-12 18:12 ` [PATCH v5 5/7] mm/x86: Implement arch_check_zapped_pud() Peter Xu
2024-08-12 18:12 ` [PATCH v5 6/7] mm/x86: Add missing pud helpers Peter Xu
2024-08-12 18:12 ` [PATCH v5 7/7] mm/mprotect: fix dax pud handlings Peter Xu
2024-08-13 12:50 ` [PATCH v5 0/7] mm/mprotect: Fix dax puds Michael Ellerman
2024-08-13 16:06   ` Peter Xu
2024-11-11 21:20 ` Jann Horn
2024-11-13 16:39   ` Peter Xu
2024-11-13 16:42     ` David Hildenbrand
2024-11-13 17:56       ` Peter Xu
2024-11-13 18:45         ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox