* [PATCH 0/7] mm/mprotect: Fix dax puds
@ 2024-06-21 14:24 Peter Xu
2024-06-21 14:24 ` [PATCH 1/7] mm/dax: Dump start address in fault handler Peter Xu
` (6 more replies)
0 siblings, 7 replies; 17+ messages in thread
From: Peter Xu @ 2024-06-21 14:24 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: x86, Borislav Petkov, Dave Jiang, Kirill A . Shutemov,
Ingo Molnar, Oscar Salvador, peterx, Matthew Wilcox,
Vlastimil Babka, Dan Williams, Andrew Morton, Hugh Dickins,
Michael Ellerman, Dave Hansen, Thomas Gleixner, linuxppc-dev,
Christophe Leroy, Rik van Riel, Mel Gorman, Aneesh Kumar K . V,
Nicholas Piggin, Huang Ying
[Based on mm-unstable, commit a53138cdbe3e]
Dax supports pud pages for a while, but mprotect on puds was missing since
the start. This series tries to fix that by providing pud handling in
mprotect(), while my real goal is adding more types of pud mappings like
hugetlb or pfnmaps, it's just that we probably want pud to work already and
build the rest on top.
Considering nobody reported this until when I looked at those other types
of pud mappings, I am thinking maybe it doesn't need to be a fix for stable
and this may not need to be backported. I would guess whoever cares about
mprotect() won't care 1G dax puds yet, vice versa. I hope fixing that in
new kernels would be fine, but I'm open to suggestions.
There're quite a few small things changed here and there to teach mprotect
work on PUDs. E.g. it will start with dropping NUMA_HUGE_PTE_UPDATES which
may stop making much sense when there can be more than one type of huge pte
(meanwhile it doesn't sound right at all to account non-numa operations
too.. more in the commit message of the relevant patch). OTOH, we'll also
need to push the mmu notifiers from pmd to pud layers, which might need
some attention but so far I think it's safe. For these small details,
please refer to each patch's commit message.
The mprotect() pud process is hopefully straightforward enough, as I kept
it as simple as possible. There's no NUMA handled as dax simply doesn't
support that. There's also no userfault involvements as file memory (even
if work with userfault-wp async mode) will need to split a pud, so pud
entry doesn't need to yet know userfault's existance (but hugetlb entries
will; that's also for later).
Tests
=====
What I did test:
- cross-build tests that I normally cover [1], except an known issue
elsewhere on hugetlb [2]
- smoke tested on x86_64 the simplest program [3] on dev_dax 1G PUD
mprotect() using QEMU's nvdimm emulations [4] and ndctl to create
namespaces with proper alignments, which used to throw "bad pud" but now
it'll run through all fine. Also I checked sigbus happens if with
illegal access on protected puds.
What I didn't test:
- fsdax: I wanted to also give it a shot, but only until then I noticed it
doesn't seem to be supported (according to dax_iomap_fault(), which will
always fallback on PUD_ORDER). I did remember it was supported before, I
could miss something important there.. please shoot if so.
- userfault wp-async: I also wanted to test userfault-wp async be able to
split huge puds (here it's simply a clear_pud.. though), but it won't
work for devdax anyway due to not allowed to do smaller than 1G faults in
this case. So skip too.
- Power, as no hardware on hand.
Thanks,
[1] https://gitlab.com/peterx/lkb-harness/-/blob/main/config.json
[2] https://lore.kernel.org/all/202406190956.9j1UCIe5-lkp@intel.com
[2] https://github.com/xzpeter/clibs/blob/master/misc/dax.c
[3] https://github.com/qemu/qemu/blob/master/docs/nvdimm.txt
Peter Xu (7):
mm/dax: Dump start address in fault handler
mm/mprotect: Remove NUMA_HUGE_PTE_UPDATES
mm/mprotect: Push mmu notifier to PUDs
mm/powerpc: Add missing pud helpers
mm/x86: Make pud_leaf() only cares about PSE bit
mm/x86: Add missing pud helpers
mm/mprotect: fix dax pud handlings
arch/powerpc/include/asm/book3s/64/pgtable.h | 3 +
arch/powerpc/mm/book3s64/pgtable.c | 20 ++++++
arch/x86/include/asm/pgtable.h | 39 ++++++++++-
arch/x86/mm/pgtable.c | 11 +++
drivers/dax/device.c | 6 +-
include/linux/huge_mm.h | 24 +++++++
include/linux/vm_event_item.h | 1 -
mm/huge_memory.c | 52 ++++++++++++++
mm/mprotect.c | 74 ++++++++++++--------
mm/vmstat.c | 1 -
10 files changed, 195 insertions(+), 36 deletions(-)
--
2.45.0
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/7] mm/dax: Dump start address in fault handler
2024-06-21 14:24 [PATCH 0/7] mm/mprotect: Fix dax puds Peter Xu
@ 2024-06-21 14:24 ` Peter Xu
2024-06-21 14:24 ` [PATCH 2/7] mm/mprotect: Remove NUMA_HUGE_PTE_UPDATES Peter Xu
` (5 subsequent siblings)
6 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2024-06-21 14:24 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: x86, Borislav Petkov, Dave Jiang, Kirill A . Shutemov,
Ingo Molnar, Oscar Salvador, peterx, Matthew Wilcox,
Vlastimil Babka, Dan Williams, Andrew Morton, Hugh Dickins,
Michael Ellerman, Dave Hansen, Thomas Gleixner, linuxppc-dev,
Christophe Leroy, Rik van Riel, Mel Gorman, Aneesh Kumar K . V,
Nicholas Piggin, Huang Ying
Currently the dax fault handler dumps the vma range when dynamic debugging
enabled. That's mostly not useful. Dump the (aligned) address instead
with the order info.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
drivers/dax/device.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index eb61598247a9..714174844ca5 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -235,9 +235,9 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf, unsigned int order)
int id;
struct dev_dax *dev_dax = filp->private_data;
- dev_dbg(&dev_dax->dev, "%s: %s (%#lx - %#lx) order:%d\n", current->comm,
- (vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read",
- vmf->vma->vm_start, vmf->vma->vm_end, order);
+ dev_dbg(&dev_dax->dev, "%s: op=%s addr=%#lx order=%d\n", current->comm,
+ (vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read",
+ vmf->address & ~((1UL << (order + PAGE_SHIFT)) - 1), order);
id = dax_read_lock();
if (order == 0)
--
2.45.0
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 2/7] mm/mprotect: Remove NUMA_HUGE_PTE_UPDATES
2024-06-21 14:24 [PATCH 0/7] mm/mprotect: Fix dax puds Peter Xu
2024-06-21 14:24 ` [PATCH 1/7] mm/dax: Dump start address in fault handler Peter Xu
@ 2024-06-21 14:24 ` Peter Xu
2024-06-21 14:25 ` [PATCH 3/7] mm/mprotect: Push mmu notifier to PUDs Peter Xu
` (4 subsequent siblings)
6 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2024-06-21 14:24 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: x86, Borislav Petkov, Dave Jiang, Kirill A . Shutemov,
Ingo Molnar, Oscar Salvador, peterx, Matthew Wilcox,
Vlastimil Babka, Dan Williams, Andrew Morton, Hugh Dickins,
Michael Ellerman, Dave Hansen, Thomas Gleixner, linuxppc-dev,
Christophe Leroy, Rik van Riel, Mel Gorman, Aneesh Kumar K . V,
Nicholas Piggin, Huang Ying, Alex Thorlton
In 2013, commit 72403b4a0fbd ("mm: numa: return the number of base pages
altered by protection changes") introduced "numa_huge_pte_updates" vmstat
entry, trying to capture how many huge ptes (in reality, PMD thps at that
time) are marked by NUMA balancing.
This patch proposes to remove it for some reasons.
Firstly, the name is misleading. We can have more than one way to have a
"huge pte" at least nowadays, and that's also the major goal of this patch,
where it paves way for PUD handling in change protection code paths.
PUDs are coming not only for dax (which has already came and yet broken..),
but also for pfnmaps and hugetlb pages. The name will simply stop making
sense when PUD will start to be involved in mprotect() world.
It'll also make it not reasonable either if we boost the counter for both
pmd/puds. In short, current accounting won't be right when PUD comes, so
the scheme was only suitable at that point in time where PUD wasn't even
possible.
Secondly, the accounting was simply not right from the start as long as it
was also affected by other call sites besides NUMA. mprotect() is one,
while userfaultfd-wp also leverages change protection path to modify
pgtables. If it wants to do right it needs to check the caller but it
never did; at least mprotect() should be there even in 2013.
It gives me the impression that nobody is seriously using this field, and
it's also impossible to be serious.
We may want to do it right if any NUMA developers would like it to exist,
but we should do that with all above resolved, on both considering PUDs,
but also on correct accountings. That should be able to be done on top
when there's a real need of such.
Cc: Huang Ying <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
include/linux/vm_event_item.h | 1 -
mm/mprotect.c | 8 +-------
mm/vmstat.c | 1 -
3 files changed, 1 insertion(+), 9 deletions(-)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 73fa5fbf33a3..a76b75d6a8f4 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -59,7 +59,6 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
OOM_KILL,
#ifdef CONFIG_NUMA_BALANCING
NUMA_PTE_UPDATES,
- NUMA_HUGE_PTE_UPDATES,
NUMA_HINT_FAULTS,
NUMA_HINT_FAULTS_LOCAL,
NUMA_PAGE_MIGRATE,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 222ab434da54..21172272695e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -363,7 +363,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
pmd_t *pmd;
unsigned long next;
long pages = 0;
- unsigned long nr_huge_updates = 0;
struct mmu_notifier_range range;
range.start = 0;
@@ -411,11 +410,8 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
ret = change_huge_pmd(tlb, vma, pmd,
addr, newprot, cp_flags);
if (ret) {
- if (ret == HPAGE_PMD_NR) {
+ if (ret == HPAGE_PMD_NR)
pages += HPAGE_PMD_NR;
- nr_huge_updates++;
- }
-
/* huge pmd was handled */
goto next;
}
@@ -435,8 +431,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
if (range.start)
mmu_notifier_invalidate_range_end(&range);
- if (nr_huge_updates)
- count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
return pages;
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 6e3347789eb2..be774893f69e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1313,7 +1313,6 @@ const char * const vmstat_text[] = {
#ifdef CONFIG_NUMA_BALANCING
"numa_pte_updates",
- "numa_huge_pte_updates",
"numa_hint_faults",
"numa_hint_faults_local",
"numa_pages_migrated",
--
2.45.0
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 3/7] mm/mprotect: Push mmu notifier to PUDs
2024-06-21 14:24 [PATCH 0/7] mm/mprotect: Fix dax puds Peter Xu
2024-06-21 14:24 ` [PATCH 1/7] mm/dax: Dump start address in fault handler Peter Xu
2024-06-21 14:24 ` [PATCH 2/7] mm/mprotect: Remove NUMA_HUGE_PTE_UPDATES Peter Xu
@ 2024-06-21 14:25 ` Peter Xu
2024-06-21 14:25 ` [PATCH 4/7] mm/powerpc: Add missing pud helpers Peter Xu
` (3 subsequent siblings)
6 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2024-06-21 14:25 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: x86, Borislav Petkov, Dave Jiang, Kirill A . Shutemov,
Ingo Molnar, Oscar Salvador, peterx, Matthew Wilcox,
Vlastimil Babka, Dan Williams, Andrew Morton, Hugh Dickins,
Michael Ellerman, Dave Hansen, Thomas Gleixner, linuxppc-dev,
Christophe Leroy, Rik van Riel, Mel Gorman, Aneesh Kumar K . V,
Nicholas Piggin, Huang Ying, kvm, Sean Christopherson,
Paolo Bonzini, David Rientjes
mprotect() does mmu notifiers in PMD levels. It's there since 2014 of
commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to
change_pmd_range").
At that time, the issue was that NUMA balancing can be applied on a huge
range of VM memory, even if nothing was populated. The notification can be
avoided in this case if no valid pmd detected, which includes either THP or
a PTE pgtable page.
Now to pave way for PUD handling, this isn't enough. We need to generate
mmu notifications even on PUD entries properly. mprotect() is currently
broken on PUD (e.g., one can easily trigger kernel error with dax 1G
mappings already), this is the start to fix it.
To fix that, this patch proposes to push such notifications to the PUD
layers.
There is risk on regressing the problem Rik wanted to resolve before, but I
think it shouldn't really happen, and I still chose this solution because
of a few reasons:
1) Consider a large VM that should definitely contain more than GBs of
memory, it's highly likely that PUDs are also none. In this case there
will have no regression.
2) KVM has evolved a lot over the years to get rid of rmap walks, which
might be the major cause of the previous soft-lockup. At least TDP MMU
already got rid of rmap as long as not nested (which should be the major
use case, IIUC), then the TDP MMU pgtable walker will simply see empty VM
pgtable (e.g. EPT on x86), the invalidation of a full empty region in
most cases could be pretty fast now, comparing to 2014.
3) KVM has explicit code paths now to even give way for mmu notifiers
just like this one, e.g. in commit d02c357e5bfa ("KVM: x86/mmu: Retry
fault before acquiring mmu_lock if mapping is changing"). It'll also
avoid contentions that may also contribute to a soft-lockup.
4) Stick with PMD layer simply don't work when PUD is there... We need
one way or another to fix PUD mappings on mprotect().
Pushing it to PUD should be the safest approach as of now, e.g. there's yet
no sign of huge P4D coming on any known archs.
Cc: kvm@vger.kernel.org
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
mm/mprotect.c | 26 ++++++++++++--------------
1 file changed, 12 insertions(+), 14 deletions(-)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 21172272695e..fb8bf3ff7cd9 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -363,9 +363,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
pmd_t *pmd;
unsigned long next;
long pages = 0;
- struct mmu_notifier_range range;
-
- range.start = 0;
pmd = pmd_offset(pud, addr);
do {
@@ -383,14 +380,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
if (pmd_none(*pmd))
goto next;
- /* invoke the mmu notifier if the pmd is populated */
- if (!range.start) {
- mmu_notifier_range_init(&range,
- MMU_NOTIFY_PROTECTION_VMA, 0,
- vma->vm_mm, addr, end);
- mmu_notifier_invalidate_range_start(&range);
- }
-
_pmd = pmdp_get_lockless(pmd);
if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) {
if ((next - addr != HPAGE_PMD_SIZE) ||
@@ -428,9 +417,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
cond_resched();
} while (pmd++, addr = next, addr != end);
- if (range.start)
- mmu_notifier_invalidate_range_end(&range);
-
return pages;
}
@@ -438,10 +424,13 @@ static inline long change_pud_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, p4d_t *p4d, unsigned long addr,
unsigned long end, pgprot_t newprot, unsigned long cp_flags)
{
+ struct mmu_notifier_range range;
pud_t *pud;
unsigned long next;
long pages = 0, ret;
+ range.start = 0;
+
pud = pud_offset(p4d, addr);
do {
next = pud_addr_end(addr, end);
@@ -450,10 +439,19 @@ static inline long change_pud_range(struct mmu_gather *tlb,
return ret;
if (pud_none_or_clear_bad(pud))
continue;
+ if (!range.start) {
+ mmu_notifier_range_init(&range,
+ MMU_NOTIFY_PROTECTION_VMA, 0,
+ vma->vm_mm, addr, end);
+ mmu_notifier_invalidate_range_start(&range);
+ }
pages += change_pmd_range(tlb, vma, pud, addr, next, newprot,
cp_flags);
} while (pud++, addr = next, addr != end);
+ if (range.start)
+ mmu_notifier_invalidate_range_end(&range);
+
return pages;
}
--
2.45.0
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 4/7] mm/powerpc: Add missing pud helpers
2024-06-21 14:24 [PATCH 0/7] mm/mprotect: Fix dax puds Peter Xu
` (2 preceding siblings ...)
2024-06-21 14:25 ` [PATCH 3/7] mm/mprotect: Push mmu notifier to PUDs Peter Xu
@ 2024-06-21 14:25 ` Peter Xu
2024-06-21 15:01 ` Peter Xu
2024-06-21 14:25 ` [PATCH 5/7] mm/x86: Make pud_leaf() only cares about PSE bit Peter Xu
` (2 subsequent siblings)
6 siblings, 1 reply; 17+ messages in thread
From: Peter Xu @ 2024-06-21 14:25 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: x86, Borislav Petkov, Dave Jiang, Kirill A . Shutemov,
Ingo Molnar, Oscar Salvador, peterx, Matthew Wilcox,
Vlastimil Babka, Dan Williams, Andrew Morton, Hugh Dickins,
Michael Ellerman, Dave Hansen, Thomas Gleixner, linuxppc-dev,
Christophe Leroy, Rik van Riel, Mel Gorman, Aneesh Kumar K . V,
Nicholas Piggin, Huang Ying
These new helpers will be needed for pud entry updates soon. Namely:
- pudp_invalidate()
- pud_modify()
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
arch/powerpc/include/asm/book3s/64/pgtable.h | 3 +++
arch/powerpc/mm/book3s64/pgtable.c | 20 ++++++++++++++++++++
2 files changed, 23 insertions(+)
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 8f9432e3855a..fc628a984669 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1108,6 +1108,7 @@ extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
extern pud_t pfn_pud(unsigned long pfn, pgprot_t pgprot);
extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot);
extern pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot);
+extern pud_t pud_modify(pud_t pud, pgprot_t newprot);
extern void set_pmd_at(struct mm_struct *mm, unsigned long addr,
pmd_t *pmdp, pmd_t pmd);
extern void set_pud_at(struct mm_struct *mm, unsigned long addr,
@@ -1368,6 +1369,8 @@ static inline pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm,
#define __HAVE_ARCH_PMDP_INVALIDATE
extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmdp);
+extern pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+ pud_t *pudp);
#define pmd_move_must_withdraw pmd_move_must_withdraw
struct spinlock;
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 2975ea0841ba..c6ae969020e0 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -176,6 +176,17 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
return __pmd(old_pmd);
}
+pmd_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+ pud_t *pudp)
+{
+ unsigned long old_pud;
+
+ VM_WARN_ON_ONCE(!pmd_present(*pmdp));
+ old_pmd = pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, _PAGE_INVALID);
+ flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+ return __pmd(old_pmd);
+}
+
pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmdp, int full)
{
@@ -259,6 +270,15 @@ pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
pmdv &= _HPAGE_CHG_MASK;
return pmd_set_protbits(__pmd(pmdv), newprot);
}
+
+pud_t pud_modify(pud_t pud, pgprot_t newprot)
+{
+ unsigned long pudv;
+
+ pudv = pud_val(pud);
+ pudv &= _HPAGE_CHG_MASK;
+ return pud_set_protbits(__pud(pudv), newprot);
+}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
/* For use by kexec, called with MMU off */
--
2.45.0
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 5/7] mm/x86: Make pud_leaf() only cares about PSE bit
2024-06-21 14:24 [PATCH 0/7] mm/mprotect: Fix dax puds Peter Xu
` (3 preceding siblings ...)
2024-06-21 14:25 ` [PATCH 4/7] mm/powerpc: Add missing pud helpers Peter Xu
@ 2024-06-21 14:25 ` Peter Xu
2024-06-21 14:32 ` Dave Hansen
2024-06-21 14:25 ` [PATCH 6/7] mm/x86: Add missing pud helpers Peter Xu
2024-06-21 14:25 ` [PATCH 7/7] mm/mprotect: fix dax pud handlings Peter Xu
6 siblings, 1 reply; 17+ messages in thread
From: Peter Xu @ 2024-06-21 14:25 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: x86, Borislav Petkov, Dave Jiang, Kirill A . Shutemov,
Ingo Molnar, Oscar Salvador, peterx, Matthew Wilcox,
Vlastimil Babka, Dan Williams, Andrew Morton, Hugh Dickins,
Michael Ellerman, Dave Hansen, Thomas Gleixner, linuxppc-dev,
Christophe Leroy, Rik van Riel, Mel Gorman, Aneesh Kumar K . V,
Nicholas Piggin, Huang Ying
An entry should be reported as PUD leaf even if it's PROT_NONE, in which
case PRESENT bit isn't there. I hit bad pud without this when testing dax
1G on zapping a PROT_NONE PUD.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
arch/x86/include/asm/pgtable.h | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 65b8e5bb902c..25fc6d809572 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1073,8 +1073,7 @@ static inline pmd_t *pud_pgtable(pud_t pud)
#define pud_leaf pud_leaf
static inline bool pud_leaf(pud_t pud)
{
- return (pud_val(pud) & (_PAGE_PSE | _PAGE_PRESENT)) ==
- (_PAGE_PSE | _PAGE_PRESENT);
+ return pud_val(pud) & _PAGE_PSE;
}
static inline int pud_bad(pud_t pud)
--
2.45.0
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 6/7] mm/x86: Add missing pud helpers
2024-06-21 14:24 [PATCH 0/7] mm/mprotect: Fix dax puds Peter Xu
` (4 preceding siblings ...)
2024-06-21 14:25 ` [PATCH 5/7] mm/x86: Make pud_leaf() only cares about PSE bit Peter Xu
@ 2024-06-21 14:25 ` Peter Xu
2024-06-21 14:51 ` Dave Hansen
2024-06-21 14:25 ` [PATCH 7/7] mm/mprotect: fix dax pud handlings Peter Xu
6 siblings, 1 reply; 17+ messages in thread
From: Peter Xu @ 2024-06-21 14:25 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: x86, Borislav Petkov, Dave Jiang, Kirill A . Shutemov,
Ingo Molnar, Oscar Salvador, peterx, Matthew Wilcox,
Vlastimil Babka, Dan Williams, Andrew Morton, Hugh Dickins,
Michael Ellerman, Dave Hansen, Thomas Gleixner, linuxppc-dev,
Christophe Leroy, Rik van Riel, Mel Gorman, Aneesh Kumar K . V,
Nicholas Piggin, Huang Ying
These new helpers will be needed for pud entry updates soon. Namely:
- pudp_invalidate()
- pud_modify()
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
arch/x86/include/asm/pgtable.h | 36 ++++++++++++++++++++++++++++++++++
arch/x86/mm/pgtable.c | 11 +++++++++++
2 files changed, 47 insertions(+)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 25fc6d809572..3c23077adca6 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -775,6 +775,12 @@ static inline pmd_t pmd_mkinvalid(pmd_t pmd)
__pgprot(pmd_flags(pmd) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));
}
+static inline pud_t pud_mkinvalid(pud_t pud)
+{
+ return pfn_pud(pud_pfn(pud),
+ __pgprot(pud_flags(pud) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));
+}
+
static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
@@ -839,6 +845,21 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
return pmd_result;
}
+static inline pud_t pud_modify(pud_t pud, pgprot_t newprot)
+{
+ pudval_t val = pud_val(pud), oldval = val;
+
+ /*
+ * NOTE: no need to consider shadow stack complexities because it
+ * doesn't support 1G mappings.
+ */
+ val &= _HPAGE_CHG_MASK;
+ val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
+ val = flip_protnone_guard(oldval, val, PHYSICAL_PUD_PAGE_MASK);
+
+ return __pud(val);
+}
+
/*
* mprotect needs to preserve PAT and encryption bits when updating
* vm_page_prot
@@ -1377,10 +1398,25 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
}
#endif
+static inline pud_t pudp_establish(struct vm_area_struct *vma,
+ unsigned long address, pud_t *pudp, pud_t pud)
+{
+ if (IS_ENABLED(CONFIG_SMP)) {
+ return xchg(pudp, pud);
+ } else {
+ pud_t old = *pudp;
+ WRITE_ONCE(*pudp, pud);
+ return old;
+ }
+}
+
#define __HAVE_ARCH_PMDP_INVALIDATE_AD
extern pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp);
+pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+ pud_t *pudp);
+
/*
* Page table pages are page-aligned. The lower half of the top
* level is used for userspace and the top half for the kernel.
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 93e54ba91fbf..4e245a1526ad 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -641,6 +641,17 @@ pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
}
#endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+ pud_t *pudp)
+{
+ VM_WARN_ON_ONCE(!pud_present(*pudp));
+ pud_t old = pudp_establish(vma, address, pudp, pud_mkinvalid(*pudp));
+ flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
+ return old;
+}
+#endif
+
/**
* reserve_top_address - reserves a hole in the top of kernel address space
* @reserve - size of hole to reserve
--
2.45.0
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 7/7] mm/mprotect: fix dax pud handlings
2024-06-21 14:24 [PATCH 0/7] mm/mprotect: Fix dax puds Peter Xu
` (5 preceding siblings ...)
2024-06-21 14:25 ` [PATCH 6/7] mm/x86: Add missing pud helpers Peter Xu
@ 2024-06-21 14:25 ` Peter Xu
6 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2024-06-21 14:25 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: x86, Borislav Petkov, Dave Jiang, Kirill A . Shutemov,
Ingo Molnar, Oscar Salvador, peterx, Matthew Wilcox,
Vlastimil Babka, Dan Williams, Andrew Morton, Hugh Dickins,
Michael Ellerman, Dave Hansen, Thomas Gleixner, linuxppc-dev,
Christophe Leroy, Rik van Riel, Mel Gorman, Aneesh Kumar K . V,
Nicholas Piggin, Huang Ying
This is only relevant to the two archs that support PUD dax, aka, x86_64
and ppc64. PUD THPs do not yet exist elsewhere, and hugetlb PUDs do not
count in this case.
DAX have had PUD mappings for years, but change protection path never
worked. When the path is triggered in any form (a simple test program
would be: call mprotect() on a 1G dev_dax mapping), the kernel will report
"bad pud". This patch should fix that.
The new change_huge_pud() tries to keep everything simple. For example, it
doesn't optimize write bit as that will need even more PUD helpers. It's
not too bad anyway to have one more write fault in the worst case once for
1G range; may be a bigger thing for each PAGE_SIZE, though. Neither does
it support userfault-wp bits, as there isn't such PUD mappings that is
supported; file mappings always need a split there.
The same to TLB shootdown: the pmd path (which was for x86 only) has the
trick of using _ad() version of pmdp_invalidate*() which can avoid one
redundant TLB, but let's also leave that for later. Again, the larger the
mapping, the smaller of such effect.
Another thing worth mention is this path needs to be careful on handling
"retry" event for change_huge_pud() (where it can return 0): it isn't like
change_huge_pmd(), as the pmd version is safe with all conditions handled
in change_pte_range() later, thanks to Hugh's new pte_offset_map_lock().
In short, change_pte_range() is simply smarter than change_pmd_range() now
after the shmem thp collapse rework. For that reason, change_pud_range()
will need proper retry if it races with something else when a huge PUD
changed from under us.
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: x86@kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages")
Fixes: 27af67f35631 ("powerpc/book3s64/mm: enable transparent pud hugepage")
Signed-off-by: Peter Xu <peterx@redhat.com>
---
include/linux/huge_mm.h | 24 +++++++++++++++++++
mm/huge_memory.c | 52 +++++++++++++++++++++++++++++++++++++++++
mm/mprotect.c | 40 ++++++++++++++++++++++++-------
3 files changed, 108 insertions(+), 8 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 212cca384d7e..b46673689f19 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -345,6 +345,17 @@ void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
unsigned long address);
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
+ pud_t *pudp, unsigned long addr, pgprot_t newprot,
+ unsigned long cp_flags);
+#else
+static inline int
+change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
+ pud_t *pudp, unsigned long addr, pgprot_t newprot,
+ unsigned long cp_flags) { return 0; }
+#endif
+
#define split_huge_pud(__vma, __pud, __address) \
do { \
pud_t *____pud = (__pud); \
@@ -588,6 +599,19 @@ static inline int next_order(unsigned long *orders, int prev)
{
return 0;
}
+
+static inline void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
+ unsigned long address)
+{
+}
+
+static inline int change_huge_pud(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, pud_t *pudp,
+ unsigned long addr, pgprot_t newprot,
+ unsigned long cp_flags)
+{
+ return 0;
+}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
static inline int split_folio_to_list_to_order(struct folio *folio,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0fffaa58a47a..af8d83f4ce02 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2091,6 +2091,53 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
return ret;
}
+/*
+ * Returns:
+ *
+ * - 0: if pud leaf changed from under us
+ * - 1: if pud can be skipped
+ * - HPAGE_PUD_NR: if pud was successfully processed
+ */
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
+ pud_t *pudp, unsigned long addr, pgprot_t newprot,
+ unsigned long cp_flags)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pud_t oldpud, entry;
+ spinlock_t *ptl;
+
+ tlb_change_page_size(tlb, HPAGE_PUD_SIZE);
+
+ /* NUMA balancing doesn't apply to dax */
+ if (cp_flags & MM_CP_PROT_NUMA)
+ return 1;
+
+ /*
+ * Huge entries on userfault-wp only works with anonymous, while we
+ * don't have anonymous PUDs yet.
+ */
+ if (WARN_ON_ONCE(cp_flags & MM_CP_UFFD_WP_ALL))
+ return 1;
+
+ ptl = __pud_trans_huge_lock(pudp, vma);
+ if (!ptl)
+ return 0;
+
+ /*
+ * Can't clear PUD or it can race with concurrent zapping. See
+ * change_huge_pmd().
+ */
+ oldpud = pudp_invalidate(vma, addr, pudp);
+ entry = pud_modify(oldpud, newprot);
+ set_pud_at(mm, addr, pudp, entry);
+ tlb_flush_pud_range(tlb, addr, HPAGE_PUD_SIZE);
+
+ spin_unlock(ptl);
+ return HPAGE_PUD_NR;
+}
+#endif
+
#ifdef CONFIG_USERFAULTFD
/*
* The PT lock for src_pmd and dst_vma/src_vma (for reading) are locked by
@@ -2319,6 +2366,11 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
spin_unlock(ptl);
mmu_notifier_invalidate_range_end(&range);
}
+#else
+void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
+ unsigned long address)
+{
+}
#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index fb8bf3ff7cd9..694f13b83864 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -425,29 +425,53 @@ static inline long change_pud_range(struct mmu_gather *tlb,
unsigned long end, pgprot_t newprot, unsigned long cp_flags)
{
struct mmu_notifier_range range;
- pud_t *pud;
+ pud_t *pudp, pud;
unsigned long next;
long pages = 0, ret;
range.start = 0;
- pud = pud_offset(p4d, addr);
+ pudp = pud_offset(p4d, addr);
do {
+again:
next = pud_addr_end(addr, end);
- ret = change_prepare(vma, pud, pmd, addr, cp_flags);
- if (ret)
- return ret;
- if (pud_none_or_clear_bad(pud))
+ ret = change_prepare(vma, pudp, pmd, addr, cp_flags);
+ if (ret) {
+ pages = ret;
+ break;
+ }
+
+ pud = READ_ONCE(*pudp);
+ if (pud_none(pud))
continue;
+
if (!range.start) {
mmu_notifier_range_init(&range,
MMU_NOTIFY_PROTECTION_VMA, 0,
vma->vm_mm, addr, end);
mmu_notifier_invalidate_range_start(&range);
}
- pages += change_pmd_range(tlb, vma, pud, addr, next, newprot,
+
+ if (pud_leaf(pud)) {
+ if ((next - addr != PUD_SIZE) ||
+ pgtable_split_needed(vma, cp_flags)) {
+ __split_huge_pud(vma, pudp, addr);
+ goto again;
+ } else {
+ ret = change_huge_pud(tlb, vma, pudp,
+ addr, newprot, cp_flags);
+ if (ret == 0)
+ goto again;
+ /* huge pud was handled */
+ if (ret == HPAGE_PUD_NR)
+ pages += HPAGE_PUD_NR;
+ continue;
+ }
+ }
+
+ pages += change_pmd_range(tlb, vma, pudp, addr, next, newprot,
cp_flags);
- } while (pud++, addr = next, addr != end);
+ } while (pudp++, addr = next, addr != end);
if (range.start)
mmu_notifier_invalidate_range_end(&range);
--
2.45.0
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 5/7] mm/x86: Make pud_leaf() only cares about PSE bit
2024-06-21 14:25 ` [PATCH 5/7] mm/x86: Make pud_leaf() only cares about PSE bit Peter Xu
@ 2024-06-21 14:32 ` Dave Hansen
0 siblings, 0 replies; 17+ messages in thread
From: Dave Hansen @ 2024-06-21 14:32 UTC (permalink / raw)
To: Peter Xu, linux-kernel, linux-mm
Cc: x86, Borislav Petkov, Dave Jiang, Kirill A . Shutemov,
Ingo Molnar, Oscar Salvador, Matthew Wilcox, Vlastimil Babka,
Dan Williams, Andrew Morton, Hugh Dickins, Michael Ellerman,
Dave Hansen, Thomas Gleixner, linuxppc-dev, Christophe Leroy,
Rik van Riel, Mel Gorman, Aneesh Kumar K . V, Nicholas Piggin,
Huang Ying
On 6/21/24 07:25, Peter Xu wrote:
> An entry should be reported as PUD leaf even if it's PROT_NONE, in which
> case PRESENT bit isn't there. I hit bad pud without this when testing dax
> 1G on zapping a PROT_NONE PUD.
Yeah, looks like pmd_leaf() got fixed up because of its involvement in
THP, but PUDs got missed. This patch also realigns pmd_leaf() and
pud_leaf() behavior, which is important.
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 6/7] mm/x86: Add missing pud helpers
2024-06-21 14:25 ` [PATCH 6/7] mm/x86: Add missing pud helpers Peter Xu
@ 2024-06-21 14:51 ` Dave Hansen
2024-06-21 15:45 ` Peter Xu
2024-06-21 19:36 ` Edgecombe, Rick P
0 siblings, 2 replies; 17+ messages in thread
From: Dave Hansen @ 2024-06-21 14:51 UTC (permalink / raw)
To: Peter Xu, linux-kernel, linux-mm
Cc: x86, Borislav Petkov, Dave Jiang, Kirill A . Shutemov,
Ingo Molnar, Oscar Salvador, Matthew Wilcox, Vlastimil Babka,
Dan Williams, Andrew Morton, Hugh Dickins, Michael Ellerman,
Dave Hansen, Thomas Gleixner, linuxppc-dev, Christophe Leroy,
Rik van Riel, Mel Gorman, Aneesh Kumar K . V, Nicholas Piggin,
Huang Ying, Edgecombe, Rick P
On 6/21/24 07:25, Peter Xu wrote:
> These new helpers will be needed for pud entry updates soon. Namely:
>
> - pudp_invalidate()
> - pud_modify()
I think it's also definitely worth noting where you got this code from.
Presumably you copied, pasted and modified the PMD code. That's fine,
but it should be called out.
...
> +static inline pud_t pud_modify(pud_t pud, pgprot_t newprot)
> +{
> + pudval_t val = pud_val(pud), oldval = val;
> +
> + /*
> + * NOTE: no need to consider shadow stack complexities because it
> + * doesn't support 1G mappings.
> + */
> + val &= _HPAGE_CHG_MASK;
> + val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
> + val = flip_protnone_guard(oldval, val, PHYSICAL_PUD_PAGE_MASK);
> +
> + return __pud(val);
> +}
First of all, the comment to explain what you didn't do here is as many
lines as the code to _actually_ implement it.
Second, I believe this might have missed the purpose of the "shadow
stack complexities". The pmd/pte code is there not to support modifying
shadow stack mappings, it's there to avoid inadvertent shadow stack
mapping creation.
That "NOTE:" is ambiguous as to whether the shadow stacks aren't
supported on 1G mappings in Linux or the hardware (I just checked the
hardware docs and don't see anything making 1G mappings special, btw).
But, still, what if you take a Dirty=1,Write=1 pud and pud_modify() it
to make it Dirty=1,Write=0? What prevents that from being
misinterpreted by the hardware as being a valid 1G shadow stack mapping?
> /*
> * mprotect needs to preserve PAT and encryption bits when updating
> * vm_page_prot
> @@ -1377,10 +1398,25 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> }
> #endif
>
> +static inline pud_t pudp_establish(struct vm_area_struct *vma,
> + unsigned long address, pud_t *pudp, pud_t pud)
> +{
> + if (IS_ENABLED(CONFIG_SMP)) {
> + return xchg(pudp, pud);
> + } else {
> + pud_t old = *pudp;
> + WRITE_ONCE(*pudp, pud);
> + return old;
> + }
> +}
Why is there no:
page_table_check_pud_set(vma->vm_mm, pudp, pud);
? Sure, it doesn't _do_ anything today. But the PMD code has it today.
So leaving it out creates a divergence that honestly can only serve to
bite us in the future and will create a head-scratching delta for anyone
that is comparing PUD and PMD implementations in the future.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 4/7] mm/powerpc: Add missing pud helpers
2024-06-21 14:25 ` [PATCH 4/7] mm/powerpc: Add missing pud helpers Peter Xu
@ 2024-06-21 15:01 ` Peter Xu
0 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2024-06-21 15:01 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: x86, Borislav Petkov, Dave Jiang, Kirill A . Shutemov,
Ingo Molnar, Oscar Salvador, Matthew Wilcox, Vlastimil Babka,
Dan Williams, Andrew Morton, Hugh Dickins, Michael Ellerman,
Dave Hansen, Thomas Gleixner, linuxppc-dev, Christophe Leroy,
Rik van Riel, Mel Gorman, Aneesh Kumar K . V, Nicholas Piggin,
Huang Ying
On Fri, Jun 21, 2024 at 10:25:01AM -0400, Peter Xu wrote:
> +pmd_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
> + pud_t *pudp)
> +{
> + unsigned long old_pud;
> +
> + VM_WARN_ON_ONCE(!pmd_present(*pmdp));
> + old_pmd = pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, _PAGE_INVALID);
> + flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> + return __pmd(old_pmd);
> +}
I'll need an amend at least here, and my test harness won't catch it even
if it's a mistake as silly as it could be.. My apologies for such noise.
Below is what I plan to squash into this patch when I repost v2.
===8<===
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index c6ae969020e0..ea2c83634434 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -176,15 +176,15 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
return __pmd(old_pmd);
}
-pmd_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
pud_t *pudp)
{
unsigned long old_pud;
- VM_WARN_ON_ONCE(!pmd_present(*pmdp));
- old_pmd = pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, _PAGE_INVALID);
- flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
- return __pmd(old_pmd);
+ VM_WARN_ON_ONCE(!pud_present(*pudp));
+ old_pud = pud_hugepage_update(vma->vm_mm, address, pudp, _PAGE_PRESENT, _PAGE_INVALID);
+ flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
+ return __pud(old_pud);
}
pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma,
===8<===
--
Peter Xu
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 6/7] mm/x86: Add missing pud helpers
2024-06-21 14:51 ` Dave Hansen
@ 2024-06-21 15:45 ` Peter Xu
2024-06-21 16:11 ` Dave Hansen
2024-06-21 19:36 ` Edgecombe, Rick P
1 sibling, 1 reply; 17+ messages in thread
From: Peter Xu @ 2024-06-21 15:45 UTC (permalink / raw)
To: Dave Hansen
Cc: linux-kernel, linux-mm, x86, Borislav Petkov, Dave Jiang,
Kirill A . Shutemov, Ingo Molnar, Oscar Salvador, Matthew Wilcox,
Vlastimil Babka, Dan Williams, Andrew Morton, Hugh Dickins,
Michael Ellerman, Dave Hansen, Thomas Gleixner, linuxppc-dev,
Christophe Leroy, Rik van Riel, Mel Gorman, Aneesh Kumar K . V,
Nicholas Piggin, Huang Ying, Edgecombe, Rick P
On Fri, Jun 21, 2024 at 07:51:26AM -0700, Dave Hansen wrote:
> On 6/21/24 07:25, Peter Xu wrote:
> > These new helpers will be needed for pud entry updates soon. Namely:
> >
> > - pudp_invalidate()
> > - pud_modify()
>
> I think it's also definitely worth noting where you got this code from.
> Presumably you copied, pasted and modified the PMD code. That's fine,
> but it should be called out.
Yes that's from PMD ones. Sure, I will add that.
>
> ...
> > +static inline pud_t pud_modify(pud_t pud, pgprot_t newprot)
> > +{
> > + pudval_t val = pud_val(pud), oldval = val;
> > +
> > + /*
> > + * NOTE: no need to consider shadow stack complexities because it
> > + * doesn't support 1G mappings.
> > + */
> > + val &= _HPAGE_CHG_MASK;
> > + val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
> > + val = flip_protnone_guard(oldval, val, PHYSICAL_PUD_PAGE_MASK);
> > +
> > + return __pud(val);
> > +}
>
> First of all, the comment to explain what you didn't do here is as many
> lines as the code to _actually_ implement it.
>
> Second, I believe this might have missed the purpose of the "shadow
> stack complexities". The pmd/pte code is there not to support modifying
> shadow stack mappings, it's there to avoid inadvertent shadow stack
> mapping creation.
>
> That "NOTE:" is ambiguous as to whether the shadow stacks aren't
> supported on 1G mappings in Linux or the hardware (I just checked the
> hardware docs and don't see anything making 1G mappings special, btw).
Right this could be ambiguous indeed; I was trying to refer to the fact
where shadow stack is only supported on anon, and anon doesn't support 1G.
But looks like I'm more than wrong than that..
>
> But, still, what if you take a Dirty=1,Write=1 pud and pud_modify() it
> to make it Dirty=1,Write=0? What prevents that from being
> misinterpreted by the hardware as being a valid 1G shadow stack mapping?
Thanks for pointing that out. I think I was thinking it will only take
effect on VM_SHADOW_STACK first, so it's not?
I was indeed trying to find more information on shadow stack at that time
but I can't find as much on the pgtable implications, on e.g. whether "D=1
+ W=0" globally will be recognized as shadow stack. At least on SDM March
2024 version Vol3 Chap4 pgtable entries still don't explain these details,
or maybe I missed it. Please let me know if there's suggestion on what I
can read before I post a v2.
So if it's globally taking effect, indeed we'll need to handle them in PUDs
too.
Asides, not sure whether it's off-topic to ask here, but... why shadow
stack doesn't reuse an old soft-bit to explicitly mark "this is shadow
stack ptes" when designing the spec? Now it consumed bit 58 anyway for
caching dirty. IIUC we can avoid all these "move back and forth" issue on
dirty bit if so.
>
> > /*
> > * mprotect needs to preserve PAT and encryption bits when updating
> > * vm_page_prot
> > @@ -1377,10 +1398,25 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> > }
> > #endif
> >
> > +static inline pud_t pudp_establish(struct vm_area_struct *vma,
> > + unsigned long address, pud_t *pudp, pud_t pud)
> > +{
> > + if (IS_ENABLED(CONFIG_SMP)) {
> > + return xchg(pudp, pud);
> > + } else {
> > + pud_t old = *pudp;
> > + WRITE_ONCE(*pudp, pud);
> > + return old;
> > + }
> > +}
>
> Why is there no:
>
> page_table_check_pud_set(vma->vm_mm, pudp, pud);
>
> ? Sure, it doesn't _do_ anything today. But the PMD code has it today.
> So leaving it out creates a divergence that honestly can only serve to
> bite us in the future and will create a head-scratching delta for anyone
> that is comparing PUD and PMD implementations in the future.
Good question, I really don't remember why I didn't have that, since I
should have referenced the pmd helper. I'll add them and see whether I'll
hit something otherwise.
Thanks for the review.
--
Peter Xu
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 6/7] mm/x86: Add missing pud helpers
2024-06-21 15:45 ` Peter Xu
@ 2024-06-21 16:11 ` Dave Hansen
2024-06-23 15:42 ` Peter Xu
0 siblings, 1 reply; 17+ messages in thread
From: Dave Hansen @ 2024-06-21 16:11 UTC (permalink / raw)
To: Peter Xu
Cc: linux-kernel, linux-mm, x86, Borislav Petkov, Dave Jiang,
Kirill A . Shutemov, Ingo Molnar, Oscar Salvador, Matthew Wilcox,
Vlastimil Babka, Dan Williams, Andrew Morton, Hugh Dickins,
Michael Ellerman, Dave Hansen, Thomas Gleixner, linuxppc-dev,
Christophe Leroy, Rik van Riel, Mel Gorman, Aneesh Kumar K . V,
Nicholas Piggin, Huang Ying, Edgecombe, Rick P
On 6/21/24 08:45, Peter Xu wrote:
> On Fri, Jun 21, 2024 at 07:51:26AM -0700, Dave Hansen wrote:
...
>> But, still, what if you take a Dirty=1,Write=1 pud and pud_modify() it
>> to make it Dirty=1,Write=0? What prevents that from being
>> misinterpreted by the hardware as being a valid 1G shadow stack mapping?
>
> Thanks for pointing that out. I think I was thinking it will only take
> effect on VM_SHADOW_STACK first, so it's not?
>
> I was indeed trying to find more information on shadow stack at that time
> but I can't find as much on the pgtable implications, on e.g. whether "D=1
> + W=0" globally will be recognized as shadow stack. At least on SDM March
> 2024 version Vol3 Chap4 pgtable entries still don't explain these details,
> or maybe I missed it. Please let me know if there's suggestion on what I
> can read before I post a v2.
It's in the "Determination of Access Rights" section.
A linear address is a shadow-stack address if the following are
true of the translation of the linear address: (1) the R/W flag
(bit 1) is 0 and the dirty flag (bit 6) is 1 in the paging-
structure entry that maps the page containing the linear
address; and (2) the R/W flag is 1 in every other paging-
structure entry controlling the translation of the linear
address.
> So if it's globally taking effect, indeed we'll need to handle them in PUDs
> too.
>
> Asides, not sure whether it's off-topic to ask here, but... why shadow
> stack doesn't reuse an old soft-bit to explicitly mark "this is shadow
> stack ptes" when designing the spec? Now it consumed bit 58 anyway for
> caching dirty. IIUC we can avoid all these "move back and forth" issue on
> dirty bit if so.
The design accommodates "other" OSes that are using all the software
bits for other things.
For Linux, you're right, we just ended up consuming a software bit
_anyway_ so we got all the complexity of the goofy permissions *AND*
lost a bit in the end. Lose, lose.
>>> /*
>>> * mprotect needs to preserve PAT and encryption bits when updating
>>> * vm_page_prot
>>> @@ -1377,10 +1398,25 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
>>> }
>>> #endif
>>>
>>> +static inline pud_t pudp_establish(struct vm_area_struct *vma,
>>> + unsigned long address, pud_t *pudp, pud_t pud)
>>> +{
>>> + if (IS_ENABLED(CONFIG_SMP)) {
>>> + return xchg(pudp, pud);
>>> + } else {
>>> + pud_t old = *pudp;
>>> + WRITE_ONCE(*pudp, pud);
>>> + return old;
>>> + }
>>> +}
>>
>> Why is there no:
>>
>> page_table_check_pud_set(vma->vm_mm, pudp, pud);
>>
>> ? Sure, it doesn't _do_ anything today. But the PMD code has it today.
>> So leaving it out creates a divergence that honestly can only serve to
>> bite us in the future and will create a head-scratching delta for anyone
>> that is comparing PUD and PMD implementations in the future.
>
> Good question, I really don't remember why I didn't have that, since I
> should have referenced the pmd helper. I'll add them and see whether I'll
> hit something otherwise.
>
> Thanks for the review.
One big thing I did in this review was make sure that the PMD and PUD
helpers were doing the same thing. Would you mind circling back and
double-checking the same before you repost this?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 6/7] mm/x86: Add missing pud helpers
2024-06-21 14:51 ` Dave Hansen
2024-06-21 15:45 ` Peter Xu
@ 2024-06-21 19:36 ` Edgecombe, Rick P
2024-06-21 20:27 ` Peter Xu
1 sibling, 1 reply; 17+ messages in thread
From: Edgecombe, Rick P @ 2024-06-21 19:36 UTC (permalink / raw)
To: linux-mm, peterx, Hansen, Dave, linux-kernel
Cc: hughd, willy, dave.hansen, vbabka, mpe, akpm, mingo, kirill,
christophe.leroy, Jiang, Dave, aneesh.kumar, tglx, riel, npiggin,
osalvador, linuxppc-dev, bp, mgorman, Huang, Ying, x86, Williams,
Dan J
On Fri, 2024-06-21 at 07:51 -0700, Dave Hansen wrote:
>
> But, still, what if you take a Dirty=1,Write=1 pud and pud_modify() it
> to make it Dirty=1,Write=0? What prevents that from being
> misinterpreted by the hardware as being a valid 1G shadow stack mapping?
Hmm, it looks like we could use an arch_check_zapped_pud() that does a warning
like arch_check_zapped_pte/pmd() too. Not that we had no use for one before
this.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 6/7] mm/x86: Add missing pud helpers
2024-06-21 19:36 ` Edgecombe, Rick P
@ 2024-06-21 20:27 ` Peter Xu
2024-06-21 22:06 ` Edgecombe, Rick P
0 siblings, 1 reply; 17+ messages in thread
From: Peter Xu @ 2024-06-21 20:27 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: linux-mm, Hansen, Dave, linux-kernel, hughd, willy, dave.hansen,
vbabka, mpe, akpm, mingo, kirill, christophe.leroy, Jiang, Dave,
aneesh.kumar, tglx, riel, npiggin, osalvador, linuxppc-dev, bp,
mgorman, Huang, Ying, x86, Williams, Dan J
On Fri, Jun 21, 2024 at 07:36:30PM +0000, Edgecombe, Rick P wrote:
> On Fri, 2024-06-21 at 07:51 -0700, Dave Hansen wrote:
> >
> > But, still, what if you take a Dirty=1,Write=1 pud and pud_modify() it
> > to make it Dirty=1,Write=0? What prevents that from being
> > misinterpreted by the hardware as being a valid 1G shadow stack mapping?
>
> Hmm, it looks like we could use an arch_check_zapped_pud() that does a warning
> like arch_check_zapped_pte/pmd() too. Not that we had no use for one before
> this.
I can definitely look into that, but this check only happens when zapping,
and IIUC it means there can still be outliers floating around. I wonder
whether it should rely on page_table_check_pxx_set() from that regard.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 6/7] mm/x86: Add missing pud helpers
2024-06-21 20:27 ` Peter Xu
@ 2024-06-21 22:06 ` Edgecombe, Rick P
0 siblings, 0 replies; 17+ messages in thread
From: Edgecombe, Rick P @ 2024-06-21 22:06 UTC (permalink / raw)
To: peterx
Cc: hughd, willy, Hansen, Dave, dave.hansen, vbabka, riel, akpm,
linux-kernel, mingo, kirill, christophe.leroy, aneesh.kumar,
Jiang, Dave, tglx, mpe, npiggin, linux-mm, osalvador,
linuxppc-dev, bp, mgorman, Huang, Ying, x86, Williams, Dan J
On Fri, 2024-06-21 at 16:27 -0400, Peter Xu wrote:
> On Fri, Jun 21, 2024 at 07:36:30PM +0000, Edgecombe, Rick P wrote:
> > On Fri, 2024-06-21 at 07:51 -0700, Dave Hansen wrote:
> > >
> > > But, still, what if you take a Dirty=1,Write=1 pud and pud_modify() it
> > > to make it Dirty=1,Write=0? What prevents that from being
> > > misinterpreted by the hardware as being a valid 1G shadow stack mapping?
> >
> > Hmm, it looks like we could use an arch_check_zapped_pud() that does a
> > warning
> > like arch_check_zapped_pte/pmd() too. Not that we had no use for one before
> > this.
>
> I can definitely look into that, but this check only happens when zapping,
> and IIUC it means there can still be outliers floating around. I wonder
> whether it should rely on page_table_check_pxx_set() from that regard.
Yes, it's not perfect. Hmm, it looks like the page_table_check would catch a lot
more cases, but it would have to be changed to plumb the vma in.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 6/7] mm/x86: Add missing pud helpers
2024-06-21 16:11 ` Dave Hansen
@ 2024-06-23 15:42 ` Peter Xu
0 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2024-06-23 15:42 UTC (permalink / raw)
To: Dave Hansen
Cc: linux-kernel, linux-mm, x86, Borislav Petkov, Dave Jiang,
Kirill A . Shutemov, Ingo Molnar, Oscar Salvador, Matthew Wilcox,
Vlastimil Babka, Dan Williams, Andrew Morton, Hugh Dickins,
Michael Ellerman, Dave Hansen, Thomas Gleixner, linuxppc-dev,
Christophe Leroy, Rik van Riel, Mel Gorman, Aneesh Kumar K . V,
Nicholas Piggin, Huang Ying, Edgecombe, Rick P
On Fri, Jun 21, 2024 at 09:11:56AM -0700, Dave Hansen wrote:
> It's in the "Determination of Access Rights" section.
>
> A linear address is a shadow-stack address if the following are
> true of the translation of the linear address: (1) the R/W flag
> (bit 1) is 0 and the dirty flag (bit 6) is 1 in the paging-
> structure entry that maps the page containing the linear
> address; and (2) the R/W flag is 1 in every other paging-
> structure entry controlling the translation of the linear
> address.
Thanks. It'll be nice if this can be referenced in the pgtable definitions
too in some way.
[...]
> One big thing I did in this review was make sure that the PMD and PUD
> helpers were doing the same thing. Would you mind circling back and
> double-checking the same before you repost this?
Sure, I'll make sure I'll at least add a comment if it doesn't match and
explain why. I actually did it already e.g. in the _modify path for shadow
stack, but I failed this spot.
The page table check thing is really rare that I overlooked, could be
relevant to what I used to hit a bug but fixed around this area, so I
forgot to add it back, but I really can't remember. I'll keep an extra eye
on that.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2024-06-23 15:42 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-21 14:24 [PATCH 0/7] mm/mprotect: Fix dax puds Peter Xu
2024-06-21 14:24 ` [PATCH 1/7] mm/dax: Dump start address in fault handler Peter Xu
2024-06-21 14:24 ` [PATCH 2/7] mm/mprotect: Remove NUMA_HUGE_PTE_UPDATES Peter Xu
2024-06-21 14:25 ` [PATCH 3/7] mm/mprotect: Push mmu notifier to PUDs Peter Xu
2024-06-21 14:25 ` [PATCH 4/7] mm/powerpc: Add missing pud helpers Peter Xu
2024-06-21 15:01 ` Peter Xu
2024-06-21 14:25 ` [PATCH 5/7] mm/x86: Make pud_leaf() only cares about PSE bit Peter Xu
2024-06-21 14:32 ` Dave Hansen
2024-06-21 14:25 ` [PATCH 6/7] mm/x86: Add missing pud helpers Peter Xu
2024-06-21 14:51 ` Dave Hansen
2024-06-21 15:45 ` Peter Xu
2024-06-21 16:11 ` Dave Hansen
2024-06-23 15:42 ` Peter Xu
2024-06-21 19:36 ` Edgecombe, Rick P
2024-06-21 20:27 ` Peter Xu
2024-06-21 22:06 ` Edgecombe, Rick P
2024-06-21 14:25 ` [PATCH 7/7] mm/mprotect: fix dax pud handlings Peter Xu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox