From: Peter Xu <peterx@redhat.com>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org
Cc: x86@kernel.org, Borislav Petkov <bp@alien8.de>,
Dave Jiang <dave.jiang@intel.com>,
"Kirill A . Shutemov" <kirill@shutemov.name>,
Ingo Molnar <mingo@redhat.com>,
Oscar Salvador <osalvador@suse.de>,
peterx@redhat.com, Matthew Wilcox <willy@infradead.org>,
Vlastimil Babka <vbabka@suse.cz>,
Dan Williams <dan.j.williams@intel.com>,
Andrew Morton <akpm@linux-foundation.org>,
Hugh Dickins <hughd@google.com>,
Michael Ellerman <mpe@ellerman.id.au>,
Dave Hansen <dave.hansen@linux.intel.com>,
Thomas Gleixner <tglx@linutronix.de>,
linuxppc-dev@lists.ozlabs.org,
Christophe Leroy <christophe.leroy@csgroup.eu>,
Rik van Riel <riel@surriel.com>,
Mel Gorman <mgorman@techsingularity.net>,
"Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>,
Nicholas Piggin <npiggin@gmail.com>,
Huang Ying <ying.huang@intel.com>,
kvm@vger.kernel.org, Sean Christopherson <seanjc@google.com>,
Paolo Bonzini <pbonzini@redhat.com>,
David Rientjes <rientjes@google.com>
Subject: [PATCH 3/7] mm/mprotect: Push mmu notifier to PUDs
Date: Fri, 21 Jun 2024 10:25:00 -0400 [thread overview]
Message-ID: <20240621142504.1940209-4-peterx@redhat.com> (raw)
In-Reply-To: <20240621142504.1940209-1-peterx@redhat.com>
mprotect() does mmu notifiers in PMD levels. It's there since 2014 of
commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to
change_pmd_range").
At that time, the issue was that NUMA balancing can be applied on a huge
range of VM memory, even if nothing was populated. The notification can be
avoided in this case if no valid pmd detected, which includes either THP or
a PTE pgtable page.
Now to pave way for PUD handling, this isn't enough. We need to generate
mmu notifications even on PUD entries properly. mprotect() is currently
broken on PUD (e.g., one can easily trigger kernel error with dax 1G
mappings already), this is the start to fix it.
To fix that, this patch proposes to push such notifications to the PUD
layers.
There is risk on regressing the problem Rik wanted to resolve before, but I
think it shouldn't really happen, and I still chose this solution because
of a few reasons:
1) Consider a large VM that should definitely contain more than GBs of
memory, it's highly likely that PUDs are also none. In this case there
will have no regression.
2) KVM has evolved a lot over the years to get rid of rmap walks, which
might be the major cause of the previous soft-lockup. At least TDP MMU
already got rid of rmap as long as not nested (which should be the major
use case, IIUC), then the TDP MMU pgtable walker will simply see empty VM
pgtable (e.g. EPT on x86), the invalidation of a full empty region in
most cases could be pretty fast now, comparing to 2014.
3) KVM has explicit code paths now to even give way for mmu notifiers
just like this one, e.g. in commit d02c357e5bfa ("KVM: x86/mmu: Retry
fault before acquiring mmu_lock if mapping is changing"). It'll also
avoid contentions that may also contribute to a soft-lockup.
4) Stick with PMD layer simply don't work when PUD is there... We need
one way or another to fix PUD mappings on mprotect().
Pushing it to PUD should be the safest approach as of now, e.g. there's yet
no sign of huge P4D coming on any known archs.
Cc: kvm@vger.kernel.org
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
mm/mprotect.c | 26 ++++++++++++--------------
1 file changed, 12 insertions(+), 14 deletions(-)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 21172272695e..fb8bf3ff7cd9 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -363,9 +363,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
pmd_t *pmd;
unsigned long next;
long pages = 0;
- struct mmu_notifier_range range;
-
- range.start = 0;
pmd = pmd_offset(pud, addr);
do {
@@ -383,14 +380,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
if (pmd_none(*pmd))
goto next;
- /* invoke the mmu notifier if the pmd is populated */
- if (!range.start) {
- mmu_notifier_range_init(&range,
- MMU_NOTIFY_PROTECTION_VMA, 0,
- vma->vm_mm, addr, end);
- mmu_notifier_invalidate_range_start(&range);
- }
-
_pmd = pmdp_get_lockless(pmd);
if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) {
if ((next - addr != HPAGE_PMD_SIZE) ||
@@ -428,9 +417,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
cond_resched();
} while (pmd++, addr = next, addr != end);
- if (range.start)
- mmu_notifier_invalidate_range_end(&range);
-
return pages;
}
@@ -438,10 +424,13 @@ static inline long change_pud_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, p4d_t *p4d, unsigned long addr,
unsigned long end, pgprot_t newprot, unsigned long cp_flags)
{
+ struct mmu_notifier_range range;
pud_t *pud;
unsigned long next;
long pages = 0, ret;
+ range.start = 0;
+
pud = pud_offset(p4d, addr);
do {
next = pud_addr_end(addr, end);
@@ -450,10 +439,19 @@ static inline long change_pud_range(struct mmu_gather *tlb,
return ret;
if (pud_none_or_clear_bad(pud))
continue;
+ if (!range.start) {
+ mmu_notifier_range_init(&range,
+ MMU_NOTIFY_PROTECTION_VMA, 0,
+ vma->vm_mm, addr, end);
+ mmu_notifier_invalidate_range_start(&range);
+ }
pages += change_pmd_range(tlb, vma, pud, addr, next, newprot,
cp_flags);
} while (pud++, addr = next, addr != end);
+ if (range.start)
+ mmu_notifier_invalidate_range_end(&range);
+
return pages;
}
--
2.45.0
next prev parent reply other threads:[~2024-06-21 14:25 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-21 14:24 [PATCH 0/7] mm/mprotect: Fix dax puds Peter Xu
2024-06-21 14:24 ` [PATCH 1/7] mm/dax: Dump start address in fault handler Peter Xu
2024-06-21 14:24 ` [PATCH 2/7] mm/mprotect: Remove NUMA_HUGE_PTE_UPDATES Peter Xu
2024-06-21 14:25 ` Peter Xu [this message]
2024-06-21 14:25 ` [PATCH 4/7] mm/powerpc: Add missing pud helpers Peter Xu
2024-06-21 15:01 ` Peter Xu
2024-06-21 14:25 ` [PATCH 5/7] mm/x86: Make pud_leaf() only cares about PSE bit Peter Xu
2024-06-21 14:32 ` Dave Hansen
2024-06-21 14:25 ` [PATCH 6/7] mm/x86: Add missing pud helpers Peter Xu
2024-06-21 14:51 ` Dave Hansen
2024-06-21 15:45 ` Peter Xu
2024-06-21 16:11 ` Dave Hansen
2024-06-23 15:42 ` Peter Xu
2024-06-21 19:36 ` Edgecombe, Rick P
2024-06-21 20:27 ` Peter Xu
2024-06-21 22:06 ` Edgecombe, Rick P
2024-06-21 14:25 ` [PATCH 7/7] mm/mprotect: fix dax pud handlings Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240621142504.1940209-4-peterx@redhat.com \
--to=peterx@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@linux.ibm.com \
--cc=bp@alien8.de \
--cc=christophe.leroy@csgroup.eu \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=dave.jiang@intel.com \
--cc=hughd@google.com \
--cc=kirill@shutemov.name \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=mgorman@techsingularity.net \
--cc=mingo@redhat.com \
--cc=mpe@ellerman.id.au \
--cc=npiggin@gmail.com \
--cc=osalvador@suse.de \
--cc=pbonzini@redhat.com \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=seanjc@google.com \
--cc=tglx@linutronix.de \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
--cc=x86@kernel.org \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox