From: Ryan Roberts <ryan.roberts@arm.com>
To: Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will@kernel.org>, Ard Biesheuvel <ardb@kernel.org>,
Marc Zyngier <maz@kernel.org>,
Oliver Upton <oliver.upton@linux.dev>,
James Morse <james.morse@arm.com>,
Suzuki K Poulose <suzuki.poulose@arm.com>,
Zenghui Yu <yuzenghui@huawei.com>,
Andrey Ryabinin <ryabinin.a.a@gmail.com>,
Alexander Potapenko <glider@google.com>,
Andrey Konovalov <andreyknvl@gmail.com>,
Dmitry Vyukov <dvyukov@google.com>,
Vincenzo Frascino <vincenzo.frascino@arm.com>,
Andrew Morton <akpm@linux-foundation.org>,
Anshuman Khandual <anshuman.khandual@arm.com>,
Matthew Wilcox <willy@infradead.org>, Yu Zhao <yuzhao@google.com>,
Mark Rutland <mark.rutland@arm.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: [PATCH v1 13/14] mm: Batch-copy PTE ranges during fork()
Date: Thu, 22 Jun 2023 15:42:08 +0100 [thread overview]
Message-ID: <20230622144210.2623299-14-ryan.roberts@arm.com> (raw)
In-Reply-To: <20230622144210.2623299-1-ryan.roberts@arm.com>
Convert copy_pte_range() to copy a set of ptes that map a physically
contiguous block of memory in a batch. This will likely improve
performance by a tiny amount due to batching the folio reference count
management and calling set_ptes() rather than making individual calls to
set_pte_at().
However, the primary motivation for this change is to reduce the number
of tlb maintenance operations that the arm64 backend has to perform
during fork, now that it transparently supports the "contiguous bit" in
its ptes. By write-protecting the parent using the new
ptep_set_wrprotects() (note the 's' at the end) function, the backend
can avoid having to unfold contig ranges of PTEs, which is expensive,
when all ptes in the range are being write-protected. Similarly, by
using set_ptes() rather than set_pte_at() to set up ptes in the child,
the backend does not need to fold a contiguous range once they are all
populated - they can be initially populated as a contiguous range in the
first place.
This change addresses the core-mm refactoring only, and introduces
ptep_set_wrprotects() with a default implementation that calls
ptep_set_wrprotect() for each pte in the range. A separate change will
implement ptep_set_wrprotects() in the arm64 backend to realize the
performance improvement.
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
include/linux/pgtable.h | 13 ++++
mm/memory.c | 149 +++++++++++++++++++++++++++++++---------
2 files changed, 128 insertions(+), 34 deletions(-)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index a661a17173fa..6a7b28d520de 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -547,6 +547,19 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
}
#endif
+#ifndef ptep_set_wrprotects
+struct mm_struct;
+static inline void ptep_set_wrprotects(struct mm_struct *mm,
+ unsigned long address, pte_t *ptep,
+ unsigned int nr)
+{
+ unsigned int i;
+
+ for (i = 0; i < nr; i++, address += PAGE_SIZE, ptep++)
+ ptep_set_wrprotect(mm, address, ptep);
+}
+#endif
+
/*
* On some architectures hardware does not set page access bit when accessing
* memory page, it is responsibility of software setting this bit. It brings
diff --git a/mm/memory.c b/mm/memory.c
index fb30f7523550..9a041cc31c74 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -911,57 +911,126 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
/* Uffd-wp needs to be delivered to dest pte as well */
pte = pte_mkuffd_wp(pte);
set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
- return 0;
+ return 1;
+}
+
+static inline unsigned long page_addr(struct page *page,
+ struct page *anchor, unsigned long anchor_addr)
+{
+ unsigned long offset;
+ unsigned long addr;
+
+ offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
+ addr = anchor_addr + offset;
+
+ if (anchor > page) {
+ if (addr > anchor_addr)
+ return 0;
+ } else {
+ if (addr < anchor_addr)
+ return ULONG_MAX;
+ }
+
+ return addr;
+}
+
+static int calc_anon_folio_map_pgcount(struct folio *folio,
+ struct page *page, pte_t *pte,
+ unsigned long addr, unsigned long end)
+{
+ pte_t ptent;
+ int floops;
+ int i;
+ unsigned long pfn;
+
+ end = min(page_addr(&folio->page + folio_nr_pages(folio), page, addr),
+ end);
+ floops = (end - addr) >> PAGE_SHIFT;
+ pfn = page_to_pfn(page);
+ pfn++;
+ pte++;
+
+ for (i = 1; i < floops; i++) {
+ ptent = ptep_get(pte);
+
+ if (!pte_present(ptent) ||
+ pte_pfn(ptent) != pfn) {
+ return i;
+ }
+
+ pfn++;
+ pte++;
+ }
+
+ return floops;
}
/*
- * Copy one pte. Returns 0 if succeeded, or -EAGAIN if one preallocated page
- * is required to copy this pte.
+ * Copy set of contiguous ptes. Returns number of ptes copied if succeeded
+ * (always gte 1), or -EAGAIN if one preallocated page is required to copy the
+ * first pte.
*/
static inline int
-copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
- pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
- struct folio **prealloc)
+copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
+ pte_t *dst_pte, pte_t *src_pte,
+ unsigned long addr, unsigned long end,
+ int *rss, struct folio **prealloc)
{
struct mm_struct *src_mm = src_vma->vm_mm;
unsigned long vm_flags = src_vma->vm_flags;
pte_t pte = ptep_get(src_pte);
struct page *page;
struct folio *folio;
+ bool anon;
+ int nr;
+ int i;
page = vm_normal_page(src_vma, addr, pte);
- if (page)
+ if (page) {
folio = page_folio(page);
- if (page && folio_test_anon(folio)) {
- /*
- * If this page may have been pinned by the parent process,
- * copy the page immediately for the child so that we'll always
- * guarantee the pinned page won't be randomly replaced in the
- * future.
- */
- folio_get(folio);
- if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
- /* Page may be pinned, we have to copy. */
- folio_put(folio);
- return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
- addr, rss, prealloc, page);
+ anon = folio_test_anon(folio);
+ nr = calc_anon_folio_map_pgcount(folio, page, src_pte, addr, end);
+
+ for (i = 0; i < nr; i++, page++) {
+ if (anon) {
+ /*
+ * If this page may have been pinned by the
+ * parent process, copy the page immediately for
+ * the child so that we'll always guarantee the
+ * pinned page won't be randomly replaced in the
+ * future.
+ */
+ if (unlikely(page_try_dup_anon_rmap(
+ page, false, src_vma))) {
+ if (i != 0)
+ break;
+ /* Page may be pinned, we have to copy. */
+ return copy_present_page(dst_vma, src_vma,
+ dst_pte, src_pte,
+ addr, rss,
+ prealloc, page);
+ }
+ rss[MM_ANONPAGES]++;
+ VM_BUG_ON(PageAnonExclusive(page));
+ } else {
+ page_dup_file_rmap(page, false);
+ rss[mm_counter_file(page)]++;
+ }
}
- rss[MM_ANONPAGES]++;
- } else if (page) {
- folio_get(folio);
- page_dup_file_rmap(page, false);
- rss[mm_counter_file(page)]++;
- }
+
+ nr = i;
+ folio_ref_add(folio, nr);
+ } else
+ nr = 1;
/*
* If it's a COW mapping, write protect it both
* in the parent and the child
*/
if (is_cow_mapping(vm_flags) && pte_write(pte)) {
- ptep_set_wrprotect(src_mm, addr, src_pte);
+ ptep_set_wrprotects(src_mm, addr, src_pte, nr);
pte = pte_wrprotect(pte);
}
- VM_BUG_ON(page && folio_test_anon(folio) && PageAnonExclusive(page));
/*
* If it's a shared mapping, mark it clean in
@@ -974,8 +1043,8 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
if (!userfaultfd_wp(dst_vma))
pte = pte_clear_uffd_wp(pte);
- set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
- return 0;
+ set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
+ return nr;
}
static inline struct folio *page_copy_prealloc(struct mm_struct *src_mm,
@@ -1065,15 +1134,28 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
*/
WARN_ON_ONCE(ret != -ENOENT);
}
- /* copy_present_pte() will clear `*prealloc' if consumed */
- ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
- addr, rss, &prealloc);
+ /* copy_present_ptes() will clear `*prealloc' if consumed */
+ ret = copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte,
+ addr, end, rss, &prealloc);
+
/*
* If we need a pre-allocated page for this pte, drop the
* locks, allocate, and try again.
*/
if (unlikely(ret == -EAGAIN))
break;
+
+ /*
+ * Positive return value is the number of ptes copied.
+ */
+ VM_WARN_ON_ONCE(ret < 1);
+ progress += 8 * ret;
+ ret--;
+ dst_pte += ret;
+ src_pte += ret;
+ addr += ret << PAGE_SHIFT;
+ ret = 0;
+
if (unlikely(prealloc)) {
/*
* pre-alloc page cannot be reused by next time so as
@@ -1084,7 +1166,6 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
folio_put(prealloc);
prealloc = NULL;
}
- progress += 8;
} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
--
2.25.1
next prev parent reply other threads:[~2023-06-22 14:43 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-22 14:41 [PATCH v1 00/14] Transparent Contiguous PTEs for User Mappings Ryan Roberts
2023-06-22 14:41 ` [PATCH v1 01/14] arm64/mm: set_pte(): New layer to manage contig bit Ryan Roberts
2023-06-22 14:41 ` [PATCH v1 02/14] arm64/mm: set_ptes()/set_pte_at(): " Ryan Roberts
2023-06-22 14:41 ` [PATCH v1 03/14] arm64/mm: pte_clear(): " Ryan Roberts
2023-06-22 14:41 ` [PATCH v1 04/14] arm64/mm: ptep_get_and_clear(): " Ryan Roberts
2023-06-22 14:42 ` [PATCH v1 05/14] arm64/mm: ptep_test_and_clear_young(): " Ryan Roberts
2023-06-22 14:42 ` [PATCH v1 06/14] arm64/mm: ptep_clear_flush_young(): " Ryan Roberts
2023-06-22 14:42 ` [PATCH v1 07/14] arm64/mm: ptep_set_wrprotect(): " Ryan Roberts
2023-06-22 14:42 ` [PATCH v1 08/14] arm64/mm: ptep_set_access_flags(): " Ryan Roberts
2023-06-22 14:42 ` [PATCH v1 09/14] arm64/mm: ptep_get(): " Ryan Roberts
2023-06-22 14:42 ` [PATCH v1 10/14] arm64/mm: Split __flush_tlb_range() to elide trailing DSB Ryan Roberts
2023-06-22 14:42 ` [PATCH v1 11/14] arm64/mm: Wire up PTE_CONT for user mappings Ryan Roberts
2023-06-30 1:54 ` John Hubbard
2023-07-03 9:48 ` Ryan Roberts
2023-07-03 15:17 ` Catalin Marinas
2023-07-04 11:09 ` Ryan Roberts
2023-07-05 13:13 ` Ryan Roberts
2023-07-16 15:09 ` Catalin Marinas
2023-06-22 14:42 ` [PATCH v1 12/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown Ryan Roberts
2023-06-22 14:42 ` Ryan Roberts [this message]
2023-06-22 14:42 ` [PATCH v1 14/14] arm64/mm: Implement ptep_set_wrprotects() to optimize fork() Ryan Roberts
2023-07-10 12:05 ` [PATCH v1 00/14] Transparent Contiguous PTEs for User Mappings Barry Song
2023-07-10 13:28 ` Ryan Roberts
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230622144210.2623299-14-ryan.roberts@arm.com \
--to=ryan.roberts@arm.com \
--cc=akpm@linux-foundation.org \
--cc=andreyknvl@gmail.com \
--cc=anshuman.khandual@arm.com \
--cc=ardb@kernel.org \
--cc=catalin.marinas@arm.com \
--cc=dvyukov@google.com \
--cc=glider@google.com \
--cc=james.morse@arm.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mark.rutland@arm.com \
--cc=maz@kernel.org \
--cc=oliver.upton@linux.dev \
--cc=ryabinin.a.a@gmail.com \
--cc=suzuki.poulose@arm.com \
--cc=vincenzo.frascino@arm.com \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=yuzenghui@huawei.com \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox