linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Ryan Roberts <ryan.roberts@arm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Yu Zhao <yuzhao@google.com>,
	"Yin, Fengwei" <fengwei.yin@intel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>,
	linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org
Subject: [RFC v2 PATCH 10/17] mm: Reuse large folios for anonymous memory
Date: Fri, 14 Apr 2023 14:02:56 +0100	[thread overview]
Message-ID: <20230414130303.2345383-11-ryan.roberts@arm.com> (raw)
In-Reply-To: <20230414130303.2345383-1-ryan.roberts@arm.com>

When taking a write fault on an anonymous page, attempt to reuse as much
of the folio as possible if it is exclusive to the process.

This avoids a problem where an exclusive, PTE-mapped THP would
previously have all of its pages except the last one CoWed, then the
last page would be reused, causing the whole original folio to hang
around as well as all the CoWed pages. This problem is exaserbated now
that we are allocating variable-order folios for anonymous memory. The
reason for this behaviour is that a PTE-mapped THP has a reference for
each PTE and the old code thought that meant it was not exclusively
mapped, and therefore could not be reused.

We now take care to find the region that intersects the underlying
folio, the VMA and the PMD entry and for the presence of that number of
references as indicating exclusivity. Note that we are not guarranteed
that this region will cover the whole folio due to munmap and mremap.

The aim is to reuse as much as possible in one go in order to:
- reduce memory consumption
- reduce number of CoWs
- reduce time spent in fault handler

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/memory.c | 169 +++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 160 insertions(+), 9 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 83835ff5a818..7e2af54fe2e0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3038,6 +3038,26 @@ struct anon_folio_range {
 	bool exclusive;
 };

+static inline unsigned long page_addr(struct page *page,
+				struct page *anchor, unsigned long anchor_addr)
+{
+	unsigned long offset;
+	unsigned long addr;
+
+	offset = (page_to_pfn(page) - page_to_pfn(anchor)) << PAGE_SHIFT;
+	addr = anchor_addr + offset;
+
+	if (anchor > page) {
+		if (addr > anchor_addr)
+			return 0;
+	} else {
+		if (addr < anchor_addr)
+			return ULONG_MAX;
+	}
+
+	return addr;
+}
+
 /*
  * Returns index of first pte that is not none, or nr if all are none.
  */
@@ -3122,6 +3142,122 @@ static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
 	return order;
 }

+static void calc_anon_folio_range_reuse(struct vm_fault *vmf,
+					struct folio *folio,
+					struct anon_folio_range *range_out)
+{
+	/*
+	 * The aim here is to determine the biggest range of pages that can be
+	 * reused for this CoW fault if the identified range is responsible for
+	 * all the references on the folio (i.e. it is exclusive) such that:
+	 * - All pages are contained within folio
+	 * - All pages are within VMA
+	 * - All pages are within the same pmd entry as vmf->address
+	 * - vmf->page is contained within the range
+	 * - All covered ptes must be present, physically contiguous and RO
+	 *
+	 * Note that the folio itself may not be naturally aligned in VA space
+	 * due to mremap. We take the largest range we can in order to increase
+	 * our chances of being the exclusive user of the folio, therefore
+	 * meaning we can reuse. Its possible that the folio crosses a pmd
+	 * boundary, in which case we don't follow it into the next pte because
+	 * this complicates the locking.
+	 *
+	 * Note that the caller may or may not choose to lock the pte. If
+	 * unlocked, the calculation should be considered an estimate that will
+	 * need to be validated under the lock.
+	 */
+
+	struct vm_area_struct *vma = vmf->vma;
+	struct page *page;
+	pte_t *ptep;
+	pte_t pte;
+	bool excl = true;
+	unsigned long start, end;
+	int bloops, floops;
+	int i;
+	unsigned long pfn;
+
+	/*
+	 * Iterate backwards, starting with the page immediately before the
+	 * anchor page. On exit from the loop, start is the inclusive start
+	 * virtual address of the range.
+	 */
+
+	start = page_addr(&folio->page, vmf->page, vmf->address);
+	start = max(start, vma->vm_start);
+	start = max(start, ALIGN_DOWN(vmf->address, PMD_SIZE));
+	bloops = (vmf->address - start) >> PAGE_SHIFT;
+
+	page = vmf->page - 1;
+	ptep = vmf->pte - 1;
+	pfn = page_to_pfn(vmf->page) - 1;
+
+	for (i = 0; i < bloops; i++) {
+		pte = *ptep;
+
+		if (!pte_present(pte) ||
+		    pte_write(pte) ||
+		    pte_protnone(pte) ||
+		    pte_pfn(pte) != pfn) {
+			start = vmf->address - (i << PAGE_SHIFT);
+			break;
+		}
+
+		if (excl && !PageAnonExclusive(page))
+			excl = false;
+
+		pfn--;
+		ptep--;
+		page--;
+	}
+
+	/*
+	 * Iterate forward, starting with the anchor page. On exit from the
+	 * loop, end is the exclusive end virtual address of the range.
+	 */
+
+	end = page_addr(&folio->page + folio_nr_pages(folio),
+			vmf->page, vmf->address);
+	end = min(end, vma->vm_end);
+	end = min(end, ALIGN_DOWN(vmf->address, PMD_SIZE) + PMD_SIZE);
+	floops = (end - vmf->address) >> PAGE_SHIFT;
+
+	page = vmf->page;
+	ptep = vmf->pte;
+	pfn = page_to_pfn(vmf->page);
+
+	for (i = 0; i < floops; i++) {
+		pte = *ptep;
+
+		if (!pte_present(pte) ||
+		    pte_write(pte) ||
+		    pte_protnone(pte) ||
+		    pte_pfn(pte) != pfn) {
+			end = vmf->address + (i << PAGE_SHIFT);
+			break;
+		}
+
+		if (excl && !PageAnonExclusive(page))
+			excl = false;
+
+		pfn++;
+		ptep++;
+		page++;
+	}
+
+	/*
+	 * Fixup vmf to point to the start of the range, and return number of
+	 * pages in range.
+	 */
+
+	range_out->va_start = start;
+	range_out->pg_start = vmf->page - ((vmf->address - start) >> PAGE_SHIFT);
+	range_out->pte_start = vmf->pte - ((vmf->address - start) >> PAGE_SHIFT);
+	range_out->nr = (end - start) >> PAGE_SHIFT;
+	range_out->exclusive = excl;
+}
+
 /*
  * Handle write page faults for pages that can be reused in the current vma
  *
@@ -3528,13 +3664,23 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 	/*
 	 * Private mapping: create an exclusive anonymous page copy if reuse
 	 * is impossible. We might miss VM_WRITE for FOLL_FORCE handling.
+	 * For anonymous memory, we attempt to copy/reuse in folios rather than
+	 * page-by-page. We always prefer reuse above copy, even if we can only
+	 * reuse a subset of the folio. Note that when reusing pages in a folio,
+	 * due to munmap, mremap and friends, the folio isn't guarranteed to be
+	 * naturally aligned in virtual memory space.
 	 */
 	if (folio && folio_test_anon(folio)) {
+		struct anon_folio_range range;
+		int swaprefs;
+
+		calc_anon_folio_range_reuse(vmf, folio, &range);
+
 		/*
-		 * If the page is exclusive to this process we must reuse the
-		 * page without further checks.
+		 * If the pages have already been proven to be exclusive to this
+		 * process we must reuse the pages without further checks.
 		 */
-		if (PageAnonExclusive(vmf->page))
+		if (range.exclusive)
 			goto reuse;

 		/*
@@ -3544,7 +3690,10 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 		 *
 		 * KSM doesn't necessarily raise the folio refcount.
 		 */
-		if (folio_test_ksm(folio) || folio_ref_count(folio) > 3)
+		swaprefs = folio_test_swapcache(folio) ?
+				folio_nr_pages(folio) : 0;
+		if (folio_test_ksm(folio) ||
+		    folio_ref_count(folio) > range.nr + swaprefs + 1)
 			goto copy;
 		if (!folio_test_lru(folio))
 			/*
@@ -3552,29 +3701,31 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 			 * remote LRU pagevecs or references to LRU folios.
 			 */
 			lru_add_drain();
-		if (folio_ref_count(folio) > 1 + folio_test_swapcache(folio))
+		if (folio_ref_count(folio) > range.nr + swaprefs)
 			goto copy;
 		if (!folio_trylock(folio))
 			goto copy;
 		if (folio_test_swapcache(folio))
 			folio_free_swap(folio);
-		if (folio_test_ksm(folio) || folio_ref_count(folio) != 1) {
+		if (folio_test_ksm(folio) ||
+		    folio_ref_count(folio) != range.nr) {
 			folio_unlock(folio);
 			goto copy;
 		}
 		/*
-		 * Ok, we've got the only folio reference from our mapping
+		 * Ok, we've got the only folio references from our mapping
 		 * and the folio is locked, it's dark out, and we're wearing
 		 * sunglasses. Hit it.
 		 */
-		page_move_anon_rmap(vmf->page, vma);
+		folio_move_anon_rmap_range(folio, range.pg_start,
+							range.nr, vma);
 		folio_unlock(folio);
 reuse:
 		if (unlikely(unshare)) {
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
 			return 0;
 		}
-		wp_page_reuse(vmf, NULL);
+		wp_page_reuse(vmf, &range);
 		return 0;
 	}
 copy:
--
2.25.1



  parent reply	other threads:[~2023-04-14 13:03 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-14 13:02 [RFC v2 PATCH 00/17] variable-order, " Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 01/17] mm: Expose clear_huge_page() unconditionally Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 02/17] mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio() Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 03/17] mm: Introduce try_vma_alloc_movable_folio() Ryan Roberts
2023-04-17  8:49   ` Yin, Fengwei
2023-04-17 10:11     ` Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 04/17] mm: Implement folio_add_new_anon_rmap_range() Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 05/17] mm: Routines to determine max anon folio allocation order Ryan Roberts
2023-04-14 14:09   ` Kirill A. Shutemov
2023-04-14 14:38     ` Ryan Roberts
2023-04-14 15:37       ` Kirill A. Shutemov
2023-04-14 16:06         ` Ryan Roberts
2023-04-14 16:18           ` Matthew Wilcox
2023-04-14 16:31             ` Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 06/17] mm: Allocate large folios for anonymous memory Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 07/17] mm: Allow deferred splitting of arbitrary large anon folios Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 08/17] mm: Implement folio_move_anon_rmap_range() Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 09/17] mm: Update wp_page_reuse() to operate on range of pages Ryan Roberts
2023-04-14 13:02 ` Ryan Roberts [this message]
2023-04-14 13:02 ` [RFC v2 PATCH 11/17] mm: Split __wp_page_copy_user() into 2 variants Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 12/17] mm: ptep_clear_flush_range_notify() macro for batch operation Ryan Roberts
2023-04-14 13:02 ` [RFC v2 PATCH 13/17] mm: Implement folio_remove_rmap_range() Ryan Roberts
2023-04-14 13:03 ` [RFC v2 PATCH 14/17] mm: Copy large folios for anonymous memory Ryan Roberts
2023-04-14 13:03 ` [RFC v2 PATCH 15/17] mm: Convert zero page to large folios on write Ryan Roberts
2023-04-14 13:03 ` [RFC v2 PATCH 16/17] mm: mmap: Align unhinted maps to highest anon folio order Ryan Roberts
2023-04-17  8:25   ` Yin, Fengwei
2023-04-17 10:13     ` Ryan Roberts
2023-04-14 13:03 ` [RFC v2 PATCH 17/17] mm: Batch-zap large anonymous folio PTE mappings Ryan Roberts
2023-04-17  8:04 ` [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Yin, Fengwei
2023-04-17 10:19   ` Ryan Roberts
2023-04-17  8:19 ` Yin, Fengwei
2023-04-17 10:28   ` Ryan Roberts
2023-04-17 10:54 ` David Hildenbrand
2023-04-17 11:43   ` Ryan Roberts
2023-04-17 14:05     ` David Hildenbrand
2023-04-17 15:38       ` Ryan Roberts
2023-04-17 15:44         ` David Hildenbrand
2023-04-17 16:15           ` Ryan Roberts
2023-04-26 10:41           ` Ryan Roberts
2023-05-17 13:58             ` David Hildenbrand
2023-05-18 11:23               ` Ryan Roberts
2023-04-19 10:12       ` Ryan Roberts
2023-04-19 10:51         ` David Hildenbrand
2023-04-19 11:13           ` Ryan Roberts

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230414130303.2345383-11-ryan.roberts@arm.com \
    --to=ryan.roberts@arm.com \
    --cc=akpm@linux-foundation.org \
    --cc=fengwei.yin@intel.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox