[RFC PATCH v2 0/5] folio based filemap_map

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v2 0/5] folio based filemap_map_pages()
@ 2023-02-01  8:17 Yin Fengwei
  2023-02-01  8:17 ` [RFC PATCH v2 1/5] mm: Enable fault around for shared file page fault Yin Fengwei
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Yin Fengwei @ 2023-02-01  8:17 UTC (permalink / raw)
  To: willy, david, linux-mm; +Cc: dave.hansen, tim.c.chen, ying.huang, fengwei.yin

Current filemap_map_pages() uses page granularity even when
underneath folio is large folio. Making it use folio based
granularity allows batched refcount, rmap and mm counter
update. Which brings performance gain.

This sereis tries to bring batched refcount, rmap and
mm counter for filemap_map_pages(). Testing with a micro
benchmark like will-it-scale.pagefault on a 48C/96T IceLake
tbox showed:
   - batched rmap brings around 15% performance gain
   - batched refcount brings around 2% performance gain

Patch 1 enabled the fault around for share file page write
        fault. As David suggested here: [1]

Patch 2 update filemap_map_pages() to do map based on folio
        granularity and batched refcount update

Patch 3,4,5 enable batched rmap and mm counter

[1] https://lore.kernel.org/linux-mm/e14b4e9a-612d-fc02-edc0-8f3b6bcf4148@redhat.com/

Change from v1:
  - Update the struct page * parameter of *_range() to index
    in the folio as Matthew suggested
  - Fix indentations problem as Matthew pointed out
  - Add back the function comment as Matthew pointed out
  - Use nr_pages than len as Matthew pointed out
  - Add do_set_pte_range() as Matthew suggested
  - Add function comment as Ying suggested
  - Add performance test result to patch 1/2/5 commit message

  Patch 1:
    - Adapt commit message as Matthew suggested
    - Add Reviweed-by from Matthew
  Patch 3:
    - Restore general logic of page_add_file_rmap_range() to
      make patch review easier as Matthew suggested
  Patch 5:
    - Add perf data collected to understand the reason of
      performance gain

Yin Fengwei (5):
  mm: Enable fault around for shared file page fault
  filemap: add function filemap_map_folio_range()
  rmap: add page_add_file_rmap_range()
  mm: add do_set_pte_range()
  filemap: batched update mm counter,rmap when map file folio

 include/linux/mm.h   |   2 +
 include/linux/rmap.h |   2 +
 mm/filemap.c         | 119 ++++++++++++++++++++++++++++++-------------
 mm/memory.c          |  71 +++++++++++++++++++++-----
 mm/rmap.c            |  66 +++++++++++++++++++-----
 5 files changed, 199 insertions(+), 61 deletions(-)

-- 
2.30.2



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH v2 1/5] mm: Enable fault around for shared file page fault
  2023-02-01  8:17 [RFC PATCH v2 0/5] folio based filemap_map_pages() Yin Fengwei
@ 2023-02-01  8:17 ` Yin Fengwei
  2023-02-01 14:34   ` Kirill A. Shutemov
  2023-02-01  8:17 ` [RFC PATCH v2 2/5] filemap: add function filemap_map_folio_range() Yin Fengwei
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Yin Fengwei @ 2023-02-01  8:17 UTC (permalink / raw)
  To: willy, david, linux-mm; +Cc: dave.hansen, tim.c.chen, ying.huang, fengwei.yin

The shared fault handler can also benefit from fault-around.  While it
is uncommon to write to MAP_SHARED files, applications that do will see
a huge benefit with will-it-scale:page_fault3 (shared file write fault)
improving by 375%.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/memory.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 7a04a1130ec1..51c04bb60724 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4546,6 +4546,17 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	vm_fault_t ret, tmp;
 
+	/*
+	 * Let's call ->map_pages() first and use ->fault() as fallback
+	 * if page by the offset is not ready to be mapped (cold cache or
+	 * something).
+	 */
+	if (should_fault_around(vmf)) {
+		ret = do_fault_around(vmf);
+		if (ret)
+			return ret;
+	}
+
 	ret = __do_fault(vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
-- 
2.30.2



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH v2 2/5] filemap: add function filemap_map_folio_range()
  2023-02-01  8:17 [RFC PATCH v2 0/5] folio based filemap_map_pages() Yin Fengwei
  2023-02-01  8:17 ` [RFC PATCH v2 1/5] mm: Enable fault around for shared file page fault Yin Fengwei
@ 2023-02-01  8:17 ` Yin Fengwei
  2023-02-01  8:17 ` [RFC PATCH v2 3/5] rmap: add page_add_file_rmap_range() Yin Fengwei
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Yin Fengwei @ 2023-02-01  8:17 UTC (permalink / raw)
  To: willy, david, linux-mm; +Cc: dave.hansen, tim.c.chen, ying.huang, fengwei.yin

filemap_map_folio_range() maps partial/full folio. Comparing to
original filemap_map_pages(), it batched updates refcount and
get minor performance improvement for large folio.

a self cooked will-it-scale.page_fault3 like app (change file
write fault to read fault) with xfs filesystem got 2% performance
gain.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 mm/filemap.c | 91 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 57 insertions(+), 34 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 992554c18f1f..9cc5edd8f998 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3351,6 +3351,56 @@ static inline struct folio *next_map_page(struct address_space *mapping,
 				  mapping, xas, end_pgoff);
 }
 
+/*
+ * Map sub-pages range [start_page, start_page + nr_pages) of folio.
+ * start_page is gotten from start by folio_page(folio, start)
+ */
+static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
+			struct folio *folio, unsigned long start,
+			unsigned long addr, unsigned int nr_pages)
+{
+	vm_fault_t ret = 0;
+	struct vm_area_struct *vma = vmf->vma;
+	struct file *file = vma->vm_file;
+	struct page *page = folio_page(folio, start);
+	unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
+	unsigned int ref_count = 0, count = 0;
+
+	do {
+		if (PageHWPoison(page))
+			continue;
+
+		if (mmap_miss > 0)
+			mmap_miss--;
+
+		/*
+		 * NOTE: If there're PTE markers, we'll leave them to be
+		 * handled in the specific fault path, and it'll prohibit the
+		 * fault-around logic.
+		 */
+		if (!pte_none(*vmf->pte))
+			continue;
+
+		if (vmf->address == addr)
+			ret = VM_FAULT_NOPAGE;
+
+		ref_count++;
+		do_set_pte(vmf, page, addr);
+		update_mmu_cache(vma, addr, vmf->pte);
+	} while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
+
+	/*
+	 * Restore the vmf->pte. Otherwise, it's possible vmf->pte point
+	 * to next page table entry if the last sub page in the range is
+	 * mapped to the last entry of page table.
+	 */
+	vmf->pte -= nr_pages;
+	folio_ref_add(folio, ref_count);
+	WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
+
+	return ret;
+}
+
 vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 			     pgoff_t start_pgoff, pgoff_t end_pgoff)
 {
@@ -3361,9 +3411,9 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	unsigned long addr;
 	XA_STATE(xas, &mapping->i_pages, start_pgoff);
 	struct folio *folio;
-	struct page *page;
 	unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
 	vm_fault_t ret = 0;
+	int nr_pages = 0;
 
 	rcu_read_lock();
 	folio = first_map_page(mapping, &xas, end_pgoff);
@@ -3378,45 +3428,18 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
 	do {
-again:
-		page = folio_file_page(folio, xas.xa_index);
-		if (PageHWPoison(page))
-			goto unlock;
-
-		if (mmap_miss > 0)
-			mmap_miss--;
+		unsigned long end;
 
 		addr += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
 		vmf->pte += xas.xa_index - last_pgoff;
 		last_pgoff = xas.xa_index;
+		end = folio->index + folio_nr_pages(folio) - 1;
+		nr_pages = min(end, end_pgoff) - xas.xa_index + 1;
 
-		/*
-		 * NOTE: If there're PTE markers, we'll leave them to be
-		 * handled in the specific fault path, and it'll prohibit the
-		 * fault-around logic.
-		 */
-		if (!pte_none(*vmf->pte))
-			goto unlock;
-
-		/* We're about to handle the fault */
-		if (vmf->address == addr)
-			ret = VM_FAULT_NOPAGE;
+		ret |=	filemap_map_folio_range(vmf, folio,
+				xas.xa_index - folio->index, addr, nr_pages);
+		xas.xa_index = end;
 
-		do_set_pte(vmf, page, addr);
-		/* no need to invalidate: a not-present page won't be cached */
-		update_mmu_cache(vma, addr, vmf->pte);
-		if (folio_more_pages(folio, xas.xa_index, end_pgoff)) {
-			xas.xa_index++;
-			folio_ref_inc(folio);
-			goto again;
-		}
-		folio_unlock(folio);
-		continue;
-unlock:
-		if (folio_more_pages(folio, xas.xa_index, end_pgoff)) {
-			xas.xa_index++;
-			goto again;
-		}
 		folio_unlock(folio);
 		folio_put(folio);
 	} while ((folio = next_map_page(mapping, &xas, end_pgoff)) != NULL);
-- 
2.30.2



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH v2 3/5] rmap: add page_add_file_rmap_range()
  2023-02-01  8:17 [RFC PATCH v2 0/5] folio based filemap_map_pages() Yin Fengwei
  2023-02-01  8:17 ` [RFC PATCH v2 1/5] mm: Enable fault around for shared file page fault Yin Fengwei
  2023-02-01  8:17 ` [RFC PATCH v2 2/5] filemap: add function filemap_map_folio_range() Yin Fengwei
@ 2023-02-01  8:17 ` Yin Fengwei
  2023-02-01 17:32   ` Matthew Wilcox
  2023-02-01  8:17 ` [RFC PATCH v2 4/5] mm: add do_set_pte_range() Yin Fengwei
  2023-02-01  8:17 ` [RFC PATCH v2 5/5] filemap: batched update mm counter,rmap when map file folio Yin Fengwei
  4 siblings, 1 reply; 19+ messages in thread
From: Yin Fengwei @ 2023-02-01  8:17 UTC (permalink / raw)
  To: willy, david, linux-mm; +Cc: dave.hansen, tim.c.chen, ying.huang, fengwei.yin

page_add_file_rmap_range() allows to add pte mapping to a specific
range of file folio. Comparing to original page_add_file_rmap(),
it batched updates __lruvec_stat for large folio.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 include/linux/rmap.h |  2 ++
 mm/rmap.c            | 66 ++++++++++++++++++++++++++++++++++----------
 2 files changed, 54 insertions(+), 14 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index a4570da03e58..9631a3701504 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -198,6 +198,8 @@ void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
 		unsigned long address);
 void page_add_file_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
+void page_add_file_rmap_range(struct folio *, unsigned long start,
+		unsigned int nr_pages, struct vm_area_struct *, bool compound);
 void page_remove_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 15ae24585fc4..090de52e1a9a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1303,31 +1303,44 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 }
 
 /**
- * page_add_file_rmap - add pte mapping to a file page
- * @page:	the page to add the mapping to
+ * page_add_file_rmap_range - add pte mapping to a sub page range of a folio
+ * @folio:	The filio to add the mapping to
+ * @start:	The first sub page index in folio
+ * @nr_pages:	The number of sub pages from the first page
  * @vma:	the vm area in which the mapping is added
  * @compound:	charge the page as compound or small page
  *
+ * The sub page range of folio is defined by
+ * 	[first_sub_page, first_sub_page + nr_pages)
+ *
  * The caller needs to hold the pte lock.
  */
-void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
-		bool compound)
+void page_add_file_rmap_range(struct folio *folio, unsigned long start,
+			unsigned int nr_pages, struct vm_area_struct *vma,
+			bool compound)
 {
-	struct folio *folio = page_folio(page);
 	atomic_t *mapped = &folio->_nr_pages_mapped;
-	int nr = 0, nr_pmdmapped = 0;
-	bool first;
+	unsigned int nr = 0, nr_pmdmapped = 0, first;
 
-	VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
+	VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio);
 
 	/* Is page being mapped by PTE? Is this its first map to be added? */
 	if (likely(!compound)) {
-		first = atomic_inc_and_test(&page->_mapcount);
-		nr = first;
-		if (first && folio_test_large(folio)) {
-			nr = atomic_inc_return_relaxed(mapped);
-			nr = (nr < COMPOUND_MAPPED);
-		}
+		struct page *page = folio_page(folio, start);
+
+		nr_pages = min_t(unsigned int, nr_pages,
+					folio_nr_pages(folio) - start);
+
+		do {
+			first = atomic_inc_and_test(&page->_mapcount);
+			if (first && folio_test_large(folio)) {
+				first = atomic_inc_return_relaxed(mapped);
+				first = (nr < COMPOUND_MAPPED);
+			}
+
+			if (first)
+				nr++;
+		} while (page++, --nr_pages > 0);
 	} else if (folio_test_pmd_mappable(folio)) {
 		/* That test is redundant: it's for safety or to optimize out */
 
@@ -1356,6 +1369,31 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
 	mlock_vma_folio(folio, vma, compound);
 }
 
+/**
+ * page_add_file_rmap - add pte mapping to a file page
+ * @page:	the page to add the mapping to
+ * @vma:	the vm area in which the mapping is added
+ * @compound:	charge the page as compound or small page
+ *
+ * The caller needs to hold the pte lock.
+ */
+void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
+		bool compound)
+{
+	struct folio *folio = page_folio(page);
+	unsigned int nr_pages;
+
+	VM_WARN_ON_ONCE_PAGE(compound && !PageTransHuge(page), page);
+
+	if (likely(!compound))
+		nr_pages = 1;
+	else
+		nr_pages = folio_nr_pages(folio);
+
+	page_add_file_rmap_range(folio, folio_page_idx(folio, page),
+			nr_pages, vma, compound);
+}
+
 /**
  * page_remove_rmap - take down pte mapping from a page
  * @page:	page to remove mapping from
-- 
2.30.2



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH v2 4/5] mm: add do_set_pte_range()
  2023-02-01  8:17 [RFC PATCH v2 0/5] folio based filemap_map_pages() Yin Fengwei
                   ` (2 preceding siblings ...)
  2023-02-01  8:17 ` [RFC PATCH v2 3/5] rmap: add page_add_file_rmap_range() Yin Fengwei
@ 2023-02-01  8:17 ` Yin Fengwei
  2023-02-01  9:09   ` David Hildenbrand
  2023-02-01 17:38   ` Matthew Wilcox
  2023-02-01  8:17 ` [RFC PATCH v2 5/5] filemap: batched update mm counter,rmap when map file folio Yin Fengwei
  4 siblings, 2 replies; 19+ messages in thread
From: Yin Fengwei @ 2023-02-01  8:17 UTC (permalink / raw)
  To: willy, david, linux-mm; +Cc: dave.hansen, tim.c.chen, ying.huang, fengwei.yin

do_set_pte_range() allows to setup page table entries for a
specific range. It calls page_add_file_rmap_range() to take
advantage of batched rmap update for large folio.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 include/linux/mm.h |  2 ++
 mm/filemap.c       |  1 -
 mm/memory.c        | 60 ++++++++++++++++++++++++++++++++++++----------
 3 files changed, 49 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d6f8f41514cc..96e08fcdce24 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1162,6 +1162,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 
 vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
 void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
+void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
+		unsigned long start, unsigned long addr, unsigned int nr);
 
 vm_fault_t finish_fault(struct vm_fault *vmf);
 vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
diff --git a/mm/filemap.c b/mm/filemap.c
index 9cc5edd8f998..95f634d11581 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3386,7 +3386,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 
 		ref_count++;
 		do_set_pte(vmf, page, addr);
-		update_mmu_cache(vma, addr, vmf->pte);
 	} while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
 
 	/*
diff --git a/mm/memory.c b/mm/memory.c
index 51c04bb60724..7e41142e1e4f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4257,7 +4257,8 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 }
 #endif
 
-void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
+static void do_set_pte_entry(struct vm_fault *vmf, struct page *page,
+		unsigned long addr)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
@@ -4277,16 +4278,52 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 	if (unlikely(uffd_wp))
 		entry = pte_mkuffd_wp(entry);
-	/* copy-on-write page */
-	if (write && !(vma->vm_flags & VM_SHARED)) {
-		inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
-		page_add_new_anon_rmap(page, vma, addr);
-		lru_cache_add_inactive_or_unevictable(page, vma);
-	} else {
-		inc_mm_counter(vma->vm_mm, mm_counter_file(page));
-		page_add_file_rmap(page, vma, false);
-	}
 	set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
+
+	/* no need to invalidate: a not-present page won't be cached */
+	update_mmu_cache(vma, addr, vmf->pte);
+}
+
+void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
+		unsigned long start, unsigned long addr, unsigned int nr)
+{
+	unsigned int i = 0;
+	struct page *page = folio_page(folio, start);
+	struct vm_area_struct *vma = vmf->vma;
+	bool cow = (vmf->flags & FAULT_FLAG_WRITE) &&
+			!(vma->vm_flags & VM_SHARED);
+
+	/*
+	 * file page: batched update rmap, mm counter.
+	 * copy-on-write page: batched update mm counter.
+	 */
+	if (!cow) {
+		page_add_file_rmap_range(folio, start, nr, vma, false);
+		add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
+	} else
+		add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr);
+
+	do {
+		if (cow) {
+			page_add_new_anon_rmap(page, vma, addr);
+			lru_cache_add_inactive_or_unevictable(page, vma);
+		}
+
+		do_set_pte_entry(vmf, page, addr);
+	} while (vmf->pte++, page++, addr += PAGE_SIZE, ++i < nr);
+}
+
+void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
+{
+	struct folio *folio = page_folio(page);
+
+	do_set_pte_range(vmf, folio, folio_page_idx(folio, page), addr, 1);
+
+	/*
+	 * do_set_pte_range changes vmf->pte. Restore it back as
+	 * do_set_pte doesn't expect the change of vmf->pte.
+	 */
+	vmf->pte--;
 }
 
 static bool vmf_pte_changed(struct vm_fault *vmf)
@@ -4361,9 +4398,6 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 	if (likely(!vmf_pte_changed(vmf))) {
 		do_set_pte(vmf, page, vmf->address);
 
-		/* no need to invalidate: a not-present page won't be cached */
-		update_mmu_cache(vma, vmf->address, vmf->pte);
-
 		ret = 0;
 	} else {
 		update_mmu_tlb(vma, vmf->address, vmf->pte);
-- 
2.30.2



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH v2 5/5] filemap: batched update mm counter,rmap when map file folio
  2023-02-01  8:17 [RFC PATCH v2 0/5] folio based filemap_map_pages() Yin Fengwei
                   ` (3 preceding siblings ...)
  2023-02-01  8:17 ` [RFC PATCH v2 4/5] mm: add do_set_pte_range() Yin Fengwei
@ 2023-02-01  8:17 ` Yin Fengwei
  2023-02-01 15:50   ` Matthew Wilcox
  4 siblings, 1 reply; 19+ messages in thread
From: Yin Fengwei @ 2023-02-01  8:17 UTC (permalink / raw)
  To: willy, david, linux-mm; +Cc: dave.hansen, tim.c.chen, ying.huang, fengwei.yin

Use do_set_pte_range() in filemap_map_folio_range(). Which
batched updates mm counter, rmap.

With a self cooked will-it-scale.page_fault3 like app (change
file write fault to read fault) got 15% performance gain.

Perf data collected before/after the change:
  18.73%--page_add_file_rmap
          |
           --11.60%--__mod_lruvec_page_state
                     |
                     |--7.40%--__mod_memcg_lruvec_state
                     |          |
                     |           --5.58%--cgroup_rstat_updated
                     |
                      --2.53%--__mod_lruvec_state
                                |
                                 --1.48%--__mod_node_page_state

  9.93%--page_add_file_rmap_range
         |
          --2.67%--__mod_lruvec_page_state
                    |
                    |--1.95%--__mod_memcg_lruvec_state
                    |          |
                    |           --1.57%--cgroup_rstat_updated
                    |
                     --0.61%--__mod_lruvec_state
                               |
                                --0.54%--__mod_node_page_state

The running time of __mode_lruvec_page_state() is reduced a lot.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 mm/filemap.c | 37 +++++++++++++++++++++++++++++++++----
 1 file changed, 33 insertions(+), 4 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 95f634d11581..b14f077d1c55 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3364,11 +3364,22 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 	struct file *file = vma->vm_file;
 	struct page *page = folio_page(folio, start);
 	unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
-	unsigned int ref_count = 0, count = 0;
+	unsigned int ref_count = 0, count = 0, nr_mapped = 0;
 
 	do {
-		if (PageHWPoison(page))
+		if (PageHWPoison(page)) {
+			if (nr_mapped) {
+				vmf->pte -= nr_mapped;
+				do_set_pte_range(vmf, folio,
+						start + count - nr_mapped,
+						addr - nr_mapped * PAGE_SIZE,
+						nr_mapped);
+
+			}
+
+			nr_mapped = 0;
 			continue;
+		}
 
 		if (mmap_miss > 0)
 			mmap_miss--;
@@ -3378,16 +3389,34 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 		 * handled in the specific fault path, and it'll prohibit the
 		 * fault-around logic.
 		 */
-		if (!pte_none(*vmf->pte))
+		if (!pte_none(*vmf->pte)) {
+			if (nr_mapped) {
+				vmf->pte -= nr_mapped;
+				do_set_pte_range(vmf, folio,
+						start + count - nr_mapped,
+						addr - nr_mapped * PAGE_SIZE,
+						nr_mapped);
+
+			}
+
+			nr_mapped = 0;
+
 			continue;
+		}
 
 		if (vmf->address == addr)
 			ret = VM_FAULT_NOPAGE;
 
 		ref_count++;
-		do_set_pte(vmf, page, addr);
+		nr_mapped++;
 	} while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
 
+	if (nr_mapped) {
+		vmf->pte -= nr_mapped;
+		do_set_pte_range(vmf, folio, start + count - nr_mapped,
+				addr - nr_mapped * PAGE_SIZE, nr_mapped);
+	}
+
 	/*
 	 * Restore the vmf->pte. Otherwise, it's possible vmf->pte point
 	 * to next page table entry if the last sub page in the range is
-- 
2.30.2



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 4/5] mm: add do_set_pte_range()
  2023-02-01  8:17 ` [RFC PATCH v2 4/5] mm: add do_set_pte_range() Yin Fengwei
@ 2023-02-01  9:09   ` David Hildenbrand
  2023-02-01 10:04     ` Yin, Fengwei
  2023-02-01 17:38   ` Matthew Wilcox
  1 sibling, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2023-02-01  9:09 UTC (permalink / raw)
  To: Yin Fengwei, willy, linux-mm; +Cc: dave.hansen, tim.c.chen, ying.huang

On 01.02.23 09:17, Yin Fengwei wrote:
> do_set_pte_range() allows to setup page table entries for a
> specific range. It calls page_add_file_rmap_range() to take
> advantage of batched rmap update for large folio.
> 
> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> ---
>   include/linux/mm.h |  2 ++
>   mm/filemap.c       |  1 -
>   mm/memory.c        | 60 ++++++++++++++++++++++++++++++++++++----------
>   3 files changed, 49 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d6f8f41514cc..96e08fcdce24 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1162,6 +1162,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   
>   vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
>   void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
> +		unsigned long start, unsigned long addr, unsigned int nr);
>   
>   vm_fault_t finish_fault(struct vm_fault *vmf);
>   vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 9cc5edd8f998..95f634d11581 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3386,7 +3386,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>   
>   		ref_count++;
>   		do_set_pte(vmf, page, addr);
> -		update_mmu_cache(vma, addr, vmf->pte);
>   	} while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
>   
>   	/*
> diff --git a/mm/memory.c b/mm/memory.c
> index 51c04bb60724..7e41142e1e4f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4257,7 +4257,8 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
>   }
>   #endif
>   
> -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
> +static void do_set_pte_entry(struct vm_fault *vmf, struct page *page,
> +		unsigned long addr)
>   {
>   	struct vm_area_struct *vma = vmf->vma;
>   	bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
> @@ -4277,16 +4278,52 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>   		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>   	if (unlikely(uffd_wp))
>   		entry = pte_mkuffd_wp(entry);
> -	/* copy-on-write page */
> -	if (write && !(vma->vm_flags & VM_SHARED)) {
> -		inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> -		page_add_new_anon_rmap(page, vma, addr);
> -		lru_cache_add_inactive_or_unevictable(page, vma);
> -	} else {
> -		inc_mm_counter(vma->vm_mm, mm_counter_file(page));
> -		page_add_file_rmap(page, vma, false);
> -	}
>   	set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
> +
> +	/* no need to invalidate: a not-present page won't be cached */
> +	update_mmu_cache(vma, addr, vmf->pte);
> +}
> +
> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
> +		unsigned long start, unsigned long addr, unsigned int nr)
> +{
> +	unsigned int i = 0;
> +	struct page *page = folio_page(folio, start);
> +	struct vm_area_struct *vma = vmf->vma;
> +	bool cow = (vmf->flags & FAULT_FLAG_WRITE) &&
> +			!(vma->vm_flags & VM_SHARED);
> +
> +	/*
> +	 * file page: batched update rmap, mm counter.
> +	 * copy-on-write page: batched update mm counter.
> +	 */
> +	if (!cow) {
> +		page_add_file_rmap_range(folio, start, nr, vma, false);
> +		add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
> +	} else
> +		add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr);
> +
> +	do {
> +		if (cow) {
> +			page_add_new_anon_rmap(page, vma, addr);

This doesn't work with anon pages the way you intended in this patch.

Please leave anon pages out of the picture for now and make 
do_set_pte_range() only deal with !cow.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 4/5] mm: add do_set_pte_range()
  2023-02-01  9:09   ` David Hildenbrand
@ 2023-02-01 10:04     ` Yin, Fengwei
  0 siblings, 0 replies; 19+ messages in thread
From: Yin, Fengwei @ 2023-02-01 10:04 UTC (permalink / raw)
  To: David Hildenbrand, willy, linux-mm; +Cc: dave.hansen, tim.c.chen, ying.huang



On 2/1/2023 5:09 PM, David Hildenbrand wrote:
> On 01.02.23 09:17, Yin Fengwei wrote:
>> do_set_pte_range() allows to setup page table entries for a
>> specific range. It calls page_add_file_rmap_range() to take
>> advantage of batched rmap update for large folio.
>>
>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>> ---
>>   include/linux/mm.h |  2 ++
>>   mm/filemap.c       |  1 -
>>   mm/memory.c        | 60 ++++++++++++++++++++++++++++++++++++----------
>>   3 files changed, 49 insertions(+), 14 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index d6f8f41514cc..96e08fcdce24 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1162,6 +1162,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>>     vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
>>   void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
>> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
>> +        unsigned long start, unsigned long addr, unsigned int nr);
>>     vm_fault_t finish_fault(struct vm_fault *vmf);
>>   vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 9cc5edd8f998..95f634d11581 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -3386,7 +3386,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>>             ref_count++;
>>           do_set_pte(vmf, page, addr);
>> -        update_mmu_cache(vma, addr, vmf->pte);
>>       } while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
>>         /*
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 51c04bb60724..7e41142e1e4f 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4257,7 +4257,8 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
>>   }
>>   #endif
>>   -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>> +static void do_set_pte_entry(struct vm_fault *vmf, struct page *page,
>> +        unsigned long addr)
>>   {
>>       struct vm_area_struct *vma = vmf->vma;
>>       bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
>> @@ -4277,16 +4278,52 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>>           entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>>       if (unlikely(uffd_wp))
>>           entry = pte_mkuffd_wp(entry);
>> -    /* copy-on-write page */
>> -    if (write && !(vma->vm_flags & VM_SHARED)) {
>> -        inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>> -        page_add_new_anon_rmap(page, vma, addr);
>> -        lru_cache_add_inactive_or_unevictable(page, vma);
>> -    } else {
>> -        inc_mm_counter(vma->vm_mm, mm_counter_file(page));
>> -        page_add_file_rmap(page, vma, false);
>> -    }
>>       set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
>> +
>> +    /* no need to invalidate: a not-present page won't be cached */
>> +    update_mmu_cache(vma, addr, vmf->pte);
>> +}
>> +
>> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
>> +        unsigned long start, unsigned long addr, unsigned int nr)
>> +{
>> +    unsigned int i = 0;
>> +    struct page *page = folio_page(folio, start);
>> +    struct vm_area_struct *vma = vmf->vma;
>> +    bool cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>> +            !(vma->vm_flags & VM_SHARED);
>> +
>> +    /*
>> +     * file page: batched update rmap, mm counter.
>> +     * copy-on-write page: batched update mm counter.
>> +     */
>> +    if (!cow) {
>> +        page_add_file_rmap_range(folio, start, nr, vma, false);
>> +        add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
>> +    } else
>> +        add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr);
>> +
>> +    do {
>> +        if (cow) {
>> +            page_add_new_anon_rmap(page, vma, addr);
> 
> This doesn't work with anon pages the way you intended in this patch.
> 
> Please leave anon pages out of the picture for now and make do_set_pte_range() only deal with !cow.
OK. I will move the check to do_set_pte() and only call to
do_set_pte_range() when it's !cow. Thanks.

Regards
Yin, Fengwei

> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 1/5] mm: Enable fault around for shared file page fault
  2023-02-01  8:17 ` [RFC PATCH v2 1/5] mm: Enable fault around for shared file page fault Yin Fengwei
@ 2023-02-01 14:34   ` Kirill A. Shutemov
  2023-02-02  1:54     ` Yin, Fengwei
  0 siblings, 1 reply; 19+ messages in thread
From: Kirill A. Shutemov @ 2023-02-01 14:34 UTC (permalink / raw)
  To: Yin Fengwei; +Cc: willy, david, linux-mm, dave.hansen, tim.c.chen, ying.huang

On Wed, Feb 01, 2023 at 04:17:33PM +0800, Yin Fengwei wrote:
> The shared fault handler can also benefit from fault-around.  While it
> is uncommon to write to MAP_SHARED files, applications that do will see
> a huge benefit with will-it-scale:page_fault3 (shared file write fault)
> improving by 375%.
> 
> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  mm/memory.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 7a04a1130ec1..51c04bb60724 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4546,6 +4546,17 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
>  	struct vm_area_struct *vma = vmf->vma;
>  	vm_fault_t ret, tmp;
>  
> +	/*
> +	 * Let's call ->map_pages() first and use ->fault() as fallback
> +	 * if page by the offset is not ready to be mapped (cold cache or
> +	 * something).
> +	 */
> +	if (should_fault_around(vmf)) {
> +		ret = do_fault_around(vmf);
> +		if (ret)
> +			return ret;
> +	}
> +

I believe it bypasses ->page_mkwrite() completely, no?

So you get a writable PTEs without notifying the filesystem. Smells like a
data loss.

>  	ret = __do_fault(vmf);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>  		return ret;
> -- 
> 2.30.2
> 
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 5/5] filemap: batched update mm counter,rmap when map file folio
  2023-02-01  8:17 ` [RFC PATCH v2 5/5] filemap: batched update mm counter,rmap when map file folio Yin Fengwei
@ 2023-02-01 15:50   ` Matthew Wilcox
  2023-02-02  3:31     ` Yin, Fengwei
  2023-02-03 13:15     ` Yin, Fengwei
  0 siblings, 2 replies; 19+ messages in thread
From: Matthew Wilcox @ 2023-02-01 15:50 UTC (permalink / raw)
  To: Yin Fengwei; +Cc: david, linux-mm, dave.hansen, tim.c.chen, ying.huang

On Wed, Feb 01, 2023 at 04:17:37PM +0800, Yin Fengwei wrote:
> Use do_set_pte_range() in filemap_map_folio_range(). Which
> batched updates mm counter, rmap.
> 
> With a self cooked will-it-scale.page_fault3 like app (change
> file write fault to read fault) got 15% performance gain.

I'd suggest that you create a will-it-scale.page_fault4.  Anton
is quite open to adding new variations of tests.

> Perf data collected before/after the change:
>   18.73%--page_add_file_rmap
>           |
>            --11.60%--__mod_lruvec_page_state
>                      |
>                      |--7.40%--__mod_memcg_lruvec_state
>                      |          |
>                      |           --5.58%--cgroup_rstat_updated
>                      |
>                       --2.53%--__mod_lruvec_state
>                                 |
>                                  --1.48%--__mod_node_page_state
> 
>   9.93%--page_add_file_rmap_range
>          |
>           --2.67%--__mod_lruvec_page_state
>                     |
>                     |--1.95%--__mod_memcg_lruvec_state
>                     |          |
>                     |           --1.57%--cgroup_rstat_updated
>                     |
>                      --0.61%--__mod_lruvec_state
>                                |
>                                 --0.54%--__mod_node_page_state
> 
> The running time of __mode_lruvec_page_state() is reduced a lot.

Nice.

> +++ b/mm/filemap.c
> @@ -3364,11 +3364,22 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>  	struct file *file = vma->vm_file;
>  	struct page *page = folio_page(folio, start);
>  	unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
> -	unsigned int ref_count = 0, count = 0;
> +	unsigned int ref_count = 0, count = 0, nr_mapped = 0;
>  
>  	do {
> -		if (PageHWPoison(page))
> +		if (PageHWPoison(page)) {
> +			if (nr_mapped) {
> +				vmf->pte -= nr_mapped;
> +				do_set_pte_range(vmf, folio,
> +						start + count - nr_mapped,
> +						addr - nr_mapped * PAGE_SIZE,
> +						nr_mapped);
> +
> +			}
> +
> +			nr_mapped = 0;
>  			continue;
> +		}

Having subtracted nr_mapped from vmf->pte, we then need to add it again.

But this is all too complicated.  What if we don't update vmf->pte
each time around the loop?  ie something like this:

	do {
		if (PageHWPoisoned(page))
			goto map;
 		if (mmap_miss > 0)
 			mmap_miss--;
		/*
		 * If there're PTE markers, we'll leave them to be handled
		 * in the specific fault path, and it'll prohibit the
		 * fault-around logic.
		 */
		if (!pte_none(vmf->pte[count]))
			goto map;
 		if (vmf->address == addr + count * PAGE_SIZE)
 			ret = VM_FAULT_NOPAGE;
		continue;
map:
		if (count > 1) {
			do_set_pte_range(vmf, folio, start, addr, count - 1);
			folio_ref_add(folio, count - 1);
		}
		start += count;
		vmf->pte += count;
		addr += count * PAGE_SIZE;
		nr_pages -= count;
		count = 0;
	} while (page++, ++count < nr_pages);

	if (count > 0) {
		do_set_pte_range(vmf, folio, start, addr, count);
		folio_ref_add(folio, count);
	} else {
		/* Make sure the PTE points to the correct page table */
		vmf->pte--;
	}


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 3/5] rmap: add page_add_file_rmap_range()
  2023-02-01  8:17 ` [RFC PATCH v2 3/5] rmap: add page_add_file_rmap_range() Yin Fengwei
@ 2023-02-01 17:32   ` Matthew Wilcox
  2023-02-02  2:00     ` Yin, Fengwei
  0 siblings, 1 reply; 19+ messages in thread
From: Matthew Wilcox @ 2023-02-01 17:32 UTC (permalink / raw)
  To: Yin Fengwei; +Cc: david, linux-mm, dave.hansen, tim.c.chen, ying.huang

On Wed, Feb 01, 2023 at 04:17:35PM +0800, Yin Fengwei wrote:
>  /**
> - * page_add_file_rmap - add pte mapping to a file page
> - * @page:	the page to add the mapping to
> + * page_add_file_rmap_range - add pte mapping to a sub page range of a folio
> + * @folio:	The filio to add the mapping to
> + * @start:	The first sub page index in folio
> + * @nr_pages:	The number of sub pages from the first page
>   * @vma:	the vm area in which the mapping is added
>   * @compound:	charge the page as compound or small page
>   *
> + * The sub page range of folio is defined by
> + * 	[first_sub_page, first_sub_page + nr_pages)

Lose the "sub" from all of this.  That's legacy thinking; pages are
pages and folios are folios.  "subpages" was from when we were trying
to use the word "page" for both "the allocation" and "the PAGE_SIZE
range of bytes".

> + *
>   * The caller needs to hold the pte lock.
>   */
> -void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
> -		bool compound)
> +void page_add_file_rmap_range(struct folio *folio, unsigned long start,
> +			unsigned int nr_pages, struct vm_area_struct *vma,
> +			bool compound)

I think this function needs to be called folio_add_file_rmap()

I'd like to lose the 'compound' parameter, and base it on nr_pages ==
folio_nr_pages(), but that may be a step far just now.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 4/5] mm: add do_set_pte_range()
  2023-02-01  8:17 ` [RFC PATCH v2 4/5] mm: add do_set_pte_range() Yin Fengwei
  2023-02-01  9:09   ` David Hildenbrand
@ 2023-02-01 17:38   ` Matthew Wilcox
  2023-02-01 21:59     ` Matthew Wilcox
                       ` (2 more replies)
  1 sibling, 3 replies; 19+ messages in thread
From: Matthew Wilcox @ 2023-02-01 17:38 UTC (permalink / raw)
  To: Yin Fengwei; +Cc: david, linux-mm, dave.hansen, tim.c.chen, ying.huang

On Wed, Feb 01, 2023 at 04:17:36PM +0800, Yin Fengwei wrote:
> do_set_pte_range() allows to setup page table entries for a
> specific range. It calls page_add_file_rmap_range() to take
> advantage of batched rmap update for large folio.

How about something more like this?  Yes, we need to define
flush_icache_pages() and PTE_STRIDE.

(we could also do for (i = 0; i < nr; i++) flush_icache_page(...) but
given that some architectures already implement flush_icache_range(),
I think they may appreciate being given one large range to flush)

+++ b/mm/memory.c
@@ -4277,15 +4277,19 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page 
*page)
 }
 #endif
 
-void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
+void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
+               unsigned int start, unsigned int nr,
+               unsigned long addr)
 {
+       struct page *page = folio_page(page, start);
        struct vm_area_struct *vma = vmf->vma;
        bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
        bool write = vmf->flags & FAULT_FLAG_WRITE;
        bool prefault = vmf->address != addr;
        pte_t entry;
+       unsigned int i;
 
-       flush_icache_page(vma, page);
+       flush_icache_pages(vma, page, nr);
        entry = mk_pte(page, vma->vm_page_prot);
 
        if (prefault && arch_wants_old_prefaulted_pte())
@@ -4299,14 +4303,23 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
                entry = pte_mkuffd_wp(pte_wrprotect(entry));
        /* copy-on-write page */
        if (write && !(vma->vm_flags & VM_SHARED)) {
-               inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
-               page_add_new_anon_rmap(page, vma, addr);
-               lru_cache_add_inactive_or_unevictable(page, vma);
+               add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr);
+               for (i = 0; i < nr; i++) {
+                       page_add_new_anon_rmap(page + i, vma, addr);
+                       lru_cache_add_inactive_or_unevictable(page + i, vma);
+               }
        } else {
-               inc_mm_counter(vma->vm_mm, mm_counter_file(page));
-               page_add_file_rmap(page, vma, false);
+               add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
+               folio_add_file_rmap(folio, start, n, vma);
+       }
+
+       for (i = 0; i < nr; i++) {
+               set_pte_at(vma->vm_mm, addr, vmf->pte + i, entry);
+                /* no need to invalidate: a not-present page won't be cached */
+                update_mmu_cache(vma, addr, vmf->pte + i);
+               addr += PAGE_SIZE;
+               entry += PTE_STRIDE;
        }
-       set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
 }
 
 static bool vmf_pte_changed(struct vm_fault *vmf)



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 4/5] mm: add do_set_pte_range()
  2023-02-01 17:38   ` Matthew Wilcox
@ 2023-02-01 21:59     ` Matthew Wilcox
  2023-02-02  3:18     ` Yin, Fengwei
  2023-02-03  8:54     ` Yin Fengwei
  2 siblings, 0 replies; 19+ messages in thread
From: Matthew Wilcox @ 2023-02-01 21:59 UTC (permalink / raw)
  To: Yin Fengwei; +Cc: david, linux-mm, dave.hansen, tim.c.chen, ying.huang

On Wed, Feb 01, 2023 at 05:38:39PM +0000, Matthew Wilcox wrote:
> On Wed, Feb 01, 2023 at 04:17:36PM +0800, Yin Fengwei wrote:
> > do_set_pte_range() allows to setup page table entries for a
> > specific range. It calls page_add_file_rmap_range() to take
> > advantage of batched rmap update for large folio.
> 
> How about something more like this?  Yes, we need to define
> flush_icache_pages() and PTE_STRIDE.

Never mind about PTE_STRIDE.  I forgot that pte_t isn't an integer
type.  Instead, we'll want each architecture to define

/* This should be right for x86 */
static inline pte_next(pte_t pte)
{
	return __pte(pte_val(pte) + PAGE_SIZE);
}

> +       for (i = 0; i < nr; i++) {
> +               set_pte_at(vma->vm_mm, addr, vmf->pte + i, entry);
> +                /* no need to invalidate: a not-present page won't be cached */
> +                update_mmu_cache(vma, addr, vmf->pte + i);
> +               addr += PAGE_SIZE;
> +               entry += PTE_STRIDE;

		entry = pte_next(entry);



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 1/5] mm: Enable fault around for shared file page fault
  2023-02-01 14:34   ` Kirill A. Shutemov
@ 2023-02-02  1:54     ` Yin, Fengwei
  0 siblings, 0 replies; 19+ messages in thread
From: Yin, Fengwei @ 2023-02-02  1:54 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: willy, david, linux-mm, dave.hansen, tim.c.chen, ying.huang



On 2/1/2023 10:34 PM, Kirill A. Shutemov wrote:
> On Wed, Feb 01, 2023 at 04:17:33PM +0800, Yin Fengwei wrote:
>> The shared fault handler can also benefit from fault-around.  While it
>> is uncommon to write to MAP_SHARED files, applications that do will see
>> a huge benefit with will-it-scale:page_fault3 (shared file write fault)
>> improving by 375%.
>>
>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>> ---
>>  mm/memory.c | 11 +++++++++++
>>  1 file changed, 11 insertions(+)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 7a04a1130ec1..51c04bb60724 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4546,6 +4546,17 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
>>  	struct vm_area_struct *vma = vmf->vma;
>>  	vm_fault_t ret, tmp;
>>  
>> +	/*
>> +	 * Let's call ->map_pages() first and use ->fault() as fallback
>> +	 * if page by the offset is not ready to be mapped (cold cache or
>> +	 * something).
>> +	 */
>> +	if (should_fault_around(vmf)) {
>> +		ret = do_fault_around(vmf);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
> 
> I believe it bypasses ->page_mkwrite() completely, no?
> 
> So you get a writable PTEs without notifying the filesystem. Smells like a
> data loss.
Yes. You are right. This may be the reason why fault around is not enabled
for shared file write fault? I will drop this patch. Thanks.

Regards
Yin, Fengwei

> 
>>  	ret = __do_fault(vmf);
>>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>>  		return ret;
>> -- 
>> 2.30.2
>>
>>
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 3/5] rmap: add page_add_file_rmap_range()
  2023-02-01 17:32   ` Matthew Wilcox
@ 2023-02-02  2:00     ` Yin, Fengwei
  0 siblings, 0 replies; 19+ messages in thread
From: Yin, Fengwei @ 2023-02-02  2:00 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: david, linux-mm, dave.hansen, tim.c.chen, ying.huang



On 2/2/2023 1:32 AM, Matthew Wilcox wrote:
> On Wed, Feb 01, 2023 at 04:17:35PM +0800, Yin Fengwei wrote:
>>  /**
>> - * page_add_file_rmap - add pte mapping to a file page
>> - * @page:	the page to add the mapping to
>> + * page_add_file_rmap_range - add pte mapping to a sub page range of a folio
>> + * @folio:	The filio to add the mapping to
>> + * @start:	The first sub page index in folio
>> + * @nr_pages:	The number of sub pages from the first page
>>   * @vma:	the vm area in which the mapping is added
>>   * @compound:	charge the page as compound or small page
>>   *
>> + * The sub page range of folio is defined by
>> + * 	[first_sub_page, first_sub_page + nr_pages)
> 
> Lose the "sub" from all of this.  That's legacy thinking; pages are
> pages and folios are folios.  "subpages" was from when we were trying
> to use the word "page" for both "the allocation" and "the PAGE_SIZE
> range of bytes".
OK. Will remove sub in next version.

> 
>> + *
>>   * The caller needs to hold the pte lock.
>>   */
>> -void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
>> -		bool compound)
>> +void page_add_file_rmap_range(struct folio *folio, unsigned long start,
>> +			unsigned int nr_pages, struct vm_area_struct *vma,
>> +			bool compound)
> 
> I think this function needs to be called folio_add_file_rmap()
Yes. Maybe a followup patch after this series? Let me know if you want
this change in this series.

> 
> I'd like to lose the 'compound' parameter, and base it on nr_pages ==
> folio_nr_pages(), but that may be a step far just now.
Yes. I had a local change to remove if (folio_test_pmd_mappable(folio))
test (It's very close to removing 'compound'). I didn't include it in
this series. I prefer a follow up patch. Let me know if you want the
change in this series. Thanks.

Regards
Yin, Fengwei

> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 4/5] mm: add do_set_pte_range()
  2023-02-01 17:38   ` Matthew Wilcox
  2023-02-01 21:59     ` Matthew Wilcox
@ 2023-02-02  3:18     ` Yin, Fengwei
  2023-02-03  8:54     ` Yin Fengwei
  2 siblings, 0 replies; 19+ messages in thread
From: Yin, Fengwei @ 2023-02-02  3:18 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: david, linux-mm, dave.hansen, tim.c.chen, ying.huang



On 2/2/2023 1:38 AM, Matthew Wilcox wrote:
> On Wed, Feb 01, 2023 at 04:17:36PM +0800, Yin Fengwei wrote:
>> do_set_pte_range() allows to setup page table entries for a
>> specific range. It calls page_add_file_rmap_range() to take
>> advantage of batched rmap update for large folio.
> 
> How about something more like this?  Yes, we need to define
> flush_icache_pages() and PTE_STRIDE.
Yes. This could remove more duplicated operations here. Let me
try to add flush_icache_pages() in next version.

But for pte_next(), it needs be added to handle all architectures.
I suppose it will take long journey to make it. What about a
followup patch for pte_next()?


Regards
Yin, Fengwei

> 
> (we could also do for (i = 0; i < nr; i++) flush_icache_page(...) but
> given that some architectures already implement flush_icache_range(),
> I think they may appreciate being given one large range to flush)
> 
> +++ b/mm/memory.c
> @@ -4277,15 +4277,19 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page 
> *page)
>  }
>  #endif
>  
> -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
> +               unsigned int start, unsigned int nr,
> +               unsigned long addr)
>  {
> +       struct page *page = folio_page(page, start);
>         struct vm_area_struct *vma = vmf->vma;
>         bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
>         bool write = vmf->flags & FAULT_FLAG_WRITE;
>         bool prefault = vmf->address != addr;
>         pte_t entry;
> +       unsigned int i;
>  
> -       flush_icache_page(vma, page);
> +       flush_icache_pages(vma, page, nr);
>         entry = mk_pte(page, vma->vm_page_prot);
>  
>         if (prefault && arch_wants_old_prefaulted_pte())
> @@ -4299,14 +4303,23 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>                 entry = pte_mkuffd_wp(pte_wrprotect(entry));
>         /* copy-on-write page */
>         if (write && !(vma->vm_flags & VM_SHARED)) {
> -               inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> -               page_add_new_anon_rmap(page, vma, addr);
> -               lru_cache_add_inactive_or_unevictable(page, vma);
> +               add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr);
> +               for (i = 0; i < nr; i++) {
> +                       page_add_new_anon_rmap(page + i, vma, addr);
> +                       lru_cache_add_inactive_or_unevictable(page + i, vma);
> +               }
>         } else {
> -               inc_mm_counter(vma->vm_mm, mm_counter_file(page));
> -               page_add_file_rmap(page, vma, false);
> +               add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
> +               folio_add_file_rmap(folio, start, n, vma);
> +       }
> +
> +       for (i = 0; i < nr; i++) {
> +               set_pte_at(vma->vm_mm, addr, vmf->pte + i, entry);
> +                /* no need to invalidate: a not-present page won't be cached */
> +                update_mmu_cache(vma, addr, vmf->pte + i);
> +               addr += PAGE_SIZE;
> +               entry += PTE_STRIDE;
>         }
> -       set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
>  }
>  
>  static bool vmf_pte_changed(struct vm_fault *vmf)
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 5/5] filemap: batched update mm counter,rmap when map file folio
  2023-02-01 15:50   ` Matthew Wilcox
@ 2023-02-02  3:31     ` Yin, Fengwei
  2023-02-03 13:15     ` Yin, Fengwei
  1 sibling, 0 replies; 19+ messages in thread
From: Yin, Fengwei @ 2023-02-02  3:31 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: david, linux-mm, dave.hansen, tim.c.chen, ying.huang



On 2/1/2023 11:50 PM, Matthew Wilcox wrote:
> On Wed, Feb 01, 2023 at 04:17:37PM +0800, Yin Fengwei wrote:
>> Use do_set_pte_range() in filemap_map_folio_range(). Which
>> batched updates mm counter, rmap.
>>
>> With a self cooked will-it-scale.page_fault3 like app (change
>> file write fault to read fault) got 15% performance gain.
> 
> I'd suggest that you create a will-it-scale.page_fault4.  Anton
> is quite open to adding new variations of tests.
OK. I will do this.

> 
>> Perf data collected before/after the change:
>>   18.73%--page_add_file_rmap
>>           |
>>            --11.60%--__mod_lruvec_page_state
>>                      |
>>                      |--7.40%--__mod_memcg_lruvec_state
>>                      |          |
>>                      |           --5.58%--cgroup_rstat_updated
>>                      |
>>                       --2.53%--__mod_lruvec_state
>>                                 |
>>                                  --1.48%--__mod_node_page_state
>>
>>   9.93%--page_add_file_rmap_range
>>          |
>>           --2.67%--__mod_lruvec_page_state
>>                     |
>>                     |--1.95%--__mod_memcg_lruvec_state
>>                     |          |
>>                     |           --1.57%--cgroup_rstat_updated
>>                     |
>>                      --0.61%--__mod_lruvec_state
>>                                |
>>                                 --0.54%--__mod_node_page_state
>>
>> The running time of __mode_lruvec_page_state() is reduced a lot.
> 
> Nice.
> 
>> +++ b/mm/filemap.c
>> @@ -3364,11 +3364,22 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>>  	struct file *file = vma->vm_file;
>>  	struct page *page = folio_page(folio, start);
>>  	unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
>> -	unsigned int ref_count = 0, count = 0;
>> +	unsigned int ref_count = 0, count = 0, nr_mapped = 0;
>>  
>>  	do {
>> -		if (PageHWPoison(page))
>> +		if (PageHWPoison(page)) {
>> +			if (nr_mapped) {
>> +				vmf->pte -= nr_mapped;
>> +				do_set_pte_range(vmf, folio,
>> +						start + count - nr_mapped,
>> +						addr - nr_mapped * PAGE_SIZE,
>> +						nr_mapped);
>> +
>> +			}
>> +
>> +			nr_mapped = 0;
>>  			continue;
>> +		}
> 
> Having subtracted nr_mapped from vmf->pte, we then need to add it again.
> 
> But this is all too complicated.  What if we don't update vmf->pte
> each time around the loop?  ie something like this:
Thanks a lot for the suggestion. I like vmf->pte[count] a lot. Will
update the code accordingly in next version.


Regards
Yin, Fengwei

> 
> 	do {
> 		if (PageHWPoisoned(page))
> 			goto map;
>  		if (mmap_miss > 0)
>  			mmap_miss--;
> 		/*
> 		 * If there're PTE markers, we'll leave them to be handled
> 		 * in the specific fault path, and it'll prohibit the
> 		 * fault-around logic.
> 		 */
> 		if (!pte_none(vmf->pte[count]))
> 			goto map;
>  		if (vmf->address == addr + count * PAGE_SIZE)
>  			ret = VM_FAULT_NOPAGE;
> 		continue;
> map:
> 		if (count > 1) {
> 			do_set_pte_range(vmf, folio, start, addr, count - 1);
> 			folio_ref_add(folio, count - 1);
> 		}
> 		start += count;
> 		vmf->pte += count;
> 		addr += count * PAGE_SIZE;
> 		nr_pages -= count;
> 		count = 0;
> 	} while (page++, ++count < nr_pages);
> 
> 	if (count > 0) {
> 		do_set_pte_range(vmf, folio, start, addr, count);
> 		folio_ref_add(folio, count);
> 	} else {
> 		/* Make sure the PTE points to the correct page table */
> 		vmf->pte--;
> 	}


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 4/5] mm: add do_set_pte_range()
  2023-02-01 17:38   ` Matthew Wilcox
  2023-02-01 21:59     ` Matthew Wilcox
  2023-02-02  3:18     ` Yin, Fengwei
@ 2023-02-03  8:54     ` Yin Fengwei
  2 siblings, 0 replies; 19+ messages in thread
From: Yin Fengwei @ 2023-02-03  8:54 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: david, linux-mm, dave.hansen, tim.c.chen, ying.huang

On 2/2/23 01:38, Matthew Wilcox wrote:
> On Wed, Feb 01, 2023 at 04:17:36PM +0800, Yin Fengwei wrote:
>> do_set_pte_range() allows to setup page table entries for a
>> specific range. It calls page_add_file_rmap_range() to take
>> advantage of batched rmap update for large folio.
> 
> How about something more like this?  Yes, we need to define
> flush_icache_pages() and PTE_STRIDE.
> 
> (we could also do for (i = 0; i < nr; i++) flush_icache_page(...) but
> given that some architectures already implement flush_icache_range(),
> I think they may appreciate being given one large range to flush)
For flush_icache_range() and flush_icache_page(), the implementation
on riscv could be exception.

According to arch/riscv/include/asm/cacheflush.h
   #define flush_icache_range(start, end) flush_icache_all()

There is no definition for flush_icache_page(). I suppose
flush_icache_page() does nothing on riscv.

Using flush_icache_range() may not be good choice for riscv.


Regards
Yin, Fengwei

> 
> +++ b/mm/memory.c
> @@ -4277,15 +4277,19 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page
> *page)
>   }
>   #endif
>   
> -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
> +               unsigned int start, unsigned int nr,
> +               unsigned long addr)
>   {
> +       struct page *page = folio_page(page, start);
>          struct vm_area_struct *vma = vmf->vma;
>          bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
>          bool write = vmf->flags & FAULT_FLAG_WRITE;
>          bool prefault = vmf->address != addr;
>          pte_t entry;
> +       unsigned int i;
>   
> -       flush_icache_page(vma, page);
> +       flush_icache_pages(vma, page, nr);
>          entry = mk_pte(page, vma->vm_page_prot);
>   
>          if (prefault && arch_wants_old_prefaulted_pte())
> @@ -4299,14 +4303,23 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>                  entry = pte_mkuffd_wp(pte_wrprotect(entry));
>          /* copy-on-write page */
>          if (write && !(vma->vm_flags & VM_SHARED)) {
> -               inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> -               page_add_new_anon_rmap(page, vma, addr);
> -               lru_cache_add_inactive_or_unevictable(page, vma);
> +               add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr);
> +               for (i = 0; i < nr; i++) {
> +                       page_add_new_anon_rmap(page + i, vma, addr);
> +                       lru_cache_add_inactive_or_unevictable(page + i, vma);
> +               }
>          } else {
> -               inc_mm_counter(vma->vm_mm, mm_counter_file(page));
> -               page_add_file_rmap(page, vma, false);
> +               add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
> +               folio_add_file_rmap(folio, start, n, vma);
> +       }
> +
> +       for (i = 0; i < nr; i++) {
> +               set_pte_at(vma->vm_mm, addr, vmf->pte + i, entry);
> +                /* no need to invalidate: a not-present page won't be cached */
> +                update_mmu_cache(vma, addr, vmf->pte + i);
> +               addr += PAGE_SIZE;
> +               entry += PTE_STRIDE;
>          }
> -       set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
>   }
>   
>   static bool vmf_pte_changed(struct vm_fault *vmf)
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v2 5/5] filemap: batched update mm counter,rmap when map file folio
  2023-02-01 15:50   ` Matthew Wilcox
  2023-02-02  3:31     ` Yin, Fengwei
@ 2023-02-03 13:15     ` Yin, Fengwei
  1 sibling, 0 replies; 19+ messages in thread
From: Yin, Fengwei @ 2023-02-03 13:15 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: david, linux-mm, dave.hansen, tim.c.chen, ying.huang



On 2/1/2023 11:50 PM, Matthew Wilcox wrote:
> On Wed, Feb 01, 2023 at 04:17:37PM +0800, Yin Fengwei wrote:
>> Use do_set_pte_range() in filemap_map_folio_range(). Which
>> batched updates mm counter, rmap.
>>
>> With a self cooked will-it-scale.page_fault3 like app (change
>> file write fault to read fault) got 15% performance gain.
> 
> I'd suggest that you create a will-it-scale.page_fault4.  Anton
> is quite open to adding new variations of tests.
> 
>> Perf data collected before/after the change:
>>   18.73%--page_add_file_rmap
>>           |
>>            --11.60%--__mod_lruvec_page_state
>>                      |
>>                      |--7.40%--__mod_memcg_lruvec_state
>>                      |          |
>>                      |           --5.58%--cgroup_rstat_updated
>>                      |
>>                       --2.53%--__mod_lruvec_state
>>                                 |
>>                                  --1.48%--__mod_node_page_state
>>
>>   9.93%--page_add_file_rmap_range
>>          |
>>           --2.67%--__mod_lruvec_page_state
>>                     |
>>                     |--1.95%--__mod_memcg_lruvec_state
>>                     |          |
>>                     |           --1.57%--cgroup_rstat_updated
>>                     |
>>                      --0.61%--__mod_lruvec_state
>>                                |
>>                                 --0.54%--__mod_node_page_state
>>
>> The running time of __mode_lruvec_page_state() is reduced a lot.
> 
> Nice.
> 
>> +++ b/mm/filemap.c
>> @@ -3364,11 +3364,22 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>>  	struct file *file = vma->vm_file;
>>  	struct page *page = folio_page(folio, start);
>>  	unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
>> -	unsigned int ref_count = 0, count = 0;
>> +	unsigned int ref_count = 0, count = 0, nr_mapped = 0;
>>  
>>  	do {
>> -		if (PageHWPoison(page))
>> +		if (PageHWPoison(page)) {
>> +			if (nr_mapped) {
>> +				vmf->pte -= nr_mapped;
>> +				do_set_pte_range(vmf, folio,
>> +						start + count - nr_mapped,
>> +						addr - nr_mapped * PAGE_SIZE,
>> +						nr_mapped);
>> +
>> +			}
>> +
>> +			nr_mapped = 0;
>>  			continue;
>> +		}
> 
> Having subtracted nr_mapped from vmf->pte, we then need to add it again.
> 
> But this is all too complicated.  What if we don't update vmf->pte
> each time around the loop?  ie something like this:
> 
> 	do {
> 		if (PageHWPoisoned(page))
> 			goto map;
>  		if (mmap_miss > 0)
>  			mmap_miss--;
> 		/*
> 		 * If there're PTE markers, we'll leave them to be handled
> 		 * in the specific fault path, and it'll prohibit the
> 		 * fault-around logic.
> 		 */
> 		if (!pte_none(vmf->pte[count]))
> 			goto map;
>  		if (vmf->address == addr + count * PAGE_SIZE)
>  			ret = VM_FAULT_NOPAGE;
> 		continue;
> map:
> 		if (count > 1) {
> 			do_set_pte_range(vmf, folio, start, addr, count - 1);
> 			folio_ref_add(folio, count - 1);
> 		}
> 		start += count;
> 		vmf->pte += count;
> 		addr += count * PAGE_SIZE;
> 		nr_pages -= count;
> 		count = 0;
I found it's not easy to make this part correct. So I tried to simplify the 
logic here a little bit.


Regards
Yin, Fengwei

> 	} while (page++, ++count < nr_pages);
> 
> 	if (count > 0) {
> 		do_set_pte_range(vmf, folio, start, addr, count);
> 		folio_ref_add(folio, count);
> 	} else {
> 		/* Make sure the PTE points to the correct page table */
> 		vmf->pte--;
> 	}


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-02-03 13:16 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-01  8:17 [RFC PATCH v2 0/5] folio based filemap_map_pages() Yin Fengwei
2023-02-01  8:17 ` [RFC PATCH v2 1/5] mm: Enable fault around for shared file page fault Yin Fengwei
2023-02-01 14:34   ` Kirill A. Shutemov
2023-02-02  1:54     ` Yin, Fengwei
2023-02-01  8:17 ` [RFC PATCH v2 2/5] filemap: add function filemap_map_folio_range() Yin Fengwei
2023-02-01  8:17 ` [RFC PATCH v2 3/5] rmap: add page_add_file_rmap_range() Yin Fengwei
2023-02-01 17:32   ` Matthew Wilcox
2023-02-02  2:00     ` Yin, Fengwei
2023-02-01  8:17 ` [RFC PATCH v2 4/5] mm: add do_set_pte_range() Yin Fengwei
2023-02-01  9:09   ` David Hildenbrand
2023-02-01 10:04     ` Yin, Fengwei
2023-02-01 17:38   ` Matthew Wilcox
2023-02-01 21:59     ` Matthew Wilcox
2023-02-02  3:18     ` Yin, Fengwei
2023-02-03  8:54     ` Yin Fengwei
2023-02-01  8:17 ` [RFC PATCH v2 5/5] filemap: batched update mm counter,rmap when map file folio Yin Fengwei
2023-02-01 15:50   ` Matthew Wilcox
2023-02-02  3:31     ` Yin, Fengwei
2023-02-03 13:15     ` Yin, Fengwei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox