[RFC PATCH v4 0/4] folio based filemap_map

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v4 0/4] folio based filemap_map_pages()
@ 2023-02-06 14:06 Yin Fengwei
  2023-02-06 14:06 ` [RFC PATCH v4 1/4] filemap: add function filemap_map_folio_range() Yin Fengwei
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Yin Fengwei @ 2023-02-06 14:06 UTC (permalink / raw)
  To: willy, david, linux-mm; +Cc: dave.hansen, tim.c.chen, ying.huang, fengwei.yin

Current filemap_map_pages() uses page granularity even when
underneath folio is large folio. Making it use folio based
granularity allows batched refcount, rmap and mm counter
update. Which brings performance gain.

This series tries to bring batched refcount, rmap and
mm counter for filemap_map_pages(). Testing with a micro
benchmark like will-it-scale.pagefault on a 48C/96T IceLake
box showed:
   - batched rmap brings around 15% performance gain
   - batched refcount brings around 2% performance gain

Patch 1 update filemap_map_pages() to do map based on folio
        granularity and batched refcount update

Patch 2,3,4 enable batched rmap and mm counter

Change from v3:
  Patch 1:
    - delete folio_more_pages() as no onc call it as suggested by Matthew
    - Make the xas.xa_index point to end_pgoff when loop end as Kirill
      pointed out
    - Style fixing as pointed out by Matthew and Kirill

  Patch 2:
    - Typo fix and better comments as Matthew and Kirill suggested

  Patch 3:
    - Warn on cow case in do_set_pte_range() and add comment about cow
      can't handle large folio yet


Change from v2:
  - Drop patch 1 because it misses ->page_mkwrite() as Kirill
    pointed out.

  Patch 2:
    - Change page_add_file_rmap_range() to folio_add_file_rmap_range()
      as Matthew suggested
    - Remove "sub" as Matthew suggested

  Patch 3:
    - Only handle !cow case in do_set_pte_range() as David pointed out
    - Add a parameter of pte and avoid change vmf->pte in do_set_pte_range()
    - Drop do_set_pte_entry()

  Patch 4:
    - adopt the suggestion from Matthew to avoid subtracted/add vmf->pte.
      filemap_map_folio_range() doesn't update vmf->pte now. Make it easy
      to fit the filemap_map_pages() and no possible point to wrong page
      table.


Change from v1:
  - Update the struct page * parameter of *_range() to index
    in the folio as Matthew suggested
  - Fix indentations problem as Matthew pointed out
  - Add back the function comment as Matthew pointed out
  - Use nr_pages than len as Matthew pointed out
  - Add do_set_pte_range() as Matthew suggested
  - Add function comment as Ying suggested
  - Add performance test result to patch 1/2/5 commit message

  Patch 1:
    - Adapt commit message as Matthew suggested
    - Add Reviweed-by from Matthew
  Patch 3:
    - Restore general logic of page_add_file_rmap_range() to
      make patch review easier as Matthew suggested
  Patch 5:
    - Add perf data collected to understand the reason of
      performance gain


Yin Fengwei (4):
  filemap: add function filemap_map_folio_range()
  rmap: add folio_add_file_rmap_range()
  mm: add do_set_pte_range()
  filemap: batched update mm counter,rmap when map file folio

 include/linux/mm.h   |   3 ++
 include/linux/rmap.h |   2 +
 mm/filemap.c         | 112 ++++++++++++++++++++++++++-----------------
 mm/memory.c          |  66 +++++++++++++++++--------
 mm/rmap.c            |  66 +++++++++++++++++++------
 5 files changed, 171 insertions(+), 78 deletions(-)

-- 
2.30.2



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH v4 1/4] filemap: add function filemap_map_folio_range()
  2023-02-06 14:06 [RFC PATCH v4 0/4] folio based filemap_map_pages() Yin Fengwei
@ 2023-02-06 14:06 ` Yin Fengwei
  2023-02-06 14:06 ` [RFC PATCH v4 2/4] rmap: add folio_add_file_rmap_range() Yin Fengwei
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Yin Fengwei @ 2023-02-06 14:06 UTC (permalink / raw)
  To: willy, david, linux-mm; +Cc: dave.hansen, tim.c.chen, ying.huang, fengwei.yin

filemap_map_folio_range() maps partial/full folio. Comparing to
original filemap_map_pages(), it batched updates refcount and
get minor performance improvement for large folio.

With a will-it-scale.page_fault3 like app (change file write
fault testing to read fault testing. Trying to upstream it to
will-it-scale at [1]), got 2% performance gain on a 48C/96T
Cascade Lake test box with 96 processes running against xfs.

[1]: https://github.com/antonblanchard/will-it-scale/pull/37

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 mm/filemap.c | 98 +++++++++++++++++++++++++++++-----------------------
 1 file changed, 54 insertions(+), 44 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 992554c18f1f..1c37376fc8d5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2200,16 +2200,6 @@ unsigned filemap_get_folios(struct address_space *mapping, pgoff_t *start,
 }
 EXPORT_SYMBOL(filemap_get_folios);
 
-static inline
-bool folio_more_pages(struct folio *folio, pgoff_t index, pgoff_t max)
-{
-	if (!folio_test_large(folio) || folio_test_hugetlb(folio))
-		return false;
-	if (index >= max)
-		return false;
-	return index < folio->index + folio_nr_pages(folio) - 1;
-}
-
 /**
  * filemap_get_folios_contig - Get a batch of contiguous folios
  * @mapping:	The address_space to search
@@ -3351,6 +3341,53 @@ static inline struct folio *next_map_page(struct address_space *mapping,
 				  mapping, xas, end_pgoff);
 }
 
+/*
+ * Map page range [start_page, start_page + nr_pages) of folio.
+ * start_page is gotten from start by folio_page(folio, start)
+ */
+static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
+			struct folio *folio, unsigned long start,
+			unsigned long addr, unsigned int nr_pages)
+{
+	vm_fault_t ret = 0;
+	struct vm_area_struct *vma = vmf->vma;
+	struct file *file = vma->vm_file;
+	struct page *page = folio_page(folio, start);
+	unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
+	unsigned int ref_count = 0, count = 0;
+
+	do {
+		if (PageHWPoison(page))
+			continue;
+
+		if (mmap_miss > 0)
+			mmap_miss--;
+
+		/*
+		 * NOTE: If there're PTE markers, we'll leave them to be
+		 * handled in the specific fault path, and it'll prohibit the
+		 * fault-around logic.
+		 */
+		if (!pte_none(*vmf->pte))
+			continue;
+
+		if (vmf->address == addr)
+			ret = VM_FAULT_NOPAGE;
+
+		ref_count++;
+		do_set_pte(vmf, page, addr);
+		update_mmu_cache(vma, addr, vmf->pte);
+	} while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
+
+	/* Restore the vmf->pte */
+	vmf->pte -= nr_pages;
+
+	folio_ref_add(folio, ref_count);
+	WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
+
+	return ret;
+}
+
 vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 			     pgoff_t start_pgoff, pgoff_t end_pgoff)
 {
@@ -3361,9 +3398,9 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	unsigned long addr;
 	XA_STATE(xas, &mapping->i_pages, start_pgoff);
 	struct folio *folio;
-	struct page *page;
 	unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
 	vm_fault_t ret = 0;
+	int nr_pages = 0;
 
 	rcu_read_lock();
 	folio = first_map_page(mapping, &xas, end_pgoff);
@@ -3378,45 +3415,18 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
 	do {
-again:
-		page = folio_file_page(folio, xas.xa_index);
-		if (PageHWPoison(page))
-			goto unlock;
-
-		if (mmap_miss > 0)
-			mmap_miss--;
+		unsigned long end;
 
 		addr += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
 		vmf->pte += xas.xa_index - last_pgoff;
 		last_pgoff = xas.xa_index;
+		end = folio->index + folio_nr_pages(folio) - 1;
+		nr_pages = min(end, end_pgoff) - xas.xa_index + 1;
 
-		/*
-		 * NOTE: If there're PTE markers, we'll leave them to be
-		 * handled in the specific fault path, and it'll prohibit the
-		 * fault-around logic.
-		 */
-		if (!pte_none(*vmf->pte))
-			goto unlock;
-
-		/* We're about to handle the fault */
-		if (vmf->address == addr)
-			ret = VM_FAULT_NOPAGE;
+		ret |= filemap_map_folio_range(vmf, folio,
+				xas.xa_index - folio->index, addr, nr_pages);
+		xas.xa_index += nr_pages;
 
-		do_set_pte(vmf, page, addr);
-		/* no need to invalidate: a not-present page won't be cached */
-		update_mmu_cache(vma, addr, vmf->pte);
-		if (folio_more_pages(folio, xas.xa_index, end_pgoff)) {
-			xas.xa_index++;
-			folio_ref_inc(folio);
-			goto again;
-		}
-		folio_unlock(folio);
-		continue;
-unlock:
-		if (folio_more_pages(folio, xas.xa_index, end_pgoff)) {
-			xas.xa_index++;
-			goto again;
-		}
 		folio_unlock(folio);
 		folio_put(folio);
 	} while ((folio = next_map_page(mapping, &xas, end_pgoff)) != NULL);
-- 
2.30.2



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH v4 2/4] rmap: add folio_add_file_rmap_range()
  2023-02-06 14:06 [RFC PATCH v4 0/4] folio based filemap_map_pages() Yin Fengwei
  2023-02-06 14:06 ` [RFC PATCH v4 1/4] filemap: add function filemap_map_folio_range() Yin Fengwei
@ 2023-02-06 14:06 ` Yin Fengwei
  2023-02-06 14:06 ` [RFC PATCH v4 3/4] mm: add do_set_pte_range() Yin Fengwei
  2023-02-06 14:06 ` [RFC PATCH v4 4/4] filemap: batched update mm counter,rmap when map file folio Yin Fengwei
  3 siblings, 0 replies; 17+ messages in thread
From: Yin Fengwei @ 2023-02-06 14:06 UTC (permalink / raw)
  To: willy, david, linux-mm; +Cc: dave.hansen, tim.c.chen, ying.huang, fengwei.yin

folio_add_file_rmap_range() allows to add pte mapping to a specific
range of file folio. Comparing to page_add_file_rmap(), it batched
updates __lruvec_stat for large folio.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 include/linux/rmap.h |  2 ++
 mm/rmap.c            | 66 ++++++++++++++++++++++++++++++++++----------
 2 files changed, 54 insertions(+), 14 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index a4570da03e58..974124b41fee 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -198,6 +198,8 @@ void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
 		unsigned long address);
 void page_add_file_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
+void folio_add_file_rmap_range(struct folio *, unsigned long start,
+		unsigned int nr_pages, struct vm_area_struct *, bool compound);
 void page_remove_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 8287f2cc327d..c07c4eef3df2 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1303,31 +1303,44 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 }
 
 /**
- * page_add_file_rmap - add pte mapping to a file page
- * @page:	the page to add the mapping to
+ * folio_add_file_rmap_range - add pte mapping to page range of a folio
+ * @folio:	The folio to add the mapping to
+ * @start:	The first page number in folio
+ * @nr_pages:	The number of pages which will be mapped
  * @vma:	the vm area in which the mapping is added
  * @compound:	charge the page as compound or small page
  *
+ * The page range of folio is defined by [first_page, first_page + nr_pages)
+ *
  * The caller needs to hold the pte lock.
  */
-void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
-		bool compound)
+void folio_add_file_rmap_range(struct folio *folio, unsigned long start,
+			unsigned int nr_pages, struct vm_area_struct *vma,
+			bool compound)
 {
-	struct folio *folio = page_folio(page);
 	atomic_t *mapped = &folio->_nr_pages_mapped;
-	int nr = 0, nr_pmdmapped = 0;
-	bool first;
+	unsigned int nr_pmdmapped = 0, first;
+	int nr = 0;
 
-	VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
+	VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio);
 
 	/* Is page being mapped by PTE? Is this its first map to be added? */
 	if (likely(!compound)) {
-		first = atomic_inc_and_test(&page->_mapcount);
-		nr = first;
-		if (first && folio_test_large(folio)) {
-			nr = atomic_inc_return_relaxed(mapped);
-			nr = (nr < COMPOUND_MAPPED);
-		}
+		struct page *page = folio_page(folio, start);
+
+		nr_pages = min_t(unsigned int, nr_pages,
+					folio_nr_pages(folio) - start);
+
+		do {
+			first = atomic_inc_and_test(&page->_mapcount);
+			if (first && folio_test_large(folio)) {
+				first = atomic_inc_return_relaxed(mapped);
+				first = (nr < COMPOUND_MAPPED);
+			}
+
+			if (first)
+				nr++;
+		} while (page++, --nr_pages > 0);
 	} else if (folio_test_pmd_mappable(folio)) {
 		/* That test is redundant: it's for safety or to optimize out */
 
@@ -1356,6 +1369,31 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
 	mlock_vma_folio(folio, vma, compound);
 }
 
+/**
+ * page_add_file_rmap - add pte mapping to a file page
+ * @page:	the page to add the mapping to
+ * @vma:	the vm area in which the mapping is added
+ * @compound:	charge the page as compound or small page
+ *
+ * The caller needs to hold the pte lock.
+ */
+void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
+		bool compound)
+{
+	struct folio *folio = page_folio(page);
+	unsigned int nr_pages;
+
+	VM_WARN_ON_ONCE_PAGE(compound && !PageTransHuge(page), page);
+
+	if (likely(!compound))
+		nr_pages = 1;
+	else
+		nr_pages = folio_nr_pages(folio);
+
+	folio_add_file_rmap_range(folio, folio_page_idx(folio, page),
+			nr_pages, vma, compound);
+}
+
 /**
  * page_remove_rmap - take down pte mapping from a page
  * @page:	page to remove mapping from
-- 
2.30.2



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH v4 3/4] mm: add do_set_pte_range()
  2023-02-06 14:06 [RFC PATCH v4 0/4] folio based filemap_map_pages() Yin Fengwei
  2023-02-06 14:06 ` [RFC PATCH v4 1/4] filemap: add function filemap_map_folio_range() Yin Fengwei
  2023-02-06 14:06 ` [RFC PATCH v4 2/4] rmap: add folio_add_file_rmap_range() Yin Fengwei
@ 2023-02-06 14:06 ` Yin Fengwei
  2023-02-06 14:44   ` Matthew Wilcox
  2023-02-06 14:06 ` [RFC PATCH v4 4/4] filemap: batched update mm counter,rmap when map file folio Yin Fengwei
  3 siblings, 1 reply; 17+ messages in thread
From: Yin Fengwei @ 2023-02-06 14:06 UTC (permalink / raw)
  To: willy, david, linux-mm; +Cc: dave.hansen, tim.c.chen, ying.huang, fengwei.yin

do_set_pte_range() allows to setup page table entries for a
specific range. It calls folio_add_file_rmap_range() to take
advantage of batched rmap update for large folio.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 include/linux/mm.h |  3 +++
 mm/filemap.c       |  1 -
 mm/memory.c        | 66 ++++++++++++++++++++++++++++++++--------------
 3 files changed, 49 insertions(+), 21 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d6f8f41514cc..93192f04b276 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1162,6 +1162,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 
 vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
 void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
+void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
+		unsigned long addr, pte_t *pte,
+		unsigned long start, unsigned int nr);
 
 vm_fault_t finish_fault(struct vm_fault *vmf);
 vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
diff --git a/mm/filemap.c b/mm/filemap.c
index 1c37376fc8d5..6f110b9e5d27 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3376,7 +3376,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 
 		ref_count++;
 		do_set_pte(vmf, page, addr);
-		update_mmu_cache(vma, addr, vmf->pte);
 	} while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
 
 	/* Restore the vmf->pte */
diff --git a/mm/memory.c b/mm/memory.c
index 7a04a1130ec1..51f8bd91d9f0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4257,36 +4257,65 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 }
 #endif
 
-void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
+void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
+		unsigned long addr, pte_t *pte,
+		unsigned long start, unsigned int nr)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
+	bool cow = write && !(vma->vm_flags & VM_SHARED);
 	bool prefault = vmf->address != addr;
+	struct page *page = folio_page(folio, start);
 	pte_t entry;
 
-	flush_icache_page(vma, page);
-	entry = mk_pte(page, vma->vm_page_prot);
+	if (!cow) {
+		folio_add_file_rmap_range(folio, start, nr, vma, false);
+		add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
+	} else {
+		/*
+		 * rmap code is not ready to handle COW with anonymous
+		 * large folio yet. Capture and warn if large folio
+		 * is given.
+		 */
+		VM_WARN_ON_FOLIO(folio_test_large(folio), folio);
+	}
 
-	if (prefault && arch_wants_old_prefaulted_pte())
-		entry = pte_mkold(entry);
-	else
-		entry = pte_sw_mkyoung(entry);
+	do {
+		flush_icache_page(vma, page);
+		entry = mk_pte(page, vma->vm_page_prot);
 
-	if (write)
-		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-	if (unlikely(uffd_wp))
-		entry = pte_mkuffd_wp(entry);
-	/* copy-on-write page */
-	if (write && !(vma->vm_flags & VM_SHARED)) {
+		if (prefault && arch_wants_old_prefaulted_pte())
+			entry = pte_mkold(entry);
+		else
+			entry = pte_sw_mkyoung(entry);
+
+		if (write)
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		if (unlikely(uffd_wp))
+			entry = pte_mkuffd_wp(entry);
+		set_pte_at(vma->vm_mm, addr, pte, entry);
+
+		/* no need to invalidate: a not-present page won't be cached */
+		update_mmu_cache(vma, addr, pte);
+	} while (pte++, page++, addr += PAGE_SIZE, --nr > 0);
+}
+
+void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
+{
+	struct folio *folio = page_folio(page);
+	struct vm_area_struct *vma = vmf->vma;
+	bool cow = (vmf->flags & FAULT_FLAG_WRITE) &&
+			!(vma->vm_flags & VM_SHARED);
+
+	if (cow) {
 		inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
 		page_add_new_anon_rmap(page, vma, addr);
 		lru_cache_add_inactive_or_unevictable(page, vma);
-	} else {
-		inc_mm_counter(vma->vm_mm, mm_counter_file(page));
-		page_add_file_rmap(page, vma, false);
 	}
-	set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
+
+	do_set_pte_range(vmf, folio, addr, vmf->pte,
+			folio_page_idx(folio, page), 1);
 }
 
 static bool vmf_pte_changed(struct vm_fault *vmf)
@@ -4361,9 +4390,6 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 	if (likely(!vmf_pte_changed(vmf))) {
 		do_set_pte(vmf, page, vmf->address);
 
-		/* no need to invalidate: a not-present page won't be cached */
-		update_mmu_cache(vma, vmf->address, vmf->pte);
-
 		ret = 0;
 	} else {
 		update_mmu_tlb(vma, vmf->address, vmf->pte);
-- 
2.30.2



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH v4 4/4] filemap: batched update mm counter,rmap when map file folio
  2023-02-06 14:06 [RFC PATCH v4 0/4] folio based filemap_map_pages() Yin Fengwei
                   ` (2 preceding siblings ...)
  2023-02-06 14:06 ` [RFC PATCH v4 3/4] mm: add do_set_pte_range() Yin Fengwei
@ 2023-02-06 14:06 ` Yin Fengwei
  2023-02-06 14:34   ` Matthew Wilcox
  3 siblings, 1 reply; 17+ messages in thread
From: Yin Fengwei @ 2023-02-06 14:06 UTC (permalink / raw)
  To: willy, david, linux-mm; +Cc: dave.hansen, tim.c.chen, ying.huang, fengwei.yin

Use do_set_pte_range() in filemap_map_folio_range(). Which
batched updates mm counter, rmap for large folio.

With a will-it-scale.page_fault3 like app (change file write
fault testing to read fault testing. Trying to upstream it to
will-it-scale at [1]) got 15% performance gain on a 48C/96T
Cascade Lake test box with 96 processes running against xfs.

Perf data collected before/after the change:
  18.73%--page_add_file_rmap
          |
           --11.60%--__mod_lruvec_page_state
                     |
                     |--7.40%--__mod_memcg_lruvec_state
                     |          |
                     |           --5.58%--cgroup_rstat_updated
                     |
                      --2.53%--__mod_lruvec_state
                                |
                                 --1.48%--__mod_node_page_state

  9.93%--page_add_file_rmap_range
         |
          --2.67%--__mod_lruvec_page_state
                    |
                    |--1.95%--__mod_memcg_lruvec_state
                    |          |
                    |           --1.57%--cgroup_rstat_updated
                    |
                     --0.61%--__mod_lruvec_state
                               |
                                --0.54%--__mod_node_page_state

The running time of __mode_lruvec_page_state() is reduced about 9%.

[1]: https://github.com/antonblanchard/will-it-scale/pull/37

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 mm/filemap.c | 35 +++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 6f110b9e5d27..4452361e8858 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3354,11 +3354,12 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 	struct file *file = vma->vm_file;
 	struct page *page = folio_page(folio, start);
 	unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
-	unsigned int ref_count = 0, count = 0;
+	unsigned int mapped = 0;
+	pte_t *pte = vmf->pte;
 
 	do {
 		if (PageHWPoison(page))
-			continue;
+			goto map;
 
 		if (mmap_miss > 0)
 			mmap_miss--;
@@ -3368,20 +3369,34 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 		 * handled in the specific fault path, and it'll prohibit the
 		 * fault-around logic.
 		 */
-		if (!pte_none(*vmf->pte))
-			continue;
+		if (!pte_none(pte[mapped]))
+			goto map;
 
 		if (vmf->address == addr)
 			ret = VM_FAULT_NOPAGE;
 
-		ref_count++;
-		do_set_pte(vmf, page, addr);
-	} while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
+		mapped++;
+		continue;
 
-	/* Restore the vmf->pte */
-	vmf->pte -= nr_pages;
+map:
+		if (mapped) {
+			do_set_pte_range(vmf, folio, addr, pte, start, mapped);
+			folio_ref_add(folio, mapped);
+		}
+
+		/* advance 1 to jump over the HWPoison or !pte_none entry */
+		mapped++;
+		start += mapped;
+		pte += mapped;
+		addr += mapped * PAGE_SIZE;
+		mapped = 0;
+	} while (page++, --nr_pages > 0);
+
+	if (mapped) {
+		do_set_pte_range(vmf, folio, addr, pte, start, mapped);
+		folio_ref_add(folio, mapped);
+	}
 
-	folio_ref_add(folio, ref_count);
 	WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
 
 	return ret;
-- 
2.30.2



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v4 4/4] filemap: batched update mm counter,rmap when map file folio
  2023-02-06 14:06 ` [RFC PATCH v4 4/4] filemap: batched update mm counter,rmap when map file folio Yin Fengwei
@ 2023-02-06 14:34   ` Matthew Wilcox
  2023-02-06 15:03     ` Yin, Fengwei
  0 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2023-02-06 14:34 UTC (permalink / raw)
  To: Yin Fengwei; +Cc: david, linux-mm, dave.hansen, tim.c.chen, ying.huang

On Mon, Feb 06, 2023 at 10:06:39PM +0800, Yin Fengwei wrote:
> @@ -3354,11 +3354,12 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>  	struct file *file = vma->vm_file;
>  	struct page *page = folio_page(folio, start);
>  	unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
> -	unsigned int ref_count = 0, count = 0;
> +	unsigned int mapped = 0;
> +	pte_t *pte = vmf->pte;
>  
>  	do {
>  		if (PageHWPoison(page))
> -			continue;
> +			goto map;
>  
>  		if (mmap_miss > 0)
>  			mmap_miss--;
> @@ -3368,20 +3369,34 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>  		 * handled in the specific fault path, and it'll prohibit the
>  		 * fault-around logic.
>  		 */
> -		if (!pte_none(*vmf->pte))
> -			continue;
> +		if (!pte_none(pte[mapped]))
> +			goto map;

I see what you're trying to do here, but do_set_pte_range() uses the
pte from vmf->pte.  Perhaps best to save it at the beginning of the
function and restore it at the end.  ie:

	pte_t *old_ptep = vmf->pte;

	} while (vmf->pte++);

	vmf->pte = old_ptep;

The only other thing that bugs me about this patch is the use of
'mapped' as the variable name.  It's in the past tense, not the future
tense, so it gives me the wrong impression about what it's counting.
While we could use 'tomap', that's kind of clunky.  'count' or even just
'i' would be better.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v4 3/4] mm: add do_set_pte_range()
  2023-02-06 14:06 ` [RFC PATCH v4 3/4] mm: add do_set_pte_range() Yin Fengwei
@ 2023-02-06 14:44   ` Matthew Wilcox
  2023-02-06 14:58     ` Yin, Fengwei
  0 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2023-02-06 14:44 UTC (permalink / raw)
  To: Yin Fengwei; +Cc: david, linux-mm, dave.hansen, tim.c.chen, ying.huang

On Mon, Feb 06, 2023 at 10:06:38PM +0800, Yin Fengwei wrote:
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d6f8f41514cc..93192f04b276 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1162,6 +1162,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>  
>  vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
>  void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
> +		unsigned long addr, pte_t *pte,
> +		unsigned long start, unsigned int nr);

There are only two callers of do_set_pte(), and they're both in mm.
I don't think we should retain do_set_pte() as a wrapper, but rather
change both callers to call 'set_pte_range()'.  The 'do' doesn't add
any value, so let's drop that word.

> +	if (!cow) {
> +		folio_add_file_rmap_range(folio, start, nr, vma, false);
> +		add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
> +	} else {
> +		/*
> +		 * rmap code is not ready to handle COW with anonymous
> +		 * large folio yet. Capture and warn if large folio
> +		 * is given.
> +		 */
> +		VM_WARN_ON_FOLIO(folio_test_large(folio), folio);
> +	}

The handling of cow pages is still very clunky.
folio_add_new_anon_rmap() handles anonymous large folios just fine.  I
think David was looking at current code, not the code in mm-next.

> +		set_pte_at(vma->vm_mm, addr, pte, entry);
> +
> +		/* no need to invalidate: a not-present page won't be cached */
> +		update_mmu_cache(vma, addr, pte);
> +	} while (pte++, page++, addr += PAGE_SIZE, --nr > 0);

There's no need to speed-run this.  Let's do it properly and get the
arch interface right.  This code isn't going to hit linux-next for four
more weeks, which is plenty of time.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v4 3/4] mm: add do_set_pte_range()
  2023-02-06 14:44   ` Matthew Wilcox
@ 2023-02-06 14:58     ` Yin, Fengwei
  2023-02-06 15:13       ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Yin, Fengwei @ 2023-02-06 14:58 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: david, linux-mm, dave.hansen, tim.c.chen, ying.huang



On 2/6/2023 10:44 PM, Matthew Wilcox wrote:
> On Mon, Feb 06, 2023 at 10:06:38PM +0800, Yin Fengwei wrote:
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index d6f8f41514cc..93192f04b276 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1162,6 +1162,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>>  
>>  vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
>>  void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
>> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
>> +		unsigned long addr, pte_t *pte,
>> +		unsigned long start, unsigned int nr);
> 
> There are only two callers of do_set_pte(), and they're both in mm.
> I don't think we should retain do_set_pte() as a wrapper, but rather
> change both callers to call 'set_pte_range()'.  The 'do' doesn't add
> any value, so let's drop that word.
OK.

> 
>> +	if (!cow) {
>> +		folio_add_file_rmap_range(folio, start, nr, vma, false);
>> +		add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
>> +	} else {
>> +		/*
>> +		 * rmap code is not ready to handle COW with anonymous
>> +		 * large folio yet. Capture and warn if large folio
>> +		 * is given.
>> +		 */
>> +		VM_WARN_ON_FOLIO(folio_test_large(folio), folio);
>> +	}
> 
> The handling of cow pages is still very clunky.
> folio_add_new_anon_rmap() handles anonymous large folios just fine.  I
> think David was looking at current code, not the code in mm-next.
OK. Let's wait for further comment from David.

> 
>> +		set_pte_at(vma->vm_mm, addr, pte, entry);
>> +
>> +		/* no need to invalidate: a not-present page won't be cached */
>> +		update_mmu_cache(vma, addr, pte);
>> +	} while (pte++, page++, addr += PAGE_SIZE, --nr > 0);
> 
> There's no need to speed-run this.  Let's do it properly and get the
Sorry. I don't get your point here. Do you mean the pte_next()? Or other
things?


Regards
Yin, Fengwei

> arch interface right.  This code isn't going to hit linux-next for four
> more weeks, which is plenty of time.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v4 4/4] filemap: batched update mm counter,rmap when map file folio
  2023-02-06 14:34   ` Matthew Wilcox
@ 2023-02-06 15:03     ` Yin, Fengwei
  0 siblings, 0 replies; 17+ messages in thread
From: Yin, Fengwei @ 2023-02-06 15:03 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: david, linux-mm, dave.hansen, tim.c.chen, ying.huang



On 2/6/2023 10:34 PM, Matthew Wilcox wrote:
> On Mon, Feb 06, 2023 at 10:06:39PM +0800, Yin Fengwei wrote:
>> @@ -3354,11 +3354,12 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>>  	struct file *file = vma->vm_file;
>>  	struct page *page = folio_page(folio, start);
>>  	unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
>> -	unsigned int ref_count = 0, count = 0;
>> +	unsigned int mapped = 0;
>> +	pte_t *pte = vmf->pte;
>>  
>>  	do {
>>  		if (PageHWPoison(page))
>> -			continue;
>> +			goto map;
>>  
>>  		if (mmap_miss > 0)
>>  			mmap_miss--;
>> @@ -3368,20 +3369,34 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>>  		 * handled in the specific fault path, and it'll prohibit the
>>  		 * fault-around logic.
>>  		 */
>> -		if (!pte_none(*vmf->pte))
>> -			continue;
>> +		if (!pte_none(pte[mapped]))
>> +			goto map;
> 
> I see what you're trying to do here, but do_set_pte_range() uses the
> pte from vmf->pte.  Perhaps best to save it at the beginning of the
> function and restore it at the end.  ie:
> 
> 	pte_t *old_ptep = vmf->pte;
> 
> 	} while (vmf->pte++);
> 
> 	vmf->pte = old_ptep;
Yes. This also works.

> 
> The only other thing that bugs me about this patch is the use of
> 'mapped' as the variable name.  It's in the past tense, not the future
> tense, so it gives me the wrong impression about what it's counting.
> While we could use 'tomap', that's kind of clunky.  'count' or even just
> 'i' would be better.
OK. Will use count in next version.


Regards
Yin, Fengwei


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v4 3/4] mm: add do_set_pte_range()
  2023-02-06 14:58     ` Yin, Fengwei
@ 2023-02-06 15:13       ` David Hildenbrand
  2023-02-06 16:33         ` Matthew Wilcox
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2023-02-06 15:13 UTC (permalink / raw)
  To: Yin, Fengwei, Matthew Wilcox
  Cc: linux-mm, dave.hansen, tim.c.chen, ying.huang

On 06.02.23 15:58, Yin, Fengwei wrote:
> 
> 
> On 2/6/2023 10:44 PM, Matthew Wilcox wrote:
>> On Mon, Feb 06, 2023 at 10:06:38PM +0800, Yin Fengwei wrote:
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index d6f8f41514cc..93192f04b276 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -1162,6 +1162,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>>>   
>>>   vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
>>>   void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
>>> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
>>> +		unsigned long addr, pte_t *pte,
>>> +		unsigned long start, unsigned int nr);
>>
>> There are only two callers of do_set_pte(), and they're both in mm.
>> I don't think we should retain do_set_pte() as a wrapper, but rather
>> change both callers to call 'set_pte_range()'.  The 'do' doesn't add
>> any value, so let's drop that word.
> OK.
> 
>>
>>> +	if (!cow) {
>>> +		folio_add_file_rmap_range(folio, start, nr, vma, false);
>>> +		add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
>>> +	} else {
>>> +		/*
>>> +		 * rmap code is not ready to handle COW with anonymous
>>> +		 * large folio yet. Capture and warn if large folio
>>> +		 * is given.
>>> +		 */
>>> +		VM_WARN_ON_FOLIO(folio_test_large(folio), folio);
>>> +	}
>>
>> The handling of cow pages is still very clunky.
>> folio_add_new_anon_rmap() handles anonymous large folios just fine.  I
>> think David was looking at current code, not the code in mm-next.
> OK. Let's wait for further comment from David.

As I raised, page_add_new_anon_rmap() -> folio_add_new_anon_rmap() can 
be used to add a fresh (a) PMD-mapped THP or (b) order-0 folio.

folio_add_new_anon_rmap() is not suitable for PTE-mapping a large folio. 
Which is what we are intending to do here unless I am completely off.

PTE-mapping a large folio requires different accounting, different 
mapcount handling and different PG_anon_exclusive handling.

Which is all not there yet.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v4 3/4] mm: add do_set_pte_range()
  2023-02-06 15:13       ` David Hildenbrand
@ 2023-02-06 16:33         ` Matthew Wilcox
  2023-02-06 16:35           ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2023-02-06 16:33 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Yin, Fengwei, linux-mm, dave.hansen, tim.c.chen, ying.huang

On Mon, Feb 06, 2023 at 04:13:44PM +0100, David Hildenbrand wrote:
> > > The handling of cow pages is still very clunky.
> > > folio_add_new_anon_rmap() handles anonymous large folios just fine.  I
> > > think David was looking at current code, not the code in mm-next.
> > OK. Let's wait for further comment from David.
> 
> As I raised, page_add_new_anon_rmap() -> folio_add_new_anon_rmap() can be
> used to add a fresh (a) PMD-mapped THP or (b) order-0 folio.
>
> folio_add_new_anon_rmap() is not suitable for PTE-mapping a large folio.
> Which is what we are intending to do here unless I am completely off.

I think you are.  While the infrastructure here handles large folios
which are not PMDs, there's nobody who will allocate such a thing, so
there is no problem.  Right>

> PTE-mapping a large folio requires different accounting, different mapcount
> handling and different PG_anon_exclusive handling.
> 
> Which is all not there yet.

Assuming you mean "a large folio which is not PMD sized", I agree.  But we
haven't even started discussing how we should decide to allocate folios
which are not 0 or PMD order yet, so this worry seems premature.  When we
have figured that out, it'll be time to look at folio_add_new_anon_rmap()
to decide how to support it, but hopefully before then we'll have a
better way of handling mapcount.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v4 3/4] mm: add do_set_pte_range()
  2023-02-06 16:33         ` Matthew Wilcox
@ 2023-02-06 16:35           ` David Hildenbrand
  2023-02-06 16:43             ` Matthew Wilcox
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2023-02-06 16:35 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Yin, Fengwei, linux-mm, dave.hansen, tim.c.chen, ying.huang

On 06.02.23 17:33, Matthew Wilcox wrote:
> On Mon, Feb 06, 2023 at 04:13:44PM +0100, David Hildenbrand wrote:
>>>> The handling of cow pages is still very clunky.
>>>> folio_add_new_anon_rmap() handles anonymous large folios just fine.  I
>>>> think David was looking at current code, not the code in mm-next.
>>> OK. Let's wait for further comment from David.
>>
>> As I raised, page_add_new_anon_rmap() -> folio_add_new_anon_rmap() can be
>> used to add a fresh (a) PMD-mapped THP or (b) order-0 folio.
>>
>> folio_add_new_anon_rmap() is not suitable for PTE-mapping a large folio.
>> Which is what we are intending to do here unless I am completely off.
> 
> I think you are.  While the infrastructure here handles large folios
> which are not PMDs, there's nobody who will allocate such a thing, so
> there is no problem.  Right>

And that's precisely what I want to have fenced off here. I want that 
function to complain instead of silently doing the wrong thing.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v4 3/4] mm: add do_set_pte_range()
  2023-02-06 16:35           ` David Hildenbrand
@ 2023-02-06 16:43             ` Matthew Wilcox
  2023-02-06 16:49               ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2023-02-06 16:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Yin, Fengwei, linux-mm, dave.hansen, tim.c.chen, ying.huang

On Mon, Feb 06, 2023 at 05:35:55PM +0100, David Hildenbrand wrote:
> On 06.02.23 17:33, Matthew Wilcox wrote:
> > On Mon, Feb 06, 2023 at 04:13:44PM +0100, David Hildenbrand wrote:
> > > > > The handling of cow pages is still very clunky.
> > > > > folio_add_new_anon_rmap() handles anonymous large folios just fine.  I
> > > > > think David was looking at current code, not the code in mm-next.
> > > > OK. Let's wait for further comment from David.
> > > 
> > > As I raised, page_add_new_anon_rmap() -> folio_add_new_anon_rmap() can be
> > > used to add a fresh (a) PMD-mapped THP or (b) order-0 folio.
> > > 
> > > folio_add_new_anon_rmap() is not suitable for PTE-mapping a large folio.
> > > Which is what we are intending to do here unless I am completely off.
> > 
> > I think you are.  While the infrastructure here handles large folios
> > which are not PMDs, there's nobody who will allocate such a thing, so
> > there is no problem.  Right>
> 
> And that's precisely what I want to have fenced off here. I want that
> function to complain instead of silently doing the wrong thing.

If this were a widely called function, I'd agree.  But there are two
callers of do_set_pte; one in filemap.c and one in memory.c.  It's
overcautious and has created huge churn in this patchset.  If you're
really that concerned, stick a VM_BUG_ON() in folio_add_new_anon_rmap()
instead of making weird stuff happen in set_pte_range().


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v4 3/4] mm: add do_set_pte_range()
  2023-02-06 16:43             ` Matthew Wilcox
@ 2023-02-06 16:49               ` David Hildenbrand
  2023-02-06 17:10                 ` Matthew Wilcox
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2023-02-06 16:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Yin, Fengwei, linux-mm, dave.hansen, tim.c.chen, ying.huang

On 06.02.23 17:43, Matthew Wilcox wrote:
> On Mon, Feb 06, 2023 at 05:35:55PM +0100, David Hildenbrand wrote:
>> On 06.02.23 17:33, Matthew Wilcox wrote:
>>> On Mon, Feb 06, 2023 at 04:13:44PM +0100, David Hildenbrand wrote:
>>>>>> The handling of cow pages is still very clunky.
>>>>>> folio_add_new_anon_rmap() handles anonymous large folios just fine.  I
>>>>>> think David was looking at current code, not the code in mm-next.
>>>>> OK. Let's wait for further comment from David.
>>>>
>>>> As I raised, page_add_new_anon_rmap() -> folio_add_new_anon_rmap() can be
>>>> used to add a fresh (a) PMD-mapped THP or (b) order-0 folio.
>>>>
>>>> folio_add_new_anon_rmap() is not suitable for PTE-mapping a large folio.
>>>> Which is what we are intending to do here unless I am completely off.
>>>
>>> I think you are.  While the infrastructure here handles large folios
>>> which are not PMDs, there's nobody who will allocate such a thing, so
>>> there is no problem.  Right>
>>
>> And that's precisely what I want to have fenced off here. I want that
>> function to complain instead of silently doing the wrong thing.
> 
> If this were a widely called function, I'd agree.  But there are two
> callers of do_set_pte; one in filemap.c and one in memory.c.  It's
> overcautious and has created huge churn in this patchset.  If you're
> really that concerned, stick a VM_BUG_ON() in folio_add_new_anon_rmap()
> instead of making weird stuff happen in set_pte_range().

We have

+	if (!cow) {
+		folio_add_file_rmap_range(folio, start, nr, vma, false);
+		add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
+	} else {
+		/*
+		 * rmap code is not ready to handle COW with anonymous
+		 * large folio yet. Capture and warn if large folio
+		 * is given.
+		 */
+		VM_WARN_ON_FOLIO(folio_test_large(folio), folio);
+	}

now.

What are we supposed to add instead on the else branch instead that 
would be correct in the future? Or not look weird?

Go on, scream louder at me, I don't care.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v4 3/4] mm: add do_set_pte_range()
  2023-02-06 16:49               ` David Hildenbrand
@ 2023-02-06 17:10                 ` Matthew Wilcox
  2023-02-06 17:35                   ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2023-02-06 17:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Yin, Fengwei, linux-mm, dave.hansen, tim.c.chen, ying.huang

On Mon, Feb 06, 2023 at 05:49:20PM +0100, David Hildenbrand wrote:
> We have
> 
> +	if (!cow) {
> +		folio_add_file_rmap_range(folio, start, nr, vma, false);
> +		add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
> +	} else {
> +		/*
> +		 * rmap code is not ready to handle COW with anonymous
> +		 * large folio yet. Capture and warn if large folio
> +		 * is given.
> +		 */
> +		VM_WARN_ON_FOLIO(folio_test_large(folio), folio);
> +	}
> 
> now.
> 
> What are we supposed to add instead on the else branch instead that would be
> correct in the future? Or not look weird?

Right now, I think this patch should look something like this.

diff --git a/mm/memory.c b/mm/memory.c
index 7a04a1130ec1..2f6173f83d8b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4257,15 +4257,18 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 }
 #endif
 
-void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
+void set_pte_range(struct vm_fault *vmf, struct folio *folio,
+		struct page *page, unsigned int nr, unsigned long addr)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
 	bool prefault = vmf->address != addr;
 	pte_t entry;
+	unsigned int i;
 
-	flush_icache_page(vma, page);
+	for (i = 0; i < nr; i++)
+		flush_icache_page(vma, page + i);
 	entry = mk_pte(page, vma->vm_page_prot);
 
 	if (prefault && arch_wants_old_prefaulted_pte())
@@ -4279,14 +4282,15 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
 		entry = pte_mkuffd_wp(entry);
 	/* copy-on-write page */
 	if (write && !(vma->vm_flags & VM_SHARED)) {
-		inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
-		page_add_new_anon_rmap(page, vma, addr);
-		lru_cache_add_inactive_or_unevictable(page, vma);
+		add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr);
+		VM_BUG_ON_FOLIO(folio, nr != 1);
+		folio_add_new_anon_rmap(folio, vma, addr);
+		folio_add_lru_vma(folio, vma);
 	} else {
-		inc_mm_counter(vma->vm_mm, mm_counter_file(page));
-		page_add_file_rmap(page, vma, false);
+		add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
+		folio_add_file_rmap_range(folio, page, nr, vma, false);
 	}
-	set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
+	set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr);
 }
 
 static bool vmf_pte_changed(struct vm_fault *vmf)
@@ -4359,7 +4363,9 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 
 	/* Re-check under ptl */
 	if (likely(!vmf_pte_changed(vmf))) {
-		do_set_pte(vmf, page, vmf->address);
+		struct folio *folio = page_folio(page);
+
+		set_pte_range(vmf, folio, page, 1, vmf->address);
 
 		/* no need to invalidate: a not-present page won't be cached */
 		update_mmu_cache(vma, vmf->address, vmf->pte);

> Go on, scream louder at me, I don't care.

I'm not even shouting.  I just think you're wrong ;-)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v4 3/4] mm: add do_set_pte_range()
  2023-02-06 17:10                 ` Matthew Wilcox
@ 2023-02-06 17:35                   ` David Hildenbrand
  2023-02-07  6:05                     ` Yin, Fengwei
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2023-02-06 17:35 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Yin, Fengwei, linux-mm, dave.hansen, tim.c.chen, ying.huang

On 06.02.23 18:10, Matthew Wilcox wrote:
> On Mon, Feb 06, 2023 at 05:49:20PM +0100, David Hildenbrand wrote:
>> We have
>>
>> +	if (!cow) {
>> +		folio_add_file_rmap_range(folio, start, nr, vma, false);
>> +		add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
>> +	} else {
>> +		/*
>> +		 * rmap code is not ready to handle COW with anonymous
>> +		 * large folio yet. Capture and warn if large folio
>> +		 * is given.
>> +		 */
>> +		VM_WARN_ON_FOLIO(folio_test_large(folio), folio);
>> +	}
>>
>> now.
>>
>> What are we supposed to add instead on the else branch instead that would be
>> correct in the future? Or not look weird?
> 
> Right now, I think this patch should look something like this.
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 7a04a1130ec1..2f6173f83d8b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4257,15 +4257,18 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
>   }
>   #endif
>   
> -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
> +void set_pte_range(struct vm_fault *vmf, struct folio *folio,
> +		struct page *page, unsigned int nr, unsigned long addr)
>   {
>   	struct vm_area_struct *vma = vmf->vma;
>   	bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
>   	bool write = vmf->flags & FAULT_FLAG_WRITE;
>   	bool prefault = vmf->address != addr;
>   	pte_t entry;
> +	unsigned int i;
>   
> -	flush_icache_page(vma, page);
> +	for (i = 0; i < nr; i++)
> +		flush_icache_page(vma, page + i);
>   	entry = mk_pte(page, vma->vm_page_prot);
>   
>   	if (prefault && arch_wants_old_prefaulted_pte())
> @@ -4279,14 +4282,15 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>   		entry = pte_mkuffd_wp(entry);
>   	/* copy-on-write page */
>   	if (write && !(vma->vm_flags & VM_SHARED)) {
> -		inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> -		page_add_new_anon_rmap(page, vma, addr);
> -		lru_cache_add_inactive_or_unevictable(page, vma);
> +		add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr);
> +		VM_BUG_ON_FOLIO(folio, nr != 1);

^ what I asked for (WARN would be sufficient for IMHO). I don't 
precisely care how precisely we tell the educated reader that this 
function only handles this special case (I could even be convinced that 
a comment is good enough ;) ).

> +		folio_add_new_anon_rmap(folio, vma, addr);
> +		folio_add_lru_vma(folio, vma);
>   	} else {
> -		inc_mm_counter(vma->vm_mm, mm_counter_file(page));
> -		page_add_file_rmap(page, vma, false);
> +		add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
> +		folio_add_file_rmap_range(folio, page, nr, vma, false);
>   	}
> -	set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
> +	set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr);
>   }
>   
>   static bool vmf_pte_changed(struct vm_fault *vmf)
> @@ -4359,7 +4363,9 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>   
>   	/* Re-check under ptl */
>   	if (likely(!vmf_pte_changed(vmf))) {
> -		do_set_pte(vmf, page, vmf->address);
> +		struct folio *folio = page_folio(page);
> +
> +		set_pte_range(vmf, folio, page, 1, vmf->address);
>   
>   		/* no need to invalidate: a not-present page won't be cached */
>   		update_mmu_cache(vma, vmf->address, vmf->pte);
> 
>> Go on, scream louder at me, I don't care.
> 
> I'm not even shouting.  I just think you're wrong ;-)
> 

Good ;)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v4 3/4] mm: add do_set_pte_range()
  2023-02-06 17:35                   ` David Hildenbrand
@ 2023-02-07  6:05                     ` Yin, Fengwei
  0 siblings, 0 replies; 17+ messages in thread
From: Yin, Fengwei @ 2023-02-07  6:05 UTC (permalink / raw)
  To: David Hildenbrand, Matthew Wilcox
  Cc: linux-mm, dave.hansen, tim.c.chen, ying.huang



On 2/7/2023 1:35 AM, David Hildenbrand wrote:
> On 06.02.23 18:10, Matthew Wilcox wrote:
>> On Mon, Feb 06, 2023 at 05:49:20PM +0100, David Hildenbrand wrote:
>>> We have
>>>
>>> +    if (!cow) {
>>> +        folio_add_file_rmap_range(folio, start, nr, vma, false);
>>> +        add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
>>> +    } else {
>>> +        /*
>>> +         * rmap code is not ready to handle COW with anonymous
>>> +         * large folio yet. Capture and warn if large folio
>>> +         * is given.
>>> +         */
>>> +        VM_WARN_ON_FOLIO(folio_test_large(folio), folio);
>>> +    }
>>>
>>> now.
>>>
>>> What are we supposed to add instead on the else branch instead that would be
>>> correct in the future? Or not look weird?
>>
>> Right now, I think this patch should look something like this.
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 7a04a1130ec1..2f6173f83d8b 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4257,15 +4257,18 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
>>   }
>>   #endif
>>   -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>> +void set_pte_range(struct vm_fault *vmf, struct folio *folio,
>> +        struct page *page, unsigned int nr, unsigned long addr)
>>   {
>>       struct vm_area_struct *vma = vmf->vma;
>>       bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
>>       bool write = vmf->flags & FAULT_FLAG_WRITE;
>>       bool prefault = vmf->address != addr;
>>       pte_t entry;
>> +    unsigned int i;
>>   -    flush_icache_page(vma, page);
>> +    for (i = 0; i < nr; i++)
>> +        flush_icache_page(vma, page + i);
>>       entry = mk_pte(page, vma->vm_page_prot);
>>         if (prefault && arch_wants_old_prefaulted_pte())
>> @@ -4279,14 +4282,15 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>>           entry = pte_mkuffd_wp(entry);
>>       /* copy-on-write page */
>>       if (write && !(vma->vm_flags & VM_SHARED)) {
>> -        inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>> -        page_add_new_anon_rmap(page, vma, addr);
>> -        lru_cache_add_inactive_or_unevictable(page, vma);
>> +        add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr);
>> +        VM_BUG_ON_FOLIO(folio, nr != 1);
> 
> ^ what I asked for (WARN would be sufficient for IMHO). I don't precisely care how precisely we tell the educated reader that this function only handles this special case (I could even be convinced that a comment is good enough ;) ).
David, Thanks. I am going to move the cow and !cow case to
set_pte_range() and WARN if the large folio is passed in.

> 
>> +        folio_add_new_anon_rmap(folio, vma, addr);
>> +        folio_add_lru_vma(folio, vma);
>>       } else {
>> -        inc_mm_counter(vma->vm_mm, mm_counter_file(page));
>> -        page_add_file_rmap(page, vma, false);
>> +        add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
>> +        folio_add_file_rmap_range(folio, page, nr, vma, false);
>>       }
>> -    set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
>> +    set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr);
Matthew, I think we can't do this set_ptes() until pte_next() is ready.
I will make it still a loop to generate correct entry. And we can replace
the loop with set_ptes() once the pte_next() is ready.

Let me know if I get something wrong. Thanks.

Regards
Yin, Fengwei

>>   }
>>     static bool vmf_pte_changed(struct vm_fault *vmf)
>> @@ -4359,7 +4363,9 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>         /* Re-check under ptl */
>>       if (likely(!vmf_pte_changed(vmf))) {
>> -        do_set_pte(vmf, page, vmf->address);
>> +        struct folio *folio = page_folio(page);
>> +
>> +        set_pte_range(vmf, folio, page, 1, vmf->address);
>>             /* no need to invalidate: a not-present page won't be cached */
>>           update_mmu_cache(vma, vmf->address, vmf->pte);
>>
>>> Go on, scream louder at me, I don't care.
>>
>> I'm not even shouting.  I just think you're wrong ;-)
>>
> 
> Good ;)
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-02-07  6:06 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-06 14:06 [RFC PATCH v4 0/4] folio based filemap_map_pages() Yin Fengwei
2023-02-06 14:06 ` [RFC PATCH v4 1/4] filemap: add function filemap_map_folio_range() Yin Fengwei
2023-02-06 14:06 ` [RFC PATCH v4 2/4] rmap: add folio_add_file_rmap_range() Yin Fengwei
2023-02-06 14:06 ` [RFC PATCH v4 3/4] mm: add do_set_pte_range() Yin Fengwei
2023-02-06 14:44   ` Matthew Wilcox
2023-02-06 14:58     ` Yin, Fengwei
2023-02-06 15:13       ` David Hildenbrand
2023-02-06 16:33         ` Matthew Wilcox
2023-02-06 16:35           ` David Hildenbrand
2023-02-06 16:43             ` Matthew Wilcox
2023-02-06 16:49               ` David Hildenbrand
2023-02-06 17:10                 ` Matthew Wilcox
2023-02-06 17:35                   ` David Hildenbrand
2023-02-07  6:05                     ` Yin, Fengwei
2023-02-06 14:06 ` [RFC PATCH v4 4/4] filemap: batched update mm counter,rmap when map file folio Yin Fengwei
2023-02-06 14:34   ` Matthew Wilcox
2023-02-06 15:03     ` Yin, Fengwei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox