* [RFC PATCH v2 1/5] mm: Enable fault around for shared file page fault
2023-02-01 8:17 [RFC PATCH v2 0/5] folio based filemap_map_pages() Yin Fengwei
@ 2023-02-01 8:17 ` Yin Fengwei
2023-02-01 14:34 ` Kirill A. Shutemov
2023-02-01 8:17 ` [RFC PATCH v2 2/5] filemap: add function filemap_map_folio_range() Yin Fengwei
` (3 subsequent siblings)
4 siblings, 1 reply; 19+ messages in thread
From: Yin Fengwei @ 2023-02-01 8:17 UTC (permalink / raw)
To: willy, david, linux-mm; +Cc: dave.hansen, tim.c.chen, ying.huang, fengwei.yin
The shared fault handler can also benefit from fault-around. While it
is uncommon to write to MAP_SHARED files, applications that do will see
a huge benefit with will-it-scale:page_fault3 (shared file write fault)
improving by 375%.
Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
mm/memory.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c
index 7a04a1130ec1..51c04bb60724 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4546,6 +4546,17 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
struct vm_area_struct *vma = vmf->vma;
vm_fault_t ret, tmp;
+ /*
+ * Let's call ->map_pages() first and use ->fault() as fallback
+ * if page by the offset is not ready to be mapped (cold cache or
+ * something).
+ */
+ if (should_fault_around(vmf)) {
+ ret = do_fault_around(vmf);
+ if (ret)
+ return ret;
+ }
+
ret = __do_fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
--
2.30.2
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [RFC PATCH v2 1/5] mm: Enable fault around for shared file page fault
2023-02-01 8:17 ` [RFC PATCH v2 1/5] mm: Enable fault around for shared file page fault Yin Fengwei
@ 2023-02-01 14:34 ` Kirill A. Shutemov
2023-02-02 1:54 ` Yin, Fengwei
0 siblings, 1 reply; 19+ messages in thread
From: Kirill A. Shutemov @ 2023-02-01 14:34 UTC (permalink / raw)
To: Yin Fengwei; +Cc: willy, david, linux-mm, dave.hansen, tim.c.chen, ying.huang
On Wed, Feb 01, 2023 at 04:17:33PM +0800, Yin Fengwei wrote:
> The shared fault handler can also benefit from fault-around. While it
> is uncommon to write to MAP_SHARED files, applications that do will see
> a huge benefit with will-it-scale:page_fault3 (shared file write fault)
> improving by 375%.
>
> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
> mm/memory.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 7a04a1130ec1..51c04bb60724 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4546,6 +4546,17 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
> struct vm_area_struct *vma = vmf->vma;
> vm_fault_t ret, tmp;
>
> + /*
> + * Let's call ->map_pages() first and use ->fault() as fallback
> + * if page by the offset is not ready to be mapped (cold cache or
> + * something).
> + */
> + if (should_fault_around(vmf)) {
> + ret = do_fault_around(vmf);
> + if (ret)
> + return ret;
> + }
> +
I believe it bypasses ->page_mkwrite() completely, no?
So you get a writable PTEs without notifying the filesystem. Smells like a
data loss.
> ret = __do_fault(vmf);
> if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
> return ret;
> --
> 2.30.2
>
>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [RFC PATCH v2 1/5] mm: Enable fault around for shared file page fault
2023-02-01 14:34 ` Kirill A. Shutemov
@ 2023-02-02 1:54 ` Yin, Fengwei
0 siblings, 0 replies; 19+ messages in thread
From: Yin, Fengwei @ 2023-02-02 1:54 UTC (permalink / raw)
To: Kirill A. Shutemov
Cc: willy, david, linux-mm, dave.hansen, tim.c.chen, ying.huang
On 2/1/2023 10:34 PM, Kirill A. Shutemov wrote:
> On Wed, Feb 01, 2023 at 04:17:33PM +0800, Yin Fengwei wrote:
>> The shared fault handler can also benefit from fault-around. While it
>> is uncommon to write to MAP_SHARED files, applications that do will see
>> a huge benefit with will-it-scale:page_fault3 (shared file write fault)
>> improving by 375%.
>>
>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>> ---
>> mm/memory.c | 11 +++++++++++
>> 1 file changed, 11 insertions(+)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 7a04a1130ec1..51c04bb60724 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4546,6 +4546,17 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
>> struct vm_area_struct *vma = vmf->vma;
>> vm_fault_t ret, tmp;
>>
>> + /*
>> + * Let's call ->map_pages() first and use ->fault() as fallback
>> + * if page by the offset is not ready to be mapped (cold cache or
>> + * something).
>> + */
>> + if (should_fault_around(vmf)) {
>> + ret = do_fault_around(vmf);
>> + if (ret)
>> + return ret;
>> + }
>> +
>
> I believe it bypasses ->page_mkwrite() completely, no?
>
> So you get a writable PTEs without notifying the filesystem. Smells like a
> data loss.
Yes. You are right. This may be the reason why fault around is not enabled
for shared file write fault? I will drop this patch. Thanks.
Regards
Yin, Fengwei
>
>> ret = __do_fault(vmf);
>> if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>> return ret;
>> --
>> 2.30.2
>>
>>
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC PATCH v2 2/5] filemap: add function filemap_map_folio_range()
2023-02-01 8:17 [RFC PATCH v2 0/5] folio based filemap_map_pages() Yin Fengwei
2023-02-01 8:17 ` [RFC PATCH v2 1/5] mm: Enable fault around for shared file page fault Yin Fengwei
@ 2023-02-01 8:17 ` Yin Fengwei
2023-02-01 8:17 ` [RFC PATCH v2 3/5] rmap: add page_add_file_rmap_range() Yin Fengwei
` (2 subsequent siblings)
4 siblings, 0 replies; 19+ messages in thread
From: Yin Fengwei @ 2023-02-01 8:17 UTC (permalink / raw)
To: willy, david, linux-mm; +Cc: dave.hansen, tim.c.chen, ying.huang, fengwei.yin
filemap_map_folio_range() maps partial/full folio. Comparing to
original filemap_map_pages(), it batched updates refcount and
get minor performance improvement for large folio.
a self cooked will-it-scale.page_fault3 like app (change file
write fault to read fault) with xfs filesystem got 2% performance
gain.
Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
mm/filemap.c | 91 ++++++++++++++++++++++++++++++++--------------------
1 file changed, 57 insertions(+), 34 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 992554c18f1f..9cc5edd8f998 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3351,6 +3351,56 @@ static inline struct folio *next_map_page(struct address_space *mapping,
mapping, xas, end_pgoff);
}
+/*
+ * Map sub-pages range [start_page, start_page + nr_pages) of folio.
+ * start_page is gotten from start by folio_page(folio, start)
+ */
+static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
+ struct folio *folio, unsigned long start,
+ unsigned long addr, unsigned int nr_pages)
+{
+ vm_fault_t ret = 0;
+ struct vm_area_struct *vma = vmf->vma;
+ struct file *file = vma->vm_file;
+ struct page *page = folio_page(folio, start);
+ unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
+ unsigned int ref_count = 0, count = 0;
+
+ do {
+ if (PageHWPoison(page))
+ continue;
+
+ if (mmap_miss > 0)
+ mmap_miss--;
+
+ /*
+ * NOTE: If there're PTE markers, we'll leave them to be
+ * handled in the specific fault path, and it'll prohibit the
+ * fault-around logic.
+ */
+ if (!pte_none(*vmf->pte))
+ continue;
+
+ if (vmf->address == addr)
+ ret = VM_FAULT_NOPAGE;
+
+ ref_count++;
+ do_set_pte(vmf, page, addr);
+ update_mmu_cache(vma, addr, vmf->pte);
+ } while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
+
+ /*
+ * Restore the vmf->pte. Otherwise, it's possible vmf->pte point
+ * to next page table entry if the last sub page in the range is
+ * mapped to the last entry of page table.
+ */
+ vmf->pte -= nr_pages;
+ folio_ref_add(folio, ref_count);
+ WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
+
+ return ret;
+}
+
vm_fault_t filemap_map_pages(struct vm_fault *vmf,
pgoff_t start_pgoff, pgoff_t end_pgoff)
{
@@ -3361,9 +3411,9 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
unsigned long addr;
XA_STATE(xas, &mapping->i_pages, start_pgoff);
struct folio *folio;
- struct page *page;
unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
vm_fault_t ret = 0;
+ int nr_pages = 0;
rcu_read_lock();
folio = first_map_page(mapping, &xas, end_pgoff);
@@ -3378,45 +3428,18 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
do {
-again:
- page = folio_file_page(folio, xas.xa_index);
- if (PageHWPoison(page))
- goto unlock;
-
- if (mmap_miss > 0)
- mmap_miss--;
+ unsigned long end;
addr += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
vmf->pte += xas.xa_index - last_pgoff;
last_pgoff = xas.xa_index;
+ end = folio->index + folio_nr_pages(folio) - 1;
+ nr_pages = min(end, end_pgoff) - xas.xa_index + 1;
- /*
- * NOTE: If there're PTE markers, we'll leave them to be
- * handled in the specific fault path, and it'll prohibit the
- * fault-around logic.
- */
- if (!pte_none(*vmf->pte))
- goto unlock;
-
- /* We're about to handle the fault */
- if (vmf->address == addr)
- ret = VM_FAULT_NOPAGE;
+ ret |= filemap_map_folio_range(vmf, folio,
+ xas.xa_index - folio->index, addr, nr_pages);
+ xas.xa_index = end;
- do_set_pte(vmf, page, addr);
- /* no need to invalidate: a not-present page won't be cached */
- update_mmu_cache(vma, addr, vmf->pte);
- if (folio_more_pages(folio, xas.xa_index, end_pgoff)) {
- xas.xa_index++;
- folio_ref_inc(folio);
- goto again;
- }
- folio_unlock(folio);
- continue;
-unlock:
- if (folio_more_pages(folio, xas.xa_index, end_pgoff)) {
- xas.xa_index++;
- goto again;
- }
folio_unlock(folio);
folio_put(folio);
} while ((folio = next_map_page(mapping, &xas, end_pgoff)) != NULL);
--
2.30.2
^ permalink raw reply [flat|nested] 19+ messages in thread* [RFC PATCH v2 3/5] rmap: add page_add_file_rmap_range()
2023-02-01 8:17 [RFC PATCH v2 0/5] folio based filemap_map_pages() Yin Fengwei
2023-02-01 8:17 ` [RFC PATCH v2 1/5] mm: Enable fault around for shared file page fault Yin Fengwei
2023-02-01 8:17 ` [RFC PATCH v2 2/5] filemap: add function filemap_map_folio_range() Yin Fengwei
@ 2023-02-01 8:17 ` Yin Fengwei
2023-02-01 17:32 ` Matthew Wilcox
2023-02-01 8:17 ` [RFC PATCH v2 4/5] mm: add do_set_pte_range() Yin Fengwei
2023-02-01 8:17 ` [RFC PATCH v2 5/5] filemap: batched update mm counter,rmap when map file folio Yin Fengwei
4 siblings, 1 reply; 19+ messages in thread
From: Yin Fengwei @ 2023-02-01 8:17 UTC (permalink / raw)
To: willy, david, linux-mm; +Cc: dave.hansen, tim.c.chen, ying.huang, fengwei.yin
page_add_file_rmap_range() allows to add pte mapping to a specific
range of file folio. Comparing to original page_add_file_rmap(),
it batched updates __lruvec_stat for large folio.
Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
include/linux/rmap.h | 2 ++
mm/rmap.c | 66 ++++++++++++++++++++++++++++++++++----------
2 files changed, 54 insertions(+), 14 deletions(-)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index a4570da03e58..9631a3701504 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -198,6 +198,8 @@ void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
unsigned long address);
void page_add_file_rmap(struct page *, struct vm_area_struct *,
bool compound);
+void page_add_file_rmap_range(struct folio *, unsigned long start,
+ unsigned int nr_pages, struct vm_area_struct *, bool compound);
void page_remove_rmap(struct page *, struct vm_area_struct *,
bool compound);
diff --git a/mm/rmap.c b/mm/rmap.c
index 15ae24585fc4..090de52e1a9a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1303,31 +1303,44 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
}
/**
- * page_add_file_rmap - add pte mapping to a file page
- * @page: the page to add the mapping to
+ * page_add_file_rmap_range - add pte mapping to a sub page range of a folio
+ * @folio: The filio to add the mapping to
+ * @start: The first sub page index in folio
+ * @nr_pages: The number of sub pages from the first page
* @vma: the vm area in which the mapping is added
* @compound: charge the page as compound or small page
*
+ * The sub page range of folio is defined by
+ * [first_sub_page, first_sub_page + nr_pages)
+ *
* The caller needs to hold the pte lock.
*/
-void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
- bool compound)
+void page_add_file_rmap_range(struct folio *folio, unsigned long start,
+ unsigned int nr_pages, struct vm_area_struct *vma,
+ bool compound)
{
- struct folio *folio = page_folio(page);
atomic_t *mapped = &folio->_nr_pages_mapped;
- int nr = 0, nr_pmdmapped = 0;
- bool first;
+ unsigned int nr = 0, nr_pmdmapped = 0, first;
- VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
+ VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio);
/* Is page being mapped by PTE? Is this its first map to be added? */
if (likely(!compound)) {
- first = atomic_inc_and_test(&page->_mapcount);
- nr = first;
- if (first && folio_test_large(folio)) {
- nr = atomic_inc_return_relaxed(mapped);
- nr = (nr < COMPOUND_MAPPED);
- }
+ struct page *page = folio_page(folio, start);
+
+ nr_pages = min_t(unsigned int, nr_pages,
+ folio_nr_pages(folio) - start);
+
+ do {
+ first = atomic_inc_and_test(&page->_mapcount);
+ if (first && folio_test_large(folio)) {
+ first = atomic_inc_return_relaxed(mapped);
+ first = (nr < COMPOUND_MAPPED);
+ }
+
+ if (first)
+ nr++;
+ } while (page++, --nr_pages > 0);
} else if (folio_test_pmd_mappable(folio)) {
/* That test is redundant: it's for safety or to optimize out */
@@ -1356,6 +1369,31 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
mlock_vma_folio(folio, vma, compound);
}
+/**
+ * page_add_file_rmap - add pte mapping to a file page
+ * @page: the page to add the mapping to
+ * @vma: the vm area in which the mapping is added
+ * @compound: charge the page as compound or small page
+ *
+ * The caller needs to hold the pte lock.
+ */
+void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
+ bool compound)
+{
+ struct folio *folio = page_folio(page);
+ unsigned int nr_pages;
+
+ VM_WARN_ON_ONCE_PAGE(compound && !PageTransHuge(page), page);
+
+ if (likely(!compound))
+ nr_pages = 1;
+ else
+ nr_pages = folio_nr_pages(folio);
+
+ page_add_file_rmap_range(folio, folio_page_idx(folio, page),
+ nr_pages, vma, compound);
+}
+
/**
* page_remove_rmap - take down pte mapping from a page
* @page: page to remove mapping from
--
2.30.2
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [RFC PATCH v2 3/5] rmap: add page_add_file_rmap_range()
2023-02-01 8:17 ` [RFC PATCH v2 3/5] rmap: add page_add_file_rmap_range() Yin Fengwei
@ 2023-02-01 17:32 ` Matthew Wilcox
2023-02-02 2:00 ` Yin, Fengwei
0 siblings, 1 reply; 19+ messages in thread
From: Matthew Wilcox @ 2023-02-01 17:32 UTC (permalink / raw)
To: Yin Fengwei; +Cc: david, linux-mm, dave.hansen, tim.c.chen, ying.huang
On Wed, Feb 01, 2023 at 04:17:35PM +0800, Yin Fengwei wrote:
> /**
> - * page_add_file_rmap - add pte mapping to a file page
> - * @page: the page to add the mapping to
> + * page_add_file_rmap_range - add pte mapping to a sub page range of a folio
> + * @folio: The filio to add the mapping to
> + * @start: The first sub page index in folio
> + * @nr_pages: The number of sub pages from the first page
> * @vma: the vm area in which the mapping is added
> * @compound: charge the page as compound or small page
> *
> + * The sub page range of folio is defined by
> + * [first_sub_page, first_sub_page + nr_pages)
Lose the "sub" from all of this. That's legacy thinking; pages are
pages and folios are folios. "subpages" was from when we were trying
to use the word "page" for both "the allocation" and "the PAGE_SIZE
range of bytes".
> + *
> * The caller needs to hold the pte lock.
> */
> -void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
> - bool compound)
> +void page_add_file_rmap_range(struct folio *folio, unsigned long start,
> + unsigned int nr_pages, struct vm_area_struct *vma,
> + bool compound)
I think this function needs to be called folio_add_file_rmap()
I'd like to lose the 'compound' parameter, and base it on nr_pages ==
folio_nr_pages(), but that may be a step far just now.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH v2 3/5] rmap: add page_add_file_rmap_range()
2023-02-01 17:32 ` Matthew Wilcox
@ 2023-02-02 2:00 ` Yin, Fengwei
0 siblings, 0 replies; 19+ messages in thread
From: Yin, Fengwei @ 2023-02-02 2:00 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: david, linux-mm, dave.hansen, tim.c.chen, ying.huang
On 2/2/2023 1:32 AM, Matthew Wilcox wrote:
> On Wed, Feb 01, 2023 at 04:17:35PM +0800, Yin Fengwei wrote:
>> /**
>> - * page_add_file_rmap - add pte mapping to a file page
>> - * @page: the page to add the mapping to
>> + * page_add_file_rmap_range - add pte mapping to a sub page range of a folio
>> + * @folio: The filio to add the mapping to
>> + * @start: The first sub page index in folio
>> + * @nr_pages: The number of sub pages from the first page
>> * @vma: the vm area in which the mapping is added
>> * @compound: charge the page as compound or small page
>> *
>> + * The sub page range of folio is defined by
>> + * [first_sub_page, first_sub_page + nr_pages)
>
> Lose the "sub" from all of this. That's legacy thinking; pages are
> pages and folios are folios. "subpages" was from when we were trying
> to use the word "page" for both "the allocation" and "the PAGE_SIZE
> range of bytes".
OK. Will remove sub in next version.
>
>> + *
>> * The caller needs to hold the pte lock.
>> */
>> -void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
>> - bool compound)
>> +void page_add_file_rmap_range(struct folio *folio, unsigned long start,
>> + unsigned int nr_pages, struct vm_area_struct *vma,
>> + bool compound)
>
> I think this function needs to be called folio_add_file_rmap()
Yes. Maybe a followup patch after this series? Let me know if you want
this change in this series.
>
> I'd like to lose the 'compound' parameter, and base it on nr_pages ==
> folio_nr_pages(), but that may be a step far just now.
Yes. I had a local change to remove if (folio_test_pmd_mappable(folio))
test (It's very close to removing 'compound'). I didn't include it in
this series. I prefer a follow up patch. Let me know if you want the
change in this series. Thanks.
Regards
Yin, Fengwei
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC PATCH v2 4/5] mm: add do_set_pte_range()
2023-02-01 8:17 [RFC PATCH v2 0/5] folio based filemap_map_pages() Yin Fengwei
` (2 preceding siblings ...)
2023-02-01 8:17 ` [RFC PATCH v2 3/5] rmap: add page_add_file_rmap_range() Yin Fengwei
@ 2023-02-01 8:17 ` Yin Fengwei
2023-02-01 9:09 ` David Hildenbrand
2023-02-01 17:38 ` Matthew Wilcox
2023-02-01 8:17 ` [RFC PATCH v2 5/5] filemap: batched update mm counter,rmap when map file folio Yin Fengwei
4 siblings, 2 replies; 19+ messages in thread
From: Yin Fengwei @ 2023-02-01 8:17 UTC (permalink / raw)
To: willy, david, linux-mm; +Cc: dave.hansen, tim.c.chen, ying.huang, fengwei.yin
do_set_pte_range() allows to setup page table entries for a
specific range. It calls page_add_file_rmap_range() to take
advantage of batched rmap update for large folio.
Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
include/linux/mm.h | 2 ++
mm/filemap.c | 1 -
mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++----------
3 files changed, 49 insertions(+), 14 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d6f8f41514cc..96e08fcdce24 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1162,6 +1162,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
+void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
+ unsigned long start, unsigned long addr, unsigned int nr);
vm_fault_t finish_fault(struct vm_fault *vmf);
vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
diff --git a/mm/filemap.c b/mm/filemap.c
index 9cc5edd8f998..95f634d11581 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3386,7 +3386,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
ref_count++;
do_set_pte(vmf, page, addr);
- update_mmu_cache(vma, addr, vmf->pte);
} while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
/*
diff --git a/mm/memory.c b/mm/memory.c
index 51c04bb60724..7e41142e1e4f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4257,7 +4257,8 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
}
#endif
-void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
+static void do_set_pte_entry(struct vm_fault *vmf, struct page *page,
+ unsigned long addr)
{
struct vm_area_struct *vma = vmf->vma;
bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
@@ -4277,16 +4278,52 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
if (unlikely(uffd_wp))
entry = pte_mkuffd_wp(entry);
- /* copy-on-write page */
- if (write && !(vma->vm_flags & VM_SHARED)) {
- inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, addr);
- lru_cache_add_inactive_or_unevictable(page, vma);
- } else {
- inc_mm_counter(vma->vm_mm, mm_counter_file(page));
- page_add_file_rmap(page, vma, false);
- }
set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
+
+ /* no need to invalidate: a not-present page won't be cached */
+ update_mmu_cache(vma, addr, vmf->pte);
+}
+
+void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
+ unsigned long start, unsigned long addr, unsigned int nr)
+{
+ unsigned int i = 0;
+ struct page *page = folio_page(folio, start);
+ struct vm_area_struct *vma = vmf->vma;
+ bool cow = (vmf->flags & FAULT_FLAG_WRITE) &&
+ !(vma->vm_flags & VM_SHARED);
+
+ /*
+ * file page: batched update rmap, mm counter.
+ * copy-on-write page: batched update mm counter.
+ */
+ if (!cow) {
+ page_add_file_rmap_range(folio, start, nr, vma, false);
+ add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
+ } else
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr);
+
+ do {
+ if (cow) {
+ page_add_new_anon_rmap(page, vma, addr);
+ lru_cache_add_inactive_or_unevictable(page, vma);
+ }
+
+ do_set_pte_entry(vmf, page, addr);
+ } while (vmf->pte++, page++, addr += PAGE_SIZE, ++i < nr);
+}
+
+void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
+{
+ struct folio *folio = page_folio(page);
+
+ do_set_pte_range(vmf, folio, folio_page_idx(folio, page), addr, 1);
+
+ /*
+ * do_set_pte_range changes vmf->pte. Restore it back as
+ * do_set_pte doesn't expect the change of vmf->pte.
+ */
+ vmf->pte--;
}
static bool vmf_pte_changed(struct vm_fault *vmf)
@@ -4361,9 +4398,6 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
if (likely(!vmf_pte_changed(vmf))) {
do_set_pte(vmf, page, vmf->address);
- /* no need to invalidate: a not-present page won't be cached */
- update_mmu_cache(vma, vmf->address, vmf->pte);
-
ret = 0;
} else {
update_mmu_tlb(vma, vmf->address, vmf->pte);
--
2.30.2
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [RFC PATCH v2 4/5] mm: add do_set_pte_range()
2023-02-01 8:17 ` [RFC PATCH v2 4/5] mm: add do_set_pte_range() Yin Fengwei
@ 2023-02-01 9:09 ` David Hildenbrand
2023-02-01 10:04 ` Yin, Fengwei
2023-02-01 17:38 ` Matthew Wilcox
1 sibling, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2023-02-01 9:09 UTC (permalink / raw)
To: Yin Fengwei, willy, linux-mm; +Cc: dave.hansen, tim.c.chen, ying.huang
On 01.02.23 09:17, Yin Fengwei wrote:
> do_set_pte_range() allows to setup page table entries for a
> specific range. It calls page_add_file_rmap_range() to take
> advantage of batched rmap update for large folio.
>
> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> ---
> include/linux/mm.h | 2 ++
> mm/filemap.c | 1 -
> mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++----------
> 3 files changed, 49 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d6f8f41514cc..96e08fcdce24 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1162,6 +1162,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>
> vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
> void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
> + unsigned long start, unsigned long addr, unsigned int nr);
>
> vm_fault_t finish_fault(struct vm_fault *vmf);
> vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 9cc5edd8f998..95f634d11581 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3386,7 +3386,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>
> ref_count++;
> do_set_pte(vmf, page, addr);
> - update_mmu_cache(vma, addr, vmf->pte);
> } while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
>
> /*
> diff --git a/mm/memory.c b/mm/memory.c
> index 51c04bb60724..7e41142e1e4f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4257,7 +4257,8 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
> }
> #endif
>
> -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
> +static void do_set_pte_entry(struct vm_fault *vmf, struct page *page,
> + unsigned long addr)
> {
> struct vm_area_struct *vma = vmf->vma;
> bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
> @@ -4277,16 +4278,52 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
> entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> if (unlikely(uffd_wp))
> entry = pte_mkuffd_wp(entry);
> - /* copy-on-write page */
> - if (write && !(vma->vm_flags & VM_SHARED)) {
> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> - page_add_new_anon_rmap(page, vma, addr);
> - lru_cache_add_inactive_or_unevictable(page, vma);
> - } else {
> - inc_mm_counter(vma->vm_mm, mm_counter_file(page));
> - page_add_file_rmap(page, vma, false);
> - }
> set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
> +
> + /* no need to invalidate: a not-present page won't be cached */
> + update_mmu_cache(vma, addr, vmf->pte);
> +}
> +
> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
> + unsigned long start, unsigned long addr, unsigned int nr)
> +{
> + unsigned int i = 0;
> + struct page *page = folio_page(folio, start);
> + struct vm_area_struct *vma = vmf->vma;
> + bool cow = (vmf->flags & FAULT_FLAG_WRITE) &&
> + !(vma->vm_flags & VM_SHARED);
> +
> + /*
> + * file page: batched update rmap, mm counter.
> + * copy-on-write page: batched update mm counter.
> + */
> + if (!cow) {
> + page_add_file_rmap_range(folio, start, nr, vma, false);
> + add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
> + } else
> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr);
> +
> + do {
> + if (cow) {
> + page_add_new_anon_rmap(page, vma, addr);
This doesn't work with anon pages the way you intended in this patch.
Please leave anon pages out of the picture for now and make
do_set_pte_range() only deal with !cow.
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [RFC PATCH v2 4/5] mm: add do_set_pte_range()
2023-02-01 9:09 ` David Hildenbrand
@ 2023-02-01 10:04 ` Yin, Fengwei
0 siblings, 0 replies; 19+ messages in thread
From: Yin, Fengwei @ 2023-02-01 10:04 UTC (permalink / raw)
To: David Hildenbrand, willy, linux-mm; +Cc: dave.hansen, tim.c.chen, ying.huang
On 2/1/2023 5:09 PM, David Hildenbrand wrote:
> On 01.02.23 09:17, Yin Fengwei wrote:
>> do_set_pte_range() allows to setup page table entries for a
>> specific range. It calls page_add_file_rmap_range() to take
>> advantage of batched rmap update for large folio.
>>
>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>> ---
>> include/linux/mm.h | 2 ++
>> mm/filemap.c | 1 -
>> mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++----------
>> 3 files changed, 49 insertions(+), 14 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index d6f8f41514cc..96e08fcdce24 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1162,6 +1162,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>> vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
>> void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
>> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
>> + unsigned long start, unsigned long addr, unsigned int nr);
>> vm_fault_t finish_fault(struct vm_fault *vmf);
>> vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 9cc5edd8f998..95f634d11581 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -3386,7 +3386,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>> ref_count++;
>> do_set_pte(vmf, page, addr);
>> - update_mmu_cache(vma, addr, vmf->pte);
>> } while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
>> /*
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 51c04bb60724..7e41142e1e4f 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4257,7 +4257,8 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
>> }
>> #endif
>> -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>> +static void do_set_pte_entry(struct vm_fault *vmf, struct page *page,
>> + unsigned long addr)
>> {
>> struct vm_area_struct *vma = vmf->vma;
>> bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
>> @@ -4277,16 +4278,52 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>> entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>> if (unlikely(uffd_wp))
>> entry = pte_mkuffd_wp(entry);
>> - /* copy-on-write page */
>> - if (write && !(vma->vm_flags & VM_SHARED)) {
>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>> - page_add_new_anon_rmap(page, vma, addr);
>> - lru_cache_add_inactive_or_unevictable(page, vma);
>> - } else {
>> - inc_mm_counter(vma->vm_mm, mm_counter_file(page));
>> - page_add_file_rmap(page, vma, false);
>> - }
>> set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
>> +
>> + /* no need to invalidate: a not-present page won't be cached */
>> + update_mmu_cache(vma, addr, vmf->pte);
>> +}
>> +
>> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
>> + unsigned long start, unsigned long addr, unsigned int nr)
>> +{
>> + unsigned int i = 0;
>> + struct page *page = folio_page(folio, start);
>> + struct vm_area_struct *vma = vmf->vma;
>> + bool cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>> + !(vma->vm_flags & VM_SHARED);
>> +
>> + /*
>> + * file page: batched update rmap, mm counter.
>> + * copy-on-write page: batched update mm counter.
>> + */
>> + if (!cow) {
>> + page_add_file_rmap_range(folio, start, nr, vma, false);
>> + add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
>> + } else
>> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr);
>> +
>> + do {
>> + if (cow) {
>> + page_add_new_anon_rmap(page, vma, addr);
>
> This doesn't work with anon pages the way you intended in this patch.
>
> Please leave anon pages out of the picture for now and make do_set_pte_range() only deal with !cow.
OK. I will move the check to do_set_pte() and only call to
do_set_pte_range() when it's !cow. Thanks.
Regards
Yin, Fengwei
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH v2 4/5] mm: add do_set_pte_range()
2023-02-01 8:17 ` [RFC PATCH v2 4/5] mm: add do_set_pte_range() Yin Fengwei
2023-02-01 9:09 ` David Hildenbrand
@ 2023-02-01 17:38 ` Matthew Wilcox
2023-02-01 21:59 ` Matthew Wilcox
` (2 more replies)
1 sibling, 3 replies; 19+ messages in thread
From: Matthew Wilcox @ 2023-02-01 17:38 UTC (permalink / raw)
To: Yin Fengwei; +Cc: david, linux-mm, dave.hansen, tim.c.chen, ying.huang
On Wed, Feb 01, 2023 at 04:17:36PM +0800, Yin Fengwei wrote:
> do_set_pte_range() allows to setup page table entries for a
> specific range. It calls page_add_file_rmap_range() to take
> advantage of batched rmap update for large folio.
How about something more like this? Yes, we need to define
flush_icache_pages() and PTE_STRIDE.
(we could also do for (i = 0; i < nr; i++) flush_icache_page(...) but
given that some architectures already implement flush_icache_range(),
I think they may appreciate being given one large range to flush)
+++ b/mm/memory.c
@@ -4277,15 +4277,19 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page
*page)
}
#endif
-void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
+void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
+ unsigned int start, unsigned int nr,
+ unsigned long addr)
{
+ struct page *page = folio_page(page, start);
struct vm_area_struct *vma = vmf->vma;
bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
bool write = vmf->flags & FAULT_FLAG_WRITE;
bool prefault = vmf->address != addr;
pte_t entry;
+ unsigned int i;
- flush_icache_page(vma, page);
+ flush_icache_pages(vma, page, nr);
entry = mk_pte(page, vma->vm_page_prot);
if (prefault && arch_wants_old_prefaulted_pte())
@@ -4299,14 +4303,23 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
entry = pte_mkuffd_wp(pte_wrprotect(entry));
/* copy-on-write page */
if (write && !(vma->vm_flags & VM_SHARED)) {
- inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, addr);
- lru_cache_add_inactive_or_unevictable(page, vma);
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr);
+ for (i = 0; i < nr; i++) {
+ page_add_new_anon_rmap(page + i, vma, addr);
+ lru_cache_add_inactive_or_unevictable(page + i, vma);
+ }
} else {
- inc_mm_counter(vma->vm_mm, mm_counter_file(page));
- page_add_file_rmap(page, vma, false);
+ add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
+ folio_add_file_rmap(folio, start, n, vma);
+ }
+
+ for (i = 0; i < nr; i++) {
+ set_pte_at(vma->vm_mm, addr, vmf->pte + i, entry);
+ /* no need to invalidate: a not-present page won't be cached */
+ update_mmu_cache(vma, addr, vmf->pte + i);
+ addr += PAGE_SIZE;
+ entry += PTE_STRIDE;
}
- set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
}
static bool vmf_pte_changed(struct vm_fault *vmf)
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [RFC PATCH v2 4/5] mm: add do_set_pte_range()
2023-02-01 17:38 ` Matthew Wilcox
@ 2023-02-01 21:59 ` Matthew Wilcox
2023-02-02 3:18 ` Yin, Fengwei
2023-02-03 8:54 ` Yin Fengwei
2 siblings, 0 replies; 19+ messages in thread
From: Matthew Wilcox @ 2023-02-01 21:59 UTC (permalink / raw)
To: Yin Fengwei; +Cc: david, linux-mm, dave.hansen, tim.c.chen, ying.huang
On Wed, Feb 01, 2023 at 05:38:39PM +0000, Matthew Wilcox wrote:
> On Wed, Feb 01, 2023 at 04:17:36PM +0800, Yin Fengwei wrote:
> > do_set_pte_range() allows to setup page table entries for a
> > specific range. It calls page_add_file_rmap_range() to take
> > advantage of batched rmap update for large folio.
>
> How about something more like this? Yes, we need to define
> flush_icache_pages() and PTE_STRIDE.
Never mind about PTE_STRIDE. I forgot that pte_t isn't an integer
type. Instead, we'll want each architecture to define
/* This should be right for x86 */
static inline pte_next(pte_t pte)
{
return __pte(pte_val(pte) + PAGE_SIZE);
}
> + for (i = 0; i < nr; i++) {
> + set_pte_at(vma->vm_mm, addr, vmf->pte + i, entry);
> + /* no need to invalidate: a not-present page won't be cached */
> + update_mmu_cache(vma, addr, vmf->pte + i);
> + addr += PAGE_SIZE;
> + entry += PTE_STRIDE;
entry = pte_next(entry);
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [RFC PATCH v2 4/5] mm: add do_set_pte_range()
2023-02-01 17:38 ` Matthew Wilcox
2023-02-01 21:59 ` Matthew Wilcox
@ 2023-02-02 3:18 ` Yin, Fengwei
2023-02-03 8:54 ` Yin Fengwei
2 siblings, 0 replies; 19+ messages in thread
From: Yin, Fengwei @ 2023-02-02 3:18 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: david, linux-mm, dave.hansen, tim.c.chen, ying.huang
On 2/2/2023 1:38 AM, Matthew Wilcox wrote:
> On Wed, Feb 01, 2023 at 04:17:36PM +0800, Yin Fengwei wrote:
>> do_set_pte_range() allows to setup page table entries for a
>> specific range. It calls page_add_file_rmap_range() to take
>> advantage of batched rmap update for large folio.
>
> How about something more like this? Yes, we need to define
> flush_icache_pages() and PTE_STRIDE.
Yes. This could remove more duplicated operations here. Let me
try to add flush_icache_pages() in next version.
But for pte_next(), it needs be added to handle all architectures.
I suppose it will take long journey to make it. What about a
followup patch for pte_next()?
Regards
Yin, Fengwei
>
> (we could also do for (i = 0; i < nr; i++) flush_icache_page(...) but
> given that some architectures already implement flush_icache_range(),
> I think they may appreciate being given one large range to flush)
>
> +++ b/mm/memory.c
> @@ -4277,15 +4277,19 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page
> *page)
> }
> #endif
>
> -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
> + unsigned int start, unsigned int nr,
> + unsigned long addr)
> {
> + struct page *page = folio_page(page, start);
> struct vm_area_struct *vma = vmf->vma;
> bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
> bool write = vmf->flags & FAULT_FLAG_WRITE;
> bool prefault = vmf->address != addr;
> pte_t entry;
> + unsigned int i;
>
> - flush_icache_page(vma, page);
> + flush_icache_pages(vma, page, nr);
> entry = mk_pte(page, vma->vm_page_prot);
>
> if (prefault && arch_wants_old_prefaulted_pte())
> @@ -4299,14 +4303,23 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
> entry = pte_mkuffd_wp(pte_wrprotect(entry));
> /* copy-on-write page */
> if (write && !(vma->vm_flags & VM_SHARED)) {
> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> - page_add_new_anon_rmap(page, vma, addr);
> - lru_cache_add_inactive_or_unevictable(page, vma);
> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr);
> + for (i = 0; i < nr; i++) {
> + page_add_new_anon_rmap(page + i, vma, addr);
> + lru_cache_add_inactive_or_unevictable(page + i, vma);
> + }
> } else {
> - inc_mm_counter(vma->vm_mm, mm_counter_file(page));
> - page_add_file_rmap(page, vma, false);
> + add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
> + folio_add_file_rmap(folio, start, n, vma);
> + }
> +
> + for (i = 0; i < nr; i++) {
> + set_pte_at(vma->vm_mm, addr, vmf->pte + i, entry);
> + /* no need to invalidate: a not-present page won't be cached */
> + update_mmu_cache(vma, addr, vmf->pte + i);
> + addr += PAGE_SIZE;
> + entry += PTE_STRIDE;
> }
> - set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
> }
>
> static bool vmf_pte_changed(struct vm_fault *vmf)
>
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [RFC PATCH v2 4/5] mm: add do_set_pte_range()
2023-02-01 17:38 ` Matthew Wilcox
2023-02-01 21:59 ` Matthew Wilcox
2023-02-02 3:18 ` Yin, Fengwei
@ 2023-02-03 8:54 ` Yin Fengwei
2 siblings, 0 replies; 19+ messages in thread
From: Yin Fengwei @ 2023-02-03 8:54 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: david, linux-mm, dave.hansen, tim.c.chen, ying.huang
On 2/2/23 01:38, Matthew Wilcox wrote:
> On Wed, Feb 01, 2023 at 04:17:36PM +0800, Yin Fengwei wrote:
>> do_set_pte_range() allows to setup page table entries for a
>> specific range. It calls page_add_file_rmap_range() to take
>> advantage of batched rmap update for large folio.
>
> How about something more like this? Yes, we need to define
> flush_icache_pages() and PTE_STRIDE.
>
> (we could also do for (i = 0; i < nr; i++) flush_icache_page(...) but
> given that some architectures already implement flush_icache_range(),
> I think they may appreciate being given one large range to flush)
For flush_icache_range() and flush_icache_page(), the implementation
on riscv could be exception.
According to arch/riscv/include/asm/cacheflush.h
#define flush_icache_range(start, end) flush_icache_all()
There is no definition for flush_icache_page(). I suppose
flush_icache_page() does nothing on riscv.
Using flush_icache_range() may not be good choice for riscv.
Regards
Yin, Fengwei
>
> +++ b/mm/memory.c
> @@ -4277,15 +4277,19 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page
> *page)
> }
> #endif
>
> -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
> + unsigned int start, unsigned int nr,
> + unsigned long addr)
> {
> + struct page *page = folio_page(page, start);
> struct vm_area_struct *vma = vmf->vma;
> bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
> bool write = vmf->flags & FAULT_FLAG_WRITE;
> bool prefault = vmf->address != addr;
> pte_t entry;
> + unsigned int i;
>
> - flush_icache_page(vma, page);
> + flush_icache_pages(vma, page, nr);
> entry = mk_pte(page, vma->vm_page_prot);
>
> if (prefault && arch_wants_old_prefaulted_pte())
> @@ -4299,14 +4303,23 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
> entry = pte_mkuffd_wp(pte_wrprotect(entry));
> /* copy-on-write page */
> if (write && !(vma->vm_flags & VM_SHARED)) {
> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> - page_add_new_anon_rmap(page, vma, addr);
> - lru_cache_add_inactive_or_unevictable(page, vma);
> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr);
> + for (i = 0; i < nr; i++) {
> + page_add_new_anon_rmap(page + i, vma, addr);
> + lru_cache_add_inactive_or_unevictable(page + i, vma);
> + }
> } else {
> - inc_mm_counter(vma->vm_mm, mm_counter_file(page));
> - page_add_file_rmap(page, vma, false);
> + add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
> + folio_add_file_rmap(folio, start, n, vma);
> + }
> +
> + for (i = 0; i < nr; i++) {
> + set_pte_at(vma->vm_mm, addr, vmf->pte + i, entry);
> + /* no need to invalidate: a not-present page won't be cached */
> + update_mmu_cache(vma, addr, vmf->pte + i);
> + addr += PAGE_SIZE;
> + entry += PTE_STRIDE;
> }
> - set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
> }
>
> static bool vmf_pte_changed(struct vm_fault *vmf)
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC PATCH v2 5/5] filemap: batched update mm counter,rmap when map file folio
2023-02-01 8:17 [RFC PATCH v2 0/5] folio based filemap_map_pages() Yin Fengwei
` (3 preceding siblings ...)
2023-02-01 8:17 ` [RFC PATCH v2 4/5] mm: add do_set_pte_range() Yin Fengwei
@ 2023-02-01 8:17 ` Yin Fengwei
2023-02-01 15:50 ` Matthew Wilcox
4 siblings, 1 reply; 19+ messages in thread
From: Yin Fengwei @ 2023-02-01 8:17 UTC (permalink / raw)
To: willy, david, linux-mm; +Cc: dave.hansen, tim.c.chen, ying.huang, fengwei.yin
Use do_set_pte_range() in filemap_map_folio_range(). Which
batched updates mm counter, rmap.
With a self cooked will-it-scale.page_fault3 like app (change
file write fault to read fault) got 15% performance gain.
Perf data collected before/after the change:
18.73%--page_add_file_rmap
|
--11.60%--__mod_lruvec_page_state
|
|--7.40%--__mod_memcg_lruvec_state
| |
| --5.58%--cgroup_rstat_updated
|
--2.53%--__mod_lruvec_state
|
--1.48%--__mod_node_page_state
9.93%--page_add_file_rmap_range
|
--2.67%--__mod_lruvec_page_state
|
|--1.95%--__mod_memcg_lruvec_state
| |
| --1.57%--cgroup_rstat_updated
|
--0.61%--__mod_lruvec_state
|
--0.54%--__mod_node_page_state
The running time of __mode_lruvec_page_state() is reduced a lot.
Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
mm/filemap.c | 37 +++++++++++++++++++++++++++++++++----
1 file changed, 33 insertions(+), 4 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 95f634d11581..b14f077d1c55 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3364,11 +3364,22 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
struct file *file = vma->vm_file;
struct page *page = folio_page(folio, start);
unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
- unsigned int ref_count = 0, count = 0;
+ unsigned int ref_count = 0, count = 0, nr_mapped = 0;
do {
- if (PageHWPoison(page))
+ if (PageHWPoison(page)) {
+ if (nr_mapped) {
+ vmf->pte -= nr_mapped;
+ do_set_pte_range(vmf, folio,
+ start + count - nr_mapped,
+ addr - nr_mapped * PAGE_SIZE,
+ nr_mapped);
+
+ }
+
+ nr_mapped = 0;
continue;
+ }
if (mmap_miss > 0)
mmap_miss--;
@@ -3378,16 +3389,34 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
* handled in the specific fault path, and it'll prohibit the
* fault-around logic.
*/
- if (!pte_none(*vmf->pte))
+ if (!pte_none(*vmf->pte)) {
+ if (nr_mapped) {
+ vmf->pte -= nr_mapped;
+ do_set_pte_range(vmf, folio,
+ start + count - nr_mapped,
+ addr - nr_mapped * PAGE_SIZE,
+ nr_mapped);
+
+ }
+
+ nr_mapped = 0;
+
continue;
+ }
if (vmf->address == addr)
ret = VM_FAULT_NOPAGE;
ref_count++;
- do_set_pte(vmf, page, addr);
+ nr_mapped++;
} while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
+ if (nr_mapped) {
+ vmf->pte -= nr_mapped;
+ do_set_pte_range(vmf, folio, start + count - nr_mapped,
+ addr - nr_mapped * PAGE_SIZE, nr_mapped);
+ }
+
/*
* Restore the vmf->pte. Otherwise, it's possible vmf->pte point
* to next page table entry if the last sub page in the range is
--
2.30.2
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [RFC PATCH v2 5/5] filemap: batched update mm counter,rmap when map file folio
2023-02-01 8:17 ` [RFC PATCH v2 5/5] filemap: batched update mm counter,rmap when map file folio Yin Fengwei
@ 2023-02-01 15:50 ` Matthew Wilcox
2023-02-02 3:31 ` Yin, Fengwei
2023-02-03 13:15 ` Yin, Fengwei
0 siblings, 2 replies; 19+ messages in thread
From: Matthew Wilcox @ 2023-02-01 15:50 UTC (permalink / raw)
To: Yin Fengwei; +Cc: david, linux-mm, dave.hansen, tim.c.chen, ying.huang
On Wed, Feb 01, 2023 at 04:17:37PM +0800, Yin Fengwei wrote:
> Use do_set_pte_range() in filemap_map_folio_range(). Which
> batched updates mm counter, rmap.
>
> With a self cooked will-it-scale.page_fault3 like app (change
> file write fault to read fault) got 15% performance gain.
I'd suggest that you create a will-it-scale.page_fault4. Anton
is quite open to adding new variations of tests.
> Perf data collected before/after the change:
> 18.73%--page_add_file_rmap
> |
> --11.60%--__mod_lruvec_page_state
> |
> |--7.40%--__mod_memcg_lruvec_state
> | |
> | --5.58%--cgroup_rstat_updated
> |
> --2.53%--__mod_lruvec_state
> |
> --1.48%--__mod_node_page_state
>
> 9.93%--page_add_file_rmap_range
> |
> --2.67%--__mod_lruvec_page_state
> |
> |--1.95%--__mod_memcg_lruvec_state
> | |
> | --1.57%--cgroup_rstat_updated
> |
> --0.61%--__mod_lruvec_state
> |
> --0.54%--__mod_node_page_state
>
> The running time of __mode_lruvec_page_state() is reduced a lot.
Nice.
> +++ b/mm/filemap.c
> @@ -3364,11 +3364,22 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
> struct file *file = vma->vm_file;
> struct page *page = folio_page(folio, start);
> unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
> - unsigned int ref_count = 0, count = 0;
> + unsigned int ref_count = 0, count = 0, nr_mapped = 0;
>
> do {
> - if (PageHWPoison(page))
> + if (PageHWPoison(page)) {
> + if (nr_mapped) {
> + vmf->pte -= nr_mapped;
> + do_set_pte_range(vmf, folio,
> + start + count - nr_mapped,
> + addr - nr_mapped * PAGE_SIZE,
> + nr_mapped);
> +
> + }
> +
> + nr_mapped = 0;
> continue;
> + }
Having subtracted nr_mapped from vmf->pte, we then need to add it again.
But this is all too complicated. What if we don't update vmf->pte
each time around the loop? ie something like this:
do {
if (PageHWPoisoned(page))
goto map;
if (mmap_miss > 0)
mmap_miss--;
/*
* If there're PTE markers, we'll leave them to be handled
* in the specific fault path, and it'll prohibit the
* fault-around logic.
*/
if (!pte_none(vmf->pte[count]))
goto map;
if (vmf->address == addr + count * PAGE_SIZE)
ret = VM_FAULT_NOPAGE;
continue;
map:
if (count > 1) {
do_set_pte_range(vmf, folio, start, addr, count - 1);
folio_ref_add(folio, count - 1);
}
start += count;
vmf->pte += count;
addr += count * PAGE_SIZE;
nr_pages -= count;
count = 0;
} while (page++, ++count < nr_pages);
if (count > 0) {
do_set_pte_range(vmf, folio, start, addr, count);
folio_ref_add(folio, count);
} else {
/* Make sure the PTE points to the correct page table */
vmf->pte--;
}
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [RFC PATCH v2 5/5] filemap: batched update mm counter,rmap when map file folio
2023-02-01 15:50 ` Matthew Wilcox
@ 2023-02-02 3:31 ` Yin, Fengwei
2023-02-03 13:15 ` Yin, Fengwei
1 sibling, 0 replies; 19+ messages in thread
From: Yin, Fengwei @ 2023-02-02 3:31 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: david, linux-mm, dave.hansen, tim.c.chen, ying.huang
On 2/1/2023 11:50 PM, Matthew Wilcox wrote:
> On Wed, Feb 01, 2023 at 04:17:37PM +0800, Yin Fengwei wrote:
>> Use do_set_pte_range() in filemap_map_folio_range(). Which
>> batched updates mm counter, rmap.
>>
>> With a self cooked will-it-scale.page_fault3 like app (change
>> file write fault to read fault) got 15% performance gain.
>
> I'd suggest that you create a will-it-scale.page_fault4. Anton
> is quite open to adding new variations of tests.
OK. I will do this.
>
>> Perf data collected before/after the change:
>> 18.73%--page_add_file_rmap
>> |
>> --11.60%--__mod_lruvec_page_state
>> |
>> |--7.40%--__mod_memcg_lruvec_state
>> | |
>> | --5.58%--cgroup_rstat_updated
>> |
>> --2.53%--__mod_lruvec_state
>> |
>> --1.48%--__mod_node_page_state
>>
>> 9.93%--page_add_file_rmap_range
>> |
>> --2.67%--__mod_lruvec_page_state
>> |
>> |--1.95%--__mod_memcg_lruvec_state
>> | |
>> | --1.57%--cgroup_rstat_updated
>> |
>> --0.61%--__mod_lruvec_state
>> |
>> --0.54%--__mod_node_page_state
>>
>> The running time of __mode_lruvec_page_state() is reduced a lot.
>
> Nice.
>
>> +++ b/mm/filemap.c
>> @@ -3364,11 +3364,22 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>> struct file *file = vma->vm_file;
>> struct page *page = folio_page(folio, start);
>> unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
>> - unsigned int ref_count = 0, count = 0;
>> + unsigned int ref_count = 0, count = 0, nr_mapped = 0;
>>
>> do {
>> - if (PageHWPoison(page))
>> + if (PageHWPoison(page)) {
>> + if (nr_mapped) {
>> + vmf->pte -= nr_mapped;
>> + do_set_pte_range(vmf, folio,
>> + start + count - nr_mapped,
>> + addr - nr_mapped * PAGE_SIZE,
>> + nr_mapped);
>> +
>> + }
>> +
>> + nr_mapped = 0;
>> continue;
>> + }
>
> Having subtracted nr_mapped from vmf->pte, we then need to add it again.
>
> But this is all too complicated. What if we don't update vmf->pte
> each time around the loop? ie something like this:
Thanks a lot for the suggestion. I like vmf->pte[count] a lot. Will
update the code accordingly in next version.
Regards
Yin, Fengwei
>
> do {
> if (PageHWPoisoned(page))
> goto map;
> if (mmap_miss > 0)
> mmap_miss--;
> /*
> * If there're PTE markers, we'll leave them to be handled
> * in the specific fault path, and it'll prohibit the
> * fault-around logic.
> */
> if (!pte_none(vmf->pte[count]))
> goto map;
> if (vmf->address == addr + count * PAGE_SIZE)
> ret = VM_FAULT_NOPAGE;
> continue;
> map:
> if (count > 1) {
> do_set_pte_range(vmf, folio, start, addr, count - 1);
> folio_ref_add(folio, count - 1);
> }
> start += count;
> vmf->pte += count;
> addr += count * PAGE_SIZE;
> nr_pages -= count;
> count = 0;
> } while (page++, ++count < nr_pages);
>
> if (count > 0) {
> do_set_pte_range(vmf, folio, start, addr, count);
> folio_ref_add(folio, count);
> } else {
> /* Make sure the PTE points to the correct page table */
> vmf->pte--;
> }
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [RFC PATCH v2 5/5] filemap: batched update mm counter,rmap when map file folio
2023-02-01 15:50 ` Matthew Wilcox
2023-02-02 3:31 ` Yin, Fengwei
@ 2023-02-03 13:15 ` Yin, Fengwei
1 sibling, 0 replies; 19+ messages in thread
From: Yin, Fengwei @ 2023-02-03 13:15 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: david, linux-mm, dave.hansen, tim.c.chen, ying.huang
On 2/1/2023 11:50 PM, Matthew Wilcox wrote:
> On Wed, Feb 01, 2023 at 04:17:37PM +0800, Yin Fengwei wrote:
>> Use do_set_pte_range() in filemap_map_folio_range(). Which
>> batched updates mm counter, rmap.
>>
>> With a self cooked will-it-scale.page_fault3 like app (change
>> file write fault to read fault) got 15% performance gain.
>
> I'd suggest that you create a will-it-scale.page_fault4. Anton
> is quite open to adding new variations of tests.
>
>> Perf data collected before/after the change:
>> 18.73%--page_add_file_rmap
>> |
>> --11.60%--__mod_lruvec_page_state
>> |
>> |--7.40%--__mod_memcg_lruvec_state
>> | |
>> | --5.58%--cgroup_rstat_updated
>> |
>> --2.53%--__mod_lruvec_state
>> |
>> --1.48%--__mod_node_page_state
>>
>> 9.93%--page_add_file_rmap_range
>> |
>> --2.67%--__mod_lruvec_page_state
>> |
>> |--1.95%--__mod_memcg_lruvec_state
>> | |
>> | --1.57%--cgroup_rstat_updated
>> |
>> --0.61%--__mod_lruvec_state
>> |
>> --0.54%--__mod_node_page_state
>>
>> The running time of __mode_lruvec_page_state() is reduced a lot.
>
> Nice.
>
>> +++ b/mm/filemap.c
>> @@ -3364,11 +3364,22 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>> struct file *file = vma->vm_file;
>> struct page *page = folio_page(folio, start);
>> unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
>> - unsigned int ref_count = 0, count = 0;
>> + unsigned int ref_count = 0, count = 0, nr_mapped = 0;
>>
>> do {
>> - if (PageHWPoison(page))
>> + if (PageHWPoison(page)) {
>> + if (nr_mapped) {
>> + vmf->pte -= nr_mapped;
>> + do_set_pte_range(vmf, folio,
>> + start + count - nr_mapped,
>> + addr - nr_mapped * PAGE_SIZE,
>> + nr_mapped);
>> +
>> + }
>> +
>> + nr_mapped = 0;
>> continue;
>> + }
>
> Having subtracted nr_mapped from vmf->pte, we then need to add it again.
>
> But this is all too complicated. What if we don't update vmf->pte
> each time around the loop? ie something like this:
>
> do {
> if (PageHWPoisoned(page))
> goto map;
> if (mmap_miss > 0)
> mmap_miss--;
> /*
> * If there're PTE markers, we'll leave them to be handled
> * in the specific fault path, and it'll prohibit the
> * fault-around logic.
> */
> if (!pte_none(vmf->pte[count]))
> goto map;
> if (vmf->address == addr + count * PAGE_SIZE)
> ret = VM_FAULT_NOPAGE;
> continue;
> map:
> if (count > 1) {
> do_set_pte_range(vmf, folio, start, addr, count - 1);
> folio_ref_add(folio, count - 1);
> }
> start += count;
> vmf->pte += count;
> addr += count * PAGE_SIZE;
> nr_pages -= count;
> count = 0;
I found it's not easy to make this part correct. So I tried to simplify the
logic here a little bit.
Regards
Yin, Fengwei
> } while (page++, ++count < nr_pages);
>
> if (count > 0) {
> do_set_pte_range(vmf, folio, start, addr, count);
> folio_ref_add(folio, count);
> } else {
> /* Make sure the PTE points to the correct page table */
> vmf->pte--;
> }
^ permalink raw reply [flat|nested] 19+ messages in thread