linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] s390: Remove PageDirty check inside mk_pte()
@ 2025-01-16 21:23 Matthew Wilcox (Oracle)
  2025-01-20 16:39 ` Alexander Gordeev
  0 siblings, 1 reply; 8+ messages in thread
From: Matthew Wilcox (Oracle) @ 2025-01-16 21:23 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Matthew Wilcox (Oracle),
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik, Claudio Imbrenda,
	Christian Borntraeger, linux-mm

In commit abf09bed3cce ("s390/mm: implement software dirty bits"),
the rationale for adding the PageDirty() test was:

    To avoid an excessive number of additional faults the
    mk_pte primitive checks for PageDirty if the pgprot value
    allows for writes and pre-dirties the pte. That avoids all
    additional faults for tmpfs and shmem pages until these
    pages are added to the swap cache.

shmem does not mark newly allocated folios as dirty since 2016 (commit
75edd345e8ed) so this test has been ineffective since then.  It still
triggers for some calls to mk_pte(), but those calls are usually
followed with other calls to pte_mkdirty() making the ones inside
s390's mk_pte() redundant.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 arch/s390/include/asm/pgtable.h | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

Please check this doesn't cause any real world performance issues.
Alexander and I briefly discussed this last year:
https://lore.kernel.org/linux-mm/20240814154427.162475-5-willy@infradead.org/
but then I missed his followup on August 22nd, and thought it best to
reopen the conversation with a fresh patch to test.

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 48268095b0a3..4c7e2fc2b5ff 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1443,11 +1443,8 @@ static inline pte_t mk_pte_phys(unsigned long physpage, pgprot_t pgprot)
 static inline pte_t mk_pte(struct page *page, pgprot_t pgprot)
 {
 	unsigned long physpage = page_to_phys(page);
-	pte_t __pte = mk_pte_phys(physpage, pgprot);
 
-	if (pte_write(__pte) && PageDirty(page))
-		__pte = pte_mkdirty(__pte);
-	return __pte;
+	return mk_pte_phys(physpage, pgprot);
 }
 
 #define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD-1))
-- 
2.45.2



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] s390: Remove PageDirty check inside mk_pte()
  2025-01-16 21:23 [PATCH] s390: Remove PageDirty check inside mk_pte() Matthew Wilcox (Oracle)
@ 2025-01-20 16:39 ` Alexander Gordeev
  2025-01-20 19:01   ` David Hildenbrand
  2025-01-30 22:12   ` Matthew Wilcox
  0 siblings, 2 replies; 8+ messages in thread
From: Alexander Gordeev @ 2025-01-20 16:39 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: Gerald Schaefer, Heiko Carstens, Vasily Gorbik, Claudio Imbrenda,
	Christian Borntraeger, linux-mm

On Thu, Jan 16, 2025 at 09:23:36PM +0000, Matthew Wilcox (Oracle) wrote:

Hi Matthew,

> In commit abf09bed3cce ("s390/mm: implement software dirty bits"),
> the rationale for adding the PageDirty() test was:
> 
>     To avoid an excessive number of additional faults the
>     mk_pte primitive checks for PageDirty if the pgprot value
>     allows for writes and pre-dirties the pte. That avoids all
>     additional faults for tmpfs and shmem pages until these
>     pages are added to the swap cache.
> 
> shmem does not mark newly allocated folios as dirty since 2016 (commit
> 75edd345e8ed) so this test has been ineffective since then.  It still
> triggers for some calls to mk_pte(), but those calls are usually
> followed with other calls to pte_mkdirty() making the ones inside
> s390's mk_pte() redundant.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  arch/s390/include/asm/pgtable.h | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> Please check this doesn't cause any real world performance issues.
> Alexander and I briefly discussed this last year:
> https://lore.kernel.org/linux-mm/20240814154427.162475-5-willy@infradead.org/
> but then I missed his followup on August 22nd, and thought it best to
> reopen the conversation with a fresh patch to test.

I tried to take exactly this approach last year, but dropped the ball.

Despite the commit 75edd345e8ed claim above I still observed that a read
access to shmem page could end up in creation of a dirty PTE:

	handle_pte_fault() -> do_pte_missing() -> do_fault() ->
	do_read_fault() -> finish_fault() -> set_pte_range() -> mk_pte()

Our internal discussion (among other things) ended up in a need to really
understand where the faults are coming from.

Further, a cursory LTP test was showing ~18K page faults increase, which
I did not confirm. That is the first thing I will re-do.

Whether this change is a pre-requisite for something or what is your aim
wrt to this patch?

> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 48268095b0a3..4c7e2fc2b5ff 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1443,11 +1443,8 @@ static inline pte_t mk_pte_phys(unsigned long physpage, pgprot_t pgprot)
>  static inline pte_t mk_pte(struct page *page, pgprot_t pgprot)
>  {
>  	unsigned long physpage = page_to_phys(page);
> -	pte_t __pte = mk_pte_phys(physpage, pgprot);
>  
> -	if (pte_write(__pte) && PageDirty(page))
> -		__pte = pte_mkdirty(__pte);
> -	return __pte;
> +	return mk_pte_phys(physpage, pgprot);
>  }
>  
>  #define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD-1))

Thanks!


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] s390: Remove PageDirty check inside mk_pte()
  2025-01-20 16:39 ` Alexander Gordeev
@ 2025-01-20 19:01   ` David Hildenbrand
  2025-01-30 22:12   ` Matthew Wilcox
  1 sibling, 0 replies; 8+ messages in thread
From: David Hildenbrand @ 2025-01-20 19:01 UTC (permalink / raw)
  To: Alexander Gordeev, Matthew Wilcox (Oracle)
  Cc: Gerald Schaefer, Heiko Carstens, Vasily Gorbik, Claudio Imbrenda,
	Christian Borntraeger, linux-mm

On 20.01.25 17:39, Alexander Gordeev wrote:
> On Thu, Jan 16, 2025 at 09:23:36PM +0000, Matthew Wilcox (Oracle) wrote:
> 
> Hi Matthew,
> 
>> In commit abf09bed3cce ("s390/mm: implement software dirty bits"),
>> the rationale for adding the PageDirty() test was:
>>
>>      To avoid an excessive number of additional faults the
>>      mk_pte primitive checks for PageDirty if the pgprot value
>>      allows for writes and pre-dirties the pte. That avoids all
>>      additional faults for tmpfs and shmem pages until these
>>      pages are added to the swap cache.
>>
>> shmem does not mark newly allocated folios as dirty since 2016 (commit
>> 75edd345e8ed) so this test has been ineffective since then.  It still
>> triggers for some calls to mk_pte(), but those calls are usually
>> followed with other calls to pte_mkdirty() making the ones inside
>> s390's mk_pte() redundant.
>>
>> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>> ---
>>   arch/s390/include/asm/pgtable.h | 5 +----
>>   1 file changed, 1 insertion(+), 4 deletions(-)
>>
>> Please check this doesn't cause any real world performance issues.
>> Alexander and I briefly discussed this last year:
>> https://lore.kernel.org/linux-mm/20240814154427.162475-5-willy@infradead.org/
>> but then I missed his followup on August 22nd, and thought it best to
>> reopen the conversation with a fresh patch to test.
> 
> I tried to take exactly this approach last year, but dropped the ball.
> 
> Despite the commit 75edd345e8ed claim above I still observed that a read
> access to shmem page could end up in creation of a dirty PTE:
> 
> 	handle_pte_fault() -> do_pte_missing() -> do_fault() ->
> 	do_read_fault() -> finish_fault() -> set_pte_range() -> mk_pte()

Could we perform the dirty-setting on that level, the caller of mk_pte()?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] s390: Remove PageDirty check inside mk_pte()
  2025-01-20 16:39 ` Alexander Gordeev
  2025-01-20 19:01   ` David Hildenbrand
@ 2025-01-30 22:12   ` Matthew Wilcox
  2025-02-03 13:03     ` Alexander Gordeev
  1 sibling, 1 reply; 8+ messages in thread
From: Matthew Wilcox @ 2025-01-30 22:12 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Gerald Schaefer, Heiko Carstens, Vasily Gorbik, Claudio Imbrenda,
	Christian Borntraeger, linux-mm

On Mon, Jan 20, 2025 at 05:39:43PM +0100, Alexander Gordeev wrote:
> Despite the commit 75edd345e8ed claim above I still observed that a read
> access to shmem page could end up in creation of a dirty PTE:
> 
> 	handle_pte_fault() -> do_pte_missing() -> do_fault() ->
> 	do_read_fault() -> finish_fault() -> set_pte_range() -> mk_pte()
> 
> Our internal discussion (among other things) ended up in a need to really
> understand where the faults are coming from.
> 
> Further, a cursory LTP test was showing ~18K page faults increase, which
> I did not confirm. That is the first thing I will re-do.
> 
> Whether this change is a pre-requisite for something or what is your aim
> wrt to this patch?

One of the things that needs to happen for the splitting of struct page
and struct folio is the removal of all '&folio->page'.  Most are in
compatibility code or code that is being transitioned and can safely be
ignored.  One of the places that needs to be changed is:

	entry = mk_pte(&folio->page, vma->vm_page_prot);
(there's a few places like this).

I believe we need a folio_mk_pte(), and I don't want to define it for
each architecture.  I don't even want to have a default implementation
and allow arches to override.

So the question becomes whether to:

(a) Get rid of the conditional pte_mkdirty() entirely (this trial
    balloon)
(b) Put it in folio_mk_pte() for everybody, not just s390
(c) Put it in set_pte_range() as David suggested.

It's feeling like (c) is the best idea.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] s390: Remove PageDirty check inside mk_pte()
  2025-01-30 22:12   ` Matthew Wilcox
@ 2025-02-03 13:03     ` Alexander Gordeev
  2025-02-12 12:44       ` Alexander Gordeev
  0 siblings, 1 reply; 8+ messages in thread
From: Alexander Gordeev @ 2025-02-03 13:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Gerald Schaefer, Heiko Carstens, Vasily Gorbik, Claudio Imbrenda,
	Christian Borntraeger, linux-mm

On Thu, Jan 30, 2025 at 10:12:48PM +0000, Matthew Wilcox wrote:
> On Mon, Jan 20, 2025 at 05:39:43PM +0100, Alexander Gordeev wrote:
> > Despite the commit 75edd345e8ed claim above I still observed that a read
> > access to shmem page could end up in creation of a dirty PTE:
> > 
> > 	handle_pte_fault() -> do_pte_missing() -> do_fault() ->
> > 	do_read_fault() -> finish_fault() -> set_pte_range() -> mk_pte()
> > 
> > Our internal discussion (among other things) ended up in a need to really
> > understand where the faults are coming from.
> > 
> > Further, a cursory LTP test was showing ~18K page faults increase, which
> > I did not confirm. That is the first thing I will re-do.
> > 
> > Whether this change is a pre-requisite for something or what is your aim
> > wrt to this patch?
> 
> One of the things that needs to happen for the splitting of struct page
> and struct folio is the removal of all '&folio->page'.  Most are in
> compatibility code or code that is being transitioned and can safely be
> ignored.  One of the places that needs to be changed is:
> 
> 	entry = mk_pte(&folio->page, vma->vm_page_prot);
> (there's a few places like this).
> 
> I believe we need a folio_mk_pte(), and I don't want to define it for
> each architecture.  I don't even want to have a default implementation
> and allow arches to override.

I guess, folio_mk_pte() would then still call arch-specific mk_pte()?

> So the question becomes whether to:
> 
> (a) Get rid of the conditional pte_mkdirty() entirely (this trial
>     balloon)
> (b) Put it in folio_mk_pte() for everybody, not just s390
> (c) Put it in set_pte_range() as David suggested.
> 
> It's feeling like (c) is the best idea.

I will check option (c)

Thanks!


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] s390: Remove PageDirty check inside mk_pte()
  2025-02-03 13:03     ` Alexander Gordeev
@ 2025-02-12 12:44       ` Alexander Gordeev
  2025-02-12 13:08         ` Claudio Imbrenda
  0 siblings, 1 reply; 8+ messages in thread
From: Alexander Gordeev @ 2025-02-12 12:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Gerald Schaefer, Heiko Carstens, Vasily Gorbik, Claudio Imbrenda,
	Christian Borntraeger, linux-mm

On Mon, Feb 03, 2025 at 02:03:05PM +0100, Alexander Gordeev wrote:
> > So the question becomes whether to:
> > 
> > (a) Get rid of the conditional pte_mkdirty() entirely (this trial
> >     balloon)
> > (b) Put it in folio_mk_pte() for everybody, not just s390
> > (c) Put it in set_pte_range() as David suggested.
> > 
> > It's feeling like (c) is the best idea.
> 
> I will check option (c)

These call stacks end up in set_pte_range():

mk_pte()
set_pte_range()
finish_fault()
do_read_fault()
do_pte_missing()
__handle_mm_fault()
handle_mm_fault()

mk_pte()
set_pte_range()
finish_fault()
do_shared_fault()
do_pte_missing()
__handle_mm_fault()
handle_mm_fault()

mk_pte()
set_pte_range()
filemap_map_pages()
do_read_fault()
do_pte_missing()
__handle_mm_fault()
handle_mm_fault()

Moving the PageDirty() check to generic code works (in a sense page fault
volume does not notably increase):

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 3ca5af4cfe43..b5d88f2b5214 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1454,8 +1454,6 @@ static inline pte_t mk_pte(struct page *page, pgprot_t pgprot)
 	unsigned long physpage = page_to_phys(page);
 	pte_t __pte = mk_pte_phys(physpage, pgprot);
 
-	if (pte_write(__pte) && PageDirty(page))
-		__pte = pte_mkdirty(__pte);
 	return __pte;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 539c0f7c6d54..4b04325db2ee 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5116,6 +5116,8 @@ void set_pte_range(struct vm_fault *vmf, struct folio *folio,
 
 	flush_icache_pages(vma, page, nr);
 	entry = mk_pte(page, vma->vm_page_prot);
+	if (pte_write(entry) && PageDirty(page))
+		entry = pte_mkdirty(entry);
 
 	if (prefault && arch_wants_old_prefaulted_pte())
 		entry = pte_mkold(entry);

The above is however not exactly the same, since set_pte_range() -> set_ptes()
dirtyfies all PTEs in a folio - unlike the current s390 implementation, which
dirtyfies a single PTE based on its struct page flag.

remove_migration_ptes() probably needs to be updated as well, unless we are
fine with a claim that a PTE is allowed not always be pre-dirtyfied. That
could also be true for mk_pte() call paths I did not manage to find or will
get added in the future.

mk_pte()
remove_migration_ptes()
migrate_pages_batch()
migrate_pages_sync()
migrate_pages()
compact_zone()
compact_node()
kcompactd()
kthread()
__ret_from_fork()
ret_from_fork()

Also, with the above change to set_pte_range() hugetlb PTEs are affected:

mk_pte()
mk_huge_pte()
make_huge_pte()
hugetlb_no_page()
hugetlb_fault()
handle_mm_fault()

Thus, we need to consider pre-dirtying of hugetlb PTEs as well,
which I think is:

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 65068671e460..1c890b3c9453 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5168,7 +5168,7 @@ static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
 	pte_t entry;
 	unsigned int shift = huge_page_shift(hstate_vma(vma));
 
-	if (try_mkwrite && (vma->vm_flags & VM_WRITE)) {
+	if ((vma->vm_flags & VM_WRITE) && (try_mkwrite || PageDirty(page))) {
 		entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_huge_pte(page,
 					 vma->vm_page_prot)));
 	} else {


Thanks!


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] s390: Remove PageDirty check inside mk_pte()
  2025-02-12 12:44       ` Alexander Gordeev
@ 2025-02-12 13:08         ` Claudio Imbrenda
  2025-02-12 16:43           ` Matthew Wilcox
  0 siblings, 1 reply; 8+ messages in thread
From: Claudio Imbrenda @ 2025-02-12 13:08 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Matthew Wilcox, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, linux-mm

On Wed, 12 Feb 2025 13:44:58 +0100
Alexander Gordeev <agordeev@linux.ibm.com> wrote:

[...]

> The above is however not exactly the same, since set_pte_range() -> set_ptes()
> dirtyfies all PTEs in a folio - unlike the current s390 implementation, which
> dirtyfies a single PTE based on its struct page flag.

I have not looked enough into this specific matter to actually have an
opinion, but I just want to quickly point out that in the next couple
of years struct page as we know it will go away, and flags will only be
per-folio anyway.
For more details, see this presentation by Matthew Wilcox: 
https://fosdem.org/2025/schedule/event/fosdem-2025-4860-shrinking-memmap/

so starting to do things per-folio instead of per-page is a lot more
future-proof.

[...]


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] s390: Remove PageDirty check inside mk_pte()
  2025-02-12 13:08         ` Claudio Imbrenda
@ 2025-02-12 16:43           ` Matthew Wilcox
  0 siblings, 0 replies; 8+ messages in thread
From: Matthew Wilcox @ 2025-02-12 16:43 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: Alexander Gordeev, Gerald Schaefer, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, linux-mm

On Wed, Feb 12, 2025 at 02:08:06PM +0100, Claudio Imbrenda wrote:
> On Wed, 12 Feb 2025 13:44:58 +0100
> Alexander Gordeev <agordeev@linux.ibm.com> wrote:
> 
> [...]
> 
> > The above is however not exactly the same, since set_pte_range() -> set_ptes()
> > dirtyfies all PTEs in a folio - unlike the current s390 implementation, which
> > dirtyfies a single PTE based on its struct page flag.
> 
> I have not looked enough into this specific matter to actually have an
> opinion, but I just want to quickly point out that in the next couple
> of years struct page as we know it will go away, and flags will only be
> per-folio anyway.

Most flags have been per-allocation rather than per-page for many years.
Specifically, the dirty flag has been per-allocation since January 2016
in commit df8c94d13c7e


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-02-12 16:43 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-16 21:23 [PATCH] s390: Remove PageDirty check inside mk_pte() Matthew Wilcox (Oracle)
2025-01-20 16:39 ` Alexander Gordeev
2025-01-20 19:01   ` David Hildenbrand
2025-01-30 22:12   ` Matthew Wilcox
2025-02-03 13:03     ` Alexander Gordeev
2025-02-12 12:44       ` Alexander Gordeev
2025-02-12 13:08         ` Claudio Imbrenda
2025-02-12 16:43           ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox