Re: [RFC PATCH] mm: swap: reuse exclusive folio directly instead of wp page faults

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Barry Song <21cnbao@gmail.com>,
	akpm@linux-foundation.org, linux-mm@kvack.org
Cc: chrisl@kernel.org, surenb@google.com, kasong@tencent.com,
	minchan@kernel.org, willy@infradead.org, ryan.roberts@arm.com,
	linux-kernel@vger.kernel.org, Barry Song <v-songbaohua@oppo.com>
Subject: Re: [RFC PATCH] mm: swap: reuse exclusive folio directly instead of wp page faults
Date: Fri, 31 May 2024 13:08:31 +0200	[thread overview]
Message-ID: <87ac9610-5650-451f-aa54-e634a6310af4@redhat.com> (raw)
In-Reply-To: <20240531104819.140218-1-21cnbao@gmail.com>

On 31.05.24 12:48, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> After swapping out, we perform a swap-in operation. If we first read
> and then write, we encounter a major fault in do_swap_page for reading,
> along with additional minor faults in do_wp_page for writing. However,
> the latter appears to be unnecessary and inefficient. Instead, we can
> directly reuse in do_swap_page and completely eliminate the need for
> do_wp_page.
> 
> This patch achieves that optimization specifically for exclusive folios.
> The following microbenchmark demonstrates the significant reduction in
> minor faults.
> 
>   #define DATA_SIZE (2UL * 1024 * 1024)
>   #define PAGE_SIZE (4UL * 1024)
> 
>   static void *read_write_data(char *addr)
>   {
>           char tmp;
> 
>           for (int i = 0; i < DATA_SIZE; i += PAGE_SIZE) {
>                   tmp = *(volatile char *)(addr + i);
>                   *(volatile char *)(addr + i) = tmp;
>           }
>   }
> 
>   int main(int argc, char **argv)
>   {
>           struct rusage ru;
> 
>           char *addr = mmap(NULL, DATA_SIZE, PROT_READ | PROT_WRITE,
>                           MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>           memset(addr, 0x11, DATA_SIZE);
> 
>           do {
>                   long old_ru_minflt, old_ru_majflt;
>                   long new_ru_minflt, new_ru_majflt;
> 
>                   madvise(addr, DATA_SIZE, MADV_PAGEOUT);
> 
>                   getrusage(RUSAGE_SELF, &ru);
>                   old_ru_minflt = ru.ru_minflt;
>                   old_ru_majflt = ru.ru_majflt;
> 
>                   read_write_data(addr);
>                   getrusage(RUSAGE_SELF, &ru);
>                   new_ru_minflt = ru.ru_minflt;
>                   new_ru_majflt = ru.ru_majflt;
> 
>                   printf("minor faults:%ld major faults:%ld\n",
>                           new_ru_minflt - old_ru_minflt,
>                           new_ru_majflt - old_ru_majflt);
>           } while(0);
> 
>           return 0;
>   }
> 
> w/o patch,
> / # ~/a.out
> minor faults:512 major faults:512
> 
> w/ patch,
> / # ~/a.out
> minor faults:0 major faults:512
> 
> Minor faults decrease to 0!
> 
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>   mm/memory.c | 7 ++++---
>   1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index eef4e482c0c2..e1d2e339958e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4325,9 +4325,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   	 */
>   	if (!folio_test_ksm(folio) &&
>   	    (exclusive || folio_ref_count(folio) == 1)) {
> -		if (vmf->flags & FAULT_FLAG_WRITE) {
> -			pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> -			vmf->flags &= ~FAULT_FLAG_WRITE;
> +		if (vma->vm_flags & VM_WRITE) {
> +			pte = pte_mkwrite(pte_mkdirty(pte), vma);
> +			if (vmf->flags & FAULT_FLAG_WRITE)
> +				vmf->flags &= ~FAULT_FLAG_WRITE;

This implies, that even on a read fault, you would mark the pte dirty 
and it would have to be written back to swap if still in the swap cache 
and only read.

That is controversial.

What is less controversial is doing what mprotect() via 
change_pte_range()/can_change_pte_writable() would do: mark the PTE 
writable but not dirty.

I suggest setting the pte only dirty if FAULT_FLAG_WRITE is set.

-- 
Cheers,

David / dhildenb

next prev parent reply	other threads:[~2024-05-31 11:08 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-31 10:48 Barry Song
2024-05-31 11:08 ` David Hildenbrand [this message]
2024-05-31 11:55   ` Barry Song
2024-05-31 12:10     ` David Hildenbrand
2024-05-31 12:20       ` Barry Song
2024-05-31 12:30         ` Barry Song
2024-05-31 12:34           ` David Hildenbrand
2024-06-01  0:48             ` Barry Song
2024-06-01  6:44               ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ac9610-5650-451f-aa54-e634a6310af4@redhat.com \
    --to=david@redhat.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=chrisl@kernel.org \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=surenb@google.com \
    --cc=v-songbaohua@oppo.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox