linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org
Subject: Re: [PATCH v2] mm: memory: check userfaultfd_wp() in vmf_orig_pte_uffd_wp()
Date: Fri, 19 Apr 2024 11:17:27 -0400	[thread overview]
Message-ID: <ZiKLBxS8D_jzD6F9@x1n> (raw)
In-Reply-To: <fe4c5e73-7242-4d35-ad76-1ee55cae6155@huawei.com>

On Fri, Apr 19, 2024 at 11:00:46AM +0800, Kefeng Wang wrote:
> 
> 
> On 2024/4/19 0:32, Peter Xu wrote:
> > Hi, Kefeng,
> > 
> > On Thu, Apr 18, 2024 at 08:06:41PM +0800, Kefeng Wang wrote:
> > > Add userfaultfd_wp() check in vmf_orig_pte_uffd_wp() to avoid the
> > > unnecessary pte_marker_entry_uffd_wp() in most pagefault, difference
> > > as shows below from perf data of lat_pagefault, note, the function
> > > vmf_orig_pte_uffd_wp() is not inlined in the two kernel versions.
> > > 
> > >    perf report -i perf.data.before | grep vmf
> > >       0.17%     0.13%  lat_pagefault  [kernel.kallsyms]      [k] vmf_orig_pte_uffd_wp.part.0.isra.0
> > >    perf report -i perf.data.after  | grep vmf
> > 
> > Any real number to share too besides the perf greps?  I meant, even if perf
> > report will not report such function anymore, it doesn't mean it'll be
> > faster, and how much it improves?
> 
> dd if=/dev/zero of=/tmp/XXX bs=512M count=1
> ./lat_pagefault -W 5 -N 5 /tmp/XXX
> 
> 	before		after	
> 1	0.2623		0.2605	
> 2	0.2622		0.2598	
> 3	0.2621		0.2595	
> 4	0.2622		0.2600	
> 5	0.2651		0.2598	
> 6	0.2624		0.2594	
> 7	0.2624		0.2605	
> 8	0.2627		0.2608	
> average	0.262675	0.2600375	-0.0026375
> 
> The lat_pagefault does show some improvement(also I reboot and retest,
> the results are same).

Thanks. Could you replace the perf report with these real data?  Or just
append to it.

I had a look at the asm and indeed the current code will generate two
jumps when without this patch, and I don't know why..

   0x0000000000006ac4 <+52>:    test   $0x8,%ah                 <---- check FAULT_FLAG_ORIG_PTE_VALID
   0x0000000000006ac7 <+55>:    jne    0x6bcf <set_pte_range+319>
   0x0000000000006acd <+61>:    mov    0x18(%rbp),%rsi

   ...

   0x0000000000006bcf <+319>:   mov    0x40(%rdi),%rdi
   0x0000000000006bd3 <+323>:   test   $0xffffffffffffff9f,%rdi <---- pte_none() check
   0x0000000000006bda <+330>:   je     0x6acd <set_pte_range+61>
   0x0000000000006be0 <+336>:   test   $0x101,%edi              <---- pte_present() check
   0x0000000000006be6 <+342>:   jne    0x6acd <set_pte_range+61>
   0x0000000000006bec <+348>:   call   0x1c50 <pte_to_swp_entry>
   0x0000000000006bf1 <+353>:   mov    0x0(%rip),%rdx        # 0x6bf8 <set_pte_range+360>
   0x0000000000006bf8 <+360>:   mov    %rax,%r15
   0x0000000000006bfb <+363>:   shr    $0x3a,%rax
   0x0000000000006bff <+367>:   cmp    $0x1f,%rax
   0x0000000000006c03 <+371>:   mov    $0x0,%eax
   0x0000000000006c08 <+376>:   cmovne %rax,%r15
   0x0000000000006c0c <+380>:   mov    0x28(%rbx),%eax
   0x0000000000006c0f <+383>:   and    $0x1,%r15d
   0x0000000000006c13 <+387>:   jmp    0x6acd <set_pte_range+61>

I also don't know why the compiler cannot already merge the none+present
check into one shot, I thought it could.  Also surprised me that
pte_to_swp_entry() is a function call.. but not involved in this context.

So I think I was right it should bypass this when seeing it pte_none,
however that includes two jumps.

And with your patch applied the two jumps are not there:

   0x0000000000006b0c <+124>:   testb  $0x8,0x29(%r14)           <--- FAULT_FLAG_ORIG_PTE_VALID
   0x0000000000006b11 <+129>:   je     0x6b6a <set_pte_range+218>
   0x0000000000006b13 <+131>:   mov    (%r14),%rax
   0x0000000000006b16 <+134>:   testb  $0x10,0x21(%rax)          <--- userfaultfd_wp(vmf->vma) check
   0x0000000000006b1a <+138>:   je     0x6b6a <set_pte_range+218>

Maybe that's what contributes to that 0.x% extra time of a fault.

So if we do care about this 0.x% and we're doing this anyway, perhaps move
the vma check upper?  Because afaict FAULT_FLAG_ORIG_PTE_VALID should
always hit in set_pte_range(), so we can avoid two more insts in the common
paths.

I'll leave that to you too if you want to mention some details in above and
add that into the commit log.

Thanks,

-- 
Peter Xu



  reply	other threads:[~2024-04-19 15:17 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-18 12:06 Kefeng Wang
2024-04-18 16:32 ` Peter Xu
2024-04-19  3:00   ` Kefeng Wang
2024-04-19 15:17     ` Peter Xu [this message]
2024-04-20  4:05       ` Kefeng Wang
2024-04-21 13:53         ` Peter Xu
2024-04-22  2:13           ` Kefeng Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZiKLBxS8D_jzD6F9@x1n \
    --to=peterx@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-mm@kvack.org \
    --cc=wangkefeng.wang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox