linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Nadav Amit <namit@vmware.com>, Peter Xu <peterx@redhat.com>
Cc: Linux MM <linux-mm@kvack.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Hugh Dickins <hughd@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Mike Rapoport <rppt@linux.ibm.com>,
	Dave Hansen <dave.hansen@linux.intel.com>
Subject: Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
Date: Tue, 28 Jun 2022 12:55:07 +0200	[thread overview]
Message-ID: <a9aea770-3836-d716-d502-a0035f68985c@redhat.com> (raw)
In-Reply-To: <A7C4D8BC-4F7B-43EA-B44B-D573DAF59B8C@vmware.com>

On 28.06.22 01:37, Nadav Amit wrote:
> [ +Dave Hansen to say how wrong I am ] 
> 
>> On Jun 27, 2022, at 6:12 AM, Peter Xu <peterx@redhat.com> wrote:
>>
>> ⚠ External Email
>>
>> On Sat, Jun 25, 2022 at 07:49:54AM +0000, Nadav Amit wrote:
>>>
>>>
>>>> On Jun 24, 2022, at 3:17 PM, Peter Xu <peterx@redhat.com> wrote:
>>>>
>>>> On Fri, Jun 24, 2022 at 05:58:17PM -0400, Peter Xu wrote:
>>>>> [Sorry for replying late]
>>>>>
>>>>> Said that, I think it doesn't really necessary need to be that complex,
>>>>> since make_huge_pte() already sets dirty bit when "writable=1", so IIUC
>>>>> what you need to do is simply make sure dirty bit set when write_hint=1.
>>>>>
>>>>> Does it sounds correct to you?
>>>>
>>>> Hmm, hold on...  I failed to figure out how that write-likely hint could
>>>> help us for either huge or non-huge pages, since:
>>>>
>>>> (1) Old code always set dirty, so no perf degrade anyway with/without the
>>>>     hint
>>>>
>>>> (2) If we want to rework dirty bit (which I'm totally fine with..), then
>>>>     we don't apply it when we shouldn't, and afaict we should set D bit
>>>>     whenever we should...  if the user assumes this page is likely to be
>>>>     written but made it read-only, say, with UFFDIO_COPY(wp_mode=1),
>>>>     setting D bit will not help, instead, the user should simply use an
>>>>     UFFDIO_COPY(wp_mode=0) then the dirty will be set with write=1..
>>>>
>>>> It'll be helpful but only helpful for UFFDIO_ZEROCOPY because it avoids one
>>>> COW.  But that seems to be it.
>>>>
>>>> In short: I'm wondering whether we only really need the ACCESS_LIKELY hint
>>>> as you proposed earlier.  We may want UFFDIO_ZEROPAGE_MODE_ALLOCATE
>>>> separately, but keep that only for zeropage op (and it shouldn't really be
>>>> called WRITE_LIKELY)?  Or did I miss something?
>>>
>>> Let’s see if I get you correctly. I am not sure whether we had this
>>> discussion before.
>>>
>>> We are talking about a scenario in which WP=0. You argue that if the page
>>> is already set as dirty, what is the benefit of not setting the dirty-bit,
>>> right?
>>>
>>> So first, IIUC, there are cases in which the page would not be set as
>>> dirty, e.g., UFFDIO_CONTINUE. [ I am admittedly not too familiar with this
>>> use-case, so I say it based on the comments. ]
>>>
>>> Second, even if the page is dirty (e.g., following UFFDIO_COPY), but it
>>> is not written by the user after UFFDI_COPY, marking the PTE as dirty
>>> when it is mapped would induce overhead, as we discussed before, since
>>> if/when the PTE is unmapped, TLB flush batching might not be possible.
>>
>> I'd hope we don't make an interface design just to service that purpose of
>> when write=0 and dirty=1 use case that is internal to the kernel so far,
>> and I still think it's the tlb flush code to change.. or do we have other
>> use case for this WRITE_LIKELY hint?
>>
>> For UFFDIO_CONTINUE, if we want to make things clear on dirty bit, then
>> IMHO for UFFDIO_CONTINUE the right place for the dirty process is where the
>> user writes to the page in the other mapping, where PageDirty() will start
>> to be true already even if the pte that to be CONTINUEd will have dirty=0
>> in the pte entry.  From that pov I still don't see why we need to grant the
>> user on the dirty bit control, no matter with a hint only, or explicit.
>>
>>>
>>> So I don’t think there is a problem in having WRITE_LIKELY hint. Moreover,
>>> I would reiterate my position (which you guys convinced me in!)
>>
>> David convinced you I think :)
>>
>>> that having hints that indicate what the user does (WRITE_LIKELY) is a
>>> better API than something that indicates directly what the kernel should
>>> do (e.g., UFFDIO_ZEROPAGE_MODE_ALLOCATE).
>>
>> The hint idea sounds good to me, it's just that we actually have two steps
>> here:
>>
>>  (1) We think providing user the control of dirty bit makes sense, then,
>>  (2) We think the flag should be a hint not explicit "set dirty bit"
>>
>> I agree with (2) in this case if (1) is applicable.  And now I think I'm
>> questioning myself on (1).
>>
>> Fundamentally, access bit has more meaningful context (0 means cold, 1
>> means hot), for dirty it's really more a perf thing to me (when clear,
>> it'll take extra cycles to set it when memory write happens to it; being
>> clear _may_ help only for the tlb flush example you mentioned but I'm not
>> fully convinced that's correct).
> 
> I am not sure we understand each other. I think the benefit of not setting
> a dirty-bit when a page is not actually written is fundamental, and has
> inherit performance benefit.
> 
> When I did x86’s pte_flags_need_flush(), I was defensive, but there is a
> basic optimization that is possible to avoid a TLB flush on non-dirty
> writable PTEs.
> 
> In x86, consider a situation in which you use ptep_modify_prot_start()
> to remove a PTE and load its old value using xchg. (A similar case happens
> on reclaim). Assume you want to write-protect the entry.
> 
> If the PTE is non-dirty then you should be able to avoid a flush, even if
> the PTE is writable. In x86, a write and the change of the dirty-bit are
> performed both atomically. Therefore, if the dirty-bit on the old PTE was
> clear, you can avoid a TLB flush.
> 
> Besides the benefit of avoiding a TLB flush, there is also the benefit
> of having more precise dirty tracking. You assume UFFDIO_CONTINUE will be
> preceded by memory write to the shared memory, but that does not have to
> be the case. Similarly, if in the future userfaultfd would also support
> memory-backed private mappings, that does not have to be the case either.
> 
> Putting all of the above aside, there is a bug in my code, but this
> bug also points why dirty should not be set unconditionally. If someone
> uses SOFT_DIRTY with userfaultfd, then marking the PTE as dirty (and
> soft-dirty) might be misleading, causing unnecessary userspace writeback
> of memory.
> 
> So I do need to fix my code so it would not write-unprotect memory if
> soft-dirty is enabled and UFFD_FLAGS_WRITE_LIKELY is not provided. But
> I think it emphasizes the benefit of having UFFD_FLAGS_WRITE_LIKELY.
> 
>>
>> Maybe with the to be proposed RFC patch for tlb flush we can know whether
>> that should be something we can rely on.  It'll add more dependency on this
>> work which I'm sorry to say.  It's just that IMHO we should think carefully
>> for the write-hint because this is a solid new uABI we're talking about.
>>
>> The other option is we can introduce the access hint first and think more
>> on the dirty one (we can always add it when proper).  What do you think?
>> Also, David please chim in anytime if I missed the whole point when you
>> proposed the idea.
>>
>>>
>>> But this discussion made me think that there are two somewhat related
>>> matters that we may want to address:
>>>
>>> 1. mwriteprotect_range() should use MM_CP_TRY_CHANGE_WRITABLE when !wp
>>> to proactively make entries writable and save .
>>
>> I'm not sure I'm right here, but I think David's patch should have covered
>> that case?  The new helper only checks pte_uffd_wp() based on my memory,
>> and when resolving page faults uffd-wp bit should have been gone, so it
>> should be treated the same as normal ptes.
> 
> Let’s see we get to the same page:
> 
> mwriteprotect_range() does:
> 
>         change_protection(&tlb, dst_vma, start, start + len, newprot,
>                           enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE)
> 
> As you see no use of MM_CP_TRY_CHANGE_WRITABLE.
> 
> And then change_pte_range() does:
> 
>                         if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>                             !pte_write(ptent) &&
>                             can_change_pte_writable(vma, addr, ptent))
>                                 ptent = pte_mkwrite(ptent);

Right, I think in a previous version of my patch (before you guys
convinced me to introduce MM_CP_TRY_CHANGE_WRITABLE :P ) it would have
done it automatically (for private mappings). We might have to add it to
some callers now manually to not only consider mprotect.


-- 
Thanks,

David / dhildenb



  reply	other threads:[~2022-06-28 10:55 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-22 18:50 [PATCH v1 0/5] userfaultfd: support access/write hints Nadav Amit
2022-06-22 18:50 ` [PATCH v1 1/5] userfaultfd: introduce uffd_flags Nadav Amit
2022-06-23 21:57   ` Peter Xu
2022-06-23 22:04     ` Nadav Amit
2022-06-22 18:50 ` [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations Nadav Amit
2022-06-23 23:24   ` Peter Xu
2022-06-23 23:35     ` Nadav Amit
2022-06-23 23:49       ` Peter Xu
2022-06-24  0:03         ` Nadav Amit
2022-06-24  2:05           ` Peter Xu
2022-06-24  2:42             ` Nadav Amit
2022-06-24 21:58               ` Peter Xu
2022-06-24 22:17                 ` Peter Xu
2022-06-25  7:49                   ` Nadav Amit
2022-06-27 13:12                     ` Peter Xu
2022-06-27 13:27                       ` David Hildenbrand
2022-06-27 14:59                         ` Peter Xu
2022-06-27 23:37                       ` Nadav Amit
2022-06-28 10:55                         ` David Hildenbrand [this message]
2022-06-28 19:15                         ` Peter Xu
2022-06-28 20:30                           ` Nadav Amit
2022-06-28 20:56                             ` Peter Xu
2022-06-28 21:03                               ` Nadav Amit
2022-06-28 21:12                                 ` Peter Xu
2022-06-28 21:15                                   ` Nadav Amit
2022-07-12  6:19   ` Nadav Amit
2022-07-12 14:56     ` Peter Xu
2022-07-13  1:09       ` Nadav Amit
2022-07-13 16:02         ` Peter Xu
2022-07-13 16:49           ` Nadav Amit
2022-06-22 18:50 ` [PATCH v1 3/5] userfaultfd: introduce write-likely mode for uffd operations Nadav Amit
2022-06-22 18:50 ` [PATCH v1 4/5] userfaultfd: zero access/write hints Nadav Amit
2022-06-23 23:34   ` Peter Xu
2022-06-22 18:50 ` [PATCH v1 5/5] selftest/userfaultfd: test read/write hints Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a9aea770-3836-d716-d502-a0035f68985c@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=hughd@google.com \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=namit@vmware.com \
    --cc=peterx@redhat.com \
    --cc=rppt@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox