From: Nadav Amit <nadav.amit@gmail.com>
To: Peter Xu <peterx@redhat.com>
Cc: Linux MM <linux-mm@kvack.org>,
Mike Kravetz <mike.kravetz@oracle.com>,
Hugh Dickins <hughd@google.com>,
Andrew Morton <akpm@linux-foundation.org>,
Axel Rasmussen <axelrasmussen@google.com>,
David Hildenbrand <david@redhat.com>,
Mike Rapoport <rppt@linux.ibm.com>
Subject: Re: [RFC PATCH v2 3/5] userfaultfd: introduce write-likely mode for copy/wp operations
Date: Tue, 21 Jun 2022 11:30:52 -0700 [thread overview]
Message-ID: <58E53A1D-466C-4A2B-8E10-2993EFF60856@gmail.com> (raw)
In-Reply-To: <YrIJm6kRhcHtupwZ@xz-m1.local>
On Jun 21, 2022, at 11:10 AM, Peter Xu <peterx@redhat.com> wrote:
> On Tue, Jun 21, 2022 at 10:14:00AM -0700, Nadav Amit wrote:
>> On Jun 21, 2022, at 9:38 AM, Peter Xu <peterx@redhat.com> wrote:
>>
>>> Hi, Nadav,
>>>
>>> On Sun, Jun 19, 2022 at 04:34:47PM -0700, Nadav Amit wrote:
>>>> From: Nadav Amit <namit@vmware.com>
>>>>
>>>> Commit 9ae0f87d009ca ("mm/shmem: unconditionally set pte dirty in
>>>> mfill_atomic_install_pte") has set PTEs as dirty as its title indicates.
>>>> However, setting read-only PTEs as dirty can have several undesired
>>>> implications.
>>>>
>>>> First, setting read-only PTEs as dirty, can cause these PTEs to become
>>>> writable during mprotect() syscall. See in change_pte_range():
>>>>
>>>> /* Avoid taking write faults for known dirty pages */
>>>> if (dirty_accountable && pte_dirty(ptent) &&
>>>> (pte_soft_dirty(ptent) ||
>>>> !(vma->vm_flags & VM_SOFTDIRTY))) {
>>>> ptent = pte_mkwrite(ptent);
>>>> }
>>>
>>> IMHO this is not really the direct reason to add the feature to "allow user
>>> to specify whether dirty bit be set for UFFDIO_COPY/... ioctl", because
>>> IIUC what's missing is the pte_uffd_wp() check that I should have added in
>>> the shmem+hugetlb uffd-wp series in change_pte_range() but I missed..
>>>
>>> But since this is fixed by David's patch to optimize mprotect() altogether
>>> which checks pte_uffd_wp() (and afaict that's only needed after the
>>> shmem+hugetlb patchset, not before), so I think we're safe now with all the
>>> branches.
>>>
>>> So IMHO we don't need to mention this as it's kind of misleading. It'll be
>>> welcomed if you want to recover the pte_dirty behavior in
>>> mfill_atomic_install_pte() but probably this is not the right patch for it?
>>
>> Sorry, I will remove it from the commit log.
>>
>> I am not sure whether there should be an additional patch to revert (or
>> partially revert) 9ae0f87d009ca and to be backported. What do you say?
>
> I think we discussed it offlist, I'd not bother but I don't have a strong
> opinion. Please feel free to go with your preference.
>
> Just to mention that it might not be a clean revert since at that time we
> don't have continue and uffd-wp on files, so we may need to add the
> complete check back if we're going to make it. Please also do proper swap
> tests for both anon and shmem with things like memcg, and when you posted
> the patches I can do some tests too.
That’s what I was afraid of. It moves it down in my priority list...
>>>> Second, unmapping read-only dirty PTEs often prevents TLB flush batching.
>>>> See try_to_unmap_one():
>>>>
>>>> /*
>>>> * Page is dirty. Flush the TLB if a writable entry
>>>> * potentially exists to avoid CPU writes after IO
>>>> * starts and then write it out here.
>>>> */
>>>> try_to_unmap_flush_dirty();
>>>>
>>>> Similarly batching TLB flushed might be prevented in zap_pte_range():
>>>>
>>>> if (!PageAnon(page)) {
>>>> if (pte_dirty(ptent)) {
>>>> force_flush = 1;
>>>> set_page_dirty(page);
>>>> }
>>>> ...
>>>
>>> I still keep the pure question here (which I asked in the private reply to
>>> you) on why we'd like only pte_dirty() but not pte_write() && pte_dirty()
>>> here. I'll rewrite what I have in the private email to you here again that
>>> I think should be the write thing to do here..
>>>
>>> if (!PageAnon(page)) {
>>> if (pte_dirty(ptent)) {
>>> set_page_dirty(page);
>>> if (pte_write(ptent))
>>> force_flush = 1;
>>> }
>>> }
>>
>> The “Second” part regards a perf issue, not a correctness issue. As I told
>> you offline, I think if makes sense to look both on pte_dirty() and
>> pte_write(), but I am afraid of other architectures than x86, which I know.
>
> I don't really see why this logic is arch-specific. I meant, as long as
> pte_write() returns a value that means "this page is writable", I still
> think it's making sense.
>
>> After our discussion, I looked at other architectures, and it does look
>> safe. So I will add an additional patch for that (with your suggested-by).
>
> Only if you agree with what I thought, thanks. And that could be a
> separate patch for sure out of this one even if to come.
I do agree. I did not quite understand your second sentence. I guess you
meant not to combine it into this patchset (or at least that’s what I
thought and I attribute it to you).
>>
>>
>> I think that perhaps I did not communicate well enough the reason it makes
>> sense to set the dirty-bit.
>>
>> Setting the access-bit and dirty-bit takes quite some time. So regardless of
>> whether you set the PageDirty, if you didn’t see access-bit and dirty-bit
>> and later you access the page - you are going to pay ~550 cycles for
>> nothing.
>>
>> For reference: https://marc.info/?l=linux-kernel&m=146582237922378&w=2
>>
>> If you want me to allow userspace to provide a hint, let me know.
>
> You did communicate well, so probably it's me that didn't. :)
>
> Yes I think it'll be nicer to allow userspace provide the same hint for
> UFFDIO_CONTINUE, the reason as I explained in the other thread on the young
> bit (or say ACCESS_LIKELY), that UFFDIO_CONTINUE can (at least in my
> current qemu impl [1]) proactively be used to install pgtables even if
> they're code pages and they may not be accessed in the near future. So
> IMHO we should treat UFFDIO_CONTINUE the same as UFFDIO_COPY, IMHO, from
> this point of view.
>
> There's just still some differences on young/dirty here so I'm not sure
> whether the idea applies to both, but I think at least it applies to young
> bit (ACCESS_LIKELY).
>
> [1] https://github.com/xzpeter/qemu/commit/b7758ad55af42b5364796362e96f4a06b51d9582
We have a saying in Hebrew: “When you explain slowly, I understand quickly”.
Thanks for explaining. I will add hints for UFFDIO_CONTINUE as well.
next prev parent reply other threads:[~2022-06-21 18:31 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-06-19 23:34 [RFC PATCH v2 0/5] userfaultfd: support access/write hints Nadav Amit
2022-06-19 23:34 ` [RFC PATCH v2 1/5] userfaultfd: introduce uffd_flags Nadav Amit
2022-06-21 8:41 ` David Hildenbrand
2022-06-21 15:31 ` Peter Xu
2022-06-21 15:29 ` Peter Xu
2022-06-21 17:41 ` Nadav Amit
2022-06-19 23:34 ` [RFC PATCH v2 2/5] userfaultfd: introduce access-likely mode for copy/wp operations Nadav Amit
2022-06-21 8:48 ` David Hildenbrand
2022-06-21 15:42 ` Peter Xu
2022-06-21 17:27 ` Nadav Amit
2022-06-19 23:34 ` [RFC PATCH v2 3/5] userfaultfd: introduce write-likely " Nadav Amit
2022-06-21 16:38 ` Peter Xu
2022-06-21 17:14 ` Nadav Amit
2022-06-21 18:10 ` Peter Xu
2022-06-21 18:30 ` Nadav Amit [this message]
2022-06-21 18:43 ` Peter Xu
2022-06-19 23:34 ` [RFC PATCH v2 4/5] userfaultfd: zero access/write hints Nadav Amit
2022-06-21 17:04 ` Peter Xu
2022-06-21 17:17 ` Nadav Amit
2022-06-21 17:56 ` Peter Xu
2022-06-21 17:58 ` Nadav Amit
2022-06-19 23:34 ` [RFC PATCH v2 5/5] selftest/userfaultfd: test read/write hints Nadav Amit
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=58E53A1D-466C-4A2B-8E10-2993EFF60856@gmail.com \
--to=nadav.amit@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=axelrasmussen@google.com \
--cc=david@redhat.com \
--cc=hughd@google.com \
--cc=linux-mm@kvack.org \
--cc=mike.kravetz@oracle.com \
--cc=peterx@redhat.com \
--cc=rppt@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox