From: Nadav Amit <nadav.amit@gmail.com>
To: Peter Xu <peterx@redhat.com>
Cc: Linux MM <linux-mm@kvack.org>,
Mike Kravetz <mike.kravetz@oracle.com>,
Hugh Dickins <hughd@google.com>,
Andrew Morton <akpm@linux-foundation.org>,
Axel Rasmussen <axelrasmussen@google.com>,
David Hildenbrand <david@redhat.com>,
Mike Rapoport <rppt@linux.ibm.com>
Subject: Re: [RFC PATCH v2 3/5] userfaultfd: introduce write-likely mode for copy/wp operations
Date: Tue, 21 Jun 2022 10:14:00 -0700 [thread overview]
Message-ID: <C2C87385-20DB-44FC-8785-09DADF731253@gmail.com> (raw)
In-Reply-To: <YrH0DR2VqrJQskU5@xz-m1.local>
On Jun 21, 2022, at 9:38 AM, Peter Xu <peterx@redhat.com> wrote:
> Hi, Nadav,
>
> On Sun, Jun 19, 2022 at 04:34:47PM -0700, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>>
>> Commit 9ae0f87d009ca ("mm/shmem: unconditionally set pte dirty in
>> mfill_atomic_install_pte") has set PTEs as dirty as its title indicates.
>> However, setting read-only PTEs as dirty can have several undesired
>> implications.
>>
>> First, setting read-only PTEs as dirty, can cause these PTEs to become
>> writable during mprotect() syscall. See in change_pte_range():
>>
>> /* Avoid taking write faults for known dirty pages */
>> if (dirty_accountable && pte_dirty(ptent) &&
>> (pte_soft_dirty(ptent) ||
>> !(vma->vm_flags & VM_SOFTDIRTY))) {
>> ptent = pte_mkwrite(ptent);
>> }
>
> IMHO this is not really the direct reason to add the feature to "allow user
> to specify whether dirty bit be set for UFFDIO_COPY/... ioctl", because
> IIUC what's missing is the pte_uffd_wp() check that I should have added in
> the shmem+hugetlb uffd-wp series in change_pte_range() but I missed..
>
> But since this is fixed by David's patch to optimize mprotect() altogether
> which checks pte_uffd_wp() (and afaict that's only needed after the
> shmem+hugetlb patchset, not before), so I think we're safe now with all the
> branches.
>
> So IMHO we don't need to mention this as it's kind of misleading. It'll be
> welcomed if you want to recover the pte_dirty behavior in
> mfill_atomic_install_pte() but probably this is not the right patch for it?
Sorry, I will remove it from the commit log.
I am not sure whether there should be an additional patch to revert (or
partially revert) 9ae0f87d009ca and to be backported. What do you say?
>> Second, unmapping read-only dirty PTEs often prevents TLB flush batching.
>> See try_to_unmap_one():
>>
>> /*
>> * Page is dirty. Flush the TLB if a writable entry
>> * potentially exists to avoid CPU writes after IO
>> * starts and then write it out here.
>> */
>> try_to_unmap_flush_dirty();
>>
>> Similarly batching TLB flushed might be prevented in zap_pte_range():
>>
>> if (!PageAnon(page)) {
>> if (pte_dirty(ptent)) {
>> force_flush = 1;
>> set_page_dirty(page);
>> }
>> ...
>
> I still keep the pure question here (which I asked in the private reply to
> you) on why we'd like only pte_dirty() but not pte_write() && pte_dirty()
> here. I'll rewrite what I have in the private email to you here again that
> I think should be the write thing to do here..
>
> if (!PageAnon(page)) {
> if (pte_dirty(ptent)) {
> set_page_dirty(page);
> if (pte_write(ptent))
> force_flush = 1;
> }
> }
The “Second” part regards a perf issue, not a correctness issue. As I told
you offline, I think if makes sense to look both on pte_dirty() and
pte_write(), but I am afraid of other architectures than x86, which I know.
After our discussion, I looked at other architectures, and it does look
safe. So I will add an additional patch for that (with your suggested-by).
> I also mentioned the other example of doing mprotect(PROT_READ) upon a
> dirty pte can also create a pte with dirty=1 and write=0 which should be
> the same condition we have here with uffd, afaict. So it's at least not a
> solo problem for uffd, and I still think if we treat this tlb flush issue
> as a perf bug we should consider fixing up the tlb flush code instead
> because otherwise the "mprotect(PROT_READ) upon dirty pte" will also be
> able to hit it.
>
> Meanwhile, I'll have the same comment that this is not helping any reviewer
> to understand why we need to grant user ability to conditionally set dirty
> bit from uABI, so I think it's better to drop this paragraph too for this
> patch alone.
Ok. Point taken.
>
>> In general, setting a PTE as dirty seems for read-only entries might be
>> dangerous. It should be reminded the dirty-COW vulnerability mitigation
>> also relies on the dirty bit being set only after COW (although it does
>> not appear to apply to userfaultfd).
>>
>> To summarize, setting the dirty bit for read-only PTEs is dangerous. But
>> even if we only consider writable pages, always setting the dirty bit or
>> always leaving it clear, does not seem as the best policy. Leaving the
>> bit clear introduces overhead on the first write-access to set the bit.
>> Setting the bit for pages the are eventually not written to can require
>> more TLB flushes.
>
> And IMHO only this paragraph is the real and proper reasoning for this
> patch..
Will do.
>
>> @@ -1902,10 +1908,10 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
>> uffdio_continue.range.start) {
>> goto out;
>> }
>> - if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE)
>> + if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE))
>> goto out;
>>
>> - uffd_flags = UFFD_FLAGS_ACCESS_LIKELY;
>> + uffd_flags = UFFD_FLAGS_ACCESS_LIKELY | UFFD_FLAGS_WRITE_LIKELY;
>
> Setting dirty by default for CONTINUE may make some sense (unlike young),
> at least to keep the old behavior of the code where the pte was
> unconditionally set.
>
> But there's another thought a real CONTINUE user modifies the page cache
> elsewhere so logically the PageDirty should be set elsewhere already
> anyways, in most cases when the userapp is updating the page cache with
> another mapping before installing pgtables for this mapping. Then this
> dirty is not required, it seems.
>
> If we're going to export this to uABI, I'm wondering maybe it'll be nicer
> to not apply either young or dirty bit for CONTINUE, because fundamentally
> losing dirty bit doesn't sound risky, meanwhile the user app should have
> the best knowledge of what to do (whether page was requested or it's just
> during a pre-faulting in the background).
>
> Axel may have some better thoughts.
I think that perhaps I did not communicate well enough the reason it makes
sense to set the dirty-bit.
Setting the access-bit and dirty-bit takes quite some time. So regardless of
whether you set the PageDirty, if you didn’t see access-bit and dirty-bit
and later you access the page - you are going to pay ~550 cycles for
nothing.
For reference: https://marc.info/?l=linux-kernel&m=146582237922378&w=2
If you want me to allow userspace to provide a hint, let me know.
>> @@ -85,13 +84,18 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>>
>> if (writable)
>> _dst_pte = pte_mkwrite(_dst_pte);
>> - else
>> + else {
>> /*
>> * We need this to make sure write bit removed; as mk_pte()
>> * could return a pte with write bit set.
>> */
>> _dst_pte = pte_wrprotect(_dst_pte);
>>
>> + /* Marking RO entries as dirty can mess with other code */
>> + if (uffd_flags & UFFD_FLAGS_WRITE_LIKELY)
>> + _dst_pte = pte_mkdirty(_dst_pte);
>
> Hmm.. what about "writable=true" ones?
Oh boy, I reverted the condition… :(
Thanks for noticing. I’ll fix it.
next prev parent reply other threads:[~2022-06-21 17:14 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-06-19 23:34 [RFC PATCH v2 0/5] userfaultfd: support access/write hints Nadav Amit
2022-06-19 23:34 ` [RFC PATCH v2 1/5] userfaultfd: introduce uffd_flags Nadav Amit
2022-06-21 8:41 ` David Hildenbrand
2022-06-21 15:31 ` Peter Xu
2022-06-21 15:29 ` Peter Xu
2022-06-21 17:41 ` Nadav Amit
2022-06-19 23:34 ` [RFC PATCH v2 2/5] userfaultfd: introduce access-likely mode for copy/wp operations Nadav Amit
2022-06-21 8:48 ` David Hildenbrand
2022-06-21 15:42 ` Peter Xu
2022-06-21 17:27 ` Nadav Amit
2022-06-19 23:34 ` [RFC PATCH v2 3/5] userfaultfd: introduce write-likely " Nadav Amit
2022-06-21 16:38 ` Peter Xu
2022-06-21 17:14 ` Nadav Amit [this message]
2022-06-21 18:10 ` Peter Xu
2022-06-21 18:30 ` Nadav Amit
2022-06-21 18:43 ` Peter Xu
2022-06-19 23:34 ` [RFC PATCH v2 4/5] userfaultfd: zero access/write hints Nadav Amit
2022-06-21 17:04 ` Peter Xu
2022-06-21 17:17 ` Nadav Amit
2022-06-21 17:56 ` Peter Xu
2022-06-21 17:58 ` Nadav Amit
2022-06-19 23:34 ` [RFC PATCH v2 5/5] selftest/userfaultfd: test read/write hints Nadav Amit
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=C2C87385-20DB-44FC-8785-09DADF731253@gmail.com \
--to=nadav.amit@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=axelrasmussen@google.com \
--cc=david@redhat.com \
--cc=hughd@google.com \
--cc=linux-mm@kvack.org \
--cc=mike.kravetz@oracle.com \
--cc=peterx@redhat.com \
--cc=rppt@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox