linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Nadav Amit <nadav.amit@gmail.com>
To: Peter Xu <peterx@redhat.com>
Cc: Linux MM <linux-mm@kvack.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Hugh Dickins <hughd@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Axel Rasmussen <axelrasmussen@google.com>,
	David Hildenbrand <david@redhat.com>,
	Mike Rapoport <rppt@linux.ibm.com>
Subject: Re: [RFC PATCH v2 3/5] userfaultfd: introduce write-likely mode for copy/wp operations
Date: Tue, 21 Jun 2022 10:14:00 -0700	[thread overview]
Message-ID: <C2C87385-20DB-44FC-8785-09DADF731253@gmail.com> (raw)
In-Reply-To: <YrH0DR2VqrJQskU5@xz-m1.local>

On Jun 21, 2022, at 9:38 AM, Peter Xu <peterx@redhat.com> wrote:

> Hi, Nadav,
> 
> On Sun, Jun 19, 2022 at 04:34:47PM -0700, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> Commit 9ae0f87d009ca ("mm/shmem: unconditionally set pte dirty in
>> mfill_atomic_install_pte") has set PTEs as dirty as its title indicates.
>> However, setting read-only PTEs as dirty can have several undesired
>> implications.
>> 
>> First, setting read-only PTEs as dirty, can cause these PTEs to become
>> writable during mprotect() syscall. See in change_pte_range():
>> 
>> 	/* Avoid taking write faults for known dirty pages */
>> 	if (dirty_accountable && pte_dirty(ptent) &&
>> 			(pte_soft_dirty(ptent) ||
>> 			 !(vma->vm_flags & VM_SOFTDIRTY))) {
>> 		ptent = pte_mkwrite(ptent);
>> 	}
> 
> IMHO this is not really the direct reason to add the feature to "allow user
> to specify whether dirty bit be set for UFFDIO_COPY/... ioctl", because
> IIUC what's missing is the pte_uffd_wp() check that I should have added in
> the shmem+hugetlb uffd-wp series in change_pte_range() but I missed..
> 
> But since this is fixed by David's patch to optimize mprotect() altogether
> which checks pte_uffd_wp() (and afaict that's only needed after the
> shmem+hugetlb patchset, not before), so I think we're safe now with all the
> branches.
> 
> So IMHO we don't need to mention this as it's kind of misleading. It'll be
> welcomed if you want to recover the pte_dirty behavior in
> mfill_atomic_install_pte() but probably this is not the right patch for it?

Sorry, I will remove it from the commit log.

I am not sure whether there should be an additional patch to revert (or
partially revert) 9ae0f87d009ca and to be backported. What do you say?

>> Second, unmapping read-only dirty PTEs often prevents TLB flush batching.
>> See try_to_unmap_one():
>> 
>> 	/*
>> 	 * Page is dirty. Flush the TLB if a writable entry
>> 	 * potentially exists to avoid CPU writes after IO
>> 	 * starts and then write it out here.
>> 	 */
>> 	try_to_unmap_flush_dirty();
>> 
>> Similarly batching TLB flushed might be prevented in zap_pte_range():
>> 
>> 	if (!PageAnon(page)) {
>> 		if (pte_dirty(ptent)) {
>> 			force_flush = 1;
>> 			set_page_dirty(page);
>> 		}
>> 	...
> 
> I still keep the pure question here (which I asked in the private reply to
> you) on why we'd like only pte_dirty() but not pte_write() && pte_dirty()
> here. I'll rewrite what I have in the private email to you here again that
> I think should be the write thing to do here..
> 
> if (!PageAnon(page)) {
> if (pte_dirty(ptent)) {
> set_page_dirty(page);
> if (pte_write(ptent))
> force_flush = 1;
> }
> }

The “Second” part regards a perf issue, not a correctness issue. As I told
you offline, I think if makes sense to look both on pte_dirty() and
pte_write(), but I am afraid of other architectures than x86, which I know.
After our discussion, I looked at other architectures, and it does look
safe. So I will add an additional patch for that (with your suggested-by).

> I also mentioned the other example of doing mprotect(PROT_READ) upon a
> dirty pte can also create a pte with dirty=1 and write=0 which should be
> the same condition we have here with uffd, afaict. So it's at least not a
> solo problem for uffd, and I still think if we treat this tlb flush issue
> as a perf bug we should consider fixing up the tlb flush code instead
> because otherwise the "mprotect(PROT_READ) upon dirty pte" will also be
> able to hit it.
> 
> Meanwhile, I'll have the same comment that this is not helping any reviewer
> to understand why we need to grant user ability to conditionally set dirty
> bit from uABI, so I think it's better to drop this paragraph too for this
> patch alone.

Ok. Point taken.

> 
>> In general, setting a PTE as dirty seems for read-only entries might be
>> dangerous. It should be reminded the dirty-COW vulnerability mitigation
>> also relies on the dirty bit being set only after COW (although it does
>> not appear to apply to userfaultfd).
>> 
>> To summarize, setting the dirty bit for read-only PTEs is dangerous. But
>> even if we only consider writable pages, always setting the dirty bit or
>> always leaving it clear, does not seem as the best policy. Leaving the
>> bit clear introduces overhead on the first write-access to set the bit.
>> Setting the bit for pages the are eventually not written to can require
>> more TLB flushes.
> 
> And IMHO only this paragraph is the real and proper reasoning for this
> patch..

Will do.

> 
>> @@ -1902,10 +1908,10 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
>> 	 uffdio_continue.range.start) {
>> 		goto out;
>> 	}
>> -	if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE)
>> +	if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE))
>> 		goto out;
>> 
>> -	uffd_flags = UFFD_FLAGS_ACCESS_LIKELY;
>> +	uffd_flags = UFFD_FLAGS_ACCESS_LIKELY | UFFD_FLAGS_WRITE_LIKELY;
> 
> Setting dirty by default for CONTINUE may make some sense (unlike young),
> at least to keep the old behavior of the code where the pte was
> unconditionally set.
> 
> But there's another thought a real CONTINUE user modifies the page cache
> elsewhere so logically the PageDirty should be set elsewhere already
> anyways, in most cases when the userapp is updating the page cache with
> another mapping before installing pgtables for this mapping. Then this
> dirty is not required, it seems.
> 
> If we're going to export this to uABI, I'm wondering maybe it'll be nicer
> to not apply either young or dirty bit for CONTINUE, because fundamentally
> losing dirty bit doesn't sound risky, meanwhile the user app should have
> the best knowledge of what to do (whether page was requested or it's just
> during a pre-faulting in the background).
> 
> Axel may have some better thoughts.

I think that perhaps I did not communicate well enough the reason it makes
sense to set the dirty-bit.

Setting the access-bit and dirty-bit takes quite some time. So regardless of
whether you set the PageDirty, if you didn’t see access-bit and dirty-bit
and later you access the page - you are going to pay ~550 cycles for
nothing.

For reference: https://marc.info/?l=linux-kernel&m=146582237922378&w=2

If you want me to allow userspace to provide a hint, let me know.

>> @@ -85,13 +84,18 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>> 
>> 	if (writable)
>> 		_dst_pte = pte_mkwrite(_dst_pte);
>> -	else
>> +	else {
>> 		/*
>> 		 * We need this to make sure write bit removed; as mk_pte()
>> 		 * could return a pte with write bit set.
>> 		 */
>> 		_dst_pte = pte_wrprotect(_dst_pte);
>> 
>> +		/* Marking RO entries as dirty can mess with other code */
>> +		if (uffd_flags & UFFD_FLAGS_WRITE_LIKELY)
>> +			_dst_pte = pte_mkdirty(_dst_pte);
> 
> Hmm.. what about "writable=true" ones?

Oh boy, I reverted the condition… :(

Thanks for noticing. I’ll fix it.



  reply	other threads:[~2022-06-21 17:14 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-19 23:34 [RFC PATCH v2 0/5] userfaultfd: support access/write hints Nadav Amit
2022-06-19 23:34 ` [RFC PATCH v2 1/5] userfaultfd: introduce uffd_flags Nadav Amit
2022-06-21  8:41   ` David Hildenbrand
2022-06-21 15:31     ` Peter Xu
2022-06-21 15:29   ` Peter Xu
2022-06-21 17:41     ` Nadav Amit
2022-06-19 23:34 ` [RFC PATCH v2 2/5] userfaultfd: introduce access-likely mode for copy/wp operations Nadav Amit
2022-06-21  8:48   ` David Hildenbrand
2022-06-21 15:42     ` Peter Xu
2022-06-21 17:27     ` Nadav Amit
2022-06-19 23:34 ` [RFC PATCH v2 3/5] userfaultfd: introduce write-likely " Nadav Amit
2022-06-21 16:38   ` Peter Xu
2022-06-21 17:14     ` Nadav Amit [this message]
2022-06-21 18:10       ` Peter Xu
2022-06-21 18:30         ` Nadav Amit
2022-06-21 18:43           ` Peter Xu
2022-06-19 23:34 ` [RFC PATCH v2 4/5] userfaultfd: zero access/write hints Nadav Amit
2022-06-21 17:04   ` Peter Xu
2022-06-21 17:17     ` Nadav Amit
2022-06-21 17:56       ` Peter Xu
2022-06-21 17:58         ` Nadav Amit
2022-06-19 23:34 ` [RFC PATCH v2 5/5] selftest/userfaultfd: test read/write hints Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=C2C87385-20DB-44FC-8785-09DADF731253@gmail.com \
    --to=nadav.amit@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=david@redhat.com \
    --cc=hughd@google.com \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=peterx@redhat.com \
    --cc=rppt@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox