From: David Hildenbrand <david@redhat.com>
To: Nadav Amit <nadav.amit@gmail.com>
Cc: LKML <linux-kernel@vger.kernel.org>,
linux-mm <linux-mm@kvack.org>,
Linus Torvalds <torvalds@linux-foundation.org>,
Andrew Morton <akpm@linux-foundation.org>,
Dave Hansen <dave.hansen@intel.com>,
Andrea Arcangeli <aarcange@redhat.com>,
Peter Xu <peterx@redhat.com>, Yang Shi <shy828301@gmail.com>,
Hugh Dickins <hughd@google.com>,
Mel Gorman <mgorman@techsingularity.net>,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Subject: Re: [PATCH v1 mmotm] mm/mprotect: try avoiding write faults for exclusive anonynmous pages when changing protection
Date: Mon, 4 Apr 2022 15:04:28 +0200 [thread overview]
Message-ID: <99cf9e14-7608-8e72-0c8e-0dd9b0047319@redhat.com> (raw)
In-Reply-To: <8A6AF878-D5D7-4D88-A736-0FEF71439D44@gmail.com>
On 01.04.22 21:15, Nadav Amit wrote:
> [ +Rick ]
>
>> On Apr 1, 2022, at 3:13 AM, David Hildenbrand <david@redhat.com> wrote:
>>
>> Similar to our MM_CP_DIRTY_ACCT handling for shared, writable mappings, we
>> can try mapping anonymous pages writable if they are exclusive,
>> the PTE is already dirty, and no special handling applies. Mapping the
>> PTE writable is essentially the same thing the write fault handler would do
>> in this case.
>
> In general I am all supportive for such a change.
>
> I do have some mostly-minor concerns.
Hi Nadav,
thanks a lot for your review!
>
>>
>> +static inline bool can_change_pte_writable(struct vm_area_struct *vma,
>> + unsigned long addr, pte_t pte,
>> + unsigned long cp_flags)
>> +{
>> + struct page *page;
>> +
>> + if ((vma->vm_flags & VM_SHARED) && !(cp_flags & MM_CP_DIRTY_ACCT))
>> + /*
>> + * MM_CP_DIRTY_ACCT is only expressive for shared mappings;
>> + * without MM_CP_DIRTY_ACCT, there is nothing to do.
>> + */
>> + return false;
>> +
>> + if (!(vma->vm_flags & VM_WRITE))
>> + return false;
>> +
>> + if (pte_write(pte) || pte_protnone(pte) || !pte_dirty(pte))
>> + return false;
>
> If pte_write() is already try then return false? I understand you want
> to do so because the page is already writable, but it is confusing.
I thought about just doing outside of the function
if ((vma->vm_flags & VM_WRITE) && !pte_write(pte) &&
can_change_pte_writable()...
I refrained from doing so because the sequence of checks might be
sub-optimal. But most probably we don't really care about that and it
might make the code easier to grasp.
Would that make it clearer?
>
> In addition, I am not sure about the pte_dirty() check is really robust.
> I mean I think it is ok, but is there any issue with shadow-stack?
Judging that it's already used that way for VMAs with dirty tracking, I
assume it's ok. Without checking that the PTE is dirty, we'd have to do a:
pte_mkwrite(pte_mkwrite(ptent));
Which would set the pte and consequently the page dirty, although there
might not even be a write access. That's what we want to avoid here.
>
> And this also assumes the kernel does not clear the dirty bit without
> clearing the present, as otherwise the note in Intel SDM section 4.8
> ("Accessed and Dirty Flags”) will be relevant and dirty bit might be
> set unnecessarily. I think it is ok.
Yeah, I think so as well.
>
>> +
>> + /* Do we need write faults for softdirty tracking? */
>> + if (IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) && !pte_soft_dirty(pte) &&
>> + (vma->vm_flags & VM_SOFTDIRTY))
>
> If !IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) then VM_SOFTDIRTY == 0. So I do not
> think the IS_ENABLED() is necessary (unless you think it is clearer this
> way).
Right, we can just do
if ((vma->vm_flags & VM_SOFTDIRTY) && !pte_soft_dirty(pte))
and it should get fully optimized out. Thanks!
>
>> + return false;
>> +
>> + /* Do we need write faults for uffd-wp tracking? */
>> + if (userfaultfd_pte_wp(vma, pte))
>> + return false;
>> +
>> + if (!(vma->vm_flags & VM_SHARED)) {
>> + /*
>> + * We can only special-case on exclusive anonymous pages,
>> + * because we know that our write-fault handler similarly would
>> + * map them writable without any additional checks while holding
>> + * the PT lock.
>> + */
>> + page = vm_normal_page(vma, addr, pte);
>
> I guess we cannot call vm_normal_page() twice, once for prot_numa and once
> here, in practice...
I guess we could, but it doesn't necessarily make the code easier to
read :) And we want to skip protnone either way.
>
>> + if (!page || !PageAnon(page) || !PageAnonExclusive(page))
>> + return false;
>> + }
>> +
>> + return true;
>> +}
>
> Note that there is a small downside to all of that. Assume you mprotect()
> a single page from RO to RW and you have many threads.
>
> With my pending patch you would avoid the TLB shootdown (and get a PF).
> With this patch you would get a TLB shootdown and save the PF. IOW, I
> think it is worthy to skip the shootdown as well in such a case and
> instead flush the TLB on spurious page-faults. But I guess that’s for
> another patch.
Just so I understand correctly: your optimization avoids the flush when
effectively, nothing changed (R/O -> R/O).
And the optimization for this case here would be, to avoid the TLB flush
when similarly not required (R/O -> R/W).
Correct?
--
Thanks,
David / dhildenb
prev parent reply other threads:[~2022-04-04 13:05 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-04-01 10:13 David Hildenbrand
2022-04-01 19:15 ` Nadav Amit
2022-04-04 13:04 ` David Hildenbrand [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=99cf9e14-7608-8e72-0c8e-0dd9b0047319@redhat.com \
--to=david@redhat.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=dave.hansen@intel.com \
--cc=hughd@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=nadav.amit@gmail.com \
--cc=peterx@redhat.com \
--cc=rick.p.edgecombe@intel.com \
--cc=shy828301@gmail.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox