From: Peter Feiner <pfeiner@google.com>
To: Ashish Srivastava <ashish0srivastava0@gmail.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>,
Andrew Morton <akpm@linux-foundation.org>,
bugzilla-daemon@bugzilla.kernel.org, linux-mm@kvack.org
Subject: Re: [Bug 117731] New: Doing mprotect for PROT_NONE and then for PROT_READ|PROT_WRITE reduces CPU write B/W on buffer
Date: Tue, 17 May 2016 08:51:55 -0700 [thread overview]
Message-ID: <CAM3pwhExGJUsQ_JoC0d1oaCyKtOgkwBbHrB3D05YJpfEdRVHbA@mail.gmail.com> (raw)
In-Reply-To: <CAGoWJG8mEwscwkUW31ejFyHR63Jm4eQKtUDpeADB2nUinrL59w@mail.gmail.com>
On Tue, May 17, 2016 at 4:26 AM, Ashish Srivastava
<ashish0srivastava0@gmail.com> wrote:
> Yes, the original repro was using a custom allocator but I was seeing the
> issue with malloc'd memory as well on my (ARMv7) platform.
> I agree that the repro code won't reliably work so have modified the repro
> code attached to the bug to use file backed memory.
Ah, I was going to ask if you were doing this on some platform other
than x86. I followed your reasoning, but when I tested the unpatched
kernel, I couldn't reproduce the problem. I used perf to count page
faults and still didn't see a difference.
> That really is the root cause of the problem. I can make the following
> change in the kernel that can make the slow writes problem go away.
> This makes vma_set_page_prot return the value of vma_wants_writenotify to
> the caller after setting vma->vmpage_prot.
>
> In vma_set_page_prot:
> -void vma_set_page_prot(struct vm_area_struct *vma)
> +bool vma_set_page_prot(struct vm_area_struct *vma)
> {
> unsigned long vm_flags = vma->vm_flags;
>
> vma->vm_page_prot = vm_pgprot_modify(vma->vm_page_prot, vm_flags);
> if (vma_wants_writenotify(vma)) {
> vm_flags &= ~VM_SHARED;
> vma->vm_page_prot = vm_pgprot_modify(vma->vm_page_prot,
> vm_flags);
> + return 1;
> }
> + return 0;
> }
>
> In mprotect_fixup:
>
> * held in write mode.
> */
> vma->vm_flags = newflags;
> - dirty_accountable = vma_wants_writenotify(vma);
> - vma_set_page_prot(vma);
> + dirty_accountable = vma_set_page_prot(vma);
>
> change_protection(vma, start, end, vma->vm_page_prot,
> dirty_accountable, 0)
>
> Thanks!
> Ashish
>
> On Mon, May 16, 2016 at 7:05 PM, Kirill A. Shutemov <kirill@shutemov.name>
> wrote:
>>
>> On Fri, May 06, 2016 at 03:01:12PM -0700, Andrew Morton wrote:
>> >
>> > (switched to email. Please respond via emailed reply-to-all, not via
>> > the
>> > bugzilla web interface).
>> >
>> > Great bug report, thanks.
>> >
>> > I assume the breakage was caused by
>> >
>> > commit 64e455079e1bd7787cc47be30b7f601ce682a5f6
>> > Author: Peter Feiner <pfeiner@google.com>
>> > AuthorDate: Mon Oct 13 15:55:46 2014 -0700
>> > Commit: Linus Torvalds <torvalds@linux-foundation.org>
>> > CommitDate: Tue Oct 14 02:18:28 2014 +0200
>> >
>> > mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY
>> > cleared
>> >
>> >
>> > Could someone (Peter, Kirill?) please take a look?
>> >
>> > On Fri, 06 May 2016 13:15:19 +0000 bugzilla-daemon@bugzilla.kernel.org
>> > wrote:
>> >
>> > > https://bugzilla.kernel.org/show_bug.cgi?id=117731
>> > >
>> > > Bug ID: 117731
>> > > Summary: Doing mprotect for PROT_NONE and then for
>> > > PROT_READ|PROT_WRITE reduces CPU write B/W on
>> > > buffer
>> > > Product: Memory Management
>> > > Version: 2.5
>> > > Kernel Version: 3.18 and beyond
>> > > Hardware: All
>> > > OS: Linux
>> > > Tree: Mainline
>> > > Status: NEW
>> > > Severity: high
>> > > Priority: P1
>> > > Component: Other
>> > > Assignee: akpm@linux-foundation.org
>> > > Reporter: ashish0srivastava0@gmail.com
>> > > Regression: No
>> > >
>> > > Created attachment 215401
>> > > --> https://bugzilla.kernel.org/attachment.cgi?id=215401&action=edit
>> > > Repro code
>>
>> The code is somewhat broken: malloc doesn't guarantee to return
>> page-aligned pointer. And in my case it leads -EINVAL from mprotect().
>>
>> Do you have a custom malloc()?
>>
>> > > This is a regression that is present in kernel 3.18 and beyond and not
>> > > in
>> > > previous ones.
>> > > Attached is a simple repro case. It measures the time taken to write
>> > > and then
>> > > read all pages in a buffer, then it does mprotect for PROT_NONE and
>> > > then
>> > > mprotect for PROT_READ|PROT_WRITE, then it again measures time taken
>> > > to write
>> > > and then read all pages in a buffer. The 2nd time taken is much larger
>> > > (20 to
>> > > 30 times) than the first one.
>> > >
>> > > I have looked at the code in the kernel tree that is causing this and
>> > > it is
>> > > because writes are causing faults, as pte_mkwrite is not being done
>> > > during
>> > > mprotect_fixup for PROT_READ|PROT_WRITE.
>> > >
>> > > This is the code inside mprotect_fixup in a tree v3.16.35 or older:
>> > > /*
>> > > * vm_flags and vm_page_prot are protected by the mmap_sem
>> > > * held in write mode.
>> > > */
>> > > vma->vm_flags = newflags;
>> > > vma->vm_page_prot = pgprot_modify(vma->vm_page_prot,
>> > > vm_get_page_prot(newflags));
>> > >
>> > > if (vma_wants_writenotify(vma)) {
>> > > vma->vm_page_prot = vm_get_page_prot(newflags & ~VM_SHARED);
>> > > dirty_accountable = 1;
>> > > }
>> > > This is the code in the same region inside mprotect_fixup in a recent
>> > > tree:
>> > > /*
>> > > * vm_flags and vm_page_prot are protected by the mmap_sem
>> > > * held in write mode.
>> > > */
>> > > vma->vm_flags = newflags;
>> > > dirty_accountable = vma_wants_writenotify(vma);
>> > > vma_set_page_prot(vma);
>> > >
>> > > The difference is the setting of dirty_accountable. result of
>> > > vma_wants_writenotify does not depend on vma->vm_flags alone but also
>> > > depends
>> > > on vma->vm_page_prot and following code will make it return 0 because
>> > > in newer
>> > > code we are setting dirty_accountable before setting
>> > > vma->vm_page_prot.
>> > > /* The open routine did something to the protections that
>> > > pgprot_modify
>> > > * won't preserve? */
>> > > if (pgprot_val(vma->vm_page_prot) !=
>> > > pgprot_val(vm_pgprot_modify(vma->vm_page_prot, vm_flags)))
>> > > return 0;
>>
>> The test-case will never hit this, as normal malloc() returns anonymous
>> memory, which is handled by the first check in vma_wants_writenotify().
>>
>> The only case when the case can change anything for you is if your
>> malloc() return file-backed memory. Which is possible, I guess, with
>> custom malloc().
>>
>> > > Now, suppose we change code by calling vma_set_page_prot before
>> > > setting
>> > > dirty_accountable:
>> > > vma->vm_flags = newflags;
>> > > vma_set_page_prot(vma);
>> > > dirty_accountable = vma_wants_writenotify(vma);
>> > > Still, dirty_accountable will be 0. This is because following code in
>> > > vma_set_page_prot modifies vma->vm_page_prot without modifying
>> > > vma->vm_flags:
>> > > if (vma_wants_writenotify(vma)) {
>> > > vm_flags &= ~VM_SHARED;
>> > > vma->vm_page_prot = vm_pgprot_modify(vma->vm_page_prot,
>> > > vm_flags);
>> > > }
>> > > so this check in vma_wants_writenotify will again return 0:
>> > > /* The open routine did something to the protections that
>> > > pgprot_modify
>> > > * won't preserve? */
>> > > if (pgprot_val(vma->vm_page_prot) !=
>> > > pgprot_val(vm_pgprot_modify(vma->vm_page_prot, vm_flags)))
>> > > return 0;
>> > > So dirty_accountable is still 0.
>> > >
>> > > This code in change_pte_range decides whether to call pte_mkwrite or
>> > > not:
>> > > /* Avoid taking write faults for known dirty pages */
>> > > if (dirty_accountable && pte_dirty(ptent) &&
>> > > (pte_soft_dirty(ptent) ||
>> > > !(vma->vm_flags & VM_SOFTDIRTY))) {
>> > > ptent = pte_mkwrite(ptent);
>> > > }
>> > > If dirty_accountable is 0 even though the pte was dirty already,
>> > > pte_mkwrite
>> > > will not be done.
>> > >
>> > > I think the correct solution should be that dirty_accountable be set
>> > > with the
>> > > value of vma_wants_writenotify queried before vma->vm_page_prot is set
>> > > with
>> > > VM_SHARED removed from flags. One way to do so could be to have
>> > > vma_set_page_prot return the value of dirty_accountable that it can
>> > > set right
>> > > after vma_wants_writenotify check. Another way could be to do
>> > > vma->vm_page_prot = pgprot_modify(vma->vm_page_prot,
>> > > vm_get_page_prot(newflags));
>> > > and then set dirty_accountable based on vma_wants_writenotify and then
>> > > call
>> > > vma_set_page_prot.
>>
>> Looks like a good catch, but I'm not sure if it's the root cause of your
>> problem.
>>
>> --
>> Kirill A. Shutemov
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
prev parent reply other threads:[~2016-05-17 15:51 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <bug-117731-27@https.bugzilla.kernel.org/>
2016-05-06 22:01 ` Andrew Morton
2016-05-09 18:07 ` Peter Feiner
2016-05-16 13:35 ` Kirill A. Shutemov
2016-05-17 11:26 ` Ashish Srivastava
2016-05-17 11:36 ` Kirill A. Shutemov
2016-05-17 11:47 ` Ashish Srivastava
2016-05-17 12:03 ` Kirill A. Shutemov
2016-05-17 15:51 ` Peter Feiner [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAM3pwhExGJUsQ_JoC0d1oaCyKtOgkwBbHrB3D05YJpfEdRVHbA@mail.gmail.com \
--to=pfeiner@google.com \
--cc=akpm@linux-foundation.org \
--cc=ashish0srivastava0@gmail.com \
--cc=bugzilla-daemon@bugzilla.kernel.org \
--cc=kirill@shutemov.name \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox