The code is somewhat broken: malloc doesn't guarantee to returnOn Fri, May 06, 2016 at 03:01:12PM -0700, Andrew Morton wrote:
>
> (switched to email. Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
>
> Great bug report, thanks.
>
> I assume the breakage was caused by
>
> commit 64e455079e1bd7787cc47be30b7f601ce682a5f6
> Author: Peter Feiner <pfeiner@google.com>
> AuthorDate: Mon Oct 13 15:55:46 2014 -0700
> Commit: Linus Torvalds <torvalds@linux-foundation.org>
> CommitDate: Tue Oct 14 02:18:28 2014 +0200
>
> mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
>
>
> Could someone (Peter, Kirill?) please take a look?
>
> On Fri, 06 May 2016 13:15:19 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:
>
> > https://bugzilla.kernel.org/show_bug.cgi?id=117731
> >
> > Bug ID: 117731
> > Summary: Doing mprotect for PROT_NONE and then for
> > PROT_READ|PROT_WRITE reduces CPU write B/W on buffer
> > Product: Memory Management
> > Version: 2.5
> > Kernel Version: 3.18 and beyond
> > Hardware: All
> > OS: Linux
> > Tree: Mainline
> > Status: NEW
> > Severity: high
> > Priority: P1
> > Component: Other
> > Assignee: akpm@linux-foundation.org
> > Reporter: ashish0srivastava0@gmail.com
> > Regression: No
> >
> > Created attachment 215401
> > --> https://bugzilla.kernel.org/attachment.cgi?id=215401&action=edit
> > Repro code
page-aligned pointer. And in my case it leads -EINVAL from mprotect().
Do you have a custom malloc()?
The test-case will never hit this, as normal malloc() returns anonymous
> > This is a regression that is present in kernel 3.18 and beyond and not in
> > previous ones.
> > Attached is a simple repro case. It measures the time taken to write and then
> > read all pages in a buffer, then it does mprotect for PROT_NONE and then
> > mprotect for PROT_READ|PROT_WRITE, then it again measures time taken to write
> > and then read all pages in a buffer. The 2nd time taken is much larger (20 to
> > 30 times) than the first one.
> >
> > I have looked at the code in the kernel tree that is causing this and it is
> > because writes are causing faults, as pte_mkwrite is not being done during
> > mprotect_fixup for PROT_READ|PROT_WRITE.
> >
> > This is the code inside mprotect_fixup in a tree v3.16.35 or older:
> > /*
> > * vm_flags and vm_page_prot are protected by the mmap_sem
> > * held in write mode.
> > */
> > vma->vm_flags = newflags;
> > vma->vm_page_prot = pgprot_modify(vma->vm_page_prot,
> > vm_get_page_prot(newflags));
> >
> > if (vma_wants_writenotify(vma)) {
> > vma->vm_page_prot = vm_get_page_prot(newflags & ~VM_SHARED);
> > dirty_accountable = 1;
> > }
> > This is the code in the same region inside mprotect_fixup in a recent tree:
> > /*
> > * vm_flags and vm_page_prot are protected by the mmap_sem
> > * held in write mode.
> > */
> > vma->vm_flags = newflags;
> > dirty_accountable = vma_wants_writenotify(vma);
> > vma_set_page_prot(vma);
> >
> > The difference is the setting of dirty_accountable. result of
> > vma_wants_writenotify does not depend on vma->vm_flags alone but also depends
> > on vma->vm_page_prot and following code will make it return 0 because in newer
> > code we are setting dirty_accountable before setting vma->vm_page_prot.
> > /* The open routine did something to the protections that pgprot_modify
> > * won't preserve? */
> > if (pgprot_val(vma->vm_page_prot) !=
> > pgprot_val(vm_pgprot_modify(vma->vm_page_prot, vm_flags)))
> > return 0;
memory, which is handled by the first check in vma_wants_writenotify().
The only case when the case can change anything for you is if your
malloc() return file-backed memory. Which is possible, I guess, with
custom malloc().
Looks like a good catch, but I'm not sure if it's the root cause of your
> > Now, suppose we change code by calling vma_set_page_prot before setting
> > dirty_accountable:
> > vma->vm_flags = newflags;
> > vma_set_page_prot(vma);
> > dirty_accountable = vma_wants_writenotify(vma);
> > Still, dirty_accountable will be 0. This is because following code in
> > vma_set_page_prot modifies vma->vm_page_prot without modifying vma->vm_flags:
> > if (vma_wants_writenotify(vma)) {
> > vm_flags &= ~VM_SHARED;
> > vma->vm_page_prot = vm_pgprot_modify(vma->vm_page_prot,
> > vm_flags);
> > }
> > so this check in vma_wants_writenotify will again return 0:
> > /* The open routine did something to the protections that pgprot_modify
> > * won't preserve? */
> > if (pgprot_val(vma->vm_page_prot) !=
> > pgprot_val(vm_pgprot_modify(vma->vm_page_prot, vm_flags)))
> > return 0;
> > So dirty_accountable is still 0.
> >
> > This code in change_pte_range decides whether to call pte_mkwrite or not:
> > /* Avoid taking write faults for known dirty pages */
> > if (dirty_accountable && pte_dirty(ptent) &&
> > (pte_soft_dirty(ptent) ||
> > !(vma->vm_flags & VM_SOFTDIRTY))) {
> > ptent = pte_mkwrite(ptent);
> > }
> > If dirty_accountable is 0 even though the pte was dirty already, pte_mkwrite
> > will not be done.
> >
> > I think the correct solution should be that dirty_accountable be set with the
> > value of vma_wants_writenotify queried before vma->vm_page_prot is set with
> > VM_SHARED removed from flags. One way to do so could be to have
> > vma_set_page_prot return the value of dirty_accountable that it can set right
> > after vma_wants_writenotify check. Another way could be to do
> > vma->vm_page_prot = pgprot_modify(vma->vm_page_prot,
> > vm_get_page_prot(newflags));
> > and then set dirty_accountable based on vma_wants_writenotify and then call
> > vma_set_page_prot.
problem.
--
Kirill A. Shutemov