From: Nick Piggin <nickpiggin@yahoo.com.au>
To: Robin Holt <holt@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>, Hugh Dickins <hugh@veritas.com>,
Christoph Lameter <clameter@sgi.com>,
Jack Steiner <steiner@sgi.com>,
linux-mm@kvack.org
Subject: Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
Date: Thu, 19 Jun 2008 03:29:30 +1000 [thread overview]
Message-ID: <200806190329.30622.nickpiggin@yahoo.com.au> (raw)
In-Reply-To: <20080618164158.GC10062@sgi.com>
On Thursday 19 June 2008 02:41, Robin Holt wrote:
> I am running into a problem where I think a call to get_user_pages(...,
> write=1, force=1,...) is returning a readable pte and a page ref count
> of 2. I have not yet trapped the event, but I think I see one place
> where this _may_ be happening.
>
> In the sles10 kernel source:
> int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> unsigned long start, int len, int write, int force,
> struct page **pages, struct vm_area_struct **vmas)
> {
> ...
> retry:
> cond_resched();
> while (!(page = follow_page(vma, start, foll_flags))) {
> int ret;
> ret = __handle_mm_fault(mm, vma, start,
> foll_flags & FOLL_WRITE);
> ...
> /*
> * The VM_FAULT_WRITE bit tells us that do_wp_page has
> * broken COW when necessary, even if maybe_mkwrite
> * decided not to set pte_write. We can thus safely do
> * subsequent page lookups as if they were reads.
> */
> if (ret & VM_FAULT_WRITE)
> foll_flags &= ~FOLL_WRITE;
>
> cond_resched();
> }
>
> The case I am seeing is under heavy memory pressure.
>
> I think the first pass at follow_page has failed and we called
> __handle_mm_fault(). At the time in __handle_mm_fault where the page table
> is unlocked, there is a writable pte in the processes page table, and a
> struct page with a reference count of 1. ret will have VM_FAULT_WRITE
> set so the get_user_pages code will clear FOLL_WRITE from foll_flags.
>
> Between the time above and the second attempt at follow_page, the
> page gets swapped out. The second attempt at follow_page, now without
> FOLL_WRITE (and FOLL_GET is set) will result in a read-only pte with a
> reference count of 2.
There would not be a writeable pte in the page table, otherwise
VM_FAULT_WRITE should not get returned. But it can be returned via
other paths...
However, assuming it was returned, then mmap_sem is still held, so
the vma should not get changed from a writeable to a readonly one,
so I can't see the problem you're describin with that sequence.
Swap pages, for one, could return with VM_FAULT_WRITE, then
subsequently have its page swapped out, then set up a readonly pte
due to the __handle_mm_fault with write access cleared. *I think*.
But although that feels a bit unclean, I don't think it would cause
a problem because the previous VM_FAULT_WRITE (while under mmap_sem)
ensures our swap page should still be valid to write into via get
user pages (and a subsequent write access should cause do_wp_page to
go through the proper reuse logic and now COW).
> Any subsequent write fault by the process will
> result in a COW break and the process pointing at a different page than
> the get_user_pages() returned page.
>
> Is this sequence plausible or am I missing something key?
>
> If this sequence is plausible, I need to know how to either work around
> this problem or if it should really be fixed in the kernel.
I'd be interested to know the situation that leads to this problem.
If possible a test case would be ideal.
But, with force=1, it is possible to create private "COW" copies of pages
that have readonly ptes in the process page table, and that the process
never has permission to write into (these are "Linus pages").
This situation should not cause the process to be able to write into the
address and cause a further COW, but in the case of shared vmas, it will
cause the page to become disconnected from the file...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2008-06-18 17:29 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-06-18 16:41 Robin Holt
2008-06-18 17:29 ` Nick Piggin [this message]
2008-06-18 19:01 ` Hugh Dickins
2008-06-18 20:33 ` Robin Holt
2008-06-18 21:46 ` Hugh Dickins
2008-06-19 3:31 ` Nick Piggin
2008-06-19 3:34 ` Nick Piggin
2008-06-19 11:39 ` Hugh Dickins
2008-06-19 12:07 ` Nick Piggin
2008-06-19 12:21 ` Nick Piggin
2008-06-19 17:48 ` Christoph Lameter
2008-06-19 12:34 ` Hugh Dickins
2008-06-19 12:53 ` Nick Piggin
2008-06-19 13:25 ` Hugh Dickins
2008-06-19 13:35 ` Robin Holt
2008-06-19 16:32 ` Robin Holt
2008-06-20 9:23 ` Nick Piggin
2008-06-19 3:07 ` Nick Piggin
2008-06-19 11:09 ` Hugh Dickins
2008-06-19 13:38 ` Robin Holt
2008-06-19 13:49 ` Hugh Dickins
2008-06-23 15:54 ` Robin Holt
2008-06-23 16:48 ` Hugh Dickins
2008-06-23 17:52 ` Robin Holt
2008-06-23 20:58 ` Hugh Dickins
2008-06-24 11:56 ` Robin Holt
2008-06-24 15:19 ` Robin Holt
2008-06-24 20:19 ` Hugh Dickins
2008-06-23 19:11 ` Robin Holt
2008-06-23 19:12 ` Robin Holt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200806190329.30622.nickpiggin@yahoo.com.au \
--to=nickpiggin@yahoo.com.au \
--cc=clameter@sgi.com \
--cc=holt@sgi.com \
--cc=hugh@veritas.com \
--cc=linux-mm@kvack.org \
--cc=mingo@elte.hu \
--cc=steiner@sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox