From: Nick Piggin <npiggin@suse.de>
To: Andrea Arcangeli <andrea@qumranet.com>
Cc: linux-mm@kvack.org, Avi Kivity <avi@qumranet.com>,
Izik Eidus <izike@qumranet.com>
Subject: Re: RFC: race between o_direct and fork (harder to fix with get_user_page_fast)
Date: Wed, 29 Oct 2008 01:43:09 +0100 [thread overview]
Message-ID: <20081029004308.GH15599@wotan.suse.de> (raw)
In-Reply-To: <20080925183846.GA6877@duo.random>
On Thu, Sep 25, 2008 at 08:38:46PM +0200, Andrea Arcangeli wrote:
> Hi Nick,
>
> with Izik and Avi, we've been discussing on how to best make ksm work
> with O_DIRECT. I don't think it's an immediate priority but eventually
> we've to fix this as KVM is close to being able to dma from disk
> directly into guest physical memory without intermediate copies.
>
> Checking if pages have O_DIRECT (or similar other physical I/O) in
> flight is fairly easy, comparing page count with page mapcount should
> do the trick. The source of the problem is that this page count ==
> page mapcount check should happen under some lock that blocks
> get_user_pages and get_user_page_fast. If the page is already shared,
> we only need to block the get_user_pages running on the 'mm' of the
> 'pte' that we're overwriting. So for get_user_pages it'd be enough to
> do the check of page count == page mapcount under the PT lock.
>
> 1)
> PT lock
> if (page_count != page_mapcount)
> goto fail
> make pte readonly
> PT unlock
>
> then in the final stage:
>
> 2)
> PT lock
> if (!pte_same)
> goto fail
> change pte to point to ksm page
> PT unlock
>
> If other tasks are starting in-flight O_DIRECT on the page we don't
> care, those will have to copy-on-write it before starting the O_DIRECT
> anyway, so there will be still no in-flight I/O on the physical page
> we're working on. All we care about is that get_user_pages doesn't run
> on our mm/PTlock between the page_count!=page_mapcount check and the
> mark of the pte readonly. Otherwise it won't trigger the COW and it
> should (for us to later notice it in pte_same!).
>
> Now with get_user_pages_fast the above PT lock isn't enough anymore to
> make it safe.
>
> While thinking at get_user_pages_fast I figured another worse way
> things can go wrong with ksm and o_direct: think a thread writing
> constantly to the last 512bytes of a page, while another thread read
> and writes to/from the first 512bytes of the page. We can lose
> O_DIRECT reads, the very moment we mark any pte wrprotected. Then Avi
> immediately pointed out this means also fork is affected by the same
> bug that ksm would have.
>
> So Avi just found a very longstanding bug in fork. Fork has the very
> same problem of ksm in marking readonly ptes that could point to pages
> that have O_DIRECT in flight.
>
> So this isn't a KSM problem anymore. We've to fix this longstanding
> bug in fork first. Then we'll think at KSM and we'll use the same
> locking technique to make KSM safe against O_DIRECT too
>
> The best I can think of, is to re-introduce of the brlock (possibly
> after making it fair). We can't use RCU as far as I can tell. No idea
> why brlock was removed perhaps somebody thought RCU was an equivalent
> replacement? RCU/SRCU can't block anything, and we've to block the
> get_user_page_fast in the critical section at point 1 to be
> safe. There's a practical limit of how much things can be delayed, for
> page faults (at least practically) they can't.
>
> ksm
>
> br_write_lock()
> if (page_count != page_mapcount)
> goto fail
> make pte readonly
> br_write_unlock()
>
> fork
>
> br_write_lock()
> if (page_count != page_mapcount)
> copy_page()
> else
> make pte readonly
> br_write_unlock()
>
> get_user_page_fast
>
> br_read_lock()
> walk ptes out of order w/o mmap_sem
> br_read_unlock()
>
> Another way of course is to take the mmap_sem in read mode around the
> out of order part of get_user_page_fast but that'd be invalidating the
> 'thread vs thread' smp scalability of get_user_page_fast.
>
> If it was just for KSM I suggested we could fix it by sigstopping (or
> getting out of the scheduler in some other more reliable mean) all
> threads that shared the 'mm' that ksm was working on. That would take
> care of the fast path of get_user_page_fast and the PT lock would take
> care of the get_user_page_fast slow path. But this schedule technique
> ala stop_machine surely isn't workable for fork() for performance
> reasons.
>
> Yet another way is as usual to use a page bitflag to serialize things
> at the page level. That will prevent multiple O_DIRECT reads to the
> same page simultaneously but it'll allow fork to wait IO completion
> and avoid the copy_page(). Ages ago I always wanted to keep the
> PG_lock for pages under O_DIRECT... We instead relied solely on page
> pinning which has a few advantages but it makes things like fork more
> complicated and harder to fix.
>
> I'm very interested to know your ideas on how to best fix fork vs
> o_direct!
Hi Andrea,
Sorry I missed this. Thanks for pinging me again. Great set of bugs
you've found :)
We also have the related problem that any existing COWs need to be broken
by get_user_pages...
At the moment I'm just hacking around (haven't touched fast_gup yet).
But if we follow the rule that for PageAnon pages, the pte must be set
to pte_write, then I'm hoping that is going to give us enough
synchronisation to get around the problem. I've attached a really raw
hack of what I'm trying to do. get_user_pages_fast I think should
be able to do a similar check without adding locks.
I do really like the idea of locking pages before they go under direct
IO... it also closes a class of real invalidate_mapping_pages bugs where
the page is going to be dirtied by the direct-IO, but it is still allowed
to be invalidated from pagecache... As a solution to this problem... I'm not
sure if it would be entirely trivial still. We could wait on get_user_pages
in fork, but would we actually want to, rather than just COW them?
---
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -488,7 +488,7 @@ out:
* covered by this vma.
*/
-static inline void
+static inline int
copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
unsigned long addr, int *rss)
@@ -496,6 +496,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
unsigned long vm_flags = vma->vm_flags;
pte_t pte = *src_pte;
struct page *page;
+ int ret = 0;
/* pte contains position in swap or file, so copy. */
if (unlikely(!pte_present(pte))) {
@@ -546,11 +547,19 @@ copy_one_pte(struct mm_struct *dst_mm, s
if (page) {
get_page(page);
page_dup_rmap(page, vma, addr);
+ if (unlikely(page_count(page) != page_mapcount(page))) { /* XXX: also have to check swapcount?! */
+ if (is_cow_mapping(vm_flags) && PageAnon(page)) {
+ printk("forcecow!\n");
+ ret = 1;
+ }
+ }
rss[!!PageAnon(page)]++;
}
out_set_pte:
set_pte_at(dst_mm, addr, dst_pte, pte);
+
+ return ret;
}
static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -561,8 +570,10 @@ static int copy_pte_range(struct mm_stru
spinlock_t *src_ptl, *dst_ptl;
int progress = 0;
int rss[2];
+ int forcecow;
again:
+ forcecow = 0;
rss[1] = rss[0] = 0;
dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
if (!dst_pte)
@@ -573,6 +584,9 @@ again:
arch_enter_lazy_mmu_mode();
do {
+ if (forcecow)
+ break;
+
/*
* We are holding two locks at this point - either of them
* could generate latencies in another task on another CPU.
@@ -587,7 +601,7 @@ again:
progress++;
continue;
}
- copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
+ forcecow = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
progress += 8;
} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
@@ -597,6 +611,10 @@ again:
add_mm_rss(dst_mm, rss[0], rss[1]);
pte_unmap_unlock(dst_pte - 1, dst_ptl);
cond_resched();
+ if (forcecow) {
+ if (handle_mm_fault(dst_mm, vma, addr - PAGE_SIZE, 1) & VM_FAULT_ERROR) /* XXX: should really just do a page copy? */
+ return -ENOMEM;
+ }
if (addr != end)
goto again;
return 0;
@@ -1216,6 +1234,7 @@ int __get_user_pages(struct task_struct
do {
struct page *page;
+ int cow = 0;
/*
* If tsk is ooming, cut off its access to large memory
@@ -1229,8 +1248,24 @@ int __get_user_pages(struct task_struct
foll_flags |= FOLL_WRITE;
cond_resched();
- while (!(page = follow_page(vma, start, foll_flags))) {
+
+ printk("get_user_pages address=%p\n", (void *)start);
+ for (;;) {
int ret;
+
+ page = follow_page(vma, start, foll_flags);
+ if (page) {
+ printk("found page is_cow_mapping=%d PageAnon=%d write=%d cow=%d\n", is_cow_mapping(vma->vm_flags), PageAnon(page), write, cow);
+
+ if (is_cow_mapping(vma->vm_flags) &&
+ PageAnon(page) && !write && !cow) {
+ foll_flags |= FOLL_WRITE;
+ printk("gup break cow\n");
+ cow = 1;
+ } else
+ break;
+ }
+
ret = handle_mm_fault(mm, vma, start,
foll_flags & FOLL_WRITE);
if (ret & VM_FAULT_ERROR) {
@@ -1252,8 +1287,10 @@ int __get_user_pages(struct task_struct
* pte_write. We can thus safely do subsequent
* page lookups as if they were reads.
*/
- if (ret & VM_FAULT_WRITE)
+ if (ret & VM_FAULT_WRITE) {
foll_flags &= ~FOLL_WRITE;
+ cow = 1;
+ }
cond_resched();
}
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -346,7 +346,7 @@ static int dup_mmap(struct mm_struct *mm
rb_parent = &tmp->vm_rb;
mm->map_count++;
- retval = copy_page_range(mm, oldmm, mpnt);
+ retval = copy_page_range(mm, oldmm, tmp);
if (tmp->vm_ops && tmp->vm_ops->open)
tmp->vm_ops->open(tmp);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2008-10-29 0:43 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-09-25 18:38 Andrea Arcangeli
2008-10-29 0:43 ` Nick Piggin [this message]
2008-11-04 15:04 ` Andrea Arcangeli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20081029004308.GH15599@wotan.suse.de \
--to=npiggin@suse.de \
--cc=andrea@qumranet.com \
--cc=avi@qumranet.com \
--cc=izike@qumranet.com \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox