* RFC: race between o_direct and fork (harder to fix with get_user_page_fast)
@ 2008-09-25 18:38 Andrea Arcangeli
2008-10-29 0:43 ` Nick Piggin
0 siblings, 1 reply; 3+ messages in thread
From: Andrea Arcangeli @ 2008-09-25 18:38 UTC (permalink / raw)
To: Nick Piggin; +Cc: linux-mm, Avi Kivity, Izik Eidus
Hi Nick,
with Izik and Avi, we've been discussing on how to best make ksm work
with O_DIRECT. I don't think it's an immediate priority but eventually
we've to fix this as KVM is close to being able to dma from disk
directly into guest physical memory without intermediate copies.
Checking if pages have O_DIRECT (or similar other physical I/O) in
flight is fairly easy, comparing page count with page mapcount should
do the trick. The source of the problem is that this page count ==
page mapcount check should happen under some lock that blocks
get_user_pages and get_user_page_fast. If the page is already shared,
we only need to block the get_user_pages running on the 'mm' of the
'pte' that we're overwriting. So for get_user_pages it'd be enough to
do the check of page count == page mapcount under the PT lock.
1)
PT lock
if (page_count != page_mapcount)
goto fail
make pte readonly
PT unlock
then in the final stage:
2)
PT lock
if (!pte_same)
goto fail
change pte to point to ksm page
PT unlock
If other tasks are starting in-flight O_DIRECT on the page we don't
care, those will have to copy-on-write it before starting the O_DIRECT
anyway, so there will be still no in-flight I/O on the physical page
we're working on. All we care about is that get_user_pages doesn't run
on our mm/PTlock between the page_count!=page_mapcount check and the
mark of the pte readonly. Otherwise it won't trigger the COW and it
should (for us to later notice it in pte_same!).
Now with get_user_pages_fast the above PT lock isn't enough anymore to
make it safe.
While thinking at get_user_pages_fast I figured another worse way
things can go wrong with ksm and o_direct: think a thread writing
constantly to the last 512bytes of a page, while another thread read
and writes to/from the first 512bytes of the page. We can lose
O_DIRECT reads, the very moment we mark any pte wrprotected. Then Avi
immediately pointed out this means also fork is affected by the same
bug that ksm would have.
So Avi just found a very longstanding bug in fork. Fork has the very
same problem of ksm in marking readonly ptes that could point to pages
that have O_DIRECT in flight.
So this isn't a KSM problem anymore. We've to fix this longstanding
bug in fork first. Then we'll think at KSM and we'll use the same
locking technique to make KSM safe against O_DIRECT too
The best I can think of, is to re-introduce of the brlock (possibly
after making it fair). We can't use RCU as far as I can tell. No idea
why brlock was removed perhaps somebody thought RCU was an equivalent
replacement? RCU/SRCU can't block anything, and we've to block the
get_user_page_fast in the critical section at point 1 to be
safe. There's a practical limit of how much things can be delayed, for
page faults (at least practically) they can't.
ksm
br_write_lock()
if (page_count != page_mapcount)
goto fail
make pte readonly
br_write_unlock()
fork
br_write_lock()
if (page_count != page_mapcount)
copy_page()
else
make pte readonly
br_write_unlock()
get_user_page_fast
br_read_lock()
walk ptes out of order w/o mmap_sem
br_read_unlock()
Another way of course is to take the mmap_sem in read mode around the
out of order part of get_user_page_fast but that'd be invalidating the
'thread vs thread' smp scalability of get_user_page_fast.
If it was just for KSM I suggested we could fix it by sigstopping (or
getting out of the scheduler in some other more reliable mean) all
threads that shared the 'mm' that ksm was working on. That would take
care of the fast path of get_user_page_fast and the PT lock would take
care of the get_user_page_fast slow path. But this schedule technique
ala stop_machine surely isn't workable for fork() for performance
reasons.
Yet another way is as usual to use a page bitflag to serialize things
at the page level. That will prevent multiple O_DIRECT reads to the
same page simultaneously but it'll allow fork to wait IO completion
and avoid the copy_page(). Ages ago I always wanted to keep the
PG_lock for pages under O_DIRECT... We instead relied solely on page
pinning which has a few advantages but it makes things like fork more
complicated and harder to fix.
I'm very interested to know your ideas on how to best fix fork vs
o_direct!
Thanks,
Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: RFC: race between o_direct and fork (harder to fix with get_user_page_fast)
2008-09-25 18:38 RFC: race between o_direct and fork (harder to fix with get_user_page_fast) Andrea Arcangeli
@ 2008-10-29 0:43 ` Nick Piggin
2008-11-04 15:04 ` Andrea Arcangeli
0 siblings, 1 reply; 3+ messages in thread
From: Nick Piggin @ 2008-10-29 0:43 UTC (permalink / raw)
To: Andrea Arcangeli; +Cc: linux-mm, Avi Kivity, Izik Eidus
On Thu, Sep 25, 2008 at 08:38:46PM +0200, Andrea Arcangeli wrote:
> Hi Nick,
>
> with Izik and Avi, we've been discussing on how to best make ksm work
> with O_DIRECT. I don't think it's an immediate priority but eventually
> we've to fix this as KVM is close to being able to dma from disk
> directly into guest physical memory without intermediate copies.
>
> Checking if pages have O_DIRECT (or similar other physical I/O) in
> flight is fairly easy, comparing page count with page mapcount should
> do the trick. The source of the problem is that this page count ==
> page mapcount check should happen under some lock that blocks
> get_user_pages and get_user_page_fast. If the page is already shared,
> we only need to block the get_user_pages running on the 'mm' of the
> 'pte' that we're overwriting. So for get_user_pages it'd be enough to
> do the check of page count == page mapcount under the PT lock.
>
> 1)
> PT lock
> if (page_count != page_mapcount)
> goto fail
> make pte readonly
> PT unlock
>
> then in the final stage:
>
> 2)
> PT lock
> if (!pte_same)
> goto fail
> change pte to point to ksm page
> PT unlock
>
> If other tasks are starting in-flight O_DIRECT on the page we don't
> care, those will have to copy-on-write it before starting the O_DIRECT
> anyway, so there will be still no in-flight I/O on the physical page
> we're working on. All we care about is that get_user_pages doesn't run
> on our mm/PTlock between the page_count!=page_mapcount check and the
> mark of the pte readonly. Otherwise it won't trigger the COW and it
> should (for us to later notice it in pte_same!).
>
> Now with get_user_pages_fast the above PT lock isn't enough anymore to
> make it safe.
>
> While thinking at get_user_pages_fast I figured another worse way
> things can go wrong with ksm and o_direct: think a thread writing
> constantly to the last 512bytes of a page, while another thread read
> and writes to/from the first 512bytes of the page. We can lose
> O_DIRECT reads, the very moment we mark any pte wrprotected. Then Avi
> immediately pointed out this means also fork is affected by the same
> bug that ksm would have.
>
> So Avi just found a very longstanding bug in fork. Fork has the very
> same problem of ksm in marking readonly ptes that could point to pages
> that have O_DIRECT in flight.
>
> So this isn't a KSM problem anymore. We've to fix this longstanding
> bug in fork first. Then we'll think at KSM and we'll use the same
> locking technique to make KSM safe against O_DIRECT too
>
> The best I can think of, is to re-introduce of the brlock (possibly
> after making it fair). We can't use RCU as far as I can tell. No idea
> why brlock was removed perhaps somebody thought RCU was an equivalent
> replacement? RCU/SRCU can't block anything, and we've to block the
> get_user_page_fast in the critical section at point 1 to be
> safe. There's a practical limit of how much things can be delayed, for
> page faults (at least practically) they can't.
>
> ksm
>
> br_write_lock()
> if (page_count != page_mapcount)
> goto fail
> make pte readonly
> br_write_unlock()
>
> fork
>
> br_write_lock()
> if (page_count != page_mapcount)
> copy_page()
> else
> make pte readonly
> br_write_unlock()
>
> get_user_page_fast
>
> br_read_lock()
> walk ptes out of order w/o mmap_sem
> br_read_unlock()
>
> Another way of course is to take the mmap_sem in read mode around the
> out of order part of get_user_page_fast but that'd be invalidating the
> 'thread vs thread' smp scalability of get_user_page_fast.
>
> If it was just for KSM I suggested we could fix it by sigstopping (or
> getting out of the scheduler in some other more reliable mean) all
> threads that shared the 'mm' that ksm was working on. That would take
> care of the fast path of get_user_page_fast and the PT lock would take
> care of the get_user_page_fast slow path. But this schedule technique
> ala stop_machine surely isn't workable for fork() for performance
> reasons.
>
> Yet another way is as usual to use a page bitflag to serialize things
> at the page level. That will prevent multiple O_DIRECT reads to the
> same page simultaneously but it'll allow fork to wait IO completion
> and avoid the copy_page(). Ages ago I always wanted to keep the
> PG_lock for pages under O_DIRECT... We instead relied solely on page
> pinning which has a few advantages but it makes things like fork more
> complicated and harder to fix.
>
> I'm very interested to know your ideas on how to best fix fork vs
> o_direct!
Hi Andrea,
Sorry I missed this. Thanks for pinging me again. Great set of bugs
you've found :)
We also have the related problem that any existing COWs need to be broken
by get_user_pages...
At the moment I'm just hacking around (haven't touched fast_gup yet).
But if we follow the rule that for PageAnon pages, the pte must be set
to pte_write, then I'm hoping that is going to give us enough
synchronisation to get around the problem. I've attached a really raw
hack of what I'm trying to do. get_user_pages_fast I think should
be able to do a similar check without adding locks.
I do really like the idea of locking pages before they go under direct
IO... it also closes a class of real invalidate_mapping_pages bugs where
the page is going to be dirtied by the direct-IO, but it is still allowed
to be invalidated from pagecache... As a solution to this problem... I'm not
sure if it would be entirely trivial still. We could wait on get_user_pages
in fork, but would we actually want to, rather than just COW them?
---
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -488,7 +488,7 @@ out:
* covered by this vma.
*/
-static inline void
+static inline int
copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
unsigned long addr, int *rss)
@@ -496,6 +496,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
unsigned long vm_flags = vma->vm_flags;
pte_t pte = *src_pte;
struct page *page;
+ int ret = 0;
/* pte contains position in swap or file, so copy. */
if (unlikely(!pte_present(pte))) {
@@ -546,11 +547,19 @@ copy_one_pte(struct mm_struct *dst_mm, s
if (page) {
get_page(page);
page_dup_rmap(page, vma, addr);
+ if (unlikely(page_count(page) != page_mapcount(page))) { /* XXX: also have to check swapcount?! */
+ if (is_cow_mapping(vm_flags) && PageAnon(page)) {
+ printk("forcecow!\n");
+ ret = 1;
+ }
+ }
rss[!!PageAnon(page)]++;
}
out_set_pte:
set_pte_at(dst_mm, addr, dst_pte, pte);
+
+ return ret;
}
static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -561,8 +570,10 @@ static int copy_pte_range(struct mm_stru
spinlock_t *src_ptl, *dst_ptl;
int progress = 0;
int rss[2];
+ int forcecow;
again:
+ forcecow = 0;
rss[1] = rss[0] = 0;
dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
if (!dst_pte)
@@ -573,6 +584,9 @@ again:
arch_enter_lazy_mmu_mode();
do {
+ if (forcecow)
+ break;
+
/*
* We are holding two locks at this point - either of them
* could generate latencies in another task on another CPU.
@@ -587,7 +601,7 @@ again:
progress++;
continue;
}
- copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
+ forcecow = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
progress += 8;
} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
@@ -597,6 +611,10 @@ again:
add_mm_rss(dst_mm, rss[0], rss[1]);
pte_unmap_unlock(dst_pte - 1, dst_ptl);
cond_resched();
+ if (forcecow) {
+ if (handle_mm_fault(dst_mm, vma, addr - PAGE_SIZE, 1) & VM_FAULT_ERROR) /* XXX: should really just do a page copy? */
+ return -ENOMEM;
+ }
if (addr != end)
goto again;
return 0;
@@ -1216,6 +1234,7 @@ int __get_user_pages(struct task_struct
do {
struct page *page;
+ int cow = 0;
/*
* If tsk is ooming, cut off its access to large memory
@@ -1229,8 +1248,24 @@ int __get_user_pages(struct task_struct
foll_flags |= FOLL_WRITE;
cond_resched();
- while (!(page = follow_page(vma, start, foll_flags))) {
+
+ printk("get_user_pages address=%p\n", (void *)start);
+ for (;;) {
int ret;
+
+ page = follow_page(vma, start, foll_flags);
+ if (page) {
+ printk("found page is_cow_mapping=%d PageAnon=%d write=%d cow=%d\n", is_cow_mapping(vma->vm_flags), PageAnon(page), write, cow);
+
+ if (is_cow_mapping(vma->vm_flags) &&
+ PageAnon(page) && !write && !cow) {
+ foll_flags |= FOLL_WRITE;
+ printk("gup break cow\n");
+ cow = 1;
+ } else
+ break;
+ }
+
ret = handle_mm_fault(mm, vma, start,
foll_flags & FOLL_WRITE);
if (ret & VM_FAULT_ERROR) {
@@ -1252,8 +1287,10 @@ int __get_user_pages(struct task_struct
* pte_write. We can thus safely do subsequent
* page lookups as if they were reads.
*/
- if (ret & VM_FAULT_WRITE)
+ if (ret & VM_FAULT_WRITE) {
foll_flags &= ~FOLL_WRITE;
+ cow = 1;
+ }
cond_resched();
}
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -346,7 +346,7 @@ static int dup_mmap(struct mm_struct *mm
rb_parent = &tmp->vm_rb;
mm->map_count++;
- retval = copy_page_range(mm, oldmm, mpnt);
+ retval = copy_page_range(mm, oldmm, tmp);
if (tmp->vm_ops && tmp->vm_ops->open)
tmp->vm_ops->open(tmp);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: RFC: race between o_direct and fork (harder to fix with get_user_page_fast)
2008-10-29 0:43 ` Nick Piggin
@ 2008-11-04 15:04 ` Andrea Arcangeli
0 siblings, 0 replies; 3+ messages in thread
From: Andrea Arcangeli @ 2008-11-04 15:04 UTC (permalink / raw)
To: Nick Piggin; +Cc: linux-mm, Avi Kivity, Izik Eidus
Hi Nick,
Thanks for looking into this, as far as I can tell gup_fast is the
major problem here as you'll see below.
On Wed, Oct 29, 2008 at 01:43:09AM +0100, Nick Piggin wrote:
> I do really like the idea of locking pages before they go under direct
> IO... it also closes a class of real invalidate_mapping_pages bugs where
>From a practical standpoint I always liked it too. It is less clean,
but it allows totally simple serialization of the VM side against the
direct-DMA without having to rely on out of order checks of count vs
mapcount along with stuff that can change under you without any way to
block against it.
The only limitation it has is that it prevents direct reads from the
same page (while writing to disk) and it also limits write parallelism
with sub-page buffer sizes (while reading from disk). Given that for
its must-have usages O_DIRECT is generally cached (either by tmpfs/IPC
or by VM guest OS cache) or it's a guaranteed cache-polluting workload
where it's never going to hit on the same page, it's hard to see how
it would ever be practical problem.
But still current way is a more 'powerful' design (even if probably
not worth it from practical standpoint), so if there's a way to fix
it, I also like to keeping this way as a theoretical exercise.
> the page is going to be dirtied by the direct-IO, but it is still allowed
> to be invalidated from pagecache... As a solution to this problem... I'm not
> sure if it would be entirely trivial still. We could wait on get_user_pages
Invalidate would wait on PG_lock so it'd be entirely trivial. In the
current out of order model w/o PG_lock the usual way to deal with
those events would be to let the invalidate go ahead, and have
O_DIRECT keep working on a page out of pagecache. It shouldn't care,
does it? If it does then it can be taught not to care ;). So the
invalidate vs o-direct I/O doesn't worry me that much. We had those
kind of issues in the past too with certain fs metadata before we had
o-direct (I/O completing on truncated buffer or stuff like that).
> in fork, but would we actually want to, rather than just COW them?
I think waiting in fork would also be ok if it would be simpler, it's
just that we can't wait as there's nothing to wait upon. Whatever is
simpler should be preferred here, if cow is simpler that's fine
too. Either ways it won't make a difference, this is just about not
corrupting data once in a while, it won't ever be measurable in
performance terms.
> @@ -546,11 +547,19 @@ copy_one_pte(struct mm_struct *dst_mm, s
> if (page) {
> get_page(page);
> page_dup_rmap(page, vma, addr);
> + if (unlikely(page_count(page) != page_mapcount(page))) { /* XXX: also have to check swapcount?! */
> + if (is_cow_mapping(vm_flags) && PageAnon(page)) {
> + printk("forcecow!\n");
> + ret = 1;
> + }
> + }
Here we're under PT lock so the follow_page in get_user_pages should
block. Furthermore we're under mmap_sem write on fork side and
get_user_pages will block on mmap_sem read/write.
If follow_page has already completed and o-direct is in flight we'd
see reliably the page count boosted (here ignoring gup_fast for
simplicity), the get_page run by follow_page is run inside the PT
lock.
> - copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
> + forcecow = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss);
> progress += 8;
> } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
>
> @@ -597,6 +611,10 @@ again:
> add_mm_rss(dst_mm, rss[0], rss[1]);
> pte_unmap_unlock(dst_pte - 1, dst_ptl);
> cond_resched();
> + if (forcecow) {
> + if (handle_mm_fault(dst_mm, vma, addr - PAGE_SIZE, 1) & VM_FAULT_ERROR) /* XXX: should really just do a page copy? */
> + return -ENOMEM;
> + }
This isn't enough, we at least need to add something like:
if (is_cow_mapping(vm_flags)) {
if (forcecow)
ptep_set_wrprotect(src_mm, addr, src_pte);
pte = pte_wrprotect(pte);
}
Otherwise the parent context would still break. Setting the child pte
readonly and forcing a cow should still be ok with the child, even if
the parent pte remained read-write the whole time (it has to or
o-direct reads will be lost). Even if the content of the parent page
is changing things should be ok (content is definitely changing). If
anybody forks out of order with memory data contents in the parent, it
means the data that was changing in the parent during fork, can't have
a deterministic value in the child.
With the additional avoidance of ptep_set_wrprotect in the forcecow
case, I think things should be fixed for the parent if follow_page
runs before fork.
If follow_page instead runs after fork (after fork releases mmap_sem,
still ignoring gup_fast for simplicity), things are fine already as
get_user_pages will cow before starting any I/O.
The fundamental problem is in converting any pte from read-write to
readonly while O_DIRECT is in flight, so the fundamental part of the
fix is the removal of that ptep_set_wrprotect, the forcecow in the
child is just a consequence of the parent having to stay read-write
the whole tim.
So your path + my removal of ptep_set_wrprotect should just fix
fork. If it wasn't for gup_fast this is the exact same way that we
would have implemented in KSM too: 1) take PT lock, 2) compare count
vs mapcount 3) mark the pte wrprotected 4) bailout if they aren't the
same, 5) release PT lock.
So now that we agree that the mapcount vs count check under PT lock is
enough for get_user_pages, you've to decide how to make gup_fast
block. It has to block or we can't possibly convert any writeable pte
to readonly ;). Or at least I don't see how you're going to use the
pte_write info to avoid gup_fast block...
> @@ -1216,6 +1234,7 @@ int __get_user_pages(struct task_struct
>
> do {
> struct page *page;
> + int cow = 0;
>
> /*
> * If tsk is ooming, cut off its access to large memory
> @@ -1229,8 +1248,24 @@ int __get_user_pages(struct task_struct
> foll_flags |= FOLL_WRITE;
>
> cond_resched();
> - while (!(page = follow_page(vma, start, foll_flags))) {
> +
> + printk("get_user_pages address=%p\n", (void *)start);
> + for (;;) {
> int ret;
> +
> + page = follow_page(vma, start, foll_flags);
> + if (page) {
> + printk("found page is_cow_mapping=%d PageAnon=%d write=%d cow=%d\n", is_cow_mapping(vma->vm_flags), PageAnon(page), write, cow);
> +
> + if (is_cow_mapping(vma->vm_flags) &&
> + PageAnon(page) && !write && !cow) {
> + foll_flags |= FOLL_WRITE;
> + printk("gup break cow\n");
> + cow = 1;
> + } else
> + break;
> + }
> +
> ret = handle_mm_fault(mm, vma, start,
> foll_flags & FOLL_WRITE);
> if (ret & VM_FAULT_ERROR) {
> @@ -1252,8 +1287,10 @@ int __get_user_pages(struct task_struct
> * pte_write. We can thus safely do subsequent
> * page lookups as if they were reads.
> */
> - if (ret & VM_FAULT_WRITE)
> + if (ret & VM_FAULT_WRITE) {
> foll_flags &= ~FOLL_WRITE;
> + cow = 1;
> + }
>
The only purpose of this one seems to be to break cows for memory
reads (disk writes). It's not needed and it can be dropped I think. We
don't care about disk writes ever. We only care about disk reads. The
trick is that if anything tries to modify a buffer that is in-flight,
the result isn't deterministic, so for disk writes we can always
assume that the I/O completed before the cpu triggered the cow.
Thanks a lot for the help!!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2008-11-04 15:04 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-09-25 18:38 RFC: race between o_direct and fork (harder to fix with get_user_page_fast) Andrea Arcangeli
2008-10-29 0:43 ` Nick Piggin
2008-11-04 15:04 ` Andrea Arcangeli
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox