From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Tue, 4 Nov 2008 16:04:49 +0100 From: Andrea Arcangeli Subject: Re: RFC: race between o_direct and fork (harder to fix with get_user_page_fast) Message-ID: <20081104150449.GA31975@random.random> References: <20080925183846.GA6877@duo.random> <20081029004308.GH15599@wotan.suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20081029004308.GH15599@wotan.suse.de> Sender: owner-linux-mm@kvack.org Return-Path: To: Nick Piggin Cc: linux-mm@kvack.org, Avi Kivity , Izik Eidus List-ID: Hi Nick, Thanks for looking into this, as far as I can tell gup_fast is the major problem here as you'll see below. On Wed, Oct 29, 2008 at 01:43:09AM +0100, Nick Piggin wrote: > I do really like the idea of locking pages before they go under direct > IO... it also closes a class of real invalidate_mapping_pages bugs where >>From a practical standpoint I always liked it too. It is less clean, but it allows totally simple serialization of the VM side against the direct-DMA without having to rely on out of order checks of count vs mapcount along with stuff that can change under you without any way to block against it. The only limitation it has is that it prevents direct reads from the same page (while writing to disk) and it also limits write parallelism with sub-page buffer sizes (while reading from disk). Given that for its must-have usages O_DIRECT is generally cached (either by tmpfs/IPC or by VM guest OS cache) or it's a guaranteed cache-polluting workload where it's never going to hit on the same page, it's hard to see how it would ever be practical problem. But still current way is a more 'powerful' design (even if probably not worth it from practical standpoint), so if there's a way to fix it, I also like to keeping this way as a theoretical exercise. > the page is going to be dirtied by the direct-IO, but it is still allowed > to be invalidated from pagecache... As a solution to this problem... I'm not > sure if it would be entirely trivial still. We could wait on get_user_pages Invalidate would wait on PG_lock so it'd be entirely trivial. In the current out of order model w/o PG_lock the usual way to deal with those events would be to let the invalidate go ahead, and have O_DIRECT keep working on a page out of pagecache. It shouldn't care, does it? If it does then it can be taught not to care ;). So the invalidate vs o-direct I/O doesn't worry me that much. We had those kind of issues in the past too with certain fs metadata before we had o-direct (I/O completing on truncated buffer or stuff like that). > in fork, but would we actually want to, rather than just COW them? I think waiting in fork would also be ok if it would be simpler, it's just that we can't wait as there's nothing to wait upon. Whatever is simpler should be preferred here, if cow is simpler that's fine too. Either ways it won't make a difference, this is just about not corrupting data once in a while, it won't ever be measurable in performance terms. > @@ -546,11 +547,19 @@ copy_one_pte(struct mm_struct *dst_mm, s > if (page) { > get_page(page); > page_dup_rmap(page, vma, addr); > + if (unlikely(page_count(page) != page_mapcount(page))) { /* XXX: also have to check swapcount?! */ > + if (is_cow_mapping(vm_flags) && PageAnon(page)) { > + printk("forcecow!\n"); > + ret = 1; > + } > + } Here we're under PT lock so the follow_page in get_user_pages should block. Furthermore we're under mmap_sem write on fork side and get_user_pages will block on mmap_sem read/write. If follow_page has already completed and o-direct is in flight we'd see reliably the page count boosted (here ignoring gup_fast for simplicity), the get_page run by follow_page is run inside the PT lock. > - copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss); > + forcecow = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss); > progress += 8; > } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end); > > @@ -597,6 +611,10 @@ again: > add_mm_rss(dst_mm, rss[0], rss[1]); > pte_unmap_unlock(dst_pte - 1, dst_ptl); > cond_resched(); > + if (forcecow) { > + if (handle_mm_fault(dst_mm, vma, addr - PAGE_SIZE, 1) & VM_FAULT_ERROR) /* XXX: should really just do a page copy? */ > + return -ENOMEM; > + } This isn't enough, we at least need to add something like: if (is_cow_mapping(vm_flags)) { if (forcecow) ptep_set_wrprotect(src_mm, addr, src_pte); pte = pte_wrprotect(pte); } Otherwise the parent context would still break. Setting the child pte readonly and forcing a cow should still be ok with the child, even if the parent pte remained read-write the whole time (it has to or o-direct reads will be lost). Even if the content of the parent page is changing things should be ok (content is definitely changing). If anybody forks out of order with memory data contents in the parent, it means the data that was changing in the parent during fork, can't have a deterministic value in the child. With the additional avoidance of ptep_set_wrprotect in the forcecow case, I think things should be fixed for the parent if follow_page runs before fork. If follow_page instead runs after fork (after fork releases mmap_sem, still ignoring gup_fast for simplicity), things are fine already as get_user_pages will cow before starting any I/O. The fundamental problem is in converting any pte from read-write to readonly while O_DIRECT is in flight, so the fundamental part of the fix is the removal of that ptep_set_wrprotect, the forcecow in the child is just a consequence of the parent having to stay read-write the whole tim. So your path + my removal of ptep_set_wrprotect should just fix fork. If it wasn't for gup_fast this is the exact same way that we would have implemented in KSM too: 1) take PT lock, 2) compare count vs mapcount 3) mark the pte wrprotected 4) bailout if they aren't the same, 5) release PT lock. So now that we agree that the mapcount vs count check under PT lock is enough for get_user_pages, you've to decide how to make gup_fast block. It has to block or we can't possibly convert any writeable pte to readonly ;). Or at least I don't see how you're going to use the pte_write info to avoid gup_fast block... > @@ -1216,6 +1234,7 @@ int __get_user_pages(struct task_struct > > do { > struct page *page; > + int cow = 0; > > /* > * If tsk is ooming, cut off its access to large memory > @@ -1229,8 +1248,24 @@ int __get_user_pages(struct task_struct > foll_flags |= FOLL_WRITE; > > cond_resched(); > - while (!(page = follow_page(vma, start, foll_flags))) { > + > + printk("get_user_pages address=%p\n", (void *)start); > + for (;;) { > int ret; > + > + page = follow_page(vma, start, foll_flags); > + if (page) { > + printk("found page is_cow_mapping=%d PageAnon=%d write=%d cow=%d\n", is_cow_mapping(vma->vm_flags), PageAnon(page), write, cow); > + > + if (is_cow_mapping(vma->vm_flags) && > + PageAnon(page) && !write && !cow) { > + foll_flags |= FOLL_WRITE; > + printk("gup break cow\n"); > + cow = 1; > + } else > + break; > + } > + > ret = handle_mm_fault(mm, vma, start, > foll_flags & FOLL_WRITE); > if (ret & VM_FAULT_ERROR) { > @@ -1252,8 +1287,10 @@ int __get_user_pages(struct task_struct > * pte_write. We can thus safely do subsequent > * page lookups as if they were reads. > */ > - if (ret & VM_FAULT_WRITE) > + if (ret & VM_FAULT_WRITE) { > foll_flags &= ~FOLL_WRITE; > + cow = 1; > + } > The only purpose of this one seems to be to break cows for memory reads (disk writes). It's not needed and it can be dropped I think. We don't care about disk writes ever. We only care about disk reads. The trick is that if anything tries to modify a buffer that is in-flight, the result isn't deterministic, so for disk writes we can always assume that the I/O completed before the cpu triggered the cow. Thanks a lot for the help!! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org