From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Wed, 29 Oct 2008 01:43:09 +0100 From: Nick Piggin Subject: Re: RFC: race between o_direct and fork (harder to fix with get_user_page_fast) Message-ID: <20081029004308.GH15599@wotan.suse.de> References: <20080925183846.GA6877@duo.random> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20080925183846.GA6877@duo.random> Sender: owner-linux-mm@kvack.org Return-Path: To: Andrea Arcangeli Cc: linux-mm@kvack.org, Avi Kivity , Izik Eidus List-ID: On Thu, Sep 25, 2008 at 08:38:46PM +0200, Andrea Arcangeli wrote: > Hi Nick, > > with Izik and Avi, we've been discussing on how to best make ksm work > with O_DIRECT. I don't think it's an immediate priority but eventually > we've to fix this as KVM is close to being able to dma from disk > directly into guest physical memory without intermediate copies. > > Checking if pages have O_DIRECT (or similar other physical I/O) in > flight is fairly easy, comparing page count with page mapcount should > do the trick. The source of the problem is that this page count == > page mapcount check should happen under some lock that blocks > get_user_pages and get_user_page_fast. If the page is already shared, > we only need to block the get_user_pages running on the 'mm' of the > 'pte' that we're overwriting. So for get_user_pages it'd be enough to > do the check of page count == page mapcount under the PT lock. > > 1) > PT lock > if (page_count != page_mapcount) > goto fail > make pte readonly > PT unlock > > then in the final stage: > > 2) > PT lock > if (!pte_same) > goto fail > change pte to point to ksm page > PT unlock > > If other tasks are starting in-flight O_DIRECT on the page we don't > care, those will have to copy-on-write it before starting the O_DIRECT > anyway, so there will be still no in-flight I/O on the physical page > we're working on. All we care about is that get_user_pages doesn't run > on our mm/PTlock between the page_count!=page_mapcount check and the > mark of the pte readonly. Otherwise it won't trigger the COW and it > should (for us to later notice it in pte_same!). > > Now with get_user_pages_fast the above PT lock isn't enough anymore to > make it safe. > > While thinking at get_user_pages_fast I figured another worse way > things can go wrong with ksm and o_direct: think a thread writing > constantly to the last 512bytes of a page, while another thread read > and writes to/from the first 512bytes of the page. We can lose > O_DIRECT reads, the very moment we mark any pte wrprotected. Then Avi > immediately pointed out this means also fork is affected by the same > bug that ksm would have. > > So Avi just found a very longstanding bug in fork. Fork has the very > same problem of ksm in marking readonly ptes that could point to pages > that have O_DIRECT in flight. > > So this isn't a KSM problem anymore. We've to fix this longstanding > bug in fork first. Then we'll think at KSM and we'll use the same > locking technique to make KSM safe against O_DIRECT too > > The best I can think of, is to re-introduce of the brlock (possibly > after making it fair). We can't use RCU as far as I can tell. No idea > why brlock was removed perhaps somebody thought RCU was an equivalent > replacement? RCU/SRCU can't block anything, and we've to block the > get_user_page_fast in the critical section at point 1 to be > safe. There's a practical limit of how much things can be delayed, for > page faults (at least practically) they can't. > > ksm > > br_write_lock() > if (page_count != page_mapcount) > goto fail > make pte readonly > br_write_unlock() > > fork > > br_write_lock() > if (page_count != page_mapcount) > copy_page() > else > make pte readonly > br_write_unlock() > > get_user_page_fast > > br_read_lock() > walk ptes out of order w/o mmap_sem > br_read_unlock() > > Another way of course is to take the mmap_sem in read mode around the > out of order part of get_user_page_fast but that'd be invalidating the > 'thread vs thread' smp scalability of get_user_page_fast. > > If it was just for KSM I suggested we could fix it by sigstopping (or > getting out of the scheduler in some other more reliable mean) all > threads that shared the 'mm' that ksm was working on. That would take > care of the fast path of get_user_page_fast and the PT lock would take > care of the get_user_page_fast slow path. But this schedule technique > ala stop_machine surely isn't workable for fork() for performance > reasons. > > Yet another way is as usual to use a page bitflag to serialize things > at the page level. That will prevent multiple O_DIRECT reads to the > same page simultaneously but it'll allow fork to wait IO completion > and avoid the copy_page(). Ages ago I always wanted to keep the > PG_lock for pages under O_DIRECT... We instead relied solely on page > pinning which has a few advantages but it makes things like fork more > complicated and harder to fix. > > I'm very interested to know your ideas on how to best fix fork vs > o_direct! Hi Andrea, Sorry I missed this. Thanks for pinging me again. Great set of bugs you've found :) We also have the related problem that any existing COWs need to be broken by get_user_pages... At the moment I'm just hacking around (haven't touched fast_gup yet). But if we follow the rule that for PageAnon pages, the pte must be set to pte_write, then I'm hoping that is going to give us enough synchronisation to get around the problem. I've attached a really raw hack of what I'm trying to do. get_user_pages_fast I think should be able to do a similar check without adding locks. I do really like the idea of locking pages before they go under direct IO... it also closes a class of real invalidate_mapping_pages bugs where the page is going to be dirtied by the direct-IO, but it is still allowed to be invalidated from pagecache... As a solution to this problem... I'm not sure if it would be entirely trivial still. We could wait on get_user_pages in fork, but would we actually want to, rather than just COW them? --- Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -488,7 +488,7 @@ out: * covered by this vma. */ -static inline void +static inline int copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma, unsigned long addr, int *rss) @@ -496,6 +496,7 @@ copy_one_pte(struct mm_struct *dst_mm, s unsigned long vm_flags = vma->vm_flags; pte_t pte = *src_pte; struct page *page; + int ret = 0; /* pte contains position in swap or file, so copy. */ if (unlikely(!pte_present(pte))) { @@ -546,11 +547,19 @@ copy_one_pte(struct mm_struct *dst_mm, s if (page) { get_page(page); page_dup_rmap(page, vma, addr); + if (unlikely(page_count(page) != page_mapcount(page))) { /* XXX: also have to check swapcount?! */ + if (is_cow_mapping(vm_flags) && PageAnon(page)) { + printk("forcecow!\n"); + ret = 1; + } + } rss[!!PageAnon(page)]++; } out_set_pte: set_pte_at(dst_mm, addr, dst_pte, pte); + + return ret; } static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, @@ -561,8 +570,10 @@ static int copy_pte_range(struct mm_stru spinlock_t *src_ptl, *dst_ptl; int progress = 0; int rss[2]; + int forcecow; again: + forcecow = 0; rss[1] = rss[0] = 0; dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl); if (!dst_pte) @@ -573,6 +584,9 @@ again: arch_enter_lazy_mmu_mode(); do { + if (forcecow) + break; + /* * We are holding two locks at this point - either of them * could generate latencies in another task on another CPU. @@ -587,7 +601,7 @@ again: progress++; continue; } - copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss); + forcecow = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss); progress += 8; } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end); @@ -597,6 +611,10 @@ again: add_mm_rss(dst_mm, rss[0], rss[1]); pte_unmap_unlock(dst_pte - 1, dst_ptl); cond_resched(); + if (forcecow) { + if (handle_mm_fault(dst_mm, vma, addr - PAGE_SIZE, 1) & VM_FAULT_ERROR) /* XXX: should really just do a page copy? */ + return -ENOMEM; + } if (addr != end) goto again; return 0; @@ -1216,6 +1234,7 @@ int __get_user_pages(struct task_struct do { struct page *page; + int cow = 0; /* * If tsk is ooming, cut off its access to large memory @@ -1229,8 +1248,24 @@ int __get_user_pages(struct task_struct foll_flags |= FOLL_WRITE; cond_resched(); - while (!(page = follow_page(vma, start, foll_flags))) { + + printk("get_user_pages address=%p\n", (void *)start); + for (;;) { int ret; + + page = follow_page(vma, start, foll_flags); + if (page) { + printk("found page is_cow_mapping=%d PageAnon=%d write=%d cow=%d\n", is_cow_mapping(vma->vm_flags), PageAnon(page), write, cow); + + if (is_cow_mapping(vma->vm_flags) && + PageAnon(page) && !write && !cow) { + foll_flags |= FOLL_WRITE; + printk("gup break cow\n"); + cow = 1; + } else + break; + } + ret = handle_mm_fault(mm, vma, start, foll_flags & FOLL_WRITE); if (ret & VM_FAULT_ERROR) { @@ -1252,8 +1287,10 @@ int __get_user_pages(struct task_struct * pte_write. We can thus safely do subsequent * page lookups as if they were reads. */ - if (ret & VM_FAULT_WRITE) + if (ret & VM_FAULT_WRITE) { foll_flags &= ~FOLL_WRITE; + cow = 1; + } cond_resched(); } Index: linux-2.6/kernel/fork.c =================================================================== --- linux-2.6.orig/kernel/fork.c +++ linux-2.6/kernel/fork.c @@ -346,7 +346,7 @@ static int dup_mmap(struct mm_struct *mm rb_parent = &tmp->vm_rb; mm->map_count++; - retval = copy_page_range(mm, oldmm, mpnt); + retval = copy_page_range(mm, oldmm, tmp); if (tmp->vm_ops && tmp->vm_ops->open) tmp->vm_ops->open(tmp); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org