From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id 8B3C56B003D for ; Wed, 11 Mar 2009 16:00:01 -0400 (EDT) Date: Wed, 11 Mar 2009 20:59:35 +0100 From: Andrea Arcangeli Subject: Re: [aarcange@redhat.com: [PATCH] fork vs gup(-fast) fix] Message-ID: <20090311195935.GO27823@random.random> References: <20090311170611.GA2079@elte.hu> <20090311174103.GA11979@elte.hu> <20090311183748.GK27823@random.random> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Ingo Molnar , Nick Piggin , Hugh Dickins , KOSAKI Motohiro , KAMEZAWA Hiroyuki , linux-mm@kvack.org List-ID: On Wed, Mar 11, 2009 at 12:01:56PM -0700, Linus Torvalds wrote: > Btw, I think your patch has a race. Admittedly a really small one. > > When you look up the page in gup.c, and then set the GUP flag on the > "struct page", in between the lookup and the setting of the flag, another > thread can come in and do that same fork+write thing. > > CPU0: CPU1 > > gup: fork: > - look up page > - it's read-write > ... > set_wr_protect > test GUP bit - not set, good > done > > - Mark it GUP > tlb_flush > > write to it from user space - COW Did you notice the check after 'mark it gup' that will run in CPU0? + if (PageAnon(page)) { + if (!PageGUP(page)) + SetPageGUP(page); + smp_mb(); + /* + * Fork doesn't want to flush the smp-tlb for + * every pte that it marks readonly but newly + * created shared anon pages cannot have + * direct-io going to them, so check if fork + * made the page shared before we taken the + * page pin. + */ + if ((pte_flags(gup_get_pte(ptep)) & + (mask | _PAGE_SPECIAL)) != mask) { + put_page(page); + pte_unmap(ptep); + return 0; + } + } gup-fast will _not_ succeed because of the set_wr_protect that just happened on CPU1. That's why I added the above check after setpagegup/get_page. > since there is no lockng on the GUP side (there's the TLB flush that will > wait for interrupts being enabled again on CPU0, but that's later in the > fork sequence). Right, I preferred to 'recheck' the wrprotect bit before allowing gup-fast to succeed to avoid sending a flood of IPI in the fork fast path. So I leave the tlb flush at the end of the fork sequence and a single IPI in the common case. Only exception is the forcecow path where the copy has to happen atomically per-page, so I have to flush the smp-tlb before the copy after marking the parent wrprotected temporarly (later the parent pte is marked read-write again by fork_pre_cow after the copy), or NPTL will never have a chance to fix its bug as its glibc-parent data structures that could be modified by threads won't be copied atomically to the child. But that's a slow path so it's ok to flush tlb there. > Also, having to set the PG_GUP bit means that the "fast" gup is likely not > much faster than the slow one. It now has two atomics per page it looks > up, afaik, which sounds like it would delete any advantage it had over the > slow version that needed locking. gup-fast has already to get_page, so I don't see it. gup-fast will always dirty that cacheline and take over it regardless of PG_gup, gup-fast will never be able to run without running get_page. Furthermore starting from the second access GUP is already set and it's only a read from l1 from a cacheline that was already dirtied and taken over a few instructions before. So I think it can't be slowing down gup-fast in any measurable way, given how close mark-gup is set after get_page. > What we _could_ try to do is to always make the COW breaking be a > _directed_ event - we'd make sure that we always break COW in the > direction of the first owner (going to the rmap chains). That might solve > everything, and be purely local to the logic in mm/memory.c (do_wp_page). That's a really interesting idea and frankly I didn't think about it. Probably one reason is that it can't work for ksm where we take two random anon pages and create one out of them so each one could already have O_DIRECT in progress on them and we've to prevent to merge pages that have in-flight O_DIRECT to be merged no matter what (ordering is irrelevant for ksm, page contents must be stable or ksm will break). I was thinking of using the same logic for both ksm and fork. But theoretically, ksm can keep doing the page_count check to truly ensure no in-flight I/O is going on, and fork could fix it in whatever way it wants (I wonder if it'd be ok for fork to map a 'changing' page in the child because of the not-defined behavior of forking while a read is in progress, at least at the first write the page would stop changing contents). In fact ksm doesn't even require the above change to gup-fast because it does ptep_clear_flush_notify when it tries to wrprotect a not-shared anon page. > I dunno. I have not looked at how horrible that would be. For fork I think it would work, not sure if the current data structures would be enough, but at first glance I think besides how horrible that would be, I think from a practical standpoint the main problem is the slowdown it'd generate in the do_wp_page fast path. The anon_vma list can be huge in some weird case, which we normally cannot care less as swap algorithms and disk I/O (even on no-seeking SSD) is even slower than that. The coolness of rmap w/o pte_chains is that rmap is zerocost for all page faults (a check on vma->anon_vma being not null is the only cost) and I'd like to keep it that way. The cost of my fix to fork is not measurable with fork microbenchmark, while the cost of finding who owns the original shared page in do_wp_page would be potentially be much bigger. The only slowdown to fork is in the O_DIRECT slow path which we don't care about and in the worst case is limited to the total amount of in-flight I/O. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org