From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <41B92567.8070809@yahoo.com.au> Date: Fri, 10 Dec 2004 15:26:15 +1100 From: Nick Piggin MIME-Version: 1.0 Subject: Re: page fault scalability patch V12 [0/7]: Overview and performance tests References: In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Hugh Dickins Cc: Christoph Lameter , Linus Torvalds , Andrew Morton , Benjamin Herrenschmidt , linux-mm@kvack.org, linux-ia64@vger.kernel.org, linux-kernel@vger.kernel.org List-ID: Hugh Dickins wrote: > On Wed, 1 Dec 2004, Christoph Lameter wrote: > >>Changes from V11->V12 of this patch: >>- dump sloppy_rss in favor of list_rss (Linus' proposal) >>- keep up against current Linus tree (patch is based on 2.6.10-rc2-bk14) >> >>This is a series of patches that increases the scalability of >>the page fault handler for SMP. Here are some performance results >>on a machine with 512 processors allocating 32 GB with an increasing >>number of threads (that are assigned a processor each). > > > Your V12 patches would apply well to 2.6.10-rc3, except that (as noted > before) your mailer or whatever is eating trailing whitespace: trivial > patch attached to apply before yours, removing that whitespace so yours > apply. But what your patches need to apply to would be 2.6.10-mm. > > Your i386 HIGHMEM64G 3level ptep_cmpxchg forgets to use cmpxchg8b, would > have tested out okay up to 4GB but not above: trivial patch attached. > That looks obviously correct. Probably the reason why Martin was getting crashes. [snip] > Moving to the main patch, 1/7, the major issue I see there is the way > do_anonymous_page does update_mmu_cache after setting the pte, without > any page_table_lock to bracket them together. Obviously no problem on > architectures where update_mmu_cache is a no-op! But although there's > been plenty of discussion, particularly with Ben and Nick, I've not > noticed anything to guarantee that as safe on all architectures. I do > think it's fine for you to post your patches before completing hooks in > all the arches, but isn't this a significant issue which needs to be > sorted before your patches go into -mm? You hazily refer to such issues > in 0/7, but now you need to work with arch maintainers to settle them > and show the patches. > Yep, the update_mmu_cache issue is real. There is a parallel problem that is update_mmu_cache can be called on a pte who's page has since been evicted and reused. Again, that looks safe on IA64, but maybe not on other architectures. It can be solved by moving lru_cache_add to after update_mmu_cache in all cases but the "update accessed bit" type fault. I solved that by simply defining that out for architectures that don't need it - a raced fault will simply get repeated if need be. > A lesser issue with the reordering in do_anonymous_page: don't you need > to move the lru_cache_add_active after the page_add_anon_rmap, to avoid > the very slight chance that vmscan will pick the page off the LRU and > unmap it before you've counted it in, hitting page_remove_rmap's > BUG_ON(page_mapcount(page) < 0)? > That's what I had been doing too. Seems to be the right way to go. > (I do wonder why do_anonymous_page calls mark_page_accessed as well as > lru_cache_add_active. The other instances of lru_cache_add_active for > an anonymous page don't mark_page_accessed i.e. SetPageReferenced too, > why here? But that's nothing new with your patch, and although you've > reordered the calls, the final page state is the same as before.) > > Where handle_pte_fault does "entry = *pte" without page_table_lock: > you're quite right to passing down precisely that entry to the fault > handlers below, but there's still a problem on the 32bit architectures > supporting 64bit ptes (i386, mips, ppc), that the upper and lower ints > of entry may be out of synch. Not a problem for do_anonymous_page, or > anything else relying on ptep_cmpxchg to check; but a problem for > do_wp_page (which could find !pfn_valid and kill the process) and > probably others (harder to think through). Your 4/7 patch for i386 has > an unused atomic get_64bit function from Nick, I think you'll have to > define a get_pte_atomic macro and use get_64bit in its 64-on-32 cases. > Indeed. This was a real problem for my patch, definitely. > Hmm, that will only work if you're using atomic set_64bit rather than > relying on page_table_lock in the complementary places which matter. > Which I believe you are indeed doing in your 3level set_pte. Shouldn't > __set_64bit be using LOCK_PREFIX like __get_64bit, instead of lock? > That's what I was wondering. It could be that the actual 64-bit store is still atomic without the lock prefix (just not the entire rmw), which I think would be sufficient. In that case, get_64bit may be able to drop the lock prefix as well. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: aart@kvack.org