From mboxrd@z Thu Jan 1 00:00:00 1970 Subject: Re: [PATCH/RFC 1/14] Reclaim Scalability: Convert anon_vma lock to read/write lock From: Lee Schermerhorn In-Reply-To: <20070918114142.abbd5421.kamezawa.hiroyu@jp.fujitsu.com> References: <20070914205359.6536.98017.sendpatchset@localhost> <20070914205405.6536.37532.sendpatchset@localhost> <20070917110234.GF25706@skynet.ie> <20070918114142.abbd5421.kamezawa.hiroyu@jp.fujitsu.com> Content-Type: text/plain Date: Tue, 18 Sep 2007 11:37:23 -0400 Message-Id: <1190129844.5035.26.camel@localhost> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: KAMEZAWA Hiroyuki Cc: Mel Gorman , linux-mm@kvack.org, akpm@linux-foundation.org, clameter@sgi.com, riel@redhat.com, balbir@linux.vnet.ibm.com, andrea@suse.de, a.p.zijlstra@chello.nl, eric.whitney@hp.com, npiggin@suse.de List-ID: On Tue, 2007-09-18 at 11:41 +0900, KAMEZAWA Hiroyuki wrote: > On Mon, 17 Sep 2007 12:02:35 +0100 > mel@skynet.ie (Mel Gorman) wrote: > > > On (14/09/07 16:54), Lee Schermerhorn didst pronounce: > > > [PATCH/RFC] 01/14 Reclaim Scalability: Convert anon_vma list lock a read/write lock > > > > > > Against 2.6.23-rc4-mm1 > > > > > > Make the anon_vma list lock a read/write lock. Heaviest use of this > > > lock is in the page_referenced()/try_to_unmap() calls from vmscan > > > [shrink_page_list()]. These functions can use a read lock to allow > > > some parallelism for different cpus trying to reclaim pages mapped > > > via the same set of vmas. > > > In light of what Peter and Linus said about rw-locks being more expensive > > than spinlocks, we'll need to measure this with some benchmark. The plus > > side is that this patch can be handled in isolation because it's either a > > scalability fix or it isn't. It's worth investigating because you say it > > fixed a real problem where under load the job was able to complete with > > this patch and live-locked without it. > > > > When you decide on a test-case, I can test just this patch and see what > > results I find. > > > > One of the case I can imagine is.. > == > 1. Use NUMA. > 2. create *large* anon_vma and use it with MPOL_INTERLEAVE > 3. When memory is exhausted (on several nodes), all kswapd on nodes will > see one anon_vma->lock. > == > Maybe the worst case. Actually, if you only have one mm/vma mapping the area, it won't be that bad. You'll still have contention on the spinlock, but with only one vma mapping it, page_referenced_anon() and try_to_unmap_anon() will be relatively fast. The problem we've seen is when you have lots [10s of thousands] of vmas referencing the same anon_vma. This occurs when the tasks are all descendants of a single original parent w/o exec()ing. I've only seen this with the AIM7 benchmark. This is the workload that was able to make progress with this patch, but the system hung indefinitely without it. But, AIM7 is not necessarily something we want to optimize for. However, I've been told that Apache can exhibit similar behavior with thousands of in-coming connections. Anyone know if this is true? I HAVE seen similar behavior in a high-user count Oracle OLTP workload, but on the file rmap--the i_mmap_lock--in page_referenced_file(), etc. That's probably worth fixing. I'm in the process of running a series of parallel kernel builds on kernels without the rmap rw_lock patches, and then with each one individually. This load doesn't exhibit the problem these patches are intended to address, but kernel builds do fork a lot of children that should result in a lot of vma linking/unlinking [but maybe not if using vfork()]. Similar for the i_mmap_lock. This lock is also used to protect the truncate_count, so it must be taken for write in this context--mostly in unmap_mapping_range*(). I'll post the results as soon as I have them for both ia64 and x86_64. Later today. Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org