From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id DAC426B003D for ; Mon, 14 Dec 2009 07:22:31 -0500 (EST) Received: from m2.gw.fujitsu.co.jp ([10.0.50.72]) by fgwmail7.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id nBECMSul022002 for (envelope-from kosaki.motohiro@jp.fujitsu.com); Mon, 14 Dec 2009 21:22:29 +0900 Received: from smail (m2 [127.0.0.1]) by outgoing.m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 887AD45DE51 for ; Mon, 14 Dec 2009 21:22:28 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (s2.gw.fujitsu.co.jp [10.0.50.92]) by m2.gw.fujitsu.co.jp (Postfix) with ESMTP id 6123E45DD76 for ; Mon, 14 Dec 2009 21:22:28 +0900 (JST) Received: from s2.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id 2BDB01DB803A for ; Mon, 14 Dec 2009 21:22:28 +0900 (JST) Received: from m108.s.css.fujitsu.com (m108.s.css.fujitsu.com [10.249.87.108]) by s2.gw.fujitsu.co.jp (Postfix) with ESMTP id CA0811DB803B for ; Mon, 14 Dec 2009 21:22:24 +0900 (JST) From: KOSAKI Motohiro Subject: Re: [PATCH v2] vmscan: limit concurrent reclaimers in shrink_zone In-Reply-To: <20091211164651.036f5340@annuminas.surriel.com> References: <20091211164651.036f5340@annuminas.surriel.com> Message-Id: <20091214210823.BBAE.A69D9226@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Date: Mon, 14 Dec 2009 21:22:24 +0900 (JST) Sender: owner-linux-mm@kvack.org To: Rik van Riel Cc: kosaki.motohiro@jp.fujitsu.com, lwoodman@redhat.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, minchan.kim@gmail.com List-ID: > Under very heavy multi-process workloads, like AIM7, the VM can > get into trouble in a variety of ways. The trouble start when > there are hundreds, or even thousands of processes active in the > page reclaim code. > > Not only can the system suffer enormous slowdowns because of > lock contention (and conditional reschedules) between thousands > of processes in the page reclaim code, but each process will try > to free up to SWAP_CLUSTER_MAX pages, even when the system already > has lots of memory free. > > It should be possible to avoid both of those issues at once, by > simply limiting how many processes are active in the page reclaim > code simultaneously. > > If too many processes are active doing page reclaim in one zone, > simply go to sleep in shrink_zone(). > > On wakeup, check whether enough memory has been freed already > before jumping into the page reclaim code ourselves. We want > to use the same threshold here that is used in the page allocator > for deciding whether or not to call the page reclaim code in the > first place, otherwise some unlucky processes could end up freeing > memory for the rest of the system. This patch seems very similar to my old effort. afaik, there are another two benefit. 1. Improve resource gurantee if thousands tasks start to vmscan at the same time, they eat all memory for PF_MEMALLOC. it might cause another dangerous problem. some filesystem and io device don't handle allocation failure properly. 2. Improve OOM contidion behavior Currently, vmscan don't handle SIGKILL at all. then if the system have hevy memory pressure, OOM killer can't kill the target process soon. it might cause OOM killer kill next innocent process. This patch can fix it. > @@ -1600,6 +1612,31 @@ static void shrink_zone(int priority, struct zone *zone, > struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc); > int noswap = 0; > > + if (!current_is_kswapd() && atomic_read(&zone->concurrent_reclaimers) > > + max_zone_concurrent_reclaimers && > + (sc->gfp_mask & (__GFP_IO|__GFP_FS)) == > + (__GFP_IO|__GFP_FS)) { > + /* > + * Do not add to the lock contention if this zone has > + * enough processes doing page reclaim already, since > + * we would just make things slower. > + */ > + sleep_on(&zone->reclaim_wait); Oops. this is bug. sleep_on() is not SMP safe. I made few fixing patch today. I'll post it soon. btw, following is mesurement result by hackbench. ================ unit: sec parameter old new 130 (5200 processes) 5.463 4.442 140 (5600 processes) 479.357 7.792 150 (6000 processes) 729.640 20.529 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org