From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 6B2406B0044 for ; Wed, 7 Jan 2009 23:22:46 -0500 (EST) Received: from m1.gw.fujitsu.co.jp ([10.0.50.71]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id n084MhC4029569 for (envelope-from kamezawa.hiroyu@jp.fujitsu.com); Thu, 8 Jan 2009 13:22:44 +0900 Received: from smail (m1 [127.0.0.1]) by outgoing.m1.gw.fujitsu.co.jp (Postfix) with ESMTP id 0BB2245DD79 for ; Thu, 8 Jan 2009 13:22:45 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (s1.gw.fujitsu.co.jp [10.0.50.91]) by m1.gw.fujitsu.co.jp (Postfix) with ESMTP id CFD4945DD72 for ; Thu, 8 Jan 2009 13:22:44 +0900 (JST) Received: from s1.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id 471111DB8042 for ; Thu, 8 Jan 2009 13:22:43 +0900 (JST) Received: from m105.s.css.fujitsu.com (m105.s.css.fujitsu.com [10.249.87.105]) by s1.gw.fujitsu.co.jp (Postfix) with ESMTP id EED5D1DB803E for ; Thu, 8 Jan 2009 13:22:42 +0900 (JST) Date: Thu, 8 Jan 2009 13:21:41 +0900 From: KAMEZAWA Hiroyuki Subject: Re: [RFC][PATCH 0/4] Memory controller soft limit patches Message-Id: <20090108132141.30bc3ce2.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20090108035930.GB7294@balbir.in.ibm.com> References: <20090107184110.18062.41459.sendpatchset@localhost.localdomain> <20090108093040.22d5f281.kamezawa.hiroyu@jp.fujitsu.com> <20090108035930.GB7294@balbir.in.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: balbir@linux.vnet.ibm.com Cc: Andrew Morton , Sudhir Kumar , YAMAMOTO Takashi , Paul Menage , lizf@cn.fujitsu.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, David Rientjes , Pavel Emelianov , riel@redhat.com, "kosaki.motohiro@jp.fujitsu.com" List-ID: On Thu, 8 Jan 2009 09:29:30 +0530 Balbir Singh wrote: > * KAMEZAWA Hiroyuki [2009-01-08 09:30:40]: > > > On Thu, 08 Jan 2009 00:11:10 +0530 > > Balbir Singh wrote: > > > > > > > > Here is v1 of the new soft limit implementation. Soft limits is a new feature > > > for the memory resource controller, something similar has existed in the > > > group scheduler in the form of shares. We'll compare shares and soft limits > > > below. I've had soft limit implementations earlier, but I've discarded those > > > approaches in favour of this one. > > > > > > Soft limits are the most useful feature to have for environments where > > > the administrator wants to overcommit the system, such that only on memory > > > contention do the limits become active. The current soft limits implementation > > > provides a soft_limit_in_bytes interface for the memory controller and not > > > for memory+swap controller. The implementation maintains an RB-Tree of groups > > > that exceed their soft limit and starts reclaiming from the group that > > > exceeds this limit by the maximum amount. > > > > > > This is an RFC implementation and is not meant for inclusion > > > > > Core implemantation seems simple and the feature sounds good. > > Thanks! > > > But, before reviewing into details, 3 points. > > > > 1. please fix current bugs on hierarchy management, before new feature. > > AFAIK, OOM-Kill under hierarchy is broken. (I have patches but waits for > > merge window close.) > > I've not hit the OOM-kill issue under hierarchy so far, is the OOM > killer selecting a bad task to kill? I'll debug/reproduce the issue. > I am not posting these patches for inclusion, fixing bugs is > definitely the highest priority. > Assume follwoing hierarchy. group_A/ limit=100M usage=1M group_01/ no limit usage=1M group_02/ no limit usage=98M (does memory leak.) Q. What happens a task on group_02 causes oom ? A. A task in group_A dies. is my problem. (As I said, I'll post a patch .) This is my homework for a month. (I'll use CSS_ID to fix this.) Any this will allow to skip my logic to check "Is this OOM is from memcg?" And makes system panic if vm.panic_on_oom==1. > > I wonder there will be some others. Lockdep error which Nishimura reported > > are all fixed now ? > > I run all my kernels and tests with lockdep enabled, I did not see any > lockdep errors showing up. > ok. > > > > 2. You inserts reclaim-by-soft-limit into alloc_pages(). But, to do this, > > you have to pass zonelist to try_to_free_mem_cgroup_pages() and have to modify > > try_to_free_mem_cgroup_pages(). > > 2-a) If not, when the memory request is for gfp_mask==GFP_DMA or allocation > > is under a cpuset, memory reclaim will not work correctlly. > > The idea behind adding the code in alloc_pages() is to detect > contention and trim mem cgroups down, if they have grown beyond their > soft limit > Allowing usual direct reclaim go on and just waking up "balance_soft_limit_daemon()" will be enough. > > 2-b) try_to_free_mem_cgroup_pages() cannot do good work for order > 1 allocation. > > > > Please try fake-numa (or real NUMA machine) and cpuset. > > Yes, order > 1 is documented in the patch and you can see the code as > well. Your suggestion is to look at the gfp_mask as well, I'll do > that. > and zonelist/nodemask. generic try_to_free_pages() doesn't have nodemask as its argument but it checks cpuset. In shrink_zones(). == 1504 /* 1505 * Take care memory controller reclaiming has small influence 1506 * to global LRU. 1507 */ 1508 if (scan_global_lru(sc)) { 1509 if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) 1510 continue; 1511 note_zone_scanning_priority(zone, priority); 1512 1513 if (zone_is_all_unreclaimable(zone) && 1514 priority != DEF_PRIORITY) 1515 continue; /* Let kswapd poll it */ 1516 sc->all_unreclaimable = 0; 1517 } else { 1518 /* 1519 * Ignore cpuset limitation here. We just want to reduce 1520 * # of used pages by us regardless of memory shortage. 1521 */ 1522 sc->all_unreclaimable = 0; 1523 mem_cgroup_note_reclaim_priority(sc->mem_cgroup, 1524 priority); 1525 } == This is because "reclaim by memcg" can happen even if there are enough memory. try_to_free_mem_cgroup_pages() is called when "hit limit". So, there will be some issues to be improved if you want to use try_to_free_mem_cgroup_pages() for recovering "memory shortage". I think above is one of issue. Some more assumption will corrupt. -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org