From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id 858106B006A for ; Sat, 8 Aug 2009 03:38:48 -0400 (EDT) Received: from m6.gw.fujitsu.co.jp ([10.0.50.76]) by fgwmail5.fujitsu.co.jp (Fujitsu Gateway) with ESMTP id n787clMD011711 for (envelope-from kamezawa.hiroyu@jp.fujitsu.com); Sat, 8 Aug 2009 16:38:48 +0900 Received: from smail (m6 [127.0.0.1]) by outgoing.m6.gw.fujitsu.co.jp (Postfix) with ESMTP id B0EC245DE4F for ; Sat, 8 Aug 2009 16:38:47 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (s6.gw.fujitsu.co.jp [10.0.50.96]) by m6.gw.fujitsu.co.jp (Postfix) with ESMTP id 7425145DE4E for ; Sat, 8 Aug 2009 16:38:47 +0900 (JST) Received: from s6.gw.fujitsu.co.jp (localhost.localdomain [127.0.0.1]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id 5B0471DB8038 for ; Sat, 8 Aug 2009 16:38:47 +0900 (JST) Received: from ml10.s.css.fujitsu.com (ml10.s.css.fujitsu.com [10.249.87.100]) by s6.gw.fujitsu.co.jp (Postfix) with ESMTP id 1085F1DB803E for ; Sat, 8 Aug 2009 16:38:47 +0900 (JST) Message-ID: <99f2a13990d68c34c76c33581949aefd.squirrel@webmail-b.css.fujitsu.com> In-Reply-To: <20090808060531.GL9686@balbir.in.ibm.com> References: <20090807221238.GJ9686@balbir.in.ibm.com> <39eafe409b85053081e9c6826005bb06.squirrel@webmail-b.css.fujitsu.com> <20090808060531.GL9686@balbir.in.ibm.com> Date: Sat, 8 Aug 2009 16:38:46 +0900 (JST) Subject: Re: Help Resource Counters Scale Better (v2) From: "KAMEZAWA Hiroyuki" MIME-Version: 1.0 Content-Type: text/plain;charset=iso-2022-jp Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org To: balbir@linux.vnet.ibm.com Cc: KAMEZAWA Hiroyuki , Andrew Morton , andi.kleen@intel.com, Prarit Bhargava , KOSAKI Motohiro , "lizf@cn.fujitsu.com" , "menage@google.com" , Pavel Emelianov , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" List-ID: Balbir Singh wrote: > * KAMEZAWA Hiroyuki [2009-08-08 > 10:11:40]: > >> Balbir Singh wrote: >> > static inline bool res_counter_limit_check_locked(struct res_counter >> > *cnt) >> > { >> > - if (cnt->usage < cnt->limit) >> > + unsigned long long usage = >> percpu_counter_read_positive(&cnt->usage); >> > + if (usage < cnt->limit) >> > return true; >> > >> Hmm. In memcg, this function is not used for busy pass but used for >> important pass to check usage under limit (and continue reclaim) >> >> Can't we add res_clounter_check_locked_exact(), which use >> percpu_counter_sum() later ? > > We can, but I want to do it in parts, once I add the policy for > strict/no-strict checking. It is on my mind, but I want to work on the > overhead, since I've heard from many people that we need to resolve > this first. > ok. >> > spin_lock_irqsave(&cnt->lock, flags); >> > - if (cnt->usage <= limit) { >> > + if (usage <= limit) { >> > cnt->limit = limit; >> > ret = 0; >> > } >> >> For the same reason to check_limit, I want correct number here. >> percpu_counter_sum() is better. >> > > I'll add that when we do strict accounting. Are you suggesting that > resource_counter_set_limit should use strict accounting? yes, I think so. ..and..I'd like to add "mem_cgroup_reduce_usage" or some call to do reclaim-on-demand, later. I wonder it's ok to add error-tolerance to memcg but I want some interface to do "sync". Especially when, we measure size of working set. I like current your direction to achieve better performance. But I wonder how users can see synchronous numbers without tolerance, it will be necessary in high-end users. goto undo; >> > @@ -68,9 +76,7 @@ int res_counter_charge(struct res_counter *counter, >> > unsigned long val, >> > goto done; >> > undo: >> > for (u = counter; u != c; u = u->parent) { >> > - spin_lock(&u->lock); >> > res_counter_uncharge_locked(u, val); >> > - spin_unlock(&u->lock); >> > } >> > done: >> When using hierarchy, tolerance to root node will be bigger. >> Please write this attention to the document. >> > > No.. I don't think so.. > > Irrespective of hierarchy, we do the following > > 1. Add, if the sum reaches batch count, we sum and save > > I don't think hierarchy should affect it.. no? > Hmm, maybe I'm misunderstanding. Let me brainstoming... In following hierarchy, A/01 /02 /03/X /Y /Z sum of tolerance of X+Y+Z is limitted by torelance of 03. sum of tolerance of 01+02+03 is limited by tolerance of A Ah, ok. I'm wrong. Hmm... > >> >> > local_irq_restore(flags); >> > @@ -79,10 +85,13 @@ done: >> > >> > void res_counter_uncharge_locked(struct res_counter *counter, >> unsigned >> > long val) >> > { >> > - if (WARN_ON(counter->usage < val)) >> > - val = counter->usage; >> > + unsigned long long usage; >> > + >> > + usage = percpu_counter_read_positive(&counter->usage); >> > + if (WARN_ON((usage + counter->usage_tolerance * nr_cpu_ids) < val)) >> > + val = usage; >> Is this correct ? (or do we need this WARN_ON ?) >> Hmm. percpu_counter is cpu-hotplug aware. Then, >> nr_cpu_ids is not correct. but nr_onlie_cpus() is heavy..hmm. >> > > OK.. so the deal is, even though it is aware, batch count is a > heuristic and I don't want to do heavy math in it. nr_cpu_ids is > larger, but also light weight in terms of computation. > yes...I wonder there is a _variable_ to show nr_online_cpus without bitmap scan... >> > /* >> > + * To help resource counters scale, we take a step back >> > + * and allow the counters to be scalable and set a >> > + * batch value such that every addition does not cause >> > + * global synchronization. The side-effect will be visible >> > + * on limit enforcement, where due to this fuzziness, >> > + * we will lose out on inforcing a limit when the usage >> > + * exceeds the limit. The plan however in the long run >> > + * is to allow this value to be controlled. We will >> > + * probably add a new control file for it. >> > + */ >> > +#define MEM_CGROUP_RES_ERR_TOLERANCE (4 * PAGE_SIZE) >> >> Considering percpu counter's extra overhead. This number is too small, >> IMO. >> > > OK.. the reason I kept it that way is because on ppc64 PAGE_SIZE is > now 64k. May be we should pick a standard size like 64k and stick with > it. What do you think? > I think 64k is reasonanle as far as there is no monster machine with 4096 cpus...But even with 4096cpus 64k*4096 = 256M...then, small amount for monster machine.. Hmm...I think you can add CONFIG_MEMCG_PCPU_TOLERANCE and set default value to 64k. (of course, you can do this in other patch) On laptop/desktop, 4cpus 4*64k=256k On volume-zone server, 8-16,32cpus 32*64k=2M On high-end 64-256cpu machine in these days, 256*64k=16M maybe not so bad. I'm not sure how many 1024cpu machines will be used in the the next ten years.. I want a percpu counter with flexible batching for minimizing tolerance. It will be my homework. Thanks, -Kame 64kx256 = 16M ...maybe reasonable. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org