From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail6.bemta8.messagelabs.com (mail6.bemta8.messagelabs.com [216.82.243.55]) by kanga.kvack.org (Postfix) with ESMTP id CD3806B002C for ; Mon, 17 Oct 2011 04:56:41 -0400 (EDT) Message-ID: <4E9BEDA9.6000908@parallels.com> Date: Mon, 17 Oct 2011 12:56:09 +0400 From: Glauber Costa MIME-Version: 1.0 Subject: Re: [RFC] [PATCH 4/4] memcg: Document kernel memory accounting. References: <1318639110-27714-1-git-send-email-ssouhlal@FreeBSD.org> <1318639110-27714-2-git-send-email-ssouhlal@FreeBSD.org> <1318639110-27714-3-git-send-email-ssouhlal@FreeBSD.org> <1318639110-27714-4-git-send-email-ssouhlal@FreeBSD.org> <1318639110-27714-5-git-send-email-ssouhlal@FreeBSD.org> In-Reply-To: <1318639110-27714-5-git-send-email-ssouhlal@FreeBSD.org> Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Suleiman Souhlal Cc: gthelen@google.com, yinghan@google.com, kamezawa.hiroyu@jp.fujitsu.com, jbottomley@parallels.com, suleiman@google.com, linux-mm@kvack.org On 10/15/2011 04:38 AM, Suleiman Souhlal wrote: > Signed-off-by: Suleiman Souhlal > --- > Documentation/cgroups/memory.txt | 33 ++++++++++++++++++++++++++++++++- > 1 files changed, 32 insertions(+), 1 deletions(-) > > diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt > index 06eb6d9..277cf25 100644 > --- a/Documentation/cgroups/memory.txt > +++ b/Documentation/cgroups/memory.txt > @@ -220,7 +220,37 @@ caches are dropped. But as mentioned above, global LRU can do swapout memory > from it for sanity of the system's memory management state. You can't forbid > it by cgroup. > > -2.5 Reclaim > +2.5 Kernel Memory > + > +A cgroup's kernel memory is accounted into its memory.usage_in_bytes and > +is also shown in memory.stat as kernel_memory. Kernel memory does not get > +counted towards the root cgroup's memory.usage_in_bytes, but still > +appears in its kernel_memory. > + > +Upon cgroup deletion, all the remaining kernel memory gets moved to the > +root cgroup. > + > +An accounted kernel memory allocation may trigger reclaim in that cgroup, > +and may also OOM. > + > +Currently only slab memory allocated without __GFP_NOACCOUNT and > +__GFP_NOFAIL gets accounted to the current process' cgroup. > + > +2.5.1 Slab > + > +Slab gets accounted on a per-page basis, which is done by using per-cgroup > +kmem_caches. These per-cgroup kmem_caches get created on-demand, the first > +time a specific kmem_cache gets used by a cgroup. Well, let me first start with some general comments: I think the approach I've taken, which is, allowing the cache creators to register themselves for cgroup usage, is better than scanning the list of existing caches. Couple of key reasons: 1) We then don't need another flag. _GFP_NOACCOUNT => doing nothing. 2) Less polution in the slab structure itself, which makes it have higher chances of inclusion, and less duplicate work in the slub. 3) Easier to do per-cache tuning if we ever want to. About, on-demand creation, I think it is a nice idea. But it may impact allocation latency on caches that we are sure to be used, like the dentry cache. So that gives us: 4) If the cache creator is registering itself, it can specify which behavior it wants. On-Demand creation vs Straight creation. > +Slab memory that cannot be attributed to a cgroup gets charged to the root > +cgroup. > + > +A per-cgroup kmem_cache is named like the original, with the cgroup's name > +in parethesis. I used the address for simplicity, but I like names better. Agree here. Extending it: If a task resides in the cgroup itself, I think it should see its cache only, in /proc/slabinfo (selectable, take a look at https://lkml.org/lkml/2011/10/6/132 for more details) > +When a kmem_cache gets migrated to the root cgroup, "dead" is appended to > +its name, to indicated that it is not going to be used for new allocations. Why not just remove it? > +2.6 Reclaim > > Each cgroup maintains a per cgroup LRU which has the same structure as > global VM. When a cgroup goes over its limit, we first try > @@ -396,6 +426,7 @@ active_anon - # of bytes of anonymous and swap cache memory on active > inactive_file - # of bytes of file-backed memory on inactive LRU list. > active_file - # of bytes of file-backed memory on active LRU list. > unevictable - # of bytes of memory that cannot be reclaimed (mlocked etc). > +kernel_memory - # of bytes of kernel memory. > > # status considering hierarchy (see memory.use_hierarchy settings) > Another * I think usage of res_counters is better than relying on slab fields to impose limits, * We still need the ability to restrict kernel memory usage separately from user memory, dependent on a selectable, as we already discussed here. * I think we should do everything in our power to reduce overhead for the special case in which only the root cgroup exist . Take a look at what happened with the following thread: https://lkml.org/lkml/2011/10/13/201. To be honest, I think it is an idea we should least consider: not to account *anything* to the root cgroup (make a selectable if we want to conserve behaviour), user memory, kernel memory. Then we can keep native performance for non-cgroup users. (But that's another discussion anyway) All in all, this is a good start. Both our approaches have a lot in common (well, which is not strange, given that we discussed them a lot on the past month =p, and I did like some concepts) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org