linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
To: Glauber Costa <glommer@parallels.com>
Cc: Suleiman Souhlal <ssouhlal@FreeBSD.org>,
	cgroups@vger.kernel.org, suleiman@google.com, penberg@kernel.org,
	cl@linux.com, yinghan@google.com, hughd@google.com,
	gthelen@google.com, peterz@infradead.org,
	dan.magenheimer@oracle.com, hannes@cmpxchg.org, mgorman@suse.de,
	James.Bottomley@HansenPartnership.com, linux-mm@kvack.org,
	devel@openvz.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 02/13] memcg: Kernel memory accounting infrastructure.
Date: Wed, 14 Mar 2012 09:15:26 +0900	[thread overview]
Message-ID: <20120314091526.3c079693.kamezawa.hiroyu@jp.fujitsu.com> (raw)
In-Reply-To: <4F5F236A.1070609@parallels.com>

On Tue, 13 Mar 2012 14:37:30 +0400
Glauber Costa <glommer@parallels.com> wrote:

> > After looking codes, I think we need to think
> > whether independent_kmem_limit is good or not....
> >
> > How about adding MEMCG_KMEM_ACCOUNT flag instead of this and use only
> > memcg->res/memcg->memsw rather than adding a new counter, memcg->kmem ?
> >
> > if MEMCG_KMEM_ACCOUNT is set     ->  slab is accoutned to mem->res/memsw.
> > if MEMCG_KMEM_ACCOUNT is not set ->  slab is never accounted.
> >
> > (I think On/Off switch is required..)
> >
> > Thanks,
> > -Kame
> >
> 
> This has been discussed before, I can probably find it in the archives 
> if you want to go back and see it.
> 

Yes. IIUC, we agreed to have independet kmem limit. I just want to think it
again because there are too many proposals and it seems I'm in confusion.

As far as I see, there are ongoing works as
 - kmem limit by 2 guys.
 - hugetlb limit
 - per lru locking (by 2 guys)
 - page cgroup diet (by me, but stops now.)
 - drity-ratio and writeback 
 - Tejun's proposal to remove pre_destroy()
 - moving shared resource

I'm thinking what is a simple plan and implementation. 
Most of series consists of 10+ patches...

Thank you for your help of clarification.



> But in a nutshell:
> 
> 1) Supposing independent knob disappear (I will explain in item 2 why I 
> don't want it to), I don't thing a flag makes sense either. *If* we are 
> planning to enable/disable this, it might make more sense to put some 
> work on it, and allow particular slabs to be enabled/disabled by writing 
> to memory.kmem.slabinfo (-* would disable all, +* enable all, +kmalloc* 
> enable all kmalloc, etc).
> 
seems interesting.

> Alternatively, what we could do instead, is something similar to what 
> ended up being done for tcp, by request of the network people: if you 
> never touch the limit file, don't bother with it at all, and simply does 
> not account. With Suleiman's lazy allocation infrastructure, that should 
> actually be trivial. And then again, a flag is not necessary, because 
> writing to the limit file does the job, and also convey the meaning well 
> enough.
> 

Hm.

> 2) For the kernel itself, we are mostly concerned that a malicious 
> container may pin into memory big amounts of kernel memory which is, 
> ultimately, unreclaimable. 

Yes. This is a big problem both to memcg and the whole system.

In my experience, 2000 process shares a 10GB shared memory and eats up
big memory ;(



> In particular, with overcommit allowed 
> scenarios, you can fill the whole physical memory (or at least a 
> significant part) with those objects, well beyond your softlimit 
> allowance, making the creation of further containers impossible.
> With user memory, you can reclaim the cgroup back to its place. With 
> kernel memory, you can't.
> 
Agreed.

> In the particular example of 32-bit boxes, you can easily fill up a 
> large part of the available 1gb kernel memory with pinned memory and 
> render the whole system unresponsive.
> 
> Never allowing the kernel memory to go beyond the soft limit was one of 
> the proposed alternatives. However, it may force you to establish a soft
> limit where one was not previously needed. Or, establish a low soft 
> limit when you really need a bigger one.
> 
> All that said, while reading your message, thinking a bit, the following 
> crossed my mind:
> 
> - We can account the slabs to memcg->res normally, and just store the
>    information that this is kernel memory into a percpu counter, as
>    I proposed recently.

Ok, then user can see the amount of kernel memory.


> - The knob goes away, and becomes implicit: if you ever write anything
>    to memory.kmem.limit_in_bytes, we transfer that memory to a separate
>    kmem res_counter, and proceed from there. We can keep accounting to
>    memcg->res anyway, just that kernel memory will now have a separate
>    limit.

Okay, then,

	kmem_limit < memory.limit < memsw.limit

...seems reasonable to me.
This means, user can specify 'ratio' of kmem in memory.limit.

More consideration will be interesting.

 - We can show the amount of reclaimable kmem by some means ?
 - What happens when a new cgroup created ?
 - Should we have 'ratio' interface in kernel level ?
 - What happens at task moving ?
 - Should we allow per-slab accounting knob in /sys/kernel/slab/xxx ?
   or somewhere ?
 - Should we show per-memcg usage in /sys/kernel/slab/xxx ?
 - Should we have force_empty for kmem (as last resort) ?

With any implementation, my concern is
 - overhead/performance.
 - unreclaimable kmem
 - shared kmem between cgroups.


> - With this scheme, it may not be necessary to ever have a file
>    memory.kmem.soft_limit_in_bytes. Reclaim is always part of the normal
>    memcg reclaim.
> 
Good.

> The outlined above would work for us, and make the whole scheme simpler, 
> I believe.
> 
> What do you think ?

It sounds interesting to me.

Thanks,
-Kame











--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2012-03-14  0:17 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-03-09 20:39 [PATCH v2 00/13] Memcg Kernel Memory Tracking Suleiman Souhlal
2012-03-09 20:39 ` [PATCH v2 01/13] memcg: Consolidate various flags into a single flags field Suleiman Souhlal
2012-03-11  7:50   ` Glauber Costa
2012-03-09 20:39 ` [PATCH v2 02/13] memcg: Kernel memory accounting infrastructure Suleiman Souhlal
2012-03-11  8:12   ` Glauber Costa
2012-03-13  6:24     ` KAMEZAWA Hiroyuki
2012-03-13 10:37       ` Glauber Costa
2012-03-13 17:00         ` Greg Thelen
2012-03-13 17:31           ` Glauber Costa
2012-03-14  0:15         ` KAMEZAWA Hiroyuki [this message]
2012-03-14 12:29           ` Glauber Costa
2012-03-15  0:48             ` KAMEZAWA Hiroyuki
2012-03-15 11:07               ` Glauber Costa
2012-03-15 11:13                 ` Peter Zijlstra
2012-03-15 11:21                   ` Glauber Costa
2012-03-12 12:38   ` Glauber Costa
2012-03-09 20:39 ` [PATCH v2 03/13] memcg: Uncharge all kmem when deleting a cgroup Suleiman Souhlal
2012-03-11  8:19   ` Glauber Costa
2012-03-13 23:16     ` Suleiman Souhlal
2012-03-14 11:59       ` Glauber Costa
2012-03-13  6:27   ` KAMEZAWA Hiroyuki
2012-03-09 20:39 ` [PATCH v2 04/13] memcg: Make it possible to use the stock for more than one page Suleiman Souhlal
2012-03-11 10:49   ` Glauber Costa
2012-03-09 20:39 ` [PATCH v2 05/13] memcg: Reclaim when more than one page needed Suleiman Souhlal
2012-03-09 20:39 ` [PATCH v2 06/13] slab: Add kmem_cache_gfp_flags() helper function Suleiman Souhlal
2012-03-11 10:53   ` Glauber Costa
2012-03-13 23:21     ` Suleiman Souhlal
2012-03-14 11:48       ` Glauber Costa
2012-03-14 22:08         ` Suleiman Souhlal
2012-03-09 20:39 ` [PATCH v2 07/13] memcg: Slab accounting Suleiman Souhlal
2012-03-11 10:25   ` Glauber Costa
2012-03-13 22:50     ` Suleiman Souhlal
2012-03-14 10:47       ` Glauber Costa
2012-03-14 22:04         ` Suleiman Souhlal
2012-03-15 11:40           ` Glauber Costa
2012-03-09 20:39 ` [PATCH v2 08/13] memcg: Make dentry slab memory accounted in kernel memory accounting Suleiman Souhlal
2012-03-09 20:39 ` [PATCH v2 09/13] memcg: Account for kmalloc " Suleiman Souhlal
2012-03-11 12:21   ` Glauber Costa
2012-03-09 20:39 ` [PATCH v2 10/13] memcg: Track all the memcg children of a kmem_cache Suleiman Souhlal
2012-03-09 20:39 ` [PATCH v2 11/13] memcg: Handle bypassed kernel memory charges Suleiman Souhlal
2012-03-09 20:39 ` [PATCH v2 12/13] memcg: Per-memcg memory.kmem.slabinfo file Suleiman Souhlal
2012-03-11 10:35   ` Glauber Costa
2012-03-09 20:39 ` [PATCH v2 13/13] memcg: Document kernel memory accounting Suleiman Souhlal
2012-03-11 10:42   ` Glauber Costa
2012-03-10  6:25 ` [PATCH v2 00/13] Memcg Kernel Memory Tracking Suleiman Souhlal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120314091526.3c079693.kamezawa.hiroyu@jp.fujitsu.com \
    --to=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=cgroups@vger.kernel.org \
    --cc=cl@linux.com \
    --cc=dan.magenheimer@oracle.com \
    --cc=devel@openvz.org \
    --cc=glommer@parallels.com \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=penberg@kernel.org \
    --cc=peterz@infradead.org \
    --cc=ssouhlal@FreeBSD.org \
    --cc=suleiman@google.com \
    --cc=yinghan@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox