From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
To: balbir@linux.vnet.ibm.com
Cc: "nishimura@mxp.nes.nec.co.jp" <nishimura@mxp.nes.nec.co.jp>,
"hugh@veritas.com" <hugh@veritas.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"menage@google.com" <menage@google.com>,
Andrew Morton <akpm@linux-foundation.org>
Subject: Re: memo: mem+swap controller
Date: Fri, 1 Aug 2008 12:45:24 +0900 [thread overview]
Message-ID: <20080801124524.7dc947e7.kamezawa.hiroyu@jp.fujitsu.com> (raw)
In-Reply-To: <489280FE.2090203@linux.vnet.ibm.com>
On Fri, 01 Aug 2008 08:50:30 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> KAMEZAWA Hiroyuki wrote:
> > Hi, mem+swap controller is suggested by Hugh Dickins and I think it's a great
> > idea. Its concept is having 2 limits. (please point out if I misunderstand.)
> >
> > - memory.limit_in_bytes .... limit memory usage.
> > - memory.total_limit_in_bytes .... limit memory+swap usage.
> >
> > By this, we can avoid excessive use of swap under a cgroup without any bad effect
> > to global LRU. (in page selection algorithm...overhead will be added, of course)
> >
> > Following is state transition and counter handling design memo.
> > This uses "3" counters to handle above conrrectly. If you have other logic,
> > please teach me. (and blame me if my diagram is broken.)
> >
> > A point is how to handle swap-cache, I think.
> > (Maybe we need a _big_ change in memcg.)
> >
>
> Could you please describe the big change? What do you have in mind?
>
Replace res_counter with new counter to handle
- 2 or 3 counters and
- 2 limits
at once.
> > ==
> >
> > state definition
> > new alloc .... an object is newly allocated
> > no_swap .... an object with page without swp_entry
> > swap_cache .... an object with page with swp_entry
> > disk_swap .... an object without page with swp_entry
> > freed .... an object is freed (by munmap)
> >
> > (*) an object is an enitity which is accoutned, page or swap.
> >
> > new alloc -> no_swap <=> swap_cache <=> disk_swap
> > | | |
> > freed. <-----------<-------------<-----------
> >
> > use 3 counters, no_swap, swap_cache, disk_swap.
> >
> > on_memory = no_swap + swap_cache.
> > total = no_swap + swap_cache + disk_swap
> >
> > on_memory is limited by memory.limit_in_bytes
> > total is limtied by memory.total_limit_in_bytes.
> >
> > no_swap swap_cache disk_swap on_memory total
> > new alloc->no_swap +1 - - +1 +1
> > no_swap->swap_cache -1 +1 - - -
> > swap_cache->no_swap +1 -1 - - -
> > swap_cache->disk_swap - -1 +1 -1 -
> > disk_swap->swap_cache - +1 -1 +1 -
> > no_swap->freed -1 - - -1 -1
> > swap_cache->freed - -1 - -1 -1
> > disk_swap->freed - - -1 - -1
> >
> >
> > any comments are welcome.
>
> What is the expected behaviour when we exceed memory.total_limit_in_bytes?
Just call try_to_free_mem_cgroup_pages() as now.
> Can't the memrlimit controller do what you ask for?
>
Never.
Example 1). assume a HPC program which treats very-sparse big matrix
and designed to be a process just handles part of it.
Example 2) When an Admin tried to use vm.overcommit_memory, he asked
30+ applications on his server "Please let me know what amount of (mmap)
memory you'll use."
Finally, He couldn't get good answer because of tons of Java and applications.
"Really used pages/swaps" can be shown by accounting and he can limit it by
his experience. And only "really used" numbers can tell resource usage.
Anyway, one of purposes, we archive by cgroup, is sever integlation.
To integrate servers, System Admin cannot know what amounts of mmap will
he use because proprietaty application software tends to say "very safe"
value and cannot handle -ENOMEM well which returns by mmap().
(*) size of mmap() can have vairable numbers by application's configration
and workload.
"Real resource usage" is tend to be set by estimated value which system admin
measured. So, it's easy to use than address space size when we integlate several
servers at once.
But for other purpose, for limiting a user program which is created by himself.
memrlimit has enough meaning, I think. He can handle -ENOMEM.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
prev parent reply other threads:[~2008-08-01 3:45 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-07-31 1:15 KAMEZAWA Hiroyuki
2008-07-31 6:18 ` KOSAKI Motohiro
2008-07-31 6:25 ` Daisuke Nishimura
2008-07-31 6:51 ` KAMEZAWA Hiroyuki
2008-07-31 13:03 ` Daisuke Nishimura
2008-07-31 16:31 ` kamezawa.hiroyu
2008-08-01 3:05 ` Daisuke Nishimura
2008-08-01 3:28 ` Balbir Singh
2008-08-01 4:02 ` Daisuke Nishimura
2008-08-01 4:13 ` Balbir Singh
2008-08-01 4:57 ` Daisuke Nishimura
2008-08-01 5:07 ` memcg swappiness (Re: memo: mem+swap controller) YAMAMOTO Takashi
2008-08-01 5:25 ` Balbir Singh
2008-08-01 6:37 ` YAMAMOTO Takashi
2008-08-01 6:46 ` Balbir Singh
2008-09-09 9:17 ` YAMAMOTO Takashi
2008-09-09 14:07 ` Balbir Singh
2008-08-01 3:20 ` memo: mem+swap controller Balbir Singh
2008-08-01 3:45 ` KAMEZAWA Hiroyuki [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20080801124524.7dc947e7.kamezawa.hiroyu@jp.fujitsu.com \
--to=kamezawa.hiroyu@jp.fujitsu.com \
--cc=akpm@linux-foundation.org \
--cc=balbir@linux.vnet.ibm.com \
--cc=hugh@veritas.com \
--cc=linux-mm@kvack.org \
--cc=menage@google.com \
--cc=nishimura@mxp.nes.nec.co.jp \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox