* memo: mem+swap controller
@ 2008-07-31 1:15 KAMEZAWA Hiroyuki
2008-07-31 6:18 ` KOSAKI Motohiro
` (2 more replies)
0 siblings, 3 replies; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-07-31 1:15 UTC (permalink / raw)
To: nishimura; +Cc: kamezawa.hiroyu, balbir, hugh, linux-mm, menage, Andrew Morton
Hi, mem+swap controller is suggested by Hugh Dickins and I think it's a great
idea. Its concept is having 2 limits. (please point out if I misunderstand.)
- memory.limit_in_bytes .... limit memory usage.
- memory.total_limit_in_bytes .... limit memory+swap usage.
By this, we can avoid excessive use of swap under a cgroup without any bad effect
to global LRU. (in page selection algorithm...overhead will be added, of course)
Following is state transition and counter handling design memo.
This uses "3" counters to handle above conrrectly. If you have other logic,
please teach me. (and blame me if my diagram is broken.)
A point is how to handle swap-cache, I think.
(Maybe we need a _big_ change in memcg.)
==
state definition
new alloc .... an object is newly allocated
no_swap .... an object with page without swp_entry
swap_cache .... an object with page with swp_entry
disk_swap .... an object without page with swp_entry
freed .... an object is freed (by munmap)
(*) an object is an enitity which is accoutned, page or swap.
new alloc -> no_swap <=> swap_cache <=> disk_swap
| | |
freed. <-----------<-------------<-----------
use 3 counters, no_swap, swap_cache, disk_swap.
on_memory = no_swap + swap_cache.
total = no_swap + swap_cache + disk_swap
on_memory is limited by memory.limit_in_bytes
total is limtied by memory.total_limit_in_bytes.
no_swap swap_cache disk_swap on_memory total
new alloc->no_swap +1 - - +1 +1
no_swap->swap_cache -1 +1 - - -
swap_cache->no_swap +1 -1 - - -
swap_cache->disk_swap - -1 +1 -1 -
disk_swap->swap_cache - +1 -1 +1 -
no_swap->freed -1 - - -1 -1
swap_cache->freed - -1 - -1 -1
disk_swap->freed - - -1 - -1
any comments are welcome.
Regards,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: memo: mem+swap controller 2008-07-31 1:15 memo: mem+swap controller KAMEZAWA Hiroyuki @ 2008-07-31 6:18 ` KOSAKI Motohiro 2008-07-31 6:25 ` Daisuke Nishimura 2008-08-01 3:20 ` memo: mem+swap controller Balbir Singh 2 siblings, 0 replies; 19+ messages in thread From: KOSAKI Motohiro @ 2008-07-31 6:18 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: kosaki.motohiro, nishimura, balbir, hugh, linux-mm, menage, Andrew Morton > on_memory is limited by memory.limit_in_bytes > total is limtied by memory.total_limit_in_bytes. "total" is slightly ambiguity word. IMHO, new term creation is better. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: memo: mem+swap controller 2008-07-31 1:15 memo: mem+swap controller KAMEZAWA Hiroyuki 2008-07-31 6:18 ` KOSAKI Motohiro @ 2008-07-31 6:25 ` Daisuke Nishimura 2008-07-31 6:51 ` KAMEZAWA Hiroyuki 2008-08-01 3:28 ` Balbir Singh 2008-08-01 3:20 ` memo: mem+swap controller Balbir Singh 2 siblings, 2 replies; 19+ messages in thread From: Daisuke Nishimura @ 2008-07-31 6:25 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: nishimura, balbir, hugh, linux-mm, menage, Andrew Morton Hi, Kamezawa-san. On Thu, 31 Jul 2008 10:15:33 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > Hi, mem+swap controller is suggested by Hugh Dickins and I think it's a great > idea. Its concept is having 2 limits. (please point out if I misunderstand.) > > - memory.limit_in_bytes .... limit memory usage. > - memory.total_limit_in_bytes .... limit memory+swap usage. > When I've considered more, I wonder how we can accomplish "do not use swap in this group". Setting "limit_in_bytes == total_limit_in_bytes" doesn't meet it, I think. "limit_in_bytes = total_limit_in_bytes = 1G" cannot avoid "memory.usage = 700M swap.usage = 300M" under memory pressure outside of the group(and I think this behavior is the diffrence of "memory controller + swap controller" and "mem+swap controller"). I think total_limit_in_bytes and swappiness(or some flag to indicate "do not swap out"?) for each group would make more sense. > By this, we can avoid excessive use of swap under a cgroup without any bad effect > to global LRU. (in page selection algorithm...overhead will be added, of course) > Sorry, I cannot understand this part. > Following is state transition and counter handling design memo. > This uses "3" counters to handle above conrrectly. If you have other logic, > please teach me. (and blame me if my diagram is broken.) > I don't think counting "disk swap" is good idea(global linux dosen't count it). Instead, I prefer counting "total swap"(that is swap entry). > A point is how to handle swap-cache, I think. > (Maybe we need a _big_ change in memcg.) > I think swap cache should be counted as both memory and swap, as global linux does. Thanks, Daisuke Nishimura. > == > > state definition > new alloc .... an object is newly allocated > no_swap .... an object with page without swp_entry > swap_cache .... an object with page with swp_entry > disk_swap .... an object without page with swp_entry > freed .... an object is freed (by munmap) > > (*) an object is an enitity which is accoutned, page or swap. > > new alloc -> no_swap <=> swap_cache <=> disk_swap > | | | > freed. <-----------<-------------<----------- > > use 3 counters, no_swap, swap_cache, disk_swap. > > on_memory = no_swap + swap_cache. > total = no_swap + swap_cache + disk_swap > > on_memory is limited by memory.limit_in_bytes > total is limtied by memory.total_limit_in_bytes. > > no_swap swap_cache disk_swap on_memory total > new alloc->no_swap +1 - - +1 +1 > no_swap->swap_cache -1 +1 - - - > swap_cache->no_swap +1 -1 - - - > swap_cache->disk_swap - -1 +1 -1 - > disk_swap->swap_cache - +1 -1 +1 - > no_swap->freed -1 - - -1 -1 > swap_cache->freed - -1 - -1 -1 > disk_swap->freed - - -1 - -1 > > > any comments are welcome. > > Regards, > -Kame > > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: memo: mem+swap controller 2008-07-31 6:25 ` Daisuke Nishimura @ 2008-07-31 6:51 ` KAMEZAWA Hiroyuki 2008-07-31 13:03 ` Daisuke Nishimura 2008-07-31 16:31 ` kamezawa.hiroyu 2008-08-01 3:28 ` Balbir Singh 1 sibling, 2 replies; 19+ messages in thread From: KAMEZAWA Hiroyuki @ 2008-07-31 6:51 UTC (permalink / raw) To: Daisuke Nishimura; +Cc: balbir, hugh, linux-mm, menage, Andrew Morton On Thu, 31 Jul 2008 15:25:33 +0900 Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote: > Hi, Kamezawa-san. > > On Thu, 31 Jul 2008 10:15:33 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > Hi, mem+swap controller is suggested by Hugh Dickins and I think it's a great > > idea. Its concept is having 2 limits. (please point out if I misunderstand.) > > > > - memory.limit_in_bytes .... limit memory usage. > > - memory.total_limit_in_bytes .... limit memory+swap usage. > > > When I've considered more, I wonder how we can accomplish > "do not use swap in this group". > you can't do that. This direction is to overcommit swap usage by counting pages which is kicked out. > Setting "limit_in_bytes == total_limit_in_bytes" doesn't meet it, I think. > "limit_in_bytes = total_limit_in_bytes = 1G" cannot > avoid "memory.usage = 700M swap.usage = 300M" under memory pressure > outside of the group(and I think this behavior is the diffrence > of "memory controller + swap controller" and "mem+swap controller"). > > I think total_limit_in_bytes and swappiness(or some flag to indicate > "do not swap out"?) for each group would make more sense. > "do not swap out" is bad direction. please use mlock(). (*) I have no objection to add a control file to move a page from swap to on-memory. > > By this, we can avoid excessive use of swap under a cgroup without any bad effect > > to global LRU. (in page selection algorithm...overhead will be added, of course) > > > Sorry, I cannot understand this part. > >From global LRU's view, anonymous page can be swapped out everytime. Because it never hits limit. == no_swap swap_cache disk_swap on_memory total no_swap->swap_cache -1 +1 - - - == no changes in total. > > Following is state transition and counter handling design memo. > > This uses "3" counters to handle above conrrectly. If you have other logic, > > please teach me. (and blame me if my diagram is broken.) > > > I don't think counting "disk swap" is good idea(global linux > dosen't count it). > Instead, I prefer counting "total swap"(that is swap entry). > Maybe my illustration is bad. total_swap = swap_cache + disk_swap. Yes, I count swp_entry. But just divides it to on-memory or not. This is just a state transition problem. When we counting only total_swap, we cannot avoid double counting of a swap_cache as memory and as swap. > > A point is how to handle swap-cache, I think. > > (Maybe we need a _big_ change in memcg.) > > > I think swap cache should be counted as both memory and swap, > as global linux does. No. If we allow double counting, we'll see OOM-Killer very soon. If what you say is == no_swap swap on_memory total no_swap->swap_cache +1 - +1 == What happens when global lru's swap_out hits limit ? If what you say is == no_swap swap on_memory total no_swap->swap_cache -1 +1 - - == What happens when SwapCache is mapped ? Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: memo: mem+swap controller 2008-07-31 6:51 ` KAMEZAWA Hiroyuki @ 2008-07-31 13:03 ` Daisuke Nishimura 2008-07-31 16:31 ` kamezawa.hiroyu 1 sibling, 0 replies; 19+ messages in thread From: Daisuke Nishimura @ 2008-07-31 13:03 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: nishimura, balbir, hugh, linux-mm, menage, Andrew Morton > > > By this, we can avoid excessive use of swap under a cgroup without any bad effect > > > to global LRU. (in page selection algorithm...overhead will be added, of course) > > > > > Sorry, I cannot understand this part. > > > From global LRU's view, anonymous page can be swapped out everytime. > Because it never hits limit. > == > no_swap swap_cache disk_swap on_memory total > no_swap->swap_cache -1 +1 - - - > == > no changes in total. > I see. Thanks. > > > Following is state transition and counter handling design memo. > > > This uses "3" counters to handle above conrrectly. If you have other logic, > > > please teach me. (and blame me if my diagram is broken.) > > > > > I don't think counting "disk swap" is good idea(global linux > > dosen't count it). > > Instead, I prefer counting "total swap"(that is swap entry). > > > Maybe my illustration is bad. > > total_swap = swap_cache + disk_swap. Yes, I count swp_entry. > But just divides it to on-memory or not. > > This is just a state transition problem. When we counting only total_swap, > we cannot avoid double counting of a swap_cache as memory and as swap. > I agree. My intention was not counting only total_swap, but counting both total_swap and swap_cache. > > > A point is how to handle swap-cache, I think. > > > (Maybe we need a _big_ change in memcg.) > > > > > I think swap cache should be counted as both memory and swap, > > as global linux does. > > No. If we allow double counting, we'll see OOM-Killer very soon. > on_memory = no_swap + swap_cache (your difinition) swap = swap_cache + disk_swap total = no_swap + swap_cache + disk_swap (your difinition) total is NOT "on_memory + swap"(swap_cache is not doble counted). so, in the sense of limitting, this is the same as yours. > If what you say is > == > no_swap swap on_memory total > no_swap->swap_cache +1 - +1 > == > What happens when global lru's swap_out hits limit ? > > If what you say is > == > no_swap swap on_memory total > no_swap->swap_cache -1 +1 - - > == > What happens when SwapCache is mapped ? > The latter, and I think there would be no defference about no_swap/swap_cache/disk_swap counters even when the swap cache is mapped. (current memory controller charges swap caches only when it is mapped, but I'm not talking about it.) Thanks, Daisuke Nishimura. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Re: memo: mem+swap controller 2008-07-31 6:51 ` KAMEZAWA Hiroyuki 2008-07-31 13:03 ` Daisuke Nishimura @ 2008-07-31 16:31 ` kamezawa.hiroyu 2008-08-01 3:05 ` Daisuke Nishimura 1 sibling, 1 reply; 19+ messages in thread From: kamezawa.hiroyu @ 2008-07-31 16:31 UTC (permalink / raw) To: Daisuke Nishimura Cc: KAMEZAWA Hiroyuki, balbir, hugh, linux-mm, menage, Andrew Morton >> > > Following is state transition and counter handling design memo. >> > > This uses "3" counters to handle above conrrectly. If you have other lo gic, >> > > please teach me. (and blame me if my diagram is broken.) >> > > >> > I don't think counting "disk swap" is good idea(global linux >> > dosen't count it). >> > Instead, I prefer counting "total swap"(that is swap entry). >> > >> Maybe my illustration is bad. >> >> total_swap = swap_cache + disk_swap. Yes, I count swp_entry. >> But just divides it to on-memory or not. >> >> This is just a state transition problem. When we counting only total_swap, >> we cannot avoid double counting of a swap_cache as memory and as swap. >> >I agree. >My intention was not counting only total_swap, but counting both >total_swap and swap_cache. > At early stage of diaglam, I just added total_swap counter. (total_swap here means # of used swp_entry.) And failed to write diaglam ;( Maybe selection of counters was bad. If just 2 counters are enough, it's better. Hmm.. - on_memory .... # of pages used - disk_swap .... # of swp_entry without SwapCache limit_in_bytes ... limits on_memory total_limit ... limits on_mempry + disk_swap. can work ? I'd like to write a sample patch in the next week. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: memo: mem+swap controller 2008-07-31 16:31 ` kamezawa.hiroyu @ 2008-08-01 3:05 ` Daisuke Nishimura 0 siblings, 0 replies; 19+ messages in thread From: Daisuke Nishimura @ 2008-08-01 3:05 UTC (permalink / raw) To: kamezawa.hiroyu; +Cc: nishimura, balbir, hugh, linux-mm, menage, Andrew Morton On Fri, 1 Aug 2008 01:31:41 +0900 (JST), kamezawa.hiroyu@jp.fujitsu.com wrote: > >> > > Following is state transition and counter handling design memo. > >> > > This uses "3" counters to handle above conrrectly. If you have other lo > gic, > >> > > please teach me. (and blame me if my diagram is broken.) > >> > > > >> > I don't think counting "disk swap" is good idea(global linux > >> > dosen't count it). > >> > Instead, I prefer counting "total swap"(that is swap entry). > >> > > >> Maybe my illustration is bad. > >> > >> total_swap = swap_cache + disk_swap. Yes, I count swp_entry. > >> But just divides it to on-memory or not. > >> > >> This is just a state transition problem. When we counting only total_swap, > >> we cannot avoid double counting of a swap_cache as memory and as swap. > >> > >I agree. > >My intention was not counting only total_swap, but counting both > >total_swap and swap_cache. > > > At early stage of diaglam, I just added total_swap counter. > (total_swap here means # of used swp_entry.) > And failed to write diaglam ;( Maybe selection of counters was bad. > > If just 2 counters are enough, it's better. > I think so. > Hmm.. > - on_memory .... # of pages used > - disk_swap .... # of swp_entry without SwapCache > > limit_in_bytes ... limits on_memory > total_limit ... limits on_mempry + disk_swap. > > can work ? > Theoretically it works, I think. But, one thing I'm wondering is how to distinguish "swap_cache -> freed" and "disk_swap -> freed". Anyway, I need more consideration. Thanks, Daisuke Nishimura. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: memo: mem+swap controller 2008-07-31 6:25 ` Daisuke Nishimura 2008-07-31 6:51 ` KAMEZAWA Hiroyuki @ 2008-08-01 3:28 ` Balbir Singh 2008-08-01 4:02 ` Daisuke Nishimura 1 sibling, 1 reply; 19+ messages in thread From: Balbir Singh @ 2008-08-01 3:28 UTC (permalink / raw) To: Daisuke Nishimura Cc: KAMEZAWA Hiroyuki, hugh, linux-mm, menage, Andrew Morton Daisuke Nishimura wrote: > Hi, Kamezawa-san. > > On Thu, 31 Jul 2008 10:15:33 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: >> Hi, mem+swap controller is suggested by Hugh Dickins and I think it's a great >> idea. Its concept is having 2 limits. (please point out if I misunderstand.) >> >> - memory.limit_in_bytes .... limit memory usage. >> - memory.total_limit_in_bytes .... limit memory+swap usage. >> > When I've considered more, I wonder how we can accomplish > "do not use swap in this group". > It's easy use the memrlimit controller and set virtual address limit <= memory.limit_in_bytes. I use that to make sure I never swap out. > Setting "limit_in_bytes == total_limit_in_bytes" doesn't meet it, I think. > "limit_in_bytes = total_limit_in_bytes = 1G" cannot > avoid "memory.usage = 700M swap.usage = 300M" under memory pressure > outside of the group(and I think this behavior is the diffrence > of "memory controller + swap controller" and "mem+swap controller"). > > I think total_limit_in_bytes and swappiness(or some flag to indicate > "do not swap out"?) for each group would make more sense. I do intend to add the swappiness feature soon for control groups. > >> By this, we can avoid excessive use of swap under a cgroup without any bad effect >> to global LRU. (in page selection algorithm...overhead will be added, of course) >> > Sorry, I cannot understand this part. > >> Following is state transition and counter handling design memo. >> This uses "3" counters to handle above conrrectly. If you have other logic, >> please teach me. (and blame me if my diagram is broken.) >> > I don't think counting "disk swap" is good idea(global linux > dosen't count it). > Instead, I prefer counting "total swap"(that is swap entry). > >> A point is how to handle swap-cache, I think. >> (Maybe we need a _big_ change in memcg.) >> > I think swap cache should be counted as both memory and swap, > as global linux does. > [snip] -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: memo: mem+swap controller 2008-08-01 3:28 ` Balbir Singh @ 2008-08-01 4:02 ` Daisuke Nishimura 2008-08-01 4:13 ` Balbir Singh 0 siblings, 1 reply; 19+ messages in thread From: Daisuke Nishimura @ 2008-08-01 4:02 UTC (permalink / raw) To: balbir Cc: nishimura, KAMEZAWA Hiroyuki, hugh, linux-mm, menage, Andrew Morton On Fri, 01 Aug 2008 08:58:07 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > Daisuke Nishimura wrote: > > Hi, Kamezawa-san. > > > > On Thu, 31 Jul 2008 10:15:33 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > >> Hi, mem+swap controller is suggested by Hugh Dickins and I think it's a great > >> idea. Its concept is having 2 limits. (please point out if I misunderstand.) > >> > >> - memory.limit_in_bytes .... limit memory usage. > >> - memory.total_limit_in_bytes .... limit memory+swap usage. > >> > > When I've considered more, I wonder how we can accomplish > > "do not use swap in this group". > > > > It's easy use the memrlimit controller and set virtual address limit <= > memory.limit_in_bytes. I use that to make sure I never swap out. > I don't think it works under memory pressure *outside* of the group, that is, global memory reclaim. (I think "limit_in_bytes == total_limit_in_bytes" also works *inside* memory pressure.) > > Setting "limit_in_bytes == total_limit_in_bytes" doesn't meet it, I think. > > "limit_in_bytes = total_limit_in_bytes = 1G" cannot > > avoid "memory.usage = 700M swap.usage = 300M" under memory pressure > > outside of the group(and I think this behavior is the diffrence > > of "memory controller + swap controller" and "mem+swap controller"). > > > > I think total_limit_in_bytes and swappiness(or some flag to indicate > > "do not swap out"?) for each group would make more sense. > > I do intend to add the swappiness feature soon for control groups. > How does it work? Does it affect global page reclaim? Thanks, Daisuke Nishimura. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: memo: mem+swap controller 2008-08-01 4:02 ` Daisuke Nishimura @ 2008-08-01 4:13 ` Balbir Singh 2008-08-01 4:57 ` Daisuke Nishimura 2008-08-01 5:07 ` memcg swappiness (Re: memo: mem+swap controller) YAMAMOTO Takashi 0 siblings, 2 replies; 19+ messages in thread From: Balbir Singh @ 2008-08-01 4:13 UTC (permalink / raw) To: Daisuke Nishimura Cc: KAMEZAWA Hiroyuki, hugh, linux-mm, menage, Andrew Morton Daisuke Nishimura wrote: > On Fri, 01 Aug 2008 08:58:07 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: >> Daisuke Nishimura wrote: >>> Hi, Kamezawa-san. >>> >>> On Thu, 31 Jul 2008 10:15:33 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: >>>> Hi, mem+swap controller is suggested by Hugh Dickins and I think it's a great >>>> idea. Its concept is having 2 limits. (please point out if I misunderstand.) >>>> >>>> - memory.limit_in_bytes .... limit memory usage. >>>> - memory.total_limit_in_bytes .... limit memory+swap usage. >>>> >>> When I've considered more, I wonder how we can accomplish >>> "do not use swap in this group". >>> >> It's easy use the memrlimit controller and set virtual address limit <= >> memory.limit_in_bytes. I use that to make sure I never swap out. >> > I don't think it works under memory pressure *outside* of the group, > that is, global memory reclaim. > (I think "limit_in_bytes == total_limit_in_bytes" also works *inside* memory > pressure.) > Yes, but it ensures that the current cgroup does not add to global swap pressure. Trying to control global pressure from inside a cgroup might be hard, but I see your intention of isolating the cgroup from global pressure. With swappiness, it should be possible to purge page cache ahead of creating swap pressure. >>> Setting "limit_in_bytes == total_limit_in_bytes" doesn't meet it, I think. >>> "limit_in_bytes = total_limit_in_bytes = 1G" cannot >>> avoid "memory.usage = 700M swap.usage = 300M" under memory pressure >>> outside of the group(and I think this behavior is the diffrence >>> of "memory controller + swap controller" and "mem+swap controller"). >>> >>> I think total_limit_in_bytes and swappiness(or some flag to indicate >>> "do not swap out"?) for each group would make more sense. >> I do intend to add the swappiness feature soon for control groups. >> > How does it work? > Does it affect global page reclaim? > We have a swappiness parameter in scan_control. Each control group indicates what it wants it swappiness to be when the control group is over it's limit and reclaim kicks in. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: memo: mem+swap controller 2008-08-01 4:13 ` Balbir Singh @ 2008-08-01 4:57 ` Daisuke Nishimura 2008-08-01 5:07 ` memcg swappiness (Re: memo: mem+swap controller) YAMAMOTO Takashi 1 sibling, 0 replies; 19+ messages in thread From: Daisuke Nishimura @ 2008-08-01 4:57 UTC (permalink / raw) To: balbir Cc: nishimura, KAMEZAWA Hiroyuki, hugh, linux-mm, menage, Andrew Morton On Fri, 01 Aug 2008 09:43:43 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > Daisuke Nishimura wrote: > > On Fri, 01 Aug 2008 08:58:07 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > >> Daisuke Nishimura wrote: > >>> Hi, Kamezawa-san. > >>> > >>> On Thu, 31 Jul 2008 10:15:33 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > >>>> Hi, mem+swap controller is suggested by Hugh Dickins and I think it's a great > >>>> idea. Its concept is having 2 limits. (please point out if I misunderstand.) > >>>> > >>>> - memory.limit_in_bytes .... limit memory usage. > >>>> - memory.total_limit_in_bytes .... limit memory+swap usage. > >>>> > >>> When I've considered more, I wonder how we can accomplish > >>> "do not use swap in this group". > >>> > >> It's easy use the memrlimit controller and set virtual address limit <= > >> memory.limit_in_bytes. I use that to make sure I never swap out. > >> > > I don't think it works under memory pressure *outside* of the group, > > that is, global memory reclaim. > > (I think "limit_in_bytes == total_limit_in_bytes" also works *inside* memory > > pressure.) > > > > Yes, but it ensures that the current cgroup does not add to global swap > pressure. Trying to control global pressure from inside a cgroup might be hard, > but I see your intention of isolating the cgroup from global pressure. With > swappiness, it should be possible to purge page cache ahead of creating swap > pressure. > Ah, I don't have strong requirement of "do not use swap in this group" under global memory pressure(and it would be bad policy). I asked just for curiosity. Sorry for confusing you. > >>> Setting "limit_in_bytes == total_limit_in_bytes" doesn't meet it, I think. > >>> "limit_in_bytes = total_limit_in_bytes = 1G" cannot > >>> avoid "memory.usage = 700M swap.usage = 300M" under memory pressure > >>> outside of the group(and I think this behavior is the diffrence > >>> of "memory controller + swap controller" and "mem+swap controller"). > >>> > >>> I think total_limit_in_bytes and swappiness(or some flag to indicate > >>> "do not swap out"?) for each group would make more sense. > >> I do intend to add the swappiness feature soon for control groups. > >> > > How does it work? > > Does it affect global page reclaim? > > > > We have a swappiness parameter in scan_control. Each control group indicates > what it wants it swappiness to be when the control group is over it's limit and > reclaim kicks in. > I see. Thank you for your explanation. Thanks, Daisuke Nishimura. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* memcg swappiness (Re: memo: mem+swap controller) 2008-08-01 4:13 ` Balbir Singh 2008-08-01 4:57 ` Daisuke Nishimura @ 2008-08-01 5:07 ` YAMAMOTO Takashi 2008-08-01 5:25 ` Balbir Singh 1 sibling, 1 reply; 19+ messages in thread From: YAMAMOTO Takashi @ 2008-08-01 5:07 UTC (permalink / raw) To: balbir; +Cc: nishimura, kamezawa.hiroyu, hugh, linux-mm, menage, akpm hi, > >> I do intend to add the swappiness feature soon for control groups. > >> > > How does it work? > > Does it affect global page reclaim? > > > > We have a swappiness parameter in scan_control. Each control group indicates > what it wants it swappiness to be when the control group is over it's limit and > reclaim kicks in. the following is an untested work-in-progress patch i happen to have. i'd appreciate it if you take care of it. YAMAMOTO Takashi Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp> --- diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index ee1b2fc..7618944 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -47,6 +47,7 @@ extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan, int active, int file); extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask); int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem); +extern int mem_cgroup_swappiness(struct mem_cgroup *mem); extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index fcfa8b4..e1eeb09 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -129,6 +129,7 @@ struct mem_cgroup { struct mem_cgroup_lru_info info; int prev_priority; /* for recording reclaim priority */ + unsigned int swappiness; /* swappiness */ /* * statistics. */ @@ -437,14 +438,11 @@ void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem, int priority) long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone, int priority, enum lru_list lru) { - long nr_pages; int nid = zone->zone_pgdat->node_id; int zid = zone_idx(zone); struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid); - nr_pages = MEM_CGROUP_ZSTAT(mz, lru); - - return (nr_pages >> priority); + return MEM_CGROUP_ZSTAT(mz, lru); } unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan, @@ -963,6 +961,24 @@ static int mem_control_stat_show(struct cgroup *cont, struct cftype *cft, return 0; } +static int mem_cgroup_swappiness_write(struct cgroup *cont, struct cftype *cft, + u64 val) +{ + struct mem_cgroup *mem = mem_cgroup_from_cont(cont); + + if (val > 100) + return -EINVAL; + mem->swappiness = val; + return 0; +} + +static u64 mem_cgroup_swappiness_read(struct cgroup *cont, struct cftype *cft) +{ + struct mem_cgroup *mem = mem_cgroup_from_cont(cont); + + return mem->swappiness; +} + static struct cftype mem_cgroup_files[] = { { .name = "usage_in_bytes", @@ -995,8 +1011,21 @@ static struct cftype mem_cgroup_files[] = { .name = "stat", .read_map = mem_control_stat_show, }, + { + .name = "swappiness", + .write_u64 = mem_cgroup_swappiness_write, + .read_u64 = mem_cgroup_swappiness_read, + }, }; +/* XXX probably it's better to move try_to_free_mem_cgroup_pages to + memcontrol.c and kill this */ +int mem_cgroup_swappiness(struct mem_cgroup *mem) +{ + + return mem->swappiness; +} + static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node) { struct mem_cgroup_per_node *pn; @@ -1072,6 +1101,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) return ERR_PTR(-ENOMEM); } + mem->swappiness = 60; /* XXX probably should inherit a value from + either parent cgroup or global vm_swappiness */ res_counter_init(&mem->res); for_each_node_state(node, N_POSSIBLE) diff --git a/mm/vmscan.c b/mm/vmscan.c index 6527e04..9f2ddbc 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1364,15 +1364,10 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, static void get_scan_ratio(struct zone *zone, struct scan_control * sc, unsigned long *percent) { - unsigned long anon, file, free; unsigned long anon_prio, file_prio; unsigned long ap, fp; - - anon = zone_page_state(zone, NR_ACTIVE_ANON) + - zone_page_state(zone, NR_INACTIVE_ANON); - file = zone_page_state(zone, NR_ACTIVE_FILE) + - zone_page_state(zone, NR_INACTIVE_FILE); - free = zone_page_state(zone, NR_FREE_PAGES); + unsigned long recent_scanned[2]; + unsigned long recent_rotated[2]; /* If we have no swap space, do not bother scanning anon pages. */ if (nr_swap_pages <= 0) { @@ -1381,36 +1376,59 @@ static void get_scan_ratio(struct zone *zone, struct scan_control * sc, return; } - /* If we have very few page cache pages, force-scan anon pages. */ - if (unlikely(file + free <= zone->pages_high)) { - percent[0] = 100; - percent[1] = 0; - return; - } + if (scan_global_lru(sc)) { + unsigned long anon, file, free; - /* - * OK, so we have swap space and a fair amount of page cache - * pages. We use the recently rotated / recently scanned - * ratios to determine how valuable each cache is. - * - * Because workloads change over time (and to avoid overflow) - * we keep these statistics as a floating average, which ends - * up weighing recent references more than old ones. - * - * anon in [0], file in [1] - */ - if (unlikely(zone->recent_scanned[0] > anon / 4)) { - spin_lock_irq(&zone->lru_lock); - zone->recent_scanned[0] /= 2; - zone->recent_rotated[0] /= 2; - spin_unlock_irq(&zone->lru_lock); - } + anon = zone_page_state(zone, NR_ACTIVE_ANON) + + zone_page_state(zone, NR_INACTIVE_ANON); + file = zone_page_state(zone, NR_ACTIVE_FILE) + + zone_page_state(zone, NR_INACTIVE_FILE); + free = zone_page_state(zone, NR_FREE_PAGES); - if (unlikely(zone->recent_scanned[1] > file / 4)) { - spin_lock_irq(&zone->lru_lock); - zone->recent_scanned[1] /= 2; - zone->recent_rotated[1] /= 2; - spin_unlock_irq(&zone->lru_lock); + /* + * If we have very few page cache pages, force-scan anon pages. + */ + if (unlikely(file + free <= zone->pages_high)) { + percent[0] = 100; + percent[1] = 0; + return; + } + + /* + * OK, so we have swap space and a fair amount of page cache + * pages. We use the recently rotated / recently scanned + * ratios to determine how valuable each cache is. + * + * Because workloads change over time (and to avoid overflow) + * we keep these statistics as a floating average, which ends + * up weighing recent references more than old ones. + * + * anon in [0], file in [1] + */ + if (unlikely(zone->recent_scanned[0] > anon / 4)) { + spin_lock_irq(&zone->lru_lock); + zone->recent_scanned[0] /= 2; + zone->recent_rotated[0] /= 2; + spin_unlock_irq(&zone->lru_lock); + } + + if (unlikely(zone->recent_scanned[1] > file / 4)) { + spin_lock_irq(&zone->lru_lock); + zone->recent_scanned[1] /= 2; + zone->recent_rotated[1] /= 2; + spin_unlock_irq(&zone->lru_lock); + } + + recent_scanned[0] = zone->recent_scanned[0]; + recent_scanned[1] = zone->recent_scanned[1]; + recent_rotated[0] = zone->recent_rotated[0]; + recent_rotated[1] = zone->recent_rotated[1]; + } else { + /* XXX */ + recent_scanned[0] = 0; + recent_scanned[1] = 0; + recent_rotated[0] = 0; + recent_rotated[1] = 0; } /* @@ -1425,11 +1443,11 @@ static void get_scan_ratio(struct zone *zone, struct scan_control * sc, * %anon = 100 * ----------- / ----------------- * IO cost * anon + file rotate_sum */ - ap = (anon_prio + 1) * (zone->recent_scanned[0] + 1); - ap /= zone->recent_rotated[0] + 1; + ap = (anon_prio + 1) * (recent_scanned[0] + 1); + ap /= recent_rotated[0] + 1; - fp = (file_prio + 1) * (zone->recent_scanned[1] + 1); - fp /= zone->recent_rotated[1] + 1; + fp = (file_prio + 1) * (recent_scanned[1] + 1); + fp /= recent_rotated[1] + 1; /* Normalize to percentages */ percent[0] = 100 * ap / (ap + fp + 1); @@ -1452,9 +1470,10 @@ static unsigned long shrink_zone(int priority, struct zone *zone, get_scan_ratio(zone, sc, percent); for_each_evictable_lru(l) { + int file = is_file_lru(l); + unsigned long scan; + if (scan_global_lru(sc)) { - int file = is_file_lru(l); - int scan; /* * Add one to nr_to_scan just to make sure that the * kernel will slowly sift through each list. @@ -1476,8 +1495,13 @@ static unsigned long shrink_zone(int priority, struct zone *zone, * but because memory controller hits its limit. * Don't modify zone reclaim related data. */ - nr[l] = mem_cgroup_calc_reclaim(sc->mem_cgroup, zone, + scan = mem_cgroup_calc_reclaim(sc->mem_cgroup, zone, priority, l); + if (priority) { + scan >>= priority; + scan = (scan * percent[file]) / 100; + } + nr[l] = scan + 1; } } @@ -1704,7 +1728,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, .may_writepage = !laptop_mode, .may_swap = 1, .swap_cluster_max = SWAP_CLUSTER_MAX, - .swappiness = vm_swappiness, + .swappiness = mem_cgroup_swappiness(mem_cont), .order = 0, .mem_cgroup = mem_cont, .isolate_pages = mem_cgroup_isolate_pages, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: memcg swappiness (Re: memo: mem+swap controller) 2008-08-01 5:07 ` memcg swappiness (Re: memo: mem+swap controller) YAMAMOTO Takashi @ 2008-08-01 5:25 ` Balbir Singh 2008-08-01 6:37 ` YAMAMOTO Takashi 0 siblings, 1 reply; 19+ messages in thread From: Balbir Singh @ 2008-08-01 5:25 UTC (permalink / raw) To: YAMAMOTO Takashi; +Cc: nishimura, kamezawa.hiroyu, hugh, linux-mm, menage, akpm YAMAMOTO Takashi wrote: > hi, > >>>> I do intend to add the swappiness feature soon for control groups. >>>> >>> How does it work? >>> Does it affect global page reclaim? >>> >> We have a swappiness parameter in scan_control. Each control group indicates >> what it wants it swappiness to be when the control group is over it's limit and >> reclaim kicks in. > > the following is an untested work-in-progress patch i happen to have. > i'd appreciate it if you take care of it. > Looks very similar to the patch I have. You seemed to have made much more progress than me, I am yet to look at the recent_* statistics. How are the test results? Are they close to what you expect? Some comments below > > Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp> > --- > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index ee1b2fc..7618944 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -47,6 +47,7 @@ extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan, > int active, int file); > extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask); > int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem); > +extern int mem_cgroup_swappiness(struct mem_cgroup *mem); > > extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p); > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index fcfa8b4..e1eeb09 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -129,6 +129,7 @@ struct mem_cgroup { > struct mem_cgroup_lru_info info; > > int prev_priority; /* for recording reclaim priority */ > + unsigned int swappiness; /* swappiness */ > /* > * statistics. > */ > @@ -437,14 +438,11 @@ void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem, int priority) > long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone, > int priority, enum lru_list lru) > { > - long nr_pages; > int nid = zone->zone_pgdat->node_id; > int zid = zone_idx(zone); > struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid); > > - nr_pages = MEM_CGROUP_ZSTAT(mz, lru); > - > - return (nr_pages >> priority); > + return MEM_CGROUP_ZSTAT(mz, lru); > } > > unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan, > @@ -963,6 +961,24 @@ static int mem_control_stat_show(struct cgroup *cont, struct cftype *cft, > return 0; > } > > +static int mem_cgroup_swappiness_write(struct cgroup *cont, struct cftype *cft, > + u64 val) > +{ > + struct mem_cgroup *mem = mem_cgroup_from_cont(cont); > + > + if (val > 100) > + return -EINVAL; > + mem->swappiness = val; > + return 0; > +} > + > +static u64 mem_cgroup_swappiness_read(struct cgroup *cont, struct cftype *cft) > +{ > + struct mem_cgroup *mem = mem_cgroup_from_cont(cont); > + > + return mem->swappiness; > +} > + > static struct cftype mem_cgroup_files[] = { > { > .name = "usage_in_bytes", > @@ -995,8 +1011,21 @@ static struct cftype mem_cgroup_files[] = { > .name = "stat", > .read_map = mem_control_stat_show, > }, > + { > + .name = "swappiness", > + .write_u64 = mem_cgroup_swappiness_write, > + .read_u64 = mem_cgroup_swappiness_read, > + }, > }; > > +/* XXX probably it's better to move try_to_free_mem_cgroup_pages to > + memcontrol.c and kill this */ > +int mem_cgroup_swappiness(struct mem_cgroup *mem) > +{ > + > + return mem->swappiness; > +} > + > static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node) > { > struct mem_cgroup_per_node *pn; > @@ -1072,6 +1101,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) > return ERR_PTR(-ENOMEM); > } > > + mem->swappiness = 60; /* XXX probably should inherit a value from > + either parent cgroup or global vm_swappiness */ Precisely, we need hierarchy support to inherit from parent. vm_swappiness is exported, so we can use it here. > res_counter_init(&mem->res); > > for_each_node_state(node, N_POSSIBLE) > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 6527e04..9f2ddbc 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1364,15 +1364,10 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, > static void get_scan_ratio(struct zone *zone, struct scan_control * sc, > unsigned long *percent) > { > - unsigned long anon, file, free; > unsigned long anon_prio, file_prio; > unsigned long ap, fp; > - > - anon = zone_page_state(zone, NR_ACTIVE_ANON) + > - zone_page_state(zone, NR_INACTIVE_ANON); > - file = zone_page_state(zone, NR_ACTIVE_FILE) + > - zone_page_state(zone, NR_INACTIVE_FILE); > - free = zone_page_state(zone, NR_FREE_PAGES); > + unsigned long recent_scanned[2]; > + unsigned long recent_rotated[2]; > > /* If we have no swap space, do not bother scanning anon pages. */ > if (nr_swap_pages <= 0) { > @@ -1381,36 +1376,59 @@ static void get_scan_ratio(struct zone *zone, struct scan_control * sc, > return; > } > > - /* If we have very few page cache pages, force-scan anon pages. */ > - if (unlikely(file + free <= zone->pages_high)) { > - percent[0] = 100; > - percent[1] = 0; > - return; > - } > + if (scan_global_lru(sc)) { > + unsigned long anon, file, free; > > - /* > - * OK, so we have swap space and a fair amount of page cache > - * pages. We use the recently rotated / recently scanned > - * ratios to determine how valuable each cache is. > - * > - * Because workloads change over time (and to avoid overflow) > - * we keep these statistics as a floating average, which ends > - * up weighing recent references more than old ones. > - * > - * anon in [0], file in [1] > - */ > - if (unlikely(zone->recent_scanned[0] > anon / 4)) { > - spin_lock_irq(&zone->lru_lock); > - zone->recent_scanned[0] /= 2; > - zone->recent_rotated[0] /= 2; > - spin_unlock_irq(&zone->lru_lock); > - } > + anon = zone_page_state(zone, NR_ACTIVE_ANON) + > + zone_page_state(zone, NR_INACTIVE_ANON); > + file = zone_page_state(zone, NR_ACTIVE_FILE) + > + zone_page_state(zone, NR_INACTIVE_FILE); > + free = zone_page_state(zone, NR_FREE_PAGES); > > - if (unlikely(zone->recent_scanned[1] > file / 4)) { > - spin_lock_irq(&zone->lru_lock); > - zone->recent_scanned[1] /= 2; > - zone->recent_rotated[1] /= 2; > - spin_unlock_irq(&zone->lru_lock); > + /* > + * If we have very few page cache pages, force-scan anon pages. > + */ > + if (unlikely(file + free <= zone->pages_high)) { > + percent[0] = 100; > + percent[1] = 0; > + return; > + } > + > + /* > + * OK, so we have swap space and a fair amount of page cache > + * pages. We use the recently rotated / recently scanned > + * ratios to determine how valuable each cache is. > + * > + * Because workloads change over time (and to avoid overflow) > + * we keep these statistics as a floating average, which ends > + * up weighing recent references more than old ones. > + * > + * anon in [0], file in [1] > + */ > + if (unlikely(zone->recent_scanned[0] > anon / 4)) { > + spin_lock_irq(&zone->lru_lock); > + zone->recent_scanned[0] /= 2; > + zone->recent_rotated[0] /= 2; > + spin_unlock_irq(&zone->lru_lock); > + } > + > + if (unlikely(zone->recent_scanned[1] > file / 4)) { > + spin_lock_irq(&zone->lru_lock); > + zone->recent_scanned[1] /= 2; > + zone->recent_rotated[1] /= 2; > + spin_unlock_irq(&zone->lru_lock); > + } > + > + recent_scanned[0] = zone->recent_scanned[0]; > + recent_scanned[1] = zone->recent_scanned[1]; > + recent_rotated[0] = zone->recent_rotated[0]; > + recent_rotated[1] = zone->recent_rotated[1]; > + } else { > + /* XXX */ OK, so this needs to be completed for swappiness to work. > + recent_scanned[0] = 0; > + recent_scanned[1] = 0; > + recent_rotated[0] = 0; > + recent_rotated[1] = 0; > } > > /* > @@ -1425,11 +1443,11 @@ static void get_scan_ratio(struct zone *zone, struct scan_control * sc, > * %anon = 100 * ----------- / ----------------- * IO cost > * anon + file rotate_sum > */ > - ap = (anon_prio + 1) * (zone->recent_scanned[0] + 1); > - ap /= zone->recent_rotated[0] + 1; > + ap = (anon_prio + 1) * (recent_scanned[0] + 1); > + ap /= recent_rotated[0] + 1; > > - fp = (file_prio + 1) * (zone->recent_scanned[1] + 1); > - fp /= zone->recent_rotated[1] + 1; > + fp = (file_prio + 1) * (recent_scanned[1] + 1); > + fp /= recent_rotated[1] + 1; > > /* Normalize to percentages */ > percent[0] = 100 * ap / (ap + fp + 1); > @@ -1452,9 +1470,10 @@ static unsigned long shrink_zone(int priority, struct zone *zone, > get_scan_ratio(zone, sc, percent); > > for_each_evictable_lru(l) { > + int file = is_file_lru(l); > + unsigned long scan; > + > if (scan_global_lru(sc)) { > - int file = is_file_lru(l); > - int scan; > /* > * Add one to nr_to_scan just to make sure that the > * kernel will slowly sift through each list. > @@ -1476,8 +1495,13 @@ static unsigned long shrink_zone(int priority, struct zone *zone, > * but because memory controller hits its limit. > * Don't modify zone reclaim related data. > */ > - nr[l] = mem_cgroup_calc_reclaim(sc->mem_cgroup, zone, > + scan = mem_cgroup_calc_reclaim(sc->mem_cgroup, zone, > priority, l); > + if (priority) { > + scan >>= priority; > + scan = (scan * percent[file]) / 100; > + } > + nr[l] = scan + 1; > } > } > > @@ -1704,7 +1728,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, > .may_writepage = !laptop_mode, > .may_swap = 1, > .swap_cluster_max = SWAP_CLUSTER_MAX, > - .swappiness = vm_swappiness, > + .swappiness = mem_cgroup_swappiness(mem_cont), > .order = 0, > .mem_cgroup = mem_cont, > .isolate_pages = mem_cgroup_isolate_pages, -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: memcg swappiness (Re: memo: mem+swap controller) 2008-08-01 5:25 ` Balbir Singh @ 2008-08-01 6:37 ` YAMAMOTO Takashi 2008-08-01 6:46 ` Balbir Singh 0 siblings, 1 reply; 19+ messages in thread From: YAMAMOTO Takashi @ 2008-08-01 6:37 UTC (permalink / raw) To: balbir; +Cc: nishimura, kamezawa.hiroyu, hugh, linux-mm, menage, akpm > YAMAMOTO Takashi wrote: > > hi, > > > >>>> I do intend to add the swappiness feature soon for control groups. > >>>> > >>> How does it work? > >>> Does it affect global page reclaim? > >>> > >> We have a swappiness parameter in scan_control. Each control group indicates > >> what it wants it swappiness to be when the control group is over it's limit and > >> reclaim kicks in. > > > > the following is an untested work-in-progress patch i happen to have. > > i'd appreciate it if you take care of it. > > > > Looks very similar to the patch I have. You seemed to have made much more > progress than me, I am yet to look at the recent_* statistics. How are the test > results? Are they close to what you expect? Some comments below it's mostly untested as i said above. i'm wondering how to test it. YAMAMOTO Takashi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: memcg swappiness (Re: memo: mem+swap controller) 2008-08-01 6:37 ` YAMAMOTO Takashi @ 2008-08-01 6:46 ` Balbir Singh 2008-09-09 9:17 ` YAMAMOTO Takashi 0 siblings, 1 reply; 19+ messages in thread From: Balbir Singh @ 2008-08-01 6:46 UTC (permalink / raw) To: YAMAMOTO Takashi; +Cc: nishimura, kamezawa.hiroyu, hugh, linux-mm, menage, akpm YAMAMOTO Takashi wrote: >> YAMAMOTO Takashi wrote: >>> hi, >>> >>>>>> I do intend to add the swappiness feature soon for control groups. >>>>>> >>>>> How does it work? >>>>> Does it affect global page reclaim? >>>>> >>>> We have a swappiness parameter in scan_control. Each control group indicates >>>> what it wants it swappiness to be when the control group is over it's limit and >>>> reclaim kicks in. >>> the following is an untested work-in-progress patch i happen to have. >>> i'd appreciate it if you take care of it. >>> >> Looks very similar to the patch I have. You seemed to have made much more >> progress than me, I am yet to look at the recent_* statistics. How are the test >> results? Are they close to what you expect? Some comments below > > it's mostly untested as i said above. i'm wondering how to test it. > I did a simple test 1. Run a RSS hungry (touch malloc'ed pages) and dd in the same control group 2. Tweak swappiness to see if the result is desirable Not a complex test, but a good starting point :) -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: memcg swappiness (Re: memo: mem+swap controller) 2008-08-01 6:46 ` Balbir Singh @ 2008-09-09 9:17 ` YAMAMOTO Takashi 2008-09-09 14:07 ` Balbir Singh 0 siblings, 1 reply; 19+ messages in thread From: YAMAMOTO Takashi @ 2008-09-09 9:17 UTC (permalink / raw) To: balbir; +Cc: nishimura, kamezawa.hiroyu, hugh, linux-mm, menage, akpm hi, here's the updated patch. changes from the previous one: - adapt to v2.6.27-rc1-mm1. - implement per-cgroup per-zone recent_scanned and recent_rotated. - when creating a cgroup, inherit the swappiness value from its parent. - fix build w/o memcg. any comments? YAMAMOTO Takashi Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp> --- diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index ee1b2fc..7b1b4e6 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -47,6 +47,7 @@ extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan, int active, int file); extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask); int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem); +extern int mem_cgroup_swappiness(struct mem_cgroup *mem); extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p); @@ -72,6 +73,19 @@ extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem, extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone, int priority, enum lru_list lru); +extern void mem_cgroup_update_scanned(struct mem_cgroup *mem, struct zone *zone, + unsigned long idx, + unsigned long scanned); +extern void mem_cgroup_update_rotated(struct mem_cgroup *mem, struct zone *zone, + unsigned long idx, + unsigned long rotated); +extern void mem_cgroup_get_scan_param(struct mem_cgroup *mem, struct zone *zone, + unsigned long *anon, + unsigned long *file, + unsigned long **recent_scanned, + unsigned long **recent_rotated, + spinlock_t **lock); + #else /* CONFIG_CGROUP_MEM_RES_CTLR */ static inline void page_reset_bad_cgroup(struct page *page) { @@ -163,6 +177,31 @@ static inline long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, { return 0; } + +static inline void mem_cgroup_update_scanned(struct mem_cgroup *mem, + struct zone *zone, + unsigned long idx, + unsigned long scanned) +{ +} + +static inline void mem_cgroup_update_rotated(struct mem_cgroup *mem, + struct zone *zone, + unsigned long idx, + unsigned long rotated) +{ +} + +static inline void mem_cgroup_get_scan_param(struct mem_cgroup *mem, + struct zone *zone, + unsigned long *anon, + unsigned long *file, + unsigned long **recent_scanned, + unsigned long **recent_rotated, + spinlock_t **lock) +{ +} + #endif /* CONFIG_CGROUP_MEM_CONT */ #endif /* _LINUX_MEMCONTROL_H */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 344a477..d4ac8c6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -93,6 +93,8 @@ struct mem_cgroup_per_zone { spinlock_t lru_lock; struct list_head lists[NR_LRU_LISTS]; unsigned long count[NR_LRU_LISTS]; + unsigned long recent_scanned[2]; + unsigned long recent_rotated[2]; }; /* Macro for accessing counter */ #define MEM_CGROUP_ZSTAT(mz, idx) ((mz)->count[(idx)]) @@ -129,6 +131,7 @@ struct mem_cgroup { struct mem_cgroup_lru_info info; int prev_priority; /* for recording reclaim priority */ + unsigned int swappiness; /* swappiness */ /* * statistics. */ @@ -437,14 +440,57 @@ void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem, int priority) long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone, int priority, enum lru_list lru) { - long nr_pages; int nid = zone->zone_pgdat->node_id; int zid = zone_idx(zone); struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid); - nr_pages = MEM_CGROUP_ZSTAT(mz, lru); + return MEM_CGROUP_ZSTAT(mz, lru); +} + +void mem_cgroup_update_scanned(struct mem_cgroup *mem, struct zone *zone, + unsigned long idx, + unsigned long scanned) +{ + int nid = zone->zone_pgdat->node_id; + int zid = zone_idx(zone); + struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid); + + spin_lock(&mz->lru_lock); + mz->recent_scanned[idx] += scanned; + spin_unlock(&mz->lru_lock); +} + +void mem_cgroup_update_rotated(struct mem_cgroup *mem, struct zone *zone, + unsigned long idx, + unsigned long rotated) +{ + int nid = zone->zone_pgdat->node_id; + int zid = zone_idx(zone); + struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid); + + spin_lock(&mz->lru_lock); + mz->recent_rotated[idx] += rotated; + spin_unlock(&mz->lru_lock); +} + +void mem_cgroup_get_scan_param(struct mem_cgroup *mem, struct zone *zone, + unsigned long *anon, + unsigned long *file, + unsigned long **recent_scanned, + unsigned long **recent_rotated, + spinlock_t **lock) +{ + int nid = zone->zone_pgdat->node_id; + int zid = zone_idx(zone); + struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid); - return (nr_pages >> priority); + *anon = MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_ANON) + + MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_ANON); + *file = MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_FILE) + + MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_FILE); + *recent_scanned = mz->recent_scanned; + *recent_rotated = mz->recent_rotated; + *lock = &mz->lru_lock; } unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan, @@ -1002,6 +1048,24 @@ static int mem_control_stat_show(struct cgroup *cont, struct cftype *cft, return 0; } +static int mem_cgroup_swappiness_write(struct cgroup *cont, struct cftype *cft, + u64 val) +{ + struct mem_cgroup *mem = mem_cgroup_from_cont(cont); + + if (val > 100) + return -EINVAL; + mem->swappiness = val; + return 0; +} + +static u64 mem_cgroup_swappiness_read(struct cgroup *cont, struct cftype *cft) +{ + struct mem_cgroup *mem = mem_cgroup_from_cont(cont); + + return mem->swappiness; +} + static struct cftype mem_cgroup_files[] = { { .name = "usage_in_bytes", @@ -1034,8 +1098,21 @@ static struct cftype mem_cgroup_files[] = { .name = "stat", .read_map = mem_control_stat_show, }, + { + .name = "swappiness", + .write_u64 = mem_cgroup_swappiness_write, + .read_u64 = mem_cgroup_swappiness_read, + }, }; +/* XXX probably it's better to move try_to_free_mem_cgroup_pages to + memcontrol.c and kill this */ +int mem_cgroup_swappiness(struct mem_cgroup *mem) +{ + + return mem->swappiness; +} + static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node) { struct mem_cgroup_per_node *pn; @@ -1105,10 +1182,14 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) if (unlikely((cont->parent) == NULL)) { mem = &init_mem_cgroup; page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC); + mem->swappiness = vm_swappiness; } else { mem = mem_cgroup_alloc(); if (!mem) return ERR_PTR(-ENOMEM); + /* XXX hierarchy */ + mem->swappiness = + mem_cgroup_from_cont(cont->parent)->swappiness; } res_counter_init(&mem->res); diff --git a/mm/vmscan.c b/mm/vmscan.c index 85ce427..b42e730 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -975,6 +975,30 @@ static unsigned long clear_active_flags(struct list_head *page_list, return nr_active; } +static void +update_scanned(struct scan_control *sc, struct zone *zone, int idx, + unsigned long scanned) +{ + + if (scan_global_lru(sc)) { + zone->recent_scanned[idx] += scanned; + } else { + mem_cgroup_update_scanned(sc->mem_cgroup, zone, idx, scanned); + } +} + +static void +update_rotated(struct scan_control *sc, struct zone *zone, int idx, + unsigned long rotated) +{ + + if (scan_global_lru(sc)) { + zone->recent_rotated[idx] += rotated; + } else { + mem_cgroup_update_rotated(sc->mem_cgroup, zone, idx, rotated); + } +} + /** * isolate_lru_page - tries to isolate a page from its LRU list * @page: page to isolate from its LRU list @@ -1075,11 +1099,11 @@ static unsigned long shrink_inactive_list(unsigned long max_scan, if (scan_global_lru(sc)) { zone->pages_scanned += nr_scan; - zone->recent_scanned[0] += count[LRU_INACTIVE_ANON]; - zone->recent_scanned[0] += count[LRU_ACTIVE_ANON]; - zone->recent_scanned[1] += count[LRU_INACTIVE_FILE]; - zone->recent_scanned[1] += count[LRU_ACTIVE_FILE]; } + update_scanned(sc, zone, 0, + count[LRU_INACTIVE_ANON] + count[LRU_ACTIVE_ANON]); + update_scanned(sc, zone, 1, + count[LRU_INACTIVE_FILE] + count[LRU_ACTIVE_FILE]); spin_unlock_irq(&zone->lru_lock); nr_scanned += nr_scan; @@ -1138,9 +1162,10 @@ static unsigned long shrink_inactive_list(unsigned long max_scan, lru = page_lru(page); add_page_to_lru_list(zone, page, lru); mem_cgroup_move_lists(page, lru); - if (PageActive(page) && scan_global_lru(sc)) { + if (PageActive(page)) { int file = !!page_is_file_cache(page); - zone->recent_rotated[file]++; + + update_rotated(sc, zone, file, 1); } if (!pagevec_add(&pvec, page)) { spin_unlock_irq(&zone->lru_lock); @@ -1218,8 +1243,8 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, */ if (scan_global_lru(sc)) { zone->pages_scanned += pgscanned; - zone->recent_scanned[!!file] += pgmoved; } + update_scanned(sc, zone, !!file, pgmoved); if (file) __mod_zone_page_state(zone, NR_ACTIVE_FILE, -pgmoved); @@ -1264,7 +1289,8 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, * to the inactive list. This helps balance scan pressure between * file and anonymous pages in get_scan_ratio. */ - zone->recent_rotated[!!file] += pgmoved; + spin_lock_irq(&zone->lru_lock); + update_rotated(sc, zone, !!file, pgmoved); /* * Now put the pages back on the appropriate [file or anon] inactive @@ -1273,7 +1299,6 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, pagevec_init(&pvec, 1); pgmoved = 0; lru = LRU_BASE + file * LRU_FILE; - spin_lock_irq(&zone->lru_lock); while (!list_empty(&l_inactive)) { page = lru_to_page(&l_inactive); prefetchw_prev_lru_page(page, &l_inactive, flags); @@ -1367,15 +1392,12 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, static void get_scan_ratio(struct zone *zone, struct scan_control * sc, unsigned long *percent) { - unsigned long anon, file, free; unsigned long anon_prio, file_prio; unsigned long ap, fp; - - anon = zone_page_state(zone, NR_ACTIVE_ANON) + - zone_page_state(zone, NR_INACTIVE_ANON); - file = zone_page_state(zone, NR_ACTIVE_FILE) + - zone_page_state(zone, NR_INACTIVE_FILE); - free = zone_page_state(zone, NR_FREE_PAGES); + unsigned long *recent_scanned; + unsigned long *recent_rotated; + unsigned long anon, file; + spinlock_t *lock; /* If we have no swap space, do not bother scanning anon pages. */ if (nr_swap_pages <= 0) { @@ -1384,36 +1406,56 @@ static void get_scan_ratio(struct zone *zone, struct scan_control * sc, return; } - /* If we have very few page cache pages, force-scan anon pages. */ - if (unlikely(file + free <= zone->pages_high)) { - percent[0] = 100; - percent[1] = 0; - return; + if (scan_global_lru(sc)) { + unsigned long free; + + anon = zone_page_state(zone, NR_ACTIVE_ANON) + + zone_page_state(zone, NR_INACTIVE_ANON); + file = zone_page_state(zone, NR_ACTIVE_FILE) + + zone_page_state(zone, NR_INACTIVE_FILE); + free = zone_page_state(zone, NR_FREE_PAGES); + + /* + * If we have very few page cache pages, force-scan anon pages. + */ + if (unlikely(file + free <= zone->pages_high)) { + percent[0] = 100; + percent[1] = 0; + return; + } + + /* + * OK, so we have swap space and a fair amount of page cache + * pages. We use the recently rotated / recently scanned + * ratios to determine how valuable each cache is. + */ + recent_scanned = zone->recent_scanned; + recent_rotated = zone->recent_rotated; + lock = &zone->lru_lock; + } else { + mem_cgroup_get_scan_param(sc->mem_cgroup, zone, &anon, &file, + &recent_scanned, &recent_rotated, &lock); } /* - * OK, so we have swap space and a fair amount of page cache - * pages. We use the recently rotated / recently scanned - * ratios to determine how valuable each cache is. - * - * Because workloads change over time (and to avoid overflow) - * we keep these statistics as a floating average, which ends - * up weighing recent references more than old ones. - * - * anon in [0], file in [1] - */ - if (unlikely(zone->recent_scanned[0] > anon / 4)) { - spin_lock_irq(&zone->lru_lock); - zone->recent_scanned[0] /= 2; - zone->recent_rotated[0] /= 2; - spin_unlock_irq(&zone->lru_lock); + * Because workloads change over time (and to avoid overflow) + * we keep these statistics as a floating average, which ends + * up weighing recent references more than old ones. + * + * anon in [0], file in [1] + */ + if (unlikely(recent_scanned[0] > anon / 4)) { + spin_lock_irq(lock); + recent_scanned[0] /= 2; + recent_rotated[0] /= 2; + spin_unlock_irq(lock); } - if (unlikely(zone->recent_scanned[1] > file / 4)) { - spin_lock_irq(&zone->lru_lock); - zone->recent_scanned[1] /= 2; - zone->recent_rotated[1] /= 2; - spin_unlock_irq(&zone->lru_lock); + if (unlikely(recent_scanned[1] > file / 4)) { + spin_lock_irq(lock); + recent_scanned[1] /= 2; + recent_rotated[1] /= 2; + spin_unlock_irq(lock); } /* @@ -1428,11 +1470,11 @@ static void get_scan_ratio(struct zone *zone, struct scan_control * sc, * %anon = 100 * ----------- / ----------------- * IO cost * anon + file rotate_sum */ - ap = (anon_prio + 1) * (zone->recent_scanned[0] + 1); - ap /= zone->recent_rotated[0] + 1; + ap = (anon_prio + 1) * (recent_scanned[0] + 1); + ap /= recent_rotated[0] + 1; - fp = (file_prio + 1) * (zone->recent_scanned[1] + 1); - fp /= zone->recent_rotated[1] + 1; + fp = (file_prio + 1) * (recent_scanned[1] + 1); + fp /= recent_rotated[1] + 1; /* Normalize to percentages */ percent[0] = 100 * ap / (ap + fp + 1); @@ -1455,10 +1497,10 @@ static unsigned long shrink_zone(int priority, struct zone *zone, get_scan_ratio(zone, sc, percent); for_each_evictable_lru(l) { - if (scan_global_lru(sc)) { - int file = is_file_lru(l); - int scan; + int file = is_file_lru(l); + unsigned long scan; + if (scan_global_lru(sc)) { scan = zone_page_state(zone, NR_LRU_BASE + l); if (priority) { scan >>= priority; @@ -1476,8 +1518,13 @@ static unsigned long shrink_zone(int priority, struct zone *zone, * but because memory controller hits its limit. * Don't modify zone reclaim related data. */ - nr[l] = mem_cgroup_calc_reclaim(sc->mem_cgroup, zone, + scan = mem_cgroup_calc_reclaim(sc->mem_cgroup, zone, priority, l); + if (priority) { + scan >>= priority; + scan = (scan * percent[file]) / 100; + } + nr[l] = scan; } } @@ -1706,7 +1753,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont, .may_writepage = !laptop_mode, .may_swap = 1, .swap_cluster_max = SWAP_CLUSTER_MAX, - .swappiness = vm_swappiness, + .swappiness = mem_cgroup_swappiness(mem_cont), .order = 0, .mem_cgroup = mem_cont, .isolate_pages = mem_cgroup_isolate_pages, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: memcg swappiness (Re: memo: mem+swap controller) 2008-09-09 9:17 ` YAMAMOTO Takashi @ 2008-09-09 14:07 ` Balbir Singh 0 siblings, 0 replies; 19+ messages in thread From: Balbir Singh @ 2008-09-09 14:07 UTC (permalink / raw) To: YAMAMOTO Takashi; +Cc: nishimura, kamezawa.hiroyu, hugh, linux-mm, menage, akpm YAMAMOTO Takashi wrote: > hi, > > here's the updated patch. > > changes from the previous one: > - adapt to v2.6.27-rc1-mm1. I just applied it to 2.6.27-rc5-mm1 and I am beginning to test it. > - implement per-cgroup per-zone recent_scanned and recent_rotated. > - when creating a cgroup, inherit the swappiness value from its parent. Looks good > - fix build w/o memcg. > > any comments? > I'll review the patches later, but I just tested them, by running "dd" and a RSS intensive application in the same control group. I am not seeing the desired result. Let me debug that further. -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: memo: mem+swap controller 2008-07-31 1:15 memo: mem+swap controller KAMEZAWA Hiroyuki 2008-07-31 6:18 ` KOSAKI Motohiro 2008-07-31 6:25 ` Daisuke Nishimura @ 2008-08-01 3:20 ` Balbir Singh 2008-08-01 3:45 ` KAMEZAWA Hiroyuki 2 siblings, 1 reply; 19+ messages in thread From: Balbir Singh @ 2008-08-01 3:20 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: nishimura, hugh, linux-mm, menage, Andrew Morton KAMEZAWA Hiroyuki wrote: > Hi, mem+swap controller is suggested by Hugh Dickins and I think it's a great > idea. Its concept is having 2 limits. (please point out if I misunderstand.) > > - memory.limit_in_bytes .... limit memory usage. > - memory.total_limit_in_bytes .... limit memory+swap usage. > > By this, we can avoid excessive use of swap under a cgroup without any bad effect > to global LRU. (in page selection algorithm...overhead will be added, of course) > > Following is state transition and counter handling design memo. > This uses "3" counters to handle above conrrectly. If you have other logic, > please teach me. (and blame me if my diagram is broken.) > > A point is how to handle swap-cache, I think. > (Maybe we need a _big_ change in memcg.) > Could you please describe the big change? What do you have in mind? > == > > state definition > new alloc .... an object is newly allocated > no_swap .... an object with page without swp_entry > swap_cache .... an object with page with swp_entry > disk_swap .... an object without page with swp_entry > freed .... an object is freed (by munmap) > > (*) an object is an enitity which is accoutned, page or swap. > > new alloc -> no_swap <=> swap_cache <=> disk_swap > | | | > freed. <-----------<-------------<----------- > > use 3 counters, no_swap, swap_cache, disk_swap. > > on_memory = no_swap + swap_cache. > total = no_swap + swap_cache + disk_swap > > on_memory is limited by memory.limit_in_bytes > total is limtied by memory.total_limit_in_bytes. > > no_swap swap_cache disk_swap on_memory total > new alloc->no_swap +1 - - +1 +1 > no_swap->swap_cache -1 +1 - - - > swap_cache->no_swap +1 -1 - - - > swap_cache->disk_swap - -1 +1 -1 - > disk_swap->swap_cache - +1 -1 +1 - > no_swap->freed -1 - - -1 -1 > swap_cache->freed - -1 - -1 -1 > disk_swap->freed - - -1 - -1 > > > any comments are welcome. What is the expected behaviour when we exceed memory.total_limit_in_bytes? Can't the memrlimit controller do what you ask for? -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: memo: mem+swap controller 2008-08-01 3:20 ` memo: mem+swap controller Balbir Singh @ 2008-08-01 3:45 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 19+ messages in thread From: KAMEZAWA Hiroyuki @ 2008-08-01 3:45 UTC (permalink / raw) To: balbir; +Cc: nishimura, hugh, linux-mm, menage, Andrew Morton On Fri, 01 Aug 2008 08:50:30 +0530 Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > KAMEZAWA Hiroyuki wrote: > > Hi, mem+swap controller is suggested by Hugh Dickins and I think it's a great > > idea. Its concept is having 2 limits. (please point out if I misunderstand.) > > > > - memory.limit_in_bytes .... limit memory usage. > > - memory.total_limit_in_bytes .... limit memory+swap usage. > > > > By this, we can avoid excessive use of swap under a cgroup without any bad effect > > to global LRU. (in page selection algorithm...overhead will be added, of course) > > > > Following is state transition and counter handling design memo. > > This uses "3" counters to handle above conrrectly. If you have other logic, > > please teach me. (and blame me if my diagram is broken.) > > > > A point is how to handle swap-cache, I think. > > (Maybe we need a _big_ change in memcg.) > > > > Could you please describe the big change? What do you have in mind? > Replace res_counter with new counter to handle - 2 or 3 counters and - 2 limits at once. > > == > > > > state definition > > new alloc .... an object is newly allocated > > no_swap .... an object with page without swp_entry > > swap_cache .... an object with page with swp_entry > > disk_swap .... an object without page with swp_entry > > freed .... an object is freed (by munmap) > > > > (*) an object is an enitity which is accoutned, page or swap. > > > > new alloc -> no_swap <=> swap_cache <=> disk_swap > > | | | > > freed. <-----------<-------------<----------- > > > > use 3 counters, no_swap, swap_cache, disk_swap. > > > > on_memory = no_swap + swap_cache. > > total = no_swap + swap_cache + disk_swap > > > > on_memory is limited by memory.limit_in_bytes > > total is limtied by memory.total_limit_in_bytes. > > > > no_swap swap_cache disk_swap on_memory total > > new alloc->no_swap +1 - - +1 +1 > > no_swap->swap_cache -1 +1 - - - > > swap_cache->no_swap +1 -1 - - - > > swap_cache->disk_swap - -1 +1 -1 - > > disk_swap->swap_cache - +1 -1 +1 - > > no_swap->freed -1 - - -1 -1 > > swap_cache->freed - -1 - -1 -1 > > disk_swap->freed - - -1 - -1 > > > > > > any comments are welcome. > > What is the expected behaviour when we exceed memory.total_limit_in_bytes? Just call try_to_free_mem_cgroup_pages() as now. > Can't the memrlimit controller do what you ask for? > Never. Example 1). assume a HPC program which treats very-sparse big matrix and designed to be a process just handles part of it. Example 2) When an Admin tried to use vm.overcommit_memory, he asked 30+ applications on his server "Please let me know what amount of (mmap) memory you'll use." Finally, He couldn't get good answer because of tons of Java and applications. "Really used pages/swaps" can be shown by accounting and he can limit it by his experience. And only "really used" numbers can tell resource usage. Anyway, one of purposes, we archive by cgroup, is sever integlation. To integrate servers, System Admin cannot know what amounts of mmap will he use because proprietaty application software tends to say "very safe" value and cannot handle -ENOMEM well which returns by mmap(). (*) size of mmap() can have vairable numbers by application's configration and workload. "Real resource usage" is tend to be set by estimated value which system admin measured. So, it's easy to use than address space size when we integlate several servers at once. But for other purpose, for limiting a user program which is created by himself. memrlimit has enough meaning, I think. He can handle -ENOMEM. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2008-09-09 14:07 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2008-07-31 1:15 memo: mem+swap controller KAMEZAWA Hiroyuki 2008-07-31 6:18 ` KOSAKI Motohiro 2008-07-31 6:25 ` Daisuke Nishimura 2008-07-31 6:51 ` KAMEZAWA Hiroyuki 2008-07-31 13:03 ` Daisuke Nishimura 2008-07-31 16:31 ` kamezawa.hiroyu 2008-08-01 3:05 ` Daisuke Nishimura 2008-08-01 3:28 ` Balbir Singh 2008-08-01 4:02 ` Daisuke Nishimura 2008-08-01 4:13 ` Balbir Singh 2008-08-01 4:57 ` Daisuke Nishimura 2008-08-01 5:07 ` memcg swappiness (Re: memo: mem+swap controller) YAMAMOTO Takashi 2008-08-01 5:25 ` Balbir Singh 2008-08-01 6:37 ` YAMAMOTO Takashi 2008-08-01 6:46 ` Balbir Singh 2008-09-09 9:17 ` YAMAMOTO Takashi 2008-09-09 14:07 ` Balbir Singh 2008-08-01 3:20 ` memo: mem+swap controller Balbir Singh 2008-08-01 3:45 ` KAMEZAWA Hiroyuki
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox