memo: mem+swap controller

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* memo: mem+swap controller
@ 2008-07-31  1:15 KAMEZAWA Hiroyuki
  2008-07-31  6:18 ` KOSAKI Motohiro
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-07-31  1:15 UTC (permalink / raw)
  To: nishimura; +Cc: kamezawa.hiroyu, balbir, hugh, linux-mm, menage, Andrew Morton

Hi, mem+swap controller is suggested by Hugh Dickins and I think it's a great
idea. Its concept is having 2 limits. (please point out if I misunderstand.)

 - memory.limit_in_bytes       .... limit memory usage.
 - memory.total_limit_in_bytes .... limit memory+swap usage.

By this, we can avoid excessive use of swap under a cgroup without any bad effect
to global LRU. (in page selection algorithm...overhead will be added, of course)

Following is state transition and counter handling design memo.
This uses "3" counters to handle above conrrectly. If you have other logic,
please teach me. (and blame me if my diagram is broken.)

A point is how to handle swap-cache, I think.
(Maybe we need a _big_ change in memcg.)

==

state definition
  new alloc  .... an object is newly allocated
  no_swap    .... an object with page without swp_entry
  swap_cache .... an object with page with swp_entry
  disk_swap  .... an object without page with swp_entry
  freed      .... an object is freed (by munmap)

(*) an object is an enitity which is accoutned, page or swap.

 new alloc ->  no_swap  <=>  swap_cache  <=>  disk_swap
                 |             |                 |
  freed.   <-----------<-------------<-----------

use 3 counters, no_swap, swap_cache, disk_swap.

    on_memory = no_swap + swap_cache.
    total     = no_swap + swap_cache + disk_swap

on_memory is limited by memory.limit_in_bytes
total     is limtied by memory.total_limit_in_bytes.

                     no_swap  swap_cache  disk_swap  on_memory  total
new alloc->no_swap     +1         -           -         +1        +1
no_swap->swap_cache    -1        +1           -         -         -
swap_cache->no_swap    +1        -1           -         -         -
swap_cache->disk_swap  -         -1           +1        -1        -
disk_swap->swap_cache  -         +1           -1        +1        -
no_swap->freed         -1        -            -         -1        -1
swap_cache->freed      -         -1           -         -1        -1
disk_swap->freed       -         -            -1        -         -1

any comments are welcome.

Regards,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: memo: mem+swap controller
  2008-07-31  1:15 memo: mem+swap controller KAMEZAWA Hiroyuki
@ 2008-07-31  6:18 ` KOSAKI Motohiro
  2008-07-31  6:25 ` Daisuke Nishimura
  2008-08-01  3:20 ` memo: mem+swap controller Balbir Singh
  2 siblings, 0 replies; 19+ messages in thread
From: KOSAKI Motohiro @ 2008-07-31  6:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, nishimura, balbir, hugh, linux-mm, menage,
	Andrew Morton

> on_memory is limited by memory.limit_in_bytes
> total     is limtied by memory.total_limit_in_bytes.

"total" is slightly ambiguity word.
IMHO, new term creation is better.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: memo: mem+swap controller
  2008-07-31  1:15 memo: mem+swap controller KAMEZAWA Hiroyuki
  2008-07-31  6:18 ` KOSAKI Motohiro
@ 2008-07-31  6:25 ` Daisuke Nishimura
  2008-07-31  6:51   ` KAMEZAWA Hiroyuki
  2008-08-01  3:28   ` Balbir Singh
  2008-08-01  3:20 ` memo: mem+swap controller Balbir Singh
  2 siblings, 2 replies; 19+ messages in thread
From: Daisuke Nishimura @ 2008-07-31  6:25 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: nishimura, balbir, hugh, linux-mm, menage, Andrew Morton

Hi, Kamezawa-san.

On Thu, 31 Jul 2008 10:15:33 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Hi, mem+swap controller is suggested by Hugh Dickins and I think it's a great
> idea. Its concept is having 2 limits. (please point out if I misunderstand.)
> 
>  - memory.limit_in_bytes       .... limit memory usage.
>  - memory.total_limit_in_bytes .... limit memory+swap usage.
> 
When I've considered more, I wonder how we can accomplish
"do not use swap in this group".

Setting "limit_in_bytes == total_limit_in_bytes" doesn't meet it, I think.
"limit_in_bytes = total_limit_in_bytes = 1G" cannot
avoid "memory.usage = 700M swap.usage = 300M" under memory pressure
outside of the group(and I think this behavior is the diffrence
of "memory controller + swap controller" and "mem+swap controller").

I think total_limit_in_bytes and swappiness(or some flag to indicate
"do not swap out"?) for each group would make more sense.

> By this, we can avoid excessive use of swap under a cgroup without any bad effect
> to global LRU. (in page selection algorithm...overhead will be added, of course)
> 
Sorry, I cannot understand this part.

> Following is state transition and counter handling design memo.
> This uses "3" counters to handle above conrrectly. If you have other logic,
> please teach me. (and blame me if my diagram is broken.)
> 
I don't think counting "disk swap" is good idea(global linux
dosen't count it).
Instead, I prefer counting "total swap"(that is swap entry).

> A point is how to handle swap-cache, I think.
> (Maybe we need a _big_ change in memcg.)
> 
I think swap cache should be counted as both memory and swap,
as global linux does.


Thanks,
Daisuke Nishimura.

> ==
> 
> state definition
>   new alloc  .... an object is newly allocated
>   no_swap    .... an object with page without swp_entry
>   swap_cache .... an object with page with swp_entry
>   disk_swap  .... an object without page with swp_entry
>   freed      .... an object is freed (by munmap)
> 
> (*) an object is an enitity which is accoutned, page or swap.
> 
>  new alloc ->  no_swap  <=>  swap_cache  <=>  disk_swap
>                  |             |                 |
>   freed.   <-----------<-------------<-----------
> 
> use 3 counters, no_swap, swap_cache, disk_swap.
> 
>     on_memory = no_swap + swap_cache.
>     total     = no_swap + swap_cache + disk_swap
> 
> on_memory is limited by memory.limit_in_bytes
> total     is limtied by memory.total_limit_in_bytes.
> 
>                      no_swap  swap_cache  disk_swap  on_memory  total
> new alloc->no_swap     +1         -           -         +1        +1
> no_swap->swap_cache    -1        +1           -         -         -
> swap_cache->no_swap    +1        -1           -         -         -
> swap_cache->disk_swap  -         -1           +1        -1        -
> disk_swap->swap_cache  -         +1           -1        +1        -
> no_swap->freed         -1        -            -         -1        -1
> swap_cache->freed      -         -1           -         -1        -1
> disk_swap->freed       -         -            -1        -         -1
> 
> 
> any comments are welcome.
> 
> Regards,
> -Kame
> 
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: memo: mem+swap controller
  2008-07-31  6:25 ` Daisuke Nishimura
@ 2008-07-31  6:51   ` KAMEZAWA Hiroyuki
  2008-07-31 13:03     ` Daisuke Nishimura
  2008-07-31 16:31     ` kamezawa.hiroyu
  2008-08-01  3:28   ` Balbir Singh
  1 sibling, 2 replies; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-07-31  6:51 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: balbir, hugh, linux-mm, menage, Andrew Morton

On Thu, 31 Jul 2008 15:25:33 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> Hi, Kamezawa-san.
> 
> On Thu, 31 Jul 2008 10:15:33 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > Hi, mem+swap controller is suggested by Hugh Dickins and I think it's a great
> > idea. Its concept is having 2 limits. (please point out if I misunderstand.)
> > 
> >  - memory.limit_in_bytes       .... limit memory usage.
> >  - memory.total_limit_in_bytes .... limit memory+swap usage.
> > 
> When I've considered more, I wonder how we can accomplish
> "do not use swap in this group".
> 
you can't do that. This direction is to overcommit swap usage by
counting pages which is kicked out.


> Setting "limit_in_bytes == total_limit_in_bytes" doesn't meet it, I think.
> "limit_in_bytes = total_limit_in_bytes = 1G" cannot
> avoid "memory.usage = 700M swap.usage = 300M" under memory pressure
> outside of the group(and I think this behavior is the diffrence
> of "memory controller + swap controller" and "mem+swap controller").
> 
> I think total_limit_in_bytes and swappiness(or some flag to indicate
> "do not swap out"?) for each group would make more sense.
> 
"do not swap out" is bad direction. please use mlock().

(*) I have no objection to add a control file to move a page from swap
    to on-memory.

> > By this, we can avoid excessive use of swap under a cgroup without any bad effect
> > to global LRU. (in page selection algorithm...overhead will be added, of course)
> > 
> Sorry, I cannot understand this part.
> 
>From global LRU's view, anonymous page can be swapped out everytime.
Because it never hits limit.
==
                      no_swap  swap_cache  disk_swap  on_memory  total
no_swap->swap_cache    -1        +1           -         -         -
==
no changes in total.

> > Following is state transition and counter handling design memo.
> > This uses "3" counters to handle above conrrectly. If you have other logic,
> > please teach me. (and blame me if my diagram is broken.)
> > 
> I don't think counting "disk swap" is good idea(global linux
> dosen't count it).
> Instead, I prefer counting "total swap"(that is swap entry).
> 
Maybe my illustration is bad. 

total_swap = swap_cache + disk_swap. Yes, I count swp_entry.
But just divides it to on-memory or not.

This is just a state transition problem. When we counting only total_swap,
we cannot avoid double counting of a swap_cache as memory and as swap.


> > A point is how to handle swap-cache, I think.
> > (Maybe we need a _big_ change in memcg.)
> > 
> I think swap cache should be counted as both memory and swap,
> as global linux does.

No. If we allow double counting, we'll see OOM-Killer very soon. 

If what you say is
==
                      no_swap  swap   on_memory  total
no_swap->swap_cache             +1       -         +1
==
What happens when global lru's swap_out hits limit ?

If what you say is
==
                      no_swap  swap   on_memory  total
no_swap->swap_cache      -1      +1       -        -
==
What happens when SwapCache is mapped ? 

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: memo: mem+swap controller
  2008-07-31  6:51   ` KAMEZAWA Hiroyuki
@ 2008-07-31 13:03     ` Daisuke Nishimura
  2008-07-31 16:31     ` kamezawa.hiroyu
  1 sibling, 0 replies; 19+ messages in thread
From: Daisuke Nishimura @ 2008-07-31 13:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: nishimura, balbir, hugh, linux-mm, menage, Andrew Morton

> > > By this, we can avoid excessive use of swap under a cgroup without any bad effect
> > > to global LRU. (in page selection algorithm...overhead will be added, of course)
> > > 
> > Sorry, I cannot understand this part.
> > 
> From global LRU's view, anonymous page can be swapped out everytime.
> Because it never hits limit.
> ==
>                       no_swap  swap_cache  disk_swap  on_memory  total
> no_swap->swap_cache    -1        +1           -         -         -
> ==
> no changes in total.
> 
I see. Thanks.

> > > Following is state transition and counter handling design memo.
> > > This uses "3" counters to handle above conrrectly. If you have other logic,
> > > please teach me. (and blame me if my diagram is broken.)
> > > 
> > I don't think counting "disk swap" is good idea(global linux
> > dosen't count it).
> > Instead, I prefer counting "total swap"(that is swap entry).
> > 
> Maybe my illustration is bad. 
> 
> total_swap = swap_cache + disk_swap. Yes, I count swp_entry.
> But just divides it to on-memory or not.
> 
> This is just a state transition problem. When we counting only total_swap,
> we cannot avoid double counting of a swap_cache as memory and as swap.
> 
I agree.
My intention was not counting only total_swap, but counting both
total_swap and swap_cache.

> > > A point is how to handle swap-cache, I think.
> > > (Maybe we need a _big_ change in memcg.)
> > > 
> > I think swap cache should be counted as both memory and swap,
> > as global linux does.
> 
> No. If we allow double counting, we'll see OOM-Killer very soon. 
> 
on_memory = no_swap + swap_cache (your difinition)
swap = swap_cache + disk_swap
total = no_swap + swap_cache + disk_swap (your difinition)

total is NOT "on_memory + swap"(swap_cache is not doble counted).

so, in the sense of limitting, this is the same as yours.

> If what you say is
> ==
>                       no_swap  swap   on_memory  total
> no_swap->swap_cache             +1       -         +1
> ==
> What happens when global lru's swap_out hits limit ?
> 
> If what you say is
> ==
>                       no_swap  swap   on_memory  total
> no_swap->swap_cache      -1      +1       -        -
> ==
> What happens when SwapCache is mapped ? 
> 
The latter, and I think there would be no defference
about no_swap/swap_cache/disk_swap counters even when the swap cache is mapped.

(current memory controller charges swap caches only when it is mapped,
but I'm not talking about it.)


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Re: memo: mem+swap controller
  2008-07-31  6:51   ` KAMEZAWA Hiroyuki
  2008-07-31 13:03     ` Daisuke Nishimura
@ 2008-07-31 16:31     ` kamezawa.hiroyu
  2008-08-01  3:05       ` Daisuke Nishimura
  1 sibling, 1 reply; 19+ messages in thread
From: kamezawa.hiroyu @ 2008-07-31 16:31 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: KAMEZAWA Hiroyuki, balbir, hugh, linux-mm, menage, Andrew Morton

>> > > Following is state transition and counter handling design memo.
>> > > This uses "3" counters to handle above conrrectly. If you have other lo
gic,
>> > > please teach me. (and blame me if my diagram is broken.)
>> > > 
>> > I don't think counting "disk swap" is good idea(global linux
>> > dosen't count it).
>> > Instead, I prefer counting "total swap"(that is swap entry).
>> > 
>> Maybe my illustration is bad. 
>> 
>> total_swap = swap_cache + disk_swap. Yes, I count swp_entry.
>> But just divides it to on-memory or not.
>> 
>> This is just a state transition problem. When we counting only total_swap,
>> we cannot avoid double counting of a swap_cache as memory and as swap.
>> 
>I agree.
>My intention was not counting only total_swap, but counting both
>total_swap and swap_cache.
>
At early stage of diaglam, I just added total_swap counter.
(total_swap here means # of used swp_entry.)
And failed to write diaglam ;( Maybe selection of counters was bad.

If just 2 counters are enough, it's better.

Hmm..
- on_memory .... # of pages used
- disk_swap      .... # of swp_entry without SwapCache

limit_in_bytes ... limits on_memory
total_limit    ... limits on_mempry + disk_swap.

can work ?

I'd like to write a sample patch in the next week.

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: memo: mem+swap controller
  2008-07-31 16:31     ` kamezawa.hiroyu
@ 2008-08-01  3:05       ` Daisuke Nishimura
  0 siblings, 0 replies; 19+ messages in thread
From: Daisuke Nishimura @ 2008-08-01  3:05 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: nishimura, balbir, hugh, linux-mm, menage, Andrew Morton

On Fri, 1 Aug 2008 01:31:41 +0900 (JST), kamezawa.hiroyu@jp.fujitsu.com wrote:
> >> > > Following is state transition and counter handling design memo.
> >> > > This uses "3" counters to handle above conrrectly. If you have other lo
> gic,
> >> > > please teach me. (and blame me if my diagram is broken.)
> >> > > 
> >> > I don't think counting "disk swap" is good idea(global linux
> >> > dosen't count it).
> >> > Instead, I prefer counting "total swap"(that is swap entry).
> >> > 
> >> Maybe my illustration is bad. 
> >> 
> >> total_swap = swap_cache + disk_swap. Yes, I count swp_entry.
> >> But just divides it to on-memory or not.
> >> 
> >> This is just a state transition problem. When we counting only total_swap,
> >> we cannot avoid double counting of a swap_cache as memory and as swap.
> >> 
> >I agree.
> >My intention was not counting only total_swap, but counting both
> >total_swap and swap_cache.
> >
> At early stage of diaglam, I just added total_swap counter.
> (total_swap here means # of used swp_entry.)
> And failed to write diaglam ;( Maybe selection of counters was bad.
> 
> If just 2 counters are enough, it's better.
> 
I think so.

> Hmm..
> - on_memory .... # of pages used
> - disk_swap      .... # of swp_entry without SwapCache
> 
> limit_in_bytes ... limits on_memory
> total_limit    ... limits on_mempry + disk_swap.
> 
> can work ?
> 
Theoretically it works, I think.

But, one thing I'm wondering is how to distinguish
"swap_cache -> freed" and "disk_swap -> freed".

Anyway, I need more consideration.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: memo: mem+swap controller
  2008-07-31  1:15 memo: mem+swap controller KAMEZAWA Hiroyuki
  2008-07-31  6:18 ` KOSAKI Motohiro
  2008-07-31  6:25 ` Daisuke Nishimura
@ 2008-08-01  3:20 ` Balbir Singh
  2008-08-01  3:45   ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 19+ messages in thread
From: Balbir Singh @ 2008-08-01  3:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, hugh, linux-mm, menage, Andrew Morton

KAMEZAWA Hiroyuki wrote:
> Hi, mem+swap controller is suggested by Hugh Dickins and I think it's a great
> idea. Its concept is having 2 limits. (please point out if I misunderstand.)
> 
>  - memory.limit_in_bytes       .... limit memory usage.
>  - memory.total_limit_in_bytes .... limit memory+swap usage.
> 
> By this, we can avoid excessive use of swap under a cgroup without any bad effect
> to global LRU. (in page selection algorithm...overhead will be added, of course)
> 
> Following is state transition and counter handling design memo.
> This uses "3" counters to handle above conrrectly. If you have other logic,
> please teach me. (and blame me if my diagram is broken.)
> 
> A point is how to handle swap-cache, I think.
> (Maybe we need a _big_ change in memcg.)
> 

Could you please describe the big change? What do you have in mind?

> ==
> 
> state definition
>   new alloc  .... an object is newly allocated
>   no_swap    .... an object with page without swp_entry
>   swap_cache .... an object with page with swp_entry
>   disk_swap  .... an object without page with swp_entry
>   freed      .... an object is freed (by munmap)
> 
> (*) an object is an enitity which is accoutned, page or swap.
> 
>  new alloc ->  no_swap  <=>  swap_cache  <=>  disk_swap
>                  |             |                 |
>   freed.   <-----------<-------------<-----------
> 
> use 3 counters, no_swap, swap_cache, disk_swap.
> 
>     on_memory = no_swap + swap_cache.
>     total     = no_swap + swap_cache + disk_swap
> 
> on_memory is limited by memory.limit_in_bytes
> total     is limtied by memory.total_limit_in_bytes.
> 
>                      no_swap  swap_cache  disk_swap  on_memory  total
> new alloc->no_swap     +1         -           -         +1        +1
> no_swap->swap_cache    -1        +1           -         -         -
> swap_cache->no_swap    +1        -1           -         -         -
> swap_cache->disk_swap  -         -1           +1        -1        -
> disk_swap->swap_cache  -         +1           -1        +1        -
> no_swap->freed         -1        -            -         -1        -1
> swap_cache->freed      -         -1           -         -1        -1
> disk_swap->freed       -         -            -1        -         -1
> 
> 
> any comments are welcome.

What is the expected behaviour when we exceed memory.total_limit_in_bytes? Can't
the memrlimit controller do what you ask for?



-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: memo: mem+swap controller
  2008-07-31  6:25 ` Daisuke Nishimura
  2008-07-31  6:51   ` KAMEZAWA Hiroyuki
@ 2008-08-01  3:28   ` Balbir Singh
  2008-08-01  4:02     ` Daisuke Nishimura
  1 sibling, 1 reply; 19+ messages in thread
From: Balbir Singh @ 2008-08-01  3:28 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: KAMEZAWA Hiroyuki, hugh, linux-mm, menage, Andrew Morton

Daisuke Nishimura wrote:
> Hi, Kamezawa-san.
> 
> On Thu, 31 Jul 2008 10:15:33 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> Hi, mem+swap controller is suggested by Hugh Dickins and I think it's a great
>> idea. Its concept is having 2 limits. (please point out if I misunderstand.)
>>
>>  - memory.limit_in_bytes       .... limit memory usage.
>>  - memory.total_limit_in_bytes .... limit memory+swap usage.
>>
> When I've considered more, I wonder how we can accomplish
> "do not use swap in this group".
> 

It's easy use the memrlimit controller and set virtual address limit <=
memory.limit_in_bytes. I use that to make sure I never swap out.

> Setting "limit_in_bytes == total_limit_in_bytes" doesn't meet it, I think.
> "limit_in_bytes = total_limit_in_bytes = 1G" cannot
> avoid "memory.usage = 700M swap.usage = 300M" under memory pressure
> outside of the group(and I think this behavior is the diffrence
> of "memory controller + swap controller" and "mem+swap controller").
> 
> I think total_limit_in_bytes and swappiness(or some flag to indicate
> "do not swap out"?) for each group would make more sense.

I do intend to add the swappiness feature soon for control groups.

> 
>> By this, we can avoid excessive use of swap under a cgroup without any bad effect
>> to global LRU. (in page selection algorithm...overhead will be added, of course)
>>
> Sorry, I cannot understand this part.
> 
>> Following is state transition and counter handling design memo.
>> This uses "3" counters to handle above conrrectly. If you have other logic,
>> please teach me. (and blame me if my diagram is broken.)
>>
> I don't think counting "disk swap" is good idea(global linux
> dosen't count it).
> Instead, I prefer counting "total swap"(that is swap entry).
> 
>> A point is how to handle swap-cache, I think.
>> (Maybe we need a _big_ change in memcg.)
>>
> I think swap cache should be counted as both memory and swap,
> as global linux does.
> 
[snip]
-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: memo: mem+swap controller
  2008-08-01  3:20 ` memo: mem+swap controller Balbir Singh
@ 2008-08-01  3:45   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 19+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-01  3:45 UTC (permalink / raw)
  To: balbir; +Cc: nishimura, hugh, linux-mm, menage, Andrew Morton

On Fri, 01 Aug 2008 08:50:30 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> KAMEZAWA Hiroyuki wrote:
> > Hi, mem+swap controller is suggested by Hugh Dickins and I think it's a great
> > idea. Its concept is having 2 limits. (please point out if I misunderstand.)
> > 
> >  - memory.limit_in_bytes       .... limit memory usage.
> >  - memory.total_limit_in_bytes .... limit memory+swap usage.
> > 
> > By this, we can avoid excessive use of swap under a cgroup without any bad effect
> > to global LRU. (in page selection algorithm...overhead will be added, of course)
> > 
> > Following is state transition and counter handling design memo.
> > This uses "3" counters to handle above conrrectly. If you have other logic,
> > please teach me. (and blame me if my diagram is broken.)
> > 
> > A point is how to handle swap-cache, I think.
> > (Maybe we need a _big_ change in memcg.)
> > 
> 
> Could you please describe the big change? What do you have in mind?
> 
Replace res_counter with new counter to handle 
  - 2 or 3 counters and
  - 2 limits
at once.

> > ==
> > 
> > state definition
> >   new alloc  .... an object is newly allocated
> >   no_swap    .... an object with page without swp_entry
> >   swap_cache .... an object with page with swp_entry
> >   disk_swap  .... an object without page with swp_entry
> >   freed      .... an object is freed (by munmap)
> > 
> > (*) an object is an enitity which is accoutned, page or swap.
> > 
> >  new alloc ->  no_swap  <=>  swap_cache  <=>  disk_swap
> >                  |             |                 |
> >   freed.   <-----------<-------------<-----------
> > 
> > use 3 counters, no_swap, swap_cache, disk_swap.
> > 
> >     on_memory = no_swap + swap_cache.
> >     total     = no_swap + swap_cache + disk_swap
> > 
> > on_memory is limited by memory.limit_in_bytes
> > total     is limtied by memory.total_limit_in_bytes.
> > 
> >                      no_swap  swap_cache  disk_swap  on_memory  total
> > new alloc->no_swap     +1         -           -         +1        +1
> > no_swap->swap_cache    -1        +1           -         -         -
> > swap_cache->no_swap    +1        -1           -         -         -
> > swap_cache->disk_swap  -         -1           +1        -1        -
> > disk_swap->swap_cache  -         +1           -1        +1        -
> > no_swap->freed         -1        -            -         -1        -1
> > swap_cache->freed      -         -1           -         -1        -1
> > disk_swap->freed       -         -            -1        -         -1
> > 
> > 
> > any comments are welcome.
> 
> What is the expected behaviour when we exceed memory.total_limit_in_bytes?
Just call try_to_free_mem_cgroup_pages() as now.

> Can't the memrlimit controller do what you ask for?
> 
Never. 

Example 1). assume a HPC program which treats very-sparse big matrix
and designed to be a process just handles part of it.

Example 2) When an Admin tried to use vm.overcommit_memory, he asked
30+ applications on his server "Please let me know what amount of (mmap)
memory you'll use."
Finally, He couldn't get good answer because of tons of Java and applications.

"Really used pages/swaps" can be shown by accounting and he can limit it by
his experience. And only "really used" numbers can tell resource usage.

Anyway, one of purposes,  we archive by cgroup, is sever integlation.

To integrate servers, System Admin cannot know what amounts of mmap will
he use because proprietaty application software tends to say "very safe"
value and cannot handle -ENOMEM well which returns by mmap().
(*) size of mmap() can have vairable numbers by application's configration
and workload. 

"Real resource usage" is tend to be set by estimated value which system admin
measured. So, it's easy to use than address space size when we integlate several
servers at once.

But for other purpose, for limiting a user program which is created by himself.
memrlimit has enough meaning, I think. He can handle -ENOMEM.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: memo: mem+swap controller
  2008-08-01  3:28   ` Balbir Singh
@ 2008-08-01  4:02     ` Daisuke Nishimura
  2008-08-01  4:13       ` Balbir Singh
  0 siblings, 1 reply; 19+ messages in thread
From: Daisuke Nishimura @ 2008-08-01  4:02 UTC (permalink / raw)
  To: balbir
  Cc: nishimura, KAMEZAWA Hiroyuki, hugh, linux-mm, menage, Andrew Morton

On Fri, 01 Aug 2008 08:58:07 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> Daisuke Nishimura wrote:
> > Hi, Kamezawa-san.
> > 
> > On Thu, 31 Jul 2008 10:15:33 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> Hi, mem+swap controller is suggested by Hugh Dickins and I think it's a great
> >> idea. Its concept is having 2 limits. (please point out if I misunderstand.)
> >>
> >>  - memory.limit_in_bytes       .... limit memory usage.
> >>  - memory.total_limit_in_bytes .... limit memory+swap usage.
> >>
> > When I've considered more, I wonder how we can accomplish
> > "do not use swap in this group".
> > 
> 
> It's easy use the memrlimit controller and set virtual address limit <=
> memory.limit_in_bytes. I use that to make sure I never swap out.
> 
I don't think it works under memory pressure *outside* of the group,
that is, global memory reclaim.
(I think "limit_in_bytes == total_limit_in_bytes" also works *inside* memory 
pressure.)

> > Setting "limit_in_bytes == total_limit_in_bytes" doesn't meet it, I think.
> > "limit_in_bytes = total_limit_in_bytes = 1G" cannot
> > avoid "memory.usage = 700M swap.usage = 300M" under memory pressure
> > outside of the group(and I think this behavior is the diffrence
> > of "memory controller + swap controller" and "mem+swap controller").
> > 
> > I think total_limit_in_bytes and swappiness(or some flag to indicate
> > "do not swap out"?) for each group would make more sense.
> 
> I do intend to add the swappiness feature soon for control groups.
> 
How does it work?
Does it affect global page reclaim?


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: memo: mem+swap controller
  2008-08-01  4:02     ` Daisuke Nishimura
@ 2008-08-01  4:13       ` Balbir Singh
  2008-08-01  4:57         ` Daisuke Nishimura
  2008-08-01  5:07         ` memcg swappiness (Re: memo: mem+swap controller) YAMAMOTO Takashi
  0 siblings, 2 replies; 19+ messages in thread
From: Balbir Singh @ 2008-08-01  4:13 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: KAMEZAWA Hiroyuki, hugh, linux-mm, menage, Andrew Morton

Daisuke Nishimura wrote:
> On Fri, 01 Aug 2008 08:58:07 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>> Daisuke Nishimura wrote:
>>> Hi, Kamezawa-san.
>>>
>>> On Thu, 31 Jul 2008 10:15:33 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>>> Hi, mem+swap controller is suggested by Hugh Dickins and I think it's a great
>>>> idea. Its concept is having 2 limits. (please point out if I misunderstand.)
>>>>
>>>>  - memory.limit_in_bytes       .... limit memory usage.
>>>>  - memory.total_limit_in_bytes .... limit memory+swap usage.
>>>>
>>> When I've considered more, I wonder how we can accomplish
>>> "do not use swap in this group".
>>>
>> It's easy use the memrlimit controller and set virtual address limit <=
>> memory.limit_in_bytes. I use that to make sure I never swap out.
>>
> I don't think it works under memory pressure *outside* of the group,
> that is, global memory reclaim.
> (I think "limit_in_bytes == total_limit_in_bytes" also works *inside* memory 
> pressure.)
> 

Yes, but it ensures that the current cgroup does not add to global swap
pressure. Trying to control global pressure from inside a cgroup might be hard,
but I see your intention of isolating the cgroup from global pressure. With
swappiness, it should be possible to purge page cache ahead of creating swap
pressure.

>>> Setting "limit_in_bytes == total_limit_in_bytes" doesn't meet it, I think.
>>> "limit_in_bytes = total_limit_in_bytes = 1G" cannot
>>> avoid "memory.usage = 700M swap.usage = 300M" under memory pressure
>>> outside of the group(and I think this behavior is the diffrence
>>> of "memory controller + swap controller" and "mem+swap controller").
>>>
>>> I think total_limit_in_bytes and swappiness(or some flag to indicate
>>> "do not swap out"?) for each group would make more sense.
>> I do intend to add the swappiness feature soon for control groups.
>>
> How does it work?
> Does it affect global page reclaim?
> 

We have a swappiness parameter in scan_control. Each control group indicates
what it wants it swappiness to be when the control group is over it's limit and
reclaim kicks in.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: memo: mem+swap controller
  2008-08-01  4:13       ` Balbir Singh
@ 2008-08-01  4:57         ` Daisuke Nishimura
  2008-08-01  5:07         ` memcg swappiness (Re: memo: mem+swap controller) YAMAMOTO Takashi
  1 sibling, 0 replies; 19+ messages in thread
From: Daisuke Nishimura @ 2008-08-01  4:57 UTC (permalink / raw)
  To: balbir
  Cc: nishimura, KAMEZAWA Hiroyuki, hugh, linux-mm, menage, Andrew Morton

On Fri, 01 Aug 2008 09:43:43 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> Daisuke Nishimura wrote:
> > On Fri, 01 Aug 2008 08:58:07 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> >> Daisuke Nishimura wrote:
> >>> Hi, Kamezawa-san.
> >>>
> >>> On Thu, 31 Jul 2008 10:15:33 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >>>> Hi, mem+swap controller is suggested by Hugh Dickins and I think it's a great
> >>>> idea. Its concept is having 2 limits. (please point out if I misunderstand.)
> >>>>
> >>>>  - memory.limit_in_bytes       .... limit memory usage.
> >>>>  - memory.total_limit_in_bytes .... limit memory+swap usage.
> >>>>
> >>> When I've considered more, I wonder how we can accomplish
> >>> "do not use swap in this group".
> >>>
> >> It's easy use the memrlimit controller and set virtual address limit <=
> >> memory.limit_in_bytes. I use that to make sure I never swap out.
> >>
> > I don't think it works under memory pressure *outside* of the group,
> > that is, global memory reclaim.
> > (I think "limit_in_bytes == total_limit_in_bytes" also works *inside* memory 
> > pressure.)
> > 
> 
> Yes, but it ensures that the current cgroup does not add to global swap
> pressure. Trying to control global pressure from inside a cgroup might be hard,
> but I see your intention of isolating the cgroup from global pressure. With
> swappiness, it should be possible to purge page cache ahead of creating swap
> pressure.
> 
Ah, I don't have strong requirement of "do not use swap in this group"
under global memory pressure(and it would be bad policy).
I asked just for curiosity. Sorry for confusing you.

> >>> Setting "limit_in_bytes == total_limit_in_bytes" doesn't meet it, I think.
> >>> "limit_in_bytes = total_limit_in_bytes = 1G" cannot
> >>> avoid "memory.usage = 700M swap.usage = 300M" under memory pressure
> >>> outside of the group(and I think this behavior is the diffrence
> >>> of "memory controller + swap controller" and "mem+swap controller").
> >>>
> >>> I think total_limit_in_bytes and swappiness(or some flag to indicate
> >>> "do not swap out"?) for each group would make more sense.
> >> I do intend to add the swappiness feature soon for control groups.
> >>
> > How does it work?
> > Does it affect global page reclaim?
> > 
> 
> We have a swappiness parameter in scan_control. Each control group indicates
> what it wants it swappiness to be when the control group is over it's limit and
> reclaim kicks in.
> 
I see. Thank you for your explanation.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* memcg swappiness (Re: memo: mem+swap controller)
  2008-08-01  4:13       ` Balbir Singh
  2008-08-01  4:57         ` Daisuke Nishimura
@ 2008-08-01  5:07         ` YAMAMOTO Takashi
  2008-08-01  5:25           ` Balbir Singh
  1 sibling, 1 reply; 19+ messages in thread
From: YAMAMOTO Takashi @ 2008-08-01  5:07 UTC (permalink / raw)
  To: balbir; +Cc: nishimura, kamezawa.hiroyu, hugh, linux-mm, menage, akpm

hi,

> >> I do intend to add the swappiness feature soon for control groups.
> >>
> > How does it work?
> > Does it affect global page reclaim?
> > 
> 
> We have a swappiness parameter in scan_control. Each control group indicates
> what it wants it swappiness to be when the control group is over it's limit and
> reclaim kicks in.

the following is an untested work-in-progress patch i happen to have.
i'd appreciate it if you take care of it.

YAMAMOTO Takashi


Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
---

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ee1b2fc..7618944 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -47,6 +47,7 @@ extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					int active, int file);
 extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
+extern int mem_cgroup_swappiness(struct mem_cgroup *mem);
 
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fcfa8b4..e1eeb09 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -129,6 +129,7 @@ struct mem_cgroup {
 	struct mem_cgroup_lru_info info;
 
 	int	prev_priority;	/* for recording reclaim priority */
+	unsigned int swappiness;	/* swappiness */
 	/*
 	 * statistics.
 	 */
@@ -437,14 +438,11 @@ void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem, int priority)
 long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
 					int priority, enum lru_list lru)
 {
-	long nr_pages;
 	int nid = zone->zone_pgdat->node_id;
 	int zid = zone_idx(zone);
 	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid);
 
-	nr_pages = MEM_CGROUP_ZSTAT(mz, lru);
-
-	return (nr_pages >> priority);
+	return MEM_CGROUP_ZSTAT(mz, lru);
 }
 
 unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
@@ -963,6 +961,24 @@ static int mem_control_stat_show(struct cgroup *cont, struct cftype *cft,
 	return 0;
 }
 
+static int mem_cgroup_swappiness_write(struct cgroup *cont, struct cftype *cft,
+				       u64 val)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+
+	if (val > 100)
+		return -EINVAL;
+	mem->swappiness = val;
+	return 0;
+}
+
+static u64 mem_cgroup_swappiness_read(struct cgroup *cont, struct cftype *cft)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+  
+	return mem->swappiness;
+}
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -995,8 +1011,21 @@ static struct cftype mem_cgroup_files[] = {
 		.name = "stat",
 		.read_map = mem_control_stat_show,
 	},
+	{
+		.name = "swappiness",
+		.write_u64 = mem_cgroup_swappiness_write,
+		.read_u64 = mem_cgroup_swappiness_read,
+	},
 };
 
+/* XXX probably it's better to move try_to_free_mem_cgroup_pages to
+  memcontrol.c and kill this */
+int mem_cgroup_swappiness(struct mem_cgroup *mem)
+{
+
+	return mem->swappiness;
+}
+
 static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
 {
 	struct mem_cgroup_per_node *pn;
@@ -1072,6 +1101,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 			return ERR_PTR(-ENOMEM);
 	}
 
+	mem->swappiness = 60; /* XXX probably should inherit a value from
+				either parent cgroup or global vm_swappiness */
 	res_counter_init(&mem->res);
 
 	for_each_node_state(node, N_POSSIBLE)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6527e04..9f2ddbc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1364,15 +1364,10 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 static void get_scan_ratio(struct zone *zone, struct scan_control * sc,
 					unsigned long *percent)
 {
-	unsigned long anon, file, free;
 	unsigned long anon_prio, file_prio;
 	unsigned long ap, fp;
-
-	anon  = zone_page_state(zone, NR_ACTIVE_ANON) +
-		zone_page_state(zone, NR_INACTIVE_ANON);
-	file  = zone_page_state(zone, NR_ACTIVE_FILE) +
-		zone_page_state(zone, NR_INACTIVE_FILE);
-	free  = zone_page_state(zone, NR_FREE_PAGES);
+	unsigned long recent_scanned[2];
+	unsigned long recent_rotated[2];
 
 	/* If we have no swap space, do not bother scanning anon pages. */
 	if (nr_swap_pages <= 0) {
@@ -1381,36 +1376,59 @@ static void get_scan_ratio(struct zone *zone, struct scan_control * sc,
 		return;
 	}
 
-	/* If we have very few page cache pages, force-scan anon pages. */
-	if (unlikely(file + free <= zone->pages_high)) {
-		percent[0] = 100;
-		percent[1] = 0;
-		return;
-	}
+	if (scan_global_lru(sc)) {
+		unsigned long anon, file, free;
 
-	/*
-         * OK, so we have swap space and a fair amount of page cache
-         * pages.  We use the recently rotated / recently scanned
-         * ratios to determine how valuable each cache is.
-         *
-         * Because workloads change over time (and to avoid overflow)
-         * we keep these statistics as a floating average, which ends
-         * up weighing recent references more than old ones.
-         *
-         * anon in [0], file in [1]
-         */
-	if (unlikely(zone->recent_scanned[0] > anon / 4)) {
-		spin_lock_irq(&zone->lru_lock);
-		zone->recent_scanned[0] /= 2;
-		zone->recent_rotated[0] /= 2;
-		spin_unlock_irq(&zone->lru_lock);
-	}
+		anon  = zone_page_state(zone, NR_ACTIVE_ANON) +
+			zone_page_state(zone, NR_INACTIVE_ANON);
+		file  = zone_page_state(zone, NR_ACTIVE_FILE) +
+			zone_page_state(zone, NR_INACTIVE_FILE);
+		free  = zone_page_state(zone, NR_FREE_PAGES);
 
-	if (unlikely(zone->recent_scanned[1] > file / 4)) {
-		spin_lock_irq(&zone->lru_lock);
-		zone->recent_scanned[1] /= 2;
-		zone->recent_rotated[1] /= 2;
-		spin_unlock_irq(&zone->lru_lock);
+		/*
+		 * If we have very few page cache pages, force-scan anon pages.
+		 */
+		if (unlikely(file + free <= zone->pages_high)) {
+			percent[0] = 100;
+			percent[1] = 0;
+			return;
+		}
+
+		/*
+		 * OK, so we have swap space and a fair amount of page cache
+		 * pages.  We use the recently rotated / recently scanned
+		 * ratios to determine how valuable each cache is.
+		 *
+		 * Because workloads change over time (and to avoid overflow)
+		 * we keep these statistics as a floating average, which ends
+		 * up weighing recent references more than old ones.
+		 *
+		 * anon in [0], file in [1]
+		 */
+		if (unlikely(zone->recent_scanned[0] > anon / 4)) {
+			spin_lock_irq(&zone->lru_lock);
+			zone->recent_scanned[0] /= 2;
+			zone->recent_rotated[0] /= 2;
+			spin_unlock_irq(&zone->lru_lock);
+		}
+
+		if (unlikely(zone->recent_scanned[1] > file / 4)) {
+			spin_lock_irq(&zone->lru_lock);
+			zone->recent_scanned[1] /= 2;
+			zone->recent_rotated[1] /= 2;
+			spin_unlock_irq(&zone->lru_lock);
+		}
+
+		recent_scanned[0] = zone->recent_scanned[0];
+		recent_scanned[1] = zone->recent_scanned[1];
+		recent_rotated[0] = zone->recent_rotated[0];
+		recent_rotated[1] = zone->recent_rotated[1];
+	} else {
+		/* XXX */
+		recent_scanned[0] = 0;
+		recent_scanned[1] = 0;
+		recent_rotated[0] = 0;
+		recent_rotated[1] = 0;
 	}
 
 	/*
@@ -1425,11 +1443,11 @@ static void get_scan_ratio(struct zone *zone, struct scan_control * sc,
 	 * %anon = 100 * ----------- / ----------------- * IO cost
 	 *               anon + file      rotate_sum
 	 */
-	ap = (anon_prio + 1) * (zone->recent_scanned[0] + 1);
-	ap /= zone->recent_rotated[0] + 1;
+	ap = (anon_prio + 1) * (recent_scanned[0] + 1);
+	ap /= recent_rotated[0] + 1;
 
-	fp = (file_prio + 1) * (zone->recent_scanned[1] + 1);
-	fp /= zone->recent_rotated[1] + 1;
+	fp = (file_prio + 1) * (recent_scanned[1] + 1);
+	fp /= recent_rotated[1] + 1;
 
 	/* Normalize to percentages */
 	percent[0] = 100 * ap / (ap + fp + 1);
@@ -1452,9 +1470,10 @@ static unsigned long shrink_zone(int priority, struct zone *zone,
 	get_scan_ratio(zone, sc, percent);
 
 	for_each_evictable_lru(l) {
+		int file = is_file_lru(l);
+		unsigned long scan;
+
 		if (scan_global_lru(sc)) {
-			int file = is_file_lru(l);
-			int scan;
 			/*
 			 * Add one to nr_to_scan just to make sure that the
 			 * kernel will slowly sift through each list.
@@ -1476,8 +1495,13 @@ static unsigned long shrink_zone(int priority, struct zone *zone,
 			 * but because memory controller hits its limit.
 			 * Don't modify zone reclaim related data.
 			 */
-			nr[l] = mem_cgroup_calc_reclaim(sc->mem_cgroup, zone,
+			scan = mem_cgroup_calc_reclaim(sc->mem_cgroup, zone,
 								priority, l);
+			if (priority) {
+				scan >>= priority;
+				scan = (scan * percent[file]) / 100;
+			}
+			nr[l] = scan + 1;
 		}
 	}
 
@@ -1704,7 +1728,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 		.may_writepage = !laptop_mode,
 		.may_swap = 1,
 		.swap_cluster_max = SWAP_CLUSTER_MAX,
-		.swappiness = vm_swappiness,
+		.swappiness = mem_cgroup_swappiness(mem_cont),
 		.order = 0,
 		.mem_cgroup = mem_cont,
 		.isolate_pages = mem_cgroup_isolate_pages,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: memcg swappiness (Re: memo: mem+swap controller)
  2008-08-01  5:07         ` memcg swappiness (Re: memo: mem+swap controller) YAMAMOTO Takashi
@ 2008-08-01  5:25           ` Balbir Singh
  2008-08-01  6:37             ` YAMAMOTO Takashi
  0 siblings, 1 reply; 19+ messages in thread
From: Balbir Singh @ 2008-08-01  5:25 UTC (permalink / raw)
  To: YAMAMOTO Takashi; +Cc: nishimura, kamezawa.hiroyu, hugh, linux-mm, menage, akpm

YAMAMOTO Takashi wrote:
> hi,
> 
>>>> I do intend to add the swappiness feature soon for control groups.
>>>>
>>> How does it work?
>>> Does it affect global page reclaim?
>>>
>> We have a swappiness parameter in scan_control. Each control group indicates
>> what it wants it swappiness to be when the control group is over it's limit and
>> reclaim kicks in.
> 
> the following is an untested work-in-progress patch i happen to have.
> i'd appreciate it if you take care of it.
> 

Looks very similar to the patch I have. You seemed to have made much more
progress than me, I am yet to look at the recent_* statistics. How are the test
results? Are they close to what you expect?  Some comments below

> 
> Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
> ---
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index ee1b2fc..7618944 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -47,6 +47,7 @@ extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>  					int active, int file);
>  extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
>  int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
> +extern int mem_cgroup_swappiness(struct mem_cgroup *mem);
> 
>  extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index fcfa8b4..e1eeb09 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -129,6 +129,7 @@ struct mem_cgroup {
>  	struct mem_cgroup_lru_info info;
> 
>  	int	prev_priority;	/* for recording reclaim priority */
> +	unsigned int swappiness;	/* swappiness */
>  	/*
>  	 * statistics.
>  	 */
> @@ -437,14 +438,11 @@ void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem, int priority)
>  long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
>  					int priority, enum lru_list lru)
>  {
> -	long nr_pages;
>  	int nid = zone->zone_pgdat->node_id;
>  	int zid = zone_idx(zone);
>  	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid);
> 
> -	nr_pages = MEM_CGROUP_ZSTAT(mz, lru);
> -
> -	return (nr_pages >> priority);
> +	return MEM_CGROUP_ZSTAT(mz, lru);
>  }
> 
>  unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
> @@ -963,6 +961,24 @@ static int mem_control_stat_show(struct cgroup *cont, struct cftype *cft,
>  	return 0;
>  }
> 
> +static int mem_cgroup_swappiness_write(struct cgroup *cont, struct cftype *cft,
> +				       u64 val)
> +{
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> +
> +	if (val > 100)
> +		return -EINVAL;
> +	mem->swappiness = val;
> +	return 0;
> +}
> +
> +static u64 mem_cgroup_swappiness_read(struct cgroup *cont, struct cftype *cft)
> +{
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> +  
> +	return mem->swappiness;
> +}
> +
>  static struct cftype mem_cgroup_files[] = {
>  	{
>  		.name = "usage_in_bytes",
> @@ -995,8 +1011,21 @@ static struct cftype mem_cgroup_files[] = {
>  		.name = "stat",
>  		.read_map = mem_control_stat_show,
>  	},
> +	{
> +		.name = "swappiness",
> +		.write_u64 = mem_cgroup_swappiness_write,
> +		.read_u64 = mem_cgroup_swappiness_read,
> +	},
>  };
> 
> +/* XXX probably it's better to move try_to_free_mem_cgroup_pages to
> +  memcontrol.c and kill this */
> +int mem_cgroup_swappiness(struct mem_cgroup *mem)
> +{
> +
> +	return mem->swappiness;
> +}
> +
>  static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
>  {
>  	struct mem_cgroup_per_node *pn;
> @@ -1072,6 +1101,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  			return ERR_PTR(-ENOMEM);
>  	}
> 
> +	mem->swappiness = 60; /* XXX probably should inherit a value from
> +				either parent cgroup or global vm_swappiness */

Precisely, we need hierarchy support to inherit from parent. vm_swappiness is
exported, so we can use it here.

>  	res_counter_init(&mem->res);
> 
>  	for_each_node_state(node, N_POSSIBLE)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6527e04..9f2ddbc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1364,15 +1364,10 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
>  static void get_scan_ratio(struct zone *zone, struct scan_control * sc,
>  					unsigned long *percent)
>  {
> -	unsigned long anon, file, free;
>  	unsigned long anon_prio, file_prio;
>  	unsigned long ap, fp;
> -
> -	anon  = zone_page_state(zone, NR_ACTIVE_ANON) +
> -		zone_page_state(zone, NR_INACTIVE_ANON);
> -	file  = zone_page_state(zone, NR_ACTIVE_FILE) +
> -		zone_page_state(zone, NR_INACTIVE_FILE);
> -	free  = zone_page_state(zone, NR_FREE_PAGES);
> +	unsigned long recent_scanned[2];
> +	unsigned long recent_rotated[2];
> 
>  	/* If we have no swap space, do not bother scanning anon pages. */
>  	if (nr_swap_pages <= 0) {
> @@ -1381,36 +1376,59 @@ static void get_scan_ratio(struct zone *zone, struct scan_control * sc,
>  		return;
>  	}
> 
> -	/* If we have very few page cache pages, force-scan anon pages. */
> -	if (unlikely(file + free <= zone->pages_high)) {
> -		percent[0] = 100;
> -		percent[1] = 0;
> -		return;
> -	}
> +	if (scan_global_lru(sc)) {
> +		unsigned long anon, file, free;
> 
> -	/*
> -         * OK, so we have swap space and a fair amount of page cache
> -         * pages.  We use the recently rotated / recently scanned
> -         * ratios to determine how valuable each cache is.
> -         *
> -         * Because workloads change over time (and to avoid overflow)
> -         * we keep these statistics as a floating average, which ends
> -         * up weighing recent references more than old ones.
> -         *
> -         * anon in [0], file in [1]
> -         */
> -	if (unlikely(zone->recent_scanned[0] > anon / 4)) {
> -		spin_lock_irq(&zone->lru_lock);
> -		zone->recent_scanned[0] /= 2;
> -		zone->recent_rotated[0] /= 2;
> -		spin_unlock_irq(&zone->lru_lock);
> -	}
> +		anon  = zone_page_state(zone, NR_ACTIVE_ANON) +
> +			zone_page_state(zone, NR_INACTIVE_ANON);
> +		file  = zone_page_state(zone, NR_ACTIVE_FILE) +
> +			zone_page_state(zone, NR_INACTIVE_FILE);
> +		free  = zone_page_state(zone, NR_FREE_PAGES);
> 
> -	if (unlikely(zone->recent_scanned[1] > file / 4)) {
> -		spin_lock_irq(&zone->lru_lock);
> -		zone->recent_scanned[1] /= 2;
> -		zone->recent_rotated[1] /= 2;
> -		spin_unlock_irq(&zone->lru_lock);
> +		/*
> +		 * If we have very few page cache pages, force-scan anon pages.
> +		 */
> +		if (unlikely(file + free <= zone->pages_high)) {
> +			percent[0] = 100;
> +			percent[1] = 0;
> +			return;
> +		}
> +
> +		/*
> +		 * OK, so we have swap space and a fair amount of page cache
> +		 * pages.  We use the recently rotated / recently scanned
> +		 * ratios to determine how valuable each cache is.
> +		 *
> +		 * Because workloads change over time (and to avoid overflow)
> +		 * we keep these statistics as a floating average, which ends
> +		 * up weighing recent references more than old ones.
> +		 *
> +		 * anon in [0], file in [1]
> +		 */
> +		if (unlikely(zone->recent_scanned[0] > anon / 4)) {
> +			spin_lock_irq(&zone->lru_lock);
> +			zone->recent_scanned[0] /= 2;
> +			zone->recent_rotated[0] /= 2;
> +			spin_unlock_irq(&zone->lru_lock);
> +		}
> +
> +		if (unlikely(zone->recent_scanned[1] > file / 4)) {
> +			spin_lock_irq(&zone->lru_lock);
> +			zone->recent_scanned[1] /= 2;
> +			zone->recent_rotated[1] /= 2;
> +			spin_unlock_irq(&zone->lru_lock);
> +		}
> +
> +		recent_scanned[0] = zone->recent_scanned[0];
> +		recent_scanned[1] = zone->recent_scanned[1];
> +		recent_rotated[0] = zone->recent_rotated[0];
> +		recent_rotated[1] = zone->recent_rotated[1];
> +	} else {
> +		/* XXX */

OK, so this needs to be completed for swappiness to work.

> +		recent_scanned[0] = 0;
> +		recent_scanned[1] = 0;
> +		recent_rotated[0] = 0;
> +		recent_rotated[1] = 0;
>  	}
> 
>  	/*
> @@ -1425,11 +1443,11 @@ static void get_scan_ratio(struct zone *zone, struct scan_control * sc,
>  	 * %anon = 100 * ----------- / ----------------- * IO cost
>  	 *               anon + file      rotate_sum
>  	 */
> -	ap = (anon_prio + 1) * (zone->recent_scanned[0] + 1);
> -	ap /= zone->recent_rotated[0] + 1;
> +	ap = (anon_prio + 1) * (recent_scanned[0] + 1);
> +	ap /= recent_rotated[0] + 1;
> 
> -	fp = (file_prio + 1) * (zone->recent_scanned[1] + 1);
> -	fp /= zone->recent_rotated[1] + 1;
> +	fp = (file_prio + 1) * (recent_scanned[1] + 1);
> +	fp /= recent_rotated[1] + 1;
> 
>  	/* Normalize to percentages */
>  	percent[0] = 100 * ap / (ap + fp + 1);
> @@ -1452,9 +1470,10 @@ static unsigned long shrink_zone(int priority, struct zone *zone,
>  	get_scan_ratio(zone, sc, percent);
> 
>  	for_each_evictable_lru(l) {
> +		int file = is_file_lru(l);
> +		unsigned long scan;
> +
>  		if (scan_global_lru(sc)) {
> -			int file = is_file_lru(l);
> -			int scan;
>  			/*
>  			 * Add one to nr_to_scan just to make sure that the
>  			 * kernel will slowly sift through each list.
> @@ -1476,8 +1495,13 @@ static unsigned long shrink_zone(int priority, struct zone *zone,
>  			 * but because memory controller hits its limit.
>  			 * Don't modify zone reclaim related data.
>  			 */
> -			nr[l] = mem_cgroup_calc_reclaim(sc->mem_cgroup, zone,
> +			scan = mem_cgroup_calc_reclaim(sc->mem_cgroup, zone,
>  								priority, l);
> +			if (priority) {
> +				scan >>= priority;
> +				scan = (scan * percent[file]) / 100;
> +			}
> +			nr[l] = scan + 1;
>  		}
>  	}
> 
> @@ -1704,7 +1728,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>  		.may_writepage = !laptop_mode,
>  		.may_swap = 1,
>  		.swap_cluster_max = SWAP_CLUSTER_MAX,
> -		.swappiness = vm_swappiness,
> +		.swappiness = mem_cgroup_swappiness(mem_cont),
>  		.order = 0,
>  		.mem_cgroup = mem_cont,
>  		.isolate_pages = mem_cgroup_isolate_pages,


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: memcg swappiness (Re: memo: mem+swap controller)
  2008-08-01  5:25           ` Balbir Singh
@ 2008-08-01  6:37             ` YAMAMOTO Takashi
  2008-08-01  6:46               ` Balbir Singh
  0 siblings, 1 reply; 19+ messages in thread
From: YAMAMOTO Takashi @ 2008-08-01  6:37 UTC (permalink / raw)
  To: balbir; +Cc: nishimura, kamezawa.hiroyu, hugh, linux-mm, menage, akpm

> YAMAMOTO Takashi wrote:
> > hi,
> > 
> >>>> I do intend to add the swappiness feature soon for control groups.
> >>>>
> >>> How does it work?
> >>> Does it affect global page reclaim?
> >>>
> >> We have a swappiness parameter in scan_control. Each control group indicates
> >> what it wants it swappiness to be when the control group is over it's limit and
> >> reclaim kicks in.
> > 
> > the following is an untested work-in-progress patch i happen to have.
> > i'd appreciate it if you take care of it.
> > 
> 
> Looks very similar to the patch I have. You seemed to have made much more
> progress than me, I am yet to look at the recent_* statistics. How are the test
> results? Are they close to what you expect?  Some comments below

it's mostly untested as i said above.  i'm wondering how to test it.

YAMAMOTO Takashi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: memcg swappiness (Re: memo: mem+swap controller)
  2008-08-01  6:37             ` YAMAMOTO Takashi
@ 2008-08-01  6:46               ` Balbir Singh
  2008-09-09  9:17                 ` YAMAMOTO Takashi
  0 siblings, 1 reply; 19+ messages in thread
From: Balbir Singh @ 2008-08-01  6:46 UTC (permalink / raw)
  To: YAMAMOTO Takashi; +Cc: nishimura, kamezawa.hiroyu, hugh, linux-mm, menage, akpm

YAMAMOTO Takashi wrote:
>> YAMAMOTO Takashi wrote:
>>> hi,
>>>
>>>>>> I do intend to add the swappiness feature soon for control groups.
>>>>>>
>>>>> How does it work?
>>>>> Does it affect global page reclaim?
>>>>>
>>>> We have a swappiness parameter in scan_control. Each control group indicates
>>>> what it wants it swappiness to be when the control group is over it's limit and
>>>> reclaim kicks in.
>>> the following is an untested work-in-progress patch i happen to have.
>>> i'd appreciate it if you take care of it.
>>>
>> Looks very similar to the patch I have. You seemed to have made much more
>> progress than me, I am yet to look at the recent_* statistics. How are the test
>> results? Are they close to what you expect?  Some comments below
> 
> it's mostly untested as i said above.  i'm wondering how to test it.
> 

I did a simple test

1. Run a RSS hungry (touch malloc'ed pages) and dd in the same control group
2. Tweak swappiness to see if the result is desirable

Not a complex test, but a good starting point :)

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: memcg swappiness (Re: memo: mem+swap controller)
  2008-08-01  6:46               ` Balbir Singh
@ 2008-09-09  9:17                 ` YAMAMOTO Takashi
  2008-09-09 14:07                   ` Balbir Singh
  0 siblings, 1 reply; 19+ messages in thread
From: YAMAMOTO Takashi @ 2008-09-09  9:17 UTC (permalink / raw)
  To: balbir; +Cc: nishimura, kamezawa.hiroyu, hugh, linux-mm, menage, akpm

hi,

here's the updated patch.

changes from the previous one:
	- adapt to v2.6.27-rc1-mm1.
	- implement per-cgroup per-zone recent_scanned and recent_rotated.
	- when creating a cgroup, inherit the swappiness value from its parent.
	- fix build w/o memcg.

any comments?

YAMAMOTO Takashi


Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
---

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ee1b2fc..7b1b4e6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -47,6 +47,7 @@ extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					int active, int file);
 extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
+extern int mem_cgroup_swappiness(struct mem_cgroup *mem);
 
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
@@ -72,6 +73,19 @@ extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
 extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
 					int priority, enum lru_list lru);
 
+extern void mem_cgroup_update_scanned(struct mem_cgroup *mem, struct zone *zone,
+					unsigned long idx,
+					unsigned long scanned);
+extern void mem_cgroup_update_rotated(struct mem_cgroup *mem, struct zone *zone,
+					unsigned long idx,
+					unsigned long rotated);
+extern void mem_cgroup_get_scan_param(struct mem_cgroup *mem, struct zone *zone,
+					unsigned long *anon,
+					unsigned long *file,
+					unsigned long **recent_scanned,
+					unsigned long **recent_rotated,
+					spinlock_t **lock);
+
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 static inline void page_reset_bad_cgroup(struct page *page)
 {
@@ -163,6 +177,31 @@ static inline long mem_cgroup_calc_reclaim(struct mem_cgroup *mem,
 {
 	return 0;
 }
+
+static inline void mem_cgroup_update_scanned(struct mem_cgroup *mem,
+					struct zone *zone,
+					unsigned long idx,
+					unsigned long scanned)
+{
+}
+
+static inline void mem_cgroup_update_rotated(struct mem_cgroup *mem,
+					struct zone *zone,
+					unsigned long idx,
+					unsigned long rotated)
+{
+}
+
+static inline void mem_cgroup_get_scan_param(struct mem_cgroup *mem,
+					struct zone *zone,
+					unsigned long *anon,
+					unsigned long *file,
+					unsigned long **recent_scanned,
+					unsigned long **recent_rotated,
+					spinlock_t **lock)
+{
+}
+
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 344a477..d4ac8c6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -93,6 +93,8 @@ struct mem_cgroup_per_zone {
 	spinlock_t		lru_lock;
 	struct list_head	lists[NR_LRU_LISTS];
 	unsigned long		count[NR_LRU_LISTS];
+	unsigned long		recent_scanned[2];
+	unsigned long		recent_rotated[2];
 };
 /* Macro for accessing counter */
 #define MEM_CGROUP_ZSTAT(mz, idx)	((mz)->count[(idx)])
@@ -129,6 +131,7 @@ struct mem_cgroup {
 	struct mem_cgroup_lru_info info;
 
 	int	prev_priority;	/* for recording reclaim priority */
+	unsigned int swappiness;	/* swappiness */
 	/*
 	 * statistics.
 	 */
@@ -437,14 +440,57 @@ void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem, int priority)
 long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
 					int priority, enum lru_list lru)
 {
-	long nr_pages;
 	int nid = zone->zone_pgdat->node_id;
 	int zid = zone_idx(zone);
 	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid);
 
-	nr_pages = MEM_CGROUP_ZSTAT(mz, lru);
+	return MEM_CGROUP_ZSTAT(mz, lru);
+}
+
+void mem_cgroup_update_scanned(struct mem_cgroup *mem, struct zone *zone,
+					unsigned long idx,
+					unsigned long scanned)
+{
+	int nid = zone->zone_pgdat->node_id;
+	int zid = zone_idx(zone);
+	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid);
+
+	spin_lock(&mz->lru_lock);
+	mz->recent_scanned[idx] += scanned;
+	spin_unlock(&mz->lru_lock);
+}
+
+void mem_cgroup_update_rotated(struct mem_cgroup *mem, struct zone *zone,
+					unsigned long idx,
+					unsigned long rotated)
+{
+	int nid = zone->zone_pgdat->node_id;
+	int zid = zone_idx(zone);
+	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid);
+
+	spin_lock(&mz->lru_lock);
+	mz->recent_rotated[idx] += rotated;
+	spin_unlock(&mz->lru_lock);
+}
+
+void mem_cgroup_get_scan_param(struct mem_cgroup *mem, struct zone *zone,
+					unsigned long *anon,
+					unsigned long *file,
+					unsigned long **recent_scanned,
+					unsigned long **recent_rotated,
+					spinlock_t **lock)
+{
+	int nid = zone->zone_pgdat->node_id;
+	int zid = zone_idx(zone);
+	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(mem, nid, zid);
 
-	return (nr_pages >> priority);
+	*anon = MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_ANON)
+	    + MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_ANON);
+	*file = MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_FILE)
+	    + MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_FILE);
+	*recent_scanned = mz->recent_scanned;
+	*recent_rotated = mz->recent_rotated;
+	*lock = &mz->lru_lock;
 }
 
 unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
@@ -1002,6 +1048,24 @@ static int mem_control_stat_show(struct cgroup *cont, struct cftype *cft,
 	return 0;
 }
 
+static int mem_cgroup_swappiness_write(struct cgroup *cont, struct cftype *cft,
+				       u64 val)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+
+	if (val > 100)
+		return -EINVAL;
+	mem->swappiness = val;
+	return 0;
+}
+
+static u64 mem_cgroup_swappiness_read(struct cgroup *cont, struct cftype *cft)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+  
+	return mem->swappiness;
+}
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -1034,8 +1098,21 @@ static struct cftype mem_cgroup_files[] = {
 		.name = "stat",
 		.read_map = mem_control_stat_show,
 	},
+	{
+		.name = "swappiness",
+		.write_u64 = mem_cgroup_swappiness_write,
+		.read_u64 = mem_cgroup_swappiness_read,
+	},
 };
 
+/* XXX probably it's better to move try_to_free_mem_cgroup_pages to
+  memcontrol.c and kill this */
+int mem_cgroup_swappiness(struct mem_cgroup *mem)
+{
+
+	return mem->swappiness;
+}
+
 static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
 {
 	struct mem_cgroup_per_node *pn;
@@ -1105,10 +1182,14 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	if (unlikely((cont->parent) == NULL)) {
 		mem = &init_mem_cgroup;
 		page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
+		mem->swappiness = vm_swappiness;
 	} else {
 		mem = mem_cgroup_alloc();
 		if (!mem)
 			return ERR_PTR(-ENOMEM);
+		/* XXX hierarchy */
+		mem->swappiness =
+		    mem_cgroup_from_cont(cont->parent)->swappiness;
 	}
 
 	res_counter_init(&mem->res);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 85ce427..b42e730 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -975,6 +975,30 @@ static unsigned long clear_active_flags(struct list_head *page_list,
 	return nr_active;
 }
 
+static void
+update_scanned(struct scan_control *sc, struct zone *zone, int idx,
+    unsigned long scanned)
+{
+
+	if (scan_global_lru(sc)) {
+		zone->recent_scanned[idx] += scanned;
+	} else {
+		mem_cgroup_update_scanned(sc->mem_cgroup, zone, idx, scanned);
+	}
+}
+
+static void
+update_rotated(struct scan_control *sc, struct zone *zone, int idx,
+    unsigned long rotated)
+{
+
+	if (scan_global_lru(sc)) {
+		zone->recent_rotated[idx] += rotated;
+	} else {
+		mem_cgroup_update_rotated(sc->mem_cgroup, zone, idx, rotated);
+	}
+}
+
 /**
  * isolate_lru_page - tries to isolate a page from its LRU list
  * @page: page to isolate from its LRU list
@@ -1075,11 +1099,11 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 
 		if (scan_global_lru(sc)) {
 			zone->pages_scanned += nr_scan;
-			zone->recent_scanned[0] += count[LRU_INACTIVE_ANON];
-			zone->recent_scanned[0] += count[LRU_ACTIVE_ANON];
-			zone->recent_scanned[1] += count[LRU_INACTIVE_FILE];
-			zone->recent_scanned[1] += count[LRU_ACTIVE_FILE];
 		}
+		update_scanned(sc, zone, 0,
+		    count[LRU_INACTIVE_ANON] + count[LRU_ACTIVE_ANON]);
+		update_scanned(sc, zone, 1,
+		    count[LRU_INACTIVE_FILE] + count[LRU_ACTIVE_FILE]);
 		spin_unlock_irq(&zone->lru_lock);
 
 		nr_scanned += nr_scan;
@@ -1138,9 +1162,10 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 			lru = page_lru(page);
 			add_page_to_lru_list(zone, page, lru);
 			mem_cgroup_move_lists(page, lru);
-			if (PageActive(page) && scan_global_lru(sc)) {
+			if (PageActive(page)) {
 				int file = !!page_is_file_cache(page);
-				zone->recent_rotated[file]++;
+
+				update_rotated(sc, zone, file, 1);
 			}
 			if (!pagevec_add(&pvec, page)) {
 				spin_unlock_irq(&zone->lru_lock);
@@ -1218,8 +1243,8 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 	 */
 	if (scan_global_lru(sc)) {
 		zone->pages_scanned += pgscanned;
-		zone->recent_scanned[!!file] += pgmoved;
 	}
+	update_scanned(sc, zone, !!file, pgmoved);
 
 	if (file)
 		__mod_zone_page_state(zone, NR_ACTIVE_FILE, -pgmoved);
@@ -1264,7 +1289,8 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 	 * to the inactive list.  This helps balance scan pressure between
 	 * file and anonymous pages in get_scan_ratio.
 	 */
-	zone->recent_rotated[!!file] += pgmoved;
+	spin_lock_irq(&zone->lru_lock);
+	update_rotated(sc, zone, !!file, pgmoved);
 
 	/*
 	 * Now put the pages back on the appropriate [file or anon] inactive
@@ -1273,7 +1299,6 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
 	pagevec_init(&pvec, 1);
 	pgmoved = 0;
 	lru = LRU_BASE + file * LRU_FILE;
-	spin_lock_irq(&zone->lru_lock);
 	while (!list_empty(&l_inactive)) {
 		page = lru_to_page(&l_inactive);
 		prefetchw_prev_lru_page(page, &l_inactive, flags);
@@ -1367,15 +1392,12 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 static void get_scan_ratio(struct zone *zone, struct scan_control * sc,
 					unsigned long *percent)
 {
-	unsigned long anon, file, free;
 	unsigned long anon_prio, file_prio;
 	unsigned long ap, fp;
-
-	anon  = zone_page_state(zone, NR_ACTIVE_ANON) +
-		zone_page_state(zone, NR_INACTIVE_ANON);
-	file  = zone_page_state(zone, NR_ACTIVE_FILE) +
-		zone_page_state(zone, NR_INACTIVE_FILE);
-	free  = zone_page_state(zone, NR_FREE_PAGES);
+	unsigned long *recent_scanned;
+	unsigned long *recent_rotated;
+	unsigned long anon, file;
+	spinlock_t *lock;
 
 	/* If we have no swap space, do not bother scanning anon pages. */
 	if (nr_swap_pages <= 0) {
@@ -1384,36 +1406,56 @@ static void get_scan_ratio(struct zone *zone, struct scan_control * sc,
 		return;
 	}
 
-	/* If we have very few page cache pages, force-scan anon pages. */
-	if (unlikely(file + free <= zone->pages_high)) {
-		percent[0] = 100;
-		percent[1] = 0;
-		return;
+	if (scan_global_lru(sc)) {
+		unsigned long free;
+
+		anon  = zone_page_state(zone, NR_ACTIVE_ANON) +
+			zone_page_state(zone, NR_INACTIVE_ANON);
+		file  = zone_page_state(zone, NR_ACTIVE_FILE) +
+			zone_page_state(zone, NR_INACTIVE_FILE);
+		free  = zone_page_state(zone, NR_FREE_PAGES);
+
+		/*
+		 * If we have very few page cache pages, force-scan anon pages.
+		 */
+		if (unlikely(file + free <= zone->pages_high)) {
+			percent[0] = 100;
+			percent[1] = 0;
+			return;
+		}
+
+		/*
+		 * OK, so we have swap space and a fair amount of page cache
+		 * pages.  We use the recently rotated / recently scanned
+		 * ratios to determine how valuable each cache is.
+		 */
+		recent_scanned = zone->recent_scanned;
+		recent_rotated = zone->recent_rotated;
+		lock = &zone->lru_lock;
+	} else {
+		mem_cgroup_get_scan_param(sc->mem_cgroup, zone, &anon, &file,
+		    &recent_scanned, &recent_rotated, &lock);
 	}
 
 	/*
-         * OK, so we have swap space and a fair amount of page cache
-         * pages.  We use the recently rotated / recently scanned
-         * ratios to determine how valuable each cache is.
-         *
-         * Because workloads change over time (and to avoid overflow)
-         * we keep these statistics as a floating average, which ends
-         * up weighing recent references more than old ones.
-         *
-         * anon in [0], file in [1]
-         */
-	if (unlikely(zone->recent_scanned[0] > anon / 4)) {
-		spin_lock_irq(&zone->lru_lock);
-		zone->recent_scanned[0] /= 2;
-		zone->recent_rotated[0] /= 2;
-		spin_unlock_irq(&zone->lru_lock);
+	 * Because workloads change over time (and to avoid overflow)
+	 * we keep these statistics as a floating average, which ends
+	 * up weighing recent references more than old ones.
+	 *
+	 * anon in [0], file in [1]
+	 */
+	if (unlikely(recent_scanned[0] > anon / 4)) {
+		spin_lock_irq(lock);
+		recent_scanned[0] /= 2;
+		recent_rotated[0] /= 2;
+		spin_unlock_irq(lock);
 	}
 
-	if (unlikely(zone->recent_scanned[1] > file / 4)) {
-		spin_lock_irq(&zone->lru_lock);
-		zone->recent_scanned[1] /= 2;
-		zone->recent_rotated[1] /= 2;
-		spin_unlock_irq(&zone->lru_lock);
+	if (unlikely(recent_scanned[1] > file / 4)) {
+		spin_lock_irq(lock);
+		recent_scanned[1] /= 2;
+		recent_rotated[1] /= 2;
+		spin_unlock_irq(lock);
 	}
 
 	/*
@@ -1428,11 +1470,11 @@ static void get_scan_ratio(struct zone *zone, struct scan_control * sc,
 	 * %anon = 100 * ----------- / ----------------- * IO cost
 	 *               anon + file      rotate_sum
 	 */
-	ap = (anon_prio + 1) * (zone->recent_scanned[0] + 1);
-	ap /= zone->recent_rotated[0] + 1;
+	ap = (anon_prio + 1) * (recent_scanned[0] + 1);
+	ap /= recent_rotated[0] + 1;
 
-	fp = (file_prio + 1) * (zone->recent_scanned[1] + 1);
-	fp /= zone->recent_rotated[1] + 1;
+	fp = (file_prio + 1) * (recent_scanned[1] + 1);
+	fp /= recent_rotated[1] + 1;
 
 	/* Normalize to percentages */
 	percent[0] = 100 * ap / (ap + fp + 1);
@@ -1455,10 +1497,10 @@ static unsigned long shrink_zone(int priority, struct zone *zone,
 	get_scan_ratio(zone, sc, percent);
 
 	for_each_evictable_lru(l) {
-		if (scan_global_lru(sc)) {
-			int file = is_file_lru(l);
-			int scan;
+		int file = is_file_lru(l);
+		unsigned long scan;
 
+		if (scan_global_lru(sc)) {
 			scan = zone_page_state(zone, NR_LRU_BASE + l);
 			if (priority) {
 				scan >>= priority;
@@ -1476,8 +1518,13 @@ static unsigned long shrink_zone(int priority, struct zone *zone,
 			 * but because memory controller hits its limit.
 			 * Don't modify zone reclaim related data.
 			 */
-			nr[l] = mem_cgroup_calc_reclaim(sc->mem_cgroup, zone,
+			scan = mem_cgroup_calc_reclaim(sc->mem_cgroup, zone,
 								priority, l);
+			if (priority) {
+				scan >>= priority;
+				scan = (scan * percent[file]) / 100;
+			}
+			nr[l] = scan;
 		}
 	}
 
@@ -1706,7 +1753,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 		.may_writepage = !laptop_mode,
 		.may_swap = 1,
 		.swap_cluster_max = SWAP_CLUSTER_MAX,
-		.swappiness = vm_swappiness,
+		.swappiness = mem_cgroup_swappiness(mem_cont),
 		.order = 0,
 		.mem_cgroup = mem_cont,
 		.isolate_pages = mem_cgroup_isolate_pages,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: memcg swappiness (Re: memo: mem+swap controller)
  2008-09-09  9:17                 ` YAMAMOTO Takashi
@ 2008-09-09 14:07                   ` Balbir Singh
  0 siblings, 0 replies; 19+ messages in thread
From: Balbir Singh @ 2008-09-09 14:07 UTC (permalink / raw)
  To: YAMAMOTO Takashi; +Cc: nishimura, kamezawa.hiroyu, hugh, linux-mm, menage, akpm

YAMAMOTO Takashi wrote:
> hi,
> 
> here's the updated patch.
> 
> changes from the previous one:
> 	- adapt to v2.6.27-rc1-mm1.

I just applied it to 2.6.27-rc5-mm1 and I am beginning to test it.

> 	- implement per-cgroup per-zone recent_scanned and recent_rotated.
> 	- when creating a cgroup, inherit the swappiness value from its parent.

Looks good

> 	- fix build w/o memcg.
> 

> any comments?
> 

I'll review the patches later, but I just tested them, by running "dd" and a RSS
intensive application in the same control group. I am not seeing the desired
result. Let me debug that further.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2008-09-09 14:07 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-07-31  1:15 memo: mem+swap controller KAMEZAWA Hiroyuki
2008-07-31  6:18 ` KOSAKI Motohiro
2008-07-31  6:25 ` Daisuke Nishimura
2008-07-31  6:51   ` KAMEZAWA Hiroyuki
2008-07-31 13:03     ` Daisuke Nishimura
2008-07-31 16:31     ` kamezawa.hiroyu
2008-08-01  3:05       ` Daisuke Nishimura
2008-08-01  3:28   ` Balbir Singh
2008-08-01  4:02     ` Daisuke Nishimura
2008-08-01  4:13       ` Balbir Singh
2008-08-01  4:57         ` Daisuke Nishimura
2008-08-01  5:07         ` memcg swappiness (Re: memo: mem+swap controller) YAMAMOTO Takashi
2008-08-01  5:25           ` Balbir Singh
2008-08-01  6:37             ` YAMAMOTO Takashi
2008-08-01  6:46               ` Balbir Singh
2008-09-09  9:17                 ` YAMAMOTO Takashi
2008-09-09 14:07                   ` Balbir Singh
2008-08-01  3:20 ` memo: mem+swap controller Balbir Singh
2008-08-01  3:45   ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox