memcg: slab control

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* memcg: slab control
@ 2009-11-25 23:08 David Rientjes
  2009-11-26  1:14 ` KAMEZAWA Hiroyuki
                   ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: David Rientjes @ 2009-11-25 23:08 UTC (permalink / raw)
  To: Balbir Singh, Pavel Emelyanov, KAMEZAWA Hiroyuki
  Cc: Suleiman Souhlal, Ying Han, linux-mm

Hi,

I wanted to see what the current ideas are concerning kernel memory 
accounting as it relates to the memory controller.  Eventually we'll want 
the ability to restrict cgroups to a hard slab limit.  That'll require 
accounting to map slab allocations back to user tasks so that we can 
enforce a policy based on the cgroup's aggregated slab usage similiar to 
how the memory controller currently does for user memory.

Is this currently being thought about within the memcg community?  We'd 
like to start a discussion and get everybody's requirements and interests 
on the table and then become actively involved in the development of such 
a feature.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-11-25 23:08 memcg: slab control David Rientjes
@ 2009-11-26  1:14 ` KAMEZAWA Hiroyuki
  2009-11-26  8:50   ` Balbir Singh
  2009-11-30 22:45   ` David Rientjes
  2009-11-26  1:17 ` KAMEZAWA Hiroyuki
  2009-11-26  2:35 ` KOSAKI Motohiro
  2 siblings, 2 replies; 33+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-26  1:14 UTC (permalink / raw)
  To: David Rientjes
  Cc: Balbir Singh, Pavel Emelyanov, Suleiman Souhlal, Ying Han, linux-mm

On Wed, 25 Nov 2009 15:08:00 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> Hi,
> 
> I wanted to see what the current ideas are concerning kernel memory 
> accounting as it relates to the memory controller.  Eventually we'll want 
> the ability to restrict cgroups to a hard slab limit.  That'll require 
> accounting to map slab allocations back to user tasks so that we can 
> enforce a policy based on the cgroup's aggregated slab usage similiar to 
> how the memory controller currently does for user memory.
> 
> Is this currently being thought about within the memcg community? 

Not yet. But I always recommend people to implement another memcg (slabcg) for
kernel memory. Because

  - It must have much lower cost than memcg, good perfomance and scalability.
    system-wide shared counter is nonsense.

  - slab is not base on LRU. So, another used-memory maintainance scheme should
    be used.

  - You can reuse page_cgroup even if slabcg is independent from memcg.


But, considering user-side, all people will not welcome dividing memcg and slabcg.
So, tieing it to current memcg is ok for me.
like...
==
	struct mem_cgroup {
		....
		....
		struct slab_cgroup slabcg; (or struct slab_cgroup *slabcg)
	}
==

But we have to use another counter and another scheme, another implemenation
than memcg, which has good scalability and more fuzzy/lazy controls.
(For example, trigger slab-shrink when usage exceeds hiwatermark, not limit.)

Scalable accounting is the first wall in front of us. Second one will be
how-to-shrink. About information recording, we can reuse page_cgroup and
we'll not have much difficulty.

I hope, at implementing slabcg, we'll not meet very complicated
racy cases as what we met in memcg. 

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-11-26  1:14 ` KAMEZAWA Hiroyuki
@ 2009-11-26  8:50   ` Balbir Singh
  2009-11-26  8:56     ` KAMEZAWA Hiroyuki
  2009-11-26 10:13     ` Suleiman Souhlal
  2009-11-30 22:45   ` David Rientjes
  1 sibling, 2 replies; 33+ messages in thread
From: Balbir Singh @ 2009-11-26  8:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: David Rientjes, Pavel Emelyanov, Suleiman Souhlal, Ying Han, linux-mm

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-11-26 10:14:14]:

> On Wed, 25 Nov 2009 15:08:00 -0800 (PST)
> David Rientjes <rientjes@google.com> wrote:
> 
> > Hi,
> > 
> > I wanted to see what the current ideas are concerning kernel memory 
> > accounting as it relates to the memory controller.  Eventually we'll want 
> > the ability to restrict cgroups to a hard slab limit.  That'll require 
> > accounting to map slab allocations back to user tasks so that we can 
> > enforce a policy based on the cgroup's aggregated slab usage similiar to 
> > how the memory controller currently does for user memory.
> > 
> > Is this currently being thought about within the memcg community? 
> 
> Not yet. But I always recommend people to implement another memcg (slabcg) for
> kernel memory. Because
> 
>   - It must have much lower cost than memcg, good perfomance and scalability.
>     system-wide shared counter is nonsense.
>

We've solved those issues mostly! Anyway, I agree that we need another
slabcg, Pavel did some work in that area and posted patches, but they
were mostly based and limited to SLUB (IIRC).
 
>   - slab is not base on LRU. So, another used-memory maintainance scheme should
>     be used.
> 
>   - You can reuse page_cgroup even if slabcg is independent from memcg.
> 
> 
> But, considering user-side, all people will not welcome dividing memcg and slabcg.
> So, tieing it to current memcg is ok for me.
> like...
> ==
> 	struct mem_cgroup {
> 		....
> 		....
> 		struct slab_cgroup slabcg; (or struct slab_cgroup *slabcg)
> 	}
> ==
> 
> But we have to use another counter and another scheme, another implemenation
> than memcg, which has good scalability and more fuzzy/lazy controls.
> (For example, trigger slab-shrink when usage exceeds hiwatermark, not limit.)
> 

That depends on requirements, hiwatermark is more like a soft limit
than a hard limit and there might be need for hard limits.

> Scalable accounting is the first wall in front of us. Second one will be
> how-to-shrink. About information recording, we can reuse page_cgroup and
> we'll not have much difficulty.
> 
> I hope, at implementing slabcg, we'll not meet very complicated
> racy cases as what we met in memcg. 
>

I think it will be because there is no swapping involved, OOM and rare
race conditions. There is limited slab reclaim possible, but otherwise
I think it is easier to write a slab controller IMHO. 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-11-26  8:50   ` Balbir Singh
@ 2009-11-26  8:56     ` KAMEZAWA Hiroyuki
  2009-11-26  9:10       ` Pavel Emelyanov
  2009-11-26 10:13     ` Suleiman Souhlal
  1 sibling, 1 reply; 33+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-26  8:56 UTC (permalink / raw)
  To: balbir
  Cc: David Rientjes, Pavel Emelyanov, Suleiman Souhlal, Ying Han, linux-mm

On Thu, 26 Nov 2009 14:20:31 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-11-26 10:14:14]:
> 
> > On Wed, 25 Nov 2009 15:08:00 -0800 (PST)
> > David Rientjes <rientjes@google.com> wrote:
> > 
> > > Hi,
> > > 
> > > I wanted to see what the current ideas are concerning kernel memory 
> > > accounting as it relates to the memory controller.  Eventually we'll want 
> > > the ability to restrict cgroups to a hard slab limit.  That'll require 
> > > accounting to map slab allocations back to user tasks so that we can 
> > > enforce a policy based on the cgroup's aggregated slab usage similiar to 
> > > how the memory controller currently does for user memory.
> > > 
> > > Is this currently being thought about within the memcg community? 
> > 
> > Not yet. But I always recommend people to implement another memcg (slabcg) for
> > kernel memory. Because
> > 
> >   - It must have much lower cost than memcg, good perfomance and scalability.
> >     system-wide shared counter is nonsense.
> >
> 
> We've solved those issues mostly! 
yes. but our solution is for page faults.
resolution of slab allocation is much more fine grained and often.

> Anyway, I agree that we need another
> slabcg, Pavel did some work in that area and posted patches, but they
> were mostly based and limited to SLUB (IIRC).
>  
> >   - slab is not base on LRU. So, another used-memory maintainance scheme should
> >     be used.
> > 
> >   - You can reuse page_cgroup even if slabcg is independent from memcg.
> > 
> > 
> > But, considering user-side, all people will not welcome dividing memcg and slabcg.
> > So, tieing it to current memcg is ok for me.
> > like...
> > ==
> > 	struct mem_cgroup {
> > 		....
> > 		....
> > 		struct slab_cgroup slabcg; (or struct slab_cgroup *slabcg)
> > 	}
> > ==
> > 
> > But we have to use another counter and another scheme, another implemenation
> > than memcg, which has good scalability and more fuzzy/lazy controls.
> > (For example, trigger slab-shrink when usage exceeds hiwatermark, not limit.)
> > 
> 
> That depends on requirements, hiwatermark is more like a soft limit
> than a hard limit and there might be need for hard limits.
> 
My point is that most of the kernel codes cannot work well when kmalloc(small area)
returns NULL.



> > Scalable accounting is the first wall in front of us. Second one will be
> > how-to-shrink. About information recording, we can reuse page_cgroup and
> > we'll not have much difficulty.
> > 
> > I hope, at implementing slabcg, we'll not meet very complicated
> > racy cases as what we met in memcg. 
> >
> 
> I think it will be because there is no swapping involved, OOM and rare
> race conditions. There is limited slab reclaim possible, but otherwise
> I think it is easier to write a slab controller IMHO. 
> 
yes ;)

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-11-26  8:56     ` KAMEZAWA Hiroyuki
@ 2009-11-26  9:10       ` Pavel Emelyanov
  2009-11-26  9:33         ` KAMEZAWA Hiroyuki
  2009-11-30 22:55         ` David Rientjes
  0 siblings, 2 replies; 33+ messages in thread
From: Pavel Emelyanov @ 2009-11-26  9:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, balbir, David Rientjes
  Cc: Suleiman Souhlal, Ying Han, linux-mm

>> Anyway, I agree that we need another
>> slabcg, Pavel did some work in that area and posted patches, but they
>> were mostly based and limited to SLUB (IIRC).

I'm ready to resurrect the patches and port them for slab.
But before doing it we should answer one question.

Consider we have two kmalloc-s in a kernel code - one is
user-space triggerable and the other one is not. From my
POV we should account for the former one, but should not
for the latter.

If so - how should we patch the kernel to achieve that goal?

> My point is that most of the kernel codes cannot work well when kmalloc(small area)
> returns NULL.

:) That's not so actually. As our experience shows kernel lives fine
when kmalloc returns NULL (this doesn't include drivers though).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-11-26  9:10       ` Pavel Emelyanov
@ 2009-11-26  9:33         ` KAMEZAWA Hiroyuki
  2009-11-26  9:56           ` Pavel Emelyanov
  2009-11-30 22:55         ` David Rientjes
  1 sibling, 1 reply; 33+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-26  9:33 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: balbir, David Rientjes, Suleiman Souhlal, Ying Han, linux-mm

On Thu, 26 Nov 2009 12:10:52 +0300
Pavel Emelyanov <xemul@parallels.com> wrote:

> >> Anyway, I agree that we need another
> >> slabcg, Pavel did some work in that area and posted patches, but they
> >> were mostly based and limited to SLUB (IIRC).
> 
> I'm ready to resurrect the patches and port them for slab.
> But before doing it we should answer one question.
> 
> Consider we have two kmalloc-s in a kernel code - one is
> user-space triggerable and the other one is not. From my
> POV we should account for the former one, but should not
> for the latter.
> 
> If so - how should we patch the kernel to achieve that goal?
> 
> > My point is that most of the kernel codes cannot work well when kmalloc(small area)
> > returns NULL.
> 
> :) That's not so actually. As our experience shows kernel lives fine
> when kmalloc returns NULL (this doesn't include drivers though).
> 
One issue it comes to my mind is that file system can return -EIO because
kmalloc() returns NULL. the kernel may work fine but terrible to users ;)


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-11-26  9:33         ` KAMEZAWA Hiroyuki
@ 2009-11-26  9:56           ` Pavel Emelyanov
  2009-11-26 10:24             ` Suleiman Souhlal
  2009-12-01  7:36             ` Balbir Singh
  0 siblings, 2 replies; 33+ messages in thread
From: Pavel Emelyanov @ 2009-11-26  9:56 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: balbir, David Rientjes, Suleiman Souhlal, Ying Han, linux-mm

KAMEZAWA Hiroyuki wrote:
> On Thu, 26 Nov 2009 12:10:52 +0300
> Pavel Emelyanov <xemul@parallels.com> wrote:
> 
>>>> Anyway, I agree that we need another
>>>> slabcg, Pavel did some work in that area and posted patches, but they
>>>> were mostly based and limited to SLUB (IIRC).
>> I'm ready to resurrect the patches and port them for slab.
>> But before doing it we should answer one question.
>>
>> Consider we have two kmalloc-s in a kernel code - one is
>> user-space triggerable and the other one is not. From my
>> POV we should account for the former one, but should not
>> for the latter.
>>
>> If so - how should we patch the kernel to achieve that goal?
>>
>>> My point is that most of the kernel codes cannot work well when kmalloc(small area)
>>> returns NULL.
>> :) That's not so actually. As our experience shows kernel lives fine
>> when kmalloc returns NULL (this doesn't include drivers though).
>>
> One issue it comes to my mind is that file system can return -EIO because
> kmalloc() returns NULL. the kernel may work fine but terrible to users ;)

That relates to my question above - we should not account for all
kmalloc-s. In particular - we don't account for bio-s and buffer-head-s
since their amount is not under direct user control. Yes, you can
request for heavy IO, but first, kernel sends your task to sleep under 
certain conditions and second, bio-s are destroyed as soon as they are
finished and thus bio-s and buffer-head-s cannot be used to eat all the
kernel memory.


> 
> Thanks,
> -Kame
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-11-26  9:56           ` Pavel Emelyanov
@ 2009-11-26 10:24             ` Suleiman Souhlal
  2009-11-26 12:31               ` Pavel Emelyanov
  2009-12-01  7:36             ` Balbir Singh
  1 sibling, 1 reply; 33+ messages in thread
From: Suleiman Souhlal @ 2009-11-26 10:24 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: KAMEZAWA Hiroyuki, balbir, David Rientjes, Ying Han, linux-mm

On 11/26/09, Pavel Emelyanov <xemul@parallels.com> wrote:
> KAMEZAWA Hiroyuki wrote:
>  > On Thu, 26 Nov 2009 12:10:52 +0300
>  > Pavel Emelyanov <xemul@parallels.com> wrote:
>  >
>  >>>> Anyway, I agree that we need another
>  >>>> slabcg, Pavel did some work in that area and posted patches, but they
>  >>>> were mostly based and limited to SLUB (IIRC).
>  >> I'm ready to resurrect the patches and port them for slab.
>  >> But before doing it we should answer one question.
>  >>
>  >> Consider we have two kmalloc-s in a kernel code - one is
>  >> user-space triggerable and the other one is not. From my
>  >> POV we should account for the former one, but should not
>  >> for the latter.
>  >>
>  >> If so - how should we patch the kernel to achieve that goal?
>  >>
>  >>> My point is that most of the kernel codes cannot work well when kmalloc(small area)
>  >>> returns NULL.
>  >> :) That's not so actually. As our experience shows kernel lives fine
>  >> when kmalloc returns NULL (this doesn't include drivers though).
>  >>
>  > One issue it comes to my mind is that file system can return -EIO because
>  > kmalloc() returns NULL. the kernel may work fine but terrible to users ;)
>
>
> That relates to my question above - we should not account for all
>  kmalloc-s. In particular - we don't account for bio-s and buffer-head-s
>  since their amount is not under direct user control. Yes, you can
>  request for heavy IO, but first, kernel sends your task to sleep under
>  certain conditions and second, bio-s are destroyed as soon as they are
>  finished and thus bio-s and buffer-head-s cannot be used to eat all the
>  kernel memory.

Aren't there patches to make the kernel track which cgroup caused
which disk I/O? If so, it should be possible to charge the bios to the
right cgroup.

Maybe one way to decide which kernel allocations should be accounted
would be to look at the calling context: If the allocation is done in
user context (syscall), then it could be counted towards that user,
while if the allocation is done in interrupt or kthread context, it
shouldn't be accounted.

Of course, this wouldn't be perfect, but it might be a good enough
approximation.

-- Suleiman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-11-26 10:24             ` Suleiman Souhlal
@ 2009-11-26 12:31               ` Pavel Emelyanov
  2009-11-26 12:52                 ` Suleiman Souhlal
                                   ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Pavel Emelyanov @ 2009-11-26 12:31 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: KAMEZAWA Hiroyuki, balbir, David Rientjes, Ying Han, linux-mm

> Aren't there patches to make the kernel track which cgroup caused
> which disk I/O? If so, it should be possible to charge the bios to the
> right cgroup.
> 
> Maybe one way to decide which kernel allocations should be accounted
> would be to look at the calling context: If the allocation is done in
> user context (syscall), then it could be counted towards that user,
> while if the allocation is done in interrupt or kthread context, it
> shouldn't be accounted.
> 
> Of course, this wouldn't be perfect, but it might be a good enough
> approximation.

I disagree. Bio-s are allocated in user context for all typical reads
(unless we requested aio) and are allocated either in pdflush context
or (!) in arbitrary task context for writes (e.g. via try_to_free_pages)
and thus such bio/buffer_head accounting will be completely random.

One of the way to achieve the goal I can propose the following (it's
not perfect, but just smth to start discussion from).

We implement support for accounting based on a bit on a kmem_cache
structure and mark all kmalloc caches as not-accountable. Then we grep
the kernel to find all kmalloc-s and think - if a kmalloc is to be
accounted we turn this into kmem_cache_alloc() with dedicated
kmem_cache and mark it as accountable.

> -- Suleiman
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-11-26 12:31               ` Pavel Emelyanov
@ 2009-11-26 12:52                 ` Suleiman Souhlal
  2009-12-01  7:40                   ` Balbir Singh
  2009-11-27  7:15                 ` Ying Han
  2009-11-30 22:57                 ` David Rientjes
  2 siblings, 1 reply; 33+ messages in thread
From: Suleiman Souhlal @ 2009-11-26 12:52 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: KAMEZAWA Hiroyuki, balbir, David Rientjes, Ying Han, linux-mm

On 11/26/09, Pavel Emelyanov <xemul@parallels.com> wrote:
> > Aren't there patches to make the kernel track which cgroup caused
>  > which disk I/O? If so, it should be possible to charge the bios to the
>  > right cgroup.
>  >
>  > Maybe one way to decide which kernel allocations should be accounted
>  > would be to look at the calling context: If the allocation is done in
>  > user context (syscall), then it could be counted towards that user,
>  > while if the allocation is done in interrupt or kthread context, it
>  > shouldn't be accounted.
>  >
>  > Of course, this wouldn't be perfect, but it might be a good enough
>  > approximation.
>
>
> I disagree. Bio-s are allocated in user context for all typical reads
>  (unless we requested aio) and are allocated either in pdflush context
>  or (!) in arbitrary task context for writes (e.g. via try_to_free_pages)
>  and thus such bio/buffer_head accounting will be completely random.

Yes, that's why I pointed out that you can account to the right cgroup
if you track who caused the I/O (which, I imagine, should already be
done by the block i/o bandwidth controller, or similar).

For most other allocations, on the other hand, accounting to the
current context should be fine.

>  One of the way to achieve the goal I can propose the following (it's
>  not perfect, but just smth to start discussion from).
>
>  We implement support for accounting based on a bit on a kmem_cache
>  structure and mark all kmalloc caches as not-accountable. Then we grep
>  the kernel to find all kmalloc-s and think - if a kmalloc is to be
>  accounted we turn this into kmem_cache_alloc() with dedicated
>  kmem_cache and mark it as accountable.

That sounds like a lot of work. :-)

-- Suleiman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-11-26 12:52                 ` Suleiman Souhlal
@ 2009-12-01  7:40                   ` Balbir Singh
  0 siblings, 0 replies; 33+ messages in thread
From: Balbir Singh @ 2009-12-01  7:40 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: Pavel Emelyanov, KAMEZAWA Hiroyuki, David Rientjes, Ying Han, linux-mm

* Suleiman Souhlal <suleiman@google.com> [2009-11-26 04:52:00]:

> On 11/26/09, Pavel Emelyanov <xemul@parallels.com> wrote:
> > > Aren't there patches to make the kernel track which cgroup caused
> >  > which disk I/O? If so, it should be possible to charge the bios to the
> >  > right cgroup.
> >  >
> >  > Maybe one way to decide which kernel allocations should be accounted
> >  > would be to look at the calling context: If the allocation is done in
> >  > user context (syscall), then it could be counted towards that user,
> >  > while if the allocation is done in interrupt or kthread context, it
> >  > shouldn't be accounted.
> >  >
> >  > Of course, this wouldn't be perfect, but it might be a good enough
> >  > approximation.
> >
> >
> > I disagree. Bio-s are allocated in user context for all typical reads
> >  (unless we requested aio) and are allocated either in pdflush context
> >  or (!) in arbitrary task context for writes (e.g. via try_to_free_pages)
> >  and thus such bio/buffer_head accounting will be completely random.
> 
> Yes, that's why I pointed out that you can account to the right cgroup
> if you track who caused the I/O (which, I imagine, should already be
> done by the block i/o bandwidth controller, or similar).
>

We can do so, we do that for task I/O accounting today and it works
quite well for the applications I've applied them to.
 
> For most other allocations, on the other hand, accounting to the
> current context should be fine.
> 

Absolutely! Except when the context is a kernel thread like
pdflush/ksm, etc. 

> >  One of the way to achieve the goal I can propose the following (it's
> >  not perfect, but just smth to start discussion from).
> >
> >  We implement support for accounting based on a bit on a kmem_cache
> >  structure and mark all kmalloc caches as not-accountable. Then we grep
> >  the kernel to find all kmalloc-s and think - if a kmalloc is to be
> >  accounted we turn this into kmem_cache_alloc() with dedicated
> >  kmem_cache and mark it as accountable.
> 
> That sounds like a lot of work. :-)
>

Hmm.. yes, it does, but I wonder if there are better alternatives. 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-11-26 12:31               ` Pavel Emelyanov
  2009-11-26 12:52                 ` Suleiman Souhlal
@ 2009-11-27  7:15                 ` Ying Han
  2009-11-27  9:45                   ` Pavel Emelyanov
  2009-12-01  5:14                   ` KOSAKI Motohiro
  2009-11-30 22:57                 ` David Rientjes
  2 siblings, 2 replies; 33+ messages in thread
From: Ying Han @ 2009-11-27  7:15 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Suleiman Souhlal, KAMEZAWA Hiroyuki, balbir, David Rientjes, linux-mm

On Thu, Nov 26, 2009 at 4:31 AM, Pavel Emelyanov <xemul@parallels.com> wrote:
>> Aren't there patches to make the kernel track which cgroup caused
>> which disk I/O? If so, it should be possible to charge the bios to the
>> right cgroup.
>>
>> Maybe one way to decide which kernel allocations should be accounted
>> would be to look at the calling context: If the allocation is done in
>> user context (syscall), then it could be counted towards that user,
>> while if the allocation is done in interrupt or kthread context, it
>> shouldn't be accounted.
>>
>> Of course, this wouldn't be perfect, but it might be a good enough
>> approximation.
>
> I disagree. Bio-s are allocated in user context for all typical reads
> (unless we requested aio) and are allocated either in pdflush context
> or (!) in arbitrary task context for writes (e.g. via try_to_free_pages)
> and thus such bio/buffer_head accounting will be completely random.
>
> One of the way to achieve the goal I can propose the following (it's
> not perfect, but just smth to start discussion from).
>
> We implement support for accounting based on a bit on a kmem_cache
> structure and mark all kmalloc caches as not-accountable. Then we grep
> the kernel to find all kmalloc-s and think - if a kmalloc is to be
> accounted we turn this into kmem_cache_alloc() with dedicated
> kmem_cache and mark it as accountable.

Well it would be nice to count all kernel memory allocations
trigger-able by user programs, the kernel
memory includes kernel slabs as well as the pages directly allocated
by get_free_pages(). However some
of the allocations happen asynchronously like in kernel thread or
interrupt context. We can not charge them
on the random process happen to run at the time.

We can either not count those allocations, or do some special
treatment to remember who owns those allocations.
In our networking intensive workload, it causes us lots of trouble of
miscounting the networking slabs for incoming
packets. So we make changes in the networking stack which records the
owner of the socket and then charge the
slab later using that recorded information.

--Ying

>> -- Suleiman
>>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-11-27  7:15                 ` Ying Han
@ 2009-11-27  9:45                   ` Pavel Emelyanov
  2009-12-01  5:14                   ` KOSAKI Motohiro
  1 sibling, 0 replies; 33+ messages in thread
From: Pavel Emelyanov @ 2009-11-27  9:45 UTC (permalink / raw)
  To: Ying Han
  Cc: Suleiman Souhlal, KAMEZAWA Hiroyuki, balbir, David Rientjes, linux-mm

Ying Han wrote:
> On Thu, Nov 26, 2009 at 4:31 AM, Pavel Emelyanov <xemul@parallels.com> wrote:
>>> Aren't there patches to make the kernel track which cgroup caused
>>> which disk I/O? If so, it should be possible to charge the bios to the
>>> right cgroup.
>>>
>>> Maybe one way to decide which kernel allocations should be accounted
>>> would be to look at the calling context: If the allocation is done in
>>> user context (syscall), then it could be counted towards that user,
>>> while if the allocation is done in interrupt or kthread context, it
>>> shouldn't be accounted.
>>>
>>> Of course, this wouldn't be perfect, but it might be a good enough
>>> approximation.
>> I disagree. Bio-s are allocated in user context for all typical reads
>> (unless we requested aio) and are allocated either in pdflush context
>> or (!) in arbitrary task context for writes (e.g. via try_to_free_pages)
>> and thus such bio/buffer_head accounting will be completely random.
>>
>> One of the way to achieve the goal I can propose the following (it's
>> not perfect, but just smth to start discussion from).
>>
>> We implement support for accounting based on a bit on a kmem_cache
>> structure and mark all kmalloc caches as not-accountable. Then we grep
>> the kernel to find all kmalloc-s and think - if a kmalloc is to be
>> accounted we turn this into kmem_cache_alloc() with dedicated
>> kmem_cache and mark it as accountable.
> 
> Well it would be nice to count all kernel memory allocations
> trigger-able by user programs, the kernel
> memory includes kernel slabs as well as the pages directly allocated
> by get_free_pages(). However some
> of the allocations happen asynchronously like in kernel thread or
> interrupt context. We can not charge them
> on the random process happen to run at the time.
> 
> We can either not count those allocations, or do some special
> treatment to remember who owns those allocations.
> In our networking intensive workload, it causes us lots of trouble of
> miscounting the networking slabs for incoming
> packets. So we make changes in the networking stack which records the
> owner of the socket and then charge the
> slab later using that recorded information.

That's the same as what we do, but note, that simple accounting
is not enough for socket buffers (a.k.a. skb-s). In a perfect world
we should implement a memory management similar to what already
exists in the networking.

In particular - sockets should not report errors in case of kmem
shortage, but instead goto waiting state. Besides, TCP sockets should
adjust the TCP window according to the current kmem controller state
and this task is quite complex.

> --Ying
> 
>>> -- Suleiman
>>>
>>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-11-27  7:15                 ` Ying Han
  2009-11-27  9:45                   ` Pavel Emelyanov
@ 2009-12-01  5:14                   ` KOSAKI Motohiro
  1 sibling, 0 replies; 33+ messages in thread
From: KOSAKI Motohiro @ 2009-12-01  5:14 UTC (permalink / raw)
  To: Ying Han
  Cc: kosaki.motohiro, Pavel Emelyanov, Suleiman Souhlal,
	KAMEZAWA Hiroyuki, balbir, David Rientjes, linux-mm

> We can either not count those allocations, or do some special
> treatment to remember who owns those allocations.
> In our networking intensive workload, it causes us lots of trouble of
> miscounting the networking slabs for incoming
> packets. So we make changes in the networking stack which records the
> owner of the socket and then charge the
> slab later using that recorded information.

I agree, currentlly network intensive workload is problematic. but I don't think
network memory management improvement need to change generic slab management.

Why can't we improve current tcp/udp memory accounting? it is good user interface than
"amount of slab memory".


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-11-26 12:31               ` Pavel Emelyanov
  2009-11-26 12:52                 ` Suleiman Souhlal
  2009-11-27  7:15                 ` Ying Han
@ 2009-11-30 22:57                 ` David Rientjes
  2009-12-01 10:31                   ` Pavel Emelyanov
  2 siblings, 1 reply; 33+ messages in thread
From: David Rientjes @ 2009-11-30 22:57 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Suleiman Souhlal, KAMEZAWA Hiroyuki, balbir, Ying Han, linux-mm

On Thu, 26 Nov 2009, Pavel Emelyanov wrote:

> I disagree. Bio-s are allocated in user context for all typical reads
> (unless we requested aio) and are allocated either in pdflush context
> or (!) in arbitrary task context for writes (e.g. via try_to_free_pages)
> and thus such bio/buffer_head accounting will be completely random.
> 

pdflush has been removed, they should all be allocated in process context.

> We implement support for accounting based on a bit on a kmem_cache
> structure and mark all kmalloc caches as not-accountable. Then we grep
> the kernel to find all kmalloc-s and think - if a kmalloc is to be
> accounted we turn this into kmem_cache_alloc() with dedicated
> kmem_cache and mark it as accountable.
> 

That doesn't work with slab cache merging done in slub.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-11-30 22:57                 ` David Rientjes
@ 2009-12-01 10:31                   ` Pavel Emelyanov
  2009-12-01 22:29                     ` David Rientjes
  0 siblings, 1 reply; 33+ messages in thread
From: Pavel Emelyanov @ 2009-12-01 10:31 UTC (permalink / raw)
  To: David Rientjes
  Cc: Suleiman Souhlal, KAMEZAWA Hiroyuki, balbir, Ying Han, linux-mm

David Rientjes wrote:
> On Thu, 26 Nov 2009, Pavel Emelyanov wrote:
> 
>> I disagree. Bio-s are allocated in user context for all typical reads
>> (unless we requested aio) and are allocated either in pdflush context
>> or (!) in arbitrary task context for writes (e.g. via try_to_free_pages)
>> and thus such bio/buffer_head accounting will be completely random.
>>
> 
> pdflush has been removed, they should all be allocated in process context.

OK, but the try_to_free_pages() concern still stands.

>> We implement support for accounting based on a bit on a kmem_cache
>> structure and mark all kmalloc caches as not-accountable. Then we grep
>> the kernel to find all kmalloc-s and think - if a kmalloc is to be
>> accounted we turn this into kmem_cache_alloc() with dedicated
>> kmem_cache and mark it as accountable.
>>
> 
> That doesn't work with slab cache merging done in slub.

Surely we'll have to change it a bit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-12-01 10:31                   ` Pavel Emelyanov
@ 2009-12-01 22:29                     ` David Rientjes
  0 siblings, 0 replies; 33+ messages in thread
From: David Rientjes @ 2009-12-01 22:29 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Suleiman Souhlal, KAMEZAWA Hiroyuki, balbir, Ying Han, linux-mm

On Tue, 1 Dec 2009, Pavel Emelyanov wrote:

> > pdflush has been removed, they should all be allocated in process context.
> 
> OK, but the try_to_free_pages() concern still stands.
> 

Yes, we lack mappings between the per-bdi flusher kthreads back to the 
user cgroup that initiated the writeback.  Since all of these kthreads are 
descendents of kthreadd, they'll be accounted for within that thread's 
cgroup unless we pass along the current context.

> >> We implement support for accounting based on a bit on a kmem_cache
> >> structure and mark all kmalloc caches as not-accountable. Then we grep
> >> the kernel to find all kmalloc-s and think - if a kmalloc is to be
> >> accounted we turn this into kmem_cache_alloc() with dedicated
> >> kmem_cache and mark it as accountable.
> >>
> > 
> > That doesn't work with slab cache merging done in slub.
> 
> Surely we'll have to change it a bit.
> 

We can't add a cache flag passed to kmem_cache_create() to identify caches 
that should be accounted versus those that shouldn't, there are allocs 
done in both process context and irq context from the same caches and we 
don't want to inhibit accounting with an additional flag passed to 
kmem_cache_alloc() if that cache has accounting enabled.

A vast majority of slab caches get merged into each other based on object 
size and alignment with slub; we could prevent that merging by checking 
the accounting bit for a cache, but that would come at a performance cost 
(nullifying many hot object allocs), increased fragmentation, and 
increased memory consumption.

In other words, we don't want to make it an attribute of the cache itself, 
we need to make it an attribute of the context in which the allocation is 
done; there're many more cases where we'll want to have accounting enabled 
by default, so we'll need to add a bit passed on alloc to inhibit 
accounting for those objects.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-11-26  9:56           ` Pavel Emelyanov
  2009-11-26 10:24             ` Suleiman Souhlal
@ 2009-12-01  7:36             ` Balbir Singh
  2009-12-01 10:40               ` Pavel Emelyanov
  1 sibling, 1 reply; 33+ messages in thread
From: Balbir Singh @ 2009-12-01  7:36 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: KAMEZAWA Hiroyuki, David Rientjes, Suleiman Souhlal, Ying Han, linux-mm

* Pavel Emelyanov <xemul@parallels.com> [2009-11-26 12:56:01]:

> KAMEZAWA Hiroyuki wrote:
> > On Thu, 26 Nov 2009 12:10:52 +0300
> > Pavel Emelyanov <xemul@parallels.com> wrote:
> > 
> >>>> Anyway, I agree that we need another
> >>>> slabcg, Pavel did some work in that area and posted patches, but they
> >>>> were mostly based and limited to SLUB (IIRC).
> >> I'm ready to resurrect the patches and port them for slab.
> >> But before doing it we should answer one question.
> >>
> >> Consider we have two kmalloc-s in a kernel code - one is
> >> user-space triggerable and the other one is not. From my
> >> POV we should account for the former one, but should not
> >> for the latter.
> >>
> >> If so - how should we patch the kernel to achieve that goal?
> >>
> >>> My point is that most of the kernel codes cannot work well when kmalloc(small area)
> >>> returns NULL.
> >> :) That's not so actually. As our experience shows kernel lives fine
> >> when kmalloc returns NULL (this doesn't include drivers though).
> >>
> > One issue it comes to my mind is that file system can return -EIO because
> > kmalloc() returns NULL. the kernel may work fine but terrible to users ;)
> 
> That relates to my question above - we should not account for all
> kmalloc-s. In particular - we don't account for bio-s and buffer-head-s
> since their amount is not under direct user control. Yes, you can
> request for heavy IO, but first, kernel sends your task to sleep under 
> certain conditions and second, bio-s are destroyed as soon as they are
> finished and thus bio-s and buffer-head-s cannot be used to eat all the
> kernel memory.

Just to understand the context better, is this really a problem. This
can occur when we do really run out of memory. The idea of using
slabcg + memcg together is good, except for our accounting process. I
can repost percpu counter patches that adds fuzziness along with other
tricks that Kame has to do batch accounting, that we will need to
make sure we are able to do with slab allocations as well.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-12-01  7:36             ` Balbir Singh
@ 2009-12-01 10:40               ` Pavel Emelyanov
  2009-12-01 15:14                 ` Balbir Singh
  0 siblings, 1 reply; 33+ messages in thread
From: Pavel Emelyanov @ 2009-12-01 10:40 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, David Rientjes, Suleiman Souhlal, Ying Han, linux-mm

> Just to understand the context better, is this really a problem. This
> can occur when we do really run out of memory. The idea of using
> slabcg + memcg together is good, except for our accounting process. I
> can repost percpu counter patches that adds fuzziness along with other
> tricks that Kame has to do batch accounting, that we will need to
> make sure we are able to do with slab allocations as well.
> 

I'm not sure I understand you concern. Can you elaborate, please?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-12-01 10:40               ` Pavel Emelyanov
@ 2009-12-01 15:14                 ` Balbir Singh
  2009-12-02 10:14                   ` Pavel Emelyanov
  0 siblings, 1 reply; 33+ messages in thread
From: Balbir Singh @ 2009-12-01 15:14 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: KAMEZAWA Hiroyuki, David Rientjes, Suleiman Souhlal, Ying Han, linux-mm

* Pavel Emelyanov <xemul@parallels.com> [2009-12-01 13:40:30]:

> > Just to understand the context better, is this really a problem. This
> > can occur when we do really run out of memory. The idea of using
> > slabcg + memcg together is good, except for our accounting process. I
> > can repost percpu counter patches that adds fuzziness along with other
> > tricks that Kame has to do batch accounting, that we will need to
> > make sure we are able to do with slab allocations as well.
> > 
> 
> I'm not sure I understand you concern. Can you elaborate, please?
> 

The concern was mostly accounting when memcg + slabcg are integrated
into the same framework. res_counters will need new scalability
primitives.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-12-01 15:14                 ` Balbir Singh
@ 2009-12-02 10:14                   ` Pavel Emelyanov
  2009-12-02 10:19                     ` Balbir Singh
  0 siblings, 1 reply; 33+ messages in thread
From: Pavel Emelyanov @ 2009-12-02 10:14 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, David Rientjes, Suleiman Souhlal, Ying Han, linux-mm

Balbir Singh wrote:
> * Pavel Emelyanov <xemul@parallels.com> [2009-12-01 13:40:30]:
> 
>>> Just to understand the context better, is this really a problem. This
>>> can occur when we do really run out of memory. The idea of using
>>> slabcg + memcg together is good, except for our accounting process. I
>>> can repost percpu counter patches that adds fuzziness along with other
>>> tricks that Kame has to do batch accounting, that we will need to
>>> make sure we are able to do with slab allocations as well.
>>>
>> I'm not sure I understand you concern. Can you elaborate, please?
>>
> 
> The concern was mostly accounting when memcg + slabcg are integrated
> into the same framework. res_counters will need new scalability
> primitives.
> 

I see. I think the best we can do here is start with a separate controller.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-12-02 10:14                   ` Pavel Emelyanov
@ 2009-12-02 10:19                     ` Balbir Singh
  2009-12-02 10:51                       ` Pavel Emelyanov
  0 siblings, 1 reply; 33+ messages in thread
From: Balbir Singh @ 2009-12-02 10:19 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: KAMEZAWA Hiroyuki, David Rientjes, Suleiman Souhlal, Ying Han, linux-mm

* Pavel Emelyanov <xemul@parallels.com> [2009-12-02 13:14:15]:

> Balbir Singh wrote:
> > * Pavel Emelyanov <xemul@parallels.com> [2009-12-01 13:40:30]:
> > 
> >>> Just to understand the context better, is this really a problem. This
> >>> can occur when we do really run out of memory. The idea of using
> >>> slabcg + memcg together is good, except for our accounting process. I
> >>> can repost percpu counter patches that adds fuzziness along with other
> >>> tricks that Kame has to do batch accounting, that we will need to
> >>> make sure we are able to do with slab allocations as well.
> >>>
> >> I'm not sure I understand you concern. Can you elaborate, please?
> >>
> > 
> > The concern was mostly accounting when memcg + slabcg are integrated
> > into the same framework. res_counters will need new scalability
> > primitives.
> > 
> 
> I see. I think the best we can do here is start with a separate controller.
>

I would think so as well, but setting up independent limits might be a
challenge, how does the user really estimate the amount of kernel
memory needed? This is the same problem that David posted sometime
back. 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-12-02 10:19                     ` Balbir Singh
@ 2009-12-02 10:51                       ` Pavel Emelyanov
  0 siblings, 0 replies; 33+ messages in thread
From: Pavel Emelyanov @ 2009-12-02 10:51 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, David Rientjes, Suleiman Souhlal, Ying Han, linux-mm

Balbir Singh wrote:
> * Pavel Emelyanov <xemul@parallels.com> [2009-12-02 13:14:15]:
> 
>> Balbir Singh wrote:
>>> * Pavel Emelyanov <xemul@parallels.com> [2009-12-01 13:40:30]:
>>>
>>>>> Just to understand the context better, is this really a problem. This
>>>>> can occur when we do really run out of memory. The idea of using
>>>>> slabcg + memcg together is good, except for our accounting process. I
>>>>> can repost percpu counter patches that adds fuzziness along with other
>>>>> tricks that Kame has to do batch accounting, that we will need to
>>>>> make sure we are able to do with slab allocations as well.
>>>>>
>>>> I'm not sure I understand you concern. Can you elaborate, please?
>>>>
>>> The concern was mostly accounting when memcg + slabcg are integrated
>>> into the same framework. res_counters will need new scalability
>>> primitives.
>>>
>> I see. I think the best we can do here is start with a separate controller.
>>
> 
> I would think so as well, but setting up independent limits might be a
> challenge, how does the user really estimate the amount of kernel
> memory needed? This is the same problem that David posted sometime
> back. 

I agree with you, but note, that the memcg consists of several part and
the question "where to account bytes to" is quite independent from "what
allocations to account" and "where to get the memcg context from on kfree" ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-11-26  9:10       ` Pavel Emelyanov
  2009-11-26  9:33         ` KAMEZAWA Hiroyuki
@ 2009-11-30 22:55         ` David Rientjes
  2009-12-01 10:39           ` Pavel Emelyanov
  1 sibling, 1 reply; 33+ messages in thread
From: David Rientjes @ 2009-11-30 22:55 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: KAMEZAWA Hiroyuki, balbir, Suleiman Souhlal, Ying Han, linux-mm

On Thu, 26 Nov 2009, Pavel Emelyanov wrote:

> I'm ready to resurrect the patches and port them for slab.
> But before doing it we should answer one question.
> 

Do you have a pointer to your latest implementation that you proposed for 
slab?

> Consider we have two kmalloc-s in a kernel code - one is
> user-space triggerable and the other one is not. From my
> POV we should account for the former one, but should not
> for the latter.
> 
> If so - how should we patch the kernel to achieve that goal?
> 

I think all slab allocations should be accounted for based on current's 
memcg other than those done in hardirq context, annotating slab 
allocations doesn't seem scalable.  Whether the accounting is done on a 
task level or cgroup level isn't really a problem for us since we don't 
move tasks amongst cgroups.  I imagine there've been previous restrictions 
on that put into place with the memcg so this doesn't seem like a 
slabcg-specific requirement anyway.

The problem on the freeing side is mapping the object back to the cgroup 
that allocated it.  We'd also need to map the object to the context in 
which it was allocated to determine whether we should decrement the 
counter or not.  How do you propose doing that without a considerable 
overhead in memory consumption, fastpath branch, and cache cold slabcg 
lookups?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-11-30 22:55         ` David Rientjes
@ 2009-12-01 10:39           ` Pavel Emelyanov
  0 siblings, 0 replies; 33+ messages in thread
From: Pavel Emelyanov @ 2009-12-01 10:39 UTC (permalink / raw)
  To: David Rientjes
  Cc: KAMEZAWA Hiroyuki, balbir, Suleiman Souhlal, Ying Han, linux-mm

David Rientjes wrote:
> On Thu, 26 Nov 2009, Pavel Emelyanov wrote:
> 
>> I'm ready to resurrect the patches and port them for slab.
>> But before doing it we should answer one question.
>>
> 
> Do you have a pointer to your latest implementation that you proposed for 
> slab?

I believe this is the one:
https://lists.linux-foundation.org/pipermail/containers/2007-September/007481.html

>> Consider we have two kmalloc-s in a kernel code - one is
>> user-space triggerable and the other one is not. From my
>> POV we should account for the former one, but should not
>> for the latter.
>>
>> If so - how should we patch the kernel to achieve that goal?
>>
> 
> I think all slab allocations should be accounted for based on current's 
> memcg other than those done in hardirq context, annotating slab 
> allocations doesn't seem scalable.  Whether the accounting is done on a 
> task level or cgroup level isn't really a problem for us since we don't 
> move tasks amongst cgroups.  I imagine there've been previous restrictions 
> on that put into place with the memcg so this doesn't seem like a 
> slabcg-specific requirement anyway.
> 
> The problem on the freeing side is mapping the object back to the cgroup 
> that allocated it.  We'd also need to map the object to the context in 
> which it was allocated to determine whether we should decrement the 
> counter or not.  How do you propose doing that without a considerable 
> overhead in memory consumption, fastpath branch, and cache cold slabcg 
> lookups?

That's the biggest problem. Generally speaking - no other way rather than
store additional pointer. In some situations you can rely on the cgroup of
a task in which context an object is being freed, but in that case once you
move a task to another cgroup your accounting is screwed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-11-26  8:50   ` Balbir Singh
  2009-11-26  8:56     ` KAMEZAWA Hiroyuki
@ 2009-11-26 10:13     ` Suleiman Souhlal
  2009-11-30  9:17       ` Balbir Singh
  1 sibling, 1 reply; 33+ messages in thread
From: Suleiman Souhlal @ 2009-11-26 10:13 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, David Rientjes, Pavel Emelyanov, Ying Han, linux-mm

On 11/26/09, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>  I think it is easier to write a slab controller IMHO.

One potential problem I can think of with writing a slab controller
would be that the user would have to estimate what fraction of the
amount of memory slab should be allowed to use, which might not be
ideal.

If you wanted to limit a cgroup to a total of 1GB of memory, you might
not care if the job wants to use 0.9 GB of user memory and 0.1GB of
slab or if it wants to use 0.9GB of slab and 0.1GB of user memory..

Because of this, it might be more practical to integrate the slab
accounting in memcg.

-- Suleiman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-11-26 10:13     ` Suleiman Souhlal
@ 2009-11-30  9:17       ` Balbir Singh
  0 siblings, 0 replies; 33+ messages in thread
From: Balbir Singh @ 2009-11-30  9:17 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: KAMEZAWA Hiroyuki, David Rientjes, Pavel Emelyanov, Ying Han, linux-mm

* Suleiman Souhlal <suleiman@google.com> [2009-11-26 02:13:17]:

> On 11/26/09, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> >  I think it is easier to write a slab controller IMHO.
> 
> One potential problem I can think of with writing a slab controller
> would be that the user would have to estimate what fraction of the
> amount of memory slab should be allowed to use, which might not be
> ideal.
> 
> If you wanted to limit a cgroup to a total of 1GB of memory, you might
> not care if the job wants to use 0.9 GB of user memory and 0.1GB of
> slab or if it wants to use 0.9GB of slab and 0.1GB of user memory..
>

Hmm.. true, yes not caring about how memory usage is partitioned is
nice (we have memsw for very similar reasons).
 
> Because of this, it might be more practical to integrate the slab
> accounting in memcg.
> 

I tend to agree, but I would like to see the early design and
thoughts. Like Kame pointed, integrating their accounting can be an
issue.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-11-26  1:14 ` KAMEZAWA Hiroyuki
  2009-11-26  8:50   ` Balbir Singh
@ 2009-11-30 22:45   ` David Rientjes
  1 sibling, 0 replies; 33+ messages in thread
From: David Rientjes @ 2009-11-30 22:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Pavel Emelyanov, Suleiman Souhlal, Ying Han, linux-mm

On Thu, 26 Nov 2009, KAMEZAWA Hiroyuki wrote:

> But, considering user-side, all people will not welcome dividing memcg and slabcg.
> So, tieing it to current memcg is ok for me.

Agreed.

> like...
> ==
> 	struct mem_cgroup {
> 		....
> 		....
> 		struct slab_cgroup slabcg; (or struct slab_cgroup *slabcg)
> 	}
> ==
> 
> But we have to use another counter and another scheme, another implemenation
> than memcg, which has good scalability and more fuzzy/lazy controls.
> (For example, trigger slab-shrink when usage exceeds hiwatermark, not limit.)
> 

We're only really interested in using memcg and slabcg together for 
accounting all memory allotted to a particular cgroup.  I'm trying to 
imagine a scenario where someone would want to account and enforce hard 
slab limits without using memcg as well.  If there are none (and one of 
the reasons we're trying to illicit discussion is to determine everyone's 
requirements for such a feature), we can probably tie them together 
without worrying about incurring unnecessary overhead by using the memcg 
framework that isn't related to slab accounting.

I think the ideal userspace API would be simply to add slab accounting to 
the memcg's limit_in_bytes if a memcg option were enabled for a cgroup.  I 
don't think it would be helpful to add a ratio of that limit for slab, 
though, since it's very difficult to predict the usage for a particular 
workload.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-11-25 23:08 memcg: slab control David Rientjes
  2009-11-26  1:14 ` KAMEZAWA Hiroyuki
@ 2009-11-26  1:17 ` KAMEZAWA Hiroyuki
  2009-11-26 10:01   ` Suleiman Souhlal
  2009-11-26  2:35 ` KOSAKI Motohiro
  2 siblings, 1 reply; 33+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-26  1:17 UTC (permalink / raw)
  To: David Rientjes
  Cc: Balbir Singh, Pavel Emelyanov, Suleiman Souhlal, Ying Han, linux-mm

On Wed, 25 Nov 2009 15:08:00 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> Hi,
> 
> I wanted to see what the current ideas are concerning kernel memory 
> accounting as it relates to the memory controller.  Eventually we'll want 
> the ability to restrict cgroups to a hard slab limit.  That'll require 
> accounting to map slab allocations back to user tasks so that we can 
> enforce a policy based on the cgroup's aggregated slab usage similiar to 
> how the memory controller currently does for user memory.
> 
> Is this currently being thought about within the memcg community?  We'd 
> like to start a discussion and get everybody's requirements and interests 
> on the table and then become actively involved in the development of such 
> a feature.
> 

BTW, how much percent of pages are used for slab in Google system ?
Because memory size is going bigger and bigger, ratio of slab usage is going
smaller, I think.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-11-26  1:17 ` KAMEZAWA Hiroyuki
@ 2009-11-26 10:01   ` Suleiman Souhlal
  0 siblings, 0 replies; 33+ messages in thread
From: Suleiman Souhlal @ 2009-11-26 10:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: David Rientjes, Balbir Singh, Pavel Emelyanov, Ying Han, linux-mm

Hello,

On 11/25/09, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> BTW, how much percent of pages are used for slab in Google system ?
>  Because memory size is going bigger and bigger, ratio of slab usage is going
>  smaller, I think.

It varies.
The amount of slab on systems can go from negligible to being a
significant portion of the total memory (in network intensive
workloads, for example).

-- Suleiman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-11-25 23:08 memcg: slab control David Rientjes
  2009-11-26  1:14 ` KAMEZAWA Hiroyuki
  2009-11-26  1:17 ` KAMEZAWA Hiroyuki
@ 2009-11-26  2:35 ` KOSAKI Motohiro
  2009-11-27  7:01   ` Ying Han
  2 siblings, 1 reply; 33+ messages in thread
From: KOSAKI Motohiro @ 2009-11-26  2:35 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Balbir Singh, Pavel Emelyanov,
	KAMEZAWA Hiroyuki, Suleiman Souhlal, Ying Han, linux-mm

Hi

> Hi,
> 
> I wanted to see what the current ideas are concerning kernel memory 
> accounting as it relates to the memory controller.  Eventually we'll want 
> the ability to restrict cgroups to a hard slab limit.  That'll require 
> accounting to map slab allocations back to user tasks so that we can 
> enforce a policy based on the cgroup's aggregated slab usage similiar to 
> how the memory controller currently does for user memory.
> 
> Is this currently being thought about within the memcg community?  We'd 
> like to start a discussion and get everybody's requirements and interests 
> on the table and then become actively involved in the development of such 
> a feature.

I don't think memory hard isolation is bad idea. however, slab restriction
is too strange. some device use slab frequently, another someone use get_free_pages()
directly. only slab restriction will not make expected result from admin view.

Probably, we need to implement generic memory reservation framework. it mihgt help
implemnt rt-task memory reservation and userland oom manager.

It is only my personal opinion...


Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-11-26  2:35 ` KOSAKI Motohiro
@ 2009-11-27  7:01   ` Ying Han
  2009-11-27  9:48     ` Pavel Emelyanov
  0 siblings, 1 reply; 33+ messages in thread
From: Ying Han @ 2009-11-27  7:01 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: David Rientjes, Balbir Singh, Pavel Emelyanov, KAMEZAWA Hiroyuki,
	Suleiman Souhlal, linux-mm

On Wed, Nov 25, 2009 at 6:35 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> Hi
>
>> Hi,
>>
>> I wanted to see what the current ideas are concerning kernel memory
>> accounting as it relates to the memory controller.  Eventually we'll want
>> the ability to restrict cgroups to a hard slab limit.  That'll require
>> accounting to map slab allocations back to user tasks so that we can
>> enforce a policy based on the cgroup's aggregated slab usage similiar to
>> how the memory controller currently does for user memory.
>>
>> Is this currently being thought about within the memcg community?  We'd
>> like to start a discussion and get everybody's requirements and interests
>> on the table and then become actively involved in the development of such
>> a feature.
>
> I don't think memory hard isolation is bad idea. however, slab restriction
> is too strange. some device use slab frequently, another someone use get_free_pages()
> directly. only slab restriction will not make expected result from admin view.
>
> Probably, we need to implement generic memory reservation framework. it mihgt help
> implemnt rt-task memory reservation and userland oom manager.
>
> It is only my personal opinion...

Looks like the beancounters implementation counts both the kernel slab
objects as well as the
pages from get_free_pages(). But It relies the caller to pass down a
GFP flag indicating the page
or slab to be accountable or not. I am looking at the beancounters v5 at:

http://lkml.indiana.edu/hypermail/linux/kernel/0610.0/1719.html

I kind of like the idea to have a kernel memory controller instead of
kernel slab controller.
If we only count kernel slabs, do we need another mechanism to count
kernel allocations
directly from get_free_pages() ?

--Ying
>
>
> Thanks.
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: memcg: slab control
  2009-11-27  7:01   ` Ying Han
@ 2009-11-27  9:48     ` Pavel Emelyanov
  0 siblings, 0 replies; 33+ messages in thread
From: Pavel Emelyanov @ 2009-11-27  9:48 UTC (permalink / raw)
  To: Ying Han
  Cc: KOSAKI Motohiro, David Rientjes, Balbir Singh, Pavel Emelyanov,
	KAMEZAWA Hiroyuki, Suleiman Souhlal, linux-mm

Ying Han wrote:
> I kind of like the idea to have a kernel memory controller instead of
> kernel slab controller.
> If we only count kernel slabs, do we need another mechanism to count
> kernel allocations
> directly from get_free_pages() ?

We do. Look at what get_free_pages we mark with GFP_UBC in beancounters
and see, that if we don't count them this creates way to DoS the kernel.

> --Ying
>>
>> Thanks.
>>
>>
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2009-12-02 10:52 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-25 23:08 memcg: slab control David Rientjes
2009-11-26  1:14 ` KAMEZAWA Hiroyuki
2009-11-26  8:50   ` Balbir Singh
2009-11-26  8:56     ` KAMEZAWA Hiroyuki
2009-11-26  9:10       ` Pavel Emelyanov
2009-11-26  9:33         ` KAMEZAWA Hiroyuki
2009-11-26  9:56           ` Pavel Emelyanov
2009-11-26 10:24             ` Suleiman Souhlal
2009-11-26 12:31               ` Pavel Emelyanov
2009-11-26 12:52                 ` Suleiman Souhlal
2009-12-01  7:40                   ` Balbir Singh
2009-11-27  7:15                 ` Ying Han
2009-11-27  9:45                   ` Pavel Emelyanov
2009-12-01  5:14                   ` KOSAKI Motohiro
2009-11-30 22:57                 ` David Rientjes
2009-12-01 10:31                   ` Pavel Emelyanov
2009-12-01 22:29                     ` David Rientjes
2009-12-01  7:36             ` Balbir Singh
2009-12-01 10:40               ` Pavel Emelyanov
2009-12-01 15:14                 ` Balbir Singh
2009-12-02 10:14                   ` Pavel Emelyanov
2009-12-02 10:19                     ` Balbir Singh
2009-12-02 10:51                       ` Pavel Emelyanov
2009-11-30 22:55         ` David Rientjes
2009-12-01 10:39           ` Pavel Emelyanov
2009-11-26 10:13     ` Suleiman Souhlal
2009-11-30  9:17       ` Balbir Singh
2009-11-30 22:45   ` David Rientjes
2009-11-26  1:17 ` KAMEZAWA Hiroyuki
2009-11-26 10:01   ` Suleiman Souhlal
2009-11-26  2:35 ` KOSAKI Motohiro
2009-11-27  7:01   ` Ying Han
2009-11-27  9:48     ` Pavel Emelyanov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox