* memcg: slab control
@ 2009-11-25 23:08 David Rientjes
2009-11-26 1:14 ` KAMEZAWA Hiroyuki
` (2 more replies)
0 siblings, 3 replies; 33+ messages in thread
From: David Rientjes @ 2009-11-25 23:08 UTC (permalink / raw)
To: Balbir Singh, Pavel Emelyanov, KAMEZAWA Hiroyuki
Cc: Suleiman Souhlal, Ying Han, linux-mm
Hi,
I wanted to see what the current ideas are concerning kernel memory
accounting as it relates to the memory controller. Eventually we'll want
the ability to restrict cgroups to a hard slab limit. That'll require
accounting to map slab allocations back to user tasks so that we can
enforce a policy based on the cgroup's aggregated slab usage similiar to
how the memory controller currently does for user memory.
Is this currently being thought about within the memcg community? We'd
like to start a discussion and get everybody's requirements and interests
on the table and then become actively involved in the development of such
a feature.
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread* Re: memcg: slab control 2009-11-25 23:08 memcg: slab control David Rientjes @ 2009-11-26 1:14 ` KAMEZAWA Hiroyuki 2009-11-26 8:50 ` Balbir Singh 2009-11-30 22:45 ` David Rientjes 2009-11-26 1:17 ` KAMEZAWA Hiroyuki 2009-11-26 2:35 ` KOSAKI Motohiro 2 siblings, 2 replies; 33+ messages in thread From: KAMEZAWA Hiroyuki @ 2009-11-26 1:14 UTC (permalink / raw) To: David Rientjes Cc: Balbir Singh, Pavel Emelyanov, Suleiman Souhlal, Ying Han, linux-mm On Wed, 25 Nov 2009 15:08:00 -0800 (PST) David Rientjes <rientjes@google.com> wrote: > Hi, > > I wanted to see what the current ideas are concerning kernel memory > accounting as it relates to the memory controller. Eventually we'll want > the ability to restrict cgroups to a hard slab limit. That'll require > accounting to map slab allocations back to user tasks so that we can > enforce a policy based on the cgroup's aggregated slab usage similiar to > how the memory controller currently does for user memory. > > Is this currently being thought about within the memcg community? Not yet. But I always recommend people to implement another memcg (slabcg) for kernel memory. Because - It must have much lower cost than memcg, good perfomance and scalability. system-wide shared counter is nonsense. - slab is not base on LRU. So, another used-memory maintainance scheme should be used. - You can reuse page_cgroup even if slabcg is independent from memcg. But, considering user-side, all people will not welcome dividing memcg and slabcg. So, tieing it to current memcg is ok for me. like... == struct mem_cgroup { .... .... struct slab_cgroup slabcg; (or struct slab_cgroup *slabcg) } == But we have to use another counter and another scheme, another implemenation than memcg, which has good scalability and more fuzzy/lazy controls. (For example, trigger slab-shrink when usage exceeds hiwatermark, not limit.) Scalable accounting is the first wall in front of us. Second one will be how-to-shrink. About information recording, we can reuse page_cgroup and we'll not have much difficulty. I hope, at implementing slabcg, we'll not meet very complicated racy cases as what we met in memcg. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-11-26 1:14 ` KAMEZAWA Hiroyuki @ 2009-11-26 8:50 ` Balbir Singh 2009-11-26 8:56 ` KAMEZAWA Hiroyuki 2009-11-26 10:13 ` Suleiman Souhlal 2009-11-30 22:45 ` David Rientjes 1 sibling, 2 replies; 33+ messages in thread From: Balbir Singh @ 2009-11-26 8:50 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: David Rientjes, Pavel Emelyanov, Suleiman Souhlal, Ying Han, linux-mm * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-11-26 10:14:14]: > On Wed, 25 Nov 2009 15:08:00 -0800 (PST) > David Rientjes <rientjes@google.com> wrote: > > > Hi, > > > > I wanted to see what the current ideas are concerning kernel memory > > accounting as it relates to the memory controller. Eventually we'll want > > the ability to restrict cgroups to a hard slab limit. That'll require > > accounting to map slab allocations back to user tasks so that we can > > enforce a policy based on the cgroup's aggregated slab usage similiar to > > how the memory controller currently does for user memory. > > > > Is this currently being thought about within the memcg community? > > Not yet. But I always recommend people to implement another memcg (slabcg) for > kernel memory. Because > > - It must have much lower cost than memcg, good perfomance and scalability. > system-wide shared counter is nonsense. > We've solved those issues mostly! Anyway, I agree that we need another slabcg, Pavel did some work in that area and posted patches, but they were mostly based and limited to SLUB (IIRC). > - slab is not base on LRU. So, another used-memory maintainance scheme should > be used. > > - You can reuse page_cgroup even if slabcg is independent from memcg. > > > But, considering user-side, all people will not welcome dividing memcg and slabcg. > So, tieing it to current memcg is ok for me. > like... > == > struct mem_cgroup { > .... > .... > struct slab_cgroup slabcg; (or struct slab_cgroup *slabcg) > } > == > > But we have to use another counter and another scheme, another implemenation > than memcg, which has good scalability and more fuzzy/lazy controls. > (For example, trigger slab-shrink when usage exceeds hiwatermark, not limit.) > That depends on requirements, hiwatermark is more like a soft limit than a hard limit and there might be need for hard limits. > Scalable accounting is the first wall in front of us. Second one will be > how-to-shrink. About information recording, we can reuse page_cgroup and > we'll not have much difficulty. > > I hope, at implementing slabcg, we'll not meet very complicated > racy cases as what we met in memcg. > I think it will be because there is no swapping involved, OOM and rare race conditions. There is limited slab reclaim possible, but otherwise I think it is easier to write a slab controller IMHO. -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-11-26 8:50 ` Balbir Singh @ 2009-11-26 8:56 ` KAMEZAWA Hiroyuki 2009-11-26 9:10 ` Pavel Emelyanov 2009-11-26 10:13 ` Suleiman Souhlal 1 sibling, 1 reply; 33+ messages in thread From: KAMEZAWA Hiroyuki @ 2009-11-26 8:56 UTC (permalink / raw) To: balbir Cc: David Rientjes, Pavel Emelyanov, Suleiman Souhlal, Ying Han, linux-mm On Thu, 26 Nov 2009 14:20:31 +0530 Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-11-26 10:14:14]: > > > On Wed, 25 Nov 2009 15:08:00 -0800 (PST) > > David Rientjes <rientjes@google.com> wrote: > > > > > Hi, > > > > > > I wanted to see what the current ideas are concerning kernel memory > > > accounting as it relates to the memory controller. Eventually we'll want > > > the ability to restrict cgroups to a hard slab limit. That'll require > > > accounting to map slab allocations back to user tasks so that we can > > > enforce a policy based on the cgroup's aggregated slab usage similiar to > > > how the memory controller currently does for user memory. > > > > > > Is this currently being thought about within the memcg community? > > > > Not yet. But I always recommend people to implement another memcg (slabcg) for > > kernel memory. Because > > > > - It must have much lower cost than memcg, good perfomance and scalability. > > system-wide shared counter is nonsense. > > > > We've solved those issues mostly! yes. but our solution is for page faults. resolution of slab allocation is much more fine grained and often. > Anyway, I agree that we need another > slabcg, Pavel did some work in that area and posted patches, but they > were mostly based and limited to SLUB (IIRC). > > > - slab is not base on LRU. So, another used-memory maintainance scheme should > > be used. > > > > - You can reuse page_cgroup even if slabcg is independent from memcg. > > > > > > But, considering user-side, all people will not welcome dividing memcg and slabcg. > > So, tieing it to current memcg is ok for me. > > like... > > == > > struct mem_cgroup { > > .... > > .... > > struct slab_cgroup slabcg; (or struct slab_cgroup *slabcg) > > } > > == > > > > But we have to use another counter and another scheme, another implemenation > > than memcg, which has good scalability and more fuzzy/lazy controls. > > (For example, trigger slab-shrink when usage exceeds hiwatermark, not limit.) > > > > That depends on requirements, hiwatermark is more like a soft limit > than a hard limit and there might be need for hard limits. > My point is that most of the kernel codes cannot work well when kmalloc(small area) returns NULL. > > Scalable accounting is the first wall in front of us. Second one will be > > how-to-shrink. About information recording, we can reuse page_cgroup and > > we'll not have much difficulty. > > > > I hope, at implementing slabcg, we'll not meet very complicated > > racy cases as what we met in memcg. > > > > I think it will be because there is no swapping involved, OOM and rare > race conditions. There is limited slab reclaim possible, but otherwise > I think it is easier to write a slab controller IMHO. > yes ;) Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-11-26 8:56 ` KAMEZAWA Hiroyuki @ 2009-11-26 9:10 ` Pavel Emelyanov 2009-11-26 9:33 ` KAMEZAWA Hiroyuki 2009-11-30 22:55 ` David Rientjes 0 siblings, 2 replies; 33+ messages in thread From: Pavel Emelyanov @ 2009-11-26 9:10 UTC (permalink / raw) To: KAMEZAWA Hiroyuki, balbir, David Rientjes Cc: Suleiman Souhlal, Ying Han, linux-mm >> Anyway, I agree that we need another >> slabcg, Pavel did some work in that area and posted patches, but they >> were mostly based and limited to SLUB (IIRC). I'm ready to resurrect the patches and port them for slab. But before doing it we should answer one question. Consider we have two kmalloc-s in a kernel code - one is user-space triggerable and the other one is not. From my POV we should account for the former one, but should not for the latter. If so - how should we patch the kernel to achieve that goal? > My point is that most of the kernel codes cannot work well when kmalloc(small area) > returns NULL. :) That's not so actually. As our experience shows kernel lives fine when kmalloc returns NULL (this doesn't include drivers though). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-11-26 9:10 ` Pavel Emelyanov @ 2009-11-26 9:33 ` KAMEZAWA Hiroyuki 2009-11-26 9:56 ` Pavel Emelyanov 2009-11-30 22:55 ` David Rientjes 1 sibling, 1 reply; 33+ messages in thread From: KAMEZAWA Hiroyuki @ 2009-11-26 9:33 UTC (permalink / raw) To: Pavel Emelyanov Cc: balbir, David Rientjes, Suleiman Souhlal, Ying Han, linux-mm On Thu, 26 Nov 2009 12:10:52 +0300 Pavel Emelyanov <xemul@parallels.com> wrote: > >> Anyway, I agree that we need another > >> slabcg, Pavel did some work in that area and posted patches, but they > >> were mostly based and limited to SLUB (IIRC). > > I'm ready to resurrect the patches and port them for slab. > But before doing it we should answer one question. > > Consider we have two kmalloc-s in a kernel code - one is > user-space triggerable and the other one is not. From my > POV we should account for the former one, but should not > for the latter. > > If so - how should we patch the kernel to achieve that goal? > > > My point is that most of the kernel codes cannot work well when kmalloc(small area) > > returns NULL. > > :) That's not so actually. As our experience shows kernel lives fine > when kmalloc returns NULL (this doesn't include drivers though). > One issue it comes to my mind is that file system can return -EIO because kmalloc() returns NULL. the kernel may work fine but terrible to users ;) Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-11-26 9:33 ` KAMEZAWA Hiroyuki @ 2009-11-26 9:56 ` Pavel Emelyanov 2009-11-26 10:24 ` Suleiman Souhlal 2009-12-01 7:36 ` Balbir Singh 0 siblings, 2 replies; 33+ messages in thread From: Pavel Emelyanov @ 2009-11-26 9:56 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: balbir, David Rientjes, Suleiman Souhlal, Ying Han, linux-mm KAMEZAWA Hiroyuki wrote: > On Thu, 26 Nov 2009 12:10:52 +0300 > Pavel Emelyanov <xemul@parallels.com> wrote: > >>>> Anyway, I agree that we need another >>>> slabcg, Pavel did some work in that area and posted patches, but they >>>> were mostly based and limited to SLUB (IIRC). >> I'm ready to resurrect the patches and port them for slab. >> But before doing it we should answer one question. >> >> Consider we have two kmalloc-s in a kernel code - one is >> user-space triggerable and the other one is not. From my >> POV we should account for the former one, but should not >> for the latter. >> >> If so - how should we patch the kernel to achieve that goal? >> >>> My point is that most of the kernel codes cannot work well when kmalloc(small area) >>> returns NULL. >> :) That's not so actually. As our experience shows kernel lives fine >> when kmalloc returns NULL (this doesn't include drivers though). >> > One issue it comes to my mind is that file system can return -EIO because > kmalloc() returns NULL. the kernel may work fine but terrible to users ;) That relates to my question above - we should not account for all kmalloc-s. In particular - we don't account for bio-s and buffer-head-s since their amount is not under direct user control. Yes, you can request for heavy IO, but first, kernel sends your task to sleep under certain conditions and second, bio-s are destroyed as soon as they are finished and thus bio-s and buffer-head-s cannot be used to eat all the kernel memory. > > Thanks, > -Kame > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-11-26 9:56 ` Pavel Emelyanov @ 2009-11-26 10:24 ` Suleiman Souhlal 2009-11-26 12:31 ` Pavel Emelyanov 2009-12-01 7:36 ` Balbir Singh 1 sibling, 1 reply; 33+ messages in thread From: Suleiman Souhlal @ 2009-11-26 10:24 UTC (permalink / raw) To: Pavel Emelyanov Cc: KAMEZAWA Hiroyuki, balbir, David Rientjes, Ying Han, linux-mm On 11/26/09, Pavel Emelyanov <xemul@parallels.com> wrote: > KAMEZAWA Hiroyuki wrote: > > On Thu, 26 Nov 2009 12:10:52 +0300 > > Pavel Emelyanov <xemul@parallels.com> wrote: > > > >>>> Anyway, I agree that we need another > >>>> slabcg, Pavel did some work in that area and posted patches, but they > >>>> were mostly based and limited to SLUB (IIRC). > >> I'm ready to resurrect the patches and port them for slab. > >> But before doing it we should answer one question. > >> > >> Consider we have two kmalloc-s in a kernel code - one is > >> user-space triggerable and the other one is not. From my > >> POV we should account for the former one, but should not > >> for the latter. > >> > >> If so - how should we patch the kernel to achieve that goal? > >> > >>> My point is that most of the kernel codes cannot work well when kmalloc(small area) > >>> returns NULL. > >> :) That's not so actually. As our experience shows kernel lives fine > >> when kmalloc returns NULL (this doesn't include drivers though). > >> > > One issue it comes to my mind is that file system can return -EIO because > > kmalloc() returns NULL. the kernel may work fine but terrible to users ;) > > > That relates to my question above - we should not account for all > kmalloc-s. In particular - we don't account for bio-s and buffer-head-s > since their amount is not under direct user control. Yes, you can > request for heavy IO, but first, kernel sends your task to sleep under > certain conditions and second, bio-s are destroyed as soon as they are > finished and thus bio-s and buffer-head-s cannot be used to eat all the > kernel memory. Aren't there patches to make the kernel track which cgroup caused which disk I/O? If so, it should be possible to charge the bios to the right cgroup. Maybe one way to decide which kernel allocations should be accounted would be to look at the calling context: If the allocation is done in user context (syscall), then it could be counted towards that user, while if the allocation is done in interrupt or kthread context, it shouldn't be accounted. Of course, this wouldn't be perfect, but it might be a good enough approximation. -- Suleiman -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-11-26 10:24 ` Suleiman Souhlal @ 2009-11-26 12:31 ` Pavel Emelyanov 2009-11-26 12:52 ` Suleiman Souhlal ` (2 more replies) 0 siblings, 3 replies; 33+ messages in thread From: Pavel Emelyanov @ 2009-11-26 12:31 UTC (permalink / raw) To: Suleiman Souhlal Cc: KAMEZAWA Hiroyuki, balbir, David Rientjes, Ying Han, linux-mm > Aren't there patches to make the kernel track which cgroup caused > which disk I/O? If so, it should be possible to charge the bios to the > right cgroup. > > Maybe one way to decide which kernel allocations should be accounted > would be to look at the calling context: If the allocation is done in > user context (syscall), then it could be counted towards that user, > while if the allocation is done in interrupt or kthread context, it > shouldn't be accounted. > > Of course, this wouldn't be perfect, but it might be a good enough > approximation. I disagree. Bio-s are allocated in user context for all typical reads (unless we requested aio) and are allocated either in pdflush context or (!) in arbitrary task context for writes (e.g. via try_to_free_pages) and thus such bio/buffer_head accounting will be completely random. One of the way to achieve the goal I can propose the following (it's not perfect, but just smth to start discussion from). We implement support for accounting based on a bit on a kmem_cache structure and mark all kmalloc caches as not-accountable. Then we grep the kernel to find all kmalloc-s and think - if a kmalloc is to be accounted we turn this into kmem_cache_alloc() with dedicated kmem_cache and mark it as accountable. > -- Suleiman > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-11-26 12:31 ` Pavel Emelyanov @ 2009-11-26 12:52 ` Suleiman Souhlal 2009-12-01 7:40 ` Balbir Singh 2009-11-27 7:15 ` Ying Han 2009-11-30 22:57 ` David Rientjes 2 siblings, 1 reply; 33+ messages in thread From: Suleiman Souhlal @ 2009-11-26 12:52 UTC (permalink / raw) To: Pavel Emelyanov Cc: KAMEZAWA Hiroyuki, balbir, David Rientjes, Ying Han, linux-mm On 11/26/09, Pavel Emelyanov <xemul@parallels.com> wrote: > > Aren't there patches to make the kernel track which cgroup caused > > which disk I/O? If so, it should be possible to charge the bios to the > > right cgroup. > > > > Maybe one way to decide which kernel allocations should be accounted > > would be to look at the calling context: If the allocation is done in > > user context (syscall), then it could be counted towards that user, > > while if the allocation is done in interrupt or kthread context, it > > shouldn't be accounted. > > > > Of course, this wouldn't be perfect, but it might be a good enough > > approximation. > > > I disagree. Bio-s are allocated in user context for all typical reads > (unless we requested aio) and are allocated either in pdflush context > or (!) in arbitrary task context for writes (e.g. via try_to_free_pages) > and thus such bio/buffer_head accounting will be completely random. Yes, that's why I pointed out that you can account to the right cgroup if you track who caused the I/O (which, I imagine, should already be done by the block i/o bandwidth controller, or similar). For most other allocations, on the other hand, accounting to the current context should be fine. > One of the way to achieve the goal I can propose the following (it's > not perfect, but just smth to start discussion from). > > We implement support for accounting based on a bit on a kmem_cache > structure and mark all kmalloc caches as not-accountable. Then we grep > the kernel to find all kmalloc-s and think - if a kmalloc is to be > accounted we turn this into kmem_cache_alloc() with dedicated > kmem_cache and mark it as accountable. That sounds like a lot of work. :-) -- Suleiman -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-11-26 12:52 ` Suleiman Souhlal @ 2009-12-01 7:40 ` Balbir Singh 0 siblings, 0 replies; 33+ messages in thread From: Balbir Singh @ 2009-12-01 7:40 UTC (permalink / raw) To: Suleiman Souhlal Cc: Pavel Emelyanov, KAMEZAWA Hiroyuki, David Rientjes, Ying Han, linux-mm * Suleiman Souhlal <suleiman@google.com> [2009-11-26 04:52:00]: > On 11/26/09, Pavel Emelyanov <xemul@parallels.com> wrote: > > > Aren't there patches to make the kernel track which cgroup caused > > > which disk I/O? If so, it should be possible to charge the bios to the > > > right cgroup. > > > > > > Maybe one way to decide which kernel allocations should be accounted > > > would be to look at the calling context: If the allocation is done in > > > user context (syscall), then it could be counted towards that user, > > > while if the allocation is done in interrupt or kthread context, it > > > shouldn't be accounted. > > > > > > Of course, this wouldn't be perfect, but it might be a good enough > > > approximation. > > > > > > I disagree. Bio-s are allocated in user context for all typical reads > > (unless we requested aio) and are allocated either in pdflush context > > or (!) in arbitrary task context for writes (e.g. via try_to_free_pages) > > and thus such bio/buffer_head accounting will be completely random. > > Yes, that's why I pointed out that you can account to the right cgroup > if you track who caused the I/O (which, I imagine, should already be > done by the block i/o bandwidth controller, or similar). > We can do so, we do that for task I/O accounting today and it works quite well for the applications I've applied them to. > For most other allocations, on the other hand, accounting to the > current context should be fine. > Absolutely! Except when the context is a kernel thread like pdflush/ksm, etc. > > One of the way to achieve the goal I can propose the following (it's > > not perfect, but just smth to start discussion from). > > > > We implement support for accounting based on a bit on a kmem_cache > > structure and mark all kmalloc caches as not-accountable. Then we grep > > the kernel to find all kmalloc-s and think - if a kmalloc is to be > > accounted we turn this into kmem_cache_alloc() with dedicated > > kmem_cache and mark it as accountable. > > That sounds like a lot of work. :-) > Hmm.. yes, it does, but I wonder if there are better alternatives. -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-11-26 12:31 ` Pavel Emelyanov 2009-11-26 12:52 ` Suleiman Souhlal @ 2009-11-27 7:15 ` Ying Han 2009-11-27 9:45 ` Pavel Emelyanov 2009-12-01 5:14 ` KOSAKI Motohiro 2009-11-30 22:57 ` David Rientjes 2 siblings, 2 replies; 33+ messages in thread From: Ying Han @ 2009-11-27 7:15 UTC (permalink / raw) To: Pavel Emelyanov Cc: Suleiman Souhlal, KAMEZAWA Hiroyuki, balbir, David Rientjes, linux-mm On Thu, Nov 26, 2009 at 4:31 AM, Pavel Emelyanov <xemul@parallels.com> wrote: >> Aren't there patches to make the kernel track which cgroup caused >> which disk I/O? If so, it should be possible to charge the bios to the >> right cgroup. >> >> Maybe one way to decide which kernel allocations should be accounted >> would be to look at the calling context: If the allocation is done in >> user context (syscall), then it could be counted towards that user, >> while if the allocation is done in interrupt or kthread context, it >> shouldn't be accounted. >> >> Of course, this wouldn't be perfect, but it might be a good enough >> approximation. > > I disagree. Bio-s are allocated in user context for all typical reads > (unless we requested aio) and are allocated either in pdflush context > or (!) in arbitrary task context for writes (e.g. via try_to_free_pages) > and thus such bio/buffer_head accounting will be completely random. > > One of the way to achieve the goal I can propose the following (it's > not perfect, but just smth to start discussion from). > > We implement support for accounting based on a bit on a kmem_cache > structure and mark all kmalloc caches as not-accountable. Then we grep > the kernel to find all kmalloc-s and think - if a kmalloc is to be > accounted we turn this into kmem_cache_alloc() with dedicated > kmem_cache and mark it as accountable. Well it would be nice to count all kernel memory allocations trigger-able by user programs, the kernel memory includes kernel slabs as well as the pages directly allocated by get_free_pages(). However some of the allocations happen asynchronously like in kernel thread or interrupt context. We can not charge them on the random process happen to run at the time. We can either not count those allocations, or do some special treatment to remember who owns those allocations. In our networking intensive workload, it causes us lots of trouble of miscounting the networking slabs for incoming packets. So we make changes in the networking stack which records the owner of the socket and then charge the slab later using that recorded information. --Ying >> -- Suleiman >> > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-11-27 7:15 ` Ying Han @ 2009-11-27 9:45 ` Pavel Emelyanov 2009-12-01 5:14 ` KOSAKI Motohiro 1 sibling, 0 replies; 33+ messages in thread From: Pavel Emelyanov @ 2009-11-27 9:45 UTC (permalink / raw) To: Ying Han Cc: Suleiman Souhlal, KAMEZAWA Hiroyuki, balbir, David Rientjes, linux-mm Ying Han wrote: > On Thu, Nov 26, 2009 at 4:31 AM, Pavel Emelyanov <xemul@parallels.com> wrote: >>> Aren't there patches to make the kernel track which cgroup caused >>> which disk I/O? If so, it should be possible to charge the bios to the >>> right cgroup. >>> >>> Maybe one way to decide which kernel allocations should be accounted >>> would be to look at the calling context: If the allocation is done in >>> user context (syscall), then it could be counted towards that user, >>> while if the allocation is done in interrupt or kthread context, it >>> shouldn't be accounted. >>> >>> Of course, this wouldn't be perfect, but it might be a good enough >>> approximation. >> I disagree. Bio-s are allocated in user context for all typical reads >> (unless we requested aio) and are allocated either in pdflush context >> or (!) in arbitrary task context for writes (e.g. via try_to_free_pages) >> and thus such bio/buffer_head accounting will be completely random. >> >> One of the way to achieve the goal I can propose the following (it's >> not perfect, but just smth to start discussion from). >> >> We implement support for accounting based on a bit on a kmem_cache >> structure and mark all kmalloc caches as not-accountable. Then we grep >> the kernel to find all kmalloc-s and think - if a kmalloc is to be >> accounted we turn this into kmem_cache_alloc() with dedicated >> kmem_cache and mark it as accountable. > > Well it would be nice to count all kernel memory allocations > trigger-able by user programs, the kernel > memory includes kernel slabs as well as the pages directly allocated > by get_free_pages(). However some > of the allocations happen asynchronously like in kernel thread or > interrupt context. We can not charge them > on the random process happen to run at the time. > > We can either not count those allocations, or do some special > treatment to remember who owns those allocations. > In our networking intensive workload, it causes us lots of trouble of > miscounting the networking slabs for incoming > packets. So we make changes in the networking stack which records the > owner of the socket and then charge the > slab later using that recorded information. That's the same as what we do, but note, that simple accounting is not enough for socket buffers (a.k.a. skb-s). In a perfect world we should implement a memory management similar to what already exists in the networking. In particular - sockets should not report errors in case of kmem shortage, but instead goto waiting state. Besides, TCP sockets should adjust the TCP window according to the current kmem controller state and this task is quite complex. > --Ying > >>> -- Suleiman >>> >> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-11-27 7:15 ` Ying Han 2009-11-27 9:45 ` Pavel Emelyanov @ 2009-12-01 5:14 ` KOSAKI Motohiro 1 sibling, 0 replies; 33+ messages in thread From: KOSAKI Motohiro @ 2009-12-01 5:14 UTC (permalink / raw) To: Ying Han Cc: kosaki.motohiro, Pavel Emelyanov, Suleiman Souhlal, KAMEZAWA Hiroyuki, balbir, David Rientjes, linux-mm > We can either not count those allocations, or do some special > treatment to remember who owns those allocations. > In our networking intensive workload, it causes us lots of trouble of > miscounting the networking slabs for incoming > packets. So we make changes in the networking stack which records the > owner of the socket and then charge the > slab later using that recorded information. I agree, currentlly network intensive workload is problematic. but I don't think network memory management improvement need to change generic slab management. Why can't we improve current tcp/udp memory accounting? it is good user interface than "amount of slab memory". -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-11-26 12:31 ` Pavel Emelyanov 2009-11-26 12:52 ` Suleiman Souhlal 2009-11-27 7:15 ` Ying Han @ 2009-11-30 22:57 ` David Rientjes 2009-12-01 10:31 ` Pavel Emelyanov 2 siblings, 1 reply; 33+ messages in thread From: David Rientjes @ 2009-11-30 22:57 UTC (permalink / raw) To: Pavel Emelyanov Cc: Suleiman Souhlal, KAMEZAWA Hiroyuki, balbir, Ying Han, linux-mm On Thu, 26 Nov 2009, Pavel Emelyanov wrote: > I disagree. Bio-s are allocated in user context for all typical reads > (unless we requested aio) and are allocated either in pdflush context > or (!) in arbitrary task context for writes (e.g. via try_to_free_pages) > and thus such bio/buffer_head accounting will be completely random. > pdflush has been removed, they should all be allocated in process context. > We implement support for accounting based on a bit on a kmem_cache > structure and mark all kmalloc caches as not-accountable. Then we grep > the kernel to find all kmalloc-s and think - if a kmalloc is to be > accounted we turn this into kmem_cache_alloc() with dedicated > kmem_cache and mark it as accountable. > That doesn't work with slab cache merging done in slub. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-11-30 22:57 ` David Rientjes @ 2009-12-01 10:31 ` Pavel Emelyanov 2009-12-01 22:29 ` David Rientjes 0 siblings, 1 reply; 33+ messages in thread From: Pavel Emelyanov @ 2009-12-01 10:31 UTC (permalink / raw) To: David Rientjes Cc: Suleiman Souhlal, KAMEZAWA Hiroyuki, balbir, Ying Han, linux-mm David Rientjes wrote: > On Thu, 26 Nov 2009, Pavel Emelyanov wrote: > >> I disagree. Bio-s are allocated in user context for all typical reads >> (unless we requested aio) and are allocated either in pdflush context >> or (!) in arbitrary task context for writes (e.g. via try_to_free_pages) >> and thus such bio/buffer_head accounting will be completely random. >> > > pdflush has been removed, they should all be allocated in process context. OK, but the try_to_free_pages() concern still stands. >> We implement support for accounting based on a bit on a kmem_cache >> structure and mark all kmalloc caches as not-accountable. Then we grep >> the kernel to find all kmalloc-s and think - if a kmalloc is to be >> accounted we turn this into kmem_cache_alloc() with dedicated >> kmem_cache and mark it as accountable. >> > > That doesn't work with slab cache merging done in slub. Surely we'll have to change it a bit. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-12-01 10:31 ` Pavel Emelyanov @ 2009-12-01 22:29 ` David Rientjes 0 siblings, 0 replies; 33+ messages in thread From: David Rientjes @ 2009-12-01 22:29 UTC (permalink / raw) To: Pavel Emelyanov Cc: Suleiman Souhlal, KAMEZAWA Hiroyuki, balbir, Ying Han, linux-mm On Tue, 1 Dec 2009, Pavel Emelyanov wrote: > > pdflush has been removed, they should all be allocated in process context. > > OK, but the try_to_free_pages() concern still stands. > Yes, we lack mappings between the per-bdi flusher kthreads back to the user cgroup that initiated the writeback. Since all of these kthreads are descendents of kthreadd, they'll be accounted for within that thread's cgroup unless we pass along the current context. > >> We implement support for accounting based on a bit on a kmem_cache > >> structure and mark all kmalloc caches as not-accountable. Then we grep > >> the kernel to find all kmalloc-s and think - if a kmalloc is to be > >> accounted we turn this into kmem_cache_alloc() with dedicated > >> kmem_cache and mark it as accountable. > >> > > > > That doesn't work with slab cache merging done in slub. > > Surely we'll have to change it a bit. > We can't add a cache flag passed to kmem_cache_create() to identify caches that should be accounted versus those that shouldn't, there are allocs done in both process context and irq context from the same caches and we don't want to inhibit accounting with an additional flag passed to kmem_cache_alloc() if that cache has accounting enabled. A vast majority of slab caches get merged into each other based on object size and alignment with slub; we could prevent that merging by checking the accounting bit for a cache, but that would come at a performance cost (nullifying many hot object allocs), increased fragmentation, and increased memory consumption. In other words, we don't want to make it an attribute of the cache itself, we need to make it an attribute of the context in which the allocation is done; there're many more cases where we'll want to have accounting enabled by default, so we'll need to add a bit passed on alloc to inhibit accounting for those objects. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-11-26 9:56 ` Pavel Emelyanov 2009-11-26 10:24 ` Suleiman Souhlal @ 2009-12-01 7:36 ` Balbir Singh 2009-12-01 10:40 ` Pavel Emelyanov 1 sibling, 1 reply; 33+ messages in thread From: Balbir Singh @ 2009-12-01 7:36 UTC (permalink / raw) To: Pavel Emelyanov Cc: KAMEZAWA Hiroyuki, David Rientjes, Suleiman Souhlal, Ying Han, linux-mm * Pavel Emelyanov <xemul@parallels.com> [2009-11-26 12:56:01]: > KAMEZAWA Hiroyuki wrote: > > On Thu, 26 Nov 2009 12:10:52 +0300 > > Pavel Emelyanov <xemul@parallels.com> wrote: > > > >>>> Anyway, I agree that we need another > >>>> slabcg, Pavel did some work in that area and posted patches, but they > >>>> were mostly based and limited to SLUB (IIRC). > >> I'm ready to resurrect the patches and port them for slab. > >> But before doing it we should answer one question. > >> > >> Consider we have two kmalloc-s in a kernel code - one is > >> user-space triggerable and the other one is not. From my > >> POV we should account for the former one, but should not > >> for the latter. > >> > >> If so - how should we patch the kernel to achieve that goal? > >> > >>> My point is that most of the kernel codes cannot work well when kmalloc(small area) > >>> returns NULL. > >> :) That's not so actually. As our experience shows kernel lives fine > >> when kmalloc returns NULL (this doesn't include drivers though). > >> > > One issue it comes to my mind is that file system can return -EIO because > > kmalloc() returns NULL. the kernel may work fine but terrible to users ;) > > That relates to my question above - we should not account for all > kmalloc-s. In particular - we don't account for bio-s and buffer-head-s > since their amount is not under direct user control. Yes, you can > request for heavy IO, but first, kernel sends your task to sleep under > certain conditions and second, bio-s are destroyed as soon as they are > finished and thus bio-s and buffer-head-s cannot be used to eat all the > kernel memory. Just to understand the context better, is this really a problem. This can occur when we do really run out of memory. The idea of using slabcg + memcg together is good, except for our accounting process. I can repost percpu counter patches that adds fuzziness along with other tricks that Kame has to do batch accounting, that we will need to make sure we are able to do with slab allocations as well. -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-12-01 7:36 ` Balbir Singh @ 2009-12-01 10:40 ` Pavel Emelyanov 2009-12-01 15:14 ` Balbir Singh 0 siblings, 1 reply; 33+ messages in thread From: Pavel Emelyanov @ 2009-12-01 10:40 UTC (permalink / raw) To: balbir Cc: KAMEZAWA Hiroyuki, David Rientjes, Suleiman Souhlal, Ying Han, linux-mm > Just to understand the context better, is this really a problem. This > can occur when we do really run out of memory. The idea of using > slabcg + memcg together is good, except for our accounting process. I > can repost percpu counter patches that adds fuzziness along with other > tricks that Kame has to do batch accounting, that we will need to > make sure we are able to do with slab allocations as well. > I'm not sure I understand you concern. Can you elaborate, please? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-12-01 10:40 ` Pavel Emelyanov @ 2009-12-01 15:14 ` Balbir Singh 2009-12-02 10:14 ` Pavel Emelyanov 0 siblings, 1 reply; 33+ messages in thread From: Balbir Singh @ 2009-12-01 15:14 UTC (permalink / raw) To: Pavel Emelyanov Cc: KAMEZAWA Hiroyuki, David Rientjes, Suleiman Souhlal, Ying Han, linux-mm * Pavel Emelyanov <xemul@parallels.com> [2009-12-01 13:40:30]: > > Just to understand the context better, is this really a problem. This > > can occur when we do really run out of memory. The idea of using > > slabcg + memcg together is good, except for our accounting process. I > > can repost percpu counter patches that adds fuzziness along with other > > tricks that Kame has to do batch accounting, that we will need to > > make sure we are able to do with slab allocations as well. > > > > I'm not sure I understand you concern. Can you elaborate, please? > The concern was mostly accounting when memcg + slabcg are integrated into the same framework. res_counters will need new scalability primitives. -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-12-01 15:14 ` Balbir Singh @ 2009-12-02 10:14 ` Pavel Emelyanov 2009-12-02 10:19 ` Balbir Singh 0 siblings, 1 reply; 33+ messages in thread From: Pavel Emelyanov @ 2009-12-02 10:14 UTC (permalink / raw) To: balbir Cc: KAMEZAWA Hiroyuki, David Rientjes, Suleiman Souhlal, Ying Han, linux-mm Balbir Singh wrote: > * Pavel Emelyanov <xemul@parallels.com> [2009-12-01 13:40:30]: > >>> Just to understand the context better, is this really a problem. This >>> can occur when we do really run out of memory. The idea of using >>> slabcg + memcg together is good, except for our accounting process. I >>> can repost percpu counter patches that adds fuzziness along with other >>> tricks that Kame has to do batch accounting, that we will need to >>> make sure we are able to do with slab allocations as well. >>> >> I'm not sure I understand you concern. Can you elaborate, please? >> > > The concern was mostly accounting when memcg + slabcg are integrated > into the same framework. res_counters will need new scalability > primitives. > I see. I think the best we can do here is start with a separate controller. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-12-02 10:14 ` Pavel Emelyanov @ 2009-12-02 10:19 ` Balbir Singh 2009-12-02 10:51 ` Pavel Emelyanov 0 siblings, 1 reply; 33+ messages in thread From: Balbir Singh @ 2009-12-02 10:19 UTC (permalink / raw) To: Pavel Emelyanov Cc: KAMEZAWA Hiroyuki, David Rientjes, Suleiman Souhlal, Ying Han, linux-mm * Pavel Emelyanov <xemul@parallels.com> [2009-12-02 13:14:15]: > Balbir Singh wrote: > > * Pavel Emelyanov <xemul@parallels.com> [2009-12-01 13:40:30]: > > > >>> Just to understand the context better, is this really a problem. This > >>> can occur when we do really run out of memory. The idea of using > >>> slabcg + memcg together is good, except for our accounting process. I > >>> can repost percpu counter patches that adds fuzziness along with other > >>> tricks that Kame has to do batch accounting, that we will need to > >>> make sure we are able to do with slab allocations as well. > >>> > >> I'm not sure I understand you concern. Can you elaborate, please? > >> > > > > The concern was mostly accounting when memcg + slabcg are integrated > > into the same framework. res_counters will need new scalability > > primitives. > > > > I see. I think the best we can do here is start with a separate controller. > I would think so as well, but setting up independent limits might be a challenge, how does the user really estimate the amount of kernel memory needed? This is the same problem that David posted sometime back. -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-12-02 10:19 ` Balbir Singh @ 2009-12-02 10:51 ` Pavel Emelyanov 0 siblings, 0 replies; 33+ messages in thread From: Pavel Emelyanov @ 2009-12-02 10:51 UTC (permalink / raw) To: balbir Cc: KAMEZAWA Hiroyuki, David Rientjes, Suleiman Souhlal, Ying Han, linux-mm Balbir Singh wrote: > * Pavel Emelyanov <xemul@parallels.com> [2009-12-02 13:14:15]: > >> Balbir Singh wrote: >>> * Pavel Emelyanov <xemul@parallels.com> [2009-12-01 13:40:30]: >>> >>>>> Just to understand the context better, is this really a problem. This >>>>> can occur when we do really run out of memory. The idea of using >>>>> slabcg + memcg together is good, except for our accounting process. I >>>>> can repost percpu counter patches that adds fuzziness along with other >>>>> tricks that Kame has to do batch accounting, that we will need to >>>>> make sure we are able to do with slab allocations as well. >>>>> >>>> I'm not sure I understand you concern. Can you elaborate, please? >>>> >>> The concern was mostly accounting when memcg + slabcg are integrated >>> into the same framework. res_counters will need new scalability >>> primitives. >>> >> I see. I think the best we can do here is start with a separate controller. >> > > I would think so as well, but setting up independent limits might be a > challenge, how does the user really estimate the amount of kernel > memory needed? This is the same problem that David posted sometime > back. I agree with you, but note, that the memcg consists of several part and the question "where to account bytes to" is quite independent from "what allocations to account" and "where to get the memcg context from on kfree" ;) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-11-26 9:10 ` Pavel Emelyanov 2009-11-26 9:33 ` KAMEZAWA Hiroyuki @ 2009-11-30 22:55 ` David Rientjes 2009-12-01 10:39 ` Pavel Emelyanov 1 sibling, 1 reply; 33+ messages in thread From: David Rientjes @ 2009-11-30 22:55 UTC (permalink / raw) To: Pavel Emelyanov Cc: KAMEZAWA Hiroyuki, balbir, Suleiman Souhlal, Ying Han, linux-mm On Thu, 26 Nov 2009, Pavel Emelyanov wrote: > I'm ready to resurrect the patches and port them for slab. > But before doing it we should answer one question. > Do you have a pointer to your latest implementation that you proposed for slab? > Consider we have two kmalloc-s in a kernel code - one is > user-space triggerable and the other one is not. From my > POV we should account for the former one, but should not > for the latter. > > If so - how should we patch the kernel to achieve that goal? > I think all slab allocations should be accounted for based on current's memcg other than those done in hardirq context, annotating slab allocations doesn't seem scalable. Whether the accounting is done on a task level or cgroup level isn't really a problem for us since we don't move tasks amongst cgroups. I imagine there've been previous restrictions on that put into place with the memcg so this doesn't seem like a slabcg-specific requirement anyway. The problem on the freeing side is mapping the object back to the cgroup that allocated it. We'd also need to map the object to the context in which it was allocated to determine whether we should decrement the counter or not. How do you propose doing that without a considerable overhead in memory consumption, fastpath branch, and cache cold slabcg lookups? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-11-30 22:55 ` David Rientjes @ 2009-12-01 10:39 ` Pavel Emelyanov 0 siblings, 0 replies; 33+ messages in thread From: Pavel Emelyanov @ 2009-12-01 10:39 UTC (permalink / raw) To: David Rientjes Cc: KAMEZAWA Hiroyuki, balbir, Suleiman Souhlal, Ying Han, linux-mm David Rientjes wrote: > On Thu, 26 Nov 2009, Pavel Emelyanov wrote: > >> I'm ready to resurrect the patches and port them for slab. >> But before doing it we should answer one question. >> > > Do you have a pointer to your latest implementation that you proposed for > slab? I believe this is the one: https://lists.linux-foundation.org/pipermail/containers/2007-September/007481.html >> Consider we have two kmalloc-s in a kernel code - one is >> user-space triggerable and the other one is not. From my >> POV we should account for the former one, but should not >> for the latter. >> >> If so - how should we patch the kernel to achieve that goal? >> > > I think all slab allocations should be accounted for based on current's > memcg other than those done in hardirq context, annotating slab > allocations doesn't seem scalable. Whether the accounting is done on a > task level or cgroup level isn't really a problem for us since we don't > move tasks amongst cgroups. I imagine there've been previous restrictions > on that put into place with the memcg so this doesn't seem like a > slabcg-specific requirement anyway. > > The problem on the freeing side is mapping the object back to the cgroup > that allocated it. We'd also need to map the object to the context in > which it was allocated to determine whether we should decrement the > counter or not. How do you propose doing that without a considerable > overhead in memory consumption, fastpath branch, and cache cold slabcg > lookups? That's the biggest problem. Generally speaking - no other way rather than store additional pointer. In some situations you can rely on the cgroup of a task in which context an object is being freed, but in that case once you move a task to another cgroup your accounting is screwed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-11-26 8:50 ` Balbir Singh 2009-11-26 8:56 ` KAMEZAWA Hiroyuki @ 2009-11-26 10:13 ` Suleiman Souhlal 2009-11-30 9:17 ` Balbir Singh 1 sibling, 1 reply; 33+ messages in thread From: Suleiman Souhlal @ 2009-11-26 10:13 UTC (permalink / raw) To: balbir Cc: KAMEZAWA Hiroyuki, David Rientjes, Pavel Emelyanov, Ying Han, linux-mm On 11/26/09, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > I think it is easier to write a slab controller IMHO. One potential problem I can think of with writing a slab controller would be that the user would have to estimate what fraction of the amount of memory slab should be allowed to use, which might not be ideal. If you wanted to limit a cgroup to a total of 1GB of memory, you might not care if the job wants to use 0.9 GB of user memory and 0.1GB of slab or if it wants to use 0.9GB of slab and 0.1GB of user memory.. Because of this, it might be more practical to integrate the slab accounting in memcg. -- Suleiman -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-11-26 10:13 ` Suleiman Souhlal @ 2009-11-30 9:17 ` Balbir Singh 0 siblings, 0 replies; 33+ messages in thread From: Balbir Singh @ 2009-11-30 9:17 UTC (permalink / raw) To: Suleiman Souhlal Cc: KAMEZAWA Hiroyuki, David Rientjes, Pavel Emelyanov, Ying Han, linux-mm * Suleiman Souhlal <suleiman@google.com> [2009-11-26 02:13:17]: > On 11/26/09, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > I think it is easier to write a slab controller IMHO. > > One potential problem I can think of with writing a slab controller > would be that the user would have to estimate what fraction of the > amount of memory slab should be allowed to use, which might not be > ideal. > > If you wanted to limit a cgroup to a total of 1GB of memory, you might > not care if the job wants to use 0.9 GB of user memory and 0.1GB of > slab or if it wants to use 0.9GB of slab and 0.1GB of user memory.. > Hmm.. true, yes not caring about how memory usage is partitioned is nice (we have memsw for very similar reasons). > Because of this, it might be more practical to integrate the slab > accounting in memcg. > I tend to agree, but I would like to see the early design and thoughts. Like Kame pointed, integrating their accounting can be an issue. -- Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-11-26 1:14 ` KAMEZAWA Hiroyuki 2009-11-26 8:50 ` Balbir Singh @ 2009-11-30 22:45 ` David Rientjes 1 sibling, 0 replies; 33+ messages in thread From: David Rientjes @ 2009-11-30 22:45 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Balbir Singh, Pavel Emelyanov, Suleiman Souhlal, Ying Han, linux-mm On Thu, 26 Nov 2009, KAMEZAWA Hiroyuki wrote: > But, considering user-side, all people will not welcome dividing memcg and slabcg. > So, tieing it to current memcg is ok for me. Agreed. > like... > == > struct mem_cgroup { > .... > .... > struct slab_cgroup slabcg; (or struct slab_cgroup *slabcg) > } > == > > But we have to use another counter and another scheme, another implemenation > than memcg, which has good scalability and more fuzzy/lazy controls. > (For example, trigger slab-shrink when usage exceeds hiwatermark, not limit.) > We're only really interested in using memcg and slabcg together for accounting all memory allotted to a particular cgroup. I'm trying to imagine a scenario where someone would want to account and enforce hard slab limits without using memcg as well. If there are none (and one of the reasons we're trying to illicit discussion is to determine everyone's requirements for such a feature), we can probably tie them together without worrying about incurring unnecessary overhead by using the memcg framework that isn't related to slab accounting. I think the ideal userspace API would be simply to add slab accounting to the memcg's limit_in_bytes if a memcg option were enabled for a cgroup. I don't think it would be helpful to add a ratio of that limit for slab, though, since it's very difficult to predict the usage for a particular workload. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-11-25 23:08 memcg: slab control David Rientjes 2009-11-26 1:14 ` KAMEZAWA Hiroyuki @ 2009-11-26 1:17 ` KAMEZAWA Hiroyuki 2009-11-26 10:01 ` Suleiman Souhlal 2009-11-26 2:35 ` KOSAKI Motohiro 2 siblings, 1 reply; 33+ messages in thread From: KAMEZAWA Hiroyuki @ 2009-11-26 1:17 UTC (permalink / raw) To: David Rientjes Cc: Balbir Singh, Pavel Emelyanov, Suleiman Souhlal, Ying Han, linux-mm On Wed, 25 Nov 2009 15:08:00 -0800 (PST) David Rientjes <rientjes@google.com> wrote: > Hi, > > I wanted to see what the current ideas are concerning kernel memory > accounting as it relates to the memory controller. Eventually we'll want > the ability to restrict cgroups to a hard slab limit. That'll require > accounting to map slab allocations back to user tasks so that we can > enforce a policy based on the cgroup's aggregated slab usage similiar to > how the memory controller currently does for user memory. > > Is this currently being thought about within the memcg community? We'd > like to start a discussion and get everybody's requirements and interests > on the table and then become actively involved in the development of such > a feature. > BTW, how much percent of pages are used for slab in Google system ? Because memory size is going bigger and bigger, ratio of slab usage is going smaller, I think. Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-11-26 1:17 ` KAMEZAWA Hiroyuki @ 2009-11-26 10:01 ` Suleiman Souhlal 0 siblings, 0 replies; 33+ messages in thread From: Suleiman Souhlal @ 2009-11-26 10:01 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: David Rientjes, Balbir Singh, Pavel Emelyanov, Ying Han, linux-mm Hello, On 11/25/09, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > BTW, how much percent of pages are used for slab in Google system ? > Because memory size is going bigger and bigger, ratio of slab usage is going > smaller, I think. It varies. The amount of slab on systems can go from negligible to being a significant portion of the total memory (in network intensive workloads, for example). -- Suleiman -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-11-25 23:08 memcg: slab control David Rientjes 2009-11-26 1:14 ` KAMEZAWA Hiroyuki 2009-11-26 1:17 ` KAMEZAWA Hiroyuki @ 2009-11-26 2:35 ` KOSAKI Motohiro 2009-11-27 7:01 ` Ying Han 2 siblings, 1 reply; 33+ messages in thread From: KOSAKI Motohiro @ 2009-11-26 2:35 UTC (permalink / raw) To: David Rientjes Cc: kosaki.motohiro, Balbir Singh, Pavel Emelyanov, KAMEZAWA Hiroyuki, Suleiman Souhlal, Ying Han, linux-mm Hi > Hi, > > I wanted to see what the current ideas are concerning kernel memory > accounting as it relates to the memory controller. Eventually we'll want > the ability to restrict cgroups to a hard slab limit. That'll require > accounting to map slab allocations back to user tasks so that we can > enforce a policy based on the cgroup's aggregated slab usage similiar to > how the memory controller currently does for user memory. > > Is this currently being thought about within the memcg community? We'd > like to start a discussion and get everybody's requirements and interests > on the table and then become actively involved in the development of such > a feature. I don't think memory hard isolation is bad idea. however, slab restriction is too strange. some device use slab frequently, another someone use get_free_pages() directly. only slab restriction will not make expected result from admin view. Probably, we need to implement generic memory reservation framework. it mihgt help implemnt rt-task memory reservation and userland oom manager. It is only my personal opinion... Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-11-26 2:35 ` KOSAKI Motohiro @ 2009-11-27 7:01 ` Ying Han 2009-11-27 9:48 ` Pavel Emelyanov 0 siblings, 1 reply; 33+ messages in thread From: Ying Han @ 2009-11-27 7:01 UTC (permalink / raw) To: KOSAKI Motohiro Cc: David Rientjes, Balbir Singh, Pavel Emelyanov, KAMEZAWA Hiroyuki, Suleiman Souhlal, linux-mm On Wed, Nov 25, 2009 at 6:35 PM, KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote: > Hi > >> Hi, >> >> I wanted to see what the current ideas are concerning kernel memory >> accounting as it relates to the memory controller. Eventually we'll want >> the ability to restrict cgroups to a hard slab limit. That'll require >> accounting to map slab allocations back to user tasks so that we can >> enforce a policy based on the cgroup's aggregated slab usage similiar to >> how the memory controller currently does for user memory. >> >> Is this currently being thought about within the memcg community? We'd >> like to start a discussion and get everybody's requirements and interests >> on the table and then become actively involved in the development of such >> a feature. > > I don't think memory hard isolation is bad idea. however, slab restriction > is too strange. some device use slab frequently, another someone use get_free_pages() > directly. only slab restriction will not make expected result from admin view. > > Probably, we need to implement generic memory reservation framework. it mihgt help > implemnt rt-task memory reservation and userland oom manager. > > It is only my personal opinion... Looks like the beancounters implementation counts both the kernel slab objects as well as the pages from get_free_pages(). But It relies the caller to pass down a GFP flag indicating the page or slab to be accountable or not. I am looking at the beancounters v5 at: http://lkml.indiana.edu/hypermail/linux/kernel/0610.0/1719.html I kind of like the idea to have a kernel memory controller instead of kernel slab controller. If we only count kernel slabs, do we need another mechanism to count kernel allocations directly from get_free_pages() ? --Ying > > > Thanks. > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: memcg: slab control 2009-11-27 7:01 ` Ying Han @ 2009-11-27 9:48 ` Pavel Emelyanov 0 siblings, 0 replies; 33+ messages in thread From: Pavel Emelyanov @ 2009-11-27 9:48 UTC (permalink / raw) To: Ying Han Cc: KOSAKI Motohiro, David Rientjes, Balbir Singh, Pavel Emelyanov, KAMEZAWA Hiroyuki, Suleiman Souhlal, linux-mm Ying Han wrote: > I kind of like the idea to have a kernel memory controller instead of > kernel slab controller. > If we only count kernel slabs, do we need another mechanism to count > kernel allocations > directly from get_free_pages() ? We do. Look at what get_free_pages we mark with GFP_UBC in beancounters and see, that if we don't count them this creates way to DoS the kernel. > --Ying >> >> Thanks. >> >> > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2009-12-02 10:52 UTC | newest] Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-11-25 23:08 memcg: slab control David Rientjes 2009-11-26 1:14 ` KAMEZAWA Hiroyuki 2009-11-26 8:50 ` Balbir Singh 2009-11-26 8:56 ` KAMEZAWA Hiroyuki 2009-11-26 9:10 ` Pavel Emelyanov 2009-11-26 9:33 ` KAMEZAWA Hiroyuki 2009-11-26 9:56 ` Pavel Emelyanov 2009-11-26 10:24 ` Suleiman Souhlal 2009-11-26 12:31 ` Pavel Emelyanov 2009-11-26 12:52 ` Suleiman Souhlal 2009-12-01 7:40 ` Balbir Singh 2009-11-27 7:15 ` Ying Han 2009-11-27 9:45 ` Pavel Emelyanov 2009-12-01 5:14 ` KOSAKI Motohiro 2009-11-30 22:57 ` David Rientjes 2009-12-01 10:31 ` Pavel Emelyanov 2009-12-01 22:29 ` David Rientjes 2009-12-01 7:36 ` Balbir Singh 2009-12-01 10:40 ` Pavel Emelyanov 2009-12-01 15:14 ` Balbir Singh 2009-12-02 10:14 ` Pavel Emelyanov 2009-12-02 10:19 ` Balbir Singh 2009-12-02 10:51 ` Pavel Emelyanov 2009-11-30 22:55 ` David Rientjes 2009-12-01 10:39 ` Pavel Emelyanov 2009-11-26 10:13 ` Suleiman Souhlal 2009-11-30 9:17 ` Balbir Singh 2009-11-30 22:45 ` David Rientjes 2009-11-26 1:17 ` KAMEZAWA Hiroyuki 2009-11-26 10:01 ` Suleiman Souhlal 2009-11-26 2:35 ` KOSAKI Motohiro 2009-11-27 7:01 ` Ying Han 2009-11-27 9:48 ` Pavel Emelyanov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox