* [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy @ 2014-07-16 14:39 Michal Hocko 2014-07-16 15:58 ` Johannes Weiner 0 siblings, 1 reply; 13+ messages in thread From: Michal Hocko @ 2014-07-16 14:39 UTC (permalink / raw) To: linux-mm Cc: LKML, Johannes Weiner, Tejun Heo, Hugh Dickins, Greg Thelen, Vladimir Davydov, Glauber Costa, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro Starting with 8f9ac36d2cbb (cgroup: distinguish the default and legacy hierarchies when handling cftypes) memory cgroup controller doesn't export any knobs because all of them are marked as legacy. The idea is that only selected knobs are exported for the new cgroup API. This patch exports the core knobs for the memory controller. The following knobs are not and won't be available in the default (aka unified) hierarchy: - use_hierarchy - was one of the biggest mistakes when memory controller was introduced. It allows for creating hierarchical cgroups structure which doesn't have any hierarchical accounting. This leads to really strange configurations where other co-mounted controllers behave hierarchically while memory controller doesn't. All controllers have to be hierarchical with the new cgroups API so this knob doesn't make any sense here. - force_empty - has been introduced primarily to drop memory before it gets reparented on the group removal. This alone doesn't sound fully justified because reparented pages which are not in use can be reclaimed also later when there is a memory pressure on the parent level. Another use-case would be something like per-memcg /proc/sys/vm/drop_caches which doesn't sound like a great idea either. We are trying to get away from using it on the global level so we shouldn't allow that on per-memcg level as well. - soft_limit_in_bytes - has been originally introduced to help to recover from the overcommit situations where the overall hard limits on the system are higher than the available memory. A group which has the largest excess on the soft limit is reclaimed to help to reduce memory pressure during the global memory pressure. The primary problem with this tunable is that every memcg is soft unlimited by default which is reverse to what would be expected from such a knob. Another problem is that soft limit is considered only during the global memory pressure rather than on an external memory pressure in general (e.g. triggered by the limit hit on a parent up the hierarchy). There are other issues which are tight to the implementation (e.g. priority-0 reclaim used for the soft limit reclaim etc.) which are really hard to fix without breaking potential users. There will be a replacement for the soft limit in the unified hierarchy and users will be encouraged to switch their configuration to the new scheme. Until this is available users are suggested to stay with the legacy cgroup API. TCP kmem sub-controller is not exported at this stage because this one has seen basically no traction since it was merged and it is not entirely clear why kmem controller cannot be used for the same purpose. Having 2 controllers for tracking kernel memory allocations sounds like too much. If there are use-cases and reasons for not merging it into kmem then we can reconsider and allow it for the new cgroups API later. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- Documentation/cgroups/memory.txt | 19 ++++--- mm/memcontrol.c | 105 ++++++++++++++++++++++++++++++++++++++- 2 files changed, 115 insertions(+), 9 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 02ab997a1ed2..a8f01497c5de 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -62,10 +62,10 @@ Brief summary of control files. memory.memsw.failcnt # show the number of memory+Swap hits limits memory.max_usage_in_bytes # show max memory usage recorded memory.memsw.max_usage_in_bytes # show max memory+Swap usage recorded - memory.soft_limit_in_bytes # set/show soft limit of memory usage +[D] memory.soft_limit_in_bytes # set/show soft limit of memory usage memory.stat # show various statistics - memory.use_hierarchy # set/show hierarchical account enabled - memory.force_empty # trigger forced move charge to parent +[D] memory.use_hierarchy # set/show hierarchical account enabled +[D] memory.force_empty # trigger forced move charge to parent memory.pressure_level # set memory pressure notifications memory.swappiness # set/show swappiness parameter of vmscan (See sysctl's vm.swappiness) @@ -78,10 +78,15 @@ Brief summary of control files. memory.kmem.failcnt # show the number of kernel memory usage hits limits memory.kmem.max_usage_in_bytes # show max kernel memory usage recorded - memory.kmem.tcp.limit_in_bytes # set/show hard limit for tcp buf memory - memory.kmem.tcp.usage_in_bytes # show current tcp buf memory allocation - memory.kmem.tcp.failcnt # show the number of tcp buf memory usage hits limits - memory.kmem.tcp.max_usage_in_bytes # show max tcp buf memory usage recorded +[D] memory.kmem.tcp.limit_in_bytes # set/show hard limit for tcp buf memory +[D] memory.kmem.tcp.usage_in_bytes # show current tcp buf memory allocation +[D] memory.kmem.tcp.failcnt # show the number of tcp buf memory usage hits limits +[D] memory.kmem.tcp.max_usage_in_bytes # show max tcp buf memory usage recorded + +Knobs marked as [D] are considered deprecated and they won't be available in +the new cgroup Unified hierarchy API (see +Documentation/cgroups/unified-hierarchy.txt for more information). They are +still available with the legacy hierarchy though. 1. History diff --git a/mm/memcontrol.c b/mm/memcontrol.c index fa99a3e2e427..9ed40a045d27 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5226,7 +5226,11 @@ out_kfree: return ret; } -static struct cftype mem_cgroup_files[] = { +/* + * memcg knobs for the legacy cgroup API. No new files should be + * added here. + */ +static struct cftype legacy_mem_cgroup_files[] = { { .name = "usage_in_bytes", .private = MEMFILE_PRIVATE(_MEM, RES_USAGE), @@ -5334,6 +5338,100 @@ static struct cftype mem_cgroup_files[] = { { }, /* terminate */ }; +/* memcg knobs for new cgroups API (default aka unified hierarchy) */ +static struct cftype dfl_mem_cgroup_files[] = { + { + .name = "usage_in_bytes", + .private = MEMFILE_PRIVATE(_MEM, RES_USAGE), + .read_u64 = mem_cgroup_read_u64, + }, + { + .name = "max_usage_in_bytes", + .private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE), + .write = mem_cgroup_reset, + .read_u64 = mem_cgroup_read_u64, + }, + { + .name = "limit_in_bytes", + .private = MEMFILE_PRIVATE(_MEM, RES_LIMIT), + .write = mem_cgroup_write, + .read_u64 = mem_cgroup_read_u64, + }, + { + .name = "failcnt", + .private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT), + .write = mem_cgroup_reset, + .read_u64 = mem_cgroup_read_u64, + }, + { + .name = "stat", + .seq_show = memcg_stat_show, + }, + { + .name = "cgroup.event_control", /* XXX: for compat */ + .write = memcg_write_event_control, + .flags = CFTYPE_NO_PREFIX, + .mode = S_IWUGO, + }, + { + .name = "swappiness", + .read_u64 = mem_cgroup_swappiness_read, + .write_u64 = mem_cgroup_swappiness_write, + }, + { + .name = "move_charge_at_immigrate", + .read_u64 = mem_cgroup_move_charge_read, + .write_u64 = mem_cgroup_move_charge_write, + }, + { + .name = "oom_control", + .seq_show = mem_cgroup_oom_control_read, + .write_u64 = mem_cgroup_oom_control_write, + .private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL), + }, + { + .name = "pressure_level", + }, +#ifdef CONFIG_NUMA + { + .name = "numa_stat", + .seq_show = memcg_numa_stat_show, + }, +#endif +#ifdef CONFIG_MEMCG_KMEM + { + .name = "kmem.limit_in_bytes", + .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT), + .write = mem_cgroup_write, + .read_u64 = mem_cgroup_read_u64, + }, + { + .name = "kmem.max_usage_in_bytes", + .private = MEMFILE_PRIVATE(_KMEM, RES_MAX_USAGE), + .write = mem_cgroup_reset, + .read_u64 = mem_cgroup_read_u64, + }, + { + .name = "kmem.usage_in_bytes", + .private = MEMFILE_PRIVATE(_KMEM, RES_USAGE), + .read_u64 = mem_cgroup_read_u64, + }, + { + .name = "kmem.failcnt", + .private = MEMFILE_PRIVATE(_KMEM, RES_FAILCNT), + .write = mem_cgroup_reset, + .read_u64 = mem_cgroup_read_u64, + }, +#ifdef CONFIG_SLABINFO + { + .name = "kmem.slabinfo", + .seq_show = mem_cgroup_slabinfo_read, + }, +#endif +#endif + { }, /* terminate */ +}; + #ifdef CONFIG_MEMCG_SWAP static struct cftype memsw_cgroup_files[] = { { @@ -6266,7 +6364,8 @@ struct cgroup_subsys memory_cgrp_subsys = { .cancel_attach = mem_cgroup_cancel_attach, .attach = mem_cgroup_move_task, .bind = mem_cgroup_bind, - .legacy_cftypes = mem_cgroup_files, + .legacy_cftypes = legacy_mem_cgroup_files, + .dfl_cftypes = dfl_mem_cgroup_files, .early_init = 0, }; @@ -6285,6 +6384,8 @@ static void __init memsw_file_init(void) { WARN_ON(cgroup_add_legacy_cftypes(&memory_cgrp_subsys, memsw_cgroup_files)); + WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys, + memsw_cgroup_files)); } static void __init enable_swap_cgroup(void) -- 2.0.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy 2014-07-16 14:39 [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy Michal Hocko @ 2014-07-16 15:58 ` Johannes Weiner 2014-07-17 13:45 ` Michal Hocko ` (2 more replies) 0 siblings, 3 replies; 13+ messages in thread From: Johannes Weiner @ 2014-07-16 15:58 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, LKML, Tejun Heo, Hugh Dickins, Greg Thelen, Vladimir Davydov, Glauber Costa, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Wed, Jul 16, 2014 at 04:39:38PM +0200, Michal Hocko wrote: > Starting with 8f9ac36d2cbb (cgroup: distinguish the default and legacy > hierarchies when handling cftypes) memory cgroup controller doesn't > export any knobs because all of them are marked as legacy. The idea is > that only selected knobs are exported for the new cgroup API. > > This patch exports the core knobs for the memory controller. The > following knobs are not and won't be available in the default (aka > unified) hierarchy: > - use_hierarchy - was one of the biggest mistakes when memory controller > was introduced. It allows for creating hierarchical cgroups structure > which doesn't have any hierarchical accounting. This leads to really > strange configurations where other co-mounted controllers behave > hierarchically while memory controller doesn't. > All controllers have to be hierarchical with the new cgroups API so > this knob doesn't make any sense here. > - force_empty - has been introduced primarily to drop memory before it > gets reparented on the group removal. This alone doesn't sound > fully justified because reparented pages which are not in use can be > reclaimed also later when there is a memory pressure on the parent > level. > Another use-case would be something like per-memcg /proc/sys/vm/drop_caches > which doesn't sound like a great idea either. We are trying to get > away from using it on the global level so we shouldn't allow that on > per-memcg level as well. > - soft_limit_in_bytes - has been originally introduced to help to > recover from the overcommit situations where the overall hard limits > on the system are higher than the available memory. A group which has > the largest excess on the soft limit is reclaimed to help to reduce > memory pressure during the global memory pressure. > The primary problem with this tunable is that every memcg is soft > unlimited by default which is reverse to what would be expected from > such a knob. > Another problem is that soft limit is considered only during the > global memory pressure rather than on an external memory pressure in > general (e.g. triggered by the limit hit on a parent up the > hierarchy). > There are other issues which are tight to the implementation (e.g. > priority-0 reclaim used for the soft limit reclaim etc.) which are > really hard to fix without breaking potential users. > There will be a replacement for the soft limit in the unified > hierarchy and users will be encouraged to switch their configuration > to the new scheme. Until this is available users are suggested to stay > with the legacy cgroup API. > > TCP kmem sub-controller is not exported at this stage because this one has > seen basically no traction since it was merged and it is not entirely > clear why kmem controller cannot be used for the same purpose. Having 2 > controllers for tracking kernel memory allocations sounds like too much. > If there are use-cases and reasons for not merging it into kmem then we > can reconsider and allow it for the new cgroups API later. There is a reason why we start out empty on the default hierarchy: the old interface is a complete cesspool. We're not blindly carrying over any of it. Everything that is added in .dfl_ctypes is a new interface and it needs to be treated as such: it needs a valid usecase to back it up, and it needs to be evaluated whether the exported information or control knob is a good way to support that usecase. > @@ -5226,7 +5226,11 @@ out_kfree: > return ret; > } > > -static struct cftype mem_cgroup_files[] = { > +/* > + * memcg knobs for the legacy cgroup API. No new files should be > + * added here. > + */ > +static struct cftype legacy_mem_cgroup_files[] = { > { > .name = "usage_in_bytes", > .private = MEMFILE_PRIVATE(_MEM, RES_USAGE), > @@ -5334,6 +5338,100 @@ static struct cftype mem_cgroup_files[] = { > { }, /* terminate */ > }; > > +/* memcg knobs for new cgroups API (default aka unified hierarchy) */ > +static struct cftype dfl_mem_cgroup_files[] = { > + { > + .name = "usage_in_bytes", > + .private = MEMFILE_PRIVATE(_MEM, RES_USAGE), > + .read_u64 = mem_cgroup_read_u64, > + }, The _in_bytes suffix is pointless, we just need to document that everything in memcg is in bytes, unless noted otherwise. How about "memory.current"? > + { > + .name = "max_usage_in_bytes", > + .private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE), > + .write = mem_cgroup_reset, > + .read_u64 = mem_cgroup_read_u64, What is this actually good for? We have lazy cache shrinking, which means that the high watermark of most workloads depends on the group max limit. > + }, > + { > + .name = "limit_in_bytes", > + .private = MEMFILE_PRIVATE(_MEM, RES_LIMIT), > + .write = mem_cgroup_write, > + .read_u64 = mem_cgroup_read_u64, > + }, We already agreed that there will be a max, a high, a min, and a low limit, why would you want to reintroduce the max limit as "limit"? How about "memory.max"? memory.min memory.low memory.high memory.max memory.current > + { > + .name = "failcnt", > + .private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT), > + .write = mem_cgroup_reset, > + .read_u64 = mem_cgroup_read_u64, > + }, This indicates the number of times the function res_counter_charge() was called and didn't succeed. Let that sink in. This is way too dependent on the current implementation. If you want a measure of how much pressure the group is under, look at vmpressure, or add scanned/reclaimed reclaim statistics. NAK. > + { > + .name = "stat", > + .seq_show = memcg_stat_show, > + }, This file is a complete mess. pgpgin/pgpgout originally refers to IO, here it means pages charged and uncharged. It's questionable whether we even need counters for charges and uncharges. mapped_file vs. NR_FILE_MAPPED is another eyesore. We should generally try harder to be consistent with /proc/vmstat. And rename it to "memory.vmstat" while we are at it. Also, having local counters no longer makes sense as there won't be tasks in intermediate nodes anymore and we are getting rid of reparenting. All counters should appear once, in their hierarchical form. The exception is the root_mem_cgroup, which should probably have a separate file for the local counters. > + { > + .name = "cgroup.event_control", /* XXX: for compat */ > + .write = memcg_write_event_control, > + .flags = CFTYPE_NO_PREFIX, > + .mode = S_IWUGO, Why? > + }, > + { > + .name = "swappiness", > + .read_u64 = mem_cgroup_swappiness_read, > + .write_u64 = mem_cgroup_swappiness_write, Do we actually need this? > + }, > + { > + .name = "move_charge_at_immigrate", > + .read_u64 = mem_cgroup_move_charge_read, > + .write_u64 = mem_cgroup_move_charge_write, This creates significant pain because pc->mem_cgroup becomes a moving target during the lifetime of the page. When this feature was added, there was no justification - except that "some users feel [charges staying in the group] to be strange" - and it was lacking the necessary synchronization to make this work properly, so the cost of this feature was anything but obvious during the initial submission. I generally don't see why tasks should be moving between groups aside from the initial setup phase. And then they shouldn't have consumed any amounts of memory that they couldn't afford to leave behind in the root/parent. So what is the usecase for this that would justify the immense cost in terms of both performance and code complexity? > + }, > + { > + .name = "oom_control", > + .seq_show = mem_cgroup_oom_control_read, > + .write_u64 = mem_cgroup_oom_control_write, > + .private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL), This needs a usecase description as well. > + }, > + { > + .name = "pressure_level", "memory.pressure"? > + }, > +#ifdef CONFIG_NUMA > + { > + .name = "numa_stat", > + .seq_show = memcg_numa_stat_show, > + }, This would also be a chance to clean up this file, which is suddenly specifying memory size in pages rather than bytes. > +#ifdef CONFIG_MEMCG_KMEM > + { > + .name = "kmem.limit_in_bytes", > + .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT), > + .write = mem_cgroup_write, > + .read_u64 = mem_cgroup_read_u64, > + }, Does it really make sense to have a separate limit for kmem only? IIRC, the reason we introduced this was that this memory is not reclaimable and so we need to limit it. But the opposite effect happened: because it's not reclaimable, the separate kmem limit is actually unusable for any values smaller than the overall memory limit: because there is no reclaim mechanism for that limit, once you hit it, it's over, there is nothing you can do anymore. The problem isn't so much unreclaimable memory, the problem is unreclaimable limits. If the global case produces memory pressure through kernel memory allocations, we reclaim page cache, anonymous pages, inodes, dentries etc. I think the same should happen for kmem: kmem should just be accounted and limited in the overall memory limit of a group, and when pressure arises, we go after anything that's reclaimable. > + { > + .name = "kmem.max_usage_in_bytes", > + .private = MEMFILE_PRIVATE(_KMEM, RES_MAX_USAGE), > + .write = mem_cgroup_reset, > + .read_u64 = mem_cgroup_read_u64, As per above, I don't see that a high watermark is meaningful with lazy cache shrinking. > + { > + .name = "kmem.usage_in_bytes", > + .private = MEMFILE_PRIVATE(_KMEM, RES_USAGE), > + .read_u64 = mem_cgroup_read_u64, > + }, We could just include slab counters, kernel stack pages etc. in the statistics file, like /proc/vmstat does. > + { > + .name = "kmem.failcnt", > + .private = MEMFILE_PRIVATE(_KMEM, RES_FAILCNT), > + .write = mem_cgroup_reset, > + .read_u64 = mem_cgroup_read_u64, NAK as per above. > +#ifdef CONFIG_SLABINFO > + { > + .name = "kmem.slabinfo", > + .seq_show = mem_cgroup_slabinfo_read, > + }, > +#endif > +#endif > + { }, /* terminate */ > +}; > + > #ifdef CONFIG_MEMCG_SWAP > static struct cftype memsw_cgroup_files[] = { > { > @@ -6266,7 +6364,8 @@ struct cgroup_subsys memory_cgrp_subsys = { > .cancel_attach = mem_cgroup_cancel_attach, > .attach = mem_cgroup_move_task, > .bind = mem_cgroup_bind, > - .legacy_cftypes = mem_cgroup_files, > + .legacy_cftypes = legacy_mem_cgroup_files, > + .dfl_cftypes = dfl_mem_cgroup_files, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy 2014-07-16 15:58 ` Johannes Weiner @ 2014-07-17 13:45 ` Michal Hocko 2014-07-18 15:44 ` Vladimir Davydov 2014-07-21 13:22 ` Michal Hocko 2 siblings, 0 replies; 13+ messages in thread From: Michal Hocko @ 2014-07-17 13:45 UTC (permalink / raw) To: Johannes Weiner Cc: linux-mm, LKML, Tejun Heo, Hugh Dickins, Greg Thelen, Vladimir Davydov, Glauber Costa, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Wed 16-07-14 11:58:14, Johannes Weiner wrote: > On Wed, Jul 16, 2014 at 04:39:38PM +0200, Michal Hocko wrote: [...] > > +/* memcg knobs for new cgroups API (default aka unified hierarchy) */ > > +static struct cftype dfl_mem_cgroup_files[] = { > > + { > > + .name = "usage_in_bytes", > > + .private = MEMFILE_PRIVATE(_MEM, RES_USAGE), > > + .read_u64 = mem_cgroup_read_u64, > > + }, > > The _in_bytes suffix is pointless, we just need to document that > everything in memcg is in bytes, unless noted otherwise. > > How about "memory.current"? I wanted old users to change the minimum possible when moving to unified hierarchy so I didn't touch the old names. Why should we make the end users life harder? If there is general agreement I have no problem with renaming I just do not think it is really necessary because there is no real reason why configurations which do not use any of the deprecated or unified-hierarchy-only features shouldn't run in both unified and legacy hierarchies without any changes. I do realize that this is a _new_ API so we can do such radical changes but I am also aware that some people have to maintain their stacks on top of different kernels and it really sucks to maintain two different configurations. In such a case it would be easier for those users to stay with the legacy mode which is a fair option but I would much rather see them move to the new API sooner rather than later. memory.usage would be much better fit IMO. > > + { > > + .name = "max_usage_in_bytes", > > + .private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE), > > + .write = mem_cgroup_reset, > > + .read_u64 = mem_cgroup_read_u64, > > What is this actually good for? We have lazy cache shrinking, which > means that the high watermark of most workloads depends on the group > max limit. Well that is a good questions. I was going back and forth disabling this and failcnt before posting this RFC and ended up adding them as they never were controversial. I have no problem ditching them, though, because the usefulness is quite dubious. If someone wants to see whether the hard limit can be decreased without putting too much reclaim pressure then we have a notification mechanism for it. > > + }, > > + { > > + .name = "limit_in_bytes", > > + .private = MEMFILE_PRIVATE(_MEM, RES_LIMIT), > > + .write = mem_cgroup_write, > > + .read_u64 = mem_cgroup_read_u64, > > + }, > > We already agreed that there will be a max, a high, a min, and a low > limit, why would you want to reintroduce the max limit as "limit"? Same as above. I didn't rename knobs for easier transition. On the other hand it is true that the name doesn't fit so nicely with the new upcoming limits scheme. Is this reason sufficient to make users lives harder? > How about "memory.max"? > > memory.min > memory.low > memory.high > memory.max > memory.current > > > + { > > + .name = "failcnt", > > + .private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT), > > + .write = mem_cgroup_reset, > > + .read_u64 = mem_cgroup_read_u64, > > + }, > > This indicates the number of times the function res_counter_charge() > was called and didn't succeed. > > Let that sink in. > > This is way too dependent on the current implementation. Well not really. It doesn't depend on the res_counter directly. We just happen to use it this way. Whatever we will use in future there will still be a moment when the hard (or whatever we call it) limit is hit. Whether somebody depends on it is a question. I wouldn't be surprised if somebody does but I also think that making any decisions based on the counter are dubious at best. But hey, users are really creative... > If you want a measure of how much pressure the group is under, look at > vmpressure, or add scanned/reclaimed reclaim statistics. NAK. OK, I will not miss this one either. As you say there is a better way to measure the pressure and make decisions based on that. > > + { > > + .name = "stat", > > + .seq_show = memcg_stat_show, > > + }, > > This file is a complete mess. It is! > pgpgin/pgpgout originally refers to IO, here it means pages charged > and uncharged. It's questionable whether we even need counters for > charges and uncharges. > > mapped_file vs. NR_FILE_MAPPED is another eyesore. > > We should generally try harder to be consistent with /proc/vmstat. > And rename it to "memory.vmstat" while we are at it. I definitely agree that we should converge to vmstat-like counters. We should also fix the counters which do not make much sense. I just do not want to end up with two sets of stats depending on whether we are in default or legacy hierarchy. > Also, having local counters no longer makes sense as there won't be > tasks in intermediate nodes anymore and we are getting rid of > reparenting. I am not sure we should get rid of reparenting. It makes sense to have pages from the gone memcgs in the parent so they are the first candidate to reclaim. > All counters should appear once, in their hierarchical form. Each memcg has its local state (e.g. LRUs) so we should reflect that in the stat file IMO. Or are there any plans to use different mem_cgroup structure for the intermediate nodes? I haven't heard anything like that. > The exception is the root_mem_cgroup, which should probably > have a separate file for the local counters. > > > + { > > + .name = "cgroup.event_control", /* XXX: for compat */ > > + .write = memcg_write_event_control, > > + .flags = CFTYPE_NO_PREFIX, > > + .mode = S_IWUGO, > > Why? For the oom, thresholds and vmpressure notifications, obviously. Or do we have any other means to do the same? Does it really make sense to push all the current users to use something else? I understand that cgroup core didn't want to support such a generic tool. But I think it serves its purpose for memcg and it works reasonably well. I am surely open to discuss alternatives. > > + }, > > + { > > + .name = "swappiness", > > + .read_u64 = mem_cgroup_swappiness_read, > > + .write_u64 = mem_cgroup_swappiness_write, > > Do we actually need this? Swappiness is a natural property of LRU (anon/file) scanning and LRUs belong to memcg so they should have a way to tell their preference. Consider container setups for example. There will never be one-swappiness-suits-all of them. > > + }, > > + { > > + .name = "move_charge_at_immigrate", > > + .read_u64 = mem_cgroup_move_charge_read, > > + .write_u64 = mem_cgroup_move_charge_write, > > This creates significant pain because pc->mem_cgroup becomes a moving > target during the lifetime of the page. When this feature was added, > there was no justification - except that "some users feel [charges > staying in the group] to be strange" - and it was lacking the > necessary synchronization to make this work properly, so the cost of > this feature was anything but obvious during the initial submission. Actually I think that move charge with tasks should be enabled by default. If the task moving between groups should be supported (and I think it should be) then leaving the charges and pages behind is more than strange. Why does the task move in the first place? Just to get rid of the responsibility for its previous memory consumption? So I am OK with the knob removing but I think we should move charge by default in the unified hierarchy. > I generally don't see why tasks should be moving between groups aside > from the initial setup phase. And then they shouldn't have consumed > any amounts of memory that they couldn't afford to leave behind in the > root/parent. > > So what is the usecase for this that would justify the immense cost in > terms of both performance and code complexity? One of them would be cooperation with other controllers where moving task has its own meaning (e.g. to a cpu group with a different share or a cpuset with a slightly different cpu/node set etc...). Memory controller shouldn't disallow task moving. Another one would be memory "load" balancing. Say you have 2 (high and low priority) sets of tasks running on your system/container. High priority tasks shouldn't be reclaimed much because that would increase their latencies or whatever. Low prio tasks can get reclaimed or even OOM killed. Load balancer, admin, user putting task to/from the background or any other mechanism can decide to change the "priority" of the task simply by moving it to the appropriate memcg. > > > + }, > > + { > > + .name = "oom_control", > > + .seq_show = mem_cgroup_oom_control_read, > > + .write_u64 = mem_cgroup_oom_control_write, > > + .private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL), > > This needs a usecase description as well. There are more of them I think. Btw. I have seen users who started using memory cgroups just because they were able to control OOM from userspace. The strongest use case is to handle OOM conditions gracefully. This includes 1) allow proper shutdown of the service 2) allow killing OOM victim's related processes which do not make any sense on their own 3) make a more workload aware victim selection. I am fully aware that there has been a lot of abuse in the past and users pushing the feature to its limits but that doesn't qualify to removing otherwise very useful feature. > > > + }, > > + { > > + .name = "pressure_level", > > "memory.pressure"? > > > + }, > > +#ifdef CONFIG_NUMA > > + { > > + .name = "numa_stat", > > + .seq_show = memcg_numa_stat_show, > > + }, > > This would also be a chance to clean up this file, which is suddenly > specifying memory size in pages rather than bytes. OK, we should merge it with stat file. > > +#ifdef CONFIG_MEMCG_KMEM > > + { > > + .name = "kmem.limit_in_bytes", > > + .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT), > > + .write = mem_cgroup_write, > > + .read_u64 = mem_cgroup_read_u64, > > + }, > > Does it really make sense to have a separate limit for kmem only? > IIRC, the reason we introduced this was that this memory is not > reclaimable and so we need to limit it. My recollection is different. Basically there are users who really do not care about kmem accounting and do not want to pay runtime overhead. Documentation/cgroups/memory.txt then describes different usecases in chapter 2.7.3. I am not user of kmem myself and considering the current state of the extension I have never encourage anybody to use it for anything but playing so I cannot tell you which of the scenario is used most widespread. I do not have any objections to leave the kmem extension in the legacy mode for now and add it later when it matures. In fact my original patch did that but then I've decided to keep it because the current limitations seem to be more implementation than semantic specific. And Vladimir is doing a really great job to fill the gaps. > But the opposite effect happened: because it's not reclaimable, the > separate kmem limit is actually unusable for any values smaller than > the overall memory limit: because there is no reclaim mechanism for > that limit, once you hit it, it's over, there is nothing you can do > anymore. The problem isn't so much unreclaimable memory, the problem > is unreclaimable limits. This is the limitation of the current implementation. > If the global case produces memory pressure through kernel memory > allocations, we reclaim page cache, anonymous pages, inodes, dentries > etc. I think the same should happen for kmem: kmem should just be > accounted and limited in the overall memory limit of a group, and when > pressure arises, we go after anything that's reclaimable. > > > + { > > + .name = "kmem.max_usage_in_bytes", > > + .private = MEMFILE_PRIVATE(_KMEM, RES_MAX_USAGE), > > + .write = mem_cgroup_reset, > > + .read_u64 = mem_cgroup_read_u64, > > As per above, I don't see that a high watermark is meaningful with > lazy cache shrinking. Sure, if other max_usage_in_bytes goes away this one will go as well. > > + { > > + .name = "kmem.usage_in_bytes", > > + .private = MEMFILE_PRIVATE(_KMEM, RES_USAGE), > > + .read_u64 = mem_cgroup_read_u64, > > + }, > > We could just include slab counters, kernel stack pages etc. in the > statistics file, like /proc/vmstat does. Agreed. > > + { > > + .name = "kmem.failcnt", > > + .private = MEMFILE_PRIVATE(_KMEM, RES_FAILCNT), > > + .write = mem_cgroup_reset, > > + .read_u64 = mem_cgroup_read_u64, > > NAK as per above. > > > +#ifdef CONFIG_SLABINFO > > + { > > + .name = "kmem.slabinfo", > > + .seq_show = mem_cgroup_slabinfo_read, > > + }, > > +#endif > > +#endif > > + { }, /* terminate */ > > +}; > > + > > #ifdef CONFIG_MEMCG_SWAP > > static struct cftype memsw_cgroup_files[] = { > > { > > @@ -6266,7 +6364,8 @@ struct cgroup_subsys memory_cgrp_subsys = { > > .cancel_attach = mem_cgroup_cancel_attach, > > .attach = mem_cgroup_move_task, > > .bind = mem_cgroup_bind, > > - .legacy_cftypes = mem_cgroup_files, > > + .legacy_cftypes = legacy_mem_cgroup_files, > > + .dfl_cftypes = dfl_mem_cgroup_files, -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy 2014-07-16 15:58 ` Johannes Weiner 2014-07-17 13:45 ` Michal Hocko @ 2014-07-18 15:44 ` Vladimir Davydov 2014-07-18 16:13 ` Johannes Weiner 2014-07-21 9:07 ` Michal Hocko 2014-07-21 13:22 ` Michal Hocko 2 siblings, 2 replies; 13+ messages in thread From: Vladimir Davydov @ 2014-07-18 15:44 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, linux-mm, LKML, Tejun Heo, Hugh Dickins, Greg Thelen, Glauber Costa, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Wed, Jul 16, 2014 at 11:58:14AM -0400, Johannes Weiner wrote: > On Wed, Jul 16, 2014 at 04:39:38PM +0200, Michal Hocko wrote: > > +#ifdef CONFIG_MEMCG_KMEM > > + { > > + .name = "kmem.limit_in_bytes", > > + .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT), > > + .write = mem_cgroup_write, > > + .read_u64 = mem_cgroup_read_u64, > > + }, > > Does it really make sense to have a separate limit for kmem only? > IIRC, the reason we introduced this was that this memory is not > reclaimable and so we need to limit it. > > But the opposite effect happened: because it's not reclaimable, the > separate kmem limit is actually unusable for any values smaller than > the overall memory limit: because there is no reclaim mechanism for > that limit, once you hit it, it's over, there is nothing you can do > anymore. The problem isn't so much unreclaimable memory, the problem > is unreclaimable limits. > > If the global case produces memory pressure through kernel memory > allocations, we reclaim page cache, anonymous pages, inodes, dentries > etc. I think the same should happen for kmem: kmem should just be > accounted and limited in the overall memory limit of a group, and when > pressure arises, we go after anything that's reclaimable. Personally, I don't think there's much sense in having a separate knob for kmem limit either. Until we have a user with a sane use case for it, let's not propagate it to the new interface. Furthermore, even when we introduce kmem shrinking, the kmem-only limit alone won't be very useful, because there are plenty of GFP_NOFS kmem allocations, which make most of slab shrinkers useless. To avoid ENOMEM's in such situation, we would have to introduce either a soft kmem limit (watermark) or a kind of kmem precharges. This means if we decided to introduce kmem-only limit, we'd eventually have to add more knobs and write more code to make it usable w/o even knowing if anyone would really benefit from it. However, there might be users that only want user memory limiting and don't want to pay the price of kmem accounting, which is pretty expensive. Even if we implement percpu stocks for kmem, there still will be noticeable overhead due to touching more cache lines on kmalloc/kfree. So I guess there should be a tunable, which will allow to toggle memcg features. May be, a bitmask for future extensibility. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy 2014-07-18 15:44 ` Vladimir Davydov @ 2014-07-18 16:13 ` Johannes Weiner 2014-07-21 9:07 ` Michal Hocko 1 sibling, 0 replies; 13+ messages in thread From: Johannes Weiner @ 2014-07-18 16:13 UTC (permalink / raw) To: Vladimir Davydov Cc: Michal Hocko, linux-mm, LKML, Tejun Heo, Hugh Dickins, Greg Thelen, Glauber Costa, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Fri, Jul 18, 2014 at 07:44:43PM +0400, Vladimir Davydov wrote: > On Wed, Jul 16, 2014 at 11:58:14AM -0400, Johannes Weiner wrote: > > On Wed, Jul 16, 2014 at 04:39:38PM +0200, Michal Hocko wrote: > > > +#ifdef CONFIG_MEMCG_KMEM > > > + { > > > + .name = "kmem.limit_in_bytes", > > > + .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT), > > > + .write = mem_cgroup_write, > > > + .read_u64 = mem_cgroup_read_u64, > > > + }, > > > > Does it really make sense to have a separate limit for kmem only? > > IIRC, the reason we introduced this was that this memory is not > > reclaimable and so we need to limit it. > > > > But the opposite effect happened: because it's not reclaimable, the > > separate kmem limit is actually unusable for any values smaller than > > the overall memory limit: because there is no reclaim mechanism for > > that limit, once you hit it, it's over, there is nothing you can do > > anymore. The problem isn't so much unreclaimable memory, the problem > > is unreclaimable limits. > > > > If the global case produces memory pressure through kernel memory > > allocations, we reclaim page cache, anonymous pages, inodes, dentries > > etc. I think the same should happen for kmem: kmem should just be > > accounted and limited in the overall memory limit of a group, and when > > pressure arises, we go after anything that's reclaimable. > > Personally, I don't think there's much sense in having a separate knob > for kmem limit either. Until we have a user with a sane use case for it, > let's not propagate it to the new interface. > > Furthermore, even when we introduce kmem shrinking, the kmem-only limit > alone won't be very useful, because there are plenty of GFP_NOFS kmem > allocations, which make most of slab shrinkers useless. To avoid > ENOMEM's in such situation, we would have to introduce either a soft > kmem limit (watermark) or a kind of kmem precharges. This means if we > decided to introduce kmem-only limit, we'd eventually have to add more > knobs and write more code to make it usable w/o even knowing if anyone > would really benefit from it. > > However, there might be users that only want user memory limiting and > don't want to pay the price of kmem accounting, which is pretty > expensive. Even if we implement percpu stocks for kmem, there still will > be noticeable overhead due to touching more cache lines on > kmalloc/kfree. Yes, we should not force everybody do take that cost in general, but once you're using it, how much overhead is it really? Charging already happens in the slow path and we can batch it as you said. I wonder if it would be enough to have the same granularity as the swap controller; a config option and a global runtime toggle. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy 2014-07-18 15:44 ` Vladimir Davydov 2014-07-18 16:13 ` Johannes Weiner @ 2014-07-21 9:07 ` Michal Hocko 2014-07-21 11:46 ` Michal Hocko 2014-07-21 11:48 ` Vladimir Davydov 1 sibling, 2 replies; 13+ messages in thread From: Michal Hocko @ 2014-07-21 9:07 UTC (permalink / raw) To: Vladimir Davydov Cc: Johannes Weiner, linux-mm, LKML, Tejun Heo, Hugh Dickins, Greg Thelen, Glauber Costa, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Fri 18-07-14 19:44:43, Vladimir Davydov wrote: > On Wed, Jul 16, 2014 at 11:58:14AM -0400, Johannes Weiner wrote: > > On Wed, Jul 16, 2014 at 04:39:38PM +0200, Michal Hocko wrote: > > > +#ifdef CONFIG_MEMCG_KMEM > > > + { > > > + .name = "kmem.limit_in_bytes", > > > + .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT), > > > + .write = mem_cgroup_write, > > > + .read_u64 = mem_cgroup_read_u64, > > > + }, > > > > Does it really make sense to have a separate limit for kmem only? > > IIRC, the reason we introduced this was that this memory is not > > reclaimable and so we need to limit it. > > > > But the opposite effect happened: because it's not reclaimable, the > > separate kmem limit is actually unusable for any values smaller than > > the overall memory limit: because there is no reclaim mechanism for > > that limit, once you hit it, it's over, there is nothing you can do > > anymore. The problem isn't so much unreclaimable memory, the problem > > is unreclaimable limits. > > > > If the global case produces memory pressure through kernel memory > > allocations, we reclaim page cache, anonymous pages, inodes, dentries > > etc. I think the same should happen for kmem: kmem should just be > > accounted and limited in the overall memory limit of a group, and when > > pressure arises, we go after anything that's reclaimable. > > Personally, I don't think there's much sense in having a separate knob > for kmem limit either. Until we have a user with a sane use case for it, > let's not propagate it to the new interface. What about fork-bomb forks protection? I thought that was the primary usecase for K < U? Or how can we handle that use case with a single limit? A special gfp flag to not trigger OOM path when called from some kmem charge paths? What about task_count or what was the name of the controller which was dropped and suggested to be replaced by kmem accounting? I can imagine that to be implemented by a separate K limit which would be roughtly stack_size * task_count + pillow for slab. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy 2014-07-21 9:07 ` Michal Hocko @ 2014-07-21 11:46 ` Michal Hocko 2014-07-21 12:02 ` Tejun Heo 2014-07-21 12:03 ` Vladimir Davydov 2014-07-21 11:48 ` Vladimir Davydov 1 sibling, 2 replies; 13+ messages in thread From: Michal Hocko @ 2014-07-21 11:46 UTC (permalink / raw) To: Vladimir Davydov Cc: Johannes Weiner, linux-mm, LKML, Tejun Heo, Hugh Dickins, Greg Thelen, Glauber Costa, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Mon 21-07-14 11:07:24, Michal Hocko wrote: > On Fri 18-07-14 19:44:43, Vladimir Davydov wrote: > > On Wed, Jul 16, 2014 at 11:58:14AM -0400, Johannes Weiner wrote: > > > On Wed, Jul 16, 2014 at 04:39:38PM +0200, Michal Hocko wrote: > > > > +#ifdef CONFIG_MEMCG_KMEM > > > > + { > > > > + .name = "kmem.limit_in_bytes", > > > > + .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT), > > > > + .write = mem_cgroup_write, > > > > + .read_u64 = mem_cgroup_read_u64, > > > > + }, > > > > > > Does it really make sense to have a separate limit for kmem only? > > > IIRC, the reason we introduced this was that this memory is not > > > reclaimable and so we need to limit it. > > > > > > But the opposite effect happened: because it's not reclaimable, the > > > separate kmem limit is actually unusable for any values smaller than > > > the overall memory limit: because there is no reclaim mechanism for > > > that limit, once you hit it, it's over, there is nothing you can do > > > anymore. The problem isn't so much unreclaimable memory, the problem > > > is unreclaimable limits. > > > > > > If the global case produces memory pressure through kernel memory > > > allocations, we reclaim page cache, anonymous pages, inodes, dentries > > > etc. I think the same should happen for kmem: kmem should just be > > > accounted and limited in the overall memory limit of a group, and when > > > pressure arises, we go after anything that's reclaimable. > > > > Personally, I don't think there's much sense in having a separate knob > > for kmem limit either. Until we have a user with a sane use case for it, > > let's not propagate it to the new interface. > > What about fork-bomb forks protection? I thought that was the primary usecase > for K < U? Or how can we handle that use case with a single limit? A > special gfp flag to not trigger OOM path when called from some kmem > charge paths? Even then, I do not see how would this fork-bomb prevention work without causing OOMs and killing other processes within the group. The danger would be still contained in the group and prevent from the system wide disruption. Do we really want only such a narrow usecase? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy 2014-07-21 11:46 ` Michal Hocko @ 2014-07-21 12:02 ` Tejun Heo 2014-07-21 12:03 ` Vladimir Davydov 1 sibling, 0 replies; 13+ messages in thread From: Tejun Heo @ 2014-07-21 12:02 UTC (permalink / raw) To: Michal Hocko Cc: Vladimir Davydov, Johannes Weiner, linux-mm, LKML, Hugh Dickins, Greg Thelen, Glauber Costa, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro Hello, On Mon, Jul 21, 2014 at 01:46:55PM +0200, Michal Hocko wrote: > Even then, I do not see how would this fork-bomb prevention work without > causing OOMs and killing other processes within the group. The danger > would be still contained in the group and prevent from the system wide > disruption. Do we really want only such a narrow usecase? Does that really matter? I don't buy the usefulness of the various suggested partial failure modes. For example, is fork-bomb actually something isolatable by not granting more forks? Doing so is likely to cripple the cgroup anyway, which apparently needed forking to operate. Such partial failure mode would only be useful iff the culprit is mostly isolated even in the cgroup, stops forking once it starts to fail, the already forked excess processes can be identified and killed somehow without requiring forking in the cgroup, and fork failures in other parts of the cgroup hopefully hasn't broken the service provided by the cgroup yet. In the long term, we should have per-cgroup OOM killing and terminate the cgroups which fail to behave. I think the value is in the ability to contain such failures, not in the partial failure modes that may or may not be salvageable without any way to systematically determine which way the situation is. Short of being able to detect which specific process are fork bombing and take them out, which I don't think can or should, I believe that fork bomb protection should be dealt as an integral part of generic memcg operation. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy 2014-07-21 11:46 ` Michal Hocko 2014-07-21 12:02 ` Tejun Heo @ 2014-07-21 12:03 ` Vladimir Davydov 2014-07-21 12:49 ` Tejun Heo 1 sibling, 1 reply; 13+ messages in thread From: Vladimir Davydov @ 2014-07-21 12:03 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, linux-mm, LKML, Tejun Heo, Hugh Dickins, Greg Thelen, Glauber Costa, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Mon, Jul 21, 2014 at 01:46:55PM +0200, Michal Hocko wrote: > On Mon 21-07-14 11:07:24, Michal Hocko wrote: > > On Fri 18-07-14 19:44:43, Vladimir Davydov wrote: > > > On Wed, Jul 16, 2014 at 11:58:14AM -0400, Johannes Weiner wrote: > > > > On Wed, Jul 16, 2014 at 04:39:38PM +0200, Michal Hocko wrote: > > > > > +#ifdef CONFIG_MEMCG_KMEM > > > > > + { > > > > > + .name = "kmem.limit_in_bytes", > > > > > + .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT), > > > > > + .write = mem_cgroup_write, > > > > > + .read_u64 = mem_cgroup_read_u64, > > > > > + }, > > > > > > > > Does it really make sense to have a separate limit for kmem only? > > > > IIRC, the reason we introduced this was that this memory is not > > > > reclaimable and so we need to limit it. > > > > > > > > But the opposite effect happened: because it's not reclaimable, the > > > > separate kmem limit is actually unusable for any values smaller than > > > > the overall memory limit: because there is no reclaim mechanism for > > > > that limit, once you hit it, it's over, there is nothing you can do > > > > anymore. The problem isn't so much unreclaimable memory, the problem > > > > is unreclaimable limits. > > > > > > > > If the global case produces memory pressure through kernel memory > > > > allocations, we reclaim page cache, anonymous pages, inodes, dentries > > > > etc. I think the same should happen for kmem: kmem should just be > > > > accounted and limited in the overall memory limit of a group, and when > > > > pressure arises, we go after anything that's reclaimable. > > > > > > Personally, I don't think there's much sense in having a separate knob > > > for kmem limit either. Until we have a user with a sane use case for it, > > > let's not propagate it to the new interface. > > > > What about fork-bomb forks protection? I thought that was the primary usecase > > for K < U? Or how can we handle that use case with a single limit? A > > special gfp flag to not trigger OOM path when called from some kmem > > charge paths? > > Even then, I do not see how would this fork-bomb prevention work without > causing OOMs and killing other processes within the group. The danger > would be still contained in the group and prevent from the system wide > disruption. Do we really want only such a narrow usecase? I think it's all about how we're going to use memory cgroups. If we're going to use them for application containers, there's simply no such problem, because we only want to isolate a potentially dangerous process group from the rest of the system. If we want to start a fully virtualized OS inside a container, then we certainly need a kind of numproc and/or kmem limiter to prevent processes inside a cgroup from being OOM killed by a fork-bomb. IMHO, the latter will always be better done by VMs, so it isn't a must-have for cgroups. I may be mistaken though. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy 2014-07-21 12:03 ` Vladimir Davydov @ 2014-07-21 12:49 ` Tejun Heo 0 siblings, 0 replies; 13+ messages in thread From: Tejun Heo @ 2014-07-21 12:49 UTC (permalink / raw) To: Vladimir Davydov Cc: Michal Hocko, Johannes Weiner, linux-mm, LKML, Hugh Dickins, Greg Thelen, Glauber Costa, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Mon, Jul 21, 2014 at 04:03:32PM +0400, Vladimir Davydov wrote: > I think it's all about how we're going to use memory cgroups. If we're > going to use them for application containers, there's simply no such > problem, because we only want to isolate a potentially dangerous process > group from the rest of the system. If we want to start a fully > virtualized OS inside a container, then we certainly need a kind of For shell environments, ulimit is a much better specific protection mechanism against fork bombs and process-granular OOM killers would behave mostly equivalently during fork bombing to the way it'd behave in the host environment w/o cgroups. I'm having a hard time seeing why this would need any special treatment from cgroups. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy 2014-07-21 9:07 ` Michal Hocko 2014-07-21 11:46 ` Michal Hocko @ 2014-07-21 11:48 ` Vladimir Davydov 2014-07-21 12:09 ` Michal Hocko 1 sibling, 1 reply; 13+ messages in thread From: Vladimir Davydov @ 2014-07-21 11:48 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, linux-mm, LKML, Tejun Heo, Hugh Dickins, Greg Thelen, Glauber Costa, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Mon, Jul 21, 2014 at 11:07:24AM +0200, Michal Hocko wrote: > On Fri 18-07-14 19:44:43, Vladimir Davydov wrote: > > On Wed, Jul 16, 2014 at 11:58:14AM -0400, Johannes Weiner wrote: > > > On Wed, Jul 16, 2014 at 04:39:38PM +0200, Michal Hocko wrote: > > > > +#ifdef CONFIG_MEMCG_KMEM > > > > + { > > > > + .name = "kmem.limit_in_bytes", > > > > + .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT), > > > > + .write = mem_cgroup_write, > > > > + .read_u64 = mem_cgroup_read_u64, > > > > + }, > > > > > > Does it really make sense to have a separate limit for kmem only? > > > IIRC, the reason we introduced this was that this memory is not > > > reclaimable and so we need to limit it. > > > > > > But the opposite effect happened: because it's not reclaimable, the > > > separate kmem limit is actually unusable for any values smaller than > > > the overall memory limit: because there is no reclaim mechanism for > > > that limit, once you hit it, it's over, there is nothing you can do > > > anymore. The problem isn't so much unreclaimable memory, the problem > > > is unreclaimable limits. > > > > > > If the global case produces memory pressure through kernel memory > > > allocations, we reclaim page cache, anonymous pages, inodes, dentries > > > etc. I think the same should happen for kmem: kmem should just be > > > accounted and limited in the overall memory limit of a group, and when > > > pressure arises, we go after anything that's reclaimable. > > > > Personally, I don't think there's much sense in having a separate knob > > for kmem limit either. Until we have a user with a sane use case for it, > > let's not propagate it to the new interface. > > What about fork-bomb forks protection? I thought that was the primary usecase > for K < U? Or how can we handle that use case with a single limit? A > special gfp flag to not trigger OOM path when called from some kmem > charge paths? Hmm, for a moment I thought that putting a fork-bomb inside a memory cgroup with kmem accounting enabled and K=U will isolate it from the rest of the system and therefore there's no need in K<U, but now I realize it's not quite right. In contrast to user memory, thread stack allocations have costly order, they cannot be swapped out, and on 32-bit systems they will consume a limited resource of low mem. Although the latter two doesn't look like being of much concern, costly order of stack pages certainly does I think. Is this what you mean by saying we have to disable OOM from some kmem charge paths? To prevent OOM on the global level that might trigger due to lack of high order pages for task stack? > What about task_count or what was the name of the controller which was > dropped and suggested to be replaced by kmem accounting? I can imagine > that to be implemented by a separate K limit which would be roughtly > stack_size * task_count + pillow for slab. I wonder how big this pillow for slab should be... Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy 2014-07-21 11:48 ` Vladimir Davydov @ 2014-07-21 12:09 ` Michal Hocko 0 siblings, 0 replies; 13+ messages in thread From: Michal Hocko @ 2014-07-21 12:09 UTC (permalink / raw) To: Vladimir Davydov Cc: Johannes Weiner, linux-mm, LKML, Tejun Heo, Hugh Dickins, Greg Thelen, Glauber Costa, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Mon 21-07-14 15:48:39, Vladimir Davydov wrote: > On Mon, Jul 21, 2014 at 11:07:24AM +0200, Michal Hocko wrote: > > On Fri 18-07-14 19:44:43, Vladimir Davydov wrote: > > > On Wed, Jul 16, 2014 at 11:58:14AM -0400, Johannes Weiner wrote: > > > > On Wed, Jul 16, 2014 at 04:39:38PM +0200, Michal Hocko wrote: > > > > > +#ifdef CONFIG_MEMCG_KMEM > > > > > + { > > > > > + .name = "kmem.limit_in_bytes", > > > > > + .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT), > > > > > + .write = mem_cgroup_write, > > > > > + .read_u64 = mem_cgroup_read_u64, > > > > > + }, > > > > > > > > Does it really make sense to have a separate limit for kmem only? > > > > IIRC, the reason we introduced this was that this memory is not > > > > reclaimable and so we need to limit it. > > > > > > > > But the opposite effect happened: because it's not reclaimable, the > > > > separate kmem limit is actually unusable for any values smaller than > > > > the overall memory limit: because there is no reclaim mechanism for > > > > that limit, once you hit it, it's over, there is nothing you can do > > > > anymore. The problem isn't so much unreclaimable memory, the problem > > > > is unreclaimable limits. > > > > > > > > If the global case produces memory pressure through kernel memory > > > > allocations, we reclaim page cache, anonymous pages, inodes, dentries > > > > etc. I think the same should happen for kmem: kmem should just be > > > > accounted and limited in the overall memory limit of a group, and when > > > > pressure arises, we go after anything that's reclaimable. > > > > > > Personally, I don't think there's much sense in having a separate knob > > > for kmem limit either. Until we have a user with a sane use case for it, > > > let's not propagate it to the new interface. > > > > What about fork-bomb forks protection? I thought that was the primary usecase > > for K < U? Or how can we handle that use case with a single limit? A > > special gfp flag to not trigger OOM path when called from some kmem > > charge paths? > > Hmm, for a moment I thought that putting a fork-bomb inside a memory > cgroup with kmem accounting enabled and K=U will isolate it from the > rest of the system and therefore there's no need in K<U, but now I > realize it's not quite right. > > In contrast to user memory, thread stack allocations have costly order, > they cannot be swapped out, and on 32-bit systems they will consume a > limited resource of low mem. Although the latter two doesn't look like > being of much concern, costly order of stack pages certainly does I > think. > > Is this what you mean by saying we have to disable OOM from some kmem > charge paths? To prevent OOM on the global level that might trigger due > to lack of high order pages for task stack? No, I meant it for a different reason. If you simply cause OOM from e.g. stack charge then you simply DoS your cgroup before you start effectively stopping fork-bomb because the fork-bomb will usually have much smaller RSS than anything else in the group. So this is a case where you really want to fail the allocation. Maybe I just didn't understand what a single-limit proposal meant... > > What about task_count or what was the name of the controller which was > > dropped and suggested to be replaced by kmem accounting? I can imagine > > that to be implemented by a separate K limit which would be roughtly > > stack_size * task_count + pillow for slab. > > I wonder how big this pillow for slab should be... Well, it obviously depends on the load running in the group. It depends on the amount of unreclaimable slab + reclaimable_and_still_not_trashing amount of slab. So the pillow should be quite large but that shouldn't be a big deal as the kernel allocations usually are a small part of the U. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy 2014-07-16 15:58 ` Johannes Weiner 2014-07-17 13:45 ` Michal Hocko 2014-07-18 15:44 ` Vladimir Davydov @ 2014-07-21 13:22 ` Michal Hocko 2 siblings, 0 replies; 13+ messages in thread From: Michal Hocko @ 2014-07-21 13:22 UTC (permalink / raw) To: Johannes Weiner Cc: linux-mm, LKML, Tejun Heo, Hugh Dickins, Greg Thelen, Vladimir Davydov, Glauber Costa, Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro On Wed 16-07-14 11:58:14, Johannes Weiner wrote: > On Wed, Jul 16, 2014 at 04:39:38PM +0200, Michal Hocko wrote: [...] > > +#ifdef CONFIG_MEMCG_KMEM > > + { > > + .name = "kmem.limit_in_bytes", > > + .private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT), > > + .write = mem_cgroup_write, > > + .read_u64 = mem_cgroup_read_u64, > > + }, > > Does it really make sense to have a separate limit for kmem only? It seems that needs furhter discussion so I will drop it in next version of the patch and we can enable it or move to a single knob for U+K later. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2014-07-21 13:22 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-07-16 14:39 [RFC PATCH] memcg: export knobs for the defaul cgroup hierarchy Michal Hocko 2014-07-16 15:58 ` Johannes Weiner 2014-07-17 13:45 ` Michal Hocko 2014-07-18 15:44 ` Vladimir Davydov 2014-07-18 16:13 ` Johannes Weiner 2014-07-21 9:07 ` Michal Hocko 2014-07-21 11:46 ` Michal Hocko 2014-07-21 12:02 ` Tejun Heo 2014-07-21 12:03 ` Vladimir Davydov 2014-07-21 12:49 ` Tejun Heo 2014-07-21 11:48 ` Vladimir Davydov 2014-07-21 12:09 ` Michal Hocko 2014-07-21 13:22 ` Michal Hocko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox