On Tue, Apr 9, 2013 at 5:13 AM, Michal Hocko wrote: > Hi all, > It's been a long when I promised my take on the $subject but I got > permanently preempted by other tasks. I finally got it, fortunately. > Hi Michal, This is on my list for a while and never get chance to get to it. The per-memcg softlimit reclaim is one of the key feature google uses today, and thank you for putting the effort of move this forward. I haven't read the patch in details, but since we chatted about this for few iterations and it should just look familiar. This is just a first attempt. There are still some todos but I wanted to > post it soon to get a feedback. > > The basic idea is quite simple. Pull soft reclaim into shrink_zone in > the first step and get rid of the previous soft reclaim infrastructure. > shrink_zone is done in two passes now. First it tries to do the soft > limit reclaim and it falls back to reclaim-all-mode if no group is over > the limit or no pages have been scanned. The second pass happens at the > same priority so the only time we waste is the memcg tree walk which > shouldn't be a big deal. There is certainly room for improvements in > that direction. But let's keep it simple for now. > As a bonus we will get rid of a _lot_ of code by this and soft reclaim > will not stand out like before. > Yes, that is the part that should have given us enough motivation to merge this effort long time ago. However, we had difficulties of agreeing the 5% of the code (mainly on the softlimit policy) which preventing to cleaning up 95% of the code. I take the blame. The second step is somehow more controversial. I am redefining meaning > of the default soft limit value. I've not chosen 0 as we discussed > previously because I want to preserve hierarchical property of the soft > limit (if a parent up the hierarchy is over its limit then children are > over as well) This is the 5% we keep disagreeing each other. The internal patch I am carrying has different interpretation of "hierarchical softlimit reclaim". However, I am more incline to accept that difference this time. At least that will get us moving forward to clean up the code first. Then we can revisit the exact policy of that 5% if that doesn't fit for other usecase ( besides google). I am happy to backport this part into our kernel later and then only carry that 5% of change internally. To give more background of what I mean by different interpretation of "hierarchical", I have some write up some time back which is attached in this thread. This is purely to make a note for later, and as I mentioned I will go ahead review the patch and forget about that difference at this step. so I have kept the default untouched - unlimited - but I > have slightly changed the meaning of this value. I interpret it as "user > doesn't care about soft limit". More precisely the value is ignored > unless it has been specified by user so such groups are eligible for > soft reclaim even though they do not reach the limit. Such groups > do not force their children to be reclaimed of course. > > I guess the only possible use case where this wouldn't work as > expected is when somebody creates a group and set its soft limit to > a small value (e.g. 0) just to protect all other groups from being > reclaimed. With a new scheme all groups would be reclaimed while the > previous implementation could end up reclaiming only the "special" > group. This configuration can be achieved by the new scheme trivially > so I think we should be safe. Or does this sound like a big problem? > Finally the third step is soft limit reclaim integration into targeted > reclaim. The patch is trivial one liner. > Will go through the patches with details in next day or so. Thanks --Ying > > I haven't get to test it properly yet. I've tested only 2 workloads: > 1) 1GB RAM + 128MB swap in a kvm (host 4 GB RAM) > - 2 memcgs (directly under root) > - A has soft limit 500MB and hard unlimited > - B both hard and soft unlimited (default values) > - One dd if=/dev/zero of=storage/$file bs=1024 count=1228800 per group > 2) same setup > - tar -xf linux source tree + make -j2 vmlinux > > Results > 1) I've checked memory.usage_in_bytes > Base (-mm tree) > Group A Group B > median 446498816 448659456 > > Patches applied > median 524314624 377921536 > > So as expected, A got more room on behalf of B and it is nicely over its > soft limit. I wanted to compare the reclaim performance as well but we > do not account scanned and reclaimed pages during the old soft reclaim > (global_reclaim prevents that). But I am planning to look at it. > Anyway it doesn't look like we are scanning/reclaiming more with the > patched kernel: > Base: pgscan_kswapd_dma32 394382 pgsteal_kswapd_dma32 394372 > Patched: pgscan_kswapd_dma32 394501 pgsteal_kswapd_dma32 394491 > > So I would assume that the soft limit reclaim scanned more in the end. > > Total runtime was slightly smaller for the patch version: > Base > Group A Group B > total time 480.087 s 480.067 s > > Patches applied > total time 474.853 s 474.736 s > > But this could be an artifacts of the guest scheduling or related to the > host activity so I wouldn't draw any conclusions from here. > > 2) kbuild test showed more or less the same results > usage_in_bytes > Base > Group A Group B > Median 394817536 395634688 > > Patches applied > median 483481600 302131200 > > A is kept closer to the soft limit again. There is some fluctuation > around the limit because kbuild creates a lot of short lived processes. > Base: pgscan_kswapd_dma32 1648718 pgsteal_kswapd_dma32 1510749 > Patched: pgscan_kswapd_dma32 2042065 pgsteal_kswapd_dma32 1667745 > > The differences are much bigger now so it would be interesting how much > has been scanned/reclaimed during soft reclaim in the base kernel. > > I haven't included total runtime statistics here because they seemed > even more random due to guest/host interaction. > > Any comments are welcome, of course. > > Michal Hocko (3): > memcg: integrate soft reclaim tighter with zone shrinking code > memcg: Ignore soft limit until it is explicitly specified > vmscan, memcg: Do softlimit reclaim also for targeted reclaim > > Incomplete diffstat (without node-zone soft limit tree removal etc...) > so more deletions to come. > include/linux/memcontrol.h | 10 +-- > mm/memcontrol.c | 175 > +++++++++----------------------------------- > mm/vmscan.c | 67 ++++++++++------- > 3 files changed, 78 insertions(+), 174 deletions(-) >