On Sun, May 22, 2011 at 5:25 PM, KAMEZAWA Hiroyuki < kamezawa.hiroyu@jp.fujitsu.com> wrote: > On Fri, 20 May 2011 18:26:40 -0700 > Andrew Morton wrote: > > > On Sat, 21 May 2011 09:41:50 +0900 Hiroyuki Kamezawa < > kamezawa.hiroyuki@gmail.com> wrote: > > > > > 2011/5/21 Andrew Morton : > > > > On Fri, 20 May 2011 12:48:37 +0900 > > > > KAMEZAWA Hiroyuki wrote: > > > > > > > >> workqueue for memory cgroup asynchronous memory shrinker. > > > >> > > > >> This patch implements the workqueue of async shrinker routine. each > > > >> memcg has a work and only one work can be scheduled at the same > time. > > > >> > > > >> If shrinking memory doesn't goes well, delay will be added to the > work. > > > >> > > > > > > > > When this code explodes (as it surely will), users will see large > > > > amounts of CPU consumption in the work queue thread. __We want to > make > > > > this as easy to debug as possible, so we should try to make the > > > > workqueue's names mappable back onto their memcg's. __And anything > else > > > > we can think of to help? > When we debug kswapd issues in the memory isolation environment, the first step is to identify which cgroup the kswapd thread is working on. We need a easy way to make the direct mapping by reading a API or just look at "top". So making the "kworkers" name mapped back to the memcg helps here. Also, we need a easy way to track the amount of cputime consumed by the kswapd per-memcg basis. We probably can export that number in the per-memcg memory.stats. Kame has the patch from the last post. > > > > > > > > > > I had a patch for showing per-memcg reclaim latency stats. It will be > help. > > > I'll add it again to this set. I just dropped it because there are many > patches > > > onto memory.stat in flight.. > > > > Will that patch help us when users report the memcg equivalent of > > "kswapd uses 99% of CPU"? > > > I think so. Each memcg shows what amount of cpu is used. > > But, maybe it's not an easy interface. I have several idea. > > > An idea I have is to rename task->comm by overwrite from kworker/u:%d as > to memcg/%d when the work is scheduled. I think this can be implemented in > very > simple interface and flags to workqueue. Then, ps -elf can show what was > goin on. > If necessary, I'll add a hardlimit of cpu usage for a work or I'll limit > the number of thread for memcg workqueue. > Does it make sense to have memcg/css->id as the name if that is not the case yet? Otherwise, there is hard to link the kworker/%d ( or memcg/% later) back to the memcg it is working on. On the last post of per-memcg-per-kswapd implementation, i have the thread named "memcg-css_id", and also has a API per-memcg to export its css_id. So we can easily identify the kernel thread to its owner. > Considering there are user who uses 2000+ memcg on a system, a thread per a > memcg > was not a choice to me. So that is only about 2000 * 8k = 16M worth of memory over the machine capacity (probably a very large number by have the 2000+ memcgs running). We've run systems w/ 1000+ kswapds w/o noticing troubles on that. What that is buying us is better visibility(more of cpu limit per memcg kswapd) and debug-ability. Sorry I know we have discussed this before on other thread, but I can not prevent myself not repeating here again :( Just want to provide a datapoint where we have lots of kswapd threads (> 1000) per host and that is not causing us any issues as you concerned. :) --Ying > Another idea was thread poll or workqueue. Because thread > pool can be a poor reimplemenation of workqueue, I used workqueue. > > > I'll implement some idea in above to the next version. > > > > > > > > > >> + __ __ limit = res_counter_read_u64(&mem->res, RES_LIMIT); > > > >> + __ __ shrink_to = limit - MEMCG_ASYNC_MARGIN - PAGE_SIZE; > > > >> + __ __ usage = res_counter_read_u64(&mem->res, RES_USAGE); > > > >> + __ __ if (shrink_to <= usage) { > > > >> + __ __ __ __ __ __ required = usage - shrink_to; > > > >> + __ __ __ __ __ __ required = (required >> PAGE_SHIFT) + 1; > > > >> + __ __ __ __ __ __ /* > > > >> + __ __ __ __ __ __ __* This scans some number of pages and returns > that memory > > > >> + __ __ __ __ __ __ __* reclaim was slow or now. If slow, we add a > delay as > > > >> + __ __ __ __ __ __ __* congestion_wait() in vmscan.c > > > >> + __ __ __ __ __ __ __*/ > > > >> + __ __ __ __ __ __ congested = mem_cgroup_shrink_static_scan(mem, > (long)required); > > > >> + __ __ } > > > >> + __ __ if (test_bit(ASYNC_NORESCHED, &mem->async_flags) > > > >> + __ __ __ __ || mem_cgroup_async_should_stop(mem)) > > > >> + __ __ __ __ __ __ goto finish_scan; > > > >> + __ __ /* If memory reclaim couldn't go well, add delay */ > > > >> + __ __ if (congested) > > > >> + __ __ __ __ __ __ delay = HZ/10; > > > > > > > > Another magic number. > > > > > > > > If Moore's law holds, we need to reduce this number by 1.4 each year. > > > > Is this good? > > > > > > > > > > not good. I just used the same magic number now used with > wait_iff_congested. > > > Other than timer, I can use pagein/pageout event counter. If we have > > > dirty_ratio, > > > I may able to link this to dirty_ratio and wait until dirty_ratio is > enough low. > > > Or, wake up again hit limit. > > > > > > Do you have suggestion ? > > > > > > > mm.. It would be pretty easy to generate an estimate of "pages scanned > > per second" from the contents of (and changes in) the scan_control. > > Hmm. > > > Konwing that datum and knowing the number of pages in the memcg, we > > should be able to come up with a delay period which scales > > appropriately with CPU speed and with memory size? > > > > Such a thing could be used to rationalise magic delays in other places, > > hopefully. > > > > Ok, I'll conder that. Thank you for nice idea. > > > > > > > > >> + __ __ queue_delayed_work(memcg_async_shrinker, &mem->async_work, > delay); > > > >> + __ __ return; > > > >> +finish_scan: > > > >> + __ __ cgroup_release_and_wakeup_rmdir(&mem->css); > > > >> + __ __ clear_bit(ASYNC_RUNNING, &mem->async_flags); > > > >> + __ __ return; > > > >> +} > > > >> + > > > >> +static void run_mem_cgroup_async_shrinker(struct mem_cgroup *mem) > > > >> +{ > > > >> + __ __ if (test_bit(ASYNC_NORESCHED, &mem->async_flags)) > > > >> + __ __ __ __ __ __ return; > > > > > > > > I can't work out what ASYNC_NORESCHED does. __Is its name > well-chosen? > > > > > > > how about BLOCK/STOP_ASYNC_RECLAIM ? > > > > I can't say - I don't know what it does! Or maybe I did, and immediately > > forgot ;) > > > > I'll find a better name ;) > > Thanks, > -Kame > >