On Sun, May 22, 2011 at 5:25 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Fri, 20 May 2011 18:26:40 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
>
> > On Sat, 21 May 2011 09:41:50 +0900 Hiroyuki Kamezawa <
> kamezawa.hiroyuki@gmail.com> wrote:
> >
> > > 2011/5/21 Andrew Morton <akpm@linux-foundation.org>:
> > > > On Fri, 20 May 2011 12:48:37 +0900
> > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > >
> > > >> workqueue for memory cgroup asynchronous memory shrinker.
> > > >>
> > > >> This patch implements the workqueue of async shrinker routine. each
> > > >> memcg has a work and only one work can be scheduled at the same
> time.
> > > >>
> > > >> If shrinking memory doesn't goes well, delay will be added to the
> work.
> > > >>
> > > >
> > > > When this code explodes (as it surely will), users will see large
> > > > amounts of CPU consumption in the work queue thread. __We want to
> make
> > > > this as easy to debug as possible, so we should try to make the
> > > > workqueue's names mappable back onto their memcg's. __And anything
> else
> > > > we can think of to help?
>

When we debug kswapd issues in the memory isolation environment, the first
step is to identify which cgroup
the kswapd thread is working on. We need a easy way to make the direct
mapping by reading a API or just
look at "top". So making the "kworkers" name mapped back to the memcg helps
here.

Also, we need a easy way to track the amount of cputime consumed by the
kswapd per-memcg basis. We probably
can export that number in the per-memcg memory.stats. Kame has the patch
from the last post.


> > > >
> > >
> > > I had a patch for showing per-memcg reclaim latency stats. It will be
> help.
> > > I'll add it again to this set. I just dropped it because there are many
> patches
> > > onto memory.stat in flight..
> >
> > Will that patch help us when users report the memcg equivalent of
> > "kswapd uses 99% of CPU"?
> >
> I think so. Each memcg shows what amount of cpu is used.
>
> But, maybe it's not an easy interface. I have several idea.
>
>
> An idea I have is to rename task->comm by overwrite from  kworker/u:%d as
> to memcg/%d when the work is scheduled. I think this can be implemented in
> very
> simple interface and flags to workqueue. Then, ps -elf can show what was
> goin on.
> If necessary, I'll add a hardlimit of cpu usage for a work or I'll limit
> the number of thread for memcg workqueue.
>

Does it make sense to have memcg/css->id as the name if that is not the case
yet? Otherwise,
there is hard to link the kworker/%d ( or memcg/% later) back to the memcg
it is working on.

On the last post of per-memcg-per-kswapd implementation, i have the thread
named "memcg-css_id",
and also has a API per-memcg to export its css_id. So we can easily identify
the kernel thread to its
owner.


> Considering there are user who uses 2000+ memcg on a system, a thread per a
> memcg
> was not a choice to me.


So that is only about 2000 * 8k = 16M worth of memory over the machine
capacity (probably a very large number by
have the 2000+ memcgs running). We've run systems w/ 1000+ kswapds w/o
noticing troubles on that. What that is
buying us is better visibility(more of cpu limit per memcg kswapd)
and debug-ability.

Sorry I know we have discussed this before on other thread, but I can not
prevent myself not repeating here again :(  Just want to provide a datapoint
where we have lots of kswapd threads (> 1000) per host and that is not
causing us any issues as you concerned. :)

--Ying


> Another idea was thread poll or workqueue. Because thread
> pool can be a poor reimplemenation of workqueue, I used workqueue.
>


>
> I'll implement some idea in above to the next version.
>
>
> > > >
> > > >> + __ __ limit = res_counter_read_u64(&mem->res, RES_LIMIT);
> > > >> + __ __ shrink_to = limit - MEMCG_ASYNC_MARGIN - PAGE_SIZE;
> > > >> + __ __ usage = res_counter_read_u64(&mem->res, RES_USAGE);
> > > >> + __ __ if (shrink_to <= usage) {
> > > >> + __ __ __ __ __ __ required = usage - shrink_to;
> > > >> + __ __ __ __ __ __ required = (required >> PAGE_SHIFT) + 1;
> > > >> + __ __ __ __ __ __ /*
> > > >> + __ __ __ __ __ __ __* This scans some number of pages and returns
> that memory
> > > >> + __ __ __ __ __ __ __* reclaim was slow or now. If slow, we add a
> delay as
> > > >> + __ __ __ __ __ __ __* congestion_wait() in vmscan.c
> > > >> + __ __ __ __ __ __ __*/
> > > >> + __ __ __ __ __ __ congested = mem_cgroup_shrink_static_scan(mem,
> (long)required);
> > > >> + __ __ }
> > > >> + __ __ if (test_bit(ASYNC_NORESCHED, &mem->async_flags)
> > > >> + __ __ __ __ || mem_cgroup_async_should_stop(mem))
> > > >> + __ __ __ __ __ __ goto finish_scan;
> > > >> + __ __ /* If memory reclaim couldn't go well, add delay */
> > > >> + __ __ if (congested)
> > > >> + __ __ __ __ __ __ delay = HZ/10;
> > > >
> > > > Another magic number.
> > > >
> > > > If Moore's law holds, we need to reduce this number by 1.4 each year.
> > > > Is this good?
> > > >
> > >
> > > not good.  I just used the same magic number now used with
> wait_iff_congested.
> > > Other than timer, I can use pagein/pageout event counter. If we have
> > > dirty_ratio,
> > > I may able to link this to dirty_ratio and wait until dirty_ratio is
> enough low.
> > > Or, wake up again hit limit.
> > >
> > > Do you have suggestion ?
> > >
> >
> > mm..  It would be pretty easy to generate an estimate of "pages scanned
> > per second" from the contents of (and changes in) the scan_control.
>
> Hmm.
>
> > Konwing that datum and knowing the number of pages in the memcg, we
> > should be able to come up with a delay period which scales
> > appropriately with CPU speed and with memory size?
> >
> > Such a thing could be used to rationalise magic delays in other places,
> > hopefully.
> >
>
> Ok, I'll conder that. Thank you for nice idea.
>
>
> > >
> > > >> + __ __ queue_delayed_work(memcg_async_shrinker, &mem->async_work,
> delay);
> > > >> + __ __ return;
> > > >> +finish_scan:
> > > >> + __ __ cgroup_release_and_wakeup_rmdir(&mem->css);
> > > >> + __ __ clear_bit(ASYNC_RUNNING, &mem->async_flags);
> > > >> + __ __ return;
> > > >> +}
> > > >> +
> > > >> +static void run_mem_cgroup_async_shrinker(struct mem_cgroup *mem)
> > > >> +{
> > > >> + __ __ if (test_bit(ASYNC_NORESCHED, &mem->async_flags))
> > > >> + __ __ __ __ __ __ return;
> > > >
> > > > I can't work out what ASYNC_NORESCHED does. __Is its name
> well-chosen?
> > > >
> > > how about BLOCK/STOP_ASYNC_RECLAIM ?
> >
> > I can't say - I don't know what it does!  Or maybe I did, and immediately
> > forgot ;)
> >
>
> I'll find a better name ;)
>
> Thanks,
> -Kame
>
>