From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id E50BA6B0011 for ; Thu, 26 May 2011 21:40:53 -0400 (EDT) Received: from wpaz5.hot.corp.google.com (wpaz5.hot.corp.google.com [172.24.198.69]) by smtp-out.google.com with ESMTP id p4R1eksj018197 for ; Thu, 26 May 2011 18:40:48 -0700 Received: from qwb8 (qwb8.prod.google.com [10.241.193.72]) by wpaz5.hot.corp.google.com with ESMTP id p4R1eiXG010622 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=NOT) for ; Thu, 26 May 2011 18:40:45 -0700 Received: by qwb8 with SMTP id 8so979323qwb.11 for ; Thu, 26 May 2011 18:40:44 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20110527093142.d3733053.kamezawa.hiroyu@jp.fujitsu.com> References: <1306444069-5094-1-git-send-email-yinghan@google.com> <20110527090506.357698e3.kamezawa.hiroyu@jp.fujitsu.com> <20110527093142.d3733053.kamezawa.hiroyu@jp.fujitsu.com> Date: Thu, 26 May 2011 18:40:44 -0700 Message-ID: Subject: Re: [PATCH] memcg: add pgfault latency histograms From: Ying Han Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: KAMEZAWA Hiroyuki Cc: KOSAKI Motohiro , Minchan Kim , Daisuke Nishimura , Balbir Singh , Tejun Heo , Pavel Emelyanov , Andrew Morton , Li Zefan , Mel Gorman , Christoph Lameter , Johannes Weiner , Rik van Riel , Hugh Dickins , Michal Hocko , Dave Hansen , Zhu Yanhai , "linux-mm@kvack.org" On Thu, May 26, 2011 at 5:31 PM, KAMEZAWA Hiroyuki wrote: > On Thu, 26 May 2011 17:23:20 -0700 > Ying Han wrote: > >> On Thu, May 26, 2011 at 5:05 PM, KAMEZAWA Hiroyuki < >> kamezawa.hiroyu@jp.fujitsu.com> wrote: >> >> > On Thu, 26 May 2011 14:07:49 -0700 >> > Ying Han wrote: >> > >> > > This adds histogram to capture pagefault latencies on per-memcg basi= s. I >> > used >> > > this patch on the memcg background reclaim test, and figured there c= ould >> > be more >> > > usecases to monitor/debug application performance. >> > > >> > > The histogram is composed 8 bucket in ns unit. The last one is infin= ite >> > (inf) >> > > which is everything beyond the last one. To be more flexible, the bu= ckets >> > can >> > > be reset and also each bucket is configurable at runtime. >> > > >> > > memory.pgfault_histogram: exports the histogram on per-memcg basis a= nd >> > also can >> > > be reset by echoing "reset". Meantime, all the buckets are writable = by >> > echoing >> > > the range into the API. see the example below. >> > > >> > > /proc/sys/vm/pgfault_histogram: the global sysfs tunablecan be used = to >> > turn >> > > on/off recording the histogram. >> > > >> > > Functional Test: >> > > Create a memcg with 10g hard_limit, running dd & allocate 8g anon pa= ge. >> > > Measure the anon page allocation latency. >> > > >> > > $ mkdir /dev/cgroup/memory/B >> > > $ echo 10g >/dev/cgroup/memory/B/memory.limit_in_bytes >> > > $ echo $$ >/dev/cgroup/memory/B/tasks >> > > $ dd if=3D/dev/zero of=3D/export/hdc3/dd/tf0 bs=3D1024 count=3D20971= 520 & >> > > $ allocate 8g anon pages >> > > >> > > $ echo 1 >/proc/sys/vm/pgfault_histogram >> > > >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram >> > > pgfault latency histogram (ns): >> > > < 600 =A0 =A0 =A0 =A0 =A0 =A02051273 >> > > < 1200 =A0 =A0 =A0 =A0 =A0 40859 >> > > < 2400 =A0 =A0 =A0 =A0 =A0 4004 >> > > < 4800 =A0 =A0 =A0 =A0 =A0 1605 >> > > < 9600 =A0 =A0 =A0 =A0 =A0 170 >> > > < 19200 =A0 =A0 =A0 =A0 =A082 >> > > < 38400 =A0 =A0 =A0 =A0 =A06 >> > > < inf =A0 =A0 =A0 =A0 =A0 =A00 >> > > >> > > $ echo reset >/dev/cgroup/memory/B/memory.pgfault_histogram >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram >> > > pgfault latency histogram (ns): >> > > < 600 =A0 =A0 =A0 =A0 =A0 =A00 >> > > < 1200 =A0 =A0 =A0 =A0 =A0 0 >> > > < 2400 =A0 =A0 =A0 =A0 =A0 0 >> > > < 4800 =A0 =A0 =A0 =A0 =A0 0 >> > > < 9600 =A0 =A0 =A0 =A0 =A0 0 >> > > < 19200 =A0 =A0 =A0 =A0 =A00 >> > > < 38400 =A0 =A0 =A0 =A0 =A00 >> > > < inf =A0 =A0 =A0 =A0 =A0 =A00 >> > > >> > > $ echo 500 520 540 580 600 1000 5000 >> > >/dev/cgroup/memory/B/memory.pgfault_histogram >> > > $ cat /dev/cgroup/memory/B/memory.pgfault_histogram >> > > pgfault latency histogram (ns): >> > > < 500 =A0 =A0 =A0 =A0 =A0 =A050 >> > > < 520 =A0 =A0 =A0 =A0 =A0 =A0151 >> > > < 540 =A0 =A0 =A0 =A0 =A0 =A03715 >> > > < 580 =A0 =A0 =A0 =A0 =A0 =A01859812 >> > > < 600 =A0 =A0 =A0 =A0 =A0 =A0202241 >> > > < 1000 =A0 =A0 =A0 =A0 =A0 25394 >> > > < 5000 =A0 =A0 =A0 =A0 =A0 5875 >> > > < inf =A0 =A0 =A0 =A0 =A0 =A0186 >> > > >> > > Performance Test: >> > > I ran through the PageFaultTest (pft) benchmark to measure the overh= ead >> > of >> > > recording the histogram. There is no overhead observed on both >> > "flt/cpu/s" >> > > and "fault/wsec". >> > > >> > > $ mkdir /dev/cgroup/memory/A >> > > $ echo 16g >/dev/cgroup/memory/A/memory.limit_in_bytes >> > > $ echo $$ >/dev/cgroup/memory/A/tasks >> > > $ ./pft -m 15g -t 8 -T a >> > > >> > > Result: >> > > "fault/wsec" >> > > >> > > $ ./ministat no_histogram histogram >> > > x no_histogram >> > > + histogram >> > > >> > +---------------------------------------------------------------------= -----+ >> > > =A0 =A0N =A0 =A0 =A0 =A0 =A0 Min =A0 =A0 =A0 =A0 =A0 Max =A0 =A0 =A0= =A0Median =A0 =A0 =A0 =A0 =A0 Avg >> > =A0Stddev >> > > x =A0 5 =A0 =A0 813404.51 =A0 =A0 824574.98 =A0 =A0 =A0821661.3 =A0 = =A0 820470.83 >> > 4202.0758 >> > > + =A0 5 =A0 =A0 821228.91 =A0 =A0 825894.66 =A0 =A0 822874.65 =A0 = =A0 823374.15 >> > 1787.9355 >> > > >> > > "flt/cpu/s" >> > > >> > > $ ./ministat no_histogram histogram >> > > x no_histogram >> > > + histogram >> > > >> > +---------------------------------------------------------------------= -----+ >> > > =A0 =A0N =A0 =A0 =A0 =A0 =A0 Min =A0 =A0 =A0 =A0 =A0 Max =A0 =A0 =A0= =A0Median =A0 =A0 =A0 =A0 =A0 Avg >> > =A0Stddev >> > > x =A0 5 =A0 =A0 104951.93 =A0 =A0 106173.13 =A0 =A0 105142.73 =A0 = =A0 =A0105349.2 >> > 513.78158 >> > > + =A0 5 =A0 =A0 104697.67 =A0 =A0 =A0105416.1 =A0 =A0 104943.52 =A0 = =A0 104973.77 >> > 269.24781 >> > > No difference proven at 95.0% confidence >> > > >> > > Signed-off-by: Ying Han >> > >> > Hmm, interesting....but isn't it very very very complicated interface = ? >> > Could you make this for 'perf' ? Then, everyone (including someone who >> > don't use memcg) >> > will be happy. >> > >> >> Thank you for looking at it. >> >> There is only one per-memcg API added which is basically exporting the >> histogram. The "reset" and reconfiguring the bucket is not "must" but ma= ke >> it more flexible. Also, the sysfs API can be reduced if necessary since >> there is no over-head observed by always turning it on anyway. >> >> I am not familiar w/ perf, any suggestions how it is supposed to be look >> like? >> >> Thanks >> > > IIUC, you can record "all" latency information by perf record. Then, late= ncy > information can be dumped out to some file. > > You can add a python? script for perf as > > =A0# perf report memory-reclaim-latency-histgram -f perf.data > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0-o 500,1000,1500,2000..... > =A0 ...show histgram in text.. or report the histgram in graphic. > > Good point is > =A0- you can reuse perf.data and show histgram from another point of view= . > > =A0- you can show another cut of view, for example, I think you can write= a > =A0 =A0parser to show "changes in hisgram by time", easily. > =A0 =A0You may able to generate a movie ;) > > =A0- Now, perf cgroup is supported. Then, > =A0 =A0- you can see per task histgram > =A0 =A0- you can see per cgroup histgram > =A0 =A0- you can see per system-wide histgram > =A0 =A0 =A0(If you record latency of usual kswapd/alloc_pages) > > =A0- If you record latency within shrink_zone(), you can show per-zone > =A0 =A0reclaim latency histgram. record parsers can gather them and > =A0 =A0show histgram. This will be benefical to cpuset users. > > > I'm sorry if I miss something. After study a bit on perf, it is not feasible in this casecase. The cpu & memory overhead of perf is overwhelming.... Each page fault will generate a record in the buffer and how many data we can record in the buffer, and how many data will be processed later.. Most of the data that is recorded by the general perf framework is not needed here. On the other hand, the memory consumption is very little in this patch. We only need to keep a counter of each bucket and the recording can go on as long as the machine is up. As also measured, there is no overhead of the data collection :) So, the perf is not an option for this purpose. --Ying > > Thanks, > -Kame > > > > > > > > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org