From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail6.bemta12.messagelabs.com (mail6.bemta12.messagelabs.com [216.82.250.247]) by kanga.kvack.org (Postfix) with ESMTP id 723176B00E7 for ; Fri, 15 Jul 2011 16:28:21 -0400 (EDT) Received: from kpbe18.cbf.corp.google.com (kpbe18.cbf.corp.google.com [172.25.105.82]) by smtp-out.google.com with ESMTP id p6FKSEjH016906 for ; Fri, 15 Jul 2011 13:28:15 -0700 Received: from gyf1 (gyf1.prod.google.com [10.243.50.65]) by kpbe18.cbf.corp.google.com with ESMTP id p6FKREMA023450 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=NOT) for ; Fri, 15 Jul 2011 13:28:13 -0700 Received: by gyf1 with SMTP id 1so830346gyf.23 for ; Fri, 15 Jul 2011 13:28:09 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20110711193036.5a03858d.kamezawa.hiroyu@jp.fujitsu.com> <20110714090221.1ead26d5.kamezawa.hiroyu@jp.fujitsu.com> Date: Fri, 15 Jul 2011 13:28:08 -0700 Message-ID: Subject: Re: [PATCH v2] memcg: add vmscan_stat From: Andrew Bresticker Content-Type: multipart/alternative; boundary=001636d33caddae8e404a8217ca0 Sender: owner-linux-mm@kvack.org List-ID: To: KAMEZAWA Hiroyuki Cc: "linux-mm@kvack.org" , "akpm@linux-foundation.org" , "nishimura@mxp.nes.nec.co.jp" , "bsingharora@gmail.com" , Michal Hocko , Ying Han --001636d33caddae8e404a8217ca0 Content-Type: text/plain; charset=ISO-8859-1 And this one tracks the number of pages unmapped: -- From: Andrew Bresticker Date: Fri, 15 Jul 2011 11:46:40 -0700 Subject: [PATCH] vmscan: Track pages unmapped during page reclaim. Record the number of pages unmapped during page reclaim in memory.vmscan_stat. Counters are broken down by type and context like the other stats in memory.vmscan_stat. Sample output: $ mkdir /dev/cgroup/memory/1 $ echo 512m > /dev/cgroup/memory/1 $ echo $$ > /dev/cgroup/memory/1 $ pft -m 512m $ cat /dev/cgroup/memory/1/memory.vmscan_stat ... unmapped_pages_by_limit 67 unmapped_anon_pages_by_limit 0 unmapped_file_pages_by_limit 67 ... unmapped_pages_by_limit_under_hierarchy 67 unmapped_anon_pages_by_limit_under_hierarchy 0 unmapped_file_pages_by_limit_under_hierarchy 67 Signed-off-by: Andrew Bresticker --- include/linux/memcontrol.h | 1 + mm/memcontrol.c | 12 ++++++++++++ mm/vmscan.c | 8 ++++++-- 3 files changed, 19 insertions(+), 2 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 4be907e..8d65b55 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -47,6 +47,7 @@ struct memcg_scanrecord { unsigned long nr_rotated[2]; /* the number of rotated pages */ unsigned long nr_freed[2]; /* the number of freed pages */ unsigned long nr_written[2]; /* the number of pages written back */ + unsigned long nr_unmapped[2]; /* the number of pages unmapped */ unsigned long elapsed; /* nsec of time elapsed while scanning */ }; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5ec2aa3..6b4fbbd 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -224,6 +224,9 @@ enum { WRITTEN, WRITTEN_ANON, WRITTEN_FILE, + UNMAPPED, + UNMAPPED_ANON, + UNMAPPED_FILE, ELAPSED, NR_SCANSTATS, }; @@ -247,6 +250,9 @@ const char *scanstat_string[NR_SCANSTATS] = { "written_pages", "written_anon_pages", "written_file_pages", + "unmapped_pages", + "unmapped_anon_pages", + "unmapped_file_pages", "elapsed_ns", }; #define SCANSTAT_WORD_LIMIT "_by_limit" @@ -1692,6 +1698,10 @@ static void __mem_cgroup_record_scanstat(unsigned long *stats, stats[WRITTEN_ANON] += rec->nr_written[0]; stats[WRITTEN_FILE] += rec->nr_written[1]; + stats[UNMAPPED] += rec->nr_unmapped[0] + rec->nr_unmapped[1]; + stats[UNMAPPED_ANON] += rec->nr_unmapped[0]; + stats[UNMAPPED_FILE] += rec->nr_unmapped[1]; + stats[ELAPSED] += rec->elapsed; } @@ -1806,6 +1816,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem, rec.nr_freed[1] = 0; rec.nr_written[0] = 0; rec.nr_written[1] = 0; + rec.nr_unmapped[0] = 0; + rec.nr_unmapped[1] = 0; rec.elapsed = 0; /* we use swappiness of local cgroup */ if (check_soft) { diff --git a/mm/vmscan.c b/mm/vmscan.c index f73b96e..2d2bc99 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -728,6 +728,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, unsigned long nr_congested = 0; unsigned long nr_reclaimed = 0; unsigned long nr_written = 0; + unsigned long nr_unmapped = 0; cond_resched(); @@ -819,7 +820,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, case SWAP_MLOCK: goto cull_mlocked; case SWAP_SUCCESS: - ; /* try to free the page below */ + /* try to free the page below */ + nr_unmapped++; } } @@ -960,8 +962,10 @@ keep_lumpy: free_page_list(&free_pages); list_splice(&ret_pages, page_list); - if (!scanning_global_lru(sc)) + if (!scanning_global_lru(sc)) { sc->memcg_record->nr_written[file] += nr_written; + sc->memcg_record->nr_unmapped[file] += nr_unmapped; + } count_vm_events(PGACTIVATE, pgactivate); return nr_reclaimed; } -- 1.7.3.1 Thanks, Andrew On Fri, Jul 15, 2011 at 11:34 AM, Andrew Bresticker wrote: > I've extended your patch to track write-back during page reclaim: > --- > > From: Andrew Bresticker > Date: Thu, 14 Jul 2011 17:56:48 -0700 > Subject: [PATCH] vmscan: Track number of pages written back during page > reclaim. > > This tracks pages written out during page reclaim in memory.vmscan_stat > and breaks it down by file vs. anon and context (like "scanned_pages", > "rotated_pages", etc.). > > Example output: > $ mkdir /dev/cgroup/memory/1 > $ echo 8m > /dev/cgroup/memory/1/memory.limit_in_bytes > $ echo $$ > /dev/cgroup/memory/1/tasks > $ dd if=/dev/urandom of=file_20g bs=4096 count=524288 > $ cat /dev/cgroup/memory/1/memory.vmscan_stat > ... > written_pages_by_limit 36 > written_anon_pages_by_limit 0 > written_file_pages_by_limit 36 > ... > written_pages_by_limit_under_hierarchy 28 > written_anon_pages_by_limit_under_hierarchy 0 > written_file_pages_by_limit_under_hierarchy 28 > > Signed-off-by: Andrew Bresticker > --- > include/linux/memcontrol.h | 1 + > mm/memcontrol.c | 12 ++++++++++++ > mm/vmscan.c | 10 +++++++--- > 3 files changed, 20 insertions(+), 3 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 4b49edf..4be907e 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -46,6 +46,7 @@ struct memcg_scanrecord { > unsigned long nr_scanned[2]; /* the number of scanned pages */ > unsigned long nr_rotated[2]; /* the number of rotated pages */ > unsigned long nr_freed[2]; /* the number of freed pages */ > + unsigned long nr_written[2]; /* the number of pages written back */ > unsigned long elapsed; /* nsec of time elapsed while scanning */ > }; > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 9bb6e93..5ec2aa3 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -221,6 +221,9 @@ enum { > FREED, > FREED_ANON, > FREED_FILE, > + WRITTEN, > + WRITTEN_ANON, > + WRITTEN_FILE, > ELAPSED, > NR_SCANSTATS, > }; > @@ -241,6 +244,9 @@ const char *scanstat_string[NR_SCANSTATS] = { > "freed_pages", > "freed_anon_pages", > "freed_file_pages", > + "written_pages", > + "written_anon_pages", > + "written_file_pages", > "elapsed_ns", > }; > #define SCANSTAT_WORD_LIMIT "_by_limit" > @@ -1682,6 +1688,10 @@ static void __mem_cgroup_record_scanstat(unsigned > long *stats, > stats[FREED_ANON] += rec->nr_freed[0]; > stats[FREED_FILE] += rec->nr_freed[1]; > > + stats[WRITTEN] += rec->nr_written[0] + rec->nr_written[1]; > + stats[WRITTEN_ANON] += rec->nr_written[0]; > + stats[WRITTEN_FILE] += rec->nr_written[1]; > + > stats[ELAPSED] += rec->elapsed; > } > > @@ -1794,6 +1804,8 @@ static int mem_cgroup_hierarchical_reclaim(struct > mem_cgroup *root_mem, > rec.nr_rotated[1] = 0; > rec.nr_freed[0] = 0; > rec.nr_freed[1] = 0; > + rec.nr_written[0] = 0; > + rec.nr_written[1] = 0; > rec.elapsed = 0; > /* we use swappiness of local cgroup */ > if (check_soft) { > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 8fb1abd..f73b96e 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -719,7 +719,7 @@ static noinline_for_stack void free_page_list(struct > list_head *free_pages) > */ > static unsigned long shrink_page_list(struct list_head *page_list, > struct zone *zone, > - struct scan_control *sc) > + struct scan_control *sc, int file) > { > LIST_HEAD(ret_pages); > LIST_HEAD(free_pages); > @@ -727,6 +727,7 @@ static unsigned long shrink_page_list(struct list_head > *page_list, > unsigned long nr_dirty = 0; > unsigned long nr_congested = 0; > unsigned long nr_reclaimed = 0; > + unsigned long nr_written = 0; > > cond_resched(); > > @@ -840,6 +841,7 @@ static unsigned long shrink_page_list(struct list_head > *page_list, > case PAGE_ACTIVATE: > goto activate_locked; > case PAGE_SUCCESS: > + nr_written++; > if (PageWriteback(page)) > goto keep_lumpy; > if (PageDirty(page)) > @@ -958,6 +960,8 @@ keep_lumpy: > free_page_list(&free_pages); > > list_splice(&ret_pages, page_list); > + if (!scanning_global_lru(sc)) > + sc->memcg_record->nr_written[file] += nr_written; > count_vm_events(PGACTIVATE, pgactivate); > return nr_reclaimed; > } > @@ -1463,7 +1467,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct > zone *zone, > > spin_unlock_irq(&zone->lru_lock); > > - nr_reclaimed = shrink_page_list(&page_list, zone, sc); > + nr_reclaimed = shrink_page_list(&page_list, zone, sc, file); > > if (!scanning_global_lru(sc)) > sc->memcg_record->nr_freed[file] += nr_reclaimed; > @@ -1471,7 +1475,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct > zone *zone, > /* Check if we should syncronously wait for writeback */ > if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) { > set_reclaim_mode(priority, sc, true); > - nr_reclaimed += shrink_page_list(&page_list, zone, sc); > + nr_reclaimed += shrink_page_list(&page_list, zone, sc, file); > } > > local_irq_disable(); > -- > 1.7.3.1 > > Thanks, > Andrew > > On Wed, Jul 13, 2011 at 5:02 PM, KAMEZAWA Hiroyuki < > kamezawa.hiroyu@jp.fujitsu.com> wrote: > >> On Tue, 12 Jul 2011 16:02:02 -0700 >> Andrew Bresticker wrote: >> >> > On Mon, Jul 11, 2011 at 3:30 AM, KAMEZAWA Hiroyuki < >> > kamezawa.hiroyu@jp.fujitsu.com> wrote: >> > >> > > >> > > This patch is onto mmotm-0710... got bigger than expected ;( >> > > == >> > > [PATCH] add memory.vmscan_stat >> > > >> > > commit log of commit 0ae5e89 " memcg: count the soft_limit reclaim >> in..." >> > > says it adds scanning stats to memory.stat file. But it doesn't >> because >> > > we considered we needed to make a concensus for such new APIs. >> > > >> > > This patch is a trial to add memory.scan_stat. This shows >> > > - the number of scanned pages(total, anon, file) >> > > - the number of rotated pages(total, anon, file) >> > > - the number of freed pages(total, anon, file) >> > > - the number of elaplsed time (including sleep/pause time) >> > > >> > > for both of direct/soft reclaim. >> > > >> > > The biggest difference with oringinal Ying's one is that this file >> > > can be reset by some write, as >> > > >> > > # echo 0 ...../memory.scan_stat >> > > >> > > Example of output is here. This is a result after make -j 6 kernel >> > > under 300M limit. >> > > >> > > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat >> > > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat >> > > scanned_pages_by_limit 9471864 >> > > scanned_anon_pages_by_limit 6640629 >> > > scanned_file_pages_by_limit 2831235 >> > > rotated_pages_by_limit 4243974 >> > > rotated_anon_pages_by_limit 3971968 >> > > rotated_file_pages_by_limit 272006 >> > > freed_pages_by_limit 2318492 >> > > freed_anon_pages_by_limit 962052 >> > > freed_file_pages_by_limit 1356440 >> > > elapsed_ns_by_limit 351386416101 >> > > scanned_pages_by_system 0 >> > > scanned_anon_pages_by_system 0 >> > > scanned_file_pages_by_system 0 >> > > rotated_pages_by_system 0 >> > > rotated_anon_pages_by_system 0 >> > > rotated_file_pages_by_system 0 >> > > freed_pages_by_system 0 >> > > freed_anon_pages_by_system 0 >> > > freed_file_pages_by_system 0 >> > > elapsed_ns_by_system 0 >> > > scanned_pages_by_limit_under_hierarchy 9471864 >> > > scanned_anon_pages_by_limit_under_hierarchy 6640629 >> > > scanned_file_pages_by_limit_under_hierarchy 2831235 >> > > rotated_pages_by_limit_under_hierarchy 4243974 >> > > rotated_anon_pages_by_limit_under_hierarchy 3971968 >> > > rotated_file_pages_by_limit_under_hierarchy 272006 >> > > freed_pages_by_limit_under_hierarchy 2318492 >> > > freed_anon_pages_by_limit_under_hierarchy 962052 >> > > freed_file_pages_by_limit_under_hierarchy 1356440 >> > > elapsed_ns_by_limit_under_hierarchy 351386416101 >> > > scanned_pages_by_system_under_hierarchy 0 >> > > scanned_anon_pages_by_system_under_hierarchy 0 >> > > scanned_file_pages_by_system_under_hierarchy 0 >> > > rotated_pages_by_system_under_hierarchy 0 >> > > rotated_anon_pages_by_system_under_hierarchy 0 >> > > rotated_file_pages_by_system_under_hierarchy 0 >> > > freed_pages_by_system_under_hierarchy 0 >> > > freed_anon_pages_by_system_under_hierarchy 0 >> > > freed_file_pages_by_system_under_hierarchy 0 >> > > elapsed_ns_by_system_under_hierarchy 0 >> > > >> > > >> > > total_xxxx is for hierarchy management. >> > > >> > > This will be useful for further memcg developments and need to be >> > > developped before we do some complicated rework on LRU/softlimit >> > > management. >> > > >> > > This patch adds a new struct memcg_scanrecord into scan_control >> struct. >> > > sc->nr_scanned at el is not designed for exporting information. For >> > > example, >> > > nr_scanned is reset frequentrly and incremented +2 at scanning mapped >> > > pages. >> > > >> > > For avoiding complexity, I added a new param in scan_control which is >> for >> > > exporting scanning score. >> > > >> > > Signed-off-by: KAMEZAWA Hiroyuki >> > > >> > > Changelog: >> > > - renamed as vmscan_stat >> > > - handle file/anon >> > > - added "rotated" >> > > - changed names of param in vmscan_stat. >> > > --- >> > > Documentation/cgroups/memory.txt | 85 +++++++++++++++++++ >> > > include/linux/memcontrol.h | 19 ++++ >> > > include/linux/swap.h | 6 - >> > > mm/memcontrol.c | 172 >> > > +++++++++++++++++++++++++++++++++++++-- >> > > mm/vmscan.c | 39 +++++++- >> > > 5 files changed, 303 insertions(+), 18 deletions(-) >> > > >> > > Index: mmotm-0710/Documentation/cgroups/memory.txt >> > > =================================================================== >> > > --- mmotm-0710.orig/Documentation/cgroups/memory.txt >> > > +++ mmotm-0710/Documentation/cgroups/memory.txt >> > > @@ -380,7 +380,7 @@ will be charged as a new owner of it. >> > > >> > > 5.2 stat file >> > > >> > > -memory.stat file includes following statistics >> > > +5.2.1 memory.stat file includes following statistics >> > > >> > > # per-memory cgroup local status >> > > cache - # of bytes of page cache memory. >> > > @@ -438,6 +438,89 @@ Note: >> > > file_mapped is accounted only when the memory cgroup is owner >> of >> > > page >> > > cache.) >> > > >> > > +5.2.2 memory.vmscan_stat >> > > + >> > > +memory.vmscan_stat includes statistics information for memory >> scanning and >> > > +freeing, reclaiming. The statistics shows memory scanning information >> > > since >> > > +memory cgroup creation and can be reset to 0 by writing 0 as >> > > + >> > > + #echo 0 > ../memory.vmscan_stat >> > > + >> > > +This file contains following statistics. >> > > + >> > > +[param]_[file_or_anon]_pages_by_[reason]_[under_heararchy] >> > > +[param]_elapsed_ns_by_[reason]_[under_hierarchy] >> > > + >> > > +For example, >> > > + >> > > + scanned_file_pages_by_limit indicates the number of scanned >> > > + file pages at vmscan. >> > > + >> > > +Now, 3 parameters are supported >> > > + >> > > + scanned - the number of pages scanned by vmscan >> > > + rotated - the number of pages activated at vmscan >> > > + freed - the number of pages freed by vmscan >> > > + >> > > +If "rotated" is high against scanned/freed, the memcg seems busy. >> > > + >> > > +Now, 2 reason are supported >> > > + >> > > + limit - the memory cgroup's limit >> > > + system - global memory pressure + softlimit >> > > + (global memory pressure not under softlimit is not handled >> now) >> > > + >> > > +When under_hierarchy is added in the tail, the number indicates the >> > > +total memcg scan of its children and itself. >> > > + >> > > +elapsed_ns is a elapsed time in nanosecond. This may include sleep >> time >> > > +and not indicates CPU usage. So, please take this as just showing >> > > +latency. >> > > + >> > > +Here is an example. >> > > + >> > > +# cat /cgroup/memory/A/memory.vmscan_stat >> > > +scanned_pages_by_limit 9471864 >> > > +scanned_anon_pages_by_limit 6640629 >> > > +scanned_file_pages_by_limit 2831235 >> > > +rotated_pages_by_limit 4243974 >> > > +rotated_anon_pages_by_limit 3971968 >> > > +rotated_file_pages_by_limit 272006 >> > > +freed_pages_by_limit 2318492 >> > > +freed_anon_pages_by_limit 962052 >> > > +freed_file_pages_by_limit 1356440 >> > > +elapsed_ns_by_limit 351386416101 >> > > +scanned_pages_by_system 0 >> > > +scanned_anon_pages_by_system 0 >> > > +scanned_file_pages_by_system 0 >> > > +rotated_pages_by_system 0 >> > > +rotated_anon_pages_by_system 0 >> > > +rotated_file_pages_by_system 0 >> > > +freed_pages_by_system 0 >> > > +freed_anon_pages_by_system 0 >> > > +freed_file_pages_by_system 0 >> > > +elapsed_ns_by_system 0 >> > > +scanned_pages_by_limit_under_hierarchy 9471864 >> > > +scanned_anon_pages_by_limit_under_hierarchy 6640629 >> > > +scanned_file_pages_by_limit_under_hierarchy 2831235 >> > > +rotated_pages_by_limit_under_hierarchy 4243974 >> > > +rotated_anon_pages_by_limit_under_hierarchy 3971968 >> > > +rotated_file_pages_by_limit_under_hierarchy 272006 >> > > +freed_pages_by_limit_under_hierarchy 2318492 >> > > +freed_anon_pages_by_limit_under_hierarchy 962052 >> > > +freed_file_pages_by_limit_under_hierarchy 1356440 >> > > +elapsed_ns_by_limit_under_hierarchy 351386416101 >> > > +scanned_pages_by_system_under_hierarchy 0 >> > > +scanned_anon_pages_by_system_under_hierarchy 0 >> > > +scanned_file_pages_by_system_under_hierarchy 0 >> > > +rotated_pages_by_system_under_hierarchy 0 >> > > +rotated_anon_pages_by_system_under_hierarchy 0 >> > > +rotated_file_pages_by_system_under_hierarchy 0 >> > > +freed_pages_by_system_under_hierarchy 0 >> > > +freed_anon_pages_by_system_under_hierarchy 0 >> > > +freed_file_pages_by_system_under_hierarchy 0 >> > > +elapsed_ns_by_system_under_hierarchy 0 >> > > + >> > > 5.3 swappiness >> > > >> > > Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of >> groups >> > > only. >> > > Index: mmotm-0710/include/linux/memcontrol.h >> > > =================================================================== >> > > --- mmotm-0710.orig/include/linux/memcontrol.h >> > > +++ mmotm-0710/include/linux/memcontrol.h >> > > @@ -39,6 +39,16 @@ extern unsigned long mem_cgroup_isolate_ >> > > struct mem_cgroup *mem_cont, >> > > int active, int file); >> > > >> > > +struct memcg_scanrecord { >> > > + struct mem_cgroup *mem; /* scanend memory cgroup */ >> > > + struct mem_cgroup *root; /* scan target hierarchy root */ >> > > + int context; /* scanning context (see memcontrol.c) >> */ >> > > + unsigned long nr_scanned[2]; /* the number of scanned pages */ >> > > + unsigned long nr_rotated[2]; /* the number of rotated pages */ >> > > + unsigned long nr_freed[2]; /* the number of freed pages */ >> > > + unsigned long elapsed; /* nsec of time elapsed while scanning >> */ >> > > +}; >> > > + >> > > #ifdef CONFIG_CGROUP_MEM_RES_CTLR >> > > /* >> > > * All "charge" functions with gfp_mask should use GFP_KERNEL or >> > > @@ -117,6 +127,15 @@ mem_cgroup_get_reclaim_stat_from_page(st >> > > extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, >> > > struct task_struct *p); >> > > >> > > +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup >> *mem, >> > > + gfp_t gfp_mask, bool >> > > noswap, >> > > + struct >> memcg_scanrecord >> > > *rec); >> > > +extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup >> *mem, >> > > + gfp_t gfp_mask, bool >> > > noswap, >> > > + struct zone *zone, >> > > + struct >> memcg_scanrecord >> > > *rec, >> > > + unsigned long >> *nr_scanned); >> > > + >> > > #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP >> > > extern int do_swap_account; >> > > #endif >> > > Index: mmotm-0710/include/linux/swap.h >> > > =================================================================== >> > > --- mmotm-0710.orig/include/linux/swap.h >> > > +++ mmotm-0710/include/linux/swap.h >> > > @@ -253,12 +253,6 @@ static inline void lru_cache_add_file(st >> > > /* linux/mm/vmscan.c */ >> > > extern unsigned long try_to_free_pages(struct zonelist *zonelist, int >> > > order, >> > > gfp_t gfp_mask, nodemask_t >> *mask); >> > > -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup >> *mem, >> > > - gfp_t gfp_mask, bool >> > > noswap); >> > > -extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup >> *mem, >> > > - gfp_t gfp_mask, bool >> > > noswap, >> > > - struct zone *zone, >> > > - unsigned long >> *nr_scanned); >> > > extern int __isolate_lru_page(struct page *page, int mode, int file); >> > > extern unsigned long shrink_all_memory(unsigned long nr_pages); >> > > extern int vm_swappiness; >> > > Index: mmotm-0710/mm/memcontrol.c >> > > =================================================================== >> > > --- mmotm-0710.orig/mm/memcontrol.c >> > > +++ mmotm-0710/mm/memcontrol.c >> > > @@ -204,6 +204,50 @@ struct mem_cgroup_eventfd_list { >> > > static void mem_cgroup_threshold(struct mem_cgroup *mem); >> > > static void mem_cgroup_oom_notify(struct mem_cgroup *mem); >> > > >> > > +enum { >> > > + SCAN_BY_LIMIT, >> > > + SCAN_BY_SYSTEM, >> > > + NR_SCAN_CONTEXT, >> > > + SCAN_BY_SHRINK, /* not recorded now */ >> > > +}; >> > > + >> > > +enum { >> > > + SCAN, >> > > + SCAN_ANON, >> > > + SCAN_FILE, >> > > + ROTATE, >> > > + ROTATE_ANON, >> > > + ROTATE_FILE, >> > > + FREED, >> > > + FREED_ANON, >> > > + FREED_FILE, >> > > + ELAPSED, >> > > + NR_SCANSTATS, >> > > +}; >> > > + >> > > +struct scanstat { >> > > + spinlock_t lock; >> > > + unsigned long stats[NR_SCAN_CONTEXT][NR_SCANSTATS]; >> > > + unsigned long rootstats[NR_SCAN_CONTEXT][NR_SCANSTATS]; >> > > +}; >> > > >> > >> > I'm working on a similar effort with Ying here at Google and so far >> we've >> > been using per-cpu counters for these statistics instead of spin-lock >> > protected counters. Clearly the spin-lock protected counters have less >> > memory overhead and make reading the stat file faster, but our concern >> is >> > that this method is inconsistent with the other memory stat files such >> > /proc/vmstat and /dev/cgroup/memory/.../memory.stat. Is there any >> > particular reason you chose to use spin-lock protected counters instead >> of >> > per-cpu counters? >> > >> >> In my experience, if we do "batch" enouch, it works always better than >> percpu-counter. percpu counter is effective when batching is difficult. >> This patch's implementation does enough batching and it's much coarse >> grained than percpu counter. Then, this patch is better than percpu. >> >> >> > I've also modified your patch to use per-cpu counters instead of >> spin-lock >> > protected counters. I tested it by doing streaming I/O from a ramdisk: >> > >> > $ mke2fs /dev/ram1 >> > $ mkdir /tmp/swapram >> > $ mkdir /tmp/swapram/ram1 >> > $ mount -t ext2 /dev/ram1 /tmp/swapram/ram1 >> > $ dd if=/dev/urandom of=/tmp/swapram/ram1/file_16m bs=4096 count=4096 >> > $ mkdir /dev/cgroup/memory/1 >> > $ echo 8m > /dev/cgroup/memory/1 >> > $ ./ramdisk_load.sh 7 >> > $ echo $$ > /dev/cgroup/memory/1/tasks >> > $ time for ((z=0; z<=2000; z++)); do cat /tmp/swapram/ram1/file_16m > >> > /dev/zero; done >> > >> > Where ramdisk_load.sh is: >> > for ((i=0; i<=$1; i++)) >> > do >> > echo $$ >/dev/cgroup/memory/1/tasks >> > for ((z=0; z<=2000; z++)); do cat /tmp/swapram/ram1/file_16m > >> /dev/zero; >> > done & >> > done >> > >> > Surprisingly, the per-cpu counters perform worse than the spin-lock >> > protected counters. Over 10 runs of the test above, the per-cpu >> counters >> > were 1.60% slower in both real time and sys time. I'm wondering if you >> have >> > any insight as to why this is. I can provide my diff against your patch >> if >> > necessary. >> > >> >> The percpu counte works effectively only when we use +1/-1 at each change >> of >> counters. It uses "batch" to merge the per-cpu value to the counter. >> I think you use default "batch" value but the scan/rotate/free/elapsed >> value >> is always larger than "batch" and you just added memory overhead and "if" >> to pure spinlock counters. >> >> Determining this "batch" threshold for percpu counter is difficult. >> >> Thanks, >> -Kame >> >> >> > --001636d33caddae8e404a8217ca0 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable And this one tracks the number of pages unmapped:
--

From: Andrew Bresticker <abrestic@google.com>
Date: Fri, 15 Jul 2011 11:46:40 -= 0700
Subject: [PATCH] vmscan: Track pages unmapped during page reclaim.

Record the number of pages unmapped during page recla= im in
memory.vmscan_stat. =A0Counters are broken down by type and=
context like the other stats in memory.vmscan_stat.

Sample output:
$ mkdir /dev/cgroup/memory/1
$ ec= ho 512m > /dev/cgroup/memory/1
$ echo $$ > /dev/cgroup/memo= ry/1
$ pft -m 512m
$ cat /dev/cgroup/memory/1/memory.vmscan_stat<= /div>
...
unmapped_pages_by_limit 67
unmapped_anon_= pages_by_limit 0
unmapped_file_pages_by_limit 67
...
unmapped_pages_by_limit_under_hierarchy 67
unmapped_anon_pag= es_by_limit_under_hierarchy 0
unmapped_file_pages_by_limit_under_= hierarchy 67

Signed-off-by: Andrew Bresticker <= abrestic@google.com>
---
=A0include/linux/memcontrol.h | =A0 =A01 +
=A0= mm/memcontrol.c =A0 =A0 =A0 =A0 =A0 =A0| =A0 12 ++++++++++++
=A0m= m/vmscan.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0| =A0 =A08 ++++++--
=A0= 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux= /memcontrol.h
index 4be907e..8d65b55 100644
--- a/inclu= de/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -47,6 +47,7 @@ struct memcg_scanrecord {
=A0 unsigned long nr_rotated[2];= /* the number of rotated pages */
=A0 unsigned long nr_freed[2]; /* the num= ber of freed pages */
=A0 un= signed long nr_written[2]; /* the number of pages written back */
+ unsigned= long nr_unmapped[2]; /* the number of pages unmapped */
=A0 un= signed long elapsed; /* nsec of time elapsed while scanning */
= =A0};
=A0
diff --git a/mm/memcontrol.c b/mm/memcontrol.= c
index 5ec2aa3..6b4fbbd 100644
--- a/mm/memcontrol.c
+++= b/mm/memcontrol.c
@@ -224,6 +224,9 @@ enum {
=A0 WRITTEN,
=A0 WRITTEN= _ANON,
=A0 WRITTEN_FILE,
+ UNMAPPED,
+ UNMA= PPED_ANON,
+ UNMAPPED_FILE,
=A0 ELAPSED,
=A0 NR= _SCANSTATS,
=A0};
@@ -247,6 +250,9 @@ const char *scans= tat_string[NR_SCANSTATS] =3D {
=A0 "written_pages",
=A0 &q= uot;written_anon_pages",
=A0 "written_file_pages",
= + "unm= apped_pages",
+ &quo= t;unmapped_anon_pages",
+ "unmapped_file_pages",
= =A0 "e= lapsed_ns",
=A0};
=A0#define SCANSTAT_WORD_LIMIT =A0 =A0"_by_limit&= quot;
@@ -1692,6 +1698,10 @@ static void __mem_cgroup_record_scan= stat(unsigned long *stats,
=A0 stats[WRITTEN_ANON] +=3D rec->nr_written[0= ];
=A0 st= ats[WRITTEN_FILE] +=3D rec->nr_written[1];
=A0
+ stats[UNMAPPED= ] +=3D rec->nr_unmapped[0] + rec->nr_unmapped[1];
+ stat= s[UNMAPPED_ANON] +=3D rec->nr_unmapped[0];
+ stats[UNMAPPED_FILE] +=3D re= c->nr_unmapped[1];
+
=A0 =A0 =A0 =A0 =A0stats[ELAPSED] +=3D rec->elapsed;
=A0}
=A0
@@ -1806,6 +1816,8 @@ static int mem_c= group_hierarchical_reclaim(struct mem_cgroup *root_mem,
=A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0rec.nr_freed[1] =3D 0;
=A0 r= ec.nr_written[0] =3D 0;
=A0 rec.nr_written[1] =3D 0;
+ rec.nr_unmapped[0]= =3D 0;
+ rec= .nr_unmapped[1] =3D 0;
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0rec.ela= psed =3D 0;
=A0 /* we use swappiness of local cgroup */
=A0 i= f (check_soft) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f73b96e..2d2bc99 100644
--- a/mm/vmscan.c
+++ b/= mm/vmscan.c
@@ -728,6 +728,7 @@ static unsigned long shrink_page_list(struct list_= head *page_list,
=A0 unsigned long nr_congested =3D 0;
=A0 unsigned long nr_r= eclaimed =3D 0;
=A0 un= signed long nr_written =3D 0;
+ unsigned long nr_unmapped =3D 0;
= =A0
=A0 cond_re= sched();
=A0
@@ -819,7 +820,8 @@ static unsigned long s= hrink_page_list(struct list_head *page_list,
=A0 case SWAP_MLOCK:
=A0 goto cull_mlocked;
=A0 case SWAP_SUCCESS:
- ; /* try to free the page b= elow */
+ /= * try to free the page below */
+ nr_unmapped++;
=A0 }
=A0 }=
=A0
@@ -960,8 +962,10 @@ keep_lumpy:
=A0 free_page_list= (&free_pages);
=A0
=A0 list_splice(&ret_pages, page_list);
- if (!scanning_global_= lru(sc))
+ if (= !scanning_global_lru(sc)) {
=A0 sc->memcg_record->nr_written[file] += =3D nr_written;
+ sc-= >memcg_record->nr_unmapped[file] +=3D nr_unmapped;
+ }
=A0 count_vm_eve= nts(PGACTIVATE, pgactivate);
=A0 re= turn nr_reclaimed;
=A0}
--=A0
1.7.3.1

Thanks,
Andrew

On Fri, Jul 15, 2011 at 11:34 AM, Andrew Bresticker &= lt;abrestic@google.com> wrote:
I've extended your patch to track write= -back during page reclaim:
---

From: Andr= ew Bresticker <= abrestic@google.com>
Date: Thu, 14 Jul 2011 17:56:48 -0700
Subject: [PATCH] vmscan: Track number of pages written back during pag= e reclaim.

This tracks pages written out during pa= ge reclaim in memory.vmscan_stat
and breaks it down by file vs. a= non and context (like "scanned_pages",
"rotated_pages", etc.).

Example out= put:
$ mkdir /dev/cgroup/memory/1
$ echo 8m > /dev/cgroup/memory/1/memory.limit_in_bytes
$ echo $$ > /dev/cgroup/memory/1/tasks
$ dd if=3D/dev/urandom of=3Dfile_20g bs=3D4096 count=3D524288
$ cat /dev/cgroup/memory/1/memory.vmscan_stat
...
written_pages_by_limit 36
written_anon_pages_by_limit 0
<= div> written_file_pages_by_limit 36
...
written_pages_by_limit_under_hierarchy 28
writ= ten_anon_pages_by_limit_under_hierarchy 0
written_file_pages_by_l= imit_under_hierarchy 28

Signed-off-by: Andrew Bres= ticker <abresti= c@google.com>
---
=A0include/linux/memcontrol.h | =A0 =A01 +
=A0= mm/memcontrol.c =A0 =A0 =A0 =A0 =A0 =A0| =A0 12 ++++++++++++
=A0m= m/vmscan.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0| =A0 10 +++++++---
=A0= 3 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux= /memcontrol.h
index 4b49edf..4be907e 100644
--- a/inclu= de/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -46,6 +46,7 @@ struct memcg_scanrecord {
=A0= unsigned long nr_scanned[2]; /= * the number of scanned pages */
=A0 unsigned long nr_rotated[2]; /* the number of rotated pages= */
=A0 unsigned long nr_freed= [2]; /* the number of freed pages */
+ unsigned long nr_written[2]; /* the number of pages= written back */
=A0 unsigned long elapsed;= /* nsec of time elapsed while scanning */
=A0};
= =A0
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9bb6e93..5ec2aa3 100644
--- a/mm/memcontrol.c
+++= b/mm/memcontrol.c
@@ -221,6 +221,9 @@ enum {
=A0 FREED,
= =A0 FREED_ANON,
=A0 FREED_FILE,
+ WRITTEN,
+ WRITTEN_ANON,
+ WRITTEN_FILE,
= =A0 ELAPSED,
=A0 NR_SCANSTATS,
=A0};
@@ -241,6 +244,9 @@ const char *scanstat_string[NR_SCA= NSTATS] =3D {
=A0 "freed_pages",
=A0 "freed_anon_pages",
=A0 "freed_file_pages= ",
+ "= ;written_pages",
+ "written_anon_pages",
+ "written_file_pages= ",
=A0 "elapsed_ns",
=A0};
=A0#define SCANS= TAT_WORD_LIMIT =A0 =A0"_by_limit"
@@ -1682,6 +1688,10 @@ static void __mem_cgroup_record_scanstat(= unsigned long *stats,
=A0 =A0 =A0 =A0 =A0stats[= FREED_ANON] +=3D rec->nr_freed[0];
=A0 =A0 =A0 =A0 =A0stats[FR= EED_FILE] +=3D rec->nr_freed[1];
=A0
+ stats[WRI= TTEN] +=3D rec->nr_written[0] + rec->nr_written[1];
+ stats[WRITTEN_ANON] +=3D rec->nr_= written[0];
+ stats[WRITTEN_FILE] +=3D= rec->nr_written[1];
+
=A0 =A0 =A0 =A0 =A0stats[ELAP= SED] +=3D rec->elapsed;
=A0}
=A0
@@ -1794,= 6 +1804,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *= root_mem,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0rec.nr_rotated[1] =3D 0;
= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0rec.nr_freed[0] =3D 0;
=A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0rec.nr_freed[1] =3D 0;
+ rec.nr_written[0] =3D 0;
+ rec.nr_written[1] =3D 0= ;
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0rec.elapse= d =3D 0;
=A0 /* we u= se swappiness of local cgroup */
=A0 if (check_soft) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8fb1abd= ..f73b96e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -719,7 +719,7 @@ static noinline_for_stack void free_page_list(stru= ct list_head *free_pages)
=A0 */
=A0static unsigned lon= g shrink_page_list(struct list_head *page_list,
=A0 =A0 =A0 =A0struct zone *zone,
- =A0 =A0 =A0struct sc= an_control *sc)
+ = =A0 =A0 =A0struct scan_control *sc, int file)
=A0{
=A0 LIST_HE= AD(ret_pages);
=A0 LI= ST_HEAD(free_pages);
@@ -727,6 +727,7 @@ static unsigned long shr= ink_page_list(struct list_head *page_list,
=A0 unsigned long nr_dirty= =3D 0;
=A0 unsigned = long nr_congested =3D 0;
=A0= unsigned long nr_reclaimed =3D 0;
+ unsigned long nr_written= =3D 0;
=A0
=A0 cond_resched();
=A0
@@ -840,6 +841,7 @@ static uns= igned long shrink_page_list(struct list_head *page_list,
=A0 case PAGE_ACTIVATE:<= /div>
=A0 goto activate_= locked;
=A0 case PA= GE_SUCCESS:
+ nr_written++;
=A0 if (PageWriteback(pag= e))
=A0 goto keep= _lumpy;
=A0 if (PageDirty(page)= )
@@ -958,6 +960,8 @@ keep_lumpy:
=A0 free_page_list(&free_pages);
=A0
=A0 list_spl= ice(&ret_pages, page_list);
+ if (!scanning_global_lru(sc))
+ sc->memcg_reco= rd->nr_written[file] +=3D nr_written;
=A0 count_vm_events(PGACTIVATE, pgactivate);
=A0 return nr_reclaimed;
=A0}
@@ -1463,7 +1467,7 @@ shrink_inactive_list(unsigned= long nr_to_scan, struct zone *zone,
=A0
<= div> =A0 spin_unlock_irq(&zone-&= gt;lru_lock);
=A0
- nr_reclaimed =3D shrink_page_list(&page_list, zone, sc);=
+ nr_reclaimed =3D shrink_= page_list(&page_list, zone, sc, file);
=A0<= /div>
=A0 if (!scanning_glo= bal_lru(sc))
=A0 sc->memcg_re= cord->nr_freed[file] +=3D nr_reclaimed;
@@ -1471,7 +1475,7 @@ = shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
=A0 /* Check if we should = syncronously wait for writeback */
=A0 if (should_reclaim_stall(nr_taken, nr_reclaimed, priority= , sc)) {
=A0 set_reclaim_mode(prio= rity, sc, true);
- nr_reclaimed +=3D shrink_page_list(&page_list, zone, sc);
+ nr_reclaimed +=3D shrin= k_page_list(&page_list, zone, sc, file);
=A0 }
=A0
=A0 local_irq_disable();
--=A0
1.7.3.1

Thanks,
Andrew

On Wed, Jul 13, 2011 at 5:02 PM, KAMEZAWA Hiroyuki <= ;kameza= wa.hiroyu@jp.fujitsu.com> wrote:
On Tue, 12 Jul 2011 16:= 02:02 -0700
Andrew Bresticker <abrestic@google.com> wrote:

> On Mon, Jul 11, 2011 at 3:30 AM, KAMEZAWA Hiroyuki <
> ka= mezawa.hiroyu@jp.fujitsu.com> wrote:
>
> >
> > This patch is onto mmotm-0710... got bigger than expected ;(
> > =3D=3D
> > [PATCH] add memory.vmscan_stat
> >
> > commit log of commit 0ae5e89 " memcg: count the soft_limit r= eclaim in..."
> > says it adds scanning stats to memory.stat file. But it doesn'= ;t because
> > we considered we needed to make a concensus for such new APIs. > >
> > This patch is a trial to add memory.scan_stat. This shows
> > =A0- the number of scanned pages(total, anon, file)
> > =A0- the number of rotated pages(total, anon, file)
> > =A0- the number of freed pages(total, anon, file)
> > =A0- the number of elaplsed time (including sleep/pause time)
> >
> > =A0for both of direct/soft reclaim.
> >
> > The biggest difference with oringinal Ying's one is that this= file
> > can be reset by some write, as
> >
> > =A0# echo 0 ...../memory.scan_stat
> >
> > Example of output is here. This is a result after make -j 6 kerne= l
> > under 300M limit.
> >
> > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
> > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat > > scanned_pages_by_limit 9471864
> > scanned_anon_pages_by_limit 6640629
> > scanned_file_pages_by_limit 2831235
> > rotated_pages_by_limit 4243974
> > rotated_anon_pages_by_limit 3971968
> > rotated_file_pages_by_limit 272006
> > freed_pages_by_limit 2318492
> > freed_anon_pages_by_limit 962052
> > freed_file_pages_by_limit 1356440
> > elapsed_ns_by_limit 351386416101
> > scanned_pages_by_system 0
> > scanned_anon_pages_by_system 0
> > scanned_file_pages_by_system 0
> > rotated_pages_by_system 0
> > rotated_anon_pages_by_system 0
> > rotated_file_pages_by_system 0
> > freed_pages_by_system 0
> > freed_anon_pages_by_system 0
> > freed_file_pages_by_system 0
> > elapsed_ns_by_system 0
> > scanned_pages_by_limit_under_hierarchy 9471864
> > scanned_anon_pages_by_limit_under_hierarchy 6640629
> > scanned_file_pages_by_limit_under_hierarchy 2831235
> > rotated_pages_by_limit_under_hierarchy 4243974
> > rotated_anon_pages_by_limit_under_hierarchy 3971968
> > rotated_file_pages_by_limit_under_hierarchy 272006
> > freed_pages_by_limit_under_hierarchy 2318492
> > freed_anon_pages_by_limit_under_hierarchy 962052
> > freed_file_pages_by_limit_under_hierarchy 1356440
> > elapsed_ns_by_limit_under_hierarchy 351386416101
> > scanned_pages_by_system_under_hierarchy 0
> > scanned_anon_pages_by_system_under_hierarchy 0
> > scanned_file_pages_by_system_under_hierarchy 0
> > rotated_pages_by_system_under_hierarchy 0
> > rotated_anon_pages_by_system_under_hierarchy 0
> > rotated_file_pages_by_system_under_hierarchy 0
> > freed_pages_by_system_under_hierarchy 0
> > freed_anon_pages_by_system_under_hierarchy 0
> > freed_file_pages_by_system_under_hierarchy 0
> > elapsed_ns_by_system_under_hierarchy 0
> >
> >
> > total_xxxx is for hierarchy management.
> >
> > This will be useful for further memcg developments and need to be=
> > developped before we do some complicated rework on LRU/softlimit<= br> > > management.
> >
> > This patch adds a new struct memcg_scanrecord into scan_control s= truct.
> > sc->nr_scanned at el is not designed for exporting information= . For
> > example,
> > nr_scanned is reset frequentrly and incremented +2 at scanning ma= pped
> > pages.
> >
> > For avoiding complexity, I added a new param in scan_control whic= h is for
> > exporting scanning score.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com&= gt;
> >
> > Changelog:
> > =A0- renamed as vmscan_stat
> > =A0- handle file/anon
> > =A0- added "rotated"
> > =A0- changed names of param in vmscan_stat.
> > ---
> > =A0Documentation/cgroups/memory.txt | =A0 85 +++++++++++++++++++<= br> > > =A0include/linux/memcontrol.h =A0 =A0 =A0 | =A0 19 ++++
> > =A0include/linux/swap.h =A0 =A0 =A0 =A0 =A0 =A0 | =A0 =A06 -
> > =A0mm/memcontrol.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0| =A0172 > > +++++++++++++++++++++++++++++++++++++--
> > =A0mm/vmscan.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0| =A0 3= 9 +++++++-
> > =A05 files changed, 303 insertions(+), 18 deletions(-)
> >
> > Index: mmotm-0710/Documentation/cgroups/memory.txt
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > --- mmotm-0710.orig/Documentation/cgroups/memory.txt
> > +++ mmotm-0710/Documentation/cgroups/memory.txt
> > @@ -380,7 +380,7 @@ will be charged as a new owner of it.
> >
> > =A05.2 stat file
> >
> > -memory.stat file includes following statistics
> > +5.2.1 memory.stat file includes following statistics
> >
> > =A0# per-memory cgroup local status
> > =A0cache =A0 =A0 =A0 =A0 =A0- # of bytes of page cache memory. > > @@ -438,6 +438,89 @@ Note:
> > =A0 =A0 =A0 =A0 file_mapped is accounted only when the memory cgr= oup is owner of
> > page
> > =A0 =A0 =A0 =A0 cache.)
> >
> > +5.2.2 memory.vmscan_stat
> > +
> > +memory.vmscan_stat includes statistics information for memory sc= anning and
> > +freeing, reclaiming. The statistics shows memory scanning inform= ation
> > since
> > +memory cgroup creation and can be reset to 0 by writing 0 as
> > +
> > + #echo 0 > ../memory.vmscan_stat
> > +
> > +This file contains following statistics.
> > +
> > +[param]_[file_or_anon]_pages_by_[reason]_[under_heararchy]
> > +[param]_elapsed_ns_by_[reason]_[under_hierarchy]
> > +
> > +For example,
> > +
> > + =A0scanned_file_pages_by_limit indicates the number of scanned<= br> > > + =A0file pages at vmscan.
> > +
> > +Now, 3 parameters are supported
> > +
> > + =A0scanned - the number of pages scanned by vmscan
> > + =A0rotated - the number of pages activated at vmscan
> > + =A0freed =A0 - the number of pages freed by vmscan
> > +
> > +If "rotated" is high against scanned/freed, the memcg = seems busy.
> > +
> > +Now, 2 reason are supported
> > +
> > + =A0limit - the memory cgroup's limit
> > + =A0system - global memory pressure + softlimit
> > + =A0 =A0 =A0 =A0 =A0 (global memory pressure not under softlimit= is not handled now)
> > +
> > +When under_hierarchy is added in the tail, the number indicates = the
> > +total memcg scan of its children and itself.
> > +
> > +elapsed_ns is a elapsed time in nanosecond. This may include sle= ep time
> > +and not indicates CPU usage. So, please take this as just showin= g
> > +latency.
> > +
> > +Here is an example.
> > +
> > +# cat /cgroup/memory/A/memory.vmscan_stat
> > +scanned_pages_by_limit 9471864
> > +scanned_anon_pages_by_limit 6640629
> > +scanned_file_pages_by_limit 2831235
> > +rotated_pages_by_limit 4243974
> > +rotated_anon_pages_by_limit 3971968
> > +rotated_file_pages_by_limit 272006
> > +freed_pages_by_limit 2318492
> > +freed_anon_pages_by_limit 962052
> > +freed_file_pages_by_limit 1356440
> > +elapsed_ns_by_limit 351386416101
> > +scanned_pages_by_system 0
> > +scanned_anon_pages_by_system 0
> > +scanned_file_pages_by_system 0
> > +rotated_pages_by_system 0
> > +rotated_anon_pages_by_system 0
> > +rotated_file_pages_by_system 0
> > +freed_pages_by_system 0
> > +freed_anon_pages_by_system 0
> > +freed_file_pages_by_system 0
> > +elapsed_ns_by_system 0
> > +scanned_pages_by_limit_under_hierarchy 9471864
> > +scanned_anon_pages_by_limit_under_hierarchy 6640629
> > +scanned_file_pages_by_limit_under_hierarchy 2831235
> > +rotated_pages_by_limit_under_hierarchy 4243974
> > +rotated_anon_pages_by_limit_under_hierarchy 3971968
> > +rotated_file_pages_by_limit_under_hierarchy 272006
> > +freed_pages_by_limit_under_hierarchy 2318492
> > +freed_anon_pages_by_limit_under_hierarchy 962052
> > +freed_file_pages_by_limit_under_hierarchy 1356440
> > +elapsed_ns_by_limit_under_hierarchy 351386416101
> > +scanned_pages_by_system_under_hierarchy 0
> > +scanned_anon_pages_by_system_under_hierarchy 0
> > +scanned_file_pages_by_system_under_hierarchy 0
> > +rotated_pages_by_system_under_hierarchy 0
> > +rotated_anon_pages_by_system_under_hierarchy 0
> > +rotated_file_pages_by_system_under_hierarchy 0
> > +freed_pages_by_system_under_hierarchy 0
> > +freed_anon_pages_by_system_under_hierarchy 0
> > +freed_file_pages_by_system_under_hierarchy 0
> > +elapsed_ns_by_system_under_hierarchy 0
> > +
> > =A05.3 swappiness
> >
> > =A0Similar to /proc/sys/vm/swappiness, but affecting a hierarchy = of groups
> > only.
> > Index: mmotm-0710/include/linux/memcontrol.h
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > --- mmotm-0710.orig/include/linux/memcontrol.h
> > +++ mmotm-0710/include/linux/memcontrol.h
> > @@ -39,6 +39,16 @@ extern unsigned long mem_cgroup_isolate_
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0struct mem_cgroup *mem_cont,
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0int active, int file);
> >
> > +struct memcg_scanrecord {
> > + =A0 =A0 =A0 struct mem_cgroup *mem; /* scanend memory cgroup */=
> > + =A0 =A0 =A0 struct mem_cgroup *root; /* scan target hierarchy r= oot */
> > + =A0 =A0 =A0 int context; =A0 =A0 =A0 =A0 =A0 =A0/* scanning con= text (see memcontrol.c) */
> > + =A0 =A0 =A0 unsigned long nr_scanned[2]; /* the number of scann= ed pages */
> > + =A0 =A0 =A0 unsigned long nr_rotated[2]; /* the number of rotat= ed pages */
> > + =A0 =A0 =A0 unsigned long nr_freed[2]; /* the number of freed p= ages */
> > + =A0 =A0 =A0 unsigned long elapsed; /* nsec of time elapsed whil= e scanning */
> > +};
> > +
> > =A0#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > =A0/*
> > =A0* All "charge" functions with gfp_mask should use GF= P_KERNEL or
> > @@ -117,6 +127,15 @@ mem_cgroup_get_reclaim_stat_from_page(st
> > =A0extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg= ,
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0struct task_struct *p);
> >
> > +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgr= oup *mem,
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 gfp_t gfp_mask, bool
> > noswap,
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct memcg_scanrecord
> > *rec);
> > +extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgro= up *mem,
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 gfp_t gfp_mask, bool
> > noswap,
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct zone *zone,
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct memcg_scanrecord
> > *rec,
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 unsigned long *nr_scanned);
> > +
> > =A0#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > =A0extern int do_swap_account;
> > =A0#endif
> > Index: mmotm-0710/include/linux/swap.h
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > --- mmotm-0710.orig/include/linux/swap.h
> > +++ mmotm-0710/include/linux/swap.h
> > @@ -253,12 +253,6 @@ static inline void lru_cache_add_file(st
> > =A0/* linux/mm/vmscan.c */
> > =A0extern unsigned long try_to_free_pages(struct zonelist *zoneli= st, int
> > order,
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0gfp_t gfp_mask, nodemask_t *mask);
> > -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgr= oup *mem,
> > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 gfp_t gfp_mask, bool
> > noswap);
> > -extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgro= up *mem,
> > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 gfp_t gfp_mask, bool
> > noswap,
> > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct zone *zone,
> > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 unsigned long *nr_scanned);
> > =A0extern int __isolate_lru_page(struct page *page, int mode, int= file);
> > =A0extern unsigned long shrink_all_memory(unsigned long nr_pages)= ;
> > =A0extern int vm_swappiness;
> > Index: mmotm-0710/mm/memcontrol.c
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > --- mmotm-0710.orig/mm/memcontrol.c
> > +++ mmotm-0710/mm/memcontrol.c
> > @@ -204,6 +204,50 @@ struct mem_cgroup_eventfd_list {
> > =A0static void mem_cgroup_threshold(struct mem_cgroup *mem);
> > =A0static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
> >
> > +enum {
> > + =A0 =A0 =A0 SCAN_BY_LIMIT,
> > + =A0 =A0 =A0 SCAN_BY_SYSTEM,
> > + =A0 =A0 =A0 NR_SCAN_CONTEXT,
> > + =A0 =A0 =A0 SCAN_BY_SHRINK, /* not recorded now */
> > +};
> > +
> > +enum {
> > + =A0 =A0 =A0 SCAN,
> > + =A0 =A0 =A0 SCAN_ANON,
> > + =A0 =A0 =A0 SCAN_FILE,
> > + =A0 =A0 =A0 ROTATE,
> > + =A0 =A0 =A0 ROTATE_ANON,
> > + =A0 =A0 =A0 ROTATE_FILE,
> > + =A0 =A0 =A0 FREED,
> > + =A0 =A0 =A0 FREED_ANON,
> > + =A0 =A0 =A0 FREED_FILE,
> > + =A0 =A0 =A0 ELAPSED,
> > + =A0 =A0 =A0 NR_SCANSTATS,
> > +};
> > +
> > +struct scanstat {
> > + =A0 =A0 =A0 spinlock_t =A0 =A0 =A0lock;
> > + =A0 =A0 =A0 unsigned long =A0 stats[NR_SCAN_CONTEXT][NR_SCANSTA= TS];
> > + =A0 =A0 =A0 unsigned long =A0 rootstats[NR_SCAN_CONTEXT][NR_SCA= NSTATS];
> > +};
> >
>
> I'm working on a similar effort with Ying here at Google and so fa= r we've
> been using per-cpu counters for these statistics instead of spin-lock<= br> > protected counters. =A0Clearly the spin-lock protected counters have l= ess
> memory overhead and make reading the stat file faster, but our concern= is
> that this method is inconsistent with the other memory stat files such=
> /proc/vmstat and /dev/cgroup/memory/.../memory.stat. =A0Is there any > particular reason you chose to use spin-lock protected counters instea= d of
> per-cpu counters?
>

In my experience, if we do "batch" enouch, it works a= lways better than
percpu-counter. percpu counter is effective when batching is difficult.
This patch's implementation does enough batching and it's much coar= se
grained than percpu counter. Then, this patch is better than percpu.


> I've also modified your patch to use per-cpu counters instead of s= pin-lock
> protected counters. =A0I tested it by doing streaming I/O from a ramdi= sk:
>
> $ mke2fs /dev/ram1
> $ mkdir /tmp/swapram
> $ mkdir /tmp/swapram/ram1
> $ mount -t ext2 /dev/ram1 /tmp/swapram/ram1
> $ dd if=3D/dev/urandom of=3D/tmp/swapram/ram1/file_16m bs=3D4096 count= =3D4096
> $ mkdir /dev/cgroup/memory/1
> $ echo 8m > /dev/cgroup/memory/1
> $ ./ramdisk_load.sh 7
> $ echo $$ > /dev/cgroup/memory/1/tasks
> $ time for ((z=3D0; z<=3D2000; z++)); do cat /tmp/swapram/ram1/file= _16m >
> /dev/zero; done
>
> Where ramdisk_load.sh is:
> for ((i=3D0; i<=3D$1; i++))
> do
> =A0 echo $$ >/dev/cgroup/memory/1/tasks
> =A0 for ((z=3D0; z<=3D2000; z++)); do cat /tmp/swapram/ram1/file_16= m > /dev/zero;
> done &
> done
>
> Surprisingly, the per-cpu counters perform worse than the spin-lock > protected counters. =A0Over 10 runs of the test above, the per-cpu cou= nters
> were 1.60% slower in both real time and sys time. =A0I'm wondering= if you have
> any insight as to why this is. =A0I can provide my diff against your p= atch if
> necessary.
>

The percpu counte works effectively only when we use +1/-1 at e= ach change of
counters. It uses "batch" to merge the per-cpu value to the count= er.
I think you use default "batch" value but the scan/rotate/free/el= apsed value
is always larger than "batch" and you just added memory overhead = and "if"
to pure spinlock counters.

Determining this "batch" threshold for percpu counter is difficul= t.

Thanks,
-Kame




--001636d33caddae8e404a8217ca0-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org