* [PATCH v2] memcg: add vmscan_stat
@ 2011-07-11 10:30 KAMEZAWA Hiroyuki
2011-07-12 23:02 ` Andrew Bresticker
2011-07-18 21:00 ` Andrew Bresticker
0 siblings, 2 replies; 9+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-11 10:30 UTC (permalink / raw)
To: linux-mm; +Cc: akpm, nishimura, bsingharora, Michal Hocko, Ying Han
This patch is onto mmotm-0710... got bigger than expected ;(
==
[PATCH] add memory.vmscan_stat
commit log of commit 0ae5e89 " memcg: count the soft_limit reclaim in..."
says it adds scanning stats to memory.stat file. But it doesn't because
we considered we needed to make a concensus for such new APIs.
This patch is a trial to add memory.scan_stat. This shows
- the number of scanned pages(total, anon, file)
- the number of rotated pages(total, anon, file)
- the number of freed pages(total, anon, file)
- the number of elaplsed time (including sleep/pause time)
for both of direct/soft reclaim.
The biggest difference with oringinal Ying's one is that this file
can be reset by some write, as
# echo 0 ...../memory.scan_stat
Example of output is here. This is a result after make -j 6 kernel
under 300M limit.
[kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
[kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat
scanned_pages_by_limit 9471864
scanned_anon_pages_by_limit 6640629
scanned_file_pages_by_limit 2831235
rotated_pages_by_limit 4243974
rotated_anon_pages_by_limit 3971968
rotated_file_pages_by_limit 272006
freed_pages_by_limit 2318492
freed_anon_pages_by_limit 962052
freed_file_pages_by_limit 1356440
elapsed_ns_by_limit 351386416101
scanned_pages_by_system 0
scanned_anon_pages_by_system 0
scanned_file_pages_by_system 0
rotated_pages_by_system 0
rotated_anon_pages_by_system 0
rotated_file_pages_by_system 0
freed_pages_by_system 0
freed_anon_pages_by_system 0
freed_file_pages_by_system 0
elapsed_ns_by_system 0
scanned_pages_by_limit_under_hierarchy 9471864
scanned_anon_pages_by_limit_under_hierarchy 6640629
scanned_file_pages_by_limit_under_hierarchy 2831235
rotated_pages_by_limit_under_hierarchy 4243974
rotated_anon_pages_by_limit_under_hierarchy 3971968
rotated_file_pages_by_limit_under_hierarchy 272006
freed_pages_by_limit_under_hierarchy 2318492
freed_anon_pages_by_limit_under_hierarchy 962052
freed_file_pages_by_limit_under_hierarchy 1356440
elapsed_ns_by_limit_under_hierarchy 351386416101
scanned_pages_by_system_under_hierarchy 0
scanned_anon_pages_by_system_under_hierarchy 0
scanned_file_pages_by_system_under_hierarchy 0
rotated_pages_by_system_under_hierarchy 0
rotated_anon_pages_by_system_under_hierarchy 0
rotated_file_pages_by_system_under_hierarchy 0
freed_pages_by_system_under_hierarchy 0
freed_anon_pages_by_system_under_hierarchy 0
freed_file_pages_by_system_under_hierarchy 0
elapsed_ns_by_system_under_hierarchy 0
total_xxxx is for hierarchy management.
This will be useful for further memcg developments and need to be
developped before we do some complicated rework on LRU/softlimit
management.
This patch adds a new struct memcg_scanrecord into scan_control struct.
sc->nr_scanned at el is not designed for exporting information. For example,
nr_scanned is reset frequentrly and incremented +2 at scanning mapped pages.
For avoiding complexity, I added a new param in scan_control which is for
exporting scanning score.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Changelog:
- renamed as vmscan_stat
- handle file/anon
- added "rotated"
- changed names of param in vmscan_stat.
---
Documentation/cgroups/memory.txt | 85 +++++++++++++++++++
include/linux/memcontrol.h | 19 ++++
include/linux/swap.h | 6 -
mm/memcontrol.c | 172 +++++++++++++++++++++++++++++++++++++--
mm/vmscan.c | 39 +++++++-
5 files changed, 303 insertions(+), 18 deletions(-)
Index: mmotm-0710/Documentation/cgroups/memory.txt
===================================================================
--- mmotm-0710.orig/Documentation/cgroups/memory.txt
+++ mmotm-0710/Documentation/cgroups/memory.txt
@@ -380,7 +380,7 @@ will be charged as a new owner of it.
5.2 stat file
-memory.stat file includes following statistics
+5.2.1 memory.stat file includes following statistics
# per-memory cgroup local status
cache - # of bytes of page cache memory.
@@ -438,6 +438,89 @@ Note:
file_mapped is accounted only when the memory cgroup is owner of page
cache.)
+5.2.2 memory.vmscan_stat
+
+memory.vmscan_stat includes statistics information for memory scanning and
+freeing, reclaiming. The statistics shows memory scanning information since
+memory cgroup creation and can be reset to 0 by writing 0 as
+
+ #echo 0 > ../memory.vmscan_stat
+
+This file contains following statistics.
+
+[param]_[file_or_anon]_pages_by_[reason]_[under_heararchy]
+[param]_elapsed_ns_by_[reason]_[under_hierarchy]
+
+For example,
+
+ scanned_file_pages_by_limit indicates the number of scanned
+ file pages at vmscan.
+
+Now, 3 parameters are supported
+
+ scanned - the number of pages scanned by vmscan
+ rotated - the number of pages activated at vmscan
+ freed - the number of pages freed by vmscan
+
+If "rotated" is high against scanned/freed, the memcg seems busy.
+
+Now, 2 reason are supported
+
+ limit - the memory cgroup's limit
+ system - global memory pressure + softlimit
+ (global memory pressure not under softlimit is not handled now)
+
+When under_hierarchy is added in the tail, the number indicates the
+total memcg scan of its children and itself.
+
+elapsed_ns is a elapsed time in nanosecond. This may include sleep time
+and not indicates CPU usage. So, please take this as just showing
+latency.
+
+Here is an example.
+
+# cat /cgroup/memory/A/memory.vmscan_stat
+scanned_pages_by_limit 9471864
+scanned_anon_pages_by_limit 6640629
+scanned_file_pages_by_limit 2831235
+rotated_pages_by_limit 4243974
+rotated_anon_pages_by_limit 3971968
+rotated_file_pages_by_limit 272006
+freed_pages_by_limit 2318492
+freed_anon_pages_by_limit 962052
+freed_file_pages_by_limit 1356440
+elapsed_ns_by_limit 351386416101
+scanned_pages_by_system 0
+scanned_anon_pages_by_system 0
+scanned_file_pages_by_system 0
+rotated_pages_by_system 0
+rotated_anon_pages_by_system 0
+rotated_file_pages_by_system 0
+freed_pages_by_system 0
+freed_anon_pages_by_system 0
+freed_file_pages_by_system 0
+elapsed_ns_by_system 0
+scanned_pages_by_limit_under_hierarchy 9471864
+scanned_anon_pages_by_limit_under_hierarchy 6640629
+scanned_file_pages_by_limit_under_hierarchy 2831235
+rotated_pages_by_limit_under_hierarchy 4243974
+rotated_anon_pages_by_limit_under_hierarchy 3971968
+rotated_file_pages_by_limit_under_hierarchy 272006
+freed_pages_by_limit_under_hierarchy 2318492
+freed_anon_pages_by_limit_under_hierarchy 962052
+freed_file_pages_by_limit_under_hierarchy 1356440
+elapsed_ns_by_limit_under_hierarchy 351386416101
+scanned_pages_by_system_under_hierarchy 0
+scanned_anon_pages_by_system_under_hierarchy 0
+scanned_file_pages_by_system_under_hierarchy 0
+rotated_pages_by_system_under_hierarchy 0
+rotated_anon_pages_by_system_under_hierarchy 0
+rotated_file_pages_by_system_under_hierarchy 0
+freed_pages_by_system_under_hierarchy 0
+freed_anon_pages_by_system_under_hierarchy 0
+freed_file_pages_by_system_under_hierarchy 0
+elapsed_ns_by_system_under_hierarchy 0
+
5.3 swappiness
Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
Index: mmotm-0710/include/linux/memcontrol.h
===================================================================
--- mmotm-0710.orig/include/linux/memcontrol.h
+++ mmotm-0710/include/linux/memcontrol.h
@@ -39,6 +39,16 @@ extern unsigned long mem_cgroup_isolate_
struct mem_cgroup *mem_cont,
int active, int file);
+struct memcg_scanrecord {
+ struct mem_cgroup *mem; /* scanend memory cgroup */
+ struct mem_cgroup *root; /* scan target hierarchy root */
+ int context; /* scanning context (see memcontrol.c) */
+ unsigned long nr_scanned[2]; /* the number of scanned pages */
+ unsigned long nr_rotated[2]; /* the number of rotated pages */
+ unsigned long nr_freed[2]; /* the number of freed pages */
+ unsigned long elapsed; /* nsec of time elapsed while scanning */
+};
+
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
/*
* All "charge" functions with gfp_mask should use GFP_KERNEL or
@@ -117,6 +127,15 @@ mem_cgroup_get_reclaim_stat_from_page(st
extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);
+extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
+ gfp_t gfp_mask, bool noswap,
+ struct memcg_scanrecord *rec);
+extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
+ gfp_t gfp_mask, bool noswap,
+ struct zone *zone,
+ struct memcg_scanrecord *rec,
+ unsigned long *nr_scanned);
+
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern int do_swap_account;
#endif
Index: mmotm-0710/include/linux/swap.h
===================================================================
--- mmotm-0710.orig/include/linux/swap.h
+++ mmotm-0710/include/linux/swap.h
@@ -253,12 +253,6 @@ static inline void lru_cache_add_file(st
/* linux/mm/vmscan.c */
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
-extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
- gfp_t gfp_mask, bool noswap);
-extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
- gfp_t gfp_mask, bool noswap,
- struct zone *zone,
- unsigned long *nr_scanned);
extern int __isolate_lru_page(struct page *page, int mode, int file);
extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
Index: mmotm-0710/mm/memcontrol.c
===================================================================
--- mmotm-0710.orig/mm/memcontrol.c
+++ mmotm-0710/mm/memcontrol.c
@@ -204,6 +204,50 @@ struct mem_cgroup_eventfd_list {
static void mem_cgroup_threshold(struct mem_cgroup *mem);
static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
+enum {
+ SCAN_BY_LIMIT,
+ SCAN_BY_SYSTEM,
+ NR_SCAN_CONTEXT,
+ SCAN_BY_SHRINK, /* not recorded now */
+};
+
+enum {
+ SCAN,
+ SCAN_ANON,
+ SCAN_FILE,
+ ROTATE,
+ ROTATE_ANON,
+ ROTATE_FILE,
+ FREED,
+ FREED_ANON,
+ FREED_FILE,
+ ELAPSED,
+ NR_SCANSTATS,
+};
+
+struct scanstat {
+ spinlock_t lock;
+ unsigned long stats[NR_SCAN_CONTEXT][NR_SCANSTATS];
+ unsigned long rootstats[NR_SCAN_CONTEXT][NR_SCANSTATS];
+};
+
+const char *scanstat_string[NR_SCANSTATS] = {
+ "scanned_pages",
+ "scanned_anon_pages",
+ "scanned_file_pages",
+ "rotated_pages",
+ "rotated_anon_pages",
+ "rotated_file_pages",
+ "freed_pages",
+ "freed_anon_pages",
+ "freed_file_pages",
+ "elapsed_ns",
+};
+#define SCANSTAT_WORD_LIMIT "_by_limit"
+#define SCANSTAT_WORD_SYSTEM "_by_system"
+#define SCANSTAT_WORD_HIERARCHY "_under_hierarchy"
+
+
/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
@@ -266,7 +310,8 @@ struct mem_cgroup {
/* For oom notifier event fd */
struct list_head oom_notify;
-
+ /* For recording LRU-scan statistics */
+ struct scanstat scanstat;
/*
* Should we move charges of a task when a task is moved into this
* mem_cgroup ? And what type of charges should we move ?
@@ -1619,6 +1664,44 @@ bool mem_cgroup_reclaimable(struct mem_c
}
#endif
+static void __mem_cgroup_record_scanstat(unsigned long *stats,
+ struct memcg_scanrecord *rec)
+{
+
+ stats[SCAN] += rec->nr_scanned[0] + rec->nr_scanned[1];
+ stats[SCAN_ANON] += rec->nr_scanned[0];
+ stats[SCAN_FILE] += rec->nr_scanned[1];
+
+ stats[ROTATE] += rec->nr_rotated[0] + rec->nr_rotated[1];
+ stats[ROTATE_ANON] += rec->nr_rotated[0];
+ stats[ROTATE_FILE] += rec->nr_rotated[1];
+
+ stats[FREED] += rec->nr_freed[0] + rec->nr_freed[1];
+ stats[FREED_ANON] += rec->nr_freed[0];
+ stats[FREED_FILE] += rec->nr_freed[1];
+
+ stats[ELAPSED] += rec->elapsed;
+}
+
+static void mem_cgroup_record_scanstat(struct memcg_scanrecord *rec)
+{
+ struct mem_cgroup *mem;
+ int context = rec->context;
+
+ if (context >= NR_SCAN_CONTEXT)
+ return;
+
+ mem = rec->mem;
+ spin_lock(&mem->scanstat.lock);
+ __mem_cgroup_record_scanstat(mem->scanstat.stats[context], rec);
+ spin_unlock(&mem->scanstat.lock);
+
+ mem = rec->root;
+ spin_lock(&mem->scanstat.lock);
+ __mem_cgroup_record_scanstat(mem->scanstat.rootstats[context], rec);
+ spin_unlock(&mem->scanstat.lock);
+}
+
/*
* Scan the hierarchy if needed to reclaim memory. We remember the last child
* we reclaimed from, so that we don't end up penalizing one child extensively
@@ -1643,8 +1726,9 @@ static int mem_cgroup_hierarchical_recla
bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
+ struct memcg_scanrecord rec;
unsigned long excess;
- unsigned long nr_scanned;
+ unsigned long scanned;
excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
@@ -1652,6 +1736,15 @@ static int mem_cgroup_hierarchical_recla
if (!check_soft && root_mem->memsw_is_minimum)
noswap = true;
+ if (shrink)
+ rec.context = SCAN_BY_SHRINK;
+ else if (check_soft)
+ rec.context = SCAN_BY_SYSTEM;
+ else
+ rec.context = SCAN_BY_LIMIT;
+
+ rec.root = root_mem;
+
while (1) {
victim = mem_cgroup_select_victim(root_mem);
if (victim == root_mem) {
@@ -1692,14 +1785,23 @@ static int mem_cgroup_hierarchical_recla
css_put(&victim->css);
continue;
}
+ rec.mem = victim;
+ rec.nr_scanned[0] = 0;
+ rec.nr_scanned[1] = 0;
+ rec.nr_rotated[0] = 0;
+ rec.nr_rotated[1] = 0;
+ rec.nr_freed[0] = 0;
+ rec.nr_freed[1] = 0;
+ rec.elapsed = 0;
/* we use swappiness of local cgroup */
if (check_soft) {
ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
- noswap, zone, &nr_scanned);
- *total_scanned += nr_scanned;
+ noswap, zone, &rec, &scanned);
+ *total_scanned += scanned;
} else
ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
- noswap);
+ noswap, &rec);
+ mem_cgroup_record_scanstat(&rec);
css_put(&victim->css);
/*
* At shrinking usage, we can't check we should stop here or
@@ -3688,14 +3790,18 @@ try_to_free:
/* try to free all pages in this cgroup */
shrink = 1;
while (nr_retries && mem->res.usage > 0) {
+ struct memcg_scanrecord rec;
int progress;
if (signal_pending(current)) {
ret = -EINTR;
goto out;
}
+ rec.context = SCAN_BY_SHRINK;
+ rec.mem = mem;
+ rec.root = mem;
progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
- false);
+ false, &rec);
if (!progress) {
nr_retries--;
/* maybe some writeback is necessary */
@@ -4539,6 +4645,54 @@ static int mem_control_numa_stat_open(st
}
#endif /* CONFIG_NUMA */
+static int mem_cgroup_vmscan_stat_read(struct cgroup *cgrp,
+ struct cftype *cft,
+ struct cgroup_map_cb *cb)
+{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+ char string[64];
+ int i;
+
+ for (i = 0; i < NR_SCANSTATS; i++) {
+ strcpy(string, scanstat_string[i]);
+ strcat(string, SCANSTAT_WORD_LIMIT);
+ cb->fill(cb, string, mem->scanstat.stats[SCAN_BY_LIMIT][i]);
+ }
+
+ for (i = 0; i < NR_SCANSTATS; i++) {
+ strcpy(string, scanstat_string[i]);
+ strcat(string, SCANSTAT_WORD_SYSTEM);
+ cb->fill(cb, string, mem->scanstat.stats[SCAN_BY_SYSTEM][i]);
+ }
+
+ for (i = 0; i < NR_SCANSTATS; i++) {
+ strcpy(string, scanstat_string[i]);
+ strcat(string, SCANSTAT_WORD_LIMIT);
+ strcat(string, SCANSTAT_WORD_HIERARCHY);
+ cb->fill(cb, string, mem->scanstat.rootstats[SCAN_BY_LIMIT][i]);
+ }
+ for (i = 0; i < NR_SCANSTATS; i++) {
+ strcpy(string, scanstat_string[i]);
+ strcat(string, SCANSTAT_WORD_SYSTEM);
+ strcat(string, SCANSTAT_WORD_HIERARCHY);
+ cb->fill(cb, string, mem->scanstat.rootstats[SCAN_BY_SYSTEM][i]);
+ }
+ return 0;
+}
+
+static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
+ unsigned int event)
+{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+
+ spin_lock(&mem->scanstat.lock);
+ memset(&mem->scanstat.stats, 0, sizeof(mem->scanstat.stats));
+ memset(&mem->scanstat.rootstats, 0, sizeof(mem->scanstat.rootstats));
+ spin_unlock(&mem->scanstat.lock);
+ return 0;
+}
+
+
static struct cftype mem_cgroup_files[] = {
{
.name = "usage_in_bytes",
@@ -4609,6 +4763,11 @@ static struct cftype mem_cgroup_files[]
.mode = S_IRUGO,
},
#endif
+ {
+ .name = "vmscan_stat",
+ .read_map = mem_cgroup_vmscan_stat_read,
+ .trigger = mem_cgroup_reset_vmscan_stat,
+ },
};
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
@@ -4872,6 +5031,7 @@ mem_cgroup_create(struct cgroup_subsys *
atomic_set(&mem->refcnt, 1);
mem->move_charge_at_immigrate = 0;
mutex_init(&mem->thresholds_lock);
+ spin_lock_init(&mem->scanstat.lock);
return &mem->css;
free_out:
__mem_cgroup_free(mem);
Index: mmotm-0710/mm/vmscan.c
===================================================================
--- mmotm-0710.orig/mm/vmscan.c
+++ mmotm-0710/mm/vmscan.c
@@ -105,6 +105,7 @@ struct scan_control {
/* Which cgroup do we reclaim from */
struct mem_cgroup *mem_cgroup;
+ struct memcg_scanrecord *memcg_record;
/*
* Nodemask of nodes allowed by the caller. If NULL, all nodes
@@ -1307,6 +1308,8 @@ putback_lru_pages(struct zone *zone, str
int file = is_file_lru(lru);
int numpages = hpage_nr_pages(page);
reclaim_stat->recent_rotated[file] += numpages;
+ if (!scanning_global_lru(sc))
+ sc->memcg_record->nr_rotated[file] += numpages;
}
if (!pagevec_add(&pvec, page)) {
spin_unlock_irq(&zone->lru_lock);
@@ -1350,6 +1353,10 @@ static noinline_for_stack void update_is
reclaim_stat->recent_scanned[0] += *nr_anon;
reclaim_stat->recent_scanned[1] += *nr_file;
+ if (!scanning_global_lru(sc)) {
+ sc->memcg_record->nr_scanned[0] += *nr_anon;
+ sc->memcg_record->nr_scanned[1] += *nr_file;
+ }
}
/*
@@ -1457,6 +1464,9 @@ shrink_inactive_list(unsigned long nr_to
nr_reclaimed = shrink_page_list(&page_list, zone, sc);
+ if (!scanning_global_lru(sc))
+ sc->memcg_record->nr_freed[file] += nr_reclaimed;
+
/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
set_reclaim_mode(priority, sc, true);
@@ -1562,6 +1572,8 @@ static void shrink_active_list(unsigned
}
reclaim_stat->recent_scanned[file] += nr_taken;
+ if (!scanning_global_lru(sc))
+ sc->memcg_record->nr_scanned[file] += nr_taken;
__count_zone_vm_events(PGREFILL, zone, pgscanned);
if (file)
@@ -1613,6 +1625,8 @@ static void shrink_active_list(unsigned
* get_scan_ratio.
*/
reclaim_stat->recent_rotated[file] += nr_rotated;
+ if (!scanning_global_lru(sc))
+ sc->memcg_record->nr_rotated[file] += nr_rotated;
move_active_pages_to_lru(zone, &l_active,
LRU_ACTIVE + file * LRU_FILE);
@@ -2207,9 +2221,10 @@ unsigned long try_to_free_pages(struct z
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
- gfp_t gfp_mask, bool noswap,
- struct zone *zone,
- unsigned long *nr_scanned)
+ gfp_t gfp_mask, bool noswap,
+ struct zone *zone,
+ struct memcg_scanrecord *rec,
+ unsigned long *scanned)
{
struct scan_control sc = {
.nr_scanned = 0,
@@ -2219,7 +2234,9 @@ unsigned long mem_cgroup_shrink_node_zon
.may_swap = !noswap,
.order = 0,
.mem_cgroup = mem,
+ .memcg_record = rec,
};
+ unsigned long start, end;
sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
@@ -2228,6 +2245,7 @@ unsigned long mem_cgroup_shrink_node_zon
sc.may_writepage,
sc.gfp_mask);
+ start = sched_clock();
/*
* NOTE: Although we can get the priority field, using it
* here is not a good idea, since it limits the pages we can scan.
@@ -2236,19 +2254,25 @@ unsigned long mem_cgroup_shrink_node_zon
* the priority and make it zero.
*/
shrink_zone(0, zone, &sc);
+ end = sched_clock();
+
+ if (rec)
+ rec->elapsed += end - start;
+ *scanned = sc.nr_scanned;
trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
- *nr_scanned = sc.nr_scanned;
return sc.nr_reclaimed;
}
unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
gfp_t gfp_mask,
- bool noswap)
+ bool noswap,
+ struct memcg_scanrecord *rec)
{
struct zonelist *zonelist;
unsigned long nr_reclaimed;
+ unsigned long start, end;
int nid;
struct scan_control sc = {
.may_writepage = !laptop_mode,
@@ -2257,6 +2281,7 @@ unsigned long try_to_free_mem_cgroup_pag
.nr_to_reclaim = SWAP_CLUSTER_MAX,
.order = 0,
.mem_cgroup = mem_cont,
+ .memcg_record = rec,
.nodemask = NULL, /* we don't care the placement */
.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
@@ -2265,6 +2290,7 @@ unsigned long try_to_free_mem_cgroup_pag
.gfp_mask = sc.gfp_mask,
};
+ start = sched_clock();
/*
* Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't
* take care of from where we get pages. So the node where we start the
@@ -2279,6 +2305,9 @@ unsigned long try_to_free_mem_cgroup_pag
sc.gfp_mask);
nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
+ end = sched_clock();
+ if (rec)
+ rec->elapsed += end - start;
trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2] memcg: add vmscan_stat
2011-07-11 10:30 [PATCH v2] memcg: add vmscan_stat KAMEZAWA Hiroyuki
@ 2011-07-12 23:02 ` Andrew Bresticker
2011-07-14 0:02 ` KAMEZAWA Hiroyuki
2011-07-18 21:00 ` Andrew Bresticker
1 sibling, 1 reply; 9+ messages in thread
From: Andrew Bresticker @ 2011-07-12 23:02 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm, akpm, nishimura, bsingharora, Michal Hocko, Ying Han
[-- Attachment #1: Type: text/plain, Size: 28208 bytes --]
On Mon, Jul 11, 2011 at 3:30 AM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> This patch is onto mmotm-0710... got bigger than expected ;(
> ==
> [PATCH] add memory.vmscan_stat
>
> commit log of commit 0ae5e89 " memcg: count the soft_limit reclaim in..."
> says it adds scanning stats to memory.stat file. But it doesn't because
> we considered we needed to make a concensus for such new APIs.
>
> This patch is a trial to add memory.scan_stat. This shows
> - the number of scanned pages(total, anon, file)
> - the number of rotated pages(total, anon, file)
> - the number of freed pages(total, anon, file)
> - the number of elaplsed time (including sleep/pause time)
>
> for both of direct/soft reclaim.
>
> The biggest difference with oringinal Ying's one is that this file
> can be reset by some write, as
>
> # echo 0 ...../memory.scan_stat
>
> Example of output is here. This is a result after make -j 6 kernel
> under 300M limit.
>
> [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
> [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat
> scanned_pages_by_limit 9471864
> scanned_anon_pages_by_limit 6640629
> scanned_file_pages_by_limit 2831235
> rotated_pages_by_limit 4243974
> rotated_anon_pages_by_limit 3971968
> rotated_file_pages_by_limit 272006
> freed_pages_by_limit 2318492
> freed_anon_pages_by_limit 962052
> freed_file_pages_by_limit 1356440
> elapsed_ns_by_limit 351386416101
> scanned_pages_by_system 0
> scanned_anon_pages_by_system 0
> scanned_file_pages_by_system 0
> rotated_pages_by_system 0
> rotated_anon_pages_by_system 0
> rotated_file_pages_by_system 0
> freed_pages_by_system 0
> freed_anon_pages_by_system 0
> freed_file_pages_by_system 0
> elapsed_ns_by_system 0
> scanned_pages_by_limit_under_hierarchy 9471864
> scanned_anon_pages_by_limit_under_hierarchy 6640629
> scanned_file_pages_by_limit_under_hierarchy 2831235
> rotated_pages_by_limit_under_hierarchy 4243974
> rotated_anon_pages_by_limit_under_hierarchy 3971968
> rotated_file_pages_by_limit_under_hierarchy 272006
> freed_pages_by_limit_under_hierarchy 2318492
> freed_anon_pages_by_limit_under_hierarchy 962052
> freed_file_pages_by_limit_under_hierarchy 1356440
> elapsed_ns_by_limit_under_hierarchy 351386416101
> scanned_pages_by_system_under_hierarchy 0
> scanned_anon_pages_by_system_under_hierarchy 0
> scanned_file_pages_by_system_under_hierarchy 0
> rotated_pages_by_system_under_hierarchy 0
> rotated_anon_pages_by_system_under_hierarchy 0
> rotated_file_pages_by_system_under_hierarchy 0
> freed_pages_by_system_under_hierarchy 0
> freed_anon_pages_by_system_under_hierarchy 0
> freed_file_pages_by_system_under_hierarchy 0
> elapsed_ns_by_system_under_hierarchy 0
>
>
> total_xxxx is for hierarchy management.
>
> This will be useful for further memcg developments and need to be
> developped before we do some complicated rework on LRU/softlimit
> management.
>
> This patch adds a new struct memcg_scanrecord into scan_control struct.
> sc->nr_scanned at el is not designed for exporting information. For
> example,
> nr_scanned is reset frequentrly and incremented +2 at scanning mapped
> pages.
>
> For avoiding complexity, I added a new param in scan_control which is for
> exporting scanning score.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> Changelog:
> - renamed as vmscan_stat
> - handle file/anon
> - added "rotated"
> - changed names of param in vmscan_stat.
> ---
> Documentation/cgroups/memory.txt | 85 +++++++++++++++++++
> include/linux/memcontrol.h | 19 ++++
> include/linux/swap.h | 6 -
> mm/memcontrol.c | 172
> +++++++++++++++++++++++++++++++++++++--
> mm/vmscan.c | 39 +++++++-
> 5 files changed, 303 insertions(+), 18 deletions(-)
>
> Index: mmotm-0710/Documentation/cgroups/memory.txt
> ===================================================================
> --- mmotm-0710.orig/Documentation/cgroups/memory.txt
> +++ mmotm-0710/Documentation/cgroups/memory.txt
> @@ -380,7 +380,7 @@ will be charged as a new owner of it.
>
> 5.2 stat file
>
> -memory.stat file includes following statistics
> +5.2.1 memory.stat file includes following statistics
>
> # per-memory cgroup local status
> cache - # of bytes of page cache memory.
> @@ -438,6 +438,89 @@ Note:
> file_mapped is accounted only when the memory cgroup is owner of
> page
> cache.)
>
> +5.2.2 memory.vmscan_stat
> +
> +memory.vmscan_stat includes statistics information for memory scanning and
> +freeing, reclaiming. The statistics shows memory scanning information
> since
> +memory cgroup creation and can be reset to 0 by writing 0 as
> +
> + #echo 0 > ../memory.vmscan_stat
> +
> +This file contains following statistics.
> +
> +[param]_[file_or_anon]_pages_by_[reason]_[under_heararchy]
> +[param]_elapsed_ns_by_[reason]_[under_hierarchy]
> +
> +For example,
> +
> + scanned_file_pages_by_limit indicates the number of scanned
> + file pages at vmscan.
> +
> +Now, 3 parameters are supported
> +
> + scanned - the number of pages scanned by vmscan
> + rotated - the number of pages activated at vmscan
> + freed - the number of pages freed by vmscan
> +
> +If "rotated" is high against scanned/freed, the memcg seems busy.
> +
> +Now, 2 reason are supported
> +
> + limit - the memory cgroup's limit
> + system - global memory pressure + softlimit
> + (global memory pressure not under softlimit is not handled now)
> +
> +When under_hierarchy is added in the tail, the number indicates the
> +total memcg scan of its children and itself.
> +
> +elapsed_ns is a elapsed time in nanosecond. This may include sleep time
> +and not indicates CPU usage. So, please take this as just showing
> +latency.
> +
> +Here is an example.
> +
> +# cat /cgroup/memory/A/memory.vmscan_stat
> +scanned_pages_by_limit 9471864
> +scanned_anon_pages_by_limit 6640629
> +scanned_file_pages_by_limit 2831235
> +rotated_pages_by_limit 4243974
> +rotated_anon_pages_by_limit 3971968
> +rotated_file_pages_by_limit 272006
> +freed_pages_by_limit 2318492
> +freed_anon_pages_by_limit 962052
> +freed_file_pages_by_limit 1356440
> +elapsed_ns_by_limit 351386416101
> +scanned_pages_by_system 0
> +scanned_anon_pages_by_system 0
> +scanned_file_pages_by_system 0
> +rotated_pages_by_system 0
> +rotated_anon_pages_by_system 0
> +rotated_file_pages_by_system 0
> +freed_pages_by_system 0
> +freed_anon_pages_by_system 0
> +freed_file_pages_by_system 0
> +elapsed_ns_by_system 0
> +scanned_pages_by_limit_under_hierarchy 9471864
> +scanned_anon_pages_by_limit_under_hierarchy 6640629
> +scanned_file_pages_by_limit_under_hierarchy 2831235
> +rotated_pages_by_limit_under_hierarchy 4243974
> +rotated_anon_pages_by_limit_under_hierarchy 3971968
> +rotated_file_pages_by_limit_under_hierarchy 272006
> +freed_pages_by_limit_under_hierarchy 2318492
> +freed_anon_pages_by_limit_under_hierarchy 962052
> +freed_file_pages_by_limit_under_hierarchy 1356440
> +elapsed_ns_by_limit_under_hierarchy 351386416101
> +scanned_pages_by_system_under_hierarchy 0
> +scanned_anon_pages_by_system_under_hierarchy 0
> +scanned_file_pages_by_system_under_hierarchy 0
> +rotated_pages_by_system_under_hierarchy 0
> +rotated_anon_pages_by_system_under_hierarchy 0
> +rotated_file_pages_by_system_under_hierarchy 0
> +freed_pages_by_system_under_hierarchy 0
> +freed_anon_pages_by_system_under_hierarchy 0
> +freed_file_pages_by_system_under_hierarchy 0
> +elapsed_ns_by_system_under_hierarchy 0
> +
> 5.3 swappiness
>
> Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups
> only.
> Index: mmotm-0710/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-0710.orig/include/linux/memcontrol.h
> +++ mmotm-0710/include/linux/memcontrol.h
> @@ -39,6 +39,16 @@ extern unsigned long mem_cgroup_isolate_
> struct mem_cgroup *mem_cont,
> int active, int file);
>
> +struct memcg_scanrecord {
> + struct mem_cgroup *mem; /* scanend memory cgroup */
> + struct mem_cgroup *root; /* scan target hierarchy root */
> + int context; /* scanning context (see memcontrol.c) */
> + unsigned long nr_scanned[2]; /* the number of scanned pages */
> + unsigned long nr_rotated[2]; /* the number of rotated pages */
> + unsigned long nr_freed[2]; /* the number of freed pages */
> + unsigned long elapsed; /* nsec of time elapsed while scanning */
> +};
> +
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> /*
> * All "charge" functions with gfp_mask should use GFP_KERNEL or
> @@ -117,6 +127,15 @@ mem_cgroup_get_reclaim_stat_from_page(st
> extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> struct task_struct *p);
>
> +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
> + gfp_t gfp_mask, bool
> noswap,
> + struct memcg_scanrecord
> *rec);
> +extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> + gfp_t gfp_mask, bool
> noswap,
> + struct zone *zone,
> + struct memcg_scanrecord
> *rec,
> + unsigned long *nr_scanned);
> +
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> extern int do_swap_account;
> #endif
> Index: mmotm-0710/include/linux/swap.h
> ===================================================================
> --- mmotm-0710.orig/include/linux/swap.h
> +++ mmotm-0710/include/linux/swap.h
> @@ -253,12 +253,6 @@ static inline void lru_cache_add_file(st
> /* linux/mm/vmscan.c */
> extern unsigned long try_to_free_pages(struct zonelist *zonelist, int
> order,
> gfp_t gfp_mask, nodemask_t *mask);
> -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
> - gfp_t gfp_mask, bool
> noswap);
> -extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> - gfp_t gfp_mask, bool
> noswap,
> - struct zone *zone,
> - unsigned long *nr_scanned);
> extern int __isolate_lru_page(struct page *page, int mode, int file);
> extern unsigned long shrink_all_memory(unsigned long nr_pages);
> extern int vm_swappiness;
> Index: mmotm-0710/mm/memcontrol.c
> ===================================================================
> --- mmotm-0710.orig/mm/memcontrol.c
> +++ mmotm-0710/mm/memcontrol.c
> @@ -204,6 +204,50 @@ struct mem_cgroup_eventfd_list {
> static void mem_cgroup_threshold(struct mem_cgroup *mem);
> static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
>
> +enum {
> + SCAN_BY_LIMIT,
> + SCAN_BY_SYSTEM,
> + NR_SCAN_CONTEXT,
> + SCAN_BY_SHRINK, /* not recorded now */
> +};
> +
> +enum {
> + SCAN,
> + SCAN_ANON,
> + SCAN_FILE,
> + ROTATE,
> + ROTATE_ANON,
> + ROTATE_FILE,
> + FREED,
> + FREED_ANON,
> + FREED_FILE,
> + ELAPSED,
> + NR_SCANSTATS,
> +};
> +
> +struct scanstat {
> + spinlock_t lock;
> + unsigned long stats[NR_SCAN_CONTEXT][NR_SCANSTATS];
> + unsigned long rootstats[NR_SCAN_CONTEXT][NR_SCANSTATS];
> +};
>
I'm working on a similar effort with Ying here at Google and so far we've
been using per-cpu counters for these statistics instead of spin-lock
protected counters. Clearly the spin-lock protected counters have less
memory overhead and make reading the stat file faster, but our concern is
that this method is inconsistent with the other memory stat files such
/proc/vmstat and /dev/cgroup/memory/.../memory.stat. Is there any
particular reason you chose to use spin-lock protected counters instead of
per-cpu counters?
I've also modified your patch to use per-cpu counters instead of spin-lock
protected counters. I tested it by doing streaming I/O from a ramdisk:
$ mke2fs /dev/ram1
$ mkdir /tmp/swapram
$ mkdir /tmp/swapram/ram1
$ mount -t ext2 /dev/ram1 /tmp/swapram/ram1
$ dd if=/dev/urandom of=/tmp/swapram/ram1/file_16m bs=4096 count=4096
$ mkdir /dev/cgroup/memory/1
$ echo 8m > /dev/cgroup/memory/1
$ ./ramdisk_load.sh 7
$ echo $$ > /dev/cgroup/memory/1/tasks
$ time for ((z=0; z<=2000; z++)); do cat /tmp/swapram/ram1/file_16m >
/dev/zero; done
Where ramdisk_load.sh is:
for ((i=0; i<=$1; i++))
do
echo $$ >/dev/cgroup/memory/1/tasks
for ((z=0; z<=2000; z++)); do cat /tmp/swapram/ram1/file_16m > /dev/zero;
done &
done
Surprisingly, the per-cpu counters perform worse than the spin-lock
protected counters. Over 10 runs of the test above, the per-cpu counters
were 1.60% slower in both real time and sys time. I'm wondering if you have
any insight as to why this is. I can provide my diff against your patch if
necessary.
Thanks,
Andrew
> +
> +const char *scanstat_string[NR_SCANSTATS] = {
> + "scanned_pages",
> + "scanned_anon_pages",
> + "scanned_file_pages",
> + "rotated_pages",
> + "rotated_anon_pages",
> + "rotated_file_pages",
> + "freed_pages",
> + "freed_anon_pages",
> + "freed_file_pages",
> + "elapsed_ns",
> +};
> +#define SCANSTAT_WORD_LIMIT "_by_limit"
> +#define SCANSTAT_WORD_SYSTEM "_by_system"
> +#define SCANSTAT_WORD_HIERARCHY "_under_hierarchy"
> +
> +
> /*
> * The memory controller data structure. The memory controller controls
> both
> * page cache and RSS per cgroup. We would eventually like to provide
> @@ -266,7 +310,8 @@ struct mem_cgroup {
>
> /* For oom notifier event fd */
> struct list_head oom_notify;
> -
> + /* For recording LRU-scan statistics */
> + struct scanstat scanstat;
> /*
> * Should we move charges of a task when a task is moved into this
> * mem_cgroup ? And what type of charges should we move ?
> @@ -1619,6 +1664,44 @@ bool mem_cgroup_reclaimable(struct mem_c
> }
> #endif
>
> +static void __mem_cgroup_record_scanstat(unsigned long *stats,
> + struct memcg_scanrecord *rec)
> +{
> +
> + stats[SCAN] += rec->nr_scanned[0] + rec->nr_scanned[1];
> + stats[SCAN_ANON] += rec->nr_scanned[0];
> + stats[SCAN_FILE] += rec->nr_scanned[1];
> +
> + stats[ROTATE] += rec->nr_rotated[0] + rec->nr_rotated[1];
> + stats[ROTATE_ANON] += rec->nr_rotated[0];
> + stats[ROTATE_FILE] += rec->nr_rotated[1];
> +
> + stats[FREED] += rec->nr_freed[0] + rec->nr_freed[1];
> + stats[FREED_ANON] += rec->nr_freed[0];
> + stats[FREED_FILE] += rec->nr_freed[1];
> +
> + stats[ELAPSED] += rec->elapsed;
> +}
> +
> +static void mem_cgroup_record_scanstat(struct memcg_scanrecord *rec)
> +{
> + struct mem_cgroup *mem;
> + int context = rec->context;
> +
> + if (context >= NR_SCAN_CONTEXT)
> + return;
> +
> + mem = rec->mem;
> + spin_lock(&mem->scanstat.lock);
> + __mem_cgroup_record_scanstat(mem->scanstat.stats[context], rec);
> + spin_unlock(&mem->scanstat.lock);
> +
> + mem = rec->root;
> + spin_lock(&mem->scanstat.lock);
> + __mem_cgroup_record_scanstat(mem->scanstat.rootstats[context],
> rec);
> + spin_unlock(&mem->scanstat.lock);
> +}
> +
> /*
> * Scan the hierarchy if needed to reclaim memory. We remember the last
> child
> * we reclaimed from, so that we don't end up penalizing one child
> extensively
> @@ -1643,8 +1726,9 @@ static int mem_cgroup_hierarchical_recla
> bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
> bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
> bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
> + struct memcg_scanrecord rec;
> unsigned long excess;
> - unsigned long nr_scanned;
> + unsigned long scanned;
>
> excess = res_counter_soft_limit_excess(&root_mem->res) >>
> PAGE_SHIFT;
>
> @@ -1652,6 +1736,15 @@ static int mem_cgroup_hierarchical_recla
> if (!check_soft && root_mem->memsw_is_minimum)
> noswap = true;
>
> + if (shrink)
> + rec.context = SCAN_BY_SHRINK;
> + else if (check_soft)
> + rec.context = SCAN_BY_SYSTEM;
> + else
> + rec.context = SCAN_BY_LIMIT;
> +
> + rec.root = root_mem;
> +
> while (1) {
> victim = mem_cgroup_select_victim(root_mem);
> if (victim == root_mem) {
> @@ -1692,14 +1785,23 @@ static int mem_cgroup_hierarchical_recla
> css_put(&victim->css);
> continue;
> }
> + rec.mem = victim;
> + rec.nr_scanned[0] = 0;
> + rec.nr_scanned[1] = 0;
> + rec.nr_rotated[0] = 0;
> + rec.nr_rotated[1] = 0;
> + rec.nr_freed[0] = 0;
> + rec.nr_freed[1] = 0;
> + rec.elapsed = 0;
> /* we use swappiness of local cgroup */
> if (check_soft) {
> ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
> - noswap, zone, &nr_scanned);
> - *total_scanned += nr_scanned;
> + noswap, zone, &rec, &scanned);
> + *total_scanned += scanned;
> } else
> ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
> - noswap);
> + noswap, &rec);
> + mem_cgroup_record_scanstat(&rec);
> css_put(&victim->css);
> /*
> * At shrinking usage, we can't check we should stop here or
> @@ -3688,14 +3790,18 @@ try_to_free:
> /* try to free all pages in this cgroup */
> shrink = 1;
> while (nr_retries && mem->res.usage > 0) {
> + struct memcg_scanrecord rec;
> int progress;
>
> if (signal_pending(current)) {
> ret = -EINTR;
> goto out;
> }
> + rec.context = SCAN_BY_SHRINK;
> + rec.mem = mem;
> + rec.root = mem;
> progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
> - false);
> + false, &rec);
> if (!progress) {
> nr_retries--;
> /* maybe some writeback is necessary */
> @@ -4539,6 +4645,54 @@ static int mem_control_numa_stat_open(st
> }
> #endif /* CONFIG_NUMA */
>
> +static int mem_cgroup_vmscan_stat_read(struct cgroup *cgrp,
> + struct cftype *cft,
> + struct cgroup_map_cb *cb)
> +{
> + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> + char string[64];
> + int i;
> +
> + for (i = 0; i < NR_SCANSTATS; i++) {
> + strcpy(string, scanstat_string[i]);
> + strcat(string, SCANSTAT_WORD_LIMIT);
> + cb->fill(cb, string,
> mem->scanstat.stats[SCAN_BY_LIMIT][i]);
> + }
> +
> + for (i = 0; i < NR_SCANSTATS; i++) {
> + strcpy(string, scanstat_string[i]);
> + strcat(string, SCANSTAT_WORD_SYSTEM);
> + cb->fill(cb, string,
> mem->scanstat.stats[SCAN_BY_SYSTEM][i]);
> + }
> +
> + for (i = 0; i < NR_SCANSTATS; i++) {
> + strcpy(string, scanstat_string[i]);
> + strcat(string, SCANSTAT_WORD_LIMIT);
> + strcat(string, SCANSTAT_WORD_HIERARCHY);
> + cb->fill(cb, string,
> mem->scanstat.rootstats[SCAN_BY_LIMIT][i]);
> + }
> + for (i = 0; i < NR_SCANSTATS; i++) {
> + strcpy(string, scanstat_string[i]);
> + strcat(string, SCANSTAT_WORD_SYSTEM);
> + strcat(string, SCANSTAT_WORD_HIERARCHY);
> + cb->fill(cb, string,
> mem->scanstat.rootstats[SCAN_BY_SYSTEM][i]);
> + }
> + return 0;
> +}
> +
> +static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
> + unsigned int event)
> +{
> + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> +
> + spin_lock(&mem->scanstat.lock);
> + memset(&mem->scanstat.stats, 0, sizeof(mem->scanstat.stats));
> + memset(&mem->scanstat.rootstats, 0,
> sizeof(mem->scanstat.rootstats));
> + spin_unlock(&mem->scanstat.lock);
> + return 0;
> +}
> +
> +
> static struct cftype mem_cgroup_files[] = {
> {
> .name = "usage_in_bytes",
> @@ -4609,6 +4763,11 @@ static struct cftype mem_cgroup_files[]
> .mode = S_IRUGO,
> },
> #endif
> + {
> + .name = "vmscan_stat",
> + .read_map = mem_cgroup_vmscan_stat_read,
> + .trigger = mem_cgroup_reset_vmscan_stat,
> + },
> };
>
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> @@ -4872,6 +5031,7 @@ mem_cgroup_create(struct cgroup_subsys *
> atomic_set(&mem->refcnt, 1);
> mem->move_charge_at_immigrate = 0;
> mutex_init(&mem->thresholds_lock);
> + spin_lock_init(&mem->scanstat.lock);
> return &mem->css;
> free_out:
> __mem_cgroup_free(mem);
> Index: mmotm-0710/mm/vmscan.c
> ===================================================================
> --- mmotm-0710.orig/mm/vmscan.c
> +++ mmotm-0710/mm/vmscan.c
> @@ -105,6 +105,7 @@ struct scan_control {
>
> /* Which cgroup do we reclaim from */
> struct mem_cgroup *mem_cgroup;
> + struct memcg_scanrecord *memcg_record;
>
> /*
> * Nodemask of nodes allowed by the caller. If NULL, all nodes
> @@ -1307,6 +1308,8 @@ putback_lru_pages(struct zone *zone, str
> int file = is_file_lru(lru);
> int numpages = hpage_nr_pages(page);
> reclaim_stat->recent_rotated[file] += numpages;
> + if (!scanning_global_lru(sc))
> + sc->memcg_record->nr_rotated[file] +=
> numpages;
> }
> if (!pagevec_add(&pvec, page)) {
> spin_unlock_irq(&zone->lru_lock);
> @@ -1350,6 +1353,10 @@ static noinline_for_stack void update_is
>
> reclaim_stat->recent_scanned[0] += *nr_anon;
> reclaim_stat->recent_scanned[1] += *nr_file;
> + if (!scanning_global_lru(sc)) {
> + sc->memcg_record->nr_scanned[0] += *nr_anon;
> + sc->memcg_record->nr_scanned[1] += *nr_file;
> + }
> }
>
> /*
> @@ -1457,6 +1464,9 @@ shrink_inactive_list(unsigned long nr_to
>
> nr_reclaimed = shrink_page_list(&page_list, zone, sc);
>
> + if (!scanning_global_lru(sc))
> + sc->memcg_record->nr_freed[file] += nr_reclaimed;
> +
> /* Check if we should syncronously wait for writeback */
> if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
> set_reclaim_mode(priority, sc, true);
> @@ -1562,6 +1572,8 @@ static void shrink_active_list(unsigned
> }
>
> reclaim_stat->recent_scanned[file] += nr_taken;
> + if (!scanning_global_lru(sc))
> + sc->memcg_record->nr_scanned[file] += nr_taken;
>
> __count_zone_vm_events(PGREFILL, zone, pgscanned);
> if (file)
> @@ -1613,6 +1625,8 @@ static void shrink_active_list(unsigned
> * get_scan_ratio.
> */
> reclaim_stat->recent_rotated[file] += nr_rotated;
> + if (!scanning_global_lru(sc))
> + sc->memcg_record->nr_rotated[file] += nr_rotated;
>
> move_active_pages_to_lru(zone, &l_active,
> LRU_ACTIVE + file *
> LRU_FILE);
> @@ -2207,9 +2221,10 @@ unsigned long try_to_free_pages(struct z
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>
> unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> - gfp_t gfp_mask, bool
> noswap,
> - struct zone *zone,
> - unsigned long *nr_scanned)
> + gfp_t gfp_mask, bool noswap,
> + struct zone *zone,
> + struct memcg_scanrecord *rec,
> + unsigned long *scanned)
> {
> struct scan_control sc = {
> .nr_scanned = 0,
> @@ -2219,7 +2234,9 @@ unsigned long mem_cgroup_shrink_node_zon
> .may_swap = !noswap,
> .order = 0,
> .mem_cgroup = mem,
> + .memcg_record = rec,
> };
> + unsigned long start, end;
>
> sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
> @@ -2228,6 +2245,7 @@ unsigned long mem_cgroup_shrink_node_zon
> sc.may_writepage,
> sc.gfp_mask);
>
> + start = sched_clock();
> /*
> * NOTE: Although we can get the priority field, using it
> * here is not a good idea, since it limits the pages we can scan.
> @@ -2236,19 +2254,25 @@ unsigned long mem_cgroup_shrink_node_zon
> * the priority and make it zero.
> */
> shrink_zone(0, zone, &sc);
> + end = sched_clock();
> +
> + if (rec)
> + rec->elapsed += end - start;
> + *scanned = sc.nr_scanned;
>
> trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
>
> - *nr_scanned = sc.nr_scanned;
> return sc.nr_reclaimed;
> }
>
> unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> gfp_t gfp_mask,
> - bool noswap)
> + bool noswap,
> + struct memcg_scanrecord *rec)
> {
> struct zonelist *zonelist;
> unsigned long nr_reclaimed;
> + unsigned long start, end;
> int nid;
> struct scan_control sc = {
> .may_writepage = !laptop_mode,
> @@ -2257,6 +2281,7 @@ unsigned long try_to_free_mem_cgroup_pag
> .nr_to_reclaim = SWAP_CLUSTER_MAX,
> .order = 0,
> .mem_cgroup = mem_cont,
> + .memcg_record = rec,
> .nodemask = NULL, /* we don't care the placement */
> .gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
> @@ -2265,6 +2290,7 @@ unsigned long try_to_free_mem_cgroup_pag
> .gfp_mask = sc.gfp_mask,
> };
>
> + start = sched_clock();
> /*
> * Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't
> * take care of from where we get pages. So the node where we start
> the
> @@ -2279,6 +2305,9 @@ unsigned long try_to_free_mem_cgroup_pag
> sc.gfp_mask);
>
> nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
> + end = sched_clock();
> + if (rec)
> + rec->elapsed += end - start;
>
> trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign
> http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
[-- Attachment #2: Type: text/html, Size: 32009 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2] memcg: add vmscan_stat
2011-07-12 23:02 ` Andrew Bresticker
@ 2011-07-14 0:02 ` KAMEZAWA Hiroyuki
2011-07-15 18:34 ` Andrew Bresticker
0 siblings, 1 reply; 9+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14 0:02 UTC (permalink / raw)
To: Andrew Bresticker
Cc: linux-mm, akpm, nishimura, bsingharora, Michal Hocko, Ying Han
On Tue, 12 Jul 2011 16:02:02 -0700
Andrew Bresticker <abrestic@google.com> wrote:
> On Mon, Jul 11, 2011 at 3:30 AM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> >
> > This patch is onto mmotm-0710... got bigger than expected ;(
> > ==
> > [PATCH] add memory.vmscan_stat
> >
> > commit log of commit 0ae5e89 " memcg: count the soft_limit reclaim in..."
> > says it adds scanning stats to memory.stat file. But it doesn't because
> > we considered we needed to make a concensus for such new APIs.
> >
> > This patch is a trial to add memory.scan_stat. This shows
> > - the number of scanned pages(total, anon, file)
> > - the number of rotated pages(total, anon, file)
> > - the number of freed pages(total, anon, file)
> > - the number of elaplsed time (including sleep/pause time)
> >
> > for both of direct/soft reclaim.
> >
> > The biggest difference with oringinal Ying's one is that this file
> > can be reset by some write, as
> >
> > # echo 0 ...../memory.scan_stat
> >
> > Example of output is here. This is a result after make -j 6 kernel
> > under 300M limit.
> >
> > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
> > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat
> > scanned_pages_by_limit 9471864
> > scanned_anon_pages_by_limit 6640629
> > scanned_file_pages_by_limit 2831235
> > rotated_pages_by_limit 4243974
> > rotated_anon_pages_by_limit 3971968
> > rotated_file_pages_by_limit 272006
> > freed_pages_by_limit 2318492
> > freed_anon_pages_by_limit 962052
> > freed_file_pages_by_limit 1356440
> > elapsed_ns_by_limit 351386416101
> > scanned_pages_by_system 0
> > scanned_anon_pages_by_system 0
> > scanned_file_pages_by_system 0
> > rotated_pages_by_system 0
> > rotated_anon_pages_by_system 0
> > rotated_file_pages_by_system 0
> > freed_pages_by_system 0
> > freed_anon_pages_by_system 0
> > freed_file_pages_by_system 0
> > elapsed_ns_by_system 0
> > scanned_pages_by_limit_under_hierarchy 9471864
> > scanned_anon_pages_by_limit_under_hierarchy 6640629
> > scanned_file_pages_by_limit_under_hierarchy 2831235
> > rotated_pages_by_limit_under_hierarchy 4243974
> > rotated_anon_pages_by_limit_under_hierarchy 3971968
> > rotated_file_pages_by_limit_under_hierarchy 272006
> > freed_pages_by_limit_under_hierarchy 2318492
> > freed_anon_pages_by_limit_under_hierarchy 962052
> > freed_file_pages_by_limit_under_hierarchy 1356440
> > elapsed_ns_by_limit_under_hierarchy 351386416101
> > scanned_pages_by_system_under_hierarchy 0
> > scanned_anon_pages_by_system_under_hierarchy 0
> > scanned_file_pages_by_system_under_hierarchy 0
> > rotated_pages_by_system_under_hierarchy 0
> > rotated_anon_pages_by_system_under_hierarchy 0
> > rotated_file_pages_by_system_under_hierarchy 0
> > freed_pages_by_system_under_hierarchy 0
> > freed_anon_pages_by_system_under_hierarchy 0
> > freed_file_pages_by_system_under_hierarchy 0
> > elapsed_ns_by_system_under_hierarchy 0
> >
> >
> > total_xxxx is for hierarchy management.
> >
> > This will be useful for further memcg developments and need to be
> > developped before we do some complicated rework on LRU/softlimit
> > management.
> >
> > This patch adds a new struct memcg_scanrecord into scan_control struct.
> > sc->nr_scanned at el is not designed for exporting information. For
> > example,
> > nr_scanned is reset frequentrly and incremented +2 at scanning mapped
> > pages.
> >
> > For avoiding complexity, I added a new param in scan_control which is for
> > exporting scanning score.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> > Changelog:
> > - renamed as vmscan_stat
> > - handle file/anon
> > - added "rotated"
> > - changed names of param in vmscan_stat.
> > ---
> > Documentation/cgroups/memory.txt | 85 +++++++++++++++++++
> > include/linux/memcontrol.h | 19 ++++
> > include/linux/swap.h | 6 -
> > mm/memcontrol.c | 172
> > +++++++++++++++++++++++++++++++++++++--
> > mm/vmscan.c | 39 +++++++-
> > 5 files changed, 303 insertions(+), 18 deletions(-)
> >
> > Index: mmotm-0710/Documentation/cgroups/memory.txt
> > ===================================================================
> > --- mmotm-0710.orig/Documentation/cgroups/memory.txt
> > +++ mmotm-0710/Documentation/cgroups/memory.txt
> > @@ -380,7 +380,7 @@ will be charged as a new owner of it.
> >
> > 5.2 stat file
> >
> > -memory.stat file includes following statistics
> > +5.2.1 memory.stat file includes following statistics
> >
> > # per-memory cgroup local status
> > cache - # of bytes of page cache memory.
> > @@ -438,6 +438,89 @@ Note:
> > file_mapped is accounted only when the memory cgroup is owner of
> > page
> > cache.)
> >
> > +5.2.2 memory.vmscan_stat
> > +
> > +memory.vmscan_stat includes statistics information for memory scanning and
> > +freeing, reclaiming. The statistics shows memory scanning information
> > since
> > +memory cgroup creation and can be reset to 0 by writing 0 as
> > +
> > + #echo 0 > ../memory.vmscan_stat
> > +
> > +This file contains following statistics.
> > +
> > +[param]_[file_or_anon]_pages_by_[reason]_[under_heararchy]
> > +[param]_elapsed_ns_by_[reason]_[under_hierarchy]
> > +
> > +For example,
> > +
> > + scanned_file_pages_by_limit indicates the number of scanned
> > + file pages at vmscan.
> > +
> > +Now, 3 parameters are supported
> > +
> > + scanned - the number of pages scanned by vmscan
> > + rotated - the number of pages activated at vmscan
> > + freed - the number of pages freed by vmscan
> > +
> > +If "rotated" is high against scanned/freed, the memcg seems busy.
> > +
> > +Now, 2 reason are supported
> > +
> > + limit - the memory cgroup's limit
> > + system - global memory pressure + softlimit
> > + (global memory pressure not under softlimit is not handled now)
> > +
> > +When under_hierarchy is added in the tail, the number indicates the
> > +total memcg scan of its children and itself.
> > +
> > +elapsed_ns is a elapsed time in nanosecond. This may include sleep time
> > +and not indicates CPU usage. So, please take this as just showing
> > +latency.
> > +
> > +Here is an example.
> > +
> > +# cat /cgroup/memory/A/memory.vmscan_stat
> > +scanned_pages_by_limit 9471864
> > +scanned_anon_pages_by_limit 6640629
> > +scanned_file_pages_by_limit 2831235
> > +rotated_pages_by_limit 4243974
> > +rotated_anon_pages_by_limit 3971968
> > +rotated_file_pages_by_limit 272006
> > +freed_pages_by_limit 2318492
> > +freed_anon_pages_by_limit 962052
> > +freed_file_pages_by_limit 1356440
> > +elapsed_ns_by_limit 351386416101
> > +scanned_pages_by_system 0
> > +scanned_anon_pages_by_system 0
> > +scanned_file_pages_by_system 0
> > +rotated_pages_by_system 0
> > +rotated_anon_pages_by_system 0
> > +rotated_file_pages_by_system 0
> > +freed_pages_by_system 0
> > +freed_anon_pages_by_system 0
> > +freed_file_pages_by_system 0
> > +elapsed_ns_by_system 0
> > +scanned_pages_by_limit_under_hierarchy 9471864
> > +scanned_anon_pages_by_limit_under_hierarchy 6640629
> > +scanned_file_pages_by_limit_under_hierarchy 2831235
> > +rotated_pages_by_limit_under_hierarchy 4243974
> > +rotated_anon_pages_by_limit_under_hierarchy 3971968
> > +rotated_file_pages_by_limit_under_hierarchy 272006
> > +freed_pages_by_limit_under_hierarchy 2318492
> > +freed_anon_pages_by_limit_under_hierarchy 962052
> > +freed_file_pages_by_limit_under_hierarchy 1356440
> > +elapsed_ns_by_limit_under_hierarchy 351386416101
> > +scanned_pages_by_system_under_hierarchy 0
> > +scanned_anon_pages_by_system_under_hierarchy 0
> > +scanned_file_pages_by_system_under_hierarchy 0
> > +rotated_pages_by_system_under_hierarchy 0
> > +rotated_anon_pages_by_system_under_hierarchy 0
> > +rotated_file_pages_by_system_under_hierarchy 0
> > +freed_pages_by_system_under_hierarchy 0
> > +freed_anon_pages_by_system_under_hierarchy 0
> > +freed_file_pages_by_system_under_hierarchy 0
> > +elapsed_ns_by_system_under_hierarchy 0
> > +
> > 5.3 swappiness
> >
> > Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups
> > only.
> > Index: mmotm-0710/include/linux/memcontrol.h
> > ===================================================================
> > --- mmotm-0710.orig/include/linux/memcontrol.h
> > +++ mmotm-0710/include/linux/memcontrol.h
> > @@ -39,6 +39,16 @@ extern unsigned long mem_cgroup_isolate_
> > struct mem_cgroup *mem_cont,
> > int active, int file);
> >
> > +struct memcg_scanrecord {
> > + struct mem_cgroup *mem; /* scanend memory cgroup */
> > + struct mem_cgroup *root; /* scan target hierarchy root */
> > + int context; /* scanning context (see memcontrol.c) */
> > + unsigned long nr_scanned[2]; /* the number of scanned pages */
> > + unsigned long nr_rotated[2]; /* the number of rotated pages */
> > + unsigned long nr_freed[2]; /* the number of freed pages */
> > + unsigned long elapsed; /* nsec of time elapsed while scanning */
> > +};
> > +
> > #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > /*
> > * All "charge" functions with gfp_mask should use GFP_KERNEL or
> > @@ -117,6 +127,15 @@ mem_cgroup_get_reclaim_stat_from_page(st
> > extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> > struct task_struct *p);
> >
> > +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
> > + gfp_t gfp_mask, bool
> > noswap,
> > + struct memcg_scanrecord
> > *rec);
> > +extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> > + gfp_t gfp_mask, bool
> > noswap,
> > + struct zone *zone,
> > + struct memcg_scanrecord
> > *rec,
> > + unsigned long *nr_scanned);
> > +
> > #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > extern int do_swap_account;
> > #endif
> > Index: mmotm-0710/include/linux/swap.h
> > ===================================================================
> > --- mmotm-0710.orig/include/linux/swap.h
> > +++ mmotm-0710/include/linux/swap.h
> > @@ -253,12 +253,6 @@ static inline void lru_cache_add_file(st
> > /* linux/mm/vmscan.c */
> > extern unsigned long try_to_free_pages(struct zonelist *zonelist, int
> > order,
> > gfp_t gfp_mask, nodemask_t *mask);
> > -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
> > - gfp_t gfp_mask, bool
> > noswap);
> > -extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> > - gfp_t gfp_mask, bool
> > noswap,
> > - struct zone *zone,
> > - unsigned long *nr_scanned);
> > extern int __isolate_lru_page(struct page *page, int mode, int file);
> > extern unsigned long shrink_all_memory(unsigned long nr_pages);
> > extern int vm_swappiness;
> > Index: mmotm-0710/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-0710.orig/mm/memcontrol.c
> > +++ mmotm-0710/mm/memcontrol.c
> > @@ -204,6 +204,50 @@ struct mem_cgroup_eventfd_list {
> > static void mem_cgroup_threshold(struct mem_cgroup *mem);
> > static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
> >
> > +enum {
> > + SCAN_BY_LIMIT,
> > + SCAN_BY_SYSTEM,
> > + NR_SCAN_CONTEXT,
> > + SCAN_BY_SHRINK, /* not recorded now */
> > +};
> > +
> > +enum {
> > + SCAN,
> > + SCAN_ANON,
> > + SCAN_FILE,
> > + ROTATE,
> > + ROTATE_ANON,
> > + ROTATE_FILE,
> > + FREED,
> > + FREED_ANON,
> > + FREED_FILE,
> > + ELAPSED,
> > + NR_SCANSTATS,
> > +};
> > +
> > +struct scanstat {
> > + spinlock_t lock;
> > + unsigned long stats[NR_SCAN_CONTEXT][NR_SCANSTATS];
> > + unsigned long rootstats[NR_SCAN_CONTEXT][NR_SCANSTATS];
> > +};
> >
>
> I'm working on a similar effort with Ying here at Google and so far we've
> been using per-cpu counters for these statistics instead of spin-lock
> protected counters. Clearly the spin-lock protected counters have less
> memory overhead and make reading the stat file faster, but our concern is
> that this method is inconsistent with the other memory stat files such
> /proc/vmstat and /dev/cgroup/memory/.../memory.stat. Is there any
> particular reason you chose to use spin-lock protected counters instead of
> per-cpu counters?
>
In my experience, if we do "batch" enouch, it works always better than
percpu-counter. percpu counter is effective when batching is difficult.
This patch's implementation does enough batching and it's much coarse
grained than percpu counter. Then, this patch is better than percpu.
> I've also modified your patch to use per-cpu counters instead of spin-lock
> protected counters. I tested it by doing streaming I/O from a ramdisk:
>
> $ mke2fs /dev/ram1
> $ mkdir /tmp/swapram
> $ mkdir /tmp/swapram/ram1
> $ mount -t ext2 /dev/ram1 /tmp/swapram/ram1
> $ dd if=/dev/urandom of=/tmp/swapram/ram1/file_16m bs=4096 count=4096
> $ mkdir /dev/cgroup/memory/1
> $ echo 8m > /dev/cgroup/memory/1
> $ ./ramdisk_load.sh 7
> $ echo $$ > /dev/cgroup/memory/1/tasks
> $ time for ((z=0; z<=2000; z++)); do cat /tmp/swapram/ram1/file_16m >
> /dev/zero; done
>
> Where ramdisk_load.sh is:
> for ((i=0; i<=$1; i++))
> do
> echo $$ >/dev/cgroup/memory/1/tasks
> for ((z=0; z<=2000; z++)); do cat /tmp/swapram/ram1/file_16m > /dev/zero;
> done &
> done
>
> Surprisingly, the per-cpu counters perform worse than the spin-lock
> protected counters. Over 10 runs of the test above, the per-cpu counters
> were 1.60% slower in both real time and sys time. I'm wondering if you have
> any insight as to why this is. I can provide my diff against your patch if
> necessary.
>
The percpu counte works effectively only when we use +1/-1 at each change of
counters. It uses "batch" to merge the per-cpu value to the counter.
I think you use default "batch" value but the scan/rotate/free/elapsed value
is always larger than "batch" and you just added memory overhead and "if"
to pure spinlock counters.
Determining this "batch" threshold for percpu counter is difficult.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2] memcg: add vmscan_stat
2011-07-14 0:02 ` KAMEZAWA Hiroyuki
@ 2011-07-15 18:34 ` Andrew Bresticker
2011-07-15 20:28 ` Andrew Bresticker
2011-07-20 5:58 ` KAMEZAWA Hiroyuki
0 siblings, 2 replies; 9+ messages in thread
From: Andrew Bresticker @ 2011-07-15 18:34 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm, akpm, nishimura, bsingharora, Michal Hocko, Ying Han
[-- Attachment #1: Type: text/plain, Size: 20347 bytes --]
I've extended your patch to track write-back during page reclaim:
---
From: Andrew Bresticker <abrestic@google.com>
Date: Thu, 14 Jul 2011 17:56:48 -0700
Subject: [PATCH] vmscan: Track number of pages written back during page
reclaim.
This tracks pages written out during page reclaim in memory.vmscan_stat
and breaks it down by file vs. anon and context (like "scanned_pages",
"rotated_pages", etc.).
Example output:
$ mkdir /dev/cgroup/memory/1
$ echo 8m > /dev/cgroup/memory/1/memory.limit_in_bytes
$ echo $$ > /dev/cgroup/memory/1/tasks
$ dd if=/dev/urandom of=file_20g bs=4096 count=524288
$ cat /dev/cgroup/memory/1/memory.vmscan_stat
...
written_pages_by_limit 36
written_anon_pages_by_limit 0
written_file_pages_by_limit 36
...
written_pages_by_limit_under_hierarchy 28
written_anon_pages_by_limit_under_hierarchy 0
written_file_pages_by_limit_under_hierarchy 28
Signed-off-by: Andrew Bresticker <abrestic@google.com>
---
include/linux/memcontrol.h | 1 +
mm/memcontrol.c | 12 ++++++++++++
mm/vmscan.c | 10 +++++++---
3 files changed, 20 insertions(+), 3 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4b49edf..4be907e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -46,6 +46,7 @@ struct memcg_scanrecord {
unsigned long nr_scanned[2]; /* the number of scanned pages */
unsigned long nr_rotated[2]; /* the number of rotated pages */
unsigned long nr_freed[2]; /* the number of freed pages */
+ unsigned long nr_written[2]; /* the number of pages written back */
unsigned long elapsed; /* nsec of time elapsed while scanning */
};
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9bb6e93..5ec2aa3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -221,6 +221,9 @@ enum {
FREED,
FREED_ANON,
FREED_FILE,
+ WRITTEN,
+ WRITTEN_ANON,
+ WRITTEN_FILE,
ELAPSED,
NR_SCANSTATS,
};
@@ -241,6 +244,9 @@ const char *scanstat_string[NR_SCANSTATS] = {
"freed_pages",
"freed_anon_pages",
"freed_file_pages",
+ "written_pages",
+ "written_anon_pages",
+ "written_file_pages",
"elapsed_ns",
};
#define SCANSTAT_WORD_LIMIT "_by_limit"
@@ -1682,6 +1688,10 @@ static void __mem_cgroup_record_scanstat(unsigned
long *stats,
stats[FREED_ANON] += rec->nr_freed[0];
stats[FREED_FILE] += rec->nr_freed[1];
+ stats[WRITTEN] += rec->nr_written[0] + rec->nr_written[1];
+ stats[WRITTEN_ANON] += rec->nr_written[0];
+ stats[WRITTEN_FILE] += rec->nr_written[1];
+
stats[ELAPSED] += rec->elapsed;
}
@@ -1794,6 +1804,8 @@ static int mem_cgroup_hierarchical_reclaim(struct
mem_cgroup *root_mem,
rec.nr_rotated[1] = 0;
rec.nr_freed[0] = 0;
rec.nr_freed[1] = 0;
+ rec.nr_written[0] = 0;
+ rec.nr_written[1] = 0;
rec.elapsed = 0;
/* we use swappiness of local cgroup */
if (check_soft) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8fb1abd..f73b96e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -719,7 +719,7 @@ static noinline_for_stack void free_page_list(struct
list_head *free_pages)
*/
static unsigned long shrink_page_list(struct list_head *page_list,
struct zone *zone,
- struct scan_control *sc)
+ struct scan_control *sc, int file)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
@@ -727,6 +727,7 @@ static unsigned long shrink_page_list(struct list_head
*page_list,
unsigned long nr_dirty = 0;
unsigned long nr_congested = 0;
unsigned long nr_reclaimed = 0;
+ unsigned long nr_written = 0;
cond_resched();
@@ -840,6 +841,7 @@ static unsigned long shrink_page_list(struct list_head
*page_list,
case PAGE_ACTIVATE:
goto activate_locked;
case PAGE_SUCCESS:
+ nr_written++;
if (PageWriteback(page))
goto keep_lumpy;
if (PageDirty(page))
@@ -958,6 +960,8 @@ keep_lumpy:
free_page_list(&free_pages);
list_splice(&ret_pages, page_list);
+ if (!scanning_global_lru(sc))
+ sc->memcg_record->nr_written[file] += nr_written;
count_vm_events(PGACTIVATE, pgactivate);
return nr_reclaimed;
}
@@ -1463,7 +1467,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct
zone *zone,
spin_unlock_irq(&zone->lru_lock);
- nr_reclaimed = shrink_page_list(&page_list, zone, sc);
+ nr_reclaimed = shrink_page_list(&page_list, zone, sc, file);
if (!scanning_global_lru(sc))
sc->memcg_record->nr_freed[file] += nr_reclaimed;
@@ -1471,7 +1475,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct
zone *zone,
/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
set_reclaim_mode(priority, sc, true);
- nr_reclaimed += shrink_page_list(&page_list, zone, sc);
+ nr_reclaimed += shrink_page_list(&page_list, zone, sc, file);
}
local_irq_disable();
--
1.7.3.1
Thanks,
Andrew
On Wed, Jul 13, 2011 at 5:02 PM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 12 Jul 2011 16:02:02 -0700
> Andrew Bresticker <abrestic@google.com> wrote:
>
> > On Mon, Jul 11, 2011 at 3:30 AM, KAMEZAWA Hiroyuki <
> > kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > >
> > > This patch is onto mmotm-0710... got bigger than expected ;(
> > > ==
> > > [PATCH] add memory.vmscan_stat
> > >
> > > commit log of commit 0ae5e89 " memcg: count the soft_limit reclaim
> in..."
> > > says it adds scanning stats to memory.stat file. But it doesn't because
> > > we considered we needed to make a concensus for such new APIs.
> > >
> > > This patch is a trial to add memory.scan_stat. This shows
> > > - the number of scanned pages(total, anon, file)
> > > - the number of rotated pages(total, anon, file)
> > > - the number of freed pages(total, anon, file)
> > > - the number of elaplsed time (including sleep/pause time)
> > >
> > > for both of direct/soft reclaim.
> > >
> > > The biggest difference with oringinal Ying's one is that this file
> > > can be reset by some write, as
> > >
> > > # echo 0 ...../memory.scan_stat
> > >
> > > Example of output is here. This is a result after make -j 6 kernel
> > > under 300M limit.
> > >
> > > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
> > > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat
> > > scanned_pages_by_limit 9471864
> > > scanned_anon_pages_by_limit 6640629
> > > scanned_file_pages_by_limit 2831235
> > > rotated_pages_by_limit 4243974
> > > rotated_anon_pages_by_limit 3971968
> > > rotated_file_pages_by_limit 272006
> > > freed_pages_by_limit 2318492
> > > freed_anon_pages_by_limit 962052
> > > freed_file_pages_by_limit 1356440
> > > elapsed_ns_by_limit 351386416101
> > > scanned_pages_by_system 0
> > > scanned_anon_pages_by_system 0
> > > scanned_file_pages_by_system 0
> > > rotated_pages_by_system 0
> > > rotated_anon_pages_by_system 0
> > > rotated_file_pages_by_system 0
> > > freed_pages_by_system 0
> > > freed_anon_pages_by_system 0
> > > freed_file_pages_by_system 0
> > > elapsed_ns_by_system 0
> > > scanned_pages_by_limit_under_hierarchy 9471864
> > > scanned_anon_pages_by_limit_under_hierarchy 6640629
> > > scanned_file_pages_by_limit_under_hierarchy 2831235
> > > rotated_pages_by_limit_under_hierarchy 4243974
> > > rotated_anon_pages_by_limit_under_hierarchy 3971968
> > > rotated_file_pages_by_limit_under_hierarchy 272006
> > > freed_pages_by_limit_under_hierarchy 2318492
> > > freed_anon_pages_by_limit_under_hierarchy 962052
> > > freed_file_pages_by_limit_under_hierarchy 1356440
> > > elapsed_ns_by_limit_under_hierarchy 351386416101
> > > scanned_pages_by_system_under_hierarchy 0
> > > scanned_anon_pages_by_system_under_hierarchy 0
> > > scanned_file_pages_by_system_under_hierarchy 0
> > > rotated_pages_by_system_under_hierarchy 0
> > > rotated_anon_pages_by_system_under_hierarchy 0
> > > rotated_file_pages_by_system_under_hierarchy 0
> > > freed_pages_by_system_under_hierarchy 0
> > > freed_anon_pages_by_system_under_hierarchy 0
> > > freed_file_pages_by_system_under_hierarchy 0
> > > elapsed_ns_by_system_under_hierarchy 0
> > >
> > >
> > > total_xxxx is for hierarchy management.
> > >
> > > This will be useful for further memcg developments and need to be
> > > developped before we do some complicated rework on LRU/softlimit
> > > management.
> > >
> > > This patch adds a new struct memcg_scanrecord into scan_control struct.
> > > sc->nr_scanned at el is not designed for exporting information. For
> > > example,
> > > nr_scanned is reset frequentrly and incremented +2 at scanning mapped
> > > pages.
> > >
> > > For avoiding complexity, I added a new param in scan_control which is
> for
> > > exporting scanning score.
> > >
> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > >
> > > Changelog:
> > > - renamed as vmscan_stat
> > > - handle file/anon
> > > - added "rotated"
> > > - changed names of param in vmscan_stat.
> > > ---
> > > Documentation/cgroups/memory.txt | 85 +++++++++++++++++++
> > > include/linux/memcontrol.h | 19 ++++
> > > include/linux/swap.h | 6 -
> > > mm/memcontrol.c | 172
> > > +++++++++++++++++++++++++++++++++++++--
> > > mm/vmscan.c | 39 +++++++-
> > > 5 files changed, 303 insertions(+), 18 deletions(-)
> > >
> > > Index: mmotm-0710/Documentation/cgroups/memory.txt
> > > ===================================================================
> > > --- mmotm-0710.orig/Documentation/cgroups/memory.txt
> > > +++ mmotm-0710/Documentation/cgroups/memory.txt
> > > @@ -380,7 +380,7 @@ will be charged as a new owner of it.
> > >
> > > 5.2 stat file
> > >
> > > -memory.stat file includes following statistics
> > > +5.2.1 memory.stat file includes following statistics
> > >
> > > # per-memory cgroup local status
> > > cache - # of bytes of page cache memory.
> > > @@ -438,6 +438,89 @@ Note:
> > > file_mapped is accounted only when the memory cgroup is owner
> of
> > > page
> > > cache.)
> > >
> > > +5.2.2 memory.vmscan_stat
> > > +
> > > +memory.vmscan_stat includes statistics information for memory scanning
> and
> > > +freeing, reclaiming. The statistics shows memory scanning information
> > > since
> > > +memory cgroup creation and can be reset to 0 by writing 0 as
> > > +
> > > + #echo 0 > ../memory.vmscan_stat
> > > +
> > > +This file contains following statistics.
> > > +
> > > +[param]_[file_or_anon]_pages_by_[reason]_[under_heararchy]
> > > +[param]_elapsed_ns_by_[reason]_[under_hierarchy]
> > > +
> > > +For example,
> > > +
> > > + scanned_file_pages_by_limit indicates the number of scanned
> > > + file pages at vmscan.
> > > +
> > > +Now, 3 parameters are supported
> > > +
> > > + scanned - the number of pages scanned by vmscan
> > > + rotated - the number of pages activated at vmscan
> > > + freed - the number of pages freed by vmscan
> > > +
> > > +If "rotated" is high against scanned/freed, the memcg seems busy.
> > > +
> > > +Now, 2 reason are supported
> > > +
> > > + limit - the memory cgroup's limit
> > > + system - global memory pressure + softlimit
> > > + (global memory pressure not under softlimit is not handled
> now)
> > > +
> > > +When under_hierarchy is added in the tail, the number indicates the
> > > +total memcg scan of its children and itself.
> > > +
> > > +elapsed_ns is a elapsed time in nanosecond. This may include sleep
> time
> > > +and not indicates CPU usage. So, please take this as just showing
> > > +latency.
> > > +
> > > +Here is an example.
> > > +
> > > +# cat /cgroup/memory/A/memory.vmscan_stat
> > > +scanned_pages_by_limit 9471864
> > > +scanned_anon_pages_by_limit 6640629
> > > +scanned_file_pages_by_limit 2831235
> > > +rotated_pages_by_limit 4243974
> > > +rotated_anon_pages_by_limit 3971968
> > > +rotated_file_pages_by_limit 272006
> > > +freed_pages_by_limit 2318492
> > > +freed_anon_pages_by_limit 962052
> > > +freed_file_pages_by_limit 1356440
> > > +elapsed_ns_by_limit 351386416101
> > > +scanned_pages_by_system 0
> > > +scanned_anon_pages_by_system 0
> > > +scanned_file_pages_by_system 0
> > > +rotated_pages_by_system 0
> > > +rotated_anon_pages_by_system 0
> > > +rotated_file_pages_by_system 0
> > > +freed_pages_by_system 0
> > > +freed_anon_pages_by_system 0
> > > +freed_file_pages_by_system 0
> > > +elapsed_ns_by_system 0
> > > +scanned_pages_by_limit_under_hierarchy 9471864
> > > +scanned_anon_pages_by_limit_under_hierarchy 6640629
> > > +scanned_file_pages_by_limit_under_hierarchy 2831235
> > > +rotated_pages_by_limit_under_hierarchy 4243974
> > > +rotated_anon_pages_by_limit_under_hierarchy 3971968
> > > +rotated_file_pages_by_limit_under_hierarchy 272006
> > > +freed_pages_by_limit_under_hierarchy 2318492
> > > +freed_anon_pages_by_limit_under_hierarchy 962052
> > > +freed_file_pages_by_limit_under_hierarchy 1356440
> > > +elapsed_ns_by_limit_under_hierarchy 351386416101
> > > +scanned_pages_by_system_under_hierarchy 0
> > > +scanned_anon_pages_by_system_under_hierarchy 0
> > > +scanned_file_pages_by_system_under_hierarchy 0
> > > +rotated_pages_by_system_under_hierarchy 0
> > > +rotated_anon_pages_by_system_under_hierarchy 0
> > > +rotated_file_pages_by_system_under_hierarchy 0
> > > +freed_pages_by_system_under_hierarchy 0
> > > +freed_anon_pages_by_system_under_hierarchy 0
> > > +freed_file_pages_by_system_under_hierarchy 0
> > > +elapsed_ns_by_system_under_hierarchy 0
> > > +
> > > 5.3 swappiness
> > >
> > > Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of
> groups
> > > only.
> > > Index: mmotm-0710/include/linux/memcontrol.h
> > > ===================================================================
> > > --- mmotm-0710.orig/include/linux/memcontrol.h
> > > +++ mmotm-0710/include/linux/memcontrol.h
> > > @@ -39,6 +39,16 @@ extern unsigned long mem_cgroup_isolate_
> > > struct mem_cgroup *mem_cont,
> > > int active, int file);
> > >
> > > +struct memcg_scanrecord {
> > > + struct mem_cgroup *mem; /* scanend memory cgroup */
> > > + struct mem_cgroup *root; /* scan target hierarchy root */
> > > + int context; /* scanning context (see memcontrol.c)
> */
> > > + unsigned long nr_scanned[2]; /* the number of scanned pages */
> > > + unsigned long nr_rotated[2]; /* the number of rotated pages */
> > > + unsigned long nr_freed[2]; /* the number of freed pages */
> > > + unsigned long elapsed; /* nsec of time elapsed while scanning
> */
> > > +};
> > > +
> > > #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > > /*
> > > * All "charge" functions with gfp_mask should use GFP_KERNEL or
> > > @@ -117,6 +127,15 @@ mem_cgroup_get_reclaim_stat_from_page(st
> > > extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> > > struct task_struct *p);
> > >
> > > +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup
> *mem,
> > > + gfp_t gfp_mask, bool
> > > noswap,
> > > + struct
> memcg_scanrecord
> > > *rec);
> > > +extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup
> *mem,
> > > + gfp_t gfp_mask, bool
> > > noswap,
> > > + struct zone *zone,
> > > + struct memcg_scanrecord
> > > *rec,
> > > + unsigned long
> *nr_scanned);
> > > +
> > > #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > > extern int do_swap_account;
> > > #endif
> > > Index: mmotm-0710/include/linux/swap.h
> > > ===================================================================
> > > --- mmotm-0710.orig/include/linux/swap.h
> > > +++ mmotm-0710/include/linux/swap.h
> > > @@ -253,12 +253,6 @@ static inline void lru_cache_add_file(st
> > > /* linux/mm/vmscan.c */
> > > extern unsigned long try_to_free_pages(struct zonelist *zonelist, int
> > > order,
> > > gfp_t gfp_mask, nodemask_t
> *mask);
> > > -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup
> *mem,
> > > - gfp_t gfp_mask, bool
> > > noswap);
> > > -extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup
> *mem,
> > > - gfp_t gfp_mask, bool
> > > noswap,
> > > - struct zone *zone,
> > > - unsigned long
> *nr_scanned);
> > > extern int __isolate_lru_page(struct page *page, int mode, int file);
> > > extern unsigned long shrink_all_memory(unsigned long nr_pages);
> > > extern int vm_swappiness;
> > > Index: mmotm-0710/mm/memcontrol.c
> > > ===================================================================
> > > --- mmotm-0710.orig/mm/memcontrol.c
> > > +++ mmotm-0710/mm/memcontrol.c
> > > @@ -204,6 +204,50 @@ struct mem_cgroup_eventfd_list {
> > > static void mem_cgroup_threshold(struct mem_cgroup *mem);
> > > static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
> > >
> > > +enum {
> > > + SCAN_BY_LIMIT,
> > > + SCAN_BY_SYSTEM,
> > > + NR_SCAN_CONTEXT,
> > > + SCAN_BY_SHRINK, /* not recorded now */
> > > +};
> > > +
> > > +enum {
> > > + SCAN,
> > > + SCAN_ANON,
> > > + SCAN_FILE,
> > > + ROTATE,
> > > + ROTATE_ANON,
> > > + ROTATE_FILE,
> > > + FREED,
> > > + FREED_ANON,
> > > + FREED_FILE,
> > > + ELAPSED,
> > > + NR_SCANSTATS,
> > > +};
> > > +
> > > +struct scanstat {
> > > + spinlock_t lock;
> > > + unsigned long stats[NR_SCAN_CONTEXT][NR_SCANSTATS];
> > > + unsigned long rootstats[NR_SCAN_CONTEXT][NR_SCANSTATS];
> > > +};
> > >
> >
> > I'm working on a similar effort with Ying here at Google and so far we've
> > been using per-cpu counters for these statistics instead of spin-lock
> > protected counters. Clearly the spin-lock protected counters have less
> > memory overhead and make reading the stat file faster, but our concern is
> > that this method is inconsistent with the other memory stat files such
> > /proc/vmstat and /dev/cgroup/memory/.../memory.stat. Is there any
> > particular reason you chose to use spin-lock protected counters instead
> of
> > per-cpu counters?
> >
>
> In my experience, if we do "batch" enouch, it works always better than
> percpu-counter. percpu counter is effective when batching is difficult.
> This patch's implementation does enough batching and it's much coarse
> grained than percpu counter. Then, this patch is better than percpu.
>
>
> > I've also modified your patch to use per-cpu counters instead of
> spin-lock
> > protected counters. I tested it by doing streaming I/O from a ramdisk:
> >
> > $ mke2fs /dev/ram1
> > $ mkdir /tmp/swapram
> > $ mkdir /tmp/swapram/ram1
> > $ mount -t ext2 /dev/ram1 /tmp/swapram/ram1
> > $ dd if=/dev/urandom of=/tmp/swapram/ram1/file_16m bs=4096 count=4096
> > $ mkdir /dev/cgroup/memory/1
> > $ echo 8m > /dev/cgroup/memory/1
> > $ ./ramdisk_load.sh 7
> > $ echo $$ > /dev/cgroup/memory/1/tasks
> > $ time for ((z=0; z<=2000; z++)); do cat /tmp/swapram/ram1/file_16m >
> > /dev/zero; done
> >
> > Where ramdisk_load.sh is:
> > for ((i=0; i<=$1; i++))
> > do
> > echo $$ >/dev/cgroup/memory/1/tasks
> > for ((z=0; z<=2000; z++)); do cat /tmp/swapram/ram1/file_16m >
> /dev/zero;
> > done &
> > done
> >
> > Surprisingly, the per-cpu counters perform worse than the spin-lock
> > protected counters. Over 10 runs of the test above, the per-cpu counters
> > were 1.60% slower in both real time and sys time. I'm wondering if you
> have
> > any insight as to why this is. I can provide my diff against your patch
> if
> > necessary.
> >
>
> The percpu counte works effectively only when we use +1/-1 at each change
> of
> counters. It uses "batch" to merge the per-cpu value to the counter.
> I think you use default "batch" value but the scan/rotate/free/elapsed
> value
> is always larger than "batch" and you just added memory overhead and "if"
> to pure spinlock counters.
>
> Determining this "batch" threshold for percpu counter is difficult.
>
> Thanks,
> -Kame
>
>
>
[-- Attachment #2: Type: text/html, Size: 29647 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2] memcg: add vmscan_stat
2011-07-15 18:34 ` Andrew Bresticker
@ 2011-07-15 20:28 ` Andrew Bresticker
2011-07-20 6:00 ` KAMEZAWA Hiroyuki
2011-07-20 5:58 ` KAMEZAWA Hiroyuki
1 sibling, 1 reply; 9+ messages in thread
From: Andrew Bresticker @ 2011-07-15 20:28 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm, akpm, nishimura, bsingharora, Michal Hocko, Ying Han
[-- Attachment #1: Type: text/plain, Size: 24929 bytes --]
And this one tracks the number of pages unmapped:
--
From: Andrew Bresticker <abrestic@google.com>
Date: Fri, 15 Jul 2011 11:46:40 -0700
Subject: [PATCH] vmscan: Track pages unmapped during page reclaim.
Record the number of pages unmapped during page reclaim in
memory.vmscan_stat. Counters are broken down by type and
context like the other stats in memory.vmscan_stat.
Sample output:
$ mkdir /dev/cgroup/memory/1
$ echo 512m > /dev/cgroup/memory/1
$ echo $$ > /dev/cgroup/memory/1
$ pft -m 512m
$ cat /dev/cgroup/memory/1/memory.vmscan_stat
...
unmapped_pages_by_limit 67
unmapped_anon_pages_by_limit 0
unmapped_file_pages_by_limit 67
...
unmapped_pages_by_limit_under_hierarchy 67
unmapped_anon_pages_by_limit_under_hierarchy 0
unmapped_file_pages_by_limit_under_hierarchy 67
Signed-off-by: Andrew Bresticker <abrestic@google.com>
---
include/linux/memcontrol.h | 1 +
mm/memcontrol.c | 12 ++++++++++++
mm/vmscan.c | 8 ++++++--
3 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4be907e..8d65b55 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -47,6 +47,7 @@ struct memcg_scanrecord {
unsigned long nr_rotated[2]; /* the number of rotated pages */
unsigned long nr_freed[2]; /* the number of freed pages */
unsigned long nr_written[2]; /* the number of pages written back */
+ unsigned long nr_unmapped[2]; /* the number of pages unmapped */
unsigned long elapsed; /* nsec of time elapsed while scanning */
};
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5ec2aa3..6b4fbbd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -224,6 +224,9 @@ enum {
WRITTEN,
WRITTEN_ANON,
WRITTEN_FILE,
+ UNMAPPED,
+ UNMAPPED_ANON,
+ UNMAPPED_FILE,
ELAPSED,
NR_SCANSTATS,
};
@@ -247,6 +250,9 @@ const char *scanstat_string[NR_SCANSTATS] = {
"written_pages",
"written_anon_pages",
"written_file_pages",
+ "unmapped_pages",
+ "unmapped_anon_pages",
+ "unmapped_file_pages",
"elapsed_ns",
};
#define SCANSTAT_WORD_LIMIT "_by_limit"
@@ -1692,6 +1698,10 @@ static void __mem_cgroup_record_scanstat(unsigned
long *stats,
stats[WRITTEN_ANON] += rec->nr_written[0];
stats[WRITTEN_FILE] += rec->nr_written[1];
+ stats[UNMAPPED] += rec->nr_unmapped[0] + rec->nr_unmapped[1];
+ stats[UNMAPPED_ANON] += rec->nr_unmapped[0];
+ stats[UNMAPPED_FILE] += rec->nr_unmapped[1];
+
stats[ELAPSED] += rec->elapsed;
}
@@ -1806,6 +1816,8 @@ static int mem_cgroup_hierarchical_reclaim(struct
mem_cgroup *root_mem,
rec.nr_freed[1] = 0;
rec.nr_written[0] = 0;
rec.nr_written[1] = 0;
+ rec.nr_unmapped[0] = 0;
+ rec.nr_unmapped[1] = 0;
rec.elapsed = 0;
/* we use swappiness of local cgroup */
if (check_soft) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f73b96e..2d2bc99 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -728,6 +728,7 @@ static unsigned long shrink_page_list(struct list_head
*page_list,
unsigned long nr_congested = 0;
unsigned long nr_reclaimed = 0;
unsigned long nr_written = 0;
+ unsigned long nr_unmapped = 0;
cond_resched();
@@ -819,7 +820,8 @@ static unsigned long shrink_page_list(struct list_head
*page_list,
case SWAP_MLOCK:
goto cull_mlocked;
case SWAP_SUCCESS:
- ; /* try to free the page below */
+ /* try to free the page below */
+ nr_unmapped++;
}
}
@@ -960,8 +962,10 @@ keep_lumpy:
free_page_list(&free_pages);
list_splice(&ret_pages, page_list);
- if (!scanning_global_lru(sc))
+ if (!scanning_global_lru(sc)) {
sc->memcg_record->nr_written[file] += nr_written;
+ sc->memcg_record->nr_unmapped[file] += nr_unmapped;
+ }
count_vm_events(PGACTIVATE, pgactivate);
return nr_reclaimed;
}
--
1.7.3.1
Thanks,
Andrew
On Fri, Jul 15, 2011 at 11:34 AM, Andrew Bresticker <abrestic@google.com>wrote:
> I've extended your patch to track write-back during page reclaim:
> ---
>
> From: Andrew Bresticker <abrestic@google.com>
> Date: Thu, 14 Jul 2011 17:56:48 -0700
> Subject: [PATCH] vmscan: Track number of pages written back during page
> reclaim.
>
> This tracks pages written out during page reclaim in memory.vmscan_stat
> and breaks it down by file vs. anon and context (like "scanned_pages",
> "rotated_pages", etc.).
>
> Example output:
> $ mkdir /dev/cgroup/memory/1
> $ echo 8m > /dev/cgroup/memory/1/memory.limit_in_bytes
> $ echo $$ > /dev/cgroup/memory/1/tasks
> $ dd if=/dev/urandom of=file_20g bs=4096 count=524288
> $ cat /dev/cgroup/memory/1/memory.vmscan_stat
> ...
> written_pages_by_limit 36
> written_anon_pages_by_limit 0
> written_file_pages_by_limit 36
> ...
> written_pages_by_limit_under_hierarchy 28
> written_anon_pages_by_limit_under_hierarchy 0
> written_file_pages_by_limit_under_hierarchy 28
>
> Signed-off-by: Andrew Bresticker <abrestic@google.com>
> ---
> include/linux/memcontrol.h | 1 +
> mm/memcontrol.c | 12 ++++++++++++
> mm/vmscan.c | 10 +++++++---
> 3 files changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4b49edf..4be907e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -46,6 +46,7 @@ struct memcg_scanrecord {
> unsigned long nr_scanned[2]; /* the number of scanned pages */
> unsigned long nr_rotated[2]; /* the number of rotated pages */
> unsigned long nr_freed[2]; /* the number of freed pages */
> + unsigned long nr_written[2]; /* the number of pages written back */
> unsigned long elapsed; /* nsec of time elapsed while scanning */
> };
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9bb6e93..5ec2aa3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -221,6 +221,9 @@ enum {
> FREED,
> FREED_ANON,
> FREED_FILE,
> + WRITTEN,
> + WRITTEN_ANON,
> + WRITTEN_FILE,
> ELAPSED,
> NR_SCANSTATS,
> };
> @@ -241,6 +244,9 @@ const char *scanstat_string[NR_SCANSTATS] = {
> "freed_pages",
> "freed_anon_pages",
> "freed_file_pages",
> + "written_pages",
> + "written_anon_pages",
> + "written_file_pages",
> "elapsed_ns",
> };
> #define SCANSTAT_WORD_LIMIT "_by_limit"
> @@ -1682,6 +1688,10 @@ static void __mem_cgroup_record_scanstat(unsigned
> long *stats,
> stats[FREED_ANON] += rec->nr_freed[0];
> stats[FREED_FILE] += rec->nr_freed[1];
>
> + stats[WRITTEN] += rec->nr_written[0] + rec->nr_written[1];
> + stats[WRITTEN_ANON] += rec->nr_written[0];
> + stats[WRITTEN_FILE] += rec->nr_written[1];
> +
> stats[ELAPSED] += rec->elapsed;
> }
>
> @@ -1794,6 +1804,8 @@ static int mem_cgroup_hierarchical_reclaim(struct
> mem_cgroup *root_mem,
> rec.nr_rotated[1] = 0;
> rec.nr_freed[0] = 0;
> rec.nr_freed[1] = 0;
> + rec.nr_written[0] = 0;
> + rec.nr_written[1] = 0;
> rec.elapsed = 0;
> /* we use swappiness of local cgroup */
> if (check_soft) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8fb1abd..f73b96e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -719,7 +719,7 @@ static noinline_for_stack void free_page_list(struct
> list_head *free_pages)
> */
> static unsigned long shrink_page_list(struct list_head *page_list,
> struct zone *zone,
> - struct scan_control *sc)
> + struct scan_control *sc, int file)
> {
> LIST_HEAD(ret_pages);
> LIST_HEAD(free_pages);
> @@ -727,6 +727,7 @@ static unsigned long shrink_page_list(struct list_head
> *page_list,
> unsigned long nr_dirty = 0;
> unsigned long nr_congested = 0;
> unsigned long nr_reclaimed = 0;
> + unsigned long nr_written = 0;
>
> cond_resched();
>
> @@ -840,6 +841,7 @@ static unsigned long shrink_page_list(struct list_head
> *page_list,
> case PAGE_ACTIVATE:
> goto activate_locked;
> case PAGE_SUCCESS:
> + nr_written++;
> if (PageWriteback(page))
> goto keep_lumpy;
> if (PageDirty(page))
> @@ -958,6 +960,8 @@ keep_lumpy:
> free_page_list(&free_pages);
>
> list_splice(&ret_pages, page_list);
> + if (!scanning_global_lru(sc))
> + sc->memcg_record->nr_written[file] += nr_written;
> count_vm_events(PGACTIVATE, pgactivate);
> return nr_reclaimed;
> }
> @@ -1463,7 +1467,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct
> zone *zone,
>
> spin_unlock_irq(&zone->lru_lock);
>
> - nr_reclaimed = shrink_page_list(&page_list, zone, sc);
> + nr_reclaimed = shrink_page_list(&page_list, zone, sc, file);
>
> if (!scanning_global_lru(sc))
> sc->memcg_record->nr_freed[file] += nr_reclaimed;
> @@ -1471,7 +1475,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct
> zone *zone,
> /* Check if we should syncronously wait for writeback */
> if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
> set_reclaim_mode(priority, sc, true);
> - nr_reclaimed += shrink_page_list(&page_list, zone, sc);
> + nr_reclaimed += shrink_page_list(&page_list, zone, sc, file);
> }
>
> local_irq_disable();
> --
> 1.7.3.1
>
> Thanks,
> Andrew
>
> On Wed, Jul 13, 2011 at 5:02 PM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
>> On Tue, 12 Jul 2011 16:02:02 -0700
>> Andrew Bresticker <abrestic@google.com> wrote:
>>
>> > On Mon, Jul 11, 2011 at 3:30 AM, KAMEZAWA Hiroyuki <
>> > kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> >
>> > >
>> > > This patch is onto mmotm-0710... got bigger than expected ;(
>> > > ==
>> > > [PATCH] add memory.vmscan_stat
>> > >
>> > > commit log of commit 0ae5e89 " memcg: count the soft_limit reclaim
>> in..."
>> > > says it adds scanning stats to memory.stat file. But it doesn't
>> because
>> > > we considered we needed to make a concensus for such new APIs.
>> > >
>> > > This patch is a trial to add memory.scan_stat. This shows
>> > > - the number of scanned pages(total, anon, file)
>> > > - the number of rotated pages(total, anon, file)
>> > > - the number of freed pages(total, anon, file)
>> > > - the number of elaplsed time (including sleep/pause time)
>> > >
>> > > for both of direct/soft reclaim.
>> > >
>> > > The biggest difference with oringinal Ying's one is that this file
>> > > can be reset by some write, as
>> > >
>> > > # echo 0 ...../memory.scan_stat
>> > >
>> > > Example of output is here. This is a result after make -j 6 kernel
>> > > under 300M limit.
>> > >
>> > > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
>> > > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat
>> > > scanned_pages_by_limit 9471864
>> > > scanned_anon_pages_by_limit 6640629
>> > > scanned_file_pages_by_limit 2831235
>> > > rotated_pages_by_limit 4243974
>> > > rotated_anon_pages_by_limit 3971968
>> > > rotated_file_pages_by_limit 272006
>> > > freed_pages_by_limit 2318492
>> > > freed_anon_pages_by_limit 962052
>> > > freed_file_pages_by_limit 1356440
>> > > elapsed_ns_by_limit 351386416101
>> > > scanned_pages_by_system 0
>> > > scanned_anon_pages_by_system 0
>> > > scanned_file_pages_by_system 0
>> > > rotated_pages_by_system 0
>> > > rotated_anon_pages_by_system 0
>> > > rotated_file_pages_by_system 0
>> > > freed_pages_by_system 0
>> > > freed_anon_pages_by_system 0
>> > > freed_file_pages_by_system 0
>> > > elapsed_ns_by_system 0
>> > > scanned_pages_by_limit_under_hierarchy 9471864
>> > > scanned_anon_pages_by_limit_under_hierarchy 6640629
>> > > scanned_file_pages_by_limit_under_hierarchy 2831235
>> > > rotated_pages_by_limit_under_hierarchy 4243974
>> > > rotated_anon_pages_by_limit_under_hierarchy 3971968
>> > > rotated_file_pages_by_limit_under_hierarchy 272006
>> > > freed_pages_by_limit_under_hierarchy 2318492
>> > > freed_anon_pages_by_limit_under_hierarchy 962052
>> > > freed_file_pages_by_limit_under_hierarchy 1356440
>> > > elapsed_ns_by_limit_under_hierarchy 351386416101
>> > > scanned_pages_by_system_under_hierarchy 0
>> > > scanned_anon_pages_by_system_under_hierarchy 0
>> > > scanned_file_pages_by_system_under_hierarchy 0
>> > > rotated_pages_by_system_under_hierarchy 0
>> > > rotated_anon_pages_by_system_under_hierarchy 0
>> > > rotated_file_pages_by_system_under_hierarchy 0
>> > > freed_pages_by_system_under_hierarchy 0
>> > > freed_anon_pages_by_system_under_hierarchy 0
>> > > freed_file_pages_by_system_under_hierarchy 0
>> > > elapsed_ns_by_system_under_hierarchy 0
>> > >
>> > >
>> > > total_xxxx is for hierarchy management.
>> > >
>> > > This will be useful for further memcg developments and need to be
>> > > developped before we do some complicated rework on LRU/softlimit
>> > > management.
>> > >
>> > > This patch adds a new struct memcg_scanrecord into scan_control
>> struct.
>> > > sc->nr_scanned at el is not designed for exporting information. For
>> > > example,
>> > > nr_scanned is reset frequentrly and incremented +2 at scanning mapped
>> > > pages.
>> > >
>> > > For avoiding complexity, I added a new param in scan_control which is
>> for
>> > > exporting scanning score.
>> > >
>> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> > >
>> > > Changelog:
>> > > - renamed as vmscan_stat
>> > > - handle file/anon
>> > > - added "rotated"
>> > > - changed names of param in vmscan_stat.
>> > > ---
>> > > Documentation/cgroups/memory.txt | 85 +++++++++++++++++++
>> > > include/linux/memcontrol.h | 19 ++++
>> > > include/linux/swap.h | 6 -
>> > > mm/memcontrol.c | 172
>> > > +++++++++++++++++++++++++++++++++++++--
>> > > mm/vmscan.c | 39 +++++++-
>> > > 5 files changed, 303 insertions(+), 18 deletions(-)
>> > >
>> > > Index: mmotm-0710/Documentation/cgroups/memory.txt
>> > > ===================================================================
>> > > --- mmotm-0710.orig/Documentation/cgroups/memory.txt
>> > > +++ mmotm-0710/Documentation/cgroups/memory.txt
>> > > @@ -380,7 +380,7 @@ will be charged as a new owner of it.
>> > >
>> > > 5.2 stat file
>> > >
>> > > -memory.stat file includes following statistics
>> > > +5.2.1 memory.stat file includes following statistics
>> > >
>> > > # per-memory cgroup local status
>> > > cache - # of bytes of page cache memory.
>> > > @@ -438,6 +438,89 @@ Note:
>> > > file_mapped is accounted only when the memory cgroup is owner
>> of
>> > > page
>> > > cache.)
>> > >
>> > > +5.2.2 memory.vmscan_stat
>> > > +
>> > > +memory.vmscan_stat includes statistics information for memory
>> scanning and
>> > > +freeing, reclaiming. The statistics shows memory scanning information
>> > > since
>> > > +memory cgroup creation and can be reset to 0 by writing 0 as
>> > > +
>> > > + #echo 0 > ../memory.vmscan_stat
>> > > +
>> > > +This file contains following statistics.
>> > > +
>> > > +[param]_[file_or_anon]_pages_by_[reason]_[under_heararchy]
>> > > +[param]_elapsed_ns_by_[reason]_[under_hierarchy]
>> > > +
>> > > +For example,
>> > > +
>> > > + scanned_file_pages_by_limit indicates the number of scanned
>> > > + file pages at vmscan.
>> > > +
>> > > +Now, 3 parameters are supported
>> > > +
>> > > + scanned - the number of pages scanned by vmscan
>> > > + rotated - the number of pages activated at vmscan
>> > > + freed - the number of pages freed by vmscan
>> > > +
>> > > +If "rotated" is high against scanned/freed, the memcg seems busy.
>> > > +
>> > > +Now, 2 reason are supported
>> > > +
>> > > + limit - the memory cgroup's limit
>> > > + system - global memory pressure + softlimit
>> > > + (global memory pressure not under softlimit is not handled
>> now)
>> > > +
>> > > +When under_hierarchy is added in the tail, the number indicates the
>> > > +total memcg scan of its children and itself.
>> > > +
>> > > +elapsed_ns is a elapsed time in nanosecond. This may include sleep
>> time
>> > > +and not indicates CPU usage. So, please take this as just showing
>> > > +latency.
>> > > +
>> > > +Here is an example.
>> > > +
>> > > +# cat /cgroup/memory/A/memory.vmscan_stat
>> > > +scanned_pages_by_limit 9471864
>> > > +scanned_anon_pages_by_limit 6640629
>> > > +scanned_file_pages_by_limit 2831235
>> > > +rotated_pages_by_limit 4243974
>> > > +rotated_anon_pages_by_limit 3971968
>> > > +rotated_file_pages_by_limit 272006
>> > > +freed_pages_by_limit 2318492
>> > > +freed_anon_pages_by_limit 962052
>> > > +freed_file_pages_by_limit 1356440
>> > > +elapsed_ns_by_limit 351386416101
>> > > +scanned_pages_by_system 0
>> > > +scanned_anon_pages_by_system 0
>> > > +scanned_file_pages_by_system 0
>> > > +rotated_pages_by_system 0
>> > > +rotated_anon_pages_by_system 0
>> > > +rotated_file_pages_by_system 0
>> > > +freed_pages_by_system 0
>> > > +freed_anon_pages_by_system 0
>> > > +freed_file_pages_by_system 0
>> > > +elapsed_ns_by_system 0
>> > > +scanned_pages_by_limit_under_hierarchy 9471864
>> > > +scanned_anon_pages_by_limit_under_hierarchy 6640629
>> > > +scanned_file_pages_by_limit_under_hierarchy 2831235
>> > > +rotated_pages_by_limit_under_hierarchy 4243974
>> > > +rotated_anon_pages_by_limit_under_hierarchy 3971968
>> > > +rotated_file_pages_by_limit_under_hierarchy 272006
>> > > +freed_pages_by_limit_under_hierarchy 2318492
>> > > +freed_anon_pages_by_limit_under_hierarchy 962052
>> > > +freed_file_pages_by_limit_under_hierarchy 1356440
>> > > +elapsed_ns_by_limit_under_hierarchy 351386416101
>> > > +scanned_pages_by_system_under_hierarchy 0
>> > > +scanned_anon_pages_by_system_under_hierarchy 0
>> > > +scanned_file_pages_by_system_under_hierarchy 0
>> > > +rotated_pages_by_system_under_hierarchy 0
>> > > +rotated_anon_pages_by_system_under_hierarchy 0
>> > > +rotated_file_pages_by_system_under_hierarchy 0
>> > > +freed_pages_by_system_under_hierarchy 0
>> > > +freed_anon_pages_by_system_under_hierarchy 0
>> > > +freed_file_pages_by_system_under_hierarchy 0
>> > > +elapsed_ns_by_system_under_hierarchy 0
>> > > +
>> > > 5.3 swappiness
>> > >
>> > > Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of
>> groups
>> > > only.
>> > > Index: mmotm-0710/include/linux/memcontrol.h
>> > > ===================================================================
>> > > --- mmotm-0710.orig/include/linux/memcontrol.h
>> > > +++ mmotm-0710/include/linux/memcontrol.h
>> > > @@ -39,6 +39,16 @@ extern unsigned long mem_cgroup_isolate_
>> > > struct mem_cgroup *mem_cont,
>> > > int active, int file);
>> > >
>> > > +struct memcg_scanrecord {
>> > > + struct mem_cgroup *mem; /* scanend memory cgroup */
>> > > + struct mem_cgroup *root; /* scan target hierarchy root */
>> > > + int context; /* scanning context (see memcontrol.c)
>> */
>> > > + unsigned long nr_scanned[2]; /* the number of scanned pages */
>> > > + unsigned long nr_rotated[2]; /* the number of rotated pages */
>> > > + unsigned long nr_freed[2]; /* the number of freed pages */
>> > > + unsigned long elapsed; /* nsec of time elapsed while scanning
>> */
>> > > +};
>> > > +
>> > > #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>> > > /*
>> > > * All "charge" functions with gfp_mask should use GFP_KERNEL or
>> > > @@ -117,6 +127,15 @@ mem_cgroup_get_reclaim_stat_from_page(st
>> > > extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>> > > struct task_struct *p);
>> > >
>> > > +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup
>> *mem,
>> > > + gfp_t gfp_mask, bool
>> > > noswap,
>> > > + struct
>> memcg_scanrecord
>> > > *rec);
>> > > +extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup
>> *mem,
>> > > + gfp_t gfp_mask, bool
>> > > noswap,
>> > > + struct zone *zone,
>> > > + struct
>> memcg_scanrecord
>> > > *rec,
>> > > + unsigned long
>> *nr_scanned);
>> > > +
>> > > #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
>> > > extern int do_swap_account;
>> > > #endif
>> > > Index: mmotm-0710/include/linux/swap.h
>> > > ===================================================================
>> > > --- mmotm-0710.orig/include/linux/swap.h
>> > > +++ mmotm-0710/include/linux/swap.h
>> > > @@ -253,12 +253,6 @@ static inline void lru_cache_add_file(st
>> > > /* linux/mm/vmscan.c */
>> > > extern unsigned long try_to_free_pages(struct zonelist *zonelist, int
>> > > order,
>> > > gfp_t gfp_mask, nodemask_t
>> *mask);
>> > > -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup
>> *mem,
>> > > - gfp_t gfp_mask, bool
>> > > noswap);
>> > > -extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup
>> *mem,
>> > > - gfp_t gfp_mask, bool
>> > > noswap,
>> > > - struct zone *zone,
>> > > - unsigned long
>> *nr_scanned);
>> > > extern int __isolate_lru_page(struct page *page, int mode, int file);
>> > > extern unsigned long shrink_all_memory(unsigned long nr_pages);
>> > > extern int vm_swappiness;
>> > > Index: mmotm-0710/mm/memcontrol.c
>> > > ===================================================================
>> > > --- mmotm-0710.orig/mm/memcontrol.c
>> > > +++ mmotm-0710/mm/memcontrol.c
>> > > @@ -204,6 +204,50 @@ struct mem_cgroup_eventfd_list {
>> > > static void mem_cgroup_threshold(struct mem_cgroup *mem);
>> > > static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
>> > >
>> > > +enum {
>> > > + SCAN_BY_LIMIT,
>> > > + SCAN_BY_SYSTEM,
>> > > + NR_SCAN_CONTEXT,
>> > > + SCAN_BY_SHRINK, /* not recorded now */
>> > > +};
>> > > +
>> > > +enum {
>> > > + SCAN,
>> > > + SCAN_ANON,
>> > > + SCAN_FILE,
>> > > + ROTATE,
>> > > + ROTATE_ANON,
>> > > + ROTATE_FILE,
>> > > + FREED,
>> > > + FREED_ANON,
>> > > + FREED_FILE,
>> > > + ELAPSED,
>> > > + NR_SCANSTATS,
>> > > +};
>> > > +
>> > > +struct scanstat {
>> > > + spinlock_t lock;
>> > > + unsigned long stats[NR_SCAN_CONTEXT][NR_SCANSTATS];
>> > > + unsigned long rootstats[NR_SCAN_CONTEXT][NR_SCANSTATS];
>> > > +};
>> > >
>> >
>> > I'm working on a similar effort with Ying here at Google and so far
>> we've
>> > been using per-cpu counters for these statistics instead of spin-lock
>> > protected counters. Clearly the spin-lock protected counters have less
>> > memory overhead and make reading the stat file faster, but our concern
>> is
>> > that this method is inconsistent with the other memory stat files such
>> > /proc/vmstat and /dev/cgroup/memory/.../memory.stat. Is there any
>> > particular reason you chose to use spin-lock protected counters instead
>> of
>> > per-cpu counters?
>> >
>>
>> In my experience, if we do "batch" enouch, it works always better than
>> percpu-counter. percpu counter is effective when batching is difficult.
>> This patch's implementation does enough batching and it's much coarse
>> grained than percpu counter. Then, this patch is better than percpu.
>>
>>
>> > I've also modified your patch to use per-cpu counters instead of
>> spin-lock
>> > protected counters. I tested it by doing streaming I/O from a ramdisk:
>> >
>> > $ mke2fs /dev/ram1
>> > $ mkdir /tmp/swapram
>> > $ mkdir /tmp/swapram/ram1
>> > $ mount -t ext2 /dev/ram1 /tmp/swapram/ram1
>> > $ dd if=/dev/urandom of=/tmp/swapram/ram1/file_16m bs=4096 count=4096
>> > $ mkdir /dev/cgroup/memory/1
>> > $ echo 8m > /dev/cgroup/memory/1
>> > $ ./ramdisk_load.sh 7
>> > $ echo $$ > /dev/cgroup/memory/1/tasks
>> > $ time for ((z=0; z<=2000; z++)); do cat /tmp/swapram/ram1/file_16m >
>> > /dev/zero; done
>> >
>> > Where ramdisk_load.sh is:
>> > for ((i=0; i<=$1; i++))
>> > do
>> > echo $$ >/dev/cgroup/memory/1/tasks
>> > for ((z=0; z<=2000; z++)); do cat /tmp/swapram/ram1/file_16m >
>> /dev/zero;
>> > done &
>> > done
>> >
>> > Surprisingly, the per-cpu counters perform worse than the spin-lock
>> > protected counters. Over 10 runs of the test above, the per-cpu
>> counters
>> > were 1.60% slower in both real time and sys time. I'm wondering if you
>> have
>> > any insight as to why this is. I can provide my diff against your patch
>> if
>> > necessary.
>> >
>>
>> The percpu counte works effectively only when we use +1/-1 at each change
>> of
>> counters. It uses "batch" to merge the per-cpu value to the counter.
>> I think you use default "batch" value but the scan/rotate/free/elapsed
>> value
>> is always larger than "batch" and you just added memory overhead and "if"
>> to pure spinlock counters.
>>
>> Determining this "batch" threshold for percpu counter is difficult.
>>
>> Thanks,
>> -Kame
>>
>>
>>
>
[-- Attachment #2: Type: text/html, Size: 37915 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2] memcg: add vmscan_stat
2011-07-11 10:30 [PATCH v2] memcg: add vmscan_stat KAMEZAWA Hiroyuki
2011-07-12 23:02 ` Andrew Bresticker
@ 2011-07-18 21:00 ` Andrew Bresticker
2011-07-20 6:03 ` KAMEZAWA Hiroyuki
1 sibling, 1 reply; 9+ messages in thread
From: Andrew Bresticker @ 2011-07-18 21:00 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm, akpm, nishimura, bsingharora, Michal Hocko, Ying Han
[-- Attachment #1: Type: text/plain, Size: 26783 bytes --]
On Mon, Jul 11, 2011 at 3:30 AM, KAMEZAWA Hiroyuki <
kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> This patch is onto mmotm-0710... got bigger than expected ;(
> ==
> [PATCH] add memory.vmscan_stat
>
> commit log of commit 0ae5e89 " memcg: count the soft_limit reclaim in..."
> says it adds scanning stats to memory.stat file. But it doesn't because
> we considered we needed to make a concensus for such new APIs.
>
> This patch is a trial to add memory.scan_stat. This shows
> - the number of scanned pages(total, anon, file)
> - the number of rotated pages(total, anon, file)
> - the number of freed pages(total, anon, file)
> - the number of elaplsed time (including sleep/pause time)
>
> for both of direct/soft reclaim.
>
> The biggest difference with oringinal Ying's one is that this file
> can be reset by some write, as
>
> # echo 0 ...../memory.scan_stat
>
> Example of output is here. This is a result after make -j 6 kernel
> under 300M limit.
>
> [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
> [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat
> scanned_pages_by_limit 9471864
> scanned_anon_pages_by_limit 6640629
> scanned_file_pages_by_limit 2831235
> rotated_pages_by_limit 4243974
> rotated_anon_pages_by_limit 3971968
> rotated_file_pages_by_limit 272006
> freed_pages_by_limit 2318492
> freed_anon_pages_by_limit 962052
> freed_file_pages_by_limit 1356440
> elapsed_ns_by_limit 351386416101
> scanned_pages_by_system 0
> scanned_anon_pages_by_system 0
> scanned_file_pages_by_system 0
> rotated_pages_by_system 0
> rotated_anon_pages_by_system 0
> rotated_file_pages_by_system 0
> freed_pages_by_system 0
> freed_anon_pages_by_system 0
> freed_file_pages_by_system 0
> elapsed_ns_by_system 0
> scanned_pages_by_limit_under_hierarchy 9471864
> scanned_anon_pages_by_limit_under_hierarchy 6640629
> scanned_file_pages_by_limit_under_hierarchy 2831235
> rotated_pages_by_limit_under_hierarchy 4243974
> rotated_anon_pages_by_limit_under_hierarchy 3971968
> rotated_file_pages_by_limit_under_hierarchy 272006
> freed_pages_by_limit_under_hierarchy 2318492
> freed_anon_pages_by_limit_under_hierarchy 962052
> freed_file_pages_by_limit_under_hierarchy 1356440
> elapsed_ns_by_limit_under_hierarchy 351386416101
> scanned_pages_by_system_under_hierarchy 0
> scanned_anon_pages_by_system_under_hierarchy 0
> scanned_file_pages_by_system_under_hierarchy 0
> rotated_pages_by_system_under_hierarchy 0
> rotated_anon_pages_by_system_under_hierarchy 0
> rotated_file_pages_by_system_under_hierarchy 0
> freed_pages_by_system_under_hierarchy 0
> freed_anon_pages_by_system_under_hierarchy 0
> freed_file_pages_by_system_under_hierarchy 0
> elapsed_ns_by_system_under_hierarchy 0
>
>
> total_xxxx is for hierarchy management.
>
> This will be useful for further memcg developments and need to be
> developped before we do some complicated rework on LRU/softlimit
> management.
>
> This patch adds a new struct memcg_scanrecord into scan_control struct.
> sc->nr_scanned at el is not designed for exporting information. For
> example,
> nr_scanned is reset frequentrly and incremented +2 at scanning mapped
> pages.
>
> For avoiding complexity, I added a new param in scan_control which is for
> exporting scanning score.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> Changelog:
> - renamed as vmscan_stat
> - handle file/anon
> - added "rotated"
> - changed names of param in vmscan_stat.
> ---
> Documentation/cgroups/memory.txt | 85 +++++++++++++++++++
> include/linux/memcontrol.h | 19 ++++
> include/linux/swap.h | 6 -
> mm/memcontrol.c | 172
> +++++++++++++++++++++++++++++++++++++--
> mm/vmscan.c | 39 +++++++-
> 5 files changed, 303 insertions(+), 18 deletions(-)
>
> Index: mmotm-0710/Documentation/cgroups/memory.txt
> ===================================================================
> --- mmotm-0710.orig/Documentation/cgroups/memory.txt
> +++ mmotm-0710/Documentation/cgroups/memory.txt
> @@ -380,7 +380,7 @@ will be charged as a new owner of it.
>
> 5.2 stat file
>
> -memory.stat file includes following statistics
> +5.2.1 memory.stat file includes following statistics
>
> # per-memory cgroup local status
> cache - # of bytes of page cache memory.
> @@ -438,6 +438,89 @@ Note:
> file_mapped is accounted only when the memory cgroup is owner of
> page
> cache.)
>
> +5.2.2 memory.vmscan_stat
> +
> +memory.vmscan_stat includes statistics information for memory scanning and
> +freeing, reclaiming. The statistics shows memory scanning information
> since
> +memory cgroup creation and can be reset to 0 by writing 0 as
> +
> + #echo 0 > ../memory.vmscan_stat
> +
> +This file contains following statistics.
> +
> +[param]_[file_or_anon]_pages_by_[reason]_[under_heararchy]
> +[param]_elapsed_ns_by_[reason]_[under_hierarchy]
> +
> +For example,
> +
> + scanned_file_pages_by_limit indicates the number of scanned
> + file pages at vmscan.
> +
> +Now, 3 parameters are supported
> +
> + scanned - the number of pages scanned by vmscan
> + rotated - the number of pages activated at vmscan
> + freed - the number of pages freed by vmscan
> +
> +If "rotated" is high against scanned/freed, the memcg seems busy.
> +
> +Now, 2 reason are supported
> +
> + limit - the memory cgroup's limit
> + system - global memory pressure + softlimit
> + (global memory pressure not under softlimit is not handled now)
> +
> +When under_hierarchy is added in the tail, the number indicates the
> +total memcg scan of its children and itself.
> +
> +elapsed_ns is a elapsed time in nanosecond. This may include sleep time
> +and not indicates CPU usage. So, please take this as just showing
> +latency.
> +
> +Here is an example.
> +
> +# cat /cgroup/memory/A/memory.vmscan_stat
> +scanned_pages_by_limit 9471864
> +scanned_anon_pages_by_limit 6640629
> +scanned_file_pages_by_limit 2831235
> +rotated_pages_by_limit 4243974
> +rotated_anon_pages_by_limit 3971968
> +rotated_file_pages_by_limit 272006
> +freed_pages_by_limit 2318492
> +freed_anon_pages_by_limit 962052
> +freed_file_pages_by_limit 1356440
> +elapsed_ns_by_limit 351386416101
> +scanned_pages_by_system 0
> +scanned_anon_pages_by_system 0
> +scanned_file_pages_by_system 0
> +rotated_pages_by_system 0
> +rotated_anon_pages_by_system 0
> +rotated_file_pages_by_system 0
> +freed_pages_by_system 0
> +freed_anon_pages_by_system 0
> +freed_file_pages_by_system 0
> +elapsed_ns_by_system 0
> +scanned_pages_by_limit_under_hierarchy 9471864
> +scanned_anon_pages_by_limit_under_hierarchy 6640629
> +scanned_file_pages_by_limit_under_hierarchy 2831235
> +rotated_pages_by_limit_under_hierarchy 4243974
> +rotated_anon_pages_by_limit_under_hierarchy 3971968
> +rotated_file_pages_by_limit_under_hierarchy 272006
> +freed_pages_by_limit_under_hierarchy 2318492
> +freed_anon_pages_by_limit_under_hierarchy 962052
> +freed_file_pages_by_limit_under_hierarchy 1356440
> +elapsed_ns_by_limit_under_hierarchy 351386416101
> +scanned_pages_by_system_under_hierarchy 0
> +scanned_anon_pages_by_system_under_hierarchy 0
> +scanned_file_pages_by_system_under_hierarchy 0
> +rotated_pages_by_system_under_hierarchy 0
> +rotated_anon_pages_by_system_under_hierarchy 0
> +rotated_file_pages_by_system_under_hierarchy 0
> +freed_pages_by_system_under_hierarchy 0
> +freed_anon_pages_by_system_under_hierarchy 0
> +freed_file_pages_by_system_under_hierarchy 0
> +elapsed_ns_by_system_under_hierarchy 0
> +
> 5.3 swappiness
>
> Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups
> only.
> Index: mmotm-0710/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-0710.orig/include/linux/memcontrol.h
> +++ mmotm-0710/include/linux/memcontrol.h
> @@ -39,6 +39,16 @@ extern unsigned long mem_cgroup_isolate_
> struct mem_cgroup *mem_cont,
> int active, int file);
>
> +struct memcg_scanrecord {
> + struct mem_cgroup *mem; /* scanend memory cgroup */
> + struct mem_cgroup *root; /* scan target hierarchy root */
> + int context; /* scanning context (see memcontrol.c) */
> + unsigned long nr_scanned[2]; /* the number of scanned pages */
> + unsigned long nr_rotated[2]; /* the number of rotated pages */
> + unsigned long nr_freed[2]; /* the number of freed pages */
> + unsigned long elapsed; /* nsec of time elapsed while scanning */
> +};
> +
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> /*
> * All "charge" functions with gfp_mask should use GFP_KERNEL or
> @@ -117,6 +127,15 @@ mem_cgroup_get_reclaim_stat_from_page(st
> extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> struct task_struct *p);
>
> +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
> + gfp_t gfp_mask, bool
> noswap,
> + struct memcg_scanrecord
> *rec);
> +extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> + gfp_t gfp_mask, bool
> noswap,
> + struct zone *zone,
> + struct memcg_scanrecord
> *rec,
> + unsigned long *nr_scanned);
> +
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> extern int do_swap_account;
> #endif
> Index: mmotm-0710/include/linux/swap.h
> ===================================================================
> --- mmotm-0710.orig/include/linux/swap.h
> +++ mmotm-0710/include/linux/swap.h
> @@ -253,12 +253,6 @@ static inline void lru_cache_add_file(st
> /* linux/mm/vmscan.c */
> extern unsigned long try_to_free_pages(struct zonelist *zonelist, int
> order,
> gfp_t gfp_mask, nodemask_t *mask);
> -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
> - gfp_t gfp_mask, bool
> noswap);
> -extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> - gfp_t gfp_mask, bool
> noswap,
> - struct zone *zone,
> - unsigned long *nr_scanned);
> extern int __isolate_lru_page(struct page *page, int mode, int file);
> extern unsigned long shrink_all_memory(unsigned long nr_pages);
> extern int vm_swappiness;
> Index: mmotm-0710/mm/memcontrol.c
> ===================================================================
> --- mmotm-0710.orig/mm/memcontrol.c
> +++ mmotm-0710/mm/memcontrol.c
> @@ -204,6 +204,50 @@ struct mem_cgroup_eventfd_list {
> static void mem_cgroup_threshold(struct mem_cgroup *mem);
> static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
>
> +enum {
> + SCAN_BY_LIMIT,
> + SCAN_BY_SYSTEM,
> + NR_SCAN_CONTEXT,
> + SCAN_BY_SHRINK, /* not recorded now */
> +};
> +
> +enum {
> + SCAN,
> + SCAN_ANON,
> + SCAN_FILE,
> + ROTATE,
> + ROTATE_ANON,
> + ROTATE_FILE,
> + FREED,
> + FREED_ANON,
> + FREED_FILE,
> + ELAPSED,
> + NR_SCANSTATS,
> +};
> +
> +struct scanstat {
> + spinlock_t lock;
> + unsigned long stats[NR_SCAN_CONTEXT][NR_SCANSTATS];
> + unsigned long rootstats[NR_SCAN_CONTEXT][NR_SCANSTATS];
> +};
> +
> +const char *scanstat_string[NR_SCANSTATS] = {
> + "scanned_pages",
> + "scanned_anon_pages",
> + "scanned_file_pages",
> + "rotated_pages",
> + "rotated_anon_pages",
> + "rotated_file_pages",
> + "freed_pages",
> + "freed_anon_pages",
> + "freed_file_pages",
> + "elapsed_ns",
> +};
> +#define SCANSTAT_WORD_LIMIT "_by_limit"
> +#define SCANSTAT_WORD_SYSTEM "_by_system"
> +#define SCANSTAT_WORD_HIERARCHY "_under_hierarchy"
> +
> +
> /*
> * The memory controller data structure. The memory controller controls
> both
> * page cache and RSS per cgroup. We would eventually like to provide
> @@ -266,7 +310,8 @@ struct mem_cgroup {
>
> /* For oom notifier event fd */
> struct list_head oom_notify;
> -
> + /* For recording LRU-scan statistics */
> + struct scanstat scanstat;
> /*
> * Should we move charges of a task when a task is moved into this
> * mem_cgroup ? And what type of charges should we move ?
> @@ -1619,6 +1664,44 @@ bool mem_cgroup_reclaimable(struct mem_c
> }
> #endif
>
> +static void __mem_cgroup_record_scanstat(unsigned long *stats,
> + struct memcg_scanrecord *rec)
> +{
> +
> + stats[SCAN] += rec->nr_scanned[0] + rec->nr_scanned[1];
> + stats[SCAN_ANON] += rec->nr_scanned[0];
> + stats[SCAN_FILE] += rec->nr_scanned[1];
> +
> + stats[ROTATE] += rec->nr_rotated[0] + rec->nr_rotated[1];
> + stats[ROTATE_ANON] += rec->nr_rotated[0];
> + stats[ROTATE_FILE] += rec->nr_rotated[1];
> +
> + stats[FREED] += rec->nr_freed[0] + rec->nr_freed[1];
> + stats[FREED_ANON] += rec->nr_freed[0];
> + stats[FREED_FILE] += rec->nr_freed[1];
> +
> + stats[ELAPSED] += rec->elapsed;
> +}
> +
> +static void mem_cgroup_record_scanstat(struct memcg_scanrecord *rec)
> +{
> + struct mem_cgroup *mem;
> + int context = rec->context;
> +
> + if (context >= NR_SCAN_CONTEXT)
> + return;
> +
> + mem = rec->mem;
> + spin_lock(&mem->scanstat.lock);
> + __mem_cgroup_record_scanstat(mem->scanstat.stats[context], rec);
> + spin_unlock(&mem->scanstat.lock);
> +
> + mem = rec->root;
> + spin_lock(&mem->scanstat.lock);
> + __mem_cgroup_record_scanstat(mem->scanstat.rootstats[context],
> rec);
> + spin_unlock(&mem->scanstat.lock);
> +}
> +
> /*
> * Scan the hierarchy if needed to reclaim memory. We remember the last
> child
> * we reclaimed from, so that we don't end up penalizing one child
> extensively
> @@ -1643,8 +1726,9 @@ static int mem_cgroup_hierarchical_recla
> bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP;
> bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK;
> bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT;
> + struct memcg_scanrecord rec;
> unsigned long excess;
> - unsigned long nr_scanned;
> + unsigned long scanned;
>
> excess = res_counter_soft_limit_excess(&root_mem->res) >>
> PAGE_SHIFT;
>
> @@ -1652,6 +1736,15 @@ static int mem_cgroup_hierarchical_recla
> if (!check_soft && root_mem->memsw_is_minimum)
> noswap = true;
>
> + if (shrink)
> + rec.context = SCAN_BY_SHRINK;
> + else if (check_soft)
> + rec.context = SCAN_BY_SYSTEM;
> + else
> + rec.context = SCAN_BY_LIMIT;
> +
> + rec.root = root_mem;
> +
> while (1) {
> victim = mem_cgroup_select_victim(root_mem);
> if (victim == root_mem) {
> @@ -1692,14 +1785,23 @@ static int mem_cgroup_hierarchical_recla
> css_put(&victim->css);
> continue;
> }
> + rec.mem = victim;
> + rec.nr_scanned[0] = 0;
> + rec.nr_scanned[1] = 0;
> + rec.nr_rotated[0] = 0;
> + rec.nr_rotated[1] = 0;
> + rec.nr_freed[0] = 0;
> + rec.nr_freed[1] = 0;
> + rec.elapsed = 0;
> /* we use swappiness of local cgroup */
> if (check_soft) {
> ret = mem_cgroup_shrink_node_zone(victim, gfp_mask,
> - noswap, zone, &nr_scanned);
> - *total_scanned += nr_scanned;
> + noswap, zone, &rec, &scanned);
> + *total_scanned += scanned;
> } else
> ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
> - noswap);
> + noswap, &rec);
> + mem_cgroup_record_scanstat(&rec);
> css_put(&victim->css);
> /*
> * At shrinking usage, we can't check we should stop here or
> @@ -3688,14 +3790,18 @@ try_to_free:
> /* try to free all pages in this cgroup */
> shrink = 1;
> while (nr_retries && mem->res.usage > 0) {
> + struct memcg_scanrecord rec;
> int progress;
>
> if (signal_pending(current)) {
> ret = -EINTR;
> goto out;
> }
> + rec.context = SCAN_BY_SHRINK;
> + rec.mem = mem;
> + rec.root = mem;
> progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL,
> - false);
> + false, &rec);
> if (!progress) {
> nr_retries--;
> /* maybe some writeback is necessary */
> @@ -4539,6 +4645,54 @@ static int mem_control_numa_stat_open(st
> }
> #endif /* CONFIG_NUMA */
>
> +static int mem_cgroup_vmscan_stat_read(struct cgroup *cgrp,
> + struct cftype *cft,
> + struct cgroup_map_cb *cb)
> +{
> + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> + char string[64];
> + int i;
> +
> + for (i = 0; i < NR_SCANSTATS; i++) {
> + strcpy(string, scanstat_string[i]);
> + strcat(string, SCANSTAT_WORD_LIMIT);
> + cb->fill(cb, string,
> mem->scanstat.stats[SCAN_BY_LIMIT][i]);
> + }
> +
> + for (i = 0; i < NR_SCANSTATS; i++) {
> + strcpy(string, scanstat_string[i]);
> + strcat(string, SCANSTAT_WORD_SYSTEM);
> + cb->fill(cb, string,
> mem->scanstat.stats[SCAN_BY_SYSTEM][i]);
> + }
> +
> + for (i = 0; i < NR_SCANSTATS; i++) {
> + strcpy(string, scanstat_string[i]);
> + strcat(string, SCANSTAT_WORD_LIMIT);
> + strcat(string, SCANSTAT_WORD_HIERARCHY);
> + cb->fill(cb, string,
> mem->scanstat.rootstats[SCAN_BY_LIMIT][i]);
> + }
> + for (i = 0; i < NR_SCANSTATS; i++) {
> + strcpy(string, scanstat_string[i]);
> + strcat(string, SCANSTAT_WORD_SYSTEM);
> + strcat(string, SCANSTAT_WORD_HIERARCHY);
> + cb->fill(cb, string,
> mem->scanstat.rootstats[SCAN_BY_SYSTEM][i]);
> + }
> + return 0;
> +}
> +
> +static int mem_cgroup_reset_vmscan_stat(struct cgroup *cgrp,
> + unsigned int event)
> +{
> + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> +
> + spin_lock(&mem->scanstat.lock);
> + memset(&mem->scanstat.stats, 0, sizeof(mem->scanstat.stats));
> + memset(&mem->scanstat.rootstats, 0,
> sizeof(mem->scanstat.rootstats));
> + spin_unlock(&mem->scanstat.lock);
> + return 0;
> +}
> +
> +
> static struct cftype mem_cgroup_files[] = {
> {
> .name = "usage_in_bytes",
> @@ -4609,6 +4763,11 @@ static struct cftype mem_cgroup_files[]
> .mode = S_IRUGO,
> },
> #endif
> + {
> + .name = "vmscan_stat",
> + .read_map = mem_cgroup_vmscan_stat_read,
> + .trigger = mem_cgroup_reset_vmscan_stat,
> + },
> };
>
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> @@ -4872,6 +5031,7 @@ mem_cgroup_create(struct cgroup_subsys *
> atomic_set(&mem->refcnt, 1);
> mem->move_charge_at_immigrate = 0;
> mutex_init(&mem->thresholds_lock);
> + spin_lock_init(&mem->scanstat.lock);
> return &mem->css;
> free_out:
> __mem_cgroup_free(mem);
> Index: mmotm-0710/mm/vmscan.c
> ===================================================================
> --- mmotm-0710.orig/mm/vmscan.c
> +++ mmotm-0710/mm/vmscan.c
> @@ -105,6 +105,7 @@ struct scan_control {
>
> /* Which cgroup do we reclaim from */
> struct mem_cgroup *mem_cgroup;
> + struct memcg_scanrecord *memcg_record;
>
> /*
> * Nodemask of nodes allowed by the caller. If NULL, all nodes
> @@ -1307,6 +1308,8 @@ putback_lru_pages(struct zone *zone, str
> int file = is_file_lru(lru);
> int numpages = hpage_nr_pages(page);
> reclaim_stat->recent_rotated[file] += numpages;
> + if (!scanning_global_lru(sc))
> + sc->memcg_record->nr_rotated[file] +=
> numpages;
> }
> if (!pagevec_add(&pvec, page)) {
> spin_unlock_irq(&zone->lru_lock);
> @@ -1350,6 +1353,10 @@ static noinline_for_stack void update_is
>
> reclaim_stat->recent_scanned[0] += *nr_anon;
> reclaim_stat->recent_scanned[1] += *nr_file;
> + if (!scanning_global_lru(sc)) {
> + sc->memcg_record->nr_scanned[0] += *nr_anon;
> + sc->memcg_record->nr_scanned[1] += *nr_file;
> + }
> }
>
> /*
> @@ -1457,6 +1464,9 @@ shrink_inactive_list(unsigned long nr_to
>
> nr_reclaimed = shrink_page_list(&page_list, zone, sc);
>
> + if (!scanning_global_lru(sc))
> + sc->memcg_record->nr_freed[file] += nr_reclaimed;
> +
>
Can't we stall for writeback? If so, we may call shrink_page_list() again
below. The accounting should probably go after that instead.
> /* Check if we should syncronously wait for writeback */
> if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
> set_reclaim_mode(priority, sc, true);
> @@ -1562,6 +1572,8 @@ static void shrink_active_list(unsigned
> }
>
> reclaim_stat->recent_scanned[file] += nr_taken;
> + if (!scanning_global_lru(sc))
> + sc->memcg_record->nr_scanned[file] += nr_taken;
>
> __count_zone_vm_events(PGREFILL, zone, pgscanned);
> if (file)
> @@ -1613,6 +1625,8 @@ static void shrink_active_list(unsigned
> * get_scan_ratio.
> */
> reclaim_stat->recent_rotated[file] += nr_rotated;
> + if (!scanning_global_lru(sc))
> + sc->memcg_record->nr_rotated[file] += nr_rotated;
>
> move_active_pages_to_lru(zone, &l_active,
> LRU_ACTIVE + file *
> LRU_FILE);
> @@ -2207,9 +2221,10 @@ unsigned long try_to_free_pages(struct z
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>
> unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> - gfp_t gfp_mask, bool
> noswap,
> - struct zone *zone,
> - unsigned long *nr_scanned)
> + gfp_t gfp_mask, bool noswap,
> + struct zone *zone,
> + struct memcg_scanrecord *rec,
> + unsigned long *scanned)
> {
> struct scan_control sc = {
> .nr_scanned = 0,
> @@ -2219,7 +2234,9 @@ unsigned long mem_cgroup_shrink_node_zon
> .may_swap = !noswap,
> .order = 0,
> .mem_cgroup = mem,
> + .memcg_record = rec,
> };
> + unsigned long start, end;
>
> sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
> @@ -2228,6 +2245,7 @@ unsigned long mem_cgroup_shrink_node_zon
> sc.may_writepage,
> sc.gfp_mask);
>
> + start = sched_clock();
> /*
> * NOTE: Although we can get the priority field, using it
> * here is not a good idea, since it limits the pages we can scan.
> @@ -2236,19 +2254,25 @@ unsigned long mem_cgroup_shrink_node_zon
> * the priority and make it zero.
> */
> shrink_zone(0, zone, &sc);
> + end = sched_clock();
> +
> + if (rec)
> + rec->elapsed += end - start;
> + *scanned = sc.nr_scanned;
>
> trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
>
> - *nr_scanned = sc.nr_scanned;
> return sc.nr_reclaimed;
> }
>
> unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> gfp_t gfp_mask,
> - bool noswap)
> + bool noswap,
> + struct memcg_scanrecord *rec)
> {
> struct zonelist *zonelist;
> unsigned long nr_reclaimed;
> + unsigned long start, end;
> int nid;
> struct scan_control sc = {
> .may_writepage = !laptop_mode,
> @@ -2257,6 +2281,7 @@ unsigned long try_to_free_mem_cgroup_pag
> .nr_to_reclaim = SWAP_CLUSTER_MAX,
> .order = 0,
> .mem_cgroup = mem_cont,
> + .memcg_record = rec,
> .nodemask = NULL, /* we don't care the placement */
> .gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
> @@ -2265,6 +2290,7 @@ unsigned long try_to_free_mem_cgroup_pag
> .gfp_mask = sc.gfp_mask,
> };
>
> + start = sched_clock();
> /*
> * Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't
> * take care of from where we get pages. So the node where we start
> the
> @@ -2279,6 +2305,9 @@ unsigned long try_to_free_mem_cgroup_pag
> sc.gfp_mask);
>
> nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
> + end = sched_clock();
> + if (rec)
> + rec->elapsed += end - start;
>
> trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign
> http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
[-- Attachment #2: Type: text/html, Size: 30247 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2] memcg: add vmscan_stat
2011-07-15 18:34 ` Andrew Bresticker
2011-07-15 20:28 ` Andrew Bresticker
@ 2011-07-20 5:58 ` KAMEZAWA Hiroyuki
1 sibling, 0 replies; 9+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-20 5:58 UTC (permalink / raw)
To: Andrew Bresticker
Cc: linux-mm, akpm, nishimura, bsingharora, Michal Hocko, Ying Han
On Fri, 15 Jul 2011 11:34:08 -0700
Andrew Bresticker <abrestic@google.com> wrote:
> I've extended your patch to track write-back during page reclaim:
Thanks, but please wait until
1. There is a work to remove ->writeback in page reclaim path.
2. Dirty ratio support in memcg.
Because of (1), it will be better to count "dirty pages we met during vmscan".
Anyway, my patch is not in mmotm yet ;(
Thanks,
-kame
> ---
>
> From: Andrew Bresticker <abrestic@google.com>
> Date: Thu, 14 Jul 2011 17:56:48 -0700
> Subject: [PATCH] vmscan: Track number of pages written back during page
> reclaim.
>
> This tracks pages written out during page reclaim in memory.vmscan_stat
> and breaks it down by file vs. anon and context (like "scanned_pages",
> "rotated_pages", etc.).
>
> Example output:
> $ mkdir /dev/cgroup/memory/1
> $ echo 8m > /dev/cgroup/memory/1/memory.limit_in_bytes
> $ echo $$ > /dev/cgroup/memory/1/tasks
> $ dd if=/dev/urandom of=file_20g bs=4096 count=524288
> $ cat /dev/cgroup/memory/1/memory.vmscan_stat
> ...
> written_pages_by_limit 36
> written_anon_pages_by_limit 0
> written_file_pages_by_limit 36
> ...
> written_pages_by_limit_under_hierarchy 28
> written_anon_pages_by_limit_under_hierarchy 0
> written_file_pages_by_limit_under_hierarchy 28
>
> Signed-off-by: Andrew Bresticker <abrestic@google.com>
> ---
> include/linux/memcontrol.h | 1 +
> mm/memcontrol.c | 12 ++++++++++++
> mm/vmscan.c | 10 +++++++---
> 3 files changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4b49edf..4be907e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -46,6 +46,7 @@ struct memcg_scanrecord {
> unsigned long nr_scanned[2]; /* the number of scanned pages */
> unsigned long nr_rotated[2]; /* the number of rotated pages */
> unsigned long nr_freed[2]; /* the number of freed pages */
> + unsigned long nr_written[2]; /* the number of pages written back */
> unsigned long elapsed; /* nsec of time elapsed while scanning */
> };
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9bb6e93..5ec2aa3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -221,6 +221,9 @@ enum {
> FREED,
> FREED_ANON,
> FREED_FILE,
> + WRITTEN,
> + WRITTEN_ANON,
> + WRITTEN_FILE,
> ELAPSED,
> NR_SCANSTATS,
> };
> @@ -241,6 +244,9 @@ const char *scanstat_string[NR_SCANSTATS] = {
> "freed_pages",
> "freed_anon_pages",
> "freed_file_pages",
> + "written_pages",
> + "written_anon_pages",
> + "written_file_pages",
> "elapsed_ns",
> };
> #define SCANSTAT_WORD_LIMIT "_by_limit"
> @@ -1682,6 +1688,10 @@ static void __mem_cgroup_record_scanstat(unsigned
> long *stats,
> stats[FREED_ANON] += rec->nr_freed[0];
> stats[FREED_FILE] += rec->nr_freed[1];
>
> + stats[WRITTEN] += rec->nr_written[0] + rec->nr_written[1];
> + stats[WRITTEN_ANON] += rec->nr_written[0];
> + stats[WRITTEN_FILE] += rec->nr_written[1];
> +
> stats[ELAPSED] += rec->elapsed;
> }
>
> @@ -1794,6 +1804,8 @@ static int mem_cgroup_hierarchical_reclaim(struct
> mem_cgroup *root_mem,
> rec.nr_rotated[1] = 0;
> rec.nr_freed[0] = 0;
> rec.nr_freed[1] = 0;
> + rec.nr_written[0] = 0;
> + rec.nr_written[1] = 0;
> rec.elapsed = 0;
> /* we use swappiness of local cgroup */
> if (check_soft) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8fb1abd..f73b96e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -719,7 +719,7 @@ static noinline_for_stack void free_page_list(struct
> list_head *free_pages)
> */
> static unsigned long shrink_page_list(struct list_head *page_list,
> struct zone *zone,
> - struct scan_control *sc)
> + struct scan_control *sc, int file)
> {
> LIST_HEAD(ret_pages);
> LIST_HEAD(free_pages);
> @@ -727,6 +727,7 @@ static unsigned long shrink_page_list(struct list_head
> *page_list,
> unsigned long nr_dirty = 0;
> unsigned long nr_congested = 0;
> unsigned long nr_reclaimed = 0;
> + unsigned long nr_written = 0;
>
> cond_resched();
>
> @@ -840,6 +841,7 @@ static unsigned long shrink_page_list(struct list_head
> *page_list,
> case PAGE_ACTIVATE:
> goto activate_locked;
> case PAGE_SUCCESS:
> + nr_written++;
> if (PageWriteback(page))
> goto keep_lumpy;
> if (PageDirty(page))
> @@ -958,6 +960,8 @@ keep_lumpy:
> free_page_list(&free_pages);
>
> list_splice(&ret_pages, page_list);
> + if (!scanning_global_lru(sc))
> + sc->memcg_record->nr_written[file] += nr_written;
> count_vm_events(PGACTIVATE, pgactivate);
> return nr_reclaimed;
> }
> @@ -1463,7 +1467,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct
> zone *zone,
>
> spin_unlock_irq(&zone->lru_lock);
>
> - nr_reclaimed = shrink_page_list(&page_list, zone, sc);
> + nr_reclaimed = shrink_page_list(&page_list, zone, sc, file);
>
> if (!scanning_global_lru(sc))
> sc->memcg_record->nr_freed[file] += nr_reclaimed;
> @@ -1471,7 +1475,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct
> zone *zone,
> /* Check if we should syncronously wait for writeback */
> if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
> set_reclaim_mode(priority, sc, true);
> - nr_reclaimed += shrink_page_list(&page_list, zone, sc);
> + nr_reclaimed += shrink_page_list(&page_list, zone, sc, file);
> }
>
> local_irq_disable();
> --
> 1.7.3.1
>
> Thanks,
> Andrew
>
> On Wed, Jul 13, 2011 at 5:02 PM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > On Tue, 12 Jul 2011 16:02:02 -0700
> > Andrew Bresticker <abrestic@google.com> wrote:
> >
> > > On Mon, Jul 11, 2011 at 3:30 AM, KAMEZAWA Hiroyuki <
> > > kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > >
> > > >
> > > > This patch is onto mmotm-0710... got bigger than expected ;(
> > > > ==
> > > > [PATCH] add memory.vmscan_stat
> > > >
> > > > commit log of commit 0ae5e89 " memcg: count the soft_limit reclaim
> > in..."
> > > > says it adds scanning stats to memory.stat file. But it doesn't because
> > > > we considered we needed to make a concensus for such new APIs.
> > > >
> > > > This patch is a trial to add memory.scan_stat. This shows
> > > > - the number of scanned pages(total, anon, file)
> > > > - the number of rotated pages(total, anon, file)
> > > > - the number of freed pages(total, anon, file)
> > > > - the number of elaplsed time (including sleep/pause time)
> > > >
> > > > for both of direct/soft reclaim.
> > > >
> > > > The biggest difference with oringinal Ying's one is that this file
> > > > can be reset by some write, as
> > > >
> > > > # echo 0 ...../memory.scan_stat
> > > >
> > > > Example of output is here. This is a result after make -j 6 kernel
> > > > under 300M limit.
> > > >
> > > > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
> > > > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat
> > > > scanned_pages_by_limit 9471864
> > > > scanned_anon_pages_by_limit 6640629
> > > > scanned_file_pages_by_limit 2831235
> > > > rotated_pages_by_limit 4243974
> > > > rotated_anon_pages_by_limit 3971968
> > > > rotated_file_pages_by_limit 272006
> > > > freed_pages_by_limit 2318492
> > > > freed_anon_pages_by_limit 962052
> > > > freed_file_pages_by_limit 1356440
> > > > elapsed_ns_by_limit 351386416101
> > > > scanned_pages_by_system 0
> > > > scanned_anon_pages_by_system 0
> > > > scanned_file_pages_by_system 0
> > > > rotated_pages_by_system 0
> > > > rotated_anon_pages_by_system 0
> > > > rotated_file_pages_by_system 0
> > > > freed_pages_by_system 0
> > > > freed_anon_pages_by_system 0
> > > > freed_file_pages_by_system 0
> > > > elapsed_ns_by_system 0
> > > > scanned_pages_by_limit_under_hierarchy 9471864
> > > > scanned_anon_pages_by_limit_under_hierarchy 6640629
> > > > scanned_file_pages_by_limit_under_hierarchy 2831235
> > > > rotated_pages_by_limit_under_hierarchy 4243974
> > > > rotated_anon_pages_by_limit_under_hierarchy 3971968
> > > > rotated_file_pages_by_limit_under_hierarchy 272006
> > > > freed_pages_by_limit_under_hierarchy 2318492
> > > > freed_anon_pages_by_limit_under_hierarchy 962052
> > > > freed_file_pages_by_limit_under_hierarchy 1356440
> > > > elapsed_ns_by_limit_under_hierarchy 351386416101
> > > > scanned_pages_by_system_under_hierarchy 0
> > > > scanned_anon_pages_by_system_under_hierarchy 0
> > > > scanned_file_pages_by_system_under_hierarchy 0
> > > > rotated_pages_by_system_under_hierarchy 0
> > > > rotated_anon_pages_by_system_under_hierarchy 0
> > > > rotated_file_pages_by_system_under_hierarchy 0
> > > > freed_pages_by_system_under_hierarchy 0
> > > > freed_anon_pages_by_system_under_hierarchy 0
> > > > freed_file_pages_by_system_under_hierarchy 0
> > > > elapsed_ns_by_system_under_hierarchy 0
> > > >
> > > >
> > > > total_xxxx is for hierarchy management.
> > > >
> > > > This will be useful for further memcg developments and need to be
> > > > developped before we do some complicated rework on LRU/softlimit
> > > > management.
> > > >
> > > > This patch adds a new struct memcg_scanrecord into scan_control struct.
> > > > sc->nr_scanned at el is not designed for exporting information. For
> > > > example,
> > > > nr_scanned is reset frequentrly and incremented +2 at scanning mapped
> > > > pages.
> > > >
> > > > For avoiding complexity, I added a new param in scan_control which is
> > for
> > > > exporting scanning score.
> > > >
> > > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > >
> > > > Changelog:
> > > > - renamed as vmscan_stat
> > > > - handle file/anon
> > > > - added "rotated"
> > > > - changed names of param in vmscan_stat.
> > > > ---
> > > > Documentation/cgroups/memory.txt | 85 +++++++++++++++++++
> > > > include/linux/memcontrol.h | 19 ++++
> > > > include/linux/swap.h | 6 -
> > > > mm/memcontrol.c | 172
> > > > +++++++++++++++++++++++++++++++++++++--
> > > > mm/vmscan.c | 39 +++++++-
> > > > 5 files changed, 303 insertions(+), 18 deletions(-)
> > > >
> > > > Index: mmotm-0710/Documentation/cgroups/memory.txt
> > > > ===================================================================
> > > > --- mmotm-0710.orig/Documentation/cgroups/memory.txt
> > > > +++ mmotm-0710/Documentation/cgroups/memory.txt
> > > > @@ -380,7 +380,7 @@ will be charged as a new owner of it.
> > > >
> > > > 5.2 stat file
> > > >
> > > > -memory.stat file includes following statistics
> > > > +5.2.1 memory.stat file includes following statistics
> > > >
> > > > # per-memory cgroup local status
> > > > cache - # of bytes of page cache memory.
> > > > @@ -438,6 +438,89 @@ Note:
> > > > file_mapped is accounted only when the memory cgroup is owner
> > of
> > > > page
> > > > cache.)
> > > >
> > > > +5.2.2 memory.vmscan_stat
> > > > +
> > > > +memory.vmscan_stat includes statistics information for memory scanning
> > and
> > > > +freeing, reclaiming. The statistics shows memory scanning information
> > > > since
> > > > +memory cgroup creation and can be reset to 0 by writing 0 as
> > > > +
> > > > + #echo 0 > ../memory.vmscan_stat
> > > > +
> > > > +This file contains following statistics.
> > > > +
> > > > +[param]_[file_or_anon]_pages_by_[reason]_[under_heararchy]
> > > > +[param]_elapsed_ns_by_[reason]_[under_hierarchy]
> > > > +
> > > > +For example,
> > > > +
> > > > + scanned_file_pages_by_limit indicates the number of scanned
> > > > + file pages at vmscan.
> > > > +
> > > > +Now, 3 parameters are supported
> > > > +
> > > > + scanned - the number of pages scanned by vmscan
> > > > + rotated - the number of pages activated at vmscan
> > > > + freed - the number of pages freed by vmscan
> > > > +
> > > > +If "rotated" is high against scanned/freed, the memcg seems busy.
> > > > +
> > > > +Now, 2 reason are supported
> > > > +
> > > > + limit - the memory cgroup's limit
> > > > + system - global memory pressure + softlimit
> > > > + (global memory pressure not under softlimit is not handled
> > now)
> > > > +
> > > > +When under_hierarchy is added in the tail, the number indicates the
> > > > +total memcg scan of its children and itself.
> > > > +
> > > > +elapsed_ns is a elapsed time in nanosecond. This may include sleep
> > time
> > > > +and not indicates CPU usage. So, please take this as just showing
> > > > +latency.
> > > > +
> > > > +Here is an example.
> > > > +
> > > > +# cat /cgroup/memory/A/memory.vmscan_stat
> > > > +scanned_pages_by_limit 9471864
> > > > +scanned_anon_pages_by_limit 6640629
> > > > +scanned_file_pages_by_limit 2831235
> > > > +rotated_pages_by_limit 4243974
> > > > +rotated_anon_pages_by_limit 3971968
> > > > +rotated_file_pages_by_limit 272006
> > > > +freed_pages_by_limit 2318492
> > > > +freed_anon_pages_by_limit 962052
> > > > +freed_file_pages_by_limit 1356440
> > > > +elapsed_ns_by_limit 351386416101
> > > > +scanned_pages_by_system 0
> > > > +scanned_anon_pages_by_system 0
> > > > +scanned_file_pages_by_system 0
> > > > +rotated_pages_by_system 0
> > > > +rotated_anon_pages_by_system 0
> > > > +rotated_file_pages_by_system 0
> > > > +freed_pages_by_system 0
> > > > +freed_anon_pages_by_system 0
> > > > +freed_file_pages_by_system 0
> > > > +elapsed_ns_by_system 0
> > > > +scanned_pages_by_limit_under_hierarchy 9471864
> > > > +scanned_anon_pages_by_limit_under_hierarchy 6640629
> > > > +scanned_file_pages_by_limit_under_hierarchy 2831235
> > > > +rotated_pages_by_limit_under_hierarchy 4243974
> > > > +rotated_anon_pages_by_limit_under_hierarchy 3971968
> > > > +rotated_file_pages_by_limit_under_hierarchy 272006
> > > > +freed_pages_by_limit_under_hierarchy 2318492
> > > > +freed_anon_pages_by_limit_under_hierarchy 962052
> > > > +freed_file_pages_by_limit_under_hierarchy 1356440
> > > > +elapsed_ns_by_limit_under_hierarchy 351386416101
> > > > +scanned_pages_by_system_under_hierarchy 0
> > > > +scanned_anon_pages_by_system_under_hierarchy 0
> > > > +scanned_file_pages_by_system_under_hierarchy 0
> > > > +rotated_pages_by_system_under_hierarchy 0
> > > > +rotated_anon_pages_by_system_under_hierarchy 0
> > > > +rotated_file_pages_by_system_under_hierarchy 0
> > > > +freed_pages_by_system_under_hierarchy 0
> > > > +freed_anon_pages_by_system_under_hierarchy 0
> > > > +freed_file_pages_by_system_under_hierarchy 0
> > > > +elapsed_ns_by_system_under_hierarchy 0
> > > > +
> > > > 5.3 swappiness
> > > >
> > > > Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of
> > groups
> > > > only.
> > > > Index: mmotm-0710/include/linux/memcontrol.h
> > > > ===================================================================
> > > > --- mmotm-0710.orig/include/linux/memcontrol.h
> > > > +++ mmotm-0710/include/linux/memcontrol.h
> > > > @@ -39,6 +39,16 @@ extern unsigned long mem_cgroup_isolate_
> > > > struct mem_cgroup *mem_cont,
> > > > int active, int file);
> > > >
> > > > +struct memcg_scanrecord {
> > > > + struct mem_cgroup *mem; /* scanend memory cgroup */
> > > > + struct mem_cgroup *root; /* scan target hierarchy root */
> > > > + int context; /* scanning context (see memcontrol.c)
> > */
> > > > + unsigned long nr_scanned[2]; /* the number of scanned pages */
> > > > + unsigned long nr_rotated[2]; /* the number of rotated pages */
> > > > + unsigned long nr_freed[2]; /* the number of freed pages */
> > > > + unsigned long elapsed; /* nsec of time elapsed while scanning
> > */
> > > > +};
> > > > +
> > > > #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > > > /*
> > > > * All "charge" functions with gfp_mask should use GFP_KERNEL or
> > > > @@ -117,6 +127,15 @@ mem_cgroup_get_reclaim_stat_from_page(st
> > > > extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> > > > struct task_struct *p);
> > > >
> > > > +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup
> > *mem,
> > > > + gfp_t gfp_mask, bool
> > > > noswap,
> > > > + struct
> > memcg_scanrecord
> > > > *rec);
> > > > +extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup
> > *mem,
> > > > + gfp_t gfp_mask, bool
> > > > noswap,
> > > > + struct zone *zone,
> > > > + struct memcg_scanrecord
> > > > *rec,
> > > > + unsigned long
> > *nr_scanned);
> > > > +
> > > > #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> > > > extern int do_swap_account;
> > > > #endif
> > > > Index: mmotm-0710/include/linux/swap.h
> > > > ===================================================================
> > > > --- mmotm-0710.orig/include/linux/swap.h
> > > > +++ mmotm-0710/include/linux/swap.h
> > > > @@ -253,12 +253,6 @@ static inline void lru_cache_add_file(st
> > > > /* linux/mm/vmscan.c */
> > > > extern unsigned long try_to_free_pages(struct zonelist *zonelist, int
> > > > order,
> > > > gfp_t gfp_mask, nodemask_t
> > *mask);
> > > > -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup
> > *mem,
> > > > - gfp_t gfp_mask, bool
> > > > noswap);
> > > > -extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup
> > *mem,
> > > > - gfp_t gfp_mask, bool
> > > > noswap,
> > > > - struct zone *zone,
> > > > - unsigned long
> > *nr_scanned);
> > > > extern int __isolate_lru_page(struct page *page, int mode, int file);
> > > > extern unsigned long shrink_all_memory(unsigned long nr_pages);
> > > > extern int vm_swappiness;
> > > > Index: mmotm-0710/mm/memcontrol.c
> > > > ===================================================================
> > > > --- mmotm-0710.orig/mm/memcontrol.c
> > > > +++ mmotm-0710/mm/memcontrol.c
> > > > @@ -204,6 +204,50 @@ struct mem_cgroup_eventfd_list {
> > > > static void mem_cgroup_threshold(struct mem_cgroup *mem);
> > > > static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
> > > >
> > > > +enum {
> > > > + SCAN_BY_LIMIT,
> > > > + SCAN_BY_SYSTEM,
> > > > + NR_SCAN_CONTEXT,
> > > > + SCAN_BY_SHRINK, /* not recorded now */
> > > > +};
> > > > +
> > > > +enum {
> > > > + SCAN,
> > > > + SCAN_ANON,
> > > > + SCAN_FILE,
> > > > + ROTATE,
> > > > + ROTATE_ANON,
> > > > + ROTATE_FILE,
> > > > + FREED,
> > > > + FREED_ANON,
> > > > + FREED_FILE,
> > > > + ELAPSED,
> > > > + NR_SCANSTATS,
> > > > +};
> > > > +
> > > > +struct scanstat {
> > > > + spinlock_t lock;
> > > > + unsigned long stats[NR_SCAN_CONTEXT][NR_SCANSTATS];
> > > > + unsigned long rootstats[NR_SCAN_CONTEXT][NR_SCANSTATS];
> > > > +};
> > > >
> > >
> > > I'm working on a similar effort with Ying here at Google and so far we've
> > > been using per-cpu counters for these statistics instead of spin-lock
> > > protected counters. Clearly the spin-lock protected counters have less
> > > memory overhead and make reading the stat file faster, but our concern is
> > > that this method is inconsistent with the other memory stat files such
> > > /proc/vmstat and /dev/cgroup/memory/.../memory.stat. Is there any
> > > particular reason you chose to use spin-lock protected counters instead
> > of
> > > per-cpu counters?
> > >
> >
> > In my experience, if we do "batch" enouch, it works always better than
> > percpu-counter. percpu counter is effective when batching is difficult.
> > This patch's implementation does enough batching and it's much coarse
> > grained than percpu counter. Then, this patch is better than percpu.
> >
> >
> > > I've also modified your patch to use per-cpu counters instead of
> > spin-lock
> > > protected counters. I tested it by doing streaming I/O from a ramdisk:
> > >
> > > $ mke2fs /dev/ram1
> > > $ mkdir /tmp/swapram
> > > $ mkdir /tmp/swapram/ram1
> > > $ mount -t ext2 /dev/ram1 /tmp/swapram/ram1
> > > $ dd if=/dev/urandom of=/tmp/swapram/ram1/file_16m bs=4096 count=4096
> > > $ mkdir /dev/cgroup/memory/1
> > > $ echo 8m > /dev/cgroup/memory/1
> > > $ ./ramdisk_load.sh 7
> > > $ echo $$ > /dev/cgroup/memory/1/tasks
> > > $ time for ((z=0; z<=2000; z++)); do cat /tmp/swapram/ram1/file_16m >
> > > /dev/zero; done
> > >
> > > Where ramdisk_load.sh is:
> > > for ((i=0; i<=$1; i++))
> > > do
> > > echo $$ >/dev/cgroup/memory/1/tasks
> > > for ((z=0; z<=2000; z++)); do cat /tmp/swapram/ram1/file_16m >
> > /dev/zero;
> > > done &
> > > done
> > >
> > > Surprisingly, the per-cpu counters perform worse than the spin-lock
> > > protected counters. Over 10 runs of the test above, the per-cpu counters
> > > were 1.60% slower in both real time and sys time. I'm wondering if you
> > have
> > > any insight as to why this is. I can provide my diff against your patch
> > if
> > > necessary.
> > >
> >
> > The percpu counte works effectively only when we use +1/-1 at each change
> > of
> > counters. It uses "batch" to merge the per-cpu value to the counter.
> > I think you use default "batch" value but the scan/rotate/free/elapsed
> > value
> > is always larger than "batch" and you just added memory overhead and "if"
> > to pure spinlock counters.
> >
> > Determining this "batch" threshold for percpu counter is difficult.
> >
> > Thanks,
> > -Kame
> >
> >
> >
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2] memcg: add vmscan_stat
2011-07-15 20:28 ` Andrew Bresticker
@ 2011-07-20 6:00 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 9+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-20 6:00 UTC (permalink / raw)
To: Andrew Bresticker
Cc: linux-mm, akpm, nishimura, bsingharora, Michal Hocko, Ying Han
On Fri, 15 Jul 2011 13:28:08 -0700
Andrew Bresticker <abrestic@google.com> wrote:
> And this one tracks the number of pages unmapped:
Hmm, maybe seems nice to add. I'll include this one when I post next version.
Thanks,
-Kame
> --
>
> From: Andrew Bresticker <abrestic@google.com>
> Date: Fri, 15 Jul 2011 11:46:40 -0700
> Subject: [PATCH] vmscan: Track pages unmapped during page reclaim.
>
> Record the number of pages unmapped during page reclaim in
> memory.vmscan_stat. Counters are broken down by type and
> context like the other stats in memory.vmscan_stat.
>
> Sample output:
> $ mkdir /dev/cgroup/memory/1
> $ echo 512m > /dev/cgroup/memory/1
> $ echo $$ > /dev/cgroup/memory/1
> $ pft -m 512m
> $ cat /dev/cgroup/memory/1/memory.vmscan_stat
> ...
> unmapped_pages_by_limit 67
> unmapped_anon_pages_by_limit 0
> unmapped_file_pages_by_limit 67
> ...
> unmapped_pages_by_limit_under_hierarchy 67
> unmapped_anon_pages_by_limit_under_hierarchy 0
> unmapped_file_pages_by_limit_under_hierarchy 67
>
> Signed-off-by: Andrew Bresticker <abrestic@google.com>
> ---
> include/linux/memcontrol.h | 1 +
> mm/memcontrol.c | 12 ++++++++++++
> mm/vmscan.c | 8 ++++++--
> 3 files changed, 19 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4be907e..8d65b55 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -47,6 +47,7 @@ struct memcg_scanrecord {
> unsigned long nr_rotated[2]; /* the number of rotated pages */
> unsigned long nr_freed[2]; /* the number of freed pages */
> unsigned long nr_written[2]; /* the number of pages written back */
> + unsigned long nr_unmapped[2]; /* the number of pages unmapped */
> unsigned long elapsed; /* nsec of time elapsed while scanning */
> };
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5ec2aa3..6b4fbbd 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -224,6 +224,9 @@ enum {
> WRITTEN,
> WRITTEN_ANON,
> WRITTEN_FILE,
> + UNMAPPED,
> + UNMAPPED_ANON,
> + UNMAPPED_FILE,
> ELAPSED,
> NR_SCANSTATS,
> };
> @@ -247,6 +250,9 @@ const char *scanstat_string[NR_SCANSTATS] = {
> "written_pages",
> "written_anon_pages",
> "written_file_pages",
> + "unmapped_pages",
> + "unmapped_anon_pages",
> + "unmapped_file_pages",
> "elapsed_ns",
> };
> #define SCANSTAT_WORD_LIMIT "_by_limit"
> @@ -1692,6 +1698,10 @@ static void __mem_cgroup_record_scanstat(unsigned
> long *stats,
> stats[WRITTEN_ANON] += rec->nr_written[0];
> stats[WRITTEN_FILE] += rec->nr_written[1];
>
> + stats[UNMAPPED] += rec->nr_unmapped[0] + rec->nr_unmapped[1];
> + stats[UNMAPPED_ANON] += rec->nr_unmapped[0];
> + stats[UNMAPPED_FILE] += rec->nr_unmapped[1];
> +
> stats[ELAPSED] += rec->elapsed;
> }
>
> @@ -1806,6 +1816,8 @@ static int mem_cgroup_hierarchical_reclaim(struct
> mem_cgroup *root_mem,
> rec.nr_freed[1] = 0;
> rec.nr_written[0] = 0;
> rec.nr_written[1] = 0;
> + rec.nr_unmapped[0] = 0;
> + rec.nr_unmapped[1] = 0;
> rec.elapsed = 0;
> /* we use swappiness of local cgroup */
> if (check_soft) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f73b96e..2d2bc99 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -728,6 +728,7 @@ static unsigned long shrink_page_list(struct list_head
> *page_list,
> unsigned long nr_congested = 0;
> unsigned long nr_reclaimed = 0;
> unsigned long nr_written = 0;
> + unsigned long nr_unmapped = 0;
>
> cond_resched();
>
> @@ -819,7 +820,8 @@ static unsigned long shrink_page_list(struct list_head
> *page_list,
> case SWAP_MLOCK:
> goto cull_mlocked;
> case SWAP_SUCCESS:
> - ; /* try to free the page below */
> + /* try to free the page below */
> + nr_unmapped++;
> }
> }
>
> @@ -960,8 +962,10 @@ keep_lumpy:
> free_page_list(&free_pages);
>
> list_splice(&ret_pages, page_list);
> - if (!scanning_global_lru(sc))
> + if (!scanning_global_lru(sc)) {
> sc->memcg_record->nr_written[file] += nr_written;
> + sc->memcg_record->nr_unmapped[file] += nr_unmapped;
> + }
> count_vm_events(PGACTIVATE, pgactivate);
> return nr_reclaimed;
> }
> --
> 1.7.3.1
>
> Thanks,
> Andrew
>
> On Fri, Jul 15, 2011 at 11:34 AM, Andrew Bresticker <abrestic@google.com>wrote:
>
> > I've extended your patch to track write-back during page reclaim:
> > ---
> >
> > From: Andrew Bresticker <abrestic@google.com>
> > Date: Thu, 14 Jul 2011 17:56:48 -0700
> > Subject: [PATCH] vmscan: Track number of pages written back during page
> > reclaim.
> >
> > This tracks pages written out during page reclaim in memory.vmscan_stat
> > and breaks it down by file vs. anon and context (like "scanned_pages",
> > "rotated_pages", etc.).
> >
> > Example output:
> > $ mkdir /dev/cgroup/memory/1
> > $ echo 8m > /dev/cgroup/memory/1/memory.limit_in_bytes
> > $ echo $$ > /dev/cgroup/memory/1/tasks
> > $ dd if=/dev/urandom of=file_20g bs=4096 count=524288
> > $ cat /dev/cgroup/memory/1/memory.vmscan_stat
> > ...
> > written_pages_by_limit 36
> > written_anon_pages_by_limit 0
> > written_file_pages_by_limit 36
> > ...
> > written_pages_by_limit_under_hierarchy 28
> > written_anon_pages_by_limit_under_hierarchy 0
> > written_file_pages_by_limit_under_hierarchy 28
> >
> > Signed-off-by: Andrew Bresticker <abrestic@google.com>
> > ---
> > include/linux/memcontrol.h | 1 +
> > mm/memcontrol.c | 12 ++++++++++++
> > mm/vmscan.c | 10 +++++++---
> > 3 files changed, 20 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 4b49edf..4be907e 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -46,6 +46,7 @@ struct memcg_scanrecord {
> > unsigned long nr_scanned[2]; /* the number of scanned pages */
> > unsigned long nr_rotated[2]; /* the number of rotated pages */
> > unsigned long nr_freed[2]; /* the number of freed pages */
> > + unsigned long nr_written[2]; /* the number of pages written back */
> > unsigned long elapsed; /* nsec of time elapsed while scanning */
> > };
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 9bb6e93..5ec2aa3 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -221,6 +221,9 @@ enum {
> > FREED,
> > FREED_ANON,
> > FREED_FILE,
> > + WRITTEN,
> > + WRITTEN_ANON,
> > + WRITTEN_FILE,
> > ELAPSED,
> > NR_SCANSTATS,
> > };
> > @@ -241,6 +244,9 @@ const char *scanstat_string[NR_SCANSTATS] = {
> > "freed_pages",
> > "freed_anon_pages",
> > "freed_file_pages",
> > + "written_pages",
> > + "written_anon_pages",
> > + "written_file_pages",
> > "elapsed_ns",
> > };
> > #define SCANSTAT_WORD_LIMIT "_by_limit"
> > @@ -1682,6 +1688,10 @@ static void __mem_cgroup_record_scanstat(unsigned
> > long *stats,
> > stats[FREED_ANON] += rec->nr_freed[0];
> > stats[FREED_FILE] += rec->nr_freed[1];
> >
> > + stats[WRITTEN] += rec->nr_written[0] + rec->nr_written[1];
> > + stats[WRITTEN_ANON] += rec->nr_written[0];
> > + stats[WRITTEN_FILE] += rec->nr_written[1];
> > +
> > stats[ELAPSED] += rec->elapsed;
> > }
> >
> > @@ -1794,6 +1804,8 @@ static int mem_cgroup_hierarchical_reclaim(struct
> > mem_cgroup *root_mem,
> > rec.nr_rotated[1] = 0;
> > rec.nr_freed[0] = 0;
> > rec.nr_freed[1] = 0;
> > + rec.nr_written[0] = 0;
> > + rec.nr_written[1] = 0;
> > rec.elapsed = 0;
> > /* we use swappiness of local cgroup */
> > if (check_soft) {
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 8fb1abd..f73b96e 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -719,7 +719,7 @@ static noinline_for_stack void free_page_list(struct
> > list_head *free_pages)
> > */
> > static unsigned long shrink_page_list(struct list_head *page_list,
> > struct zone *zone,
> > - struct scan_control *sc)
> > + struct scan_control *sc, int file)
> > {
> > LIST_HEAD(ret_pages);
> > LIST_HEAD(free_pages);
> > @@ -727,6 +727,7 @@ static unsigned long shrink_page_list(struct list_head
> > *page_list,
> > unsigned long nr_dirty = 0;
> > unsigned long nr_congested = 0;
> > unsigned long nr_reclaimed = 0;
> > + unsigned long nr_written = 0;
> >
> > cond_resched();
> >
> > @@ -840,6 +841,7 @@ static unsigned long shrink_page_list(struct list_head
> > *page_list,
> > case PAGE_ACTIVATE:
> > goto activate_locked;
> > case PAGE_SUCCESS:
> > + nr_written++;
> > if (PageWriteback(page))
> > goto keep_lumpy;
> > if (PageDirty(page))
> > @@ -958,6 +960,8 @@ keep_lumpy:
> > free_page_list(&free_pages);
> >
> > list_splice(&ret_pages, page_list);
> > + if (!scanning_global_lru(sc))
> > + sc->memcg_record->nr_written[file] += nr_written;
> > count_vm_events(PGACTIVATE, pgactivate);
> > return nr_reclaimed;
> > }
> > @@ -1463,7 +1467,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct
> > zone *zone,
> >
> > spin_unlock_irq(&zone->lru_lock);
> >
> > - nr_reclaimed = shrink_page_list(&page_list, zone, sc);
> > + nr_reclaimed = shrink_page_list(&page_list, zone, sc, file);
> >
> > if (!scanning_global_lru(sc))
> > sc->memcg_record->nr_freed[file] += nr_reclaimed;
> > @@ -1471,7 +1475,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct
> > zone *zone,
> > /* Check if we should syncronously wait for writeback */
> > if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
> > set_reclaim_mode(priority, sc, true);
> > - nr_reclaimed += shrink_page_list(&page_list, zone, sc);
> > + nr_reclaimed += shrink_page_list(&page_list, zone, sc, file);
> > }
> >
> > local_irq_disable();
> > --
> > 1.7.3.1
> >
> > Thanks,
> > Andrew
> >
> > On Wed, Jul 13, 2011 at 5:02 PM, KAMEZAWA Hiroyuki <
> > kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> >> On Tue, 12 Jul 2011 16:02:02 -0700
> >> Andrew Bresticker <abrestic@google.com> wrote:
> >>
> >> > On Mon, Jul 11, 2011 at 3:30 AM, KAMEZAWA Hiroyuki <
> >> > kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> >
> >> > >
> >> > > This patch is onto mmotm-0710... got bigger than expected ;(
> >> > > ==
> >> > > [PATCH] add memory.vmscan_stat
> >> > >
> >> > > commit log of commit 0ae5e89 " memcg: count the soft_limit reclaim
> >> in..."
> >> > > says it adds scanning stats to memory.stat file. But it doesn't
> >> because
> >> > > we considered we needed to make a concensus for such new APIs.
> >> > >
> >> > > This patch is a trial to add memory.scan_stat. This shows
> >> > > - the number of scanned pages(total, anon, file)
> >> > > - the number of rotated pages(total, anon, file)
> >> > > - the number of freed pages(total, anon, file)
> >> > > - the number of elaplsed time (including sleep/pause time)
> >> > >
> >> > > for both of direct/soft reclaim.
> >> > >
> >> > > The biggest difference with oringinal Ying's one is that this file
> >> > > can be reset by some write, as
> >> > >
> >> > > # echo 0 ...../memory.scan_stat
> >> > >
> >> > > Example of output is here. This is a result after make -j 6 kernel
> >> > > under 300M limit.
> >> > >
> >> > > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
> >> > > [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat
> >> > > scanned_pages_by_limit 9471864
> >> > > scanned_anon_pages_by_limit 6640629
> >> > > scanned_file_pages_by_limit 2831235
> >> > > rotated_pages_by_limit 4243974
> >> > > rotated_anon_pages_by_limit 3971968
> >> > > rotated_file_pages_by_limit 272006
> >> > > freed_pages_by_limit 2318492
> >> > > freed_anon_pages_by_limit 962052
> >> > > freed_file_pages_by_limit 1356440
> >> > > elapsed_ns_by_limit 351386416101
> >> > > scanned_pages_by_system 0
> >> > > scanned_anon_pages_by_system 0
> >> > > scanned_file_pages_by_system 0
> >> > > rotated_pages_by_system 0
> >> > > rotated_anon_pages_by_system 0
> >> > > rotated_file_pages_by_system 0
> >> > > freed_pages_by_system 0
> >> > > freed_anon_pages_by_system 0
> >> > > freed_file_pages_by_system 0
> >> > > elapsed_ns_by_system 0
> >> > > scanned_pages_by_limit_under_hierarchy 9471864
> >> > > scanned_anon_pages_by_limit_under_hierarchy 6640629
> >> > > scanned_file_pages_by_limit_under_hierarchy 2831235
> >> > > rotated_pages_by_limit_under_hierarchy 4243974
> >> > > rotated_anon_pages_by_limit_under_hierarchy 3971968
> >> > > rotated_file_pages_by_limit_under_hierarchy 272006
> >> > > freed_pages_by_limit_under_hierarchy 2318492
> >> > > freed_anon_pages_by_limit_under_hierarchy 962052
> >> > > freed_file_pages_by_limit_under_hierarchy 1356440
> >> > > elapsed_ns_by_limit_under_hierarchy 351386416101
> >> > > scanned_pages_by_system_under_hierarchy 0
> >> > > scanned_anon_pages_by_system_under_hierarchy 0
> >> > > scanned_file_pages_by_system_under_hierarchy 0
> >> > > rotated_pages_by_system_under_hierarchy 0
> >> > > rotated_anon_pages_by_system_under_hierarchy 0
> >> > > rotated_file_pages_by_system_under_hierarchy 0
> >> > > freed_pages_by_system_under_hierarchy 0
> >> > > freed_anon_pages_by_system_under_hierarchy 0
> >> > > freed_file_pages_by_system_under_hierarchy 0
> >> > > elapsed_ns_by_system_under_hierarchy 0
> >> > >
> >> > >
> >> > > total_xxxx is for hierarchy management.
> >> > >
> >> > > This will be useful for further memcg developments and need to be
> >> > > developped before we do some complicated rework on LRU/softlimit
> >> > > management.
> >> > >
> >> > > This patch adds a new struct memcg_scanrecord into scan_control
> >> struct.
> >> > > sc->nr_scanned at el is not designed for exporting information. For
> >> > > example,
> >> > > nr_scanned is reset frequentrly and incremented +2 at scanning mapped
> >> > > pages.
> >> > >
> >> > > For avoiding complexity, I added a new param in scan_control which is
> >> for
> >> > > exporting scanning score.
> >> > >
> >> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> > >
> >> > > Changelog:
> >> > > - renamed as vmscan_stat
> >> > > - handle file/anon
> >> > > - added "rotated"
> >> > > - changed names of param in vmscan_stat.
> >> > > ---
> >> > > Documentation/cgroups/memory.txt | 85 +++++++++++++++++++
> >> > > include/linux/memcontrol.h | 19 ++++
> >> > > include/linux/swap.h | 6 -
> >> > > mm/memcontrol.c | 172
> >> > > +++++++++++++++++++++++++++++++++++++--
> >> > > mm/vmscan.c | 39 +++++++-
> >> > > 5 files changed, 303 insertions(+), 18 deletions(-)
> >> > >
> >> > > Index: mmotm-0710/Documentation/cgroups/memory.txt
> >> > > ===================================================================
> >> > > --- mmotm-0710.orig/Documentation/cgroups/memory.txt
> >> > > +++ mmotm-0710/Documentation/cgroups/memory.txt
> >> > > @@ -380,7 +380,7 @@ will be charged as a new owner of it.
> >> > >
> >> > > 5.2 stat file
> >> > >
> >> > > -memory.stat file includes following statistics
> >> > > +5.2.1 memory.stat file includes following statistics
> >> > >
> >> > > # per-memory cgroup local status
> >> > > cache - # of bytes of page cache memory.
> >> > > @@ -438,6 +438,89 @@ Note:
> >> > > file_mapped is accounted only when the memory cgroup is owner
> >> of
> >> > > page
> >> > > cache.)
> >> > >
> >> > > +5.2.2 memory.vmscan_stat
> >> > > +
> >> > > +memory.vmscan_stat includes statistics information for memory
> >> scanning and
> >> > > +freeing, reclaiming. The statistics shows memory scanning information
> >> > > since
> >> > > +memory cgroup creation and can be reset to 0 by writing 0 as
> >> > > +
> >> > > + #echo 0 > ../memory.vmscan_stat
> >> > > +
> >> > > +This file contains following statistics.
> >> > > +
> >> > > +[param]_[file_or_anon]_pages_by_[reason]_[under_heararchy]
> >> > > +[param]_elapsed_ns_by_[reason]_[under_hierarchy]
> >> > > +
> >> > > +For example,
> >> > > +
> >> > > + scanned_file_pages_by_limit indicates the number of scanned
> >> > > + file pages at vmscan.
> >> > > +
> >> > > +Now, 3 parameters are supported
> >> > > +
> >> > > + scanned - the number of pages scanned by vmscan
> >> > > + rotated - the number of pages activated at vmscan
> >> > > + freed - the number of pages freed by vmscan
> >> > > +
> >> > > +If "rotated" is high against scanned/freed, the memcg seems busy.
> >> > > +
> >> > > +Now, 2 reason are supported
> >> > > +
> >> > > + limit - the memory cgroup's limit
> >> > > + system - global memory pressure + softlimit
> >> > > + (global memory pressure not under softlimit is not handled
> >> now)
> >> > > +
> >> > > +When under_hierarchy is added in the tail, the number indicates the
> >> > > +total memcg scan of its children and itself.
> >> > > +
> >> > > +elapsed_ns is a elapsed time in nanosecond. This may include sleep
> >> time
> >> > > +and not indicates CPU usage. So, please take this as just showing
> >> > > +latency.
> >> > > +
> >> > > +Here is an example.
> >> > > +
> >> > > +# cat /cgroup/memory/A/memory.vmscan_stat
> >> > > +scanned_pages_by_limit 9471864
> >> > > +scanned_anon_pages_by_limit 6640629
> >> > > +scanned_file_pages_by_limit 2831235
> >> > > +rotated_pages_by_limit 4243974
> >> > > +rotated_anon_pages_by_limit 3971968
> >> > > +rotated_file_pages_by_limit 272006
> >> > > +freed_pages_by_limit 2318492
> >> > > +freed_anon_pages_by_limit 962052
> >> > > +freed_file_pages_by_limit 1356440
> >> > > +elapsed_ns_by_limit 351386416101
> >> > > +scanned_pages_by_system 0
> >> > > +scanned_anon_pages_by_system 0
> >> > > +scanned_file_pages_by_system 0
> >> > > +rotated_pages_by_system 0
> >> > > +rotated_anon_pages_by_system 0
> >> > > +rotated_file_pages_by_system 0
> >> > > +freed_pages_by_system 0
> >> > > +freed_anon_pages_by_system 0
> >> > > +freed_file_pages_by_system 0
> >> > > +elapsed_ns_by_system 0
> >> > > +scanned_pages_by_limit_under_hierarchy 9471864
> >> > > +scanned_anon_pages_by_limit_under_hierarchy 6640629
> >> > > +scanned_file_pages_by_limit_under_hierarchy 2831235
> >> > > +rotated_pages_by_limit_under_hierarchy 4243974
> >> > > +rotated_anon_pages_by_limit_under_hierarchy 3971968
> >> > > +rotated_file_pages_by_limit_under_hierarchy 272006
> >> > > +freed_pages_by_limit_under_hierarchy 2318492
> >> > > +freed_anon_pages_by_limit_under_hierarchy 962052
> >> > > +freed_file_pages_by_limit_under_hierarchy 1356440
> >> > > +elapsed_ns_by_limit_under_hierarchy 351386416101
> >> > > +scanned_pages_by_system_under_hierarchy 0
> >> > > +scanned_anon_pages_by_system_under_hierarchy 0
> >> > > +scanned_file_pages_by_system_under_hierarchy 0
> >> > > +rotated_pages_by_system_under_hierarchy 0
> >> > > +rotated_anon_pages_by_system_under_hierarchy 0
> >> > > +rotated_file_pages_by_system_under_hierarchy 0
> >> > > +freed_pages_by_system_under_hierarchy 0
> >> > > +freed_anon_pages_by_system_under_hierarchy 0
> >> > > +freed_file_pages_by_system_under_hierarchy 0
> >> > > +elapsed_ns_by_system_under_hierarchy 0
> >> > > +
> >> > > 5.3 swappiness
> >> > >
> >> > > Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of
> >> groups
> >> > > only.
> >> > > Index: mmotm-0710/include/linux/memcontrol.h
> >> > > ===================================================================
> >> > > --- mmotm-0710.orig/include/linux/memcontrol.h
> >> > > +++ mmotm-0710/include/linux/memcontrol.h
> >> > > @@ -39,6 +39,16 @@ extern unsigned long mem_cgroup_isolate_
> >> > > struct mem_cgroup *mem_cont,
> >> > > int active, int file);
> >> > >
> >> > > +struct memcg_scanrecord {
> >> > > + struct mem_cgroup *mem; /* scanend memory cgroup */
> >> > > + struct mem_cgroup *root; /* scan target hierarchy root */
> >> > > + int context; /* scanning context (see memcontrol.c)
> >> */
> >> > > + unsigned long nr_scanned[2]; /* the number of scanned pages */
> >> > > + unsigned long nr_rotated[2]; /* the number of rotated pages */
> >> > > + unsigned long nr_freed[2]; /* the number of freed pages */
> >> > > + unsigned long elapsed; /* nsec of time elapsed while scanning
> >> */
> >> > > +};
> >> > > +
> >> > > #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> >> > > /*
> >> > > * All "charge" functions with gfp_mask should use GFP_KERNEL or
> >> > > @@ -117,6 +127,15 @@ mem_cgroup_get_reclaim_stat_from_page(st
> >> > > extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> >> > > struct task_struct *p);
> >> > >
> >> > > +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup
> >> *mem,
> >> > > + gfp_t gfp_mask, bool
> >> > > noswap,
> >> > > + struct
> >> memcg_scanrecord
> >> > > *rec);
> >> > > +extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup
> >> *mem,
> >> > > + gfp_t gfp_mask, bool
> >> > > noswap,
> >> > > + struct zone *zone,
> >> > > + struct
> >> memcg_scanrecord
> >> > > *rec,
> >> > > + unsigned long
> >> *nr_scanned);
> >> > > +
> >> > > #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> >> > > extern int do_swap_account;
> >> > > #endif
> >> > > Index: mmotm-0710/include/linux/swap.h
> >> > > ===================================================================
> >> > > --- mmotm-0710.orig/include/linux/swap.h
> >> > > +++ mmotm-0710/include/linux/swap.h
> >> > > @@ -253,12 +253,6 @@ static inline void lru_cache_add_file(st
> >> > > /* linux/mm/vmscan.c */
> >> > > extern unsigned long try_to_free_pages(struct zonelist *zonelist, int
> >> > > order,
> >> > > gfp_t gfp_mask, nodemask_t
> >> *mask);
> >> > > -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup
> >> *mem,
> >> > > - gfp_t gfp_mask, bool
> >> > > noswap);
> >> > > -extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup
> >> *mem,
> >> > > - gfp_t gfp_mask, bool
> >> > > noswap,
> >> > > - struct zone *zone,
> >> > > - unsigned long
> >> *nr_scanned);
> >> > > extern int __isolate_lru_page(struct page *page, int mode, int file);
> >> > > extern unsigned long shrink_all_memory(unsigned long nr_pages);
> >> > > extern int vm_swappiness;
> >> > > Index: mmotm-0710/mm/memcontrol.c
> >> > > ===================================================================
> >> > > --- mmotm-0710.orig/mm/memcontrol.c
> >> > > +++ mmotm-0710/mm/memcontrol.c
> >> > > @@ -204,6 +204,50 @@ struct mem_cgroup_eventfd_list {
> >> > > static void mem_cgroup_threshold(struct mem_cgroup *mem);
> >> > > static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
> >> > >
> >> > > +enum {
> >> > > + SCAN_BY_LIMIT,
> >> > > + SCAN_BY_SYSTEM,
> >> > > + NR_SCAN_CONTEXT,
> >> > > + SCAN_BY_SHRINK, /* not recorded now */
> >> > > +};
> >> > > +
> >> > > +enum {
> >> > > + SCAN,
> >> > > + SCAN_ANON,
> >> > > + SCAN_FILE,
> >> > > + ROTATE,
> >> > > + ROTATE_ANON,
> >> > > + ROTATE_FILE,
> >> > > + FREED,
> >> > > + FREED_ANON,
> >> > > + FREED_FILE,
> >> > > + ELAPSED,
> >> > > + NR_SCANSTATS,
> >> > > +};
> >> > > +
> >> > > +struct scanstat {
> >> > > + spinlock_t lock;
> >> > > + unsigned long stats[NR_SCAN_CONTEXT][NR_SCANSTATS];
> >> > > + unsigned long rootstats[NR_SCAN_CONTEXT][NR_SCANSTATS];
> >> > > +};
> >> > >
> >> >
> >> > I'm working on a similar effort with Ying here at Google and so far
> >> we've
> >> > been using per-cpu counters for these statistics instead of spin-lock
> >> > protected counters. Clearly the spin-lock protected counters have less
> >> > memory overhead and make reading the stat file faster, but our concern
> >> is
> >> > that this method is inconsistent with the other memory stat files such
> >> > /proc/vmstat and /dev/cgroup/memory/.../memory.stat. Is there any
> >> > particular reason you chose to use spin-lock protected counters instead
> >> of
> >> > per-cpu counters?
> >> >
> >>
> >> In my experience, if we do "batch" enouch, it works always better than
> >> percpu-counter. percpu counter is effective when batching is difficult.
> >> This patch's implementation does enough batching and it's much coarse
> >> grained than percpu counter. Then, this patch is better than percpu.
> >>
> >>
> >> > I've also modified your patch to use per-cpu counters instead of
> >> spin-lock
> >> > protected counters. I tested it by doing streaming I/O from a ramdisk:
> >> >
> >> > $ mke2fs /dev/ram1
> >> > $ mkdir /tmp/swapram
> >> > $ mkdir /tmp/swapram/ram1
> >> > $ mount -t ext2 /dev/ram1 /tmp/swapram/ram1
> >> > $ dd if=/dev/urandom of=/tmp/swapram/ram1/file_16m bs=4096 count=4096
> >> > $ mkdir /dev/cgroup/memory/1
> >> > $ echo 8m > /dev/cgroup/memory/1
> >> > $ ./ramdisk_load.sh 7
> >> > $ echo $$ > /dev/cgroup/memory/1/tasks
> >> > $ time for ((z=0; z<=2000; z++)); do cat /tmp/swapram/ram1/file_16m >
> >> > /dev/zero; done
> >> >
> >> > Where ramdisk_load.sh is:
> >> > for ((i=0; i<=$1; i++))
> >> > do
> >> > echo $$ >/dev/cgroup/memory/1/tasks
> >> > for ((z=0; z<=2000; z++)); do cat /tmp/swapram/ram1/file_16m >
> >> /dev/zero;
> >> > done &
> >> > done
> >> >
> >> > Surprisingly, the per-cpu counters perform worse than the spin-lock
> >> > protected counters. Over 10 runs of the test above, the per-cpu
> >> counters
> >> > were 1.60% slower in both real time and sys time. I'm wondering if you
> >> have
> >> > any insight as to why this is. I can provide my diff against your patch
> >> if
> >> > necessary.
> >> >
> >>
> >> The percpu counte works effectively only when we use +1/-1 at each change
> >> of
> >> counters. It uses "batch" to merge the per-cpu value to the counter.
> >> I think you use default "batch" value but the scan/rotate/free/elapsed
> >> value
> >> is always larger than "batch" and you just added memory overhead and "if"
> >> to pure spinlock counters.
> >>
> >> Determining this "batch" threshold for percpu counter is difficult.
> >>
> >> Thanks,
> >> -Kame
> >>
> >>
> >>
> >
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v2] memcg: add vmscan_stat
2011-07-18 21:00 ` Andrew Bresticker
@ 2011-07-20 6:03 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 9+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-20 6:03 UTC (permalink / raw)
To: Andrew Bresticker
Cc: linux-mm, akpm, nishimura, bsingharora, Michal Hocko, Ying Han
On Mon, 18 Jul 2011 14:00:32 -0700
Andrew Bresticker <abrestic@google.com> wrote:
> On Mon, Jul 11, 2011 at 3:30 AM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
spin_unlock_irq(&zone->lru_lock);
> > @@ -1350,6 +1353,10 @@ static noinline_for_stack void update_is
> >
> > reclaim_stat->recent_scanned[0] += *nr_anon;
> > reclaim_stat->recent_scanned[1] += *nr_file;
> > + if (!scanning_global_lru(sc)) {
> > + sc->memcg_record->nr_scanned[0] += *nr_anon;
> > + sc->memcg_record->nr_scanned[1] += *nr_file;
> > + }
> > }
> >
> > /*
> > @@ -1457,6 +1464,9 @@ shrink_inactive_list(unsigned long nr_to
> >
> > nr_reclaimed = shrink_page_list(&page_list, zone, sc);
> >
> > + if (!scanning_global_lru(sc))
> > + sc->memcg_record->nr_freed[file] += nr_reclaimed;
> > +
> >
>
> Can't we stall for writeback? If so, we may call shrink_page_list() again
> below. The accounting should probably go after that instead.
>
you're right. I'll fix this.
Thank you.
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2011-07-20 6:10 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-11 10:30 [PATCH v2] memcg: add vmscan_stat KAMEZAWA Hiroyuki
2011-07-12 23:02 ` Andrew Bresticker
2011-07-14 0:02 ` KAMEZAWA Hiroyuki
2011-07-15 18:34 ` Andrew Bresticker
2011-07-15 20:28 ` Andrew Bresticker
2011-07-20 6:00 ` KAMEZAWA Hiroyuki
2011-07-20 5:58 ` KAMEZAWA Hiroyuki
2011-07-18 21:00 ` Andrew Bresticker
2011-07-20 6:03 ` KAMEZAWA Hiroyuki
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox