From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 2C5B2900137 for ; Thu, 11 Aug 2011 16:39:53 -0400 (EDT) Received: from wpaz5.hot.corp.google.com (wpaz5.hot.corp.google.com [172.24.198.69]) by smtp-out.google.com with ESMTP id p7BKdmSU020236 for ; Thu, 11 Aug 2011 13:39:48 -0700 Received: from qwc23 (qwc23.prod.google.com [10.241.193.151]) by wpaz5.hot.corp.google.com with ESMTP id p7BKaNSp028780 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=NOT) for ; Thu, 11 Aug 2011 13:39:47 -0700 Received: by qwc23 with SMTP id 23so1634886qwc.31 for ; Thu, 11 Aug 2011 13:39:47 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <1306909519-7286-3-git-send-email-hannes@cmpxchg.org> References: <1306909519-7286-1-git-send-email-hannes@cmpxchg.org> <1306909519-7286-3-git-send-email-hannes@cmpxchg.org> Date: Thu, 11 Aug 2011 13:39:45 -0700 Message-ID: Subject: Re: [patch 2/8] mm: memcg-aware global reclaim From: Ying Han Content-Type: multipart/alternative; boundary=0016368321161db04904aa40cc96 Sender: owner-linux-mm@kvack.org List-ID: To: Johannes Weiner Cc: KAMEZAWA Hiroyuki , Daisuke Nishimura , Balbir Singh , Michal Hocko , Andrew Morton , Rik van Riel , Minchan Kim , KOSAKI Motohiro , Mel Gorman , Greg Thelen , Michel Lespinasse , linux-mm@kvack.org, linux-kernel@vger.kernel.org --0016368321161db04904aa40cc96 Content-Type: text/plain; charset=ISO-8859-1 On Tue, May 31, 2011 at 11:25 PM, Johannes Weiner wrote: > When a memcg hits its hard limit, hierarchical target reclaim is > invoked, which goes through all contributing memcgs in the hierarchy > below the offending memcg and reclaims from the respective per-memcg > lru lists. This distributes pressure fairly among all involved > memcgs, and pages are aged with respect to their list buddies. > > When global memory pressure arises, however, all this is dropped > overboard. Pages are reclaimed based on global lru lists that have > nothing to do with container-internal age, and some memcgs may be > reclaimed from much more than others. > > This patch makes traditional global reclaim consider container > boundaries and no longer scan the global lru lists. For each zone > scanned, the memcg hierarchy is walked and pages are reclaimed from > the per-memcg lru lists of the respective zone. For now, the > hierarchy walk is bounded to one full round-trip through the > hierarchy, or if the number of reclaimed pages reach the overall > reclaim target, whichever comes first. > > Conceptually, global memory pressure is then treated as if the root > memcg had hit its limit. Since all existing memcgs contribute to the > usage of the root memcg, global reclaim is nothing more than target > reclaim starting from the root memcg. The code is mostly the same for > both cases, except for a few heuristics and statistics that do not > always apply. They are distinguished by a newly introduced > global_reclaim() primitive. > > One implication of this change is that pages have to be linked to the > lru lists of the root memcg again, which could be optimized away with > the old scheme. The costs are not measurable, though, even with > worst-case microbenchmarks. > > As global reclaim no longer relies on global lru lists, this change is > also in preparation to remove those completely. > > Signed-off-by: Johannes Weiner > --- > include/linux/memcontrol.h | 15 ++++ > mm/memcontrol.c | 176 > ++++++++++++++++++++++++++++---------------- > mm/vmscan.c | 121 ++++++++++++++++++++++-------- > 3 files changed, 218 insertions(+), 94 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 5e9840f5..332b0a6 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -101,6 +101,10 @@ mem_cgroup_prepare_migration(struct page *page, > extern void mem_cgroup_end_migration(struct mem_cgroup *mem, > struct page *oldpage, struct page *newpage, bool migration_ok); > > +struct mem_cgroup *mem_cgroup_hierarchy_walk(struct mem_cgroup *, > + struct mem_cgroup *); > +void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *, struct mem_cgroup > *); > + > /* > * For memory reclaim. > */ > @@ -321,6 +325,17 @@ mem_cgroup_get_reclaim_stat_from_page(struct page > *page) > return NULL; > } > > +static inline struct mem_cgroup *mem_cgroup_hierarchy_walk(struct > mem_cgroup *r, > + struct > mem_cgroup *m) > +{ > + return NULL; > +} > + > +static inline void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *r, > + struct mem_cgroup *m) > +{ > +} > + > static inline void > mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) > { > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index bf5ab87..850176e 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -313,8 +313,8 @@ static bool move_file(void) > } > > /* > - * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft > - * limit reclaim to prevent infinite loops, if they ever occur. > + * Maximum loops in reclaim, used for soft limit reclaim to prevent > + * infinite loops, if they ever occur. > */ > #define MEM_CGROUP_MAX_RECLAIM_LOOPS (100) > #define MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS (2) > @@ -340,7 +340,7 @@ enum charge_type { > #define OOM_CONTROL (0) > > /* > - * Reclaim flags for mem_cgroup_hierarchical_reclaim > + * Reclaim flags > */ > #define MEM_CGROUP_RECLAIM_NOSWAP_BIT 0x0 > #define MEM_CGROUP_RECLAIM_NOSWAP (1 << > MEM_CGROUP_RECLAIM_NOSWAP_BIT) > @@ -846,8 +846,6 @@ void mem_cgroup_del_lru_list(struct page *page, enum > lru_list lru) > mz = page_cgroup_zoneinfo(pc->mem_cgroup, page); > /* huge page split is done under lru_lock. so, we have no races. */ > MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page); > - if (mem_cgroup_is_root(pc->mem_cgroup)) > - return; > VM_BUG_ON(list_empty(&pc->lru)); > list_del_init(&pc->lru); > } > @@ -872,13 +870,11 @@ void mem_cgroup_rotate_reclaimable_page(struct page > *page) > return; > > pc = lookup_page_cgroup(page); > - /* unused or root page is not rotated. */ > + /* unused page is not rotated. */ > if (!PageCgroupUsed(pc)) > return; > /* Ensure pc->mem_cgroup is visible after reading PCG_USED. */ > smp_rmb(); > - if (mem_cgroup_is_root(pc->mem_cgroup)) > - return; > mz = page_cgroup_zoneinfo(pc->mem_cgroup, page); > list_move_tail(&pc->lru, &mz->lists[lru]); > } > @@ -892,13 +888,11 @@ void mem_cgroup_rotate_lru_list(struct page *page, > enum lru_list lru) > return; > > pc = lookup_page_cgroup(page); > - /* unused or root page is not rotated. */ > + /* unused page is not rotated. */ > if (!PageCgroupUsed(pc)) > return; > /* Ensure pc->mem_cgroup is visible after reading PCG_USED. */ > smp_rmb(); > - if (mem_cgroup_is_root(pc->mem_cgroup)) > - return; > mz = page_cgroup_zoneinfo(pc->mem_cgroup, page); > list_move(&pc->lru, &mz->lists[lru]); > } > @@ -920,8 +914,6 @@ void mem_cgroup_add_lru_list(struct page *page, enum > lru_list lru) > /* huge page split is done under lru_lock. so, we have no races. */ > MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page); > SetPageCgroupAcctLRU(pc); > - if (mem_cgroup_is_root(pc->mem_cgroup)) > - return; > list_add(&pc->lru, &mz->lists[lru]); > } > > @@ -1381,6 +1373,97 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg) > return min(limit, memsw); > } > > +/** > + * mem_cgroup_hierarchy_walk - iterate over a memcg hierarchy > + * @root: starting point of the hierarchy > + * @prev: previous position or NULL > + * > + * Caller must hold a reference to @root. While this function will > + * return @root as part of the walk, it will never increase its > + * reference count. > + * > + * Caller must clean up with mem_cgroup_stop_hierarchy_walk() when it > + * stops the walk potentially before the full round trip. > + */ > +struct mem_cgroup *mem_cgroup_hierarchy_walk(struct mem_cgroup *root, > + struct mem_cgroup *prev) > +{ > + struct mem_cgroup *mem; > + > + if (mem_cgroup_disabled()) > + return NULL; > + > + if (!root) > + root = root_mem_cgroup; > + /* > + * Even without hierarchy explicitely enabled in the root > + * memcg, it is the ultimate parent of all memcgs. > + */ > + if (!(root == root_mem_cgroup || root->use_hierarchy)) > + return root; > + if (prev && prev != root) > + css_put(&prev->css); > + do { > + int id = root->last_scanned_child; > + struct cgroup_subsys_state *css; > + > + rcu_read_lock(); > + css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, > &id); > + if (css && (css == &root->css || css_tryget(css))) > + mem = container_of(css, struct mem_cgroup, css); > + rcu_read_unlock(); > + if (!css) > + id = 0; > + root->last_scanned_child = id; > + } while (!mem); > + return mem; > +} > + > +/** > + * mem_cgroup_stop_hierarchy_walk - clean up after partial hierarchy walk > + * @root: starting point in the hierarchy > + * @mem: last position during the walk > + */ > +void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *root, > + struct mem_cgroup *mem) > +{ > + if (mem && mem != root) > + css_put(&mem->css); > +} > + > +static unsigned long mem_cgroup_reclaim(struct mem_cgroup *mem, > + gfp_t gfp_mask, > + unsigned long flags) > +{ > + unsigned long total = 0; > + bool noswap = false; > + int loop; > + > + if ((flags & MEM_CGROUP_RECLAIM_NOSWAP) || mem->memsw_is_minimum) > + noswap = true; > + for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) { > + drain_all_stock_async(); > + total += try_to_free_mem_cgroup_pages(mem, gfp_mask, > noswap, > + get_swappiness(mem)); > + /* > + * Avoid freeing too much when shrinking to resize the > + * limit. XXX: Shouldn't the margin check be enough? > + */ > + if (total && (flags & MEM_CGROUP_RECLAIM_SHRINK)) > + break; > + if (mem_cgroup_margin(mem)) > + break; > + /* > + * If we have not been able to reclaim anything after > + * two reclaim attempts, there may be no reclaimable > + * pages in this hierarchy. > + */ > + if (loop && !total) > + break; > + } > + return total; > +} > + > /* > * Visit the first child (need not be the first child as per the ordering > * of the cgroup list, since we track last_scanned_child) of @mem and use > @@ -1418,29 +1501,14 @@ mem_cgroup_select_victim(struct mem_cgroup > *root_mem) > return ret; > } > > -/* > - * Scan the hierarchy if needed to reclaim memory. We remember the last > child > - * we reclaimed from, so that we don't end up penalizing one child > extensively > - * based on its position in the children list. > - * > - * root_mem is the original ancestor that we've been reclaim from. > - * > - * We give up and return to the caller when we visit root_mem twice. > - * (other groups can be removed while we're walking....) > - * > - * If shrink==true, for avoiding to free too much, this returns > immedieately. > - */ > -static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem, > - struct zone *zone, > - gfp_t gfp_mask, > - unsigned long > reclaim_options) > +static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_mem, > + struct zone *zone, > + gfp_t gfp_mask) > { > struct mem_cgroup *victim; > int ret, total = 0; > int loop = 0; > - bool noswap = reclaim_options & MEM_CGROUP_RECLAIM_NOSWAP; > - bool shrink = reclaim_options & MEM_CGROUP_RECLAIM_SHRINK; > - bool check_soft = reclaim_options & MEM_CGROUP_RECLAIM_SOFT; > + bool noswap = false; > unsigned long excess; > > excess = res_counter_soft_limit_excess(&root_mem->res) >> > PAGE_SHIFT; > @@ -1461,7 +1529,7 @@ static int mem_cgroup_hierarchical_reclaim(struct > mem_cgroup *root_mem, > * anything, it might because there are > * no reclaimable pages under this hierarchy > */ > - if (!check_soft || !total) { > + if (!total) { > css_put(&victim->css); > break; > } > @@ -1483,26 +1551,11 @@ static int mem_cgroup_hierarchical_reclaim(struct > mem_cgroup *root_mem, > css_put(&victim->css); > continue; > } > - /* we use swappiness of local cgroup */ > - if (check_soft) > - ret = mem_cgroup_shrink_node_zone(victim, gfp_mask, > - noswap, get_swappiness(victim), zone); > - else > - ret = try_to_free_mem_cgroup_pages(victim, > gfp_mask, > - noswap, > get_swappiness(victim)); > + ret = mem_cgroup_shrink_node_zone(victim, gfp_mask, noswap, > + get_swappiness(victim), > zone); > css_put(&victim->css); > - /* > - * At shrinking usage, we can't check we should stop here > or > - * reclaim more. It's depends on callers. > last_scanned_child > - * will work enough for keeping fairness under tree. > - */ > - if (shrink) > - return ret; > total += ret; > - if (check_soft) { > - if (!res_counter_soft_limit_excess(&root_mem->res)) > - return total; > - } else if (mem_cgroup_margin(root_mem)) > + if (!res_counter_soft_limit_excess(&root_mem->res)) > return total; > } > return total; > @@ -1927,8 +1980,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup > *mem, gfp_t gfp_mask, > if (!(gfp_mask & __GFP_WAIT)) > return CHARGE_WOULDBLOCK; > > - ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL, > - gfp_mask, flags); > + ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags); > if (mem_cgroup_margin(mem_over_limit) >= nr_pages) > return CHARGE_RETRY; > /* > @@ -3085,7 +3137,7 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem, > > /* > * A call to try to shrink memory usage on charge failure at shmem's > swapin. > - * Calling hierarchical_reclaim is not enough because we should update > + * Calling reclaim is not enough because we should update > * last_oom_jiffies to prevent pagefault_out_of_memory from invoking global > OOM. > * Moreover considering hierarchy, we should reclaim from the > mem_over_limit, > * not from the memcg which this page would be charged to. > @@ -3167,7 +3219,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup > *memcg, > int enlarge; > > /* > - * For keeping hierarchical_reclaim simple, how long we should > retry > + * For keeping reclaim simple, how long we should retry > * is depends on callers. We set our retry-count to be function > * of # of children which we should visit in this loop. > */ > @@ -3210,8 +3262,8 @@ static int mem_cgroup_resize_limit(struct mem_cgroup > *memcg, > if (!ret) > break; > > - mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL, > - MEM_CGROUP_RECLAIM_SHRINK); > + mem_cgroup_reclaim(memcg, GFP_KERNEL, > + MEM_CGROUP_RECLAIM_SHRINK); > curusage = res_counter_read_u64(&memcg->res, RES_USAGE); > /* Usage is reduced ? */ > if (curusage >= oldusage) > @@ -3269,9 +3321,9 @@ static int mem_cgroup_resize_memsw_limit(struct > mem_cgroup *memcg, > if (!ret) > break; > > - mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL, > - MEM_CGROUP_RECLAIM_NOSWAP | > - MEM_CGROUP_RECLAIM_SHRINK); > + mem_cgroup_reclaim(memcg, GFP_KERNEL, > + MEM_CGROUP_RECLAIM_NOSWAP | > + MEM_CGROUP_RECLAIM_SHRINK); > curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE); > /* Usage is reduced ? */ > if (curusage >= oldusage) > @@ -3311,9 +3363,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct > zone *zone, int order, > if (!mz) > break; > > - reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone, > - gfp_mask, > - MEM_CGROUP_RECLAIM_SOFT); > + reclaimed = mem_cgroup_soft_reclaim(mz->mem, zone, > gfp_mask); > nr_reclaimed += reclaimed; > spin_lock(&mctz->lock); > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 8bfd450..7e9bfca 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -104,7 +104,16 @@ struct scan_control { > */ > reclaim_mode_t reclaim_mode; > > - /* Which cgroup do we reclaim from */ > + /* > + * The memory cgroup that hit its hard limit and is the > + * primary target of this reclaim invocation. > + */ > + struct mem_cgroup *target_mem_cgroup; > + > + /* > + * The memory cgroup that is currently being scanned as a > + * child and contributor to the usage of target_mem_cgroup. > + */ > struct mem_cgroup *mem_cgroup; > > /* > @@ -154,9 +163,36 @@ static LIST_HEAD(shrinker_list); > static DECLARE_RWSEM(shrinker_rwsem); > > #ifdef CONFIG_CGROUP_MEM_RES_CTLR > -#define scanning_global_lru(sc) (!(sc)->mem_cgroup) > +/** > + * global_reclaim - whether reclaim is global or due to memcg hard limit > + * @sc: scan control of this reclaim invocation > + */ > +static bool global_reclaim(struct scan_control *sc) > +{ > + return !sc->target_mem_cgroup; > +} > +/** > + * scanning_global_lru - whether scanning global lrus or per-memcg lrus > + * @sc: scan control of this reclaim invocation > + */ > +static bool scanning_global_lru(struct scan_control *sc) > +{ > + /* > + * Unless memory cgroups are disabled on boot, the traditional > + * global lru lists are never scanned and reclaim will always > + * operate on the per-memcg lru lists. > + */ > + return mem_cgroup_disabled(); > +} > #else > -#define scanning_global_lru(sc) (1) > +static bool global_reclaim(struct scan_control *sc) > +{ > + return true; > +} > +static bool scanning_global_lru(struct scan_control *sc) > +{ > + return true; > +} > #endif > > static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone, > @@ -1228,7 +1264,7 @@ static int too_many_isolated(struct zone *zone, int > file, > if (current_is_kswapd()) > return 0; > > - if (!scanning_global_lru(sc)) > + if (!global_reclaim(sc)) > return 0; > > if (file) { > @@ -1397,13 +1433,6 @@ shrink_inactive_list(unsigned long nr_to_scan, > struct zone *zone, > sc->reclaim_mode & RECLAIM_MODE_LUMPYRECLAIM ? > ISOLATE_BOTH : ISOLATE_INACTIVE, > zone, 0, file); > - zone->pages_scanned += nr_scanned; > - if (current_is_kswapd()) > - __count_zone_vm_events(PGSCAN_KSWAPD, zone, > - nr_scanned); > - else > - __count_zone_vm_events(PGSCAN_DIRECT, zone, > - nr_scanned); > } else { > nr_taken = mem_cgroup_isolate_pages(nr_to_scan, > &page_list, &nr_scanned, sc->order, > @@ -1411,10 +1440,16 @@ shrink_inactive_list(unsigned long nr_to_scan, > struct zone *zone, > ISOLATE_BOTH : ISOLATE_INACTIVE, > zone, sc->mem_cgroup, > 0, file); > - /* > - * mem_cgroup_isolate_pages() keeps track of > - * scanned pages on its own. > - */ > + } > + > + if (global_reclaim(sc)) { > + zone->pages_scanned += nr_scanned; > + if (current_is_kswapd()) > + __count_zone_vm_events(PGSCAN_KSWAPD, zone, > + nr_scanned); > + else > + __count_zone_vm_events(PGSCAN_DIRECT, zone, > + nr_scanned); > } > > if (nr_taken == 0) { > @@ -1520,18 +1555,16 @@ static void shrink_active_list(unsigned long > nr_pages, struct zone *zone, > &pgscanned, sc->order, > ISOLATE_ACTIVE, zone, > 1, file); > - zone->pages_scanned += pgscanned; > } else { > nr_taken = mem_cgroup_isolate_pages(nr_pages, &l_hold, > &pgscanned, sc->order, > ISOLATE_ACTIVE, zone, > sc->mem_cgroup, 1, file); > - /* > - * mem_cgroup_isolate_pages() keeps track of > - * scanned pages on its own. > - */ > } > > + if (global_reclaim(sc)) > + zone->pages_scanned += pgscanned; > + > reclaim_stat->recent_scanned[file] += nr_taken; > > __count_zone_vm_events(PGREFILL, zone, pgscanned); > @@ -1752,7 +1785,7 @@ static void get_scan_count(struct zone *zone, struct > scan_control *sc, > file = zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) + > zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FILE); > > - if (scanning_global_lru(sc)) { > + if (global_reclaim(sc)) { > free = zone_page_state(zone, NR_FREE_PAGES); > /* If we have very few page cache pages, > force-scan anon pages. */ > @@ -1889,8 +1922,8 @@ static inline bool should_continue_reclaim(struct > zone *zone, > /* > * This is a basic per-zone page freer. Used by both kswapd and direct > reclaim. > */ > -static void shrink_zone(int priority, struct zone *zone, > - struct scan_control *sc) > +static void do_shrink_zone(int priority, struct zone *zone, > + struct scan_control *sc) > { > unsigned long nr[NR_LRU_LISTS]; > unsigned long nr_to_scan; > @@ -1943,6 +1976,31 @@ restart: > throttle_vm_writeout(sc->gfp_mask); > } > > +static void shrink_zone(int priority, struct zone *zone, > + struct scan_control *sc) > +{ > + unsigned long nr_reclaimed_before = sc->nr_reclaimed; > + struct mem_cgroup *root = sc->target_mem_cgroup; > + struct mem_cgroup *first, *mem = NULL; > + > + first = mem = mem_cgroup_hierarchy_walk(root, mem); > + for (;;) { > + unsigned long nr_reclaimed; > + > + sc->mem_cgroup = mem; > + do_shrink_zone(priority, zone, sc); > + > + nr_reclaimed = sc->nr_reclaimed - nr_reclaimed_before; > + if (nr_reclaimed >= sc->nr_to_reclaim) > + break; > + > + mem = mem_cgroup_hierarchy_walk(root, mem); > + if (mem == first) > + break; > + } > + mem_cgroup_stop_hierarchy_walk(root, mem); > +} > + > /* > * This is the direct reclaim path, for page-allocating processes. We only > * try to reclaim pages from zones which will satisfy the caller's > allocation > @@ -1973,7 +2031,7 @@ static void shrink_zones(int priority, struct > zonelist *zonelist, > * Take care memory controller reclaiming has small > influence > * to global LRU. > */ > - if (scanning_global_lru(sc)) { > + if (global_reclaim(sc)) { > if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) > continue; > if (zone->all_unreclaimable && priority != > DEF_PRIORITY) > @@ -2038,7 +2096,7 @@ static unsigned long do_try_to_free_pages(struct > zonelist *zonelist, > get_mems_allowed(); > delayacct_freepages_start(); > > - if (scanning_global_lru(sc)) > + if (global_reclaim(sc)) > count_vm_event(ALLOCSTALL); > > for (priority = DEF_PRIORITY; priority >= 0; priority--) { > @@ -2050,7 +2108,7 @@ static unsigned long do_try_to_free_pages(struct > zonelist *zonelist, > * Don't shrink slabs when reclaiming memory from > * over limit cgroups > */ > - if (scanning_global_lru(sc)) { > + if (global_reclaim(sc)) { > unsigned long lru_pages = 0; > for_each_zone_zonelist(zone, z, zonelist, > gfp_zone(sc->gfp_mask)) { > @@ -2111,7 +2169,7 @@ out: > return 0; > > /* top priority shrink_zones still had more to do? don't OOM, then > */ > - if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc)) > + if (global_reclaim(sc) && !all_unreclaimable(zonelist, sc)) > return 1; > > return 0; > @@ -2129,7 +2187,7 @@ unsigned long try_to_free_pages(struct zonelist > *zonelist, int order, > .may_swap = 1, > .swappiness = vm_swappiness, > .order = order, > - .mem_cgroup = NULL, > + .target_mem_cgroup = NULL, > .nodemask = nodemask, > }; > > @@ -2158,6 +2216,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct > mem_cgroup *mem, > .may_swap = !noswap, > .swappiness = swappiness, > .order = 0, > + .target_mem_cgroup = mem, > .mem_cgroup = mem, > }; > sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) | > @@ -2174,7 +2233,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct > mem_cgroup *mem, > * will pick up pages from other mem cgroup's as well. We hack > * the priority and make it zero. > */ > - shrink_zone(0, zone, &sc); > + do_shrink_zone(0, zone, &sc); > > trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed); > > @@ -2195,7 +2254,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct > mem_cgroup *mem_cont, > .nr_to_reclaim = SWAP_CLUSTER_MAX, > .swappiness = swappiness, > .order = 0, > - .mem_cgroup = mem_cont, > + .target_mem_cgroup = mem_cont, > .nodemask = NULL, /* we don't care the placement */ > }; > > @@ -2333,7 +2392,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, > int order, > .nr_to_reclaim = ULONG_MAX, > .swappiness = vm_swappiness, > .order = order, > - .mem_cgroup = NULL, > + .target_mem_cgroup = NULL, > }; > loop_again: > total_scanned = 0; > Please consider including the following patch for the next post. It causes crash on some of the tests where sc->mem_cgroup is NULL (global kswapd). diff --git a/mm/vmscan.c b/mm/vmscan.c index b72a844..12ab25d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2768,7 +2768,8 @@ loop_again: * Do some background aging of the anon list, to give * pages a chance to be referenced before reclaiming. */ - if (inactive_anon_is_low(zone, &sc)) + if (scanning_global_lru(&sc) && + inactive_anon_is_low(zone, &sc)) shrink_active_list(SWAP_CLUSTER_MAX, zone, &sc, priority, 0); --Ying > -- > 1.7.5.2 > > --0016368321161db04904aa40cc96 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

On Tue, May 31, 2011 at 11:25 PM, Johann= es Weiner <hanne= s@cmpxchg.org> wrote:
When a memcg hits its hard limit, hierarchical target reclaim is
invoked, which goes through all contributing memcgs in the hierarchy
below the offending memcg and reclaims from the respective per-memcg
lru lists. =A0This distributes pressure fairly among all involved
memcgs, and pages are aged with respect to their list buddies.

When global memory pressure arises, however, all this is dropped
overboard. =A0Pages are reclaimed based on global lru lists that have
nothing to do with container-internal age, and some memcgs may be
reclaimed from much more than others.

This patch makes traditional global reclaim consider container
boundaries and no longer scan the global lru lists. =A0For each zone
scanned, the memcg hierarchy is walked and pages are reclaimed from
the per-memcg lru lists of the respective zone. =A0For now, the
hierarchy walk is bounded to one full round-trip through the
hierarchy, or if the number of reclaimed pages reach the overall
reclaim target, whichever comes first.

Conceptually, global memory pressure is then treated as if the root
memcg had hit its limit. =A0Since all existing memcgs contribute to the
usage of the root memcg, global reclaim is nothing more than target
reclaim starting from the root memcg. =A0The code is mostly the same for both cases, except for a few heuristics and statistics that do not
always apply. =A0They are distinguished by a newly introduced
global_reclaim() primitive.

One implication of this change is that pages have to be linked to the
lru lists of the root memcg again, which could be optimized away with
the old scheme. =A0The costs are not measurable, though, even with
worst-case microbenchmarks.

As global reclaim no longer relies on global lru lists, this change is
also in preparation to remove those completely.

Signed-off-by: Johannes Weiner <ha= nnes@cmpxchg.org>
---
=A0include/linux/memcontrol.h | =A0 15 ++++
=A0mm/memcontrol.c =A0 =A0 =A0 =A0 =A0 =A0| =A0176 ++++++++++++++++++++++++= ++++----------------
=A0mm/vmscan.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0| =A0121 ++++++++++++++++++++= ++--------
=A03 files changed, 218 insertions(+), 94 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5e9840f5..332b0a6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -101,6 +101,10 @@ mem_cgroup_prepare_migration(struct page *page,
=A0extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
=A0 =A0 =A0 =A0struct page *oldpage, struct page *newpage, bool migration_= ok);

+struct mem_cgroup *mem_cgroup_hierarchy_walk(struct mem_cgroup *,
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0struct mem_cgroup *);
+void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *, struct mem_cgroup= *);
+
=A0/*
=A0* For memory reclaim.
=A0*/
@@ -321,6 +325,17 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *pag= e)
=A0 =A0 =A0 =A0return NULL;
=A0}

+static inline struct mem_cgroup *mem_cgroup_hierarchy_walk(struct mem_cgro= up *r,
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct mem_cgroup *m)
+{
+ =A0 =A0 =A0 return NULL;
+}
+
+static inline void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *r, + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 struct mem_cgroup *m)
+{
+}
+
=A0static inline void
=A0mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *= p)
=A0{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bf5ab87..850176e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -313,8 +313,8 @@ static bool move_file(void)
=A0}

=A0/*
- * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
- * limit reclaim to prevent infinite loops, if they ever occur.
+ * Maximum loops in reclaim, used for soft limit reclaim to prevent
+ * infinite loops, if they ever occur.
=A0*/
=A0#define =A0 =A0 =A0 =A0MEM_CGROUP_MAX_RECLAIM_LOOPS =A0 =A0 =A0 =A0 =A0 = =A0(100)
=A0#define =A0 =A0 =A0 =A0MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS (2)
@@ -340,7 +340,7 @@ enum charge_type {
=A0#define OOM_CONTROL =A0 =A0 =A0 =A0 =A0 =A0(0)

=A0/*
- * Reclaim flags for mem_cgroup_hierarchical_reclaim
+ * Reclaim flags
=A0*/
=A0#define MEM_CGROUP_RECLAIM_NOSWAP_BIT =A00x0
=A0#define MEM_CGROUP_RECLAIM_NOSWAP =A0 =A0 =A0(1 << MEM_CGROUP_RECL= AIM_NOSWAP_BIT)
@@ -846,8 +846,6 @@ void mem_cgroup_del_lru_list(struct page *page, enum lr= u_list lru)
=A0 =A0 =A0 =A0mz =3D page_cgroup_zoneinfo(pc->mem_cgroup, page);
=A0 =A0 =A0 =A0/* huge page split is done under lru_lock. so, we have no r= aces. */
=A0 =A0 =A0 =A0MEM_CGROUP_ZSTAT(mz, lru) -=3D 1 << compound_order(pa= ge);
- =A0 =A0 =A0 if (mem_cgroup_is_root(pc->mem_cgroup))
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 return;
=A0 =A0 =A0 =A0VM_BUG_ON(list_empty(&pc->lru));
=A0 =A0 =A0 =A0list_del_init(&pc->lru);
=A0}
@@ -872,13 +870,11 @@ void mem_cgroup_rotate_reclaimable_page(struct page *= page)
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return;

=A0 =A0 =A0 =A0pc =3D lookup_page_cgroup(page);
- =A0 =A0 =A0 /* unused or root page is not rotated. */
+ =A0 =A0 =A0 /* unused page is not rotated. */
=A0 =A0 =A0 =A0if (!PageCgroupUsed(pc))
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return;
=A0 =A0 =A0 =A0/* Ensure pc->mem_cgroup is visible after reading PCG_US= ED. */
=A0 =A0 =A0 =A0smp_rmb();
- =A0 =A0 =A0 if (mem_cgroup_is_root(pc->mem_cgroup))
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 return;
=A0 =A0 =A0 =A0mz =3D page_cgroup_zoneinfo(pc->mem_cgroup, page);
=A0 =A0 =A0 =A0list_move_tail(&pc->lru, &mz->lists[lru]); =A0}
@@ -892,13 +888,11 @@ void mem_cgroup_rotate_lru_list(struct page *page, en= um lru_list lru)
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return;

=A0 =A0 =A0 =A0pc =3D lookup_page_cgroup(page);
- =A0 =A0 =A0 /* unused or root page is not rotated. */
+ =A0 =A0 =A0 /* unused page is not rotated. */
=A0 =A0 =A0 =A0if (!PageCgroupUsed(pc))
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return;
=A0 =A0 =A0 =A0/* Ensure pc->mem_cgroup is visible after reading PCG_US= ED. */
=A0 =A0 =A0 =A0smp_rmb();
- =A0 =A0 =A0 if (mem_cgroup_is_root(pc->mem_cgroup))
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 return;
=A0 =A0 =A0 =A0mz =3D page_cgroup_zoneinfo(pc->mem_cgroup, page);
=A0 =A0 =A0 =A0list_move(&pc->lru, &mz->lists[lru]);
=A0}
@@ -920,8 +914,6 @@ void mem_cgroup_add_lru_list(struct page *page, enum lr= u_list lru)
=A0 =A0 =A0 =A0/* huge page split is done under lru_lock. so, we have no r= aces. */
=A0 =A0 =A0 =A0MEM_CGROUP_ZSTAT(mz, lru) +=3D 1 << compound_order(pa= ge);
=A0 =A0 =A0 =A0SetPageCgroupAcctLRU(pc);
- =A0 =A0 =A0 if (mem_cgroup_is_root(pc->mem_cgroup))
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 return;
=A0 =A0 =A0 =A0list_add(&pc->lru, &mz->lists[lru]);
=A0}

@@ -1381,6 +1373,97 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *memcg) =A0 =A0 =A0 =A0return min(limit, memsw);
=A0}

+/**
+ * mem_cgroup_hierarchy_walk - iterate over a memcg hierarchy
+ * @root: starting point of the hierarchy
+ * @prev: previous position or NULL
+ *
+ * Caller must hold a reference to @root. =A0While this function will
+ * return @root as part of the walk, it will never increase its
+ * reference count.
+ *
+ * Caller must clean up with mem_cgroup_stop_hierarchy_walk() when it
+ * stops the walk potentially before the full round trip.
+ */
+struct mem_cgroup *mem_cgroup_hierarchy_walk(struct mem_cgroup *root,
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0struct mem_cgroup *prev)
+{
+ =A0 =A0 =A0 struct mem_cgroup *mem;
+
+ =A0 =A0 =A0 if (mem_cgroup_disabled())
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 return NULL;
+
+ =A0 =A0 =A0 if (!root)
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 root =3D root_mem_cgroup;
+ =A0 =A0 =A0 /*
+ =A0 =A0 =A0 =A0* Even without hierarchy explicitely enabled in the root + =A0 =A0 =A0 =A0* memcg, it is the ultimate parent of all memcgs.
+ =A0 =A0 =A0 =A0*/
+ =A0 =A0 =A0 if (!(root =3D=3D root_mem_cgroup || root->use_hierarchy))=
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 return root;
+ =A0 =A0 =A0 if (prev && prev !=3D root)
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 css_put(&prev->css);
+ =A0 =A0 =A0 do {
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 int id =3D root->last_scanned_child;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct cgroup_subsys_state *css;
+
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 rcu_read_lock();
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 css =3D css_get_next(&mem_cgroup_subsys, = id + 1, &root->css, &id);
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (css && (css =3D=3D &root->= css || css_tryget(css)))
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 mem =3D container_of(css, str= uct mem_cgroup, css);
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 rcu_read_unlock();
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (!css)
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 id =3D 0;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 root->last_scanned_child =3D id;
+ =A0 =A0 =A0 } while (!mem);
+ =A0 =A0 =A0 return mem;
+}
+
+/**
+ * mem_cgroup_stop_hierarchy_walk - clean up after partial hierarchy walk<= br> + * @root: starting point in the hierarchy
+ * @mem: last position during the walk
+ */
+void mem_cgroup_stop_hierarchy_walk(struct mem_cgroup *root,
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 struc= t mem_cgroup *mem)
+{
+ =A0 =A0 =A0 if (mem && mem !=3D root)
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 css_put(&mem->css);
+}
+
+static unsigned long mem_cgroup_reclaim(struct mem_cgroup *mem,
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 gfp_t gfp_mask,
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 unsigned long flags)
+{
+ =A0 =A0 =A0 unsigned long total =3D 0;
+ =A0 =A0 =A0 bool noswap =3D false;
+ =A0 =A0 =A0 int loop;
+
+ =A0 =A0 =A0 if ((flags & MEM_CGROUP_RECLAIM_NOSWAP) || mem->memsw_= is_minimum)
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 noswap =3D true;
+ =A0 =A0 =A0 for (loop =3D 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop= ++) {
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 drain_all_stock_async();
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 total +=3D try_to_free_mem_cgroup_pages(mem, = gfp_mask, noswap,
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 get_swappiness(mem));
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 /*
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* Avoid freeing too much when shrinking to= resize the
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* limit. =A0XXX: Shouldn't the margin = check be enough?
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*/
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (total && (flags & MEM_CGROUP_= RECLAIM_SHRINK))
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (mem_cgroup_margin(mem))
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 /*
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* If we have not been able to reclaim anyt= hing after
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* two reclaim attempts, there may be no re= claimable
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* pages in this hierarchy.
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*/
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (loop && !total)
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
+ =A0 =A0 =A0 }
+ =A0 =A0 =A0 return total;
+}
+
=A0/*
=A0* Visit the first child (need not be the first child as per the orderin= g
=A0* of the cgroup list, since we track last_scanned_child) of @mem and us= e
@@ -1418,29 +1501,14 @@ mem_cgroup_select_victim(struct mem_cgroup *root_me= m)
=A0 =A0 =A0 =A0return ret;
=A0}

-/*
- * Scan the hierarchy if needed to reclaim memory. We remember the last ch= ild
- * we reclaimed from, so that we don't end up penalizing one child ext= ensively
- * based on its position in the children list.
- *
- * root_mem is the original ancestor that we've been reclaim from.
- *
- * We give up and return to the caller when we visit root_mem twice.
- * (other groups can be removed while we're walking....)
- *
- * If shrink=3D=3Dtrue, for avoiding to free too much, this returns immedi= eately.
- */
-static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem, - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 struct zone *zone,
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 gfp_t gfp_mask,
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 unsigned long reclaim_options)
+static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_mem,
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct= zone *zone,
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0gfp_t = gfp_mask)
=A0{
=A0 =A0 =A0 =A0struct mem_cgroup *victim;
=A0 =A0 =A0 =A0int ret, total =3D 0;
=A0 =A0 =A0 =A0int loop =3D 0;
- =A0 =A0 =A0 bool noswap =3D reclaim_options & MEM_CGROUP_RECLAIM_NOSW= AP;
- =A0 =A0 =A0 bool shrink =3D reclaim_options & MEM_CGROUP_RECLAIM_SHRI= NK;
- =A0 =A0 =A0 bool check_soft =3D reclaim_options & MEM_CGROUP_RECLAIM_= SOFT;
+ =A0 =A0 =A0 bool noswap =3D false;
=A0 =A0 =A0 =A0unsigned long excess;

=A0 =A0 =A0 =A0excess =3D res_counter_soft_limit_excess(&root_mem->= res) >> PAGE_SHIFT;
@@ -1461,7 +1529,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem= _cgroup *root_mem,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 * anything= , it might because there are
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 * no recla= imable pages under this hierarchy
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 */
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (!check_so= ft || !total) {
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (!total) {=
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0css_put(&victim->css);
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0break;
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}
@@ -1483,26 +1551,11 @@ static int mem_cgroup_hierarchical_reclaim(struct m= em_cgroup *root_mem,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0css_put(&victim->css= );
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0continue;
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* we use swappiness of local cgroup */
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (check_soft)
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ret =3D mem_cgroup_shrink_nod= e_zone(victim, gfp_mask,
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 noswap, get_s= wappiness(victim), zone);
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 else
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ret =3D try_to_free_mem_cgrou= p_pages(victim, gfp_mask,
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 noswap, get_swappiness(victim));
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 ret =3D mem_cgroup_shrink_node_zone(victim, g= fp_mask, noswap,
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 get_swappiness(victim), zone);
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0css_put(&victim->css);
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 /*
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* At shrinking usage, we can't check w= e should stop here or
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* reclaim more. It's depends on caller= s. last_scanned_child
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* will work enough for keeping fairness un= der tree.
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*/
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (shrink)
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return ret;
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0total +=3D ret;
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (check_soft) {
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (!res_counter_soft_limit_e= xcess(&root_mem->res))
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 return total;=
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 } else if (mem_cgroup_margin(root_mem))
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (!res_counter_soft_limit_excess(&root_= mem->res))
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return total;
=A0 =A0 =A0 =A0}
=A0 =A0 =A0 =A0return total;
@@ -1927,8 +1980,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *me= m, gfp_t gfp_mask,
=A0 =A0 =A0 =A0if (!(gfp_mask & __GFP_WAIT))
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return CHARGE_WOULDBLOCK;

- =A0 =A0 =A0 ret =3D mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,=
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 gfp_mask, flags);
+ =A0 =A0 =A0 ret =3D mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);<= br> =A0 =A0 =A0 =A0if (mem_cgroup_margin(mem_over_limit) >=3D nr_pages)
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return CHARGE_RETRY;
=A0 =A0 =A0 =A0/*
@@ -3085,7 +3137,7 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,=

=A0/*
=A0* A call to try to shrink memory usage on charge failure at shmem's= swapin.
- * Calling hierarchical_reclaim is not enough because we should update
+ * Calling reclaim is not enough because we should update
=A0* last_oom_jiffies to prevent pagefault_out_of_memory from invoking glo= bal OOM.
=A0* Moreover considering hierarchy, we should reclaim from the mem_over_l= imit,
=A0* not from the memcg which this page would be charged to.
@@ -3167,7 +3219,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup = *memcg,
=A0 =A0 =A0 =A0int enlarge;

=A0 =A0 =A0 =A0/*
- =A0 =A0 =A0 =A0* For keeping hierarchical_reclaim simple, how long we sho= uld retry
+ =A0 =A0 =A0 =A0* For keeping reclaim simple, how long we should retry
=A0 =A0 =A0 =A0 * is depends on callers. We set our retry-count to be func= tion
=A0 =A0 =A0 =A0 * of # of children which we should visit in this loop.
=A0 =A0 =A0 =A0 */
@@ -3210,8 +3262,8 @@ static int mem_cgroup_resize_limit(struct mem_cgroup = *memcg,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (!ret)
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0break;

- =A0 =A0 =A0 =A0 =A0 =A0 =A0 mem_cgroup_hierarchical_reclaim(memcg, NULL, = GFP_KERNEL,
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 MEM_CGROUP_RECLAIM_SHRINK);
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 mem_cgroup_reclaim(memcg, GFP_KERNEL,
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0MEM_CG= ROUP_RECLAIM_SHRINK);
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0curusage =3D res_counter_read_u64(&memc= g->res, RES_USAGE);
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0/* Usage is reduced ? */
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (curusage >=3D oldusage)
@@ -3269,9 +3321,9 @@ static int mem_cgroup_resize_memsw_limit(struct mem_c= group *memcg,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (!ret)
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0break;

- =A0 =A0 =A0 =A0 =A0 =A0 =A0 mem_cgroup_hierarchical_reclaim(memcg, NULL, = GFP_KERNEL,
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 MEM_CGROUP_RECLAIM_NOSWAP |
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 MEM_CGROUP_RECLAIM_SHRINK);
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 mem_cgroup_reclaim(memcg, GFP_KERNEL,
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0MEM_CG= ROUP_RECLAIM_NOSWAP |
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0MEM_CG= ROUP_RECLAIM_SHRINK);
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0curusage =3D res_counter_read_u64(&memc= g->memsw, RES_USAGE);
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0/* Usage is reduced ? */
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (curusage >=3D oldusage)
@@ -3311,9 +3363,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zo= ne *zone, int order,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (!mz)
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0break;

- =A0 =A0 =A0 =A0 =A0 =A0 =A0 reclaimed =3D mem_cgroup_hierarchical_reclaim= (mz->mem, zone,
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 gfp_mask,
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 MEM_CGROUP_RECLAIM_SOFT);
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 reclaimed =3D mem_cgroup_soft_reclaim(mz->= mem, zone, gfp_mask);
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0nr_reclaimed +=3D reclaimed;
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0spin_lock(&mctz->lock);

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8bfd450..7e9bfca 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -104,7 +104,16 @@ struct scan_control {
=A0 =A0 =A0 =A0 */
=A0 =A0 =A0 =A0reclaim_mode_t reclaim_mode;

- =A0 =A0 =A0 /* Which cgroup do we reclaim from */
+ =A0 =A0 =A0 /*
+ =A0 =A0 =A0 =A0* The memory cgroup that hit its hard limit and is the
+ =A0 =A0 =A0 =A0* primary target of this reclaim invocation.
+ =A0 =A0 =A0 =A0*/
+ =A0 =A0 =A0 struct mem_cgroup *target_mem_cgroup;
+
+ =A0 =A0 =A0 /*
+ =A0 =A0 =A0 =A0* The memory cgroup that is currently being scanned as a + =A0 =A0 =A0 =A0* child and contributor to the usage of target_mem_cgroup.=
+ =A0 =A0 =A0 =A0*/
=A0 =A0 =A0 =A0struct mem_cgroup *mem_cgroup;

=A0 =A0 =A0 =A0/*
@@ -154,9 +163,36 @@ static LIST_HEAD(shrinker_list);
=A0static DECLARE_RWSEM(shrinker_rwsem);

=A0#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-#define scanning_global_lru(sc) =A0 =A0 =A0 =A0(!(sc)->mem_cgroup)
+/**
+ * global_reclaim - whether reclaim is global or due to memcg hard limit + * @sc: scan control of this reclaim invocation
+ */
+static bool global_reclaim(struct scan_control *sc)
+{
+ =A0 =A0 =A0 return !sc->target_mem_cgroup;
+}
+/**
+ * scanning_global_lru - whether scanning global lrus or per-memcg lrus + * @sc: scan control of this reclaim invocation
+ */
+static bool scanning_global_lru(struct scan_control *sc)
+{
+ =A0 =A0 =A0 /*
+ =A0 =A0 =A0 =A0* Unless memory cgroups are disabled on boot, the traditio= nal
+ =A0 =A0 =A0 =A0* global lru lists are never scanned and reclaim will alwa= ys
+ =A0 =A0 =A0 =A0* operate on the per-memcg lru lists.
+ =A0 =A0 =A0 =A0*/
+ =A0 =A0 =A0 return mem_cgroup_disabled();
+}
=A0#else
-#define scanning_global_lru(sc) =A0 =A0 =A0 =A0(1)
+static bool global_reclaim(struct scan_control *sc)
+{
+ =A0 =A0 =A0 return true;
+}
+static bool scanning_global_lru(struct scan_control *sc)
+{
+ =A0 =A0 =A0 return true;
+}
=A0#endif

=A0static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
@@ -1228,7 +1264,7 @@ static int too_many_isolated(struct zone *zone, int f= ile,
=A0 =A0 =A0 =A0if (current_is_kswapd())
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return 0;

- =A0 =A0 =A0 if (!scanning_global_lru(sc))
+ =A0 =A0 =A0 if (!global_reclaim(sc))
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return 0;

=A0 =A0 =A0 =A0if (file) {
@@ -1397,13 +1433,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struc= t zone *zone,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0sc->reclaim_mode & R= ECLAIM_MODE_LUMPYRECLAIM ?
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0ISOLATE_BOTH : ISOLATE_INACTIVE,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0zone, 0, file);
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 zone->pages_scanned +=3D nr_scanned;
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (current_is_kswapd())
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 __count_zone_vm_events(PGSCAN= _KSWAPD, zone,
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0nr_scanned);
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 else
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 __count_zone_vm_events(PGSCAN= _DIRECT, zone,
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0nr_scanned);
=A0 =A0 =A0 =A0} else {
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0nr_taken =3D mem_cgroup_isolate_pages(nr_to= _scan,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0&page_list, &nr_sca= nned, sc->order,
@@ -1411,10 +1440,16 @@ shrink_inactive_list(unsigned long nr_to_scan, stru= ct zone *zone,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0ISOLATE_BOTH : ISOLATE_INACTIVE,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0zone, sc->mem_cgroup, =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A00, file);
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 /*
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* mem_cgroup_isolate_pages() keeps track o= f
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* scanned pages on its own.
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*/
+ =A0 =A0 =A0 }
+
+ =A0 =A0 =A0 if (global_reclaim(sc)) {
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 zone->pages_scanned +=3D nr_scanned;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (current_is_kswapd())
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 __count_zone_vm_events(PGSCAN= _KSWAPD, zone,
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0nr_scanned);
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 else
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 __count_zone_vm_events(PGSCAN= _DIRECT, zone,
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0nr_scanned);
=A0 =A0 =A0 =A0}

=A0 =A0 =A0 =A0if (nr_taken =3D=3D 0) {
@@ -1520,18 +1555,16 @@ static void shrink_active_list(unsigned long nr_pag= es, struct zone *zone,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0&pgscanned, sc->order,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0ISOLATE_ACTIVE, zone,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A01, file);
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 zone->pages_scanned +=3D pgscanned;
=A0 =A0 =A0 =A0} else {
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0nr_taken =3D mem_cgroup_isolate_pages(nr_pa= ges, &l_hold,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0&pgscanned, sc->order,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0ISOLATE_ACTIVE, zone,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0sc->mem_cgroup, 1, file);
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 /*
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* mem_cgroup_isolate_pages() keeps track o= f
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* scanned pages on its own.
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*/
=A0 =A0 =A0 =A0}

+ =A0 =A0 =A0 if (global_reclaim(sc))
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 zone->pages_scanned +=3D pgscanned;
+
=A0 =A0 =A0 =A0reclaim_stat->recent_scanned[file] +=3D nr_taken;

=A0 =A0 =A0 =A0__count_zone_vm_events(PGREFILL, zone, pgscanned);
@@ -1752,7 +1785,7 @@ static void get_scan_count(struct zone *zone, struct = scan_control *sc,
=A0 =A0 =A0 =A0file =A0=3D zone_nr_lru_pages(zone, sc, LRU_ACTIVE_FILE) +<= br> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0zone_nr_lru_pages(zone, sc, LRU_INACTIVE_FI= LE);

- =A0 =A0 =A0 if (scanning_global_lru(sc)) {
+ =A0 =A0 =A0 if (global_reclaim(sc)) {
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0free =A0=3D zone_page_state(zone, NR_FREE_P= AGES);
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0/* If we have very few page cache pages, =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 force-scan anon pages. */
@@ -1889,8 +1922,8 @@ static inline bool should_continue_reclaim(struct zon= e *zone,
=A0/*
=A0* This is a basic per-zone page freer. =A0Used by both kswapd and direc= t reclaim.
=A0*/
-static void shrink_zone(int priority, struct zone *zone,
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct scan_c= ontrol *sc)
+static void do_shrink_zone(int priority, struct zone *zone,
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct scan_control *s= c)
=A0{
=A0 =A0 =A0 =A0unsigned long nr[NR_LRU_LISTS];
=A0 =A0 =A0 =A0unsigned long nr_to_scan;
@@ -1943,6 +1976,31 @@ restart:
=A0 =A0 =A0 =A0throttle_vm_writeout(sc->gfp_mask);
=A0}

+static void shrink_zone(int priority, struct zone *zone,
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct scan_control *sc)
+{
+ =A0 =A0 =A0 unsigned long nr_reclaimed_before =3D sc->nr_reclaimed; + =A0 =A0 =A0 struct mem_cgroup *root =3D sc->target_mem_cgroup;
+ =A0 =A0 =A0 struct mem_cgroup *first, *mem =3D NULL;
+
+ =A0 =A0 =A0 first =3D mem =3D mem_cgroup_hierarchy_walk(root, mem);
+ =A0 =A0 =A0 for (;;) {
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 unsigned long nr_reclaimed;
+
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 sc->mem_cgroup =3D mem;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 do_shrink_zone(priority, zone, sc);
+
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 nr_reclaimed =3D sc->nr_reclaimed - nr_rec= laimed_before;
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (nr_reclaimed >=3D sc->nr_to_reclaim= )
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
+
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 mem =3D mem_cgroup_hierarchy_walk(root, mem);=
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (mem =3D=3D first)
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
+ =A0 =A0 =A0 }
+ =A0 =A0 =A0 mem_cgroup_stop_hierarchy_walk(root, mem);
+}
+
=A0/*
=A0* This is the direct reclaim path, for page-allocating processes. =A0We= only
=A0* try to reclaim pages from zones which will satisfy the caller's a= llocation
@@ -1973,7 +2031,7 @@ static void shrink_zones(int priority, struct zonelis= t *zonelist,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 * Take care memory controller reclaiming h= as small influence
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 * to global LRU.
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 */
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (scanning_global_lru(sc)) {
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (global_reclaim(sc)) {
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (!cpuset_zone_allowed_ha= rdwall(zone, GFP_KERNEL))
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0continue; =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (zone->all_unreclaima= ble && priority !=3D DEF_PRIORITY)
@@ -2038,7 +2096,7 @@ static unsigned long do_try_to_free_pages(struct zone= list *zonelist,
=A0 =A0 =A0 =A0get_mems_allowed();
=A0 =A0 =A0 =A0delayacct_freepages_start();

- =A0 =A0 =A0 if (scanning_global_lru(sc))
+ =A0 =A0 =A0 if (global_reclaim(sc))
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0count_vm_event(ALLOCSTALL);

=A0 =A0 =A0 =A0for (priority =3D DEF_PRIORITY; priority >=3D 0; priorit= y--) {
@@ -2050,7 +2108,7 @@ static unsigned long do_try_to_free_pages(struct zone= list *zonelist,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 * Don't shrink slabs when reclaiming m= emory from
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 * over limit cgroups
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 */
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (scanning_global_lru(sc)) {
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (global_reclaim(sc)) {
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0unsigned long lru_pages =3D= 0;
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0for_each_zone_zonelist(zone= , z, zonelist,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0gfp_zone(sc->gfp_mask)) {
@@ -2111,7 +2169,7 @@ out:
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return 0;

=A0 =A0 =A0 =A0/* top priority shrink_zones still had more to do? don'= t OOM, then */
- =A0 =A0 =A0 if (scanning_global_lru(sc) && !all_unreclaimable(zon= elist, sc))
+ =A0 =A0 =A0 if (global_reclaim(sc) && !all_unreclaimable(zonelist= , sc))
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return 1;

=A0 =A0 =A0 =A0return 0;
@@ -2129,7 +2187,7 @@ unsigned long try_to_free_pages(struct zonelist *zone= list, int order,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0.may_swap =3D 1,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0.swappiness =3D vm_swappiness,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0.order =3D order,
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 .mem_cgroup =3D NULL,
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 .target_mem_cgroup =3D NULL,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0.nodemask =3D nodemask,
=A0 =A0 =A0 =A0};

@@ -2158,6 +2216,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_= cgroup *mem,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0.may_swap =3D !noswap,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0.swappiness =3D swappiness,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0.order =3D 0,
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 .target_mem_cgroup =3D mem,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0.mem_cgroup =3D mem,
=A0 =A0 =A0 =A0};
=A0 =A0 =A0 =A0sc.gfp_mask =3D (gfp_mask & GFP_RECLAIM_MASK) |
@@ -2174,7 +2233,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_= cgroup *mem,
=A0 =A0 =A0 =A0 * will pick up pages from other mem cgroup's as well. = We hack
=A0 =A0 =A0 =A0 * the priority and make it zero.
=A0 =A0 =A0 =A0 */
- =A0 =A0 =A0 shrink_zone(0, zone, &sc);
+ =A0 =A0 =A0 do_shrink_zone(0, zone, &sc);

=A0 =A0 =A0 =A0trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed= );

@@ -2195,7 +2254,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem= _cgroup *mem_cont,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0.nr_to_reclaim =3D SWAP_CLUSTER_MAX,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0.swappiness =3D swappiness,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0.order =3D 0,
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 .mem_cgroup =3D mem_cont,
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 .target_mem_cgroup =3D mem_cont,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0.nodemask =3D NULL, /* we don't care th= e placement */
=A0 =A0 =A0 =A0};

@@ -2333,7 +2392,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, = int order,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0.nr_to_reclaim =3D ULONG_MAX,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0.swappiness =3D vm_swappiness,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0.order =3D order,
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 .mem_cgroup =3D NULL,
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 .target_mem_cgroup =3D NULL,
=A0 =A0 =A0 =A0};
=A0loop_again:
=A0 =A0 =A0 =A0total_scanned =3D 0;

Pl= ease consider including the following patch for the next post. It causes cr= ash on some of the tests where sc->mem_cgroup is NULL (global kswapd).

diff --git a/mm/vmscan.c b/mm/vmscan.c
index = b72a844..12ab25d 100644
--- a/mm/vmscan.c
+++ b/mm/vmsc= an.c
@@ -2768,7 +2768,8 @@ loop_again:
=A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* Do some background aging of the anon l= ist, to give
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0* pages a chance to= be referenced before reclaiming.
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0*/
- =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 if (inactive_anon_is_low(zone, &sc))
+ =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (scanning_global_lru(&sc) &&
+ =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 inactive_anon_is_low(zone, &sc))
=A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 shrink_active_list(SWAP_CLUSTER= _MAX, zone,
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 &sc, priori= ty, 0);

--Ying
--
1.7.5.2


--0016368321161db04904aa40cc96-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org