From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 139CF90010D for ; Tue, 26 Apr 2011 19:15:11 -0400 (EDT) Received: from hpaq1.eem.corp.google.com (hpaq1.eem.corp.google.com [172.25.149.1]) by smtp-out.google.com with ESMTP id p3QNF6tD005327 for ; Tue, 26 Apr 2011 16:15:06 -0700 Received: from qyg14 (qyg14.prod.google.com [10.241.82.142]) by hpaq1.eem.corp.google.com with ESMTP id p3QNDwU7030564 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=NOT) for ; Tue, 26 Apr 2011 16:15:05 -0700 Received: by qyg14 with SMTP id 14so583770qyg.19 for ; Tue, 26 Apr 2011 16:15:05 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20110426140815.8847062b.kamezawa.hiroyu@jp.fujitsu.com> References: <20110425182529.c7c37bb4.kamezawa.hiroyu@jp.fujitsu.com> <20110425183629.144d3f19.kamezawa.hiroyu@jp.fujitsu.com> <20110426140815.8847062b.kamezawa.hiroyu@jp.fujitsu.com> Date: Tue, 26 Apr 2011 16:15:04 -0700 Message-ID: Subject: Re: [PATCH 5/7] memcg bgreclaim core. From: Ying Han Content-Type: multipart/alternative; boundary=0016e64aefda8b335c04a1da7ef1 Sender: owner-linux-mm@kvack.org List-ID: To: KAMEZAWA Hiroyuki Cc: "linux-mm@kvack.org" , "kosaki.motohiro@jp.fujitsu.com" , "balbir@linux.vnet.ibm.com" , "nishimura@mxp.nes.nec.co.jp" , "akpm@linux-foundation.org" , Johannes Weiner , "minchan.kim@gmail.com" , Michal Hocko --0016e64aefda8b335c04a1da7ef1 Content-Type: text/plain; charset=ISO-8859-1 On Mon, Apr 25, 2011 at 10:08 PM, KAMEZAWA Hiroyuki < kamezawa.hiroyu@jp.fujitsu.com> wrote: > On Mon, 25 Apr 2011 21:59:06 -0700 > Ying Han wrote: > > > On Mon, Apr 25, 2011 at 2:36 AM, KAMEZAWA Hiroyuki > > wrote: > > > Following patch will chagnge the logic. This is a core. > > > == > > > This is the main loop of per-memcg background reclaim which is > implemented in > > > function balance_mem_cgroup_pgdat(). > > > > > > The function performs a priority loop similar to global reclaim. During > each > > > iteration it frees memory from a selected victim node. > > > After reclaiming enough pages or scanning enough pages, it returns and > find > > > next work with round-robin. > > > > > > changelog v8b..v7 > > > 1. reworked for using work_queue rather than threads. > > > 2. changed shrink_mem_cgroup algorithm to fit workqueue. In short, > avoid > > > long running and allow quick round-robin and unnecessary write page. > > > When a thread make pages dirty continuously, write back them by > flusher > > > is far faster than writeback by background reclaim. This detail will > > > be fixed when dirty_ratio implemented. The logic around this will be > > > revisited in following patche. > > > > > > Signed-off-by: Ying Han > > > Signed-off-by: KAMEZAWA Hiroyuki > > > --- > > > include/linux/memcontrol.h | 11 ++++ > > > mm/memcontrol.c | 44 ++++++++++++++--- > > > mm/vmscan.c | 115 > +++++++++++++++++++++++++++++++++++++++++++++ > > > 3 files changed, 162 insertions(+), 8 deletions(-) > > > > > > Index: memcg/include/linux/memcontrol.h > > > =================================================================== > > > --- memcg.orig/include/linux/memcontrol.h > > > +++ memcg/include/linux/memcontrol.h > > > @@ -89,6 +89,8 @@ extern int mem_cgroup_last_scanned_node( > > > extern int mem_cgroup_select_victim_node(struct mem_cgroup *mem, > > > const nodemask_t *nodes); > > > > > > +unsigned long shrink_mem_cgroup(struct mem_cgroup *mem); > > > + > > > static inline > > > int mm_match_cgroup(const struct mm_struct *mm, const struct > mem_cgroup *cgroup) > > > { > > > @@ -112,6 +114,9 @@ extern void mem_cgroup_end_migration(str > > > */ > > > int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg); > > > int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg); > > > +unsigned int mem_cgroup_swappiness(struct mem_cgroup *memcg); > > > +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup > *memcg, > > > + int nid, int zone_idx); > > > unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, > > > struct zone *zone, > > > enum lru_list lru); > > > @@ -310,6 +315,12 @@ mem_cgroup_inactive_file_is_low(struct m > > > } > > > > > > static inline unsigned long > > > +mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg, int nid, > int zone_idx) > > > +{ > > > + return 0; > > > +} > > > + > > > +static inline unsigned long > > > mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone, > > > enum lru_list lru) > > > { > > > Index: memcg/mm/memcontrol.c > > > =================================================================== > > > --- memcg.orig/mm/memcontrol.c > > > +++ memcg/mm/memcontrol.c > > > @@ -1166,6 +1166,23 @@ int mem_cgroup_inactive_file_is_low(stru > > > return (active > inactive); > > > } > > > > > > +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgroup > *memcg, > > > + int nid, int zone_idx) > > > +{ > > > + int nr; > > > + struct mem_cgroup_per_zone *mz = > > > + mem_cgroup_zoneinfo(memcg, nid, zone_idx); > > > + > > > + nr = MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE) + > > > + MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_FILE); > > > + > > > + if (nr_swap_pages > 0) > > > + nr += MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_ANON) + > > > + MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_ANON); > > > + > > > + return nr; > > > +} > > > + > > > unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, > > > struct zone *zone, > > > enum lru_list lru) > > > @@ -1286,7 +1303,7 @@ static unsigned long mem_cgroup_margin(s > > > return margin >> PAGE_SHIFT; > > > } > > > > > > -static unsigned int get_swappiness(struct mem_cgroup *memcg) > > > +unsigned int mem_cgroup_swappiness(struct mem_cgroup *memcg) > > > { > > > struct cgroup *cgrp = memcg->css.cgroup; > > > > > > @@ -1595,14 +1612,15 @@ static int mem_cgroup_hierarchical_recla > > > /* we use swappiness of local cgroup */ > > > if (check_soft) { > > > ret = mem_cgroup_shrink_node_zone(victim, > gfp_mask, > > > - noswap, get_swappiness(victim), zone, > > > + noswap, mem_cgroup_swappiness(victim), > zone, > > > &nr_scanned); > > > *total_scanned += nr_scanned; > > > mem_cgroup_soft_steal(victim, ret); > > > mem_cgroup_soft_scan(victim, nr_scanned); > > > } else > > > ret = try_to_free_mem_cgroup_pages(victim, > gfp_mask, > > > - noswap, > get_swappiness(victim)); > > > + noswap, > > > + > mem_cgroup_swappiness(victim)); > > > css_put(&victim->css); > > > /* > > > * At shrinking usage, we can't check we should stop > here or > > > @@ -1628,15 +1646,25 @@ static int mem_cgroup_hierarchical_recla > > > int > > > mem_cgroup_select_victim_node(struct mem_cgroup *mem, const nodemask_t > *nodes) > > > { > > > - int next_nid; > > > + int next_nid, i; > > > int last_scanned; > > > > > > last_scanned = mem->last_scanned_node; > > > - next_nid = next_node(last_scanned, *nodes); > > > + next_nid = last_scanned; > > > +rescan: > > > + next_nid = next_node(next_nid, *nodes); > > > > > > if (next_nid == MAX_NUMNODES) > > > next_nid = first_node(*nodes); > > > > > > + /* If no page on this node, skip */ > > > + for (i = 0; i < MAX_NR_ZONES; i++) > > > + if (mem_cgroup_zone_reclaimable_pages(mem, next_nid, > i)) > > > + break; > > > + > > > + if (next_nid != last_scanned && (i == MAX_NR_ZONES)) > > > + goto rescan; > > > + > > > mem->last_scanned_node = next_nid; > > > > > > return next_nid; > > > @@ -3649,7 +3677,7 @@ try_to_free: > > > goto out; > > > } > > > progress = try_to_free_mem_cgroup_pages(mem, GFP_KERNEL, > > > - false, > get_swappiness(mem)); > > > + false, > mem_cgroup_swappiness(mem)); > > > if (!progress) { > > > nr_retries--; > > > /* maybe some writeback is necessary */ > > > @@ -4073,7 +4101,7 @@ static u64 mem_cgroup_swappiness_read(st > > > { > > > struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); > > > > > > - return get_swappiness(memcg); > > > + return mem_cgroup_swappiness(memcg); > > > } > > > > > > static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct > cftype *cft, > > > @@ -4849,7 +4877,7 @@ mem_cgroup_create(struct cgroup_subsys * > > > INIT_LIST_HEAD(&mem->oom_notify); > > > > > > if (parent) > > > - mem->swappiness = get_swappiness(parent); > > > + mem->swappiness = mem_cgroup_swappiness(parent); > > > atomic_set(&mem->refcnt, 1); > > > mem->move_charge_at_immigrate = 0; > > > mutex_init(&mem->thresholds_lock); > > > Index: memcg/mm/vmscan.c > > > =================================================================== > > > --- memcg.orig/mm/vmscan.c > > > +++ memcg/mm/vmscan.c > > > @@ -42,6 +42,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > > > > #include > > > #include > > > @@ -2308,6 +2309,120 @@ static bool sleeping_prematurely(pg_data > > > return !all_zones_ok; > > > } > > > > > > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR > > > +/* > > > + * The function is used for per-memcg LRU. It scanns all the zones of > the > > > + * node and returns the nr_scanned and nr_reclaimed. > > > + */ > > > +/* > > > + * Limit of scanning per iteration. For round-robin. > > > + */ > > > +#define MEMCG_BGSCAN_LIMIT (2048) > > > + > > > +static void > > > +shrink_memcg_node(int nid, int priority, struct scan_control *sc) > > > +{ > > > + unsigned long total_scanned = 0; > > > + struct mem_cgroup *mem_cont = sc->mem_cgroup; > > > + int i; > > > + > > > + /* > > > + * This dma->highmem order is consistant with global reclaim. > > > + * We do this because the page allocator works in the opposite > > > + * direction although memcg user pages are mostly allocated at > > > + * highmem. > > > + */ > > > + for (i = 0; > > > + (i < NODE_DATA(nid)->nr_zones) && > > > + (total_scanned < MEMCG_BGSCAN_LIMIT); > > > + i++) { > > > + struct zone *zone = NODE_DATA(nid)->node_zones + i; > > > + struct zone_reclaim_stat *zrs; > > > + unsigned long scan, rotate; > > > + > > > + if (!populated_zone(zone)) > > > + continue; > > > + scan = mem_cgroup_zone_reclaimable_pages(mem_cont, nid, > i); > > > + if (!scan) > > > + continue; > > > + /* If recent memory reclaim on this zone doesn't get > good */ > > > + zrs = get_reclaim_stat(zone, sc); > > > + scan = zrs->recent_scanned[0] + zrs->recent_scanned[1]; > > > + rotate = zrs->recent_rotated[0] + > zrs->recent_rotated[1]; > > > + > > > + if (rotate > scan/2) > > > + sc->may_writepage = 1; > > > + > > > + sc->nr_scanned = 0; > > > + shrink_zone(priority, zone, sc); > > > + total_scanned += sc->nr_scanned; > > > + sc->may_writepage = 0; > > > + } > > > + sc->nr_scanned = total_scanned; > > > +} > > > > I see the MEMCG_BGSCAN_LIMIT is a newly defined macro from previous > > post. So, now the number of pages to scan is capped on 2k for each > > memcg, and does it make difference on big vs small cgroup? > > > > Now, no difference. One reason is because low_watermark - high_watermark is > limited to 4MB, at most. It should be static 4MB in many cases and 2048 > pages > is for scanning 8MB, twice of low_wmark - high_wmark. Another reason is > that I didn't have enough time for considering to tune this. > By MEMCG_BGSCAN_LIMIT, round-robin can be simply fair and I think it's a > good start point. > I can see a problem here to be "fair" to each memcg. Each container has different sizes and running with different workloads. Some of them are more sensitive with latency than the other, so they are willing to pay more cpu cycles to do background reclaim. So, here we fix the amount of work per-memcg, and the performance for those jobs will be hurt. If i understand correctly, we only have one workitem on the workqueue per memcg. So which means we can only reclaim those amount of pages for each iteration. And if the queue is big, those jobs(heavy memory allocating, and willing to pay cpu to do bg reclaim) will hit direct reclaim more than necessary. --Ying > > If memory eater enough slow (because the threads needs to do some > work on allocated memory), this shrink_mem_cgroup() works fine and > helps to avoid hitting limit. Here, the amount of dirty pages is > troublesome. > > The penaly for cpu eating (hard-to-reclaim) cgroup is given by 'delay'. > (see patch 7.) This patch's congestion_wait is too bad and will be replaced > in patch 7 as 'delay'. In short, if memcg scanning seems to be not > successful, > it gets HZ/10 delay until the next work. > > If we have dirty_ratio + I/O less dirty throttling, I think we'll see much > better fairness on this watermark reclaim round robin. > > > Thanks, > -Kame > > > > --0016e64aefda8b335c04a1da7ef1 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

On Mon, Apr 25, 2011 at 10:08 PM, KAMEZA= WA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
On Mon, 25 Apr 2011 21:59:06 -0700
Ying Han <yinghan@google.com&g= t; wrote:

> On Mon, Apr 25, 2011 at 2:36 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@= jp.fujitsu.com> wrote:
> > Following patch will chagnge the logic. This is a core.
> > =3D=3D
> > This is the main loop of per-memcg background reclaim which is im= plemented in
> > function balance_mem_cgroup_pgdat().
> >
> > The function performs a priority loop similar to global reclaim. = During each
> > iteration it frees memory from a selected victim node.
> > After reclaiming enough pages or scanning enough pages, it return= s and find
> > next work with round-robin.
> >
> > changelog v8b..v7
> > 1. reworked for using work_queue rather than threads.
> > 2. changed shrink_mem_cgroup algorithm to fit workqueue. In short= , avoid
> > =A0 long running and allow quick round-robin and unnecessary writ= e page.
> > =A0 When a thread make pages dirty continuously, write back them = by flusher
> > =A0 is far faster than writeback by background reclaim. This deta= il will
> > =A0 be fixed when dirty_ratio implemented. The logic around this = will be
> > =A0 revisited in following patche.
> >
> > Signed-off-by: Ying Han <yinghan@google.com>
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> > =A0include/linux/memcontrol.h | =A0 11 ++++
> > =A0mm/memcontrol.c =A0 =A0 =A0 =A0 =A0 =A0| =A0 44 ++++++++++++++= ---
> > =A0mm/vmscan.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0| =A0115 ++++++++++= +++++++++++++++++++++++++++++++++++
> > =A03 files changed, 162 insertions(+), 8 deletions(-)
> >
> > Index: memcg/include/linux/memcontrol.h
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > --- memcg.orig/include/linux/memcontrol.h
> > +++ memcg/include/linux/memcontrol.h
> > @@ -89,6 +89,8 @@ extern int mem_cgroup_last_scanned_node(
> > =A0extern int mem_cgroup_select_victim_node(struct mem_cgroup *me= m,
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0const nodemask_t *nodes);
> >
> > +unsigned long shrink_mem_cgroup(struct mem_cgroup *mem);
> > +
> > =A0static inline
> > =A0int mm_match_cgroup(const struct mm_struct *mm, const struct m= em_cgroup *cgroup)
> > =A0{
> > @@ -112,6 +114,9 @@ extern void mem_cgroup_end_migration(str
> > =A0*/
> > =A0int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);=
> > =A0int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);=
> > +unsigned int mem_cgroup_swappiness(struct mem_cgroup *memcg); > > +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgrou= p *memcg,
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 int= nid, int zone_idx);
> > =A0unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memc= g,
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 struct zone *zone,
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 enum lru_list lru);
> > @@ -310,6 +315,12 @@ mem_cgroup_inactive_file_is_low(struct m
> > =A0}
> >
> > =A0static inline unsigned long
> > +mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg, int = nid, int zone_idx)
> > +{
> > + =A0 =A0 =A0 return 0;
> > +}
> > +
> > +static inline unsigned long
> > =A0mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone= *zone,
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 enum lru_list lru= )
> > =A0{
> > Index: memcg/mm/memcontrol.c
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > --- memcg.orig/mm/memcontrol.c
> > +++ memcg/mm/memcontrol.c
> > @@ -1166,6 +1166,23 @@ int mem_cgroup_inactive_file_is_low(stru > > =A0 =A0 =A0 =A0return (active > inactive);
> > =A0}
> >
> > +unsigned long mem_cgroup_zone_reclaimable_pages(struct mem_cgrou= p *memcg,
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 int nid, int zone_idx)
> > +{
> > + =A0 =A0 =A0 int nr;
> > + =A0 =A0 =A0 struct mem_cgroup_per_zone *mz =3D
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 mem_cgroup_zoneinfo(memcg, nid, zon= e_idx);
> > +
> > + =A0 =A0 =A0 nr =3D MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE) +
> > + =A0 =A0 =A0 =A0 =A0 =A0MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_FILE);<= br> > > +
> > + =A0 =A0 =A0 if (nr_swap_pages > 0)
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 nr +=3D MEM_CGROUP_ZSTAT(mz, NR_ACT= IVE_ANON) +
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 MEM_CGROUP_ZSTAT(mz, NR= _INACTIVE_ANON);
> > +
> > + =A0 =A0 =A0 return nr;
> > +}
> > +
> > =A0unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memc= g,
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 struct zone *zone,
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 enum lru_list lru)
> > @@ -1286,7 +1303,7 @@ static unsigned long mem_cgroup_margin(s > > =A0 =A0 =A0 =A0return margin >> PAGE_SHIFT;
> > =A0}
> >
> > -static unsigned int get_swappiness(struct mem_cgroup *memcg)
> > +unsigned int mem_cgroup_swappiness(struct mem_cgroup *memcg)
> > =A0{
> > =A0 =A0 =A0 =A0struct cgroup *cgrp =3D memcg->css.cgroup;
> >
> > @@ -1595,14 +1612,15 @@ static int mem_cgroup_hierarchical_recla<= br> > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0/* we use swappiness of local cgro= up */
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (check_soft) {
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ret =3D mem_cgroup= _shrink_node_zone(victim, gfp_mask,
> > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 nos= wap, get_swappiness(victim), zone,
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 nos= wap, mem_cgroup_swappiness(victim), zone,
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0&a= mp;nr_scanned);
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*total_scanned += =3D nr_scanned;
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0mem_cgroup_soft_st= eal(victim, ret);
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0mem_cgroup_soft_sc= an(victim, nr_scanned);
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0} else
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0ret =3D try_to_fre= e_mem_cgroup_pages(victim, gfp_mask,
> > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 noswap, get_swappiness(victim));
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 noswap,
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 mem_cgroup_swappiness(victim));
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0css_put(&victim->css);
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0/*
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 * At shrinking usage, we can'= t check we should stop here or
> > @@ -1628,15 +1646,25 @@ static int mem_cgroup_hierarchical_recla<= br> > > =A0int
> > =A0mem_cgroup_select_victim_node(struct mem_cgroup *mem, const no= demask_t *nodes)
> > =A0{
> > - =A0 =A0 =A0 int next_nid;
> > + =A0 =A0 =A0 int next_nid, i;
> > =A0 =A0 =A0 =A0int last_scanned;
> >
> > =A0 =A0 =A0 =A0last_scanned =3D mem->last_scanned_node;
> > - =A0 =A0 =A0 next_nid =3D next_node(last_scanned, *nodes);
> > + =A0 =A0 =A0 next_nid =3D last_scanned;
> > +rescan:
> > + =A0 =A0 =A0 next_nid =3D next_node(next_nid, *nodes);
> >
> > =A0 =A0 =A0 =A0if (next_nid =3D=3D MAX_NUMNODES)
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0next_nid =3D first_node(*nodes); > >
> > + =A0 =A0 =A0 /* If no page on this node, skip */
> > + =A0 =A0 =A0 for (i =3D 0; i < MAX_NR_ZONES; i++)
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (mem_cgroup_zone_reclaimable_pag= es(mem, next_nid, i))
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 break;
> > +
> > + =A0 =A0 =A0 if (next_nid !=3D last_scanned && (i =3D=3D= MAX_NR_ZONES))
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto rescan;
> > +
> > =A0 =A0 =A0 =A0mem->last_scanned_node =3D next_nid;
> >
> > =A0 =A0 =A0 =A0return next_nid;
> > @@ -3649,7 +3677,7 @@ try_to_free:
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0goto out;
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0progress =3D try_to_free_mem_cgrou= p_pages(mem, GFP_KERNEL,
> > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 false, get_swappiness(mem));
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 false, mem_cgroup_swappiness(mem));
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (!progress) {
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0nr_retries--;
> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0/* maybe some writ= eback is necessary */
> > @@ -4073,7 +4101,7 @@ static u64 mem_cgroup_swappiness_read(st > > =A0{
> > =A0 =A0 =A0 =A0struct mem_cgroup *memcg =3D mem_cgroup_from_cont(= cgrp);
> >
> > - =A0 =A0 =A0 return get_swappiness(memcg);
> > + =A0 =A0 =A0 return mem_cgroup_swappiness(memcg);
> > =A0}
> >
> > =A0static int mem_cgroup_swappiness_write(struct cgroup *cgrp, st= ruct cftype *cft,
> > @@ -4849,7 +4877,7 @@ mem_cgroup_create(struct cgroup_subsys * > > =A0 =A0 =A0 =A0INIT_LIST_HEAD(&mem->oom_notify);
> >
> > =A0 =A0 =A0 =A0if (parent)
> > - =A0 =A0 =A0 =A0 =A0 =A0 =A0 mem->swappiness =3D get_swappine= ss(parent);
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 mem->swappiness =3D mem_cgroup_s= wappiness(parent);
> > =A0 =A0 =A0 =A0atomic_set(&mem->refcnt, 1);
> > =A0 =A0 =A0 =A0mem->move_charge_at_immigrate =3D 0;
> > =A0 =A0 =A0 =A0mutex_init(&mem->thresholds_lock);
> > Index: memcg/mm/vmscan.c
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > --- memcg.orig/mm/vmscan.c
> > +++ memcg/mm/vmscan.c
> > @@ -42,6 +42,7 @@
> > =A0#include <linux/delayacct.h>
> > =A0#include <linux/sysctl.h>
> > =A0#include <linux/oom.h>
> > +#include <linux/res_counter.h>
> >
> > =A0#include <asm/tlbflush.h>
> > =A0#include <asm/div64.h>
> > @@ -2308,6 +2309,120 @@ static bool sleeping_prematurely(pg_data<= br> > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return !all_zones_ok;
> > =A0}
> >
> > +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> > +/*
> > + * The function is used for per-memcg LRU. It scanns all the zon= es of the
> > + * node and returns the nr_scanned and nr_reclaimed.
> > + */
> > +/*
> > + * Limit of scanning per iteration. For round-robin.
> > + */
> > +#define MEMCG_BGSCAN_LIMIT =A0 =A0 (2048)
> > +
> > +static void
> > +shrink_memcg_node(int nid, int priority, struct scan_control *sc= )
> > +{
> > + =A0 =A0 =A0 unsigned long total_scanned =3D 0;
> > + =A0 =A0 =A0 struct mem_cgroup *mem_cont =3D sc->mem_cgroup;<= br> > > + =A0 =A0 =A0 int i;
> > +
> > + =A0 =A0 =A0 /*
> > + =A0 =A0 =A0 =A0* This dma->highmem order is consistant with = global reclaim.
> > + =A0 =A0 =A0 =A0* We do this because the page allocator works in= the opposite
> > + =A0 =A0 =A0 =A0* direction although memcg user pages are mostly= allocated at
> > + =A0 =A0 =A0 =A0* highmem.
> > + =A0 =A0 =A0 =A0*/
> > + =A0 =A0 =A0 for (i =3D 0;
> > + =A0 =A0 =A0 =A0 =A0 =A0(i < NODE_DATA(nid)->nr_zones) &am= p;&
> > + =A0 =A0 =A0 =A0 =A0 =A0(total_scanned < MEMCG_BGSCAN_LIMIT);=
> > + =A0 =A0 =A0 =A0 =A0 =A0i++) {
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct zone *zone =3D NODE_DATA(nid= )->node_zones + i;
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 struct zone_reclaim_stat *zrs;
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 unsigned long scan, rotate;
> > +
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (!populated_zone(zone))
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 scan =3D mem_cgroup_zone_reclaimabl= e_pages(mem_cont, nid, i);
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (!scan)
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 continue;
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* If recent memory reclaim on this= zone doesn't get good */
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 zrs =3D get_reclaim_stat(zone, sc);=
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 scan =3D zrs->recent_scanned[0] = + zrs->recent_scanned[1];
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 rotate =3D zrs->recent_rotated[0= ] + zrs->recent_rotated[1];
> > +
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (rotate > scan/2)
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 sc->may_writepag= e =3D 1;
> > +
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 sc->nr_scanned =3D 0;
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 shrink_zone(priority, zone, sc); > > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 total_scanned +=3D sc->nr_scanne= d;
> > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 sc->may_writepage =3D 0;
> > + =A0 =A0 =A0 }
> > + =A0 =A0 =A0 sc->nr_scanned =3D total_scanned;
> > +}
>
> I see the MEMCG_BGSCAN_LIMIT is a newly defined macro from previous > post. So, now the number of pages to scan is capped on 2k for each
> memcg, and does it make difference on big vs small cgroup?
>

Now, no difference. One reason is because low_watermark - high_= watermark is
limited to 4MB, at most. It should be static 4MB in many cases and 2048 pag= es
is for scanning 8MB, twice of low_wmark - high_wmark. Another reason is
that I didn't have enough time for considering to tune this.
By MEMCG_BGSCAN_LIMIT, round-robin can be simply fair and I think it's = a
good start point.

I can see a problem h= ere to be "fair" to each memcg. Each container has different size= s and running with
different workloads. Some of them are more sen= sitive with latency than the other, so they are willing to pay
more cpu cycles to do background reclaim.=A0

= So, here we fix the amount of work per-memcg, and the=A0performance for tho= se jobs will be hurt. If i understand
correctly, we only have one= workitem on the workqueue per memcg. So which means we can only reclaim th= ose amount of pages for each iteration. And if the queue is big, those jobs= (heavy memory allocating, and willing to pay cpu to do bg reclaim) will hit= direct reclaim more than necessary.

--Ying

If memory eater enough slow (because the threads needs to do some
work on allocated memory), this shrink_mem_cgroup() works fine and
helps to avoid hitting limit. Here, the amount of dirty pages is troublesom= e.

The penaly for cpu eating (hard-to-reclaim) cgroup is given by 'delay&#= 39;.
(see patch 7.) This patch's congestion_wait is too bad and will be repl= aced
in patch 7 as 'delay'. In short, if memcg scanning seems to be not = successful,
it gets HZ/10 delay until the next work.

If we have dirty_ratio + I/O less dirty throttling, I think we'll see m= uch
better fairness on this watermark reclaim round robin.


Thanks,
-Kame




--0016e64aefda8b335c04a1da7ef1-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: email@kvack.org