From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.5 required=3.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED,DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C11D7C4361B for ; Mon, 14 Dec 2020 22:37:54 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 6B30322509 for ; Mon, 14 Dec 2020 22:37:54 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6B30322509 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 0758A6B0070; Mon, 14 Dec 2020 17:37:54 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0001B6B0071; Mon, 14 Dec 2020 17:37:53 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DE23F6B0072; Mon, 14 Dec 2020 17:37:53 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0186.hostedemail.com [216.40.44.186]) by kanga.kvack.org (Postfix) with ESMTP id BF1316B0070 for ; Mon, 14 Dec 2020 17:37:53 -0500 (EST) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 75F391DE1 for ; Mon, 14 Dec 2020 22:37:53 +0000 (UTC) X-FDA: 77593351626.14.beef89_0a0fb372741e Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin14.hostedemail.com (Postfix) with ESMTP id 58EF618229818 for ; Mon, 14 Dec 2020 22:37:53 +0000 (UTC) X-HE-Tag: beef89_0a0fb372741e X-Filterd-Recvd-Size: 12075 Received: from mail-pl1-f193.google.com (mail-pl1-f193.google.com [209.85.214.193]) by imf03.hostedemail.com (Postfix) with ESMTP for ; Mon, 14 Dec 2020 22:37:52 +0000 (UTC) Received: by mail-pl1-f193.google.com with SMTP id 4so9747984plk.5 for ; Mon, 14 Dec 2020 14:37:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=ssXnZiLmnDnhCBaKkF0ViwDaE3pTbiD3bgrO3REdu1o=; b=Kdyag+afxDemndXxDu2neVdeMrMnJnpsM79GpbcCiOrzOzOecSQNwuUwNkXBaAfsUm amDvyRcr96/3GBLplUfOrS53VyimB/bD7uc8cL7ubwF/h2bPq8ljNIV1s7se7hquiP9c 3RvhvrRw/x7f2Khxqt8p3gz0hMoWA+vH1wdzGMUQd4f/e2P0RPJwesMT33kdRgRR7ZQa tTYXem2DTnlGvmP2C9FZi3W0VC37WChq1xtvTnSlMHjlpmKdvYYrKxL/nXqDKvyRHxDf CTV27ONp3jyp+fxK52/nnbpluX0KBiHPx8zrahhBuZE/0VU2+hJ6L+reyM7zj7X9D8qb /LOw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=ssXnZiLmnDnhCBaKkF0ViwDaE3pTbiD3bgrO3REdu1o=; b=eBAt6RnZ7CsAnPwzqK2oqRnC6yaqnyOwIH4TzGZWGm55t3cBRe0mgSA7XffJGWwfva Tf3m3KEnDtPFz/uq3Vi/TA6rEZsH0aQhHziPY6fnSm7C+GnoXbXx45kfM0kW17NBoco0 Srab9AIs29RWF8ujyEUUaR/+iU1AiZI8m8eW9bfZ8EvoJjs2tkhhNwKphUkNR5g1mgae PV8oUWQ9ytMWtT1x8TtzLsO4klzfeRtuB4kJo5TfkzqSZbtZOVk79a6yyOskDrA8awMC KePJJHMMNROSe29lWgI3R4GjLoF7pfqjWt7+2AUD+5iSBalm5uX547Gd9wEqtDaItkVw jklA== X-Gm-Message-State: AOAM5320ApiXeIS+yXhxiSHG/iECdGh5qydxiguHCpPjZEei/0S9adrI rQ2hVUW6npsMUxkFAnTMfbU= X-Google-Smtp-Source: ABdhPJyWkKVRTRNzs5DbHGN2QcNv2wKI3ximz4j2UUbDkjomVlfb5vfSXO6RIa099VMG93AU7IgDsw== X-Received: by 2002:a17:90b:4b07:: with SMTP id lx7mr28007598pjb.230.1607985471993; Mon, 14 Dec 2020 14:37:51 -0800 (PST) Received: from localhost.localdomain (c-73-93-239-127.hsd1.ca.comcast.net. [73.93.239.127]) by smtp.gmail.com with ESMTPSA id d4sm20610758pfo.127.2020.12.14.14.37.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 14 Dec 2020 14:37:51 -0800 (PST) From: Yang Shi To: guro@fb.com, ktkhai@virtuozzo.com, shakeelb@google.com, david@fromorbit.com, hannes@cmpxchg.org, mhocko@suse.com, akpm@linux-foundation.org Cc: shy828301@gmail.com, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [v2 PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred Date: Mon, 14 Dec 2020 14:37:18 -0800 Message-Id: <20201214223722.232537-6-shy828301@gmail.com> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20201214223722.232537-1-shy828301@gmail.com> References: <20201214223722.232537-1-shy828301@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Currently the number of deferred objects are per shrinker, but some slabs= , for example, vfs inode/dentry cache are per memcg, this would result in poor isolation= among memcgs. The deferred objects typically are generated by __GFP_NOFS allocations, o= ne memcg with excessive __GFP_NOFS allocations may blow up deferred objects, then other= innocent memcgs may suffer from over shrink, excessive reclaim latency, etc. For example, two workloads run in memcgA and memcgB respectively, workloa= d in B is vfs heavy workload. Workload in A generates excessive deferred objects, then= B's vfs cache might be hit heavily (drop half of caches) by B's limit reclaim or global= reclaim. We observed this hit in our production environment which was running vfs = heavy workload shown as the below tracing log: <...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cach= e_scan+0x0/0x1a0 ffff9a83046f3458: nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__G= FP_ZERO pgs_scanned 1 lru_pgs 15721 cache items 246404277 delta 31345 total_scan 123202138 <...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_= scan+0x0/0x1a0 ffff9a83046f3458: nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total= _scan 602 last shrinker return val 123186855 The vfs cache and page cache ration was 10:1 on this machine, and half of= caches were dropped. This also resulted in significant amount of page caches were dropped due = to inodes eviction. Make nr_deferred per memcg for memcg aware shrinkers would solve the unfa= irness and bring better isolation. When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker= 's nr_deferred would be used. And non memcg aware shrinkers use shrinker's nr_deferred = all the time. Signed-off-by: Yang Shi --- include/linux/memcontrol.h | 9 +++ mm/memcontrol.c | 110 ++++++++++++++++++++++++++++++++++++- mm/vmscan.c | 4 ++ 3 files changed, 120 insertions(+), 3 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 922a7f600465..1b343b268359 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -92,6 +92,13 @@ struct lruvec_stat { long count[NR_VM_NODE_STAT_ITEMS]; }; =20 + +/* Shrinker::id indexed nr_deferred of memcg-aware shrinkers. */ +struct memcg_shrinker_deferred { + struct rcu_head rcu; + atomic_long_t nr_deferred[]; +}; + /* * Bitmap of shrinker::id corresponding to memcg-aware shrinkers, * which have elements charged to this memcg. @@ -119,6 +126,7 @@ struct mem_cgroup_per_node { struct mem_cgroup_reclaim_iter iter; =20 struct memcg_shrinker_map __rcu *shrinker_map; + struct memcg_shrinker_deferred __rcu *shrinker_deferred; =20 struct rb_node tree_node; /* RB tree node */ unsigned long usage_in_excess;/* Set to the value by which */ @@ -1489,6 +1497,7 @@ static inline bool mem_cgroup_under_socket_pressure= (struct mem_cgroup *memcg) } =20 extern int memcg_expand_shrinker_maps(int new_id); +extern int memcg_expand_shrinker_deferred(int new_id); =20 extern void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3d4ddbb84a01..321d1818ce3d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -394,14 +394,20 @@ DEFINE_STATIC_KEY_FALSE(memcg_kmem_enabled_key); EXPORT_SYMBOL(memcg_kmem_enabled_key); #endif =20 -/* It is only can be changed with holding shrinker_rwsem exclusively */ +/* They are only can be changed with holding shrinker_rwsem exclusively = */ static int memcg_shrinker_map_size; +static int memcg_shrinker_deferred_size; =20 static void memcg_free_shrinker_map_rcu(struct rcu_head *head) { kvfree(container_of(head, struct memcg_shrinker_map, rcu)); } =20 +static void memcg_free_shrinker_deferred_rcu(struct rcu_head *head) +{ + kvfree(container_of(head, struct memcg_shrinker_deferred, rcu)); +} + static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg, int size, int old_size) { @@ -430,6 +436,34 @@ static int memcg_expand_one_shrinker_map(struct mem_= cgroup *memcg, return 0; } =20 +static int memcg_expand_one_shrinker_deferred(struct mem_cgroup *memcg, + int size, int old_size) +{ + struct memcg_shrinker_deferred *new, *old; + int nid; + + for_each_node(nid) { + old =3D rcu_dereference_protected( + mem_cgroup_nodeinfo(memcg, nid)->shrinker_deferred, true); + /* Not yet online memcg */ + if (!old) + return 0; + + new =3D kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid); + if (!new) + return -ENOMEM; + + /* Copy all old values, and clear all new ones */ + memcpy((void *)new->nr_deferred, (void *)old->nr_deferred, old_size); + memset((void *)new->nr_deferred + old_size, 0, size - old_size); + + rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_deferred, new); + call_rcu(&old->rcu, memcg_free_shrinker_deferred_rcu); + } + + return 0; +} + static void memcg_free_shrinker_maps(struct mem_cgroup *memcg) { struct mem_cgroup_per_node *pn; @@ -448,6 +482,21 @@ static void memcg_free_shrinker_maps(struct mem_cgro= up *memcg) } } =20 +static void memcg_free_shrinker_deferred(struct mem_cgroup *memcg) +{ + struct mem_cgroup_per_node *pn; + struct memcg_shrinker_deferred *deferred; + int nid; + + for_each_node(nid) { + pn =3D mem_cgroup_nodeinfo(memcg, nid); + deferred =3D rcu_dereference_protected(pn->shrinker_deferred, true); + if (deferred) + kvfree(deferred); + rcu_assign_pointer(pn->shrinker_deferred, NULL); + } +} + static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg) { struct memcg_shrinker_map *map; @@ -472,6 +521,27 @@ static int memcg_alloc_shrinker_maps(struct mem_cgro= up *memcg) return ret; } =20 +static int memcg_alloc_shrinker_deferred(struct mem_cgroup *memcg) +{ + struct memcg_shrinker_deferred *deferred; + int nid, size, ret =3D 0; + + down_read(&shrinker_rwsem); + size =3D memcg_shrinker_deferred_size; + for_each_node(nid) { + deferred =3D kvzalloc_node(sizeof(*deferred) + size, GFP_KERNEL, nid); + if (!deferred) { + memcg_free_shrinker_deferred(memcg); + ret =3D -ENOMEM; + break; + } + rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_deferred, deferred); + } + up_read(&shrinker_rwsem); + + return ret; +} + int memcg_expand_shrinker_maps(int new_id) { int size, old_size, ret =3D 0; @@ -501,6 +571,33 @@ int memcg_expand_shrinker_maps(int new_id) return ret; } =20 +int memcg_expand_shrinker_deferred(int new_id) +{ + int size, old_size, ret =3D 0; + struct mem_cgroup *memcg; + + size =3D (new_id + 1) * sizeof(atomic_long_t); + old_size =3D memcg_shrinker_deferred_size; + if (size <=3D old_size) + return 0; + + if (!root_mem_cgroup) + goto out; + + for_each_mem_cgroup(memcg) { + ret =3D memcg_expand_one_shrinker_deferred(memcg, size, old_size); + if (ret) { + mem_cgroup_iter_break(NULL, memcg); + goto out; + } + } +out: + if (!ret) + memcg_shrinker_deferred_size =3D size; + + return ret; +} + void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrin= ker_id) { if (shrinker_id >=3D 0 && memcg && !mem_cgroup_is_root(memcg)) { @@ -5397,8 +5494,8 @@ static int mem_cgroup_css_online(struct cgroup_subs= ys_state *css) struct mem_cgroup *memcg =3D mem_cgroup_from_css(css); =20 /* - * A memcg must be visible for memcg_expand_shrinker_maps() - * by the time the maps are allocated. So, we allocate maps + * A memcg must be visible for memcg_expand_shrinker_{maps|deferred}() + * by the time the maps are allocated. So, we allocate maps and deferre= d * here, when for_each_mem_cgroup() can't skip it. */ if (memcg_alloc_shrinker_maps(memcg)) { @@ -5406,6 +5503,12 @@ static int mem_cgroup_css_online(struct cgroup_sub= sys_state *css) return -ENOMEM; } =20 + if (memcg_alloc_shrinker_deferred(memcg)) { + memcg_free_shrinker_maps(memcg); + mem_cgroup_id_remove(memcg); + return -ENOMEM; + } + /* * Barrier for CSS_ONLINE, so that shrink_slab_memcg() sees shirnker_ma= ps * and shrinker_deferred before CSS_ONLINE. It pairs with the read barr= ier @@ -5473,6 +5576,7 @@ static void mem_cgroup_css_free(struct cgroup_subsy= s_state *css) cancel_work_sync(&memcg->high_work); mem_cgroup_remove_from_trees(memcg); memcg_free_shrinker_maps(memcg); + memcg_free_shrinker_deferred(memcg); memcg_free_kmem(memcg); mem_cgroup_free(memcg); } diff --git a/mm/vmscan.c b/mm/vmscan.c index 16c9d2aeeb26..bf34167dd67e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -219,6 +219,10 @@ static int prealloc_memcg_shrinker(struct shrinker *= shrinker) goto unlock; } =20 + if (memcg_expand_shrinker_deferred(id)) { + idr_remove(&shrinker_idr, id); + goto unlock; + } shrinker_nr_max =3D id + 1; } shrinker->id =3D id; --=20 2.26.2