From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.5 required=3.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED,DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CE702C71155 for ; Wed, 2 Dec 2020 18:27:58 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 46AFC22248 for ; Wed, 2 Dec 2020 18:27:58 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 46AFC22248 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id C5ACB6B006E; Wed, 2 Dec 2020 13:27:57 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C0DE66B0070; Wed, 2 Dec 2020 13:27:57 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B20186B0071; Wed, 2 Dec 2020 13:27:57 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0175.hostedemail.com [216.40.44.175]) by kanga.kvack.org (Postfix) with ESMTP id 994C36B006E for ; Wed, 2 Dec 2020 13:27:57 -0500 (EST) Received: from smtpin05.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 63E90180AD807 for ; Wed, 2 Dec 2020 18:27:57 +0000 (UTC) X-FDA: 77549176194.05.coach66_2f12176273b5 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin05.hostedemail.com (Postfix) with ESMTP id 4923E18030159 for ; Wed, 2 Dec 2020 18:27:57 +0000 (UTC) X-HE-Tag: coach66_2f12176273b5 X-Filterd-Recvd-Size: 12141 Received: from mail-pl1-f196.google.com (mail-pl1-f196.google.com [209.85.214.196]) by imf39.hostedemail.com (Postfix) with ESMTP for ; Wed, 2 Dec 2020 18:27:56 +0000 (UTC) Received: by mail-pl1-f196.google.com with SMTP id s2so1586039plr.9 for ; Wed, 02 Dec 2020 10:27:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=+now6GXkLhkQRGDQHjpHa4FbHWB432OuUSJYQ5tg+pI=; b=D0aFlfvHsuZ2kjGJz4ZhdA39FJT6b7FReaIjJfGRKgFd9J9LzyVlhmm5r0Fq5smXv9 FljoCZsjWDbRPay8zpMA9Yoe2LGFFURSihESWgCjGClGnsHOV4CSLKHV1ObdJY32zrJ1 BDgFATRngM0fIMZQYN5zk8MJy/5hpRlojQrxoluCuQdLDzvyj7+VKP1zzVNpVAxpG11O dSAXZx+sjog/MN2LhcFYbsEPoTBSCK50NbLYCNMpUnegxIjEyc076mfKXdkWf1m8eaSp 8Y6yuBqtmbyPSxX2Y45GiFznUxGB0K41dtbeg+DOqohZX+OXMDKy9nZMVwXJjf9/D7pE NpYA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=+now6GXkLhkQRGDQHjpHa4FbHWB432OuUSJYQ5tg+pI=; b=kbzbKmNEOx28uaVOI9qvLgApDEqUWyfk1bli3Vnx/jryS/58hbvol8Vk2tgJ4+ZpAM QltsmrzQHxEplObuzvxINVxpf8HkJGmUwq02g8Cx/3GxLSw6yyzcQsuS/WOx3FpqTC4r 4avkod25d4sVZ67cd7NL3rvaPcqLaM0Z3/8C9woLrJMaW2sHRp/6IXUxIDs3WC/WRxn1 oe2vZDKZDWF5T45ObPoIu98FEg270gatla13f5Rg4154OCIouIStKWqD/PHokI6V5mRo P6MjwmofCTw04CE23sDjAvsnx0VnKp2u2TzcV42KMX9NprLlRWlOZByH8OT/Mb1ulOnV SshQ== X-Gm-Message-State: AOAM5328pO8glf//DVs/IrUu0tW7zyRkHFQq1GNSUq6s46wDhviDn/wM fW8RiXfufLuFOHcn2WO9pNk= X-Google-Smtp-Source: ABdhPJy5Y6eaGT8FaEdvLzURm/UqsiizIu/LAUICpZ5RukCtFGdPwPLHXZHQBi5PokaLDCLslQ3IuQ== X-Received: by 2002:a17:902:d34a:b029:da:861e:ecd8 with SMTP id l10-20020a170902d34ab02900da861eecd8mr3632408plk.45.1606933675682; Wed, 02 Dec 2020 10:27:55 -0800 (PST) Received: from localhost.localdomain (c-73-93-239-127.hsd1.ca.comcast.net. [73.93.239.127]) by smtp.gmail.com with ESMTPSA id c6sm396906pgl.38.2020.12.02.10.27.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Dec 2020 10:27:54 -0800 (PST) From: Yang Shi To: guro@fb.com, ktkhai@virtuozzo.com, shakeelb@google.com, david@fromorbit.com, hannes@cmpxchg.org, mhocko@suse.com, akpm@linux-foundation.org Cc: shy828301@gmail.com, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH 5/9] mm: memcontrol: add per memcg shrinker nr_deferred Date: Wed, 2 Dec 2020 10:27:21 -0800 Message-Id: <20201202182725.265020-6-shy828301@gmail.com> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20201202182725.265020-1-shy828301@gmail.com> References: <20201202182725.265020-1-shy828301@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Currently the number of deferred objects are per shrinker, but some slabs= , for example, vfs inode/dentry cache are per memcg, this would result in poor isolation= among memcgs. The deferred objects typically are generated by __GFP_NOFS allocations, o= ne memcg with excessive __GFP_NOFS allocations may blow up deferred objects, then other= innocent memcgs may suffer from over shrink, excessive reclaim latency, etc. For example, two workloads run in memcgA and memcgB respectively, workloa= d in B is vfs heavy workload. Workload in A generates excessive deferred objects, then= B's vfs cache might be hit heavily (drop half of caches) by B's limit reclaim or global= reclaim. We observed this hit in our production environment which was running vfs = heavy workload shown as the below tracing log: <...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cach= e_scan+0x0/0x1a0 ffff9a83046f3458: nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__G= FP_ZERO pgs_scanned 1 lru_pgs 15721 cache items 246404277 delta 31345 total_scan 123202138 <...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_= scan+0x0/0x1a0 ffff9a83046f3458: nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total= _scan 602 last shrinker return val 123186855 The vfs cache and page cache ration was 10:1 on this machine, and half of= caches were dropped. This also resulted in significant amount of page caches were dropped due = to inodes eviction. Make nr_deferred per memcg for memcg aware shrinkers would solve the unfa= irness and bring better isolation. When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker= 's nr_deferred would be used. And non memcg aware shrinkers use shrinker's nr_deferred = all the time. Signed-off-by: Yang Shi --- include/linux/memcontrol.h | 9 +++ mm/memcontrol.c | 112 ++++++++++++++++++++++++++++++++++++- mm/vmscan.c | 4 ++ 3 files changed, 123 insertions(+), 2 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 922a7f600465..1b343b268359 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -92,6 +92,13 @@ struct lruvec_stat { long count[NR_VM_NODE_STAT_ITEMS]; }; =20 + +/* Shrinker::id indexed nr_deferred of memcg-aware shrinkers. */ +struct memcg_shrinker_deferred { + struct rcu_head rcu; + atomic_long_t nr_deferred[]; +}; + /* * Bitmap of shrinker::id corresponding to memcg-aware shrinkers, * which have elements charged to this memcg. @@ -119,6 +126,7 @@ struct mem_cgroup_per_node { struct mem_cgroup_reclaim_iter iter; =20 struct memcg_shrinker_map __rcu *shrinker_map; + struct memcg_shrinker_deferred __rcu *shrinker_deferred; =20 struct rb_node tree_node; /* RB tree node */ unsigned long usage_in_excess;/* Set to the value by which */ @@ -1489,6 +1497,7 @@ static inline bool mem_cgroup_under_socket_pressure= (struct mem_cgroup *memcg) } =20 extern int memcg_expand_shrinker_maps(int new_id); +extern int memcg_expand_shrinker_deferred(int new_id); =20 extern void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 19e41684c96b..d3d5c88db179 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -395,6 +395,8 @@ EXPORT_SYMBOL(memcg_kmem_enabled_key); #endif =20 static int memcg_shrinker_map_size; +static int memcg_shrinker_deferred_size; + static DEFINE_MUTEX(memcg_shrinker_mutex); =20 static void memcg_free_shrinker_map_rcu(struct rcu_head *head) @@ -402,6 +404,11 @@ static void memcg_free_shrinker_map_rcu(struct rcu_h= ead *head) kvfree(container_of(head, struct memcg_shrinker_map, rcu)); } =20 +static void memcg_free_shrinker_deferred_rcu(struct rcu_head *head) +{ + kvfree(container_of(head, struct memcg_shrinker_deferred, rcu)); +} + static int memcg_expand_one_shrinker_map(struct mem_cgroup *memcg, int size, int old_size) { @@ -432,6 +439,36 @@ static int memcg_expand_one_shrinker_map(struct mem_= cgroup *memcg, return 0; } =20 +static int memcg_expand_one_shrinker_deferred(struct mem_cgroup *memcg, + int size, int old_size) +{ + struct memcg_shrinker_deferred *new, *old; + int nid; + + lockdep_assert_held(&memcg_shrinker_mutex); + + for_each_node(nid) { + old =3D rcu_dereference_protected( + mem_cgroup_nodeinfo(memcg, nid)->shrinker_deferred, true); + /* Not yet online memcg */ + if (!old) + return 0; + + new =3D kvmalloc_node(sizeof(*new) + size, GFP_KERNEL, nid); + if (!new) + return -ENOMEM; + + /* Copy all old values, and clear all new ones */ + memcpy((void *)new->nr_deferred, (void *)old->nr_deferred, old_size); + memset((void *)new->nr_deferred + old_size, 0, size - old_size); + + rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_deferred, new); + call_rcu(&old->rcu, memcg_free_shrinker_deferred_rcu); + } + + return 0; +} + static void memcg_free_shrinker_maps(struct mem_cgroup *memcg) { struct mem_cgroup_per_node *pn; @@ -450,6 +487,21 @@ static void memcg_free_shrinker_maps(struct mem_cgro= up *memcg) } } =20 +static void memcg_free_shrinker_deferred(struct mem_cgroup *memcg) +{ + struct mem_cgroup_per_node *pn; + struct memcg_shrinker_deferred *deferred; + int nid; + + for_each_node(nid) { + pn =3D mem_cgroup_nodeinfo(memcg, nid); + deferred =3D rcu_dereference_protected(pn->shrinker_deferred, true); + if (deferred) + kvfree(deferred); + rcu_assign_pointer(pn->shrinker_deferred, NULL); + } +} + static int memcg_alloc_shrinker_maps(struct mem_cgroup *memcg) { struct memcg_shrinker_map *map; @@ -474,6 +526,27 @@ static int memcg_alloc_shrinker_maps(struct mem_cgro= up *memcg) return ret; } =20 +static int memcg_alloc_shrinker_deferred(struct mem_cgroup *memcg) +{ + struct memcg_shrinker_deferred *deferred; + int nid, size, ret =3D 0; + + mutex_lock(&memcg_shrinker_mutex); + size =3D memcg_shrinker_deferred_size; + for_each_node(nid) { + deferred =3D kvzalloc_node(sizeof(*deferred) + size, GFP_KERNEL, nid); + if (!deferred) { + memcg_free_shrinker_deferred(memcg); + ret =3D -ENOMEM; + break; + } + rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_deferred, deferred); + } + mutex_unlock(&memcg_shrinker_mutex); + + return ret; +} + int memcg_expand_shrinker_maps(int new_id) { int size, old_size, ret =3D 0; @@ -504,6 +577,34 @@ int memcg_expand_shrinker_maps(int new_id) return ret; } =20 +int memcg_expand_shrinker_deferred(int new_id) +{ + int size, old_size, ret =3D 0; + struct mem_cgroup *memcg; + + size =3D (new_id + 1) * sizeof(atomic_long_t); + old_size =3D memcg_shrinker_deferred_size; + if (size <=3D old_size) + return 0; + + mutex_lock(&memcg_shrinker_mutex); + if (!root_mem_cgroup) + goto unlock; + + for_each_mem_cgroup(memcg) { + ret =3D memcg_expand_one_shrinker_deferred(memcg, size, old_size); + if (ret) { + mem_cgroup_iter_break(NULL, memcg); + goto unlock; + } + } +unlock: + if (!ret) + memcg_shrinker_deferred_size =3D size; + mutex_unlock(&memcg_shrinker_mutex); + return ret; +} + void memcg_set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrin= ker_id) { if (shrinker_id >=3D 0 && memcg && !mem_cgroup_is_root(memcg)) { @@ -5400,8 +5501,8 @@ static int mem_cgroup_css_online(struct cgroup_subs= ys_state *css) struct mem_cgroup *memcg =3D mem_cgroup_from_css(css); =20 /* - * A memcg must be visible for memcg_expand_shrinker_maps() - * by the time the maps are allocated. So, we allocate maps + * A memcg must be visible for memcg_expand_shrinker_{maps|deferred}() + * by the time the maps are allocated. So, we allocate maps and deferre= d * here, when for_each_mem_cgroup() can't skip it. */ if (memcg_alloc_shrinker_maps(memcg)) { @@ -5409,6 +5510,12 @@ static int mem_cgroup_css_online(struct cgroup_sub= sys_state *css) return -ENOMEM; } =20 + if (memcg_alloc_shrinker_deferred(memcg)) { + memcg_free_shrinker_maps(memcg); + mem_cgroup_id_remove(memcg); + return -ENOMEM; + } + /* Online state pins memcg ID, memcg ID pins CSS */ refcount_set(&memcg->id.ref, 1); css_get(css); @@ -5469,6 +5576,7 @@ static void mem_cgroup_css_free(struct cgroup_subsy= s_state *css) cancel_work_sync(&memcg->high_work); mem_cgroup_remove_from_trees(memcg); memcg_free_shrinker_maps(memcg); + memcg_free_shrinker_deferred(memcg); memcg_free_kmem(memcg); mem_cgroup_free(memcg); } diff --git a/mm/vmscan.c b/mm/vmscan.c index 0d628299e55c..cba0bc8d4661 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -219,6 +219,10 @@ static int prealloc_memcg_shrinker(struct shrinker *= shrinker) goto unlock; } =20 + if (memcg_expand_shrinker_deferred(id)) { + idr_remove(&shrinker_idr, id); + goto unlock; + } shrinker_nr_max =3D id + 1; } shrinker->id =3D id; --=20 2.26.2