From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7510EC6FD19 for ; Mon, 13 Mar 2023 11:30:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 134DF6B0075; Mon, 13 Mar 2023 07:30:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0BE476B0078; Mon, 13 Mar 2023 07:30:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E53176B007B; Mon, 13 Mar 2023 07:29:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id D2C076B0075 for ; Mon, 13 Mar 2023 07:29:59 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id A4FCCC0162 for ; Mon, 13 Mar 2023 11:29:59 +0000 (UTC) X-FDA: 80563655718.14.E7A1FBB Received: from mail-pj1-f42.google.com (mail-pj1-f42.google.com [209.85.216.42]) by imf16.hostedemail.com (Postfix) with ESMTP id C4E9D180005 for ; Mon, 13 Mar 2023 11:29:57 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=VL291Bsm; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf16.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.42 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1678706997; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CwHKJa6QSzn+fFUbr3GdLYPgv95cnerwTX8zoVbLwNQ=; b=shkSTsTE9pi4dIqsn+4zj/MxNrr4W/naT04SjsYVIDkEur/4SruLiXVF9raqnMbBl4Drdu p+wJZ9q6eyGRijSwLGiUY4XgEbw5xzGDDHLqML3jWuidBH8p+Gr8pid3VIqxBWvFj/dWZk nW18r6jNcuz8kSB/GQi8EjZXAga4OZ0= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=VL291Bsm; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf16.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.42 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1678706997; a=rsa-sha256; cv=none; b=X77UZyyFhClcU8EkqiG0pR20crV05Czhkwvn8kXgnf54WUbXyVRme5koklTECmugdaVcIF L9U92b1wjwF8XLfx5UZu3zoLC0RF+323db8gImSYzimY4QCYKH8R9K5iyAlhwkf3Hv6n5D MH7Li4aAYuzTsuJYfmClAg4OSDWFL38= Received: by mail-pj1-f42.google.com with SMTP id fy10-20020a17090b020a00b0023b4bcf0727so3861475pjb.0 for ; Mon, 13 Mar 2023 04:29:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1678706997; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=CwHKJa6QSzn+fFUbr3GdLYPgv95cnerwTX8zoVbLwNQ=; b=VL291BsmdjOKqQ02b55ZkiNe1Gxiw0B9HIG72ICowrQQIz60/2k76pkbV6ohUH0jkm Jox9Y09fVi77XloTu403ELBMBr76X8dzxeQ3haW2GgqK0Esu9ZkoXpWzHaSH6tAaMoDV tmDAGCPgOSqsKKkamJFDdXSO6nIaXZBF40p+RTjEGc53dohjfIGG8/C+G7GvyIESgath WeikExSTqhrXJdcJWm/xMTEGyKADTPQDEs+Ees4T5pfkypk5jFJZpd/7M1ecIk1aLgTn YzvxNO/avEaPXMEeDvs/G5Ff4veN37dFXuRX65vfKGciMuehn4VVeyzAB73LmFAqM9LG sxIw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1678706997; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=CwHKJa6QSzn+fFUbr3GdLYPgv95cnerwTX8zoVbLwNQ=; b=0r+UOTMVaWuxqOfCsWlCFS3M51/fh4hlrquZ3tla7WA3aHMxVMffYT9mUwJXWgOStd 8nHAVoQ0UNyMsuJEskKAgrpn1jdGj0UaBiLLyv+uefm+t3ZlfvuvERGu9SZ4fBx6UVIP mWPZsMAlKmJDrnyRJVxGY0HBODZOryN88w2jNYzd1YRzPld1how3TKCczD3kq3EhLjMm NzLY2snOLOpXqHSQVnmvDq3mf8i0YcNRXFjPdsKo/n/jNRA579J5P1MWM1tLB6oGCo3Y sUzt1TrvM1dqijhORvnspgDJY+AqgzQ8s9H7UtQ4i0VIce8Ww/GgHtz7Rl7R2u1ik7DE 2/wg== X-Gm-Message-State: AO0yUKWbWrPHughHFFmRoXGInPY9YAHiz4FyaskUW1NA5sA4/PzrTrDi x8oEVhXhe0h8zJGMatCm2u98Pg== X-Google-Smtp-Source: AK7set9JyLNsrTQk+ldmpaXssouDn0mkoIHdAxrcwhQZfqJh+G9zdIKa6tn/z3NoTm4Hps375eI0IQ== X-Received: by 2002:a17:90a:990e:b0:233:a836:15f4 with SMTP id b14-20020a17090a990e00b00233a83615f4mr12917519pjp.1.1678706996712; Mon, 13 Mar 2023 04:29:56 -0700 (PDT) Received: from C02DW0BEMD6R.bytedance.net ([139.177.225.229]) by smtp.gmail.com with ESMTPSA id n2-20020a654882000000b0050300a7c8c2sm4390827pgs.89.2023.03.13.04.29.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 13 Mar 2023 04:29:56 -0700 (PDT) From: Qi Zheng To: akpm@linux-foundation.org, tkhai@ya.ru, vbabka@suse.cz, christian.koenig@amd.com, hannes@cmpxchg.org, shakeelb@google.com, mhocko@kernel.org, roman.gushchin@linux.dev, muchun.song@linux.dev, david@redhat.com, shy828301@gmail.com Cc: sultan@kerneltoast.com, dave@stgolabs.net, penguin-kernel@I-love.SAKURA.ne.jp, paulmck@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Qi Zheng , Vlastimil Babka Subject: [PATCH v5 3/8] mm: vmscan: make memcg slab shrink lockless Date: Mon, 13 Mar 2023 19:28:14 +0800 Message-Id: <20230313112819.38938-4-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20230313112819.38938-1-zhengqi.arch@bytedance.com> References: <20230313112819.38938-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: C4E9D180005 X-Stat-Signature: us4d86kj84qu8zy6pirg66eq5nwq1ijs X-HE-Tag: 1678706997-11673 X-HE-Meta: U2FsdGVkX1/WzSkV6gPX+tLONRhApLIvT+8dUP8AA4cuXpi/YKWZUgjr6p5J+xRR+3aMiPikWrCefEGlhyE0w1QMWqmv+RPEfwRVjAzfCwA2IbSPSQKRa7JSSIkLsJHx6WxSKt67yDvTggfExP4Js11Gn+/WiYNWDwydCF9WQh7Yq8RLX8WbW6i6jMIbkHtzMr6lcqn9SnfAB0hDLyRMoA+G04edg632w4LoEKGGW4XCwa6XL0He7JRDWA92GdDT2bjgmHmFxw6JvrpVR36eEvqUd1c9GCtTZgn1ZZfD4YCfmPyTj2/EZmnKtrdyUfp05elGpwyFNw5f7b9zSF2HD65ARuPta/QlluEZYtJgAYMKNBbvAW79v5skPmMAiS7Yt1/xrDeUV+2DGkbL7zjwoM1r7e8vBzwQKI42eQjDY6e95lAIY3ui4zymm/B019bxNhln5e4ScInzVGbo+nugz7mHqZGEQByE8ZO48Hx5BhZVmQGeiPgYAPnQBfr8lKhLlHQRW+A/3xo+jNHeOo877PXE5eJjrZqLcO5Hbp+Z8XfmBieUyUMtT2Eqgj287GpjyY8Z1pDsxh72K8mYcnhKsqjmqCW0wLKO3d1ApRn3ivEsnBWw/dY2km9jPksNIhSY5nwgkQqhd8r62/zQHQnY9vQQoySLSPaxCRujvqUdjyPntNoIkFQk0w+PHhFd6bGeAZXFGvh2EjISzD+/h3wx2V07zuLU49zwVM3Ij0kxWJoiddP2pio57BASXJKkKldDQJ8Q8U6xpPhhkrM/a9HW2lMDIi+xLkHQE7o/CE8LWIzuYQdM1lMY102qEwQMcMXMVwUq0fPGoMyVQz+DFchVBVpy6IWAnOvL6hIaFfRXmXuMui5x2bvWSVpSUk2MGtbdWy+nl6Wo9d2GqU+CDe13xirP4/lRgj8a4v9CspXILhFaNJvTeJ0oPoj3e/n2KjlHlfNSbTNW7NCvkn2CW6X 1mcbrrut SErNFs7jt7uwqnVv0idMonxA6JUee1A9PnrGTWr6OXYgvE8eQp0H13GLjy3B14vGH7dh+gu+l5UZO51SNYsN06m39s2rdefoxN9a5NpuvpHvzhw3gcwHd3p9JW5zJUFb7ZXDDUkG2NbtRHGM82oBJeZo4tZKMm5RMAuKeNeDjJXP2oE4GVnlslnxOlLfOs9iDSscBFJvURuHJAEPUfSAwMQlVwK1VSDUfVrpkgdKrA3wVeSKHgTS0rscx3ocOduCf5bnYS1HOFl70c9ud7I/yOVOsm8Rls3Yi+DxlHpsnGDjrswMaOegLFNLnotXPK+3DHU2mjF1RHzNrrKVT83EoF/DObVWgSIWGUIk0P26cJc/+QlzaaH/Ne/gVLu5yWVu0bdbBqLc04OhEaseOZtc09MWPO2owonijsUL7tRaW1z2M0th3SeXymHxG3god06LWXxo/49JeXC8HeEHczFYXPpxqKIUhXYy3YWpwoxK5hH4F0Ys= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Like global slab shrink, this commit also uses SRCU to make memcg slab shrink lockless. We can reproduce the down_read_trylock() hotspot through the following script: ``` DIR="/root/shrinker/memcg/mnt" do_create() { mkdir -p /sys/fs/cgroup/memory/test mkdir -p /sys/fs/cgroup/perf_event/test echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes for i in `seq 0 $1`; do mkdir -p /sys/fs/cgroup/memory/test/$i; echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs; echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs; mkdir -p $DIR/$i; done } do_mount() { for i in `seq $1 $2`; do mount -t tmpfs $i $DIR/$i; done } do_touch() { for i in `seq $1 $2`; do echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs; echo $$ > /sys/fs/cgroup/perf_event/test/cgroup.procs; dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 & done } case "$1" in touch) do_touch $2 $3 ;; test) do_create 4000 do_mount 0 4000 do_touch 0 3000 ;; *) exit 1 ;; esac ``` Save the above script, then run test and touch commands. Then we can use the following perf command to view hotspots: perf top -U -F 999 1) Before applying this patchset: 32.31% [kernel] [k] down_read_trylock 19.40% [kernel] [k] pv_native_safe_halt 16.24% [kernel] [k] up_read 15.70% [kernel] [k] shrink_slab 4.69% [kernel] [k] _find_next_bit 2.62% [kernel] [k] shrink_node 1.78% [kernel] [k] shrink_lruvec 0.76% [kernel] [k] do_shrink_slab 2) After applying this patchset: 27.83% [kernel] [k] _find_next_bit 16.97% [kernel] [k] shrink_slab 15.82% [kernel] [k] pv_native_safe_halt 9.58% [kernel] [k] shrink_node 8.31% [kernel] [k] shrink_lruvec 5.64% [kernel] [k] do_shrink_slab 3.88% [kernel] [k] mem_cgroup_iter At the same time, we use the following perf command to capture IPC information: perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10 1) Before applying this patchset: Performance counter stats for 'system wide' (5 runs): 454187219766 cycles test ( +- 1.84% ) 78896433101 instructions test # 0.17 insn per cycle ( +- 0.44% ) 10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% ) 2) After applying this patchset: Performance counter stats for 'system wide' (5 runs): 841954709443 cycles test ( +- 15.80% ) (98.69%) 527258677936 instructions test # 0.63 insn per cycle ( +- 15.11% ) (98.68%) 10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% ) We can see that IPC drops very seriously when calling down_read_trylock() at high frequency. After using SRCU, the IPC is at a normal level. Signed-off-by: Qi Zheng Acked-by: Kirill Tkhai Acked-by: Vlastimil Babka --- mm/vmscan.c | 45 ++++++++++++++++++++++++++------------------- 1 file changed, 26 insertions(+), 19 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index db2ed6e08f67..ce7834030f75 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -222,8 +222,21 @@ static inline int shrinker_defer_size(int nr_items) static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg, int nid) { - return rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info, - lockdep_is_held(&shrinker_rwsem)); + return srcu_dereference_check(memcg->nodeinfo[nid]->shrinker_info, + &shrinker_srcu, + lockdep_is_held(&shrinker_rwsem)); +} + +static struct shrinker_info *shrinker_info_srcu(struct mem_cgroup *memcg, + int nid) +{ + return srcu_dereference(memcg->nodeinfo[nid]->shrinker_info, + &shrinker_srcu); +} + +static void free_shrinker_info_rcu(struct rcu_head *head) +{ + kvfree(container_of(head, struct shrinker_info, rcu)); } static int expand_one_shrinker_info(struct mem_cgroup *memcg, @@ -264,7 +277,7 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg, defer_size - old_defer_size); rcu_assign_pointer(pn->shrinker_info, new); - kvfree_rcu(old, rcu); + call_srcu(&shrinker_srcu, &old->rcu, free_shrinker_info_rcu); } return 0; @@ -350,15 +363,16 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id) { if (shrinker_id >= 0 && memcg && !mem_cgroup_is_root(memcg)) { struct shrinker_info *info; + int srcu_idx; - rcu_read_lock(); - info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info); + srcu_idx = srcu_read_lock(&shrinker_srcu); + info = shrinker_info_srcu(memcg, nid); if (!WARN_ON_ONCE(shrinker_id >= info->map_nr_max)) { /* Pairs with smp mb in shrink_slab() */ smp_mb__before_atomic(); set_bit(shrinker_id, info->map); } - rcu_read_unlock(); + srcu_read_unlock(&shrinker_srcu, srcu_idx); } } @@ -372,7 +386,6 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker) return -ENOSYS; down_write(&shrinker_rwsem); - /* This may call shrinker, so it must use down_read_trylock() */ id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL); if (id < 0) goto unlock; @@ -406,7 +419,7 @@ static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker, { struct shrinker_info *info; - info = shrinker_info_protected(memcg, nid); + info = shrinker_info_srcu(memcg, nid); return atomic_long_xchg(&info->nr_deferred[shrinker->id], 0); } @@ -415,7 +428,7 @@ static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker, { struct shrinker_info *info; - info = shrinker_info_protected(memcg, nid); + info = shrinker_info_srcu(memcg, nid); return atomic_long_add_return(nr, &info->nr_deferred[shrinker->id]); } @@ -893,15 +906,14 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, { struct shrinker_info *info; unsigned long ret, freed = 0; + int srcu_idx; int i; if (!mem_cgroup_online(memcg)) return 0; - if (!down_read_trylock(&shrinker_rwsem)) - return 0; - - info = shrinker_info_protected(memcg, nid); + srcu_idx = srcu_read_lock(&shrinker_srcu); + info = shrinker_info_srcu(memcg, nid); if (unlikely(!info)) goto unlock; @@ -951,14 +963,9 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, set_shrinker_bit(memcg, nid, i); } freed += ret; - - if (rwsem_is_contended(&shrinker_rwsem)) { - freed = freed ? : 1; - break; - } } unlock: - up_read(&shrinker_rwsem); + srcu_read_unlock(&shrinker_srcu, srcu_idx); return freed; } #else /* CONFIG_MEMCG */ -- 2.20.1