From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.3 required=3.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED,DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 19AC4C433E2 for ; Wed, 16 Sep 2020 18:58:47 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 802F2206B2 for ; Wed, 16 Sep 2020 18:58:46 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="twVEt6yA" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 802F2206B2 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 15E7690000A; Wed, 16 Sep 2020 14:58:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 11488900007; Wed, 16 Sep 2020 14:58:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F196F90000A; Wed, 16 Sep 2020 14:58:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0235.hostedemail.com [216.40.44.235]) by kanga.kvack.org (Postfix) with ESMTP id D8E74900007 for ; Wed, 16 Sep 2020 14:58:45 -0400 (EDT) Received: from smtpin07.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 94344181AEF21 for ; Wed, 16 Sep 2020 18:58:45 +0000 (UTC) X-FDA: 77269836210.07.sink81_0a11b642711c Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin07.hostedemail.com (Postfix) with ESMTP id 6848C1803FFC6 for ; Wed, 16 Sep 2020 18:58:45 +0000 (UTC) X-HE-Tag: sink81_0a11b642711c X-Filterd-Recvd-Size: 13056 Received: from mail-pg1-f193.google.com (mail-pg1-f193.google.com [209.85.215.193]) by imf04.hostedemail.com (Postfix) with ESMTP for ; Wed, 16 Sep 2020 18:58:44 +0000 (UTC) Received: by mail-pg1-f193.google.com with SMTP id l191so4435011pgd.5 for ; Wed, 16 Sep 2020 11:58:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=1GS1CmojHoBv0FOA7wjuZ6JbQNF7dWa7IOU4RRajxCU=; b=twVEt6yAH1J0hYMOAkUnGcn22t7f1gc8OJLPyJ1wgboYkQD66pKW4FQw1EZFFInGR/ jg1rwIZVaysLXIYLc4uUj0g7o7kFALf1Tsp8G61bskPXY6+a1BjUqEBLC1w/GUg9k3Eo V0ySr7IdFH0532/PCKXTip69AbrjjXbWXtyRU5seINb0xZR5BkfnPlxphb/Smxg7by6i 9LKAtzO46HsEiSLpNe/QTMgRUvhstbHK0jPfXI+S7hG0CgKaiGO1HX6SQYS3jf6KLwx3 XhLlyQlSacL6cMZcDZgvCaTi2cn//2idNXncGpXi3vQevWA3dAfqP98zIJipclXZLZT4 XxBg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=1GS1CmojHoBv0FOA7wjuZ6JbQNF7dWa7IOU4RRajxCU=; b=tORAn/VRF2ctE7Ki7viDhUZL0ll/kHp3UsO3XVycq/whhoU7qxbWXyuNNNDsgujdaf j/r+L/z1sO01s9g8hNs1cmBiINn9MHlqaKG6yjTvmqeQJImIXD57zhz8dOWdP/G/Cowo iSRlTaG0MzjJda5Uci6Zgh1wWzmNd/pg+7BIAZTBWD0f1DzaNzIkL6Z28kWzG7D7lYb+ memoITZPuqR4YI14TLb9/N7rl6xOr0zLo9hRcugqn25sBIu9ucisWnvYi29aoY/a8Hra ijXuAUGgsBeCU7xZczvcGSDW59ou4SPPds3SuSsmi/HXn/jWk1ilJlXaW2obh4Zabe2q eh9A== X-Gm-Message-State: AOAM531SB+vBvLE99M7SUCMwq83OMW2ROlHbV/ep1QS33/3kf7v4FAtU in1sFPh8ijamOONlUGJqKfUzen7+zzk= X-Google-Smtp-Source: ABdhPJxG4mlRvALLbKL294G8Pg1sefaHQgZzLWug4UuvBIJJrfp1yzL8vB8ylD0VIS05yl65dNpabw== X-Received: by 2002:a63:1d5c:: with SMTP id d28mr18656349pgm.82.1600282722853; Wed, 16 Sep 2020 11:58:42 -0700 (PDT) Received: from localhost.localdomain (c-107-3-138-210.hsd1.ca.comcast.net. [107.3.138.210]) by smtp.gmail.com with ESMTPSA id fz23sm3453747pjb.36.2020.09.16.11.58.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 16 Sep 2020 11:58:41 -0700 (PDT) From: Yang Shi To: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Cc: shy828301@gmail.com, linux-kernel@vger.kernel.org Subject: [RFC PATCH 2/2] mm: vmscan: remove shrinker's nr_deferred Date: Wed, 16 Sep 2020 11:58:23 -0700 Message-Id: <20200916185823.5347-3-shy828301@gmail.com> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20200916185823.5347-1-shy828301@gmail.com> References: <20200916185823.5347-1-shy828301@gmail.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 6848C1803FFC6 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam05 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Recently huge amount one-off slab drop was seen on some vfs metadata heav= y workloads, it turned out there were huge amount accumulated nr_deferred objects seen= by the shrinker. I managed to reproduce this problem with kernel build workload plus negat= ive dentry generator. First step, run the below kernel build test script: NR_CPUS=3D`cat /proc/cpuinfo | grep -e processor | wc -l` cd /root/Buildarea/linux-stable for i in `seq 1500`; do cgcreate -g memory:kern_build echo 4G > /sys/fs/cgroup/memory/kern_build/memory.limit_in_bytes echo 3 > /proc/sys/vm/drop_caches cgexec -g memory:kern_build make clean > /dev/null 2>&1 cgexec -g memory:kern_build make -j$NR_CPUS > /dev/null 2>&1 cgdelete -g memory:kern_build done That would generate huge amount deferred objects due to __GFP_NOFS alloca= tions. Then run the below negative dentry generator script: NR_CPUS=3D`cat /proc/cpuinfo | grep -e processor | wc -l` mkdir /sys/fs/cgroup/memory/test echo $$ > /sys/fs/cgroup/memory/test/tasks for i in `seq $NR_CPUS`; do while true; do FILE=3D`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 64= ` cat $FILE 2>/dev/null done & done Then kswapd will shrink half of dentry cache in just one loop as the belo= w tracing result showed: kswapd0-475 [028] .... 305968.252561: mm_shrink_slab_start: super_cach= e_scan+0x0/0x190 0000000024acf00c: nid: 0 objects to shrink 4994376020 gfp_flags GFP_KERNEL cache items 93689873 de= lta 45746 total_scan 46844936 priority 12 kswapd0-475 [021] .... 306013.099399: mm_shrink_slab_end: super_cache_= scan+0x0/0x190 0000000024acf00c: nid: 0 unused scan count 4994376020 new scan count 4947576838 total_scan 8 last shrinke= r return val 46844928 There were huge deferred objects before the shrinker was called, the beha= vior does match the code but it might be not desirable from the user's stand of point. IIUC the deferred objects were used to make balance between slab and page= cache, but since commit 9092c71bb724dba2ecba849eae69e5c9d39bd3d2 ("mm: use sc->priority for slab = shrink targets") they were decoupled. And as that commit stated "these two things have nothing= to do with each other". So why do we have to still keep it around? I can think of there might be= huge slab accumulated without taking into account deferred objects, but nowadays the most workl= oads are constrained by memcg which could limit the usage of kmem (by default now), so it seems m= aintaining deferred objects is not that useful anymore. It seems we could remove it to simpl= ify the shrinker logic a lot. Signed-off-by: Yang Shi --- include/linux/shrinker.h | 2 - include/trace/events/vmscan.h | 11 ++--- mm/vmscan.c | 91 +++-------------------------------- 3 files changed, 12 insertions(+), 92 deletions(-) diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h index 0f80123650e2..db1d3e7d098e 100644 --- a/include/linux/shrinker.h +++ b/include/linux/shrinker.h @@ -73,8 +73,6 @@ struct shrinker { /* ID in shrinker_idr */ int id; #endif - /* objs pending delete, per node */ - atomic_long_t *nr_deferred; }; #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */ =20 diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.= h index 27f268bbeba4..130abf781790 100644 --- a/include/trace/events/vmscan.h +++ b/include/trace/events/vmscan.h @@ -184,10 +184,10 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template,= mm_vmscan_memcg_softlimit_re =20 TRACE_EVENT(mm_shrink_slab_start, TP_PROTO(struct shrinker *shr, struct shrink_control *sc, - unsigned long cache_items, unsigned long long delta, - unsigned long total_scan, int priority), + unsigned long cache_items, unsigned long total_scan, + int priority), =20 - TP_ARGS(shr, sc, cache_items, delta, total_scan, priority), + TP_ARGS(shr, sc, cache_items, total_scan, priority), =20 TP_STRUCT__entry( __field(struct shrinker *, shr) @@ -195,7 +195,6 @@ TRACE_EVENT(mm_shrink_slab_start, __field(int, nid) __field(gfp_t, gfp_flags) __field(unsigned long, cache_items) - __field(unsigned long long, delta) __field(unsigned long, total_scan) __field(int, priority) ), @@ -206,18 +205,16 @@ TRACE_EVENT(mm_shrink_slab_start, __entry->nid =3D sc->nid; __entry->gfp_flags =3D sc->gfp_mask; __entry->cache_items =3D cache_items; - __entry->delta =3D delta; __entry->total_scan =3D total_scan; __entry->priority =3D priority; ), =20 - TP_printk("%pS %p: nid: %d gfp_flags %s cache items %ld delta %lld tota= l_scan %ld priority %d", + TP_printk("%pS %p: nid: %d gfp_flags %s cache items %ld total_scan %ld = priority %d", __entry->shrink, __entry->shr, __entry->nid, show_gfp_flags(__entry->gfp_flags), __entry->cache_items, - __entry->delta, __entry->total_scan, __entry->priority) ); diff --git a/mm/vmscan.c b/mm/vmscan.c index 48ebea97f12f..5223131a20d0 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -336,38 +336,21 @@ unsigned long lruvec_lru_size(struct lruvec *lruvec= , enum lru_list lru, int zone */ int prealloc_shrinker(struct shrinker *shrinker) { - unsigned int size =3D sizeof(*shrinker->nr_deferred); - - if (shrinker->flags & SHRINKER_NUMA_AWARE) - size *=3D nr_node_ids; - - shrinker->nr_deferred =3D kzalloc(size, GFP_KERNEL); - if (!shrinker->nr_deferred) - return -ENOMEM; - if (shrinker->flags & SHRINKER_MEMCG_AWARE) { if (prealloc_memcg_shrinker(shrinker)) - goto free_deferred; + goto nomem; } =20 return 0; =20 -free_deferred: - kfree(shrinker->nr_deferred); - shrinker->nr_deferred =3D NULL; +nomem: return -ENOMEM; } =20 void free_prealloced_shrinker(struct shrinker *shrinker) { - if (!shrinker->nr_deferred) - return; - if (shrinker->flags & SHRINKER_MEMCG_AWARE) unregister_memcg_shrinker(shrinker); - - kfree(shrinker->nr_deferred); - shrinker->nr_deferred =3D NULL; } =20 void register_shrinker_prepared(struct shrinker *shrinker) @@ -397,15 +380,11 @@ EXPORT_SYMBOL(register_shrinker); */ void unregister_shrinker(struct shrinker *shrinker) { - if (!shrinker->nr_deferred) - return; if (shrinker->flags & SHRINKER_MEMCG_AWARE) unregister_memcg_shrinker(shrinker); down_write(&shrinker_rwsem); list_del(&shrinker->list); up_write(&shrinker_rwsem); - kfree(shrinker->nr_deferred); - shrinker->nr_deferred =3D NULL; } EXPORT_SYMBOL(unregister_shrinker); =20 @@ -415,15 +394,11 @@ static unsigned long do_shrink_slab(struct shrink_c= ontrol *shrinkctl, struct shrinker *shrinker, int priority) { unsigned long freed =3D 0; - unsigned long long delta; long total_scan; long freeable; - long nr; - long new_nr; int nid =3D shrinkctl->nid; long batch_size =3D shrinker->batch ? shrinker->batch : SHRINK_BATCH; - long scanned =3D 0, next_deferred; =20 if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) nid =3D 0; @@ -432,61 +407,27 @@ static unsigned long do_shrink_slab(struct shrink_c= ontrol *shrinkctl, if (freeable =3D=3D 0 || freeable =3D=3D SHRINK_EMPTY) return freeable; =20 - /* - * copy the current shrinker scan count into a local variable - * and zero it so that other concurrent shrinker invocations - * don't also do this scanning work. - */ - nr =3D atomic_long_xchg(&shrinker->nr_deferred[nid], 0); - - total_scan =3D nr; if (shrinker->seeks) { - delta =3D freeable >> priority; - delta *=3D 4; - do_div(delta, shrinker->seeks); + total_scan =3D freeable >> priority; + total_scan *=3D 4; + do_div(total_scan, shrinker->seeks); } else { /* * These objects don't require any IO to create. Trim * them aggressively under memory pressure to keep * them from causing refetches in the IO caches. */ - delta =3D freeable / 2; + total_scan =3D freeable / 2; } =20 - total_scan +=3D delta; if (total_scan < 0) { pr_err("shrink_slab: %pS negative objects to delete nr=3D%ld\n", shrinker->scan_objects, total_scan); total_scan =3D freeable; - next_deferred =3D nr; - } else - next_deferred =3D total_scan; - - /* - * We need to avoid excessive windup on filesystem shrinkers - * due to large numbers of GFP_NOFS allocations causing the - * shrinkers to return -1 all the time. This results in a large - * nr being built up so when a shrink that can do some work - * comes along it empties the entire cache due to nr >>> - * freeable. This is bad for sustaining a working set in - * memory. - * - * Hence only allow the shrinker to scan the entire cache when - * a large delta change is calculated directly. - */ - if (delta < freeable / 4) - total_scan =3D min(total_scan, freeable / 2); - - /* - * Avoid risking looping forever due to too large nr value: - * never try to free more than twice the estimate number of - * freeable entries. - */ - if (total_scan > freeable * 2) - total_scan =3D freeable * 2; + } =20 trace_mm_shrink_slab_start(shrinker, shrinkctl,=20 - freeable, delta, total_scan, priority); + freeable, total_scan, priority); =20 /* * Normally, we should not scan less than batch_size objects in one @@ -517,26 +458,10 @@ static unsigned long do_shrink_slab(struct shrink_c= ontrol *shrinkctl, =20 count_vm_events(SLABS_SCANNED, shrinkctl->nr_scanned); total_scan -=3D shrinkctl->nr_scanned; - scanned +=3D shrinkctl->nr_scanned; =20 cond_resched(); } =20 - if (next_deferred >=3D scanned) - next_deferred -=3D scanned; - else - next_deferred =3D 0; - /* - * move the unused scan count back into the shrinker in a - * manner that handles concurrent updates. If we exhausted the - * scan, there is no need to do an update. - */ - if (next_deferred > 0) - new_nr =3D atomic_long_add_return(next_deferred, - &shrinker->nr_deferred[nid]); - else - new_nr =3D atomic_long_read(&shrinker->nr_deferred[nid]); - trace_mm_shrink_slab_end(shrinker, nid, freed, total_scan); return freed; } --=20 2.26.2