From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B85E2C61DA4 for ; Fri, 24 Feb 2023 04:16:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 27D436B0074; Thu, 23 Feb 2023 23:16:40 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 206C56B0075; Thu, 23 Feb 2023 23:16:40 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 080196B0078; Thu, 23 Feb 2023 23:16:40 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id E6CB76B0074 for ; Thu, 23 Feb 2023 23:16:39 -0500 (EST) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id B670712133F for ; Fri, 24 Feb 2023 04:16:39 +0000 (UTC) X-FDA: 80500874118.19.5865F10 Received: from mail-pj1-f52.google.com (mail-pj1-f52.google.com [209.85.216.52]) by imf04.hostedemail.com (Postfix) with ESMTP id CAB0C40006 for ; Fri, 24 Feb 2023 04:16:35 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=cVay3KVn; spf=pass (imf04.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.52 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677212197; a=rsa-sha256; cv=none; b=WhH7nyG1na7BkugCuMM1raU5JUGY0Hh9S6MyfLD2OHKeGyKth5m0cHNJq5gifOR8NjPNWb hAat+RlkjzcHlWHG2Y3YLAbrDM53Z4MpHZW+odpmJJJFB2z5mzOuftUiexFMOVSiHaU3gf jrJtm4pTTncalzvx/FszHGG1gJ0lzzU= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=cVay3KVn; spf=pass (imf04.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.216.52 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677212197; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=0AKwR5mqxsrvlagPHmXtlEufeGgOH+YmDlsDpN2VBxc=; b=LvcXraXwHoatuxr/2kdMJDX/+l5qbFm+fXadJJnBke9ieyFDf1oiBZepAIIsynUvoBbMua bXSl0lzAbJ+2w61Cfy7lXBAIi9BGoKNWYhWWmeiRSzNby6w8dy3+SO1hz/k74rSrKYn6lT 6wlIM/FFJJAfpj7cjo6I9YS7Ax8CYCE= Received: by mail-pj1-f52.google.com with SMTP id qi12-20020a17090b274c00b002341621377cso1547565pjb.2 for ; Thu, 23 Feb 2023 20:16:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1677212194; h=content-transfer-encoding:in-reply-to:references:cc:to:from :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=0AKwR5mqxsrvlagPHmXtlEufeGgOH+YmDlsDpN2VBxc=; b=cVay3KVnoML1OdCm7wPJopWbu1AUHi55YoN/MOF+35JayCxTCfn8hirZ9qDc60CkH0 aawGYZhRy+ddSkaw/eD861ziKMbHW71xlDaz/HsZqS7vaX6P89EBqpJ2lclvGReT2g8x Ia3+LBN1RZDeu5npTmmbSylS2gYMc9tOPewFW0ih7zxGsismSUlTHmozLCr5CVuUyxDg FtiNrfaSorubO5vUM+mB0YrpRY/Apme2ux1DzE5OHPV/Lb1Tm0IN0fX34EOxKV4QPvVU qyEBiw7hqGht0TbTtig3nN5Wswy/9imauOXV5Z13uQyyKjkPveD4NhLeLez3sho6y1Uf o1Mg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1677212194; h=content-transfer-encoding:in-reply-to:references:cc:to:from :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=0AKwR5mqxsrvlagPHmXtlEufeGgOH+YmDlsDpN2VBxc=; b=3ucELiit3KRUXWaVkogn019vUGr+tgEYWkxaiZgOiF+Og542Q46q05nblhH+c2kG8x YF6nHdvfNAn5kikXkCP2OFDO7QaoHxIMFT6JEOe8c1ekntTfSyNyZ9d8PHkxaVoZKyzD m9e3rb37IYk+DQNz0LspZWwtIOE6Wc9EesMqJ2drn6sr27JIb9x3JYb+Q6MsGPZxlw82 dRpRU6CDJ4X9XYRf3v3UQyizre8zSflOrvuhdCWDEJJQ4BBoy5USNiDCz9J7HECt92GK kHoHVC09OjAY/eAnlb26q+iR+YDl/S2t8vP9LJ+B8DgIP6yGuwDOWe/roWvl8IgCiYKo 0JJw== X-Gm-Message-State: AO0yUKUGpP5RVqHnFuGl6qW+Z81ms5uQf+nOTPfuw3O5PC+uSJjj7CSN 3yQRWdRR7OtAwXkHpwA5IM3q2g== X-Google-Smtp-Source: AK7set8D7O6WtJCkcqkTUSxRg7xCsDosJwIgciP3mH3dHeZlxLbwATist2ZRRFxoK0Qg+rUsaQl/xQ== X-Received: by 2002:a17:902:ab57:b0:19a:839d:b67a with SMTP id ij23-20020a170902ab5700b0019a839db67amr14959487plb.5.1677212194305; Thu, 23 Feb 2023 20:16:34 -0800 (PST) Received: from [10.70.252.135] ([139.177.225.245]) by smtp.gmail.com with ESMTPSA id a7-20020a170902b58700b0019abd4ddbf2sm5928115pls.179.2023.02.23.20.16.29 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 23 Feb 2023 20:16:33 -0800 (PST) Message-ID: <00a212ee-1433-937f-1f15-f82e3137778c@bytedance.com> Date: Fri, 24 Feb 2023 12:16:26 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.7.2 Subject: Re: [PATCH v2 2/7] mm: vmscan: make global slab shrink lockless Content-Language: en-US From: Qi Zheng To: Sultan Alsawaf , paulmck@kernel.org Cc: akpm@linux-foundation.org, tkhai@ya.ru, hannes@cmpxchg.org, shakeelb@google.com, mhocko@kernel.org, roman.gushchin@linux.dev, muchun.song@linux.dev, david@redhat.com, shy828301@gmail.com, dave@stgolabs.net, penguin-kernel@i-love.sakura.ne.jp, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20230223132725.11685-1-zhengqi.arch@bytedance.com> <20230223132725.11685-3-zhengqi.arch@bytedance.com> <8049b6ed-435f-b518-f947-5516a514aec2@bytedance.com> In-Reply-To: <8049b6ed-435f-b518-f947-5516a514aec2@bytedance.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Queue-Id: CAB0C40006 X-Rspamd-Server: rspam01 X-Stat-Signature: nuuje87i14u7kimnp55c9wpu7brmbm4u X-HE-Tag: 1677212195-935916 X-HE-Meta: U2FsdGVkX19+x58esJ2S8DuNBJan8wV2TGscfe2RUn1TAWN+AsVyRCZTqw2hXr6NFFdMZavjHt0GUHoYiv7aEOA0h+AD/swDbKylPR2ASCwhDZ9G0sPSXtOxYD04Z+jm7kEi/jfDJ0ax/r7mNCu03yWbPi3Ur9Io4tfoQ6KMvfxU0yKiPYo7YsRNuEUcXRTfzYtLC5DaotA4ToarggPopuqGx5HY3I0Mp98dzhIIAZXfGjPlWxhEG2WHLtaoPUB0lRA1KJ98rbqgSyaBtJuL71hY4fAzA4pONP2jjV8H5xOOIn+/oQfjr0XmhMgvFralsVWbItANBh5yPWmspJLXmxOb9w1O/mF8lVoqRqc0F/LeX0ym0bR+p19Zh27W0Na1rxBcI+UMwCTFpu0/wDqoLBXiQdv/Wk5Hv2Osiih6Cfd2Bg7ifV2rWNj5e+wRWXGKAa+YYKFVrnjxK7q0E/EMVsXnbjR/cGtibC6MRJ7e14i+prJbbGpZgTZ2uNt+ySBL1EIA1ziufsKWUOXRpz3bNk81JNBLg/jke+IocIOxMcuDAn5vC0aZklFCm0ON0POhmfgv67sQX9yZWjm2Z6vY76FHKgIQp4LQYYaWkhBK7ecjqG96NPTncdHwacWGIoog0QMGKs07H1kVOFuG7wsKhRDuvMKkFm+WyNd5c5IoLPh9R8yJcgEMXi3ZH0SNR7fqKzrINfEf9+rbodiBlZdzcDjv8QTUjucvC0fcXy5CyRKFGYDq/aTOWrM1VILyg412At7fy2j0A1n0C0xsRJvXlY/briCRASVviaEJ/6Q1I2kiuq7lP1U6FGfIysWwNlAtrk8q6MHzrQFG6CFv7rpVS63W+ofhYZGllxpU3r9mOlPx7CHcJ4e87Mvvh0x2qHdGxfzXShYkckXJTY/9g7xjfkX5LBWmPNw2B6UKLng8vhthmBd5K+3ACRZRRmXxM9qIzrBa0p9RT8i4zDjtyO7 qK7swfeb waCvTlGVHxvtdrQdX9sqXQiugYfN1VpcGDa2uiTC2TxsFyiehW705N37Q6TESUp8hjqOOYauhd7uNTrfqzKq5Igy6Y/88R0U5OZlJt+3rzurEKlY1AFBqb9vSm08ZsCraLyvSoAU77TNuE0c4wqCGH7ONSp2/Vm0YjgpdI1N6hwT0H0BX+PQIxVEsvwDHEBRpCxb6DR5/iJ0RK4mu0UBGWmBxHPWisnHEe/Mq8a64bY449h7Wj9O5f/sRyl9fH2I/LyXtjA5fKrSwPsCVHVhA4G7dfK2KGz8zrEkue1+JmPvIvpT1CAteO3T0Lu2zd+ULC0IclMvEbZOO848P0jtAW7jMGH54+oUsM9PIejziJhUSso84N16198RjjBOJlszNA7q+8u164srPNDXTr7MW2vYk5cfZwVn9tfqonEPmiPxeH0G005SaR1C7h19GVNIamnVJgDEM1xUysVWN3MXfDsxShQoCXnDMZLmpfQ73sYD9gWwH+zgriDlxfnfa/xO+0Citp6TlmlCo3rPWBjYGFGIWoV0iWpFvIVL+vAWahNniSxIzBaNOJikWpF2YJgVfeGWcM9au1BO64TbmcBxhEN3UhJzbHsuQbn5wC/HVuQt0L5dU2KSASvWHyuHdw9f2lJxSVIURygD/Bj4TFUlklOmBsYI8dJ0crvyI2Bf3iueEWtjygVpc5ZRzVKtt4R2B0rmD X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2023/2/24 12:00, Qi Zheng wrote: > > > On 2023/2/24 02:24, Sultan Alsawaf wrote: >> On Thu, Feb 23, 2023 at 09:27:20PM +0800, Qi Zheng wrote: >>> The shrinker_rwsem is a global lock in shrinkers subsystem, >>> it is easy to cause blocking in the following cases: >>> >>> a. the write lock of shrinker_rwsem was held for too long. >>>     For example, there are many memcgs in the system, which >>>     causes some paths to hold locks and traverse it for too >>>     long. (e.g. expand_shrinker_info()) >>> b. the read lock of shrinker_rwsem was held for too long, >>>     and a writer came at this time. Then this writer will be >>>     forced to wait and block all subsequent readers. >>>     For example: >>>     - be scheduled when the read lock of shrinker_rwsem is >>>       held in do_shrink_slab() >>>     - some shrinker are blocked for too long. Like the case >>>       mentioned in the patchset[1]. >>> >>> Therefore, many times in history ([2],[3],[4],[5]), some >>> people wanted to replace shrinker_rwsem reader with SRCU, >>> but they all gave up because SRCU was not unconditionally >>> enabled. >>> >>> But now, since commit 1cd0bd06093c ("rcu: Remove CONFIG_SRCU"), >>> the SRCU is unconditionally enabled. So it's time to use >>> SRCU to protect readers who previously held shrinker_rwsem. >>> >>> [1]. >>> https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomirov@virtuozzo.com/ >>> [2]. https://lore.kernel.org/all/1437080113.3596.2.camel@stgolabs.net/ >>> [3]. >>> https://lore.kernel.org/lkml/1510609063-3327-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp/ >>> [4]. >>> https://lore.kernel.org/lkml/153365347929.19074.12509495712735843805.stgit@localhost.localdomain/ >>> [5]. >>> https://lore.kernel.org/lkml/20210927074823.5825-1-sultan@kerneltoast.com/ >>> >>> Signed-off-by: Qi Zheng >>> --- >>>   mm/vmscan.c | 27 +++++++++++---------------- >>>   1 file changed, 11 insertions(+), 16 deletions(-) >>> >>> diff --git a/mm/vmscan.c b/mm/vmscan.c >>> index 9f895ca6216c..02987a6f95d1 100644 >>> --- a/mm/vmscan.c >>> +++ b/mm/vmscan.c >>> @@ -202,6 +202,7 @@ static void set_task_reclaim_state(struct >>> task_struct *task, >>>   LIST_HEAD(shrinker_list); >>>   DECLARE_RWSEM(shrinker_rwsem); >>> +DEFINE_SRCU(shrinker_srcu); >>>   #ifdef CONFIG_MEMCG >>>   static int shrinker_nr_max; >>> @@ -706,7 +707,7 @@ void free_prealloced_shrinker(struct shrinker >>> *shrinker) >>>   void register_shrinker_prepared(struct shrinker *shrinker) >>>   { >>>       down_write(&shrinker_rwsem); >>> -    list_add_tail(&shrinker->list, &shrinker_list); >>> +    list_add_tail_rcu(&shrinker->list, &shrinker_list); >>>       shrinker->flags |= SHRINKER_REGISTERED; >>>       shrinker_debugfs_add(shrinker); >>>       up_write(&shrinker_rwsem); >>> @@ -760,13 +761,15 @@ void unregister_shrinker(struct shrinker >>> *shrinker) >>>           return; >>>       down_write(&shrinker_rwsem); >>> -    list_del(&shrinker->list); >>> +    list_del_rcu(&shrinker->list); >>>       shrinker->flags &= ~SHRINKER_REGISTERED; >>>       if (shrinker->flags & SHRINKER_MEMCG_AWARE) >>>           unregister_memcg_shrinker(shrinker); >>>       debugfs_entry = shrinker_debugfs_remove(shrinker); >>>       up_write(&shrinker_rwsem); >>> +    synchronize_srcu(&shrinker_srcu); >>> + >>>       debugfs_remove_recursive(debugfs_entry); >>>       kfree(shrinker->nr_deferred); >>> @@ -786,6 +789,7 @@ void synchronize_shrinkers(void) >>>   { >>>       down_write(&shrinker_rwsem); >>>       up_write(&shrinker_rwsem); >>> +    synchronize_srcu(&shrinker_srcu); >>>   } >>>   EXPORT_SYMBOL(synchronize_shrinkers); >>> @@ -996,6 +1000,7 @@ static unsigned long shrink_slab(gfp_t gfp_mask, >>> int nid, >>>   { >>>       unsigned long ret, freed = 0; >>>       struct shrinker *shrinker; >>> +    int srcu_idx; >>>       /* >>>        * The root memcg might be allocated even though memcg is disabled >>> @@ -1007,10 +1012,10 @@ static unsigned long shrink_slab(gfp_t >>> gfp_mask, int nid, >>>       if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) >>>           return shrink_slab_memcg(gfp_mask, nid, memcg, priority); >>> -    if (!down_read_trylock(&shrinker_rwsem)) >>> -        goto out; >>> +    srcu_idx = srcu_read_lock(&shrinker_srcu); >>> -    list_for_each_entry(shrinker, &shrinker_list, list) { >>> +    list_for_each_entry_srcu(shrinker, &shrinker_list, list, >>> +                 srcu_read_lock_held(&shrinker_srcu)) { >>>           struct shrink_control sc = { >>>               .gfp_mask = gfp_mask, >>>               .nid = nid, >>> @@ -1021,19 +1026,9 @@ static unsigned long shrink_slab(gfp_t >>> gfp_mask, int nid, >>>           if (ret == SHRINK_EMPTY) >>>               ret = 0; >>>           freed += ret; >>> -        /* >>> -         * Bail out if someone want to register a new shrinker to >>> -         * prevent the registration from being stalled for long periods >>> -         * by parallel ongoing shrinking. >>> -         */ >>> -        if (rwsem_is_contended(&shrinker_rwsem)) { >>> -            freed = freed ? : 1; >>> -            break; >>> -        } >>>       } >>> -    up_read(&shrinker_rwsem); >>> -out: >>> +    srcu_read_unlock(&shrinker_srcu, srcu_idx); >>>       cond_resched(); >>>       return freed; >>>   } >>> -- >>> 2.20.1 >>> >>> >> >> Hi Qi, >> >> A different problem I realized after my old attempt to use SRCU was >> that the >> unregister_shrinker() path became quite slow due to the heavy >> synchronize_srcu() >> call. Both register_shrinker() *and* unregister_shrinker() are called >> frequently >> these days, and SRCU is too unfair to the unregister path IMO. > > Hi Sultan, > > IIUC, for unregister_shrinker(), the wait time is hardly longer with > SRCU than with shrinker_rwsem before. > > And I just did a simple test. After using the script in cover letter to > increase the shrink_slab hotspot, I did umount 1k times at the same > time, and then I used bpftrace to measure the time consumption of > unregister_shrinker() as follows: > > bpftrace -e 'kprobe:unregister_shrinker { @start[tid] = nsecs; } > kretprobe:unregister_shrinker /@start[tid]/ { @ns[comm] = hist(nsecs - > @start[tid]); delete(@start[tid]); }' > > @ns[umount]: > [16K, 32K)             3 |      | > [32K, 64K)            66 |@@@@@@@@@@      | > [64K, 128K)           32 |@@@@@      | > [128K, 256K)          22 |@@@      | > [256K, 512K)          48 |@@@@@@@      | > [512K, 1M)            19 |@@@      | > [1M, 2M)             131 |@@@@@@@@@@@@@@@@@@@@@      | > [2M, 4M)             313 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [4M, 8M)             302 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  | > [8M, 16M)             55 |@@@@@@@@@ > > I see that the highest time-consuming of unregister_shrinker() is > between 8ms and 16ms, which feels tolerable? And when I use the synchronize_srcu_expedited() mentioned by Paul, the measured time consumption has a more obvious decrease: @ns[umount]: [16K, 32K) 105 |@@@@@@@@@@ | [32K, 64K) 521 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [64K, 128K) 119 |@@@@@@@@@@@ | [128K, 256K) 32 |@@@ | [256K, 512K) 70 |@@@@@@ | [512K, 1M) 49 |@@@@ | [1M, 2M) 34 |@@@ | [2M, 4M) 18 |@ | [4M, 8M) 4 | > > Thanks, > Qi > >> >> Although I never got around to submitting it, I made a non-SRCU >> solution [1] >> that uses fine-grained locking instead, which is fair to both the >> register path >> and unregister path. (The patch I've linked is a version of this >> adapted to an >> older 4.14 kernel FYI, but it can be reworked for the current kernel.) >> >> What do you think about the fine-grained locking approach? >> >> Thanks, >> Sultan >> >> [1] >> https://github.com/kerneltoast/android_kernel_google_floral/commit/012378f3173a82d2333d3ae7326691544301e76a >> > -- Thanks, Qi