From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 52ED8C77B7E for ; Mon, 29 May 2023 12:51:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9A628900003; Mon, 29 May 2023 08:51:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9558E900002; Mon, 29 May 2023 08:51:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7F6B5900003; Mon, 29 May 2023 08:51:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 70931900002 for ; Mon, 29 May 2023 08:51:15 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 385C416012B for ; Mon, 29 May 2023 12:51:15 +0000 (UTC) X-FDA: 80843278110.26.0A168B5 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf14.hostedemail.com (Postfix) with ESMTP id 3B97510001A for ; Mon, 29 May 2023 12:51:13 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=MVAv5EmL; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf14.hostedemail.com: domain of "SRS0=ZTnM=BS=paulmck-ThinkPad-P17-Gen-1.home=paulmck@kernel.org" designates 139.178.84.217 as permitted sender) smtp.mailfrom="SRS0=ZTnM=BS=paulmck-ThinkPad-P17-Gen-1.home=paulmck@kernel.org" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1685364673; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CSKF2ra/zrCA6KvLKqp5Mow04BURbDegNINQ8MJyt3M=; b=8K1YAfmBB1T1F48zAWft05A3964d2WQuUG3kD5Mw9V44jxqqM+rzEnPQgtAGwKZIavJKTz w7tqVn4bvLu9DEqPtiiz9GgIHKWZMZmZ0k+wQChrDEDRtlQ9Xrm7zzGxZn3Wm9uvcAVUyM EBE/Ldivx5C8WdEBks0nIdbUtFkmivI= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=MVAv5EmL; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf14.hostedemail.com: domain of "SRS0=ZTnM=BS=paulmck-ThinkPad-P17-Gen-1.home=paulmck@kernel.org" designates 139.178.84.217 as permitted sender) smtp.mailfrom="SRS0=ZTnM=BS=paulmck-ThinkPad-P17-Gen-1.home=paulmck@kernel.org" ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1685364673; a=rsa-sha256; cv=none; b=IDnv7ljXVHs/re71G3AZ5WKPlo02L9DSb9xBRJxGKQeGLK8UWVKwwidmB2uPhtIn9avIdZ b2UF9Q0ZwYYJjgE6LLjOqE/Cg8+fNvHQLRbTaNPccmq/2IK6RTd37hZpKSCG2nb6qTgqu/ bSa7JvL4S4o3WnN+SvD5IR/w+EhTqxk= Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 1D9D761451; Mon, 29 May 2023 12:51:12 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7EC15C433D2; Mon, 29 May 2023 12:51:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1685364671; bh=fv6CZtUBlZCjthVqlk7/NDYhhCLFE00AtbpCy+f//HU=; h=Date:From:To:Cc:Subject:Reply-To:References:In-Reply-To:From; b=MVAv5EmL9CkMTkfH281UmsgF6OrPs2K0pqdIl3QRg+AAAE3xgwSCB6UVp08/IVJ+T trdaip6ji8H6RpZoxX75/3Bu0qHISegjd1siwT8NBjvSZY2zWwNfGPX2q3jIBhxvjc jNoBQCLjR0MurcP8s3TeGScQ1N+yTSTE11JsaUHXWtoN5NyfsBMxenT1CHxmU5JZIR Acya7KxpXe9TsRbmZg/VREgiuSNbXDZS4DAJ5fAdEbESyc2Qczy8sAq9FtxKS5iXLc lLCbIKbmzs8/Kn0tna/UvgXDyw2jeRX5jxnTy8MBQPA8OUI7o4+y6tmlxpF0OKgq7n 4dkR2/CTA0Afg== Received: by paulmck-ThinkPad-P17-Gen-1.home (Postfix, from userid 1000) id 0B1C7CE1CCC; Mon, 29 May 2023 05:51:11 -0700 (PDT) Date: Mon, 29 May 2023 05:51:11 -0700 From: "Paul E. McKenney" To: Qi Zheng Cc: Kirill Tkhai , RCU , Yujie Liu , oe-lkp@lists.linux.dev, lkp@intel.com, linux-kernel@vger.kernel.org, Andrew Morton , Vlastimil Babka , Roman Gushchin , Christian =?iso-8859-1?Q?K=F6nig?= , David Hildenbrand , Davidlohr Bueso , Johannes Weiner , Michal Hocko , Muchun Song , Shakeel Butt , Yang Shi , linux-mm@kvack.org, ying.huang@intel.com, feng.tang@intel.com, fengwei.yin@intel.com Subject: Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression Message-ID: <095806f1-f7f0-4914-b04b-c874fb25bb83@paulmck-laptop> Reply-To: paulmck@kernel.org References: <202305230837.db2c233f-yujie.liu@intel.com> <896bbb09-d400-ec73-ba3a-b64c6e9bbe46@linux.dev> <44407892-b7bc-4d6c-8e4a-6452f0ee88b9@paulmck-laptop> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Queue-Id: 3B97510001A X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: nnyhhsms3u3jeftfdd1zh6r1c1rahkdu X-HE-Tag: 1685364673-21210 X-HE-Meta: U2FsdGVkX180AROkukhhPIqDUxkGcl+wXW0cnExd828luPmcFc2m7Bpl7iOARYYTviP6cPo0/qFZGSHgcz2DVcLPLJ11fjJIHUMcbe1cF6V+FWfh39vE19+PZhrVwmCyYhJLeoXO05U9lu2KEy6LuuovTKO2u9DuR0Lsxk9nGY+c+D7rm6qaN4YI92bXAwplIhJySFAAPD6r25/q4oRKyoEchiTUq/iYgeHJ0lRDTKlTnFhRVNEeqm/UY14EeYx+B35sBvEOmk1qpRWGkg/CZoZA625zGxUjd0i6unoiBylVs6c/LZ04OZoF1HWgEqOakdoUVoRqMBucuu9M/IfWXE9F+ouKTdD5JF3ySpaYTDHvBfd7M7CeHOkfvWBKN7CfwruVHAJy3DgA7TtDQX0X6fkdogCM6OqqacMYA5Blfs8bEruO7cu9ThzcUctbb2tQXXjS7Gva4ZAnmbIblzqdYoM0Mpa0JDtEFepLUKIJl/vN4zGuifwV20JxANJH9oCLkcptN39zPmDOvA2vrCDEuYK2Whery8HcJ97F5Cqb1bPoYXOZzJXy/HdKGzShrg7/9NpPYVPJvs5fhAtnnZGBzzjVRH2Tvy8haMceFPG2ffUM5STi4Ijqm/RSlTnd4nAbIsNCLI+w7Tt1EhfnNSWaygWcUs7RSk8r6QGnyRuyITrq03hFfy48SXzxaSc71KYbsHifQTpdlcVcX7h7S86hAmU1mxqSds2vx2hf0RzeOALBnc7nkBw3CxvMEBNyonP5e7gfhs5tt358RrQRor28e7da0YAD1MU32hchdKVhbDRRn6oO/UDUE/IcnaKdOqot5k82RFu6PxUPdF4L3Er2qezcTnR7rn54K19Rf/BKySlzoBwEB00gjcGN3N9DD1bYYQPI3KzS6gAdS3/Zb7mhLf7ngrOOyYp/wwH7aR7uEPedKx+iHWazhGbJC0+yisIsPQ+/xhO/d9B1GiKb8JK eyKeZDJ/ WthO+r5E2FLhSYpJ29kEOyAa7kBk6SyNzBPSHUX6tlIDaJhvVxejUYu7L+FM9ZJX4FqK/mKah/D07ixr/3OVwVLsUDLMVRt1JT52WbknqJAK2K6iJitXOONYZtJPxp79Shd0EvSUen27Q+kMxaeA+8+0O1b5gou8QWwssnieywE1VF0X48Aoceu4olrywHGjk2IYTOCC94WHmAaydHhrrnCcBfx+4Bp0nG8eDqHJgJAv2+43ZzydUa5AKdA/dGhb9uG83hgPSxApry1Vovt186gXK9yENhob/f+35D12p/TjWKFvGJyhop/nwNKanzjNOWWEfPEZOAlT39Ma3RmCMv5ZuS9MtkKn+/RoA1yLcoAPX20DH25p4zU1qYrdAKyfEyh2cB4VPiwd/TbjBBjVsDyG90e4CQzEIBHIiYKqS51HUbU2qgoQgZBZgnFhbQ6EeGyDdOoYy3G5v2qjEdCAPEAbwEMOE9apMdIas1HdtUc8qmixLXfXN8GsIvCaarr3O4bPMjWUDCN6C1ulLXhReaPoS6/L1uq/Qx1gnP82gVR1JIuGLzuQtB0P/VUIlL1occDuajXdEMqJtmMRkQw00T4IXPnCI76rV9erSl+27EMU/msUBFAp/TLwwjmYzJLYci9hCsOKr4xXRFOiXT8w/uMWfjaJFNwpfUhzo52aAbMbGT9A8RCZjhmDuxxMzUNDK3/QSfzndPWL+8r54w9dk9o1+PVM1onW/kxV9k6u0mTfAgi0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, May 29, 2023 at 10:39:21AM +0800, Qi Zheng wrote: > Hi Paul, > > On 2023/5/27 19:14, Paul E. McKenney wrote: > > On Thu, May 25, 2023 at 12:03:16PM +0800, Qi Zheng wrote: > > > On 2023/5/24 19:56, Qi Zheng wrote: > > > > On 2023/5/24 19:08, Qi Zheng wrote: > > > > > > > > [...] > > > > > > > > > > > > > > Well, I just ran the following command and reproduced the result: > > > > > > > > > > stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 & > > > > > > > > > > 1) with commit 42c9db3970483: > > > > > > > > > > stress-ng: info:  [11023] setting to a 60 second run per stressor > > > > > stress-ng: info:  [11023] dispatching hogs: 9 ramfs > > > > > stress-ng: info:  [11023] stressor       bogo ops real time  usr > > > > > time sys time   bogo ops/s     bogo ops/s > > > > > stress-ng: info:  [11023]                           (secs)    (secs) > > > > > (secs)   (real time) (usr+sys time) > > > > > stress-ng: info:  [11023] ramfs            774966     60.00 > > > > > 10.18 169.45     12915.89        4314.26 > > > > > stress-ng: info:  [11023] for a 60.00s run time: > > > > > stress-ng: info:  [11023]    1920.11s available CPU time > > > > > stress-ng: info:  [11023]      10.18s user time   (  0.53%) > > > > > stress-ng: info:  [11023]     169.44s system time (  8.82%) > > > > > stress-ng: info:  [11023]     179.62s total time  (  9.35%) > > > > > stress-ng: info:  [11023] load average: 8.99 2.69 0.93 > > > > > stress-ng: info:  [11023] successful run completed in 60.00s (1 min, > > > > > 0.00 secs) > > > > > > > > > > 2) with commit f95bdb700bc6b: > > > > > > > > > > stress-ng: info:  [37676] dispatching hogs: 9 ramfs > > > > > stress-ng: info:  [37676] stressor       bogo ops real time  usr > > > > > time sys time   bogo ops/s     bogo ops/s > > > > > stress-ng: info:  [37676]                           (secs)    (secs) > > > > > (secs)   (real time) (usr+sys time) > > > > > stress-ng: info:  [37676] ramfs            168673     60.00 > > > > > 1.61   39.66      2811.08        4087.47 > > > > > stress-ng: info:  [37676] for a 60.10s run time: > > > > > stress-ng: info:  [37676]    1923.36s available CPU time > > > > > stress-ng: info:  [37676]       1.60s user time   (  0.08%) > > > > > stress-ng: info:  [37676]      39.66s system time (  2.06%) > > > > > stress-ng: info:  [37676]      41.26s total time  (  2.15%) > > > > > stress-ng: info:  [37676] load average: 7.69 3.63 2.36 > > > > > stress-ng: info:  [37676] successful run completed in 60.10s (1 min, > > > > > 0.10 secs) > > > > > > > > > > The bogo ops/s (real time) did drop significantly. > > > > > > > > > > And the memory reclaimation was not triggered in the whole process. so > > > > > theoretically no one is in the read critical section of shrinker_srcu. > > > > > > > > > > Then I found that some stress-ng-ramfs processes were in > > > > > TASK_UNINTERRUPTIBLE state for a long time: > > > > > > > > > > root       42313  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 > > > > > stress-ng-ramfs [run] > > > > > root       42314  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 > > > > > stress-ng-ramfs [run] > > > > > root       42315  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 > > > > > stress-ng-ramfs [run] > > > > > root       42316  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 > > > > > stress-ng-ramfs [run] > > > > > root       42317  7.8  0.0  69592  1812 pts/0    D    19:00   0:02 > > > > > stress-ng-ramfs [run] > > > > > root       42318  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 > > > > > stress-ng-ramfs [run] > > > > > root       42319  7.8  0.0  69592  1812 pts/0    D    19:00   0:02 > > > > > stress-ng-ramfs [run] > > > > > root       42320  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 > > > > > stress-ng-ramfs [run] > > > > > root       42321  7.8  0.0  69592  1812 pts/0    D    19:00   0:02 > > > > > stress-ng-ramfs [run] > > > > > root       42322  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 > > > > > stress-ng-ramfs [run] > > > > > root       42323  7.8  0.0  69592  1812 pts/0    D    19:00   0:02 > > > > > stress-ng-ramfs [run] > > > > > root       42324  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 > > > > > stress-ng-ramfs [run] > > > > > root       42325  7.8  0.0  69592  1812 pts/0    D    19:00   0:02 > > > > > stress-ng-ramfs [run] > > > > > root       42326  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 > > > > > stress-ng-ramfs [run] > > > > > root       42327  7.9  0.0  69592  1812 pts/0    D    19:00   0:02 > > > > > stress-ng-ramfs [run] > > > > > root       42328  7.9  0.0  69592  1812 pts/0    D    19:00   0:02 > > > > > stress-ng-ramfs [run] > > > > > root       42329  7.9  0.0  69592  1812 pts/0    D    19:00   0:02 > > > > > stress-ng-ramfs [run] > > > > > root       42330  7.9  0.0  69592  1556 pts/0    D    19:00   0:02 > > > > > stress-ng-ramfs [run] > > > > > > > > > > Their call stack is as follows: > > > > > > > > > > cat /proc/42330/stack > > > > > > > > > > [<0>] __synchronize_srcu.part.21+0x83/0xb0 > > > > > [<0>] unregister_shrinker+0x85/0xb0 > > > > > [<0>] deactivate_locked_super+0x27/0x70 > > > > > [<0>] cleanup_mnt+0xb8/0x140 > > > > > [<0>] task_work_run+0x65/0x90 > > > > > [<0>] exit_to_user_mode_prepare+0x1ba/0x1c0 > > > > > [<0>] syscall_exit_to_user_mode+0x1b/0x40 > > > > > [<0>] do_syscall_64+0x44/0x80 > > > > > [<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd > > > > > > > > > > + RCU folks, Is this result as expected? I would have thought that > > > > > synchronize_srcu() should return quickly if no one is in the read > > > > > critical section. :( > > > > In theory, it would indeed be nice if synchronize_srcu() would do that. > > In practice, the act of checking to see if there is anyone in an SRCU > > read-side critical section is a heavy-weight operation, involving at > > least one cache miss per CPU along with a number of full memory barriers. > > > > So SRCU has to be careful to not check too frequently. > > Got it. > > > > > However, if SRCU has been idle for some time, normal synchronize_srcu() > > will do an immediate check. And this will of course mark SRCU as > > non-idle. > > > > > > With the following changes, ops/s can return to previous levels: > > > > > > Or just set rcu_expedited to 1: > > > echo 1 > /sys/kernel/rcu_expedited > > > > This does cause SRCU to be much more aggressive. This can be a good > > choice for small systems, but please keep in mind that this affects normal > > RCU as well as SRCU. It will cause RCU to also be much more aggressive, > > sending IPIs to CPUs that are (or might be) in RCU read-side critical > > sections. Depending on your workload, this might or might not be what > > you want RCU to be doing. For example, if you are running aggressive > > real-time workloads, it most definitely is not what you want. > > Yeah, that's not what I want, a shrinker might run for a long time. > > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > index db2ed6e08f67..90f541b07cd1 100644 > > > > --- a/mm/vmscan.c > > > > +++ b/mm/vmscan.c > > > > @@ -763,7 +763,7 @@ void unregister_shrinker(struct shrinker *shrinker) > > > >         debugfs_entry = shrinker_debugfs_remove(shrinker); > > > >         up_write(&shrinker_rwsem); > > > > > > > > -       synchronize_srcu(&shrinker_srcu); > > > > +       synchronize_srcu_expedited(&shrinker_srcu); > > > > If shrinkers are unregistered only occasionally, this is an entirely > > reasonable change. > > > > > >         debugfs_remove_recursive(debugfs_entry); > > > > > > > > stress-ng: info:  [13159] dispatching hogs: 9 ramfs > > > > stress-ng: info:  [13159] stressor       bogo ops real time  usr time > > > > sys time   bogo ops/s     bogo ops/s > > > > stress-ng: info:  [13159]                           (secs)    (secs) > > > > (secs)   (real time) (usr+sys time) > > > > stress-ng: info:  [13159] ramfs            710062     60.00      9.63 > > > > 157.26     11834.18        4254.75 > > > > stress-ng: info:  [13159] for a 60.00s run time: > > > > stress-ng: info:  [13159]    1920.14s available CPU time > > > > stress-ng: info:  [13159]       9.62s user time   (  0.50%) > > > > stress-ng: info:  [13159]     157.26s system time (  8.19%) > > > > stress-ng: info:  [13159]     166.88s total time  (  8.69%) > > > > stress-ng: info:  [13159] load average: 9.49 4.02 1.65 > > > > stress-ng: info:  [13159] successful run completed in 60.00s (1 min, > > > > 0.00 secs) > > > > > > > > Can we make synchronize_srcu() call synchronize_srcu_expedited() when no > > > > one is in the read critical section? > > > > Yes, in theory we could, but this would be a bad thing in practice. > > After all, the point of having synchronize_srcu() be separate from > > synchronize_srcu_expedited() is to allow uses that are OK with longer > > latency avoid consuming too much CPU. In addition, that longer > > SRCU grace-period latency allows the next grace period to handle more > > synchronize_srcu() and call_srcu() requests. This amortizes the > > overhead of that next grace period over a larger number of updates. > > > > However, your use of synchronize_srcu_expedited() does have that effect, > > but only for this call point. Which has the advantage of avoiding > > burning excessive quantities of CPU for the other 50+ call points. > > Thanks for such a detailed explanation. > > Now I think we can continue to try to complete the idea[1] from > Kirill Tkhai. The patch moves heavy synchronize_srcu() to delayed > work, so it doesn't affect on user-visible unregistration speed. > > [1]. https://lore.kernel.org/lkml/153365636747.19074.12610817307548583381.stgit@localhost.localdomain/ A blast from the past! ;-) But yes, moving the long-latency synchronize_srcu() off the user-visible critical code path can be even better. Thanx, Paul