From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 56E7CC77B7E
	for <linux-mm@archiver.kernel.org>; Mon, 29 May 2023 02:39:37 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A64B0900003; Sun, 28 May 2023 22:39:36 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A1498900002; Sun, 28 May 2023 22:39:36 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8C015900003; Sun, 28 May 2023 22:39:36 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 78AE5900002
	for <linux-mm@kvack.org>; Sun, 28 May 2023 22:39:36 -0400 (EDT)
Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 2F1CF160393
	for <linux-mm@kvack.org>; Mon, 29 May 2023 02:39:36 +0000 (UTC)
X-FDA: 80841736752.13.FE7EB9D
Received: from out-8.mta0.migadu.com (out-8.mta0.migadu.com [91.218.175.8])
	by imf05.hostedemail.com (Postfix) with ESMTP id 080F2100013
	for <linux-mm@kvack.org>; Mon, 29 May 2023 02:39:33 +0000 (UTC)
Authentication-Results: imf05.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=Z8XIoICr;
	spf=pass (imf05.hostedemail.com: domain of qi.zheng@linux.dev designates 91.218.175.8 as permitted sender) smtp.mailfrom=qi.zheng@linux.dev;
	dmarc=pass (policy=none) header.from=linux.dev
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1685327974;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=qPg6ZDtxNrVeEaFU+D4Wu5Yq1tkWvN5fR6aSGPARBZA=;
	b=A6CPbklNycVx6r8AGXkXS7YeAiJHPBm04lYd/0+yPlxUcWRLgZwv3LjtyaJGgBNB5drl3P
	myq+PiIZ+sQ7oWoyUOK3BFEjgxXpzFmmdKbbkRYZ+JsfOCAJPAxykC1Ws4Ol0p7ZQgAgVK
	pnWSJn9+aT0pvGIc21UaBmUu0r3rIQw=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1685327974; a=rsa-sha256;
	cv=none;
	b=L1Z6WWHTCAQeEi3DWOuuzL2WHEqjHvla8GECMxxihSXCb6vO+/hphpEdRuAsbWgjsNc9Rs
	BODaNQpigXDpc7i9oPNOU7pK8J0l4E0CiEpcZf/lUPZdiFmEQrjeX/3v50r9D3dRvx0Cx2
	gTzS0aW4fBve4W2AQnOg06g6o4BMfbk=
ARC-Authentication-Results: i=1;
	imf05.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=Z8XIoICr;
	spf=pass (imf05.hostedemail.com: domain of qi.zheng@linux.dev designates 91.218.175.8 as permitted sender) smtp.mailfrom=qi.zheng@linux.dev;
	dmarc=pass (policy=none) header.from=linux.dev
Message-ID: <d4b1599d-14c1-071c-6205-09fe60f2ed8b@linux.dev>
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1685327971;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=qPg6ZDtxNrVeEaFU+D4Wu5Yq1tkWvN5fR6aSGPARBZA=;
	b=Z8XIoICremlWHRFpMcWCOQAANOafwHhpcFj6yo+H86RlcpqprVyJ7IZ1w9eblm/lqZG+VB
	/lKgMQR/5qeYI0UUFPj0ZoRYAIS3cqLFLIAu0/pMs9s5v79n6vavfYbJgZYR2CkffPlstJ
	acLcdsTWSl0MD+s4mY6Zn9Xbi5cQ4yQ=
Date: Mon, 29 May 2023 10:39:21 +0800
MIME-Version: 1.0
Subject: Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec
 -88.8% regression
Content-Language: en-US
To: paulmck@kernel.org, Kirill Tkhai <tkhai@ya.ru>
Cc: RCU <rcu@vger.kernel.org>, Yujie Liu <yujie.liu@intel.com>,
 oe-lkp@lists.linux.dev, lkp@intel.com, linux-kernel@vger.kernel.org,
 Andrew Morton <akpm@linux-foundation.org>, Vlastimil Babka <vbabka@suse.cz>,
 Roman Gushchin <roman.gushchin@linux.dev>,
 =?UTF-8?Q?Christian_K=c3=b6nig?= <christian.koenig@amd.com>,
 David Hildenbrand <david@redhat.com>, Davidlohr Bueso <dave@stgolabs.net>,
 Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>,
 Muchun Song <muchun.song@linux.dev>, Shakeel Butt <shakeelb@google.com>,
 Yang Shi <shy828301@gmail.com>, linux-mm@kvack.org, ying.huang@intel.com,
 feng.tang@intel.com, fengwei.yin@intel.com
References: <202305230837.db2c233f-yujie.liu@intel.com>
 <eba38fce-2454-d7a4-10ef-240b4686f23d@linux.dev>
 <ZG29ULGNJdErnatI@yujie-X299>
 <896bbb09-d400-ec73-ba3a-b64c6e9bbe46@linux.dev>
 <e5fb8b34-c1ad-92e0-e7e5-f7ed1605dbc6@linux.dev>
 <bfb36563-fac9-4c84-96db-87dd28892088@linux.dev>
 <be04dc3e-a671-ec70-6cf6-70dc702f4184@linux.dev>
 <44407892-b7bc-4d6c-8e4a-6452f0ee88b9@paulmck-laptop>
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: Qi Zheng <qi.zheng@linux.dev>
In-Reply-To: <44407892-b7bc-4d6c-8e4a-6452f0ee88b9@paulmck-laptop>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Migadu-Flow: FLOW_OUT
X-Stat-Signature: uppbkdxjshuoiiof4wykisk7bn73zmz6
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: 080F2100013
X-Rspam-User: 
X-HE-Tag: 1685327973-492388
X-HE-Meta: U2FsdGVkX1/ZrIrhS7VYR/CjqfJEOiLkj4BO3VuOPZyrXZY/y+l9IBBeaQ+C/8iut2MrDCGKflX/4orqKXPIJ9mHdJHCBgvvzxKJxr3Uoz5ibRNZMaXsaoZkSEnnpdyjJTnpJMemgr85Jzx0wMNN8v0Uu828mKRbiCmUlM0ZHr0PTOFUJBhL3zFAbF5N4Nt7ADlUyeh8HtA1WLX0PEuSYCj4PkVaPiS68uO3QJCaml9XTGR8bqrYbFpMW7kBEwgbAPVqLi9N/6VGYPbVlpmnhmmFzofxP1KAZouh8604o8h43fZGyuDeuyb9VP3Q2B449THJidzTZUu4w+d091fZr6vk2MACdL7T4dHWXRu4GuJJAkIkEf2OweTEeEepTdGcfok0aDXlkMjWBttwMVv2mrUxtSZu7DK5EKBm98pMwcFmNYbiNqA3nJ4VJZvl2aznBqNXTRBL+Eh5s/ebXei1qlROKsistvfrTUtkKahe8rtl78Aa41OmzrMFyO8S/ry1mTe3HRJb+7nbZBNPp/WW5Vz5P3N4no6Hfd9a1ns3yW1DN6bkzHcwpxZotZrvfOV6tjdXS6Js+UAKcKliu4BH6QSiCYkngA2RsIrY6INHd1gnaNsB83BJSGsuQExG6HlEoo4NVgUFBEP5YYEy0kHfcXpC9HijEutiN3EzqhliQgl5aG2Sl+bqwG0ZUsnN+JimP92jUWIoWvyUa/+Tx1OQQHU7oRq+QGw9SESk39Byt2q8cwW6uAPkLGoV0pmiPCRsUhvocoIVypvfvn3ny7nKydhlWhhpDqMi6dhdOeGx5r4QmkCIwFr3XzG1IH48FZ1knfymKOspQ8Y+ElInfPL08rPWe+VhT76hxU1ygLijJPq+gcryvM88lfnEMV5bxqDf8wEWiZu1VWv40J2hb7AsdT5sWPZshBtSeNo1J7T/v0GLkl/ICequTCvA8hDSP4+zSuU0kod9+qlwgbDdyqw
 mCNHaZVk
 QnyibW30S8z+v4Lj25C7BEpf4OwOthQnOYbF/S+LCxTK/Vg6G0ZjBO5gUiMZGy+REDBSDfFqLNpVTCIL+QLClBH6h4cIy4bnKQZKHdb3VqXStXErrEGd4hB97ZFPSP9/hvkgjY3xheMO04EkU5dxttVdOq5Y/zgxdA6Ebm040x9sfTnA3WX0NzS/A4PCFod3ja+flY4dZVs0SlOY=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Hi Paul,

On 2023/5/27 19:14, Paul E. McKenney wrote:
> On Thu, May 25, 2023 at 12:03:16PM +0800, Qi Zheng wrote:
>> On 2023/5/24 19:56, Qi Zheng wrote:
>>> On 2023/5/24 19:08, Qi Zheng wrote:
>>>
>>> [...]
>>>
>>>>
>>>> Well, I just ran the following command and reproduced the result:
>>>>
>>>> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
>>>>
>>>> 1) with commit 42c9db3970483:
>>>>
>>>> stress-ng: info:  [11023] setting to a 60 second run per stressor
>>>> stress-ng: info:  [11023] dispatching hogs: 9 ramfs
>>>> stress-ng: info:  [11023] stressor       bogo ops real time  usr
>>>> time sys time   bogo ops/s     bogo ops/s
>>>> stress-ng: info:  [11023]                           (secs)    (secs)
>>>> (secs)   (real time) (usr+sys time)
>>>> stress-ng: info:  [11023] ramfs            774966     60.00
>>>> 10.18 169.45     12915.89        4314.26
>>>> stress-ng: info:  [11023] for a 60.00s run time:
>>>> stress-ng: info:  [11023]    1920.11s available CPU time
>>>> stress-ng: info:  [11023]      10.18s user time   (  0.53%)
>>>> stress-ng: info:  [11023]     169.44s system time (  8.82%)
>>>> stress-ng: info:  [11023]     179.62s total time  (  9.35%)
>>>> stress-ng: info:  [11023] load average: 8.99 2.69 0.93
>>>> stress-ng: info:  [11023] successful run completed in 60.00s (1 min,
>>>> 0.00 secs)
>>>>
>>>> 2) with commit f95bdb700bc6b:
>>>>
>>>> stress-ng: info:  [37676] dispatching hogs: 9 ramfs
>>>> stress-ng: info:  [37676] stressor       bogo ops real time  usr
>>>> time sys time   bogo ops/s     bogo ops/s
>>>> stress-ng: info:  [37676]                           (secs)    (secs)
>>>> (secs)   (real time) (usr+sys time)
>>>> stress-ng: info:  [37676] ramfs            168673     60.00
>>>> 1.61   39.66      2811.08        4087.47
>>>> stress-ng: info:  [37676] for a 60.10s run time:
>>>> stress-ng: info:  [37676]    1923.36s available CPU time
>>>> stress-ng: info:  [37676]       1.60s user time   (  0.08%)
>>>> stress-ng: info:  [37676]      39.66s system time (  2.06%)
>>>> stress-ng: info:  [37676]      41.26s total time  (  2.15%)
>>>> stress-ng: info:  [37676] load average: 7.69 3.63 2.36
>>>> stress-ng: info:  [37676] successful run completed in 60.10s (1 min,
>>>> 0.10 secs)
>>>>
>>>> The bogo ops/s (real time) did drop significantly.
>>>>
>>>> And the memory reclaimation was not triggered in the whole process. so
>>>> theoretically no one is in the read critical section of shrinker_srcu.
>>>>
>>>> Then I found that some stress-ng-ramfs processes were in
>>>> TASK_UNINTERRUPTIBLE state for a long time:
>>>>
>>>> root       42313  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
>>>> stress-ng-ramfs [run]
>>>> root       42314  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
>>>> stress-ng-ramfs [run]
>>>> root       42315  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
>>>> stress-ng-ramfs [run]
>>>> root       42316  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
>>>> stress-ng-ramfs [run]
>>>> root       42317  7.8  0.0  69592  1812 pts/0    D    19:00   0:02
>>>> stress-ng-ramfs [run]
>>>> root       42318  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
>>>> stress-ng-ramfs [run]
>>>> root       42319  7.8  0.0  69592  1812 pts/0    D    19:00   0:02
>>>> stress-ng-ramfs [run]
>>>> root       42320  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
>>>> stress-ng-ramfs [run]
>>>> root       42321  7.8  0.0  69592  1812 pts/0    D    19:00   0:02
>>>> stress-ng-ramfs [run]
>>>> root       42322  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
>>>> stress-ng-ramfs [run]
>>>> root       42323  7.8  0.0  69592  1812 pts/0    D    19:00   0:02
>>>> stress-ng-ramfs [run]
>>>> root       42324  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
>>>> stress-ng-ramfs [run]
>>>> root       42325  7.8  0.0  69592  1812 pts/0    D    19:00   0:02
>>>> stress-ng-ramfs [run]
>>>> root       42326  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
>>>> stress-ng-ramfs [run]
>>>> root       42327  7.9  0.0  69592  1812 pts/0    D    19:00   0:02
>>>> stress-ng-ramfs [run]
>>>> root       42328  7.9  0.0  69592  1812 pts/0    D    19:00   0:02
>>>> stress-ng-ramfs [run]
>>>> root       42329  7.9  0.0  69592  1812 pts/0    D    19:00   0:02
>>>> stress-ng-ramfs [run]
>>>> root       42330  7.9  0.0  69592  1556 pts/0    D    19:00   0:02
>>>> stress-ng-ramfs [run]
>>>>
>>>> Their call stack is as follows:
>>>>
>>>> cat /proc/42330/stack
>>>>
>>>> [<0>] __synchronize_srcu.part.21+0x83/0xb0
>>>> [<0>] unregister_shrinker+0x85/0xb0
>>>> [<0>] deactivate_locked_super+0x27/0x70
>>>> [<0>] cleanup_mnt+0xb8/0x140
>>>> [<0>] task_work_run+0x65/0x90
>>>> [<0>] exit_to_user_mode_prepare+0x1ba/0x1c0
>>>> [<0>] syscall_exit_to_user_mode+0x1b/0x40
>>>> [<0>] do_syscall_64+0x44/0x80
>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
>>>>
>>>> + RCU folks, Is this result as expected? I would have thought that
>>>> synchronize_srcu() should return quickly if no one is in the read
>>>> critical section. :(
> 
> In theory, it would indeed be nice if synchronize_srcu() would do that.
> In practice, the act of checking to see if there is anyone in an SRCU
> read-side critical section is a heavy-weight operation, involving at
> least one cache miss per CPU along with a number of full memory barriers.
> 
> So SRCU has to be careful to not check too frequently.

Got it.

> 
> However, if SRCU has been idle for some time, normal synchronize_srcu()
> will do an immediate check.  And this will of course mark SRCU as
> non-idle.
> 
>>> With the following changes, ops/s can return to previous levels:
>>
>> Or just set rcu_expedited to 1:
>> 	echo 1 > /sys/kernel/rcu_expedited
> 
> This does cause SRCU to be much more aggressive.  This can be a good
> choice for small systems, but please keep in mind that this affects normal
> RCU as well as SRCU.  It will cause RCU to also be much more aggressive,
> sending IPIs to CPUs that are (or might be) in RCU read-side critical
> sections.  Depending on your workload, this might or might not be what
> you want RCU to be doing.  For example, if you are running aggressive
> real-time workloads, it most definitely is not what you want.

Yeah, that's not what I want, a shrinker might run for a long time.

> 
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index db2ed6e08f67..90f541b07cd1 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -763,7 +763,7 @@ void unregister_shrinker(struct shrinker *shrinker)
>>>           debugfs_entry = shrinker_debugfs_remove(shrinker);
>>>           up_write(&shrinker_rwsem);
>>>
>>> -       synchronize_srcu(&shrinker_srcu);
>>> +       synchronize_srcu_expedited(&shrinker_srcu);
> 
> If shrinkers are unregistered only occasionally, this is an entirely
> reasonable change.
> 
>>>           debugfs_remove_recursive(debugfs_entry);
>>>
>>> stress-ng: info:  [13159] dispatching hogs: 9 ramfs
>>> stress-ng: info:  [13159] stressor       bogo ops real time  usr time
>>> sys time   bogo ops/s     bogo ops/s
>>> stress-ng: info:  [13159]                           (secs)    (secs)
>>> (secs)   (real time) (usr+sys time)
>>> stress-ng: info:  [13159] ramfs            710062     60.00      9.63
>>> 157.26     11834.18        4254.75
>>> stress-ng: info:  [13159] for a 60.00s run time:
>>> stress-ng: info:  [13159]    1920.14s available CPU time
>>> stress-ng: info:  [13159]       9.62s user time   (  0.50%)
>>> stress-ng: info:  [13159]     157.26s system time (  8.19%)
>>> stress-ng: info:  [13159]     166.88s total time  (  8.69%)
>>> stress-ng: info:  [13159] load average: 9.49 4.02 1.65
>>> stress-ng: info:  [13159] successful run completed in 60.00s (1 min,
>>> 0.00 secs)
>>>
>>> Can we make synchronize_srcu() call synchronize_srcu_expedited() when no
>>> one is in the read critical section?
> 
> Yes, in theory we could, but this would be a bad thing in practice.
> After all, the point of having synchronize_srcu() be separate from
> synchronize_srcu_expedited() is to allow uses that are OK with longer
> latency avoid consuming too much CPU.  In addition, that longer
> SRCU grace-period latency allows the next grace period to handle more
> synchronize_srcu() and call_srcu() requests.  This amortizes the
> overhead of that next grace period over a larger number of updates.
> 
> However, your use of synchronize_srcu_expedited() does have that effect,
> but only for this call point.  Which has the advantage of avoiding
> burning excessive quantities of CPU for the other 50+ call points.

Thanks for such a detailed explanation.

Now I think we can continue to try to complete the idea[1] from
Kirill Tkhai. The patch moves heavy synchronize_srcu() to delayed
work, so it doesn't affect on user-visible unregistration speed.

[1]. 
https://lore.kernel.org/lkml/153365636747.19074.12610817307548583381.stgit@localhost.localdomain/

Thanks,
Qi

> 
> 							Thanx, Paul