linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mateusz Guzik <mjguzik@gmail.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	 David Laight <david.laight.linux@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 David Hildenbrand <david@kernel.org>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	 Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	 Michal Hocko <mhocko@suse.com>,
	oliver.sang@intel.com, linux-mm@kvack.org,
	 linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm: avoid use of BIT() macro for initialising VMA flags
Date: Tue, 9 Dec 2025 10:26:22 +0100	[thread overview]
Message-ID: <ek7v5vxht56g77kbojo35eqdcqdmxjvguf3tp5wnfydopbthb2@xie6b7r53ote> (raw)
In-Reply-To: <7006fa60-f4d3-4e7d-8c2b-974e9e4a1224@suse.cz>

On Tue, Dec 09, 2025 at 09:28:10AM +0100, Vlastimil Babka wrote:
> As Mateusz pointed out off-list, the profiles look like mutexes are doing
> less optimistic spinning and more sleeping. Which IMHO isn't something that
> this change can directly affect.
> 

Not mutexes but rwsems.

The bench at hand has some of the code spinlocked, other parts take
rwsems for reading *or* writing.

I had a peek at rwsem implementation and to my understanding it can
degrade to no spinning in a microbenchmark setting like this one,
provided you are unlucky enough.

In particular you can get unlucky if existing timings get perturbed,
which I presume is happening after Lorenzo's patch.

To demonstrate I wrote a toy patch which conditionally converts affected
down_read calls into down_write (inlined at the end).

While the original report is based on a 192-thread box, I was only able
to test with 80 threads. Even so, the crux of the issue was nicely
reproduced.

./stress-ng --timeout 10 --times --verify --metrics --no-rand-seed --msg 80

Top says (times vary, idle is growing over time):
%Cpu(s):  3.3 us, 24.4 sy,  0.0 ni, 72.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st 

... but if I flip the switch to down_write:

%Cpu(s):  6.3 us, 80.9 sy,  0.0 ni, 12.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

The switch is a sysctl named fs.magic_tunable (0 == down_read; 1 == down_write).

In terms of performance I see the following:

stress-ng: metrc: [2546] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
# sysctl fs.magic_tunable=0 ## down_read
stress-ng: metrc: [2546] msg            63353488     10.01     28.21    213.26   6331298.95      262362.91        30.16          2016
# sysctl fs.magic_tunable=1 ## down_write
stress-ng: metrc: [2036] msg           455014809     10.00     48.79    676.42  45496870.65      627425.68        90.64          2056

That is to say rwsem code is the real culprit and Lorenzo is a random
(albeit deserving) victim.

I see two action items:
- massage the patch back to a state where things compile to the same asm
  as before as it clearly avoidably regressed regardless of the
  aforementioned issue
- figure out what to do with rwsem code for read vs write spinning

I'm not picking this up for the time being, but I might look at this at
some point.

diff --git a/fs/file_table.c b/fs/file_table.c
index cd4a3db4659a..de1ef700d144 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -109,6 +109,8 @@ static int proc_nr_files(const struct ctl_table *table, int write, void *buffer,
 	return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
 }
 
+unsigned long magic_tunable;
+
 static const struct ctl_table fs_stat_sysctls[] = {
 	{
 		.procname	= "file-nr",
@@ -126,6 +128,16 @@ static const struct ctl_table fs_stat_sysctls[] = {
 		.extra1		= SYSCTL_LONG_ZERO,
 		.extra2		= SYSCTL_LONG_MAX,
 	},
+	{
+		.procname	= "magic_tunable",
+		.data		= &magic_tunable,
+		.maxlen		= sizeof(magic_tunable),
+		.mode		= 0644,
+		.proc_handler	= proc_doulongvec_minmax,
+		.extra1		= SYSCTL_LONG_ZERO,
+		.extra2		= SYSCTL_LONG_MAX,
+	},
+
 	{
 		.procname	= "nr_open",
 		.data		= &sysctl_nr_open,
diff --git a/ipc/msg.c b/ipc/msg.c
index ee6af4fe52bf..fa835ea53e09 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -474,6 +474,8 @@ static int msgctl_down(struct ipc_namespace *ns, int msqid, int cmd,
 	return err;
 }
 
+extern unsigned long magic_tunable;
+
 static int msgctl_info(struct ipc_namespace *ns, int msqid,
 			 int cmd, struct msginfo *msginfo)
 {
@@ -495,11 +497,19 @@ static int msgctl_info(struct ipc_namespace *ns, int msqid,
 	msginfo->msgmnb = ns->msg_ctlmnb;
 	msginfo->msgssz = MSGSSZ;
 	msginfo->msgseg = MSGSEG;
-	down_read(&msg_ids(ns).rwsem);
-	if (cmd == MSG_INFO)
-		msginfo->msgpool = msg_ids(ns).in_use;
-	max_idx = ipc_get_maxidx(&msg_ids(ns));
-	up_read(&msg_ids(ns).rwsem);
+	if (!READ_ONCE(magic_tunable)) {
+		down_read(&msg_ids(ns).rwsem);
+		if (cmd == MSG_INFO)
+			msginfo->msgpool = msg_ids(ns).in_use;
+		max_idx = ipc_get_maxidx(&msg_ids(ns));
+		up_read(&msg_ids(ns).rwsem);
+	} else {
+		down_write(&msg_ids(ns).rwsem);
+		if (cmd == MSG_INFO)
+			msginfo->msgpool = msg_ids(ns).in_use;
+		max_idx = ipc_get_maxidx(&msg_ids(ns));
+		up_write(&msg_ids(ns).rwsem);
+	}
 	if (cmd == MSG_INFO) {
 		msginfo->msgmap = min_t(int,
 				     percpu_counter_sum(&ns->percpu_msg_hdrs),
diff --git a/ipc/util.c b/ipc/util.c
index cae60f11d9c2..c65c8289a54b 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -771,6 +771,7 @@ struct ipc_proc_iter {
 	struct ipc_namespace *ns;
 	struct pid_namespace *pid_ns;
 	struct ipc_proc_iface *iface;
+	bool writelocked;
 };
 
 struct pid_namespace *ipc_seq_pid_ns(struct seq_file *s)
@@ -828,6 +829,8 @@ static void *sysvipc_proc_next(struct seq_file *s, void *it, loff_t *pos)
 	return sysvipc_find_ipc(&iter->ns->ids[iface->ids], pos);
 }
 
+extern unsigned long magic_tunable;
+
 /*
  * File positions: pos 0 -> header, pos n -> ipc idx = n - 1.
  * SeqFile iterator: iterator value locked ipc pointer or SEQ_TOKEN_START.
@@ -844,7 +847,13 @@ static void *sysvipc_proc_start(struct seq_file *s, loff_t *pos)
 	 * Take the lock - this will be released by the corresponding
 	 * call to stop().
 	 */
-	down_read(&ids->rwsem);
+	if (!READ_ONCE(magic_tunable)) {
+		down_read(&ids->rwsem);
+		iter->writelocked = false;
+	} else {
+		down_write(&ids->rwsem);
+		iter->writelocked = true;
+	}
 
 	/* pos < 0 is invalid */
 	if (*pos < 0)
@@ -871,7 +880,10 @@ static void sysvipc_proc_stop(struct seq_file *s, void *it)
 
 	ids = &iter->ns->ids[iface->ids];
 	/* Release the lock we took in start() */
-	up_read(&ids->rwsem);
+	if (!iter->writelocked)
+		up_read(&ids->rwsem);
+	else
+		up_write(&ids->rwsem);
 }
 
 static int sysvipc_proc_show(struct seq_file *s, void *it)


  reply	other threads:[~2025-12-09  9:26 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-05 17:50 Lorenzo Stoakes
2025-12-05 17:52 ` Lorenzo Stoakes
2025-12-05 18:43 ` David Laight
2025-12-05 19:18   ` Lorenzo Stoakes
2025-12-05 21:34     ` David Laight
2025-12-06 16:43       ` Lorenzo Stoakes
2025-12-08 16:42         ` Lorenzo Stoakes
2025-12-08 18:57           ` David Laight
2025-12-09  8:28           ` Vlastimil Babka
2025-12-09  9:26             ` Mateusz Guzik [this message]
2025-12-10 16:18               ` Lorenzo Stoakes
2025-12-10 22:44                 ` Mateusz Guzik
2025-12-12 12:24               ` Mateusz Guzik
2025-12-12 13:02                 ` David Laight
2025-12-12 13:13                   ` Mateusz Guzik
2025-12-12 15:03                 ` Lorenzo Stoakes
2025-12-12 17:26                   ` Mateusz Guzik
2025-12-05 21:49     ` David Laight
2025-12-06 16:47       ` Lorenzo Stoakes
2025-12-05 19:56 ` John Hubbard
2025-12-06 16:42   ` Lorenzo Stoakes
2025-12-05 20:15 ` Andrew Morton
2025-12-05 20:18   ` David Hildenbrand (Red Hat)
2025-12-06  0:40   ` Stephen Rothwell
2025-12-06  3:12     ` Andrew Morton
2025-12-06 16:35       ` Lorenzo Stoakes
2025-12-06  1:14 ` Al Viro
2025-12-06  1:26   ` Al Viro
2025-12-06 12:35     ` Vlastimil Babka
2025-12-06 16:34       ` Lorenzo Stoakes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ek7v5vxht56g77kbojo35eqdcqdmxjvguf3tp5wnfydopbthb2@xie6b7r53ote \
    --to=mjguzik@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=david.laight.linux@gmail.com \
    --cc=david@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=oliver.sang@intel.com \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox