From: Huang Shijie <huangsj@hygon.cn>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: <akpm@linux-foundation.org>, <viro@zeniv.linux.org.uk>,
<brauner@kernel.org>, <linux-mm@kvack.org>,
<linux-kernel@vger.kernel.org>,
<linux-arm-kernel@lists.infradead.org>,
<linux-fsdevel@vger.kernel.org>, <muchun.song@linux.dev>,
<osalvador@suse.de>, <linux-trace-kernel@vger.kernel.org>,
<linux-perf-users@vger.kernel.org>,
<linux-parisc@vger.kernel.org>, <nvdimm@lists.linux.dev>,
<zhongyuan@hygon.cn>, <fangbaoshun@hygon.cn>,
<yingzhiwei@hygon.cn>
Subject: Re: [PATCH 0/3] mm: split the file's i_mmap tree for NUMA
Date: Tue, 14 Apr 2026 17:11:26 +0800 [thread overview]
Message-ID: <ad4EvoDcAKE2Sl4+@hsj-2U-Workstation> (raw)
In-Reply-To: <76pfiwabdgsej6q2yxfh3efuqvsyg7mt7rvl5itzzjyhdrto5r@53viaxsackzv>
On Mon, Apr 13, 2026 at 05:33:21PM +0200, Mateusz Guzik wrote:
> On Mon, Apr 13, 2026 at 02:20:39PM +0800, Huang Shijie wrote:
> > In NUMA, there are maybe many NUMA nodes and many CPUs.
> > For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs.
> > In the UnixBench tests, there is a test "execl" which tests
> > the execve system call.
> >
> > When we test our server with "./Run -c 384 execl",
> > the test result is not good enough. The i_mmap locks contended heavily on
> > "libc.so" and "ld.so". For example, the i_mmap tree for "libc.so" can have
> > over 6000 VMAs, all the VMAs can be in different NUMA mode.
> > The insert/remove operations do not run quickly enough.
> >
> > patch 1 & patch 2 are try to hide the direct access of i_mmap.
> > patch 3 splits the i_mmap into sibling trees, and we can get better
> > performance with this patch set:
> > we can get 77% performance improvement(10 times average)
> >
>
> To my reading you kept the lock as-is and only distributed the protected
> state.
>
> While I don't doubt the improvement, I'm confident should you take a
> look at the profile you are going to find this still does not scale with
> rwsem being one of the problems (there are other global locks, some of
> which have experimental patches for).
IMHO, when the number of VMAs in the i_mmap is very large, only optimise the rwsem
lock does not help too much for our NUMA case.
In our NUMA server, the remote access could be the major issue.
>
> Apart from that this does nothing to help high core systems which are
> all one node, which imo puts another question mark on this specific
> proposal.
Yes, this patch set only focus on the NUMA case.
The one-node case should use the original i_mmap.
Maybe I can add a new config, CONFIG_SPILT_I_MMAP. The config is disabled
by default, and enabled when the NUMA node is not one.
>
> Of course one may question whether a RB tree is the right choice here,
> it may be the lock-protected cost can go way down with merely a better
> data structure.
>
> Regardless of that, for actual scalability, there will be no way around
> decentralazing locking around this and partitioning per some core count
> (not just by numa awareness).
>
> Decentralizing locking is definitely possible, but I have not looked
> into specifics of how problematic it is. Best case scenario it will
> merely with separate locks. Worst case scenario something needs a fully
> stabilized state for traversal, in that case another rw lock can be
Yes.
The traversal may need to hold many locks.
> slapped around this, creating locking order read lock -> per-subset
> write lock -- this will suffer scalability due to the read locking, but
> it will still scale drastically better as apart from that there will be
> no serialization. In this setting the problematic consumer will write
> lock the new thing to stabilize the state.
>
> So my non-maintainer opinion is that the patchset is not worth it as it
> fails to address anything for significantly more common and already
> affected setups.
This patch set is to reduce the remote access latency for insert/remove VMA
in NUMA.
>
> Have you looked into splitting the lock?
>
I ever tried.
But there are two disadvantages:
1.) The traversal may need to hold many locks which makes the
code very horrible.
2.) Even we split the locks. Each lock protects a tree, when the tree becomes
big enough, the VMA insert/remove will also become slow in NUMA.
The reason is that the tree has VMAs in different NUMA nodes.
Thanks
Huang Shijie
prev parent reply other threads:[~2026-04-14 9:11 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-13 6:20 Huang Shijie
2026-04-13 6:20 ` [PATCH 1/3] mm: use mapping_mapped to simplify the code Huang Shijie
2026-04-13 6:20 ` [PATCH 2/3] mm: use get_i_mmap_root to access the file's i_mmap Huang Shijie
2026-04-13 6:20 ` [PATCH 3/3] mm: split the file's i_mmap tree for NUMA Huang Shijie
2026-04-13 15:33 ` [PATCH 0/3] " Mateusz Guzik
2026-04-14 9:11 ` Huang Shijie [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ad4EvoDcAKE2Sl4+@hsj-2U-Workstation \
--to=huangsj@hygon.cn \
--cc=akpm@linux-foundation.org \
--cc=brauner@kernel.org \
--cc=fangbaoshun@hygon.cn \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-parisc@vger.kernel.org \
--cc=linux-perf-users@vger.kernel.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=mjguzik@gmail.com \
--cc=muchun.song@linux.dev \
--cc=nvdimm@lists.linux.dev \
--cc=osalvador@suse.de \
--cc=viro@zeniv.linux.org.uk \
--cc=yingzhiwei@hygon.cn \
--cc=zhongyuan@hygon.cn \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox