From: Rakie Kim <rakie.kim@sk.com>
To: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: akpm@linux-foundation.org, gourry@gourry.net, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org,
ziy@nvidia.com, matthew.brost@intel.com, joshua.hahnjy@gmail.com,
byungchul@sk.com, ying.huang@linux.alibaba.com,
apopple@nvidia.com, david@kernel.org, lorenzo.stoakes@oracle.com,
Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org,
surenb@google.com, mhocko@suse.com, dave@stgolabs.net,
dave.jiang@intel.com, alison.schofield@intel.com,
vishal.l.verma@intel.com, ira.weiny@intel.com,
dan.j.williams@intel.com, kernel_team@skhynix.com,
honggyu.kim@sk.com, yunjeong.mun@sk.com,
Rakie Kim <rakie.kim@sk.com>
Subject: Re: [RFC PATCH 2/4] mm/memory-tiers: introduce socket-aware topology management for NUMA nodes
Date: Mon, 30 Mar 2026 18:47:23 +0900 [thread overview]
Message-ID: <20260330094733.425-1-rakie.kim@sk.com> (raw)
In-Reply-To: <20260318122242.00004e0d@huawei.com>
On Wed, 18 Mar 2026 12:22:42 +0000 Jonathan Cameron <jonathan.cameron@huawei.com> wrote:
> On Mon, 16 Mar 2026 14:12:50 +0900
> Rakie Kim <rakie.kim@sk.com> wrote:
>
> > The existing NUMA distance model provides only relative latency values
> > between nodes and lacks any notion of structural grouping such as socket
> > or package boundaries. As a result, memory policies based solely on
> > distance cannot differentiate between nodes that are physically local
> > to the same socket and those that belong to different sockets. This
> > often leads to inefficient cross-socket demotion and suboptimal memory
> > placement.
> >
> > This patch introduces a socket-aware topology management layer that
> > groups NUMA nodes according to their physical package (socket)
> > association. Each group forms a "memory package" that explicitly links
> > CPU and memory-only nodes (such as CXL or HBM) under the same socket.
> > This structure allows the kernel to interpret NUMA topology in a way
> > that reflects real hardware locality rather than relying solely on
> > flat distance values.
> >
> > By maintaining socket-level grouping, the kernel can:
> > - Enforce demotion and promotion policies that stay within the same
> > socket.
> > - Avoid unintended cross-socket migrations that degrade performance.
> > - Provide a structural abstraction for future policy and tiering logic.
> >
> > Unlike ACPI-provided distance tables, which offer static and symmetric
> > relationships, this socket-aware model captures the true hardware
> > hierarchy and provides a flexible foundation for systems where the
> > distance matrix alone cannot accurately express socket boundaries or
> > asymmetric topologies.
>
> Careful with the generalities in here. There is no way to derive the
> 'true' hierarchy. What this is doing is applying a particular set
> of heuristics to the data that ACPI provided and attempting to use
> that to derive relationships. In simple cases that might work fine.0
>
> Doing so is OK in an RFC for discussion but this will need testing
> against a wide range of topologies to at least ensure it fails gracefully.
> Note we've had to paper over quite a few topology assumptions in the
> kernel and this feels like another one that will bite us later.
>
> I'd avoid the socket terminology as multiple NUMA nodes in sockets
> have been a thing for many years. Today there can even be multiple
> IO dies with a complex 'distance' relationship wrt to the CPUs
> in that socket. Topologies of memory controllers in those
> packages are another level of complexity.
>
>
> Otherwise a few general things from a quick look.
>
> I'd avoid goto out; where out just returns. That just makes code
> flow more complex and often makes for longer code. When you have
> an error and there is nothing to cleanup just return immediately.
>
> guard() / scoped_guard() will help simplify some of the locking.
>
> Thanks,
>
> Jonathan
>
Hello Jonathan,
First of all, I sincerely apologize for the delayed response. I completely
missed this email while I was highly focused on debugging the HMAT and
CDAT latency issues in the other threads.
You are absolutely right about the terminology and the architectural
assumptions. Given modern complex architectures with multiple NUMA nodes
and IO dies within a single physical package, relying on the term "socket"
is outdated and misleading. I completely agree that this is just another
heuristic, not the "true" hierarchy.
For the v2 patch, I will reword the commit messages to remove those
generalities and avoid using the strict "socket" terminology. I will
also make sure the logic is designed to fail gracefully and fall back
to the default behavior when it encounters complex topologies that it
cannot cleanly resolve.
Thank you so much for the detailed code review as well. I will definitely
remove the unnecessary `goto out;` statements and replace them with
direct returns. I will also adopt `guard()` and `scoped_guard()` to
simplify the locking mechanics.
Thanks again for your time and the valuable feedback.
Rakie Kim
next prev parent reply other threads:[~2026-03-30 9:47 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-16 5:12 [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave Rakie Kim
2026-03-16 5:12 ` [RFC PATCH 1/4] mm/numa: introduce nearest_nodes_nodemask() Rakie Kim
2026-03-16 5:12 ` [RFC PATCH 2/4] mm/memory-tiers: introduce socket-aware topology management for NUMA nodes Rakie Kim
2026-03-18 12:22 ` Jonathan Cameron
2026-03-30 9:47 ` Rakie Kim [this message]
2026-03-16 5:12 ` [RFC PATCH 3/4] mm/memory-tiers: register CXL nodes to socket-aware packages via initiator Rakie Kim
2026-03-16 5:12 ` [RFC PATCH 4/4] mm/mempolicy: enhance weighted interleave with socket-aware locality Rakie Kim
2026-03-16 14:01 ` [LSF/MM/BPF TOPIC] [RFC PATCH 0/4] mm/mempolicy: introduce socket-aware weighted interleave Gregory Price
2026-03-17 9:50 ` Rakie Kim
2026-03-16 15:19 ` Joshua Hahn
2026-03-16 19:45 ` Gregory Price
2026-03-17 11:50 ` Rakie Kim
2026-03-17 11:36 ` Rakie Kim
2026-03-18 12:02 ` Jonathan Cameron
2026-03-19 7:55 ` Rakie Kim
2026-03-20 16:56 ` Jonathan Cameron
2026-03-24 5:35 ` Rakie Kim
2026-03-25 12:33 ` Jonathan Cameron
2026-03-26 8:54 ` Rakie Kim
2026-03-26 21:41 ` Dave Jiang
2026-03-26 22:19 ` Dave Jiang
2026-03-30 3:17 ` Rakie Kim
2026-03-30 3:09 ` Rakie Kim
2026-03-26 20:13 ` Dan Williams
2026-03-30 5:32 ` Rakie Kim
2026-03-26 22:24 ` Dave Jiang
2026-03-30 2:59 ` Rakie Kim
2026-03-27 1:54 ` Gregory Price
2026-03-30 9:24 ` Rakie Kim
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260330094733.425-1-rakie.kim@sk.com \
--to=rakie.kim@sk.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alison.schofield@intel.com \
--cc=apopple@nvidia.com \
--cc=byungchul@sk.com \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=david@kernel.org \
--cc=gourry@gourry.net \
--cc=honggyu.kim@sk.com \
--cc=ira.weiny@intel.com \
--cc=jonathan.cameron@huawei.com \
--cc=joshua.hahnjy@gmail.com \
--cc=kernel_team@skhynix.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=matthew.brost@intel.com \
--cc=mhocko@suse.com \
--cc=rppt@kernel.org \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
--cc=vishal.l.verma@intel.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yunjeong.mun@sk.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox