From: Gregory Price <gourry@gourry.net>
To: Matthew Wilcox <willy@infradead.org>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>,
hyeonggon.yoo@sk.com, ying.huang@linux.alibaba.com,
rafael@kernel.org, lenb@kernel.org, gregkh@linuxfoundation.org,
akpm@linux-foundation.org, honggyu.kim@sk.com, rakie.kim@sk.com,
dan.j.williams@intel.com, Jonathan.Cameron@huawei.com,
dave.jiang@intel.com, horen.chuang@linux.dev, hannes@cmpxchg.org,
linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
linux-mm@kvack.org, kernel-team@meta.com
Subject: Re: [PATCH v3] Weighted interleave auto-tuning
Date: Fri, 24 Jan 2025 10:48:16 -0500 [thread overview]
Message-ID: <Z5O2QATuhvRnygcx@gourry-fedora-PF4VCD3F> (raw)
In-Reply-To: <Z5Mr8WQGEZZjp9Uu@casper.infradead.org>
On Fri, Jan 24, 2025 at 05:58:09AM +0000, Matthew Wilcox wrote:
> On Wed, Jan 15, 2025 at 10:58:54AM -0800, Joshua Hahn wrote:
> > On machines with multiple memory nodes, interleaving page allocations
> > across nodes allows for better utilization of each node's bandwidth.
> > Previous work by Gregory Price [1] introduced weighted interleave, which
> > allowed for pages to be allocated across NUMA nodes according to
> > user-set ratios.
>
> I still don't get it. You always want memory to be on the local node or
> the fabric gets horribly congested and slows you right down. But you're
> not really talking about NUMA, are you? You're talking about CXL.
>
> And CXL is terrible for bandwidth. I just ran the numbers.
>
> On a current Intel top-end CPU, we're looking at 8x DDR5-4800 DIMMs,
> each with a bandwidth of 38.4GB/s for a total of 300GB/s.
>
> For each CXL lane, you take a lane of PCIe gen5 away. So that's
> notionally 32Gbit/s, or 4GB/s per lane. But CXL is crap, and you'll be
> lucky to get 3 cachelines per 256 byte packet, dropping you down to 3GB/s.
> You're not going to use all 80 lanes for CXL (presumably these CPUs are
> going to want to do I/O somehow), so maybe allocate 20 of them to CXL.
> That's 60GB/s, or a 20% improvement in bandwidth. On top of that,
> it's slow, with a minimum of 10ns latency penalty just from the CXL
> encode/decode penalty.
>
From the original - the performance tests show considerable opportunity
in the scenarios where DRAM bandwidth is pressured - as you can either
1) Lower the DRAM bandwidth pressure by offloading some cachelines to
CXL - reducing latency on DRAM and reducing average latency overall.
The latency cost on CXL lines gets amortized over all DRAM fetches
no longer hitting stalls.
2) Under full-pressure scenarios (DRAM and CXL are saturated), the
additional lanes / buffers provide more concurrent fetches - i.e.
you're just doing more work (and avoiding going to storage).
This is the weaker of the two scenarios.
No one is proposing we switch the default policy to weighted interleave.
= Performance summary =
(tests may have different configurations, see extended info below)
1) MLC (W2) : +38% over DRAM. +264% over default interleave.
MLC (W5) : +40% over DRAM. +226% over default interleave.
2) Stream : -6% to +4% over DRAM, +430% over default interleave.
3) XSBench : +19% over DRAM. +47% over default interleave.
=====================================================================
Performance tests - MLC
From - Ravi Jonnalagadda <ravis.opensrc@micron.com>
Hardware: Single-socket, multiple CXL memory expanders.
Workload: W2
Data Signature: 2:1 read:write
DRAM only bandwidth (GBps): 298.8
DRAM + CXL (default interleave) (GBps): 113.04
DRAM + CXL (weighted interleave)(GBps): 412.5
Gain over DRAM only: 1.38x
Gain over default interleave: 2.64x
Workload: W5
Data Signature: 1:1 read:write
DRAM only bandwidth (GBps): 273.2
DRAM + CXL (default interleave) (GBps): 117.23
DRAM + CXL (weighted interleave)(GBps): 382.7
Gain over DRAM only: 1.4x
Gain over default interleave: 2.26x
=====================================================================
Performance test - Stream
From - Gregory Price <gregory.price@memverge.com>
Hardware: Single socket, single CXL expander
Summary: 64 threads, ~18GB workload, 3GB per array, executed 100 times
Default interleave : -78% (slower than DRAM)
Global weighting : -6% to +4% (workload dependant)
mbind weights : +2.5% to +4% (consistently better than DRAM)
=====================================================================
Performance tests - XSBench
From - Hyeongtak Ji <hyeongtak.ji@sk.com>
Hardware: Single socket, Single CXL memory Expander
NUMA node 0: 56 logical cores, 128 GB memory
NUMA node 2: 96 GB CXL memory
Threads: 56
Lookups: 170,000,000
Summary: +19% over DRAM. +47% over default interleave.
> Putting page cache in the CXL seems like nonsense to me. I can see it
> making sense to swap to CXL, or allocating anonymous memory for tasks
> with low priority on it. But I just can't see the point of putting
> pagecache on CXL.
No one said anything about page cache - but it depends.
If you can keep your entire working set in-memory and on-CXL, as opposed
to swapping to disk - you win. "Swapping to CXL" incurs a bunch of page
faults, that sounds like a lose.
However - the stream test from the original proposal agrees with you
that just making everything interleaved (code, pagecache, etc) is at
best a wash:
Global weighting : -6% to +4% (workload dependant)
But targeting specific regions can provide a modest bump
mbind weights : +2.5% to +4% (consistently better than DRAM)
~Gregory
next prev parent reply other threads:[~2025-01-24 15:48 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-15 18:58 Joshua Hahn
2025-01-20 4:47 ` Hyeonggon Yoo
2025-01-21 21:24 ` Joshua Hahn
2025-01-21 11:17 ` Huang, Ying
2025-01-21 11:27 ` Honggyu Kim
2025-01-21 20:02 ` Gregory Price
2025-01-22 1:24 ` Huang, Ying
2025-02-05 5:34 ` Honggyu Kim
2025-01-21 19:56 ` Gregory Price
2025-01-22 1:37 ` Huang, Ying
2025-01-22 15:59 ` Joshua Hahn
2025-01-22 16:53 ` Gregory Price
2025-01-23 3:32 ` Huang, Ying
2025-01-24 5:58 ` Matthew Wilcox
2025-01-24 15:48 ` Gregory Price [this message]
2025-01-24 15:53 ` Gregory Price
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z5O2QATuhvRnygcx@gourry-fedora-PF4VCD3F \
--to=gourry@gourry.net \
--cc=Jonathan.Cameron@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=gregkh@linuxfoundation.org \
--cc=hannes@cmpxchg.org \
--cc=honggyu.kim@sk.com \
--cc=horen.chuang@linux.dev \
--cc=hyeonggon.yoo@sk.com \
--cc=joshua.hahnjy@gmail.com \
--cc=kernel-team@meta.com \
--cc=lenb@kernel.org \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=rafael@kernel.org \
--cc=rakie.kim@sk.com \
--cc=willy@infradead.org \
--cc=ying.huang@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox