Re: [PATCH v3] Weighted interleave auto-tuning

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Gregory Price <gourry@gourry.net>
To: Matthew Wilcox <willy@infradead.org>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>,
	hyeonggon.yoo@sk.com, ying.huang@linux.alibaba.com,
	rafael@kernel.org, lenb@kernel.org, gregkh@linuxfoundation.org,
	akpm@linux-foundation.org, honggyu.kim@sk.com, rakie.kim@sk.com,
	dan.j.williams@intel.com, Jonathan.Cameron@huawei.com,
	dave.jiang@intel.com, horen.chuang@linux.dev, hannes@cmpxchg.org,
	linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-mm@kvack.org, kernel-team@meta.com
Subject: Re: [PATCH v3] Weighted interleave auto-tuning
Date: Fri, 24 Jan 2025 10:48:16 -0500	[thread overview]
Message-ID: <Z5O2QATuhvRnygcx@gourry-fedora-PF4VCD3F> (raw)
In-Reply-To: <Z5Mr8WQGEZZjp9Uu@casper.infradead.org>

On Fri, Jan 24, 2025 at 05:58:09AM +0000, Matthew Wilcox wrote:
> On Wed, Jan 15, 2025 at 10:58:54AM -0800, Joshua Hahn wrote:
> > On machines with multiple memory nodes, interleaving page allocations
> > across nodes allows for better utilization of each node's bandwidth.
> > Previous work by Gregory Price [1] introduced weighted interleave, which
> > allowed for pages to be allocated across NUMA nodes according to
> > user-set ratios.
> 
> I still don't get it.  You always want memory to be on the local node or
> the fabric gets horribly congested and slows you right down.  But you're
> not really talking about NUMA, are you?  You're talking about CXL.
> 
> And CXL is terrible for bandwidth.  I just ran the numbers.
> 
> On a current Intel top-end CPU, we're looking at 8x DDR5-4800 DIMMs,
> each with a bandwidth of 38.4GB/s for a total of 300GB/s.
> 
> For each CXL lane, you take a lane of PCIe gen5 away.  So that's
> notionally 32Gbit/s, or 4GB/s per lane.  But CXL is crap, and you'll be
> lucky to get 3 cachelines per 256 byte packet, dropping you down to 3GB/s.
> You're not going to use all 80 lanes for CXL (presumably these CPUs are
> going to want to do I/O somehow), so maybe allocate 20 of them to CXL.
> That's 60GB/s, or a 20% improvement in bandwidth.  On top of that,
> it's slow, with a minimum of 10ns latency penalty just from the CXL
> encode/decode penalty.
>

From the original - the performance tests show considerable opportunity
in the scenarios where DRAM bandwidth is pressured - as you can either

1) Lower the DRAM bandwidth pressure by offloading some cachelines to
   CXL - reducing latency on DRAM and reducing average latency overall.

   The latency cost on CXL lines gets amortized over all DRAM fetches
   no longer hitting stalls.

2) Under full-pressure scenarios (DRAM and CXL are saturated), the
   additional lanes / buffers provide more concurrent fetches - i.e.
   you're just doing more work (and avoiding going to storage).

   This is the weaker of the two scenarios.

No one is proposing we switch the default policy to weighted interleave.

= Performance summary =
(tests may have different configurations, see extended info below)
1) MLC (W2) : +38% over DRAM. +264% over default interleave.
   MLC (W5) : +40% over DRAM. +226% over default interleave.
2) Stream   : -6% to +4% over DRAM, +430% over default interleave.
3) XSBench  : +19% over DRAM. +47% over default interleave.

=====================================================================
Performance tests - MLC
From - Ravi Jonnalagadda <ravis.opensrc@micron.com>

Hardware: Single-socket, multiple CXL memory expanders.

Workload:                               W2
Data Signature:                         2:1 read:write
DRAM only bandwidth (GBps):             298.8
DRAM + CXL (default interleave) (GBps): 113.04
DRAM + CXL (weighted interleave)(GBps): 412.5
Gain over DRAM only:                    1.38x
Gain over default interleave:           2.64x

Workload:                               W5
Data Signature:                         1:1 read:write
DRAM only bandwidth (GBps):             273.2
DRAM + CXL (default interleave) (GBps): 117.23
DRAM + CXL (weighted interleave)(GBps): 382.7
Gain over DRAM only:                    1.4x
Gain over default interleave:           2.26x

=====================================================================
Performance test - Stream
From - Gregory Price <gregory.price@memverge.com>

Hardware: Single socket, single CXL expander

Summary: 64 threads, ~18GB workload, 3GB per array, executed 100 times
Default interleave : -78% (slower than DRAM)
Global weighting   : -6% to +4% (workload dependant)
mbind  weights     : +2.5% to +4% (consistently better than DRAM)

=====================================================================
Performance tests - XSBench
From - Hyeongtak Ji <hyeongtak.ji@sk.com>

Hardware: Single socket, Single CXL memory Expander

NUMA node 0: 56 logical cores, 128 GB memory
NUMA node 2: 96 GB CXL memory
Threads:     56
Lookups:     170,000,000

Summary: +19% over DRAM. +47% over default interleave.


> Putting page cache in the CXL seems like nonsense to me.  I can see it
> making sense to swap to CXL, or allocating anonymous memory for tasks
> with low priority on it.  But I just can't see the point of putting
> pagecache on CXL.

No one said anything about page cache - but it depends.

If you can keep your entire working set in-memory and on-CXL, as opposed
to swapping to disk - you win.  "Swapping to CXL" incurs a bunch of page
faults, that sounds like a lose.

However - the stream test from the original proposal agrees with you
that just making everything interleaved (code, pagecache, etc) is at
best a wash:

Global weighting   : -6% to +4% (workload dependant)

But targeting specific regions can provide a modest bump

mbind weights      : +2.5% to +4% (consistently better than DRAM)

~Gregory

next prev parent reply	other threads:[~2025-01-24 15:48 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-15 18:58 Joshua Hahn
2025-01-20  4:47 ` Hyeonggon Yoo
2025-01-21 21:24   ` Joshua Hahn
2025-01-21 11:17 ` Huang, Ying
2025-01-21 11:27   ` Honggyu Kim
2025-01-21 20:02     ` Gregory Price
2025-01-22  1:24     ` Huang, Ying
2025-02-05  5:34       ` Honggyu Kim
2025-01-21 19:56   ` Gregory Price
2025-01-22  1:37     ` Huang, Ying
2025-01-22 15:59       ` Joshua Hahn
2025-01-22 16:53         ` Gregory Price
2025-01-23  3:32         ` Huang, Ying
2025-01-24  5:58 ` Matthew Wilcox
2025-01-24 15:48   ` Gregory Price [this message]
2025-01-24 15:53   ` Gregory Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z5O2QATuhvRnygcx@gourry-fedora-PF4VCD3F \
    --to=gourry@gourry.net \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=honggyu.kim@sk.com \
    --cc=horen.chuang@linux.dev \
    --cc=hyeonggon.yoo@sk.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kernel-team@meta.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=rafael@kernel.org \
    --cc=rakie.kim@sk.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox