From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 19AC6C02181
	for <linux-mm@archiver.kernel.org>; Fri, 24 Jan 2025 15:48:22 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 8EBE26B00AF; Fri, 24 Jan 2025 10:48:22 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 89A966B00B0; Fri, 24 Jan 2025 10:48:22 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 762BB280059; Fri, 24 Jan 2025 10:48:22 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 594FF6B00AF
	for <linux-mm@kvack.org>; Fri, 24 Jan 2025 10:48:22 -0500 (EST)
Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id 096501C8DB6
	for <linux-mm@kvack.org>; Fri, 24 Jan 2025 15:48:22 +0000 (UTC)
X-FDA: 83042777244.05.DD0FD55
Received: from mail-qk1-f176.google.com (mail-qk1-f176.google.com [209.85.222.176])
	by imf22.hostedemail.com (Postfix) with ESMTP id 18F6BC0006
	for <linux-mm@kvack.org>; Fri, 24 Jan 2025 15:48:19 +0000 (UTC)
Authentication-Results: imf22.hostedemail.com;
	dkim=pass header.d=gourry.net header.s=google header.b=l3XzgxQJ;
	spf=pass (imf22.hostedemail.com: domain of gourry@gourry.net designates 209.85.222.176 as permitted sender) smtp.mailfrom=gourry@gourry.net;
	dmarc=none
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1737733700;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=m562TZx/BdOCXnSumLiFkdR7KvTDmTP7YvRm+7CCcuI=;
	b=rdVK2yjeLNNckh/Aqdv3E5ve4Wu+saEo4a0zPn5BZsywvho6/XXKZAMH4eh9H4M62t54bW
	oZco9PXuDWp8I2np3hhBszOMZt98l0cUnKQTnLTesneOFTOQKfyAIPWWRidS7jkcbfSKbG
	fr4+EI3iMyr7z+Pwd5J6ZluQuVXF5cA=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737733700; a=rsa-sha256;
	cv=none;
	b=uU1U78gNuL37Di1lgXdtu3z/iufMWAa4dgxOOqkXZinIn2JGZRM/WZ8XgQFQL1zGIJCDua
	fmm1kpeEZZdd0Av6gE4v4kHFZl4xAqm75iHCTIMy66zBCk0OXuBTEGRpqbk5p6BDvgqM4B
	zZQf0PChsHtzudiFsXHBa39Kn1Qf7CE=
ARC-Authentication-Results: i=1;
	imf22.hostedemail.com;
	dkim=pass header.d=gourry.net header.s=google header.b=l3XzgxQJ;
	spf=pass (imf22.hostedemail.com: domain of gourry@gourry.net designates 209.85.222.176 as permitted sender) smtp.mailfrom=gourry@gourry.net;
	dmarc=none
Received: by mail-qk1-f176.google.com with SMTP id af79cd13be357-7b9bc648736so227259185a.1
        for <linux-mm@kvack.org>; Fri, 24 Jan 2025 07:48:19 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gourry.net; s=google; t=1737733699; x=1738338499; darn=kvack.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=m562TZx/BdOCXnSumLiFkdR7KvTDmTP7YvRm+7CCcuI=;
        b=l3XzgxQJBbO4GDgpUfPEM+oO2KIJO0s+zbfdWUjMozQtS+MSSUveFMTC4TrEualIBV
         7ZiCuuA3VEwTHbWV7/N+/KMeefgO06hlsfEjOCKCHWgoZIfCljcLjJealrDjzyXsuQq6
         hJ0jLFESXpQDId6hqi3X9KlQNvCmEaeR+p+zOHcxCEfr4bvy9O0HuC0V8EakK6VQUJOB
         eZwGRPmxSLwKhURXjIsst0wql33SWhN26FN4VqoutiyjPdjgnAKpSHEi+5W4Yj5T4+hR
         aUyXZ6O9f3otM5tfxCb56Ndzruj8d+IM4cUPsYwHWDgMoJh3QykQP4zyg0oRGn6FU0q8
         SOhw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1737733699; x=1738338499;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=m562TZx/BdOCXnSumLiFkdR7KvTDmTP7YvRm+7CCcuI=;
        b=lPWlTYnq6/52YVMqGHHShoWBXDiIXY55+lIGN19Y32Bg2Rk3YQKNQQooG7Oy600N7P
         uVQU1/5iGCJN+X+4B/d5kyH0ij1kjXpz2XkXjTfjlRtgH0um+IozxYm1YRZz9g/zMXOt
         ZzdwruQoFF0ZdliogFIOaaq24MbEyuLHzj8jEsnzgoJeWkS0ijdiynRw2NAbdB911Pdg
         iEV9/tcNXJf1aWUvvdrt1WuEV+gcDMkOmdu3ppuAFvMYNX9+McTnH40Lip/85TmKkVIu
         aJ72cRc1lg0LWLdd7Cq5zFfvcz3dm/mIZsdgd0TXX0ctt4+nhgxuxez4zml0JfmtHUJX
         2jQA==
X-Forwarded-Encrypted: i=1; AJvYcCVZCkb7G+sWe+fkdqi7QZq3Id2OFXjgzzbOdhph/aYxyQw5bf0ok07dppC2Aq1ErR+x0jsdJngwCA==@kvack.org
X-Gm-Message-State: AOJu0Yx07nMJjXeMCiXOMeUUxe8P+/cvHvE8wMSBzcWvXpURguQL8R8a
	eRQoRcqEmXs+7rd5p9vO0bIWqmmuQtv7Qiu8+0xnVdf1fhgI+Lh0/761wGWQ7Oo=
X-Gm-Gg: ASbGnctXTypnAoNpJOnVwNYTERweu3mUadVCv8d1UkzbHfcLjv7O5QMZt6SNgktZ32a
	s3+GzNbMW69k2OsvIRo8P9CIRy4CS5df6PLOpfB9bC5xnExWXgOoAN7Y02lOkRkTIj8lb3yM3fj
	b5He29F8UKixaJ51WQIogcMOnEkbvcorsboZXmp3P1C3NK0VP4xvtHyjny7nFMl6aVvT7JKWwY6
	g5UW20A6kgiE9E2KEIKElP7U2Q12jYQLKDWsHRzUxJiJz8QHIyOv/4RTmeknJgCoihIrWciricM
	4lFl0UOo1lGxLCKXvVAnh8/Ibvtj9bpbSQOyWSKhk8uGZD80N9J9PLThcsWnM5w=
X-Google-Smtp-Source: AGHT+IHAmF9H/uJyOZ1fy3dvSN0mMIgEX3R3qF1HUo1sYQICkvQxxUc2gSc15oHllcaWPb/TZ5k1Sw==
X-Received: by 2002:a05:620a:2687:b0:7b6:d70a:86df with SMTP id af79cd13be357-7be6320aee0mr4993653785a.27.1737733699074;
        Fri, 24 Jan 2025 07:48:19 -0800 (PST)
Received: from gourry-fedora-PF4VCD3F (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208])
        by smtp.gmail.com with ESMTPSA id af79cd13be357-7be9aeee9b7sm104073185a.68.2025.01.24.07.48.17
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 24 Jan 2025 07:48:18 -0800 (PST)
Date: Fri, 24 Jan 2025 10:48:16 -0500
From: Gregory Price <gourry@gourry.net>
To: Matthew Wilcox <willy@infradead.org>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>, hyeonggon.yoo@sk.com,
	ying.huang@linux.alibaba.com, rafael@kernel.org, lenb@kernel.org,
	gregkh@linuxfoundation.org, akpm@linux-foundation.org,
	honggyu.kim@sk.com, rakie.kim@sk.com, dan.j.williams@intel.com,
	Jonathan.Cameron@huawei.com, dave.jiang@intel.com,
	horen.chuang@linux.dev, hannes@cmpxchg.org,
	linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-mm@kvack.org, kernel-team@meta.com
Subject: Re: [PATCH v3] Weighted interleave auto-tuning
Message-ID: <Z5O2QATuhvRnygcx@gourry-fedora-PF4VCD3F>
References: <20250115185854.1991771-1-joshua.hahnjy@gmail.com>
 <Z5Mr8WQGEZZjp9Uu@casper.infradead.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Z5Mr8WQGEZZjp9Uu@casper.infradead.org>
X-Rspamd-Queue-Id: 18F6BC0006
X-Stat-Signature: f957e88o4pyreh8ik9krpgiz1ssfqagu
X-Rspam-User: 
X-Rspamd-Server: rspam12
X-HE-Tag: 1737733699-197515
X-HE-Meta: U2FsdGVkX1/ZV1gr0OUejVmUZAE0FKlTTv97KiQtODvgBcpLzwO7LssfMgI2N2SCpA3ZMGhEbGxaMPYZdjbRVLvq4bOrPXrP05XWKpORhnhdm0SOXnwkg32gTP/kEGHfGIu+n1U7QKSR5XPV18CAzCBKBdmBtdnarpFyIb8VAJRR3Wy6u9YMRPeXhILlww9gBRxrmxGh32S/6HW0XgjKJB3lxI9RC9lE3Txf+d1L6y9W+8WzBsk9L7v+As1lB+zjh9Y6C583CeRcZrUX2AbQ1PkLFw12wcdp1Y8bgblaTUYWTLi0hQ5jVCkPm5jq4aVmbSlhpM3mCdj2F/JXdrNYK64ey/lkOu/JZZDCFTrvjeBu8v/CQ0NOCGBvMxOYZD5Ob6CuqzPTvXSLqGdAmTmbseSvH+0Psjv1IFvYLBxaYhM4Ymcq/BN+usTxWRxaFEe/opQlrgl3IrSncOlBx+H9Hv1Rjd9//8ISd4TwHeb/cDyQwcGM8Htoc/NXbuxNavVhDYCiQwpPnyX1Fp1M1ZUg06iI/nxh6FRsI9LysBXCmhDXRUZYpEQvETH6LcM48jFQ/UkOmxcZFJrzsbMzcJJXTqt54hK5PybtRdZS9H/hxeHkfXM4ilGjdf5S9bFv/o4R1M5FNQ5x8ugO1ejSyAPLvqnaYT/gb6nyj8QIBiycj3ORNmn/DO0mR0Mt/xA6EccIwbYBCBTMNcTd+mybAl0Rr9pqUgR61kTtPyt2v9ge5Xak9bEuNwhhaymB0rZhz41tPyI9r4Z8VY1lCduXISW/JuSUL2B3l83OsMtZNrecDVUyknC++wt7sFPLwfN8aDWfYkBxArS8yJvgf6uZ04SaCYttaaAMyOnaBfi/W0fueVUUIPViZgbynu+1T2W778Bw3pPxVHU5MpgsY5tK/dLfyAG6vYx8wS2MBgwsWEr+g36uA/SxItS3PSD0dileJvKmlqmjn7sswI9lzLW7kyX
 dj5495qN
 CuE8t1MOBYTqJXhSgGLcYiIm8A/oFqqg6/ZzS2+hDEqYVv+XQJEhugzMuOW65MXTAJiqJxjCdowQzcgM4eRjGZ+IKn+loz5cTHo8D1AgNlrIEXzUOHtgPNtXzld+/DWVt2EUQny5vPZGBdmdKidCsFZ71j9bEQe05hWbcKpWWNwdBsB698kvhHs8h4puKxu05lzKGPGsAk9c/NaQXU923lLo51AtS9QssCxKjsxGLTEqNfoDL4Geb2irR6/wtlNDE5G+35FbofyOl+1gbJ2Z4P8GDOMtYmCAnySumUqF2tUaW0TBtW/BJqUzcq3FNUBgfOreTFoPBI9NShs19QkyI+ylyfgBeEeyvkwROfOT6mCC2kStulvXm6obR8nTRUdQ2oPOwS/BuHXUS7by7vMMqyrQKL4xDd5gceVgb4cds4AXKCktW5GoBaz98Q7ataO5g91In2YQk8RAwbeTuBdiA6XUbbtGJFataeIwRvau24FoZSq66C6krTLV8gNoUINvmpxlEqHt1GmG4Z6Ck7UJIh0M6M7y7BjM/4a730wdecEqwkxikT6cKk4UWbgGet+jU8gofAeQwl2rzG7M=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Fri, Jan 24, 2025 at 05:58:09AM +0000, Matthew Wilcox wrote:
> On Wed, Jan 15, 2025 at 10:58:54AM -0800, Joshua Hahn wrote:
> > On machines with multiple memory nodes, interleaving page allocations
> > across nodes allows for better utilization of each node's bandwidth.
> > Previous work by Gregory Price [1] introduced weighted interleave, which
> > allowed for pages to be allocated across NUMA nodes according to
> > user-set ratios.
> 
> I still don't get it.  You always want memory to be on the local node or
> the fabric gets horribly congested and slows you right down.  But you're
> not really talking about NUMA, are you?  You're talking about CXL.
> 
> And CXL is terrible for bandwidth.  I just ran the numbers.
> 
> On a current Intel top-end CPU, we're looking at 8x DDR5-4800 DIMMs,
> each with a bandwidth of 38.4GB/s for a total of 300GB/s.
> 
> For each CXL lane, you take a lane of PCIe gen5 away.  So that's
> notionally 32Gbit/s, or 4GB/s per lane.  But CXL is crap, and you'll be
> lucky to get 3 cachelines per 256 byte packet, dropping you down to 3GB/s.
> You're not going to use all 80 lanes for CXL (presumably these CPUs are
> going to want to do I/O somehow), so maybe allocate 20 of them to CXL.
> That's 60GB/s, or a 20% improvement in bandwidth.  On top of that,
> it's slow, with a minimum of 10ns latency penalty just from the CXL
> encode/decode penalty.
>

>From the original - the performance tests show considerable opportunity
in the scenarios where DRAM bandwidth is pressured - as you can either

1) Lower the DRAM bandwidth pressure by offloading some cachelines to
   CXL - reducing latency on DRAM and reducing average latency overall.

   The latency cost on CXL lines gets amortized over all DRAM fetches
   no longer hitting stalls.

2) Under full-pressure scenarios (DRAM and CXL are saturated), the
   additional lanes / buffers provide more concurrent fetches - i.e.
   you're just doing more work (and avoiding going to storage).

   This is the weaker of the two scenarios.

No one is proposing we switch the default policy to weighted interleave.

= Performance summary =
(tests may have different configurations, see extended info below)
1) MLC (W2) : +38% over DRAM. +264% over default interleave.
   MLC (W5) : +40% over DRAM. +226% over default interleave.
2) Stream   : -6% to +4% over DRAM, +430% over default interleave.
3) XSBench  : +19% over DRAM. +47% over default interleave.

=====================================================================
Performance tests - MLC
>From - Ravi Jonnalagadda <ravis.opensrc@micron.com>

Hardware: Single-socket, multiple CXL memory expanders.

Workload:                               W2
Data Signature:                         2:1 read:write
DRAM only bandwidth (GBps):             298.8
DRAM + CXL (default interleave) (GBps): 113.04
DRAM + CXL (weighted interleave)(GBps): 412.5
Gain over DRAM only:                    1.38x
Gain over default interleave:           2.64x

Workload:                               W5
Data Signature:                         1:1 read:write
DRAM only bandwidth (GBps):             273.2
DRAM + CXL (default interleave) (GBps): 117.23
DRAM + CXL (weighted interleave)(GBps): 382.7
Gain over DRAM only:                    1.4x
Gain over default interleave:           2.26x

=====================================================================
Performance test - Stream
>From - Gregory Price <gregory.price@memverge.com>

Hardware: Single socket, single CXL expander

Summary: 64 threads, ~18GB workload, 3GB per array, executed 100 times
Default interleave : -78% (slower than DRAM)
Global weighting   : -6% to +4% (workload dependant)
mbind  weights     : +2.5% to +4% (consistently better than DRAM)

=====================================================================
Performance tests - XSBench
>From - Hyeongtak Ji <hyeongtak.ji@sk.com>

Hardware: Single socket, Single CXL memory Expander

NUMA node 0: 56 logical cores, 128 GB memory
NUMA node 2: 96 GB CXL memory
Threads:     56
Lookups:     170,000,000

Summary: +19% over DRAM. +47% over default interleave.


> Putting page cache in the CXL seems like nonsense to me.  I can see it
> making sense to swap to CXL, or allocating anonymous memory for tasks
> with low priority on it.  But I just can't see the point of putting
> pagecache on CXL.

No one said anything about page cache - but it depends.

If you can keep your entire working set in-memory and on-CXL, as opposed
to swapping to disk - you win.  "Swapping to CXL" incurs a bunch of page
faults, that sounds like a lose.

However - the stream test from the original proposal agrees with you
that just making everything interleaved (code, pagecache, etc) is at
best a wash:

Global weighting   : -6% to +4% (workload dependant)

But targeting specific regions can provide a modest bump

mbind weights      : +2.5% to +4% (consistently better than DRAM)

~Gregory