From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 740E5C47258
	for <linux-mm@archiver.kernel.org>; Thu, 25 Jan 2024 21:37:08 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 09DCD8D0007; Thu, 25 Jan 2024 16:37:08 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 04D948D0001; Thu, 25 Jan 2024 16:37:07 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id E57768D0007; Thu, 25 Jan 2024 16:37:07 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id D2A3A8D0001
	for <linux-mm@kvack.org>; Thu, 25 Jan 2024 16:37:07 -0500 (EST)
Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id A74E4160287
	for <linux-mm@kvack.org>; Thu, 25 Jan 2024 21:37:07 +0000 (UTC)
X-FDA: 81719144094.22.E41CFD2
Received: from mail-pl1-f170.google.com (mail-pl1-f170.google.com [209.85.214.170])
	by imf26.hostedemail.com (Postfix) with ESMTP id 0ABDF140028
	for <linux-mm@kvack.org>; Thu, 25 Jan 2024 21:37:05 +0000 (UTC)
Authentication-Results: imf26.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=gZ15Y0N7;
	spf=pass (imf26.hostedemail.com: domain of rientjes@google.com designates 209.85.214.170 as permitted sender) smtp.mailfrom=rientjes@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1706218626;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=S+cD3HNnvtw9X+G0L9x1YklvKT09ByuvJtcMQmdN0B4=;
	b=IjnLitxMzFs1eYzGPsFMSvZNrvuYDBfI3VY/GBscnN7+8dRdwfOhfV5M9h8m27+37RDENI
	w9rdfSzyg8CdOaLvu+CQ+Vxm4fRM+1Iov4s4VjCWcDVKPjUnXN/MfIXAPt616qNWfR6l3k
	SQdt7wREpsPR8saUJI8LuncSXpmDQ9g=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706218626; a=rsa-sha256;
	cv=none;
	b=wMawRzWomE+EFy3nvmi5jA2E3kJomngUfySQQnCFjr7EsbwKnGjibwf4gCimAFrHA1z4HH
	7wqi3mXAqcpb77LeDZ61CxcghkCl6jX6wbT/mwXNyliq1IefE6akcXdv1hYrIq8ivMt11K
	x0bJ2JV3QTTQ85VcdR4hAZcYRz2Dl6w=
ARC-Authentication-Results: i=1;
	imf26.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=gZ15Y0N7;
	spf=pass (imf26.hostedemail.com: domain of rientjes@google.com designates 209.85.214.170 as permitted sender) smtp.mailfrom=rientjes@google.com;
	dmarc=pass (policy=reject) header.from=google.com
Received: by mail-pl1-f170.google.com with SMTP id d9443c01a7336-1d72d240c69so38945ad.1
        for <linux-mm@kvack.org>; Thu, 25 Jan 2024 13:37:05 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1706218625; x=1706823425; darn=kvack.org;
        h=mime-version:references:message-id:in-reply-to:subject:cc:to:from
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=S+cD3HNnvtw9X+G0L9x1YklvKT09ByuvJtcMQmdN0B4=;
        b=gZ15Y0N7t3Ur+COiNPv1f/xqIQs/a49h5JH0mTuQOAGdEeGo19lGo8ZendSc5vJ+AD
         mWXL5+8Ex+ChJMfnv813ZuaiwaIaeFeBGVEqAAeXOrEt18MKSJ4ybfqUzMQMBh+ktxai
         ZNm+879DhHZCq/CAESCOvaop/bjlhNpS+58oydaP71fnuC3gXXK3BpKv2MyT7fXo+Z7i
         zHIGISPJMs/h3APGWiwPBQsntyCXAPUmmWx/g0KpRoUhPJ99PY/Sx+vkgrJMVvUQ4v3y
         lWqFlOonSNEJFQNmsZeuMJY9x4nPx30Ders40sh7CUo1wGB2kyMTljPWL3KFv1fA73im
         UKNA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1706218625; x=1706823425;
        h=mime-version:references:message-id:in-reply-to:subject:cc:to:from
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=S+cD3HNnvtw9X+G0L9x1YklvKT09ByuvJtcMQmdN0B4=;
        b=G5R12ZdwFiyJ2auWRkMxBLjACLRq5fg09ea9NSgtAnyNo6Biyq/ZeVnS2nYQo0nrLR
         R7XxGeBiUIy+Y4PyOBbuTsVIB7s3II+ROeUaPGdTzjexyj8BaVsJDuE8DQu5+INx4Hfn
         ZCJuFYdy+MC4Ve1X1zhU07aWPyqUgOI5+5pV6CKPEp4FkT9toIfok6bjZFFoS/R01JvJ
         L9wBjA+/oq79Y9RtGbFtla9a1X8ydwX0+ZqesbUObFzO7SuZJZgLKeIhqOi0YuwsliCZ
         Lvr1f3GtdWgPUOnlE0QDTqcR9T/7ABaj0Ze6eKjb00S9HcDTxhGjjKSOcu6X2u8VJAFa
         CYFQ==
X-Gm-Message-State: AOJu0YzKR0+QofpjeKLxpd7HNFwDfg8tQtV1XylmP1cpMJfQVScyWAwK
	ytUuRq6vGPCsg1dZovbmJrb7LCmWbsoz+tk+9RaSj0Vx+2Fg8z1decOdgBvLPA==
X-Google-Smtp-Source: AGHT+IHW/zocLKH5+WdxG8N1V+w5ufrzNarW1YzvV9/XcRuGLqSYCV91T8MHBfk75gCH2e8XzmYjYQ==
X-Received: by 2002:a17:903:234f:b0:1d7:4b04:108e with SMTP id c15-20020a170903234f00b001d74b04108emr13150plh.15.1706218624468;
        Thu, 25 Jan 2024 13:37:04 -0800 (PST)
Received: from [2620:0:1008:15:8d79:aa0b:df21:e137] ([2620:0:1008:15:8d79:aa0b:df21:e137])
        by smtp.gmail.com with ESMTPSA id lo15-20020a170903434f00b001d71f2ae008sm10670540plb.85.2024.01.25.13.37.03
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 25 Jan 2024 13:37:03 -0800 (PST)
Date: Thu, 25 Jan 2024 13:37:02 -0800 (PST)
From: David Rientjes <rientjes@google.com>
To: Matthew Wilcox <willy@infradead.org>
cc: John Hubbard <jhubbard@nvidia.com>, Zi Yan <ziy@nvidia.com>, 
    Bharata B Rao <bharata@amd.com>, Dave Jiang <dave.jiang@intel.com>, 
    "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>, 
    "Huang, Ying" <ying.huang@intel.com>, Alistair Popple <apopple@nvidia.com>, 
    Christoph Lameter <cl@gentwo.org>, 
    Andrew Morton <akpm@linux-foundation.org>, 
    Linus Torvalds <torvalds@linux-foundation.org>, 
    Dave Hansen <dave.hansen@linux.intel.com>, Mel Gorman <mgorman@suse.de>, 
    Jon Grimm <jon.grimm@amd.com>, Gregory Price <gourry.memverge@gmail.com>, 
    Brian Morris <bsmorris@google.com>, Wei Xu <weixugc@google.com>, 
    Johannes Weiner <hannes@cmpxchg.org>, SeongJae Park <sj@kernel.org>, 
    linux-mm@kvack.org
Subject: Re: [RFC] Memory tiering kernel alignment
In-Reply-To: <ZbLCPO7cI2LmNhnD@casper.infradead.org>
Message-ID: <e351526b-0872-afcb-4eb7-a3dd6242f9f9@google.com>
References: <75f21150-1e12-4f4b-e578-e170e4fea18b@google.com> <ZbKt7jDN8XI561DO@casper.infradead.org> <2b29dd3d-bb2c-6a8c-94d2-d5c2e035516a@google.com> <ZbLCPO7cI2LmNhnD@casper.infradead.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Rspamd-Queue-Id: 0ABDF140028
X-Rspam-User: 
X-Rspamd-Server: rspam11
X-Stat-Signature: is3bjjindtyqh8ffxxs5kx3fbqsgqzhu
X-HE-Tag: 1706218625-982907
X-HE-Meta: U2FsdGVkX183oLYjqyenXcux+kIDS47DKBcqDD2GmeAWIJ8powUERXb8RYKeWJ+PftPwhV1/Xz/hYUbaRk/X16kklhU3xbljGfhvyfhDOTh0PDBrzblZAR3DG/XB80FRp/re0s4PfrwtszJRlKY1XAEsUHj/xeCl0B1JAEvizdYPeMNYE+kFRbtCwRllw2emxnDPD+01oxCm6ylBaeK6EzdZAOTF/czztg1+H9kz2LD/LrIVRsZeasv/NQuVhHYwe3Th32FEMKGbQFpFSQCX4jpjuljQ/iaap5wrKThZRbiBEHZ+wg57rkEwLuuANf+S92tMzJmYXgEuEb+XxrjxE+nF1CugboXccwCPhEzPAx4tlpQ9WP+qQhoabPOIYZeyIn4A6pFYObclG1te0nNqjRoIW0ZywIUSkQaqnjLDqGtkCRSOl6PoIFF+3SvXaXnNKv5KVe+kNbtxlzGxVNkNFJKqhmTq5w6SSO9n2vFYDKlsCRQtHv9PpvEvz1VWb78lcctFDHWFKGgJh7+8uzG9eTjU6BxZHl2d9cRxhPa5nRIoXPAi8/6mrtjrMkNLyYfPe77xdxuLbY8vrXeTKYIiYUmHfCA4UBpZAF5L7QYSv9sV6lT0mlZol1sEgdstwpDJthlI1thk/zRIcBQqIBkDJ6xi4WohwQ4f0u0atFTKNKzzjS1L3j7ZfBVXONOt1zVl+Xf88yTPZELlVJ6YbGQ2kcjgJrTaJmsflwavvoo3Ebw45mHYIcy1OWBnw99gNL3K0qBzfRcztyyr9/ZEw9r/aTfvz0uhPj7iBeHQI2wlX5znRXmtQks3G7+R3K/+epsQeQUTKJDVAQwBWyi/aiKdeUbVpLn5IzQyNR6msCuZ4cbfXZbR5QUGIhbfggvuQ+1c2esxU7K9YI+RraiPLPYsTFwBxv3EgVKfZDs5TuVNvN77xm57VMChn47yy4mbbCn3M64ij+bXom/LxUNWAHG
 FXNdJ+bk
 /Z74g6a81w6xYo06jdARbMJVYFckHYQ2T73TQb2+oqfvkD2EH8R4zQzLFc/UVaDaufN8Wxb9xxRUmIV/gV6nXInsTppPPycABwNDbtqVHm4jqAZq1SaioEAWB1tGGfR8wzNqIbG5+YgSFjRs1DvHIyUL+Hw==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


On Thu, 25 Jan 2024, Matthew Wilcox wrote:

> On Thu, Jan 25, 2024 at 12:04:37PM -0800, David Rientjes wrote:
> > On Thu, 25 Jan 2024, Matthew Wilcox wrote:
> > > On Thu, Jan 25, 2024 at 10:26:19AM -0800, David Rientjes wrote:
> > > > There is a lot of excitement around upcoming CXL type 3 memory expansion
> > > > devices and their cost savings potential.  As the industry starts to
> > > > adopt this technology, one of the key components in strategic planning is
> > > > how the upstream Linux kernel will support various tiered configurations
> > > > to meet various user needs.  I think it goes without saying that this is
> > > > quite interesting to cloud providers as well as other hyperscalers :)
> > > 
> > > I'm not excited.  I'm disappointed that people are falling for this scam.
> > > CXL is the ATM of this decade.  The protocol is not fit for the purpose
> > > of accessing remote memory, adding 10ns just for an encode/decode cycle.
> > > Hands up everybody who's excited about memory latency increasing by 17%.
> > 
> > Right, I don't think that anybody is claiming that we can leverage locally 
> > attached CXL memory as through it was DRAM on the same or remote socket 
> > and that there won't be a noticable impact to application performance 
> > while the memory is still across the device.
> > 
> > It does offer several cost savings benefits for offloading of cold memory, 
> > though, if locally attached and I think the support for that use case is 
> > inevitable -- in fact, Linux has some sophisticated support for the 
> > locally attached use case already.
> > 
> > > Then there are the lies from the vendors who want you to buy switches.
> > > Not one of them are willing to guarantee you the worst case latency
> > > through their switches.
> > 
> > I should have prefaced this thread by saying "locally attached CXL memory 
> > expansion", because that's the primary focus of many of the folks on this 
> > email thread :)
> 
> That's a huge relief.  I was not looking forward to the patches to add
> support for pooling (etc).
> 
> Using CXL as cold-data-storage makes a certain amount of sense, although
> I'm not really sure why it offers an advantage over NAND.  It's faster
> than NAND, but you still want to bring it back locally before operating
> on it.  NAND is denser, and consumes less power while idle.  NAND comes
> with a DMA controller to move the data instead of relying on the CPU to
> move the data around.  And of course moving the data first to CXL and
> then to swap means that it's got to go over the memory bus multiple
> times, unless you're building a swap device which attaches to the
> other end of the CXL bus ...
> 

This is **exactly** the type of discussion we're looking to have :)

There are some things that I've chatted informally with folks about that 
I'd like to bring to the forum:

 - Decoupling CPU migration from memory migration for NUMA Balancing (or
   perhaps deprecating CPU migration entirely)

 - Allowing NUMA Balancing to do migration as part of a kthread 
   asynchronous to the NUMA hint fault, in kernel context

 - Abstraction for future hardware devices that can provide an expanded
   view into page hotness that can be leveraged in different areas of the
   kernel, including as a backend for NUMA Balanacing to replace NUMA
   hint faults

 - Per-container support for configuring balancing and memory migration

 - Opting certain types of memory into NUMA Balancing (like tmpfs) while
   leaving other types alone

 - Utilizing hardware accelerated memory migration as a replacement for
   the traditional migrate_pages() path when available

I could go code all of this up and spend an enormous amount of time doing 
so only to get NAKed by somebody because I'm ripping out their critical 
use case that I just didn't know about :)  There's also the question of 
whether DAMON should be the source of truth for this or it should be 
decoupled.

My dream world would be where we could discuss various use cases for 
locally attached CXL memory and determine, as a group, what the shared, 
comprehensive "Linux vision" for it is and do so before LSF/MM/BPF.  In a 
perfect world, we could block out an expanded MM session in Salt Lake City 
to bring all these concepts together, what approaches sound reasonable vs 
unreasonable, and leave that conference with a clear understanding of what 
needs to happen.