From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 65FB2C47258 for ; Thu, 25 Jan 2024 18:26:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AEA048D0003; Thu, 25 Jan 2024 13:26:25 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A9A518D0002; Thu, 25 Jan 2024 13:26:25 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 961C38D0003; Thu, 25 Jan 2024 13:26:25 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 83E458D0002 for ; Thu, 25 Jan 2024 13:26:25 -0500 (EST) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id D53BD120128 for ; Thu, 25 Jan 2024 18:26:24 +0000 (UTC) X-FDA: 81718663488.23.277AB0A Received: from mail-pl1-f173.google.com (mail-pl1-f173.google.com [209.85.214.173]) by imf30.hostedemail.com (Postfix) with ESMTP id 2E1AE80013 for ; Thu, 25 Jan 2024 18:26:22 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=SrPkCEf8; spf=pass (imf30.hostedemail.com: domain of rientjes@google.com designates 209.85.214.173 as permitted sender) smtp.mailfrom=rientjes@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706207183; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=qxkzGyeJ9Eth/Hp3F0IFCoUFHirHm2+QQw/tr6h9vhw=; b=pB1iQaOP0FuioYCwIPI2V9BO7PdRf5FLY99kvAhIv4WucuP6IyshPaOqAUS41o0Nye4qqF kAaSfOJk1L2gqpOW1bdnJvYKi7NK8NIDnHdeFB21+4ifmEirVjI++sMVTWMugeeQGup9zr jzpgnA82xkZNEYj5DewCqM5ocHewY/M= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706207183; a=rsa-sha256; cv=none; b=JtfcsJVfuG+z2Filr8m6pDS1aKel7r8KTo7obcTUzTh2Hxd+x3RlqR3LR2NL6RlSLfWE0M pLkmFI02k3XQ0aqo7PNyvsdnxC1x57bQ8hqZbxuhxcwVS+QeWEaDW7CPdpu2T5jCrZ5DlL 3pSLoc+PZUF9/MPygOazkQvgMVDfU/U= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=SrPkCEf8; spf=pass (imf30.hostedemail.com: domain of rientjes@google.com designates 209.85.214.173 as permitted sender) smtp.mailfrom=rientjes@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-pl1-f173.google.com with SMTP id d9443c01a7336-1d5ce88b51cso12535ad.0 for ; Thu, 25 Jan 2024 10:26:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1706207182; x=1706811982; darn=kvack.org; h=mime-version:message-id:subject:cc:to:from:date:from:to:cc:subject :date:message-id:reply-to; bh=qxkzGyeJ9Eth/Hp3F0IFCoUFHirHm2+QQw/tr6h9vhw=; b=SrPkCEf8hIR3nQBsT/UkzPKxMRmkMkjxJiX5UN07pXEz9IWK3lp6kwYEaFqfhcasM+ 1fDO/+eEmOdjl/+h2noTo8elocLACYr4OpfCU2Jh2c9q05IBrm8q7tzi6Z8zJM1vhUkK sQhtIS9QiFnQBJdJFTeo5yeYrloeXjLf7O9mcrrJfD1I8/vUFpysV34CwaeFp0urGL8U 5JILvTLsrL7pyDNzzTkGg+Cf196E+qLbgPSDRozFsCHJkEE6c01zKe57bfigqdrbaOHA 1lX5CJQq1K/I7nrBx3iG0kIbAo05ty4F7hl5VOztFgSC1sZcEBFfsV5WSu2zpjcp6LiO 44vg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706207182; x=1706811982; h=mime-version:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=qxkzGyeJ9Eth/Hp3F0IFCoUFHirHm2+QQw/tr6h9vhw=; b=PR4494ZDm4jckjm8iV+vWOCjHfiOdF5+yt/JgC6EYPrSjuQbn8ppKYweqo2q25uYdT TQ4vjQ0zcUFm2+QWsjTk9Mx6dmyqy3g/fi7tAYSyG/UStv4XLk+1xAkv3dQ3LZseWyBq wX7aHy10wr5GdXeT31CSmBlx7OjIq5DeLRSp5D3V4arh7Xb+NR6eddW0Zv2ktbh6wnXv seU3UgCOuub6/BuojLC3g3nNGdmVXU00IodoaVG8xx+Z3XiPc5MOcVtBw21DcvMQhbGp dyK1jM1Zl35/yR41esyW+DqqCoojJZazNzGygHKwGDfiFtbAoobCJdVmGoV/LBB1fpJL dV7A== X-Gm-Message-State: AOJu0Yw4Aib0XGnj0P7mWaYLjZT0uCEtmdFvjG5UtT5ZQ0PGueMaib6B NByFGzMdvzHTpwrpKcle0ZWJqTKD8V4eYbEBeQK05TPIlV6MCneasTg4Eqq48Q== X-Google-Smtp-Source: AGHT+IGWi5UuOD2pfFO4T7QHZAwC7NwFMqUi7fA9ZG9b+USoZ/KFgWRrp9LB0SiBD1brJSz2M/1YWA== X-Received: by 2002:a17:903:1210:b0:1d7:5adc:39ee with SMTP id l16-20020a170903121000b001d75adc39eemr246061plh.19.1706207181116; Thu, 25 Jan 2024 10:26:21 -0800 (PST) Received: from [2620:0:1008:15:8d79:aa0b:df21:e137] ([2620:0:1008:15:8d79:aa0b:df21:e137]) by smtp.gmail.com with ESMTPSA id u11-20020a170902e20b00b001d70c0f2f04sm11922567plb.217.2024.01.25.10.26.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 25 Jan 2024 10:26:20 -0800 (PST) Date: Thu, 25 Jan 2024 10:26:19 -0800 (PST) From: David Rientjes To: John Hubbard , Zi Yan , Bharata B Rao , Dave Jiang , "Aneesh Kumar K.V" , "Huang, Ying" , Alistair Popple , Christoph Lameter , Andrew Morton , Linus Torvalds , Dave Hansen , Mel Gorman , Jon Grimm , Gregory Price , Brian Morris , Wei Xu , Johannes Weiner cc: linux-mm@kvack.org Subject: [RFC] Memory tiering kernel alignment Message-ID: <75f21150-1e12-4f4b-e578-e170e4fea18b@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Stat-Signature: pzsi8ss47n5m1daz3pqmbye6g9j3dy4r X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 2E1AE80013 X-Rspam-User: X-HE-Tag: 1706207182-5023 X-HE-Meta: U2FsdGVkX19XZ0BHzStTYdW9yfMLRaGVwo4b9d9k1yWFoNO7Gakv338HLAF0S0RPXX5I1J1gr+3iTvoGXAp+Wpj678s72dq53QNQibjRn2gc6FqlvFDKTdmgjxGwvAL4rvDF9fyCkGZHXcnOaXOwy2IxT0UMw8AIPQc1g6FY8Cyp3rEANXOjOifFi7ZY0tQE2IfIKHCM9GrAACMrNG6fzIEh6Ves9o5Z57E9FJGNyNsLEKrHmmlx8eelu49/dzMRqZ1EJ0uJAy3b1oHK2xmVRNvNXUDYOfYn/foNVOiUiIMbKf6wa/zGHFJJRfAYLHe3s4jl6FOaut56iRzPCqdGy9a1YsKdHnh/tlkf1pThlakBCPWWhtF2wfU+AWOpOe6B3jP6L50HIwN+TZrQAuVTHNX4Bzir1LqMY6OyX7QOUK86XNCXEi+NaES5Op1dqhL0Pcs3x/m24M3C5YCRYIDDI53hoAZWu6VqsoHVxpliCIrtgZMs92OJ/LISdupWcr5Cb3g3yt0cSifgvagn4bN/G7HELSCprEsQiulmlp7DGt/eM33lBnbK31uiSIvLdniYzfZ9r7qywe2e0HMoYOGXMx5myWYcdW7vYOuNK24x6gdCmaFIXm6koj4V+wIXVOVV/iSz3KfVHqx3oDReXX95jbM6bHj27CbTLLipvNyZD/JXdp2SOcbkJykgEeTqV0eTtNznmAiNrqUC06ZmKJ3jD9eOBDWCXf8a2x3qROo/fjDDLvTVqksZzu8AVZ16FZfAugu1u0aLMnmTw0BfsFsPVTU/HJwpv+FKbrawPcickguy2DQUKMBa1iNz5+K56J+Z0BS5z+OmDxiRztDhOOj0EX5c/u12XT+5/gPH/xDeoe5as2HcS1WF5wif1k9d62IWUxqZCFHGeT/tj/A8mXhqPKxXojuPzpdu4r3ZRvLB2+fxrS/HZVhPQ19fTvKzjGydNgmmZLMKccW1ubY7vVV n8nXVUkT 3TJq44a31xUmtm9K0dclXIoCsWwbjNr/mtPJdAj64W6VUC5+xAIz0vafu2f/rzqzpx7NHIT3nWVi3N44MQKIjcF+4Gw8FIVavirJEnafblAgWP5LPHtHX6IyLvnKZx80U4K+M3nXhTUyu5A16n6xJOuQ/k+xjz9p80Gz/arkvqIpFiRbhGPtek+m09rBzxnE6e4akw2U6NJB4pW330G5ZwyBoUxopSWFrCGR4aeYDilgx2fJbGmvq+iLCrF6Vj0iY7NEML4aUXK+7uLdfCpqrC5YKj9RJiOS7lTYPjOIiXDtnOBXY+aGUNaxVG0REwIW/bg9e+d2zzbnTimcul7ISuiAfMKE9f82dlVAzBc+0Ej0mHFv/Cz5gVoqfbxdA0O0VbjT+CXHTQJkLAoj89St/QqZtPzyoOG4XDTcbYE+b0QNHlmSmcIIEn3oy/jCyuymEN5EWq89r7UtJIyk= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi everybody, There is a lot of excitement around upcoming CXL type 3 memory expansion devices and their cost savings potential. As the industry starts to adopt this technology, one of the key components in strategic planning is how the upstream Linux kernel will support various tiered configurations to meet various user needs. I think it goes without saying that this is quite interesting to cloud providers as well as other hyperscalers :) I think this discussion would benefit from a collaborative approach between various stakeholders and interested parties. Reason being is that there are several different use cases the need different support models, but also because there is great incentive toward moving "with" upstream Linux for this support rather than having multiple parties bringing up their own stacks only to find that they are diverging from upstream rather than converging with it. I'm interested to learn if there is interest in forming a "Linux Memory Tiering Work Group" to share ideas, discuss multi-faceted approaches, and keep track of work items? Some recent discussions have proven that there is widespread interest in some very foundational topics for this technology such as: - Decoupling CPU balancing from memory balancing (or obsoleting CPU balancing entirely) + John Hubbard notes this would be useful for GPUs: a) GPUs have their own processors that are invisible to the kernel's NUMA "which tasks are active on which NUMA nodes" calculations, and b) Similar to where CXL is generally going, we have already built fully memory-coherent hardware, which include memory-only NUMA nodes. - In-kernel hot memory abstraction, informed by hardware hinting drivers (incl some architectures like Power10), usable as a NUMA Balancing backend for promotion and other areas of the kernel like transparent hugepage utilization - NUMA and memory tiering enlightenment for accelerators, such as for optimal use of GPU memory, extremely important for a cloud provider (hint hint :) - Asynchronous memory promotion independent of task_numa_fault() while considering the cost of page migration (due to identifying cold memory) It looks like there is already some interest in such a working group that would have a biweekly discussion of shared interests with the goal of accelerating design, development, testing, and division of work: Alistair Popple Aneesh Kumar K V Brian Morris Christoph Lameter Dan Williams Gregory Price Grimm, Jon Huang, Ying Johannes Weiner John Hubbard Zi Yan Specifically for the in-kernel hot memory abstraction topic, Google and Meta recently publushed an OCP base specification "Hyperscale CXL Tiered Memory Expander Specification" available at https://drive.google.com/file/d/1fFfU7dFmCyl6V9-9qiakdWaDr9d38ewZ/view?usp=drive_link that would be great to discuss. There is also on-going work in the CXL Consortium to standardize some of the abstractions for CXL 3.1. If folks are interested in this topic and your name doesn't appear above (I already got you :), please: - reply-all to this email to express interest and expand upon the list of topics above to represent additional areas of interest that should be included, *or* - email me privately to express interest to make sure you are included Perhaps I'm overly optimistic, but one thing that would be absolutely *amazing* would be if we all have a very clear and understandable vision for how Linux will support the wide variety of use cases, even before that work is fully implemented (or even designed), by LSF/MM/BPF 2024 time in May. Thanks!