From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4924FC4706C for ; Fri, 12 Jan 2024 08:14:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C1A096B0078; Fri, 12 Jan 2024 03:14:20 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BC9156B008A; Fri, 12 Jan 2024 03:14:20 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A427A6B00AA; Fri, 12 Jan 2024 03:14:20 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 913AE6B0078 for ; Fri, 12 Jan 2024 03:14:20 -0500 (EST) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 5243EA21DE for ; Fri, 12 Jan 2024 08:14:20 +0000 (UTC) X-FDA: 81669946680.09.EED9043 Received: from mail-ed1-f44.google.com (mail-ed1-f44.google.com [209.85.208.44]) by imf13.hostedemail.com (Postfix) with ESMTP id 9FF7620002 for ; Fri, 12 Jan 2024 08:14:17 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=S6xMwZNa; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf13.hostedemail.com: domain of hao.xiang@bytedance.com designates 209.85.208.44 as permitted sender) smtp.mailfrom=hao.xiang@bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1705047258; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=VKb3pXA3O7QJEj9qo6R8Br8Dqg0hH16HzN0jOTIU5CY=; b=GrH6JndU+7Ub9eX6bghTdy2bLz4dzf5Gq2J3JhAp2Z/tipENv6TJI1q1pLP/mPvIhGYkSW zqoZMBxOYPW6DH81SS0zKTvfq0xjMVG5+Ty6BZOEMSFeiaAk9uPjZ5GFXTtKxyK7DEUWbG ZM6mCzI29b6j+KmXOirYIZIyFtVvFEs= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=S6xMwZNa; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf13.hostedemail.com: domain of hao.xiang@bytedance.com designates 209.85.208.44 as permitted sender) smtp.mailfrom=hao.xiang@bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1705047258; a=rsa-sha256; cv=none; b=aXZXpgxmqHu3N2AKBOvm5cju+wBcqHngxB8YfUYZvxHHM0PzWLzsX8MYvHW/EMfwGq7vAw zJZYlgGafaU71BgvNXbSbghAETHqlthWh4tadE6Twa/HbV9JsBzZktQVtRaZvuPgjDqNkq j/ISJfUuFfwyUOnBuwT/MytbQ+obuAs= Received: by mail-ed1-f44.google.com with SMTP id 4fb4d7f45d1cf-5571e662b93so5222376a12.2 for ; Fri, 12 Jan 2024 00:14:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1705047256; x=1705652056; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=VKb3pXA3O7QJEj9qo6R8Br8Dqg0hH16HzN0jOTIU5CY=; b=S6xMwZNa1LBkbcNMzx3N3qL3dKk0eKVqAZL8vVLV84w7Qy+p8r+wxI9CpSENMBDl72 U0fXoCCu0+OrBiUwQc+1wW6iaXWm0Krol2+1yofjR/TW5rDCYm/1N44+wSDM9uIKuX/O 8428LMgfN08ER1rVN53037RJYW1tlN2VHhYOVP56p1oyRLylUfqivWOgC8gbfbQQ6qqd r885OUVCm6ZmuItT3ZRiYZWKH3sgvoMp/oPm3HkXcK/GoW11zOlXAaV188LCFKlqqefw M3R8IVUbRj+7mIRuGLaqPW82vq2gtryf8wgn+IVK7m/hiquJ/4O+VJ8nxjorfmiGSmQC GYow== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1705047256; x=1705652056; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=VKb3pXA3O7QJEj9qo6R8Br8Dqg0hH16HzN0jOTIU5CY=; b=Kc3e1u7VYDYlKpJ8tJ7Gh2ph9kedPx+noNRoK3ZFQ/g27KejnVryq/QfCEkX7IsUVM KeTYmJ8yNskWhTq00tr3DdGqlne64rPS18z62xfinVQrWywZ8KIxTjfIxvni7Pc8dJed OXkAgxh3ePpu0gwDElKSQdPN+fgL7TWzu0lYlDCL/rtsZMGYu2LXkHHyd+klnUP7Hvoa M0xRKZ+6ovkFRYLYjuwXcvnKqpe+7t0zciJKfUNKGVkJRA4mGCAbIvHryttHrheyRX/4 BGywUWqw4MQtAwKS7MGTz0Lx0eEzTNZGJnzoUzEkye591kdCja8T1ACoGDyji5Bw1IQ3 9yzg== X-Gm-Message-State: AOJu0Yx8cVZtUgu/ME7uBQrvzHw4B5ICTilRwOHRxOIn9nGIlrhS5yfV m9grwDiW4R4Qa6z6HtAnf/eC/D3UiPmdlTakKCg7ViYfUKnhUA== X-Google-Smtp-Source: AGHT+IFiFE1mg3Ou42N8EKtfdnhiQN1fItrBbzyubruEBUXHkcUgv7iMSLfYfu4GiAjCUieiEfS7XXBwbT3DdXNntPk= X-Received: by 2002:aa7:d859:0:b0:554:4dde:4ca6 with SMTP id f25-20020aa7d859000000b005544dde4ca6mr511278eds.4.1705047255771; Fri, 12 Jan 2024 00:14:15 -0800 (PST) MIME-Version: 1.0 References: <87fs00njft.fsf@yhuang6-desk2.ccr.corp.intel.com> <87edezc5l1.fsf@yhuang6-desk2.ccr.corp.intel.com> <87a5pmddl5.fsf@yhuang6-desk2.ccr.corp.intel.com> <87wmspbpma.fsf@yhuang6-desk2.ccr.corp.intel.com> <87o7dv897s.fsf@yhuang6-desk2.ccr.corp.intel.com> <20240109155049.00003f13@Huawei.com> <20240110141821.0000370d@Huawei.com> <87il3z2g03.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <87il3z2g03.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Hao Xiang Date: Fri, 12 Jan 2024 00:14:04 -0800 Message-ID: Subject: Re: [External] Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers To: "Huang, Ying" Cc: "aneesh.kumar@linux.ibm.com" , Jonathan Cameron , Gregory Price , Srinivasulu Thanneeru , Srinivasulu Opensrc , "linux-cxl@vger.kernel.org" , "linux-mm@kvack.org" , "dan.j.williams@intel.com" , "mhocko@suse.com" , "tj@kernel.org" , "john@jagalactic.com" , Eishan Mirakhur , Vinicius Tavares Petrucci , Ravis OpenSrc , "linux-kernel@vger.kernel.org" , Johannes Weiner , Wei Xu , "Ho-Ren (Jack) Chuang" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 9FF7620002 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: j9zs3cg9c5t1g6zzzac1kjmuuieyqg1z X-HE-Tag: 1705047257-387479 X-HE-Meta: U2FsdGVkX19Z0FS9GzFTXUDHOA9slpRUyj+o4soQpt7TGWi0dAYNL8vTc7mvg5wW/J9EeScXEZKYUAUzUZp3I6x386N9s8c5Q/F5LYi8AxUcVrmiPbdW9v8X1Zbw+snNc0F4knn9XEzfOmrEcx2iLlsO21fiMgoPerWKFhplTBqQuKVaX6TKkcAkkA28aCZ5hjoJjgSP5gXxD/tsmaIe5dmRVWrDB+1epEBpkHYtYH8c+b4eJe1kRqlwitOEDizkrozW1cZfSPqmOLSW3WXCFLL7JgSdtEC5T7xUUqnT7dj3LpQvhkVgU2tLC94qFFvpSD4Q4obKqwAt0Qe8jkHCngd6VU25oiG/6rRapY2pueUDEm1dXDyB1pUXLszheufEbeNNLGMIE5TQHr692/EFq0WxoVM8ssTBTENOyaVfJFJc5EesAMk3Pp2YpstCTsMA/klASwU7UCoECdA3+qBAz6D/Xvpk/WyQ+Qr6Sthzjfc95pwkNsfAJAHWzKe+4ijC/forwox7LbG5F4Xitd8xu/p6Gi8NjXtSUol3Yf1sV6tqgQ9dH0V/84PHZInCbN4XGKjGNT8N4eCs6TeF6c3OtvEXeAjTQhYeXJBMrLuQtGgI0IZio1bt0ebRXwLh5cTMEz0I3muILQ/jmO9X27ze7gyLNr+yxLIZaMFKgZSOQc4ISkqotnqs8FVNAMxu2Jt8+D8G21ADwDMRb8nPxgU4ILMOZjvc5SQvdoxW/CowXqKKzFgw4H1ZLTN9p+hIeVdtFUAl7JA70c2xwCNZC78oQRGMvXhrk4gapGf5rGO8olo/lMdVQon/sCtJMsABvKXrUwGPq+Lx52sufZI2oNWcEtWvrVyMILBKls7UDSNDI7jRLHihXZjWkG1Nm8PQ35MzS9+uatrfFdBw1BgiCnd2EouVm6trcOoGVnGz/QZO9hW4rkxT2T8qRPo1F9FStr3cT25tLZvcKf1IO+1BDaG BCqWxGvJ TzsbFEz7ASbHVZuwGQLogc/Xopx8sdp6cFJaT/ebERcBqAiMCkl0+/3Q23s52O4bm7gEGpEd7W17sM8uFleHMPd9uzOZFc1IZNeJMjNstmwQEaxMVosxH3pWsOU6ZqAAhhShhxxgmnBMJZBaGoEANXqLtasuRAjpEVDALf9n4xiofloO5nDze2T6xZ6/lzx7+Fl8JJFD9sPBBHBf6Qv7MltxHhRIZFsCH2AIBYnxmMrEI9lqzLolbiHm26Geu455A5zsoc7Ore3Xv0pyExinvm1qj381PdDLSYucd1vnBgu04qq19hhI8PENmk0YplnM2JA32f+iIutsovliOb37OAxIek+EVfYeMF8W6d4T3Oe5FAA/+L7+ahjuJvwG3Xbk0LYY4bfBymON4q1BMTvVnHFR26XgBsC/D9aPhcoFi/eQfUO1ni1ycyjSjT5YY/BNjbMZq X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jan 11, 2024 at 11:02=E2=80=AFPM Huang, Ying = wrote: > > Hao Xiang writes: > > > On Wed, Jan 10, 2024 at 6:18=E2=80=AFAM Jonathan Cameron > > wrote: > >> > >> On Tue, 9 Jan 2024 16:28:15 -0800 > >> Hao Xiang wrote: > >> > >> > On Tue, Jan 9, 2024 at 9:59=E2=80=AFAM Gregory Price wrote: > >> > > > >> > > On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote: > >> > > > On Tue, 09 Jan 2024 11:41:11 +0800 > >> > > > "Huang, Ying" wrote: > >> > > > > Gregory Price writes: > >> > > > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote: > >> > > > > It's possible to change the performance of a NUMA node changed= , if we > >> > > > > hot-remove a memory device, then hot-add another different mem= ory > >> > > > > device. It's hoped that the CDAT changes too. > >> > > > > >> > > > Not supported, but ACPI has _HMA methods to in theory allow chan= ging > >> > > > HMAT values based on firmware notifications... So we 'could' ma= ke > >> > > > it work for HMAT based description. > >> > > > > >> > > > Ultimately my current thinking is we'll end up emulating CXL typ= e3 > >> > > > devices (hiding topology complexity) and you can update CDAT but > >> > > > IIRC that is only meant to be for degraded situations - so if yo= u > >> > > > want multiple performance regions, CDAT should describe them for= m the start. > >> > > > > >> > > > >> > > That was my thought. I don't think it's particularly *realistic* = for > >> > > HMAT/CDAT values to change at runtime, but I can imagine a case wh= ere > >> > > it could be valuable. > >> > > > >> > > > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLn= cS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/ > >> > > > > > > >> > > > > > This group wants to enable passing CXL memory through to KVM= /QEMU > >> > > > > > (i.e. host CXL expander memory passed through to the guest),= and > >> > > > > > allow the guest to apply memory tiering. > >> > > > > > > >> > > > > > There are multiple issues with this, presently: > >> > > > > > > >> > > > > > 1. The QEMU CXL virtual device is not and probably never wil= l be > >> > > > > > performant enough to be a commodity class virtualization. > >> > > > > >> > > > I'd flex that a bit - we will end up with a solution for virtual= ization but > >> > > > it isn't the emulation that is there today because it's not poss= ible to > >> > > > emulate some of the topology in a peformant manner (interleaving= with sub > >> > > > page granularity / interleaving at all (to a lesser degree)). Th= ere are > >> > > > ways to do better than we are today, but they start to look like > >> > > > software dissagregated memory setups (think lots of page faults = in the host). > >> > > > > >> > > > >> > > Agreed, the emulated device as-is can't be the virtualization devi= ce, > >> > > but it doesn't mean it can't be the basis for it. > >> > > > >> > > My thought is, if you want to pass host CXL *memory* through to th= e > >> > > guest, you don't actually care to pass CXL *control* through to th= e > >> > > guest. That control lies pretty squarely with the host/hypervisor= . > >> > > > >> > > So, at least in theory, you can just cut the type3 device out of t= he > >> > > QEMU configuration entirely and just pass it through as a distinct= numa > >> > > node with specific hmat qualities. > >> > > > >> > > Barring that, if we must go through the type3 device, the question= is > >> > > how difficult would it be to just make a stripped down type3 devic= e > >> > > to provide the informational components, but hack off anything > >> > > topology/interleave related? Then you just do direct passthrough a= s you > >> > > described below. > >> > > > >> > > qemu/kvm would report errors if you tried to touch the naughty bit= s. > >> > > > >> > > The second question is... is that device "compliant" or does it ne= ed > >> > > super special handling from the kernel driver :D? If what i descr= ibed > >> > > is not "compliant", then it's probably a bad idea, and KVM/QEMU sh= ould > >> > > just hide the CXL device entirely from the guest (for this use cas= e) > >> > > and just pass the memory through as a numa node. > >> > > > >> > > Which gets us back to: The memory-tiering component needs a way to > >> > > place nodes in different tiers based on HMAT/CDAT/User Whim. All t= hree > >> > > of those seem like totally valid ways to go about it. > >> > > > >> > > > > > > >> > > > > > 2. When passing memory through as an explicit NUMA node, but= not as > >> > > > > > part of a CXL memory device, the nodes are lumped togethe= r in the > >> > > > > > DRAM tier. > >> > > > > > > >> > > > > > None of this has to do with firmware. > >> > > > > > > >> > > > > > Memory-type is an awful way of denoting membership of a tier= , but we > >> > > > > > have HMAT information that can be passed through via QEMU: > >> > > > > > > >> > > > > > -object memory-backend-ram,size=3D4G,id=3Dram-node0 \ > >> > > > > > -object memory-backend-ram,size=3D4G,id=3Dram-node1 \ > >> > > > > > -numa node,nodeid=3D0,cpus=3D0-4,memdev=3Dram-node0 \ > >> > > > > > -numa node,initiator=3D0,nodeid=3D1,memdev=3Dram-node1 \ > >> > > > > > -numa hmat-lb,initiator=3D0,target=3D0,hierarchy=3Dmemory,da= ta-type=3Daccess-latency,latency=3D10 \ > >> > > > > > -numa hmat-lb,initiator=3D0,target=3D0,hierarchy=3Dmemory,da= ta-type=3Daccess-bandwidth,bandwidth=3D10485760 \ > >> > > > > > -numa hmat-lb,initiator=3D0,target=3D1,hierarchy=3Dmemory,da= ta-type=3Daccess-latency,latency=3D20 \ > >> > > > > > -numa hmat-lb,initiator=3D0,target=3D1,hierarchy=3Dmemory,da= ta-type=3Daccess-bandwidth,bandwidth=3D5242880 > >> > > > > > > >> > > > > > Not only would it be nice if we could change tier membership= based on > >> > > > > > this data, it's realistically the only way to allow guests t= o accomplish > >> > > > > > memory tiering w/ KVM/QEMU and CXL memory passed through to = the guest. > >> > > > > >> > > > This I fully agree with. There will be systems with a bunch of = normal DDR with different > >> > > > access characteristics irrespective of CXL. + likely HMAT soluti= ons will be used > >> > > > before we get anything more complex in place for CXL. > >> > > > > >> > > > >> > > Had not even considered this, but that's completely accurate as we= ll. > >> > > > >> > > And more discretely: What of devices that don't provide HMAT/CDAT?= That > >> > > isn't necessarily a violation of any standard. There probably cou= ld be > >> > > a release valve for us to still make those devices useful. > >> > > > >> > > The concern I have with not implementing a movement mechanism *at = all* > >> > > is that a one-size-fits-all initial-placement heuristic feels gros= s > >> > > when we're, at least ideologically, moving toward "software define= d memory". > >> > > > >> > > Personally I think the movement mechanism is a good idea that gets= folks > >> > > where they're going sooner, and it doesn't hurt anything by existi= ng. We > >> > > can change the initial placement mechanism too. > >> > > >> > I think providing users a way to "FIX" the memory tiering is a backu= p > >> > option. Given that DDRs with different access characteristics provid= e > >> > the relevant CDAT/HMAT information, the kernel should be able to > >> > correctly establish memory tiering on boot. > >> > >> Include hotplug and I'll be happier! I know that's messy though. > >> > >> > Current memory tiering code has > >> > 1) memory_tier_init() to iterate through all boot onlined memory > >> > nodes. All nodes are assumed to be fast tier (adistance > >> > MEMTIER_ADISTANCE_DRAM is used). > >> > 2) dev_dax_kmem_probe to iterate through all devdax controlled memor= y > >> > nodes. This is the place the kernel reads the memory attributes from > >> > HMAT and recognizes the memory nodes into the correct tier (devdax > >> > controlled CXL, pmem, etc). > >> > If we want DDRs with different memory characteristics to be put into > >> > the correct tier (as in the guest VM memory tiering case), we probab= ly > >> > need a third path to iterate the boot onlined memory nodes and also = be > >> > able to read their memory attributes. I don't think we can do that i= n > >> > 1) because the ACPI subsystem is not yet initialized. > >> > >> Can we move it later in general? Or drag HMAT parsing earlier? > >> ACPI table availability is pretty early, it's just that we don't bothe= r > >> with HMAT because nothing early uses it. > >> IIRC SRAT parsing occurs way before memory_tier_init() will be called. > > > > I tested the call sequence under a debugger earlier. hmat_init() is > > called after memory_tier_init(). Let me poke around and see what our > > options are. > > This sounds reasonable. > > Please keep in mind that we need a way to identify the base line memory > type(default_dram_type). A simple method is to use NUMA nodes with CPU > attached. But I remember that Aneesh said that some NUMA nodes without > CPU will need to be put in default_dram_type too on their systems. We > need a way to identify that. Yes, I am doing some prototyping the way you described. In memory_tier_init(), we will just set the memory tier for the NUMA nodes with CPU. In hmat_init(), I am trying to call back to mm to finish the memory tier initialization for the CPUless NUMA nodes. If a CPUless numa node can't get the effective adistance from mt_calc_adistance(), we will fallback to add that node to default_dram_type. The other thing I want to experiment is to call mt_calc_adistance() on a memory node with CPU and see what kind of adistance will be returned. > > -- > Best Regards, > Huang, Ying