From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id ED864C433EF for ; Fri, 27 May 2022 14:05:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 77E638D0016; Fri, 27 May 2022 10:05:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 731F18D0002; Fri, 27 May 2022 10:05:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5F4208D0016; Fri, 27 May 2022 10:05:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 4A3898D0002 for ; Fri, 27 May 2022 10:05:32 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 03A8035CDB for ; Fri, 27 May 2022 14:05:31 +0000 (UTC) X-FDA: 79511695704.12.B8CCDD2 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf12.hostedemail.com (Postfix) with ESMTP id 0348040078 for ; Fri, 27 May 2022 14:04:53 +0000 (UTC) Received: from fraeml705-chm.china.huawei.com (unknown [172.18.147.200]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4L8mgL0RJ9z67sMs; Fri, 27 May 2022 22:01:18 +0800 (CST) Received: from lhreml751-chm.china.huawei.com (10.201.108.201) by fraeml705-chm.china.huawei.com (10.206.15.54) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.2375.24; Fri, 27 May 2022 16:05:29 +0200 Received: from localhost (10.47.84.9) by lhreml751-chm.china.huawei.com (10.201.108.201) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Fri, 27 May 2022 15:05:27 +0100 Date: Fri, 27 May 2022 15:05:24 +0100 From: Hesham Almatary To: Ying Huang CC: Wei Xu , Andrew Morton , Greg Thelen , Yang Shi , "Aneesh Kumar K.V" , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Dave Hansen , "Jonathan Cameron" , Alistair Popple , Dan Williams , Feng Tang , Linux MM , Jagdish Gediya , Baolin Wang , David Rientjes , Subject: Re: RFC: Memory Tiering Kernel Interfaces (v3) Message-ID: <20220527150524.00000871@huawei.com> In-Reply-To: References: Organization: Huawei UK R&D X-Mailer: Claws Mail 3.18.0 (GTK+ 2.24.33; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: quoted-printable X-Originating-IP: [10.47.84.9] X-ClientProxiedBy: lhreml725-chm.china.huawei.com (10.201.108.76) To lhreml751-chm.china.huawei.com (10.201.108.201) X-CFilter-Loop: Reflected X-Stat-Signature: o8wejeus5y33mn7d6ma15coc37xgozyg X-Rspam-User: Authentication-Results: imf12.hostedemail.com; dkim=none; spf=pass (imf12.hostedemail.com: domain of hesham.almatary@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=hesham.almatary@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 0348040078 X-HE-Tag: 1653660293-788141 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hello Wei and Ying, Please find my comments below based on a discussion with Jonathan. On Fri, 27 May 2022 10:58:39 +0800 Ying Huang wrote: > On Thu, 2022-05-26 at 14:22 -0700, Wei Xu wrote: > > Changes since v2 > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > * Updated the design and examples to use "rank" instead of device ID > > =A0=A0to determine the order between memory tiers for better > > flexibility. > >=20 > > Overview > > =3D=3D=3D=3D=3D=3D=3D=3D > >=20 > > The current kernel has the basic memory tiering support: Inactive > > pages on a higher tier NUMA node can be migrated (demoted) to a > > lower tier NUMA node to make room for new allocations on the higher > > tier NUMA node. Frequently accessed pages on a lower tier NUMA > > node can be migrated (promoted) to a higher tier NUMA node to > > improve the performance. > >=20 > > In the current kernel, memory tiers are defined implicitly via a > > demotion path relationship between NUMA nodes, which is created > > during the kernel initialization and updated when a NUMA node is > > hot-added or hot-removed. The current implementation puts all > > nodes with CPU into the top tier, and builds the tier hierarchy > > tier-by-tier by establishing the per-node demotion targets based on > > the distances between nodes. > >=20 > > This current memory tier kernel interface needs to be improved for > > several important use cases: > >=20 > > * The current tier initialization code always initializes > > =A0=A0each memory-only NUMA node into a lower tier. But a memory-only > > =A0=A0NUMA node may have a high performance memory device (e.g. a DRAM > > =A0=A0device attached via CXL.mem or a DRAM-backed memory-only node on > > =A0=A0a virtual machine) and should be put into a higher tier. > >=20 > > * The current tier hierarchy always puts CPU nodes into the top > > =A0=A0tier. But on a system with HBM (e.g. GPU memory) devices, these > > =A0=A0memory-only HBM NUMA nodes should be in the top tier, and DRAM > > nodes with CPUs are better to be placed into the next lower tier. > >=20 > > * Also because the current tier hierarchy always puts CPU nodes > > =A0=A0into the top tier, when a CPU is hot-added (or hot-removed) and > > =A0=A0triggers a memory node from CPU-less into a CPU node (or vice > > =A0=A0versa), the memory tier hierarchy gets changed, even though no > > =A0=A0memory node is added or removed. This can make the tier > > =A0=A0hierarchy unstable and make it difficult to support tier-based > > =A0=A0memory accounting. > >=20 > > * A higher tier node can only be demoted to selected nodes on the > > =A0=A0next lower tier as defined by the demotion path, not any other > > =A0=A0node from any lower tier. This strict, hard-coded demotion order > > =A0=A0does not work in all use cases (e.g. some use cases may want to > > =A0=A0allow cross-socket demotion to another node in the same demotion > > =A0=A0tier as a fallback when the preferred demotion node is out of > > =A0=A0space), and has resulted in the feature request for an interface > > to override the system-wide, per-node demotion order from the > > =A0=A0userspace. This demotion order is also inconsistent with the page > > =A0=A0allocation fallback order when all the nodes in a higher tier are > > =A0=A0out of space: The page allocation can fall back to any node from > > =A0=A0any lower tier, whereas the demotion order doesn't allow that. > >=20 > > * There are no interfaces for the userspace to learn about the > > memory tier hierarchy in order to optimize its memory allocations. > >=20 > > I'd like to propose revised memory tier kernel interfaces based on > > the discussions in the threads: > >=20 > > - > > https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/ > > - > > https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/ > > - > > https://lore.kernel.org/linux-mm/867bc216386eb6cbf54648f23e5825830f5b92= 2e.camel@intel.com/T/ > > - > > https://lore.kernel.org/linux-mm/d6314cfe1c7898a6680bed1e7cc93b0ab93e31= 55.camel@intel.com/T/ > >=20 > >=20 > > High-level Design Ideas > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >=20 > > * Define memory tiers explicitly, not implicitly. > >=20 > > * Memory tiers are defined based on hardware capabilities of memory > > =A0=A0nodes, not their relative node distances between each other. > >=20 > > * The tier assignment of each node is independent from each other. > > =A0=A0Moving a node from one tier to another tier doesn't affect the > > tier assignment of any other node. > >=20 > > * The node-tier association is stable. A node can be reassigned to a > > =A0=A0different tier only under the specific conditions that don't block > > =A0=A0future tier-based memory cgroup accounting. > >=20 > > * A node can demote its pages to any nodes of any lower tiers. The > > =A0=A0demotion target node selection follows the allocation fallback > > order of the source node, which is built based on node distances. > > The demotion targets are also restricted to only the nodes from the > > tiers lower than the source node. We no longer need to maintain a > > separate per-node demotion order (node_demotion[]). > >=20 > >=20 > > Sysfs Interfaces > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >=20 > > * /sys/devices/system/memtier/ > >=20 > > =A0=A0This is the directory containing the information about memory > > tiers. > >=20 > > =A0=A0Each memory tier has its own subdirectory. > >=20 > > =A0=A0The order of memory tiers is determined by their rank values, not > > by their memtier device names. > >=20 > > =A0=A0- /sys/devices/system/memtier/possible > >=20 > > =A0=A0=A0=A0Format: ordered list of "memtier(rank)" > > =A0=A0=A0=A0Example: 0(64), 1(128), 2(192) > >=20 > > =A0=A0=A0=A0Read-only. When read, list all available memory tiers and = their > > =A0=A0=A0=A0associated ranks, ordered by the rank values (from the high= est > > =A0=A0=A0=A0=A0tier to the lowest tier). >=20 > I like the idea of "possible" file. And I think we can show default > tier too. That is, if "1(128)" is the default tier (tier with DRAM), > then the list can be, >=20 > " > 0/64 [1/128] 2/192 > " >=20 > To make it more easier to be parsed by shell, I will prefer something > like, >=20 > " > 0 64 > 1 128 default > 2 192 > " >=20 > But one line format is OK for me too. >=20 I wonder if there's a good argument to have this "possible" file at all? My thinking is that, 1) all the details can be scripted at user-level by reading memtierN/nodeN, offloading some work from the kernel side, and 2) the format/numbers are confusing anyway; it could get tricky when/if tier device IDs are similar to ranks. The other thing is whether we should have a file called "default" containing the default tier value for the user to read? > >=20 > > * /sys/devices/system/memtier/memtierN/ > >=20 > > =A0=A0This is the directory containing the information about a > > particular memory tier, memtierN, where N is the memtier device ID > > (e.g. 0, 1). > >=20 > > =A0=A0The memtier device ID number itself is just an identifier and has > > no special meaning, i.e. memtier device ID numbers do not determine > > the order of memory tiers. > >=20 > > =A0=A0- /sys/devices/system/memtier/memtierN/rank > >=20 > > =A0=A0=A0=A0Format: int > > =A0=A0=A0=A0Example: 100 > >=20 > > =A0=A0=A0=A0Read-only. When read, list the "rank" value associated with > > memtierN. > >=20 > > =A0=A0=A0=A0"Rank" is an opaque value. Its absolute value doesn't have = any > > =A0=A0=A0=A0special meaning. But the rank values of different memtiers = can > > be compared with each other to determine the memory tier order. > > =A0=A0=A0=A0For example, if we have 3 memtiers: memtier0, memtier1, > > memiter2, and their rank values are 10, 20, 15, then the memory > > tier order is: memtier0 -> memtier2 -> memtier1, where memtier0 is > > the highest tier and memtier1 is the lowest tier. > >=20 > > =A0=A0=A0=A0The rank value of each memtier should be unique. > >=20 > > =A0=A0- /sys/devices/system/memtier/memtierN/nodelist > >=20 > > =A0=A0=A0=A0Format: node_list > > =A0=A0=A0=A0Example: 1-2 > >=20 > > =A0=A0=A0=A0Read-only. When read, list the memory nodes in the specifi= ed > > tier. > >=20 > > =A0=A0=A0=A0If a memory tier has no memory nodes, the kernel can hide t= he > > sysfs directory of this memory tier, though the tier itself can > > still be visible from /sys/devices/system/memtier/possible. > >=20 Is there a good reason why the kernel needs to hide this directory? > > * /sys/devices/system/node/nodeN/memtier > >=20 > > =A0=A0where N =3D 0, 1, ... > >=20 > > =A0=A0Format: int or empty > > =A0=A0Example: 1 > >=20 > > =A0=A0When read, list the device ID of the memory tier that the node > > belongs to. Its value is empty for a CPU-only NUMA node. > >=20 > > =A0=A0When written, the kernel moves the node into the specified memory > > =A0=A0tier if the move is allowed. The tier assignment of all other > > nodes are not affected. > >=20 Who decides if the move is allowed or not? Might need to explicitly mention that? > > =A0=A0Initially, we can make this interface read-only. > >=20 > >=20 > > Kernel Representation > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >=20 > > * All memory tiering code is guarded by CONFIG_TIERED_MEMORY. > >=20 > > * #define MAX_MEMORY_TIERS 3 > >=20 > > =A0=A0Support 3 memory tiers for now. This can be a kconfig option. > >=20 > > * #define MEMORY_DEFAULT_TIER_DEVICE 1 > >=20 > > =A0=A0The default tier device that a memory node is assigned to. > >=20 > > * struct memtier_dev { > > =A0=A0=A0=A0=A0=A0nodemask_t nodelist; > > =A0=A0=A0=A0=A0=A0int rank; > > =A0=A0=A0=A0=A0=A0int tier; > > =A0=A0} memtier_devices[MAX_MEMORY_TIERS] > >=20 > > =A0=A0Store memory tiers by device IDs. > >=20 > > * struct memtier_dev *memory_tier(int tier) > >=20 > > =A0=A0Returns the memtier device for a given memory tier. > >=20 Might need to define the case where there's no memory tier device for a specific tier number. For example, we can return NULL or an error code when an invalid tier number is passed (e.g., -1 for CPU-only nodes). > > * int node_tier_dev_map[MAX_NUMNODES] > >=20 > > =A0=A0Map a node to its tier device ID.. > >=20 > > =A0=A0For each CPU-only node c, node_tier_dev_map[c] =3D -1. > >=20 > >=20 > > Memory Tier Initialization > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D > >=20 > > By default, all memory nodes are assigned to the default tier > > (MEMORY_DEFAULT_TIER_DEVICE). The default tier device has a rank > > value in the middle of the possible rank value range (e.g. 127 if > > the range is [0..255]). > >=20 > > A device driver can move up or down its memory nodes from the > > default tier. For example, PMEM can move down its memory nodes > > below the default tier, whereas GPU can move up its memory nodes > > above the default tier. > >=20 Is "up/down" here still relative after the rank addition? > > The kernel initialization code makes the decision on which exact > > tier a memory node should be assigned to based on the requests from > > the device drivers as well as the memory device hardware information > > provided by the firmware. > >=20 > >=20 > > Memory Tier Reassignment > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >=20 > > After a memory node is hot-removed, it can be hot-added back to a > > different memory tier. This is useful for supporting dynamically > > provisioned CXL.mem NUMA nodes, which may connect to different > > memory devices across hot-plug events. Such tier changes should > > be compatible with tier-based memory accounting. > >=20 > > The userspace may also reassign an existing online memory node to a > > different tier. However, this should only be allowed when no pages > > are allocated from the memory node or when there are no non-root > > memory cgroups (e.g. during the system boot). This restriction is > > important for keeping memory tier hierarchy stable enough for > > tier-based memory cgroup accounting. >=20 > One way to do this is hot-remove all memory of a node, change its > memtier, then hot-add its memory. >=20 > Best Regards, > Huang, Ying >=20 > > Hot-adding/removing CPUs doesn't affect memory tier hierarchy. > >=20 > >=20 > > Memory Allocation for Demotion > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D > >=20 > > To allocate a new page as the demotion target for a page, the kernel > > calls the allocation function (__alloc_pages_nodemask) with the > > source page node as the preferred node and the union of all lower > > tier nodes as the allowed nodemask. The actual target node > > selection then follows the allocation fallback order that the > > kernel has already defined. > >=20 > > The pseudo code looks like: > >=20 > > =A0=A0=A0=A0targets =3D NODE_MASK_NONE; > > =A0=A0=A0=A0src_nid =3D page_to_nid(page); > > =A0=A0=A0=A0src_tier =3D memtier_devices[node_tier_dev_map[src_nid]].ti= er; > > =A0=A0=A0=A0for (i =3D src_tier + 1; i < MAX_MEMORY_TIERS; i++) > > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0nodes_or(targets, targets, memory_t= ier(i)->nodelist); > > =A0=A0=A0=A0new_page =3D __alloc_pages_nodemask(gfp, order, src_nid, ta= rgets); > >=20 > > The memopolicy of cpuset, vma and owner task of the source page can > > be set to refine the demotion target nodemask, e.g. to prevent > > demotion or select a particular allowed node as the demotion target. > >=20 > >=20 > > Memory Allocation for Promotion > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D > >=20 > > The page allocation for promotion is similar to demotion, except > > that (1) the target nodemask uses the promotion tiers, (2) the > > preferred node can be the accessing CPU node, not the source page > > node. > >=20 > >=20 > > Examples > > =3D=3D=3D=3D=3D=3D=3D=3D > >=20 > > * Example 1: > >=20 > > Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes. > >=20 > > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A020 > > =A0=A0Node 0 (DRAM) ---- Node 1 (DRAM) > > =A0=A0=A0=A0=A0=A0=A0| \ / | > > =A0=A0=A0=A0=A0=A0=A0| 30 40 X 40 | 30 > > =A0=A0=A0=A0=A0=A0=A0| / \ | > > =A0=A0Node 2 (PMEM) ---- Node 3 (PMEM) > > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A040 > >=20 > > node distances: > > node 0 1 2 3 > > =A0=A0=A00 10 20 30 40 > > =A0=A0=A01 20 10 40 30 > > =A0=A0=A02 30 40 10 40 > > =A0=A0=A03 40 30 40 10 > >=20 > > $ cat /sys/devices/system/memtier/possible > > 0(64), 1(128), 2(192) > >=20 > > $ grep '' /sys/devices/system/memtier/memtier*/rank > > /sys/devices/system/memtier/memtier1/rank:128 > > /sys/devices/system/memtier/memtier2/rank:192 > >=20 > > $ grep '' /sys/devices/system/memtier/memtier*/nodelist > > /sys/devices/system/memtier/memtier1/nodelist:0-1 > > /sys/devices/system/memtier/memtier2/nodelist:2-3 > >=20 > > $ grep '' /sys/devices/system/node/node*/memtier > > /sys/devices/system/node/node0/memtier:1 > > /sys/devices/system/node/node1/memtier:1 > > /sys/devices/system/node/node2/memtier:2 > > /sys/devices/system/node/node3/memtier:2 > >=20 > > Demotion fallback order: > > node 0: 2, 3 > > node 1: 3, 2 > > node 2: empty > > node 3: empty > >=20 > > To prevent cross-socket demotion and memory access, the user can set > > mempolicy, e.g. cpuset.mems=3D0,2. > >=20 > >=20 > > * Example 2: > >=20 > > Node 0 & 1 are DRAM nodes. > > Node 2 is a PMEM node and closer to node 0. > >=20 > > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A020 > > =A0=A0Node 0 (DRAM) ---- Node 1 (DRAM) > > =A0=A0=A0=A0=A0=A0=A0| / > > =A0=A0=A0=A0=A0=A0=A0| 30 / 40 > > =A0=A0=A0=A0=A0=A0=A0| / > > =A0=A0Node 2 (PMEM) > >=20 > > node distances: > > node 0 1 2 > > =A0=A0=A00 10 20 30 > > =A0=A0=A01 20 10 40 > > =A0=A0=A02 30 40 10 > >=20 > > $ cat /sys/devices/system/memtier/possible > > 0(64), 1(128), 2(192) > >=20 > > $ grep '' /sys/devices/system/memtier/memtier*/rank > > /sys/devices/system/memtier/memtier1/rank:128 > > /sys/devices/system/memtier/memtier2/rank:192 > >=20 > > $ grep '' /sys/devices/system/memtier/memtier*/nodelist > > /sys/devices/system/memtier/memtier1/nodelist:0-1 > > /sys/devices/system/memtier/memtier2/nodelist:2 > >=20 > > $ grep '' /sys/devices/system/node/node*/memtier > > /sys/devices/system/node/node0/memtier:1 > > /sys/devices/system/node/node1/memtier:1 > > /sys/devices/system/node/node2/memtier:2 > >=20 > > Demotion fallback order: > > node 0: 2 > > node 1: 2 > > node 2: empty > >=20 > >=20 > > * Example 3: > >=20 > > Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node. > >=20 np: PMEM instead of memory-only DRAM? > > All nodes are in the same tier. > >=20 > > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A020 > > =A0=A0Node 0 (DRAM) ---- Node 1 (DRAM) > > =A0=A0=A0=A0=A0=A0=A0=A0=A0\ / > > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0\ 30 / 30 > > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0\ / > > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0Node 2 (PMEM) > >=20 > > node distances: > > node 0 1 2 > > =A0=A0=A00 10 20 30 > > =A0=A0=A01 20 10 30 > > =A0=A0=A02 30 30 10 > >=20 > > $ cat /sys/devices/system/memtier/possible > > 0(64), 1(128), 2(192) > >=20 > > $ grep '' /sys/devices/system/memtier/memtier*/rank > > /sys/devices/system/memtier/memtier1/rank:128 > >=20 > > $ grep '' /sys/devices/system/memtier/memtier*/nodelist > > /sys/devices/system/memtier/memtier1/nodelist:0-2 > >=20 > > $ grep '' /sys/devices/system/node/node*/memtier > > /sys/devices/system/node/node0/memtier:1 > > /sys/devices/system/node/node1/memtier:1 > > /sys/devices/system/node/node2/memtier:1 > >=20 > > Demotion fallback order: > > node 0: empty > > node 1: empty > > node 2: empty > >=20 > >=20 > > * Example 4: > >=20 > > Node 0 is a DRAM node with CPU. > > Node 1 is a PMEM node. > > Node 2 is a GPU node. > >=20 > > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A050 > > =A0=A0Node 0 (DRAM) ---- Node 2 (GPU) > > =A0=A0=A0=A0=A0=A0=A0=A0=A0\ / > > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0\ 30 / 60 > > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0\ / > > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0Node 1 (PMEM) > >=20 > > node distances: > > node 0 1 2 > > =A0=A0=A00 10 30 50 > > =A0=A0=A01 30 10 60 > > =A0=A0=A02 50 60 10 > >=20 > > $ cat /sys/devices/system/memtier/possible > > 0(64), 1(128), 2(192) > >=20 > > $ grep '' /sys/devices/system/memtier/memtier*/rank > > /sys/devices/system/memtier/memtier0/rank:64 > > /sys/devices/system/memtier/memtier1/rank:128 > > /sys/devices/system/memtier/memtier2/rank:192 > >=20 > > $ grep '' /sys/devices/system/memtier/memtier*/nodelist > > /sys/devices/system/memtier/memtier0/nodelist:2 > > /sys/devices/system/memtier/memtier1/nodelist:0 > > /sys/devices/system/memtier/memtier2/nodelist:1 > >=20 > > $ grep '' /sys/devices/system/node/node*/memtier > > /sys/devices/system/node/node0/memtier:1 > > /sys/devices/system/node/node1/memtier:2 > > /sys/devices/system/node/node2/memtier:0 > >=20 > > Demotion fallback order: > > node 0: 1 > > node 1: empty > > node 2: 0, 1 > >=20 > >=20 > > * Example 5: > >=20 > > Node 0 is a DRAM node with CPU. > > Node 1 is a GPU node. > > Node 2 is a PMEM node. > > Node 3 is a large, slow DRAM node without CPU. > >=20 > > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0100 > > =A0=A0=A0=A0=A0Node 0 (DRAM) ---- Node 1 (GPU) > > =A0=A0=A0=A0/ | / | > > =A0=A0=A0/40 |30 120 / | 110 > > =A0=A0| | / | > > =A0=A0| Node 2 (PMEM) ---- / > > =A0=A0| \ / > > =A0=A0=A0\ 80 \ / > > =A0=A0=A0=A0------- Node 3 (Slow DRAM) > >=20 > > node distances: > > node 0 1 2 3 > > =A0=A0=A00 10 100 30 40 > > =A0=A0=A01 100 10 120 110 > > =A0=A0=A02 30 120 10 80 > > =A0=A0=A03 40 110 80 10 > >=20 > > MAX_MEMORY_TIERS=3D4 (memtier3 is a memory tier added later). > >=20 > > $ cat /sys/devices/system/memtier/possible > > 0(64), 1(128), 3(160), 2(192) > >=20 > > $ grep '' /sys/devices/system/memtier/memtier*/rank > > /sys/devices/system/memtier/memtier0/rank:64 > > /sys/devices/system/memtier/memtier1/rank:128 > > /sys/devices/system/memtier/memtier2/rank:192 > > /sys/devices/system/memtier/memtier3/rank:160 > >=20 > > $ grep '' /sys/devices/system/memtier/memtier*/nodelist > > /sys/devices/system/memtier/memtier0/nodelist:1 > > /sys/devices/system/memtier/memtier1/nodelist:0 > > /sys/devices/system/memtier/memtier2/nodelist:2 > > /sys/devices/system/memtier/memtier3/nodelist:3 > >=20 > > $ grep '' /sys/devices/system/node/node*/memtier > > /sys/devices/system/node/node0/memtier:1 > > /sys/devices/system/node/node1/memtier:0 > > /sys/devices/system/node/node2/memtier:2 > > /sys/devices/system/node/node3/memtier:3 > >=20 > > Demotion fallback order: > > node 0: 2, 3 > > node 1: 0, 3, 2 > > node 2: empty > > node 3: 2 >=20 >=20