From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 62C23C433F5 for ; Tue, 3 May 2022 02:06:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 66F1F6B0071; Mon, 2 May 2022 22:06:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 61DB66B0073; Mon, 2 May 2022 22:06:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4E54A6B0074; Mon, 2 May 2022 22:06:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.28]) by kanga.kvack.org (Postfix) with ESMTP id 3BD626B0071 for ; Mon, 2 May 2022 22:06:31 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 1902221B81 for ; Tue, 3 May 2022 02:06:31 +0000 (UTC) X-FDA: 79422792582.22.CFDE67B Received: from out30-43.freemail.mail.aliyun.com (out30-43.freemail.mail.aliyun.com [115.124.30.43]) by imf19.hostedemail.com (Postfix) with ESMTP id E24BD1A0072 for ; Tue, 3 May 2022 02:06:23 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R581e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04423;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=16;SR=0;TI=SMTPD_---0VC4XbVj_1651543583; Received: from 30.39.210.51(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0VC4XbVj_1651543583) by smtp.aliyun-inc.com(127.0.0.1); Tue, 03 May 2022 10:06:24 +0800 Message-ID: <87f8d4d0-6d06-7254-b2a6-3ccf6a555733@linux.alibaba.com> Date: Tue, 3 May 2022 10:07:08 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.8.1 Subject: Re: RFC: Memory Tiering Kernel Interfaces To: Wei Xu , Andrew Morton , Dave Hansen , Huang Ying , Dan Williams , Yang Shi , Linux MM , Greg Thelen , "Aneesh Kumar K.V" , Jagdish Gediya , Linux Kernel Mailing List , Alistair Popple , Michal Hocko , Brice Goglin , Feng Tang , Jonathan.Cameron@huawei.com References: <20220501175813.tvytoosygtqlh3nn@offworld> From: Baolin Wang In-Reply-To: <20220501175813.tvytoosygtqlh3nn@offworld> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: E24BD1A0072 Authentication-Results: imf19.hostedemail.com; dkim=none; spf=pass (imf19.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.43 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com X-Rspam-User: X-Stat-Signature: snyseu7rwwcnfjcaj593hskt5gnzrq3b X-HE-Tag: 1651543583-295832 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 5/2/2022 1:58 AM, Davidlohr Bueso wrote: > Nice summary, thanks. I don't know who of the interested parties will be > at lsfmm, but fyi we have a couple of sessions on memory tiering Tuesday > at 14:00 and 15:00. > > On Fri, 29 Apr 2022, Wei Xu wrote: > >> The current kernel has the basic memory tiering support: Inactive >> pages on a higher tier NUMA node can be migrated (demoted) to a lower >> tier NUMA node to make room for new allocations on the higher tier >> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be >> migrated (promoted) to a higher tier NUMA node to improve the >> performance. > > Regardless of the promotion algorithm, at some point I see the NUMA hinting > fault mechanism being in the way of performance. It would be nice if > hardware > began giving us page "heatmaps" instead of having to rely on faulting or > sampling based ways to identify hot memory. > >> A tiering relationship between NUMA nodes in the form of demotion path >> is created during the kernel initialization and updated when a NUMA >> node is hot-added or hot-removed.  The current implementation puts all >> nodes with CPU into the top tier, and then builds the tiering hierarchy >> tier-by-tier by establishing the per-node demotion targets based on >> the distances between nodes. >> >> The current memory tiering interface needs to be improved to address >> several important use cases: >> >> * The current tiering initialization code always initializes >>  each memory-only NUMA node into a lower tier.  But a memory-only >>  NUMA node may have a high performance memory device (e.g. a DRAM >>  device attached via CXL.mem or a DRAM-backed memory-only node on >>  a virtual machine) and should be put into the top tier. > > At least the CXL memory (volatile or not) will still be slower than > regular DRAM, so I think that we'd not want this to be top-tier. But > in general, yes I agree that defining top tier as whether or not the > node has a CPU a bit limiting, as you've detailed here. > >> Tiering Hierarchy Initialization >> ================================ >> >> By default, all memory nodes are in the top tier (N_TOPTIER_MEMORY). >> >> A device driver can remove its memory nodes from the top tier, e.g. >> a dax driver can remove PMEM nodes from the top tier. >> >> The kernel builds the memory tiering hierarchy and per-node demotion >> order tier-by-tier starting from N_TOPTIER_MEMORY.  For a node N, the >> best distance nodes in the next lower tier are assigned to >> node_demotion[N].preferred and all the nodes in the next lower tier >> are assigned to node_demotion[N].allowed. >> >> node_demotion[N].preferred can be empty if no preferred demotion node >> is available for node N. > > Upon cases where there more than one possible demotion node (with equal > cost), I'm wondering if we want to do something better than choosing > randomly, like we do now - perhaps round robin? Of course anything > like this will require actual performance data, something I have seen > very little of. I've tried to use round robin[1] to select a target demotion node if there are multiple demotion nodes, however I did not see any obvious performance gain with mysql testing. Maybe use other test suits? https://lore.kernel.org/all/c02bcbc04faa7a2c852534e9cd58a91c44494657.1636016609.git.baolin.wang@linux.alibaba.com/