From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 81254C00144 for ; Tue, 2 Aug 2022 03:17:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EACCD6B0071; Mon, 1 Aug 2022 23:17:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E5D878E0002; Mon, 1 Aug 2022 23:17:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D25668E0001; Mon, 1 Aug 2022 23:17:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id C26C36B0071 for ; Mon, 1 Aug 2022 23:17:05 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 8AE04AB524 for ; Tue, 2 Aug 2022 03:17:05 +0000 (UTC) X-FDA: 79753191210.03.D343561 Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by imf24.hostedemail.com (Postfix) with ESMTP id D9E7018004C for ; Tue, 2 Aug 2022 03:17:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1659410223; x=1690946223; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version; bh=MsdSl+uxEAuLh3NyLjAryj8qnFW9S8M0LT8+pbhmHh8=; b=awUygI0Fezgx1jMQm8vj/U1PsL/peEtw+suMI2HI0e3LOSUFKa0Ia0H6 Gwl/m2G19ZeIKLlF5Gc9b6r7WLEF8U5tMMrQX21YM8UURgfLtyl3s8Ela lqEn2kirvKqCmCOybHACvAyYfj+yVizhYEgaWkbmnCd3NI23acITfsjo7 HvaqAHIFHLiZrDaTsy9bL9bmeKLd9sP7kFZZ/o3vgf8fyAmUsHw32iB5j xKUg9pL3AjH6DFgn8kdSYTKnyUd+xPhBWYfcgF417RTKhwFHEm4rtvc14 9g4HLzaq8GICMz37yZK8UW8j5pdTgIbbdanbz1NcVWa+MuMbKjuSn1eYz w==; X-IronPort-AV: E=McAfee;i="6400,9594,10426"; a="276217379" X-IronPort-AV: E=Sophos;i="5.93,209,1654585200"; d="scan'208";a="276217379" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Aug 2022 20:17:02 -0700 X-IronPort-AV: E=Sophos;i="5.93,209,1654585200"; d="scan'208";a="635134991" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Aug 2022 20:16:58 -0700 From: "Huang, Ying" To: Dan Williams Cc: Aneesh Kumar K.V , , , Wei Xu , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , "Linux Kernel Mailing List" , Hesham Almatary , Dave Hansen , "Jonathan Cameron" , Alistair Popple , Johannes Weiner , , Jagdish Gediya Subject: Re: [PATCH v11 1/8] mm/demotion: Add support for explicit memory tiers References: <20220728190436.858458-1-aneesh.kumar@linux.ibm.com> <20220728190436.858458-2-aneesh.kumar@linux.ibm.com> <62e890da7f784_577a029473@dwillia2-xfh.jf.intel.com.notmuch> Date: Tue, 02 Aug 2022 11:16:54 +0800 In-Reply-To: <62e890da7f784_577a029473@dwillia2-xfh.jf.intel.com.notmuch> (Dan Williams's message of "Mon, 1 Aug 2022 19:50:02 -0700") Message-ID: <874jyvjpw9.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=awUygI0F; spf=pass (imf24.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.20 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1659410224; a=rsa-sha256; cv=none; b=gKHzDIODvUiE8sndOLoHbdRJiDt905vBgQUHQx16m3J8NJaXiN8fgFE5HVA1UeqZoW1aS8 O9f6CsEjjYJVCexwNTVnkMrrLpfbd6kBVLfI9/km3jDx2mGqoVoiiZ/YYJ/mw9kW08tvdj 7EFkhU2f8UKma9qZeYdtCBrYGboddpg= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1659410224; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=idGpJud5HsaFA7qd2LKT7Hn/XZO0td15L0VPwW0dNSY=; b=ptLKFYMzFrQiB68WxKq3NAbcjM6gx36T91k9y4LdI7k+2Lx3yguzqBIQWCG20vw+jNY+BD +i22khWk4KPHHEmPYgK/ihpPmWEmeLLTMo5VFixO3D5AvZE5w+Vd+FdFysiyUD565SCuH0 //k7fp5V8ifRg6OZbX73W70kGpJELls= X-Rspamd-Queue-Id: D9E7018004C Authentication-Results: imf24.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=awUygI0F; spf=pass (imf24.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.20 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com X-Rspamd-Server: rspam01 X-Rspam-User: X-Stat-Signature: sh4woos577unqkj6obg7zgtpq8643yj4 X-HE-Tag: 1659410223-170550 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Dan Williams writes: > Aneesh Kumar K.V wrote: >> In the current kernel, memory tiers are defined implicitly via a demotion path >> relationship between NUMA nodes, which is created during the kernel >> initialization and updated when a NUMA node is hot-added or hot-removed. The >> current implementation puts all nodes with CPU into the highest tier, and builds >> the tier hierarchy tier-by-tier by establishing the per-node demotion targets >> based on the distances between nodes. >> >> This current memory tier kernel implementation needs to be improved for several >> important use cases, >> >> The current tier initialization code always initializes each memory-only NUMA >> node into a lower tier. But a memory-only NUMA node may have a high performance >> memory device (e.g. a DRAM-backed memory-only node on a virtual machine) that >> should be put into a higher tier. >> >> The current tier hierarchy always puts CPU nodes into the top tier. But on a >> system with HBM or GPU devices, the memory-only NUMA nodes mapping these devices >> should be in the top tier, and DRAM nodes with CPUs are better to be placed into >> the next lower tier. >> >> With current kernel higher tier node can only be demoted to nodes with shortest >> distance on the next lower tier as defined by the demotion path, not any other >> node from any lower tier. This strict, demotion order does not work in all use >> cases (e.g. some use cases may want to allow cross-socket demotion to another >> node in the same demotion tier as a fallback when the preferred demotion node is >> out of space), This demotion order is also inconsistent with the page allocation >> fallback order when all the nodes in a higher tier are out of space: The page >> allocation can fall back to any node from any lower tier, whereas the demotion >> order doesn't allow that. >> >> This patch series address the above by defining memory tiers explicitly. >> >> Linux kernel presents memory devices as NUMA nodes and each memory device is of >> a specific type. The memory type of a device is represented by its abstract >> distance. A memory tier corresponds to a range of abstract distance. This allows >> for classifying memory devices with a specific performance range into a memory >> tier. >> >> This patch configures the range/chunk size to be 128. The default DRAM >> abstract distance is 512. We can have 4 memory tiers below the default DRAM >> abstract distance which cover the range 0 - 127, 127 - 255, 256- 383, 384 - 511. >> Slower memory devices like persistent memory will have abstract distance below >> the default DRAM level and hence will be placed in these 4 lower tiers. >> >> A kernel parameter is provided to override the default memory tier. >> >> Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com >> Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com >> >> Signed-off-by: Jagdish Gediya >> Signed-off-by: Aneesh Kumar K.V >> --- >> include/linux/memory-tiers.h | 17 ++++++ >> mm/Makefile | 1 + >> mm/memory-tiers.c | 102 +++++++++++++++++++++++++++++++++++ >> 3 files changed, 120 insertions(+) >> create mode 100644 include/linux/memory-tiers.h >> create mode 100644 mm/memory-tiers.c >> >> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h >> new file mode 100644 >> index 000000000000..8d7884b7a3f0 >> --- /dev/null >> +++ b/include/linux/memory-tiers.h >> @@ -0,0 +1,17 @@ >> +/* SPDX-License-Identifier: GPL-2.0 */ >> +#ifndef _LINUX_MEMORY_TIERS_H >> +#define _LINUX_MEMORY_TIERS_H >> + >> +/* >> + * Each tier cover a abstrace distance chunk size of 128 >> + */ >> +#define MEMTIER_CHUNK_BITS 7 >> +#define MEMTIER_CHUNK_SIZE (1 << MEMTIER_CHUNK_BITS) >> +/* >> + * For now let's have 4 memory tier below default DRAM tier. >> + */ >> +#define MEMTIER_ADISTANCE_DRAM (1 << (MEMTIER_CHUNK_BITS + 2)) >> +/* leave one tier below this slow pmem */ >> +#define MEMTIER_ADISTANCE_PMEM (1 << MEMTIER_CHUNK_BITS) > > Why is memory type encoded in these values? There is no reason to > believe that PMEM is of a lower performance tier than DRAM. Consider > high performance energy backed DRAM that makes it "PMEM", consider CXL > attached DRAM over a switch topology and constrained links that makes it > a lower performance tier than locally attached DRAM. The names should be > associated with tiers that indicate their usage. Something like HOT, > GENERAL, and COLD. Where, for example, HOT is low capacity high > performance compared to the general purpose pool, and COLD is high > capacity low performance intended to offload the general purpose tier. > > It does not need to be exactly that ontology, but please try to not > encode policy meaning behind memory types. There has been explicit > effort to avoid that to date because types are fraught for declaring > relative performance characteristics, and the relative performance > changes based on what memory types are assembled in a given system. Yes. MEMTIER_ADISTANCE_PMEM is something over simplified. That is only used in this very first version to make it as simple as possible. I think we can come up with something better in the later version. For example, identify the abstract distance of a PMEM device based on HMAT, etc. And even in this first version, we should put MEMTIER_ADISTANCE_PMEM in dax/kmem.c. Because it's just for that specific type of memory used now, not for all PMEM. In the current design, memory type is used to report the performance of the hardware, in terms of abstract distance, per Johannes' suggestion. Which is an abstraction of memory latency and bandwidth. Policy is described via memory tiers. Several memory types may be put in one memory tier. The abstract distance chunk size of the memory tier may be adjusted according to policy. Best Regards, Huang, Ying