From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 003A5CD1288 for ; Fri, 29 Mar 2024 00:59:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8A1E96B0095; Thu, 28 Mar 2024 20:59:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 852666B0096; Thu, 28 Mar 2024 20:59:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6F1D66B0098; Thu, 28 Mar 2024 20:59:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 507D96B0095 for ; Thu, 28 Mar 2024 20:59:19 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 16C861A0F9C for ; Fri, 29 Mar 2024 00:59:19 +0000 (UTC) X-FDA: 81948268038.17.71E9D50 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.14]) by imf27.hostedemail.com (Postfix) with ESMTP id A126340009 for ; Fri, 29 Mar 2024 00:59:15 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="CVEfN/80"; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf27.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.14 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1711673957; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BLRF2lWu8EyVJevX/hveCIptiDwyCjo+jLhQ7O4LgHA=; b=Oe2qarfzmolfAhtvsY4k9/7/eElQvhcHowgZEHWT2aIG9/a+QQnxuqu4QAomRdGMV/vcKY YQ33KMJbxSJ4siYd8ItxwMY9JNXMCPyZWkIC+5cKEMYi5e64uWMoL95GNre6ltagUtvFbJ Ev1aBG8/jcIFVDodkLxZc437ozbcTmc= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="CVEfN/80"; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf27.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.14 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1711673957; a=rsa-sha256; cv=none; b=XBPreTmTdLeKkgb27gdBzW58B9qeGm2NrrdnvRbN4aJakAruDsUETiBSJv7ojYiWqviRdl o+tutE/bicishLt5InAjoL3iD8EmGNER6OT/EVXGzxFjZhE3crhAuS1AU5nEbMnlOzvqj8 xHzZewrZSCBl1RHEVEa1aysh7Mt2p/I= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1711673955; x=1743209955; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=K4JVW10yLm4zNnV4kPQnYT8v/YP7n6h53VOMc/kzU6o=; b=CVEfN/80/IZJJzoMCeT/+1gFDfpPbsDCGXJxFCVtOMLUkOKcf6OWq0fR piCMB31q2hyelfhvSURgC+sBlHkbm+HfXhjCs42CmUDkhROadDefHrcin L4Q8eamySqMoyev9m1kEsQvoKGkvKsk1cVLY5hVGl00tREgzPlRkzDMex yXvmjFs68XJBhYENwGiU5d9xiJnmM0rjik9/m4fMdOhgGNfWFXGurI1HK VZv4Jj0ioa3xugV0/0fONP2iIpaYQKqMjwnWrOcG2mSasBmd9qpHnVEft m99gV+LpnnJ2AelFN4PNueZBbCRo3Ll7tqQS2nRXc8cgNLXJLMcPE9VNu w==; X-CSE-ConnectionGUID: zGztCzXgSTCjV8NsoaHDTA== X-CSE-MsgGUID: IHcQ0hoZTD2oF8wVCAR0aw== X-IronPort-AV: E=McAfee;i="6600,9927,11027"; a="10663921" X-IronPort-AV: E=Sophos;i="6.07,162,1708416000"; d="scan'208";a="10663921" Received: from fmviesa006.fm.intel.com ([10.60.135.146]) by orvoesa106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Mar 2024 17:59:14 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.07,162,1708416000"; d="scan'208";a="16860173" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Mar 2024 17:59:09 -0700 From: "Huang, Ying" To: "Ho-Ren (Jack) Chuang" Cc: "Gregory Price" , aneesh.kumar@linux.ibm.com, mhocko@suse.com, tj@kernel.org, john@jagalactic.com, "Eishan Mirakhur" , "Vinicius Tavares Petrucci" , "Ravis OpenSrc" , "Alistair Popple" , "Srinivasulu Thanneeru" , Dan Williams , Vishal Verma , Dave Jiang , Andrew Morton , nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "Ho-Ren (Jack) Chuang" , "Ho-Ren (Jack) Chuang" , qemu-devel@nongnu.org, Hao Xiang Subject: Re: [PATCH v8 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info In-Reply-To: <20240329004815.195476-3-horenchuang@bytedance.com> (Ho-Ren Chuang's message of "Fri, 29 Mar 2024 00:48:14 +0000") References: <20240329004815.195476-1-horenchuang@bytedance.com> <20240329004815.195476-3-horenchuang@bytedance.com> Date: Fri, 29 Mar 2024 08:57:14 +0800 Message-ID: <87a5mhlus5.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspam-User: X-Stat-Signature: ib8k9tyhmir98zr318gcc3pifyqm3iqj X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: A126340009 X-HE-Tag: 1711673955-611892 X-HE-Meta: U2FsdGVkX1/xSzOxii65FMIhrQDyerN7uQ+0igtRZzmBlJqqKq8tjM9cCikhtunsChFBUs1NSLVuP4efs4kKp/8lpZ4NY84/dFwiXgPvdivzOsaFSz/eMEiXD6HvwARw5ZUsKUGErKtLSRO1h1CNyFLtqLc40DBsF6kqXCFuoL6iDnWMzYimL6mbmSycI6BBsXJLoEqDzKJDhnEwI29C5JSsD6wdgnybDbxI3nomUZ4pr+8twuotTg/7+7BNjfRZFd3WmPsRlxxiLMIaR/mwUsJb0WdRmQTVhmpbDPClCzPH08jzmwU6IAafGjjyaWRANhLWfuQ54d2cZA39KDwin+32bolfKwWOU3heQf3e3Ebk1EEYLfNH8+yo1Mr2kOYacZYTV3eurs7aNARhAiJ7lT8kOXds1X+RYmDcGvK5/3/JSbhyXe7n+rdeJTvFVtafKqZXqLKZLq0LXiCBa2A8Ju7B0jPE0Vbb6MFqrFzU91IrVLKNMuxhknXlDYzN0xzjBo2pcWLrN9E9Z88rdprXdX7XfAySty6tppTPEeU7Lmi3DJeuD8v/4ki/gs+grkxURQbLvaIRfkzdoUCeXduG+4QA1wWmaleMwd+72qi43eEwvqwPeralt3vSSxmA1iD2Z8N0/Exu2XpnNbBkjMjbG5uSJ9rVPWF1nzA9QfyzwjjokpqZqLp4rszP/9Fpwi3amubLixGSzPTUv4mCCdFjev9Hwjo27tQtPCOy843cIC4SgDl3CvT4KIeY7ur5w7dZJyDZ7+6iIoNnb9LVLxnVO4WH2gJB3pNVQbwvojdj6pfvBhx5lVjYfmr/saY3Ibfl6A1PElNv7qAMMFmq/VhYRqhQeFykkbMeoUtW3fgnW+VAULXsyy2L+33ymdYwkCXoPZjKytgXOJk5pt/lA2Xp/W/INzzaCKvLRpn2xfaRP144H8lqLDtbV/iDLKXoFfLSgqkp477ucVgeMzRDRCe 9zbYXF9R 2JBHOT8i4Y8paN1vSaSCiY6MViyCUc4l2/glmAcIJ7B3mHuJzHK6v6VtNteO9AfyfKshENO/jZEzH65ZPpAzk+78PWDQOq8R2ngLgwiTBJuSnMcpXEB1J2M7jPO+bewu0M9j8DpQ0EQqSf4zJVxfhV7R+RV9UVx+/fXuTuFl50DgfQukydYMpHgYoq+VyuBqZHmjx53lnLnRQVyfEW9eQbUd+WABy/k7O0tji4gM4LC4GsaSAoYQnzGQ7X6tCgqx5KYGLiLra5GTQflCg8q7+e0VQJ31vqDLvNU2m/0av2ze0OcQb7W5i9r5PxWnpzpRCTyeRDCqCREHBY7T5QNH/2WLoxMKYeXeOGDSCCyuSAy5FtIM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: "Ho-Ren (Jack) Chuang" writes: > The current implementation treats emulated memory devices, such as > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory > (E820_TYPE_RAM). However, these emulated devices have different > characteristics than traditional DRAM, making it important to > distinguish them. Thus, we modify the tiered memory initialization process > to introduce a delay specifically for CPUless NUMA nodes. This delay > ensures that the memory tier initialization for these nodes is deferred > until HMAT information is obtained during the boot process. Finally, > demotion tables are recalculated at the end. > > * late_initcall(memory_tier_late_init); > Some device drivers may have initialized memory tiers between > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing > online memory nodes and configuring memory tiers. They should be excluded > in the late init. > > * Handle cases where there is no HMAT when creating memory tiers > There is a scenario where a CPUless node does not provide HMAT information. > If no HMAT is specified, it falls back to using the default DRAM tier. > > * Introduce another new lock `default_dram_perf_lock` for adist calculation > In the current implementation, iterating through CPUlist nodes requires > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up > trying to acquire the same lock, leading to a potential deadlock. > Therefore, we propose introducing a standalone `default_dram_perf_lock` to > protect `default_dram_perf_*`. This approach not only avoids deadlock > but also prevents holding a large lock simultaneously. > > * Upgrade `set_node_memory_tier` to support additional cases, including > default DRAM, late CPUless, and hot-plugged initializations. > To cover hot-plugged memory nodes, `mt_calc_adistance()` and > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to > handle cases where memtype is not initialized and where HMAT information is > available. > > * Introduce `default_memory_types` for those memory types that are not > initialized by device drivers. > Because late initialized memory and default DRAM memory need to be managed, > a default memory type is created for storing all memory types that are > not initialized by device drivers and as a fallback. > > Signed-off-by: Ho-Ren (Jack) Chuang > Signed-off-by: Hao Xiang > Reviewed-by: "Huang, Ying" > --- > mm/memory-tiers.c | 94 +++++++++++++++++++++++++++++++++++++++-------- > 1 file changed, 78 insertions(+), 16 deletions(-) > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > index 974af10cfdd8..e24fc3bebae4 100644 > --- a/mm/memory-tiers.c > +++ b/mm/memory-tiers.c > @@ -36,6 +36,11 @@ struct node_memory_type_map { > > static DEFINE_MUTEX(memory_tier_lock); > static LIST_HEAD(memory_tiers); > +/* > + * The list is used to store all memory types that are not created > + * by a device driver. > + */ > +static LIST_HEAD(default_memory_types); > static struct node_memory_type_map node_memory_types[MAX_NUMNODES]; > struct memory_dev_type *default_dram_type; > > @@ -108,6 +113,8 @@ static struct demotion_nodes *node_demotion __read_mostly; > > static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms); > > +/* The lock is used to protect `default_dram_perf*` info and nid. */ > +static DEFINE_MUTEX(default_dram_perf_lock); > static bool default_dram_perf_error; > static struct access_coordinate default_dram_perf; > static int default_dram_perf_ref_nid = NUMA_NO_NODE; > @@ -505,7 +512,8 @@ static inline void __init_node_memory_type(int node, struct memory_dev_type *mem > static struct memory_tier *set_node_memory_tier(int node) > { > struct memory_tier *memtier; > - struct memory_dev_type *memtype; > + struct memory_dev_type *mtype = default_dram_type; > + int adist = MEMTIER_ADISTANCE_DRAM; > pg_data_t *pgdat = NODE_DATA(node); > > > @@ -514,11 +522,20 @@ static struct memory_tier *set_node_memory_tier(int node) > if (!node_state(node, N_MEMORY)) > return ERR_PTR(-EINVAL); > > - __init_node_memory_type(node, default_dram_type); > + mt_calc_adistance(node, &adist); > + if (node_memory_types[node].memtype == NULL) { > + mtype = mt_find_alloc_memory_type(adist, &default_memory_types); > + if (IS_ERR(mtype)) { > + mtype = default_dram_type; > + pr_info("Failed to allocate a memory type. Fall back.\n"); > + } > + } > + > + __init_node_memory_type(node, mtype); > > - memtype = node_memory_types[node].memtype; > - node_set(node, memtype->nodes); > - memtier = find_create_memory_tier(memtype); > + mtype = node_memory_types[node].memtype; > + node_set(node, mtype->nodes); > + memtier = find_create_memory_tier(mtype); > if (!IS_ERR(memtier)) > rcu_assign_pointer(pgdat->memtier, memtier); > return memtier; > @@ -655,6 +672,34 @@ void mt_put_memory_types(struct list_head *memory_types) > } > EXPORT_SYMBOL_GPL(mt_put_memory_types); > > +/* > + * This is invoked via `late_initcall()` to initialize memory tiers for > + * CPU-less memory nodes after driver initialization, which is > + * expected to provide `adistance` algorithms. > + */ > +static int __init memory_tier_late_init(void) > +{ > + int nid; > + > + mutex_lock(&memory_tier_lock); > + for_each_node_state(nid, N_MEMORY) > + if (!node_state(nid, N_CPU) && It appears that you didn't notice my comments about this... https://lore.kernel.org/linux-mm/87v857kujp.fsf@yhuang6-desk2.ccr.corp.intel.com/ > + node_memory_types[nid].memtype == NULL) > + /* > + * Some device drivers may have initialized memory tiers > + * between `memory_tier_init()` and `memory_tier_late_init()`, > + * potentially bringing online memory nodes and > + * configuring memory tiers. Exclude them here. > + */ > + set_node_memory_tier(nid); > + > + establish_demotion_targets(); > + mutex_unlock(&memory_tier_lock); > + > + return 0; > +} > +late_initcall(memory_tier_late_init); > + > static void dump_hmem_attrs(struct access_coordinate *coord, const char *prefix) > { > pr_info( > @@ -668,7 +713,7 @@ int mt_set_default_dram_perf(int nid, struct access_coordinate *perf, > { > int rc = 0; > > - mutex_lock(&memory_tier_lock); > + mutex_lock(&default_dram_perf_lock); > if (default_dram_perf_error) { > rc = -EIO; > goto out; > @@ -716,23 +761,30 @@ int mt_set_default_dram_perf(int nid, struct access_coordinate *perf, > } > > out: > - mutex_unlock(&memory_tier_lock); > + mutex_unlock(&default_dram_perf_lock); > return rc; > } > > int mt_perf_to_adistance(struct access_coordinate *perf, int *adist) > { > - if (default_dram_perf_error) > - return -EIO; > + int rc = 0; > > - if (default_dram_perf_ref_nid == NUMA_NO_NODE) > - return -ENOENT; > + mutex_lock(&default_dram_perf_lock); > + if (default_dram_perf_error) { > + rc = -EIO; > + goto out; > + } > > if (perf->read_latency + perf->write_latency == 0 || > - perf->read_bandwidth + perf->write_bandwidth == 0) > - return -EINVAL; > + perf->read_bandwidth + perf->write_bandwidth == 0) { > + rc = -EINVAL; > + goto out; > + } > > - mutex_lock(&memory_tier_lock); > + if (default_dram_perf_ref_nid == NUMA_NO_NODE) { > + rc = -ENOENT; > + goto out; > + } > /* > * The abstract distance of a memory node is in direct proportion to > * its memory latency (read + write) and inversely proportional to its > @@ -745,8 +797,9 @@ int mt_perf_to_adistance(struct access_coordinate *perf, int *adist) > (default_dram_perf.read_latency + default_dram_perf.write_latency) * > (default_dram_perf.read_bandwidth + default_dram_perf.write_bandwidth) / > (perf->read_bandwidth + perf->write_bandwidth); > - mutex_unlock(&memory_tier_lock); > > +out: > + mutex_unlock(&default_dram_perf_lock); > return 0; > } > EXPORT_SYMBOL_GPL(mt_perf_to_adistance); > @@ -858,7 +911,8 @@ static int __init memory_tier_init(void) > * For now we can have 4 faster memory tiers with smaller adistance > * than default DRAM tier. > */ > - default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM); > + default_dram_type = mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM, > + &default_memory_types); > if (IS_ERR(default_dram_type)) > panic("%s() failed to allocate default DRAM tier\n", __func__); > > @@ -868,6 +922,14 @@ static int __init memory_tier_init(void) > * types assigned. > */ > for_each_node_state(node, N_MEMORY) { > + if (!node_state(node, N_CPU)) > + /* > + * Defer memory tier initialization on CPUless numa nodes. > + * These will be initialized after firmware and devices are > + * initialized. > + */ > + continue; > + > memtier = set_node_memory_tier(node); > if (IS_ERR(memtier)) > /* -- Best Regards, Huang, Ying