From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6EE97C54E71 for ; Fri, 22 Mar 2024 08:41:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E82606B00AE; Fri, 22 Mar 2024 04:41:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E32C86B00AF; Fri, 22 Mar 2024 04:41:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CFAD86B00B0; Fri, 22 Mar 2024 04:41:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id C10266B00AE for ; Fri, 22 Mar 2024 04:41:47 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 87FFF1606CE for ; Fri, 22 Mar 2024 08:41:47 +0000 (UTC) X-FDA: 81924031854.22.E596C7F Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.8]) by imf23.hostedemail.com (Postfix) with ESMTP id 740CF140010 for ; Fri, 22 Mar 2024 08:41:45 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=CJGV9vng; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf23.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.8 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1711096905; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ReOAX7lYm5bpWGbPCVf9sILQRqrZ17ktCHhwM/TbddE=; b=75pVTvcblgNukjRBSIk2n4QH/v6bo+h2FPMgm23YQkObU3lukEUTRQauBZVeFwn+IBkUKV dkMuXSOQDPvAWIA/UpYM9LfpZu7AO6JqgAVthXRC47fpMuiKS9CrYqwn9QBH5o6btNahOh 6attCKJkQVowkC6Vu2Wliw3T7L5q4UQ= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=CJGV9vng; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf23.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.8 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1711096905; a=rsa-sha256; cv=none; b=jK+ccNdcQk7KrIo2NNeSBUgDkBbLAlyZpBpyEEe80Tgq2AfFUoFPDEWUQhVUZZs0vKYIMc vdBMURTIplqRhTftso8UJmKf7OEjl26W7ZnNv+70mERmXb0wN7HBof90/5/A1dx8NZVs6u BraWO9FMz19yFpT5Sgiv1hdo3o7iaUA= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1711096905; x=1742632905; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=lUc29sDLNaxzdQqbXFTr/bw/3I6o0uIxECNwJmIeNUw=; b=CJGV9vngmJWFDDe3hXcjtc0KvHe96htfDjXW1fXkv/NOeqQ1Jy9lQuuD /eY+q/DeGlBCldsl2hBUcR/cWT6mzyc3fXpYauSMbhyJprvNhOoo5A7M+ xmBLy22tSs/Pi3pNzQ9KhZBytDg4hPJSzFYajjp1eBK7U/hHC2VdrsSuJ 1lxwj2sq0kdXTgDDZO/GRSzeyiIPmzyzxAr2XepP6ZvAg/XI3lh2qaNdH xV0tUc+LRBOZdmrUddlGkMHK1Y9bkQaWvaw4MjfcwX/oZdm7pshuR6sjT 7eVH6qjiV5JOTnMzl/IAnlTDX4KhnkBVy1mlGbgn8TMfjAMw91kUJk+IV Q==; X-IronPort-AV: E=McAfee;i="6600,9927,11020"; a="23620919" X-IronPort-AV: E=Sophos;i="6.07,145,1708416000"; d="scan'208";a="23620919" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by fmvoesa102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Mar 2024 01:41:44 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.07,145,1708416000"; d="scan'208";a="19487739" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmviesa004-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Mar 2024 01:41:38 -0700 From: "Huang, Ying" To: "Ho-Ren (Jack) Chuang" Cc: "Gregory Price" , aneesh.kumar@linux.ibm.com, mhocko@suse.com, tj@kernel.org, john@jagalactic.com, "Eishan Mirakhur" , "Vinicius Tavares Petrucci" , "Ravis OpenSrc" , "Alistair Popple" , "Srinivasulu Thanneeru" , Dan Williams , Vishal Verma , Dave Jiang , Andrew Morton , nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "Ho-Ren (Jack) Chuang" , "Ho-Ren (Jack) Chuang" , qemu-devel@nongnu.org, Hao Xiang Subject: Re: [PATCH v4 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info In-Reply-To: <20240322070356.315922-3-horenchuang@bytedance.com> (Ho-Ren Chuang's message of "Fri, 22 Mar 2024 07:03:55 +0000") References: <20240322070356.315922-1-horenchuang@bytedance.com> <20240322070356.315922-3-horenchuang@bytedance.com> Date: Fri, 22 Mar 2024 16:39:44 +0800 Message-ID: <87cyrmr773.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspam-User: X-Stat-Signature: g9kbgbygmwxwfmjtmcpdybn3u3ge4jre X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 740CF140010 X-HE-Tag: 1711096905-851070 X-HE-Meta: U2FsdGVkX18aGH8V8q7Cra/ubwOeCzmrp7Z2D/P5SpiKxSoc5P2C5+gXZm9ZlWf0eRJ4InkG0PgGzcutWS44HDaQhu1tS2Fxkpej8tP7a3BR1HDnaDhMLQE8/rPINgQ2zaPQ/hLkC7KpBJTT4tCpTFivLKAn7yZVvaxhn1QVvXNK6e9+j18VO1l4q5Px6iiHQZtZd2zYQCPxX/3VzKJmUuKra5Obe3hgHfpD1WwHOJy7SSoIq8nY+RWnqQVqcwsb80yR+vboVeU9c9Rz/gPgPy0S9E2uLzNpAyYjq/82z5VbDz2/kYg9epnnbihlvE3jtb1HGsCKoldh9i3hIwpon8GSQBiQN6NxnuN3nEroCHxKW3PJKrAX3KOfemOilRPoEXjf/+oh84ebcHxPDpjAnjEqNAQ5oaF3IumDS3mIUG4XOAm3re/AHSVNV2IEz7723mS9NSDaXOG74D1v1zpaFnsbcqIQW+cAsytX/3+Bdufbq2/Db+NU73M3deKen8nRqV3ts5BCW6kgOJNRUAOJpCdej700L+av6488LWEwjjkvbyE0FYd/6fJZzdh62/MaYzmicC5tMZnS4kWmBpuqsZyyuXJFgr7FU0t8O/F4mvVXqdlsa6ytkTPUDY47glAwKPS2LuKvKhChnHV+dQOfHAdhnN3+Bzox7Av1eRnMsUyuVUbedhztgkRiGalwUxofHH9mvjxNqs5+vdZyJMtJNTzdlDIseoZSS0Qr54npXf/phLxLfivMVM5dYTWpJZaGtHajngOukc9SfFRHIB+ITR4dFxLX5kQ2loud+miMFP/9M8RRz2UPpFtEc//rIjIFV/v/GsJBX7tzfI0KgyuOLBNOHJ9MuQjSxQydLlbeD5sN+laJ1okK4NNs959zFkaLo5mpuUYPAHkixt6DnpLK/DGZGQOVfWeXx03XBkLKo9QKsHq6EcB66ZV3ZQwXrN9w3dQ/QZEN8c7UOCqVsgI LbLfYnLp nwb2U1M5y7gsvuVIX7SGzqKln3iKfVBU74hmdAtvuEFQAAWh4O5Wj8+/KGCmSeeZdajiHW/M5RdGqivXR31ABt+t5qF7Fhfq+LKiEH4H2zQmq2m9MAp6q5s0bws84itukV6cnuhPF0GM8F4nYTQhsxhcWmCGP+Nm0iS7xGTycBJs8fR3Gfdopt36QEYYBrr8tSf92iAY/UTRHOpErhyHYlSmPPrcyayUCxEarwdzoRRjPQTIs+Uqhsp79ubjowQIIQtqjKltg+Ks+FfW3s3ZT322WZg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: "Ho-Ren (Jack) Chuang" writes: > The current implementation treats emulated memory devices, such as > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory > (E820_TYPE_RAM). However, these emulated devices have different > characteristics than traditional DRAM, making it important to > distinguish them. Thus, we modify the tiered memory initialization process > to introduce a delay specifically for CPUless NUMA nodes. This delay > ensures that the memory tier initialization for these nodes is deferred > until HMAT information is obtained during the boot process. Finally, > demotion tables are recalculated at the end. > > * late_initcall(memory_tier_late_init); > Some device drivers may have initialized memory tiers between > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing > online memory nodes and configuring memory tiers. They should be excluded > in the late init. > > * Handle cases where there is no HMAT when creating memory tiers > There is a scenario where a CPUless node does not provide HMAT information. > If no HMAT is specified, it falls back to using the default DRAM tier. > > * Introduce another new lock `default_dram_perf_lock` for adist calculation > In the current implementation, iterating through CPUlist nodes requires > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up > trying to acquire the same lock, leading to a potential deadlock. > Therefore, we propose introducing a standalone `default_dram_perf_lock` to > protect `default_dram_perf_*`. This approach not only avoids deadlock > but also prevents holding a large lock simultaneously. > > * Upgrade `set_node_memory_tier` to support additional cases, including > default DRAM, late CPUless, and hot-plugged initializations. > To cover hot-plugged memory nodes, `mt_calc_adistance()` and > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to > handle cases where memtype is not initialized and where HMAT information is > available. > > * Introduce `default_memory_types` for those memory types that are not > initialized by device drivers. > Because late initialized memory and default DRAM memory need to be managed, > a default memory type is created for storing all memory types that are > not initialized by device drivers and as a fallback. > > Signed-off-by: Ho-Ren (Jack) Chuang > Signed-off-by: Hao Xiang > --- > mm/memory-tiers.c | 73 ++++++++++++++++++++++++++++++++++++++++------- > 1 file changed, 63 insertions(+), 10 deletions(-) > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > index 974af10cfdd8..9396330fa162 100644 > --- a/mm/memory-tiers.c > +++ b/mm/memory-tiers.c > @@ -36,6 +36,11 @@ struct node_memory_type_map { > > static DEFINE_MUTEX(memory_tier_lock); > static LIST_HEAD(memory_tiers); > +/* > + * The list is used to store all memory types that are not created > + * by a device driver. > + */ > +static LIST_HEAD(default_memory_types); > static struct node_memory_type_map node_memory_types[MAX_NUMNODES]; > struct memory_dev_type *default_dram_type; > > @@ -108,6 +113,7 @@ static struct demotion_nodes *node_demotion __read_mostly; > > static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms); > > +static DEFINE_MUTEX(default_dram_perf_lock); Better to add comments about what is protected by this lock. > static bool default_dram_perf_error; > static struct access_coordinate default_dram_perf; > static int default_dram_perf_ref_nid = NUMA_NO_NODE; > @@ -505,7 +511,8 @@ static inline void __init_node_memory_type(int node, struct memory_dev_type *mem > static struct memory_tier *set_node_memory_tier(int node) > { > struct memory_tier *memtier; > - struct memory_dev_type *memtype; > + struct memory_dev_type *mtype; mtype may be referenced without initialization now below. > + int adist = MEMTIER_ADISTANCE_DRAM; > pg_data_t *pgdat = NODE_DATA(node); > > > @@ -514,11 +521,20 @@ static struct memory_tier *set_node_memory_tier(int node) > if (!node_state(node, N_MEMORY)) > return ERR_PTR(-EINVAL); > > - __init_node_memory_type(node, default_dram_type); > + mt_calc_adistance(node, &adist); > + if (node_memory_types[node].memtype == NULL) { > + mtype = mt_find_alloc_memory_type(adist, &default_memory_types); > + if (IS_ERR(mtype)) { > + mtype = default_dram_type; > + pr_info("Failed to allocate a memory type. Fall back.\n"); > + } > + } > > - memtype = node_memory_types[node].memtype; > - node_set(node, memtype->nodes); > - memtier = find_create_memory_tier(memtype); > + __init_node_memory_type(node, mtype); > + > + mtype = node_memory_types[node].memtype; > + node_set(node, mtype->nodes); > + memtier = find_create_memory_tier(mtype); > if (!IS_ERR(memtier)) > rcu_assign_pointer(pgdat->memtier, memtier); > return memtier; > @@ -655,6 +671,34 @@ void mt_put_memory_types(struct list_head *memory_types) > } > EXPORT_SYMBOL_GPL(mt_put_memory_types); > > +/* > + * This is invoked via `late_initcall()` to initialize memory tiers for > + * CPU-less memory nodes after driver initialization, which is > + * expected to provide `adistance` algorithms. > + */ > +static int __init memory_tier_late_init(void) > +{ > + int nid; > + > + mutex_lock(&memory_tier_lock); > + for_each_node_state(nid, N_MEMORY) > + if (!node_state(nid, N_CPU) && > + node_memory_types[nid].memtype == NULL) > + /* > + * Some device drivers may have initialized memory tiers > + * between `memory_tier_init()` and `memory_tier_late_init()`, > + * potentially bringing online memory nodes and > + * configuring memory tiers. Exclude them here. > + */ > + set_node_memory_tier(nid); > + > + establish_demotion_targets(); > + mutex_unlock(&memory_tier_lock); > + > + return 0; > +} > +late_initcall(memory_tier_late_init); > + > static void dump_hmem_attrs(struct access_coordinate *coord, const char *prefix) > { > pr_info( > @@ -668,7 +712,7 @@ int mt_set_default_dram_perf(int nid, struct access_coordinate *perf, > { > int rc = 0; > > - mutex_lock(&memory_tier_lock); > + mutex_lock(&default_dram_perf_lock); > if (default_dram_perf_error) { > rc = -EIO; > goto out; > @@ -716,7 +760,7 @@ int mt_set_default_dram_perf(int nid, struct access_coordinate *perf, > } > > out: > - mutex_unlock(&memory_tier_lock); > + mutex_unlock(&default_dram_perf_lock); > return rc; > } > > @@ -732,7 +776,7 @@ int mt_perf_to_adistance(struct access_coordinate *perf, int *adist) > perf->read_bandwidth + perf->write_bandwidth == 0) > return -EINVAL; > > - mutex_lock(&memory_tier_lock); > + mutex_lock(&default_dram_perf_lock); > /* > * The abstract distance of a memory node is in direct proportion to > * its memory latency (read + write) and inversely proportional to its > @@ -745,7 +789,7 @@ int mt_perf_to_adistance(struct access_coordinate *perf, int *adist) > (default_dram_perf.read_latency + default_dram_perf.write_latency) * > (default_dram_perf.read_bandwidth + default_dram_perf.write_bandwidth) / > (perf->read_bandwidth + perf->write_bandwidth); > - mutex_unlock(&memory_tier_lock); > + mutex_unlock(&default_dram_perf_lock); > > return 0; > } > @@ -858,7 +902,8 @@ static int __init memory_tier_init(void) > * For now we can have 4 faster memory tiers with smaller adistance > * than default DRAM tier. > */ > - default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM); > + default_dram_type = mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM, > + &default_memory_types); > if (IS_ERR(default_dram_type)) > panic("%s() failed to allocate default DRAM tier\n", __func__); > > @@ -868,6 +913,14 @@ static int __init memory_tier_init(void) > * types assigned. > */ > for_each_node_state(node, N_MEMORY) { > + if (!node_state(node, N_CPU)) > + /* > + * Defer memory tier initialization on CPUless numa nodes. > + * These will be initialized after firmware and devices are > + * initialized. > + */ > + continue; > + > memtier = set_node_memory_tier(node); > if (IS_ERR(memtier)) > /* -- Best Regards, Huang, Ying