From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C504BC54E64 for ; Tue, 26 Mar 2024 03:08:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 342056B0088; Mon, 25 Mar 2024 23:08:17 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2A1586B008A; Mon, 25 Mar 2024 23:08:17 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 141EC6B0092; Mon, 25 Mar 2024 23:08:17 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id F359A6B0088 for ; Mon, 25 Mar 2024 23:08:16 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id C12051208B9 for ; Tue, 26 Mar 2024 03:08:16 +0000 (UTC) X-FDA: 81937706592.21.F15A5D3 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.11]) by imf17.hostedemail.com (Postfix) with ESMTP id BF90F4000E for ; Tue, 26 Mar 2024 03:08:13 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=NsIhD5sT; spf=pass (imf17.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.11 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1711422495; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=UnBcv0lS/ef5p2N65UCetVCHRWZbVtNOZyWWs/UhV0M=; b=cIZurXz5gjtUXbLwhJ46ukjSIhaTgMsw8g4D7jnVCqaT+0tG+tCvf1QsEwPEWD3a9wj+jZ I7NNKun8H6RO4SqHyArcvNe/JQRS2PscvcEn8r7oF7ZnUJ3yFw7w7kT6ok/BV/L03q7Rhm lxA2qOeheg2sqNHu3bxgAz5LIVke/YA= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1711422495; a=rsa-sha256; cv=none; b=VV1fVUNKaFMUyWpBF4qPmpd6svCfbTQwk5YToIKko24SrVoireUt7+6dxEPlTRq4uuePwT KHL3tHf1tpkyf2xgJP3TE4k0FHp/6uovw9b+ZJB8Xs3eNT+qGqIY/Q0A2XYPOSPOAyUOMt 9g8/cU3VsHadgGY9GZVy4J74cg8kqbQ= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=NsIhD5sT; spf=pass (imf17.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.11 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1711422494; x=1742958494; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=tshkDg+57nZGbKKTGrPpvXEl1QS5OIzpwMeGxbRpWlQ=; b=NsIhD5sTL2kDJBnHsQedFWw839jh8nstzJ75dTBSrsBNgvs8ST0WzULe av6IfyV1BLoYpIXGPe/yr9rgwZ+mLLHJmpoJ4R69a3Yg6KAnlpQpWGHZK w4fY3CKFaUsd/O8ubjTutdf8kwkDnLY9slO+vVJtc02XtC5C+Ls2pHDA8 /ydYZeSS3IdL0QxKTVnMorvuQSDNP4bkjBEPAZvMa8rWXYhqxOm2jisok VMdmFTsTa36ITY2xwVNB54Ool+unyt0J+6CTVn5Cs6wwmrag9ulF9QRBm 3aNX7ZEWY1qF/S8GKPe5OkqStFaPZYNurj49OKDn6U6SXyiaOBk72Zh3Z Q==; X-IronPort-AV: E=McAfee;i="6600,9927,11024"; a="17089226" X-IronPort-AV: E=Sophos;i="6.07,155,1708416000"; d="scan'208";a="17089226" Received: from orviesa010.jf.intel.com ([10.64.159.150]) by fmvoesa105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Mar 2024 20:08:12 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.07,155,1708416000"; d="scan'208";a="15695641" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orviesa010-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Mar 2024 20:08:07 -0700 From: "Huang, Ying" To: "Ho-Ren (Jack) Chuang" Cc: Gregory Price , aneesh.kumar@linux.ibm.com, mhocko@suse.com, tj@kernel.org, john@jagalactic.com, Eishan Mirakhur , Vinicius Tavares Petrucci , Ravis OpenSrc , Alistair Popple , Srinivasulu Thanneeru , Dan Williams , Vishal Verma , Dave Jiang , Andrew Morton , nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "Ho-Ren (Jack) Chuang" , "Ho-Ren (Jack) Chuang" , qemu-devel@nongnu.org, Hao Xiang Subject: Re: [External] Re: [PATCH v4 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info In-Reply-To: (Ho-Ren Chuang's message of "Fri, 22 Mar 2024 11:26:33 -0700") References: <20240322070356.315922-1-horenchuang@bytedance.com> <20240322070356.315922-3-horenchuang@bytedance.com> <87cyrmr773.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Tue, 26 Mar 2024 11:06:13 +0800 Message-ID: <87ttktofoa.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: BF90F4000E X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: e4w58qhdxzcpbeiaofso61r49xcazyyn X-HE-Tag: 1711422493-914152 X-HE-Meta: U2FsdGVkX1+g5GpGr4TQ5FVRGsJAbUsY6Rv/cdbM5oV0fc5g3z9xPq1npyNirRAvL1W4AbqbQ2rXXGI64UpM3/0dcvnNqanFqLFEyWGmlGjwR+XvSm2xVHnoAFdYDGvVvmn7G4aDPYLKuSSyylrsaj8TY+aceHm6d7t6jSp6KbQT7Cq7sxMm8APtRb7leG6sE+og2EhjJMj4X5OqcN15hdIQwGXq5r29lNng2X1MYGZBPhkZXQ1mY3lldmei2qy7zJ04Ay0B30RClOj4mCUZIihPPH0yAIZamEN+K2jzUs/DjXIvoxv1vJ+z7303Fyql88bybPbZgQZJq+TfIZwJfrzR37y1+dJGTbSg80abAYrSKVLt3ex+ZtDLqoPYs47L3shppN6Ri4ZCNA6Ld2IfecowsG+w2ZuPCa/vofvEBJNx3f0qtY9O6qazhz8rmMRwNYTLB79zA/pPmJvjcP58BesYVN2TSXO8JuHu4AaozXplkdPzK88/u6FugOavsnVSAN0fzTTOjM9qMsrjZev2MJe1Jnk5uuiGTNv+FHOKfaUfZi3R2+uZnTZb54UdASlgEIwyNoyLzVWhVgbcfxeAP/roAlmTAecgWyMcq4VD0srTYdVRLEjTPKWQMVkWNJmalHpMPMIypdmtFo5K2/vUa5y/MVyoO1iAcifmkS5LS0wZKZBsPvuuHeownTd/WeLy2I6oA8Dy2rQ6D3jBMam6PLYEEfWtA37KBYsRKiMWckzVezJzWM2t4Ot+Dfaj2LFGipeAWTS70/e9gSp3c/Rnzvp6goFSFx5z9PB3MMGPC+jtiNWS1smR0QffLfGDdA+ZcOJM2vCAcVzNR8DinGayqQRWANGZdTWiTz+a8jn/2JDaVhxBpOtezr8kopxNqyo/kRcATlo1pqJexbABmKXxdV/dufjNW5P6p3/R/T+z5WOKGLitNIDnoGtWEFJYmO9CBCBXr6hkL//8KYenCBu 617nIovo yIZdjMf2PPyGh/iSsOWOxPlpBy7j2xKs3/NU2NuT7t9MDsKjUkqUqAaG2HWVfz2b+2qRCtB05qrKYmsmKSsu/+ZL9y8Ak0xQrg3/H8fFppL4GO5MPHoA3tYzVe3pao98KN2CzXpFizuAfZb1BEHIxeknSGZJkj9KQbSpf47sH4pjon1eLs4gYYTF05b05Ton8/TAhnfjQhJEjljq5QiPJMK2+CVpZaGSpybmGnNhbKGg+3B8QExWB2PzKfdTXDL/1R5gLjjfINdWGhqqFHDYfBC2wnmZFMybMfOnj94cF+e4A2lU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: "Ho-Ren (Jack) Chuang" writes: > On Fri, Mar 22, 2024 at 1:41=E2=80=AFAM Huang, Ying wrote: >> >> "Ho-Ren (Jack) Chuang" writes: >> >> > The current implementation treats emulated memory devices, such as >> > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal m= emory >> > (E820_TYPE_RAM). However, these emulated devices have different >> > characteristics than traditional DRAM, making it important to >> > distinguish them. Thus, we modify the tiered memory initialization pro= cess >> > to introduce a delay specifically for CPUless NUMA nodes. This delay >> > ensures that the memory tier initialization for these nodes is deferred >> > until HMAT information is obtained during the boot process. Finally, >> > demotion tables are recalculated at the end. >> > >> > * late_initcall(memory_tier_late_init); >> > Some device drivers may have initialized memory tiers between >> > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringi= ng >> > online memory nodes and configuring memory tiers. They should be exclu= ded >> > in the late init. >> > >> > * Handle cases where there is no HMAT when creating memory tiers >> > There is a scenario where a CPUless node does not provide HMAT informa= tion. >> > If no HMAT is specified, it falls back to using the default DRAM tier. >> > >> > * Introduce another new lock `default_dram_perf_lock` for adist calcul= ation >> > In the current implementation, iterating through CPUlist nodes requires >> > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will en= d up >> > trying to acquire the same lock, leading to a potential deadlock. >> > Therefore, we propose introducing a standalone `default_dram_perf_lock= ` to >> > protect `default_dram_perf_*`. This approach not only avoids deadlock >> > but also prevents holding a large lock simultaneously. >> > >> > * Upgrade `set_node_memory_tier` to support additional cases, including >> > default DRAM, late CPUless, and hot-plugged initializations. >> > To cover hot-plugged memory nodes, `mt_calc_adistance()` and >> > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` = to >> > handle cases where memtype is not initialized and where HMAT informati= on is >> > available. >> > >> > * Introduce `default_memory_types` for those memory types that are not >> > initialized by device drivers. >> > Because late initialized memory and default DRAM memory need to be man= aged, >> > a default memory type is created for storing all memory types that are >> > not initialized by device drivers and as a fallback. >> > >> > Signed-off-by: Ho-Ren (Jack) Chuang >> > Signed-off-by: Hao Xiang >> > --- >> > mm/memory-tiers.c | 73 ++++++++++++++++++++++++++++++++++++++++------- >> > 1 file changed, 63 insertions(+), 10 deletions(-) >> > >> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c >> > index 974af10cfdd8..9396330fa162 100644 >> > --- a/mm/memory-tiers.c >> > +++ b/mm/memory-tiers.c >> > @@ -36,6 +36,11 @@ struct node_memory_type_map { >> > >> > static DEFINE_MUTEX(memory_tier_lock); >> > static LIST_HEAD(memory_tiers); >> > +/* >> > + * The list is used to store all memory types that are not created >> > + * by a device driver. >> > + */ >> > +static LIST_HEAD(default_memory_types); >> > static struct node_memory_type_map node_memory_types[MAX_NUMNODES]; >> > struct memory_dev_type *default_dram_type; >> > >> > @@ -108,6 +113,7 @@ static struct demotion_nodes *node_demotion __read= _mostly; >> > >> > static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms); >> > >> > +static DEFINE_MUTEX(default_dram_perf_lock); >> >> Better to add comments about what is protected by this lock. >> > > Thank you. I will add a comment like this: > + /* The lock is used to protect `default_dram_perf*` info and nid. */ > +static DEFINE_MUTEX(default_dram_perf_lock); > > I also found an error path was not handled and > found the lock could be put closer to what it protects. > I will have them fixed in V5. > >> > static bool default_dram_perf_error; >> > static struct access_coordinate default_dram_perf; >> > static int default_dram_perf_ref_nid =3D NUMA_NO_NODE; >> > @@ -505,7 +511,8 @@ static inline void __init_node_memory_type(int nod= e, struct memory_dev_type *mem >> > static struct memory_tier *set_node_memory_tier(int node) >> > { >> > struct memory_tier *memtier; >> > - struct memory_dev_type *memtype; >> > + struct memory_dev_type *mtype; >> >> mtype may be referenced without initialization now below. >> > > Good catch! Thank you. > > Please check below. > I may found a potential NULL pointer dereference. > >> > + int adist =3D MEMTIER_ADISTANCE_DRAM; >> > pg_data_t *pgdat =3D NODE_DATA(node); >> > >> > >> > @@ -514,11 +521,20 @@ static struct memory_tier *set_node_memory_tier(= int node) >> > if (!node_state(node, N_MEMORY)) >> > return ERR_PTR(-EINVAL); >> > >> > - __init_node_memory_type(node, default_dram_type); >> > + mt_calc_adistance(node, &adist); >> > + if (node_memory_types[node].memtype =3D=3D NULL) { >> > + mtype =3D mt_find_alloc_memory_type(adist, &default_memo= ry_types); >> > + if (IS_ERR(mtype)) { >> > + mtype =3D default_dram_type; >> > + pr_info("Failed to allocate a memory type. Fall = back.\n"); >> > + } >> > + } >> > >> > - memtype =3D node_memory_types[node].memtype; >> > - node_set(node, memtype->nodes); >> > - memtier =3D find_create_memory_tier(memtype); >> > + __init_node_memory_type(node, mtype); >> > + >> > + mtype =3D node_memory_types[node].memtype; >> > + node_set(node, mtype->nodes); > > If the `mtype` could be NULL, would there be a potential > NULL pointer dereference. Do you have a preferred idea > to fix this if needed? Initialize mtype with default_dram_type? >> > + memtier =3D find_create_memory_tier(mtype); >> > if (!IS_ERR(memtier)) >> > rcu_assign_pointer(pgdat->memtier, memtier); >> > return memtier; >> > @@ -655,6 +671,34 @@ void mt_put_memory_types(struct list_head *memory= _types) >> > } >> > EXPORT_SYMBOL_GPL(mt_put_memory_types); >> > [snip] -- Best Regards, Huang, Ying