From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C0CA8C67861 for ; Wed, 10 Apr 2024 02:32:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 51A2C6B0093; Tue, 9 Apr 2024 22:32:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4CABD6B0096; Tue, 9 Apr 2024 22:32:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 392D86B0098; Tue, 9 Apr 2024 22:32:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 19FC36B0093 for ; Tue, 9 Apr 2024 22:32:59 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id C0701805FE for ; Wed, 10 Apr 2024 02:32:58 +0000 (UTC) X-FDA: 81992049636.27.E9043C1 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.12]) by imf26.hostedemail.com (Postfix) with ESMTP id 1B91F14000F for ; Wed, 10 Apr 2024 02:32:55 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="m/mmzrsU"; spf=pass (imf26.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.12 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1712716376; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kfYIn6Ka/u5qb+NzsZMJ24nFzhkGPr/UgepwiBbyTco=; b=yyfzuEAvuDKgRqDyxoM2U5djxXmWJzKgwMxFDJqGidWDKOBNuuSeiV/eZkguFicgU3aAxc /WeBnvZAZH2OCCwQbWsaqC9d7pmEA+o7REJZGMPqFbW60GmysPxTQK6XYT8EhOP66D/smd 0CAwAZW2bmtT9bYuA+Z6qeAU5MhV59Y= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1712716376; a=rsa-sha256; cv=none; b=1zYASDZtet9yMWh4gIp5T0jLBAAy3pA673t9tzoO7/ccEW2v/un6I1wd7NjzfPNSB+Dq3v STf/HdRUtdh+mS20KbBNvIZXQ4F+jGQ3GDbXtqOYiwThxL0zhdHi+uLPqxLKhsSsT6CYi8 IVK/ThdiAFZok85oUWtjB4+eNM0fYCs= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="m/mmzrsU"; spf=pass (imf26.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.12 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1712716376; x=1744252376; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=ScsLn2QVzQBveqF59yy04O8I8ch6/UBRUmnuXTuXCkY=; b=m/mmzrsU1U663XVZiS2URsoh9cXzcXuDpEblkXKDZSG3XdDmRFxLJH/3 5T3+TTh+97XPJycybvWjes5GGMbI1HgbArSnzlAvpj67a48eTfquHccz9 MEg6kU5Qj38DNwF9FCuu3Q+kGd02G1ll62HBh/xUu2wJYRqOqjlUTRC+g XtBgqWULx+4UM7p9MVwBwm37nZve2pIXZLA4EJNiXq5ML8cS+OJogOlqJ 348iG/hr9zOmhBiXrUDvWguo4O6UJoYj6fgBThRANcceChBdK9Ovv9Y+r oBIRR6dRLZHHBLY5ncaNt2SM3wuvSYw1FQHcuRsJOve9niXYCh/3rdZfv g==; X-CSE-ConnectionGUID: g2DD3DMSSn2kjSvOR5K3GA== X-CSE-MsgGUID: oH7CH6aaRE+7pjrCs8cmhQ== X-IronPort-AV: E=McAfee;i="6600,9927,11039"; a="11834443" X-IronPort-AV: E=Sophos;i="6.07,190,1708416000"; d="scan'208";a="11834443" Received: from fmviesa007.fm.intel.com ([10.60.135.147]) by fmvoesa106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Apr 2024 19:32:55 -0700 X-CSE-ConnectionGUID: vqaprlaOT1+aD0p8Oc9Q3w== X-CSE-MsgGUID: C0kf41mLSwerSyvWdpaN6Q== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.07,190,1708416000"; d="scan'208";a="20482433" Received: from unknown (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmviesa007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Apr 2024 19:32:49 -0700 From: "Huang, Ying" To: "Ho-Ren (Jack) Chuang" Cc: Jonathan Cameron , Gregory Price , aneesh.kumar@linux.ibm.com, mhocko@suse.com, tj@kernel.org, john@jagalactic.com, Eishan Mirakhur , Vinicius Tavares Petrucci , Ravis OpenSrc , Alistair Popple , Srinivasulu Thanneeru , SeongJae Park , Dan Williams , Vishal Verma , Dave Jiang , Andrew Morton , nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org, Linux Memory Management List , "Ho-Ren (Jack) Chuang" , "Ho-Ren (Jack) Chuang" , qemu-devel@nongnu.org, Hao Xiang Subject: Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info In-Reply-To: (Ho-Ren Chuang's message of "Fri, 5 Apr 2024 15:43:47 -0700") References: <20240405000707.2670063-1-horenchuang@bytedance.com> <20240405000707.2670063-3-horenchuang@bytedance.com> <20240405150244.00004b49@Huawei.com> Date: Wed, 10 Apr 2024 10:30:56 +0800 Message-ID: <87ttka54pr.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 1B91F14000F X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: ok4pgpyufuihbyribu48j6fcqn1z9pcn X-HE-Tag: 1712716375-375076 X-HE-Meta: U2FsdGVkX19Tah5v4AVsyFWSYUI2G308AZ3uxtXN09dX+ZlbnndlA0bsp8A3J10CcslDhdBxpG6QwIHfwJqRSnt+bZxU6oJtbGHKlOjZ83oas60K3XnBNsYP+wQgBmKbbL4ePV/u0PyL/lQ57UwomxkASrbFDIcUz0aiLv6Ix+Bkg0+ivOUVYMi0zQP7wTuRKO4lnW/AVBLM5M85ChAGYuhaArNv9os5Ndu0TM2JyTWH8rSp138UJ2prvA3ffj0qj57vh59ekGMxjVeHZPAHtLogObMVzhUvcCM70k4VMPdRYtLs0qXNZmeR4PkimcpptMkYEmU9xyFcRYg3dPYYf9lQn/E29z5d18OYeL/sG3uqyqGZrBf2VIrVnqwPROr34izbFk/78mZQAIAl+hfCSC+8TI/Q2RiuvqW+59yanzNWWnk0X+T0tj61W6NbJ2pUQXxKAXdFrsPYbQLGD3IoToqzPYuRyhg42DzotVLXePUYvf+960kF0BMXKn7LFFts2PQ/MG5xamArGzGxpFSmwydIYSVG4bDjH8CAauUECt9JVh5UBr3CltxzLX6T1oIrvRlLRpeal0XuXuW8n29u8WaXATAAtVNefuTilT9UEGBW3gVoLp7VL20TMLqfnPK8hN+u5ggs0xpjQvdGcOMbpZYOHRlDNOXyZ1thFNp3E4wEctIFQg0j69JSVpjJQXxVK8WqyWNtuFnGURMrEG+8lPRIBWqUhKwa4Xnx/cT6DurMKnEXYj90n/Vm0EivFLbnn/Zob6c6FYp+b5mLIRBBJ96j1HzJJlB5+uFMbvNylLriut4Kz6mV7WuIIjT+qGx8h5BbHxy6F+gOBzQOERkPg+5FqAHV5aVmDbJogmNF6C/87E0P1a6+Btaics5xwJVl7XZR4p6xthNbsmqsOVPzMNKU9agNv9XavENv9oMziO5qTIHAz01Cy1EGt6AxMR2zt3VjwCqvV6uC7NfbaYm FRUmtzI5 Erzu42XL1a5xwT2smIBZVgZCXk0vcCz2XO/aT4f6fk1+bcCumv58BBpSHtqE3sSSRFHUZxbQEkZ9VlSO5UX/CRl29qpJjwlynHA+W6BJ3xrFuVso34Rq6t32FtdsUuD0iDAEaF92EicHFqe2tH6W68I/zTMVJ4q0ijdH6zh1A2no6k/u1IJnkELe/67omgT4tLLzSDOBEYzPxr00mcSZuyEwoU9flBBOClpyvWJ24N0PzC7d8mcdE67becW1USXQGbIY4KmsZiBV35U8WhUaNFUZ+5kqCdRYnLxEvk4Z9WGE2/e7IZbM1xkUW6VOXGE+P22K5dKLLKCzMTYE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: "Ho-Ren (Jack) Chuang" writes: > On Fri, Apr 5, 2024 at 7:03=E2=80=AFAM Jonathan Cameron > wrote: >> >> On Fri, 5 Apr 2024 00:07:06 +0000 >> "Ho-Ren (Jack) Chuang" wrote: >> >> > The current implementation treats emulated memory devices, such as >> > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal m= emory >> > (E820_TYPE_RAM). However, these emulated devices have different >> > characteristics than traditional DRAM, making it important to >> > distinguish them. Thus, we modify the tiered memory initialization pro= cess >> > to introduce a delay specifically for CPUless NUMA nodes. This delay >> > ensures that the memory tier initialization for these nodes is deferred >> > until HMAT information is obtained during the boot process. Finally, >> > demotion tables are recalculated at the end. >> > >> > * late_initcall(memory_tier_late_init); >> > Some device drivers may have initialized memory tiers between >> > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringi= ng >> > online memory nodes and configuring memory tiers. They should be exclu= ded >> > in the late init. >> > >> > * Handle cases where there is no HMAT when creating memory tiers >> > There is a scenario where a CPUless node does not provide HMAT informa= tion. >> > If no HMAT is specified, it falls back to using the default DRAM tier. >> > >> > * Introduce another new lock `default_dram_perf_lock` for adist calcul= ation >> > In the current implementation, iterating through CPUlist nodes requires >> > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will en= d up >> > trying to acquire the same lock, leading to a potential deadlock. >> > Therefore, we propose introducing a standalone `default_dram_perf_lock= ` to >> > protect `default_dram_perf_*`. This approach not only avoids deadlock >> > but also prevents holding a large lock simultaneously. >> > >> > * Upgrade `set_node_memory_tier` to support additional cases, including >> > default DRAM, late CPUless, and hot-plugged initializations. >> > To cover hot-plugged memory nodes, `mt_calc_adistance()` and >> > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` = to >> > handle cases where memtype is not initialized and where HMAT informati= on is >> > available. >> > >> > * Introduce `default_memory_types` for those memory types that are not >> > initialized by device drivers. >> > Because late initialized memory and default DRAM memory need to be man= aged, >> > a default memory type is created for storing all memory types that are >> > not initialized by device drivers and as a fallback. >> > >> > Signed-off-by: Ho-Ren (Jack) Chuang >> > Signed-off-by: Hao Xiang >> > Reviewed-by: "Huang, Ying" >> >> Hi - one remaining question. Why can't we delay init for all nodes >> to either drivers or your fallback late_initcall code. >> It would be nice to reduce possible code paths. > > I try not to change too much of the existing code structure in > this patchset. > > To me, postponing/moving all memory tier registrations to > late_initcall() is another possible action item for the next patchset. > > After tier_mem(), hmat_init() is called, which requires registering > `default_dram_type` info. This is when `default_dram_type` is needed. > However, it is indeed possible to postpone the latter part, > set_node_memory_tier(), to `late_init(). So, memory_tier_init() can > indeed be split into two parts, and the latter part can be moved to > late_initcall() to be processed together. I don't think that it's good to move all memory_tier initialization in drivers to late_initcall(). It's natural to keep them in device_initcall() level. If so, we can allocate default_dram_type in memory_tier_init(), and call set_node_memory_tier() only in memory_tier_lateinit(). We can call memory_tier_lateinit() in device_initcall() level too. -- Best Regards, Huang, Ying > Doing this all memory-type drivers have to call late_initcall() to > register a memory tier. I=E2=80=99m not sure how many they are? > > What do you guys think? > >> >> Jonathan >> >> >> > --- >> > mm/memory-tiers.c | 94 +++++++++++++++++++++++++++++++++++------------ >> > 1 file changed, 70 insertions(+), 24 deletions(-) >> > >> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c >> > index 516b144fd45a..6632102bd5c9 100644 >> > --- a/mm/memory-tiers.c >> > +++ b/mm/memory-tiers.c >> >> >> >> > @@ -855,7 +892,8 @@ static int __init memory_tier_init(void) >> > * For now we can have 4 faster memory tiers with smaller adista= nce >> > * than default DRAM tier. >> > */ >> > - default_dram_type =3D alloc_memory_type(MEMTIER_ADISTANCE_DRAM); >> > + default_dram_type =3D mt_find_alloc_memory_type(MEMTIER_ADISTANC= E_DRAM, >> > + &default_memory_ty= pes); >> > if (IS_ERR(default_dram_type)) >> > panic("%s() failed to allocate default DRAM tier\n", __f= unc__); >> > >> > @@ -865,6 +903,14 @@ static int __init memory_tier_init(void) >> > * types assigned. >> > */ >> > for_each_node_state(node, N_MEMORY) { >> > + if (!node_state(node, N_CPU)) >> > + /* >> > + * Defer memory tier initialization on >> > + * CPUless numa nodes. These will be initialized >> > + * after firmware and devices are initialized. >> >> Could the comment also say why we can't defer them all? >> >> (In an odd coincidence we have a similar issue for some CPU hotplug >> related bring up where review feedback was move all cases later). >> >> > + */ >> > + continue; >> > + >> > memtier =3D set_node_memory_tier(node); >> > if (IS_ERR(memtier)) >> > /* >>