From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B55EECD128A for ; Tue, 9 Apr 2024 19:02:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 441A16B008A; Tue, 9 Apr 2024 15:02:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3CADF6B008C; Tue, 9 Apr 2024 15:02:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 26B5A6B0092; Tue, 9 Apr 2024 15:02:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 03F866B008A for ; Tue, 9 Apr 2024 15:02:45 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 7ACAD140519 for ; Tue, 9 Apr 2024 19:02:45 +0000 (UTC) X-FDA: 81990915090.25.3932674 Received: from mail-yb1-f171.google.com (mail-yb1-f171.google.com [209.85.219.171]) by imf21.hostedemail.com (Postfix) with ESMTP id A94D01C0021 for ; Tue, 9 Apr 2024 19:02:43 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=JdMYdPH0; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf21.hostedemail.com: domain of horenchuang@bytedance.com designates 209.85.219.171 as permitted sender) smtp.mailfrom=horenchuang@bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1712689363; a=rsa-sha256; cv=none; b=jj8qwlvqq8CIRGUnKVnX1QozFbkRw6kYWm1glJgI1FtA0t+63JHQks+hscI4R93p+Ft/zt qCj3MJodvnpDSZrcn9QlQStr7yddrzQOu/htb5G572WYj1WrMyLSSP0cFe2/0++xnmbe48 KtOxmqEaJEdxtG1GSKrgIX2WQTv+xeA= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=JdMYdPH0; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf21.hostedemail.com: domain of horenchuang@bytedance.com designates 209.85.219.171 as permitted sender) smtp.mailfrom=horenchuang@bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1712689363; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=u9JT5x/b9oaUqCAXwkz87p9Kfabiba0RNb5HLj9kkfw=; b=Ao4l4e379bAZLqf0UgN59gpNAXy3BLq66fjUFIyW7VpTjD6nMS3OrJjkn6f1A5Zyj1JeQX rRMqbOgRmq8YKcf/g0rX1S2aMaNgdPrkxZ/0SrV+fVo6G18AZ9keYjwjym2KvuhVOhQAL/ 85D49thJN5eIPlwP+D/Al37vbWER0j8= Received: by mail-yb1-f171.google.com with SMTP id 3f1490d57ef6-de0f9a501d6so2498217276.3 for ; Tue, 09 Apr 2024 12:02:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1712689362; x=1713294162; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=u9JT5x/b9oaUqCAXwkz87p9Kfabiba0RNb5HLj9kkfw=; b=JdMYdPH0xmKit+GcwZXSWoxj39i2umyc3XQdJ8VdADyir2E8HvGKdB1MVoKLvHXA+A FefDUAtwtGbg3d2JkAcJfQTHwFyWbZMcE79mPv6EDP1ZFkR+zifOuHc6kzf6m+mzTx6N qfCZGUqWJLDSlc9C/1xmcCk5mi7IhSS6CH4NqhMxe1ZGVVT2uuhdOTPU+i/N4GgKb3sY sYHtoVw5ejl7+6vAJxQSNBfWIJ6DPmYh4/2KwdsUAAX7LUT4gBxGSSRrFjIq93ru5qcY eKX19yKUyvv17TV8oyB+7scBSJczRW2JkyVIllCShZBXjVE+J0scFmfb0ngoXQ9bHFPD CM+Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1712689362; x=1713294162; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=u9JT5x/b9oaUqCAXwkz87p9Kfabiba0RNb5HLj9kkfw=; b=McdFgmdNPZimdREdGNg71a/fSM9Iavl9d580oKlzWJFstLV72SFKy/b6UeA26mB0D7 Pwlo7+9kzdEX80r9/UnwZwoEf/AfVhX1khnoGfbeP/UpZjd9PwUevUlwPZmK5cQB0meW 0GIw1UbWiCW3w6Zwna8UlCDwCEcJV9bmriAyDNQMY774B/5ZzDOp03FZKrcXgnAlgRmt XrSa5Hz24JOWZgm1TBPoBmmKaQ52RXfht7z6CVkix2JZgX3XMUOO2vtf28CyR+AsOlbX kMhjB4qXM7LfvMm0GJQlvNI4PKX9j/5Fh7PdYs2Bfi+hireYHmeWSEUsu5NGZHwTiV9N DLsw== X-Forwarded-Encrypted: i=1; AJvYcCVGr9Sn2RRb79DLQOluGAXN5Xnd4ehtGBsGfCZGREVDydMOkrksta7/dndtK3TvVxj7X3oyEo3/p8nM92Lwx00tbbs= X-Gm-Message-State: AOJu0YzFwvzMumHty/J9OZGcKTN+ZPsUBtsHrTFKX57UY0Dw8ThpgfAn 1cRewyu36MSwE+sJRv+GdD1RYkIbo8/odeY6IYURP2YsMj3JNfd2dNQAepviMuuV6DeDpGopTJj YuYjcRuE3TIP1tNY4FaPt7IoPWmT/naZsTftFjw== X-Google-Smtp-Source: AGHT+IEhGXon713SAihpC+CGTqCe5Yi+IkEzhQrqxAFLZdzLSTS9HBsKOtlx/4EZ5ThM7PRsnpjepkyCsqvR+OR0zTE= X-Received: by 2002:a5b:ac2:0:b0:de0:de85:e388 with SMTP id a2-20020a5b0ac2000000b00de0de85e388mr666317ybr.24.1712689362423; Tue, 09 Apr 2024 12:02:42 -0700 (PDT) MIME-Version: 1.0 References: <20240405000707.2670063-1-horenchuang@bytedance.com> <20240405000707.2670063-3-horenchuang@bytedance.com> <20240405150244.00004b49@Huawei.com> <20240409171204.00001710@Huawei.com> In-Reply-To: <20240409171204.00001710@Huawei.com> From: "Ho-Ren (Jack) Chuang" Date: Tue, 9 Apr 2024 12:02:31 -0700 Message-ID: Subject: Re: [External] Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info To: Jonathan Cameron Cc: "Huang, Ying" , Gregory Price , aneesh.kumar@linux.ibm.com, mhocko@suse.com, tj@kernel.org, john@jagalactic.com, Eishan Mirakhur , Vinicius Tavares Petrucci , Ravis OpenSrc , Alistair Popple , Srinivasulu Thanneeru , SeongJae Park , Dan Williams , Vishal Verma , Dave Jiang , Andrew Morton , nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org, Linux Memory Management List , "Ho-Ren (Jack) Chuang" , "Ho-Ren (Jack) Chuang" , qemu-devel@nongnu.org, Hao Xiang Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: A94D01C0021 X-Stat-Signature: 7aasnicbjsa6hkr5sptjda794tg45ck3 X-HE-Tag: 1712689363-988071 X-HE-Meta: U2FsdGVkX18r7YATvdV1AVsipLO0b65vyeiqlRYHJKZWQcnAD1k6jxG94yenLBXHG2h2gtI4aZ2T2zovzj/0rQed+pdT49NnBvlJ8b1Hh9ddgKXiruZV0OOxYjUWzTH86hBb9AKrRgdIKCBMWIaRtlC76KfRnvgVmEHTDISDU9XPx1HOBB2lIZH7+JQZmePAv1ZQaCkz189hbRfosOOqrcggh0ztmtBHGmCCTfqeUzucb+8/+IBNSKyItO2vDH8GA6MdRnbiItMAOnw3GZn5GhApG6yVpPejSdTlye5rhEC7IhqIDUFForViquHuCECKQfAjp7kV4fCo26s6gUyeha0SaO7q/hmOhle7aqwwJvtuGf7AxGvXq08CH3lyEVibdmBJQMX9tgGUnC7r4SsjxXmZvu2KfXXHQNy2wB6xUB7tJs6v8kYBFwQUbfD2tvQjy01MDML0zb1F8EyHls62dWftlccP7NIl2wBi/Dh8JFhuPoYhlGqghCcHoWQfPs7LoI/pzqvwlIEpK95cTYq0WjrHI/Zcv3oKAqTsjYpfKFUOZWFYAt/TXsUEINtEWT582a95dwnyhoGnjopqBpv4o89HheoOmq/r5eT5FLKqEUigQNpY/jYm8vnTjygPUHJGiBbztBmpM5fR0/X4I8fwhDYuJ8+v0UlUwkYiC81fHFDo6xs4qIm6tZM3sJPN7QG6G3jhhDZkmI5RNUdgRYKGV5GctD90lr1kmKQ/95LN5/dSa9ocJLZU4jhaYRWOFG34QmgJHZpXGnORgs8PwFs+sVSyTpoME1gb73y0mbsfYast4F4HB76PCT7xYom2prI+gW1ES+Rb7wCTvOxbZ+OcLaEyey4guRo/H3WksjAlXSZ+fCzEDtGMPHP4lVKo6SiXtccp8vJqIzX4L8NsT1DvnXvJvGuQbpYbZIJPSgADJeHU21K+YBUYpZTlXRCpLRSzkrvGefHkcSj6TXTdhjT x2TgRGCT DZDVLUq3bwp+w+x/7bpBKSnQIJFDyLiXXA9a4QIxeUkde3f2Fgz5Km/gvtF1/7WtIytGsOQ0gbIheC8QIkbqIyg2vgi4mwYjFnum1/MI6Y4Z3tf7XIk/U4dsjgmJk5POdVlaLbdWdPCVzUKPlJsCfaBGpoetfLijsbRTl6l0Fd+RK1BQV2yfJCSM5NUZU+kt8HBOIF5ROAPdNTiMd8btSdC0N1QstXrLOZyumgzdpIe2QL/3mk2A1icVXIRPbxRq/tG/96QHd4YU/EcVIHXnmO+IL58lXlfgv8MDiIm676zY0DuwpbBsNg+k/3ZEwpGebsZMpvuUJeCHRWHItFxaWM2DK7FEHlkASXN1jtZ+U6n9FldMQLPybP7EJ3G9pjlkJ4LzIJ4738qQy0yp/+tC1Df65XS7IWS2ghBk9RRBERjLC7Bo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Jonathan, On Tue, Apr 9, 2024 at 9:12=E2=80=AFAM Jonathan Cameron wrote: > > On Fri, 5 Apr 2024 15:43:47 -0700 > "Ho-Ren (Jack) Chuang" wrote: > > > On Fri, Apr 5, 2024 at 7:03=E2=80=AFAM Jonathan Cameron > > wrote: > > > > > > On Fri, 5 Apr 2024 00:07:06 +0000 > > > "Ho-Ren (Jack) Chuang" wrote: > > > > > > > The current implementation treats emulated memory devices, such as > > > > CXL1.1 type3 memory, as normal DRAM when they are emulated as norma= l memory > > > > (E820_TYPE_RAM). However, these emulated devices have different > > > > characteristics than traditional DRAM, making it important to > > > > distinguish them. Thus, we modify the tiered memory initialization = process > > > > to introduce a delay specifically for CPUless NUMA nodes. This dela= y > > > > ensures that the memory tier initialization for these nodes is defe= rred > > > > until HMAT information is obtained during the boot process. Finally= , > > > > demotion tables are recalculated at the end. > > > > > > > > * late_initcall(memory_tier_late_init); > > > > Some device drivers may have initialized memory tiers between > > > > `memory_tier_init()` and `memory_tier_late_init()`, potentially bri= nging > > > > online memory nodes and configuring memory tiers. They should be ex= cluded > > > > in the late init. > > > > > > > > * Handle cases where there is no HMAT when creating memory tiers > > > > There is a scenario where a CPUless node does not provide HMAT info= rmation. > > > > If no HMAT is specified, it falls back to using the default DRAM ti= er. > > > > > > > > * Introduce another new lock `default_dram_perf_lock` for adist cal= culation > > > > In the current implementation, iterating through CPUlist nodes requ= ires > > > > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will= end up > > > > trying to acquire the same lock, leading to a potential deadlock. > > > > Therefore, we propose introducing a standalone `default_dram_perf_l= ock` to > > > > protect `default_dram_perf_*`. This approach not only avoids deadlo= ck > > > > but also prevents holding a large lock simultaneously. > > > > > > > > * Upgrade `set_node_memory_tier` to support additional cases, inclu= ding > > > > default DRAM, late CPUless, and hot-plugged initializations. > > > > To cover hot-plugged memory nodes, `mt_calc_adistance()` and > > > > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier(= )` to > > > > handle cases where memtype is not initialized and where HMAT inform= ation is > > > > available. > > > > > > > > * Introduce `default_memory_types` for those memory types that are = not > > > > initialized by device drivers. > > > > Because late initialized memory and default DRAM memory need to be = managed, > > > > a default memory type is created for storing all memory types that = are > > > > not initialized by device drivers and as a fallback. > > > > > > > > Signed-off-by: Ho-Ren (Jack) Chuang > > > > Signed-off-by: Hao Xiang > > > > Reviewed-by: "Huang, Ying" > > > > > > Hi - one remaining question. Why can't we delay init for all nodes > > > to either drivers or your fallback late_initcall code. > > > It would be nice to reduce possible code paths. > > > > I try not to change too much of the existing code structure in > > this patchset. > > > > To me, postponing/moving all memory tier registrations to > > late_initcall() is another possible action item for the next patchset. > > > > After tier_mem(), hmat_init() is called, which requires registering > > `default_dram_type` info. This is when `default_dram_type` is needed. > > However, it is indeed possible to postpone the latter part, > > set_node_memory_tier(), to `late_init(). So, memory_tier_init() can > > indeed be split into two parts, and the latter part can be moved to > > late_initcall() to be processed together. > > > > Doing this all memory-type drivers have to call late_initcall() to > > register a memory tier. I=E2=80=99m not sure how many they are? > > > > What do you guys think? > > Gut feeling - if you are going to move it for some cases, move it for > all of them. Then we only have to test once ;) > > J Thank you for your reminder! I agree~ That's why I'm considering changing them in the next patchset because of the amount of changes. And also, this patchset already contains too many things. > > > > > > > > Jonathan > > > > > > > > > > --- > > > > mm/memory-tiers.c | 94 +++++++++++++++++++++++++++++++++++--------= ---- > > > > 1 file changed, 70 insertions(+), 24 deletions(-) > > > > > > > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > > > > index 516b144fd45a..6632102bd5c9 100644 > > > > --- a/mm/memory-tiers.c > > > > +++ b/mm/memory-tiers.c > > > > > > > > > > > > > @@ -855,7 +892,8 @@ static int __init memory_tier_init(void) > > > > * For now we can have 4 faster memory tiers with smaller adi= stance > > > > * than default DRAM tier. > > > > */ > > > > - default_dram_type =3D alloc_memory_type(MEMTIER_ADISTANCE_DRA= M); > > > > + default_dram_type =3D mt_find_alloc_memory_type(MEMTIER_ADIST= ANCE_DRAM, > > > > + &default_memory= _types); > > > > if (IS_ERR(default_dram_type)) > > > > panic("%s() failed to allocate default DRAM tier\n", = __func__); > > > > > > > > @@ -865,6 +903,14 @@ static int __init memory_tier_init(void) > > > > * types assigned. > > > > */ > > > > for_each_node_state(node, N_MEMORY) { > > > > + if (!node_state(node, N_CPU)) > > > > + /* > > > > + * Defer memory tier initialization on > > > > + * CPUless numa nodes. These will be initiali= zed > > > > + * after firmware and devices are initialized= . > > > > > > Could the comment also say why we can't defer them all? > > > > > > (In an odd coincidence we have a similar issue for some CPU hotplug > > > related bring up where review feedback was move all cases later). > > > > > > > + */ > > > > + continue; > > > > + > > > > memtier =3D set_node_memory_tier(node); > > > > if (IS_ERR(memtier)) > > > > /* > > > > > > > > --=20 Best regards, Ho-Ren (Jack) Chuang =E8=8E=8A=E8=B3=80=E4=BB=BB