From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2C5DFC3A59D for ; Fri, 5 Apr 2024 22:44:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5E0DA6B0083; Fri, 5 Apr 2024 18:44:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 591966B0085; Fri, 5 Apr 2024 18:44:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 458D76B0088; Fri, 5 Apr 2024 18:44:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 269B96B0083 for ; Fri, 5 Apr 2024 18:44:02 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 8D962A031E for ; Fri, 5 Apr 2024 22:44:01 +0000 (UTC) X-FDA: 81976957482.28.0EEDEAA Received: from mail-yb1-f169.google.com (mail-yb1-f169.google.com [209.85.219.169]) by imf22.hostedemail.com (Postfix) with ESMTP id D556AC0005 for ; Fri, 5 Apr 2024 22:43:58 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=YrsVgWv7; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf22.hostedemail.com: domain of horenchuang@bytedance.com designates 209.85.219.169 as permitted sender) smtp.mailfrom=horenchuang@bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1712357039; a=rsa-sha256; cv=none; b=UAc4BLlYPIhFMYJCUGq7vKlp6oxDlOwusA9T4WexRD8DYu0hpNruCX/ruqDTemZQDNxvSU 1K2d1/DetC/aAjIY8X/lOXQkc/3Qi5zon6SzxAwvPIkcIaSGYK9Ee9HBVTjEg5ArBzBKuR Z2ep5Sm5Rqk3Re4NIa8jx0dQ8V/lEGU= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=YrsVgWv7; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf22.hostedemail.com: domain of horenchuang@bytedance.com designates 209.85.219.169 as permitted sender) smtp.mailfrom=horenchuang@bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1712357039; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=hCBHfr4VKJb+nF0gfasAMfqJdSwuuy6m/sFNcO7Zez0=; b=TBV7W6Dj0RzzVbhT1z8AUzFHiaZZpFxbNfo7bUQDotQ4pOLK4Qy4iU0O8tKHSLwhRw7uGy 4Q9/K7eO4OIysRsZRRQKnUfh54cDhFOdps7Kyvtqhjz+HjeNdqUKSGf0yf5yWcXv69bO4C EY4XVymkAqSIqv4VQxKRHWtGZY9CZVY= Received: by mail-yb1-f169.google.com with SMTP id 3f1490d57ef6-dd14d8e7026so2630024276.2 for ; Fri, 05 Apr 2024 15:43:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1712357038; x=1712961838; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=hCBHfr4VKJb+nF0gfasAMfqJdSwuuy6m/sFNcO7Zez0=; b=YrsVgWv7AfXJCGyAueNzbPqb/2tkbpgDQxmVy8AnwPLvY4QrOmOPuZeDphx8kbU9eP CNP/zUx/6Y3ygM9OFLtJskLEwVff/K+VVtn1Svz/OevCJkkVXP+lT+YPwb7jgOF3PaI2 HxebU+0nsyVF9/snMkf85Lv1UI+rkYAS06NNuRlrCGPobKO5VIkEJqM1X7WU75Ly6KSI EJUHxYgmD5aHohD24VlN2U4PB59NzVCVw5ljLD/5l0gww5eobmU08nFqY+UQI37CdZsm bUhZvRBETzvxD35jfPQUor70Knytu6+nIUn8LvVmQYgUc+EYkEUzSy14QTf85GJ2iDI5 rjCA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1712357038; x=1712961838; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=hCBHfr4VKJb+nF0gfasAMfqJdSwuuy6m/sFNcO7Zez0=; b=axo3HxFOPimxctKMuHgy6/UVYwGAh2Wr3rOnMZyfuubQcVEuLqoRfbMOImvYlKo7oY 40dPkMwSF65kAeO5e+XDLVwuCVg/aiKgvVUYgDwkxevln8qK0isCshu51W68DwKrLPMI djbkSOtT4zsIAs+oU3di8yR5//c01Cky64Nwow3TC1hZwftgyqwCCGDUdeWZMZvLdLn1 ox0k3TbwV2Z5I0Jo4gLeQjU+C1iXTXRh3PXp3kvEWV4xPfTl1VD0o4Pvzsy4+8cQZ6wC We5kF1WBndNYyGjUk73K7/uwJgyRqleOihoczi11bVmdc0Hm7LgIh3pAI0fByKJQoiuy RSfA== X-Forwarded-Encrypted: i=1; AJvYcCXxjE5ad67EY1JoNzCTZd2xu0gZLXEeaqiibf65HgVXx4GJksk1XSQOajMg+X5Ofu+HWscbjoOl/DDkM4W/7Hx1e0I= X-Gm-Message-State: AOJu0YwUs5t4A66AyntC6oqEfwRY9o9iPaTOLBF747VGhQl10okg3yU6 y6ZnHjAnt2vOtdbiTtXXxaAy2bWt5zRll/2c7mC1QSpV1TTXcEJLNeMcN3C1gql/G57z7DNSyC9 x3LmgZyFhKuDhKCSMKgMZDpWmfCfByhNVK1HwDw== X-Google-Smtp-Source: AGHT+IGkc0Q0l/mos/5NT0gBK0qJn5r1i2V05XNXtK/wcS38g1Oz3fhqPqidWAK23808M9UnRFIa1KWgPI5kzEPoB9g= X-Received: by 2002:a25:6b51:0:b0:dc6:bbbc:80e4 with SMTP id o17-20020a256b51000000b00dc6bbbc80e4mr2701001ybm.4.1712357037795; Fri, 05 Apr 2024 15:43:57 -0700 (PDT) MIME-Version: 1.0 References: <20240405000707.2670063-1-horenchuang@bytedance.com> <20240405000707.2670063-3-horenchuang@bytedance.com> <20240405150244.00004b49@Huawei.com> In-Reply-To: <20240405150244.00004b49@Huawei.com> From: "Ho-Ren (Jack) Chuang" Date: Fri, 5 Apr 2024 15:43:47 -0700 Message-ID: Subject: Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info To: Jonathan Cameron Cc: "Huang, Ying" , Gregory Price , aneesh.kumar@linux.ibm.com, mhocko@suse.com, tj@kernel.org, john@jagalactic.com, Eishan Mirakhur , Vinicius Tavares Petrucci , Ravis OpenSrc , Alistair Popple , Srinivasulu Thanneeru , SeongJae Park , Dan Williams , Vishal Verma , Dave Jiang , Andrew Morton , nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org, Linux Memory Management List , "Ho-Ren (Jack) Chuang" , "Ho-Ren (Jack) Chuang" , qemu-devel@nongnu.org, Hao Xiang Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: D556AC0005 X-Stat-Signature: sue8epqbm9bq6enggwim1zubzbk3ipu7 X-HE-Tag: 1712357038-652917 X-HE-Meta: U2FsdGVkX1/IcP2KgkXfKCcg2cnVhZo1eEV6AB2bwAtqqM+DeE39XHqHJPPHnXHU3MYuFsDNJfcPEZoxkPhtwTRiiUpv7Myjo+FAxf6Kd2pfVqriv3r+dYb1BAfuuZAjgpqTg8qYgCee9WH5hJaTEKQBuqFvv4p5PmBV2QVcvmuedQxJBHvB57GqZhNZ5zCPyH/f8PXmQD4lMXBoT9/uQpQm1F+y+HENcY7aiI7fwBwVTfDqohDZ1EMKz5QCKvF8NFAun7N9iUyy8jF2e1SULt3oBqoKGZZp+X0ZpNiyBcVFJQYjVdKhIaEyTCqRQn/VZxtxcbRRa3Sup6MOkl/O5RyAtIYg1s2rDy5GWuSC9MluPJgVziuicf5ruJH9M5a7w8calxuHZA1uZS4f0ANZZa9uMcVwE64CL7xYEM7Mc1kGIsOfu15FSIo5ycyAeNFNa1Gf8as7YKHv4BtcExYuJ0xo0RqhYNWImobrhrjmcGOyZ6eYq0Kwcp7qSuo8m8SU5ld+vW1mMqK9URIQfclU2FnDvrzPg7GGlPEvlC2U1+bB954Y73X9xvKMQepwATDGUyJ14uJd2OV62KsVGVMxSYjGTiEJWaf2Rx8kicKFEbIB+FtoOPWksl2YtVmVDsQnFoVwN4UtruTDDe8APLJbI2pFQPVmuifl3fbmk4Q0k1xRPCmXlRObp4sEEFIk/wXscitYKizEDk/2vCTl5tO3Tz/oJ9mXkikc9PwUN7ShwLg6KcvMlg4OsrH97VFNCst2jmIRme5yfG8eZLJiRgxyEEECHgn4S/qC2mDo9etsnSRSjbYqv2PcY3uoSXTnBFdK5SbDaa9t9In2xLPb9UTFlI7mHo2apU5WWL/6QhEQ7O12jTA6OpCX9wXQctE8RshJkw7JhYm7XwiX7MfdQ+/eIv8jBJyvWcXzRFpUCbNssc4hF40CEc7VXEbpiDMJleHdeXz1Wf/oYcpoJ5v0z8A 4VXpjHwC bluC7oZMseklyg6B0iBvXrxTdbcl2CYffi8/3eB0HtbUO1/Xf6imn8pVejdgL8VZtblz+5orbL0nJX5hj5ZN2jx5x6agqMZE0XAYlwdLcEMOhDu7gkfWnMrqrQ7QVXk+tbh9RgRYI+6rr7FrU+ZacfL30SZS4vQ3d0PBIZWk41Kbx4MzFzAnXXJn9Oi91xUBBUpjUZ/dtNfgwxXNfy7ra9bEZ9/5LSyPohF7+Rk2OT5/RL0JW735CM+U3oEeWjYy2LxN8XbrbIV3rYnj2aOwzq6IhOymwVKHNfA02ySEUDElPJSGbAsh5B//Oo8tytC+kaYZsxEMSw223Pn8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Apr 5, 2024 at 7:03=E2=80=AFAM Jonathan Cameron wrote: > > On Fri, 5 Apr 2024 00:07:06 +0000 > "Ho-Ren (Jack) Chuang" wrote: > > > The current implementation treats emulated memory devices, such as > > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal me= mory > > (E820_TYPE_RAM). However, these emulated devices have different > > characteristics than traditional DRAM, making it important to > > distinguish them. Thus, we modify the tiered memory initialization proc= ess > > to introduce a delay specifically for CPUless NUMA nodes. This delay > > ensures that the memory tier initialization for these nodes is deferred > > until HMAT information is obtained during the boot process. Finally, > > demotion tables are recalculated at the end. > > > > * late_initcall(memory_tier_late_init); > > Some device drivers may have initialized memory tiers between > > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringin= g > > online memory nodes and configuring memory tiers. They should be exclud= ed > > in the late init. > > > > * Handle cases where there is no HMAT when creating memory tiers > > There is a scenario where a CPUless node does not provide HMAT informat= ion. > > If no HMAT is specified, it falls back to using the default DRAM tier. > > > > * Introduce another new lock `default_dram_perf_lock` for adist calcula= tion > > In the current implementation, iterating through CPUlist nodes requires > > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end= up > > trying to acquire the same lock, leading to a potential deadlock. > > Therefore, we propose introducing a standalone `default_dram_perf_lock`= to > > protect `default_dram_perf_*`. This approach not only avoids deadlock > > but also prevents holding a large lock simultaneously. > > > > * Upgrade `set_node_memory_tier` to support additional cases, including > > default DRAM, late CPUless, and hot-plugged initializations. > > To cover hot-plugged memory nodes, `mt_calc_adistance()` and > > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` t= o > > handle cases where memtype is not initialized and where HMAT informatio= n is > > available. > > > > * Introduce `default_memory_types` for those memory types that are not > > initialized by device drivers. > > Because late initialized memory and default DRAM memory need to be mana= ged, > > a default memory type is created for storing all memory types that are > > not initialized by device drivers and as a fallback. > > > > Signed-off-by: Ho-Ren (Jack) Chuang > > Signed-off-by: Hao Xiang > > Reviewed-by: "Huang, Ying" > > Hi - one remaining question. Why can't we delay init for all nodes > to either drivers or your fallback late_initcall code. > It would be nice to reduce possible code paths. I try not to change too much of the existing code structure in this patchset. To me, postponing/moving all memory tier registrations to late_initcall() is another possible action item for the next patchset. After tier_mem(), hmat_init() is called, which requires registering `default_dram_type` info. This is when `default_dram_type` is needed. However, it is indeed possible to postpone the latter part, set_node_memory_tier(), to `late_init(). So, memory_tier_init() can indeed be split into two parts, and the latter part can be moved to late_initcall() to be processed together. Doing this all memory-type drivers have to call late_initcall() to register a memory tier. I=E2=80=99m not sure how many they are? What do you guys think? > > Jonathan > > > > --- > > mm/memory-tiers.c | 94 +++++++++++++++++++++++++++++++++++------------ > > 1 file changed, 70 insertions(+), 24 deletions(-) > > > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > > index 516b144fd45a..6632102bd5c9 100644 > > --- a/mm/memory-tiers.c > > +++ b/mm/memory-tiers.c > > > > > @@ -855,7 +892,8 @@ static int __init memory_tier_init(void) > > * For now we can have 4 faster memory tiers with smaller adistan= ce > > * than default DRAM tier. > > */ > > - default_dram_type =3D alloc_memory_type(MEMTIER_ADISTANCE_DRAM); > > + default_dram_type =3D mt_find_alloc_memory_type(MEMTIER_ADISTANCE= _DRAM, > > + &default_memory_typ= es); > > if (IS_ERR(default_dram_type)) > > panic("%s() failed to allocate default DRAM tier\n", __fu= nc__); > > > > @@ -865,6 +903,14 @@ static int __init memory_tier_init(void) > > * types assigned. > > */ > > for_each_node_state(node, N_MEMORY) { > > + if (!node_state(node, N_CPU)) > > + /* > > + * Defer memory tier initialization on > > + * CPUless numa nodes. These will be initialized > > + * after firmware and devices are initialized. > > Could the comment also say why we can't defer them all? > > (In an odd coincidence we have a similar issue for some CPU hotplug > related bring up where review feedback was move all cases later). > > > + */ > > + continue; > > + > > memtier =3D set_node_memory_tier(node); > > if (IS_ERR(memtier)) > > /* > --=20 Best regards, Ho-Ren (Jack) Chuang =E8=8E=8A=E8=B3=80=E4=BB=BB