From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 340B8C3DA4A for ; Tue, 30 Jul 2024 01:16:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 80C716B007B; Mon, 29 Jul 2024 21:16:36 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 795F06B0083; Mon, 29 Jul 2024 21:16:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 60F3A6B0085; Mon, 29 Jul 2024 21:16:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 3E63B6B007B for ; Mon, 29 Jul 2024 21:16:36 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 9E4871201D1 for ; Tue, 30 Jul 2024 01:16:35 +0000 (UTC) X-FDA: 82394653950.14.5E4C237 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.19]) by imf10.hostedemail.com (Postfix) with ESMTP id C220BC0002 for ; Tue, 30 Jul 2024 01:16:32 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=YOwlZOrk; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf10.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.19 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722302149; a=rsa-sha256; cv=none; b=m3dpeR2e4ezECSBtJ9Q7PtGuygVvIcC5J3hh6JwQ9yfpCrgWmvAvLWDOg0n54+Z//ubUzQ pDteRFDWd59GjMHuZxz2qy9ATNDDPNb2q4yZ5De6IYwd+kkKWfocnsAw92+L9aq5QBopJP NPfs5nIKpn5fru8WXZYOcdTNTTSLKqw= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=YOwlZOrk; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf10.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.19 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722302149; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=bMVD9QnsTNmb18rqveubKjikEdN4sLoUrnXB8Yfecxs=; b=ML5ZDJErMOIdqvQ74YPJnRnFs1AMreT5zg0cqMWbeFJlUnVVJmy0EkG0e0Mz0B3xC8XpTw q1GQXzrsMq6aNNJKDqRhj5XFUdScKze91ReIrEKubgjuiwPrLDSArr3Hoy80P8oTuJmu/c 7etKzq/mFluBCJVybcROkACeX78sAsU= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1722302193; x=1753838193; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=SxYB33kknq3g61Lw+Frbi/7XC84m2aM2NRAipQSxfZ0=; b=YOwlZOrkcrA4MQq6MEza4WRPk9yVbf9VwrCN40YydFkl2FDHi95J+B/e 0F7Eqfy14WEPGBD/4DfFY/vZutWpCJeVvoEmrEMUykC6CloIbwusl+De3 QPl3ogHWafm4+fi12SrTndqOsyU0XjhGHG7ABWGnJBbiRzIzdSuHHkCig uJ5R52OijHI2cDHkxzqtHyBS1Qp1yn418sHWtupFGupqCEdGz9bMyKbOG U8+ygk6HQhxTrZZ09mj//sRtxBuza5BrwbIpjHWiUQL+zhb0NfN1Kq2lM tyWLF9b01eCMKYDte/utG9M6gXEafioY2rQpCsaTY3XIudqzQqXNQ939U w==; X-CSE-ConnectionGUID: CcCKbtGSTUqlZBUjts1JBA== X-CSE-MsgGUID: o7mqtubcQJScWKEy9+KDGA== X-IronPort-AV: E=McAfee;i="6700,10204,11148"; a="19933800" X-IronPort-AV: E=Sophos;i="6.09,247,1716274800"; d="scan'208";a="19933800" Received: from orviesa009.jf.intel.com ([10.64.159.149]) by orvoesa111.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Jul 2024 18:16:32 -0700 X-CSE-ConnectionGUID: tm+3ugd8QyKNtsjjBQf4Vw== X-CSE-MsgGUID: aJ1byG5/ROWt+SDXLRMMpQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.09,247,1716274800"; d="scan'208";a="54131848" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Jul 2024 18:16:29 -0700 From: "Huang, Ying" To: Gregory Price Cc: linux-mm@kvack.org, akpm@linux-foundation.org, dave.jiang@intel.com, Jonathan.Cameron@huawei.com, horenchuang@bytedance.com, linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, dan.j.williams@intel.com, lenb@kernel.org, Aneesh Kumar K.V Subject: Re: [PATCH] acpi/hmat,mm/memtier: always register hmat adist calculation callback In-Reply-To: (Gregory Price's message of "Mon, 29 Jul 2024 10:22:52 -0400") References: <20240726215548.10653-1-gourry@gourry.net> <87ttg91046.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Tue, 30 Jul 2024 09:12:55 +0800 Message-ID: <877cd3u1go.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: C220BC0002 X-Stat-Signature: gzxqj1xba75ksroqjm5dk48oue4je13x X-Rspam-User: X-HE-Tag: 1722302192-317708 X-HE-Meta: U2FsdGVkX1+fRxmD1rwBsKBiydjzVdLji63cLHmnXjGZ3YNBWnVL9PKcOPeFW45OAVfDV/Ntn4yAEQq3Yl+5PKZZYJFVh8UnGC32c3nm3hn7MWhHVZOW3vBRzhluPvwvMwMtKA2Ja0QIPKC4HN0ogHTH5NYwxI5syxj45SWkNP9WOq2jimgL68bCKlii9GB0Zj8UROO9vZRNXmBWePvrwlN7heZd7zOx52xcQ2v1fA/DIv6jkKGijSeyyXTWhyncmYSKkm/uLny+flcl9cz7NhrTUr3OAVzfxz94O1GTND4gZLhVfXsYaJTzmfK/5ek61TEl7YCnJtxpPANnu3Ziyj2s3hizXO5O0R/fVsGiz8MCwU+KlImMbuY/Re829F8kHLBzIV3/c/hs61ouHt+hdBVR4LVACtNahClkSb9g2zZqAtufWJMJHvQoY1fC6s3pvx3BTev4xPNBBcQj1Sftj9p176ZaqcaWjjY6wySxbj4Q16bti6Kx+6i+PzTGmm0Y3CZpHtGB5utKwaFVgVA5G41Rj+bzlI19rE94CzjUY+HKYSgW44YTyyo/leUTC+K8lZ7PqeDtVB6FhmMzm1d6Tn9jfx85NTPlif/+lU/2cCDr5vLzmf5o9gvTAMdZO15AH4KHZYsLa2OaeH+IpXrt5gzWCJfiUvSjVryh6izJH2PEjceDl4gnbnzEJ1EKLKwmRsdvcBDLaDcngZ5fe+GpJbUmZCT9YKTAhN3czgeCSV67jeAG+6cMKHxxbHoPnG2TB9S7aSxzRFzy5I8iJPVn5+eJTVAhJIhfj0Gh6btYLiW6gjGSkDhkihqokp8qipoaGl9b+mCQbVJTAw2mGKCg5ogH+zuntNY1pXErX0vr6ty3eHOCRe40uYDAhd8rVIEXVprePL9cGB9T9S8WAuiMf9zZt8hdVaJ7ulGYhCVRl2xbePxVTlKuifx+WXkjpPPbtSthvNt0TL4Qfg/8VOM I2Vnf1fL XE4pqlh60mcVLE7Ug6TMl4QdwHildSSwDXZG0Sj2YQXtrvcYzqN1jch+RLdqDVIHj9aeCgpUw1RnVeW9x1XUPLfwdMXuQAvb3HjCr2Ae1LenK3GQ2Y2O72OrHW0udjEjWmsv4ueXjhiTXhOtNRpw9dQYIUQo9AIGgN9xixjpznRldXoXDsdv1eCmTFMx0lybStzMWnD/kq/8g0KQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Gregory Price writes: > On Mon, Jul 29, 2024 at 09:02:33AM +0800, Huang, Ying wrote: >> Gregory Price writes: >> >> > In the event that hmat data is not available for the DRAM tier, >> > or if it is invalid (bandwidth or latency is 0), we can still register >> > a callback to calculate the abstract distance for non-cpu nodes >> > and simply assign it a different tier manually. >> > >> > In the case where DRAM HMAT values are missing or not sane we >> > manually assign adist=(MEMTIER_ADISTANCE_DRAM + MEMTIER_CHUNK_SIZE). >> > >> > If the HMAT data for the non-cpu tier is invalid (e.g. bw = 0), we >> > cannot reasonable determine where to place the tier, so it will default >> > to MEMTIER_ADISTANCE_DRAM (which is the existing behavior). >> >> Why do we need this? Do you have machines with broken HMAT table? Can >> you ask the vendor to fix the HMAT table? >> > > It's a little unclear from the ACPI specification whether HMAT is > technically optional or not (given that the kernel handles missing HMAT > gracefully, it certainly seems optional). In one scenario I have seen > incorrect data, and in another scenario I have seen the HMAT omitted > entirely. In another scenario I have seen the HMAT-SLLBI omitted while > the CDAT is present. IIUC, HMAT is optional. Is it possible for you to ask the system vendor to fix the broken HMAT table. > In all scenarios the result is the same: all nodes in the same tier. I don't think so, in drivers/dax/kmem.c, we will put memory devices onlined by kmem.c in another tier by default. > The HMAT is explicitly described as "A hint" in the ACPI spec. > > ACPI 5.2.28.1 HMAT Overview > > "The software is expected to use this information as a hint for > optimization, or when the system has heterogeneous memory" > > If something is "a hint", then it should not be used prescriptively. > > Right now HMAT appears to be used prescriptively, this despite the fact > that there was a clear intent to separate CPU-nodes and non-CPU-nodes in > the memory-tier code. So this patch simply realizes this intent when the > hints are not very reasonable. If HMAT isn't available, it's hard to put memory devices to appropriate memory tiers without other information. In commit 992bf77591cb ("mm/demotion: add support for explicit memory tiers"), Aneesh pointed out that it doesn't work for his system to put non-CPU-nodes in lower tier. Even if we want to use other information to put memory devices to memory tiers, we can register another adist calculation callback instead of reusing hmat callback. -- Best Regards, Huang, Ying