From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F4140C433EF for ; Mon, 25 Jul 2022 06:49:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3C329900002; Mon, 25 Jul 2022 02:49:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 372968E0001; Mon, 25 Jul 2022 02:49:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 212A6900002; Mon, 25 Jul 2022 02:49:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 118048E0001 for ; Mon, 25 Jul 2022 02:49:04 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id DFE081A03CB for ; Mon, 25 Jul 2022 06:49:03 +0000 (UTC) X-FDA: 79724694966.16.EF6ABFA Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf13.hostedemail.com (Postfix) with ESMTP id 5F130200B5 for ; Mon, 25 Jul 2022 06:49:02 +0000 (UTC) Received: from pps.filterd (m0187473.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 26P6HlsU010302; Mon, 25 Jul 2022 06:48:50 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=message-id : date : mime-version : subject : to : cc : references : from : in-reply-to : content-type : content-transfer-encoding; s=pp1; bh=+X2y2Ep5LNfyKx1z3RjQTjN1xxKwW/p6JSoKIK9wFwk=; b=PnrIQN88aMJ5P5IRyuYcKtNBptL94DPnOAVkq5fNOV/C7TosFHA2Nb8nAbLPsaS4XpHu 7syzXns3RD+TIQcngzpsVZjVPw8t/8sI9is3F1AW7RoD4FOs6KqqtuiQfxzr7WEKPJ6O KR8Sid61VTTxF7Jj7kOSgJu0qgw5PsZTWwP76x2soMutoe8IjdsIyH3Yt011JysKk/5s 1vCqLtz7flYibZ6WGOvQPol4SEIVU7mauO8eivGWigLaQMvH4JkkGHBSrWWCPp/ZgeHW FEgcRKR7TFd1CCdwosZWmSm3V2ZU2g1dlegsF8XEmfMjcfBia+ll26oCvvrP5xOr2Y1T aw== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3hhnsggpq1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 25 Jul 2022 06:48:50 +0000 Received: from m0187473.ppops.net (m0187473.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 26P6Lcqr022336; Mon, 25 Jul 2022 06:48:49 GMT Received: from ppma03ams.nl.ibm.com (62.31.33a9.ip4.static.sl-reverse.com [169.51.49.98]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3hhnsggpnq-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 25 Jul 2022 06:48:49 +0000 Received: from pps.filterd (ppma03ams.nl.ibm.com [127.0.0.1]) by ppma03ams.nl.ibm.com (8.16.1.2/8.16.1.2) with SMTP id 26P6ZG3M011247; Mon, 25 Jul 2022 06:48:46 GMT Received: from b06cxnps3074.portsmouth.uk.ibm.com (d06relay09.portsmouth.uk.ibm.com [9.149.109.194]) by ppma03ams.nl.ibm.com with ESMTP id 3hh6eugpf4-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 25 Jul 2022 06:48:46 +0000 Received: from d06av22.portsmouth.uk.ibm.com (d06av22.portsmouth.uk.ibm.com [9.149.105.58]) by b06cxnps3074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 26P6minB24772966 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 25 Jul 2022 06:48:44 GMT Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 5F6804C04A; Mon, 25 Jul 2022 06:48:44 +0000 (GMT) Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id CBB824C040; Mon, 25 Jul 2022 06:48:40 +0000 (GMT) Received: from [9.43.12.201] (unknown [9.43.12.201]) by d06av22.portsmouth.uk.ibm.com (Postfix) with ESMTP; Mon, 25 Jul 2022 06:48:40 +0000 (GMT) Message-ID: Date: Mon, 25 Jul 2022 12:18:39 +0530 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0 Subject: Re: [PATCH v10 4/8] mm/demotion/dax/kmem: Set node's performance level to MEMTIER_PERF_LEVEL_PMEM Content-Language: en-US To: "Huang, Ying" Cc: linux-mm@kvack.org, akpm@linux-foundation.org, Wei Xu , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com References: <20220720025920.1373558-1-aneesh.kumar@linux.ibm.com> <20220720025920.1373558-5-aneesh.kumar@linux.ibm.com> <874jz5zoi9.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Aneesh Kumar K V In-Reply-To: <874jz5zoi9.fsf@yhuang6-desk2.ccr.corp.intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 X-Proofpoint-GUID: NKeawWIycSKfDMfsUE5ZXBzcHBIT5M05 X-Proofpoint-ORIG-GUID: ibApjH30H0DsSC3wvnMlaq0KBCQv_gSn X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.883,Hydra:6.0.517,FMLib:17.11.122.1 definitions=2022-07-23_02,2022-07-21_02,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 phishscore=0 suspectscore=0 mlxlogscore=999 mlxscore=0 spamscore=0 adultscore=0 bulkscore=0 priorityscore=1501 lowpriorityscore=0 impostorscore=0 clxscore=1015 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2206140000 definitions=main-2207250028 ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1658731742; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=+X2y2Ep5LNfyKx1z3RjQTjN1xxKwW/p6JSoKIK9wFwk=; b=blEyoJHg6J5ciW17qkYMbodvsQBhFYYnbfOVwrN0VtKAWubLSeixkL6BFYfg+kDJqyg0Zz xfC0REeWA9Ksb/5rIqln68ubBoul5bi/mpnIpL0+45yWzu7FtBl4WUqPiw4h5pOvNmk5ie SBAgkKuVnKzxKssKyuLkhFS2R0J63ME= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1658731742; a=rsa-sha256; cv=none; b=5s2bqRK67NyEPO1Bg58+vGpGlATxTfCx6+h8A71nbLj7LAbmxkJvVOHw0wZiB2cZx7sDUb nkVfYoy2XqpP04TaGsZxPvyteHY6NBZbto++oasX/qHxy+Dktrphjt8XccjOnfVPXPYW1T cCYs3Aacjwc6E279ffTlTkq2cIXYiPc= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=PnrIQN88; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf13.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=PnrIQN88; dmarc=pass (policy=none) header.from=ibm.com; spf=pass (imf13.hostedemail.com: domain of aneesh.kumar@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=aneesh.kumar@linux.ibm.com X-Rspam-User: X-Rspamd-Queue-Id: 5F130200B5 X-Rspamd-Server: rspam06 X-Stat-Signature: 5eicj1ztmb7x9n98ps75r356thra5yar X-HE-Tag: 1658731742-443112 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 7/25/22 12:07 PM, Huang, Ying wrote: > "Aneesh Kumar K.V" writes: > >> By default, all nodes are assigned to the default memory tier which >> is the memory tier designated for nodes with DRAM >> >> Set dax kmem device node's tier to slower memory tier by assigning >> performance level to MEMTIER_PERF_LEVEL_PMEM. PMEM tier >> appears below the default memory tier in demotion order. >> >> Signed-off-by: Aneesh Kumar K.V >> --- >> arch/powerpc/platforms/pseries/papr_scm.c | 41 ++++++++++++++++++++--- >> drivers/acpi/nfit/core.c | 41 ++++++++++++++++++++++- >> 2 files changed, 76 insertions(+), 6 deletions(-) >> >> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c >> index 82cae08976bc..3b6164418d6f 100644 >> --- a/arch/powerpc/platforms/pseries/papr_scm.c >> +++ b/arch/powerpc/platforms/pseries/papr_scm.c >> @@ -14,6 +14,8 @@ >> #include >> #include >> #include >> +#include >> +#include >> >> #include >> #include >> @@ -98,6 +100,7 @@ struct papr_scm_priv { >> bool hcall_flush_required; >> >> uint64_t bound_addr; >> + int target_node; >> >> struct nvdimm_bus_descriptor bus_desc; >> struct nvdimm_bus *bus; >> @@ -1278,6 +1281,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p) >> p->bus_desc.module = THIS_MODULE; >> p->bus_desc.of_node = p->pdev->dev.of_node; >> p->bus_desc.provider_name = kstrdup(p->pdev->name, GFP_KERNEL); >> + p->target_node = dev_to_node(&p->pdev->dev); >> >> /* Set the dimm command family mask to accept PDSMs */ >> set_bit(NVDIMM_FAMILY_PAPR, &p->bus_desc.dimm_family_mask); >> @@ -1322,7 +1326,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p) >> mapping.size = p->blocks * p->block_size; // XXX: potential overflow? >> >> memset(&ndr_desc, 0, sizeof(ndr_desc)); >> - target_nid = dev_to_node(&p->pdev->dev); >> + target_nid = p->target_node; >> online_nid = numa_map_to_online_node(target_nid); >> ndr_desc.numa_node = online_nid; >> ndr_desc.target_node = target_nid; >> @@ -1582,15 +1586,42 @@ static struct platform_driver papr_scm_driver = { >> }, >> }; >> >> +static int papr_scm_callback(struct notifier_block *self, >> + unsigned long action, void *arg) >> +{ >> + struct memory_notify *mnb = arg; >> + int nid = mnb->status_change_nid; >> + struct papr_scm_priv *p; >> + >> + if (nid == NUMA_NO_NODE || action != MEM_ONLINE) >> + return NOTIFY_OK; >> + >> + mutex_lock(&papr_ndr_lock); >> + list_for_each_entry(p, &papr_nd_regions, region_list) { >> + if (p->target_node == nid) { >> + node_devices[nid]->perf_level = MEMTIER_PERF_LEVEL_PMEM; >> + break; >> + } >> + } >> + >> + mutex_unlock(&papr_ndr_lock); >> + return NOTIFY_OK; >> +} >> + >> static int __init papr_scm_init(void) >> { >> int ret; >> >> ret = platform_driver_register(&papr_scm_driver); >> - if (!ret) >> - mce_register_notifier(&mce_ue_nb); >> - >> - return ret; >> + if (ret) >> + return ret; >> + mce_register_notifier(&mce_ue_nb); >> + /* >> + * register a memory hotplug notifier at prio 2 so that we >> + * can update the perf level for the node. >> + */ >> + hotplug_memory_notifier(papr_scm_callback, MEMTIER_HOTPLUG_PRIO + 1); >> + return 0; >> } >> module_init(papr_scm_init); >> >> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c >> index ae5f4acf2675..7ea1017ef790 100644 >> --- a/drivers/acpi/nfit/core.c >> +++ b/drivers/acpi/nfit/core.c >> @@ -15,6 +15,8 @@ >> #include >> #include >> #include >> +#include >> +#include >> #include >> #include >> #include "intel.h" >> @@ -3470,6 +3472,39 @@ static struct acpi_driver acpi_nfit_driver = { >> }, >> }; >> >> +static int nfit_callback(struct notifier_block *self, >> + unsigned long action, void *arg) >> +{ >> + bool found = false; >> + struct memory_notify *mnb = arg; >> + int nid = mnb->status_change_nid; >> + struct nfit_spa *nfit_spa; >> + struct acpi_nfit_desc *acpi_desc; >> + >> + if (nid == NUMA_NO_NODE || action != MEM_ONLINE) >> + return NOTIFY_OK; >> + >> + mutex_lock(&acpi_desc_lock); >> + list_for_each_entry(acpi_desc, &acpi_descs, list) { >> + mutex_lock(&acpi_desc->init_mutex); >> + list_for_each_entry(nfit_spa, &acpi_desc->spas, list) { >> + struct acpi_nfit_system_address *spa = nfit_spa->spa; >> + int target_node = pxm_to_node(spa->proximity_domain); >> + >> + if (target_node == nid) { >> + node_devices[nid]->perf_level = MEMTIER_PERF_LEVEL_PMEM; >> + found = true; >> + break; >> + } >> + } >> + mutex_unlock(&acpi_desc->init_mutex); >> + if (found) >> + break; >> + } >> + mutex_unlock(&acpi_desc_lock); >> + return NOTIFY_OK; >> +} >> + >> static __init int nfit_init(void) >> { >> int ret; >> @@ -3509,7 +3544,11 @@ static __init int nfit_init(void) >> nfit_mce_unregister(); >> destroy_workqueue(nfit_wq); >> } >> - >> + /* >> + * register a memory hotplug notifier at prio 2 so that we >> + * can update the perf level for the node. >> + */ >> + hotplug_memory_notifier(nfit_callback, MEMTIER_HOTPLUG_PRIO + 1); >> return ret; >> >> } > > I don't think that it's a good idea to set perf_level of a memory device > (node) via NFIT only. > > For example, we may prefer HMAT over NFIT when it's available. So the > perf_level should be set in dax/kmem.c based on information provided by > ACPI or other information sources. ACPI can provide some functions/data > structures to let drivers (like dax/kmem.c) to query the properties of > the memory device (node). > I was trying to make it architecture specific so that we have a placeholder to fine-tune this better. For example, ppc64 will look at device tree details to find the performance level and x86 will look at ACPI data structure. Adding that hotplug callback in dax/kmem will prevent that architecture-specific customization? I would expect that callback to move to the generic ACPI layer so that even firmware managed CXL devices can be added to a lower tier? I don't understand ACPI enough to find the right abstraction for that hotplug callback. > As the simplest first version, this can be just hard coded. > If you are suggesting to not use hotplug callback, one of the challenge was node_devices[nid] get allocated pretty late when we try to online the node. > Best Regards, > Huang, Ying