From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 11617C00144 for ; Mon, 1 Aug 2022 06:37:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5DC128E0002; Mon, 1 Aug 2022 02:37:30 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 58B998E0001; Mon, 1 Aug 2022 02:37:30 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 452478E0002; Mon, 1 Aug 2022 02:37:30 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 344208E0001 for ; Mon, 1 Aug 2022 02:37:30 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id F40791A0837 for ; Mon, 1 Aug 2022 06:37:29 +0000 (UTC) X-FDA: 79750067460.24.1C69E9E Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by imf14.hostedemail.com (Postfix) with ESMTP id A3FBD1000F9 for ; Mon, 1 Aug 2022 06:37:28 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1659335848; x=1690871848; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version; bh=MaJXiBaxLLBxqBVZFl3PrTZwYtLf0XxDzp3tXRpmeBM=; b=O/EaszlUeJMdmUIv7blM576pT6BWa70k6ym6Uq0SZZEop6q2HhpOI1ef QYmhEF66KQYJfjMPcFZL458qKnvtdDQ4VA6dUXqzAcJXeHcumlxUe5SZd feQVQPRhHfyzVoR62KezONlXi9qTlkl931JfCsIqjL9ghMeWAJnkxHOzT TiPwmO+mpHeYTlcSzEsLIHFTaUAxfkbJoLADk3Y2owDlsJVMh64c3XdDr Q37sk5Ixg5053vP1v7G/ptVS6eCz05i6cQql6ehWSFbjy8og2aHeAEfmc 3ET0eqbox/OLKeAKPhevdG1xh3/mVB2XJEkW3nyiTM5RMQw7dtRsmR/W5 A==; X-IronPort-AV: E=McAfee;i="6400,9594,10425"; a="286633833" X-IronPort-AV: E=Sophos;i="5.93,206,1654585200"; d="scan'208";a="286633833" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 31 Jul 2022 23:37:26 -0700 X-IronPort-AV: E=Sophos;i="5.93,206,1654585200"; d="scan'208";a="577671306" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 31 Jul 2022 23:37:21 -0700 From: "Huang, Ying" To: Aneesh Kumar K V Cc: linux-mm@kvack.org, akpm@linux-foundation.org, Wei Xu , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com Subject: Re: [PATCH v11 4/8] mm/demotion/dax/kmem: Set node's abstract distance to MEMTIER_ADISTANCE_PMEM References: <20220728190436.858458-1-aneesh.kumar@linux.ibm.com> <20220728190436.858458-5-aneesh.kumar@linux.ibm.com> <875yjgmocg.fsf@yhuang6-desk2.ccr.corp.intel.com> <87bkt8s7w9.fsf@linux.ibm.com> <87k07slnt7.fsf@yhuang6-desk2.ccr.corp.intel.com> <87tu6wk0q5.fsf@yhuang6-desk2.ccr.corp.intel.com> <826fbdbc-219f-8f4a-7373-41c718287533@linux.ibm.com> Date: Mon, 01 Aug 2022 14:37:17 +0800 In-Reply-To: <826fbdbc-219f-8f4a-7373-41c718287533@linux.ibm.com> (Aneesh Kumar K. V.'s message of "Mon, 1 Aug 2022 11:08:20 +0530") Message-ID: <87les8jwpu.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b="O/EaszlU"; spf=pass (imf14.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.93 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1659335849; a=rsa-sha256; cv=none; b=w7sURBRh8UF7FdSvp/aWb1DP1NMn3TO9et1g3GtnOASmUW+O9cgogE52rdpbuN/DLEgmMG tU9ucQpPaueol4Ix6aVOTaJXRZFgvi6Jajhm5mL70pRkz34DdkCDEVPF2snHMHSRqiAVLa eN12c62eQeZUdpWiIs63XIqqoWvUeBU= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1659335849; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ryb68WZ9Tg87MkP4g69PSOY/xCZcBAH2tTuqqbl0xpM=; b=ffzZctCkQt+S34wayzdp4x/NBI4lQ82U9JmWaxpUU0yaS2RRluPGy2421wvqbfOgCFt8c5 1oDue8dJ62sFbkAZ+LVuHnoAEBe074gcPaaX4sLDWJKAXUAszAGVkNiBtacCDNuEZuBAz2 GPo8BJ3zFv914MU8c5KPCdIHPRXANBk= X-Stat-Signature: aqtzqcmecaxjbtnu8u5g9jxahczx3nqm X-Rspamd-Queue-Id: A3FBD1000F9 Authentication-Results: imf14.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b="O/EaszlU"; spf=pass (imf14.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.93 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com X-Rspam-User: X-Rspamd-Server: rspam12 X-HE-Tag: 1659335848-974199 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Aneesh Kumar K V writes: > On 8/1/22 10:40 AM, Huang, Ying wrote: >> Aneesh Kumar K V writes: >> >>> On 8/1/22 7:36 AM, Huang, Ying wrote: >>>> "Aneesh Kumar K.V" writes: >>>> >>>>> "Huang, Ying" writes: >>>>> >>>>>> "Aneesh Kumar K.V" writes: >>>>>> >>>>>>> By default, all nodes are assigned to the default memory tier which >>>>>>> is the memory tier designated for nodes with DRAM >>>>>>> >>>>>>> Set dax kmem device node's tier to slower memory tier by assigning >>>>>>> abstract distance to MEMTIER_ADISTANCE_PMEM. PMEM tier >>>>>>> appears below the default memory tier in demotion order. >>>>>>> >>>>>>> Signed-off-by: Aneesh Kumar K.V >>>>>>> --- >>>>>>> drivers/dax/kmem.c | 9 +++++++++ >>>>>>> include/linux/memory-tiers.h | 19 ++++++++++++++++++- >>>>>>> mm/memory-tiers.c | 28 ++++++++++++++++------------ >>>>>>> 3 files changed, 43 insertions(+), 13 deletions(-) >>>>>>> >>>>>>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c >>>>>>> index a37622060fff..6b0d5de9a3e9 100644 >>>>>>> --- a/drivers/dax/kmem.c >>>>>>> +++ b/drivers/dax/kmem.c >>>>>>> @@ -11,6 +11,7 @@ >>>>>>> #include >>>>>>> #include >>>>>>> #include >>>>>>> +#include >>>>>>> #include "dax-private.h" >>>>>>> #include "bus.h" >>>>>>> >>>>>>> @@ -41,6 +42,12 @@ struct dax_kmem_data { >>>>>>> struct resource *res[]; >>>>>>> }; >>>>>>> >>>>>>> +static struct memory_dev_type default_pmem_type = { >>>>>> >>>>>> Why is this named as default_pmem_type? We will not change the memory >>>>>> type of a node usually. >>>>>> >>>>> >>>>> Any other suggestion? pmem_dev_type? >>>> >>>> Or dax_pmem_type? >>>> >>>> DAX is used to enumerate the memory device. >>>> >>>>> >>>>>>> + .adistance = MEMTIER_ADISTANCE_PMEM, >>>>>>> + .tier_sibiling = LIST_HEAD_INIT(default_pmem_type.tier_sibiling), >>>>>>> + .nodes = NODE_MASK_NONE, >>>>>>> +}; >>>>>>> + >>>>>>> static int dev_dax_kmem_probe(struct dev_dax *dev_dax) >>>>>>> { >>>>>>> struct device *dev = &dev_dax->dev; >>>>>>> @@ -62,6 +69,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) >>>>>>> return -EINVAL; >>>>>>> } >>>>>>> >>>>>>> + init_node_memory_type(numa_node, &default_pmem_type); >>>>>>> + >>>>>> >>>>>> The memory hot-add below may fail. So the error handling needs to be >>>>>> added. >>>>>> >>>>>> And, it appears that the memory type and memory tier of a node may be >>>>>> fully initialized here before NUMA hot-adding started. So I suggest to >>>>>> set node_memory_types[] here only. And set memory_dev_type->nodes in >>>>>> node hot-add callback. I think there is the proper place to complete >>>>>> the initialization. >>>>>> >>>>>> And, in theory dax/kmem.c can be unloaded. So we need to clear >>>>>> node_memory_types[] for nodes somewhere. >>>>>> >>>>> >>>>> I guess by module exit we can be sure that all the memory managed >>>>> by dax/kmem is hotplugged out. How about something like below? >>>> >>>> Because we set node_memorty_types[] in dev_dax_kmem_probe(), it's >>>> natural to clear it in dev_dax_kmem_remove(). >>>> >>> >>> Most of required reset/clear is done as part of memory hotunplug. So >>> if we did manage to successfully unplug the memory, everything except >>> node_memory_types[node] should be reset. That makes the clear_node_memory_type >>> the below. >>> >>> void clear_node_memory_type(int node, struct memory_dev_type *memtype) >>> { >>> >>> mutex_lock(&memory_tier_lock); >>> /* >>> * memory unplug did clear the node from the memtype and >>> * dax/kem did initialize this node's memory type. >>> */ >>> if (!node_isset(node, memtype->nodes) && node_memory_types[node] == memtype){ >>> node_memory_types[node] = NULL; >>> } >>> mutex_unlock(&memory_tier_lock); >>> } >>> >>> With the module unload, it is kind of force removing the usage of the specific memtype. >>> Considering module unload will remove the usage of specific memtype from other parts >>> of the kernel and we already do all the required reset in memory hot unplug, do we >>> need to do the clear_node_memory_type above? >> >> Per my understanding, we need to call clear_node_memory_type() in >> dev_dax_kmem_remove(). After that, we have nothing to do in >> dax_kmem_exit(). >> > > Ok, I guess you are suggesting to do the clear_node_memory_type even if we fail the memory remove. Can we use node_memory_types[] to indicate whether a node is managed by a driver? Regardless being succeeded or failed, dev_dax_kmem_remove() will set node_memory_types[] = NULL. But until node is offlined, we will still keep the node in the memory_dev_type (dax_pmem_type). And we will prevent dax/kmem from unloading via try_module_get() and add "struct module *" to struct memory_dev_type. Best Regards, Huang, Ying > Should we also rebuild demotion order? On a successful memory remove we do rebuild demotion order. > This is what i ended up with. > > modified drivers/dax/kmem.c > @@ -171,6 +171,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > static void dev_dax_kmem_remove(struct dev_dax *dev_dax) > { > int i, success = 0; > + int node = dev_dax->target_node; > struct device *dev = &dev_dax->dev; > struct dax_kmem_data *data = dev_get_drvdata(dev); > > @@ -208,6 +209,12 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax) > kfree(data); > dev_set_drvdata(dev, NULL); > } > + /* > + * Clear the memtype association, even if the memory > + * remove failed. > + */ > + clear_node_memory_type(node, dax_pmem_type); > + > } > #else > static void dev_dax_kmem_remove(struct dev_dax *dev_dax) > modified include/linux/memory-tiers.h > @@ -31,6 +31,7 @@ struct memory_dev_type { > #ifdef CONFIG_NUMA > extern bool numa_demotion_enabled; > void init_node_memory_type(int node, struct memory_dev_type *default_type); > +void clear_node_memory_type(int node, struct memory_dev_type *memtype); > #ifdef CONFIG_MIGRATION > int next_demotion_node(int node); > void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); > @@ -57,6 +58,10 @@ static inline bool node_is_toptier(int node) > #define numa_demotion_enabled false > static inline void init_node_memory_type(int node, struct memory_dev_type *default_type) > { > +} > + > +static inline void unregister_memory_type(struct memory_dev_type *memtype) > +{ > > } > > modified mm/memory-tiers.c > @@ -501,6 +501,36 @@ void init_node_memory_type(int node, struct memory_dev_type *default_type) > } > EXPORT_SYMBOL_GPL(init_node_memory_type); > > +void clear_node_memory_type(int node, struct memory_dev_type *memtype) > +{ > + struct memory_tier *memtier; > + > + mutex_lock(&memory_tier_lock); > + /* > + * Even if we fail to unplug memory, clear the association of > + * this node to this specific memory type. > + */ > + if (node_memory_types[node] == memtype) { > + > + memtier = __node_get_memory_tier(node); > + if (memtier) { > + rcu_assign_pointer(pgdat->memtier, NULL); > + synchronize_rcu(); > + } > + node_clear(node, memtype->nodes); > + if (nodes_empty(memtype->nodes)) { > + list_del(&memtype->tier_sibiling); > + memtype->memtier = NULL; > + if (current_memtier && list_empty(¤t_memtier->memory_types)) > + destroy_memory_tier(current_memtier); > + > + } > + node_memory_types[node] = NULL; > + } > + mutex_unlock(&memory_tier_lock); > +} > +EXPORT_SYMBOL_GPL(init_node_memory_type); > + > void update_node_adistance(int node, struct memory_dev_type *memtype) > { > pg_data_t *pgdat; > > [back