From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6A4C3C433EF for ; Fri, 27 May 2022 14:15:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D53938D0017; Fri, 27 May 2022 10:15:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CD7E08D0002; Fri, 27 May 2022 10:15:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B2AEA8D0017; Fri, 27 May 2022 10:15:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id A31158D0002 for ; Fri, 27 May 2022 10:15:40 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay13.hostedemail.com (Postfix) with ESMTP id 5D0496142F for ; Fri, 27 May 2022 14:15:40 +0000 (UTC) X-FDA: 79511721240.22.82E69A1 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf04.hostedemail.com (Postfix) with ESMTP id 36C9F40054 for ; Fri, 27 May 2022 14:15:21 +0000 (UTC) Received: from fraeml744-chm.china.huawei.com (unknown [172.18.147.207]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4L8mw33wNXz67MpP; Fri, 27 May 2022 22:12:19 +0800 (CST) Received: from lhreml710-chm.china.huawei.com (10.201.108.61) by fraeml744-chm.china.huawei.com (10.206.15.225) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Fri, 27 May 2022 16:15:34 +0200 Received: from localhost (10.81.201.194) by lhreml710-chm.china.huawei.com (10.201.108.61) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Fri, 27 May 2022 15:15:33 +0100 Date: Fri, 27 May 2022 15:15:31 +0100 From: Jonathan Cameron To: Aneesh Kumar K.V CC: , , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes Subject: Re: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs Message-ID: <20220527151531.00002a0c@Huawei.com> In-Reply-To: <20220527122528.129445-3-aneesh.kumar@linux.ibm.com> References: <20220527122528.129445-1-aneesh.kumar@linux.ibm.com> <20220527122528.129445-3-aneesh.kumar@linux.ibm.com> Organization: Huawei Technologies Research and Development (UK) Ltd. X-Mailer: Claws Mail 4.0.0 (GTK+ 3.24.29; i686-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.81.201.194] X-ClientProxiedBy: lhreml719-chm.china.huawei.com (10.201.108.70) To lhreml710-chm.china.huawei.com (10.201.108.61) X-CFilter-Loop: Reflected X-Stat-Signature: uphs67so3i1u7w15dg6u56z5z96xm6ec Authentication-Results: imf04.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf04.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com X-Rspam-User: X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 36C9F40054 X-HE-Tag: 1653660921-915906 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, 27 May 2022 17:55:23 +0530 "Aneesh Kumar K.V" wrote: > From: Jagdish Gediya > > Add support to read/write the memory tierindex for a NUMA node. > > /sys/devices/system/node/nodeN/memtier > > where N = node id > > When read, It list the memory tier that the node belongs to. > > When written, the kernel moves the node into the specified > memory tier, the tier assignment of all other nodes are not > affected. > > If the memory tier does not exist, writing to the above file > create the tier and assign the NUMA node to that tier. creates There was some discussion in v2 of Wei Xu's RFC that what matter for creation is the rank, not the tier number. My suggestion is move to an explicit creation file such as memtier/create_tier_from_rank to which writing the rank gives results in a new tier with the next device ID and requested rank. > > mutex memory_tier_lock is introduced to protect memory tier > related chanegs as it can happen from sysfs as well on hot > plug events. > > Signed-off-by: Jagdish Gediya > Signed-off-by: Aneesh Kumar K.V > --- > drivers/base/node.c | 35 ++++++++++++++ > include/linux/migrate.h | 4 +- > mm/migrate.c | 103 ++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 141 insertions(+), 1 deletion(-) > > diff --git a/drivers/base/node.c b/drivers/base/node.c > index ec8bb24a5a22..cf4a58446d8c 100644 > --- a/drivers/base/node.c > +++ b/drivers/base/node.c > @@ -20,6 +20,7 @@ > #include > #include > #include > +#include > > static struct bus_type node_subsys = { > .name = "node", > @@ -560,11 +561,45 @@ static ssize_t node_read_distance(struct device *dev, > } > static DEVICE_ATTR(distance, 0444, node_read_distance, NULL); > > +#ifdef CONFIG_TIERED_MEMORY > +static ssize_t memtier_show(struct device *dev, > + struct device_attribute *attr, > + char *buf) > +{ > + int node = dev->id; > + > + return sysfs_emit(buf, "%d\n", node_get_memory_tier(node)); > +} > + > +static ssize_t memtier_store(struct device *dev, > + struct device_attribute *attr, > + const char *buf, size_t count) > +{ > + unsigned long tier; > + int node = dev->id; > + > + int ret = kstrtoul(buf, 10, &tier); > + if (ret) > + return ret; > + > + ret = node_reset_memory_tier(node, tier); I don't follow why reset here rather than set. > + if (ret) > + return ret; > + > + return count; > +} > + > +static DEVICE_ATTR_RW(memtier); > +#endif > + > static struct attribute *node_dev_attrs[] = { > &dev_attr_meminfo.attr, > &dev_attr_numastat.attr, > &dev_attr_distance.attr, > &dev_attr_vmstat.attr, > +#ifdef CONFIG_TIERED_MEMORY > + &dev_attr_memtier.attr, > +#endif > NULL > }; > > diff --git a/include/linux/migrate.h b/include/linux/migrate.h > index 0ec653623565..d37d1d5dee82 100644 > --- a/include/linux/migrate.h > +++ b/include/linux/migrate.h > @@ -177,13 +177,15 @@ enum memory_tier_type { > }; > > int next_demotion_node(int node); > - tidy that up to reduce noise on next version. > extern void migrate_on_reclaim_init(void); > #ifdef CONFIG_HOTPLUG_CPU > extern void set_migration_target_nodes(void); > #else > static inline void set_migration_target_nodes(void) {} > #endif > +int node_get_memory_tier(int node); > +int node_set_memory_tier(int node, int tier); > +int node_reset_memory_tier(int node, int tier); > #else > #define numa_demotion_enabled false > static inline int next_demotion_node(int node) > diff --git a/mm/migrate.c b/mm/migrate.c > index f28ee93fb017..304559ba3372 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -2132,6 +2132,7 @@ static struct bus_type memory_tier_subsys = { > .dev_name = "memtier", > }; > > +DEFINE_MUTEX(memory_tier_lock); > static struct memory_tier *memory_tiers[MAX_MEMORY_TIERS]; > > static ssize_t nodelist_show(struct device *dev, > @@ -2225,6 +2226,108 @@ static const struct attribute_group *memory_tier_attr_groups[] = { > NULL, > }; > > +static int __node_get_memory_tier(int node) > +{ > + int tier; > + > + for (tier = 0; tier < MAX_MEMORY_TIERS; tier++) { > + if (memory_tiers[tier] && node_isset(node, memory_tiers[tier]->nodelist)) > + return tier; > + } > + > + return -1; > +} > + > +int node_get_memory_tier(int node) > +{ > + int tier; > + > + /* > + * Make sure memory tier is not unregistered > + * while it is being read. > + */ > + mutex_lock(&memory_tier_lock); > + > + tier = __node_get_memory_tier(node); > + > + mutex_unlock(&memory_tier_lock); > + > + return tier; > +} > + > +int __node_set_memory_tier(int node, int tier) > +{ > + int ret = 0; > + /* > + * As register_memory_tier() for new tier can fail, > + * try it before modifying existing tier. register > + * tier makes tier visible in sysfs. > + */ > + if (!memory_tiers[tier]) { > + ret = register_memory_tier(tier); > + if (ret) { > + goto out; no brackets around goto out; It's also pointless, just return ret directly here > + } > + } > + > + node_set(node, memory_tiers[tier]->nodelist); > + > +out: > + return ret; > +} > + > +int node_reset_memory_tier(int node, int tier) > +{ > + int current_tier, ret = 0; > + > + mutex_lock(&memory_tier_lock); > + > + current_tier = __node_get_memory_tier(node); > + if (current_tier == tier) > + goto out; > + > + if (current_tier != -1 ) > + node_clear(node, memory_tiers[current_tier]->nodelist); > + > + ret = __node_set_memory_tier(node, tier); > + > + if (!ret) { > + if (nodes_empty(memory_tiers[current_tier]->nodelist)) > + unregister_memory_tier(current_tier); flip logic so the error path is out of line. if (ret) { /* reset.. ret = ... goto out; } and have the 'good path' here with less indent. if (nodes_empty(memory... will result in more 'idiomatic' code that is easier for reviewers to read. It's Friday afternoon. Don't make me think :) > + } else { > + /* reset it back to older tier */ > + ret = __node_set_memory_tier(node, current_tier); > + } > +out: > + mutex_unlock(&memory_tier_lock); > + > + return ret; > +} > + > +int node_set_memory_tier(int node, int tier) > +{ Currently seems to be unused code. Fine if it is used in a later patch, but call it out in the patch description. > + int current_tier, ret = 0; > + > + if (tier >= MAX_MEMORY_TIERS) > + return -EINVAL; > + > + mutex_lock(&memory_tier_lock); > + current_tier = __node_get_memory_tier(node); > + /* > + * if node is already part of the tier proceed with the > + * current tier value, because we might want to establish > + * new migration paths now. The node might be added to a tier > + * before it was made part of N_MEMORY, hence estabilish_migration_targets > + * will have skipped this node. > + */ > + if (current_tier != -1) > + tier = current_tier; > + ret = __node_set_memory_tier(node, tier); > + mutex_unlock(&memory_tier_lock); > + > + return ret; > +} > + > /* > * node_demotion[] example: > *