From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 291D6C433F5 for ; Fri, 27 May 2022 14:31:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 79D138D0019; Fri, 27 May 2022 10:31:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 720C78D000C; Fri, 27 May 2022 10:31:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5C04D8D0019; Fri, 27 May 2022 10:31:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 4A6E48D000C for ; Fri, 27 May 2022 10:31:43 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 1458721681 for ; Fri, 27 May 2022 14:31:43 +0000 (UTC) X-FDA: 79511761686.05.8840232 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf22.hostedemail.com (Postfix) with ESMTP id 2B8B6C009F for ; Fri, 27 May 2022 14:31:39 +0000 (UTC) Received: from fraeml707-chm.china.huawei.com (unknown [172.18.147.201]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4L8nKd0Qfdz689yL; Fri, 27 May 2022 22:31:01 +0800 (CST) Received: from lhreml710-chm.china.huawei.com (10.201.108.61) by fraeml707-chm.china.huawei.com (10.206.15.35) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Fri, 27 May 2022 16:31:38 +0200 Received: from localhost (10.81.201.194) by lhreml710-chm.china.huawei.com (10.201.108.61) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Fri, 27 May 2022 15:31:36 +0100 Date: Fri, 27 May 2022 15:31:35 +0100 From: Jonathan Cameron To: Aneesh Kumar K.V CC: , , Huang Ying , Greg Thelen , Yang Shi , Davidlohr Bueso , Tim C Chen , Brice Goglin , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Alistair Popple , Dan Williams , Feng Tang , Jagdish Gediya , Baolin Wang , David Rientjes Subject: Re: [RFC PATCH v4 3/7] mm/demotion: Build demotion targets based on explicit memory tiers Message-ID: <20220527153135.00004339@Huawei.com> In-Reply-To: <20220527122528.129445-4-aneesh.kumar@linux.ibm.com> References: <20220527122528.129445-1-aneesh.kumar@linux.ibm.com> <20220527122528.129445-4-aneesh.kumar@linux.ibm.com> Organization: Huawei Technologies Research and Development (UK) Ltd. X-Mailer: Claws Mail 4.0.0 (GTK+ 3.24.29; i686-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.81.201.194] X-ClientProxiedBy: lhreml719-chm.china.huawei.com (10.201.108.70) To lhreml710-chm.china.huawei.com (10.201.108.61) X-CFilter-Loop: Reflected X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 2B8B6C009F X-Rspam-User: X-Stat-Signature: udknab6omp8ub9w4d5cb66u73qq74sgw Authentication-Results: imf22.hostedemail.com; dkim=none; spf=pass (imf22.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com X-HE-Tag: 1653661899-212592 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, 27 May 2022 17:55:24 +0530 "Aneesh Kumar K.V" wrote: > From: Jagdish Gediya > > This patch switch the demotion target building logic to use memory tiers > instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the > default tier 1 and additional memory tiers will be added by drivers like > dax kmem. > > This patch builds the demotion target for a NUMA node by looking at all > memory tiers below the tier to which the NUMA node belongs. The closest node > in the immediately following memory tier is used as a demotion target. > > Since we are now only building demotion target for N_MEMORY NUMA nodes > the CPU hotplug calls are removed in this patch. > > Signed-off-by: Jagdish Gediya > Signed-off-by: Aneesh Kumar K.V Hi Diff made a mess of this one! Anyhow, a few comments inline. Thanks, Jonathan > --- a/mm/migrate.c > +++ b/mm/migrate.c > +/* > + * node_demotion[] examples: Perhaps call out these are examples of possible default situations. None are enforced by this code. > + * > + * Example 1: > + * > + * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes. > + * > + * node distances: > + * node 0 1 2 3 > + * 0 10 20 30 40 > + * 1 20 10 40 30 > + * 2 30 40 10 40 > + * 3 40 30 40 10 > + * > + * memory_tiers[0] = > + * memory_tiers[1] = 0-1 > + * memory_tiers[2] = 2-3 > + * > + * node_demotion[0].preferred = 2 > + * node_demotion[1].preferred = 3 > + * node_demotion[2].preferred = > + * node_demotion[3].preferred = > + * > + * Example 2: > + * > + * Node 0 & 1 are CPU + DRAM nodes, node 2 is memory-only DRAM node. > + * > + * node distances: > + * node 0 1 2 > + * 0 10 20 30 > + * 1 20 10 30 > + * 2 30 30 10 > + * > + * memory_tiers[0] = > + * memory_tiers[1] = 0-2 > + * memory_tiers[2] = > + * > + * node_demotion[0].preferred = > + * node_demotion[1].preferred = > + * node_demotion[2].preferred = > + * > + * Example 3: > + * > + * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node. > + * > + * node distances: > + * node 0 1 2 > + * 0 10 20 30 > + * 1 20 10 40 > + * 2 30 40 10 > + * > + * memory_tiers[0] = 1 > + * memory_tiers[1] = 0 > + * memory_tiers[2] = 2 > + * > + * node_demotion[0].preferred = 2 > + * node_demotion[1].preferred = 0 > + * node_demotion[2].preferred = > + * > + */ > /* Disable reclaim-based migration. */ > static void __disable_all_migrate_targets(void) > { > + int node; > > + for_each_node_mask(node, node_states[N_MEMORY]) > + node_demotion[node].preferred = NODE_MASK_NONE; > } > /* int best_distance) > +* Find an automatic demotion target for all memory > +* nodes. Failing here is OK. It might just indicate > +* being at the end of a chain. > +*/ > +static void establish_migration_targets(void) > { Diff did a horrible job on this, so I've reformatted heavily so could see what was happening! > struct demotion_nodes *nd; > + int tier, target = NUMA_NO_NODE, node; > + int distance, best_distance; > + nodemask_t used; > > if (!node_demotion) > + return; > > + disable_all_migrate_targets(); > + for_each_node_mask(node, node_states[N_MEMORY]) { > + best_distance = -1; > + nd = &node_demotion[node]; > > + tier = __node_get_memory_tier(node); > + /* > + * Find next tier to demote. in discussion of Wei Xu's RFC we concluded that we need to allow demotion to nearest node in 'any' higher tier (now bigger rank). That functionality matters for even moderately complex systems. > + */ > + while (++tier < MAX_MEMORY_TIERS) { > + if (memory_tiers[tier]) > + break; > + } > + if (tier >= MAX_MEMORY_TIERS) > + continue; > > + nodes_andnot(used, node_states[N_MEMORY], memory_tiers[tier]->nodelist); I'm a bit lost on this one. Perhaps a comment to say what 'used' represents? I was expecting all memory nodes in tiers with rank > current tier. I'm not sure that's what we have here. > > /* > + * Find all the nodes in the memory tier node list of same best distance. > + * add add them to the preferred mask. We randomly select between nodes repeated add. > + * in the preferred mask when allocating pages during demotion. > */ > do { > + target = find_next_best_node(node, &used); > + if (target == NUMA_NO_NODE) > break; > > + distance = node_distance(node, target); > + if (distance == best_distance || best_distance == -1) { > + best_distance = distance; > + node_set(target, nd->preferred); > + } else { > + break; > + } > } while (1); > } > } >