From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 125C7C433F5 for ; Mon, 15 Nov 2021 19:06:44 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B1CAD636ED for ; Mon, 15 Nov 2021 19:06:43 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org B1CAD636ED Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 20F016B007E; Mon, 15 Nov 2021 14:06:43 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1BE9B6B0080; Mon, 15 Nov 2021 14:06:43 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0867A6B008C; Mon, 15 Nov 2021 14:06:43 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0062.hostedemail.com [216.40.44.62]) by kanga.kvack.org (Postfix) with ESMTP id ECF3F6B007E for ; Mon, 15 Nov 2021 14:06:42 -0500 (EST) Received: from smtpin21.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id AD6661815B7D4 for ; Mon, 15 Nov 2021 19:06:42 +0000 (UTC) X-FDA: 78812096244.21.B13F1E5 Received: from mail-ed1-f51.google.com (mail-ed1-f51.google.com [209.85.208.51]) by imf20.hostedemail.com (Postfix) with ESMTP id B259CD0000B4 for ; Mon, 15 Nov 2021 19:06:26 +0000 (UTC) Received: by mail-ed1-f51.google.com with SMTP id z5so16535747edd.3 for ; Mon, 15 Nov 2021 11:06:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=uCbqKs1ZcW/d50rZafmZ+SyLk/Uguxy204CBYFA2gEw=; b=ZNFnn/GoqMfn2uOtaWD4XYyNUAuMEtxNSMZr4cJumapKrsn2DcW2/pxhHC1JS3djHf L0XLZ4SwtSrW6cseMin9NI53UmyIbtSPnvMjxtMZ9mr6oh114gi4cazu68z6inUkMWmD 1F5xqaVpv60aRaai4JsyzFHfrds0Db4suzcI0wunwG2al/E73Dwhr0nHCpcIcr2VKhBC a3y0Xjv/ZV6E9rTu3yN/Mj93mRKIq19H0clkFrrG1KMX9cw6tL4jGC76IdZP0eEv3VT9 QckyVHgDCPUyYa/fw34mTGeG1BipFA4nYxfJ1Gqwi0pLAEfiWew+k6lZxgiM+5GohEDz pjhw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=uCbqKs1ZcW/d50rZafmZ+SyLk/Uguxy204CBYFA2gEw=; b=0mTSNHs+BjMytFGLBngGc7cx4SKIbCYCB1E+Bfm7LN/s2ZFWRQgfK/HneUtaC0MKxA bNpZvGPa6vjSpmDrussmEkPppqM8OG0TPnPB6Ne9UjGSbVXC6s+0zuPjvD4U+JGG/qaM /mGMjHk0A1O2xSdXmtK17zKj1z3rFFCPmT/85un1eU4R0B2+hC6yXPzFwLm6odFwf8AJ TRTBpPWS0x6R0/KOyXuQW3bzCFiNRixzarI8F1KtF2JmzwZNNS6ogJtwyasJL6loL6b9 8Slk7rsqFCtqhaMWq7WjNQXT/SmJD5Lk75e4lVFpTd7GGGDqnGFmnlLufvgLN3PmAACw TqYw== X-Gm-Message-State: AOAM531hZLnna9MphXLMcR2QU2C8/0luXWO3OGxs3nultF8mPpUvzG9I /XTohqcMzg0BUXnzaHEGUDoZsbjsJdS0bOy9JRQ= X-Google-Smtp-Source: ABdhPJwMLXzdDIvmeIdYTuS5wG0oJ87UJwctiuV18ZCm3LyqYdqClQ4WdBRrWFmirungf3BfBhNer5/8bCSLWasyEMo= X-Received: by 2002:a05:6402:2926:: with SMTP id ee38mr1422987edb.71.1637003198157; Mon, 15 Nov 2021 11:06:38 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Yang Shi Date: Mon, 15 Nov 2021 11:06:26 -0800 Message-ID: Subject: Re: [PATCH v3] mm: migrate: Support multiple target nodes demotion To: Baolin Wang Cc: Andrew Morton , Huang Ying , Dave Hansen , Zi Yan , Oscar Salvador , zhongjiang-ali@linux.alibaba.com, Xunlei Pang , Linux MM , Linux Kernel Mailing List Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: B259CD0000B4 X-Stat-Signature: nehb68t7txscfqftok8tyqz8cxhmybhe Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b="ZNFnn/Go"; spf=pass (imf20.hostedemail.com: domain of shy828301@gmail.com designates 209.85.208.51 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-HE-Tag: 1637003186-729956 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sun, Nov 14, 2021 at 6:40 AM Baolin Wang wrote: > > > > On 2021/11/13 3:05, Yang Shi wrote: > > On Thu, Nov 11, 2021 at 6:28 PM Baolin Wang > > wrote: > >> > >> We have some machines with multiple memory types like below, which > >> have one fast (DRAM) memory node and two slow (persistent memory) memory > >> nodes. According to current node demotion policy, if node 0 fills up, > >> its memory should be migrated to node 1, when node 1 fills up, its > >> memory will be migrated to node 2: node 0 -> node 1 -> node 2 ->stop. > >> > >> But this is not efficient and suitbale memory migration route > >> for our machine with multiple slow memory nodes. Since the distance > >> between node 0 to node 1 and node 0 to node 2 is equal, and memory > >> migration between slow memory nodes will increase persistent memory > >> bandwidth greatly, which will hurt the whole system's performance. > >> > >> Thus for this case, we can treat the slow memory node 1 and node 2 > >> as a whole slow memory region, and we should migrate memory from > >> node 0 to node 1 and node 2 if node 0 fills up. > >> > >> This patch changes the node_demotion data structure to support multiple > >> target nodes, and establishes the migration path to support multiple > >> target nodes with validating if the node distance is the best or not. > >> > >> available: 3 nodes (0-2) > >> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > >> node 0 size: 62153 MB > >> node 0 free: 55135 MB > >> node 1 cpus: > >> node 1 size: 127007 MB > >> node 1 free: 126930 MB > >> node 2 cpus: > >> node 2 size: 126968 MB > >> node 2 free: 126878 MB > >> node distances: > >> node 0 1 2 > >> 0: 10 20 20 > >> 1: 20 10 20 > >> 2: 20 20 10 > >> > >> Signed-off-by: Baolin Wang > >> --- > >> Changes from v2: > >> - Redefine the DEMOTION_TARGET_NODES macro according to the > >> MAX_NUMNODES. > >> - Change node_demotion to a pointer and allocate it dynamically. > >> > >> Changes from v1: > >> - Add a new patch to allocate the node_demotion dynamically. > >> - Update some comments. > >> - Simplify some variables' name. > >> > >> Changes from RFC v2: > >> - Change to 'short' type for target nodes array. > >> - Remove nodemask instead selecting target node directly. > >> - Add WARN_ONCE() if the target nodes exceed the maximum value. > >> > >> Changes from RFC v1: > >> - Re-define the node_demotion structure. > >> - Set up multiple target nodes by validating the node distance. > >> - Add more comments. > >> --- > >> mm/migrate.c | 167 ++++++++++++++++++++++++++++++++++++++++++++++------------- > >> 1 file changed, 132 insertions(+), 35 deletions(-) > >> > >> diff --git a/mm/migrate.c b/mm/migrate.c > >> index cf25b00..9b8a813 100644 > >> --- a/mm/migrate.c > >> +++ b/mm/migrate.c > >> @@ -50,6 +50,7 @@ > >> #include > >> #include > >> #include > >> +#include > >> > >> #include > >> > >> @@ -1119,12 +1120,25 @@ static int __unmap_and_move(struct page *page, struct page *newpage, > >> * > >> * This is represented in the node_demotion[] like this: > >> * > >> - * { 1, // Node 0 migrates to 1 > >> - * 2, // Node 1 migrates to 2 > >> - * -1, // Node 2 does not migrate > >> - * 4, // Node 3 migrates to 4 > >> - * 5, // Node 4 migrates to 5 > >> - * -1} // Node 5 does not migrate > >> + * { nr=1, nodes[0]=1 }, // Node 0 migrates to 1 > >> + * { nr=1, nodes[0]=2 }, // Node 1 migrates to 2 > >> + * { nr=0, nodes[0]=-1 }, // Node 2 does not migrate > >> + * { nr=1, nodes[0]=4 }, // Node 3 migrates to 4 > >> + * { nr=1, nodes[0]=5 }, // Node 4 migrates to 5 > >> + * { nr=0, nodes[0]=-1 }, // Node 5 does not migrate > >> + * > >> + * Moreover some systems may have multiple slow memory nodes. > >> + * Suppose a system has one socket with 3 memory nodes, node 0 > >> + * is fast memory type, and node 1/2 both are slow memory > >> + * type, and the distance between fast memory node and slow > >> + * memory node is same. So the migration path should be: > >> + * > >> + * 0 -> 1/2 -> stop > >> + * > >> + * This is represented in the node_demotion[] like this: > >> + * { nr=2, {nodes[0]=1, nodes[1]=2} }, // Node 0 migrates to node 1 and node 2 > >> + * { nr=0, nodes[0]=-1, }, // Node 1 dose not migrate > >> + * { nr=0, nodes[0]=-1, }, // Node 2 does not migrate > >> */ > >> > >> /* > >> @@ -1135,8 +1149,20 @@ static int __unmap_and_move(struct page *page, struct page *newpage, > >> * must be held over all reads to ensure that no cycles are > >> * observed. > >> */ > >> -static int node_demotion[MAX_NUMNODES] __read_mostly = > >> - {[0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE}; > >> +#define DEFAULT_DEMOTION_TARGET_NODES 15 > >> + > >> +#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES > >> +#define DEMOTION_TARGET_NODES (MAX_NUMNODES - 1) > >> +#else > >> +#define DEMOTION_TARGET_NODES DEFAULT_DEMOTION_TARGET_NODES > >> +#endif > >> + > >> +struct demotion_nodes { > >> + unsigned short nr; > >> + short nodes[DEMOTION_TARGET_NODES]; > >> +}; > >> + > >> +static struct demotion_nodes *node_demotion __read_mostly; > >> > >> /** > >> * next_demotion_node() - Get the next node in the demotion path > >> @@ -1149,8 +1175,15 @@ static int __unmap_and_move(struct page *page, struct page *newpage, > >> */ > >> int next_demotion_node(int node) > >> { > >> + struct demotion_nodes *nd; > >> + unsigned short target_nr, index; > >> int target; > >> > >> + if (!node_demotion) > >> + return NUMA_NO_NODE; > >> + > >> + nd = &node_demotion[node]; > >> + > >> /* > >> * node_demotion[] is updated without excluding this > >> * function from running. RCU doesn't provide any > >> @@ -1161,9 +1194,28 @@ int next_demotion_node(int node) > >> * node_demotion[] reads need to be consistent. > >> */ > >> rcu_read_lock(); > >> - target = READ_ONCE(node_demotion[node]); > >> - rcu_read_unlock(); > >> + target_nr = READ_ONCE(nd->nr); > >> + > >> + switch (target_nr) { > >> + case 0: > >> + target = NUMA_NO_NODE; > >> + goto out; > >> + case 1: > >> + index = 0; > >> + break; > >> + default: > >> + /* > >> + * If there are multiple target nodes, just select one > >> + * target node randomly. > >> + */ > >> + index = get_random_int() % target_nr; > > > > Sorry for chiming in late. I don't get why not select demotion targe > node interleave? TBH, it makes more sense to me. Random is ok, but at > > least I'd expect to see some explanation about why random is used. > > My first version patch[1] already did round-robin to select target node. > For interleave (or round-robin), we should introduce another member to > record last selected target node, as Dave and Ying said, that will cause > cache ping-pong to hurt performance, or introduce per-cpu data to avoid > this, which seems more complicated now. Thanks. It should be better to have some words in the commit log or code to elaborate this? Someone else may have the same question in the future. > > [1] > https://lore.kernel.org/all/c02bcbc04faa7a2c852534e9cd58a91c44494657.1636016609.git.baolin.wang@linux.alibaba.com/