From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D6A6BC4167D for ; Tue, 31 Oct 2023 00:38:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6A5396B00E0; Mon, 30 Oct 2023 20:38:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 62E8E6B00E6; Mon, 30 Oct 2023 20:38:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 47FF16B00EA; Mon, 30 Oct 2023 20:38:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 30B006B00E0 for ; Mon, 30 Oct 2023 20:38:35 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 2EDE81A05D5 for ; Tue, 31 Oct 2023 00:38:34 +0000 (UTC) X-FDA: 81403895748.24.A157F39 Received: from mail-yb1-f171.google.com (mail-yb1-f171.google.com [209.85.219.171]) by imf12.hostedemail.com (Postfix) with ESMTP id 2AB0D4001C for ; Tue, 31 Oct 2023 00:38:31 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Sv1dbVFr; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf12.hostedemail.com: domain of gourry.memverge@gmail.com designates 209.85.219.171 as permitted sender) smtp.mailfrom=gourry.memverge@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698712712; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Cx5PlkWwC2fLL3dTmkDjNmZmIU8cRtisheijomJJMGs=; b=7dDBgsQ+nhTC41iae7tmW0zX/8pPl94kqqf5I973+V+IpOXGo8O0sNKPk7aLqdnURpnm1d gfNjFx0aTej5sze+r8VCGnKBPUuP0kvnj2pDDikUQcz8TJpdMrkGgbQKuGmAwhDtE2Tndn 8bz9Z3mph15pDWXAKL3quIaOKB5v760= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Sv1dbVFr; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf12.hostedemail.com: domain of gourry.memverge@gmail.com designates 209.85.219.171 as permitted sender) smtp.mailfrom=gourry.memverge@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698712712; a=rsa-sha256; cv=none; b=qi63G+BSr0/m4NkLZx3qcFwcSxfM4Fltr9fs0v5WZfvWwRX68uCcdhEpMylLtCU2naX1DB pXwMM0fRQj3eQ86pudD3VSKHh7l6j/ixY0O/G9/xyE+aqmQmheo+YILNOFcO20NMKibRIH ADVjZ3TjdVOSrkb5PnSYe0WAzVec6N0= Received: by mail-yb1-f171.google.com with SMTP id 3f1490d57ef6-da2e786743aso2230867276.0 for ; Mon, 30 Oct 2023 17:38:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1698712711; x=1699317511; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Cx5PlkWwC2fLL3dTmkDjNmZmIU8cRtisheijomJJMGs=; b=Sv1dbVFrW2r4rJ0wupCerwfmjfBASN9BfWI7/7wnPWYo+J//z2vjaRopS1qh3LdcU5 rucgo4WirFNDnDfqMYDYvgtCycU+YicBRFCA2wN0n9cZYAjTFKelyXGBvQQR1mr2k2gX lCcrUhAKE3/MU8tvjiamt2/X0mNS32nxkjf0MiPVNi4rYsVO9qQ8QcEMRkvnsT7zQgx7 QsjRx9p2p6DmyONUhl0r3ugFmdP2OAa6qjRLyypd3d/IuNJg3a17QCj4laBShcQbJe+A peqbRWJoKZ24ChAlSDT9vytM/XkGrHvg9YKhxESV2Nhyu21JG1xXCzxinMh26Fz+pR2+ 8Blw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698712711; x=1699317511; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Cx5PlkWwC2fLL3dTmkDjNmZmIU8cRtisheijomJJMGs=; b=F39dDj21Q1LDakeHW/0tkESBH0v3JVcX+sb2f32zuqr1IqIpCDOAV3g7nQaHx7JsQp V5tqJ0x+KxC6UeS2v25uinUCimeumtTnvKekTNCUJlnCjl+SiWnPP657K07Wt3nWf6L1 +X4JMjxKOdObvL0SvwGPo3KWUJLUPnLn6oQ5z7KuWfKhrZho+LaZCNswoMy0GarzOyrc /d+71TkRGe1oBPVlsZa/GxAVFtiSUUTPNoWHy+00D6OLbbZrBV7VyqKFpKZk8PWN+Nb2 TdpZlwGZFyueYLnXpZfEugg/ZeNekI9J1VHQ1+NzcOtncLNHbRHvPcwNNl17UtQDsAsx DGhw== X-Gm-Message-State: AOJu0YwcrEujXwN8+5inSlyS6qZQYdMfQ067Mkb3WkaBaRdqYRotpbmV 5MqHuQA5o+S3YW/HUbGIjA== X-Google-Smtp-Source: AGHT+IG+ZY2XwA/VPsRVOLtPqHUQiweY/sd8aY9XDrTI59V15gNxaAle6BUs2Yfak9s1J0WLWgcoWA== X-Received: by 2002:a25:40c7:0:b0:d9b:9aeb:8c26 with SMTP id n190-20020a2540c7000000b00d9b9aeb8c26mr9320170yba.40.1698712711115; Mon, 30 Oct 2023 17:38:31 -0700 (PDT) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id b19-20020a25ae93000000b00da086d6921fsm182750ybj.50.2023.10.30.17.38.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 30 Oct 2023 17:38:30 -0700 (PDT) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-kernel@vger.kernel.org Cc: linux-cxl@vger.kernel.org, linux-mm@kvack.org, ying.huang@intel.com, akpm@linux-foundation.org, aneesh.kumar@linux.ibm.com, weixugc@google.com, apopple@nvidia.com, hannes@cmpxchg.org, tim.c.chen@intel.com, dave.hansen@intel.com, mhocko@kernel.org, shy828301@gmail.com, gregkh@linuxfoundation.org, rafael@kernel.org, Gregory Price Subject: [RFC PATCH v3 4/4] mm/mempolicy: modify interleave mempolicy to use node weights Date: Mon, 30 Oct 2023 20:38:10 -0400 Message-Id: <20231031003810.4532-5-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231031003810.4532-1-gregory.price@memverge.com> References: <20231031003810.4532-1-gregory.price@memverge.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 2AB0D4001C X-Stat-Signature: t8byonmhahqbij7ay6a3oac4bjrfzrfo X-HE-Tag: 1698712711-116100 X-HE-Meta: U2FsdGVkX1+7bkvCA7uL9NusqiJaC3omRGXrYTQNuYkuORSL/k/QC7Nx19NF9b4NSWOCLG2sImibkt3ACqlsmnmfa9FfY/hh2rvd9h1WWZziSh1dQjBSsJOsRva0/ZSJi0W2K/0jxAWPSkkZwYt896rny/YUi6eFZQPeNYT50rHRgbfMmAQhkigkhqTjmpOMP8j8ynJQX+6jWZzh4AAivqcZQO2dWjkjoncOxncKnCll6EmPSIyBXEpRRHsTKGFX4wfZbeaORpo/DJi+cbW9q0VBDIQHF6KD0BfYcr570gy4+/BUwe/b+S6tJM5u1ADFbFoNGbnzbYqsj6j0d8ykiR3IrzAh+oNb8gh0gDt5/tLZGaGRY0hG7KJK0WctQGjxA/FDBx8PU/Suv7wgosqd0zOKo1el8salAqQ43MIzWg2O7T/yNnLSOrm4t3Qw15CMy+oBLSa3iLvcqX0c3VwOIZWa0o3Oval0NEWxJnZnT2DK6Vs/GAwlpmKWgpCDBmzVkioFEdfMjTn3nAL443eCwZofwJi04ggJiAt+PCmiDWh/uhAPX3EVWtVSA9Yi/hhy8KJ41gWdOhYupr6KdudDmXd5brsXaxBaDJtiuZgcWiHollUDRfv+k1GVTbnUCejeQ8Ef4QUEvVR1PBdm6dqleM6gMcWZolbwky+l/6J6x/IpRl+lmTFjwnFwrTWfwKicU9X+AkdANBccBgu8F/pMTlBtI9z95wIeJlPHW2m4yh0nxD7dlQ8b8qJeZ1+n/K5JtFieY1usZbUkm6js2YvvUYevKcZRjYphpnXuSD1XfAyeg/kZ6sy8+OkTCTaopCdfjviyfodb65g8FTcOG2z2K32xwbf2Qkx0Z3RdYSEoQet34neMYF9uxCUZaWswlADB4dLtc5fcNfP0bcQMelkIKaCNCJYDqW9EiE2PbmOkf6WxsbGzCbrCxlrtAaU5lilZ/mD6wcqTTPx4SruEdf0 +2gL+gAF Xo3Ez6JJM1dAG6/RFBzwX+5BfmJKRCta9TmAk122HbdQ4G5TTuxsSZ8A0djgrIYxQCxhvrhnqS0gwWJk9k6H7oxn24JkFakH9x0zHg90dPI1Vi2AM1fqrKHqKDgt0SNBlvv1oou69IQNKw87kt0Va9iZVnwKioA/V/uAsV6lXssorTyRUClo8vxKMQikoQfqE2o92DhF7eZxyL3O4tuTA4oj7YKt/lMQb63ckUeQ0Y8HmQmFPXAE8PDi2asgww/Jw2Zct8W2BK2Am3y80DvXcpwkk0PqRJXgXVFzoWnZF2UA72lYhADJqxova7WNQVcMdkQn384odwPqKxNV0Yhz4t06T2bSeh5Vjs4i3WcyHbCbQhaRV84iw1tiDxxTfNyPNPJQ8m76lOJQRqJXewFGQAj9UdI8eMNbEayRUNWfnDoQ+nzkm0Vy2wr0eg3pJ/0uPIiJuSvar9735mcnFf9u3IbLlwWWyWH2cenQo X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: The node subsystem implements interleave weighting for the purpose of bandwidth optimization. Each node may have different weights in relation to each compute node ("access node"). The mempolicy MPOL_INTERLEAVE utilizes the node weights to implement weighted interleave. By default, since all nodes default to a weight of 1, the original interleave behavior is retained. Examples Weight settings: echo 4 > node0/access0/il_weight echo 1 > node0/access1/il_weight echo 3 > node1/access0/il_weight echo 2 > node1/access1/il_weight Results: Task A: cpunode: 0 nodemask: [0,1] weights: [4,3] allocation result: [0,0,0,0,1,1,1 repeat] Task B: cpunode: 1 nodemask: [0,1] weights: [1,2] allocation result: [0,1,1 repeat] Weights are relative to access node Signed-off-by: Gregory Price --- include/linux/mempolicy.h | 4 ++ mm/mempolicy.c | 138 +++++++++++++++++++++++++++++--------- 2 files changed, 112 insertions(+), 30 deletions(-) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index d232de7cdc56..240468b669fd 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -48,6 +48,10 @@ struct mempolicy { nodemask_t nodes; /* interleave/bind/perfer */ int home_node; /* Home node to use for MPOL_BIND and MPOL_PREFERRED_MANY */ + /* weighted interleave settings */ + unsigned char cur_weight; + unsigned char il_weights[MAX_NUMNODES]; + union { nodemask_t cpuset_mems_allowed; /* relative to these nodes */ nodemask_t user_nodemask; /* nodemask passed by user */ diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 29ebf1e7898c..d62e942a13bd 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -102,6 +102,7 @@ #include #include #include +#include #include #include @@ -300,6 +301,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags, policy->mode = mode; policy->flags = flags; policy->home_node = NUMA_NO_NODE; + policy->cur_weight = 0; return policy; } @@ -334,6 +336,7 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes) tmp = *nodes; pol->nodes = tmp; + pol->cur_weight = 0; } static void mpol_rebind_preferred(struct mempolicy *pol, @@ -881,8 +884,11 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags, old = current->mempolicy; current->mempolicy = new; - if (new && new->mode == MPOL_INTERLEAVE) + if (new && new->mode == MPOL_INTERLEAVE) { current->il_prev = MAX_NUMNODES-1; + new->cur_weight = 0; + } + task_unlock(current); mpol_put(old); ret = 0; @@ -1903,12 +1909,21 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd) /* Do dynamic interleaving for a process */ static unsigned interleave_nodes(struct mempolicy *policy) { - unsigned next; + unsigned int next; + unsigned char next_weight; struct task_struct *me = current; next = next_node_in(me->il_prev, policy->nodes); - if (next < MAX_NUMNODES) + if (!policy->cur_weight) { + /* If the node is set, at least 1 allocation is required */ + next_weight = node_get_il_weight(next, numa_node_id()); + policy->cur_weight = next_weight ? next_weight : 1; + } + + policy->cur_weight--; + if (next < MAX_NUMNODES && !policy->cur_weight) me->il_prev = next; + return next; } @@ -1967,25 +1982,37 @@ unsigned int mempolicy_slab_node(void) static unsigned offset_il_node(struct mempolicy *pol, unsigned long n) { nodemask_t nodemask = pol->nodes; - unsigned int target, nnodes; - int i; + unsigned int target, nnodes, il_weight; + unsigned char weight; int nid; + int cur_node = numa_node_id(); + /* * The barrier will stabilize the nodemask in a register or on * the stack so that it will stop changing under the code. * * Between first_node() and next_node(), pol->nodes could be changed * by other threads. So we put pol->nodes in a local stack. + * + * Additionally, place the cur_node on the stack in case of a migration */ barrier(); nnodes = nodes_weight(nodemask); if (!nnodes) - return numa_node_id(); - target = (unsigned int)n % nnodes; + return cur_node; + + il_weight = nodes_get_il_weights(cur_node, &nodemask, pol->il_weights); + target = (unsigned int)n % il_weight; nid = first_node(nodemask); - for (i = 0; i < target; i++) - nid = next_node(nid, nodemask); + while (target) { + weight = pol->il_weights[nid]; + if (target < weight) + break; + target -= weight; + nid = next_node_in(nid, nodemask); + } + return nid; } @@ -2319,32 +2346,83 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp, struct mempolicy *pol, unsigned long nr_pages, struct page **page_array) { - int nodes; - unsigned long nr_pages_per_node; - int delta; - int i; - unsigned long nr_allocated; + struct task_struct *me = current; unsigned long total_allocated = 0; + unsigned long nr_allocated; + unsigned long rounds; + unsigned long node_pages, delta; + unsigned char weight; + unsigned long il_weight; + unsigned long req_pages = nr_pages; + int nnodes, node, prev_node; + int cur_node = numa_node_id(); + int i; - nodes = nodes_weight(pol->nodes); - nr_pages_per_node = nr_pages / nodes; - delta = nr_pages - nodes * nr_pages_per_node; - - for (i = 0; i < nodes; i++) { - if (delta) { - nr_allocated = __alloc_pages_bulk(gfp, - interleave_nodes(pol), NULL, - nr_pages_per_node + 1, NULL, - page_array); - delta--; - } else { - nr_allocated = __alloc_pages_bulk(gfp, - interleave_nodes(pol), NULL, - nr_pages_per_node, NULL, page_array); + prev_node = me->il_prev; + nnodes = nodes_weight(pol->nodes); + /* Continue allocating from most recent node */ + if (pol->cur_weight) { + node = next_node_in(prev_node, pol->nodes); + node_pages = pol->cur_weight; + if (node_pages > nr_pages) + node_pages = nr_pages; + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, + NULL, page_array); + page_array += nr_allocated; + total_allocated += nr_allocated; + /* if that's all the pages, no need to interleave */ + if (req_pages <= pol->cur_weight) { + pol->cur_weight -= req_pages; + return total_allocated; } - + /* Otherwise we adjust req_pages down, and continue from there */ + req_pages -= pol->cur_weight; + pol->cur_weight = 0; + prev_node = node; + } + + il_weight = nodes_get_il_weights(cur_node, &pol->nodes, + pol->il_weights); + rounds = req_pages / il_weight; + delta = req_pages % il_weight; + for (i = 0; i < nnodes; i++) { + node = next_node_in(prev_node, pol->nodes); + weight = pol->il_weights[node]; + node_pages = weight * rounds; + if (delta > weight) { + node_pages += weight; + delta -= weight; + } else if (delta) { + node_pages += delta; + delta = 0; + } + /* The number of requested pages may not hit every node */ + if (!node_pages) + break; + /* If an over-allocation would occur, floor it */ + if (node_pages + total_allocated > nr_pages) { + node_pages = nr_pages - total_allocated; + delta = 0; + } + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, + NULL, page_array); page_array += nr_allocated; total_allocated += nr_allocated; + prev_node = node; + } + + /* + * Finally, we need to update me->il_prev and pol->cur_weight + * If the last node allocated on has un-used weight, apply + * the remainder as the cur_weight, otherwise proceed to next node + */ + if (node_pages) { + me->il_prev = prev_node; + node_pages %= weight; + pol->cur_weight = weight - node_pages; + } else { + me->il_prev = node; + pol->cur_weight = 0; } return total_allocated; -- 2.39.1