From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.7 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7E0DAC433E0 for ; Fri, 19 Jun 2020 16:24:40 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 3DA5D2168B for ; Fri, 19 Jun 2020 16:24:40 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3DA5D2168B Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id BB0EC8D00D4; Fri, 19 Jun 2020 12:24:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B3D2D8D00D3; Fri, 19 Jun 2020 12:24:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8F1238D00D5; Fri, 19 Jun 2020 12:24:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0148.hostedemail.com [216.40.44.148]) by kanga.kvack.org (Postfix) with ESMTP id 681C78D00D4 for ; Fri, 19 Jun 2020 12:24:33 -0400 (EDT) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 25E0C5F372 for ; Fri, 19 Jun 2020 16:24:33 +0000 (UTC) X-FDA: 76946484426.14.meat35_470efc426e1a Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin14.hostedemail.com (Postfix) with ESMTP id 0650A1800E61E for ; Fri, 19 Jun 2020 16:24:32 +0000 (UTC) X-HE-Tag: meat35_470efc426e1a X-Filterd-Recvd-Size: 9697 Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by imf44.hostedemail.com (Postfix) with ESMTP for ; Fri, 19 Jun 2020 16:24:31 +0000 (UTC) IronPort-SDR: lQlGuHAFmika25QOHUFD+hUgzGUz4vx5J2z+d5LR3Qby4Zqe5xq4AGT0Fs2rp99SyM3cUMuKQ1 dBDFbnlrKZAw== X-IronPort-AV: E=McAfee;i="6000,8403,9657"; a="141280140" X-IronPort-AV: E=Sophos;i="5.75,255,1589266800"; d="scan'208";a="141280140" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga007.jf.intel.com ([10.7.209.58]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Jun 2020 09:24:28 -0700 IronPort-SDR: JjDjt/6sAL68rGj0XMM1XBGZ2TlG2SEA7oZejlI9snhT+eVoW65AOXh467jeJEoCjkanO+KZq9 +aP4yrbj9qtA== X-IronPort-AV: E=Sophos;i="5.75,255,1589266800"; d="scan'208";a="264368007" Received: from sjiang-mobl2.ccr.corp.intel.com (HELO bwidawsk-mobl5.local) ([10.252.131.131]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Jun 2020 09:24:28 -0700 From: Ben Widawsky To: linux-mm Cc: Ben Widawsky , Andi Kleen , Andrew Morton , Dave Hansen , Kuppuswamy Sathyanarayanan , Mel Gorman , Michal Hocko Subject: [PATCH 03/18] mm/page_alloc: start plumbing multi preferred node Date: Fri, 19 Jun 2020 09:24:10 -0700 Message-Id: <20200619162425.1052382-4-ben.widawsky@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200619162425.1052382-1-ben.widawsky@intel.com> References: <20200619162425.1052382-1-ben.widawsky@intel.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 0650A1800E61E X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam02 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In preparation for supporting multiple preferred nodes, we need the internals to switch from taking a nid to a nodemask. As mentioned in the code as a comment, __alloc_pages_nodemask() is the heart of the page allocator. It takes a single node as a preferred node to try to obtain a zonelist from first. This patch leaves that internal interface in place, but changes the guts of the function to consider a list of preferred nodes. The local node is always most preferred. If the local node is either restricted because of preference or binding, then the closest node that meets both the binding and preference criteria is used. If the intersection of binding and preference is the empty set, then fall back to the first node the meets binding criteria. As of this patch, multiple preferred nodes aren't actually supported as it might seem initially. As an example, suppose your preferred nodes are 0, and 1. Node 0's fallback zone list may have zones from nodes ordered 0->2->1. If this code were to pick 0's zonelist, and all zones from node 0 were full, you'd get a zone from node 2 instead of 1. As multiple nodes aren't yet supported anyway, this is okay just as a preparatory patch. v2: Fixed memory hotplug handling (Ben) Cc: Andi Kleen Cc: Andrew Morton Cc: Dave Hansen Cc: Kuppuswamy Sathyanarayanan Cc: Mel Gorman Cc: Michal Hocko Signed-off-by: Ben Widawsky --- mm/page_alloc.c | 125 +++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 119 insertions(+), 6 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 48eb0f1410d4..280ca85dc4d8 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -129,6 +129,10 @@ nodemask_t node_states[NR_NODE_STATES] __read_mostly= =3D { }; EXPORT_SYMBOL(node_states); =20 +#ifdef CONFIG_NUMA +static int find_next_best_node(int node, nodemask_t *used_node_mask); +#endif + atomic_long_t _totalram_pages __read_mostly; EXPORT_SYMBOL(_totalram_pages); unsigned long totalreserve_pages __read_mostly; @@ -4759,13 +4763,118 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned = int order, return page; } =20 -static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int orde= r, - int preferred_nid, nodemask_t *nodemask, - struct alloc_context *ac, gfp_t *alloc_mask, - unsigned int *alloc_flags) +#ifndef CONFIG_NUMA +#define set_pref_bind_mask(out, pref, bind) = \ + { = \ + (out)->bits[0] =3D 1UL \ + } +#else +static void set_pref_bind_mask(nodemask_t *out, const nodemask_t *prefma= sk, + const nodemask_t *bindmask) +{ + bool has_pref, has_bind; + + has_pref =3D prefmask && !nodes_empty(*prefmask); + has_bind =3D bindmask && !nodes_empty(*bindmask); + + if (has_pref && has_bind) + nodes_and(*out, *prefmask, *bindmask); + else if (has_pref && !has_bind) + *out =3D *prefmask; + else if (!has_pref && has_bind) + *out =3D *bindmask; + else if (!has_pref && !has_bind) + unreachable(); /* Handled above */ + else + unreachable(); +} +#endif + +/* + * Find a zonelist from a preferred node. Here is a truth table example = using 2 + * different masks. The choices are, NULL mask, empty mask, two masks wi= th an + * intersection, and two masks with no intersection. If the local node i= s in the + * intersection, it is used, otherwise the first set node is used. + * + * +----------+----------+------------+ + * | bindmask | prefmask | zonelist | + * +----------+----------+------------+ + * | NULL/0 | NULL/0 | local node | + * | NULL/0 | 0x2 | 0x2 | + * | NULL/0 | 0x4 | 0x4 | + * | 0x2 | NULL/0 | 0x2 | + * | 0x2 | 0x2 | 0x2 | + * | 0x2 | 0x4 | local* | + * | 0x4 | NULL/0 | 0x4 | + * | 0x4 | 0x2 | local* | + * | 0x4 | 0x4 | 0x4 | + * +----------+----------+------------+ + * + * NB: That zonelist will have *all* zones in the fallback case, and not= all of + * those zones will belong to preferred nodes. + */ +static struct zonelist *preferred_zonelist(gfp_t gfp_mask, + const nodemask_t *prefmask, + const nodemask_t *bindmask) +{ + nodemask_t pref; + int nid, local_node =3D numa_mem_id(); + + /* Multi nodes not supported yet */ + VM_BUG_ON(prefmask && nodes_weight(*prefmask) !=3D 1); + +#define _isset(mask, node) = \ + (!(mask) || nodes_empty(*(mask)) ? 1 : node_isset(node, *(mask))) + /* + * This will handle NULL masks, empty masks, and when the local node + * match all constraints. It does most of the magic here. + */ + if (_isset(prefmask, local_node) && _isset(bindmask, local_node)) + return node_zonelist(local_node, gfp_mask); +#undef _isset + + VM_BUG_ON(!prefmask && !bindmask); + + set_pref_bind_mask(&pref, prefmask, bindmask); + + /* + * It is possible that the caller may ask for a preferred set that isn'= t + * available. One such case is memory hotplug. Memory hotplug code trie= s + * to do some allocations from the target node (what will be local to + * the new node) before the pages are onlined (N_MEMORY). + */ + for_each_node_mask(nid, pref) { + if (!node_state(nid, N_MEMORY)) + node_clear(nid, pref); + } + + /* + * If we couldn't manage to get anything reasonable, let later code + * clean up our mess. Local node will be the best approximation for wha= t + * is desired, just use it. + */ + if (unlikely(nodes_empty(pref))) + return node_zonelist(local_node, gfp_mask); + + /* Try to find the "closest" node in the list. */ + nodes_complement(pref, pref); + while ((nid =3D find_next_best_node(local_node, &pref)) !=3D NUMA_NO_NO= DE) + return node_zonelist(nid, gfp_mask); + + /* + * find_next_best_node() will have to have found something if the + * node list isn't empty and so it isn't possible to get here unless + * find_next_best_node() is modified and this function isn't updated. + */ + BUG(); +} + +static inline bool +prepare_alloc_pages(gfp_t gfp_mask, unsigned int order, nodemask_t *pref= mask, + nodemask_t *nodemask, struct alloc_context *ac, + gfp_t *alloc_mask, unsigned int *alloc_flags) { ac->highest_zoneidx =3D gfp_zone(gfp_mask); - ac->zonelist =3D node_zonelist(preferred_nid, gfp_mask); ac->nodemask =3D nodemask; ac->migratetype =3D gfp_migratetype(gfp_mask); =20 @@ -4777,6 +4886,8 @@ static inline bool prepare_alloc_pages(gfp_t gfp_ma= sk, unsigned int order, *alloc_flags |=3D ALLOC_CPUSET; } =20 + ac->zonelist =3D preferred_zonelist(gfp_mask, prefmask, ac->nodemask); + fs_reclaim_acquire(gfp_mask); fs_reclaim_release(gfp_mask); =20 @@ -4817,6 +4928,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int= order, int preferred_nid, unsigned int alloc_flags =3D ALLOC_WMARK_LOW; gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */ struct alloc_context ac =3D { }; + nodemask_t prefmask =3D nodemask_of_node(preferred_nid); =20 /* * There are several places where we assume that the order value is san= e @@ -4829,7 +4941,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int= order, int preferred_nid, =20 gfp_mask &=3D gfp_allowed_mask; alloc_mask =3D gfp_mask; - if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac,= &alloc_mask, &alloc_flags)) + if (!prepare_alloc_pages(gfp_mask, order, &prefmask, nodemask, &ac, + &alloc_mask, &alloc_flags)) return NULL; =20 finalise_ac(gfp_mask, &ac); --=20 2.27.0