From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E6E64C48BF6 for ; Sun, 3 Mar 2024 06:16:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1ACCF940007; Sun, 3 Mar 2024 01:16:25 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 15EB26B009C; Sun, 3 Mar 2024 01:16:25 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 024E5940007; Sun, 3 Mar 2024 01:16:24 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id E796E6B0098 for ; Sun, 3 Mar 2024 01:16:24 -0500 (EST) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id B30C9160776 for ; Sun, 3 Mar 2024 06:16:24 +0000 (UTC) X-FDA: 81854718288.28.E3754BC Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf24.hostedemail.com (Postfix) with ESMTP id E9DA9180007 for ; Sun, 3 Mar 2024 06:16:21 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=UJ9cnsRb; spf=pass (imf24.hostedemail.com: domain of aneesh.kumar@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=aneesh.kumar@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709446582; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=LpPIVSwSaYXZNiIZ/A2jiTjGgDKeexsAxGuXPrN6+qM=; b=jR9uGCbqLYaLUYYz63/mP5CJBI7Hv/9kOREvGlJ2jDYR72IzLEf/N2K93my0JvEDsR1Oak YDCDxPDP9DMBj2m9m4yolAcrpo+5XpuZp976bg9ZkBwTbcayZ3Wbam1MlAc1qgg7EjGE4U /fJwl/K0CkZUoMVl5rwEZ2xsqxwgsh8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709446582; a=rsa-sha256; cv=none; b=PssfsEVsv0iotgQK+XLCAhJBZqN/n9cSwUXqWS5eOyu/QxQEdv3Jul2o5Dg/H73ebefL9+ bhlA1u4AZ5fEtCUI5RtFrBEhjn3amaKVvTEWnkssFsXjkntukNzhBRgh0QmlKYyn/fz6Bx zNkGhcWflHgC2M1wvSC7ub7P/TLWTOk= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=UJ9cnsRb; spf=pass (imf24.hostedemail.com: domain of aneesh.kumar@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=aneesh.kumar@kernel.org; dmarc=pass (policy=none) header.from=kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id A2AB260A1A; Sun, 3 Mar 2024 06:16:20 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 10F7CC433C7; Sun, 3 Mar 2024 06:16:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1709446580; bh=MDmzk7Y7KLtXakpc0plr7MLSv8GCZZhHgELtbBb0gkk=; h=From:To:Cc:Subject:In-Reply-To:References:Date:From; b=UJ9cnsRb+a3U174a3rkoQfxHjDrXCoBZhX0+pkdVH0UHSThiVUBw4jPa1RwdDTsmz S5hgzAK3iT9jFwvlZV3A9//21O/Ds8VVrewEMWxgJk98TJ9oCNjypmTUxlab+YpqK1 lMCnr1NIrmHXXx+UJcVWPVlwS48Qok73quDT1MVCin36PadrCGLPg7WZsbn8Nl7oKP /uKWHUJjHWCqOji9hhC1kqmRfM/l1dkwod2J27GiIM97xYRpvz9a8ocKYishmm9D8u 682Nvj/5mxSeufo1MPlv4GztaH/PyhS6ENtzCZhSvMWnfzcBpIjmTV4Z8+mzTDxsw2 gdYmmZxXmVL+A== X-Mailer: emacs 29.2 (via feedmail 11-beta-1 I) From: Aneesh Kumar K.V To: "Huang, Ying" Cc: Donet Tom , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dave Hansen , Mel Gorman , Ben Widawsky , Feng Tang , Michal Hocko , Andrea Arcangeli , Peter Zijlstra , Ingo Molnar , Rik van Riel , Johannes Weiner , Matthew Wilcox , Mike Kravetz , Vlastimil Babka , Dan Williams , Hugh Dickins , Kefeng Wang , Suren Baghdasaryan Subject: Re: [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy In-Reply-To: <87ttm3o9db.fsf@yhuang6-desk2.ccr.corp.intel.com> References: <9c3f7b743477560d1c5b12b8c111a584a2cc92ee.1708097962.git.donettom@linux.ibm.com> <8d7737208bd24e754dc7a538a3f7f02de84f1f72.1708097962.git.donettom@linux.ibm.com> <877cizppsa.fsf@yhuang6-desk2.ccr.corp.intel.com> <87sf1nzi3s.fsf@kernel.org> <87ttm3o9db.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Sun, 03 Mar 2024 11:46:09 +0530 Message-ID: <878r2zlu1i.fsf@kernel.org> MIME-Version: 1.0 Content-Type: text/plain X-Stat-Signature: hk56e8h9xy1ugyrpxji3xrfir4ru7zni X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: E9DA9180007 X-Rspam-User: X-HE-Tag: 1709446581-300235 X-HE-Meta: U2FsdGVkX1+KPO8waNOGNnPUTFzsYpIscKAMidbzNGV3sBHDpn5+UzRucDySXC623tIx7q6PDlx12N1gZLwhCe6owKNCuyGl+Ul4DUc6OSwN/H9KUNXWOTsqCr2z7sq3IKx6gN1H7KdFLQlVg6bQFczEso8AzOzW95D8TCiVDT/h4yoSaWiB1t9PUZdKQicGwP8pwz3cDCK2QQzs8F/bDVmIF0tAiI06xHA6VRngEbBHt2ovM/R6YuIqmLRbpTr6+2xqgcyfFnNm3PMxZXERiLTiCs57vvhnOgUSiePRQRQ+mL4FHh5kXDcd01u4hMpmgHX4epAdDvN5dATlbrn9NWksovsgfoxpwetmuCk0ocr8E9Dhed803CoSch2/e8vFpxEyHSRSg0dle8DGGmjwSPtQF5Y2TrH82i2AYobzxjwEstLKQgz8RG/re7DDqNwejFyZuLJOma6tCJki+qlCcb2myqfKMoDO3ARDvZnJiCQItCFbP8CJjoUX720vPMtSxNc8EaDMqbxk6+7JuRNrl0w8XTu0eXsPt8IZG29wm9nrVDpDROqNmszY2HQh5SkvJgn2sAoV26qIYh5xh8iJb+Z2aERSewWaJiJvWLEZUOAN/nFlOd1rF0FQ9tqLeNxHN8Ai3b1TL6hzwx+xbpt51KmwF9HT0vOFsWsK+BlEaKc0GfkJxV7hnzgSKmnObQ2+5Cur8uUaPxPYcxE9z8fzrHz0YflcNFxmVzubpPExmP4sPDTzexUM6flA10pAvevDInvdGbCZT8rjehjunAoys951A+mkaC6VtV4/EyP6qipnGqwXfMXew2e1rFziHxR6c3952MiGWDmX2zqRGCUMyF+Rs/oXCp4nVu26a0xRBblSdc3E7N1b/9j6i+8JCcjkvVmFU3ccXyN82Vs4MXo3MVG6Yxq5gJdgGhc5BHRd39qT34jtE3XV4L2lsiUzKeP51G7Kshq9ExE8TDdkxgv M74ifMbN 7EB/WqNJg1kcSjQibbhyJ9czVgeR9DdBNPnKYCFBxMl4RUfNZGo7F8lNkwh5G21fyVQJd90q3ogfTMcQMAo1n/JjbZGHlW0XhVtZ4CDimJwg5VE3o4O+hitOYpkVu8LJFxz6sjpXRFwxxpCPhX2AV1s8nPcTqeRkXWXSPO4vZ543X53PkcuwFqdlBiIRIX8URTqp9yGjaOB9RbHIDG4LcRIjf4rfEJD1H92UdBmsVjqfePUDIaH5C3rg/Z8gIaN0pulfVW2pBrlgb6l8JGhsVCsaUcCUI1Pr7JdvOXpKmZYj7PFcB58qBEcJjf2tM4ZqUhmdPR0NyCanQnnP51IJ73d9J6h2+cDYVHU9IVitUQUVridmsNtD9rcU9aA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: "Huang, Ying" writes: > Aneesh Kumar K.V writes: > >> "Huang, Ying" writes: >> >>> Donet Tom writes: >>> >>>> commit bda420b98505 ("numa balancing: migrate on fault among multiple bound >>>> nodes") added support for migrate on protnone reference with MPOL_BIND >>>> memory policy. This allowed numa fault migration when the executing node >>>> is part of the policy mask for MPOL_BIND. This patch extends migration >>>> support to MPOL_PREFERRED_MANY policy. >>>> >>>> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy flag >>>> MPOL_F_NUMA_BALANCING. This causes issues when we want to use >>>> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory tier, >>>> the kernel should not allocate pages from the slower memory tier via >>>> allocation control zonelist fallback. Instead, we should move cold pages >>>> from the faster memory node via memory demotion. For a page allocation, >>>> kswapd is only woken up after we try to allocate pages from all nodes in >>>> the allocation zone list. This implies that, without using memory >>>> policies, we will end up allocating hot pages in the slower memory tier. >>>> >>>> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy: add >>>> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better >>>> allocation control when we have memory tiers in the system. With >>>> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting only >>>> of faster memory nodes. When we fail to allocate pages from the faster >>>> memory node, kswapd would be woken up, allowing demotion of cold pages >>>> to slower memory nodes. >>>> >>>> With the current kernel, such usage of memory policies implies we can't >>>> do page promotion from a slower memory tier to a faster memory tier >>>> using numa fault. This patch fixes this issue. >>>> >>>> For MPOL_PREFERRED_MANY, if the executing node is in the policy node >>>> mask, we allow numa migration to the executing nodes. If the executing >>>> node is not in the policy node mask but the folio is already allocated >>>> based on policy preference (the folio node is in the policy node mask), >>>> we don't allow numa migration. If both the executing node and folio node >>>> are outside the policy node mask, we allow numa migration to the >>>> executing nodes. >>>> >>>> Signed-off-by: Aneesh Kumar K.V (IBM) >>>> Signed-off-by: Donet Tom >>>> --- >>>> mm/mempolicy.c | 28 ++++++++++++++++++++++++++-- >>>> 1 file changed, 26 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c >>>> index 73d698e21dae..8c4c92b10371 100644 >>>> --- a/mm/mempolicy.c >>>> +++ b/mm/mempolicy.c >>>> @@ -1458,9 +1458,10 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags) >>>> if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES)) >>>> return -EINVAL; >>>> if (*flags & MPOL_F_NUMA_BALANCING) { >>>> - if (*mode != MPOL_BIND) >>>> + if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY) >>>> + *flags |= (MPOL_F_MOF | MPOL_F_MORON); >>>> + else >>>> return -EINVAL; >>>> - *flags |= (MPOL_F_MOF | MPOL_F_MORON); >>>> } >>>> return 0; >>>> } >>>> @@ -2463,6 +2464,23 @@ static void sp_free(struct sp_node *n) >>>> kmem_cache_free(sn_cache, n); >>>> } >>>> >>>> +static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, >>>> + struct mempolicy *pol) >>>> +{ >>>> + /* if the executing node is in the policy node mask, migrate */ >>>> + if (node_isset(exec_node, pol->nodes)) >>>> + return true; >>>> + >>>> + /* If the folio node is in policy node mask, don't migrate */ >>>> + if (node_isset(folio_node, pol->nodes)) >>>> + return false; >>>> + /* >>>> + * both the folio node and executing node are outside the policy nodemask, >>>> + * migrate as normal numa fault migration. >>>> + */ >>>> + return true; >>> >>> Why? This may cause some unexpected result. For example, pages may be >>> distributed among multiple sockets unexpectedly. So, I prefer the more >>> conservative policy, that is, only migrate if this node is in >>> pol->nodes. >>> >> >> This will only have an impact if the user specifies >> MPOL_F_NUMA_BALANCING. This means that the user is explicitly requesting >> for frequently accessed memory pages to be migrated. Memory policy >> MPOL_PREFERRED_MANY is able to allocate pages from nodes outside of >> policy->nodes. For the specific use case that I am interested in, it >> should be okay to restrict it to policy->nodes. However, I am wondering >> if this is too restrictive given the definition of MPOL_PREFERRED_MANY. > > IMHO, we can start with some consecutive way and expand it if it's > proved necessary. > Is this good? 1 file changed, 14 insertions(+), 34 deletions(-) mm/mempolicy.c | 48 ++++++++++++++---------------------------------- modified mm/mempolicy.c @@ -2464,23 +2464,6 @@ static void sp_free(struct sp_node *n) kmem_cache_free(sn_cache, n); } -static inline bool mpol_preferred_should_numa_migrate(int exec_node, int folio_node, - struct mempolicy *pol) -{ - /* if the executing node is in the policy node mask, migrate */ - if (node_isset(exec_node, pol->nodes)) - return true; - - /* If the folio node is in policy node mask, don't migrate */ - if (node_isset(folio_node, pol->nodes)) - return false; - /* - * both the folio node and executing node are outside the policy nodemask, - * migrate as normal numa fault migration. - */ - return true; -} - /** * mpol_misplaced - check whether current folio node is valid in policy * @@ -2533,29 +2516,26 @@ int mpol_misplaced(struct folio *folio, struct vm_fault *vmf, break; case MPOL_BIND: - /* Optimize placement among multiple nodes via NUMA balancing */ + case MPOL_PREFERRED_MANY: + /* + * Even though MPOL_PREFERRED_MANY can allocate pages outside + * policy nodemask we don't allow numa migration to nodes + * outside policy nodemask for now. This is done so that if we + * want demotion to slow memory to happen, before allocating + * from some DRAM node say 'x', we will end up using a + * MPOL_PREFERRED_MANY mask excluding node 'x'. In such scenario + * we should not promote to node 'x' from slow memory node. + */ if (pol->flags & MPOL_F_MORON) { + /* + * Optimize placement among multiple nodes + * via NUMA balancing + */ if (node_isset(thisnid, pol->nodes)) break; goto out; } - if (node_isset(curnid, pol->nodes)) - goto out; - z = first_zones_zonelist( - node_zonelist(thisnid, GFP_HIGHUSER), - gfp_zone(GFP_HIGHUSER), - &pol->nodes); - polnid = zone_to_nid(z->zone); - break; - - case MPOL_PREFERRED_MANY: - if (pol->flags & MPOL_F_MORON) { - if (!mpol_preferred_should_numa_migrate(thisnid, curnid, pol)) - goto out; - break; - } - /* * use current page if in policy nodemask, * else select nearest allowed node, if any. [back] .