From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8C7B1C48BC3 for ; Tue, 20 Feb 2024 07:47:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 11D836B007D; Tue, 20 Feb 2024 02:47:06 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0CDE48D0001; Tue, 20 Feb 2024 02:47:06 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id ED8066B0083; Tue, 20 Feb 2024 02:47:05 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id DE07A6B007D for ; Tue, 20 Feb 2024 02:47:05 -0500 (EST) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 77416A05E7 for ; Tue, 20 Feb 2024 07:47:05 +0000 (UTC) X-FDA: 81811401210.28.6053839 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf18.hostedemail.com (Postfix) with ESMTP id E924B1C0010 for ; Tue, 20 Feb 2024 07:47:03 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=h7v4I3ZQ; spf=pass (imf18.hostedemail.com: domain of aneesh.kumar@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=aneesh.kumar@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1708415224; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=/ZsvAXj4GmUd6Fsg9SD8bxQyabmXVNO7QpquP0eGr4o=; b=nRBi/gbaoE7j5cXDV8opIP6Xr8VGMwEzZJNZ7+5bidRw9BwQyMc13Vr4PJPy/uzO8+Ewia PCx4i//CopkGgqJ/cthnMJwNjfc6Y+4YOXAuxRhVaD7qSpo7pWs4ZC74uO+dm0WGIbo1Rt wy3xilGUhn6nGfQInzitY+0VaqmnNMY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708415224; a=rsa-sha256; cv=none; b=vdP7R/cTDmDvJX8WXxVT8gaAuMeNM16zyt4JtIydlIHO4HywJO53h7ikY/hJI3rEXAY1ec 0rw3aDIO4nR7KT8kUQSzzc4BbUdmZ2+egofBzAx1EPyXZc/MISgmArXqBly3ThMfJwe082 7M+9zxZH2PozEhv/XlL5ZLcjE3KVDEc= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=h7v4I3ZQ; spf=pass (imf18.hostedemail.com: domain of aneesh.kumar@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=aneesh.kumar@kernel.org; dmarc=pass (policy=none) header.from=kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id ACD4960FC9; Tue, 20 Feb 2024 07:47:02 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 801A3C433C7; Tue, 20 Feb 2024 07:46:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1708415222; bh=KzR5o052uF6NHep5nn4rBL+2f1ZGfXoJznrTNszQois=; h=From:To:Cc:Subject:In-Reply-To:References:Date:From; b=h7v4I3ZQ0Bucprxy/pNFWWNkMCm3NrWeeU4B5v00eZvSK5T8fxXG9/a+3dYv3Rn8U 3KMdfH7Hg0yXGtGHZnvtjxbDBCBBAn1isSNiSWNjWKn6GVGPLLPuerEhwzWzQwxdEb IDWD90iy/b9I/X1hibGvfnDSMO5NN0UeOMHBN5ZEtXplUK/0pPR9J/gZeLyiynDaQe KC/vpe+DV0Qww6ctT6rPXGN3snWQXHI26M1gidyHSNB0aw9MtvkIgmxSEmNdAQXZB7 K3IPtwYXncsfCQHUmCAc2Z1smdkNYnWCiDkDAESOv5TyL+8tOh7wlGsUkHHum/uTKp STD9gysutnGDA== X-Mailer: emacs 29.2 (via feedmail 11-beta-1 I) From: Aneesh Kumar K.V To: "Huang, Ying" Cc: Donet Tom , Michal Hocko , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dave Hansen , Mel Gorman , Ben Widawsky , Feng Tang , Andrea Arcangeli , Peter Zijlstra , Ingo Molnar , Rik van Riel , Johannes Weiner , Matthew Wilcox , Mike Kravetz , Vlastimil Babka , Dan Williams , Hugh Dickins , Kefeng Wang , Suren Baghdasaryan Subject: Re: [PATCH 3/3] mm/numa_balancing:Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy In-Reply-To: <87y1bfoayd.fsf@yhuang6-desk2.ccr.corp.intel.com> References: <9c3f7b743477560d1c5b12b8c111a584a2cc92ee.1708097962.git.donettom@linux.ibm.com> <8d7737208bd24e754dc7a538a3f7f02de84f1f72.1708097962.git.donettom@linux.ibm.com> <87bk8bprpr.fsf@yhuang6-desk2.ccr.corp.intel.com> <87y1bfoayd.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Tue, 20 Feb 2024 13:16:51 +0530 Message-ID: <87v86jzifo.fsf@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: E924B1C0010 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: zb3iswdafmqsha4s5e8rwbi1yqk6pshq X-HE-Tag: 1708415223-84902 X-HE-Meta: U2FsdGVkX1/xSxfIu5PLKemo7oFS9KozDNB7WbCQTLWzU3YFwHI9bOSmZFaQLm1KIQGe9bkxe2+rWTdKNWElYaeOQzXTbuuROihPcqNPHKMy8ASIxhbTu12XtA99MxXCx+0o4tY4l0jlr3bjCgXBPhWXS+HvhsH17YfdXhpdfrg8PBTLbPF66oSrnING0c9xxskhN1G60z3z8pj69nJZ3V51p+P75RmgpYEhYnPvGL1zIxZS+ZVzHt+m2xIkMGfRwYPijbR0hHmNI2GM4+CIW7GBquoGHe8CEQGwG+UnZXF0E6Q1cTcFaHJfRfpcYqAjdcM/RXTGRvznAhKtqnjXVGS34NttT8LfZ8knCwrTrq3HzB4GG+0gZIZK9mJp7rYF+5eJbyB7oocB8uuCw8rxYrTBBRo768N8fWIMhjqN2RQDTdhMLFOjxWAV42rG6LziHQPJJA4PtmbWHE4oQkEFlBcroOmm2qwAKfxaovWAY9lyE9cxWMl0NefxYRAby71YVNMxHgyooT3+4xxgdBbs7W5o5rwGUIsM5h21yx1CgdzJ5e6UJ8Q5q1+7z7t/1pCvlnUE7DcRO7j2ACNT+v/ZS7KT4U2Jztb/FNSqN7//FSfZqQ5PJIODXsfCpfIorDWJ5Kl/K8owVj71NaQ0RD0khIZm/tazLlHHS8PdYTQp+gSgOolc9Wvpquy0PmmVyHeUhGwjbCMQhd2HycgG1xHsIlRGaJmxOa2CGkeYELacpwgb9LPDd+xY6Yow1By0SuPm6K4vdyDFGsl2ZZi0ycsMXkNupMAUszOIgwjzLLf5RyvL8PFro7V0MEUZtJ0YYrIMSm80izRknh+xcwdN6+FeoS0mtwRum9uWKFB3aOuCUOqs3InhnQ4b6OsHgqpThbVIDm2CpMG+/nT2LMKH19q1QJxgHjPlb+mNVNtMh50fu1OG3ZFcHDHPEJrKOEl4koGy X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: "Huang, Ying" writes: > "Aneesh Kumar K.V" writes: > >> On 2/20/24 12:06 PM, Huang, Ying wrote: >>> Donet Tom writes: >>>=20 >>>> On 2/19/24 17:37, Michal Hocko wrote: >>>>> On Sat 17-02-24 01:31:35, Donet Tom wrote: >>>>>> commit bda420b98505 ("numa balancing: migrate on fault among multipl= e bound >>>>>> nodes") added support for migrate on protnone reference with MPOL_BI= ND >>>>>> memory policy. This allowed numa fault migration when the executing = node >>>>>> is part of the policy mask for MPOL_BIND. This patch extends migrati= on >>>>>> support to MPOL_PREFERRED_MANY policy. >>>>>> >>>>>> Currently, we cannot specify MPOL_PREFERRED_MANY with the mempolicy = flag >>>>>> MPOL_F_NUMA_BALANCING. This causes issues when we want to use >>>>>> NUMA_BALANCING_MEMORY_TIERING. To effectively use the slow memory ti= er, >>>>>> the kernel should not allocate pages from the slower memory tier via >>>>>> allocation control zonelist fallback. Instead, we should move cold p= ages >>>>>> from the faster memory node via memory demotion. For a page allocati= on, >>>>>> kswapd is only woken up after we try to allocate pages from all node= s in >>>>>> the allocation zone list. This implies that, without using memory >>>>>> policies, we will end up allocating hot pages in the slower memory t= ier. >>>>>> >>>>>> MPOL_PREFERRED_MANY was added by commit b27abaccf8e8 ("mm/mempolicy:= add >>>>>> MPOL_PREFERRED_MANY for multiple preferred nodes") to allow better >>>>>> allocation control when we have memory tiers in the system. With >>>>>> MPOL_PREFERRED_MANY, the user can use a policy node mask consisting = only >>>>>> of faster memory nodes. When we fail to allocate pages from the fast= er >>>>>> memory node, kswapd would be woken up, allowing demotion of cold pag= es >>>>>> to slower memory nodes. >>>>>> >>>>>> With the current kernel, such usage of memory policies implies we ca= n't >>>>>> do page promotion from a slower memory tier to a faster memory tier >>>>>> using numa fault. This patch fixes this issue. >>>>>> >>>>>> For MPOL_PREFERRED_MANY, if the executing node is in the policy node >>>>>> mask, we allow numa migration to the executing nodes. If the executi= ng >>>>>> node is not in the policy node mask but the folio is already allocat= ed >>>>>> based on policy preference (the folio node is in the policy node mas= k), >>>>>> we don't allow numa migration. If both the executing node and folio = node >>>>>> are outside the policy node mask, we allow numa migration to the >>>>>> executing nodes. >>>>> The feature makes sense to me. How has this been tested? Do you have = any >>>>> numbers to present? >>>> >>>> Hi Michal >>>> >>>> I have a test program which allocate memory on a specified node and >>>> trigger the promotion or migration (Keep accessing the pages). >>>> >>>> Without this patch if we set MPOL_PREFERRED_MANY promotion or migratio= n was not happening >>>> with this patch I could see pages are getting migrated or promoted. >>>> >>>> My system has 2 CPU+DRAM node (Tier 1) and 1 PMEM node(Tier 2). Below >>>> are my test results. >>>> >>>> In below table N0 and N1 are Tier1 Nodes. N6 is the Tier2 Node. >>>> Exec_Node is the execution node, Policy is the nodes in nodemask and >>>> "Curr Location Pages" is the node where pages present before migration >>>> or promotion start. >>>> >>>> Tests Results >>>> ------------------ >>>> Scenario 1:=C2=A0 if the executing node is in the policy node mask >>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D >>>> Exec_Node=C2=A0=C2=A0=C2=A0 Policy=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 Curr Location Pages Observations >>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D >>>> N0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 N0 N1 N= 6=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 N= 1 Pages Migrated from N1 to N0 >>>> N0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =C2=A0 N0 N1 N6=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =C2=A0 N6 = Pages Promoted from N6 to N0 >>>> N0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =C2=A0 N0 N1=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =C2=A0 = N1 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =C2=A0 Pages M= igrated from N1 to N0 >>>> N0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =C2=A0 N0 N1=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =C2=A0 = =C2=A0N6 =C2=A0 =C2=A0 Pages Promoted from N6 to N0 >>>> >>>> Scenario 2: If the folio node is in policy node mask and Exec node not= in policy=C2=A0 node mask >>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D >>>> Exec_Node=C2=A0=C2=A0=C2=A0 Policy=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= Curr Location Pages=C2=A0=C2=A0=C2=A0 =C2=A0 Observations >>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D >>>> N0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 N1 N6=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 N1 = Pages are not Migrating to N0 >>>> N0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 N1 N6= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 N6= Pages are not migration to N0 >>>> N0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0N1= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 =C2=A0N1 =C2=A0 =C2=A0 Pages are not Migrating to N0 >>>> >>>> Scenario 3: both the folio node and executing node are outside the pol= icy nodemask >>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D >>>> Exec_Node=C2=A0 =C2=A0 Policy=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 Curr Location Pages=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Observati= ons >>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D >>>> N0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =C2=A0 =C2=A0 =C2=A0 N1=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 N6 =C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0 Page= s Promoted from N6 to N0 >>>> N0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0N6 N1=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0Pages Migrated from N1 to N0 >>>> >>>=20 >>> Please use some benchmarks (e.g., redis + memtier) and show the >>> proc-vmstat stats and benchamrk score. >> >> >> Without this change numa fault migration is not supported with MPOL_PREF= ERRED_MANY >> policy. So there is no performance comparison with and without patch. W.= r.t effectiveness of numa >> fault migration, that is a different topic from this patch > > IIUC, the goal of the patch is to optimize performance, right? If so, > the benchmark score will help justify the change. > The objective is to enable the use of the MPOL_PREFERRED_MANY policy, which is essential for the correct functioning of memory demotion in conjunction with memory promotion. Once we can use memory promotion, we should be able to observe the same benefits as those provided by numa fault memory promotion. The actual benefit of numa fault migration is dependent on various factors such as the speed of the slower memory device, the access pattern of the application, etc. We are discussing its effectiveness and how to improve numa fault overhead in other forums. However, we believe that this discussion should not hinder the merging of this patch. This change is similar to commit bda420b98505 ("numa balancing: migrate on fault among multiple bound nodes") -aneesh