From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 46C0CC77B7F for ; Wed, 25 Jun 2025 02:14:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8140D6B00B3; Tue, 24 Jun 2025 22:14:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7EC106B00B4; Tue, 24 Jun 2025 22:14:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 728326B00B5; Tue, 24 Jun 2025 22:14:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 647086B00B3 for ; Tue, 24 Jun 2025 22:14:13 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id EF1405F74A for ; Wed, 25 Jun 2025 02:14:12 +0000 (UTC) X-FDA: 83592303144.23.752925B Received: from esa2.hc1455-7.c3s2.iphmx.com (esa2.hc1455-7.c3s2.iphmx.com [207.54.90.48]) by imf03.hostedemail.com (Postfix) with ESMTP id BBEFD20008 for ; Wed, 25 Jun 2025 02:14:10 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=fujitsu.com header.s=fj2 header.b="l7R7/iPE"; spf=pass (imf03.hostedemail.com: domain of lizhijian@fujitsu.com designates 207.54.90.48 as permitted sender) smtp.mailfrom=lizhijian@fujitsu.com; dmarc=pass (policy=reject) header.from=fujitsu.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750817651; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=r4kCgidFZFH9/Zhw8Psh7AW6sH4im0vV7VcIvtUD47I=; b=qq5+UeSoCZY9Idmuo0pDBgAnwDDct4lcet8rL/2E37W5T8wjREO6UDub/wqKgvvSECZhqK wrWJ1gc2iEGCb68RbvDgd8M5s3+FYDR57Q+BnqDMg8gt6Uw4RGnITsElN6B7+50iS/4qKu y5aJy32W/5bdT0yqqIcU69Z8PUUtvV4= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=fujitsu.com header.s=fj2 header.b="l7R7/iPE"; spf=pass (imf03.hostedemail.com: domain of lizhijian@fujitsu.com designates 207.54.90.48 as permitted sender) smtp.mailfrom=lizhijian@fujitsu.com; dmarc=pass (policy=reject) header.from=fujitsu.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750817651; a=rsa-sha256; cv=none; b=xcFgDEN1qfZt/2oFGkmMsSfP4YyEpk4IGMN2hgxDWphY/cUK6SChyQLBiHIa9QYMWMl7rk P5dHUamzYVee7SqnwLcNe7xdSLPy6DgrwPbwhYTwFzwefmPJr9puqbqV6E4o5QzDQ+LUze r+xjk8XsFJHchUM5QK9hPBB+8C3GkRc= DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=fujitsu.com; i=@fujitsu.com; q=dns/txt; s=fj2; t=1750817651; x=1782353651; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=FzxUUqh0xMibWP3tI2gdghaU5jOpT5eCOiDlKcYLzUw=; b=l7R7/iPES0plqankJijO/WyTK9M4fH+PHuJwntgoEe6ZzwHZL7ANrHMO R3f3l8rFPADBr8lOXhsxx6sM27B9EksDK0NbRglYG97UCZlX3BCtnVMm2 izUL6zJBfq5hL/IEvpwjF1GuuNHypmJTq6fgqc9/2f3iCLUf/XFdRhXQ9 esYhTPt5+8x/cml5GBQ0BeQFKVAL0yxJFLVngRZmphOoMZYgcm3YI4kKW GJ5vqQVr2lpE0s0XF17zIdzbDO21YYr4FETL6wh7IcqlMsgH2r2J92tiU E4e4Z3mJVsp4tABLky3RbXjOhgn0viW8OguMrUujf0J6Y5Phw+NtyOc+t Q==; X-CSE-ConnectionGUID: ahjAtPbvTqieuJEPlDFxow== X-CSE-MsgGUID: hvCVEd1jQuGBrkIZMrFZuw== X-IronPort-AV: E=McAfee;i="6800,10657,11474"; a="203800032" X-IronPort-AV: E=Sophos;i="6.16,263,1744038000"; d="scan'208";a="203800032" Received: from unknown (HELO az2nlsmgr3.o.css.fujitsu.com) ([20.61.8.234]) by esa2.hc1455-7.c3s2.iphmx.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Jun 2025 11:14:09 +0900 Received: from az2nlsmgm1.o.css.fujitsu.com (unknown [10.150.26.203]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by az2nlsmgr3.o.css.fujitsu.com (Postfix) with ESMTPS id 0FCB01000352 for ; Wed, 25 Jun 2025 02:14:09 +0000 (UTC) Received: from oym-m1.gw.nic.fujitsu.com (oym-m1.gw.nic.fujitsu.com [10.85.9.161]) by az2nlsmgm1.o.css.fujitsu.com (Postfix) with ESMTP id 480B8C01EE1 for ; Wed, 25 Jun 2025 02:14:08 +0000 (UTC) Received: from edo.cn.fujitsu.com (edo.cn.fujitsu.com [10.167.33.5]) by oym-m1.gw.nic.fujitsu.com (Postfix) with ESMTP id A9D8CD8ACD for ; Wed, 25 Jun 2025 11:14:07 +0900 (JST) Received: from FNSTPC.g08.fujitsu.local (unknown [10.167.135.44]) by edo.cn.fujitsu.com (Postfix) with ESMTP id 3D9EC1A000B; Wed, 25 Jun 2025 10:14:05 +0800 (CST) From: Li Zhijian To: linux-mm@kvack.org Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, y-goto@fujitsu.com, Li Zhijian , Huang Ying , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , lkp@intel.com Subject: [PATCH RFC v2] mm: memory-tiering: Fix PGPROMOTE_CANDIDATE accounting Date: Wed, 25 Jun 2025 10:13:52 +0800 Message-ID: <20250625021352.2291544-1-lizhijian@fujitsu.com> X-Mailer: git-send-email 2.41.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam11 X-Rspam-User: X-Rspamd-Queue-Id: BBEFD20008 X-Stat-Signature: 9uqjyd39586tnidu8715p8hgux5nnqn5 X-HE-Tag: 1750817650-895379 X-HE-Meta: U2FsdGVkX1/nsH11xhPbfwiICLv6z/HgyvEiN85DKEOuByc2pyX2mM9SClep1gREk9B+qIqSc7sC5KakhT4wIEtfIyzA3an4WMTDOM/bkSQrUvEbhN1SeSiasqO3knocUL8NEvWG1f4qKOpIH/nFQG1AncWDggqPWXN36/0wMX7Qwiteja+Z7MiHi/o62sNeYYHvo/2axTEQA7US6RWmE85d/MgsutVVNfv2V64UCffiMfDx853jkP7P1d5D54FAeoTWp67+Gb8a0zW0IBImxiNflUYY6pPV8WJyYozsfY2rXMSOBzVVUWbFhz8JcedH+6axoDoGZlpXnknnyQbLXlqqs/icRNXqMOX/ekLJGIE4rffIzmL6mx8j38BQeoKbEPBEXyubK2NOxY2c5N+UXFhA0/pX756g7I8/2lAR3+K1SfSZgsRDZ88nWTbt+2WFEgSIzpp8AVPM2v3BajXnZTrbP8dtRNRb8seiN/PA6ysIQW0ojWjMmB3mYTbVJ7B23DkR59LuOgaN1ObZQViTTpgTupivcqwMyF5GhuoV0oJPGhfm7Sh3Gi3lylZ09IKGZxbodk1SlxQmlFCgpUEIlcHZGnF+YCQSnKDUAQuH2E9rlidl6kj3LMBuMqHaOaLfgNmL7wb4n4ulpcQ9ImVsJ+kvS40UidXv7athi8eGFWLHqv1J5UNObRQVLuCN1dtCRjelIIIoRt/I1SVvuvuKqMZch1u03TtsUnj8VZFD2GO2xszeZDvRex08AL3U4SbCnp1kjyQ4fNZDZAJclvctLWJqj5aW2zMAYxSI4t/jgsdgN/YZzkGhdlS4xq2VkqX8onWJxBzGf+vLmFiR5HHGMXv9caorP4l1HFAOPBho43IPqsYSxLhWEa6YIqXR+NpNheSiBtxqKISwZxo4ejNQ04lieaeIJLGdH6Rnojk7NQwN0lovx2G+p9/jXmSlC55p+rNAxHkS5QU8PCqZi5S OGrp+Rbk VXdTR4Qz3862FxFAHBsGxlk1U81KryplqA8cftuH0+/VclC+YG4QKLfZ5e62R5pzv3ZcT0AmU2YRELn44YsIqZ0nTL3WGw5nNhCMyrG8x6xMwtLDwscovoscs6iV65f8Rdf5RPc4Cq1VQcIuGOb3cilPF6e2YVtOnSYBqjBuoajyURvXWuNDSF9w4oHFTbkKcFbsHjRUJzbtCaORgDEThyjwTAaRTQ+H7GKrHvyDI++8rPHIMrWs1FQk+ZMEqr9QzeE0bwr/mg6QpO/FO1XH3JzCkXaNaMrKrXeQ+grO7AE+aMI6pNpSZwrKN7jozs3VcfZ17MM4CiMf78uQMSJS5vO5vo9vvzW//OGNmGmaxejdi6zoAUOrw3J5LMaOl4dkG9HSG4ytWwZjq5qKcv4jPR9IqYiSZtYtLyujoyeO2hUl82WHNIYS7QfurwmD/HC0ul314aOT2lYuNrhW7F3RWj1+MgucIWki4HXAPdfElKaW/pXuSFckqd9l0u2DVXurkPxDFO5d7RXT/r44eHW/j9T17Ae4RjORS01Yk2YdJZGmFMADmc4hZor0U2zSdRgaQeqkbcF3XrB9PnL3MIFR8563Hy1Qv7ftLYbaJEPhDgTYAuTxxoFT/Yd4uKkM+rW7MbjymQDkybgSZlVSzb3He1O5N52MdAbBUbcu69lximij8aCmkJRwmhnyqR3uaoFAh//P9f3vgCgOUm2hJEYwGizhlnA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Goto-san reported confusing pgpromote statistics where the pgpromote_success count significantly exceeded pgpromote_candidate. On a system with three nodes (nodes 0-1: DRAM 4GB, node 2: NVDIMM 4GB): # Enable demotion only echo 1 > /sys/kernel/mm/numa/demotion_enabled numactl -m 0-1 memhog -r200 3500M >/dev/null & pid=$! sleep 2 numactl memhog -r100 2500M >/dev/null & sleep 10 kill -9 $pid # terminate the 1st memhog # Enable promotion echo 2 > /proc/sys/kernel/numa_balancing After a few seconds, we observeed `pgpromote_candidate < pgpromote_success` $ grep -e pgpromote /proc/vmstat pgpromote_success 2579 pgpromote_candidate 0 In this scenario, after terminating the first memhog, the conditions for pgdat_free_space_enough() are quickly met, triggering promotion. However, these migrated pages are only accounted for in PGPROMOTE_SUCCESS, not in PGPROMOTE_CANDIDATE. This update increments PGPROMOTE_CANDIDATE within the free space branch when a promotion decision is made, which may alter the mechanism of the rate limit. Consequently, it becomes easier to reach the rate limit than it was previously. For example: Rate Limit = 100 pages/sec Scenario: T0: 90 free-space migrations T0+100ms: 20-page migration request Before: Rate limit is *not* reached: 0 + 20 = 20 < 100 PGPROMOTE_CANDIDATE: 20 After: Rate limit is reached: 90 + 20 = 110 > 100 PGPROMOTE_CANDIDATE: 110 Due to the fact that the rate limit mechanism recalculates every second, theoretically, only within that one second can the transition from pgdat_free_space_enough() to !pgdat_free_space_enough() in top-tier remaining memory be affected. Moreover, previously, within this one-second span, promotions caused by pgdat_free_space_enough() are not restricted by rate limits. This theoretically makes it easier to cause application latency. The current modification can better control the rate limit in cases of transition from pgdat_free_space_enough() to !pgdat_free_space_enough() within one second. Cc: Huang Ying Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Cc: Valentin Schneider Reported-by: Yasunori Gotou (Fujitsu) Signed-off-by: Li Zhijian --- V2: Fix compiling error # Reported by LKP As Ying suggested, we need to assess whether this change causes regression. However, considering the stringent conditions this patch involves, properly evaluating it may be challenging, as the outcomes depend on your perspective. Much like in a zero-sum game, if someone benefits, another might lose. If there are subsequent results, I will update them here. Cc: lkp@intel.com Here, I hope to leverage the existing LKP benchmark to evaluate the potential impacts. The ideal evaluation conditions are: 1. Installed with DRAM + NVDIMM (which can be simulated). 2. NVDIMM is used as system RAM (configurable via daxctl). 3. Promotion is enabled (`echo 2 > /proc/sys/kernel/numa_balancing`). Alternative: We can indeed eliminate the potential impact within pgdat_free_space_enough(), so that the rate limit behavior remains as before. For instance, consider the following change: if (pgdat_free_space_enough(pgdat)) { /* workload changed, reset hot threshold */ pgdat->nbp_threshold = 0; + pgdat->nbp_rl_nr_cand += nr; mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr); return true; } RFC: I am uncertain whether we originally intended for this discrepancy or if it was overlooked. However, the current situation where pgpromote_candidate < pgpromote_success is indeed confusing when interpreted literally. --- kernel/sched/fair.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7a14da5396fb..505b40f8897a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1940,11 +1940,13 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, struct pglist_data *pgdat; unsigned long rate_limit; unsigned int latency, th, def_th; + long nr = folio_nr_pages(folio); pgdat = NODE_DATA(dst_nid); if (pgdat_free_space_enough(pgdat)) { /* workload changed, reset hot threshold */ pgdat->nbp_threshold = 0; + mod_node_page_state(pgdat, PGPROMOTE_CANDIDATE, nr); return true; } @@ -1958,8 +1960,7 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, if (latency >= th) return false; - return !numa_promotion_rate_limit(pgdat, rate_limit, - folio_nr_pages(folio)); + return !numa_promotion_rate_limit(pgdat, rate_limit, nr); } this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid); -- 2.41.0