From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E2FF9E7718B for ; Fri, 27 Dec 2024 11:02:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6C4256B007B; Fri, 27 Dec 2024 06:02:02 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 6745D6B0082; Fri, 27 Dec 2024 06:02:02 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5152D6B0083; Fri, 27 Dec 2024 06:02:02 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 31C726B007B for ; Fri, 27 Dec 2024 06:02:02 -0500 (EST) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id E29671C8F36 for ; Fri, 27 Dec 2024 11:02:01 +0000 (UTC) X-FDA: 82940448780.02.FF62A59 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf09.hostedemail.com (Postfix) with ESMTP id 33C4B140015 for ; Fri, 27 Dec 2024 11:01:30 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=aQwV6o0q; spf=pass (imf09.hostedemail.com: domain of donettom@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=donettom@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1735297270; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=A8e5Y6yV6qxjWxYyqc+qb3ssGtkWjzZi5SKX3x4l0O8=; b=4Vb5UiqRDvGzOKnt7DpFieNXkdenzBQ76o8gIj44eD7oF/VfLKLScveW5klduoQvZqw+Y7 tRd/d8fXWk8EcBXTYYpGIdpztjuiYrRDSh9FMFlv0EpTLP6QWZL3UP3oQwmw0ESs81r6Iv eb7NEjxs6RrPL07ok2GKSG1isUQ4ScQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1735297270; a=rsa-sha256; cv=none; b=uXFVO3fPf8xA8K9MR1iGPp2bjvNKSJCboUWzgBrFnUkfKszlO4uk8FiYr0wb71N2rMAmxc 4ntQCX96RRF95rzhbD0DcAiC0oxbp17RmRrgkDpvNUABohLic2CT3sJHTUjxHOIkLbf0h/ mGt4DDPeyet0IIMajcCnWbDfFju0wBI= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=aQwV6o0q; spf=pass (imf09.hostedemail.com: domain of donettom@linux.ibm.com designates 148.163.158.5 as permitted sender) smtp.mailfrom=donettom@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com Received: from pps.filterd (m0360072.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 4BR85pYs004482; Fri, 27 Dec 2024 11:01:57 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s=pp1; bh=A8e5Y6 yV6qxjWxYyqc+qb3ssGtkWjzZi5SKX3x4l0O8=; b=aQwV6o0qmvuaCfry6RDHpe kenmznIsKkntdXBH4wtblhN6PzF2i9Bql8ycCbdBa0QGn8smndwPIRRRaH5A8U7S gnX7NX2HUcPo+W427f1hDA9KGxyTnS9LGMsDG16gRCn8YcV3V18hewgCBuj+E9sm Nqp2iHsH5UttqprW/R62btmqBBESwLeAucq5DfxEQRwdWIIY+3t9zcO/gd0goAVT YbekqrnOqES0CHpJfRfNF3JwgUdKmQI5T0i0pGpf0OZX646MoKX0mqSTCLW46yEF TaISNJRZzpYhKazPDNdA4kX0Yk8CLjorXwCNPmjzC3YtCDG/ithO+eN8CE3fPoXw == Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 43sre2ghd2-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 27 Dec 2024 11:01:56 +0000 (GMT) Received: from m0360072.ppops.net (m0360072.ppops.net [127.0.0.1]) by pps.reinject (8.18.0.8/8.18.0.8) with ESMTP id 4BRAVaGD029819; Fri, 27 Dec 2024 11:01:56 GMT Received: from ppma21.wdc07v.mail.ibm.com (5b.69.3da9.ip4.static.sl-reverse.com [169.61.105.91]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 43sre2ghd0-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 27 Dec 2024 11:01:56 +0000 (GMT) Received: from pps.filterd (ppma21.wdc07v.mail.ibm.com [127.0.0.1]) by ppma21.wdc07v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 4BR86EDl010606; Fri, 27 Dec 2024 11:01:55 GMT Received: from smtprelay06.wdc07v.mail.ibm.com ([172.16.1.73]) by ppma21.wdc07v.mail.ibm.com (PPS) with ESMTPS id 43p90ndka5-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 27 Dec 2024 11:01:55 +0000 Received: from smtpav04.wdc07v.mail.ibm.com (smtpav04.wdc07v.mail.ibm.com [10.39.53.231]) by smtprelay06.wdc07v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 4BRB1tHw32768646 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 27 Dec 2024 11:01:55 GMT Received: from smtpav04.wdc07v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 0686958050; Fri, 27 Dec 2024 11:01:55 +0000 (GMT) Received: from smtpav04.wdc07v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id F298558052; Fri, 27 Dec 2024 11:01:49 +0000 (GMT) Received: from [9.179.17.222] (unknown [9.179.17.222]) by smtpav04.wdc07v.mail.ibm.com (Postfix) with ESMTP; Fri, 27 Dec 2024 11:01:49 +0000 (GMT) Message-ID: Date: Fri, 27 Dec 2024 16:31:48 +0530 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC v2 PATCH 5/5] migrate,sysfs: add pagecache promotion To: Gregory Price , linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, nehagholkar@meta.com, abhishekd@meta.com, kernel-team@meta.com, david@redhat.com, nphamcs@gmail.com, akpm@linux-foundation.org, hannes@cmpxchg.org, kbusch@meta.com, ying.huang@linux.alibaba.com References: <20241210213744.2968-1-gourry@gourry.net> <20241210213744.2968-6-gourry@gourry.net> Content-Language: en-US From: Donet Tom In-Reply-To: <20241210213744.2968-6-gourry@gourry.net> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: lbypdZKC5zshpXY7TuOFYBmOcyMzUhsF X-Proofpoint-GUID: qRklnJVzTUR-V1RIbGsjQ8va_hWOksno X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1051,Hydra:6.0.680,FMLib:17.12.62.30 definitions=2024-10-15_01,2024-10-11_01,2024-09-30_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 suspectscore=0 lowpriorityscore=0 malwarescore=0 mlxlogscore=999 adultscore=0 spamscore=0 impostorscore=0 bulkscore=0 phishscore=0 clxscore=1015 priorityscore=1501 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.19.0-2411120000 definitions=main-2412270092 X-Rspamd-Queue-Id: 33C4B140015 X-Rspam-User: X-Rspamd-Server: rspam07 X-Stat-Signature: tbn59qndjqyxgxbcqzgjt5tyt9h86rbc X-HE-Tag: 1735297290-658926 X-HE-Meta: U2FsdGVkX18StQAvKetpX1NZ8Tov9oYwhwidAadk/YT9sSf6xHmD4sLSz9cZ7XhmkBJliBQXt/fn1UuGbqLlPLYgHeOowpA/zXkEjmLVJkgoYvo886sE8C64+crtsAZpFfC4zBxLy4zYbBJiSlTKZT+EYVw86V4XYGfgsTMbZ56tlA5NR2NM8rWS4LmGdFelipNWIAWnX3+38HcPD8Scqy1Vw4eCGLDgLjRNJPE9MbRp5PRtCelKHZXTF1mENUMQPQ9nEu2yuqf0RrFn9eCOP4eKpXyjneTrUI8cjmVpqM0iao4ROVtRTG5bnHq+BMHpKcIEO+5ALPrp3qMlOuDSLvU1/t+qR2TjhqNHI1jxMBWexykSb7sbfvBR1U3jiMwTcw3/Rfz+ysje4t9kcjSzUy8GLH/724xfyGEaI7H+2D5GahSCocboxHWSjtGTzv99W2pOcIv4oUVTsTxNUgBfiiURqFxxf4tDsX5njYU6sWXfd2pWTI03+RMnBterKiyPL8KS/hQlNZoZUT3yJICvMlruoW5MGO6rTmWgNJkuqb7ZGiCK6S0kf2cqjXzJVfd+WfODaU+4T/pprc6MuHDImDLwpm5gp4xguwMTZ5TDs2l7JuDn4i7r+XSGPKqAtSMGeYVpG9wtFy/BkiYKKjNWI2Pn0ej140UiJHxpOc6IgpvIdE1mU9TbC68DAnu6AEHF0D+dq9h2kWXiFR2KVE/Lu4S52u2nEJpLfeiz1bwGqpVJr/37u6lgpFI2MgWgcQi2kpOPEDwv25CfolDVt0+fm8L7VHGB/cQxBsfRs03Y4+NnmOYSuo6qzyfV+wu5HrIjPlG5ns+X3KI/BfCVL7OLiYotBRHBr9T3J8R/mg+9VzNv+LBCkCaHPWOA7ZQAmZSpEptIqJdOjLW4f1x0q/JQ/0dpMYxVth6EFvrodSbvPvmCjWOzIri9rcZTWrM7ZNcCHbUGn1bmGszkCjkdgQ7 kNVq5EMb d4tegkcTB0lMUhv2TPJXcUpI3+FjnE74Q8Qv59Kf92j/+DDTKrGOqk84cqcBp8mMgFtVkk/zTohKCK2NAoMuhxK5Gwv03tmrALT5x5Tcuim//fHqo5yFb9jRgSOVsHZZ+60SecDjIQQhLpYmbC+Wpc1hebjkgaEzOo0QXPfyV3mS2fGO4d7Ej/+AsCN6Wv/G6+iLKkOrZqeFcISgfnupZqJwy8xqnGmBv5RDOUeVDi00sUU7Ym0MFGCxjKqBQWuijUzrnvAKkogHBtfxB5kJHJ56YKyPMJ3yt3ZL/ZHFnQ1jWZ5vNRyY4ecunjFiU9FzaVq+ZOW6IXkYEjitf3M1QzKqai8s5DfV7wq+iOoL0g43dPRqL9SDAC42trZUEuT+wLF9H6uHNjZbKaDqmzP92nA+6qg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 12/11/24 03:07, Gregory Price wrote: > adds /sys/kernel/mm/numa/pagecache_promotion_enabled > > When page cache lands on lower tiers, there is no way for promotion > to occur unless it becomes memory-mapped and exposed to NUMA hint > faults. Just adding a mechanism to promote pages unconditionally, > however, opens up significant possibility of performance regressions. > > Similar to the `demotion_enabled` sysfs entry, provide a sysfs toggle > to enable and disable page cache promotion. This option will enable > opportunistic promotion of unmapped page cache during syscall access. > > This option is intended for operational conditions where demoted page > cache will eventually contain memory which becomes hot - and where > said memory likely to cause performance issues due to being trapped on > the lower tier of memory. > > A Page Cache folio is considered a promotion candidates when: > 0) tiering and pagecache-promotion are enabled > 1) the folio reside on a node not in the top tier > 2) the folio is already marked referenced and active. > 3) Multiple accesses in (referenced & active) state occur quickly. > > Since promotion is not safe to execute unconditionally from within > folio_mark_accessed, we defer promotion to a new task_work captured > in the task_struct. This ensures that the task doing the access has > some hand in promoting pages - even among deduplicated read only files. > > We use numa_hint_fault_latency to help identify when a folio is accessed > multiple times in a short period. Along with folio flag checks, this > helps us minimize promoting pages on the first few accesses. > > The promotion node is always the local node of the promoting cpu. > > Suggested-by: Johannes Weiner > Signed-off-by: Gregory Price > --- > .../ABI/testing/sysfs-kernel-mm-numa | 20 +++++++ > include/linux/memory-tiers.h | 2 + > include/linux/migrate.h | 2 + > include/linux/sched.h | 3 + > include/linux/sched/numa_balancing.h | 5 ++ > init/init_task.c | 1 + > kernel/sched/fair.c | 26 +++++++- > mm/memory-tiers.c | 27 +++++++++ > mm/migrate.c | 59 +++++++++++++++++++ > mm/swap.c | 3 + > 10 files changed, 147 insertions(+), 1 deletion(-) > > diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-numa b/Documentation/ABI/testing/sysfs-kernel-mm-numa > index 77e559d4ed80..b846e7d80cba 100644 > --- a/Documentation/ABI/testing/sysfs-kernel-mm-numa > +++ b/Documentation/ABI/testing/sysfs-kernel-mm-numa > @@ -22,3 +22,23 @@ Description: Enable/disable demoting pages during reclaim > the guarantees of cpusets. This should not be enabled > on systems which need strict cpuset location > guarantees. > + > +What: /sys/kernel/mm/numa/pagecache_promotion_enabled > +Date: November 2024 > +Contact: Linux memory management mailing list > +Description: Enable/disable promoting pages during file access > + > + Page migration during file access is intended for systems > + with tiered memory configurations that have significant > + unmapped file cache usage. By default, file cache memory > + on slower tiers will not be opportunistically promoted by > + normal NUMA hint faults, because the system has no way to > + track them. This option enables opportunistic promotion > + of pages that are accessed via syscall (e.g. read/write) > + if multiple accesses occur in quick succession. > + > + It may move data to a NUMA node that does not fall into > + the cpuset of the allocating process which might be > + construed to violate the guarantees of cpusets. This > + should not be enabled on systems which need strict cpuset > + location guarantees. > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h > index 0dc0cf2863e2..fa96a67b8996 100644 > --- a/include/linux/memory-tiers.h > +++ b/include/linux/memory-tiers.h > @@ -37,6 +37,7 @@ struct access_coordinate; > > #ifdef CONFIG_NUMA > extern bool numa_demotion_enabled; > +extern bool numa_pagecache_promotion_enabled; > extern struct memory_dev_type *default_dram_type; > extern nodemask_t default_dram_nodes; > struct memory_dev_type *alloc_memory_type(int adistance); > @@ -76,6 +77,7 @@ static inline bool node_is_toptier(int node) > #else > > #define numa_demotion_enabled false > +#define numa_pagecache_promotion_enabled false > #define default_dram_type NULL > #define default_dram_nodes NODE_MASK_NONE > /* > diff --git a/include/linux/migrate.h b/include/linux/migrate.h > index 29919faea2f1..cf58a97d4216 100644 > --- a/include/linux/migrate.h > +++ b/include/linux/migrate.h > @@ -145,6 +145,7 @@ const struct movable_operations *page_movable_ops(struct page *page) > int migrate_misplaced_folio_prepare(struct folio *folio, > struct vm_area_struct *vma, int node); > int migrate_misplaced_folio(struct folio *folio, int node); > +void promotion_candidate(struct folio *folio); > #else > static inline int migrate_misplaced_folio_prepare(struct folio *folio, > struct vm_area_struct *vma, int node) > @@ -155,6 +156,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node) > { > return -EAGAIN; /* can't migrate now */ > } > +static inline void promotion_candidate(struct folio *folio) { } > #endif /* CONFIG_NUMA_BALANCING */ > > #ifdef CONFIG_MIGRATION > diff --git a/include/linux/sched.h b/include/linux/sched.h > index d380bffee2ef..faa84fb7a756 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1356,6 +1356,9 @@ struct task_struct { > unsigned long numa_faults_locality[3]; > > unsigned long numa_pages_migrated; > + > + struct callback_head numa_promo_work; > + struct list_head promo_list; > #endif /* CONFIG_NUMA_BALANCING */ > > #ifdef CONFIG_RSEQ > diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h > index 52b22c5c396d..cc7750d754ff 100644 > --- a/include/linux/sched/numa_balancing.h > +++ b/include/linux/sched/numa_balancing.h > @@ -32,6 +32,7 @@ extern void set_numabalancing_state(bool enabled); > extern void task_numa_free(struct task_struct *p, bool final); > bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, > int src_nid, int dst_cpu); > +int numa_hint_fault_latency(struct folio *folio); > #else > static inline void task_numa_fault(int last_node, int node, int pages, > int flags) > @@ -52,6 +53,10 @@ static inline bool should_numa_migrate_memory(struct task_struct *p, > { > return true; > } > +static inline int numa_hint_fault_latency(struct folio *folio) > +{ > + return 0; > +} > #endif > > #endif /* _LINUX_SCHED_NUMA_BALANCING_H */ > diff --git a/init/init_task.c b/init/init_task.c > index e557f622bd90..f831980748c4 100644 > --- a/init/init_task.c > +++ b/init/init_task.c > @@ -187,6 +187,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = { > .numa_preferred_nid = NUMA_NO_NODE, > .numa_group = NULL, > .numa_faults = NULL, > + .promo_list = LIST_HEAD_INIT(init_task.promo_list), > #endif > #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS) > .kasan_depth = 1, > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index a59ae2e23daf..047f02091773 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -42,6 +42,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -1842,7 +1843,7 @@ static bool pgdat_free_space_enough(struct pglist_data *pgdat) > * The smaller the hint page fault latency, the higher the possibility > * for the page to be hot. > */ > -static int numa_hint_fault_latency(struct folio *folio) > +int numa_hint_fault_latency(struct folio *folio) > { > int last_time, time; > > @@ -3534,6 +3535,27 @@ static void task_numa_work(struct callback_head *work) > } > } > > +static void task_numa_promotion_work(struct callback_head *work) > +{ > + struct task_struct *p = current; > + struct list_head *promo_list = &p->promo_list; > + struct folio *folio, *tmp; > + int nid = numa_node_id(); > + > + SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_promo_work)); > + > + work->next = work; > + > + if (list_empty(promo_list)) > + return; > + > + list_for_each_entry_safe(folio, tmp, promo_list, lru) { > + list_del_init(&folio->lru); > + migrate_misplaced_folio(folio, nid); > + } > +} > + > + > void init_numa_balancing(unsigned long clone_flags, struct task_struct *p) > { > int mm_users = 0; > @@ -3558,8 +3580,10 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p) > RCU_INIT_POINTER(p->numa_group, NULL); > p->last_task_numa_placement = 0; > p->last_sum_exec_runtime = 0; > + INIT_LIST_HEAD(&p->promo_list); > > init_task_work(&p->numa_work, task_numa_work); > + init_task_work(&p->numa_promo_work, task_numa_promotion_work); > > /* New address space, reset the preferred nid */ > if (!(clone_flags & CLONE_VM)) { > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > index fc14fe53e9b7..4c44598e485e 100644 > --- a/mm/memory-tiers.c > +++ b/mm/memory-tiers.c > @@ -935,6 +935,7 @@ static int __init memory_tier_init(void) > subsys_initcall(memory_tier_init); > > bool numa_demotion_enabled = false; > +bool numa_pagecache_promotion_enabled; > > #ifdef CONFIG_MIGRATION > #ifdef CONFIG_SYSFS > @@ -957,11 +958,37 @@ static ssize_t demotion_enabled_store(struct kobject *kobj, > return count; > } > > +static ssize_t pagecache_promotion_enabled_show(struct kobject *kobj, > + struct kobj_attribute *attr, > + char *buf) > +{ > + return sysfs_emit(buf, "%s\n", > + numa_pagecache_promotion_enabled ? "true" : "false"); > +} > + > +static ssize_t pagecache_promotion_enabled_store(struct kobject *kobj, > + struct kobj_attribute *attr, > + const char *buf, size_t count) > +{ > + ssize_t ret; > + > + ret = kstrtobool(buf, &numa_pagecache_promotion_enabled); > + if (ret) > + return ret; > + > + return count; > +} > + > + > static struct kobj_attribute numa_demotion_enabled_attr = > __ATTR_RW(demotion_enabled); > > +static struct kobj_attribute numa_pagecache_promotion_enabled_attr = > + __ATTR_RW(pagecache_promotion_enabled); > + > static struct attribute *numa_attrs[] = { > &numa_demotion_enabled_attr.attr, > + &numa_pagecache_promotion_enabled_attr.attr, > NULL, > }; > > diff --git a/mm/migrate.c b/mm/migrate.c > index af07b399060b..320258a1aaba 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -44,6 +44,8 @@ > #include > #include > #include > +#include > +#include > > #include > > @@ -2710,5 +2712,62 @@ int migrate_misplaced_folio(struct folio *folio, int node) > BUG_ON(!list_empty(&migratepages)); > return nr_remaining ? -EAGAIN : 0; > } > + > +/** > + * promotion_candidate() - report a promotion candidate folio > + * > + * @folio: The folio reported as a candidate > + * > + * Records folio access time and places the folio on the task promotion list > + * if access time is less than the threshold. The folio will be isolated from > + * LRU if selected, and task_work will putback the folio on promotion failure. > + * > + * If selected, takes a folio reference to be released in task work. > + */ > +void promotion_candidate(struct folio *folio) > +{ > + struct task_struct *task = current; > + struct list_head *promo_list = &task->promo_list; > + struct callback_head *work = &task->numa_promo_work; > + struct address_space *mapping = folio_mapping(folio); > + bool write = mapping ? mapping->gfp_mask & __GFP_WRITE : false; > + int nid = folio_nid(folio); > + int flags, last_cpupid; > + > + /* > + * Only do this work if: > + * 1) tiering and pagecache promotion are enabled > + * 2) the page can actually be promoted > + * 3) The hint-fault latency is relatively hot > + * 4) the folio is not already isolated > + * 5) This is not a kernel thread context > + */ > + if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) || > + !numa_pagecache_promotion_enabled || > + node_is_toptier(nid) || > + numa_hint_fault_latency(folio) >= PAGE_ACCESS_TIME_MASK || > + folio_test_isolated(folio) || > + (current->flags & PF_KTHREAD)) { > + return; > + } > + > + nid = numa_migrate_check(folio, NULL, 0, &flags, write, &last_cpupid); > + if (nid == NUMA_NO_NODE) > + return; > + > + if (migrate_misplaced_folio_prepare(folio, NULL, nid)) > + return; > + > + /* > + * Ensure task can schedule work, otherwise we'll leak folios. > + * If the list is not empty, task work has already been scheduled. > + */ > + if (list_empty(promo_list) && task_work_add(task, work, TWA_RESUME)) { > + folio_putback_lru(folio); > + return; > + } > + list_add(&folio->lru, promo_list); > +} > +EXPORT_SYMBOL(promotion_candidate); > #endif /* CONFIG_NUMA_BALANCING */ > #endif /* CONFIG_NUMA */ > diff --git a/mm/swap.c b/mm/swap.c > index 320b959b74c6..57909c349388 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -37,6 +37,7 @@ > #include > #include > #include > +#include > > #include "internal.h" > > @@ -469,6 +470,8 @@ void folio_mark_accessed(struct folio *folio) > __lru_cache_activate_folio(folio); > folio_clear_referenced(folio); > workingset_activation(folio); > + } else { > + In the current implementation, promotion will not work if we enable MGLRU, right? Is there any specific reason we are not enabling promotion with MGLRU? > promotion_candidate(folio); > } > if (folio_test_idle(folio)) > folio_clear_idle(folio);