[RFC PATCH v4 0/6] Promotion of Unmapped Page Cache Folios.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v4 0/6] Promotion of Unmapped Page Cache Folios.
@ 2025-04-11 22:11 Gregory Price
  2025-04-11 22:11 ` [RFC PATCH v4 1/6] migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA Gregory Price
                   ` (7 more replies)
  0 siblings, 8 replies; 19+ messages in thread
From: Gregory Price @ 2025-04-11 22:11 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, linux-kernel, kernel-team, akpm, mingo, peterz,
	juri.lelli, vincent.guittot, hannes, mhocko, roman.gushchin,
	shakeel.butt, donettom, Huang Ying, Keith Busch, Feng Tang,
	Neha Gholkar

Unmapped page cache pages can be demoted to low-tier memory, but
they can presently only be promoted in two conditions:
    1) The page is fully swapped out and re-faulted
    2) The page becomes mapped (and exposed to NUMA hint faults)

This RFC proposes promoting unmapped page cache pages by using
folio_mark_accessed as a hotness hint for unmapped pages.

We show in a microbenchmark that this mechanism can increase
performance up to 23.5% compared to leaving page cache on the
low tier - when that page cache becomes excessively hot.

When disabled (NUMA tiering off), overhead in folio_mark_accessed
was limited to <1% in a worst case scenario (all work is file_read()).

Patches 1-2
	allow NULL as valid input to migration prep interfaces
	for vmf/vma - which is not present in unmapped folios.
Patch 3
	adds NUMA_HINT_PAGE_CACHE to vmstat
Patch 4
	Implement migrate_misplaced_folio_batch
Patch 5
	add the promotion mechanism, along with a sysfs
	extension which defaults the behavior to off.
	/sys/kernel/mm/numa/pagecache_promotion_enabled
Patch 6
	add MGLRU implementation by Donet Tom

v4 Notes
===
- Add MGLRU implementation
- dropped ifdef change patch after build testing
- Worst-case scenario analysis (thrashing)
- FIO Test analysis

Test Environment
================
    1.5-3.7GHz CPU, ~4000 BogoMIPS, 
    1TB Machine with 768GB DRAM and 256GB CXL

FIO Test:
=========
We evaluated this with FIO with the page-cache pre-loaded

Step 1: Load 128GB file into page cache with a mempolicy
        (dram, cxl, and cxl to promote)
Step 2: Run FIO with 4 reading jobs
        Config does not invalidate pagecache between runs
Step 3: Repeat with a 900GB file that spills into CXL and
        creates thrashing to show its impact.

Configuration:
[global]
bs=1M
size=128G  # 900G
time_based=1
numjobs=4
rw=randread,randwrite
filename=test.data
direct=0
invalidate=0 # Do not drop the cache between runs
[test1]
ioengine=psync
iodepth=64
runtime=240s # Also did 480s, didn't show much difference

On promotion runs, vmstat reported the entire file is promoted
	numa_pages_migrated 33554772   (128.01GB)

DRAM (128GB):
  lat (usec)   : 50=98.34%, 100=1.61%, 250=0.05%, 500=0.01%
  cpu          : usr=1.42%, sys=98.35%, ctx=213, majf=0, minf=264
  bw=83.5GiB/s (89.7GB/s)

Remote (128GB)
  lat (usec)   : 100=91.78%, 250=8.22%, 500=0.01%
  cpu          : usr=0.66%, sys=99.13%, ctx=449, majf=0, minf=263
  bw=41.4GiB/s (44.4GB/s)

Promo (128GB)
  lat (usec)   : 50=88.02%, 100=10.65%, 250=1.05%, 500=0.20%, 750=0.06%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.01%
  cpu          : usr=1.44%, sys=97.72%, ctx=1679, majf=0, minf=265
  bw=79.2GiB/s (85.0GB/s)

900GB Hot - No Promotion (~150GB spills to remote node via demotion)
  lat (usec)   : 50=69.26%, 100=13.79%, 250=16.95%, 500=0.01%
  bw=64.5GiB/s (69.3GB/s)

900GB Hot - Promotion (Causes thrashing between demotion/promotion)
  lat (usec)   : 50=47.79%, 100=29.59%, 250=21.16%, 500=1.24%, 750=0.04%
  lat (usec)   : 1000=0.03%
  lat (msec)   : 2=0.15%, 4=0.01%
  bw=47.6GiB/s (51.1GB/s)

900GB Hot - No remote memory (Fault-in/page-out of read-only data)
  lat (usec)   : 50=39.73%, 100=31.34%, 250=4.71%, 500=0.01%, 750=0.01%
  lat (usec)   : 1000=1.78%
  lat (msec)   : 2=21.63%, 4=0.81%, 10=0.01%, 20=0.01%
  bw=10.2GiB/s (10.9GB/s)

Obviously some portion of the overhead comes from migration, but the
results here are pretty dramatic.  We can regain nearly all of the
performance in a degenerate scenario (demoted page cache becomes hot)
by turning on promotion - even temporarily.

In the scenario where the entire workload is hot, turning on promotion
causes thrashing, and we hurt performance.

So this feature is useful in one of two scenarios:
1) Headroom on DRAM is available and we opportunistically move page
   cache to the higher tier as it's accessed, or
2) A lower performance node becomes un-evenly pressured.
   It doesn't help us if the higher node is pressured.

For example, it would be nice for a userland daemon to detect DRAM-tier
memory becomes available, and to flip the bit to migrate any hotter page
cache up a level until DRAM becomes pressured again. Cold pagecache
stays put and new allocations still occur on the fast tier.

Worst Case Scenario Test (Micro-benchmark)
==========================================

Goal:
   Generate promotions and demonstrate upper-bound on performance
   overhead and gain/loss.

System Settings:
   CXL Memory in ZONE_MOVABLE (no fallback alloc, demotion-only use)
   echo 2 > /proc/sys/kernel/numa_balancing
   echo 1 > /sys/kernel/mm/numa/pagecache_promotion_enabled
   echo 1 > /sys/kernel/mm/numa/demotion_enabled

Test process:
   In each test, we do a linear read of a 128GB file into a buffer
   in a loop.  To allocate the pagecache into CXL, we use mbind prior
   to the CXL test runs and read the file.  We omit the overhead of
   allocating the buffer and initializing the memory into CXL from the
   test runs.

   1) file allocated in DRAM with mechanisms off
   2) file allocated in DRAM with balancing on but promotion off
   3) file allocated in DRAM with balancing and promotion on
      (promotion check is negative because all pages are top tier)
   4) file allocated in CXL with mechanisms off
   5) file allocated in CXL with mechanisms on

Each test was run with 50 read cycles and averaged (where relevant)
to account for system noise.  This number of cycles gives the promotion
mechanism time to promote the vast majority of memory (usually <1MB
remaining in worst case).

Tests 2 and 3 test the upper bound on overhead of the new checks when
there are no pages to migrate but work is dominated by file_read().

|     1     |    2     |     3       |    4     |      5         |
| DRAM Base | Promo On | TopTier Chk | CXL Base | Post-Promotion |
|  7.5804   |  7.7586  |   7.9726    |   9.75   |    7.8941      |

Baseline DRAM vs Baseline CXL shows a ~28% overhead just allowing the
file to remain on CXL, while after promotion, we see the performance
trend back towards the overhead of the TopTier check time - a total
overhead reduction of ~84% (or ~5% overhead down from ~23.5%).

During promotion, we do see overhead which eventually tapers off over
time.  Here is a sample of the first 10 cycles during which promotion
is the most aggressive, which shows overhead drops off dramatically
as the majority of memory is migrated to the top tier.

12.79, 12.52, 12.33, 12.03, 11.81, 11.58, 11.36, 11.1, 8, 7.96

After promotion, turning the mechanism off via sysfs increased the
overall performance back to the DRAM baseline. The slight (~1%)
increase between post-migration performance and the baseline mechanism
overhead check appears to be general variance as similar times were
observed during the baseline checks on subsequent runs.

The mechanism itself represents a ~2.5% overhead in a worst case
scenario (all work is file_read(), all pages are in DRAM, all pages are
hot - which is highly unrealistic). This is inclusive of any overhead 

Development History and Notes
=======================================
During development, we explored the following proposals:

1) directly promoting within folio_mark_accessed (FMA)
   Originally suggested by Johannes Weiner
   https://lore.kernel.org/all/20240803094715.23900-1-gourry@gourry.net/

   This caused deadlocks due to the fact that the PTL was held
   in a variety of cases - but in particular during task exit.
   It also is incredibly inflexible and causes promotion-on-fault.
   It was discussed that a deferral mechanism was preferred.

2) promoting in filemap.c locations (callers of FMA)
   Originally proposed by Feng Tang and Ying Huang
   https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/patch/?id=5f2e64ce75c0322602c2ec8c70b64bb69b1f1329

   First, we saw this as less problematic than directly hooking FMA,
   but we realized this has the potential to miss data in a variety of
   locations: swap.c, memory.c, gup.c, ksm.c, paddr.c - etc.

   Second, we discovered that the lock state of pages is very subtle,
   and that these locations in filemap.c can be called in an atomic
   context.  Prototypes lead to a variety of stalls and lockups.

3) a new LRU - originally proposed by Keith Busch
   https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/patch/?id=6616afe9a722f6ebedbb27ade3848cf07b9a3af7

   There are two issues with this approach: PG_promotable and reclaim.

   First - PG_promotable has generally be discouraged.

   Second - Attach this mechanism to an LRU is both backwards and
   counter-intutive.  A promotable list is better served by a MOST
   recently used list, and since LRUs are generally only shrank when
   exposed to pressure it would require implementing a new promotion
   list shrinker that runs separate from the existing reclaim logic.

4) Adding a separate kthread - suggested by many

   This is - to an extent - a more general version of the LRU proposal.
   We still have to track the folios - which likely requires the
   addition of a page flag.  Additionally, this method would actually
   contend pretty heavily with LRU behavior - i.e. we'd want to
   throttle addition to the promotion candidate list in some scenarios.

5) Doing it in task work

   This seemed to be the most realistic after considering the above.

   We observe the following:
    - FMA is an ideal hook for this and isolation is safe here
    - the new promotion_candidate function is an ideal hook for new
      filter logic (throttling, fairness, etc).
    - isolated folios are either promoted or putback on task resume,
      there are no additional concurrency mechanics to worry about
    - The mechanic can be made optional via a sysfs hook to avoid
      overhead in degenerate scenarios (thrashing).

Suggested-by: Huang Ying <ying.huang@linux.alibaba.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Keith Busch <kbusch@meta.com>
Suggested-by: Feng Tang <feng.tang@intel.com>
Tested-by: Neha Gholkar <nehagholkar@meta.com>
Signed-off-by: Gregory Price <gourry@gourry.net>
Co-developed-by: Donet Tom <donettom@linux.ibm.com>

Donet Tom (1):
  mm/swap.c: Enable promotion of unmapped MGLRU page cache pages

Gregory Price (5):
  migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA.
  memory: allow non-fault migration in numa_migrate_check path
  vmstat: add page-cache numa hints
  migrate: implement migrate_misplaced_folio_batch
  migrate,sysfs: add pagecache promotion

 .../ABI/testing/sysfs-kernel-mm-numa          | 20 +++++
 include/linux/memory-tiers.h                  |  2 +
 include/linux/migrate.h                       | 11 +++
 include/linux/sched.h                         |  4 +
 include/linux/sched/sysctl.h                  |  1 +
 include/linux/vm_event_item.h                 |  8 ++
 init/init_task.c                              |  2 +
 kernel/sched/fair.c                           | 24 ++++-
 mm/memcontrol.c                               |  1 +
 mm/memory-tiers.c                             | 27 ++++++
 mm/memory.c                                   | 30 ++++---
 mm/mempolicy.c                                | 25 ++++--
 mm/migrate.c                                  | 88 ++++++++++++++++++-
 mm/swap.c                                     | 15 +++-
 mm/vmstat.c                                   |  2 +
 15 files changed, 236 insertions(+), 24 deletions(-)

-- 
2.49.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH v4 1/6] migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA.
  2025-04-11 22:11 [RFC PATCH v4 0/6] Promotion of Unmapped Page Cache Folios Gregory Price
@ 2025-04-11 22:11 ` Gregory Price
  2025-04-15  0:12   ` SeongJae Park
  2025-04-11 22:11 ` [RFC PATCH v4 2/6] memory: allow non-fault migration in numa_migrate_check path Gregory Price
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: Gregory Price @ 2025-04-11 22:11 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, linux-kernel, kernel-team, akpm, mingo, peterz,
	juri.lelli, vincent.guittot, hannes, mhocko, roman.gushchin,
	shakeel.butt, donettom

migrate_misplaced_folio_prepare() may be called on a folio without
a VMA, and so it must be made to accept a NULL VMA.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
 mm/migrate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index f3ee6d8d5e2e..047131f6c839 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2654,7 +2654,7 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
 		 * See folio_maybe_mapped_shared() on possible imprecision
 		 * when we cannot easily detect if a folio is shared.
 		 */
-		if ((vma->vm_flags & VM_EXEC) && folio_maybe_mapped_shared(folio))
+		if (vma && (vma->vm_flags & VM_EXEC) && folio_maybe_mapped_shared(folio))
 			return -EACCES;
 
 		/*
-- 
2.49.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH v4 2/6] memory: allow non-fault migration in numa_migrate_check path
  2025-04-11 22:11 [RFC PATCH v4 0/6] Promotion of Unmapped Page Cache Folios Gregory Price
  2025-04-11 22:11 ` [RFC PATCH v4 1/6] migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA Gregory Price
@ 2025-04-11 22:11 ` Gregory Price
  2025-04-11 22:11 ` [RFC PATCH v4 3/6] vmstat: add page-cache numa hints Gregory Price
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Gregory Price @ 2025-04-11 22:11 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, linux-kernel, kernel-team, akpm, mingo, peterz,
	juri.lelli, vincent.guittot, hannes, mhocko, roman.gushchin,
	shakeel.butt, donettom

numa_migrate_check and mpol_misplaced presume callers are in the
fault path with accessed to a VMA.  To enable migrations from page
cache, re-using the same logic to handle migration prep is preferable.

Mildly refactor numa_migrate_check and mpol_misplaced so that they may
be called with (vmf = NULL) from non-faulting paths.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 mm/memory.c    | 24 ++++++++++++++----------
 mm/mempolicy.c | 25 +++++++++++++++++--------
 2 files changed, 31 insertions(+), 18 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 3900225d99c5..e72b0d8df647 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5665,7 +5665,20 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
 		      unsigned long addr, int *flags,
 		      bool writable, int *last_cpupid)
 {
-	struct vm_area_struct *vma = vmf->vma;
+	if (vmf) {
+		struct vm_area_struct *vma = vmf->vma;
+		const vm_flags_t vmflags = vma->vm_flags;
+
+		/*
+		 * Flag if the folio is shared between multiple address spaces. This
+		 * is later used when determining whether to group tasks together
+		 */
+		if (folio_maybe_mapped_shared(folio) && (vmflags & VM_SHARED))
+			*flags |= TNF_SHARED;
+
+		/* Record the current PID acceesing VMA */
+		vma_set_access_pid_bit(vma);
+	}
 
 	/*
 	 * Avoid grouping on RO pages in general. RO pages shouldn't hurt as
@@ -5678,12 +5691,6 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
 	if (!writable)
 		*flags |= TNF_NO_GROUP;
 
-	/*
-	 * Flag if the folio is shared between multiple address spaces. This
-	 * is later used when determining whether to group tasks together
-	 */
-	if (folio_maybe_mapped_shared(folio) && (vma->vm_flags & VM_SHARED))
-		*flags |= TNF_SHARED;
 	/*
 	 * For memory tiering mode, cpupid of slow memory page is used
 	 * to record page access time.  So use default value.
@@ -5693,9 +5700,6 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
 	else
 		*last_cpupid = folio_last_cpupid(folio);
 
-	/* Record the current PID acceesing VMA */
-	vma_set_access_pid_bit(vma);
-
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 #ifdef CONFIG_NUMA_BALANCING
 	count_memcg_folio_events(folio, NUMA_HINT_FAULTS, 1);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 530e71fe9147..f86a4a9087f4 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2747,12 +2747,16 @@ static void sp_free(struct sp_node *n)
  * mpol_misplaced - check whether current folio node is valid in policy
  *
  * @folio: folio to be checked
- * @vmf: structure describing the fault
+ * @vmf: structure describing the fault (NULL if called outside fault path)
  * @addr: virtual address in @vma for shared policy lookup and interleave policy
+ *	  Ignored if vmf is NULL.
  *
  * Lookup current policy node id for vma,addr and "compare to" folio's
- * node id.  Policy determination "mimics" alloc_page_vma().
- * Called from fault path where we know the vma and faulting address.
+ * node id - or task's policy node id if vmf is NULL.  Policy determination
+ * "mimics" alloc_page_vma().
+ *
+ * vmf must be non-NULL if called from fault path where we know the vma and
+ * faulting address. The PTL must be held by caller if vmf is not NULL.
  *
  * Return: NUMA_NO_NODE if the page is in a node that is valid for this
  * policy, or a suitable node ID to allocate a replacement folio from.
@@ -2764,7 +2768,6 @@ int mpol_misplaced(struct folio *folio, struct vm_fault *vmf,
 	pgoff_t ilx;
 	struct zoneref *z;
 	int curnid = folio_nid(folio);
-	struct vm_area_struct *vma = vmf->vma;
 	int thiscpu = raw_smp_processor_id();
 	int thisnid = numa_node_id();
 	int polnid = NUMA_NO_NODE;
@@ -2774,18 +2777,24 @@ int mpol_misplaced(struct folio *folio, struct vm_fault *vmf,
 	 * Make sure ptl is held so that we don't preempt and we
 	 * have a stable smp processor id
 	 */
-	lockdep_assert_held(vmf->ptl);
-	pol = get_vma_policy(vma, addr, folio_order(folio), &ilx);
+	if (vmf) {
+		lockdep_assert_held(vmf->ptl);
+		pol = get_vma_policy(vmf->vma, addr, folio_order(folio), &ilx);
+	} else {
+		pol = get_task_policy(current);
+	}
 	if (!(pol->flags & MPOL_F_MOF))
 		goto out;
 
 	switch (pol->mode) {
 	case MPOL_INTERLEAVE:
-		polnid = interleave_nid(pol, ilx);
+		polnid = vmf ? interleave_nid(pol, ilx) :
+			       interleave_nodes(pol);
 		break;
 
 	case MPOL_WEIGHTED_INTERLEAVE:
-		polnid = weighted_interleave_nid(pol, ilx);
+		polnid = vmf ? weighted_interleave_nid(pol, ilx) :
+			       weighted_interleave_nodes(pol);
 		break;
 
 	case MPOL_PREFERRED:
-- 
2.49.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH v4 3/6] vmstat: add page-cache numa hints
  2025-04-11 22:11 [RFC PATCH v4 0/6] Promotion of Unmapped Page Cache Folios Gregory Price
  2025-04-11 22:11 ` [RFC PATCH v4 1/6] migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA Gregory Price
  2025-04-11 22:11 ` [RFC PATCH v4 2/6] memory: allow non-fault migration in numa_migrate_check path Gregory Price
@ 2025-04-11 22:11 ` Gregory Price
  2025-04-11 22:11 ` [RFC PATCH v4 4/6] migrate: implement migrate_misplaced_folio_batch Gregory Price
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Gregory Price @ 2025-04-11 22:11 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, linux-kernel, kernel-team, akpm, mingo, peterz,
	juri.lelli, vincent.guittot, hannes, mhocko, roman.gushchin,
	shakeel.butt, donettom

Count non-page-fault events as page-cache numa hints instead of
fault hints in vmstat. Add a define to select the hint type to
keep the code clean.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/vm_event_item.h | 8 ++++++++
 mm/memcontrol.c               | 1 +
 mm/memory.c                   | 6 +++---
 mm/vmstat.c                   | 2 ++
 4 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index f11b6fa9c5b3..fa66d784c9ec 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -65,6 +65,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		NUMA_HUGE_PTE_UPDATES,
 		NUMA_HINT_FAULTS,
 		NUMA_HINT_FAULTS_LOCAL,
+		NUMA_HINT_PAGE_CACHE,
+		NUMA_HINT_PAGE_CACHE_LOCAL,
 		NUMA_PAGE_MIGRATE,
 #endif
 #ifdef CONFIG_MIGRATION
@@ -187,6 +189,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		NR_VM_EVENT_ITEMS
 };
 
+#ifdef CONFIG_NUMA_BALANCING
+#define NUMA_HINT_TYPE(vmf) (vmf ? NUMA_HINT_FAULTS : NUMA_HINT_PAGE_CACHE)
+#define NUMA_HINT_TYPE_LOCAL(vmf) (vmf ? NUMA_HINT_FAULTS_LOCAL : \
+					 NUMA_HINT_PAGE_CACHE_LOCAL)
+#endif
+
 #ifndef CONFIG_TRANSPARENT_HUGEPAGE
 #define THP_FILE_ALLOC ({ BUILD_BUG(); 0; })
 #define THP_FILE_FALLBACK ({ BUILD_BUG(); 0; })
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 40c07b8699ae..d50f7522863c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -463,6 +463,7 @@ static const unsigned int memcg_vm_event_stat[] = {
 	NUMA_PAGE_MIGRATE,
 	NUMA_PTE_UPDATES,
 	NUMA_HINT_FAULTS,
+	NUMA_HINT_PAGE_CACHE,
 #endif
 };
 
diff --git a/mm/memory.c b/mm/memory.c
index e72b0d8df647..8d3257ee9ab1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5700,12 +5700,12 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
 	else
 		*last_cpupid = folio_last_cpupid(folio);
 
-	count_vm_numa_event(NUMA_HINT_FAULTS);
+	count_vm_numa_event(NUMA_HINT_TYPE(vmf));
 #ifdef CONFIG_NUMA_BALANCING
-	count_memcg_folio_events(folio, NUMA_HINT_FAULTS, 1);
+	count_memcg_folio_events(folio, NUMA_HINT_TYPE(vmf), 1);
 #endif
 	if (folio_nid(folio) == numa_node_id()) {
-		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
+		count_vm_numa_event(NUMA_HINT_TYPE_LOCAL(vmf));
 		*flags |= TNF_FAULT_LOCAL;
 	}
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ab5c840941f3..0f1cc0f2c68f 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1343,6 +1343,8 @@ const char * const vmstat_text[] = {
 	"numa_huge_pte_updates",
 	"numa_hint_faults",
 	"numa_hint_faults_local",
+	"numa_hint_page_cache",
+	"numa_hint_page_cache_local",
 	"numa_pages_migrated",
 #endif
 #ifdef CONFIG_MIGRATION
-- 
2.49.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH v4 4/6] migrate: implement migrate_misplaced_folio_batch
  2025-04-11 22:11 [RFC PATCH v4 0/6] Promotion of Unmapped Page Cache Folios Gregory Price
                   ` (2 preceding siblings ...)
  2025-04-11 22:11 ` [RFC PATCH v4 3/6] vmstat: add page-cache numa hints Gregory Price
@ 2025-04-11 22:11 ` Gregory Price
  2025-04-15  0:19   ` SeongJae Park
  2025-04-11 22:11 ` [RFC PATCH v4 5/6] migrate,sysfs: add pagecache promotion Gregory Price
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: Gregory Price @ 2025-04-11 22:11 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, linux-kernel, kernel-team, akpm, mingo, peterz,
	juri.lelli, vincent.guittot, hannes, mhocko, roman.gushchin,
	shakeel.butt, donettom

A common operation in tiering is to migrate multiple pages at once.
The migrate_misplaced_folio function requires one call for each
individual folio.  Expose a batch-variant of the same call for use
when doing batch migrations.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/migrate.h |  6 ++++++
 mm/migrate.c            | 31 +++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 61899ec7a9a3..2df756128316 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -145,6 +145,7 @@ const struct movable_operations *page_movable_ops(struct page *page)
 int migrate_misplaced_folio_prepare(struct folio *folio,
 		struct vm_area_struct *vma, int node);
 int migrate_misplaced_folio(struct folio *folio, int node);
+int migrate_misplaced_folio_batch(struct list_head *foliolist, int node);
 #else
 static inline int migrate_misplaced_folio_prepare(struct folio *folio,
 		struct vm_area_struct *vma, int node)
@@ -155,6 +156,11 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
 {
 	return -EAGAIN; /* can't migrate now */
 }
+static inline int migrate_misplaced_folio_batch(struct list_head *foliolist,
+						int node)
+{
+	return -EAGAIN; /* can't migrate now */
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_MIGRATION
diff --git a/mm/migrate.c b/mm/migrate.c
index 047131f6c839..7e1ba6001596 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2731,5 +2731,36 @@ int migrate_misplaced_folio(struct folio *folio, int node)
 	BUG_ON(!list_empty(&migratepages));
 	return nr_remaining ? -EAGAIN : 0;
 }
+
+/*
+ * Batch variant of migrate_misplaced_folio. Attempts to migrate
+ * a folio list to the specified destination.
+ *
+ * Caller is expected to have isolated the folios by calling
+ * migrate_misplaced_folio_prepare(), which will result in an
+ * elevated reference count on the folio.
+ *
+ * This function will un-isolate the folios, dereference them, and
+ * remove them from the list before returning.
+ */
+int migrate_misplaced_folio_batch(struct list_head *folio_list, int node)
+{
+	pg_data_t *pgdat = NODE_DATA(node);
+	unsigned int nr_succeeded;
+	int nr_remaining;
+
+	nr_remaining = migrate_pages(folio_list, alloc_misplaced_dst_folio,
+				     NULL, node, MIGRATE_ASYNC,
+				     MR_NUMA_MISPLACED, &nr_succeeded);
+	if (nr_remaining)
+		putback_movable_pages(folio_list);
+
+	if (nr_succeeded) {
+		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
+		mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, nr_succeeded);
+	}
+	BUG_ON(!list_empty(folio_list));
+	return nr_remaining ? -EAGAIN : 0;
+}
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_NUMA */
-- 
2.49.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH v4 5/6] migrate,sysfs: add pagecache promotion
  2025-04-11 22:11 [RFC PATCH v4 0/6] Promotion of Unmapped Page Cache Folios Gregory Price
                   ` (3 preceding siblings ...)
  2025-04-11 22:11 ` [RFC PATCH v4 4/6] migrate: implement migrate_misplaced_folio_batch Gregory Price
@ 2025-04-11 22:11 ` Gregory Price
  2025-04-15  0:41   ` SeongJae Park
  2025-04-11 22:11 ` [RFC PATCH v4 6/6] mm/swap.c: Enable promotion of unmapped MGLRU page cache pages Gregory Price
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 19+ messages in thread
From: Gregory Price @ 2025-04-11 22:11 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, linux-kernel, kernel-team, akpm, mingo, peterz,
	juri.lelli, vincent.guittot, hannes, mhocko, roman.gushchin,
	shakeel.butt, donettom

adds /sys/kernel/mm/numa/pagecache_promotion_enabled

When page cache lands on lower tiers, there is no way for promotion
to occur unless it becomes memory-mapped and exposed to NUMA hint
faults.  Just adding a mechanism to promote pages unconditionally,
however, opens up significant possibility of performance regressions.

Similar to the `demotion_enabled` sysfs entry, provide a sysfs toggle
to enable and disable page cache promotion.  This option will enable
opportunistic promotion of unmapped page cache during syscall access.

This option is intended for operational conditions where demoted page
cache will eventually contain memory which becomes hot - and where
said memory likely to cause performance issues due to being trapped on
the lower tier of memory.

A Page Cache folio is considered a promotion candidates when:
  0) tiering and pagecache-promotion are enabled
  1) the folio resides on a node not in the top tier
  2) the folio is already marked referenced and active.
  3) Multiple accesses in (referenced & active) state occur quickly.

Since promotion is not safe to execute unconditionally from within
folio_mark_accessed, we defer promotion to a new task_work captured
in the task_struct.  This ensures that the task doing the access has
some hand in promoting pages - even among deduplicated read only files.

We limit the total number of folios on the promotion list to the
promotion rate limit to limit the amount of inline work done during
large reads - avoiding significant overhead.  We do not use the existing
rate-limit check function this checked during the migration anyway.

The promotion node is always the local node of the promoting cpu.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
 .../ABI/testing/sysfs-kernel-mm-numa          | 20 +++++++
 include/linux/memory-tiers.h                  |  2 +
 include/linux/migrate.h                       |  5 ++
 include/linux/sched.h                         |  4 ++
 include/linux/sched/sysctl.h                  |  1 +
 init/init_task.c                              |  2 +
 kernel/sched/fair.c                           | 24 +++++++-
 mm/memory-tiers.c                             | 27 +++++++++
 mm/migrate.c                                  | 55 +++++++++++++++++++
 mm/swap.c                                     |  8 +++
 10 files changed, 147 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-numa b/Documentation/ABI/testing/sysfs-kernel-mm-numa
index 77e559d4ed80..ebb041891db2 100644
--- a/Documentation/ABI/testing/sysfs-kernel-mm-numa
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-numa
@@ -22,3 +22,23 @@ Description:	Enable/disable demoting pages during reclaim
 		the guarantees of cpusets.  This should not be enabled
 		on systems which need strict cpuset location
 		guarantees.
+
+What:		/sys/kernel/mm/numa/pagecache_promotion_enabled
+Date:		January 2025
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Enable/disable promoting pages during file access
+
+		Page migration during file access is intended for systems
+		with tiered memory configurations that have significant
+		unmapped file cache usage. By default, file cache memory
+		on slower tiers will not be opportunistically promoted by
+		normal NUMA hint faults, because the system has no way to
+		track them.  This option enables opportunistic promotion
+		of pages that are accessed via syscall (e.g. read/write)
+		if multiple accesses occur in quick succession.
+
+		It may move data to a NUMA node that does not fall into
+		the cpuset of the allocating process which might be
+		construed to violate the guarantees of cpusets.  This
+		should not be enabled on systems which need strict cpuset
+		location guarantees.
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 0dc0cf2863e2..fa96a67b8996 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -37,6 +37,7 @@ struct access_coordinate;
 
 #ifdef CONFIG_NUMA
 extern bool numa_demotion_enabled;
+extern bool numa_pagecache_promotion_enabled;
 extern struct memory_dev_type *default_dram_type;
 extern nodemask_t default_dram_nodes;
 struct memory_dev_type *alloc_memory_type(int adistance);
@@ -76,6 +77,7 @@ static inline bool node_is_toptier(int node)
 #else
 
 #define numa_demotion_enabled	false
+#define numa_pagecache_promotion_enabled	false
 #define default_dram_type	NULL
 #define default_dram_nodes	NODE_MASK_NONE
 /*
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 2df756128316..3f8f30ae3a67 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -146,6 +146,7 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
 		struct vm_area_struct *vma, int node);
 int migrate_misplaced_folio(struct folio *folio, int node);
 int migrate_misplaced_folio_batch(struct list_head *foliolist, int node);
+void promotion_candidate(struct folio *folio);
 #else
 static inline int migrate_misplaced_folio_prepare(struct folio *folio,
 		struct vm_area_struct *vma, int node)
@@ -161,6 +162,10 @@ static inline int migrate_misplaced_folio_batch(struct list_head *foliolist,
 {
 	return -EAGAIN; /* can't migrate now */
 }
+static inline void promotion_candidate(struct folio *folio)
+{
+	return;
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_MIGRATION
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9c15365a30c0..392aec1f947c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1370,6 +1370,10 @@ struct task_struct {
 	unsigned long			numa_faults_locality[3];
 
 	unsigned long			numa_pages_migrated;
+
+	struct callback_head		numa_promo_work;
+	struct list_head		promo_list;
+	unsigned long			promo_count;
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_RSEQ
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 5a64582b086b..50b1d1dc27e2 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -25,6 +25,7 @@ enum sched_tunable_scaling {
 
 #ifdef CONFIG_NUMA_BALANCING
 extern int sysctl_numa_balancing_mode;
+extern unsigned int sysctl_numa_balancing_promote_rate_limit;
 #else
 #define sysctl_numa_balancing_mode	0
 #endif
diff --git a/init/init_task.c b/init/init_task.c
index e557f622bd90..47162ed14106 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -187,6 +187,8 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.numa_preferred_nid = NUMA_NO_NODE,
 	.numa_group	= NULL,
 	.numa_faults	= NULL,
+	.promo_list	= LIST_HEAD_INIT(init_task.promo_list),
+	.promo_count	= 0,
 #endif
 #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
 	.kasan_depth	= 1,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c798d2795243..68efbd4a9452 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -43,6 +43,7 @@
 #include <linux/interrupt.h>
 #include <linux/memory-tiers.h>
 #include <linux/mempolicy.h>
+#include <linux/migrate.h>
 #include <linux/mutex_api.h>
 #include <linux/profile.h>
 #include <linux/psi.h>
@@ -129,7 +130,7 @@ static unsigned int sysctl_sched_cfs_bandwidth_slice		= 5000UL;
 
 #ifdef CONFIG_NUMA_BALANCING
 /* Restrict the NUMA promotion throughput (MB/s) for each target node. */
-static unsigned int sysctl_numa_balancing_promote_rate_limit = 65536;
+unsigned int sysctl_numa_balancing_promote_rate_limit = 65536;
 #endif
 
 #ifdef CONFIG_SYSCTL
@@ -3535,6 +3536,25 @@ static void task_numa_work(struct callback_head *work)
 	}
 }
 
+static void task_numa_promotion_work(struct callback_head *work)
+{
+	struct task_struct *p = current;
+	struct list_head *promo_list = &p->promo_list;
+	int nid = numa_node_id();
+
+	SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_promo_work));
+
+	work->next = work;
+
+	if (list_empty(promo_list))
+		return;
+
+	migrate_misplaced_folio_batch(promo_list, nid);
+	current->promo_count = 0;
+	return;
+}
+
+
 void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 {
 	int mm_users = 0;
@@ -3559,8 +3579,10 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 	RCU_INIT_POINTER(p->numa_group, NULL);
 	p->last_task_numa_placement	= 0;
 	p->last_sum_exec_runtime	= 0;
+	INIT_LIST_HEAD(&p->promo_list);
 
 	init_task_work(&p->numa_work, task_numa_work);
+	init_task_work(&p->numa_promo_work, task_numa_promotion_work);
 
 	/* New address space, reset the preferred nid */
 	if (!(clone_flags & CLONE_VM)) {
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index fc14fe53e9b7..e8acb54aa8df 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -935,6 +935,7 @@ static int __init memory_tier_init(void)
 subsys_initcall(memory_tier_init);
 
 bool numa_demotion_enabled = false;
+bool numa_pagecache_promotion_enabled;
 
 #ifdef CONFIG_MIGRATION
 #ifdef CONFIG_SYSFS
@@ -957,11 +958,37 @@ static ssize_t demotion_enabled_store(struct kobject *kobj,
 	return count;
 }
 
+static ssize_t pagecache_promotion_enabled_show(struct kobject *kobj,
+						struct kobj_attribute *attr,
+						char *buf)
+{
+	return sysfs_emit(buf, "%s\n",
+			  numa_pagecache_promotion_enabled ? "true" : "false");
+}
+
+static ssize_t pagecache_promotion_enabled_store(struct kobject *kobj,
+						 struct kobj_attribute *attr,
+						 const char *buf, size_t count)
+{
+	ssize_t ret;
+
+	ret = kstrtobool(buf, &numa_pagecache_promotion_enabled);
+	if (ret)
+		return ret;
+
+	return count;
+}
+
+
 static struct kobj_attribute numa_demotion_enabled_attr =
 	__ATTR_RW(demotion_enabled);
 
+static struct kobj_attribute numa_pagecache_promotion_enabled_attr =
+	__ATTR_RW(pagecache_promotion_enabled);
+
 static struct attribute *numa_attrs[] = {
 	&numa_demotion_enabled_attr.attr,
+	&numa_pagecache_promotion_enabled_attr.attr,
 	NULL,
 };
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 7e1ba6001596..e6b4bf364837 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -44,6 +44,8 @@
 #include <linux/sched/sysctl.h>
 #include <linux/memory-tiers.h>
 #include <linux/pagewalk.h>
+#include <linux/sched/numa_balancing.h>
+#include <linux/task_work.h>
 
 #include <asm/tlbflush.h>
 
@@ -2762,5 +2764,58 @@ int migrate_misplaced_folio_batch(struct list_head *folio_list, int node)
 	BUG_ON(!list_empty(folio_list));
 	return nr_remaining ? -EAGAIN : 0;
 }
+
+/**
+ * promotion_candidate: report a promotion candidate folio
+ *
+ * The folio will be isolated from LRU if selected, and task_work will
+ * putback the folio on promotion failure.
+ *
+ * Candidates may not be promoted and may be returned to the LRU.
+ *
+ * Takes a folio reference that will be released in task work.
+ */
+void promotion_candidate(struct folio *folio)
+{
+	struct task_struct *task = current;
+	struct list_head *promo_list = &task->promo_list;
+	struct callback_head *work = &task->numa_promo_work;
+	int nid = folio_nid(folio);
+	int flags, last_cpupid;
+
+	/* do not migrate toptier folios or in kernel context */
+	if (node_is_toptier(nid) || task->flags & PF_KTHREAD)
+		return;
+
+	/*
+	 * Limit per-syscall migration rate to balancing rate limit. This avoids
+	 * excessive work during large reads knowing that task work is likely to
+	 * hit the rate limit and put excess folios back on the LRU anyway.
+	 */
+	if (task->promo_count >= sysctl_numa_balancing_promote_rate_limit)
+		return;
+
+	/* Isolate the folio to prepare for migration */
+	nid = numa_migrate_check(folio, NULL, 0, &flags, folio_test_dirty(folio),
+				 &last_cpupid);
+	if (nid == NUMA_NO_NODE)
+		return;
+
+	if (migrate_misplaced_folio_prepare(folio, NULL, nid))
+		return;
+
+	/*
+	 * If work is pending, add this folio to the list. Otherwise, ensure
+	 * the task will execute the work, otherwise we can leak folios.
+	 */
+	if (list_empty(promo_list) && task_work_add(task, work, TWA_RESUME)) {
+		folio_putback_lru(folio);
+		return;
+	}
+	list_add_tail(&folio->lru, promo_list);
+	task->promo_count += folio_nr_pages(folio);
+	return;
+}
+EXPORT_SYMBOL(promotion_candidate);
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_NUMA */
diff --git a/mm/swap.c b/mm/swap.c
index 7523b65d8caa..382828fde505 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -37,6 +37,10 @@
 #include <linux/page_idle.h>
 #include <linux/local_lock.h>
 #include <linux/buffer_head.h>
+#include <linux/migrate.h>
+#include <linux/memory-tiers.h>
+#include <linux/sched/sysctl.h>
+#include <linux/sched/numa_balancing.h>
 
 #include "internal.h"
 
@@ -476,6 +480,10 @@ void folio_mark_accessed(struct folio *folio)
 			__lru_cache_activate_folio(folio);
 		folio_clear_referenced(folio);
 		workingset_activation(folio);
+	} else if (!folio_test_isolated(folio) &&
+		   (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
+		   numa_pagecache_promotion_enabled) {
+		promotion_candidate(folio);
 	}
 	if (folio_test_idle(folio))
 		folio_clear_idle(folio);
-- 
2.49.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH v4 6/6] mm/swap.c: Enable promotion of unmapped MGLRU page cache pages
  2025-04-11 22:11 [RFC PATCH v4 0/6] Promotion of Unmapped Page Cache Folios Gregory Price
                   ` (4 preceding siblings ...)
  2025-04-11 22:11 ` [RFC PATCH v4 5/6] migrate,sysfs: add pagecache promotion Gregory Price
@ 2025-04-11 22:11 ` Gregory Price
  2025-04-11 23:49 ` [RFC PATCH v4 0/6] Promotion of Unmapped Page Cache Folios Matthew Wilcox
  2025-04-15  0:45 ` SeongJae Park
  7 siblings, 0 replies; 19+ messages in thread
From: Gregory Price @ 2025-04-11 22:11 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, linux-kernel, kernel-team, akpm, mingo, peterz,
	juri.lelli, vincent.guittot, hannes, mhocko, roman.gushchin,
	shakeel.butt, donettom

From: Donet Tom <donettom@linux.ibm.com>

Extend MGLRU to support promotion of page cache pages.

An MGLRU page cache page is eligible for promotion when:

1. Memory Tiering and pagecache_promotion_enabled are enabled
2. It resides in a lower memory tier.
3. It is referenced.
4. It is part of the working set.
5. folio reference count is maximun (LRU_REFS_MASK).

When a page is accessed through a file descriptor, folio_inc_refs()
is invoked. The first access will set the folio’s referenced flag,
and subsequent accesses will increment the reference count in the
folio flag (reference counter size in folio flags is 2 bits). Once
the referenced flag is set, and the folio’s reference count reaches
the maximum value (LRU_REFS_MASK), the working set flag will be set
as well.

If a folio has both the referenced and working set flags set, and its
reference count equals LRU_REFS_MASK, it becomes a good candidate for
promotion. These pages will be added to the promotion list. The
per-process task task_numa_promotion_work() takes the pages from the
promotion list and promotes them to a higher memory tier.

In the MGLRU, for folios accessed through a file descriptor, if the
folio’s referenced and working set flags are set, and the folio's
reference count is equal to LRU_REFS_MASK, the folio is lazily
promoted to the second oldest generation in the eviction path. When
folio_inc_gen() does this, it clears the LRU_REFS_FLAGS so that
lru_gen_inc_refs() can start over.

Test process:
We measured the read time in below scenarios for both LRU and MGLRU.
Scenario 1: Pages are on Lower tier + promotion off
Scenario 2: Pages are on Lower tier + promotion on
Scenario 3: Pages are on higher tier

Test Results MGLRU
----------------------------------------------------------------
Pages on higher   | Pages Lower tier |  Pages on Lower Tier    |
   Tier           |  promotion off   |   Promotion On          |
----------------------------------------------------------------
  0.48s           |    1.6s          |During Promotion - 3.3s  |
                  |                  |After Promotion  - 0.48s |
                  |                  |                         |
----------------------------------------------------------------

Test Results LRU
----------------------------------------------------------------
Pages on higher   | Pages Lower tier |  Pages on Lower Tier    |
   Tier           |  promotion off   |   Promotion On          |
----------------------------------------------------------------
   0.48s          |    1.6s          |During Promotion - 3.3s  |
                  |                  |After Promotion  - 0.48s |
                  |                  |                         |
----------------------------------------------------------------

MGLRU and LRU are showing similar performance benefit.

Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
 mm/swap.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/swap.c b/mm/swap.c
index 382828fde505..3af2377515ad 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -399,8 +399,13 @@ static void lru_gen_inc_refs(struct folio *folio)

 	do {
 		if ((old_flags & LRU_REFS_MASK) == LRU_REFS_MASK) {
-			if (!folio_test_workingset(folio))
+			if (!folio_test_workingset(folio)) {
 				folio_set_workingset(folio);
+			} else if (!folio_test_isolated(folio) &&
+				  (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
+				   numa_pagecache_promotion_enabled) {
+				promotion_candidate(folio);
+			}
 			return;
 		}

-- 
2.49.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v4 0/6] Promotion of Unmapped Page Cache Folios.
  2025-04-11 22:11 [RFC PATCH v4 0/6] Promotion of Unmapped Page Cache Folios Gregory Price
                   ` (5 preceding siblings ...)
  2025-04-11 22:11 ` [RFC PATCH v4 6/6] mm/swap.c: Enable promotion of unmapped MGLRU page cache pages Gregory Price
@ 2025-04-11 23:49 ` Matthew Wilcox
  2025-04-12  0:09   ` Gregory Price
  2025-04-13  5:23   ` Donet Tom
  2025-04-15  0:45 ` SeongJae Park
  7 siblings, 2 replies; 19+ messages in thread
From: Matthew Wilcox @ 2025-04-11 23:49 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-mm, cgroups, linux-kernel, kernel-team, akpm, mingo,
	peterz, juri.lelli, vincent.guittot, hannes, mhocko,
	roman.gushchin, shakeel.butt, donettom, Huang Ying, Keith Busch,
	Feng Tang, Neha Gholkar

On Fri, Apr 11, 2025 at 06:11:05PM -0400, Gregory Price wrote:
> Unmapped page cache pages can be demoted to low-tier memory, but

No.  Page cache should never be demoted to low-tier memory.
NACK this patchset.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v4 0/6] Promotion of Unmapped Page Cache Folios.
  2025-04-11 23:49 ` [RFC PATCH v4 0/6] Promotion of Unmapped Page Cache Folios Matthew Wilcox
@ 2025-04-12  0:09   ` Gregory Price
  2025-04-12  0:35     ` Matthew Wilcox
  2025-04-13  5:23   ` Donet Tom
  1 sibling, 1 reply; 19+ messages in thread
From: Gregory Price @ 2025-04-12  0:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, cgroups, linux-kernel, kernel-team, akpm, mingo,
	peterz, juri.lelli, vincent.guittot, hannes, mhocko,
	roman.gushchin, shakeel.butt, donettom, Huang Ying, Keith Busch,
	Feng Tang, Neha Gholkar

On Sat, Apr 12, 2025 at 12:49:18AM +0100, Matthew Wilcox wrote:
> On Fri, Apr 11, 2025 at 06:11:05PM -0400, Gregory Price wrote:
> > Unmapped page cache pages can be demoted to low-tier memory, but
> 
> No.  Page cache should never be demoted to low-tier memory.
> NACK this patchset.

This wasn't a statement of approval page cache being on lower tiers,
it's a statement of fact.  Enabling demotion causes this issue.

~Gregory


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v4 0/6] Promotion of Unmapped Page Cache Folios.
  2025-04-12  0:09   ` Gregory Price
@ 2025-04-12  0:35     ` Matthew Wilcox
  2025-04-12  0:44       ` Gregory Price
  0 siblings, 1 reply; 19+ messages in thread
From: Matthew Wilcox @ 2025-04-12  0:35 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-mm, cgroups, linux-kernel, kernel-team, akpm, mingo,
	peterz, juri.lelli, vincent.guittot, hannes, mhocko,
	roman.gushchin, shakeel.butt, donettom, Huang Ying, Keith Busch,
	Feng Tang, Neha Gholkar

On Fri, Apr 11, 2025 at 08:09:55PM -0400, Gregory Price wrote:
> On Sat, Apr 12, 2025 at 12:49:18AM +0100, Matthew Wilcox wrote:
> > On Fri, Apr 11, 2025 at 06:11:05PM -0400, Gregory Price wrote:
> > > Unmapped page cache pages can be demoted to low-tier memory, but
> > 
> > No.  Page cache should never be demoted to low-tier memory.
> > NACK this patchset.
> 
> This wasn't a statement of approval page cache being on lower tiers,
> it's a statement of fact.  Enabling demotion causes this issue.

Then that's the bug that needs to be fixed.  Not adding 200+ lines
of code to recover from a situation that should never happen.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v4 0/6] Promotion of Unmapped Page Cache Folios.
  2025-04-12  0:35     ` Matthew Wilcox
@ 2025-04-12  0:44       ` Gregory Price
  2025-04-12 11:52         ` Ritesh Harjani
  0 siblings, 1 reply; 19+ messages in thread
From: Gregory Price @ 2025-04-12  0:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, cgroups, linux-kernel, kernel-team, akpm, mingo,
	peterz, juri.lelli, vincent.guittot, hannes, mhocko,
	roman.gushchin, shakeel.butt, donettom, Huang Ying, Keith Busch,
	Feng Tang, Neha Gholkar

On Sat, Apr 12, 2025 at 01:35:56AM +0100, Matthew Wilcox wrote:
> On Fri, Apr 11, 2025 at 08:09:55PM -0400, Gregory Price wrote:
> > On Sat, Apr 12, 2025 at 12:49:18AM +0100, Matthew Wilcox wrote:
> > > On Fri, Apr 11, 2025 at 06:11:05PM -0400, Gregory Price wrote:
> > > > Unmapped page cache pages can be demoted to low-tier memory, but
> > > 
> > > No.  Page cache should never be demoted to low-tier memory.
> > > NACK this patchset.
> > 
> > This wasn't a statement of approval page cache being on lower tiers,
> > it's a statement of fact.  Enabling demotion causes this issue.
> 
> Then that's the bug that needs to be fixed.  Not adding 200+ lines
> of code to recover from a situation that should never happen.

Well, I have a use case that make valuable use of putting the page cache
on a farther node rather than pushing it out to disk.  But this
discussion aside, I think we could simply make this a separate mode of
demotion_enabled

/* Only demote anonymous memory */
echo 2 > /sys/kernel/mm/numa/demotion_enabled

Assuming we can recognize anon from just struct folio

~Gregory


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v4 0/6] Promotion of Unmapped Page Cache Folios.
  2025-04-12  0:44       ` Gregory Price
@ 2025-04-12 11:52         ` Ritesh Harjani
  2025-04-12 14:35           ` Gregory Price
  0 siblings, 1 reply; 19+ messages in thread
From: Ritesh Harjani @ 2025-04-12 11:52 UTC (permalink / raw)
  To: Gregory Price, Matthew Wilcox
  Cc: linux-mm, cgroups, linux-kernel, kernel-team, akpm, mingo,
	peterz, juri.lelli, vincent.guittot, hannes, mhocko,
	roman.gushchin, shakeel.butt, donettom, Huang Ying, Keith Busch,
	Feng Tang, Neha Gholkar

Gregory Price <gourry@gourry.net> writes:

> On Sat, Apr 12, 2025 at 01:35:56AM +0100, Matthew Wilcox wrote:
>> On Fri, Apr 11, 2025 at 08:09:55PM -0400, Gregory Price wrote:
>> > On Sat, Apr 12, 2025 at 12:49:18AM +0100, Matthew Wilcox wrote:
>> > > On Fri, Apr 11, 2025 at 06:11:05PM -0400, Gregory Price wrote:
>> > > > Unmapped page cache pages can be demoted to low-tier memory, but
>> > > 
>> > > No.  Page cache should never be demoted to low-tier memory.
>> > > NACK this patchset.

Hi Matthew, 

Could you please give some context around why shouldn't page cache be
considered as a demotion target if demotion is enabled? Shouldn't
demoting page cache pages to a lower tier (when we have enough space in
lower tier) can be a better alternative then discarding these pages and
later doing I/Os to read them back again?

>> > 
>> > This wasn't a statement of approval page cache being on lower tiers,
>> > it's a statement of fact.  Enabling demotion causes this issue.
>> 
>> Then that's the bug that needs to be fixed.  Not adding 200+ lines
>> of code to recover from a situation that should never happen.

/me goes and checks when the demotion feature was added... 

Ok, so I believe this was added here [1]
"[PATCH -V10 4/9] mm/migrate: demote pages during reclaim". 
[1]: https://lore.kernel.org/all/20210715055145.195411-5-ying.huang@intel.com/T/#u

I think systems with persistent memory acting as DRAM nodes, could choose
to demote page cache pages too, to lower tier instead of simply
discarding them and later doing I/O to read them back from disk. 

e.g. when one has a smaller size DRAM as faster tier and larger size
PMEM as slower tier. During memory pressure on faster tier, demoting
page cache pages to slower tier can be helpful to avoid doing I/O later
to read them back in, isn't it?

>
> Well, I have a use case that make valuable use of putting the page cache
> on a farther node rather than pushing it out to disk.  But this
> discussion aside, I think we could simply make this a separate mode of
> demotion_enabled
>
> /* Only demote anonymous memory */
> echo 2 > /sys/kernel/mm/numa/demotion_enabled
>

If we are going down this road... then should we consider what other
choices users may need for their usecases? e.g.

0: Demotion disabled
1: Demotion enabled for both anon and file pages
Till here the support is already present.

2: Demotion enabled only for anon pages
3: Demotion enabled only for file pages

Should this be further classified for dirty v/s clean page cache
pages too?

> Assuming we can recognize anon from just struct folio

I am not 100% sure of this, so others should correct. Should this
simply be, folio_is_file_lru() to differentiate page cache pages?

Although this still might give us anon pages which have the
PG_swapbacked dropped as a result of MADV_FREE. Note sure if that need
any special care though?

-ritesh

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v4 0/6] Promotion of Unmapped Page Cache Folios.
  2025-04-12 11:52         ` Ritesh Harjani
@ 2025-04-12 14:35           ` Gregory Price
  0 siblings, 0 replies; 19+ messages in thread
From: Gregory Price @ 2025-04-12 14:35 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Matthew Wilcox, linux-mm, cgroups, linux-kernel, kernel-team,
	akpm, mingo, peterz, juri.lelli, vincent.guittot, hannes, mhocko,
	roman.gushchin, shakeel.butt, donettom, Huang Ying, Keith Busch,
	Feng Tang, Neha Gholkar

On Sat, Apr 12, 2025 at 05:22:24PM +0530, Ritesh Harjani wrote:
> Gregory Price <gourry@gourry.net> writes:
> 0: Demotion disabled
> 1: Demotion enabled for both anon and file pages
> Till here the support is already present.
> 
> 2: Demotion enabled only for anon pages
> 3: Demotion enabled only for file pages
> 
> Should this be further classified for dirty v/s clean page cache
> pages too?
> 

There are some limitations around migrating dirty pages IIRC, but right
now the vmscan code indescriminately adds any and all folios to the
demotion list if it gets to that chunk of the code.

> > Assuming we can recognize anon from just struct folio
> 
> I am not 100% sure of this, so others should correct. Should this
> simply be, folio_is_file_lru() to differentiate page cache pages?
> 
> Although this still might give us anon pages which have the
> PG_swapbacked dropped as a result of MADV_FREE. Note sure if that need
> any special care though?
> 

I made the comment without looking but yeah, PageAnon/folio_test_anon
exist, so this exists in some form somewhere.  Basically there's some
space to do something a little less indescriminate here.

~Gregory


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v4 0/6] Promotion of Unmapped Page Cache Folios.
  2025-04-11 23:49 ` [RFC PATCH v4 0/6] Promotion of Unmapped Page Cache Folios Matthew Wilcox
  2025-04-12  0:09   ` Gregory Price
@ 2025-04-13  5:23   ` Donet Tom
  2025-04-13 12:48     ` Matthew Wilcox
  1 sibling, 1 reply; 19+ messages in thread
From: Donet Tom @ 2025-04-13  5:23 UTC (permalink / raw)
  To: Matthew Wilcox, Gregory Price
  Cc: linux-mm, cgroups, linux-kernel, kernel-team, akpm, mingo,
	peterz, juri.lelli, vincent.guittot, hannes, mhocko,
	roman.gushchin, shakeel.butt, Huang Ying, Keith Busch, Feng Tang,
	Neha Gholkar


On 4/12/25 5:19 AM, Matthew Wilcox wrote:
> On Fri, Apr 11, 2025 at 06:11:05PM -0400, Gregory Price wrote:
>> Unmapped page cache pages can be demoted to low-tier memory, but
> No.  Page cache should never be demoted to low-tier memory.
> NACK this patchset.

Hi Mathew,

I have one doubt. Under memory pressure, page cache allocations can
fall back to lower-tier memory, right? So later, if those page cache pages
become hot, shouldn't we promote them?

Thanks
Donet



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v4 0/6] Promotion of Unmapped Page Cache Folios.
  2025-04-13  5:23   ` Donet Tom
@ 2025-04-13 12:48     ` Matthew Wilcox
  0 siblings, 0 replies; 19+ messages in thread
From: Matthew Wilcox @ 2025-04-13 12:48 UTC (permalink / raw)
  To: Donet Tom
  Cc: Gregory Price, linux-mm, cgroups, linux-kernel, kernel-team,
	akpm, mingo, peterz, juri.lelli, vincent.guittot, hannes, mhocko,
	roman.gushchin, shakeel.butt, Huang Ying, Keith Busch, Feng Tang,
	Neha Gholkar

On Sun, Apr 13, 2025 at 10:53:48AM +0530, Donet Tom wrote:
> 
> On 4/12/25 5:19 AM, Matthew Wilcox wrote:
> > On Fri, Apr 11, 2025 at 06:11:05PM -0400, Gregory Price wrote:
> > > Unmapped page cache pages can be demoted to low-tier memory, but
> > No.  Page cache should never be demoted to low-tier memory.
> > NACK this patchset.
> 
> Hi Mathew,
> 
> I have one doubt. Under memory pressure, page cache allocations can
> fall back to lower-tier memory, right? So later, if those page cache pages
> become hot, shouldn't we promote them?

That shouldn't happen either.  CXL should never be added to the page
allocator.  You guys are creating a lot of problems for yourselves,
and I've been clear that I want no part of this.
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v4 1/6] migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA.
  2025-04-11 22:11 ` [RFC PATCH v4 1/6] migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA Gregory Price
@ 2025-04-15  0:12   ` SeongJae Park
  0 siblings, 0 replies; 19+ messages in thread
From: SeongJae Park @ 2025-04-15  0:12 UTC (permalink / raw)
  To: Gregory Price
  Cc: SeongJae Park, linux-mm, cgroups, linux-kernel, kernel-team,
	akpm, mingo, peterz, juri.lelli, vincent.guittot, hannes, mhocko,
	roman.gushchin, shakeel.butt, donettom

On Fri, 11 Apr 2025 18:11:06 -0400 Gregory Price <gourry@gourry.net> wrote:

> migrate_misplaced_folio_prepare() may be called on a folio without
> a VMA, and so it must be made to accept a NULL VMA.

The comment of the function says "Must be called with the PTL still held".  I
understand it is not needed for NULL VMA case because it is for unmapped
folios?  If I'm understanding correctly, could you please also clarify such
details including when NULL VMA case happens and if locking requirement is
changed, on the comment?


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v4 4/6] migrate: implement migrate_misplaced_folio_batch
  2025-04-11 22:11 ` [RFC PATCH v4 4/6] migrate: implement migrate_misplaced_folio_batch Gregory Price
@ 2025-04-15  0:19   ` SeongJae Park
  0 siblings, 0 replies; 19+ messages in thread
From: SeongJae Park @ 2025-04-15  0:19 UTC (permalink / raw)
  To: Gregory Price
  Cc: SeongJae Park, linux-mm, cgroups, linux-kernel, kernel-team,
	akpm, mingo, peterz, juri.lelli, vincent.guittot, hannes, mhocko,
	roman.gushchin, shakeel.butt, donettom

On Fri, 11 Apr 2025 18:11:09 -0400 Gregory Price <gourry@gourry.net> wrote:

> A common operation in tiering is to migrate multiple pages at once.
> The migrate_misplaced_folio function requires one call for each
> individual folio.  Expose a batch-variant of the same call for use
> when doing batch migrations.
> 
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
>  include/linux/migrate.h |  6 ++++++
>  mm/migrate.c            | 31 +++++++++++++++++++++++++++++++
>  2 files changed, 37 insertions(+)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 61899ec7a9a3..2df756128316 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -145,6 +145,7 @@ const struct movable_operations *page_movable_ops(struct page *page)
>  int migrate_misplaced_folio_prepare(struct folio *folio,
>  		struct vm_area_struct *vma, int node);
>  int migrate_misplaced_folio(struct folio *folio, int node);
> +int migrate_misplaced_folio_batch(struct list_head *foliolist, int node);

Nit.  s/foliolist/folio_list/ ?

The none-inline-definition of the function below calls the parameter
folio_list, and I show more treewide usage of folio_list than foliolist.

    linux$ git grep foliolist | wc -l
    4
    linux$ git grep folio_list | wc -l
    142

I wouldn't argue folio_list is the only one right name, but at least using same
name on the declaration and the definition[s] would be nice in terms of
consistency.

>  #else
>  static inline int migrate_misplaced_folio_prepare(struct folio *folio,
>  		struct vm_area_struct *vma, int node)
> @@ -155,6 +156,11 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
>  {
>  	return -EAGAIN; /* can't migrate now */
>  }
> +static inline int migrate_misplaced_folio_batch(struct list_head *foliolist,

Ditto.

> +						int node)
> +{
> +	return -EAGAIN; /* can't migrate now */
> +}
>  #endif /* CONFIG_NUMA_BALANCING */
>  
>  #ifdef CONFIG_MIGRATION
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 047131f6c839..7e1ba6001596 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2731,5 +2731,36 @@ int migrate_misplaced_folio(struct folio *folio, int node)
>  	BUG_ON(!list_empty(&migratepages));
>  	return nr_remaining ? -EAGAIN : 0;
>  }
> +
> +/*
> + * Batch variant of migrate_misplaced_folio. Attempts to migrate
> + * a folio list to the specified destination.
> + *
> + * Caller is expected to have isolated the folios by calling
> + * migrate_misplaced_folio_prepare(), which will result in an
> + * elevated reference count on the folio.
> + *
> + * This function will un-isolate the folios, dereference them, and
> + * remove them from the list before returning.
> + */
> +int migrate_misplaced_folio_batch(struct list_head *folio_list, int node)
> +{
> +	pg_data_t *pgdat = NODE_DATA(node);
> +	unsigned int nr_succeeded;
> +	int nr_remaining;
> +
> +	nr_remaining = migrate_pages(folio_list, alloc_misplaced_dst_folio,
> +				     NULL, node, MIGRATE_ASYNC,
> +				     MR_NUMA_MISPLACED, &nr_succeeded);
> +	if (nr_remaining)
> +		putback_movable_pages(folio_list);
> +
> +	if (nr_succeeded) {
> +		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);

migrate_misplaced_folio() also counts memcg events and call mod_lruvec_state(),
but this variant doesn't.  Is this an intended difference?  If so, could you
please clarify the reason?

> +		mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, nr_succeeded);
> +	}
> +	BUG_ON(!list_empty(folio_list));
> +	return nr_remaining ? -EAGAIN : 0;
> +}

I feel some code here is duplicated from a part of migrate_misplaced_folio().
Can we deduplicate those?  Maybe migrate_misplaced_folio() could be a wrapper
of migrate_mispalced_folio_batch()?


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v4 5/6] migrate,sysfs: add pagecache promotion
  2025-04-11 22:11 ` [RFC PATCH v4 5/6] migrate,sysfs: add pagecache promotion Gregory Price
@ 2025-04-15  0:41   ` SeongJae Park
  0 siblings, 0 replies; 19+ messages in thread
From: SeongJae Park @ 2025-04-15  0:41 UTC (permalink / raw)
  To: Gregory Price
  Cc: SeongJae Park, linux-mm, cgroups, linux-kernel, kernel-team,
	akpm, mingo, peterz, juri.lelli, vincent.guittot, hannes, mhocko,
	roman.gushchin, shakeel.butt, donettom

On Fri, 11 Apr 2025 18:11:10 -0400 Gregory Price <gourry@gourry.net> wrote:

> adds /sys/kernel/mm/numa/pagecache_promotion_enabled
> 
> When page cache lands on lower tiers, there is no way for promotion
> to occur unless it becomes memory-mapped and exposed to NUMA hint
> faults.  Just adding a mechanism to promote pages unconditionally,
> however, opens up significant possibility of performance regressions.
> 
> Similar to the `demotion_enabled` sysfs entry, provide a sysfs toggle
> to enable and disable page cache promotion.  This option will enable
> opportunistic promotion of unmapped page cache during syscall access.
> 
> This option is intended for operational conditions where demoted page
> cache will eventually contain memory which becomes hot - and where
> said memory likely to cause performance issues due to being trapped on
> the lower tier of memory.
> 
> A Page Cache folio is considered a promotion candidates when:
>   0) tiering and pagecache-promotion are enabled

"Tiering" here means NUMA_BALANCING_MEMORY_TIERING, right?  Why do you make
this feature depend on it?

If there is a good reason for the dependency, what do you think about

1. making pagecache_promotion_enabled automatically eanbles
   NUMA_BALANCING_MEMORY_TIERING, or
2. adding another flag for NUMA balancing
   (e.g., echo 4 > /proc/sys/kernel/numa_balancing) that enables this feature
   and mapped pages promotion together?

>   1) the folio resides on a node not in the top tier
>   2) the folio is already marked referenced and active.
>   3) Multiple accesses in (referenced & active) state occur quickly.

I don't clearly understand what 3) means, particularly the criteria of "quick",
and how the speed is measured.  Could you please clarify?

> 
> Since promotion is not safe to execute unconditionally from within
> folio_mark_accessed, we defer promotion to a new task_work captured
> in the task_struct.  This ensures that the task doing the access has
> some hand in promoting pages - even among deduplicated read only files.
> 
> We limit the total number of folios on the promotion list to the
> promotion rate limit to limit the amount of inline work done during
> large reads - avoiding significant overhead.  We do not use the existing
> rate-limit check function this checked during the migration anyway.
> 
> The promotion node is always the local node of the promoting cpu.
> 
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
>  .../ABI/testing/sysfs-kernel-mm-numa          | 20 +++++++
>  include/linux/memory-tiers.h                  |  2 +
>  include/linux/migrate.h                       |  5 ++
>  include/linux/sched.h                         |  4 ++
>  include/linux/sched/sysctl.h                  |  1 +
>  init/init_task.c                              |  2 +
>  kernel/sched/fair.c                           | 24 +++++++-
>  mm/memory-tiers.c                             | 27 +++++++++
>  mm/migrate.c                                  | 55 +++++++++++++++++++
>  mm/swap.c                                     |  8 +++
>  10 files changed, 147 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-numa b/Documentation/ABI/testing/sysfs-kernel-mm-numa
> index 77e559d4ed80..ebb041891db2 100644
> --- a/Documentation/ABI/testing/sysfs-kernel-mm-numa
> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-numa
> @@ -22,3 +22,23 @@ Description:	Enable/disable demoting pages during reclaim
>  		the guarantees of cpusets.  This should not be enabled
>  		on systems which need strict cpuset location
>  		guarantees.
> +
> +What:		/sys/kernel/mm/numa/pagecache_promotion_enabled

This is not for any page cache page but unmapped page cache pages, right?
I think making the name be more explicit about it could avoid confuses?

> +Date:		January 2025

Captain, it's April ;)

> +Contact:	Linux memory management mailing list <linux-mm@kvack.org>
> +Description:	Enable/disable promoting pages during file access
> +
> +		Page migration during file access is intended for systems
> +		with tiered memory configurations that have significant
> +		unmapped file cache usage. By default, file cache memory
> +		on slower tiers will not be opportunistically promoted by
> +		normal NUMA hint faults, because the system has no way to
> +		track them.  This option enables opportunistic promotion
> +		of pages that are accessed via syscall (e.g. read/write)
> +		if multiple accesses occur in quick succession.

I again think it would be nice to clarify how quick it should be.

> +
> +		It may move data to a NUMA node that does not fall into
> +		the cpuset of the allocating process which might be
> +		construed to violate the guarantees of cpusets.  This
> +		should not be enabled on systems which need strict cpuset
> +		location guarantees.
[...]
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
[...]
> @@ -957,11 +958,37 @@ static ssize_t demotion_enabled_store(struct kobject *kobj,
>  	return count;
>  }
>  
> +static ssize_t pagecache_promotion_enabled_show(struct kobject *kobj,
> +						struct kobj_attribute *attr,
> +						char *buf)
> +{
> +	return sysfs_emit(buf, "%s\n",
> +			  numa_pagecache_promotion_enabled ? "true" : "false");
> +}

How about using str_true_false(), like demotion_enabled_show() does?

[...]
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -44,6 +44,8 @@
>  #include <linux/sched/sysctl.h>
>  #include <linux/memory-tiers.h>
>  #include <linux/pagewalk.h>
> +#include <linux/sched/numa_balancing.h>
> +#include <linux/task_work.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -2762,5 +2764,58 @@ int migrate_misplaced_folio_batch(struct list_head *folio_list, int node)
>  	BUG_ON(!list_empty(folio_list));
>  	return nr_remaining ? -EAGAIN : 0;
>  }
> +
> +/**
> + * promotion_candidate: report a promotion candidate folio
> + *
> + * The folio will be isolated from LRU if selected, and task_work will
> + * putback the folio on promotion failure.
> + *
> + * Candidates may not be promoted and may be returned to the LRU.

Is this for situations that are different from the above sentence explaining
cases?  If so, could you clarify that?

> + *
> + * Takes a folio reference that will be released in task work.
> + */
> +void promotion_candidate(struct folio *folio)
> +{
> +	struct task_struct *task = current;
> +	struct list_head *promo_list = &task->promo_list;
> +	struct callback_head *work = &task->numa_promo_work;
> +	int nid = folio_nid(folio);
> +	int flags, last_cpupid;
> +
> +	/* do not migrate toptier folios or in kernel context */
> +	if (node_is_toptier(nid) || task->flags & PF_KTHREAD)
> +		return;
> +
> +	/*
> +	 * Limit per-syscall migration rate to balancing rate limit. This avoids

Isn't this per-task work rather than per-syscall?

> +	 * excessive work during large reads knowing that task work is likely to
> +	 * hit the rate limit and put excess folios back on the LRU anyway.
> +	 */
> +	if (task->promo_count >= sysctl_numa_balancing_promote_rate_limit)
> +		return;
> +
> +	/* Isolate the folio to prepare for migration */
> +	nid = numa_migrate_check(folio, NULL, 0, &flags, folio_test_dirty(folio),
> +				 &last_cpupid);
> +	if (nid == NUMA_NO_NODE)
> +		return;
> +
> +	if (migrate_misplaced_folio_prepare(folio, NULL, nid))
> +		return;
> +
> +	/*
> +	 * If work is pending, add this folio to the list. Otherwise, ensure
> +	 * the task will execute the work, otherwise we can leak folios.
> +	 */
> +	if (list_empty(promo_list) && task_work_add(task, work, TWA_RESUME)) {
> +		folio_putback_lru(folio);
> +		return;
> +	}
> +	list_add_tail(&folio->lru, promo_list);
> +	task->promo_count += folio_nr_pages(folio);
> +	return;
> +}
> +EXPORT_SYMBOL(promotion_candidate);

Why export this symbol?


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH v4 0/6] Promotion of Unmapped Page Cache Folios.
  2025-04-11 22:11 [RFC PATCH v4 0/6] Promotion of Unmapped Page Cache Folios Gregory Price
                   ` (6 preceding siblings ...)
  2025-04-11 23:49 ` [RFC PATCH v4 0/6] Promotion of Unmapped Page Cache Folios Matthew Wilcox
@ 2025-04-15  0:45 ` SeongJae Park
  7 siblings, 0 replies; 19+ messages in thread
From: SeongJae Park @ 2025-04-15  0:45 UTC (permalink / raw)
  To: Gregory Price
  Cc: SeongJae Park, linux-mm, cgroups, linux-kernel, kernel-team,
	akpm, mingo, peterz, juri.lelli, vincent.guittot, hannes, mhocko,
	roman.gushchin, shakeel.butt, donettom, Huang Ying, Keith Busch,
	Feng Tang, Neha Gholkar

Hi Gregory,

Thank you for continuing and sharing this nice work.  Adding some comments
based on my humble and DAMON-biased perspective below.

On Fri, 11 Apr 2025 18:11:05 -0400 Gregory Price <gourry@gourry.net> wrote:

> Unmapped page cache pages can be demoted to low-tier memory, but
> they can presently only be promoted in two conditions:
>     1) The page is fully swapped out and re-faulted
>     2) The page becomes mapped (and exposed to NUMA hint faults)

Yet another way is running DAMOS with DAMOS_MIGRATE_HOT action and unmapped
pages DAMOS filter.  Or, without unmapped pages DAMOS filter, if you want to
promote both mapped and unmapped pages.

It is easy to tell, but I anticiapte it will show many limitations on your use
case.  I anyway only very recently shared[1] my version of experimental
prototype and its evaluation results.  I also understand this patch series is
the simple and well working solution for your use case, and I have no special
blocker for this work.  Nevertheless, I was just thinking it would be nice if
your anticipated or expected limitations and opportunities of other approaches
including DAMON-based one can be put here together.

[...]
> 
> Development History and Notes
> =======================================
> During development, we explored the following proposals:

This is very informative and helpful for getting the context.  Thank you for
sharing.

[...]
> 4) Adding a separate kthread - suggested by many

DAMON-based approach might also be categorized here since DAMON does access
monitoring and monitoring results-based system operations (migration in this
context) in a separate thread.

> 
>    This is - to an extent - a more general version of the LRU proposal.
>    We still have to track the folios - which likely requires the
>    addition of a page flag.

In case of DAMON-based approach, this is not technically true, since it uses
its own abstraction called DAMON region.  Of course, DAMON region abstraction
is not a panacea.  There were concerns around the accuracy of the region
abstraction.  We found unoptimum DAMON intervals tuning could be one of the
source of the poor accuracy and recently made an automation[2] of the tuning.

I remeber you previously mentioned it might make sense to utilize DAMON as a
way to save such additional information.  It has been one of the motivations
for my recent suggestion of a new DAMON API[3], namely damon_report_access().
It will allow any kernel code reports their access finding to DAMON with
controlled overhead.  The aimed usage is to make page faults handler,
folio_mark_accessed(), and AMD IBS sample handers like code path passes the
information to DAMON via the API function.

>    Additionally, this method would actually
>    contend pretty heavily with LRU behavior - i.e. we'd want to
>    throttle addition to the promotion candidate list in some scenarios.

DAMON-based approach could use DAMOS quota feature for this kind of purpose.

> 
> 
> 5) Doing it in task work
> 
>    This seemed to be the most realistic after considering the above.
> 
>    We observe the following:
>     - FMA is an ideal hook for this and isolation is safe here

My one concern is that users can ask DAMON to call folio_mark_accessed() for
non unmapped page cache folios, via DAMOS_LRU_PRIO.  Promoting the folio could
be understood as a sort of LRU-prioritizing, so I'm not really concerned about
DAMON's behavioral change that this patch series could introduce.  Rather than
that, I'm concerned if vmstat change of this patch series could be confused by
such DAMON users.

>     - the new promotion_candidate function is an ideal hook for new
>       filter logic (throttling, fairness, etc).

Agreed.  DAMON's target filtering and aim-based aggressiveness self-tuning
features could be such logic.  I suggested[3] damos_add_folio() as a potential
future DAMON API for such use cases.

With this patch series, nevertheless, only folio_mark_accessed() called folios
could get such opportunity.  Do you have future plans to integrate faults-based
promotion logic with this function, and extend for other access information
source?

[1] https://lore.kernel.org/20250320053937.57734-1-sj@kernel.org
[2] https://lkml.kernel.org/r/20250303221726.484227-1-sj@kernel.org
[3] https://lwn.net/Articles/1016525/

Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-04-15  0:45 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-04-11 22:11 [RFC PATCH v4 0/6] Promotion of Unmapped Page Cache Folios Gregory Price
2025-04-11 22:11 ` [RFC PATCH v4 1/6] migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA Gregory Price
2025-04-15  0:12   ` SeongJae Park
2025-04-11 22:11 ` [RFC PATCH v4 2/6] memory: allow non-fault migration in numa_migrate_check path Gregory Price
2025-04-11 22:11 ` [RFC PATCH v4 3/6] vmstat: add page-cache numa hints Gregory Price
2025-04-11 22:11 ` [RFC PATCH v4 4/6] migrate: implement migrate_misplaced_folio_batch Gregory Price
2025-04-15  0:19   ` SeongJae Park
2025-04-11 22:11 ` [RFC PATCH v4 5/6] migrate,sysfs: add pagecache promotion Gregory Price
2025-04-15  0:41   ` SeongJae Park
2025-04-11 22:11 ` [RFC PATCH v4 6/6] mm/swap.c: Enable promotion of unmapped MGLRU page cache pages Gregory Price
2025-04-11 23:49 ` [RFC PATCH v4 0/6] Promotion of Unmapped Page Cache Folios Matthew Wilcox
2025-04-12  0:09   ` Gregory Price
2025-04-12  0:35     ` Matthew Wilcox
2025-04-12  0:44       ` Gregory Price
2025-04-12 11:52         ` Ritesh Harjani
2025-04-12 14:35           ` Gregory Price
2025-04-13  5:23   ` Donet Tom
2025-04-13 12:48     ` Matthew Wilcox
2025-04-15  0:45 ` SeongJae Park

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox