[RFC v3 PATCH 0/5] Promotion of Unmapped Page Cache Folios.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC v3 PATCH 0/5] Promotion of Unmapped Page Cache Folios.
@ 2025-01-07  0:03 Gregory Price
  2025-01-07  0:03 ` [PATCH v3 1/6] migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA Gregory Price
                   ` (6 more replies)
  0 siblings, 7 replies; 14+ messages in thread
From: Gregory Price @ 2025-01-07  0:03 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-kernel, kernel-team, nehagholkar, abhishekd,
	david, nphamcs, gourry, akpm, hannes, kbusch, ying.huang,
	feng.tang, donettom

Unmapped page cache pages can be demoted to low-tier memory, but
they can presently only be promoted in two conditions:
    1) The page is fully swapped out and re-faulted
    2) The page becomes mapped (and exposed to NUMA hint faults)

This RFC proposes promoting unmapped page cache pages by using
folio_mark_accessed as a hotness hint for unmapped pages.

We show in a microbenchmark that this mechanism can increase
performance up to 23.5% compared to leaving page cache on the
low tier - when that page cache becomes excessively hot.

When disabled (NUMA tiering off), overhead in folio_mark_accessed
was limited to <1% in a worst case scenario (all work is file_read()).

There is an open question as to how to integrate this into MGLRU,
as the current design is only applies to traditional LRU.

Patches 1-3
	allow NULL as valid input to migration prep interfaces
	for vmf/vma - which is not present in unmapped folios.
Patch 4
	adds NUMA_HINT_PAGE_CACHE to vmstat
Patch 5
	Implement migrate_misplaced_folio_batch
Patch 6
	add the promotion mechanism, along with a sysfs
	extension which defaults the behavior to off.
	/sys/kernel/mm/numa/pagecache_promotion_enabled

v3 Notes
===
- added batch migration interface (migrate_misplaced_folio_batch)

- dropped timestamp check in promotion_candidate (tests showed
  it did not make a difference and the work is duplicated during
  the migraiton process).

- Bug fix from Donet Tom regarding vmstat

- pulled folio_isolated and sysfs switch checks out into
  folio_mark_accessed because microbenchmark tests showed the
  function call overhead of promotion_candidate warranted a bit
  of manual optimization for the scenario where the majority of
  work is file_read().  This brought the standing overhead from
  ~7% down to <1% when everything is disabled.

- Limited promotion work list to a number of folios that match
  the existing promotion rate limit, as microbenchmark demonstrated
  excessive overhead on a single system-call when significant amounts
  of memory are read.
  Before: 128GB read went from 7 seconds to 40 seconds over ~2 rounds.
  Now:    128GB read went from 7 seconds to ~11 seconds over ~10 rounds.

- switched from list_add to list_add_tail in promotion_candidate, as
  it was discovered promoting in non-linear order caused fairly
  significant overheads (as high as running out of CXL) - likely due
  to poor TLB and prefetch behavior.  Simply switching to list_add_tail
  all but confirmed this as the additional ~20% overhead vanished.

  This is likely to only occur on systems with a large amount of
  contiguous physical memory available on the hot tier, since the
  allocators are more likely to provide better spacially locality.

Test:
======

Environment:
    1.5-3.7GHz CPU, ~4000 BogoMIPS, 
    1TB Machine with 768GB DRAM and 256GB CXL
    A 128GB file being linearly read by a single process

Goal:
   Generate promotions and demonstrate upper-bound on performance
   overhead and gain/loss. 

System Settings:
   echo 1 > /sys/kernel/mm/numa/pagecache_promotion_enabled
   echo 2 > /proc/sys/kernel/numa_balancing

Test process:
   In each test, we do a linear read of a 128GB file into a buffer
   in a loop.  To allocate the pagecache into CXL, we use mbind prior
   to the CXL test runs and read the file.  We omit the overhead of
   allocating the buffer and initializing the memory into CXL from the
   test runs.

   1) file allocated in DRAM with mechanisms off
   2) file allocated in DRAM with balancing on but promotion off
   3) file allocated in DRAM with balancing and promotion on
      (promotion check is negative because all pages are top tier)
   4) file allocated in CXL with mechanisms off
   5) file allocated in CXL with mechanisms on

Each test was run with 50 read cycles and averaged (where relevant)
to account for system noise.  This number of cycles gives the promotion
mechanism time to promote the vast majority of memory (usually <1MB
remaining in worst case).

Tests 2 and 3 test the upper bound on overhead of the new checks when
there are no pages to migrate but work is dominated by file_read().

|     1     |    2     |     3       |    4     |      5         |
| DRAM Base | Promo On | TopTier Chk | CXL Base | Post-Promotion |
|  7.5804   |  7.7586  |   7.9726    |   9.75   |    7.8941      |

Baseline DRAM vs Baseline CXL shows a ~28% overhead just allowing the
file to remain on CXL, while after promotion, we see the performance
trend back towards the overhead of the TopTier check time - a total
overhead reduction of ~84% (or ~5% overhead down from ~23.5%).

During promotion, we do see overhead which eventually tapers off over
time.  Here is a sample of the first 10 cycles during which promotion
is the most aggressive, which shows overhead drops off dramatically
as the majority of memory is migrated to the top tier.

12.79, 12.52, 12.33, 12.03, 11.81, 11.58, 11.36, 11.1, 8, 7.96

This could be further limited by limiting the promotion rate via the
existing knob, or by implementing a new knob detached from the existing
promotion rate.  There are merits to both approach.

After promotion, turning the mechanism off via sysfs increased the
overall performance back to the DRAM baseline. The slight (~1%)
increase between post-migration performance and the baseline mechanism
overhead check appears to be general variance as similar times were
observed during the baseline checks on subsequent runs.

The mechanism itself represents a ~2-5% overhead in a worst case
scenario (all work is file_read() and pages are in DRAM).

Development History and Notes
=======================================
During development, we explored the following proposals:

1) directly promoting within folio_mark_accessed (FMA)
   Originally suggested by Johannes Weiner
   https://lore.kernel.org/all/20240803094715.23900-1-gourry@gourry.net/

   This caused deadlocks due to the fact that the PTL was held
   in a variety of cases - but in particular during task exit.
   It also is incredibly inflexible and causes promotion-on-fault.
   It was discussed that a deferral mechanism was preferred.

2) promoting in filemap.c locations (callers of FMA)
   Originally proposed by Feng Tang and Ying Huang
   https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/patch/?id=5f2e64ce75c0322602c2ec8c70b64bb69b1f1329

   First, we saw this as less problematic than directly hooking FMA,
   but we realized this has the potential to miss data in a variety of
   locations: swap.c, memory.c, gup.c, ksm.c, paddr.c - etc.

   Second, we discovered that the lock state of pages is very subtle,
   and that these locations in filemap.c can be called in an atomic
   context.  Prototypes lead to a variety of stalls and lockups.

3) a new LRU - originally proposed by Keith Busch
   https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/patch/?id=6616afe9a722f6ebedbb27ade3848cf07b9a3af7

   There are two issues with this approach: PG_promotable and reclaim.

   First - PG_promotable has generally be discouraged.

   Second - Attach this mechanism to an LRU is both backwards and
   counter-intutive.  A promotable list is better served by a MOST
   recently used list, and since LRUs are generally only shrank when
   exposed to pressure it would require implementing a new promotion
   list shrinker that runs separate from the existing reclaim logic.

4) Adding a separate kthread - suggested by many

   This is - to an extent - a more general version of the LRU proposal.
   We still have to track the folios - which likely requires the
   addition of a page flag.  Additionally, this method would actually
   contend pretty heavily with LRU behavior - i.e. we'd want to
   throttle addition to the promotion candidate list in some scenarios.

5) Doing it in task work

   This seemed to be the most realistic after considering the above.

   We observe the following:
    - FMA is an ideal hook for this and isolation is safe here
    - the new promotion_candidate function is an ideal hook for new
      filter logic (throttling, fairness, etc).
    - isolated folios are either promoted or putback on task resume,
      there are no additional concurrency mechanics to worry about
    - The mechanic can be made optional via a sysfs hook to avoid
      overhead in degenerate scenarios (thrashing).

Suggested-by: Huang Ying <ying.huang@linux.alibaba.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Keith Busch <kbusch@meta.com>
Suggested-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Gregory Price <gourry@gourry.net>

Gregory Price (6):
  migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA.
  memory: move conditionally defined enums use inside ifdef tags
  memory: allow non-fault migration in numa_migrate_check path
  vmstat: add page-cache numa hints
  migrate: implement migrate_misplaced_folio_batch
  migrate,sysfs: add pagecache promotion

 .../ABI/testing/sysfs-kernel-mm-numa          | 20 +++++
 include/linux/memory-tiers.h                  |  2 +
 include/linux/migrate.h                       | 10 +++
 include/linux/sched.h                         |  4 +
 include/linux/sched/sysctl.h                  |  1 +
 include/linux/vm_event_item.h                 |  8 ++
 init/init_task.c                              |  2 +
 kernel/sched/fair.c                           | 24 ++++-
 mm/memcontrol.c                               |  1 +
 mm/memory-tiers.c                             | 27 ++++++
 mm/memory.c                                   | 32 ++++---
 mm/mempolicy.c                                | 25 ++++--
 mm/migrate.c                                  | 88 ++++++++++++++++++-
 mm/swap.c                                     |  8 ++
 mm/vmstat.c                                   |  2 +
 15 files changed, 230 insertions(+), 24 deletions(-)

-- 
2.47.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 1/6] migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA.
  2025-01-07  0:03 [RFC v3 PATCH 0/5] Promotion of Unmapped Page Cache Folios Gregory Price
@ 2025-01-07  0:03 ` Gregory Price
  2025-01-07  0:03 ` [PATCH v3 2/6] memory: move conditionally defined enums use inside ifdef tags Gregory Price
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Gregory Price @ 2025-01-07  0:03 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-kernel, kernel-team, nehagholkar, abhishekd,
	david, nphamcs, gourry, akpm, hannes, kbusch, ying.huang,
	feng.tang, donettom

migrate_misplaced_folio_prepare() may be called on a folio without
a VMA, and so it must be made to accept a NULL VMA.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
 mm/migrate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index caadbe393aa2..ea20d9bc4f40 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2627,7 +2627,7 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
 		 * See folio_likely_mapped_shared() on possible imprecision
 		 * when we cannot easily detect if a folio is shared.
 		 */
-		if ((vma->vm_flags & VM_EXEC) &&
+		if (vma && (vma->vm_flags & VM_EXEC) &&
 		    folio_likely_mapped_shared(folio))
 			return -EACCES;
 
-- 
2.47.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 2/6] memory: move conditionally defined enums use inside ifdef tags
  2025-01-07  0:03 [RFC v3 PATCH 0/5] Promotion of Unmapped Page Cache Folios Gregory Price
  2025-01-07  0:03 ` [PATCH v3 1/6] migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA Gregory Price
@ 2025-01-07  0:03 ` Gregory Price
  2025-01-21  4:33   ` Bharata B Rao
  2025-01-07  0:03 ` [PATCH v3 3/6] memory: allow non-fault migration in numa_migrate_check path Gregory Price
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: Gregory Price @ 2025-01-07  0:03 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-kernel, kernel-team, nehagholkar, abhishekd,
	david, nphamcs, gourry, akpm, hannes, kbusch, ying.huang,
	feng.tang, donettom

NUMA_HINT_FAULTS and NUMA_HINT_FAULTS_LOCAL are only defined if
CONFIG_NUMA_BALANCING is defined, but are used outside the tags in
numa_migrate_check().  Fix this.

TNF_SHARED is only used if CONFIG_NUMA_BALANCING is enabled, so
moving this line inside the ifdef is also safe - despite use of TNF_*
elsewhere in the function.  TNF_* are not conditionally defined.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 mm/memory.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 9cc93c2f79f3..8d254e97840d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5571,14 +5571,14 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
 	/* Record the current PID acceesing VMA */
 	vma_set_access_pid_bit(vma);
 
-	count_vm_numa_event(NUMA_HINT_FAULTS);
 #ifdef CONFIG_NUMA_BALANCING
+	count_vm_numa_event(NUMA_HINT_FAULTS);
 	count_memcg_folio_events(folio, NUMA_HINT_FAULTS, 1);
-#endif
 	if (folio_nid(folio) == numa_node_id()) {
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 		*flags |= TNF_FAULT_LOCAL;
 	}
+#endif
 
 	return mpol_misplaced(folio, vmf, addr);
 }
-- 
2.47.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 3/6] memory: allow non-fault migration in numa_migrate_check path
  2025-01-07  0:03 [RFC v3 PATCH 0/5] Promotion of Unmapped Page Cache Folios Gregory Price
  2025-01-07  0:03 ` [PATCH v3 1/6] migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA Gregory Price
  2025-01-07  0:03 ` [PATCH v3 2/6] memory: move conditionally defined enums use inside ifdef tags Gregory Price
@ 2025-01-07  0:03 ` Gregory Price
  2025-01-07  0:03 ` [PATCH v3 4/6] vmstat: add page-cache numa hints Gregory Price
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Gregory Price @ 2025-01-07  0:03 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-kernel, kernel-team, nehagholkar, abhishekd,
	david, nphamcs, gourry, akpm, hannes, kbusch, ying.huang,
	feng.tang, donettom

numa_migrate_check and mpol_misplaced presume callers are in the
fault path with accessed to a VMA.  To enable migrations from page
cache, re-using the same logic to handle migration prep is preferable.

Mildly refactor numa_migrate_check and mpol_misplaced so that they may
be called with (vmf = NULL) from non-faulting paths.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 mm/memory.c    | 24 ++++++++++++++----------
 mm/mempolicy.c | 25 +++++++++++++++++--------
 2 files changed, 31 insertions(+), 18 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 8d254e97840d..24acac94399c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5540,7 +5540,20 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
 		      unsigned long addr, int *flags,
 		      bool writable, int *last_cpupid)
 {
-	struct vm_area_struct *vma = vmf->vma;
+	if (vmf) {
+		struct vm_area_struct *vma = vmf->vma;
+		const vm_flags_t vmflags = vma->vm_flags;
+
+		/*
+		 * Flag if the folio is shared between multiple address spaces.
+		 * This used later when determining whether to group tasks.
+		 */
+		if (folio_likely_mapped_shared(folio))
+			*flags |= vmflags & VM_SHARED ? TNF_SHARED : 0;
+
+		/* Record the current PID acceesing VMA */
+		vma_set_access_pid_bit(vma);
+	}
 
 	/*
 	 * Avoid grouping on RO pages in general. RO pages shouldn't hurt as
@@ -5553,12 +5566,6 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
 	if (!writable)
 		*flags |= TNF_NO_GROUP;
 
-	/*
-	 * Flag if the folio is shared between multiple address spaces. This
-	 * is later used when determining whether to group tasks together
-	 */
-	if (folio_likely_mapped_shared(folio) && (vma->vm_flags & VM_SHARED))
-		*flags |= TNF_SHARED;
 	/*
 	 * For memory tiering mode, cpupid of slow memory page is used
 	 * to record page access time.  So use default value.
@@ -5568,9 +5575,6 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
 	else
 		*last_cpupid = folio_last_cpupid(folio);
 
-	/* Record the current PID acceesing VMA */
-	vma_set_access_pid_bit(vma);
-
 #ifdef CONFIG_NUMA_BALANCING
 	count_vm_numa_event(NUMA_HINT_FAULTS);
 	count_memcg_folio_events(folio, NUMA_HINT_FAULTS, 1);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 305aa3012173..9a7804f65782 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2747,12 +2747,16 @@ static void sp_free(struct sp_node *n)
  * mpol_misplaced - check whether current folio node is valid in policy
  *
  * @folio: folio to be checked
- * @vmf: structure describing the fault
+ * @vmf: structure describing the fault (NULL if called outside fault path)
  * @addr: virtual address in @vma for shared policy lookup and interleave policy
+ *	  Ignored if vmf is NULL.
  *
  * Lookup current policy node id for vma,addr and "compare to" folio's
- * node id.  Policy determination "mimics" alloc_page_vma().
- * Called from fault path where we know the vma and faulting address.
+ * node id - or task's policy node id if vmf is NULL.  Policy determination
+ * "mimics" alloc_page_vma().
+ *
+ * vmf must be non-NULL if called from fault path where we know the vma and
+ * faulting address. The PTL must be held by caller if vmf is not NULL.
  *
  * Return: NUMA_NO_NODE if the page is in a node that is valid for this
  * policy, or a suitable node ID to allocate a replacement folio from.
@@ -2764,7 +2768,6 @@ int mpol_misplaced(struct folio *folio, struct vm_fault *vmf,
 	pgoff_t ilx;
 	struct zoneref *z;
 	int curnid = folio_nid(folio);
-	struct vm_area_struct *vma = vmf->vma;
 	int thiscpu = raw_smp_processor_id();
 	int thisnid = numa_node_id();
 	int polnid = NUMA_NO_NODE;
@@ -2774,18 +2777,24 @@ int mpol_misplaced(struct folio *folio, struct vm_fault *vmf,
 	 * Make sure ptl is held so that we don't preempt and we
 	 * have a stable smp processor id
 	 */
-	lockdep_assert_held(vmf->ptl);
-	pol = get_vma_policy(vma, addr, folio_order(folio), &ilx);
+	if (vmf) {
+		lockdep_assert_held(vmf->ptl);
+		pol = get_vma_policy(vmf->vma, addr, folio_order(folio), &ilx);
+	} else {
+		pol = get_task_policy(current);
+	}
 	if (!(pol->flags & MPOL_F_MOF))
 		goto out;
 
 	switch (pol->mode) {
 	case MPOL_INTERLEAVE:
-		polnid = interleave_nid(pol, ilx);
+		polnid = vmf ? interleave_nid(pol, ilx) :
+			       interleave_nodes(pol);
 		break;
 
 	case MPOL_WEIGHTED_INTERLEAVE:
-		polnid = weighted_interleave_nid(pol, ilx);
+		polnid = vmf ? weighted_interleave_nid(pol, ilx) :
+			       weighted_interleave_nodes(pol);
 		break;
 
 	case MPOL_PREFERRED:
-- 
2.47.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 4/6] vmstat: add page-cache numa hints
  2025-01-07  0:03 [RFC v3 PATCH 0/5] Promotion of Unmapped Page Cache Folios Gregory Price
                   ` (2 preceding siblings ...)
  2025-01-07  0:03 ` [PATCH v3 3/6] memory: allow non-fault migration in numa_migrate_check path Gregory Price
@ 2025-01-07  0:03 ` Gregory Price
  2025-01-07  0:03 ` [PATCH v3 5/6] migrate: implement migrate_misplaced_folio_batch Gregory Price
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Gregory Price @ 2025-01-07  0:03 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-kernel, kernel-team, nehagholkar, abhishekd,
	david, nphamcs, gourry, akpm, hannes, kbusch, ying.huang,
	feng.tang, donettom

Count non-page-fault events as page-cache numa hints instead of
fault hints in vmstat. Add a define to select the hint type to
keep the code clean.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/vm_event_item.h | 8 ++++++++
 mm/memcontrol.c               | 1 +
 mm/memory.c                   | 6 +++---
 mm/vmstat.c                   | 2 ++
 4 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index f70d0958095c..c5abb0f7cca7 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -63,6 +63,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		NUMA_HUGE_PTE_UPDATES,
 		NUMA_HINT_FAULTS,
 		NUMA_HINT_FAULTS_LOCAL,
+		NUMA_HINT_PAGE_CACHE,
+		NUMA_HINT_PAGE_CACHE_LOCAL,
 		NUMA_PAGE_MIGRATE,
 #endif
 #ifdef CONFIG_MIGRATION
@@ -185,6 +187,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		NR_VM_EVENT_ITEMS
 };
 
+#ifdef CONFIG_NUMA_BALANCING
+#define NUMA_HINT_TYPE(vmf) (vmf ? NUMA_HINT_FAULTS : NUMA_HINT_PAGE_CACHE)
+#define NUMA_HINT_TYPE_LOCAL(vmf) (vmf ? NUMA_HINT_FAULTS_LOCAL : \
+					 NUMA_HINT_PAGE_CACHE_LOCAL)
+#endif
+
 #ifndef CONFIG_TRANSPARENT_HUGEPAGE
 #define THP_FILE_ALLOC ({ BUILD_BUG(); 0; })
 #define THP_FILE_FALLBACK ({ BUILD_BUG(); 0; })
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 46f8b372d212..865c9c64068e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -460,6 +460,7 @@ static const unsigned int memcg_vm_event_stat[] = {
 	NUMA_PAGE_MIGRATE,
 	NUMA_PTE_UPDATES,
 	NUMA_HINT_FAULTS,
+	NUMA_HINT_PAGE_CACHE,
 #endif
 };
 
diff --git a/mm/memory.c b/mm/memory.c
index 24acac94399c..3f63cfd24296 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5576,10 +5576,10 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
 		*last_cpupid = folio_last_cpupid(folio);
 
 #ifdef CONFIG_NUMA_BALANCING
-	count_vm_numa_event(NUMA_HINT_FAULTS);
-	count_memcg_folio_events(folio, NUMA_HINT_FAULTS, 1);
+	count_vm_numa_event(NUMA_HINT_TYPE(vmf));
+	count_memcg_folio_events(folio, NUMA_HINT_TYPE(vmf), 1);
 	if (folio_nid(folio) == numa_node_id()) {
-		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
+		count_vm_numa_event(NUMA_HINT_TYPE_LOCAL(vmf));
 		*flags |= TNF_FAULT_LOCAL;
 	}
 #endif
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 0889b75cef14..1b74d1faf089 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1338,6 +1338,8 @@ const char * const vmstat_text[] = {
 	"numa_huge_pte_updates",
 	"numa_hint_faults",
 	"numa_hint_faults_local",
+	"numa_hint_page_cache",
+	"numa_hint_page_cache_local",
 	"numa_pages_migrated",
 #endif
 #ifdef CONFIG_MIGRATION
-- 
2.47.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 5/6] migrate: implement migrate_misplaced_folio_batch
  2025-01-07  0:03 [RFC v3 PATCH 0/5] Promotion of Unmapped Page Cache Folios Gregory Price
                   ` (3 preceding siblings ...)
  2025-01-07  0:03 ` [PATCH v3 4/6] vmstat: add page-cache numa hints Gregory Price
@ 2025-01-07  0:03 ` Gregory Price
  2025-01-07  0:03 ` [PATCH v3 6/6] migrate,sysfs: add pagecache promotion Gregory Price
  2025-01-22 11:16 ` [RFC v3 PATCH 0/5] Promotion of Unmapped Page Cache Folios Huang, Ying
  6 siblings, 0 replies; 14+ messages in thread
From: Gregory Price @ 2025-01-07  0:03 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-kernel, kernel-team, nehagholkar, abhishekd,
	david, nphamcs, gourry, akpm, hannes, kbusch, ying.huang,
	feng.tang, donettom

A common operation in tiering is to migrate multiple pages at once.
The migrate_misplaced_folio function requires one call for each
individual folio.  Expose a batch-variant of the same call for use
when doing batch migrations.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 include/linux/migrate.h |  5 +++++
 mm/migrate.c            | 31 +++++++++++++++++++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 29919faea2f1..3dfbe7c1cc83 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -145,6 +145,7 @@ const struct movable_operations *page_movable_ops(struct page *page)
 int migrate_misplaced_folio_prepare(struct folio *folio,
 		struct vm_area_struct *vma, int node);
 int migrate_misplaced_folio(struct folio *folio, int node);
+int migrate_misplaced_folio_batch(struct list_head *foliolist, int node);
 #else
 static inline int migrate_misplaced_folio_prepare(struct folio *folio,
 		struct vm_area_struct *vma, int node)
@@ -155,6 +156,10 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
 {
 	return -EAGAIN; /* can't migrate now */
 }
+int migrate_misplaced_folio_batch(struct list_head *foliolist, int node)
+{
+	return -EAGAIN; /* can't migrate now */
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_MIGRATION
diff --git a/mm/migrate.c b/mm/migrate.c
index ea20d9bc4f40..a751a995f2d9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2705,5 +2705,36 @@ int migrate_misplaced_folio(struct folio *folio, int node)
 	BUG_ON(!list_empty(&migratepages));
 	return nr_remaining ? -EAGAIN : 0;
 }
+
+/*
+ * Batch variant of migrate_misplaced_folio. Attempts to migrate
+ * a folio list to the specified destination.
+ *
+ * Caller is expected to have isolated the folios by calling
+ * migrate_misplaced_folio_prepare(), which will result in an
+ * elevated reference count on the folio.
+ *
+ * This function will un-isolate the folios, dereference them, and
+ * remove them from the list before returning.
+ */
+int migrate_misplaced_folio_batch(struct list_head *folio_list, int node)
+{
+	pg_data_t *pgdat = NODE_DATA(node);
+	unsigned int nr_succeeded;
+	int nr_remaining;
+
+	nr_remaining = migrate_pages(folio_list, alloc_misplaced_dst_folio,
+				     NULL, node, MIGRATE_ASYNC,
+				     MR_NUMA_MISPLACED, &nr_succeeded);
+	if (nr_remaining)
+		putback_movable_pages(folio_list);
+
+	if (nr_succeeded) {
+		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
+		mod_node_page_state(pgdat, PGPROMOTE_SUCCESS, nr_succeeded);
+	}
+	BUG_ON(!list_empty(folio_list));
+	return nr_remaining ? -EAGAIN : 0;
+}
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_NUMA */
-- 
2.47.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 6/6] migrate,sysfs: add pagecache promotion
  2025-01-07  0:03 [RFC v3 PATCH 0/5] Promotion of Unmapped Page Cache Folios Gregory Price
                   ` (4 preceding siblings ...)
  2025-01-07  0:03 ` [PATCH v3 5/6] migrate: implement migrate_misplaced_folio_batch Gregory Price
@ 2025-01-07  0:03 ` Gregory Price
  2025-01-22 11:16 ` [RFC v3 PATCH 0/5] Promotion of Unmapped Page Cache Folios Huang, Ying
  6 siblings, 0 replies; 14+ messages in thread
From: Gregory Price @ 2025-01-07  0:03 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-kernel, kernel-team, nehagholkar, abhishekd,
	david, nphamcs, gourry, akpm, hannes, kbusch, ying.huang,
	feng.tang, donettom

adds /sys/kernel/mm/numa/pagecache_promotion_enabled

When page cache lands on lower tiers, there is no way for promotion
to occur unless it becomes memory-mapped and exposed to NUMA hint
faults.  Just adding a mechanism to promote pages unconditionally,
however, opens up significant possibility of performance regressions.

Similar to the `demotion_enabled` sysfs entry, provide a sysfs toggle
to enable and disable page cache promotion.  This option will enable
opportunistic promotion of unmapped page cache during syscall access.

This option is intended for operational conditions where demoted page
cache will eventually contain memory which becomes hot - and where
said memory likely to cause performance issues due to being trapped on
the lower tier of memory.

A Page Cache folio is considered a promotion candidates when:
  0) tiering and pagecache-promotion are enabled
  1) the folio resides on a node not in the top tier
  2) the folio is already marked referenced and active.
  3) Multiple accesses in (referenced & active) state occur quickly.

Since promotion is not safe to execute unconditionally from within
folio_mark_accessed, we defer promotion to a new task_work captured
in the task_struct.  This ensures that the task doing the access has
some hand in promoting pages - even among deduplicated read only files.

We limit the total number of folios on the promotion list to the
promotion rate limit to limit the amount of inline work done during
large reads - avoiding significant overhead.  We do not use the existing
rate-limit check function this checked during the migration anyway.

The promotion node is always the local node of the promoting cpu.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
 .../ABI/testing/sysfs-kernel-mm-numa          | 20 +++++++
 include/linux/memory-tiers.h                  |  2 +
 include/linux/migrate.h                       |  5 ++
 include/linux/sched.h                         |  4 ++
 include/linux/sched/sysctl.h                  |  1 +
 init/init_task.c                              |  2 +
 kernel/sched/fair.c                           | 24 +++++++-
 mm/memory-tiers.c                             | 27 +++++++++
 mm/migrate.c                                  | 55 +++++++++++++++++++
 mm/swap.c                                     |  8 +++
 10 files changed, 147 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-numa b/Documentation/ABI/testing/sysfs-kernel-mm-numa
index 77e559d4ed80..ebb041891db2 100644
--- a/Documentation/ABI/testing/sysfs-kernel-mm-numa
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-numa
@@ -22,3 +22,23 @@ Description:	Enable/disable demoting pages during reclaim
 		the guarantees of cpusets.  This should not be enabled
 		on systems which need strict cpuset location
 		guarantees.
+
+What:		/sys/kernel/mm/numa/pagecache_promotion_enabled
+Date:		January 2025
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Enable/disable promoting pages during file access
+
+		Page migration during file access is intended for systems
+		with tiered memory configurations that have significant
+		unmapped file cache usage. By default, file cache memory
+		on slower tiers will not be opportunistically promoted by
+		normal NUMA hint faults, because the system has no way to
+		track them.  This option enables opportunistic promotion
+		of pages that are accessed via syscall (e.g. read/write)
+		if multiple accesses occur in quick succession.
+
+		It may move data to a NUMA node that does not fall into
+		the cpuset of the allocating process which might be
+		construed to violate the guarantees of cpusets.  This
+		should not be enabled on systems which need strict cpuset
+		location guarantees.
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 0dc0cf2863e2..fa96a67b8996 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -37,6 +37,7 @@ struct access_coordinate;
 
 #ifdef CONFIG_NUMA
 extern bool numa_demotion_enabled;
+extern bool numa_pagecache_promotion_enabled;
 extern struct memory_dev_type *default_dram_type;
 extern nodemask_t default_dram_nodes;
 struct memory_dev_type *alloc_memory_type(int adistance);
@@ -76,6 +77,7 @@ static inline bool node_is_toptier(int node)
 #else
 
 #define numa_demotion_enabled	false
+#define numa_pagecache_promotion_enabled	false
 #define default_dram_type	NULL
 #define default_dram_nodes	NODE_MASK_NONE
 /*
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 3dfbe7c1cc83..80438ddd76c7 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -146,6 +146,7 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
 		struct vm_area_struct *vma, int node);
 int migrate_misplaced_folio(struct folio *folio, int node);
 int migrate_misplaced_folio_batch(struct list_head *foliolist, int node);
+void promotion_candidate(struct folio *folio);
 #else
 static inline int migrate_misplaced_folio_prepare(struct folio *folio,
 		struct vm_area_struct *vma, int node)
@@ -160,6 +161,10 @@ int migrate_misplaced_folio_batch(struct list_head *foliolist, int node)
 {
 	return -EAGAIN; /* can't migrate now */
 }
+static inline void promotion_candidate(struct folio *folio)
+{
+	return;
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_MIGRATION
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 66b311fbd5d6..84b9bcfaa01d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1363,6 +1363,10 @@ struct task_struct {
 	unsigned long			numa_faults_locality[3];
 
 	unsigned long			numa_pages_migrated;
+
+	struct callback_head		numa_promo_work;
+	struct list_head		promo_list;
+	unsigned long			promo_count;
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_RSEQ
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 5a64582b086b..50b1d1dc27e2 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -25,6 +25,7 @@ enum sched_tunable_scaling {
 
 #ifdef CONFIG_NUMA_BALANCING
 extern int sysctl_numa_balancing_mode;
+extern unsigned int sysctl_numa_balancing_promote_rate_limit;
 #else
 #define sysctl_numa_balancing_mode	0
 #endif
diff --git a/init/init_task.c b/init/init_task.c
index e557f622bd90..47162ed14106 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -187,6 +187,8 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.numa_preferred_nid = NUMA_NO_NODE,
 	.numa_group	= NULL,
 	.numa_faults	= NULL,
+	.promo_list	= LIST_HEAD_INIT(init_task.promo_list),
+	.promo_count	= 0,
 #endif
 #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
 	.kasan_depth	= 1,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e9ca38512de..7612d5c2d75d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -42,6 +42,7 @@
 #include <linux/interrupt.h>
 #include <linux/memory-tiers.h>
 #include <linux/mempolicy.h>
+#include <linux/migrate.h>
 #include <linux/mutex_api.h>
 #include <linux/profile.h>
 #include <linux/psi.h>
@@ -126,7 +127,7 @@ static unsigned int sysctl_sched_cfs_bandwidth_slice		= 5000UL;
 
 #ifdef CONFIG_NUMA_BALANCING
 /* Restrict the NUMA promotion throughput (MB/s) for each target node. */
-static unsigned int sysctl_numa_balancing_promote_rate_limit = 65536;
+unsigned int sysctl_numa_balancing_promote_rate_limit = 65536;
 #endif
 
 #ifdef CONFIG_SYSCTL
@@ -3537,6 +3538,25 @@ static void task_numa_work(struct callback_head *work)
 	}
 }
 
+static void task_numa_promotion_work(struct callback_head *work)
+{
+	struct task_struct *p = current;
+	struct list_head *promo_list= &p->promo_list;
+	int nid = numa_node_id();
+
+	SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_promo_work));
+
+	work->next = work;
+
+	if (list_empty(promo_list))
+		return;
+
+	migrate_misplaced_folio_batch(promo_list, nid);
+	current->promo_count = 0;
+	return;
+}
+
+
 void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 {
 	int mm_users = 0;
@@ -3561,8 +3581,10 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 	RCU_INIT_POINTER(p->numa_group, NULL);
 	p->last_task_numa_placement	= 0;
 	p->last_sum_exec_runtime	= 0;
+	INIT_LIST_HEAD(&p->promo_list);
 
 	init_task_work(&p->numa_work, task_numa_work);
+	init_task_work(&p->numa_promo_work, task_numa_promotion_work);
 
 	/* New address space, reset the preferred nid */
 	if (!(clone_flags & CLONE_VM)) {
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index fc14fe53e9b7..e8acb54aa8df 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -935,6 +935,7 @@ static int __init memory_tier_init(void)
 subsys_initcall(memory_tier_init);
 
 bool numa_demotion_enabled = false;
+bool numa_pagecache_promotion_enabled = false;
 
 #ifdef CONFIG_MIGRATION
 #ifdef CONFIG_SYSFS
@@ -957,11 +958,37 @@ static ssize_t demotion_enabled_store(struct kobject *kobj,
 	return count;
 }
 
+static ssize_t pagecache_promotion_enabled_show(struct kobject *kobj,
+						struct kobj_attribute *attr,
+						char *buf)
+{
+	return sysfs_emit(buf, "%s\n",
+			  numa_pagecache_promotion_enabled ? "true" : "false");
+}
+
+static ssize_t pagecache_promotion_enabled_store(struct kobject *kobj,
+						 struct kobj_attribute *attr,
+						 const char *buf, size_t count)
+{
+	ssize_t ret;
+
+	ret = kstrtobool(buf, &numa_pagecache_promotion_enabled);
+	if (ret)
+		return ret;
+
+	return count;
+}
+
+
 static struct kobj_attribute numa_demotion_enabled_attr =
 	__ATTR_RW(demotion_enabled);
 
+static struct kobj_attribute numa_pagecache_promotion_enabled_attr =
+	__ATTR_RW(pagecache_promotion_enabled);
+
 static struct attribute *numa_attrs[] = {
 	&numa_demotion_enabled_attr.attr,
+	&numa_pagecache_promotion_enabled_attr.attr,
 	NULL,
 };
 
diff --git a/mm/migrate.c b/mm/migrate.c
index a751a995f2d9..97f86ee6fd0d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -44,6 +44,8 @@
 #include <linux/sched/sysctl.h>
 #include <linux/memory-tiers.h>
 #include <linux/pagewalk.h>
+#include <linux/sched/numa_balancing.h>
+#include <linux/task_work.h>
 
 #include <asm/tlbflush.h>
 
@@ -2736,5 +2738,58 @@ int migrate_misplaced_folio_batch(struct list_head *folio_list, int node)
 	BUG_ON(!list_empty(folio_list));
 	return nr_remaining ? -EAGAIN : 0;
 }
+
+/**
+ * promotion_candidate: report a promotion candidate folio
+ *
+ * The folio will be isolated from LRU if selected, and task_work will
+ * putback the folio on promotion failure.
+ *
+ * Candidates may not be promoted and may be returned to the LRU.
+ *
+ * Takes a folio reference that will be released in task work.
+ */
+void promotion_candidate(struct folio *folio)
+{
+	struct task_struct *task = current;
+	struct list_head *promo_list= &task->promo_list;
+	struct callback_head *work = &task->numa_promo_work;
+	int nid = folio_nid(folio);
+	int flags, last_cpupid;
+
+	/* do not migrate toptier folios or in kernel context */
+	if (node_is_toptier(nid) || task->flags & PF_KTHREAD)
+		return;
+
+	/*
+	 * Limit per-syscall migration rate to balancing rate limit. This avoids
+	 * excessive work during large reads knowing that task work is likely to
+	 * hit the rate limit and put excess folios back on the LRU anyway.
+	 */
+	if (task->promo_count >= sysctl_numa_balancing_promote_rate_limit)
+		return;
+
+	/* Isolate the folio to prepare for migration */
+	nid = numa_migrate_check(folio, NULL, 0, &flags, folio_test_dirty(folio),
+				 &last_cpupid);
+	if (nid == NUMA_NO_NODE)
+		return;
+
+	if (migrate_misplaced_folio_prepare(folio, NULL, nid))
+		return;
+
+	/*
+	 * If work is pending, add this folio to the list. Otherwise, ensure
+	 * the task will execute the work, otherwise we can leak folios.
+	 */
+	if (list_empty(promo_list) && task_work_add(task, work, TWA_RESUME)) {
+		folio_putback_lru(folio);
+		return;
+	}
+	list_add_tail(&folio->lru, promo_list);
+	task->promo_count += folio_nr_pages(folio);
+	return;
+}
+EXPORT_SYMBOL(promotion_candidate);
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_NUMA */
diff --git a/mm/swap.c b/mm/swap.c
index 746a5ceba42c..c006be6048f4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -37,6 +37,10 @@
 #include <linux/page_idle.h>
 #include <linux/local_lock.h>
 #include <linux/buffer_head.h>
+#include <linux/migrate.h>
+#include <linux/memory-tiers.h>
+#include <linux/sched/sysctl.h>
+#include <linux/sched/numa_balancing.h>
 
 #include "internal.h"
 
@@ -474,6 +478,10 @@ void folio_mark_accessed(struct folio *folio)
 			__lru_cache_activate_folio(folio);
 		folio_clear_referenced(folio);
 		workingset_activation(folio);
+	} else if (!folio_test_isolated(folio) &&
+		   (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
+		   numa_pagecache_promotion_enabled) {
+		promotion_candidate(folio);
 	}
 	if (folio_test_idle(folio))
 		folio_clear_idle(folio);
-- 
2.47.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/6] memory: move conditionally defined enums use inside ifdef tags
  2025-01-07  0:03 ` [PATCH v3 2/6] memory: move conditionally defined enums use inside ifdef tags Gregory Price
@ 2025-01-21  4:33   ` Bharata B Rao
  2025-01-22 18:01     ` Gregory Price
  0 siblings, 1 reply; 14+ messages in thread
From: Bharata B Rao @ 2025-01-21  4:33 UTC (permalink / raw)
  To: gourry
  Cc: abhishekd, akpm, david, donettom, feng.tang, hannes, kbusch,
	kernel-team, linux-doc, linux-kernel, linux-mm, nehagholkar,
	nphamcs, ying.huang

> NUMA_HINT_FAULTS and NUMA_HINT_FAULTS_LOCAL are only defined if
> CONFIG_NUMA_BALANCING is defined, but are used outside the tags in
> numa_migrate_check().  Fix this.
> 
> TNF_SHARED is only used if CONFIG_NUMA_BALANCING is enabled, so
> moving this line inside the ifdef is also safe - despite use of TNF_*
> elsewhere in the function.  TNF_* are not conditionally defined.
> 
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
>  mm/memory.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 9cc93c2f79f3..8d254e97840d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5571,14 +5571,14 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
>  	/* Record the current PID acceesing VMA */
>  	vma_set_access_pid_bit(vma);
>  
> -	count_vm_numa_event(NUMA_HINT_FAULTS);
>  #ifdef CONFIG_NUMA_BALANCING
> +	count_vm_numa_event(NUMA_HINT_FAULTS);
>  	count_memcg_folio_events(folio, NUMA_HINT_FAULTS, 1);
> -#endif
>  	if (folio_nid(folio) == numa_node_id()) {
>  		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
>  		*flags |= TNF_FAULT_LOCAL;
>  	}
> +#endif

I don't think moving count_vm_numa_event() to within
CONFIG_NUMA_BALANCING is necessary as it is defined separately as NOP
for !CONFIG_NUMA_BALANCING.

In fact numa_migrate_check() should be within CONFIG_NUMA_BALANCING as
it should ideally be  called only if NUMA balancing is enabled. The same
could be said for the callers of numa_migrate_check() which are
do_numa_page() and do_huge_pmd_numa_page().

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC v3 PATCH 0/5] Promotion of Unmapped Page Cache Folios.
  2025-01-07  0:03 [RFC v3 PATCH 0/5] Promotion of Unmapped Page Cache Folios Gregory Price
                   ` (5 preceding siblings ...)
  2025-01-07  0:03 ` [PATCH v3 6/6] migrate,sysfs: add pagecache promotion Gregory Price
@ 2025-01-22 11:16 ` Huang, Ying
  2025-01-22 16:48   ` Gregory Price
  6 siblings, 1 reply; 14+ messages in thread
From: Huang, Ying @ 2025-01-22 11:16 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-mm, linux-doc, linux-kernel, kernel-team, nehagholkar,
	abhishekd, david, nphamcs, akpm, hannes, kbusch, feng.tang,
	donettom

Hi, Gregory,

Thanks for the patchset and sorry about the late reply.

Gregory Price <gourry@gourry.net> writes:

> Unmapped page cache pages can be demoted to low-tier memory, but
> they can presently only be promoted in two conditions:
>     1) The page is fully swapped out and re-faulted
>     2) The page becomes mapped (and exposed to NUMA hint faults)
>
> This RFC proposes promoting unmapped page cache pages by using
> folio_mark_accessed as a hotness hint for unmapped pages.
>
> We show in a microbenchmark that this mechanism can increase
> performance up to 23.5% compared to leaving page cache on the
> low tier - when that page cache becomes excessively hot.
>
> When disabled (NUMA tiering off), overhead in folio_mark_accessed
> was limited to <1% in a worst case scenario (all work is file_read()).
>
> There is an open question as to how to integrate this into MGLRU,
> as the current design is only applies to traditional LRU.
>
> Patches 1-3
> 	allow NULL as valid input to migration prep interfaces
> 	for vmf/vma - which is not present in unmapped folios.
> Patch 4
> 	adds NUMA_HINT_PAGE_CACHE to vmstat
> Patch 5
> 	Implement migrate_misplaced_folio_batch
> Patch 6
> 	add the promotion mechanism, along with a sysfs
> 	extension which defaults the behavior to off.
> 	/sys/kernel/mm/numa/pagecache_promotion_enabled
>
> v3 Notes
> ===
> - added batch migration interface (migrate_misplaced_folio_batch)
>
> - dropped timestamp check in promotion_candidate (tests showed
>   it did not make a difference and the work is duplicated during
>   the migraiton process).
>
> - Bug fix from Donet Tom regarding vmstat
>
> - pulled folio_isolated and sysfs switch checks out into
>   folio_mark_accessed because microbenchmark tests showed the
>   function call overhead of promotion_candidate warranted a bit
>   of manual optimization for the scenario where the majority of
>   work is file_read().  This brought the standing overhead from
>   ~7% down to <1% when everything is disabled.
>
> - Limited promotion work list to a number of folios that match
>   the existing promotion rate limit, as microbenchmark demonstrated
>   excessive overhead on a single system-call when significant amounts
>   of memory are read.
>   Before: 128GB read went from 7 seconds to 40 seconds over ~2 rounds.
>   Now:    128GB read went from 7 seconds to ~11 seconds over ~10 rounds.
>
> - switched from list_add to list_add_tail in promotion_candidate, as
>   it was discovered promoting in non-linear order caused fairly
>   significant overheads (as high as running out of CXL) - likely due
>   to poor TLB and prefetch behavior.  Simply switching to list_add_tail
>   all but confirmed this as the additional ~20% overhead vanished.
>
>   This is likely to only occur on systems with a large amount of
>   contiguous physical memory available on the hot tier, since the
>   allocators are more likely to provide better spacially locality.
>
>
> Test:
> ======
>
> Environment:
>     1.5-3.7GHz CPU, ~4000 BogoMIPS, 
>     1TB Machine with 768GB DRAM and 256GB CXL
>     A 128GB file being linearly read by a single process
>
> Goal:
>    Generate promotions and demonstrate upper-bound on performance
>    overhead and gain/loss. 
>
> System Settings:
>    echo 1 > /sys/kernel/mm/numa/pagecache_promotion_enabled
>    echo 2 > /proc/sys/kernel/numa_balancing
>    
> Test process:
>    In each test, we do a linear read of a 128GB file into a buffer
>    in a loop.

IMHO, the linear reading isn't a very good test case for promotion.  You
cannot test the hot-page selection algorithm.  I think that it's better
to use something like normal accessing pattern.  IIRC, it is available
in fio test suite.

> To allocate the pagecache into CXL, we use mbind prior
>    to the CXL test runs and read the file.  We omit the overhead of
>    allocating the buffer and initializing the memory into CXL from the
>    test runs.
>
>    1) file allocated in DRAM with mechanisms off
>    2) file allocated in DRAM with balancing on but promotion off
>    3) file allocated in DRAM with balancing and promotion on
>       (promotion check is negative because all pages are top tier)
>    4) file allocated in CXL with mechanisms off
>    5) file allocated in CXL with mechanisms on
>
> Each test was run with 50 read cycles and averaged (where relevant)
> to account for system noise.  This number of cycles gives the promotion
> mechanism time to promote the vast majority of memory (usually <1MB
> remaining in worst case).
>
> Tests 2 and 3 test the upper bound on overhead of the new checks when
> there are no pages to migrate but work is dominated by file_read().
>
> |     1     |    2     |     3       |    4     |      5         |
> | DRAM Base | Promo On | TopTier Chk | CXL Base | Post-Promotion |
> |  7.5804   |  7.7586  |   7.9726    |   9.75   |    7.8941      |

For 3, we can check whether the folio is in top-tier as the first step.
Will that introduce measurable overhead?

> Baseline DRAM vs Baseline CXL shows a ~28% overhead just allowing the
> file to remain on CXL, while after promotion, we see the performance
> trend back towards the overhead of the TopTier check time - a total
> overhead reduction of ~84% (or ~5% overhead down from ~23.5%).
>
> During promotion, we do see overhead which eventually tapers off over
> time.  Here is a sample of the first 10 cycles during which promotion
> is the most aggressive, which shows overhead drops off dramatically
> as the majority of memory is migrated to the top tier.
>
> 12.79, 12.52, 12.33, 12.03, 11.81, 11.58, 11.36, 11.1, 8, 7.96
>
> This could be further limited by limiting the promotion rate via the
> existing knob, or by implementing a new knob detached from the existing
> promotion rate.  There are merits to both approach.

Have you tested with the existing knob?  Whether does it help?

> After promotion, turning the mechanism off via sysfs increased the
> overall performance back to the DRAM baseline. The slight (~1%)
> increase between post-migration performance and the baseline mechanism
> overhead check appears to be general variance as similar times were
> observed during the baseline checks on subsequent runs.
>
> The mechanism itself represents a ~2-5% overhead in a worst case
> scenario (all work is file_read() and pages are in DRAM).
>
[snip]

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC v3 PATCH 0/5] Promotion of Unmapped Page Cache Folios.
  2025-01-22 11:16 ` [RFC v3 PATCH 0/5] Promotion of Unmapped Page Cache Folios Huang, Ying
@ 2025-01-22 16:48   ` Gregory Price
  2025-01-23  3:46     ` Huang, Ying
  0 siblings, 1 reply; 14+ messages in thread
From: Gregory Price @ 2025-01-22 16:48 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, linux-doc, linux-kernel, kernel-team, nehagholkar,
	abhishekd, david, nphamcs, akpm, hannes, kbusch, feng.tang,
	donettom

On Wed, Jan 22, 2025 at 07:16:03PM +0800, Huang, Ying wrote:
> Hi, Gregory,
> > Test process:
> >    In each test, we do a linear read of a 128GB file into a buffer
> >    in a loop.
> 
> IMHO, the linear reading isn't a very good test case for promotion.  You
> cannot test the hot-page selection algorithm.  I think that it's better
> to use something like normal accessing pattern.  IIRC, it is available
> in fio test suite.
>

Oh yes, I don't plan to drop RFC until I can get a real workload and
probably fio running under this.  This patch set is varying priority for
me at the moment so the versions will take some time.  My goal is to
have something a bit more solid by LSF/MM, but not before.

> >    1) file allocated in DRAM with mechanisms off
> >    2) file allocated in DRAM with balancing on but promotion off
> >    3) file allocated in DRAM with balancing and promotion on
> >       (promotion check is negative because all pages are top tier)
> >    4) file allocated in CXL with mechanisms off
> >    5) file allocated in CXL with mechanisms on
> >
> > |     1     |    2     |     3       |    4     |      5         |
> > | DRAM Base | Promo On | TopTier Chk | CXL Base | Post-Promotion |
> > |  7.5804   |  7.7586  |   7.9726    |   9.75   |    7.8941      |
> 
> For 3, we can check whether the folio is in top-tier as the first step.
> Will that introduce measurable overhead?
>

That is basically what 2 vs 3 is doing.

Test 2 shows overhead of TPP on + pagecache promo off
Test 3 shows overhead of TPP+Promo on, but all the memory is on top tier

This shows the check as to whether the folio is in the top tier is
actually somewhat expensive (~5% compared to baseline, ~2.7% compared to
TPP-on Promo-off).

The goal of this linear, simple test is to isolate test behavior from
the overhead - that makes it easy to test each individual variable (TPP,
promo, top tier, etc) and see relative overheads.

This basically gives us a reasonable floor/ceiling of expected overhead.
If we see something wildly different than this during something like FIO
or a real workload, then we'll know we missed something.

> >
> > This could be further limited by limiting the promotion rate via the
> > existing knob, or by implementing a new knob detached from the existing
> > promotion rate.  There are merits to both approach.
> 
> Have you tested with the existing knob?  Whether does it help?
>

Not yet, this fell off my priority list before I could do additional
testing.  I will add that to my backlog.

~Gregory

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/6] memory: move conditionally defined enums use inside ifdef tags
  2025-01-21  4:33   ` Bharata B Rao
@ 2025-01-22 18:01     ` Gregory Price
  2025-01-23  3:07       ` Bharata B Rao
  0 siblings, 1 reply; 14+ messages in thread
From: Gregory Price @ 2025-01-22 18:01 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: abhishekd, akpm, david, donettom, feng.tang, hannes, kbusch,
	kernel-team, linux-doc, linux-kernel, linux-mm, nehagholkar,
	nphamcs, ying.huang

On Tue, Jan 21, 2025 at 10:03:55AM +0530, Bharata B Rao wrote:
> I don't think moving count_vm_numa_event() to within
> CONFIG_NUMA_BALANCING is necessary as it is defined separately as NOP
> for !CONFIG_NUMA_BALANCING.
> 

NUMA_HINT_FAULTS and NUMA_HINT_FAULTS_LOCAL are only defined if
CONFIG_NUMA_BALANCING

include/linux/vm_event_item.h

#ifdef CONFIG_NUMA_BALANCING
                NUMA_PTE_UPDATES,
                NUMA_HUGE_PTE_UPDATES,
                NUMA_HINT_FAULTS,
                NUMA_HINT_FAULTS_LOCAL,
                NUMA_PAGE_MIGRATE,
#endif

> In fact numa_migrate_check() should be within CONFIG_NUMA_BALANCING as
> it should ideally be  called only if NUMA balancing is enabled. The same
> could be said for the callers of numa_migrate_check() which are
> do_numa_page() and do_huge_pmd_numa_page().
> 

Really what i'm reading is that these functions are in the wrong file,
since ifdef spaghetti in *.c files is not encouraged.  These functions
should be moved somewhere else and given stubs if the build option is
off.

> Regards,
> Bharata.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/6] memory: move conditionally defined enums use inside ifdef tags
  2025-01-22 18:01     ` Gregory Price
@ 2025-01-23  3:07       ` Bharata B Rao
  0 siblings, 0 replies; 14+ messages in thread
From: Bharata B Rao @ 2025-01-23  3:07 UTC (permalink / raw)
  To: Gregory Price
  Cc: abhishekd, akpm, david, donettom, feng.tang, hannes, kbusch,
	kernel-team, linux-doc, linux-kernel, linux-mm, nehagholkar,
	nphamcs, ying.huang

On 22-Jan-25 11:31 PM, Gregory Price wrote:
> On Tue, Jan 21, 2025 at 10:03:55AM +0530, Bharata B Rao wrote:
>> I don't think moving count_vm_numa_event() to within
>> CONFIG_NUMA_BALANCING is necessary as it is defined separately as NOP
>> for !CONFIG_NUMA_BALANCING.
>>
> 
> NUMA_HINT_FAULTS and NUMA_HINT_FAULTS_LOCAL are only defined if
> CONFIG_NUMA_BALANCING
> 
> include/linux/vm_event_item.h
> 
> #ifdef CONFIG_NUMA_BALANCING
>                  NUMA_PTE_UPDATES,
>                  NUMA_HUGE_PTE_UPDATES,
>                  NUMA_HINT_FAULTS,
>                  NUMA_HINT_FAULTS_LOCAL,
>                  NUMA_PAGE_MIGRATE,
> #endif

What I meant is

include/linux/vmstat.h has a definition for count_vm_numa_event() for
!CONFIG_NUMA_BALANCING case like below:

#ifdef CONFIG_NUMA_BALANCING
#define count_vm_numa_event(x)     count_vm_event(x)
#define count_vm_numa_events(x, y) count_vm_events(x, y)
#else
#define count_vm_numa_event(x) do {} while (0)
#define count_vm_numa_events(x, y) do { (void)(y); } while (0)
#endif /* CONFIG_NUMA_BALANCING */

and hence moving count_vm_numa_events(NUMA_HINT_FAULTS) to within 
CONFIG_NUMA_BALANCING section in numa_migrate_check() isn't necessary. 
The current code already compiles fine when CONFIG_NUMA_BALANCING is 
turned off.

> 
>> In fact numa_migrate_check() should be within CONFIG_NUMA_BALANCING as
>> it should ideally be  called only if NUMA balancing is enabled. The same
>> could be said for the callers of numa_migrate_check() which are
>> do_numa_page() and do_huge_pmd_numa_page().
>>
> 
> Really what i'm reading is that these functions are in the wrong file,
> since ifdef spaghetti in *.c files is not encouraged.  These functions
> should be moved somewhere else and given stubs if the build option is
> off.

Yes !CONFIG_NUMA_BALANCING stubs for numa_migrate_check(), 
do_numa_page() and do_huge_pmd_numa_page() would be good.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC v3 PATCH 0/5] Promotion of Unmapped Page Cache Folios.
  2025-01-22 16:48   ` Gregory Price
@ 2025-01-23  3:46     ` Huang, Ying
  2025-01-23 14:55       ` Gregory Price
  0 siblings, 1 reply; 14+ messages in thread
From: Huang, Ying @ 2025-01-23  3:46 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-mm, linux-doc, linux-kernel, kernel-team, nehagholkar,
	abhishekd, david, nphamcs, akpm, hannes, kbusch, feng.tang,
	donettom

Gregory Price <gourry@gourry.net> writes:

> On Wed, Jan 22, 2025 at 07:16:03PM +0800, Huang, Ying wrote:
>> Hi, Gregory,
>> > Test process:
>> >    In each test, we do a linear read of a 128GB file into a buffer
>> >    in a loop.
>> 
>> IMHO, the linear reading isn't a very good test case for promotion.  You
>> cannot test the hot-page selection algorithm.  I think that it's better
>> to use something like normal accessing pattern.  IIRC, it is available
>> in fio test suite.
>>
>
> Oh yes, I don't plan to drop RFC until I can get a real workload and
> probably fio running under this.  This patch set is varying priority for
> me at the moment so the versions will take some time.  My goal is to
> have something a bit more solid by LSF/MM, but not before.

No problem.

>> >    1) file allocated in DRAM with mechanisms off
>> >    2) file allocated in DRAM with balancing on but promotion off
>> >    3) file allocated in DRAM with balancing and promotion on
>> >       (promotion check is negative because all pages are top tier)
>> >    4) file allocated in CXL with mechanisms off
>> >    5) file allocated in CXL with mechanisms on
>> >
>> > |     1     |    2     |     3       |    4     |      5         |
>> > | DRAM Base | Promo On | TopTier Chk | CXL Base | Post-Promotion |
>> > |  7.5804   |  7.7586  |   7.9726    |   9.75   |    7.8941      |
>> 
>> For 3, we can check whether the folio is in top-tier as the first step.
>> Will that introduce measurable overhead?
>>
>
> That is basically what 2 vs 3 is doing.
>
> Test 2 shows overhead of TPP on + pagecache promo off
> Test 3 shows overhead of TPP+Promo on, but all the memory is on top tier
>
> This shows the check as to whether the folio is in the top tier is
> actually somewhat expensive (~5% compared to baseline, ~2.7% compared to
> TPP-on Promo-off).

This is unexpected.  Can we try to optimize it?  For example, via using
a nodemask?  node_is_toptier() is used in the mapped pages promotion
too (1 vs. 2 above).  I guess that the optimization can reduce the
overhead there with measurable difference too.

> The goal of this linear, simple test is to isolate test behavior from
> the overhead - that makes it easy to test each individual variable (TPP,
> promo, top tier, etc) and see relative overheads.
>
> This basically gives us a reasonable floor/ceiling of expected overhead.
> If we see something wildly different than this during something like FIO
> or a real workload, then we'll know we missed something.
>
>> >
>> > This could be further limited by limiting the promotion rate via the
>> > existing knob, or by implementing a new knob detached from the existing
>> > promotion rate.  There are merits to both approach.
>> 
>> Have you tested with the existing knob?  Whether does it help?
>>
>
> Not yet, this fell off my priority list before I could do additional
> testing.  I will add that to my backlog.

No problem.

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC v3 PATCH 0/5] Promotion of Unmapped Page Cache Folios.
  2025-01-23  3:46     ` Huang, Ying
@ 2025-01-23 14:55       ` Gregory Price
  0 siblings, 0 replies; 14+ messages in thread
From: Gregory Price @ 2025-01-23 14:55 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-mm, linux-doc, linux-kernel, kernel-team, nehagholkar,
	abhishekd, david, nphamcs, akpm, hannes, kbusch, feng.tang,
	donettom

On Thu, Jan 23, 2025 at 11:46:49AM +0800, Huang, Ying wrote:
> Gregory Price <gourry@gourry.net> writes:
> > Test 2 shows overhead of TPP on + pagecache promo off
> > Test 3 shows overhead of TPP+Promo on, but all the memory is on top tier
> >
> > This shows the check as to whether the folio is in the top tier is
> > actually somewhat expensive (~5% compared to baseline, ~2.7% compared to
> > TPP-on Promo-off).
> 
> This is unexpected.  Can we try to optimize it?  For example, via using
> a nodemask?  node_is_toptier() is used in the mapped pages promotion
> too (1 vs. 2 above).  I guess that the optimization can reduce the
> overhead there with measurable difference too.
>

Agreed it surprised me a bit as well. But more surprising is the fact
that test 2 was also 2-3% slower given that it's a simple boolean check
against whether tiering is turned on.

I suppose that since the test is blowing up the cache/tlb by design,
multiple additional cache/tlb misses could cause a non-trivial slowdown,
but it is certainly a small puzzle I haven't dug into yet.

~Gregory


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-01-23 14:56 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-07  0:03 [RFC v3 PATCH 0/5] Promotion of Unmapped Page Cache Folios Gregory Price
2025-01-07  0:03 ` [PATCH v3 1/6] migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA Gregory Price
2025-01-07  0:03 ` [PATCH v3 2/6] memory: move conditionally defined enums use inside ifdef tags Gregory Price
2025-01-21  4:33   ` Bharata B Rao
2025-01-22 18:01     ` Gregory Price
2025-01-23  3:07       ` Bharata B Rao
2025-01-07  0:03 ` [PATCH v3 3/6] memory: allow non-fault migration in numa_migrate_check path Gregory Price
2025-01-07  0:03 ` [PATCH v3 4/6] vmstat: add page-cache numa hints Gregory Price
2025-01-07  0:03 ` [PATCH v3 5/6] migrate: implement migrate_misplaced_folio_batch Gregory Price
2025-01-07  0:03 ` [PATCH v3 6/6] migrate,sysfs: add pagecache promotion Gregory Price
2025-01-22 11:16 ` [RFC v3 PATCH 0/5] Promotion of Unmapped Page Cache Folios Huang, Ying
2025-01-22 16:48   ` Gregory Price
2025-01-23  3:46     ` Huang, Ying
2025-01-23 14:55       ` Gregory Price

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox