* [RFC PATCH 0/4] Promotion of Unmapped Page Cache Folios.
@ 2024-11-27 8:21 Gregory Price
2024-11-27 8:21 ` [PATCH 1/4] migrate: Allow migrate_misplaced_folio APIs without a VMA Gregory Price
` (4 more replies)
0 siblings, 5 replies; 9+ messages in thread
From: Gregory Price @ 2024-11-27 8:21 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, nehagholkar, abhishekd, kernel-team, david,
ying.huang, nphamcs, gourry, akpm, hannes, feng.tang, kbusch
Unmapped page cache pages can be demoted to low-tier memory, but
they can presently only be promoted in two conditions:
1) The page is fully swapped out and re-faulted
2) The page becomes mapped (and exposed to NUMA hint faults)
This RFC proposes promoting unmapped page cache pages by using
folio_mark_accessed as a hotness hint for unmapped pages.
Patches 1 & 2
allow NULL as valid input to migration prep interfaces
for vmf/vma - which is not present in unmapped folios.
Patch 3
adds NUMA_HINT_PAGE_CACHE to vmstat
Patch 4
adds the promotion mechanism, along with a sysfs
extension which defaults the behavior to off.
/sys/kernel/mm/numa/pagecache_promotion_enabled
Functional test showed that we are able to reclaim some performance
in canned scenarios (a file gets demoted and becomes hot with
relatively little contention). See test/overhead section below.
Open Questions:
======
1) Should we also add a limit to how much can be forced onto
a single task's promotion list at any one time? This might
piggy-back on the existing TPP promotion limit (256MB?) and
would simply add something like task->promo_count.
Technically we are limited by the batch read-rate before a
TASK_RESUME occurs.
2) Should we exempt certain forms of folios, or add additional
knobs/levers in to deal with things like large folios?
3) We added NUMA_HINT_PAGE_CACHE to differentiate hint faults
so we could validate the behavior works as intended. Should
we just call this a NUMA_HINT_FAULT and not add a new hint?
4) Benchmark suggestions that can pressure 1TB memory. This is
not my typical wheelhouse, so if folks know of a useful
benchmark that can pressure my 1TB (768 DRAM / 256 CXL) setup,
I'd like to add additional measurements here.
Development Notes
=================
During development, we explored the following proposals:
1) directly promoting within folio_mark_accessed (FMA)
Originally suggested by Johannes Weiner
https://lore.kernel.org/all/20240803094715.23900-1-gourry@gourry.net/
This caused deadlocks due to the fact that the PTL was held
in a variety of cases - but in particular during task exit.
It also is incredibly inflexible and causes promotion-on-fault.
It was discussed that a deferral mechanism was preferred.
2) promoting in filemap.c locations (calls of FMA)
Originally proposed by Feng Tang and Ying Huang
https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/patch/?id=5f2e64ce75c0322602c2ec8c70b64bb69b1f1329
First, we saw this as less problematic than directly hooking FMA,
but we realized this has the potential to miss data in a variety of
locations: swap.c, memory.c, gup.c, ksm.c, paddr.c - etc.
Second, we discovered that the lock state of pages is very subtle,
and that these locations in filemap.c can be called in an atomic
context. Prototypes lead to a variety of stalls and lockups.
3) a new LRU - originally proposed by Keith Busch
https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/patch/?id=6616afe9a722f6ebedbb27ade3848cf07b9a3af7
There are two issues with this approach: PG_promotable and reclaim.
First - PG_promotable has generally be discouraged.
Second - Attach this mechanism to an LRU is both backwards and
counter-intutive. A promotable list is better served by a MOST
recently used list, and since LRUs are generally only shrank when
exposed to pressure it would require implementing a new promotion
list shrinker that runs separate from the existing reclaim logic.
4) Adding a separate kthread - suggested by many
This is - to an extent - a more general version of the LRU proposal.
We still have to track the folios - which likely requires the
addition of a page flag. Additionally, this method would actually
contend pretty heavily with LRU behavior - i.e. we'd want to
throttle addition to the promotion candidate list in some scenarios.
5) Doing it in task work
This seemed to be the most realistic after considering the above.
We observe the following:
- FMA is an ideal hook for this and isolation is safe here
- the new promotion_candidate function is an ideal hook for new
filter logic (throttling, fairness, etc).
- isolated folios are either promoted or putback on task resume,
there are no additional concurrency mechanics to worry about
- The mechanic can be made optional via a sysfs hook to avoid
overhead in degenerate scenarios (thrashing).
We also piggy-backed on the numa_hint_fault_latency timestamp to
further throttle promotions to help avoid promotions on one or
two time accesses to a particular page.
Test:
======
Environment:
1.5-3.7GHz CPU, ~4000 BogoMIPS,
1TB Machine with 768GB DRAM and 256GB CXL
A 64GB file being linearly read by 6-7 Python processes
Goal:
Generate promotions. Demonstrate stability and measure overhead.
System Settings:
echo 1 > /sys/kernel/mm/numa/demotion_enabled
echo 1 > /sys/kernel/mm/numa/pagecache_promotion_enabled
echo 2 > /proc/sys/kernel/numa_balancing
Each process took up ~128GB each, with anonymous memory growing and
shrinking as python filled and released buffers with the 64GB data.
This causes DRAM pressure to generate demotions, and file pages to
"become hot" - and therefore be selected for promotion.
First we ran with promotion disabled to show consistent overhead as
a result of forcing a file out to CXL memory. We first ran a single
reader to see uncontended performance, launched many readers to force
demotions, then droppedb back to a single reader to observe.
Single-reader DRAM: ~16.0-16.4s
Single-reader CXL (after demotion): ~16.8-17s
Next we turned promotion on with only a single reader running.
Before promotions:
Node 0 MemFree: 636478112 kB
Node 0 FilePages: 59009156 kB
Node 1 MemFree: 250336004 kB
Node 1 FilePages: 14979628 kB
After promotions:
Node 0 MemFree: 632267268 kB
Node 0 FilePages: 72204968 kB
Node 1 MemFree: 262567056 kB
Node 1 FilePages: 2918768 kB
Single-reader (after_promotion): ~16.5s
Turning the promotion mechanism on when nothing had been demoted
produced no appreciable overhead (memory allocation noise overpowers it)
Read time did not change after turning promotion off after promotion
occurred, which implies that the additional overhead is not coming from
the promotion system itself - but likely other pages still trapped on
the low tier. Either way, this at least demonstrates the mechanism is
not particularly harmful when there are no pages to promote - and the
mechanism is valuable when a file actually is quite hot.
Notability, it takes some time for the average read loop to come back
down, and there still remains unpromoted file pages trapped in pagecache.
This isn't entirely unexpected, there are many files which may have been
demoted, and they may not be very hot.
Overhead
======
When promotion was tured on we saw a loop-runtime increate temporarily
before: 16.8s
during:
17.606216192245483
17.375206470489502
17.722095489501953
18.230552434921265
18.20712447166443
18.008254528045654
17.008427381515503
16.851454257965088
16.715774059295654
stable: ~16.5s
We measured overhead with a separate patch that simply measured the
rdtsc value before/after calls in promotion_candidate and task work.
e.g.:
+ start = rdtsc();
list_for_each_entry_safe(folio, tmp, promo_list, lru) {
list_del_init(&folio->lru);
migrate_misplaced_folio(folio, NULL, nid);
+ count++;
}
+ atomic_long_add(rdtsc()-start, &promo_time);
+ atomic_long_add(count, &promo_count);
numa_migrate_prep: 93 - time(3969867917) count(42576860)
migrate_misplaced_folio_prepare: 491 - time(3433174319) count(6985523)
migrate_misplaced_folio: 1635 - time(11426529980) count(6985523)
Thoughts on a good throttling heuristic would be appreciated here.
Suggested-by: Huang Ying <ying.huang@intel.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Keith Busch <kbusch@meta.com>
Suggested-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Gregory Price <gourry@gourry.net>
Gregory Price (4):
migrate: Allow migrate_misplaced_folio APIs without a VMA
memory: allow non-fault migration in numa_migrate_check path
vmstat: add page-cache numa hints
migrate,sysfs: add pagecache promotion
.../ABI/testing/sysfs-kernel-mm-numa | 20 +++++++
include/linux/memory-tiers.h | 2 +
include/linux/migrate.h | 5 ++
include/linux/sched.h | 3 +
include/linux/sched/numa_balancing.h | 5 ++
include/linux/vm_event_item.h | 2 +
init/init_task.c | 1 +
kernel/sched/fair.c | 27 ++++++++-
mm/memory-tiers.c | 27 +++++++++
mm/memory.c | 41 ++++++++-----
mm/mempolicy.c | 25 +++++---
mm/migrate.c | 59 ++++++++++++++++++-
mm/swap.c | 3 +
mm/vmstat.c | 2 +
14 files changed, 196 insertions(+), 26 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/4] migrate: Allow migrate_misplaced_folio APIs without a VMA
2024-11-27 8:21 [RFC PATCH 0/4] Promotion of Unmapped Page Cache Folios Gregory Price
@ 2024-11-27 8:21 ` Gregory Price
2024-11-28 11:12 ` Huang, Ying
2024-11-29 6:21 ` Raghavendra K T
2024-11-27 8:21 ` [PATCH 2/4] memory: allow non-fault migration in numa_migrate_check path Gregory Price
` (3 subsequent siblings)
4 siblings, 2 replies; 9+ messages in thread
From: Gregory Price @ 2024-11-27 8:21 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, nehagholkar, abhishekd, kernel-team, david,
ying.huang, nphamcs, gourry, akpm, hannes, feng.tang, kbusch
To migrate unmapped pagecache folios, migrate_misplaced_folio and
migrate_misplaced_folio_prepare must handle folios without VMAs.
migrate_misplaced_folio_prepare checks VMA for exec bits, so allow
a NULL VMA when it does not have a mapping.
migrate_misplaced_folio must call migrate_pages with MIGRATE_SYNC
when in the pagecache path because it is a synchronous context.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
mm/migrate.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index dfb5eba3c522..3b0bd3f21ac3 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2632,7 +2632,7 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
* See folio_likely_mapped_shared() on possible imprecision
* when we cannot easily detect if a folio is shared.
*/
- if ((vma->vm_flags & VM_EXEC) &&
+ if (vma && (vma->vm_flags & VM_EXEC) &&
folio_likely_mapped_shared(folio))
return -EACCES;
--
2.43.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 2/4] memory: allow non-fault migration in numa_migrate_check path
2024-11-27 8:21 [RFC PATCH 0/4] Promotion of Unmapped Page Cache Folios Gregory Price
2024-11-27 8:21 ` [PATCH 1/4] migrate: Allow migrate_misplaced_folio APIs without a VMA Gregory Price
@ 2024-11-27 8:21 ` Gregory Price
2024-11-27 8:22 ` [PATCH 3/4] vmstat: add page-cache numa hints Gregory Price
` (2 subsequent siblings)
4 siblings, 0 replies; 9+ messages in thread
From: Gregory Price @ 2024-11-27 8:21 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, nehagholkar, abhishekd, kernel-team, david,
ying.huang, nphamcs, gourry, akpm, hannes, feng.tang, kbusch
numa_migrate_check and mpol_misplaced presume callers are in the
fault path with accessed to a VMA. To enable migrations from page
cache, re-using the same logic to handle migration prep is preferable.
Mildly refactor numa_migrate_check and mpol_misplaced so that they may
be called with (vmf = NULL) from non-faulting paths.
Also move from numa balancing defines inside the appropriate ifdef.
Signed-off-by: Gregory Price <gourry@gourry.net>
---
mm/memory.c | 28 ++++++++++++++++------------
mm/mempolicy.c | 25 +++++++++++++++++--------
2 files changed, 33 insertions(+), 20 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 209885a4134f..a373b6ad0b34 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5471,7 +5471,20 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
unsigned long addr, int *flags,
bool writable, int *last_cpupid)
{
- struct vm_area_struct *vma = vmf->vma;
+ if (vmf) {
+ struct vm_area_struct *vma = vmf->vma;
+ const vm_flags_t vmflags = vma->vm_flags;
+
+ /*
+ * Flag if the folio is shared between multiple address spaces.
+ * This used later when determining whether to group tasks.
+ */
+ if (folio_likely_mapped_shared(folio))
+ *flags |= vmflags & VM_SHARED ? TNF_SHARED : 0;
+
+ /* Record the current PID acceesing VMA */
+ vma_set_access_pid_bit(vma);
+ }
/*
* Avoid grouping on RO pages in general. RO pages shouldn't hurt as
@@ -5484,12 +5497,6 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
if (!writable)
*flags |= TNF_NO_GROUP;
- /*
- * Flag if the folio is shared between multiple address spaces. This
- * is later used when determining whether to group tasks together
- */
- if (folio_likely_mapped_shared(folio) && (vma->vm_flags & VM_SHARED))
- *flags |= TNF_SHARED;
/*
* For memory tiering mode, cpupid of slow memory page is used
* to record page access time. So use default value.
@@ -5499,17 +5506,14 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
else
*last_cpupid = folio_last_cpupid(folio);
- /* Record the current PID acceesing VMA */
- vma_set_access_pid_bit(vma);
-
- count_vm_numa_event(NUMA_HINT_FAULTS);
#ifdef CONFIG_NUMA_BALANCING
+ count_vm_numa_event(NUMA_HINT_FAULTS);
count_memcg_folio_events(folio, NUMA_HINT_FAULTS, 1);
-#endif
if (folio_nid(folio) == numa_node_id()) {
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
*flags |= TNF_FAULT_LOCAL;
}
+#endif
return mpol_misplaced(folio, vmf, addr);
}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index bb37cd1a51d8..eb6c97bccea3 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2727,12 +2727,16 @@ static void sp_free(struct sp_node *n)
* mpol_misplaced - check whether current folio node is valid in policy
*
* @folio: folio to be checked
- * @vmf: structure describing the fault
+ * @vmf: structure describing the fault (NULL if called outside fault path)
* @addr: virtual address in @vma for shared policy lookup and interleave policy
+ * Ignored if vmf is NULL.
*
* Lookup current policy node id for vma,addr and "compare to" folio's
- * node id. Policy determination "mimics" alloc_page_vma().
- * Called from fault path where we know the vma and faulting address.
+ * node id - or task's policy node id if vmf is NULL. Policy determination
+ * "mimics" alloc_page_vma().
+ *
+ * vmf must be non-NULL if called from fault path where we know the vma and
+ * faulting address. The PTL must be held by caller if vmf is not NULL.
*
* Return: NUMA_NO_NODE if the page is in a node that is valid for this
* policy, or a suitable node ID to allocate a replacement folio from.
@@ -2744,7 +2748,6 @@ int mpol_misplaced(struct folio *folio, struct vm_fault *vmf,
pgoff_t ilx;
struct zoneref *z;
int curnid = folio_nid(folio);
- struct vm_area_struct *vma = vmf->vma;
int thiscpu = raw_smp_processor_id();
int thisnid = numa_node_id();
int polnid = NUMA_NO_NODE;
@@ -2754,18 +2757,24 @@ int mpol_misplaced(struct folio *folio, struct vm_fault *vmf,
* Make sure ptl is held so that we don't preempt and we
* have a stable smp processor id
*/
- lockdep_assert_held(vmf->ptl);
- pol = get_vma_policy(vma, addr, folio_order(folio), &ilx);
+ if (vmf) {
+ lockdep_assert_held(vmf->ptl);
+ pol = get_vma_policy(vmf->vma, addr, folio_order(folio), &ilx);
+ } else {
+ pol = get_task_policy(current);
+ }
if (!(pol->flags & MPOL_F_MOF))
goto out;
switch (pol->mode) {
case MPOL_INTERLEAVE:
- polnid = interleave_nid(pol, ilx);
+ polnid = vmf ? interleave_nid(pol, ilx) :
+ interleave_nodes(pol);
break;
case MPOL_WEIGHTED_INTERLEAVE:
- polnid = weighted_interleave_nid(pol, ilx);
+ polnid = vmf ? weighted_interleave_nid(pol, ilx) :
+ weighted_interleave_nodes(pol);
break;
case MPOL_PREFERRED:
--
2.43.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 3/4] vmstat: add page-cache numa hints
2024-11-27 8:21 [RFC PATCH 0/4] Promotion of Unmapped Page Cache Folios Gregory Price
2024-11-27 8:21 ` [PATCH 1/4] migrate: Allow migrate_misplaced_folio APIs without a VMA Gregory Price
2024-11-27 8:21 ` [PATCH 2/4] memory: allow non-fault migration in numa_migrate_check path Gregory Price
@ 2024-11-27 8:22 ` Gregory Price
2024-11-27 8:22 ` [PATCH 4/4] migrate,sysfs: add pagecache promotion Gregory Price
2024-11-27 21:18 ` [RFC PATCH 0/4] Promotion of Unmapped Page Cache Folios SeongJae Park
4 siblings, 0 replies; 9+ messages in thread
From: Gregory Price @ 2024-11-27 8:22 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, nehagholkar, abhishekd, kernel-team, david,
ying.huang, nphamcs, gourry, akpm, hannes, feng.tang, kbusch
Count non-page-fault events as page-cache numa hints instead of
fault hints in vmstat.
Signed-off-by: Gregory Price <gourry@gourry.net>
---
include/linux/vm_event_item.h | 2 ++
mm/memory.c | 15 ++++++++++-----
mm/vmstat.c | 2 ++
3 files changed, 14 insertions(+), 5 deletions(-)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index f70d0958095c..9fee15d9ba48 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -63,6 +63,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
NUMA_HUGE_PTE_UPDATES,
NUMA_HINT_FAULTS,
NUMA_HINT_FAULTS_LOCAL,
+ NUMA_HINT_PAGE_CACHE,
+ NUMA_HINT_PAGE_CACHE_LOCAL,
NUMA_PAGE_MIGRATE,
#endif
#ifdef CONFIG_MIGRATION
diff --git a/mm/memory.c b/mm/memory.c
index a373b6ad0b34..35b72a1cfbd5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5507,11 +5507,16 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
*last_cpupid = folio_last_cpupid(folio);
#ifdef CONFIG_NUMA_BALANCING
- count_vm_numa_event(NUMA_HINT_FAULTS);
- count_memcg_folio_events(folio, NUMA_HINT_FAULTS, 1);
- if (folio_nid(folio) == numa_node_id()) {
- count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
- *flags |= TNF_FAULT_LOCAL;
+ if (vmf) {
+ count_vm_numa_event(NUMA_HINT_FAULTS);
+ count_memcg_folio_events(folio, NUMA_HINT_FAULTS, 1);
+ if (folio_nid(folio) == numa_node_id()) {
+ count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
+ *flags |= TNF_FAULT_LOCAL;
+ }
+ } else {
+ count_vm_numa_event(NUMA_HINT_PAGE_CACHE);
+ count_memcg_folio_events(folio, NUMA_HINT_PAGE_CACHE, 1);
}
#endif
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4d016314a56c..bcd9be11e957 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1338,6 +1338,8 @@ const char * const vmstat_text[] = {
"numa_huge_pte_updates",
"numa_hint_faults",
"numa_hint_faults_local",
+ "numa_hint_page_cache",
+ "numa_hint_page_cache_local",
"numa_pages_migrated",
#endif
#ifdef CONFIG_MIGRATION
--
2.43.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 4/4] migrate,sysfs: add pagecache promotion
2024-11-27 8:21 [RFC PATCH 0/4] Promotion of Unmapped Page Cache Folios Gregory Price
` (2 preceding siblings ...)
2024-11-27 8:22 ` [PATCH 3/4] vmstat: add page-cache numa hints Gregory Price
@ 2024-11-27 8:22 ` Gregory Price
2024-11-27 21:18 ` [RFC PATCH 0/4] Promotion of Unmapped Page Cache Folios SeongJae Park
4 siblings, 0 replies; 9+ messages in thread
From: Gregory Price @ 2024-11-27 8:22 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, nehagholkar, abhishekd, kernel-team, david,
ying.huang, nphamcs, gourry, akpm, hannes, feng.tang, kbusch
adds /sys/kernel/mm/numa/pagecache_promotion_enabled
When page cache lands on lower tiers, there is no way for promotion
to occur unless it becomes memory-mapped and exposed to NUMA hint
faults. Just adding a mechanism to promote pages unconditionally,
however, opens up significant possibility of performance regressions.
Similar to the `demotion_enabled` sysfs entry, provide a sysfs toggle
to enable and disable page cache promotion. This option will enable
opportunistic promotion of unmapped page cache during syscall access.
This option is intended for operational conditions where demoted page
cache will eventually contain memory which becomes hot - and where
said memory likely to cause performance issues due to being trapped on
the lower tier of memory.
A Page Cache folio is considered a promotion candidates when:
0) tiering and pagecache-promotion are enabled
1) the folio reside on a node not in the top tier
2) the folio is already marked referenced and active.
3) Multiple accesses in (referenced & active) state occur quickly.
Since promotion is not safe to execute unconditionally from within
folio_mark_accessed, we defer promotion to a new task_work captured
in the task_struct. This ensures that the task doing the access has
some hand in promoting pages - even among deduplicated read only files.
We use numa_hint_fault_latency to help identify when a folio is accessed
multiple times in a short period. Along with folio flag checks, this
helps us minimize promoting pages on the first few accesses.
The promotion node is always the local node of the promoting cpu.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
.../ABI/testing/sysfs-kernel-mm-numa | 20 +++++++
include/linux/memory-tiers.h | 2 +
include/linux/migrate.h | 4 ++
include/linux/sched.h | 3 +
include/linux/sched/numa_balancing.h | 5 ++
init/init_task.c | 1 +
kernel/sched/fair.c | 26 ++++++++-
mm/memory-tiers.c | 27 +++++++++
mm/migrate.c | 56 +++++++++++++++++++
mm/swap.c | 3 +
10 files changed, 146 insertions(+), 1 deletion(-)
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-numa b/Documentation/ABI/testing/sysfs-kernel-mm-numa
index 77e559d4ed80..b846e7d80cba 100644
--- a/Documentation/ABI/testing/sysfs-kernel-mm-numa
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-numa
@@ -22,3 +22,23 @@ Description: Enable/disable demoting pages during reclaim
the guarantees of cpusets. This should not be enabled
on systems which need strict cpuset location
guarantees.
+
+What: /sys/kernel/mm/numa/pagecache_promotion_enabled
+Date: November 2024
+Contact: Linux memory management mailing list <linux-mm@kvack.org>
+Description: Enable/disable promoting pages during file access
+
+ Page migration during file access is intended for systems
+ with tiered memory configurations that have significant
+ unmapped file cache usage. By default, file cache memory
+ on slower tiers will not be opportunistically promoted by
+ normal NUMA hint faults, because the system has no way to
+ track them. This option enables opportunistic promotion
+ of pages that are accessed via syscall (e.g. read/write)
+ if multiple accesses occur in quick succession.
+
+ It may move data to a NUMA node that does not fall into
+ the cpuset of the allocating process which might be
+ construed to violate the guarantees of cpusets. This
+ should not be enabled on systems which need strict cpuset
+ location guarantees.
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 0dc0cf2863e2..fa96a67b8996 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -37,6 +37,7 @@ struct access_coordinate;
#ifdef CONFIG_NUMA
extern bool numa_demotion_enabled;
+extern bool numa_pagecache_promotion_enabled;
extern struct memory_dev_type *default_dram_type;
extern nodemask_t default_dram_nodes;
struct memory_dev_type *alloc_memory_type(int adistance);
@@ -76,6 +77,7 @@ static inline bool node_is_toptier(int node)
#else
#define numa_demotion_enabled false
+#define numa_pagecache_promotion_enabled false
#define default_dram_type NULL
#define default_dram_nodes NODE_MASK_NONE
/*
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 002e49b2ebd9..c288c16b1311 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -146,6 +146,7 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
struct vm_area_struct *vma, int node);
int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
int node);
+void promotion_candidate(struct folio *folio);
#else
static inline int migrate_misplaced_folio_prepare(struct folio *folio,
struct vm_area_struct *vma, int node)
@@ -157,6 +158,9 @@ static inline int migrate_misplaced_folio(struct folio *folio,
{
return -EAGAIN; /* can't migrate now */
}
+static inline void promotion_candidate(struct folio *folio)
+{
+}
#endif /* CONFIG_NUMA_BALANCING */
#ifdef CONFIG_MIGRATION
diff --git a/include/linux/sched.h b/include/linux/sched.h
index bb343136ddd0..8ddd4986e57f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1353,6 +1353,9 @@ struct task_struct {
unsigned long numa_faults_locality[3];
unsigned long numa_pages_migrated;
+
+ struct callback_head numa_promo_work;
+ struct list_head promo_list;
#endif /* CONFIG_NUMA_BALANCING */
#ifdef CONFIG_RSEQ
diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
index 52b22c5c396d..cc7750d754ff 100644
--- a/include/linux/sched/numa_balancing.h
+++ b/include/linux/sched/numa_balancing.h
@@ -32,6 +32,7 @@ extern void set_numabalancing_state(bool enabled);
extern void task_numa_free(struct task_struct *p, bool final);
bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
int src_nid, int dst_cpu);
+int numa_hint_fault_latency(struct folio *folio);
#else
static inline void task_numa_fault(int last_node, int node, int pages,
int flags)
@@ -52,6 +53,10 @@ static inline bool should_numa_migrate_memory(struct task_struct *p,
{
return true;
}
+static inline int numa_hint_fault_latency(struct folio *folio)
+{
+ return 0;
+}
#endif
#endif /* _LINUX_SCHED_NUMA_BALANCING_H */
diff --git a/init/init_task.c b/init/init_task.c
index 136a8231355a..ee33e508067e 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -186,6 +186,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
.numa_preferred_nid = NUMA_NO_NODE,
.numa_group = NULL,
.numa_faults = NULL,
+ .promo_list = LIST_HEAD_INIT(init_task.promo_list),
#endif
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
.kasan_depth = 1,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2d16c8545c71..34d66faa50f9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -42,6 +42,7 @@
#include <linux/interrupt.h>
#include <linux/memory-tiers.h>
#include <linux/mempolicy.h>
+#include <linux/migrate.h>
#include <linux/mutex_api.h>
#include <linux/profile.h>
#include <linux/psi.h>
@@ -1842,7 +1843,7 @@ static bool pgdat_free_space_enough(struct pglist_data *pgdat)
* The smaller the hint page fault latency, the higher the possibility
* for the page to be hot.
*/
-static int numa_hint_fault_latency(struct folio *folio)
+int numa_hint_fault_latency(struct folio *folio)
{
int last_time, time;
@@ -3528,6 +3529,27 @@ static void task_numa_work(struct callback_head *work)
}
}
+static void task_numa_promotion_work(struct callback_head *work)
+{
+ struct task_struct *p = current;
+ struct list_head *promo_list = &p->promo_list;
+ struct folio *folio, *tmp;
+ int nid = numa_node_id();
+
+ SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_promo_work));
+
+ work->next = work;
+
+ if (list_empty(promo_list))
+ return;
+
+ list_for_each_entry_safe(folio, tmp, promo_list, lru) {
+ list_del_init(&folio->lru);
+ migrate_misplaced_folio(folio, NULL, nid);
+ }
+}
+
+
void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
{
int mm_users = 0;
@@ -3552,8 +3574,10 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
RCU_INIT_POINTER(p->numa_group, NULL);
p->last_task_numa_placement = 0;
p->last_sum_exec_runtime = 0;
+ INIT_LIST_HEAD(&p->promo_list);
init_task_work(&p->numa_work, task_numa_work);
+ init_task_work(&p->numa_promo_work, task_numa_promotion_work);
/* New address space, reset the preferred nid */
if (!(clone_flags & CLONE_VM)) {
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index fc14fe53e9b7..4c44598e485e 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -935,6 +935,7 @@ static int __init memory_tier_init(void)
subsys_initcall(memory_tier_init);
bool numa_demotion_enabled = false;
+bool numa_pagecache_promotion_enabled;
#ifdef CONFIG_MIGRATION
#ifdef CONFIG_SYSFS
@@ -957,11 +958,37 @@ static ssize_t demotion_enabled_store(struct kobject *kobj,
return count;
}
+static ssize_t pagecache_promotion_enabled_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sysfs_emit(buf, "%s\n",
+ numa_pagecache_promotion_enabled ? "true" : "false");
+}
+
+static ssize_t pagecache_promotion_enabled_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ ssize_t ret;
+
+ ret = kstrtobool(buf, &numa_pagecache_promotion_enabled);
+ if (ret)
+ return ret;
+
+ return count;
+}
+
+
static struct kobj_attribute numa_demotion_enabled_attr =
__ATTR_RW(demotion_enabled);
+static struct kobj_attribute numa_pagecache_promotion_enabled_attr =
+ __ATTR_RW(pagecache_promotion_enabled);
+
static struct attribute *numa_attrs[] = {
&numa_demotion_enabled_attr.attr,
+ &numa_pagecache_promotion_enabled_attr.attr,
NULL,
};
diff --git a/mm/migrate.c b/mm/migrate.c
index 3b0bd3f21ac3..2cd9faed6ab8 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -44,6 +44,8 @@
#include <linux/sched/sysctl.h>
#include <linux/memory-tiers.h>
#include <linux/pagewalk.h>
+#include <linux/sched/numa_balancing.h>
+#include <linux/task_work.h>
#include <asm/tlbflush.h>
@@ -2711,5 +2713,59 @@ int migrate_misplaced_folio(struct folio *folio, struct vm_area_struct *vma,
BUG_ON(!list_empty(&migratepages));
return nr_remaining ? -EAGAIN : 0;
}
+
+/**
+ * promotion_candidate() - report a promotion candidate folio
+ *
+ * @folio: The folio reported as a candidate
+ *
+ * Records folio access time and places the folio on the task promotion list
+ * if access time is less than the threshold. The folio will be isolated from
+ * LRU if selected, and task_work will putback the folio on promotion failure.
+ *
+ * Takes a folio reference that will be released in task work.
+ */
+void promotion_candidate(struct folio *folio)
+{
+ struct task_struct *task = current;
+ struct list_head *promo_list = &task->promo_list;
+ struct callback_head *work = &task->numa_promo_work;
+ struct address_space *mapping = folio_mapping(folio);
+ bool write = mapping ? mapping->gfp_mask & __GFP_WRITE : false;
+ int nid = folio_nid(folio);
+ int flags, last_cpupid;
+
+ /*
+ * Only do this work if:
+ * 1) tiering and pagecache promotion are enabled
+ * 2) the page can actually be promoted
+ * 3) The hint-fault latency is relatively hot
+ * 4) the folio is not already isolated
+ * 5) This is not a kernel thread context
+ */
+ if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) ||
+ !numa_pagecache_promotion_enabled ||
+ node_is_toptier(nid) ||
+ numa_hint_fault_latency(folio) >= PAGE_ACCESS_TIME_MASK ||
+ folio_test_isolated(folio) ||
+ (current->flags & PF_KTHREAD)) {
+ return;
+ }
+
+ nid = numa_migrate_check(folio, NULL, 0, &flags, write, &last_cpupid);
+ if (nid == NUMA_NO_NODE)
+ return;
+
+ if (migrate_misplaced_folio_prepare(folio, NULL, nid))
+ return;
+
+ /* Ensure task can schedule work, otherwise we'll leak folios */
+ if (list_empty(promo_list) && task_work_add(task, work, TWA_RESUME)) {
+ folio_putback_lru(folio);
+ return;
+ }
+ list_add(&folio->lru, promo_list);
+}
+EXPORT_SYMBOL(promotion_candidate);
#endif /* CONFIG_NUMA_BALANCING */
#endif /* CONFIG_NUMA */
diff --git a/mm/swap.c b/mm/swap.c
index 10decd9dffa1..9cf4c1f73fe5 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -37,6 +37,7 @@
#include <linux/page_idle.h>
#include <linux/local_lock.h>
#include <linux/buffer_head.h>
+#include <linux/migrate.h>
#include "internal.h"
@@ -453,6 +454,8 @@ void folio_mark_accessed(struct folio *folio)
__lru_cache_activate_folio(folio);
folio_clear_referenced(folio);
workingset_activation(folio);
+ } else {
+ promotion_candidate(folio);
}
if (folio_test_idle(folio))
folio_clear_idle(folio);
--
2.43.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH 0/4] Promotion of Unmapped Page Cache Folios.
2024-11-27 8:21 [RFC PATCH 0/4] Promotion of Unmapped Page Cache Folios Gregory Price
` (3 preceding siblings ...)
2024-11-27 8:22 ` [PATCH 4/4] migrate,sysfs: add pagecache promotion Gregory Price
@ 2024-11-27 21:18 ` SeongJae Park
4 siblings, 0 replies; 9+ messages in thread
From: SeongJae Park @ 2024-11-27 21:18 UTC (permalink / raw)
To: Gregory Price
Cc: SeongJae Park, linux-mm, linux-kernel, nehagholkar, abhishekd,
kernel-team, david, ying.huang, nphamcs, akpm, hannes, feng.tang,
kbusch, damon
Hello,
On Wed, 27 Nov 2024 03:21:57 -0500 Gregory Price <gourry@gourry.net> wrote:
> Unmapped page cache pages can be demoted to low-tier memory, but
> they can presently only be promoted in two conditions:
> 1) The page is fully swapped out and re-faulted
> 2) The page becomes mapped (and exposed to NUMA hint faults)
>
> This RFC proposes promoting unmapped page cache pages by using
> folio_mark_accessed as a hotness hint for unmapped pages.
Adding my thoughts that humble and biased as a DAMON maintainer. The thoughts
are about the general problem, not about this patchset. So please think below
as a sort of my "thinking loud" or "btw I use DAMON", and ignore those when
discussing about this specific patch series.
DAMON's access check mechanisms use PG_idle which is updated by
folio_mark_accessed(), and hence DAMON can monitor access to unmapped pages.
DAMON also supports migrating pages of specific access pattern to an arbitrary
NUMA node. So, promoting unmapped page cache folios using DAMON might be
another way.
More specifically, users could use only DAMON for both promoting and demoting
of every page like HMSDK[1] does, or for only unmapped pages promotion. I
think the former idea might work given previous test results, and I proposed an
idea[2] to make it more general (>2 tiers) and easy to tune using DAMOS quota
auto-tuning feature before. All features for the proposed idea[2] are
available starting v6.11. For the latter idea, though, I'm not sure how
beneficial it would be, and whether it makes sense at all.
For people who might be interested in it, or just how DAMON can be used for
such weird idea, I posted an RFC patch[3] for making DAMON be able to be used
for the use case. For easy testing from anyone who interested, I also pushed
DAMON user-space tool's support of the new filter to a temporal branch[4]. The
temporal branch[4] might be erased later.
Note that I haven't test any of the two changes for the unmapped pages only
promotion idea, and have no ETA for any test. Those are only for concept level
idea sharing.
[...]
> During development, we explored the following proposals:
[...]
> 4) Adding a separate kthread - suggested by many
>
> This is - to an extent - a more general version of the LRU proposal.
> We still have to track the folios - which likely requires the
> addition of a page flag. Additionally, this method would actually
> contend pretty heavily with LRU behavior - i.e. we'd want to
> throttle addition to the promotion candidate list in some scenarios.
DAMON runs on a separate kthread, so DAMON-based approach maybe categorized
into this one.
[1] https://github.com/skhynix/hmsdk/wiki/Capacity-Expansion
[2] https://lore.kernel.org/damon/20231112195602.61525-1-sj@kernel.org/
[3] https://lore.kernel.org/20241127205624.86986-1-sj@kernel.org
[4] https://github.com/damonitor/damo/commit/32186d710355ef0dec55e3c6bd398fadeb9d136f
Thanks,
SJ
[...]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/4] migrate: Allow migrate_misplaced_folio APIs without a VMA
2024-11-27 8:21 ` [PATCH 1/4] migrate: Allow migrate_misplaced_folio APIs without a VMA Gregory Price
@ 2024-11-28 11:12 ` Huang, Ying
2024-12-02 15:47 ` Gregory Price
2024-11-29 6:21 ` Raghavendra K T
1 sibling, 1 reply; 9+ messages in thread
From: Huang, Ying @ 2024-11-28 11:12 UTC (permalink / raw)
To: Gregory Price
Cc: linux-mm, linux-kernel, nehagholkar, abhishekd, kernel-team,
david, ying.huang, nphamcs, akpm, hannes, feng.tang, kbusch
Hi, Gregory,
Gregory Price <gourry@gourry.net> writes:
> To migrate unmapped pagecache folios, migrate_misplaced_folio and
> migrate_misplaced_folio_prepare must handle folios without VMAs.
IMHO, it's better to use migrate_misplaced_folio() instead of
migrate_misplaced_folio for readability in patch title and description.
> migrate_misplaced_folio_prepare checks VMA for exec bits, so allow
> a NULL VMA when it does not have a mapping.
>
> migrate_misplaced_folio must call migrate_pages with MIGRATE_SYNC
> when in the pagecache path because it is a synchronous context.
I don't find the corresponding implementation for this. And, I don't
think it's a good idea to change from MIGRATE_ASYNC to MIGRATE_SYNC.
This may cause too long page access latency for page placement
optimization. The downside may offset the benefit.
And, it appears that we can delete the "vma" parameter of
migrate_misplaced_folio() because it's not used now. This is a trivial
code cleanup.
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
> mm/migrate.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index dfb5eba3c522..3b0bd3f21ac3 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2632,7 +2632,7 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
> * See folio_likely_mapped_shared() on possible imprecision
> * when we cannot easily detect if a folio is shared.
> */
> - if ((vma->vm_flags & VM_EXEC) &&
> + if (vma && (vma->vm_flags & VM_EXEC) &&
> folio_likely_mapped_shared(folio))
> return -EACCES;
---
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/4] migrate: Allow migrate_misplaced_folio APIs without a VMA
2024-11-27 8:21 ` [PATCH 1/4] migrate: Allow migrate_misplaced_folio APIs without a VMA Gregory Price
2024-11-28 11:12 ` Huang, Ying
@ 2024-11-29 6:21 ` Raghavendra K T
1 sibling, 0 replies; 9+ messages in thread
From: Raghavendra K T @ 2024-11-29 6:21 UTC (permalink / raw)
To: Gregory Price, linux-mm
Cc: linux-kernel, nehagholkar, abhishekd, kernel-team, david,
ying.huang, nphamcs, akpm, hannes, feng.tang, kbusch
On 11/27/2024 1:51 PM, Gregory Price wrote:
> To migrate unmapped pagecache folios, migrate_misplaced_folio and
> migrate_misplaced_folio_prepare must handle folios without VMAs.
>
> migrate_misplaced_folio_prepare checks VMA for exec bits, so allow
> a NULL VMA when it does not have a mapping.
>
> migrate_misplaced_folio must call migrate_pages with MIGRATE_SYNC
> when in the pagecache path because it is a synchronous context.
>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
> mm/migrate.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index dfb5eba3c522..3b0bd3f21ac3 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2632,7 +2632,7 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
> * See folio_likely_mapped_shared() on possible imprecision
> * when we cannot easily detect if a folio is shared.
> */
> - if ((vma->vm_flags & VM_EXEC) &&
> + if (vma && (vma->vm_flags & VM_EXEC) &&
> folio_likely_mapped_shared(folio))
> return -EACCES;
>
Thanks for this patch.
This would be helpful in the cases of independent page scanning
algorithms where we do not have a VMA associated with that.
Hopefully it can be taken to tree independently.
Feel free to add.
Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/4] migrate: Allow migrate_misplaced_folio APIs without a VMA
2024-11-28 11:12 ` Huang, Ying
@ 2024-12-02 15:47 ` Gregory Price
0 siblings, 0 replies; 9+ messages in thread
From: Gregory Price @ 2024-12-02 15:47 UTC (permalink / raw)
To: Huang, Ying
Cc: linux-mm, linux-kernel, nehagholkar, abhishekd, kernel-team,
david, ying.huang, nphamcs, akpm, hannes, feng.tang, kbusch
On Thu, Nov 28, 2024 at 07:12:11PM +0800, Huang, Ying wrote:
> Hi, Gregory,
>
> Gregory Price <gourry@gourry.net> writes:
>
> > To migrate unmapped pagecache folios, migrate_misplaced_folio and
> > migrate_misplaced_folio_prepare must handle folios without VMAs.
>
> IMHO, it's better to use migrate_misplaced_folio() instead of
> migrate_misplaced_folio for readability in patch title and description.
>
> > migrate_misplaced_folio_prepare checks VMA for exec bits, so allow
> > a NULL VMA when it does not have a mapping.
> >
> > migrate_misplaced_folio must call migrate_pages with MIGRATE_SYNC
> > when in the pagecache path because it is a synchronous context.
>
> I don't find the corresponding implementation for this. And, I don't
> think it's a good idea to change from MIGRATE_ASYNC to MIGRATE_SYNC.
> This may cause too long page access latency for page placement
> optimization. The downside may offset the benefit.
>
> And, it appears that we can delete the "vma" parameter of
> migrate_misplaced_folio() because it's not used now. This is a trivial
> code cleanup.
>
This patch apparently got a bit away from me and was heavily reduced
from its initial form. This commit message is just wrong now. I will
update this and the 2nd commit and probably submit them separately.
~Gregory
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2024-12-02 15:48 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-27 8:21 [RFC PATCH 0/4] Promotion of Unmapped Page Cache Folios Gregory Price
2024-11-27 8:21 ` [PATCH 1/4] migrate: Allow migrate_misplaced_folio APIs without a VMA Gregory Price
2024-11-28 11:12 ` Huang, Ying
2024-12-02 15:47 ` Gregory Price
2024-11-29 6:21 ` Raghavendra K T
2024-11-27 8:21 ` [PATCH 2/4] memory: allow non-fault migration in numa_migrate_check path Gregory Price
2024-11-27 8:22 ` [PATCH 3/4] vmstat: add page-cache numa hints Gregory Price
2024-11-27 8:22 ` [PATCH 4/4] migrate,sysfs: add pagecache promotion Gregory Price
2024-11-27 21:18 ` [RFC PATCH 0/4] Promotion of Unmapped Page Cache Folios SeongJae Park
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox