* [RFC v2 PATCH 1/5] migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA.
2024-12-10 21:37 [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios Gregory Price
@ 2024-12-10 21:37 ` Gregory Price
2024-12-10 21:37 ` [RFC v2 PATCH 2/5] memory: move conditionally defined enums use inside ifdef tags Gregory Price
` (4 subsequent siblings)
5 siblings, 0 replies; 27+ messages in thread
From: Gregory Price @ 2024-12-10 21:37 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, nehagholkar, abhishekd, kernel-team, david,
nphamcs, gourry, akpm, hannes, kbusch, ying.huang
migrate_misplaced_folio_prepare() may be called on a folio without
a VMA, and so it must be made to accept a NULL VMA.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
mm/migrate.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index e9e00d1d1d19..af07b399060b 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2632,7 +2632,7 @@ int migrate_misplaced_folio_prepare(struct folio *folio,
* See folio_likely_mapped_shared() on possible imprecision
* when we cannot easily detect if a folio is shared.
*/
- if ((vma->vm_flags & VM_EXEC) &&
+ if (vma && (vma->vm_flags & VM_EXEC) &&
folio_likely_mapped_shared(folio))
return -EACCES;
--
2.43.0
^ permalink raw reply [flat|nested] 27+ messages in thread* [RFC v2 PATCH 2/5] memory: move conditionally defined enums use inside ifdef tags
2024-12-10 21:37 [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios Gregory Price
2024-12-10 21:37 ` [RFC v2 PATCH 1/5] migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA Gregory Price
@ 2024-12-10 21:37 ` Gregory Price
2024-12-27 10:34 ` Donet Tom
2024-12-10 21:37 ` [RFC v2 PATCH 3/5] memory: allow non-fault migration in numa_migrate_check path Gregory Price
` (3 subsequent siblings)
5 siblings, 1 reply; 27+ messages in thread
From: Gregory Price @ 2024-12-10 21:37 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, nehagholkar, abhishekd, kernel-team, david,
nphamcs, gourry, akpm, hannes, kbusch, ying.huang
NUMA_HINT_FAULTS and NUMA_HINT_FAULTS_LOCAL are only defined if
CONFIG_NUMA_BALANCING is defined, but are used outside the tags in
numa_migrate_check(). Fix this.
TNF_SHARED is only used if CONFIG_NUMA_BALANCING is enabled, so
moving this line inside the ifdef is also safe - despite use of TNF_*
elsewhere in the function. TNF_* are not conditionally defined.
Signed-off-by: Gregory Price <gourry@gourry.net>
---
mm/memory.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 83fd35c034d7..6ad7616918c4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5573,14 +5573,14 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
/* Record the current PID acceesing VMA */
vma_set_access_pid_bit(vma);
- count_vm_numa_event(NUMA_HINT_FAULTS);
#ifdef CONFIG_NUMA_BALANCING
+ count_vm_numa_event(NUMA_HINT_FAULTS);
count_memcg_folio_events(folio, NUMA_HINT_FAULTS, 1);
-#endif
if (folio_nid(folio) == numa_node_id()) {
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
*flags |= TNF_FAULT_LOCAL;
}
+#endif
return mpol_misplaced(folio, vmf, addr);
}
--
2.43.0
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [RFC v2 PATCH 2/5] memory: move conditionally defined enums use inside ifdef tags
2024-12-10 21:37 ` [RFC v2 PATCH 2/5] memory: move conditionally defined enums use inside ifdef tags Gregory Price
@ 2024-12-27 10:34 ` Donet Tom
2024-12-27 15:42 ` Gregory Price
0 siblings, 1 reply; 27+ messages in thread
From: Donet Tom @ 2024-12-27 10:34 UTC (permalink / raw)
To: Gregory Price, linux-mm
Cc: linux-kernel, nehagholkar, abhishekd, kernel-team, david,
nphamcs, akpm, hannes, kbusch, ying.huang
On 12/11/24 03:07, Gregory Price wrote:
> NUMA_HINT_FAULTS and NUMA_HINT_FAULTS_LOCAL are only defined if
> CONFIG_NUMA_BALANCING is defined, but are used outside the tags in
> numa_migrate_check(). Fix this.
>
> TNF_SHARED is only used if CONFIG_NUMA_BALANCING is enabled, so
> moving this line inside the ifdef is also safe - despite use of TNF_*
> elsewhere in the function. TNF_* are not conditionally defined.
>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
> mm/memory.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 83fd35c034d7..6ad7616918c4 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5573,14 +5573,14 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
> /* Record the current PID acceesing VMA */
> vma_set_access_pid_bit(vma);
>
> - count_vm_numa_event(NUMA_HINT_FAULTS);
> #ifdef CONFIG_NUMA_BALANCING
IIUC,|do_huge_pmd_numa_page|() and|do_numa_page()| are executed only if
|CONFIG_NUMA_BALANCING| is enabled (|pte_protnone()| and|pmd_protnone()|
return 0 if|CONFIG_NUMA_BALANCING| is disabled).
Given this, do we still need the|#ifdef|?
> + count_vm_numa_event(NUMA_HINT_FAULTS);
> count_memcg_folio_events(folio, NUMA_HINT_FAULTS, 1);
> -#endif
> if (folio_nid(folio) == numa_node_id()) {
> count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
> *flags |= TNF_FAULT_LOCAL;
> }
> +#endif
>
> return mpol_misplaced(folio, vmf, addr);
> }
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [RFC v2 PATCH 2/5] memory: move conditionally defined enums use inside ifdef tags
2024-12-27 10:34 ` Donet Tom
@ 2024-12-27 15:42 ` Gregory Price
2024-12-29 14:49 ` Donet Tom
0 siblings, 1 reply; 27+ messages in thread
From: Gregory Price @ 2024-12-27 15:42 UTC (permalink / raw)
To: Donet Tom
Cc: Gregory Price, linux-mm, linux-kernel, nehagholkar, abhishekd,
kernel-team, david, nphamcs, akpm, hannes, kbusch, ying.huang
On Fri, Dec 27, 2024 at 04:04:05PM +0530, Donet Tom wrote:
>
> On 12/11/24 03:07, Gregory Price wrote:
> > NUMA_HINT_FAULTS and NUMA_HINT_FAULTS_LOCAL are only defined if
> > CONFIG_NUMA_BALANCING is defined, but are used outside the tags in
> > numa_migrate_check(). Fix this.
> >
> > TNF_SHARED is only used if CONFIG_NUMA_BALANCING is enabled, so
> > moving this line inside the ifdef is also safe - despite use of TNF_*
> > elsewhere in the function. TNF_* are not conditionally defined.
> >
> > Signed-off-by: Gregory Price <gourry@gourry.net>
> > ---
> > mm/memory.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 83fd35c034d7..6ad7616918c4 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5573,14 +5573,14 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
> > /* Record the current PID acceesing VMA */
> > vma_set_access_pid_bit(vma);
> > - count_vm_numa_event(NUMA_HINT_FAULTS);
> > #ifdef CONFIG_NUMA_BALANCING
>
> IIUC,|do_huge_pmd_numa_page|() and|do_numa_page()| are executed only if
> |CONFIG_NUMA_BALANCING| is enabled (|pte_protnone()| and|pmd_protnone()|
> return 0 if|CONFIG_NUMA_BALANCING| is disabled).
>
> Given this, do we still need the|#ifdef|?
>
the NUMA_HINT_FAULTS stuff is only defined if CONFIG_NUMA_BALANCING is
built.
The ifdefs around some of this code is a bit inconsistent, it's
probably worth a separate line to try to clean it up.
~Gregory
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC v2 PATCH 2/5] memory: move conditionally defined enums use inside ifdef tags
2024-12-27 15:42 ` Gregory Price
@ 2024-12-29 14:49 ` Donet Tom
0 siblings, 0 replies; 27+ messages in thread
From: Donet Tom @ 2024-12-29 14:49 UTC (permalink / raw)
To: Gregory Price
Cc: linux-mm, linux-kernel, nehagholkar, abhishekd, kernel-team,
david, nphamcs, akpm, hannes, kbusch, ying.huang
On 12/27/24 21:12, Gregory Price wrote:
> On Fri, Dec 27, 2024 at 04:04:05PM +0530, Donet Tom wrote:
>> On 12/11/24 03:07, Gregory Price wrote:
>>> NUMA_HINT_FAULTS and NUMA_HINT_FAULTS_LOCAL are only defined if
>>> CONFIG_NUMA_BALANCING is defined, but are used outside the tags in
>>> numa_migrate_check(). Fix this.
>>>
>>> TNF_SHARED is only used if CONFIG_NUMA_BALANCING is enabled, so
>>> moving this line inside the ifdef is also safe - despite use of TNF_*
>>> elsewhere in the function. TNF_* are not conditionally defined.
>>>
>>> Signed-off-by: Gregory Price <gourry@gourry.net>
>>> ---
>>> mm/memory.c | 4 ++--
>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 83fd35c034d7..6ad7616918c4 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -5573,14 +5573,14 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
>>> /* Record the current PID acceesing VMA */
>>> vma_set_access_pid_bit(vma);
>>> - count_vm_numa_event(NUMA_HINT_FAULTS);
>>> #ifdef CONFIG_NUMA_BALANCING
>> IIUC,|do_huge_pmd_numa_page|() and|do_numa_page()| are executed only if
>> |CONFIG_NUMA_BALANCING| is enabled (|pte_protnone()| and|pmd_protnone()|
>> return 0 if|CONFIG_NUMA_BALANCING| is disabled).
>>
>> Given this, do we still need the|#ifdef|?
>>
> the NUMA_HINT_FAULTS stuff is only defined if CONFIG_NUMA_BALANCING is
> built.
>
> The ifdefs around some of this code is a bit inconsistent, it's
> probably worth a separate line to try to clean it up.
Sure. Thank you.
>
> ~Gregory
^ permalink raw reply [flat|nested] 27+ messages in thread
* [RFC v2 PATCH 3/5] memory: allow non-fault migration in numa_migrate_check path
2024-12-10 21:37 [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios Gregory Price
2024-12-10 21:37 ` [RFC v2 PATCH 1/5] migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA Gregory Price
2024-12-10 21:37 ` [RFC v2 PATCH 2/5] memory: move conditionally defined enums use inside ifdef tags Gregory Price
@ 2024-12-10 21:37 ` Gregory Price
2024-12-10 21:37 ` [RFC v2 PATCH 4/5] vmstat: add page-cache numa hints Gregory Price
` (2 subsequent siblings)
5 siblings, 0 replies; 27+ messages in thread
From: Gregory Price @ 2024-12-10 21:37 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, nehagholkar, abhishekd, kernel-team, david,
nphamcs, gourry, akpm, hannes, kbusch, ying.huang
numa_migrate_check and mpol_misplaced presume callers are in the
fault path with access to a VMA. To enable migrations from page
cache, re-using the same logic to handle migration prep is preferable.
Mildly refactor numa_migrate_check and mpol_misplaced so that they may
be called with (vmf = NULL) from non-faulting paths.
Signed-off-by: Gregory Price <gourry@gourry.net>
---
mm/memory.c | 24 ++++++++++++++----------
mm/mempolicy.c | 25 +++++++++++++++++--------
2 files changed, 31 insertions(+), 18 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 6ad7616918c4..af7ba56a4e1e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5542,7 +5542,20 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
unsigned long addr, int *flags,
bool writable, int *last_cpupid)
{
- struct vm_area_struct *vma = vmf->vma;
+ if (vmf) {
+ struct vm_area_struct *vma = vmf->vma;
+ const vm_flags_t vmflags = vma->vm_flags;
+
+ /*
+ * Flag if the folio is shared between multiple address spaces.
+ * This used later when determining whether to group tasks.
+ */
+ if (folio_likely_mapped_shared(folio))
+ *flags |= vmflags & VM_SHARED ? TNF_SHARED : 0;
+
+ /* Record the current PID acceesing VMA */
+ vma_set_access_pid_bit(vma);
+ }
/*
* Avoid grouping on RO pages in general. RO pages shouldn't hurt as
@@ -5555,12 +5568,6 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
if (!writable)
*flags |= TNF_NO_GROUP;
- /*
- * Flag if the folio is shared between multiple address spaces. This
- * is later used when determining whether to group tasks together
- */
- if (folio_likely_mapped_shared(folio) && (vma->vm_flags & VM_SHARED))
- *flags |= TNF_SHARED;
/*
* For memory tiering mode, cpupid of slow memory page is used
* to record page access time. So use default value.
@@ -5570,9 +5577,6 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
else
*last_cpupid = folio_last_cpupid(folio);
- /* Record the current PID acceesing VMA */
- vma_set_access_pid_bit(vma);
-
#ifdef CONFIG_NUMA_BALANCING
count_vm_numa_event(NUMA_HINT_FAULTS);
count_memcg_folio_events(folio, NUMA_HINT_FAULTS, 1);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 88eef9776bb0..77a123fa71b0 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2746,12 +2746,16 @@ static void sp_free(struct sp_node *n)
* mpol_misplaced - check whether current folio node is valid in policy
*
* @folio: folio to be checked
- * @vmf: structure describing the fault
+ * @vmf: structure describing the fault (NULL if called outside fault path)
* @addr: virtual address in @vma for shared policy lookup and interleave policy
+ * Ignored if vmf is NULL.
*
* Lookup current policy node id for vma,addr and "compare to" folio's
- * node id. Policy determination "mimics" alloc_page_vma().
- * Called from fault path where we know the vma and faulting address.
+ * node id - or task's policy node id if vmf is NULL. Policy determination
+ * "mimics" alloc_page_vma().
+ *
+ * vmf must be non-NULL if called from fault path where we know the vma and
+ * faulting address. The PTL must be held by caller if vmf is not NULL.
*
* Return: NUMA_NO_NODE if the page is in a node that is valid for this
* policy, or a suitable node ID to allocate a replacement folio from.
@@ -2763,7 +2767,6 @@ int mpol_misplaced(struct folio *folio, struct vm_fault *vmf,
pgoff_t ilx;
struct zoneref *z;
int curnid = folio_nid(folio);
- struct vm_area_struct *vma = vmf->vma;
int thiscpu = raw_smp_processor_id();
int thisnid = numa_node_id();
int polnid = NUMA_NO_NODE;
@@ -2773,18 +2776,24 @@ int mpol_misplaced(struct folio *folio, struct vm_fault *vmf,
* Make sure ptl is held so that we don't preempt and we
* have a stable smp processor id
*/
- lockdep_assert_held(vmf->ptl);
- pol = get_vma_policy(vma, addr, folio_order(folio), &ilx);
+ if (vmf) {
+ lockdep_assert_held(vmf->ptl);
+ pol = get_vma_policy(vmf->vma, addr, folio_order(folio), &ilx);
+ } else {
+ pol = get_task_policy(current);
+ }
if (!(pol->flags & MPOL_F_MOF))
goto out;
switch (pol->mode) {
case MPOL_INTERLEAVE:
- polnid = interleave_nid(pol, ilx);
+ polnid = vmf ? interleave_nid(pol, ilx) :
+ interleave_nodes(pol);
break;
case MPOL_WEIGHTED_INTERLEAVE:
- polnid = weighted_interleave_nid(pol, ilx);
+ polnid = vmf ? weighted_interleave_nid(pol, ilx) :
+ weighted_interleave_nodes(pol);
break;
case MPOL_PREFERRED:
--
2.43.0
^ permalink raw reply [flat|nested] 27+ messages in thread* [RFC v2 PATCH 4/5] vmstat: add page-cache numa hints
2024-12-10 21:37 [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios Gregory Price
` (2 preceding siblings ...)
2024-12-10 21:37 ` [RFC v2 PATCH 3/5] memory: allow non-fault migration in numa_migrate_check path Gregory Price
@ 2024-12-10 21:37 ` Gregory Price
2024-12-27 10:48 ` Donet Tom
2025-01-03 10:18 ` Donet Tom
2024-12-10 21:37 ` [RFC v2 PATCH 5/5] migrate,sysfs: add pagecache promotion Gregory Price
2024-12-21 5:18 ` [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios Huang, Ying
5 siblings, 2 replies; 27+ messages in thread
From: Gregory Price @ 2024-12-10 21:37 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, nehagholkar, abhishekd, kernel-team, david,
nphamcs, gourry, akpm, hannes, kbusch, ying.huang
Count non-page-fault events as page-cache numa hints instead of
fault hints in vmstat. Add a define to select the hint type to
keep the code clean.
Signed-off-by: Gregory Price <gourry@gourry.net>
---
include/linux/vm_event_item.h | 8 ++++++++
mm/memory.c | 6 +++---
mm/vmstat.c | 2 ++
3 files changed, 13 insertions(+), 3 deletions(-)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index f70d0958095c..c5abb0f7cca7 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -63,6 +63,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
NUMA_HUGE_PTE_UPDATES,
NUMA_HINT_FAULTS,
NUMA_HINT_FAULTS_LOCAL,
+ NUMA_HINT_PAGE_CACHE,
+ NUMA_HINT_PAGE_CACHE_LOCAL,
NUMA_PAGE_MIGRATE,
#endif
#ifdef CONFIG_MIGRATION
@@ -185,6 +187,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
NR_VM_EVENT_ITEMS
};
+#ifdef CONFIG_NUMA_BALANCING
+#define NUMA_HINT_TYPE(vmf) (vmf ? NUMA_HINT_FAULTS : NUMA_HINT_PAGE_CACHE)
+#define NUMA_HINT_TYPE_LOCAL(vmf) (vmf ? NUMA_HINT_FAULTS_LOCAL : \
+ NUMA_HINT_PAGE_CACHE_LOCAL)
+#endif
+
#ifndef CONFIG_TRANSPARENT_HUGEPAGE
#define THP_FILE_ALLOC ({ BUILD_BUG(); 0; })
#define THP_FILE_FALLBACK ({ BUILD_BUG(); 0; })
diff --git a/mm/memory.c b/mm/memory.c
index af7ba56a4e1e..620e2045af7b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5578,10 +5578,10 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
*last_cpupid = folio_last_cpupid(folio);
#ifdef CONFIG_NUMA_BALANCING
- count_vm_numa_event(NUMA_HINT_FAULTS);
- count_memcg_folio_events(folio, NUMA_HINT_FAULTS, 1);
+ count_vm_numa_event(NUMA_HINT_TYPE(vmf));
+ count_memcg_folio_events(folio, NUMA_HINT_TYPE(vmf), 1);
if (folio_nid(folio) == numa_node_id()) {
- count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
+ count_vm_numa_event(NUMA_HINT_TYPE_LOCAL(vmf));
*flags |= TNF_FAULT_LOCAL;
}
#endif
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4d016314a56c..bcd9be11e957 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1338,6 +1338,8 @@ const char * const vmstat_text[] = {
"numa_huge_pte_updates",
"numa_hint_faults",
"numa_hint_faults_local",
+ "numa_hint_page_cache",
+ "numa_hint_page_cache_local",
"numa_pages_migrated",
#endif
#ifdef CONFIG_MIGRATION
--
2.43.0
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [RFC v2 PATCH 4/5] vmstat: add page-cache numa hints
2024-12-10 21:37 ` [RFC v2 PATCH 4/5] vmstat: add page-cache numa hints Gregory Price
@ 2024-12-27 10:48 ` Donet Tom
2024-12-27 15:49 ` Gregory Price
2025-01-03 10:18 ` Donet Tom
1 sibling, 1 reply; 27+ messages in thread
From: Donet Tom @ 2024-12-27 10:48 UTC (permalink / raw)
To: Gregory Price, linux-mm
Cc: linux-kernel, nehagholkar, abhishekd, kernel-team, david,
nphamcs, akpm, hannes, kbusch, ying.huang
On 12/11/24 03:07, Gregory Price wrote:
> Count non-page-fault events as page-cache numa hints instead of
> fault hints in vmstat. Add a define to select the hint type to
> keep the code clean.
>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
> include/linux/vm_event_item.h | 8 ++++++++
> mm/memory.c | 6 +++---
> mm/vmstat.c | 2 ++
> 3 files changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index f70d0958095c..c5abb0f7cca7 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -63,6 +63,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> NUMA_HUGE_PTE_UPDATES,
> NUMA_HINT_FAULTS,
> NUMA_HINT_FAULTS_LOCAL,
> + NUMA_HINT_PAGE_CACHE,
> + NUMA_HINT_PAGE_CACHE_LOCAL,
> NUMA_PAGE_MIGRATE,
> #endif
> #ifdef CONFIG_MIGRATION
> @@ -185,6 +187,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> NR_VM_EVENT_ITEMS
> };
>
> +#ifdef CONFIG_NUMA_BALANCING
> +#define NUMA_HINT_TYPE(vmf) (vmf ? NUMA_HINT_FAULTS : NUMA_HINT_PAGE_CACHE)
> +#define NUMA_HINT_TYPE_LOCAL(vmf) (vmf ? NUMA_HINT_FAULTS_LOCAL : \
> + NUMA_HINT_PAGE_CACHE_LOCAL)
> +#endif
> +
> #ifndef CONFIG_TRANSPARENT_HUGEPAGE
> #define THP_FILE_ALLOC ({ BUILD_BUG(); 0; })
> #define THP_FILE_FALLBACK ({ BUILD_BUG(); 0; })
> diff --git a/mm/memory.c b/mm/memory.c
> index af7ba56a4e1e..620e2045af7b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5578,10 +5578,10 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
> *last_cpupid = folio_last_cpupid(folio);
>
> #ifdef CONFIG_NUMA_BALANCING
> - count_vm_numa_event(NUMA_HINT_FAULTS);
> - count_memcg_folio_events(folio, NUMA_HINT_FAULTS, 1);
> + count_vm_numa_event(NUMA_HINT_TYPE(vmf));
> + count_memcg_folio_events(folio, NUMA_HINT_TYPE(vmf), 1);
> if (folio_nid(folio) == numa_node_id()) {
> - count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
> + count_vm_numa_event(NUMA_HINT_TYPE_LOCAL(vmf));
I have tested this patch series on my system with my test program. I am able
to see unmapped page cache pages are getting promoted.
numa_hint_faults2269numa_hint_faults_local2245numa_hint_page_cache1244numa_hint_page_cache_local0numa_pages_migrated4501
In my test result numa_hint_page_cache_local is 0. I am seeing
numa_hint_page_cache_local will only be incremented if the folio's
node and the process's running node are the same. This condition
does not occur in the current implementation, correct?
Thanks
Donet
> *flags |= TNF_FAULT_LOCAL;
> }
> #endif
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4d016314a56c..bcd9be11e957 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1338,6 +1338,8 @@ const char * const vmstat_text[] = {
> "numa_huge_pte_updates",
> "numa_hint_faults",
> "numa_hint_faults_local",
> + "numa_hint_page_cache",
> + "numa_hint_page_cache_local",
> "numa_pages_migrated",
> #endif
> #ifdef CONFIG_MIGRATION
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [RFC v2 PATCH 4/5] vmstat: add page-cache numa hints
2024-12-27 10:48 ` Donet Tom
@ 2024-12-27 15:49 ` Gregory Price
2024-12-29 14:57 ` Donet Tom
0 siblings, 1 reply; 27+ messages in thread
From: Gregory Price @ 2024-12-27 15:49 UTC (permalink / raw)
To: Donet Tom
Cc: Gregory Price, linux-mm, linux-kernel, nehagholkar, abhishekd,
kernel-team, david, nphamcs, akpm, hannes, kbusch, ying.huang
On Fri, Dec 27, 2024 at 04:18:24PM +0530, Donet Tom wrote:
>
> On 12/11/24 03:07, Gregory Price wrote:
... snip ...
> > + NUMA_HINT_PAGE_CACHE,
> > + NUMA_HINT_PAGE_CACHE_LOCAL,
> > NUMA_PAGE_MIGRATE,
... snip ...
> > if (folio_nid(folio) == numa_node_id()) {
> > - count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
> > + count_vm_numa_event(NUMA_HINT_TYPE_LOCAL(vmf));
>
> I have tested this patch series on my system with my test program. I am able
> to see unmapped page cache pages are getting promoted.
> numa_hint_faults2269numa_hint_faults_local2245numa_hint_page_cache1244numa_hint_page_cache_local0numa_pages_migrated4501
>
> In my test result numa_hint_page_cache_local is 0. I am seeing
> numa_hint_page_cache_local will only be incremented if the folio's
> node and the process's running node are the same. This condition
> does not occur in the current implementation, correct?
>
I did not want to assume we'd never use this interface where such a
scenario could occur - so i wanted to:
a) make such a scenario visible
b) make the code consistent with existing fault counts
I'm fine removing it. It's hard to know if this interface ever gets
called with that scenario occurringwithout capturing the data.
~Gregory
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [RFC v2 PATCH 4/5] vmstat: add page-cache numa hints
2024-12-27 15:49 ` Gregory Price
@ 2024-12-29 14:57 ` Donet Tom
0 siblings, 0 replies; 27+ messages in thread
From: Donet Tom @ 2024-12-29 14:57 UTC (permalink / raw)
To: Gregory Price
Cc: linux-mm, linux-kernel, nehagholkar, abhishekd, kernel-team,
david, nphamcs, akpm, hannes, kbusch, ying.huang
On 12/27/24 21:19, Gregory Price wrote:
> On Fri, Dec 27, 2024 at 04:18:24PM +0530, Donet Tom wrote:
>> On 12/11/24 03:07, Gregory Price wrote:
> ... snip ...
>>> + NUMA_HINT_PAGE_CACHE,
>>> + NUMA_HINT_PAGE_CACHE_LOCAL,
>>> NUMA_PAGE_MIGRATE,
> ... snip ...
>>> if (folio_nid(folio) == numa_node_id()) {
>>> - count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
>>> + count_vm_numa_event(NUMA_HINT_TYPE_LOCAL(vmf));
>> I have tested this patch series on my system with my test program. I am able
>> to see unmapped page cache pages are getting promoted.
>> numa_hint_faults2269numa_hint_faults_local2245numa_hint_page_cache1244numa_hint_page_cache_local0numa_pages_migrated4501
>>
>> In my test result numa_hint_page_cache_local is 0. I am seeing
>> numa_hint_page_cache_local will only be incremented if the folio's
>> node and the process's running node are the same. This condition
>> does not occur in the current implementation, correct?
>>
> I did not want to assume we'd never use this interface where such a
> scenario could occur - so i wanted to:
>
> a) make such a scenario visible
> b) make the code consistent with existing fault counts
Understood, thank you for clarifying!
>
> I'm fine removing it. It's hard to know if this interface ever gets
> called with that scenario occurringwithout capturing the data.
>
> ~Gregory
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC v2 PATCH 4/5] vmstat: add page-cache numa hints
2024-12-10 21:37 ` [RFC v2 PATCH 4/5] vmstat: add page-cache numa hints Gregory Price
2024-12-27 10:48 ` Donet Tom
@ 2025-01-03 10:18 ` Donet Tom
2025-01-03 19:19 ` Gregory Price
1 sibling, 1 reply; 27+ messages in thread
From: Donet Tom @ 2025-01-03 10:18 UTC (permalink / raw)
To: Gregory Price, linux-mm
Cc: linux-kernel, nehagholkar, abhishekd, kernel-team, david,
nphamcs, akpm, hannes, kbusch, ying.huang
On 12/11/24 03:07, Gregory Price wrote:
> Count non-page-fault events as page-cache numa hints instead of
> fault hints in vmstat. Add a define to select the hint type to
> keep the code clean.
>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
> include/linux/vm_event_item.h | 8 ++++++++
> mm/memory.c | 6 +++---
> mm/vmstat.c | 2 ++
> 3 files changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index f70d0958095c..c5abb0f7cca7 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -63,6 +63,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> NUMA_HUGE_PTE_UPDATES,
> NUMA_HINT_FAULTS,
> NUMA_HINT_FAULTS_LOCAL,
> + NUMA_HINT_PAGE_CACHE,
Hi Gregory,
While running tests on the patch, I encountered the following warning
message on the console.
[ 187.943052] ------------[ cut here ]------------
[ 187.943234] __count_memcg_events: missing stat item 49
[ 187.943287] WARNING: CPU: 0 PID: 3121 at mm/memcontrol.c:852
__count_memcg_events+0x3fc/0x42c
The warning occurred because NUMA_HINT_PAGE_CACHE was not added
in memcg_vm_event_stat.
I did below change, Now the warnings are not coming.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7b3503d12aaf..fbb360cfea30 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -460,6 +460,7 @@ static const unsigned int memcg_vm_event_stat[] = {
NUMA_PAGE_MIGRATE,
NUMA_PTE_UPDATES,
NUMA_HINT_FAULTS,
+ NUMA_HINT_PAGE_CACHE,
#endif
};
Without the change stat output
========================
# cat /proc/vmstat |grep -i numa_hint_page_cache
numa_hint_page_cache 274
numa_hint_page_cache_local 0
#cat /sys/fs/cgroup/memory.stat |grep -i numa_hint_page_cache
#
With the change stat output
========================
# cat /proc/vmstat |grep -i numa_hint_page_cache
numa_hint_page_cache 274
numa_hint_page_cache_local 0
#
# cat /sys/fs/cgroup/memory.stat |grep -i numa_hint_page_cache
numa_hint_page_cache 274
#
-Donet
> + NUMA_HINT_PAGE_CACHE_LOCAL,
> NUMA_PAGE_MIGRATE,
> #endif
> #ifdef CONFIG_MIGRATION
> @@ -185,6 +187,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> NR_VM_EVENT_ITEMS
> };
>
> +#ifdef CONFIG_NUMA_BALANCING
> +#define NUMA_HINT_TYPE(vmf) (vmf ? NUMA_HINT_FAULTS : NUMA_HINT_PAGE_CACHE)
> +#define NUMA_HINT_TYPE_LOCAL(vmf) (vmf ? NUMA_HINT_FAULTS_LOCAL : \
> + NUMA_HINT_PAGE_CACHE_LOCAL)
> +#endif
> +
> #ifndef CONFIG_TRANSPARENT_HUGEPAGE
> #define THP_FILE_ALLOC ({ BUILD_BUG(); 0; })
> #define THP_FILE_FALLBACK ({ BUILD_BUG(); 0; })
> diff --git a/mm/memory.c b/mm/memory.c
> index af7ba56a4e1e..620e2045af7b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5578,10 +5578,10 @@ int numa_migrate_check(struct folio *folio, struct vm_fault *vmf,
> *last_cpupid = folio_last_cpupid(folio);
>
> #ifdef CONFIG_NUMA_BALANCING
> - count_vm_numa_event(NUMA_HINT_FAULTS);
> - count_memcg_folio_events(folio, NUMA_HINT_FAULTS, 1);
> + count_vm_numa_event(NUMA_HINT_TYPE(vmf));
> + count_memcg_folio_events(folio, NUMA_HINT_TYPE(vmf), 1);
> if (folio_nid(folio) == numa_node_id()) {
> - count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
> + count_vm_numa_event(NUMA_HINT_TYPE_LOCAL(vmf));
> *flags |= TNF_FAULT_LOCAL;
> }
> #endif
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4d016314a56c..bcd9be11e957 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1338,6 +1338,8 @@ const char * const vmstat_text[] = {
> "numa_huge_pte_updates",
> "numa_hint_faults",
> "numa_hint_faults_local",
> + "numa_hint_page_cache",
> + "numa_hint_page_cache_local",
> "numa_pages_migrated",
> #endif
> #ifdef CONFIG_MIGRATION
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [RFC v2 PATCH 4/5] vmstat: add page-cache numa hints
2025-01-03 10:18 ` Donet Tom
@ 2025-01-03 19:19 ` Gregory Price
0 siblings, 0 replies; 27+ messages in thread
From: Gregory Price @ 2025-01-03 19:19 UTC (permalink / raw)
To: Donet Tom
Cc: linux-mm, linux-kernel, nehagholkar, abhishekd, kernel-team,
david, nphamcs, akpm, hannes, kbusch, ying.huang
On Fri, Jan 03, 2025 at 03:48:24PM +0530, Donet Tom wrote:
>
> On 12/11/24 03:07, Gregory Price wrote:
> > Count non-page-fault events as page-cache numa hints instead of
> > fault hints in vmstat. Add a define to select the hint type to
> > keep the code clean.
> >
> > Signed-off-by: Gregory Price <gourry@gourry.net>
> > ---
> > include/linux/vm_event_item.h | 8 ++++++++
> > mm/memory.c | 6 +++---
> > mm/vmstat.c | 2 ++
> > 3 files changed, 13 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> > index f70d0958095c..c5abb0f7cca7 100644
> > --- a/include/linux/vm_event_item.h
> > +++ b/include/linux/vm_event_item.h
> > @@ -63,6 +63,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> > NUMA_HUGE_PTE_UPDATES,
> > NUMA_HINT_FAULTS,
> > NUMA_HINT_FAULTS_LOCAL,
> > + NUMA_HINT_PAGE_CACHE,
> Hi Gregory,
>
> While running tests on the patch, I encountered the following warning
> message on the console.
>
Thank you for catching this, will add to v3.
v3 is coming soon with more microbenchmark info, then hopefully moving
on to some workload testing.
~Gregory
^ permalink raw reply [flat|nested] 27+ messages in thread
* [RFC v2 PATCH 5/5] migrate,sysfs: add pagecache promotion
2024-12-10 21:37 [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios Gregory Price
` (3 preceding siblings ...)
2024-12-10 21:37 ` [RFC v2 PATCH 4/5] vmstat: add page-cache numa hints Gregory Price
@ 2024-12-10 21:37 ` Gregory Price
2024-12-27 11:01 ` Donet Tom
2024-12-21 5:18 ` [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios Huang, Ying
5 siblings, 1 reply; 27+ messages in thread
From: Gregory Price @ 2024-12-10 21:37 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, nehagholkar, abhishekd, kernel-team, david,
nphamcs, gourry, akpm, hannes, kbusch, ying.huang
adds /sys/kernel/mm/numa/pagecache_promotion_enabled
When page cache lands on lower tiers, there is no way for promotion
to occur unless it becomes memory-mapped and exposed to NUMA hint
faults. Just adding a mechanism to promote pages unconditionally,
however, opens up significant possibility of performance regressions.
Similar to the `demotion_enabled` sysfs entry, provide a sysfs toggle
to enable and disable page cache promotion. This option will enable
opportunistic promotion of unmapped page cache during syscall access.
This option is intended for operational conditions where demoted page
cache will eventually contain memory which becomes hot - and where
said memory likely to cause performance issues due to being trapped on
the lower tier of memory.
A Page Cache folio is considered a promotion candidates when:
0) tiering and pagecache-promotion are enabled
1) the folio reside on a node not in the top tier
2) the folio is already marked referenced and active.
3) Multiple accesses in (referenced & active) state occur quickly.
Since promotion is not safe to execute unconditionally from within
folio_mark_accessed, we defer promotion to a new task_work captured
in the task_struct. This ensures that the task doing the access has
some hand in promoting pages - even among deduplicated read only files.
We use numa_hint_fault_latency to help identify when a folio is accessed
multiple times in a short period. Along with folio flag checks, this
helps us minimize promoting pages on the first few accesses.
The promotion node is always the local node of the promoting cpu.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
.../ABI/testing/sysfs-kernel-mm-numa | 20 +++++++
include/linux/memory-tiers.h | 2 +
include/linux/migrate.h | 2 +
include/linux/sched.h | 3 +
include/linux/sched/numa_balancing.h | 5 ++
init/init_task.c | 1 +
kernel/sched/fair.c | 26 +++++++-
mm/memory-tiers.c | 27 +++++++++
mm/migrate.c | 59 +++++++++++++++++++
mm/swap.c | 3 +
10 files changed, 147 insertions(+), 1 deletion(-)
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-numa b/Documentation/ABI/testing/sysfs-kernel-mm-numa
index 77e559d4ed80..b846e7d80cba 100644
--- a/Documentation/ABI/testing/sysfs-kernel-mm-numa
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-numa
@@ -22,3 +22,23 @@ Description: Enable/disable demoting pages during reclaim
the guarantees of cpusets. This should not be enabled
on systems which need strict cpuset location
guarantees.
+
+What: /sys/kernel/mm/numa/pagecache_promotion_enabled
+Date: November 2024
+Contact: Linux memory management mailing list <linux-mm@kvack.org>
+Description: Enable/disable promoting pages during file access
+
+ Page migration during file access is intended for systems
+ with tiered memory configurations that have significant
+ unmapped file cache usage. By default, file cache memory
+ on slower tiers will not be opportunistically promoted by
+ normal NUMA hint faults, because the system has no way to
+ track them. This option enables opportunistic promotion
+ of pages that are accessed via syscall (e.g. read/write)
+ if multiple accesses occur in quick succession.
+
+ It may move data to a NUMA node that does not fall into
+ the cpuset of the allocating process which might be
+ construed to violate the guarantees of cpusets. This
+ should not be enabled on systems which need strict cpuset
+ location guarantees.
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 0dc0cf2863e2..fa96a67b8996 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -37,6 +37,7 @@ struct access_coordinate;
#ifdef CONFIG_NUMA
extern bool numa_demotion_enabled;
+extern bool numa_pagecache_promotion_enabled;
extern struct memory_dev_type *default_dram_type;
extern nodemask_t default_dram_nodes;
struct memory_dev_type *alloc_memory_type(int adistance);
@@ -76,6 +77,7 @@ static inline bool node_is_toptier(int node)
#else
#define numa_demotion_enabled false
+#define numa_pagecache_promotion_enabled false
#define default_dram_type NULL
#define default_dram_nodes NODE_MASK_NONE
/*
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 29919faea2f1..cf58a97d4216 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -145,6 +145,7 @@ const struct movable_operations *page_movable_ops(struct page *page)
int migrate_misplaced_folio_prepare(struct folio *folio,
struct vm_area_struct *vma, int node);
int migrate_misplaced_folio(struct folio *folio, int node);
+void promotion_candidate(struct folio *folio);
#else
static inline int migrate_misplaced_folio_prepare(struct folio *folio,
struct vm_area_struct *vma, int node)
@@ -155,6 +156,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
{
return -EAGAIN; /* can't migrate now */
}
+static inline void promotion_candidate(struct folio *folio) { }
#endif /* CONFIG_NUMA_BALANCING */
#ifdef CONFIG_MIGRATION
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d380bffee2ef..faa84fb7a756 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1356,6 +1356,9 @@ struct task_struct {
unsigned long numa_faults_locality[3];
unsigned long numa_pages_migrated;
+
+ struct callback_head numa_promo_work;
+ struct list_head promo_list;
#endif /* CONFIG_NUMA_BALANCING */
#ifdef CONFIG_RSEQ
diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
index 52b22c5c396d..cc7750d754ff 100644
--- a/include/linux/sched/numa_balancing.h
+++ b/include/linux/sched/numa_balancing.h
@@ -32,6 +32,7 @@ extern void set_numabalancing_state(bool enabled);
extern void task_numa_free(struct task_struct *p, bool final);
bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
int src_nid, int dst_cpu);
+int numa_hint_fault_latency(struct folio *folio);
#else
static inline void task_numa_fault(int last_node, int node, int pages,
int flags)
@@ -52,6 +53,10 @@ static inline bool should_numa_migrate_memory(struct task_struct *p,
{
return true;
}
+static inline int numa_hint_fault_latency(struct folio *folio)
+{
+ return 0;
+}
#endif
#endif /* _LINUX_SCHED_NUMA_BALANCING_H */
diff --git a/init/init_task.c b/init/init_task.c
index e557f622bd90..f831980748c4 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -187,6 +187,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
.numa_preferred_nid = NUMA_NO_NODE,
.numa_group = NULL,
.numa_faults = NULL,
+ .promo_list = LIST_HEAD_INIT(init_task.promo_list),
#endif
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
.kasan_depth = 1,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a59ae2e23daf..047f02091773 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -42,6 +42,7 @@
#include <linux/interrupt.h>
#include <linux/memory-tiers.h>
#include <linux/mempolicy.h>
+#include <linux/migrate.h>
#include <linux/mutex_api.h>
#include <linux/profile.h>
#include <linux/psi.h>
@@ -1842,7 +1843,7 @@ static bool pgdat_free_space_enough(struct pglist_data *pgdat)
* The smaller the hint page fault latency, the higher the possibility
* for the page to be hot.
*/
-static int numa_hint_fault_latency(struct folio *folio)
+int numa_hint_fault_latency(struct folio *folio)
{
int last_time, time;
@@ -3534,6 +3535,27 @@ static void task_numa_work(struct callback_head *work)
}
}
+static void task_numa_promotion_work(struct callback_head *work)
+{
+ struct task_struct *p = current;
+ struct list_head *promo_list = &p->promo_list;
+ struct folio *folio, *tmp;
+ int nid = numa_node_id();
+
+ SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_promo_work));
+
+ work->next = work;
+
+ if (list_empty(promo_list))
+ return;
+
+ list_for_each_entry_safe(folio, tmp, promo_list, lru) {
+ list_del_init(&folio->lru);
+ migrate_misplaced_folio(folio, nid);
+ }
+}
+
+
void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
{
int mm_users = 0;
@@ -3558,8 +3580,10 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
RCU_INIT_POINTER(p->numa_group, NULL);
p->last_task_numa_placement = 0;
p->last_sum_exec_runtime = 0;
+ INIT_LIST_HEAD(&p->promo_list);
init_task_work(&p->numa_work, task_numa_work);
+ init_task_work(&p->numa_promo_work, task_numa_promotion_work);
/* New address space, reset the preferred nid */
if (!(clone_flags & CLONE_VM)) {
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index fc14fe53e9b7..4c44598e485e 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -935,6 +935,7 @@ static int __init memory_tier_init(void)
subsys_initcall(memory_tier_init);
bool numa_demotion_enabled = false;
+bool numa_pagecache_promotion_enabled;
#ifdef CONFIG_MIGRATION
#ifdef CONFIG_SYSFS
@@ -957,11 +958,37 @@ static ssize_t demotion_enabled_store(struct kobject *kobj,
return count;
}
+static ssize_t pagecache_promotion_enabled_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+ return sysfs_emit(buf, "%s\n",
+ numa_pagecache_promotion_enabled ? "true" : "false");
+}
+
+static ssize_t pagecache_promotion_enabled_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ ssize_t ret;
+
+ ret = kstrtobool(buf, &numa_pagecache_promotion_enabled);
+ if (ret)
+ return ret;
+
+ return count;
+}
+
+
static struct kobj_attribute numa_demotion_enabled_attr =
__ATTR_RW(demotion_enabled);
+static struct kobj_attribute numa_pagecache_promotion_enabled_attr =
+ __ATTR_RW(pagecache_promotion_enabled);
+
static struct attribute *numa_attrs[] = {
&numa_demotion_enabled_attr.attr,
+ &numa_pagecache_promotion_enabled_attr.attr,
NULL,
};
diff --git a/mm/migrate.c b/mm/migrate.c
index af07b399060b..320258a1aaba 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -44,6 +44,8 @@
#include <linux/sched/sysctl.h>
#include <linux/memory-tiers.h>
#include <linux/pagewalk.h>
+#include <linux/sched/numa_balancing.h>
+#include <linux/task_work.h>
#include <asm/tlbflush.h>
@@ -2710,5 +2712,62 @@ int migrate_misplaced_folio(struct folio *folio, int node)
BUG_ON(!list_empty(&migratepages));
return nr_remaining ? -EAGAIN : 0;
}
+
+/**
+ * promotion_candidate() - report a promotion candidate folio
+ *
+ * @folio: The folio reported as a candidate
+ *
+ * Records folio access time and places the folio on the task promotion list
+ * if access time is less than the threshold. The folio will be isolated from
+ * LRU if selected, and task_work will putback the folio on promotion failure.
+ *
+ * If selected, takes a folio reference to be released in task work.
+ */
+void promotion_candidate(struct folio *folio)
+{
+ struct task_struct *task = current;
+ struct list_head *promo_list = &task->promo_list;
+ struct callback_head *work = &task->numa_promo_work;
+ struct address_space *mapping = folio_mapping(folio);
+ bool write = mapping ? mapping->gfp_mask & __GFP_WRITE : false;
+ int nid = folio_nid(folio);
+ int flags, last_cpupid;
+
+ /*
+ * Only do this work if:
+ * 1) tiering and pagecache promotion are enabled
+ * 2) the page can actually be promoted
+ * 3) The hint-fault latency is relatively hot
+ * 4) the folio is not already isolated
+ * 5) This is not a kernel thread context
+ */
+ if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) ||
+ !numa_pagecache_promotion_enabled ||
+ node_is_toptier(nid) ||
+ numa_hint_fault_latency(folio) >= PAGE_ACCESS_TIME_MASK ||
+ folio_test_isolated(folio) ||
+ (current->flags & PF_KTHREAD)) {
+ return;
+ }
+
+ nid = numa_migrate_check(folio, NULL, 0, &flags, write, &last_cpupid);
+ if (nid == NUMA_NO_NODE)
+ return;
+
+ if (migrate_misplaced_folio_prepare(folio, NULL, nid))
+ return;
+
+ /*
+ * Ensure task can schedule work, otherwise we'll leak folios.
+ * If the list is not empty, task work has already been scheduled.
+ */
+ if (list_empty(promo_list) && task_work_add(task, work, TWA_RESUME)) {
+ folio_putback_lru(folio);
+ return;
+ }
+ list_add(&folio->lru, promo_list);
+}
+EXPORT_SYMBOL(promotion_candidate);
#endif /* CONFIG_NUMA_BALANCING */
#endif /* CONFIG_NUMA */
diff --git a/mm/swap.c b/mm/swap.c
index 320b959b74c6..57909c349388 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -37,6 +37,7 @@
#include <linux/page_idle.h>
#include <linux/local_lock.h>
#include <linux/buffer_head.h>
+#include <linux/migrate.h>
#include "internal.h"
@@ -469,6 +470,8 @@ void folio_mark_accessed(struct folio *folio)
__lru_cache_activate_folio(folio);
folio_clear_referenced(folio);
workingset_activation(folio);
+ } else {
+ promotion_candidate(folio);
}
if (folio_test_idle(folio))
folio_clear_idle(folio);
--
2.43.0
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [RFC v2 PATCH 5/5] migrate,sysfs: add pagecache promotion
2024-12-10 21:37 ` [RFC v2 PATCH 5/5] migrate,sysfs: add pagecache promotion Gregory Price
@ 2024-12-27 11:01 ` Donet Tom
2024-12-27 15:56 ` Gregory Price
0 siblings, 1 reply; 27+ messages in thread
From: Donet Tom @ 2024-12-27 11:01 UTC (permalink / raw)
To: Gregory Price, linux-mm
Cc: linux-kernel, nehagholkar, abhishekd, kernel-team, david,
nphamcs, akpm, hannes, kbusch, ying.huang
On 12/11/24 03:07, Gregory Price wrote:
> adds /sys/kernel/mm/numa/pagecache_promotion_enabled
>
> When page cache lands on lower tiers, there is no way for promotion
> to occur unless it becomes memory-mapped and exposed to NUMA hint
> faults. Just adding a mechanism to promote pages unconditionally,
> however, opens up significant possibility of performance regressions.
>
> Similar to the `demotion_enabled` sysfs entry, provide a sysfs toggle
> to enable and disable page cache promotion. This option will enable
> opportunistic promotion of unmapped page cache during syscall access.
>
> This option is intended for operational conditions where demoted page
> cache will eventually contain memory which becomes hot - and where
> said memory likely to cause performance issues due to being trapped on
> the lower tier of memory.
>
> A Page Cache folio is considered a promotion candidates when:
> 0) tiering and pagecache-promotion are enabled
> 1) the folio reside on a node not in the top tier
> 2) the folio is already marked referenced and active.
> 3) Multiple accesses in (referenced & active) state occur quickly.
>
> Since promotion is not safe to execute unconditionally from within
> folio_mark_accessed, we defer promotion to a new task_work captured
> in the task_struct. This ensures that the task doing the access has
> some hand in promoting pages - even among deduplicated read only files.
>
> We use numa_hint_fault_latency to help identify when a folio is accessed
> multiple times in a short period. Along with folio flag checks, this
> helps us minimize promoting pages on the first few accesses.
>
> The promotion node is always the local node of the promoting cpu.
>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
> .../ABI/testing/sysfs-kernel-mm-numa | 20 +++++++
> include/linux/memory-tiers.h | 2 +
> include/linux/migrate.h | 2 +
> include/linux/sched.h | 3 +
> include/linux/sched/numa_balancing.h | 5 ++
> init/init_task.c | 1 +
> kernel/sched/fair.c | 26 +++++++-
> mm/memory-tiers.c | 27 +++++++++
> mm/migrate.c | 59 +++++++++++++++++++
> mm/swap.c | 3 +
> 10 files changed, 147 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-numa b/Documentation/ABI/testing/sysfs-kernel-mm-numa
> index 77e559d4ed80..b846e7d80cba 100644
> --- a/Documentation/ABI/testing/sysfs-kernel-mm-numa
> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-numa
> @@ -22,3 +22,23 @@ Description: Enable/disable demoting pages during reclaim
> the guarantees of cpusets. This should not be enabled
> on systems which need strict cpuset location
> guarantees.
> +
> +What: /sys/kernel/mm/numa/pagecache_promotion_enabled
> +Date: November 2024
> +Contact: Linux memory management mailing list <linux-mm@kvack.org>
> +Description: Enable/disable promoting pages during file access
> +
> + Page migration during file access is intended for systems
> + with tiered memory configurations that have significant
> + unmapped file cache usage. By default, file cache memory
> + on slower tiers will not be opportunistically promoted by
> + normal NUMA hint faults, because the system has no way to
> + track them. This option enables opportunistic promotion
> + of pages that are accessed via syscall (e.g. read/write)
> + if multiple accesses occur in quick succession.
> +
> + It may move data to a NUMA node that does not fall into
> + the cpuset of the allocating process which might be
> + construed to violate the guarantees of cpusets. This
> + should not be enabled on systems which need strict cpuset
> + location guarantees.
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index 0dc0cf2863e2..fa96a67b8996 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -37,6 +37,7 @@ struct access_coordinate;
>
> #ifdef CONFIG_NUMA
> extern bool numa_demotion_enabled;
> +extern bool numa_pagecache_promotion_enabled;
> extern struct memory_dev_type *default_dram_type;
> extern nodemask_t default_dram_nodes;
> struct memory_dev_type *alloc_memory_type(int adistance);
> @@ -76,6 +77,7 @@ static inline bool node_is_toptier(int node)
> #else
>
> #define numa_demotion_enabled false
> +#define numa_pagecache_promotion_enabled false
> #define default_dram_type NULL
> #define default_dram_nodes NODE_MASK_NONE
> /*
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 29919faea2f1..cf58a97d4216 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -145,6 +145,7 @@ const struct movable_operations *page_movable_ops(struct page *page)
> int migrate_misplaced_folio_prepare(struct folio *folio,
> struct vm_area_struct *vma, int node);
> int migrate_misplaced_folio(struct folio *folio, int node);
> +void promotion_candidate(struct folio *folio);
> #else
> static inline int migrate_misplaced_folio_prepare(struct folio *folio,
> struct vm_area_struct *vma, int node)
> @@ -155,6 +156,7 @@ static inline int migrate_misplaced_folio(struct folio *folio, int node)
> {
> return -EAGAIN; /* can't migrate now */
> }
> +static inline void promotion_candidate(struct folio *folio) { }
> #endif /* CONFIG_NUMA_BALANCING */
>
> #ifdef CONFIG_MIGRATION
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d380bffee2ef..faa84fb7a756 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1356,6 +1356,9 @@ struct task_struct {
> unsigned long numa_faults_locality[3];
>
> unsigned long numa_pages_migrated;
> +
> + struct callback_head numa_promo_work;
> + struct list_head promo_list;
> #endif /* CONFIG_NUMA_BALANCING */
>
> #ifdef CONFIG_RSEQ
> diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
> index 52b22c5c396d..cc7750d754ff 100644
> --- a/include/linux/sched/numa_balancing.h
> +++ b/include/linux/sched/numa_balancing.h
> @@ -32,6 +32,7 @@ extern void set_numabalancing_state(bool enabled);
> extern void task_numa_free(struct task_struct *p, bool final);
> bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
> int src_nid, int dst_cpu);
> +int numa_hint_fault_latency(struct folio *folio);
> #else
> static inline void task_numa_fault(int last_node, int node, int pages,
> int flags)
> @@ -52,6 +53,10 @@ static inline bool should_numa_migrate_memory(struct task_struct *p,
> {
> return true;
> }
> +static inline int numa_hint_fault_latency(struct folio *folio)
> +{
> + return 0;
> +}
> #endif
>
> #endif /* _LINUX_SCHED_NUMA_BALANCING_H */
> diff --git a/init/init_task.c b/init/init_task.c
> index e557f622bd90..f831980748c4 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -187,6 +187,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
> .numa_preferred_nid = NUMA_NO_NODE,
> .numa_group = NULL,
> .numa_faults = NULL,
> + .promo_list = LIST_HEAD_INIT(init_task.promo_list),
> #endif
> #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
> .kasan_depth = 1,
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a59ae2e23daf..047f02091773 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -42,6 +42,7 @@
> #include <linux/interrupt.h>
> #include <linux/memory-tiers.h>
> #include <linux/mempolicy.h>
> +#include <linux/migrate.h>
> #include <linux/mutex_api.h>
> #include <linux/profile.h>
> #include <linux/psi.h>
> @@ -1842,7 +1843,7 @@ static bool pgdat_free_space_enough(struct pglist_data *pgdat)
> * The smaller the hint page fault latency, the higher the possibility
> * for the page to be hot.
> */
> -static int numa_hint_fault_latency(struct folio *folio)
> +int numa_hint_fault_latency(struct folio *folio)
> {
> int last_time, time;
>
> @@ -3534,6 +3535,27 @@ static void task_numa_work(struct callback_head *work)
> }
> }
>
> +static void task_numa_promotion_work(struct callback_head *work)
> +{
> + struct task_struct *p = current;
> + struct list_head *promo_list = &p->promo_list;
> + struct folio *folio, *tmp;
> + int nid = numa_node_id();
> +
> + SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_promo_work));
> +
> + work->next = work;
> +
> + if (list_empty(promo_list))
> + return;
> +
> + list_for_each_entry_safe(folio, tmp, promo_list, lru) {
> + list_del_init(&folio->lru);
> + migrate_misplaced_folio(folio, nid);
> + }
> +}
> +
> +
> void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
> {
> int mm_users = 0;
> @@ -3558,8 +3580,10 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
> RCU_INIT_POINTER(p->numa_group, NULL);
> p->last_task_numa_placement = 0;
> p->last_sum_exec_runtime = 0;
> + INIT_LIST_HEAD(&p->promo_list);
>
> init_task_work(&p->numa_work, task_numa_work);
> + init_task_work(&p->numa_promo_work, task_numa_promotion_work);
>
> /* New address space, reset the preferred nid */
> if (!(clone_flags & CLONE_VM)) {
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index fc14fe53e9b7..4c44598e485e 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -935,6 +935,7 @@ static int __init memory_tier_init(void)
> subsys_initcall(memory_tier_init);
>
> bool numa_demotion_enabled = false;
> +bool numa_pagecache_promotion_enabled;
>
> #ifdef CONFIG_MIGRATION
> #ifdef CONFIG_SYSFS
> @@ -957,11 +958,37 @@ static ssize_t demotion_enabled_store(struct kobject *kobj,
> return count;
> }
>
> +static ssize_t pagecache_promotion_enabled_show(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + char *buf)
> +{
> + return sysfs_emit(buf, "%s\n",
> + numa_pagecache_promotion_enabled ? "true" : "false");
> +}
> +
> +static ssize_t pagecache_promotion_enabled_store(struct kobject *kobj,
> + struct kobj_attribute *attr,
> + const char *buf, size_t count)
> +{
> + ssize_t ret;
> +
> + ret = kstrtobool(buf, &numa_pagecache_promotion_enabled);
> + if (ret)
> + return ret;
> +
> + return count;
> +}
> +
> +
> static struct kobj_attribute numa_demotion_enabled_attr =
> __ATTR_RW(demotion_enabled);
>
> +static struct kobj_attribute numa_pagecache_promotion_enabled_attr =
> + __ATTR_RW(pagecache_promotion_enabled);
> +
> static struct attribute *numa_attrs[] = {
> &numa_demotion_enabled_attr.attr,
> + &numa_pagecache_promotion_enabled_attr.attr,
> NULL,
> };
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index af07b399060b..320258a1aaba 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -44,6 +44,8 @@
> #include <linux/sched/sysctl.h>
> #include <linux/memory-tiers.h>
> #include <linux/pagewalk.h>
> +#include <linux/sched/numa_balancing.h>
> +#include <linux/task_work.h>
>
> #include <asm/tlbflush.h>
>
> @@ -2710,5 +2712,62 @@ int migrate_misplaced_folio(struct folio *folio, int node)
> BUG_ON(!list_empty(&migratepages));
> return nr_remaining ? -EAGAIN : 0;
> }
> +
> +/**
> + * promotion_candidate() - report a promotion candidate folio
> + *
> + * @folio: The folio reported as a candidate
> + *
> + * Records folio access time and places the folio on the task promotion list
> + * if access time is less than the threshold. The folio will be isolated from
> + * LRU if selected, and task_work will putback the folio on promotion failure.
> + *
> + * If selected, takes a folio reference to be released in task work.
> + */
> +void promotion_candidate(struct folio *folio)
> +{
> + struct task_struct *task = current;
> + struct list_head *promo_list = &task->promo_list;
> + struct callback_head *work = &task->numa_promo_work;
> + struct address_space *mapping = folio_mapping(folio);
> + bool write = mapping ? mapping->gfp_mask & __GFP_WRITE : false;
> + int nid = folio_nid(folio);
> + int flags, last_cpupid;
> +
> + /*
> + * Only do this work if:
> + * 1) tiering and pagecache promotion are enabled
> + * 2) the page can actually be promoted
> + * 3) The hint-fault latency is relatively hot
> + * 4) the folio is not already isolated
> + * 5) This is not a kernel thread context
> + */
> + if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) ||
> + !numa_pagecache_promotion_enabled ||
> + node_is_toptier(nid) ||
> + numa_hint_fault_latency(folio) >= PAGE_ACCESS_TIME_MASK ||
> + folio_test_isolated(folio) ||
> + (current->flags & PF_KTHREAD)) {
> + return;
> + }
> +
> + nid = numa_migrate_check(folio, NULL, 0, &flags, write, &last_cpupid);
> + if (nid == NUMA_NO_NODE)
> + return;
> +
> + if (migrate_misplaced_folio_prepare(folio, NULL, nid))
> + return;
> +
> + /*
> + * Ensure task can schedule work, otherwise we'll leak folios.
> + * If the list is not empty, task work has already been scheduled.
> + */
> + if (list_empty(promo_list) && task_work_add(task, work, TWA_RESUME)) {
> + folio_putback_lru(folio);
> + return;
> + }
> + list_add(&folio->lru, promo_list);
> +}
> +EXPORT_SYMBOL(promotion_candidate);
> #endif /* CONFIG_NUMA_BALANCING */
> #endif /* CONFIG_NUMA */
> diff --git a/mm/swap.c b/mm/swap.c
> index 320b959b74c6..57909c349388 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -37,6 +37,7 @@
> #include <linux/page_idle.h>
> #include <linux/local_lock.h>
> #include <linux/buffer_head.h>
> +#include <linux/migrate.h>
>
> #include "internal.h"
>
> @@ -469,6 +470,8 @@ void folio_mark_accessed(struct folio *folio)
> __lru_cache_activate_folio(folio);
> folio_clear_referenced(folio);
> workingset_activation(folio);
> + } else {
> +
In the current implementation, promotion will not work if we enable
MGLRU, right?
Is there any specific reason we are not enabling promotion with MGLRU?
> promotion_candidate(folio);
> }
> if (folio_test_idle(folio))
> folio_clear_idle(folio);
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [RFC v2 PATCH 5/5] migrate,sysfs: add pagecache promotion
2024-12-27 11:01 ` Donet Tom
@ 2024-12-27 15:56 ` Gregory Price
2024-12-29 15:00 ` Donet Tom
0 siblings, 1 reply; 27+ messages in thread
From: Gregory Price @ 2024-12-27 15:56 UTC (permalink / raw)
To: Donet Tom
Cc: Gregory Price, linux-mm, linux-kernel, nehagholkar, abhishekd,
kernel-team, david, nphamcs, akpm, hannes, kbusch, ying.huang
On Fri, Dec 27, 2024 at 04:31:48PM +0530, Donet Tom wrote:
>
> > folio_clear_referenced(folio);
> > workingset_activation(folio);
> > + } else {
> > +
>
> In the current implementation, promotion will not work if we enable MGLRU,
> right?
> Is there any specific reason we are not enabling promotion with MGLRU?
>
folio_mark_accessed may not be the best place to interpose for MGLRU,
since MGLRU's semantics are so different. I didn't want to dive that
deep until I could show traditional LRU benefit.
Certainly open to ideas if you're familiar with MGLRU semantics.
~Gregory
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [RFC v2 PATCH 5/5] migrate,sysfs: add pagecache promotion
2024-12-27 15:56 ` Gregory Price
@ 2024-12-29 15:00 ` Donet Tom
0 siblings, 0 replies; 27+ messages in thread
From: Donet Tom @ 2024-12-29 15:00 UTC (permalink / raw)
To: Gregory Price
Cc: linux-mm, linux-kernel, nehagholkar, abhishekd, kernel-team,
david, nphamcs, akpm, hannes, kbusch, ying.huang
On 12/27/24 21:26, Gregory Price wrote:
> On Fri, Dec 27, 2024 at 04:31:48PM +0530, Donet Tom wrote:
>>> folio_clear_referenced(folio);
>>> workingset_activation(folio);
>>> + } else {
>>> +
>> In the current implementation, promotion will not work if we enable MGLRU,
>> right?
>> Is there any specific reason we are not enabling promotion with MGLRU?
>>
> folio_mark_accessed may not be the best place to interpose for MGLRU,
> since MGLRU's semantics are so different. I didn't want to dive that
> deep until I could show traditional LRU benefit.
>
> Certainly open to ideas if you're familiar with MGLRU semantics.
Thanks. I will take a look at it.
>
> ~Gregory
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.
2024-12-10 21:37 [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios Gregory Price
` (4 preceding siblings ...)
2024-12-10 21:37 ` [RFC v2 PATCH 5/5] migrate,sysfs: add pagecache promotion Gregory Price
@ 2024-12-21 5:18 ` Huang, Ying
2024-12-21 14:48 ` Gregory Price
5 siblings, 1 reply; 27+ messages in thread
From: Huang, Ying @ 2024-12-21 5:18 UTC (permalink / raw)
To: Gregory Price
Cc: linux-mm, linux-kernel, nehagholkar, abhishekd, kernel-team,
david, nphamcs, akpm, hannes, kbusch, Feng Tang
Hi, Gregory,
Thanks for working on this!
Gregory Price <gourry@gourry.net> writes:
> Unmapped page cache pages can be demoted to low-tier memory, but
> they can presently only be promoted in two conditions:
> 1) The page is fully swapped out and re-faulted
> 2) The page becomes mapped (and exposed to NUMA hint faults)
>
> This RFC proposes promoting unmapped page cache pages by using
> folio_mark_accessed as a hotness hint for unmapped pages.
>
> Patches 1-3
> allow NULL as valid input to migration prep interfaces
> for vmf/vma - which is not present in unmapped folios.
> Patch 4
> adds NUMA_HINT_PAGE_CACHE to vmstat
> Patch 5
> adds the promotion mechanism, along with a sysfs
> extension which defaults the behavior to off.
> /sys/kernel/mm/numa/pagecache_promotion_enabled
>
> Functional test showed that we are able to reclaim some performance
> in canned scenarios (a file gets demoted and becomes hot with
> relatively little contention). See test/overhead section below.
>
> v2
> - cleanup first commit to be accurate and take Ying's feedback
> - cleanup NUMA_HINT_ define usage
> - add NUMA_HINT_ type selection macro to keep code clean
> - mild comment updates
>
> Open Questions:
> ======
> 1) Should we also add a limit to how much can be forced onto
> a single task's promotion list at any one time? This might
> piggy-back on the existing TPP promotion limit (256MB?) and
> would simply add something like task->promo_count.
>
> Technically we are limited by the batch read-rate before a
> TASK_RESUME occurs.
>
> 2) Should we exempt certain forms of folios, or add additional
> knobs/levers in to deal with things like large folios?
>
> 3) We added NUMA_HINT_PAGE_CACHE to differentiate hint faults
> so we could validate the behavior works as intended. Should
> we just call this a NUMA_HINT_FAULT and not add a new hint?
>
> 4) Benchmark suggestions that can pressure 1TB memory. This is
> not my typical wheelhouse, so if folks know of a useful
> benchmark that can pressure my 1TB (768 DRAM / 256 CXL) setup,
> I'd like to add additional measurements here.
>
> Development Notes
> =================
>
> During development, we explored the following proposals:
>
> 1) directly promoting within folio_mark_accessed (FMA)
> Originally suggested by Johannes Weiner
> https://lore.kernel.org/all/20240803094715.23900-1-gourry@gourry.net/
>
> This caused deadlocks due to the fact that the PTL was held
> in a variety of cases - but in particular during task exit.
> It also is incredibly inflexible and causes promotion-on-fault.
> It was discussed that a deferral mechanism was preferred.
>
>
> 2) promoting in filemap.c locations (calls of FMA)
> Originally proposed by Feng Tang and Ying Huang
> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/patch/?id=5f2e64ce75c0322602c2ec8c70b64bb69b1f1329
>
> First, we saw this as less problematic than directly hooking FMA,
> but we realized this has the potential to miss data in a variety of
> locations: swap.c, memory.c, gup.c, ksm.c, paddr.c - etc.
>
> Second, we discovered that the lock state of pages is very subtle,
> and that these locations in filemap.c can be called in an atomic
> context. Prototypes lead to a variety of stalls and lockups.
>
>
> 3) a new LRU - originally proposed by Keith Busch
> https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/patch/?id=6616afe9a722f6ebedbb27ade3848cf07b9a3af7
>
> There are two issues with this approach: PG_promotable and reclaim.
>
> First - PG_promotable has generally be discouraged.
>
> Second - Attach this mechanism to an LRU is both backwards and
> counter-intutive. A promotable list is better served by a MOST
> recently used list, and since LRUs are generally only shrank when
> exposed to pressure it would require implementing a new promotion
> list shrinker that runs separate from the existing reclaim logic.
>
>
> 4) Adding a separate kthread - suggested by many
>
> This is - to an extent - a more general version of the LRU proposal.
> We still have to track the folios - which likely requires the
> addition of a page flag. Additionally, this method would actually
> contend pretty heavily with LRU behavior - i.e. we'd want to
> throttle addition to the promotion candidate list in some scenarios.
>
>
> 5) Doing it in task work
>
> This seemed to be the most realistic after considering the above.
>
> We observe the following:
> - FMA is an ideal hook for this and isolation is safe here
> - the new promotion_candidate function is an ideal hook for new
> filter logic (throttling, fairness, etc).
> - isolated folios are either promoted or putback on task resume,
> there are no additional concurrency mechanics to worry about
> - The mechanic can be made optional via a sysfs hook to avoid
> overhead in degenerate scenarios (thrashing).
>
> We also piggy-backed on the numa_hint_fault_latency timestamp to
> further throttle promotions to help avoid promotions on one or
> two time accesses to a particular page.
>
>
> Test:
> ======
>
> Environment:
> 1.5-3.7GHz CPU, ~4000 BogoMIPS,
> 1TB Machine with 768GB DRAM and 256GB CXL
> A 64GB file being linearly read by 6-7 Python processes
>
> Goal:
> Generate promotions. Demonstrate stability and measure overhead.
>
> System Settings:
> echo 1 > /sys/kernel/mm/numa/demotion_enabled
> echo 1 > /sys/kernel/mm/numa/pagecache_promotion_enabled
> echo 2 > /proc/sys/kernel/numa_balancing
>
> Each process took up ~128GB each, with anonymous memory growing and
> shrinking as python filled and released buffers with the 64GB data.
> This causes DRAM pressure to generate demotions, and file pages to
> "become hot" - and therefore be selected for promotion.
>
> First we ran with promotion disabled to show consistent overhead as
> a result of forcing a file out to CXL memory. We first ran a single
> reader to see uncontended performance, launched many readers to force
> demotions, then droppedb back to a single reader to observe.
>
> Single-reader DRAM: ~16.0-16.4s
> Single-reader CXL (after demotion): ~16.8-17s
The difference is trivial. This makes me thought that why we need this
patchset?
> Next we turned promotion on with only a single reader running.
>
> Before promotions:
> Node 0 MemFree: 636478112 kB
> Node 0 FilePages: 59009156 kB
> Node 1 MemFree: 250336004 kB
> Node 1 FilePages: 14979628 kB
Why are there some many file pages on node 1 even if there're a lot of
free pages on node 0? You moved some file pages from node 0 to node 1?
> After promotions:
> Node 0 MemFree: 632267268 kB
> Node 0 FilePages: 72204968 kB
> Node 1 MemFree: 262567056 kB
> Node 1 FilePages: 2918768 kB
>
> Single-reader (after_promotion): ~16.5s
>
> Turning the promotion mechanism on when nothing had been demoted
> produced no appreciable overhead (memory allocation noise overpowers it)
>
> Read time did not change after turning promotion off after promotion
> occurred, which implies that the additional overhead is not coming from
> the promotion system itself - but likely other pages still trapped on
> the low tier. Either way, this at least demonstrates the mechanism is
> not particularly harmful when there are no pages to promote - and the
> mechanism is valuable when a file actually is quite hot.
>
> Notability, it takes some time for the average read loop to come back
> down, and there still remains unpromoted file pages trapped in pagecache.
> This isn't entirely unexpected, there are many files which may have been
> demoted, and they may not be very hot.
>
>
> Overhead
> ======
> When promotion was tured on we saw a loop-runtime increate temporarily
>
> before: 16.8s
> during:
> 17.606216192245483
> 17.375206470489502
> 17.722095489501953
> 18.230552434921265
> 18.20712447166443
> 18.008254528045654
> 17.008427381515503
> 16.851454257965088
> 16.715774059295654
> stable: ~16.5s
>
> We measured overhead with a separate patch that simply measured the
> rdtsc value before/after calls in promotion_candidate and task work.
>
> e.g.:
> + start = rdtsc();
> list_for_each_entry_safe(folio, tmp, promo_list, lru) {
> list_del_init(&folio->lru);
> migrate_misplaced_folio(folio, NULL, nid);
> + count++;
> }
> + atomic_long_add(rdtsc()-start, &promo_time);
> + atomic_long_add(count, &promo_count);
>
> numa_migrate_prep: 93 - time(3969867917) count(42576860)
> migrate_misplaced_folio_prepare: 491 - time(3433174319) count(6985523)
> migrate_misplaced_folio: 1635 - time(11426529980) count(6985523)
>
> Thoughts on a good throttling heuristic would be appreciated here.
We do have a throttle mechanism already, for example, you can used
$ echo 100 > /proc/sys/kernel/numa_balancing_promote_rate_limit_MBps
to rate limit the promotion throughput under 100 MB/s for each DRAM
node.
> Suggested-by: Huang Ying <ying.huang@linux.alibaba.com>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Suggested-by: Keith Busch <kbusch@meta.com>
> Suggested-by: Feng Tang <feng.tang@intel.com>
> Signed-off-by: Gregory Price <gourry@gourry.net>
>
> Gregory Price (5):
> migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA.
> memory: move conditionally defined enums use inside ifdef tags
> memory: allow non-fault migration in numa_migrate_check path
> vmstat: add page-cache numa hints
> migrate,sysfs: add pagecache promotion
>
> .../ABI/testing/sysfs-kernel-mm-numa | 20 ++++++
> include/linux/memory-tiers.h | 2 +
> include/linux/migrate.h | 2 +
> include/linux/sched.h | 3 +
> include/linux/sched/numa_balancing.h | 5 ++
> include/linux/vm_event_item.h | 8 +++
> init/init_task.c | 1 +
> kernel/sched/fair.c | 26 +++++++-
> mm/memory-tiers.c | 27 ++++++++
> mm/memory.c | 32 +++++-----
> mm/mempolicy.c | 25 +++++---
> mm/migrate.c | 61 ++++++++++++++++++-
> mm/swap.c | 3 +
> mm/vmstat.c | 2 +
> 14 files changed, 193 insertions(+), 24 deletions(-)
---
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.
2024-12-21 5:18 ` [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios Huang, Ying
@ 2024-12-21 14:48 ` Gregory Price
2024-12-22 7:09 ` Huang, Ying
0 siblings, 1 reply; 27+ messages in thread
From: Gregory Price @ 2024-12-21 14:48 UTC (permalink / raw)
To: Huang, Ying
Cc: Gregory Price, linux-mm, linux-kernel, nehagholkar, abhishekd,
kernel-team, david, nphamcs, akpm, hannes, kbusch
On Sat, Dec 21, 2024 at 01:18:04PM +0800, Huang, Ying wrote:
> Gregory Price <gourry@gourry.net> writes:
>
> >
> > Single-reader DRAM: ~16.0-16.4s
> > Single-reader CXL (after demotion): ~16.8-17s
>
> The difference is trivial. This makes me thought that why we need this
> patchset?
>
That's 3-6% performance in this contrived case.
We're working to testing a real workload we know suffers from this
problem as it is long-running. Should be early in the new year hopefully.
> > Next we turned promotion on with only a single reader running.
> >
> > Before promotions:
> > Node 0 MemFree: 636478112 kB
> > Node 0 FilePages: 59009156 kB
> > Node 1 MemFree: 250336004 kB
> > Node 1 FilePages: 14979628 kB
>
> Why are there some many file pages on node 1 even if there're a lot of
> free pages on node 0? You moved some file pages from node 0 to node 1?
>
This was explicit and explained in the test notes:
First we ran with promotion disabled to show consistent overhead as
a result of forcing a file out to CXL memory. We first ran a single
reader to see uncontended performance, launched many readers to force
demotions, then dropped back to a single reader to observe.
The goal here was to simply demonstrate functionality and stability.
> > After promotions:
> > Node 0 MemFree: 632267268 kB
> > Node 0 FilePages: 72204968 kB
> > Node 1 MemFree: 262567056 kB
> > Node 1 FilePages: 2918768 kB
> >
> > Single-reader (after_promotion): ~16.5s
This represents a 2.5-6% speedup depending on the spread.
> >
> > numa_migrate_prep: 93 - time(3969867917) count(42576860)
> > migrate_misplaced_folio_prepare: 491 - time(3433174319) count(6985523)
> > migrate_misplaced_folio: 1635 - time(11426529980) count(6985523)
> >
> > Thoughts on a good throttling heuristic would be appreciated here.
>
> We do have a throttle mechanism already, for example, you can used
>
> $ echo 100 > /proc/sys/kernel/numa_balancing_promote_rate_limit_MBps
>
> to rate limit the promotion throughput under 100 MB/s for each DRAM
> node.
>
Can easily piggyback on that, just wasn't sure if overloading it was
an acceptable idea. Although since that promotion rate limit is also
per-task (as far as I know, will need to read into it a bit more) this
is probably fine.
~Gregory
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.
2024-12-21 14:48 ` Gregory Price
@ 2024-12-22 7:09 ` Huang, Ying
2024-12-22 16:22 ` Gregory Price
0 siblings, 1 reply; 27+ messages in thread
From: Huang, Ying @ 2024-12-22 7:09 UTC (permalink / raw)
To: Gregory Price
Cc: linux-mm, linux-kernel, nehagholkar, abhishekd, kernel-team,
david, nphamcs, akpm, hannes, kbusch
Gregory Price <gourry@gourry.net> writes:
> On Sat, Dec 21, 2024 at 01:18:04PM +0800, Huang, Ying wrote:
>> Gregory Price <gourry@gourry.net> writes:
>>
>> >
>> > Single-reader DRAM: ~16.0-16.4s
>> > Single-reader CXL (after demotion): ~16.8-17s
>>
>> The difference is trivial. This makes me thought that why we need this
>> patchset?
>>
>
> That's 3-6% performance in this contrived case.
This is small too.
> We're working to testing a real workload we know suffers from this
> problem as it is long-running. Should be early in the new year hopefully.
Good!
To demonstrate the max possible performance gain. We can use a pure
file read/write benchmark such as fio, run in on pure DRAM and pure CXL.
Then the difference is the max possible performance gain we can get.
>> > Next we turned promotion on with only a single reader running.
>> >
>> > Before promotions:
>> > Node 0 MemFree: 636478112 kB
>> > Node 0 FilePages: 59009156 kB
>> > Node 1 MemFree: 250336004 kB
>> > Node 1 FilePages: 14979628 kB
>>
>> Why are there some many file pages on node 1 even if there're a lot of
>> free pages on node 0? You moved some file pages from node 0 to node 1?
>>
>
> This was explicit and explained in the test notes:
>
> First we ran with promotion disabled to show consistent overhead as
> a result of forcing a file out to CXL memory. We first ran a single
> reader to see uncontended performance, launched many readers to force
> demotions, then dropped back to a single reader to observe.
>
> The goal here was to simply demonstrate functionality and stability.
Got it.
>> > After promotions:
>> > Node 0 MemFree: 632267268 kB
>> > Node 0 FilePages: 72204968 kB
>> > Node 1 MemFree: 262567056 kB
>> > Node 1 FilePages: 2918768 kB
>> >
>> > Single-reader (after_promotion): ~16.5s
>
> This represents a 2.5-6% speedup depending on the spread.
>
>> >
>> > numa_migrate_prep: 93 - time(3969867917) count(42576860)
>> > migrate_misplaced_folio_prepare: 491 - time(3433174319) count(6985523)
>> > migrate_misplaced_folio: 1635 - time(11426529980) count(6985523)
>> >
>> > Thoughts on a good throttling heuristic would be appreciated here.
>>
>> We do have a throttle mechanism already, for example, you can used
>>
>> $ echo 100 > /proc/sys/kernel/numa_balancing_promote_rate_limit_MBps
>>
>> to rate limit the promotion throughput under 100 MB/s for each DRAM
>> node.
>>
>
> Can easily piggyback on that, just wasn't sure if overloading it was
> an acceptable idea.
It's the recommended setup in the original PMEM promotion
implementation. Please check commit c959924b0dc5 ("memory tiering:
adjust hot threshold automatically").
> Although since that promotion rate limit is also
> per-task (as far as I know, will need to read into it a bit more) this
> is probably fine.
It's not per-task. Please read the code, especially
should_numa_migrate_memory().
---
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.
2024-12-22 7:09 ` Huang, Ying
@ 2024-12-22 16:22 ` Gregory Price
2024-12-27 2:16 ` Huang, Ying
0 siblings, 1 reply; 27+ messages in thread
From: Gregory Price @ 2024-12-22 16:22 UTC (permalink / raw)
To: Huang, Ying
Cc: Gregory Price, linux-mm, linux-kernel, nehagholkar, abhishekd,
kernel-team, david, nphamcs, akpm, hannes, kbusch
On Sun, Dec 22, 2024 at 03:09:44PM +0800, Huang, Ying wrote:
> Gregory Price <gourry@gourry.net> writes:
> > That's 3-6% performance in this contrived case.
>
> This is small too.
>
Small is relative. 3-6% performance increase across millions of servers
across a year is a non trivial speedup for such a common operation.
> > Can easily piggyback on that, just wasn't sure if overloading it was
> > an acceptable idea.
>
> It's the recommended setup in the original PMEM promotion
> implementation. Please check commit c959924b0dc5 ("memory tiering:
> adjust hot threshold automatically").
>
> > Although since that promotion rate limit is also
> > per-task (as far as I know, will need to read into it a bit more) this
> > is probably fine.
>
> It's not per-task. Please read the code, especially
> should_numa_migrate_memory().
Oh, then this is already throttled. We call mpol_misplaced which calls
should_numa_migrate_memory.
There's some duplication of candidate selection logic between
promotion_candidate and should_numa_migrate_memory, but it may be
beneficial to keep it that way. I'll have to look.
~Gregory
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.
2024-12-22 16:22 ` Gregory Price
@ 2024-12-27 2:16 ` Huang, Ying
2024-12-27 15:40 ` Gregory Price
0 siblings, 1 reply; 27+ messages in thread
From: Huang, Ying @ 2024-12-27 2:16 UTC (permalink / raw)
To: Gregory Price
Cc: linux-mm, linux-kernel, nehagholkar, abhishekd, kernel-team,
david, nphamcs, akpm, hannes, kbusch
Gregory Price <gourry@gourry.net> writes:
> On Sun, Dec 22, 2024 at 03:09:44PM +0800, Huang, Ying wrote:
>> Gregory Price <gourry@gourry.net> writes:
>> > That's 3-6% performance in this contrived case.
>>
>> This is small too.
>>
>
> Small is relative. 3-6% performance increase across millions of servers
> across a year is a non trivial speedup for such a common operation.
If we cannot only get 3-6% performance increase in a micro-benchmark,
how much can we get from a real life workloads?
Anyway, we need to prove the usefulness of the change via data. 3-6%
isn't some strong data.
Can we measure the largest improvement? For example, run the benchmark
with all file pages in DRAM and CXL.mem via numa binding, and compare.
>> > Can easily piggyback on that, just wasn't sure if overloading it was
>> > an acceptable idea.
>>
>> It's the recommended setup in the original PMEM promotion
>> implementation. Please check commit c959924b0dc5 ("memory tiering:
>> adjust hot threshold automatically").
>>
>> > Although since that promotion rate limit is also
>> > per-task (as far as I know, will need to read into it a bit more) this
>> > is probably fine.
>>
>> It's not per-task. Please read the code, especially
>> should_numa_migrate_memory().
>
> Oh, then this is already throttled. We call mpol_misplaced which calls
> should_numa_migrate_memory.
>
> There's some duplication of candidate selection logic between
> promotion_candidate and should_numa_migrate_memory, but it may be
> beneficial to keep it that way. I'll have to look.
---
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.
2024-12-27 2:16 ` Huang, Ying
@ 2024-12-27 15:40 ` Gregory Price
2024-12-27 19:09 ` Gregory Price
0 siblings, 1 reply; 27+ messages in thread
From: Gregory Price @ 2024-12-27 15:40 UTC (permalink / raw)
To: Huang, Ying
Cc: Gregory Price, linux-mm, linux-kernel, nehagholkar, abhishekd,
kernel-team, david, nphamcs, akpm, hannes, kbusch
On Fri, Dec 27, 2024 at 10:16:42AM +0800, Huang, Ying wrote:
> Gregory Price <gourry@gourry.net> writes:
>
> > On Sun, Dec 22, 2024 at 03:09:44PM +0800, Huang, Ying wrote:
> >> Gregory Price <gourry@gourry.net> writes:
> >> > That's 3-6% performance in this contrived case.
> >>
> >> This is small too.
> >>
> >
> > Small is relative. 3-6% performance increase across millions of servers
> > across a year is a non trivial speedup for such a common operation.
>
> If we cannot only get 3-6% performance increase in a micro-benchmark,
> how much can we get from a real life workloads?
>
> Anyway, we need to prove the usefulness of the change via data. 3-6%
> isn't some strong data.
>
> Can we measure the largest improvement? For example, run the benchmark
> with all file pages in DRAM and CXL.mem via numa binding, and compare.
>
I can probably come up with something, will rework some stuff.
~Gregory
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.
2024-12-27 15:40 ` Gregory Price
@ 2024-12-27 19:09 ` Gregory Price
2024-12-28 3:38 ` Gregory Price
0 siblings, 1 reply; 27+ messages in thread
From: Gregory Price @ 2024-12-27 19:09 UTC (permalink / raw)
To: Huang, Ying
Cc: linux-mm, linux-kernel, nehagholkar, abhishekd, kernel-team,
david, nphamcs, akpm, hannes, kbusch
On Fri, Dec 27, 2024 at 10:40:36AM -0500, Gregory Price wrote:
> > Can we measure the largest improvement? For example, run the benchmark
> > with all file pages in DRAM and CXL.mem via numa binding, and compare.
>
> I can probably come up with something, will rework some stuff.
>
so I did as you suggested, I made a program that allocates a 16GB
buffer, initializes it, them membinds itself to node1 before accessing
the file to force it into pagecache, then i ran a bunch of tests.
Completely unexpected result: ~25% overhead from an inexplicable source.
baseline - no membind()
./test
Read loop took 0.93 seconds
drop caches
./test - w/ membind(1) just before file open
Read loop took 1.16 seconds
node 1 size: 262144 MB
node 1 free: 245756 MB <- file confirmed in cache
kill and relaunch without membind to avoid any funny business
./test
Read loop took 1.16 seconds
enable promotion
Read loop took 3.37 seconds <- migration overhead
... snip ...
Read loop took 1.17 seconds <- stabilizes here
node 1 size: 262144 MB
node 1 free: 262144 MB <- pagecache promoted
Absolutely bizarre result: there is 0% CXL usage ocurring, but the
overhead we originally measured is still present.
This overhead persists even if i do the following
- disable pagecache promotion
- disable numa_balancing
- offline CXL memory entirely
This is actually pretty wild. I presume this must imply the folio flags
are mucked up after migration and we're incurring a bunch of overhead
on access for no reason. At the very least it doesn't appear to be
an isolated folio issue:
nr_isolated_anon 0
nr_isolated_file 0
I'll have to dig into this further, I wonder if this happens with mapped
memory as well.
~Gregory
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.
2024-12-27 19:09 ` Gregory Price
@ 2024-12-28 3:38 ` Gregory Price
2024-12-31 7:32 ` Gregory Price
0 siblings, 1 reply; 27+ messages in thread
From: Gregory Price @ 2024-12-28 3:38 UTC (permalink / raw)
To: Huang, Ying
Cc: linux-mm, linux-kernel, nehagholkar, abhishekd, kernel-team,
david, nphamcs, akpm, hannes, kbusch
On Fri, Dec 27, 2024 at 02:09:50PM -0500, Gregory Price wrote:
> On Fri, Dec 27, 2024 at 10:40:36AM -0500, Gregory Price wrote:
just adding some follow-up data
test is essentially
membind(1) - node1 is cxl
read() - filecache is initialized on cxl
set_mempolicy(MPOL_DEFAULT) - allow migrations
while true:
start = time()
read()
print(time()-start)
// external events cause migration/drop cache while running
baseline: .93-1s/read()
from cxl: ~1.15-1.2s/read()
So we are seeing anywhere from 20-25% overhead from the filecache living
on CXL right out of the box. At least we have good clear signal, right?
tests:
echo 3 > drop_cache - filecache refills into node 1
result => ~.95-1s/read()
we return back to the baseline, which is expected
enable promotion - numactl shows promotion occurs
result => ~1.15-1.2s/read()
No effect?! Even offlining the dax devices does nothing.
enable promotion, wait for it to complete, drop cache
after promotion => 1.15-1.2s/read
after drop cache => .95-1s/read()
Back to baseline!
This seems to imply that the overhead we're seeing from read() even
when filecache is on the remote node isn't actually related to the
memory speed, but instead likely related to some kind of stale
metadata in the filesystem or filecache layers.
This is going to take me a bit to figure out. I need to isolate the
filesystem influence (we are using btrfs, i want to make sure this
behavior is consistent on other file systems).
~Gregory
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.
2024-12-28 3:38 ` Gregory Price
@ 2024-12-31 7:32 ` Gregory Price
2025-01-02 2:58 ` Huang, Ying
0 siblings, 1 reply; 27+ messages in thread
From: Gregory Price @ 2024-12-31 7:32 UTC (permalink / raw)
To: Huang, Ying
Cc: linux-mm, linux-kernel, nehagholkar, abhishekd, kernel-team,
david, nphamcs, akpm, hannes, kbusch
On Fri, Dec 27, 2024 at 10:38:45PM -0500, Gregory Price wrote:
> On Fri, Dec 27, 2024 at 02:09:50PM -0500, Gregory Price wrote:
>
> This seems to imply that the overhead we're seeing from read() even
> when filecache is on the remote node isn't actually related to the
> memory speed, but instead likely related to some kind of stale
> metadata in the filesystem or filecache layers.
>
> ~Gregory
Mystery solved
> +void promotion_candidate(struct folio *folio)
> +{
... snip ...
> + list_add(&folio->lru, promo_list);
> +}
read(file, length) will do a linear read, and promotion_candidate will
add those pages to the promotion list head resulting into a reversed
promotion order
so you read [1,2,3,4] folios, you'll promote in [4,3,2,1] order.
The result of this, on an unloaded system, is essentially that pages end
up in the worst possible configuration for the prefetcher, and therefore
TLB hits. I figured this out because i was seeing the additional ~30%
overhead show up purely in `copy_page_to_iter()` (i.e. copy_to_user).
Swapping this for list_add_tail results in the following test result:
initializing
Read loop took 9.41 seconds <- reading from CXL
Read loop took 31.74 seconds <- migration enabled
Read loop took 10.31 seconds
Read loop took 7.71 seconds <- migration finished
Read loop took 7.71 seconds
Read loop took 7.70 seconds
Read loop took 7.75 seconds
Read loop took 19.34 seconds <- dropped caches
Read loop took 13.68 seconds <- cache refilling to DRAM
Read loop took 7.37 seconds
Read loop took 7.68 seconds
Read loop took 7.65 seconds <- back to DRAM baseline
On our CXL devices, we're seeing a 22-27% performance penalty for a file
being hosted entirely out of CXL. When we promote this file out of CXL,
we set a 22-27% performance boost.
Probably list_add_tail is right here, but since files *tend to* be read
linearly with `read()` this should *tend toward* optimal. That said, we
can probably make this more reliable by adding batch migration function
`mpol_migrate_misplaced_batch()` which also tries to do bulk allocation
of destination folios. This will also probably save us a bunch of
invalidation overhead.
I'm also noticing that the migration limit (256mbps) is not being
respected, probably because we're doing 1 folio at a time instead of a
batch. Will probably look at changing promotion_candidate to limit the
number of selected pages to promote per read-call.
---
diff --git a/mm/migrate.c b/mm/migrate.c
index f965814b7d40..99b584f22bcb 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2675,7 +2675,7 @@ void promotion_candidate(struct folio *folio)
folio_putback_lru(folio);
return;
}
- list_add(&folio->lru, promo_list);
+ list_add_tail(&folio->lru, promo_list);
return;
}
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.
2024-12-31 7:32 ` Gregory Price
@ 2025-01-02 2:58 ` Huang, Ying
0 siblings, 0 replies; 27+ messages in thread
From: Huang, Ying @ 2025-01-02 2:58 UTC (permalink / raw)
To: Gregory Price
Cc: linux-mm, linux-kernel, nehagholkar, abhishekd, kernel-team,
david, nphamcs, akpm, hannes, kbusch
Gregory Price <gourry@gourry.net> writes:
> On Fri, Dec 27, 2024 at 10:38:45PM -0500, Gregory Price wrote:
>> On Fri, Dec 27, 2024 at 02:09:50PM -0500, Gregory Price wrote:
>>
>> This seems to imply that the overhead we're seeing from read() even
>> when filecache is on the remote node isn't actually related to the
>> memory speed, but instead likely related to some kind of stale
>> metadata in the filesystem or filecache layers.
>>
>> ~Gregory
>
> Mystery solved
>
>> +void promotion_candidate(struct folio *folio)
>> +{
> ... snip ...
>> + list_add(&folio->lru, promo_list);
>> +}
>
> read(file, length) will do a linear read, and promotion_candidate will
> add those pages to the promotion list head resulting into a reversed
> promotion order
>
> so you read [1,2,3,4] folios, you'll promote in [4,3,2,1] order.
>
> The result of this, on an unloaded system, is essentially that pages end
> up in the worst possible configuration for the prefetcher, and therefore
> TLB hits. I figured this out because i was seeing the additional ~30%
> overhead show up purely in `copy_page_to_iter()` (i.e. copy_to_user).
>
> Swapping this for list_add_tail results in the following test result:
>
> initializing
> Read loop took 9.41 seconds <- reading from CXL
> Read loop took 31.74 seconds <- migration enabled
> Read loop took 10.31 seconds
This shows that migration causes large disturbing to the workload. This
may be not acceptable in real life. Can you check whether promoting
rate limit can improve the situation?
> Read loop took 7.71 seconds <- migration finished
> Read loop took 7.71 seconds
> Read loop took 7.70 seconds
> Read loop took 7.75 seconds
> Read loop took 19.34 seconds <- dropped caches
> Read loop took 13.68 seconds <- cache refilling to DRAM
> Read loop took 7.37 seconds
> Read loop took 7.68 seconds
> Read loop took 7.65 seconds <- back to DRAM baseline
>
> On our CXL devices, we're seeing a 22-27% performance penalty for a file
> being hosted entirely out of CXL. When we promote this file out of CXL,
> we set a 22-27% performance boost.
This is a good number! Thanks!
> Probably list_add_tail is right here, but since files *tend to* be read
> linearly with `read()` this should *tend toward* optimal. That said, we
> can probably make this more reliable by adding batch migration function
> `mpol_migrate_misplaced_batch()` which also tries to do bulk allocation
> of destination folios. This will also probably save us a bunch of
> invalidation overhead.
>
> I'm also noticing that the migration limit (256mbps) is not being
> respected, probably because we're doing 1 folio at a time instead of a
> batch. Will probably look at changing promotion_candidate to limit the
> number of selected pages to promote per read-call.
The migration limit is checked in should_numa_migrate_memory(). You may
take a look at that function.
> ---
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index f965814b7d40..99b584f22bcb 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2675,7 +2675,7 @@ void promotion_candidate(struct folio *folio)
> folio_putback_lru(folio);
> return;
> }
> - list_add(&folio->lru, promo_list);
> + list_add_tail(&folio->lru, promo_list);
>
> return;
> }
[snip]
---
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 27+ messages in thread