* [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
@ 2025-12-01 6:00 Aboorva Devarajan
2025-12-01 17:41 ` Andrew Morton
2025-12-03 8:21 ` Michal Hocko
0 siblings, 2 replies; 16+ messages in thread
From: Aboorva Devarajan @ 2025-12-01 6:00 UTC (permalink / raw)
To: akpm, vbabka, surenb, mhocko, jackmanb, hannes, ziy
Cc: linux-mm, linux-kernel, aboorvad
When page isolation loops indefinitely during memory offline, reading
/proc/sys/vm/percpu_pagelist_high_fraction blocks on pcp_batch_high_lock,
causing hung task warnings.
Make procfs reads lock-free since percpu_pagelist_high_fraction is a simple
integer with naturally atomic reads, writers still serialize via the mutex.
This prevents hung task warnings when reading the procfs file during
long-running memory offline operations.
Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
---
mm/page_alloc.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ed82ee55e66a..7c8d773ed4af 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6611,11 +6611,14 @@ static int percpu_pagelist_high_fraction_sysctl_handler(const struct ctl_table *
int old_percpu_pagelist_high_fraction;
int ret;
+ if (!write)
+ return proc_dointvec_minmax(table, write, buffer, length, ppos);
+
mutex_lock(&pcp_batch_high_lock);
old_percpu_pagelist_high_fraction = percpu_pagelist_high_fraction;
ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
- if (!write || ret < 0)
+ if (ret < 0)
goto out;
/* Sanity checking to avoid pcp imbalance */
--
2.50.1
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
2025-12-01 6:00 [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free Aboorva Devarajan
@ 2025-12-01 17:41 ` Andrew Morton
2025-12-03 8:27 ` Michal Hocko
2025-12-03 8:21 ` Michal Hocko
1 sibling, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2025-12-01 17:41 UTC (permalink / raw)
To: Aboorva Devarajan
Cc: vbabka, surenb, mhocko, jackmanb, hannes, ziy, linux-mm, linux-kernel
On Mon, 1 Dec 2025 11:30:09 +0530 Aboorva Devarajan <aboorvad@linux.ibm.com> wrote:
> When page isolation loops indefinitely during memory offline, reading
> /proc/sys/vm/percpu_pagelist_high_fraction blocks on pcp_batch_high_lock,
> causing hung task warnings.
That's pretty bad behavior.
I wonder if there are other problems which can be caused by this
lengthy hold time.
It would be better to address the lengthy hold time rather that having
to work around it in one impacted site.
> Make procfs reads lock-free since percpu_pagelist_high_fraction is a simple
> integer with naturally atomic reads, writers still serialize via the mutex.
>
> This prevents hung task warnings when reading the procfs file during
> long-running memory offline operations.
>
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6611,11 +6611,14 @@ static int percpu_pagelist_high_fraction_sysctl_handler(const struct ctl_table *
> int old_percpu_pagelist_high_fraction;
> int ret;
>
> + if (!write)
> + return proc_dointvec_minmax(table, write, buffer, length, ppos);
> +
> mutex_lock(&pcp_batch_high_lock);
> old_percpu_pagelist_high_fraction = percpu_pagelist_high_fraction;
>
> ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
> - if (!write || ret < 0)
> + if (ret < 0)
> goto out;
>
> /* Sanity checking to avoid pcp imbalance */
That being said, I'll grab the patch and shall put a cc:stable on it,
see what people think about this hold-time issue.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
2025-12-01 6:00 [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free Aboorva Devarajan
2025-12-01 17:41 ` Andrew Morton
@ 2025-12-03 8:21 ` Michal Hocko
1 sibling, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2025-12-03 8:21 UTC (permalink / raw)
To: Aboorva Devarajan
Cc: akpm, vbabka, surenb, jackmanb, hannes, ziy, linux-mm, linux-kernel
On Mon 01-12-25 11:30:09, Aboorva Devarajan wrote:
> When page isolation loops indefinitely during memory offline, reading
> /proc/sys/vm/percpu_pagelist_high_fraction blocks on pcp_batch_high_lock,
> causing hung task warnings.
>
> Make procfs reads lock-free since percpu_pagelist_high_fraction is a simple
> integer with naturally atomic reads, writers still serialize via the mutex.
>
> This prevents hung task warnings when reading the procfs file during
> long-running memory offline operations.
>
> Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Looks OK. I would just add a short comment explaining that in the code.
See below.
Acked-by: Michal Hocko <mhocko@suse.com>
> ---
> mm/page_alloc.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ed82ee55e66a..7c8d773ed4af 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6611,11 +6611,14 @@ static int percpu_pagelist_high_fraction_sysctl_handler(const struct ctl_table *
> int old_percpu_pagelist_high_fraction;
> int ret;
>
/*
* Avoid using pcp_batch_high_lock for reads as the value is
* read atomicaly and race with offlining is harmless.
*/
> + if (!write)
> + return proc_dointvec_minmax(table, write, buffer, length, ppos);
> +
> mutex_lock(&pcp_batch_high_lock);
> old_percpu_pagelist_high_fraction = percpu_pagelist_high_fraction;
>
> ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
> - if (!write || ret < 0)
> + if (ret < 0)
> goto out;
>
> /* Sanity checking to avoid pcp imbalance */
> --
> 2.50.1
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
2025-12-01 17:41 ` Andrew Morton
@ 2025-12-03 8:27 ` Michal Hocko
2025-12-03 8:35 ` Gregory Price
0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2025-12-03 8:27 UTC (permalink / raw)
To: Andrew Morton
Cc: Aboorva Devarajan, vbabka, surenb, jackmanb, hannes, ziy,
linux-mm, linux-kernel, Oscar Salvador, David Hildenbrand
Let me add Oscar and David.
On Mon 01-12-25 09:41:12, Andrew Morton wrote:
> On Mon, 1 Dec 2025 11:30:09 +0530 Aboorva Devarajan <aboorvad@linux.ibm.com> wrote:
>
> > When page isolation loops indefinitely during memory offline, reading
> > /proc/sys/vm/percpu_pagelist_high_fraction blocks on pcp_batch_high_lock,
> > causing hung task warnings.
>
> That's pretty bad behavior.
>
> I wonder if there are other problems which can be caused by this
> lengthy hold time.
pcp_batch_high_lock is not taken in any performance critical path. It is
true that memory offlining can take long when memory is not free but I
am not sure we can do much better. I guess we could check contention on
the lock and drop it to make cpu hotplug events and
sysctl_min_unmapped_ratio_sysctl_handler smoother. The question is
whether this is a practical problem hit in real life.
> It would be better to address the lengthy hold time rather that having
> to work around it in one impacted site.
>
> > Make procfs reads lock-free since percpu_pagelist_high_fraction is a simple
> > integer with naturally atomic reads, writers still serialize via the mutex.
> >
> > This prevents hung task warnings when reading the procfs file during
> > long-running memory offline operations.
> >
> > ...
> >
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6611,11 +6611,14 @@ static int percpu_pagelist_high_fraction_sysctl_handler(const struct ctl_table *
> > int old_percpu_pagelist_high_fraction;
> > int ret;
> >
> > + if (!write)
> > + return proc_dointvec_minmax(table, write, buffer, length, ppos);
> > +
> > mutex_lock(&pcp_batch_high_lock);
> > old_percpu_pagelist_high_fraction = percpu_pagelist_high_fraction;
> >
> > ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
> > - if (!write || ret < 0)
> > + if (ret < 0)
> > goto out;
> >
> > /* Sanity checking to avoid pcp imbalance */
>
> That being said, I'll grab the patch and shall put a cc:stable on it,
> see what people think about this hold-time issue.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
2025-12-03 8:27 ` Michal Hocko
@ 2025-12-03 8:35 ` Gregory Price
2025-12-03 8:42 ` Michal Hocko
0 siblings, 1 reply; 16+ messages in thread
From: Gregory Price @ 2025-12-03 8:35 UTC (permalink / raw)
To: Michal Hocko
Cc: Andrew Morton, Aboorva Devarajan, vbabka, surenb, jackmanb,
hannes, ziy, linux-mm, linux-kernel, Oscar Salvador,
David Hildenbrand
On Wed, Dec 03, 2025 at 09:27:26AM +0100, Michal Hocko wrote:
> Let me add Oscar and David.
>
> On Mon 01-12-25 09:41:12, Andrew Morton wrote:
> > On Mon, 1 Dec 2025 11:30:09 +0530 Aboorva Devarajan <aboorvad@linux.ibm.com> wrote:
> >
> > > When page isolation loops indefinitely during memory offline, reading
> > > /proc/sys/vm/percpu_pagelist_high_fraction blocks on pcp_batch_high_lock,
> > > causing hung task warnings.
> >
> > That's pretty bad behavior.
> >
> > I wonder if there are other problems which can be caused by this
> > lengthy hold time.
>
> pcp_batch_high_lock is not taken in any performance critical path. It is
> true that memory offlining can take long when memory is not free but I
> am not sure we can do much better. I guess we could check contention on
> the lock and drop it to make cpu hotplug events and
> sysctl_min_unmapped_ratio_sysctl_handler smoother. The question is
> whether this is a practical problem hit in real life.
>
I just today hit a scenario where offlining was blocked on migration
failures that took an exceedingly long time to offline (many minutes)
even on a relatively small block (256MB).
Now that I'm looking at the double-do-while loop in memory_hotplug.c
zone_pcp_disable(zone); /* (pcp_batch_high_lock) */
...
do {
do {
...
cond_resched();
ret = scan_movable_pages(pfn, end_pfn, &pfn);
if (!ret) {
/*
* TODO: fatal migration failures should bail
* out
*/
do_migrate_range(pfn, end_pfn);
}
} while (!ret);
} while (ret);
...
zone_pcp_enable(zone); /* (pcp_batch_high_lock) */
Maybe it's time to implement the bail out?
~Gregory
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
2025-12-03 8:35 ` Gregory Price
@ 2025-12-03 8:42 ` Michal Hocko
2025-12-03 8:51 ` David Hildenbrand (Red Hat)
2025-12-03 8:59 ` Gregory Price
0 siblings, 2 replies; 16+ messages in thread
From: Michal Hocko @ 2025-12-03 8:42 UTC (permalink / raw)
To: Gregory Price
Cc: Andrew Morton, Aboorva Devarajan, vbabka, surenb, jackmanb,
hannes, ziy, linux-mm, linux-kernel, Oscar Salvador,
David Hildenbrand
On Wed 03-12-25 03:35:51, Gregory Price wrote:
> On Wed, Dec 03, 2025 at 09:27:26AM +0100, Michal Hocko wrote:
> > Let me add Oscar and David.
> >
> > On Mon 01-12-25 09:41:12, Andrew Morton wrote:
> > > On Mon, 1 Dec 2025 11:30:09 +0530 Aboorva Devarajan <aboorvad@linux.ibm.com> wrote:
> > >
> > > > When page isolation loops indefinitely during memory offline, reading
> > > > /proc/sys/vm/percpu_pagelist_high_fraction blocks on pcp_batch_high_lock,
> > > > causing hung task warnings.
> > >
> > > That's pretty bad behavior.
> > >
> > > I wonder if there are other problems which can be caused by this
> > > lengthy hold time.
> >
> > pcp_batch_high_lock is not taken in any performance critical path. It is
> > true that memory offlining can take long when memory is not free but I
> > am not sure we can do much better. I guess we could check contention on
> > the lock and drop it to make cpu hotplug events and
> > sysctl_min_unmapped_ratio_sysctl_handler smoother. The question is
> > whether this is a practical problem hit in real life.
> >
>
> I just today hit a scenario where offlining was blocked on migration
> failures that took an exceedingly long time to offline (many minutes)
> even on a relatively small block (256MB).
>
> Now that I'm looking at the double-do-while loop in memory_hotplug.c
>
> zone_pcp_disable(zone); /* (pcp_batch_high_lock) */
> ...
> do {
> do {
> ...
> cond_resched();
> ret = scan_movable_pages(pfn, end_pfn, &pfn);
> if (!ret) {
> /*
> * TODO: fatal migration failures should bail
> * out
> */
> do_migrate_range(pfn, end_pfn);
> }
> } while (!ret);
> } while (ret);
> ...
> zone_pcp_enable(zone); /* (pcp_batch_high_lock) */
>
>
> Maybe it's time to implement the bail out?
That would be great but can we tell transient from permanent migration
failures? Maybe long term pins could be treated as permanent failure.
>
> ~Gregory
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
2025-12-03 8:42 ` Michal Hocko
@ 2025-12-03 8:51 ` David Hildenbrand (Red Hat)
2025-12-03 9:02 ` Gregory Price
2025-12-03 8:59 ` Gregory Price
1 sibling, 1 reply; 16+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-03 8:51 UTC (permalink / raw)
To: Michal Hocko, Gregory Price
Cc: Andrew Morton, Aboorva Devarajan, vbabka, surenb, jackmanb,
hannes, ziy, linux-mm, linux-kernel, Oscar Salvador
On 12/3/25 09:42, Michal Hocko wrote:
> On Wed 03-12-25 03:35:51, Gregory Price wrote:
>> On Wed, Dec 03, 2025 at 09:27:26AM +0100, Michal Hocko wrote:
>>> Let me add Oscar and David.
>>>
>>> On Mon 01-12-25 09:41:12, Andrew Morton wrote:
>>>> On Mon, 1 Dec 2025 11:30:09 +0530 Aboorva Devarajan <aboorvad@linux.ibm.com> wrote:
>>>>
>>>>> When page isolation loops indefinitely during memory offline, reading
>>>>> /proc/sys/vm/percpu_pagelist_high_fraction blocks on pcp_batch_high_lock,
>>>>> causing hung task warnings.
>>>>
>>>> That's pretty bad behavior.
>>>>
>>>> I wonder if there are other problems which can be caused by this
>>>> lengthy hold time.
>>>
>>> pcp_batch_high_lock is not taken in any performance critical path. It is
>>> true that memory offlining can take long when memory is not free but I
>>> am not sure we can do much better. I guess we could check contention on
>>> the lock and drop it to make cpu hotplug events and
>>> sysctl_min_unmapped_ratio_sysctl_handler smoother. The question is
>>> whether this is a practical problem hit in real life.
>>>
>>
>> I just today hit a scenario where offlining was blocked on migration
>> failures that took an exceedingly long time to offline (many minutes)
>> even on a relatively small block (256MB).
>>
>> Now that I'm looking at the double-do-while loop in memory_hotplug.c
>>
>> zone_pcp_disable(zone); /* (pcp_batch_high_lock) */
>> ...
>> do {
>> do {
>> ...
>> cond_resched();
>> ret = scan_movable_pages(pfn, end_pfn, &pfn);
>> if (!ret) {
>> /*
>> * TODO: fatal migration failures should bail
>> * out
>> */
>> do_migrate_range(pfn, end_pfn);
>> }
>> } while (!ret);
>> } while (ret);
>> ...
>> zone_pcp_enable(zone); /* (pcp_batch_high_lock) */
>>
>>
>> Maybe it's time to implement the bail out?
>
> That would be great but can we tell transient from permanent migration
> failures? Maybe long term pins could be treated as permanent failure.
Did we try offline a ZONE_MOVABLE block or a ZONE_NORMAL block? In case
of ZONE_MOABLE, bailing out is not really the right thing to do.
--
Cheers
David
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
2025-12-03 8:42 ` Michal Hocko
2025-12-03 8:51 ` David Hildenbrand (Red Hat)
@ 2025-12-03 8:59 ` Gregory Price
2025-12-03 9:15 ` David Hildenbrand (Red Hat)
1 sibling, 1 reply; 16+ messages in thread
From: Gregory Price @ 2025-12-03 8:59 UTC (permalink / raw)
To: Michal Hocko
Cc: Andrew Morton, Aboorva Devarajan, vbabka, surenb, jackmanb,
hannes, ziy, linux-mm, linux-kernel, Oscar Salvador,
David Hildenbrand
On Wed, Dec 03, 2025 at 09:42:59AM +0100, Michal Hocko wrote:
> On Wed 03-12-25 03:35:51, Gregory Price wrote:
> > if (!ret) {
> > /*
> > * TODO: fatal migration failures should bail
> > * out
> > */
> > do_migrate_range(pfn, end_pfn);
> > }
> >
> > Maybe it's time to implement the bail out?
>
> That would be great but can we tell transient from permanent migration
> failures? Maybe long term pins could be treated as permanent failure.
>
I see deep in migration code `migrate_pages_batch()` we would return
"Some other failure" as fatal:
switch(rc) {
case -ENOMEM:
...
/* Note: some long-term pin handing is done here */
break;
case -EAGAIN:
...
break;
case 0:
...
list_move_tail(&folio->lru, &unmap_folios);
list_add_tail(&dst->lru, &dst_folios);
break;
default:
/*
* Permanent failure (-EBUSY, etc.):
* unlike -EAGAIN case, the failed folio is
* removed from migration folio list and not
* retried in the next outer loop.
*/
nr_failed++;
stats->nr_thp_failed += is_thp;
stats->nr_failed_pages += nr_pages;
break;
}
So at a minimum we could at least check for !(ENOMEM,EAGAIN) I suppose?
It's unclear to me based on this code here how longerm pinning would
return. Maybe David knows.
~Gregory
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
2025-12-03 8:51 ` David Hildenbrand (Red Hat)
@ 2025-12-03 9:02 ` Gregory Price
2025-12-03 9:08 ` David Hildenbrand (Red Hat)
0 siblings, 1 reply; 16+ messages in thread
From: Gregory Price @ 2025-12-03 9:02 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: Michal Hocko, Andrew Morton, Aboorva Devarajan, vbabka, surenb,
jackmanb, hannes, ziy, linux-mm, linux-kernel, Oscar Salvador
On Wed, Dec 03, 2025 at 09:51:52AM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/3/25 09:42, Michal Hocko wrote:
> > > if (!ret) {
> > > /*
> > > * TODO: fatal migration failures should bail
> > > * out
> > > */
> > > do_migrate_range(pfn, end_pfn);
> > > }
> > > ...
> > >
> > > Maybe it's time to implement the bail out?
> >
> > That would be great but can we tell transient from permanent migration
> > failures? Maybe long term pins could be treated as permanent failure.
>
> Did we try offline a ZONE_MOVABLE block or a ZONE_NORMAL block? In case of
> ZONE_MOABLE, bailing out is not really the right thing to do.
>
My transient failure (although i'm not sure it was actually transient, i
killed it and retried after a few minutes and it succeeded immediately)
was on a ZONE_MOVABLE block.
Kind of suggested to me there was some bad condition the resolved once I
took a second to release the lock and try again.
Can't speak for Aboorva's situation.
~Gregory
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
2025-12-03 9:02 ` Gregory Price
@ 2025-12-03 9:08 ` David Hildenbrand (Red Hat)
2025-12-03 9:23 ` Gregory Price
0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-03 9:08 UTC (permalink / raw)
To: Gregory Price
Cc: Michal Hocko, Andrew Morton, Aboorva Devarajan, vbabka, surenb,
jackmanb, hannes, ziy, linux-mm, linux-kernel, Oscar Salvador,
Juan Yescas
On 12/3/25 10:02, Gregory Price wrote:
> On Wed, Dec 03, 2025 at 09:51:52AM +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/3/25 09:42, Michal Hocko wrote:
>>>> if (!ret) {
>>>> /*
>>>> * TODO: fatal migration failures should bail
>>>> * out
>>>> */
>>>> do_migrate_range(pfn, end_pfn);
>>>> }
>>>> ...
>>>>
>>>> Maybe it's time to implement the bail out?
>>>
>>> That would be great but can we tell transient from permanent migration
>>> failures? Maybe long term pins could be treated as permanent failure.
>>
>> Did we try offline a ZONE_MOVABLE block or a ZONE_NORMAL block? In case of
>> ZONE_MOABLE, bailing out is not really the right thing to do.
>>
>
> My transient failure (although i'm not sure it was actually transient, i
> killed it and retried after a few minutes and it succeeded immediately)
> was on a ZONE_MOVABLE block.
Okay, so that one should not bail out. Longterm pinnins must never end
up on such memory, and if it happens, we have to identify why and fix it.
We have this known problem of "stream of short-term pinnings" that can
temporarily turn memory effectively unmovable. Juan will talk about that
at LPC [1].
We have another set of problematic cases (vmsplice(), fuse) but I would
assume that these are not the cases you are hitting.
So not sure what exact problem you were hitting.
[1] https://lpc.events/event/19/contributions/2144/
>
> Kind of suggested to me there was some bad condition the resolved once I
> took a second to release the lock and try again.
Hard to tell I'm afraid. Do you still have the dump_folio() calls we
print when migration fails?
--
Cheers
David
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
2025-12-03 8:59 ` Gregory Price
@ 2025-12-03 9:15 ` David Hildenbrand (Red Hat)
2025-12-03 9:42 ` Michal Hocko
0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-03 9:15 UTC (permalink / raw)
To: Gregory Price, Michal Hocko
Cc: Andrew Morton, Aboorva Devarajan, vbabka, surenb, jackmanb,
hannes, ziy, linux-mm, linux-kernel, Oscar Salvador
On 12/3/25 09:59, Gregory Price wrote:
> On Wed, Dec 03, 2025 at 09:42:59AM +0100, Michal Hocko wrote:
>> On Wed 03-12-25 03:35:51, Gregory Price wrote:
>>> if (!ret) {
>>> /*
>>> * TODO: fatal migration failures should bail
>>> * out
>>> */
>>> do_migrate_range(pfn, end_pfn);
>>> }
>>>
>>> Maybe it's time to implement the bail out?
>>
>> That would be great but can we tell transient from permanent migration
>> failures? Maybe long term pins could be treated as permanent failure.
>>
>
> I see deep in migration code `migrate_pages_batch()` we would return
> "Some other failure" as fatal:
>
> switch(rc) {
> case -ENOMEM:
> ...
> /* Note: some long-term pin handing is done here */
> break;
> case -EAGAIN:
> ...
> break;
> case 0:
> ...
> list_move_tail(&folio->lru, &unmap_folios);
> list_add_tail(&dst->lru, &dst_folios);
> break;
> default:
> /*
> * Permanent failure (-EBUSY, etc.):
> * unlike -EAGAIN case, the failed folio is
> * removed from migration folio list and not
> * retried in the next outer loop.
> */
> nr_failed++;
> stats->nr_thp_failed += is_thp;
> stats->nr_failed_pages += nr_pages;
> break;
> }
>
> So at a minimum we could at least check for !(ENOMEM,EAGAIN) I suppose?
>
> It's unclear to me based on this code here how longerm pinning would
> return. Maybe David knows.
I would assume that additional references will always result in -EAGAIN.
Remember that we cannot distinguish short-term pins from long-term pins.
We should never have longterm-pins on ZONE_MOVABLE, unless something
broke that contract and needs to be fixed.
--
Cheers
David
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
2025-12-03 9:08 ` David Hildenbrand (Red Hat)
@ 2025-12-03 9:23 ` Gregory Price
2025-12-03 9:26 ` Gregory Price
2025-12-03 11:28 ` David Hildenbrand (Red Hat)
0 siblings, 2 replies; 16+ messages in thread
From: Gregory Price @ 2025-12-03 9:23 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: Michal Hocko, Andrew Morton, Aboorva Devarajan, vbabka, surenb,
jackmanb, hannes, ziy, linux-mm, linux-kernel, Oscar Salvador,
Juan Yescas
On Wed, Dec 03, 2025 at 10:08:55AM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/3/25 10:02, Gregory Price wrote:
> >
> > My transient failure (although i'm not sure it was actually transient, i
> > killed it and retried after a few minutes and it succeeded immediately)
> > was on a ZONE_MOVABLE block.
>
> Okay, so that one should not bail out. Longterm pinnins must never end up on
> such memory, and if it happens, we have to identify why and fix it.
>
> We have this known problem of "stream of short-term pinnings" that can
> temporarily turn memory effectively unmovable. Juan will talk about that at
> LPC [1].
Nice, fun, good topic. Looking forward to Japan n_n
>
> We have another set of problematic cases (vmsplice(), fuse) but I would
> assume that these are not the cases you are hitting.
>
We do use fuse, but this system was relatively quiet when i tried this.
We do have some proactive reclaim / demotion going on, but i don't think
it was that (see below).
> >
> > Kind of suggested to me there was some bad condition the resolved once I
> > took a second to release the lock and try again.
>
> Hard to tell I'm afraid. Do you still have the dump_folio() calls we print
> when migration fails?
>
What luck, I do! :D
And i just noticed it's the same page over and over
[ 3404.119270] migrating pfn c06f176 failed ret:1
[ 3404.129152] page: refcount:4 mapcount:0 mapping:0000000061ca20ba index:0xad28e5b pfn:0xc06f176
[ 3404.148284] memcg:ffff88842e855000
[ 3404.155834] aops:btree_aops ino:1
[ 3404.163193] flags: 0x17ffff066c00420c(referenced|uptodate|workingset|private|node=1|zone=3|lastcpupid=0x1ffff)
[ 3404.185408] raw: 17ffff066c00420c ffffc90066a13ca0 ffffc90066a13ca0 ffff88812b8502f8
[ 3404.202603] raw: 000000000ad28e5b ffff888859fd42d0 00000004ffffffff ffff88842e855000
[ 3404.219779] page dumped because: migration failure
[ 3404.230610] migrating pfn c06f176 failed ret:1
[ 3404.240483] page: refcount:4 mapcount:0 mapping:0000000061ca20ba index:0xad28e5b pfn:0xc06f176
[ 3404.259603] memcg:ffff88842e855000
[ 3404.267152] aops:btree_aops ino:1
[ 3404.274511] flags: 0x17ffff066c00420c(referenced|uptodate|workingset|private|node=1|zone=3|lastcpupid=0x1ffff)
[ 3404.296716] raw: 17ffff066c00420c ffffc90066a13ca0 ffffc90066a13ca0 ffff88812b8502f8
[ 3404.313909] raw: 000000000ad28e5b ffff888859fd42d0 00000004ffffffff ffff88842e855000
[ 3404.331102] page dumped because: migration failure
[ 3404.341778] migrating pfn c06f176 failed ret:1
[ 3404.351658] page: refcount:4 mapcount:0 mapping:0000000061ca20ba index:0xad28e5b pfn:0xc06f176
[ 3404.370781] memcg:ffff88842e855000
[ 3404.378331] aops:btree_aops ino:1
[ 3404.385687] flags: 0x17ffff066c00420c(referenced|uptodate|workingset|private|node=1|zone=3|lastcpupid=0x1ffff)
[ 3404.407895] raw: 17ffff066c00420c ffffc90066a13ca0 ffffc90066a13ca0 ffff88812b8502f8
[ 3404.425073] raw: 000000000ad28e5b ffff888859fd42d0 00000004ffffffff ffff88842e855000
[ 3404.442264] page dumped because: migration failure
[ 3404.452928] migrating pfn c06f176 failed ret:1
[ 3404.462809] page: refcount:4 mapcount:0 mapping:0000000061ca20ba index:0xad28e5b pfn:0xc06f176
[ 3404.481948] memcg:ffff88842e855000
[ 3404.489511] aops:btree_aops ino:1
[ 3404.496899] flags: 0x17ffff066c00420c(referenced|uptodate|workingset|private|node=1|zone=3|lastcpupid=0x1ffff)
[ 3404.519128] raw: 17ffff066c00420c ffffc90066a13ca0 ffffc90066a13ca0 ffff88812b8502f8
[ 3404.536332] raw: 000000000ad28e5b ffff888859fd42d0 00000004ffffffff ffff88842e855000
[ 3404.553534] page dumped because: migration failure
[ 3404.564200] migrating pfn c06f176 failed ret:1
[ 3404.574077] page: refcount:4 mapcount:0 mapping:0000000061ca20ba index:0xad28e5b pfn:0xc06f176
[ 3404.593208] memcg:ffff88842e855000
[ 3404.600769] aops:btree_aops ino:1
[ 3404.608138] flags: 0x17ffff066c00420c(referenced|uptodate|workingset|private|node=1|zone=3|lastcpupid=0x1ffff)
[ 3404.630355] raw: 17ffff066c00420c ffffc90066a13ca0 ffffc90066a13ca0 ffff88812b8502f8
[ 3404.647558] raw: 000000000ad28e5b ffff888859fd42d0 00000004ffffffff ffff88842e855000
[ 3404.664761] page dumped because: migration failure
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
2025-12-03 9:23 ` Gregory Price
@ 2025-12-03 9:26 ` Gregory Price
2025-12-03 11:28 ` David Hildenbrand (Red Hat)
1 sibling, 0 replies; 16+ messages in thread
From: Gregory Price @ 2025-12-03 9:26 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: Michal Hocko, Andrew Morton, Aboorva Devarajan, vbabka, surenb,
jackmanb, hannes, ziy, linux-mm, linux-kernel, Oscar Salvador,
Juan Yescas
On Wed, Dec 03, 2025 at 04:23:20AM -0500, Gregory Price wrote:
> What luck, I do! :D
> And i just noticed it's the same page over and over
>
Should have noted: 6.13.2
but it's kind of a frankenstein's 6.13.2 with stable backports, so I
don't know what's been fixed between 6.13 and latest. If there are
relevant patches I can search for whether we have them.
> [ 3404.119270] migrating pfn c06f176 failed ret:1
> [ 3404.129152] page: refcount:4 mapcount:0 mapping:0000000061ca20ba index:0xad28e5b pfn:0xc06f176
> [ 3404.148284] memcg:ffff88842e855000
> [ 3404.155834] aops:btree_aops ino:1
> [ 3404.163193] flags: 0x17ffff066c00420c(referenced|uptodate|workingset|private|node=1|zone=3|lastcpupid=0x1ffff)
> [ 3404.185408] raw: 17ffff066c00420c ffffc90066a13ca0 ffffc90066a13ca0 ffff88812b8502f8
> [ 3404.202603] raw: 000000000ad28e5b ffff888859fd42d0 00000004ffffffff ffff88842e855000
> [ 3404.219779] page dumped because: migration failure
>
~Gregory
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
2025-12-03 9:15 ` David Hildenbrand (Red Hat)
@ 2025-12-03 9:42 ` Michal Hocko
2025-12-03 11:22 ` David Hildenbrand (Red Hat)
0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2025-12-03 9:42 UTC (permalink / raw)
To: David Hildenbrand (Red Hat)
Cc: Gregory Price, Andrew Morton, Aboorva Devarajan, vbabka, surenb,
jackmanb, hannes, ziy, linux-mm, linux-kernel, Oscar Salvador
On Wed 03-12-25 10:15:04, David Hildenbrand (Red Hat) wrote:
> On 12/3/25 09:59, Gregory Price wrote:
> > On Wed, Dec 03, 2025 at 09:42:59AM +0100, Michal Hocko wrote:
> > > On Wed 03-12-25 03:35:51, Gregory Price wrote:
> > > > if (!ret) {
> > > > /*
> > > > * TODO: fatal migration failures should bail
> > > > * out
> > > > */
> > > > do_migrate_range(pfn, end_pfn);
> > > > }
> > > >
> > > > Maybe it's time to implement the bail out?
> > >
> > > That would be great but can we tell transient from permanent migration
> > > failures? Maybe long term pins could be treated as permanent failure.
> > >
> >
> > I see deep in migration code `migrate_pages_batch()` we would return
> > "Some other failure" as fatal:
> >
> > switch(rc) {
> > case -ENOMEM:
> > ...
> > /* Note: some long-term pin handing is done here */
> > break;
> > case -EAGAIN:
> > ...
> > break;
> > case 0:
> > ...
> > list_move_tail(&folio->lru, &unmap_folios);
> > list_add_tail(&dst->lru, &dst_folios);
> > break;
> > default:
> > /*
> > * Permanent failure (-EBUSY, etc.):
> > * unlike -EAGAIN case, the failed folio is
> > * removed from migration folio list and not
> > * retried in the next outer loop.
> > */
> > nr_failed++;
> > stats->nr_thp_failed += is_thp;
> > stats->nr_failed_pages += nr_pages;
> > break;
> > }
> >
> > So at a minimum we could at least check for !(ENOMEM,EAGAIN) I suppose?
> >
> > It's unclear to me based on this code here how longerm pinning would
> > return. Maybe David knows.
>
> I would assume that additional references will always result in -EAGAIN.
> Remember that we cannot distinguish short-term pins from long-term pins.
>
> We should never have longterm-pins on ZONE_MOVABLE, unless something broke
> that contract and needs to be fixed.
Right. But what should the hotplug code do under that condition. Loop
for ever or fail reporting the broken contract? I would lean towards the
latter. We have never promised that offlining will not fail ever for
movable zones. We just guarantee that the operation is resistant against
recovarable failures.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
2025-12-03 9:42 ` Michal Hocko
@ 2025-12-03 11:22 ` David Hildenbrand (Red Hat)
0 siblings, 0 replies; 16+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-03 11:22 UTC (permalink / raw)
To: Michal Hocko
Cc: Gregory Price, Andrew Morton, Aboorva Devarajan, vbabka, surenb,
jackmanb, hannes, ziy, linux-mm, linux-kernel, Oscar Salvador
On 12/3/25 10:42, Michal Hocko wrote:
> On Wed 03-12-25 10:15:04, David Hildenbrand (Red Hat) wrote:
>> On 12/3/25 09:59, Gregory Price wrote:
>>> On Wed, Dec 03, 2025 at 09:42:59AM +0100, Michal Hocko wrote:
>>>> On Wed 03-12-25 03:35:51, Gregory Price wrote:
>>>>> if (!ret) {
>>>>> /*
>>>>> * TODO: fatal migration failures should bail
>>>>> * out
>>>>> */
>>>>> do_migrate_range(pfn, end_pfn);
>>>>> }
>>>>>
>>>>> Maybe it's time to implement the bail out?
>>>>
>>>> That would be great but can we tell transient from permanent migration
>>>> failures? Maybe long term pins could be treated as permanent failure.
>>>>
>>>
>>> I see deep in migration code `migrate_pages_batch()` we would return
>>> "Some other failure" as fatal:
>>>
>>> switch(rc) {
>>> case -ENOMEM:
>>> ...
>>> /* Note: some long-term pin handing is done here */
>>> break;
>>> case -EAGAIN:
>>> ...
>>> break;
>>> case 0:
>>> ...
>>> list_move_tail(&folio->lru, &unmap_folios);
>>> list_add_tail(&dst->lru, &dst_folios);
>>> break;
>>> default:
>>> /*
>>> * Permanent failure (-EBUSY, etc.):
>>> * unlike -EAGAIN case, the failed folio is
>>> * removed from migration folio list and not
>>> * retried in the next outer loop.
>>> */
>>> nr_failed++;
>>> stats->nr_thp_failed += is_thp;
>>> stats->nr_failed_pages += nr_pages;
>>> break;
>>> }
>>>
>>> So at a minimum we could at least check for !(ENOMEM,EAGAIN) I suppose?
>>>
>>> It's unclear to me based on this code here how longerm pinning would
>>> return. Maybe David knows.
>>
>> I would assume that additional references will always result in -EAGAIN.
>> Remember that we cannot distinguish short-term pins from long-term pins.
>>
>> We should never have longterm-pins on ZONE_MOVABLE, unless something broke
>> that contract and needs to be fixed.
>
> Right. But what should the hotplug code do under that condition. Loop
> for ever or fail reporting the broken contract? I would lean towards the
> latter.
If you can detect it reliably.
> We have never promised that offlining will not fail ever for
> movable zones. We just guarantee that the operation is resistant against
> recovarable failures.
Right, but we don't want it to fail for reasons where retrying a bit longer
would just have worked.
What we document is:
Memory Offlining and ZONE_MOVABLE
---------------------------------
Even with ZONE_MOVABLE, there are some corner cases where offlining a memory
block might fail:
... list of corner cases
Further, when running into out of memory situations while migrating pages, or
when still encountering permanently unmovable pages within ZONE_MOVABLE
(-> BUG), memory offlining will keep retrying until it eventually succeeds.
When offlining is triggered from user space, the offlining context can be
terminated by sending a signal. A timeout based offlining can easily be
implemented via::
% timeout $TIMEOUT offline_block | failure_handling
--
Cheers
David
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
2025-12-03 9:23 ` Gregory Price
2025-12-03 9:26 ` Gregory Price
@ 2025-12-03 11:28 ` David Hildenbrand (Red Hat)
1 sibling, 0 replies; 16+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-03 11:28 UTC (permalink / raw)
To: Gregory Price
Cc: Michal Hocko, Andrew Morton, Aboorva Devarajan, vbabka, surenb,
jackmanb, hannes, ziy, linux-mm, linux-kernel, Oscar Salvador,
Juan Yescas
On 12/3/25 10:23, Gregory Price wrote:
> On Wed, Dec 03, 2025 at 10:08:55AM +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/3/25 10:02, Gregory Price wrote:
>>>
>>> My transient failure (although i'm not sure it was actually transient, i
>>> killed it and retried after a few minutes and it succeeded immediately)
>>> was on a ZONE_MOVABLE block.
>>
>> Okay, so that one should not bail out. Longterm pinnins must never end up on
>> such memory, and if it happens, we have to identify why and fix it.
>>
>> We have this known problem of "stream of short-term pinnings" that can
>> temporarily turn memory effectively unmovable. Juan will talk about that at
>> LPC [1].
>
> Nice, fun, good topic. Looking forward to Japan n_n
>
>>
>> We have another set of problematic cases (vmsplice(), fuse) but I would
>> assume that these are not the cases you are hitting.
>>
>
> We do use fuse, but this system was relatively quiet when i tried this.
>
> We do have some proactive reclaim / demotion going on, but i don't think
> it was that (see below).
>
>>>
>>> Kind of suggested to me there was some bad condition the resolved once I
>>> took a second to release the lock and try again.
>>
>> Hard to tell I'm afraid. Do you still have the dump_folio() calls we print
>> when migration fails?
>>
>
> What luck, I do! :D
:)
> And i just noticed it's the same page over and over
>
> [ 3404.119270] migrating pfn c06f176 failed ret:1
> [ 3404.129152] page: refcount:4 mapcount:0 mapping:0000000061ca20ba index:0xad28e5b pfn:0xc06f176
> [ 3404.148284] memcg:ffff88842e855000
> [ 3404.155834] aops:btree_aops ino:1
Small folio. Not GUP-pinned (FOLL_PIN, otherwise our refcount would be
>= 1024.
It could be ordinary GUP (FOLL_GET) e.g., from vmsplice or some older
O_DIRECT user that was not converted to FOLL_PIN yet. But maybe it's
just btrfs / something else that temporarily holds a folio reference.
Given that this is from 6.13 ... hard to tell :)
> [ 3404.163193] flags: 0x17ffff066c00420c(referenced|uptodate|workingset|private|node=1|zone=3|lastcpupid=0x1ffff)
Neither dirty nor under writeback.
--
Cheers
David
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-12-03 11:29 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-01 6:00 [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free Aboorva Devarajan
2025-12-01 17:41 ` Andrew Morton
2025-12-03 8:27 ` Michal Hocko
2025-12-03 8:35 ` Gregory Price
2025-12-03 8:42 ` Michal Hocko
2025-12-03 8:51 ` David Hildenbrand (Red Hat)
2025-12-03 9:02 ` Gregory Price
2025-12-03 9:08 ` David Hildenbrand (Red Hat)
2025-12-03 9:23 ` Gregory Price
2025-12-03 9:26 ` Gregory Price
2025-12-03 11:28 ` David Hildenbrand (Red Hat)
2025-12-03 8:59 ` Gregory Price
2025-12-03 9:15 ` David Hildenbrand (Red Hat)
2025-12-03 9:42 ` Michal Hocko
2025-12-03 11:22 ` David Hildenbrand (Red Hat)
2025-12-03 8:21 ` Michal Hocko
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox