[PATCH] mm/page_alloc: make percpu_pagelist_high

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
@ 2025-12-01  6:00 Aboorva Devarajan
  2025-12-01 17:41 ` Andrew Morton
  2025-12-03  8:21 ` Michal Hocko
  0 siblings, 2 replies; 19+ messages in thread
From: Aboorva Devarajan @ 2025-12-01  6:00 UTC (permalink / raw)
  To: akpm, vbabka, surenb, mhocko, jackmanb, hannes, ziy
  Cc: linux-mm, linux-kernel, aboorvad

When page isolation loops indefinitely during memory offline, reading
/proc/sys/vm/percpu_pagelist_high_fraction blocks on pcp_batch_high_lock,
causing hung task warnings.

Make procfs reads lock-free since percpu_pagelist_high_fraction is a simple
integer with naturally atomic reads, writers still serialize via the mutex.

This prevents hung task warnings when reading the procfs file during
long-running memory offline operations.

Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
---
 mm/page_alloc.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ed82ee55e66a..7c8d773ed4af 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6611,11 +6611,14 @@ static int percpu_pagelist_high_fraction_sysctl_handler(const struct ctl_table *
 	int old_percpu_pagelist_high_fraction;
 	int ret;
 
+	if (!write)
+		return proc_dointvec_minmax(table, write, buffer, length, ppos);
+
 	mutex_lock(&pcp_batch_high_lock);
 	old_percpu_pagelist_high_fraction = percpu_pagelist_high_fraction;
 
 	ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
-	if (!write || ret < 0)
+	if (ret < 0)
 		goto out;
 
 	/* Sanity checking to avoid pcp imbalance */
-- 
2.50.1



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
  2025-12-01  6:00 [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free Aboorva Devarajan
@ 2025-12-01 17:41 ` Andrew Morton
  2025-12-03  8:27   ` Michal Hocko
  2025-12-08 17:30   ` Aboorva Devarajan
  2025-12-03  8:21 ` Michal Hocko
  1 sibling, 2 replies; 19+ messages in thread
From: Andrew Morton @ 2025-12-01 17:41 UTC (permalink / raw)
  To: Aboorva Devarajan
  Cc: vbabka, surenb, mhocko, jackmanb, hannes, ziy, linux-mm, linux-kernel

On Mon,  1 Dec 2025 11:30:09 +0530 Aboorva Devarajan <aboorvad@linux.ibm.com> wrote:

> When page isolation loops indefinitely during memory offline, reading
> /proc/sys/vm/percpu_pagelist_high_fraction blocks on pcp_batch_high_lock,
> causing hung task warnings.

That's pretty bad behavior.

I wonder if there are other problems which can be caused by this
lengthy hold time.

It would be better to address the lengthy hold time rather that having
to work around it in one impacted site.

> Make procfs reads lock-free since percpu_pagelist_high_fraction is a simple
> integer with naturally atomic reads, writers still serialize via the mutex.
> 
> This prevents hung task warnings when reading the procfs file during
> long-running memory offline operations.
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6611,11 +6611,14 @@ static int percpu_pagelist_high_fraction_sysctl_handler(const struct ctl_table *
>  	int old_percpu_pagelist_high_fraction;
>  	int ret;
>  
> +	if (!write)
> +		return proc_dointvec_minmax(table, write, buffer, length, ppos);
> +
>  	mutex_lock(&pcp_batch_high_lock);
>  	old_percpu_pagelist_high_fraction = percpu_pagelist_high_fraction;
>  
>  	ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
> -	if (!write || ret < 0)
> +	if (ret < 0)
>  		goto out;
>  
>  	/* Sanity checking to avoid pcp imbalance */

That being said, I'll grab the patch and shall put a cc:stable on it,
see what people think about this hold-time issue.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
  2025-12-01  6:00 [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free Aboorva Devarajan
  2025-12-01 17:41 ` Andrew Morton
@ 2025-12-03  8:21 ` Michal Hocko
  1 sibling, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2025-12-03  8:21 UTC (permalink / raw)
  To: Aboorva Devarajan
  Cc: akpm, vbabka, surenb, jackmanb, hannes, ziy, linux-mm, linux-kernel

On Mon 01-12-25 11:30:09, Aboorva Devarajan wrote:
> When page isolation loops indefinitely during memory offline, reading
> /proc/sys/vm/percpu_pagelist_high_fraction blocks on pcp_batch_high_lock,
> causing hung task warnings.
> 
> Make procfs reads lock-free since percpu_pagelist_high_fraction is a simple
> integer with naturally atomic reads, writers still serialize via the mutex.
> 
> This prevents hung task warnings when reading the procfs file during
> long-running memory offline operations.
> 
> Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com>

Looks OK. I would just add a short comment explaining that in the code.
See below.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ed82ee55e66a..7c8d773ed4af 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6611,11 +6611,14 @@ static int percpu_pagelist_high_fraction_sysctl_handler(const struct ctl_table *
>  	int old_percpu_pagelist_high_fraction;
>  	int ret;
>  
	/*
	 * Avoid using pcp_batch_high_lock for reads as the value is
	 * read atomicaly and race with offlining is harmless.
	 */
> +	if (!write)
> +		return proc_dointvec_minmax(table, write, buffer, length, ppos);
> +
>  	mutex_lock(&pcp_batch_high_lock);
>  	old_percpu_pagelist_high_fraction = percpu_pagelist_high_fraction;
>  
>  	ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
> -	if (!write || ret < 0)
> +	if (ret < 0)
>  		goto out;
>  
>  	/* Sanity checking to avoid pcp imbalance */
> -- 
> 2.50.1

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
  2025-12-01 17:41 ` Andrew Morton
@ 2025-12-03  8:27   ` Michal Hocko
  2025-12-03  8:35     ` Gregory Price
  2025-12-08 17:30   ` Aboorva Devarajan
  1 sibling, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2025-12-03  8:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aboorva Devarajan, vbabka, surenb, jackmanb, hannes, ziy,
	linux-mm, linux-kernel, Oscar Salvador, David Hildenbrand

Let me add Oscar and David.

On Mon 01-12-25 09:41:12, Andrew Morton wrote:
> On Mon,  1 Dec 2025 11:30:09 +0530 Aboorva Devarajan <aboorvad@linux.ibm.com> wrote:
> 
> > When page isolation loops indefinitely during memory offline, reading
> > /proc/sys/vm/percpu_pagelist_high_fraction blocks on pcp_batch_high_lock,
> > causing hung task warnings.
> 
> That's pretty bad behavior.
> 
> I wonder if there are other problems which can be caused by this
> lengthy hold time.

pcp_batch_high_lock is not taken in any performance critical path. It is
true that memory offlining can take long when memory is not free but I
am not sure we can do much better. I guess we could check contention on
the lock and drop it to make cpu hotplug events and
sysctl_min_unmapped_ratio_sysctl_handler smoother. The question is
whether this is a practical problem hit in real life.

> It would be better to address the lengthy hold time rather that having
> to work around it in one impacted site.
> 
> > Make procfs reads lock-free since percpu_pagelist_high_fraction is a simple
> > integer with naturally atomic reads, writers still serialize via the mutex.
> > 
> > This prevents hung task warnings when reading the procfs file during
> > long-running memory offline operations.
> > 
> > ...
> >
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6611,11 +6611,14 @@ static int percpu_pagelist_high_fraction_sysctl_handler(const struct ctl_table *
> >  	int old_percpu_pagelist_high_fraction;
> >  	int ret;
> >  
> > +	if (!write)
> > +		return proc_dointvec_minmax(table, write, buffer, length, ppos);
> > +
> >  	mutex_lock(&pcp_batch_high_lock);
> >  	old_percpu_pagelist_high_fraction = percpu_pagelist_high_fraction;
> >  
> >  	ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
> > -	if (!write || ret < 0)
> > +	if (ret < 0)
> >  		goto out;
> >  
> >  	/* Sanity checking to avoid pcp imbalance */
> 
> That being said, I'll grab the patch and shall put a cc:stable on it,
> see what people think about this hold-time issue.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
  2025-12-03  8:27   ` Michal Hocko
@ 2025-12-03  8:35     ` Gregory Price
  2025-12-03  8:42       ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Gregory Price @ 2025-12-03  8:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Aboorva Devarajan, vbabka, surenb, jackmanb,
	hannes, ziy, linux-mm, linux-kernel, Oscar Salvador,
	David Hildenbrand

On Wed, Dec 03, 2025 at 09:27:26AM +0100, Michal Hocko wrote:
> Let me add Oscar and David.
> 
> On Mon 01-12-25 09:41:12, Andrew Morton wrote:
> > On Mon,  1 Dec 2025 11:30:09 +0530 Aboorva Devarajan <aboorvad@linux.ibm.com> wrote:
> > 
> > > When page isolation loops indefinitely during memory offline, reading
> > > /proc/sys/vm/percpu_pagelist_high_fraction blocks on pcp_batch_high_lock,
> > > causing hung task warnings.
> > 
> > That's pretty bad behavior.
> > 
> > I wonder if there are other problems which can be caused by this
> > lengthy hold time.
> 
> pcp_batch_high_lock is not taken in any performance critical path. It is
> true that memory offlining can take long when memory is not free but I
> am not sure we can do much better. I guess we could check contention on
> the lock and drop it to make cpu hotplug events and
> sysctl_min_unmapped_ratio_sysctl_handler smoother. The question is
> whether this is a practical problem hit in real life.
> 

I just today hit a scenario where offlining was blocked on migration
failures that took an exceedingly long time to offline (many minutes)
even on a relatively small block (256MB).

Now that I'm looking at the double-do-while loop in memory_hotplug.c

zone_pcp_disable(zone);  /* (pcp_batch_high_lock) */
...
do {
	do {
...
		cond_resched();
		ret = scan_movable_pages(pfn, end_pfn, &pfn);
		if (!ret) {
			/*
			 * TODO: fatal migration failures should bail
			 * out
			 */
			do_migrate_range(pfn, end_pfn);
		}
	} while (!ret);
} while (ret);
...
zone_pcp_enable(zone);  /* (pcp_batch_high_lock) */


Maybe it's time to implement the bail out?

~Gregory


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
  2025-12-03  8:35     ` Gregory Price
@ 2025-12-03  8:42       ` Michal Hocko
  2025-12-03  8:51         ` David Hildenbrand (Red Hat)
  2025-12-03  8:59         ` Gregory Price
  0 siblings, 2 replies; 19+ messages in thread
From: Michal Hocko @ 2025-12-03  8:42 UTC (permalink / raw)
  To: Gregory Price
  Cc: Andrew Morton, Aboorva Devarajan, vbabka, surenb, jackmanb,
	hannes, ziy, linux-mm, linux-kernel, Oscar Salvador,
	David Hildenbrand

On Wed 03-12-25 03:35:51, Gregory Price wrote:
> On Wed, Dec 03, 2025 at 09:27:26AM +0100, Michal Hocko wrote:
> > Let me add Oscar and David.
> > 
> > On Mon 01-12-25 09:41:12, Andrew Morton wrote:
> > > On Mon,  1 Dec 2025 11:30:09 +0530 Aboorva Devarajan <aboorvad@linux.ibm.com> wrote:
> > > 
> > > > When page isolation loops indefinitely during memory offline, reading
> > > > /proc/sys/vm/percpu_pagelist_high_fraction blocks on pcp_batch_high_lock,
> > > > causing hung task warnings.
> > > 
> > > That's pretty bad behavior.
> > > 
> > > I wonder if there are other problems which can be caused by this
> > > lengthy hold time.
> > 
> > pcp_batch_high_lock is not taken in any performance critical path. It is
> > true that memory offlining can take long when memory is not free but I
> > am not sure we can do much better. I guess we could check contention on
> > the lock and drop it to make cpu hotplug events and
> > sysctl_min_unmapped_ratio_sysctl_handler smoother. The question is
> > whether this is a practical problem hit in real life.
> > 
> 
> I just today hit a scenario where offlining was blocked on migration
> failures that took an exceedingly long time to offline (many minutes)
> even on a relatively small block (256MB).
> 
> Now that I'm looking at the double-do-while loop in memory_hotplug.c
> 
> zone_pcp_disable(zone);  /* (pcp_batch_high_lock) */
> ...
> do {
> 	do {
> ...
> 		cond_resched();
> 		ret = scan_movable_pages(pfn, end_pfn, &pfn);
> 		if (!ret) {
> 			/*
> 			 * TODO: fatal migration failures should bail
> 			 * out
> 			 */
> 			do_migrate_range(pfn, end_pfn);
> 		}
> 	} while (!ret);
> } while (ret);
> ...
> zone_pcp_enable(zone);  /* (pcp_batch_high_lock) */
> 
> 
> Maybe it's time to implement the bail out?

That would be great but can we tell transient from permanent migration
failures? Maybe long term pins could be treated as permanent failure.

> 
> ~Gregory

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
  2025-12-03  8:42       ` Michal Hocko
@ 2025-12-03  8:51         ` David Hildenbrand (Red Hat)
  2025-12-03  9:02           ` Gregory Price
  2025-12-03  8:59         ` Gregory Price
  1 sibling, 1 reply; 19+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-03  8:51 UTC (permalink / raw)
  To: Michal Hocko, Gregory Price
  Cc: Andrew Morton, Aboorva Devarajan, vbabka, surenb, jackmanb,
	hannes, ziy, linux-mm, linux-kernel, Oscar Salvador

On 12/3/25 09:42, Michal Hocko wrote:
> On Wed 03-12-25 03:35:51, Gregory Price wrote:
>> On Wed, Dec 03, 2025 at 09:27:26AM +0100, Michal Hocko wrote:
>>> Let me add Oscar and David.
>>>
>>> On Mon 01-12-25 09:41:12, Andrew Morton wrote:
>>>> On Mon,  1 Dec 2025 11:30:09 +0530 Aboorva Devarajan <aboorvad@linux.ibm.com> wrote:
>>>>
>>>>> When page isolation loops indefinitely during memory offline, reading
>>>>> /proc/sys/vm/percpu_pagelist_high_fraction blocks on pcp_batch_high_lock,
>>>>> causing hung task warnings.
>>>>
>>>> That's pretty bad behavior.
>>>>
>>>> I wonder if there are other problems which can be caused by this
>>>> lengthy hold time.
>>>
>>> pcp_batch_high_lock is not taken in any performance critical path. It is
>>> true that memory offlining can take long when memory is not free but I
>>> am not sure we can do much better. I guess we could check contention on
>>> the lock and drop it to make cpu hotplug events and
>>> sysctl_min_unmapped_ratio_sysctl_handler smoother. The question is
>>> whether this is a practical problem hit in real life.
>>>
>>
>> I just today hit a scenario where offlining was blocked on migration
>> failures that took an exceedingly long time to offline (many minutes)
>> even on a relatively small block (256MB).
>>
>> Now that I'm looking at the double-do-while loop in memory_hotplug.c
>>
>> zone_pcp_disable(zone);  /* (pcp_batch_high_lock) */
>> ...
>> do {
>> 	do {
>> ...
>> 		cond_resched();
>> 		ret = scan_movable_pages(pfn, end_pfn, &pfn);
>> 		if (!ret) {
>> 			/*
>> 			 * TODO: fatal migration failures should bail
>> 			 * out
>> 			 */
>> 			do_migrate_range(pfn, end_pfn);
>> 		}
>> 	} while (!ret);
>> } while (ret);
>> ...
>> zone_pcp_enable(zone);  /* (pcp_batch_high_lock) */
>>
>>
>> Maybe it's time to implement the bail out?
> 
> That would be great but can we tell transient from permanent migration
> failures? Maybe long term pins could be treated as permanent failure.

Did we try offline a ZONE_MOVABLE block or a ZONE_NORMAL block? In case 
of ZONE_MOABLE, bailing out is not really the right thing to do.

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
  2025-12-03  8:42       ` Michal Hocko
  2025-12-03  8:51         ` David Hildenbrand (Red Hat)
@ 2025-12-03  8:59         ` Gregory Price
  2025-12-03  9:15           ` David Hildenbrand (Red Hat)
  1 sibling, 1 reply; 19+ messages in thread
From: Gregory Price @ 2025-12-03  8:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Aboorva Devarajan, vbabka, surenb, jackmanb,
	hannes, ziy, linux-mm, linux-kernel, Oscar Salvador,
	David Hildenbrand

On Wed, Dec 03, 2025 at 09:42:59AM +0100, Michal Hocko wrote:
> On Wed 03-12-25 03:35:51, Gregory Price wrote:
> > 		if (!ret) {
> > 			/*
> > 			 * TODO: fatal migration failures should bail
> > 			 * out
> > 			 */
> > 			do_migrate_range(pfn, end_pfn);
> > 		}
> > 
> > Maybe it's time to implement the bail out?
> 
> That would be great but can we tell transient from permanent migration
> failures? Maybe long term pins could be treated as permanent failure.
> 

I see deep in migration code `migrate_pages_batch()` we would return
"Some other failure" as fatal:

	switch(rc) {
	case -ENOMEM:
		...
		/* Note: some long-term pin handing is done here */
		break;
	case -EAGAIN:
		...
		break;
	case 0:
		...
		list_move_tail(&folio->lru, &unmap_folios);
		list_add_tail(&dst->lru, &dst_folios);
		break;
	default:
		/*
		 * Permanent failure (-EBUSY, etc.):
		 * unlike -EAGAIN case, the failed folio is
		 * removed from migration folio list and not
		 * retried in the next outer loop.
		 */
		nr_failed++;
		stats->nr_thp_failed += is_thp;
		stats->nr_failed_pages += nr_pages;
		break;
	}

So at a minimum we could at least check for !(ENOMEM,EAGAIN) I suppose?

It's unclear to me based on this code here how longerm pinning would
return.  Maybe David knows.

~Gregory


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
  2025-12-03  8:51         ` David Hildenbrand (Red Hat)
@ 2025-12-03  9:02           ` Gregory Price
  2025-12-03  9:08             ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 19+ messages in thread
From: Gregory Price @ 2025-12-03  9:02 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Michal Hocko, Andrew Morton, Aboorva Devarajan, vbabka, surenb,
	jackmanb, hannes, ziy, linux-mm, linux-kernel, Oscar Salvador

On Wed, Dec 03, 2025 at 09:51:52AM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/3/25 09:42, Michal Hocko wrote:
> > > 		if (!ret) {
> > > 			/*
> > > 			 * TODO: fatal migration failures should bail
> > > 			 * out
> > > 			 */
> > > 			do_migrate_range(pfn, end_pfn);
> > > 		}
> > > ...
> > > 
> > > Maybe it's time to implement the bail out?
> > 
> > That would be great but can we tell transient from permanent migration
> > failures? Maybe long term pins could be treated as permanent failure.
> 
> Did we try offline a ZONE_MOVABLE block or a ZONE_NORMAL block? In case of
> ZONE_MOABLE, bailing out is not really the right thing to do.
> 

My transient failure (although i'm not sure it was actually transient, i
killed it and retried after a few minutes and it succeeded immediately)
was on a ZONE_MOVABLE block.

Kind of suggested to me there was some bad condition the resolved once I
took a second to release the lock and try again.

Can't speak for Aboorva's situation.

~Gregory


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
  2025-12-03  9:02           ` Gregory Price
@ 2025-12-03  9:08             ` David Hildenbrand (Red Hat)
  2025-12-03  9:23               ` Gregory Price
  0 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-03  9:08 UTC (permalink / raw)
  To: Gregory Price
  Cc: Michal Hocko, Andrew Morton, Aboorva Devarajan, vbabka, surenb,
	jackmanb, hannes, ziy, linux-mm, linux-kernel, Oscar Salvador,
	Juan Yescas

On 12/3/25 10:02, Gregory Price wrote:
> On Wed, Dec 03, 2025 at 09:51:52AM +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/3/25 09:42, Michal Hocko wrote:
>>>> 		if (!ret) {
>>>> 			/*
>>>> 			 * TODO: fatal migration failures should bail
>>>> 			 * out
>>>> 			 */
>>>> 			do_migrate_range(pfn, end_pfn);
>>>> 		}
>>>> ...
>>>>
>>>> Maybe it's time to implement the bail out?
>>>
>>> That would be great but can we tell transient from permanent migration
>>> failures? Maybe long term pins could be treated as permanent failure.
>>
>> Did we try offline a ZONE_MOVABLE block or a ZONE_NORMAL block? In case of
>> ZONE_MOABLE, bailing out is not really the right thing to do.
>>
> 
> My transient failure (although i'm not sure it was actually transient, i
> killed it and retried after a few minutes and it succeeded immediately)
> was on a ZONE_MOVABLE block.

Okay, so that one should not bail out. Longterm pinnins must never end 
up on such memory, and if it happens, we have to identify why and fix it.

We have this known problem of "stream of short-term pinnings" that can 
temporarily turn memory effectively unmovable. Juan will talk about that 
at LPC [1].

We have another set of problematic cases (vmsplice(), fuse) but I would 
assume that these are not the cases you are hitting.

So not sure what exact problem you were hitting.

[1] https://lpc.events/event/19/contributions/2144/

> 
> Kind of suggested to me there was some bad condition the resolved once I
> took a second to release the lock and try again.

Hard to tell I'm afraid. Do you still have the dump_folio() calls we 
print when migration fails?

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
  2025-12-03  8:59         ` Gregory Price
@ 2025-12-03  9:15           ` David Hildenbrand (Red Hat)
  2025-12-03  9:42             ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-03  9:15 UTC (permalink / raw)
  To: Gregory Price, Michal Hocko
  Cc: Andrew Morton, Aboorva Devarajan, vbabka, surenb, jackmanb,
	hannes, ziy, linux-mm, linux-kernel, Oscar Salvador

On 12/3/25 09:59, Gregory Price wrote:
> On Wed, Dec 03, 2025 at 09:42:59AM +0100, Michal Hocko wrote:
>> On Wed 03-12-25 03:35:51, Gregory Price wrote:
>>> 		if (!ret) {
>>> 			/*
>>> 			 * TODO: fatal migration failures should bail
>>> 			 * out
>>> 			 */
>>> 			do_migrate_range(pfn, end_pfn);
>>> 		}
>>>
>>> Maybe it's time to implement the bail out?
>>
>> That would be great but can we tell transient from permanent migration
>> failures? Maybe long term pins could be treated as permanent failure.
>>
> 
> I see deep in migration code `migrate_pages_batch()` we would return
> "Some other failure" as fatal:
> 
> 	switch(rc) {
> 	case -ENOMEM:
> 		...
> 		/* Note: some long-term pin handing is done here */
> 		break;
> 	case -EAGAIN:
> 		...
> 		break;
> 	case 0:
> 		...
> 		list_move_tail(&folio->lru, &unmap_folios);
> 		list_add_tail(&dst->lru, &dst_folios);
> 		break;
> 	default:
> 		/*
> 		 * Permanent failure (-EBUSY, etc.):
> 		 * unlike -EAGAIN case, the failed folio is
> 		 * removed from migration folio list and not
> 		 * retried in the next outer loop.
> 		 */
> 		nr_failed++;
> 		stats->nr_thp_failed += is_thp;
> 		stats->nr_failed_pages += nr_pages;
> 		break;
> 	}
> 
> So at a minimum we could at least check for !(ENOMEM,EAGAIN) I suppose?
> 
> It's unclear to me based on this code here how longerm pinning would
> return.  Maybe David knows.

I would assume that additional references will always result in -EAGAIN. 
Remember that we cannot distinguish short-term pins from long-term pins.

We should never have longterm-pins on ZONE_MOVABLE, unless something 
broke that contract and needs to be fixed.

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
  2025-12-03  9:08             ` David Hildenbrand (Red Hat)
@ 2025-12-03  9:23               ` Gregory Price
  2025-12-03  9:26                 ` Gregory Price
  2025-12-03 11:28                 ` David Hildenbrand (Red Hat)
  0 siblings, 2 replies; 19+ messages in thread
From: Gregory Price @ 2025-12-03  9:23 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Michal Hocko, Andrew Morton, Aboorva Devarajan, vbabka, surenb,
	jackmanb, hannes, ziy, linux-mm, linux-kernel, Oscar Salvador,
	Juan Yescas

On Wed, Dec 03, 2025 at 10:08:55AM +0100, David Hildenbrand (Red Hat) wrote:
> On 12/3/25 10:02, Gregory Price wrote:
> > 
> > My transient failure (although i'm not sure it was actually transient, i
> > killed it and retried after a few minutes and it succeeded immediately)
> > was on a ZONE_MOVABLE block.
> 
> Okay, so that one should not bail out. Longterm pinnins must never end up on
> such memory, and if it happens, we have to identify why and fix it.
> 
> We have this known problem of "stream of short-term pinnings" that can
> temporarily turn memory effectively unmovable. Juan will talk about that at
> LPC [1].

Nice, fun, good topic. Looking forward to Japan n_n

> 
> We have another set of problematic cases (vmsplice(), fuse) but I would
> assume that these are not the cases you are hitting.
> 

We do use fuse, but this system was relatively quiet when i tried this.

We do have some proactive reclaim / demotion going on, but i don't think
it was that (see below).

> > 
> > Kind of suggested to me there was some bad condition the resolved once I
> > took a second to release the lock and try again.
> 
> Hard to tell I'm afraid. Do you still have the dump_folio() calls we print
> when migration fails?
> 

What luck, I do! :D
And i just noticed it's the same page over and over

[ 3404.119270] migrating pfn c06f176 failed ret:1
[ 3404.129152] page: refcount:4 mapcount:0 mapping:0000000061ca20ba index:0xad28e5b pfn:0xc06f176
[ 3404.148284] memcg:ffff88842e855000
[ 3404.155834] aops:btree_aops ino:1
[ 3404.163193] flags: 0x17ffff066c00420c(referenced|uptodate|workingset|private|node=1|zone=3|lastcpupid=0x1ffff)
[ 3404.185408] raw: 17ffff066c00420c ffffc90066a13ca0 ffffc90066a13ca0 ffff88812b8502f8
[ 3404.202603] raw: 000000000ad28e5b ffff888859fd42d0 00000004ffffffff ffff88842e855000
[ 3404.219779] page dumped because: migration failure

[ 3404.230610] migrating pfn c06f176 failed ret:1
[ 3404.240483] page: refcount:4 mapcount:0 mapping:0000000061ca20ba index:0xad28e5b pfn:0xc06f176
[ 3404.259603] memcg:ffff88842e855000
[ 3404.267152] aops:btree_aops ino:1
[ 3404.274511] flags: 0x17ffff066c00420c(referenced|uptodate|workingset|private|node=1|zone=3|lastcpupid=0x1ffff)
[ 3404.296716] raw: 17ffff066c00420c ffffc90066a13ca0 ffffc90066a13ca0 ffff88812b8502f8
[ 3404.313909] raw: 000000000ad28e5b ffff888859fd42d0 00000004ffffffff ffff88842e855000
[ 3404.331102] page dumped because: migration failure

[ 3404.341778] migrating pfn c06f176 failed ret:1
[ 3404.351658] page: refcount:4 mapcount:0 mapping:0000000061ca20ba index:0xad28e5b pfn:0xc06f176
[ 3404.370781] memcg:ffff88842e855000
[ 3404.378331] aops:btree_aops ino:1
[ 3404.385687] flags: 0x17ffff066c00420c(referenced|uptodate|workingset|private|node=1|zone=3|lastcpupid=0x1ffff)
[ 3404.407895] raw: 17ffff066c00420c ffffc90066a13ca0 ffffc90066a13ca0 ffff88812b8502f8
[ 3404.425073] raw: 000000000ad28e5b ffff888859fd42d0 00000004ffffffff ffff88842e855000
[ 3404.442264] page dumped because: migration failure

[ 3404.452928] migrating pfn c06f176 failed ret:1
[ 3404.462809] page: refcount:4 mapcount:0 mapping:0000000061ca20ba index:0xad28e5b pfn:0xc06f176
[ 3404.481948] memcg:ffff88842e855000
[ 3404.489511] aops:btree_aops ino:1
[ 3404.496899] flags: 0x17ffff066c00420c(referenced|uptodate|workingset|private|node=1|zone=3|lastcpupid=0x1ffff)
[ 3404.519128] raw: 17ffff066c00420c ffffc90066a13ca0 ffffc90066a13ca0 ffff88812b8502f8
[ 3404.536332] raw: 000000000ad28e5b ffff888859fd42d0 00000004ffffffff ffff88842e855000
[ 3404.553534] page dumped because: migration failure

[ 3404.564200] migrating pfn c06f176 failed ret:1
[ 3404.574077] page: refcount:4 mapcount:0 mapping:0000000061ca20ba index:0xad28e5b pfn:0xc06f176
[ 3404.593208] memcg:ffff88842e855000
[ 3404.600769] aops:btree_aops ino:1
[ 3404.608138] flags: 0x17ffff066c00420c(referenced|uptodate|workingset|private|node=1|zone=3|lastcpupid=0x1ffff)
[ 3404.630355] raw: 17ffff066c00420c ffffc90066a13ca0 ffffc90066a13ca0 ffff88812b8502f8
[ 3404.647558] raw: 000000000ad28e5b ffff888859fd42d0 00000004ffffffff ffff88842e855000
[ 3404.664761] page dumped because: migration failure


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
  2025-12-03  9:23               ` Gregory Price
@ 2025-12-03  9:26                 ` Gregory Price
  2025-12-03 11:28                 ` David Hildenbrand (Red Hat)
  1 sibling, 0 replies; 19+ messages in thread
From: Gregory Price @ 2025-12-03  9:26 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Michal Hocko, Andrew Morton, Aboorva Devarajan, vbabka, surenb,
	jackmanb, hannes, ziy, linux-mm, linux-kernel, Oscar Salvador,
	Juan Yescas

On Wed, Dec 03, 2025 at 04:23:20AM -0500, Gregory Price wrote:
> What luck, I do! :D
> And i just noticed it's the same page over and over
> 

Should have noted: 6.13.2

but it's kind of a frankenstein's 6.13.2 with stable backports, so I
don't know what's been fixed between 6.13 and latest.  If there are
relevant patches I can search for whether we have them.

> [ 3404.119270] migrating pfn c06f176 failed ret:1
> [ 3404.129152] page: refcount:4 mapcount:0 mapping:0000000061ca20ba index:0xad28e5b pfn:0xc06f176
> [ 3404.148284] memcg:ffff88842e855000
> [ 3404.155834] aops:btree_aops ino:1
> [ 3404.163193] flags: 0x17ffff066c00420c(referenced|uptodate|workingset|private|node=1|zone=3|lastcpupid=0x1ffff)
> [ 3404.185408] raw: 17ffff066c00420c ffffc90066a13ca0 ffffc90066a13ca0 ffff88812b8502f8
> [ 3404.202603] raw: 000000000ad28e5b ffff888859fd42d0 00000004ffffffff ffff88842e855000
> [ 3404.219779] page dumped because: migration failure
> 

~Gregory


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
  2025-12-03  9:15           ` David Hildenbrand (Red Hat)
@ 2025-12-03  9:42             ` Michal Hocko
  2025-12-03 11:22               ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2025-12-03  9:42 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Gregory Price, Andrew Morton, Aboorva Devarajan, vbabka, surenb,
	jackmanb, hannes, ziy, linux-mm, linux-kernel, Oscar Salvador

On Wed 03-12-25 10:15:04, David Hildenbrand (Red Hat) wrote:
> On 12/3/25 09:59, Gregory Price wrote:
> > On Wed, Dec 03, 2025 at 09:42:59AM +0100, Michal Hocko wrote:
> > > On Wed 03-12-25 03:35:51, Gregory Price wrote:
> > > > 		if (!ret) {
> > > > 			/*
> > > > 			 * TODO: fatal migration failures should bail
> > > > 			 * out
> > > > 			 */
> > > > 			do_migrate_range(pfn, end_pfn);
> > > > 		}
> > > > 
> > > > Maybe it's time to implement the bail out?
> > > 
> > > That would be great but can we tell transient from permanent migration
> > > failures? Maybe long term pins could be treated as permanent failure.
> > > 
> > 
> > I see deep in migration code `migrate_pages_batch()` we would return
> > "Some other failure" as fatal:
> > 
> > 	switch(rc) {
> > 	case -ENOMEM:
> > 		...
> > 		/* Note: some long-term pin handing is done here */
> > 		break;
> > 	case -EAGAIN:
> > 		...
> > 		break;
> > 	case 0:
> > 		...
> > 		list_move_tail(&folio->lru, &unmap_folios);
> > 		list_add_tail(&dst->lru, &dst_folios);
> > 		break;
> > 	default:
> > 		/*
> > 		 * Permanent failure (-EBUSY, etc.):
> > 		 * unlike -EAGAIN case, the failed folio is
> > 		 * removed from migration folio list and not
> > 		 * retried in the next outer loop.
> > 		 */
> > 		nr_failed++;
> > 		stats->nr_thp_failed += is_thp;
> > 		stats->nr_failed_pages += nr_pages;
> > 		break;
> > 	}
> > 
> > So at a minimum we could at least check for !(ENOMEM,EAGAIN) I suppose?
> > 
> > It's unclear to me based on this code here how longerm pinning would
> > return.  Maybe David knows.
> 
> I would assume that additional references will always result in -EAGAIN.
> Remember that we cannot distinguish short-term pins from long-term pins.
> 
> We should never have longterm-pins on ZONE_MOVABLE, unless something broke
> that contract and needs to be fixed.

Right. But what should the hotplug code do under that condition. Loop
for ever or fail reporting the broken contract? I would lean towards the
latter. We have never promised that offlining will not fail ever for
movable zones. We just guarantee that the operation is resistant against
recovarable failures.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
  2025-12-03  9:42             ` Michal Hocko
@ 2025-12-03 11:22               ` David Hildenbrand (Red Hat)
  0 siblings, 0 replies; 19+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-03 11:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Gregory Price, Andrew Morton, Aboorva Devarajan, vbabka, surenb,
	jackmanb, hannes, ziy, linux-mm, linux-kernel, Oscar Salvador

On 12/3/25 10:42, Michal Hocko wrote:
> On Wed 03-12-25 10:15:04, David Hildenbrand (Red Hat) wrote:
>> On 12/3/25 09:59, Gregory Price wrote:
>>> On Wed, Dec 03, 2025 at 09:42:59AM +0100, Michal Hocko wrote:
>>>> On Wed 03-12-25 03:35:51, Gregory Price wrote:
>>>>> 		if (!ret) {
>>>>> 			/*
>>>>> 			 * TODO: fatal migration failures should bail
>>>>> 			 * out
>>>>> 			 */
>>>>> 			do_migrate_range(pfn, end_pfn);
>>>>> 		}
>>>>>
>>>>> Maybe it's time to implement the bail out?
>>>>
>>>> That would be great but can we tell transient from permanent migration
>>>> failures? Maybe long term pins could be treated as permanent failure.
>>>>
>>>
>>> I see deep in migration code `migrate_pages_batch()` we would return
>>> "Some other failure" as fatal:
>>>
>>> 	switch(rc) {
>>> 	case -ENOMEM:
>>> 		...
>>> 		/* Note: some long-term pin handing is done here */
>>> 		break;
>>> 	case -EAGAIN:
>>> 		...
>>> 		break;
>>> 	case 0:
>>> 		...
>>> 		list_move_tail(&folio->lru, &unmap_folios);
>>> 		list_add_tail(&dst->lru, &dst_folios);
>>> 		break;
>>> 	default:
>>> 		/*
>>> 		 * Permanent failure (-EBUSY, etc.):
>>> 		 * unlike -EAGAIN case, the failed folio is
>>> 		 * removed from migration folio list and not
>>> 		 * retried in the next outer loop.
>>> 		 */
>>> 		nr_failed++;
>>> 		stats->nr_thp_failed += is_thp;
>>> 		stats->nr_failed_pages += nr_pages;
>>> 		break;
>>> 	}
>>>
>>> So at a minimum we could at least check for !(ENOMEM,EAGAIN) I suppose?
>>>
>>> It's unclear to me based on this code here how longerm pinning would
>>> return.  Maybe David knows.
>>
>> I would assume that additional references will always result in -EAGAIN.
>> Remember that we cannot distinguish short-term pins from long-term pins.
>>
>> We should never have longterm-pins on ZONE_MOVABLE, unless something broke
>> that contract and needs to be fixed.
> 
> Right. But what should the hotplug code do under that condition. Loop
> for ever or fail reporting the broken contract? I would lean towards the
> latter. 

If you can detect it reliably.

> We have never promised that offlining will not fail ever for
> movable zones. We just guarantee that the operation is resistant against
> recovarable failures.

Right, but we don't want it to fail for reasons where retrying a bit longer
would just have worked.


What we document is:

Memory Offlining and ZONE_MOVABLE
---------------------------------

Even with ZONE_MOVABLE, there are some corner cases where offlining a memory
block might fail:

... list of corner cases

Further, when running into out of memory situations while migrating pages, or
when still encountering permanently unmovable pages within ZONE_MOVABLE
(-> BUG), memory offlining will keep retrying until it eventually succeeds.

When offlining is triggered from user space, the offlining context can be
terminated by sending a signal. A timeout based offlining can easily be
implemented via::

	% timeout $TIMEOUT offline_block | failure_handling

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
  2025-12-03  9:23               ` Gregory Price
  2025-12-03  9:26                 ` Gregory Price
@ 2025-12-03 11:28                 ` David Hildenbrand (Red Hat)
  1 sibling, 0 replies; 19+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-03 11:28 UTC (permalink / raw)
  To: Gregory Price
  Cc: Michal Hocko, Andrew Morton, Aboorva Devarajan, vbabka, surenb,
	jackmanb, hannes, ziy, linux-mm, linux-kernel, Oscar Salvador,
	Juan Yescas

On 12/3/25 10:23, Gregory Price wrote:
> On Wed, Dec 03, 2025 at 10:08:55AM +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/3/25 10:02, Gregory Price wrote:
>>>
>>> My transient failure (although i'm not sure it was actually transient, i
>>> killed it and retried after a few minutes and it succeeded immediately)
>>> was on a ZONE_MOVABLE block.
>>
>> Okay, so that one should not bail out. Longterm pinnins must never end up on
>> such memory, and if it happens, we have to identify why and fix it.
>>
>> We have this known problem of "stream of short-term pinnings" that can
>> temporarily turn memory effectively unmovable. Juan will talk about that at
>> LPC [1].
> 
> Nice, fun, good topic. Looking forward to Japan n_n
> 
>>
>> We have another set of problematic cases (vmsplice(), fuse) but I would
>> assume that these are not the cases you are hitting.
>>
> 
> We do use fuse, but this system was relatively quiet when i tried this.
> 
> We do have some proactive reclaim / demotion going on, but i don't think
> it was that (see below).
> 
>>>
>>> Kind of suggested to me there was some bad condition the resolved once I
>>> took a second to release the lock and try again.
>>
>> Hard to tell I'm afraid. Do you still have the dump_folio() calls we print
>> when migration fails?
>>
> 
> What luck, I do! :D

:)

> And i just noticed it's the same page over and over
> 
> [ 3404.119270] migrating pfn c06f176 failed ret:1
> [ 3404.129152] page: refcount:4 mapcount:0 mapping:0000000061ca20ba index:0xad28e5b pfn:0xc06f176
> [ 3404.148284] memcg:ffff88842e855000
> [ 3404.155834] aops:btree_aops ino:1

Small folio. Not GUP-pinned (FOLL_PIN, otherwise our refcount would be 
 >= 1024.

It could be ordinary GUP (FOLL_GET) e.g., from vmsplice or some older 
O_DIRECT user that was not converted to FOLL_PIN yet. But maybe it's 
just btrfs / something else that temporarily holds a folio reference.


Given that this is from 6.13 ... hard to tell :)

> [ 3404.163193] flags: 0x17ffff066c00420c(referenced|uptodate|workingset|private|node=1|zone=3|lastcpupid=0x1ffff)

Neither dirty nor under writeback.

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
  2025-12-01 17:41 ` Andrew Morton
  2025-12-03  8:27   ` Michal Hocko
@ 2025-12-08 17:30   ` Aboorva Devarajan
  2025-12-08 18:15     ` Michal Hocko
  2025-12-08 19:29     ` David Hildenbrand (Red Hat)
  1 sibling, 2 replies; 19+ messages in thread
From: Aboorva Devarajan @ 2025-12-08 17:30 UTC (permalink / raw)
  To: Andrew Morton, gourry, mhocko, david
  Cc: vbabka, surenb, jackmanb, hannes, ziy, linux-mm, linux-kernel

On Mon, 2025-12-01 at 09:41 -0800, Andrew Morton wrote:
> On Mon,  1 Dec 2025 11:30:09 +0530 Aboorva Devarajan <aboorvad@linux.ibm.com> wrote:
> 
> > When page isolation loops indefinitely during memory offline, reading
> > /proc/sys/vm/percpu_pagelist_high_fraction blocks on pcp_batch_high_lock,
> > causing hung task warnings.
> 
> That's pretty bad behavior.
> 
> I wonder if there are other problems which can be caused by this
> lengthy hold time.
> 
> It would be better to address the lengthy hold time rather that having
> to work around it in one impacted site.


Sorry for the delayed response, I spent some time recreating this issue.


I've encountered this lengthy hold time several times during memory hot-unplug, with
operations hanging indefinitely (20+ hours). It occurs intermittently, and it has 
different failure signatures, here's one example where isolation fails on a single
slab page continuously:

..
[83310.373699] page dumped because: isolation failed
[83310.373704] failed to isolate pfn 4dc68
[83310.373708] page: refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4dc68
[83310.373714] flags: 0x23ffffe00000000(node=2|zone=0|lastcpupid=0x1fffff)
[83310.373722] page_type: f5(slab)
[83310.373727] raw: 023ffffe00000000 c0000028e001fa00 5deadbeef0000100 5deadbeef0000122
[83310.373735] raw: 0000000000000000 0000000001e101e1 00000002f5000000 0000000000000000
[83310.373741] page dumped because: isolation failed
[83310.373749] failed to isolate pfn 4dc68
[83310.373753] page: refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4dc68
[83310.373760] flags: 0x23ffffe00000000(node=2|zone=0|lastcpupid=0x1fffff)
[83310.373767] page_type: f5(slab)
[83310.373770] raw: 023ffffe00000000 c0000028e001fa00 5deadbeef0000100 5deadbeef0000122
[83310.373774] raw: 0000000000000000 0000000001e101e1 00000002f5000000 0000000000000000
[83310.373778] page dumped because: isolation failed
[83310.373788] failed to isolate pfn 4dc68
[83310.373791] page: refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4dc68
[83310.373794] flags: 0x23ffffe00000000(node=2|zone=0|lastcpupid=0x1fffff)
[83310.373797] page_type: f5(slab)
[83310.373799] raw: 023ffffe00000000 c0000028e001fa00 5deadbeef0000100 5deadbeef0000122
[83310.373803] raw: 0000000000000000 0000000001e101e1 00000002f5000000 0000000000000000
[83310.373809] page dumped because: isolation failed
[83315.383370] do_migrate_range: 1098409 callbacks suppressed
[83315.383377] failed to isolate pfn 4dc68
[83315.383406] page: refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4dc68
[83315.383411] flags: 0x23ffffe00000000(node=2|zone=0|lastcpupid=0x1fffff)
[83315.383416] page_type: f5(slab)
[83315.383420] raw: 023ffffe00000000 c0000028e001fa00 5deadbeef0000100 5deadbeef0000122
[83315.383423] raw: 0000000000000000 0000000001e101e1 00000002f5000000 0000000000000000
[83315.383426] page dumped because: isolation failed
[83315.383431] failed to isolate pfn 4dc68
[83315.383433] page: refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4dc68
[83315.383442] flags: 0x23ffffe00000000(node=2|zone=0|lastcpupid=0x1fffff)
[83315.383448] page_type: f5(slab)
[83315.383454] raw: 023ffffe00000000 c0000028e001fa00 5deadbeef0000100 5deadbeef0000122
[83315.383462] raw: 0000000000000000 0000000001e101e1 00000002f5000000 0000000000000000
[83315.383470] page dumped because: isolation failed
...
...
...


Given the following statement in the documentation, should this behavior be considered
expected?

From Documentation/admin-guide/mm/memory-hotplug.rst:
"Further, memory offlining might retry for a long time (or even forever), until
aborted by the user."


There's also a TODO in the code that confirms this issue:

mm/memory_hotplug.c
/*
 * TODO: fatal migration failures should bail
 * out
 */
do_migrate_range(pfn, end_pfn);


A possible improvement would be to add a retry limit or timeout for pages that repeatedly
fail isolation, returning -EBUSY after N attempts instead of looping indefinitely for
umovable pages. This would make the behavior more predictable.


-----


In addition to the above, I've also seen test_pages_isolated() return -EBUSY at the final
isolation check for the same page-block continuously

int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
			enum pb_isolate_mode mode)
{
	...

	/* Check all pages are free or marked as ISOLATED */
	zone = page_zone(page);
	spin_lock_irqsave(&zone->lock, flags);
	pfn = __test_page_isolated_in_pageblock(start_pfn, end_pfn, mode); 
	spin_unlock_irqrestore(&zone->lock, flags);

	ret = pfn < end_pfn ? -EBUSY : 0; 
        ...

out:
	...
	return ret;
}

When __test_page_isolated_in_pageblock() encounters a page that isn't PageBuddy, PageHWPoison,
or PageOffline with count 0, it returns that pfn, causing -EBUSY. 


I'll work on capturing more traces for this failure scenario and follow up.

> 
> > Make procfs reads lock-free since percpu_pagelist_high_fraction is a simple
> > integer with naturally atomic reads, writers still serialize via the mutex.
> > 
> > This prevents hung task warnings when reading the procfs file during
> > long-running memory offline operations.
> > 
> > ...
> > 
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6611,11 +6611,14 @@ static int percpu_pagelist_high_fraction_sysctl_handler(const struct ctl_table *
> >  	int old_percpu_pagelist_high_fraction;
> >  	int ret;
> >  
> > +	if (!write)
> > +		return proc_dointvec_minmax(table, write, buffer, length, ppos);
> > +
> >  	mutex_lock(&pcp_batch_high_lock);
> >  	old_percpu_pagelist_high_fraction = percpu_pagelist_high_fraction;
> >  
> >  	ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
> > -	if (!write || ret < 0)
> > +	if (ret < 0)
> >  		goto out;
> >  
> >  	/* Sanity checking to avoid pcp imbalance */
> 
> That being said, I'll grab the patch and shall put a cc:stable on it,
> see what people think about this hold-time issue.

Thanks.


Regards,
Aboorva


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
  2025-12-08 17:30   ` Aboorva Devarajan
@ 2025-12-08 18:15     ` Michal Hocko
  2025-12-08 19:29     ` David Hildenbrand (Red Hat)
  1 sibling, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2025-12-08 18:15 UTC (permalink / raw)
  To: Aboorva Devarajan
  Cc: Andrew Morton, gourry, david, vbabka, surenb, jackmanb, hannes,
	ziy, linux-mm, linux-kernel

On Mon 08-12-25 23:00:46, Aboorva Devarajan wrote:
> On Mon, 2025-12-01 at 09:41 -0800, Andrew Morton wrote:
> > On Mon,  1 Dec 2025 11:30:09 +0530 Aboorva Devarajan <aboorvad@linux.ibm.com> wrote:
[...]
> [83315.383433] page: refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4dc68
> [83315.383442] flags: 0x23ffffe00000000(node=2|zone=0|lastcpupid=0x1fffff)
> [83315.383448] page_type: f5(slab)
> [83315.383454] raw: 023ffffe00000000 c0000028e001fa00 5deadbeef0000100 5deadbeef0000122
> [83315.383462] raw: 0000000000000000 0000000001e101e1 00000002f5000000 0000000000000000
> [83315.383470] page dumped because: isolation failed
> ...
> ...
> ...
> 
> 
> Given the following statement in the documentation, should this behavior be considered
> expected?
> 
> >From Documentation/admin-guide/mm/memory-hotplug.rst:
> "Further, memory offlining might retry for a long time (or even forever), until
> aborted by the user."

This is in line with trying to offline memory blocks containing the
kernel memory as seen above. Retrying for ever on movable zones is a
different issue as discussed in other reply.

> There's also a TODO in the code that confirms this issue:
> 
> mm/memory_hotplug.c
> /*
>  * TODO: fatal migration failures should bail
>  * out
>  */
> do_migrate_range(pfn, end_pfn);
> 
> 
> A possible improvement would be to add a retry limit or timeout for pages that repeatedly
> fail isolation, returning -EBUSY after N attempts instead of looping indefinitely for
> umovable pages. This would make the behavior more predictable.

I disagree. It is trivial to implement timeout retry in the userspace.
Any retry attempts limit behavior will be much less predictable. It
could have been a matter of timing that an operation succeeds. We've had
exactly that kind of behavior before.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free
  2025-12-08 17:30   ` Aboorva Devarajan
  2025-12-08 18:15     ` Michal Hocko
@ 2025-12-08 19:29     ` David Hildenbrand (Red Hat)
  1 sibling, 0 replies; 19+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-08 19:29 UTC (permalink / raw)
  To: Aboorva Devarajan, Andrew Morton, gourry, mhocko
  Cc: vbabka, surenb, jackmanb, hannes, ziy, linux-mm, linux-kernel

On 12/8/25 18:30, Aboorva Devarajan wrote:
> On Mon, 2025-12-01 at 09:41 -0800, Andrew Morton wrote:
>> On Mon,  1 Dec 2025 11:30:09 +0530 Aboorva Devarajan <aboorvad@linux.ibm.com> wrote:
>>
>>> When page isolation loops indefinitely during memory offline, reading
>>> /proc/sys/vm/percpu_pagelist_high_fraction blocks on pcp_batch_high_lock,
>>> causing hung task warnings.
>>
>> That's pretty bad behavior.
>>
>> I wonder if there are other problems which can be caused by this
>> lengthy hold time.
>>
>> It would be better to address the lengthy hold time rather that having
>> to work around it in one impacted site.
> 
> 
> Sorry for the delayed response, I spent some time recreating this issue.
> 
> 
> I've encountered this lengthy hold time several times during memory hot-unplug, with
> operations hanging indefinitely (20+ hours). It occurs intermittently, and it has
> different failure signatures, here's one example where isolation fails on a single
> slab page continuously:
> 
> ..
> [83310.373699] page dumped because: isolation failed
> [83310.373704] failed to isolate pfn 4dc68
> [83310.373708] page: refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4dc68
> [83310.373714] flags: 0x23ffffe00000000(node=2|zone=0|lastcpupid=0x1fffff)
> [83310.373722] page_type: f5(slab)
> [83310.373727] raw: 023ffffe00000000 c0000028e001fa00 5deadbeef0000100 5deadbeef0000122
> [83310.373735] raw: 0000000000000000 0000000001e101e1 00000002f5000000 0000000000000000
> [83310.373741] page dumped because: isolation failed
> [83310.373749] failed to isolate pfn 4dc68
> [83310.373753] page: refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4dc68
> [83310.373760] flags: 0x23ffffe00000000(node=2|zone=0|lastcpupid=0x1fffff)
> [83310.373767] page_type: f5(slab)
> [83310.373770] raw: 023ffffe00000000 c0000028e001fa00 5deadbeef0000100 5deadbeef0000122
> [83310.373774] raw: 0000000000000000 0000000001e101e1 00000002f5000000 0000000000000000
> [83310.373778] page dumped because: isolation failed
> [83310.373788] failed to isolate pfn 4dc68
> [83310.373791] page: refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4dc68
> [83310.373794] flags: 0x23ffffe00000000(node=2|zone=0|lastcpupid=0x1fffff)
> [83310.373797] page_type: f5(slab)
> [83310.373799] raw: 023ffffe00000000 c0000028e001fa00 5deadbeef0000100 5deadbeef0000122
> [83310.373803] raw: 0000000000000000 0000000001e101e1 00000002f5000000 0000000000000000
> [83310.373809] page dumped because: isolation failed
> [83315.383370] do_migrate_range: 1098409 callbacks suppressed
> [83315.383377] failed to isolate pfn 4dc68
> [83315.383406] page: refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4dc68
> [83315.383411] flags: 0x23ffffe00000000(node=2|zone=0|lastcpupid=0x1fffff)
> [83315.383416] page_type: f5(slab)
> [83315.383420] raw: 023ffffe00000000 c0000028e001fa00 5deadbeef0000100 5deadbeef0000122
> [83315.383423] raw: 0000000000000000 0000000001e101e1 00000002f5000000 0000000000000000
> [83315.383426] page dumped because: isolation failed
> [83315.383431] failed to isolate pfn 4dc68
> [83315.383433] page: refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4dc68
> [83315.383442] flags: 0x23ffffe00000000(node=2|zone=0|lastcpupid=0x1fffff)
> [83315.383448] page_type: f5(slab)
> [83315.383454] raw: 023ffffe00000000 c0000028e001fa00 5deadbeef0000100 5deadbeef0000122
> [83315.383462] raw: 0000000000000000 0000000001e101e1 00000002f5000000 0000000000000000
> [83315.383470] page dumped because: isolation failed

When starting memory offlining we do a racy check whether memory 
offlining will succeed in has_unmovable_pages().

It's racy because briefly after the check, a kernel page could get 
allocated, before we isolate all the free pages.

I assume that's what happened here.

But even performing another has_unmovable_pages() check after isolating 
the free pages will not catch all cases: in particular movable pages 
that are longterm-pinned.

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-12-08 19:29 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-01  6:00 [PATCH] mm/page_alloc: make percpu_pagelist_high_fraction reads lock-free Aboorva Devarajan
2025-12-01 17:41 ` Andrew Morton
2025-12-03  8:27   ` Michal Hocko
2025-12-03  8:35     ` Gregory Price
2025-12-03  8:42       ` Michal Hocko
2025-12-03  8:51         ` David Hildenbrand (Red Hat)
2025-12-03  9:02           ` Gregory Price
2025-12-03  9:08             ` David Hildenbrand (Red Hat)
2025-12-03  9:23               ` Gregory Price
2025-12-03  9:26                 ` Gregory Price
2025-12-03 11:28                 ` David Hildenbrand (Red Hat)
2025-12-03  8:59         ` Gregory Price
2025-12-03  9:15           ` David Hildenbrand (Red Hat)
2025-12-03  9:42             ` Michal Hocko
2025-12-03 11:22               ` David Hildenbrand (Red Hat)
2025-12-08 17:30   ` Aboorva Devarajan
2025-12-08 18:15     ` Michal Hocko
2025-12-08 19:29     ` David Hildenbrand (Red Hat)
2025-12-03  8:21 ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox