[PATCH v2 0/4] mm/page_alloc: Batch callers of free_pcppages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/4] mm/page_alloc: Batch callers of free_pcppages_bulk
@ 2025-09-24 20:44 Joshua Hahn
  2025-09-24 20:44 ` [PATCH v2 1/4] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection Joshua Hahn
                   ` (3 more replies)
  0 siblings, 4 replies; 33+ messages in thread
From: Joshua Hahn @ 2025-09-24 20:44 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner
  Cc: Chris Mason, Kiryl Shutsemau, Liam R. Howlett, Brendan Jackman,
	David Hildenbrand, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Zi Yan, linux-kernel,
	linux-mm

Motivation & Approach
=====================

While testing workloads with high sustained memory pressure on large machines
in the Meta fleet (1Tb memory, 316 CPUs), we saw an unexpectedly high number
of softlockups. Further investigation showed that the lock in
free_pcppages_bulk was being held for a long time, and was called to free
2k+ pages over 100 times just during boot.

This causes starvation in other processes for both the pcp and zone locks,
which can lead to the system stalling as multiple threads cannot make progress
without the locks. We can see these issues manifesting as warnings:

[ 4512.591979] rcu: INFO: rcu_sched self-detected stall on CPU
[ 4512.604370] rcu: 	20-....: (9312 ticks this GP) idle=a654/1/0x4000000000000000 softirq=309340/309344 fqs=5426
[ 4512.626401] rcu: 	         hardirqs   softirqs   csw/system
[ 4512.638793] rcu: 	 number:        0        145            0
[ 4512.651177] rcu: 	cputime:       30      10410          174   ==> 10558(ms)
[ 4512.666657] rcu: 	(t=21077 jiffies g=783665 q=1242213 ncpus=316)

While these warnings are benign, they do point to the underlying issue of
lock contention. To prevent starvation in both locks, batch the freeing of
pages using pcp->batch.

Because free_pcppages_bulk is called with both the pcp and zone lock,
relinquishing and reacquiring the locks are only effective when both of them
are broken together (unless the system was built with queued spinlocks).
Thus, instead of modifying free_pcppages_bulk to break both locks, batch the
freeing from its callers instead.

A similar fix has been implemented in the Meta fleet, and we have seen
significantly less softlockups.

Testing
=======
The following are a few synthetic benchmarks, made on a machine with
250G RAM, 179G swap, and 176 CPUs.

stress-ng --vm 50 --vm-bytes 5G -M -t 100
+----------------------+---------------+----------+
|        Metric        | Variation (%) | Delta(%) |
+----------------------+---------------+----------+
| bogo ops             |        0.0216 |  -0.0172 |
| bogo ops/s (real)    |        0.0223 |  -0.0163 |
| bogo ops/s (usr+sys) |        1.3433 |  +1.0769 |
+----------------------+---------------+----------+

stress-ng --vm 10 --vm-bytes 30G -M -t 100
+----------------------+---------------+----------+
|        Metric        | Variation (%) | Delta(%) |
+----------------------+---------------+----------+
| bogo ops             |        2.1736 |  +4.8535 |
| bogo ops/s (real)    |        2.2689 |  +5.1719 |
| bogo ops/s (usr+sys) |        2.1283 |  +0.6587 |
+----------------------+---------------+----------+

It seems like depending on the workload, this patch may lead to an increase
in performance, or stay neutral. I believe this has to do with how much lock
contention there is, and how many free_pcppages_bulk calls were being made
previously with high counts.

The difference between bogo ops/s (real) and (usr+sys) seems to indicate that
there is meaningful difference in the amount of time threads spend blocked
on getting either the pcp or zone lock.

Changelog
=========
v1 --> v2:
- Reworded cover letter to be more explicit about what kinds of issues
  running processes might face as a result of the existing lock starvation
- Reworded cover letter to be in sections to make it easier to read
- Fixed patch 4/4 to properly store & restore UP flags.
- Re-ran tests, updated the testing results and interpretation

Joshua Hahn (4):
  mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection
  mm/page_alloc: Perform appropriate batching in drain_pages_zone
  mm/page_alloc: Batch page freeing in decay_pcp_high
  mm/page_alloc: Batch page freeing in free_frozen_page_commit

 include/linux/gfp.h |  2 +-
 mm/page_alloc.c     | 67 ++++++++++++++++++++++++++++++++-------------
 mm/vmstat.c         | 26 +++++++++---------
 3 files changed, 62 insertions(+), 33 deletions(-)

base-commit: 097a6c336d0080725c626fda118ecfec448acd0f
-- 
2.47.3

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v2 1/4] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection
  2025-09-24 20:44 [PATCH v2 0/4] mm/page_alloc: Batch callers of free_pcppages_bulk Joshua Hahn
@ 2025-09-24 20:44 ` Joshua Hahn
  2025-09-24 22:51   ` Christoph Lameter (Ampere)
                     ` (2 more replies)
  2025-09-24 20:44 ` [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone Joshua Hahn
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 33+ messages in thread
From: Joshua Hahn @ 2025-09-24 20:44 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner
  Cc: Chris Mason, Kiryl Shutsemau, Liam R. Howlett, Brendan Jackman,
	David Hildenbrand, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Zi Yan, linux-kernel,
	linux-mm, kernel-team

Currently, refresh_cpu_vm_stats returns an int, indicating how many
changes were made during its updates. Using this information, callers
like vmstat_update can heuristically determine if more work will be done
in the future.

However, all of refresh_cpu_vm_stats's callers either (a) ignore the
result, only caring about performing the updates, or (b) only care about
whether changes were made, but not *how many* changes were made.

Simplify the code by returning a bool instead to indicate if updates
were made.

In addition, simplify fold_diff and decay_pcp_high to return a bool
for the same reason.

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 include/linux/gfp.h |  2 +-
 mm/page_alloc.c     |  8 ++++----
 mm/vmstat.c         | 26 +++++++++++++-------------
 3 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 5ebf26fcdcfa..63c72cb1d117 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -386,7 +386,7 @@ extern void free_pages(unsigned long addr, unsigned int order);
 #define free_page(addr) free_pages((addr), 0)
 
 void page_alloc_init_cpuhp(void);
-int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp);
+bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp);
 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
 void drain_all_pages(struct zone *zone);
 void drain_local_pages(struct zone *zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d1d037f97c5f..77e7d9a5f149 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2561,10 +2561,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
  * Called from the vmstat counter updater to decay the PCP high.
  * Return whether there are addition works to do.
  */
-int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
+bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
 {
 	int high_min, to_drain, batch;
-	int todo = 0;
+	bool todo;
 
 	high_min = READ_ONCE(pcp->high_min);
 	batch = READ_ONCE(pcp->batch);
@@ -2577,7 +2577,7 @@ int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
 		pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX),
 				 pcp->high - (pcp->high >> 3), high_min);
 		if (pcp->high > high_min)
-			todo++;
+			todo = true;
 	}
 
 	to_drain = pcp->count - pcp->high;
@@ -2585,7 +2585,7 @@ int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
 		spin_lock(&pcp->lock);
 		free_pcppages_bulk(zone, to_drain, pcp, 0);
 		spin_unlock(&pcp->lock);
-		todo++;
+		todo = true;
 	}
 
 	return todo;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 71cd1ceba191..1f74a3517ab2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -771,25 +771,25 @@ EXPORT_SYMBOL(dec_node_page_state);
 
 /*
  * Fold a differential into the global counters.
- * Returns the number of counters updated.
+ * Returns whether counters were updated.
  */
 static int fold_diff(int *zone_diff, int *node_diff)
 {
 	int i;
-	int changes = 0;
+	bool changed = false;
 
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
 		if (zone_diff[i]) {
 			atomic_long_add(zone_diff[i], &vm_zone_stat[i]);
-			changes++;
+			changed = true;
 	}
 
 	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
 		if (node_diff[i]) {
 			atomic_long_add(node_diff[i], &vm_node_stat[i]);
-			changes++;
+			changed = true;
 	}
-	return changes;
+	return changed;
 }
 
 /*
@@ -806,16 +806,16 @@ static int fold_diff(int *zone_diff, int *node_diff)
  * with the global counters. These could cause remote node cache line
  * bouncing and will have to be only done when necessary.
  *
- * The function returns the number of global counters updated.
+ * The function returns whether global counters were updated.
  */
-static int refresh_cpu_vm_stats(bool do_pagesets)
+static bool refresh_cpu_vm_stats(bool do_pagesets)
 {
 	struct pglist_data *pgdat;
 	struct zone *zone;
 	int i;
 	int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
 	int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
-	int changes = 0;
+	bool changed = false;
 
 	for_each_populated_zone(zone) {
 		struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
@@ -839,7 +839,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 		if (do_pagesets) {
 			cond_resched();
 
-			changes += decay_pcp_high(zone, this_cpu_ptr(pcp));
+			changed |= decay_pcp_high(zone, this_cpu_ptr(pcp));
 #ifdef CONFIG_NUMA
 			/*
 			 * Deal with draining the remote pageset of this
@@ -861,13 +861,13 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 			}
 
 			if (__this_cpu_dec_return(pcp->expire)) {
-				changes++;
+				changed = true;
 				continue;
 			}
 
 			if (__this_cpu_read(pcp->count)) {
 				drain_zone_pages(zone, this_cpu_ptr(pcp));
-				changes++;
+				changed = true;
 			}
 #endif
 		}
@@ -887,8 +887,8 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 		}
 	}
 
-	changes += fold_diff(global_zone_diff, global_node_diff);
-	return changes;
+	changed |= fold_diff(global_zone_diff, global_node_diff);
+	return changed;
 }
 
 /*
-- 
2.47.3


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone
  2025-09-24 20:44 [PATCH v2 0/4] mm/page_alloc: Batch callers of free_pcppages_bulk Joshua Hahn
  2025-09-24 20:44 ` [PATCH v2 1/4] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection Joshua Hahn
@ 2025-09-24 20:44 ` Joshua Hahn
  2025-09-24 23:09   ` Christoph Lameter (Ampere)
                     ` (2 more replies)
  2025-09-24 20:44 ` [PATCH v2 3/4] mm/page_alloc: Batch page freeing in decay_pcp_high Joshua Hahn
  2025-09-24 20:44 ` [PATCH v2 4/4] mm/page_alloc: Batch page freeing in free_frozen_page_commit Joshua Hahn
  3 siblings, 3 replies; 33+ messages in thread
From: Joshua Hahn @ 2025-09-24 20:44 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner
  Cc: Chris Mason, Kiryl Shutsemau, Brendan Jackman, Michal Hocko,
	Suren Baghdasaryan, Vlastimil Babka, Zi Yan, linux-kernel,
	linux-mm, kernel-team

drain_pages_zone completely drains a zone of its pcp free pages by
repeatedly calling free_pcppages_bulk until pcp->count reaches 0.
In this loop, it already performs batched calls to ensure that
free_pcppages_bulk isn't called to free too many pages at once, and
relinquishes & reacquires the lock between each call to prevent
lock starvation from other processes.

However, the current batching does not prevent lock starvation. The
current implementation creates batches of
pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX, which has been seen in
Meta workloads to be up to 64 << 5 == 2048 pages.

While it is true that CONFIG_PCP_BATCH_SCALE_MAX is a config and
indeed can be adjusted by the system admin to be any number from
0 to 6, it's default value of 5 is still too high to be reasonable for
any system.

Instead, let's create batches of pcp->batch pages, which gives a more
reasonable 64 pages per call to free_pcppages_bulk. This gives other
processes a chance to grab the lock and prevents starvation. Each
individual call to drain_pages_zone may take longer, but we avoid the
worst case scenario of completely starving out other system-critical
threads from acquiring the pcp lock while 2048 pages are freed
one-by-one.

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 mm/page_alloc.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 77e7d9a5f149..b861b647f184 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2623,8 +2623,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 		spin_lock(&pcp->lock);
 		count = pcp->count;
 		if (count) {
-			int to_drain = min(count,
-				pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX);
+			int to_drain = min(count, pcp->batch);
 
 			free_pcppages_bulk(zone, to_drain, pcp, 0);
 			count -= to_drain;
-- 
2.47.3


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v2 3/4] mm/page_alloc: Batch page freeing in decay_pcp_high
  2025-09-24 20:44 [PATCH v2 0/4] mm/page_alloc: Batch callers of free_pcppages_bulk Joshua Hahn
  2025-09-24 20:44 ` [PATCH v2 1/4] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection Joshua Hahn
  2025-09-24 20:44 ` [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone Joshua Hahn
@ 2025-09-24 20:44 ` Joshua Hahn
  2025-09-24 20:44 ` [PATCH v2 4/4] mm/page_alloc: Batch page freeing in free_frozen_page_commit Joshua Hahn
  3 siblings, 0 replies; 33+ messages in thread
From: Joshua Hahn @ 2025-09-24 20:44 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner
  Cc: Chris Mason, Kiryl Shutsemau, Brendan Jackman, Michal Hocko,
	Suren Baghdasaryan, Vlastimil Babka, Zi Yan, linux-kernel,
	linux-mm, kernel-team

It is possible for pcp->count - pcp->high to exceed pcp->batch by a lot.
When this happens, we should perform batching to ensure that
free_pcppages_bulk isn't called with too many pages to free at once and
starve out other threads that need the pcp lock.

Since we are still only freeing the difference between the initial
pcp->count and pcp->high values, there should be no change to how many
pages are freed.

Suggested-by: Chris Mason <clm@fb.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Co-developed-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 mm/page_alloc.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b861b647f184..467e524a99df 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2563,7 +2563,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
  */
 bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
 {
-	int high_min, to_drain, batch;
+	int high_min, to_drain, to_drain_batched, batch;
 	bool todo;
 
 	high_min = READ_ONCE(pcp->high_min);
@@ -2581,11 +2581,14 @@ bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
 	}
 
 	to_drain = pcp->count - pcp->high;
-	if (to_drain > 0) {
+	while (to_drain > 0) {
+		to_drain_batched = min(to_drain, batch);
 		spin_lock(&pcp->lock);
-		free_pcppages_bulk(zone, to_drain, pcp, 0);
+		free_pcppages_bulk(zone, to_drain_batched, pcp, 0);
 		spin_unlock(&pcp->lock);
 		todo = true;
+
+		to_drain -= to_drain_batched;
 	}
 
 	return todo;
-- 
2.47.3


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v2 4/4] mm/page_alloc: Batch page freeing in free_frozen_page_commit
  2025-09-24 20:44 [PATCH v2 0/4] mm/page_alloc: Batch callers of free_pcppages_bulk Joshua Hahn
                   ` (2 preceding siblings ...)
  2025-09-24 20:44 ` [PATCH v2 3/4] mm/page_alloc: Batch page freeing in decay_pcp_high Joshua Hahn
@ 2025-09-24 20:44 ` Joshua Hahn
  2025-09-28  5:17   ` kernel test robot
  3 siblings, 1 reply; 33+ messages in thread
From: Joshua Hahn @ 2025-09-24 20:44 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner
  Cc: Chris Mason, Kiryl Shutsemau, Brendan Jackman, Michal Hocko,
	Suren Baghdasaryan, Vlastimil Babka, Zi Yan, linux-kernel,
	linux-mm, kernel-team

Before returning, free_frozen_page_commit calls free_pcppages_bulk using
nr_pcp_free to determine how many pages can appropritately be freed,
based on the tunable parameters stored in pcp. While this number is an
accurate representation of how many pages should be freed in total, it
is not an appropriate number of pages to free at once using
free_pcppages_bulk, since we have seen the value consistently go above
2000 in the Meta fleet on larger machines.

As such, perform batched page freeing in free_pcppages_bulk by using
pcp->batch member. In order to ensure that other processes are not
starved of the pcp (and zone) lock, free the pcp lock.

Note that because free_frozen_page_commit now performs a spinlock inside the
function (and can fail), the function may now return with a freed pcp.
To handle this, return true if the pcp is locked on exit and false otherwise.

In addition, since free_frozen_page_commit must now be aware of what UP
flags were stored at the time of the spin lock, and because we must be
able to report new UP flags to the callers, add a new unsigned long*
parameter UP_flags to keep track of this.

Suggested-by: Chris Mason <clm@fb.com>
Co-developed-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 mm/page_alloc.c | 47 +++++++++++++++++++++++++++++++++++++----------
 1 file changed, 37 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 467e524a99df..50589ff0a9c6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2821,11 +2821,19 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 	return high;
 }
 
-static void free_frozen_page_commit(struct zone *zone,
+/*
+ * Tune pcp alloc factor and adjust count & free_count. Free pages to bring the
+ * pcp's watermarks below high.
+ *
+ * May return a freed pcp, if during page freeing the pcp spinlock cannot be
+ * reacquired. Return true if pcp is locked, false otherwise.
+ */
+static bool free_frozen_page_commit(struct zone *zone,
 		struct per_cpu_pages *pcp, struct page *page, int migratetype,
-		unsigned int order, fpi_t fpi_flags)
+		unsigned int order, fpi_t fpi_flags, unsigned long *UP_flags)
 {
 	int high, batch;
+	int to_free, to_free_batched;
 	int pindex;
 	bool free_high = false;
 
@@ -2864,17 +2872,34 @@ static void free_frozen_page_commit(struct zone *zone,
 		 * Do not attempt to take a zone lock. Let pcp->count get
 		 * over high mark temporarily.
 		 */
-		return;
+		return true;
 	}
 	high = nr_pcp_high(pcp, zone, batch, free_high);
-	if (pcp->count >= high) {
-		free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
-				   pcp, pindex);
+	to_free = nr_pcp_free(pcp, batch, high, free_high);
+	while (to_free > 0 && pcp->count >= high) {
+		to_free_batched = min(to_free, batch);
+		free_pcppages_bulk(zone, to_free_batched, pcp, pindex);
 		if (test_bit(ZONE_BELOW_HIGH, &zone->flags) &&
 		    zone_watermark_ok(zone, 0, high_wmark_pages(zone),
 				      ZONE_MOVABLE, 0))
 			clear_bit(ZONE_BELOW_HIGH, &zone->flags);
+
+		high = nr_pcp_high(pcp, zone, batch, free_high);
+		to_free -= to_free_batched;
+		if (pcp->count >= high) {
+			pcp_spin_unlock(pcp);
+			pcp_trylock_finish(*UP_flags);
+
+			pcp_trylock_prepare(*UP_flags);
+			pcp = pcp_spin_trylock(zone->per_cpu_pageset);
+			if (!pcp) {
+				pcp_trylock_finish(*UP_flags);
+				return false;
+			}
+		}
 	}
+
+	return true;
 }
 
 /*
@@ -2922,8 +2947,9 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
 	pcp_trylock_prepare(UP_flags);
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (pcp) {
-		free_frozen_page_commit(zone, pcp, page, migratetype, order, fpi_flags);
-		pcp_spin_unlock(pcp);
+		if (free_frozen_page_commit(zone, pcp, page, migratetype, order,
+					    fpi_flags, &UP_flags))
+			pcp_spin_unlock(pcp);
 	} else {
 		free_one_page(zone, page, pfn, order, fpi_flags);
 	}
@@ -3022,8 +3048,9 @@ void free_unref_folios(struct folio_batch *folios)
 			migratetype = MIGRATE_MOVABLE;
 
 		trace_mm_page_free_batched(&folio->page);
-		free_frozen_page_commit(zone, pcp, &folio->page, migratetype,
-					order, FPI_NONE);
+		if (!free_frozen_page_commit(zone, pcp, &folio->page,
+				migratetype, order, FPI_NONE, &UP_flags))
+			pcp = NULL;
 	}
 
 	if (pcp) {
-- 
2.47.3


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 1/4] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection
  2025-09-24 20:44 ` [PATCH v2 1/4] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection Joshua Hahn
@ 2025-09-24 22:51   ` Christoph Lameter (Ampere)
  2025-09-25 18:26     ` Joshua Hahn
  2025-09-26 15:34   ` Dan Carpenter
  2025-09-26 17:50   ` SeongJae Park
  2 siblings, 1 reply; 33+ messages in thread
From: Christoph Lameter (Ampere) @ 2025-09-24 22:51 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Andrew Morton, Johannes Weiner, Chris Mason, Kiryl Shutsemau,
	Liam R. Howlett, Brendan Jackman, David Hildenbrand,
	Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Suren Baghdasaryan,
	Vlastimil Babka, Zi Yan, linux-kernel, linux-mm, kernel-team

On Wed, 24 Sep 2025, Joshua Hahn wrote:

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d1d037f97c5f..77e7d9a5f149 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2561,10 +2561,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>   * Called from the vmstat counter updater to decay the PCP high.
>   * Return whether there are addition works to do.
>   */
> -int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
> +bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
>  {
>  	int high_min, to_drain, batch;
> -	int todo = 0;
> +	bool todo;

	todo = false

is needed here.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone
  2025-09-24 20:44 ` [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone Joshua Hahn
@ 2025-09-24 23:09   ` Christoph Lameter (Ampere)
  2025-09-25 18:44     ` Joshua Hahn
  2025-09-26 14:01   ` Brendan Jackman
  2025-09-27  0:46   ` Hillf Danton
  2 siblings, 1 reply; 33+ messages in thread
From: Christoph Lameter (Ampere) @ 2025-09-24 23:09 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Andrew Morton, Johannes Weiner, Chris Mason, Kiryl Shutsemau,
	Brendan Jackman, Michal Hocko, Suren Baghdasaryan,
	Vlastimil Babka, Zi Yan, linux-kernel, linux-mm, kernel-team

On Wed, 24 Sep 2025, Joshua Hahn wrote:

> drain_pages_zone completely drains a zone of its pcp free pages by
> repeatedly calling free_pcppages_bulk until pcp->count reaches 0.
> In this loop, it already performs batched calls to ensure that
> free_pcppages_bulk isn't called to free too many pages at once, and
> relinquishes & reacquires the lock between each call to prevent
> lock starvation from other processes.

drain_pages_zone() operates on a lock in a percpu area. The lock is
specific to a cpu and should not be contended in normal operatons unless
there is significant remote access to the per cpu queues.

This seems to be then from __drain_all_pages() running on multiple cpus
frequently. There is no point in concurrently draining the per cpu pages
of all processors from multiple remote processors and we have a
pcpu_drain_mutex to prevent that from happening.

So we need an explanation as to why there is such high contention on the
lock first before changing the logic here.

The current logic seems to be designed to prevent the lock contention you
are seeing.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 1/4] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection
  2025-09-24 22:51   ` Christoph Lameter (Ampere)
@ 2025-09-25 18:26     ` Joshua Hahn
  0 siblings, 0 replies; 33+ messages in thread
From: Joshua Hahn @ 2025-09-25 18:26 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Andrew Morton, Johannes Weiner, Chris Mason, Kiryl Shutsemau,
	Liam R. Howlett, Brendan Jackman, David Hildenbrand,
	Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Suren Baghdasaryan,
	Vlastimil Babka, Zi Yan, linux-kernel, linux-mm, kernel-team

On Wed, 24 Sep 2025 15:51:58 -0700 (PDT) "Christoph Lameter (Ampere)" <cl@gentwo.org> wrote:

> On Wed, 24 Sep 2025, Joshua Hahn wrote:
> 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index d1d037f97c5f..77e7d9a5f149 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2561,10 +2561,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> >   * Called from the vmstat counter updater to decay the PCP high.
> >   * Return whether there are addition works to do.
> >   */
> > -int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
> > +bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
> >  {
> >  	int high_min, to_drain, batch;
> > -	int todo = 0;
> > +	bool todo;
> 
> 	todo = false
> 
> is needed here.

Hello Christoph,

You're absolutely right, thank you for the catch! I'll fix this and send it
out in v3. Have a great day!
Joshua

Sent using hkml (https://github.com/sjp38/hackermail)


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone
  2025-09-24 23:09   ` Christoph Lameter (Ampere)
@ 2025-09-25 18:44     ` Joshua Hahn
  2025-09-26 16:21       ` Christoph Lameter (Ampere)
  0 siblings, 1 reply; 33+ messages in thread
From: Joshua Hahn @ 2025-09-25 18:44 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Andrew Morton, Johannes Weiner, Chris Mason, Kiryl Shutsemau,
	Brendan Jackman, Michal Hocko, Suren Baghdasaryan,
	Vlastimil Babka, Zi Yan, linux-kernel, linux-mm, kernel-team

On Wed, 24 Sep 2025 16:09:47 -0700 (PDT) "Christoph Lameter (Ampere)" <cl@gentwo.org> wrote:

> On Wed, 24 Sep 2025, Joshua Hahn wrote:
> 
> > drain_pages_zone completely drains a zone of its pcp free pages by
> > repeatedly calling free_pcppages_bulk until pcp->count reaches 0.
> > In this loop, it already performs batched calls to ensure that
> > free_pcppages_bulk isn't called to free too many pages at once, and
> > relinquishes & reacquires the lock between each call to prevent
> > lock starvation from other processes. 

Hello Christoph,

Thank you for your thoughtful review. 

> drain_pages_zone() operates on a lock in a percpu area. The lock is
> specific to a cpu and should not be contended in normal operatons unless
> there is significant remote access to the per cpu queues.

When you say "the lock", I imagine you are referring to the pcp lock.
However, there is one more lock that we have to think about, which is the
zone lock which drain_pages_zone-->free_pcppages_bulk locks & frees. Perhaps
I can be more explicit about this goal in the commit message.

Actually, in the original version of this patch [1] the goal was to
relieve the tension in the zone lock, but since it was a nested lock under
the pcp lock, it was required to free both of these in order for the
task to be rescheduled and other processes to come in and grab the zone lock.

So from what I can understand, it seems like there truly is lock contention,
just not on the pcp lock. Please let me know if I have an incorrect
understanding of the code here.

> This seems to be then from __drain_all_pages() running on multiple cpus
> frequently. There is no point in concurrently draining the per cpu pages
> of all processors from multiple remote processors and we have a
> pcpu_drain_mutex to prevent that from happening.

Agreed.

> So we need an explanation as to why there is such high contention on the
> lock first before changing the logic here.
> 
> The current logic seems to be designed to prevent the lock contention you
> are seeing.

This is true, but my concern was mostly with the value that is being used
for the batching (2048 seems too high). But as I explain below, it seems
like the min(2048, count) operation is a no-op anyways, since it is never
called with count > 1000 (at least from the benchmarks that I was running,
on my machine).

To demonstrate where the contention / offenders of trying to free too many
pages at once, I compiled the upstream kernel and made a histogram of
calls to free_pcppages_bulk where count > 1000.
On the left is the value of count passed to the function, and on the right we
have the frequency. This is on a 250G memory, 179G swap, 176 CPU machine,
running a memory intensive task for ~12 minutes.

free_frozen_page_commit
-----------------------
1000-1250| *          (3732)
1250-1500| *          (3329)
1500-1750| *          (3102)
1750-2000| *          (2878)
2000-2250| ********** (28456)

decay_pcp_high
--------------
1000-1250| ********** (837)
1250-1500| ******     (540)
1500-1750| ****       (337)
1750-2000| **         (249)
2000-2250| ******     (547)

So it seems like drain_pages_zone does not even call free_pcppages_bulk with
a high count. In fact, it seems like free_frozen_page_commit is truly the
biggest offender here.

When I was writing this patch, I was doing it for completeness and ensuring
that all callers of free_pcppages_bulk have a limit to how many pages it
can try to free at once.

I am happy to drop this patch from v3, if you feel that it is an inappropriate
change. From what I can see from the above numbers, it doesnt seem like
there will be much impact on performance, anyways (at least on my machine,
with the same benchmark. Perhaps bigger machines or more memory-intensive
benchmarks will request for larger drains?)

Thank you again for your review, and please feel free to correct me on anything
that I may be understanding incorrectly. Have a great day!
Joshua

[1] https://lore.kernel.org/all/20250818185804.21044-1-joshua.hahnjy@gmail.com/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone
  2025-09-24 20:44 ` [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone Joshua Hahn
  2025-09-24 23:09   ` Christoph Lameter (Ampere)
@ 2025-09-26 14:01   ` Brendan Jackman
  2025-09-26 15:48     ` Joshua Hahn
  2025-09-27  0:46   ` Hillf Danton
  2 siblings, 1 reply; 33+ messages in thread
From: Brendan Jackman @ 2025-09-26 14:01 UTC (permalink / raw)
  To: Joshua Hahn, Andrew Morton, Johannes Weiner
  Cc: Chris Mason, Kiryl Shutsemau, Michal Hocko, Suren Baghdasaryan,
	Vlastimil Babka, Zi Yan, linux-kernel, linux-mm, kernel-team

On Wed Sep 24, 2025 at 8:44 PM UTC, Joshua Hahn wrote:
> drain_pages_zone completely drains a zone of its pcp free pages by
> repeatedly calling free_pcppages_bulk until pcp->count reaches 0.
> In this loop, it already performs batched calls to ensure that
> free_pcppages_bulk isn't called to free too many pages at once, and
> relinquishes & reacquires the lock between each call to prevent
> lock starvation from other processes.
>
> However, the current batching does not prevent lock starvation. The
> current implementation creates batches of
> pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX, which has been seen in
> Meta workloads to be up to 64 << 5 == 2048 pages.
>
> While it is true that CONFIG_PCP_BATCH_SCALE_MAX is a config and
> indeed can be adjusted by the system admin to be any number from
> 0 to 6, it's default value of 5 is still too high to be reasonable for
> any system.
>
> Instead, let's create batches of pcp->batch pages, which gives a more
> reasonable 64 pages per call to free_pcppages_bulk. This gives other
> processes a chance to grab the lock and prevents starvation. Each
> individual call to drain_pages_zone may take longer, but we avoid the
> worst case scenario of completely starving out other system-critical
> threads from acquiring the pcp lock while 2048 pages are freed
> one-by-one.

Hey Joshua, do you know why pcp->batch is a factor here at all? Until
now I never really noticed it. I thought that this field was a kinda
dynamic auto-tuning where we try to make the pcplists a more aggressive
cache when they're being used a lot and then shrink them down when the
allocator is under less load. But I don't have a good intuition for why
that's relevant to drain_pages_zone(). Something to do with the amount
of lock contention we expect?

Unless I'm just being stupid here, maybe a chance to add commentary.

>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> ---
>  mm/page_alloc.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 77e7d9a5f149..b861b647f184 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2623,8 +2623,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
>  		spin_lock(&pcp->lock);
>  		count = pcp->count;
>  		if (count) {
> -			int to_drain = min(count,
> -				pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX);
> +			int to_drain = min(count, pcp->batch);

We actually don't need the min() here as free_pcppages_bulk() does that
anyway. Not really related to the commit but maybe worth tidying that
up.

Also, it seems if we drop the BATCH_SCALE_MAX logic the inside of the
loop is now very similar to drain_zone_pages(), maybe time to have them
share some code and avoid the confusing name overlap? drain_zone_pages()
reads pcp->count without the lock or READ_ONCE() though, I assume that's
coming from an assumption that pcp is owned by the current CPU and
that's the only one that modifies it? Even if that's accurate it seems
like an unnecessary optimisation to me.

Cheers,
Brendan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 1/4] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection
  2025-09-24 20:44 ` [PATCH v2 1/4] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection Joshua Hahn
  2025-09-24 22:51   ` Christoph Lameter (Ampere)
@ 2025-09-26 15:34   ` Dan Carpenter
  2025-09-26 16:40     ` Joshua Hahn
  2025-09-26 17:50   ` SeongJae Park
  2 siblings, 1 reply; 33+ messages in thread
From: Dan Carpenter @ 2025-09-26 15:34 UTC (permalink / raw)
  To: oe-kbuild, Joshua Hahn, Andrew Morton, Johannes Weiner
  Cc: lkp, oe-kbuild-all, Linux Memory Management List, Chris Mason,
	Kiryl Shutsemau, Liam R. Howlett, Brendan Jackman,
	David Hildenbrand, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Zi Yan, linux-kernel,
	kernel-team

Hi Joshua,

kernel test robot noticed the following build warnings:

url:    https://github.com/intel-lab-lkp/linux/commits/Joshua-Hahn/mm-page_alloc-vmstat-Simplify-refresh_cpu_vm_stats-change-detection/20250925-044532
base:   097a6c336d0080725c626fda118ecfec448acd0f
patch link:    https://lore.kernel.org/r/20250924204409.1706524-2-joshua.hahnjy%40gmail.com
patch subject: [PATCH v2 1/4] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection
config: mips-randconfig-r073-20250925 (https://download.01.org/0day-ci/archive/20250926/202509260132.awvdNKqF-lkp@intel.com/config)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project cafc064fc7a96b3979a023ddae1da2b499d6c954)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
| Closes: https://lore.kernel.org/r/202509260132.awvdNKqF-lkp@intel.com/

smatch warnings:
mm/page_alloc.c:2591 decay_pcp_high() error: uninitialized symbol 'todo'.

vim +/todo +2591 mm/page_alloc.c

06fb80866d37b0 Joshua Hahn    2025-09-24  2564  bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
51a755c56dc05a Ying Huang     2023-10-16  2565  {
51a755c56dc05a Ying Huang     2023-10-16  2566  	int high_min, to_drain, batch;
06fb80866d37b0 Joshua Hahn    2025-09-24  2567  	bool todo;

needs to be initialized to false.

51a755c56dc05a Ying Huang     2023-10-16  2568  
51a755c56dc05a Ying Huang     2023-10-16  2569  	high_min = READ_ONCE(pcp->high_min);
51a755c56dc05a Ying Huang     2023-10-16  2570  	batch = READ_ONCE(pcp->batch);
51a755c56dc05a Ying Huang     2023-10-16  2571  	/*
51a755c56dc05a Ying Huang     2023-10-16  2572  	 * Decrease pcp->high periodically to try to free possible
51a755c56dc05a Ying Huang     2023-10-16  2573  	 * idle PCP pages.  And, avoid to free too many pages to
51a755c56dc05a Ying Huang     2023-10-16  2574  	 * control latency.  This caps pcp->high decrement too.
51a755c56dc05a Ying Huang     2023-10-16  2575  	 */
51a755c56dc05a Ying Huang     2023-10-16  2576  	if (pcp->high > high_min) {
51a755c56dc05a Ying Huang     2023-10-16  2577  		pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX),
51a755c56dc05a Ying Huang     2023-10-16  2578  				 pcp->high - (pcp->high >> 3), high_min);
51a755c56dc05a Ying Huang     2023-10-16  2579  		if (pcp->high > high_min)
06fb80866d37b0 Joshua Hahn    2025-09-24  2580  			todo = true;
51a755c56dc05a Ying Huang     2023-10-16  2581  	}
51a755c56dc05a Ying Huang     2023-10-16  2582  
51a755c56dc05a Ying Huang     2023-10-16  2583  	to_drain = pcp->count - pcp->high;
51a755c56dc05a Ying Huang     2023-10-16  2584  	if (to_drain > 0) {
51a755c56dc05a Ying Huang     2023-10-16  2585  		spin_lock(&pcp->lock);
51a755c56dc05a Ying Huang     2023-10-16  2586  		free_pcppages_bulk(zone, to_drain, pcp, 0);
51a755c56dc05a Ying Huang     2023-10-16  2587  		spin_unlock(&pcp->lock);
06fb80866d37b0 Joshua Hahn    2025-09-24  2588  		todo = true;
51a755c56dc05a Ying Huang     2023-10-16  2589  	}
51a755c56dc05a Ying Huang     2023-10-16  2590  
51a755c56dc05a Ying Huang     2023-10-16 @2591  	return todo;
51a755c56dc05a Ying Huang     2023-10-16  2592  }

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone
  2025-09-26 14:01   ` Brendan Jackman
@ 2025-09-26 15:48     ` Joshua Hahn
  2025-09-26 16:57       ` Brendan Jackman
  0 siblings, 1 reply; 33+ messages in thread
From: Joshua Hahn @ 2025-09-26 15:48 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: Andrew Morton, Johannes Weiner, Chris Mason, Kiryl Shutsemau,
	Michal Hocko, Suren Baghdasaryan, Vlastimil Babka, Zi Yan,
	linux-kernel, linux-mm, kernel-team

On Fri, 26 Sep 2025 14:01:43 +0000 Brendan Jackman <jackmanb@google.com> wrote:

> On Wed Sep 24, 2025 at 8:44 PM UTC, Joshua Hahn wrote:
> > drain_pages_zone completely drains a zone of its pcp free pages by
> > repeatedly calling free_pcppages_bulk until pcp->count reaches 0.
> > In this loop, it already performs batched calls to ensure that
> > free_pcppages_bulk isn't called to free too many pages at once, and
> > relinquishes & reacquires the lock between each call to prevent
> > lock starvation from other processes.
> >
> > However, the current batching does not prevent lock starvation. The
> > current implementation creates batches of
> > pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX, which has been seen in
> > Meta workloads to be up to 64 << 5 == 2048 pages.
> >
> > While it is true that CONFIG_PCP_BATCH_SCALE_MAX is a config and
> > indeed can be adjusted by the system admin to be any number from
> > 0 to 6, it's default value of 5 is still too high to be reasonable for
> > any system.
> >
> > Instead, let's create batches of pcp->batch pages, which gives a more
> > reasonable 64 pages per call to free_pcppages_bulk. This gives other
> > processes a chance to grab the lock and prevents starvation. Each
> > individual call to drain_pages_zone may take longer, but we avoid the
> > worst case scenario of completely starving out other system-critical
> > threads from acquiring the pcp lock while 2048 pages are freed
> > one-by-one.

Hello Brendan, thank you for your review!

> Hey Joshua, do you know why pcp->batch is a factor here at all? Until
> now I never really noticed it. I thought that this field was a kinda
> dynamic auto-tuning where we try to make the pcplists a more aggressive
> cache when they're being used a lot and then shrink them down when the
> allocator is under less load. But I don't have a good intuition for why
> that's relevant to drain_pages_zone(). Something to do with the amount
> of lock contention we expect?

From my understanding, pcp->batch is a value that can be used to batch
both allocation and freeing operations. For instance, drain_zone_pages
uses pcp->batch to ensure that we don't free too many pages at once,
which would lead to things like lock contention (I will address the
similarity between drain_zone_pages and drain_pages_zone at the end).

As for the purpose of batch and how its value is determined, I got my
understanding from this comment in zone_batchsize:

	 * ... The batch
	 * size is striking a balance between allocation latency
	 * and zone lock contention.

And based on this comment, I think a symmetric argument can be made for
freeing by just s/allocation latency/freeing latency above. My understanding
was that if we are allocating at a higher factor, we should also be freeing
at a higher factor to clean up those allocations faster as well, and it seems
like this is reflected in decay_pcp_high, where a higher batch means we
lower pcp->high to try and free up more pages.

Please let me know if my understanding of this area is incorrect here!

> Unless I'm just being stupid here, maybe a chance to add commentary.

I can definitely add some more context in the next version for this patch.
Actually, you are right -- reading back in my patch description, I've
motivated why we want batching, but not why pcp->batch is a good candidate
for this value. I'll definitely go back and clean it up!

> >
> > Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> > ---
> >  mm/page_alloc.c | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 77e7d9a5f149..b861b647f184 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2623,8 +2623,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
> >  		spin_lock(&pcp->lock);
> >  		count = pcp->count;
> >  		if (count) {
> > -			int to_drain = min(count,
> > -				pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX);
> > +			int to_drain = min(count, pcp->batch);
> 
> We actually don't need the min() here as free_pcppages_bulk() does that
> anyway. Not really related to the commit but maybe worth tidying that
> up.

Please correct me if I am missing something, but I think we still need the
min() here, since it takes the min of count and pcp->batch, while the
min in free_pcppages_bulk takes the min of the above result and pcp->count.
From what I can understand, the goal of the min() in free_pcppages_bulk
is to ensure that we don't try to free more pages than exist in the pcp
(hence the min with count), while the goal of my min() is to not free
too many pages at once.

> Also, it seems if we drop the BATCH_SCALE_MAX logic the inside of the
> loop is now very similar to drain_zone_pages(), maybe time to have them
> share some code and avoid the confusing name overlap? drain_zone_pages()
> reads pcp->count without the lock or READ_ONCE() though, I assume that's
> coming from an assumption that pcp is owned by the current CPU and
> that's the only one that modifies it? Even if that's accurate it seems
> like an unnecessary optimisation to me.

This makes a lot of sense to me. To be honest, I had a lot of confusion
as to why these functions were different as well, so combining these
two functions into one definitely sonds like a great change.

I'll make these revisions in the next version. Thank you for your valuable
feedback, this was very helpful! I hope you have a great day : -)
Joshua

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone
  2025-09-25 18:44     ` Joshua Hahn
@ 2025-09-26 16:21       ` Christoph Lameter (Ampere)
  2025-09-26 17:25         ` Joshua Hahn
  2025-10-01 11:23         ` Vlastimil Babka
  0 siblings, 2 replies; 33+ messages in thread
From: Christoph Lameter (Ampere) @ 2025-09-26 16:21 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Andrew Morton, Johannes Weiner, Chris Mason, Kiryl Shutsemau,
	Brendan Jackman, Michal Hocko, Suren Baghdasaryan,
	Vlastimil Babka, Zi Yan, linux-kernel, linux-mm, kernel-team

On Thu, 25 Sep 2025, Joshua Hahn wrote:

> > So we need an explanation as to why there is such high contention on the
> > lock first before changing the logic here.
> >
> > The current logic seems to be designed to prevent the lock contention you
> > are seeing.
>
> This is true, but my concern was mostly with the value that is being used
> for the batching (2048 seems too high). But as I explain below, it seems
> like the min(2048, count) operation is a no-op anyways, since it is never
> called with count > 1000 (at least from the benchmarks that I was running,
> on my machine).

The problem is that you likely increase zone lock contention with a
reduced batch size.

Actually that there is a lock in the pcp structure is weird and causes
cacheline bouncing on such hot paths. Access should be only from the cpu
that owns this structure. Remote cleaning (if needed) can be triggered via
IPIs.

This is the way it used to be and the way it was tested for high core
counts years ago.

You seem to run 176 cores here so its similar to what we tested way back
when. If all cores are accessing the pcp structure then you have
significant cacheline bouncing. Removing the lock and going back to the
IPI solution would likely remove the problem.

The cachelines of the allocator per cpu structures are usually very hot
and should only be touched in rare circumstances from other cpus.

Having a loop over all processors accessing all the hot percpus structurs
is likely causing significant performance issues and therefore the issues
that you are seeing here.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 1/4] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection
  2025-09-26 15:34   ` Dan Carpenter
@ 2025-09-26 16:40     ` Joshua Hahn
  0 siblings, 0 replies; 33+ messages in thread
From: Joshua Hahn @ 2025-09-26 16:40 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: oe-kbuild, Andrew Morton, Johannes Weiner, lkp, oe-kbuild-all,
	Linux Memory Management List, Chris Mason, Kiryl Shutsemau,
	Liam R. Howlett, Brendan Jackman, David Hildenbrand,
	Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Suren Baghdasaryan,
	Vlastimil Babka, Zi Yan, linux-kernel, kernel-team

Hi Dan,

On Fri, Sep 26, 2025 at 11:34 AM Dan Carpenter <dan.carpenter@linaro.org> wrote:
>
> Hi Joshua,
>
> kernel test robot noticed the following build warnings:
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Joshua-Hahn/mm-page_alloc-vmstat-Simplify-refresh_cpu_vm_stats-change-detection/20250925-044532
> base:   097a6c336d0080725c626fda118ecfec448acd0f
> patch link:    https://lore.kernel.org/r/20250924204409.1706524-2-joshua.hahnjy%40gmail.com
> patch subject: [PATCH v2 1/4] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection
> config: mips-randconfig-r073-20250925 (https://download.01.org/0day-ci/archive/20250926/202509260132.awvdNKqF-lkp@intel.com/config)
> compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project cafc064fc7a96b3979a023ddae1da2b499d6c954)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
> | Closes: https://lore.kernel.org/r/202509260132.awvdNKqF-lkp@intel.com/
>
> smatch warnings:
> mm/page_alloc.c:2591 decay_pcp_high() error: uninitialized symbol 'todo'.
>
> vim +/todo +2591 mm/page_alloc.c
>
> 06fb80866d37b0 Joshua Hahn    2025-09-24  2564  bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
> 51a755c56dc05a Ying Huang     2023-10-16  2565  {
> 51a755c56dc05a Ying Huang     2023-10-16  2566          int high_min, to_drain, batch;
> 06fb80866d37b0 Joshua Hahn    2025-09-24  2567          bool todo;
>
> needs to be initialized to false.

Thank you for the catch, I think Christoph also pointed out the same
thing earlier.
I'll be sure to fix this up in the next version! Have a great day!
Joshua


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone
  2025-09-26 15:48     ` Joshua Hahn
@ 2025-09-26 16:57       ` Brendan Jackman
  2025-09-26 17:33         ` Joshua Hahn
  0 siblings, 1 reply; 33+ messages in thread
From: Brendan Jackman @ 2025-09-26 16:57 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Andrew Morton, Johannes Weiner, Chris Mason, Kiryl Shutsemau,
	Michal Hocko, Suren Baghdasaryan, Vlastimil Babka, Zi Yan,
	linux-kernel, linux-mm, kernel-team

On Fri Sep 26, 2025 at 3:48 PM UTC, Joshua Hahn wrote:
> On Fri, 26 Sep 2025 14:01:43 +0000 Brendan Jackman <jackmanb@google.com> wrote:
>> Hey Joshua, do you know why pcp->batch is a factor here at all? Until
>> now I never really noticed it. I thought that this field was a kinda
>> dynamic auto-tuning where we try to make the pcplists a more aggressive
>> cache when they're being used a lot and then shrink them down when the
>> allocator is under less load. But I don't have a good intuition for why
>> that's relevant to drain_pages_zone(). Something to do with the amount
>> of lock contention we expect?
>
> From my understanding, pcp->batch is a value that can be used to batch
> both allocation and freeing operations. For instance, drain_zone_pages
> uses pcp->batch to ensure that we don't free too many pages at once,
> which would lead to things like lock contention (I will address the
> similarity between drain_zone_pages and drain_pages_zone at the end).
>
> As for the purpose of batch and how its value is determined, I got my
> understanding from this comment in zone_batchsize:
>
> 	 * ... The batch
> 	 * size is striking a balance between allocation latency
> 	 * and zone lock contention.
>
> And based on this comment, I think a symmetric argument can be made for
> freeing by just s/allocation latency/freeing latency above. My understanding
> was that if we are allocating at a higher factor, we should also be freeing
> at a higher factor to clean up those allocations faster as well, and it seems
> like this is reflected in decay_pcp_high, where a higher batch means we
> lower pcp->high to try and free up more pages.

Hmm thanks, now I'm reading it again I think I was not clear in my head
on how ->batch is used. It's more like a kinda static "batchiness"
parameter that informs the dynamic scaling stuff rather than being an
output of it, in that context it's less surprising that the drain code
cares about it.

> Please let me know if my understanding of this area is incorrect here!
>  
>> Unless I'm just being stupid here, maybe a chance to add commentary.
>
> I can definitely add some more context in the next version for this patch.
> Actually, you are right -- reading back in my patch description, I've
> motivated why we want batching, but not why pcp->batch is a good candidate
> for this value. I'll definitely go back and clean it up!
>
>> >
>> > Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
>> > ---
>> >  mm/page_alloc.c | 3 +--
>> >  1 file changed, 1 insertion(+), 2 deletions(-)
>> >
>> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> > index 77e7d9a5f149..b861b647f184 100644
>> > --- a/mm/page_alloc.c
>> > +++ b/mm/page_alloc.c
>> > @@ -2623,8 +2623,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
>> >  		spin_lock(&pcp->lock);
>> >  		count = pcp->count;
>> >  		if (count) {
>> > -			int to_drain = min(count,
>> > -				pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX);
>> > +			int to_drain = min(count, pcp->batch);
>> 
>> We actually don't need the min() here as free_pcppages_bulk() does that
>> anyway. Not really related to the commit but maybe worth tidying that
>> up.
>
> Please correct me if I am missing something, but I think we still need the
> min() here, since it takes the min of count and pcp->batch, while the
> min in free_pcppages_bulk takes the min of the above result and pcp->count.

Hold on, what's the difference between count and pcp->count here?

> From what I can understand, the goal of the min() in free_pcppages_bulk
> is to ensure that we don't try to free more pages than exist in the pcp
> (hence the min with count), while the goal of my min() is to not free
> too many pages at once.

Yeah, I think we're in agreement about the intent, it's just that one of
us is misreading the code (and I think it might be me, I will probably
be more certain on Monday!).


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone
  2025-09-26 16:21       ` Christoph Lameter (Ampere)
@ 2025-09-26 17:25         ` Joshua Hahn
  2025-10-01 11:23         ` Vlastimil Babka
  1 sibling, 0 replies; 33+ messages in thread
From: Joshua Hahn @ 2025-09-26 17:25 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Andrew Morton, Johannes Weiner, Chris Mason, Kiryl Shutsemau,
	Brendan Jackman, Michal Hocko, Suren Baghdasaryan,
	Vlastimil Babka, Zi Yan, linux-kernel, linux-mm, kernel-team

On Fri, 26 Sep 2025 09:21:29 -0700 (PDT) "Christoph Lameter (Ampere)" <cl@gentwo.org> wrote:

> On Thu, 25 Sep 2025, Joshua Hahn wrote:
> 
> > > So we need an explanation as to why there is such high contention on the
> > > lock first before changing the logic here.
> > >
> > > The current logic seems to be designed to prevent the lock contention you
> > > are seeing.
> >
> > This is true, but my concern was mostly with the value that is being used
> > for the batching (2048 seems too high). But as I explain below, it seems
> > like the min(2048, count) operation is a no-op anyways, since it is never
> > called with count > 1000 (at least from the benchmarks that I was running,
> > on my machine).
> 

Hello Christoph, thank you for your feedback!

> The problem is that you likely increase zone lock contention with a
> reduced batch size.

I see, so is the suggestion here that by reducing batch size, we
increase zone lock contention because we are bouncing more often and
having to spend more contending on the zone lock?

> Actually that there is a lock in the pcp structure is weird and causes
> cacheline bouncing on such hot paths. Access should be only from the cpu
> that owns this structure. Remote cleaning (if needed) can be triggered via
> IPIs.
> 
> This is the way it used to be and the way it was tested for high core
> counts years ago.
> 
> You seem to run 176 cores here so its similar to what we tested way back
> when. If all cores are accessing the pcp structure then you have
> significant cacheline bouncing. Removing the lock and going back to the
> IPI solution would likely remove the problem.
> 
> The cachelines of the allocator per cpu structures are usually very hot
> and should only be touched in rare circumstances from other cpus.
> 
> Having a loop over all processors accessing all the hot percpus structurs
> is likely causing significant performance issues and therefore the issues
> that you are seeing here.

So just to be explicit about this -- in my last response, I noted that
drain_zone_pages really isn't a hot path. In fact, it's never called
with count > 1000. The real offenders are in the last 2 patches of this
series (decay_pcp_high and free_frozen_page_commit), and I believe these
callers don't have the same problem of iterating over the CPUs.
free_frozen_page_commit only iterates through the zones
(from free_unref_folios), so I don't think it has the same cacheline
bouncing issue as drain_pages_zone.

So perhaps drain_pages_zone is a good candidate to leave the batching
size as is, i.e. pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX. What do you
think? Again, happy to remove this patch from the series as all of the
proposed performance gains and lock contention reduction on the cover
letter come from the last 2 patches of this series.

Thank you again for your thoughtful response, have a great day!
Joshua

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone
  2025-09-26 16:57       ` Brendan Jackman
@ 2025-09-26 17:33         ` Joshua Hahn
  0 siblings, 0 replies; 33+ messages in thread
From: Joshua Hahn @ 2025-09-26 17:33 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: Andrew Morton, Johannes Weiner, Chris Mason, Kiryl Shutsemau,
	Michal Hocko, Suren Baghdasaryan, Vlastimil Babka, Zi Yan,
	linux-kernel, linux-mm, kernel-team

On Fri, 26 Sep 2025 16:57:16 +0000 Brendan Jackman <jackmanb@google.com> wrote:

> On Fri Sep 26, 2025 at 3:48 PM UTC, Joshua Hahn wrote:
> > On Fri, 26 Sep 2025 14:01:43 +0000 Brendan Jackman <jackmanb@google.com> wrote:
> >> Hey Joshua, do you know why pcp->batch is a factor here at all? Until
> >> now I never really noticed it. I thought that this field was a kinda
> >> dynamic auto-tuning where we try to make the pcplists a more aggressive
> >> cache when they're being used a lot and then shrink them down when the
> >> allocator is under less load. But I don't have a good intuition for why
> >> that's relevant to drain_pages_zone(). Something to do with the amount
> >> of lock contention we expect?
> >
> > From my understanding, pcp->batch is a value that can be used to batch
> > both allocation and freeing operations. For instance, drain_zone_pages
> > uses pcp->batch to ensure that we don't free too many pages at once,
> > which would lead to things like lock contention (I will address the
> > similarity between drain_zone_pages and drain_pages_zone at the end).
> >
> > As for the purpose of batch and how its value is determined, I got my
> > understanding from this comment in zone_batchsize:
> >
> > 	 * ... The batch
> > 	 * size is striking a balance between allocation latency
> > 	 * and zone lock contention.

Hi Brendan, thanks for your feedback!

> > And based on this comment, I think a symmetric argument can be made for
> > freeing by just s/allocation latency/freeing latency above. My understanding
> > was that if we are allocating at a higher factor, we should also be freeing
> > at a higher factor to clean up those allocations faster as well, and it seems
> > like this is reflected in decay_pcp_high, where a higher batch means we
> > lower pcp->high to try and free up more pages.
> 
> Hmm thanks, now I'm reading it again I think I was not clear in my head
> on how ->batch is used. It's more like a kinda static "batchiness"
> parameter that informs the dynamic scaling stuff rather than being an
> output of it, in that context it's less surprising that the drain code
> cares about it.

It also took me a while to understand how all of the pcp high, count,
batch, etc. tuning worked. But yes, from my understanding, batch is used
as the parameter that helps inform things like what value to decay
pcp->high to, as well as how many pages to free, how many pages to leave
remaining in the pcp, among other operations. 

> > Please let me know if my understanding of this area is incorrect here!
> >  
> >> Unless I'm just being stupid here, maybe a chance to add commentary.
> >
> > I can definitely add some more context in the next version for this patch.
> > Actually, you are right -- reading back in my patch description, I've
> > motivated why we want batching, but not why pcp->batch is a good candidate
> > for this value. I'll definitely go back and clean it up!
> >
> >> >
> >> > Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> >> > ---
> >> >  mm/page_alloc.c | 3 +--
> >> >  1 file changed, 1 insertion(+), 2 deletions(-)
> >> >
> >> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> > index 77e7d9a5f149..b861b647f184 100644
> >> > --- a/mm/page_alloc.c
> >> > +++ b/mm/page_alloc.c
> >> > @@ -2623,8 +2623,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
> >> >  		spin_lock(&pcp->lock);
> >> >  		count = pcp->count;
> >> >  		if (count) {
> >> > -			int to_drain = min(count,
> >> > -				pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX);
> >> > +			int to_drain = min(count, pcp->batch);
> >> 
> >> We actually don't need the min() here as free_pcppages_bulk() does that
> >> anyway. Not really related to the commit but maybe worth tidying that
> >> up.
> >
> > Please correct me if I am missing something, but I think we still need the
> > min() here, since it takes the min of count and pcp->batch, while the
> > min in free_pcppages_bulk takes the min of the above result and pcp->count.
> 
> Hold on, what's the difference between count and pcp->count here?

Hm, sorry. I think I also confused myself here. count is indeed pcp->count.
So what is basically happening here is
	min3(pcp->count, pcp->batch, pcp->count)

... and you are absolutely right that one instance of pcp->count is
unnecessary here. So it seems like the solution is just to pass
pcp->batch to free_pcppages_bulk, and that way we can have only one
instance of pcp->count!

I think I confused myself because I thought you meant we don't need the
pcp->batch portion of this min(), not the pcp->count!

With all of this said, as noted in my conversation with Christoph I think
this patch may be unnecessary since this is not a source of contention
or lock monopoly anyways (in fact, it may even be harmful to batch it
to a lower value here).

> > From what I can understand, the goal of the min() in free_pcppages_bulk
> > is to ensure that we don't try to free more pages than exist in the pcp
> > (hence the min with count), while the goal of my min() is to not free
> > too many pages at once.
> 
> Yeah, I think we're in agreement about the intent, it's just that one of
> us is misreading the code (and I think it might be me, I will probably
> be more certain on Monday!).

It was me : -)

Thank you for taking the time to review this. I hope you have a great day!
Joshua


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 1/4] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection
  2025-09-24 20:44 ` [PATCH v2 1/4] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection Joshua Hahn
  2025-09-24 22:51   ` Christoph Lameter (Ampere)
  2025-09-26 15:34   ` Dan Carpenter
@ 2025-09-26 17:50   ` SeongJae Park
  2025-09-26 18:24     ` Joshua Hahn
  2 siblings, 1 reply; 33+ messages in thread
From: SeongJae Park @ 2025-09-26 17:50 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: SeongJae Park, Andrew Morton, Johannes Weiner, Chris Mason,
	Kiryl Shutsemau, Liam R. Howlett, Brendan Jackman,
	David Hildenbrand, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Zi Yan, linux-kernel,
	linux-mm, kernel-team

On Wed, 24 Sep 2025 13:44:05 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> Currently, refresh_cpu_vm_stats returns an int, indicating how many
> changes were made during its updates. Using this information, callers
> like vmstat_update can heuristically determine if more work will be done
> in the future.
> 
> However, all of refresh_cpu_vm_stats's callers either (a) ignore the
> result, only caring about performing the updates, or (b) only care about
> whether changes were made, but not *how many* changes were made.
> 
> Simplify the code by returning a bool instead to indicate if updates
> were made.
> 
> In addition, simplify fold_diff and decay_pcp_high to return a bool
> for the same reason.
> 
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> ---
>  include/linux/gfp.h |  2 +-
>  mm/page_alloc.c     |  8 ++++----
>  mm/vmstat.c         | 26 +++++++++++++-------------
>  3 files changed, 18 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 5ebf26fcdcfa..63c72cb1d117 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -386,7 +386,7 @@ extern void free_pages(unsigned long addr, unsigned int order);
>  #define free_page(addr) free_pages((addr), 0)
>  
>  void page_alloc_init_cpuhp(void);
> -int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp);
> +bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp);
>  void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
>  void drain_all_pages(struct zone *zone);
>  void drain_local_pages(struct zone *zone);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d1d037f97c5f..77e7d9a5f149 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2561,10 +2561,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>   * Called from the vmstat counter updater to decay the PCP high.
>   * Return whether there are addition works to do.
>   */
> -int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
> +bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
>  {
>  	int high_min, to_drain, batch;
> -	int todo = 0;
> +	bool todo;

I know you and others already found 'todo' should be initialized. :)

[...]
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 71cd1ceba191..1f74a3517ab2 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
[...]
> @@ -839,7 +839,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
>  		if (do_pagesets) {
>  			cond_resched();
>  
> -			changes += decay_pcp_high(zone, this_cpu_ptr(pcp));
> +			changed |= decay_pcp_high(zone, this_cpu_ptr(pcp));

I'm not a fan of bit operations unless it provides clear benefits.
What about below?

    if (decay_pcp_high(zone, this_cpu_ptr(pcp)) && !changed)
    	changed = truee;

Just a personal and trivial taste.  No strong opinion.  If you don't strongly
feel my suggestion is better, please keep the original code.

>  #ifdef CONFIG_NUMA
>  			/*
>  			 * Deal with draining the remote pageset of this
> @@ -861,13 +861,13 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
>  			}
>  
>  			if (__this_cpu_dec_return(pcp->expire)) {
> -				changes++;
> +				changed = true;
>  				continue;
>  			}
>  
>  			if (__this_cpu_read(pcp->count)) {
>  				drain_zone_pages(zone, this_cpu_ptr(pcp));
> -				changes++;
> +				changed = true;
>  			}
>  #endif
>  		}
> @@ -887,8 +887,8 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
>  		}
>  	}
>  
> -	changes += fold_diff(global_zone_diff, global_node_diff);
> -	return changes;
> +	changed |= fold_diff(global_zone_diff, global_node_diff);
> +	return changed;

Ditto.

>  }
>  
>  /*
> -- 
> 2.47.3

Other than the above trivial things, all looks good to me :)


Thanks,
SJ


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 1/4] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection
  2025-09-26 17:50   ` SeongJae Park
@ 2025-09-26 18:24     ` Joshua Hahn
  2025-09-26 18:33       ` SeongJae Park
  0 siblings, 1 reply; 33+ messages in thread
From: Joshua Hahn @ 2025-09-26 18:24 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, Johannes Weiner, Chris Mason, Kiryl Shutsemau,
	Liam R. Howlett, Brendan Jackman, David Hildenbrand,
	Lorenzo Stoakes, Michal Hocko, Mike Rapoport, Suren Baghdasaryan,
	Vlastimil Babka, Zi Yan, linux-kernel, linux-mm, kernel-team

Hi SJ, thank you for reviewing my patch!

> >  void page_alloc_init_cpuhp(void);
> > -int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp);
> > +bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp);
> >  void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
> >  void drain_all_pages(struct zone *zone);
> >  void drain_local_pages(struct zone *zone);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index d1d037f97c5f..77e7d9a5f149 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2561,10 +2561,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> >   * Called from the vmstat counter updater to decay the PCP high.
> >   * Return whether there are addition works to do.
> >   */
> > -int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
> > +bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
> >  {
> >  	int high_min, to_drain, batch;
> > -	int todo = 0;
> > +	bool todo;
> 
> I know you and others already found 'todo' should be initialized. :)

Yes, I'll be sure to fix this in the next version! : -)

> [...]
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 71cd1ceba191..1f74a3517ab2 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> [...]
> > @@ -839,7 +839,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
> >  		if (do_pagesets) {
> >  			cond_resched();
> >  
> > -			changes += decay_pcp_high(zone, this_cpu_ptr(pcp));
> > +			changed |= decay_pcp_high(zone, this_cpu_ptr(pcp));
> 
> I'm not a fan of bit operations unless it provides clear benefits.
> What about below?
> 
>     if (decay_pcp_high(zone, this_cpu_ptr(pcp)) && !changed)
>     	changed = truee;

Here, what if I change it to just:

	if (decay_pcp_high(zone, this_cpu_ptr(pcp))
		changed = true;

Since even if changed == true already, this will be a no-op.

> Just a personal and trivial taste.  No strong opinion.  If you don't strongly
> feel my suggestion is better, please keep the original code.

I feel like if someone (you) feels like bitwise operations here makes it
less clear what the code is doing, others may feel the same way as well!
Happy to make the change to hopefully make it more easily understandable
what is happening. 

> >  #ifdef CONFIG_NUMA
> >  			/*
> >  			 * Deal with draining the remote pageset of this
> > @@ -861,13 +861,13 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
> >  			}
> >  
> >  			if (__this_cpu_dec_return(pcp->expire)) {
> > -				changes++;
> > +				changed = true;
> >  				continue;
> >  			}
> >  
> >  			if (__this_cpu_read(pcp->count)) {
> >  				drain_zone_pages(zone, this_cpu_ptr(pcp));
> > -				changes++;
> > +				changed = true;
> >  			}
> >  #endif
> >  		}
> > @@ -887,8 +887,8 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
> >  		}
> >  	}
> >  
> > -	changes += fold_diff(global_zone_diff, global_node_diff);
> > -	return changes;
> > +	changed |= fold_diff(global_zone_diff, global_node_diff);
> > +	return changed;
> 
> Ditto.

Will make the same change here!

> >  }
> >  
> >  /*
> > -- 
> > 2.47.3
> 
> Other than the above trivial things, all looks good to me :)

Thanks SJ, I hope you have a great day!
Joshua

> Thanks,
> SJ


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 1/4] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection
  2025-09-26 18:24     ` Joshua Hahn
@ 2025-09-26 18:33       ` SeongJae Park
  0 siblings, 0 replies; 33+ messages in thread
From: SeongJae Park @ 2025-09-26 18:33 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: SeongJae Park, Andrew Morton, Johannes Weiner, Chris Mason,
	Kiryl Shutsemau, Liam R. Howlett, Brendan Jackman,
	David Hildenbrand, Lorenzo Stoakes, Michal Hocko, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Zi Yan, linux-kernel,
	linux-mm, kernel-team

On Fri, 26 Sep 2025 11:24:37 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
[...]
> > > @@ -839,7 +839,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
> > >  		if (do_pagesets) {
> > >  			cond_resched();
> > >  
> > > -			changes += decay_pcp_high(zone, this_cpu_ptr(pcp));
> > > +			changed |= decay_pcp_high(zone, this_cpu_ptr(pcp));
> > 
> > I'm not a fan of bit operations unless it provides clear benefits.
> > What about below?
> > 
> >     if (decay_pcp_high(zone, this_cpu_ptr(pcp)) && !changed)
> >     	changed = truee;
> 
> Here, what if I change it to just:
> 
> 	if (decay_pcp_high(zone, this_cpu_ptr(pcp))
> 		changed = true;

Looks nice to me! :)

> 
> Since even if changed == true already, this will be a no-op.

I was thinking the compiler might or might not emit unnecessary writes, but I
don't really care that.  Your suggested version looks good to me :)

> 
> > Just a personal and trivial taste.  No strong opinion.  If you don't strongly
> > feel my suggestion is better, please keep the original code.
> 
> I feel like if someone (you) feels like bitwise operations here makes it
> less clear what the code is doing, others may feel the same way as well!
> Happy to make the change to hopefully make it more easily understandable
> what is happening. 

Thank you Joshua!

[...]
> Thanks SJ, I hope you have a great day!

You too!  Plus, great weekend!


Thanks,
SJ

[...]


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone
  2025-09-24 20:44 ` [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone Joshua Hahn
  2025-09-24 23:09   ` Christoph Lameter (Ampere)
  2025-09-26 14:01   ` Brendan Jackman
@ 2025-09-27  0:46   ` Hillf Danton
  2025-09-30 14:42     ` Joshua Hahn
  2 siblings, 1 reply; 33+ messages in thread
From: Hillf Danton @ 2025-09-27  0:46 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Andrew Morton, Johannes Weiner, linux-kernel, linux-mm, kernel-team

On Wed, 24 Sep 2025 13:44:06 -0700 Joshua Hahn wrote:
> drain_pages_zone completely drains a zone of its pcp free pages by
> repeatedly calling free_pcppages_bulk until pcp->count reaches 0.
> In this loop, it already performs batched calls to ensure that
> free_pcppages_bulk isn't called to free too many pages at once, and
> relinquishes & reacquires the lock between each call to prevent
> lock starvation from other processes.
> 
> However, the current batching does not prevent lock starvation. The
> current implementation creates batches of
> pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX, which has been seen in
> Meta workloads to be up to 64 << 5 == 2048 pages.
> 
> While it is true that CONFIG_PCP_BATCH_SCALE_MAX is a config and
> indeed can be adjusted by the system admin to be any number from
> 0 to 6, it's default value of 5 is still too high to be reasonable for
> any system.
> 
> Instead, let's create batches of pcp->batch pages, which gives a more
> reasonable 64 pages per call to free_pcppages_bulk. This gives other
> processes a chance to grab the lock and prevents starvation. Each

Feel free to make it clear, which lock is contended, pcp->lock or
zone->lock, or both, to help understand the starvation.

If the zone lock is hot, why did numa node fail to mitigate the contension,
given workloads tested with high sustained memory pressure on large machines
in the Meta fleet (1Tb memory, 316 CPUs)?

Can the contension be observed with tight memory pressure but not highly tight? 
If not, it is due to misconfigure in the user space, no?

> individual call to drain_pages_zone may take longer, but we avoid the
> worst case scenario of completely starving out other system-critical
> threads from acquiring the pcp lock while 2048 pages are freed
> one-by-one.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 4/4] mm/page_alloc: Batch page freeing in free_frozen_page_commit
  2025-09-24 20:44 ` [PATCH v2 4/4] mm/page_alloc: Batch page freeing in free_frozen_page_commit Joshua Hahn
@ 2025-09-28  5:17   ` kernel test robot
  2025-09-29 15:17     ` Joshua Hahn
  0 siblings, 1 reply; 33+ messages in thread
From: kernel test robot @ 2025-09-28  5:17 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: oe-lkp, lkp, Chris Mason, Johannes Weiner, linux-mm,
	Andrew Morton, Kiryl Shutsemau, Brendan Jackman, Michal Hocko,
	Suren Baghdasaryan, Vlastimil Babka, Zi Yan, linux-kernel,
	kernel-team, oliver.sang



Hello,

kernel test robot noticed "WARNING:bad_unlock_balance_detected" on:

commit: 7e86100bfb0d65a17f3228a9af4c2a49ac38f057 ("[PATCH v2 4/4] mm/page_alloc: Batch page freeing in free_frozen_page_commit")
url: https://github.com/intel-lab-lkp/linux/commits/Joshua-Hahn/mm-page_alloc-vmstat-Simplify-refresh_cpu_vm_stats-change-detection/20250925-044532
patch link: https://lore.kernel.org/all/20250924204409.1706524-5-joshua.hahnjy@gmail.com/
patch subject: [PATCH v2 4/4] mm/page_alloc: Batch page freeing in free_frozen_page_commit

in testcase: trinity
version: 
with following parameters:

	runtime: 300s
	group: group-03
	nr_groups: 5



config: x86_64-randconfig-161-20250927
compiler: gcc-14
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202509281204.3086f707-lkp@intel.com


[  414.880298][ T7549] WARNING: bad unlock balance detected!
[  414.881071][ T7549] 6.17.0-rc6-00147-g7e86100bfb0d #1 Not tainted
[  414.881924][ T7549] -------------------------------------
[  414.882695][ T7549] date/7549 is trying to release lock (&pcp->lock) at:
[ 414.883649][ T7549] free_frozen_page_commit+0x425/0x9d0 
[  414.884764][ T7549] but there are no more locks to release!
[  414.885539][ T7549]
[  414.885539][ T7549] other info that might help us debug this:
[  414.886704][ T7549] 2 locks held by date/7549:
[ 414.887353][ T7549] #0: ffff888104f29940 (&mm->mmap_lock){++++}-{4:4}, at: exit_mmap (include/linux/seqlock.h:431 include/linux/mmap_lock.h:88 include/linux/mmap_lock.h:398 mm/mmap.c:1288) 
[ 414.888591][ T7549] #1: ffff8883ae40e858 (&pcp->lock){+.+.}-{3:3}, at: free_frozen_page_commit+0x46a/0x9d0 
[  414.890003][ T7549]
[  414.890003][ T7549] stack backtrace:
[  414.890867][ T7549] CPU: 0 UID: 0 PID: 7549 Comm: date Not tainted 6.17.0-rc6-00147-g7e86100bfb0d #1 PREEMPT
[  414.890878][ T7549] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[  414.890882][ T7549] Call Trace:
[  414.890887][ T7549]  <TASK>
[ 414.890892][ T7549] dump_stack_lvl (lib/dump_stack.c:122) 
[ 414.890904][ T7549] ? free_frozen_page_commit+0x425/0x9d0 
[ 414.890914][ T7549] print_unlock_imbalance_bug.cold (kernel/locking/lockdep.c:5301) 
[ 414.890923][ T7549] ? free_frozen_page_commit+0x425/0x9d0 
[ 414.890929][ T7549] lock_release (kernel/locking/lockdep.c:470 (discriminator 4) kernel/locking/lockdep.c:5891 (discriminator 4)) 
[ 414.890936][ T7549] _raw_spin_unlock (include/linux/spinlock_api_smp.h:142 kernel/locking/spinlock.c:186) 
[ 414.890945][ T7549] free_frozen_page_commit+0x425/0x9d0 
[ 414.890955][ T7549] free_unref_folios (mm/page_alloc.c:3051 (discriminator 1)) 
[ 414.890963][ T7549] ? mark_held_locks (kernel/locking/lockdep.c:4325 (discriminator 1)) 
[ 414.890975][ T7549] folios_put_refs (mm/swap.c:999) 
[ 414.890985][ T7549] ? __folio_put (mm/swap.c:949) 
[ 414.890990][ T7549] ? check_bytes_and_report (mm/slub.c:1253) 
[ 414.890999][ T7549] ? check_bytes_and_report (mm/slub.c:1253) 
[ 414.891005][ T7549] ? look_up_lock_class (kernel/locking/lockdep.c:933 (discriminator 28)) 
[ 414.891015][ T7549] release_pages (mm/swap.c:1035) 
[ 414.891022][ T7549] ? folio_add_lru (mm/swap.c:1016) 
[ 414.891034][ T7549] ? __lock_acquire (kernel/locking/lockdep.c:5237 (discriminator 1)) 
[ 414.891054][ T7549] tlb_flush_mmu (mm/mmu_gather.c:137 mm/mmu_gather.c:149 mm/mmu_gather.c:397 mm/mmu_gather.c:404) 
[ 414.891063][ T7549] ? down_write (kernel/locking/rwsem.c:1588) 
[ 414.891073][ T7549] tlb_finish_mmu (mm/mmu_gather.c:157 mm/mmu_gather.c:500) 
[ 414.891080][ T7549] exit_mmap (mm/mmap.c:1300) 
[ 414.891088][ T7549] ? do_munmap (mm/mmap.c:1255) 
[ 414.891093][ T7549] ? kvm_sched_clock_read (arch/x86/kernel/kvmclock.c:91 (discriminator 2)) 
[ 414.891098][ T7549] ? local_clock_noinstr (kernel/sched/clock.c:304 (discriminator 1)) 
[ 414.891106][ T7549] ? __x64_sys_io_setup (fs/aio.c:893) 
[ 414.891115][ T7549] ? __mutex_unlock_slowpath (arch/x86/include/asm/atomic64_64.h:101 include/linux/atomic/atomic-arch-fallback.h:4329 include/linux/atomic/atomic-long.h:1506 include/linux/atomic/atomic-instrumented.h:4481 kernel/locking/mutex.c:939) 
[ 414.891128][ T7549] mmput (kernel/fork.c:1197 (discriminator 2) kernel/fork.c:1131 (discriminator 2) kernel/fork.c:1152 (discriminator 2)) 
[ 414.891136][ T7549] exit_mm (kernel/exit.c:583 (discriminator 1)) 
[ 414.891144][ T7549] do_exit (kernel/exit.c:956) 
[ 414.891149][ T7549] ? stack_not_used (kernel/exit.c:893) 
[ 414.891154][ T7549] ? do_group_exit (include/linux/spinlock.h:402 kernel/exit.c:1099) 
[ 414.891161][ T7549] do_group_exit (kernel/exit.c:1083) 
[ 414.891167][ T7549] __x64_sys_exit_group (kernel/exit.c:1111) 
[ 414.891173][ T7549] x64_sys_call (kbuild/obj/consumer/x86_64-randconfig-161-20250927/./arch/x86/include/generated/asm/syscalls_64.h:61) 
[ 414.891180][ T7549] do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1)) 
[ 414.891192][ T7549] ? lock_acquire (kernel/locking/lockdep.c:470 (discriminator 4) kernel/locking/lockdep.c:5870 (discriminator 4)) 
[ 414.891200][ T7549] ? kvm_sched_clock_read (arch/x86/kernel/kvmclock.c:91 (discriminator 2)) 
[ 414.891204][ T7549] ? local_clock_noinstr (kernel/sched/clock.c:304 (discriminator 1)) 
[ 414.891208][ T7549] ? local_clock (arch/x86/include/asm/preempt.h:95 kernel/sched/clock.c:319) 
[ 414.891216][ T7549] ? fput_close_sync (arch/x86/include/asm/atomic64_64.h:79 (discriminator 5) include/linux/atomic/atomic-arch-fallback.h:2913 (discriminator 5) include/linux/atomic/atomic-arch-fallback.h:3364 (discriminator 5) include/linux/atomic/atomic-long.h:698 (discriminator 5) include/linux/atomic/atomic-instrumented.h:3767 (discriminator 5) include/linux/file_ref.h:157 (discriminator 5) include/linux/file_ref.h:187 (discriminator 5) fs/file_table.c:572 (discriminator 5)) 
[ 414.891224][ T7549] ? alloc_file_clone (fs/file_table.c:571) 
[ 414.891231][ T7549] ? trace_irq_enable+0xba/0x100 
[ 414.891239][ T7549] ? do_syscall_64 (arch/x86/entry/syscall_64.c:113) 
[ 414.891247][ T7549] ? __vm_munmap (mm/vma.c:3155) 
[ 414.891254][ T7549] ? expand_downwards (mm/vma.c:3146) 
[ 414.891265][ T7549] ? handle_mm_fault (include/linux/perf_event.h:1596 mm/memory.c:6263 mm/memory.c:6390) 
[ 414.891272][ T7549] ? trace_irq_enable+0xba/0x100 
[ 414.891278][ T7549] ? do_syscall_64 (arch/x86/entry/syscall_64.c:113) 
[ 414.891284][ T7549] ? exc_page_fault (arch/x86/include/asm/irqflags.h:26 arch/x86/include/asm/irqflags.h:109 arch/x86/include/asm/irqflags.h:151 arch/x86/mm/fault.c:1484 arch/x86/mm/fault.c:1532) 
[ 414.891290][ T7549] ? do_user_addr_fault (arch/x86/include/asm/atomic.h:93 (discriminator 4) include/linux/atomic/atomic-arch-fallback.h:949 (discriminator 4) include/linux/atomic/atomic-instrumented.h:401 (discriminator 4) include/linux/refcount.h:389 (discriminator 4) include/linux/refcount.h:432 (discriminator 4) include/linux/mmap_lock.h:143 (discriminator 4) include/linux/mmap_lock.h:267 (discriminator 4) arch/x86/mm/fault.c:1338 (discriminator 4)) 
[ 414.891302][ T7549] ? trace_irq_enable+0xba/0x100 
[ 414.891307][ T7549] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4351 kernel/locking/lockdep.c:4410) 
[ 414.891315][ T7549] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) 
[  414.891322][ T7549] RIP: 0033:0x7f1f0ec31408
[ 414.891329][ T7549] Code: Unable to access opcode bytes at 0x7f1f0ec313de.

Code starting with the faulting instruction
===========================================
[  414.891332][ T7549] RSP: 002b:00007ffc16600ce8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[  414.891340][ T7549] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1f0ec31408
[  414.891344][ T7549] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
[  414.891348][ T7549] RBP: 00007f1f0ef25820 R08: 00000000000000e7 R09: ffffffffffffffa0
[  414.891352][ T7549] R10: 00007f1f0ef2cfa8 R11: 0000000000000246 R12: 00007f1f0ef25820
[  414.891356][ T7549] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[  414.891365][ T7549]  </TASK>
[  424.617670][ T5443] trinity-c1 invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=500
[  424.619318][ T5443] CPU: 1 UID: 65534 PID: 5443 Comm: trinity-c1 Not tainted 6.17.0-rc6-00147-g7e86100bfb0d #1 PREEMPT
[  424.619331][ T5443] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[  424.619335][ T5443] Call Trace:
[  424.619340][ T5443]  <TASK>
[ 424.619344][ T5443] dump_stack_lvl (lib/dump_stack.c:122) 
[ 424.619359][ T5443] dump_header (mm/oom_kill.c:74 mm/oom_kill.c:468) 
[ 424.619370][ T5443] oom_kill_process.cold (mm/oom_kill.c:1041) 
[ 424.619377][ T5443] out_of_memory (mm/oom_kill.c:1180) 
[ 424.619388][ T5443] ? oom_killer_disable (mm/oom_kill.c:1113) 
[ 424.619397][ T5443] __alloc_pages_may_oom (mm/page_alloc.c:4055) 
[ 424.619405][ T5443] ? __alloc_pages_direct_compact (mm/page_alloc.c:3987) 
[ 424.619415][ T5443] __alloc_pages_slowpath+0x735/0x1460 
[ 424.619425][ T5443] ? warn_alloc (mm/page_alloc.c:4625) 
[ 424.619434][ T5443] __alloc_frozen_pages_noprof (mm/page_alloc.c:5190) 
[ 424.619441][ T5443] ? __alloc_pages_slowpath+0x1460/0x1460 
[ 424.619446][ T5443] ? __dquot_alloc_space (fs/quota/dquot.c:1683) 
[ 424.619476][ T5443] ? trace_lock_release (include/trace/events/lock.h:69 (discriminator 2)) 
[ 424.619489][ T5443] ? trace_lock_acquire (include/trace/events/lock.h:24 (discriminator 2)) 
[ 424.619494][ T5443] ? lock_acquire (kernel/locking/lockdep.c:5833) 
[ 424.619501][ T5443] alloc_pages_mpol (mm/mempolicy.c:2418) 
[ 424.619507][ T5443] ? xas_start (include/linux/xarray.h:1210 (discriminator 1) lib/xarray.c:191 (discriminator 1)) 
[ 424.619515][ T5443] ? policy_nodemask (mm/mempolicy.c:2373) 
[ 424.619520][ T5443] ? xas_load (include/linux/xarray.h:1226 (discriminator 2) lib/xarray.c:208 (discriminator 2) lib/xarray.c:246 (discriminator 2)) 
[ 424.619526][ T5443] folio_alloc_mpol_noprof (mm/mempolicy.c:2435) 
[ 424.619533][ T5443] shmem_alloc_and_add_folio+0x3dd/0x4f0 
[ 424.619544][ T5443] ? shmem_add_to_page_cache+0x900/0x900 
[ 424.619551][ T5443] ? trace_irq_enable+0xba/0x100 
[ 424.619560][ T5443] shmem_get_folio_gfp+0x42c/0xfa0 
[ 424.619569][ T5443] ? shmem_write_end (mm/shmem.c:2497) 
[ 424.619575][ T5443] ? folio_mark_dirty (mm/page-writeback.c:2849) 
[ 424.619596][ T5443] shmem_fallocate (mm/shmem.c:3813) 
[ 424.619607][ T5443] ? shmem_get_link (mm/shmem.c:3713) 
[ 424.619612][ T5443] ? debug_object_activate (lib/debugobjects.c:837) 
[ 424.619622][ T5443] ? do_raw_spin_unlock (arch/x86/include/asm/atomic.h:23 include/linux/atomic/atomic-arch-fallback.h:457 include/linux/atomic/atomic-instrumented.h:33 include/asm-generic/qspinlock.h:57 kernel/locking/spinlock_debug.c:101 kernel/locking/spinlock_debug.c:141) 
[ 424.619630][ T5443] ? validate_chain (include/linux/hash.h:78 (discriminator 3) kernel/locking/lockdep.c:3798 (discriminator 3) kernel/locking/lockdep.c:3821 (discriminator 3) kernel/locking/lockdep.c:3876 (discriminator 3)) 
[ 424.619636][ T5443] ? mark_lock (kernel/locking/lockdep.c:4731 (discriminator 1)) 
[ 424.619645][ T5443] ? __lock_acquire (kernel/locking/lockdep.c:5237 (discriminator 1)) 
[ 424.619653][ T5443] ? lock_acquire (kernel/locking/lockdep.c:470 (discriminator 4) kernel/locking/lockdep.c:5870 (discriminator 4)) 
[ 424.619659][ T5443] ? ksys_fallocate (include/linux/file.h:62 (discriminator 1) include/linux/file.h:83 (discriminator 1) fs/open.c:361 (discriminator 1)) 
[ 424.619670][ T5443] vfs_fallocate (fs/open.c:342 (discriminator 1)) 
[ 424.619679][ T5443] ksys_fallocate (include/linux/file.h:62 (discriminator 1) include/linux/file.h:83 (discriminator 1) fs/open.c:361 (discriminator 1)) 
[ 424.619685][ T5443] ? local_clock_noinstr (kernel/sched/clock.c:304 (discriminator 1)) 
[ 424.619694][ T5443] __ia32_sys_ia32_fallocate (arch/x86/kernel/sys_ia32.c:119) 
[ 424.619703][ T5443] ? trace_irq_enable+0xba/0x100 
[ 424.619709][ T5443] do_int80_emulation (arch/x86/entry/syscall_32.c:83 (discriminator 1) arch/x86/entry/syscall_32.c:172 (discriminator 1)) 
[ 424.619718][ T5443] ? trace_irq_enable+0xba/0x100 
[ 424.619724][ T5443] ? do_syscall_64 (arch/x86/entry/syscall_64.c:113) 
[ 424.619730][ T5443] ? kvm_sched_clock_read (arch/x86/kernel/kvmclock.c:91 (discriminator 2)) 
[ 424.619735][ T5443] ? local_clock_noinstr (kernel/sched/clock.c:304 (discriminator 1)) 
[ 424.619740][ T5443] ? local_clock (arch/x86/include/asm/preempt.h:95 kernel/sched/clock.c:319) 
[ 424.619747][ T5443] ? __lock_release+0x120/0x230 
[ 424.619752][ T5443] ? __task_pid_nr_ns (include/linux/rcupdate.h:341 include/linux/rcupdate.h:871 kernel/pid.c:518) 
[ 424.619760][ T5443] ? trace_irq_enable+0xba/0x100 
[ 424.619765][ T5443] ? do_syscall_64 (arch/x86/entry/syscall_64.c:113) 
[ 424.619770][ T5443] ? put_timespec64 (kernel/time/time.c:910 (discriminator 1)) 
[ 424.619776][ T5443] ? __ia32_compat_sys_gettimeofday (kernel/time/time.c:904) 
[ 424.619783][ T5443] ? __x64_sys_clock_gettime (kernel/time/posix-timers.c:1147 (discriminator 1) kernel/time/posix-timers.c:1135 (discriminator 1) kernel/time/posix-timers.c:1135 (discriminator 1)) 
[ 424.619792][ T5443] ? posix_timer_unhash_and_free (kernel/time/posix-timers.c:1135) 
[ 424.619797][ T5443] ? kvm_sched_clock_read (arch/x86/kernel/kvmclock.c:91 (discriminator 2)) 
[ 424.619803][ T5443] ? trace_irq_enable+0xba/0x100 
[ 424.619809][ T5443] ? do_syscall_64 (arch/x86/entry/syscall_64.c:113) 
[ 424.619816][ T5443] ? trace_irq_enable+0xba/0x100 
[ 424.619821][ T5443] ? do_syscall_64 (arch/x86/entry/syscall_64.c:113) 
[ 424.619828][ T5443] ? trace_irq_disable+0xba/0x100 
[ 424.619833][ T5443] ? do_int80_emulation (include/linux/irq-entry-common.h:98 arch/x86/entry/syscall_32.c:145) 
[ 424.619840][ T5443] asm_int80_emulation (arch/x86/include/asm/idtentry.h:574) 
[  424.619846][ T5443] RIP: 0033:0x407d4c
[ 424.619854][ T5443] Code: 83 c0 01 41 89 80 40 30 00 00 8b 44 24 04 4c 89 d1 48 8b 54 24 08 4c 89 de 4c 89 e7 55 41 50 41 51 41 52 41 53 4c 89 cd cd 80 <41> 5b 41 5a 41 59 41 58 5d 48 3d 7a ff ff ff 49 89 c4 0f 87 5c 01


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250928/202509281204.3086f707-lkp@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 4/4] mm/page_alloc: Batch page freeing in free_frozen_page_commit
  2025-09-28  5:17   ` kernel test robot
@ 2025-09-29 15:17     ` Joshua Hahn
  2025-10-01 10:04       ` Vlastimil Babka
  0 siblings, 1 reply; 33+ messages in thread
From: Joshua Hahn @ 2025-09-29 15:17 UTC (permalink / raw)
  To: kernel test robot
  Cc: oe-lkp, lkp, Chris Mason, Johannes Weiner, linux-mm,
	Andrew Morton, Kiryl Shutsemau, Brendan Jackman, Michal Hocko,
	Suren Baghdasaryan, Vlastimil Babka, Zi Yan, linux-kernel,
	kernel-team

On Sun, 28 Sep 2025 13:17:37 +0800 kernel test robot <oliver.sang@intel.com> wrote:

Hello Kernel Test Robot,

> Hello,
> 
> kernel test robot noticed "WARNING:bad_unlock_balance_detected" on:
> 
> commit: 7e86100bfb0d65a17f3228a9af4c2a49ac38f057 ("[PATCH v2 4/4] mm/page_alloc: Batch page freeing in free_frozen_page_commit")
> url: https://github.com/intel-lab-lkp/linux/commits/Joshua-Hahn/mm-page_alloc-vmstat-Simplify-refresh_cpu_vm_stats-change-detection/20250925-044532
> patch link: https://lore.kernel.org/all/20250924204409.1706524-5-joshua.hahnjy@gmail.com/
> patch subject: [PATCH v2 4/4] mm/page_alloc: Batch page freeing in free_frozen_page_commit
> 
> in testcase: trinity
> version: 
> with following parameters:
> 
> 	runtime: 300s
> 	group: group-03
> 	nr_groups: 5
> 
> config: x86_64-randconfig-161-20250927
> compiler: gcc-14
> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> 
> (please refer to attached dmesg/kmsg for entire log/backtrace)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202509281204.3086f707-lkp@intel.com
> 
> 
> [  414.880298][ T7549] WARNING: bad unlock balance detected!
> [  414.881071][ T7549] 6.17.0-rc6-00147-g7e86100bfb0d #1 Not tainted
> [  414.881924][ T7549] -------------------------------------
> [  414.882695][ T7549] date/7549 is trying to release lock (&pcp->lock) at:
> [ 414.883649][ T7549] free_frozen_page_commit+0x425/0x9d0 
> [  414.884764][ T7549] but there are no more locks to release!
> [  414.885539][ T7549]
> [  414.885539][ T7549] other info that might help us debug this:
> [  414.886704][ T7549] 2 locks held by date/7549:
> [ 414.887353][ T7549] #0: ffff888104f29940 (&mm->mmap_lock){++++}-{4:4}, at: exit_mmap (include/linux/seqlock.h:431 include/linux/mmap_lock.h:88 include/linux/mmap_lock.h:398 mm/mmap.c:1288) 
> [ 414.888591][ T7549] #1: ffff8883ae40e858 (&pcp->lock){+.+.}-{3:3}, at: free_frozen_page_commit+0x46a/0x9d0 

So based on this, it seems like I must have overlooked a pretty important
consideration here. When I unlock the pcp, it allows both the zone and pcp
lock to be picked up by another task (pcp lock less likely), but it also
means that this process can be migrated to a different CPU, where it will
be trying to unlock & acquire a completely different pcp.

For me the most simple solution looks to be migrate_disable() and
migrate_enable() in the function to ensure that this task is bound to the
CPU it originally started runing on.

I'm not sure how this will affect performance, but I think in terms of
desired behavior it does seem like this is the correct way to do it.

Joshua


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone
  2025-09-27  0:46   ` Hillf Danton
@ 2025-09-30 14:42     ` Joshua Hahn
  2025-09-30 22:14       ` Hillf Danton
  0 siblings, 1 reply; 33+ messages in thread
From: Joshua Hahn @ 2025-09-30 14:42 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Andrew Morton, Johannes Weiner, linux-kernel, linux-mm, kernel-team

On Sat, 27 Sep 2025 08:46:15 +0800 Hillf Danton <hdanton@sina.com> wrote:

> On Wed, 24 Sep 2025 13:44:06 -0700 Joshua Hahn wrote:
> > drain_pages_zone completely drains a zone of its pcp free pages by
> > repeatedly calling free_pcppages_bulk until pcp->count reaches 0.
> > In this loop, it already performs batched calls to ensure that
> > free_pcppages_bulk isn't called to free too many pages at once, and
> > relinquishes & reacquires the lock between each call to prevent
> > lock starvation from other processes.
> > 
> > However, the current batching does not prevent lock starvation. The
> > current implementation creates batches of
> > pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX, which has been seen in
> > Meta workloads to be up to 64 << 5 == 2048 pages.
> > 
> > While it is true that CONFIG_PCP_BATCH_SCALE_MAX is a config and
> > indeed can be adjusted by the system admin to be any number from
> > 0 to 6, it's default value of 5 is still too high to be reasonable for
> > any system.
> > 
> > Instead, let's create batches of pcp->batch pages, which gives a more
> > reasonable 64 pages per call to free_pcppages_bulk. This gives other
> > processes a chance to grab the lock and prevents starvation. Each

Hello Hillf,

Thank you for your feedback!

> Feel free to make it clear, which lock is contended, pcp->lock or
> zone->lock, or both, to help understand the starvation.

Sorry for the late reply. I took some time to run some more tests and
gather numbers so that I could give an accurate representation of what
I was seeing in these systems.

So running perf lock con -abl on my system and compiling the kernel,
I see that the biggest lock contentions come from free_pcppages_bulk
and __rmqueue_pcplist on the upstream kernel (ignoring lock contentions
on lruvec, which is actually the biggest offender on these systems.
This will hopefully be addressed some time in the future as well).

Looking deeper into where they are waiting on the lock, I found that they
are both waiting for the zone->lock (not the pcp lock, even for
__rmqueue_pcplist). I'll add this detail into v3, so that it is more
clear for the user. I'll also emphasize why we still need to break the
pcp lock, since this was something that wasn't immediately obvious to me.

> If the zone lock is hot, why did numa node fail to mitigate the contension,
> given workloads tested with high sustained memory pressure on large machines
> in the Meta fleet (1Tb memory, 316 CPUs)?

This is a good question. On this system, I've configured the machine to only
use 1 node/zone, so there is no ability to migrate the contention. Perhaps
another approach to this problem would be to encourage the user to
configure the system such that each NUMA node does not exceed N GB of memory?

But if so -- how many GB/node is too much? It seems like there would be
some sweet spot where the overhead required to maintain many nodes
cancels out with the benefits one would get from splitting the system into
multiple nodes. What do you think? Personally, I think that this patchset
(not this patch, since it will be dropped in v3) still provides value in
the form of preventing lock monopolies in the zone lock even in a system
where memory is spread out across more nodes.

> Can the contension be observed with tight memory pressure but not highly tight? 
> If not, it is due to misconfigure in the user space, no?

I'm not sure I entirely follow what you mean here, but are you asking
whether this is a userspace issue for running a workload that isn't
properly sized for the system? Perhaps that could be the case, but I think
that it would not be bad for the system to protect against these workloads
which can cause the system to stall, especially since the numbers show
that there is neutral to positive gains from this patch. But of course
I am very biased in this opinion : -) so happy to take other opinions on
this matter.

Thanks for your questions. I hope you have a great day!
Joshua

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone
  2025-09-30 14:42     ` Joshua Hahn
@ 2025-09-30 22:14       ` Hillf Danton
  2025-10-01 15:37         ` Joshua Hahn
  0 siblings, 1 reply; 33+ messages in thread
From: Hillf Danton @ 2025-09-30 22:14 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Andrew Morton, Johannes Weiner, linux-kernel, linux-mm, kernel-team

On Tue, 30 Sep 2025 07:42:40 -0700 Joshua Hahn wrote:
> On Sat, 27 Sep 2025 08:46:15 +0800 Hillf Danton <hdanton@sina.com> wrote:
> > On Wed, 24 Sep 2025 13:44:06 -0700 Joshua Hahn wrote:
> > > drain_pages_zone completely drains a zone of its pcp free pages by
> > > repeatedly calling free_pcppages_bulk until pcp->count reaches 0.
> > > In this loop, it already performs batched calls to ensure that
> > > free_pcppages_bulk isn't called to free too many pages at once, and
> > > relinquishes & reacquires the lock between each call to prevent
> > > lock starvation from other processes.
> > > 
> > > However, the current batching does not prevent lock starvation. The
> > > current implementation creates batches of
> > > pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX, which has been seen in
> > > Meta workloads to be up to 64 << 5 == 2048 pages.
> > > 
> > > While it is true that CONFIG_PCP_BATCH_SCALE_MAX is a config and
> > > indeed can be adjusted by the system admin to be any number from
> > > 0 to 6, it's default value of 5 is still too high to be reasonable for
> > > any system.
> > > 
> > > Instead, let's create batches of pcp->batch pages, which gives a more
> > > reasonable 64 pages per call to free_pcppages_bulk. This gives other
> > > processes a chance to grab the lock and prevents starvation. Each
> 
> Hello Hillf,
> 
> Thank you for your feedback!
> 
> > Feel free to make it clear, which lock is contended, pcp->lock or
> > zone->lock, or both, to help understand the starvation.
> 
> Sorry for the late reply. I took some time to run some more tests and
> gather numbers so that I could give an accurate representation of what
> I was seeing in these systems.
> 
> So running perf lock con -abl on my system and compiling the kernel,
> I see that the biggest lock contentions come from free_pcppages_bulk
> and __rmqueue_pcplist on the upstream kernel (ignoring lock contentions
> on lruvec, which is actually the biggest offender on these systems.
> This will hopefully be addressed some time in the future as well).
> 
> Looking deeper into where they are waiting on the lock, I found that they
> are both waiting for the zone->lock (not the pcp lock, even for

One of the hottest locks again plays its role.

> __rmqueue_pcplist). I'll add this detail into v3, so that it is more
> clear for the user. I'll also emphasize why we still need to break the
> pcp lock, since this was something that wasn't immediately obvious to me.
> 
> > If the zone lock is hot, why did numa node fail to mitigate the contension,
> > given workloads tested with high sustained memory pressure on large machines
> > in the Meta fleet (1Tb memory, 316 CPUs)?
> 
> This is a good question. On this system, I've configured the machine to only
> use 1 node/zone, so there is no ability to migrate the contention. Perhaps

Thanks, we know why the zone lock is hot - 300+ CPUs potentially contended a lock.
The single node/zone config may explain why no similar reports of large
systems (1Tb memory, 316 CPUs) emerged a couple of years back, given
NUMA machine is not anything new on the market.

> another approach to this problem would be to encourage the user to
> configure the system such that each NUMA node does not exceed N GB of memory?
> 
> But if so -- how many GB/node is too much? It seems like there would be
> some sweet spot where the overhead required to maintain many nodes
> cancels out with the benefits one would get from splitting the system into
> multiple nodes. What do you think? Personally, I think that this patchset
> (not this patch, since it will be dropped in v3) still provides value in
> the form of preventing lock monopolies in the zone lock even in a system
> where memory is spread out across more nodes.
> 
> > Can the contension be observed with tight memory pressure but not highly tight? 
> > If not, it is due to misconfigure in the user space, no?
> 
> I'm not sure I entirely follow what you mean here, but are you asking
> whether this is a userspace issue for running a workload that isn't
> properly sized for the system? Perhaps that could be the case, but I think

This is not complicated. Take another look at the system from another
POV - what is your comment if running the same workload on the same
system but with RAM cut down to 1Gb? If roughly it is fully loaded for a
dentist to serve two patients well a day, getting the professional over
loaded makes no sense I think.

In short, given the zone lock is hot in nature, soft lockup with reproducer
hints misconfig.

> that it would not be bad for the system to protect against these workloads
> which can cause the system to stall, especially since the numbers show
> that there is neutral to positive gains from this patch. But of course
> I am very biased in this opinion : -) so happy to take other opinions on
> this matter.
> 
> Thanks for your questions. I hope you have a great day!
> Joshua


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 4/4] mm/page_alloc: Batch page freeing in free_frozen_page_commit
  2025-09-29 15:17     ` Joshua Hahn
@ 2025-10-01 10:04       ` Vlastimil Babka
  2025-10-01 15:55         ` Joshua Hahn
  0 siblings, 1 reply; 33+ messages in thread
From: Vlastimil Babka @ 2025-10-01 10:04 UTC (permalink / raw)
  To: Joshua Hahn, kernel test robot
  Cc: oe-lkp, lkp, Chris Mason, Johannes Weiner, linux-mm,
	Andrew Morton, Kiryl Shutsemau, Brendan Jackman, Michal Hocko,
	Suren Baghdasaryan, Zi Yan, linux-kernel, kernel-team

On 9/29/25 5:17 PM, Joshua Hahn wrote:
> On Sun, 28 Sep 2025 13:17:37 +0800 kernel test robot <oliver.sang@intel.com> wrote:
> 
> Hello Kernel Test Robot,
> 
>> Hello,
>>
>> kernel test robot noticed "WARNING:bad_unlock_balance_detected" on:
>>
>> commit: 7e86100bfb0d65a17f3228a9af4c2a49ac38f057 ("[PATCH v2 4/4] mm/page_alloc: Batch page freeing in free_frozen_page_commit")
>> url: https://github.com/intel-lab-lkp/linux/commits/Joshua-Hahn/mm-page_alloc-vmstat-Simplify-refresh_cpu_vm_stats-change-detection/20250925-044532
>> patch link: https://lore.kernel.org/all/20250924204409.1706524-5-joshua.hahnjy@gmail.com/
>> patch subject: [PATCH v2 4/4] mm/page_alloc: Batch page freeing in free_frozen_page_commit
>>
>> in testcase: trinity
>> version: 
>> with following parameters:
>>
>> 	runtime: 300s
>> 	group: group-03
>> 	nr_groups: 5
>>
>> config: x86_64-randconfig-161-20250927
>> compiler: gcc-14
>> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>>
>> (please refer to attached dmesg/kmsg for entire log/backtrace)
>>
>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>> the same patch/commit), kindly add following tags
>> | Reported-by: kernel test robot <oliver.sang@intel.com>
>> | Closes: https://lore.kernel.org/oe-lkp/202509281204.3086f707-lkp@intel.com
>>
>>
>> [  414.880298][ T7549] WARNING: bad unlock balance detected!
>> [  414.881071][ T7549] 6.17.0-rc6-00147-g7e86100bfb0d #1 Not tainted
>> [  414.881924][ T7549] -------------------------------------
>> [  414.882695][ T7549] date/7549 is trying to release lock (&pcp->lock) at:
>> [ 414.883649][ T7549] free_frozen_page_commit+0x425/0x9d0 
>> [  414.884764][ T7549] but there are no more locks to release!
>> [  414.885539][ T7549]
>> [  414.885539][ T7549] other info that might help us debug this:
>> [  414.886704][ T7549] 2 locks held by date/7549:
>> [ 414.887353][ T7549] #0: ffff888104f29940 (&mm->mmap_lock){++++}-{4:4}, at: exit_mmap (include/linux/seqlock.h:431 include/linux/mmap_lock.h:88 include/linux/mmap_lock.h:398 mm/mmap.c:1288) 
>> [ 414.888591][ T7549] #1: ffff8883ae40e858 (&pcp->lock){+.+.}-{3:3}, at: free_frozen_page_commit+0x46a/0x9d0 
> 
> So based on this, it seems like I must have overlooked a pretty important
> consideration here. When I unlock the pcp, it allows both the zone and pcp
> lock to be picked up by another task (pcp lock less likely), but it also
> means that this process can be migrated to a different CPU, where it will
> be trying to unlock & acquire a completely different pcp.

Yes.

> For me the most simple solution looks to be migrate_disable() and
> migrate_enable() in the function to ensure that this task is bound to the
> CPU it originally started runing on.
> 
> I'm not sure how this will affect performance, but I think in terms of

It is somewhat expensive, I'd rather avoid if possible.

> desired behavior it does seem like this is the correct way to do it.

I'd rather detect this happened (new pcp doesn't match old pcp after a
relock) and either give up (should be rare enough hopefully so won't
cause much imbalance) or recalculate how much to free on the other cpu
and continue there (probably subtract how much we already did so we
don't end up unlucky flushing all kinds of cpus "forever").

> Joshua



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone
  2025-09-26 16:21       ` Christoph Lameter (Ampere)
  2025-09-26 17:25         ` Joshua Hahn
@ 2025-10-01 11:23         ` Vlastimil Babka
  1 sibling, 0 replies; 33+ messages in thread
From: Vlastimil Babka @ 2025-10-01 11:23 UTC (permalink / raw)
  To: Christoph Lameter (Ampere), Joshua Hahn
  Cc: Andrew Morton, Johannes Weiner, Chris Mason, Kiryl Shutsemau,
	Brendan Jackman, Michal Hocko, Suren Baghdasaryan, Zi Yan,
	linux-kernel, linux-mm, kernel-team, Mel Gorman

On 9/26/25 6:21 PM, Christoph Lameter (Ampere) wrote:
> On Thu, 25 Sep 2025, Joshua Hahn wrote:
> 
>>> So we need an explanation as to why there is such high contention on the
>>> lock first before changing the logic here.
>>>
>>> The current logic seems to be designed to prevent the lock contention you
>>> are seeing.
>>
>> This is true, but my concern was mostly with the value that is being used
>> for the batching (2048 seems too high). But as I explain below, it seems
>> like the min(2048, count) operation is a no-op anyways, since it is never
>> called with count > 1000 (at least from the benchmarks that I was running,
>> on my machine).
> 
> 
> The problem is that you likely increase zone lock contention with a
> reduced batch size.
> 
> Actually that there is a lock in the pcp structure is weird and causes
> cacheline bouncing on such hot paths. Access should be only from the cpu

The hot paths only access the lock local to them so should not cause
bouncing.

> that owns this structure. Remote cleaning (if needed) can be triggered via
> IPIs.

It used to be that way but Mel changed it to the current implementation
few years ago. IIRC one motivation was to avoid disabling irqs (that
provide exclusion with IPI handlers), hence the spin_trylock() approach
locally and spin_lock() for remote flushing.

Today we could use local_trylock() instead of spin_trylock()
theoretically. The benefit is being inline, unlike spin_trylock() (on
x86). But an IPI handler (that must succeed and can't give up if the
lock is already taken by the operation it interrupted) wouldn't work
with that - it can't give up nor "spin". So the remote flushes would
need to use queue/flush work instead and then the preempt disable +
local_trylock() would be enough (work handler can't interrupt a preempt
disabled section). I don't know if that would make the remote flushes
too expensive though or whether they only happen in such slowpaths to be
acceptable.

> This is the way it used to be and the way it was tested for high core
> counts years ago.
> 
> You seem to run 176 cores here so its similar to what we tested way back
> when. If all cores are accessing the pcp structure then you have
> significant cacheline bouncing. Removing the lock and going back to the
> IPI solution would likely remove the problem.

I doubt the problem here is about cacheline bouncing of pcp. AFAIK it's
free_frozen_page_commit() will be called under preempt_disable()
(pcpu_spin_trylock does that) and do a potentially long
free_pcppages_bulk() operation under spin_lock_irqsave(&zone->lock). So
multiple cpus with similarly long free_pcppages_bulk() will spin on the
zone lock with irqs disabled.
Breaking down the time zone lock is held to smaller batches will help
that and reduce the irqs disabled time. But there might be still long
preemption disabled times for the pcp, and that's IIRC enough to cause
rcu_sched stalls? So patch 4/4 also relinquishes the pcp lock itself
(i.e. enables preemption), which we already saw from the lkp report
isn't trivial to do. But none of this is about pcp cacheline bouncing,
AFAICS.

> The cachelines of the allocator per cpu structures are usually very hot
> and should only be touched in rare circumstances from other cpus.

It should be rare enough to not be an issue.

> Having a loop over all processors accessing all the hot percpus structurs
> is likely causing significant performance issues and therefore the issues
> that you are seeing here.
> 
> 
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone
  2025-09-30 22:14       ` Hillf Danton
@ 2025-10-01 15:37         ` Joshua Hahn
  2025-10-01 23:48           ` Hillf Danton
  0 siblings, 1 reply; 33+ messages in thread
From: Joshua Hahn @ 2025-10-01 15:37 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Andrew Morton, Johannes Weiner, linux-kernel, linux-mm, kernel-team

Hello Hillf, thanks for your continued interest in this series!

> > Hello Hillf,
> > 
> > Thank you for your feedback!
> > 
> > > Feel free to make it clear, which lock is contended, pcp->lock or
> > > zone->lock, or both, to help understand the starvation.
> > 
> > Sorry for the late reply. I took some time to run some more tests and
> > gather numbers so that I could give an accurate representation of what
> > I was seeing in these systems.
> > 
> > So running perf lock con -abl on my system and compiling the kernel,
> > I see that the biggest lock contentions come from free_pcppages_bulk
> > and __rmqueue_pcplist on the upstream kernel (ignoring lock contentions
> > on lruvec, which is actually the biggest offender on these systems.
> > This will hopefully be addressed some time in the future as well).
> > 
> > Looking deeper into where they are waiting on the lock, I found that they
> > are both waiting for the zone->lock (not the pcp lock, even for
> 
> One of the hottest locks again plays its role.

Indeed...

> > __rmqueue_pcplist). I'll add this detail into v3, so that it is more
> > clear for the user. I'll also emphasize why we still need to break the
> > pcp lock, since this was something that wasn't immediately obvious to me.
> > 
> > > If the zone lock is hot, why did numa node fail to mitigate the contension,
> > > given workloads tested with high sustained memory pressure on large machines
> > > in the Meta fleet (1Tb memory, 316 CPUs)?
> > 
> > This is a good question. On this system, I've configured the machine to only
> > use 1 node/zone, so there is no ability to migrate the contention. Perhaps
> 
> Thanks, we know why the zone lock is hot - 300+ CPUs potentially contended a lock.
> The single node/zone config may explain why no similar reports of large
> systems (1Tb memory, 316 CPUs) emerged a couple of years back, given
> NUMA machine is not anything new on the market.
> 
> > another approach to this problem would be to encourage the user to
> > configure the system such that each NUMA node does not exceed N GB of memory?
> > 
> > But if so -- how many GB/node is too much? It seems like there would be
> > some sweet spot where the overhead required to maintain many nodes
> > cancels out with the benefits one would get from splitting the system into
> > multiple nodes. What do you think? Personally, I think that this patchset
> > (not this patch, since it will be dropped in v3) still provides value in
> > the form of preventing lock monopolies in the zone lock even in a system
> > where memory is spread out across more nodes.
> > 
> > > Can the contension be observed with tight memory pressure but not highly tight? 
> > > If not, it is due to misconfigure in the user space, no?
> > 
> > I'm not sure I entirely follow what you mean here, but are you asking
> > whether this is a userspace issue for running a workload that isn't
> > properly sized for the system? Perhaps that could be the case, but I think
> 
> This is not complicated. Take another look at the system from another
> POV - what is your comment if running the same workload on the same
> system but with RAM cut down to 1Gb? If roughly it is fully loaded for a
> dentist to serve two patients well a day, getting the professional over
> loaded makes no sense I think.
> 
> In short, given the zone lock is hot in nature, soft lockup with reproducer
> hints misconfig.

While I definitely agree that spreading out 1TB across multiple NUMA nodes
is an option that should be considered, I am unsure if it makes sense to
dismiss this issue as simply a misconfiguration problem.

The reality is that these machines do exist, and we see zone lock contention
on these machines. You can also see that I ran performance evaluation tests
on relatively smaller machines (250G) and saw some performance gains.

The other point that I wanted to mention is that simply adding more NUMA
nodes is not always strictly beneficial; it changes how the scheduler
has to work, workloads would require more numa-aware tuning, etc.

Thanks for your feedback, Hillf. I hope you have a great day!
Joshua


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 4/4] mm/page_alloc: Batch page freeing in free_frozen_page_commit
  2025-10-01 10:04       ` Vlastimil Babka
@ 2025-10-01 15:55         ` Joshua Hahn
  0 siblings, 0 replies; 33+ messages in thread
From: Joshua Hahn @ 2025-10-01 15:55 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: kernel test robot, oe-lkp, lkp, Chris Mason, Johannes Weiner,
	linux-mm, Andrew Morton, Kiryl Shutsemau, Brendan Jackman,
	Michal Hocko, Suren Baghdasaryan, Zi Yan, linux-kernel,
	kernel-team

Hi Vlastimil,

Thank you for your feedback!

On Wed, 1 Oct 2025 12:04:50 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:

> On 9/29/25 5:17 PM, Joshua Hahn wrote:
> > On Sun, 28 Sep 2025 13:17:37 +0800 kernel test robot <oliver.sang@intel.com> wrote:
> > 
> > Hello Kernel Test Robot,
> > 
> >> Hello,
> >>
> >> kernel test robot noticed "WARNING:bad_unlock_balance_detected" on:

[...snip...]

> >> [  414.880298][ T7549] WARNING: bad unlock balance detected!
> >> [  414.881071][ T7549] 6.17.0-rc6-00147-g7e86100bfb0d #1 Not tainted
> >> [  414.881924][ T7549] -------------------------------------
> >> [  414.882695][ T7549] date/7549 is trying to release lock (&pcp->lock) at:
> >> [ 414.883649][ T7549] free_frozen_page_commit+0x425/0x9d0 
> >> [  414.884764][ T7549] but there are no more locks to release!
> >> [  414.885539][ T7549]
> >> [  414.885539][ T7549] other info that might help us debug this:
> >> [  414.886704][ T7549] 2 locks held by date/7549:
> >> [ 414.887353][ T7549] #0: ffff888104f29940 (&mm->mmap_lock){++++}-{4:4}, at: exit_mmap (include/linux/seqlock.h:431 include/linux/mmap_lock.h:88 include/linux/mmap_lock.h:398 mm/mmap.c:1288) 
> >> [ 414.888591][ T7549] #1: ffff8883ae40e858 (&pcp->lock){+.+.}-{3:3}, at: free_frozen_page_commit+0x46a/0x9d0 
> > 
> > So based on this, it seems like I must have overlooked a pretty important
> > consideration here. When I unlock the pcp, it allows both the zone and pcp
> > lock to be picked up by another task (pcp lock less likely), but it also
> > means that this process can be migrated to a different CPU, where it will
> > be trying to unlock & acquire a completely different pcp.
> 
> Yes.
> 
> > For me the most simple solution looks to be migrate_disable() and
> > migrate_enable() in the function to ensure that this task is bound to the
> > CPU it originally started runing on.
> > 
> > I'm not sure how this will affect performance, but I think in terms of
> 
> It is somewhat expensive, I'd rather avoid if possible.
> 
> > desired behavior it does seem like this is the correct way to do it.
> 
> I'd rather detect this happened (new pcp doesn't match old pcp after a
> relock) and either give up (should be rare enough hopefully so won't
> cause much imbalance) or recalculate how much to free on the other cpu
> and continue there (probably subtract how much we already did so we
> don't end up unlucky flushing all kinds of cpus "forever").

I think this idea makes sense. Since I am dropping patch 2/4, the remaining
call sites here will be in decay_pcp_high and free_frozen_page_commit.
Here, I think it makes sense to just give up when it realizes it's on a
different CPU. If the new CPU should have pages flushed, then it will be
flushed by either the next call to free_frozen_page_commit (like in
free_unref_folios) or in the case of __free_frozen_pages, it doesn't really
make sense to flush a pcp that isn't related to the current caller.

One concern that I do have is that now it is possible to flush less pages
than the current behavior, since before it was guaranteed that the
specified number of pages will have been flushed. But like you suggested,
hopefully this is rare enough that we don't see it happen. FWIW, I have
not seen this happen before during my testing, the first time I saw it
was in the kernel test robot report, so hopefully it's not very likely.

Thank you for the idea Vlastimil. I'll make these changes in v3!
I hope you have a great day!
Joshua


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone
  2025-10-01 15:37         ` Joshua Hahn
@ 2025-10-01 23:48           ` Hillf Danton
  2025-10-03  8:35             ` Vlastimil Babka
  0 siblings, 1 reply; 33+ messages in thread
From: Hillf Danton @ 2025-10-01 23:48 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Andrew Morton, Johannes Weiner, linux-kernel, linux-mm, kernel-team

On Wed,  1 Oct 2025 08:37:16 -0700 Joshua Hahn wrote:
> 
> While I definitely agree that spreading out 1TB across multiple NUMA nodes
> is an option that should be considered, I am unsure if it makes sense to
> dismiss this issue as simply a misconfiguration problem.
> 
> The reality is that these machines do exist, and we see zone lock contention
> on these machines. You can also see that I ran performance evaluation tests
> on relatively smaller machines (250G) and saw some performance gains.
> 
If NUMA node could not be an option, there is still much room in the zone types
for adding new zones on top of the current pcp and zone mechanism to mitigate
zone lock contention, see diff below. Then the issue falls in the config category.

> The other point that I wanted to mention is that simply adding more NUMA
> nodes is not always strictly beneficial; it changes how the scheduler
> has to work, workloads would require more numa-aware tuning, etc.

Feel safe to sit back with Netflix on as PeterZ is taking care of NUMA nodes
and eevdf, haha.

--- x/include/linux/mmzone.h
+++ y/include/linux/mmzone.h
@@ -779,6 +779,9 @@ enum zone_type {
 #ifdef CONFIG_ZONE_DMA32
 	ZONE_DMA32,
 #endif
+#ifdef CONFIG_ZONE_EXP
+	ZONE_EXP0, ZONE_EXP1, ZONE_EXP2, /* experiment */
+#endif
 	/*
 	 * Normal addressable memory is in ZONE_NORMAL. DMA operations can be
 	 * performed on pages in ZONE_NORMAL if the DMA devices support


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone
  2025-10-01 23:48           ` Hillf Danton
@ 2025-10-03  8:35             ` Vlastimil Babka
  2025-10-03 10:02               ` Hillf Danton
  0 siblings, 1 reply; 33+ messages in thread
From: Vlastimil Babka @ 2025-10-03  8:35 UTC (permalink / raw)
  To: Hillf Danton, Joshua Hahn
  Cc: Andrew Morton, Johannes Weiner, linux-kernel, linux-mm,
	kernel-team, Chris Mason, Kiryl Shutsemau, Liam R. Howlett,
	Brendan Jackman, David Hildenbrand, Lorenzo Stoakes,
	Michal Hocko, Mike Rapoport, Suren Baghdasaryan, Vlastimil Babka,
	Zi Yan, Hillf Danton, linux-kernel, linux-mm, kernel-team

On 10/2/25 01:48, Hillf Danton wrote:
> On Wed,  1 Oct 2025 08:37:16 -0700 Joshua Hahn wrote:
>> 
>> While I definitely agree that spreading out 1TB across multiple NUMA nodes
>> is an option that should be considered, I am unsure if it makes sense to
>> dismiss this issue as simply a misconfiguration problem.
>> 
>> The reality is that these machines do exist, and we see zone lock contention
>> on these machines. You can also see that I ran performance evaluation tests
>> on relatively smaller machines (250G) and saw some performance gains.
>> 
> If NUMA node could not be an option, there is still much room in the zone types
> for adding new zones on top of the current pcp and zone mechanism to mitigate
> zone lock contention, see diff below. Then the issue falls in the config category.
> 
>> The other point that I wanted to mention is that simply adding more NUMA
>> nodes is not always strictly beneficial; it changes how the scheduler
>> has to work, workloads would require more numa-aware tuning, etc.
> 
> Feel safe to sit back with Netflix on as PeterZ is taking care of NUMA nodes
> and eevdf, haha.

Feel free to stop making such weird "jokes"?

Also you should really stop dropping CC's on your replies, especially for
maintainers of given code. I've only learned in the v3 changelog from "as
suggested by Hillf Danton" that there was this subthread. This is not
acceptable. When the feedback is wrong and uncorrected by others, it can
mislead the patch author to do wrong changes in the next version.

If this Cc reduction is due to a problem with your e-mail provider, get a
different one?

> --- x/include/linux/mmzone.h
> +++ y/include/linux/mmzone.h
> @@ -779,6 +779,9 @@ enum zone_type {
>  #ifdef CONFIG_ZONE_DMA32
>  	ZONE_DMA32,
>  #endif
> +#ifdef CONFIG_ZONE_EXP
> +	ZONE_EXP0, ZONE_EXP1, ZONE_EXP2, /* experiment */
> +#endif
>  	/*
>  	 * Normal addressable memory is in ZONE_NORMAL. DMA operations can be
>  	 * performed on pages in ZONE_NORMAL if the DMA devices support
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone
  2025-10-03  8:35             ` Vlastimil Babka
@ 2025-10-03 10:02               ` Hillf Danton
  2025-10-04  9:03                 ` Mike Rapoport
  0 siblings, 1 reply; 33+ messages in thread
From: Hillf Danton @ 2025-10-03 10:02 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Joshua Hahn, Johannes Weiner, linux-kernel, linux-mm

On Fri, 3 Oct 2025 10:35:46 +0200 Vlastimil Babka wrote:
> 
> Also you should really stop dropping CC's on your replies, especially for
> maintainers of given code. I've only learned in the v3 changelog from "as
> suggested by Hillf Danton" that there was this subthread. This is not
> acceptable. When the feedback is wrong and uncorrected by others, it can
> mislead the patch author to do wrong changes in the next version.

The Cc list was cut mainly to avoid getting guys interrupted in accident
with Johannes Weiner still Cced. And nothing more.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone
  2025-10-03 10:02               ` Hillf Danton
@ 2025-10-04  9:03                 ` Mike Rapoport
  0 siblings, 0 replies; 33+ messages in thread
From: Mike Rapoport @ 2025-10-04  9:03 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Vlastimil Babka, Andrew Morton, Joshua Hahn, Johannes Weiner,
	linux-kernel, linux-mm

On Fri, Oct 03, 2025 at 06:02:39PM +0800, Hillf Danton wrote:
> On Fri, 3 Oct 2025 10:35:46 +0200 Vlastimil Babka wrote:
> > 
> > Also you should really stop dropping CC's on your replies, especially for
> > maintainers of given code. I've only learned in the v3 changelog from "as
> > suggested by Hillf Danton" that there was this subthread. This is not
> > acceptable. When the feedback is wrong and uncorrected by others, it can
> > mislead the patch author to do wrong changes in the next version.
> 
> The Cc list was cut mainly to avoid getting guys interrupted in accident
> with Johannes Weiner still Cced. And nothing more.

Please don't make decisions instead of the maintainers. Cutting people from
CC actually makes it worse because they need then to spend more time
looking up the emails they would have got in their inbox.

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2025-10-04  9:03 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-24 20:44 [PATCH v2 0/4] mm/page_alloc: Batch callers of free_pcppages_bulk Joshua Hahn
2025-09-24 20:44 ` [PATCH v2 1/4] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection Joshua Hahn
2025-09-24 22:51   ` Christoph Lameter (Ampere)
2025-09-25 18:26     ` Joshua Hahn
2025-09-26 15:34   ` Dan Carpenter
2025-09-26 16:40     ` Joshua Hahn
2025-09-26 17:50   ` SeongJae Park
2025-09-26 18:24     ` Joshua Hahn
2025-09-26 18:33       ` SeongJae Park
2025-09-24 20:44 ` [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone Joshua Hahn
2025-09-24 23:09   ` Christoph Lameter (Ampere)
2025-09-25 18:44     ` Joshua Hahn
2025-09-26 16:21       ` Christoph Lameter (Ampere)
2025-09-26 17:25         ` Joshua Hahn
2025-10-01 11:23         ` Vlastimil Babka
2025-09-26 14:01   ` Brendan Jackman
2025-09-26 15:48     ` Joshua Hahn
2025-09-26 16:57       ` Brendan Jackman
2025-09-26 17:33         ` Joshua Hahn
2025-09-27  0:46   ` Hillf Danton
2025-09-30 14:42     ` Joshua Hahn
2025-09-30 22:14       ` Hillf Danton
2025-10-01 15:37         ` Joshua Hahn
2025-10-01 23:48           ` Hillf Danton
2025-10-03  8:35             ` Vlastimil Babka
2025-10-03 10:02               ` Hillf Danton
2025-10-04  9:03                 ` Mike Rapoport
2025-09-24 20:44 ` [PATCH v2 3/4] mm/page_alloc: Batch page freeing in decay_pcp_high Joshua Hahn
2025-09-24 20:44 ` [PATCH v2 4/4] mm/page_alloc: Batch page freeing in free_frozen_page_commit Joshua Hahn
2025-09-28  5:17   ` kernel test robot
2025-09-29 15:17     ` Joshua Hahn
2025-10-01 10:04       ` Vlastimil Babka
2025-10-01 15:55         ` Joshua Hahn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox