* [PATCH v4 1/3] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection
2025-10-13 19:08 [PATCH v4 0/3] mm/page_alloc: Batch callers of free_pcppages_bulk Joshua Hahn
@ 2025-10-13 19:08 ` Joshua Hahn
2025-10-13 19:08 ` [PATCH v4 2/3] mm/page_alloc: Batch page freeing in decay_pcp_high Joshua Hahn
` (2 subsequent siblings)
3 siblings, 0 replies; 9+ messages in thread
From: Joshua Hahn @ 2025-10-13 19:08 UTC (permalink / raw)
To: Andrew Morton
Cc: Chris Mason, Kiryl Shutsemau, Liam R. Howlett, Brendan Jackman,
David Hildenbrand, Johannes Weiner, Lorenzo Stoakes,
Michal Hocko, Mike Rapoport, Suren Baghdasaryan, Vlastimil Babka,
Zi Yan, linux-kernel, linux-mm
Currently, refresh_cpu_vm_stats returns an int, indicating how many
changes were made during its updates. Using this information, callers
like vmstat_update can heuristically determine if more work will be done
in the future.
However, all of refresh_cpu_vm_stats's callers either (a) ignore the
result, only caring about performing the updates, or (b) only care about
whether changes were made, but not *how many* changes were made.
Simplify the code by returning a bool instead to indicate if updates
were made.
In addition, simplify fold_diff and decay_pcp_high to return a bool
for the same reason.
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
include/linux/gfp.h | 2 +-
mm/page_alloc.c | 8 ++++----
mm/vmstat.c | 28 +++++++++++++++-------------
3 files changed, 20 insertions(+), 18 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0ceb4e09306c..f46b066c7661 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -386,7 +386,7 @@ extern void free_pages(unsigned long addr, unsigned int order);
#define free_page(addr) free_pages((addr), 0)
void page_alloc_init_cpuhp(void);
-int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp);
+bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp);
void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
void drain_all_pages(struct zone *zone);
void drain_local_pages(struct zone *zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 600d9e981c23..bbc3282fdffc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2557,10 +2557,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
* Called from the vmstat counter updater to decay the PCP high.
* Return whether there are addition works to do.
*/
-int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
+bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
{
int high_min, to_drain, batch;
- int todo = 0;
+ bool todo = false;
high_min = READ_ONCE(pcp->high_min);
batch = READ_ONCE(pcp->batch);
@@ -2573,7 +2573,7 @@ int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX),
pcp->high - (pcp->high >> 3), high_min);
if (pcp->high > high_min)
- todo++;
+ todo = true;
}
to_drain = pcp->count - pcp->high;
@@ -2581,7 +2581,7 @@ int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
spin_lock(&pcp->lock);
free_pcppages_bulk(zone, to_drain, pcp, 0);
spin_unlock(&pcp->lock);
- todo++;
+ todo = true;
}
return todo;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index bb09c032eecf..98855f31294d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -771,25 +771,25 @@ EXPORT_SYMBOL(dec_node_page_state);
/*
* Fold a differential into the global counters.
- * Returns the number of counters updated.
+ * Returns whether counters were updated.
*/
static int fold_diff(int *zone_diff, int *node_diff)
{
int i;
- int changes = 0;
+ bool changed = false;
for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
if (zone_diff[i]) {
atomic_long_add(zone_diff[i], &vm_zone_stat[i]);
- changes++;
+ changed = true;
}
for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
if (node_diff[i]) {
atomic_long_add(node_diff[i], &vm_node_stat[i]);
- changes++;
+ changed = true;
}
- return changes;
+ return changed;
}
/*
@@ -806,16 +806,16 @@ static int fold_diff(int *zone_diff, int *node_diff)
* with the global counters. These could cause remote node cache line
* bouncing and will have to be only done when necessary.
*
- * The function returns the number of global counters updated.
+ * The function returns whether global counters were updated.
*/
-static int refresh_cpu_vm_stats(bool do_pagesets)
+static bool refresh_cpu_vm_stats(bool do_pagesets)
{
struct pglist_data *pgdat;
struct zone *zone;
int i;
int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
- int changes = 0;
+ bool changed = false;
for_each_populated_zone(zone) {
struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
@@ -839,7 +839,8 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
if (do_pagesets) {
cond_resched();
- changes += decay_pcp_high(zone, this_cpu_ptr(pcp));
+ if (decay_pcp_high(zone, this_cpu_ptr(pcp)))
+ changed = true;
#ifdef CONFIG_NUMA
/*
* Deal with draining the remote pageset of this
@@ -861,13 +862,13 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
}
if (__this_cpu_dec_return(pcp->expire)) {
- changes++;
+ changed = true;
continue;
}
if (__this_cpu_read(pcp->count)) {
drain_zone_pages(zone, this_cpu_ptr(pcp));
- changes++;
+ changed = true;
}
#endif
}
@@ -887,8 +888,9 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
}
}
- changes += fold_diff(global_zone_diff, global_node_diff);
- return changes;
+ if (fold_diff(global_zone_diff, global_node_diff))
+ changed = true;
+ return changed;
}
/*
--
2.47.3
^ permalink raw reply [flat|nested] 9+ messages in thread* [PATCH v4 2/3] mm/page_alloc: Batch page freeing in decay_pcp_high
2025-10-13 19:08 [PATCH v4 0/3] mm/page_alloc: Batch callers of free_pcppages_bulk Joshua Hahn
2025-10-13 19:08 ` [PATCH v4 1/3] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection Joshua Hahn
@ 2025-10-13 19:08 ` Joshua Hahn
2025-10-13 19:08 ` [PATCH v4 3/3] mm/page_alloc: Batch page freeing in free_frozen_page_commit Joshua Hahn
2025-10-14 11:29 ` [PATCH v4 0/3] mm/page_alloc: Batch callers of free_pcppages_bulk Hillf Danton
3 siblings, 0 replies; 9+ messages in thread
From: Joshua Hahn @ 2025-10-13 19:08 UTC (permalink / raw)
To: Andrew Morton
Cc: Chris Mason, Kiryl Shutsemau, Brendan Jackman, Johannes Weiner,
Michal Hocko, Suren Baghdasaryan, Vlastimil Babka, Zi Yan,
linux-kernel, linux-mm, kernel-team
It is possible for pcp->count - pcp->high to exceed pcp->batch by a lot.
When this happens, we should perform batching to ensure that
free_pcppages_bulk isn't called with too many pages to free at once and
starve out other threads that need the pcp or zone lock.
Since we are still only freeing the difference between the initial
pcp->count and pcp->high values, there should be no change to how many
pages are freed.
Suggested-by: Chris Mason <clm@fb.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Co-developed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
mm/page_alloc.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bbc3282fdffc..8ecd48be8bdd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2559,7 +2559,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
*/
bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
{
- int high_min, to_drain, batch;
+ int high_min, to_drain, to_drain_batched, batch;
bool todo = false;
high_min = READ_ONCE(pcp->high_min);
@@ -2577,11 +2577,14 @@ bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
}
to_drain = pcp->count - pcp->high;
- if (to_drain > 0) {
+ while (to_drain > 0) {
+ to_drain_batched = min(to_drain, batch);
spin_lock(&pcp->lock);
- free_pcppages_bulk(zone, to_drain, pcp, 0);
+ free_pcppages_bulk(zone, to_drain_batched, pcp, 0);
spin_unlock(&pcp->lock);
todo = true;
+
+ to_drain -= to_drain_batched;
}
return todo;
--
2.47.3
^ permalink raw reply [flat|nested] 9+ messages in thread* [PATCH v4 3/3] mm/page_alloc: Batch page freeing in free_frozen_page_commit
2025-10-13 19:08 [PATCH v4 0/3] mm/page_alloc: Batch callers of free_pcppages_bulk Joshua Hahn
2025-10-13 19:08 ` [PATCH v4 1/3] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection Joshua Hahn
2025-10-13 19:08 ` [PATCH v4 2/3] mm/page_alloc: Batch page freeing in decay_pcp_high Joshua Hahn
@ 2025-10-13 19:08 ` Joshua Hahn
2025-10-14 9:38 ` Vlastimil Babka
2025-10-14 11:29 ` [PATCH v4 0/3] mm/page_alloc: Batch callers of free_pcppages_bulk Hillf Danton
3 siblings, 1 reply; 9+ messages in thread
From: Joshua Hahn @ 2025-10-13 19:08 UTC (permalink / raw)
To: Andrew Morton
Cc: Chris Mason, Kiryl Shutsemau, Brendan Jackman, Johannes Weiner,
Michal Hocko, Suren Baghdasaryan, Vlastimil Babka, Zi Yan,
linux-kernel, linux-mm, kernel-team
Before returning, free_frozen_page_commit calls free_pcppages_bulk using
nr_pcp_free to determine how many pages can appropritately be freed,
based on the tunable parameters stored in pcp. While this number is an
accurate representation of how many pages should be freed in total, it
is not an appropriate number of pages to free at once using
free_pcppages_bulk, since we have seen the value consistently go above
2000 in the Meta fleet on larger machines.
As such, perform batched page freeing in free_pcppages_bulk by using
pcp->batch member. In order to ensure that other processes are not
starved of the zone lock, free both the zone lock and pcp lock to yield to
other threads.
Note that because free_frozen_page_commit now performs a spinlock inside the
function (and can fail), the function may now return with a freed pcp.
To handle this, return true if the pcp is locked on exit and false otherwise.
In addition, since free_frozen_page_commit must now be aware of what UP
flags were stored at the time of the spin lock, and because we must be
able to report new UP flags to the callers, add a new unsigned long*
parameter UP_flags to keep track of this.
The following are a few synthetic benchmarks, made on three machines. The
first is a large machine with 754GiB memory and 316 processors.
The second is a relatively smaller machine with 251GiB memory and 176
processors. The third and final is the smallest of the three, which has 62GiB
memory and 36 processors.
On all machines, I kick off a kernel build with -j$(nproc).
Negative delta is better (faster compilation)
Large machine (754GiB memory, 316 processors)
make -j$(nproc)
+------------+---------------+-----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+-----------+
| real | 0.8070 | - 1.4865 |
| user | 0.2823 | + 0.4081 |
| sys | 5.0267 | -11.8737 |
+------------+---------------+-----------+
Medium machine (251GiB memory, 176 processors)
make -j$(nproc)
+------------+---------------+----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+----------+
| real | 0.2806 | +0.0351 |
| user | 0.0994 | +0.3170 |
| sys | 0.6229 | -0.6277 |
+------------+---------------+----------+
Small machine (62GiB memory, 36 processors)
make -j$(nproc)
+------------+---------------+----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+----------+
| real | 0.1503 | -2.6585 |
| user | 0.0431 | -2.2984 |
| sys | 0.1870 | -3.2013 |
+------------+---------------+----------+
Here, variation is the coefficient of variation, i.e. standard deviation / mean.
Suggested-by: Chris Mason <clm@fb.com>
Co-developed-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
mm/page_alloc.c | 66 ++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 57 insertions(+), 9 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8ecd48be8bdd..e85770dd54bd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2818,12 +2818,22 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
return high;
}
-static void free_frozen_page_commit(struct zone *zone,
+/*
+ * Tune pcp alloc factor and adjust count & free_count. Free pages to bring the
+ * pcp's watermarks below high.
+ *
+ * May return a freed pcp, if during page freeing the pcp spinlock cannot be
+ * reacquired. Return true if pcp is locked, false otherwise.
+ */
+static bool free_frozen_page_commit(struct zone *zone,
struct per_cpu_pages *pcp, struct page *page, int migratetype,
- unsigned int order, fpi_t fpi_flags)
+ unsigned int order, fpi_t fpi_flags, unsigned long *UP_flags)
{
int high, batch;
+ int to_free, to_free_batched;
int pindex;
+ int cpu = smp_processor_id();
+ int ret = true;
bool free_high = false;
/*
@@ -2861,15 +2871,47 @@ static void free_frozen_page_commit(struct zone *zone,
* Do not attempt to take a zone lock. Let pcp->count get
* over high mark temporarily.
*/
- return;
+ return true;
}
high = nr_pcp_high(pcp, zone, batch, free_high);
if (pcp->count < high)
- return;
+ return true;
+
+ to_free = nr_pcp_free(pcp, batch, high, free_high);
+ if (to_free == 0)
+ return true;
+
+ while (to_free > 0 && pcp->count >= high) {
+ to_free_batched = min(to_free, batch);
+ free_pcppages_bulk(zone, to_free_batched, pcp, pindex);
+ to_free -= to_free_batched;
+ if (pcp->count >= high) {
+ pcp_spin_unlock(pcp);
+ pcp_trylock_finish(*UP_flags);
+
+ pcp_trylock_prepare(*UP_flags);
+ pcp = pcp_spin_trylock(zone->per_cpu_pageset);
+ if (!pcp) {
+ pcp_trylock_finish(*UP_flags);
+ ret = false;
+ break;
+ }
+
+ /*
+ * Check if this thread has been migrated to a different
+ * CPU. If that is the case, give up and indicate that
+ * the pcp is returned in an unlocked state.
+ */
+ if (smp_processor_id() != cpu) {
+ pcp_spin_unlock(pcp);
+ pcp_trylock_finish(*UP_flags);
+ ret = false;
+ break;
+ }
+ }
+ }
- free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
- pcp, pindex);
if (test_bit(ZONE_BELOW_HIGH, &zone->flags) &&
zone_watermark_ok(zone, 0, high_wmark_pages(zone),
ZONE_MOVABLE, 0)) {
@@ -2887,6 +2929,7 @@ static void free_frozen_page_commit(struct zone *zone,
next_memory_node(pgdat->node_id) < MAX_NUMNODES)
atomic_set(&pgdat->kswapd_failures, 0);
}
+ return ret;
}
/*
@@ -2934,7 +2977,9 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
pcp_trylock_prepare(UP_flags);
pcp = pcp_spin_trylock(zone->per_cpu_pageset);
if (pcp) {
- free_frozen_page_commit(zone, pcp, page, migratetype, order, fpi_flags);
+ if (!free_frozen_page_commit(zone, pcp, page, migratetype,
+ order, fpi_flags, &UP_flags))
+ return;
pcp_spin_unlock(pcp);
} else {
free_one_page(zone, page, pfn, order, fpi_flags);
@@ -3034,8 +3079,11 @@ void free_unref_folios(struct folio_batch *folios)
migratetype = MIGRATE_MOVABLE;
trace_mm_page_free_batched(&folio->page);
- free_frozen_page_commit(zone, pcp, &folio->page, migratetype,
- order, FPI_NONE);
+ if (!free_frozen_page_commit(zone, pcp, &folio->page,
+ migratetype, order, FPI_NONE, &UP_flags)) {
+ pcp = NULL;
+ locked_zone = NULL;
+ }
}
if (pcp) {
--
2.47.3
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [PATCH v4 3/3] mm/page_alloc: Batch page freeing in free_frozen_page_commit
2025-10-13 19:08 ` [PATCH v4 3/3] mm/page_alloc: Batch page freeing in free_frozen_page_commit Joshua Hahn
@ 2025-10-14 9:38 ` Vlastimil Babka
2025-10-14 13:15 ` Joshua Hahn
0 siblings, 1 reply; 9+ messages in thread
From: Vlastimil Babka @ 2025-10-14 9:38 UTC (permalink / raw)
To: Joshua Hahn, Andrew Morton
Cc: Chris Mason, Kiryl Shutsemau, Brendan Jackman, Johannes Weiner,
Michal Hocko, Suren Baghdasaryan, Zi Yan, linux-kernel, linux-mm,
kernel-team
On 10/13/25 21:08, Joshua Hahn wrote:
> Before returning, free_frozen_page_commit calls free_pcppages_bulk using
> nr_pcp_free to determine how many pages can appropritately be freed,
> based on the tunable parameters stored in pcp. While this number is an
> accurate representation of how many pages should be freed in total, it
> is not an appropriate number of pages to free at once using
> free_pcppages_bulk, since we have seen the value consistently go above
> 2000 in the Meta fleet on larger machines.
>
> As such, perform batched page freeing in free_pcppages_bulk by using
> pcp->batch member. In order to ensure that other processes are not
> starved of the zone lock, free both the zone lock and pcp lock to yield to
> other threads.
>
> Note that because free_frozen_page_commit now performs a spinlock inside the
> function (and can fail), the function may now return with a freed pcp.
> To handle this, return true if the pcp is locked on exit and false otherwise.
>
> In addition, since free_frozen_page_commit must now be aware of what UP
> flags were stored at the time of the spin lock, and because we must be
> able to report new UP flags to the callers, add a new unsigned long*
> parameter UP_flags to keep track of this.
>
> The following are a few synthetic benchmarks, made on three machines. The
> first is a large machine with 754GiB memory and 316 processors.
> The second is a relatively smaller machine with 251GiB memory and 176
> processors. The third and final is the smallest of the three, which has 62GiB
> memory and 36 processors.
>
> On all machines, I kick off a kernel build with -j$(nproc).
> Negative delta is better (faster compilation)
>
> Large machine (754GiB memory, 316 processors)
> make -j$(nproc)
> +------------+---------------+-----------+
> | Metric (s) | Variation (%) | Delta(%) |
> +------------+---------------+-----------+
> | real | 0.8070 | - 1.4865 |
> | user | 0.2823 | + 0.4081 |
> | sys | 5.0267 | -11.8737 |
> +------------+---------------+-----------+
>
> Medium machine (251GiB memory, 176 processors)
> make -j$(nproc)
> +------------+---------------+----------+
> | Metric (s) | Variation (%) | Delta(%) |
> +------------+---------------+----------+
> | real | 0.2806 | +0.0351 |
> | user | 0.0994 | +0.3170 |
> | sys | 0.6229 | -0.6277 |
> +------------+---------------+----------+
>
> Small machine (62GiB memory, 36 processors)
> make -j$(nproc)
> +------------+---------------+----------+
> | Metric (s) | Variation (%) | Delta(%) |
> +------------+---------------+----------+
> | real | 0.1503 | -2.6585 |
> | user | 0.0431 | -2.2984 |
> | sys | 0.1870 | -3.2013 |
> +------------+---------------+----------+
>
> Here, variation is the coefficient of variation, i.e. standard deviation / mean.
>
> Suggested-by: Chris Mason <clm@fb.com>
> Co-developed-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> ---
> mm/page_alloc.c | 66 ++++++++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 57 insertions(+), 9 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8ecd48be8bdd..e85770dd54bd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2818,12 +2818,22 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
> return high;
> }
>
> -static void free_frozen_page_commit(struct zone *zone,
> +/*
> + * Tune pcp alloc factor and adjust count & free_count. Free pages to bring the
> + * pcp's watermarks below high.
> + *
> + * May return a freed pcp, if during page freeing the pcp spinlock cannot be
> + * reacquired. Return true if pcp is locked, false otherwise.
> + */
> +static bool free_frozen_page_commit(struct zone *zone,
> struct per_cpu_pages *pcp, struct page *page, int migratetype,
> - unsigned int order, fpi_t fpi_flags)
> + unsigned int order, fpi_t fpi_flags, unsigned long *UP_flags)
> {
> int high, batch;
> + int to_free, to_free_batched;
> int pindex;
> + int cpu = smp_processor_id();
> + int ret = true;
> bool free_high = false;
>
> /*
> @@ -2861,15 +2871,47 @@ static void free_frozen_page_commit(struct zone *zone,
> * Do not attempt to take a zone lock. Let pcp->count get
> * over high mark temporarily.
> */
> - return;
> + return true;
> }
>
> high = nr_pcp_high(pcp, zone, batch, free_high);
> if (pcp->count < high)
> - return;
> + return true;
> +
> + to_free = nr_pcp_free(pcp, batch, high, free_high);
> + if (to_free == 0)
> + return true;
I think this is an unnecessary shortcut. The while() condition covers this
and it's likely rare enough that we don't gain anything (if the goal was to
skip the ZONE_BELOW_HIGH check below).
> +
> + while (to_free > 0 && pcp->count >= high) {
The "&& pcp->count >= high" is AFAICS still changing how much we free
compared to before the patch. I.e. we might terminate as soon as freeing
"to_free_batched" in some iteration gets us below "high", while previously
we would free the whole "to_free" and get way further below the "high".
It should be changed to "&& pcp->count > 0" intended only to prevent useless
iterations that decrement to_free by to_free_batched while
free_pcppages_bulk() does nothing.
> + to_free_batched = min(to_free, batch);
> + free_pcppages_bulk(zone, to_free_batched, pcp, pindex);
> + to_free -= to_free_batched;
> + if (pcp->count >= high) {
> + pcp_spin_unlock(pcp);
> + pcp_trylock_finish(*UP_flags);
> +
> + pcp_trylock_prepare(*UP_flags);
> + pcp = pcp_spin_trylock(zone->per_cpu_pageset);
> + if (!pcp) {
> + pcp_trylock_finish(*UP_flags);
> + ret = false;
> + break;
> + }
> +
> + /*
> + * Check if this thread has been migrated to a different
> + * CPU. If that is the case, give up and indicate that
> + * the pcp is returned in an unlocked state.
> + */
> + if (smp_processor_id() != cpu) {
> + pcp_spin_unlock(pcp);
> + pcp_trylock_finish(*UP_flags);
> + ret = false;
> + break;
> + }
> + }
> + }
>
> - free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
> - pcp, pindex);
> if (test_bit(ZONE_BELOW_HIGH, &zone->flags) &&
> zone_watermark_ok(zone, 0, high_wmark_pages(zone),
> ZONE_MOVABLE, 0)) {
> @@ -2887,6 +2929,7 @@ static void free_frozen_page_commit(struct zone *zone,
> next_memory_node(pgdat->node_id) < MAX_NUMNODES)
> atomic_set(&pgdat->kswapd_failures, 0);
> }
> + return ret;
> }
>
> /*
> @@ -2934,7 +2977,9 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
> pcp_trylock_prepare(UP_flags);
> pcp = pcp_spin_trylock(zone->per_cpu_pageset);
> if (pcp) {
> - free_frozen_page_commit(zone, pcp, page, migratetype, order, fpi_flags);
> + if (!free_frozen_page_commit(zone, pcp, page, migratetype,
> + order, fpi_flags, &UP_flags))
> + return;
> pcp_spin_unlock(pcp);
> } else {
> free_one_page(zone, page, pfn, order, fpi_flags);
> @@ -3034,8 +3079,11 @@ void free_unref_folios(struct folio_batch *folios)
> migratetype = MIGRATE_MOVABLE;
>
> trace_mm_page_free_batched(&folio->page);
> - free_frozen_page_commit(zone, pcp, &folio->page, migratetype,
> - order, FPI_NONE);
> + if (!free_frozen_page_commit(zone, pcp, &folio->page,
> + migratetype, order, FPI_NONE, &UP_flags)) {
> + pcp = NULL;
> + locked_zone = NULL;
> + }
> }
>
> if (pcp) {
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [PATCH v4 3/3] mm/page_alloc: Batch page freeing in free_frozen_page_commit
2025-10-14 9:38 ` Vlastimil Babka
@ 2025-10-14 13:15 ` Joshua Hahn
2025-10-14 17:42 ` Vlastimil Babka
0 siblings, 1 reply; 9+ messages in thread
From: Joshua Hahn @ 2025-10-14 13:15 UTC (permalink / raw)
To: Vlastimil Babka
Cc: Andrew Morton, Chris Mason, Kiryl Shutsemau, Brendan Jackman,
Johannes Weiner, Michal Hocko, Suren Baghdasaryan, Zi Yan,
linux-kernel, linux-mm, kernel-team
On Tue, 14 Oct 2025 11:38:00 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
> On 10/13/25 21:08, Joshua Hahn wrote:
> > Before returning, free_frozen_page_commit calls free_pcppages_bulk using
> > nr_pcp_free to determine how many pages can appropritately be freed,
> > based on the tunable parameters stored in pcp. While this number is an
> > accurate representation of how many pages should be freed in total, it
> > is not an appropriate number of pages to free at once using
> > free_pcppages_bulk, since we have seen the value consistently go above
> > 2000 in the Meta fleet on larger machines.
> >
> > As such, perform batched page freeing in free_pcppages_bulk by using
> > pcp->batch member. In order to ensure that other processes are not
> > starved of the zone lock, free both the zone lock and pcp lock to yield to
> > other threads.
> >
> > Note that because free_frozen_page_commit now performs a spinlock inside the
> > function (and can fail), the function may now return with a freed pcp.
> > To handle this, return true if the pcp is locked on exit and false otherwise.
> >
> > In addition, since free_frozen_page_commit must now be aware of what UP
> > flags were stored at the time of the spin lock, and because we must be
> > able to report new UP flags to the callers, add a new unsigned long*
> > parameter UP_flags to keep track of this.
[...snip...]
> > @@ -2861,15 +2871,47 @@ static void free_frozen_page_commit(struct zone *zone,
> > * Do not attempt to take a zone lock. Let pcp->count get
> > * over high mark temporarily.
> > */
> > - return;
> > + return true;
> > }
> >
> > high = nr_pcp_high(pcp, zone, batch, free_high);
> > if (pcp->count < high)
> > - return;
> > + return true;
> > +
> > + to_free = nr_pcp_free(pcp, batch, high, free_high);
> > + if (to_free == 0)
> > + return true;
Hello Vlastimil, thank you for your patience and review on this iteration!
> I think this is an unnecessary shortcut. The while() condition covers this
> and it's likely rare enough that we don't gain anything (if the goal was to
> skip the ZONE_BELOW_HIGH check below).
Agreed.
> > +
> > + while (to_free > 0 && pcp->count >= high) {
>
> The "&& pcp->count >= high" is AFAICS still changing how much we free
> compared to before the patch. I.e. we might terminate as soon as freeing
> "to_free_batched" in some iteration gets us below "high", while previously
> we would free the whole "to_free" and get way further below the "high".
This is true, and I also see now what you had meant in your feedback on the
previous iteration.
> It should be changed to "&& pcp->count > 0" intended only to prevent useless
> iterations that decrement to_free by to_free_batched while
> free_pcppages_bulk() does nothing.
This makes sense. Sorry, I think I missed your point in the previous version,
but I think now I see what you were trying to say about the count. Previously
when we were re-calculating high every iteration, I thought it made some sense
to make the check again, since we might want to terminate early. But I do
agree that it doesn't really make sense to do this; we want to preserve the
behavior of the original code. I do have one comment below as well:
> > + to_free_batched = min(to_free, batch);
> > + free_pcppages_bulk(zone, to_free_batched, pcp, pindex);
> > + to_free -= to_free_batched;
> > + if (pcp->count >= high) {
Here, I think I should change this in the next version to also just check
for the same condition in the while loop (i.e. to_free > 0 && pcp->count > 0)
The idea is that if we have another iteration, we will re-lock. Otherwise, we
can just ignore the case inside the if statement. I think if it is left as
a check for pcp->count >= high, then there will be a weird case for when
0 < pcp->count <= high, where we continue to call free_pcppages_bulk but
do not re-lock.
So unfortunately, I will have to check for the same condition of the
while loop in the if statement : -( I'll send a new version with the changes;
I don't expect there to be a drastic performance change, since I think the
early termination case would have only applied if there was a race condition
that freed the pcp remotely.
Thank you as always, Vlastimil. I hope you have a great day!
Joshua
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [PATCH v4 3/3] mm/page_alloc: Batch page freeing in free_frozen_page_commit
2025-10-14 13:15 ` Joshua Hahn
@ 2025-10-14 17:42 ` Vlastimil Babka
0 siblings, 0 replies; 9+ messages in thread
From: Vlastimil Babka @ 2025-10-14 17:42 UTC (permalink / raw)
To: Joshua Hahn
Cc: Andrew Morton, Chris Mason, Kiryl Shutsemau, Brendan Jackman,
Johannes Weiner, Michal Hocko, Suren Baghdasaryan, Zi Yan,
linux-kernel, linux-mm, kernel-team
On 10/14/25 15:15, Joshua Hahn wrote:
> On Tue, 14 Oct 2025 11:38:00 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
>
>> On 10/13/25 21:08, Joshua Hahn wrote:
>> > Before returning, free_frozen_page_commit calls free_pcppages_bulk using
>> > nr_pcp_free to determine how many pages can appropritately be freed,
>> > based on the tunable parameters stored in pcp. While this number is an
>> > accurate representation of how many pages should be freed in total, it
>> > is not an appropriate number of pages to free at once using
>> > free_pcppages_bulk, since we have seen the value consistently go above
>> > 2000 in the Meta fleet on larger machines.
>> >
>> > As such, perform batched page freeing in free_pcppages_bulk by using
>> > pcp->batch member. In order to ensure that other processes are not
>> > starved of the zone lock, free both the zone lock and pcp lock to yield to
>> > other threads.
>> >
>> > Note that because free_frozen_page_commit now performs a spinlock inside the
>> > function (and can fail), the function may now return with a freed pcp.
>> > To handle this, return true if the pcp is locked on exit and false otherwise.
>> >
>> > In addition, since free_frozen_page_commit must now be aware of what UP
>> > flags were stored at the time of the spin lock, and because we must be
>> > able to report new UP flags to the callers, add a new unsigned long*
>> > parameter UP_flags to keep track of this.
>
> [...snip...]
>
>> > @@ -2861,15 +2871,47 @@ static void free_frozen_page_commit(struct zone *zone,
>> > * Do not attempt to take a zone lock. Let pcp->count get
>> > * over high mark temporarily.
>> > */
>> > - return;
>> > + return true;
>> > }
>> >
>> > high = nr_pcp_high(pcp, zone, batch, free_high);
>> > if (pcp->count < high)
>> > - return;
>> > + return true;
>> > +
>> > + to_free = nr_pcp_free(pcp, batch, high, free_high);
>> > + if (to_free == 0)
>> > + return true;
>
> Hello Vlastimil, thank you for your patience and review on this iteration!
No problem!
>
>> > + to_free_batched = min(to_free, batch);
>> > + free_pcppages_bulk(zone, to_free_batched, pcp, pindex);
>> > + to_free -= to_free_batched;
>> > + if (pcp->count >= high) {
>
> Here, I think I should change this in the next version to also just check
> for the same condition in the while loop (i.e. to_free > 0 && pcp->count > 0)
You're right, I missed that one.
> The idea is that if we have another iteration, we will re-lock. Otherwise, we
> can just ignore the case inside the if statement. I think if it is left as
> a check for pcp->count >= high, then there will be a weird case for when
> 0 < pcp->count <= high, where we continue to call free_pcppages_bulk but
> do not re-lock.
>
> So unfortunately, I will have to check for the same condition of the
> while loop in the if statement : -( I'll send a new version with the changes;
> I don't expect there to be a drastic performance change, since I think the
> early termination case would have only applied if there was a race condition
> that freed the pcp remotely.
Great!
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v4 0/3] mm/page_alloc: Batch callers of free_pcppages_bulk
2025-10-13 19:08 [PATCH v4 0/3] mm/page_alloc: Batch callers of free_pcppages_bulk Joshua Hahn
` (2 preceding siblings ...)
2025-10-13 19:08 ` [PATCH v4 3/3] mm/page_alloc: Batch page freeing in free_frozen_page_commit Joshua Hahn
@ 2025-10-14 11:29 ` Hillf Danton
2025-10-14 13:42 ` Joshua Hahn
3 siblings, 1 reply; 9+ messages in thread
From: Hillf Danton @ 2025-10-14 11:29 UTC (permalink / raw)
To: Joshua Hahn
Cc: Andrew Morton, Johannes Weiner, Vlastimil Babka, linux-kernel,
linux-mm, kernel-team
On Mon, 13 Oct 2025 12:08:08 -0700 Joshua Hahn wrote:
> Motivation & Approach
> =====================
>
> While testing workloads with high sustained memory pressure on large machines
> in the Meta fleet (1Tb memory, 316 CPUs), we saw an unexpectedly high number
> of softlockups. Further investigation showed that the zone lock in
> free_pcppages_bulk was being held for a long time, and was called to free
> 2k+ pages over 100 times just during boot.
>
> This causes starvation in other processes for the zone lock, which can lead
> to the system stalling as multiple threads cannot make progress without the
> locks. We can see these issues manifesting as warnings:
>
> [ 4512.591979] rcu: INFO: rcu_sched self-detected stall on CPU
> [ 4512.604370] rcu: 20-....: (9312 ticks this GP) idle=a654/1/0x4000000000000000 softirq=309340/309344 fqs=5426
> [ 4512.626401] rcu: hardirqs softirqs csw/system
> [ 4512.638793] rcu: number: 0 145 0
> [ 4512.651177] rcu: cputime: 30 10410 174 ==> 10558(ms)
> [ 4512.666657] rcu: (t=21077 jiffies g=783665 q=1242213 ncpus=316)
>
> While these warnings are benign, they do point to the underlying issue of
No fix is needed if it is benign.
> lock contention. To prevent starvation in both locks, batch the freeing of
> pages using pcp->batch.
>
> Because free_pcppages_bulk is called with the pcp lock and acquires the zone
> lock, relinquishing and reacquiring the locks are only effective when both of
> them are broken together (unless the system was built with queued spinlocks).
> Thus, instead of modifying free_pcppages_bulk to break both locks, batch the
> freeing from its callers instead.
>
> A similar fix has been implemented in the Meta fleet, and we have seen
> significantly less softlockups.
>
Fine, softlockup is not cured.
> Testing
> =======
> The following are a few synthetic benchmarks, made on three machines. The
> first is a large machine with 754GiB memory and 316 processors.
> The second is a relatively smaller machine with 251GiB memory and 176
> processors. The third and final is the smallest of the three, which has 62GiB
> memory and 36 processors.
>
> On all machines, I kick off a kernel build with -j$(nproc).
> Negative delta is better (faster compilation).
>
> Large machine (754GiB memory, 316 processors)
> make -j$(nproc)
> +------------+---------------+-----------+
> | Metric (s) | Variation (%) | Delta(%) |
> +------------+---------------+-----------+
> | real | 0.8070 | - 1.4865 |
> | user | 0.2823 | + 0.4081 |
> | sys | 5.0267 | -11.8737 |
> +------------+---------------+-----------+
>
> Medium machine (251GiB memory, 176 processors)
> make -j$(nproc)
> +------------+---------------+----------+
> | Metric (s) | Variation (%) | Delta(%) |
> +------------+---------------+----------+
> | real | 0.2806 | +0.0351 |
> | user | 0.0994 | +0.3170 |
> | sys | 0.6229 | -0.6277 |
> +------------+---------------+----------+
>
> Small machine (62GiB memory, 36 processors)
> make -j$(nproc)
> +------------+---------------+----------+
> | Metric (s) | Variation (%) | Delta(%) |
> +------------+---------------+----------+
> | real | 0.1503 | -2.6585 |
> | user | 0.0431 | -2.2984 |
> | sys | 0.1870 | -3.2013 |
> +------------+---------------+----------+
>
> Here, variation is the coefficient of variation, i.e. standard deviation / mean.
>
> Based on these results, it seems like there are varying degrees to how much
> lock contention this reduces. For the largest and smallest machines that I ran
> the tests on, it seems like there is quite some significant reduction. There
> is also some performance increases visible from userspace.
>
> Interestingly, the performance gains don't scale with the size of the machine,
> but rather there seems to be a dip in the gain there is for the medium-sized
> machine.
>
Explaining the dip helps land this work in the next tree.
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [PATCH v4 0/3] mm/page_alloc: Batch callers of free_pcppages_bulk
2025-10-14 11:29 ` [PATCH v4 0/3] mm/page_alloc: Batch callers of free_pcppages_bulk Hillf Danton
@ 2025-10-14 13:42 ` Joshua Hahn
0 siblings, 0 replies; 9+ messages in thread
From: Joshua Hahn @ 2025-10-14 13:42 UTC (permalink / raw)
To: Hillf Danton
Cc: Andrew Morton, Johannes Weiner, Vlastimil Babka, linux-kernel,
linux-mm, kernel-team
On Tue, 14 Oct 2025 19:29:45 +0800 Hillf Danton <hdanton@sina.com> wrote:
> On Mon, 13 Oct 2025 12:08:08 -0700 Joshua Hahn wrote:
> > Motivation & Approach
> > =====================
> >
> > While testing workloads with high sustained memory pressure on large machines
> > in the Meta fleet (1Tb memory, 316 CPUs), we saw an unexpectedly high number
> > of softlockups. Further investigation showed that the zone lock in
> > free_pcppages_bulk was being held for a long time, and was called to free
> > 2k+ pages over 100 times just during boot.
> >
> > This causes starvation in other processes for the zone lock, which can lead
> > to the system stalling as multiple threads cannot make progress without the
> > locks. We can see these issues manifesting as warnings:
> >
> > [ 4512.591979] rcu: INFO: rcu_sched self-detected stall on CPU
> > [ 4512.604370] rcu: 20-....: (9312 ticks this GP) idle=a654/1/0x4000000000000000 softirq=309340/309344 fqs=5426
> > [ 4512.626401] rcu: hardirqs softirqs csw/system
> > [ 4512.638793] rcu: number: 0 145 0
> > [ 4512.651177] rcu: cputime: 30 10410 174 ==> 10558(ms)
> > [ 4512.666657] rcu: (t=21077 jiffies g=783665 q=1242213 ncpus=316)
Hello Hillf, thank you for your review.
> > While these warnings are benign, they do point to the underlying issue of
>
> No fix is needed if it is benign.
Maybe this is poor wording on my part. What I mean to say is that these
warning messages can help us understand that the system is trending negatively,
even though the warning messages themselves may not indiate that something
has crashed or broken completely.
> > lock contention. To prevent starvation in both locks, batch the freeing of
> > pages using pcp->batch.
> >
> > Because free_pcppages_bulk is called with the pcp lock and acquires the zone
> > lock, relinquishing and reacquiring the locks are only effective when both of
> > them are broken together (unless the system was built with queued spinlocks).
> > Thus, instead of modifying free_pcppages_bulk to break both locks, batch the
> > freeing from its callers instead.
> >
> > A similar fix has been implemented in the Meta fleet, and we have seen
> > significantly less softlockups.
> >
> Fine, softlockup is not cured.
>
> > Testing
> > =======
> > The following are a few synthetic benchmarks, made on three machines. The
> > first is a large machine with 754GiB memory and 316 processors.
> > The second is a relatively smaller machine with 251GiB memory and 176
> > processors. The third and final is the smallest of the three, which has 62GiB
> > memory and 36 processors.
> >
> > On all machines, I kick off a kernel build with -j$(nproc).
> > Negative delta is better (faster compilation).
> >
> > Large machine (754GiB memory, 316 processors)
> > make -j$(nproc)
> > +------------+---------------+-----------+
> > | Metric (s) | Variation (%) | Delta(%) |
> > +------------+---------------+-----------+
> > | real | 0.8070 | - 1.4865 |
> > | user | 0.2823 | + 0.4081 |
> > | sys | 5.0267 | -11.8737 |
> > +------------+---------------+-----------+
> >
> > Medium machine (251GiB memory, 176 processors)
> > make -j$(nproc)
> > +------------+---------------+----------+
> > | Metric (s) | Variation (%) | Delta(%) |
> > +------------+---------------+----------+
> > | real | 0.2806 | +0.0351 |
> > | user | 0.0994 | +0.3170 |
> > | sys | 0.6229 | -0.6277 |
> > +------------+---------------+----------+
> >
> > Small machine (62GiB memory, 36 processors)
> > make -j$(nproc)
> > +------------+---------------+----------+
> > | Metric (s) | Variation (%) | Delta(%) |
> > +------------+---------------+----------+
> > | real | 0.1503 | -2.6585 |
> > | user | 0.0431 | -2.2984 |
> > | sys | 0.1870 | -3.2013 |
> > +------------+---------------+----------+
> >
> > Here, variation is the coefficient of variation, i.e. standard deviation / mean.
> >
> > Based on these results, it seems like there are varying degrees to how much
> > lock contention this reduces. For the largest and smallest machines that I ran
> > the tests on, it seems like there is quite some significant reduction. There
> > is also some performance increases visible from userspace.
> >
> > Interestingly, the performance gains don't scale with the size of the machine,
> > but rather there seems to be a dip in the gain there is for the medium-sized
> > machine.
> >
> Explaining the dip helps land this work in the next tree.
I do agree that I left this on a bit of a cliffhanger here. I'm a bit confused
as to why there is this kind of behavior as well, although I have a theory
as to why this behavior is seen. Going back to why we see zone lock contention
in the first place, I think it might have to do with the memory vs. processors
ratio that leads to such contention issues.
The lower the memory:processor ratio is, it seems like there is already
less zone lock contention. If we rank these machines by their mem:proc ratio:
Large machine : 2.38
Small machine : 1.72
Medium machine: 1.42
It seems like this is the order in which we see the gains as well. I this
explanation also kind of makes sense -- the more memory we have, the more
memory each pcp will have, and the longer free_pcppages_bulk would have taken
before (and vice versa). This is the case, at least for my setup, where each
machine is onlined in one node (zone) and so the pcp watermarks really
do scale with the size of the system.
I didn't want to include this in the cover letter, because this was purely an
untested conjecture.
I hope this helps!
Joshua
^ permalink raw reply [flat|nested] 9+ messages in thread