From: Joshua Hahn <joshua.hahnjy@gmail.com>
To: Andrew Morton <akpm@linux-foundation.org>,
Johannes Weiner <hannes@cmpxchg.org>
Cc: Chris Mason <clm@fb.com>, Kiryl Shutsemau <kirill@shutemov.name>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Brendan Jackman <jackmanb@google.com>,
David Hildenbrand <david@redhat.com>,
Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
Michal Hocko <mhocko@suse.com>, Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Vlastimil Babka <vbabka@suse.cz>, Zi Yan <ziy@nvidia.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: [PATCH v2 0/4] mm/page_alloc: Batch callers of free_pcppages_bulk
Date: Wed, 24 Sep 2025 13:44:04 -0700 [thread overview]
Message-ID: <20250924204409.1706524-1-joshua.hahnjy@gmail.com> (raw)
Motivation & Approach
=====================
While testing workloads with high sustained memory pressure on large machines
in the Meta fleet (1Tb memory, 316 CPUs), we saw an unexpectedly high number
of softlockups. Further investigation showed that the lock in
free_pcppages_bulk was being held for a long time, and was called to free
2k+ pages over 100 times just during boot.
This causes starvation in other processes for both the pcp and zone locks,
which can lead to the system stalling as multiple threads cannot make progress
without the locks. We can see these issues manifesting as warnings:
[ 4512.591979] rcu: INFO: rcu_sched self-detected stall on CPU
[ 4512.604370] rcu: 20-....: (9312 ticks this GP) idle=a654/1/0x4000000000000000 softirq=309340/309344 fqs=5426
[ 4512.626401] rcu: hardirqs softirqs csw/system
[ 4512.638793] rcu: number: 0 145 0
[ 4512.651177] rcu: cputime: 30 10410 174 ==> 10558(ms)
[ 4512.666657] rcu: (t=21077 jiffies g=783665 q=1242213 ncpus=316)
While these warnings are benign, they do point to the underlying issue of
lock contention. To prevent starvation in both locks, batch the freeing of
pages using pcp->batch.
Because free_pcppages_bulk is called with both the pcp and zone lock,
relinquishing and reacquiring the locks are only effective when both of them
are broken together (unless the system was built with queued spinlocks).
Thus, instead of modifying free_pcppages_bulk to break both locks, batch the
freeing from its callers instead.
A similar fix has been implemented in the Meta fleet, and we have seen
significantly less softlockups.
Testing
=======
The following are a few synthetic benchmarks, made on a machine with
250G RAM, 179G swap, and 176 CPUs.
stress-ng --vm 50 --vm-bytes 5G -M -t 100
+----------------------+---------------+----------+
| Metric | Variation (%) | Delta(%) |
+----------------------+---------------+----------+
| bogo ops | 0.0216 | -0.0172 |
| bogo ops/s (real) | 0.0223 | -0.0163 |
| bogo ops/s (usr+sys) | 1.3433 | +1.0769 |
+----------------------+---------------+----------+
stress-ng --vm 10 --vm-bytes 30G -M -t 100
+----------------------+---------------+----------+
| Metric | Variation (%) | Delta(%) |
+----------------------+---------------+----------+
| bogo ops | 2.1736 | +4.8535 |
| bogo ops/s (real) | 2.2689 | +5.1719 |
| bogo ops/s (usr+sys) | 2.1283 | +0.6587 |
+----------------------+---------------+----------+
It seems like depending on the workload, this patch may lead to an increase
in performance, or stay neutral. I believe this has to do with how much lock
contention there is, and how many free_pcppages_bulk calls were being made
previously with high counts.
The difference between bogo ops/s (real) and (usr+sys) seems to indicate that
there is meaningful difference in the amount of time threads spend blocked
on getting either the pcp or zone lock.
Changelog
=========
v1 --> v2:
- Reworded cover letter to be more explicit about what kinds of issues
running processes might face as a result of the existing lock starvation
- Reworded cover letter to be in sections to make it easier to read
- Fixed patch 4/4 to properly store & restore UP flags.
- Re-ran tests, updated the testing results and interpretation
Joshua Hahn (4):
mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection
mm/page_alloc: Perform appropriate batching in drain_pages_zone
mm/page_alloc: Batch page freeing in decay_pcp_high
mm/page_alloc: Batch page freeing in free_frozen_page_commit
include/linux/gfp.h | 2 +-
mm/page_alloc.c | 67 ++++++++++++++++++++++++++++++++-------------
mm/vmstat.c | 26 +++++++++---------
3 files changed, 62 insertions(+), 33 deletions(-)
base-commit: 097a6c336d0080725c626fda118ecfec448acd0f
--
2.47.3
next reply other threads:[~2025-09-24 20:44 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-24 20:44 Joshua Hahn [this message]
2025-09-24 20:44 ` [PATCH v2 1/4] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection Joshua Hahn
2025-09-24 22:51 ` Christoph Lameter (Ampere)
2025-09-25 18:26 ` Joshua Hahn
2025-09-26 15:34 ` Dan Carpenter
2025-09-26 16:40 ` Joshua Hahn
2025-09-26 17:50 ` SeongJae Park
2025-09-26 18:24 ` Joshua Hahn
2025-09-26 18:33 ` SeongJae Park
2025-09-24 20:44 ` [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone Joshua Hahn
2025-09-24 23:09 ` Christoph Lameter (Ampere)
2025-09-25 18:44 ` Joshua Hahn
2025-09-26 16:21 ` Christoph Lameter (Ampere)
2025-09-26 17:25 ` Joshua Hahn
2025-10-01 11:23 ` Vlastimil Babka
2025-09-26 14:01 ` Brendan Jackman
2025-09-26 15:48 ` Joshua Hahn
2025-09-26 16:57 ` Brendan Jackman
2025-09-26 17:33 ` Joshua Hahn
2025-09-27 0:46 ` Hillf Danton
2025-09-30 14:42 ` Joshua Hahn
2025-09-30 22:14 ` Hillf Danton
2025-10-01 15:37 ` Joshua Hahn
2025-10-01 23:48 ` Hillf Danton
2025-10-03 8:35 ` Vlastimil Babka
2025-10-03 10:02 ` Hillf Danton
2025-10-04 9:03 ` Mike Rapoport
2025-09-24 20:44 ` [PATCH v2 3/4] mm/page_alloc: Batch page freeing in decay_pcp_high Joshua Hahn
2025-09-24 20:44 ` [PATCH v2 4/4] mm/page_alloc: Batch page freeing in free_frozen_page_commit Joshua Hahn
2025-09-28 5:17 ` kernel test robot
2025-09-29 15:17 ` Joshua Hahn
2025-10-01 10:04 ` Vlastimil Babka
2025-10-01 15:55 ` Joshua Hahn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250924204409.1706524-1-joshua.hahnjy@gmail.com \
--to=joshua.hahnjy@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=clm@fb.com \
--cc=david@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=jackmanb@google.com \
--cc=kirill@shutemov.name \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@suse.com \
--cc=rppt@kernel.org \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox