From: Joshua Hahn <joshua.hahnjy@gmail.com>
To: Hillf Danton <hdanton@sina.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Johannes Weiner <hannes@cmpxchg.org>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
kernel-team@meta.com
Subject: Re: [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone
Date: Tue, 30 Sep 2025 07:42:40 -0700 [thread overview]
Message-ID: <20250930144240.2326093-1-joshua.hahnjy@gmail.com> (raw)
In-Reply-To: <20250927004617.7667-1-hdanton@sina.com>
On Sat, 27 Sep 2025 08:46:15 +0800 Hillf Danton <hdanton@sina.com> wrote:
> On Wed, 24 Sep 2025 13:44:06 -0700 Joshua Hahn wrote:
> > drain_pages_zone completely drains a zone of its pcp free pages by
> > repeatedly calling free_pcppages_bulk until pcp->count reaches 0.
> > In this loop, it already performs batched calls to ensure that
> > free_pcppages_bulk isn't called to free too many pages at once, and
> > relinquishes & reacquires the lock between each call to prevent
> > lock starvation from other processes.
> >
> > However, the current batching does not prevent lock starvation. The
> > current implementation creates batches of
> > pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX, which has been seen in
> > Meta workloads to be up to 64 << 5 == 2048 pages.
> >
> > While it is true that CONFIG_PCP_BATCH_SCALE_MAX is a config and
> > indeed can be adjusted by the system admin to be any number from
> > 0 to 6, it's default value of 5 is still too high to be reasonable for
> > any system.
> >
> > Instead, let's create batches of pcp->batch pages, which gives a more
> > reasonable 64 pages per call to free_pcppages_bulk. This gives other
> > processes a chance to grab the lock and prevents starvation. Each
Hello Hillf,
Thank you for your feedback!
> Feel free to make it clear, which lock is contended, pcp->lock or
> zone->lock, or both, to help understand the starvation.
Sorry for the late reply. I took some time to run some more tests and
gather numbers so that I could give an accurate representation of what
I was seeing in these systems.
So running perf lock con -abl on my system and compiling the kernel,
I see that the biggest lock contentions come from free_pcppages_bulk
and __rmqueue_pcplist on the upstream kernel (ignoring lock contentions
on lruvec, which is actually the biggest offender on these systems.
This will hopefully be addressed some time in the future as well).
Looking deeper into where they are waiting on the lock, I found that they
are both waiting for the zone->lock (not the pcp lock, even for
__rmqueue_pcplist). I'll add this detail into v3, so that it is more
clear for the user. I'll also emphasize why we still need to break the
pcp lock, since this was something that wasn't immediately obvious to me.
> If the zone lock is hot, why did numa node fail to mitigate the contension,
> given workloads tested with high sustained memory pressure on large machines
> in the Meta fleet (1Tb memory, 316 CPUs)?
This is a good question. On this system, I've configured the machine to only
use 1 node/zone, so there is no ability to migrate the contention. Perhaps
another approach to this problem would be to encourage the user to
configure the system such that each NUMA node does not exceed N GB of memory?
But if so -- how many GB/node is too much? It seems like there would be
some sweet spot where the overhead required to maintain many nodes
cancels out with the benefits one would get from splitting the system into
multiple nodes. What do you think? Personally, I think that this patchset
(not this patch, since it will be dropped in v3) still provides value in
the form of preventing lock monopolies in the zone lock even in a system
where memory is spread out across more nodes.
> Can the contension be observed with tight memory pressure but not highly tight?
> If not, it is due to misconfigure in the user space, no?
I'm not sure I entirely follow what you mean here, but are you asking
whether this is a userspace issue for running a workload that isn't
properly sized for the system? Perhaps that could be the case, but I think
that it would not be bad for the system to protect against these workloads
which can cause the system to stall, especially since the numbers show
that there is neutral to positive gains from this patch. But of course
I am very biased in this opinion : -) so happy to take other opinions on
this matter.
Thanks for your questions. I hope you have a great day!
Joshua
next prev parent reply other threads:[~2025-09-30 14:42 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-24 20:44 [PATCH v2 0/4] mm/page_alloc: Batch callers of free_pcppages_bulk Joshua Hahn
2025-09-24 20:44 ` [PATCH v2 1/4] mm/page_alloc/vmstat: Simplify refresh_cpu_vm_stats change detection Joshua Hahn
2025-09-24 22:51 ` Christoph Lameter (Ampere)
2025-09-25 18:26 ` Joshua Hahn
2025-09-26 15:34 ` Dan Carpenter
2025-09-26 16:40 ` Joshua Hahn
2025-09-26 17:50 ` SeongJae Park
2025-09-26 18:24 ` Joshua Hahn
2025-09-26 18:33 ` SeongJae Park
2025-09-24 20:44 ` [PATCH v2 2/4] mm/page_alloc: Perform appropriate batching in drain_pages_zone Joshua Hahn
2025-09-24 23:09 ` Christoph Lameter (Ampere)
2025-09-25 18:44 ` Joshua Hahn
2025-09-26 16:21 ` Christoph Lameter (Ampere)
2025-09-26 17:25 ` Joshua Hahn
2025-10-01 11:23 ` Vlastimil Babka
2025-09-26 14:01 ` Brendan Jackman
2025-09-26 15:48 ` Joshua Hahn
2025-09-26 16:57 ` Brendan Jackman
2025-09-26 17:33 ` Joshua Hahn
2025-09-27 0:46 ` Hillf Danton
2025-09-30 14:42 ` Joshua Hahn [this message]
2025-09-30 22:14 ` Hillf Danton
2025-10-01 15:37 ` Joshua Hahn
2025-10-01 23:48 ` Hillf Danton
2025-10-03 8:35 ` Vlastimil Babka
2025-10-03 10:02 ` Hillf Danton
2025-10-04 9:03 ` Mike Rapoport
2025-09-24 20:44 ` [PATCH v2 3/4] mm/page_alloc: Batch page freeing in decay_pcp_high Joshua Hahn
2025-09-24 20:44 ` [PATCH v2 4/4] mm/page_alloc: Batch page freeing in free_frozen_page_commit Joshua Hahn
2025-09-28 5:17 ` kernel test robot
2025-09-29 15:17 ` Joshua Hahn
2025-10-01 10:04 ` Vlastimil Babka
2025-10-01 15:55 ` Joshua Hahn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250930144240.2326093-1-joshua.hahnjy@gmail.com \
--to=joshua.hahnjy@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=hdanton@sina.com \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox