From: fujunjie <fujunjie1@qq.com>
To: akpm@linux-foundation.org
Cc: vbabka@suse.cz, surenb@google.com, mhocko@suse.com,
jackmanb@google.com, hannes@cmpxchg.org, ziy@nvidia.com,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
fujunjie <fujunjie1@qq.com>
Subject: [PATCH v2] mm/page_alloc: optimize lowmem_reserve max lookup using its semantic monotonicity
Date: Sat, 15 Nov 2025 03:02:55 +0000 [thread overview]
Message-ID: <tencent_EB0FED91B01B1F8B6DAEE96719C5F5797F07@qq.com> (raw)
calculate_totalreserve_pages() currently finds the maximum
lowmem_reserve[j] for a zone by scanning the full forward range
[j = zone_idx .. MAX_NR_ZONES). However, for a given zone i, the
lowmem_reserve[j] array (for j > i) is naturally expected to form a
monotonically non-decreasing sequence in j, not as an implementation
detail, but as a consequence that naturally arises from the semantics
of lowmem_reserve[].
For zone "i", lowmem_reserve[j] expresses how many pages in zone i must
effectively be kept in reserve when deciding whether an allocation class
that may allocate from zones up to j is allowed to fall back into i. It
protects less flexible allocation classes (which cannot use higher
zones) from being starved by more flexible ones.
Viewed from this semantics, it is natural to expect a partial ordering in j:
as j increases, the allocation class gains access to a strictly larger
set of fallback zones. Therefore lowmem_reserve[j] is expected to be
monotonically non-decreasing in j: more flexible allocation classes must
not be allowed to deplete low zones more aggressively than less flexible
ones.
In other words, if lowmem_reserve[j] were ever observed to *decrease*
as j grows, that would be unexpected from the reserve semantics' point of
view and would likely indicate a semantic change or a misconfiguration.
The current implementation in setup_per_zone_lowmem_reserve() reflects
this policy by accumulating managed pages from higher zones and applying
the configured ratio, which results in a non-decreasing sequence. This
patch makes calculate_totalreserve_pages() rely on that monotonicity
explicitly and finds the maximum reserve value by scanning backward and
stopping at the first non-zero entry. This avoids unnecessary iteration
and reflects the conceptual model more directly. No functional behavior
changes.
To maintain this assumption explicitly, a comment is added next to
setup_per_zone_lowmem_reserve() documenting the monotonicity expectation
and noting that calculate_totalreserve_pages() relies on it.
Changes in v2:
- Reword the semantic explanation of lowmem_reserve[] monotonicity to
clarify that it arises naturally from its semantics.
- Maintain a minimal reference to the invariant in
calculate_totalreserve_pages(), with full documentation placed in
setup_per_zone_lowmem_reserve().
Signed-off-by: fujunjie <fujunjie1@qq.com>
---
mm/page_alloc.c | 33 +++++++++++++++++++++++++++++----
1 file changed, 29 insertions(+), 4 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 600d9e981c23d..d13a81de2203b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6285,10 +6285,21 @@ static void calculate_totalreserve_pages(void)
long max = 0;
unsigned long managed_pages = zone_managed_pages(zone);
- /* Find valid and maximum lowmem_reserve in the zone */
- for (j = i; j < MAX_NR_ZONES; j++)
- max = max(max, zone->lowmem_reserve[j]);
+ /*
+ * lowmem_reserve[j] is monotonically non-decreasing
+ * in j for a given zone (see
+ * setup_per_zone_lowmem_reserve()). The maximum
+ * valid reserve lives at the highest index with a
+ * non-zero value, so scan backwards and stop at the
+ * first hit.
+ */
+ for (j = MAX_NR_ZONES - 1; j > i; j--) {
+ if (!zone->lowmem_reserve[j])
+ continue;
+ max = zone->lowmem_reserve[j];
+ break;
+ }
/* we treat the high watermark as reserved pages. */
max += high_wmark_pages(zone);
@@ -6313,7 +6324,21 @@ static void setup_per_zone_lowmem_reserve(void)
{
struct pglist_data *pgdat;
enum zone_type i, j;
-
+ /*
+ * For a given zone node_zones[i], lowmem_reserve[j] (j > i)
+ * represents how many pages in zone i must effectively be kept
+ * in reserve when deciding whether an allocation class that is
+ * allowed to allocate from zones up to j may fall back into
+ * zone i.
+ *
+ * As j increases, the allocation class can use a strictly larger
+ * set of fallback zones and therefore must not be allowed to
+ * deplete low zones more aggressively than a less flexible one.
+ * As a result, lowmem_reserve[j] is required to be monotonically
+ * non-decreasing in j for each zone i. Callers such as
+ * calculate_totalreserve_pages() rely on this monotonicity when
+ * selecting the maximum reserve entry.
+ */
for_each_online_pgdat(pgdat) {
for (i = 0; i < MAX_NR_ZONES - 1; i++) {
struct zone *zone = &pgdat->node_zones[i];
--
2.34.1
next reply other threads:[~2025-11-15 3:05 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-15 3:02 fujunjie [this message]
2025-11-15 17:51 ` Zi Yan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=tencent_EB0FED91B01B1F8B6DAEE96719C5F5797F07@qq.com \
--to=fujunjie1@qq.com \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=jackmanb@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox