From: Benjamin Lee McQueen <mcq@disroot.org>
To: Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@kernel.org>,
Michal Hocko <mhocko@suse.com>,
Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R . Howlett" <Liam.Howlett@oracle.com>,
Vlastimil Babka <vbabka@kernel.org>,
Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Benjamin Lee McQueen <mcq@disroot.org>
Subject: [RFC PATCH v3] mm/vmpressure: scale window size based on machine memory
Date: Wed, 4 Mar 2026 22:30:38 -0600 [thread overview]
Message-ID: <20260305043038.2176-1-mcq@disroot.org> (raw)
In-Reply-To: <beab70d3-470d-46ad-a890-76fabe49272e@lucifer.local>
the vmpressure window size has been fixed at 512 pages
(SWAP_CLUSTER_MAX * 16), ever since the file's inception. a TODO in
the file notes that the vmpressure window size should be scaled
similarly to vmstat's scaling, via machine size.
the problem with fixed window size on large memory systems:
the window fills after 512 pages (SWAP_CLUSTER_MAX * 16) of scanned
pages. on a 256GB system that is 0.00076% of total memory. the
reclaimer works in chunks of 32 pages (SWAP_CLUSTER_MAX), so the
window fills up after 16 reclaim cycles. here, a single (or more)
bad reclaim cycle which reports false info has a considerable effect
on the scanned/reclaim ratio, producing an incorrect reading.
a larger window however, is *potentially* prone to some additional
notification latency, as more pages must be scanned before the ratio
is calculated.
this is what we consider a false positive; a notification that
doesn't correctly represent the current sustained memory pressure.
as for issues with why false positives are bad: applications or even
maybe the system listening for these notifications are woken up
unnecessarily and may perform actions that aren't supposed to happen,
instead of being in coherence with the actual sustained memory
pressure.
i did some testing, as well.
the testing was performed on ONLY a 9GB VM and nothing else.
window sizes corresponding to larger machine memory were set manually
via a debugfs knob, so my testing may have been corrupted as i only
had 9GB on the VM, and this doesn't correctly test on larger systems,
but it is the best i am able to do.
vmpressure_calc_level() was instrumented with a tracepoint emitting
the raw pressure value. a controlled workload (2200MB allocation into
a 2000MB cgroup with 500MB swap cap) was run at each window size.
1000 pressure samples were collected per run with 50 sample warmup
discarded, repeated 5 times per window size.
the key metrics are stddev and cv% (coefficient of variation; stddev
divided by mean pressure, expressed as a percentage). cv% is
load-independent so it is a better measurement than stddev alone. a
high cv% means the pressure signal is noisy relative to its own
average; essentially the readings are unpredictable and unreliable.
a low cv% means the signal is stable and trustworthy.
do take the data with a grain of salt, as i probably didn't test
efficiently. this patch still needs to be tested on larger memory
systems on real workloads. but if you think there is a better way
for me or others to test this PLEASE REACH OUT!
Window RAM Equiv avg stddev avg cv%
512 stock 45.86 91.24%
1024 4GB 34.62 69.28%
1792 8GB 4.03 7.97%
2304 32GB 9.90 18.53%
2560 64GB 9.95 18.59%
3072 256GB 11.49 20.99%
the results show an improvement in quality as window size increases.
stock at 512 pages shows a cv% of 91.24%, meaning the noise in the
pressure signal is nearly as large as the signal itself; the readings
are essentially unpredictable. at the 8GB equivalent window (1792
pages) cv% drops to 7.97%, an 11x improvement in signal stability.
the data is consistent across 25 independent runs per window size
(5 sweeps of 5 runs each). stddev and cv% barely move between
sweeps, which gives me confidence the measurement is real and not
an artifact of system state.
stddev increases slightly beyond the 8GB equivalent window, from
3.82 at win=1792 up to 11.49 at win=3072. this is expected and
may also be an artifact of testing on a 9GB machine rather than
real large-memory hardware. even at the 256GB equivalent window
cv% is 20.99%; still a 4x improvement over stock's 91.24%.
since i only have a 9GB VM, i'm setting the window size manually
to simulate larger machines, but the actual reclaim behavior of a
9GB system doesn't match what a real 256GB machine would do. on a
real large-memory machine the reclaimer has proportionally more
work to do and the window would fill with more representative data.
this is another reason why testing on real large-memory hardware
is needed.
the formula itself isn't like vmstat's threshold calculation, and
uses total machine memory size (RAM), because reclaim costs grow
with RAM, not CPU size or any other variables about the system.
the formula's floor clause also ensures the existing 512 page
window on smaller systems (512MB), and only affects larger systems.
if there are any other questions, i can try to answer them.
IF YOU CAN TEST OR COME UP WITH BETTER METHODS PLEASE REACH OUT!
Signed-off-by: Benjamin Lee McQueen <mcq@disroot.org>
---
mm/vmpressure.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 3fbb86996c4d..0154df4d754e 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -10,6 +10,7 @@
*/
#include <linux/cgroup.h>
+#include <linux/debugfs.h>
#include <linux/fs.h>
#include <linux/log2.h>
#include <linux/sched.h>
@@ -29,14 +30,19 @@
* sizes can cause lot of false positives, but too big window size will
* delay the notifications.
*
- * As the vmscan reclaimer logic works with chunks which are multiple of
- * SWAP_CLUSTER_MAX, it makes sense to use it for the window size as well.
- *
- * TODO: Make the window size depend on machine size, as we do for vmstat
- * thresholds. Currently we set it to 512 pages (2MB for 4KB pages).
+ * As of now, we use a logarithmic scale to scale the window based on
+ * machine RAM size.
*/
-static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;
+static unsigned long vmpressure_win;
+
+static int __init vmpressure_win_init(void)
+{
+ unsigned long mem = totalram_pages() >> (27 - PAGE_SHIFT);
+ vmpressure_win = SWAP_CLUSTER_MAX * max(16UL, (unsigned long)fls64(mem) * 8UL);
+ return 0;
+}
+core_initcall(vmpressure_win_init);
/*
* These thresholds are used when we account memory pressure through
* scanned/reclaimed ratio. The current values were chosen empirically. In
--
2.47.3
prev parent reply other threads:[~2026-03-05 4:31 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-27 22:15 [PATCH] mm/vmpressure: scale window size based on machine memory and CPU count Benjamin Lee McQueen
2026-03-02 8:56 ` Michal Hocko
2026-03-02 12:15 ` Lorenzo Stoakes
2026-03-05 4:30 ` Benjamin Lee McQueen [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260305043038.2176-1-mcq@disroot.org \
--to=mcq@disroot.org \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=david@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@suse.com \
--cc=rppt@kernel.org \
--cc=surenb@google.com \
--cc=vbabka@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox