* [PATCH] mm/vmpressure: scale window size based on machine memory and CPU count
@ 2026-02-27 22:15 Benjamin Lee McQueen
2026-03-02 8:56 ` Michal Hocko
0 siblings, 1 reply; 4+ messages in thread
From: Benjamin Lee McQueen @ 2026-02-27 22:15 UTC (permalink / raw)
To: akpm, david
Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
linux-mm, linux-kernel, Benjamin Lee McQueen
on systems of different sizes, the fixed 512 page window may not
be suitable and cause excessive false positive memory pressure
notifications.
or should window size be capped to avoid excessive notification
delays on very large systems?
v2: better commit msg, also tried to fix the whitespace.
Signed-off-by: Benjamin Lee McQueen <mcq@disroot.org>
---
mm/vmpressure.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 3fbb86996c4d..925659f28dcb 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -32,10 +32,20 @@
* As the vmscan reclaimer logic works with chunks which are multiple of
* SWAP_CLUSTER_MAX, it makes sense to use it for the window size as well.
*
- * TODO: Make the window size depend on machine size, as we do for vmstat
- * thresholds. Currently we set it to 512 pages (2MB for 4KB pages).
+ * Window size is now scaled based on RAM and CPU size, similarly to how
+ * vmstat checks them.
*/
-static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;
+static unsigned long vmpressure_win;
+
+static int __init vmpressure_win_init(void)
+{
+ unsigned long mem = totalram_pages() >> (27 - PAGE_SHIFT);
+
+ vmpressure_win = SWAP_CLUSTER_MAX * max(16UL,
+ 2UL * fls(num_online_cpus()) * (1 + fls(mem)));
+ return 0;
+}
+core_initcall(vmpressure_win_init);
/*
* These thresholds are used when we account memory pressure through
--
2.51.0
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] mm/vmpressure: scale window size based on machine memory and CPU count
2026-02-27 22:15 [PATCH] mm/vmpressure: scale window size based on machine memory and CPU count Benjamin Lee McQueen
@ 2026-03-02 8:56 ` Michal Hocko
2026-03-02 12:15 ` Lorenzo Stoakes
0 siblings, 1 reply; 4+ messages in thread
From: Michal Hocko @ 2026-03-02 8:56 UTC (permalink / raw)
To: Benjamin Lee McQueen
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
linux-mm, linux-kernel
On Fri 27-02-26 16:15:55, Benjamin Lee McQueen wrote:
> on systems of different sizes, the fixed 512 page window may not
> be suitable and cause excessive false positive memory pressure
> notifications.
Please be more specific about the issue you are trying to have fixed.
The above is way too generic. How much memory the system has, what do
you consider false positive and why. What is the workload. Etc...
> or should window size be capped to avoid excessive notification
> delays on very large systems?
>
> v2: better commit msg, also tried to fix the whitespace.
Also please refrain from sending new versions in a quick succession
and wait for more feedback to come.
Last but not lease if this is a more of an idea rather than something
aimed to be merged make the fact explicit by RFC prefix to PATCH.
There is much more you can read about the process in Documentation/process/
Thanks
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] mm/vmpressure: scale window size based on machine memory and CPU count
2026-03-02 8:56 ` Michal Hocko
@ 2026-03-02 12:15 ` Lorenzo Stoakes
2026-03-05 4:30 ` [RFC PATCH v3] mm/vmpressure: scale window size based on machine memory Benjamin Lee McQueen
0 siblings, 1 reply; 4+ messages in thread
From: Lorenzo Stoakes @ 2026-03-02 12:15 UTC (permalink / raw)
To: Michal Hocko
Cc: Benjamin Lee McQueen, akpm, david, Liam.Howlett, vbabka, rppt,
surenb, linux-mm, linux-kernel
On Mon, Mar 02, 2026 at 09:56:38AM +0100, Michal Hocko wrote:
> On Fri 27-02-26 16:15:55, Benjamin Lee McQueen wrote:
> > on systems of different sizes, the fixed 512 page window may not
> > be suitable and cause excessive false positive memory pressure
> > notifications.
>
> Please be more specific about the issue you are trying to have fixed.
> The above is way too generic. How much memory the system has, what do
> you consider false positive and why. What is the workload. Etc...
>
> > or should window size be capped to avoid excessive notification
> > delays on very large systems?
> >
> > v2: better commit msg, also tried to fix the whitespace.
>
> Also please refrain from sending new versions in a quick succession
> and wait for more feedback to come.
>
> Last but not lease if this is a more of an idea rather than something
> aimed to be merged make the fact explicit by RFC prefix to PATCH.
>
> There is much more you can read about the process in Documentation/process/
>
Agree on all points here.
I'm also concerned that by simply adding this conjectured approach without a
_lot_ of careful testing and examination of real-world cases you risk causing
issues/breaking real world user's assumptions. This is pretty sensitive code.
Definitely this kind of potentially invasive change should always be submitted
as an RFC to begin with.
> Thanks
> --
> Michal Hocko
> SUSE Labs
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 4+ messages in thread
* [RFC PATCH v3] mm/vmpressure: scale window size based on machine memory
2026-03-02 12:15 ` Lorenzo Stoakes
@ 2026-03-05 4:30 ` Benjamin Lee McQueen
0 siblings, 0 replies; 4+ messages in thread
From: Benjamin Lee McQueen @ 2026-03-05 4:30 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Michal Hocko, Lorenzo Stoakes
Cc: Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, linux-mm, linux-kernel, Benjamin Lee McQueen
the vmpressure window size has been fixed at 512 pages
(SWAP_CLUSTER_MAX * 16), ever since the file's inception. a TODO in
the file notes that the vmpressure window size should be scaled
similarly to vmstat's scaling, via machine size.
the problem with fixed window size on large memory systems:
the window fills after 512 pages (SWAP_CLUSTER_MAX * 16) of scanned
pages. on a 256GB system that is 0.00076% of total memory. the
reclaimer works in chunks of 32 pages (SWAP_CLUSTER_MAX), so the
window fills up after 16 reclaim cycles. here, a single (or more)
bad reclaim cycle which reports false info has a considerable effect
on the scanned/reclaim ratio, producing an incorrect reading.
a larger window however, is *potentially* prone to some additional
notification latency, as more pages must be scanned before the ratio
is calculated.
this is what we consider a false positive; a notification that
doesn't correctly represent the current sustained memory pressure.
as for issues with why false positives are bad: applications or even
maybe the system listening for these notifications are woken up
unnecessarily and may perform actions that aren't supposed to happen,
instead of being in coherence with the actual sustained memory
pressure.
i did some testing, as well.
the testing was performed on ONLY a 9GB VM and nothing else.
window sizes corresponding to larger machine memory were set manually
via a debugfs knob, so my testing may have been corrupted as i only
had 9GB on the VM, and this doesn't correctly test on larger systems,
but it is the best i am able to do.
vmpressure_calc_level() was instrumented with a tracepoint emitting
the raw pressure value. a controlled workload (2200MB allocation into
a 2000MB cgroup with 500MB swap cap) was run at each window size.
1000 pressure samples were collected per run with 50 sample warmup
discarded, repeated 5 times per window size.
the key metrics are stddev and cv% (coefficient of variation; stddev
divided by mean pressure, expressed as a percentage). cv% is
load-independent so it is a better measurement than stddev alone. a
high cv% means the pressure signal is noisy relative to its own
average; essentially the readings are unpredictable and unreliable.
a low cv% means the signal is stable and trustworthy.
do take the data with a grain of salt, as i probably didn't test
efficiently. this patch still needs to be tested on larger memory
systems on real workloads. but if you think there is a better way
for me or others to test this PLEASE REACH OUT!
Window RAM Equiv avg stddev avg cv%
512 stock 45.86 91.24%
1024 4GB 34.62 69.28%
1792 8GB 4.03 7.97%
2304 32GB 9.90 18.53%
2560 64GB 9.95 18.59%
3072 256GB 11.49 20.99%
the results show an improvement in quality as window size increases.
stock at 512 pages shows a cv% of 91.24%, meaning the noise in the
pressure signal is nearly as large as the signal itself; the readings
are essentially unpredictable. at the 8GB equivalent window (1792
pages) cv% drops to 7.97%, an 11x improvement in signal stability.
the data is consistent across 25 independent runs per window size
(5 sweeps of 5 runs each). stddev and cv% barely move between
sweeps, which gives me confidence the measurement is real and not
an artifact of system state.
stddev increases slightly beyond the 8GB equivalent window, from
3.82 at win=1792 up to 11.49 at win=3072. this is expected and
may also be an artifact of testing on a 9GB machine rather than
real large-memory hardware. even at the 256GB equivalent window
cv% is 20.99%; still a 4x improvement over stock's 91.24%.
since i only have a 9GB VM, i'm setting the window size manually
to simulate larger machines, but the actual reclaim behavior of a
9GB system doesn't match what a real 256GB machine would do. on a
real large-memory machine the reclaimer has proportionally more
work to do and the window would fill with more representative data.
this is another reason why testing on real large-memory hardware
is needed.
the formula itself isn't like vmstat's threshold calculation, and
uses total machine memory size (RAM), because reclaim costs grow
with RAM, not CPU size or any other variables about the system.
the formula's floor clause also ensures the existing 512 page
window on smaller systems (512MB), and only affects larger systems.
if there are any other questions, i can try to answer them.
IF YOU CAN TEST OR COME UP WITH BETTER METHODS PLEASE REACH OUT!
Signed-off-by: Benjamin Lee McQueen <mcq@disroot.org>
---
mm/vmpressure.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 3fbb86996c4d..0154df4d754e 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -10,6 +10,7 @@
*/
#include <linux/cgroup.h>
+#include <linux/debugfs.h>
#include <linux/fs.h>
#include <linux/log2.h>
#include <linux/sched.h>
@@ -29,14 +30,19 @@
* sizes can cause lot of false positives, but too big window size will
* delay the notifications.
*
- * As the vmscan reclaimer logic works with chunks which are multiple of
- * SWAP_CLUSTER_MAX, it makes sense to use it for the window size as well.
- *
- * TODO: Make the window size depend on machine size, as we do for vmstat
- * thresholds. Currently we set it to 512 pages (2MB for 4KB pages).
+ * As of now, we use a logarithmic scale to scale the window based on
+ * machine RAM size.
*/
-static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;
+static unsigned long vmpressure_win;
+
+static int __init vmpressure_win_init(void)
+{
+ unsigned long mem = totalram_pages() >> (27 - PAGE_SHIFT);
+ vmpressure_win = SWAP_CLUSTER_MAX * max(16UL, (unsigned long)fls64(mem) * 8UL);
+ return 0;
+}
+core_initcall(vmpressure_win_init);
/*
* These thresholds are used when we account memory pressure through
* scanned/reclaimed ratio. The current values were chosen empirically. In
--
2.47.3
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-03-05 4:31 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-27 22:15 [PATCH] mm/vmpressure: scale window size based on machine memory and CPU count Benjamin Lee McQueen
2026-03-02 8:56 ` Michal Hocko
2026-03-02 12:15 ` Lorenzo Stoakes
2026-03-05 4:30 ` [RFC PATCH v3] mm/vmpressure: scale window size based on machine memory Benjamin Lee McQueen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox