[PATCH] mm/vmpressure: scale window size based on machine memory and CPU count

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm/vmpressure: scale window size based on machine memory and CPU count
@ 2026-02-27 22:15 Benjamin Lee McQueen
  2026-03-02  8:56 ` Michal Hocko
  0 siblings, 1 reply; 6+ messages in thread
From: Benjamin Lee McQueen @ 2026-02-27 22:15 UTC (permalink / raw)
  To: akpm, david
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel, Benjamin Lee McQueen

on systems of different sizes, the fixed 512 page window may not
be suitable and cause excessive false positive memory pressure
notifications.

or should window size be capped to avoid excessive notification
delays on very large systems?

v2: better commit msg, also tried to fix the whitespace.

Signed-off-by: Benjamin Lee McQueen <mcq@disroot.org>
---
 mm/vmpressure.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 3fbb86996c4d..925659f28dcb 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -32,10 +32,20 @@
  * As the vmscan reclaimer logic works with chunks which are multiple of
  * SWAP_CLUSTER_MAX, it makes sense to use it for the window size as well.
  *
- * TODO: Make the window size depend on machine size, as we do for vmstat
- * thresholds. Currently we set it to 512 pages (2MB for 4KB pages).
+ * Window size is now scaled based on RAM and CPU size, similarly to how
+ * vmstat checks them.
  */
-static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;
+static unsigned long vmpressure_win;
+
+static int __init vmpressure_win_init(void)
+{
+	unsigned long mem = totalram_pages() >> (27 - PAGE_SHIFT);
+
+	vmpressure_win = SWAP_CLUSTER_MAX * max(16UL,
+		2UL * fls(num_online_cpus()) * (1 + fls(mem)));
+	return 0;
+}
+core_initcall(vmpressure_win_init);
 
 /*
  * These thresholds are used when we account memory pressure through
-- 
2.51.0



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm/vmpressure: scale window size based on machine memory and CPU count
  2026-02-27 22:15 [PATCH] mm/vmpressure: scale window size based on machine memory and CPU count Benjamin Lee McQueen
@ 2026-03-02  8:56 ` Michal Hocko
  2026-03-02 12:15   ` Lorenzo Stoakes
  0 siblings, 1 reply; 6+ messages in thread
From: Michal Hocko @ 2026-03-02  8:56 UTC (permalink / raw)
  To: Benjamin Lee McQueen
  Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
	linux-mm, linux-kernel

On Fri 27-02-26 16:15:55, Benjamin Lee McQueen wrote:
> on systems of different sizes, the fixed 512 page window may not
> be suitable and cause excessive false positive memory pressure
> notifications.

Please be more specific about the issue you are trying to have fixed.
The above is way too generic. How much memory the system has, what do
you consider false positive and why. What is the workload. Etc...

> or should window size be capped to avoid excessive notification
> delays on very large systems?
> 
> v2: better commit msg, also tried to fix the whitespace.

Also please refrain from sending new versions in a quick succession
and wait for more feedback to come.

Last but not lease if this is a more of an idea rather than something
aimed to be merged make the fact explicit by RFC prefix to PATCH.

There is much more you can read about the process in Documentation/process/

Thanks
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm/vmpressure: scale window size based on machine memory and CPU count
  2026-03-02  8:56 ` Michal Hocko
@ 2026-03-02 12:15   ` Lorenzo Stoakes
  2026-03-05  4:30     ` [RFC PATCH v3] mm/vmpressure: scale window size based on machine memory Benjamin Lee McQueen
  0 siblings, 1 reply; 6+ messages in thread
From: Lorenzo Stoakes @ 2026-03-02 12:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Benjamin Lee McQueen, akpm, david, Liam.Howlett, vbabka, rppt,
	surenb, linux-mm, linux-kernel

On Mon, Mar 02, 2026 at 09:56:38AM +0100, Michal Hocko wrote:
> On Fri 27-02-26 16:15:55, Benjamin Lee McQueen wrote:
> > on systems of different sizes, the fixed 512 page window may not
> > be suitable and cause excessive false positive memory pressure
> > notifications.
>
> Please be more specific about the issue you are trying to have fixed.
> The above is way too generic. How much memory the system has, what do
> you consider false positive and why. What is the workload. Etc...
>
> > or should window size be capped to avoid excessive notification
> > delays on very large systems?
> >
> > v2: better commit msg, also tried to fix the whitespace.
>
> Also please refrain from sending new versions in a quick succession
> and wait for more feedback to come.
>
> Last but not lease if this is a more of an idea rather than something
> aimed to be merged make the fact explicit by RFC prefix to PATCH.
>
> There is much more you can read about the process in Documentation/process/
>

Agree on all points here.

I'm also concerned that by simply adding this conjectured approach without a
_lot_ of careful testing and examination of real-world cases you risk causing
issues/breaking real world user's assumptions. This is pretty sensitive code.

Definitely this kind of potentially invasive change should always be submitted
as an RFC to begin with.

> Thanks
> --
> Michal Hocko
> SUSE Labs

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC PATCH v3] mm/vmpressure: scale window size based on machine memory
  2026-03-02 12:15   ` Lorenzo Stoakes
@ 2026-03-05  4:30     ` Benjamin Lee McQueen
  2026-03-06 15:52       ` Michal Hocko
  0 siblings, 1 reply; 6+ messages in thread
From: Benjamin Lee McQueen @ 2026-03-05  4:30 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Michal Hocko, Lorenzo Stoakes
  Cc: Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, linux-mm, linux-kernel, Benjamin Lee McQueen

the vmpressure window size has been fixed at 512 pages
(SWAP_CLUSTER_MAX * 16), ever since the file's inception. a TODO in
the file notes that the vmpressure window size should be scaled
similarly to vmstat's scaling, via machine size.

the problem with fixed window size on large memory systems:

the window fills after 512 pages (SWAP_CLUSTER_MAX * 16) of scanned
pages. on a 256GB system that is 0.00076% of total memory. the
reclaimer works in chunks of 32 pages (SWAP_CLUSTER_MAX), so the
window fills up after 16 reclaim cycles. here, a single (or more)
bad reclaim cycle which reports false info has a considerable effect
on the scanned/reclaim ratio, producing an incorrect reading.

a larger window however, is *potentially* prone to some additional
notification latency, as more pages must be scanned before the ratio
is calculated.

this is what we consider a false positive; a notification that
doesn't correctly represent the current sustained memory pressure.

as for issues with why false positives are bad: applications or even
maybe the system listening for these notifications are woken up
unnecessarily and may perform actions that aren't supposed to happen,
instead of being in coherence with the actual sustained memory
pressure.

i did some testing, as well.

the testing was performed on ONLY a 9GB VM and nothing else.
window sizes corresponding to larger machine memory were set manually
via a debugfs knob, so my testing may have been corrupted as i only
had 9GB on the VM, and this doesn't correctly test on larger systems,
but it is the best i am able to do.

vmpressure_calc_level() was instrumented with a tracepoint emitting
the raw pressure value. a controlled workload (2200MB allocation into
a 2000MB cgroup with 500MB swap cap) was run at each window size.
1000 pressure samples were collected per run with 50 sample warmup
discarded, repeated 5 times per window size.

the key metrics are stddev and cv% (coefficient of variation; stddev
divided by mean pressure, expressed as a percentage). cv% is
load-independent so it is a better measurement than stddev alone. a
high cv% means the pressure signal is noisy relative to its own
average; essentially the readings are unpredictable and unreliable.
a low cv% means the signal is stable and trustworthy.

do take the data with a grain of salt, as i probably didn't test
efficiently. this patch still needs to be tested on larger memory
systems on real workloads. but if you think there is a better way
for me or others to test this PLEASE REACH OUT!

  Window  RAM Equiv  avg stddev  avg cv%
  512     stock      45.86       91.24%
  1024    4GB        34.62       69.28%
  1792    8GB         4.03        7.97%
  2304    32GB        9.90       18.53%
  2560    64GB        9.95       18.59%
  3072    256GB      11.49       20.99%

the results show an improvement in quality as window size increases.
stock at 512 pages shows a cv% of 91.24%, meaning the noise in the
pressure signal is nearly as large as the signal itself; the readings
are essentially unpredictable. at the 8GB equivalent window (1792
pages) cv% drops to 7.97%, an 11x improvement in signal stability.

the data is consistent across 25 independent runs per window size
(5 sweeps of 5 runs each). stddev and cv% barely move between
sweeps, which gives me confidence the measurement is real and not
an artifact of system state.

stddev increases slightly beyond the 8GB equivalent window, from
3.82 at win=1792 up to 11.49 at win=3072. this is expected and
may also be an artifact of testing on a 9GB machine rather than
real large-memory hardware. even at the 256GB equivalent window
cv% is 20.99%; still a 4x improvement over stock's 91.24%.

since i only have a 9GB VM, i'm setting the window size manually
to simulate larger machines, but the actual reclaim behavior of a
9GB system doesn't match what a real 256GB machine would do. on a
real large-memory machine the reclaimer has proportionally more
work to do and the window would fill with more representative data.
this is another reason why testing on real large-memory hardware
is needed.

the formula itself isn't like vmstat's threshold calculation, and
uses total machine memory size (RAM), because reclaim costs grow
with RAM, not CPU size or any other variables about the system.
the formula's floor clause also ensures the existing 512 page
window on smaller systems (512MB), and only affects larger systems.

if there are any other questions, i can try to answer them.

IF YOU CAN TEST OR COME UP WITH BETTER METHODS PLEASE REACH OUT!

Signed-off-by: Benjamin Lee McQueen <mcq@disroot.org>
---
 mm/vmpressure.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 3fbb86996c4d..0154df4d754e 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -10,6 +10,7 @@
  */

 #include <linux/cgroup.h>
+#include <linux/debugfs.h>
 #include <linux/fs.h>
 #include <linux/log2.h>
 #include <linux/sched.h>
@@ -29,14 +30,19 @@
  * sizes can cause lot of false positives, but too big window size will
  * delay the notifications.
  *
- * As the vmscan reclaimer logic works with chunks which are multiple of
- * SWAP_CLUSTER_MAX, it makes sense to use it for the window size as well.
- *
- * TODO: Make the window size depend on machine size, as we do for vmstat
- * thresholds. Currently we set it to 512 pages (2MB for 4KB pages).
+ * As of now, we use a logarithmic scale to scale the window based on
+ * machine RAM size.
  */
-static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16;
+static unsigned long vmpressure_win;
+
+static int __init vmpressure_win_init(void)
+{
+	unsigned long mem = totalram_pages() >> (27 - PAGE_SHIFT);

+	vmpressure_win = SWAP_CLUSTER_MAX * max(16UL, (unsigned long)fls64(mem) * 8UL);
+	return 0;
+}
+core_initcall(vmpressure_win_init);
 /*
  * These thresholds are used when we account memory pressure through
  * scanned/reclaimed ratio. The current values were chosen empirically. In
-- 
2.47.3

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH v3] mm/vmpressure: scale window size based on machine memory
  2026-03-05  4:30     ` [RFC PATCH v3] mm/vmpressure: scale window size based on machine memory Benjamin Lee McQueen
@ 2026-03-06 15:52       ` Michal Hocko
  2026-03-06 16:44         ` Benjamin Lee McQueen
  0 siblings, 1 reply; 6+ messages in thread
From: Michal Hocko @ 2026-03-06 15:52 UTC (permalink / raw)
  To: Benjamin Lee McQueen
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, linux-mm, linux-kernel

On Wed 04-03-26 22:30:38, Benjamin Lee McQueen wrote:
> the vmpressure window size has been fixed at 512 pages
> (SWAP_CLUSTER_MAX * 16), ever since the file's inception. a TODO in
> the file notes that the vmpressure window size should be scaled
> similarly to vmstat's scaling, via machine size.
> 
> the problem with fixed window size on large memory systems:

Thank you for this much more detail insight into your thinking. This is
a good start. I am still missing an overall motivation though. Are you
trying to address a theoretical concern (the said TODO) or do you have
any practical workload that generates bogus vmpressure events. Also
please note that vmpressure is a legacy notification interface only
available in cgroup v1. PSI (via memory.pressure or global
/proc/pressure/) is a new/preferred method to measure memory pressure.
See more Documentation/accounting/psi.rst
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH v3] mm/vmpressure: scale window size based on machine memory
  2026-03-06 15:52       ` Michal Hocko
@ 2026-03-06 16:44         ` Benjamin Lee McQueen
  0 siblings, 0 replies; 6+ messages in thread
From: Benjamin Lee McQueen @ 2026-03-06 16:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, linux-mm, linux-kernel

On 2026-03-06 16:52, Michal Hocko wrote:
> On Wed 04-03-26 22:30:38, Benjamin Lee McQueen wrote:
>> the vmpressure window size has been fixed at 512 pages
>> (SWAP_CLUSTER_MAX * 16), ever since the file's inception. a TODO in
>> the file notes that the vmpressure window size should be scaled
>> similarly to vmstat's scaling, via machine size.
>> 
>> the problem with fixed window size on large memory systems:
> 
> Thank you for this much more detail insight into your thinking. This is
> a good start. I am still missing an overall motivation though. Are you
> trying to address a theoretical concern (the said TODO) or do you have
> any practical workload that generates bogus vmpressure events. 

yes, as i don't personally have a workload that explicitly requires
vmpresssure to not generate bogus events, it is still technically a "bug"
that can still be improved.  i believe many legacy systems would benefit
from this as they still use vmpressure notifications.

i agree PSI is the preferred interface, but vmpressure is still used in
other production environments, like Android's LMKD, right?

-ben



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-03-06 16:44 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-27 22:15 [PATCH] mm/vmpressure: scale window size based on machine memory and CPU count Benjamin Lee McQueen
2026-03-02  8:56 ` Michal Hocko
2026-03-02 12:15   ` Lorenzo Stoakes
2026-03-05  4:30     ` [RFC PATCH v3] mm/vmpressure: scale window size based on machine memory Benjamin Lee McQueen
2026-03-06 15:52       ` Michal Hocko
2026-03-06 16:44         ` Benjamin Lee McQueen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox