* [PATCH] mm/vmpressure: scale window size based on machine memory and CPU count @ 2026-02-27 22:15 Benjamin Lee McQueen 2026-03-02 8:56 ` Michal Hocko 0 siblings, 1 reply; 6+ messages in thread From: Benjamin Lee McQueen @ 2026-02-27 22:15 UTC (permalink / raw) To: akpm, david Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, linux-mm, linux-kernel, Benjamin Lee McQueen on systems of different sizes, the fixed 512 page window may not be suitable and cause excessive false positive memory pressure notifications. or should window size be capped to avoid excessive notification delays on very large systems? v2: better commit msg, also tried to fix the whitespace. Signed-off-by: Benjamin Lee McQueen <mcq@disroot.org> --- mm/vmpressure.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/mm/vmpressure.c b/mm/vmpressure.c index 3fbb86996c4d..925659f28dcb 100644 --- a/mm/vmpressure.c +++ b/mm/vmpressure.c @@ -32,10 +32,20 @@ * As the vmscan reclaimer logic works with chunks which are multiple of * SWAP_CLUSTER_MAX, it makes sense to use it for the window size as well. * - * TODO: Make the window size depend on machine size, as we do for vmstat - * thresholds. Currently we set it to 512 pages (2MB for 4KB pages). + * Window size is now scaled based on RAM and CPU size, similarly to how + * vmstat checks them. */ -static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16; +static unsigned long vmpressure_win; + +static int __init vmpressure_win_init(void) +{ + unsigned long mem = totalram_pages() >> (27 - PAGE_SHIFT); + + vmpressure_win = SWAP_CLUSTER_MAX * max(16UL, + 2UL * fls(num_online_cpus()) * (1 + fls(mem))); + return 0; +} +core_initcall(vmpressure_win_init); /* * These thresholds are used when we account memory pressure through -- 2.51.0 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] mm/vmpressure: scale window size based on machine memory and CPU count 2026-02-27 22:15 [PATCH] mm/vmpressure: scale window size based on machine memory and CPU count Benjamin Lee McQueen @ 2026-03-02 8:56 ` Michal Hocko 2026-03-02 12:15 ` Lorenzo Stoakes 0 siblings, 1 reply; 6+ messages in thread From: Michal Hocko @ 2026-03-02 8:56 UTC (permalink / raw) To: Benjamin Lee McQueen Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, linux-mm, linux-kernel On Fri 27-02-26 16:15:55, Benjamin Lee McQueen wrote: > on systems of different sizes, the fixed 512 page window may not > be suitable and cause excessive false positive memory pressure > notifications. Please be more specific about the issue you are trying to have fixed. The above is way too generic. How much memory the system has, what do you consider false positive and why. What is the workload. Etc... > or should window size be capped to avoid excessive notification > delays on very large systems? > > v2: better commit msg, also tried to fix the whitespace. Also please refrain from sending new versions in a quick succession and wait for more feedback to come. Last but not lease if this is a more of an idea rather than something aimed to be merged make the fact explicit by RFC prefix to PATCH. There is much more you can read about the process in Documentation/process/ Thanks -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] mm/vmpressure: scale window size based on machine memory and CPU count 2026-03-02 8:56 ` Michal Hocko @ 2026-03-02 12:15 ` Lorenzo Stoakes 2026-03-05 4:30 ` [RFC PATCH v3] mm/vmpressure: scale window size based on machine memory Benjamin Lee McQueen 0 siblings, 1 reply; 6+ messages in thread From: Lorenzo Stoakes @ 2026-03-02 12:15 UTC (permalink / raw) To: Michal Hocko Cc: Benjamin Lee McQueen, akpm, david, Liam.Howlett, vbabka, rppt, surenb, linux-mm, linux-kernel On Mon, Mar 02, 2026 at 09:56:38AM +0100, Michal Hocko wrote: > On Fri 27-02-26 16:15:55, Benjamin Lee McQueen wrote: > > on systems of different sizes, the fixed 512 page window may not > > be suitable and cause excessive false positive memory pressure > > notifications. > > Please be more specific about the issue you are trying to have fixed. > The above is way too generic. How much memory the system has, what do > you consider false positive and why. What is the workload. Etc... > > > or should window size be capped to avoid excessive notification > > delays on very large systems? > > > > v2: better commit msg, also tried to fix the whitespace. > > Also please refrain from sending new versions in a quick succession > and wait for more feedback to come. > > Last but not lease if this is a more of an idea rather than something > aimed to be merged make the fact explicit by RFC prefix to PATCH. > > There is much more you can read about the process in Documentation/process/ > Agree on all points here. I'm also concerned that by simply adding this conjectured approach without a _lot_ of careful testing and examination of real-world cases you risk causing issues/breaking real world user's assumptions. This is pretty sensitive code. Definitely this kind of potentially invasive change should always be submitted as an RFC to begin with. > Thanks > -- > Michal Hocko > SUSE Labs Cheers, Lorenzo ^ permalink raw reply [flat|nested] 6+ messages in thread
* [RFC PATCH v3] mm/vmpressure: scale window size based on machine memory 2026-03-02 12:15 ` Lorenzo Stoakes @ 2026-03-05 4:30 ` Benjamin Lee McQueen 2026-03-06 15:52 ` Michal Hocko 0 siblings, 1 reply; 6+ messages in thread From: Benjamin Lee McQueen @ 2026-03-05 4:30 UTC (permalink / raw) To: Andrew Morton, David Hildenbrand, Michal Hocko, Lorenzo Stoakes Cc: Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, linux-mm, linux-kernel, Benjamin Lee McQueen the vmpressure window size has been fixed at 512 pages (SWAP_CLUSTER_MAX * 16), ever since the file's inception. a TODO in the file notes that the vmpressure window size should be scaled similarly to vmstat's scaling, via machine size. the problem with fixed window size on large memory systems: the window fills after 512 pages (SWAP_CLUSTER_MAX * 16) of scanned pages. on a 256GB system that is 0.00076% of total memory. the reclaimer works in chunks of 32 pages (SWAP_CLUSTER_MAX), so the window fills up after 16 reclaim cycles. here, a single (or more) bad reclaim cycle which reports false info has a considerable effect on the scanned/reclaim ratio, producing an incorrect reading. a larger window however, is *potentially* prone to some additional notification latency, as more pages must be scanned before the ratio is calculated. this is what we consider a false positive; a notification that doesn't correctly represent the current sustained memory pressure. as for issues with why false positives are bad: applications or even maybe the system listening for these notifications are woken up unnecessarily and may perform actions that aren't supposed to happen, instead of being in coherence with the actual sustained memory pressure. i did some testing, as well. the testing was performed on ONLY a 9GB VM and nothing else. window sizes corresponding to larger machine memory were set manually via a debugfs knob, so my testing may have been corrupted as i only had 9GB on the VM, and this doesn't correctly test on larger systems, but it is the best i am able to do. vmpressure_calc_level() was instrumented with a tracepoint emitting the raw pressure value. a controlled workload (2200MB allocation into a 2000MB cgroup with 500MB swap cap) was run at each window size. 1000 pressure samples were collected per run with 50 sample warmup discarded, repeated 5 times per window size. the key metrics are stddev and cv% (coefficient of variation; stddev divided by mean pressure, expressed as a percentage). cv% is load-independent so it is a better measurement than stddev alone. a high cv% means the pressure signal is noisy relative to its own average; essentially the readings are unpredictable and unreliable. a low cv% means the signal is stable and trustworthy. do take the data with a grain of salt, as i probably didn't test efficiently. this patch still needs to be tested on larger memory systems on real workloads. but if you think there is a better way for me or others to test this PLEASE REACH OUT! Window RAM Equiv avg stddev avg cv% 512 stock 45.86 91.24% 1024 4GB 34.62 69.28% 1792 8GB 4.03 7.97% 2304 32GB 9.90 18.53% 2560 64GB 9.95 18.59% 3072 256GB 11.49 20.99% the results show an improvement in quality as window size increases. stock at 512 pages shows a cv% of 91.24%, meaning the noise in the pressure signal is nearly as large as the signal itself; the readings are essentially unpredictable. at the 8GB equivalent window (1792 pages) cv% drops to 7.97%, an 11x improvement in signal stability. the data is consistent across 25 independent runs per window size (5 sweeps of 5 runs each). stddev and cv% barely move between sweeps, which gives me confidence the measurement is real and not an artifact of system state. stddev increases slightly beyond the 8GB equivalent window, from 3.82 at win=1792 up to 11.49 at win=3072. this is expected and may also be an artifact of testing on a 9GB machine rather than real large-memory hardware. even at the 256GB equivalent window cv% is 20.99%; still a 4x improvement over stock's 91.24%. since i only have a 9GB VM, i'm setting the window size manually to simulate larger machines, but the actual reclaim behavior of a 9GB system doesn't match what a real 256GB machine would do. on a real large-memory machine the reclaimer has proportionally more work to do and the window would fill with more representative data. this is another reason why testing on real large-memory hardware is needed. the formula itself isn't like vmstat's threshold calculation, and uses total machine memory size (RAM), because reclaim costs grow with RAM, not CPU size or any other variables about the system. the formula's floor clause also ensures the existing 512 page window on smaller systems (512MB), and only affects larger systems. if there are any other questions, i can try to answer them. IF YOU CAN TEST OR COME UP WITH BETTER METHODS PLEASE REACH OUT! Signed-off-by: Benjamin Lee McQueen <mcq@disroot.org> --- mm/vmpressure.c | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/mm/vmpressure.c b/mm/vmpressure.c index 3fbb86996c4d..0154df4d754e 100644 --- a/mm/vmpressure.c +++ b/mm/vmpressure.c @@ -10,6 +10,7 @@ */ #include <linux/cgroup.h> +#include <linux/debugfs.h> #include <linux/fs.h> #include <linux/log2.h> #include <linux/sched.h> @@ -29,14 +30,19 @@ * sizes can cause lot of false positives, but too big window size will * delay the notifications. * - * As the vmscan reclaimer logic works with chunks which are multiple of - * SWAP_CLUSTER_MAX, it makes sense to use it for the window size as well. - * - * TODO: Make the window size depend on machine size, as we do for vmstat - * thresholds. Currently we set it to 512 pages (2MB for 4KB pages). + * As of now, we use a logarithmic scale to scale the window based on + * machine RAM size. */ -static const unsigned long vmpressure_win = SWAP_CLUSTER_MAX * 16; +static unsigned long vmpressure_win; + +static int __init vmpressure_win_init(void) +{ + unsigned long mem = totalram_pages() >> (27 - PAGE_SHIFT); + vmpressure_win = SWAP_CLUSTER_MAX * max(16UL, (unsigned long)fls64(mem) * 8UL); + return 0; +} +core_initcall(vmpressure_win_init); /* * These thresholds are used when we account memory pressure through * scanned/reclaimed ratio. The current values were chosen empirically. In -- 2.47.3 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC PATCH v3] mm/vmpressure: scale window size based on machine memory 2026-03-05 4:30 ` [RFC PATCH v3] mm/vmpressure: scale window size based on machine memory Benjamin Lee McQueen @ 2026-03-06 15:52 ` Michal Hocko 2026-03-06 16:44 ` Benjamin Lee McQueen 0 siblings, 1 reply; 6+ messages in thread From: Michal Hocko @ 2026-03-06 15:52 UTC (permalink / raw) To: Benjamin Lee McQueen Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, linux-mm, linux-kernel On Wed 04-03-26 22:30:38, Benjamin Lee McQueen wrote: > the vmpressure window size has been fixed at 512 pages > (SWAP_CLUSTER_MAX * 16), ever since the file's inception. a TODO in > the file notes that the vmpressure window size should be scaled > similarly to vmstat's scaling, via machine size. > > the problem with fixed window size on large memory systems: Thank you for this much more detail insight into your thinking. This is a good start. I am still missing an overall motivation though. Are you trying to address a theoretical concern (the said TODO) or do you have any practical workload that generates bogus vmpressure events. Also please note that vmpressure is a legacy notification interface only available in cgroup v1. PSI (via memory.pressure or global /proc/pressure/) is a new/preferred method to measure memory pressure. See more Documentation/accounting/psi.rst -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC PATCH v3] mm/vmpressure: scale window size based on machine memory 2026-03-06 15:52 ` Michal Hocko @ 2026-03-06 16:44 ` Benjamin Lee McQueen 0 siblings, 0 replies; 6+ messages in thread From: Benjamin Lee McQueen @ 2026-03-06 16:44 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, linux-mm, linux-kernel On 2026-03-06 16:52, Michal Hocko wrote: > On Wed 04-03-26 22:30:38, Benjamin Lee McQueen wrote: >> the vmpressure window size has been fixed at 512 pages >> (SWAP_CLUSTER_MAX * 16), ever since the file's inception. a TODO in >> the file notes that the vmpressure window size should be scaled >> similarly to vmstat's scaling, via machine size. >> >> the problem with fixed window size on large memory systems: > > Thank you for this much more detail insight into your thinking. This is > a good start. I am still missing an overall motivation though. Are you > trying to address a theoretical concern (the said TODO) or do you have > any practical workload that generates bogus vmpressure events. yes, as i don't personally have a workload that explicitly requires vmpresssure to not generate bogus events, it is still technically a "bug" that can still be improved. i believe many legacy systems would benefit from this as they still use vmpressure notifications. i agree PSI is the preferred interface, but vmpressure is still used in other production environments, like Android's LMKD, right? -ben ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-03-06 16:44 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2026-02-27 22:15 [PATCH] mm/vmpressure: scale window size based on machine memory and CPU count Benjamin Lee McQueen 2026-03-02 8:56 ` Michal Hocko 2026-03-02 12:15 ` Lorenzo Stoakes 2026-03-05 4:30 ` [RFC PATCH v3] mm/vmpressure: scale window size based on machine memory Benjamin Lee McQueen 2026-03-06 15:52 ` Michal Hocko 2026-03-06 16:44 ` Benjamin Lee McQueen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox