* [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths
@ 2026-04-07 20:09 Joseph Salisbury
2026-04-07 21:47 ` Pedro Falcato
2026-04-07 22:44 ` John Hubbard
0 siblings, 2 replies; 6+ messages in thread
From: Joseph Salisbury @ 2026-04-07 20:09 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Chris Li, Kairui Song
Cc: Jason Gunthorpe, John Hubbard, Peter Xu, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, linux-mm, LKML
Hello,
I would like to ask for feedback on an MM performance issue triggered by
stress-ng's mremap stressor:
stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --metrics-brief
This was first investigated as a possible regression from 0ca0c24e3211
("mm: store zero pages to be swapped out in a bitmap"), but the current
evidence suggests that commit is mostly exposing an older problem for
this workload rather than directly causing it.
Observed behavior:
The metrics below are in this format:
stressor bogo ops real time usr time sys time bogo ops/s
bogo ops/s
(secs) (secs) (secs) (real time)
(usr+sys time)
On a 5.15-based kernel, the workload behaves much worse when swapping is
disabled:
swap enabled:
mremap 1660980 31.08 64.78 84.63 53437.09 11116.73
swap disabled:
mremap 40786258 27.94 15.41 15354.79 1459749.43 2653.59
On a 6.12-based kernel with swap enabled, the same high-system-time
behavior is also observed:
mremap 77087729 21.50 29.95 30558.08 3584738.22 2520.19
A recent 7.0-rc5-based mainline build still behaves similarly:
mremap 39208813 28.12 12.34 15318.39 1394408.50 2557.53
So this does not appear to be already fixed upstream.
The current theory is that 0ca0c24e3211 exposes this specific
zero-page-heavy workload. Before that change, swap-enabled runs
actually swapped pages. After that change, zero pages are stored in the
swap bitmap instead, so the workload behaves much more like the
swap-disabled case.
Perf data supports the idea that the expensive behavior is global LRU
lock contention caused by short-lived populate/unmap churn.
The dominant stacks on the bad cases include:
vm_mmap_pgoff
__mm_populate
populate_vma_page_range
lru_add_drain
folio_batch_move_lru
folio_lruvec_lock_irqsave
native_queued_spin_lock_slowpath
and:
__x64_sys_munmap
__vm_munmap
...
release_pages
folios_put_refs
__page_cache_release
folio_lruvec_relock_irqsave
native_queued_spin_lock_slowpath
It was also found that adding '--mremap-numa' changes the behavior
substantially:
stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --mremap-numa
--metrics-brief
mremap 2570798 29.39 8.06 106.23 87466.50 22494.74
So it's possible that either actual swapping, or the mbind(...,
MPOL_MF_MOVE) path used by '--mremap-numa', removes most of the
excessive system time.
Does this look like a known MM scalability issue around short-lived
MAP_POPULATE / munmap churn?
REPRODUCER:
The issue is reproducible with stress-ng's mremap stressor:
stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --metrics-brief
On older kernels, the bad behavior is easiest to expose by disabling
swap first:
swapoff -a
stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --metrics-brief
On kernels with 0ca0c24e3211 ("mm: store zero pages to be swapped out in
a bitmap") or newer, the same bad behavior can be seen even with swap
enabled, because this zero-page-heavy workload no longer actually swaps
pages and behaves much like the swap-disabled case.
Typical bad-case behaviour:
- Very large aggregate sys time during a 30s run (for example, ~15000s
or higher)
- Poor bogo ops/s measured against usr+sys time (~2500 range in our tests)
- Perf shows time dominated by:
vm_mmap_pgoff -> __mm_populate -> populate_vma_page_range ->
lru_add_drain
and
munmap -> release_pages -> __page_cache_release
with heavy time in
folio_lruvec_lock_irqsave/native_queued_spin_lock_slowpath
Diagnostic variant:
stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --mremap-numa
--metrics-brief
That variant greatly reduces the excessive system time, which is one of
the clues that the excessive system-time overhead depends on which MM
path the workload takes.
Thanks in advance!
Joe
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths
2026-04-07 20:09 [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths Joseph Salisbury
@ 2026-04-07 21:47 ` Pedro Falcato
2026-04-08 8:09 ` David Hildenbrand (Arm)
2026-04-07 22:44 ` John Hubbard
1 sibling, 1 reply; 6+ messages in thread
From: Pedro Falcato @ 2026-04-07 21:47 UTC (permalink / raw)
To: Joseph Salisbury
Cc: Andrew Morton, David Hildenbrand, Chris Li, Kairui Song,
Jason Gunthorpe, John Hubbard, Peter Xu, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, linux-mm, LKML
Hi,
On Tue, Apr 07, 2026 at 04:09:20PM -0400, Joseph Salisbury wrote:
> Hello,
>
> I would like to ask for feedback on an MM performance issue triggered by
> stress-ng's mremap stressor:
>
> stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --metrics-brief
>
> This was first investigated as a possible regression from 0ca0c24e3211 ("mm:
> store zero pages to be swapped out in a bitmap"), but the current evidence
> suggests that commit is mostly exposing an older problem for this workload
> rather than directly causing it.
>
>
> Observed behavior:
>
> The metrics below are in this format:
> stressor bogo ops real time usr time sys time bogo ops/s
> bogo ops/s
> (secs) (secs) (secs) (real time)
> (usr+sys time)
>
> On a 5.15-based kernel, the workload behaves much worse when swapping is
> disabled:
>
> swap enabled:
> mremap 1660980 31.08 64.78 84.63 53437.09 11116.73
>
> swap disabled:
> mremap 40786258 27.94 15.41 15354.79 1459749.43 2653.59
>
> On a 6.12-based kernel with swap enabled, the same high-system-time behavior
> is also observed:
>
> mremap 77087729 21.50 29.95 30558.08 3584738.22 2520.19
>
> A recent 7.0-rc5-based mainline build still behaves similarly:
>
> mremap 39208813 28.12 12.34 15318.39 1394408.50 2557.53
>
> So this does not appear to be already fixed upstream.
>
>
>
> The current theory is that 0ca0c24e3211 exposes this specific
> zero-page-heavy workload. Before that change, swap-enabled runs actually
> swapped pages. After that change, zero pages are stored in the swap bitmap
> instead, so the workload behaves much more like the swap-disabled case.
>
> Perf data supports the idea that the expensive behavior is global LRU lock
> contention caused by short-lived populate/unmap churn.
>
> The dominant stacks on the bad cases include:
>
> vm_mmap_pgoff
> __mm_populate
> populate_vma_page_range
> lru_add_drain
> folio_batch_move_lru
> folio_lruvec_lock_irqsave
> native_queued_spin_lock_slowpath
>
> and:
>
> __x64_sys_munmap
> __vm_munmap
> ...
> release_pages
> folios_put_refs
> __page_cache_release
> folio_lruvec_relock_irqsave
> native_queued_spin_lock_slowpath
>
Yes, this is known problematic. The lruvec locks are gigantic and, despite
the LRU cache in front, they are still problematic. It might be argued that the
current cache is downright useless for populate as it's too small to contain
a significant number of folios. Perhaps worth thinking about, but not trivial
to change given the way things are structured + the way folio batches work.
You should be able to see this on any workload that does lots of page faulting
or population (not dependent on mremap at all, etc)
>
>
> It was also found that adding '--mremap-numa' changes the behavior
> substantially:
"assign memory mapped pages to randomly selected NUMA nodes. This is
disabled for systems that do not support NUMA."
so this is just sharding your lock contention across your NUMA nodes (you
have an lruvec per node).
>
> stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --mremap-numa
> --metrics-brief
>
> mremap 2570798 29.39 8.06 106.23 87466.50 22494.74
>
> So it's possible that either actual swapping, or the mbind(...,
> MPOL_MF_MOVE) path used by '--mremap-numa', removes most of the excessive
> system time.
>
> Does this look like a known MM scalability issue around short-lived
> MAP_POPULATE / munmap churn?
Yes. Is this an actual issue on some workload?
--
Pedro
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths
2026-04-07 20:09 [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths Joseph Salisbury
2026-04-07 21:47 ` Pedro Falcato
@ 2026-04-07 22:44 ` John Hubbard
2026-04-08 0:35 ` Hugh Dickins
1 sibling, 1 reply; 6+ messages in thread
From: John Hubbard @ 2026-04-07 22:44 UTC (permalink / raw)
To: Joseph Salisbury, Andrew Morton, David Hildenbrand, Chris Li,
Kairui Song, Hugh Dickins
Cc: Jason Gunthorpe, Peter Xu, Kemeng Shi, Nhat Pham, Baoquan He,
Barry Song, linux-mm, LKML
On 4/7/26 1:09 PM, Joseph Salisbury wrote:
> Hello,
>
> I would like to ask for feedback on an MM performance issue triggered by
> stress-ng's mremap stressor:
>
> stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --metrics-brief
>
> This was first investigated as a possible regression from 0ca0c24e3211
> ("mm: store zero pages to be swapped out in a bitmap"), but the current
> evidence suggests that commit is mostly exposing an older problem for
> this workload rather than directly causing it.
>
Can you try this out? (Adding Hugh to Cc.)
From: John Hubbard <jhubbard@nvidia.com>
Date: Tue, 7 Apr 2026 15:33:47 -0700
Subject: [PATCH] mm/gup: skip lru_add_drain() for non-locked populate
X-NVConfidentiality: public
Cc: John Hubbard <jhubbard@nvidia.com>
populate_vma_page_range() calls lru_add_drain() unconditionally after
__get_user_pages(). With high-frequency single-page MAP_POPULATE/munmap
cycles at high thread counts, this forces a lruvec->lru_lock acquire
per page, defeating per-CPU folio_batch batching.
The drain was added by commit ece369c7e104 ("mm/munlock: add
lru_add_drain() to fix memcg_stat_test") for VM_LOCKED populate, where
unevictable page stats must be accurate after faulting. Non-locked VMAs
have no such requirement. Skip the drain for them.
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
mm/gup.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/mm/gup.c b/mm/gup.c
index 8e7dc2c6ee73..2dd5de1cb5b9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1816,6 +1816,7 @@ long populate_vma_page_range(struct vm_area_struct *vma,
struct mm_struct *mm = vma->vm_mm;
unsigned long nr_pages = (end - start) / PAGE_SIZE;
int local_locked = 1;
+ bool need_drain;
int gup_flags;
long ret;
@@ -1857,9 +1858,19 @@ long populate_vma_page_range(struct vm_area_struct *vma,
* We made sure addr is within a VMA, so the following will
* not result in a stack expansion that recurses back here.
*/
+ /*
+ * Read VM_LOCKED before __get_user_pages(), which may drop
+ * mmap_lock when FOLL_UNLOCKABLE is set, after which the vma
+ * must not be accessed. The read is stable: mmap_lock is held
+ * for read here, so mlock() (which needs the write lock)
+ * cannot change VM_LOCKED concurrently.
+ */
+ need_drain = vma->vm_flags & VM_LOCKED;
+
ret = __get_user_pages(mm, start, nr_pages, gup_flags,
NULL, locked ? locked : &local_locked);
- lru_add_drain();
+ if (need_drain)
+ lru_add_drain();
return ret;
}
base-commit: 3036cd0d3328220a1858b1ab390be8b562774e8a
--
2.53.0
thanks,
--
John Hubbard
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths
2026-04-07 22:44 ` John Hubbard
@ 2026-04-08 0:35 ` Hugh Dickins
0 siblings, 0 replies; 6+ messages in thread
From: Hugh Dickins @ 2026-04-08 0:35 UTC (permalink / raw)
To: John Hubbard
Cc: Joseph Salisbury, Andrew Morton, David Hildenbrand, Chris Li,
Kairui Song, Hugh Dickins, Jason Gunthorpe, Peter Xu, Kemeng Shi,
Nhat Pham, Baoquan He, Barry Song, linux-mm, LKML
On Tue, 7 Apr 2026, John Hubbard wrote:
> On 4/7/26 1:09 PM, Joseph Salisbury wrote:
> > Hello,
> >
> > I would like to ask for feedback on an MM performance issue triggered by
> > stress-ng's mremap stressor:
> >
> > stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --metrics-brief
> >
> > This was first investigated as a possible regression from 0ca0c24e3211
> > ("mm: store zero pages to be swapped out in a bitmap"), but the current
> > evidence suggests that commit is mostly exposing an older problem for
> > this workload rather than directly causing it.
> >
>
> Can you try this out? (Adding Hugh to Cc.)
>
> From: John Hubbard <jhubbard@nvidia.com>
> Date: Tue, 7 Apr 2026 15:33:47 -0700
> Subject: [PATCH] mm/gup: skip lru_add_drain() for non-locked populate
> X-NVConfidentiality: public
> Cc: John Hubbard <jhubbard@nvidia.com>
>
> populate_vma_page_range() calls lru_add_drain() unconditionally after
> __get_user_pages(). With high-frequency single-page MAP_POPULATE/munmap
> cycles at high thread counts, this forces a lruvec->lru_lock acquire
> per page, defeating per-CPU folio_batch batching.
>
> The drain was added by commit ece369c7e104 ("mm/munlock: add
> lru_add_drain() to fix memcg_stat_test") for VM_LOCKED populate, where
> unevictable page stats must be accurate after faulting. Non-locked VMAs
> have no such requirement. Skip the drain for them.
>
> Cc: Hugh Dickins <hughd@google.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Thanks for the Cc. I'm not convinced that we should be making such a
change, just to avoid the stress that an avowed stresstest is showing;
but can let others debate that - and, need it be said, I have no
problem with Joseph trying your patch.
I tend to stand by my comment in that commit, that it's not just for
VM_LOCKED: I believe it's in everyone's interest that a bulk faulting
interface like populate_vma_page_range() or faultin_vma_page_range()
should drain its local pagevecs at the end, to save others sometimes
needing the much more expensive lru_add_drain_all().
But lru_add_drain() and lru_add_drain_all(): there's so much to be
said and agonized over there They've distressed me for years, and
are a hot topic for us at present. But I won't be able to contribute
more on that subject, not this week.
Hugh
> ---
> mm/gup.c | 13 ++++++++++++-
> 1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 8e7dc2c6ee73..2dd5de1cb5b9 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1816,6 +1816,7 @@ long populate_vma_page_range(struct vm_area_struct *vma,
> struct mm_struct *mm = vma->vm_mm;
> unsigned long nr_pages = (end - start) / PAGE_SIZE;
> int local_locked = 1;
> + bool need_drain;
> int gup_flags;
> long ret;
>
> @@ -1857,9 +1858,19 @@ long populate_vma_page_range(struct vm_area_struct *vma,
> * We made sure addr is within a VMA, so the following will
> * not result in a stack expansion that recurses back here.
> */
> + /*
> + * Read VM_LOCKED before __get_user_pages(), which may drop
> + * mmap_lock when FOLL_UNLOCKABLE is set, after which the vma
> + * must not be accessed. The read is stable: mmap_lock is held
> + * for read here, so mlock() (which needs the write lock)
> + * cannot change VM_LOCKED concurrently.
> + */
> + need_drain = vma->vm_flags & VM_LOCKED;
> +
> ret = __get_user_pages(mm, start, nr_pages, gup_flags,
> NULL, locked ? locked : &local_locked);
> - lru_add_drain();
> + if (need_drain)
> + lru_add_drain();
> return ret;
> }
>
>
> base-commit: 3036cd0d3328220a1858b1ab390be8b562774e8a
> --
> 2.53.0
>
>
> thanks,
> --
> John Hubbard
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths
2026-04-07 21:47 ` Pedro Falcato
@ 2026-04-08 8:09 ` David Hildenbrand (Arm)
2026-04-08 14:27 ` [External] : " Joseph Salisbury
0 siblings, 1 reply; 6+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-08 8:09 UTC (permalink / raw)
To: Pedro Falcato, Joseph Salisbury
Cc: Andrew Morton, Chris Li, Kairui Song, Jason Gunthorpe,
John Hubbard, Peter Xu, Kemeng Shi, Nhat Pham, Baoquan He,
Barry Song, linux-mm, LKML
>>
>> It was also found that adding '--mremap-numa' changes the behavior
>> substantially:
>
> "assign memory mapped pages to randomly selected NUMA nodes. This is
> disabled for systems that do not support NUMA."
>
> so this is just sharding your lock contention across your NUMA nodes (you
> have an lruvec per node).
>
>>
>> stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --mremap-numa
>> --metrics-brief
>>
>> mremap 2570798 29.39 8.06 106.23 87466.50 22494.74
>>
>> So it's possible that either actual swapping, or the mbind(...,
>> MPOL_MF_MOVE) path used by '--mremap-numa', removes most of the excessive
>> system time.
>>
>> Does this look like a known MM scalability issue around short-lived
>> MAP_POPULATE / munmap churn?
>
> Yes. Is this an actual issue on some workload?
Same thought, it's unclear to me why we should care here. In particular,
when talking about excessive use of zero-filled pages.
--
Cheers,
David
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [External] : Re: [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths
2026-04-08 8:09 ` David Hildenbrand (Arm)
@ 2026-04-08 14:27 ` Joseph Salisbury
0 siblings, 0 replies; 6+ messages in thread
From: Joseph Salisbury @ 2026-04-08 14:27 UTC (permalink / raw)
To: David Hildenbrand (Arm), Pedro Falcato
Cc: Andrew Morton, Chris Li, Kairui Song, Jason Gunthorpe,
John Hubbard, Peter Xu, Kemeng Shi, Nhat Pham, Baoquan He,
Barry Song, linux-mm, LKML
On 4/8/26 4:09 AM, David Hildenbrand (Arm) wrote:
>>> It was also found that adding '--mremap-numa' changes the behavior
>>> substantially:
>> "assign memory mapped pages to randomly selected NUMA nodes. This is
>> disabled for systems that do not support NUMA."
>>
>> so this is just sharding your lock contention across your NUMA nodes (you
>> have an lruvec per node).
>>
>>> stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --mremap-numa
>>> --metrics-brief
>>>
>>> mremap 2570798 29.39 8.06 106.23 87466.50 22494.74
>>>
>>> So it's possible that either actual swapping, or the mbind(...,
>>> MPOL_MF_MOVE) path used by '--mremap-numa', removes most of the excessive
>>> system time.
>>>
>>> Does this look like a known MM scalability issue around short-lived
>>> MAP_POPULATE / munmap churn?
>> Yes. Is this an actual issue on some workload?
> Same thought, it's unclear to me why we should care here. In particular,
> when talking about excessive use of zero-filled pages.
>
Currently this is only showing up with that particular stress test. We
will try John's patch and provide feedback.
Thanks for all the feedback, everyone!
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-04-08 14:32 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-04-07 20:09 [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths Joseph Salisbury
2026-04-07 21:47 ` Pedro Falcato
2026-04-08 8:09 ` David Hildenbrand (Arm)
2026-04-08 14:27 ` [External] : " Joseph Salisbury
2026-04-07 22:44 ` John Hubbard
2026-04-08 0:35 ` Hugh Dickins
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox