* [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support
@ 2026-04-13 22:39 Minchan Kim
2026-04-13 22:39 ` [RFC 1/3] mm: process_mrelease: expedite clean file folio reclaim via mmu_gather Minchan Kim
` (3 more replies)
0 siblings, 4 replies; 7+ messages in thread
From: Minchan Kim @ 2026-04-13 22:39 UTC (permalink / raw)
To: akpm
Cc: david, mhocko, brauner, linux-mm, linux-kernel, surenb,
timmurray, Minchan Kim
This patch series introduces optimizations to expedite memory reclamation
in process_mrelease() and provides a secure, race-free "auto-kill"
mechanism for efficient container shutdown and OOM handling.
Currently, process_mrelease() unmaps pages but leaves clean file folios
on the LRU list, relying on standard memory reclaim to eventually free
them. Furthermore, requiring userspace to send a SIGKILL prior to
invoking process_mrelease() introduces scheduling race conditions where
the victim task may enter the exit path prematurely, bypassing expedited
reclamation hooks.
This series addresses these limitations in three logical steps.
Patch #1: mm: process_mrelease: expedite clean file folio reclaim via mmu_gather
Integrates clean file folio eviction directly into the low-level TLB
batching (mmu_gather) infrastructure. Symmetrically truncates clean file
folios alongside anonymous pages during the unmap loop.
Patch #2: mm: process_mrelease: skip LRU movement for exclusive file folios
Skips costly LRU marking (folio_mark_accessed) for exclusive file-backed
folios undergoing process_mrelease reclaim. Perf profiling reveals that
LRU movement accounts for ~55% of overhead during unmap.
Patch #3: mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
Adds an auto-kill flag supporting atomic teardown. Utilizes a dedicated
signal code (KILL_MRELEASE) to guarantee MMF_UNSTABLE is marked in the
signal delivery path, preventing scheduling races.
Minchan Kim (3):
mm: process_mrelease: expedite clean file folio reclaim via mmu_gather
mm: process_mrelease: skip LRU movement for exclusive file folios
mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
arch/s390/include/asm/tlb.h | 2 +-
include/linux/swap.h | 9 ++++++---
include/uapi/asm-generic/siginfo.h | 6 ++++++
include/uapi/linux/mman.h | 4 ++++
kernel/signal.c | 4 ++++
mm/memory.c | 13 ++++++++++++-
mm/mmu_gather.c | 8 +++++---
mm/oom_kill.c | 20 +++++++++++++++++++-
mm/swap_state.c | 19 +++++++++++++++++--
9 files changed, 74 insertions(+), 11 deletions(-)
--
2.54.0.rc0.605.g598a273b03-goog
^ permalink raw reply [flat|nested] 7+ messages in thread
* [RFC 1/3] mm: process_mrelease: expedite clean file folio reclaim via mmu_gather
2026-04-13 22:39 [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support Minchan Kim
@ 2026-04-13 22:39 ` Minchan Kim
2026-04-14 7:45 ` David Hildenbrand (Arm)
2026-04-13 22:39 ` [RFC 2/3] mm: process_mrelease: skip LRU movement for exclusive file folios Minchan Kim
` (2 subsequent siblings)
3 siblings, 1 reply; 7+ messages in thread
From: Minchan Kim @ 2026-04-13 22:39 UTC (permalink / raw)
To: akpm
Cc: david, mhocko, brauner, linux-mm, linux-kernel, surenb,
timmurray, Minchan Kim
Currently, process_mrelease() unmaps the pages but leaves clean file
folios on the LRU list, relying on standard memory reclaim to eventually
free them. This delays the immediate recovery of system memory under OOM
or container shutdown scenarios.
This patch implements an expedited eviction mechanism for clean file
folios by integrating directly into the low-level TLB batching
infrastructure (mmu_gather).
Instead of repeatedly locking and evicting folios one by one inside the
unmap loop (zap_present_folio_ptes), we pass the MMF_UNSTABLE flag
status down to free_pages_and_swap_cache(). Within this single unified
loop, anonymous pages are released via free_swap_cache(), and
file-backed folios are symmetrically truncated via mapping_evict_folio().
This avoids introducing unnecessary data structures, preserves TLB flush
safety, and removes duplicate tree traversals, resulting in an extremely
lean and highly responsive process_mrelease() implementation.
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
arch/s390/include/asm/tlb.h | 2 +-
include/linux/swap.h | 9 ++++++---
mm/mmu_gather.c | 8 +++++---
mm/swap_state.c | 19 +++++++++++++++++--
4 files changed, 29 insertions(+), 9 deletions(-)
diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index 619fd41e710e..554842345ccd 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -62,7 +62,7 @@ static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
VM_WARN_ON_ONCE(delay_rmap);
VM_WARN_ON_ONCE(page_folio(page) != page_folio(page + nr_pages - 1));
- free_pages_and_swap_cache(encoded_pages, ARRAY_SIZE(encoded_pages));
+ free_pages_and_caches(encoded_pages, ARRAY_SIZE(encoded_pages), false);
return false;
}
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 62fc7499b408..e7b929b062f8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -433,7 +433,7 @@ static inline unsigned long total_swapcache_pages(void)
void free_swap_cache(struct folio *folio);
void free_folio_and_swap_cache(struct folio *folio);
-void free_pages_and_swap_cache(struct encoded_page **, int);
+void free_pages_and_caches(struct encoded_page **pages, int nr, bool free_unmapped_file);
/* linux/mm/swapfile.c */
extern atomic_long_t nr_swap_pages;
extern long total_swap_pages;
@@ -510,8 +510,11 @@ static inline void put_swap_device(struct swap_info_struct *si)
do { (val)->freeswap = (val)->totalswap = 0; } while (0)
#define free_folio_and_swap_cache(folio) \
folio_put(folio)
-#define free_pages_and_swap_cache(pages, nr) \
- release_pages((pages), (nr));
+static inline void free_pages_and_caches(struct encoded_page **pages,
+ int nr, bool free_unmapped_file)
+{
+ release_pages(pages, nr);
+}
static inline void free_swap_cache(struct folio *folio)
{
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index fe5b6a031717..5ce5824db07f 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -100,7 +100,8 @@ void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma)
*/
#define MAX_NR_FOLIOS_PER_FREE 512
-static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch)
+static void __tlb_batch_free_encoded_pages(struct mm_struct *mm,
+ struct mmu_gather_batch *batch)
{
struct encoded_page **pages = batch->encoded_pages;
unsigned int nr, nr_pages;
@@ -135,7 +136,8 @@ static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch)
}
}
- free_pages_and_swap_cache(pages, nr);
+ free_pages_and_caches(pages, nr,
+ mm_flags_test(MMF_UNSTABLE, mm));
pages += nr;
batch->nr -= nr;
@@ -148,7 +150,7 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
struct mmu_gather_batch *batch;
for (batch = &tlb->local; batch && batch->nr; batch = batch->next)
- __tlb_batch_free_encoded_pages(batch);
+ __tlb_batch_free_encoded_pages(tlb->mm, batch);
tlb->active = &tlb->local;
}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 6d0eef7470be..e70a52ead6d3 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -400,11 +400,22 @@ void free_folio_and_swap_cache(struct folio *folio)
folio_put(folio);
}
+static inline void free_file_cache(struct folio *folio)
+{
+ if (folio_trylock(folio)) {
+ mapping_evict_folio(folio_mapping(folio), folio);
+ folio_unlock(folio);
+ }
+}
+
/*
* Passed an array of pages, drop them all from swapcache and then release
* them. They are removed from the LRU and freed if this is their last use.
+ *
+ * If @free_unmapped_file is true, this function will proactively evict clean
+ * file-backed folios if they are no longer mapped.
*/
-void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
+void free_pages_and_caches(struct encoded_page **pages, int nr, bool free_unmapped_file)
{
struct folio_batch folios;
unsigned int refs[PAGEVEC_SIZE];
@@ -413,7 +424,11 @@ void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
for (int i = 0; i < nr; i++) {
struct folio *folio = page_folio(encoded_page_ptr(pages[i]));
- free_swap_cache(folio);
+ if (folio_test_anon(folio))
+ free_swap_cache(folio);
+ else if (unlikely(free_unmapped_file))
+ free_file_cache(folio);
+
refs[folios.nr] = 1;
if (unlikely(encoded_page_flags(pages[i]) &
ENCODED_PAGE_BIT_NR_PAGES_NEXT))
--
2.54.0.rc0.605.g598a273b03-goog
^ permalink raw reply [flat|nested] 7+ messages in thread
* [RFC 2/3] mm: process_mrelease: skip LRU movement for exclusive file folios
2026-04-13 22:39 [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support Minchan Kim
2026-04-13 22:39 ` [RFC 1/3] mm: process_mrelease: expedite clean file folio reclaim via mmu_gather Minchan Kim
@ 2026-04-13 22:39 ` Minchan Kim
2026-04-14 7:20 ` David Hildenbrand (Arm)
2026-04-13 22:39 ` [RFC 3/3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag Minchan Kim
2026-04-14 6:57 ` [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support Michal Hocko
3 siblings, 1 reply; 7+ messages in thread
From: Minchan Kim @ 2026-04-13 22:39 UTC (permalink / raw)
To: akpm
Cc: david, mhocko, brauner, linux-mm, linux-kernel, surenb,
timmurray, Minchan Kim
For the process_mrelease reclaim, skip LRU handling for exclusive
file-backed folios since they will be freed soon so pointless
to move around in the LRU.
This avoids costly LRU movement which accounts for a significant portion
of the time during unmap_page_range.
- 91.31% 0.00% mmap_exit_test [kernel.kallsyms] [.] exit_mm
exit_mm
__mmput
exit_mmap
unmap_vmas
- unmap_page_range
- 55.75% folio_mark_accessed
+ 48.79% __folio_batch_add_and_move
4.23% workingset_activation
+ 12.94% folio_remove_rmap_ptes
+ 9.86% page_table_check_clear
+ 3.34% tlb_flush_mmu
1.06% __page_table_check_pte_clear
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
mm/memory.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/mm/memory.c b/mm/memory.c
index 2f815a34d924..25e17893c919 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1640,6 +1640,8 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
bool delay_rmap = false;
if (!folio_test_anon(folio)) {
+ bool skip_mark_accessed;
+
ptent = get_and_clear_full_ptes(mm, addr, pte, nr, tlb->fullmm);
if (pte_dirty(ptent)) {
folio_mark_dirty(folio);
@@ -1648,7 +1650,16 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
*force_flush = true;
}
}
- if (pte_young(ptent) && likely(vma_has_recency(vma)))
+
+ /*
+ * For the process_mrelease reclaim, skip LRU handling for exclusive
+ * file-backed folios since they will be freed soon so pointless
+ * to move around in the LRU.
+ */
+ skip_mark_accessed = mm_flags_test(MMF_UNSTABLE, mm) &&
+ folio_mapcount(folio) < 2;
+ if (likely(!skip_mark_accessed) && pte_young(ptent) &&
+ likely(vma_has_recency(vma)))
folio_mark_accessed(folio);
rss[mm_counter(folio)] -= nr;
} else {
--
2.54.0.rc0.605.g598a273b03-goog
^ permalink raw reply [flat|nested] 7+ messages in thread
* [RFC 3/3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
2026-04-13 22:39 [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support Minchan Kim
2026-04-13 22:39 ` [RFC 1/3] mm: process_mrelease: expedite clean file folio reclaim via mmu_gather Minchan Kim
2026-04-13 22:39 ` [RFC 2/3] mm: process_mrelease: skip LRU movement for exclusive file folios Minchan Kim
@ 2026-04-13 22:39 ` Minchan Kim
2026-04-14 6:57 ` [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support Michal Hocko
3 siblings, 0 replies; 7+ messages in thread
From: Minchan Kim @ 2026-04-13 22:39 UTC (permalink / raw)
To: akpm
Cc: david, mhocko, brauner, linux-mm, linux-kernel, surenb,
timmurray, Minchan Kim
Currently, process_mrelease() requires userspace to send a SIGKILL signal
prior to invocation. This separation introduces a race window where the
victim task may receive the signal and enter the exit path before the
reaper can invoke process_mrelease().
In this case, the victim task frees its memory via the standard, unoptimized
exit path, bypassing the expedited clean file folio reclamation optimization
introduced in the previous patch (which relies on the MMF_UNSTABLE flag).
This patch introduces the PROCESS_MRELEASE_REAP_KILL UAPI flag to support
an integrated auto-kill mode. When specified, process_mrelease() directly
injects a SIGKILL into the target task.
Crucially, this patch utilizes a dedicated signal code (KILL_MRELEASE)
during signal injection, belonging to a new SIGKILL si_codes section.
This special code ensures that the kernel's signal delivery path reliably
intercepts the request and marks the target address space as unstable
(MMF_UNSTABLE). This mechanism guarantees that the MMF_UNSTABLE flag is set
before either the victim task or the reaper proceeds, ensuring that the
expedited reclamation optimization is utilized regardless of scheduling
order.
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
include/uapi/asm-generic/siginfo.h | 6 ++++++
include/uapi/linux/mman.h | 4 ++++
kernel/signal.c | 4 ++++
mm/oom_kill.c | 20 +++++++++++++++++++-
4 files changed, 33 insertions(+), 1 deletion(-)
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index 5a1ca43b5fc6..0f59b791dab4 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -252,6 +252,12 @@ typedef struct siginfo {
#define BUS_MCEERR_AO 5
#define NSIGBUS 5
+/*
+ * SIGKILL si_codes
+ */
+#define KILL_MRELEASE 1 /* sent by process_mrelease */
+#define NSIGKILL 1
+
/*
* SIGTRAP si_codes
*/
diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index e89d00528f2f..4266976b45ad 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -56,4 +56,8 @@ struct cachestat {
__u64 nr_recently_evicted;
};
+/* Flags for process_mrelease */
+#define PROCESS_MRELEASE_REAP_KILL (1 << 0)
+#define PROCESS_MRELEASE_VALID_FLAGS (PROCESS_MRELEASE_REAP_KILL)
+
#endif /* _UAPI_LINUX_MMAN_H */
diff --git a/kernel/signal.c b/kernel/signal.c
index d65d0fe24bfb..c21b2176dc5e 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1134,6 +1134,10 @@ static int __send_signal_locked(int sig, struct kernel_siginfo *info,
out_set:
signalfd_notify(t, sig);
+
+ if (sig == SIGKILL && !is_si_special(info) &&
+ info->si_code == KILL_MRELEASE && t->mm)
+ mm_flags_set(MMF_UNSTABLE, t->mm);
sigaddset(&pending->signal, sig);
/* Let multiprocess signals appear after on-going forks */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5c6c95c169ee..0b5da5208707 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -20,6 +20,8 @@
#include <linux/oom.h>
#include <linux/mm.h>
+#include <uapi/linux/mman.h>
+#include <linux/capability.h>
#include <linux/err.h>
#include <linux/gfp.h>
#include <linux/sched.h>
@@ -1218,13 +1220,29 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
bool reap = false;
long ret = 0;
- if (flags)
+ if (flags & ~PROCESS_MRELEASE_VALID_FLAGS)
return -EINVAL;
task = pidfd_get_task(pidfd, &f_flags);
if (IS_ERR(task))
return PTR_ERR(task);
+ if (flags & PROCESS_MRELEASE_REAP_KILL) {
+ struct kernel_siginfo info;
+
+ if (!capable(CAP_KILL)) {
+ ret = -EPERM;
+ goto put_task;
+ }
+ clear_siginfo(&info);
+ info.si_signo = SIGKILL;
+ info.si_code = KILL_MRELEASE;
+ info.si_pid = task_tgid_vnr(current);
+ info.si_uid = from_kuid_munged(current_user_ns(), current_uid());
+
+ do_send_sig_info(SIGKILL, &info, task, PIDTYPE_TGID);
+ }
+
/*
* Make sure to choose a thread which still has a reference to mm
* during the group exit
--
2.54.0.rc0.605.g598a273b03-goog
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support
2026-04-13 22:39 [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support Minchan Kim
` (2 preceding siblings ...)
2026-04-13 22:39 ` [RFC 3/3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag Minchan Kim
@ 2026-04-14 6:57 ` Michal Hocko
3 siblings, 0 replies; 7+ messages in thread
From: Michal Hocko @ 2026-04-14 6:57 UTC (permalink / raw)
To: Minchan Kim
Cc: akpm, david, brauner, linux-mm, linux-kernel, surenb, timmurray
On Mon 13-04-26 15:39:45, Minchan Kim wrote:
> This patch series introduces optimizations to expedite memory reclamation
> in process_mrelease() and provides a secure, race-free "auto-kill"
> mechanism for efficient container shutdown and OOM handling.
>
> Currently, process_mrelease() unmaps pages but leaves clean file folios
> on the LRU list, relying on standard memory reclaim to eventually free
> them. Furthermore, requiring userspace to send a SIGKILL prior to
> invoking process_mrelease() introduces scheduling race conditions where
> the victim task may enter the exit path prematurely, bypassing expedited
> reclamation hooks.
>
> This series addresses these limitations in three logical steps.
>
> Patch #1: mm: process_mrelease: expedite clean file folio reclaim via mmu_gather
> Integrates clean file folio eviction directly into the low-level TLB
> batching (mmu_gather) infrastructure. Symmetrically truncates clean file
> folios alongside anonymous pages during the unmap loop.
Why do we need to care about clean page cache? Is this a form of
drop_caches?
> Patch #2: mm: process_mrelease: skip LRU movement for exclusive file folios
> Skips costly LRU marking (folio_mark_accessed) for exclusive file-backed
> folios undergoing process_mrelease reclaim. Perf profiling reveals that
> LRU movement accounts for ~55% of overhead during unmap.
OK, but why is this not desirable behavior fir mrelease?
> Patch #3: mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
> Adds an auto-kill flag supporting atomic teardown. Utilizes a dedicated
> signal code (KILL_MRELEASE) to guarantee MMF_UNSTABLE is marked in the
> signal delivery path, preventing scheduling races.
Could you explain why those races are a real problem?
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC 2/3] mm: process_mrelease: skip LRU movement for exclusive file folios
2026-04-13 22:39 ` [RFC 2/3] mm: process_mrelease: skip LRU movement for exclusive file folios Minchan Kim
@ 2026-04-14 7:20 ` David Hildenbrand (Arm)
0 siblings, 0 replies; 7+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-14 7:20 UTC (permalink / raw)
To: Minchan Kim, akpm
Cc: mhocko, brauner, linux-mm, linux-kernel, surenb, timmurray
On 4/14/26 00:39, Minchan Kim wrote:
> For the process_mrelease reclaim, skip LRU handling for exclusive
> file-backed folios since they will be freed soon so pointless
> to move around in the LRU.
>
> This avoids costly LRU movement which accounts for a significant portion
> of the time during unmap_page_range.
>
> - 91.31% 0.00% mmap_exit_test [kernel.kallsyms] [.] exit_mm
> exit_mm
> __mmput
> exit_mmap
> unmap_vmas
> - unmap_page_range
> - 55.75% folio_mark_accessed
> + 48.79% __folio_batch_add_and_move
> 4.23% workingset_activation
> + 12.94% folio_remove_rmap_ptes
> + 9.86% page_table_check_clear
> + 3.34% tlb_flush_mmu
> 1.06% __page_table_check_pte_clear
>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
> mm/memory.c | 13 ++++++++++++-
> 1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 2f815a34d924..25e17893c919 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1640,6 +1640,8 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
> bool delay_rmap = false;
>
> if (!folio_test_anon(folio)) {
> + bool skip_mark_accessed;
> +
> ptent = get_and_clear_full_ptes(mm, addr, pte, nr, tlb->fullmm);
> if (pte_dirty(ptent)) {
> folio_mark_dirty(folio);
> @@ -1648,7 +1650,16 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
> *force_flush = true;
> }
> }
> - if (pte_young(ptent) && likely(vma_has_recency(vma)))
> +
> + /*
> + * For the process_mrelease reclaim, skip LRU handling for exclusive
> + * file-backed folios since they will be freed soon so pointless
> + * to move around in the LRU.
> + */
> + skip_mark_accessed = mm_flags_test(MMF_UNSTABLE, mm) &&
> + folio_mapcount(folio) < 2;
folio_mapcount() is most certainly the wrong thing to use if you want to
handle large folios properly.
Maybe !folio_likely_mapped_shared() is what you are looking for. Maybe.
--
Cheers,
David
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC 1/3] mm: process_mrelease: expedite clean file folio reclaim via mmu_gather
2026-04-13 22:39 ` [RFC 1/3] mm: process_mrelease: expedite clean file folio reclaim via mmu_gather Minchan Kim
@ 2026-04-14 7:45 ` David Hildenbrand (Arm)
0 siblings, 0 replies; 7+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-14 7:45 UTC (permalink / raw)
To: Minchan Kim, akpm
Cc: mhocko, brauner, linux-mm, linux-kernel, surenb, timmurray
On 4/14/26 00:39, Minchan Kim wrote:
> Currently, process_mrelease() unmaps the pages but leaves clean file
> folios on the LRU list, relying on standard memory reclaim to eventually
> free them. This delays the immediate recovery of system memory under OOM
> or container shutdown scenarios.
process_mrelease() calls __oom_reap_task_mm().
There, we skip any MAP_SHARED file mappings.
So I assume what you describe only applies to MAP_PRIVATE file mappings?
What about MAP_SHARED?
Also "leaves ... on the LRU list" is rather confusing. They are not
evicted and stay in the pagecache?
>
> This patch implements an expedited eviction mechanism for clean file
> folios by integrating directly into the low-level TLB batching
> infrastructure (mmu_gather).
Is this a complicated way of saying "Handle clean pagecache folios
similar to swapcache folios in mmu_gather code, dropping them from the
swapcache (i.e., evicting them) if they are completely unmapped during
reaping"?
>
> Instead of repeatedly locking and evicting folios one by one inside the
> unmap loop (zap_present_folio_ptes), we pass the MMF_UNSTABLE flag
> status down to free_pages_and_swap_cache(). Within this single unified
> loop, anonymous pages are released via free_swap_cache(), and
> file-backed folios are symmetrically truncated via mapping_evict_folio().
... where you still evict them one-by-one. Rather confusing.
>
> This avoids introducing unnecessary data structures, preserves TLB flush
> safety, and removes duplicate tree traversals, resulting in an extremely
> lean and highly responsive process_mrelease() implementation.
I don't think this paragraph adds a lot of value, really.
Which "duplicate tree traversal"? Which unnecessary data structures?
Is that AI generated text? A lot of the stuff here reads AI generated. I
yet have to meet a developer (not a sales person) that would just say
"extremely lean and highly responsive process_mrelease() implementation"
If it is AI generated, throw it away and write it yourself from scratch.
Use AI only to polish your English.
>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
> arch/s390/include/asm/tlb.h | 2 +-
> include/linux/swap.h | 9 ++++++---
> mm/mmu_gather.c | 8 +++++---
> mm/swap_state.c | 19 +++++++++++++++++--
> 4 files changed, 29 insertions(+), 9 deletions(-)
>
> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
> index 619fd41e710e..554842345ccd 100644
> --- a/arch/s390/include/asm/tlb.h
> +++ b/arch/s390/include/asm/tlb.h
> @@ -62,7 +62,7 @@ static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
> VM_WARN_ON_ONCE(delay_rmap);
> VM_WARN_ON_ONCE(page_folio(page) != page_folio(page + nr_pages - 1));
>
> - free_pages_and_swap_cache(encoded_pages, ARRAY_SIZE(encoded_pages));
> + free_pages_and_caches(encoded_pages, ARRAY_SIZE(encoded_pages), false);
As we dislike boolean parameters, we either try to avoid them (e.g., use
flags) or document the parameters using something like
"/* parameter_name= */false"
> return false;
> }
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 62fc7499b408..e7b929b062f8 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -433,7 +433,7 @@ static inline unsigned long total_swapcache_pages(void)
>
> void free_swap_cache(struct folio *folio);
> void free_folio_and_swap_cache(struct folio *folio);
> -void free_pages_and_swap_cache(struct encoded_page **, int);
> +void free_pages_and_caches(struct encoded_page **pages, int nr, bool free_unmapped_file);
> /* linux/mm/swapfile.c */
> extern atomic_long_t nr_swap_pages;
> extern long total_swap_pages;
> @@ -510,8 +510,11 @@ static inline void put_swap_device(struct swap_info_struct *si)
> do { (val)->freeswap = (val)->totalswap = 0; } while (0)
> #define free_folio_and_swap_cache(folio) \
> folio_put(folio)
> -#define free_pages_and_swap_cache(pages, nr) \
> - release_pages((pages), (nr));
> +static inline void free_pages_and_caches(struct encoded_page **pages,
> + int nr, bool free_unmapped_file)
> +{
> + release_pages(pages, nr);
> +}
Why should !CONFIG_SWAP not take care of free_unmapped_file?
>
> static inline void free_swap_cache(struct folio *folio)
> {
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index fe5b6a031717..5ce5824db07f 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -100,7 +100,8 @@ void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_area_struct *vma)
> */
> #define MAX_NR_FOLIOS_PER_FREE 512
>
> -static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch)
> +static void __tlb_batch_free_encoded_pages(struct mm_struct *mm,
> + struct mmu_gather_batch *batch)
> {
> struct encoded_page **pages = batch->encoded_pages;
> unsigned int nr, nr_pages;
> @@ -135,7 +136,8 @@ static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch)
> }
> }
>
> - free_pages_and_swap_cache(pages, nr);
> + free_pages_and_caches(pages, nr,
> + mm_flags_test(MMF_UNSTABLE, mm));
> pages += nr;
> batch->nr -= nr;
>
> @@ -148,7 +150,7 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
> struct mmu_gather_batch *batch;
>
> for (batch = &tlb->local; batch && batch->nr; batch = batch->next)
> - __tlb_batch_free_encoded_pages(batch);
> + __tlb_batch_free_encoded_pages(tlb->mm, batch);
> tlb->active = &tlb->local;
> }
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 6d0eef7470be..e70a52ead6d3 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -400,11 +400,22 @@ void free_folio_and_swap_cache(struct folio *folio)
> folio_put(folio);
> }
>
> +static inline void free_file_cache(struct folio *folio)
> +{
> + if (folio_trylock(folio)) {
> + mapping_evict_folio(folio_mapping(folio), folio);
> + folio_unlock(folio);
> + }
> +}
> +
> /*
> * Passed an array of pages, drop them all from swapcache and then release
> * them. They are removed from the LRU and freed if this is their last use.
> + *
> + * If @free_unmapped_file is true, this function will proactively evict clean
> + * file-backed folios if they are no longer mapped.
The parameter name is not really expressive.
You are not freeing unmapped files.
"try_evict_file_folios" maybe?
mapping_evict_folio() has exactly these semantics (unmapped, clean)
> */
> -void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
> +void free_pages_and_caches(struct encoded_page **pages, int nr, bool free_unmapped_file)
> {
> struct folio_batch folios;
> unsigned int refs[PAGEVEC_SIZE];
> @@ -413,7 +424,11 @@ void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
> for (int i = 0; i < nr; i++) {
> struct folio *folio = page_folio(encoded_page_ptr(pages[i]));
>
> - free_swap_cache(folio);
> + if (folio_test_anon(folio))
> + free_swap_cache(folio);
> + else if (unlikely(free_unmapped_file))
> + free_file_cache(folio);
> +
> refs[folios.nr] = 1;
> if (unlikely(encoded_page_flags(pages[i]) &
> ENCODED_PAGE_BIT_NR_PAGES_NEXT))
--
Cheers,
David
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-04-14 7:45 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-04-13 22:39 [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support Minchan Kim
2026-04-13 22:39 ` [RFC 1/3] mm: process_mrelease: expedite clean file folio reclaim via mmu_gather Minchan Kim
2026-04-14 7:45 ` David Hildenbrand (Arm)
2026-04-13 22:39 ` [RFC 2/3] mm: process_mrelease: skip LRU movement for exclusive file folios Minchan Kim
2026-04-14 7:20 ` David Hildenbrand (Arm)
2026-04-13 22:39 ` [RFC 3/3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag Minchan Kim
2026-04-14 6:57 ` [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support Michal Hocko
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox