* [PATCH RFC v2] mm: add support for dropping LRU recency on process exit @ 2025-05-14 7:08 Barry Song 2025-05-20 16:19 ` Liam R. Howlett 0 siblings, 1 reply; 4+ messages in thread From: Barry Song @ 2025-05-14 7:08 UTC (permalink / raw) To: akpm, linux-mm Cc: linux-kernel, zhengtangquan, Barry Song, Baolin Wang, David Hildenbrand, Johannes Weiner, Matthew Wilcox, Oscar Salvador, Ryan Roberts, Zi Yan, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko From: Barry Song <v-songbaohua@oppo.com> Currently, both zap_pmd and zap_pte always promote young file folios, regardless of whether the processes are dying. However, in systems where the process recency fades upon dying, we may want to reverse this behavior. The goal is to reclaim the folios from the dying process as quickly as possible, allowing new processes to acquire memory ASAP. For example, while Firefox is killed and LibreOffice is launched, activating Firefox's young file-backed folios makes it harder to reclaim memory that LibreOffice doesn't use at all. On systems like Android, processes are either explicitly stopped by user action or reaped due to OOM after being inactive for a long time. These processes are unlikely to restart in the near future. Rather than promoting their folios, we skip promoting and demote their exclusive folios so that memory can be reclaimed and made available for new user-facing processes. Users possibly do not care about the recency of a dying process. However, we still need an explicit user indication to take this action. Thus, we introduced a prctl to provide that necessary user-level hint as suggested by Johannes and David. We observed noticeable improvements in refaults, swap-ins, and swap-outs on a hooked Android kernel. More data for this specific version will follow. Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> --- -v2: * add prctl as suggested by Johannes and David * demote exclusive file folios if drop_recency can apply -v1: https://lore.kernel.org/linux-mm/20250412085852.48524-1-21cnbao@gmail.com/ include/linux/mm_types.h | 1 + include/uapi/linux/prctl.h | 3 +++ kernel/sys.c | 16 ++++++++++++++++ mm/huge_memory.c | 12 ++++++++++-- mm/internal.h | 14 ++++++++++++++ mm/memory.c | 12 +++++++++++- 6 files changed, 55 insertions(+), 3 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 15808cad2bc1..84ab113c54a2 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1733,6 +1733,7 @@ enum { * on NFS restore */ //#define MMF_EXE_FILE_CHANGED 18 /* see prctl_set_mm_exe_file() */ +#define MMF_FADE_ON_DEATH 18 /* Recency is discarded on process exit */ #define MMF_HAS_UPROBES 19 /* has uprobes */ #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 15c18ef4eb11..22d861157552 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -364,4 +364,7 @@ struct prctl_mm_map { # define PR_TIMER_CREATE_RESTORE_IDS_ON 1 # define PR_TIMER_CREATE_RESTORE_IDS_GET 2 +#define PR_SET_FADE_ON_DEATH 78 +#define PR_GET_FADE_ON_DEATH 79 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/sys.c b/kernel/sys.c index c434968e9f5d..cabe1bbb35a4 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2658,6 +2658,22 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, clear_bit(MMF_DISABLE_THP, &me->mm->flags); mmap_write_unlock(me->mm); break; + case PR_GET_FADE_ON_DEATH: + if (arg2 || arg3 || arg4 || arg5) + return -EINVAL; + error = !!test_bit(MMF_FADE_ON_DEATH, &me->mm->flags); + break; + case PR_SET_FADE_ON_DEATH: + if (arg3 || arg4 || arg5) + return -EINVAL; + if (mmap_write_lock_killable(me->mm)) + return -EINTR; + if (arg2) + set_bit(MMF_FADE_ON_DEATH, &me->mm->flags); + else + clear_bit(MMF_FADE_ON_DEATH, &me->mm->flags); + mmap_write_unlock(me->mm); + break; case PR_MPX_ENABLE_MANAGEMENT: case PR_MPX_DISABLE_MANAGEMENT: /* No longer implemented: */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2780a12b25f0..c99894611d4a 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2204,6 +2204,7 @@ static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd) int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr) { + bool drop_recency = false; pmd_t orig_pmd; spinlock_t *ptl; @@ -2260,13 +2261,20 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, add_mm_counter(tlb->mm, mm_counter_file(folio), -HPAGE_PMD_NR); + drop_recency = zap_need_to_drop_recency(tlb->mm); /* * Use flush_needed to indicate whether the PMD entry * is present, instead of checking pmd_present() again. */ - if (flush_needed && pmd_young(orig_pmd) && - likely(vma_has_recency(vma))) + if (flush_needed && pmd_young(orig_pmd) && !drop_recency && + likely(vma_has_recency(vma))) folio_mark_accessed(folio); + /* + * Userspace explicitly marks recency to fade when the process + * dies; demote exclusive file folios to aid reclamation. + */ + if (drop_recency && !folio_maybe_mapped_shared(folio)) + deactivate_file_folio(folio); } spin_unlock(ptl); diff --git a/mm/internal.h b/mm/internal.h index 6b8ed2017743..af9649b3e84a 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -11,6 +11,7 @@ #include <linux/khugepaged.h> #include <linux/mm.h> #include <linux/mm_inline.h> +#include <linux/oom.h> #include <linux/pagemap.h> #include <linux/pagewalk.h> #include <linux/rmap.h> @@ -130,6 +131,19 @@ static inline int folio_nr_pages_mapped(const struct folio *folio) return atomic_read(&folio->_nr_pages_mapped) & FOLIO_PAGES_MAPPED; } +/* + * Returns true if the process attached to the mm is dying or undergoing + * OOM reaping, and its recency—explicitly marked by userspace—will also + * fade; otherwise, returns false. + */ +static inline bool zap_need_to_drop_recency(struct mm_struct *mm) +{ + if (!atomic_read(&mm->mm_users) || check_stable_address_space(mm)) + return !!test_bit(MMF_FADE_ON_DEATH, &mm->flags); + + return false; +} + /* * Retrieve the first entry of a folio based on a provided entry within the * folio. We cannot rely on folio->swap as there is no guarantee that it has diff --git a/mm/memory.c b/mm/memory.c index 5a7e4c0e89c7..6dd01a7736a8 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1505,6 +1505,7 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb, bool *force_flush, bool *force_break, bool *any_skipped) { struct mm_struct *mm = tlb->mm; + bool drop_recency = false; bool delay_rmap = false; if (!folio_test_anon(folio)) { @@ -1516,9 +1517,18 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb, *force_flush = true; } } - if (pte_young(ptent) && likely(vma_has_recency(vma))) + + drop_recency = zap_need_to_drop_recency(mm); + if (pte_young(ptent) && !drop_recency && + likely(vma_has_recency(vma))) folio_mark_accessed(folio); rss[mm_counter(folio)] -= nr; + /* + * Userspace explicitly marks recency to fade when the process dies; + * demote exclusive file folios to aid reclamation. + */ + if (drop_recency && !folio_maybe_mapped_shared(folio)) + deactivate_file_folio(folio); } else { /* We don't need up-to-date accessed/dirty bits. */ clear_full_ptes(mm, addr, pte, nr, tlb->fullmm); -- 2.39.3 (Apple Git-146) ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH RFC v2] mm: add support for dropping LRU recency on process exit 2025-05-14 7:08 [PATCH RFC v2] mm: add support for dropping LRU recency on process exit Barry Song @ 2025-05-20 16:19 ` Liam R. Howlett 2025-05-22 2:05 ` Barry Song 0 siblings, 1 reply; 4+ messages in thread From: Liam R. Howlett @ 2025-05-20 16:19 UTC (permalink / raw) To: Barry Song Cc: akpm, linux-mm, linux-kernel, zhengtangquan, Barry Song, Baolin Wang, David Hildenbrand, Johannes Weiner, Matthew Wilcox, Oscar Salvador, Ryan Roberts, Zi Yan, Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko * Barry Song <21cnbao@gmail.com> [250514 03:08]: > From: Barry Song <v-songbaohua@oppo.com> > > Currently, both zap_pmd and zap_pte always promote young file folios, > regardless of whether the processes are dying. > However, in systems where the process recency fades upon dying, we may > want to reverse this behavior. The goal is to reclaim the folios from > the dying process as quickly as possible, allowing new processes to > acquire memory ASAP. > For example, while Firefox is killed and LibreOffice is launched, > activating Firefox's young file-backed folios makes it harder to > reclaim memory that LibreOffice doesn't use at all. > > On systems like Android, processes are either explicitly stopped by > user action or reaped due to OOM after being inactive for a long time. > These processes are unlikely to restart in the near future. Rather than > promoting their folios, we skip promoting and demote their exclusive > folios so that memory can be reclaimed and made available for new > user-facing processes. > > Users possibly do not care about the recency of a dying process. > However, we still need an explicit user indication to take this action. Can you add why? It'd be nice to capture the reasons pointed out in v1 discussion as they seem important to why this isn't set as a default for all tasks. > Thus, we introduced a prctl to provide that necessary user-level hint > as suggested by Johannes and David. I'm not sure it really makes much of a difference if we update the lru or not in this case. Johannes point about this small change having unknown results for the larger community is certainly the best argument as to why we need this to be opt-in. We should probably document it so that people can opt-in though :) > > We observed noticeable improvements in refaults, swap-ins, and swap-outs > on a hooked Android kernel. More data for this specific version will > follow. Looking forward to the results. What happens when I kill my app and reopen it? (close all apps, open the one that was being annoying?) > > Cc: Baolin Wang <baolin.wang@linux.alibaba.com> > Cc: David Hildenbrand <david@redhat.com> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Matthew Wilcox (Oracle) <willy@infradead.org> > Cc: Oscar Salvador <osalvador@suse.de> > Cc: Ryan Roberts <ryan.roberts@arm.com> > Cc: Zi Yan <ziy@nvidia.com> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Mike Rapoport <rppt@kernel.org> > Cc: Suren Baghdasaryan <surenb@google.com> > Cc: Michal Hocko <mhocko@suse.com> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > --- > -v2: > * add prctl as suggested by Johannes and David > * demote exclusive file folios if drop_recency can apply > -v1: > https://lore.kernel.org/linux-mm/20250412085852.48524-1-21cnbao@gmail.com/ > > include/linux/mm_types.h | 1 + > include/uapi/linux/prctl.h | 3 +++ > kernel/sys.c | 16 ++++++++++++++++ > mm/huge_memory.c | 12 ++++++++++-- > mm/internal.h | 14 ++++++++++++++ > mm/memory.c | 12 +++++++++++- > 6 files changed, 55 insertions(+), 3 deletions(-) > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 15808cad2bc1..84ab113c54a2 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -1733,6 +1733,7 @@ enum { > * on NFS restore > */ > //#define MMF_EXE_FILE_CHANGED 18 /* see prctl_set_mm_exe_file() */ > +#define MMF_FADE_ON_DEATH 18 /* Recency is discarded on process exit */ Why is recency not in the MMF name? Why not MMF_NO_RECENCY or something? I guess we are back to no space in this flag. > > #define MMF_HAS_UPROBES 19 /* has uprobes */ > #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h > index 15c18ef4eb11..22d861157552 100644 > --- a/include/uapi/linux/prctl.h > +++ b/include/uapi/linux/prctl.h > @@ -364,4 +364,7 @@ struct prctl_mm_map { > # define PR_TIMER_CREATE_RESTORE_IDS_ON 1 > # define PR_TIMER_CREATE_RESTORE_IDS_GET 2 > > +#define PR_SET_FADE_ON_DEATH 78 > +#define PR_GET_FADE_ON_DEATH 79 > + > #endif /* _LINUX_PRCTL_H */ > diff --git a/kernel/sys.c b/kernel/sys.c > index c434968e9f5d..cabe1bbb35a4 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -2658,6 +2658,22 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > clear_bit(MMF_DISABLE_THP, &me->mm->flags); > mmap_write_unlock(me->mm); > break; > + case PR_GET_FADE_ON_DEATH: > + if (arg2 || arg3 || arg4 || arg5) > + return -EINVAL; > + error = !!test_bit(MMF_FADE_ON_DEATH, &me->mm->flags); > + break; Is there a usecase for get? > + case PR_SET_FADE_ON_DEATH: Could you just check the value prior to setting and just return if it's what you want? In which case, the setting is just change_bit(), and there probably isn't a need for a get? > + if (arg3 || arg4 || arg5) > + return -EINVAL; > + if (mmap_write_lock_killable(me->mm)) > + return -EINTR; > + if (arg2) > + set_bit(MMF_FADE_ON_DEATH, &me->mm->flags); > + else > + clear_bit(MMF_FADE_ON_DEATH, &me->mm->flags); > + mmap_write_unlock(me->mm); > + break; > case PR_MPX_ENABLE_MANAGEMENT: > case PR_MPX_DISABLE_MANAGEMENT: > /* No longer implemented: */ > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 2780a12b25f0..c99894611d4a 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2204,6 +2204,7 @@ static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd) > int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, > pmd_t *pmd, unsigned long addr) > { > + bool drop_recency = false; > pmd_t orig_pmd; > spinlock_t *ptl; > > @@ -2260,13 +2261,20 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, > add_mm_counter(tlb->mm, mm_counter_file(folio), > -HPAGE_PMD_NR); > > + drop_recency = zap_need_to_drop_recency(tlb->mm); > /* > * Use flush_needed to indicate whether the PMD entry > * is present, instead of checking pmd_present() again. > */ > - if (flush_needed && pmd_young(orig_pmd) && > - likely(vma_has_recency(vma))) > + if (flush_needed && pmd_young(orig_pmd) && !drop_recency && > + likely(vma_has_recency(vma))) > folio_mark_accessed(folio); > + /* > + * Userspace explicitly marks recency to fade when the process > + * dies; demote exclusive file folios to aid reclamation. > + */ > + if (drop_recency && !folio_maybe_mapped_shared(folio)) > + deactivate_file_folio(folio); > } > > spin_unlock(ptl); > diff --git a/mm/internal.h b/mm/internal.h > index 6b8ed2017743..af9649b3e84a 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -11,6 +11,7 @@ > #include <linux/khugepaged.h> > #include <linux/mm.h> > #include <linux/mm_inline.h> > +#include <linux/oom.h> > #include <linux/pagemap.h> > #include <linux/pagewalk.h> > #include <linux/rmap.h> > @@ -130,6 +131,19 @@ static inline int folio_nr_pages_mapped(const struct folio *folio) > return atomic_read(&folio->_nr_pages_mapped) & FOLIO_PAGES_MAPPED; > } > > +/* > + * Returns true if the process attached to the mm is dying or undergoing > + * OOM reaping, and its recency—explicitly marked by userspace—will also > + * fade; otherwise, returns false. > + */ > +static inline bool zap_need_to_drop_recency(struct mm_struct *mm) This name is confusing. We are zapping the need to drop the recency? If this returns false, then the need to drop recency is false.. It is not very easy to read and harder to understand how it translates to the values it returns. How about mm_has_exit_recency(), like vma_has_recency()? Or mmf_update_recency()? > +{ > + if (!atomic_read(&mm->mm_users) || check_stable_address_space(mm)) FYI, failed forks may also set the address space as unstable. > + return !!test_bit(MMF_FADE_ON_DEATH, &mm->flags); > + > + return false; > +} > + > /* > * Retrieve the first entry of a folio based on a provided entry within the > * folio. We cannot rely on folio->swap as there is no guarantee that it has > diff --git a/mm/memory.c b/mm/memory.c > index 5a7e4c0e89c7..6dd01a7736a8 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1505,6 +1505,7 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb, > bool *force_flush, bool *force_break, bool *any_skipped) > { > struct mm_struct *mm = tlb->mm; > + bool drop_recency = false; > bool delay_rmap = false; > > if (!folio_test_anon(folio)) { > @@ -1516,9 +1517,18 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb, > *force_flush = true; > } > } > - if (pte_young(ptent) && likely(vma_has_recency(vma))) > + > + drop_recency = zap_need_to_drop_recency(mm); > + if (pte_young(ptent) && !drop_recency && > + likely(vma_has_recency(vma))) I really don't like that you are calling an atomic_read() and two flag checks every time this block of code it executed. This must impact your performance? How about this: 1. Check in unmap_vmas() that the range is 0 - ULONG_MAX, and if the OOM flag is set. 2. set a new zap_flags_t flag (mmf_update_recency, maybe?) if test_bit(MMF_FADE_ON_DEATH) 3. check zap_details->zap_flags if that bit is set in this function. 4. (hopefully) profit with better performance :) Since this really is a zap flag, it fits to make it one. It also means that you will not need to check an atomic and will only check the one flag as apposed to two. I think we can live with some user (probably syzbot) unmapping 0 - ULONG_MAX and incorrectly checking a flag and, in the very rare case of actually using this flag, does not do the correct LRU aging. If you unmap everything, we can be pretty confident that you will be on the exit path rather quickly. > folio_mark_accessed(folio); > rss[mm_counter(folio)] -= nr; > + /* > + * Userspace explicitly marks recency to fade when the process dies; > + * demote exclusive file folios to aid reclamation. > + */ > + if (drop_recency && !folio_maybe_mapped_shared(folio)) > + deactivate_file_folio(folio); Thanks, Liam ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH RFC v2] mm: add support for dropping LRU recency on process exit 2025-05-20 16:19 ` Liam R. Howlett @ 2025-05-22 2:05 ` Barry Song 2025-05-22 19:07 ` Liam R. Howlett 0 siblings, 1 reply; 4+ messages in thread From: Barry Song @ 2025-05-22 2:05 UTC (permalink / raw) To: Liam R. Howlett, Barry Song, akpm, linux-mm, linux-kernel, zhengtangquan, Barry Song, Baolin Wang, David Hildenbrand, Johannes Weiner, Matthew Wilcox, Oscar Salvador, Ryan Roberts, Zi Yan, Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko Hi Liam, I really appreciate your review—thank you! On Wed, May 21, 2025 at 4:20 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > * Barry Song <21cnbao@gmail.com> [250514 03:08]: > > From: Barry Song <v-songbaohua@oppo.com> > > > > Currently, both zap_pmd and zap_pte always promote young file folios, > > regardless of whether the processes are dying. > > However, in systems where the process recency fades upon dying, we may > > want to reverse this behavior. The goal is to reclaim the folios from > > the dying process as quickly as possible, allowing new processes to > > acquire memory ASAP. > > For example, while Firefox is killed and LibreOffice is launched, > > activating Firefox's young file-backed folios makes it harder to > > reclaim memory that LibreOffice doesn't use at all. > > > > On systems like Android, processes are either explicitly stopped by > > user action or reaped due to OOM after being inactive for a long time. > > These processes are unlikely to restart in the near future. Rather than > > promoting their folios, we skip promoting and demote their exclusive > > folios so that memory can be reclaimed and made available for new > > user-facing processes. > > > > Users possibly do not care about the recency of a dying process. > > However, we still need an explicit user indication to take this action. > > Can you add why? It'd be nice to capture the reasons pointed out in v1 > discussion as they seem important to why this isn't set as a default for > all tasks. Essentially, I took Johannes’ point (and to some extent David’s as well) to be that it behaves somewhat unpredictably in broader application scenarios—for example, when repeatedly executing a file in a script or restarting an application shortly after it exits. Also, when a shared library is mapped by multiple processes, we might still want to retain recency information from a process that is exiting. So we might only want to do that only for exclusive folios. This actually leads to two questions: 1. Are we confident that the recency of a dead process is no longer useful within a period of time? 2. Should we limit the optimization only to exclusive folios—for example, shared objects (.so files) that are specific to the exiting process? For both questions, the answer seems to be yes. Though in the first case—when we repeatedly restart the same application—the folios are likely still in the LRU and may still be hit even if we unconditionally demote them. But that's not guaranteed. So we likely need a userspace hint to eliminate the uncertainty. > > > Thus, we introduced a prctl to provide that necessary user-level hint > > as suggested by Johannes and David. > > I'm not sure it really makes much of a difference if we update the lru > or not in this case. Johannes point about this small change having > unknown results for the larger community is certainly the best argument > as to why we need this to be opt-in. > > We should probably document it so that people can opt-in though :) > > > > > We observed noticeable improvements in refaults, swap-ins, and swap-outs > > on a hooked Android kernel. More data for this specific version will > > follow. > > Looking forward to the results. What happens when I kill my app and > reopen it? (close all apps, open the one that was being annoying?) I'm not sure I fully understand your question. In Android, we're primarily concerned with smooth app switching. For example, in a sequence like A → B → C → D → E, if we can quickly reclaim folios from dead processes, it helps us launch new (different) apps faster. However, if we do A → kill A → start A → kill A → start A repeatedly, it’s likely not a problem because our memory can hold the same application. The issue arises when memory isn’t enough to hold A + B + C + D + E simultaneously. I’m not overly concerned about repeatedly restarting the same application in Android. However, for wider scenarios across various industries, I’m uncertain. > > > > > Cc: Baolin Wang <baolin.wang@linux.alibaba.com> > > Cc: David Hildenbrand <david@redhat.com> > > Cc: Johannes Weiner <hannes@cmpxchg.org> > > Cc: Matthew Wilcox (Oracle) <willy@infradead.org> > > Cc: Oscar Salvador <osalvador@suse.de> > > Cc: Ryan Roberts <ryan.roberts@arm.com> > > Cc: Zi Yan <ziy@nvidia.com> > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> > > Cc: Vlastimil Babka <vbabka@suse.cz> > > Cc: Mike Rapoport <rppt@kernel.org> > > Cc: Suren Baghdasaryan <surenb@google.com> > > Cc: Michal Hocko <mhocko@suse.com> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > --- > > -v2: > > * add prctl as suggested by Johannes and David > > * demote exclusive file folios if drop_recency can apply > > -v1: > > https://lore.kernel.org/linux-mm/20250412085852.48524-1-21cnbao@gmail.com/ > > > > include/linux/mm_types.h | 1 + > > include/uapi/linux/prctl.h | 3 +++ > > kernel/sys.c | 16 ++++++++++++++++ > > mm/huge_memory.c | 12 ++++++++++-- > > mm/internal.h | 14 ++++++++++++++ > > mm/memory.c | 12 +++++++++++- > > 6 files changed, 55 insertions(+), 3 deletions(-) > > > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > > index 15808cad2bc1..84ab113c54a2 100644 > > --- a/include/linux/mm_types.h > > +++ b/include/linux/mm_types.h > > @@ -1733,6 +1733,7 @@ enum { > > * on NFS restore > > */ > > //#define MMF_EXE_FILE_CHANGED 18 /* see prctl_set_mm_exe_file() */ > > +#define MMF_FADE_ON_DEATH 18 /* Recency is discarded on process exit */ > > Why is recency not in the MMF name? Why not MMF_NO_RECENCY or > something? I included RECENCY in the name but found it too long. On the other hand, MMF_NO_RECENCY seems insufficient to convey the true meaning, since we do have recency—it’s just lost on death. So perhaps the original, longer names I considered are better: MMF_RECENCY_FADE_ON_DEATH or MMF_NO_RECENCY_ON_DEATH? > > I guess we are back to no space in this flag. Yes, it is 32 bits. > > > > > #define MMF_HAS_UPROBES 19 /* has uprobes */ > > #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ > > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h > > index 15c18ef4eb11..22d861157552 100644 > > --- a/include/uapi/linux/prctl.h > > +++ b/include/uapi/linux/prctl.h > > @@ -364,4 +364,7 @@ struct prctl_mm_map { > > # define PR_TIMER_CREATE_RESTORE_IDS_ON 1 > > # define PR_TIMER_CREATE_RESTORE_IDS_GET 2 > > > > +#define PR_SET_FADE_ON_DEATH 78 > > +#define PR_GET_FADE_ON_DEATH 79 > > + > > #endif /* _LINUX_PRCTL_H */ > > diff --git a/kernel/sys.c b/kernel/sys.c > > index c434968e9f5d..cabe1bbb35a4 100644 > > --- a/kernel/sys.c > > +++ b/kernel/sys.c > > @@ -2658,6 +2658,22 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > > clear_bit(MMF_DISABLE_THP, &me->mm->flags); > > mmap_write_unlock(me->mm); > > break; > > + case PR_GET_FADE_ON_DEATH: > > + if (arg2 || arg3 || arg4 || arg5) > > + return -EINVAL; > > + error = !!test_bit(MMF_FADE_ON_DEATH, &me->mm->flags); > > + break; > > Is there a usecase for get? Probably not. I was just trying to implement put/get for a pair. I’m happy to remove it if you feel it’s redundant. > > > + case PR_SET_FADE_ON_DEATH: > > Could you just check the value prior to setting and just return if it's > what you want? In which case, the setting is just change_bit(), and > there probably isn't a need for a get? Ok. > > > + if (arg3 || arg4 || arg5) > > + return -EINVAL; > > + if (mmap_write_lock_killable(me->mm)) > > + return -EINTR; > > + if (arg2) > > + set_bit(MMF_FADE_ON_DEATH, &me->mm->flags); > > + else > > + clear_bit(MMF_FADE_ON_DEATH, &me->mm->flags); > > + mmap_write_unlock(me->mm); > > + break; > > case PR_MPX_ENABLE_MANAGEMENT: > > case PR_MPX_DISABLE_MANAGEMENT: > > /* No longer implemented: */ > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index 2780a12b25f0..c99894611d4a 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -2204,6 +2204,7 @@ static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd) > > int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, > > pmd_t *pmd, unsigned long addr) > > { > > + bool drop_recency = false; > > pmd_t orig_pmd; > > spinlock_t *ptl; > > > > @@ -2260,13 +2261,20 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, > > add_mm_counter(tlb->mm, mm_counter_file(folio), > > -HPAGE_PMD_NR); > > > > + drop_recency = zap_need_to_drop_recency(tlb->mm); > > /* > > * Use flush_needed to indicate whether the PMD entry > > * is present, instead of checking pmd_present() again. > > */ > > - if (flush_needed && pmd_young(orig_pmd) && > > - likely(vma_has_recency(vma))) > > + if (flush_needed && pmd_young(orig_pmd) && !drop_recency && > > + likely(vma_has_recency(vma))) > > folio_mark_accessed(folio); > > + /* > > + * Userspace explicitly marks recency to fade when the process > > + * dies; demote exclusive file folios to aid reclamation. > > + */ > > + if (drop_recency && !folio_maybe_mapped_shared(folio)) > > + deactivate_file_folio(folio); > > } > > > > spin_unlock(ptl); > > diff --git a/mm/internal.h b/mm/internal.h > > index 6b8ed2017743..af9649b3e84a 100644 > > --- a/mm/internal.h > > +++ b/mm/internal.h > > @@ -11,6 +11,7 @@ > > #include <linux/khugepaged.h> > > #include <linux/mm.h> > > #include <linux/mm_inline.h> > > +#include <linux/oom.h> > > #include <linux/pagemap.h> > > #include <linux/pagewalk.h> > > #include <linux/rmap.h> > > @@ -130,6 +131,19 @@ static inline int folio_nr_pages_mapped(const struct folio *folio) > > return atomic_read(&folio->_nr_pages_mapped) & FOLIO_PAGES_MAPPED; > > } > > > > +/* > > + * Returns true if the process attached to the mm is dying or undergoing > > + * OOM reaping, and its recency—explicitly marked by userspace—will also > > + * fade; otherwise, returns false. > > + */ > > +static inline bool zap_need_to_drop_recency(struct mm_struct *mm) > > This name is confusing. We are zapping the need to drop the recency? If > this returns false, then the need to drop recency is false.. It is not > very easy to read and harder to understand how it translates to the > values it returns. > > How about mm_has_exit_recency(), like vma_has_recency()? > Or mmf_update_recency()? It seems mm_has_exit_recency() is good. > > > +{ > > + if (!atomic_read(&mm->mm_users) || check_stable_address_space(mm)) > > FYI, failed forks may also set the address space as unstable. > > > + return !!test_bit(MMF_FADE_ON_DEATH, &mm->flags); > > + > > + return false; > > +} > > + > > /* > > * Retrieve the first entry of a folio based on a provided entry within the > > * folio. We cannot rely on folio->swap as there is no guarantee that it has > > diff --git a/mm/memory.c b/mm/memory.c > > index 5a7e4c0e89c7..6dd01a7736a8 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -1505,6 +1505,7 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb, > > bool *force_flush, bool *force_break, bool *any_skipped) > > { > > struct mm_struct *mm = tlb->mm; > > + bool drop_recency = false; > > bool delay_rmap = false; > > > > if (!folio_test_anon(folio)) { > > @@ -1516,9 +1517,18 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb, > > *force_flush = true; > > } > > } > > - if (pte_young(ptent) && likely(vma_has_recency(vma))) > > + > > + drop_recency = zap_need_to_drop_recency(mm); > > + if (pte_young(ptent) && !drop_recency && > > + likely(vma_has_recency(vma))) > > > I really don't like that you are calling an atomic_read() and two flag > checks every time this block of code it executed. This must impact your > performance? Fair enough. That seems like a valid point to consider regarding atomic operations. > > How about this: > 1. Check in unmap_vmas() that the range is 0 - ULONG_MAX, and if the OOM > flag is set. > 2. set a new zap_flags_t flag (mmf_update_recency, maybe?) if > test_bit(MMF_FADE_ON_DEATH) > 3. check zap_details->zap_flags if that bit is set in this function. > 4. (hopefully) profit with better performance :) > > Since this really is a zap flag, it fits to make it one. It also means > that you will not need to check an atomic and will only check the one > flag as apposed to two. > > I think we can live with some user (probably syzbot) unmapping 0 - > ULONG_MAX and incorrectly checking a flag and, in the very rare case of > actually using this flag, does not do the correct LRU aging. If you > unmap everything, we can be pretty confident that you will be on the > exit path rather quickly. Good idea—let me give this a try. > > > folio_mark_accessed(folio); > > rss[mm_counter(folio)] -= nr; > > + /* > > + * Userspace explicitly marks recency to fade when the process dies; > > + * demote exclusive file folios to aid reclamation. > > + */ > > + if (drop_recency && !folio_maybe_mapped_shared(folio)) > > + deactivate_file_folio(folio); > > Thanks, > Liam > Thanks Barry ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH RFC v2] mm: add support for dropping LRU recency on process exit 2025-05-22 2:05 ` Barry Song @ 2025-05-22 19:07 ` Liam R. Howlett 0 siblings, 0 replies; 4+ messages in thread From: Liam R. Howlett @ 2025-05-22 19:07 UTC (permalink / raw) To: Barry Song Cc: akpm, linux-mm, linux-kernel, zhengtangquan, Barry Song, Baolin Wang, David Hildenbrand, Johannes Weiner, Matthew Wilcox, Oscar Salvador, Ryan Roberts, Zi Yan, Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko * Barry Song <21cnbao@gmail.com> [250521 22:05]: > Hi Liam, > I really appreciate your review—thank you! Thanks. This came up when discussing another policy type control to an entire mm [1]. It looks like the consensus is that a new system call for memory specific controls might be the best way forward, and your requirement fits well into this idea. I'd be interested in hearing what you think on this plan. [1]. https://lore.kernel.org/all/a8aedeb6-2179-4e53-8310-5b81438c2b80@redhat.com/ Regards, Liam ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-05-22 19:07 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-05-14 7:08 [PATCH RFC v2] mm: add support for dropping LRU recency on process exit Barry Song 2025-05-20 16:19 ` Liam R. Howlett 2025-05-22 2:05 ` Barry Song 2025-05-22 19:07 ` Liam R. Howlett
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox