* [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg
@ 2024-12-15 7:34 Yafang Shao
2024-12-15 7:34 ` [RFC PATCH 1/2] mm/memcontrol: add a new cgroup file memory.nomlock Yafang Shao
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Yafang Shao @ 2024-12-15 7:34 UTC (permalink / raw)
To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm
Cc: linux-mm, Yafang Shao
The Use Case
============
We have a scenario where multiple services (cgroups) may share the same
file cache, as illustrated below:
download-proxy application
\ /
/shared_path/shared_files
When the application needs specific types of files, it sends an RPC request
to the download-proxy. The download-proxy then downloads the files to
shared paths, after which the application reads these shared files. All
disk I/O operations are performed using buffered I/O.
The reason for using buffered I/O, rather than direct I/O, is that the
download-proxy itself may also read these shared files. This is because it
serves as a peer-to-peer (P2P) service:
download-proxy of server1 <- P2P -> download-proxy of server2
/shared_path/shared_files /shared_path/shared_files
The Problem
===========
Applications reading these shared files may use mlock to pin the files in
memory for performance reasons. However, the shared file cache is charged
to the memory cgroup of the download-proxy during the download or P2P
process. Consequently, the page cache pages of the shared files might be
mlocked within the download-proxy's memcg, as shown:
download-proxy application
| /
(charged) (mlocked)
| /
pagecache pages
\
\
/shared_path/shared_files
This setup leads to a frequent scenario where the memory usage of the
download-proxy's memcg reaches its limit, potentially resulting in OOM
events. This behavior is undesirable.
The Solution
============
To address this, we propose introducing a new cgroup file, memory.nomlock,
which prevents page cache pages from being mlocked in a specific memcg when
set to 1.
Implementation Options
----------------------
- Solution A: Allow file caches on the unevictable list to become
reclaimable.
This approach would require significant refactoring of the page reclaim
logic.
- Solution B: Prevent file caches from being moved to the unevictable list
during mlock and ignore the VM_LOCKED flag during page reclaim.
This is a more straightforward solution and is the one we have chosen.
If the file caches are reclaimed from the download-proxy's memcg and
subsequently accessed by tasks in the application’s memcg, a filemap
fault will occur. A new file cache will be faulted in, charged to the
application’s memcg, and locked there.
Current limitations
==================
This solution is in its early stages and has the following limitations:
- Timing Dependency:
memory.nomlock must be set before file caches are moved to the
unevictable list. Otherwise, the file caches cannot be reclaimed.
- Metrics Inaccuracy:
The "unevictable" metric in memory.stat and the "Mlocked" metric in
/proc/meminfo may not be reliable. However, these metrics are already
affected by the use of large folios.
If this solution is deemed acceptable, I will proceed with refining the
implementation and addressing these limitations.
Yafang Shao (2):
mm/memcontrol: add a new cgroup file memory.nomlock
mm: Add support for nomlock to avoid folios beling mlocked in a memcg
include/linux/memcontrol.h | 3 +++
mm/memcontrol.c | 35 +++++++++++++++++++++++++++++++++++
mm/mlock.c | 9 +++++++++
mm/rmap.c | 8 +++++++-
mm/vmscan.c | 5 +++++
5 files changed, 59 insertions(+), 1 deletion(-)
--
2.43.5
^ permalink raw reply [flat|nested] 15+ messages in thread* [RFC PATCH 1/2] mm/memcontrol: add a new cgroup file memory.nomlock 2024-12-15 7:34 [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg Yafang Shao @ 2024-12-15 7:34 ` Yafang Shao 2024-12-15 7:34 ` [RFC PATCH 2/2] mm: Add support for nomlock to avoid folios beling mlocked in a memcg Yafang Shao 2024-12-20 10:23 ` [RFC PATCH 0/2] memcg: add " Michal Hocko 2 siblings, 0 replies; 15+ messages in thread From: Yafang Shao @ 2024-12-15 7:34 UTC (permalink / raw) To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm Cc: linux-mm, Yafang Shao Add a new cgroup file memory.nomlock to control the behavior of mlock. This is a preparation of the followup patch. 0: Disable nomlock 1: Enable nomlock Signed-off-by: Yafang Shao <laoar.shao@gmail.com> --- include/linux/memcontrol.h | 3 +++ mm/memcontrol.c | 35 +++++++++++++++++++++++++++++++++++ 2 files changed, 38 insertions(+) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b28180269e75..b3a4e28ae0f9 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -220,6 +220,9 @@ struct mem_cgroup { */ bool oom_group; + /* Folios won't be mlocked in this memcg. */ + bool nomlock; + int swappiness; /* memory.events and memory.events.local */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7b3503d12aaf..08bb65c1612f 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4374,6 +4374,35 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, return nbytes; } +static int memory_nomlock_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + seq_printf(m, "%d\n", READ_ONCE(memcg->nomlock)); + return 0; +} + +static ssize_t memory_nomlock_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + int ret, nomlock; + + buf = strstrip(buf); + if (!buf) + return -EINVAL; + + ret = kstrtoint(buf, 0, &nomlock); + if (ret) + return ret; + + if (nomlock != 0 && nomlock != 1) + return -EINVAL; + + WRITE_ONCE(memcg->nomlock, nomlock); + return nbytes; +} + static struct cftype memory_files[] = { { .name = "current", @@ -4445,6 +4474,12 @@ static struct cftype memory_files[] = { .flags = CFTYPE_NS_DELEGATABLE, .write = memory_reclaim, }, + { + .name = "nomlock", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = memory_nomlock_show, + .write = memory_nomlock_write, + }, { } /* terminate */ }; -- 2.43.5 ^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC PATCH 2/2] mm: Add support for nomlock to avoid folios beling mlocked in a memcg 2024-12-15 7:34 [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg Yafang Shao 2024-12-15 7:34 ` [RFC PATCH 1/2] mm/memcontrol: add a new cgroup file memory.nomlock Yafang Shao @ 2024-12-15 7:34 ` Yafang Shao 2024-12-20 10:23 ` [RFC PATCH 0/2] memcg: add " Michal Hocko 2 siblings, 0 replies; 15+ messages in thread From: Yafang Shao @ 2024-12-15 7:34 UTC (permalink / raw) To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm Cc: linux-mm, Yafang Shao The Use Case ============ We have a scenario where multiple services (cgroups) may share the same file cache, as illustrated below: download-proxy application \ / /shared_path/shared_files When the application needs specific types of files, it sends an RPC request to the download-proxy. The download-proxy then downloads the files to shared paths, after which the application reads these shared files. All disk I/O operations are performed using buffered I/O. The reason for using buffered I/O, rather than direct I/O, is that the download-proxy itself may also read these shared files. This is because it serves as a peer-to-peer (P2P) service: download-proxy of server1 <- P2P -> download-proxy of server2 /shared_path/shared_files /shared_path/shared_files The Problem =========== Applications reading these shared files may use mlock to pin the files in memory for performance reasons. However, the shared file cache is charged to the memory cgroup of the download-proxy during the download or P2P process. Consequently, the page cache pages of the shared files might be mlocked within the download-proxy's memcg, as shown: download-proxy application | / (charged) (mlocked) | / pagecache pages \ \ /shared_path/shared_files This setup leads to a frequent scenario where the memory usage of the download-proxy's memcg reaches its limit, potentially resulting in OOM events. This behavior is undesirable. The Solution ============ To address this, we propose introducing a new cgroup file, memory.nomlock, which prevents page cache pages from being mlocked in a specific memcg when set to 1. Implementation Options ---------------------- - Solution A: Allow file caches on the unevictable list to become reclaimable. This approach would require significant refactoring of the page reclaim logic. - Solution B: Prevent file caches from being moved to the unevictable list during mlock and ignore the VM_LOCKED flag during page reclaim. This is a more straightforward solution and is the one we have chosen. If the file caches are reclaimed from the download-proxy's memcg and subsequently accessed by tasks in the application’s memcg, a filemap fault will occur. A new file cache will be faulted in, charged to the application’s memcg, and locked there. Current limitations ================== This solution is in its early stages and has the following limitations: - Timing Dependency: memory.nomlock must be set before file caches are moved to the unevictable list. Otherwise, the file caches cannot be reclaimed. - Metrics Inaccuracy: The "unevictable" metric in memory.stat and the "Mlocked" metric in /proc/meminfo may not be reliable. However, these metrics are already affected by the use of large folios. If this solution is deemed acceptable, I will proceed with refining the implementation. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> --- mm/mlock.c | 9 +++++++++ mm/rmap.c | 8 +++++++- mm/vmscan.c | 5 +++++ 3 files changed, 21 insertions(+), 1 deletion(-) diff --git a/mm/mlock.c b/mm/mlock.c index cde076fa7d5e..9cebcf13929f 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -186,6 +186,7 @@ static inline struct folio *mlock_new(struct folio *folio) static void mlock_folio_batch(struct folio_batch *fbatch) { struct lruvec *lruvec = NULL; + struct mem_cgroup *memcg; unsigned long mlock; struct folio *folio; int i; @@ -196,6 +197,10 @@ static void mlock_folio_batch(struct folio_batch *fbatch) folio = (struct folio *)((unsigned long)folio - mlock); fbatch->folios[i] = folio; + memcg = folio_memcg(folio); + if (memcg && memcg->nomlock && mlock) + continue; + if (mlock & LRU_FOLIO) lruvec = __mlock_folio(folio, lruvec); else if (mlock & NEW_FOLIO) @@ -241,8 +246,12 @@ bool need_mlock_drain(int cpu) */ void mlock_folio(struct folio *folio) { + struct mem_cgroup *memcg = folio_memcg(folio); struct folio_batch *fbatch; + if (memcg && memcg->nomlock) + return; + local_lock(&mlock_fbatch.lock); fbatch = this_cpu_ptr(&mlock_fbatch.fbatch); diff --git a/mm/rmap.c b/mm/rmap.c index c6c4d4ea29a7..6f16f86f9274 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -853,11 +853,17 @@ static bool folio_referenced_one(struct folio *folio, DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); int referenced = 0; unsigned long start = address, ptes = 0; + bool ignore_mlock = false; + struct mem_cgroup *memcg; + + memcg = folio_memcg(folio); + if (memcg && memcg->nomlock) + ignore_mlock = true; while (page_vma_mapped_walk(&pvmw)) { address = pvmw.address; - if (vma->vm_flags & VM_LOCKED) { + if (!ignore_mlock && vma->vm_flags & VM_LOCKED) { if (!folio_test_large(folio) || !pvmw.pte) { /* Restore the mlock which got missed */ mlock_vma_folio(folio, vma); diff --git a/mm/vmscan.c b/mm/vmscan.c index fd55c3ec0054..defd36be28e9 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1283,6 +1283,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, if (folio_mapped(folio)) { enum ttu_flags flags = TTU_BATCH_FLUSH; bool was_swapbacked = folio_test_swapbacked(folio); + struct mem_cgroup *memcg; if (folio_test_pmd_mappable(folio)) flags |= TTU_SPLIT_HUGE_PMD; @@ -1301,6 +1302,10 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, if (folio_test_large(folio)) flags |= TTU_SYNC; + memcg = folio_memcg(folio); + if (memcg && memcg->nomlock) + flags |= TTU_IGNORE_MLOCK; + try_to_unmap(folio, flags); if (folio_mapped(folio)) { stat->nr_unmap_fail += nr_pages; -- 2.43.5 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg 2024-12-15 7:34 [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg Yafang Shao 2024-12-15 7:34 ` [RFC PATCH 1/2] mm/memcontrol: add a new cgroup file memory.nomlock Yafang Shao 2024-12-15 7:34 ` [RFC PATCH 2/2] mm: Add support for nomlock to avoid folios beling mlocked in a memcg Yafang Shao @ 2024-12-20 10:23 ` Michal Hocko 2024-12-20 11:52 ` Yafang Shao 2 siblings, 1 reply; 15+ messages in thread From: Michal Hocko @ 2024-12-20 10:23 UTC (permalink / raw) To: Yafang Shao Cc: hannes, roman.gushchin, shakeel.butt, muchun.song, akpm, linux-mm On Sun 15-12-24 15:34:13, Yafang Shao wrote: > Implementation Options > ---------------------- > > - Solution A: Allow file caches on the unevictable list to become > reclaimable. > This approach would require significant refactoring of the page reclaim > logic. > > - Solution B: Prevent file caches from being moved to the unevictable list > during mlock and ignore the VM_LOCKED flag during page reclaim. > This is a more straightforward solution and is the one we have chosen. > If the file caches are reclaimed from the download-proxy's memcg and > subsequently accessed by tasks in the application’s memcg, a filemap > fault will occur. A new file cache will be faulted in, charged to the > application’s memcg, and locked there. Both options are silently breaking userspace because a non failing mlock doesn't give guarantees it is supposed to AFAICS. So unless I am missing something really importanant I do not think this is an acceptable memcg extension. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg 2024-12-20 10:23 ` [RFC PATCH 0/2] memcg: add " Michal Hocko @ 2024-12-20 11:52 ` Yafang Shao 2024-12-21 7:21 ` Michal Hocko 0 siblings, 1 reply; 15+ messages in thread From: Yafang Shao @ 2024-12-20 11:52 UTC (permalink / raw) To: Michal Hocko Cc: hannes, roman.gushchin, shakeel.butt, muchun.song, akpm, linux-mm On Fri, Dec 20, 2024 at 6:23 PM Michal Hocko <mhocko@suse.com> wrote: > > On Sun 15-12-24 15:34:13, Yafang Shao wrote: > > Implementation Options > > ---------------------- > > > > - Solution A: Allow file caches on the unevictable list to become > > reclaimable. > > This approach would require significant refactoring of the page reclaim > > logic. > > > > - Solution B: Prevent file caches from being moved to the unevictable list > > during mlock and ignore the VM_LOCKED flag during page reclaim. > > This is a more straightforward solution and is the one we have chosen. > > If the file caches are reclaimed from the download-proxy's memcg and > > subsequently accessed by tasks in the application’s memcg, a filemap > > fault will occur. A new file cache will be faulted in, charged to the > > application’s memcg, and locked there. > > Both options are silently breaking userspace because a non failing mlock > doesn't give guarantees it is supposed to AFAICS. It does not bypass the mlock mechanism; rather, it defers the actual locking operation to the page fault path. Could you clarify what you mean by "a non-failing mlock"? From what I can see, mlock can indeed fail if there isn’t sufficient memory available. With this change, we are simply shifting the potential failure point to the page fault path instead. -- Regards Yafang ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg 2024-12-20 11:52 ` Yafang Shao @ 2024-12-21 7:21 ` Michal Hocko 2024-12-22 2:34 ` Yafang Shao 0 siblings, 1 reply; 15+ messages in thread From: Michal Hocko @ 2024-12-21 7:21 UTC (permalink / raw) To: Yafang Shao Cc: hannes, roman.gushchin, shakeel.butt, muchun.song, akpm, linux-mm On Fri 20-12-24 19:52:16, Yafang Shao wrote: > On Fri, Dec 20, 2024 at 6:23 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Sun 15-12-24 15:34:13, Yafang Shao wrote: > > > Implementation Options > > > ---------------------- > > > > > > - Solution A: Allow file caches on the unevictable list to become > > > reclaimable. > > > This approach would require significant refactoring of the page reclaim > > > logic. > > > > > > - Solution B: Prevent file caches from being moved to the unevictable list > > > during mlock and ignore the VM_LOCKED flag during page reclaim. > > > This is a more straightforward solution and is the one we have chosen. > > > If the file caches are reclaimed from the download-proxy's memcg and > > > subsequently accessed by tasks in the application’s memcg, a filemap > > > fault will occur. A new file cache will be faulted in, charged to the > > > application’s memcg, and locked there. > > > > Both options are silently breaking userspace because a non failing mlock > > doesn't give guarantees it is supposed to AFAICS. > > It does not bypass the mlock mechanism; rather, it defers the actual > locking operation to the page fault path. Could you clarify what you > mean by "a non-failing mlock"? From what I can see, mlock can indeed > fail if there isn’t sufficient memory available. With this change, we > are simply shifting the potential failure point to the page fault path > instead. Your change will cause mlocked pages (as mlock syscall returns success) to be reclaimable later on. That breaks the basic mlock contract. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg 2024-12-21 7:21 ` Michal Hocko @ 2024-12-22 2:34 ` Yafang Shao 2024-12-25 2:23 ` Yafang Shao 2025-01-06 12:28 ` Michal Hocko 0 siblings, 2 replies; 15+ messages in thread From: Yafang Shao @ 2024-12-22 2:34 UTC (permalink / raw) To: Michal Hocko Cc: hannes, roman.gushchin, shakeel.butt, muchun.song, akpm, linux-mm On Sat, Dec 21, 2024 at 3:21 PM Michal Hocko <mhocko@suse.com> wrote: > > On Fri 20-12-24 19:52:16, Yafang Shao wrote: > > On Fri, Dec 20, 2024 at 6:23 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Sun 15-12-24 15:34:13, Yafang Shao wrote: > > > > Implementation Options > > > > ---------------------- > > > > > > > > - Solution A: Allow file caches on the unevictable list to become > > > > reclaimable. > > > > This approach would require significant refactoring of the page reclaim > > > > logic. > > > > > > > > - Solution B: Prevent file caches from being moved to the unevictable list > > > > during mlock and ignore the VM_LOCKED flag during page reclaim. > > > > This is a more straightforward solution and is the one we have chosen. > > > > If the file caches are reclaimed from the download-proxy's memcg and > > > > subsequently accessed by tasks in the application’s memcg, a filemap > > > > fault will occur. A new file cache will be faulted in, charged to the > > > > application’s memcg, and locked there. > > > > > > Both options are silently breaking userspace because a non failing mlock > > > doesn't give guarantees it is supposed to AFAICS. > > > > It does not bypass the mlock mechanism; rather, it defers the actual > > locking operation to the page fault path. Could you clarify what you > > mean by "a non-failing mlock"? From what I can see, mlock can indeed > > fail if there isn’t sufficient memory available. With this change, we > > are simply shifting the potential failure point to the page fault path > > instead. > > Your change will cause mlocked pages (as mlock syscall returns success) > to be reclaimable later on. That breaks the basic mlock contract. AFAICS, the mlock() behavior was originally designed with only a single root memory cgroup in mind. In other words, when mlock() was introduced, all locked pages were confined to the same memcg. However, this changed with the introduction of memcg support. Now, mlock() can lock pages that belong to a different memcg than the current task. This behavior is not explicitly defined in the mlock() documentation, which could lead to confusion. To clarify, I propose updating the mlock() documentation as follows: When memcg is enabled, the page being locked might reside in a different memcg than the current task. In such cases, the page might be reclaimed if mlock() is not permitted in its original memcg. If the locked page is reclaimed, it could be faulted back into the current task's memcg and then locked again. Additionally, encountering a single page fault during this process should be acceptable to most users. If your application cannot tolerate even a single page fault, you likely wouldn’t enable memcg in the first place. -- Regards Yafang ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg 2024-12-22 2:34 ` Yafang Shao @ 2024-12-25 2:23 ` Yafang Shao 2025-01-06 12:30 ` Michal Hocko 2025-01-06 12:28 ` Michal Hocko 1 sibling, 1 reply; 15+ messages in thread From: Yafang Shao @ 2024-12-25 2:23 UTC (permalink / raw) To: Michal Hocko Cc: hannes, roman.gushchin, shakeel.butt, muchun.song, akpm, linux-mm On Sun, Dec 22, 2024 at 10:34 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > On Sat, Dec 21, 2024 at 3:21 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Fri 20-12-24 19:52:16, Yafang Shao wrote: > > > On Fri, Dec 20, 2024 at 6:23 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Sun 15-12-24 15:34:13, Yafang Shao wrote: > > > > > Implementation Options > > > > > ---------------------- > > > > > > > > > > - Solution A: Allow file caches on the unevictable list to become > > > > > reclaimable. > > > > > This approach would require significant refactoring of the page reclaim > > > > > logic. > > > > > > > > > > - Solution B: Prevent file caches from being moved to the unevictable list > > > > > during mlock and ignore the VM_LOCKED flag during page reclaim. > > > > > This is a more straightforward solution and is the one we have chosen. > > > > > If the file caches are reclaimed from the download-proxy's memcg and > > > > > subsequently accessed by tasks in the application’s memcg, a filemap > > > > > fault will occur. A new file cache will be faulted in, charged to the > > > > > application’s memcg, and locked there. > > > > > > > > Both options are silently breaking userspace because a non failing mlock > > > > doesn't give guarantees it is supposed to AFAICS. > > > > > > It does not bypass the mlock mechanism; rather, it defers the actual > > > locking operation to the page fault path. Could you clarify what you > > > mean by "a non-failing mlock"? From what I can see, mlock can indeed > > > fail if there isn’t sufficient memory available. With this change, we > > > are simply shifting the potential failure point to the page fault path > > > instead. > > > > Your change will cause mlocked pages (as mlock syscall returns success) > > to be reclaimable later on. That breaks the basic mlock contract. > > AFAICS, the mlock() behavior was originally designed with only a > single root memory cgroup in mind. In other words, when mlock() was > introduced, all locked pages were confined to the same memcg. > > However, this changed with the introduction of memcg support. Now, > mlock() can lock pages that belong to a different memcg than the > current task. This behavior is not explicitly defined in the mlock() > documentation, which could lead to confusion. > > To clarify, I propose updating the mlock() documentation as follows: > > When memcg is enabled, the page being locked might reside in a > different memcg than the current task. In such cases, the page might > be reclaimed if mlock() is not permitted in its original memcg. If the > locked page is reclaimed, it could be faulted back into the current > task's memcg and then locked again. > > Additionally, encountering a single page fault during this process > should be acceptable to most users. If your application cannot > tolerate even a single page fault, you likely wouldn’t enable memcg in > the first place. > If you insist on not allowing a single page fault, there is an alternative approach, though it may require significantly more complex handling. - Option C: Reparent the mlocked page to a common ancestor Consider the following hierarchical: A / \ B C If B is mlocking a page in C, we can reparent that mlocked page to A, essentially making A the new parent for the mlocked page. A / \ B C / \ \ D E F In this example, if D is mlocking a page in F, we will reparent the mlocked page to A. - Benefits: No user-visible cgroup file setting: This approach avoids introducing or modifying cgroup settings that could be visible or configurable by users. - Downsides: Increased complexity: This option requires significantly more work in terms of managing the reparenting process. -- Regards Yafang ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg 2024-12-25 2:23 ` Yafang Shao @ 2025-01-06 12:30 ` Michal Hocko 2025-01-06 14:04 ` Yafang Shao 0 siblings, 1 reply; 15+ messages in thread From: Michal Hocko @ 2025-01-06 12:30 UTC (permalink / raw) To: Yafang Shao Cc: hannes, roman.gushchin, shakeel.butt, muchun.song, akpm, linux-mm On Wed 25-12-24 10:23:53, Yafang Shao wrote: [...] > - Option C: Reparent the mlocked page to a common ancestor > > Consider the following hierarchical: > > A > / \ > B C > > If B is mlocking a page in C, we can reparent that mlocked page to A, > essentially making A the new parent for the mlocked page. How does this solve the underlying problem? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg 2025-01-06 12:30 ` Michal Hocko @ 2025-01-06 14:04 ` Yafang Shao 2025-01-07 8:39 ` Michal Hocko 0 siblings, 1 reply; 15+ messages in thread From: Yafang Shao @ 2025-01-06 14:04 UTC (permalink / raw) To: Michal Hocko Cc: hannes, roman.gushchin, shakeel.butt, muchun.song, akpm, linux-mm On Mon, Jan 6, 2025 at 8:30 PM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 25-12-24 10:23:53, Yafang Shao wrote: > [...] > > - Option C: Reparent the mlocked page to a common ancestor > > > > Consider the following hierarchical: > > > > A > > / \ > > B C > > > > If B is mlocking a page in C, we can reparent that mlocked page to A, > > essentially making A the new parent for the mlocked page. > > How does this solve the underlying problem? No OOM will occur in C until the limit of A is reached, and an OOM at that point is the expected behavior. In simpler terms: move the shared resource to a shared layer. I believe this will resolve most of the issues caused by the shared resources. -- Regards Yafang ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg 2025-01-06 14:04 ` Yafang Shao @ 2025-01-07 8:39 ` Michal Hocko 2025-01-07 9:43 ` Yafang Shao 0 siblings, 1 reply; 15+ messages in thread From: Michal Hocko @ 2025-01-07 8:39 UTC (permalink / raw) To: Yafang Shao Cc: hannes, roman.gushchin, shakeel.butt, muchun.song, akpm, linux-mm On Mon 06-01-25 22:04:31, Yafang Shao wrote: > On Mon, Jan 6, 2025 at 8:30 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Wed 25-12-24 10:23:53, Yafang Shao wrote: > > [...] > > > - Option C: Reparent the mlocked page to a common ancestor > > > > > > Consider the following hierarchical: > > > > > > A > > > / \ > > > B C > > > > > > If B is mlocking a page in C, we can reparent that mlocked page to A, > > > essentially making A the new parent for the mlocked page. > > > > How does this solve the underlying problem? > > No OOM will occur in C until the limit of A is reached, and an OOM at > that point is the expected behavior. Right but if A happens to be the root cgroup then you effectivelly allows mlock to run away a local limit. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg 2025-01-07 8:39 ` Michal Hocko @ 2025-01-07 9:43 ` Yafang Shao 0 siblings, 0 replies; 15+ messages in thread From: Yafang Shao @ 2025-01-07 9:43 UTC (permalink / raw) To: Michal Hocko Cc: hannes, roman.gushchin, shakeel.butt, muchun.song, akpm, linux-mm On Tue, Jan 7, 2025 at 4:39 PM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 06-01-25 22:04:31, Yafang Shao wrote: > > On Mon, Jan 6, 2025 at 8:30 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Wed 25-12-24 10:23:53, Yafang Shao wrote: > > > [...] > > > > - Option C: Reparent the mlocked page to a common ancestor > > > > > > > > Consider the following hierarchical: > > > > > > > > A > > > > / \ > > > > B C > > > > > > > > If B is mlocking a page in C, we can reparent that mlocked page to A, > > > > essentially making A the new parent for the mlocked page. > > > > > > How does this solve the underlying problem? > > > > No OOM will occur in C until the limit of A is reached, and an OOM at > > that point is the expected behavior. > > Right but if A happens to be the root cgroup then you effectivelly > allows mlock to run away a local limit. Typically, in a real production environment, people don't use mlock on large amounts of file cache. It’s unusual for one instance to lock a significant amount of file cache and share it with another instance. If such use cases do exist, it would be more appropriate to use memory.min rather than relying on mlock(). For most users, allowing mlocked file pages to be unrestricted is acceptable. I don’t think we should worry too much about edge cases in this regard. However, if it is deemed a concern, we could introduce a "cgroup.memory=mlock_shared" boot parameter to enable this behavior explicitly. -- Regards Yafang ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg 2024-12-22 2:34 ` Yafang Shao 2024-12-25 2:23 ` Yafang Shao @ 2025-01-06 12:28 ` Michal Hocko 2025-01-06 13:59 ` Yafang Shao 1 sibling, 1 reply; 15+ messages in thread From: Michal Hocko @ 2025-01-06 12:28 UTC (permalink / raw) To: Yafang Shao Cc: hannes, roman.gushchin, shakeel.butt, muchun.song, akpm, linux-mm On Sun 22-12-24 10:34:12, Yafang Shao wrote: > On Sat, Dec 21, 2024 at 3:21 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Fri 20-12-24 19:52:16, Yafang Shao wrote: > > > On Fri, Dec 20, 2024 at 6:23 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Sun 15-12-24 15:34:13, Yafang Shao wrote: > > > > > Implementation Options > > > > > ---------------------- > > > > > > > > > > - Solution A: Allow file caches on the unevictable list to become > > > > > reclaimable. > > > > > This approach would require significant refactoring of the page reclaim > > > > > logic. > > > > > > > > > > - Solution B: Prevent file caches from being moved to the unevictable list > > > > > during mlock and ignore the VM_LOCKED flag during page reclaim. > > > > > This is a more straightforward solution and is the one we have chosen. > > > > > If the file caches are reclaimed from the download-proxy's memcg and > > > > > subsequently accessed by tasks in the application’s memcg, a filemap > > > > > fault will occur. A new file cache will be faulted in, charged to the > > > > > application’s memcg, and locked there. > > > > > > > > Both options are silently breaking userspace because a non failing mlock > > > > doesn't give guarantees it is supposed to AFAICS. > > > > > > It does not bypass the mlock mechanism; rather, it defers the actual > > > locking operation to the page fault path. Could you clarify what you > > > mean by "a non-failing mlock"? From what I can see, mlock can indeed > > > fail if there isn’t sufficient memory available. With this change, we > > > are simply shifting the potential failure point to the page fault path > > > instead. > > > > Your change will cause mlocked pages (as mlock syscall returns success) > > to be reclaimable later on. That breaks the basic mlock contract. > > AFAICS, the mlock() behavior was originally designed with only a > single root memory cgroup in mind. In other words, when mlock() was > introduced, all locked pages were confined to the same memcg. yes and this is the case to any other syscalls that might have an impact on the memory consumption. This is by design. Memory cgroup controller aims to provide a completely transparent resource control without any modifications to applications. This is the case for all other cgroup controllers. If memcg (or other controller) affects a specific syscall behavior then this has to be communicated explicitly to the caller. The purpose of mlock syscall is to _guarantee_ memory to be resident (never swapped out). There might be additional constrains to prevent from mlock succeeding - e.g. rlimit or if memcg aims to control amount of the mlocked memory but those failures need to be explicitly communicated via syscall failure. > However, this changed with the introduction of memcg support. Now, > mlock() can lock pages that belong to a different memcg than the > current task. This behavior is not explicitly defined in the mlock() > documentation, which could lead to confusion. This is more of a problem of the cgroup configurations where different resource domains are sharing resources. This is not much diffent when other resources (e.g. shmem) are shared accross unrelated cgroups. > To clarify, I propose updating the mlock() documentation as follows: This is not really possible because you are effectively breaking an existing userspace. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg 2025-01-06 12:28 ` Michal Hocko @ 2025-01-06 13:59 ` Yafang Shao 2025-01-07 10:04 ` Michal Hocko 0 siblings, 1 reply; 15+ messages in thread From: Yafang Shao @ 2025-01-06 13:59 UTC (permalink / raw) To: Michal Hocko Cc: hannes, roman.gushchin, shakeel.butt, muchun.song, akpm, linux-mm On Mon, Jan 6, 2025 at 8:28 PM Michal Hocko <mhocko@suse.com> wrote: > > On Sun 22-12-24 10:34:12, Yafang Shao wrote: > > On Sat, Dec 21, 2024 at 3:21 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Fri 20-12-24 19:52:16, Yafang Shao wrote: > > > > On Fri, Dec 20, 2024 at 6:23 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Sun 15-12-24 15:34:13, Yafang Shao wrote: > > > > > > Implementation Options > > > > > > ---------------------- > > > > > > > > > > > > - Solution A: Allow file caches on the unevictable list to become > > > > > > reclaimable. > > > > > > This approach would require significant refactoring of the page reclaim > > > > > > logic. > > > > > > > > > > > > - Solution B: Prevent file caches from being moved to the unevictable list > > > > > > during mlock and ignore the VM_LOCKED flag during page reclaim. > > > > > > This is a more straightforward solution and is the one we have chosen. > > > > > > If the file caches are reclaimed from the download-proxy's memcg and > > > > > > subsequently accessed by tasks in the application’s memcg, a filemap > > > > > > fault will occur. A new file cache will be faulted in, charged to the > > > > > > application’s memcg, and locked there. > > > > > > > > > > Both options are silently breaking userspace because a non failing mlock > > > > > doesn't give guarantees it is supposed to AFAICS. > > > > > > > > It does not bypass the mlock mechanism; rather, it defers the actual > > > > locking operation to the page fault path. Could you clarify what you > > > > mean by "a non-failing mlock"? From what I can see, mlock can indeed > > > > fail if there isn’t sufficient memory available. With this change, we > > > > are simply shifting the potential failure point to the page fault path > > > > instead. > > > > > > Your change will cause mlocked pages (as mlock syscall returns success) > > > to be reclaimable later on. That breaks the basic mlock contract. > > > > AFAICS, the mlock() behavior was originally designed with only a > > single root memory cgroup in mind. In other words, when mlock() was > > introduced, all locked pages were confined to the same memcg. > > yes and this is the case to any other syscalls that might have an impact > on the memory consumption. This is by design. Memory cgroup controller > aims to provide a completely transparent resource control without any > modifications to applications. This is the case for all other cgroup > controllers. If memcg (or other controller) affects a specific syscall > behavior then this has to be communicated explicitly to the caller. > > The purpose of mlock syscall is to _guarantee_ memory to be resident > (never swapped out). There might be additional constrains to prevent > from mlock succeeding - e.g. rlimit or if memcg aims to control amount > of the mlocked memory but those failures need to be explicitly > communicated via syscall failure. Returning an error code like EBUSY to userspace is straightforward when attempting to mlock a page that is charged to a different memcg. > > > However, this changed with the introduction of memcg support. Now, > > mlock() can lock pages that belong to a different memcg than the > > current task. This behavior is not explicitly defined in the mlock() > > documentation, which could lead to confusion. > > This is more of a problem of the cgroup configurations where different > resource domains are sharing resources. This is not much diffent when > other resources (e.g. shmem) are shared accross unrelated cgroups. However, we have yet to address even a single one of these issues or reach a consensus on a solution, correct? > > > To clarify, I propose updating the mlock() documentation as follows: > > This is not really possible because you are effectively breaking an > existing userspace. This behavior is neither mandatory nor the default. You are not obligated to use it if you prefer not to. -- Regards Yafang ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg 2025-01-06 13:59 ` Yafang Shao @ 2025-01-07 10:04 ` Michal Hocko 0 siblings, 0 replies; 15+ messages in thread From: Michal Hocko @ 2025-01-07 10:04 UTC (permalink / raw) To: Yafang Shao Cc: hannes, roman.gushchin, shakeel.butt, muchun.song, akpm, linux-mm On Mon 06-01-25 21:59:07, Yafang Shao wrote: > On Mon, Jan 6, 2025 at 8:28 PM Michal Hocko <mhocko@suse.com> wrote: [...] > > The purpose of mlock syscall is to _guarantee_ memory to be resident > > (never swapped out). There might be additional constrains to prevent > > from mlock succeeding - e.g. rlimit or if memcg aims to control amount > > of the mlocked memory but those failures need to be explicitly > > communicated via syscall failure. > > Returning an error code like EBUSY to userspace is straightforward > when attempting to mlock a page that is charged to a different memcg. EAGAIN is already documented error failure when some pages cannot be mlocked, but yes this would be acceptable solution. The question is how to handle mlockall(MCL_FUTURE) resp. mlock(MLOCK_ONFAULT). I didn't give those much time but I can see some comlications there. Either we fail those on shared resources which could lead to pre-mature failures or we need to somehow record lock ownership and enforce it during the fault (charge). -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2025-01-07 10:04 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-12-15 7:34 [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg Yafang Shao 2024-12-15 7:34 ` [RFC PATCH 1/2] mm/memcontrol: add a new cgroup file memory.nomlock Yafang Shao 2024-12-15 7:34 ` [RFC PATCH 2/2] mm: Add support for nomlock to avoid folios beling mlocked in a memcg Yafang Shao 2024-12-20 10:23 ` [RFC PATCH 0/2] memcg: add " Michal Hocko 2024-12-20 11:52 ` Yafang Shao 2024-12-21 7:21 ` Michal Hocko 2024-12-22 2:34 ` Yafang Shao 2024-12-25 2:23 ` Yafang Shao 2025-01-06 12:30 ` Michal Hocko 2025-01-06 14:04 ` Yafang Shao 2025-01-07 8:39 ` Michal Hocko 2025-01-07 9:43 ` Yafang Shao 2025-01-06 12:28 ` Michal Hocko 2025-01-06 13:59 ` Yafang Shao 2025-01-07 10:04 ` Michal Hocko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox