[RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg
@ 2024-12-15  7:34 Yafang Shao
  2024-12-15  7:34 ` [RFC PATCH 1/2] mm/memcontrol: add a new cgroup file memory.nomlock Yafang Shao
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Yafang Shao @ 2024-12-15  7:34 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm
  Cc: linux-mm, Yafang Shao

The Use Case
============

We have a scenario where multiple services (cgroups) may share the same
file cache, as illustrated below:

    download-proxy       application
                   \    /
         /shared_path/shared_files

When the application needs specific types of files, it sends an RPC request
to the download-proxy. The download-proxy then downloads the files to
shared paths, after which the application reads these shared files. All
disk I/O operations are performed using buffered I/O.

The reason for using buffered I/O, rather than direct I/O, is that the
download-proxy itself may also read these shared files. This is because it
serves as a peer-to-peer (P2P) service:

   download-proxy of server1    <- P2P ->    download-proxy of server2

   /shared_path/shared_files                 /shared_path/shared_files

The Problem
===========

Applications reading these shared files may use mlock to pin the files in
memory for performance reasons. However, the shared file cache is charged
to the memory cgroup of the download-proxy during the download or P2P
process. Consequently, the page cache pages of the shared files might be
mlocked within the download-proxy's memcg, as shown:

    download-proxy     application
          |            /
        (charged)    (mlocked)
          |         /
    pagecache pages
           \
            \
          /shared_path/shared_files

This setup leads to a frequent scenario where the memory usage of the
download-proxy's memcg reaches its limit, potentially resulting in OOM
events. This behavior is undesirable.

The Solution
============

To address this, we propose introducing a new cgroup file, memory.nomlock,
which prevents page cache pages from being mlocked in a specific memcg when
set to 1.

Implementation Options
----------------------

- Solution A: Allow file caches on the unevictable list to become
  reclaimable. 
  This approach would require significant refactoring of the page reclaim
  logic.

- Solution B: Prevent file caches from being moved to the unevictable list
  during mlock and ignore the VM_LOCKED flag during page reclaim.
  This is a more straightforward solution and is the one we have chosen.
  If the file caches are reclaimed from the download-proxy's memcg and
  subsequently accessed by tasks in the application’s memcg, a filemap
  fault will occur. A new file cache will be faulted in, charged to the
  application’s memcg, and locked there.

Current limitations
==================

This solution is in its early stages and has the following limitations:

- Timing Dependency:
  memory.nomlock must be set before file caches are moved to the
  unevictable list. Otherwise, the file caches cannot be reclaimed.

- Metrics Inaccuracy:
  The "unevictable" metric in memory.stat and the "Mlocked" metric in
  /proc/meminfo may not be reliable. However, these metrics are already
  affected by the use of large folios.

If this solution is deemed acceptable, I will proceed with refining the
implementation and addressing these limitations.

Yafang Shao (2):
  mm/memcontrol: add a new cgroup file memory.nomlock
  mm: Add support for nomlock to avoid folios beling mlocked in a memcg

 include/linux/memcontrol.h |  3 +++
 mm/memcontrol.c            | 35 +++++++++++++++++++++++++++++++++++
 mm/mlock.c                 |  9 +++++++++
 mm/rmap.c                  |  8 +++++++-
 mm/vmscan.c                |  5 +++++
 5 files changed, 59 insertions(+), 1 deletion(-)

-- 
2.43.5

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH 1/2] mm/memcontrol: add a new cgroup file memory.nomlock
  2024-12-15  7:34 [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg Yafang Shao
@ 2024-12-15  7:34 ` Yafang Shao
  2024-12-15  7:34 ` [RFC PATCH 2/2] mm: Add support for nomlock to avoid folios beling mlocked in a memcg Yafang Shao
  2024-12-20 10:23 ` [RFC PATCH 0/2] memcg: add " Michal Hocko
  2 siblings, 0 replies; 15+ messages in thread
From: Yafang Shao @ 2024-12-15  7:34 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm
  Cc: linux-mm, Yafang Shao

Add a new cgroup file memory.nomlock to control the behavior of mlock.
This is a preparation of the followup patch.

  0: Disable nomlock
  1: Enable nomlock

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/memcontrol.h |  3 +++
 mm/memcontrol.c            | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b28180269e75..b3a4e28ae0f9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -220,6 +220,9 @@ struct mem_cgroup {
 	 */
 	bool oom_group;
 
+	/* Folios won't be mlocked in this memcg. */
+	bool nomlock;
+
 	int swappiness;
 
 	/* memory.events and memory.events.local */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7b3503d12aaf..08bb65c1612f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4374,6 +4374,35 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 	return nbytes;
 }
 
+static int memory_nomlock_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	seq_printf(m, "%d\n", READ_ONCE(memcg->nomlock));
+	return 0;
+}
+
+static ssize_t memory_nomlock_write(struct kernfs_open_file *of,
+				   char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	int ret, nomlock;
+
+	buf = strstrip(buf);
+	if (!buf)
+		return -EINVAL;
+
+	ret = kstrtoint(buf, 0, &nomlock);
+	if (ret)
+		return ret;
+
+	if (nomlock != 0 && nomlock != 1)
+		return -EINVAL;
+
+	WRITE_ONCE(memcg->nomlock, nomlock);
+	return nbytes;
+}
+
 static struct cftype memory_files[] = {
 	{
 		.name = "current",
@@ -4445,6 +4474,12 @@ static struct cftype memory_files[] = {
 		.flags = CFTYPE_NS_DELEGATABLE,
 		.write = memory_reclaim,
 	},
+	{
+		.name = "nomlock",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_nomlock_show,
+		.write = memory_nomlock_write,
+	},
 	{ }	/* terminate */
 };
 
-- 
2.43.5



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH 2/2] mm: Add support for nomlock to avoid folios beling mlocked in a memcg
  2024-12-15  7:34 [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg Yafang Shao
  2024-12-15  7:34 ` [RFC PATCH 1/2] mm/memcontrol: add a new cgroup file memory.nomlock Yafang Shao
@ 2024-12-15  7:34 ` Yafang Shao
  2024-12-20 10:23 ` [RFC PATCH 0/2] memcg: add " Michal Hocko
  2 siblings, 0 replies; 15+ messages in thread
From: Yafang Shao @ 2024-12-15  7:34 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm
  Cc: linux-mm, Yafang Shao

The Use Case
============

We have a scenario where multiple services (cgroups) may share the same
file cache, as illustrated below:

    download-proxy       application
                   \    /
         /shared_path/shared_files

When the application needs specific types of files, it sends an RPC request
to the download-proxy. The download-proxy then downloads the files to
shared paths, after which the application reads these shared files. All
disk I/O operations are performed using buffered I/O.

The reason for using buffered I/O, rather than direct I/O, is that the
download-proxy itself may also read these shared files. This is because it
serves as a peer-to-peer (P2P) service:

   download-proxy of server1    <- P2P ->    download-proxy of server2

   /shared_path/shared_files                 /shared_path/shared_files

The Problem
===========

Applications reading these shared files may use mlock to pin the files in
memory for performance reasons. However, the shared file cache is charged
to the memory cgroup of the download-proxy during the download or P2P
process. Consequently, the page cache pages of the shared files might be
mlocked within the download-proxy's memcg, as shown:

    download-proxy     application
          |            /
        (charged)    (mlocked)
          |         /
    pagecache pages
           \
            \
          /shared_path/shared_files

This setup leads to a frequent scenario where the memory usage of the
download-proxy's memcg reaches its limit, potentially resulting in OOM
events. This behavior is undesirable.

The Solution
============

To address this, we propose introducing a new cgroup file, memory.nomlock,
which prevents page cache pages from being mlocked in a specific memcg when
set to 1.

Implementation Options
----------------------

- Solution A: Allow file caches on the unevictable list to become
  reclaimable. 
  This approach would require significant refactoring of the page reclaim
  logic.

- Solution B: Prevent file caches from being moved to the unevictable list
  during mlock and ignore the VM_LOCKED flag during page reclaim.
  This is a more straightforward solution and is the one we have chosen.
  If the file caches are reclaimed from the download-proxy's memcg and
  subsequently accessed by tasks in the application’s memcg, a filemap
  fault will occur. A new file cache will be faulted in, charged to the
  application’s memcg, and locked there.

Current limitations
==================

This solution is in its early stages and has the following limitations:

- Timing Dependency:
  memory.nomlock must be set before file caches are moved to the
  unevictable list. Otherwise, the file caches cannot be reclaimed.

- Metrics Inaccuracy:
  The "unevictable" metric in memory.stat and the "Mlocked" metric in
  /proc/meminfo may not be reliable. However, these metrics are already
  affected by the use of large folios.

If this solution is deemed acceptable, I will proceed with refining the
implementation.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 mm/mlock.c  | 9 +++++++++
 mm/rmap.c   | 8 +++++++-
 mm/vmscan.c | 5 +++++
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index cde076fa7d5e..9cebcf13929f 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -186,6 +186,7 @@ static inline struct folio *mlock_new(struct folio *folio)
 static void mlock_folio_batch(struct folio_batch *fbatch)
 {
 	struct lruvec *lruvec = NULL;
+	struct mem_cgroup *memcg;
 	unsigned long mlock;
 	struct folio *folio;
 	int i;
@@ -196,6 +197,10 @@ static void mlock_folio_batch(struct folio_batch *fbatch)
 		folio = (struct folio *)((unsigned long)folio - mlock);
 		fbatch->folios[i] = folio;
 
+		memcg = folio_memcg(folio);
+		if (memcg && memcg->nomlock && mlock)
+			continue;
+
 		if (mlock & LRU_FOLIO)
 			lruvec = __mlock_folio(folio, lruvec);
 		else if (mlock & NEW_FOLIO)
@@ -241,8 +246,12 @@ bool need_mlock_drain(int cpu)
  */
 void mlock_folio(struct folio *folio)
 {
+	struct mem_cgroup *memcg = folio_memcg(folio);
 	struct folio_batch *fbatch;
 
+	if (memcg && memcg->nomlock)
+		return;
+
 	local_lock(&mlock_fbatch.lock);
 	fbatch = this_cpu_ptr(&mlock_fbatch.fbatch);
 
diff --git a/mm/rmap.c b/mm/rmap.c
index c6c4d4ea29a7..6f16f86f9274 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -853,11 +853,17 @@ static bool folio_referenced_one(struct folio *folio,
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
 	int referenced = 0;
 	unsigned long start = address, ptes = 0;
+	bool ignore_mlock = false;
+	struct mem_cgroup *memcg;
+
+	memcg = folio_memcg(folio);
+	if (memcg && memcg->nomlock)
+		ignore_mlock = true;
 
 	while (page_vma_mapped_walk(&pvmw)) {
 		address = pvmw.address;
 
-		if (vma->vm_flags & VM_LOCKED) {
+		if (!ignore_mlock && vma->vm_flags & VM_LOCKED) {
 			if (!folio_test_large(folio) || !pvmw.pte) {
 				/* Restore the mlock which got missed */
 				mlock_vma_folio(folio, vma);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fd55c3ec0054..defd36be28e9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1283,6 +1283,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		if (folio_mapped(folio)) {
 			enum ttu_flags flags = TTU_BATCH_FLUSH;
 			bool was_swapbacked = folio_test_swapbacked(folio);
+			struct mem_cgroup *memcg;
 
 			if (folio_test_pmd_mappable(folio))
 				flags |= TTU_SPLIT_HUGE_PMD;
@@ -1301,6 +1302,10 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			if (folio_test_large(folio))
 				flags |= TTU_SYNC;
 
+			memcg = folio_memcg(folio);
+			if (memcg && memcg->nomlock)
+				flags |= TTU_IGNORE_MLOCK;
+
 			try_to_unmap(folio, flags);
 			if (folio_mapped(folio)) {
 				stat->nr_unmap_fail += nr_pages;
-- 
2.43.5



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg
  2024-12-15  7:34 [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg Yafang Shao
  2024-12-15  7:34 ` [RFC PATCH 1/2] mm/memcontrol: add a new cgroup file memory.nomlock Yafang Shao
  2024-12-15  7:34 ` [RFC PATCH 2/2] mm: Add support for nomlock to avoid folios beling mlocked in a memcg Yafang Shao
@ 2024-12-20 10:23 ` Michal Hocko
  2024-12-20 11:52   ` Yafang Shao
  2 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2024-12-20 10:23 UTC (permalink / raw)
  To: Yafang Shao
  Cc: hannes, roman.gushchin, shakeel.butt, muchun.song, akpm, linux-mm

On Sun 15-12-24 15:34:13, Yafang Shao wrote:
> Implementation Options
> ----------------------
> 
> - Solution A: Allow file caches on the unevictable list to become
>   reclaimable. 
>   This approach would require significant refactoring of the page reclaim
>   logic.
> 
> - Solution B: Prevent file caches from being moved to the unevictable list
>   during mlock and ignore the VM_LOCKED flag during page reclaim.
>   This is a more straightforward solution and is the one we have chosen.
>   If the file caches are reclaimed from the download-proxy's memcg and
>   subsequently accessed by tasks in the application’s memcg, a filemap
>   fault will occur. A new file cache will be faulted in, charged to the
>   application’s memcg, and locked there.

Both options are silently breaking userspace because a non failing mlock
doesn't give guarantees it is supposed to AFAICS. So unless I am missing
something really importanant I do not think this is an acceptable memcg
extension.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg
  2024-12-20 10:23 ` [RFC PATCH 0/2] memcg: add " Michal Hocko
@ 2024-12-20 11:52   ` Yafang Shao
  2024-12-21  7:21     ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Yafang Shao @ 2024-12-20 11:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: hannes, roman.gushchin, shakeel.butt, muchun.song, akpm, linux-mm

On Fri, Dec 20, 2024 at 6:23 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Sun 15-12-24 15:34:13, Yafang Shao wrote:
> > Implementation Options
> > ----------------------
> >
> > - Solution A: Allow file caches on the unevictable list to become
> >   reclaimable.
> >   This approach would require significant refactoring of the page reclaim
> >   logic.
> >
> > - Solution B: Prevent file caches from being moved to the unevictable list
> >   during mlock and ignore the VM_LOCKED flag during page reclaim.
> >   This is a more straightforward solution and is the one we have chosen.
> >   If the file caches are reclaimed from the download-proxy's memcg and
> >   subsequently accessed by tasks in the application’s memcg, a filemap
> >   fault will occur. A new file cache will be faulted in, charged to the
> >   application’s memcg, and locked there.
>
> Both options are silently breaking userspace because a non failing mlock
> doesn't give guarantees it is supposed to AFAICS.

It does not bypass the mlock mechanism; rather, it defers the actual
locking operation to the page fault path. Could you clarify what you
mean by "a non-failing mlock"? From what I can see, mlock can indeed
fail if there isn’t sufficient memory available. With this change, we
are simply shifting the potential failure point to the page fault path
instead.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg
  2024-12-20 11:52   ` Yafang Shao
@ 2024-12-21  7:21     ` Michal Hocko
  2024-12-22  2:34       ` Yafang Shao
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2024-12-21  7:21 UTC (permalink / raw)
  To: Yafang Shao
  Cc: hannes, roman.gushchin, shakeel.butt, muchun.song, akpm, linux-mm

On Fri 20-12-24 19:52:16, Yafang Shao wrote:
> On Fri, Dec 20, 2024 at 6:23 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Sun 15-12-24 15:34:13, Yafang Shao wrote:
> > > Implementation Options
> > > ----------------------
> > >
> > > - Solution A: Allow file caches on the unevictable list to become
> > >   reclaimable.
> > >   This approach would require significant refactoring of the page reclaim
> > >   logic.
> > >
> > > - Solution B: Prevent file caches from being moved to the unevictable list
> > >   during mlock and ignore the VM_LOCKED flag during page reclaim.
> > >   This is a more straightforward solution and is the one we have chosen.
> > >   If the file caches are reclaimed from the download-proxy's memcg and
> > >   subsequently accessed by tasks in the application’s memcg, a filemap
> > >   fault will occur. A new file cache will be faulted in, charged to the
> > >   application’s memcg, and locked there.
> >
> > Both options are silently breaking userspace because a non failing mlock
> > doesn't give guarantees it is supposed to AFAICS.
> 
> It does not bypass the mlock mechanism; rather, it defers the actual
> locking operation to the page fault path. Could you clarify what you
> mean by "a non-failing mlock"? From what I can see, mlock can indeed
> fail if there isn’t sufficient memory available. With this change, we
> are simply shifting the potential failure point to the page fault path
> instead.

Your change will cause mlocked pages (as mlock syscall returns success)
to be reclaimable later on. That breaks the basic mlock contract.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg
  2024-12-21  7:21     ` Michal Hocko
@ 2024-12-22  2:34       ` Yafang Shao
  2024-12-25  2:23         ` Yafang Shao
  2025-01-06 12:28         ` Michal Hocko
  0 siblings, 2 replies; 15+ messages in thread
From: Yafang Shao @ 2024-12-22  2:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: hannes, roman.gushchin, shakeel.butt, muchun.song, akpm, linux-mm

On Sat, Dec 21, 2024 at 3:21 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 20-12-24 19:52:16, Yafang Shao wrote:
> > On Fri, Dec 20, 2024 at 6:23 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Sun 15-12-24 15:34:13, Yafang Shao wrote:
> > > > Implementation Options
> > > > ----------------------
> > > >
> > > > - Solution A: Allow file caches on the unevictable list to become
> > > >   reclaimable.
> > > >   This approach would require significant refactoring of the page reclaim
> > > >   logic.
> > > >
> > > > - Solution B: Prevent file caches from being moved to the unevictable list
> > > >   during mlock and ignore the VM_LOCKED flag during page reclaim.
> > > >   This is a more straightforward solution and is the one we have chosen.
> > > >   If the file caches are reclaimed from the download-proxy's memcg and
> > > >   subsequently accessed by tasks in the application’s memcg, a filemap
> > > >   fault will occur. A new file cache will be faulted in, charged to the
> > > >   application’s memcg, and locked there.
> > >
> > > Both options are silently breaking userspace because a non failing mlock
> > > doesn't give guarantees it is supposed to AFAICS.
> >
> > It does not bypass the mlock mechanism; rather, it defers the actual
> > locking operation to the page fault path. Could you clarify what you
> > mean by "a non-failing mlock"? From what I can see, mlock can indeed
> > fail if there isn’t sufficient memory available. With this change, we
> > are simply shifting the potential failure point to the page fault path
> > instead.
>
> Your change will cause mlocked pages (as mlock syscall returns success)
> to be reclaimable later on. That breaks the basic mlock contract.

AFAICS, the mlock() behavior was originally designed with only a
single root memory cgroup in mind. In other words, when mlock() was
introduced, all locked pages were confined to the same memcg.

However, this changed with the introduction of memcg support. Now,
mlock() can lock pages that belong to a different memcg than the
current task. This behavior is not explicitly defined in the mlock()
documentation, which could lead to confusion.

To clarify, I propose updating the mlock() documentation as follows:

When memcg is enabled, the page being locked might reside in a
different memcg than the current task. In such cases, the page might
be reclaimed if mlock() is not permitted in its original memcg. If the
locked page is reclaimed, it could be faulted back into the current
task's memcg and then locked again.

Additionally, encountering a single page fault during this process
should be acceptable to most users. If your application cannot
tolerate even a single page fault, you likely wouldn’t enable memcg in
the first place.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg
  2024-12-22  2:34       ` Yafang Shao
@ 2024-12-25  2:23         ` Yafang Shao
  2025-01-06 12:30           ` Michal Hocko
  2025-01-06 12:28         ` Michal Hocko
  1 sibling, 1 reply; 15+ messages in thread
From: Yafang Shao @ 2024-12-25  2:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: hannes, roman.gushchin, shakeel.butt, muchun.song, akpm, linux-mm

On Sun, Dec 22, 2024 at 10:34 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Sat, Dec 21, 2024 at 3:21 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 20-12-24 19:52:16, Yafang Shao wrote:
> > > On Fri, Dec 20, 2024 at 6:23 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Sun 15-12-24 15:34:13, Yafang Shao wrote:
> > > > > Implementation Options
> > > > > ----------------------
> > > > >
> > > > > - Solution A: Allow file caches on the unevictable list to become
> > > > >   reclaimable.
> > > > >   This approach would require significant refactoring of the page reclaim
> > > > >   logic.
> > > > >
> > > > > - Solution B: Prevent file caches from being moved to the unevictable list
> > > > >   during mlock and ignore the VM_LOCKED flag during page reclaim.
> > > > >   This is a more straightforward solution and is the one we have chosen.
> > > > >   If the file caches are reclaimed from the download-proxy's memcg and
> > > > >   subsequently accessed by tasks in the application’s memcg, a filemap
> > > > >   fault will occur. A new file cache will be faulted in, charged to the
> > > > >   application’s memcg, and locked there.
> > > >
> > > > Both options are silently breaking userspace because a non failing mlock
> > > > doesn't give guarantees it is supposed to AFAICS.
> > >
> > > It does not bypass the mlock mechanism; rather, it defers the actual
> > > locking operation to the page fault path. Could you clarify what you
> > > mean by "a non-failing mlock"? From what I can see, mlock can indeed
> > > fail if there isn’t sufficient memory available. With this change, we
> > > are simply shifting the potential failure point to the page fault path
> > > instead.
> >
> > Your change will cause mlocked pages (as mlock syscall returns success)
> > to be reclaimable later on. That breaks the basic mlock contract.
>
> AFAICS, the mlock() behavior was originally designed with only a
> single root memory cgroup in mind. In other words, when mlock() was
> introduced, all locked pages were confined to the same memcg.
>
> However, this changed with the introduction of memcg support. Now,
> mlock() can lock pages that belong to a different memcg than the
> current task. This behavior is not explicitly defined in the mlock()
> documentation, which could lead to confusion.
>
> To clarify, I propose updating the mlock() documentation as follows:
>
> When memcg is enabled, the page being locked might reside in a
> different memcg than the current task. In such cases, the page might
> be reclaimed if mlock() is not permitted in its original memcg. If the
> locked page is reclaimed, it could be faulted back into the current
> task's memcg and then locked again.
>
> Additionally, encountering a single page fault during this process
> should be acceptable to most users. If your application cannot
> tolerate even a single page fault, you likely wouldn’t enable memcg in
> the first place.
>

If you insist on not allowing a single page fault, there is an
alternative approach, though it may require significantly more complex
handling.

- Option C: Reparent the mlocked page to a common ancestor

Consider the following hierarchical:

         A
    /        \
  B           C

If B is mlocking a page in C, we can reparent that mlocked page to A,
essentially making A the new parent for the mlocked page.

                        A
                     /     \
                   B        C
                /     \         \
              D      E        F

In this example, if D is mlocking a page in F, we will reparent the
mlocked page to A.

- Benefits:
   No user-visible cgroup file setting: This approach avoids
introducing or modifying cgroup settings that could be visible or
configurable by users.

- Downsides:
  Increased complexity: This option requires significantly more work
in terms of managing the reparenting process.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg
  2024-12-25  2:23         ` Yafang Shao
@ 2025-01-06 12:30           ` Michal Hocko
  2025-01-06 14:04             ` Yafang Shao
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2025-01-06 12:30 UTC (permalink / raw)
  To: Yafang Shao
  Cc: hannes, roman.gushchin, shakeel.butt, muchun.song, akpm, linux-mm

On Wed 25-12-24 10:23:53, Yafang Shao wrote:
[...]
> - Option C: Reparent the mlocked page to a common ancestor
> 
> Consider the following hierarchical:
> 
>          A
>     /        \
>   B           C
> 
> If B is mlocking a page in C, we can reparent that mlocked page to A,
> essentially making A the new parent for the mlocked page.

How does this solve the underlying problem?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg
  2025-01-06 12:30           ` Michal Hocko
@ 2025-01-06 14:04             ` Yafang Shao
  2025-01-07  8:39               ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Yafang Shao @ 2025-01-06 14:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: hannes, roman.gushchin, shakeel.butt, muchun.song, akpm, linux-mm

On Mon, Jan 6, 2025 at 8:30 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 25-12-24 10:23:53, Yafang Shao wrote:
> [...]
> > - Option C: Reparent the mlocked page to a common ancestor
> >
> > Consider the following hierarchical:
> >
> >          A
> >     /        \
> >   B           C
> >
> > If B is mlocking a page in C, we can reparent that mlocked page to A,
> > essentially making A the new parent for the mlocked page.
>
> How does this solve the underlying problem?

No OOM will occur in C until the limit of A is reached, and an OOM at
that point is the expected behavior.

In simpler terms: move the shared resource to a shared layer. I
believe this will resolve most of the issues caused by the shared
resources.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg
  2025-01-06 14:04             ` Yafang Shao
@ 2025-01-07  8:39               ` Michal Hocko
  2025-01-07  9:43                 ` Yafang Shao
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2025-01-07  8:39 UTC (permalink / raw)
  To: Yafang Shao
  Cc: hannes, roman.gushchin, shakeel.butt, muchun.song, akpm, linux-mm

On Mon 06-01-25 22:04:31, Yafang Shao wrote:
> On Mon, Jan 6, 2025 at 8:30 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 25-12-24 10:23:53, Yafang Shao wrote:
> > [...]
> > > - Option C: Reparent the mlocked page to a common ancestor
> > >
> > > Consider the following hierarchical:
> > >
> > >          A
> > >     /        \
> > >   B           C
> > >
> > > If B is mlocking a page in C, we can reparent that mlocked page to A,
> > > essentially making A the new parent for the mlocked page.
> >
> > How does this solve the underlying problem?
> 
> No OOM will occur in C until the limit of A is reached, and an OOM at
> that point is the expected behavior.

Right but if A happens to be the root cgroup then you effectivelly
allows mlock to run away a local limit.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg
  2025-01-07  8:39               ` Michal Hocko
@ 2025-01-07  9:43                 ` Yafang Shao
  0 siblings, 0 replies; 15+ messages in thread
From: Yafang Shao @ 2025-01-07  9:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: hannes, roman.gushchin, shakeel.butt, muchun.song, akpm, linux-mm

On Tue, Jan 7, 2025 at 4:39 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 06-01-25 22:04:31, Yafang Shao wrote:
> > On Mon, Jan 6, 2025 at 8:30 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Wed 25-12-24 10:23:53, Yafang Shao wrote:
> > > [...]
> > > > - Option C: Reparent the mlocked page to a common ancestor
> > > >
> > > > Consider the following hierarchical:
> > > >
> > > >          A
> > > >     /        \
> > > >   B           C
> > > >
> > > > If B is mlocking a page in C, we can reparent that mlocked page to A,
> > > > essentially making A the new parent for the mlocked page.
> > >
> > > How does this solve the underlying problem?
> >
> > No OOM will occur in C until the limit of A is reached, and an OOM at
> > that point is the expected behavior.
>
> Right but if A happens to be the root cgroup then you effectivelly
> allows mlock to run away a local limit.

Typically, in a real production environment, people don't use mlock on
large amounts of file cache. It’s unusual for one instance to lock a
significant amount of file cache and share it with another instance.
If such use cases do exist, it would be more appropriate to use
memory.min rather than relying on mlock().

For most users, allowing mlocked file pages to be unrestricted is
acceptable. I don’t think we should worry too much about edge cases in
this regard. However, if it is deemed a concern, we could introduce a
"cgroup.memory=mlock_shared" boot parameter to enable this behavior
explicitly.

--
Regards
Yafang

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg
  2024-12-22  2:34       ` Yafang Shao
  2024-12-25  2:23         ` Yafang Shao
@ 2025-01-06 12:28         ` Michal Hocko
  2025-01-06 13:59           ` Yafang Shao
  1 sibling, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2025-01-06 12:28 UTC (permalink / raw)
  To: Yafang Shao
  Cc: hannes, roman.gushchin, shakeel.butt, muchun.song, akpm, linux-mm

On Sun 22-12-24 10:34:12, Yafang Shao wrote:
> On Sat, Dec 21, 2024 at 3:21 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 20-12-24 19:52:16, Yafang Shao wrote:
> > > On Fri, Dec 20, 2024 at 6:23 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Sun 15-12-24 15:34:13, Yafang Shao wrote:
> > > > > Implementation Options
> > > > > ----------------------
> > > > >
> > > > > - Solution A: Allow file caches on the unevictable list to become
> > > > >   reclaimable.
> > > > >   This approach would require significant refactoring of the page reclaim
> > > > >   logic.
> > > > >
> > > > > - Solution B: Prevent file caches from being moved to the unevictable list
> > > > >   during mlock and ignore the VM_LOCKED flag during page reclaim.
> > > > >   This is a more straightforward solution and is the one we have chosen.
> > > > >   If the file caches are reclaimed from the download-proxy's memcg and
> > > > >   subsequently accessed by tasks in the application’s memcg, a filemap
> > > > >   fault will occur. A new file cache will be faulted in, charged to the
> > > > >   application’s memcg, and locked there.
> > > >
> > > > Both options are silently breaking userspace because a non failing mlock
> > > > doesn't give guarantees it is supposed to AFAICS.
> > >
> > > It does not bypass the mlock mechanism; rather, it defers the actual
> > > locking operation to the page fault path. Could you clarify what you
> > > mean by "a non-failing mlock"? From what I can see, mlock can indeed
> > > fail if there isn’t sufficient memory available. With this change, we
> > > are simply shifting the potential failure point to the page fault path
> > > instead.
> >
> > Your change will cause mlocked pages (as mlock syscall returns success)
> > to be reclaimable later on. That breaks the basic mlock contract.
> 
> AFAICS, the mlock() behavior was originally designed with only a
> single root memory cgroup in mind. In other words, when mlock() was
> introduced, all locked pages were confined to the same memcg.

yes and this is the case to any other syscalls that might have an impact
on the memory consumption. This is by design. Memory cgroup controller
aims to provide a completely transparent resource control without any
modifications to applications. This is the case for all other cgroup
controllers. If memcg (or other controller) affects a specific syscall
behavior then this has to be communicated explicitly to the caller.

The purpose of mlock syscall is to _guarantee_ memory to be resident
(never swapped out). There might be additional constrains to prevent
from mlock succeeding - e.g. rlimit or if memcg aims to control amount
of the mlocked memory but those failures need to be explicitly
communicated via syscall failure.

> However, this changed with the introduction of memcg support. Now,
> mlock() can lock pages that belong to a different memcg than the
> current task. This behavior is not explicitly defined in the mlock()
> documentation, which could lead to confusion.

This is more of a problem of the cgroup configurations where different
resource domains are sharing resources. This is not much diffent when
other resources (e.g. shmem) are shared accross unrelated cgroups.

> To clarify, I propose updating the mlock() documentation as follows:

This is not really possible because you are effectively breaking an
existing userspace.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg
  2025-01-06 12:28         ` Michal Hocko
@ 2025-01-06 13:59           ` Yafang Shao
  2025-01-07 10:04             ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Yafang Shao @ 2025-01-06 13:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: hannes, roman.gushchin, shakeel.butt, muchun.song, akpm, linux-mm

On Mon, Jan 6, 2025 at 8:28 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Sun 22-12-24 10:34:12, Yafang Shao wrote:
> > On Sat, Dec 21, 2024 at 3:21 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 20-12-24 19:52:16, Yafang Shao wrote:
> > > > On Fri, Dec 20, 2024 at 6:23 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Sun 15-12-24 15:34:13, Yafang Shao wrote:
> > > > > > Implementation Options
> > > > > > ----------------------
> > > > > >
> > > > > > - Solution A: Allow file caches on the unevictable list to become
> > > > > >   reclaimable.
> > > > > >   This approach would require significant refactoring of the page reclaim
> > > > > >   logic.
> > > > > >
> > > > > > - Solution B: Prevent file caches from being moved to the unevictable list
> > > > > >   during mlock and ignore the VM_LOCKED flag during page reclaim.
> > > > > >   This is a more straightforward solution and is the one we have chosen.
> > > > > >   If the file caches are reclaimed from the download-proxy's memcg and
> > > > > >   subsequently accessed by tasks in the application’s memcg, a filemap
> > > > > >   fault will occur. A new file cache will be faulted in, charged to the
> > > > > >   application’s memcg, and locked there.
> > > > >
> > > > > Both options are silently breaking userspace because a non failing mlock
> > > > > doesn't give guarantees it is supposed to AFAICS.
> > > >
> > > > It does not bypass the mlock mechanism; rather, it defers the actual
> > > > locking operation to the page fault path. Could you clarify what you
> > > > mean by "a non-failing mlock"? From what I can see, mlock can indeed
> > > > fail if there isn’t sufficient memory available. With this change, we
> > > > are simply shifting the potential failure point to the page fault path
> > > > instead.
> > >
> > > Your change will cause mlocked pages (as mlock syscall returns success)
> > > to be reclaimable later on. That breaks the basic mlock contract.
> >
> > AFAICS, the mlock() behavior was originally designed with only a
> > single root memory cgroup in mind. In other words, when mlock() was
> > introduced, all locked pages were confined to the same memcg.
>
> yes and this is the case to any other syscalls that might have an impact
> on the memory consumption. This is by design. Memory cgroup controller
> aims to provide a completely transparent resource control without any
> modifications to applications. This is the case for all other cgroup
> controllers. If memcg (or other controller) affects a specific syscall
> behavior then this has to be communicated explicitly to the caller.
>
> The purpose of mlock syscall is to _guarantee_ memory to be resident
> (never swapped out). There might be additional constrains to prevent
> from mlock succeeding - e.g. rlimit or if memcg aims to control amount
> of the mlocked memory but those failures need to be explicitly
> communicated via syscall failure.

Returning an error code like EBUSY to userspace is straightforward
when attempting to mlock a page that is charged to a different memcg.

>
> > However, this changed with the introduction of memcg support. Now,
> > mlock() can lock pages that belong to a different memcg than the
> > current task. This behavior is not explicitly defined in the mlock()
> > documentation, which could lead to confusion.
>
> This is more of a problem of the cgroup configurations where different
> resource domains are sharing resources. This is not much diffent when
> other resources (e.g. shmem) are shared accross unrelated cgroups.

However, we have yet to address even a single one of these issues or
reach a consensus on a solution, correct?

>
> > To clarify, I propose updating the mlock() documentation as follows:
>
> This is not really possible because you are effectively breaking an
> existing userspace.

This behavior is neither mandatory nor the default. You are not
obligated to use it if you prefer not to.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg
  2025-01-06 13:59           ` Yafang Shao
@ 2025-01-07 10:04             ` Michal Hocko
  0 siblings, 0 replies; 15+ messages in thread
From: Michal Hocko @ 2025-01-07 10:04 UTC (permalink / raw)
  To: Yafang Shao
  Cc: hannes, roman.gushchin, shakeel.butt, muchun.song, akpm, linux-mm

On Mon 06-01-25 21:59:07, Yafang Shao wrote:
> On Mon, Jan 6, 2025 at 8:28 PM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > The purpose of mlock syscall is to _guarantee_ memory to be resident
> > (never swapped out). There might be additional constrains to prevent
> > from mlock succeeding - e.g. rlimit or if memcg aims to control amount
> > of the mlocked memory but those failures need to be explicitly
> > communicated via syscall failure.
> 
> Returning an error code like EBUSY to userspace is straightforward
> when attempting to mlock a page that is charged to a different memcg.

EAGAIN is already documented error failure when some pages cannot be
mlocked, but yes this would be acceptable solution. The question is how
to handle mlockall(MCL_FUTURE) resp. mlock(MLOCK_ONFAULT). I didn't give
those much time but I can see some comlications there. Either we fail
those on shared resources which could lead to pre-mature failures or we
need to somehow record lock ownership and enforce it during the fault
(charge).
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-01-07 10:04 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-15  7:34 [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg Yafang Shao
2024-12-15  7:34 ` [RFC PATCH 1/2] mm/memcontrol: add a new cgroup file memory.nomlock Yafang Shao
2024-12-15  7:34 ` [RFC PATCH 2/2] mm: Add support for nomlock to avoid folios beling mlocked in a memcg Yafang Shao
2024-12-20 10:23 ` [RFC PATCH 0/2] memcg: add " Michal Hocko
2024-12-20 11:52   ` Yafang Shao
2024-12-21  7:21     ` Michal Hocko
2024-12-22  2:34       ` Yafang Shao
2024-12-25  2:23         ` Yafang Shao
2025-01-06 12:30           ` Michal Hocko
2025-01-06 14:04             ` Yafang Shao
2025-01-07  8:39               ` Michal Hocko
2025-01-07  9:43                 ` Yafang Shao
2025-01-06 12:28         ` Michal Hocko
2025-01-06 13:59           ` Yafang Shao
2025-01-07 10:04             ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox