[PATCH v7 0/3] btrfs: try to allocate larger folios for metadata

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v7 0/3] btrfs: try to allocate larger folios for metadata
@ 2024-07-19 10:28 Qu Wenruo
  2024-07-19 10:28 ` [PATCH v7 1/3] memcontrol: define root_mem_cgroup for CONFIG_MEMCG=n cases Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Qu Wenruo @ 2024-07-19 10:28 UTC (permalink / raw)
  To: linux-btrfs
  Cc: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	cgroups, linux-mm

[CHANGELOG]
v7:
- Fix an accidentially removed line caused by previous modification
  attempt
  Previously I was moving that line to the common branch to
  unconditionally define root_mem_cgroup pointer.
  But that's later discarded and changed to use macro definition, but
  forgot to add back the original line.

v6:
- Add a new root_mem_cgroup definition for CONFIG_MEMCG=n cases
  So that users of root_mem_cgroup no longer needs to check
  CONFIG_MEMCG.
  This is to fix the compile error for CONFIG_MEMCG=n cases.

- Slight rewording of the 2nd patch

v5:
- Use root memcgroup to attach folios to btree inode filemap
- Only try higher order folio once without NOFAIL nor extra retry

v4:
- Hide the feature behind CONFIG_BTRFS_DEBUG
  So that end users won't be affected (aka, still per-page based
  allocation) meanwhile we can do more testing on this new behavior.

v3:
- Rebased to the latest for-next branch
- Use PAGE_ALLOC_COSTLY_ORDER to determine whether to use __GFP_NOFAIL
- Add a dependency MM patch "mm/page_alloc: unify the warning on NOFAIL
  and high order allocation"
  This allows us to use NOFAIL up to 32K nodesize, and makes sure for
  default 16K nodesize, all metadata would go 16K folios

v2:
- Rebased to handle the change in "btrfs: cache folio size and shift in extent_buffer"

This is the latest update on the attempt to utilize larger folios for
btrfs metadata.

The previous version exposed a reproducibe hang at btrfs/187, where we
hang at filemap_add_folio() around its memcgroup charge code.

Even without the problem, I still believe for btree inode we do not
really need all the memcgroup charge, nor using __GFP_NOFAIL to work
around the possible memcgroup limits.

So in this update, suggested by the memcgroup people from SUSE, there is
a new patch to make btree inode filemap folio attaching to use the root
memcgroup, so that we won't be limited by the memcgroup.

Then for the patch enabling the larger folio, I reverted back to the old
behavior that we only try larger folio once without extra retry, just to
be extra safe.


Qu Wenruo (3):
  memcontrol: define root_mem_cgroup for CONFIG_MEMCG=n cases
  btrfs: always uses root memcgroup for filemap_add_folio()
  btrfs: prefer to allocate larger folio for metadata

 fs/btrfs/extent_io.c       | 112 ++++++++++++++++++++++++++-----------
 include/linux/memcontrol.h |   6 ++
 2 files changed, 84 insertions(+), 34 deletions(-)

-- 
2.45.2



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v7 1/3] memcontrol: define root_mem_cgroup for CONFIG_MEMCG=n cases
  2024-07-19 10:28 [PATCH v7 0/3] btrfs: try to allocate larger folios for metadata Qu Wenruo
@ 2024-07-19 10:28 ` Qu Wenruo
  2024-07-19 11:13   ` Michal Hocko
  2024-07-19 10:28 ` [PATCH v7 2/3] btrfs: always uses root memcgroup for filemap_add_folio() Qu Wenruo
  2024-07-19 10:28 ` [PATCH v7 3/3] btrfs: prefer to allocate larger folio for metadata Qu Wenruo
  2 siblings, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2024-07-19 10:28 UTC (permalink / raw)
  To: linux-btrfs
  Cc: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	cgroups, linux-mm

There is an incoming btrfs patchset, which will use @root_mem_cgroup as
the active cgroup to attach metadata folios to its internal btree
inode, so that btrfs can skip the possibly costly charge for the
internal inode which is only accessible by btrfs itself.

However @root_mem_cgroup is not always defined (not defined for
CONFIG_MEMCG=n case), thus all such callers need to do the extra
handling for different CONFIG_MEMCG settings.

So here we add a special macro definition of root_mem_cgroup, making it
to always be NULL.

The advantage of this, other than pulling the pointer definition out,
is that we will avoid wasting global data section space for such
pointer.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 include/linux/memcontrol.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 030d34e9d117..ae5c78719454 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -346,6 +346,12 @@ enum page_memcg_data_flags {

 #define __FIRST_OBJEXT_FLAG	(1UL << 0)

+/*
+ * For CONFIG_MEMCG=n case, still define a root_mem_cgroup, but that will
+ * always be NULL and not taking any global data section space.
+ */
+#define root_mem_cgroup		(NULL)
+
 #endif /* CONFIG_MEMCG */

 enum objext_flags {
-- 
2.45.2

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v7 2/3] btrfs: always uses root memcgroup for filemap_add_folio()
  2024-07-19 10:28 [PATCH v7 0/3] btrfs: try to allocate larger folios for metadata Qu Wenruo
  2024-07-19 10:28 ` [PATCH v7 1/3] memcontrol: define root_mem_cgroup for CONFIG_MEMCG=n cases Qu Wenruo
@ 2024-07-19 10:28 ` Qu Wenruo
  2024-07-19 17:02   ` Johannes Weiner
  2024-07-19 10:28 ` [PATCH v7 3/3] btrfs: prefer to allocate larger folio for metadata Qu Wenruo
  2 siblings, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2024-07-19 10:28 UTC (permalink / raw)
  To: linux-btrfs
  Cc: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	cgroups, linux-mm, Michal Hocko, Vlastimil Babka

[BACKGROUND]
The function filemap_add_folio() charges the memory cgroup,
as we assume all page caches are accessible by user space progresses
thus needs the cgroup accounting.

However btrfs is a special case, it has a very large metadata thanks to
its support of data csum (by default it's 4 bytes per 4K data, and can
be as large as 32 bytes per 4K data).
This means btrfs has to go page cache for its metadata pages, to take
advantage of both cache and reclaim ability of filemap.

This has a tiny problem, that all btrfs metadata pages have to go through
the memcgroup charge, even all those metadata pages are not
accessible by the user space, and doing the charging can introduce some
latency if there is a memory limits set.

Btrfs currently uses __GFP_NOFAIL flag as a workaround for this cgroup
charge situation so that metadata pages won't really be limited by
memcgroup.

[ENHANCEMENT]
Instead of relying on __GFP_NOFAIL to avoid charge failure, use root
memory cgroup to attach metadata pages.

With root memory cgroup, we directly skip the charging part, and only
rely on __GFP_NOFAIL for the real memory allocation part.

Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index aa7f8148cd0d..cfeed7673009 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2971,6 +2971,7 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,

 	struct btrfs_fs_info *fs_info = eb->fs_info;
 	struct address_space *mapping = fs_info->btree_inode->i_mapping;
+	struct mem_cgroup *old_memcg;
 	const unsigned long index = eb->start >> PAGE_SHIFT;
 	struct folio *existing_folio = NULL;
 	int ret;
@@ -2981,8 +2982,17 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
 	ASSERT(eb->folios[i]);

 retry:
+	/*
+	 * Btree inode is a btrfs internal inode, and not exposed to any
+	 * user.
+	 * Furthermore we do not want any cgroup limits on this inode.
+	 * So we always use root_mem_cgroup as our active memcg when attaching
+	 * the folios.
+	 */
+	old_memcg = set_active_memcg(root_mem_cgroup);
 	ret = filemap_add_folio(mapping, eb->folios[i], index + i,
 				GFP_NOFS | __GFP_NOFAIL);
+	set_active_memcg(old_memcg);
 	if (!ret)
 		goto finish;

-- 
2.45.2

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v7 3/3] btrfs: prefer to allocate larger folio for metadata
  2024-07-19 10:28 [PATCH v7 0/3] btrfs: try to allocate larger folios for metadata Qu Wenruo
  2024-07-19 10:28 ` [PATCH v7 1/3] memcontrol: define root_mem_cgroup for CONFIG_MEMCG=n cases Qu Wenruo
  2024-07-19 10:28 ` [PATCH v7 2/3] btrfs: always uses root memcgroup for filemap_add_folio() Qu Wenruo
@ 2024-07-19 10:28 ` Qu Wenruo
  2 siblings, 0 replies; 14+ messages in thread
From: Qu Wenruo @ 2024-07-19 10:28 UTC (permalink / raw)
  To: linux-btrfs
  Cc: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	cgroups, linux-mm

For btrfs metadata, the high order folios are only utilized when all the
following conditions are met:

- The extent buffer start is aligned to nodesize
  This should be the common case for any btrfs in the last 5 years.

- The nodesize is larger than page size
  Or there is no need to use larger folios at all.

- MM layer can fulfill our folio allocation request

- The larger folio must exactly cover the extent buffer
  No longer no smaller, must be an exact fit.

  This is to make extent buffer accessors much easier.
  They only need to check the first slot in eb->folios[], to determine
  their access unit (need per-page handling or a large folio covering
  the whole eb).

There is another small blockage, filemap APIs can not guarantee the
folio size.
For example, by default we go 16K nodesize on x86_64, meaning a larger
folio we expect would be with order 2 (size 16K).
We don't accept 2 order 1 (size 8K) folios, or we fall back to 4 order 0
(page sized) folios.

So here we go a different workaround, allocate a order 2 folio first,
then attach them to the filemap of metadata.

Thus here comes several results related to the attach attempt of eb
folios:

1) We can attach the pre-allocated eb folio to filemap
   This is the most simple and hot path, we just continue our work
   setting up the extent buffer.

2) There is an existing folio in the filemap

   2.0) Subpage case
        We would reuse the folio no matter what, subpage is doing a
	different way handling folio->private (a bitmap other than a
	pointer to an existing eb).

   2.1) There is already a live extent buffer attached to the filemap
        folio
	This should be more or less hot path, we grab the existing eb
	and free the current one.

   2.2) No live eb.
   2.2.1) The filemap folio is larger than eb folio
          This is a better case, we can reuse the filemap folio, but
	  we need to cleanup all the pre-allocated folios of the
	  new eb before reusing.
	  Later code should take the folio size change into
	  consideration.

   2.2.2) The filemap folio is the same size of eb folio
          We just free the current folio, and reuse the filemap one.
	  No other special handling needed.

   2.2.3) The filemap folio is smaller than eb folio
          This is the most tricky corner case, we can not easily replace
	  the folio in filemap using our eb folio.

	  Thus here we return -EAGAIN, to inform our caller to re-try
	  with order 0 (of course with our larger folio freed).

Otherwise all the needed infrastructure is already here, we only need to
try allocate larger folio as our first try in alloc_eb_folio_array().

For now, the higher order allocation is only a preferable attempt for
debug build, before we had enough test coverage and push it to end
users.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 102 ++++++++++++++++++++++++++++---------------
 1 file changed, 68 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index cfeed7673009..d7824644d593 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -719,12 +719,28 @@ int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array,
  *
  * For now, the folios populated are always in order 0 (aka, single page).
  */
-static int alloc_eb_folio_array(struct extent_buffer *eb, bool nofail)
+static int alloc_eb_folio_array(struct extent_buffer *eb, int order,
+				bool nofail)
 {
 	struct page *page_array[INLINE_EXTENT_BUFFER_PAGES] = { 0 };
 	int num_pages = num_extent_pages(eb);
 	int ret;
 
+	if (order) {
+		gfp_t gfp;
+
+		if (order > 0)
+			gfp = GFP_NOFS | __GFP_NORETRY | __GFP_NOWARN;
+		else
+			gfp = nofail ? (GFP_NOFS | __GFP_NOFAIL) : GFP_NOFS;
+		eb->folios[0] = folio_alloc(gfp, order);
+		if (likely(eb->folios[0])) {
+			eb->folio_size = folio_size(eb->folios[0]);
+			eb->folio_shift = folio_shift(eb->folios[0]);
+			return 0;
+		}
+		/* Fallback to 0 order (single page) allocation. */
+	}
 	ret = btrfs_alloc_page_array(num_pages, page_array, nofail);
 	if (ret < 0)
 		return ret;
@@ -2707,7 +2723,7 @@ struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
 	 */
 	set_bit(EXTENT_BUFFER_UNMAPPED, &new->bflags);
 
-	ret = alloc_eb_folio_array(new, false);
+	ret = alloc_eb_folio_array(new, 0, false);
 	if (ret) {
 		btrfs_release_extent_buffer(new);
 		return NULL;
@@ -2740,7 +2756,7 @@ struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
 	if (!eb)
 		return NULL;
 
-	ret = alloc_eb_folio_array(eb, false);
+	ret = alloc_eb_folio_array(eb, 0, false);
 	if (ret)
 		goto err;
 
@@ -2955,6 +2971,14 @@ static int check_eb_alignment(struct btrfs_fs_info *fs_info, u64 start)
 	return 0;
 }
 
+static void free_all_eb_folios(struct extent_buffer *eb)
+{
+	for (int i = 0; i < INLINE_EXTENT_BUFFER_PAGES; i++) {
+		if (eb->folios[i])
+			folio_put(eb->folios[i]);
+		eb->folios[i] = NULL;
+	}
+}
 
 /*
  * Return 0 if eb->folios[i] is attached to btree inode successfully.
@@ -2974,6 +2998,7 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
 	struct mem_cgroup *old_memcg;
 	const unsigned long index = eb->start >> PAGE_SHIFT;
 	struct folio *existing_folio = NULL;
+	const int eb_order = folio_order(eb->folios[0]);
 	int ret;
 
 	ASSERT(found_eb_ret);
@@ -3003,15 +3028,6 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
 		goto retry;
 	}
 
-	/* For now, we should only have single-page folios for btree inode. */
-	ASSERT(folio_nr_pages(existing_folio) == 1);
-
-	if (folio_size(existing_folio) != eb->folio_size) {
-		folio_unlock(existing_folio);
-		folio_put(existing_folio);
-		return -EAGAIN;
-	}
-
 finish:
 	spin_lock(&mapping->i_private_lock);
 	if (existing_folio && fs_info->nodesize < PAGE_SIZE) {
@@ -3020,6 +3036,7 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
 		eb->folios[i] = existing_folio;
 	} else if (existing_folio) {
 		struct extent_buffer *existing_eb;
+		int existing_order = folio_order(existing_folio);
 
 		existing_eb = grab_extent_buffer(fs_info,
 						 folio_page(existing_folio, 0));
@@ -3031,9 +3048,34 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
 			folio_put(existing_folio);
 			return 1;
 		}
-		/* The extent buffer no longer exists, we can reuse the folio. */
-		__free_page(folio_page(eb->folios[i], 0));
-		eb->folios[i] = existing_folio;
+		if (existing_order > eb_order) {
+			/*
+			 * The existing one has higher order, we need to drop
+			 * all eb folios before resuing it.
+			 * And this should only happen for the first folio.
+			 */
+			ASSERT(i == 0);
+			free_all_eb_folios(eb);
+			eb->folios[i] = existing_folio;
+		} else if (existing_order == eb_order) {
+			/*
+			 * Can safely reuse the filemap folio, just
+			 * release the eb one.
+			 */
+			folio_put(eb->folios[i]);
+			eb->folios[i] = existing_folio;
+		} else {
+			/*
+			 * The existing one has lower order.
+			 *
+			 * Just retry and fallback to order 0.
+			 */
+			ASSERT(i == 0);
+			folio_unlock(existing_folio);
+			folio_put(existing_folio);
+			spin_unlock(&mapping->i_private_lock);
+			return -EAGAIN;
+		}
 	}
 	eb->folio_size = folio_size(eb->folios[i]);
 	eb->folio_shift = folio_shift(eb->folios[i]);
@@ -3066,6 +3108,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 	u64 lockdep_owner = owner_root;
 	bool page_contig = true;
 	int uptodate = 1;
+	int order = 0;
 	int ret;
 
 	if (check_eb_alignment(fs_info, start))
@@ -3082,6 +3125,10 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 		btrfs_warn_32bit_limit(fs_info);
 #endif
 
+	if (IS_ENABLED(CONFIG_BTRFS_DEBUG) && fs_info->nodesize > PAGE_SIZE &&
+	    IS_ALIGNED(start, fs_info->nodesize))
+		order = ilog2(fs_info->nodesize >> PAGE_SHIFT);
+
 	eb = find_extent_buffer(fs_info, start);
 	if (eb)
 		return eb;
@@ -3116,7 +3163,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 
 reallocate:
 	/* Allocate all pages first. */
-	ret = alloc_eb_folio_array(eb, true);
+	ret = alloc_eb_folio_array(eb, order, true);
 	if (ret < 0) {
 		btrfs_free_subpage(prealloc);
 		goto out;
@@ -3134,26 +3181,12 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 		}
 
 		/*
-		 * TODO: Special handling for a corner case where the order of
-		 * folios mismatch between the new eb and filemap.
-		 *
-		 * This happens when:
-		 *
-		 * - the new eb is using higher order folio
-		 *
-		 * - the filemap is still using 0-order folios for the range
-		 *   This can happen at the previous eb allocation, and we don't
-		 *   have higher order folio for the call.
-		 *
-		 * - the existing eb has already been freed
-		 *
-		 * In this case, we have to free the existing folios first, and
-		 * re-allocate using the same order.
-		 * Thankfully this is not going to happen yet, as we're still
-		 * using 0-order folios.
+		 * Got a corner case where the existing folio is lower order,
+		 * fallback to 0 order and retry.
 		 */
 		if (unlikely(ret == -EAGAIN)) {
-			ASSERT(0);
+			order = 0;
+			free_all_eb_folios(eb);
 			goto reallocate;
 		}
 		attached++;
@@ -3164,6 +3197,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 		 * and free the allocated page.
 		 */
 		folio = eb->folios[i];
+		num_folios = num_extent_folios(eb);
 		WARN_ON(btrfs_folio_test_dirty(fs_info, folio, eb->start, eb->len));
 
 		/*
-- 
2.45.2



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 1/3] memcontrol: define root_mem_cgroup for CONFIG_MEMCG=n cases
  2024-07-19 10:28 ` [PATCH v7 1/3] memcontrol: define root_mem_cgroup for CONFIG_MEMCG=n cases Qu Wenruo
@ 2024-07-19 11:13   ` Michal Hocko
  2024-07-19 21:58     ` Qu Wenruo
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2024-07-19 11:13 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: linux-btrfs, hannes, roman.gushchin, shakeel.butt, muchun.song,
	cgroups, linux-mm

On Fri 19-07-24 19:58:39, Qu Wenruo wrote:
> There is an incoming btrfs patchset, which will use @root_mem_cgroup as
> the active cgroup to attach metadata folios to its internal btree
> inode, so that btrfs can skip the possibly costly charge for the
> internal inode which is only accessible by btrfs itself.
> 
> However @root_mem_cgroup is not always defined (not defined for
> CONFIG_MEMCG=n case), thus all such callers need to do the extra
> handling for different CONFIG_MEMCG settings.
> 
> So here we add a special macro definition of root_mem_cgroup, making it
> to always be NULL.

Isn't just a declaration sufficient? Nothing should really dereference
the pointer anyway.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 2/3] btrfs: always uses root memcgroup for filemap_add_folio()
  2024-07-19 10:28 ` [PATCH v7 2/3] btrfs: always uses root memcgroup for filemap_add_folio() Qu Wenruo
@ 2024-07-19 17:02   ` Johannes Weiner
  2024-07-19 17:25     ` Roman Gushchin
  2024-07-19 18:13     ` Michal Hocko
  0 siblings, 2 replies; 14+ messages in thread
From: Johannes Weiner @ 2024-07-19 17:02 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: linux-btrfs, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	cgroups, linux-mm, Michal Hocko, Vlastimil Babka

On Fri, Jul 19, 2024 at 07:58:40PM +0930, Qu Wenruo wrote:
> [BACKGROUND]
> The function filemap_add_folio() charges the memory cgroup,
> as we assume all page caches are accessible by user space progresses
> thus needs the cgroup accounting.
> 
> However btrfs is a special case, it has a very large metadata thanks to
> its support of data csum (by default it's 4 bytes per 4K data, and can
> be as large as 32 bytes per 4K data).
> This means btrfs has to go page cache for its metadata pages, to take
> advantage of both cache and reclaim ability of filemap.
> 
> This has a tiny problem, that all btrfs metadata pages have to go through
> the memcgroup charge, even all those metadata pages are not
> accessible by the user space, and doing the charging can introduce some
> latency if there is a memory limits set.
> 
> Btrfs currently uses __GFP_NOFAIL flag as a workaround for this cgroup
> charge situation so that metadata pages won't really be limited by
> memcgroup.
> 
> [ENHANCEMENT]
> Instead of relying on __GFP_NOFAIL to avoid charge failure, use root
> memory cgroup to attach metadata pages.
> 
> With root memory cgroup, we directly skip the charging part, and only
> rely on __GFP_NOFAIL for the real memory allocation part.
> 
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Suggested-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent_io.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index aa7f8148cd0d..cfeed7673009 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2971,6 +2971,7 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
>  
>  	struct btrfs_fs_info *fs_info = eb->fs_info;
>  	struct address_space *mapping = fs_info->btree_inode->i_mapping;
> +	struct mem_cgroup *old_memcg;
>  	const unsigned long index = eb->start >> PAGE_SHIFT;
>  	struct folio *existing_folio = NULL;
>  	int ret;
> @@ -2981,8 +2982,17 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
>  	ASSERT(eb->folios[i]);
>  
>  retry:
> +	/*
> +	 * Btree inode is a btrfs internal inode, and not exposed to any
> +	 * user.
> +	 * Furthermore we do not want any cgroup limits on this inode.
> +	 * So we always use root_mem_cgroup as our active memcg when attaching
> +	 * the folios.
> +	 */
> +	old_memcg = set_active_memcg(root_mem_cgroup);
>  	ret = filemap_add_folio(mapping, eb->folios[i], index + i,
>  				GFP_NOFS | __GFP_NOFAIL);
> +	set_active_memcg(old_memcg);

It looks correct. But it's going through all dance to set up
current->active_memcg, then have the charge path look that up,
css_get(), call try_charge() only to bail immediately, css_put(), then
update current->active_memcg again. All those branches are necessary
when we want to charge to a "real" other cgroup. But in this case, we
always know we're not charging, so it seems uncalled for.

Wouldn't it be a lot simpler (and cheaper) to have a
filemap_add_folio_nocharge()?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 2/3] btrfs: always uses root memcgroup for filemap_add_folio()
  2024-07-19 17:02   ` Johannes Weiner
@ 2024-07-19 17:25     ` Roman Gushchin
  2024-07-19 18:13     ` Michal Hocko
  1 sibling, 0 replies; 14+ messages in thread
From: Roman Gushchin @ 2024-07-19 17:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Qu Wenruo, linux-btrfs, mhocko, shakeel.butt, muchun.song,
	cgroups, linux-mm, Michal Hocko, Vlastimil Babka

On Fri, Jul 19, 2024 at 01:02:06PM -0400, Johannes Weiner wrote:
> On Fri, Jul 19, 2024 at 07:58:40PM +0930, Qu Wenruo wrote:
> > [BACKGROUND]
> > The function filemap_add_folio() charges the memory cgroup,
> > as we assume all page caches are accessible by user space progresses
> > thus needs the cgroup accounting.
> > 
> > However btrfs is a special case, it has a very large metadata thanks to
> > its support of data csum (by default it's 4 bytes per 4K data, and can
> > be as large as 32 bytes per 4K data).
> > This means btrfs has to go page cache for its metadata pages, to take
> > advantage of both cache and reclaim ability of filemap.
> > 
> > This has a tiny problem, that all btrfs metadata pages have to go through
> > the memcgroup charge, even all those metadata pages are not
> > accessible by the user space, and doing the charging can introduce some
> > latency if there is a memory limits set.
> > 
> > Btrfs currently uses __GFP_NOFAIL flag as a workaround for this cgroup
> > charge situation so that metadata pages won't really be limited by
> > memcgroup.
> > 
> > [ENHANCEMENT]
> > Instead of relying on __GFP_NOFAIL to avoid charge failure, use root
> > memory cgroup to attach metadata pages.
> > 
> > With root memory cgroup, we directly skip the charging part, and only
> > rely on __GFP_NOFAIL for the real memory allocation part.
> > 
> > Suggested-by: Michal Hocko <mhocko@suse.com>
> > Suggested-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> > Signed-off-by: Qu Wenruo <wqu@suse.com>
> > ---
> >  fs/btrfs/extent_io.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index aa7f8148cd0d..cfeed7673009 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -2971,6 +2971,7 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
> >  
> >  	struct btrfs_fs_info *fs_info = eb->fs_info;
> >  	struct address_space *mapping = fs_info->btree_inode->i_mapping;
> > +	struct mem_cgroup *old_memcg;
> >  	const unsigned long index = eb->start >> PAGE_SHIFT;
> >  	struct folio *existing_folio = NULL;
> >  	int ret;
> > @@ -2981,8 +2982,17 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
> >  	ASSERT(eb->folios[i]);
> >  
> >  retry:
> > +	/*
> > +	 * Btree inode is a btrfs internal inode, and not exposed to any
> > +	 * user.
> > +	 * Furthermore we do not want any cgroup limits on this inode.
> > +	 * So we always use root_mem_cgroup as our active memcg when attaching
> > +	 * the folios.
> > +	 */
> > +	old_memcg = set_active_memcg(root_mem_cgroup);
> >  	ret = filemap_add_folio(mapping, eb->folios[i], index + i,
> >  				GFP_NOFS | __GFP_NOFAIL);
> > +	set_active_memcg(old_memcg);
> 
> It looks correct. But it's going through all dance to set up
> current->active_memcg, then have the charge path look that up,
> css_get(), call try_charge() only to bail immediately, css_put(), then
> update current->active_memcg again. All those branches are necessary
> when we want to charge to a "real" other cgroup. But in this case, we
> always know we're not charging, so it seems uncalled for.
> 
> Wouldn't it be a lot simpler (and cheaper) to have a
> filemap_add_folio_nocharge()?

Time to restore GFP_NOACCOUNT? I think it might be useful for allocating objects
which are shared across the entire system and/or unlikely will go away under
the memory pressure.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 2/3] btrfs: always uses root memcgroup for filemap_add_folio()
  2024-07-19 17:02   ` Johannes Weiner
  2024-07-19 17:25     ` Roman Gushchin
@ 2024-07-19 18:13     ` Michal Hocko
  2024-07-19 22:11       ` Qu Wenruo
  2024-07-24  3:46       ` Qu Wenruo
  1 sibling, 2 replies; 14+ messages in thread
From: Michal Hocko @ 2024-07-19 18:13 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Qu Wenruo, linux-btrfs, roman.gushchin, shakeel.butt,
	muchun.song, cgroups, linux-mm, Vlastimil Babka

On Fri 19-07-24 13:02:06, Johannes Weiner wrote:
> On Fri, Jul 19, 2024 at 07:58:40PM +0930, Qu Wenruo wrote:
> > [BACKGROUND]
> > The function filemap_add_folio() charges the memory cgroup,
> > as we assume all page caches are accessible by user space progresses
> > thus needs the cgroup accounting.
> > 
> > However btrfs is a special case, it has a very large metadata thanks to
> > its support of data csum (by default it's 4 bytes per 4K data, and can
> > be as large as 32 bytes per 4K data).
> > This means btrfs has to go page cache for its metadata pages, to take
> > advantage of both cache and reclaim ability of filemap.
> > 
> > This has a tiny problem, that all btrfs metadata pages have to go through
> > the memcgroup charge, even all those metadata pages are not
> > accessible by the user space, and doing the charging can introduce some
> > latency if there is a memory limits set.
> > 
> > Btrfs currently uses __GFP_NOFAIL flag as a workaround for this cgroup
> > charge situation so that metadata pages won't really be limited by
> > memcgroup.
> > 
> > [ENHANCEMENT]
> > Instead of relying on __GFP_NOFAIL to avoid charge failure, use root
> > memory cgroup to attach metadata pages.
> > 
> > With root memory cgroup, we directly skip the charging part, and only
> > rely on __GFP_NOFAIL for the real memory allocation part.
> > 
> > Suggested-by: Michal Hocko <mhocko@suse.com>
> > Suggested-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> > Signed-off-by: Qu Wenruo <wqu@suse.com>
> > ---
> >  fs/btrfs/extent_io.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index aa7f8148cd0d..cfeed7673009 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -2971,6 +2971,7 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
> >  
> >  	struct btrfs_fs_info *fs_info = eb->fs_info;
> >  	struct address_space *mapping = fs_info->btree_inode->i_mapping;
> > +	struct mem_cgroup *old_memcg;
> >  	const unsigned long index = eb->start >> PAGE_SHIFT;
> >  	struct folio *existing_folio = NULL;
> >  	int ret;
> > @@ -2981,8 +2982,17 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
> >  	ASSERT(eb->folios[i]);
> >  
> >  retry:
> > +	/*
> > +	 * Btree inode is a btrfs internal inode, and not exposed to any
> > +	 * user.
> > +	 * Furthermore we do not want any cgroup limits on this inode.
> > +	 * So we always use root_mem_cgroup as our active memcg when attaching
> > +	 * the folios.
> > +	 */
> > +	old_memcg = set_active_memcg(root_mem_cgroup);
> >  	ret = filemap_add_folio(mapping, eb->folios[i], index + i,
> >  				GFP_NOFS | __GFP_NOFAIL);

I thoutght you've said that NOFAIL was added to workaround memcg
charges. Can you remove it when memcg is out of the picture?

It would be great to add some background about how much memory are we
talking about. Because this might require memcg configuration in some
setups.

> > +	set_active_memcg(old_memcg);
> 
> It looks correct. But it's going through all dance to set up
> current->active_memcg, then have the charge path look that up,
> css_get(), call try_charge() only to bail immediately, css_put(), then
> update current->active_memcg again. All those branches are necessary
> when we want to charge to a "real" other cgroup. But in this case, we
> always know we're not charging, so it seems uncalled for.
> 
> Wouldn't it be a lot simpler (and cheaper) to have a
> filemap_add_folio_nocharge()?

Yes, that would certainly simplify things. From the previous discussion
I understood that there would be broader scopes which would opt-out from
charging. If this is really about a single filemap_add_folio call then
having a variant without doesn't call mem_cgroup_charge sounds like a
much more viable option and also it doesn't require to make any memcg
specific changes.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 1/3] memcontrol: define root_mem_cgroup for CONFIG_MEMCG=n cases
  2024-07-19 11:13   ` Michal Hocko
@ 2024-07-19 21:58     ` Qu Wenruo
  2024-07-22  7:32       ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2024-07-19 21:58 UTC (permalink / raw)
  To: Michal Hocko, Qu Wenruo
  Cc: linux-btrfs, hannes, roman.gushchin, shakeel.butt, muchun.song,
	cgroups, linux-mm



在 2024/7/19 20:43, Michal Hocko 写道:
> On Fri 19-07-24 19:58:39, Qu Wenruo wrote:
>> There is an incoming btrfs patchset, which will use @root_mem_cgroup as
>> the active cgroup to attach metadata folios to its internal btree
>> inode, so that btrfs can skip the possibly costly charge for the
>> internal inode which is only accessible by btrfs itself.
>>
>> However @root_mem_cgroup is not always defined (not defined for
>> CONFIG_MEMCG=n case), thus all such callers need to do the extra
>> handling for different CONFIG_MEMCG settings.
>>
>> So here we add a special macro definition of root_mem_cgroup, making it
>> to always be NULL.
>
> Isn't just a declaration sufficient? Nothing should really dereference
> the pointer anyway.
>
That can pass the compile, but waste the extra bytes for the pointer in
the data section, even if no one is utilizing that pointer.

Thanks,
Qu


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 2/3] btrfs: always uses root memcgroup for filemap_add_folio()
  2024-07-19 18:13     ` Michal Hocko
@ 2024-07-19 22:11       ` Qu Wenruo
  2024-07-22  7:34         ` Michal Hocko
  2024-07-24  3:46       ` Qu Wenruo
  1 sibling, 1 reply; 14+ messages in thread
From: Qu Wenruo @ 2024-07-19 22:11 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner
  Cc: Qu Wenruo, linux-btrfs, roman.gushchin, shakeel.butt,
	muchun.song, cgroups, linux-mm, Vlastimil Babka



在 2024/7/20 03:43, Michal Hocko 写道:
> On Fri 19-07-24 13:02:06, Johannes Weiner wrote:
>> On Fri, Jul 19, 2024 at 07:58:40PM +0930, Qu Wenruo wrote:
>>> [BACKGROUND]
>>> The function filemap_add_folio() charges the memory cgroup,
>>> as we assume all page caches are accessible by user space progresses
>>> thus needs the cgroup accounting.
>>>
>>> However btrfs is a special case, it has a very large metadata thanks to
>>> its support of data csum (by default it's 4 bytes per 4K data, and can
>>> be as large as 32 bytes per 4K data).
>>> This means btrfs has to go page cache for its metadata pages, to take
>>> advantage of both cache and reclaim ability of filemap.
>>>
>>> This has a tiny problem, that all btrfs metadata pages have to go through
>>> the memcgroup charge, even all those metadata pages are not
>>> accessible by the user space, and doing the charging can introduce some
>>> latency if there is a memory limits set.
>>>
>>> Btrfs currently uses __GFP_NOFAIL flag as a workaround for this cgroup
>>> charge situation so that metadata pages won't really be limited by
>>> memcgroup.
>>>
>>> [ENHANCEMENT]
>>> Instead of relying on __GFP_NOFAIL to avoid charge failure, use root
>>> memory cgroup to attach metadata pages.
>>>
>>> With root memory cgroup, we directly skip the charging part, and only
>>> rely on __GFP_NOFAIL for the real memory allocation part.
>>>
>>> Suggested-by: Michal Hocko <mhocko@suse.com>
>>> Suggested-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>>   fs/btrfs/extent_io.c | 10 ++++++++++
>>>   1 file changed, 10 insertions(+)
>>>
>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>> index aa7f8148cd0d..cfeed7673009 100644
>>> --- a/fs/btrfs/extent_io.c
>>> +++ b/fs/btrfs/extent_io.c
>>> @@ -2971,6 +2971,7 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
>>>
>>>   	struct btrfs_fs_info *fs_info = eb->fs_info;
>>>   	struct address_space *mapping = fs_info->btree_inode->i_mapping;
>>> +	struct mem_cgroup *old_memcg;
>>>   	const unsigned long index = eb->start >> PAGE_SHIFT;
>>>   	struct folio *existing_folio = NULL;
>>>   	int ret;
>>> @@ -2981,8 +2982,17 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
>>>   	ASSERT(eb->folios[i]);
>>>
>>>   retry:
>>> +	/*
>>> +	 * Btree inode is a btrfs internal inode, and not exposed to any
>>> +	 * user.
>>> +	 * Furthermore we do not want any cgroup limits on this inode.
>>> +	 * So we always use root_mem_cgroup as our active memcg when attaching
>>> +	 * the folios.
>>> +	 */
>>> +	old_memcg = set_active_memcg(root_mem_cgroup);
>>>   	ret = filemap_add_folio(mapping, eb->folios[i], index + i,
>>>   				GFP_NOFS | __GFP_NOFAIL);
>
> I thoutght you've said that NOFAIL was added to workaround memcg
> charges. Can you remove it when memcg is out of the picture?

Sure, but that would be a dedicated patch, as we need to add the -ENOMEM
handling.
I already have such a patch before:
https://lore.kernel.org/linux-btrfs/d6a9c038e12f1f2dae353f1ba657ba0666f0aaaa.1720159494.git.wqu@suse.com/

But that's before the memcgroup change.

I'd prefer to have all the larger folio fully tested and merged, then
cleanup the NOFAIL flags.

>
> It would be great to add some background about how much memory are we
> talking about. Because this might require memcg configuration in some
> setups.
>
>>> +	set_active_memcg(old_memcg);
>>
>> It looks correct. But it's going through all dance to set up
>> current->active_memcg, then have the charge path look that up,
>> css_get(), call try_charge() only to bail immediately, css_put(), then
>> update current->active_memcg again. All those branches are necessary
>> when we want to charge to a "real" other cgroup. But in this case, we
>> always know we're not charging, so it seems uncalled for.
>>
>> Wouldn't it be a lot simpler (and cheaper) to have a
>> filemap_add_folio_nocharge()?
>
> Yes, that would certainly simplify things. From the previous discussion
> I understood that there would be broader scopes which would opt-out from
> charging. If this is really about a single filemap_add_folio call then
> having a variant without doesn't call mem_cgroup_charge sounds like a
> much more viable option and also it doesn't require to make any memcg
> specific changes.
>

I'm not 100% sure if the VFS guys are happy with that.

The current filemap folio interfaces are already much concentraced,
other than all the various page based interfaces for different situations.

E.g. we have the following wrappers related to filemap page cache
search/creation:

- find_get_page() and find_get_page_flags()
- find_lock_page()
- find_or_create_page()
- grab_cache_page_nowait()
- grab_cache_page()

Meanwhile just two folio interfaces:

- filemap_get_folio()
- __fielmap_get_folio()

So according to the trend, I'm pretty sure VFS people will reject such
new interface just to skip accounting.

Thus the GFP_NO_ACCOUNT solution looks more feasible.

Thanks,
Qu


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 1/3] memcontrol: define root_mem_cgroup for CONFIG_MEMCG=n cases
  2024-07-19 21:58     ` Qu Wenruo
@ 2024-07-22  7:32       ` Michal Hocko
  0 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2024-07-22  7:32 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Qu Wenruo, linux-btrfs, hannes, roman.gushchin, shakeel.butt,
	muchun.song, cgroups, linux-mm

On Sat 20-07-24 07:28:45, Qu Wenruo wrote:
> 
> 
> 在 2024/7/19 20:43, Michal Hocko 写道:
> > On Fri 19-07-24 19:58:39, Qu Wenruo wrote:
> > > There is an incoming btrfs patchset, which will use @root_mem_cgroup as
> > > the active cgroup to attach metadata folios to its internal btree
> > > inode, so that btrfs can skip the possibly costly charge for the
> > > internal inode which is only accessible by btrfs itself.
> > > 
> > > However @root_mem_cgroup is not always defined (not defined for
> > > CONFIG_MEMCG=n case), thus all such callers need to do the extra
> > > handling for different CONFIG_MEMCG settings.
> > > 
> > > So here we add a special macro definition of root_mem_cgroup, making it
> > > to always be NULL.
> > 
> > Isn't just a declaration sufficient? Nothing should really dereference
> > the pointer anyway.
> > 
> That can pass the compile, but waste the extra bytes for the pointer in
> the data section, even if no one is utilizing that pointer.

Are you sure that the mere declaration will be defined in the data section?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 2/3] btrfs: always uses root memcgroup for filemap_add_folio()
  2024-07-19 22:11       ` Qu Wenruo
@ 2024-07-22  7:34         ` Michal Hocko
  2024-07-22  8:08           ` Qu Wenruo
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2024-07-22  7:34 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Johannes Weiner, Qu Wenruo, linux-btrfs, roman.gushchin,
	shakeel.butt, muchun.song, cgroups, linux-mm, Vlastimil Babka

On Sat 20-07-24 07:41:19, Qu Wenruo wrote:
[...]
> So according to the trend, I'm pretty sure VFS people will reject such
> new interface just to skip accounting.

I would just give it a try with your usecase described. If this is a
nogo then the root cgroup workaround is still available.

> Thus the GFP_NO_ACCOUNT solution looks more feasible.

So we have GFP_ACCOUNT to opt in for accounting and now we should be
adding GFP_NO_ACCOUNT to override it? This doesn't sound like a good use
of gfp flags (which we do not have infinitely) and it is also quite
confusing TBH.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 2/3] btrfs: always uses root memcgroup for filemap_add_folio()
  2024-07-22  7:34         ` Michal Hocko
@ 2024-07-22  8:08           ` Qu Wenruo
  0 siblings, 0 replies; 14+ messages in thread
From: Qu Wenruo @ 2024-07-22  8:08 UTC (permalink / raw)
  To: Michal Hocko, Qu Wenruo
  Cc: Johannes Weiner, linux-btrfs, roman.gushchin, shakeel.butt,
	muchun.song, cgroups, linux-mm, Vlastimil Babka

在 2024/7/22 17:04, Michal Hocko 写道:
> On Sat 20-07-24 07:41:19, Qu Wenruo wrote:
> [...]
>> So according to the trend, I'm pretty sure VFS people will reject such
>> new interface just to skip accounting.
> 
> I would just give it a try with your usecase described. If this is a
> nogo then the root cgroup workaround is still available.

I have submitted a patchset doing exactly that, and thankfully that's 
the series where I got all the helpful feedbacks:

https://lore.kernel.org/linux-btrfs/92dea37a395781ee4d5cf8b16307801ccd8a5700.1720572937.git.wqu@suse.com/

Unfortunately I haven't get any feedback from the VFS guys.

> 
>> Thus the GFP_NO_ACCOUNT solution looks more feasible.
> 
> So we have GFP_ACCOUNT to opt in for accounting and now we should be
> adding GFP_NO_ACCOUNT to override it? This doesn't sound like a good use
> of gfp flags (which we do not have infinitely) and it is also quite
> confusing TBH.

The problem is, for filemap_add_folio(), we didn't specify GFP_ACCOUNT 
(nor any other caller) but it is still doing the charge, due to the 
mostly-correct assumption that all filemap page caches are accessible to 
user space programs.

So one can argue that, cgroup is still charged even if no GFP_ACCOUNT is 
specified.

But I get your point, indeed it's not that a good idea to introduce 
GFP_NO_ACCOUNT.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v7 2/3] btrfs: always uses root memcgroup for filemap_add_folio()
  2024-07-19 18:13     ` Michal Hocko
  2024-07-19 22:11       ` Qu Wenruo
@ 2024-07-24  3:46       ` Qu Wenruo
  1 sibling, 0 replies; 14+ messages in thread
From: Qu Wenruo @ 2024-07-24  3:46 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner
  Cc: Qu Wenruo, linux-btrfs, roman.gushchin, shakeel.butt,
	muchun.song, cgroups, linux-mm, Vlastimil Babka

在 2024/7/20 03:43, Michal Hocko 写道:
[...]
>
>>> +	set_active_memcg(old_memcg);
>>
>> It looks correct. But it's going through all dance to set up
>> current->active_memcg, then have the charge path look that up,
>> css_get(), call try_charge() only to bail immediately, css_put(), then
>> update current->active_memcg again. All those branches are necessary
>> when we want to charge to a "real" other cgroup. But in this case, we
>> always know we're not charging, so it seems uncalled for.
>>
>> Wouldn't it be a lot simpler (and cheaper) to have a
>> filemap_add_folio_nocharge()?
>
> Yes, that would certainly simplify things. From the previous discussion
> I understood that there would be broader scopes which would opt-out from
> charging. If this is really about a single filemap_add_folio call then
> having a variant without doesn't call mem_cgroup_charge sounds like a
> much more viable option and also it doesn't require to make any memcg
> specific changes.
>

Talking about skipping mem cgroup charging, I still have a question.

[MEMCG AT FOLIO EVICTION TIME]
Even we completely skip the mem cgroup charging, we cannot really escape
the eviction time handling.

In fact if we just skip the mem_cgroup_charge(), kernel would crash when
evicting the folio.
As in lru_gen_eviction(), folio_memcg() would just return NULL, and
mem_cgroup_id(memcg) would trigger a NULL pointer dereference.

That's why I sent out a patch fixing that first:
https://lore.kernel.org/linux-mm/e1036b9cc8928be9a7dec150ab3a0317bd7180cf.1720572937.git.wqu@suse.com/

I'm not sure if it's going to cause any extra problems even with the
above fix.

And just for the sake of consistency, it looks more sane to have
root_mem_cgroup for the filemap_add_folio() operation, other than leave
it empty, especially since most filemaps still need proper memcg handling.

[REALLY EXPENSIVE?]
Another question is, is the set_active_memcg() and later handling really
that expensive?

set_active_memcg() is small enough to be an inline function, so is the
active_memcg(), css_get() and the root memcg path of try_charge().

Later commit part is not that expensive either, mostly simple member or
per-cpu assignment.

According to my very little knowledge about mem cgroup, most of the
heavy lifting part is in the slow path of try_charge_memcg().

Even with all the set_active_memcg(), the whole extra overhead still
look very tiny.
And it should already be a big win for btrfs to opt-out the regular
charging routine.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-07-24  3:46 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-19 10:28 [PATCH v7 0/3] btrfs: try to allocate larger folios for metadata Qu Wenruo
2024-07-19 10:28 ` [PATCH v7 1/3] memcontrol: define root_mem_cgroup for CONFIG_MEMCG=n cases Qu Wenruo
2024-07-19 11:13   ` Michal Hocko
2024-07-19 21:58     ` Qu Wenruo
2024-07-22  7:32       ` Michal Hocko
2024-07-19 10:28 ` [PATCH v7 2/3] btrfs: always uses root memcgroup for filemap_add_folio() Qu Wenruo
2024-07-19 17:02   ` Johannes Weiner
2024-07-19 17:25     ` Roman Gushchin
2024-07-19 18:13     ` Michal Hocko
2024-07-19 22:11       ` Qu Wenruo
2024-07-22  7:34         ` Michal Hocko
2024-07-22  8:08           ` Qu Wenruo
2024-07-24  3:46       ` Qu Wenruo
2024-07-19 10:28 ` [PATCH v7 3/3] btrfs: prefer to allocate larger folio for metadata Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox