[PATCH 0/1] memcg/hugetlb: Adding hugeTLB counters to memory controller

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/1] memcg/hugetlb: Adding hugeTLB counters to memory controller
@ 2024-10-17 16:04 Joshua Hahn
  2024-10-17 16:04 ` [PATCH 1/1] " Joshua Hahn
  2024-10-18 10:12 ` [PATCH 0/1] " Michal Hocko
  0 siblings, 2 replies; 14+ messages in thread
From: Joshua Hahn @ 2024-10-17 16:04 UTC (permalink / raw)
  To: hannes
  Cc: nphamcs, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	cgroups, linux-mm, linux-kernel, lnyng

HugeTLB usage is a metric that can provide utility for monitors hoping
to get more insight into the memory usage patterns in cgroups. It also
helps identify if large folios are being distributed efficiently across
workloads, so that tasks that can take most advantage of reduced TLB
misses are prioritized.

While cgroupv2's hugeTLB controller does report this value, some users
who wish to track hugeTLB usage might not want to take on the additional
overhead or the features of the controller just to use the metric.
This patch introduces hugeTLB usage in the memcg stats, mirroring the
value in the hugeTLB controller and offering a more fine-grained
cgroup-level breakdown of the value in /proc/meminfo.

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>

Joshua Hahn (1):
  Adding hugeTLB counters to memory controller

 include/linux/memcontrol.h | 3 +++
 mm/hugetlb.c               | 5 +++++
 mm/memcontrol.c            | 6 ++++++
 3 files changed, 14 insertions(+)

-- 
2.43.5

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/1] memcg/hugetlb: Adding hugeTLB counters to memory controller
  2024-10-17 16:04 [PATCH 0/1] memcg/hugetlb: Adding hugeTLB counters to memory controller Joshua Hahn
@ 2024-10-17 16:04 ` Joshua Hahn
  2024-10-17 17:22   ` Johannes Weiner
  2024-10-18 21:34   ` Shakeel Butt
  2024-10-18 10:12 ` [PATCH 0/1] " Michal Hocko
  1 sibling, 2 replies; 14+ messages in thread
From: Joshua Hahn @ 2024-10-17 16:04 UTC (permalink / raw)
  To: hannes
  Cc: nphamcs, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	cgroups, linux-mm, linux-kernel, lnyng

HugeTLB is added as a metric in memcg_stat_item, and is updated in the
alloc and free methods for hugeTLB, after (un)charging has already been
committed. Changes are batched and updated / flushed like the rest of
the memcg stats, which makes additional overhead by the infrequent
hugetlb allocs / frees minimal.

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 include/linux/memcontrol.h | 3 +++
 mm/hugetlb.c               | 5 +++++
 mm/memcontrol.c            | 6 ++++++
 3 files changed, 14 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 34d2da05f2f1..66e925ae499a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -39,6 +39,9 @@ enum memcg_stat_item {
 	MEMCG_KMEM,
 	MEMCG_ZSWAP_B,
 	MEMCG_ZSWAPPED,
+#ifdef CONFIG_HUGETLB_PAGE
+	MEMCG_HUGETLB,
+#endif
 	MEMCG_NR_STAT,
 };
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 190fa05635f4..ca7151096712 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1887,6 +1887,7 @@ void free_huge_folio(struct folio *folio)
 	struct hstate *h = folio_hstate(folio);
 	int nid = folio_nid(folio);
 	struct hugepage_subpool *spool = hugetlb_folio_subpool(folio);
+	struct mem_cgroup *memcg = get_mem_cgroup_from_current();
 	bool restore_reserve;
 	unsigned long flags;
 
@@ -1926,6 +1927,8 @@ void free_huge_folio(struct folio *folio)
 	hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
 					  pages_per_huge_page(h), folio);
 	mem_cgroup_uncharge(folio);
+	mod_memcg_state(memcg, MEMCG_HUGETLB, -pages_per_huge_page(h));
+	mem_cgroup_put(memcg);
 	if (restore_reserve)
 		h->resv_huge_pages++;
 
@@ -3093,6 +3096,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 
 	if (!memcg_charge_ret)
 		mem_cgroup_commit_charge(folio, memcg);
+
+	mod_memcg_state(memcg, MEMCG_HUGETLB, nr_pages);
 	mem_cgroup_put(memcg);
 
 	return folio;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7845c64a2c57..4180ee876adb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -320,6 +320,9 @@ static const unsigned int memcg_stat_items[] = {
 	MEMCG_KMEM,
 	MEMCG_ZSWAP_B,
 	MEMCG_ZSWAPPED,
+#ifdef CONFIG_HUGETLB_PAGE
+	MEMCG_HUGETLB,
+#endif
 };
 
 #define NR_MEMCG_NODE_STAT_ITEMS ARRAY_SIZE(memcg_node_stat_items)
@@ -1324,6 +1327,9 @@ static const struct memory_stat memory_stats[] = {
 	{ "sock",			MEMCG_SOCK			},
 	{ "vmalloc",			MEMCG_VMALLOC			},
 	{ "shmem",			NR_SHMEM			},
+#ifdef CONFIG_HUGETLB_PAGE
+	{ "hugeTLB",			MEMCG_HUGETLB			},
+#endif
 #ifdef CONFIG_ZSWAP
 	{ "zswap",			MEMCG_ZSWAP_B			},
 	{ "zswapped",			MEMCG_ZSWAPPED			},
-- 
2.43.5



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] memcg/hugetlb: Adding hugeTLB counters to memory controller
  2024-10-17 16:04 ` [PATCH 1/1] " Joshua Hahn
@ 2024-10-17 17:22   ` Johannes Weiner
  2024-10-18 21:34   ` Shakeel Butt
  1 sibling, 0 replies; 14+ messages in thread
From: Johannes Weiner @ 2024-10-17 17:22 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: nphamcs, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	cgroups, linux-mm, linux-kernel, lnyng

On Thu, Oct 17, 2024 at 09:04:38AM -0700, Joshua Hahn wrote:
> HugeTLB is added as a metric in memcg_stat_item, and is updated in the
> alloc and free methods for hugeTLB, after (un)charging has already been
> committed. Changes are batched and updated / flushed like the rest of
> the memcg stats, which makes additional overhead by the infrequent
> hugetlb allocs / frees minimal.
> 
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> ---
>  include/linux/memcontrol.h | 3 +++
>  mm/hugetlb.c               | 5 +++++
>  mm/memcontrol.c            | 6 ++++++
>  3 files changed, 14 insertions(+)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 34d2da05f2f1..66e925ae499a 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -39,6 +39,9 @@ enum memcg_stat_item {
>  	MEMCG_KMEM,
>  	MEMCG_ZSWAP_B,
>  	MEMCG_ZSWAPPED,
> +#ifdef CONFIG_HUGETLB_PAGE
> +	MEMCG_HUGETLB,
> +#endif
>  	MEMCG_NR_STAT,
>  };

It would be better to add a native vmstat counter for this, as there
is no strong reason to make this memcg specific. This would also make
it NUMA-node-aware.

IOW, add a new item to enum node_stat_item (plus the string in
vmstat.c and memcontrol.c).

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 190fa05635f4..ca7151096712 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1887,6 +1887,7 @@ void free_huge_folio(struct folio *folio)
>  	struct hstate *h = folio_hstate(folio);
>  	int nid = folio_nid(folio);
>  	struct hugepage_subpool *spool = hugetlb_folio_subpool(folio);
> +	struct mem_cgroup *memcg = get_mem_cgroup_from_current();
>  	bool restore_reserve;
>  	unsigned long flags;
>  
> @@ -1926,6 +1927,8 @@ void free_huge_folio(struct folio *folio)
>  	hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
>  					  pages_per_huge_page(h), folio);
>  	mem_cgroup_uncharge(folio);
> +	mod_memcg_state(memcg, MEMCG_HUGETLB, -pages_per_huge_page(h));
> +	mem_cgroup_put(memcg);
>  	if (restore_reserve)
>  		h->resv_huge_pages++;

This goes wrong if the folio is freed by somebody other than the
owning cgroup. For example if the task moved between cgroups after the
memory was charged.

It's better to use the folio->memcg linkage that was established by
the allocation path.

Use lruvec_stat_mod_folio(), it will handle all of this.

> @@ -3093,6 +3096,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  
>  	if (!memcg_charge_ret)
>  		mem_cgroup_commit_charge(folio, memcg);
> +
> +	mod_memcg_state(memcg, MEMCG_HUGETLB, nr_pages);

And here as well.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/1] memcg/hugetlb: Adding hugeTLB counters to memory controller
  2024-10-17 16:04 [PATCH 0/1] memcg/hugetlb: Adding hugeTLB counters to memory controller Joshua Hahn
  2024-10-17 16:04 ` [PATCH 1/1] " Joshua Hahn
@ 2024-10-18 10:12 ` Michal Hocko
  2024-10-18 12:31   ` Johannes Weiner
  1 sibling, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2024-10-18 10:12 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: hannes, nphamcs, roman.gushchin, shakeel.butt, muchun.song, akpm,
	cgroups, linux-mm, linux-kernel, lnyng

On Thu 17-10-24 09:04:37, Joshua Hahn wrote:
> HugeTLB usage is a metric that can provide utility for monitors hoping
> to get more insight into the memory usage patterns in cgroups. It also
> helps identify if large folios are being distributed efficiently across
> workloads, so that tasks that can take most advantage of reduced TLB
> misses are prioritized.
> 
> While cgroupv2's hugeTLB controller does report this value, some users
> who wish to track hugeTLB usage might not want to take on the additional
> overhead or the features of the controller just to use the metric.
> This patch introduces hugeTLB usage in the memcg stats, mirroring the
> value in the hugeTLB controller and offering a more fine-grained
> cgroup-level breakdown of the value in /proc/meminfo.

This seems really confusing because memcg controller is not responsible
for the hugetlb memory. Could you be more specific why enabling hugetlb
controller is not really desirable when the actual per-group tracking is
needed?
 
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> 
> Joshua Hahn (1):
>   Adding hugeTLB counters to memory controller
> 
>  include/linux/memcontrol.h | 3 +++
>  mm/hugetlb.c               | 5 +++++
>  mm/memcontrol.c            | 6 ++++++
>  3 files changed, 14 insertions(+)
> 
> -- 
> 2.43.5

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/1] memcg/hugetlb: Adding hugeTLB counters to memory controller
  2024-10-18 10:12 ` [PATCH 0/1] " Michal Hocko
@ 2024-10-18 12:31   ` Johannes Weiner
  2024-10-18 13:42     ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Johannes Weiner @ 2024-10-18 12:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joshua Hahn, nphamcs, roman.gushchin, shakeel.butt, muchun.song,
	akpm, cgroups, linux-mm, linux-kernel, lnyng

On Fri, Oct 18, 2024 at 12:12:00PM +0200, Michal Hocko wrote:
> On Thu 17-10-24 09:04:37, Joshua Hahn wrote:
> > HugeTLB usage is a metric that can provide utility for monitors hoping
> > to get more insight into the memory usage patterns in cgroups. It also
> > helps identify if large folios are being distributed efficiently across
> > workloads, so that tasks that can take most advantage of reduced TLB
> > misses are prioritized.
> > 
> > While cgroupv2's hugeTLB controller does report this value, some users
> > who wish to track hugeTLB usage might not want to take on the additional
> > overhead or the features of the controller just to use the metric.
> > This patch introduces hugeTLB usage in the memcg stats, mirroring the
> > value in the hugeTLB controller and offering a more fine-grained
> > cgroup-level breakdown of the value in /proc/meminfo.
> 
> This seems really confusing because memcg controller is not responsible
> for the hugetlb memory. Could you be more specific why enabling hugetlb
> controller is not really desirable when the actual per-group tracking is
> needed?

We have competition over memory, but not specifically over hugetlb.

The maximum hugetlb footprint of jobs is known in advance, and we
configure hugetlb_cma= accordingly. There are no static boot time
hugetlb reservations, and there is no opportunistic use of hugetlb
from jobs or other parts of the system. So we don't need control over
the hugetlb pool, and no limit enforcement on hugetlb specifically.

However, memory overall is overcommitted between job and system
management. If the main job is using hugetlb, we need that to show up
in memory.current and be taken into account for memory.high and
memory.low enforcement. It's the old memory fungibility argument: if
you use hugetlb, it should reduce the budget for cache/anon.

Nhat's recent patch to charge hugetlb to memcg accomplishes that.

However, we now have potentially a sizable portion of memory in
memory.current that doesn't show up in memory.stat. Joshua's patch
addresses that, so userspace can understand its memory footprint.

I hope that makes sense.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/1] memcg/hugetlb: Adding hugeTLB counters to memory controller
  2024-10-18 12:31   ` Johannes Weiner
@ 2024-10-18 13:42     ` Michal Hocko
  2024-10-18 18:11       ` Shakeel Butt
  2024-10-18 18:57       ` Johannes Weiner
  0 siblings, 2 replies; 14+ messages in thread
From: Michal Hocko @ 2024-10-18 13:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Joshua Hahn, nphamcs, roman.gushchin, shakeel.butt, muchun.song,
	akpm, cgroups, linux-mm, linux-kernel, lnyng

On Fri 18-10-24 08:31:22, Johannes Weiner wrote:
> On Fri, Oct 18, 2024 at 12:12:00PM +0200, Michal Hocko wrote:
> > On Thu 17-10-24 09:04:37, Joshua Hahn wrote:
> > > HugeTLB usage is a metric that can provide utility for monitors hoping
> > > to get more insight into the memory usage patterns in cgroups. It also
> > > helps identify if large folios are being distributed efficiently across
> > > workloads, so that tasks that can take most advantage of reduced TLB
> > > misses are prioritized.
> > > 
> > > While cgroupv2's hugeTLB controller does report this value, some users
> > > who wish to track hugeTLB usage might not want to take on the additional
> > > overhead or the features of the controller just to use the metric.
> > > This patch introduces hugeTLB usage in the memcg stats, mirroring the
> > > value in the hugeTLB controller and offering a more fine-grained
> > > cgroup-level breakdown of the value in /proc/meminfo.
> > 
> > This seems really confusing because memcg controller is not responsible
> > for the hugetlb memory. Could you be more specific why enabling hugetlb
> > controller is not really desirable when the actual per-group tracking is
> > needed?
> 
> We have competition over memory, but not specifically over hugetlb.
> 
> The maximum hugetlb footprint of jobs is known in advance, and we
> configure hugetlb_cma= accordingly. There are no static boot time
> hugetlb reservations, and there is no opportunistic use of hugetlb
> from jobs or other parts of the system. So we don't need control over
> the hugetlb pool, and no limit enforcement on hugetlb specifically.
> 
> However, memory overall is overcommitted between job and system
> management. If the main job is using hugetlb, we need that to show up
> in memory.current and be taken into account for memory.high and
> memory.low enforcement. It's the old memory fungibility argument: if
> you use hugetlb, it should reduce the budget for cache/anon.
> 
> Nhat's recent patch to charge hugetlb to memcg accomplishes that.
> 
> However, we now have potentially a sizable portion of memory in
> memory.current that doesn't show up in memory.stat. Joshua's patch
> addresses that, so userspace can understand its memory footprint.
> 
> I hope that makes sense.

Looking at 8cba9576df60 ("hugetlb: memcg: account hugetlb-backed memory
in memory controller") describes this limitation

      * Hugetlb pages utilized while this option is not selected will not
        be tracked by the memory controller (even if cgroup v2 is remounted
        later on).

and it would be great to have an explanation why the lack of tracking
has proven problematic. Also the above doesn't really explain why those
who care cannot really enabled hugetlb controller to gain the
consumption information.

Also what happens if CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING is disabled.
Should we report potentially misleading data?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/1] memcg/hugetlb: Adding hugeTLB counters to memory controller
  2024-10-18 13:42     ` Michal Hocko
@ 2024-10-18 18:11       ` Shakeel Butt
  2024-10-18 18:38         ` Joshua Hahn
  2024-10-18 18:57       ` Johannes Weiner
  1 sibling, 1 reply; 14+ messages in thread
From: Shakeel Butt @ 2024-10-18 18:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Joshua Hahn, nphamcs, roman.gushchin,
	muchun.song, akpm, cgroups, linux-mm, linux-kernel, lnyng

On Fri, Oct 18, 2024 at 03:42:13PM GMT, Michal Hocko wrote:
> On Fri 18-10-24 08:31:22, Johannes Weiner wrote:
> > On Fri, Oct 18, 2024 at 12:12:00PM +0200, Michal Hocko wrote:
> > > On Thu 17-10-24 09:04:37, Joshua Hahn wrote:
> > > > HugeTLB usage is a metric that can provide utility for monitors hoping
> > > > to get more insight into the memory usage patterns in cgroups. It also
> > > > helps identify if large folios are being distributed efficiently across
> > > > workloads, so that tasks that can take most advantage of reduced TLB
> > > > misses are prioritized.
> > > > 
> > > > While cgroupv2's hugeTLB controller does report this value, some users
> > > > who wish to track hugeTLB usage might not want to take on the additional
> > > > overhead or the features of the controller just to use the metric.
> > > > This patch introduces hugeTLB usage in the memcg stats, mirroring the
> > > > value in the hugeTLB controller and offering a more fine-grained
> > > > cgroup-level breakdown of the value in /proc/meminfo.
> > > 
> > > This seems really confusing because memcg controller is not responsible
> > > for the hugetlb memory. Could you be more specific why enabling hugetlb
> > > controller is not really desirable when the actual per-group tracking is
> > > needed?
> > 
> > We have competition over memory, but not specifically over hugetlb.
> > 
> > The maximum hugetlb footprint of jobs is known in advance, and we
> > configure hugetlb_cma= accordingly. There are no static boot time
> > hugetlb reservations, and there is no opportunistic use of hugetlb
> > from jobs or other parts of the system. So we don't need control over
> > the hugetlb pool, and no limit enforcement on hugetlb specifically.
> > 
> > However, memory overall is overcommitted between job and system
> > management. If the main job is using hugetlb, we need that to show up
> > in memory.current and be taken into account for memory.high and
> > memory.low enforcement. It's the old memory fungibility argument: if
> > you use hugetlb, it should reduce the budget for cache/anon.
> > 
> > Nhat's recent patch to charge hugetlb to memcg accomplishes that.
> > 
> > However, we now have potentially a sizable portion of memory in
> > memory.current that doesn't show up in memory.stat. Joshua's patch
> > addresses that, so userspace can understand its memory footprint.
> > 
> > I hope that makes sense.
> 
> Looking at 8cba9576df60 ("hugetlb: memcg: account hugetlb-backed memory
> in memory controller") describes this limitation
> 
>       * Hugetlb pages utilized while this option is not selected will not
>         be tracked by the memory controller (even if cgroup v2 is remounted
>         later on).
> 
> and it would be great to have an explanation why the lack of tracking
> has proven problematic. Also the above doesn't really explain why those
> who care cannot really enabled hugetlb controller to gain the
> consumption information.

Let me give my take on this. The reason is the ease and convenience to
see what is happening when I see unexpectedly large memory.current
value. Logically I would look at memory.stat to make sense of it.
Without this I have to remember that the user might have hugetlb memcg
accounting option enabled and they are use hugetlb cgroup to find the
answer. If they have not enabled hugetlb cgroup then I am in dark.

> 
> Also what happens if CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING is disabled.
> Should we report potentially misleading data?

I think with what Johannes has suggested (to use lruvec_stat_mod_folio),
the metric will only get updated when hugetlb memcg accounting is
enabled and zero otherwise.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/1] memcg/hugetlb: Adding hugeTLB counters to memory controller
  2024-10-18 18:11       ` Shakeel Butt
@ 2024-10-18 18:38         ` Joshua Hahn
  2024-10-21  7:15           ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Joshua Hahn @ 2024-10-18 18:38 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Johannes Weiner, nphamcs, roman.gushchin,
	muchun.song, akpm, cgroups, linux-mm, linux-kernel, lnyng

On Fri, Oct 18, 2024 at 2:11 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> On Fri, Oct 18, 2024 at 03:42:13PM GMT, Michal Hocko wrote:
> > On Fri 18-10-24 08:31:22, Johannes Weiner wrote:
> > > On Fri, Oct 18, 2024 at 12:12:00PM +0200, Michal Hocko wrote:
> > > > On Thu 17-10-24 09:04:37, Joshua Hahn wrote:
> > > > > HugeTLB usage is a metric that can provide utility for monitors hoping
> > > > > to get more insight into the memory usage patterns in cgroups. It also
> > > > > helps identify if large folios are being distributed efficiently across
> > > > > workloads, so that tasks that can take most advantage of reduced TLB
> > > > > misses are prioritized.
> > > >
> > > > This seems really confusing because memcg controller is not responsible
> > > > for the hugetlb memory. Could you be more specific why enabling hugetlb
> > > > controller is not really desirable when the actual per-group tracking is
> > > > needed?
> > >
> > > However, we now have potentially a sizable portion of memory in
> > > memory.current that doesn't show up in memory.stat. Joshua's patch
> > > addresses that, so userspace can understand its memory footprint.
> > >
> > > I hope that makes sense.
> >
> > and it would be great to have an explanation why the lack of tracking
> > has proven problematic. Also the above doesn't really explain why those
> > who care cannot really enabled hugetlb controller to gain the
> > consumption information.
>
> Let me give my take on this. The reason is the ease and convenience to
> see what is happening when I see unexpectedly large memory.current
> value. Logically I would look at memory.stat to make sense of it.
> Without this I have to remember that the user might have hugetlb memcg
> accounting option enabled and they are use hugetlb cgroup to find the
> answer. If they have not enabled hugetlb cgroup then I am in dark.
>
> >
> > Also what happens if CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING is disabled.
> > Should we report potentially misleading data?
>
> I think with what Johannes has suggested (to use lruvec_stat_mod_folio),
> the metric will only get updated when hugetlb memcg accounting is
> enabled and zero otherwise.

Hi Michal, Johannes, and Shakeel,

Thank you all for taking the time to review my patch.

I was writing my response when Shakeel responded, and I think it includes
an important point. I am sending out this message in the hopes that I can
gather insight on what direction would make most sense for everyone.

Michal -- You brought up several points in your response, so I'll do my
best to answer them below.

1. Why is the lack of tracking hugeTLB problematic?

The biggest problem that I see is that if CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING
is enabled, then there is a discrepancy between what is reported in
memory.stat and memory.current, as Johannes explained in his response above.

As Shakeel expanded as well, it is just convenient to have the value
explicitly there, so users don't have to go through and remember where
hugeTLB pages might be used and where they might not be used.

Aside from consistency between the two files, we can see benefits in
observability. There are many reasons userspace might be intersted in
understanding the hugeTLB footprint of cgroups. To name a few, system
administrators might want to verify that hugeTLB usage is distributed as
expected across tasks: i.e. memory-intensive tasks are using more hugeTLB
pages than tasks that don't consume a lot of memory, or is seen to fault
frequently. Note that this is separate from wanting to inspect the
distribution for limiting purposes (in that case, it makes sense to use
the controller)

2. Why can't you enable the hugeTLB controller, if tracking is so important?

By turning on the hugeTLB controller, we gain all of the observability
that I mentioned above; users can just check the respective hugetlb files.
However, the discrepancy between memory.stat and memory.current is still
there. When I check memory.current, I expect to be able to explain the usage
by looking at memory.stat and trying to understand the breakdown, not by going
into the various hugetlb controller files to check how/if the memory is
accounted for.

But even if we are okay with this, I think it might be overkill to
enable the hugeTLB controller for the convenience of being able to inspect
the hugeTLB usage for cgroups. This is especially true in workloads where
we can predict what usage patterns will be like, and we do not need to enforce
specific limits on hugeTLB usage.

3. What if CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING is disabled?

This is a great point. The way the patch is currently implemented, it
should still do the accounting to memory.stat, even if
CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING is disabled. This would give us the reverse
problem where hugeTLB usage that is reported in the statistics are no longer
accounted for in current...

I think it makes sense to show hugeTLB statistics in memory.stat only if
hugeTLB is accounted for in memory.current as well (i.e. check if
CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING is enabled before doing the accounting,
or move the accounting from hugeTLB alloc/free --> hugeTLB charge/uncharge,
which should only happen if hugeTLBs are accounted for in memory.current).

What do you think?

Joshua

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/1] memcg/hugetlb: Adding hugeTLB counters to memory controller
  2024-10-18 13:42     ` Michal Hocko
  2024-10-18 18:11       ` Shakeel Butt
@ 2024-10-18 18:57       ` Johannes Weiner
  1 sibling, 0 replies; 14+ messages in thread
From: Johannes Weiner @ 2024-10-18 18:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joshua Hahn, nphamcs, roman.gushchin, shakeel.butt, muchun.song,
	akpm, cgroups, linux-mm, linux-kernel, lnyng

On Fri, Oct 18, 2024 at 03:42:13PM +0200, Michal Hocko wrote:
> On Fri 18-10-24 08:31:22, Johannes Weiner wrote:
> > On Fri, Oct 18, 2024 at 12:12:00PM +0200, Michal Hocko wrote:
> > > On Thu 17-10-24 09:04:37, Joshua Hahn wrote:
> > > > HugeTLB usage is a metric that can provide utility for monitors hoping
> > > > to get more insight into the memory usage patterns in cgroups. It also
> > > > helps identify if large folios are being distributed efficiently across
> > > > workloads, so that tasks that can take most advantage of reduced TLB
> > > > misses are prioritized.
> > > > 
> > > > While cgroupv2's hugeTLB controller does report this value, some users
> > > > who wish to track hugeTLB usage might not want to take on the additional
> > > > overhead or the features of the controller just to use the metric.
> > > > This patch introduces hugeTLB usage in the memcg stats, mirroring the
> > > > value in the hugeTLB controller and offering a more fine-grained
> > > > cgroup-level breakdown of the value in /proc/meminfo.
> > > 
> > > This seems really confusing because memcg controller is not responsible
> > > for the hugetlb memory. Could you be more specific why enabling hugetlb
> > > controller is not really desirable when the actual per-group tracking is
> > > needed?
> > 
> > We have competition over memory, but not specifically over hugetlb.
> > 
> > The maximum hugetlb footprint of jobs is known in advance, and we
> > configure hugetlb_cma= accordingly. There are no static boot time
> > hugetlb reservations, and there is no opportunistic use of hugetlb
> > from jobs or other parts of the system. So we don't need control over
> > the hugetlb pool, and no limit enforcement on hugetlb specifically.
> > 
> > However, memory overall is overcommitted between job and system
> > management. If the main job is using hugetlb, we need that to show up
> > in memory.current and be taken into account for memory.high and
> > memory.low enforcement. It's the old memory fungibility argument: if
> > you use hugetlb, it should reduce the budget for cache/anon.
> > 
> > Nhat's recent patch to charge hugetlb to memcg accomplishes that.
> > 
> > However, we now have potentially a sizable portion of memory in
> > memory.current that doesn't show up in memory.stat. Joshua's patch
> > addresses that, so userspace can understand its memory footprint.
> > 
> > I hope that makes sense.
> 
> Looking at 8cba9576df60 ("hugetlb: memcg: account hugetlb-backed memory
> in memory controller") describes this limitation
> 
>       * Hugetlb pages utilized while this option is not selected will not
>         be tracked by the memory controller (even if cgroup v2 is remounted
>         later on).
> 
> and it would be great to have an explanation why the lack of tracking
> has proven problematic.

Yes, I agree it would be good to outline this in the changelog.

The argument being that memory.stat breaks down the consumers that are
charged to memory.current. hugetlb is (can be) charged, but is not
broken out. This is a significant gap in the memcg stats picture.

> Also the above doesn't really explain why those who care cannot
> really enabled hugetlb controller to gain the consumption
> information.

Well, I have explained why we don't need it at least. Enabling almost
a thousand lines of basically abandoned code, compared to the few
lines in this patch, doesn't strike me as reasonable.

That said, I don't think the hugetlb controller is relevant. With
hugetlb being part of memory.current (for arguments that are already
settled), it needs to be itemized in memory.stat. It's a gap in the
memory controller in any case.

> Also what happens if CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING is disabled.
> Should we report potentially misleading data?

Good point. The stat item tracking should follow the same rules as
charging, such that memory.current and memory.stat are always in sync.

A stat helper that mirrors the mem_cgroup_hugetlb_try_charge() checks
would make sense to me. E.g. lruvec_stat_mod_hugetlb_folio().


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] memcg/hugetlb: Adding hugeTLB counters to memory controller
  2024-10-17 16:04 ` [PATCH 1/1] " Joshua Hahn
  2024-10-17 17:22   ` Johannes Weiner
@ 2024-10-18 21:34   ` Shakeel Butt
  2024-10-19 22:45     ` Joshua Hahn
  1 sibling, 1 reply; 14+ messages in thread
From: Shakeel Butt @ 2024-10-18 21:34 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: hannes, nphamcs, mhocko, roman.gushchin, muchun.song, akpm,
	cgroups, linux-mm, linux-kernel, lnyng

On Thu, Oct 17, 2024 at 09:04:38AM GMT, Joshua Hahn wrote:
> HugeTLB is added as a metric in memcg_stat_item, and is updated in the
> alloc and free methods for hugeTLB, after (un)charging has already been
> committed. Changes are batched and updated / flushed like the rest of
> the memcg stats, which makes additional overhead by the infrequent
> hugetlb allocs / frees minimal.
> 
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>

I have an orthogonal cleanup request (i.e. after you are done with this
work). Hugetlb is the last user of try-charge + commit protocol for
memcg charging. I think we should just remove that and use a simple
charge interface. You will need to reorder couple of things like
allocating the folio first and then charge and you will need to do right
cleanup on charge failing but I think it will cleanup the error path of
alloc_hugetlb_folio() a lot.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] memcg/hugetlb: Adding hugeTLB counters to memory controller
  2024-10-18 21:34   ` Shakeel Butt
@ 2024-10-19 22:45     ` Joshua Hahn
  0 siblings, 0 replies; 14+ messages in thread
From: Joshua Hahn @ 2024-10-19 22:45 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: hannes, nphamcs, mhocko, roman.gushchin, muchun.song, akpm,
	cgroups, linux-mm, linux-kernel, lnyng

On Fri, Oct 18, 2024 at 5:34 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, Oct 17, 2024 at 09:04:38AM GMT, Joshua Hahn wrote:
> > HugeTLB is added as a metric in memcg_stat_item, and is updated in the
> > alloc and free methods for hugeTLB, after (un)charging has already been
> > committed. Changes are batched and updated / flushed like the rest of
> > the memcg stats, which makes additional overhead by the infrequent
> > hugetlb allocs / frees minimal.
> >
> > Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
>
> I have an orthogonal cleanup request (i.e. after you are done with this
> work). Hugetlb is the last user of try-charge + commit protocol for
> memcg charging. I think we should just remove that and use a simple
> charge interface. You will need to reorder couple of things like
> allocating the folio first and then charge and you will need to do right
> cleanup on charge failing but I think it will cleanup the error path of
> alloc_hugetlb_folio() a lot.

That sounds good to me. I was originally planning to include the
hugeTLB accounting in the try charging mechanism (as to only include
it in memory.stat if it is also accounted for in memory.current. I will
think of another way to do this accounting so that cleanup becomes
easier once this patch is finished. One way I can think of is just to
check for the hugeTLB accounting config before adding the stats
and accounting for them.

Thank you for your feedback!

Joshua


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/1] memcg/hugetlb: Adding hugeTLB counters to memory controller
  2024-10-18 18:38         ` Joshua Hahn
@ 2024-10-21  7:15           ` Michal Hocko
  2024-10-21 14:51             ` Joshua Hahn
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2024-10-21  7:15 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Shakeel Butt, Johannes Weiner, nphamcs, roman.gushchin,
	muchun.song, akpm, cgroups, linux-mm, linux-kernel, lnyng

On Fri 18-10-24 14:38:48, Joshua Hahn wrote:
> On Fri, Oct 18, 2024 at 2:11 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > On Fri, Oct 18, 2024 at 03:42:13PM GMT, Michal Hocko wrote:
[...]
> > > and it would be great to have an explanation why the lack of tracking
> > > has proven problematic. Also the above doesn't really explain why those
> > > who care cannot really enabled hugetlb controller to gain the
> > > consumption information.
> >
> > Let me give my take on this. The reason is the ease and convenience to
> > see what is happening when I see unexpectedly large memory.current
> > value. Logically I would look at memory.stat to make sense of it.
> > Without this I have to remember that the user might have hugetlb memcg
> > accounting option enabled and they are use hugetlb cgroup to find the
> > answer. If they have not enabled hugetlb cgroup then I am in dark.

Yes, I thought that this was an acceptable limitation of the accounting.
If that is not the case then it is really preferable to mention reasons
in the changelog. The reasoning was that hugetlb controller which would
be a natural source of that information is not really great because of
an overhead which hasn't really been evaluated - hence my questioning.

[...]

> Aside from consistency between the two files, we can see benefits in
> observability. There are many reasons userspace might be intersted in
> understanding the hugeTLB footprint of cgroups. To name a few, system
> administrators might want to verify that hugeTLB usage is distributed as
> expected across tasks: i.e. memory-intensive tasks are using more hugeTLB
> pages than tasks that don't consume a lot of memory, or is seen to fault
> frequently. Note that this is separate from wanting to inspect the
> distribution for limiting purposes (in that case, it makes sense to use
> the controller)

Please add this information into the changelog.

> 2. Why can't you enable the hugeTLB controller, if tracking is so important?
> 
> By turning on the hugeTLB controller, we gain all of the observability
> that I mentioned above; users can just check the respective hugetlb files.
> However, the discrepancy between memory.stat and memory.current is still
> there. When I check memory.current, I expect to be able to explain the usage
> by looking at memory.stat and trying to understand the breakdown, not by going
> into the various hugetlb controller files to check how/if the memory is
> accounted for.

Well, as mentioned in the previous response this has been an acceptable
limitation of the hugetlb accounting. It is fair to reconsider this
based on existing experience but that should be a part of the changelog.

> But even if we are okay with this, I think it might be overkill to
> enable the hugeTLB controller for the convenience of being able to inspect
> the hugeTLB usage for cgroups. This is especially true in workloads where
> we can predict what usage patterns will be like, and we do not need to enforce
> specific limits on hugeTLB usage.

I am sorry but I do not understand the overkill part of the argument.
Is there any runtime or configuration cost that is visible?

> 3. What if CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING is disabled?
> 
> This is a great point. The way the patch is currently implemented, it
> should still do the accounting to memory.stat, even if
> CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING is disabled. This would give us the reverse
> problem where hugeTLB usage that is reported in the statistics are no longer
> accounted for in current...

Exactly! And this is a problem.

> I think it makes sense to show hugeTLB statistics in memory.stat only if
> hugeTLB is accounted for in memory.current as well (i.e. check if
> CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING is enabled before doing the accounting,
> or move the accounting from hugeTLB alloc/free --> hugeTLB charge/uncharge,
> which should only happen if hugeTLBs are accounted for in memory.current).
> 
> What do you think?

yes, not only shown but also accounted only if the feature is enabled
because we do not want to introduce any (even tiny) overhead for feature
that is not enabled.

TL;DR
1) you need to make the stats accounting aligned with the existing
   charge accounting.
2) make the stat visible only when feature is enabled
3) work more on the justification - i.e. changelog part and give us a
   better insight why a) hugetlb cgroup is seen is a bad choice and b) why
   the original limitation hasn't proven good since the feature was
   introduced.

Makes sense?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/1] memcg/hugetlb: Adding hugeTLB counters to memory controller
  2024-10-21  7:15           ` Michal Hocko
@ 2024-10-21 14:51             ` Joshua Hahn
  2024-10-21 15:44               ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Joshua Hahn @ 2024-10-21 14:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Shakeel Butt, Johannes Weiner, nphamcs, roman.gushchin,
	muchun.song, akpm, cgroups, linux-mm, linux-kernel, lnyng

> On Mon, Oct 21, 2024 at 3:15 AM Michal Hocko <mhocko@suse.com> wrote:
> > On Fri 18-10-24 14:38:48, Joshua Hahn wrote:
> > But even if we are okay with this, I think it might be overkill to
> > enable the hugeTLB controller for the convenience of being able to inspect
> > the hugeTLB usage for cgroups. This is especially true in workloads where
> > we can predict what usage patterns will be like, and we do not need to enforce
> > specific limits on hugeTLB usage.
>
> I am sorry but I do not understand the overkill part of the argument.
> Is there any runtime or configuration cost that is visible?

I think an argument could be made that any amount of incremental overhead
is unnecessary. With that said however, I think a bigger reason why this is
overkill is that a user who wishes to use the hugeTLB counter (which this
patch achieves in ~10 lines) should not have to enable a ~1000 line feature,
as Johannes suggested.

A diligent user will have to spend time learning how the hugeTLB controller
works and figuring out the settings that will basically make the controller
do no enforcing (and basically, the same as if the feature was not enabled).
A not-so-diligent user will not spend the time to make sure that the configs
make sense, and may run into unexpected & unwanted hugeTLB behavior [1].

> TL;DR
> 1) you need to make the stats accounting aligned with the existing
>    charge accounting.
> 2) make the stat visible only when feature is enabled
> 3) work more on the justification - i.e. changelog part and give us a
>    better insight why a) hugetlb cgroup is seen is a bad choice and b) why
>    the original limitation hasn't proven good since the feature was
>    introduced.
>
> Makes sense?
> --
> Michal Hocko
> SUSE Labs

Hi Michal,

Thank you for your input. Yes -- this makes sense to me. I apologize, as it
seems that I definitely left a lot to be assumed & inferred, based on my
original patch changelog.

In my next version of this patch, I am planning to add the changes that have
been suggested by Johannes, write code with the try_charge cleanup that
Shakeel suggested in mind, and change the behavior to make sense only when
hugeTLB accounting is enabled, as you suggested as well. I'll also update
the changelog & commit message and add any information that will hopefully
make future reviewers aware of the motivation for this patch.

Please let me know if you have any remaining concerns with the implementation
or motivation, and I will be happy to incorporate your ideas into the next
version as well.

Joshua

[1] Of course, this argument isn't really generalizable to *all* features,
we can't make a separate config for every small feature that a user might
want to enable with the smallest granularity. However, given that the existing
solution of enabling the hugeTLB controller is an imperfect solution that
still leaves a discrepancy between memory.stat and memory.current when hugeTLB
accounting is enabled, I think it is reasonable to isolate this feature.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/1] memcg/hugetlb: Adding hugeTLB counters to memory controller
  2024-10-21 14:51             ` Joshua Hahn
@ 2024-10-21 15:44               ` Michal Hocko
  0 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2024-10-21 15:44 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Shakeel Butt, Johannes Weiner, nphamcs, roman.gushchin,
	muchun.song, akpm, cgroups, linux-mm, linux-kernel, lnyng

On Mon 21-10-24 10:51:43, Joshua Hahn wrote:
> > On Mon, Oct 21, 2024 at 3:15 AM Michal Hocko <mhocko@suse.com> wrote:
> > > On Fri 18-10-24 14:38:48, Joshua Hahn wrote:
> > > But even if we are okay with this, I think it might be overkill to
> > > enable the hugeTLB controller for the convenience of being able to inspect
> > > the hugeTLB usage for cgroups. This is especially true in workloads where
> > > we can predict what usage patterns will be like, and we do not need to enforce
> > > specific limits on hugeTLB usage.
> >
> > I am sorry but I do not understand the overkill part of the argument.
> > Is there any runtime or configuration cost that is visible?
> 
> I think an argument could be made that any amount of incremental overhead
> is unnecessary. With that said however, I think a bigger reason why this is
> overkill is that a user who wishes to use the hugeTLB counter (which this
> patch achieves in ~10 lines) should not have to enable a ~1000 line feature,
> as Johannes suggested.
> 
> A diligent user will have to spend time learning how the hugeTLB controller
> works and figuring out the settings that will basically make the controller
> do no enforcing (and basically, the same as if the feature was not enabled).
> A not-so-diligent user will not spend the time to make sure that the configs
> make sense, and may run into unexpected & unwanted hugeTLB behavior [1].

Heh, a lazy user would just enable the controller and hope for the best.
And it would actually worked out of the box because there is no limit
imposed by default so the only downside is a theoretical overhead due to
charging.

Anyway, I get the point and I guess it is fair to say the half baked
memcg accounting is not optimal because it only provides half baked
insight and you aim to fix that. This is fair intentention and
justification.

I have to say I really disliked this extension to the memcg when it was
proposed but it was claimed this was good enough for people who know
what they are doing. 

> > TL;DR
> > 1) you need to make the stats accounting aligned with the existing
> >    charge accounting.
> > 2) make the stat visible only when feature is enabled
> > 3) work more on the justification - i.e. changelog part and give us a
> >    better insight why a) hugetlb cgroup is seen is a bad choice and b) why
> >    the original limitation hasn't proven good since the feature was
> >    introduced.
> >
> > Makes sense?
> > --
> > Michal Hocko
> > SUSE Labs
> 
> Hi Michal,
> 
> Thank you for your input. Yes -- this makes sense to me. I apologize, as it
> seems that I definitely left a lot to be assumed & inferred, based on my
> original patch changelog.
> 
> In my next version of this patch, I am planning to add the changes that have
> been suggested by Johannes, write code with the try_charge cleanup that
> Shakeel suggested in mind, and change the behavior to make sense only when
> hugeTLB accounting is enabled, as you suggested as well. I'll also update
> the changelog & commit message and add any information that will hopefully
> make future reviewers aware of the motivation for this patch.

Thanks a lot!

> Please let me know if you have any remaining concerns with the implementation
> or motivation, and I will be happy to incorporate your ideas into the next
> version as well.

I think clarification and fixing the reporting is good enough. This
still won't make the hugetlb sneaking into memcg more likeable to me but
nothing that would force me awake during nights ;)

Thanks!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-10-21 15:44 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-17 16:04 [PATCH 0/1] memcg/hugetlb: Adding hugeTLB counters to memory controller Joshua Hahn
2024-10-17 16:04 ` [PATCH 1/1] " Joshua Hahn
2024-10-17 17:22   ` Johannes Weiner
2024-10-18 21:34   ` Shakeel Butt
2024-10-19 22:45     ` Joshua Hahn
2024-10-18 10:12 ` [PATCH 0/1] " Michal Hocko
2024-10-18 12:31   ` Johannes Weiner
2024-10-18 13:42     ` Michal Hocko
2024-10-18 18:11       ` Shakeel Butt
2024-10-18 18:38         ` Joshua Hahn
2024-10-21  7:15           ` Michal Hocko
2024-10-21 14:51             ` Joshua Hahn
2024-10-21 15:44               ` Michal Hocko
2024-10-18 18:57       ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox