[RFC PATCH 0/3] Cgroup-based THP control

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] Cgroup-based THP control
@ 2024-10-30  8:33 gutierrez.asier
  2024-10-30  8:33 ` [RFC PATCH 1/3] mm: Add thp_flags control for cgroup gutierrez.asier
                   ` (6 more replies)
  0 siblings, 7 replies; 27+ messages in thread
From: gutierrez.asier @ 2024-10-30  8:33 UTC (permalink / raw)
  To: akpm, david, ryan.roberts, baohua, willy, peterx, hannes, hocko,
	roman.gushchin, shakeel.butt, muchun.song
  Cc: cgroups, linux-mm, linux-kernel, stepanov.anatoly,
	alexander.kozhevnikov, guohanjun, weiyongjun1, wangkefeng.wang,
	judy.chenhui, yusongping, artem.kuzin, kang.sun

From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>

Currently THP modes are set globally. It can be an overkill if only some
specific app/set of apps need to get benefits from THP usage. Moreover, various
apps might need different THP settings. Here we propose a cgroup-based THP
control mechanism.

THP interface is added to memory cgroup subsystem. Existing global THP control
semantics is supported for backward compatibility. When THP modes are set
globally all the changes are propagated to memory cgroups. However, when a
particular cgroup changes its THP policy, the global THP policy in sysfs remains
the same.

New memcg files are exposed: memory.thp_enabled and memory.thp_defrag, which
have completely the same format as global THP enabled/defrag.

Child cgroups inherit THP settings from parent cgroup upon creation. Particular
cgroup mode changes aren't propagated to child cgroups.

During the memory cgroup attachment stage, the correct slots
are added or removed to khugepaged according to the THP
policy.

Usage examples:

Set globally "madvise" mode:
# echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
# cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never

All the settings are propagated
# cat /sys/fs/cgroup/memory.thp_enabled
always [madvise] never

# cat /sys/fs/cgroup/test/memory.thp_enabled
always [madvise] never

Set "always" for some specific cgroup:
# echo always > /sys/fs/cgroup/test/memory.thp_enabled
# cat /sys/fs/cgroup/test/memory.thp_enabled
[always] madvise never

Root cgroup remains with "madvise" mode:
# cat /sys/fs/cgroup/memory.thp_enabled
always [madvise] never

When attempting to read global settings we get "mixed state" warning as the
THP-mode isn't the same for every cgroup:
# cat /sys/kernel/mm/transparent_hugepage/enabled
Mixed state: see particular memcg flags! 

Again, set THP mode globally, make sure everything works fine:
# echo never > /sys/kernel/mm/transparent_hugepage/enabled
# cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

# cat /sys/fs/cgroup/memory.thp_enabled
always madvise [never]

# cat /sys/fs/cgroup/test/memory.thp_enabled
always madvise [never]

Here is a simple demo with a 
test which is doing anon. mmap() and a series of random reads.
System is rebooted between the cases.

Case 1: Global THP - always. No cgroup.

// Global THP stats:
AnonHugePages:    391168 kB
FileHugePages:    120832 kB
FilePmdMapped:     67584 kB

// THP stats from *smaps* of the testing process
AnonHugePages:     12288 kB

Case 2: Global THP - never. Cgroup - always.

// Global THP stats:
AnonHugePages:     12288 kB
FileHugePages:      2048 kB
FilePmdMapped:      2048 kB

// THP stats from *smaps* of the testing process
AnonHugePages:     12288 kB

// The cgroup THP stats
anon_thp 12582912
file_thp 2097152

Obviously there's a huge difference between the two in terms of global THP 
usage, thus showing the cgroup approach is beneficial for such cases, when a 
specific app/set of apps needs THP, but not willing to change anything in the 
app. code.

TODO list:

1. Anonymous mTHP
2. Fine-grained mode selection for different VMA types: "anon|exec|ro|file", to
   be able to support combinations as: "always + exec", "always + anon", etc.
3. Per-cgroup limit for the THP usage

Signed-off-by: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
Signed-off-by: Anatoly Stepanov <stepanov.anatoly@huawei.com>
Reviewed-by: Alexander Kozhevnikov <alexander.kozhevnikov@huawei-partners.com>

Asier Gutierrez, Anatoly Stepanov (3):
  mm: Add thp_flags control for cgroup
  mm: Support for huge pages in cgroups
  mm: Add thp_defrag control for cgroup

 include/linux/huge_mm.h    |  23 +++-
 include/linux/khugepaged.h |   2 +-
 include/linux/memcontrol.h |  28 ++++
 mm/huge_memory.c           | 207 ++++++++++++++++++-----------
 mm/khugepaged.c            |   8 +-
 mm/memcontrol.c            | 262 +++++++++++++++++++++++++++++++++++++
 6 files changed, 449 insertions(+), 81 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH 1/3] mm: Add thp_flags control for cgroup
  2024-10-30  8:33 [RFC PATCH 0/3] Cgroup-based THP control gutierrez.asier
@ 2024-10-30  8:33 ` gutierrez.asier
  2024-10-30  8:33 ` [RFC PATCH 2/3] mm: Support for huge pages in cgroups gutierrez.asier
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: gutierrez.asier @ 2024-10-30  8:33 UTC (permalink / raw)
  To: akpm, david, ryan.roberts, baohua, willy, peterx, hannes, hocko,
	roman.gushchin, shakeel.butt, muchun.song
  Cc: cgroups, linux-mm, linux-kernel, stepanov.anatoly,
	alexander.kozhevnikov, guohanjun, weiyongjun1, wangkefeng.wang,
	judy.chenhui, yusongping, artem.kuzin, kang.sun

From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>

Exposed a new file in memory cgroup called memory.thp_enabled. This file works
in the same way and same format as thp settings in
/sys/kernel/mm/transparent_hugepage/enabled. The patch allows to read from and
write to that file, changing effectively the memory cgroup THP policy. New
cgroups will inherit the THP policies from their parents.

Signed-off-by: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
Signed-off-by: Anatoly Stepanov <stepanov.anatoly@huawei.com>
Reviewed-by: Alexander Kozhevnikov <alexander.kozhevnikov@huawei-partners.com>

---
 include/linux/huge_mm.h    |  5 +++
 include/linux/memcontrol.h | 15 +++++++
 mm/huge_memory.c           | 71 ++++++++++++++++++++-----------
 mm/memcontrol.c            | 86 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 153 insertions(+), 24 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e25d9ebfdf89..86c0fb4c0b28 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -53,6 +53,9 @@ enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
 };
 
+#define HUGEPAGE_FLAGS_ENABLED_MASK ((1UL << TRANSPARENT_HUGEPAGE_FLAG) |\
+				(1UL << TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG))
+
 struct kobject;
 struct kobj_attribute;
 
@@ -430,6 +433,8 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
 bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
 			   pmd_t *pmdp, struct folio *folio);
 
+int thp_enabled_parse(const char *buf, unsigned long *flags);
+const char *thp_enabled_string(unsigned long flags);
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 static inline bool folio_test_pmd_mappable(struct folio *folio)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0e5bf25d324f..87b5fe93e19d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -315,6 +315,12 @@ struct mem_cgroup {
 	spinlock_t event_list_lock;
 #endif /* CONFIG_MEMCG_V1 */
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	unsigned long thp_flags;
+	unsigned long thp_anon_orders_always;
+	unsigned long thp_anon_orders_madvise;
+	unsigned long thp_anon_orders_inherit;
+#endif
 	struct mem_cgroup_per_node *nodeinfo[];
 };
 
@@ -1615,6 +1621,15 @@ struct sock;
 bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
 			     gfp_t gfp_mask);
 void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+int memory_thp_enabled_show(struct seq_file *m, void *v);
+ssize_t memory_thp_enabled_write(struct kernfs_open_file *of, char *buf,
+			      size_t nbytes, loff_t off);
+
+int mem_cgroup_thp_flags_update_all(unsigned long flags, unsigned long mask);
+unsigned long memcg_get_thp_flags_all(unsigned long mask);
+unsigned long memcg_get_thp_flags(struct vm_area_struct *vma);
+#endif
 #ifdef CONFIG_MEMCG
 extern struct static_key_false memcg_sockets_enabled_key;
 #define mem_cgroup_sockets_enabled static_branch_unlikely(&memcg_sockets_enabled_key)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 67c86a5d64a6..0fbdd8213443 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -46,6 +46,8 @@
 #include "internal.h"
 #include "swap.h"
 
+#include <linux/memcontrol.h>
+
 #define CREATE_TRACE_POINTS
 #include <trace/events/thp.h>
 
@@ -287,21 +289,43 @@ static unsigned long shrink_huge_zero_page_scan(struct shrinker *shrink,
 
 static struct shrinker *huge_zero_page_shrinker;
 
-#ifdef CONFIG_SYSFS
-static ssize_t enabled_show(struct kobject *kobj,
-			    struct kobj_attribute *attr, char *buf)
+const char *thp_enabled_string(unsigned long flags)
 {
 	const char *output;
 
-	if (test_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags))
+	if (test_bit(TRANSPARENT_HUGEPAGE_FLAG, &flags))
 		output = "[always] madvise never";
-	else if (test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
-			  &transparent_hugepage_flags))
+	else if (test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &flags))
 		output = "always [madvise] never";
 	else
 		output = "always madvise [never]";
 
-	return sysfs_emit(buf, "%s\n", output);
+	return output;
+}
+
+int thp_enabled_parse(const char *buf, unsigned long *flags)
+{
+	if (sysfs_streq(buf, "always")) {
+		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, flags);
+		set_bit(TRANSPARENT_HUGEPAGE_FLAG, flags);
+	} else if (sysfs_streq(buf, "madvise")) {
+		clear_bit(TRANSPARENT_HUGEPAGE_FLAG, flags);
+		set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, flags);
+	} else if (sysfs_streq(buf, "never")) {
+		clear_bit(TRANSPARENT_HUGEPAGE_FLAG, flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, flags);
+	} else
+		return -EINVAL;
+
+	return 0;
+}
+
+#ifdef CONFIG_SYSFS
+static ssize_t enabled_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buf)
+{
+	unsigned long flags = transparent_hugepage_flags;
+	return sysfs_emit(buf, "%s\n", thp_enabled_string(flags));
 }
 
 static ssize_t enabled_store(struct kobject *kobj,
@@ -309,24 +333,21 @@ static ssize_t enabled_store(struct kobject *kobj,
 			     const char *buf, size_t count)
 {
 	ssize_t ret = count;
+	int err;
 
-	if (sysfs_streq(buf, "always")) {
-		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
-		set_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
-	} else if (sysfs_streq(buf, "madvise")) {
-		clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
-		set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
-	} else if (sysfs_streq(buf, "never")) {
-		clear_bit(TRANSPARENT_HUGEPAGE_FLAG, &transparent_hugepage_flags);
-		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
-	} else
-		ret = -EINVAL;
+	ret = thp_enabled_parse(buf, &transparent_hugepage_flags) ? : count;
+	if (ret <= 0)
+		goto out;
 
-	if (ret > 0) {
-		int err = start_stop_khugepaged();
-		if (err)
-			ret = err;
-	}
+	if (IS_ENABLED(CONFIG_MEMCG) && !mem_cgroup_disabled())
+		err = mem_cgroup_thp_flags_update_all(transparent_hugepage_flags,
+							HUGEPAGE_FLAGS_ENABLED_MASK);
+	else
+		err = start_stop_khugepaged();
+
+	if (err)
+		ret = err;
+out:
 	return ret;
 }
 
@@ -1036,7 +1057,9 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 gfp_t vma_thp_gfp_mask(struct vm_area_struct *vma)
 {
 	const bool vma_madvised = vma && (vma->vm_flags & VM_HUGEPAGE);
-
+#ifdef CONFIG_MEMCG
+	unsigned long transparent_hugepage_flags = memcg_get_thp_flags(vma);
+#endif
 	/* Always do synchronous compaction */
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d563fb515766..2b25c45c85c3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -970,6 +970,33 @@ struct mem_cgroup *get_mem_cgroup_from_current(void)
 	return memcg;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline bool memcg_thp_always_enabled(struct mem_cgroup *memcg)
+{
+	return test_bit(TRANSPARENT_HUGEPAGE_FLAG, &memcg->thp_flags);
+}
+
+static inline bool memcg_thp_madvise_enabled(struct mem_cgroup *memcg)
+{
+	return test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &memcg->thp_flags);
+}
+
+unsigned long memcg_get_thp_flags(struct vm_area_struct *vma)
+{
+	unsigned long flags = 0UL;
+	struct mem_cgroup *memcg = get_mem_cgroup_from_mm(vma->vm_mm);
+
+	if (!memcg)
+		goto out;
+
+	flags = READ_ONCE(memcg->thp_flags);
+out:
+	if (memcg)
+		css_put(&memcg->css);
+	return flags;
+}
+#endif
+
 /**
  * mem_cgroup_iter - iterate over memory cgroup hierarchy
  * @root: hierarchy root
@@ -3625,6 +3652,11 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 		WRITE_ONCE(memcg->oom_kill_disable, READ_ONCE(parent->oom_kill_disable));
 		page_counter_init(&memcg->kmem, &parent->kmem);
 		page_counter_init(&memcg->tcpmem, &parent->tcpmem);
+#endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		WRITE_ONCE(memcg->thp_flags, READ_ONCE(parent->thp_flags));
+		WRITE_ONCE(memcg->thp_anon_orders_inherit,
+			READ_ONCE(parent->thp_anon_orders_inherit));
 #endif
 	} else {
 		init_memcg_stats();
@@ -3634,6 +3666,17 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 #ifdef CONFIG_MEMCG_V1
 		page_counter_init(&memcg->kmem, NULL);
 		page_counter_init(&memcg->tcpmem, NULL);
+#endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		WRITE_ONCE(memcg->thp_flags,
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS
+		(1<<TRANSPARENT_HUGEPAGE_FLAG)|
+#endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_MADVISE
+		(1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)|
+#endif
+		(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG));
+		WRITE_ONCE(memcg->thp_anon_orders_inherit, BIT(PMD_ORDER));
 #endif
 		root_mem_cgroup = memcg;
 		return &memcg->css;
@@ -4315,6 +4358,19 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 	return nbytes;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+DEFINE_MUTEX(memcg_thp_flags_mutex);
+
+int memory_thp_enabled_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+	unsigned long flags = READ_ONCE(memcg->thp_flags);
+
+	seq_printf(m, "%s\n", thp_enabled_string(flags));
+	return 0;
+}
+#endif
+
 static struct cftype memory_files[] = {
 	{
 		.name = "current",
@@ -4383,6 +4439,12 @@ static struct cftype memory_files[] = {
 		.flags = CFTYPE_NS_DELEGATABLE,
 		.write = memory_reclaim,
 	},
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	{
+		.name = "thp_enabled",
+		.seq_show = memory_thp_enabled_show,
+	},
+#endif
 	{ }	/* terminate */
 };
 
@@ -4844,6 +4906,30 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 	refill_stock(memcg, nr_pages);
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+int mem_cgroup_thp_flags_update_all(unsigned long new_flags, unsigned long mask)
+{
+	int ret = 0;
+	struct mem_cgroup *iter, *memcg = root_mem_cgroup;
+	unsigned long enabled_mask =
+		(1UL << TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) |
+		(1UL << TRANSPARENT_HUGEPAGE_FLAG);
+
+	mutex_lock(&memcg_thp_flags_mutex);
+	enabled_mask &= new_flags;
+
+	for_each_mem_cgroup_tree(iter, memcg) {
+		unsigned long old_flags = iter->thp_flags;
+
+		iter->thp_flags = (old_flags & ~mask) | new_flags;
+	}
+
+	mutex_unlock(&memcg_thp_flags_mutex);
+	return ret;
+}
+
+#endif
+
 static int __init cgroup_memory(char *s)
 {
 	char *token;
-- 
2.34.1



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH 2/3] mm: Support for huge pages in cgroups
  2024-10-30  8:33 [RFC PATCH 0/3] Cgroup-based THP control gutierrez.asier
  2024-10-30  8:33 ` [RFC PATCH 1/3] mm: Add thp_flags control for cgroup gutierrez.asier
@ 2024-10-30  8:33 ` gutierrez.asier
  2024-10-30  8:33 ` [RFC PATCH 3/3] mm: Add thp_defrag control for cgroup gutierrez.asier
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: gutierrez.asier @ 2024-10-30  8:33 UTC (permalink / raw)
  To: akpm, david, ryan.roberts, baohua, willy, peterx, hannes, hocko,
	roman.gushchin, shakeel.butt, muchun.song
  Cc: cgroups, linux-mm, linux-kernel, stepanov.anatoly,
	alexander.kozhevnikov, guohanjun, weiyongjun1, wangkefeng.wang,
	judy.chenhui, yusongping, artem.kuzin, kang.sun

From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>

This patch adds support for the correct order mask depending on the memory THP
policy. Also, khugepaged lazy collpasing and kernel boot parameter THP override
were added.

Signed-off-by: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
Signed-off-by: Anatoly Stepanov <stepanov.anatoly@huawei.com>
Reviewed-by: Alexander Kozhevnikov <alexander.kozhevnikov@huawei-partners.com>

---
 include/linux/huge_mm.h    |  10 ++-
 include/linux/khugepaged.h |   2 +-
 include/linux/memcontrol.h |  11 +++
 mm/huge_memory.c           |  22 ++++--
 mm/khugepaged.c            |   8 +-
 mm/memcontrol.c            | 147 ++++++++++++++++++++++++++++++++++++-
 6 files changed, 187 insertions(+), 13 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 86c0fb4c0b28..f99ac9b7e5bc 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -207,6 +207,12 @@ static inline unsigned long thp_vma_suitable_orders(struct vm_area_struct *vma,
 	return orders;
 }
 
+#if defined(CONFIG_MEMCG) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
+bool memcg_thp_vma_allowable_orders(struct vm_area_struct *vma,
+				unsigned long vm_flags,
+				unsigned long *res_mask);
+#endif
+
 static inline bool file_thp_enabled(struct vm_area_struct *vma)
 {
 	struct inode *inode;
@@ -255,7 +261,9 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 		if (hugepage_global_always() ||
 		    ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
 			mask |= READ_ONCE(huge_anon_orders_inherit);
-
+#if defined(CONFIG_MEMCG) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
+		memcg_thp_vma_allowable_orders(vma, vm_flags, &mask);
+#endif
 		orders &= mask;
 		if (!orders)
 			return 0;
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index f68865e19b0b..50cabca48b93 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -9,7 +9,7 @@ extern struct attribute_group khugepaged_attr_group;
 
 extern int khugepaged_init(void);
 extern void khugepaged_destroy(void);
-extern int start_stop_khugepaged(void);
+extern int start_stop_khugepaged(bool force_stop);
 extern void __khugepaged_enter(struct mm_struct *mm);
 extern void __khugepaged_exit(struct mm_struct *mm);
 extern void khugepaged_enter_vma(struct vm_area_struct *vma,
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 87b5fe93e19d..d78318782af8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -23,6 +23,7 @@
 #include <linux/writeback.h>
 #include <linux/page-flags.h>
 #include <linux/shrinker.h>
+#include <linux/khugepaged.h>
 
 struct mem_cgroup;
 struct obj_cgroup;
@@ -1069,6 +1070,9 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
 
 void split_page_memcg(struct page *head, int old_order, int new_order);
 
+bool memcg_thp_vma_allowable_orders(struct vm_area_struct *vma,
+				unsigned long vm_flags,
+				unsigned long *res_mask);
 #else /* CONFIG_MEMCG */
 
 #define MEM_CGROUP_ID_SHIFT	0
@@ -1476,6 +1480,13 @@ void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 static inline void split_page_memcg(struct page *head, int old_order, int new_order)
 {
 }
+
+static inline bool memcg_thp_vma_allowable_orders(struct vm_area_struct *vma,
+				unsigned long vm_flags,
+				unsigned long *res_mask)
+{
+	return false;
+}
 #endif /* CONFIG_MEMCG */
 
 /*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0fbdd8213443..fdffdfc8605c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -172,15 +172,23 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 	}
 
 	if (!vma_is_anonymous(vma)) {
+		bool memcg_enabled = false;
 		/*
 		 * Enforce sysfs THP requirements as necessary. Anonymous vmas
 		 * were already handled in thp_vma_allowable_orders().
 		 */
-		if (enforce_sysfs &&
-		    (!hugepage_global_enabled() || (!(vm_flags & VM_HUGEPAGE) &&
-						    !hugepage_global_always())))
-			return 0;
+		if (enforce_sysfs) {
+			unsigned long mask = 0UL;
+
+			memcg_enabled = memcg_thp_vma_allowable_orders(vma, vm_flags, &mask);
 
+			if (memcg_enabled && !mask)
+				return 0;
+			if (!memcg_enabled && (!hugepage_global_enabled() ||
+									(!(vm_flags & VM_HUGEPAGE) &&
+									!hugepage_global_always())))
+				return 0;
+		}
 		/*
 		 * Trust that ->huge_fault() handlers know what they are doing
 		 * in fault path.
@@ -343,7 +351,7 @@ static ssize_t enabled_store(struct kobject *kobj,
 		err = mem_cgroup_thp_flags_update_all(transparent_hugepage_flags,
 							HUGEPAGE_FLAGS_ENABLED_MASK);
 	else
-		err = start_stop_khugepaged();
+		err = start_stop_khugepaged(false);
 
 	if (err)
 		ret = err;
@@ -539,7 +547,7 @@ static ssize_t thpsize_enabled_store(struct kobject *kobj,
 	if (ret > 0) {
 		int err;
 
-		err = start_stop_khugepaged();
+		err = start_stop_khugepaged(false);
 		if (err)
 			ret = err;
 	}
@@ -812,7 +820,7 @@ static int __init hugepage_init(void)
 		return 0;
 	}
 
-	err = start_stop_khugepaged();
+	err = start_stop_khugepaged(false);
 	if (err)
 		goto err_khugepaged;
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index cdd1d8655a76..ebed9bf8cfb5 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -415,6 +415,8 @@ static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
 
 static bool hugepage_pmd_enabled(void)
 {
+	if (IS_ENABLED(CONFIG_MEMCG) && !mem_cgroup_disabled())
+		return true;
 	/*
 	 * We cover both the anon and the file-backed case here; file-backed
 	 * hugepages, when configured in, are determined by the global control.
@@ -2586,7 +2588,7 @@ static void set_recommended_min_free_kbytes(void)
 	int nr_zones = 0;
 	unsigned long recommended_min;
 
-	if (!hugepage_pmd_enabled()) {
+	if (!hugepage_pmd_enabled() || !khugepaged_thread) {
 		calculate_min_free_kbytes();
 		goto update_wmarks;
 	}
@@ -2631,12 +2633,12 @@ static void set_recommended_min_free_kbytes(void)
 	setup_per_zone_wmarks();
 }
 
-int start_stop_khugepaged(void)
+int start_stop_khugepaged(bool force_stop)
 {
 	int err = 0;
 
 	mutex_lock(&khugepaged_mutex);
-	if (hugepage_pmd_enabled()) {
+	if (hugepage_pmd_enabled() && !force_stop) {
 		if (!khugepaged_thread)
 			khugepaged_thread = kthread_run(khugepaged, NULL,
 							"khugepaged");
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2b25c45c85c3..938e6894c0b3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -981,6 +981,37 @@ static inline bool memcg_thp_madvise_enabled(struct mem_cgroup *memcg)
 	return test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, &memcg->thp_flags);
 }
 
+bool memcg_thp_vma_allowable_orders(struct vm_area_struct *vma,
+				unsigned long vm_flags,
+				unsigned long *res_mask)
+{
+	unsigned long mask = 0UL;
+
+	struct mem_cgroup *memcg = get_mem_cgroup_from_mm(vma->vm_mm);
+
+	if (!memcg)
+		return false;
+
+	if (memcg_thp_always_enabled(memcg) ||
+								 ((vm_flags & VM_HUGEPAGE) &&
+								 memcg_thp_madvise_enabled(memcg))) {
+		if (!vma_is_anonymous(vma))
+			mask = THP_ORDERS_ALL_FILE_DEFAULT;
+		else {
+			mask = READ_ONCE(memcg->thp_anon_orders_always);
+
+			if (vm_flags & VM_HUGEPAGE)
+				mask |= READ_ONCE(memcg->thp_anon_orders_madvise);
+
+			mask = mask | READ_ONCE(memcg->thp_anon_orders_inherit);
+		}
+	}
+
+	css_put(&memcg->css);
+	*res_mask = mask;
+	return true;
+}
+
 unsigned long memcg_get_thp_flags(struct vm_area_struct *vma)
 {
 	unsigned long flags = 0UL;
@@ -3986,10 +4017,52 @@ static void mem_cgroup_kmem_attach(struct cgroup_taskset *tset)
 	}
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static int mem_cgroup_notify_khugepaged_cb(struct task_struct *p, void *arg)
+{
+	struct vm_area_struct *vma = NULL;
+	struct mem_cgroup *memcg = arg;
+	bool is_madvise = memcg_thp_madvise_enabled(memcg);
+	bool is_always = memcg_thp_always_enabled(memcg);
+
+	VMA_ITERATOR(vmi, p->mm, 0);
+
+	if (!is_always && !is_madvise) {
+		khugepaged_exit(p->mm);
+		return 0;
+	}
+
+	for_each_vma(vmi, vma) {
+		if (is_madvise && !(vma->vm_flags & VM_HUGEPAGE))
+			continue;
+
+		khugepaged_enter_vma(vma, vma->vm_flags);
+
+		if (test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags))
+			break;
+	}
+
+	return 0;
+}
+
+static void mem_cgroup_thp_attach(struct cgroup_taskset *tset)
+{
+	struct task_struct *task;
+	struct cgroup_subsys_state *css;
+
+	cgroup_taskset_for_each(task, css, tset) {
+		mem_cgroup_notify_khugepaged_cb(task, mem_cgroup_from_css(css));
+	}
+}
+#else
+static void mem_cgroup_thp_attach(struct cgroup_taskset *tset) {}
+#endif
+
 static void mem_cgroup_attach(struct cgroup_taskset *tset)
 {
 	mem_cgroup_lru_gen_attach(tset);
 	mem_cgroup_kmem_attach(tset);
+	mem_cgroup_thp_attach(tset);
 }
 
 static int seq_puts_memcg_tunable(struct seq_file *m, unsigned long value)
@@ -4369,6 +4442,54 @@ int memory_thp_enabled_show(struct seq_file *m, void *v)
 	seq_printf(m, "%s\n", thp_enabled_string(flags));
 	return 0;
 }
+
+static int mem_cgroup_notify_khugepaged(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *iter;
+	int ret = 0;
+
+	for_each_mem_cgroup_tree(iter, memcg) {
+		struct css_task_iter it;
+		struct task_struct *task;
+
+		css_task_iter_start(&iter->css, CSS_TASK_ITER_PROCS, &it);
+		while (!ret && (task = css_task_iter_next(&it))) {
+			if (!task->mm || (task->flags & PF_KTHREAD))
+				continue;
+
+			ret = mem_cgroup_notify_khugepaged_cb(task, memcg);
+		}
+		css_task_iter_end(&it);
+		if (ret) {
+			mem_cgroup_iter_break(memcg, iter);
+			break;
+		}
+	}
+
+	return ret;
+}
+
+ssize_t memory_thp_enabled_write(struct kernfs_open_file *of, char *buf,
+			      size_t nbytes, loff_t off)
+{
+	int err;
+	int ret = nbytes;
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+
+	buf = strstrip(buf);
+
+	mutex_lock(&memcg_thp_flags_mutex);
+	ret = thp_enabled_parse(buf, &memcg->thp_flags) ? : nbytes;
+	if (ret > 0) {
+		err = mem_cgroup_notify_khugepaged(memcg);
+		if (!err)
+			err = start_stop_khugepaged(false);
+		if (err)
+			ret = err;
+	}
+	mutex_unlock(&memcg_thp_flags_mutex);
+	return ret;
+}
 #endif
 
 static struct cftype memory_files[] = {
@@ -4443,6 +4564,7 @@ static struct cftype memory_files[] = {
 	{
 		.name = "thp_enabled",
 		.seq_show = memory_thp_enabled_show,
+		.write = memory_thp_enabled_write,
 	},
 #endif
 	{ }	/* terminate */
@@ -4909,7 +5031,9 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 int mem_cgroup_thp_flags_update_all(unsigned long new_flags, unsigned long mask)
 {
-	int ret = 0;
+	int ret;
+	struct css_task_iter task_iter;
+	struct task_struct *task;
 	struct mem_cgroup *iter, *memcg = root_mem_cgroup;
 	unsigned long enabled_mask =
 		(1UL << TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) |
@@ -4922,8 +5046,18 @@ int mem_cgroup_thp_flags_update_all(unsigned long new_flags, unsigned long mask)
 		unsigned long old_flags = iter->thp_flags;
 
 		iter->thp_flags = (old_flags & ~mask) | new_flags;
+
+		css_task_iter_start(&iter->css, CSS_TASK_ITER_PROCS, &task_iter);
+		while ((task = css_task_iter_next(&task_iter))) {
+			if (!task->mm || (task->flags & PF_KTHREAD))
+				continue;
+
+			mem_cgroup_notify_khugepaged_cb(task, iter);
+		}
+		css_task_iter_end(&task_iter);
 	}
 
+	ret = start_stop_khugepaged(!enabled_mask);
 	mutex_unlock(&memcg_thp_flags_mutex);
 	return ret;
 }
@@ -5509,4 +5643,15 @@ static int __init mem_cgroup_swap_init(void)
 }
 subsys_initcall(mem_cgroup_swap_init);
 
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static int __init mem_cgroup_thp_init(void)
+{
+	if (IS_ENABLED(CONFIG_MEMCG))
+		root_mem_cgroup->thp_flags = transparent_hugepage_flags;
+
+	return 0;
+}
+subsys_initcall(mem_cgroup_thp_init);
+#endif
 #endif /* CONFIG_SWAP */
-- 
2.34.1



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH 3/3] mm: Add thp_defrag control for cgroup
  2024-10-30  8:33 [RFC PATCH 0/3] Cgroup-based THP control gutierrez.asier
  2024-10-30  8:33 ` [RFC PATCH 1/3] mm: Add thp_flags control for cgroup gutierrez.asier
  2024-10-30  8:33 ` [RFC PATCH 2/3] mm: Support for huge pages in cgroups gutierrez.asier
@ 2024-10-30  8:33 ` gutierrez.asier
  2024-10-30  8:38 ` [RFC PATCH 0/3] Cgroup-based THP control Michal Hocko
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: gutierrez.asier @ 2024-10-30  8:33 UTC (permalink / raw)
  To: akpm, david, ryan.roberts, baohua, willy, peterx, hannes, hocko,
	roman.gushchin, shakeel.butt, muchun.song
  Cc: cgroups, linux-mm, linux-kernel, stepanov.anatoly,
	alexander.kozhevnikov, guohanjun, weiyongjun1, wangkefeng.wang,
	judy.chenhui, yusongping, artem.kuzin, kang.sun

From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>

This patch exposes a new file in memory cgroups: memory.thp_defrag, which
follows the /sys/kernel/mm/transparent_hugepage/defrag style. Support for
different defrag THP defrag policies for memory cgroups were also added.

Signed-off-by: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
Signed-off-by: Anatoly Stepanov <stepanov.anatoly@huawei.com>
Reviewed-by: Alexander Kozhevnikov <alexander.kozhevnikov@huawei-partners.com>

---
 include/linux/huge_mm.h    |   8 +++
 include/linux/memcontrol.h |   4 +-
 mm/huge_memory.c           | 116 ++++++++++++++++++++++---------------
 mm/memcontrol.c            |  31 ++++++++++
 4 files changed, 112 insertions(+), 47 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f99ac9b7e5bc..177c7d3578ed 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -56,6 +56,12 @@ enum transparent_hugepage_flag {
 #define HUGEPAGE_FLAGS_ENABLED_MASK ((1UL << TRANSPARENT_HUGEPAGE_FLAG) |\
 				(1UL << TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG))
 
+#define HUGEPAGE_FLAGS_DEFRAG_MASK ((1UL << TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG) |\
+				(1UL << TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG) |\
+				(1UL << TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG) |\
+				(1UL << TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG) |\
+				(1UL << TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG))
+
 struct kobject;
 struct kobj_attribute;
 
@@ -442,7 +448,9 @@ bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
 			   pmd_t *pmdp, struct folio *folio);
 
 int thp_enabled_parse(const char *buf, unsigned long *flags);
+int thp_defrag_parse(const char *buf, unsigned long *flags);
 const char *thp_enabled_string(unsigned long flags);
+const char *thp_defrag_string(unsigned long flags);
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 static inline bool folio_test_pmd_mappable(struct folio *folio)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d78318782af8..a0edf15b3a07 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1634,9 +1634,11 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
 void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 int memory_thp_enabled_show(struct seq_file *m, void *v);
+int memory_thp_defrag_show(struct seq_file *m, void *v);
 ssize_t memory_thp_enabled_write(struct kernfs_open_file *of, char *buf,
 			      size_t nbytes, loff_t off);
-
+ssize_t memory_thp_defrag_write(struct kernfs_open_file *of, char *buf,
+			      size_t nbytes, loff_t off);
 int mem_cgroup_thp_flags_update_all(unsigned long flags, unsigned long mask);
 unsigned long memcg_get_thp_flags_all(unsigned long mask);
 unsigned long memcg_get_thp_flags(struct vm_area_struct *vma);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fdffdfc8605c..6e1886b220d9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -311,6 +311,28 @@ const char *thp_enabled_string(unsigned long flags)
 	return output;
 }
 
+const char *thp_defrag_string(unsigned long flags)
+{
+	const char *output;
+
+	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
+		     &flags))
+		output = "[always] defer defer+madvise madvise never";
+	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
+			  &flags))
+		output = "always [defer] defer+madvise madvise never";
+	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG,
+			  &flags))
+		output = "always defer [defer+madvise] madvise never";
+	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
+			  &flags))
+		output = "always defer defer+madvise [madvise] never";
+	else
+		output = "always defer defer+madvise madvise [never]";
+
+	return output;
+}
+
 int thp_enabled_parse(const char *buf, unsigned long *flags)
 {
 	if (sysfs_streq(buf, "always")) {
@@ -328,6 +350,39 @@ int thp_enabled_parse(const char *buf, unsigned long *flags)
 	return 0;
 }
 
+int thp_defrag_parse(const char *buf, unsigned long *flags)
+{
+	if (sysfs_streq(buf, "always")) {
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, flags);
+		set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, flags);
+	} else if (sysfs_streq(buf, "defer+madvise")) {
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, flags);
+		set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, flags);
+	} else if (sysfs_streq(buf, "defer")) {
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, flags);
+		set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, flags);
+	} else if (sysfs_streq(buf, "madvise")) {
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, flags);
+		set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, flags);
+	} else if (sysfs_streq(buf, "never")) {
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, flags);
+	} else
+		return -EINVAL;
+
+	return 0;
+}
+
 #ifdef CONFIG_SYSFS
 static ssize_t enabled_show(struct kobject *kobj,
 			    struct kobj_attribute *attr, char *buf)
@@ -394,60 +449,29 @@ ssize_t single_hugepage_flag_store(struct kobject *kobj,
 static ssize_t defrag_show(struct kobject *kobj,
 			   struct kobj_attribute *attr, char *buf)
 {
-	const char *output;
-
-	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
-		     &transparent_hugepage_flags))
-		output = "[always] defer defer+madvise madvise never";
-	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
-			  &transparent_hugepage_flags))
-		output = "always [defer] defer+madvise madvise never";
-	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG,
-			  &transparent_hugepage_flags))
-		output = "always defer [defer+madvise] madvise never";
-	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
-			  &transparent_hugepage_flags))
-		output = "always defer defer+madvise [madvise] never";
-	else
-		output = "always defer defer+madvise madvise [never]";
-
-	return sysfs_emit(buf, "%s\n", output);
+	unsigned long flags = transparent_hugepage_flags;
+	return sysfs_emit(buf, "%s\n", thp_defrag_string(flags));
 }
 
 static ssize_t defrag_store(struct kobject *kobj,
 			    struct kobj_attribute *attr,
 			    const char *buf, size_t count)
 {
-	if (sysfs_streq(buf, "always")) {
-		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags);
-		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags);
-		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags);
-		set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags);
-	} else if (sysfs_streq(buf, "defer+madvise")) {
-		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags);
-		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags);
-		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags);
-		set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags);
-	} else if (sysfs_streq(buf, "defer")) {
-		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags);
-		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags);
-		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags);
-		set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags);
-	} else if (sysfs_streq(buf, "madvise")) {
-		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags);
-		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags);
-		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags);
-		set_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags);
-	} else if (sysfs_streq(buf, "never")) {
-		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags);
-		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags);
-		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags);
-		clear_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags);
-	} else
-		return -EINVAL;
+	ssize_t ret = count;
+	int err;
 
-	return count;
+	ret = thp_defrag_parse(buf, &transparent_hugepage_flags) ? : count;
+	if (ret > 0 && IS_ENABLED(CONFIG_MEMCG) &&
+			!mem_cgroup_disabled()) {
+		err = mem_cgroup_thp_flags_update_all(transparent_hugepage_flags,
+							HUGEPAGE_FLAGS_DEFRAG_MASK);
+		if (err)
+			ret = err;
+	}
+
+	return ret;
 }
+
 static struct kobj_attribute defrag_attr = __ATTR_RW(defrag);
 
 static ssize_t use_zero_page_show(struct kobject *kobj,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 938e6894c0b3..53384f0a69af 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3706,6 +3706,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE_MADVISE
 		(1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)|
 #endif
+		(1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG)|
+		(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
 		(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG));
 		WRITE_ONCE(memcg->thp_anon_orders_inherit, BIT(PMD_ORDER));
 #endif
@@ -4490,6 +4492,30 @@ ssize_t memory_thp_enabled_write(struct kernfs_open_file *of, char *buf,
 	mutex_unlock(&memcg_thp_flags_mutex);
 	return ret;
 }
+
+int memory_thp_defrag_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+	unsigned long flags = READ_ONCE(memcg->thp_flags);
+
+	seq_printf(m, "%s\n", thp_defrag_string(flags));
+	return 0;
+}
+
+ssize_t memory_thp_defrag_write(struct kernfs_open_file *of, char *buf,
+			      size_t nbytes, loff_t off)
+{
+	int ret = nbytes;
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+
+	buf = strstrip(buf);
+
+	mutex_lock(&memcg_thp_flags_mutex);
+	ret = thp_defrag_parse(buf, &memcg->thp_flags) ? : nbytes;
+	mutex_unlock(&memcg_thp_flags_mutex);
+
+	return ret;
+}
 #endif
 
 static struct cftype memory_files[] = {
@@ -4566,6 +4592,11 @@ static struct cftype memory_files[] = {
 		.seq_show = memory_thp_enabled_show,
 		.write = memory_thp_enabled_write,
 	},
+	{
+		.name = "thp_defrag",
+		.seq_show = memory_thp_defrag_show,
+		.write = memory_thp_defrag_write,
+	},
 #endif
 	{ }	/* terminate */
 };
-- 
2.34.1



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/3] Cgroup-based THP control
  2024-10-30  8:33 [RFC PATCH 0/3] Cgroup-based THP control gutierrez.asier
                   ` (2 preceding siblings ...)
  2024-10-30  8:33 ` [RFC PATCH 3/3] mm: Add thp_defrag control for cgroup gutierrez.asier
@ 2024-10-30  8:38 ` Michal Hocko
  2024-10-30 12:51   ` Gutierrez Asier
  2024-10-30 13:14 ` Matthew Wilcox
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2024-10-30  8:38 UTC (permalink / raw)
  To: gutierrez.asier
  Cc: akpm, david, ryan.roberts, baohua, willy, peterx, hannes, hocko,
	roman.gushchin, shakeel.butt, muchun.song, cgroups, linux-mm,
	linux-kernel, stepanov.anatoly, alexander.kozhevnikov, guohanjun,
	weiyongjun1, wangkefeng.wang, judy.chenhui, yusongping,
	artem.kuzin, kang.sun

On Wed 30-10-24 16:33:08, gutierrez.asier@huawei-partners.com wrote:
> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
> 
> Currently THP modes are set globally. It can be an overkill if only some
> specific app/set of apps need to get benefits from THP usage. Moreover, various
> apps might need different THP settings. Here we propose a cgroup-based THP
> control mechanism.
> 
> THP interface is added to memory cgroup subsystem. Existing global THP control
> semantics is supported for backward compatibility. When THP modes are set
> globally all the changes are propagated to memory cgroups. However, when a
> particular cgroup changes its THP policy, the global THP policy in sysfs remains
> the same.

Do you have any specific examples where this would be benefitial?

> New memcg files are exposed: memory.thp_enabled and memory.thp_defrag, which
> have completely the same format as global THP enabled/defrag.
> 
> Child cgroups inherit THP settings from parent cgroup upon creation. Particular
> cgroup mode changes aren't propagated to child cgroups.

So this breaks hierarchical property, doesn't it? In other words if a
parent cgroup would like to enforce a certain policy to all descendants
then this is not really possible. 
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/3] Cgroup-based THP control
  2024-10-30  8:38 ` [RFC PATCH 0/3] Cgroup-based THP control Michal Hocko
@ 2024-10-30 12:51   ` Gutierrez Asier
  2024-10-30 13:27     ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Gutierrez Asier @ 2024-10-30 12:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, david, ryan.roberts, baohua, willy, peterx, hannes, hocko,
	roman.gushchin, shakeel.butt, muchun.song, cgroups, linux-mm,
	linux-kernel, stepanov.anatoly, alexander.kozhevnikov, guohanjun,
	weiyongjun1, wangkefeng.wang, judy.chenhui, yusongping,
	artem.kuzin, kang.sun



On 10/30/2024 11:38 AM, Michal Hocko wrote:
> On Wed 30-10-24 16:33:08, gutierrez.asier@huawei-partners.com wrote:
>> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
>>
>> Currently THP modes are set globally. It can be an overkill if only some
>> specific app/set of apps need to get benefits from THP usage. Moreover, various
>> apps might need different THP settings. Here we propose a cgroup-based THP
>> control mechanism.
>>
>> THP interface is added to memory cgroup subsystem. Existing global THP control
>> semantics is supported for backward compatibility. When THP modes are set
>> globally all the changes are propagated to memory cgroups. However, when a
>> particular cgroup changes its THP policy, the global THP policy in sysfs remains
>> the same.
> 
> Do you have any specific examples where this would be benefitial?

Now we're mostly focused on database scenarios (MySQL, Redis).  

The main idea is to avoid using a global THP setting that can potentially waste 
overall resource and have per cgroup granularity.

Besides THP are being beneficial for DB performance, we observe high THP 
"over-usage" by some unrelated apps/services, when "always" mode is enabled 
globally.

With cgroup-THP, we're able to specify exact "THP-users", and plan to introduce
an ability to limit the amount of THPs per-cgroup.

We suppose it should be beneficial for some container-based workloads, when 
certain containers can have different THP-policies, but haven't looked into 
this case yet.

>> New memcg files are exposed: memory.thp_enabled and memory.thp_defrag, which
>> have completely the same format as global THP enabled/defrag.
>>
>> Child cgroups inherit THP settings from parent cgroup upon creation. Particular
>> cgroup mode changes aren't propagated to child cgroups.
> 
> So this breaks hierarchical property, doesn't it? In other words if a
> parent cgroup would like to enforce a certain policy to all descendants
> then this is not really possible. 

The first idea was to have some flexibility when changing THP policies. 

I will submit a new patch set which will enforce the cgroup hierarchy and change all
the children recursively.

-- 
Asier Gutierrez
Huawei



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/3] Cgroup-based THP control
  2024-10-30  8:33 [RFC PATCH 0/3] Cgroup-based THP control gutierrez.asier
                   ` (3 preceding siblings ...)
  2024-10-30  8:38 ` [RFC PATCH 0/3] Cgroup-based THP control Michal Hocko
@ 2024-10-30 13:14 ` Matthew Wilcox
  2024-10-30 13:16   ` David Hildenbrand
  2024-10-30 14:45 ` Chris Down
  2024-10-30 15:08 ` Johannes Weiner
  6 siblings, 1 reply; 27+ messages in thread
From: Matthew Wilcox @ 2024-10-30 13:14 UTC (permalink / raw)
  To: gutierrez.asier
  Cc: akpm, david, ryan.roberts, baohua, peterx, hannes, hocko,
	roman.gushchin, shakeel.butt, muchun.song, cgroups, linux-mm,
	linux-kernel, stepanov.anatoly, alexander.kozhevnikov, guohanjun,
	weiyongjun1, wangkefeng.wang, judy.chenhui, yusongping,
	artem.kuzin, kang.sun

On Wed, Oct 30, 2024 at 04:33:08PM +0800, gutierrez.asier@huawei-partners.com wrote:
> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
> 
> Currently THP modes are set globally. It can be an overkill if only some
> specific app/set of apps need to get benefits from THP usage. Moreover, various
> apps might need different THP settings. Here we propose a cgroup-based THP
> control mechanism.

Or maybe we should stop making the sysadmin's life so damned hard and
figure out how to do without all of these settings?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/3] Cgroup-based THP control
  2024-10-30 13:14 ` Matthew Wilcox
@ 2024-10-30 13:16   ` David Hildenbrand
  0 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2024-10-30 13:16 UTC (permalink / raw)
  To: Matthew Wilcox, gutierrez.asier
  Cc: akpm, ryan.roberts, baohua, peterx, hannes, hocko,
	roman.gushchin, shakeel.butt, muchun.song, cgroups, linux-mm,
	linux-kernel, stepanov.anatoly, alexander.kozhevnikov, guohanjun,
	weiyongjun1, wangkefeng.wang, judy.chenhui, yusongping,
	artem.kuzin, kang.sun

On 30.10.24 14:14, Matthew Wilcox wrote:
> On Wed, Oct 30, 2024 at 04:33:08PM +0800, gutierrez.asier@huawei-partners.com wrote:
>> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
>>
>> Currently THP modes are set globally. It can be an overkill if only some
>> specific app/set of apps need to get benefits from THP usage. Moreover, various
>> apps might need different THP settings. Here we propose a cgroup-based THP
>> control mechanism.
> 
> Or maybe we should stop making the sysadmin's life so damned hard and
> figure out how to do without all of these settings?

In particular if there is no proper problem description / use case.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/3] Cgroup-based THP control
  2024-10-30 12:51   ` Gutierrez Asier
@ 2024-10-30 13:27     ` Michal Hocko
  2024-10-30 14:58       ` Gutierrez Asier
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2024-10-30 13:27 UTC (permalink / raw)
  To: Gutierrez Asier
  Cc: akpm, david, ryan.roberts, baohua, willy, peterx, hannes, hocko,
	roman.gushchin, shakeel.butt, muchun.song, cgroups, linux-mm,
	linux-kernel, stepanov.anatoly, alexander.kozhevnikov, guohanjun,
	weiyongjun1, wangkefeng.wang, judy.chenhui, yusongping,
	artem.kuzin, kang.sun

On Wed 30-10-24 15:51:00, Gutierrez Asier wrote:
> 
> 
> On 10/30/2024 11:38 AM, Michal Hocko wrote:
> > On Wed 30-10-24 16:33:08, gutierrez.asier@huawei-partners.com wrote:
> >> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
> >>
> >> Currently THP modes are set globally. It can be an overkill if only some
> >> specific app/set of apps need to get benefits from THP usage. Moreover, various
> >> apps might need different THP settings. Here we propose a cgroup-based THP
> >> control mechanism.
> >>
> >> THP interface is added to memory cgroup subsystem. Existing global THP control
> >> semantics is supported for backward compatibility. When THP modes are set
> >> globally all the changes are propagated to memory cgroups. However, when a
> >> particular cgroup changes its THP policy, the global THP policy in sysfs remains
> >> the same.
> > 
> > Do you have any specific examples where this would be benefitial?
> 
> Now we're mostly focused on database scenarios (MySQL, Redis).  

That seems to be more process than workload oriented. Why the existing
per-process tuning doesn't work?

[...]
> >> Child cgroups inherit THP settings from parent cgroup upon creation. Particular
> >> cgroup mode changes aren't propagated to child cgroups.
> > 
> > So this breaks hierarchical property, doesn't it? In other words if a
> > parent cgroup would like to enforce a certain policy to all descendants
> > then this is not really possible. 
> 
> The first idea was to have some flexibility when changing THP policies. 
> 
> I will submit a new patch set which will enforce the cgroup hierarchy and change all
> the children recursively.

What is the expected semantics then?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/3] Cgroup-based THP control
  2024-10-30  8:33 [RFC PATCH 0/3] Cgroup-based THP control gutierrez.asier
                   ` (4 preceding siblings ...)
  2024-10-30 13:14 ` Matthew Wilcox
@ 2024-10-30 14:45 ` Chris Down
  2024-10-30 15:04   ` Michal Hocko
  2024-10-30 15:08 ` Johannes Weiner
  6 siblings, 1 reply; 27+ messages in thread
From: Chris Down @ 2024-10-30 14:45 UTC (permalink / raw)
  To: gutierrez.asier
  Cc: akpm, david, ryan.roberts, baohua, willy, peterx, hannes, hocko,
	roman.gushchin, shakeel.butt, muchun.song, cgroups, linux-mm,
	linux-kernel, stepanov.anatoly, alexander.kozhevnikov, guohanjun,
	weiyongjun1, wangkefeng.wang, judy.chenhui, yusongping,
	artem.kuzin, kang.sun

gutierrez.asier@huawei-partners.com writes:
>New memcg files are exposed: memory.thp_enabled and memory.thp_defrag, which
>have completely the same format as global THP enabled/defrag.

cgroup controls exist because there are things we want to do for an entire 
class of processes (group OOM, resource control, etc). Enabling or disabling 
some specific setting is generally not one of them, hence why we got rid of 
things like per-cgroup vm.swappiness. We know that these controls do not 
compose well and have caused a lot of pain in the past. So my immediate 
reaction is a nack on the general concept, unless there's some absolutely 
compelling case here.

I talked a little at Kernel Recipes last year about moving away from sysctl and 
other global interfaces and making things more granular. Don't get me wrong, I 
think that is a good thing (although, of course, a very large undertaking) -- 
but it is a mistake to overload the amount of controls we expose as part of the 
cgroup interface.

I am up for thinking overall about how we can improve the state of global 
tunables to make them more granular overall, but this can't set a precedent as 
the way to do it.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/3] Cgroup-based THP control
  2024-10-30 13:27     ` Michal Hocko
@ 2024-10-30 14:58       ` Gutierrez Asier
  2024-10-30 15:15         ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Gutierrez Asier @ 2024-10-30 14:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, david, ryan.roberts, baohua, willy, peterx, hannes, hocko,
	roman.gushchin, shakeel.butt, muchun.song, cgroups, linux-mm,
	linux-kernel, stepanov.anatoly, alexander.kozhevnikov, guohanjun,
	weiyongjun1, wangkefeng.wang, judy.chenhui, yusongping,
	artem.kuzin, kang.sun



On 10/30/2024 4:27 PM, Michal Hocko wrote:
> On Wed 30-10-24 15:51:00, Gutierrez Asier wrote:
>>
>>
>> On 10/30/2024 11:38 AM, Michal Hocko wrote:
>>> On Wed 30-10-24 16:33:08, gutierrez.asier@huawei-partners.com wrote:
>>>> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
>>>>
>>>> Currently THP modes are set globally. It can be an overkill if only some
>>>> specific app/set of apps need to get benefits from THP usage. Moreover, various
>>>> apps might need different THP settings. Here we propose a cgroup-based THP
>>>> control mechanism.
>>>>
>>>> THP interface is added to memory cgroup subsystem. Existing global THP control
>>>> semantics is supported for backward compatibility. When THP modes are set
>>>> globally all the changes are propagated to memory cgroups. However, when a
>>>> particular cgroup changes its THP policy, the global THP policy in sysfs remains
>>>> the same.
>>>
>>> Do you have any specific examples where this would be benefitial?
>>
>> Now we're mostly focused on database scenarios (MySQL, Redis).  
> 
> That seems to be more process than workload oriented. Why the existing
> per-process tuning doesn't work?
> 
> [...]

1st Point

We're trying to provide a transparent mechanism, but all the existing per-process
methods require to modify an app itself (MADV_HUGE, MADV_COLLAPSE, hugetlbfs)

Moreover we're using file-backed THPs too (for .text mostly), which make it for
user-space developers even more complicated.

>>>> Child cgroups inherit THP settings from parent cgroup upon creation. Particular
>>>> cgroup mode changes aren't propagated to child cgroups.
>>>
>>> So this breaks hierarchical property, doesn't it? In other words if a
>>> parent cgroup would like to enforce a certain policy to all descendants
>>> then this is not really possible. 
>>
>> The first idea was to have some flexibility when changing THP policies. 
>>
>> I will submit a new patch set which will enforce the cgroup hierarchy and change all
>> the children recursively.
> 
> What is the expected semantics then?

2nd point (on semantics)
1. Children inherit the THP policy upon creation
2. Parent's policy changes are propagated to all the children
3. Children can set the policy independently

-- 
Asier Gutierrez
Huawei



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/3] Cgroup-based THP control
  2024-10-30 14:45 ` Chris Down
@ 2024-10-30 15:04   ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2024-10-30 15:04 UTC (permalink / raw)
  To: Chris Down
  Cc: gutierrez.asier, akpm, david, ryan.roberts, baohua, willy,
	peterx, hannes, hocko, roman.gushchin, shakeel.butt, muchun.song,
	cgroups, linux-mm, linux-kernel, stepanov.anatoly,
	alexander.kozhevnikov, guohanjun, weiyongjun1, wangkefeng.wang,
	judy.chenhui, yusongping, artem.kuzin, kang.sun

On Wed 30-10-24 14:45:24, Chris Down wrote:
> gutierrez.asier@huawei-partners.com writes:
> > New memcg files are exposed: memory.thp_enabled and memory.thp_defrag, which
> > have completely the same format as global THP enabled/defrag.
> 
> cgroup controls exist because there are things we want to do for an entire
> class of processes (group OOM, resource control, etc). Enabling or disabling
> some specific setting is generally not one of them, hence why we got rid of
> things like per-cgroup vm.swappiness. We know that these controls do not
> compose well and have caused a lot of pain in the past. So my immediate
> reaction is a nack on the general concept, unless there's some absolutely
> compelling case here.
> 
> I talked a little at Kernel Recipes last year about moving away from sysctl
> and other global interfaces and making things more granular. Don't get me
> wrong, I think that is a good thing (although, of course, a very large
> undertaking) -- but it is a mistake to overload the amount of controls we
> expose as part of the cgroup interface.

Completely agreed!

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/3] Cgroup-based THP control
  2024-10-30  8:33 [RFC PATCH 0/3] Cgroup-based THP control gutierrez.asier
                   ` (5 preceding siblings ...)
  2024-10-30 14:45 ` Chris Down
@ 2024-10-30 15:08 ` Johannes Weiner
  2024-11-01 12:44   ` Stepanov Anatoly
  6 siblings, 1 reply; 27+ messages in thread
From: Johannes Weiner @ 2024-10-30 15:08 UTC (permalink / raw)
  To: gutierrez.asier
  Cc: akpm, david, ryan.roberts, baohua, willy, peterx, hocko,
	roman.gushchin, shakeel.butt, muchun.song, cgroups, linux-mm,
	linux-kernel, stepanov.anatoly, alexander.kozhevnikov, guohanjun,
	weiyongjun1, wangkefeng.wang, judy.chenhui, yusongping,
	artem.kuzin, kang.sun

On Wed, Oct 30, 2024 at 04:33:08PM +0800, gutierrez.asier@huawei-partners.com wrote:
> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
> 
> Currently THP modes are set globally. It can be an overkill if only some
> specific app/set of apps need to get benefits from THP usage. Moreover, various
> apps might need different THP settings. Here we propose a cgroup-based THP
> control mechanism.
> 
> THP interface is added to memory cgroup subsystem. Existing global THP control
> semantics is supported for backward compatibility. When THP modes are set
> globally all the changes are propagated to memory cgroups. However, when a
> particular cgroup changes its THP policy, the global THP policy in sysfs remains
> the same.
> 
> New memcg files are exposed: memory.thp_enabled and memory.thp_defrag, which
> have completely the same format as global THP enabled/defrag.
> 
> Child cgroups inherit THP settings from parent cgroup upon creation. Particular
> cgroup mode changes aren't propagated to child cgroups.

Cgroups are for hierarchical resource distribution. It's tempting to
add parameters you would want for flat collections of processes, but
it gets weird when it comes to inheritance and hiearchical semantics
inside the cgroup tree - like it does here. So this is not a good fit.

On this particular issue, I agree with what Willy and David: let's not
proliferate THP knobs; let's focus on making them truly transparent.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/3] Cgroup-based THP control
  2024-10-30 14:58       ` Gutierrez Asier
@ 2024-10-30 15:15         ` Michal Hocko
  2024-10-31  6:06           ` Stepanov Anatoly
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2024-10-30 15:15 UTC (permalink / raw)
  To: Gutierrez Asier
  Cc: akpm, david, ryan.roberts, baohua, willy, peterx, hannes, hocko,
	roman.gushchin, shakeel.butt, muchun.song, cgroups, linux-mm,
	linux-kernel, stepanov.anatoly, alexander.kozhevnikov, guohanjun,
	weiyongjun1, wangkefeng.wang, judy.chenhui, yusongping,
	artem.kuzin, kang.sun

On Wed 30-10-24 17:58:04, Gutierrez Asier wrote:
> 
> 
> On 10/30/2024 4:27 PM, Michal Hocko wrote:
> > On Wed 30-10-24 15:51:00, Gutierrez Asier wrote:
> >>
> >>
> >> On 10/30/2024 11:38 AM, Michal Hocko wrote:
> >>> On Wed 30-10-24 16:33:08, gutierrez.asier@huawei-partners.com wrote:
> >>>> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
> >>>>
> >>>> Currently THP modes are set globally. It can be an overkill if only some
> >>>> specific app/set of apps need to get benefits from THP usage. Moreover, various
> >>>> apps might need different THP settings. Here we propose a cgroup-based THP
> >>>> control mechanism.
> >>>>
> >>>> THP interface is added to memory cgroup subsystem. Existing global THP control
> >>>> semantics is supported for backward compatibility. When THP modes are set
> >>>> globally all the changes are propagated to memory cgroups. However, when a
> >>>> particular cgroup changes its THP policy, the global THP policy in sysfs remains
> >>>> the same.
> >>>
> >>> Do you have any specific examples where this would be benefitial?
> >>
> >> Now we're mostly focused on database scenarios (MySQL, Redis).  
> > 
> > That seems to be more process than workload oriented. Why the existing
> > per-process tuning doesn't work?
> > 
> > [...]
> 
> 1st Point
> 
> We're trying to provide a transparent mechanism, but all the existing per-process
> methods require to modify an app itself (MADV_HUGE, MADV_COLLAPSE, hugetlbfs)

There is also prctl to define per-process policy. We currently have
means to disable THP for the process to override the defeault behavior.
That would be mostly transparent for the application. 

You have not really answered a more fundamental question though. Why the
THP behavior should be at the cgroup scope? From a practical POV that
would represent containers which are a mixed bag of applications to
support the workload. Why does the same THP policy apply to all of them?
Doesn't this make the sub-optimal global behavior the same on the cgroup
level when some parts will benefit while others will not?

> Moreover we're using file-backed THPs too (for .text mostly), which make it for
> user-space developers even more complicated.
> 
> >>>> Child cgroups inherit THP settings from parent cgroup upon creation. Particular
> >>>> cgroup mode changes aren't propagated to child cgroups.
> >>>
> >>> So this breaks hierarchical property, doesn't it? In other words if a
> >>> parent cgroup would like to enforce a certain policy to all descendants
> >>> then this is not really possible. 
> >>
> >> The first idea was to have some flexibility when changing THP policies. 
> >>
> >> I will submit a new patch set which will enforce the cgroup hierarchy and change all
> >> the children recursively.
> > 
> > What is the expected semantics then?
> 
> 2nd point (on semantics)
> 1. Children inherit the THP policy upon creation
> 2. Parent's policy changes are propagated to all the children
> 3. Children can set the policy independently

So if the parent decides that none of the children should be using THP
they can override that so the tuning at parent has no imperative
control. This is breaking hierarchical property that is expected from
cgroup control files.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/3] Cgroup-based THP control
  2024-10-30 15:15         ` Michal Hocko
@ 2024-10-31  6:06           ` Stepanov Anatoly
  2024-10-31  8:33             ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Stepanov Anatoly @ 2024-10-31  6:06 UTC (permalink / raw)
  To: Michal Hocko, Gutierrez Asier
  Cc: akpm, david, ryan.roberts, baohua, willy, peterx, hannes, hocko,
	roman.gushchin, shakeel.butt, muchun.song, cgroups, linux-mm,
	linux-kernel, alexander.kozhevnikov, guohanjun, weiyongjun1,
	wangkefeng.wang, judy.chenhui, yusongping, artem.kuzin, kang.sun

On 10/30/2024 6:15 PM, Michal Hocko wrote:
> On Wed 30-10-24 17:58:04, Gutierrez Asier wrote:
>>
>>
>> On 10/30/2024 4:27 PM, Michal Hocko wrote:
>>> On Wed 30-10-24 15:51:00, Gutierrez Asier wrote:
>>>>
>>>>
>>>> On 10/30/2024 11:38 AM, Michal Hocko wrote:
>>>>> On Wed 30-10-24 16:33:08, gutierrez.asier@huawei-partners.com wrote:
>>>>>> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
>>>>>>
>>>>>> Currently THP modes are set globally. It can be an overkill if only some
>>>>>> specific app/set of apps need to get benefits from THP usage. Moreover, various
>>>>>> apps might need different THP settings. Here we propose a cgroup-based THP
>>>>>> control mechanism.
>>>>>>
>>>>>> THP interface is added to memory cgroup subsystem. Existing global THP control
>>>>>> semantics is supported for backward compatibility. When THP modes are set
>>>>>> globally all the changes are propagated to memory cgroups. However, when a
>>>>>> particular cgroup changes its THP policy, the global THP policy in sysfs remains
>>>>>> the same.
>>>>>
>>>>> Do you have any specific examples where this would be benefitial?
>>>>
>>>> Now we're mostly focused on database scenarios (MySQL, Redis).  
>>>
>>> That seems to be more process than workload oriented. Why the existing
>>> per-process tuning doesn't work?
>>>
>>> [...]
>>
>> 1st Point
>>
>> We're trying to provide a transparent mechanism, but all the existing per-process
>> methods require to modify an app itself (MADV_HUGE, MADV_COLLAPSE, hugetlbfs)

>
> There is also prctl to define per-process policy. We currently have
> means to disable THP for the process to override the defeault behavior.
> That would be mostly transparent for the application. 
(Answering as a co-author of the feature)

As prctl(PR_SET_THP_DISABLE) can only be used from the calling thread,
it needs app. developer participation anyway.
In theory, kind of a launcher-process can be used, to utilize the inheritance
of the corresponding prctl THP setting, but this seems not transparent
for the user-space.

And what if we'd like to enable THP for a specific set of unrelated (in terms of parent-child)
tasks?

IMHO, an alternative approach would be changing per-process THP-mode by PID,
thus also avoiding any user app. changes.
But that kind of thing doesn't exist yet.
Anyway, it would require maintaining a set of PIDs for a specific group of processes,
that's also some extra-work for a sysadmin.

>
> You have not really answered a more fundamental question though. Why the
> THP behavior should be at the cgroup scope? From a practical POV that
> would represent containers which are a mixed bag of applications to
> support the workload. Why does the same THP policy apply to all of them?

For THP there're 3 possible levels of fine-control:
- global THP
  - THP per-group of processes
     - THP per-process

I agree, that in a container, different apps might have different
THP requirements. 
But it also depends on many factors, such as:
container "size"(tiny/huge container), diversity of apps/functions inside a container.
I mean, for some cases, we might not need to go below "per-group" level in terms of THP control.

>
> Doesn't this make the sub-optimal global behavior the same on the cgroup
> level when some parts will benefit while others will not?
>

I think the key idea for the sub-optimal behavior is "predictability",
so we know for sure which apps/services would consume THPs.
We observed a significant THP usage on almost idle Ubuntu server, with simple test running,
(some random system services consumed few hundreds Mb of THPs).
Of course, on other distros me might have different situation.
But with fine-grained per-group control it's a lot more predictable.

Am i got you question right? 


>> Moreover we're using file-backed THPs too (for .text mostly), which make it for
>> user-space developers even more complicated.
>>
>>>>>> Child cgroups inherit THP settings from parent cgroup upon creation. Particular
>>>>>> cgroup mode changes aren't propagated to child cgroups.
>>>>>
>>>>> So this breaks hierarchical property, doesn't it? In other words if a
>>>>> parent cgroup would like to enforce a certain policy to all descendants
>>>>> then this is not really possible. 
>>>>
>>>> The first idea was to have some flexibility when changing THP policies. 
>>>>
>>>> I will submit a new patch set which will enforce the cgroup hierarchy and change all
>>>> the children recursively.
>>>
>>> What is the expected semantics then?
>>
>> 2nd point (on semantics)
>> 1. Children inherit the THP policy upon creation
>> 2. Parent's policy changes are propagated to all the children
>> 3. Children can set the policy independently

>
> So if the parent decides that none of the children should be using THP
> they can override that so the tuning at parent has no imperative
> control. This is breaking hierarchical property that is expected from
> cgroup control files.

Actually, i think we can solve this.
As we mostly need just a single children level,
"flat" case (root->child) is enough, interpreting root-memcg THP mode as "global THP setting",
where sub-children are forbidden to override an inherited THP-mode.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/3] Cgroup-based THP control
  2024-10-31  6:06           ` Stepanov Anatoly
@ 2024-10-31  8:33             ` Michal Hocko
  2024-10-31 14:37               ` Stepanov Anatoly
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2024-10-31  8:33 UTC (permalink / raw)
  To: Stepanov Anatoly
  Cc: Gutierrez Asier, akpm, david, ryan.roberts, baohua, willy,
	peterx, hannes, hocko, roman.gushchin, shakeel.butt, muchun.song,
	cgroups, linux-mm, linux-kernel, alexander.kozhevnikov,
	guohanjun, weiyongjun1, wangkefeng.wang, judy.chenhui,
	yusongping, artem.kuzin, kang.sun

On Thu 31-10-24 09:06:47, Stepanov Anatoly wrote:
[...]
> As prctl(PR_SET_THP_DISABLE) can only be used from the calling thread,
> it needs app. developer participation anyway.
> In theory, kind of a launcher-process can be used, to utilize the inheritance
> of the corresponding prctl THP setting, but this seems not transparent
> for the user-space.

No, this is not in theaory. This is a very common usage pattern to allow
changing the behavior for the target application transparently.

> And what if we'd like to enable THP for a specific set of unrelated (in terms of parent-child)
> tasks?

This is what I've had in mind. Currently we only have THP disable
option. If we really need an override to enforce THP on an application
then this could be a more viable path.

> IMHO, an alternative approach would be changing per-process THP-mode by PID,
> thus also avoiding any user app. changes.

We already have process_madvise. MADV_HUGEPAGE resp. MADV_COLLAPSE are
not supported but we can discuss that option of course. This interface
requires much more orchestration of course because it is VMA range
based.

> > You have not really answered a more fundamental question though. Why the
> > THP behavior should be at the cgroup scope? From a practical POV that
> > would represent containers which are a mixed bag of applications to
> > support the workload. Why does the same THP policy apply to all of them?
> 
> For THP there're 3 possible levels of fine-control:
> - global THP
>   - THP per-group of processes
>      - THP per-process
> 
> I agree, that in a container, different apps might have different
> THP requirements. 
> But it also depends on many factors, such as:
> container "size"(tiny/huge container), diversity of apps/functions inside a container.
> I mean, for some cases, we might not need to go below "per-group" level in terms of THP control.

I am sorry but I do not really see any argument why this should be
per-memcg. Quite contrary. having that per memcg seems more muddy.

> > Doesn't this make the sub-optimal global behavior the same on the cgroup
> > level when some parts will benefit while others will not?
> >
> 
> I think the key idea for the sub-optimal behavior is "predictability",
> so we know for sure which apps/services would consume THPs.

OK, that seems fair.

> We observed a significant THP usage on almost idle Ubuntu server, with simple test running,
> (some random system services consumed few hundreds Mb of THPs).

I assume that you are using Always as global default configuration,
right? If that is the case then the high (in fact as high as feasible)
THP utilization is a real goal. If you want more targeted THP use then
madvise is what you are looking for. This will not help applications
which are not THP aware of course but then we are back to the discussion
whether the interface should be per a) per process b) per cgroup c)
process_madvise.

> Of course, on other distros me might have different situation.
> But with fine-grained per-group control it's a lot more predictable.
> 
> Am i got you question right? 

Not really but at least I do understand (hopefully) that you are trying
to workaround THP overuse by changing the global default to be more
restrictive while some workloads to be less restrictive. The question
why pushing that down to memcg scope makes the situation better is not
answered AFAICT.

[...]
> > So if the parent decides that none of the children should be using THP
> > they can override that so the tuning at parent has no imperative
> > control. This is breaking hierarchical property that is expected from
> > cgroup control files.
> 
> Actually, i think we can solve this.
> As we mostly need just a single children level,
> "flat" case (root->child) is enough, interpreting root-memcg THP mode as "global THP setting",
> where sub-children are forbidden to override an inherited THP-mode.

This reduced case is not really sufficient to justify the non
hiearchical semantic, I am afraid. There must be a _really_ strong case
to break this property and even then I am rather skeptical to be honest.
We have been burnt by introducing stuff like memcg.swappiness that
seemed like a good idea initially but backfired with unexpected behavior
to many users.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/3] Cgroup-based THP control
  2024-10-31  8:33             ` Michal Hocko
@ 2024-10-31 14:37               ` Stepanov Anatoly
  2024-11-01  7:35                 ` Michal Hocko
  2024-11-01 16:01                 ` Matthew Wilcox
  0 siblings, 2 replies; 27+ messages in thread
From: Stepanov Anatoly @ 2024-10-31 14:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Gutierrez Asier, akpm, david, ryan.roberts, baohua, willy,
	peterx, hannes, hocko, roman.gushchin, shakeel.butt, muchun.song,
	cgroups, linux-mm, linux-kernel, alexander.kozhevnikov,
	guohanjun, weiyongjun1, wangkefeng.wang, judy.chenhui,
	yusongping, artem.kuzin, kang.sun, nikita.panov

On 10/31/2024 11:33 AM, Michal Hocko wrote:
> On Thu 31-10-24 09:06:47, Stepanov Anatoly wrote:
> [...]
>> As prctl(PR_SET_THP_DISABLE) can only be used from the calling thread,
>> it needs app. developer participation anyway.
>> In theory, kind of a launcher-process can be used, to utilize the inheritance
>> of the corresponding prctl THP setting, but this seems not transparent
>> for the user-space.

> 
> No, this is not in theaory. This is a very common usage pattern to allow
> changing the behavior for the target application transparently.
> 
>> And what if we'd like to enable THP for a specific set of unrelated (in terms of parent-child)
>> tasks?
> 
> This is what I've had in mind. Currently we only have THP disable
> option. If we really need an override to enforce THP on an application
> then this could be a more viable path.
> 
>> IMHO, an alternative approach would be changing per-process THP-mode by PID,
>> thus also avoiding any user app. changes.

> 
> We already have process_madvise. MADV_HUGEPAGE resp. MADV_COLLAPSE are
> not supported but we can discuss that option of course. This interface
> requires much more orchestration of course because it is VMA range
> based.
> 
If we consider the inheritance approach (prctl + launcher), it's fine until we need to change
THP mode property for several tasks at once, in this case some batch-change approach needed.

if, for example, process_madvise() would support task recursive logic, coupled with kind of
MADV_HUGE + *ITERATE_ALL_VMA*, it would be helpful.
In this case, the orchestration will be much easier.

>>> You have not really answered a more fundamental question though. Why the
>>> THP behavior should be at the cgroup scope? From a practical POV that
>>> would represent containers which are a mixed bag of applications to
>>> support the workload. Why does the same THP policy apply to all of them?
>>
>> For THP there're 3 possible levels of fine-control:
>> - global THP
>>   - THP per-group of processes
>>      - THP per-process
>>
>> I agree, that in a container, different apps might have different
>> THP requirements. 
>> But it also depends on many factors, such as:
>> container "size"(tiny/huge container), diversity of apps/functions inside a container.
>> I mean, for some cases, we might not need to go below "per-group" level in terms of THP control.
> 
> I am sorry but I do not really see any argument why this should be
> per-memcg. Quite contrary. having that per memcg seems more muddy.
> 
>>> Doesn't this make the sub-optimal global behavior the same on the cgroup
>>> level when some parts will benefit while others will not?
>>>
>>
>> I think the key idea for the sub-optimal behavior is "predictability",
>> so we know for sure which apps/services would consume THPs.
> 
> OK, that seems fair.
> 
>> We observed a significant THP usage on almost idle Ubuntu server, with simple test running,
>> (some random system services consumed few hundreds Mb of THPs).
> 
> I assume that you are using Always as global default configuration,
> right? If that is the case then the high (in fact as high as feasible)
> THP utilization is a real goal. If you want more targeted THP use then
> madvise is what you are looking for. This will not help applications
> which are not THP aware of course but then we are back to the discussion
> whether the interface should be per a) per process b) per cgroup c)
> process_madvise.
> 
>> Of course, on other distros me might have different situation.
>> But with fine-grained per-group control it's a lot more predictable.
>>
>> Am i got you question right? 

> 
> Not really but at least I do understand (hopefully) that you are trying
> to workaround THP overuse by changing the global default to be more
> restrictive while some workloads to be less restrictive. The question
> why pushing that down to memcg scope makes the situation better is not
> answered AFAICT.
>
Don't get us wrong, we're not trying to push this into memcg specifically. 
We're just trying to find a proper/friendly way to control
THP mode for a group of processes (which can be tasks without common parent).

May be if the process grouping logic were decoupled from hierarchical resource control
logic, it could be possible to gather multiple process, and batch-control some task properties.
But it would require to build kind of task properties system, where
a given set of properties can be flexibly assigned to one or more tasks.

Anyway, i think we gonna try alternative
approaches first.(prctl, process_madvise).
 
> [...]
>>> So if the parent decides that none of the children should be using THP
>>> they can override that so the tuning at parent has no imperative
>>> control. This is breaking hierarchical property that is expected from
>>> cgroup control files.
>>
>> Actually, i think we can solve this.
>> As we mostly need just a single children level,
>> "flat" case (root->child) is enough, interpreting root-memcg THP mode as "global THP setting",
>> where sub-children are forbidden to override an inherited THP-mode.

> 
> This reduced case is not really sufficient to justify the non
> hiearchical semantic, I am afraid. There must be a _really_ strong case
> to break this property and even then I am rather skeptical to be honest.
> We have been burnt by introducing stuff like memcg.swappiness that
> seemed like a good idea initially but backfired with unexpected behavior
> to many users.
> 

-- 
Anatoly Stepanov, Huawei


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/3] Cgroup-based THP control
  2024-10-31 14:37               ` Stepanov Anatoly
@ 2024-11-01  7:35                 ` Michal Hocko
  2024-11-01 11:54                   ` Stepanov Anatoly
  2024-11-01 16:01                 ` Matthew Wilcox
  1 sibling, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2024-11-01  7:35 UTC (permalink / raw)
  To: Stepanov Anatoly
  Cc: Gutierrez Asier, akpm, david, ryan.roberts, baohua, willy,
	peterx, hannes, hocko, roman.gushchin, shakeel.butt, muchun.song,
	cgroups, linux-mm, linux-kernel, alexander.kozhevnikov,
	guohanjun, weiyongjun1, wangkefeng.wang, judy.chenhui,
	yusongping, artem.kuzin, kang.sun, nikita.panov

On Thu 31-10-24 17:37:12, Stepanov Anatoly wrote:
> If we consider the inheritance approach (prctl + launcher), it's fine until we need to change
> THP mode property for several tasks at once, in this case some batch-change approach needed.

I do not follow. How is this any different from a single process? Or do
you mean to change the mode for an already running process?

> if, for example, process_madvise() would support task recursive logic, coupled with kind of
> MADV_HUGE + *ITERATE_ALL_VMA*, it would be helpful.
> In this case, the orchestration will be much easier.

Nope, process_madvise is pidfd based interface and making it recursive
seems simply impossible for most operations as the address space is very
likely different in each child process.


-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/3] Cgroup-based THP control
  2024-11-01  7:35                 ` Michal Hocko
@ 2024-11-01 11:54                   ` Stepanov Anatoly
  2024-11-01 13:15                     ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Stepanov Anatoly @ 2024-11-01 11:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Gutierrez Asier, akpm, david, ryan.roberts, baohua, willy,
	peterx, hannes, hocko, roman.gushchin, shakeel.butt, muchun.song,
	cgroups, linux-mm, linux-kernel, alexander.kozhevnikov,
	guohanjun, weiyongjun1, wangkefeng.wang, judy.chenhui,
	yusongping, artem.kuzin, kang.sun, nikita.panov

On 11/1/2024 10:35 AM, Michal Hocko wrote:
> On Thu 31-10-24 17:37:12, Stepanov Anatoly wrote:
>> If we consider the inheritance approach (prctl + launcher), it's fine until we need to change
>> THP mode property for several tasks at once, in this case some batch-change approach needed.
> 
> I do not follow. How is this any different from a single process? Or do
> you mean to change the mode for an already running process?
> 
yes, for already running set of processes


-- 
Anatoly Stepanov, Huawei


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/3] Cgroup-based THP control
  2024-10-30 15:08 ` Johannes Weiner
@ 2024-11-01 12:44   ` Stepanov Anatoly
  0 siblings, 0 replies; 27+ messages in thread
From: Stepanov Anatoly @ 2024-11-01 12:44 UTC (permalink / raw)
  To: Johannes Weiner, gutierrez.asier
  Cc: akpm, david, ryan.roberts, baohua, willy, peterx, hocko,
	roman.gushchin, shakeel.butt, muchun.song, cgroups, linux-mm,
	linux-kernel, alexander.kozhevnikov, guohanjun, weiyongjun1,
	wangkefeng.wang, judy.chenhui, yusongping, artem.kuzin, kang.sun

On 10/30/2024 6:08 PM, Johannes Weiner wrote:
> On Wed, Oct 30, 2024 at 04:33:08PM +0800, gutierrez.asier@huawei-partners.com wrote:
>> From: Asier Gutierrez <gutierrez.asier@huawei-partners.com>
>>
>> Currently THP modes are set globally. It can be an overkill if only some
>> specific app/set of apps need to get benefits from THP usage. Moreover, various
>> apps might need different THP settings. Here we propose a cgroup-based THP
>> control mechanism.
>>
>> THP interface is added to memory cgroup subsystem. Existing global THP control
>> semantics is supported for backward compatibility. When THP modes are set
>> globally all the changes are propagated to memory cgroups. However, when a
>> particular cgroup changes its THP policy, the global THP policy in sysfs remains
>> the same.
>>
>> New memcg files are exposed: memory.thp_enabled and memory.thp_defrag, which
>> have completely the same format as global THP enabled/defrag.
>>
>> Child cgroups inherit THP settings from parent cgroup upon creation. Particular
>> cgroup mode changes aren't propagated to child cgroups.

> 
> Cgroups are for hierarchical resource distribution. It's tempting to
> add parameters you would want for flat collections of processes, but
> it gets weird when it comes to inheritance and hiearchical semantics
> inside the cgroup tree - like it does here. So this is not a good fit.
> 
> On this particular issue, I agree with what Willy and David: let's not
> proliferate THP knobs; let's focus on making them truly transparent.

We're also thinking about THP-limit direction (as mentioned in cover-letter)
Just have per-cgroup THP-limit and only global THP knobs, with couple additional global modes
(always-cgroup/madvise-cgroup).

"always-cgroup" for instance would enable THP for those tasks
which are attached to non-root memcg.

Per-cgroup THP limit might be used in combination with global THP knobs,
and in this we can maintain hiearchical semantics.

Now it's just an idea, may be it's better to have another RFC patch-set for this,
to be able to have more productive conversation.


-- 
Anatoly Stepanov, Huawei


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/3] Cgroup-based THP control
  2024-11-01 11:54                   ` Stepanov Anatoly
@ 2024-11-01 13:15                     ` Michal Hocko
  2024-11-01 13:24                       ` Stepanov Anatoly
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2024-11-01 13:15 UTC (permalink / raw)
  To: Stepanov Anatoly
  Cc: Gutierrez Asier, akpm, david, ryan.roberts, baohua, willy,
	peterx, hannes, hocko, roman.gushchin, shakeel.butt, muchun.song,
	cgroups, linux-mm, linux-kernel, alexander.kozhevnikov,
	guohanjun, weiyongjun1, wangkefeng.wang, judy.chenhui,
	yusongping, artem.kuzin, kang.sun, nikita.panov

On Fri 01-11-24 14:54:27, Stepanov Anatoly wrote:
> On 11/1/2024 10:35 AM, Michal Hocko wrote:
> > On Thu 31-10-24 17:37:12, Stepanov Anatoly wrote:
> >> If we consider the inheritance approach (prctl + launcher), it's fine until we need to change
> >> THP mode property for several tasks at once, in this case some batch-change approach needed.
> > 
> > I do not follow. How is this any different from a single process? Or do
> > you mean to change the mode for an already running process?
> > 
> yes, for already running set of processes

Why is that preferred over setting the policy upfront?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/3] Cgroup-based THP control
  2024-11-01 13:15                     ` Michal Hocko
@ 2024-11-01 13:24                       ` Stepanov Anatoly
  2024-11-01 13:28                         ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Stepanov Anatoly @ 2024-11-01 13:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Gutierrez Asier, akpm, david, ryan.roberts, baohua, willy,
	peterx, hannes, hocko, roman.gushchin, shakeel.butt, muchun.song,
	cgroups, linux-mm, linux-kernel, alexander.kozhevnikov,
	guohanjun, weiyongjun1, wangkefeng.wang, judy.chenhui,
	yusongping, artem.kuzin, kang.sun, nikita.panov

On 11/1/2024 4:15 PM, Michal Hocko wrote:
> On Fri 01-11-24 14:54:27, Stepanov Anatoly wrote:
>> On 11/1/2024 10:35 AM, Michal Hocko wrote:
>>> On Thu 31-10-24 17:37:12, Stepanov Anatoly wrote:
>>>> If we consider the inheritance approach (prctl + launcher), it's fine until we need to change
>>>> THP mode property for several tasks at once, in this case some batch-change approach needed.
>>>
>>> I do not follow. How is this any different from a single process? Or do
>>> you mean to change the mode for an already running process?
>>>
>> yes, for already running set of processes
> 

> Why is that preferred over setting the policy upfront?
Setting the policy in advance is fine, as the first step to do.
But we might not know in advance
which exact policy is the most beneficial for one set of apps or another.

So, i think it's better having an ability to change the policy
on the fly, without app./service restart.

-- 
Anatoly Stepanov, Huawei


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/3] Cgroup-based THP control
  2024-11-01 13:24                       ` Stepanov Anatoly
@ 2024-11-01 13:28                         ` Michal Hocko
  2024-11-01 13:39                           ` Stepanov Anatoly
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2024-11-01 13:28 UTC (permalink / raw)
  To: Stepanov Anatoly
  Cc: Gutierrez Asier, akpm, david, ryan.roberts, baohua, willy,
	peterx, hannes, hocko, roman.gushchin, shakeel.butt, muchun.song,
	cgroups, linux-mm, linux-kernel, alexander.kozhevnikov,
	guohanjun, weiyongjun1, wangkefeng.wang, judy.chenhui,
	yusongping, artem.kuzin, kang.sun, nikita.panov

On Fri 01-11-24 16:24:55, Stepanov Anatoly wrote:
> On 11/1/2024 4:15 PM, Michal Hocko wrote:
> > On Fri 01-11-24 14:54:27, Stepanov Anatoly wrote:
> >> On 11/1/2024 10:35 AM, Michal Hocko wrote:
> >>> On Thu 31-10-24 17:37:12, Stepanov Anatoly wrote:
> >>>> If we consider the inheritance approach (prctl + launcher), it's fine until we need to change
> >>>> THP mode property for several tasks at once, in this case some batch-change approach needed.
> >>>
> >>> I do not follow. How is this any different from a single process? Or do
> >>> you mean to change the mode for an already running process?
> >>>
> >> yes, for already running set of processes
> > 
> 
> > Why is that preferred over setting the policy upfront?
> Setting the policy in advance is fine, as the first step to do.
> But we might not know in advance
> which exact policy is the most beneficial for one set of apps or another.

How do you plan to find that out when the application is running
already?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/3] Cgroup-based THP control
  2024-11-01 13:28                         ` Michal Hocko
@ 2024-11-01 13:39                           ` Stepanov Anatoly
  2024-11-01 13:50                             ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Stepanov Anatoly @ 2024-11-01 13:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Gutierrez Asier, akpm, david, ryan.roberts, baohua, willy,
	peterx, hannes, hocko, roman.gushchin, shakeel.butt, muchun.song,
	cgroups, linux-mm, linux-kernel, alexander.kozhevnikov,
	guohanjun, weiyongjun1, wangkefeng.wang, judy.chenhui,
	yusongping, artem.kuzin, kang.sun, nikita.panov

On 11/1/2024 4:28 PM, Michal Hocko wrote:
> On Fri 01-11-24 16:24:55, Stepanov Anatoly wrote:
>> On 11/1/2024 4:15 PM, Michal Hocko wrote:
>>> On Fri 01-11-24 14:54:27, Stepanov Anatoly wrote:
>>>> On 11/1/2024 10:35 AM, Michal Hocko wrote:
>>>>> On Thu 31-10-24 17:37:12, Stepanov Anatoly wrote:
>>>>>> If we consider the inheritance approach (prctl + launcher), it's fine until we need to change
>>>>>> THP mode property for several tasks at once, in this case some batch-change approach needed.
>>>>>
>>>>> I do not follow. How is this any different from a single process? Or do
>>>>> you mean to change the mode for an already running process?
>>>>>
>>>> yes, for already running set of processes
>>>
>>
>>> Why is that preferred over setting the policy upfront?
>> Setting the policy in advance is fine, as the first step to do.
>> But we might not know in advance
>> which exact policy is the most beneficial for one set of apps or another.

> 
> How do you plan to find that out when the application is running
> already?
For example, if someone willing to compare some DB server performance with THP-off vs THP-on,
and DB server restart isn't an option.
Of course, if the restart is ok then we don't need such feature, "launcher" approach would be enough.
if i got your question right.


-- 
Anatoly Stepanov, Huawei


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/3] Cgroup-based THP control
  2024-11-01 13:39                           ` Stepanov Anatoly
@ 2024-11-01 13:50                             ` Michal Hocko
  2024-11-01 14:03                               ` Stepanov Anatoly
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2024-11-01 13:50 UTC (permalink / raw)
  To: Stepanov Anatoly
  Cc: Gutierrez Asier, akpm, david, ryan.roberts, baohua, willy,
	peterx, hannes, hocko, roman.gushchin, shakeel.butt, muchun.song,
	cgroups, linux-mm, linux-kernel, alexander.kozhevnikov,
	guohanjun, weiyongjun1, wangkefeng.wang, judy.chenhui,
	yusongping, artem.kuzin, kang.sun, nikita.panov

On Fri 01-11-24 16:39:07, Stepanov Anatoly wrote:
> On 11/1/2024 4:28 PM, Michal Hocko wrote:
> > On Fri 01-11-24 16:24:55, Stepanov Anatoly wrote:
> >> On 11/1/2024 4:15 PM, Michal Hocko wrote:
> >>> On Fri 01-11-24 14:54:27, Stepanov Anatoly wrote:
> >>>> On 11/1/2024 10:35 AM, Michal Hocko wrote:
> >>>>> On Thu 31-10-24 17:37:12, Stepanov Anatoly wrote:
> >>>>>> If we consider the inheritance approach (prctl + launcher), it's fine until we need to change
> >>>>>> THP mode property for several tasks at once, in this case some batch-change approach needed.
> >>>>>
> >>>>> I do not follow. How is this any different from a single process? Or do
> >>>>> you mean to change the mode for an already running process?
> >>>>>
> >>>> yes, for already running set of processes
> >>>
> >>
> >>> Why is that preferred over setting the policy upfront?
> >> Setting the policy in advance is fine, as the first step to do.
> >> But we might not know in advance
> >> which exact policy is the most beneficial for one set of apps or another.
> 
> > 
> > How do you plan to find that out when the application is running
> > already?
> For example, if someone willing to compare some DB server performance with THP-off vs THP-on,
> and DB server restart isn't an option.

So you essentially expect user tell you that they want THP and you want
to make that happen on fly, correct? It is not like there is an actual
monitoring and dynamic policing.

If that is the case then I am not really convinced this is a worthwhile
to support TBH. I can see that a workload knows in advance that they
benefit from THP but I am much more dubious about "learning during the
runtime" is a real life thing. I might be wrong of course but if
somebody has performance monitoring that is able to identify performance
bottlenecks based on specific workload then applying THP on the whole
group of proceesses seems like a very crude way to deal with that. I
could see a case for madvice_process(MADV_COLLAPSE) to deal with
specific memory hotspots though.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/3] Cgroup-based THP control
  2024-11-01 13:50                             ` Michal Hocko
@ 2024-11-01 14:03                               ` Stepanov Anatoly
  0 siblings, 0 replies; 27+ messages in thread
From: Stepanov Anatoly @ 2024-11-01 14:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Gutierrez Asier, akpm, david, ryan.roberts, baohua, willy,
	peterx, hannes, hocko, roman.gushchin, shakeel.butt, muchun.song,
	cgroups, linux-mm, linux-kernel, alexander.kozhevnikov,
	guohanjun, weiyongjun1, wangkefeng.wang, judy.chenhui,
	yusongping, artem.kuzin, kang.sun, nikita.panov

On 11/1/2024 4:50 PM, Michal Hocko wrote:
> On Fri 01-11-24 16:39:07, Stepanov Anatoly wrote:
>> On 11/1/2024 4:28 PM, Michal Hocko wrote:
>>> On Fri 01-11-24 16:24:55, Stepanov Anatoly wrote:
>>>> On 11/1/2024 4:15 PM, Michal Hocko wrote:
>>>>> On Fri 01-11-24 14:54:27, Stepanov Anatoly wrote:
>>>>>> On 11/1/2024 10:35 AM, Michal Hocko wrote:
>>>>>>> On Thu 31-10-24 17:37:12, Stepanov Anatoly wrote:
>>>>>>>> If we consider the inheritance approach (prctl + launcher), it's fine until we need to change
>>>>>>>> THP mode property for several tasks at once, in this case some batch-change approach needed.
>>>>>>>
>>>>>>> I do not follow. How is this any different from a single process? Or do
>>>>>>> you mean to change the mode for an already running process?
>>>>>>>
>>>>>> yes, for already running set of processes
>>>>>
>>>>
>>>>> Why is that preferred over setting the policy upfront?
>>>> Setting the policy in advance is fine, as the first step to do.
>>>> But we might not know in advance
>>>> which exact policy is the most beneficial for one set of apps or another.
>>
>>>
>>> How do you plan to find that out when the application is running
>>> already?
>> For example, if someone willing to compare some DB server performance with THP-off vs THP-on,
>> and DB server restart isn't an option.

> 
> So you essentially expect user tell you that they want THP and you want
> to make that happen on fly, correct? It is not like there is an actual
> monitoring and dynamic policing.
For a user/sysadmin this scenario is almost the same as experimenting with 
global THP settings, but with explicit THP usage, less THP overuse by some random apps,
so more predictable.

> 
> If that is the case then I am not really convinced this is a worthwhile
> to support TBH. I can see that a workload knows in advance that they
> benefit from THP but I am much more dubious about "learning during the
> runtime" is a real life thing. I might be wrong of course but if
> somebody has performance monitoring that is able to identify performance
> bottlenecks based on specific workload then applying THP on the whole
> group of proceesses seems like a very crude way to deal with that. I

> could see a case for madvice_process(MADV_COLLAPSE) to deal with
> specific memory hotspots though.
Yes, we have something like this in mind.

-- 
Anatoly Stepanov, Huawei


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/3] Cgroup-based THP control
  2024-10-31 14:37               ` Stepanov Anatoly
  2024-11-01  7:35                 ` Michal Hocko
@ 2024-11-01 16:01                 ` Matthew Wilcox
  1 sibling, 0 replies; 27+ messages in thread
From: Matthew Wilcox @ 2024-11-01 16:01 UTC (permalink / raw)
  To: Stepanov Anatoly
  Cc: Michal Hocko, Gutierrez Asier, akpm, david, ryan.roberts, baohua,
	peterx, hannes, hocko, roman.gushchin, shakeel.butt, muchun.song,
	cgroups, linux-mm, linux-kernel, alexander.kozhevnikov,
	guohanjun, weiyongjun1, wangkefeng.wang, judy.chenhui,
	yusongping, artem.kuzin, kang.sun, nikita.panov

On Thu, Oct 31, 2024 at 05:37:12PM +0300, Stepanov Anatoly wrote:
> Don't get us wrong, we're not trying to push this into memcg specifically. 
> We're just trying to find a proper/friendly way to control
> THP mode for a group of processes (which can be tasks without common parent).
> 
> May be if the process grouping logic were decoupled from hierarchical resource control
> logic, it could be possible to gather multiple process, and batch-control some task properties.
> But it would require to build kind of task properties system, where
> a given set of properties can be flexibly assigned to one or more tasks.
> 
> Anyway, i think we gonna try alternative
> approaches first.(prctl, process_madvise).

I oppose all of these approaches.  They are fundamentally misguided.
You're trying to blame sysadmins for our inadequacies as programmers.
All of this should be automatic.  Certainly the kernel will make mistakes
and not use the perfectly optimal size at all times, but it should be able
to get close to optimal.  Please, focus your efforts on allocating memory
of the right size, not on this fake problem of "we only have 235 THPs
available and we must make sure that the right process gets 183 of them".


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2024-11-01 16:01 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-30  8:33 [RFC PATCH 0/3] Cgroup-based THP control gutierrez.asier
2024-10-30  8:33 ` [RFC PATCH 1/3] mm: Add thp_flags control for cgroup gutierrez.asier
2024-10-30  8:33 ` [RFC PATCH 2/3] mm: Support for huge pages in cgroups gutierrez.asier
2024-10-30  8:33 ` [RFC PATCH 3/3] mm: Add thp_defrag control for cgroup gutierrez.asier
2024-10-30  8:38 ` [RFC PATCH 0/3] Cgroup-based THP control Michal Hocko
2024-10-30 12:51   ` Gutierrez Asier
2024-10-30 13:27     ` Michal Hocko
2024-10-30 14:58       ` Gutierrez Asier
2024-10-30 15:15         ` Michal Hocko
2024-10-31  6:06           ` Stepanov Anatoly
2024-10-31  8:33             ` Michal Hocko
2024-10-31 14:37               ` Stepanov Anatoly
2024-11-01  7:35                 ` Michal Hocko
2024-11-01 11:54                   ` Stepanov Anatoly
2024-11-01 13:15                     ` Michal Hocko
2024-11-01 13:24                       ` Stepanov Anatoly
2024-11-01 13:28                         ` Michal Hocko
2024-11-01 13:39                           ` Stepanov Anatoly
2024-11-01 13:50                             ` Michal Hocko
2024-11-01 14:03                               ` Stepanov Anatoly
2024-11-01 16:01                 ` Matthew Wilcox
2024-10-30 13:14 ` Matthew Wilcox
2024-10-30 13:16   ` David Hildenbrand
2024-10-30 14:45 ` Chris Down
2024-10-30 15:04   ` Michal Hocko
2024-10-30 15:08 ` Johannes Weiner
2024-11-01 12:44   ` Stepanov Anatoly

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox