linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] Selective KSM: Synchronous and Partitioned Merging
@ 2025-03-21 17:37 Sourav Panda
  2025-03-21 17:37 ` [RFC PATCH 1/6] mm: introduce SELECTIVE_KSM KConfig Sourav Panda
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: Sourav Panda @ 2025-03-21 17:37 UTC (permalink / raw)
  To: mathieu.desnoyers, willy, david, pasha.tatashin, rientjes, akpm,
	linux-mm, linux-kernel, weixugc, gthelen, souravpanda, surenb

The purpose of this RFC is to supplement our discussion in LSF/MM-25.

This is sent as a proof of concept. It applies on top of v6.14-rc7.

With the goal of increasing security and improving CPU efficiency,
we would like to propose making KSM synchronous and partitioned.

The synchronous aspect eliminates the need of ksmd running in the
background. Instead, userspace can trigger merging on the specified
memory region synchronously. Contrary to SKSM [1], which uses
MADV_MERGE, we also propose sysfs and syscall based alternatives.

The partitioned aspect divides the merge space into security domains.
Merging of pages only takes place within a partition, improving security.
Furthermore, trees in each partitioning becomes smaller, improving CPU
efficiency.

Proposal 1: SYSFS Interface

  KSM_SYSFS=/sys/kernel/mm/ksm

  echo "part_1" >  ${KSM_SYSFS}/ksm/control/add_partition

  ls ${KSM_SYSFS}/part_1/
	  pages_scanned       pages_to_scan   sleep_millisecs  ...

  echo "pid start_addr end_addr" > ${KSM_SYSFS}/part_1/trigger_merge

Proposal 2: SYSCALL Interface

  Partition can be created or opened using:

    int ksm_fd = ksm_open(ksm_name, flag);
      name specifies the ksm partition to be created or opened.
      flags:
        O_CREAT
          Create the ksm partition object if it does not exist.
        O_EXCL
          If O_CREAT was also specified, and a shared memory object
          with the given name already exists, return an error.

  Trigger the merge using:
    ksm_merge(ksm_fd, pid, start_addr, size);

[1] https://lore.kernel.org/linux-mm/20250228023043.83726-1-mathieu.desnoyers@efficios.com/

Sourav Panda (6):
  mm: introduce SELECTIVE_KSM KConfig
  mm: make Selective KSM synchronous
  mm: make Selective KSM partitioned
  mm: create dedicated trees for SELECTIVE KSM partitions
  mm: trigger unmerge and remove SELECTIVE KSM partition
  mm: syscall alternative for SELECTIVE_KSM

 arch/x86/entry/syscalls/syscall_64.tbl |   3 +-
 include/linux/ksm.h                    |   4 +
 mm/Kconfig                             |  11 +
 mm/ksm.c                               | 823 ++++++++++++++++++++++---
 4 files changed, 751 insertions(+), 90 deletions(-)

-- 
2.49.0.395.g12beb8f557-goog



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH 1/6] mm: introduce SELECTIVE_KSM KConfig
  2025-03-21 17:37 [RFC PATCH 0/6] Selective KSM: Synchronous and Partitioned Merging Sourav Panda
@ 2025-03-21 17:37 ` Sourav Panda
  2025-03-21 17:37 ` [RFC PATCH 2/6] mm: make Selective KSM synchronous Sourav Panda
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Sourav Panda @ 2025-03-21 17:37 UTC (permalink / raw)
  To: mathieu.desnoyers, willy, david, pasha.tatashin, rientjes, akpm,
	linux-mm, linux-kernel, weixugc, gthelen, souravpanda, surenb

Gate the partitioned and synchronous features of SELECTIVE_KSM behind
a KConfig. This shall prevent vanilla KSM's background thread from
stepping over SELECTIVE_KSM.

Signed-off-by: Sourav Panda <souravpanda@google.com>
---
 mm/Kconfig | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 1b501db06417..f9873002414c 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -783,6 +783,17 @@ config KSM
 	  until a program has madvised that an area is MADV_MERGEABLE, and
 	  root has set /sys/kernel/mm/ksm/run to 1 (if CONFIG_SYSFS is set).
 
+config SELECTIVE_KSM
+	bool "Enable Selective KSM for page merging"
+	depends on KSM
+	help
+	  Enable Synchronous and Partitioned KSM for page merging. There is
+	  no background scanning. Instead, userspace specifies the pid
+	  and address range to have merged. The partitioning aspect divides
+	  the merge space into security domains. Merging of pages only takes
+	  place within a partition, improving security. Furthermore, trees
+	  in each partitioning becomes smaller, improving CPU efficiency.
+
 config DEFAULT_MMAP_MIN_ADDR
 	int "Low address space to protect from user allocation"
 	depends on MMU
-- 
2.49.0.395.g12beb8f557-goog



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH 2/6] mm: make Selective KSM synchronous
  2025-03-21 17:37 [RFC PATCH 0/6] Selective KSM: Synchronous and Partitioned Merging Sourav Panda
  2025-03-21 17:37 ` [RFC PATCH 1/6] mm: introduce SELECTIVE_KSM KConfig Sourav Panda
@ 2025-03-21 17:37 ` Sourav Panda
  2025-03-21 17:37 ` [RFC PATCH 3/6] mm: make Selective KSM partitioned Sourav Panda
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Sourav Panda @ 2025-03-21 17:37 UTC (permalink / raw)
  To: mathieu.desnoyers, willy, david, pasha.tatashin, rientjes, akpm,
	linux-mm, linux-kernel, weixugc, gthelen, souravpanda, surenb

Make KSM synchronous by introducing the following sysfs file, which
shall carryout merging on the specified memory region synchronously
and eliminates the need of ksmd running in the background.

echo "pid start_addr end_addr" > /sys/kernel/mm/ksm/trigger_merge

Signed-off-by: Sourav Panda <souravpanda@google.com>
---
 mm/ksm.c | 317 +++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 271 insertions(+), 46 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 8be2b144fefd..b2f184557ed9 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -290,16 +290,18 @@ static unsigned int zero_checksum __read_mostly;
 /* Whether to merge empty (zeroed) pages with actual zero pages */
 static bool ksm_use_zero_pages __read_mostly;
 
-/* Skip pages that couldn't be de-duplicated previously */
-/* Default to true at least temporarily, for testing */
-static bool ksm_smart_scan = true;
-
 /* The number of zero pages which is placed by KSM */
 atomic_long_t ksm_zero_pages = ATOMIC_LONG_INIT(0);
 
 /* The number of pages that have been skipped due to "smart scanning" */
 static unsigned long ksm_pages_skipped;
 
+#ifndef CONFIG_SELECTIVE_KSM /* advisor immaterial if there is no scanning */
+
+/* Skip pages that couldn't be de-duplicated previously */
+/* Default to true at least temporarily, for testing */
+static bool ksm_smart_scan = true;
+
 /* Don't scan more than max pages per batch. */
 static unsigned long ksm_advisor_max_pages_to_scan = 30000;
 
@@ -465,6 +467,7 @@ static void advisor_stop_scan(void)
 	if (ksm_advisor == KSM_ADVISOR_SCAN_TIME)
 		scan_time_advisor();
 }
+#endif /* CONFIG_SELECTIVE_KSM */
 
 #ifdef CONFIG_NUMA
 /* Zeroed when merging across nodes is not allowed */
@@ -957,6 +960,25 @@ static struct folio *ksm_get_folio(struct ksm_stable_node *stable_node,
 	return NULL;
 }
 
+static unsigned char get_rmap_item_age(struct ksm_rmap_item *rmap_item)
+{
+#ifdef CONFIG_SELECTIVE_KSM /* age is immaterial in selective ksm */
+	return 0;
+#else
+	unsigned char age;
+	/*
+	 * Usually ksmd can and must skip the rb_erase, because
+	 * root_unstable_tree was already reset to RB_ROOT.
+	 * But be careful when an mm is exiting: do the rb_erase
+	 * if this rmap_item was inserted by this scan, rather
+	 * than left over from before.
+	 */
+	age = (unsigned char)(ksm_scan.seqnr - rmap_item->address);
+	WARN_ON_ONCE(age > 1);
+	return age;
+#endif /* CONFIG_SELECTIVE_KSM */
+}
+
 /*
  * Removing rmap_item from stable or unstable tree.
  * This function will clean the information from the stable/unstable tree.
@@ -991,16 +1013,7 @@ static void remove_rmap_item_from_tree(struct ksm_rmap_item *rmap_item)
 		rmap_item->address &= PAGE_MASK;
 
 	} else if (rmap_item->address & UNSTABLE_FLAG) {
-		unsigned char age;
-		/*
-		 * Usually ksmd can and must skip the rb_erase, because
-		 * root_unstable_tree was already reset to RB_ROOT.
-		 * But be careful when an mm is exiting: do the rb_erase
-		 * if this rmap_item was inserted by this scan, rather
-		 * than left over from before.
-		 */
-		age = (unsigned char)(ksm_scan.seqnr - rmap_item->address);
-		BUG_ON(age > 1);
+		unsigned char age = get_rmap_item_age(rmap_item);
 		if (!age)
 			rb_erase(&rmap_item->node,
 				 root_unstable_tree + NUMA(rmap_item->nid));
@@ -2203,6 +2216,37 @@ static void stable_tree_append(struct ksm_rmap_item *rmap_item,
 	rmap_item->mm->ksm_merging_pages++;
 }
 
+#ifdef CONFIG_SELECTIVE_KSM
+static int update_checksum(struct page *page, struct ksm_rmap_item *rmap_item)
+{
+	/*
+	 * Typically KSM would wait for a second round to even consider
+	 * the page for unstable tree insertion to ascertain its stability.
+	 * Avoid this when using selective ksm.
+	 */
+	rmap_item->oldchecksum = calc_checksum(page);
+	return 0;
+}
+#else
+static int update_checksum(struct page *page, struct ksm_rmap_item *rmap_item)
+{
+	remove_rmap_item_from_tree(rmap_item);
+
+	/*
+	 * If the hash value of the page has changed from the last time
+	 * we calculated it, this page is changing frequently: therefore we
+	 * don't want to insert it in the unstable tree, and we don't want
+	 * to waste our time searching for something identical to it there.
+	 */
+	checksum = calc_checksum(page);
+	if (rmap_item->oldchecksum != checksum) {
+		rmap_item->oldchecksum = checksum;
+		return -EINVAL;
+	}
+	return 0;
+}
+#endif
+
 /*
  * cmp_and_merge_page - first see if page can be merged into the stable tree;
  * if not, compare checksum to previous and if it's the same, see if page can
@@ -2218,7 +2262,6 @@ static void cmp_and_merge_page(struct page *page, struct ksm_rmap_item *rmap_ite
 	struct page *tree_page = NULL;
 	struct ksm_stable_node *stable_node;
 	struct folio *kfolio;
-	unsigned int checksum;
 	int err;
 	bool max_page_sharing_bypass = false;
 
@@ -2241,20 +2284,8 @@ static void cmp_and_merge_page(struct page *page, struct ksm_rmap_item *rmap_ite
 		if (!is_page_sharing_candidate(stable_node))
 			max_page_sharing_bypass = true;
 	} else {
-		remove_rmap_item_from_tree(rmap_item);
-
-		/*
-		 * If the hash value of the page has changed from the last time
-		 * we calculated it, this page is changing frequently: therefore we
-		 * don't want to insert it in the unstable tree, and we don't want
-		 * to waste our time searching for something identical to it there.
-		 */
-		checksum = calc_checksum(page);
-		if (rmap_item->oldchecksum != checksum) {
-			rmap_item->oldchecksum = checksum;
+		if (update_checksum(page, rmap_item))
 			return;
-		}
-
 		if (!try_to_merge_with_zero_page(rmap_item, page))
 			return;
 	}
@@ -2379,6 +2410,111 @@ static struct ksm_rmap_item *get_next_rmap_item(struct ksm_mm_slot *mm_slot,
 	return rmap_item;
 }
 
+#ifdef CONFIG_SELECTIVE_KSM
+static struct ksm_rmap_item *retrieve_rmap_item(struct page **page,
+						struct mm_struct *mm,
+						unsigned long start,
+						unsigned long end)
+{
+	struct ksm_mm_slot *mm_slot;
+	struct mm_slot *slot;
+	struct vm_area_struct *vma;
+	struct ksm_rmap_item *rmap_item;
+	struct vma_iterator vmi;
+
+	lru_add_drain_all();
+
+	if (!ksm_merge_across_nodes) {
+		struct ksm_stable_node *stable_node, *next;
+		struct folio *folio;
+
+		list_for_each_entry_safe(stable_node, next,
+					 &migrate_nodes, list) {
+			folio = ksm_get_folio(stable_node, KSM_GET_FOLIO_NOLOCK);
+			if (folio)
+				folio_put(folio);
+		}
+	}
+
+	spin_lock(&ksm_mmlist_lock);
+	slot = mm_slot_lookup(mm_slots_hash, mm);
+	spin_unlock(&ksm_mmlist_lock);
+
+	if (!slot)
+		return NULL;
+	mm_slot = mm_slot_entry(slot, struct ksm_mm_slot, slot);
+
+	ksm_scan.address = 0;
+	ksm_scan.mm_slot = mm_slot;
+	ksm_scan.rmap_list = &mm_slot->rmap_list;
+
+	vma_iter_init(&vmi, mm, ksm_scan.address);
+
+	mmap_read_lock(mm);
+	for_each_vma(vmi, vma) {
+		if (!(vma->vm_flags & VM_MERGEABLE))
+			continue;
+		if (ksm_scan.address < vma->vm_start)
+			ksm_scan.address = vma->vm_start;
+		if (!vma->anon_vma)
+			ksm_scan.address = vma->vm_end;
+
+		while (ksm_scan.address < vma->vm_end) {
+			struct page *tmp_page = NULL;
+			struct folio_walk fw;
+			struct folio *folio;
+
+			if (ksm_scan.address < start || ksm_scan.address > end)
+				break;
+
+			folio = folio_walk_start(&fw, vma, ksm_scan.address, 0);
+			if (folio) {
+				if (!folio_is_zone_device(folio) &&
+				    folio_test_anon(folio)) {
+					folio_get(folio);
+					tmp_page = fw.page;
+				}
+				folio_walk_end(&fw, vma);
+			}
+
+			if (tmp_page) {
+				flush_anon_page(vma, tmp_page, ksm_scan.address);
+				flush_dcache_page(tmp_page);
+				rmap_item = get_next_rmap_item(mm_slot,
+							       ksm_scan.rmap_list,
+							       ksm_scan.address);
+				if (rmap_item) {
+					ksm_scan.rmap_list =
+							&rmap_item->rmap_list;
+					ksm_scan.address += PAGE_SIZE;
+					*page = tmp_page;
+				} else {
+					folio_put(folio);
+				}
+				mmap_read_unlock(mm);
+				return rmap_item;
+			}
+			ksm_scan.address += PAGE_SIZE;
+		}
+	}
+	mmap_read_unlock(mm);
+	return NULL;
+}
+
+static void ksm_sync_merge(struct mm_struct *mm,
+			   unsigned long start, unsigned long end)
+{
+	struct ksm_rmap_item *rmap_item;
+	struct page *page;
+
+	rmap_item = retrieve_rmap_item(&page, mm, start, end);
+	if (!rmap_item)
+		return;
+	cmp_and_merge_page(page, rmap_item);
+	put_page(page);
+}
+
+#else /* CONFIG_SELECTIVE_KSM */
 /*
  * Calculate skip age for the ksm page age. The age determines how often
  * de-duplicating has already been tried unsuccessfully. If the age is
@@ -2688,6 +2824,7 @@ static int ksm_scan_thread(void *nothing)
 	}
 	return 0;
 }
+#endif /* CONFIG_SELECTIVE_KSM */
 
 static void __ksm_add_vma(struct vm_area_struct *vma)
 {
@@ -3335,9 +3472,10 @@ static ssize_t pages_to_scan_store(struct kobject *kobj,
 	unsigned int nr_pages;
 	int err;
 
+#ifndef CONFIG_SELECTIVE_KSM
 	if (ksm_advisor != KSM_ADVISOR_NONE)
 		return -EINVAL;
-
+#endif
 	err = kstrtouint(buf, 10, &nr_pages);
 	if (err)
 		return -EINVAL;
@@ -3396,6 +3534,65 @@ static ssize_t run_store(struct kobject *kobj, struct kobj_attribute *attr,
 }
 KSM_ATTR(run);
 
+static ssize_t trigger_merge_show(struct kobject *kobj,
+				  struct kobj_attribute *attr,
+				  char *buf)
+{
+	return -EINVAL;	/* Not yet implemented */
+}
+
+static ssize_t trigger_merge_store(struct kobject *kobj,
+				   struct kobj_attribute *attr,
+				   const char *buf, size_t count)
+{
+	unsigned long start, end;
+	pid_t pid;
+	char *input, *ptr;
+	int ret;
+	struct task_struct *task;
+	struct mm_struct *mm;
+
+	input = kstrdup(buf, GFP_KERNEL);
+	if (!input)
+		return -ENOMEM;
+
+	ptr = strim(input);
+	ret = sscanf(ptr, "%d %lx %lx", &pid, &start, &end);
+	kfree(input);
+
+	if (ret != 3)
+		return -EINVAL;
+
+	if (start >= end)
+		return -EINVAL;
+
+	/* Find the mm_struct */
+	rcu_read_lock();
+	task = find_task_by_vpid(pid);
+	if (!task) {
+		rcu_read_unlock();
+		return -ESRCH;
+	}
+
+	get_task_struct(task);
+
+	rcu_read_unlock();
+	mm = get_task_mm(task);
+	put_task_struct(task);
+
+	if (!mm)
+		return -EINVAL;
+
+	mutex_lock(&ksm_thread_mutex);
+	wait_while_offlining();
+	ksm_sync_merge(mm, start, end);
+	mutex_unlock(&ksm_thread_mutex);
+
+	mmput(mm);
+	return count;
+}
+KSM_ATTR(trigger_merge);
+
 #ifdef CONFIG_NUMA
 static ssize_t merge_across_nodes_show(struct kobject *kobj,
 				       struct kobj_attribute *attr, char *buf)
@@ -3635,6 +3832,7 @@ static ssize_t full_scans_show(struct kobject *kobj,
 }
 KSM_ATTR_RO(full_scans);
 
+#ifndef CONFIG_SELECTIVE_KSM
 static ssize_t smart_scan_show(struct kobject *kobj,
 			       struct kobj_attribute *attr, char *buf)
 {
@@ -3780,11 +3978,13 @@ static ssize_t advisor_target_scan_time_store(struct kobject *kobj,
 	return count;
 }
 KSM_ATTR(advisor_target_scan_time);
+#endif /* CONFIG_SELECTIVE_KSM */
 
 static struct attribute *ksm_attrs[] = {
 	&sleep_millisecs_attr.attr,
 	&pages_to_scan_attr.attr,
 	&run_attr.attr,
+	&trigger_merge_attr.attr,
 	&pages_scanned_attr.attr,
 	&pages_shared_attr.attr,
 	&pages_sharing_attr.attr,
@@ -3802,12 +4002,14 @@ static struct attribute *ksm_attrs[] = {
 	&stable_node_chains_prune_millisecs_attr.attr,
 	&use_zero_pages_attr.attr,
 	&general_profit_attr.attr,
+#ifndef CONFIG_SELECTIVE_KSM
 	&smart_scan_attr.attr,
 	&advisor_mode_attr.attr,
 	&advisor_max_cpu_attr.attr,
 	&advisor_min_pages_to_scan_attr.attr,
 	&advisor_max_pages_to_scan_attr.attr,
 	&advisor_target_scan_time_attr.attr,
+#endif
 	NULL,
 };
 
@@ -3815,40 +4017,63 @@ static const struct attribute_group ksm_attr_group = {
 	.attrs = ksm_attrs,
 	.name = "ksm",
 };
+
+static int __init ksm_sysfs_init(void)
+{
+	return sysfs_create_group(mm_kobj, &ksm_attr_group);
+}
+#else /* CONFIG_SYSFS */
+static int __init ksm_sysfs_init(void)
+{
+	ksm_run = KSM_RUN_MERGE;	/* no way for user to start it */
+	return 0;
+}
 #endif /* CONFIG_SYSFS */
 
-static int __init ksm_init(void)
+#ifdef CONFIG_SELECTIVE_KSM
+static int __init ksm_thread_sysfs_init(void)
+{
+	return ksm_sysfs_init();
+}
+#else /* CONFIG_SELECTIVE_KSM */
+static int __init ksm_thread_sysfs_init(void)
 {
 	struct task_struct *ksm_thread;
 	int err;
 
-	/* The correct value depends on page size and endianness */
-	zero_checksum = calc_checksum(ZERO_PAGE(0));
-	/* Default to false for backwards compatibility */
-	ksm_use_zero_pages = false;
-
-	err = ksm_slab_init();
-	if (err)
-		goto out;
-
 	ksm_thread = kthread_run(ksm_scan_thread, NULL, "ksmd");
 	if (IS_ERR(ksm_thread)) {
 		pr_err("ksm: creating kthread failed\n");
 		err = PTR_ERR(ksm_thread);
-		goto out_free;
+		return err;
 	}
 
-#ifdef CONFIG_SYSFS
-	err = sysfs_create_group(mm_kobj, &ksm_attr_group);
+	err = ksm_sysfs_init();
 	if (err) {
 		pr_err("ksm: register sysfs failed\n");
 		kthread_stop(ksm_thread);
-		goto out_free;
 	}
-#else
-	ksm_run = KSM_RUN_MERGE;	/* no way for user to start it */
 
-#endif /* CONFIG_SYSFS */
+	return err;
+}
+#endif /* CONFIG_SELECTIVE_KSM */
+
+static int __init ksm_init(void)
+{
+	int err;
+
+	/* The correct value depends on page size and endianness */
+	zero_checksum = calc_checksum(ZERO_PAGE(0));
+	/* Default to false for backwards compatibility */
+	ksm_use_zero_pages = false;
+
+	err = ksm_slab_init();
+	if (err)
+		goto out;
+
+	err = ksm_thread_sysfs_init();
+	if (err)
+		goto out_free;
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
 	/* There is no significance to this priority 100 */
-- 
2.49.0.395.g12beb8f557-goog



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH 3/6] mm: make Selective KSM partitioned
  2025-03-21 17:37 [RFC PATCH 0/6] Selective KSM: Synchronous and Partitioned Merging Sourav Panda
  2025-03-21 17:37 ` [RFC PATCH 1/6] mm: introduce SELECTIVE_KSM KConfig Sourav Panda
  2025-03-21 17:37 ` [RFC PATCH 2/6] mm: make Selective KSM synchronous Sourav Panda
@ 2025-03-21 17:37 ` Sourav Panda
  2025-03-21 17:37 ` [RFC PATCH 4/6] mm: create dedicated trees for SELECTIVE KSM partitions Sourav Panda
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Sourav Panda @ 2025-03-21 17:37 UTC (permalink / raw)
  To: mathieu.desnoyers, willy, david, pasha.tatashin, rientjes, akpm,
	linux-mm, linux-kernel, weixugc, gthelen, souravpanda, surenb

Create a sysfs interface to partition the KSM merge space. We add a
new sysfs file, namely add_partition. Which is used to specify the
name of the new partition. Once a partition is created, we would get the
traditional files typcally available in KSM under each partition.

This sysfs interface changes are in preparation of the following patch
that shall actually partition the merge space (e.g., prevent
page-comparison and merging across partitions).

KSM_SYSFS=/sys/kernel/mm/ksm

echo "part_1" >  ${KSM_SYSFS}/ksm/control/add_partition

ls ${KSM_SYSFS}/part_1/
	pages_scanned       pages_to_scan   sleep_millisecs  ...

echo "pid start_addr end_addr" > ${KSM_SYSFS}/part_1/trigger_merge

Signed-off-by: Sourav Panda <souravpanda@google.com>
---
 mm/ksm.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 95 insertions(+), 6 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index b2f184557ed9..927e257c48b5 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -3832,7 +3832,17 @@ static ssize_t full_scans_show(struct kobject *kobj,
 }
 KSM_ATTR_RO(full_scans);
 
-#ifndef CONFIG_SELECTIVE_KSM
+#ifdef CONFIG_SELECTIVE_KSM
+static struct kobject *ksm_base_kobj;
+
+struct partition_kobj {
+	struct kobject *kobj;
+	struct list_head list;
+};
+
+static LIST_HEAD(partition_list);
+
+#else /* CONFIG_SELECTIVE_KSM */
 static ssize_t smart_scan_show(struct kobject *kobj,
 			       struct kobj_attribute *attr, char *buf)
 {
@@ -4015,15 +4025,22 @@ static struct attribute *ksm_attrs[] = {
 
 static const struct attribute_group ksm_attr_group = {
 	.attrs = ksm_attrs,
+#ifndef CONFIG_SELECTIVE_KSM
 	.name = "ksm",
+#endif
 };
 
-static int __init ksm_sysfs_init(void)
+static int __init ksm_sysfs_init(struct kobject *kobj,
+				 const struct attribute_group *grp)
 {
-	return sysfs_create_group(mm_kobj, &ksm_attr_group);
+	int err;
+
+	err = sysfs_create_group(kobj, grp);
+	return err;
 }
 #else /* CONFIG_SYSFS */
-static int __init ksm_sysfs_init(void)
+static int __init ksm_sysfs_init(struct kobject *kobj,
+				 const struct attribute_group *grp)
 {
 	ksm_run = KSM_RUN_MERGE;	/* no way for user to start it */
 	return 0;
@@ -4031,9 +4048,81 @@ static int __init ksm_sysfs_init(void)
 #endif /* CONFIG_SYSFS */
 
 #ifdef CONFIG_SELECTIVE_KSM
+static ssize_t add_partition_store(struct kobject *kobj,
+				   struct kobj_attribute *attr,
+				   const char *buf, size_t count)
+{
+	struct partition_kobj *new_partition_kobj;
+	char partition_name[50];
+	int err;
+
+	mutex_lock(&ksm_thread_mutex);
+
+	if (count >= sizeof(partition_name)) {
+		err = -EINVAL;  /* Prevent buffer overflow */
+		goto unlock;
+	}
+
+	snprintf(partition_name, sizeof(partition_name),
+		 "%.*s", (int)(count - 1), buf); /* Remove newline */
+
+	/* Allocate memory for new dynamic kobject entry */
+	new_partition_kobj = kmalloc(sizeof(*new_partition_kobj), GFP_KERNEL);
+	if (!new_partition_kobj) {
+		err = -ENOMEM;
+		goto unlock;
+	}
+
+	new_partition_kobj->kobj = kobject_create_and_add(partition_name,
+							  ksm_base_kobj);
+	if (!new_partition_kobj) {
+		kfree(new_partition_kobj);
+		err = -ENOMEM;
+		goto unlock;
+	}
+
+	err = sysfs_create_group(new_partition_kobj->kobj, &ksm_attr_group);
+	if (err) {
+		pr_err("ksm: register sysfs failed\n");
+		kfree(new_partition_kobj);
+		err = -ENOMEM;
+		goto unlock;
+	}
+
+	list_add(&new_partition_kobj->list, &partition_list);
+
+unlock:
+	mutex_unlock(&ksm_thread_mutex);
+	return err ? err : count;
+}
+
+static struct kobj_attribute add_kobj_attr = __ATTR(add_partition, 0220, NULL,
+						    add_partition_store);
+
+/* Array of attributes for base kobject */
+static struct attribute *ksm_base_attrs[] = {
+	&add_kobj_attr.attr,
+	NULL,  /* NULL-terminated */
+};
+
+/* Attribute group for base kobject */
+static struct attribute_group ksm_base_attr_group = {
+	.name = "control",
+	.attrs = ksm_base_attrs,
+};
+
 static int __init ksm_thread_sysfs_init(void)
 {
-	return ksm_sysfs_init();
+	int err;
+
+	ksm_base_kobj = kobject_create_and_add("ksm", mm_kobj);
+	if (!ksm_base_kobj) {
+		err = -ENOMEM;
+		return err;
+	}
+
+	err = ksm_sysfs_init(ksm_base_kobj, &ksm_base_attr_group);
+	return err;
 }
 #else /* CONFIG_SELECTIVE_KSM */
 static int __init ksm_thread_sysfs_init(void)
@@ -4048,7 +4137,7 @@ static int __init ksm_thread_sysfs_init(void)
 		return err;
 	}
 
-	err = ksm_sysfs_init();
+	err = ksm_sysfs_init(mm_kobj, &ksm_attr_group);
 	if (err) {
 		pr_err("ksm: register sysfs failed\n");
 		kthread_stop(ksm_thread);
-- 
2.49.0.395.g12beb8f557-goog



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH 4/6] mm: create dedicated trees for SELECTIVE KSM partitions
  2025-03-21 17:37 [RFC PATCH 0/6] Selective KSM: Synchronous and Partitioned Merging Sourav Panda
                   ` (2 preceding siblings ...)
  2025-03-21 17:37 ` [RFC PATCH 3/6] mm: make Selective KSM partitioned Sourav Panda
@ 2025-03-21 17:37 ` Sourav Panda
  2025-03-21 17:37 ` [RFC PATCH 5/6] mm: trigger unmerge and remove SELECTIVE KSM partition Sourav Panda
  2025-03-21 17:37 ` [RFC PATCH 6/6] mm: syscall alternative for SELECTIVE_KSM Sourav Panda
  5 siblings, 0 replies; 7+ messages in thread
From: Sourav Panda @ 2025-03-21 17:37 UTC (permalink / raw)
  To: mathieu.desnoyers, willy, david, pasha.tatashin, rientjes, akpm,
	linux-mm, linux-kernel, weixugc, gthelen, souravpanda, surenb

Extend ksm to create dedicated unstable and stable trees for
each partition.

Signed-off-by: Sourav Panda <souravpanda@google.com>
---
 mm/ksm.c | 165 +++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 111 insertions(+), 54 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 927e257c48b5..b575250aaf45 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -144,6 +144,28 @@ struct ksm_scan {
 	unsigned long seqnr;
 };
 
+static struct kobject *ksm_base_kobj;
+
+struct partition_kobj {
+	struct kobject *kobj;
+	struct list_head list;
+	struct rb_root *root_stable_tree;
+	struct rb_root *root_unstable_tree;
+};
+
+static LIST_HEAD(partition_list);
+
+static struct partition_kobj *find_partition_by_kobj(struct kobject *kobj)
+{
+	struct partition_kobj *partition;
+
+	list_for_each_entry(partition, &partition_list, list) {
+		if (partition->kobj == kobj)
+			return partition;
+	}
+	return NULL;
+}
+
 /**
  * struct ksm_stable_node - node of the stable rbtree
  * @node: rb node of this ksm page in the stable tree
@@ -182,6 +204,7 @@ struct ksm_stable_node {
 #ifdef CONFIG_NUMA
 	int nid;
 #endif
+	struct partition_kobj *partition;
 };
 
 /**
@@ -218,6 +241,7 @@ struct ksm_rmap_item {
 			struct hlist_node hlist;
 		};
 	};
+	struct partition_kobj *partition;
 };
 
 #define SEQNR_MASK	0x0ff	/* low bits of unstable tree seqnr */
@@ -227,8 +251,6 @@ struct ksm_rmap_item {
 /* The stable and unstable tree heads */
 static struct rb_root one_stable_tree[1] = { RB_ROOT };
 static struct rb_root one_unstable_tree[1] = { RB_ROOT };
-static struct rb_root *root_stable_tree = one_stable_tree;
-static struct rb_root *root_unstable_tree = one_unstable_tree;
 
 /* Recently migrated nodes of stable tree, pending proper placement */
 static LIST_HEAD(migrate_nodes);
@@ -555,7 +577,7 @@ static inline void stable_node_dup_del(struct ksm_stable_node *dup)
 	if (is_stable_node_dup(dup))
 		__stable_node_dup_del(dup);
 	else
-		rb_erase(&dup->node, root_stable_tree + NUMA(dup->nid));
+		rb_erase(&dup->node, dup->partition->root_stable_tree + NUMA(dup->nid));
 #ifdef CONFIG_DEBUG_VM
 	dup->head = NULL;
 #endif
@@ -580,14 +602,20 @@ static inline void free_rmap_item(struct ksm_rmap_item *rmap_item)
 	kmem_cache_free(rmap_item_cache, rmap_item);
 }
 
-static inline struct ksm_stable_node *alloc_stable_node(void)
+static inline struct ksm_stable_node *alloc_stable_node(struct partition_kobj *partition)
 {
 	/*
 	 * The allocation can take too long with GFP_KERNEL when memory is under
 	 * pressure, which may lead to hung task warnings.  Adding __GFP_HIGH
 	 * grants access to memory reserves, helping to avoid this problem.
 	 */
-	return kmem_cache_alloc(stable_node_cache, GFP_KERNEL | __GFP_HIGH);
+	struct ksm_stable_node *node =  kmem_cache_alloc(stable_node_cache,
+							 GFP_KERNEL | __GFP_HIGH);
+
+	if (node)
+		node->partition = partition;
+
+	return node;
 }
 
 static inline void free_stable_node(struct ksm_stable_node *stable_node)
@@ -777,9 +805,10 @@ static inline int get_kpfn_nid(unsigned long kpfn)
 }
 
 static struct ksm_stable_node *alloc_stable_node_chain(struct ksm_stable_node *dup,
-						   struct rb_root *root)
+						   struct rb_root *root,
+						   struct partition_kobj *partition)
 {
-	struct ksm_stable_node *chain = alloc_stable_node();
+	struct ksm_stable_node *chain = alloc_stable_node(partition);
 	VM_BUG_ON(is_stable_node_chain(dup));
 	if (likely(chain)) {
 		INIT_HLIST_HEAD(&chain->hlist);
@@ -1016,7 +1045,8 @@ static void remove_rmap_item_from_tree(struct ksm_rmap_item *rmap_item)
 		unsigned char age = get_rmap_item_age(rmap_item);
 		if (!age)
 			rb_erase(&rmap_item->node,
-				 root_unstable_tree + NUMA(rmap_item->nid));
+				 rmap_item->partition->root_unstable_tree +
+				 NUMA(rmap_item->nid));
 		ksm_pages_unshared--;
 		rmap_item->address &= PAGE_MASK;
 	}
@@ -1154,17 +1184,23 @@ static int remove_all_stable_nodes(void)
 	struct ksm_stable_node *stable_node, *next;
 	int nid;
 	int err = 0;
-
-	for (nid = 0; nid < ksm_nr_node_ids; nid++) {
-		while (root_stable_tree[nid].rb_node) {
-			stable_node = rb_entry(root_stable_tree[nid].rb_node,
-						struct ksm_stable_node, node);
-			if (remove_stable_node_chain(stable_node,
-						     root_stable_tree + nid)) {
-				err = -EBUSY;
-				break;	/* proceed to next nid */
+	struct partition_kobj *partition;
+	struct rb_root *root_stable_tree;
+
+	list_for_each_entry(partition, &partition_list, list) {
+		root_stable_tree = partition->root_stable_tree;
+
+		for (nid = 0; nid < ksm_nr_node_ids; nid++) {
+			while (root_stable_tree[nid].rb_node) {
+				stable_node = rb_entry(root_stable_tree[nid].rb_node,
+						       struct ksm_stable_node, node);
+				if (remove_stable_node_chain(stable_node,
+							     root_stable_tree + nid)) {
+					err = -EBUSY;
+					break;	/* proceed to next nid */
+				}
+				cond_resched();
 			}
-			cond_resched();
 		}
 	}
 	list_for_each_entry_safe(stable_node, next, &migrate_nodes, list) {
@@ -1802,7 +1838,8 @@ static __always_inline struct folio *chain(struct ksm_stable_node **s_n_d,
  * This function returns the stable tree node of identical content if found,
  * -EBUSY if the stable node's page is being migrated, NULL otherwise.
  */
-static struct folio *stable_tree_search(struct page *page)
+static struct folio *stable_tree_search(struct page *page,
+					struct partition_kobj *partition)
 {
 	int nid;
 	struct rb_root *root;
@@ -1821,7 +1858,7 @@ static struct folio *stable_tree_search(struct page *page)
 	}
 
 	nid = get_kpfn_nid(folio_pfn(folio));
-	root = root_stable_tree + nid;
+	root = partition->root_stable_tree + nid;
 again:
 	new = &root->rb_node;
 	parent = NULL;
@@ -1991,7 +2028,7 @@ static struct folio *stable_tree_search(struct page *page)
 		VM_BUG_ON(is_stable_node_dup(stable_node_dup));
 		/* chain is missing so create it */
 		stable_node = alloc_stable_node_chain(stable_node_dup,
-						      root);
+						      root, partition);
 		if (!stable_node)
 			return NULL;
 	}
@@ -2016,7 +2053,8 @@ static struct folio *stable_tree_search(struct page *page)
  * This function returns the stable tree node just allocated on success,
  * NULL otherwise.
  */
-static struct ksm_stable_node *stable_tree_insert(struct folio *kfolio)
+static struct ksm_stable_node *stable_tree_insert(struct folio *kfolio,
+						  struct partition_kobj *partition)
 {
 	int nid;
 	unsigned long kpfn;
@@ -2028,7 +2066,7 @@ static struct ksm_stable_node *stable_tree_insert(struct folio *kfolio)
 
 	kpfn = folio_pfn(kfolio);
 	nid = get_kpfn_nid(kpfn);
-	root = root_stable_tree + nid;
+	root = partition->root_stable_tree + nid;
 again:
 	parent = NULL;
 	new = &root->rb_node;
@@ -2067,7 +2105,7 @@ static struct ksm_stable_node *stable_tree_insert(struct folio *kfolio)
 		}
 	}
 
-	stable_node_dup = alloc_stable_node();
+	stable_node_dup = alloc_stable_node(partition);
 	if (!stable_node_dup)
 		return NULL;
 
@@ -2082,7 +2120,8 @@ static struct ksm_stable_node *stable_tree_insert(struct folio *kfolio)
 		if (!is_stable_node_chain(stable_node)) {
 			struct ksm_stable_node *orig = stable_node;
 			/* chain is missing so create it */
-			stable_node = alloc_stable_node_chain(orig, root);
+			stable_node = alloc_stable_node_chain(orig, root,
+							      partition);
 			if (!stable_node) {
 				free_stable_node(stable_node_dup);
 				return NULL;
@@ -2121,7 +2160,7 @@ struct ksm_rmap_item *unstable_tree_search_insert(struct ksm_rmap_item *rmap_ite
 	int nid;
 
 	nid = get_kpfn_nid(page_to_pfn(page));
-	root = root_unstable_tree + nid;
+	root = rmap_item->partition->root_unstable_tree + nid;
 	new = &root->rb_node;
 
 	while (*new) {
@@ -2291,7 +2330,7 @@ static void cmp_and_merge_page(struct page *page, struct ksm_rmap_item *rmap_ite
 	}
 
 	/* Start by searching for the folio in the stable tree */
-	kfolio = stable_tree_search(page);
+	kfolio = stable_tree_search(page, rmap_item->partition);
 	if (&kfolio->page == page && rmap_item->head == stable_node) {
 		folio_put(kfolio);
 		return;
@@ -2344,7 +2383,8 @@ static void cmp_and_merge_page(struct page *page, struct ksm_rmap_item *rmap_ite
 			 * node in the stable tree and add both rmap_items.
 			 */
 			folio_lock(kfolio);
-			stable_node = stable_tree_insert(kfolio);
+			stable_node = stable_tree_insert(kfolio,
+							 rmap_item->partition);
 			if (stable_node) {
 				stable_tree_append(tree_rmap_item, stable_node,
 						   false);
@@ -2502,7 +2542,8 @@ static struct ksm_rmap_item *retrieve_rmap_item(struct page **page,
 }
 
 static void ksm_sync_merge(struct mm_struct *mm,
-			   unsigned long start, unsigned long end)
+			   unsigned long start, unsigned long end,
+			   struct partition_kobj *partition)
 {
 	struct ksm_rmap_item *rmap_item;
 	struct page *page;
@@ -2510,6 +2551,7 @@ static void ksm_sync_merge(struct mm_struct *mm,
 	rmap_item = retrieve_rmap_item(&page, mm, start, end);
 	if (!rmap_item)
 		return;
+	rmap_item->partition = partition;
 	cmp_and_merge_page(page, rmap_item);
 	put_page(page);
 }
@@ -3328,19 +3370,23 @@ static void ksm_check_stable_tree(unsigned long start_pfn,
 	struct ksm_stable_node *stable_node, *next;
 	struct rb_node *node;
 	int nid;
-
-	for (nid = 0; nid < ksm_nr_node_ids; nid++) {
-		node = rb_first(root_stable_tree + nid);
-		while (node) {
-			stable_node = rb_entry(node, struct ksm_stable_node, node);
-			if (stable_node_chain_remove_range(stable_node,
-							   start_pfn, end_pfn,
-							   root_stable_tree +
-							   nid))
-				node = rb_first(root_stable_tree + nid);
-			else
-				node = rb_next(node);
-			cond_resched();
+	struct rb_root *root_stable_tree
+
+	list_for_each_entry(partition, &partition_list, list) {
+		root_stable_tree = partition->root_stable_tree;
+
+		for (nid = 0; nid < ksm_nr_node_ids; nid++) {
+			node = rb_first(root_stable_tree + nid);
+			while (node) {
+				stable_node = rb_entry(node, struct ksm_stable_node, node);
+				if (stable_node_chain_remove_range(stable_node,
+								   start_pfn, end_pfn,
+								   root_stable_tree + nid))
+					node = rb_first(root_stable_tree + nid);
+				else
+					node = rb_next(node);
+				cond_resched();
+			}
 		}
 	}
 	list_for_each_entry_safe(stable_node, next, &migrate_nodes, list) {
@@ -3551,6 +3597,7 @@ static ssize_t trigger_merge_store(struct kobject *kobj,
 	int ret;
 	struct task_struct *task;
 	struct mm_struct *mm;
+	struct partition_kobj *partition;
 
 	input = kstrdup(buf, GFP_KERNEL);
 	if (!input)
@@ -3583,9 +3630,13 @@ static ssize_t trigger_merge_store(struct kobject *kobj,
 	if (!mm)
 		return -EINVAL;
 
+	partition = find_partition_by_kobj(kobj);
+	if (!partition)
+		return -EINVAL;
+
 	mutex_lock(&ksm_thread_mutex);
 	wait_while_offlining();
-	ksm_sync_merge(mm, start, end);
+	ksm_sync_merge(mm, start, end, partition);
 	mutex_unlock(&ksm_thread_mutex);
 
 	mmput(mm);
@@ -3606,6 +3657,8 @@ static ssize_t merge_across_nodes_store(struct kobject *kobj,
 {
 	int err;
 	unsigned long knob;
+	struct rb_root *root_stable_tree;
+	struct partition_kobj *partition;
 
 	err = kstrtoul(buf, 10, &knob);
 	if (err)
@@ -3615,6 +3668,10 @@ static ssize_t merge_across_nodes_store(struct kobject *kobj,
 
 	mutex_lock(&ksm_thread_mutex);
 	wait_while_offlining();
+
+	partition = find_partition_by_kobj(kobj);
+	root_stable_tree = partition->root_stable_tree;
+
 	if (ksm_merge_across_nodes != knob) {
 		if (ksm_pages_shared || remove_all_stable_nodes())
 			err = -EBUSY;
@@ -3633,10 +3690,10 @@ static ssize_t merge_across_nodes_store(struct kobject *kobj,
 			if (!buf)
 				err = -ENOMEM;
 			else {
-				root_stable_tree = buf;
-				root_unstable_tree = buf + nr_node_ids;
+				partition->root_stable_tree = buf;
+				partition->root_unstable_tree = buf + nr_node_ids;
 				/* Stable tree is empty but not the unstable */
-				root_unstable_tree[0] = one_unstable_tree[0];
+				partition->root_unstable_tree[0] = one_unstable_tree[0];
 			}
 		}
 		if (!err) {
@@ -3834,14 +3891,6 @@ KSM_ATTR_RO(full_scans);
 
 #ifdef CONFIG_SELECTIVE_KSM
 static struct kobject *ksm_base_kobj;
-
-struct partition_kobj {
-	struct kobject *kobj;
-	struct list_head list;
-};
-
-static LIST_HEAD(partition_list);
-
 #else /* CONFIG_SELECTIVE_KSM */
 static ssize_t smart_scan_show(struct kobject *kobj,
 			       struct kobj_attribute *attr, char *buf)
@@ -4055,6 +4104,7 @@ static ssize_t add_partition_store(struct kobject *kobj,
 	struct partition_kobj *new_partition_kobj;
 	char partition_name[50];
 	int err;
+	struct rb_root *tree_root;
 
 	mutex_lock(&ksm_thread_mutex);
 
@@ -4081,6 +4131,13 @@ static ssize_t add_partition_store(struct kobject *kobj,
 		goto unlock;
 	}
 
+	tree_root = kcalloc(nr_node_ids + nr_node_ids, sizeof(*tree_root), GFP_KERNEL);
+	if (!tree_root) {
+		err = -ENOMEM;
+		goto unlock;
+	}
+	new_partition_kobj->root_stable_tree = tree_root;
+	new_partition_kobj->root_unstable_tree = tree_root + nr_node_ids;
 	err = sysfs_create_group(new_partition_kobj->kobj, &ksm_attr_group);
 	if (err) {
 		pr_err("ksm: register sysfs failed\n");
-- 
2.49.0.395.g12beb8f557-goog



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH 5/6] mm: trigger unmerge and remove SELECTIVE KSM partition
  2025-03-21 17:37 [RFC PATCH 0/6] Selective KSM: Synchronous and Partitioned Merging Sourav Panda
                   ` (3 preceding siblings ...)
  2025-03-21 17:37 ` [RFC PATCH 4/6] mm: create dedicated trees for SELECTIVE KSM partitions Sourav Panda
@ 2025-03-21 17:37 ` Sourav Panda
  2025-03-21 17:37 ` [RFC PATCH 6/6] mm: syscall alternative for SELECTIVE_KSM Sourav Panda
  5 siblings, 0 replies; 7+ messages in thread
From: Sourav Panda @ 2025-03-21 17:37 UTC (permalink / raw)
  To: mathieu.desnoyers, willy, david, pasha.tatashin, rientjes, akpm,
	linux-mm, linux-kernel, weixugc, gthelen, souravpanda, surenb

Trigger unmerge or remove a partition using the
following sysfs interface:

Triggering an unmerge for a specific partition:
  echo "pid" > /sys/kernel/mm/ksm/partition_name/trigger_unmerge

Removing a partition:
  echo "partition_to_remove" >
	/sys/kernel/mm/ksm/control/remove_partition

Limitation of current implementation: On carrying out trigger_unmerge,
we unmerge all rmap items which is wrong. We should only unmerge the
rmap items that belong to the partition where we called unmerge.

Another limitation is that we do not specify the address range when
echoing into trigger unmerge. Intentionally left out till until we
determine the implementation feasibility.

Signed-off-by: Sourav Panda <souravpanda@google.com>
---
 mm/ksm.c | 120 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 120 insertions(+)

diff --git a/mm/ksm.c b/mm/ksm.c
index b575250aaf45..fd7626d5d8c9 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2556,6 +2556,31 @@ static void ksm_sync_merge(struct mm_struct *mm,
 	put_page(page);
 }
 
+static void ksm_sync_unmerge(struct mm_struct *mm)
+{
+	struct mm_slot *slot;
+	struct ksm_mm_slot *mm_slot;
+
+	struct vm_area_struct *vma;
+	struct vma_iterator vmi;
+
+	slot = mm_slot_lookup(mm_slots_hash, mm);
+	mm_slot = container_of(slot, struct ksm_mm_slot, slot);
+
+	ksm_scan.address = 0;
+	vma_iter_init(&vmi, mm, ksm_scan.address);
+
+	mmap_read_lock(mm);
+	for_each_vma(vmi, vma) {
+		if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
+			continue;
+		unmerge_ksm_pages(vma, vma->vm_start, vma->vm_end, false);
+	}
+	remove_trailing_rmap_items(&mm_slot->rmap_list);
+
+	mmap_read_unlock(mm);
+}
+
 #else /* CONFIG_SELECTIVE_KSM */
 /*
  * Calculate skip age for the ksm page age. The age determines how often
@@ -3644,6 +3669,58 @@ static ssize_t trigger_merge_store(struct kobject *kobj,
 }
 KSM_ATTR(trigger_merge);
 
+static ssize_t trigger_unmerge_show(struct kobject *kobj,
+				    struct kobj_attribute *attr,
+				    char *buf)
+{
+	return -EINVAL;	/* Not yet implemented */
+}
+
+static ssize_t trigger_unmerge_store(struct kobject *kobj,
+				     struct kobj_attribute *attr,
+				     const char *buf, size_t count)
+{
+	pid_t pid;
+	char *input, *ptr;
+	int ret;
+	struct task_struct *task;
+	struct mm_struct *mm;
+
+	input = kstrdup(buf, GFP_KERNEL);
+	if (!input)
+		return -ENOMEM;
+
+	ptr = strim(input);
+	ret = kstrtoint(ptr, 10, &pid);
+	kfree(input);
+
+	/* Find the mm_struct */
+	rcu_read_lock();
+	task = find_task_by_vpid(pid);
+	if (!task) {
+		rcu_read_unlock();
+		return -ESRCH;
+	}
+
+	get_task_struct(task);
+
+	rcu_read_unlock();
+	mm = get_task_mm(task);
+	put_task_struct(task);
+
+	if (!mm)
+		return -EINVAL;
+
+	mutex_lock(&ksm_thread_mutex);
+	wait_while_offlining();
+	ksm_sync_unmerge(mm);
+	mutex_unlock(&ksm_thread_mutex);
+
+	mmput(mm);
+	return count;
+}
+KSM_ATTR(trigger_unmerge);
+
 #ifdef CONFIG_NUMA
 static ssize_t merge_across_nodes_show(struct kobject *kobj,
 				       struct kobj_attribute *attr, char *buf)
@@ -4044,6 +4121,7 @@ static struct attribute *ksm_attrs[] = {
 	&pages_to_scan_attr.attr,
 	&run_attr.attr,
 	&trigger_merge_attr.attr,
+	&trigger_unmerge_attr.attr,
 	&pages_scanned_attr.attr,
 	&pages_shared_attr.attr,
 	&pages_sharing_attr.attr,
@@ -4156,9 +4234,51 @@ static ssize_t add_partition_store(struct kobject *kobj,
 static struct kobj_attribute add_kobj_attr = __ATTR(add_partition, 0220, NULL,
 						    add_partition_store);
 
+static ssize_t remove_partition_store(struct kobject *kobj,
+				      struct kobj_attribute *attr,
+				      const char *buf, size_t count)
+{
+	struct partition_kobj *partition;
+	struct partition_kobj *partition_found = NULL;
+	char partition_name[50];
+	int err = 0;
+
+	if (sscanf(buf, "%31s", partition_name) != 1)
+		return -EINVAL;
+
+	mutex_lock(&ksm_thread_mutex);
+
+	list_for_each_entry(partition, &partition_list, list) {
+		if (strcmp(kobject_name(partition->kobj), partition_name) == 0) {
+			partition_found = partition;
+			break;
+		}
+	}
+
+	if (!partition_found) {
+		err = -ENOENT;
+		goto unlock;
+	}
+
+	unmerge_and_remove_all_rmap_items();
+
+	kobject_put(partition_found->kobj);
+	list_del(&partition_found->list);
+	kfree(partition_found->root_stable_tree);
+	kfree(partition_found);
+
+unlock:
+	mutex_unlock(&ksm_thread_mutex);
+	return err ? err : count;
+}
+
+static struct kobj_attribute rm_kobj_attr = __ATTR(remove_partition, 0220, NULL,
+						   remove_partition_store);
+
 /* Array of attributes for base kobject */
 static struct attribute *ksm_base_attrs[] = {
 	&add_kobj_attr.attr,
+	&rm_kobj_attr.attr,
 	NULL,  /* NULL-terminated */
 };
 
-- 
2.49.0.395.g12beb8f557-goog



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH 6/6] mm: syscall alternative for SELECTIVE_KSM
  2025-03-21 17:37 [RFC PATCH 0/6] Selective KSM: Synchronous and Partitioned Merging Sourav Panda
                   ` (4 preceding siblings ...)
  2025-03-21 17:37 ` [RFC PATCH 5/6] mm: trigger unmerge and remove SELECTIVE KSM partition Sourav Panda
@ 2025-03-21 17:37 ` Sourav Panda
  5 siblings, 0 replies; 7+ messages in thread
From: Sourav Panda @ 2025-03-21 17:37 UTC (permalink / raw)
  To: mathieu.desnoyers, willy, david, pasha.tatashin, rientjes, akpm,
	linux-mm, linux-kernel, weixugc, gthelen, souravpanda, surenb

Partition can be created or opened using:

  int ksm_fd = ksm_open(ksm_name, flag);
    name specifies the ksm partition to be created or opened.
    flags:
      O_CREAT
        Create the ksm partition object if it does not exist.
      O_EXCL
        If O_CREAT was also specified, and a ksm partition object
        with the given name already exists, return an error.

Trigger the merge using:
  ksm_merge(ksm_fd, pid, start_addr, size);

Limitation: Only supporting x86 syscall_64.

Signed-off-by: Sourav Panda <souravpanda@google.com>
---
 arch/x86/entry/syscalls/syscall_64.tbl |   3 +-
 include/linux/ksm.h                    |   4 +
 mm/ksm.c                               | 156 ++++++++++++++++++++++++-
 3 files changed, 161 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 5eb708bff1c7..352d747dbe33 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -390,7 +390,8 @@
 464	common	getxattrat		sys_getxattrat
 465	common	listxattrat		sys_listxattrat
 466	common	removexattrat		sys_removexattrat
-
+467	common	ksm_open		sys_ksm_open
+468	common	ksm_merge		sys_ksm_merge
 #
 # Due to a historical design error, certain syscalls are numbered differently
 # in x32 as compared to native x86_64.  These syscalls have numbers 512-547.
diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index d73095b5cd96..a94c89403c29 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -14,6 +14,10 @@
 #include <linux/rmap.h>
 #include <linux/sched.h>
 
+#include <linux/anon_inodes.h>
+#include <linux/syscalls.h>
+#define MAX_KSM_NAME_LEN 128
+
 #ifdef CONFIG_KSM
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 		unsigned long end, int advice, unsigned long *vm_flags);
diff --git a/mm/ksm.c b/mm/ksm.c
index fd7626d5d8c9..71558120b034 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -147,7 +147,8 @@ struct ksm_scan {
 static struct kobject *ksm_base_kobj;
 
 struct partition_kobj {
-	struct kobject *kobj;
+	struct kobject *kobj;	/* Not required for the syscall interface */
+	char name[MAX_KSM_NAME_LEN];
 	struct list_head list;
 	struct rb_root *root_stable_tree;
 	struct rb_root *root_unstable_tree;
@@ -166,6 +167,106 @@ static struct partition_kobj *find_partition_by_kobj(struct kobject *kobj)
 	return NULL;
 }
 
+static struct partition_kobj *find_ksm_partition(char *partition_name)
+{
+	struct partition_kobj *partition;
+
+	list_for_each_entry(partition, &partition_list, list) {
+		if (strcmp(partition->name, partition_name) == 0)
+			return partition;
+	}
+	return NULL;
+}
+
+static DEFINE_MUTEX(ksm_partition_lock);
+
+static int ksm_release(struct inode *inode, struct file *file)
+{
+	struct partition_kobj *ksm = file->private_data;
+
+	mutex_lock(&ksm_partition_lock);
+	list_del(&ksm->list);
+	mutex_unlock(&ksm_partition_lock);
+
+	kfree(ksm);
+	return 0;
+}
+
+static const struct file_operations ksm_fops = {
+	.release = ksm_release,
+};
+
+static struct partition_kobj *ksm_create_partition(char *ksm_name)
+{
+	struct partition_kobj *partition;
+	struct rb_root *tree_root;
+
+	partition = kzalloc(sizeof(*partition), GFP_KERNEL);
+	if (!partition)
+		return NULL;
+
+	tree_root = kcalloc(nr_node_ids + nr_node_ids, sizeof(*tree_root),
+			    GFP_KERNEL);
+	if (!tree_root)
+		return NULL;
+
+	partition->root_stable_tree = tree_root;
+	partition->root_unstable_tree = tree_root + nr_node_ids;
+	strncpy(partition->name, ksm_name, sizeof(partition->name));
+
+	list_add(&partition->list, &partition_list);
+
+	return partition;
+}
+
+static int ksm_partition_fd(struct partition_kobj *partition)
+{
+	int fd;
+	struct file *file;
+	int ret;
+
+	file = anon_inode_getfile("ksm_partition", &ksm_fops, partition, O_RDWR);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		return ret;
+	}
+
+	fd = get_unused_fd_flags(O_RDWR);
+	if (fd < 0) {
+		fput(file);
+		return fd;
+	}
+	fd_install(fd, file);
+	return fd;
+}
+
+SYSCALL_DEFINE2(ksm_open, const char __user *, ksm_name, int, flags) {
+	char name[MAX_KSM_NAME_LEN];
+	struct partition_kobj *partition;
+	int ret;
+
+	ret = strncpy_from_user(name, ksm_name, sizeof(name));
+	if (ret < 0)
+		return -EFAULT;
+
+	partition = find_ksm_partition(name);
+
+	if (flags & O_EXCL && partition) /* Partition already exists, return error */
+		return -EEXIST;
+
+	if (flags & O_CREAT && !partition) {
+		/* Partition does not exist, but we are allowed to create one */
+		mutex_lock(&ksm_partition_lock);
+		partition = ksm_create_partition(name);
+		mutex_unlock(&ksm_partition_lock);
+	}
+
+	if (!partition)
+		return flags & O_CREAT ? -ENOMEM : -ENOENT;
+
+	return ksm_partition_fd(partition);
+}
+
 /**
  * struct ksm_stable_node - node of the stable rbtree
  * @node: rb node of this ksm page in the stable tree
@@ -4324,6 +4425,59 @@ static int __init ksm_thread_sysfs_init(void)
 }
 #endif /* CONFIG_SELECTIVE_KSM */
 
+SYSCALL_DEFINE4(ksm_merge, int, ksm_fd, pid_t, pid, unsigned long, start, size_t, size) {
+	unsigned long end = start + size;
+	struct task_struct *task;
+	struct mm_struct *mm;
+	struct partition_kobj *partition;
+	struct file *file;
+
+	file = fget(ksm_fd);
+	if (!file)
+		return -EBADF;
+
+	partition = file->private_data;
+	if (!partition) {
+		fput(file);
+		return -EINVAL;
+	}
+
+	if (start >= end) {
+		fput(file);
+		return -EINVAL;
+	}
+
+	/* Find the mm_struct */
+	rcu_read_lock();
+	task = find_task_by_vpid(pid);
+	if (!task) {
+		fput(file);
+		rcu_read_unlock();
+		return -ESRCH;
+	}
+
+	get_task_struct(task);
+
+	rcu_read_unlock();
+	mm = get_task_mm(task);
+	put_task_struct(task);
+
+	if (!mm) {
+		fput(file);
+		return -EINVAL;
+	}
+
+	mutex_lock(&ksm_thread_mutex);
+	wait_while_offlining();
+	ksm_sync_merge(mm, start, end, partition);
+	mutex_unlock(&ksm_thread_mutex);
+
+	mmput(mm);
+
+	fput(file);
+	return 0;
+}
+
 static int __init ksm_init(void)
 {
 	int err;
-- 
2.49.0.395.g12beb8f557-goog



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-03-21 17:38 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-21 17:37 [RFC PATCH 0/6] Selective KSM: Synchronous and Partitioned Merging Sourav Panda
2025-03-21 17:37 ` [RFC PATCH 1/6] mm: introduce SELECTIVE_KSM KConfig Sourav Panda
2025-03-21 17:37 ` [RFC PATCH 2/6] mm: make Selective KSM synchronous Sourav Panda
2025-03-21 17:37 ` [RFC PATCH 3/6] mm: make Selective KSM partitioned Sourav Panda
2025-03-21 17:37 ` [RFC PATCH 4/6] mm: create dedicated trees for SELECTIVE KSM partitions Sourav Panda
2025-03-21 17:37 ` [RFC PATCH 5/6] mm: trigger unmerge and remove SELECTIVE KSM partition Sourav Panda
2025-03-21 17:37 ` [RFC PATCH 6/6] mm: syscall alternative for SELECTIVE_KSM Sourav Panda

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox