[PATCH v4 0/6] add mTHP support for anonymous shmem

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/6] add mTHP support for anonymous shmem
@ 2024-06-04 10:17 Baolin Wang
  2024-06-04 10:17 ` [PATCH v4 1/6] mm: memory: extend finish_fault() to support large folio Baolin Wang
                   ` (7 more replies)
  0 siblings, 8 replies; 16+ messages in thread
From: Baolin Wang @ 2024-06-04 10:17 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, ying.huang, 21cnbao, ryan.roberts,
	shy828301, ziy, ioworker0, da.gomez, p.raghav, baolin.wang,
	linux-mm, linux-kernel

Anonymous pages have already been supported for multi-size (mTHP) allocation
through commit 19eaf44954df, that can allow THP to be configured through the
sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.

However, the anonymous shmem will ignore the anonymous mTHP rule configured
through the sysfs interface, and can only use the PMD-mapped THP, that is not
reasonable. Many implement anonymous page sharing through mmap(MAP_SHARED |
MAP_ANONYMOUS), especially in database usage scenarios, therefore, users expect
to apply an unified mTHP strategy for anonymous pages, also including the
anonymous shared pages, in order to enjoy the benefits of mTHP. For example,
lower latency than PMD-mapped THP, smaller memory bloat than PMD-mapped THP,
contiguous PTEs on ARM architecture to reduce TLB miss etc.

As discussed in the bi-weekly MM meeting[1], the mTHP controls should control
all of shmem, not only anonymous shmem, but support will be added iteratively.
Therefore, this patch set starts with support for anonymous shmem.

The primary strategy is similar to supporting anonymous mTHP. Introduce
a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
which can have almost the same values as the top-level
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
additional "inherit" option and dropping the testing options 'force' and
'deny'. By default all sizes will be set to "never" except PMD size, which
is set to "inherit". This ensures backward compatibility with the anonymous
shmem enabled of the top level, meanwhile also allows independent control of
anonymous shmem enabled for each mTHP.

Use the page fault latency tool to measure the performance of 1G anonymous shmem
with 32 threads on my machine environment with: ARM64 Architecture, 32 cores,
125G memory:
base: mm-unstable
user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
0.04s        3.10s         83516.416                  2669684.890

mm-unstable + patchset, anon shmem mTHP disabled
user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
0.02s        3.14s         82936.359                  2630746.027

mm-unstable + patchset, anon shmem 64K mTHP enabled
user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
0.08s        0.31s         678630.231                 17082522.495

From the data above, it is observed that the patchset has a minimal impact when
mTHP is not enabled (some fluctuations observed during testing). When enabling 64K
mTHP, there is a significant improvement of the page fault latency.

[1] https://lore.kernel.org/all/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com/

Changes from v3:
 - Drop 'force' and 'deny' testing options for each mTHP.
 - Use new helper update_mmu_tlb_range(), per Lance.
 - Update documentation to drop "anonymous thp" terminology, per David.
 - Initialize the 'suitable_orders' in shmem_alloc_and_add_folio(),
   reported by kernel test robot.
 - Fix the highest mTHP order in shmem_get_unmapped_area().
 - Update some commit message.

Changes from v2:
 - Rebased to mm/mm-unstable.
 - Remove 'huge' parameter for shmem_alloc_and_add_folio(), per Lance.

Changes from v1:
 - Drop the patch that re-arranges the position of highest_order() and
   next_order(), per Ryan.
 - Modify the finish_fault() to fix VA alignment issue, per Ryan and
   David.
 - Fix some building issues, reported by Lance and kernel test robot.
 - Update some commit message.

Changes from RFC:
 - Rebase the patch set against the new mm-unstable branch, per Lance.
 - Add a new patch to export highest_order() and next_order().
 - Add a new patch to align mTHP size in shmem_get_unmapped_area().
 - Handle the uffd case and the VMA limits case when building mapping for
   large folio in the finish_fault() function, per Ryan.
 - Remove unnecessary 'order' variable in patch 3, per Kefeng.
 - Keep the anon shmem counters' name consistency.
 - Modify the strategy to support mTHP for anonymous shmem, discussed with
   Ryan and David.
 - Add reviewed tag from Barry.
 - Update the commit message.

Baolin Wang (6):
  mm: memory: extend finish_fault() to support large folio
  mm: shmem: add THP validation for PMD-mapped THP related statistics
  mm: shmem: add multi-size THP sysfs interface for anonymous shmem
  mm: shmem: add mTHP support for anonymous shmem
  mm: shmem: add mTHP size alignment in shmem_get_unmapped_area
  mm: shmem: add mTHP counters for anonymous shmem

 Documentation/admin-guide/mm/transhuge.rst |  23 ++
 include/linux/huge_mm.h                    |  23 ++
 mm/huge_memory.c                           |  17 +-
 mm/memory.c                                |  57 +++-
 mm/shmem.c                                 | 344 ++++++++++++++++++---
 5 files changed, 403 insertions(+), 61 deletions(-)

-- 
2.39.3

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4 1/6] mm: memory: extend finish_fault() to support large folio
  2024-06-04 10:17 [PATCH v4 0/6] add mTHP support for anonymous shmem Baolin Wang
@ 2024-06-04 10:17 ` Baolin Wang
  2024-06-04 14:58   ` kernel test robot
  2024-06-04 10:17 ` [PATCH v4 2/6] mm: shmem: add THP validation for PMD-mapped THP related statistics Baolin Wang
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Baolin Wang @ 2024-06-04 10:17 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, ying.huang, 21cnbao, ryan.roberts,
	shy828301, ziy, ioworker0, da.gomez, p.raghav, baolin.wang,
	linux-mm, linux-kernel

Add large folio mapping establishment support for finish_fault() as a
preparation, to support multi-size THP allocation of anonymous shmem pages
in the following patches.

Keep the same behavior (per-page fault) for non-anon shmem to avoid inflating
the RSS unintentionally, and we can discuss what size of mapping to build
when extending mTHP to control non-anon shmem in the future.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 mm/memory.c | 57 +++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 47 insertions(+), 10 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index eef4e482c0c2..1f7be4c6aac4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4831,9 +4831,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct page *page;
+	struct folio *folio;
 	vm_fault_t ret;
 	bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
 		      !(vma->vm_flags & VM_SHARED);
+	int type, nr_pages, i;
+	unsigned long addr = vmf->address;
 
 	/* Did we COW the page? */
 	if (is_cow)
@@ -4864,24 +4867,58 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 			return VM_FAULT_OOM;
 	}
 
+	folio = page_folio(page);
+	nr_pages = folio_nr_pages(folio);
+
+	/*
+	 * Using per-page fault to maintain the uffd semantics, and same
+	 * approach also applies to non-anonymous-shmem faults to avoid
+	 * inflating the RSS of the process.
+	 */
+	if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma))) {
+		nr_pages = 1;
+	} else if (nr_pages > 1) {
+		pgoff_t idx = folio_page_idx(folio, page);
+		/* The page offset of vmf->address within the VMA. */
+		pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff;
+
+		/*
+		 * Fallback to per-page fault in case the folio size in page
+		 * cache beyond the VMA limits.
+		 */
+		if (unlikely(vma_off < idx ||
+			     vma_off + (nr_pages - idx) > vma_pages(vma))) {
+			nr_pages = 1;
+		} else {
+			/* Now we can set mappings for the whole large folio. */
+			addr = vmf->address - idx * PAGE_SIZE;
+			page = &folio->page;
+		}
+	}
+
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-				      vmf->address, &vmf->ptl);
+				       addr, &vmf->ptl);
 	if (!vmf->pte)
 		return VM_FAULT_NOPAGE;
 
 	/* Re-check under ptl */
-	if (likely(!vmf_pte_changed(vmf))) {
-		struct folio *folio = page_folio(page);
-		int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
-
-		set_pte_range(vmf, folio, page, 1, vmf->address);
-		add_mm_counter(vma->vm_mm, type, 1);
-		ret = 0;
-	} else {
-		update_mmu_tlb(vma, vmf->address, vmf->pte);
+	if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) {
+		update_mmu_tlb(vma, addr, vmf->pte);
+		ret = VM_FAULT_NOPAGE;
+		goto unlock;
+	} else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
+		update_mmu_tlb_range(vma, addr, vmf->pte, nr_pages);
 		ret = VM_FAULT_NOPAGE;
+		goto unlock;
 	}
 
+	folio_ref_add(folio, nr_pages - 1);
+	set_pte_range(vmf, folio, page, nr_pages, addr);
+	type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
+	add_mm_counter(vma->vm_mm, type, nr_pages);
+	ret = 0;
+
+unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 	return ret;
 }
-- 
2.39.3



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4 2/6] mm: shmem: add THP validation for PMD-mapped THP related statistics
  2024-06-04 10:17 [PATCH v4 0/6] add mTHP support for anonymous shmem Baolin Wang
  2024-06-04 10:17 ` [PATCH v4 1/6] mm: memory: extend finish_fault() to support large folio Baolin Wang
@ 2024-06-04 10:17 ` Baolin Wang
  2024-06-04 10:17 ` [PATCH v4 3/6] mm: shmem: add multi-size THP sysfs interface for anonymous shmem Baolin Wang
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Baolin Wang @ 2024-06-04 10:17 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, ying.huang, 21cnbao, ryan.roberts,
	shy828301, ziy, ioworker0, da.gomez, p.raghav, baolin.wang,
	linux-mm, linux-kernel

In order to extend support for mTHP, add THP validation for PMD-mapped THP
related statistics to avoid statistical confusion.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/shmem.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 6868c0af3a69..ae358efc397a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1647,7 +1647,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
 			return ERR_PTR(-E2BIG);
 
 		folio = shmem_alloc_folio(gfp, HPAGE_PMD_ORDER, info, index);
-		if (!folio)
+		if (!folio && pages == HPAGE_PMD_NR)
 			count_vm_event(THP_FILE_FALLBACK);
 	} else {
 		pages = 1;
@@ -1665,7 +1665,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
 		if (xa_find(&mapping->i_pages, &index,
 				index + pages - 1, XA_PRESENT)) {
 			error = -EEXIST;
-		} else if (huge) {
+		} else if (pages == HPAGE_PMD_NR) {
 			count_vm_event(THP_FILE_FALLBACK);
 			count_vm_event(THP_FILE_FALLBACK_CHARGE);
 		}
@@ -2031,7 +2031,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 		folio = shmem_alloc_and_add_folio(huge_gfp,
 				inode, index, fault_mm, true);
 		if (!IS_ERR(folio)) {
-			count_vm_event(THP_FILE_ALLOC);
+			if (folio_test_pmd_mappable(folio))
+				count_vm_event(THP_FILE_ALLOC);
 			goto alloced;
 		}
 		if (PTR_ERR(folio) == -EEXIST)
-- 
2.39.3



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4 3/6] mm: shmem: add multi-size THP sysfs interface for anonymous shmem
  2024-06-04 10:17 [PATCH v4 0/6] add mTHP support for anonymous shmem Baolin Wang
  2024-06-04 10:17 ` [PATCH v4 1/6] mm: memory: extend finish_fault() to support large folio Baolin Wang
  2024-06-04 10:17 ` [PATCH v4 2/6] mm: shmem: add THP validation for PMD-mapped THP related statistics Baolin Wang
@ 2024-06-04 10:17 ` Baolin Wang
       [not found]   ` <CGME20240610122305eucas1p21bfd8a8c999b3fc8bfce04e5feea7bf7@eucas1p2.samsung.com>
  2024-06-04 10:17 ` [PATCH v4 4/6] mm: shmem: add mTHP support " Baolin Wang
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Baolin Wang @ 2024-06-04 10:17 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, ying.huang, 21cnbao, ryan.roberts,
	shy828301, ziy, ioworker0, da.gomez, p.raghav, baolin.wang,
	linux-mm, linux-kernel

To support the use of mTHP with anonymous shmem, add a new sysfs interface
'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
directory for each mTHP to control whether shmem is enabled for that mTHP,
with a value similar to the top level 'shmem_enabled', which can be set to:
"always", "inherit (to inherit the top level setting)", "within_size", "advise",
"never". An 'inherit' option is added to ensure compatibility with these
global settings, and the options 'force' and 'deny' are dropped, which are
rather testing artifacts from the old ages.

By default, PMD-sized hugepages have enabled="inherit" and all other hugepage
sizes have enabled="never" for '/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/shmem_enabled'.

In addition, if top level value is 'force', then only PMD-sized hugepages
have enabled="inherit", otherwise configuration will be failed and vice versa.
That means now we will avoid using non-PMD sized THP to override the global
huge allocation.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 23 ++++++
 include/linux/huge_mm.h                    | 10 +++
 mm/huge_memory.c                           | 11 +--
 mm/shmem.c                                 | 96 ++++++++++++++++++++++
 4 files changed, 132 insertions(+), 8 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index d414d3f5592a..b76d15e408b3 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -332,6 +332,29 @@ deny
 force
     Force the huge option on for all - very useful for testing;
 
+Shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob to control
+mTHP allocation: '/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled',
+and its value for each mTHP is essentially consistent with the global setting.
+An 'inherit' option is added to ensure compatibility with these global settings.
+Conversely, the options 'force' and 'deny' are dropped, which are rather testing
+artifacts from the old ages.
+always
+    Attempt to allocate <size> huge pages every time we need a new page;
+
+inherit
+    Inherit the top-level "shmem_enabled" value. By default, PMD-sized hugepages
+    have enabled="inherit" and all other hugepage sizes have enabled="never";
+
+never
+    Do not allocate <size> huge pages;
+
+within_size
+    Only allocate <size> huge page if it will be fully within i_size.
+    Also respect fadvise()/madvise() hints;
+
+advise
+    Only allocate <size> huge pages if requested with fadvise()/madvise();
+
 Need of application restart
 ===========================
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 020e2344eb86..fac21548c5de 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -6,6 +6,7 @@
 #include <linux/mm_types.h>
 
 #include <linux/fs.h> /* only for vma_is_dax() */
+#include <linux/kobject.h>
 
 vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -63,6 +64,7 @@ ssize_t single_hugepage_flag_show(struct kobject *kobj,
 				  struct kobj_attribute *attr, char *buf,
 				  enum transparent_hugepage_flag flag);
 extern struct kobj_attribute shmem_enabled_attr;
+extern struct kobj_attribute thpsize_shmem_enabled_attr;
 
 /*
  * Mask of all large folio orders supported for anonymous THP; all orders up to
@@ -265,6 +267,14 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 	return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
 }
 
+struct thpsize {
+	struct kobject kobj;
+	struct list_head node;
+	int order;
+};
+
+#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj)
+
 enum mthp_stat_item {
 	MTHP_STAT_ANON_FAULT_ALLOC,
 	MTHP_STAT_ANON_FAULT_FALLBACK,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8e49f402d7c7..1360a1903b66 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -449,14 +449,6 @@ static void thpsize_release(struct kobject *kobj);
 static DEFINE_SPINLOCK(huge_anon_orders_lock);
 static LIST_HEAD(thpsize_list);
 
-struct thpsize {
-	struct kobject kobj;
-	struct list_head node;
-	int order;
-};
-
-#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj)
-
 static ssize_t thpsize_enabled_show(struct kobject *kobj,
 				    struct kobj_attribute *attr, char *buf)
 {
@@ -517,6 +509,9 @@ static struct kobj_attribute thpsize_enabled_attr =
 
 static struct attribute *thpsize_attrs[] = {
 	&thpsize_enabled_attr.attr,
+#ifdef CONFIG_SHMEM
+	&thpsize_shmem_enabled_attr.attr,
+#endif
 	NULL,
 };
 
diff --git a/mm/shmem.c b/mm/shmem.c
index ae358efc397a..643ff7516b4d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -131,6 +131,14 @@ struct shmem_options {
 #define SHMEM_SEEN_QUOTA 32
 };
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static unsigned long huge_anon_shmem_orders_always __read_mostly;
+static unsigned long huge_anon_shmem_orders_madvise __read_mostly;
+static unsigned long huge_anon_shmem_orders_inherit __read_mostly;
+static unsigned long huge_anon_shmem_orders_within_size __read_mostly;
+static DEFINE_SPINLOCK(huge_anon_shmem_orders_lock);
+#endif
+
 #ifdef CONFIG_TMPFS
 static unsigned long shmem_default_max_blocks(void)
 {
@@ -4672,6 +4680,12 @@ void __init shmem_init(void)
 		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
 	else
 		shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
+
+	/*
+	 * Default to setting PMD-sized THP to inherit the global setting and
+	 * disable all other multi-size THPs, when anonymous shmem uses mTHP.
+	 */
+	huge_anon_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER);
 #endif
 	return;
 
@@ -4731,6 +4745,11 @@ static ssize_t shmem_enabled_store(struct kobject *kobj,
 			huge != SHMEM_HUGE_NEVER && huge != SHMEM_HUGE_DENY)
 		return -EINVAL;
 
+	/* Do not override huge allocation policy with non-PMD sized mTHP */
+	if (huge == SHMEM_HUGE_FORCE &&
+	    huge_anon_shmem_orders_inherit != BIT(HPAGE_PMD_ORDER))
+		return -EINVAL;
+
 	shmem_huge = huge;
 	if (shmem_huge > SHMEM_HUGE_DENY)
 		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
@@ -4738,6 +4757,83 @@ static ssize_t shmem_enabled_store(struct kobject *kobj,
 }
 
 struct kobj_attribute shmem_enabled_attr = __ATTR_RW(shmem_enabled);
+
+static ssize_t thpsize_shmem_enabled_show(struct kobject *kobj,
+					  struct kobj_attribute *attr, char *buf)
+{
+	int order = to_thpsize(kobj)->order;
+	const char *output;
+
+	if (test_bit(order, &huge_anon_shmem_orders_always))
+		output = "[always] inherit within_size advise never";
+	else if (test_bit(order, &huge_anon_shmem_orders_inherit))
+		output = "always [inherit] within_size advise never";
+	else if (test_bit(order, &huge_anon_shmem_orders_within_size))
+		output = "always inherit [within_size] advise never";
+	else if (test_bit(order, &huge_anon_shmem_orders_madvise))
+		output = "always inherit within_size [advise] never";
+	else
+		output = "always inherit within_size advise [never]";
+
+	return sysfs_emit(buf, "%s\n", output);
+}
+
+static ssize_t thpsize_shmem_enabled_store(struct kobject *kobj,
+					   struct kobj_attribute *attr,
+					   const char *buf, size_t count)
+{
+	int order = to_thpsize(kobj)->order;
+	ssize_t ret = count;
+
+	if (sysfs_streq(buf, "always")) {
+		spin_lock(&huge_anon_shmem_orders_lock);
+		clear_bit(order, &huge_anon_shmem_orders_inherit);
+		clear_bit(order, &huge_anon_shmem_orders_madvise);
+		clear_bit(order, &huge_anon_shmem_orders_within_size);
+		set_bit(order, &huge_anon_shmem_orders_always);
+		spin_unlock(&huge_anon_shmem_orders_lock);
+	} else if (sysfs_streq(buf, "inherit")) {
+		/* Do not override huge allocation policy with non-PMD sized mTHP */
+		if (shmem_huge == SHMEM_HUGE_FORCE &&
+		    order != HPAGE_PMD_ORDER)
+			return -EINVAL;
+
+		spin_lock(&huge_anon_shmem_orders_lock);
+		clear_bit(order, &huge_anon_shmem_orders_always);
+		clear_bit(order, &huge_anon_shmem_orders_madvise);
+		clear_bit(order, &huge_anon_shmem_orders_within_size);
+		set_bit(order, &huge_anon_shmem_orders_inherit);
+		spin_unlock(&huge_anon_shmem_orders_lock);
+	} else if (sysfs_streq(buf, "within_size")) {
+		spin_lock(&huge_anon_shmem_orders_lock);
+		clear_bit(order, &huge_anon_shmem_orders_always);
+		clear_bit(order, &huge_anon_shmem_orders_inherit);
+		clear_bit(order, &huge_anon_shmem_orders_madvise);
+		set_bit(order, &huge_anon_shmem_orders_within_size);
+		spin_unlock(&huge_anon_shmem_orders_lock);
+	} else if (sysfs_streq(buf, "madvise")) {
+		spin_lock(&huge_anon_shmem_orders_lock);
+		clear_bit(order, &huge_anon_shmem_orders_always);
+		clear_bit(order, &huge_anon_shmem_orders_inherit);
+		clear_bit(order, &huge_anon_shmem_orders_within_size);
+		set_bit(order, &huge_anon_shmem_orders_madvise);
+		spin_unlock(&huge_anon_shmem_orders_lock);
+	} else if (sysfs_streq(buf, "never")) {
+		spin_lock(&huge_anon_shmem_orders_lock);
+		clear_bit(order, &huge_anon_shmem_orders_always);
+		clear_bit(order, &huge_anon_shmem_orders_inherit);
+		clear_bit(order, &huge_anon_shmem_orders_within_size);
+		clear_bit(order, &huge_anon_shmem_orders_madvise);
+		spin_unlock(&huge_anon_shmem_orders_lock);
+	} else {
+		ret = -EINVAL;
+	}
+
+	return ret;
+}
+
+struct kobj_attribute thpsize_shmem_enabled_attr =
+	__ATTR(shmem_enabled, 0644, thpsize_shmem_enabled_show, thpsize_shmem_enabled_store);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */
 
 #else /* !CONFIG_SHMEM */
-- 
2.39.3



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4 4/6] mm: shmem: add mTHP support for anonymous shmem
  2024-06-04 10:17 [PATCH v4 0/6] add mTHP support for anonymous shmem Baolin Wang
                   ` (2 preceding siblings ...)
  2024-06-04 10:17 ` [PATCH v4 3/6] mm: shmem: add multi-size THP sysfs interface for anonymous shmem Baolin Wang
@ 2024-06-04 10:17 ` Baolin Wang
       [not found]   ` <CGME20240610132756eucas1p1d892ccbabdb5f8fc4cff55c662f24d75@eucas1p1.samsung.com>
  2024-06-04 10:17 ` [PATCH v4 5/6] mm: shmem: add mTHP size alignment in shmem_get_unmapped_area Baolin Wang
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Baolin Wang @ 2024-06-04 10:17 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, ying.huang, 21cnbao, ryan.roberts,
	shy828301, ziy, ioworker0, da.gomez, p.raghav, baolin.wang,
	linux-mm, linux-kernel

Commit 19eaf44954df adds multi-size THP (mTHP) for anonymous pages, that
can allow THP to be configured through the sysfs interface located at
'/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.

However, the anonymous shmem will ignore the anonymous mTHP rule
configured through the sysfs interface, and can only use the PMD-mapped
THP, that is not reasonable. Users expect to apply the mTHP rule for
all anonymous pages, including the anonymous shmem, in order to enjoy
the benefits of mTHP. For example, lower latency than PMD-mapped THP,
smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture
to reduce TLB miss etc. In addition, the mTHP interfaces can be extended
to support all shmem/tmpfs scenarios in the future, especially for the
shmem mmap() case.

The primary strategy is similar to supporting anonymous mTHP. Introduce
a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
which can have almost the same values as the top-level
'/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
additional "inherit" option and dropping the testing options 'force' and
'deny'. By default all sizes will be set to "never" except PMD size,
which is set to "inherit". This ensures backward compatibility with the
anonymous shmem enabled of the top level, meanwhile also allows independent
control of anonymous shmem enabled for each mTHP.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 include/linux/huge_mm.h |  10 +++
 mm/shmem.c              | 187 +++++++++++++++++++++++++++++++++-------
 2 files changed, 167 insertions(+), 30 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index fac21548c5de..909cfc67521d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -575,6 +575,16 @@ static inline bool thp_migration_supported(void)
 {
 	return false;
 }
+
+static inline int highest_order(unsigned long orders)
+{
+	return 0;
+}
+
+static inline int next_order(unsigned long *orders, int prev)
+{
+	return 0;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 static inline int split_folio_to_list_to_order(struct folio *folio,
diff --git a/mm/shmem.c b/mm/shmem.c
index 643ff7516b4d..9a8533482208 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1611,6 +1611,107 @@ static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp)
 	return result;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static unsigned long anon_shmem_allowable_huge_orders(struct inode *inode,
+				struct vm_area_struct *vma, pgoff_t index,
+				bool global_huge)
+{
+	unsigned long mask = READ_ONCE(huge_anon_shmem_orders_always);
+	unsigned long within_size_orders = READ_ONCE(huge_anon_shmem_orders_within_size);
+	unsigned long vm_flags = vma->vm_flags;
+	/*
+	 * Check all the (large) orders below HPAGE_PMD_ORDER + 1 that
+	 * are enabled for this vma.
+	 */
+	unsigned long orders = BIT(PMD_ORDER + 1) - 1;
+	loff_t i_size;
+	int order;
+
+	if ((vm_flags & VM_NOHUGEPAGE) ||
+	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
+		return 0;
+
+	/* If the hardware/firmware marked hugepage support disabled. */
+	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
+		return 0;
+
+	/*
+	 * Following the 'deny' semantics of the top level, force the huge
+	 * option off from all mounts.
+	 */
+	if (shmem_huge == SHMEM_HUGE_DENY)
+		return 0;
+
+	/*
+	 * Only allow inherit orders if the top-level value is 'force', which
+	 * means non-PMD sized THP can not override 'huge' mount option now.
+	 */
+	if (shmem_huge == SHMEM_HUGE_FORCE)
+		return READ_ONCE(huge_anon_shmem_orders_inherit);
+
+	/* Allow mTHP that will be fully within i_size. */
+	order = highest_order(within_size_orders);
+	while (within_size_orders) {
+		index = round_up(index + 1, order);
+		i_size = round_up(i_size_read(inode), PAGE_SIZE);
+		if (i_size >> PAGE_SHIFT >= index) {
+			mask |= within_size_orders;
+			break;
+		}
+
+		order = next_order(&within_size_orders, order);
+	}
+
+	if (vm_flags & VM_HUGEPAGE)
+		mask |= READ_ONCE(huge_anon_shmem_orders_madvise);
+
+	if (global_huge)
+		mask |= READ_ONCE(huge_anon_shmem_orders_inherit);
+
+	return orders & mask;
+}
+
+static unsigned long anon_shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf,
+					struct address_space *mapping, pgoff_t index,
+					unsigned long orders)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned long pages;
+	int order;
+
+	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
+	if (!orders)
+		return 0;
+
+	/* Find the highest order that can add into the page cache */
+	order = highest_order(orders);
+	while (orders) {
+		pages = 1UL << order;
+		index = round_down(index, pages);
+		if (!xa_find(&mapping->i_pages, &index,
+			     index + pages - 1, XA_PRESENT))
+			break;
+		order = next_order(&orders, order);
+	}
+
+	return orders;
+}
+#else
+static unsigned long anon_shmem_allowable_huge_orders(struct inode *inode,
+				struct vm_area_struct *vma, pgoff_t index,
+				bool global_huge)
+{
+	return 0;
+}
+
+static unsigned long anon_shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf,
+					struct address_space *mapping, pgoff_t index,
+					unsigned long orders)
+{
+	return 0;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 static struct folio *shmem_alloc_folio(gfp_t gfp, int order,
 		struct shmem_inode_info *info, pgoff_t index)
 {
@@ -1625,38 +1726,55 @@ static struct folio *shmem_alloc_folio(gfp_t gfp, int order,
 	return folio;
 }
 
-static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
-		struct inode *inode, pgoff_t index,
-		struct mm_struct *fault_mm, bool huge)
+static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
+		gfp_t gfp, struct inode *inode, pgoff_t index,
+		struct mm_struct *fault_mm, unsigned long orders)
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info = SHMEM_I(inode);
-	struct folio *folio;
+	struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
+	unsigned long suitable_orders = 0;
+	struct folio *folio = NULL;
 	long pages;
-	int error;
+	int error, order;
 
 	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
-		huge = false;
+		orders = 0;
 
-	if (huge) {
-		pages = HPAGE_PMD_NR;
-		index = round_down(index, HPAGE_PMD_NR);
+	if (orders > 0) {
+		if (vma && vma_is_anon_shmem(vma)) {
+			suitable_orders = anon_shmem_suitable_orders(inode, vmf,
+							mapping, index, orders);
+		} else if (orders & BIT(HPAGE_PMD_ORDER)) {
+			pages = HPAGE_PMD_NR;
+			suitable_orders = BIT(HPAGE_PMD_ORDER);
+			index = round_down(index, HPAGE_PMD_NR);
 
-		/*
-		 * Check for conflict before waiting on a huge allocation.
-		 * Conflict might be that a huge page has just been allocated
-		 * and added to page cache by a racing thread, or that there
-		 * is already at least one small page in the huge extent.
-		 * Be careful to retry when appropriate, but not forever!
-		 * Elsewhere -EEXIST would be the right code, but not here.
-		 */
-		if (xa_find(&mapping->i_pages, &index,
-				index + HPAGE_PMD_NR - 1, XA_PRESENT))
-			return ERR_PTR(-E2BIG);
+			/*
+			 * Check for conflict before waiting on a huge allocation.
+			 * Conflict might be that a huge page has just been allocated
+			 * and added to page cache by a racing thread, or that there
+			 * is already at least one small page in the huge extent.
+			 * Be careful to retry when appropriate, but not forever!
+			 * Elsewhere -EEXIST would be the right code, but not here.
+			 */
+			if (xa_find(&mapping->i_pages, &index,
+				    index + HPAGE_PMD_NR - 1, XA_PRESENT))
+				return ERR_PTR(-E2BIG);
+		}
 
-		folio = shmem_alloc_folio(gfp, HPAGE_PMD_ORDER, info, index);
-		if (!folio && pages == HPAGE_PMD_NR)
-			count_vm_event(THP_FILE_FALLBACK);
+		order = highest_order(suitable_orders);
+		while (suitable_orders) {
+			pages = 1UL << order;
+			index = round_down(index, pages);
+			folio = shmem_alloc_folio(gfp, order, info, index);
+			if (folio)
+				goto allocated;
+
+			if (pages == HPAGE_PMD_NR)
+				count_vm_event(THP_FILE_FALLBACK);
+			order = next_order(&suitable_orders, order);
+		}
 	} else {
 		pages = 1;
 		folio = shmem_alloc_folio(gfp, 0, info, index);
@@ -1664,6 +1782,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
 	if (!folio)
 		return ERR_PTR(-ENOMEM);
 
+allocated:
 	__folio_set_locked(folio);
 	__folio_set_swapbacked(folio);
 
@@ -1958,7 +2077,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 	struct mm_struct *fault_mm;
 	struct folio *folio;
 	int error;
-	bool alloced;
+	bool alloced, huge;
+	unsigned long orders = 0;
 
 	if (WARN_ON_ONCE(!shmem_mapping(inode->i_mapping)))
 		return -EINVAL;
@@ -2030,14 +2150,21 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 		return 0;
 	}
 
-	if (shmem_is_huge(inode, index, false, fault_mm,
-			  vma ? vma->vm_flags : 0)) {
+	huge = shmem_is_huge(inode, index, false, fault_mm,
+			     vma ? vma->vm_flags : 0);
+	/* Find hugepage orders that are allowed for anonymous shmem. */
+	if (vma && vma_is_anon_shmem(vma))
+		orders = anon_shmem_allowable_huge_orders(inode, vma, index, huge);
+	else if (huge)
+		orders = BIT(HPAGE_PMD_ORDER);
+
+	if (orders > 0) {
 		gfp_t huge_gfp;
 
 		huge_gfp = vma_thp_gfp_mask(vma);
 		huge_gfp = limit_gfp_mask(huge_gfp, gfp);
-		folio = shmem_alloc_and_add_folio(huge_gfp,
-				inode, index, fault_mm, true);
+		folio = shmem_alloc_and_add_folio(vmf, huge_gfp,
+				inode, index, fault_mm, orders);
 		if (!IS_ERR(folio)) {
 			if (folio_test_pmd_mappable(folio))
 				count_vm_event(THP_FILE_ALLOC);
@@ -2047,7 +2174,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 			goto repeat;
 	}
 
-	folio = shmem_alloc_and_add_folio(gfp, inode, index, fault_mm, false);
+	folio = shmem_alloc_and_add_folio(vmf, gfp, inode, index, fault_mm, 0);
 	if (IS_ERR(folio)) {
 		error = PTR_ERR(folio);
 		if (error == -EEXIST)
@@ -2058,7 +2185,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 
 alloced:
 	alloced = true;
-	if (folio_test_pmd_mappable(folio) &&
+	if (folio_test_large(folio) &&
 	    DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) <
 					folio_next_index(folio) - 1) {
 		struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
-- 
2.39.3



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4 5/6] mm: shmem: add mTHP size alignment in shmem_get_unmapped_area
  2024-06-04 10:17 [PATCH v4 0/6] add mTHP support for anonymous shmem Baolin Wang
                   ` (3 preceding siblings ...)
  2024-06-04 10:17 ` [PATCH v4 4/6] mm: shmem: add mTHP support " Baolin Wang
@ 2024-06-04 10:17 ` Baolin Wang
  2024-06-04 10:17 ` [PATCH v4 6/6] mm: shmem: add mTHP counters for anonymous shmem Baolin Wang
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Baolin Wang @ 2024-06-04 10:17 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, ying.huang, 21cnbao, ryan.roberts,
	shy828301, ziy, ioworker0, da.gomez, p.raghav, baolin.wang,
	linux-mm, linux-kernel

Although the top-level hugepage allocation can be turned off, anonymous shmem
can still use mTHP by configuring the sysfs interface located at
'/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled'. Therefore,
add alignment for mTHP size to provide a suitable alignment address in
shmem_get_unmapped_area().

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Lance Yang <ioworker0@gmail.com>
---
 mm/shmem.c | 40 +++++++++++++++++++++++++++++++---------
 1 file changed, 31 insertions(+), 9 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 9a8533482208..2ecc41521dbb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2394,6 +2394,7 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 	unsigned long inflated_len;
 	unsigned long inflated_addr;
 	unsigned long inflated_offset;
+	unsigned long hpage_size;
 
 	if (len > TASK_SIZE)
 		return -ENOMEM;
@@ -2412,8 +2413,6 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 
 	if (shmem_huge == SHMEM_HUGE_DENY)
 		return addr;
-	if (len < HPAGE_PMD_SIZE)
-		return addr;
 	if (flags & MAP_FIXED)
 		return addr;
 	/*
@@ -2425,8 +2424,11 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 	if (uaddr == addr)
 		return addr;
 
+	hpage_size = HPAGE_PMD_SIZE;
 	if (shmem_huge != SHMEM_HUGE_FORCE) {
 		struct super_block *sb;
+		unsigned long __maybe_unused hpage_orders;
+		int order = 0;
 
 		if (file) {
 			VM_BUG_ON(file->f_op != &shmem_file_operations);
@@ -2439,18 +2441,38 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 			if (IS_ERR(shm_mnt))
 				return addr;
 			sb = shm_mnt->mnt_sb;
+
+			/*
+			 * Find the highest mTHP order used for anonymous shmem to
+			 * provide a suitable alignment address.
+			 */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+			hpage_orders = READ_ONCE(huge_anon_shmem_orders_always);
+			hpage_orders |= READ_ONCE(huge_anon_shmem_orders_within_size);
+			hpage_orders |= READ_ONCE(huge_anon_shmem_orders_madvise);
+			if (SHMEM_SB(sb)->huge != SHMEM_HUGE_NEVER)
+				hpage_orders |= READ_ONCE(huge_anon_shmem_orders_inherit);
+
+			if (hpage_orders > 0) {
+				order = highest_order(hpage_orders);
+				hpage_size = PAGE_SIZE << order;
+			}
+#endif
 		}
-		if (SHMEM_SB(sb)->huge == SHMEM_HUGE_NEVER)
+		if (SHMEM_SB(sb)->huge == SHMEM_HUGE_NEVER && !order)
 			return addr;
 	}
 
-	offset = (pgoff << PAGE_SHIFT) & (HPAGE_PMD_SIZE-1);
-	if (offset && offset + len < 2 * HPAGE_PMD_SIZE)
+	if (len < hpage_size)
+		return addr;
+
+	offset = (pgoff << PAGE_SHIFT) & (hpage_size - 1);
+	if (offset && offset + len < 2 * hpage_size)
 		return addr;
-	if ((addr & (HPAGE_PMD_SIZE-1)) == offset)
+	if ((addr & (hpage_size - 1)) == offset)
 		return addr;
 
-	inflated_len = len + HPAGE_PMD_SIZE - PAGE_SIZE;
+	inflated_len = len + hpage_size - PAGE_SIZE;
 	if (inflated_len > TASK_SIZE)
 		return addr;
 	if (inflated_len < len)
@@ -2463,10 +2485,10 @@ unsigned long shmem_get_unmapped_area(struct file *file,
 	if (inflated_addr & ~PAGE_MASK)
 		return addr;
 
-	inflated_offset = inflated_addr & (HPAGE_PMD_SIZE-1);
+	inflated_offset = inflated_addr & (hpage_size - 1);
 	inflated_addr += offset - inflated_offset;
 	if (inflated_offset > offset)
-		inflated_addr += HPAGE_PMD_SIZE;
+		inflated_addr += hpage_size;
 
 	if (inflated_addr > TASK_SIZE - len)
 		return addr;
-- 
2.39.3



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4 6/6] mm: shmem: add mTHP counters for anonymous shmem
  2024-06-04 10:17 [PATCH v4 0/6] add mTHP support for anonymous shmem Baolin Wang
                   ` (4 preceding siblings ...)
  2024-06-04 10:17 ` [PATCH v4 5/6] mm: shmem: add mTHP size alignment in shmem_get_unmapped_area Baolin Wang
@ 2024-06-04 10:17 ` Baolin Wang
  2024-06-04 23:50 ` [PATCH v4 0/6] add mTHP support " Andrew Morton
       [not found] ` <CGME20240610121040eucas1p2ad07a27dec959bf1658ea9e5f0dd4697@eucas1p2.samsung.com>
  7 siblings, 0 replies; 16+ messages in thread
From: Baolin Wang @ 2024-06-04 10:17 UTC (permalink / raw)
  To: akpm, hughd
  Cc: willy, david, wangkefeng.wang, ying.huang, 21cnbao, ryan.roberts,
	shy828301, ziy, ioworker0, da.gomez, p.raghav, baolin.wang,
	linux-mm, linux-kernel

Add mTHP counters for anonymous shmem.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 include/linux/huge_mm.h |  3 +++
 mm/huge_memory.c        |  6 ++++++
 mm/shmem.c              | 18 +++++++++++++++---
 3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 909cfc67521d..212cca384d7e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -281,6 +281,9 @@ enum mthp_stat_item {
 	MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
 	MTHP_STAT_SWPOUT,
 	MTHP_STAT_SWPOUT_FALLBACK,
+	MTHP_STAT_FILE_ALLOC,
+	MTHP_STAT_FILE_FALLBACK,
+	MTHP_STAT_FILE_FALLBACK_CHARGE,
 	__MTHP_STAT_COUNT
 };
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1360a1903b66..3fbcd77f5957 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -555,6 +555,9 @@ DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
 DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
 DEFINE_MTHP_STAT_ATTR(swpout, MTHP_STAT_SWPOUT);
 DEFINE_MTHP_STAT_ATTR(swpout_fallback, MTHP_STAT_SWPOUT_FALLBACK);
+DEFINE_MTHP_STAT_ATTR(file_alloc, MTHP_STAT_FILE_ALLOC);
+DEFINE_MTHP_STAT_ATTR(file_fallback, MTHP_STAT_FILE_FALLBACK);
+DEFINE_MTHP_STAT_ATTR(file_fallback_charge, MTHP_STAT_FILE_FALLBACK_CHARGE);
 
 static struct attribute *stats_attrs[] = {
 	&anon_fault_alloc_attr.attr,
@@ -562,6 +565,9 @@ static struct attribute *stats_attrs[] = {
 	&anon_fault_fallback_charge_attr.attr,
 	&swpout_attr.attr,
 	&swpout_fallback_attr.attr,
+	&file_alloc_attr.attr,
+	&file_fallback_attr.attr,
+	&file_fallback_charge_attr.attr,
 	NULL,
 };
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 2ecc41521dbb..d9a11950c586 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1773,6 +1773,9 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
 
 			if (pages == HPAGE_PMD_NR)
 				count_vm_event(THP_FILE_FALLBACK);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+			count_mthp_stat(order, MTHP_STAT_FILE_FALLBACK);
+#endif
 			order = next_order(&suitable_orders, order);
 		}
 	} else {
@@ -1792,9 +1795,15 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
 		if (xa_find(&mapping->i_pages, &index,
 				index + pages - 1, XA_PRESENT)) {
 			error = -EEXIST;
-		} else if (pages == HPAGE_PMD_NR) {
-			count_vm_event(THP_FILE_FALLBACK);
-			count_vm_event(THP_FILE_FALLBACK_CHARGE);
+		} else if (pages > 1) {
+			if (pages == HPAGE_PMD_NR) {
+				count_vm_event(THP_FILE_FALLBACK);
+				count_vm_event(THP_FILE_FALLBACK_CHARGE);
+			}
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+			count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_FALLBACK);
+			count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_FALLBACK_CHARGE);
+#endif
 		}
 		goto unlock;
 	}
@@ -2168,6 +2177,9 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 		if (!IS_ERR(folio)) {
 			if (folio_test_pmd_mappable(folio))
 				count_vm_event(THP_FILE_ALLOC);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+			count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_ALLOC);
+#endif
 			goto alloced;
 		}
 		if (PTR_ERR(folio) == -EEXIST)
-- 
2.39.3



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 1/6] mm: memory: extend finish_fault() to support large folio
  2024-06-04 10:17 ` [PATCH v4 1/6] mm: memory: extend finish_fault() to support large folio Baolin Wang
@ 2024-06-04 14:58   ` kernel test robot
  2024-06-05  1:16     ` Baolin Wang
  0 siblings, 1 reply; 16+ messages in thread
From: kernel test robot @ 2024-06-04 14:58 UTC (permalink / raw)
  To: Baolin Wang, akpm, hughd
  Cc: oe-kbuild-all, willy, david, wangkefeng.wang, ying.huang,
	21cnbao, ryan.roberts, shy828301, ziy, ioworker0, da.gomez,
	p.raghav, baolin.wang, linux-mm, linux-kernel

Hi Baolin,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master v6.10-rc2 next-20240604]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Baolin-Wang/mm-memory-extend-finish_fault-to-support-large-folio/20240604-182028
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/bee11bfd9157e60aaea6db033a4af7c13c982c82.1717495894.git.baolin.wang%40linux.alibaba.com
patch subject: [PATCH v4 1/6] mm: memory: extend finish_fault() to support large folio
config: openrisc-allnoconfig (https://download.01.org/0day-ci/archive/20240604/202406042210.20LfqwNd-lkp@intel.com/config)
compiler: or1k-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240604/202406042210.20LfqwNd-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202406042210.20LfqwNd-lkp@intel.com/

All warnings (new ones prefixed by >>):

   mm/memory.c: In function 'finish_fault':
>> mm/memory.c:4838:29: warning: unused variable 'i' [-Wunused-variable]
    4838 |         int type, nr_pages, i;
         |                             ^


vim +/i +4838 mm/memory.c

  4814	
  4815	/**
  4816	 * finish_fault - finish page fault once we have prepared the page to fault
  4817	 *
  4818	 * @vmf: structure describing the fault
  4819	 *
  4820	 * This function handles all that is needed to finish a page fault once the
  4821	 * page to fault in is prepared. It handles locking of PTEs, inserts PTE for
  4822	 * given page, adds reverse page mapping, handles memcg charges and LRU
  4823	 * addition.
  4824	 *
  4825	 * The function expects the page to be locked and on success it consumes a
  4826	 * reference of a page being mapped (for the PTE which maps it).
  4827	 *
  4828	 * Return: %0 on success, %VM_FAULT_ code in case of error.
  4829	 */
  4830	vm_fault_t finish_fault(struct vm_fault *vmf)
  4831	{
  4832		struct vm_area_struct *vma = vmf->vma;
  4833		struct page *page;
  4834		struct folio *folio;
  4835		vm_fault_t ret;
  4836		bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) &&
  4837			      !(vma->vm_flags & VM_SHARED);
> 4838		int type, nr_pages, i;
  4839		unsigned long addr = vmf->address;
  4840	
  4841		/* Did we COW the page? */
  4842		if (is_cow)
  4843			page = vmf->cow_page;
  4844		else
  4845			page = vmf->page;
  4846	
  4847		/*
  4848		 * check even for read faults because we might have lost our CoWed
  4849		 * page
  4850		 */
  4851		if (!(vma->vm_flags & VM_SHARED)) {
  4852			ret = check_stable_address_space(vma->vm_mm);
  4853			if (ret)
  4854				return ret;
  4855		}
  4856	
  4857		if (pmd_none(*vmf->pmd)) {
  4858			if (PageTransCompound(page)) {
  4859				ret = do_set_pmd(vmf, page);
  4860				if (ret != VM_FAULT_FALLBACK)
  4861					return ret;
  4862			}
  4863	
  4864			if (vmf->prealloc_pte)
  4865				pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte);
  4866			else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
  4867				return VM_FAULT_OOM;
  4868		}
  4869	
  4870		folio = page_folio(page);
  4871		nr_pages = folio_nr_pages(folio);
  4872	
  4873		/*
  4874		 * Using per-page fault to maintain the uffd semantics, and same
  4875		 * approach also applies to non-anonymous-shmem faults to avoid
  4876		 * inflating the RSS of the process.
  4877		 */
  4878		if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma))) {
  4879			nr_pages = 1;
  4880		} else if (nr_pages > 1) {
  4881			pgoff_t idx = folio_page_idx(folio, page);
  4882			/* The page offset of vmf->address within the VMA. */
  4883			pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff;
  4884	
  4885			/*
  4886			 * Fallback to per-page fault in case the folio size in page
  4887			 * cache beyond the VMA limits.
  4888			 */
  4889			if (unlikely(vma_off < idx ||
  4890				     vma_off + (nr_pages - idx) > vma_pages(vma))) {
  4891				nr_pages = 1;
  4892			} else {
  4893				/* Now we can set mappings for the whole large folio. */
  4894				addr = vmf->address - idx * PAGE_SIZE;
  4895				page = &folio->page;
  4896			}
  4897		}
  4898	
  4899		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
  4900					       addr, &vmf->ptl);
  4901		if (!vmf->pte)
  4902			return VM_FAULT_NOPAGE;
  4903	
  4904		/* Re-check under ptl */
  4905		if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) {
  4906			update_mmu_tlb(vma, addr, vmf->pte);
  4907			ret = VM_FAULT_NOPAGE;
  4908			goto unlock;
  4909		} else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) {
  4910			update_mmu_tlb_range(vma, addr, vmf->pte, nr_pages);
  4911			ret = VM_FAULT_NOPAGE;
  4912			goto unlock;
  4913		}
  4914	
  4915		folio_ref_add(folio, nr_pages - 1);
  4916		set_pte_range(vmf, folio, page, nr_pages, addr);
  4917		type = is_cow ? MM_ANONPAGES : mm_counter_file(folio);
  4918		add_mm_counter(vma->vm_mm, type, nr_pages);
  4919		ret = 0;
  4920	
  4921	unlock:
  4922		pte_unmap_unlock(vmf->pte, vmf->ptl);
  4923		return ret;
  4924	}
  4925	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 0/6] add mTHP support for anonymous shmem
  2024-06-04 10:17 [PATCH v4 0/6] add mTHP support for anonymous shmem Baolin Wang
                   ` (5 preceding siblings ...)
  2024-06-04 10:17 ` [PATCH v4 6/6] mm: shmem: add mTHP counters for anonymous shmem Baolin Wang
@ 2024-06-04 23:50 ` Andrew Morton
       [not found] ` <CGME20240610121040eucas1p2ad07a27dec959bf1658ea9e5f0dd4697@eucas1p2.samsung.com>
  7 siblings, 0 replies; 16+ messages in thread
From: Andrew Morton @ 2024-06-04 23:50 UTC (permalink / raw)
  To: Baolin Wang
  Cc: hughd, willy, david, wangkefeng.wang, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, da.gomez, p.raghav,
	linux-mm, linux-kernel

On Tue,  4 Jun 2024 18:17:44 +0800 Baolin Wang <baolin.wang@linux.alibaba.com> wrote:

> base: mm-unstable
> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> 0.04s        3.10s         83516.416                  2669684.890
>
> ...
>
> mm-unstable + patchset, anon shmem 64K mTHP enabled
> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> 0.08s        0.31s         678630.231                 17082522.495
> 

Geeze, is that the best you can do ;)

It's early and there's review work to be done.  But I'll queue this up
for testing now, as it's clearly something we should finish off and
get merged.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 1/6] mm: memory: extend finish_fault() to support large folio
  2024-06-04 14:58   ` kernel test robot
@ 2024-06-05  1:16     ` Baolin Wang
  0 siblings, 0 replies; 16+ messages in thread
From: Baolin Wang @ 2024-06-05  1:16 UTC (permalink / raw)
  To: kernel test robot, akpm, hughd
  Cc: oe-kbuild-all, willy, david, wangkefeng.wang, ying.huang,
	21cnbao, ryan.roberts, shy828301, ziy, ioworker0, da.gomez,
	p.raghav, linux-mm, linux-kernel



On 2024/6/4 22:58, kernel test robot wrote:
> Hi Baolin,
> 
> kernel test robot noticed the following build warnings:
> 
> [auto build test WARNING on akpm-mm/mm-everything]
> [also build test WARNING on linus/master v6.10-rc2 next-20240604]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Baolin-Wang/mm-memory-extend-finish_fault-to-support-large-folio/20240604-182028
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link:    https://lore.kernel.org/r/bee11bfd9157e60aaea6db033a4af7c13c982c82.1717495894.git.baolin.wang%40linux.alibaba.com
> patch subject: [PATCH v4 1/6] mm: memory: extend finish_fault() to support large folio
> config: openrisc-allnoconfig (https://download.01.org/0day-ci/archive/20240604/202406042210.20LfqwNd-lkp@intel.com/config)
> compiler: or1k-linux-gcc (GCC) 13.2.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240604/202406042210.20LfqwNd-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202406042210.20LfqwNd-lkp@intel.com/
> 
> All warnings (new ones prefixed by >>):
> 
>     mm/memory.c: In function 'finish_fault':
>>> mm/memory.c:4838:29: warning: unused variable 'i' [-Wunused-variable]
>      4838 |         int type, nr_pages, i;
>           |                             ^

Oops, thanks for reporting. Forgot to remove the variable 'i' when 
changing to use update_mmu_tlb_range().

I see Andrew has already helped to remove this unused variable 'i'. 
Thanks Andrew.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 0/6] add mTHP support for anonymous shmem
       [not found] ` <CGME20240610121040eucas1p2ad07a27dec959bf1658ea9e5f0dd4697@eucas1p2.samsung.com>
@ 2024-06-10 12:10   ` Daniel Gomez
  2024-06-11  2:53     ` Baolin Wang
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel Gomez @ 2024-06-10 12:10 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hughd, willy, david, wangkefeng.wang, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, Pankaj Raghav, linux-mm,
	linux-kernel

Hi Baolin,

On Tue, Jun 04, 2024 at 06:17:44PM +0800, Baolin Wang wrote:
> Anonymous pages have already been supported for multi-size (mTHP) allocation
> through commit 19eaf44954df, that can allow THP to be configured through the
> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
> 
> However, the anonymous shmem will ignore the anonymous mTHP rule configured
> through the sysfs interface, and can only use the PMD-mapped THP, that is not
> reasonable. Many implement anonymous page sharing through mmap(MAP_SHARED |
> MAP_ANONYMOUS), especially in database usage scenarios, therefore, users expect
> to apply an unified mTHP strategy for anonymous pages, also including the
> anonymous shared pages, in order to enjoy the benefits of mTHP. For example,
> lower latency than PMD-mapped THP, smaller memory bloat than PMD-mapped THP,
> contiguous PTEs on ARM architecture to reduce TLB miss etc.
> 
> As discussed in the bi-weekly MM meeting[1], the mTHP controls should control
> all of shmem, not only anonymous shmem, but support will be added iteratively.
> Therefore, this patch set starts with support for anonymous shmem.
> 
> The primary strategy is similar to supporting anonymous mTHP. Introduce
> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> which can have almost the same values as the top-level
> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> additional "inherit" option and dropping the testing options 'force' and
> 'deny'. By default all sizes will be set to "never" except PMD size, which
> is set to "inherit". This ensures backward compatibility with the anonymous
> shmem enabled of the top level, meanwhile also allows independent control of
> anonymous shmem enabled for each mTHP.
> 
> Use the page fault latency tool to measure the performance of 1G anonymous shmem

I'm not familiar with this tool. Could you share which repo/tool you are
referring to?

Also, are you running or are you aware of any other tools/tests available for
shmem that we can use to make sure we do not introduce any regressions?

Thanks!
Daniel

> with 32 threads on my machine environment with: ARM64 Architecture, 32 cores,
> 125G memory:
> base: mm-unstable
> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> 0.04s        3.10s         83516.416                  2669684.890
> 
> mm-unstable + patchset, anon shmem mTHP disabled
> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> 0.02s        3.14s         82936.359                  2630746.027
> 
> mm-unstable + patchset, anon shmem 64K mTHP enabled
> user-time    sys_time    faults_per_sec_per_cpu     faults_per_sec
> 0.08s        0.31s         678630.231                 17082522.495
> 
> From the data above, it is observed that the patchset has a minimal impact when
> mTHP is not enabled (some fluctuations observed during testing). When enabling 64K
> mTHP, there is a significant improvement of the page fault latency.
> 
> [1] https://lore.kernel.org/all/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com/
> 
> Changes from v3:
>  - Drop 'force' and 'deny' testing options for each mTHP.
>  - Use new helper update_mmu_tlb_range(), per Lance.
>  - Update documentation to drop "anonymous thp" terminology, per David.
>  - Initialize the 'suitable_orders' in shmem_alloc_and_add_folio(),
>    reported by kernel test robot.
>  - Fix the highest mTHP order in shmem_get_unmapped_area().
>  - Update some commit message.
> 
> Changes from v2:
>  - Rebased to mm/mm-unstable.
>  - Remove 'huge' parameter for shmem_alloc_and_add_folio(), per Lance.
> 
> Changes from v1:
>  - Drop the patch that re-arranges the position of highest_order() and
>    next_order(), per Ryan.
>  - Modify the finish_fault() to fix VA alignment issue, per Ryan and
>    David.
>  - Fix some building issues, reported by Lance and kernel test robot.
>  - Update some commit message.
> 
> Changes from RFC:
>  - Rebase the patch set against the new mm-unstable branch, per Lance.
>  - Add a new patch to export highest_order() and next_order().
>  - Add a new patch to align mTHP size in shmem_get_unmapped_area().
>  - Handle the uffd case and the VMA limits case when building mapping for
>    large folio in the finish_fault() function, per Ryan.
>  - Remove unnecessary 'order' variable in patch 3, per Kefeng.
>  - Keep the anon shmem counters' name consistency.
>  - Modify the strategy to support mTHP for anonymous shmem, discussed with
>    Ryan and David.
>  - Add reviewed tag from Barry.
>  - Update the commit message.
> 
> Baolin Wang (6):
>   mm: memory: extend finish_fault() to support large folio
>   mm: shmem: add THP validation for PMD-mapped THP related statistics
>   mm: shmem: add multi-size THP sysfs interface for anonymous shmem
>   mm: shmem: add mTHP support for anonymous shmem
>   mm: shmem: add mTHP size alignment in shmem_get_unmapped_area
>   mm: shmem: add mTHP counters for anonymous shmem
> 
>  Documentation/admin-guide/mm/transhuge.rst |  23 ++
>  include/linux/huge_mm.h                    |  23 ++
>  mm/huge_memory.c                           |  17 +-
>  mm/memory.c                                |  57 +++-
>  mm/shmem.c                                 | 344 ++++++++++++++++++---
>  5 files changed, 403 insertions(+), 61 deletions(-)
> 
> -- 
> 2.39.3
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/6] mm: shmem: add multi-size THP sysfs interface for anonymous shmem
       [not found]   ` <CGME20240610122305eucas1p21bfd8a8c999b3fc8bfce04e5feea7bf7@eucas1p2.samsung.com>
@ 2024-06-10 12:23     ` Daniel Gomez
  2024-06-11  2:04       ` Baolin Wang
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel Gomez @ 2024-06-10 12:23 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hughd, willy, david, wangkefeng.wang, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, Pankaj Raghav, linux-mm,
	linux-kernel

Hi Baolin,
On Tue, Jun 04, 2024 at 06:17:47PM +0800, Baolin Wang wrote:
> To support the use of mTHP with anonymous shmem, add a new sysfs interface
> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
> directory for each mTHP to control whether shmem is enabled for that mTHP,
> with a value similar to the top level 'shmem_enabled', which can be set to:
> "always", "inherit (to inherit the top level setting)", "within_size", "advise",
> "never". An 'inherit' option is added to ensure compatibility with these
> global settings, and the options 'force' and 'deny' are dropped, which are
> rather testing artifacts from the old ages.
> 
> By default, PMD-sized hugepages have enabled="inherit" and all other hugepage
> sizes have enabled="never" for '/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/shmem_enabled'.
> 
> In addition, if top level value is 'force', then only PMD-sized hugepages
> have enabled="inherit", otherwise configuration will be failed and vice versa.
> That means now we will avoid using non-PMD sized THP to override the global
> huge allocation.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  Documentation/admin-guide/mm/transhuge.rst | 23 ++++++
>  include/linux/huge_mm.h                    | 10 +++
>  mm/huge_memory.c                           | 11 +--
>  mm/shmem.c                                 | 96 ++++++++++++++++++++++
>  4 files changed, 132 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index d414d3f5592a..b76d15e408b3 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -332,6 +332,29 @@ deny
>  force
>      Force the huge option on for all - very useful for testing;
>  
> +Shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob to control
> +mTHP allocation: '/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled',
> +and its value for each mTHP is essentially consistent with the global setting.
> +An 'inherit' option is added to ensure compatibility with these global settings.
> +Conversely, the options 'force' and 'deny' are dropped, which are rather testing
> +artifacts from the old ages.
> +always
> +    Attempt to allocate <size> huge pages every time we need a new page;
> +
> +inherit
> +    Inherit the top-level "shmem_enabled" value. By default, PMD-sized hugepages
> +    have enabled="inherit" and all other hugepage sizes have enabled="never";
> +
> +never
> +    Do not allocate <size> huge pages;
> +
> +within_size
> +    Only allocate <size> huge page if it will be fully within i_size.
> +    Also respect fadvise()/madvise() hints;
> +
> +advise
> +    Only allocate <size> huge pages if requested with fadvise()/madvise();
> +
>  Need of application restart
>  ===========================
>  
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 020e2344eb86..fac21548c5de 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -6,6 +6,7 @@
>  #include <linux/mm_types.h>
>  
>  #include <linux/fs.h> /* only for vma_is_dax() */
> +#include <linux/kobject.h>
>  
>  vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
>  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> @@ -63,6 +64,7 @@ ssize_t single_hugepage_flag_show(struct kobject *kobj,
>  				  struct kobj_attribute *attr, char *buf,
>  				  enum transparent_hugepage_flag flag);
>  extern struct kobj_attribute shmem_enabled_attr;
> +extern struct kobj_attribute thpsize_shmem_enabled_attr;
>  
>  /*
>   * Mask of all large folio orders supported for anonymous THP; all orders up to
> @@ -265,6 +267,14 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>  	return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
>  }
>  
> +struct thpsize {
> +	struct kobject kobj;
> +	struct list_head node;
> +	int order;
> +};
> +
> +#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj)
> +
>  enum mthp_stat_item {
>  	MTHP_STAT_ANON_FAULT_ALLOC,
>  	MTHP_STAT_ANON_FAULT_FALLBACK,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 8e49f402d7c7..1360a1903b66 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -449,14 +449,6 @@ static void thpsize_release(struct kobject *kobj);
>  static DEFINE_SPINLOCK(huge_anon_orders_lock);
>  static LIST_HEAD(thpsize_list);
>  
> -struct thpsize {
> -	struct kobject kobj;
> -	struct list_head node;
> -	int order;
> -};
> -
> -#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj)
> -
>  static ssize_t thpsize_enabled_show(struct kobject *kobj,
>  				    struct kobj_attribute *attr, char *buf)
>  {
> @@ -517,6 +509,9 @@ static struct kobj_attribute thpsize_enabled_attr =
>  
>  static struct attribute *thpsize_attrs[] = {
>  	&thpsize_enabled_attr.attr,
> +#ifdef CONFIG_SHMEM
> +	&thpsize_shmem_enabled_attr.attr,
> +#endif
>  	NULL,
>  };
>  
> diff --git a/mm/shmem.c b/mm/shmem.c
> index ae358efc397a..643ff7516b4d 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -131,6 +131,14 @@ struct shmem_options {
>  #define SHMEM_SEEN_QUOTA 32
>  };
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static unsigned long huge_anon_shmem_orders_always __read_mostly;
> +static unsigned long huge_anon_shmem_orders_madvise __read_mostly;
> +static unsigned long huge_anon_shmem_orders_inherit __read_mostly;
> +static unsigned long huge_anon_shmem_orders_within_size __read_mostly;
> +static DEFINE_SPINLOCK(huge_anon_shmem_orders_lock);
> +#endif

Since we are also applying the new sysfs knob controls to tmpfs and anon mm,
should we rename this to get rid of the anon prefix?

> +
>  #ifdef CONFIG_TMPFS
>  static unsigned long shmem_default_max_blocks(void)
>  {
> @@ -4672,6 +4680,12 @@ void __init shmem_init(void)
>  		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
>  	else
>  		shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */
> +
> +	/*
> +	 * Default to setting PMD-sized THP to inherit the global setting and
> +	 * disable all other multi-size THPs, when anonymous shmem uses mTHP.
> +	 */
> +	huge_anon_shmem_orders_inherit = BIT(HPAGE_PMD_ORDER);
>  #endif
>  	return;
>  
> @@ -4731,6 +4745,11 @@ static ssize_t shmem_enabled_store(struct kobject *kobj,
>  			huge != SHMEM_HUGE_NEVER && huge != SHMEM_HUGE_DENY)
>  		return -EINVAL;
>  
> +	/* Do not override huge allocation policy with non-PMD sized mTHP */
> +	if (huge == SHMEM_HUGE_FORCE &&
> +	    huge_anon_shmem_orders_inherit != BIT(HPAGE_PMD_ORDER))
> +		return -EINVAL;
> +
>  	shmem_huge = huge;
>  	if (shmem_huge > SHMEM_HUGE_DENY)
>  		SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge;
> @@ -4738,6 +4757,83 @@ static ssize_t shmem_enabled_store(struct kobject *kobj,
>  }
>  
>  struct kobj_attribute shmem_enabled_attr = __ATTR_RW(shmem_enabled);
> +
> +static ssize_t thpsize_shmem_enabled_show(struct kobject *kobj,
> +					  struct kobj_attribute *attr, char *buf)
> +{
> +	int order = to_thpsize(kobj)->order;
> +	const char *output;
> +
> +	if (test_bit(order, &huge_anon_shmem_orders_always))
> +		output = "[always] inherit within_size advise never";
> +	else if (test_bit(order, &huge_anon_shmem_orders_inherit))
> +		output = "always [inherit] within_size advise never";
> +	else if (test_bit(order, &huge_anon_shmem_orders_within_size))
> +		output = "always inherit [within_size] advise never";
> +	else if (test_bit(order, &huge_anon_shmem_orders_madvise))
> +		output = "always inherit within_size [advise] never";
> +	else
> +		output = "always inherit within_size advise [never]";
> +
> +	return sysfs_emit(buf, "%s\n", output);
> +}
> +
> +static ssize_t thpsize_shmem_enabled_store(struct kobject *kobj,
> +					   struct kobj_attribute *attr,
> +					   const char *buf, size_t count)
> +{
> +	int order = to_thpsize(kobj)->order;
> +	ssize_t ret = count;
> +
> +	if (sysfs_streq(buf, "always")) {
> +		spin_lock(&huge_anon_shmem_orders_lock);
> +		clear_bit(order, &huge_anon_shmem_orders_inherit);
> +		clear_bit(order, &huge_anon_shmem_orders_madvise);
> +		clear_bit(order, &huge_anon_shmem_orders_within_size);
> +		set_bit(order, &huge_anon_shmem_orders_always);
> +		spin_unlock(&huge_anon_shmem_orders_lock);
> +	} else if (sysfs_streq(buf, "inherit")) {
> +		/* Do not override huge allocation policy with non-PMD sized mTHP */
> +		if (shmem_huge == SHMEM_HUGE_FORCE &&
> +		    order != HPAGE_PMD_ORDER)
> +			return -EINVAL;
> +
> +		spin_lock(&huge_anon_shmem_orders_lock);
> +		clear_bit(order, &huge_anon_shmem_orders_always);
> +		clear_bit(order, &huge_anon_shmem_orders_madvise);
> +		clear_bit(order, &huge_anon_shmem_orders_within_size);
> +		set_bit(order, &huge_anon_shmem_orders_inherit);
> +		spin_unlock(&huge_anon_shmem_orders_lock);
> +	} else if (sysfs_streq(buf, "within_size")) {
> +		spin_lock(&huge_anon_shmem_orders_lock);
> +		clear_bit(order, &huge_anon_shmem_orders_always);
> +		clear_bit(order, &huge_anon_shmem_orders_inherit);
> +		clear_bit(order, &huge_anon_shmem_orders_madvise);
> +		set_bit(order, &huge_anon_shmem_orders_within_size);
> +		spin_unlock(&huge_anon_shmem_orders_lock);
> +	} else if (sysfs_streq(buf, "madvise")) {
> +		spin_lock(&huge_anon_shmem_orders_lock);
> +		clear_bit(order, &huge_anon_shmem_orders_always);
> +		clear_bit(order, &huge_anon_shmem_orders_inherit);
> +		clear_bit(order, &huge_anon_shmem_orders_within_size);
> +		set_bit(order, &huge_anon_shmem_orders_madvise);
> +		spin_unlock(&huge_anon_shmem_orders_lock);
> +	} else if (sysfs_streq(buf, "never")) {
> +		spin_lock(&huge_anon_shmem_orders_lock);
> +		clear_bit(order, &huge_anon_shmem_orders_always);
> +		clear_bit(order, &huge_anon_shmem_orders_inherit);
> +		clear_bit(order, &huge_anon_shmem_orders_within_size);
> +		clear_bit(order, &huge_anon_shmem_orders_madvise);
> +		spin_unlock(&huge_anon_shmem_orders_lock);
> +	} else {
> +		ret = -EINVAL;
> +	}
> +
> +	return ret;
> +}
> +
> +struct kobj_attribute thpsize_shmem_enabled_attr =
> +	__ATTR(shmem_enabled, 0644, thpsize_shmem_enabled_show, thpsize_shmem_enabled_store);
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */
>  
>  #else /* !CONFIG_SHMEM */
> -- 
> 2.39.3
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 4/6] mm: shmem: add mTHP support for anonymous shmem
       [not found]   ` <CGME20240610132756eucas1p1d892ccbabdb5f8fc4cff55c662f24d75@eucas1p1.samsung.com>
@ 2024-06-10 13:27     ` Daniel Gomez
  2024-06-11  2:42       ` Baolin Wang
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel Gomez @ 2024-06-10 13:27 UTC (permalink / raw)
  To: Baolin Wang
  Cc: akpm, hughd, willy, david, wangkefeng.wang, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, Pankaj Raghav, linux-mm,
	linux-kernel

On Tue, Jun 04, 2024 at 06:17:48PM +0800, Baolin Wang wrote:
> Commit 19eaf44954df adds multi-size THP (mTHP) for anonymous pages, that
> can allow THP to be configured through the sysfs interface located at
> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
> 
> However, the anonymous shmem will ignore the anonymous mTHP rule
> configured through the sysfs interface, and can only use the PMD-mapped
> THP, that is not reasonable. Users expect to apply the mTHP rule for
> all anonymous pages, including the anonymous shmem, in order to enjoy
> the benefits of mTHP. For example, lower latency than PMD-mapped THP,
> smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture
> to reduce TLB miss etc. In addition, the mTHP interfaces can be extended
> to support all shmem/tmpfs scenarios in the future, especially for the
> shmem mmap() case.
> 
> The primary strategy is similar to supporting anonymous mTHP. Introduce
> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
> which can have almost the same values as the top-level
> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
> additional "inherit" option and dropping the testing options 'force' and
> 'deny'. By default all sizes will be set to "never" except PMD size,
> which is set to "inherit". This ensures backward compatibility with the
> anonymous shmem enabled of the top level, meanwhile also allows independent
> control of anonymous shmem enabled for each mTHP.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  include/linux/huge_mm.h |  10 +++
>  mm/shmem.c              | 187 +++++++++++++++++++++++++++++++++-------
>  2 files changed, 167 insertions(+), 30 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index fac21548c5de..909cfc67521d 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -575,6 +575,16 @@ static inline bool thp_migration_supported(void)
>  {
>  	return false;
>  }
> +
> +static inline int highest_order(unsigned long orders)
> +{
> +	return 0;
> +}
> +
> +static inline int next_order(unsigned long *orders, int prev)
> +{
> +	return 0;
> +}
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
>  static inline int split_folio_to_list_to_order(struct folio *folio,
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 643ff7516b4d..9a8533482208 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1611,6 +1611,107 @@ static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp)
>  	return result;
>  }
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static unsigned long anon_shmem_allowable_huge_orders(struct inode *inode,

We want to get mTHP orders as well for tmpfs so, could we make this to work for
both paths? If true, I'd remove the anon prefix.

> +				struct vm_area_struct *vma, pgoff_t index,
> +				bool global_huge)

Why did you rename 'huge' variable to 'global_huge'? We were using 'huge' in
shmem_alloc_and_add_folio() before this commit. I guess it's just odd to me this
var rename without seen any name conflict inside it.

> +{
> +	unsigned long mask = READ_ONCE(huge_anon_shmem_orders_always);
> +	unsigned long within_size_orders = READ_ONCE(huge_anon_shmem_orders_within_size);
> +	unsigned long vm_flags = vma->vm_flags;
> +	/*
> +	 * Check all the (large) orders below HPAGE_PMD_ORDER + 1 that
> +	 * are enabled for this vma.
> +	 */
> +	unsigned long orders = BIT(PMD_ORDER + 1) - 1;
> +	loff_t i_size;
> +	int order;
> +

We can start the mm anon path here but we should exclude the ones that do not
apply for tmpfs.

> +	if ((vm_flags & VM_NOHUGEPAGE) ||
> +	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
> +		return 0;
> +
> +	/* If the hardware/firmware marked hugepage support disabled. */
> +	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
> +		return 0;
> +
> +	/*
> +	 * Following the 'deny' semantics of the top level, force the huge
> +	 * option off from all mounts.
> +	 */
> +	if (shmem_huge == SHMEM_HUGE_DENY)
> +		return 0;
> +
> +	/*
> +	 * Only allow inherit orders if the top-level value is 'force', which
> +	 * means non-PMD sized THP can not override 'huge' mount option now.
> +	 */
> +	if (shmem_huge == SHMEM_HUGE_FORCE)
> +		return READ_ONCE(huge_anon_shmem_orders_inherit);
> +
> +	/* Allow mTHP that will be fully within i_size. */
> +	order = highest_order(within_size_orders);
> +	while (within_size_orders) {
> +		index = round_up(index + 1, order);
> +		i_size = round_up(i_size_read(inode), PAGE_SIZE);
> +		if (i_size >> PAGE_SHIFT >= index) {
> +			mask |= within_size_orders;
> +			break;
> +		}
> +
> +		order = next_order(&within_size_orders, order);
> +	}
> +
> +	if (vm_flags & VM_HUGEPAGE)
> +		mask |= READ_ONCE(huge_anon_shmem_orders_madvise);
> +
> +	if (global_huge)
> +		mask |= READ_ONCE(huge_anon_shmem_orders_inherit);
> +
> +	return orders & mask;
> +}
> +
> +static unsigned long anon_shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf,
> +					struct address_space *mapping, pgoff_t index,
> +					unsigned long orders)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	unsigned long pages;
> +	int order;
> +
> +	orders = thp_vma_suitable_orders(vma, vmf->address, orders);

This won't apply to tmpfs. I'm thinking if we can apply
shmem_mapping_size_order() [1] here for tmpfs path so we have the same suitable
orders for both paths.

[1] https://lore.kernel.org/all/v5acpezkt4ml3j3ufmbgnq5b335envea7xfobvowtaetvbt3an@v3pfkwly5jh2/#t

> +	if (!orders)
> +		return 0;
> +
> +	/* Find the highest order that can add into the page cache */
> +	order = highest_order(orders);
> +	while (orders) {
> +		pages = 1UL << order;
> +		index = round_down(index, pages);
> +		if (!xa_find(&mapping->i_pages, &index,
> +			     index + pages - 1, XA_PRESENT))
> +			break;
> +		order = next_order(&orders, order);
> +	}
> +
> +	return orders;
> +}
> +#else
> +static unsigned long anon_shmem_allowable_huge_orders(struct inode *inode,
> +				struct vm_area_struct *vma, pgoff_t index,
> +				bool global_huge)
> +{
> +	return 0;
> +}
> +
> +static unsigned long anon_shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf,
> +					struct address_space *mapping, pgoff_t index,
> +					unsigned long orders)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
>  static struct folio *shmem_alloc_folio(gfp_t gfp, int order,
>  		struct shmem_inode_info *info, pgoff_t index)
>  {
> @@ -1625,38 +1726,55 @@ static struct folio *shmem_alloc_folio(gfp_t gfp, int order,
>  	return folio;
>  }
>  
> -static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
> -		struct inode *inode, pgoff_t index,
> -		struct mm_struct *fault_mm, bool huge)
> +static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
> +		gfp_t gfp, struct inode *inode, pgoff_t index,
> +		struct mm_struct *fault_mm, unsigned long orders)
>  {
>  	struct address_space *mapping = inode->i_mapping;
>  	struct shmem_inode_info *info = SHMEM_I(inode);
> -	struct folio *folio;
> +	struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
> +	unsigned long suitable_orders = 0;
> +	struct folio *folio = NULL;
>  	long pages;
> -	int error;
> +	int error, order;
>  
>  	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> -		huge = false;
> +		orders = 0;
>  
> -	if (huge) {
> -		pages = HPAGE_PMD_NR;
> -		index = round_down(index, HPAGE_PMD_NR);
> +	if (orders > 0) {

Can we get rid of this condition if we handle all allowable orders in 'orders'?
Including order-0 and PMD-order. I agree, we do not need the huge flag anymore
since you have handled all cases in shmem_allowable_huge_orders().

> +		if (vma && vma_is_anon_shmem(vma)) {
> +			suitable_orders = anon_shmem_suitable_orders(inode, vmf,
> +							mapping, index, orders);
> +		} else if (orders & BIT(HPAGE_PMD_ORDER)) {
> +			pages = HPAGE_PMD_NR;
> +			suitable_orders = BIT(HPAGE_PMD_ORDER);
> +			index = round_down(index, HPAGE_PMD_NR);
>  
> -		/*
> -		 * Check for conflict before waiting on a huge allocation.
> -		 * Conflict might be that a huge page has just been allocated
> -		 * and added to page cache by a racing thread, or that there
> -		 * is already at least one small page in the huge extent.
> -		 * Be careful to retry when appropriate, but not forever!
> -		 * Elsewhere -EEXIST would be the right code, but not here.
> -		 */
> -		if (xa_find(&mapping->i_pages, &index,
> -				index + HPAGE_PMD_NR - 1, XA_PRESENT))
> -			return ERR_PTR(-E2BIG);
> +			/*
> +			 * Check for conflict before waiting on a huge allocation.
> +			 * Conflict might be that a huge page has just been allocated
> +			 * and added to page cache by a racing thread, or that there
> +			 * is already at least one small page in the huge extent.
> +			 * Be careful to retry when appropriate, but not forever!
> +			 * Elsewhere -EEXIST would be the right code, but not here.
> +			 */
> +			if (xa_find(&mapping->i_pages, &index,
> +				    index + HPAGE_PMD_NR - 1, XA_PRESENT))
> +				return ERR_PTR(-E2BIG);
> +		}
>  
> -		folio = shmem_alloc_folio(gfp, HPAGE_PMD_ORDER, info, index);
> -		if (!folio && pages == HPAGE_PMD_NR)
> -			count_vm_event(THP_FILE_FALLBACK);
> +		order = highest_order(suitable_orders);
> +		while (suitable_orders) {
> +			pages = 1UL << order;
> +			index = round_down(index, pages);
> +			folio = shmem_alloc_folio(gfp, order, info, index);
> +			if (folio)
> +				goto allocated;
> +
> +			if (pages == HPAGE_PMD_NR)
> +				count_vm_event(THP_FILE_FALLBACK);
> +			order = next_order(&suitable_orders, order);
> +		}
>  	} else {
>  		pages = 1;
>  		folio = shmem_alloc_folio(gfp, 0, info, index);
> @@ -1664,6 +1782,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
>  	if (!folio)
>  		return ERR_PTR(-ENOMEM);
>  
> +allocated:
>  	__folio_set_locked(folio);
>  	__folio_set_swapbacked(folio);
>  
> @@ -1958,7 +2077,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>  	struct mm_struct *fault_mm;
>  	struct folio *folio;
>  	int error;
> -	bool alloced;
> +	bool alloced, huge;
> +	unsigned long orders = 0;
>  
>  	if (WARN_ON_ONCE(!shmem_mapping(inode->i_mapping)))
>  		return -EINVAL;
> @@ -2030,14 +2150,21 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>  		return 0;
>  	}
>  
> -	if (shmem_is_huge(inode, index, false, fault_mm,
> -			  vma ? vma->vm_flags : 0)) {
> +	huge = shmem_is_huge(inode, index, false, fault_mm,
> +			     vma ? vma->vm_flags : 0);
> +	/* Find hugepage orders that are allowed for anonymous shmem. */
> +	if (vma && vma_is_anon_shmem(vma))

I guess we do not want to check the anon path here either (in case you agree to
merge this with tmpfs path).

> +		orders = anon_shmem_allowable_huge_orders(inode, vma, index, huge);
> +	else if (huge)
> +		orders = BIT(HPAGE_PMD_ORDER);

Why not handling this case inside allowable_huge_orders()?

> +
> +	if (orders > 0) {

Does it make sense to handle these case anymore? Before, we had the huge
path and order-0. If we handle all cases in allowable_orders() perhaps we can
simplify this.

>  		gfp_t huge_gfp;
>  
>  		huge_gfp = vma_thp_gfp_mask(vma);

We are also setting this flag regardless of the final order. Meaning that
suitable_orders() might return order-0 and yet we keep the huge gfp flag. Is
that right?

>  		huge_gfp = limit_gfp_mask(huge_gfp, gfp);
> -		folio = shmem_alloc_and_add_folio(huge_gfp,
> -				inode, index, fault_mm, true);
> +		folio = shmem_alloc_and_add_folio(vmf, huge_gfp,
> +				inode, index, fault_mm, orders);
>  		if (!IS_ERR(folio)) {
>  			if (folio_test_pmd_mappable(folio))
>  				count_vm_event(THP_FILE_ALLOC);
> @@ -2047,7 +2174,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>  			goto repeat;
>  	}
>  
> -	folio = shmem_alloc_and_add_folio(gfp, inode, index, fault_mm, false);
> +	folio = shmem_alloc_and_add_folio(vmf, gfp, inode, index, fault_mm, 0);
>  	if (IS_ERR(folio)) {
>  		error = PTR_ERR(folio);
>  		if (error == -EEXIST)
> @@ -2058,7 +2185,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>  
>  alloced:
>  	alloced = true;
> -	if (folio_test_pmd_mappable(folio) &&
> +	if (folio_test_large(folio) &&
>  	    DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) <
>  					folio_next_index(folio) - 1) {
>  		struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
> -- 
> 2.39.3
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/6] mm: shmem: add multi-size THP sysfs interface for anonymous shmem
  2024-06-10 12:23     ` Daniel Gomez
@ 2024-06-11  2:04       ` Baolin Wang
  0 siblings, 0 replies; 16+ messages in thread
From: Baolin Wang @ 2024-06-11  2:04 UTC (permalink / raw)
  To: Daniel Gomez
  Cc: akpm, hughd, willy, david, wangkefeng.wang, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, Pankaj Raghav, linux-mm,
	linux-kernel



On 2024/6/10 20:23, Daniel Gomez wrote:
> Hi Baolin,
> On Tue, Jun 04, 2024 at 06:17:47PM +0800, Baolin Wang wrote:
>> To support the use of mTHP with anonymous shmem, add a new sysfs interface
>> 'shmem_enabled' in the '/sys/kernel/mm/transparent_hugepage/hugepages-kB/'
>> directory for each mTHP to control whether shmem is enabled for that mTHP,
>> with a value similar to the top level 'shmem_enabled', which can be set to:
>> "always", "inherit (to inherit the top level setting)", "within_size", "advise",
>> "never". An 'inherit' option is added to ensure compatibility with these
>> global settings, and the options 'force' and 'deny' are dropped, which are
>> rather testing artifacts from the old ages.
>>
>> By default, PMD-sized hugepages have enabled="inherit" and all other hugepage
>> sizes have enabled="never" for '/sys/kernel/mm/transparent_hugepage/hugepages-xxkB/shmem_enabled'.
>>
>> In addition, if top level value is 'force', then only PMD-sized hugepages
>> have enabled="inherit", otherwise configuration will be failed and vice versa.
>> That means now we will avoid using non-PMD sized THP to override the global
>> huge allocation.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   Documentation/admin-guide/mm/transhuge.rst | 23 ++++++
>>   include/linux/huge_mm.h                    | 10 +++
>>   mm/huge_memory.c                           | 11 +--
>>   mm/shmem.c                                 | 96 ++++++++++++++++++++++
>>   4 files changed, 132 insertions(+), 8 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>> index d414d3f5592a..b76d15e408b3 100644
>> --- a/Documentation/admin-guide/mm/transhuge.rst
>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>> @@ -332,6 +332,29 @@ deny
>>   force
>>       Force the huge option on for all - very useful for testing;
>>   
>> +Shmem can also use "multi-size THP" (mTHP) by adding a new sysfs knob to control
>> +mTHP allocation: '/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled',
>> +and its value for each mTHP is essentially consistent with the global setting.
>> +An 'inherit' option is added to ensure compatibility with these global settings.
>> +Conversely, the options 'force' and 'deny' are dropped, which are rather testing
>> +artifacts from the old ages.
>> +always
>> +    Attempt to allocate <size> huge pages every time we need a new page;
>> +
>> +inherit
>> +    Inherit the top-level "shmem_enabled" value. By default, PMD-sized hugepages
>> +    have enabled="inherit" and all other hugepage sizes have enabled="never";
>> +
>> +never
>> +    Do not allocate <size> huge pages;
>> +
>> +within_size
>> +    Only allocate <size> huge page if it will be fully within i_size.
>> +    Also respect fadvise()/madvise() hints;
>> +
>> +advise
>> +    Only allocate <size> huge pages if requested with fadvise()/madvise();
>> +
>>   Need of application restart
>>   ===========================
>>   
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 020e2344eb86..fac21548c5de 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -6,6 +6,7 @@
>>   #include <linux/mm_types.h>
>>   
>>   #include <linux/fs.h> /* only for vma_is_dax() */
>> +#include <linux/kobject.h>
>>   
>>   vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
>>   int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>> @@ -63,6 +64,7 @@ ssize_t single_hugepage_flag_show(struct kobject *kobj,
>>   				  struct kobj_attribute *attr, char *buf,
>>   				  enum transparent_hugepage_flag flag);
>>   extern struct kobj_attribute shmem_enabled_attr;
>> +extern struct kobj_attribute thpsize_shmem_enabled_attr;
>>   
>>   /*
>>    * Mask of all large folio orders supported for anonymous THP; all orders up to
>> @@ -265,6 +267,14 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>>   	return __thp_vma_allowable_orders(vma, vm_flags, tva_flags, orders);
>>   }
>>   
>> +struct thpsize {
>> +	struct kobject kobj;
>> +	struct list_head node;
>> +	int order;
>> +};
>> +
>> +#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj)
>> +
>>   enum mthp_stat_item {
>>   	MTHP_STAT_ANON_FAULT_ALLOC,
>>   	MTHP_STAT_ANON_FAULT_FALLBACK,
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 8e49f402d7c7..1360a1903b66 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -449,14 +449,6 @@ static void thpsize_release(struct kobject *kobj);
>>   static DEFINE_SPINLOCK(huge_anon_orders_lock);
>>   static LIST_HEAD(thpsize_list);
>>   
>> -struct thpsize {
>> -	struct kobject kobj;
>> -	struct list_head node;
>> -	int order;
>> -};
>> -
>> -#define to_thpsize(kobj) container_of(kobj, struct thpsize, kobj)
>> -
>>   static ssize_t thpsize_enabled_show(struct kobject *kobj,
>>   				    struct kobj_attribute *attr, char *buf)
>>   {
>> @@ -517,6 +509,9 @@ static struct kobj_attribute thpsize_enabled_attr =
>>   
>>   static struct attribute *thpsize_attrs[] = {
>>   	&thpsize_enabled_attr.attr,
>> +#ifdef CONFIG_SHMEM
>> +	&thpsize_shmem_enabled_attr.attr,
>> +#endif
>>   	NULL,
>>   };
>>   
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index ae358efc397a..643ff7516b4d 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -131,6 +131,14 @@ struct shmem_options {
>>   #define SHMEM_SEEN_QUOTA 32
>>   };
>>   
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +static unsigned long huge_anon_shmem_orders_always __read_mostly;
>> +static unsigned long huge_anon_shmem_orders_madvise __read_mostly;
>> +static unsigned long huge_anon_shmem_orders_inherit __read_mostly;
>> +static unsigned long huge_anon_shmem_orders_within_size __read_mostly;
>> +static DEFINE_SPINLOCK(huge_anon_shmem_orders_lock);
>> +#endif
> 
> Since we are also applying the new sysfs knob controls to tmpfs and anon mm,
> should we rename this to get rid of the anon prefix?

Sure. I want to do this in the patch set of mTHP support tmpfs 
originally, but yes, I can just drop the 'anon' prefix as a preparation.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 4/6] mm: shmem: add mTHP support for anonymous shmem
  2024-06-10 13:27     ` Daniel Gomez
@ 2024-06-11  2:42       ` Baolin Wang
  0 siblings, 0 replies; 16+ messages in thread
From: Baolin Wang @ 2024-06-11  2:42 UTC (permalink / raw)
  To: Daniel Gomez
  Cc: akpm, hughd, willy, david, wangkefeng.wang, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, Pankaj Raghav, linux-mm,
	linux-kernel



On 2024/6/10 21:27, Daniel Gomez wrote:
> On Tue, Jun 04, 2024 at 06:17:48PM +0800, Baolin Wang wrote:
>> Commit 19eaf44954df adds multi-size THP (mTHP) for anonymous pages, that
>> can allow THP to be configured through the sysfs interface located at
>> '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>
>> However, the anonymous shmem will ignore the anonymous mTHP rule
>> configured through the sysfs interface, and can only use the PMD-mapped
>> THP, that is not reasonable. Users expect to apply the mTHP rule for
>> all anonymous pages, including the anonymous shmem, in order to enjoy
>> the benefits of mTHP. For example, lower latency than PMD-mapped THP,
>> smaller memory bloat than PMD-mapped THP, contiguous PTEs on ARM architecture
>> to reduce TLB miss etc. In addition, the mTHP interfaces can be extended
>> to support all shmem/tmpfs scenarios in the future, especially for the
>> shmem mmap() case.
>>
>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>> which can have almost the same values as the top-level
>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>> additional "inherit" option and dropping the testing options 'force' and
>> 'deny'. By default all sizes will be set to "never" except PMD size,
>> which is set to "inherit". This ensures backward compatibility with the
>> anonymous shmem enabled of the top level, meanwhile also allows independent
>> control of anonymous shmem enabled for each mTHP.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   include/linux/huge_mm.h |  10 +++
>>   mm/shmem.c              | 187 +++++++++++++++++++++++++++++++++-------
>>   2 files changed, 167 insertions(+), 30 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index fac21548c5de..909cfc67521d 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -575,6 +575,16 @@ static inline bool thp_migration_supported(void)
>>   {
>>   	return false;
>>   }
>> +
>> +static inline int highest_order(unsigned long orders)
>> +{
>> +	return 0;
>> +}
>> +
>> +static inline int next_order(unsigned long *orders, int prev)
>> +{
>> +	return 0;
>> +}
>>   #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>   
>>   static inline int split_folio_to_list_to_order(struct folio *folio,
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 643ff7516b4d..9a8533482208 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -1611,6 +1611,107 @@ static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp)
>>   	return result;
>>   }
>>   
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +static unsigned long anon_shmem_allowable_huge_orders(struct inode *inode,
> 
> We want to get mTHP orders as well for tmpfs so, could we make this to work for > both paths? If true, I'd remove the anon prefix.

Yes, I can drop the 'anon' prefix for these functions. But like I said 
in the cover letter, this patch set is for supporting mTHP for anon 
shmem, as a start. For supporting mTHP for tmpfs, patches will be added 
iteratively.

>> +				struct vm_area_struct *vma, pgoff_t index,
>> +				bool global_huge)
> 
> Why did you rename 'huge' variable to 'global_huge'? We were using 'huge' in
> shmem_alloc_and_add_folio() before this commit. I guess it's just odd to me this
> var rename without seen any name conflict inside it.

This is to use the ‘inherit’ option of mTHP to be compatible with the 
top level 'shmem_enabled' configuration (located at 
'/mm/transparent_hugepage/shmem_enabled'). Original 'huge' can not 
reflect the settings of the top level huge configuration. Moreover 
'global' terminology also refers to the naming used by THP, for example, 
hugepage_global_enabled().

>> +{
>> +	unsigned long mask = READ_ONCE(huge_anon_shmem_orders_always);
>> +	unsigned long within_size_orders = READ_ONCE(huge_anon_shmem_orders_within_size);
>> +	unsigned long vm_flags = vma->vm_flags;
>> +	/*
>> +	 * Check all the (large) orders below HPAGE_PMD_ORDER + 1 that
>> +	 * are enabled for this vma.
>> +	 */
>> +	unsigned long orders = BIT(PMD_ORDER + 1) - 1;
>> +	loff_t i_size;
>> +	int order;
>> +
> 
> We can start the mm anon path here but we should exclude the ones that do not
> apply for tmpfs.

As I said above, this patch set is focus on the anon shmem. So we should 
talk about this in the patch of mTHP support tmpfs.

> 
>> +	if ((vm_flags & VM_NOHUGEPAGE) ||
>> +	    test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
>> +		return 0;
>> +
>> +	/* If the hardware/firmware marked hugepage support disabled. */
>> +	if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED))
>> +		return 0;
>> +
>> +	/*
>> +	 * Following the 'deny' semantics of the top level, force the huge
>> +	 * option off from all mounts.
>> +	 */
>> +	if (shmem_huge == SHMEM_HUGE_DENY)
>> +		return 0;
>> +
>> +	/*
>> +	 * Only allow inherit orders if the top-level value is 'force', which
>> +	 * means non-PMD sized THP can not override 'huge' mount option now.
>> +	 */
>> +	if (shmem_huge == SHMEM_HUGE_FORCE)
>> +		return READ_ONCE(huge_anon_shmem_orders_inherit);
>> +
>> +	/* Allow mTHP that will be fully within i_size. */
>> +	order = highest_order(within_size_orders);
>> +	while (within_size_orders) {
>> +		index = round_up(index + 1, order);
>> +		i_size = round_up(i_size_read(inode), PAGE_SIZE);
>> +		if (i_size >> PAGE_SHIFT >= index) {
>> +			mask |= within_size_orders;
>> +			break;
>> +		}
>> +
>> +		order = next_order(&within_size_orders, order);
>> +	}
>> +
>> +	if (vm_flags & VM_HUGEPAGE)
>> +		mask |= READ_ONCE(huge_anon_shmem_orders_madvise);
>> +
>> +	if (global_huge)
>> +		mask |= READ_ONCE(huge_anon_shmem_orders_inherit);
>> +
>> +	return orders & mask;
>> +}
>> +
>> +static unsigned long anon_shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf,
>> +					struct address_space *mapping, pgoff_t index,
>> +					unsigned long orders)
>> +{
>> +	struct vm_area_struct *vma = vmf->vma;
>> +	unsigned long pages;
>> +	int order;
>> +
>> +	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> 
> This won't apply to tmpfs. I'm thinking if we can apply
> shmem_mapping_size_order() [1] here for tmpfs path so we have the same suitable
> orders for both paths.
> 
> [1] https://lore.kernel.org/all/v5acpezkt4ml3j3ufmbgnq5b335envea7xfobvowtaetvbt3an@v3pfkwly5jh2/#t

Ditto.

> 
>> +	if (!orders)
>> +		return 0;
>> +
>> +	/* Find the highest order that can add into the page cache */
>> +	order = highest_order(orders);
>> +	while (orders) {
>> +		pages = 1UL << order;
>> +		index = round_down(index, pages);
>> +		if (!xa_find(&mapping->i_pages, &index,
>> +			     index + pages - 1, XA_PRESENT))
>> +			break;
>> +		order = next_order(&orders, order);
>> +	}
>> +
>> +	return orders;
>> +}
>> +#else
>> +static unsigned long anon_shmem_allowable_huge_orders(struct inode *inode,
>> +				struct vm_area_struct *vma, pgoff_t index,
>> +				bool global_huge)
>> +{
>> +	return 0;
>> +}
>> +
>> +static unsigned long anon_shmem_suitable_orders(struct inode *inode, struct vm_fault *vmf,
>> +					struct address_space *mapping, pgoff_t index,
>> +					unsigned long orders)
>> +{
>> +	return 0;
>> +}
>> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>> +
>>   static struct folio *shmem_alloc_folio(gfp_t gfp, int order,
>>   		struct shmem_inode_info *info, pgoff_t index)
>>   {
>> @@ -1625,38 +1726,55 @@ static struct folio *shmem_alloc_folio(gfp_t gfp, int order,
>>   	return folio;
>>   }
>>   
>> -static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
>> -		struct inode *inode, pgoff_t index,
>> -		struct mm_struct *fault_mm, bool huge)
>> +static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
>> +		gfp_t gfp, struct inode *inode, pgoff_t index,
>> +		struct mm_struct *fault_mm, unsigned long orders)
>>   {
>>   	struct address_space *mapping = inode->i_mapping;
>>   	struct shmem_inode_info *info = SHMEM_I(inode);
>> -	struct folio *folio;
>> +	struct vm_area_struct *vma = vmf ? vmf->vma : NULL;
>> +	unsigned long suitable_orders = 0;
>> +	struct folio *folio = NULL;
>>   	long pages;
>> -	int error;
>> +	int error, order;
>>   
>>   	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
>> -		huge = false;
>> +		orders = 0;
>>   
>> -	if (huge) {
>> -		pages = HPAGE_PMD_NR;
>> -		index = round_down(index, HPAGE_PMD_NR);
>> +	if (orders > 0) {
> 
> Can we get rid of this condition if we handle all allowable orders in 'orders'?
> Including order-0 and PMD-order. I agree, we do not need the huge flag anymore
> since you have handled all cases in shmem_allowable_huge_orders().

IMO, for order-0, we do not need suitable validation, so just 
fallbacking to order 0 allocation is clear to me.

> 
>> +		if (vma && vma_is_anon_shmem(vma)) {
>> +			suitable_orders = anon_shmem_suitable_orders(inode, vmf,
>> +							mapping, index, orders);
>> +		} else if (orders & BIT(HPAGE_PMD_ORDER)) {
>> +			pages = HPAGE_PMD_NR;
>> +			suitable_orders = BIT(HPAGE_PMD_ORDER);
>> +			index = round_down(index, HPAGE_PMD_NR);
>>   
>> -		/*
>> -		 * Check for conflict before waiting on a huge allocation.
>> -		 * Conflict might be that a huge page has just been allocated
>> -		 * and added to page cache by a racing thread, or that there
>> -		 * is already at least one small page in the huge extent.
>> -		 * Be careful to retry when appropriate, but not forever!
>> -		 * Elsewhere -EEXIST would be the right code, but not here.
>> -		 */
>> -		if (xa_find(&mapping->i_pages, &index,
>> -				index + HPAGE_PMD_NR - 1, XA_PRESENT))
>> -			return ERR_PTR(-E2BIG);
>> +			/*
>> +			 * Check for conflict before waiting on a huge allocation.
>> +			 * Conflict might be that a huge page has just been allocated
>> +			 * and added to page cache by a racing thread, or that there
>> +			 * is already at least one small page in the huge extent.
>> +			 * Be careful to retry when appropriate, but not forever!
>> +			 * Elsewhere -EEXIST would be the right code, but not here.
>> +			 */
>> +			if (xa_find(&mapping->i_pages, &index,
>> +				    index + HPAGE_PMD_NR - 1, XA_PRESENT))
>> +				return ERR_PTR(-E2BIG);
>> +		}
>>   
>> -		folio = shmem_alloc_folio(gfp, HPAGE_PMD_ORDER, info, index);
>> -		if (!folio && pages == HPAGE_PMD_NR)
>> -			count_vm_event(THP_FILE_FALLBACK);
>> +		order = highest_order(suitable_orders);
>> +		while (suitable_orders) {
>> +			pages = 1UL << order;
>> +			index = round_down(index, pages);
>> +			folio = shmem_alloc_folio(gfp, order, info, index);
>> +			if (folio)
>> +				goto allocated;
>> +
>> +			if (pages == HPAGE_PMD_NR)
>> +				count_vm_event(THP_FILE_FALLBACK);
>> +			order = next_order(&suitable_orders, order);
>> +		}
>>   	} else {
>>   		pages = 1;
>>   		folio = shmem_alloc_folio(gfp, 0, info, index);
>> @@ -1664,6 +1782,7 @@ static struct folio *shmem_alloc_and_add_folio(gfp_t gfp,
>>   	if (!folio)
>>   		return ERR_PTR(-ENOMEM);
>>   
>> +allocated:
>>   	__folio_set_locked(folio);
>>   	__folio_set_swapbacked(folio);
>>   
>> @@ -1958,7 +2077,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>>   	struct mm_struct *fault_mm;
>>   	struct folio *folio;
>>   	int error;
>> -	bool alloced;
>> +	bool alloced, huge;
>> +	unsigned long orders = 0;
>>   
>>   	if (WARN_ON_ONCE(!shmem_mapping(inode->i_mapping)))
>>   		return -EINVAL;
>> @@ -2030,14 +2150,21 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>>   		return 0;
>>   	}
>>   
>> -	if (shmem_is_huge(inode, index, false, fault_mm,
>> -			  vma ? vma->vm_flags : 0)) {
>> +	huge = shmem_is_huge(inode, index, false, fault_mm,
>> +			     vma ? vma->vm_flags : 0);
>> +	/* Find hugepage orders that are allowed for anonymous shmem. */
>> +	if (vma && vma_is_anon_shmem(vma))
> 
> I guess we do not want to check the anon path here either (in case you agree to
> merge this with tmpfs path).

Ditto. Should do this in another patch.

> 
>> +		orders = anon_shmem_allowable_huge_orders(inode, vma, index, huge);
>> +	else if (huge)
>> +		orders = BIT(HPAGE_PMD_ORDER);
> 
> Why not handling this case inside allowable_huge_orders()?

Ditto.

> 
>> +
>> +	if (orders > 0) {
> 
> Does it make sense to handle these case anymore? Before, we had the huge
> path and order-0. If we handle all cases in allowable_orders() perhaps we can
> simplify this.
> 
>>   		gfp_t huge_gfp;
>>   
>>   		huge_gfp = vma_thp_gfp_mask(vma);
> 
> We are also setting this flag regardless of the final order. Meaning that
> suitable_orders() might return order-0 and yet we keep the huge gfp flag. Is
> that right?

If anon_shmem_suitable_orders() returns order-0, then 
shmem_alloc_and_add_folio() will return -ENOMEM, which will lead to 
fallback order-0 allocation with 'gfp' flag in this function.

> 
>>   		huge_gfp = limit_gfp_mask(huge_gfp, gfp);
>> -		folio = shmem_alloc_and_add_folio(huge_gfp,
>> -				inode, index, fault_mm, true);
>> +		folio = shmem_alloc_and_add_folio(vmf, huge_gfp,
>> +				inode, index, fault_mm, orders);
>>   		if (!IS_ERR(folio)) {
>>   			if (folio_test_pmd_mappable(folio))
>>   				count_vm_event(THP_FILE_ALLOC);
>> @@ -2047,7 +2174,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>>   			goto repeat;
>>   	}
>>   
>> -	folio = shmem_alloc_and_add_folio(gfp, inode, index, fault_mm, false);
>> +	folio = shmem_alloc_and_add_folio(vmf, gfp, inode, index, fault_mm, 0);
>>   	if (IS_ERR(folio)) {
>>   		error = PTR_ERR(folio);
>>   		if (error == -EEXIST)
>> @@ -2058,7 +2185,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>>   
>>   alloced:
>>   	alloced = true;
>> -	if (folio_test_pmd_mappable(folio) &&
>> +	if (folio_test_large(folio) &&
>>   	    DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) <
>>   					folio_next_index(folio) - 1) {
>>   		struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
>> -- 
>> 2.39.3


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 0/6] add mTHP support for anonymous shmem
  2024-06-10 12:10   ` Daniel Gomez
@ 2024-06-11  2:53     ` Baolin Wang
  0 siblings, 0 replies; 16+ messages in thread
From: Baolin Wang @ 2024-06-11  2:53 UTC (permalink / raw)
  To: Daniel Gomez
  Cc: akpm, hughd, willy, david, wangkefeng.wang, ying.huang, 21cnbao,
	ryan.roberts, shy828301, ziy, ioworker0, Pankaj Raghav, linux-mm,
	linux-kernel



On 2024/6/10 20:10, Daniel Gomez wrote:
> Hi Baolin,
> 
> On Tue, Jun 04, 2024 at 06:17:44PM +0800, Baolin Wang wrote:
>> Anonymous pages have already been supported for multi-size (mTHP) allocation
>> through commit 19eaf44954df, that can allow THP to be configured through the
>> sysfs interface located at '/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled'.
>>
>> However, the anonymous shmem will ignore the anonymous mTHP rule configured
>> through the sysfs interface, and can only use the PMD-mapped THP, that is not
>> reasonable. Many implement anonymous page sharing through mmap(MAP_SHARED |
>> MAP_ANONYMOUS), especially in database usage scenarios, therefore, users expect
>> to apply an unified mTHP strategy for anonymous pages, also including the
>> anonymous shared pages, in order to enjoy the benefits of mTHP. For example,
>> lower latency than PMD-mapped THP, smaller memory bloat than PMD-mapped THP,
>> contiguous PTEs on ARM architecture to reduce TLB miss etc.
>>
>> As discussed in the bi-weekly MM meeting[1], the mTHP controls should control
>> all of shmem, not only anonymous shmem, but support will be added iteratively.
>> Therefore, this patch set starts with support for anonymous shmem.
>>
>> The primary strategy is similar to supporting anonymous mTHP. Introduce
>> a new interface '/mm/transparent_hugepage/hugepage-XXkb/shmem_enabled',
>> which can have almost the same values as the top-level
>> '/sys/kernel/mm/transparent_hugepage/shmem_enabled', with adding a new
>> additional "inherit" option and dropping the testing options 'force' and
>> 'deny'. By default all sizes will be set to "never" except PMD size, which
>> is set to "inherit". This ensures backward compatibility with the anonymous
>> shmem enabled of the top level, meanwhile also allows independent control of
>> anonymous shmem enabled for each mTHP.
>>
>> Use the page fault latency tool to measure the performance of 1G anonymous shmem
> 
> I'm not familiar with this tool. Could you share which repo/tool you are
> referring to?

Sure. The git repo is: https://github.com/gormanm/pft.git

And I did a little changes to test anon shmem:
diff --git a/pft.c b/pft.c
index 3ab1457..bbcd7e6 100644
--- a/pft.c
+++ b/pft.c
@@ -739,7 +739,7 @@ alloc_test_memory(void)
         int j;

         if (do_shm) {
-               if (p = alloc_shm(bytes)) {
+               if (p = valloc_shared(bytes)) {
                         do_mbind(p, bytes);
                         do_noclear(p, bytes);
                 }

> Also, are you running or are you aware of any other tools/tests available for
> shmem that we can use to make sure we do not introduce any regressions?

I did run the mm selftest cases, as well as some testing cases I wrote 
for anon shmem.


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-06-11  2:53 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-04 10:17 [PATCH v4 0/6] add mTHP support for anonymous shmem Baolin Wang
2024-06-04 10:17 ` [PATCH v4 1/6] mm: memory: extend finish_fault() to support large folio Baolin Wang
2024-06-04 14:58   ` kernel test robot
2024-06-05  1:16     ` Baolin Wang
2024-06-04 10:17 ` [PATCH v4 2/6] mm: shmem: add THP validation for PMD-mapped THP related statistics Baolin Wang
2024-06-04 10:17 ` [PATCH v4 3/6] mm: shmem: add multi-size THP sysfs interface for anonymous shmem Baolin Wang
     [not found]   ` <CGME20240610122305eucas1p21bfd8a8c999b3fc8bfce04e5feea7bf7@eucas1p2.samsung.com>
2024-06-10 12:23     ` Daniel Gomez
2024-06-11  2:04       ` Baolin Wang
2024-06-04 10:17 ` [PATCH v4 4/6] mm: shmem: add mTHP support " Baolin Wang
     [not found]   ` <CGME20240610132756eucas1p1d892ccbabdb5f8fc4cff55c662f24d75@eucas1p1.samsung.com>
2024-06-10 13:27     ` Daniel Gomez
2024-06-11  2:42       ` Baolin Wang
2024-06-04 10:17 ` [PATCH v4 5/6] mm: shmem: add mTHP size alignment in shmem_get_unmapped_area Baolin Wang
2024-06-04 10:17 ` [PATCH v4 6/6] mm: shmem: add mTHP counters for anonymous shmem Baolin Wang
2024-06-04 23:50 ` [PATCH v4 0/6] add mTHP support " Andrew Morton
     [not found] ` <CGME20240610121040eucas1p2ad07a27dec959bf1658ea9e5f0dd4697@eucas1p2.samsung.com>
2024-06-10 12:10   ` Daniel Gomez
2024-06-11  2:53     ` Baolin Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox