[PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists
@ 2024-04-15  8:12 Kefeng Wang
  2024-04-15  8:12 ` [PATCH rfc 1/3] mm: prepare more high-order pages to be stored on the per-cpu lists Kefeng Wang
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Kefeng Wang @ 2024-04-15  8:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Huang Ying, Mel Gorman, Ryan Roberts, David Hildenbrand,
	Barry Song, Vlastimil Babka, Zi Yan, Matthew Wilcox (Oracle),
	Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm, Kefeng Wang

Both the file pages and anonymous pages support large folio, high-order
pages except PMD_ORDER will also be allocated frequently which could
increase the zone lock contention, allow high-order pages on pcp lists
could reduce the big zone lock contention, but as commit 44042b449872
("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
pointed, it may not win in all the scenes, add a new control sysfs to
enable or disable specified high-order pages stored on PCP lists, the order
(PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by default.

With perf lock tools, the lock contention from will-it-scale page_fault1
(with 90 tasks run 10s, hugepage-2048KB never, hugepage-64K always) show
below(only care about zone spinlock and pcp spinlock),

Without patches,
 contended   total wait     max wait     avg wait         type   caller
       713      4.64 ms     74.37 us      6.51 us     spinlock   __alloc_pages+0x23c

With patches,
 contended   total wait     max wait     avg wait         type   caller
         2     25.66 us     16.31 us     12.83 us     spinlock   rmqueue_pcplist+0x2b0

Similar results on shell8 from unixbench,

Without patches,
      4942    901.09 ms      1.31 ms    182.33 us     spinlock   __alloc_pages+0x23c	
      1556    298.76 ms      1.23 ms    192.01 us     spinlock   rmqueue_pcplist+0x2b0
       991    182.73 ms    879.80 us    184.39 us     spinlock   rmqueue_pcplist+0x2b0

With patches,
contended   total wait     max wait     avg wait         type   caller
       988    187.63 ms    855.18 us    189.91 us     spinlock   rmqueue_pcplist+0x2b0
       505     88.99 ms    793.27 us    176.21 us     spinlock   rmqueue_pcplist+0x2b0

The Benchmarks Score shows a little improvoment(0.28%) from shell8, but the
zone lock from __alloc_pages() disappeared.

Kefeng Wang (3):
  mm: prepare more high-order pages to be stored on the per-cpu lists
  mm: add control to allow specified high-order pages stored on PCP list
  mm: pcp: show each order page count

 Documentation/admin-guide/mm/transhuge.rst | 11 ++++
 include/linux/gfp.h                        |  1 +
 include/linux/huge_mm.h                    |  1 +
 include/linux/mmzone.h                     | 10 ++-
 include/linux/vmstat.h                     | 19 ++++++
 mm/Kconfig.debug                           |  8 +++
 mm/huge_memory.c                           | 74 ++++++++++++++++++++++
 mm/page_alloc.c                            | 30 +++++++--
 mm/vmstat.c                                | 16 +++++
 9 files changed, 164 insertions(+), 6 deletions(-)

-- 
2.27.0



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH rfc 1/3] mm: prepare more high-order pages to be stored on the per-cpu lists
  2024-04-15  8:12 [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists Kefeng Wang
@ 2024-04-15  8:12 ` Kefeng Wang
  2024-04-15 11:41   ` Baolin Wang
  2024-04-15  8:12 ` [PATCH rfc 2/3] mm: add control to allow specified high-order pages stored on PCP list Kefeng Wang
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Kefeng Wang @ 2024-04-15  8:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Huang Ying, Mel Gorman, Ryan Roberts, David Hildenbrand,
	Barry Song, Vlastimil Babka, Zi Yan, Matthew Wilcox (Oracle),
	Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm, Kefeng Wang

Both the file pages and anonymous pages support large folio, high-order
pages except HPAGE_PMD_ORDER(PMD_SHIFT - PAGE_SHIFT) will be allocated
frequently which will increase the zone lock contention, allow high-order
pages on pcp lists could alleviate the big zone lock contention, in order
to allows high-orders(PAGE_ALLOC_COSTLY_ORDER, HPAGE_PMD_ORDER) to be
stored on the per-cpu lists, similar with PMD_ORDER pages, more lists is
added in struct per_cpu_pages (one list each high-order pages), also a
new PCP_MAX_ORDER instead of HPAGE_PMD_ORDER is added in mmzone.h.

But as commit 44042b449872 ("mm/page_alloc: allow high-order pages to be
stored on the per-cpu lists") pointed, it may not win in all the scenes,
so this don't allow higher-order pages to be added to PCP list, the next
will add a control to enable or disable it.

The struct per_cpu_pages increases in size from 256(4 cache lines) to
320 bytes (5 cache lines) on arm64 with defconfig.

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
 include/linux/mmzone.h |  4 +++-
 mm/page_alloc.c        | 10 +++++-----
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c11b7cde81ef..c745e2f1a0f2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -657,11 +657,13 @@ enum zone_watermarks {
  * failures.
  */
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define NR_PCP_THP 1
+#define PCP_MAX_ORDER (PMD_SHIFT - PAGE_SHIFT)
+#define NR_PCP_THP (PCP_MAX_ORDER - PAGE_ALLOC_COSTLY_ORDER)
 #else
 #define NR_PCP_THP 0
 #endif
 #define NR_LOWORDER_PCP_LISTS (MIGRATE_PCPTYPES * (PAGE_ALLOC_COSTLY_ORDER + 1))
+#define HIGHORDER_PCP_LIST_INDEX (NR_LOWORDER_PCP_LISTS - (PAGE_ALLOC_COSTLY_ORDER + 1))
 #define NR_PCP_LISTS (NR_LOWORDER_PCP_LISTS + NR_PCP_THP)
 
 #define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b51becf03d1e..2248afc7b73a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -506,8 +506,8 @@ static inline unsigned int order_to_pindex(int migratetype, int order)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	if (order > PAGE_ALLOC_COSTLY_ORDER) {
-		VM_BUG_ON(order != HPAGE_PMD_ORDER);
-		return NR_LOWORDER_PCP_LISTS;
+		VM_BUG_ON(order > PCP_MAX_ORDER);
+		return order + HIGHORDER_PCP_LIST_INDEX;
 	}
 #else
 	VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
@@ -521,8 +521,8 @@ static inline int pindex_to_order(unsigned int pindex)
 	int order = pindex / MIGRATE_PCPTYPES;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	if (pindex == NR_LOWORDER_PCP_LISTS)
-		order = HPAGE_PMD_ORDER;
+	if (pindex >= NR_LOWORDER_PCP_LISTS)
+		order = pindex - HIGHORDER_PCP_LIST_INDEX;
 #else
 	VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
 #endif
@@ -535,7 +535,7 @@ static inline bool pcp_allowed_order(unsigned int order)
 	if (order <= PAGE_ALLOC_COSTLY_ORDER)
 		return true;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	if (order == HPAGE_PMD_ORDER)
+	if (order == PCP_MAX_ORDER)
 		return true;
 #endif
 	return false;
-- 
2.27.0



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH rfc 2/3] mm: add control to allow specified high-order pages stored on PCP list
  2024-04-15  8:12 [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists Kefeng Wang
  2024-04-15  8:12 ` [PATCH rfc 1/3] mm: prepare more high-order pages to be stored on the per-cpu lists Kefeng Wang
@ 2024-04-15  8:12 ` Kefeng Wang
  2024-04-15  8:12 ` [PATCH rfc 3/3] mm: pcp: show per-order pages count Kefeng Wang
  2024-04-15  8:18 ` [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists Barry Song
  3 siblings, 0 replies; 17+ messages in thread
From: Kefeng Wang @ 2024-04-15  8:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Huang Ying, Mel Gorman, Ryan Roberts, David Hildenbrand,
	Barry Song, Vlastimil Babka, Zi Yan, Matthew Wilcox (Oracle),
	Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm, Kefeng Wang

The high-order pages stored on PCP list may not always win, even herts
some workloads, so it is disabled by default for high-orders except
PMD_ORDER. Since there is already per-supported-THP-size interfaces
to configrate mTHP behaviours, adding a new control pcp_enabled under
above interfaces to allow user to enable/disable the specified high-order
pages stored on PCP list or not, but it can't change the existing behaviour
for order = PMD_ORDER and order <= PAGE_ALLOC_COSTLY_ORDER, they are
always enabled and can't be disabled, meanwhile, when disabled by
pcp_enabled for other high-orders, pcplists will be drained.

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 11 +++++
 include/linux/gfp.h                        |  1 +
 include/linux/huge_mm.h                    |  1 +
 mm/huge_memory.c                           | 47 ++++++++++++++++++++++
 mm/page_alloc.c                            | 16 ++++++++
 5 files changed, 76 insertions(+)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 04eb45a2f940..3cb91336f81a 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -189,6 +189,17 @@ madvise
 never
 	should be self-explanatory.
 
+
+There's also sysfs knob to control hugepage to be stored on PCP lists for
+high-orders(greated than PAGE_ALLOC_COSTLY_ORDER), which could reduce
+the zone lock contention when allocate hige-order pages frequently. Please
+note that the PCP behavior of low-order and PMD-order pages cannot changed,
+it is possible to enable other higher-order pages stored on PCP lists by
+writing 1 or disable it back by writing 0::
+
+        echo 0 >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/pcp_enabled
+        echo 1 >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/pcp_enabled
+
 By default kernel tries to use huge, PMD-mappable zero page on read
 page fault to anonymous mapping. It's possible to disable huge zero
 page by writing 0 or enable it back by writing 1::
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 450c2cbcf04b..2ae1157abd6e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -365,6 +365,7 @@ extern void page_frag_free(void *addr);
 
 void page_alloc_init_cpuhp(void);
 int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp);
+void drain_all_zone_pages(void);
 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
 void drain_all_pages(struct zone *zone);
 void drain_local_pages(struct zone *zone);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b67294d5814f..86306becfd52 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -108,6 +108,7 @@ extern unsigned long transparent_hugepage_flags;
 extern unsigned long huge_anon_orders_always;
 extern unsigned long huge_anon_orders_madvise;
 extern unsigned long huge_anon_orders_inherit;
+extern unsigned long huge_pcp_allow_orders;
 
 static inline bool hugepage_global_enabled(void)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9a1b57ef9c60..9b8a8aa36526 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -512,8 +512,49 @@ static ssize_t thpsize_enabled_store(struct kobject *kobj,
 static struct kobj_attribute thpsize_enabled_attr =
 	__ATTR(enabled, 0644, thpsize_enabled_show, thpsize_enabled_store);
 
+unsigned long huge_pcp_allow_orders __read_mostly;
+static ssize_t thpsize_pcp_enabled_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *buf)
+{
+	int order = to_thpsize(kobj)->order;
+
+	return sysfs_emit(buf, "%d\n",
+			  !!test_bit(order, &huge_pcp_allow_orders));
+}
+
+static ssize_t thpsize_pcp_enabled_store(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 const char *buf, size_t count)
+{
+	int order = to_thpsize(kobj)->order;
+	unsigned long value;
+	int ret;
+
+	if (order <= PAGE_ALLOC_COSTLY_ORDER || order == PMD_ORDER)
+		return -EINVAL;
+
+	ret = kstrtoul(buf, 10, &value);
+	if (ret < 0)
+		return ret;
+	if (value > 1)
+		return -EINVAL;
+
+	if (value) {
+		set_bit(order, &huge_pcp_allow_orders);
+	} else {
+		if (test_and_clear_bit(order, &huge_pcp_allow_orders))
+			drain_all_zone_pages();
+	}
+
+	return count;
+}
+
+static struct kobj_attribute thpsize_pcp_enabled_attr = __ATTR(pcp_enabled,
+		0644, thpsize_pcp_enabled_show, thpsize_pcp_enabled_store);
+
 static struct attribute *thpsize_attrs[] = {
 	&thpsize_enabled_attr.attr,
+	&thpsize_pcp_enabled_attr.attr,
 	NULL,
 };
 
@@ -624,6 +665,8 @@ static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
 	 */
 	huge_anon_orders_inherit = BIT(PMD_ORDER);
 
+	huge_pcp_allow_orders = BIT(PMD_ORDER);
+
 	*hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
 	if (unlikely(!*hugepage_kobj)) {
 		pr_err("failed to create transparent hugepage kobject\n");
@@ -658,6 +701,10 @@ static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
 			err = PTR_ERR(thpsize);
 			goto remove_all;
 		}
+
+		if (order <= PAGE_ALLOC_COSTLY_ORDER)
+			huge_pcp_allow_orders |= BIT(order);
+
 		list_add(&thpsize->node, &thpsize_list);
 		order = next_order(&orders, order);
 	}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2248afc7b73a..25fd3fe30cb0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -537,6 +537,8 @@ static inline bool pcp_allowed_order(unsigned int order)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	if (order == PCP_MAX_ORDER)
 		return true;
+	if (BIT(order) & huge_pcp_allow_orders)
+		return true;
 #endif
 	return false;
 }
@@ -6705,6 +6707,20 @@ void zone_pcp_reset(struct zone *zone)
 	}
 }
 
+void drain_all_zone_pages(void)
+{
+	struct zone *zone;
+
+	mutex_lock(&pcp_batch_high_lock);
+	for_each_populated_zone(zone)
+		__zone_set_pageset_high_and_batch(zone, 0, 0, 1);
+	__drain_all_pages(NULL, true);
+	for_each_populated_zone(zone)
+		__zone_set_pageset_high_and_batch(zone, zone->pageset_high_min,
+				zone->pageset_high_max, zone->pageset_batch);
+	mutex_unlock(&pcp_batch_high_lock);
+}
+
 #ifdef CONFIG_MEMORY_HOTREMOVE
 /*
  * All pages in the range must be in a single zone, must not contain holes,
-- 
2.27.0



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH rfc 3/3] mm: pcp: show per-order pages count
  2024-04-15  8:12 [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists Kefeng Wang
  2024-04-15  8:12 ` [PATCH rfc 1/3] mm: prepare more high-order pages to be stored on the per-cpu lists Kefeng Wang
  2024-04-15  8:12 ` [PATCH rfc 2/3] mm: add control to allow specified high-order pages stored on PCP list Kefeng Wang
@ 2024-04-15  8:12 ` Kefeng Wang
  2024-04-15  8:18 ` [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists Barry Song
  3 siblings, 0 replies; 17+ messages in thread
From: Kefeng Wang @ 2024-04-15  8:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Huang Ying, Mel Gorman, Ryan Roberts, David Hildenbrand,
	Barry Song, Vlastimil Babka, Zi Yan, Matthew Wilcox (Oracle),
	Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm, Kefeng Wang

THIS IS ONLY FOR DEBUG.

Show more detail about per-order page count on each cpu in zoneinfo, and
a new pcp_order_stat shows the total counts of each hugepage size in sysfs.

  #cat /proc/zoneinfo
    ....
  cpu: 15
            count: 275
            high:  529
            batch: 63
            order0: 59
            order1: 28
            order2: 28
            order3: 6
            order4: 0
            order5: 0
            order6: 0
            order7: 0
            order8: 0
            order9: 0

  #cat /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/pcp_order_stat
  10

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
 include/linux/mmzone.h |  6 ++++++
 include/linux/vmstat.h | 19 +++++++++++++++++++
 mm/Kconfig.debug       |  8 ++++++++
 mm/huge_memory.c       | 27 +++++++++++++++++++++++++++
 mm/page_alloc.c        |  4 ++++
 mm/vmstat.c            | 16 ++++++++++++++++
 6 files changed, 80 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c745e2f1a0f2..c32c01468a77 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -665,6 +665,9 @@ enum zone_watermarks {
 #define NR_LOWORDER_PCP_LISTS (MIGRATE_PCPTYPES * (PAGE_ALLOC_COSTLY_ORDER + 1))
 #define HIGHORDER_PCP_LIST_INDEX (NR_LOWORDER_PCP_LISTS - (PAGE_ALLOC_COSTLY_ORDER + 1))
 #define NR_PCP_LISTS (NR_LOWORDER_PCP_LISTS + NR_PCP_THP)
+#ifdef CONFIG_PCP_ORDER_STATS
+#define NR_PCP_ORDER (PAGE_ALLOC_COSTLY_ORDER + NR_PCP_THP + 1)
+#endif
 
 #define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost)
 #define low_wmark_pages(z) (z->_watermark[WMARK_LOW] + z->watermark_boost)
@@ -702,6 +705,9 @@ struct per_cpu_pages {
 
 	/* Lists of pages, one per migrate type stored on the pcp-lists */
 	struct list_head lists[NR_PCP_LISTS];
+#ifdef CONFIG_PCP_ORDER_STATS
+	int per_order_count[NR_PCP_ORDER]; /* per-order page counts */
+#endif
 } ____cacheline_aligned_in_smp;
 
 struct per_cpu_zonestat {
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 735eae6e272c..91843f2d327f 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -624,4 +624,23 @@ static inline void lruvec_stat_sub_folio(struct folio *folio,
 {
 	lruvec_stat_mod_folio(folio, idx, -folio_nr_pages(folio));
 }
+
+static inline void pcp_order_stat_mod(struct per_cpu_pages *pcp, int order,
+				      int val)
+{
+#ifdef CONFIG_PCP_ORDER_STATS
+	pcp->per_order_count[order] += val;
+#endif
+}
+
+static inline void pcp_order_stat_inc(struct per_cpu_pages *pcp, int order)
+{
+	pcp_order_stat_mod(pcp, order, 1);
+}
+
+static inline void pcp_order_stat_dec(struct per_cpu_pages *pcp, int order)
+{
+	pcp_order_stat_mod(pcp, order, -1);
+}
+
 #endif /* _LINUX_VMSTAT_H */
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index afc72fde0f03..57eef0ce809b 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -276,3 +276,11 @@ config PER_VMA_LOCK_STATS
 	  overhead in the page fault path.
 
 	  If in doubt, say N.
+
+config PCP_ORDER_STATS
+	bool "Statistics for per-order of PCP (Per-CPU pageset)"
+	help
+	  Say Y to show per-order statistics of Per-CPU pageset from zoneinfo
+	  and pcp_order_stat in sysfs.
+
+	  If in doubt, say N.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9b8a8aa36526..0c6262bb8fe4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -599,12 +599,39 @@ DEFINE_MTHP_STAT_ATTR(anon_swpout, MTHP_STAT_ANON_SWPOUT);
 DEFINE_MTHP_STAT_ATTR(anon_swpout_fallback, MTHP_STAT_ANON_SWPOUT_FALLBACK);
 DEFINE_MTHP_STAT_ATTR(anon_swpin_refault, MTHP_STAT_ANON_SWPIN_REFAULT);
 
+#ifdef CONFIG_PCP_ORDER_STATS
+static ssize_t pcp_order_stat_show(struct kobject *kobj,
+				   struct kobj_attribute *attr, char *buf)
+{
+	int order = to_thpsize(kobj)->order;
+	unsigned int counts = 0;
+	struct zone *zone;
+
+	for_each_populated_zone(zone) {
+		struct per_cpu_pages *pcp;
+		int i;
+
+		for_each_online_cpu(i) {
+			pcp = per_cpu_ptr(zone->per_cpu_pageset, i);
+			counts += pcp->per_order_count[order];
+		}
+	}
+
+	return sysfs_emit(buf, "%u\n", counts);
+}
+
+static struct kobj_attribute pcp_order_stat_attr = __ATTR_RO(pcp_order_stat);
+#endif
+
 static struct attribute *stats_attrs[] = {
 	&anon_alloc_attr.attr,
 	&anon_alloc_fallback_attr.attr,
 	&anon_swpout_attr.attr,
 	&anon_swpout_fallback_attr.attr,
 	&anon_swpin_refault_attr.attr,
+#ifdef CONFIG_PCP_ORDER_STATS
+	&pcp_order_stat_attr.attr,
+#endif
 	NULL,
 };
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 25fd3fe30cb0..f44cdf8dec50 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1185,6 +1185,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			list_del(&page->pcp_list);
 			count -= nr_pages;
 			pcp->count -= nr_pages;
+			pcp_order_stat_dec(pcp, order);
 
 			__free_one_page(page, pfn, zone, order, mt, FPI_NONE);
 			trace_mm_page_pcpu_drain(page, order, mt);
@@ -2560,6 +2561,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	pindex = order_to_pindex(migratetype, order);
 	list_add(&page->pcp_list, &pcp->lists[pindex]);
 	pcp->count += 1 << order;
+	pcp_order_stat_inc(pcp, order);
 
 	batch = READ_ONCE(pcp->batch);
 	/*
@@ -2957,6 +2959,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 					migratetype, alloc_flags);
 
 			pcp->count += alloced << order;
+			pcp_order_stat_mod(pcp, order, alloced);
 			if (unlikely(list_empty(list)))
 				return NULL;
 		}
@@ -2964,6 +2967,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 		page = list_first_entry(list, struct page, pcp_list);
 		list_del(&page->pcp_list);
 		pcp->count -= 1 << order;
+		pcp_order_stat_dec(pcp, order);
 	} while (check_new_pages(page, order));
 
 	return page;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index db79935e4a54..632bb1ed6a53 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1674,6 +1674,19 @@ static bool is_zone_first_populated(pg_data_t *pgdat, struct zone *zone)
 	return false;
 }
 
+static void zoneinfo_show_pcp_order_stat(struct seq_file *m,
+					 struct per_cpu_pages *pcp)
+{
+#ifdef CONFIG_PCP_ORDER_STATS
+	int j;
+
+	for (j = 0; j < NR_PCP_ORDER; j++)
+		seq_printf(m,
+			   "\n              order%d: %i",
+			   j, pcp->per_order_count[j]);
+#endif
+}
+
 static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 							struct zone *zone)
 {
@@ -1748,6 +1761,9 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 			   pcp->count,
 			   pcp->high,
 			   pcp->batch);
+
+		zoneinfo_show_pcp_order_stat(m, pcp);
+
 #ifdef CONFIG_SMP
 		pzstats = per_cpu_ptr(zone->per_cpu_zonestats, i);
 		seq_printf(m, "\n  vm stats threshold: %d",
-- 
2.27.0



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists
  2024-04-15  8:12 [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists Kefeng Wang
                   ` (2 preceding siblings ...)
  2024-04-15  8:12 ` [PATCH rfc 3/3] mm: pcp: show per-order pages count Kefeng Wang
@ 2024-04-15  8:18 ` Barry Song
  2024-04-15  8:59   ` Kefeng Wang
  3 siblings, 1 reply; 17+ messages in thread
From: Barry Song @ 2024-04-15  8:18 UTC (permalink / raw)
  To: Kefeng Wang
  Cc: Andrew Morton, Huang Ying, Mel Gorman, Ryan Roberts,
	David Hildenbrand, Barry Song, Vlastimil Babka, Zi Yan,
	Matthew Wilcox (Oracle),
	Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm

On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
> Both the file pages and anonymous pages support large folio, high-order
> pages except PMD_ORDER will also be allocated frequently which could
> increase the zone lock contention, allow high-order pages on pcp lists
> could reduce the big zone lock contention, but as commit 44042b449872
> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
> pointed, it may not win in all the scenes, add a new control sysfs to
> enable or disable specified high-order pages stored on PCP lists, the order
> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by default.

This is precisely something Baolin and I have discussed and intended
to implement[1],
but unfortunately, we haven't had the time to do so.

[1] https://lore.kernel.org/linux-mm/13c59ca8-baac-405e-8640-e693c78ef79a@suse.cz/T/#mecb0514ced830ac4df320113bedd7073bea9ab7a

>
> With perf lock tools, the lock contention from will-it-scale page_fault1
> (with 90 tasks run 10s, hugepage-2048KB never, hugepage-64K always) show
> below(only care about zone spinlock and pcp spinlock),
>
> Without patches,
>  contended   total wait     max wait     avg wait         type   caller
>        713      4.64 ms     74.37 us      6.51 us     spinlock   __alloc_pages+0x23c
>
> With patches,
>  contended   total wait     max wait     avg wait         type   caller
>          2     25.66 us     16.31 us     12.83 us     spinlock   rmqueue_pcplist+0x2b0
>
> Similar results on shell8 from unixbench,
>
> Without patches,
>       4942    901.09 ms      1.31 ms    182.33 us     spinlock   __alloc_pages+0x23c
>       1556    298.76 ms      1.23 ms    192.01 us     spinlock   rmqueue_pcplist+0x2b0
>        991    182.73 ms    879.80 us    184.39 us     spinlock   rmqueue_pcplist+0x2b0
>
> With patches,
> contended   total wait     max wait     avg wait         type   caller
>        988    187.63 ms    855.18 us    189.91 us     spinlock   rmqueue_pcplist+0x2b0
>        505     88.99 ms    793.27 us    176.21 us     spinlock   rmqueue_pcplist+0x2b0
>
> The Benchmarks Score shows a little improvoment(0.28%) from shell8, but the
> zone lock from __alloc_pages() disappeared.
>
> Kefeng Wang (3):
>   mm: prepare more high-order pages to be stored on the per-cpu lists
>   mm: add control to allow specified high-order pages stored on PCP list
>   mm: pcp: show each order page count
>
>  Documentation/admin-guide/mm/transhuge.rst | 11 ++++
>  include/linux/gfp.h                        |  1 +
>  include/linux/huge_mm.h                    |  1 +
>  include/linux/mmzone.h                     | 10 ++-
>  include/linux/vmstat.h                     | 19 ++++++
>  mm/Kconfig.debug                           |  8 +++
>  mm/huge_memory.c                           | 74 ++++++++++++++++++++++
>  mm/page_alloc.c                            | 30 +++++++--
>  mm/vmstat.c                                | 16 +++++
>  9 files changed, 164 insertions(+), 6 deletions(-)
>
> --
> 2.27.0
>
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists
  2024-04-15  8:18 ` [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists Barry Song
@ 2024-04-15  8:59   ` Kefeng Wang
  2024-04-15 10:52     ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Kefeng Wang @ 2024-04-15  8:59 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Huang Ying, Mel Gorman, Ryan Roberts,
	David Hildenbrand, Barry Song, Vlastimil Babka, Zi Yan,
	Matthew Wilcox (Oracle),
	Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm



On 2024/4/15 16:18, Barry Song wrote:
> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>
>> Both the file pages and anonymous pages support large folio, high-order
>> pages except PMD_ORDER will also be allocated frequently which could
>> increase the zone lock contention, allow high-order pages on pcp lists
>> could reduce the big zone lock contention, but as commit 44042b449872
>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
>> pointed, it may not win in all the scenes, add a new control sysfs to
>> enable or disable specified high-order pages stored on PCP lists, the order
>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by default.
> 
> This is precisely something Baolin and I have discussed and intended
> to implement[1],
> but unfortunately, we haven't had the time to do so.

Indeed, same thing. Recently, we are working on unixbench/lmbench
optimization, I tested Multi-size THP for anonymous memory by hard-cord
PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
not for all cases and not very stable, so re-implemented it by according
to the user requirement and enable it dynamically.

[1] 
https://lore.kernel.org/linux-mm/b8f5a47a-af1e-44ed-a89b-460d0be56d2c@huawei.com/

> 
> [1] https://lore.kernel.org/linux-mm/13c59ca8-baac-405e-8640-e693c78ef79a@suse.cz/T/#mecb0514ced830ac4df320113bedd7073bea9ab7a
> 
>>
>> With perf lock tools, the lock contention from will-it-scale page_fault1
>> (with 90 tasks run 10s, hugepage-2048KB never, hugepage-64K always) show
>> below(only care about zone spinlock and pcp spinlock),
>>
>> Without patches,
>>   contended   total wait     max wait     avg wait         type   caller
>>         713      4.64 ms     74.37 us      6.51 us     spinlock   __alloc_pages+0x23c
>>
>> With patches,
>>   contended   total wait     max wait     avg wait         type   caller
>>           2     25.66 us     16.31 us     12.83 us     spinlock   rmqueue_pcplist+0x2b0
>>
>> Similar results on shell8 from unixbench,
>>
>> Without patches,
>>        4942    901.09 ms      1.31 ms    182.33 us     spinlock   __alloc_pages+0x23c
>>        1556    298.76 ms      1.23 ms    192.01 us     spinlock   rmqueue_pcplist+0x2b0
>>         991    182.73 ms    879.80 us    184.39 us     spinlock   rmqueue_pcplist+0x2b0
>>
>> With patches,
>> contended   total wait     max wait     avg wait         type   caller
>>         988    187.63 ms    855.18 us    189.91 us     spinlock   rmqueue_pcplist+0x2b0
>>         505     88.99 ms    793.27 us    176.21 us     spinlock   rmqueue_pcplist+0x2b0
>>
>> The Benchmarks Score shows a little improvoment(0.28%) from shell8, but the
>> zone lock from __alloc_pages() disappeared.
>>
>> Kefeng Wang (3):
>>    mm: prepare more high-order pages to be stored on the per-cpu lists
>>    mm: add control to allow specified high-order pages stored on PCP list
>>    mm: pcp: show each order page count
>>
>>   Documentation/admin-guide/mm/transhuge.rst | 11 ++++
>>   include/linux/gfp.h                        |  1 +
>>   include/linux/huge_mm.h                    |  1 +
>>   include/linux/mmzone.h                     | 10 ++-
>>   include/linux/vmstat.h                     | 19 ++++++
>>   mm/Kconfig.debug                           |  8 +++
>>   mm/huge_memory.c                           | 74 ++++++++++++++++++++++
>>   mm/page_alloc.c                            | 30 +++++++--
>>   mm/vmstat.c                                | 16 +++++
>>   9 files changed, 164 insertions(+), 6 deletions(-)
>>
>> --
>> 2.27.0
>>
>>
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists
  2024-04-15  8:59   ` Kefeng Wang
@ 2024-04-15 10:52     ` David Hildenbrand
  2024-04-15 11:14       ` Barry Song
  2024-04-15 12:17       ` Kefeng Wang
  0 siblings, 2 replies; 17+ messages in thread
From: David Hildenbrand @ 2024-04-15 10:52 UTC (permalink / raw)
  To: Kefeng Wang, Barry Song
  Cc: Andrew Morton, Huang Ying, Mel Gorman, Ryan Roberts, Barry Song,
	Vlastimil Babka, Zi Yan, Matthew Wilcox (Oracle),
	Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm

On 15.04.24 10:59, Kefeng Wang wrote:
> 
> 
> On 2024/4/15 16:18, Barry Song wrote:
>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>
>>> Both the file pages and anonymous pages support large folio, high-order
>>> pages except PMD_ORDER will also be allocated frequently which could
>>> increase the zone lock contention, allow high-order pages on pcp lists
>>> could reduce the big zone lock contention, but as commit 44042b449872
>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
>>> pointed, it may not win in all the scenes, add a new control sysfs to
>>> enable or disable specified high-order pages stored on PCP lists, the order
>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by default.
>>
>> This is precisely something Baolin and I have discussed and intended
>> to implement[1],
>> but unfortunately, we haven't had the time to do so.
> 
> Indeed, same thing. Recently, we are working on unixbench/lmbench
> optimization, I tested Multi-size THP for anonymous memory by hard-cord
> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
> not for all cases and not very stable, so re-implemented it by according
> to the user requirement and enable it dynamically.

I'm wondering, though, if this is really a suitable candidate for a 
sysctl toggle. Can anybody really come up with an educated guess for 
these values?

Especially reading "Benchmarks Score shows a little improvoment(0.28%)" 
and "it may not win in all the scenes", to me it mostly sounds like 
"minimal impact" -- so who cares?

How much is the cost vs. benefit of just having one sane system 
configuration?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists
  2024-04-15 10:52     ` David Hildenbrand
@ 2024-04-15 11:14       ` Barry Song
  2024-04-15 12:17       ` Kefeng Wang
  1 sibling, 0 replies; 17+ messages in thread
From: Barry Song @ 2024-04-15 11:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kefeng Wang, Andrew Morton, Huang Ying, Mel Gorman, Ryan Roberts,
	Barry Song, Vlastimil Babka, Zi Yan, Matthew Wilcox (Oracle),
	Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm

On Mon, Apr 15, 2024 at 6:52 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 15.04.24 10:59, Kefeng Wang wrote:
> >
> >
> > On 2024/4/15 16:18, Barry Song wrote:
> >> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>>
> >>> Both the file pages and anonymous pages support large folio, high-order
> >>> pages except PMD_ORDER will also be allocated frequently which could
> >>> increase the zone lock contention, allow high-order pages on pcp lists
> >>> could reduce the big zone lock contention, but as commit 44042b449872
> >>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
> >>> pointed, it may not win in all the scenes, add a new control sysfs to
> >>> enable or disable specified high-order pages stored on PCP lists, the order
> >>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by default.
> >>
> >> This is precisely something Baolin and I have discussed and intended
> >> to implement[1],
> >> but unfortunately, we haven't had the time to do so.
> >
> > Indeed, same thing. Recently, we are working on unixbench/lmbench
> > optimization, I tested Multi-size THP for anonymous memory by hard-cord
> > PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
> > not for all cases and not very stable, so re-implemented it by according
> > to the user requirement and enable it dynamically.
>
> I'm wondering, though, if this is really a suitable candidate for a
> sysctl toggle. Can anybody really come up with an educated guess for
> these values?
>
> Especially reading "Benchmarks Score shows a little improvoment(0.28%)"
> and "it may not win in all the scenes", to me it mostly sounds like
> "minimal impact" -- so who cares?

Considering the original goal of employing PCP to alleviate page allocation
lock contention, and now that we have configured mTHP, for instance, to
64KiB, it's possible that 64KiB could become the most common page allocation
size just like order0. We should expect to see similar improvements as a result.

I'm questioning whether shell8 is the suitable benchmark for this
situation. A mere
0.28% performance enhancement might not be substantial to pique interest.
Shouldn't we have numerous threads allocating and freeing in parallel to truly
gauge the benefits of PCP?

>
> How much is the cost vs. benefit of just having one sane system
> configuration?
>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 1/3] mm: prepare more high-order pages to be stored on the per-cpu lists
  2024-04-15  8:12 ` [PATCH rfc 1/3] mm: prepare more high-order pages to be stored on the per-cpu lists Kefeng Wang
@ 2024-04-15 11:41   ` Baolin Wang
  2024-04-15 12:25     ` Kefeng Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Baolin Wang @ 2024-04-15 11:41 UTC (permalink / raw)
  To: Kefeng Wang, Andrew Morton
  Cc: Huang Ying, Mel Gorman, Ryan Roberts, David Hildenbrand,
	Barry Song, Vlastimil Babka, Zi Yan, Matthew Wilcox (Oracle),
	Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm



On 2024/4/15 16:12, Kefeng Wang wrote:
> Both the file pages and anonymous pages support large folio, high-order
> pages except HPAGE_PMD_ORDER(PMD_SHIFT - PAGE_SHIFT) will be allocated
> frequently which will increase the zone lock contention, allow high-order
> pages on pcp lists could alleviate the big zone lock contention, in order
> to allows high-orders(PAGE_ALLOC_COSTLY_ORDER, HPAGE_PMD_ORDER) to be
> stored on the per-cpu lists, similar with PMD_ORDER pages, more lists is
> added in struct per_cpu_pages (one list each high-order pages), also a
> new PCP_MAX_ORDER instead of HPAGE_PMD_ORDER is added in mmzone.h.
> 
> But as commit 44042b449872 ("mm/page_alloc: allow high-order pages to be
> stored on the per-cpu lists") pointed, it may not win in all the scenes,
> so this don't allow higher-order pages to be added to PCP list, the next
> will add a control to enable or disable it.
> 
> The struct per_cpu_pages increases in size from 256(4 cache lines) to
> 320 bytes (5 cache lines) on arm64 with defconfig.
> 
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
>   include/linux/mmzone.h |  4 +++-
>   mm/page_alloc.c        | 10 +++++-----
>   2 files changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index c11b7cde81ef..c745e2f1a0f2 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -657,11 +657,13 @@ enum zone_watermarks {
>    * failures.
>    */
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -#define NR_PCP_THP 1
> +#define PCP_MAX_ORDER (PMD_SHIFT - PAGE_SHIFT)
> +#define NR_PCP_THP (PCP_MAX_ORDER - PAGE_ALLOC_COSTLY_ORDER)
>   #else
>   #define NR_PCP_THP 0
>   #endif
>   #define NR_LOWORDER_PCP_LISTS (MIGRATE_PCPTYPES * (PAGE_ALLOC_COSTLY_ORDER + 1))
> +#define HIGHORDER_PCP_LIST_INDEX (NR_LOWORDER_PCP_LISTS - (PAGE_ALLOC_COSTLY_ORDER + 1))

Thanks for starting the discussion.

I am concerned that mixing mTHPs of different migratetypes in a single 
pcp list might lead to fragmentation issues, potentially causing 
unmovable mTHPs to occupy movable pageblocks, which would reduce 
compaction efficiency.

But also not sure if it is suitable to add more pcp lists, maybe we can 
just add the most commonly used mTHP as a start, for example: 64K?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists
  2024-04-15 10:52     ` David Hildenbrand
  2024-04-15 11:14       ` Barry Song
@ 2024-04-15 12:17       ` Kefeng Wang
  2024-04-16  0:21         ` Barry Song
  1 sibling, 1 reply; 17+ messages in thread
From: Kefeng Wang @ 2024-04-15 12:17 UTC (permalink / raw)
  To: David Hildenbrand, Barry Song
  Cc: Andrew Morton, Huang Ying, Mel Gorman, Ryan Roberts, Barry Song,
	Vlastimil Babka, Zi Yan, Matthew Wilcox (Oracle),
	Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm



On 2024/4/15 18:52, David Hildenbrand wrote:
> On 15.04.24 10:59, Kefeng Wang wrote:
>>
>>
>> On 2024/4/15 16:18, Barry Song wrote:
>>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang 
>>> <wangkefeng.wang@huawei.com> wrote:
>>>>
>>>> Both the file pages and anonymous pages support large folio, high-order
>>>> pages except PMD_ORDER will also be allocated frequently which could
>>>> increase the zone lock contention, allow high-order pages on pcp lists
>>>> could reduce the big zone lock contention, but as commit 44042b449872
>>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu 
>>>> lists")
>>>> pointed, it may not win in all the scenes, add a new control sysfs to
>>>> enable or disable specified high-order pages stored on PCP lists, 
>>>> the order
>>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by 
>>>> default.
>>>
>>> This is precisely something Baolin and I have discussed and intended
>>> to implement[1],
>>> but unfortunately, we haven't had the time to do so.
>>
>> Indeed, same thing. Recently, we are working on unixbench/lmbench
>> optimization, I tested Multi-size THP for anonymous memory by hard-cord
>> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
>> not for all cases and not very stable, so re-implemented it by according
>> to the user requirement and enable it dynamically.
> 
> I'm wondering, though, if this is really a suitable candidate for a 
> sysctl toggle. Can anybody really come up with an educated guess for 
> these values?

Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl,
we could trace __alloc_pages() and do order statistic to decide to
choose the high-order to be enabled on PCP.

> 
> Especially reading "Benchmarks Score shows a little improvoment(0.28%)" 
> and "it may not win in all the scenes", to me it mostly sounds like 
> "minimal impact" -- so who cares?

Even though lock conflicts are eliminated, there is very limited
performance improvement(even maybe fluctuation), it is not a good
testcase to show improvement, just show the zone-lock issue, we need to
find other better testcase, maybe some test on Andriod(heavy use 64K, no
PMD THP), or LKP maybe give some help?

I will try to find other testcase to show the benefit.

> 
> How much is the cost vs. benefit of just having one sane system 
> configuration?
> 

For arm64 with 4k, five more high-orders(4~8), five more pcplists,
and for high-orders, we assumes most of them are moveable, but maybe
not, so enable it by default maybe more fragmentization, see
5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized 
allocations").



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 1/3] mm: prepare more high-order pages to be stored on the per-cpu lists
  2024-04-15 11:41   ` Baolin Wang
@ 2024-04-15 12:25     ` Kefeng Wang
  0 siblings, 0 replies; 17+ messages in thread
From: Kefeng Wang @ 2024-04-15 12:25 UTC (permalink / raw)
  To: Baolin Wang, Andrew Morton
  Cc: Huang Ying, Mel Gorman, Ryan Roberts, David Hildenbrand,
	Barry Song, Vlastimil Babka, Zi Yan, Matthew Wilcox (Oracle),
	Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm



On 2024/4/15 19:41, Baolin Wang wrote:
> 
> 
> On 2024/4/15 16:12, Kefeng Wang wrote:
>> Both the file pages and anonymous pages support large folio, high-order
>> pages except HPAGE_PMD_ORDER(PMD_SHIFT - PAGE_SHIFT) will be allocated
>> frequently which will increase the zone lock contention, allow high-order
>> pages on pcp lists could alleviate the big zone lock contention, in order
>> to allows high-orders(PAGE_ALLOC_COSTLY_ORDER, HPAGE_PMD_ORDER) to be
>> stored on the per-cpu lists, similar with PMD_ORDER pages, more lists is
>> added in struct per_cpu_pages (one list each high-order pages), also a
>> new PCP_MAX_ORDER instead of HPAGE_PMD_ORDER is added in mmzone.h.
>>
>> But as commit 44042b449872 ("mm/page_alloc: allow high-order pages to be
>> stored on the per-cpu lists") pointed, it may not win in all the scenes,
>> so this don't allow higher-order pages to be added to PCP list, the next
>> will add a control to enable or disable it.
>>
>> The struct per_cpu_pages increases in size from 256(4 cache lines) to
>> 320 bytes (5 cache lines) on arm64 with defconfig.
>>
>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>> ---
>>   include/linux/mmzone.h |  4 +++-
>>   mm/page_alloc.c        | 10 +++++-----
>>   2 files changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index c11b7cde81ef..c745e2f1a0f2 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -657,11 +657,13 @@ enum zone_watermarks {
>>    * failures.
>>    */
>>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> -#define NR_PCP_THP 1
>> +#define PCP_MAX_ORDER (PMD_SHIFT - PAGE_SHIFT)
>> +#define NR_PCP_THP (PCP_MAX_ORDER - PAGE_ALLOC_COSTLY_ORDER)
>>   #else
>>   #define NR_PCP_THP 0
>>   #endif
>>   #define NR_LOWORDER_PCP_LISTS (MIGRATE_PCPTYPES * 
>> (PAGE_ALLOC_COSTLY_ORDER + 1))
>> +#define HIGHORDER_PCP_LIST_INDEX (NR_LOWORDER_PCP_LISTS - 
>> (PAGE_ALLOC_COSTLY_ORDER + 1))
> 
> Thanks for starting the discussion.
> 
> I am concerned that mixing mTHPs of different migratetypes in a single 
> pcp list might lead to fragmentation issues, potentially causing 
> unmovable mTHPs to occupy movable pageblocks, which would reduce 
> compaction efficiency.
> 

Yes, this is not enabled it by default.

> But also not sure if it is suitable to add more pcp lists, maybe we can 
> just add the most commonly used mTHP as a start, for example: 64K?

Do you mean only add only one list for 64K, I think it before, but it is
not true for all cases, maybe other order is most used in different
tests, so only enable specified  high-order by a pcp_enabled sysfs, but
it is certain that we need find a case to show improvement when use the 
high-order(eg,order4 = 64K) on pcp list.




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists
  2024-04-15 12:17       ` Kefeng Wang
@ 2024-04-16  0:21         ` Barry Song
  2024-04-16  4:50           ` Kefeng Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Barry Song @ 2024-04-16  0:21 UTC (permalink / raw)
  To: Kefeng Wang
  Cc: David Hildenbrand, Andrew Morton, Huang Ying, Mel Gorman,
	Ryan Roberts, Barry Song, Vlastimil Babka, Zi Yan,
	Matthew Wilcox (Oracle),
	Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm

On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
>
>
> On 2024/4/15 18:52, David Hildenbrand wrote:
> > On 15.04.24 10:59, Kefeng Wang wrote:
> >>
> >>
> >> On 2024/4/15 16:18, Barry Song wrote:
> >>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang
> >>> <wangkefeng.wang@huawei.com> wrote:
> >>>>
> >>>> Both the file pages and anonymous pages support large folio, high-order
> >>>> pages except PMD_ORDER will also be allocated frequently which could
> >>>> increase the zone lock contention, allow high-order pages on pcp lists
> >>>> could reduce the big zone lock contention, but as commit 44042b449872
> >>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu
> >>>> lists")
> >>>> pointed, it may not win in all the scenes, add a new control sysfs to
> >>>> enable or disable specified high-order pages stored on PCP lists,
> >>>> the order
> >>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by
> >>>> default.
> >>>
> >>> This is precisely something Baolin and I have discussed and intended
> >>> to implement[1],
> >>> but unfortunately, we haven't had the time to do so.
> >>
> >> Indeed, same thing. Recently, we are working on unixbench/lmbench
> >> optimization, I tested Multi-size THP for anonymous memory by hard-cord
> >> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
> >> not for all cases and not very stable, so re-implemented it by according
> >> to the user requirement and enable it dynamically.
> >
> > I'm wondering, though, if this is really a suitable candidate for a
> > sysctl toggle. Can anybody really come up with an educated guess for
> > these values?
>
> Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl,
> we could trace __alloc_pages() and do order statistic to decide to
> choose the high-order to be enabled on PCP.
>
> >
> > Especially reading "Benchmarks Score shows a little improvoment(0.28%)"
> > and "it may not win in all the scenes", to me it mostly sounds like
> > "minimal impact" -- so who cares?
>
> Even though lock conflicts are eliminated, there is very limited
> performance improvement(even maybe fluctuation), it is not a good
> testcase to show improvement, just show the zone-lock issue, we need to
> find other better testcase, maybe some test on Andriod(heavy use 64K, no
> PMD THP), or LKP maybe give some help?
>
> I will try to find other testcase to show the benefit.

Hi Kefeng,

I wonder if you will see some major improvements on mTHP 64KiB using
the below microbench I wrote just now, for example perf and time to
finish the program

#define DATA_SIZE (2UL * 1024 * 1024)

int main(int argc, char **argv)
{
        /* make 32 concurrent alloc and free of mTHP */
        fork(); fork(); fork(); fork(); fork();

        for (int i = 0; i < 100000; i++) {
                void *addr = mmap(NULL, DATA_SIZE, PROT_READ | PROT_WRITE,
                                MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
                if (addr == MAP_FAILED) {
                        perror("fail to malloc");
                        return -1;
                }
                memset(addr, 0x11, DATA_SIZE);
                munmap(addr, DATA_SIZE);
        }

        return 0;
}

>
> >
> > How much is the cost vs. benefit of just having one sane system
> > configuration?
> >
>
> For arm64 with 4k, five more high-orders(4~8), five more pcplists,
> and for high-orders, we assumes most of them are moveable, but maybe
> not, so enable it by default maybe more fragmentization, see
> 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized
> allocations").
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists
  2024-04-16  0:21         ` Barry Song
@ 2024-04-16  4:50           ` Kefeng Wang
  2024-04-16  4:58             ` Kefeng Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Kefeng Wang @ 2024-04-16  4:50 UTC (permalink / raw)
  To: Barry Song
  Cc: David Hildenbrand, Andrew Morton, Huang Ying, Mel Gorman,
	Ryan Roberts, Barry Song, Vlastimil Babka, Zi Yan,
	Matthew Wilcox (Oracle),
	Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm



On 2024/4/16 8:21, Barry Song wrote:
> On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>
>>
>>
>> On 2024/4/15 18:52, David Hildenbrand wrote:
>>> On 15.04.24 10:59, Kefeng Wang wrote:
>>>>
>>>>
>>>> On 2024/4/15 16:18, Barry Song wrote:
>>>>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang
>>>>> <wangkefeng.wang@huawei.com> wrote:
>>>>>>
>>>>>> Both the file pages and anonymous pages support large folio, high-order
>>>>>> pages except PMD_ORDER will also be allocated frequently which could
>>>>>> increase the zone lock contention, allow high-order pages on pcp lists
>>>>>> could reduce the big zone lock contention, but as commit 44042b449872
>>>>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu
>>>>>> lists")
>>>>>> pointed, it may not win in all the scenes, add a new control sysfs to
>>>>>> enable or disable specified high-order pages stored on PCP lists,
>>>>>> the order
>>>>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by
>>>>>> default.
>>>>>
>>>>> This is precisely something Baolin and I have discussed and intended
>>>>> to implement[1],
>>>>> but unfortunately, we haven't had the time to do so.
>>>>
>>>> Indeed, same thing. Recently, we are working on unixbench/lmbench
>>>> optimization, I tested Multi-size THP for anonymous memory by hard-cord
>>>> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
>>>> not for all cases and not very stable, so re-implemented it by according
>>>> to the user requirement and enable it dynamically.
>>>
>>> I'm wondering, though, if this is really a suitable candidate for a
>>> sysctl toggle. Can anybody really come up with an educated guess for
>>> these values?
>>
>> Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl,
>> we could trace __alloc_pages() and do order statistic to decide to
>> choose the high-order to be enabled on PCP.
>>
>>>
>>> Especially reading "Benchmarks Score shows a little improvoment(0.28%)"
>>> and "it may not win in all the scenes", to me it mostly sounds like
>>> "minimal impact" -- so who cares?
>>
>> Even though lock conflicts are eliminated, there is very limited
>> performance improvement(even maybe fluctuation), it is not a good
>> testcase to show improvement, just show the zone-lock issue, we need to
>> find other better testcase, maybe some test on Andriod(heavy use 64K, no
>> PMD THP), or LKP maybe give some help?
>>
>> I will try to find other testcase to show the benefit.
> 
> Hi Kefeng,
> 
> I wonder if you will see some major improvements on mTHP 64KiB using
> the below microbench I wrote just now, for example perf and time to
> finish the program
> 
> #define DATA_SIZE (2UL * 1024 * 1024)
> 
> int main(int argc, char **argv)
> {
>          /* make 32 concurrent alloc and free of mTHP */
>          fork(); fork(); fork(); fork(); fork();
> 
>          for (int i = 0; i < 100000; i++) {
>                  void *addr = mmap(NULL, DATA_SIZE, PROT_READ | PROT_WRITE,
>                                  MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>                  if (addr == MAP_FAILED) {
>                          perror("fail to malloc");
>                          return -1;
>                  }
>                  memset(addr, 0x11, DATA_SIZE);
>                  munmap(addr, DATA_SIZE);
>          }
> 
>          return 0;
> }
> 

1) PCP disabled
	1	2	3	4	5	average		
real	200.41	202.18	203.16	201.54	200.91	201.64	
user	6.49	6.21	6.25	6.31	6.35	6.322		
sys 	193.3	195.39	196.3	194.65	194.01	194.73	
	
2) PCP enabled							
real	198.25	199.26	195.51	199.28	189.12	196.284	   -2.66%
user	6.21	6.02	6.02	6.28	6.21	6.148	   -2.75%
sys 	191.46	192.64	188.96	192.47	182.39	189.584	   -2.64%

for above test, time reduce 2.x%


And re-test page_fault1(anon) from will-it-scale

1) PCP enabled 					
tasks	processes	processes_idle	threads	threads_idle	linear
0	0	100	0	100	0
1	1416915	98.95	1418128	98.95	1418128
20	5327312	79.22	3821312	94.36	28362560
40	9437184	58.58	4463657	94.55	56725120
60	8120003	38.16	4736716	94.61	85087680
80	7356508	18.29	4847824	94.46	113450240
100	7256185	1.48	4870096	94.61	141812800

2) PCP disabled
tasks	processes	processes_idle	threads	threads_idle	linear
0	0	100	0	100	0
1	1365398	98.95	1354502	98.95	1365398
20	5174918	79.22	3722368	94.65	27307960
40	9094265	58.58	4427267	94.82	54615920
60	8021606	38.18	4572896	94.93	81923880
80	7497318	18.2	4637062	94.76	109231840
100	6819897	1.47	4654521	94.63	136539800

------------------------------------
1) vs 2)  pcp enabled improve 3.86%

3) PCP re-enabled					
tasks	processes	processes_idle	threads	threads_idle	linear
0	0	100	0	100	0
1	1419036	98.96	1428403	98.95	1428403
20	5356092	79.23	3851849	94.41	28568060
40	9437184	58.58	4512918	94.63	57136120
60	8252342	38.16	4659552	94.68	85704180
80	7414899	18.26	4790576	94.77	114272240
100	7062902	1.46	4759030	94.64	142840300

4) PCP re-disabled
tasks	processes	processes_idle	threads	threads_idle	linear
0	0	100	0	100	0
1	1352649	98.95	1354806	98.95	1354806
20	5172924	79.22	3719292	94.64	27096120
40	9174505	58.59	4310649	94.93	54192240
60	8021606	38.17	4552960	94.81	81288360
80	7497318	18.18	4671638	94.81	108384480
100	6823926	1.47	4725955	94.64	135480600

------------------------------------
3) vs 4)  pcp enabled improve 5.43%

Average: 4.645%





>>
>>>
>>> How much is the cost vs. benefit of just having one sane system
>>> configuration?
>>>
>>
>> For arm64 with 4k, five more high-orders(4~8), five more pcplists,
>> and for high-orders, we assumes most of them are moveable, but maybe
>> not, so enable it by default maybe more fragmentization, see
>> 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized
>> allocations").
>>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists
  2024-04-16  4:50           ` Kefeng Wang
@ 2024-04-16  4:58             ` Kefeng Wang
  2024-04-16  5:26               ` Barry Song
  0 siblings, 1 reply; 17+ messages in thread
From: Kefeng Wang @ 2024-04-16  4:58 UTC (permalink / raw)
  To: Barry Song
  Cc: David Hildenbrand, Andrew Morton, Huang Ying, Mel Gorman,
	Ryan Roberts, Barry Song, Vlastimil Babka, Zi Yan,
	Matthew Wilcox (Oracle),
	Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm



On 2024/4/16 12:50, Kefeng Wang wrote:
> 
> 
> On 2024/4/16 8:21, Barry Song wrote:
>> On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang 
>> <wangkefeng.wang@huawei.com> wrote:
>>>
>>>
>>>
>>> On 2024/4/15 18:52, David Hildenbrand wrote:
>>>> On 15.04.24 10:59, Kefeng Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/4/15 16:18, Barry Song wrote:
>>>>>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang
>>>>>> <wangkefeng.wang@huawei.com> wrote:
>>>>>>>
>>>>>>> Both the file pages and anonymous pages support large folio, 
>>>>>>> high-order
>>>>>>> pages except PMD_ORDER will also be allocated frequently which could
>>>>>>> increase the zone lock contention, allow high-order pages on pcp 
>>>>>>> lists
>>>>>>> could reduce the big zone lock contention, but as commit 
>>>>>>> 44042b449872
>>>>>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu
>>>>>>> lists")
>>>>>>> pointed, it may not win in all the scenes, add a new control 
>>>>>>> sysfs to
>>>>>>> enable or disable specified high-order pages stored on PCP lists,
>>>>>>> the order
>>>>>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by
>>>>>>> default.
>>>>>>
>>>>>> This is precisely something Baolin and I have discussed and intended
>>>>>> to implement[1],
>>>>>> but unfortunately, we haven't had the time to do so.
>>>>>
>>>>> Indeed, same thing. Recently, we are working on unixbench/lmbench
>>>>> optimization, I tested Multi-size THP for anonymous memory by 
>>>>> hard-cord
>>>>> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
>>>>> not for all cases and not very stable, so re-implemented it by 
>>>>> according
>>>>> to the user requirement and enable it dynamically.
>>>>
>>>> I'm wondering, though, if this is really a suitable candidate for a
>>>> sysctl toggle. Can anybody really come up with an educated guess for
>>>> these values?
>>>
>>> Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl,
>>> we could trace __alloc_pages() and do order statistic to decide to
>>> choose the high-order to be enabled on PCP.
>>>
>>>>
>>>> Especially reading "Benchmarks Score shows a little improvoment(0.28%)"
>>>> and "it may not win in all the scenes", to me it mostly sounds like
>>>> "minimal impact" -- so who cares?
>>>
>>> Even though lock conflicts are eliminated, there is very limited
>>> performance improvement(even maybe fluctuation), it is not a good
>>> testcase to show improvement, just show the zone-lock issue, we need to
>>> find other better testcase, maybe some test on Andriod(heavy use 64K, no
>>> PMD THP), or LKP maybe give some help?
>>>
>>> I will try to find other testcase to show the benefit.
>>
>> Hi Kefeng,
>>
>> I wonder if you will see some major improvements on mTHP 64KiB using
>> the below microbench I wrote just now, for example perf and time to
>> finish the program
>>
>> #define DATA_SIZE (2UL * 1024 * 1024)
>>
>> int main(int argc, char **argv)
>> {
>>          /* make 32 concurrent alloc and free of mTHP */
>>          fork(); fork(); fork(); fork(); fork();
>>
>>          for (int i = 0; i < 100000; i++) {
>>                  void *addr = mmap(NULL, DATA_SIZE, PROT_READ | 
>> PROT_WRITE,
>>                                  MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>>                  if (addr == MAP_FAILED) {
>>                          perror("fail to malloc");
>>                          return -1;
>>                  }
>>                  memset(addr, 0x11, DATA_SIZE);
>>                  munmap(addr, DATA_SIZE);
>>          }
>>
>>          return 0;
>> }
>>

Rebased on next-20240415,

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled

Compare with
   echo 0 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
   echo 1 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled

> 
> 1) PCP disabled
>      1    2    3    4    5    average
> real    200.41    202.18    203.16    201.54    200.91    201.64
> user    6.49    6.21    6.25    6.31    6.35    6.322
> sys     193.3    195.39    196.3    194.65    194.01    194.73
> 
> 2) PCP enabled
> real    198.25    199.26    195.51    199.28    189.12    196.284       
> -2.66%
> user    6.21    6.02    6.02    6.28    6.21    6.148       -2.75%
> sys     191.46    192.64    188.96    192.47    182.39    189.584       
> -2.64%
> 
> for above test, time reduce 2.x%
> 
> 
> And re-test page_fault1(anon) from will-it-scale
> 
> 1) PCP enabled
> tasks    processes    processes_idle    threads    threads_idle    linear
> 0    0    100    0    100    0
> 1    1416915    98.95    1418128    98.95    1418128
> 20    5327312    79.22    3821312    94.36    28362560
> 40    9437184    58.58    4463657    94.55    56725120
> 60    8120003    38.16    4736716    94.61    85087680
> 80    7356508    18.29    4847824    94.46    113450240
> 100    7256185    1.48    4870096    94.61    141812800
> 
> 2) PCP disabled
> tasks    processes    processes_idle    threads    threads_idle    linear
> 0    0    100    0    100    0
> 1    1365398    98.95    1354502    98.95    1365398
> 20    5174918    79.22    3722368    94.65    27307960
> 40    9094265    58.58    4427267    94.82    54615920
> 60    8021606    38.18    4572896    94.93    81923880
> 80    7497318    18.2    4637062    94.76    109231840
> 100    6819897    1.47    4654521    94.63    136539800
> 
> ------------------------------------
> 1) vs 2)  pcp enabled improve 3.86%
> 
> 3) PCP re-enabled
> tasks    processes    processes_idle    threads    threads_idle    linear
> 0    0    100    0    100    0
> 1    1419036    98.96    1428403    98.95    1428403
> 20    5356092    79.23    3851849    94.41    28568060
> 40    9437184    58.58    4512918    94.63    57136120
> 60    8252342    38.16    4659552    94.68    85704180
> 80    7414899    18.26    4790576    94.77    114272240
> 100    7062902    1.46    4759030    94.64    142840300
> 
> 4) PCP re-disabled
> tasks    processes    processes_idle    threads    threads_idle    linear
> 0    0    100    0    100    0
> 1    1352649    98.95    1354806    98.95    1354806
> 20    5172924    79.22    3719292    94.64    27096120
> 40    9174505    58.59    4310649    94.93    54192240
> 60    8021606    38.17    4552960    94.81    81288360
> 80    7497318    18.18    4671638    94.81    108384480
> 100    6823926    1.47    4725955    94.64    135480600
> 
> ------------------------------------
> 3) vs 4)  pcp enabled improve 5.43%
> 
> Average: 4.645%
> 
> 
> 
> 
> 
>>>
>>>>
>>>> How much is the cost vs. benefit of just having one sane system
>>>> configuration?
>>>>
>>>
>>> For arm64 with 4k, five more high-orders(4~8), five more pcplists,
>>> and for high-orders, we assumes most of them are moveable, but maybe
>>> not, so enable it by default maybe more fragmentization, see
>>> 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized
>>> allocations").
>>>
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists
  2024-04-16  4:58             ` Kefeng Wang
@ 2024-04-16  5:26               ` Barry Song
  2024-04-16  7:03                 ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Barry Song @ 2024-04-16  5:26 UTC (permalink / raw)
  To: Kefeng Wang
  Cc: David Hildenbrand, Andrew Morton, Huang Ying, Mel Gorman,
	Ryan Roberts, Barry Song, Vlastimil Babka, Zi Yan,
	Matthew Wilcox (Oracle),
	Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm

On Tue, Apr 16, 2024 at 4:58 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
>
>
> On 2024/4/16 12:50, Kefeng Wang wrote:
> >
> >
> > On 2024/4/16 8:21, Barry Song wrote:
> >> On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang
> >> <wangkefeng.wang@huawei.com> wrote:
> >>>
> >>>
> >>>
> >>> On 2024/4/15 18:52, David Hildenbrand wrote:
> >>>> On 15.04.24 10:59, Kefeng Wang wrote:
> >>>>>
> >>>>>
> >>>>> On 2024/4/15 16:18, Barry Song wrote:
> >>>>>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang
> >>>>>> <wangkefeng.wang@huawei.com> wrote:
> >>>>>>>
> >>>>>>> Both the file pages and anonymous pages support large folio,
> >>>>>>> high-order
> >>>>>>> pages except PMD_ORDER will also be allocated frequently which could
> >>>>>>> increase the zone lock contention, allow high-order pages on pcp
> >>>>>>> lists
> >>>>>>> could reduce the big zone lock contention, but as commit
> >>>>>>> 44042b449872
> >>>>>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu
> >>>>>>> lists")
> >>>>>>> pointed, it may not win in all the scenes, add a new control
> >>>>>>> sysfs to
> >>>>>>> enable or disable specified high-order pages stored on PCP lists,
> >>>>>>> the order
> >>>>>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by
> >>>>>>> default.
> >>>>>>
> >>>>>> This is precisely something Baolin and I have discussed and intended
> >>>>>> to implement[1],
> >>>>>> but unfortunately, we haven't had the time to do so.
> >>>>>
> >>>>> Indeed, same thing. Recently, we are working on unixbench/lmbench
> >>>>> optimization, I tested Multi-size THP for anonymous memory by
> >>>>> hard-cord
> >>>>> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
> >>>>> not for all cases and not very stable, so re-implemented it by
> >>>>> according
> >>>>> to the user requirement and enable it dynamically.
> >>>>
> >>>> I'm wondering, though, if this is really a suitable candidate for a
> >>>> sysctl toggle. Can anybody really come up with an educated guess for
> >>>> these values?
> >>>
> >>> Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl,
> >>> we could trace __alloc_pages() and do order statistic to decide to
> >>> choose the high-order to be enabled on PCP.
> >>>
> >>>>
> >>>> Especially reading "Benchmarks Score shows a little improvoment(0.28%)"
> >>>> and "it may not win in all the scenes", to me it mostly sounds like
> >>>> "minimal impact" -- so who cares?
> >>>
> >>> Even though lock conflicts are eliminated, there is very limited
> >>> performance improvement(even maybe fluctuation), it is not a good
> >>> testcase to show improvement, just show the zone-lock issue, we need to
> >>> find other better testcase, maybe some test on Andriod(heavy use 64K, no
> >>> PMD THP), or LKP maybe give some help?
> >>>
> >>> I will try to find other testcase to show the benefit.
> >>
> >> Hi Kefeng,
> >>
> >> I wonder if you will see some major improvements on mTHP 64KiB using
> >> the below microbench I wrote just now, for example perf and time to
> >> finish the program
> >>
> >> #define DATA_SIZE (2UL * 1024 * 1024)
> >>
> >> int main(int argc, char **argv)
> >> {
> >>          /* make 32 concurrent alloc and free of mTHP */
> >>          fork(); fork(); fork(); fork(); fork();
> >>
> >>          for (int i = 0; i < 100000; i++) {
> >>                  void *addr = mmap(NULL, DATA_SIZE, PROT_READ |
> >> PROT_WRITE,
> >>                                  MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> >>                  if (addr == MAP_FAILED) {
> >>                          perror("fail to malloc");
> >>                          return -1;
> >>                  }
> >>                  memset(addr, 0x11, DATA_SIZE);
> >>                  munmap(addr, DATA_SIZE);
> >>          }
> >>
> >>          return 0;
> >> }
> >>
>
> Rebased on next-20240415,
>
> echo never > /sys/kernel/mm/transparent_hugepage/enabled
> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
>
> Compare with
>    echo 0 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
>    echo 1 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
>
> >
> > 1) PCP disabled
> >      1    2    3    4    5    average
> > real    200.41    202.18    203.16    201.54    200.91    201.64
> > user    6.49    6.21    6.25    6.31    6.35    6.322
> > sys     193.3    195.39    196.3    194.65    194.01    194.73
> >
> > 2) PCP enabled
> > real    198.25    199.26    195.51    199.28    189.12    196.284
> > -2.66%
> > user    6.21    6.02    6.02    6.28    6.21    6.148       -2.75%
> > sys     191.46    192.64    188.96    192.47    182.39    189.584
> > -2.64%
> >
> > for above test, time reduce 2.x%

This is an improvement from 0.28%, but it's still below my expectations.
I suspect it's due to mTHP reducing the frequency of allocations and frees.
Running the same test on order-0 might yield much better results.

I suppose that as the order increases, PCP exhibits fewer improvements
since both allocation and release activities decrease.

Conversely, we also employ PCP for THP (2MB). Do we have any data
demonstrating that such large-size allocations can benefit from PCP
before ?

> >
> >
> > And re-test page_fault1(anon) from will-it-scale
> >
> > 1) PCP enabled
> > tasks    processes    processes_idle    threads    threads_idle    linear
> > 0    0    100    0    100    0
> > 1    1416915    98.95    1418128    98.95    1418128
> > 20    5327312    79.22    3821312    94.36    28362560
> > 40    9437184    58.58    4463657    94.55    56725120
> > 60    8120003    38.16    4736716    94.61    85087680
> > 80    7356508    18.29    4847824    94.46    113450240
> > 100    7256185    1.48    4870096    94.61    141812800
> >
> > 2) PCP disabled
> > tasks    processes    processes_idle    threads    threads_idle    linear
> > 0    0    100    0    100    0
> > 1    1365398    98.95    1354502    98.95    1365398
> > 20    5174918    79.22    3722368    94.65    27307960
> > 40    9094265    58.58    4427267    94.82    54615920
> > 60    8021606    38.18    4572896    94.93    81923880
> > 80    7497318    18.2    4637062    94.76    109231840
> > 100    6819897    1.47    4654521    94.63    136539800
> >
> > ------------------------------------
> > 1) vs 2)  pcp enabled improve 3.86%
> >
> > 3) PCP re-enabled
> > tasks    processes    processes_idle    threads    threads_idle    linear
> > 0    0    100    0    100    0
> > 1    1419036    98.96    1428403    98.95    1428403
> > 20    5356092    79.23    3851849    94.41    28568060
> > 40    9437184    58.58    4512918    94.63    57136120
> > 60    8252342    38.16    4659552    94.68    85704180
> > 80    7414899    18.26    4790576    94.77    114272240
> > 100    7062902    1.46    4759030    94.64    142840300
> >
> > 4) PCP re-disabled
> > tasks    processes    processes_idle    threads    threads_idle    linear
> > 0    0    100    0    100    0
> > 1    1352649    98.95    1354806    98.95    1354806
> > 20    5172924    79.22    3719292    94.64    27096120
> > 40    9174505    58.59    4310649    94.93    54192240
> > 60    8021606    38.17    4552960    94.81    81288360
> > 80    7497318    18.18    4671638    94.81    108384480
> > 100    6823926    1.47    4725955    94.64    135480600
> >
> > ------------------------------------
> > 3) vs 4)  pcp enabled improve 5.43%
> >
> > Average: 4.645%
> >
> >
> >
> >
> >
> >>>
> >>>>
> >>>> How much is the cost vs. benefit of just having one sane system
> >>>> configuration?
> >>>>
> >>>
> >>> For arm64 with 4k, five more high-orders(4~8), five more pcplists,
> >>> and for high-orders, we assumes most of them are moveable, but maybe
> >>> not, so enable it by default maybe more fragmentization, see
> >>> 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized
> >>> allocations").
> >>>

Thanks
Barry


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists
  2024-04-16  5:26               ` Barry Song
@ 2024-04-16  7:03                 ` David Hildenbrand
  2024-04-16  8:06                   ` Kefeng Wang
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2024-04-16  7:03 UTC (permalink / raw)
  To: Barry Song, Kefeng Wang
  Cc: Andrew Morton, Huang Ying, Mel Gorman, Ryan Roberts, Barry Song,
	Vlastimil Babka, Zi Yan, Matthew Wilcox (Oracle),
	Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm

On 16.04.24 07:26, Barry Song wrote:
> On Tue, Apr 16, 2024 at 4:58 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>
>>
>>
>> On 2024/4/16 12:50, Kefeng Wang wrote:
>>>
>>>
>>> On 2024/4/16 8:21, Barry Song wrote:
>>>> On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang
>>>> <wangkefeng.wang@huawei.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 2024/4/15 18:52, David Hildenbrand wrote:
>>>>>> On 15.04.24 10:59, Kefeng Wang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2024/4/15 16:18, Barry Song wrote:
>>>>>>>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang
>>>>>>>> <wangkefeng.wang@huawei.com> wrote:
>>>>>>>>>
>>>>>>>>> Both the file pages and anonymous pages support large folio,
>>>>>>>>> high-order
>>>>>>>>> pages except PMD_ORDER will also be allocated frequently which could
>>>>>>>>> increase the zone lock contention, allow high-order pages on pcp
>>>>>>>>> lists
>>>>>>>>> could reduce the big zone lock contention, but as commit
>>>>>>>>> 44042b449872
>>>>>>>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu
>>>>>>>>> lists")
>>>>>>>>> pointed, it may not win in all the scenes, add a new control
>>>>>>>>> sysfs to
>>>>>>>>> enable or disable specified high-order pages stored on PCP lists,
>>>>>>>>> the order
>>>>>>>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by
>>>>>>>>> default.
>>>>>>>>
>>>>>>>> This is precisely something Baolin and I have discussed and intended
>>>>>>>> to implement[1],
>>>>>>>> but unfortunately, we haven't had the time to do so.
>>>>>>>
>>>>>>> Indeed, same thing. Recently, we are working on unixbench/lmbench
>>>>>>> optimization, I tested Multi-size THP for anonymous memory by
>>>>>>> hard-cord
>>>>>>> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
>>>>>>> not for all cases and not very stable, so re-implemented it by
>>>>>>> according
>>>>>>> to the user requirement and enable it dynamically.
>>>>>>
>>>>>> I'm wondering, though, if this is really a suitable candidate for a
>>>>>> sysctl toggle. Can anybody really come up with an educated guess for
>>>>>> these values?
>>>>>
>>>>> Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl,
>>>>> we could trace __alloc_pages() and do order statistic to decide to
>>>>> choose the high-order to be enabled on PCP.
>>>>>
>>>>>>
>>>>>> Especially reading "Benchmarks Score shows a little improvoment(0.28%)"
>>>>>> and "it may not win in all the scenes", to me it mostly sounds like
>>>>>> "minimal impact" -- so who cares?
>>>>>
>>>>> Even though lock conflicts are eliminated, there is very limited
>>>>> performance improvement(even maybe fluctuation), it is not a good
>>>>> testcase to show improvement, just show the zone-lock issue, we need to
>>>>> find other better testcase, maybe some test on Andriod(heavy use 64K, no
>>>>> PMD THP), or LKP maybe give some help?
>>>>>
>>>>> I will try to find other testcase to show the benefit.
>>>>
>>>> Hi Kefeng,
>>>>
>>>> I wonder if you will see some major improvements on mTHP 64KiB using
>>>> the below microbench I wrote just now, for example perf and time to
>>>> finish the program
>>>>
>>>> #define DATA_SIZE (2UL * 1024 * 1024)
>>>>
>>>> int main(int argc, char **argv)
>>>> {
>>>>           /* make 32 concurrent alloc and free of mTHP */
>>>>           fork(); fork(); fork(); fork(); fork();
>>>>
>>>>           for (int i = 0; i < 100000; i++) {
>>>>                   void *addr = mmap(NULL, DATA_SIZE, PROT_READ |
>>>> PROT_WRITE,
>>>>                                   MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>>>>                   if (addr == MAP_FAILED) {
>>>>                           perror("fail to malloc");
>>>>                           return -1;
>>>>                   }
>>>>                   memset(addr, 0x11, DATA_SIZE);
>>>>                   munmap(addr, DATA_SIZE);
>>>>           }
>>>>
>>>>           return 0;
>>>> }
>>>>
>>
>> Rebased on next-20240415,
>>
>> echo never > /sys/kernel/mm/transparent_hugepage/enabled
>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
>>
>> Compare with
>>     echo 0 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
>>     echo 1 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
>>
>>>
>>> 1) PCP disabled
>>>       1    2    3    4    5    average
>>> real    200.41    202.18    203.16    201.54    200.91    201.64
>>> user    6.49    6.21    6.25    6.31    6.35    6.322
>>> sys     193.3    195.39    196.3    194.65    194.01    194.73
>>>
>>> 2) PCP enabled
>>> real    198.25    199.26    195.51    199.28    189.12    196.284
>>> -2.66%
>>> user    6.21    6.02    6.02    6.28    6.21    6.148       -2.75%
>>> sys     191.46    192.64    188.96    192.47    182.39    189.584
>>> -2.64%
>>>
>>> for above test, time reduce 2.x%
> 
> This is an improvement from 0.28%, but it's still below my expectations.

Yes, it's noise. Maybe we need a system with more Cores/Sockets? But it 
does feel a bit like we're trying to come up with the problem after we 
have a solution; I'd have thought some existing benchmark could 
highlight if that is worth it.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists
  2024-04-16  7:03                 ` David Hildenbrand
@ 2024-04-16  8:06                   ` Kefeng Wang
  0 siblings, 0 replies; 17+ messages in thread
From: Kefeng Wang @ 2024-04-16  8:06 UTC (permalink / raw)
  To: David Hildenbrand, Barry Song
  Cc: Andrew Morton, Huang Ying, Mel Gorman, Ryan Roberts, Barry Song,
	Vlastimil Babka, Zi Yan, Matthew Wilcox (Oracle),
	Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm



On 2024/4/16 15:03, David Hildenbrand wrote:
> On 16.04.24 07:26, Barry Song wrote:
>> On Tue, Apr 16, 2024 at 4:58 PM Kefeng Wang 
>> <wangkefeng.wang@huawei.com> wrote:
>>>
>>>
>>>
>>> On 2024/4/16 12:50, Kefeng Wang wrote:
>>>>
>>>>
>>>> On 2024/4/16 8:21, Barry Song wrote:
>>>>> On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang
>>>>> <wangkefeng.wang@huawei.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2024/4/15 18:52, David Hildenbrand wrote:
>>>>>>> On 15.04.24 10:59, Kefeng Wang wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2024/4/15 16:18, Barry Song wrote:
>>>>>>>>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang
>>>>>>>>> <wangkefeng.wang@huawei.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Both the file pages and anonymous pages support large folio,
>>>>>>>>>> high-order
>>>>>>>>>> pages except PMD_ORDER will also be allocated frequently which 
>>>>>>>>>> could
>>>>>>>>>> increase the zone lock contention, allow high-order pages on pcp
>>>>>>>>>> lists
>>>>>>>>>> could reduce the big zone lock contention, but as commit
>>>>>>>>>> 44042b449872
>>>>>>>>>> ("mm/page_alloc: allow high-order pages to be stored on the 
>>>>>>>>>> per-cpu
>>>>>>>>>> lists")
>>>>>>>>>> pointed, it may not win in all the scenes, add a new control
>>>>>>>>>> sysfs to
>>>>>>>>>> enable or disable specified high-order pages stored on PCP lists,
>>>>>>>>>> the order
>>>>>>>>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP 
>>>>>>>>>> list by
>>>>>>>>>> default.
>>>>>>>>>
>>>>>>>>> This is precisely something Baolin and I have discussed and 
>>>>>>>>> intended
>>>>>>>>> to implement[1],
>>>>>>>>> but unfortunately, we haven't had the time to do so.
>>>>>>>>
>>>>>>>> Indeed, same thing. Recently, we are working on unixbench/lmbench
>>>>>>>> optimization, I tested Multi-size THP for anonymous memory by
>>>>>>>> hard-cord
>>>>>>>> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some 
>>>>>>>> improvement but
>>>>>>>> not for all cases and not very stable, so re-implemented it by
>>>>>>>> according
>>>>>>>> to the user requirement and enable it dynamically.
>>>>>>>
>>>>>>> I'm wondering, though, if this is really a suitable candidate for a
>>>>>>> sysctl toggle. Can anybody really come up with an educated guess for
>>>>>>> these values?
>>>>>>
>>>>>> Not sure this is suitable in sysctl, but mTHP anon is enabled in 
>>>>>> sysctl,
>>>>>> we could trace __alloc_pages() and do order statistic to decide to
>>>>>> choose the high-order to be enabled on PCP.
>>>>>>
>>>>>>>
>>>>>>> Especially reading "Benchmarks Score shows a little 
>>>>>>> improvoment(0.28%)"
>>>>>>> and "it may not win in all the scenes", to me it mostly sounds like
>>>>>>> "minimal impact" -- so who cares?
>>>>>>
>>>>>> Even though lock conflicts are eliminated, there is very limited
>>>>>> performance improvement(even maybe fluctuation), it is not a good
>>>>>> testcase to show improvement, just show the zone-lock issue, we 
>>>>>> need to
>>>>>> find other better testcase, maybe some test on Andriod(heavy use 
>>>>>> 64K, no
>>>>>> PMD THP), or LKP maybe give some help?
>>>>>>
>>>>>> I will try to find other testcase to show the benefit.
>>>>>
>>>>> Hi Kefeng,
>>>>>
>>>>> I wonder if you will see some major improvements on mTHP 64KiB using
>>>>> the below microbench I wrote just now, for example perf and time to
>>>>> finish the program
>>>>>
>>>>> #define DATA_SIZE (2UL * 1024 * 1024)
>>>>>
>>>>> int main(int argc, char **argv)
>>>>> {
>>>>>           /* make 32 concurrent alloc and free of mTHP */
>>>>>           fork(); fork(); fork(); fork(); fork();
>>>>>
>>>>>           for (int i = 0; i < 100000; i++) {
>>>>>                   void *addr = mmap(NULL, DATA_SIZE, PROT_READ |
>>>>> PROT_WRITE,
>>>>>                                   MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>>>>>                   if (addr == MAP_FAILED) {
>>>>>                           perror("fail to malloc");
>>>>>                           return -1;
>>>>>                   }
>>>>>                   memset(addr, 0x11, DATA_SIZE);
>>>>>                   munmap(addr, DATA_SIZE);
>>>>>           }
>>>>>
>>>>>           return 0;
>>>>> }
>>>>>
>>>
>>> Rebased on next-20240415,
>>>
>>> echo never > /sys/kernel/mm/transparent_hugepage/enabled
>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
>>>
>>> Compare with
>>>     echo 0 > 
>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
>>>     echo 1 > 
>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
>>>
>>>>
>>>> 1) PCP disabled
>>>>       1    2    3    4    5    average
>>>> real    200.41    202.18    203.16    201.54    200.91    201.64
>>>> user    6.49    6.21    6.25    6.31    6.35    6.322
>>>> sys     193.3    195.39    196.3    194.65    194.01    194.73
>>>>
>>>> 2) PCP enabled
>>>> real    198.25    199.26    195.51    199.28    189.12    196.284
>>>> -2.66%
>>>> user    6.21    6.02    6.02    6.28    6.21    6.148       -2.75%
>>>> sys     191.46    192.64    188.96    192.47    182.39    189.584
>>>> -2.64%
>>>>
>>>> for above test, time reduce 2.x%
>>
>> This is an improvement from 0.28%, but it's still below my expectations.
> 
> Yes, it's noise. Maybe we need a system with more Cores/Sockets? But it 
> does feel a bit like we're trying to come up with the problem after we 
> have a solution; I'd have thought some existing benchmark could 
> highlight if that is worth it.


96 core, with 129 threads, a quick test with pcp_enabled to control
hugepages-2048KB, it is no big improvement on 2M

PCP enabled
	1	2	3	average
real	221.8	225.6	221.5	222.9666667
user	14.91	14.91	17.05	15.62333333
sys 	141.91	159.25	156.23	152.4633333
				
PCP disabled				
real	230.76	231.39	228.39	230.18
user	15.47	15.88	17.5	16.28333333
sys 	159.07	162.32	159.09	160.16


 From 44042b449872 ("mm/page_alloc: allow high-order pages to be stored
on the per-cpu lists"), it seems limited improve,

  netperf-udp
                                   5.13.0-rc2             5.13.0-rc2
                             mm-pcpburst-v3r4   mm-pcphighorder-v1r7
  Hmean     send-64         261.46 (   0.00%)      266.30 *   1.85%*
  Hmean     send-128        516.35 (   0.00%)      536.78 *   3.96%*
  Hmean     send-256       1014.13 (   0.00%)     1034.63 *   2.02%*
  Hmean     send-1024      3907.65 (   0.00%)     4046.11 *   3.54%*
  Hmean     send-2048      7492.93 (   0.00%)     7754.85 *   3.50%*
  Hmean     send-3312     11410.04 (   0.00%)    11772.32 *   3.18%*
  Hmean     send-4096     13521.95 (   0.00%)    13912.34 *   2.89%*
  Hmean     send-8192     21660.50 (   0.00%)    22730.72 *   4.94%*
  Hmean     send-16384    31902.32 (   0.00%)    32637.50 *   2.30%*





^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-04-16  8:06 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-15  8:12 [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists Kefeng Wang
2024-04-15  8:12 ` [PATCH rfc 1/3] mm: prepare more high-order pages to be stored on the per-cpu lists Kefeng Wang
2024-04-15 11:41   ` Baolin Wang
2024-04-15 12:25     ` Kefeng Wang
2024-04-15  8:12 ` [PATCH rfc 2/3] mm: add control to allow specified high-order pages stored on PCP list Kefeng Wang
2024-04-15  8:12 ` [PATCH rfc 3/3] mm: pcp: show per-order pages count Kefeng Wang
2024-04-15  8:18 ` [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists Barry Song
2024-04-15  8:59   ` Kefeng Wang
2024-04-15 10:52     ` David Hildenbrand
2024-04-15 11:14       ` Barry Song
2024-04-15 12:17       ` Kefeng Wang
2024-04-16  0:21         ` Barry Song
2024-04-16  4:50           ` Kefeng Wang
2024-04-16  4:58             ` Kefeng Wang
2024-04-16  5:26               ` Barry Song
2024-04-16  7:03                 ` David Hildenbrand
2024-04-16  8:06                   ` Kefeng Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox