* [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists
@ 2024-04-15 8:12 Kefeng Wang
2024-04-15 8:12 ` [PATCH rfc 1/3] mm: prepare more high-order pages to be stored on the per-cpu lists Kefeng Wang
` (3 more replies)
0 siblings, 4 replies; 17+ messages in thread
From: Kefeng Wang @ 2024-04-15 8:12 UTC (permalink / raw)
To: Andrew Morton
Cc: Huang Ying, Mel Gorman, Ryan Roberts, David Hildenbrand,
Barry Song, Vlastimil Babka, Zi Yan, Matthew Wilcox (Oracle),
Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm, Kefeng Wang
Both the file pages and anonymous pages support large folio, high-order
pages except PMD_ORDER will also be allocated frequently which could
increase the zone lock contention, allow high-order pages on pcp lists
could reduce the big zone lock contention, but as commit 44042b449872
("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
pointed, it may not win in all the scenes, add a new control sysfs to
enable or disable specified high-order pages stored on PCP lists, the order
(PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by default.
With perf lock tools, the lock contention from will-it-scale page_fault1
(with 90 tasks run 10s, hugepage-2048KB never, hugepage-64K always) show
below(only care about zone spinlock and pcp spinlock),
Without patches,
contended total wait max wait avg wait type caller
713 4.64 ms 74.37 us 6.51 us spinlock __alloc_pages+0x23c
With patches,
contended total wait max wait avg wait type caller
2 25.66 us 16.31 us 12.83 us spinlock rmqueue_pcplist+0x2b0
Similar results on shell8 from unixbench,
Without patches,
4942 901.09 ms 1.31 ms 182.33 us spinlock __alloc_pages+0x23c
1556 298.76 ms 1.23 ms 192.01 us spinlock rmqueue_pcplist+0x2b0
991 182.73 ms 879.80 us 184.39 us spinlock rmqueue_pcplist+0x2b0
With patches,
contended total wait max wait avg wait type caller
988 187.63 ms 855.18 us 189.91 us spinlock rmqueue_pcplist+0x2b0
505 88.99 ms 793.27 us 176.21 us spinlock rmqueue_pcplist+0x2b0
The Benchmarks Score shows a little improvoment(0.28%) from shell8, but the
zone lock from __alloc_pages() disappeared.
Kefeng Wang (3):
mm: prepare more high-order pages to be stored on the per-cpu lists
mm: add control to allow specified high-order pages stored on PCP list
mm: pcp: show each order page count
Documentation/admin-guide/mm/transhuge.rst | 11 ++++
include/linux/gfp.h | 1 +
include/linux/huge_mm.h | 1 +
include/linux/mmzone.h | 10 ++-
include/linux/vmstat.h | 19 ++++++
mm/Kconfig.debug | 8 +++
mm/huge_memory.c | 74 ++++++++++++++++++++++
mm/page_alloc.c | 30 +++++++--
mm/vmstat.c | 16 +++++
9 files changed, 164 insertions(+), 6 deletions(-)
--
2.27.0
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH rfc 1/3] mm: prepare more high-order pages to be stored on the per-cpu lists
2024-04-15 8:12 [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists Kefeng Wang
@ 2024-04-15 8:12 ` Kefeng Wang
2024-04-15 11:41 ` Baolin Wang
2024-04-15 8:12 ` [PATCH rfc 2/3] mm: add control to allow specified high-order pages stored on PCP list Kefeng Wang
` (2 subsequent siblings)
3 siblings, 1 reply; 17+ messages in thread
From: Kefeng Wang @ 2024-04-15 8:12 UTC (permalink / raw)
To: Andrew Morton
Cc: Huang Ying, Mel Gorman, Ryan Roberts, David Hildenbrand,
Barry Song, Vlastimil Babka, Zi Yan, Matthew Wilcox (Oracle),
Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm, Kefeng Wang
Both the file pages and anonymous pages support large folio, high-order
pages except HPAGE_PMD_ORDER(PMD_SHIFT - PAGE_SHIFT) will be allocated
frequently which will increase the zone lock contention, allow high-order
pages on pcp lists could alleviate the big zone lock contention, in order
to allows high-orders(PAGE_ALLOC_COSTLY_ORDER, HPAGE_PMD_ORDER) to be
stored on the per-cpu lists, similar with PMD_ORDER pages, more lists is
added in struct per_cpu_pages (one list each high-order pages), also a
new PCP_MAX_ORDER instead of HPAGE_PMD_ORDER is added in mmzone.h.
But as commit 44042b449872 ("mm/page_alloc: allow high-order pages to be
stored on the per-cpu lists") pointed, it may not win in all the scenes,
so this don't allow higher-order pages to be added to PCP list, the next
will add a control to enable or disable it.
The struct per_cpu_pages increases in size from 256(4 cache lines) to
320 bytes (5 cache lines) on arm64 with defconfig.
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
include/linux/mmzone.h | 4 +++-
mm/page_alloc.c | 10 +++++-----
2 files changed, 8 insertions(+), 6 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c11b7cde81ef..c745e2f1a0f2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -657,11 +657,13 @@ enum zone_watermarks {
* failures.
*/
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define NR_PCP_THP 1
+#define PCP_MAX_ORDER (PMD_SHIFT - PAGE_SHIFT)
+#define NR_PCP_THP (PCP_MAX_ORDER - PAGE_ALLOC_COSTLY_ORDER)
#else
#define NR_PCP_THP 0
#endif
#define NR_LOWORDER_PCP_LISTS (MIGRATE_PCPTYPES * (PAGE_ALLOC_COSTLY_ORDER + 1))
+#define HIGHORDER_PCP_LIST_INDEX (NR_LOWORDER_PCP_LISTS - (PAGE_ALLOC_COSTLY_ORDER + 1))
#define NR_PCP_LISTS (NR_LOWORDER_PCP_LISTS + NR_PCP_THP)
#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b51becf03d1e..2248afc7b73a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -506,8 +506,8 @@ static inline unsigned int order_to_pindex(int migratetype, int order)
{
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
if (order > PAGE_ALLOC_COSTLY_ORDER) {
- VM_BUG_ON(order != HPAGE_PMD_ORDER);
- return NR_LOWORDER_PCP_LISTS;
+ VM_BUG_ON(order > PCP_MAX_ORDER);
+ return order + HIGHORDER_PCP_LIST_INDEX;
}
#else
VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
@@ -521,8 +521,8 @@ static inline int pindex_to_order(unsigned int pindex)
int order = pindex / MIGRATE_PCPTYPES;
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- if (pindex == NR_LOWORDER_PCP_LISTS)
- order = HPAGE_PMD_ORDER;
+ if (pindex >= NR_LOWORDER_PCP_LISTS)
+ order = pindex - HIGHORDER_PCP_LIST_INDEX;
#else
VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
#endif
@@ -535,7 +535,7 @@ static inline bool pcp_allowed_order(unsigned int order)
if (order <= PAGE_ALLOC_COSTLY_ORDER)
return true;
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- if (order == HPAGE_PMD_ORDER)
+ if (order == PCP_MAX_ORDER)
return true;
#endif
return false;
--
2.27.0
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH rfc 2/3] mm: add control to allow specified high-order pages stored on PCP list
2024-04-15 8:12 [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists Kefeng Wang
2024-04-15 8:12 ` [PATCH rfc 1/3] mm: prepare more high-order pages to be stored on the per-cpu lists Kefeng Wang
@ 2024-04-15 8:12 ` Kefeng Wang
2024-04-15 8:12 ` [PATCH rfc 3/3] mm: pcp: show per-order pages count Kefeng Wang
2024-04-15 8:18 ` [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists Barry Song
3 siblings, 0 replies; 17+ messages in thread
From: Kefeng Wang @ 2024-04-15 8:12 UTC (permalink / raw)
To: Andrew Morton
Cc: Huang Ying, Mel Gorman, Ryan Roberts, David Hildenbrand,
Barry Song, Vlastimil Babka, Zi Yan, Matthew Wilcox (Oracle),
Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm, Kefeng Wang
The high-order pages stored on PCP list may not always win, even herts
some workloads, so it is disabled by default for high-orders except
PMD_ORDER. Since there is already per-supported-THP-size interfaces
to configrate mTHP behaviours, adding a new control pcp_enabled under
above interfaces to allow user to enable/disable the specified high-order
pages stored on PCP list or not, but it can't change the existing behaviour
for order = PMD_ORDER and order <= PAGE_ALLOC_COSTLY_ORDER, they are
always enabled and can't be disabled, meanwhile, when disabled by
pcp_enabled for other high-orders, pcplists will be drained.
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
Documentation/admin-guide/mm/transhuge.rst | 11 +++++
include/linux/gfp.h | 1 +
include/linux/huge_mm.h | 1 +
mm/huge_memory.c | 47 ++++++++++++++++++++++
mm/page_alloc.c | 16 ++++++++
5 files changed, 76 insertions(+)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 04eb45a2f940..3cb91336f81a 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -189,6 +189,17 @@ madvise
never
should be self-explanatory.
+
+There's also sysfs knob to control hugepage to be stored on PCP lists for
+high-orders(greated than PAGE_ALLOC_COSTLY_ORDER), which could reduce
+the zone lock contention when allocate hige-order pages frequently. Please
+note that the PCP behavior of low-order and PMD-order pages cannot changed,
+it is possible to enable other higher-order pages stored on PCP lists by
+writing 1 or disable it back by writing 0::
+
+ echo 0 >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/pcp_enabled
+ echo 1 >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/pcp_enabled
+
By default kernel tries to use huge, PMD-mappable zero page on read
page fault to anonymous mapping. It's possible to disable huge zero
page by writing 0 or enable it back by writing 1::
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 450c2cbcf04b..2ae1157abd6e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -365,6 +365,7 @@ extern void page_frag_free(void *addr);
void page_alloc_init_cpuhp(void);
int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp);
+void drain_all_zone_pages(void);
void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
void drain_all_pages(struct zone *zone);
void drain_local_pages(struct zone *zone);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b67294d5814f..86306becfd52 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -108,6 +108,7 @@ extern unsigned long transparent_hugepage_flags;
extern unsigned long huge_anon_orders_always;
extern unsigned long huge_anon_orders_madvise;
extern unsigned long huge_anon_orders_inherit;
+extern unsigned long huge_pcp_allow_orders;
static inline bool hugepage_global_enabled(void)
{
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9a1b57ef9c60..9b8a8aa36526 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -512,8 +512,49 @@ static ssize_t thpsize_enabled_store(struct kobject *kobj,
static struct kobj_attribute thpsize_enabled_attr =
__ATTR(enabled, 0644, thpsize_enabled_show, thpsize_enabled_store);
+unsigned long huge_pcp_allow_orders __read_mostly;
+static ssize_t thpsize_pcp_enabled_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ int order = to_thpsize(kobj)->order;
+
+ return sysfs_emit(buf, "%d\n",
+ !!test_bit(order, &huge_pcp_allow_orders));
+}
+
+static ssize_t thpsize_pcp_enabled_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ int order = to_thpsize(kobj)->order;
+ unsigned long value;
+ int ret;
+
+ if (order <= PAGE_ALLOC_COSTLY_ORDER || order == PMD_ORDER)
+ return -EINVAL;
+
+ ret = kstrtoul(buf, 10, &value);
+ if (ret < 0)
+ return ret;
+ if (value > 1)
+ return -EINVAL;
+
+ if (value) {
+ set_bit(order, &huge_pcp_allow_orders);
+ } else {
+ if (test_and_clear_bit(order, &huge_pcp_allow_orders))
+ drain_all_zone_pages();
+ }
+
+ return count;
+}
+
+static struct kobj_attribute thpsize_pcp_enabled_attr = __ATTR(pcp_enabled,
+ 0644, thpsize_pcp_enabled_show, thpsize_pcp_enabled_store);
+
static struct attribute *thpsize_attrs[] = {
&thpsize_enabled_attr.attr,
+ &thpsize_pcp_enabled_attr.attr,
NULL,
};
@@ -624,6 +665,8 @@ static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
*/
huge_anon_orders_inherit = BIT(PMD_ORDER);
+ huge_pcp_allow_orders = BIT(PMD_ORDER);
+
*hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
if (unlikely(!*hugepage_kobj)) {
pr_err("failed to create transparent hugepage kobject\n");
@@ -658,6 +701,10 @@ static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj)
err = PTR_ERR(thpsize);
goto remove_all;
}
+
+ if (order <= PAGE_ALLOC_COSTLY_ORDER)
+ huge_pcp_allow_orders |= BIT(order);
+
list_add(&thpsize->node, &thpsize_list);
order = next_order(&orders, order);
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2248afc7b73a..25fd3fe30cb0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -537,6 +537,8 @@ static inline bool pcp_allowed_order(unsigned int order)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
if (order == PCP_MAX_ORDER)
return true;
+ if (BIT(order) & huge_pcp_allow_orders)
+ return true;
#endif
return false;
}
@@ -6705,6 +6707,20 @@ void zone_pcp_reset(struct zone *zone)
}
}
+void drain_all_zone_pages(void)
+{
+ struct zone *zone;
+
+ mutex_lock(&pcp_batch_high_lock);
+ for_each_populated_zone(zone)
+ __zone_set_pageset_high_and_batch(zone, 0, 0, 1);
+ __drain_all_pages(NULL, true);
+ for_each_populated_zone(zone)
+ __zone_set_pageset_high_and_batch(zone, zone->pageset_high_min,
+ zone->pageset_high_max, zone->pageset_batch);
+ mutex_unlock(&pcp_batch_high_lock);
+}
+
#ifdef CONFIG_MEMORY_HOTREMOVE
/*
* All pages in the range must be in a single zone, must not contain holes,
--
2.27.0
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH rfc 3/3] mm: pcp: show per-order pages count
2024-04-15 8:12 [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists Kefeng Wang
2024-04-15 8:12 ` [PATCH rfc 1/3] mm: prepare more high-order pages to be stored on the per-cpu lists Kefeng Wang
2024-04-15 8:12 ` [PATCH rfc 2/3] mm: add control to allow specified high-order pages stored on PCP list Kefeng Wang
@ 2024-04-15 8:12 ` Kefeng Wang
2024-04-15 8:18 ` [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists Barry Song
3 siblings, 0 replies; 17+ messages in thread
From: Kefeng Wang @ 2024-04-15 8:12 UTC (permalink / raw)
To: Andrew Morton
Cc: Huang Ying, Mel Gorman, Ryan Roberts, David Hildenbrand,
Barry Song, Vlastimil Babka, Zi Yan, Matthew Wilcox (Oracle),
Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm, Kefeng Wang
THIS IS ONLY FOR DEBUG.
Show more detail about per-order page count on each cpu in zoneinfo, and
a new pcp_order_stat shows the total counts of each hugepage size in sysfs.
#cat /proc/zoneinfo
....
cpu: 15
count: 275
high: 529
batch: 63
order0: 59
order1: 28
order2: 28
order3: 6
order4: 0
order5: 0
order6: 0
order7: 0
order8: 0
order9: 0
#cat /sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/pcp_order_stat
10
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
include/linux/mmzone.h | 6 ++++++
include/linux/vmstat.h | 19 +++++++++++++++++++
mm/Kconfig.debug | 8 ++++++++
mm/huge_memory.c | 27 +++++++++++++++++++++++++++
mm/page_alloc.c | 4 ++++
mm/vmstat.c | 16 ++++++++++++++++
6 files changed, 80 insertions(+)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c745e2f1a0f2..c32c01468a77 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -665,6 +665,9 @@ enum zone_watermarks {
#define NR_LOWORDER_PCP_LISTS (MIGRATE_PCPTYPES * (PAGE_ALLOC_COSTLY_ORDER + 1))
#define HIGHORDER_PCP_LIST_INDEX (NR_LOWORDER_PCP_LISTS - (PAGE_ALLOC_COSTLY_ORDER + 1))
#define NR_PCP_LISTS (NR_LOWORDER_PCP_LISTS + NR_PCP_THP)
+#ifdef CONFIG_PCP_ORDER_STATS
+#define NR_PCP_ORDER (PAGE_ALLOC_COSTLY_ORDER + NR_PCP_THP + 1)
+#endif
#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost)
#define low_wmark_pages(z) (z->_watermark[WMARK_LOW] + z->watermark_boost)
@@ -702,6 +705,9 @@ struct per_cpu_pages {
/* Lists of pages, one per migrate type stored on the pcp-lists */
struct list_head lists[NR_PCP_LISTS];
+#ifdef CONFIG_PCP_ORDER_STATS
+ int per_order_count[NR_PCP_ORDER]; /* per-order page counts */
+#endif
} ____cacheline_aligned_in_smp;
struct per_cpu_zonestat {
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 735eae6e272c..91843f2d327f 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -624,4 +624,23 @@ static inline void lruvec_stat_sub_folio(struct folio *folio,
{
lruvec_stat_mod_folio(folio, idx, -folio_nr_pages(folio));
}
+
+static inline void pcp_order_stat_mod(struct per_cpu_pages *pcp, int order,
+ int val)
+{
+#ifdef CONFIG_PCP_ORDER_STATS
+ pcp->per_order_count[order] += val;
+#endif
+}
+
+static inline void pcp_order_stat_inc(struct per_cpu_pages *pcp, int order)
+{
+ pcp_order_stat_mod(pcp, order, 1);
+}
+
+static inline void pcp_order_stat_dec(struct per_cpu_pages *pcp, int order)
+{
+ pcp_order_stat_mod(pcp, order, -1);
+}
+
#endif /* _LINUX_VMSTAT_H */
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index afc72fde0f03..57eef0ce809b 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -276,3 +276,11 @@ config PER_VMA_LOCK_STATS
overhead in the page fault path.
If in doubt, say N.
+
+config PCP_ORDER_STATS
+ bool "Statistics for per-order of PCP (Per-CPU pageset)"
+ help
+ Say Y to show per-order statistics of Per-CPU pageset from zoneinfo
+ and pcp_order_stat in sysfs.
+
+ If in doubt, say N.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9b8a8aa36526..0c6262bb8fe4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -599,12 +599,39 @@ DEFINE_MTHP_STAT_ATTR(anon_swpout, MTHP_STAT_ANON_SWPOUT);
DEFINE_MTHP_STAT_ATTR(anon_swpout_fallback, MTHP_STAT_ANON_SWPOUT_FALLBACK);
DEFINE_MTHP_STAT_ATTR(anon_swpin_refault, MTHP_STAT_ANON_SWPIN_REFAULT);
+#ifdef CONFIG_PCP_ORDER_STATS
+static ssize_t pcp_order_stat_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ int order = to_thpsize(kobj)->order;
+ unsigned int counts = 0;
+ struct zone *zone;
+
+ for_each_populated_zone(zone) {
+ struct per_cpu_pages *pcp;
+ int i;
+
+ for_each_online_cpu(i) {
+ pcp = per_cpu_ptr(zone->per_cpu_pageset, i);
+ counts += pcp->per_order_count[order];
+ }
+ }
+
+ return sysfs_emit(buf, "%u\n", counts);
+}
+
+static struct kobj_attribute pcp_order_stat_attr = __ATTR_RO(pcp_order_stat);
+#endif
+
static struct attribute *stats_attrs[] = {
&anon_alloc_attr.attr,
&anon_alloc_fallback_attr.attr,
&anon_swpout_attr.attr,
&anon_swpout_fallback_attr.attr,
&anon_swpin_refault_attr.attr,
+#ifdef CONFIG_PCP_ORDER_STATS
+ &pcp_order_stat_attr.attr,
+#endif
NULL,
};
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 25fd3fe30cb0..f44cdf8dec50 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1185,6 +1185,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
list_del(&page->pcp_list);
count -= nr_pages;
pcp->count -= nr_pages;
+ pcp_order_stat_dec(pcp, order);
__free_one_page(page, pfn, zone, order, mt, FPI_NONE);
trace_mm_page_pcpu_drain(page, order, mt);
@@ -2560,6 +2561,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
pindex = order_to_pindex(migratetype, order);
list_add(&page->pcp_list, &pcp->lists[pindex]);
pcp->count += 1 << order;
+ pcp_order_stat_inc(pcp, order);
batch = READ_ONCE(pcp->batch);
/*
@@ -2957,6 +2959,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
migratetype, alloc_flags);
pcp->count += alloced << order;
+ pcp_order_stat_mod(pcp, order, alloced);
if (unlikely(list_empty(list)))
return NULL;
}
@@ -2964,6 +2967,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
page = list_first_entry(list, struct page, pcp_list);
list_del(&page->pcp_list);
pcp->count -= 1 << order;
+ pcp_order_stat_dec(pcp, order);
} while (check_new_pages(page, order));
return page;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index db79935e4a54..632bb1ed6a53 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1674,6 +1674,19 @@ static bool is_zone_first_populated(pg_data_t *pgdat, struct zone *zone)
return false;
}
+static void zoneinfo_show_pcp_order_stat(struct seq_file *m,
+ struct per_cpu_pages *pcp)
+{
+#ifdef CONFIG_PCP_ORDER_STATS
+ int j;
+
+ for (j = 0; j < NR_PCP_ORDER; j++)
+ seq_printf(m,
+ "\n order%d: %i",
+ j, pcp->per_order_count[j]);
+#endif
+}
+
static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
struct zone *zone)
{
@@ -1748,6 +1761,9 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
pcp->count,
pcp->high,
pcp->batch);
+
+ zoneinfo_show_pcp_order_stat(m, pcp);
+
#ifdef CONFIG_SMP
pzstats = per_cpu_ptr(zone->per_cpu_zonestats, i);
seq_printf(m, "\n vm stats threshold: %d",
--
2.27.0
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists
2024-04-15 8:12 [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists Kefeng Wang
` (2 preceding siblings ...)
2024-04-15 8:12 ` [PATCH rfc 3/3] mm: pcp: show per-order pages count Kefeng Wang
@ 2024-04-15 8:18 ` Barry Song
2024-04-15 8:59 ` Kefeng Wang
3 siblings, 1 reply; 17+ messages in thread
From: Barry Song @ 2024-04-15 8:18 UTC (permalink / raw)
To: Kefeng Wang
Cc: Andrew Morton, Huang Ying, Mel Gorman, Ryan Roberts,
David Hildenbrand, Barry Song, Vlastimil Babka, Zi Yan,
Matthew Wilcox (Oracle),
Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm
On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
> Both the file pages and anonymous pages support large folio, high-order
> pages except PMD_ORDER will also be allocated frequently which could
> increase the zone lock contention, allow high-order pages on pcp lists
> could reduce the big zone lock contention, but as commit 44042b449872
> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
> pointed, it may not win in all the scenes, add a new control sysfs to
> enable or disable specified high-order pages stored on PCP lists, the order
> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by default.
This is precisely something Baolin and I have discussed and intended
to implement[1],
but unfortunately, we haven't had the time to do so.
[1] https://lore.kernel.org/linux-mm/13c59ca8-baac-405e-8640-e693c78ef79a@suse.cz/T/#mecb0514ced830ac4df320113bedd7073bea9ab7a
>
> With perf lock tools, the lock contention from will-it-scale page_fault1
> (with 90 tasks run 10s, hugepage-2048KB never, hugepage-64K always) show
> below(only care about zone spinlock and pcp spinlock),
>
> Without patches,
> contended total wait max wait avg wait type caller
> 713 4.64 ms 74.37 us 6.51 us spinlock __alloc_pages+0x23c
>
> With patches,
> contended total wait max wait avg wait type caller
> 2 25.66 us 16.31 us 12.83 us spinlock rmqueue_pcplist+0x2b0
>
> Similar results on shell8 from unixbench,
>
> Without patches,
> 4942 901.09 ms 1.31 ms 182.33 us spinlock __alloc_pages+0x23c
> 1556 298.76 ms 1.23 ms 192.01 us spinlock rmqueue_pcplist+0x2b0
> 991 182.73 ms 879.80 us 184.39 us spinlock rmqueue_pcplist+0x2b0
>
> With patches,
> contended total wait max wait avg wait type caller
> 988 187.63 ms 855.18 us 189.91 us spinlock rmqueue_pcplist+0x2b0
> 505 88.99 ms 793.27 us 176.21 us spinlock rmqueue_pcplist+0x2b0
>
> The Benchmarks Score shows a little improvoment(0.28%) from shell8, but the
> zone lock from __alloc_pages() disappeared.
>
> Kefeng Wang (3):
> mm: prepare more high-order pages to be stored on the per-cpu lists
> mm: add control to allow specified high-order pages stored on PCP list
> mm: pcp: show each order page count
>
> Documentation/admin-guide/mm/transhuge.rst | 11 ++++
> include/linux/gfp.h | 1 +
> include/linux/huge_mm.h | 1 +
> include/linux/mmzone.h | 10 ++-
> include/linux/vmstat.h | 19 ++++++
> mm/Kconfig.debug | 8 +++
> mm/huge_memory.c | 74 ++++++++++++++++++++++
> mm/page_alloc.c | 30 +++++++--
> mm/vmstat.c | 16 +++++
> 9 files changed, 164 insertions(+), 6 deletions(-)
>
> --
> 2.27.0
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists
2024-04-15 8:18 ` [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists Barry Song
@ 2024-04-15 8:59 ` Kefeng Wang
2024-04-15 10:52 ` David Hildenbrand
0 siblings, 1 reply; 17+ messages in thread
From: Kefeng Wang @ 2024-04-15 8:59 UTC (permalink / raw)
To: Barry Song
Cc: Andrew Morton, Huang Ying, Mel Gorman, Ryan Roberts,
David Hildenbrand, Barry Song, Vlastimil Babka, Zi Yan,
Matthew Wilcox (Oracle),
Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm
On 2024/4/15 16:18, Barry Song wrote:
> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>
>> Both the file pages and anonymous pages support large folio, high-order
>> pages except PMD_ORDER will also be allocated frequently which could
>> increase the zone lock contention, allow high-order pages on pcp lists
>> could reduce the big zone lock contention, but as commit 44042b449872
>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
>> pointed, it may not win in all the scenes, add a new control sysfs to
>> enable or disable specified high-order pages stored on PCP lists, the order
>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by default.
>
> This is precisely something Baolin and I have discussed and intended
> to implement[1],
> but unfortunately, we haven't had the time to do so.
Indeed, same thing. Recently, we are working on unixbench/lmbench
optimization, I tested Multi-size THP for anonymous memory by hard-cord
PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
not for all cases and not very stable, so re-implemented it by according
to the user requirement and enable it dynamically.
[1]
https://lore.kernel.org/linux-mm/b8f5a47a-af1e-44ed-a89b-460d0be56d2c@huawei.com/
>
> [1] https://lore.kernel.org/linux-mm/13c59ca8-baac-405e-8640-e693c78ef79a@suse.cz/T/#mecb0514ced830ac4df320113bedd7073bea9ab7a
>
>>
>> With perf lock tools, the lock contention from will-it-scale page_fault1
>> (with 90 tasks run 10s, hugepage-2048KB never, hugepage-64K always) show
>> below(only care about zone spinlock and pcp spinlock),
>>
>> Without patches,
>> contended total wait max wait avg wait type caller
>> 713 4.64 ms 74.37 us 6.51 us spinlock __alloc_pages+0x23c
>>
>> With patches,
>> contended total wait max wait avg wait type caller
>> 2 25.66 us 16.31 us 12.83 us spinlock rmqueue_pcplist+0x2b0
>>
>> Similar results on shell8 from unixbench,
>>
>> Without patches,
>> 4942 901.09 ms 1.31 ms 182.33 us spinlock __alloc_pages+0x23c
>> 1556 298.76 ms 1.23 ms 192.01 us spinlock rmqueue_pcplist+0x2b0
>> 991 182.73 ms 879.80 us 184.39 us spinlock rmqueue_pcplist+0x2b0
>>
>> With patches,
>> contended total wait max wait avg wait type caller
>> 988 187.63 ms 855.18 us 189.91 us spinlock rmqueue_pcplist+0x2b0
>> 505 88.99 ms 793.27 us 176.21 us spinlock rmqueue_pcplist+0x2b0
>>
>> The Benchmarks Score shows a little improvoment(0.28%) from shell8, but the
>> zone lock from __alloc_pages() disappeared.
>>
>> Kefeng Wang (3):
>> mm: prepare more high-order pages to be stored on the per-cpu lists
>> mm: add control to allow specified high-order pages stored on PCP list
>> mm: pcp: show each order page count
>>
>> Documentation/admin-guide/mm/transhuge.rst | 11 ++++
>> include/linux/gfp.h | 1 +
>> include/linux/huge_mm.h | 1 +
>> include/linux/mmzone.h | 10 ++-
>> include/linux/vmstat.h | 19 ++++++
>> mm/Kconfig.debug | 8 +++
>> mm/huge_memory.c | 74 ++++++++++++++++++++++
>> mm/page_alloc.c | 30 +++++++--
>> mm/vmstat.c | 16 +++++
>> 9 files changed, 164 insertions(+), 6 deletions(-)
>>
>> --
>> 2.27.0
>>
>>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists
2024-04-15 8:59 ` Kefeng Wang
@ 2024-04-15 10:52 ` David Hildenbrand
2024-04-15 11:14 ` Barry Song
2024-04-15 12:17 ` Kefeng Wang
0 siblings, 2 replies; 17+ messages in thread
From: David Hildenbrand @ 2024-04-15 10:52 UTC (permalink / raw)
To: Kefeng Wang, Barry Song
Cc: Andrew Morton, Huang Ying, Mel Gorman, Ryan Roberts, Barry Song,
Vlastimil Babka, Zi Yan, Matthew Wilcox (Oracle),
Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm
On 15.04.24 10:59, Kefeng Wang wrote:
>
>
> On 2024/4/15 16:18, Barry Song wrote:
>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>
>>> Both the file pages and anonymous pages support large folio, high-order
>>> pages except PMD_ORDER will also be allocated frequently which could
>>> increase the zone lock contention, allow high-order pages on pcp lists
>>> could reduce the big zone lock contention, but as commit 44042b449872
>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
>>> pointed, it may not win in all the scenes, add a new control sysfs to
>>> enable or disable specified high-order pages stored on PCP lists, the order
>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by default.
>>
>> This is precisely something Baolin and I have discussed and intended
>> to implement[1],
>> but unfortunately, we haven't had the time to do so.
>
> Indeed, same thing. Recently, we are working on unixbench/lmbench
> optimization, I tested Multi-size THP for anonymous memory by hard-cord
> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
> not for all cases and not very stable, so re-implemented it by according
> to the user requirement and enable it dynamically.
I'm wondering, though, if this is really a suitable candidate for a
sysctl toggle. Can anybody really come up with an educated guess for
these values?
Especially reading "Benchmarks Score shows a little improvoment(0.28%)"
and "it may not win in all the scenes", to me it mostly sounds like
"minimal impact" -- so who cares?
How much is the cost vs. benefit of just having one sane system
configuration?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists
2024-04-15 10:52 ` David Hildenbrand
@ 2024-04-15 11:14 ` Barry Song
2024-04-15 12:17 ` Kefeng Wang
1 sibling, 0 replies; 17+ messages in thread
From: Barry Song @ 2024-04-15 11:14 UTC (permalink / raw)
To: David Hildenbrand
Cc: Kefeng Wang, Andrew Morton, Huang Ying, Mel Gorman, Ryan Roberts,
Barry Song, Vlastimil Babka, Zi Yan, Matthew Wilcox (Oracle),
Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm
On Mon, Apr 15, 2024 at 6:52 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 15.04.24 10:59, Kefeng Wang wrote:
> >
> >
> > On 2024/4/15 16:18, Barry Song wrote:
> >> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>>
> >>> Both the file pages and anonymous pages support large folio, high-order
> >>> pages except PMD_ORDER will also be allocated frequently which could
> >>> increase the zone lock contention, allow high-order pages on pcp lists
> >>> could reduce the big zone lock contention, but as commit 44042b449872
> >>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
> >>> pointed, it may not win in all the scenes, add a new control sysfs to
> >>> enable or disable specified high-order pages stored on PCP lists, the order
> >>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by default.
> >>
> >> This is precisely something Baolin and I have discussed and intended
> >> to implement[1],
> >> but unfortunately, we haven't had the time to do so.
> >
> > Indeed, same thing. Recently, we are working on unixbench/lmbench
> > optimization, I tested Multi-size THP for anonymous memory by hard-cord
> > PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
> > not for all cases and not very stable, so re-implemented it by according
> > to the user requirement and enable it dynamically.
>
> I'm wondering, though, if this is really a suitable candidate for a
> sysctl toggle. Can anybody really come up with an educated guess for
> these values?
>
> Especially reading "Benchmarks Score shows a little improvoment(0.28%)"
> and "it may not win in all the scenes", to me it mostly sounds like
> "minimal impact" -- so who cares?
Considering the original goal of employing PCP to alleviate page allocation
lock contention, and now that we have configured mTHP, for instance, to
64KiB, it's possible that 64KiB could become the most common page allocation
size just like order0. We should expect to see similar improvements as a result.
I'm questioning whether shell8 is the suitable benchmark for this
situation. A mere
0.28% performance enhancement might not be substantial to pique interest.
Shouldn't we have numerous threads allocating and freeing in parallel to truly
gauge the benefits of PCP?
>
> How much is the cost vs. benefit of just having one sane system
> configuration?
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH rfc 1/3] mm: prepare more high-order pages to be stored on the per-cpu lists
2024-04-15 8:12 ` [PATCH rfc 1/3] mm: prepare more high-order pages to be stored on the per-cpu lists Kefeng Wang
@ 2024-04-15 11:41 ` Baolin Wang
2024-04-15 12:25 ` Kefeng Wang
0 siblings, 1 reply; 17+ messages in thread
From: Baolin Wang @ 2024-04-15 11:41 UTC (permalink / raw)
To: Kefeng Wang, Andrew Morton
Cc: Huang Ying, Mel Gorman, Ryan Roberts, David Hildenbrand,
Barry Song, Vlastimil Babka, Zi Yan, Matthew Wilcox (Oracle),
Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm
On 2024/4/15 16:12, Kefeng Wang wrote:
> Both the file pages and anonymous pages support large folio, high-order
> pages except HPAGE_PMD_ORDER(PMD_SHIFT - PAGE_SHIFT) will be allocated
> frequently which will increase the zone lock contention, allow high-order
> pages on pcp lists could alleviate the big zone lock contention, in order
> to allows high-orders(PAGE_ALLOC_COSTLY_ORDER, HPAGE_PMD_ORDER) to be
> stored on the per-cpu lists, similar with PMD_ORDER pages, more lists is
> added in struct per_cpu_pages (one list each high-order pages), also a
> new PCP_MAX_ORDER instead of HPAGE_PMD_ORDER is added in mmzone.h.
>
> But as commit 44042b449872 ("mm/page_alloc: allow high-order pages to be
> stored on the per-cpu lists") pointed, it may not win in all the scenes,
> so this don't allow higher-order pages to be added to PCP list, the next
> will add a control to enable or disable it.
>
> The struct per_cpu_pages increases in size from 256(4 cache lines) to
> 320 bytes (5 cache lines) on arm64 with defconfig.
>
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
> include/linux/mmzone.h | 4 +++-
> mm/page_alloc.c | 10 +++++-----
> 2 files changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index c11b7cde81ef..c745e2f1a0f2 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -657,11 +657,13 @@ enum zone_watermarks {
> * failures.
> */
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -#define NR_PCP_THP 1
> +#define PCP_MAX_ORDER (PMD_SHIFT - PAGE_SHIFT)
> +#define NR_PCP_THP (PCP_MAX_ORDER - PAGE_ALLOC_COSTLY_ORDER)
> #else
> #define NR_PCP_THP 0
> #endif
> #define NR_LOWORDER_PCP_LISTS (MIGRATE_PCPTYPES * (PAGE_ALLOC_COSTLY_ORDER + 1))
> +#define HIGHORDER_PCP_LIST_INDEX (NR_LOWORDER_PCP_LISTS - (PAGE_ALLOC_COSTLY_ORDER + 1))
Thanks for starting the discussion.
I am concerned that mixing mTHPs of different migratetypes in a single
pcp list might lead to fragmentation issues, potentially causing
unmovable mTHPs to occupy movable pageblocks, which would reduce
compaction efficiency.
But also not sure if it is suitable to add more pcp lists, maybe we can
just add the most commonly used mTHP as a start, for example: 64K?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists
2024-04-15 10:52 ` David Hildenbrand
2024-04-15 11:14 ` Barry Song
@ 2024-04-15 12:17 ` Kefeng Wang
2024-04-16 0:21 ` Barry Song
1 sibling, 1 reply; 17+ messages in thread
From: Kefeng Wang @ 2024-04-15 12:17 UTC (permalink / raw)
To: David Hildenbrand, Barry Song
Cc: Andrew Morton, Huang Ying, Mel Gorman, Ryan Roberts, Barry Song,
Vlastimil Babka, Zi Yan, Matthew Wilcox (Oracle),
Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm
On 2024/4/15 18:52, David Hildenbrand wrote:
> On 15.04.24 10:59, Kefeng Wang wrote:
>>
>>
>> On 2024/4/15 16:18, Barry Song wrote:
>>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang
>>> <wangkefeng.wang@huawei.com> wrote:
>>>>
>>>> Both the file pages and anonymous pages support large folio, high-order
>>>> pages except PMD_ORDER will also be allocated frequently which could
>>>> increase the zone lock contention, allow high-order pages on pcp lists
>>>> could reduce the big zone lock contention, but as commit 44042b449872
>>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu
>>>> lists")
>>>> pointed, it may not win in all the scenes, add a new control sysfs to
>>>> enable or disable specified high-order pages stored on PCP lists,
>>>> the order
>>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by
>>>> default.
>>>
>>> This is precisely something Baolin and I have discussed and intended
>>> to implement[1],
>>> but unfortunately, we haven't had the time to do so.
>>
>> Indeed, same thing. Recently, we are working on unixbench/lmbench
>> optimization, I tested Multi-size THP for anonymous memory by hard-cord
>> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
>> not for all cases and not very stable, so re-implemented it by according
>> to the user requirement and enable it dynamically.
>
> I'm wondering, though, if this is really a suitable candidate for a
> sysctl toggle. Can anybody really come up with an educated guess for
> these values?
Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl,
we could trace __alloc_pages() and do order statistic to decide to
choose the high-order to be enabled on PCP.
>
> Especially reading "Benchmarks Score shows a little improvoment(0.28%)"
> and "it may not win in all the scenes", to me it mostly sounds like
> "minimal impact" -- so who cares?
Even though lock conflicts are eliminated, there is very limited
performance improvement(even maybe fluctuation), it is not a good
testcase to show improvement, just show the zone-lock issue, we need to
find other better testcase, maybe some test on Andriod(heavy use 64K, no
PMD THP), or LKP maybe give some help?
I will try to find other testcase to show the benefit.
>
> How much is the cost vs. benefit of just having one sane system
> configuration?
>
For arm64 with 4k, five more high-orders(4~8), five more pcplists,
and for high-orders, we assumes most of them are moveable, but maybe
not, so enable it by default maybe more fragmentization, see
5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized
allocations").
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH rfc 1/3] mm: prepare more high-order pages to be stored on the per-cpu lists
2024-04-15 11:41 ` Baolin Wang
@ 2024-04-15 12:25 ` Kefeng Wang
0 siblings, 0 replies; 17+ messages in thread
From: Kefeng Wang @ 2024-04-15 12:25 UTC (permalink / raw)
To: Baolin Wang, Andrew Morton
Cc: Huang Ying, Mel Gorman, Ryan Roberts, David Hildenbrand,
Barry Song, Vlastimil Babka, Zi Yan, Matthew Wilcox (Oracle),
Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm
On 2024/4/15 19:41, Baolin Wang wrote:
>
>
> On 2024/4/15 16:12, Kefeng Wang wrote:
>> Both the file pages and anonymous pages support large folio, high-order
>> pages except HPAGE_PMD_ORDER(PMD_SHIFT - PAGE_SHIFT) will be allocated
>> frequently which will increase the zone lock contention, allow high-order
>> pages on pcp lists could alleviate the big zone lock contention, in order
>> to allows high-orders(PAGE_ALLOC_COSTLY_ORDER, HPAGE_PMD_ORDER) to be
>> stored on the per-cpu lists, similar with PMD_ORDER pages, more lists is
>> added in struct per_cpu_pages (one list each high-order pages), also a
>> new PCP_MAX_ORDER instead of HPAGE_PMD_ORDER is added in mmzone.h.
>>
>> But as commit 44042b449872 ("mm/page_alloc: allow high-order pages to be
>> stored on the per-cpu lists") pointed, it may not win in all the scenes,
>> so this don't allow higher-order pages to be added to PCP list, the next
>> will add a control to enable or disable it.
>>
>> The struct per_cpu_pages increases in size from 256(4 cache lines) to
>> 320 bytes (5 cache lines) on arm64 with defconfig.
>>
>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>> ---
>> include/linux/mmzone.h | 4 +++-
>> mm/page_alloc.c | 10 +++++-----
>> 2 files changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index c11b7cde81ef..c745e2f1a0f2 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -657,11 +657,13 @@ enum zone_watermarks {
>> * failures.
>> */
>> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> -#define NR_PCP_THP 1
>> +#define PCP_MAX_ORDER (PMD_SHIFT - PAGE_SHIFT)
>> +#define NR_PCP_THP (PCP_MAX_ORDER - PAGE_ALLOC_COSTLY_ORDER)
>> #else
>> #define NR_PCP_THP 0
>> #endif
>> #define NR_LOWORDER_PCP_LISTS (MIGRATE_PCPTYPES *
>> (PAGE_ALLOC_COSTLY_ORDER + 1))
>> +#define HIGHORDER_PCP_LIST_INDEX (NR_LOWORDER_PCP_LISTS -
>> (PAGE_ALLOC_COSTLY_ORDER + 1))
>
> Thanks for starting the discussion.
>
> I am concerned that mixing mTHPs of different migratetypes in a single
> pcp list might lead to fragmentation issues, potentially causing
> unmovable mTHPs to occupy movable pageblocks, which would reduce
> compaction efficiency.
>
Yes, this is not enabled it by default.
> But also not sure if it is suitable to add more pcp lists, maybe we can
> just add the most commonly used mTHP as a start, for example: 64K?
Do you mean only add only one list for 64K, I think it before, but it is
not true for all cases, maybe other order is most used in different
tests, so only enable specified high-order by a pcp_enabled sysfs, but
it is certain that we need find a case to show improvement when use the
high-order(eg,order4 = 64K) on pcp list.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists
2024-04-15 12:17 ` Kefeng Wang
@ 2024-04-16 0:21 ` Barry Song
2024-04-16 4:50 ` Kefeng Wang
0 siblings, 1 reply; 17+ messages in thread
From: Barry Song @ 2024-04-16 0:21 UTC (permalink / raw)
To: Kefeng Wang
Cc: David Hildenbrand, Andrew Morton, Huang Ying, Mel Gorman,
Ryan Roberts, Barry Song, Vlastimil Babka, Zi Yan,
Matthew Wilcox (Oracle),
Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm
On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
>
>
> On 2024/4/15 18:52, David Hildenbrand wrote:
> > On 15.04.24 10:59, Kefeng Wang wrote:
> >>
> >>
> >> On 2024/4/15 16:18, Barry Song wrote:
> >>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang
> >>> <wangkefeng.wang@huawei.com> wrote:
> >>>>
> >>>> Both the file pages and anonymous pages support large folio, high-order
> >>>> pages except PMD_ORDER will also be allocated frequently which could
> >>>> increase the zone lock contention, allow high-order pages on pcp lists
> >>>> could reduce the big zone lock contention, but as commit 44042b449872
> >>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu
> >>>> lists")
> >>>> pointed, it may not win in all the scenes, add a new control sysfs to
> >>>> enable or disable specified high-order pages stored on PCP lists,
> >>>> the order
> >>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by
> >>>> default.
> >>>
> >>> This is precisely something Baolin and I have discussed and intended
> >>> to implement[1],
> >>> but unfortunately, we haven't had the time to do so.
> >>
> >> Indeed, same thing. Recently, we are working on unixbench/lmbench
> >> optimization, I tested Multi-size THP for anonymous memory by hard-cord
> >> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
> >> not for all cases and not very stable, so re-implemented it by according
> >> to the user requirement and enable it dynamically.
> >
> > I'm wondering, though, if this is really a suitable candidate for a
> > sysctl toggle. Can anybody really come up with an educated guess for
> > these values?
>
> Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl,
> we could trace __alloc_pages() and do order statistic to decide to
> choose the high-order to be enabled on PCP.
>
> >
> > Especially reading "Benchmarks Score shows a little improvoment(0.28%)"
> > and "it may not win in all the scenes", to me it mostly sounds like
> > "minimal impact" -- so who cares?
>
> Even though lock conflicts are eliminated, there is very limited
> performance improvement(even maybe fluctuation), it is not a good
> testcase to show improvement, just show the zone-lock issue, we need to
> find other better testcase, maybe some test on Andriod(heavy use 64K, no
> PMD THP), or LKP maybe give some help?
>
> I will try to find other testcase to show the benefit.
Hi Kefeng,
I wonder if you will see some major improvements on mTHP 64KiB using
the below microbench I wrote just now, for example perf and time to
finish the program
#define DATA_SIZE (2UL * 1024 * 1024)
int main(int argc, char **argv)
{
/* make 32 concurrent alloc and free of mTHP */
fork(); fork(); fork(); fork(); fork();
for (int i = 0; i < 100000; i++) {
void *addr = mmap(NULL, DATA_SIZE, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
if (addr == MAP_FAILED) {
perror("fail to malloc");
return -1;
}
memset(addr, 0x11, DATA_SIZE);
munmap(addr, DATA_SIZE);
}
return 0;
}
>
> >
> > How much is the cost vs. benefit of just having one sane system
> > configuration?
> >
>
> For arm64 with 4k, five more high-orders(4~8), five more pcplists,
> and for high-orders, we assumes most of them are moveable, but maybe
> not, so enable it by default maybe more fragmentization, see
> 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized
> allocations").
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists
2024-04-16 0:21 ` Barry Song
@ 2024-04-16 4:50 ` Kefeng Wang
2024-04-16 4:58 ` Kefeng Wang
0 siblings, 1 reply; 17+ messages in thread
From: Kefeng Wang @ 2024-04-16 4:50 UTC (permalink / raw)
To: Barry Song
Cc: David Hildenbrand, Andrew Morton, Huang Ying, Mel Gorman,
Ryan Roberts, Barry Song, Vlastimil Babka, Zi Yan,
Matthew Wilcox (Oracle),
Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm
On 2024/4/16 8:21, Barry Song wrote:
> On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>
>>
>>
>> On 2024/4/15 18:52, David Hildenbrand wrote:
>>> On 15.04.24 10:59, Kefeng Wang wrote:
>>>>
>>>>
>>>> On 2024/4/15 16:18, Barry Song wrote:
>>>>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang
>>>>> <wangkefeng.wang@huawei.com> wrote:
>>>>>>
>>>>>> Both the file pages and anonymous pages support large folio, high-order
>>>>>> pages except PMD_ORDER will also be allocated frequently which could
>>>>>> increase the zone lock contention, allow high-order pages on pcp lists
>>>>>> could reduce the big zone lock contention, but as commit 44042b449872
>>>>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu
>>>>>> lists")
>>>>>> pointed, it may not win in all the scenes, add a new control sysfs to
>>>>>> enable or disable specified high-order pages stored on PCP lists,
>>>>>> the order
>>>>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by
>>>>>> default.
>>>>>
>>>>> This is precisely something Baolin and I have discussed and intended
>>>>> to implement[1],
>>>>> but unfortunately, we haven't had the time to do so.
>>>>
>>>> Indeed, same thing. Recently, we are working on unixbench/lmbench
>>>> optimization, I tested Multi-size THP for anonymous memory by hard-cord
>>>> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
>>>> not for all cases and not very stable, so re-implemented it by according
>>>> to the user requirement and enable it dynamically.
>>>
>>> I'm wondering, though, if this is really a suitable candidate for a
>>> sysctl toggle. Can anybody really come up with an educated guess for
>>> these values?
>>
>> Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl,
>> we could trace __alloc_pages() and do order statistic to decide to
>> choose the high-order to be enabled on PCP.
>>
>>>
>>> Especially reading "Benchmarks Score shows a little improvoment(0.28%)"
>>> and "it may not win in all the scenes", to me it mostly sounds like
>>> "minimal impact" -- so who cares?
>>
>> Even though lock conflicts are eliminated, there is very limited
>> performance improvement(even maybe fluctuation), it is not a good
>> testcase to show improvement, just show the zone-lock issue, we need to
>> find other better testcase, maybe some test on Andriod(heavy use 64K, no
>> PMD THP), or LKP maybe give some help?
>>
>> I will try to find other testcase to show the benefit.
>
> Hi Kefeng,
>
> I wonder if you will see some major improvements on mTHP 64KiB using
> the below microbench I wrote just now, for example perf and time to
> finish the program
>
> #define DATA_SIZE (2UL * 1024 * 1024)
>
> int main(int argc, char **argv)
> {
> /* make 32 concurrent alloc and free of mTHP */
> fork(); fork(); fork(); fork(); fork();
>
> for (int i = 0; i < 100000; i++) {
> void *addr = mmap(NULL, DATA_SIZE, PROT_READ | PROT_WRITE,
> MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> if (addr == MAP_FAILED) {
> perror("fail to malloc");
> return -1;
> }
> memset(addr, 0x11, DATA_SIZE);
> munmap(addr, DATA_SIZE);
> }
>
> return 0;
> }
>
1) PCP disabled
1 2 3 4 5 average
real 200.41 202.18 203.16 201.54 200.91 201.64
user 6.49 6.21 6.25 6.31 6.35 6.322
sys 193.3 195.39 196.3 194.65 194.01 194.73
2) PCP enabled
real 198.25 199.26 195.51 199.28 189.12 196.284 -2.66%
user 6.21 6.02 6.02 6.28 6.21 6.148 -2.75%
sys 191.46 192.64 188.96 192.47 182.39 189.584 -2.64%
for above test, time reduce 2.x%
And re-test page_fault1(anon) from will-it-scale
1) PCP enabled
tasks processes processes_idle threads threads_idle linear
0 0 100 0 100 0
1 1416915 98.95 1418128 98.95 1418128
20 5327312 79.22 3821312 94.36 28362560
40 9437184 58.58 4463657 94.55 56725120
60 8120003 38.16 4736716 94.61 85087680
80 7356508 18.29 4847824 94.46 113450240
100 7256185 1.48 4870096 94.61 141812800
2) PCP disabled
tasks processes processes_idle threads threads_idle linear
0 0 100 0 100 0
1 1365398 98.95 1354502 98.95 1365398
20 5174918 79.22 3722368 94.65 27307960
40 9094265 58.58 4427267 94.82 54615920
60 8021606 38.18 4572896 94.93 81923880
80 7497318 18.2 4637062 94.76 109231840
100 6819897 1.47 4654521 94.63 136539800
------------------------------------
1) vs 2) pcp enabled improve 3.86%
3) PCP re-enabled
tasks processes processes_idle threads threads_idle linear
0 0 100 0 100 0
1 1419036 98.96 1428403 98.95 1428403
20 5356092 79.23 3851849 94.41 28568060
40 9437184 58.58 4512918 94.63 57136120
60 8252342 38.16 4659552 94.68 85704180
80 7414899 18.26 4790576 94.77 114272240
100 7062902 1.46 4759030 94.64 142840300
4) PCP re-disabled
tasks processes processes_idle threads threads_idle linear
0 0 100 0 100 0
1 1352649 98.95 1354806 98.95 1354806
20 5172924 79.22 3719292 94.64 27096120
40 9174505 58.59 4310649 94.93 54192240
60 8021606 38.17 4552960 94.81 81288360
80 7497318 18.18 4671638 94.81 108384480
100 6823926 1.47 4725955 94.64 135480600
------------------------------------
3) vs 4) pcp enabled improve 5.43%
Average: 4.645%
>>
>>>
>>> How much is the cost vs. benefit of just having one sane system
>>> configuration?
>>>
>>
>> For arm64 with 4k, five more high-orders(4~8), five more pcplists,
>> and for high-orders, we assumes most of them are moveable, but maybe
>> not, so enable it by default maybe more fragmentization, see
>> 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized
>> allocations").
>>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists
2024-04-16 4:50 ` Kefeng Wang
@ 2024-04-16 4:58 ` Kefeng Wang
2024-04-16 5:26 ` Barry Song
0 siblings, 1 reply; 17+ messages in thread
From: Kefeng Wang @ 2024-04-16 4:58 UTC (permalink / raw)
To: Barry Song
Cc: David Hildenbrand, Andrew Morton, Huang Ying, Mel Gorman,
Ryan Roberts, Barry Song, Vlastimil Babka, Zi Yan,
Matthew Wilcox (Oracle),
Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm
On 2024/4/16 12:50, Kefeng Wang wrote:
>
>
> On 2024/4/16 8:21, Barry Song wrote:
>> On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang
>> <wangkefeng.wang@huawei.com> wrote:
>>>
>>>
>>>
>>> On 2024/4/15 18:52, David Hildenbrand wrote:
>>>> On 15.04.24 10:59, Kefeng Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/4/15 16:18, Barry Song wrote:
>>>>>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang
>>>>>> <wangkefeng.wang@huawei.com> wrote:
>>>>>>>
>>>>>>> Both the file pages and anonymous pages support large folio,
>>>>>>> high-order
>>>>>>> pages except PMD_ORDER will also be allocated frequently which could
>>>>>>> increase the zone lock contention, allow high-order pages on pcp
>>>>>>> lists
>>>>>>> could reduce the big zone lock contention, but as commit
>>>>>>> 44042b449872
>>>>>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu
>>>>>>> lists")
>>>>>>> pointed, it may not win in all the scenes, add a new control
>>>>>>> sysfs to
>>>>>>> enable or disable specified high-order pages stored on PCP lists,
>>>>>>> the order
>>>>>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by
>>>>>>> default.
>>>>>>
>>>>>> This is precisely something Baolin and I have discussed and intended
>>>>>> to implement[1],
>>>>>> but unfortunately, we haven't had the time to do so.
>>>>>
>>>>> Indeed, same thing. Recently, we are working on unixbench/lmbench
>>>>> optimization, I tested Multi-size THP for anonymous memory by
>>>>> hard-cord
>>>>> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
>>>>> not for all cases and not very stable, so re-implemented it by
>>>>> according
>>>>> to the user requirement and enable it dynamically.
>>>>
>>>> I'm wondering, though, if this is really a suitable candidate for a
>>>> sysctl toggle. Can anybody really come up with an educated guess for
>>>> these values?
>>>
>>> Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl,
>>> we could trace __alloc_pages() and do order statistic to decide to
>>> choose the high-order to be enabled on PCP.
>>>
>>>>
>>>> Especially reading "Benchmarks Score shows a little improvoment(0.28%)"
>>>> and "it may not win in all the scenes", to me it mostly sounds like
>>>> "minimal impact" -- so who cares?
>>>
>>> Even though lock conflicts are eliminated, there is very limited
>>> performance improvement(even maybe fluctuation), it is not a good
>>> testcase to show improvement, just show the zone-lock issue, we need to
>>> find other better testcase, maybe some test on Andriod(heavy use 64K, no
>>> PMD THP), or LKP maybe give some help?
>>>
>>> I will try to find other testcase to show the benefit.
>>
>> Hi Kefeng,
>>
>> I wonder if you will see some major improvements on mTHP 64KiB using
>> the below microbench I wrote just now, for example perf and time to
>> finish the program
>>
>> #define DATA_SIZE (2UL * 1024 * 1024)
>>
>> int main(int argc, char **argv)
>> {
>> /* make 32 concurrent alloc and free of mTHP */
>> fork(); fork(); fork(); fork(); fork();
>>
>> for (int i = 0; i < 100000; i++) {
>> void *addr = mmap(NULL, DATA_SIZE, PROT_READ |
>> PROT_WRITE,
>> MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>> if (addr == MAP_FAILED) {
>> perror("fail to malloc");
>> return -1;
>> }
>> memset(addr, 0x11, DATA_SIZE);
>> munmap(addr, DATA_SIZE);
>> }
>>
>> return 0;
>> }
>>
Rebased on next-20240415,
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
Compare with
echo 0 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
echo 1 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
>
> 1) PCP disabled
> 1 2 3 4 5 average
> real 200.41 202.18 203.16 201.54 200.91 201.64
> user 6.49 6.21 6.25 6.31 6.35 6.322
> sys 193.3 195.39 196.3 194.65 194.01 194.73
>
> 2) PCP enabled
> real 198.25 199.26 195.51 199.28 189.12 196.284
> -2.66%
> user 6.21 6.02 6.02 6.28 6.21 6.148 -2.75%
> sys 191.46 192.64 188.96 192.47 182.39 189.584
> -2.64%
>
> for above test, time reduce 2.x%
>
>
> And re-test page_fault1(anon) from will-it-scale
>
> 1) PCP enabled
> tasks processes processes_idle threads threads_idle linear
> 0 0 100 0 100 0
> 1 1416915 98.95 1418128 98.95 1418128
> 20 5327312 79.22 3821312 94.36 28362560
> 40 9437184 58.58 4463657 94.55 56725120
> 60 8120003 38.16 4736716 94.61 85087680
> 80 7356508 18.29 4847824 94.46 113450240
> 100 7256185 1.48 4870096 94.61 141812800
>
> 2) PCP disabled
> tasks processes processes_idle threads threads_idle linear
> 0 0 100 0 100 0
> 1 1365398 98.95 1354502 98.95 1365398
> 20 5174918 79.22 3722368 94.65 27307960
> 40 9094265 58.58 4427267 94.82 54615920
> 60 8021606 38.18 4572896 94.93 81923880
> 80 7497318 18.2 4637062 94.76 109231840
> 100 6819897 1.47 4654521 94.63 136539800
>
> ------------------------------------
> 1) vs 2) pcp enabled improve 3.86%
>
> 3) PCP re-enabled
> tasks processes processes_idle threads threads_idle linear
> 0 0 100 0 100 0
> 1 1419036 98.96 1428403 98.95 1428403
> 20 5356092 79.23 3851849 94.41 28568060
> 40 9437184 58.58 4512918 94.63 57136120
> 60 8252342 38.16 4659552 94.68 85704180
> 80 7414899 18.26 4790576 94.77 114272240
> 100 7062902 1.46 4759030 94.64 142840300
>
> 4) PCP re-disabled
> tasks processes processes_idle threads threads_idle linear
> 0 0 100 0 100 0
> 1 1352649 98.95 1354806 98.95 1354806
> 20 5172924 79.22 3719292 94.64 27096120
> 40 9174505 58.59 4310649 94.93 54192240
> 60 8021606 38.17 4552960 94.81 81288360
> 80 7497318 18.18 4671638 94.81 108384480
> 100 6823926 1.47 4725955 94.64 135480600
>
> ------------------------------------
> 3) vs 4) pcp enabled improve 5.43%
>
> Average: 4.645%
>
>
>
>
>
>>>
>>>>
>>>> How much is the cost vs. benefit of just having one sane system
>>>> configuration?
>>>>
>>>
>>> For arm64 with 4k, five more high-orders(4~8), five more pcplists,
>>> and for high-orders, we assumes most of them are moveable, but maybe
>>> not, so enable it by default maybe more fragmentization, see
>>> 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized
>>> allocations").
>>>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists
2024-04-16 4:58 ` Kefeng Wang
@ 2024-04-16 5:26 ` Barry Song
2024-04-16 7:03 ` David Hildenbrand
0 siblings, 1 reply; 17+ messages in thread
From: Barry Song @ 2024-04-16 5:26 UTC (permalink / raw)
To: Kefeng Wang
Cc: David Hildenbrand, Andrew Morton, Huang Ying, Mel Gorman,
Ryan Roberts, Barry Song, Vlastimil Babka, Zi Yan,
Matthew Wilcox (Oracle),
Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm
On Tue, Apr 16, 2024 at 4:58 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
>
>
> On 2024/4/16 12:50, Kefeng Wang wrote:
> >
> >
> > On 2024/4/16 8:21, Barry Song wrote:
> >> On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang
> >> <wangkefeng.wang@huawei.com> wrote:
> >>>
> >>>
> >>>
> >>> On 2024/4/15 18:52, David Hildenbrand wrote:
> >>>> On 15.04.24 10:59, Kefeng Wang wrote:
> >>>>>
> >>>>>
> >>>>> On 2024/4/15 16:18, Barry Song wrote:
> >>>>>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang
> >>>>>> <wangkefeng.wang@huawei.com> wrote:
> >>>>>>>
> >>>>>>> Both the file pages and anonymous pages support large folio,
> >>>>>>> high-order
> >>>>>>> pages except PMD_ORDER will also be allocated frequently which could
> >>>>>>> increase the zone lock contention, allow high-order pages on pcp
> >>>>>>> lists
> >>>>>>> could reduce the big zone lock contention, but as commit
> >>>>>>> 44042b449872
> >>>>>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu
> >>>>>>> lists")
> >>>>>>> pointed, it may not win in all the scenes, add a new control
> >>>>>>> sysfs to
> >>>>>>> enable or disable specified high-order pages stored on PCP lists,
> >>>>>>> the order
> >>>>>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by
> >>>>>>> default.
> >>>>>>
> >>>>>> This is precisely something Baolin and I have discussed and intended
> >>>>>> to implement[1],
> >>>>>> but unfortunately, we haven't had the time to do so.
> >>>>>
> >>>>> Indeed, same thing. Recently, we are working on unixbench/lmbench
> >>>>> optimization, I tested Multi-size THP for anonymous memory by
> >>>>> hard-cord
> >>>>> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
> >>>>> not for all cases and not very stable, so re-implemented it by
> >>>>> according
> >>>>> to the user requirement and enable it dynamically.
> >>>>
> >>>> I'm wondering, though, if this is really a suitable candidate for a
> >>>> sysctl toggle. Can anybody really come up with an educated guess for
> >>>> these values?
> >>>
> >>> Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl,
> >>> we could trace __alloc_pages() and do order statistic to decide to
> >>> choose the high-order to be enabled on PCP.
> >>>
> >>>>
> >>>> Especially reading "Benchmarks Score shows a little improvoment(0.28%)"
> >>>> and "it may not win in all the scenes", to me it mostly sounds like
> >>>> "minimal impact" -- so who cares?
> >>>
> >>> Even though lock conflicts are eliminated, there is very limited
> >>> performance improvement(even maybe fluctuation), it is not a good
> >>> testcase to show improvement, just show the zone-lock issue, we need to
> >>> find other better testcase, maybe some test on Andriod(heavy use 64K, no
> >>> PMD THP), or LKP maybe give some help?
> >>>
> >>> I will try to find other testcase to show the benefit.
> >>
> >> Hi Kefeng,
> >>
> >> I wonder if you will see some major improvements on mTHP 64KiB using
> >> the below microbench I wrote just now, for example perf and time to
> >> finish the program
> >>
> >> #define DATA_SIZE (2UL * 1024 * 1024)
> >>
> >> int main(int argc, char **argv)
> >> {
> >> /* make 32 concurrent alloc and free of mTHP */
> >> fork(); fork(); fork(); fork(); fork();
> >>
> >> for (int i = 0; i < 100000; i++) {
> >> void *addr = mmap(NULL, DATA_SIZE, PROT_READ |
> >> PROT_WRITE,
> >> MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> >> if (addr == MAP_FAILED) {
> >> perror("fail to malloc");
> >> return -1;
> >> }
> >> memset(addr, 0x11, DATA_SIZE);
> >> munmap(addr, DATA_SIZE);
> >> }
> >>
> >> return 0;
> >> }
> >>
>
> Rebased on next-20240415,
>
> echo never > /sys/kernel/mm/transparent_hugepage/enabled
> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
>
> Compare with
> echo 0 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
> echo 1 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
>
> >
> > 1) PCP disabled
> > 1 2 3 4 5 average
> > real 200.41 202.18 203.16 201.54 200.91 201.64
> > user 6.49 6.21 6.25 6.31 6.35 6.322
> > sys 193.3 195.39 196.3 194.65 194.01 194.73
> >
> > 2) PCP enabled
> > real 198.25 199.26 195.51 199.28 189.12 196.284
> > -2.66%
> > user 6.21 6.02 6.02 6.28 6.21 6.148 -2.75%
> > sys 191.46 192.64 188.96 192.47 182.39 189.584
> > -2.64%
> >
> > for above test, time reduce 2.x%
This is an improvement from 0.28%, but it's still below my expectations.
I suspect it's due to mTHP reducing the frequency of allocations and frees.
Running the same test on order-0 might yield much better results.
I suppose that as the order increases, PCP exhibits fewer improvements
since both allocation and release activities decrease.
Conversely, we also employ PCP for THP (2MB). Do we have any data
demonstrating that such large-size allocations can benefit from PCP
before ?
> >
> >
> > And re-test page_fault1(anon) from will-it-scale
> >
> > 1) PCP enabled
> > tasks processes processes_idle threads threads_idle linear
> > 0 0 100 0 100 0
> > 1 1416915 98.95 1418128 98.95 1418128
> > 20 5327312 79.22 3821312 94.36 28362560
> > 40 9437184 58.58 4463657 94.55 56725120
> > 60 8120003 38.16 4736716 94.61 85087680
> > 80 7356508 18.29 4847824 94.46 113450240
> > 100 7256185 1.48 4870096 94.61 141812800
> >
> > 2) PCP disabled
> > tasks processes processes_idle threads threads_idle linear
> > 0 0 100 0 100 0
> > 1 1365398 98.95 1354502 98.95 1365398
> > 20 5174918 79.22 3722368 94.65 27307960
> > 40 9094265 58.58 4427267 94.82 54615920
> > 60 8021606 38.18 4572896 94.93 81923880
> > 80 7497318 18.2 4637062 94.76 109231840
> > 100 6819897 1.47 4654521 94.63 136539800
> >
> > ------------------------------------
> > 1) vs 2) pcp enabled improve 3.86%
> >
> > 3) PCP re-enabled
> > tasks processes processes_idle threads threads_idle linear
> > 0 0 100 0 100 0
> > 1 1419036 98.96 1428403 98.95 1428403
> > 20 5356092 79.23 3851849 94.41 28568060
> > 40 9437184 58.58 4512918 94.63 57136120
> > 60 8252342 38.16 4659552 94.68 85704180
> > 80 7414899 18.26 4790576 94.77 114272240
> > 100 7062902 1.46 4759030 94.64 142840300
> >
> > 4) PCP re-disabled
> > tasks processes processes_idle threads threads_idle linear
> > 0 0 100 0 100 0
> > 1 1352649 98.95 1354806 98.95 1354806
> > 20 5172924 79.22 3719292 94.64 27096120
> > 40 9174505 58.59 4310649 94.93 54192240
> > 60 8021606 38.17 4552960 94.81 81288360
> > 80 7497318 18.18 4671638 94.81 108384480
> > 100 6823926 1.47 4725955 94.64 135480600
> >
> > ------------------------------------
> > 3) vs 4) pcp enabled improve 5.43%
> >
> > Average: 4.645%
> >
> >
> >
> >
> >
> >>>
> >>>>
> >>>> How much is the cost vs. benefit of just having one sane system
> >>>> configuration?
> >>>>
> >>>
> >>> For arm64 with 4k, five more high-orders(4~8), five more pcplists,
> >>> and for high-orders, we assumes most of them are moveable, but maybe
> >>> not, so enable it by default maybe more fragmentization, see
> >>> 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized
> >>> allocations").
> >>>
Thanks
Barry
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists
2024-04-16 5:26 ` Barry Song
@ 2024-04-16 7:03 ` David Hildenbrand
2024-04-16 8:06 ` Kefeng Wang
0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2024-04-16 7:03 UTC (permalink / raw)
To: Barry Song, Kefeng Wang
Cc: Andrew Morton, Huang Ying, Mel Gorman, Ryan Roberts, Barry Song,
Vlastimil Babka, Zi Yan, Matthew Wilcox (Oracle),
Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm
On 16.04.24 07:26, Barry Song wrote:
> On Tue, Apr 16, 2024 at 4:58 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>
>>
>>
>> On 2024/4/16 12:50, Kefeng Wang wrote:
>>>
>>>
>>> On 2024/4/16 8:21, Barry Song wrote:
>>>> On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang
>>>> <wangkefeng.wang@huawei.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 2024/4/15 18:52, David Hildenbrand wrote:
>>>>>> On 15.04.24 10:59, Kefeng Wang wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2024/4/15 16:18, Barry Song wrote:
>>>>>>>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang
>>>>>>>> <wangkefeng.wang@huawei.com> wrote:
>>>>>>>>>
>>>>>>>>> Both the file pages and anonymous pages support large folio,
>>>>>>>>> high-order
>>>>>>>>> pages except PMD_ORDER will also be allocated frequently which could
>>>>>>>>> increase the zone lock contention, allow high-order pages on pcp
>>>>>>>>> lists
>>>>>>>>> could reduce the big zone lock contention, but as commit
>>>>>>>>> 44042b449872
>>>>>>>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu
>>>>>>>>> lists")
>>>>>>>>> pointed, it may not win in all the scenes, add a new control
>>>>>>>>> sysfs to
>>>>>>>>> enable or disable specified high-order pages stored on PCP lists,
>>>>>>>>> the order
>>>>>>>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by
>>>>>>>>> default.
>>>>>>>>
>>>>>>>> This is precisely something Baolin and I have discussed and intended
>>>>>>>> to implement[1],
>>>>>>>> but unfortunately, we haven't had the time to do so.
>>>>>>>
>>>>>>> Indeed, same thing. Recently, we are working on unixbench/lmbench
>>>>>>> optimization, I tested Multi-size THP for anonymous memory by
>>>>>>> hard-cord
>>>>>>> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
>>>>>>> not for all cases and not very stable, so re-implemented it by
>>>>>>> according
>>>>>>> to the user requirement and enable it dynamically.
>>>>>>
>>>>>> I'm wondering, though, if this is really a suitable candidate for a
>>>>>> sysctl toggle. Can anybody really come up with an educated guess for
>>>>>> these values?
>>>>>
>>>>> Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl,
>>>>> we could trace __alloc_pages() and do order statistic to decide to
>>>>> choose the high-order to be enabled on PCP.
>>>>>
>>>>>>
>>>>>> Especially reading "Benchmarks Score shows a little improvoment(0.28%)"
>>>>>> and "it may not win in all the scenes", to me it mostly sounds like
>>>>>> "minimal impact" -- so who cares?
>>>>>
>>>>> Even though lock conflicts are eliminated, there is very limited
>>>>> performance improvement(even maybe fluctuation), it is not a good
>>>>> testcase to show improvement, just show the zone-lock issue, we need to
>>>>> find other better testcase, maybe some test on Andriod(heavy use 64K, no
>>>>> PMD THP), or LKP maybe give some help?
>>>>>
>>>>> I will try to find other testcase to show the benefit.
>>>>
>>>> Hi Kefeng,
>>>>
>>>> I wonder if you will see some major improvements on mTHP 64KiB using
>>>> the below microbench I wrote just now, for example perf and time to
>>>> finish the program
>>>>
>>>> #define DATA_SIZE (2UL * 1024 * 1024)
>>>>
>>>> int main(int argc, char **argv)
>>>> {
>>>> /* make 32 concurrent alloc and free of mTHP */
>>>> fork(); fork(); fork(); fork(); fork();
>>>>
>>>> for (int i = 0; i < 100000; i++) {
>>>> void *addr = mmap(NULL, DATA_SIZE, PROT_READ |
>>>> PROT_WRITE,
>>>> MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>>>> if (addr == MAP_FAILED) {
>>>> perror("fail to malloc");
>>>> return -1;
>>>> }
>>>> memset(addr, 0x11, DATA_SIZE);
>>>> munmap(addr, DATA_SIZE);
>>>> }
>>>>
>>>> return 0;
>>>> }
>>>>
>>
>> Rebased on next-20240415,
>>
>> echo never > /sys/kernel/mm/transparent_hugepage/enabled
>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
>>
>> Compare with
>> echo 0 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
>> echo 1 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
>>
>>>
>>> 1) PCP disabled
>>> 1 2 3 4 5 average
>>> real 200.41 202.18 203.16 201.54 200.91 201.64
>>> user 6.49 6.21 6.25 6.31 6.35 6.322
>>> sys 193.3 195.39 196.3 194.65 194.01 194.73
>>>
>>> 2) PCP enabled
>>> real 198.25 199.26 195.51 199.28 189.12 196.284
>>> -2.66%
>>> user 6.21 6.02 6.02 6.28 6.21 6.148 -2.75%
>>> sys 191.46 192.64 188.96 192.47 182.39 189.584
>>> -2.64%
>>>
>>> for above test, time reduce 2.x%
>
> This is an improvement from 0.28%, but it's still below my expectations.
Yes, it's noise. Maybe we need a system with more Cores/Sockets? But it
does feel a bit like we're trying to come up with the problem after we
have a solution; I'd have thought some existing benchmark could
highlight if that is worth it.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists
2024-04-16 7:03 ` David Hildenbrand
@ 2024-04-16 8:06 ` Kefeng Wang
0 siblings, 0 replies; 17+ messages in thread
From: Kefeng Wang @ 2024-04-16 8:06 UTC (permalink / raw)
To: David Hildenbrand, Barry Song
Cc: Andrew Morton, Huang Ying, Mel Gorman, Ryan Roberts, Barry Song,
Vlastimil Babka, Zi Yan, Matthew Wilcox (Oracle),
Jonathan Corbet, Yang Shi, Yu Zhao, linux-mm
On 2024/4/16 15:03, David Hildenbrand wrote:
> On 16.04.24 07:26, Barry Song wrote:
>> On Tue, Apr 16, 2024 at 4:58 PM Kefeng Wang
>> <wangkefeng.wang@huawei.com> wrote:
>>>
>>>
>>>
>>> On 2024/4/16 12:50, Kefeng Wang wrote:
>>>>
>>>>
>>>> On 2024/4/16 8:21, Barry Song wrote:
>>>>> On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang
>>>>> <wangkefeng.wang@huawei.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2024/4/15 18:52, David Hildenbrand wrote:
>>>>>>> On 15.04.24 10:59, Kefeng Wang wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2024/4/15 16:18, Barry Song wrote:
>>>>>>>>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang
>>>>>>>>> <wangkefeng.wang@huawei.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Both the file pages and anonymous pages support large folio,
>>>>>>>>>> high-order
>>>>>>>>>> pages except PMD_ORDER will also be allocated frequently which
>>>>>>>>>> could
>>>>>>>>>> increase the zone lock contention, allow high-order pages on pcp
>>>>>>>>>> lists
>>>>>>>>>> could reduce the big zone lock contention, but as commit
>>>>>>>>>> 44042b449872
>>>>>>>>>> ("mm/page_alloc: allow high-order pages to be stored on the
>>>>>>>>>> per-cpu
>>>>>>>>>> lists")
>>>>>>>>>> pointed, it may not win in all the scenes, add a new control
>>>>>>>>>> sysfs to
>>>>>>>>>> enable or disable specified high-order pages stored on PCP lists,
>>>>>>>>>> the order
>>>>>>>>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP
>>>>>>>>>> list by
>>>>>>>>>> default.
>>>>>>>>>
>>>>>>>>> This is precisely something Baolin and I have discussed and
>>>>>>>>> intended
>>>>>>>>> to implement[1],
>>>>>>>>> but unfortunately, we haven't had the time to do so.
>>>>>>>>
>>>>>>>> Indeed, same thing. Recently, we are working on unixbench/lmbench
>>>>>>>> optimization, I tested Multi-size THP for anonymous memory by
>>>>>>>> hard-cord
>>>>>>>> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some
>>>>>>>> improvement but
>>>>>>>> not for all cases and not very stable, so re-implemented it by
>>>>>>>> according
>>>>>>>> to the user requirement and enable it dynamically.
>>>>>>>
>>>>>>> I'm wondering, though, if this is really a suitable candidate for a
>>>>>>> sysctl toggle. Can anybody really come up with an educated guess for
>>>>>>> these values?
>>>>>>
>>>>>> Not sure this is suitable in sysctl, but mTHP anon is enabled in
>>>>>> sysctl,
>>>>>> we could trace __alloc_pages() and do order statistic to decide to
>>>>>> choose the high-order to be enabled on PCP.
>>>>>>
>>>>>>>
>>>>>>> Especially reading "Benchmarks Score shows a little
>>>>>>> improvoment(0.28%)"
>>>>>>> and "it may not win in all the scenes", to me it mostly sounds like
>>>>>>> "minimal impact" -- so who cares?
>>>>>>
>>>>>> Even though lock conflicts are eliminated, there is very limited
>>>>>> performance improvement(even maybe fluctuation), it is not a good
>>>>>> testcase to show improvement, just show the zone-lock issue, we
>>>>>> need to
>>>>>> find other better testcase, maybe some test on Andriod(heavy use
>>>>>> 64K, no
>>>>>> PMD THP), or LKP maybe give some help?
>>>>>>
>>>>>> I will try to find other testcase to show the benefit.
>>>>>
>>>>> Hi Kefeng,
>>>>>
>>>>> I wonder if you will see some major improvements on mTHP 64KiB using
>>>>> the below microbench I wrote just now, for example perf and time to
>>>>> finish the program
>>>>>
>>>>> #define DATA_SIZE (2UL * 1024 * 1024)
>>>>>
>>>>> int main(int argc, char **argv)
>>>>> {
>>>>> /* make 32 concurrent alloc and free of mTHP */
>>>>> fork(); fork(); fork(); fork(); fork();
>>>>>
>>>>> for (int i = 0; i < 100000; i++) {
>>>>> void *addr = mmap(NULL, DATA_SIZE, PROT_READ |
>>>>> PROT_WRITE,
>>>>> MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>>>>> if (addr == MAP_FAILED) {
>>>>> perror("fail to malloc");
>>>>> return -1;
>>>>> }
>>>>> memset(addr, 0x11, DATA_SIZE);
>>>>> munmap(addr, DATA_SIZE);
>>>>> }
>>>>>
>>>>> return 0;
>>>>> }
>>>>>
>>>
>>> Rebased on next-20240415,
>>>
>>> echo never > /sys/kernel/mm/transparent_hugepage/enabled
>>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
>>>
>>> Compare with
>>> echo 0 >
>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
>>> echo 1 >
>>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
>>>
>>>>
>>>> 1) PCP disabled
>>>> 1 2 3 4 5 average
>>>> real 200.41 202.18 203.16 201.54 200.91 201.64
>>>> user 6.49 6.21 6.25 6.31 6.35 6.322
>>>> sys 193.3 195.39 196.3 194.65 194.01 194.73
>>>>
>>>> 2) PCP enabled
>>>> real 198.25 199.26 195.51 199.28 189.12 196.284
>>>> -2.66%
>>>> user 6.21 6.02 6.02 6.28 6.21 6.148 -2.75%
>>>> sys 191.46 192.64 188.96 192.47 182.39 189.584
>>>> -2.64%
>>>>
>>>> for above test, time reduce 2.x%
>>
>> This is an improvement from 0.28%, but it's still below my expectations.
>
> Yes, it's noise. Maybe we need a system with more Cores/Sockets? But it
> does feel a bit like we're trying to come up with the problem after we
> have a solution; I'd have thought some existing benchmark could
> highlight if that is worth it.
96 core, with 129 threads, a quick test with pcp_enabled to control
hugepages-2048KB, it is no big improvement on 2M
PCP enabled
1 2 3 average
real 221.8 225.6 221.5 222.9666667
user 14.91 14.91 17.05 15.62333333
sys 141.91 159.25 156.23 152.4633333
PCP disabled
real 230.76 231.39 228.39 230.18
user 15.47 15.88 17.5 16.28333333
sys 159.07 162.32 159.09 160.16
From 44042b449872 ("mm/page_alloc: allow high-order pages to be stored
on the per-cpu lists"), it seems limited improve,
netperf-udp
5.13.0-rc2 5.13.0-rc2
mm-pcpburst-v3r4 mm-pcphighorder-v1r7
Hmean send-64 261.46 ( 0.00%) 266.30 * 1.85%*
Hmean send-128 516.35 ( 0.00%) 536.78 * 3.96%*
Hmean send-256 1014.13 ( 0.00%) 1034.63 * 2.02%*
Hmean send-1024 3907.65 ( 0.00%) 4046.11 * 3.54%*
Hmean send-2048 7492.93 ( 0.00%) 7754.85 * 3.50%*
Hmean send-3312 11410.04 ( 0.00%) 11772.32 * 3.18%*
Hmean send-4096 13521.95 ( 0.00%) 13912.34 * 2.89%*
Hmean send-8192 21660.50 ( 0.00%) 22730.72 * 4.94%*
Hmean send-16384 31902.32 ( 0.00%) 32637.50 * 2.30%*
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2024-04-16 8:06 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-15 8:12 [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists Kefeng Wang
2024-04-15 8:12 ` [PATCH rfc 1/3] mm: prepare more high-order pages to be stored on the per-cpu lists Kefeng Wang
2024-04-15 11:41 ` Baolin Wang
2024-04-15 12:25 ` Kefeng Wang
2024-04-15 8:12 ` [PATCH rfc 2/3] mm: add control to allow specified high-order pages stored on PCP list Kefeng Wang
2024-04-15 8:12 ` [PATCH rfc 3/3] mm: pcp: show per-order pages count Kefeng Wang
2024-04-15 8:18 ` [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists Barry Song
2024-04-15 8:59 ` Kefeng Wang
2024-04-15 10:52 ` David Hildenbrand
2024-04-15 11:14 ` Barry Song
2024-04-15 12:17 ` Kefeng Wang
2024-04-16 0:21 ` Barry Song
2024-04-16 4:50 ` Kefeng Wang
2024-04-16 4:58 ` Kefeng Wang
2024-04-16 5:26 ` Barry Song
2024-04-16 7:03 ` David Hildenbrand
2024-04-16 8:06 ` Kefeng Wang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox