From: Yafang Shao <laoar.shao@gmail.com>
To: akpm@linux-foundation.org
Cc: ying.huang@intel.com, mgorman@techsingularity.net,
linux-mm@kvack.org, Yafang Shao <laoar.shao@gmail.com>,
Matthew Wilcox <willy@infradead.org>,
David Rientjes <rientjes@google.com>
Subject: [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
Date: Mon, 29 Jul 2024 10:35:32 +0800 [thread overview]
Message-ID: <20240729023532.1555-4-laoar.shao@gmail.com> (raw)
In-Reply-To: <20240729023532.1555-1-laoar.shao@gmail.com>
During my recent work to resolve latency spikes caused by zone->lock
contention[0], I found that CONFIG_PCP_BATCH_SCALE_MAX is difficult to use
in practice.
To demonstrate this, I wrote a Python script:
import mmap
size = 6 * 1024**3
while True:
mm = mmap.mmap(-1, size)
mm[:] = b'\xff' * size
mm.close()
Run this script 10 times in parallel and measure the allocation latency by
measuring the duration of rmqueue_bulk() with the BCC tools
funclatency[1]:
funclatency -T -i 600 rmqueue_bulk
Here are the results for both AMD and Intel CPUs.
AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtual server
=====================================================================
- Default value of 5
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 12 | |
1024 -> 2047 : 9116 | |
2048 -> 4095 : 2004 | |
4096 -> 8191 : 2497 | |
8192 -> 16383 : 2127 | |
16384 -> 32767 : 2483 | |
32768 -> 65535 : 10102 | |
65536 -> 131071 : 212730 |******************* |
131072 -> 262143 : 314692 |***************************** |
262144 -> 524287 : 430058 |****************************************|
524288 -> 1048575 : 224032 |******************** |
1048576 -> 2097151 : 73567 |****** |
2097152 -> 4194303 : 17079 |* |
4194304 -> 8388607 : 3900 | |
8388608 -> 16777215 : 750 | |
16777216 -> 33554431 : 88 | |
33554432 -> 67108863 : 2 | |
avg = 449775 nsecs, total: 587066511229 nsecs, count: 1305242
The avg alloc latency can be 449us, and the max latency can be higher
than 30ms.
- Value set to 0
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 92 | |
1024 -> 2047 : 8594 | |
2048 -> 4095 : 2042818 |****** |
4096 -> 8191 : 8737624 |************************** |
8192 -> 16383 : 13147872 |****************************************|
16384 -> 32767 : 8799951 |************************** |
32768 -> 65535 : 2879715 |******** |
65536 -> 131071 : 659600 |** |
131072 -> 262143 : 204004 | |
262144 -> 524287 : 78246 | |
524288 -> 1048575 : 30800 | |
1048576 -> 2097151 : 12251 | |
2097152 -> 4194303 : 2950 | |
4194304 -> 8388607 : 78 | |
avg = 19359 nsecs, total: 708638369918 nsecs, count: 36604636
The avg was reduced significantly to 19us, and the max latency is reduced
to less than 8ms.
- Conclusion
On this AMD CPU, reducing vm.pcp_batch_scale_max significantly helps reduce
latency. Latency-sensitive applications will benefit from this tuning.
However, I don't have access to other types of AMD CPUs, so I was unable to
test it on different AMD models.
Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes
============================================================
- Default value of 5
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 2419 | |
1024 -> 2047 : 34499 |* |
2048 -> 4095 : 4272 | |
4096 -> 8191 : 9035 | |
8192 -> 16383 : 4374 | |
16384 -> 32767 : 2963 | |
32768 -> 65535 : 6407 | |
65536 -> 131071 : 884806 |****************************************|
131072 -> 262143 : 145931 |****** |
262144 -> 524287 : 13406 | |
524288 -> 1048575 : 1874 | |
1048576 -> 2097151 : 249 | |
2097152 -> 4194303 : 28 | |
avg = 96173 nsecs, total: 106778157925 nsecs, count: 1110263
- Conclusion
This Intel CPU works fine with the default setting.
Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node
==============================================================
Using the cpuset cgroup, we can restrict the test script to run on NUMA
node 0 only.
- Default value of 5
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 46 | |
512 -> 1023 : 695 | |
1024 -> 2047 : 19950 |* |
2048 -> 4095 : 1788 | |
4096 -> 8191 : 3392 | |
8192 -> 16383 : 2569 | |
16384 -> 32767 : 2619 | |
32768 -> 65535 : 3809 | |
65536 -> 131071 : 616182 |****************************************|
131072 -> 262143 : 295587 |******************* |
262144 -> 524287 : 75357 |**** |
524288 -> 1048575 : 15471 |* |
1048576 -> 2097151 : 2939 | |
2097152 -> 4194303 : 243 | |
4194304 -> 8388607 : 3 | |
avg = 144410 nsecs, total: 150281196195 nsecs, count: 1040651
The zone->lock contention becomes severe when there is only a single NUMA
node. The average latency is approximately 144us, with the maximum
latency exceeding 4ms.
- Value set to 0
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 24 | |
512 -> 1023 : 2686 | |
1024 -> 2047 : 10246 | |
2048 -> 4095 : 4061529 |********* |
4096 -> 8191 : 16894971 |****************************************|
8192 -> 16383 : 6279310 |************** |
16384 -> 32767 : 1658240 |*** |
32768 -> 65535 : 445760 |* |
65536 -> 131071 : 110817 | |
131072 -> 262143 : 20279 | |
262144 -> 524287 : 4176 | |
524288 -> 1048575 : 436 | |
1048576 -> 2097151 : 8 | |
2097152 -> 4194303 : 2 | |
avg = 8401 nsecs, total: 247739809022 nsecs, count: 29488508
After setting it to 0, the avg latency is reduced to around 8us, and the
max latency is less than 4ms.
- Conclusion
On this Intel CPU, this tuning doesn't help much. Latency-sensitive
applications work well with the default setting.
It is worth noting that all the above data were tested using the upstream
kernel.
Why introduce a systl knob?
===========================
From the above data, it's clear that different CPU types have varying
allocation latencies concerning zone->lock contention. Typically, people
don't release individual kernel packages for each type of x86_64 CPU.
Furthermore, for latency-insensitive applications, we can keep the default
setting for better throughput. In our production environment, we set this
value to 0 for applications running on Kubernetes servers while keeping it
at the default value of 5 for other applications like big data. It's not
common to release individual kernel packages for each application.
Future work
===========
To ultimately mitigate the zone->lock contention issue, several suggestions
have been proposed. One approach involves dividing large zones into multi
smaller zones, as suggested by Matthew[2], while another entails splitting
the zone->lock using a mechanism similar to memory arenas and shifting away
from relying solely on zone_id to identify the range of free lists a
particular page belongs to, as suggested by Mel[3]. However, implementing
these solutions is likely to necessitate a more extended development
effort.
Link: https://lwn.net/Articles/981069/ [0]
Link: https://github.com/iovisor/bcc/blob/master/tools/funclatency.py [1]
Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@casper.infradead.org/ [2]
Link: https://lore.kernel.org/linux-mm/20240705130943.htsyhhhzbcptnkcu@techsingularity.net/ [3]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Rientjes <rientjes@google.com>
---
Documentation/admin-guide/sysctl/vm.rst | 17 +++++++++++++++++
mm/Kconfig | 11 -----------
mm/page_alloc.c | 23 +++++++++++++++++------
3 files changed, 34 insertions(+), 17 deletions(-)
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index e86c968a7a0e..aa29f2fdad7c 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -65,6 +65,7 @@ Currently, these files are in /proc/sys/vm:
- page-cluster
- page_lock_unfairness
- panic_on_oom
+- pcp_batch_scale_max
- percpu_pagelist_high_fraction
- stat_interval
- stat_refresh
@@ -845,6 +846,22 @@ panic_on_oom=2+kdump gives you very strong tool to investigate
why oom happens. You can get snapshot.
+pcp_batch_scale_max
+===================
+
+In page allocator, PCP (Per-CPU pageset) is refilled and drained in
+batches. The batch number is scaled automatically to improve page
+allocation/free throughput. But too large scale factor may hurt
+latency. This option sets the upper limit of scale factor to limit
+the maximum latency.
+
+The range for this parameter spans from 0 to 6, with a default value of 5.
+The value assigned to 'N' signifies that during each refilling or draining
+process, a maximum of (batch << N) pages will be involved, where "batch"
+represents the default batch size automatically computed by the kernel for
+each zone.
+
+
percpu_pagelist_high_fraction
=============================
diff --git a/mm/Kconfig b/mm/Kconfig
index b4cb45255a54..41fe4c13b7ac 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -663,17 +663,6 @@ config HUGETLB_PAGE_SIZE_VARIABLE
config CONTIG_ALLOC
def_bool (MEMORY_ISOLATION && COMPACTION) || CMA
-config PCP_BATCH_SCALE_MAX
- int "Maximum scale factor of PCP (Per-CPU pageset) batch allocate/free"
- default 5
- range 0 6
- help
- In page allocator, PCP (Per-CPU pageset) is refilled and drained in
- batches. The batch number is scaled automatically to improve page
- allocation/free throughput. But too large scale factor may hurt
- latency. This option sets the upper limit of scale factor to limit
- the maximum latency.
-
config PHYS_ADDR_T_64BIT
def_bool 64BIT
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bfd44b65777c..8d6f9dc99387 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -273,6 +273,8 @@ int min_free_kbytes = 1024;
int user_min_free_kbytes = -1;
static int watermark_boost_factor __read_mostly = 15000;
static int watermark_scale_factor = 10;
+static int pcp_batch_scale_max = 5;
+static int sysctl_6 = 6;
/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
int movable_zone;
@@ -2334,7 +2336,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
int count = READ_ONCE(pcp->count);
while (count) {
- int to_drain = min(count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX);
+ int to_drain = min(count, pcp->batch << pcp_batch_scale_max);
count -= to_drain;
spin_lock(&pcp->lock);
@@ -2462,7 +2464,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free
/* Free as much as possible if batch freeing high-order pages. */
if (unlikely(free_high))
- return min(pcp->count, batch << CONFIG_PCP_BATCH_SCALE_MAX);
+ return min(pcp->count, batch << pcp_batch_scale_max);
/* Check for PCP disabled or boot pageset */
if (unlikely(high < batch))
@@ -2494,7 +2496,7 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
return 0;
if (unlikely(free_high)) {
- pcp->high = max(high - (batch << CONFIG_PCP_BATCH_SCALE_MAX),
+ pcp->high = max(high - (batch << pcp_batch_scale_max),
high_min);
return 0;
}
@@ -2564,9 +2566,9 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
} else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
}
- if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX))
+ if (pcp->free_count < (batch << pcp_batch_scale_max))
pcp->free_count = min(pcp->free_count + (1 << order),
- batch << CONFIG_PCP_BATCH_SCALE_MAX);
+ batch << pcp_batch_scale_max);
high = nr_pcp_high(pcp, zone, batch, free_high);
if (pcp->count >= high) {
free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
@@ -2908,7 +2910,7 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order)
* subsequent allocation of order-0 pages without any freeing.
*/
if (batch <= max_nr_alloc &&
- pcp->alloc_factor < CONFIG_PCP_BATCH_SCALE_MAX)
+ pcp->alloc_factor < pcp_batch_scale_max)
pcp->alloc_factor++;
batch = min(batch, max_nr_alloc);
}
@@ -6275,6 +6277,15 @@ static struct ctl_table page_alloc_sysctl_table[] = {
.proc_handler = percpu_pagelist_high_fraction_sysctl_handler,
.extra1 = SYSCTL_ZERO,
},
+ {
+ .procname = "pcp_batch_scale_max",
+ .data = &pcp_batch_scale_max,
+ .maxlen = sizeof(pcp_batch_scale_max),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = &sysctl_6,
+ },
{
.procname = "lowmem_reserve_ratio",
.data = &sysctl_lowmem_reserve_ratio,
--
2.43.5
next prev parent reply other threads:[~2024-07-29 2:36 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-29 2:35 [PATCH v2 0/3] mm: " Yafang Shao
2024-07-29 2:35 ` [PATCH v2 1/3] mm/page_alloc: A minor fix to the calculation of pcp->free_count Yafang Shao
2024-07-29 2:35 ` [PATCH v2 2/3] mm/page_alloc: Avoid changing pcp->high decaying when adjusting CONFIG_PCP_BATCH_SCALE_MAX Yafang Shao
2024-07-29 2:35 ` Yafang Shao [this message]
2024-07-29 3:18 ` [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Huang, Ying
2024-07-29 3:40 ` Yafang Shao
2024-07-29 5:12 ` Huang, Ying
2024-07-29 5:45 ` Yafang Shao
2024-07-29 5:50 ` Huang, Ying
2024-07-29 6:00 ` Yafang Shao
2024-07-29 6:00 ` Huang, Ying
2024-07-29 6:13 ` Yafang Shao
2024-07-29 6:14 ` Huang, Ying
2024-07-29 7:50 ` Yafang Shao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240729023532.1555-4-laoar.shao@gmail.com \
--to=laoar.shao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=rientjes@google.com \
--cc=willy@infradead.org \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox